cs0009005/path.tex
1: \documentclass[11pt,twocolumn]{article}
2: \usepackage{times}
3: \usepackage{mathfont}
4: \usepackage{url}
5: 
6: \def\square{\framebox{ } \smallskip \smallskip}
7: \long\def\omit#1{} 
8: \def\log{\mathop{{\rm log}}}
9: \def\Pr{\mathop{{\rm Pr}}}
10: 
11: % magic to make big-O notation use script font
12: \mathcode`O="724F
13: 
14: \setlength{\textwidth}{6.5in}
15: \setlength{\textheight}{9.2in}
16: \setlength{\topmargin}{-.5in}
17: \setlength{\oddsidemargin}{0in}
18: \setlength{\evensidemargin}{0in}
19: 
20: \pagenumbering{arabic}
21: \newtheorem{theorem}{Theorem}
22: \newtheorem{lemma}[theorem]{Lemma}
23: \newtheorem{corollary}[theorem]{Corollary}
24: \newtheorem{observation}[theorem]{Observation}
25: 
26: \begin{document}
27: \title{Fast Approximation of Centrality}
28: \author{David Eppstein\thanks{Dept. Inf. \& Comp. Sci., 
29: UC Irvine, CA 92697-3425, USA,
30: {\tt\{eppstein,josephw\}@ics.uci.edu}.}
31: \and Joseph Wang$^*$}
32: \date{ }
33: \maketitle
34: 
35: \begin{abstract}
36: Social studies researchers use graphs to model
37: group activities in social networks.
38: An important property in this context is
39: the {\em centrality} of a vertex: the inverse of the
40: average distance to each other vertex.  We describe a
41: randomized approximation algorithm for centrality in weighted
42: graphs.  For graphs exhibiting the small world phenomenon, our
43: method estimates the centrality of all vertices
44: with high probability within a $(1+\epsilon)$ factor in near-linear time.
45: \end{abstract}
46: 
47: \section{Introduction}
48: In social network analysis, the vertices of a graph represent 
49: agents in a group and the edges represent relationships, such
50: as communication or friendship.
51: The idea of applying graph theory to analyze the
52: connection between the structural {\em centrality} 
53: and group process was introduced by Bavelas \cite{Bavelas48}. 
54: Various measurement of centrality \cite{Bonacich72,Freeman79,Friedkin91}
55: have been proposed for analyzing
56: communication activity, control, or independence within 
57: a social network.
58: 
59: We are particularly interested in {\em closeness centrality}
60: \cite{Bavelas50,Beauchamp65,Sabidussi66}, which is used to
61: measure the independence and efficiency of an agent 
62: \cite{Freeman79,Friedkin91}. Beauchamp~\cite{Beauchamp65} defined 
63: the closeness centrality of agent $a_j$ as
64: $${n - 1} \over {\sum_{i = 1}^{n} d(i, j)}$$
65: where $d(i, j)$ is the
66: distance between agents $i$ and~$j$.\footnote{This
67: should be distinguished from another common concept of graph centrality,
68: in which the most central vertices minimize the maximum
69: distance to another vertex.}
70: We
71: are interested in computing centrality  values for all agents.
72: To compute the centrality for each agent, 
73: it is sufficient to solve the all-pairs shortest-paths (APSP) 
74: problem. No faster exact method is known.
75: 
76: The APSP problem can be solved by various algorithms
77: in time $O(nm + n^2 \log n)$ \cite{FredmanTarjan87,Johnson77}, $O(n^3)$
78: \cite{Floyd62}, or more quickly using fast matrix multiplication
79: techniques \cite{AGM97,CoppWin90,Seidel95,Yuval76}.
80: \omit{Several researchers have developed more efficient algorithms for
81: special graph classes such as interval graphs 
82: \cite{ACL93,CLSS98,RMPR92} and chordal graphs \cite{BCD94,HSS97}.
83: The APSP problem
84: can be solved in average-case in time $O(n^2 \log n)$
85: for various classes of random graphs  
86: \cite{CFMP97,FriezeGrimmett85,MehlhornPriebe95,MoffatTakaoka85}.}
87: Because these results are slow
88: or (with fast matrix multiplication) complicated and impractical,
89: and because recent applications of social network theory to the internet
90: may involve graphs with millions of vertices, it is of interest to
91: consider faster approximations. Aingworth et al. \cite{ACIM99} proposed 
92: an algorithm with an additive error of $2$
93: for the unweighted APSP problem
94: that runs in time $O(n^{2.5}\sqrt{\log n})$.
95: However this is still slow and does not provide a good approximation
96: when the distances are small.
97: 
98: In this paper, we consider a method for fast approximation of centrality.  
99: We apply a random sampling technique to approximate the 
100: inverse centrality of all vertices in a weighted graph to within an
101: additive error of $\epsilon \Delta$ with high probability
102: in time $O({\log n \over \epsilon^2} (n \log n + m))$, where
103: $\epsilon$ is any fixed constant and $\Delta$ is the diameter of the
104: graph.
105: 
106: It has been observed empirically that many social networks exhibit the
107: {\em small world phenomenon} \cite{Milgram67}: their diameter is bounded
108: by a constant, or, equivalently, the ratio between the minimum and
109: maximum distance is bounded.  For such networks, the inverse centrality
110: at any vertex is $\Omega(\Delta)$ and our method provides a near-linear
111: time $(1+\epsilon)$-approximation to the centrality of all vertices.
112: 
113:                                                       
114: \omit{\section{Preliminaries} 
115: We are given a graph $G(V, E)$ with $n$ vertices
116: and $m$ edges. The distance $d(u, v)$ between
117: two vertices $u$ and $v$ is the length of the shortest path
118: between them. The diameter $\Delta$ of a graph $G$
119: is defined as $max_{u, v \in V} d(u, v)$. For
120: simplicity, we define centrality $c_v$ for vertex $v$
121: as ${n - 1} \over {\sum_{u \in V} d(u, v)}$. If $G$ is not
122: connected, then $c_v = \infty$. Hence we will assume $G$ is connected.}
123: 
124: \omit{Given an optimization problem $P$. Let $value(OPT)$ denote
125: the optimal solution for a problem instance in $P$.
126: Let $value(A)$ denote the solution computed by
127: an approximation algorithm $A$.
128: $A$ is said to have constant additive approximation
129: error $c$ if $|value(A) -  value(OPT)| \le c$ for every 
130: problem instance in $P$. }
131: 
132: % too long for two column mode
133: %\section{Randomized Approximation Algorithm}
134: \section{The Algorithm}
135: 
136: We now describe a randomized 
137: approximation algorithm RAND for estimating centrality.
138: RAND randomly chooses $k$ sample vertices and computes
139: single-source shortest-paths (SSSP) from each sample vertex to all
140: other vertices. The estimated centrality of a vertex is
141: defined in terms of the average distance to the sample vertices.
142:  
143: 
144: \vfil\eject
145: 
146: \noindent {\bf Algorithm RAND:}
147: \begin{enumerate}
148: \item Let $k$ be the number of iterations needed to
149: obtain the desired error bound. 
150: \item In iteration $i$, pick 
151: vertex $v_i$ uniformly at random from $G$ and solve the SSSP problem
152: with
153: $v_i$ as the source.
154: \item Let
155: 	$$\hat{c}_u = 1/\sum_{i = 1}^{k}\frac{n\,d(v_i, u)}{k(n-1)}$$
156: be the centrality estimator for vertex $u$.
157: \end{enumerate}
158: 
159: \smallskip
160: It is not hard to see that, for any $k$ and $u$,
161: the expected value of $1/\hat{c}_u$ is equal to $1/c_u$.  
162: 
163: \omit{
164: PROOF NEEDS FIXING TO ACCOUNT FOR N/(N-1) FACTOR!
165: \begin{theorem}
166: $E[1/\hat{c}_u] = 1/c_u$.
167: \end{theorem}
168: 
169: {\bf Proof:} 
170: Each vertex has equal probability of $1/n$ to be picked at each
171: round. The expected value for $1 \over \hat{c}_u$ is
172: \begin{eqnarray*}
173: E[1 \over {\hat{c}_u}] & = & {n \over n - 1} 1/n^{k} \cdot {{kn^{k - 1}
174: \sum_{i = 1}^{n} d(i, u)} \over k} \\
175: & = & {n \over n - 1} {{\sum_{i = 1}^{n} d(i, u)} \over n}  \\
176: & = & {1 \over c_u}.\end{eqnarray*}         
177: \square
178: \bigskip
179: }
180: 
181: \omit{In 1963, Hoeffding \cite{Hoeffding63} gave the following theorem 
182: on probability bounds for sums of independent random variables.}
183: 
184: \begin{lemma}[Hoeffding 
185: \cite{Hoeffding63}]
186: If $x_1, x_2, \ldots, x_{k}$ are independent,
187: $a_i \le x_i \le b_i$,
188: and $\mu = E[\sum x_i/k]$ is the expected mean,
189: then for $\xi > 0$
190: $$\Pr\Bigl\{ |{\sum_{i = 1}^{k} x_i \over k} - \mu| \ge \xi \Bigr\}
191: \le 2 e^{-2{k}^2 {\xi}^2/\sum_{i = 1}^{k}(b_i - a_i)^2}.$$
192: \end{lemma}
193: \bigskip
194: 
195: 
196: We need to bound the probability that the error in estimating
197: the inverse centrality of any vertex $u$ is at most $\xi$.
198: This is done by applying Hoeffding's bound with
199: $x_i = \frac{d(i, u) n}{(n-1)}$,
200: $\mu = \frac{1}{c_u}$,
201: $a_i=0$, and $b_i=\frac{n\Delta}{n-1}$. 
202: \omit{
203: % I just put the factor of two directly into the lemma
204: We know $E[1/\hat{c}_u] = 1/c_u$.
205: To take care of the case in
206: which $\hat{c}_u$ is smaller than $c_u$, we multiply
207: the above inequality by $2$.
208: }
209: Thus the probability that 
210: the difference between the estimated inverse centrality
211: $1/\hat{c}_u$ and the actual inverse centrality $1/c_u$ is more than $\xi$
212: is
213: \begin{eqnarray*}
214: \Pr\left\{ {\textstyle |\frac{1}{\hat{c}_u} - \frac{1}{c_u}|}
215: \ge \xi \right\}  
216: & \le &
217: 2 \cdot e^{-2{k}^2 {\xi}^2/\sum_{i = 1}^{k}(b_i - a_i)^2} \\
218: & = & 2 \cdot e^{-2{k}^2 {\xi}^2/{k}(\frac{n\Delta}{n-1})^2} \\
219: & = & 2 \cdot e^{-\Omega(k\xi^2/\Delta^2)}
220: \end{eqnarray*}
221: For $\xi = \epsilon\Delta$, using $\Theta(\frac{\log n}{\epsilon^2})$
222: samples will cause the probability of error at any vertex to be bounded
223: above by e.g. $1/n^2$, giving at most $1/n$ probability of
224: having greater than $\epsilon\Delta$ error anywhere in the graph.
225: 
226: \omit{
227: Fredman and Tarjan \cite{FredmanTarjan87} gave an
228: algorithm for solving the $SSSP$ problem in time $O(n \log n + m)$.
229: Thus }
230: The total running time of algorithm is
231: $O(k \cdot m)$ for unweighted graphs and $O(k (n \log n + m))$
232: for weighted graphs.
233: Thus, for $k = \Theta(\frac{\log n}{\epsilon^2})$,
234: we have an $O({\log n \over \epsilon^2} (n \log n + m))$ algorithm
235: for approximating centrality within an inverse additive
236: error of $\epsilon \Delta$ with high probability.
237: 
238: 
239: \omit{\section{Conclusion}
240: We gave an $O({\log n \over \epsilon^2} (n \log n + m))$ 
241: randomized algorithm with additive error of $\epsilon \Delta$
242: for weighted graphs. Many graph classes such as paths, cycles,
243: and balanced trees, have centrality proportional to
244: $\Delta$. More interestingly, Milgram \cite{Milgram67} showed that 
245: many social networks have bounded diameter and centrality.
246: When the centrality is proportional to $\Delta$,
247: we have an $(1 + \epsilon)$-approximation algorithm. }
248: 
249: \small
250: \paragraph{Acknowledgements.}
251: We thank Dave Goggin for bringing this problem to our attention,
252: and Lin Freeman for helpful comments on a draft of this paper.
253: 
254: \bibliographystyle{nomonths}
255: \let\oldbib\thebibliography
256: \def\thebibliography#1{\oldbib{#1}\itemsep 0pt}
257: \bibliography{bibdata}
258: \end{document}
259: