cond-mat0408187/pre.tex
1: \documentclass[twocolumn,pre,superscriptaddress]{revtex4}
2: 
3: % Required packages
4: \usepackage{dcolumn}
5: \usepackage{amsmath}
6: 
7: % Optional extra packages
8: \usepackage{graphicx}
9: 
10: % Hyphenation
11: \hyphenation{}
12: 
13: \begin{document}
14: 
15: % Macros
16: \renewcommand{\d}{d}
17: \newcommand{\Ord}{\mathrm{O}}
18: \newcommand{\eref}[1]{(\ref{#1})}
19: \newcommand{\etal}{{\it{}et~al.}}
20: \newcommand{\defn}{\textit}
21: \newcommand{\half}{\mbox{$\frac12$}}
22: 
23: % Style parameters
24: \newlength{\figurewidth}
25: \setlength{\figurewidth}{0.95\columnwidth}
26: \setlength{\parskip}{0pt}
27: \setlength{\tabcolsep}{6pt}
28: \setlength{\arraycolsep}{2pt}
29: 
30: 
31: \title{Finding community structure in very large networks}
32: \author{Aaron Clauset}
33: \affiliation{Department of Computer Science, University of New Mexico,
34: Albuquerque, NM 87131}
35: \author{M. E. J. Newman}
36: \affiliation{Department of Physics and Center for the Study of Complex
37: Systems,\\
38: University of Michigan, Ann Arbor, MI 48109}
39: \author{Cristopher Moore}
40: \affiliation{Department of Computer Science, University of New Mexico,
41: Albuquerque, NM 87131}
42: \affiliation{Department of Physics and Astronomy, University of New Mexico,
43: Albuquerque, NM 87131}
44: 
45: \begin{abstract}
46: The discovery and analysis of community structure in networks is a topic of
47: considerable recent interest within the physics community, but most methods
48: proposed so far are unsuitable for very large networks because of their
49: computational cost.  Here we present a hierarchical agglomeration algorithm
50: for detecting community structure which is faster than many competing
51: algorithms: its running time on a network with $n$ vertices and $m$ edges
52: is $\Ord(m d \log n)$ where $d$ is the depth of the dendrogram describing
53: the community structure.  Many real-world networks are sparse and
54: hierarchical, with $m \sim n$ and $d \sim \log n$, in which case our
55: algorithm runs in essentially linear time, $\Ord(n \log^2 n)$.  As an
56: example of the application of this algorithm we use it to analyze a network
57: of items for sale on the web-site of a large online retailer, items in the
58: network being linked if they are frequently purchased by the same buyer.
59: The network has more than $400\,000$ vertices and 2 million edges.  We show
60: that our algorithm can extract meaningful communities from this network,
61: revealing large-scale patterns present in the purchasing habits of
62: customers.
63: \end{abstract}
64: \maketitle
65: 
66: 
67: % ---------- Introduction and Background ----------
68: \section{Introduction}
69: Many systems of current interest to the scientific community can usefully
70: be represented as networks~\cite{Strogatz01,AB02,DM02,Newman03d}.  Examples
71: include the Internet~\cite{FFF99} and the world-wide
72: web~\cite{AJB99,Kleinberg99b}, social networks~\cite{WF94}, citation
73: networks~\cite{Price65,Redner98}, food webs~\cite{DWM02a}, and biochemical
74: networks~\cite{Kauffman69,Ito01}.  Each of these networks consists of a set
75: of nodes or \defn{vertices} representing, for instance, computers or
76: routers on the Internet or people in a social network, connected together
77: by links or \defn{edges}, representing data connections between computers,
78: friendships between people, and so forth.
79: 
80: One network feature that has been emphasized in recent work is
81: \defn{community structure}, the gathering of vertices into groups such that
82: there is a higher density of edges within groups than between
83: them~\cite{note}.  The problem of detecting such communities within
84: networks has been well studied.  Early approaches such as the
85: Kernighan--Lin algorithm~\cite{KL70}, spectral
86: partitioning~\cite{Fiedler73,PSL90}, or hierarchical
87: clustering~\cite{Scott00} work well for specific types of problems
88: (particularly graph bisection or problems with well defined vertex
89: similarity measures), but perform poorly in more general
90: cases~\cite{Newman04b}.
91: 
92: To combat this problem a number of new algorithms have been proposed in
93: recent years.  Girvan and Newman~\cite{GN02,NG04} proposed a divisive
94: algorithm that uses edge betweenness as a metric to identify the boundaries
95: of communities.  This algorithm has been applied successfully to a variety
96: of networks, including networks of email messages, human and animal social
97: networks, networks of collaborations between scientists and musicians,
98: metabolic networks and gene
99: networks~\cite{GN02,GN04,Guimera03,HHJ03,HH03,TWH03,GD03,BPDA04,WH04b,Arenas04}.
100: However, as noted in~\cite{NG04}, the algorithm makes heavy demands on
101: computational resources, running in $\Ord(m^2 n)$ time on an arbitrary
102: network with $m$ edges and $n$ vertices, or $\Ord(n^3)$ time on a sparse
103: graph (one in which $m\sim n$, which covers most real-world networks of
104: interest).  This restricts the algorithm's use to networks of at most a few
105: thousand vertices with current hardware.
106: 
107: More recently a number of faster algorithms have been
108: proposed~\cite{Radicchi04,Newman04a,WH04a}.  In~\cite{Newman04a}, one of us
109: proposed an algorithm based on the greedy optimization of the quantity
110: known as \defn{modularity}~\cite{NG04}.  This method appears to work well
111: both in contrived test cases and in real-world situations, and is
112: substantially faster than the algorithm of Girvan and Newman.  A naive
113: implementation runs in time $\Ord((m+n)n)$, or $\Ord(n^{2})$ on a sparse
114: graph.
115: 
116: Here we propose a new algorithm that performs the same greedy optimization
117: as the algorithm of~\cite{Newman04a} and therefore gives identical results
118: for the communities found.  However, by exploiting some shortcuts in the
119: optimization problem and using more sophisticated data structures, it runs
120: far more quickly, in time $\Ord(m d \log n)$ where $d$ is the depth of the
121: ``dendrogram'' describing the network's community structure.  Many
122: real-world networks are sparse, so that $m \sim n$; and moreover, for
123: networks that have a hierarchical structure with communities at many
124: scales, $d \sim \log n$.  For such networks our algorithm has essentially
125: linear running time, $\Ord(n \log^2 n)$.
126: 
127: This is not merely a technical advance but has substantial practical
128: implications, bringing within reach the analysis of extremely large
129: networks.  Networks of ten million vertices or more should be possible in
130: reasonable run times.  As an example, we give results from the application
131: of the algorithm to a recommender network of books from the online
132: bookseller Amazon.com, which has more than $400\,000$ vertices and two
133: million edges.
134: 
135: 
136: % ---------- Description of the Algorithm and its Complexity ----------
137: \section{The algorithm}
138: \defn{Modularity}~\cite{NG04} is a property of a network and a specific
139: proposed division of that network into communities.  It measures when the
140: division is a good one, in the sense that there are many edges within
141: communities and only a few between them.  Let $A_{vw}$ be an element of the
142: adjacency matrix of the network thus:
143: \begin{equation}
144: A_{vw} = \biggl\lbrace\begin{array}{ll}
145:            1 & \quad\mbox{if vertices $v$ and $w$ are connected,}\\
146:            0 & \quad\mbox{otherwise.}
147:          \end{array}
148: \end{equation}
149: and suppose the vertices are divided into communities such that vertex~$v$
150: belongs to community~$c_v$.  Then the fraction of edges that fall within
151: communities, i.e.,~that connect vertices that both lie in the same
152: community, is
153: \begin{equation}
154: {\sum_{vw} A_{vw} \delta(c_v,c_w)\over\sum_{vw} A_{vw}}
155:   = {1\over2m} \sum_{vw} A_{vw} \delta(c_v,c_w),
156: \end{equation}
157: where the $\delta$-function $\delta(i,j)$ is 1 if $i=j$ and 0 otherwise,
158: and $m=\half\sum_{vw} A_{vw}$ is the number of edges in the graph.  This
159: quantity will be large for good divisions of the network, in the sense of
160: having many within-community edges, but it is not, on its own, a good
161: measure of community structure since it takes its largest value of~1 in the
162: trivial case where all vertices belong to a single community.  However, if
163: we subtract from it the expected value of the same quantity in the case of
164: a randomized network, we do get a useful measure.
165: 
166: The \defn{degree}~$k_v$ of a vertex~$v$ is defined to be the number of
167: edges incident upon it:
168: \begin{equation}
169: k_v = \sum_w A_{vw}.
170: \end{equation}
171: The probability of an edge existing between vertices $v$ and $w$ if
172: connections are made at random but respecting vertex degrees is $k_v
173: k_w/2m$.  We define the modularity~$Q$ to be
174: \begin{equation}
175: Q = {1\over2m} \sum_{vw} \biggl[ A_{vw} - {k_v k_w\over2m} \biggr]
176:     \delta(c_v,c_w).
177: \label{modularity}
178: \end{equation}
179: If the fraction of within-community edges is no different from what we
180: would expect for the randomized network, then this quantity will be zero.
181: Nonzero values represent deviations from randomness, and in practice it is
182: found that a value above about 0.3 is a good indicator of significant
183: community structure in a network.
184: 
185: If high values of the modularity correspond to good divisions of a network
186: into communities, then one should be able to find such good divisions by
187: searching through the possible candidates for ones with high modularity.
188: While finding the global maximum modularity over all possible divisions
189: seems hard in general, reasonably good solutions can be found with
190: approximate optimization techniques.  The algorithm proposed
191: in~\cite{Newman04a} uses a greedy optimization in which, starting with each
192: vertex being the sole member of a community of one, we repeatedly join
193: together the two communities whose amalgamation produces the largest
194: increase in~$Q$.  For a network of $n$ vertices, after $n-1$ such joins we
195: are left with a single community and the algorithm stops.  The entire
196: process can be represented as a tree whose leaves are the vertices of the
197: original network and whose internal nodes correspond to the joins.  This
198: \defn{dendrogram} represents a hierarchical decomposition of the network
199: into communities at all levels.
200: 
201: The most straightforward implementation of this idea (and the only one
202: considered in~\cite{Newman04a}) involves storing the adjacency matrix of
203: the graph as an array of integers and repeatedly merging pairs of rows and
204: columns as the corresponding communities are merged.  For the case of the
205: sparse graphs that are of primary interest in the field, however, this
206: approach wastes a good deal of time and memory space on the storage and
207: merging of matrix elements with value~0, which is the vast majority of the
208: adjacency matrix.  The algorithm proposed in this paper achieves speed (and
209: memory efficiency) by eliminating these needless operations.
210: 
211: To simplify the description of our algorithm let us define the following
212: two quantities:
213: \begin{equation}
214: e_{ij} = {1\over2m} \sum_{vw} A_{vw} \delta(c_v,i) \delta(c_w,j),
215: \end{equation}
216: which is the fraction of edges that join vertices in community~$i$ to
217: vertices in community~$j$, and
218: \begin{equation}
219: a_i = {1\over2m} \sum_v k_v\delta(c_v,i),
220: \end{equation}
221: which is the fraction of ends of edges that are attached to vertices in
222: community~$i$.  Then, writing
223: $\delta(c_v,c_w)=\sum_i \delta(c_v,i) \delta(c_w,i)$, we have, from
224: Eq.~\eref{modularity}
225: \begin{eqnarray}
226: Q &=& {1\over2m} \sum_{vw} \biggl[ A_{vw} - {k_v k_w\over2m} \biggr]
227:       \sum_i\delta(c_v,i)\delta(c_w,i)\nonumber\\
228:   &=& \sum_i \biggl[ {1\over2m} \sum_{vw} A_{vw} \,\delta(c_v,i)\delta(c_w,i)
229:         \nonumber\\
230:   & & \qquad {} - {1\over2m}\sum_v k_v \,\delta(c_v,i)
231:            {1\over2m}\sum_w k_w\delta(c_w,i) \biggr]\nonumber\\
232:   &=& \sum_i (e_{ii} - a_i^2).
233: \end{eqnarray}
234: %When two communities $i$ and $j$ are joined, all edges that previously ran
235: %between the two now become within-community edges, and hence the modularity
236: %changes by an amount $\Delta Q_{ij} = 2(e_{ij} - a_i a_j)$.  At each step,
237: %the algorithm joins the pair of communities $i,j$ with the largest~$\Delta
238: %Q_{ij}$.
239: 
240: The operation of the algorithm involves finding the changes in $Q$ that
241: would result from the amalgamation of each pair of communities, choosing
242: the largest of them, and performing the corresponding amalgamation.  One
243: way to envisage (and implement) this process is to think of network as a
244: multigraph, in which a whole community is represented by a vertex, bundles
245: of edges connect one vertex to another, and edges internal to communities
246: are represented by self-edges.  The adjacency matrix of this multigraph has
247: elements$A'_{ij} = 2m e_{ij}$, and the joining of two communities $i$ and
248: $j$ corresponds to replacing the $i$th and $j$th rows and columns by their
249: sum.  In the algorithm of~\cite{Newman04a} this operation is done
250: explicitly on the entire matrix, but if the adjacency matrix is sparse
251: (which we expect in the early stages of the process) the operation can be
252: carried out more efficiently using data structures for sparse matrices.
253: Unfortunately, calculating $\Delta Q_{ij}$ and finding the pair $i,j$ with
254: the largest $\Delta Q_{ij}$ then becomes time-consuming.
255: 
256: In our new algorithm, rather than maintaining the adjacency matrix and
257: calculating~$\Delta Q_{ij}$, we instead maintain and update a matrix of
258: value of $\Delta Q_{ij}$.  Since joining two communities with no edge
259: between them can never produce an increase in~$Q$, we need only store
260: $\Delta Q_{ij}$ for those pairs $i,j$ that are joined by one or more edges.
261: Since this matrix has the same support as the adjacency matrix, it will be
262: similarly sparse, so we can again represent it with efficient data
263: structures.  In addition, we make use of an efficient data structure to
264: keep track of the largest $\Delta Q_{ij}$.  These improvements result in a
265: considerable saving of both memory and time.
266: 
267: In total, we maintain three data structures:
268: \begin{enumerate}
269: \item A sparse matrix containing $\Delta Q_{ij}$ for each pair $i,j$ of
270: communities with at least one edge between them.  We store each row of the
271: matrix both as a balanced binary tree (so that elements can be found or
272: inserted in $O(\log n)$ time) and as a max-heap (so that the largest
273: element can be found in constant time).
274: \item A max-heap $H$ containing the largest element of each row of the
275: matrix $\Delta Q_{ij}$ along with the labels $i,j$ of the corresponding
276: pair of communities.
277: \item An ordinary vector array with elements~$a_i$.
278: \end{enumerate}
279: 
280: As described above we start off with each vertex being the sole member of a
281: community of one, in which case $e_{ij}=1/2m$ if $i$ and $j$ are connected
282: and zero otherwise, and $a_i=k_i/2m$.  Thus we initially set
283: \begin{equation}
284: \label{eq:qinit}
285: \Delta Q_{ij} = \biggl\lbrace\begin{array}{ll}
286:                   1/2m - k_i k_j/(2m)^2 &
287:                     \quad\mbox{if $i,j$ are connected,}\\
288:                   0 & \quad\mbox{otherwise,}
289:                 \end{array}
290: \end{equation}
291: and
292: \begin{equation}
293: \label{eq:ainit} 
294: a_i = \frac{k_i}{2m} 
295: \end{equation}
296: for each~$i$.  (This assumes the graph is unweighted; weighted graphs are a
297: simple generalization~\cite{Newman05a}.)
298: 
299: Our algorithm can now be defined as follows.
300: \begin{enumerate}
301: \item Calculate the initial values of $\Delta Q_{ij}$ and $a_i$ according
302: to~\eref{eq:qinit} and~\eref{eq:ainit}, and populate the max-heap with the
303: largest element of each row of the matrix $\Delta Q$.
304: \item Select the largest $\Delta Q_{ij}$ from $H$, join the corresponding
305: communities, update the matrix $\Delta Q$, the heap $H$ and $a_i$ (as
306: described below) and increment $Q$ by $\Delta Q_{ij}$.
307: \item Repeat step 2 until only one community remains.
308: \end{enumerate}
309: 
310: Our data structures allow us to carry out the updates in step 2 quickly.
311: First, note that we need only adjust a few of the elements of $\Delta Q$.
312: If we join communities $i$ and~$j$, labeling the combined community~$j$,
313: say, we need only update the $j$th row and column, and remove the $i$th row
314: and column altogether.  The update rules are as follows.
315: \begin{subequations}
316: If community $k$ is connected to both $i$ and $j$, then
317: \begin{equation}
318: \label{eq:both}
319: \Delta Q'_{jk} = \Delta Q_{ik} + \Delta Q_{jk} 
320: \end{equation}
321: If $k$ is connected to $i$ but not to~$j$, then
322: \begin{equation}
323: \label{eq:justi}
324: \Delta Q'_{jk} = \Delta Q_{ik} - 2 a_j a_k 
325: \end{equation}
326: If $k$ is connected to $j$ but not to $i$, then
327: \begin{equation}
328: \label{eq:justj}
329: \Delta Q'_{jk} = \Delta Q_{jk} - 2 a_i a_k.
330: \end{equation}
331: \end{subequations}
332: Note that these equations imply that $Q$ has a single peak over the course of the algorithm, since after the largest $\Delta Q$ becomes negative all the $\Delta Q$ can only decrease.
333: 
334: To analyze how long the algorithm takes using our data structures, let us
335: denote the degrees of $i$ and $j$ in the reduced graph---i.e.,~the numbers
336: of neighboring communities---as $|i|$ and $|j|$ respectively.  The first
337: operation in a step of the algorithm is to update the $j$th row.  To
338: implement Eq.~\eref{eq:both}, we insert the elements of the $i$th row into
339: the $j$th row, summing them wherever an element exists in both columns.
340: Since we store the rows as balanced binary trees, each of these $|i|$
341: insertions takes $O(\log |j|) \le O(\log n)$ time.  We then update the
342: other elements of the $j$th row, of which there are at most $|i|+|j|$,
343: according to Eqs.~\eref{eq:justi} and~\eref{eq:justj}.  In the $k$th row,
344: we update a single element, taking $O(\log |k|) \le O(\log n)$ time, and
345: there are at most $|i|+|j|$ values of $k$ for which we have to do this.
346: All of this thus takes $O((|i|+|j|) \log n)$ time.
347: 
348: We also have to update the max-heaps for each row and the overall max-heap
349: $H$.  Reforming the max-heap corresponding to the $j$th row can be done in
350: $O(|j|)$ time~\cite{CLRS01}.  Updating the max-heap for the $k$th row by
351: inserting, raising, or lowering $\Delta Q_{kj}$ takes $O(\log |k|) \le
352: O(\log n)$ time.  Since we have changed the maximum element on at most
353: $|i|+|j|$ rows, we need to do at most $|i|+|j|$ updates of $H$, each of
354: which takes $O(\log n)$ time, for a total of $O((|i|+|j|) \log n)$.
355: 
356: Finally, the update $a'_j = a_j + a_i$ (and $a_i = 0$) is trivial and can
357: be done in constant time.
358: 
359: Since each join takes $O((|i|+|j|) \log n)$ time, the total running time is
360: at most $O(\log n)$ times the sum over all nodes of the dendrogram of the
361: degrees of the corresponding communities.  Let us make the worst-case
362: assumption that the degree of a community is the sum of the degrees of all
363: the vertices in the original network comprising it.  In that case, each
364: vertex of the original network contributes its degree to all of the
365: communities it is a part of, along the path in the dendrogram from it to
366: the root.  If the dendrogram has depth~$d$, there are at most $d$ nodes in
367: this path, and since the total degree of all the vertices is~$2m$, we have
368: a running time of $O(m d \log n)$ as stated.
369: 
370: We note that, if the dendrogram is unbalanced, some time savings can be
371: gained by inserting the sparser row into the less sparse one.  In addition,
372: we have found that in practical situations it is usually unnecessary to
373: maintain the separate max-heaps for each row.  These heaps are used to find
374: the largest element in a row quickly, but their maintenance takes a
375: moderate amount of effort and this effort is wasted if the largest element
376: in a row does not change when two rows are amalgamated, which turns out
377: often to be the case.  Thus we find that the following simpler
378: implementation works quite well in realistic situations: if the largest
379: element of the $k$th row was $\Delta Q_{ki}$ or $\Delta Q_{kj}$ and is now
380: reduced by Eq.~\eref{eq:justi} or~\eref{eq:justj}, we simply scan the $k$th
381: row to find the new largest element. Although the worst-case running time
382: of this approach has an additional factor of~$n$, the average-case running
383: time is often better than that of the more sophisticated algorithm. It should be noted 
384: that the hierarchies generated by these two versions of our algorithm will 
385: differ slightly as a result of the differences in how ties are broken for the 
386: maximum element in a row. However, we find that in practice these differences 
387: do not cause significant deviations in the modularity, the community 
388: size distribution, or the composition of the largest communities.
389: 
390: 
391: 
392: % modularity as a function of time
393: \begin{figure}[t]
394: \begin{center}
395: \includegraphics[scale=0.45]{fc_amazon0308_Q.eps}
396: \end{center}
397: \caption{The modularity $Q$ over the course of the algorithm (the $x$ axis
398: shows the number of joins). Its maximum value is $Q=0.745$, where the
399: partition consists of $1684$ communities.}
400: \label{fig:Q}
401: \end{figure}
402: % ------------------------------
403: 
404: \section{Amazon.com purchasing network}
405: The output of the algorithm described above is precisely the same as that
406: of the slower hierarchical algorithm of~\cite{Newman04a}.  The much
407: improved speed of our algorithm however makes possible studies of very
408: large networks for which previous methods were too slow to produce useful
409: results.  Here we give one example, the analysis of a co-purchasing or
410: ``recommender'' network from the online vendor Amazon.com.  Amazon sells a
411: variety of products, particularly books and music, and as part of their web
412: sales operation they list for each item~A the ten other items most
413: frequently purchased by buyers of~A.  This information can be represented
414: as a directed network in which vertices represent items and there is a edge
415: from item~A to another item~B if B was frequently purchased by buyers of~A.
416: In our study we have ignored the directed nature of the network (as is
417: common in community structure calculations), assuming any link between two
418: items, regardless of direction, to be an indication of their similarity.
419: The network we study consists of items listed on the Amazon web site in
420: August 2003.  We concentrate on the largest component of the network, which
421: has $409\,687$ items and $2\,464\,630$ edges.
422: 
423: \begin{figure}[t]
424: \begin{center}
425: \includegraphics[width=3in]{groups.eps}
426: \end{center}
427: \caption{A visualization of the community structure at maximum modularity.
428: Note that the some major communities have a large number of ``satellite''
429: communities connected only to them (top, lower left, lower right).  Also,
430: some pairs of major communities have sets of smaller communities that act
431: as ``bridges'' between them (e.g., between the lower left and lower right,
432: near the center).}
433: \label{fig:visual}
434: \end{figure}
435: 
436: The dendrogram for this calculation is of course too big to draw, but
437: Fig.~\ref{fig:Q} illustrates the modularity over the course of the
438: algorithm as vertices are joined into larger and larger groups.  The
439: maximum value is $Q=0.745$, which is high as calculations of this type
440: go~\cite{NG04,Newman04a} and indicates strong community structure in the
441: network.  The maximum occurs when there are $1684$ communities with a mean
442: size of $243$ items each.  Fig.~\ref{fig:visual} gives a visualization of
443: the community structure, including the major communities, smaller
444: ``satellite'' communities connected to them, and ``bridge'' communities
445: that connect two major communities with each other.
446: 
447: % descriptions of communities
448: \begin{table*}
449: \begin{center}
450: \begin{tabular}{cr|p{14.5cm}}
451: Rank & Size & Description \\
452: \hline
453: 1 & 114538 & General interest: politics; art/literature; general fiction;
454: human nature; technical books; how things, people, computers, societies
455: work, etc. \\
456: 
457: 2 & 92276 & The arts: videos, books, DVDs about the creative and performing
458: arts \\
459: 
460: 3 & 78661 & Hobbies and interests I: self-help; self-education; popular
461: science fiction, popular fantasy; leisure; etc. \\
462: 
463: 4 & 54582 & Hobbies and interests II: adventure books; video games/comics;
464: some sports; some humor; some classic fiction; some western religious
465: material; etc. \\
466: 
467: 5 & 9872 & classical music and related items \\
468: 
469: 6 & 1904 & children's videos, movies, music and books \\
470: 
471: 7 & 1493 & church/religious music; African-descent cultural books;
472: homoerotic imagery \\
473: 
474: 8 & 1101 & pop horror; mystery/adventure fiction \\
475: 
476: 9 & 1083 & jazz; orchestral music; easy listening \\
477: 
478: 10 & 947 & engineering; practical fashion 
479: \end{tabular}
480: \end{center}\caption{The $10$ largest communities in the
481: Amazon.com network, which account for $87\%$ of the vertices in the
482: network.}
483: \label{table:labels}
484: \end{table*}
485: % -----------------------------------
486: 
487: Looking at the largest communities in the network, we find that they tend
488: to consist of items (books, music) in similar genres or on similar topics.
489: In Table~\ref{table:labels}, we give informal descriptions of the ten
490: largest communities, which account for about 87\% of the entire network. 
491: The remainder is generally divided into small, densely connected 
492: communities that represent highly specific co-purchasing habits, e.g.,~major 
493: works of science fiction ($162$ items), music by John Cougar Mellencamp 
494: ($17$ items), and books about (mostly female) spies in the American Civil 
495: War ($13$ items).  It is worth noting that because few real-world 
496: networks have community metadata associated with them to which we 
497: may compare the inferred communities, this type of manual check of
498: the veracity and coherence of the algorithm's output is often necessary.
499: 
500: % distribution of community sizes
501: \begin{figure}[t]
502: \begin{center}
503: \includegraphics[scale=0.45]{fc_amazon0308_sizedistribution.eps}
504: \end{center}
505: \caption{Cumulative distribution of the sizes of communities when the
506: network is partitioned at the maximum modularity found by the algorithm.
507: The distribution appears to follow a power law form over two decades in the
508: central part of its range, although it deviates in the tail.  As a guide to
509: the eye, the straight line has slope $-1$, which corresponds to an exponent
510: of $\alpha=2$ for the raw probability distribution.}
511: \label{fig:distribution}
512: \end{figure}
513: % -----------------------------------
514: 
515: One interesting property recently noted in some
516: networks~\cite{Arenas04,Newman04a} is that when partitioned at the point of
517: maximum modularity, the distribution of community sizes~$s$ appears to have
518: a power-law form $P(s) \sim s^{-\alpha}$ for some constant~$\alpha$, at
519: least over some significant range.  The Amazon co-purchasing network also
520: seems to exhibit this property, as we show in Fig.~\ref{fig:distribution},
521: with an exponent $\alpha\simeq2$.  It is unclear why such a distribution
522: should arise, but we speculate that it could be a result either of the
523: sociology of the network (a power-law distribution in the number of people
524: interested in various topics) or of the dynamics of the community structure
525: algorithm.  We propose this as a direction for further research.
526: 
527: 
528: \section{Conclusions}
529: We have described a new algorithm for inferring community structure from
530: network topology which works by greedily optimizing the modularity.  Our
531: algorithm runs in time $\Ord(m d \log n)$ for a network with $n$ vertices
532: and $m$ edges where $d$ is the depth of the dendrogram.  For networks that
533: are hierarchical, in the sense that there are communities at many scales
534: and the dendrogram is roughly balanced, we have $d \sim \log n$.  If the
535: network is also sparse, $m \sim n$, then the running time is essentially
536: linear, $\Ord(n \log^2 n)$.  This is considerably faster than most previous
537: general algorithms, and allows us to extend community structure analysis to
538: networks that had been considered too large to be tractable.  
539: We have demonstrated our algorithm with an application to a large network 
540: of co-purchasing data from the online retailer Amazon.com.  
541: Our algorithm discovers clear communities within this network
542: that correspond to specific topics or genres of books or music, indicating
543: that the co-purchasing tendencies of Amazon customers are strongly
544: correlated with subject matter.  Our algorithm should allow researchers to
545: analyze even larger networks with millions of vertices and tens of millions
546: of edges using current computing resources, and we look forward to seeing
547: such applications.
548: 
549: %\bigskip 
550: 
551: \begin{acknowledgements}
552: The authors are grateful to Amazon.com and Eric Promislow for providing the
553: purchasing network data.  This work was funded in part by the National
554: Science Foundation under grant PHY-0200909 (AC, CM) and by a grant from
555: the James S. McDonell Foundation (MEJN).
556: \end{acknowledgements}
557: 
558: 
559: %\bibliographystyle{numeric}
560: %\bibliography{journals,references}
561: 
562: \begin{thebibliography}{10}
563: \expandafter\ifx\csname url\endcsname\relax
564:   \def\url#1{\texttt{#1}}\fi
565: \expandafter\ifx\csname urlprefix\endcsname\relax\def\urlprefix{URL }\fi
566: 
567: \bibitem{Strogatz01}
568: S.~H. Strogatz, Exploring complex networks. \textit{Nature} \textbf{410},
569:   268--276 (2001).
570: 
571: \bibitem{AB02}
572: R.~Albert and A.-L. Barab\'asi, Statistical mechanics of complex networks.
573:   \textit{Rev. Mod. Phys.} \textbf{74}, 47--97 (2002).
574: 
575: \bibitem{DM02}
576: S.~N. Dorogovtsev and J.~F.~F. Mendes, Evolution of networks. \textit{Advances
577:   in Physics} \textbf{51}, 1079--1187 (2002).
578: 
579: \bibitem{Newman03d}
580: M.~E.~J. Newman, The structure and function of complex networks. \textit{SIAM
581:   Review} \textbf{45}, 167--256 (2003).
582: 
583: \bibitem{FFF99}
584: M.~Faloutsos, P.~Faloutsos, and C.~Faloutsos, On power-law relationships of the
585:   internet topology. \textit{Computer Communications Review} \textbf{29},
586:   251--262 (1999).
587: 
588: \bibitem{AJB99}
589: R.~Albert, H.~Jeong, and A.-L. Barab\'asi, Diameter of the world-wide web.
590:   \textit{Nature} \textbf{401}, 130--131 (1999).
591: 
592: \bibitem{Kleinberg99b}
593: J.~M. Kleinberg, S.~R. Kumar, P.~Raghavan, S.~Rajagopalan, and A.~Tomkins, The
594:   {W}eb as a graph: Measurements, models and methods. In \textit{Proceedings of
595:   the International Conference on Combinatorics and Computing}, number 1627 in
596:   Lecture Notes in Computer Science, pp. 1--18, Springer, Berlin (1999).
597: 
598: \bibitem{WF94}
599: S.~Wasserman and K.~Faust, \textit{Social Network Analysis}. Cambridge
600:   University Press, Cambridge (1994).
601: 
602: \bibitem{Price65}
603: D.~J.~{\relax de S}. Price, Networks of scientific papers. \textit{Science}
604:   \textbf{149}, 510--515 (1965).
605: 
606: \bibitem{Redner98}
607: S.~Redner, How popular is your paper? {A}n empirical study of the citation
608:   distribution. \textit{Eur. Phys. J. B} \textbf{4}, 131--134 (1998).
609: 
610: \bibitem{DWM02a}
611: J.~A. Dunne, R.~J. Williams, and N.~D. Martinez, Food-web structure and network
612:   theory: The role of connectance and size. \textit{Proc. Natl. Acad. Sci. USA}
613:   \textbf{99}, 12917--12922 (2002).
614: 
615: \bibitem{Kauffman69}
616: S.~A. Kauffman, Metabolic stability and epigenesis in randomly connected nets.
617:   \textit{J. Theor. Bio.} \textbf{22}, 437--467 (1969).
618: 
619: \bibitem{Ito01}
620: T.~Ito, T.~Chiba, R.~Ozawa, M.~Yoshida, M.~Hattori, and Y.~Sakaki, A
621:   comprehensive two-hybrid analysis to explore the yeast protein interactome.
622:   \textit{Proc. Natl. Acad. Sci. USA} \textbf{98}, 4569--4574 (2001).
623: 
624: \bibitem{note}
625: Community structure is sometimes referred to as ``clustering'' in sociology
626: or computer science, but this term is commonly used to mean something else
627: in the physics literature~\cite{WS98}, so to prevent confusion we avoid it
628: here.  We note also that the problem of finding communities in a network is
629: somewhat ill-posed, since we haven't defined precisely what a community is.
630: A number of definitions have been proposed~\cite{WF94,FLGC02,Radicchi04},
631: but none is standard.
632: 
633: \bibitem{KL70}
634: B.~W. Kernighan and S.~Lin, An efficient heuristic procedure for partitioning
635:   graphs. \textit{Bell System Technical Journal} \textbf{49}, 291--307 (1970).
636: 
637: \bibitem{Fiedler73}
638: M.~Fiedler, Algebraic connectivity of graphs. \textit{Czech. Math. J.}
639:   \textbf{23}, 298--305 (1973).
640: 
641: \bibitem{PSL90}
642: A.~Pothen, H.~Simon, and K.-P. Liou, Partitioning sparse matrices with
643:   eigenvectors of graphs. \textit{SIAM J. Matrix Anal. Appl.} \textbf{11},
644:   430--452 (1990).
645: 
646: \bibitem{Scott00}
647: J.~Scott, \textit{Social Network Analysis: A Handbook}. Sage, London, 2nd
648:   edition (2000).
649: 
650: \bibitem{Newman04b}
651: M.~E.~J. Newman, Detecting community structure in networks. \textit{Eur. Phys.
652:   J. B} \textbf{38}, 321--330 (2004).
653: 
654: \bibitem{GN02}
655: M.~Girvan and M.~E.~J. Newman, Community structure in social and biological
656:   networks. \textit{Proc. Natl. Acad. Sci. USA} \textbf{99}, 7821--7826 (2002).
657: 
658: \bibitem{NG04}
659: M.~E.~J. Newman and M.~Girvan, Finding and evaluating community structure in
660:   networks. \textit{Phys. Rev. E} \textbf{69}, 026113 (2004).
661: 
662: \bibitem{GN04}
663: M.~T. Gastner and M.~E.~J. Newman, Diffusion-based method for producing density
664:   equalizing maps. \textit{Proc. Natl. Acad. Sci. USA} \textbf{101}, 7499--7504
665:   (2004).
666: 
667: \bibitem{Guimera03}
668: R.~Guimer\`a, L.~Danon, A.~D{\'\i}az-Guilera, F.~Giralt, and A.~Arenas,
669:   Self-similar community structure in organisations. \textit{Phys. Rev. E}
670:   \textbf{68}, 065103 (2003).
671: 
672: \bibitem{HHJ03}
673: P.~Holme, M.~Huss, and H.~Jeong, Subnetwork hierarchies of biochemical
674:   pathways. \textit{Bioinformatics} \textbf{19}, 532--538 (2003).
675: 
676: \bibitem{HH03}
677: P.~Holme and M.~Huss, Discovery and analysis of biochemical subnetwork
678:   hierarchies. In R.~Gauges, U.~Kummer, J.~Pahle, and U.~Rost (eds.),
679:   \textit{Proceedings of the 3rd Workshop on Computation of Biochemical
680:   Pathways and Genetic Networks}, pp. 3--9, Logos, Berlin (2003).
681: 
682: \bibitem{TWH03}
683: J.~R. Tyler, D.~M. Wilkinson, and B.~A. Huberman, Email as spectroscopy:
684:   Automated discovery of community structure within organizations. In
685:   M.~Huysman, E.~Wenger, and V.~Wulf (eds.), \textit{Proceedings of the First
686:   International Conference on Communities and Technologies}, Kluwer, Dordrecht
687:   (2003).
688: 
689: \bibitem{GD03}
690: P.~Gleiser and L.~Danon, Community structure in jazz. \textit{Advances in
691:   Complex Systems} \textbf{6}, 565--573 (2003).
692: 
693: \bibitem{BPDA04}
694: M.~Bogu{\~n}\'a, R.~Pastor-Satorras, A.~D{\'\i}az-Guilera, and A.~Arenas,
695:   Emergence of clustering, correlations, and communities in a social network
696:   model. Preprint cond-mat/0309263 (2003).
697: 
698: \bibitem{WH04b}
699: D.~M. Wilkinson and B.~A. Huberman, A method for finding communities of related
700:   genes. \textit{Proc. Natl. Acad. Sci. USA} \textbf{101}, 5241--5248 (2004).
701: 
702: \bibitem{Arenas04}
703: A.~Arenas, L.~Danon, A.~D{\'\i}az-Guilera, P.~M. Gleiser, and R.~Guimer\`a,
704:   Community analysis in social networks. \textit{Eur. Phys. J. B} \textbf{38},
705:   373--380 (2004).
706: 
707: \bibitem{Radicchi04}
708: F.~Radicchi, C.~Castellano, F.~Cecconi, V.~Loreto, and D.~Parisi, Defining and
709:   identifying communities in networks. \textit{Proc. Natl. Acad. Sci. USA}
710:   \textbf{101}, 2658--2663 (2004).
711: 
712: \bibitem{Newman04a}
713: M.~E.~J. Newman, Fast algorithm for detecting community structure in networks.
714:   \textit{Phys. Rev. E} \textbf{69}, 066133 (2004).
715: 
716: \bibitem{WH04a}
717: F.~Wu and B.~A. Huberman, Finding communities in linear time: A physics
718:   approach. \textit{Eur. Phys. J. B} \textbf{38}, 331--338 (2004).
719: 
720: \bibitem{Newman05a}
721: M.~E.~J. Newman, Analysis of weighted networks. Preprint cond-mat/0407503
722:   (2004).
723: 
724: \bibitem{CLRS01}
725: T.~H. Cormen, C.~E. Leiserson, R.~L. Rivest, and C.~Stein, \textit{Introduction
726:   to Algorithms}. MIT Press, Cambridge, MA, 2nd edition (2001).
727: 
728: \bibitem{WS98}
729: D.~J. Watts and S.~H. Strogatz, Collective dynamics of `small-world' networks.
730:   \textit{Nature} \textbf{393}, 440--442 (1998).
731: 
732: \bibitem{FLGC02}
733: G.~W. Flake, S.~R. Lawrence, C.~L. Giles, and F.~M. Coetzee, Self-organization
734:   and identification of {W}eb communities. \textit{IEEE Computer} \textbf{35},
735:   66--71 (2002).
736: 
737: \end{thebibliography}
738: 
739: \end{document}
740: