1: \documentclass[twocolumn,pre,superscriptaddress]{revtex4}
2:
3: % Required packages
4: \usepackage{dcolumn}
5: \usepackage{amsmath}
6:
7: % Optional extra packages
8: \usepackage{graphicx}
9:
10: % Hyphenation
11: \hyphenation{}
12:
13: \begin{document}
14:
15: % Macros
16: \renewcommand{\d}{d}
17: \newcommand{\Ord}{\mathrm{O}}
18: \newcommand{\eref}[1]{(\ref{#1})}
19: \newcommand{\etal}{{\it{}et~al.}}
20: \newcommand{\defn}{\textit}
21: \newcommand{\half}{\mbox{$\frac12$}}
22:
23: % Style parameters
24: \newlength{\figurewidth}
25: \setlength{\figurewidth}{0.95\columnwidth}
26: \setlength{\parskip}{0pt}
27: \setlength{\tabcolsep}{6pt}
28: \setlength{\arraycolsep}{2pt}
29:
30:
31: \title{Finding community structure in very large networks}
32: \author{Aaron Clauset}
33: \affiliation{Department of Computer Science, University of New Mexico,
34: Albuquerque, NM 87131}
35: \author{M. E. J. Newman}
36: \affiliation{Department of Physics and Center for the Study of Complex
37: Systems,\\
38: University of Michigan, Ann Arbor, MI 48109}
39: \author{Cristopher Moore}
40: \affiliation{Department of Computer Science, University of New Mexico,
41: Albuquerque, NM 87131}
42: \affiliation{Department of Physics and Astronomy, University of New Mexico,
43: Albuquerque, NM 87131}
44:
45: \begin{abstract}
46: The discovery and analysis of community structure in networks is a topic of
47: considerable recent interest within the physics community, but most methods
48: proposed so far are unsuitable for very large networks because of their
49: computational cost. Here we present a hierarchical agglomeration algorithm
50: for detecting community structure which is faster than many competing
51: algorithms: its running time on a network with $n$ vertices and $m$ edges
52: is $\Ord(m d \log n)$ where $d$ is the depth of the dendrogram describing
53: the community structure. Many real-world networks are sparse and
54: hierarchical, with $m \sim n$ and $d \sim \log n$, in which case our
55: algorithm runs in essentially linear time, $\Ord(n \log^2 n)$. As an
56: example of the application of this algorithm we use it to analyze a network
57: of items for sale on the web-site of a large online retailer, items in the
58: network being linked if they are frequently purchased by the same buyer.
59: The network has more than $400\,000$ vertices and 2 million edges. We show
60: that our algorithm can extract meaningful communities from this network,
61: revealing large-scale patterns present in the purchasing habits of
62: customers.
63: \end{abstract}
64: \maketitle
65:
66:
67: % ---------- Introduction and Background ----------
68: \section{Introduction}
69: Many systems of current interest to the scientific community can usefully
70: be represented as networks~\cite{Strogatz01,AB02,DM02,Newman03d}. Examples
71: include the Internet~\cite{FFF99} and the world-wide
72: web~\cite{AJB99,Kleinberg99b}, social networks~\cite{WF94}, citation
73: networks~\cite{Price65,Redner98}, food webs~\cite{DWM02a}, and biochemical
74: networks~\cite{Kauffman69,Ito01}. Each of these networks consists of a set
75: of nodes or \defn{vertices} representing, for instance, computers or
76: routers on the Internet or people in a social network, connected together
77: by links or \defn{edges}, representing data connections between computers,
78: friendships between people, and so forth.
79:
80: One network feature that has been emphasized in recent work is
81: \defn{community structure}, the gathering of vertices into groups such that
82: there is a higher density of edges within groups than between
83: them~\cite{note}. The problem of detecting such communities within
84: networks has been well studied. Early approaches such as the
85: Kernighan--Lin algorithm~\cite{KL70}, spectral
86: partitioning~\cite{Fiedler73,PSL90}, or hierarchical
87: clustering~\cite{Scott00} work well for specific types of problems
88: (particularly graph bisection or problems with well defined vertex
89: similarity measures), but perform poorly in more general
90: cases~\cite{Newman04b}.
91:
92: To combat this problem a number of new algorithms have been proposed in
93: recent years. Girvan and Newman~\cite{GN02,NG04} proposed a divisive
94: algorithm that uses edge betweenness as a metric to identify the boundaries
95: of communities. This algorithm has been applied successfully to a variety
96: of networks, including networks of email messages, human and animal social
97: networks, networks of collaborations between scientists and musicians,
98: metabolic networks and gene
99: networks~\cite{GN02,GN04,Guimera03,HHJ03,HH03,TWH03,GD03,BPDA04,WH04b,Arenas04}.
100: However, as noted in~\cite{NG04}, the algorithm makes heavy demands on
101: computational resources, running in $\Ord(m^2 n)$ time on an arbitrary
102: network with $m$ edges and $n$ vertices, or $\Ord(n^3)$ time on a sparse
103: graph (one in which $m\sim n$, which covers most real-world networks of
104: interest). This restricts the algorithm's use to networks of at most a few
105: thousand vertices with current hardware.
106:
107: More recently a number of faster algorithms have been
108: proposed~\cite{Radicchi04,Newman04a,WH04a}. In~\cite{Newman04a}, one of us
109: proposed an algorithm based on the greedy optimization of the quantity
110: known as \defn{modularity}~\cite{NG04}. This method appears to work well
111: both in contrived test cases and in real-world situations, and is
112: substantially faster than the algorithm of Girvan and Newman. A naive
113: implementation runs in time $\Ord((m+n)n)$, or $\Ord(n^{2})$ on a sparse
114: graph.
115:
116: Here we propose a new algorithm that performs the same greedy optimization
117: as the algorithm of~\cite{Newman04a} and therefore gives identical results
118: for the communities found. However, by exploiting some shortcuts in the
119: optimization problem and using more sophisticated data structures, it runs
120: far more quickly, in time $\Ord(m d \log n)$ where $d$ is the depth of the
121: ``dendrogram'' describing the network's community structure. Many
122: real-world networks are sparse, so that $m \sim n$; and moreover, for
123: networks that have a hierarchical structure with communities at many
124: scales, $d \sim \log n$. For such networks our algorithm has essentially
125: linear running time, $\Ord(n \log^2 n)$.
126:
127: This is not merely a technical advance but has substantial practical
128: implications, bringing within reach the analysis of extremely large
129: networks. Networks of ten million vertices or more should be possible in
130: reasonable run times. As an example, we give results from the application
131: of the algorithm to a recommender network of books from the online
132: bookseller Amazon.com, which has more than $400\,000$ vertices and two
133: million edges.
134:
135:
136: % ---------- Description of the Algorithm and its Complexity ----------
137: \section{The algorithm}
138: \defn{Modularity}~\cite{NG04} is a property of a network and a specific
139: proposed division of that network into communities. It measures when the
140: division is a good one, in the sense that there are many edges within
141: communities and only a few between them. Let $A_{vw}$ be an element of the
142: adjacency matrix of the network thus:
143: \begin{equation}
144: A_{vw} = \biggl\lbrace\begin{array}{ll}
145: 1 & \quad\mbox{if vertices $v$ and $w$ are connected,}\\
146: 0 & \quad\mbox{otherwise.}
147: \end{array}
148: \end{equation}
149: and suppose the vertices are divided into communities such that vertex~$v$
150: belongs to community~$c_v$. Then the fraction of edges that fall within
151: communities, i.e.,~that connect vertices that both lie in the same
152: community, is
153: \begin{equation}
154: {\sum_{vw} A_{vw} \delta(c_v,c_w)\over\sum_{vw} A_{vw}}
155: = {1\over2m} \sum_{vw} A_{vw} \delta(c_v,c_w),
156: \end{equation}
157: where the $\delta$-function $\delta(i,j)$ is 1 if $i=j$ and 0 otherwise,
158: and $m=\half\sum_{vw} A_{vw}$ is the number of edges in the graph. This
159: quantity will be large for good divisions of the network, in the sense of
160: having many within-community edges, but it is not, on its own, a good
161: measure of community structure since it takes its largest value of~1 in the
162: trivial case where all vertices belong to a single community. However, if
163: we subtract from it the expected value of the same quantity in the case of
164: a randomized network, we do get a useful measure.
165:
166: The \defn{degree}~$k_v$ of a vertex~$v$ is defined to be the number of
167: edges incident upon it:
168: \begin{equation}
169: k_v = \sum_w A_{vw}.
170: \end{equation}
171: The probability of an edge existing between vertices $v$ and $w$ if
172: connections are made at random but respecting vertex degrees is $k_v
173: k_w/2m$. We define the modularity~$Q$ to be
174: \begin{equation}
175: Q = {1\over2m} \sum_{vw} \biggl[ A_{vw} - {k_v k_w\over2m} \biggr]
176: \delta(c_v,c_w).
177: \label{modularity}
178: \end{equation}
179: If the fraction of within-community edges is no different from what we
180: would expect for the randomized network, then this quantity will be zero.
181: Nonzero values represent deviations from randomness, and in practice it is
182: found that a value above about 0.3 is a good indicator of significant
183: community structure in a network.
184:
185: If high values of the modularity correspond to good divisions of a network
186: into communities, then one should be able to find such good divisions by
187: searching through the possible candidates for ones with high modularity.
188: While finding the global maximum modularity over all possible divisions
189: seems hard in general, reasonably good solutions can be found with
190: approximate optimization techniques. The algorithm proposed
191: in~\cite{Newman04a} uses a greedy optimization in which, starting with each
192: vertex being the sole member of a community of one, we repeatedly join
193: together the two communities whose amalgamation produces the largest
194: increase in~$Q$. For a network of $n$ vertices, after $n-1$ such joins we
195: are left with a single community and the algorithm stops. The entire
196: process can be represented as a tree whose leaves are the vertices of the
197: original network and whose internal nodes correspond to the joins. This
198: \defn{dendrogram} represents a hierarchical decomposition of the network
199: into communities at all levels.
200:
201: The most straightforward implementation of this idea (and the only one
202: considered in~\cite{Newman04a}) involves storing the adjacency matrix of
203: the graph as an array of integers and repeatedly merging pairs of rows and
204: columns as the corresponding communities are merged. For the case of the
205: sparse graphs that are of primary interest in the field, however, this
206: approach wastes a good deal of time and memory space on the storage and
207: merging of matrix elements with value~0, which is the vast majority of the
208: adjacency matrix. The algorithm proposed in this paper achieves speed (and
209: memory efficiency) by eliminating these needless operations.
210:
211: To simplify the description of our algorithm let us define the following
212: two quantities:
213: \begin{equation}
214: e_{ij} = {1\over2m} \sum_{vw} A_{vw} \delta(c_v,i) \delta(c_w,j),
215: \end{equation}
216: which is the fraction of edges that join vertices in community~$i$ to
217: vertices in community~$j$, and
218: \begin{equation}
219: a_i = {1\over2m} \sum_v k_v\delta(c_v,i),
220: \end{equation}
221: which is the fraction of ends of edges that are attached to vertices in
222: community~$i$. Then, writing
223: $\delta(c_v,c_w)=\sum_i \delta(c_v,i) \delta(c_w,i)$, we have, from
224: Eq.~\eref{modularity}
225: \begin{eqnarray}
226: Q &=& {1\over2m} \sum_{vw} \biggl[ A_{vw} - {k_v k_w\over2m} \biggr]
227: \sum_i\delta(c_v,i)\delta(c_w,i)\nonumber\\
228: &=& \sum_i \biggl[ {1\over2m} \sum_{vw} A_{vw} \,\delta(c_v,i)\delta(c_w,i)
229: \nonumber\\
230: & & \qquad {} - {1\over2m}\sum_v k_v \,\delta(c_v,i)
231: {1\over2m}\sum_w k_w\delta(c_w,i) \biggr]\nonumber\\
232: &=& \sum_i (e_{ii} - a_i^2).
233: \end{eqnarray}
234: %When two communities $i$ and $j$ are joined, all edges that previously ran
235: %between the two now become within-community edges, and hence the modularity
236: %changes by an amount $\Delta Q_{ij} = 2(e_{ij} - a_i a_j)$. At each step,
237: %the algorithm joins the pair of communities $i,j$ with the largest~$\Delta
238: %Q_{ij}$.
239:
240: The operation of the algorithm involves finding the changes in $Q$ that
241: would result from the amalgamation of each pair of communities, choosing
242: the largest of them, and performing the corresponding amalgamation. One
243: way to envisage (and implement) this process is to think of network as a
244: multigraph, in which a whole community is represented by a vertex, bundles
245: of edges connect one vertex to another, and edges internal to communities
246: are represented by self-edges. The adjacency matrix of this multigraph has
247: elements$A'_{ij} = 2m e_{ij}$, and the joining of two communities $i$ and
248: $j$ corresponds to replacing the $i$th and $j$th rows and columns by their
249: sum. In the algorithm of~\cite{Newman04a} this operation is done
250: explicitly on the entire matrix, but if the adjacency matrix is sparse
251: (which we expect in the early stages of the process) the operation can be
252: carried out more efficiently using data structures for sparse matrices.
253: Unfortunately, calculating $\Delta Q_{ij}$ and finding the pair $i,j$ with
254: the largest $\Delta Q_{ij}$ then becomes time-consuming.
255:
256: In our new algorithm, rather than maintaining the adjacency matrix and
257: calculating~$\Delta Q_{ij}$, we instead maintain and update a matrix of
258: value of $\Delta Q_{ij}$. Since joining two communities with no edge
259: between them can never produce an increase in~$Q$, we need only store
260: $\Delta Q_{ij}$ for those pairs $i,j$ that are joined by one or more edges.
261: Since this matrix has the same support as the adjacency matrix, it will be
262: similarly sparse, so we can again represent it with efficient data
263: structures. In addition, we make use of an efficient data structure to
264: keep track of the largest $\Delta Q_{ij}$. These improvements result in a
265: considerable saving of both memory and time.
266:
267: In total, we maintain three data structures:
268: \begin{enumerate}
269: \item A sparse matrix containing $\Delta Q_{ij}$ for each pair $i,j$ of
270: communities with at least one edge between them. We store each row of the
271: matrix both as a balanced binary tree (so that elements can be found or
272: inserted in $O(\log n)$ time) and as a max-heap (so that the largest
273: element can be found in constant time).
274: \item A max-heap $H$ containing the largest element of each row of the
275: matrix $\Delta Q_{ij}$ along with the labels $i,j$ of the corresponding
276: pair of communities.
277: \item An ordinary vector array with elements~$a_i$.
278: \end{enumerate}
279:
280: As described above we start off with each vertex being the sole member of a
281: community of one, in which case $e_{ij}=1/2m$ if $i$ and $j$ are connected
282: and zero otherwise, and $a_i=k_i/2m$. Thus we initially set
283: \begin{equation}
284: \label{eq:qinit}
285: \Delta Q_{ij} = \biggl\lbrace\begin{array}{ll}
286: 1/2m - k_i k_j/(2m)^2 &
287: \quad\mbox{if $i,j$ are connected,}\\
288: 0 & \quad\mbox{otherwise,}
289: \end{array}
290: \end{equation}
291: and
292: \begin{equation}
293: \label{eq:ainit}
294: a_i = \frac{k_i}{2m}
295: \end{equation}
296: for each~$i$. (This assumes the graph is unweighted; weighted graphs are a
297: simple generalization~\cite{Newman05a}.)
298:
299: Our algorithm can now be defined as follows.
300: \begin{enumerate}
301: \item Calculate the initial values of $\Delta Q_{ij}$ and $a_i$ according
302: to~\eref{eq:qinit} and~\eref{eq:ainit}, and populate the max-heap with the
303: largest element of each row of the matrix $\Delta Q$.
304: \item Select the largest $\Delta Q_{ij}$ from $H$, join the corresponding
305: communities, update the matrix $\Delta Q$, the heap $H$ and $a_i$ (as
306: described below) and increment $Q$ by $\Delta Q_{ij}$.
307: \item Repeat step 2 until only one community remains.
308: \end{enumerate}
309:
310: Our data structures allow us to carry out the updates in step 2 quickly.
311: First, note that we need only adjust a few of the elements of $\Delta Q$.
312: If we join communities $i$ and~$j$, labeling the combined community~$j$,
313: say, we need only update the $j$th row and column, and remove the $i$th row
314: and column altogether. The update rules are as follows.
315: \begin{subequations}
316: If community $k$ is connected to both $i$ and $j$, then
317: \begin{equation}
318: \label{eq:both}
319: \Delta Q'_{jk} = \Delta Q_{ik} + \Delta Q_{jk}
320: \end{equation}
321: If $k$ is connected to $i$ but not to~$j$, then
322: \begin{equation}
323: \label{eq:justi}
324: \Delta Q'_{jk} = \Delta Q_{ik} - 2 a_j a_k
325: \end{equation}
326: If $k$ is connected to $j$ but not to $i$, then
327: \begin{equation}
328: \label{eq:justj}
329: \Delta Q'_{jk} = \Delta Q_{jk} - 2 a_i a_k.
330: \end{equation}
331: \end{subequations}
332: Note that these equations imply that $Q$ has a single peak over the course of the algorithm, since after the largest $\Delta Q$ becomes negative all the $\Delta Q$ can only decrease.
333:
334: To analyze how long the algorithm takes using our data structures, let us
335: denote the degrees of $i$ and $j$ in the reduced graph---i.e.,~the numbers
336: of neighboring communities---as $|i|$ and $|j|$ respectively. The first
337: operation in a step of the algorithm is to update the $j$th row. To
338: implement Eq.~\eref{eq:both}, we insert the elements of the $i$th row into
339: the $j$th row, summing them wherever an element exists in both columns.
340: Since we store the rows as balanced binary trees, each of these $|i|$
341: insertions takes $O(\log |j|) \le O(\log n)$ time. We then update the
342: other elements of the $j$th row, of which there are at most $|i|+|j|$,
343: according to Eqs.~\eref{eq:justi} and~\eref{eq:justj}. In the $k$th row,
344: we update a single element, taking $O(\log |k|) \le O(\log n)$ time, and
345: there are at most $|i|+|j|$ values of $k$ for which we have to do this.
346: All of this thus takes $O((|i|+|j|) \log n)$ time.
347:
348: We also have to update the max-heaps for each row and the overall max-heap
349: $H$. Reforming the max-heap corresponding to the $j$th row can be done in
350: $O(|j|)$ time~\cite{CLRS01}. Updating the max-heap for the $k$th row by
351: inserting, raising, or lowering $\Delta Q_{kj}$ takes $O(\log |k|) \le
352: O(\log n)$ time. Since we have changed the maximum element on at most
353: $|i|+|j|$ rows, we need to do at most $|i|+|j|$ updates of $H$, each of
354: which takes $O(\log n)$ time, for a total of $O((|i|+|j|) \log n)$.
355:
356: Finally, the update $a'_j = a_j + a_i$ (and $a_i = 0$) is trivial and can
357: be done in constant time.
358:
359: Since each join takes $O((|i|+|j|) \log n)$ time, the total running time is
360: at most $O(\log n)$ times the sum over all nodes of the dendrogram of the
361: degrees of the corresponding communities. Let us make the worst-case
362: assumption that the degree of a community is the sum of the degrees of all
363: the vertices in the original network comprising it. In that case, each
364: vertex of the original network contributes its degree to all of the
365: communities it is a part of, along the path in the dendrogram from it to
366: the root. If the dendrogram has depth~$d$, there are at most $d$ nodes in
367: this path, and since the total degree of all the vertices is~$2m$, we have
368: a running time of $O(m d \log n)$ as stated.
369:
370: We note that, if the dendrogram is unbalanced, some time savings can be
371: gained by inserting the sparser row into the less sparse one. In addition,
372: we have found that in practical situations it is usually unnecessary to
373: maintain the separate max-heaps for each row. These heaps are used to find
374: the largest element in a row quickly, but their maintenance takes a
375: moderate amount of effort and this effort is wasted if the largest element
376: in a row does not change when two rows are amalgamated, which turns out
377: often to be the case. Thus we find that the following simpler
378: implementation works quite well in realistic situations: if the largest
379: element of the $k$th row was $\Delta Q_{ki}$ or $\Delta Q_{kj}$ and is now
380: reduced by Eq.~\eref{eq:justi} or~\eref{eq:justj}, we simply scan the $k$th
381: row to find the new largest element. Although the worst-case running time
382: of this approach has an additional factor of~$n$, the average-case running
383: time is often better than that of the more sophisticated algorithm. It should be noted
384: that the hierarchies generated by these two versions of our algorithm will
385: differ slightly as a result of the differences in how ties are broken for the
386: maximum element in a row. However, we find that in practice these differences
387: do not cause significant deviations in the modularity, the community
388: size distribution, or the composition of the largest communities.
389:
390:
391:
392: % modularity as a function of time
393: \begin{figure}[t]
394: \begin{center}
395: \includegraphics[scale=0.45]{fc_amazon0308_Q.eps}
396: \end{center}
397: \caption{The modularity $Q$ over the course of the algorithm (the $x$ axis
398: shows the number of joins). Its maximum value is $Q=0.745$, where the
399: partition consists of $1684$ communities.}
400: \label{fig:Q}
401: \end{figure}
402: % ------------------------------
403:
404: \section{Amazon.com purchasing network}
405: The output of the algorithm described above is precisely the same as that
406: of the slower hierarchical algorithm of~\cite{Newman04a}. The much
407: improved speed of our algorithm however makes possible studies of very
408: large networks for which previous methods were too slow to produce useful
409: results. Here we give one example, the analysis of a co-purchasing or
410: ``recommender'' network from the online vendor Amazon.com. Amazon sells a
411: variety of products, particularly books and music, and as part of their web
412: sales operation they list for each item~A the ten other items most
413: frequently purchased by buyers of~A. This information can be represented
414: as a directed network in which vertices represent items and there is a edge
415: from item~A to another item~B if B was frequently purchased by buyers of~A.
416: In our study we have ignored the directed nature of the network (as is
417: common in community structure calculations), assuming any link between two
418: items, regardless of direction, to be an indication of their similarity.
419: The network we study consists of items listed on the Amazon web site in
420: August 2003. We concentrate on the largest component of the network, which
421: has $409\,687$ items and $2\,464\,630$ edges.
422:
423: \begin{figure}[t]
424: \begin{center}
425: \includegraphics[width=3in]{groups.eps}
426: \end{center}
427: \caption{A visualization of the community structure at maximum modularity.
428: Note that the some major communities have a large number of ``satellite''
429: communities connected only to them (top, lower left, lower right). Also,
430: some pairs of major communities have sets of smaller communities that act
431: as ``bridges'' between them (e.g., between the lower left and lower right,
432: near the center).}
433: \label{fig:visual}
434: \end{figure}
435:
436: The dendrogram for this calculation is of course too big to draw, but
437: Fig.~\ref{fig:Q} illustrates the modularity over the course of the
438: algorithm as vertices are joined into larger and larger groups. The
439: maximum value is $Q=0.745$, which is high as calculations of this type
440: go~\cite{NG04,Newman04a} and indicates strong community structure in the
441: network. The maximum occurs when there are $1684$ communities with a mean
442: size of $243$ items each. Fig.~\ref{fig:visual} gives a visualization of
443: the community structure, including the major communities, smaller
444: ``satellite'' communities connected to them, and ``bridge'' communities
445: that connect two major communities with each other.
446:
447: % descriptions of communities
448: \begin{table*}
449: \begin{center}
450: \begin{tabular}{cr|p{14.5cm}}
451: Rank & Size & Description \\
452: \hline
453: 1 & 114538 & General interest: politics; art/literature; general fiction;
454: human nature; technical books; how things, people, computers, societies
455: work, etc. \\
456:
457: 2 & 92276 & The arts: videos, books, DVDs about the creative and performing
458: arts \\
459:
460: 3 & 78661 & Hobbies and interests I: self-help; self-education; popular
461: science fiction, popular fantasy; leisure; etc. \\
462:
463: 4 & 54582 & Hobbies and interests II: adventure books; video games/comics;
464: some sports; some humor; some classic fiction; some western religious
465: material; etc. \\
466:
467: 5 & 9872 & classical music and related items \\
468:
469: 6 & 1904 & children's videos, movies, music and books \\
470:
471: 7 & 1493 & church/religious music; African-descent cultural books;
472: homoerotic imagery \\
473:
474: 8 & 1101 & pop horror; mystery/adventure fiction \\
475:
476: 9 & 1083 & jazz; orchestral music; easy listening \\
477:
478: 10 & 947 & engineering; practical fashion
479: \end{tabular}
480: \end{center}\caption{The $10$ largest communities in the
481: Amazon.com network, which account for $87\%$ of the vertices in the
482: network.}
483: \label{table:labels}
484: \end{table*}
485: % -----------------------------------
486:
487: Looking at the largest communities in the network, we find that they tend
488: to consist of items (books, music) in similar genres or on similar topics.
489: In Table~\ref{table:labels}, we give informal descriptions of the ten
490: largest communities, which account for about 87\% of the entire network.
491: The remainder is generally divided into small, densely connected
492: communities that represent highly specific co-purchasing habits, e.g.,~major
493: works of science fiction ($162$ items), music by John Cougar Mellencamp
494: ($17$ items), and books about (mostly female) spies in the American Civil
495: War ($13$ items). It is worth noting that because few real-world
496: networks have community metadata associated with them to which we
497: may compare the inferred communities, this type of manual check of
498: the veracity and coherence of the algorithm's output is often necessary.
499:
500: % distribution of community sizes
501: \begin{figure}[t]
502: \begin{center}
503: \includegraphics[scale=0.45]{fc_amazon0308_sizedistribution.eps}
504: \end{center}
505: \caption{Cumulative distribution of the sizes of communities when the
506: network is partitioned at the maximum modularity found by the algorithm.
507: The distribution appears to follow a power law form over two decades in the
508: central part of its range, although it deviates in the tail. As a guide to
509: the eye, the straight line has slope $-1$, which corresponds to an exponent
510: of $\alpha=2$ for the raw probability distribution.}
511: \label{fig:distribution}
512: \end{figure}
513: % -----------------------------------
514:
515: One interesting property recently noted in some
516: networks~\cite{Arenas04,Newman04a} is that when partitioned at the point of
517: maximum modularity, the distribution of community sizes~$s$ appears to have
518: a power-law form $P(s) \sim s^{-\alpha}$ for some constant~$\alpha$, at
519: least over some significant range. The Amazon co-purchasing network also
520: seems to exhibit this property, as we show in Fig.~\ref{fig:distribution},
521: with an exponent $\alpha\simeq2$. It is unclear why such a distribution
522: should arise, but we speculate that it could be a result either of the
523: sociology of the network (a power-law distribution in the number of people
524: interested in various topics) or of the dynamics of the community structure
525: algorithm. We propose this as a direction for further research.
526:
527:
528: \section{Conclusions}
529: We have described a new algorithm for inferring community structure from
530: network topology which works by greedily optimizing the modularity. Our
531: algorithm runs in time $\Ord(m d \log n)$ for a network with $n$ vertices
532: and $m$ edges where $d$ is the depth of the dendrogram. For networks that
533: are hierarchical, in the sense that there are communities at many scales
534: and the dendrogram is roughly balanced, we have $d \sim \log n$. If the
535: network is also sparse, $m \sim n$, then the running time is essentially
536: linear, $\Ord(n \log^2 n)$. This is considerably faster than most previous
537: general algorithms, and allows us to extend community structure analysis to
538: networks that had been considered too large to be tractable.
539: We have demonstrated our algorithm with an application to a large network
540: of co-purchasing data from the online retailer Amazon.com.
541: Our algorithm discovers clear communities within this network
542: that correspond to specific topics or genres of books or music, indicating
543: that the co-purchasing tendencies of Amazon customers are strongly
544: correlated with subject matter. Our algorithm should allow researchers to
545: analyze even larger networks with millions of vertices and tens of millions
546: of edges using current computing resources, and we look forward to seeing
547: such applications.
548:
549: %\bigskip
550:
551: \begin{acknowledgements}
552: The authors are grateful to Amazon.com and Eric Promislow for providing the
553: purchasing network data. This work was funded in part by the National
554: Science Foundation under grant PHY-0200909 (AC, CM) and by a grant from
555: the James S. McDonell Foundation (MEJN).
556: \end{acknowledgements}
557:
558:
559: %\bibliographystyle{numeric}
560: %\bibliography{journals,references}
561:
562: \begin{thebibliography}{10}
563: \expandafter\ifx\csname url\endcsname\relax
564: \def\url#1{\texttt{#1}}\fi
565: \expandafter\ifx\csname urlprefix\endcsname\relax\def\urlprefix{URL }\fi
566:
567: \bibitem{Strogatz01}
568: S.~H. Strogatz, Exploring complex networks. \textit{Nature} \textbf{410},
569: 268--276 (2001).
570:
571: \bibitem{AB02}
572: R.~Albert and A.-L. Barab\'asi, Statistical mechanics of complex networks.
573: \textit{Rev. Mod. Phys.} \textbf{74}, 47--97 (2002).
574:
575: \bibitem{DM02}
576: S.~N. Dorogovtsev and J.~F.~F. Mendes, Evolution of networks. \textit{Advances
577: in Physics} \textbf{51}, 1079--1187 (2002).
578:
579: \bibitem{Newman03d}
580: M.~E.~J. Newman, The structure and function of complex networks. \textit{SIAM
581: Review} \textbf{45}, 167--256 (2003).
582:
583: \bibitem{FFF99}
584: M.~Faloutsos, P.~Faloutsos, and C.~Faloutsos, On power-law relationships of the
585: internet topology. \textit{Computer Communications Review} \textbf{29},
586: 251--262 (1999).
587:
588: \bibitem{AJB99}
589: R.~Albert, H.~Jeong, and A.-L. Barab\'asi, Diameter of the world-wide web.
590: \textit{Nature} \textbf{401}, 130--131 (1999).
591:
592: \bibitem{Kleinberg99b}
593: J.~M. Kleinberg, S.~R. Kumar, P.~Raghavan, S.~Rajagopalan, and A.~Tomkins, The
594: {W}eb as a graph: Measurements, models and methods. In \textit{Proceedings of
595: the International Conference on Combinatorics and Computing}, number 1627 in
596: Lecture Notes in Computer Science, pp. 1--18, Springer, Berlin (1999).
597:
598: \bibitem{WF94}
599: S.~Wasserman and K.~Faust, \textit{Social Network Analysis}. Cambridge
600: University Press, Cambridge (1994).
601:
602: \bibitem{Price65}
603: D.~J.~{\relax de S}. Price, Networks of scientific papers. \textit{Science}
604: \textbf{149}, 510--515 (1965).
605:
606: \bibitem{Redner98}
607: S.~Redner, How popular is your paper? {A}n empirical study of the citation
608: distribution. \textit{Eur. Phys. J. B} \textbf{4}, 131--134 (1998).
609:
610: \bibitem{DWM02a}
611: J.~A. Dunne, R.~J. Williams, and N.~D. Martinez, Food-web structure and network
612: theory: The role of connectance and size. \textit{Proc. Natl. Acad. Sci. USA}
613: \textbf{99}, 12917--12922 (2002).
614:
615: \bibitem{Kauffman69}
616: S.~A. Kauffman, Metabolic stability and epigenesis in randomly connected nets.
617: \textit{J. Theor. Bio.} \textbf{22}, 437--467 (1969).
618:
619: \bibitem{Ito01}
620: T.~Ito, T.~Chiba, R.~Ozawa, M.~Yoshida, M.~Hattori, and Y.~Sakaki, A
621: comprehensive two-hybrid analysis to explore the yeast protein interactome.
622: \textit{Proc. Natl. Acad. Sci. USA} \textbf{98}, 4569--4574 (2001).
623:
624: \bibitem{note}
625: Community structure is sometimes referred to as ``clustering'' in sociology
626: or computer science, but this term is commonly used to mean something else
627: in the physics literature~\cite{WS98}, so to prevent confusion we avoid it
628: here. We note also that the problem of finding communities in a network is
629: somewhat ill-posed, since we haven't defined precisely what a community is.
630: A number of definitions have been proposed~\cite{WF94,FLGC02,Radicchi04},
631: but none is standard.
632:
633: \bibitem{KL70}
634: B.~W. Kernighan and S.~Lin, An efficient heuristic procedure for partitioning
635: graphs. \textit{Bell System Technical Journal} \textbf{49}, 291--307 (1970).
636:
637: \bibitem{Fiedler73}
638: M.~Fiedler, Algebraic connectivity of graphs. \textit{Czech. Math. J.}
639: \textbf{23}, 298--305 (1973).
640:
641: \bibitem{PSL90}
642: A.~Pothen, H.~Simon, and K.-P. Liou, Partitioning sparse matrices with
643: eigenvectors of graphs. \textit{SIAM J. Matrix Anal. Appl.} \textbf{11},
644: 430--452 (1990).
645:
646: \bibitem{Scott00}
647: J.~Scott, \textit{Social Network Analysis: A Handbook}. Sage, London, 2nd
648: edition (2000).
649:
650: \bibitem{Newman04b}
651: M.~E.~J. Newman, Detecting community structure in networks. \textit{Eur. Phys.
652: J. B} \textbf{38}, 321--330 (2004).
653:
654: \bibitem{GN02}
655: M.~Girvan and M.~E.~J. Newman, Community structure in social and biological
656: networks. \textit{Proc. Natl. Acad. Sci. USA} \textbf{99}, 7821--7826 (2002).
657:
658: \bibitem{NG04}
659: M.~E.~J. Newman and M.~Girvan, Finding and evaluating community structure in
660: networks. \textit{Phys. Rev. E} \textbf{69}, 026113 (2004).
661:
662: \bibitem{GN04}
663: M.~T. Gastner and M.~E.~J. Newman, Diffusion-based method for producing density
664: equalizing maps. \textit{Proc. Natl. Acad. Sci. USA} \textbf{101}, 7499--7504
665: (2004).
666:
667: \bibitem{Guimera03}
668: R.~Guimer\`a, L.~Danon, A.~D{\'\i}az-Guilera, F.~Giralt, and A.~Arenas,
669: Self-similar community structure in organisations. \textit{Phys. Rev. E}
670: \textbf{68}, 065103 (2003).
671:
672: \bibitem{HHJ03}
673: P.~Holme, M.~Huss, and H.~Jeong, Subnetwork hierarchies of biochemical
674: pathways. \textit{Bioinformatics} \textbf{19}, 532--538 (2003).
675:
676: \bibitem{HH03}
677: P.~Holme and M.~Huss, Discovery and analysis of biochemical subnetwork
678: hierarchies. In R.~Gauges, U.~Kummer, J.~Pahle, and U.~Rost (eds.),
679: \textit{Proceedings of the 3rd Workshop on Computation of Biochemical
680: Pathways and Genetic Networks}, pp. 3--9, Logos, Berlin (2003).
681:
682: \bibitem{TWH03}
683: J.~R. Tyler, D.~M. Wilkinson, and B.~A. Huberman, Email as spectroscopy:
684: Automated discovery of community structure within organizations. In
685: M.~Huysman, E.~Wenger, and V.~Wulf (eds.), \textit{Proceedings of the First
686: International Conference on Communities and Technologies}, Kluwer, Dordrecht
687: (2003).
688:
689: \bibitem{GD03}
690: P.~Gleiser and L.~Danon, Community structure in jazz. \textit{Advances in
691: Complex Systems} \textbf{6}, 565--573 (2003).
692:
693: \bibitem{BPDA04}
694: M.~Bogu{\~n}\'a, R.~Pastor-Satorras, A.~D{\'\i}az-Guilera, and A.~Arenas,
695: Emergence of clustering, correlations, and communities in a social network
696: model. Preprint cond-mat/0309263 (2003).
697:
698: \bibitem{WH04b}
699: D.~M. Wilkinson and B.~A. Huberman, A method for finding communities of related
700: genes. \textit{Proc. Natl. Acad. Sci. USA} \textbf{101}, 5241--5248 (2004).
701:
702: \bibitem{Arenas04}
703: A.~Arenas, L.~Danon, A.~D{\'\i}az-Guilera, P.~M. Gleiser, and R.~Guimer\`a,
704: Community analysis in social networks. \textit{Eur. Phys. J. B} \textbf{38},
705: 373--380 (2004).
706:
707: \bibitem{Radicchi04}
708: F.~Radicchi, C.~Castellano, F.~Cecconi, V.~Loreto, and D.~Parisi, Defining and
709: identifying communities in networks. \textit{Proc. Natl. Acad. Sci. USA}
710: \textbf{101}, 2658--2663 (2004).
711:
712: \bibitem{Newman04a}
713: M.~E.~J. Newman, Fast algorithm for detecting community structure in networks.
714: \textit{Phys. Rev. E} \textbf{69}, 066133 (2004).
715:
716: \bibitem{WH04a}
717: F.~Wu and B.~A. Huberman, Finding communities in linear time: A physics
718: approach. \textit{Eur. Phys. J. B} \textbf{38}, 331--338 (2004).
719:
720: \bibitem{Newman05a}
721: M.~E.~J. Newman, Analysis of weighted networks. Preprint cond-mat/0407503
722: (2004).
723:
724: \bibitem{CLRS01}
725: T.~H. Cormen, C.~E. Leiserson, R.~L. Rivest, and C.~Stein, \textit{Introduction
726: to Algorithms}. MIT Press, Cambridge, MA, 2nd edition (2001).
727:
728: \bibitem{WS98}
729: D.~J. Watts and S.~H. Strogatz, Collective dynamics of `small-world' networks.
730: \textit{Nature} \textbf{393}, 440--442 (1998).
731:
732: \bibitem{FLGC02}
733: G.~W. Flake, S.~R. Lawrence, C.~L. Giles, and F.~M. Coetzee, Self-organization
734: and identification of {W}eb communities. \textit{IEEE Computer} \textbf{35},
735: 66--71 (2002).
736:
737: \end{thebibliography}
738:
739: \end{document}
740: