0408:cond-mat0408187/pre.tex

1: \documentclass[twocolumn,pre,superscriptaddress]{revtex4}

2:

3: % Required packages

4: \usepackage{dcolumn}

5: \usepackage{amsmath}

6:

7: % Optional extra packages

8: \usepackage{graphicx}

9:

10: % Hyphenation

11: \hyphenation{}

12:

13: \begin{document}

14:

15: % Macros

16: \renewcommand{\d}{d}

17: \newcommand{\Ord}{\mathrm{O}}

18: \newcommand{\eref}[1]{(\ref{#1})}

19: \newcommand{\etal}{{\it{}et~al.}}

20: \newcommand{\defn}{\textit}

21: \newcommand{\half}{\mbox{$\frac12$}}

22:

23: % Style parameters

24: \newlength{\figurewidth}

25: \setlength{\figurewidth}{0.95\columnwidth}

26: \setlength{\parskip}{0pt}

27: \setlength{\tabcolsep}{6pt}

28: \setlength{\arraycolsep}{2pt}

29:

30:

31: \title{Finding community structure in very large networks}

32: \author{Aaron Clauset}

33: \affiliation{Department of Computer Science, University of New Mexico,

34: Albuquerque, NM 87131}

35: \author{M. E. J. Newman}

36: \affiliation{Department of Physics and Center for the Study of Complex

37: Systems,\\

38: University of Michigan, Ann Arbor, MI 48109}

39: \author{Cristopher Moore}

40: \affiliation{Department of Computer Science, University of New Mexico,

41: Albuquerque, NM 87131}

42: \affiliation{Department of Physics and Astronomy, University of New Mexico,

43: Albuquerque, NM 87131}

44:

45: \begin{abstract}

46: The discovery and analysis of community structure in networks is a topic of

47: considerable recent interest within the physics community, but most methods

48: proposed so far are unsuitable for very large networks because of their

49: computational cost.  Here we present a hierarchical agglomeration algorithm

50: for detecting community structure which is faster than many competing

51: algorithms: its running time on a network with $n$ vertices and $m$ edges

52: is $\Ord(m d \log n)$ where $d$ is the depth of the dendrogram describing

53: the community structure.  Many real-world networks are sparse and

54: hierarchical, with $m \sim n$ and $d \sim \log n$, in which case our

55: algorithm runs in essentially linear time, $\Ord(n \log^2 n)$.  As an

56: example of the application of this algorithm we use it to analyze a network

57: of items for sale on the web-site of a large online retailer, items in the

58: network being linked if they are frequently purchased by the same buyer.

59: The network has more than $400\,000$ vertices and 2 million edges.  We show

60: that our algorithm can extract meaningful communities from this network,

61: revealing large-scale patterns present in the purchasing habits of

62: customers.

63: \end{abstract}

64: \maketitle

65:

66:

67: % ---------- Introduction and Background ----------

68: \section{Introduction}

69: Many systems of current interest to the scientific community can usefully

70: be represented as networks~\cite{Strogatz01,AB02,DM02,Newman03d}.  Examples

71: include the Internet~\cite{FFF99} and the world-wide

72: web~\cite{AJB99,Kleinberg99b}, social networks~\cite{WF94}, citation

73: networks~\cite{Price65,Redner98}, food webs~\cite{DWM02a}, and biochemical

74: networks~\cite{Kauffman69,Ito01}.  Each of these networks consists of a set

75: of nodes or \defn{vertices} representing, for instance, computers or

76: routers on the Internet or people in a social network, connected together

77: by links or \defn{edges}, representing data connections between computers,

78: friendships between people, and so forth.

79:

80: One network feature that has been emphasized in recent work is

81: \defn{community structure}, the gathering of vertices into groups such that

82: there is a higher density of edges within groups than between

83: them~\cite{note}.  The problem of detecting such communities within

84: networks has been well studied.  Early approaches such as the

85: Kernighan--Lin algorithm~\cite{KL70}, spectral

86: partitioning~\cite{Fiedler73,PSL90}, or hierarchical

87: clustering~\cite{Scott00} work well for specific types of problems

88: (particularly graph bisection or problems with well defined vertex

89: similarity measures), but perform poorly in more general

90: cases~\cite{Newman04b}.

91:

92: To combat this problem a number of new algorithms have been proposed in

93: recent years.  Girvan and Newman~\cite{GN02,NG04} proposed a divisive

94: algorithm that uses edge betweenness as a metric to identify the boundaries

95: of communities.  This algorithm has been applied successfully to a variety

96: of networks, including networks of email messages, human and animal social

97: networks, networks of collaborations between scientists and musicians,

98: metabolic networks and gene

99: networks~\cite{GN02,GN04,Guimera03,HHJ03,HH03,TWH03,GD03,BPDA04,WH04b,Arenas04}.

100: However, as noted in~\cite{NG04}, the algorithm makes heavy demands on

101: computational resources, running in $\Ord(m^2 n)$ time on an arbitrary

102: network with $m$ edges and $n$ vertices, or $\Ord(n^3)$ time on a sparse

103: graph (one in which $m\sim n$, which covers most real-world networks of

104: interest).  This restricts the algorithm's use to networks of at most a few

105: thousand vertices with current hardware.

106:

107: More recently a number of faster algorithms have been

108: proposed~\cite{Radicchi04,Newman04a,WH04a}.  In~\cite{Newman04a}, one of us

109: proposed an algorithm based on the greedy optimization of the quantity

110: known as \defn{modularity}~\cite{NG04}.  This method appears to work well

111: both in contrived test cases and in real-world situations, and is

112: substantially faster than the algorithm of Girvan and Newman.  A naive

113: implementation runs in time $\Ord((m+n)n)$, or $\Ord(n^{2})$ on a sparse

114: graph.

115:

116: Here we propose a new algorithm that performs the same greedy optimization

117: as the algorithm of~\cite{Newman04a} and therefore gives identical results

118: for the communities found.  However, by exploiting some shortcuts in the

119: optimization problem and using more sophisticated data structures, it runs

120: far more quickly, in time $\Ord(m d \log n)$ where $d$ is the depth of the

121: ``dendrogram'' describing the network's community structure.  Many

122: real-world networks are sparse, so that $m \sim n$; and moreover, for

123: networks that have a hierarchical structure with communities at many

124: scales, $d \sim \log n$.  For such networks our algorithm has essentially

125: linear running time, $\Ord(n \log^2 n)$.

126:

127: This is not merely a technical advance but has substantial practical

128: implications, bringing within reach the analysis of extremely large

129: networks.  Networks of ten million vertices or more should be possible in

130: reasonable run times.  As an example, we give results from the application

131: of the algorithm to a recommender network of books from the online

132: bookseller Amazon.com, which has more than $400\,000$ vertices and two

133: million edges.

134:

135:

136: % ---------- Description of the Algorithm and its Complexity ----------

137: \section{The algorithm}

138: \defn{Modularity}~\cite{NG04} is a property of a network and a specific

139: proposed division of that network into communities.  It measures when the

140: division is a good one, in the sense that there are many edges within

141: communities and only a few between them.  Let $A_{vw}$ be an element of the

142: adjacency matrix of the network thus:

143: \begin{equation}

144: A_{vw} = \biggl\lbrace\begin{array}{ll}

145:            1 & \quad\mbox{if vertices $v$ and $w$ are connected,}\\

146:            0 & \quad\mbox{otherwise.}

147:          \end{array}

148: \end{equation}

149: and suppose the vertices are divided into communities such that vertex~$v$

150: belongs to community~$c_v$.  Then the fraction of edges that fall within

151: communities, i.e.,~that connect vertices that both lie in the same

152: community, is

153: \begin{equation}

154: {\sum_{vw} A_{vw} \delta(c_v,c_w)\over\sum_{vw} A_{vw}}

155:   = {1\over2m} \sum_{vw} A_{vw} \delta(c_v,c_w),

156: \end{equation}

157: where the $\delta$-function $\delta(i,j)$ is 1 if $i=j$ and 0 otherwise,

158: and $m=\half\sum_{vw} A_{vw}$ is the number of edges in the graph.  This

159: quantity will be large for good divisions of the network, in the sense of

160: having many within-community edges, but it is not, on its own, a good

161: measure of community structure since it takes its largest value of~1 in the

162: trivial case where all vertices belong to a single community.  However, if

163: we subtract from it the expected value of the same quantity in the case of

164: a randomized network, we do get a useful measure.

165:

166: The \defn{degree}~$k_v$ of a vertex~$v$ is defined to be the number of

167: edges incident upon it:

168: \begin{equation}

169: k_v = \sum_w A_{vw}.

170: \end{equation}

171: The probability of an edge existing between vertices $v$ and $w$ if

172: connections are made at random but respecting vertex degrees is $k_v

173: k_w/2m$.  We define the modularity~$Q$ to be

174: \begin{equation}

175: Q = {1\over2m} \sum_{vw} \biggl[ A_{vw} - {k_v k_w\over2m} \biggr]

176:     \delta(c_v,c_w).

177: \label{modularity}

178: \end{equation}

179: If the fraction of within-community edges is no different from what we

180: would expect for the randomized network, then this quantity will be zero.

181: Nonzero values represent deviations from randomness, and in practice it is

182: found that a value above about 0.3 is a good indicator of significant

183: community structure in a network.

184:

185: If high values of the modularity correspond to good divisions of a network

186: into communities, then one should be able to find such good divisions by

187: searching through the possible candidates for ones with high modularity.

188: While finding the global maximum modularity over all possible divisions

189: seems hard in general, reasonably good solutions can be found with

190: approximate optimization techniques.  The algorithm proposed

191: in~\cite{Newman04a} uses a greedy optimization in which, starting with each

192: vertex being the sole member of a community of one, we repeatedly join

193: together the two communities whose amalgamation produces the largest

194: increase in~$Q$.  For a network of $n$ vertices, after $n-1$ such joins we

195: are left with a single community and the algorithm stops.  The entire

196: process can be represented as a tree whose leaves are the vertices of the

197: original network and whose internal nodes correspond to the joins.  This

198: \defn{dendrogram} represents a hierarchical decomposition of the network

199: into communities at all levels.

200:

201: The most straightforward implementation of this idea (and the only one

202: considered in~\cite{Newman04a}) involves storing the adjacency matrix of

203: the graph as an array of integers and repeatedly merging pairs of rows and

204: columns as the corresponding communities are merged.  For the case of the

205: sparse graphs that are of primary interest in the field, however, this

206: approach wastes a good deal of time and memory space on the storage and

207: merging of matrix elements with value~0, which is the vast majority of the

208: adjacency matrix.  The algorithm proposed in this paper achieves speed (and

209: memory efficiency) by eliminating these needless operations.

210:

211: To simplify the description of our algorithm let us define the following

212: two quantities:

213: \begin{equation}

214: e_{ij} = {1\over2m} \sum_{vw} A_{vw} \delta(c_v,i) \delta(c_w,j),

215: \end{equation}

216: which is the fraction of edges that join vertices in community~$i$ to

217: vertices in community~$j$, and

218: \begin{equation}

219: a_i = {1\over2m} \sum_v k_v\delta(c_v,i),

220: \end{equation}

221: which is the fraction of ends of edges that are attached to vertices in

222: community~$i$.  Then, writing

223: $\delta(c_v,c_w)=\sum_i \delta(c_v,i) \delta(c_w,i)$, we have, from

224: Eq.~\eref{modularity}

225: \begin{eqnarray}

226: Q &=& {1\over2m} \sum_{vw} \biggl[ A_{vw} - {k_v k_w\over2m} \biggr]

227:       \sum_i\delta(c_v,i)\delta(c_w,i)\nonumber\\

228:   &=& \sum_i \biggl[ {1\over2m} \sum_{vw} A_{vw} \,\delta(c_v,i)\delta(c_w,i)

229:         \nonumber\\

230:   & & \qquad {} - {1\over2m}\sum_v k_v \,\delta(c_v,i)

231:            {1\over2m}\sum_w k_w\delta(c_w,i) \biggr]\nonumber\\

232:   &=& \sum_i (e_{ii} - a_i^2).

233: \end{eqnarray}

234: %When two communities $i$ and $j$ are joined, all edges that previously ran

235: %between the two now become within-community edges, and hence the modularity

236: %changes by an amount $\Delta Q_{ij} = 2(e_{ij} - a_i a_j)$.  At each step,

237: %the algorithm joins the pair of communities $i,j$ with the largest~$\Delta

238: %Q_{ij}$.

239:

240: The operation of the algorithm involves finding the changes in $Q$ that

241: would result from the amalgamation of each pair of communities, choosing

242: the largest of them, and performing the corresponding amalgamation.  One

243: way to envisage (and implement) this process is to think of network as a

244: multigraph, in which a whole community is represented by a vertex, bundles

245: of edges connect one vertex to another, and edges internal to communities

246: are represented by self-edges.  The adjacency matrix of this multigraph has

247: elements$A'_{ij} = 2m e_{ij}$, and the joining of two communities $i$ and

248: $j$ corresponds to replacing the $i$th and $j$th rows and columns by their

249: sum.  In the algorithm of~\cite{Newman04a} this operation is done

250: explicitly on the entire matrix, but if the adjacency matrix is sparse

251: (which we expect in the early stages of the process) the operation can be

252: carried out more efficiently using data structures for sparse matrices.

253: Unfortunately, calculating $\Delta Q_{ij}$ and finding the pair $i,j$ with

254: the largest $\Delta Q_{ij}$ then becomes time-consuming.

255:

256: In our new algorithm, rather than maintaining the adjacency matrix and

257: calculating~$\Delta Q_{ij}$, we instead maintain and update a matrix of

258: value of $\Delta Q_{ij}$.  Since joining two communities with no edge

259: between them can never produce an increase in~$Q$, we need only store

260: $\Delta Q_{ij}$ for those pairs $i,j$ that are joined by one or more edges.

261: Since this matrix has the same support as the adjacency matrix, it will be

262: similarly sparse, so we can again represent it with efficient data

263: structures.  In addition, we make use of an efficient data structure to

264: keep track of the largest $\Delta Q_{ij}$.  These improvements result in a

265: considerable saving of both memory and time.

266:

267: In total, we maintain three data structures:

268: \begin{enumerate}

269: \item A sparse matrix containing $\Delta Q_{ij}$ for each pair $i,j$ of

270: communities with at least one edge between them.  We store each row of the

271: matrix both as a balanced binary tree (so that elements can be found or

272: inserted in $O(\log n)$ time) and as a max-heap (so that the largest

273: element can be found in constant time).

274: \item A max-heap $H$ containing the largest element of each row of the

275: matrix $\Delta Q_{ij}$ along with the labels $i,j$ of the corresponding

276: pair of communities.

277: \item An ordinary vector array with elements~$a_i$.

278: \end{enumerate}

279:

280: As described above we start off with each vertex being the sole member of a

281: community of one, in which case $e_{ij}=1/2m$ if $i$ and $j$ are connected

282: and zero otherwise, and $a_i=k_i/2m$.  Thus we initially set

283: \begin{equation}

284: \label{eq:qinit}

285: \Delta Q_{ij} = \biggl\lbrace\begin{array}{ll}

286:                   1/2m - k_i k_j/(2m)^2 &

287:                     \quad\mbox{if $i,j$ are connected,}\\

288:                   0 & \quad\mbox{otherwise,}

289:                 \end{array}

290: \end{equation}

291: and

292: \begin{equation}

293: \label{eq:ainit}

294: a_i = \frac{k_i}{2m}

295: \end{equation}

296: for each~$i$.  (This assumes the graph is unweighted; weighted graphs are a

297: simple generalization~\cite{Newman05a}.)

298:

299: Our algorithm can now be defined as follows.

300: \begin{enumerate}

301: \item Calculate the initial values of $\Delta Q_{ij}$ and $a_i$ according

302: to~\eref{eq:qinit} and~\eref{eq:ainit}, and populate the max-heap with the

303: largest element of each row of the matrix $\Delta Q$.

304: \item Select the largest $\Delta Q_{ij}$ from $H$, join the corresponding

305: communities, update the matrix $\Delta Q$, the heap $H$ and $a_i$ (as

306: described below) and increment $Q$ by $\Delta Q_{ij}$.

307: \item Repeat step 2 until only one community remains.

308: \end{enumerate}

309:

310: Our data structures allow us to carry out the updates in step 2 quickly.

311: First, note that we need only adjust a few of the elements of $\Delta Q$.

312: If we join communities $i$ and~$j$, labeling the combined community~$j$,

313: say, we need only update the $j$th row and column, and remove the $i$th row

314: and column altogether.  The update rules are as follows.

315: \begin{subequations}

316: If community $k$ is connected to both $i$ and $j$, then

317: \begin{equation}

318: \label{eq:both}

319: \Delta Q'_{jk} = \Delta Q_{ik} + \Delta Q_{jk}

320: \end{equation}

321: If $k$ is connected to $i$ but not to~$j$, then

322: \begin{equation}

323: \label{eq:justi}

324: \Delta Q'_{jk} = \Delta Q_{ik} - 2 a_j a_k

325: \end{equation}

326: If $k$ is connected to $j$ but not to $i$, then

327: \begin{equation}

328: \label{eq:justj}

329: \Delta Q'_{jk} = \Delta Q_{jk} - 2 a_i a_k.

330: \end{equation}

331: \end{subequations}

332: Note that these equations imply that $Q$ has a single peak over the course of the algorithm, since after the largest $\Delta Q$ becomes negative all the $\Delta Q$ can only decrease.

333:

334: To analyze how long the algorithm takes using our data structures, let us

335: denote the degrees of $i$ and $j$ in the reduced graph---i.e.,~the numbers

336: of neighboring communities---as $|i|$ and $|j|$ respectively.  The first

337: operation in a step of the algorithm is to update the $j$th row.  To

338: implement Eq.~\eref{eq:both}, we insert the elements of the $i$th row into

339: the $j$th row, summing them wherever an element exists in both columns.

340: Since we store the rows as balanced binary trees, each of these $|i|$

341: insertions takes $O(\log |j|) \le O(\log n)$ time.  We then update the

342: other elements of the $j$th row, of which there are at most $|i|+|j|$,

343: according to Eqs.~\eref{eq:justi} and~\eref{eq:justj}.  In the $k$th row,

344: we update a single element, taking $O(\log |k|) \le O(\log n)$ time, and

345: there are at most $|i|+|j|$ values of $k$ for which we have to do this.

346: All of this thus takes $O((|i|+|j|) \log n)$ time.

347:

348: We also have to update the max-heaps for each row and the overall max-heap

349: $H$.  Reforming the max-heap corresponding to the $j$th row can be done in

350: $O(|j|)$ time~\cite{CLRS01}.  Updating the max-heap for the $k$th row by

351: inserting, raising, or lowering $\Delta Q_{kj}$ takes $O(\log |k|) \le

352: O(\log n)$ time.  Since we have changed the maximum element on at most

353: $|i|+|j|$ rows, we need to do at most $|i|+|j|$ updates of $H$, each of

354: which takes $O(\log n)$ time, for a total of $O((|i|+|j|) \log n)$.

355:

356: Finally, the update $a'_j = a_j + a_i$ (and $a_i = 0$) is trivial and can

357: be done in constant time.

358:

359: Since each join takes $O((|i|+|j|) \log n)$ time, the total running time is

360: at most $O(\log n)$ times the sum over all nodes of the dendrogram of the

361: degrees of the corresponding communities.  Let us make the worst-case

362: assumption that the degree of a community is the sum of the degrees of all

363: the vertices in the original network comprising it.  In that case, each

364: vertex of the original network contributes its degree to all of the

365: communities it is a part of, along the path in the dendrogram from it to

366: the root.  If the dendrogram has depth~$d$, there are at most $d$ nodes in

367: this path, and since the total degree of all the vertices is~$2m$, we have

368: a running time of $O(m d \log n)$ as stated.

369:

370: We note that, if the dendrogram is unbalanced, some time savings can be

371: gained by inserting the sparser row into the less sparse one.  In addition,

372: we have found that in practical situations it is usually unnecessary to

373: maintain the separate max-heaps for each row.  These heaps are used to find

374: the largest element in a row quickly, but their maintenance takes a

375: moderate amount of effort and this effort is wasted if the largest element

376: in a row does not change when two rows are amalgamated, which turns out

377: often to be the case.  Thus we find that the following simpler

378: implementation works quite well in realistic situations: if the largest

379: element of the $k$th row was $\Delta Q_{ki}$ or $\Delta Q_{kj}$ and is now

380: reduced by Eq.~\eref{eq:justi} or~\eref{eq:justj}, we simply scan the $k$th

381: row to find the new largest element. Although the worst-case running time

382: of this approach has an additional factor of~$n$, the average-case running

383: time is often better than that of the more sophisticated algorithm. It should be noted

384: that the hierarchies generated by these two versions of our algorithm will

385: differ slightly as a result of the differences in how ties are broken for the

386: maximum element in a row. However, we find that in practice these differences

387: do not cause significant deviations in the modularity, the community

388: size distribution, or the composition of the largest communities.

389:

390:

391:

392: % modularity as a function of time

393: \begin{figure}[t]

394: \begin{center}

395: \includegraphics[scale=0.45]{fc_amazon0308_Q.eps}

396: \end{center}

397: \caption{The modularity $Q$ over the course of the algorithm (the $x$ axis

398: shows the number of joins). Its maximum value is $Q=0.745$, where the

399: partition consists of $1684$ communities.}

400: \label{fig:Q}

401: \end{figure}

402: % ------------------------------

403:

404: \section{Amazon.com purchasing network}

405: The output of the algorithm described above is precisely the same as that

406: of the slower hierarchical algorithm of~\cite{Newman04a}.  The much

407: improved speed of our algorithm however makes possible studies of very

408: large networks for which previous methods were too slow to produce useful

409: results.  Here we give one example, the analysis of a co-purchasing or

410: ``recommender'' network from the online vendor Amazon.com.  Amazon sells a

411: variety of products, particularly books and music, and as part of their web

412: sales operation they list for each item~A the ten other items most

413: frequently purchased by buyers of~A.  This information can be represented

414: as a directed network in which vertices represent items and there is a edge

415: from item~A to another item~B if B was frequently purchased by buyers of~A.

416: In our study we have ignored the directed nature of the network (as is

417: common in community structure calculations), assuming any link between two

418: items, regardless of direction, to be an indication of their similarity.

419: The network we study consists of items listed on the Amazon web site in

420: August 2003.  We concentrate on the largest component of the network, which

421: has $409\,687$ items and $2\,464\,630$ edges.

422:

423: \begin{figure}[t]

424: \begin{center}

425: \includegraphics[width=3in]{groups.eps}

426: \end{center}

427: \caption{A visualization of the community structure at maximum modularity.

428: Note that the some major communities have a large number of ``satellite''

429: communities connected only to them (top, lower left, lower right).  Also,

430: some pairs of major communities have sets of smaller communities that act

431: as ``bridges'' between them (e.g., between the lower left and lower right,

432: near the center).}

433: \label{fig:visual}

434: \end{figure}

435:

436: The dendrogram for this calculation is of course too big to draw, but

437: Fig.~\ref{fig:Q} illustrates the modularity over the course of the

438: algorithm as vertices are joined into larger and larger groups.  The

439: maximum value is $Q=0.745$, which is high as calculations of this type

440: go~\cite{NG04,Newman04a} and indicates strong community structure in the

441: network.  The maximum occurs when there are $1684$ communities with a mean

442: size of $243$ items each.  Fig.~\ref{fig:visual} gives a visualization of

443: the community structure, including the major communities, smaller

444: ``satellite'' communities connected to them, and ``bridge'' communities

445: that connect two major communities with each other.

446:

447: % descriptions of communities

448: \begin{table*}

449: \begin{center}

450: \begin{tabular}{cr|p{14.5cm}}

451: Rank & Size & Description \\

452: \hline

453: 1 & 114538 & General interest: politics; art/literature; general fiction;

454: human nature; technical books; how things, people, computers, societies

455: work, etc. \\

456:

457: 2 & 92276 & The arts: videos, books, DVDs about the creative and performing

458: arts \\

459:

460: 3 & 78661 & Hobbies and interests I: self-help; self-education; popular

461: science fiction, popular fantasy; leisure; etc. \\

462:

463: 4 & 54582 & Hobbies and interests II: adventure books; video games/comics;

464: some sports; some humor; some classic fiction; some western religious

465: material; etc. \\

466:

467: 5 & 9872 & classical music and related items \\

468:

469: 6 & 1904 & children's videos, movies, music and books \\

470:

471: 7 & 1493 & church/religious music; African-descent cultural books;

472: homoerotic imagery \\

473:

474: 8 & 1101 & pop horror; mystery/adventure fiction \\

475:

476: 9 & 1083 & jazz; orchestral music; easy listening \\

477:

478: 10 & 947 & engineering; practical fashion

479: \end{tabular}

480: \end{center}\caption{The $10$ largest communities in the

481: Amazon.com network, which account for $87\%$ of the vertices in the

482: network.}

483: \label{table:labels}

484: \end{table*}

485: % -----------------------------------

486:

487: Looking at the largest communities in the network, we find that they tend

488: to consist of items (books, music) in similar genres or on similar topics.

489: In Table~\ref{table:labels}, we give informal descriptions of the ten

490: largest communities, which account for about 87\% of the entire network.

491: The remainder is generally divided into small, densely connected

492: communities that represent highly specific co-purchasing habits, e.g.,~major

493: works of science fiction ($162$ items), music by John Cougar Mellencamp

494: ($17$ items), and books about (mostly female) spies in the American Civil

495: War ($13$ items).  It is worth noting that because few real-world

496: networks have community metadata associated with them to which we

497: may compare the inferred communities, this type of manual check of

498: the veracity and coherence of the algorithm's output is often necessary.

499:

500: % distribution of community sizes

501: \begin{figure}[t]

502: \begin{center}

503: \includegraphics[scale=0.45]{fc_amazon0308_sizedistribution.eps}

504: \end{center}

505: \caption{Cumulative distribution of the sizes of communities when the

506: network is partitioned at the maximum modularity found by the algorithm.

507: The distribution appears to follow a power law form over two decades in the

508: central part of its range, although it deviates in the tail.  As a guide to

509: the eye, the straight line has slope $-1$, which corresponds to an exponent

510: of $\alpha=2$ for the raw probability distribution.}

511: \label{fig:distribution}

512: \end{figure}

513: % -----------------------------------

514:

515: One interesting property recently noted in some

516: networks~\cite{Arenas04,Newman04a} is that when partitioned at the point of

517: maximum modularity, the distribution of community sizes~$s$ appears to have

518: a power-law form $P(s) \sim s^{-\alpha}$ for some constant~$\alpha$, at

519: least over some significant range.  The Amazon co-purchasing network also

520: seems to exhibit this property, as we show in Fig.~\ref{fig:distribution},

521: with an exponent $\alpha\simeq2$.  It is unclear why such a distribution

522: should arise, but we speculate that it could be a result either of the

523: sociology of the network (a power-law distribution in the number of people

524: interested in various topics) or of the dynamics of the community structure

525: algorithm.  We propose this as a direction for further research.

526:

527:

528: \section{Conclusions}

529: We have described a new algorithm for inferring community structure from

530: network topology which works by greedily optimizing the modularity.  Our

531: algorithm runs in time $\Ord(m d \log n)$ for a network with $n$ vertices

532: and $m$ edges where $d$ is the depth of the dendrogram.  For networks that

533: are hierarchical, in the sense that there are communities at many scales

534: and the dendrogram is roughly balanced, we have $d \sim \log n$.  If the

535: network is also sparse, $m \sim n$, then the running time is essentially

536: linear, $\Ord(n \log^2 n)$.  This is considerably faster than most previous

537: general algorithms, and allows us to extend community structure analysis to

538: networks that had been considered too large to be tractable.

539: We have demonstrated our algorithm with an application to a large network

540: of co-purchasing data from the online retailer Amazon.com.

541: Our algorithm discovers clear communities within this network

542: that correspond to specific topics or genres of books or music, indicating

543: that the co-purchasing tendencies of Amazon customers are strongly

544: correlated with subject matter.  Our algorithm should allow researchers to

545: analyze even larger networks with millions of vertices and tens of millions

546: of edges using current computing resources, and we look forward to seeing

547: such applications.

548:

549: %\bigskip

550:

551: \begin{acknowledgements}

552: The authors are grateful to Amazon.com and Eric Promislow for providing the

553: purchasing network data.  This work was funded in part by the National

554: Science Foundation under grant PHY-0200909 (AC, CM) and by a grant from

555: the James S. McDonell Foundation (MEJN).

556: \end{acknowledgements}

557:

558:

559: %\bibliographystyle{numeric}

560: %\bibliography{journals,references}

561:

562: \begin{thebibliography}{10}

563: \expandafter\ifx\csname url\endcsname\relax

564:   \def\url#1{\texttt{#1}}\fi

565: \expandafter\ifx\csname urlprefix\endcsname\relax\def\urlprefix{URL }\fi

566:

567: \bibitem{Strogatz01}

568: S.~H. Strogatz, Exploring complex networks. \textit{Nature} \textbf{410},

569:   268--276 (2001).

570:

571: \bibitem{AB02}

572: R.~Albert and A.-L. Barab\'asi, Statistical mechanics of complex networks.

573:   \textit{Rev. Mod. Phys.} \textbf{74}, 47--97 (2002).

574:

575: \bibitem{DM02}

576: S.~N. Dorogovtsev and J.~F.~F. Mendes, Evolution of networks. \textit{Advances

577:   in Physics} \textbf{51}, 1079--1187 (2002).

578:

579: \bibitem{Newman03d}

580: M.~E.~J. Newman, The structure and function of complex networks. \textit{SIAM

581:   Review} \textbf{45}, 167--256 (2003).

582:

583: \bibitem{FFF99}

584: M.~Faloutsos, P.~Faloutsos, and C.~Faloutsos, On power-law relationships of the

585:   internet topology. \textit{Computer Communications Review} \textbf{29},

586:   251--262 (1999).

587:

588: \bibitem{AJB99}

589: R.~Albert, H.~Jeong, and A.-L. Barab\'asi, Diameter of the world-wide web.

590:   \textit{Nature} \textbf{401}, 130--131 (1999).

591:

592: \bibitem{Kleinberg99b}

593: J.~M. Kleinberg, S.~R. Kumar, P.~Raghavan, S.~Rajagopalan, and A.~Tomkins, The

594:   {W}eb as a graph: Measurements, models and methods. In \textit{Proceedings of

595:   the International Conference on Combinatorics and Computing}, number 1627 in

596:   Lecture Notes in Computer Science, pp. 1--18, Springer, Berlin (1999).

597:

598: \bibitem{WF94}

599: S.~Wasserman and K.~Faust, \textit{Social Network Analysis}. Cambridge

600:   University Press, Cambridge (1994).

601:

602: \bibitem{Price65}

603: D.~J.~{\relax de S}. Price, Networks of scientific papers. \textit{Science}

604:   \textbf{149}, 510--515 (1965).

605:

606: \bibitem{Redner98}

607: S.~Redner, How popular is your paper? {A}n empirical study of the citation

608:   distribution. \textit{Eur. Phys. J. B} \textbf{4}, 131--134 (1998).

609:

610: \bibitem{DWM02a}

611: J.~A. Dunne, R.~J. Williams, and N.~D. Martinez, Food-web structure and network

612:   theory: The role of connectance and size. \textit{Proc. Natl. Acad. Sci. USA}

613:   \textbf{99}, 12917--12922 (2002).

614:

615: \bibitem{Kauffman69}

616: S.~A. Kauffman, Metabolic stability and epigenesis in randomly connected nets.

617:   \textit{J. Theor. Bio.} \textbf{22}, 437--467 (1969).

618:

619: \bibitem{Ito01}

620: T.~Ito, T.~Chiba, R.~Ozawa, M.~Yoshida, M.~Hattori, and Y.~Sakaki, A

621:   comprehensive two-hybrid analysis to explore the yeast protein interactome.

622:   \textit{Proc. Natl. Acad. Sci. USA} \textbf{98}, 4569--4574 (2001).

623:

624: \bibitem{note}

625: Community structure is sometimes referred to as ``clustering'' in sociology

626: or computer science, but this term is commonly used to mean something else

627: in the physics literature~\cite{WS98}, so to prevent confusion we avoid it

628: here.  We note also that the problem of finding communities in a network is

629: somewhat ill-posed, since we haven't defined precisely what a community is.

630: A number of definitions have been proposed~\cite{WF94,FLGC02,Radicchi04},

631: but none is standard.

632:

633: \bibitem{KL70}

634: B.~W. Kernighan and S.~Lin, An efficient heuristic procedure for partitioning

635:   graphs. \textit{Bell System Technical Journal} \textbf{49}, 291--307 (1970).

636:

637: \bibitem{Fiedler73}

638: M.~Fiedler, Algebraic connectivity of graphs. \textit{Czech. Math. J.}

639:   \textbf{23}, 298--305 (1973).

640:

641: \bibitem{PSL90}

642: A.~Pothen, H.~Simon, and K.-P. Liou, Partitioning sparse matrices with

643:   eigenvectors of graphs. \textit{SIAM J. Matrix Anal. Appl.} \textbf{11},

644:   430--452 (1990).

645:

646: \bibitem{Scott00}

647: J.~Scott, \textit{Social Network Analysis: A Handbook}. Sage, London, 2nd

648:   edition (2000).

649:

650: \bibitem{Newman04b}

651: M.~E.~J. Newman, Detecting community structure in networks. \textit{Eur. Phys.

652:   J. B} \textbf{38}, 321--330 (2004).

653:

654: \bibitem{GN02}

655: M.~Girvan and M.~E.~J. Newman, Community structure in social and biological

656:   networks. \textit{Proc. Natl. Acad. Sci. USA} \textbf{99}, 7821--7826 (2002).

657:

658: \bibitem{NG04}

659: M.~E.~J. Newman and M.~Girvan, Finding and evaluating community structure in

660:   networks. \textit{Phys. Rev. E} \textbf{69}, 026113 (2004).

661:

662: \bibitem{GN04}

663: M.~T. Gastner and M.~E.~J. Newman, Diffusion-based method for producing density

664:   equalizing maps. \textit{Proc. Natl. Acad. Sci. USA} \textbf{101}, 7499--7504

665:   (2004).

666:

667: \bibitem{Guimera03}

668: R.~Guimer\`a, L.~Danon, A.~D{\'\i}az-Guilera, F.~Giralt, and A.~Arenas,

669:   Self-similar community structure in organisations. \textit{Phys. Rev. E}

670:   \textbf{68}, 065103 (2003).

671:

672: \bibitem{HHJ03}

673: P.~Holme, M.~Huss, and H.~Jeong, Subnetwork hierarchies of biochemical

674:   pathways. \textit{Bioinformatics} \textbf{19}, 532--538 (2003).

675:

676: \bibitem{HH03}

677: P.~Holme and M.~Huss, Discovery and analysis of biochemical subnetwork

678:   hierarchies. In R.~Gauges, U.~Kummer, J.~Pahle, and U.~Rost (eds.),

679:   \textit{Proceedings of the 3rd Workshop on Computation of Biochemical

680:   Pathways and Genetic Networks}, pp. 3--9, Logos, Berlin (2003).

681:

682: \bibitem{TWH03}

683: J.~R. Tyler, D.~M. Wilkinson, and B.~A. Huberman, Email as spectroscopy:

684:   Automated discovery of community structure within organizations. In

685:   M.~Huysman, E.~Wenger, and V.~Wulf (eds.), \textit{Proceedings of the First

686:   International Conference on Communities and Technologies}, Kluwer, Dordrecht

687:   (2003).

688:

689: \bibitem{GD03}

690: P.~Gleiser and L.~Danon, Community structure in jazz. \textit{Advances in

691:   Complex Systems} \textbf{6}, 565--573 (2003).

692:

693: \bibitem{BPDA04}

694: M.~Bogu{\~n}\'a, R.~Pastor-Satorras, A.~D{\'\i}az-Guilera, and A.~Arenas,

695:   Emergence of clustering, correlations, and communities in a social network

696:   model. Preprint cond-mat/0309263 (2003).

697:

698: \bibitem{WH04b}

699: D.~M. Wilkinson and B.~A. Huberman, A method for finding communities of related

700:   genes. \textit{Proc. Natl. Acad. Sci. USA} \textbf{101}, 5241--5248 (2004).

701:

702: \bibitem{Arenas04}

703: A.~Arenas, L.~Danon, A.~D{\'\i}az-Guilera, P.~M. Gleiser, and R.~Guimer\`a,

704:   Community analysis in social networks. \textit{Eur. Phys. J. B} \textbf{38},

705:   373--380 (2004).

706:

707: \bibitem{Radicchi04}

708: F.~Radicchi, C.~Castellano, F.~Cecconi, V.~Loreto, and D.~Parisi, Defining and

709:   identifying communities in networks. \textit{Proc. Natl. Acad. Sci. USA}

710:   \textbf{101}, 2658--2663 (2004).

711:

712: \bibitem{Newman04a}

713: M.~E.~J. Newman, Fast algorithm for detecting community structure in networks.

714:   \textit{Phys. Rev. E} \textbf{69}, 066133 (2004).

715:

716: \bibitem{WH04a}

717: F.~Wu and B.~A. Huberman, Finding communities in linear time: A physics

718:   approach. \textit{Eur. Phys. J. B} \textbf{38}, 331--338 (2004).

719:

720: \bibitem{Newman05a}

721: M.~E.~J. Newman, Analysis of weighted networks. Preprint cond-mat/0407503

722:   (2004).

723:

724: \bibitem{CLRS01}

725: T.~H. Cormen, C.~E. Leiserson, R.~L. Rivest, and C.~Stein, \textit{Introduction

726:   to Algorithms}. MIT Press, Cambridge, MA, 2nd edition (2001).

727:

728: \bibitem{WS98}

729: D.~J. Watts and S.~H. Strogatz, Collective dynamics of `small-world' networks.

730:   \textit{Nature} \textbf{393}, 440--442 (1998).

731:

732: \bibitem{FLGC02}

733: G.~W. Flake, S.~R. Lawrence, C.~L. Giles, and F.~M. Coetzee, Self-organization

734:   and identification of {W}eb communities. \textit{IEEE Computer} \textbf{35},

735:   66--71 (2002).

736:

737: \end{thebibliography}

738:

739: \end{document}

740: