0702:q-bio0702029/m9.tex

1: \documentclass[aps,pre,showpacs,twocolumn,superscriptaddress]{revtex4}

2: \usepackage{epsfig}

3: \newcommand{\be}{\begin{equation}}

4: \newcommand{\ee}{\end{equation}}

5: \newcommand{\bea}{\begin{eqnarray}}

6: \newcommand{\eea}{\end{eqnarray}}

7: \newcommand{\kim}[1]{{\huge{$\bullet$}}{\em #1}}

8: \newcommand{\maya}[1]{{\large{$\clubsuit$}}{\em #1}}

9: \newcommand{\peter}[1]{{\large{$\heartsuit$}}{\em #1}}

10: \newcommand{\ecoli}{{\it Escherichia coli }}

11: \newcommand{\colp}{{\it E.~coli}}

12: \newcommand{\coli}{{\it E.~coli }}

13:

14: \begin{document}

15:

16: \title{Graph animals, subgraph sampling and motif search in large networks}

17:

18: \author{Kim Baskerville} \affiliation{Perimeter Institute for

19:   Theoretical Physics, Waterloo, Canada N2L 2Y5}

20: \author{Peter Grassberger} \affiliation{Complexity Science Group,

21:   University of Calgary, Calgary, Canada} \affiliation{Institute for

22:   Biocomplexity and Informatics, University of Calgary, Calgary,

23:   Canada}

24: \author{Maya Paczuski} \affiliation{Complexity Science Group,

25:   University of Calgary, Calgary, Canada}

26:

27: \date{\today}

28:

29:

30: \begin{abstract}

31:   We generalize a sampling algorithm for lattice animals (connected

32:   clusters on a regular lattice) to a Monte Carlo algorithm for `graph

33:   animals', i.e. connected subgraphs in arbitrary networks. As with

34:   the algorithm in [N. Kashtan {\it et al.}, Bioinformatics {\bf 20},

35:   1746 (2004)], it provides a weighted sample, but the computation of

36:   the weights is much faster (linear in the size of subgraphs, instead

37:   of super-exponential). This allows subgraphs with up to ten or more

38:   nodes to be sampled with very high statistics, from arbitrarily

39:   large networks. Using this together with a heuristic algorithm for

40:   rapidly classifying isomorphic graphs, we present results for two

41:   protein interaction networks obtained using the TAP high throughput

42:   method: one of \ecoli with 230 nodes and 695

43:   links, and one for yeast ({\it Saccharomyces cerevisiae}) with

44:   roughly ten times more nodes and links. We find in both cases that

45:   most connected subgraphs are strong motifs ($Z$-scores $>10$) or

46:   anti-motifs ($Z$-scores $<-10$) when the null model is the ensemble of

47:   networks with fixed degree sequence. Strong differences appear

48:   between the two networks, with dominant motifs in \coli being

49:   (nearly) bipartite graphs and having many pairs of nodes which

50:   connect to the same neighbors, while dominant motifs in yeast tend

51:   towards completeness or contain large cliques. We also explore a

52:   number of methods that do not rely on measurements of $Z$-scores or

53:   comparisons with null models. For instance, we discuss the influence

54:   of specific complexes like the 26S proteasome in yeast, where a

55:   small number of complexes dominate the $k$-cores with large $k$ and

56:   have a decisive effect on the strongest motifs with 6 to 8 nodes. We

57:   also present Zipf plots of counts versus rank. They show broad

58:   distributions that are not power laws, in contrast to the case when

59:   disconnected subgraphs are included.

60: \end{abstract}

61:

62: \pacs{02.70.Uu, 05.10.Ln, 87.10.+e, 89.75.Fb, 89.75.Hc}

63:

64: \maketitle

65:

66: \section{Introduction}

67:

68: Recently, there has been an increased interest in complex networks,

69: partly triggered by the observation that naturally occurring networks

70: tend to have fat-tailed or even power law degree distributions

71: \cite{faloutsos,barabasi}. Thus real-world networks tend to be very

72: different from the completely random Erd\"os-Renyi \cite{bollobas}

73: networks that have been much studied by mathematicians, and which give

74: Poissonian degree distributions.  In addition, most networks have

75: further significant properties that arise either from functional

76: constraints, from the way they have grown (fat tails, e.g., are

77: naturally explained by preferential attachment), or for other reasons.

78: As a consequence, a large number of statistical indicators have been

79: proposed to distinguish between networks with different functionality

80: (neural networks, protein transcription networks, social networks,

81: chip layouts, etc.) and between networks which were specially designed

82: or which have grown spontaneously (such as, e.g. the world wide web),

83: under more or less strong evolutionary pressure. These observables

84: include various centrality measures \cite{newman_SIAM}, assortativity

85: (the tendency of nodes with similar degree to link preferentially)

86: \cite{newman_SIAM}, clustering \cite{watts-stro,newman_clust},

87: different notions of modularity \cite{barabasi,ravasz,girvan,ziv,rosvall},

88: properties of loop statistics \cite{stadler},

89: the small world property (i.e., slow increase of the effective

90: diameter of the network with the number of nodes) \cite{milgram},

91: bipartivity (the prevalence of even-sized closed walks over closed

92: walks with an odd number of steps) \cite{estrada}, and others.

93:

94: The frequency of specific subgraphs form a particular class of

95: indicators. Subgraphs that occur more frequently than expected are

96: referred to as motifs, while those occurring less frequently are

97: anti-motifs \cite{milo,shen-orr,vasquez,kashtan}. Typically, motif

98: search requires a null model for deciding when a subgraph is over-

99: or under-abundant. The most popular null model so far has been the

100: ensemble of all random graphs with the same degree sequence. This

101: popularity is largely due to the fact that it can be simulated easily

102: by means of the so-called `rewiring algorithm' \cite{besag,maslov}.

103: As we shall see, however, in the present analysis its value is severely

104: limited, because it gives predictions

105: that are too far from those actually observed. Other null models that

106: retain more properties of the original network have been suggested

107: \cite{milo,mahadevan}, but have received much less attention. Analytic

108: approaches to null models are discussed in

109: Refs.~\cite{newman_park1,newman_park2,foster}.

110:

111: \subsection{Motifs and the Search for Structure}

112:

113: Up to now, motif search has been mainly restricted to small motifs,

114: typically with three or four nodes. Certain specific classes of

115: larger subgraphs have been examined in

116: Refs.~\cite{class1,kashtan2004b,vasquez}. With the exception of

117: Ref.~\cite{baskerville}, few systematic attempts have been made to

118: learn about significant structures at larger scale, by counting all

119: possible subgraphs (for a different approach to the discovery of

120: structure than discussed here see the work on inference of

121: hierarchy in Ref.~\cite{clauset}).

122:

123: One reason for this is that the number of non-isomorphic (i.e.

124: structurally different) subgraphs in any but the most trivial networks

125: increases extremely fast (super-exponentially) with their size. For

126: instance, the number of different undirected graphs with 11 nodes is

127: $\approx 10^9$~\cite{briggs}. Thus exhaustive studies of all possible

128: subgraphs with $>10$ nodes becomes virtually impossible with

129: present-day computers. But just because of this inflationary growth,

130: counts at intermediate sizes contain an enormous amount of potentially

131: useful information. Another obstacle is the notorious graph

132: isomorphism problem \cite{kobler,faulon}, which is in the NP class

133: (though probably not NP complete \cite{toran}). Existing state of the

134: art programs for determining whether any two graphs are isomorphic

135: \cite{nauty} remain too slow for our purpose.  Instead, we shall use

136: heuristics based on graph invariants similar to those put forward in

137: Ref.~\cite{baskerville}, where intermediate size motifs and

138: anti-motifs in the protein interaction network of \ecoli were

139: detected.

140:

141: The last problem when studying larger motifs, and the main one

142: addressed in the present work, is the difficulty of estimating how

143: often each possible subgraph appears in a large network, i.e. of

144: obtaining a `subgraph census'.  Most studies so far were based on

145: exact enumeration. In a network with $N$ nodes, there are ${N\choose

146: n}$ subgraphs of size $n$. With $N=500$ and $n=6$, say, this number

147: is $\approx 5\times 10^{11}$. In addition, most of the subgraphs

148: generated this way on a sparse network would be disconnected, while

149: connected subgraphs are of more intrinsic interest. Thus some

150: statistical sampling is needed. If one is willing to generate disconnected

151: as well as connected subgraphs, then uniform sampling is simple: Just

152: choose random $n$-tuples of nodes from the network \cite{baskerville}.

153: Uniform sampling connected subgraphs is less trivial. To

154: our knowledge, the only work which addressed this systematically was

155: Kashtan {\it et al.} \cite{kashtan2004b} (for a less systematic

156: approach, see also \cite{spirin}). There, a biased sampling

157: algorithm was put forward.  While generating the subgraphs is fast,

158: computing the weight factor needed to correct for the bias is

159: $\exp[O(n)]$, making their algorithm inefficient for $n\ge 7$.

160:

161: \subsection{Graph Animals}

162:

163: In the present paper we exploit the fact that sampling connected

164: subgraphs of a finite graph resembles sampling connected clusters of

165: sites on a regular lattice.  The latter is called the {\it lattice

166:   animal} problem \cite{animals}, whence we propose to call the

167: subgraph counting problem that of {\it graph animals}. It is important

168: to recognize obvious differences between the two cases. In particular,

169: lattices are infinite and translationally invariant, while networks

170: are finite and heterogeneous (disordered). For lattice animals one

171: counts the number of configurations up to translations (i.e. per unit

172: cell of the lattice), while on a network the quantity of immediate

173: interest is the absolute number of occurrences of particular

174: subgraphs. Still, apart from these issues, the basic operations

175: involved in both cases coincide.

176:

177: Algorithms for enumerating lattice animals exactly exist and have been

178: pushed to high efficiency \cite{jensen}, but are far from trivial

179: \cite{redner}. Due to disorder, we should expect the situation to be

180: even worse for graph animals. Algorithms for stochastic sampling of

181: lattice animals are divided into two groups: Markov chain Monte Carlo

182: (MCMC) algorithms take a connected cluster and randomly deform it

183: while preserving connectivity \cite{stauffer,dickman,pivot}, while

184: `sequential' sampling algorithms grow the cluster from scratch

185: \cite{leath,hsu,gfn1998,care}. Even for regular lattices, MCMC

186: algorithms seem less efficient than growth algorithms \cite{hsu}. For

187: networks, this difference should be even more pronounced, since MCMC algorithms

188: would dwell in certain parts of the network, and averaging over the

189: different parts costs additional time. Thus we shall in the following

190: concentrate only on growth algorithms.

191:

192: All growth algorithms similar to those in

193: \cite{leath,hsu,gfn1998,care} produce unbiased samples of {\it

194:   percolation} clusters. As explained in Sec.~II, this means that they

195: sample clusters or subgraphs with non-uniform probability (for an

196: alternative algorithm, see \cite{Redner79}). Consequently, computing

197: graph animal statistics requires the computation of weights to be

198: assigned to the clusters, in order to correct for the bias. In

199: contrast to the algorithm in Ref.~\cite{kashtan2004b}, the correct

200: weights are easily and rapidly calculated in our graph animal

201: algorithm.  This is its main advantage.

202:

203: \subsection{Summary}

204:

205: In Sec.~\ref{alg} we present the graph animal algorithm in detail. The

206: method used to handle graph isomorphism is briefly reviewed in Sec.~III.

207: Extensive tests, mostly with two protein interaction networks, one for

208: \coli with 230 nodes and 695 links~\cite{ecoli}, and one for yeast

209: with 2559 nodes and 7031 links~\cite{yeast}, are presented in

210: Sec.~IV \cite{footnote}. Both networks were obtained using the TAP high

211: throughput method. In particular, our algorithm involves as a

212: free parameter a percolation probability $p$. For optimal performance,

213: in lattice animals $p$ should be near the critical value where cluster

214: growth percolates~\cite{hsu}. We show how the performance for graph

215: animals depends on $p$, on the subgraph size $n$, and on other

216: parameters. In Sec.~V we use our sampling method to study these two

217: networks systematically. We verify that large subgraphs with high link

218: density are overwhelmingly strong motifs, while nearly all large

219: subgraphs with low link density are anti-motifs

220: \cite{vasquez,baskerville} -- although our data show much more

221: structure than suggested by the scaling arguments of \cite{vasquez}.

222: We also find striking differences in the strongest motifs for the two

223: networks.  Dominant motifs for the \coli network are either bipartite

224: or close to it (with many nodes sharing the same neighbors) while

225: `tadpoles' with bodies consisting of (almost) complete graphs dominate

226: for yeast.  Our conclusions and discussions of open problems are given

227: in Sec.~VI.

228:

229: The present work only addresses undirected networks, but the graph

230: animal algorithm works without major changes also for directed

231: networks.  Due to the larger number of different directed subgraphs,

232: an exhaustive study of even moderately large subgraphs is much more

233: challenging~\cite{newpaper}.

234: % A first step in this direction will be given in

235:

236: \section{The Algorithm}

237: \label{alg}

238:

239: In this section we explain how our algorithm achieves uniform sampling

240: of connected subgraphs in undirected networks.  The graph animal

241: algorithm executes a generalization of the Leath algorithm for lattice

242: animals. The observation central to the work in

243: Refs.~\cite{leath,hsu,care} is that the animal and percolation

244: ensembles concern exactly the same clusters. The only difference

245: between the two ensembles is that clusters in the percolation ensemble

246: have different weights, while all clusters with the same number of nodes

247: (sites) have the same weight in the animal ensemble. We focus on site

248: percolation~\cite{stauffer-aharony}. Bond percolation could also be

249: used~\cite{hsu}, but this would be more complicated and is not

250: discussed here.

251:

252: \subsection{Leath growth for graph animals}

253:

254: For regular lattices and  undirected networks we use the following epidemic

255: model for growing connected clusters of sites~\cite{leath}: \\

256: (1) Choose a number $p \in [0,1]$ and a maximal cluster size $n_{\rm max}$.

257: Label all sites (nodes) as `unvisited'. \\

258: (2) Pick a random site (node)  $i_0$ as a {\it seed} for the cluster, so that

259: the cluster consists initially of only this site; mark it as `visited'.\\

260: (3) Do the following step recursively, until all boundary sites of the

261: cluster have been visited, or until the cluster consists of $n_{\rm max}$

262: sites, whichever comes first: (Note that a boundary site of a cluster $C$ is

263: a site which is  not in $C$, but which is connected to $C$ by

264: one or more edges). \\

265: (A) Choose one of the unvisited boundary sites of the present cluster, and

266: mark it as visited;

267: (B) With probability $p$ join it to the cluster.\\

268: Once a boundary site has been visited, it cannot later join the cluster; it

269: either joins the cluster when it is first visited (with probability $p$) or is

270: permanently forbidden to join (with probability $1-p$).

271:

272: The order in which the boundary (or `growth') sites are chosen

273: influences the efficiency of the algorithm, but this is irrelevant for

274: the present discussion. The growth algorithm can be seen as an

275: idealization of an epidemic process (`generalized' or SIR

276: epidemic~\cite{mollison, grass}) with three types of individuals

277: (Susceptible, Infected, Removed).  Starting with a single infected

278: individual with all others susceptible, the infected individual can

279: infect neighbours during a finite time span.  Everyone either gets

280: infected or doesn't at his/her first contact.  The latter are removed,

281: as are the infected ones after their recovery, and do not participate

282: in the further spread of the epidemic.

283:

284: Assume that for some fixed node $i_0$, a connected labeled subgraph $G^{\ell}$

285: exists, which contains $i_0$ and has $n<n_{\rm max}$ nodes and $b$ visited

286: boundary nodes. The chance that precisely this particular labeled subgraph

287: will be chosen using the algorithm is

288: \be

289:     P_{G^{\ell}}(p;i_0) \equiv P_{nb}(p;i_0)= p^{n-1}(1-p)^b \;.

290: \label{growth-prob}

291: \ee

292: Since an independent decision is made at each boundary site, this is

293: indeed the probability for $n-1$ sites to be selected to join the

294: cluster, while $b$ sites are rejected.

295:

296: Denote by $c(G^{\ell})$ the indicator function for the existence of $G^{\ell}$,

297: i.e. $c(G^{\ell})=1$ if the subgraph exists in the network, and $c(G^{\ell})=0$

298: else. Furthermore, denote by $c(G^{\ell};i_0)$ the explicit indicator that

299: $G^{\ell}$ exists and contains the node $i_0$. Then the total number of

300: occurrences of the {\it unlabeled} subgraph $G$ is given by

301: \be

302:     c_G = n^{-1} \sum_{i=1}^N c_{G,i} = n^{-1} \sum_{i_0=1}^N \sum_{G^{\ell}\sim

303: G}c(G^{\ell};i_0),

304: \ee

305: where $c_{G,i}$ is the number of occurrences which contain node

306: $i$, and where the last sum runs over all labeled subgraphs

307: $G^{\ell}$ that are isomorphic to $G$. The factor $n^{-1}$ takes into

308: account that a subgraph with $n$ nodes is counted $n$ times.

309:

310: If we repeat the epidemic process $M$ times, always starting at the

311: same node $i_0$, then the expected number of times $G^\ell$ occurs is

312: \be

313:    \langle m(G^\ell;p,i_0)\rangle = M c(G^\ell;i_0) P_{G^\ell}(p;i_0)\;.

314: \ee

315: Hence, an estimator for $c_{G,i}$ based on the actual counts

316: $m(G^\ell;P,i_0)$ after $M$ trials is

317: \be

318:    {\hat c}_{G,i_0}(M) = M^{-1} \sum_{G^{\ell}\sim G} m(G^\ell;p,i_0)

319: [P_{G^\ell}(p;i_0)]^{-1} \;.

320: \ee

321: Here and in what follows carets always indicate estimators.

322:

323: More generally, the starting nodes are chosen according to some

324: probability $Q_{i_0}$.  After $M>>1 $ trials in total, site $i_0$ will

325: have been used as starting point on average $Q_{i_0}M$ times. This gives then the

326: estimator for the total number of occurrences of $G$

327: \bea

328:         \label{simple_estimate}

329:    {\hat c}_G(M) & = & n^{-1} \sum_{i_0=1}^N {\hat c}_{G,i_0}(Q_{i_0}M) \\

330:          & = &(nM)^{-1} \sum_{i=1}^N Q_i^{-1} \sum_{G^{\ell}\sim G} m(G^{\ell};p,i)

331: [P_{G^{\ell}}(p;i)]^{-1}.

332:         \nonumber

333: \eea

334: It is simplest to take a uniform probability $Q_{i_0} = 1/N$. But a

335: better alternative is to choose each node with a probability

336: proportional to its degree, as nodes with larger degrees have more

337: connected subgraphs attached to them. This is accomplished by choosing

338: a link with uniform probability $1/L$, where $L$ is the total number

339: of links in the network, and then choosing one of the two ends of this

340: link at random. This gives

341: \be

342:    Q_i = (2L)^{-1}k_i.                                 \label{link-prob}

343: \ee

344:

345: The algorithms of \cite{leath,care} are directly based on

346: Eq.~(5).  Their main drawback is that all

347: %Eq.~(\ref{simple_estimate}).  Their main drawback is that all

348: information from clusters which are still growing at size $n$ is not

349: used. Clusters whose growth had stopped at sizes $<n$ don't contribute

350: to ${\hat c}_G$ either, of course. Thus only those that stop growing

351: exactly at size $n$ are used in Eq.~(5). This

352: %exactly at size $n$ are used in Eq.~(\ref{simple_estimate}). This

353: requires, among other things, a careful choice of $p$: If $p$ is too

354: large, too many clusters survive past size $n$, while in the opposite

355: case too few reach this size at all. But even with the optimal choice

356: of $p$, most of the information is wasted.

357:

358: \subsection{Improved Leath method}

359:

360: The major improvement comes from the following observation \cite{hsu}:

361: Assume that a cluster has grown to size $n$, and among the $b$

362: boundary sites there are exactly $g$ which have not yet been tested

363: (`growth sites'). Thus growth has definitely stopped at $b-g$ already

364: visited boundary sites, while the growth on the remaining $g$ boundary

365: sites depends on future values of the random variable used to decide

366: whether they are going to be infected. With probability $(1-p)^g$ none

367: of them are susceptible, and the growth will stop at the present

368: cluster size $n$. Thus we can replace the counts $m(G^\ell;p,i_0)$ in

369: the estimator for $c_G$ by the counts of `unfinished' subgraphs,

370: provided we weigh each occurrence of a subgraph isomorphic to $G$ with

371: an additional weight factor $(1-p)^g$. Formally, this gives, with

372: uniform initial link selection (Eq.~(\ref{link-prob})),

373: \bea

374: {\hat c}_G & = &{2L\over nM} \sum_{i=1}^N k_i^{-1} \sum_{G^{\ell}\sim G}

375: p^{1-n}(1-p)^{g-b} \;\;\times  \nonumber \\

376: & & \times \;\; m_{\rm unfinished}(G^{\ell};p,i,g)\;.

377:     \label{estimate}

378: \eea

379: The quantity $m_{\rm unfinished}(G^{\ell};p,i,g)$ is the number of

380: epidemics (with parameter $p$) that start at node $i$, give a

381: labeled subgraph $G^{\ell}$ of infected nodes, and leave $g$

382: unvisited boundary nodes. The factor $p^{1-n}(1-p)^{g-b}$ has a simple

383: interpretation.  In analogy to Eq.~(\ref{growth-prob}) it is the

384: probability to grow a cluster with $n-1$ nodes in addition to the

385: start node, $g$ growth nodes, and $b-g$ blocked boundary nodes,

386: \be

387:    P_{nbg}(p;i_0) = p^{n-1}(1-p)^{b-g} \;.

388: \label{growth-prob-2}

389: \ee

390: Eq.~(\ref{estimate}) is the number of generated clusters, reweighted

391: with their inverse probabilities to be sampled, given they exist.

392: It is the formula we use to estimate frequencies of occurrences of

393: connected subgraphs in the protein interaction networks as discussed

394: later in the text.

395:

396: \subsection{Resampling}

397:

398: In principle, Eq.~(\ref{estimate}) can be improved further.

399: Ref.~\cite{hsu} shows how to use the equivalents of

400: Eqs.(\ref{estimate},\ref{growth-prob-2}) for lattice animals as a

401: starting point for a re-sampling scheme. For completeness, re-sampling

402: for graph animals is briefly explained, even though it is not used in

403: this work.

404:

405: For each cluster that is still growing a {\it fitness function} is defined as

406: \be

407:    f_{nbg}(p) = p^{1-n}(1-p)^{-b} = [P_{nbg}(p;i_0)]^{-1}/ (1-p)^g.

408: \label{fitness}

409: \ee

410: Clusters with too small fitness are killed, while clusters with too

411: large fitness are cloned, with both the fitness and the weight being

412: split evenly among the clones. The first factor in the fitness is just

413: proportional to the weight, while the second factor takes into account

414: that clusters with larger $g$ have more possibilities to continue

415: their growth, and thus should be more `valuable'. The precise form of

416: Eq.(\ref{fitness}) is purely heuristic, but was found to be near

417: optimal in fairly extensive tests.

418:

419: This resampling scheme was found to be essential, if one wants to

420: sample clusters of sizes $n>100$. In \cite{hsu}, the emphasis was on

421: very large clusters (several thousand sites), and thus resampling was

422: a necessity. Here, in contrast, we concentrate on subgraphs with

423: $\approx 10$ nodes or less, and stick to the simpler scheme without

424: resampling.  With respect to graph animals, we point out that optimal

425: fitness thresholds for pruning and cloning depend in a irregular

426: network on the start node, $i_0$, and have to be learned for each

427: $i_0$ separately. Although a similar strategy achieves success for

428: dealing with self avoiding walks on random lattices~\cite{randomSAW},

429: this is much more time consuming than for regular lattices.

430:

431: \subsection{Implementation details}

432:

433: For fast data access, we used several redundant data structures. The

434: adjacency matrix was stored directly as a $N\times N$ matrix with

435: elements 0/1 and as a list of linked pairs $(i,j)$, i.e. as an array

436: of size $L\times 2$. The first is needed for fast checking of which

437: links are present in a subgraph, while the second is the format in

438: which the networks were downloaded from the web. Finally, for fast

439: neighbor searches, the links were also stored in the form of linked

440: lists. To test whether a site was visited during the growth of the

441: present (say $k$-th, $k=1\ldots M$) subgraph, an array {\sf s[i]} of

442: size $N$ and type {\sf unsigned int} was used, which was initiated as

443: {\sf s[i]=0, i = 0,...N-1}. Each time a site {\sl i} was visited, we

444: set {\sf s[i] = k}, and {\sf s[i] $<$ k} was used as indicator that

445: this site had not been visited during the growth of the present

446: cluster.

447:

448: In Leath-type cluster growth, there are two popular variants. Untested

449: sites in the boundary can be written either into a first-in first-out

450: queue, or into a stack (first-in last-out queue). In was found in

451: \cite{hsu} that these two possibilities, whose efficiency is roughly

452: the same when Eq.~(5) is used, give vastly

453: %the same when Eq.(\ref{simple_estimate}) is used, give vastly

454: different efficiency with Eq.(\ref{estimate}), in particular (but not

455: only) in combination with resampling. In that case, the first-in

456: first-out queue gives much better results, and we use this method to

457: get the numerical results shown later.

458:

459: \section{Subgraph Classification}

460:

461: After sampling a labelled subgraph $G_\ell$, one has to find its

462: isomorphism class $G$ (i.e., $G_\ell\sim G$), by testing which

463: of the representatives for isomorphism classes it can be mapped onto by

464: permuting the node labels.  State-of-the-art computer programs for

465: comparing two graphs, such as NAUTY~\cite{nauty}, proceed in two

466: steps. First, some invariants are calculated such as the number of

467: links, traces of various powers of the adjacency matrix, a sorted list

468: of node degrees, etc. In most cases, this shows that the two graphs

469: are not isomorphic (if any of these invariants disagree), but

470: obviously this does not resolve all cases. When ambiguities remain,

471: each graph is transformed into a standard form by a suitable

472: permutation, and the standard forms are compared. The standard form

473: is, of course, also a special invariant, so the distinction between

474: ``invariants" and ``standard form" might seem arbitrary. It becomes

475: relevant in practice, since the user of the package can specify which

476: invariants (s)he deems relevant, while the calculation of the

477: standard form is at the core of the algorithm and cannot be changed.

478:

479: It is mostly the second step in this scheme which is time limiting

480: and which renders it useless for our purposes -- although some

481: invariants suggested e.g. by NAUTY are also quite demanding in CPU

482: time. Thus we skip the second step and only use invariants that are fast to compute.

483: All these invariants, except for the number $n$ of nodes and the number

484: $\ell$ of links in the subgraph, are combined into a single index $I$, which is

485: intended to be a good discriminator between all non-isomorphic subraphs

486: with the same $n$ and $\ell$. Whenever a new subgraph is found, the

487: triplet ($n, \ell , I$) is calculated and compared to triplets that

488: have already appeared.  If the triplet appeared previously, the

489: counter for this triplet is increased by 1; if not, a new counter is

490: initiated and set to 1.

491:

492: Since no known invariant (other than standard form) can discriminate

493: between any two graphs, any method not using it is necessarily

494: heuristic. Some of the invariants we used are those defined in

495: Ref.~\cite{baskerville}. In addition, we use invariants based on

496: powers of the adjacency matrix and of its compliment. More precisely,

497: if $A_{ij}$ is the adjacency matrix of a subgraph, then we define its

498: complement by $B_{ij} = 1-A_{ij}$ for $i\neq j$ and $B_{ij} = 0 =

499: A_{ij}$ for $i=j$. Any trace of any product $A^{a_1} B^{b_1} A^{a_2}

500: \ldots$ is invariant, and can be computed quickly. The same is true

501: for the number of non-zero elements of any such product, and for the

502: sum of all its matrix elements. The index $I$ is then either a linear

503: combination or a product (taken modulo $2^{32}$) of these invariants.

504: The particular choices were {\it ad hoc} and there is no reason to

505: believe they are optimal; hence those details are not given here.

506:

507: With the indices described in~\cite{baskerville}, all undirected

508: graphs of sizes $n\leq 8$ and all directed graphs with up to $5$ nodes

509: are correctly classified. In this work, a faster algorithm for

510: counting loops is used; hence loop counting is always included, in

511: contrast to the work of \cite{baskerville}. Index calculation based on

512: matrix products is even faster but less precise: only 11112 out of all

513: 11117 non-isomorphic connected graphs with $n=8$ were

514: distinguished, and for directed graphs with $n=5$ just 4 graphs out of

515: 9608~\cite{integer-seq} were missed. For larger subgraphs we were not

516: able to test the quality of the indices systematically, but we can

517: cite some results for $n=9$. Using indices based on matrix products,

518: we found 239846 different connected subgraphs with $n=9$ in the \coli

519: protein interaction network~\cite{ecoli} and its rewirings. Given the

520: fact that there are only 261080 different connected graphs with $n=9$

521: \cite{integer-seq}, that many of them might not appear in the \coli

522: network, and that our sampling was not exhaustive, our graph

523: classification method failed to distinguish at most 9\% of the

524: non-isomorphic graphs -- and probably many fewer.

525:

526:

527: \section {Numerical Tests of the Sampling Algorithm}

528:

529: To test the graph animal algorithm, we first sampled both $n=4$ and

530: $n=5$ subgraphs of the \coli network, as well as $n=4$ subgraphs of

531: the yeast network. In these cases exact counts are possible, and we

532: verified that the results from sampling agreed with results from exact

533: enumeration within the estimated (very small) errors. To obtain these results

534: we used crude estimates for optimal $p$ values, namely $p=0.11$ for \coli

535: and $p=0.03$ for yeast. For larger subgraphs more precise estimates

536: for the optimal $p$ are required.

537:

538: \subsection{Optimal values for $p$}

539:

540: When $p$ is too small, only small clusters are regularly encountered.

541: If $p$ is too large, performance decreases because the weight factors

542: in Eq.~(\ref{estimate}) depend too strongly on the number of blocked

543: boundary sites, $b-g$. The latter varies from instance to instance,

544: and this can create huge fluctuations in the weights given to

545: individual subgraphs.

546:

547: The networks we are interested in are sparse ($L/N \approx const \ll

548: N$) and approximately scale-free \cite{yeast}.  As a result, most

549: nodes have only a few links, but some `hubs' have very high degree. In

550: fact, the degrees of the strongest hubs may diverge in the limit

551: $N\to\infty$.  For such networks it is well-known that the threshold

552: for spreading of an infinite SIR epidemic is zero~\cite{pastor}. On

553: finite networks this means that one can create huge clusters even for

554: minute $p$, and this tendency increases as $N$ increases.  Thus, we

555: anticipate the optimal $p$ to be small, and to decrease noticeably in

556: going from the \coli ($N=230$) to the yeast network ($N=2559$). This

557: is, in fact, what we find.

558:

559: \begin{figure}

560:   \begin{center}

561:    \psfig{file=rms-errors.ps,width=6.4cm,angle=270}

562:    \caption{(color online) Root mean square relative errors of connected

563:      subgraph counts, Eq.~(\ref{sigma}), for the yeast ($n=5$ to $8$)

564:      and \coli ($n=7$) networks. In most cases, clear minima indicate

565:      roughly the optimum value for $p$, with caveats as explained in

566:      the text. Each data point is based on $4\times 10^9$ generated

567:      subgraphs. Smaller values of $\sigma_n(p)$ indicate that the

568:      census for subgraphs with $n$ nodes is on average more precise.}

569: \label{count-errors.fig}

570: \end{center}

571: \end{figure}

572:

573: As a first test, we compute the root mean square relative errors of

574: the subgraph counts, averaged over all subgraphs of fixed size $n$.

575: Let $\gamma_n$ be the number of different subgraphs of size $n$ found,

576: and let $\Delta c_G$ be the error of the count for subgraph $G$. These

577: errors were estimated by dividing the set of $M$ independent samples

578: into bins, and estimating the fluctuations from bin to bin. Then

579: \be

580:    \sigma_n(p) = \left[{1\over \gamma_n} \sum_{j=1}^{\gamma_n}

581:            (\Delta c_{G_j}/{\hat c_{G_j}})^2\right]^{1/2}.

582: \label{sigma}

583: \ee

584: Smaller values of $\sigma_n(p)$ indicate that the subgraph census is

585: on average more precise.

586: Fig.~\ref{count-errors.fig} shows results for the yeast network, with

587: various values of $p$ and $n$. Also shown are data for the \coli

588: network, for $n=7$. Each simulation used for this figure (i.e., each

589: data point) involved $M = 4\times 10^9$ generated clusters. Our first

590: observation is that the results for \coli are much more precise than

591: those for yeast.  This is mainly due to smaller hubs ($k_{\rm

592:   max}^{\rm e. coli} = 36$, while $k_{\rm max}^{\rm yeast} = 141$), so

593: that much larger $p$ values~\cite{footnote2} could be used. Also in

594: all other aspects, our algorithm worked much better for the \coli

595: network than for yeast.  Therefore we exhibit in the rest of this

596: section only results for yeast, implying that whenever a test was

597: positive for yeast, an analogous test had been made for \coli with at

598: least as good results.

599:

600: Even with the large sample sizes used in Fig.~\ref{count-errors.fig},

601: many $n=8$ subgraphs were found only once (in which case we set

602: $\Delta c_{G_j} /{\hat c}_{G_j}=1$), which explains the high values of

603: $\sigma_8(p)$. This is also why we do not show any data for $n>8$ in

604: Fig.~\ref{count-errors.fig}.  The relative error $\sigma_n(p)$ for

605: each $n<8$ shows a broad minimum as a function of $p$. The increase in

606: $\sigma_n(p)$ at small $p$ is because of the paucity of different

607: graphs being generated. This effect grows when $n$ increases, explaining why

608: the minimum shifts to the right with increasing $n$. The increase of

609: $\sigma_n(p)$ for large $p$, in contrast, comes from large

610: fluctuations of weights for individual sampled graphs.  When $p$ is

611: large, the factor $(1-p)^{b-g}$ in Eq.(\ref{estimate}) can also be

612: large, particularly in the presence of strong hubs.

613:

614: Unfortunately, if a subgraph is found only once, it is impossible to

615: decide whether or not the frequency estimate is reliable. Even for

616: strong outliers, when the frequency estimate is far too large, the

617: formal error estimate cannot be larger than $\Delta c_G = O({\hat

618:   c}_G)$. This underestimates the true statistical errors and is

619: partially responsible for the fact that the curve for $n=8$ in

620: Fig.~\ref{count-errors.fig} does not increase at large $p$

621: \cite{footnote3}.

622:

623: \begin{figure}

624:   \begin{center}

625:    \psfig{file=Fig2-weighthist.ps,width=6.3cm,angle=270}

626:    \caption{(color online) Histograms of $wP(\ln w) = w^2P(w)$ for

627:      connected $n=8$ subgraphs of the yeast network. Each curve

628:      corresponds to one run ($4\times 10^9$ generated subgraphs)

629:      with fixed value of $p$. Results are the more reliable,

630:      the further to the left is the maximum of the curve and the

631:      faster is the decrease of its tail at large $w$.}

632: \label{weight-hist.fig}

633: \end{center}

634: \end{figure}

635:

636: A more direct understanding of the decreasing performance at large

637: $p$ comes from histograms of the (logarithms of) weight factors. Such

638: histograms, for $n=8$ subgraphs in the yeast network, are shown in

639: Fig.~\ref{weight-hist.fig}. From the results in Section~\ref{alg}

640: \be

641:    w = {2L\over nMk}p^{1-n}(1-p)^{g-b}

642: \ee

643: is the weight for a subgraph with $n$ nodes, $b$ boundary nodes, and

644: $g$ growth nodes.

645: %$P(\ln w) = wP(w)$ is the probability distribution function of $\ln w$.

646: The algorithm produces reliable estimates if $P(w)$ decreases for

647: large $w$ faster than $1/w^2$, since averages

648: (which are weighted by $w$) are then dominated by subgraphs that are

649: well sampled. If, in contrast, $P(w)$ decreases more slowly, then the

650: tail of the distribution dominates, and the results cannot be taken at

651: face value \cite{grass-PERM}. We observe from

652: Fig.~\ref{weight-hist.fig} that the data for $n=8$ is indeed reliable

653: for $p<0.07$ only. The curve for $p=0.09$ in

654: Fig.~\ref{weight-hist.fig} also bends over at very large values of

655: $w$, indicating that even for this $p$ our estimates should finally be

656: reliable, when the sample sizes become sufficiently large.  But this

657: would require extremely large sample sizes.

658:

659: As a last test we checked whether the estimates $\hat{c}_G$ are

660: independent of $p$ as they should be. Fig.~\ref{estimate-p.fig} shows

661: the estimates obtained for $n=8$ subgraphs in the yeast network with

662: $p=0.025$ and $p=0.07$ against those obtained with $p=0.04$.  Clearly,

663: the data cluster along the diagonal -- showing that the estimates are

664: basically correct. They scatter more when the counts are lower (i.e.

665: in the lower left corner of the plot).  The asymmetries in that region

666: result from the fact that rarely occurring subgraphs are completely

667: missed for $p=0.04$ and even more so for $p=0.025$, cutting off

668: thereby the distributions at small ${\hat c}_G$. For larger counts,

669: the estimates for $p=0.025$ are more precise than those for $p=0.07$.

670: The latter show high weight ``glitches" arising from the tail of

671: $P(w)$ discussed earlier in this section.

672:

673: \begin{figure}

674:   \begin{center}

675:    \psfig{file=Fig3-25_7-4-x.ps,width=7.6cm,angle=270}

676:    \caption{(color online) Scatter plots of ${\hat c}_G(p=0.025)$ and

677:      ${\hat c}_G(p=0.07)$ against ${\hat c}_G(p=0.04)$ for connected $n=8$

678:      subgraphs of the yeast network. The clustering of the data along

679:      the diagonal indicates the basic reliability of the estimates,

680:      independent of the precise choice of $p$. Sample sizes were $4\times

681:      10^{10}$ for $p=0.04$, $2.4\times 10^{10}$ for $p=0.025$, and

682:      $8\times 10^9$ for $p=0.07$. The latter two correspond to roughly

683:      the same CPU time.}

684: \label{estimate-p.fig}

685: \end{center}

686: \end{figure}

687:

688: For increasing $p$, the numbers $m_G$ of generated subgraphs of type

689: $G$ increase of course (as the epidemic survives longer), so that

690: average weights, defined as $\langle w_G\rangle = {\hat c}_G M / m_G$,

691: decrease. But this decrease is not uniform for all $G$. Rather, it is

692: strongest for fully connected subgraphs ($\ell = n(n-1)/2$), and is

693: weakest for trees. For the yeast network and $n=8$, e.g., $\langle

694: w_G\rangle$ averaged over all trees decreases by a factor $\sim 18$

695: when $p$ increases from 0.025 to 0.085, while $\langle w_G\rangle$ averaged

696: over all graphs with $\ell \ge 25$ decreases by a factor $\sim 1700$. Smaller

697: values of $\langle w_G\rangle$ are preferable, as they imply

698: smaller fluctuations.  Thus it would be most efficient to use larger

699: $p$ values for highly connected subgraphs, and smaller $p$ for

700: tree-like subgraphs.  Counting very highly connected subgraphs --

701: where every node has a degree in the subgraph $\geq k_0$, say -- is also made easier by

702: first reducing the network to its $k$-core with $k=k_0$, and then

703: sampling from the latter.

704:

705:

706: \section{Results}

707:

708: \subsection{Characterization of the networks}

709:

710: As already stated, both networks as we use them are fully connected

711: \cite{footnote}.

712: The \coli network has 230 nodes and 695 links, while the yeast network

713: has 2559 nodes and 7031 links. Both networks show strong clustering,

714: as measured by the clustering coefficients \cite{watts-stro}

715: \be

716:    C_i = {2\over k_i(k_i-1)}\sum_{j<m} A_{jm}

717: \ee

718: where $k_i$ is the degree of node $i$ and the sum runs over all pairs

719: of nodes linked directly to $i$. In Fig.~\ref{clustering.fig} we show

720: averages of $C_i$ over all nodes with fixed degree $k$. We see that

721: $\langle C\rangle_k$ is quite large, but has a noticeably different

722: dependence on $k$ for the two networks.  While it decreases with $k$

723: for \colp, it attains a maximum at $k\approx 15$ for yeast.

724:

725: \begin{figure}

726:   \begin{center}

727:    \psfig{file=Fig_clusterings.ps,width=6.3cm,angle=270}

728:    \caption{(color online) Average clustering coefficients for nodes

729:      with fixed degree $k$ plotted versus the degree, for

730:      the giant component of the yeast and \coli protein interaction

731:      networks. While the clustering coefficient decreases with

732:      $k$ for \colp, it attains a maximum at $k \approx 15$ for yeast.}

733: \label{clustering.fig}

734: \end{center}

735: \end{figure}

736:

737: \begin{figure}

738:   \begin{center}

739:    \psfig{file=Fig_kcores.ps,width=6.3cm,angle=270}

740:    \caption{(color online) Sizes of the $k-$cores for the two networks,

741:      plotted against $k$. Notice that the $k-$cores for yeast contain a

742:      nearly fully connected cluster with 17 nodes. In addition to the

743:      core sizes for the original networks, the figure also shows average

744:      core sizes for rewired networks as discussed in section V C.}

745: \label{k-cores.fig}

746: \end{center}

747: \end{figure}

748:

749: The unweighted average clustering ${\bar C} = N^{-1}\sum_{i=1}^N C_i$

750: is 0.1947 for yeast, and 0.2235 for \colp.  Due to the different

751: behavior of $\langle C\rangle_k$, the ranking is reversed for the

752: weighted averages

753: \be

754:     \langle C\rangle = {\sum_{i=1}^N C_i k_i(k_i-1) \over

755:       \sum_{i=1}^N k_i(k_i-1)} = {3n_\Delta\over 3n_\Delta + n_\vee},

756: \ee

757: where $n_\Delta$ is the number of fully connected triangles on the

758: network and $n_\vee$ is the number of triads with two links

759: (see~\cite{newman_clust} for a somewhat different formula).

760: Numerically, this gives $\langle C\rangle = 0.1948$ for yeast and

761: 0.1552 for \colp. This can be understood as a consequence of the fact

762: that the relative frequency of fully connected triangles is higher in

763: yeast than in \colp: in yeast (\colp) there are 6969 (478) triangles

764: compared to 86291 (7805) triads with two links.

765:

766: Associated with this difference are distinctions between the $k$-cores

767: \cite{seidman}

768: of the two networks. Fig.~\ref{k-cores.fig} shows the sizes of the

769: $k$-cores against $k$. We see that the yeast network contains

770: non-empty cores with $k$ up to 15. Moreover, the core with $k=15$ has

771: exactly 17 nodes. It is a nearly fully connected subgraph with just

772: one missing link. All 17 proteins in this core are parts of the 26S

773: proteasome which consists of 20 or 21 proteins \cite{mips,sgd}. All

774: these proteins presumably interact very strongly with each other. When the

775: interactions between the proteins within the 26S proteasome are taken

776: out (the corresponding elements of the adjacency matrix are set to

777: zero), the $k$-core with highest $k$ has $k=12$ and consists of 15

778: nodes. All its nodes correspond to proteins in the mediator complex of

779: RNA polymerase II \cite{mips}, which contains 20 proteins altogether.

780: After eliminating all interactions between these, two 11-cores with

781: respectively 13 and 14 nodes remain, the first corresponding to the

782: 20S proteasome and the second corresponding to the RSC complex

783: \cite{mips}. Again these particular complexes have only a few more

784: proteins than those contained within their largest $k$-cores, so they

785: are very tightly bound together. All remaining complexes appear to be

786: more loosely bound, so that much of the strong larger scale clustering

787: in the yeast network (involving 7 - 10 nodes) can be traced to only a

788: few tightly bound complexes.  This has a big effect on the subgraph

789: counts, as we shall see.

790:

791: \subsection{Trends in Subgraph counts}

792:

793: Subgraph counts ${\hat c}_G$ for the \ecoli and yeast networks, plotted

794: against $n^2 +2\ell$, are shown in Figs.~\ref{ecoli-subgraphs.fig} and

795: \ref{yeast-subgraphs.fig}. For large $n$ we see a very wide range,

796: with counts varying between 1 and $>10^8$.  In general, counts

797: decrease with increasing number of links, i.e. trees are most

798: frequent. This is a direct consequence of the fact that the networks

799: are sparse. Even when $n$ and $\ell$ are fixed, the counts $c_G$ can

800: range over six orders of magnitude (e.g. for yeast with $n=8$ and

801: $\ell=17$).

802:

803: \begin{figure}

804:   \begin{center}

805:    \psfig{file=Fig-ecoli-counts.ps,width=6.3cm,angle=270}

806:    \caption{(color online) Counts for connected subgraphs with fixed topology

807:      and with $n\le 8$ in the \coli network, plotted against $n^2

808:      +2\ell$. The variable $n^2 +2\ell$ is used to spread out the

809:      data, so that the dependence on both $n$ and $\ell$ (number of

810:      links) can be seen independently, without data points

811:      overlapping. For most of the points, the error bars are smaller

812:      than the sizes of the symbols.}

813: \label{ecoli-subgraphs.fig}

814: \end{center}

815: \end{figure}

816:

817: \begin{figure}

818:   \begin{center}

819:    \psfig{file=Fig-yeast-counts.ps,width=6.3cm,angle=270}

820:    \caption{(color online) Counts for subgraphs with fixed topology

821:      and with $n\le 8$ in the yeast network, plotted against

822:      $n^2+2\ell$ as in Fig.~\ref{ecoli-subgraphs.fig}.}

823: % \maya{Peter, do you know what the pdf of the counts looks like for a fixed n and l?}}

824: \label{yeast-subgraphs.fig}

825: \end{center}

826: \end{figure}

827:

828: For the yeast network, there are clear systematic trends for the

829: counts at fixed $n$ and $\ell$. The most frequent subgraphs are those

830: with strong heterogeneity, i.e. with a large variation of the degrees

831: (within the subgraph) of nodes, while the most rare are those with

832: minimal variation. Fig.~\ref{yeast-var.fig} shows the counts ${\hat c}_G$ for $n=8$

833: and with four different values of $\ell$ plotted against the variance

834: of the degrees of the nodes within the subgraph,

835: \be

836:    \sigma^2 = {1\over n}\sum_{i=1}^n k_i^2 - [{1\over n}\sum_{i=1}^n k_i]^2.

837:                    \label{var}

838: \ee

839: For all four curves we see a trend, where the count increases with

840: $\sigma$, but hardly any trend like this is seen for the \coli network

841: (data not shown). The effect seen in the yeast data is probably

842: related to the very strongly connected core in that network (see the

843: last subsection). As we shall also see later in subsection D,

844: subgraphs with high counts in yeast often have a tadpole form with a

845: highly connected body (which is part of one of the densely connected

846: complexes discussed in the last subsection) and a short

847: tail attached to it. These cores may also be responsible for the main

848: difference between Figs.~\ref{ecoli-subgraphs.fig} and

849: \ref{yeast-subgraphs.fig}, namely the strong representation of very

850: highly connected (large $\ell$) subgraphs in the yeast network. Taking out all

851: interactions within the 26S and 20S proteasomes, within the mediator

852: complex and within the RSC complex reduces substantially the counts for

853: highly connected subgraphs. The count for the complete $n=7$ subgraph, e.g.,

854: is reduced in this way from $25,164\pm 68$ to $682\pm 23$. The removal

855: of interactions within the 26S proteasome makes by far the biggest

856: contribution.

857:

858: \begin{figure}

859:   \begin{center}

860:    \psfig{file=Fig.yeast-var-count.ps,width=6.3cm,angle=270}

861:    \caption{(color online) Counts for $n=8$ subgraphs of the yeast

862:      network with $\ell = 7, 10,13,$ and $18$, plotted against the

863:      variance of the node degrees within the subgraphs, as given by

864:      Eq.~\ref{var}. Zero variance means that all nodes

865:      have exactly the same degree, whereas a higher variance indicates

866:      that the nodes differ more widely. Typically, subgraphs

867:      with more variation in their nodes (and thus with larger $\sigma^2$)

868:      have higher counts than those for which the degrees within the

869:      subgraph are more uniform.}

870: \label{yeast-var.fig}

871: \end{center}

872: \end{figure}

873:

874: \subsection{Zipf plots}

875:

876: In~\cite{baskerville} it was found that ``Zipf plots" (subgraph counts

877: vs. rank) in the \coli network exhibit power law behavior, whose

878: origin is not yet understood. The essential difference between the

879: subgraph counts in~\cite{baskerville} and in the present paper is that

880: we sample only connected subgraphs, while {\it all} subgraphs with

881: given $n$ were ranked in~\cite{baskerville}.  Also, noting that

882: disconnected subgraphs are more likely to be sampled than connected

883: ones when picking nodes at random (due to the sparsity of the

884: networks), we can go to much higher ranks for the connected subgraphs.

885:

886: Zipf plots for connected subgraphs in the \coli network are shown in

887: Fig.~\ref{zipf.fig}. Each curve is based on $4\times 10^9$ to

888: $10^{10}$ generated subgraphs. Each is strongly curved,

889: suggesting that there are no power laws -- at least for subgraph sizes

890: where we obtain reasonable statistics for the census. The curves

891: show less curvature for larger $n$, but this is a gradual effect. It

892: seems that the scaling behavior found in \cite{baskerville} was mainly

893: due to the presence of disconnected graphs, although it is not

894: immediately obvious why those should give scale-free statistics

895: either. In addition, the right hand tails of the Zipf plots in

896: \cite{baskerville} were cut

897: off because of substantially lower statistics. In our case, apparently

898: sharp cutoffs in the counts are observed for ranks $\approx

899: 1.08\times 10^4$ for $n=8$, $\approx 2.1\times 10^5$ for $n=9$, and

900: $\approx 2.9\times 10^6$ for $n=10$. For $n\leq 9$ these are close to

901: the total number of different connected subgraphs~\cite{briggs},

902: suggesting that we have fairly complete statistics. For $n=10$ the

903: cutoff is more affected by lack of statistics, but it is still within

904: a factor of four of the upper limit.

905:

906: \begin{figure}

907:  \begin{center}

908:   \psfig{file=Fig-Zipf.ps,width=6.3cm,angle=270}

909:   \caption{(color online) ``Zipf" plots showing the counts for individual

910:     connected subgraphs with fixed $n$, plotted against their rank.

911:     Data are for the \coli network.}

912: \label{zipf.fig}

913: \end{center}

914: \end{figure}

915:

916:

917: \subsection{Null model comparison and motifs}

918:

919: One of the most striking results of \cite{baskerville} was that most

920: large subgraphs were either strong motifs or strong anti-motifs.

921: However, this finding was based on rather limited statistics and on a

922: single protein interaction network.  One of the purposes of the

923: present study is to test this and other results of \cite{baskerville}

924: with much higher statistics and for a larger network, the protein

925: interaction network of yeast.

926:

927: \begin{figure}

928:   \begin{center}

929:    \psfig{file=Fig-ecoli-countratios.ps,width=6.3cm,angle=270}

930:    \caption{(color online) Ratios between the count estimates

931:      $\hat{c}_G$ for connected subgraphs in the \coli

932:      network, and the corresponding average counts $\langle\hat{c}^{(0)}_G\rangle$

933:      in rewired networks. The data are plotted against $n^2+2\ell$,

934:      again to spread the points out conveniently. Most error bars are

935:      smaller than the symbols.}

936: \label{nullratio-ecoli.fig}

937: \end{center}

938: \end{figure}

939:

940: \begin{figure}

941:   \begin{center}

942:    \psfig{file=Fig-yeast-countratios.ps,width=6.3cm,angle=270}

943:    \caption{(color online) Same as Fig.~\ref{nullratio-ecoli.fig}, but for

944:      the yeast network.  Notice that most data points for large $n$ and

945:      $\ell$ are missing. Indeed, for $n=7$ all (!) data points with $\ell >

946:      16$ are missing, because no such subgraphs were found in the

947:      rewired ensemble.}

948: \label{nullratio-yeast.fig}

949: \end{center}

950: \end{figure}

951:

952: To define a motif requires a null model. We take this to be the ensemble

953: of networks with the same degree sequence, obtained by the rewiring

954: method.  The average subgraph counts in the null ensemble are denoted

955: as $\langle c_G^{(0)}\rangle$.  In Figs.~\ref{nullratio-ecoli.fig} and

956: \ref{nullratio-yeast.fig} we plot the ratios $c_G / \langle c_G^{(0)}\rangle$

957: against the variable $n^2+2\ell$ for each connected subgraph that was sampled

958: both in the original graph and in at least one of the rewired graphs.

959: The error bars, which include both

960: statistical errors from sampling and the ensemble fluctuations of the

961: null model estimated from several hundred rewired networks, are for

962: most points smaller than the symbols. A subgraph is a motif

963: (anti-motif), if this ratio is significantly larger (smaller) than 1.

964: Notice that motifs do not in general occur particularly

965: frequently in the original network. Even without rigorous estimates

966: to estimate significance, it is clear that most densely connected

967: subgraphs are motifs in the yeast network. The fact that trees or

968: subgraphs with few loops tend to be anti-motifs might not be so evident

969: from Fig.~\ref{nullratio-yeast.fig}, since the ratios for trees and

970: tree-like graphs are close to one. Thus we have to discuss

971: significance more formally.

972:

973: \subsubsection{$Z$-scores}

974:

975: Usually~\cite{baskerville}, the significance of a motif (or

976: anti-motif) is measured by its $Z$-score

977: \be

978:    Z = {c_G - \langle c_G^{(0)}\rangle \over \sigma_G^{(0)}}\;,

979:               \label{Z}

980: \ee

981: where $\sigma_G^{(0)}$ is the standard deviation of $c_G$ within the null

982: ensemble. A subgraph is a motif (anti-motif), if $Z \gg 1$ ($Z\ll -1$).

983:

984: The eight strongest motifs with $n=7$ in the \coli network according

985: to this definition are shown in Fig.~\ref{fig:ecoli_motif}, together

986: with their $Z$-values. To name the strongest motifs in the yeast

987: network is less straight forward, since many subgraphs did not show

988: up in any rewired network at all. Assuming for those subgraphs

989: $\sigma_G^{(0)} = \langle c_G^{(0)}\rangle = 0$ would give $Z=\infty$.

990: Rough lower bounds on $Z$ are obtained for them by assuming that $\langle

991: c_G^{(0)}\rangle < 1/R$ and $\sigma_G^{(0)} < 1/\sqrt{R}$, where $R$

992: is the number of rewired networks that were sampled, giving $Z\geq c_G\sqrt{R}$.

993: Some of the strongest motifs in the yeast network, together with their

994: estimated $Z$-scores, are shown in Fig.~\ref{fig:yeast_motif}. Note

995: that no $n=7$ graphs with $\ell>16$ were found in any of the realizations of

996: the null model, while they were all found in the real yeast network. Hence

997: these are all strong motifs. Those motifs in Fig.~\ref{fig:yeast_motif} for

998: which only lower bounds for the $Z$-score are given are the most frequent in

999: the real network, hence they have the highest lower bound. It was

1000: pointed out in \cite{spirin,ispolatov} that cliques (complete subgraphs)

1001: are in general very strong motifs. In yeast, the $n=7$ clique (with

1002: $\ell=21$) is indeed a very strong motif, but it does not have the largest

1003: lower bound on the $Z$-score.  In comparison, anti-motifs have rather

1004: modest $Z$-scores. The strongest anti-motif with $n=7$ has $Z=-32.9$

1005: ($Z=-24.7$) for \coli (yeast).

1006:

1007: \begin{figure}

1008:   \begin{center}

1009:    \psfig{file=maya_fig2.eps,width=8.5cm,angle=0}

1010:    \caption{The eight strongest motifs with $n=7$ in the \coli protein

1011:      interaction network. These tend to be almost bipartite graphs,

1012:      and many pairs of nodes are linked to the same set of neighbors.

1013:      Their $Z$-scores, in order from left to right, first then second

1014:      row, are: $2.9\times 10^4, 932, 885, 648, 595, 532, 516$ and

1015:      377. Their estimated frequencies in the original \coli network

1016:      are, in the same order: $20936\pm 8,

1017:      161521\pm 63, 8312\pm 5, 1331\pm 2, 838\pm 2, 5985 \pm 5, 5165\pm 4,$

1018:      and $ 519\pm 1$.}

1019: \label{fig:ecoli_motif}

1020: \end{center}

1021: \end{figure}

1022:

1023: \begin{figure}

1024:   \begin{center}

1025:    \psfig{file=m3a.eps,width=8.3cm,angle=0}

1026:    \caption{Eight very strong motifs with $n=7$ for the yeast protein

1027:      interaction network. These tend to be almost complete graphs with

1028:      a single dangling node.  Four of these graphs were not seen in

1029:      any realization of the null model, so only lower bounds on their

1030:      $Z$-scores can be given. From left to right, first then second

1031:      row, the estimated $Z$-scores are: $>3\times 10^7, 9\times 10^5,

1032:      >8\times 10^6, 5\times 10^5,>4\times 10^6,3\times 10^5,2.5\times 10^5$,

1033:      and $>1.5\times 10^6$. Estimated frequencies are, in the same order:

1034:      $6.68(1)\times 10^5, 9.27(5)\times 10^4, 1.76(1)\times 10^5, 4.84(1)\times 10^5,

1035:      7.78(2)\times 10^4, 3.13(6)\times 10^5, 1.38(1)\times 10^5$, and

1036:      $3.35(1)\times 10^4$.}

1037: \label{fig:yeast_motif}

1038: \end{center}

1039: \end{figure}

1040:

1041: With $Z$-values up to $10^7$ and more, as in

1042: Fig.~\ref{fig:yeast_motif}, the motivation for using $Z$-scores

1043: becomes suspect. On the one hand, the null model is clearly unable

1044: to describe the actual network, and has to be replaced by a more

1045: refined null model. This will be done in a future paper

1046: \cite{newpaper}. On the other hand, it suggests to use instead a

1047: $Z$-score based on {\it logarithms} of counts,

1048: \be

1049: Z_{\rm log} = {\log c_G - \langle \log c_G^{(0)}\rangle \over

1050:   \sigma_{\log, G}^{(0)}}\;,

1051: \label{Z_log}

1052: \ee

1053: where $\sigma_{\log, G}^{(0)}$ is the standard deviation of $\log

1054: c_G^{(0)}$.  An advantage of Eq.(\ref{Z_log}) would be that it

1055: suppresses $|Z|$ for motifs, but enhances $|Z|$ for anti-motifs.

1056:

1057: In general, strong yeast motifs have a tadpole structure with a

1058: complete or almost complete body, and a tail consisting of a few nodes

1059: with low degree. This agrees nicely with our previous observation that

1060: frequently occurring subgraphs in the yeast network have strong

1061: heterogeneity in the degrees of their nodes.  In contrast, strong \coli

1062: motifs with not too many loops are all based on a 4-3 or 5-2 bipartite

1063: structure. When the number of loops increases, strictly bipartite

1064: structures are impossible, but the tendency towards these structures

1065: is still observed.

1066:

1067: Whether we use $Z$-scores or the ratio $C_G/C_G^{(0)}$ to identify

1068: motifs makes very little difference. Using either criterion, the

1069: strengths of the strongest motifs skyrocket with subgraph size. This

1070: is most dramatically apparent for the yeast network. Indeed,

1071: correlations between $Z$-scores of individual graphs in the yeast and

1072: \coli networks (data not shown) are much weaker than correlations

1073: between count ratios. The latter are shown in Fig.~\ref{graph-r} for

1074: $n=7$ subgraphs.

1075:

1076: \subsubsection{Twinning versus Clustering}

1077:

1078: Another characteristic feature of strong motifs in the \coli network is

1079: the tendency for `twin' nodes. We call two nodes in a subgraph twins if

1080: they are connected to the same set of neighbours in the subgraph.

1081: Otherwise said, nodes $i$ and $k$ are twins, iff the $i$-th and $k$-th

1082: rows of the subgraph adjacency matrix are identical. Notice that twin

1083: nodes can be created most naturally by duplicating genes. We

1084: found that subgraphs with many pairs of twin nodes are in general also

1085: motifs in the yeast network, but they do not stand out spectacularly

1086: from the mass of other motifs. They could be the `genuine' motifs also

1087: for yeast, but only a better null model where all subgraphs actually

1088: occur with reasonable frequency would be able to prove or disprove this.

1089:

1090: In Fig.~\ref{graph-r} we also indicated the dependence on the number

1091: $n_{\rm twin}$ of pairs of twin nodes, by marking subgraphs with

1092: $n_{\rm twin}>3$ ($n_{\rm twin}> 1)$ by bullets (asterisks). We

1093: see that all strong motifs in \coli have multiple pairs of twin nodes.

1094: These subgraphs tend to be also motifs of comparable strength in yeast

1095: -- the bullets in Fig.~\ref{graph-r} tend to cluster on the diagonal

1096: $[c_G /\langle c_G^{(0)}\rangle]_\coli = [c_G /\langle c_G^{(0)}\rangle]_{yeast}$. However, there

1097: are even stronger motifs in yeast that have no twin nodes. These

1098: graphs are typically much weaker motifs or not motifs at all in \colp.

1099:

1100: \begin{figure}

1101:   \begin{center}

1102: \epsfig{file=ecoli-yeast-ratios.ps, width=6.3cm, angle=270}

1103: \caption{(color online) Count ratios $c_G /\langle c_G^{(0)}\rangle$ for individual

1104:   subgraphs in the \coli network, plotted against the count ratio for

1105:   the same subgraph in the yeast network. To highlight the dependence

1106:   on the number of twin nodes in the subgraph, subgraphs with $n_{\rm

1107:     twin}>1 \; (n_{\rm twin}> 3)$ are marked by asterisks

1108:   (bullets). Whereas almost all ratios are much higher in the yeast

1109:   network, this is noticeably less true for subgraphs containing

1110:   more than three pairs of twin nodes. These tend to fall on the

1111:   diagonal indicated by the dashed line.}

1112: \label{graph-r}

1113: \end{center}

1114: \end{figure}

1115:

1116: As we have already indicated, many of the strong motifs in yeast seem to

1117: be related to a few densely connected complexes such as those discussed in

1118: subsection A. They are either part of their cores, or they have most of

1119: their nodes in the core, with one or two extra nodes forming the tail of

1120: what looks like a tadpole. This effect is even more pronounced for

1121: $n=8$ subgraphs. For instance, the three most frequent subgraphs with

1122: $n=8$ and $ \ell = 17$ all contained a 6-clique and two nodes connected

1123: to it either in chain or in parallel. None of them occurred even in a

1124: single rewired network.

1125:

1126: The situation is different for the \coli network. There, the three

1127: most frequent graphs with 8 nodes and 17 edges also have a tadpole

1128: structure, few twin nodes, and low bipartivity. But they are not very

1129: strong motifs since they occur also frequently in the rewired networks.

1130: The three strongest motifs with $n=8$ and $ \ell = 17$, in contrast,

1131: have many twin pairs and high bipartivity. They have slightly lower

1132: counts (by factors 2-4), but occur much more rarely in the rewired

1133: networks.

1134:

1135: \begin{figure}

1136:   \begin{center}

1137:   \epsfig{file=yeast-ecoli-orig.rewired.ps, width=6.3cm, angle=270}

1138:   \caption{(color online) Counts $c_G$ resp. $\langle

1139:     c_G^{(0)}\rangle$ for individual subgraphs in the \ecoli network,

1140:     plotted against counts for the same subgraph in the yeast network.

1141:     It can be seen that the two rewired networks are much more similar

1142:     (display higher correlation) than the original networks.}

1143:   \label{graph-freq}

1144:   \end{center}

1145: \end{figure}

1146:

1147: \subsubsection{Effects of Rewiring on Differences between Networks}

1148:

1149: Finally, Fig.~\ref{graph-freq} shows counts for individual subgraphs

1150: in the \coli network against counts for the same subgraph in yeast.

1151: This is done for all four combinations of original and

1152: rewired networks. We see that the correlation is strongest when we

1153: compare rewired networks of \coli to rewired networks of yeast. This

1154: is not surprising. It means that a lack of correlations is mostly due

1155: to special features of one network which are not shared by the other.

1156: Rewiring eliminates most of these features. The other observation is

1157: that rewiring in general reduces further the counts for subgraphs

1158: which are already rare in the original networks. This is mainly due to

1159: the fact that such subgraphs are relatively densely connected, and

1160: appear in the original networks only because of the strong clustering.

1161: This effect is more pronounced for yeast than for \colp, because it

1162: is more sparse and has more densely connected clusters/complexes.

1163:

1164: \section{Discussion}

1165:

1166: In this paper we have presented an algorithm for sampling connected

1167: subgraphs uniformly from large networks. This algorithm is a

1168: generalization of algorithms for sampling lattice animals, hence we

1169: refer to it as a ``graph animal algorithm" and to the connected subgraphs

1170: as ``graph animals". It allowed us to obtain high statistics estimates of

1171: subgraph censuses for two protein interaction networks. Although the

1172: graph animal algorithm worked well in both cases, the analysis of the

1173: smaller network (\colp) was much easier than that of the bigger (yeast).

1174: This was not so much because of the sheer size of the latter (the yeast

1175: network has about ten times more nodes and links than the \coli network),

1176: but was mainly caused by the existence of stronger hubs. Indeed, the

1177: presence of hubs places a more stringent limitation on the method than

1178: the size of the network.

1179:

1180: One of the main results is that many subgraph frequency counts are

1181: hugely different from those in the most popular null model, which is

1182: the ensemble of networks with fixed degree sequence. Based on a

1183: comparison with this null model, most subgraphs with size $\geq 6$ in

1184: both networks would be very strong motifs or anti-motifs. This clearly

1185: shows that alternative null models are needed which take clustering and

1186: other effects into account.

1187:

1188: While this was not very surprising (hints of it had been found in

1189: previous analyses), a more surprising result is the fact that the

1190: dominant motifs in the two protein interaction networks show very

1191: different features. Most of these seem to be related to the densely

1192: connected cores of a small number of complexes in the yeast network,

1193: which have no parallels in the \coli network and which strongly affect

1194: the subgraph census. Further studies are needed to disentangle

1195: these effects from other -- possibly biologically more interesting --

1196: effects.

1197:

1198: Finally, a feature with likely biological significance is the dominance

1199: of subgraphs with many twin nodes. These are nodes which share the

1200: same list of linked neighbors within the subgraph. They correspond to

1201: proteins which interact with the same set of other proteins. The most

1202: natural explanation for them is gene duplication.  Connected to

1203: this is a preference for (approximately) bipartite subgraphs. These

1204: two features are very clearly seen in the \coli network, much less so

1205: in yeast. But it would be premature to conclude that gene duplication

1206: was evolutionary more important in \coli than in yeast. It is more

1207: likely that its effect is just masked in the yeast network by other

1208: effects, most probably by the densely connected complexes and other

1209: clustering effects which do not show up to the same extent in \colp.

1210:

1211: \begin{figure}

1212:   \begin{center}

1213:   \epsfig{file=Fig-compare-ratios.ps, width=7.7cm, angle=270}

1214:   \caption{(color online) Count ratios $c_G/\langle c_G^{(0)}\rangle$

1215:     for individual subgraphs in the yeast networks of

1216:     Refs.~\cite{bu,batada}, plotted against counts for the same

1217:     subgraph in the network of \cite{yeast}. If all three networks

1218:     were identical, all points should lie on the diagonal (indicated by

1219:     the straight dashed line), whereas in fact systematic deviations are

1220:     observed.}

1221:   \label{ratios-compare}

1222:   \end{center}

1223: \end{figure}

1224:

1225: Up to now, we know very little about the biological significance of our

1226: findings. One main avenue of further work could be to relate our results

1227: on subgraph abundances in more detail to properties of the network that

1228: are associated with biological function. Another important problem is the

1229: comparison between network reconstructions which supposedly describe

1230: the same or similar objects. There exist, e.g., a large number of

1231: published protein-protein interaction networks for yeast.

1232: Some were obtained by means of different experimental techniques, either

1233: with conventional or with high throughput methods, while others were

1234: obtained by comprehensive literature compilations. In a preliminary

1235: step, we compared three such networks: The network obtained by Krogan

1236: {\it et al.}~\cite{yeast} that was studied above, a somewhat older

1237: network downloaded from~\cite{pajek} and attributed to

1238: Bu {\it et al.}~\cite{bu}, and the `high confidence' (HC)

1239: network of Batada {\it et al.}~\cite{batada}. The latter is the most

1240: recent. It was obtained by extracting the most reliable interactions

1241: from a vast data base which includes the data of both Bu {\it et al.} and

1242: Krogan {\it et al.}. In Fig.~\ref{ratios-compare} we plot the ratios

1243: between the actual counts and the average counts in rewired networks

1244: for Bu {\it et al.} and for the HC data set against the analogous

1245: ratios for the Krogan {\it et al.} networks. If the three data sets

1246: indeed describe the same yeast network -- as they purport to do, within

1247: experimental uncertainties -- the points should all fall onto the

1248: diagonal. Instead, we see systematic deviations. Surprisingly, these

1249: deviations are much stronger between the Krogan {\it et al.} and the

1250: HC networks than between the Krogan {\it et al.} and the Bu {\it et al.}

1251: networks. Clarifying these and other systematic irregularities should

1252: give valuable insight into the strengths and weaknesses of the methods

1253: used in constructing the networks as well as their biological

1254: reliability, and should lead to improved methods for network

1255: reconstruction.

1256:

1257: In the present paper we have only dealt with undirected networks. The

1258: basic sampling algorithm works equally well for directed networks. The

1259: main obstacle in applying our methods to the latter is the huge number

1260: of directed subgraphs, even for relatively small sizes.

1261: Nevertheless, we will present an analysis of directed networks in

1262: forthcoming work, as well as applications to other undirected

1263: networks.

1264:

1265: Acknowledgements: We thank Gabriel Musso for valuable information on

1266: the yeast network.

1267:

1268: \begin{thebibliography}{99}

1269: \bibitem{faloutsos} C. Faloutsos, M. Faloutsos, and P. Faloutsos, ACM SIGCOMM

1270:    Computer Communication Review {\bf 29}, 251 (1999).

1271: \bibitem{barabasi} A.-L. Barabasi and R. Albert, Science {\bf 286}, 509 (1999).

1272: \bibitem{bollobas} B. Bollobas, {\it Random Graphs} (Academic Press,

1273:   London 1985).

1274: \bibitem{newman_SIAM} M.E.J. Newman, SIAM Review {\bf 45}, 167 (2003).

1275: \bibitem{watts-stro} D.J. Watts and S.H. Strogatz, Nature {\bf 393}, 440 (1998).

1276: \bibitem{newman_clust} M.E.J. Newman, Phys. Rev. E {\bf 64}, 016131 (2001).

1277: \bibitem{ravasz} E. Ravasz, L. Somera, D.A. Mongru, Z.N. Oltvai, and

1278:    A.-L. Barab{\'a}si, Science {\bf 297}, 1551 (2002).

1279: \bibitem{girvan} M.E.J. Newman and M. Girvan, Phys. Rev. E {\bf 69}, 026113 (2004).

1280: \bibitem{ziv} E. Ziv, M. Middendorf, and C. Wiggins, Phys. Rev. E {\bf 71}, 046117

1281:    (2005).

1282: \bibitem{rosvall} M. Rosvall and C.T. Bergstrom, Proc. Nat. Acad. Sci. U.S.A.

1283:    {\bf 104}, 7327 (2007).

1284: \bibitem{stadler} K. Klemm and P.F. Stadler, Phys. Rev. E {\bf 73},

1285:    025101(R) (2006).

1286: \bibitem{milgram} S. Milgram, Psychology Today {\bf 2}, 60 (1967).

1287: \bibitem{estrada} E. Estrada and J.A. Rodr\'iguez-Vel\'azquez, Phys. Rev. E {\bf

1288:     72}, 046105 (2005); E. Estrada, J. Proteome Res. {\bf 5}, 2177 (2006).

1289: \bibitem{milo} R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii,

1290:     and U. Alon, Science {\bf 298}, 824 (2002).

1291: \bibitem{shen-orr} S. Shen-Orr, R. Milo, S. Managan, and U. Alon, Nat. Genet.

1292:     {\bf 31}, 64 (2002).

1293: \bibitem{vasquez} A. V\'asquez, R. Dobrin, D. Sergi, J.-P. Eckmann, Z.N. Oltvai,

1294:     and A.-L. Barabasi, Proc. Nat. Acad. Sci. U.S.A. {\bf 101}, 17940 (2004).

1295:   \bibitem{kashtan} N. Kashtan, S. Itzkovitz, R. Milo, and U. Alon,

1296:     Phys. Rev. E {\bf 70}, 031909 (2004).

1297: \bibitem{besag} J. Besag and P. Cliffors, Biometrica {\bf 76}, 633 (1989).

1298: \bibitem{maslov} S. Maslov and K. Sneppen, Science {\bf 296}, 910 (2002).

1299: \bibitem{class1} M.~Middendorf, E.~Ziv and C.~H.~Wiggins,

1300:   Proc. Natl. Acad. Sci.~U.S.A. {\bf 102}, 3192 (2005).

1301: \bibitem{mahadevan} P. Mahadevan, D. Krioukov, K. Fall, and A. Vahdat,

1302:   ``A Basis for Systematic Analysis of Network Topologies", preprint

1303:   arXiv/cs.NI/0605007v2 (2006).

1304: \bibitem{newman_park1} J. Park and M.~E.~J.~Newman, Phys. Rev. E {\bf

1305:     68}, 026112 (2003).

1306: \bibitem{newman_park2} J.~Park and M.~E.~J.~Newman, Phys. Rev. E {\bf

1307:     70}, 066117 (2004).

1308: \bibitem{foster} J.~Foster, D.~Foster, P.~Grassberger and M.~Paczuski, e-print

1309: cond-mat/0610446 (2006).

1310: \bibitem{clauset} A.~Clauset, C.~Moore, and M.~E.~J.~Newman, e-print

1311:   physics/0610051 (2006).

1312: \bibitem{briggs} K. Briggs, {\sf http://keithbriggs.info/cgt.html} (2006).

1313: \bibitem{kobler} J.U. K\"obler and J.T. Sch\"oning, {\it The Graph

1314:     Isomorphism Problem: Its Structural Complexity} (Birkhauser,

1315:   Boston 1993).

1316: \bibitem{faulon} J.-L. Faulon, J. Chem. Inf. Comput. Sci. {\bf 38}, 432 (1998).

1317: \bibitem{toran} J. Tor\'an, FOCS 180 (2000).

1318: \bibitem{nauty} For the ``nauty" program of B. MacKay, see

1319:    {\sf http://cs.anu.edu.au/people/bdm/nauty/}

1320: \bibitem{baskerville} K. Baskerville and M. Paczuski, Phys. Rev. {\bf

1321:    E 74}, 051903 (2006).

1322: \bibitem{kashtan2004b} N. Kashtan, S. Itzkovitz, R. Milo, and U.

1323:    Alon, Bioinformatics {\bf 20}, 1746 (2004).

1324: \bibitem{spirin} V. Spirin and L.A. Mirny, Proc. Nat. Acad. Sci.

1325:    U.S.A. {\bf 100}, 12123 (2003).

1326: \bibitem{animals} R.C. Read, Canad. J. Math. {\bf 14}, 1 (1962).

1327: \bibitem{jensen} I. Jensen, J. Stat. Phys. {\bf 102}, 865 (2001).

1328: \bibitem{redner} S. Redner, J. Statist. Phys. {\bf 29}, 309 (1982).

1329: \bibitem{stauffer} D. Stauffer,  Phys. Rev. Lett. {\bf 41}, 1333 (1978).

1330: \bibitem{dickman} R. Dickman and W.C. Schieve, J. Physique {\bf 45}, 1727 (1984).

1331: \bibitem{pivot} E J Janse van Rensburg and N Madras, J. Phys. A:

1332:    Math. Gen. {\bf 25} 303 (1992).

1333: \bibitem{leath} P. Leath, Phys. Rev. B {\bf 14}, 5046 (1976).

1334: \bibitem{hsu} H.-P. Hsu, W. Nadler, and P. Grassberger, J. Phys. A:

1335:   Math. Gen. {\bf 38}, 775 (2005); e-print cond-mat/0408061 (2004).

1336: \bibitem{newpaper} K.~Baskerville {\it et al.}, in preparation.

1337: \bibitem{gfn1998} P. Grassberger, H. Frauenkron, and W. Nadler, {\it

1338:    PERM: A Monte Carlo Strategy for Simulating Polymers and other

1339:    Things}, in ``Monte Carlo Approach to Biopolymers and Protein

1340:    Folding", eds. P. Grassberger {\it et al.}  (World Scientific,

1341:    Singapore 1998); arXiv:cond-mat/9806321 (1998).

1342: \bibitem{care} C.M. Care, Phys. Rev. E {\bf 56}, 1181 (1997); C.M.

1343:    Care and R. Ettelaie, Phys. Rev. E {\bf 62}, 1397 (2000).

1344: \bibitem{Redner79} S. Redner, J. Phys. A: Math. Gen. {\bf 12}, L239 (1979).

1345: \bibitem{ecoli} G. Butland {\it et al.}, Nature {\bf 433}, 531 (2005);

1346:    {\sf http://www.cosin.org}.

1347: \bibitem{yeast} N.J. Krogan {\it et al.}, Nature {\bf 440}, 637 (2006).

1348: \bibitem{footnote} The networks given in \cite{ecoli,yeast} are not

1349:    connected. In the present paper we used only their largest connected

1350:    components.

1351: \bibitem{stauffer-aharony} D. Stauffer and A, Aharony, {\it An

1352:    Introduction to Percolation Theory}, 2nd Ed. (Taylor and Francis,

1353:    London, 1994).

1354: \bibitem{randomSAW} P. Grassberger, J. Phys. A: Math. Gen. {\bf 26}, 1023 (1993).

1355: \bibitem{mollison} D. Mollison, J. R. Statist. Soc. B {\bf 39}, 283 (1977).

1356: \bibitem{grass} P. Grassberger, Mathematical Biosciences {\bf 63}, 157 (1983).

1357: \bibitem{integer-seq} {\it The On-Line Encyclopedia on Integer Sequences},

1358:    {\sf http://www.research.att.com/~njas/sequences} (AT\&T Labs, 2006).

1359: \bibitem{pastor} R. Pastor-Satorras and A. Vespignani, Phys. Rev.

1360:    Lett. {\bf 86}, 3200 (2001).

1361: \bibitem{footnote2} We might try to estimate the optimal $p$ by the

1362:    threshold for an infinite SIR epidemic on an infinite tree like

1363:    network with the same degree distribution, $p_c = \langle k\rangle

1364:    / \langle k^2\rangle$ \cite{pastor}. For the two networks

1365:    considered in this paper, this would give $p_c({\rm yeast}) =

1366:    0.062$, $p_c({\rm e.~coli}) = 0.070$, i.e.  a much smaller

1367:    difference in the optimal $p$ values. One reason why this is not

1368:    observed might be the very strong clustering, in particular in the

1369:    yeast data, which is neglected in this argument.

1370: \bibitem{footnote3} Another reason why no minimum appears in the

1371:    $n=8$ curve is that we kept the number of generated clusters fixed,

1372:    not CPU time.  Since larger $p$ values also imply larger clusters

1373:    in average, the CPU time per cluster increases sharply for larger

1374:    $p$.

1375: \bibitem{grass-PERM} P. Grassberger and W. Nadler, {\it ``Go with

1376:     the winners"-Simulations}, in ``Computational Statistical Physics:

1377:     From Billards to Monte Carlo", eds. K.H.  Hoffmann {\it et al.}

1378:     (Springer, Heidelberg 2000); arXiv:cond-mat/0010265 (2000).

1379: \bibitem{seidman} S.B. Seidman, Social Networks {\bf 5}, 269 (1983).

1380: \bibitem{mips} MIPS data base: \\

1381:     {\sf http://mips.gsf.de/genre/proj/yeast/Search/Catalogs\-/catalog.jsp}.

1382: \bibitem{sgd} SGD data base: \\

1383:     {\sf http://www.yeastgenome.org/cgi-bin/GO/go.pl?}.

1384: \bibitem{ispolatov} I. Ispolatov, P.L. Krapivsky, I. Mazo, and A. Yuryev,

1385:      New Journal of Physics {\bf 7}, 145 (2005).

1386: \bibitem{pajek} {\sf http://vlado.fmf.uni-lj.si/pub/networks/data/bio/Yeast\-/Yeast.htm}.

1387: \bibitem{bu} D. Bu {\it et al.}, Nucleic Acids Res. {\bf 31}, 2443 (2003).

1388: \bibitem{batada} N.N. Batada {\it et al.}, PLoS Biology {\bf 4}, 1720 (2006).

1389: \end{thebibliography}

1390:

1391: \end{document}

1392: