q-bio0702029/m9.tex
1: \documentclass[aps,pre,showpacs,twocolumn,superscriptaddress]{revtex4}
2: \usepackage{epsfig}
3: \newcommand{\be}{\begin{equation}}
4: \newcommand{\ee}{\end{equation}}
5: \newcommand{\bea}{\begin{eqnarray}}
6: \newcommand{\eea}{\end{eqnarray}}
7: \newcommand{\kim}[1]{{\huge{$\bullet$}}{\em #1}}
8: \newcommand{\maya}[1]{{\large{$\clubsuit$}}{\em #1}}
9: \newcommand{\peter}[1]{{\large{$\heartsuit$}}{\em #1}}
10: \newcommand{\ecoli}{{\it Escherichia coli }}
11: \newcommand{\colp}{{\it E.~coli}}
12: \newcommand{\coli}{{\it E.~coli }}
13: 
14: \begin{document}
15: 
16: \title{Graph animals, subgraph sampling and motif search in large networks}
17: 
18: \author{Kim Baskerville} \affiliation{Perimeter Institute for
19:   Theoretical Physics, Waterloo, Canada N2L 2Y5} 
20: \author{Peter Grassberger} \affiliation{Complexity Science Group,
21:   University of Calgary, Calgary, Canada} \affiliation{Institute for
22:   Biocomplexity and Informatics, University of Calgary, Calgary,
23:   Canada} 
24: \author{Maya Paczuski} \affiliation{Complexity Science Group,
25:   University of Calgary, Calgary, Canada}
26: 
27: \date{\today}
28: 
29: 
30: \begin{abstract}
31:   We generalize a sampling algorithm for lattice animals (connected
32:   clusters on a regular lattice) to a Monte Carlo algorithm for `graph
33:   animals', i.e. connected subgraphs in arbitrary networks. As with
34:   the algorithm in [N. Kashtan {\it et al.}, Bioinformatics {\bf 20},
35:   1746 (2004)], it provides a weighted sample, but the computation of
36:   the weights is much faster (linear in the size of subgraphs, instead
37:   of super-exponential). This allows subgraphs with up to ten or more
38:   nodes to be sampled with very high statistics, from arbitrarily
39:   large networks. Using this together with a heuristic algorithm for
40:   rapidly classifying isomorphic graphs, we present results for two
41:   protein interaction networks obtained using the TAP high throughput 
42:   method: one of \ecoli with 230 nodes and 695
43:   links, and one for yeast ({\it Saccharomyces cerevisiae}) with
44:   roughly ten times more nodes and links. We find in both cases that
45:   most connected subgraphs are strong motifs ($Z$-scores $>10$) or
46:   anti-motifs ($Z$-scores $<-10$) when the null model is the ensemble of
47:   networks with fixed degree sequence. Strong differences appear
48:   between the two networks, with dominant motifs in \coli being
49:   (nearly) bipartite graphs and having many pairs of nodes which
50:   connect to the same neighbors, while dominant motifs in yeast tend
51:   towards completeness or contain large cliques. We also explore a
52:   number of methods that do not rely on measurements of $Z$-scores or
53:   comparisons with null models. For instance, we discuss the influence
54:   of specific complexes like the 26S proteasome in yeast, where a
55:   small number of complexes dominate the $k$-cores with large $k$ and
56:   have a decisive effect on the strongest motifs with 6 to 8 nodes. We
57:   also present Zipf plots of counts versus rank. They show broad
58:   distributions that are not power laws, in contrast to the case when
59:   disconnected subgraphs are included.
60: \end{abstract}
61: 
62: \pacs{02.70.Uu, 05.10.Ln, 87.10.+e, 89.75.Fb, 89.75.Hc}
63: 
64: \maketitle
65: 
66: \section{Introduction}
67: 
68: Recently, there has been an increased interest in complex networks,
69: partly triggered by the observation that naturally occurring networks
70: tend to have fat-tailed or even power law degree distributions
71: \cite{faloutsos,barabasi}. Thus real-world networks tend to be very
72: different from the completely random Erd\"os-Renyi \cite{bollobas}
73: networks that have been much studied by mathematicians, and which give
74: Poissonian degree distributions.  In addition, most networks have
75: further significant properties that arise either from functional
76: constraints, from the way they have grown (fat tails, e.g., are 
77: naturally explained by preferential attachment), or for other reasons.
78: As a consequence, a large number of statistical indicators have been
79: proposed to distinguish between networks with different functionality
80: (neural networks, protein transcription networks, social networks,
81: chip layouts, etc.) and between networks which were specially designed
82: or which have grown spontaneously (such as, e.g. the world wide web),
83: under more or less strong evolutionary pressure. These observables
84: include various centrality measures \cite{newman_SIAM}, assortativity
85: (the tendency of nodes with similar degree to link preferentially)
86: \cite{newman_SIAM}, clustering \cite{watts-stro,newman_clust},
87: different notions of modularity \cite{barabasi,ravasz,girvan,ziv,rosvall}, 
88: properties of loop statistics \cite{stadler},
89: the small world property (i.e., slow increase of the effective
90: diameter of the network with the number of nodes) \cite{milgram},
91: bipartivity (the prevalence of even-sized closed walks over closed
92: walks with an odd number of steps) \cite{estrada}, and others.
93: 
94: The frequency of specific subgraphs form a particular class of
95: indicators. Subgraphs that occur more frequently than expected are
96: referred to as motifs, while those occurring less frequently are
97: anti-motifs \cite{milo,shen-orr,vasquez,kashtan}. Typically, motif
98: search requires a null model for deciding when a subgraph is over- 
99: or under-abundant. The most popular null model so far has been the 
100: ensemble of all random graphs with the same degree sequence. This
101: popularity is largely due to the fact that it can be simulated easily
102: by means of the so-called `rewiring algorithm' \cite{besag,maslov}.
103: As we shall see, however, in the present analysis its value is severely 
104: limited, because it gives predictions
105: that are too far from those actually observed. Other null models that
106: retain more properties of the original network have been suggested
107: \cite{milo,mahadevan}, but have received much less attention. Analytic
108: approaches to null models are discussed in
109: Refs.~\cite{newman_park1,newman_park2,foster}.
110: 
111: \subsection{Motifs and the Search for Structure}
112: 
113: Up to now, motif search has been mainly restricted to small motifs,
114: typically with three or four nodes. Certain specific classes of 
115: larger subgraphs have been examined in
116: Refs.~\cite{class1,kashtan2004b,vasquez}. With the exception of
117: Ref.~\cite{baskerville}, few systematic attempts have been made to 
118: learn about significant structures at larger scale, by counting all 
119: possible subgraphs (for a different approach to the discovery of 
120: structure than discussed here see the work on inference of 
121: hierarchy in Ref.~\cite{clauset}).
122: 
123: One reason for this is that the number of non-isomorphic (i.e.
124: structurally different) subgraphs in any but the most trivial networks
125: increases extremely fast (super-exponentially) with their size. For
126: instance, the number of different undirected graphs with 11 nodes is
127: $\approx 10^9$~\cite{briggs}. Thus exhaustive studies of all possible
128: subgraphs with $>10$ nodes becomes virtually impossible with
129: present-day computers. But just because of this inflationary growth,
130: counts at intermediate sizes contain an enormous amount of potentially
131: useful information. Another obstacle is the notorious graph
132: isomorphism problem \cite{kobler,faulon}, which is in the NP class
133: (though probably not NP complete \cite{toran}). Existing state of the
134: art programs for determining whether any two graphs are isomorphic
135: \cite{nauty} remain too slow for our purpose.  Instead, we shall use
136: heuristics based on graph invariants similar to those put forward in
137: Ref.~\cite{baskerville}, where intermediate size motifs and
138: anti-motifs in the protein interaction network of \ecoli were
139: detected.
140: 
141: The last problem when studying larger motifs, and the main one
142: addressed in the present work, is the difficulty of estimating how
143: often each possible subgraph appears in a large network, i.e. of
144: obtaining a `subgraph census'.  Most studies so far were based on
145: exact enumeration. In a network with $N$ nodes, there are ${N\choose
146: n}$ subgraphs of size $n$. With $N=500$ and $n=6$, say, this number
147: is $\approx 5\times 10^{11}$. In addition, most of the subgraphs
148: generated this way on a sparse network would be disconnected, while
149: connected subgraphs are of more intrinsic interest. Thus some
150: statistical sampling is needed. If one is willing to generate disconnected 
151: as well as connected subgraphs, then uniform sampling is simple: Just 
152: choose random $n$-tuples of nodes from the network \cite{baskerville}. 
153: Uniform sampling connected subgraphs is less trivial. To 
154: our knowledge, the only work which addressed this systematically was 
155: Kashtan {\it et al.} \cite{kashtan2004b} (for a less systematic 
156: approach, see also \cite{spirin}). There, a biased sampling 
157: algorithm was put forward.  While generating the subgraphs is fast,
158: computing the weight factor needed to correct for the bias is
159: $\exp[O(n)]$, making their algorithm inefficient for $n\ge 7$.
160: 
161: \subsection{Graph Animals}
162: 
163: In the present paper we exploit the fact that sampling connected
164: subgraphs of a finite graph resembles sampling connected clusters of
165: sites on a regular lattice.  The latter is called the {\it lattice
166:   animal} problem \cite{animals}, whence we propose to call the
167: subgraph counting problem that of {\it graph animals}. It is important
168: to recognize obvious differences between the two cases. In particular,
169: lattices are infinite and translationally invariant, while networks
170: are finite and heterogeneous (disordered). For lattice animals one
171: counts the number of configurations up to translations (i.e. per unit
172: cell of the lattice), while on a network the quantity of immediate
173: interest is the absolute number of occurrences of particular
174: subgraphs. Still, apart from these issues, the basic operations
175: involved in both cases coincide.
176: 
177: Algorithms for enumerating lattice animals exactly exist and have been
178: pushed to high efficiency \cite{jensen}, but are far from trivial
179: \cite{redner}. Due to disorder, we should expect the situation to be
180: even worse for graph animals. Algorithms for stochastic sampling of
181: lattice animals are divided into two groups: Markov chain Monte Carlo
182: (MCMC) algorithms take a connected cluster and randomly deform it
183: while preserving connectivity \cite{stauffer,dickman,pivot}, while
184: `sequential' sampling algorithms grow the cluster from scratch
185: \cite{leath,hsu,gfn1998,care}. Even for regular lattices, MCMC
186: algorithms seem less efficient than growth algorithms \cite{hsu}. For
187: networks, this difference should be even more pronounced, since MCMC algorithms
188: would dwell in certain parts of the network, and averaging over the
189: different parts costs additional time. Thus we shall in the following
190: concentrate only on growth algorithms.
191: 
192: All growth algorithms similar to those in
193: \cite{leath,hsu,gfn1998,care} produce unbiased samples of {\it
194:   percolation} clusters. As explained in Sec.~II, this means that they
195: sample clusters or subgraphs with non-uniform probability (for an
196: alternative algorithm, see \cite{Redner79}). Consequently, computing
197: graph animal statistics requires the computation of weights to be
198: assigned to the clusters, in order to correct for the bias. In
199: contrast to the algorithm in Ref.~\cite{kashtan2004b}, the correct
200: weights are easily and rapidly calculated in our graph animal
201: algorithm.  This is its main advantage.
202: 
203: \subsection{Summary}
204: 
205: In Sec.~\ref{alg} we present the graph animal algorithm in detail. The 
206: method used to handle graph isomorphism is briefly reviewed in Sec.~III.
207: Extensive tests, mostly with two protein interaction networks, one for
208: \coli with 230 nodes and 695 links~\cite{ecoli}, and one for yeast
209: with 2559 nodes and 7031 links~\cite{yeast}, are presented in
210: Sec.~IV \cite{footnote}. Both networks were obtained using the TAP high 
211: throughput method. In particular, our algorithm involves as a
212: free parameter a percolation probability $p$. For optimal performance,
213: in lattice animals $p$ should be near the critical value where cluster
214: growth percolates~\cite{hsu}. We show how the performance for graph
215: animals depends on $p$, on the subgraph size $n$, and on other
216: parameters. In Sec.~V we use our sampling method to study these two
217: networks systematically. We verify that large subgraphs with high link
218: density are overwhelmingly strong motifs, while nearly all large
219: subgraphs with low link density are anti-motifs
220: \cite{vasquez,baskerville} -- although our data show much more
221: structure than suggested by the scaling arguments of \cite{vasquez}.
222: We also find striking differences in the strongest motifs for the two
223: networks.  Dominant motifs for the \coli network are either bipartite
224: or close to it (with many nodes sharing the same neighbors) while
225: `tadpoles' with bodies consisting of (almost) complete graphs dominate
226: for yeast.  Our conclusions and discussions of open problems are given
227: in Sec.~VI.
228: 
229: The present work only addresses undirected networks, but the graph
230: animal algorithm works without major changes also for directed
231: networks.  Due to the larger number of different directed subgraphs,
232: an exhaustive study of even moderately large subgraphs is much more
233: challenging~\cite{newpaper}.
234: % A first step in this direction will be given in
235: 
236: \section{The Algorithm}
237: \label{alg}
238:  
239: In this section we explain how our algorithm achieves uniform sampling
240: of connected subgraphs in undirected networks.  The graph animal
241: algorithm executes a generalization of the Leath algorithm for lattice
242: animals. The observation central to the work in
243: Refs.~\cite{leath,hsu,care} is that the animal and percolation
244: ensembles concern exactly the same clusters. The only difference
245: between the two ensembles is that clusters in the percolation ensemble
246: have different weights, while all clusters with the same number of nodes
247: (sites) have the same weight in the animal ensemble. We focus on site
248: percolation~\cite{stauffer-aharony}. Bond percolation could also be
249: used~\cite{hsu}, but this would be more complicated and is not
250: discussed here.
251: 
252: \subsection{Leath growth for graph animals}
253: 
254: For regular lattices and  undirected networks we use the following epidemic 
255: model for growing connected clusters of sites~\cite{leath}: \\
256: (1) Choose a number $p \in [0,1]$ and a maximal cluster size $n_{\rm max}$. 
257: Label all sites (nodes) as `unvisited'. \\
258: (2) Pick a random site (node)  $i_0$ as a {\it seed} for the cluster, so that 
259: the cluster consists initially of only this site; mark it as `visited'.\\
260: (3) Do the following step recursively, until all boundary sites of the 
261: cluster have been visited, or until the cluster consists of $n_{\rm max}$ 
262: sites, whichever comes first: (Note that a boundary site of a cluster $C$ is
263: a site which is  not in $C$, but which is connected to $C$ by
264: one or more edges). \\
265: (A) Choose one of the unvisited boundary sites of the present cluster, and 
266: mark it as visited;
267: (B) With probability $p$ join it to the cluster.\\
268: Once a boundary site has been visited, it cannot later join the cluster; it 
269: either joins the cluster when it is first visited (with probability $p$) or is 
270: permanently forbidden to join (with probability $1-p$).
271: 
272: The order in which the boundary (or `growth') sites are chosen
273: influences the efficiency of the algorithm, but this is irrelevant for
274: the present discussion. The growth algorithm can be seen as an
275: idealization of an epidemic process (`generalized' or SIR
276: epidemic~\cite{mollison, grass}) with three types of individuals
277: (Susceptible, Infected, Removed).  Starting with a single infected
278: individual with all others susceptible, the infected individual can
279: infect neighbours during a finite time span.  Everyone either gets
280: infected or doesn't at his/her first contact.  The latter are removed,
281: as are the infected ones after their recovery, and do not participate 
282: in the further spread of the epidemic.
283: 
284: Assume that for some fixed node $i_0$, a connected labeled subgraph $G^{\ell}$ 
285: exists, which contains $i_0$ and has $n<n_{\rm max}$ nodes and $b$ visited 
286: boundary nodes. The chance that precisely this particular labeled subgraph 
287: will be chosen using the algorithm is
288: \be 
289:     P_{G^{\ell}}(p;i_0) \equiv P_{nb}(p;i_0)= p^{n-1}(1-p)^b \;.  
290: \label{growth-prob} 
291: \ee
292: Since an independent decision is made at each boundary site, this is
293: indeed the probability for $n-1$ sites to be selected to join the
294: cluster, while $b$ sites are rejected.
295: 
296: Denote by $c(G^{\ell})$ the indicator function for the existence of $G^{\ell}$, 
297: i.e. $c(G^{\ell})=1$ if the subgraph exists in the network, and $c(G^{\ell})=0$ 
298: else. Furthermore, denote by $c(G^{\ell};i_0)$ the explicit indicator that 
299: $G^{\ell}$ exists and contains the node $i_0$. Then the total number of 
300: occurrences of the {\it unlabeled} subgraph $G$ is given by 
301: \be
302:     c_G = n^{-1} \sum_{i=1}^N c_{G,i} = n^{-1} \sum_{i_0=1}^N \sum_{G^{\ell}\sim 
303: G}c(G^{\ell};i_0),
304: \ee
305: where $c_{G,i}$ is the number of occurrences which contain node
306: $i$, and where the last sum runs over all labeled subgraphs
307: $G^{\ell}$ that are isomorphic to $G$. The factor $n^{-1}$ takes into
308: account that a subgraph with $n$ nodes is counted $n$ times.
309: 
310: If we repeat the epidemic process $M$ times, always starting at the
311: same node $i_0$, then the expected number of times $G^\ell$ occurs is
312: \be
313:    \langle m(G^\ell;p,i_0)\rangle = M c(G^\ell;i_0) P_{G^\ell}(p;i_0)\;.
314: \ee
315: Hence, an estimator for $c_{G,i}$ based on the actual counts
316: $m(G^\ell;P,i_0)$ after $M$ trials is
317: \be
318:    {\hat c}_{G,i_0}(M) = M^{-1} \sum_{G^{\ell}\sim G} m(G^\ell;p,i_0) 
319: [P_{G^\ell}(p;i_0)]^{-1} \;.
320: \ee
321: Here and in what follows carets always indicate estimators.
322: 
323: More generally, the starting nodes are chosen according to some
324: probability $Q_{i_0}$.  After $M>>1 $ trials in total, site $i_0$ will
325: have been used as starting point on average $Q_{i_0}M$ times. This gives then the
326: estimator for the total number of occurrences of $G$
327: \bea
328:         \label{simple_estimate}
329:    {\hat c}_G(M) & = & n^{-1} \sum_{i_0=1}^N {\hat c}_{G,i_0}(Q_{i_0}M) \\
330:          & = &(nM)^{-1} \sum_{i=1}^N Q_i^{-1} \sum_{G^{\ell}\sim G} m(G^{\ell};p,i) 
331: [P_{G^{\ell}}(p;i)]^{-1}. 
332:         \nonumber
333: \eea
334: It is simplest to take a uniform probability $Q_{i_0} = 1/N$. But a
335: better alternative is to choose each node with a probability
336: proportional to its degree, as nodes with larger degrees have more
337: connected subgraphs attached to them. This is accomplished by choosing
338: a link with uniform probability $1/L$, where $L$ is the total number
339: of links in the network, and then choosing one of the two ends of this
340: link at random. This gives
341: \be
342:    Q_i = (2L)^{-1}k_i.                                 \label{link-prob}
343: \ee
344: 
345: The algorithms of \cite{leath,care} are directly based on
346: Eq.~(5).  Their main drawback is that all
347: %Eq.~(\ref{simple_estimate}).  Their main drawback is that all
348: information from clusters which are still growing at size $n$ is not
349: used. Clusters whose growth had stopped at sizes $<n$ don't contribute
350: to ${\hat c}_G$ either, of course. Thus only those that stop growing
351: exactly at size $n$ are used in Eq.~(5). This
352: %exactly at size $n$ are used in Eq.~(\ref{simple_estimate}). This
353: requires, among other things, a careful choice of $p$: If $p$ is too
354: large, too many clusters survive past size $n$, while in the opposite
355: case too few reach this size at all. But even with the optimal choice
356: of $p$, most of the information is wasted.
357: 
358: \subsection{Improved Leath method}
359: 
360: The major improvement comes from the following observation \cite{hsu}:
361: Assume that a cluster has grown to size $n$, and among the $b$
362: boundary sites there are exactly $g$ which have not yet been tested
363: (`growth sites'). Thus growth has definitely stopped at $b-g$ already
364: visited boundary sites, while the growth on the remaining $g$ boundary
365: sites depends on future values of the random variable used to decide
366: whether they are going to be infected. With probability $(1-p)^g$ none
367: of them are susceptible, and the growth will stop at the present
368: cluster size $n$. Thus we can replace the counts $m(G^\ell;p,i_0)$ in
369: the estimator for $c_G$ by the counts of `unfinished' subgraphs,
370: provided we weigh each occurrence of a subgraph isomorphic to $G$ with
371: an additional weight factor $(1-p)^g$. Formally, this gives, with
372: uniform initial link selection (Eq.~(\ref{link-prob})), 
373: \bea
374: {\hat c}_G & = &{2L\over nM} \sum_{i=1}^N k_i^{-1} \sum_{G^{\ell}\sim G} 
375: p^{1-n}(1-p)^{g-b} \;\;\times  \nonumber \\
376: & & \times \;\; m_{\rm unfinished}(G^{\ell};p,i,g)\;.
377:     \label{estimate}
378: \eea
379: The quantity $m_{\rm unfinished}(G^{\ell};p,i,g)$ is the number of
380: epidemics (with parameter $p$) that start at node $i$, give a
381: labeled subgraph $G^{\ell}$ of infected nodes, and leave $g$
382: unvisited boundary nodes. The factor $p^{1-n}(1-p)^{g-b}$ has a simple
383: interpretation.  In analogy to Eq.~(\ref{growth-prob}) it is the
384: probability to grow a cluster with $n-1$ nodes in addition to the
385: start node, $g$ growth nodes, and $b-g$ blocked boundary nodes,
386: \be
387:    P_{nbg}(p;i_0) = p^{n-1}(1-p)^{b-g} \;.
388: \label{growth-prob-2}
389: \ee
390: Eq.~(\ref{estimate}) is the number of generated clusters, reweighted
391: with their inverse probabilities to be sampled, given they exist.  
392: It is the formula we use to estimate frequencies of occurrences of 
393: connected subgraphs in the protein interaction networks as discussed 
394: later in the text.
395: 
396: \subsection{Resampling}
397: 
398: In principle, Eq.~(\ref{estimate}) can be improved further.
399: Ref.~\cite{hsu} shows how to use the equivalents of
400: Eqs.(\ref{estimate},\ref{growth-prob-2}) for lattice animals as a
401: starting point for a re-sampling scheme. For completeness, re-sampling
402: for graph animals is briefly explained, even though it is not used in
403: this work.
404: 
405: For each cluster that is still growing a {\it fitness function} is defined as 
406: \be
407:    f_{nbg}(p) = p^{1-n}(1-p)^{-b} = [P_{nbg}(p;i_0)]^{-1}/ (1-p)^g.
408: \label{fitness}
409: \ee
410: Clusters with too small fitness are killed, while clusters with too
411: large fitness are cloned, with both the fitness and the weight being
412: split evenly among the clones. The first factor in the fitness is just
413: proportional to the weight, while the second factor takes into account
414: that clusters with larger $g$ have more possibilities to continue
415: their growth, and thus should be more `valuable'. The precise form of
416: Eq.(\ref{fitness}) is purely heuristic, but was found to be near
417: optimal in fairly extensive tests.
418: 
419: This resampling scheme was found to be essential, if one wants to
420: sample clusters of sizes $n>100$. In \cite{hsu}, the emphasis was on
421: very large clusters (several thousand sites), and thus resampling was
422: a necessity. Here, in contrast, we concentrate on subgraphs with
423: $\approx 10$ nodes or less, and stick to the simpler scheme without
424: resampling.  With respect to graph animals, we point out that optimal
425: fitness thresholds for pruning and cloning depend in a irregular
426: network on the start node, $i_0$, and have to be learned for each
427: $i_0$ separately. Although a similar strategy achieves success for
428: dealing with self avoiding walks on random lattices~\cite{randomSAW},
429: this is much more time consuming than for regular lattices.
430: 
431: \subsection{Implementation details}
432: 
433: For fast data access, we used several redundant data structures. The
434: adjacency matrix was stored directly as a $N\times N$ matrix with
435: elements 0/1 and as a list of linked pairs $(i,j)$, i.e. as an array
436: of size $L\times 2$. The first is needed for fast checking of which
437: links are present in a subgraph, while the second is the format in
438: which the networks were downloaded from the web. Finally, for fast
439: neighbor searches, the links were also stored in the form of linked
440: lists. To test whether a site was visited during the growth of the
441: present (say $k$-th, $k=1\ldots M$) subgraph, an array {\sf s[i]} of
442: size $N$ and type {\sf unsigned int} was used, which was initiated as
443: {\sf s[i]=0, i = 0,...N-1}. Each time a site {\sl i} was visited, we
444: set {\sf s[i] = k}, and {\sf s[i] $<$ k} was used as indicator that
445: this site had not been visited during the growth of the present
446: cluster.
447: 
448: In Leath-type cluster growth, there are two popular variants. Untested
449: sites in the boundary can be written either into a first-in first-out
450: queue, or into a stack (first-in last-out queue). In was found in
451: \cite{hsu} that these two possibilities, whose efficiency is roughly
452: the same when Eq.~(5) is used, give vastly
453: %the same when Eq.(\ref{simple_estimate}) is used, give vastly
454: different efficiency with Eq.(\ref{estimate}), in particular (but not
455: only) in combination with resampling. In that case, the first-in
456: first-out queue gives much better results, and we use this method to 
457: get the numerical results shown later.
458: 
459: \section{Subgraph Classification}
460: 
461: After sampling a labelled subgraph $G_\ell$, one has to find its 
462: isomorphism class $G$ (i.e., $G_\ell\sim G$), by testing which 
463: of the representatives for isomorphism classes it can be mapped onto by
464: permuting the node labels.  State-of-the-art computer programs for
465: comparing two graphs, such as NAUTY~\cite{nauty}, proceed in two
466: steps. First, some invariants are calculated such as the number of
467: links, traces of various powers of the adjacency matrix, a sorted list
468: of node degrees, etc. In most cases, this shows that the two graphs
469: are not isomorphic (if any of these invariants disagree), but
470: obviously this does not resolve all cases. When ambiguities remain,
471: each graph is transformed into a standard form by a suitable
472: permutation, and the standard forms are compared. The standard form
473: is, of course, also a special invariant, so the distinction between
474: ``invariants" and ``standard form" might seem arbitrary. It becomes
475: relevant in practice, since the user of the package can specify which
476: invariants (s)he deems relevant, while the calculation of the
477: standard form is at the core of the algorithm and cannot be changed.
478: 
479: It is mostly the second step in this scheme which is time limiting 
480: and which renders it useless for our purposes -- although some 
481: invariants suggested e.g. by NAUTY are also quite demanding in CPU 
482: time. Thus we skip the second step and only use invariants that are fast to compute.
483: All these invariants, except for the number $n$ of nodes and the number
484: $\ell$ of links in the subgraph, are combined into a single index $I$, which is
485: intended to be a good discriminator between all non-isomorphic subraphs
486: with the same $n$ and $\ell$. Whenever a new subgraph is found, the
487: triplet ($n, \ell , I$) is calculated and compared to triplets that
488: have already appeared.  If the triplet appeared previously, the
489: counter for this triplet is increased by 1; if not, a new counter is
490: initiated and set to 1.
491: 
492: Since no known invariant (other than standard form) can discriminate
493: between any two graphs, any method not using it is necessarily
494: heuristic. Some of the invariants we used are those defined in
495: Ref.~\cite{baskerville}. In addition, we use invariants based on
496: powers of the adjacency matrix and of its compliment. More precisely,
497: if $A_{ij}$ is the adjacency matrix of a subgraph, then we define its
498: complement by $B_{ij} = 1-A_{ij}$ for $i\neq j$ and $B_{ij} = 0 =
499: A_{ij}$ for $i=j$. Any trace of any product $A^{a_1} B^{b_1} A^{a_2}
500: \ldots$ is invariant, and can be computed quickly. The same is true
501: for the number of non-zero elements of any such product, and for the
502: sum of all its matrix elements. The index $I$ is then either a linear
503: combination or a product (taken modulo $2^{32}$) of these invariants.
504: The particular choices were {\it ad hoc} and there is no reason to
505: believe they are optimal; hence those details are not given here.
506: 
507: With the indices described in~\cite{baskerville}, all undirected
508: graphs of sizes $n\leq 8$ and all directed graphs with up to $5$ nodes
509: are correctly classified. In this work, a faster algorithm for
510: counting loops is used; hence loop counting is always included, in
511: contrast to the work of \cite{baskerville}. Index calculation based on
512: matrix products is even faster but less precise: only 11112 out of all
513: 11117 non-isomorphic connected graphs with $n=8$ were 
514: distinguished, and for directed graphs with $n=5$ just 4 graphs out of
515: 9608~\cite{integer-seq} were missed. For larger subgraphs we were not
516: able to test the quality of the indices systematically, but we can
517: cite some results for $n=9$. Using indices based on matrix products, 
518: we found 239846 different connected subgraphs with $n=9$ in the \coli 
519: protein interaction network~\cite{ecoli} and its rewirings. Given the 
520: fact that there are only 261080 different connected graphs with $n=9$ 
521: \cite{integer-seq}, that many of them might not appear in the \coli 
522: network, and that our sampling was not exhaustive, our graph 
523: classification method failed to distinguish at most 9\% of the 
524: non-isomorphic graphs -- and probably many fewer.
525: 
526: 
527: \section {Numerical Tests of the Sampling Algorithm}
528: 
529: To test the graph animal algorithm, we first sampled both $n=4$ and
530: $n=5$ subgraphs of the \coli network, as well as $n=4$ subgraphs of
531: the yeast network. In these cases exact counts are possible, and we
532: verified that the results from sampling agreed with results from exact
533: enumeration within the estimated (very small) errors. To obtain these results 
534: we used crude estimates for optimal $p$ values, namely $p=0.11$ for \coli
535: and $p=0.03$ for yeast. For larger subgraphs more precise estimates
536: for the optimal $p$ are required.
537: 
538: \subsection{Optimal values for $p$}
539: 
540: When $p$ is too small, only small clusters are regularly encountered.
541: If $p$ is too large, performance decreases because the weight factors
542: in Eq.~(\ref{estimate}) depend too strongly on the number of blocked
543: boundary sites, $b-g$. The latter varies from instance to instance,
544: and this can create huge fluctuations in the weights given to
545: individual subgraphs.
546: 
547: The networks we are interested in are sparse ($L/N \approx const \ll
548: N$) and approximately scale-free \cite{yeast}.  As a result, most
549: nodes have only a few links, but some `hubs' have very high degree. In
550: fact, the degrees of the strongest hubs may diverge in the limit
551: $N\to\infty$.  For such networks it is well-known that the threshold
552: for spreading of an infinite SIR epidemic is zero~\cite{pastor}. On
553: finite networks this means that one can create huge clusters even for
554: minute $p$, and this tendency increases as $N$ increases.  Thus, we
555: anticipate the optimal $p$ to be small, and to decrease noticeably in
556: going from the \coli ($N=230$) to the yeast network ($N=2559$). This
557: is, in fact, what we find.
558: 
559: \begin{figure}
560:   \begin{center}
561:    \psfig{file=rms-errors.ps,width=6.4cm,angle=270}
562:    \caption{(color online) Root mean square relative errors of connected
563:      subgraph counts, Eq.~(\ref{sigma}), for the yeast ($n=5$ to $8$)
564:      and \coli ($n=7$) networks. In most cases, clear minima indicate
565:      roughly the optimum value for $p$, with caveats as explained in
566:      the text. Each data point is based on $4\times 10^9$ generated
567:      subgraphs. Smaller values of $\sigma_n(p)$ indicate that the 
568:      census for subgraphs with $n$ nodes is on average more precise.}
569: \label{count-errors.fig}
570: \end{center}
571: \end{figure}
572: 
573: As a first test, we compute the root mean square relative errors of
574: the subgraph counts, averaged over all subgraphs of fixed size $n$.
575: Let $\gamma_n$ be the number of different subgraphs of size $n$ found,
576: and let $\Delta c_G$ be the error of the count for subgraph $G$. These
577: errors were estimated by dividing the set of $M$ independent samples
578: into bins, and estimating the fluctuations from bin to bin. Then
579: \be
580:    \sigma_n(p) = \left[{1\over \gamma_n} \sum_{j=1}^{\gamma_n}
581:            (\Delta c_{G_j}/{\hat c_{G_j}})^2\right]^{1/2}.
582: \label{sigma}
583: \ee
584: Smaller values of $\sigma_n(p)$ indicate that the subgraph census is
585: on average more precise.
586: Fig.~\ref{count-errors.fig} shows results for the yeast network, with
587: various values of $p$ and $n$. Also shown are data for the \coli
588: network, for $n=7$. Each simulation used for this figure (i.e., each
589: data point) involved $M = 4\times 10^9$ generated clusters. Our first
590: observation is that the results for \coli are much more precise than
591: those for yeast.  This is mainly due to smaller hubs ($k_{\rm
592:   max}^{\rm e. coli} = 36$, while $k_{\rm max}^{\rm yeast} = 141$), so
593: that much larger $p$ values~\cite{footnote2} could be used. Also in
594: all other aspects, our algorithm worked much better for the \coli
595: network than for yeast.  Therefore we exhibit in the rest of this
596: section only results for yeast, implying that whenever a test was
597: positive for yeast, an analogous test had been made for \coli with at
598: least as good results.
599: 
600: Even with the large sample sizes used in Fig.~\ref{count-errors.fig},
601: many $n=8$ subgraphs were found only once (in which case we set
602: $\Delta c_{G_j} /{\hat c}_{G_j}=1$), which explains the high values of
603: $\sigma_8(p)$. This is also why we do not show any data for $n>8$ in
604: Fig.~\ref{count-errors.fig}.  The relative error $\sigma_n(p)$ for
605: each $n<8$ shows a broad minimum as a function of $p$. The increase in
606: $\sigma_n(p)$ at small $p$ is because of the paucity of different
607: graphs being generated. This effect grows when $n$ increases, explaining why
608: the minimum shifts to the right with increasing $n$. The increase of
609: $\sigma_n(p)$ for large $p$, in contrast, comes from large
610: fluctuations of weights for individual sampled graphs.  When $p$ is
611: large, the factor $(1-p)^{b-g}$ in Eq.(\ref{estimate}) can also be
612: large, particularly in the presence of strong hubs.
613: 
614: Unfortunately, if a subgraph is found only once, it is impossible to
615: decide whether or not the frequency estimate is reliable. Even for
616: strong outliers, when the frequency estimate is far too large, the
617: formal error estimate cannot be larger than $\Delta c_G = O({\hat
618:   c}_G)$. This underestimates the true statistical errors and is
619: partially responsible for the fact that the curve for $n=8$ in
620: Fig.~\ref{count-errors.fig} does not increase at large $p$
621: \cite{footnote3}.
622: 
623: \begin{figure}
624:   \begin{center}
625:    \psfig{file=Fig2-weighthist.ps,width=6.3cm,angle=270}
626:    \caption{(color online) Histograms of $wP(\ln w) = w^2P(w)$ for
627:      connected $n=8$ subgraphs of the yeast network. Each curve
628:      corresponds to one run ($4\times 10^9$ generated subgraphs)
629:      with fixed value of $p$. Results are the more reliable,
630:      the further to the left is the maximum of the curve and the
631:      faster is the decrease of its tail at large $w$.}
632: \label{weight-hist.fig}
633: \end{center}
634: \end{figure}
635: 
636: A more direct understanding of the decreasing performance at large
637: $p$ comes from histograms of the (logarithms of) weight factors. Such 
638: histograms, for $n=8$ subgraphs in the yeast network, are shown in 
639: Fig.~\ref{weight-hist.fig}. From the results in Section~\ref{alg}
640: \be
641:    w = {2L\over nMk}p^{1-n}(1-p)^{g-b}
642: \ee 
643: is the weight for a subgraph with $n$ nodes, $b$ boundary nodes, and
644: $g$ growth nodes. 
645: %$P(\ln w) = wP(w)$ is the probability distribution function of $\ln w$. 
646: The algorithm produces reliable estimates if $P(w)$ decreases for 
647: large $w$ faster than $1/w^2$, since averages
648: (which are weighted by $w$) are then dominated by subgraphs that are
649: well sampled. If, in contrast, $P(w)$ decreases more slowly, then the
650: tail of the distribution dominates, and the results cannot be taken at
651: face value \cite{grass-PERM}. We observe from
652: Fig.~\ref{weight-hist.fig} that the data for $n=8$ is indeed reliable
653: for $p<0.07$ only. The curve for $p=0.09$ in
654: Fig.~\ref{weight-hist.fig} also bends over at very large values of
655: $w$, indicating that even for this $p$ our estimates should finally be
656: reliable, when the sample sizes become sufficiently large.  But this
657: would require extremely large sample sizes.
658: 
659: As a last test we checked whether the estimates $\hat{c}_G$ are
660: independent of $p$ as they should be. Fig.~\ref{estimate-p.fig} shows
661: the estimates obtained for $n=8$ subgraphs in the yeast network with
662: $p=0.025$ and $p=0.07$ against those obtained with $p=0.04$.  Clearly,
663: the data cluster along the diagonal -- showing that the estimates are
664: basically correct. They scatter more when the counts are lower (i.e.
665: in the lower left corner of the plot).  The asymmetries in that region
666: result from the fact that rarely occurring subgraphs are completely
667: missed for $p=0.04$ and even more so for $p=0.025$, cutting off
668: thereby the distributions at small ${\hat c}_G$. For larger counts,
669: the estimates for $p=0.025$ are more precise than those for $p=0.07$.
670: The latter show high weight ``glitches" arising from the tail of
671: $P(w)$ discussed earlier in this section.
672: 
673: \begin{figure}
674:   \begin{center}
675:    \psfig{file=Fig3-25_7-4-x.ps,width=7.6cm,angle=270}
676:    \caption{(color online) Scatter plots of ${\hat c}_G(p=0.025)$ and
677:      ${\hat c}_G(p=0.07)$ against ${\hat c}_G(p=0.04)$ for connected $n=8$
678:      subgraphs of the yeast network. The clustering of the data along
679:      the diagonal indicates the basic reliability of the estimates,
680:      independent of the precise choice of $p$. Sample sizes were $4\times
681:      10^{10}$ for $p=0.04$, $2.4\times 10^{10}$ for $p=0.025$, and
682:      $8\times 10^9$ for $p=0.07$. The latter two correspond to roughly
683:      the same CPU time.}
684: \label{estimate-p.fig}
685: \end{center}
686: \end{figure}
687: 
688: For increasing $p$, the numbers $m_G$ of generated subgraphs of type
689: $G$ increase of course (as the epidemic survives longer), so that
690: average weights, defined as $\langle w_G\rangle = {\hat c}_G M / m_G$,
691: decrease. But this decrease is not uniform for all $G$. Rather, it is
692: strongest for fully connected subgraphs ($\ell = n(n-1)/2$), and is
693: weakest for trees. For the yeast network and $n=8$, e.g., $\langle
694: w_G\rangle$ averaged over all trees decreases by a factor $\sim 18$
695: when $p$ increases from 0.025 to 0.085, while $\langle w_G\rangle$ averaged 
696: over all graphs with $\ell \ge 25$ decreases by a factor $\sim 1700$. Smaller
697: values of $\langle w_G\rangle$ are preferable, as they imply
698: smaller fluctuations.  Thus it would be most efficient to use larger
699: $p$ values for highly connected subgraphs, and smaller $p$ for
700: tree-like subgraphs.  Counting very highly connected subgraphs --
701: where every node has a degree in the subgraph $\geq k_0$, say -- is also made easier by
702: first reducing the network to its $k$-core with $k=k_0$, and then
703: sampling from the latter.
704: 
705: 
706: \section{Results}
707: 
708: \subsection{Characterization of the networks}
709: 
710: As already stated, both networks as we use them are fully connected 
711: \cite{footnote}.
712: The \coli network has 230 nodes and 695 links, while the yeast network
713: has 2559 nodes and 7031 links. Both networks show strong clustering,
714: as measured by the clustering coefficients \cite{watts-stro}
715: \be
716:    C_i = {2\over k_i(k_i-1)}\sum_{j<m} A_{jm}
717: \ee
718: where $k_i$ is the degree of node $i$ and the sum runs over all pairs
719: of nodes linked directly to $i$. In Fig.~\ref{clustering.fig} we show
720: averages of $C_i$ over all nodes with fixed degree $k$. We see that
721: $\langle C\rangle_k$ is quite large, but has a noticeably different
722: dependence on $k$ for the two networks.  While it decreases with $k$
723: for \colp, it attains a maximum at $k\approx 15$ for yeast.
724: 
725: \begin{figure}
726:   \begin{center}
727:    \psfig{file=Fig_clusterings.ps,width=6.3cm,angle=270}
728:    \caption{(color online) Average clustering coefficients for nodes 
729:      with fixed degree $k$ plotted versus the degree, for
730:      the giant component of the yeast and \coli protein interaction
731:      networks. While the clustering coefficient decreases with
732:      $k$ for \colp, it attains a maximum at $k \approx 15$ for yeast.}
733: \label{clustering.fig}
734: \end{center}
735: \end{figure}
736: 
737: \begin{figure}
738:   \begin{center}
739:    \psfig{file=Fig_kcores.ps,width=6.3cm,angle=270}
740:    \caption{(color online) Sizes of the $k-$cores for the two networks,
741:      plotted against $k$. Notice that the $k-$cores for yeast contain a 
742:      nearly fully connected cluster with 17 nodes. In addition to the 
743:      core sizes for the original networks, the figure also shows average
744:      core sizes for rewired networks as discussed in section V C.}
745: \label{k-cores.fig}
746: \end{center}
747: \end{figure} 
748: 
749: The unweighted average clustering ${\bar C} = N^{-1}\sum_{i=1}^N C_i$
750: is 0.1947 for yeast, and 0.2235 for \colp.  Due to the different
751: behavior of $\langle C\rangle_k$, the ranking is reversed for the
752: weighted averages
753: \be
754:     \langle C\rangle = {\sum_{i=1}^N C_i k_i(k_i-1) \over 
755:       \sum_{i=1}^N k_i(k_i-1)} = {3n_\Delta\over 3n_\Delta + n_\vee}, 
756: \ee 
757: where $n_\Delta$ is the number of fully connected triangles on the
758: network and $n_\vee$ is the number of triads with two links
759: (see~\cite{newman_clust} for a somewhat different formula).
760: Numerically, this gives $\langle C\rangle = 0.1948$ for yeast and
761: 0.1552 for \colp. This can be understood as a consequence of the fact
762: that the relative frequency of fully connected triangles is higher in
763: yeast than in \colp: in yeast (\colp) there are 6969 (478) triangles
764: compared to 86291 (7805) triads with two links.
765: 
766: Associated with this difference are distinctions between the $k$-cores
767: \cite{seidman}
768: of the two networks. Fig.~\ref{k-cores.fig} shows the sizes of the
769: $k$-cores against $k$. We see that the yeast network contains
770: non-empty cores with $k$ up to 15. Moreover, the core with $k=15$ has
771: exactly 17 nodes. It is a nearly fully connected subgraph with just
772: one missing link. All 17 proteins in this core are parts of the 26S
773: proteasome which consists of 20 or 21 proteins \cite{mips,sgd}. All
774: these proteins presumably interact very strongly with each other. When the
775: interactions between the proteins within the 26S proteasome are taken
776: out (the corresponding elements of the adjacency matrix are set to
777: zero), the $k$-core with highest $k$ has $k=12$ and consists of 15
778: nodes. All its nodes correspond to proteins in the mediator complex of
779: RNA polymerase II \cite{mips}, which contains 20 proteins altogether.
780: After eliminating all interactions between these, two 11-cores with
781: respectively 13 and 14 nodes remain, the first corresponding to the
782: 20S proteasome and the second corresponding to the RSC complex
783: \cite{mips}. Again these particular complexes have only a few more
784: proteins than those contained within their largest $k$-cores, so they
785: are very tightly bound together. All remaining complexes appear to be
786: more loosely bound, so that much of the strong larger scale clustering
787: in the yeast network (involving 7 - 10 nodes) can be traced to only a
788: few tightly bound complexes.  This has a big effect on the subgraph
789: counts, as we shall see.
790: 
791: \subsection{Trends in Subgraph counts}
792: 
793: Subgraph counts ${\hat c}_G$ for the \ecoli and yeast networks, plotted
794: against $n^2 +2\ell$, are shown in Figs.~\ref{ecoli-subgraphs.fig} and
795: \ref{yeast-subgraphs.fig}. For large $n$ we see a very wide range,
796: with counts varying between 1 and $>10^8$.  In general, counts
797: decrease with increasing number of links, i.e. trees are most
798: frequent. This is a direct consequence of the fact that the networks
799: are sparse. Even when $n$ and $\ell$ are fixed, the counts $c_G$ can
800: range over six orders of magnitude (e.g. for yeast with $n=8$ and
801: $\ell=17$).
802: 
803: \begin{figure}
804:   \begin{center}
805:    \psfig{file=Fig-ecoli-counts.ps,width=6.3cm,angle=270}
806:    \caption{(color online) Counts for connected subgraphs with fixed topology
807:      and with $n\le 8$ in the \coli network, plotted against $n^2
808:      +2\ell$. The variable $n^2 +2\ell$ is used to spread out the
809:      data, so that the dependence on both $n$ and $\ell$ (number of
810:      links) can be seen independently, without data points
811:      overlapping. For most of the points, the error bars are smaller
812:      than the sizes of the symbols.}
813: \label{ecoli-subgraphs.fig}
814: \end{center}
815: \end{figure}
816: 
817: \begin{figure}
818:   \begin{center}
819:    \psfig{file=Fig-yeast-counts.ps,width=6.3cm,angle=270}
820:    \caption{(color online) Counts for subgraphs with fixed topology
821:      and with $n\le 8$ in the yeast network, plotted against
822:      $n^2+2\ell$ as in Fig.~\ref{ecoli-subgraphs.fig}.}
823: % \maya{Peter, do you know what the pdf of the counts looks like for a fixed n and l?}}
824: \label{yeast-subgraphs.fig}
825: \end{center}
826: \end{figure} 
827: 
828: For the yeast network, there are clear systematic trends for the
829: counts at fixed $n$ and $\ell$. The most frequent subgraphs are those
830: with strong heterogeneity, i.e. with a large variation of the degrees
831: (within the subgraph) of nodes, while the most rare are those with
832: minimal variation. Fig.~\ref{yeast-var.fig} shows the counts ${\hat c}_G$ for $n=8$
833: and with four different values of $\ell$ plotted against the variance
834: of the degrees of the nodes within the subgraph,
835: \be 
836:    \sigma^2 = {1\over n}\sum_{i=1}^n k_i^2 - [{1\over n}\sum_{i=1}^n k_i]^2.  
837:                    \label{var}
838: \ee
839: For all four curves we see a trend, where the count increases with
840: $\sigma$, but hardly any trend like this is seen for the \coli network
841: (data not shown). The effect seen in the yeast data is probably
842: related to the very strongly connected core in that network (see the
843: last subsection). As we shall also see later in subsection D,
844: subgraphs with high counts in yeast often have a tadpole form with a
845: highly connected body (which is part of one of the densely connected 
846: complexes discussed in the last subsection) and a short
847: tail attached to it. These cores may also be responsible for the main
848: difference between Figs.~\ref{ecoli-subgraphs.fig} and
849: \ref{yeast-subgraphs.fig}, namely the strong representation of very
850: highly connected (large $\ell$) subgraphs in the yeast network. Taking out all 
851: interactions within the 26S and 20S proteasomes, within the mediator
852: complex and within the RSC complex reduces substantially the counts for 
853: highly connected subgraphs. The count for the complete $n=7$ subgraph, e.g.,
854: is reduced in this way from $25,164\pm 68$ to $682\pm 23$. The removal 
855: of interactions within the 26S proteasome makes by far the biggest 
856: contribution.
857: 
858: \begin{figure}
859:   \begin{center}
860:    \psfig{file=Fig.yeast-var-count.ps,width=6.3cm,angle=270}
861:    \caption{(color online) Counts for $n=8$ subgraphs of the yeast
862:      network with $\ell = 7, 10,13,$ and $18$, plotted against the
863:      variance of the node degrees within the subgraphs, as given by 
864:      Eq.~\ref{var}. Zero variance means that all nodes
865:      have exactly the same degree, whereas a higher variance indicates
866:      that the nodes differ more widely. Typically, subgraphs
867:      with more variation in their nodes (and thus with larger $\sigma^2$) 
868:      have higher counts than those for which the degrees within the 
869:      subgraph are more uniform.}
870: \label{yeast-var.fig}
871: \end{center}
872: \end{figure}
873: 
874: \subsection{Zipf plots}
875: 
876: In~\cite{baskerville} it was found that ``Zipf plots" (subgraph counts
877: vs. rank) in the \coli network exhibit power law behavior, whose
878: origin is not yet understood. The essential difference between the
879: subgraph counts in~\cite{baskerville} and in the present paper is that
880: we sample only connected subgraphs, while {\it all} subgraphs with
881: given $n$ were ranked in~\cite{baskerville}.  Also, noting that
882: disconnected subgraphs are more likely to be sampled than connected
883: ones when picking nodes at random (due to the sparsity of the
884: networks), we can go to much higher ranks for the connected subgraphs.
885: 
886: Zipf plots for connected subgraphs in the \coli network are shown in
887: Fig.~\ref{zipf.fig}. Each curve is based on $4\times 10^9$ to
888: $10^{10}$ generated subgraphs. Each is strongly curved,
889: suggesting that there are no power laws -- at least for subgraph sizes
890: where we obtain reasonable statistics for the census. The curves
891: show less curvature for larger $n$, but this is a gradual effect. It
892: seems that the scaling behavior found in \cite{baskerville} was mainly
893: due to the presence of disconnected graphs, although it is not
894: immediately obvious why those should give scale-free statistics
895: either. In addition, the right hand tails of the Zipf plots in 
896: \cite{baskerville} were cut
897: off because of substantially lower statistics. In our case, apparently
898: sharp cutoffs in the counts are observed for ranks $\approx
899: 1.08\times 10^4$ for $n=8$, $\approx 2.1\times 10^5$ for $n=9$, and
900: $\approx 2.9\times 10^6$ for $n=10$. For $n\leq 9$ these are close to
901: the total number of different connected subgraphs~\cite{briggs},
902: suggesting that we have fairly complete statistics. For $n=10$ the
903: cutoff is more affected by lack of statistics, but it is still within
904: a factor of four of the upper limit.
905: 
906: \begin{figure}
907:  \begin{center}
908:   \psfig{file=Fig-Zipf.ps,width=6.3cm,angle=270}
909:   \caption{(color online) ``Zipf" plots showing the counts for individual
910:     connected subgraphs with fixed $n$, plotted against their rank.
911:     Data are for the \coli network.}
912: \label{zipf.fig}
913: \end{center}
914: \end{figure}  
915: 
916: 
917: \subsection{Null model comparison and motifs}
918: 
919: One of the most striking results of \cite{baskerville} was that most
920: large subgraphs were either strong motifs or strong anti-motifs.
921: However, this finding was based on rather limited statistics and on a
922: single protein interaction network.  One of the purposes of the
923: present study is to test this and other results of \cite{baskerville}
924: with much higher statistics and for a larger network, the protein
925: interaction network of yeast.
926: 
927: \begin{figure}
928:   \begin{center}
929:    \psfig{file=Fig-ecoli-countratios.ps,width=6.3cm,angle=270}
930:    \caption{(color online) Ratios between the count estimates
931:      $\hat{c}_G$ for connected subgraphs in the \coli
932:      network, and the corresponding average counts $\langle\hat{c}^{(0)}_G\rangle$
933:      in rewired networks. The data are plotted against $n^2+2\ell$,
934:      again to spread the points out conveniently. Most error bars are
935:      smaller than the symbols.}
936: \label{nullratio-ecoli.fig}
937: \end{center}
938: \end{figure}
939: 
940: \begin{figure}
941:   \begin{center}
942:    \psfig{file=Fig-yeast-countratios.ps,width=6.3cm,angle=270}
943:    \caption{(color online) Same as Fig.~\ref{nullratio-ecoli.fig}, but for 
944:      the yeast network.  Notice that most data points for large $n$ and
945:      $\ell$ are missing. Indeed, for $n=7$ all (!) data points with $\ell >
946:      16$ are missing, because no such subgraphs were found in the
947:      rewired ensemble.}
948: \label{nullratio-yeast.fig}
949: \end{center}
950: \end{figure}
951: 
952: To define a motif requires a null model. We take this to be the ensemble 
953: of networks with the same degree sequence, obtained by the rewiring 
954: method.  The average subgraph counts in the null ensemble are denoted 
955: as $\langle c_G^{(0)}\rangle$.  In Figs.~\ref{nullratio-ecoli.fig} and
956: \ref{nullratio-yeast.fig} we plot the ratios $c_G / \langle c_G^{(0)}\rangle$ 
957: against the variable $n^2+2\ell$ for each connected subgraph that was sampled 
958: both in the original graph and in at least one of the rewired graphs. 
959: The error bars, which include both
960: statistical errors from sampling and the ensemble fluctuations of the
961: null model estimated from several hundred rewired networks, are for
962: most points smaller than the symbols. A subgraph is a motif
963: (anti-motif), if this ratio is significantly larger (smaller) than 1.
964: Notice that motifs do not in general occur particularly
965: frequently in the original network. Even without rigorous estimates 
966: to estimate significance, it is clear that most densely connected 
967: subgraphs are motifs in the yeast network. The fact that trees or 
968: subgraphs with few loops tend to be anti-motifs might not be so evident 
969: from Fig.~\ref{nullratio-yeast.fig}, since the ratios for trees and
970: tree-like graphs are close to one. Thus we have to discuss
971: significance more formally.
972: 
973: \subsubsection{$Z$-scores}
974: 
975: Usually~\cite{baskerville}, the significance of a motif (or
976: anti-motif) is measured by its $Z$-score
977: \be 
978:    Z = {c_G - \langle c_G^{(0)}\rangle \over \sigma_G^{(0)}}\;,
979:               \label{Z}
980: \ee
981: where $\sigma_G^{(0)}$ is the standard deviation of $c_G$ within the null 
982: ensemble. A subgraph is a motif (anti-motif), if $Z \gg 1$ ($Z\ll -1$).
983: 
984: The eight strongest motifs with $n=7$ in the \coli network according
985: to this definition are shown in Fig.~\ref{fig:ecoli_motif}, together
986: with their $Z$-values. To name the strongest motifs in the yeast
987: network is less straight forward, since many subgraphs did not show
988: up in any rewired network at all. Assuming for those subgraphs 
989: $\sigma_G^{(0)} = \langle c_G^{(0)}\rangle = 0$ would give $Z=\infty$. 
990: Rough lower bounds on $Z$ are obtained for them by assuming that $\langle
991: c_G^{(0)}\rangle < 1/R$ and $\sigma_G^{(0)} < 1/\sqrt{R}$, where $R$
992: is the number of rewired networks that were sampled, giving $Z\geq c_G\sqrt{R}$. 
993: Some of the strongest motifs in the yeast network, together with their
994: estimated $Z$-scores, are shown in Fig.~\ref{fig:yeast_motif}. Note
995: that no $n=7$ graphs with $\ell>16$ were found in any of the realizations of
996: the null model, while they were all found in the real yeast network. Hence
997: these are all strong motifs. Those motifs in Fig.~\ref{fig:yeast_motif} for 
998: which only lower bounds for the $Z$-score are given are the most frequent in 
999: the real network, hence they have the highest lower bound. It was
1000: pointed out in \cite{spirin,ispolatov} that cliques (complete subgraphs) 
1001: are in general very strong motifs. In yeast, the $n=7$ clique (with
1002: $\ell=21$) is indeed a very strong motif, but it does not have the largest
1003: lower bound on the $Z$-score.  In comparison, anti-motifs have rather
1004: modest $Z$-scores. The strongest anti-motif with $n=7$ has $Z=-32.9$
1005: ($Z=-24.7$) for \coli (yeast).
1006: 
1007: \begin{figure}
1008:   \begin{center}
1009:    \psfig{file=maya_fig2.eps,width=8.5cm,angle=0}
1010:    \caption{The eight strongest motifs with $n=7$ in the \coli protein
1011:      interaction network. These tend to be almost bipartite graphs,
1012:      and many pairs of nodes are linked to the same set of neighbors.
1013:      Their $Z$-scores, in order from left to right, first then second
1014:      row, are: $2.9\times 10^4, 932, 885, 648, 595, 532, 516$ and
1015:      377. Their estimated frequencies in the original \coli network 
1016:      are, in the same order: $20936\pm 8,
1017:      161521\pm 63, 8312\pm 5, 1331\pm 2, 838\pm 2, 5985 \pm 5, 5165\pm 4,$
1018:      and $ 519\pm 1$.}
1019: \label{fig:ecoli_motif}
1020: \end{center}
1021: \end{figure}
1022: 
1023: \begin{figure}
1024:   \begin{center}
1025:    \psfig{file=m3a.eps,width=8.3cm,angle=0}
1026:    \caption{Eight very strong motifs with $n=7$ for the yeast protein
1027:      interaction network. These tend to be almost complete graphs with
1028:      a single dangling node.  Four of these graphs were not seen in
1029:      any realization of the null model, so only lower bounds on their 
1030:      $Z$-scores can be given. From left to right, first then second
1031:      row, the estimated $Z$-scores are: $>3\times 10^7, 9\times 10^5,
1032:      >8\times 10^6, 5\times 10^5,>4\times 10^6,3\times 10^5,2.5\times 10^5$,
1033:      and $>1.5\times 10^6$. Estimated frequencies are, in the same order:
1034:      $6.68(1)\times 10^5, 9.27(5)\times 10^4, 1.76(1)\times 10^5, 4.84(1)\times 10^5, 
1035:      7.78(2)\times 10^4, 3.13(6)\times 10^5, 1.38(1)\times 10^5$, and
1036:      $3.35(1)\times 10^4$.}
1037: \label{fig:yeast_motif}
1038: \end{center}
1039: \end{figure}
1040: 
1041: With $Z$-values up to $10^7$ and more, as in
1042: Fig.~\ref{fig:yeast_motif}, the motivation for using $Z$-scores
1043: becomes suspect. On the one hand, the null model is clearly unable
1044: to describe the actual network, and has to be replaced by a more
1045: refined null model. This will be done in a future paper
1046: \cite{newpaper}. On the other hand, it suggests to use instead a
1047: $Z$-score based on {\it logarithms} of counts,
1048: \be 
1049: Z_{\rm log} = {\log c_G - \langle \log c_G^{(0)}\rangle \over
1050:   \sigma_{\log, G}^{(0)}}\;,
1051: \label{Z_log}
1052: \ee
1053: where $\sigma_{\log, G}^{(0)}$ is the standard deviation of $\log
1054: c_G^{(0)}$.  An advantage of Eq.(\ref{Z_log}) would be that it
1055: suppresses $|Z|$ for motifs, but enhances $|Z|$ for anti-motifs.
1056: 
1057: In general, strong yeast motifs have a tadpole structure with a
1058: complete or almost complete body, and a tail consisting of a few nodes
1059: with low degree. This agrees nicely with our previous observation that
1060: frequently occurring subgraphs in the yeast network have strong
1061: heterogeneity in the degrees of their nodes.  In contrast, strong \coli 
1062: motifs with not too many loops are all based on a 4-3 or 5-2 bipartite
1063: structure. When the number of loops increases, strictly bipartite
1064: structures are impossible, but the tendency towards these structures
1065: is still observed. 
1066: 
1067: Whether we use $Z$-scores or the ratio $C_G/C_G^{(0)}$ to identify
1068: motifs makes very little difference. Using either criterion, the
1069: strengths of the strongest motifs skyrocket with subgraph size. This
1070: is most dramatically apparent for the yeast network. Indeed,
1071: correlations between $Z$-scores of individual graphs in the yeast and
1072: \coli networks (data not shown) are much weaker than correlations
1073: between count ratios. The latter are shown in Fig.~\ref{graph-r} for
1074: $n=7$ subgraphs.
1075: 
1076: \subsubsection{Twinning versus Clustering}
1077: 
1078: Another characteristic feature of strong motifs in the \coli network is 
1079: the tendency for `twin' nodes. We call two nodes in a subgraph twins if
1080: they are connected to the same set of neighbours in the subgraph.
1081: Otherwise said, nodes $i$ and $k$ are twins, iff the $i$-th and $k$-th
1082: rows of the subgraph adjacency matrix are identical. Notice that twin
1083: nodes can be created most naturally by duplicating genes. We
1084: found that subgraphs with many pairs of twin nodes are in general also
1085: motifs in the yeast network, but they do not stand out spectacularly
1086: from the mass of other motifs. They could be the `genuine' motifs also
1087: for yeast, but only a better null model where all subgraphs actually 
1088: occur with reasonable frequency would be able to prove or disprove this.
1089: 
1090: In Fig.~\ref{graph-r} we also indicated the dependence on the number 
1091: $n_{\rm twin}$ of pairs of twin nodes, by marking subgraphs with 
1092: $n_{\rm twin}>3$ ($n_{\rm twin}> 1)$ by bullets (asterisks). We 
1093: see that all strong motifs in \coli have multiple pairs of twin nodes.
1094: These subgraphs tend to be also motifs of comparable strength in yeast
1095: -- the bullets in Fig.~\ref{graph-r} tend to cluster on the diagonal
1096: $[c_G /\langle c_G^{(0)}\rangle]_\coli = [c_G /\langle c_G^{(0)}\rangle]_{yeast}$. However, there
1097: are even stronger motifs in yeast that have no twin nodes. These
1098: graphs are typically much weaker motifs or not motifs at all in \colp.
1099: 
1100: \begin{figure}
1101:   \begin{center}
1102: \epsfig{file=ecoli-yeast-ratios.ps, width=6.3cm, angle=270}
1103: \caption{(color online) Count ratios $c_G /\langle c_G^{(0)}\rangle$ for individual
1104:   subgraphs in the \coli network, plotted against the count ratio for
1105:   the same subgraph in the yeast network. To highlight the dependence
1106:   on the number of twin nodes in the subgraph, subgraphs with $n_{\rm
1107:     twin}>1 \; (n_{\rm twin}> 3)$ are marked by asterisks
1108:   (bullets). Whereas almost all ratios are much higher in the yeast
1109:   network, this is noticeably less true for subgraphs containing
1110:   more than three pairs of twin nodes. These tend to fall on the 
1111:   diagonal indicated by the dashed line.}
1112: \label{graph-r}
1113: \end{center}
1114: \end{figure}
1115: 
1116: As we have already indicated, many of the strong motifs in yeast seem to 
1117: be related to a few densely connected complexes such as those discussed in
1118: subsection A. They are either part of their cores, or they have most of
1119: their nodes in the core, with one or two extra nodes forming the tail of
1120: what looks like a tadpole. This effect is even more pronounced for 
1121: $n=8$ subgraphs. For instance, the three most frequent subgraphs with 
1122: $n=8$ and $ \ell = 17$ all contained a 6-clique and two nodes connected 
1123: to it either in chain or in parallel. None of them occurred even in a 
1124: single rewired network.
1125: 
1126: The situation is different for the \coli network. There, the three 
1127: most frequent graphs with 8 nodes and 17 edges also have a tadpole 
1128: structure, few twin nodes, and low bipartivity. But they are not very 
1129: strong motifs since they occur also frequently in the rewired networks. 
1130: The three strongest motifs with $n=8$ and $ \ell = 17$, in contrast, 
1131: have many twin pairs and high bipartivity. They have slightly lower 
1132: counts (by factors 2-4), but occur much more rarely in the rewired 
1133: networks. 
1134: 
1135: \begin{figure}
1136:   \begin{center}
1137:   \epsfig{file=yeast-ecoli-orig.rewired.ps, width=6.3cm, angle=270}
1138:   \caption{(color online) Counts $c_G$ resp. $\langle
1139:     c_G^{(0)}\rangle$ for individual subgraphs in the \ecoli network,
1140:     plotted against counts for the same subgraph in the yeast network.
1141:     It can be seen that the two rewired networks are much more similar
1142:     (display higher correlation) than the original networks.}
1143:   \label{graph-freq}
1144:   \end{center}
1145: \end{figure}
1146: 
1147: \subsubsection{Effects of Rewiring on Differences between Networks}
1148: 
1149: Finally, Fig.~\ref{graph-freq} shows counts for individual subgraphs
1150: in the \coli network against counts for the same subgraph in yeast. 
1151: This is done for all four combinations of original and
1152: rewired networks. We see that the correlation is strongest when we
1153: compare rewired networks of \coli to rewired networks of yeast. This
1154: is not surprising. It means that a lack of correlations is mostly due
1155: to special features of one network which are not shared by the other.
1156: Rewiring eliminates most of these features. The other observation is
1157: that rewiring in general reduces further the counts for subgraphs
1158: which are already rare in the original networks. This is mainly due to
1159: the fact that such subgraphs are relatively densely connected, and
1160: appear in the original networks only because of the strong clustering.
1161: This effect is more pronounced for yeast than for \colp, because it 
1162: is more sparse and has more densely connected clusters/complexes.
1163: 
1164: \section{Discussion}
1165: 
1166: In this paper we have presented an algorithm for sampling connected
1167: subgraphs uniformly from large networks. This algorithm is a
1168: generalization of algorithms for sampling lattice animals, hence we
1169: refer to it as a ``graph animal algorithm" and to the connected subgraphs 
1170: as ``graph animals". It allowed us to obtain high statistics estimates of
1171: subgraph censuses for two protein interaction networks. Although the
1172: graph animal algorithm worked well in both cases, the analysis of the
1173: smaller network (\colp) was much easier than that of the bigger (yeast). 
1174: This was not so much because of the sheer size of the latter (the yeast 
1175: network has about ten times more nodes and links than the \coli network), 
1176: but was mainly caused by the existence of stronger hubs. Indeed, the
1177: presence of hubs places a more stringent limitation on the method than
1178: the size of the network.
1179: 
1180: One of the main results is that many subgraph frequency counts are
1181: hugely different from those in the most popular null model, which is
1182: the ensemble of networks with fixed degree sequence. Based on a
1183: comparison with this null model, most subgraphs with size $\geq 6$ in
1184: both networks would be very strong motifs or anti-motifs. This clearly
1185: shows that alternative null models are needed which take clustering and
1186: other effects into account.
1187: 
1188: While this was not very surprising (hints of it had been found in
1189: previous analyses), a more surprising result is the fact that the
1190: dominant motifs in the two protein interaction networks show very
1191: different features. Most of these seem to be related to the densely
1192: connected cores of a small number of complexes in the yeast network,
1193: which have no parallels in the \coli network and which strongly affect
1194: the subgraph census. Further studies are needed to disentangle
1195: these effects from other -- possibly biologically more interesting -- 
1196: effects.
1197: 
1198: Finally, a feature with likely biological significance is the dominance 
1199: of subgraphs with many twin nodes. These are nodes which share the 
1200: same list of linked neighbors within the subgraph. They correspond to 
1201: proteins which interact with the same set of other proteins. The most 
1202: natural explanation for them is gene duplication.  Connected to 
1203: this is a preference for (approximately) bipartite subgraphs. These 
1204: two features are very clearly seen in the \coli network, much less so 
1205: in yeast. But it would be premature to conclude that gene duplication 
1206: was evolutionary more important in \coli than in yeast. It is more 
1207: likely that its effect is just masked in the yeast network by other 
1208: effects, most probably by the densely connected complexes and other
1209: clustering effects which do not show up to the same extent in \colp.
1210: 
1211: \begin{figure}
1212:   \begin{center}
1213:   \epsfig{file=Fig-compare-ratios.ps, width=7.7cm, angle=270}
1214:   \caption{(color online) Count ratios $c_G/\langle c_G^{(0)}\rangle$
1215:     for individual subgraphs in the yeast networks of
1216:     Refs.~\cite{bu,batada}, plotted against counts for the same
1217:     subgraph in the network of \cite{yeast}. If all three networks
1218:     were identical, all points should lie on the diagonal (indicated by
1219:     the straight dashed line), whereas in fact systematic deviations are 
1220:     observed.}
1221:   \label{ratios-compare}
1222:   \end{center}
1223: \end{figure}
1224: 
1225: Up to now, we know very little about the biological significance of our 
1226: findings. One main avenue of further work could be to relate our results 
1227: on subgraph abundances in more detail to properties of the network that
1228: are associated with biological function. Another important problem is the 
1229: comparison between network reconstructions which supposedly describe 
1230: the same or similar objects. There exist, e.g., a large number of 
1231: published protein-protein interaction networks for yeast.
1232: Some were obtained by means of different experimental techniques, either
1233: with conventional or with high throughput methods, while others were 
1234: obtained by comprehensive literature compilations. In a preliminary 
1235: step, we compared three such networks: The network obtained by Krogan
1236: {\it et al.}~\cite{yeast} that was studied above, a somewhat older 
1237: network downloaded from~\cite{pajek} and attributed to 
1238: Bu {\it et al.}~\cite{bu}, and the `high confidence' (HC) 
1239: network of Batada {\it et al.}~\cite{batada}. The latter is the most 
1240: recent. It was obtained by extracting the most reliable interactions
1241: from a vast data base which includes the data of both Bu {\it et al.} and 
1242: Krogan {\it et al.}. In Fig.~\ref{ratios-compare} we plot the ratios
1243: between the actual counts and the average counts in rewired networks
1244: for Bu {\it et al.} and for the HC data set against the analogous 
1245: ratios for the Krogan {\it et al.} networks. If the three data sets
1246: indeed describe the same yeast network -- as they purport to do, within 
1247: experimental uncertainties -- the points should all fall onto the 
1248: diagonal. Instead, we see systematic deviations. Surprisingly, these 
1249: deviations are much stronger between the Krogan {\it et al.} and the 
1250: HC networks than between the Krogan {\it et al.} and the Bu {\it et al.} 
1251: networks. Clarifying these and other systematic irregularities should 
1252: give valuable insight into the strengths and weaknesses of the methods 
1253: used in constructing the networks as well as their biological 
1254: reliability, and should lead to improved methods for network 
1255: reconstruction.
1256: 
1257: In the present paper we have only dealt with undirected networks. The
1258: basic sampling algorithm works equally well for directed networks. The
1259: main obstacle in applying our methods to the latter is the huge number
1260: of directed subgraphs, even for relatively small sizes.
1261: Nevertheless, we will present an analysis of directed networks in
1262: forthcoming work, as well as applications to other undirected
1263: networks.
1264: 
1265: Acknowledgements: We thank Gabriel Musso for valuable information on 
1266: the yeast network.
1267: 
1268: \begin{thebibliography}{99}
1269: \bibitem{faloutsos} C. Faloutsos, M. Faloutsos, and P. Faloutsos, ACM SIGCOMM 
1270:    Computer Communication Review {\bf 29}, 251 (1999).
1271: \bibitem{barabasi} A.-L. Barabasi and R. Albert, Science {\bf 286}, 509 (1999).
1272: \bibitem{bollobas} B. Bollobas, {\it Random Graphs} (Academic Press,
1273:   London 1985).
1274: \bibitem{newman_SIAM} M.E.J. Newman, SIAM Review {\bf 45}, 167 (2003).
1275: \bibitem{watts-stro} D.J. Watts and S.H. Strogatz, Nature {\bf 393}, 440 (1998).
1276: \bibitem{newman_clust} M.E.J. Newman, Phys. Rev. E {\bf 64}, 016131 (2001).
1277: \bibitem{ravasz} E. Ravasz, L. Somera, D.A. Mongru, Z.N. Oltvai, and 
1278:    A.-L. Barab{\'a}si, Science {\bf 297}, 1551 (2002).
1279: \bibitem{girvan} M.E.J. Newman and M. Girvan, Phys. Rev. E {\bf 69}, 026113 (2004).
1280: \bibitem{ziv} E. Ziv, M. Middendorf, and C. Wiggins, Phys. Rev. E {\bf 71}, 046117 
1281:    (2005).
1282: \bibitem{rosvall} M. Rosvall and C.T. Bergstrom, Proc. Nat. Acad. Sci. U.S.A. 
1283:    {\bf 104}, 7327 (2007).
1284: \bibitem{stadler} K. Klemm and P.F. Stadler, Phys. Rev. E {\bf 73},
1285:    025101(R) (2006).
1286: \bibitem{milgram} S. Milgram, Psychology Today {\bf 2}, 60 (1967).
1287: \bibitem{estrada} E. Estrada and J.A. Rodr\'iguez-Vel\'azquez, Phys. Rev. E {\bf 
1288:     72}, 046105 (2005); E. Estrada, J. Proteome Res. {\bf 5}, 2177 (2006).
1289: \bibitem{milo} R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, 
1290:     and U. Alon, Science {\bf 298}, 824 (2002).
1291: \bibitem{shen-orr} S. Shen-Orr, R. Milo, S. Managan, and U. Alon, Nat. Genet.
1292:     {\bf 31}, 64 (2002).
1293: \bibitem{vasquez} A. V\'asquez, R. Dobrin, D. Sergi, J.-P. Eckmann, Z.N. Oltvai,
1294:     and A.-L. Barabasi, Proc. Nat. Acad. Sci. U.S.A. {\bf 101}, 17940 (2004).
1295:   \bibitem{kashtan} N. Kashtan, S. Itzkovitz, R. Milo, and U. Alon,
1296:     Phys. Rev. E {\bf 70}, 031909 (2004).
1297: \bibitem{besag} J. Besag and P. Cliffors, Biometrica {\bf 76}, 633 (1989).
1298: \bibitem{maslov} S. Maslov and K. Sneppen, Science {\bf 296}, 910 (2002).
1299: \bibitem{class1} M.~Middendorf, E.~Ziv and C.~H.~Wiggins,
1300:   Proc. Natl. Acad. Sci.~U.S.A. {\bf 102}, 3192 (2005).
1301: \bibitem{mahadevan} P. Mahadevan, D. Krioukov, K. Fall, and A. Vahdat,
1302:   ``A Basis for Systematic Analysis of Network Topologies", preprint
1303:   arXiv/cs.NI/0605007v2 (2006).
1304: \bibitem{newman_park1} J. Park and M.~E.~J.~Newman, Phys. Rev. E {\bf
1305:     68}, 026112 (2003).
1306: \bibitem{newman_park2} J.~Park and M.~E.~J.~Newman, Phys. Rev. E {\bf
1307:     70}, 066117 (2004).
1308: \bibitem{foster} J.~Foster, D.~Foster, P.~Grassberger and M.~Paczuski, e-print
1309: cond-mat/0610446 (2006).
1310: \bibitem{clauset} A.~Clauset, C.~Moore, and M.~E.~J.~Newman, e-print
1311:   physics/0610051 (2006).
1312: \bibitem{briggs} K. Briggs, {\sf http://keithbriggs.info/cgt.html} (2006).
1313: \bibitem{kobler} J.U. K\"obler and J.T. Sch\"oning, {\it The Graph
1314:     Isomorphism Problem: Its Structural Complexity} (Birkhauser,
1315:   Boston 1993).
1316: \bibitem{faulon} J.-L. Faulon, J. Chem. Inf. Comput. Sci. {\bf 38}, 432 (1998).
1317: \bibitem{toran} J. Tor\'an, FOCS 180 (2000).
1318: \bibitem{nauty} For the ``nauty" program of B. MacKay, see 
1319:    {\sf http://cs.anu.edu.au/people/bdm/nauty/}
1320: \bibitem{baskerville} K. Baskerville and M. Paczuski, Phys. Rev. {\bf
1321:    E 74}, 051903 (2006).
1322: \bibitem{kashtan2004b} N. Kashtan, S. Itzkovitz, R. Milo, and U.
1323:    Alon, Bioinformatics {\bf 20}, 1746 (2004).
1324: \bibitem{spirin} V. Spirin and L.A. Mirny, Proc. Nat. Acad. Sci.
1325:    U.S.A. {\bf 100}, 12123 (2003).
1326: \bibitem{animals} R.C. Read, Canad. J. Math. {\bf 14}, 1 (1962).
1327: \bibitem{jensen} I. Jensen, J. Stat. Phys. {\bf 102}, 865 (2001).
1328: \bibitem{redner} S. Redner, J. Statist. Phys. {\bf 29}, 309 (1982).
1329: \bibitem{stauffer} D. Stauffer,  Phys. Rev. Lett. {\bf 41}, 1333 (1978).
1330: \bibitem{dickman} R. Dickman and W.C. Schieve, J. Physique {\bf 45}, 1727 (1984).
1331: \bibitem{pivot} E J Janse van Rensburg and N Madras, J. Phys. A:
1332:    Math. Gen. {\bf 25} 303 (1992).
1333: \bibitem{leath} P. Leath, Phys. Rev. B {\bf 14}, 5046 (1976).
1334: \bibitem{hsu} H.-P. Hsu, W. Nadler, and P. Grassberger, J. Phys. A:
1335:   Math. Gen. {\bf 38}, 775 (2005); e-print cond-mat/0408061 (2004).
1336: \bibitem{newpaper} K.~Baskerville {\it et al.}, in preparation.
1337: \bibitem{gfn1998} P. Grassberger, H. Frauenkron, and W. Nadler, {\it
1338:    PERM: A Monte Carlo Strategy for Simulating Polymers and other
1339:    Things}, in ``Monte Carlo Approach to Biopolymers and Protein
1340:    Folding", eds. P. Grassberger {\it et al.}  (World Scientific,
1341:    Singapore 1998); arXiv:cond-mat/9806321 (1998).
1342: \bibitem{care} C.M. Care, Phys. Rev. E {\bf 56}, 1181 (1997); C.M.
1343:    Care and R. Ettelaie, Phys. Rev. E {\bf 62}, 1397 (2000).
1344: \bibitem{Redner79} S. Redner, J. Phys. A: Math. Gen. {\bf 12}, L239 (1979).
1345: \bibitem{ecoli} G. Butland {\it et al.}, Nature {\bf 433}, 531 (2005);
1346:    {\sf http://www.cosin.org}.
1347: \bibitem{yeast} N.J. Krogan {\it et al.}, Nature {\bf 440}, 637 (2006).
1348: \bibitem{footnote} The networks given in \cite{ecoli,yeast} are not
1349:    connected. In the present paper we used only their largest connected
1350:    components.
1351: \bibitem{stauffer-aharony} D. Stauffer and A, Aharony, {\it An
1352:    Introduction to Percolation Theory}, 2nd Ed. (Taylor and Francis,
1353:    London, 1994).
1354: \bibitem{randomSAW} P. Grassberger, J. Phys. A: Math. Gen. {\bf 26}, 1023 (1993).
1355: \bibitem{mollison} D. Mollison, J. R. Statist. Soc. B {\bf 39}, 283 (1977).
1356: \bibitem{grass} P. Grassberger, Mathematical Biosciences {\bf 63}, 157 (1983).
1357: \bibitem{integer-seq} {\it The On-Line Encyclopedia on Integer Sequences}, 
1358:    {\sf http://www.research.att.com/~njas/sequences} (AT\&T Labs, 2006).
1359: \bibitem{pastor} R. Pastor-Satorras and A. Vespignani, Phys. Rev.
1360:    Lett. {\bf 86}, 3200 (2001).
1361: \bibitem{footnote2} We might try to estimate the optimal $p$ by the
1362:    threshold for an infinite SIR epidemic on an infinite tree like
1363:    network with the same degree distribution, $p_c = \langle k\rangle
1364:    / \langle k^2\rangle$ \cite{pastor}. For the two networks
1365:    considered in this paper, this would give $p_c({\rm yeast}) =
1366:    0.062$, $p_c({\rm e.~coli}) = 0.070$, i.e.  a much smaller
1367:    difference in the optimal $p$ values. One reason why this is not
1368:    observed might be the very strong clustering, in particular in the
1369:    yeast data, which is neglected in this argument.
1370: \bibitem{footnote3} Another reason why no minimum appears in the
1371:    $n=8$ curve is that we kept the number of generated clusters fixed,
1372:    not CPU time.  Since larger $p$ values also imply larger clusters
1373:    in average, the CPU time per cluster increases sharply for larger
1374:    $p$.
1375: \bibitem{grass-PERM} P. Grassberger and W. Nadler, {\it ``Go with
1376:     the winners"-Simulations}, in ``Computational Statistical Physics:
1377:     From Billards to Monte Carlo", eds. K.H.  Hoffmann {\it et al.}
1378:     (Springer, Heidelberg 2000); arXiv:cond-mat/0010265 (2000).
1379: \bibitem{seidman} S.B. Seidman, Social Networks {\bf 5}, 269 (1983).
1380: \bibitem{mips} MIPS data base: \\
1381:     {\sf http://mips.gsf.de/genre/proj/yeast/Search/Catalogs\-/catalog.jsp}.
1382: \bibitem{sgd} SGD data base: \\ 
1383:     {\sf http://www.yeastgenome.org/cgi-bin/GO/go.pl?}.
1384: \bibitem{ispolatov} I. Ispolatov, P.L. Krapivsky, I. Mazo, and A. Yuryev,
1385:      New Journal of Physics {\bf 7}, 145 (2005).
1386: \bibitem{pajek} {\sf http://vlado.fmf.uni-lj.si/pub/networks/data/bio/Yeast\-/Yeast.htm}.
1387: \bibitem{bu} D. Bu {\it et al.}, Nucleic Acids Res. {\bf 31}, 2443 (2003).
1388: \bibitem{batada} N.N. Batada {\it et al.}, PLoS Biology {\bf 4}, 1720 (2006).
1389: \end{thebibliography}
1390: 
1391: \end{document}
1392: