q-bio0611020/xtr.tex
1: \documentclass[rmp,twocolumn,showpacs]{revtex4}
2: 
3: \usepackage{graphicx,amsmath,amssymb,txfonts,dcolumn}
4: 
5: \begin{document}
6: 
7: \title{Exploring the assortativity-clustering space of a network's
8:   degree sequence}
9: 
10: \author{Petter Holme}
11: \affiliation{Department of Computer Science, University of New Mexico,
12:   Albuquerque, NM 87131, U.S.A.}
13: 
14: \author{Jing Zhao}
15: \affiliation{School of Life Sciences \& Technology, Shanghai Jiao Tong
16:   University, Shanghai 200240, China}
17: \affiliation{Shanghai Center for Bioinformation and Technology,
18:   Shanghai 200235, China}
19: \affiliation{Department of Mathematics, Logistical Engineering
20:   University, Chongqing 400016, China}
21: 
22: \begin{abstract}
23:   Nowadays there is a multitude of measures designed to
24:   capture different aspects of network structure. To be able to say if
25:   the structure of certain network is expected or not, one needs a
26:   reference model (null model). One frequently used null model is the
27:   ensemble of graphs with the same set of degrees as the original
28:   network. In this paper we argue that this ensemble can be more than just
29:   a null model---it also carries information about the original network and
30:   factors that affect its evolution. By mapping out this ensemble in the
31:   space of some low-level network structure---in our case those
32:   measured by the assortativity and clustering coefficients---one can
33:   for example study how close to the valid region of the parameter
34:   space the observed networks are. Such analysis suggests which
35:   quantities are actively optimized during the evolution of the
36:   network. We use four very different biological networks to exemplify
37:   our method. Among other things, we find that high clustering might
38:   be a force in the evolution of protein interaction networks. We also
39:   find that all four networks are conspicuously robust to both random
40:   errors and targeted attacks.
41: \end{abstract}
42: 
43: \pacs{89.75.Fb, 82.39.Rt, 89.75.Hc}
44: % 89.75.Fb -- Structures and organization in complex systems
45: % 82.39.Rt -- Reactions in complex biological systems
46: % 89.75.Hc -- Networks and genealogical trees
47: 
48: \maketitle
49: 
50: \section{Introduction}
51: 
52: Network structure~\cite{mejn:rev,doromen:book,ba:rev} is usually
53: defined as the way a network differs from what is expected. What
54: ``expected'' means depends on the fundamental constraints on the
55: network, and this can vary from system to system. For example, if
56: the network is made of units that must be connected to two, and only
57: two, others; then, it is not interesting whether or not a vertex lies
58: on a cycle (we already know that it will). The ensemble of all
59: networks fulfilling the fundamental constraints on the system is
60: usually called \textit{null model} (or \textit{reference model}). When
61: we have pinned down the null model we can measure the network
62: structure by standard quantities. If the values of these quantities
63: differs significantly from the null-model average, then we call the network
64: structured. The baseline assumption of complex network theory is that
65: network structure carries information about the forces that have
66: formed the network. Ever since the studies of Barab\'{a}si and
67: coworkers~\cite{ba:model,ba:rev}, the degree distribution (or, if
68: referring to the set of degrees of one particular network,
69: \textit{degree sequence}) has been regarded as the most fundamental
70: network structure. For many networks, the degrees are related to outer
71: factors (not emerging from the network evolution). In such cases the
72: ensemble of all graphs with the same degree sequence as the original
73: network is a natural null model. Another interpretation is that the
74: network structures measured relative to this null model are of higher
75: order than the degree---i.e., what remain after the effects of the
76: more fundamental structure (the degree sequence) is filtered away. The
77: usual way to use a null model is to compare a network measure with the
78: ensemble average value of the null model. In this paper we will argue
79: that one can glean more information about the original network by studying
80: the null model ensemble in greater detail than just measuring
81: averages.
82: 
83: 
84: We consider networks that can be modeled as a graph $G=(V,E)$ where
85: $V$ is the set of $N$ vertices and $E$ is the set of $M$ undirected
86: edges. We denote the ensemble of graphs with the same degree sequence
87: as $G$ as $\mathcal{G}(G)$. Our basic approach to study
88: $\mathcal{G}(G)$ is to resolve its members in the space of higher
89: order network structures. The two such higher order network structures
90: we consider in this paper are: the correlation between the
91: degrees at either side of an edge (measured by the \textit{assortative
92:   mixing coefficient}, $r$~\cite{mejn:assmix}, or simply
93: \textit{assortativity}); and, the fraction of triangles in the network
94: (measured by the \textit{clustering
95:   coefficient}, $C$~\cite{bw:sw,mejn:rev}). By mapping out
96: $\mathcal{G}(G)$ in the space defined by $r$ and $C$ one can pose
97: questions such as: How large is the region in $r$-$C$ space where
98: members of $\mathcal{G}(G)$ actually exist? (This helps us answer how
99: constrained the network evolution is if the degrees are given.) Is the
100: real network close to $\mathcal{G}(G)$'s boundaries in $r$-$C$ space?
101: (Which would indicate whether or not $r$ or $C$ are actively
102: optimized.)
103: 
104: 
105: The basis for our exploration of an ensemble $\mathcal{G}(G)$ is to map
106: out its members in the space defined by some network-structural
107: measures, in our case the assortativity and clustering. We explore the
108: $r$-$C$ space by successively rewire pairs of edges, $(i,j)$ and
109: $(i',j')$ to $(i,j')$ and $(i',j)$, that takes the system in a desired
110: direction. Rewiring techniques for studying networks are half a century
111: old~\cite{gale:rew} (randomization for obtaining null models was
112: studied in Ref.~\cite{katz:cug}). In the physics literature these
113: techniques were first used in Refs.~\cite{maslov:pro,alon}.
114: 
115: 
116: \section{Network structural measures}
117: 
118: Before going into details of our algorithm, we will review the network
119: structural quantities that we use to describe our networks: both the
120: independent variables (the assortative and clustering
121: coefficients) that form the basis for our space of interest; and the
122: quantities we use for characterizing the regions of this space.
123: 
124: 
125: \subsection{Assortative mixing coefficient}
126: 
127: It is quite well accepted that the set of degrees, the degree
128: sequence, is the network quantity that contains most information about both the evolution and function of the network. Degree can (in most
129: contexts) be identified as how influential the vertex is~\cite{wf} (in
130: some sense)---high
131: degree vertices are assumed to be more influential both the formation
132: of the network and the flow of dynamic systems on the network. In this
133: paper we assume the degree sequence is inherent to the system and look
134: at higher order structures arising from how the vertices are linked to
135: one another. The simplest such higher-order structure is the
136: correlations between the degrees of vertices at either side of an
137: edge. Is it the case that high-degree vertices are primarily connected
138: to other high degree vertices, or are they linked to low-degree
139: vertices? A simple way of measuring this tendency is by the
140: assortativity~\cite{mejn:rev} $r$. Basically
141: speaking, $r$ is the linear correlation coefficient of the degrees at
142: either side of an edge. One complication is that since the edges are undirected, $r$ has to be symmetric with respect to edge-reversal, but the correlation coefficient is not symmetric. The solution is to let one edge contribute
143: twice to the covariance, i.e.\ represent an undirected edge by two directed edges pointing in opposite directions. If one use an edge list representation
144: internally (i.e., let the edges be stored in an array of ordered pairs
145: $(i_1,j_1),\cdots,(i_M,j_M)$) then~\cite{mejn:assmix}
146: \begin{equation}\label{eq:assmix}
147:   r=\frac{4\langle k_1\, k_2\rangle - \langle k_1 + k_2\rangle^2}
148:   {2\langle k_1^2+k_2^2\rangle - \langle k_1+ k_2\rangle^2}
149: \end{equation}
150: where, for an edge $(i,j)$, $k_1$ is the degree of first argument
151: (i.e., the degree of $i$) and $k_2$ is the degree of the second
152: argument. The range of $r$ is $[-1,1]$ where negative values indicate
153: a preference for high connected vertices to attach to low-degree
154: vertices, and positive values means that vertices tend to be attached
155: to others with degrees of similar magnitudes.
156: 
157: 
158: \subsection{Clustering coefficient}
159: 
160: Several simple random network models (such as the
161: Edr\H{o}s-R\'{e}nyi~\cite{er:on} or the model for generating networks
162: of a given $r$-value in Ref.~\cite{mejn:assmix}) have rather few triangles (fully connected
163: subgraphs of three vertices). For some classes of real-world networks (notably
164: social networks~\cite{holl:72}) there is a strong tendency for triangles to form, which makes such models fail. The network measure of the density of
165: triangles is called \textit{clustering coefficient}. We use the definition
166: of Ref.~\cite{bw:sw}:
167: \begin{equation}\label{eq:clust}
168:   C = 3 n_\mathrm{triangle}\:\big/\:n_\mathrm{triple},
169: \end{equation}
170: where $n_\mathrm{triangle}$ is the number of triangles and
171: $n_\mathrm{triple}$ is the number of connected triples (subgraphs
172: consisting of three vertices and two or three edges). The factor three
173: is included to normalize the quantity to the interval $[0,1]$.
174: 
175: 
176: \subsection{Distance and component size}
177: 
178: Two quantities that are, perhaps more than any other, related to the functionality of dynamic processes on the network are the relative size of the largest
179: component (connected subgraph) $s$, and the average distance $\langle
180: d\rangle$. $s$ is simply defined as the number of vertices in the
181: largest component divided by $N$. The distance $d(i,j)$ between two
182: vertices $i$ and $j$ is defined as the number of edges in the shortest
183: path between these two vertices. $\langle d\rangle$ is $d(i,j)$
184: averaged over all vertex pairs ($i\neq j$) in the largest
185: component. In a network with large $s$ and small $\langle d\rangle$,
186: spreading processes will be fast and far-reaching. This is a good
187: property of information networks, but bad in the context of, for
188: example, disease spreading. Some authors have combined the distance
189: and component size aspects by considering the average reciprocal
190: distances~\cite{our:attack,latora:eff}. For most purposes, we believe,
191: valuable information gets lost in such a combination (a fragmented
192: network $G$ with short average distances can be something very
193: different from a connected graph of large distances and the same
194: average reciprocal distances as $G$).
195: 
196: 
197: \subsection{Robustness}
198: 
199: One line of complex network research is the study of the response of
200: the network to attacks, errors, failures and other events that effectively change
201: the structure. The error response problem is usually formulated as: how
202: does the functionality of the network change if a random fraction of
203: the vertices, or edges, is removed~\cite{mejn:rev}? The attack
204: problem is the same, except that the vertices are not selected
205: randomly but according to some strategy intended to decrease the
206: networks' functionality as rapidly as possible~\cite{our:attack,alb:attack}.
207: A frequently used metric for functionality is the ratio of $s$
208: before and after the
209: event~\cite{our:attack,alb:attack,motter:cascade}. In the error and
210: attack robustness problems, this quantity is typically plotted as a
211: function of the number of removed vertices. The idea is that even if
212: one network $G$ is more robust than another network $G'$ to the
213: removal of, say, $1\%$ of the vertices, $G'$ can be less vulnerable
214: than $G$ if $10\%$ of the vertices are deleted. Since we aim at
215: mapping out the $r$-$C$ space of degree sequences, we would like to
216: capture the robustness with just one number. We will use what we call
217: the $f$-\textit{robustness} $R_f$ of a network as the expected
218: fraction of vertices that needs to be removed for the relative size of
219: the largest component to decrease to a fraction $f\in (0,1)$ of its
220: original value. The way of removal can either be random (the error
221: problem) or selective (the attack problem). For the rest of the paper
222: we will set $f=1/2$, and refer to the $1/2$-robustness just as
223: ``robustness'' $R$. Other $f$-values give slightly different results, but
224: our conclusions will hold for a range of intermediate $f$-values.
225: 
226:  \begin{figure}
227:   \centering\resizebox*{0.9\linewidth}{!}{\includegraphics{ill.eps}}
228:   \caption{Illustration of the analysis scheme applied to the
229:     \textit{C. elegans} neural network. (a) shows how the valid region
230:     is mapped out: 1. $r_\mathrm{min}$ is located. 2. $r_\mathrm{max}$
231:     is found and the interval $[r_\mathrm{min}, r_\mathrm{max}]$ is
232:     divided into $L$ segments. 3. $C_\mathrm{min}(n)$ is
233:     constructed. 4. $C_\mathrm{max}(n)$ is traced and the interval
234:     $[C_\mathrm{min}, C_\mathrm{max}]$ is segmented into $L$
235:     regions. (b) illustrates the sampling of the pixels. The next
236:     pixel to go to is chosen from a random permutation of the
237:     pixels. In this example $n$ and $n'$ are chosen to be far
238:     apart. The line shows the path taken by the algorithm. The circles
239:     indicate every thousandth step on the way from $n$ to $n'$. The
240:     blow-up illustrates the random walk within a pixel to sample the
241:     graphs of the pixel more randomly.
242: }
243:   \label{fig:ill}
244: \end{figure}
245: 
246: 
247: \section{The analysis scheme\label{sect:analysis}}
248: 
249: The fundamental idea of our method is simple: we update the network by
250: choosing pairs of edges randomly, say $(i,j)$ and $(i',j')$, and swap
251: one end of them (forming $(i,j')$ and $(i',j)$). This guarantees that
252: the degree sequence stays intact. We navigate in the $r$-$C$ space by
253: only accepting changes that move us in the desired direction.  If an
254: edge-swap would introduce a self-edge (i.e.\ if $i=j'$ or $i'=j$) or a
255: multiple edge (i.e.\ if $(i,j')$ or $(i',j)$ belongs to $E$ before the
256: swapping, or \textit{move}) it is not performed. There are many other
257: technicalities concerning the convergence to extremes, uniformity of
258: the sampling and more that we discuss in the Appendix.
259: 
260: The members of the ensemble $\mathcal{G}(G)$ do not, in general, cover
261: the whole range of $(r,C)$-values. Indeed, for any finite $G$,
262: $\mathcal{G}(G)$ defines a set of points, rather than a continuous
263: region, in the $r$-$C$ space. We will perform a more coarse-grained
264: analysis breaking down the $r$-$C$ space into pixels and average quantities over the graphs of $\mathcal{G}(G)$ with $(r,C)$-values within the pixel. (Thus, a pixel constitute a graph ensemble in itself, our aim is to sample its members with uniform randomness.) For a computationally tractable
265: resolution, the pixels containing members of $\mathcal{G}(G)$
266: typically form contiguous regions. We will refer to the pixels that
267: contain a member of $\mathcal{G}(G)$ as \textit{valid pixels}, and all
268: pixels that are valid or between valid pixels the \textit{valid
269:   region} of $\mathcal{G}(G)$.
270: 
271: To trace the valid region of $\mathcal{G}(G)$ we start by finding the
272: lowest and highest assortativity value, $r_\mathrm{min}$ and
273: $r_\mathrm{max}$ respectively. Briefly speaking (more details follow
274: below), to find $r_\mathrm{min}$ we rewire edge-pairs that lower $r$
275: (and vice versa for $r_\mathrm{max}$). After finding the extremal
276: $r$-values, we splice the region between these into $L$ segments. Then
277: we go through the region and for each region $n\in [1,L]$ we find the
278: minimal and maximal $C$-values, $C_\mathrm{min}(n)$ and
279: $C_\mathrm{max}(n)$. The region in $C$-space between the lowest
280: $C_\mathrm{min}=\min_{1\leq n\leq L}C_\mathrm{min}(n)$ and highest
281: $C_\mathrm{max}=\max_{1\leq n\leq L}C_\mathrm{max}(n)$ observed
282: clustering coefficient is segmented into $L$ regions. (Note that $C_\mathrm{min}$, without argument,
283: is the global clustering minimum, whereas $C_\mathrm{min}(n)$ is the
284: minimum conditioned on $r$ being in the $n$'th segment.) Thus we
285: (assuming our method works) obtain an $L\times L$ grid of the $r$-$C$
286: space that contains the valid region of $\mathcal{G}(G)$. The method
287: is illustrated in Fig.~\ref{fig:ill}.
288: 
289: To find the $\mathcal{G}(G)$ elements of minimal and maximal
290: assortativity is a non-trivial optimization problem. There are
291: deterministic methods that, if they terminate, are guaranteed to give the
292: maximal (or minimal) assortativity~\cite{doyle:big,zj:spectrum}. To avoid the
293: such technicalities and to simplify the program, we will use the same
294: kind of optimization algorithm to find $r_\mathrm{max}$ and
295: $r_\mathrm{min}$ as to find $C_\mathrm{min}(n)$ and
296: $C_\mathrm{max}(n)$. In the Appendix we will argue that this method
297: allows us to come as close to the optimal $r$-values as we need.
298: A method we find efficient is to repeat the
299: simple edge-pair swapping procedure (where only changes in the desired
300: direction are accepted) with different random seeds until no lower
301: state is found during a number $\nu_\mathrm{rep.}$ of
302: repetitions~\cite{walker_walstedt}. Each individual edge-pair is
303: terminated when no lowest state is found for $\nu_\mathrm{same}$
304: swaps. In general, the larger the network is, the more densely
305: distributed are the points close to the border of the valid region. If
306: one is satisfied with finding a value a certain distance from the
307: extrema, then $\nu_\mathrm{rep.}$ and $\nu_\mathrm{same}$ do not need to be
308: increased for larger $N$. To find $C_\mathrm{min}(n)$ and
309: $C_\mathrm{max}(n)$ almost the same procedure is employed. First,
310: edge-pairs are swapped until the desired segment of $r$ is
311: found. Second, unless $r$ is outside the segment $n$ and the move
312: takes the system yet further from segment, edge-pairs are swapped provided
313: the clustering would decrease (for $C_\mathrm{min}(n)$), or increase,
314: (for $C_\mathrm{max}(n)$). When the valid region is traced out and we
315: sample networks of different pixels, we select the pixels
316: randomly. The idea is to sample the space of networks more randomly.
317: 
318: To summarize, the algorithm for finding the extremal assortativity
319: values, $r_\mathrm{min}$ and $r_\mathrm{max}$, is:
320: \begin{enumerate}
321: \item \label{step:choose} Choose two undirected edges $(i,j)$ and
322:   $(i',j')$ at random. If the program makes a difference between the
323:   arguments of the edge, the direction of the reading of the edge also
324:   has to be randomized (so $(i,j)$ is read as $(j,i)$ with probability
325:   $1/2$).
326: \item \label{step:check} Check if swapping these edges to $(i,j')$ and
327:   $(i',j)$ would introduce a self-edge or multiple edge in the
328:   network. If so, go to step~\ref{step:choose}.
329: \item \label{step:accept} Let $\Delta r$ be the change in $r$ if the
330:   move in step~\ref{step:choose} is executed. If $r$ is
331:   to be minimized and $\Delta r<0$, then accept the change (vice versa for maximization of $r$).
332: \item \label{step:conclude} If no move has been executed during the last
333:   $\nu_\mathrm{same}$ executions of step~\ref{step:accept}, then take
334:   the current $r$ as $\tilde{r}_\mathrm{min}$ (or $\tilde{r}_\mathrm{max}$).
335: \item \label{step:stop} Repeat from the beginning $\nu_\mathrm{rep.}$
336:   times and return the lowest observed $\tilde{r}_\mathrm{min}$ during these iterations.
337: \end{enumerate}
338: 
339: Given $r_\mathrm{min}$ and $r_\mathrm{max}$, and a division of the $r$
340: space into $L$ segments of width $(r_\mathrm{max}-r_\mathrm{min})/L$,
341: we trace the boundaries of the valid region as follows:
342: \begin{enumerate}\setcounter{enumi}{5}
343: \item \label{step:choose2} Go through the regions sequentially. Say
344:   the $n$'th region is the interval $[r_n,
345:   r_{n+1})$.
346: \item Perform step~\ref{step:choose} and \ref{step:check} of the
347:   assortativity optimization algorithm.
348: \item Let $\Delta C$ be the change in clustering coefficient during the previous
349:   step. If $r<r_n$ and $\Delta r>0$, $r\geq
350:   r_{n+1}$ and $\Delta r<0$ or $r_n \leq r <
351:   r_{n+1}$ and $\Delta C < 0$ (for minimization) or $\Delta
352:   C > 0$ (for maximization), then perform the change of
353:   step~\ref{step:choose2}.
354: \item \label{step:conclude2} If, counting from the first time the
355:   system entered the desired $r$-segment, the minimal (maximal)
356:   $C$-value has been repeated $\nu_\mathrm{same}$ times, take this
357:   value as $\tilde{C}_\mathrm{min}(n)$ ($\tilde{C}_\mathrm{max}(n)$).
358: \item \label{step:stop2} Repeat from step~\ref{step:choose2}
359:   $\nu_\mathrm{rep.}$ times. Let the lowest
360:   $\tilde{C}_\mathrm{min}(n)$-values and largest
361:   $\tilde{C}_\mathrm{max}(n)$ during these iterations be
362:   $C_\mathrm{min}(n)$ and $C_\mathrm{max}(n)$.
363: \end{enumerate}
364: 
365: Then, when the valid region is mapped out, we split the $C$-range
366: (between $C_\mathrm{min}$ and $C_\mathrm{max}$ in $L$ segments of
367: equal width, thus forming an $L\times L$-grid enclosing the valid
368: region. This grid is sampled as follows:
369: \begin{enumerate}\setcounter{enumi}{10}
370: \item \label{step:perm} Construct a random permutation of the valid
371:   pixels.
372: \item \label{step:pick} Pick the next pixel
373:   $P_n=[r_n,r_{n+1})\times
374:   [C_m,C_{m+1})$ from the index-list of
375:   step~\ref{step:perm}. Denote the center $[(r_n+r_{n+1})/2,(C_m+C_{m+1})/2)]$ of the pixel
376:   $(r_{n,0},C_{m,0})$. Let
377:   \begin{equation}\label{eq:dist}
378:     \delta (r,C)=\sqrt{\left(\frac{r - r_{n,0}}{r_\mathrm{max} -
379:           r_\mathrm{min}}\right)^2 + \left(\frac{C -
380:           C_{m,0}}{C_\mathrm{max}-C_\mathrm{min}}\right)^2}
381:   \end{equation}
382:   measure the distance in $r$-$C$ space from the current position
383:   $(r,C)$ to the center of the target pixel.
384: \item Pick edge-pair candidates according to steps~\ref{step:choose}
385:   and \ref{step:check} of the assortativity optimization algorithm.
386: \item Calculate $\Delta (r,C)=\delta(r',C')-\delta(r,C)$ where $r$
387:   and $C$ are the current assortativity and clustering values, and
388:   $r'$ and $C'$ are the values after the pending move has been
389:   performed. If $\Delta (r,C)<0$ perform the move.
390: \item \label{step:rw} If the updated $(r,C)$ belongs to $P_n$, then: First, make
391:   $\nu_\mathrm{rnd.}$ random edge swappings such that $(r,C)$ does not
392:   leave $P_n$. (This is to sample the pixel more uniformly.) Then,
393:   measure network structural quantities of $P_n$, save these values
394:   for statistics, and go to step~\ref{step:pick}.
395: \item If not all pixels have been measured go to step~\ref{step:pick}.
396: \item Go to step~\ref{step:perm} until each pixel have been sampled
397:   $\nu_\mathrm{samp.}$ times.
398: \end{enumerate}
399: 
400: The parameter values we use in this study are (unless otherwise stated):
401: $\nu_\mathrm{same}=10^5$, $\nu_\mathrm{rep.}=5$,
402: $\nu_\mathrm{samp.}=100$, $\nu_\mathrm{rnd.}=1000$ and $L=50$. The
403: choice of parameters and further considerations are discussed in the
404: Appendix. Due to the uncertain stopping conditions of
405: steps~\ref{step:conclude}, \ref{step:stop}, \ref{step:conclude2} and
406: \ref{step:stop2} it is hard to derive meaningful bounds on the
407: computational complexity. We note, however, that the optimization is
408: faster in $r$- than in $C$-direction, this probably relates to the observation in
409: Fig.~\ref{fig:ill}(b) that swapping procedure moves faster in the $r$-
410: than in the $C$-direction. (The speed in the $C$-direction is roughly the
411: same per 1000 steps, but the speed in the $r$-direction decrease.) 
412: 
413: \section{Networks}
414: 
415: Our method can be applied to every kind of system that can be modeled
416: as an undirected network.  To limit ourselves, we use four networks
417: from biology as examples in this paper. These networks are,
418: nonetheless, representing fundamentally different systems.
419: 
420: \begin{table}
421: \caption{Basic statistical properties of the example networks we
422:   use. The number of vertices $N$, number of edges $M$,
423:   assortativity $r$, clustering coefficient $C$, relative size of the
424:   largest cluster $s$, average distance in the largest cluster
425:   $\langle d\rangle$, the error robustness $R_\mathrm{error}$ and the
426:   attack robustness $R_\mathrm{attack}$.}
427: \begin{ruledtabular}
428:   \begin{tabular}{r|dddd}\label{tab:stat}
429:     & \multicolumn{1}{c}{gene fusion} &
430:     \multicolumn{1}{c}{protein interaction} &
431:     \multicolumn{1}{c}{metabolic} &
432:     \multicolumn{1}{c}{neural}\\\hline
433:     $N$ & 291 & 4168 & 1905 & 280 \\
434:     $M$ & 278 & 7434 & 3526 & 1973 \\
435:     $r$ & -0.36 & -0.13 & -0.10 & -0.069 \\
436:     $C$ & 0.0016 & 0.034 & 0.039 & 0.20 \\
437:     $s$ & 0.38 & 0.94 & 0.87 & 1 \\
438:     $\langle d\rangle$ & 4.2 & 4.8 & 4.5 & 2.6 \\
439:     $R_\mathrm{error}$ & 0.43 & 0.36 & 0.36 & 0.50 \\
440:     $R_\mathrm{attack}$ & 0.012 & 0.048 & 0.046 & 0.38 \\
441:   \end{tabular}
442: \end{ruledtabular}
443: \end{table}
444: 
445: \subsection{Gene fusion network}
446: 
447: Cancer is a disease that occurs due to changes in the genome. One
448: important process causing such changes is gene fusion---when two
449: genes merge to form a hybrid gene~\cite{mitelman}. In
450: Ref.~\cite{hoglund} the authors construct a network of human genes
451: that have been observed to be fused in the development of tumors in
452: humans. Some genes can fuse with many others but most of the genes have
453: only been observed fusing with one, or a few others. The resulting
454: network structure has a skewed, power-law like degree distribution and
455: is rather fragmented---the largest component spanning only $38\%$ of the
456: vertices. Statistics of this and the other networks are listed in
457: Table~\ref{tab:stat}.
458: 
459: 
460: \subsection{Metabolic network\label{sect:meta}}
461: 
462: A cell can be regarded as a machine driven by biochemical reactions. The
463: possible reactions of the metabolism (the cellular
464: biochemistry except signaling processes) and its environment
465: determine the state of the cell. The metabolism of an organism is a very
466: complex system---so complex that one has to choose between studying
467: a part of it in detail, or the whole with a coarser method. One approach in the latter category is to construct a network, connecting the chemical
468: substrates occurring in the same reactions to a network, and employ
469: network analysis to characterize the large-scale structure of the
470: metabolism. The way to construct a biochemical network is not entirely
471: straightforward~\cite{zhao:meta}. Should the substances be
472: linked to each other (in a \textit{substrate graph}), or to the
473: reactions they participate in? If one use a substrate graph, should
474: the substrates be linked only to products, or to all reactants
475: (i.e.\ in a reaction A + B $\leftrightarrow$ C + D, should A be
476: linked to C and D, or to all three other vertices)? Furthermore, some
477: chemical substances (like H$_2$O, ATP, NADH, and so on) are abundant
478: throughout the cell and seldom pose any restriction on the reaction
479: dynamics. For many purposes, one obtains a more meaningful network by
480: deleting such \textit{currency metabolites}. The biochemical network
481: we use is the human metabolic network of Ref.~\cite{our:curr}. In this
482: network, substrates are linked only to products (A to C and D in the
483: above example). Currency metabolites are identified and deleted
484: according to a self-consistent, graph-theoretic method~\cite{our:curr}.
485: 
486: 
487: \subsection{Protein interaction}
488: 
489: In protein interaction networks the vertices are proteins and two
490: proteins constitute an edge if they can interact physically. Examples
491: of interaction are the ability to form complexes, carrying another
492: protein across a membrane or modifying another protein. We use the
493: (``physical interaction'') data set from Ref.~\cite{hh:pfp}
494: of protein interaction in the budding yeast \textit{S. cerevisiae}.
495: 
496: 
497: \subsection{Neural network}
498: 
499: For the biochemistry of an organism, the network representation is a
500: crude model of the system as a whole (as an alternative to a detailed
501: model of a subsystem). Neural networks are yet more complex. For these
502: the choices are either to make a coarse-grained network
503: representation~\cite{sporns:cortex} or study the full network of a very
504: simple organism. In this work, we take the latter approach and study
505: the neural network of \textit{C. elegans}~\cite{white:420}. In this
506: data set, the strength of the neuronal coupling has been measured, but we make the network undirected by
507: letting an edge represent a non-zero coupling.
508: 
509: 
510: \begin{figure*}
511:   \centering\resizebox*{0.75\linewidth}{!}{\includegraphics{exa.eps}}
512:   \caption{The valid region demarcated by the $C_\mathrm{min}(r)$- and
513:     $C_\mathrm{max}(r)$-curves (a), and three networks: (b) is the
514:     original gene fusion network, (c) shows a random sample with
515:     $r$-$C$ coordinates close to those of the real network. (d) shows a network
516:     with high clustering and high assortativity. The largest component
517:     of (b), (c) and (d) are indicated with a different color.
518:   }
519:   \label{fig:exp}
520: \end{figure*}
521: 
522: \section{Numerical results}
523: 
524: 
525: In this section we present numerical results for our four
526: network-structural measures over the $\mathcal{G}(G)$ ensembles of the
527: four test graphs. To get a first view, we display the valid region of the
528: gene fusion graph in Fig.~\ref{fig:exp}(a). As seen, the valid region is
529: not covering a large part of the theoretical limits of $r$ ($-1\leq r\leq 1$) and
530: $C$ ($0\leq C < 1$). (Note that only fully
531: connected graphs have $C=1$, and for these $r$ is undefined.) The
532: requirement that the graph should be simple (no multiple edges or
533: self-edges) puts hard constraints on the actual $r$-values that can
534: occur (cf.\ Ref.~\cite{maslov:inet}). Fig.~\ref{fig:exp}(a) shows that, considering the entire $r$-$C$ plane, such constraints are even harder.
535: The general shape of the valid region is consistent with
536: the observations that the simple-graph constraint induce a positive
537: correlation between $r$ and $C$~\cite{maslov:inet,mejn:why}.
538: 
539: In Fig.~\ref{fig:exp}(b), (c) and (d) we show three example networks
540: of $\mathcal{G}(G)$ (where $G$ is the gene fusion
541: network). Fig.~\ref{fig:exp}(b) displays the relatively fragmented
542: real network. Fig.~\ref{fig:exp}(c) is a random network $G'$ with the
543: almost the same $r$-$C$ coordinates as the real network
544: ($\delta(G,G')\approx 0.0026$). Maybe the biggest visible difference
545: between $G$ and $G'$ is the larger size of the largest component of
546: $G'$. Is it true that the gene fusion network is unusually fragmented,
547: given the degree sequence and $r$-$C$ coordinates? If so, there might
548: be an evolutionary pressure for gene fusion networks to be fragmented. (This
549: will be discussed further in Sect.~\ref{sect:size}.)
550: Fig.~\ref{fig:exp}(d) shows, as a contrast, a network far away from
551: $G$ and $G'$. The network has a well-defined core where high-degree
552: vertices connect to each other. There are also a number of peripheral
553: triangles, which indicates that the network evolves toward a maximal
554: $C$-value, given its assortativity.
555: 
556: \begin{figure*}
557:   \centering\resizebox*{0.75\linewidth}{!}{\includegraphics{ngc.eps}}
558:   \caption{The relative size of the largest component $s$ as a function of
559:     $r$ and $C$. The networks are (a) the network of gene fusions in
560:     tumors in humans, (b) protein interaction network of
561:     \textit{S. cerevisiae}, (c) human metabolic network and (d) the
562:     \textit{C. elegans} neural network. The x-like symbols of the main
563:     figures and the diamond symbols of the color-bars indicate
564:     the values of the real networks. The plus-like symbol
565:     indicates the average $(r,C)$-value of the $\mathcal{G}(G)$
566:     ensemble.}
567:   \label{fig:ngc}
568: \end{figure*}
569: 
570: \subsection{Location in $r$-$C$ space and size of largest
571:   component\label{sect:size}}
572: 
573: 
574: In Fig.~\ref{fig:ngc} we plot the relative size of the largest
575: component of the four test networks. We also display the locations of
576: the actual networks in the $r$-$C$ plane, and the
577: $\mathcal{G}(G)$-averages. (The $\mathcal{G}(G)$ averages are obtained
578: from a rewiring sampling of $\mathcal{G}(G)$, with
579: step~\ref{step:check} of the algorithm as the only constraint.) We see
580: that the $C$-value of the gene fusion graph lies close to the
581: $C_\mathrm{min}(r)$-boundary of its valid region. $C$ averaged over
582: the whole $\mathcal{G}(G)$ is about three times larger ($\langle
583: C\rangle_{\mathcal{G}(G)}=0.0061\pm 0.0001$) than the observed value
584: ($C=0.0017$). Furthermore, we see that the assortativity is lower than
585: the $\mathcal{G}(G)$ average. This kind of analysis has been used by
586: many authors (following Ref.~\cite{maslov:pro}). The interpretation is
587: usually that the network is, effectively, disassortative and clustered
588: (i.e., $r<\langle r\rangle_{\mathcal{G}(G)}$ and $C>\langle
589: C\rangle_{\mathcal{G}(G)}$). However, looking at the entire valid
590: region, we can get another perspective: If high clustering really
591: would have been an important goal for the network to obtain (given the
592: degree sequence) there is large room for improvement. For the
593: assortativity, on the other hand, the observed network is rather close to the
594: minimum. This might be telling us that assortativity is a more
595: important factor, than clustering, in the evolution of the gene fusion
596: networks. The protein interaction network of Fig.~\ref{fig:ngc}(b) is
597: located quite far from the ensemble average---the assortativity is much
598: lower than the $\mathcal{G}(G)$-average, and given that assortativity,
599: the clustering is maximal. Also
600: the metabolic (Fig.~\ref{fig:ngc}(c)) and neural
601: (Fig.~\ref{fig:ngc}(d)) networks are more clustered than the average,
602: but here the assortativity is slightly larger than the
603: $\mathcal{G}(G)$ average. From Fig.~\ref{fig:ngc} we also note that
604: the density of states is very inhomogeneous distributed---the
605: average $(r,C)$ is close to $C=0$ and (except for the neural network)
606: left of the middle of the assortativity spectrum. The shapes of the
607: valid regions are rather similar, with an exception for the broader
608: region of the neural network. This can be related to the more
609: narrow degree sequence of the neural network~\cite{amaral:classes}. We
610: have established a correlation between $r$ and
611: $C$. Ref.~\cite{mejn:why} argues that such correlation occurs in
612: social networks because of their modularity (or ``community
613: structure'' as the authors call it). However, our large-$r$ networks
614: have no explicit bias towards high modularity, which leads us to
615: conjecture that the correlation between $r$ and $C$, or more
616: fundamentally the sum $\sum_{(i,j)\in E} k_ik_j$ (which, given a
617: degree sequence, is the only factor of Eq.~\ref{eq:assmix} that can
618: vary) is a more general phenomenon. Since $r$ is normalized by,
619: essentially, the variance of the degree, it follows that the valid
620: region for $\mathcal{G}(G)$ with more narrow degree sequence will
621: appear stretched (larger).
622: 
623: Turning to the average size of the largest component, we observe that the
624: gene fusion network is indeed more fragmented than the average network of the
625: same $(r,C)$-coordinates (as anticipated from comparing Figs.~\ref{fig:exp}(b) and (c)). The protein interaction and neural networks
626: have no particular bias in this respect, whereas the metabolic network
627: is more fragmented than expected. The relatively low $s$ of the
628: metabolic network can be attributed to the ``modularity'' of such
629: networks~\cite{zhao:meta,our:curr}. Such modules are subgraphs that
630: are densely connected within, and sparsely inter-connected. Sometimes
631: they are even disconnected from the largest component (which explains
632: the lower $s$). In general, $s$ decreases with assortativity. This
633: is natural---in more assortative networks high degree vertices are
634: connected to each other, forming a highly connected core and a
635: periphery too sparse to be connected (viz.\ Fig.~\ref{fig:exp}(c) and
636: (d)). For the denser networks (the protein interaction, metabolic and
637: neural networks) $s$ increases with $C$ (for a fixed $r$). For the
638: sparser gene-fusion network $s$ has a peak at
639: intermediate $C$. We do not speculate further about combinatorial
640: cause of these dependencies; but we note (comparing
641: e.g.\ Figs.~\ref{fig:exp}(a) and (b)) that even though the shape of
642: the valid regions are similar, the $s$ behavior can be qualitatively
643: different.
644: 
645: 
646: \begin{figure*}
647:   \centering\resizebox*{0.75\linewidth}{!}{\includegraphics{length.eps}}
648:   \caption{The average distance within the largest component $\langle
649:     d\rangle$ as a function of $r$ and $C$. The panes and symbols
650:     correspond to those of Fig.~\ref{fig:ngc}.}
651:   \label{fig:length}
652: \end{figure*}
653: 
654: \subsection{Distances in the largest component}
655: 
656: In Fig.~\ref{fig:length} we display the average distance in the
657: largest component. As mentioned, measuring the distance can give complementary
658: information to the $s(r,C)$ graphs of Fig.~\ref{fig:ngc}---while $s$
659: tells us how much of the network that can be reached, $\langle
660: d\rangle$ tells us how fast that can happen. For all networks the big picture is that large connected
661: components have large average distances. This is expected from most
662: network models. There is, however, more information than this in
663: Fig.~\ref{fig:length}: for components of the same size, the average
664: distance is increasing with both $r$ and $C$. That $\langle d\rangle$
665: should increase with $C$ seems quite natural---if one of a triangle's
666: edges is rewired to connect two distant vertices, the distances in the
667: surrounding of the triangle would increase with one, but this would be
668: more than compensated by the connection of the two previously distant
669: areas. Disassortative networks typically lack a well-defined core.
670: Such cores are known to keep the average distance of general power-law networks
671: short~\cite{chung_lu:pnas}. Thus one would expect an increase of $r$
672: to cause a larger $\langle d\rangle$, but apparently the clustering
673: related length-increase outweighs this effect.
674: In contrast to the relative size, the average distances of the real
675: networks are close to the $\mathcal{G}(G)$-averages at the same
676: $r$-$C$ coordinates. For the gene fusion network (with a relatively
677: small largest component), this means the distances are rather
678: large.
679: 
680: 
681: \begin{figure*}
682:   \centering\resizebox*{0.75\linewidth}{!}{\includegraphics{error.eps}}
683:   \caption{The error robustness $R_\mathrm{error}$ as a function of
684:     $r$ and $C$. The panes and symbols correspond to those of
685:     Fig.~\ref{fig:ngc}.}
686:   \label{fig:error}
687: \end{figure*}
688: 
689: \subsection{Error robustness}
690: 
691: Next, we turn to the error robustness problem. As seen in
692: Fig.~\ref{fig:error} the gene fusion network (Fig.~\ref{fig:error}(a)),
693: once again, has a qualitatively different behavior than the other
694: three networks (Fig.~\ref{fig:error}(b), (c) and (d)). While the gene
695: fusion network is most robust for high $r$- and $C$-values the other
696: networks are most robust for low $r$. A sketchy explanation can be found
697: in the chain-like subgraphs extending from the largest component in
698: a large-$r$ network (cf.\ Fig.~\ref{fig:exp})---with a random deletion
699: of vertices, these subgraphs are likely to be disconnected from the
700: core rather soon (whereas in a disassortative network alternative paths may
701: still exist), then if the deletion-robust core is less than half of
702: the original component size it follows that it may soon be
703: isolated. The sparsity of the gene fusion network makes the low-$r$
704: $\mathcal{G}(G)$-graphs much like trees (i.e., having few cycles), and
705: since cycles provide redundant paths that can make a network robust,
706: it follows that these graphs are fragile. For a fixed
707: $r$, $R_\mathrm{error}$  is a decreasing function of $C$ for the three
708: largest networks. We believe this is an effect of the local path
709: redundancy induced by triangles---if one vertex of a triangle is
710: deleted, the other two are still connected.
711: 
712: The $R_\mathrm{error}$-values for the real networks are always
713: markedly higher than the $\mathcal{G}(G)$-averages for the same
714: $(r,C)$-coordinates. Networks with highly skewed degree distributions
715: (the gene fusion, protein interaction and metabolic networks) are
716: known to be robust to errors by virtue of degree distribution
717: alone~\cite{alb:attack}, now Fig.~\ref{fig:error} tells us that all
718: these networks have a yet higher error tolerance which is an indication
719: that error robustness is an important factor in the evolution of these
720: networks.
721: 
722: 
723: \begin{figure*}
724:   \centering\resizebox*{0.75\linewidth}{!}{\includegraphics{attack.eps}}
725:   \caption{The error robustness $R_\mathrm{attack}$ as a function of
726:     $r$ and $C$. The panes and symbols correspond to those of
727:     Fig.~\ref{fig:ngc}.}
728:   \label{fig:attack}
729: \end{figure*}
730: 
731: \subsection{Attack robustness}
732: 
733: The final quantity we measure is the attack robustness (see
734: Fig.~\ref{fig:attack}). $R_\mathrm{attack}$'s functional dependence
735: on $r$ and $C$ is quite different from that of $R_\mathrm{error}$. The
736: gene fusion $\mathcal{G}(G)$ has the highest attack robustness at high
737: $r$- and low $C$-values. The other networks have higher robustness
738: values for high assortativity, but no clear tendency in the
739: $C$-direction. The attack mechanism we study targets the high degree
740: vertices. Having all high degree vertices connected to each other is
741: probably the only way to keep the network from instantaneous
742: fragmentation. The observed $r$-dependence is thus rather
743: expected. The real-world networks all have $R_\mathrm{attack}$-values
744: of the same order of magnitude as the average values for the
745: $\mathcal{G}(G)$ networks of the same location in $r$-$C$ space.
746: We note that for studying the attack problem of metabolic networks, the (less common) enzyme centric graph representation is more appropriate (see Sect.~\ref{sect:meta}). The reason being that one can suppress an enzyme much easier than removing the substrates.
747: 
748: 
749: \begin{table}
750: \caption{Summary of the network structural measures of the real world
751:   networks relative to the average values of the $\mathcal{G}(G)$ a
752:   distance $\delta < 0.02$ from the real network. ``$<$'' indicates that
753:   the real network have a lower value than the corresponding
754:   $\mathcal{G}(G)$-value. All results are significant with p-values
755:   $>0.01$, except the $s$-value of the neural network that has a
756:   p-value of $\sim 0.05$.}
757: \begin{ruledtabular}
758:   \begin{tabular}{r|dddd}\label{tab:pval}
759:     & \multicolumn{1}{c}{gene fusion} &
760:     \multicolumn{1}{c}{protein interaction} &
761:     \multicolumn{1}{c}{metabolic} &
762:     \multicolumn{1}{c}{neural}\\\hline
763:     $s$ & < & < & < & > \\
764:     $\langle d\rangle$ & < & < & < & < \\
765:     $R_\mathrm{error}$ & > & > & > & > \\
766:     $R_\mathrm{attack}$ & > & > & > & > \\
767:   \end{tabular}
768: \end{ruledtabular}
769: \end{table}
770: 
771: \subsection{Comparison between the graphs}
772: 
773: Even though all our example networks are constructed from biological
774: data, they represent fundamentally different systems---the neural
775: network is spatial by nature, the protein interaction and (even more
776: so) the metabolic networks are the background topology for an active
777: dynamic system, whereas the gene fusion network is a representation
778: of possible but undesired events. The protein interaction, metabolic
779: and neural networks have one thing in common---the organism needs them
780: to be robust to errors (caused by injuries, mutations, disease
781: etc.)~\cite{wagner:robu}. As mentioned above and summarized in
782: Table~\ref{tab:pval} the error robustness is indeed higher for the
783: real networks than the $\mathcal{G}(G)$-ensemble at the same
784: $(r,C)$-coordinates. As mentioned above, the attack robustness of the
785: real network is of the same order as the $\mathcal{G}(G)$-average at
786: the same $(r,C)$-coordinate, but actually there is a significant
787: tendency that these network also are more robust to
788: attacks. Furthermore, the distances in the largest component, and the
789: relative sizes $s$ are (with the neural network $s$-value as the only
790: exception) smaller in the real than the $\mathcal{G}(G)$ networks.
791: 
792: Despite these similarities between the statistics of the
793: real-world networks the $r$-$C$ space of the different degree
794: sequences have qualitatively different network structure. Especially,
795: the gene fusion network behaves almost the opposite of the other
796: networks (at least for $s$, $\langle d\rangle$ and
797: $R_\mathrm{error}$). The source of this opposite behavior (as we
798: discuss above) is probably that it is much sparser than the other
799: networks. The neural network is the densest network and the only one
800: that do not have a power-law like degree distribution.
801: 
802: 
803: \section{Discussion}
804: 
805: Many complex network studies use the ensemble $\mathcal{G}(G)$ of
806: graphs with the same degree sequence as the subject graph $G$ as a
807: null model. In contrast to a generative network model, with a few
808: degrees of freedom that has to be fitted approximately, such an
809: ensemble has $O(N)$ degrees of freedom that can be matched exactly
810: with the values of $G$. We argue that $\mathcal{G}(G)$ is more than a
811: null model---by resolving the graphs of $\mathcal{G}(G)$ in a space
812: defined by some network-structural measures, one can get a picture of
813: the opportunities and limits there are (or has been) in the evolution
814: of $G$. In this work we map out $\mathcal{G}(G)$ in the
815: two-dimensional space defined by the clustering coefficient and the
816: assortativity. Then we measure other network structural quantities
817: throughout this space. One formal way to see our method is that we
818: resolve $\mathcal{G}(G)$ in the (high dimensional) space of all
819: sensible network measures. Then, for simplicity, we project to a few
820: dimensions. (The case of projection to one dimension has been studied
821: in a less formalized way earlier---projection to
822: assortativity~\cite{zj:spectrum} or a ``hierarchy''
823: measure~\cite{rosv:mountain}.) An interesting open question is to find
824: the principal components of the space of all sensible network
825: measures. Using four example networks from biology, we measure
826: average values of four network-structural quantities over the $r$-$C$
827: space and compare these with the values of the real networks.
828: 
829: The functional characteristics of the $r$-$C$ spaces varies much
830: between the four example networks. For example, the
831: \textit{C. elegans} neural network covers a much larger area of the
832: $r$-$C$ space, something that probably relates to its more narrow
833: degree distribution. The human gene fusion network, on the other hand,
834: has a broad degree distribution similar to the \textit{S. cerevisiae}
835: protein interaction and human metabolic networks, still the structural
836: dependency on $r$ and $C$ is very different for the gene fusion
837: network compared to the others. We argue that this difference stems
838: from the sparseness of the gene fusion network. To achieve a
839: comprehensive understanding about how the network structure throughout
840: the $r$-$C$ space depends on the degree sequence, one would need a
841: systematic investigation of different artificial degree sequences. In
842: this paper, we do not pursue this goal beyond the analysis of the four
843: biological data sets. The position of the real networks in the valid
844: region of the $r$-$C$ space adds some further information. For
845: example, it may have been the case that networks with lower
846: assortativity have been favored during the evolution of the gene fusion
847: network. Clustering, on the other hand, has probably not put any
848: constraint on the network evolution. Furthermore we compare the
849: network structure of the real networks with the average values of
850: networks in $\mathcal{G}(G)$ that are close to the $(r,C)$-coordinates
851: of the real network. From this analysis, we conclude that all
852: our four example networks are more robust to both random errors and
853: targeted attacks than what can be expected from a random network
854: constrained to the same degree distribution, assortativity and clustering
855: coefficient. For all networks, except maybe the gene fusion network,
856: this is in line with robustness being an important factor in the
857: network evolution. Note that in this work we assume the subject
858: network to be accurate. To get more valid error estimates one would
859: need to take the accuracy of the edges into account.
860: 
861: The analysis scheme presented in this paper can be further extended
862: and analyzed. As mentioned, it would be interesting with a quantitative
863: evaluation of the network-structural spaces, and how they depend on
864: the degree sequence. One can also try, for time-resolved data sets, to
865: incorporate dynamic information in the analysis by monitoring the
866: network-evolutionary trajectory in the $r$-$C$ space.
867: 
868: 
869: \acknowledgements{
870:   The authors thank Mikael Huss and Martin Rosvall for helpful
871:   suggestions and comments.
872:   PH acknowledges financial support from the Wenner-Gren
873:   Foundations and the National Science Foundation (grant
874:   CCR--0331580).
875: }
876: 
877: \appendix
878: 
879:  \begin{figure}
880:   \centering\resizebox*{\linewidth}{!}{\includegraphics{conv.eps}}
881:   \caption{Convergence of the optimization algorithm. (a) shows the
882:     average maximal assortativity $\langle
883:     r_\mathrm{max}\rangle$ with $\nu_\mathrm{rep.}=1$. The horizontal
884:     line represents the result of the maximization algorithm of
885:     Ref.~\cite{doyle:big}. (b) shows the further improvement by
886:     finding the maximum over many independent runs (for
887:     $\nu_\mathrm{same}=10000$). The vertical bars indicates the
888:     standard deviation of the observed maxima.}
889:   \label{fig:conv}
890: \end{figure}
891: 
892: \section{Convergence and sampling uniformity}
893: 
894: In this Appendix, we address some technical issues of our method
895: related to the convergence of our optimization algorithm and uniformity of
896: the sampling. We will also motivate our choice of parameters.
897: 
898: 
899: \subsection{Assortativity and clustering extremes}
900: 
901: To find the extremal assortativity values we use the edge-swapping
902: algorithm described in Sect.~\ref{sect:analysis}. To find
903: $r_\mathrm{min}$ we start from a random member of $\mathcal{G}(G)$ and
904: swap random edge-pairs (keeping the graph simple at all times) that
905: lower $r$. When no graph of lower $r$ has been found for
906: $\nu_\mathrm{same}$ time steps, we break the iteration. To avoid the
907: effect of being trapped in local minima, this process is repeated
908: $\nu_\mathrm{rep.}$ times. The main motivation for using this method
909: is that it is at heart the same scheme as for obtaining the extremal
910: clustering values and sampling the valid region (and thus we can
911: re-use the same code for many steps of the calculations). In this
912: section, we argue that the optimization performance of this method is sufficiently good for our purpose.
913: 
914: There is a deterministic method to maximize the assortativity that is,
915: if it exits properly, guaranteed to find
916: $r_\mathrm{max}$~\cite{doyle:big}. The method works as
917: follows: First all vertex-pairs $(i,j)$ are ranked in decreasing order of
918: the product of their degrees, $k_ik_j$. Then the edges are added
919: in order of this list unless the degree of one of the vertices already
920: is fulfilled. There are some other technicalities from the additional
921: constraint (of the authors) that the network should be connected. Of
922: our networks, only the neural network has such an evolutionary
923: constraint, so we do not impose it.
924: 
925: In Fig.~\ref{fig:conv} we display the parameter dependence of the
926: convergence for the gene fusion network. The horizontal line is the
927: theoretical maximum obtained by the algorithm of
928: Ref.~\cite{doyle:big}. When $\nu_\mathrm{same} = 10000$ we obtain an
929: average maximal assortativity within $0.001$ of the
930: theoretical maximum (Fig.~\ref{fig:conv}(a)). By increasing
931: $\nu_\mathrm{rep.}$ the accuracy can be increased further
932: (Fig.~\ref{fig:conv}(b)). The lattice spacing we use is $0.005\lesssim
933: r \lesssim 0.02$, so we deem a precision of $0.001$ sufficient. The
934: gene fusion network is our smallest network but the other networks are
935: not harder to converge. When one edge-pair is swapped so that $r$
936: decreases, the only term of Eq.~\ref{eq:assmix} that changes is
937: $\langle k_1\, k_2\rangle$. The potential change of the sum
938: $\sum_{(i,j)\in E} k_ik_j$, in the calculation of $\langle k_1\, k_2\rangle$
939: (close to the extrema) is of the order of the typical degree values
940: of the network. These values grow slower than the network itself,
941: which means that a larger network can be closer in $r$, but further
942: away in number of edge swaps to reach the global optimum, than a
943: smaller network. Some authors~\cite{doyle:big} use $\sum_{(i,j)\in E}
944: k_ik_j$ to measure the degree correlations, but since we strive for a
945: macroscopic level of description (consistent in the large-$N$ limit),
946: $r$ is a more appropriate quantity for the present work.
947: 
948: The optimization of the clustering to find the minima (maxima) of
949: the segments of assortativity space follows the same pattern as the
950: method to find the minimal (maximal) $r$. Changes of the parameters
951: ($\nu_\mathrm{same}$ and $\nu_\mathrm{rep.}$) have the same effect as
952: in Fig.~\ref{fig:conv}, and the same values seem sufficient.
953: 
954:  \begin{figure}
955:   \centering\resizebox*{0.95 \linewidth}{!}{\includegraphics{uni.eps}}
956:   \caption{Histograms of $s$ for the discussion of sampling
957:     uniformity. All the histograms are from the gene fusion network and a
958:     pixel centered around $r=-0.1$, $C=0.1$ (the dimensions of a pixel
959:     is $\Delta r = 0.013$, $\Delta C = 0.0096$. The error bars
960:     represent standard errors. Lines are guides for the eyes. (a)
961:     shows the histograms with a different numbers of random edge-pair
962:     swappings $\nu_\mathrm{rnd.}$ within the pixel before the
963:     measurements of quantities. (b) illustrates the location of the
964:     starting point pixels used in panels (c) and (d). (c) compares
965:     histograms for swapping processes starting at W, S with the
966:     regular algorithm. (d) compares the average histogram of walks starting in
967:     the four peripheral points of (b) with the result of the regular
968:     algorithm. In panels (c) and (d) $\nu_\mathrm{rnd.}=1000$. The
969:     whole range of the histograms is not shown, which is why the areas
970:     under the curves appear different.}
971:   \label{fig:uni}
972: \end{figure}
973: 
974: \subsection{Sampling uniformity}
975: 
976: The other technical issue we address in this Appendix is the
977: uniformity of our sampling procedure. Ideally we would like all
978: unique (i.e., non-isomorphic) members of $\mathcal{G}(G)$ to be
979: sampled with the same probability. The most important observation is
980: trivial---by edge-pair swapping one can go from one member of
981: $\mathcal{G}(G)$ to any other, and thus all members of the ensemble
982: will contribute to the averages. A much harder question is whether or
983: not every member of $\mathcal{G}(G)$ is sampled with uniform
984: probability. In this section, we will argue that our algorithm does a
985: reasonably good job in the sense that there are no inconsistencies and
986: parameter values are appropriate.
987: 
988: When the target pixel is found (step~\ref{step:rw} of the algorithm)
989: we perform $\nu_\mathrm{rnd.}$ additional random edge-pair swaps. The idea
990: is to sample the $\mathcal{G}(G)$-members of the pixel more
991: uniformly (and indeed to be able to reach into the interior of the
992: pixel). In Fig.~\ref{fig:uni}(a) we illustrate the effect of these random
993: moves. We plot a normalized histogram of the relative largest cluster
994: size $s$ for $0$, $100$ and $10000$ random moves. We see that these
995: moves do make a difference (the $\nu_\mathrm{rnd.}=0$ is different
996: from the $\nu_\mathrm{rnd.}=100$) but it does not matter if
997: $\nu_\mathrm{rnd.}=100$ or $\nu_\mathrm{rnd.}=10000$. The same
998: situation is observed for other pixels, networks and
999: quantities. Therefore, we use $\nu_\mathrm{rnd.}=1000$ in this work.
1000: 
1001: Next, we will illustrate the use of the randomly permuted list
1002: in the sampling of the pixels (steps~\ref{step:perm} and
1003: \ref{step:pick} of the algorithm). The motivation for this procedure
1004: is that the network structure can depend on the direction from which
1005: the search arrives to the pixel. In Fig.~\ref{fig:uni}(b) we
1006: illustrate the test procedure---we sample separate histograms from four
1007: starting points in the four cardinal directions with respect to the central
1008: $(r,C)=(-0.1,0.1)$ pixel. In Fig.~\ref{fig:uni}(c) we
1009: see that the histograms from the W and S pixels are different. There
1010: appears to be two regions of $\mathcal{G}(G)$ contributing to these
1011: histograms (one with $s\approx 0.65$, one with $s\approx
1012: 0.75$). Searches starting from W seem to arrive at the $s\approx 0.75$
1013: region more frequently, and searches staring at S ends up around
1014: $s\approx 0.65$ more frequently. The curve of the actual algorithm
1015: weighs the two peaks more equal. The curves from N and E coincides
1016: almost completely the curve for the regular algorithm (and are
1017: therefore omitted for clarity). The impression we get is that the
1018: search from one direction can induce a bias in the network structure
1019: (symbolically speaking, the graphs have a preference for ending up in
1020: a certain region of $\mathcal{G}(G)$). However, from other directions,
1021: or by the random sampling of pixels (step~\ref{step:perm}), the bias is
1022: reduced. This picture is further strengthened in Fig.~\ref{fig:uni}(d)
1023: where we show that the average value of the histograms from the four
1024: starting points are overlapping with the histogram of the regular
1025: algorithm.
1026: 
1027: \begin{thebibliography}{10}
1028: 
1029: \bibitem{ba:rev}
1030: R.~Albert and A.-L. Barab\'{a}si.
1031: \newblock Statistical mechanics of complex networks.
1032: \newblock {\em Rev. Mod. Phys}, 74:47--98, 2002.
1033: 
1034: \bibitem{alb:attack}
1035: R.~Albert, H.~Jeong, and A.-L. Barab\'{a}si.
1036: \newblock Attack and error tolerance of complex networks.
1037: \newblock {\em Nature}, 406:378--382, 2000.
1038: 
1039: \bibitem{amaral:classes}
1040: L.~A.~N. Amaral, A.~Scala, M.~Barth\'{e}l\'{e}my, and H.~E. Stanley.
1041: \newblock Classes of small-world networks.
1042: \newblock {\em Proc. Natl. Acad. Sci. USA}, 97:11149--11152, 2000.
1043: 
1044: \bibitem{rosv:mountain}
1045: J.~B. Axelsen, S.~Bernhardsson, M.~Rosvall, K.~Sneppen, and A.~Trusina.
1046: \newblock Degree landscapes in scale-free networks.
1047: \newblock {\em Phys. Rev. E}, 74:036119, 2006.
1048: 
1049: \bibitem{ba:model}
1050: A.-L. Barab\'{a}si and R.~Albert.
1051: \newblock Emergence of scaling in random networks.
1052: \newblock {\em Science}, 286:509--512, 1999.
1053: 
1054: \bibitem{bw:sw}
1055: A.~Barrat and M.~Weigt.
1056: \newblock On the properties of small-world network models.
1057: \newblock {\em Eur. Phys. J. B}, 13:547--560, 2000.
1058: 
1059: \bibitem{chung_lu:pnas}
1060: F.~Chung and L.~Lu.
1061: \newblock The average distances in random graphs with given expected degrees.
1062: \newblock {\em Proc. Natl. Acad. Sci. USA}, 99:15879--15882, 2002.
1063: 
1064: \bibitem{doromen:book}
1065: S.~N. Dorogovtsev and J.~F.~F. Mendes.
1066: \newblock {\em Evolution of Networks: From Biological Nets to the Internet and
1067:   WWW}.
1068: \newblock Oxford University Press, Oxford, 2003.
1069: 
1070: \bibitem{er:on}
1071: P.~Erd\H{o}s and A.~R\'{e}nyi.
1072: \newblock On random graphs {I}.
1073: \newblock {\em Publ. Math. Debrecen}, 6:290--297, 1959.
1074: 
1075: \bibitem{gale:rew}
1076: D.~Gale.
1077: \newblock A theorem of flows in networks.
1078: \newblock {\em Pacific J. Math.}, 7:1073--1082, 1957.
1079: 
1080: \bibitem{hoglund}
1081: M.~H\"{o}glund, A.~Frigyesi, and F.~Mitelman.
1082: \newblock A gene fusion network in human neoplasia.
1083: \newblock {\em Oncogene}, 25:2674--2678, 2006.
1084: 
1085: \bibitem{holl:72}
1086: P.~W. Holland and S.~Leinhardt.
1087: \newblock Some evidence on the transitivity of positive interpersonal
1088:   sentiment.
1089: \newblock {\em Am. J. Sociol.}, 72:1205--1209, 1972.
1090: 
1091: \bibitem{hh:pfp}
1092: P.~Holme and M.~Huss.
1093: \newblock Role-similarity based functional prediction in networked systems:
1094:   application to the yeast proteome.
1095: \newblock {\em J. Roy. Soc. Interface}, 2:327--333, 2005.
1096: 
1097: \bibitem{our:attack}
1098: P.~Holme, B.~J. Kim, C.~N. Yoon, and S.~K. Han.
1099: \newblock Attack vulnerability of complex networks.
1100: \newblock {\em Phys. Rev. E}, 65:066109, 2002.
1101: 
1102: \bibitem{our:curr}
1103: M.~Huss and P.~Holme.
1104: \newblock Currency and commodity metabolites: Their identification and relation
1105:   to the modularity of metabolic networks.
1106: \newblock e-print q-bio/0603038.
1107: 
1108: \bibitem{katz:cug}
1109: L.~Katz and J.~H. Powell.
1110: \newblock Probability distributions of random variables associated with a
1111:   structure of the sample space of sociometric investigations.
1112: \newblock {\em Ann. Math. Stat.}, 28:442--448, 1957.
1113: 
1114: \bibitem{latora:eff}
1115: V.~Latora and M.~Marchiori.
1116: \newblock Efficient behavior of small-world networks.
1117: \newblock {\em Phys. Rev. Lett.}, 87:198701, 2001.
1118: 
1119: \bibitem{doyle:big}
1120: L.~Li, D.~Alderson, J.~C. Doyle, and W.~Willinger.
1121: \newblock Towards a theory of scale-free graphs: Definition, properties, and
1122:   implications.
1123: \newblock {\em Internet Mathematics}, 2:431--523, 2005.
1124: 
1125: \bibitem{maslov:pro}
1126: S.~Maslov and K.~Sneppen.
1127: \newblock Specificity and stability in topology of protein networks.
1128: \newblock {\em Science}, 296:910--913, 2002.
1129: 
1130: \bibitem{maslov:inet}
1131: S.~Maslov, K.~Sneppen, and A.~Zaliznyak.
1132: \newblock Detection of topological patterns in complex networks: Correlation
1133:   profile of the {I}nternet.
1134: \newblock {\em Physica A}, 333:529--540, 2004.
1135: 
1136: \bibitem{mitelman}
1137: F.~Mitelman, B.~Johansson, and F.~Mertens.
1138: \newblock Fusion genes and rearranged genes as a linear function of chromosome
1139:   aberrations in cancer.
1140: \newblock {\em Nature Genetics}, 36:331--334, 2004.
1141: 
1142: \bibitem{motter:cascade}
1143: A.~E. Motter.
1144: \newblock Cascade control and defense in complex networks.
1145: \newblock {\em Phys. Rev. Lett.}, 93:098701, 2004.
1146: 
1147: \bibitem{mejn:assmix}
1148: M.~E.~J. Newman.
1149: \newblock Assortative mixing in networks.
1150: \newblock {\em Phys. Rev. Lett.}, 89:208701, 2002.
1151: 
1152: \bibitem{mejn:rev}
1153: M.~E.~J. Newman.
1154: \newblock The structure and function of complex networks.
1155: \newblock {\em SIAM Review}, 45:167--256, 2003.
1156: 
1157: \bibitem{mejn:why}
1158: M.~E.~J. Newman and J.~Park.
1159: \newblock Why social networks are different from other types of networks.
1160: \newblock {\em Phys. Rev. E}, 68:036122, 2003.
1161: 
1162: \bibitem{alon}
1163: S.~Shen-Orr, R.~Milo, S.~Mangan, and U.~Alon.
1164: \newblock Network motifs in the transcriptional regulation network of
1165:   {E}scherichia coli.
1166: \newblock {\em Nature Genetics}, 31:64--68, 2002.
1167: 
1168: \bibitem{sporns:cortex}
1169: O.~Sporns, G.~Tononi, and G.~M. Edelman.
1170: \newblock Theoretical neuroanatomy: Relating anatomical and functional
1171:   connectivity in graphs and cortical connection matrices.
1172: \newblock {\em Cerebral Cortex}, 10:127--141, 2000.
1173: 
1174: \bibitem{wagner:robu}
1175: A.~Wagner.
1176: \newblock {\em Robustness and Evolvability in Living Systems}.
1177: \newblock Princeton University Press, Princeton NJ, 2005.
1178: 
1179: \bibitem{walker_walstedt}
1180: L.~R. Walker and R.~E. Walstedt.
1181: \newblock Computer model of metallic spin-glasses.
1182: \newblock {\em Phys. Rev. B}, 22:3816--3842, 1980.
1183: 
1184: \bibitem{wf}
1185: S.~Wasserman and K.~Faust.
1186: \newblock {\em Social network analysis: Methods and applications}.
1187: \newblock Cambridge University Press, Cambridge, 1994.
1188: 
1189: \bibitem{white:420}
1190: J.~G. White, E.~Southgate, J.~N. Thompson, and S.~Brenner.
1191: \newblock The structure of the nervous system of the nematode {C.} {E}legans.
1192: \newblock {\em Philos. Trans. Roy. Soc. Lond.}, 314:1--340, 1986.
1193: 
1194: \bibitem{zj:spectrum}
1195: J.~Zhao, L.~Tao, H.~Yu, J.-H. Luo, Z.-W. Cao, and Y.-X. Li.
1196: \newblock The spectrum of degree correlations: topological diversity of
1197:   networks with a given degree sequence.
1198: \newblock e-print cond-mat/0611104.
1199: 
1200: \bibitem{zhao:meta}
1201: J.~Zhao, H.~Yu, J.~Luo, Z.~W. Cao, and Y.-X. Li.
1202: \newblock Complex networks theory for analyzing metabolic networks.
1203: \newblock {\em Chinese Science Bulletin}, 51:1529--1537, 2006.
1204: 
1205: \end{thebibliography}
1206: 
1207: 
1208: \end{document}
1209: 
1210: 
1211: