0611:q-bio0611020/xtr.tex

1: \documentclass[rmp,twocolumn,showpacs]{revtex4}

2:

3: \usepackage{graphicx,amsmath,amssymb,txfonts,dcolumn}

4:

5: \begin{document}

6:

7: \title{Exploring the assortativity-clustering space of a network's

8:   degree sequence}

9:

10: \author{Petter Holme}

11: \affiliation{Department of Computer Science, University of New Mexico,

12:   Albuquerque, NM 87131, U.S.A.}

13:

14: \author{Jing Zhao}

15: \affiliation{School of Life Sciences \& Technology, Shanghai Jiao Tong

16:   University, Shanghai 200240, China}

17: \affiliation{Shanghai Center for Bioinformation and Technology,

18:   Shanghai 200235, China}

19: \affiliation{Department of Mathematics, Logistical Engineering

20:   University, Chongqing 400016, China}

21:

22: \begin{abstract}

23:   Nowadays there is a multitude of measures designed to

24:   capture different aspects of network structure. To be able to say if

25:   the structure of certain network is expected or not, one needs a

26:   reference model (null model). One frequently used null model is the

27:   ensemble of graphs with the same set of degrees as the original

28:   network. In this paper we argue that this ensemble can be more than just

29:   a null model---it also carries information about the original network and

30:   factors that affect its evolution. By mapping out this ensemble in the

31:   space of some low-level network structure---in our case those

32:   measured by the assortativity and clustering coefficients---one can

33:   for example study how close to the valid region of the parameter

34:   space the observed networks are. Such analysis suggests which

35:   quantities are actively optimized during the evolution of the

36:   network. We use four very different biological networks to exemplify

37:   our method. Among other things, we find that high clustering might

38:   be a force in the evolution of protein interaction networks. We also

39:   find that all four networks are conspicuously robust to both random

40:   errors and targeted attacks.

41: \end{abstract}

42:

43: \pacs{89.75.Fb, 82.39.Rt, 89.75.Hc}

44: % 89.75.Fb -- Structures and organization in complex systems

45: % 82.39.Rt -- Reactions in complex biological systems

46: % 89.75.Hc -- Networks and genealogical trees

47:

48: \maketitle

49:

50: \section{Introduction}

51:

52: Network structure~\cite{mejn:rev,doromen:book,ba:rev} is usually

53: defined as the way a network differs from what is expected. What

54: ``expected'' means depends on the fundamental constraints on the

55: network, and this can vary from system to system. For example, if

56: the network is made of units that must be connected to two, and only

57: two, others; then, it is not interesting whether or not a vertex lies

58: on a cycle (we already know that it will). The ensemble of all

59: networks fulfilling the fundamental constraints on the system is

60: usually called \textit{null model} (or \textit{reference model}). When

61: we have pinned down the null model we can measure the network

62: structure by standard quantities. If the values of these quantities

63: differs significantly from the null-model average, then we call the network

64: structured. The baseline assumption of complex network theory is that

65: network structure carries information about the forces that have

66: formed the network. Ever since the studies of Barab\'{a}si and

67: coworkers~\cite{ba:model,ba:rev}, the degree distribution (or, if

68: referring to the set of degrees of one particular network,

69: \textit{degree sequence}) has been regarded as the most fundamental

70: network structure. For many networks, the degrees are related to outer

71: factors (not emerging from the network evolution). In such cases the

72: ensemble of all graphs with the same degree sequence as the original

73: network is a natural null model. Another interpretation is that the

74: network structures measured relative to this null model are of higher

75: order than the degree---i.e., what remain after the effects of the

76: more fundamental structure (the degree sequence) is filtered away. The

77: usual way to use a null model is to compare a network measure with the

78: ensemble average value of the null model. In this paper we will argue

79: that one can glean more information about the original network by studying

80: the null model ensemble in greater detail than just measuring

81: averages.

82:

83:

84: We consider networks that can be modeled as a graph $G=(V,E)$ where

85: $V$ is the set of $N$ vertices and $E$ is the set of $M$ undirected

86: edges. We denote the ensemble of graphs with the same degree sequence

87: as $G$ as $\mathcal{G}(G)$. Our basic approach to study

88: $\mathcal{G}(G)$ is to resolve its members in the space of higher

89: order network structures. The two such higher order network structures

90: we consider in this paper are: the correlation between the

91: degrees at either side of an edge (measured by the \textit{assortative

92:   mixing coefficient}, $r$~\cite{mejn:assmix}, or simply

93: \textit{assortativity}); and, the fraction of triangles in the network

94: (measured by the \textit{clustering

95:   coefficient}, $C$~\cite{bw:sw,mejn:rev}). By mapping out

96: $\mathcal{G}(G)$ in the space defined by $r$ and $C$ one can pose

97: questions such as: How large is the region in $r$-$C$ space where

98: members of $\mathcal{G}(G)$ actually exist? (This helps us answer how

99: constrained the network evolution is if the degrees are given.) Is the

100: real network close to $\mathcal{G}(G)$'s boundaries in $r$-$C$ space?

101: (Which would indicate whether or not $r$ or $C$ are actively

102: optimized.)

103:

104:

105: The basis for our exploration of an ensemble $\mathcal{G}(G)$ is to map

106: out its members in the space defined by some network-structural

107: measures, in our case the assortativity and clustering. We explore the

108: $r$-$C$ space by successively rewire pairs of edges, $(i,j)$ and

109: $(i',j')$ to $(i,j')$ and $(i',j)$, that takes the system in a desired

110: direction. Rewiring techniques for studying networks are half a century

111: old~\cite{gale:rew} (randomization for obtaining null models was

112: studied in Ref.~\cite{katz:cug}). In the physics literature these

113: techniques were first used in Refs.~\cite{maslov:pro,alon}.

114:

115:

116: \section{Network structural measures}

117:

118: Before going into details of our algorithm, we will review the network

119: structural quantities that we use to describe our networks: both the

120: independent variables (the assortative and clustering

121: coefficients) that form the basis for our space of interest; and the

122: quantities we use for characterizing the regions of this space.

123:

124:

125: \subsection{Assortative mixing coefficient}

126:

127: It is quite well accepted that the set of degrees, the degree

128: sequence, is the network quantity that contains most information about both the evolution and function of the network. Degree can (in most

129: contexts) be identified as how influential the vertex is~\cite{wf} (in

130: some sense)---high

131: degree vertices are assumed to be more influential both the formation

132: of the network and the flow of dynamic systems on the network. In this

133: paper we assume the degree sequence is inherent to the system and look

134: at higher order structures arising from how the vertices are linked to

135: one another. The simplest such higher-order structure is the

136: correlations between the degrees of vertices at either side of an

137: edge. Is it the case that high-degree vertices are primarily connected

138: to other high degree vertices, or are they linked to low-degree

139: vertices? A simple way of measuring this tendency is by the

140: assortativity~\cite{mejn:rev} $r$. Basically

141: speaking, $r$ is the linear correlation coefficient of the degrees at

142: either side of an edge. One complication is that since the edges are undirected, $r$ has to be symmetric with respect to edge-reversal, but the correlation coefficient is not symmetric. The solution is to let one edge contribute

143: twice to the covariance, i.e.\ represent an undirected edge by two directed edges pointing in opposite directions. If one use an edge list representation

144: internally (i.e., let the edges be stored in an array of ordered pairs

145: $(i_1,j_1),\cdots,(i_M,j_M)$) then~\cite{mejn:assmix}

146: \begin{equation}\label{eq:assmix}

147:   r=\frac{4\langle k_1\, k_2\rangle - \langle k_1 + k_2\rangle^2}

148:   {2\langle k_1^2+k_2^2\rangle - \langle k_1+ k_2\rangle^2}

149: \end{equation}

150: where, for an edge $(i,j)$, $k_1$ is the degree of first argument

151: (i.e., the degree of $i$) and $k_2$ is the degree of the second

152: argument. The range of $r$ is $[-1,1]$ where negative values indicate

153: a preference for high connected vertices to attach to low-degree

154: vertices, and positive values means that vertices tend to be attached

155: to others with degrees of similar magnitudes.

156:

157:

158: \subsection{Clustering coefficient}

159:

160: Several simple random network models (such as the

161: Edr\H{o}s-R\'{e}nyi~\cite{er:on} or the model for generating networks

162: of a given $r$-value in Ref.~\cite{mejn:assmix}) have rather few triangles (fully connected

163: subgraphs of three vertices). For some classes of real-world networks (notably

164: social networks~\cite{holl:72}) there is a strong tendency for triangles to form, which makes such models fail. The network measure of the density of

165: triangles is called \textit{clustering coefficient}. We use the definition

166: of Ref.~\cite{bw:sw}:

167: \begin{equation}\label{eq:clust}

168:   C = 3 n_\mathrm{triangle}\:\big/\:n_\mathrm{triple},

169: \end{equation}

170: where $n_\mathrm{triangle}$ is the number of triangles and

171: $n_\mathrm{triple}$ is the number of connected triples (subgraphs

172: consisting of three vertices and two or three edges). The factor three

173: is included to normalize the quantity to the interval $[0,1]$.

174:

175:

176: \subsection{Distance and component size}

177:

178: Two quantities that are, perhaps more than any other, related to the functionality of dynamic processes on the network are the relative size of the largest

179: component (connected subgraph) $s$, and the average distance $\langle

180: d\rangle$. $s$ is simply defined as the number of vertices in the

181: largest component divided by $N$. The distance $d(i,j)$ between two

182: vertices $i$ and $j$ is defined as the number of edges in the shortest

183: path between these two vertices. $\langle d\rangle$ is $d(i,j)$

184: averaged over all vertex pairs ($i\neq j$) in the largest

185: component. In a network with large $s$ and small $\langle d\rangle$,

186: spreading processes will be fast and far-reaching. This is a good

187: property of information networks, but bad in the context of, for

188: example, disease spreading. Some authors have combined the distance

189: and component size aspects by considering the average reciprocal

190: distances~\cite{our:attack,latora:eff}. For most purposes, we believe,

191: valuable information gets lost in such a combination (a fragmented

192: network $G$ with short average distances can be something very

193: different from a connected graph of large distances and the same

194: average reciprocal distances as $G$).

195:

196:

197: \subsection{Robustness}

198:

199: One line of complex network research is the study of the response of

200: the network to attacks, errors, failures and other events that effectively change

201: the structure. The error response problem is usually formulated as: how

202: does the functionality of the network change if a random fraction of

203: the vertices, or edges, is removed~\cite{mejn:rev}? The attack

204: problem is the same, except that the vertices are not selected

205: randomly but according to some strategy intended to decrease the

206: networks' functionality as rapidly as possible~\cite{our:attack,alb:attack}.

207: A frequently used metric for functionality is the ratio of $s$

208: before and after the

209: event~\cite{our:attack,alb:attack,motter:cascade}. In the error and

210: attack robustness problems, this quantity is typically plotted as a

211: function of the number of removed vertices. The idea is that even if

212: one network $G$ is more robust than another network $G'$ to the

213: removal of, say, $1\%$ of the vertices, $G'$ can be less vulnerable

214: than $G$ if $10\%$ of the vertices are deleted. Since we aim at

215: mapping out the $r$-$C$ space of degree sequences, we would like to

216: capture the robustness with just one number. We will use what we call

217: the $f$-\textit{robustness} $R_f$ of a network as the expected

218: fraction of vertices that needs to be removed for the relative size of

219: the largest component to decrease to a fraction $f\in (0,1)$ of its

220: original value. The way of removal can either be random (the error

221: problem) or selective (the attack problem). For the rest of the paper

222: we will set $f=1/2$, and refer to the $1/2$-robustness just as

223: ``robustness'' $R$. Other $f$-values give slightly different results, but

224: our conclusions will hold for a range of intermediate $f$-values.

225:

226:  \begin{figure}

227:   \centering\resizebox*{0.9\linewidth}{!}{\includegraphics{ill.eps}}

228:   \caption{Illustration of the analysis scheme applied to the

229:     \textit{C. elegans} neural network. (a) shows how the valid region

230:     is mapped out: 1. $r_\mathrm{min}$ is located. 2. $r_\mathrm{max}$

231:     is found and the interval $[r_\mathrm{min}, r_\mathrm{max}]$ is

232:     divided into $L$ segments. 3. $C_\mathrm{min}(n)$ is

233:     constructed. 4. $C_\mathrm{max}(n)$ is traced and the interval

234:     $[C_\mathrm{min}, C_\mathrm{max}]$ is segmented into $L$

235:     regions. (b) illustrates the sampling of the pixels. The next

236:     pixel to go to is chosen from a random permutation of the

237:     pixels. In this example $n$ and $n'$ are chosen to be far

238:     apart. The line shows the path taken by the algorithm. The circles

239:     indicate every thousandth step on the way from $n$ to $n'$. The

240:     blow-up illustrates the random walk within a pixel to sample the

241:     graphs of the pixel more randomly.

242: }

243:   \label{fig:ill}

244: \end{figure}

245:

246:

247: \section{The analysis scheme\label{sect:analysis}}

248:

249: The fundamental idea of our method is simple: we update the network by

250: choosing pairs of edges randomly, say $(i,j)$ and $(i',j')$, and swap

251: one end of them (forming $(i,j')$ and $(i',j)$). This guarantees that

252: the degree sequence stays intact. We navigate in the $r$-$C$ space by

253: only accepting changes that move us in the desired direction.  If an

254: edge-swap would introduce a self-edge (i.e.\ if $i=j'$ or $i'=j$) or a

255: multiple edge (i.e.\ if $(i,j')$ or $(i',j)$ belongs to $E$ before the

256: swapping, or \textit{move}) it is not performed. There are many other

257: technicalities concerning the convergence to extremes, uniformity of

258: the sampling and more that we discuss in the Appendix.

259:

260: The members of the ensemble $\mathcal{G}(G)$ do not, in general, cover

261: the whole range of $(r,C)$-values. Indeed, for any finite $G$,

262: $\mathcal{G}(G)$ defines a set of points, rather than a continuous

263: region, in the $r$-$C$ space. We will perform a more coarse-grained

264: analysis breaking down the $r$-$C$ space into pixels and average quantities over the graphs of $\mathcal{G}(G)$ with $(r,C)$-values within the pixel. (Thus, a pixel constitute a graph ensemble in itself, our aim is to sample its members with uniform randomness.) For a computationally tractable

265: resolution, the pixels containing members of $\mathcal{G}(G)$

266: typically form contiguous regions. We will refer to the pixels that

267: contain a member of $\mathcal{G}(G)$ as \textit{valid pixels}, and all

268: pixels that are valid or between valid pixels the \textit{valid

269:   region} of $\mathcal{G}(G)$.

270:

271: To trace the valid region of $\mathcal{G}(G)$ we start by finding the

272: lowest and highest assortativity value, $r_\mathrm{min}$ and

273: $r_\mathrm{max}$ respectively. Briefly speaking (more details follow

274: below), to find $r_\mathrm{min}$ we rewire edge-pairs that lower $r$

275: (and vice versa for $r_\mathrm{max}$). After finding the extremal

276: $r$-values, we splice the region between these into $L$ segments. Then

277: we go through the region and for each region $n\in [1,L]$ we find the

278: minimal and maximal $C$-values, $C_\mathrm{min}(n)$ and

279: $C_\mathrm{max}(n)$. The region in $C$-space between the lowest

280: $C_\mathrm{min}=\min_{1\leq n\leq L}C_\mathrm{min}(n)$ and highest

281: $C_\mathrm{max}=\max_{1\leq n\leq L}C_\mathrm{max}(n)$ observed

282: clustering coefficient is segmented into $L$ regions. (Note that $C_\mathrm{min}$, without argument,

283: is the global clustering minimum, whereas $C_\mathrm{min}(n)$ is the

284: minimum conditioned on $r$ being in the $n$'th segment.) Thus we

285: (assuming our method works) obtain an $L\times L$ grid of the $r$-$C$

286: space that contains the valid region of $\mathcal{G}(G)$. The method

287: is illustrated in Fig.~\ref{fig:ill}.

288:

289: To find the $\mathcal{G}(G)$ elements of minimal and maximal

290: assortativity is a non-trivial optimization problem. There are

291: deterministic methods that, if they terminate, are guaranteed to give the

292: maximal (or minimal) assortativity~\cite{doyle:big,zj:spectrum}. To avoid the

293: such technicalities and to simplify the program, we will use the same

294: kind of optimization algorithm to find $r_\mathrm{max}$ and

295: $r_\mathrm{min}$ as to find $C_\mathrm{min}(n)$ and

296: $C_\mathrm{max}(n)$. In the Appendix we will argue that this method

297: allows us to come as close to the optimal $r$-values as we need.

298: A method we find efficient is to repeat the

299: simple edge-pair swapping procedure (where only changes in the desired

300: direction are accepted) with different random seeds until no lower

301: state is found during a number $\nu_\mathrm{rep.}$ of

302: repetitions~\cite{walker_walstedt}. Each individual edge-pair is

303: terminated when no lowest state is found for $\nu_\mathrm{same}$

304: swaps. In general, the larger the network is, the more densely

305: distributed are the points close to the border of the valid region. If

306: one is satisfied with finding a value a certain distance from the

307: extrema, then $\nu_\mathrm{rep.}$ and $\nu_\mathrm{same}$ do not need to be

308: increased for larger $N$. To find $C_\mathrm{min}(n)$ and

309: $C_\mathrm{max}(n)$ almost the same procedure is employed. First,

310: edge-pairs are swapped until the desired segment of $r$ is

311: found. Second, unless $r$ is outside the segment $n$ and the move

312: takes the system yet further from segment, edge-pairs are swapped provided

313: the clustering would decrease (for $C_\mathrm{min}(n)$), or increase,

314: (for $C_\mathrm{max}(n)$). When the valid region is traced out and we

315: sample networks of different pixels, we select the pixels

316: randomly. The idea is to sample the space of networks more randomly.

317:

318: To summarize, the algorithm for finding the extremal assortativity

319: values, $r_\mathrm{min}$ and $r_\mathrm{max}$, is:

320: \begin{enumerate}

321: \item \label{step:choose} Choose two undirected edges $(i,j)$ and

322:   $(i',j')$ at random. If the program makes a difference between the

323:   arguments of the edge, the direction of the reading of the edge also

324:   has to be randomized (so $(i,j)$ is read as $(j,i)$ with probability

325:   $1/2$).

326: \item \label{step:check} Check if swapping these edges to $(i,j')$ and

327:   $(i',j)$ would introduce a self-edge or multiple edge in the

328:   network. If so, go to step~\ref{step:choose}.

329: \item \label{step:accept} Let $\Delta r$ be the change in $r$ if the

330:   move in step~\ref{step:choose} is executed. If $r$ is

331:   to be minimized and $\Delta r<0$, then accept the change (vice versa for maximization of $r$).

332: \item \label{step:conclude} If no move has been executed during the last

333:   $\nu_\mathrm{same}$ executions of step~\ref{step:accept}, then take

334:   the current $r$ as $\tilde{r}_\mathrm{min}$ (or $\tilde{r}_\mathrm{max}$).

335: \item \label{step:stop} Repeat from the beginning $\nu_\mathrm{rep.}$

336:   times and return the lowest observed $\tilde{r}_\mathrm{min}$ during these iterations.

337: \end{enumerate}

338:

339: Given $r_\mathrm{min}$ and $r_\mathrm{max}$, and a division of the $r$

340: space into $L$ segments of width $(r_\mathrm{max}-r_\mathrm{min})/L$,

341: we trace the boundaries of the valid region as follows:

342: \begin{enumerate}\setcounter{enumi}{5}

343: \item \label{step:choose2} Go through the regions sequentially. Say

344:   the $n$'th region is the interval $[r_n,

345:   r_{n+1})$.

346: \item Perform step~\ref{step:choose} and \ref{step:check} of the

347:   assortativity optimization algorithm.

348: \item Let $\Delta C$ be the change in clustering coefficient during the previous

349:   step. If $r<r_n$ and $\Delta r>0$, $r\geq

350:   r_{n+1}$ and $\Delta r<0$ or $r_n \leq r <

351:   r_{n+1}$ and $\Delta C < 0$ (for minimization) or $\Delta

352:   C > 0$ (for maximization), then perform the change of

353:   step~\ref{step:choose2}.

354: \item \label{step:conclude2} If, counting from the first time the

355:   system entered the desired $r$-segment, the minimal (maximal)

356:   $C$-value has been repeated $\nu_\mathrm{same}$ times, take this

357:   value as $\tilde{C}_\mathrm{min}(n)$ ($\tilde{C}_\mathrm{max}(n)$).

358: \item \label{step:stop2} Repeat from step~\ref{step:choose2}

359:   $\nu_\mathrm{rep.}$ times. Let the lowest

360:   $\tilde{C}_\mathrm{min}(n)$-values and largest

361:   $\tilde{C}_\mathrm{max}(n)$ during these iterations be

362:   $C_\mathrm{min}(n)$ and $C_\mathrm{max}(n)$.

363: \end{enumerate}

364:

365: Then, when the valid region is mapped out, we split the $C$-range

366: (between $C_\mathrm{min}$ and $C_\mathrm{max}$ in $L$ segments of

367: equal width, thus forming an $L\times L$-grid enclosing the valid

368: region. This grid is sampled as follows:

369: \begin{enumerate}\setcounter{enumi}{10}

370: \item \label{step:perm} Construct a random permutation of the valid

371:   pixels.

372: \item \label{step:pick} Pick the next pixel

373:   $P_n=[r_n,r_{n+1})\times

374:   [C_m,C_{m+1})$ from the index-list of

375:   step~\ref{step:perm}. Denote the center $[(r_n+r_{n+1})/2,(C_m+C_{m+1})/2)]$ of the pixel

376:   $(r_{n,0},C_{m,0})$. Let

377:   \begin{equation}\label{eq:dist}

378:     \delta (r,C)=\sqrt{\left(\frac{r - r_{n,0}}{r_\mathrm{max} -

379:           r_\mathrm{min}}\right)^2 + \left(\frac{C -

380:           C_{m,0}}{C_\mathrm{max}-C_\mathrm{min}}\right)^2}

381:   \end{equation}

382:   measure the distance in $r$-$C$ space from the current position

383:   $(r,C)$ to the center of the target pixel.

384: \item Pick edge-pair candidates according to steps~\ref{step:choose}

385:   and \ref{step:check} of the assortativity optimization algorithm.

386: \item Calculate $\Delta (r,C)=\delta(r',C')-\delta(r,C)$ where $r$

387:   and $C$ are the current assortativity and clustering values, and

388:   $r'$ and $C'$ are the values after the pending move has been

389:   performed. If $\Delta (r,C)<0$ perform the move.

390: \item \label{step:rw} If the updated $(r,C)$ belongs to $P_n$, then: First, make

391:   $\nu_\mathrm{rnd.}$ random edge swappings such that $(r,C)$ does not

392:   leave $P_n$. (This is to sample the pixel more uniformly.) Then,

393:   measure network structural quantities of $P_n$, save these values

394:   for statistics, and go to step~\ref{step:pick}.

395: \item If not all pixels have been measured go to step~\ref{step:pick}.

396: \item Go to step~\ref{step:perm} until each pixel have been sampled

397:   $\nu_\mathrm{samp.}$ times.

398: \end{enumerate}

399:

400: The parameter values we use in this study are (unless otherwise stated):

401: $\nu_\mathrm{same}=10^5$, $\nu_\mathrm{rep.}=5$,

402: $\nu_\mathrm{samp.}=100$, $\nu_\mathrm{rnd.}=1000$ and $L=50$. The

403: choice of parameters and further considerations are discussed in the

404: Appendix. Due to the uncertain stopping conditions of

405: steps~\ref{step:conclude}, \ref{step:stop}, \ref{step:conclude2} and

406: \ref{step:stop2} it is hard to derive meaningful bounds on the

407: computational complexity. We note, however, that the optimization is

408: faster in $r$- than in $C$-direction, this probably relates to the observation in

409: Fig.~\ref{fig:ill}(b) that swapping procedure moves faster in the $r$-

410: than in the $C$-direction. (The speed in the $C$-direction is roughly the

411: same per 1000 steps, but the speed in the $r$-direction decrease.)

412:

413: \section{Networks}

414:

415: Our method can be applied to every kind of system that can be modeled

416: as an undirected network.  To limit ourselves, we use four networks

417: from biology as examples in this paper. These networks are,

418: nonetheless, representing fundamentally different systems.

419:

420: \begin{table}

421: \caption{Basic statistical properties of the example networks we

422:   use. The number of vertices $N$, number of edges $M$,

423:   assortativity $r$, clustering coefficient $C$, relative size of the

424:   largest cluster $s$, average distance in the largest cluster

425:   $\langle d\rangle$, the error robustness $R_\mathrm{error}$ and the

426:   attack robustness $R_\mathrm{attack}$.}

427: \begin{ruledtabular}

428:   \begin{tabular}{r|dddd}\label{tab:stat}

429:     & \multicolumn{1}{c}{gene fusion} &

430:     \multicolumn{1}{c}{protein interaction} &

431:     \multicolumn{1}{c}{metabolic} &

432:     \multicolumn{1}{c}{neural}\\\hline

433:     $N$ & 291 & 4168 & 1905 & 280 \\

434:     $M$ & 278 & 7434 & 3526 & 1973 \\

435:     $r$ & -0.36 & -0.13 & -0.10 & -0.069 \\

436:     $C$ & 0.0016 & 0.034 & 0.039 & 0.20 \\

437:     $s$ & 0.38 & 0.94 & 0.87 & 1 \\

438:     $\langle d\rangle$ & 4.2 & 4.8 & 4.5 & 2.6 \\

439:     $R_\mathrm{error}$ & 0.43 & 0.36 & 0.36 & 0.50 \\

440:     $R_\mathrm{attack}$ & 0.012 & 0.048 & 0.046 & 0.38 \\

441:   \end{tabular}

442: \end{ruledtabular}

443: \end{table}

444:

445: \subsection{Gene fusion network}

446:

447: Cancer is a disease that occurs due to changes in the genome. One

448: important process causing such changes is gene fusion---when two

449: genes merge to form a hybrid gene~\cite{mitelman}. In

450: Ref.~\cite{hoglund} the authors construct a network of human genes

451: that have been observed to be fused in the development of tumors in

452: humans. Some genes can fuse with many others but most of the genes have

453: only been observed fusing with one, or a few others. The resulting

454: network structure has a skewed, power-law like degree distribution and

455: is rather fragmented---the largest component spanning only $38\%$ of the

456: vertices. Statistics of this and the other networks are listed in

457: Table~\ref{tab:stat}.

458:

459:

460: \subsection{Metabolic network\label{sect:meta}}

461:

462: A cell can be regarded as a machine driven by biochemical reactions. The

463: possible reactions of the metabolism (the cellular

464: biochemistry except signaling processes) and its environment

465: determine the state of the cell. The metabolism of an organism is a very

466: complex system---so complex that one has to choose between studying

467: a part of it in detail, or the whole with a coarser method. One approach in the latter category is to construct a network, connecting the chemical

468: substrates occurring in the same reactions to a network, and employ

469: network analysis to characterize the large-scale structure of the

470: metabolism. The way to construct a biochemical network is not entirely

471: straightforward~\cite{zhao:meta}. Should the substances be

472: linked to each other (in a \textit{substrate graph}), or to the

473: reactions they participate in? If one use a substrate graph, should

474: the substrates be linked only to products, or to all reactants

475: (i.e.\ in a reaction A + B $\leftrightarrow$ C + D, should A be

476: linked to C and D, or to all three other vertices)? Furthermore, some

477: chemical substances (like H$_2$O, ATP, NADH, and so on) are abundant

478: throughout the cell and seldom pose any restriction on the reaction

479: dynamics. For many purposes, one obtains a more meaningful network by

480: deleting such \textit{currency metabolites}. The biochemical network

481: we use is the human metabolic network of Ref.~\cite{our:curr}. In this

482: network, substrates are linked only to products (A to C and D in the

483: above example). Currency metabolites are identified and deleted

484: according to a self-consistent, graph-theoretic method~\cite{our:curr}.

485:

486:

487: \subsection{Protein interaction}

488:

489: In protein interaction networks the vertices are proteins and two

490: proteins constitute an edge if they can interact physically. Examples

491: of interaction are the ability to form complexes, carrying another

492: protein across a membrane or modifying another protein. We use the

493: (``physical interaction'') data set from Ref.~\cite{hh:pfp}

494: of protein interaction in the budding yeast \textit{S. cerevisiae}.

495:

496:

497: \subsection{Neural network}

498:

499: For the biochemistry of an organism, the network representation is a

500: crude model of the system as a whole (as an alternative to a detailed

501: model of a subsystem). Neural networks are yet more complex. For these

502: the choices are either to make a coarse-grained network

503: representation~\cite{sporns:cortex} or study the full network of a very

504: simple organism. In this work, we take the latter approach and study

505: the neural network of \textit{C. elegans}~\cite{white:420}. In this

506: data set, the strength of the neuronal coupling has been measured, but we make the network undirected by

507: letting an edge represent a non-zero coupling.

508:

509:

510: \begin{figure*}

511:   \centering\resizebox*{0.75\linewidth}{!}{\includegraphics{exa.eps}}

512:   \caption{The valid region demarcated by the $C_\mathrm{min}(r)$- and

513:     $C_\mathrm{max}(r)$-curves (a), and three networks: (b) is the

514:     original gene fusion network, (c) shows a random sample with

515:     $r$-$C$ coordinates close to those of the real network. (d) shows a network

516:     with high clustering and high assortativity. The largest component

517:     of (b), (c) and (d) are indicated with a different color.

518:   }

519:   \label{fig:exp}

520: \end{figure*}

521:

522: \section{Numerical results}

523:

524:

525: In this section we present numerical results for our four

526: network-structural measures over the $\mathcal{G}(G)$ ensembles of the

527: four test graphs. To get a first view, we display the valid region of the

528: gene fusion graph in Fig.~\ref{fig:exp}(a). As seen, the valid region is

529: not covering a large part of the theoretical limits of $r$ ($-1\leq r\leq 1$) and

530: $C$ ($0\leq C < 1$). (Note that only fully

531: connected graphs have $C=1$, and for these $r$ is undefined.) The

532: requirement that the graph should be simple (no multiple edges or

533: self-edges) puts hard constraints on the actual $r$-values that can

534: occur (cf.\ Ref.~\cite{maslov:inet}). Fig.~\ref{fig:exp}(a) shows that, considering the entire $r$-$C$ plane, such constraints are even harder.

535: The general shape of the valid region is consistent with

536: the observations that the simple-graph constraint induce a positive

537: correlation between $r$ and $C$~\cite{maslov:inet,mejn:why}.

538:

539: In Fig.~\ref{fig:exp}(b), (c) and (d) we show three example networks

540: of $\mathcal{G}(G)$ (where $G$ is the gene fusion

541: network). Fig.~\ref{fig:exp}(b) displays the relatively fragmented

542: real network. Fig.~\ref{fig:exp}(c) is a random network $G'$ with the

543: almost the same $r$-$C$ coordinates as the real network

544: ($\delta(G,G')\approx 0.0026$). Maybe the biggest visible difference

545: between $G$ and $G'$ is the larger size of the largest component of

546: $G'$. Is it true that the gene fusion network is unusually fragmented,

547: given the degree sequence and $r$-$C$ coordinates? If so, there might

548: be an evolutionary pressure for gene fusion networks to be fragmented. (This

549: will be discussed further in Sect.~\ref{sect:size}.)

550: Fig.~\ref{fig:exp}(d) shows, as a contrast, a network far away from

551: $G$ and $G'$. The network has a well-defined core where high-degree

552: vertices connect to each other. There are also a number of peripheral

553: triangles, which indicates that the network evolves toward a maximal

554: $C$-value, given its assortativity.

555:

556: \begin{figure*}

557:   \centering\resizebox*{0.75\linewidth}{!}{\includegraphics{ngc.eps}}

558:   \caption{The relative size of the largest component $s$ as a function of

559:     $r$ and $C$. The networks are (a) the network of gene fusions in

560:     tumors in humans, (b) protein interaction network of

561:     \textit{S. cerevisiae}, (c) human metabolic network and (d) the

562:     \textit{C. elegans} neural network. The x-like symbols of the main

563:     figures and the diamond symbols of the color-bars indicate

564:     the values of the real networks. The plus-like symbol

565:     indicates the average $(r,C)$-value of the $\mathcal{G}(G)$

566:     ensemble.}

567:   \label{fig:ngc}

568: \end{figure*}

569:

570: \subsection{Location in $r$-$C$ space and size of largest

571:   component\label{sect:size}}

572:

573:

574: In Fig.~\ref{fig:ngc} we plot the relative size of the largest

575: component of the four test networks. We also display the locations of

576: the actual networks in the $r$-$C$ plane, and the

577: $\mathcal{G}(G)$-averages. (The $\mathcal{G}(G)$ averages are obtained

578: from a rewiring sampling of $\mathcal{G}(G)$, with

579: step~\ref{step:check} of the algorithm as the only constraint.) We see

580: that the $C$-value of the gene fusion graph lies close to the

581: $C_\mathrm{min}(r)$-boundary of its valid region. $C$ averaged over

582: the whole $\mathcal{G}(G)$ is about three times larger ($\langle

583: C\rangle_{\mathcal{G}(G)}=0.0061\pm 0.0001$) than the observed value

584: ($C=0.0017$). Furthermore, we see that the assortativity is lower than

585: the $\mathcal{G}(G)$ average. This kind of analysis has been used by

586: many authors (following Ref.~\cite{maslov:pro}). The interpretation is

587: usually that the network is, effectively, disassortative and clustered

588: (i.e., $r<\langle r\rangle_{\mathcal{G}(G)}$ and $C>\langle

589: C\rangle_{\mathcal{G}(G)}$). However, looking at the entire valid

590: region, we can get another perspective: If high clustering really

591: would have been an important goal for the network to obtain (given the

592: degree sequence) there is large room for improvement. For the

593: assortativity, on the other hand, the observed network is rather close to the

594: minimum. This might be telling us that assortativity is a more

595: important factor, than clustering, in the evolution of the gene fusion

596: networks. The protein interaction network of Fig.~\ref{fig:ngc}(b) is

597: located quite far from the ensemble average---the assortativity is much

598: lower than the $\mathcal{G}(G)$-average, and given that assortativity,

599: the clustering is maximal. Also

600: the metabolic (Fig.~\ref{fig:ngc}(c)) and neural

601: (Fig.~\ref{fig:ngc}(d)) networks are more clustered than the average,

602: but here the assortativity is slightly larger than the

603: $\mathcal{G}(G)$ average. From Fig.~\ref{fig:ngc} we also note that

604: the density of states is very inhomogeneous distributed---the

605: average $(r,C)$ is close to $C=0$ and (except for the neural network)

606: left of the middle of the assortativity spectrum. The shapes of the

607: valid regions are rather similar, with an exception for the broader

608: region of the neural network. This can be related to the more

609: narrow degree sequence of the neural network~\cite{amaral:classes}. We

610: have established a correlation between $r$ and

611: $C$. Ref.~\cite{mejn:why} argues that such correlation occurs in

612: social networks because of their modularity (or ``community

613: structure'' as the authors call it). However, our large-$r$ networks

614: have no explicit bias towards high modularity, which leads us to

615: conjecture that the correlation between $r$ and $C$, or more

616: fundamentally the sum $\sum_{(i,j)\in E} k_ik_j$ (which, given a

617: degree sequence, is the only factor of Eq.~\ref{eq:assmix} that can

618: vary) is a more general phenomenon. Since $r$ is normalized by,

619: essentially, the variance of the degree, it follows that the valid

620: region for $\mathcal{G}(G)$ with more narrow degree sequence will

621: appear stretched (larger).

622:

623: Turning to the average size of the largest component, we observe that the

624: gene fusion network is indeed more fragmented than the average network of the

625: same $(r,C)$-coordinates (as anticipated from comparing Figs.~\ref{fig:exp}(b) and (c)). The protein interaction and neural networks

626: have no particular bias in this respect, whereas the metabolic network

627: is more fragmented than expected. The relatively low $s$ of the

628: metabolic network can be attributed to the ``modularity'' of such

629: networks~\cite{zhao:meta,our:curr}. Such modules are subgraphs that

630: are densely connected within, and sparsely inter-connected. Sometimes

631: they are even disconnected from the largest component (which explains

632: the lower $s$). In general, $s$ decreases with assortativity. This

633: is natural---in more assortative networks high degree vertices are

634: connected to each other, forming a highly connected core and a

635: periphery too sparse to be connected (viz.\ Fig.~\ref{fig:exp}(c) and

636: (d)). For the denser networks (the protein interaction, metabolic and

637: neural networks) $s$ increases with $C$ (for a fixed $r$). For the

638: sparser gene-fusion network $s$ has a peak at

639: intermediate $C$. We do not speculate further about combinatorial

640: cause of these dependencies; but we note (comparing

641: e.g.\ Figs.~\ref{fig:exp}(a) and (b)) that even though the shape of

642: the valid regions are similar, the $s$ behavior can be qualitatively

643: different.

644:

645:

646: \begin{figure*}

647:   \centering\resizebox*{0.75\linewidth}{!}{\includegraphics{length.eps}}

648:   \caption{The average distance within the largest component $\langle

649:     d\rangle$ as a function of $r$ and $C$. The panes and symbols

650:     correspond to those of Fig.~\ref{fig:ngc}.}

651:   \label{fig:length}

652: \end{figure*}

653:

654: \subsection{Distances in the largest component}

655:

656: In Fig.~\ref{fig:length} we display the average distance in the

657: largest component. As mentioned, measuring the distance can give complementary

658: information to the $s(r,C)$ graphs of Fig.~\ref{fig:ngc}---while $s$

659: tells us how much of the network that can be reached, $\langle

660: d\rangle$ tells us how fast that can happen. For all networks the big picture is that large connected

661: components have large average distances. This is expected from most

662: network models. There is, however, more information than this in

663: Fig.~\ref{fig:length}: for components of the same size, the average

664: distance is increasing with both $r$ and $C$. That $\langle d\rangle$

665: should increase with $C$ seems quite natural---if one of a triangle's

666: edges is rewired to connect two distant vertices, the distances in the

667: surrounding of the triangle would increase with one, but this would be

668: more than compensated by the connection of the two previously distant

669: areas. Disassortative networks typically lack a well-defined core.

670: Such cores are known to keep the average distance of general power-law networks

671: short~\cite{chung_lu:pnas}. Thus one would expect an increase of $r$

672: to cause a larger $\langle d\rangle$, but apparently the clustering

673: related length-increase outweighs this effect.

674: In contrast to the relative size, the average distances of the real

675: networks are close to the $\mathcal{G}(G)$-averages at the same

676: $r$-$C$ coordinates. For the gene fusion network (with a relatively

677: small largest component), this means the distances are rather

678: large.

679:

680:

681: \begin{figure*}

682:   \centering\resizebox*{0.75\linewidth}{!}{\includegraphics{error.eps}}

683:   \caption{The error robustness $R_\mathrm{error}$ as a function of

684:     $r$ and $C$. The panes and symbols correspond to those of

685:     Fig.~\ref{fig:ngc}.}

686:   \label{fig:error}

687: \end{figure*}

688:

689: \subsection{Error robustness}

690:

691: Next, we turn to the error robustness problem. As seen in

692: Fig.~\ref{fig:error} the gene fusion network (Fig.~\ref{fig:error}(a)),

693: once again, has a qualitatively different behavior than the other

694: three networks (Fig.~\ref{fig:error}(b), (c) and (d)). While the gene

695: fusion network is most robust for high $r$- and $C$-values the other

696: networks are most robust for low $r$. A sketchy explanation can be found

697: in the chain-like subgraphs extending from the largest component in

698: a large-$r$ network (cf.\ Fig.~\ref{fig:exp})---with a random deletion

699: of vertices, these subgraphs are likely to be disconnected from the

700: core rather soon (whereas in a disassortative network alternative paths may

701: still exist), then if the deletion-robust core is less than half of

702: the original component size it follows that it may soon be

703: isolated. The sparsity of the gene fusion network makes the low-$r$

704: $\mathcal{G}(G)$-graphs much like trees (i.e., having few cycles), and

705: since cycles provide redundant paths that can make a network robust,

706: it follows that these graphs are fragile. For a fixed

707: $r$, $R_\mathrm{error}$  is a decreasing function of $C$ for the three

708: largest networks. We believe this is an effect of the local path

709: redundancy induced by triangles---if one vertex of a triangle is

710: deleted, the other two are still connected.

711:

712: The $R_\mathrm{error}$-values for the real networks are always

713: markedly higher than the $\mathcal{G}(G)$-averages for the same

714: $(r,C)$-coordinates. Networks with highly skewed degree distributions

715: (the gene fusion, protein interaction and metabolic networks) are

716: known to be robust to errors by virtue of degree distribution

717: alone~\cite{alb:attack}, now Fig.~\ref{fig:error} tells us that all

718: these networks have a yet higher error tolerance which is an indication

719: that error robustness is an important factor in the evolution of these

720: networks.

721:

722:

723: \begin{figure*}

724:   \centering\resizebox*{0.75\linewidth}{!}{\includegraphics{attack.eps}}

725:   \caption{The error robustness $R_\mathrm{attack}$ as a function of

726:     $r$ and $C$. The panes and symbols correspond to those of

727:     Fig.~\ref{fig:ngc}.}

728:   \label{fig:attack}

729: \end{figure*}

730:

731: \subsection{Attack robustness}

732:

733: The final quantity we measure is the attack robustness (see

734: Fig.~\ref{fig:attack}). $R_\mathrm{attack}$'s functional dependence

735: on $r$ and $C$ is quite different from that of $R_\mathrm{error}$. The

736: gene fusion $\mathcal{G}(G)$ has the highest attack robustness at high

737: $r$- and low $C$-values. The other networks have higher robustness

738: values for high assortativity, but no clear tendency in the

739: $C$-direction. The attack mechanism we study targets the high degree

740: vertices. Having all high degree vertices connected to each other is

741: probably the only way to keep the network from instantaneous

742: fragmentation. The observed $r$-dependence is thus rather

743: expected. The real-world networks all have $R_\mathrm{attack}$-values

744: of the same order of magnitude as the average values for the

745: $\mathcal{G}(G)$ networks of the same location in $r$-$C$ space.

746: We note that for studying the attack problem of metabolic networks, the (less common) enzyme centric graph representation is more appropriate (see Sect.~\ref{sect:meta}). The reason being that one can suppress an enzyme much easier than removing the substrates.

747:

748:

749: \begin{table}

750: \caption{Summary of the network structural measures of the real world

751:   networks relative to the average values of the $\mathcal{G}(G)$ a

752:   distance $\delta < 0.02$ from the real network. ``$<$'' indicates that

753:   the real network have a lower value than the corresponding

754:   $\mathcal{G}(G)$-value. All results are significant with p-values

755:   $>0.01$, except the $s$-value of the neural network that has a

756:   p-value of $\sim 0.05$.}

757: \begin{ruledtabular}

758:   \begin{tabular}{r|dddd}\label{tab:pval}

759:     & \multicolumn{1}{c}{gene fusion} &

760:     \multicolumn{1}{c}{protein interaction} &

761:     \multicolumn{1}{c}{metabolic} &

762:     \multicolumn{1}{c}{neural}\\\hline

763:     $s$ & < & < & < & > \\

764:     $\langle d\rangle$ & < & < & < & < \\

765:     $R_\mathrm{error}$ & > & > & > & > \\

766:     $R_\mathrm{attack}$ & > & > & > & > \\

767:   \end{tabular}

768: \end{ruledtabular}

769: \end{table}

770:

771: \subsection{Comparison between the graphs}

772:

773: Even though all our example networks are constructed from biological

774: data, they represent fundamentally different systems---the neural

775: network is spatial by nature, the protein interaction and (even more

776: so) the metabolic networks are the background topology for an active

777: dynamic system, whereas the gene fusion network is a representation

778: of possible but undesired events. The protein interaction, metabolic

779: and neural networks have one thing in common---the organism needs them

780: to be robust to errors (caused by injuries, mutations, disease

781: etc.)~\cite{wagner:robu}. As mentioned above and summarized in

782: Table~\ref{tab:pval} the error robustness is indeed higher for the

783: real networks than the $\mathcal{G}(G)$-ensemble at the same

784: $(r,C)$-coordinates. As mentioned above, the attack robustness of the

785: real network is of the same order as the $\mathcal{G}(G)$-average at

786: the same $(r,C)$-coordinate, but actually there is a significant

787: tendency that these network also are more robust to

788: attacks. Furthermore, the distances in the largest component, and the

789: relative sizes $s$ are (with the neural network $s$-value as the only

790: exception) smaller in the real than the $\mathcal{G}(G)$ networks.

791:

792: Despite these similarities between the statistics of the

793: real-world networks the $r$-$C$ space of the different degree

794: sequences have qualitatively different network structure. Especially,

795: the gene fusion network behaves almost the opposite of the other

796: networks (at least for $s$, $\langle d\rangle$ and

797: $R_\mathrm{error}$). The source of this opposite behavior (as we

798: discuss above) is probably that it is much sparser than the other

799: networks. The neural network is the densest network and the only one

800: that do not have a power-law like degree distribution.

801:

802:

803: \section{Discussion}

804:

805: Many complex network studies use the ensemble $\mathcal{G}(G)$ of

806: graphs with the same degree sequence as the subject graph $G$ as a

807: null model. In contrast to a generative network model, with a few

808: degrees of freedom that has to be fitted approximately, such an

809: ensemble has $O(N)$ degrees of freedom that can be matched exactly

810: with the values of $G$. We argue that $\mathcal{G}(G)$ is more than a

811: null model---by resolving the graphs of $\mathcal{G}(G)$ in a space

812: defined by some network-structural measures, one can get a picture of

813: the opportunities and limits there are (or has been) in the evolution

814: of $G$. In this work we map out $\mathcal{G}(G)$ in the

815: two-dimensional space defined by the clustering coefficient and the

816: assortativity. Then we measure other network structural quantities

817: throughout this space. One formal way to see our method is that we

818: resolve $\mathcal{G}(G)$ in the (high dimensional) space of all

819: sensible network measures. Then, for simplicity, we project to a few

820: dimensions. (The case of projection to one dimension has been studied

821: in a less formalized way earlier---projection to

822: assortativity~\cite{zj:spectrum} or a ``hierarchy''

823: measure~\cite{rosv:mountain}.) An interesting open question is to find

824: the principal components of the space of all sensible network

825: measures. Using four example networks from biology, we measure

826: average values of four network-structural quantities over the $r$-$C$

827: space and compare these with the values of the real networks.

828:

829: The functional characteristics of the $r$-$C$ spaces varies much

830: between the four example networks. For example, the

831: \textit{C. elegans} neural network covers a much larger area of the

832: $r$-$C$ space, something that probably relates to its more narrow

833: degree distribution. The human gene fusion network, on the other hand,

834: has a broad degree distribution similar to the \textit{S. cerevisiae}

835: protein interaction and human metabolic networks, still the structural

836: dependency on $r$ and $C$ is very different for the gene fusion

837: network compared to the others. We argue that this difference stems

838: from the sparseness of the gene fusion network. To achieve a

839: comprehensive understanding about how the network structure throughout

840: the $r$-$C$ space depends on the degree sequence, one would need a

841: systematic investigation of different artificial degree sequences. In

842: this paper, we do not pursue this goal beyond the analysis of the four

843: biological data sets. The position of the real networks in the valid

844: region of the $r$-$C$ space adds some further information. For

845: example, it may have been the case that networks with lower

846: assortativity have been favored during the evolution of the gene fusion

847: network. Clustering, on the other hand, has probably not put any

848: constraint on the network evolution. Furthermore we compare the

849: network structure of the real networks with the average values of

850: networks in $\mathcal{G}(G)$ that are close to the $(r,C)$-coordinates

851: of the real network. From this analysis, we conclude that all

852: our four example networks are more robust to both random errors and

853: targeted attacks than what can be expected from a random network

854: constrained to the same degree distribution, assortativity and clustering

855: coefficient. For all networks, except maybe the gene fusion network,

856: this is in line with robustness being an important factor in the

857: network evolution. Note that in this work we assume the subject

858: network to be accurate. To get more valid error estimates one would

859: need to take the accuracy of the edges into account.

860:

861: The analysis scheme presented in this paper can be further extended

862: and analyzed. As mentioned, it would be interesting with a quantitative

863: evaluation of the network-structural spaces, and how they depend on

864: the degree sequence. One can also try, for time-resolved data sets, to

865: incorporate dynamic information in the analysis by monitoring the

866: network-evolutionary trajectory in the $r$-$C$ space.

867:

868:

869: \acknowledgements{

870:   The authors thank Mikael Huss and Martin Rosvall for helpful

871:   suggestions and comments.

872:   PH acknowledges financial support from the Wenner-Gren

873:   Foundations and the National Science Foundation (grant

874:   CCR--0331580).

875: }

876:

877: \appendix

878:

879:  \begin{figure}

880:   \centering\resizebox*{\linewidth}{!}{\includegraphics{conv.eps}}

881:   \caption{Convergence of the optimization algorithm. (a) shows the

882:     average maximal assortativity $\langle

883:     r_\mathrm{max}\rangle$ with $\nu_\mathrm{rep.}=1$. The horizontal

884:     line represents the result of the maximization algorithm of

885:     Ref.~\cite{doyle:big}. (b) shows the further improvement by

886:     finding the maximum over many independent runs (for

887:     $\nu_\mathrm{same}=10000$). The vertical bars indicates the

888:     standard deviation of the observed maxima.}

889:   \label{fig:conv}

890: \end{figure}

891:

892: \section{Convergence and sampling uniformity}

893:

894: In this Appendix, we address some technical issues of our method

895: related to the convergence of our optimization algorithm and uniformity of

896: the sampling. We will also motivate our choice of parameters.

897:

898:

899: \subsection{Assortativity and clustering extremes}

900:

901: To find the extremal assortativity values we use the edge-swapping

902: algorithm described in Sect.~\ref{sect:analysis}. To find

903: $r_\mathrm{min}$ we start from a random member of $\mathcal{G}(G)$ and

904: swap random edge-pairs (keeping the graph simple at all times) that

905: lower $r$. When no graph of lower $r$ has been found for

906: $\nu_\mathrm{same}$ time steps, we break the iteration. To avoid the

907: effect of being trapped in local minima, this process is repeated

908: $\nu_\mathrm{rep.}$ times. The main motivation for using this method

909: is that it is at heart the same scheme as for obtaining the extremal

910: clustering values and sampling the valid region (and thus we can

911: re-use the same code for many steps of the calculations). In this

912: section, we argue that the optimization performance of this method is sufficiently good for our purpose.

913:

914: There is a deterministic method to maximize the assortativity that is,

915: if it exits properly, guaranteed to find

916: $r_\mathrm{max}$~\cite{doyle:big}. The method works as

917: follows: First all vertex-pairs $(i,j)$ are ranked in decreasing order of

918: the product of their degrees, $k_ik_j$. Then the edges are added

919: in order of this list unless the degree of one of the vertices already

920: is fulfilled. There are some other technicalities from the additional

921: constraint (of the authors) that the network should be connected. Of

922: our networks, only the neural network has such an evolutionary

923: constraint, so we do not impose it.

924:

925: In Fig.~\ref{fig:conv} we display the parameter dependence of the

926: convergence for the gene fusion network. The horizontal line is the

927: theoretical maximum obtained by the algorithm of

928: Ref.~\cite{doyle:big}. When $\nu_\mathrm{same} = 10000$ we obtain an

929: average maximal assortativity within $0.001$ of the

930: theoretical maximum (Fig.~\ref{fig:conv}(a)). By increasing

931: $\nu_\mathrm{rep.}$ the accuracy can be increased further

932: (Fig.~\ref{fig:conv}(b)). The lattice spacing we use is $0.005\lesssim

933: r \lesssim 0.02$, so we deem a precision of $0.001$ sufficient. The

934: gene fusion network is our smallest network but the other networks are

935: not harder to converge. When one edge-pair is swapped so that $r$

936: decreases, the only term of Eq.~\ref{eq:assmix} that changes is

937: $\langle k_1\, k_2\rangle$. The potential change of the sum

938: $\sum_{(i,j)\in E} k_ik_j$, in the calculation of $\langle k_1\, k_2\rangle$

939: (close to the extrema) is of the order of the typical degree values

940: of the network. These values grow slower than the network itself,

941: which means that a larger network can be closer in $r$, but further

942: away in number of edge swaps to reach the global optimum, than a

943: smaller network. Some authors~\cite{doyle:big} use $\sum_{(i,j)\in E}

944: k_ik_j$ to measure the degree correlations, but since we strive for a

945: macroscopic level of description (consistent in the large-$N$ limit),

946: $r$ is a more appropriate quantity for the present work.

947:

948: The optimization of the clustering to find the minima (maxima) of

949: the segments of assortativity space follows the same pattern as the

950: method to find the minimal (maximal) $r$. Changes of the parameters

951: ($\nu_\mathrm{same}$ and $\nu_\mathrm{rep.}$) have the same effect as

952: in Fig.~\ref{fig:conv}, and the same values seem sufficient.

953:

954:  \begin{figure}

955:   \centering\resizebox*{0.95 \linewidth}{!}{\includegraphics{uni.eps}}

956:   \caption{Histograms of $s$ for the discussion of sampling

957:     uniformity. All the histograms are from the gene fusion network and a

958:     pixel centered around $r=-0.1$, $C=0.1$ (the dimensions of a pixel

959:     is $\Delta r = 0.013$, $\Delta C = 0.0096$. The error bars

960:     represent standard errors. Lines are guides for the eyes. (a)

961:     shows the histograms with a different numbers of random edge-pair

962:     swappings $\nu_\mathrm{rnd.}$ within the pixel before the

963:     measurements of quantities. (b) illustrates the location of the

964:     starting point pixels used in panels (c) and (d). (c) compares

965:     histograms for swapping processes starting at W, S with the

966:     regular algorithm. (d) compares the average histogram of walks starting in

967:     the four peripheral points of (b) with the result of the regular

968:     algorithm. In panels (c) and (d) $\nu_\mathrm{rnd.}=1000$. The

969:     whole range of the histograms is not shown, which is why the areas

970:     under the curves appear different.}

971:   \label{fig:uni}

972: \end{figure}

973:

974: \subsection{Sampling uniformity}

975:

976: The other technical issue we address in this Appendix is the

977: uniformity of our sampling procedure. Ideally we would like all

978: unique (i.e., non-isomorphic) members of $\mathcal{G}(G)$ to be

979: sampled with the same probability. The most important observation is

980: trivial---by edge-pair swapping one can go from one member of

981: $\mathcal{G}(G)$ to any other, and thus all members of the ensemble

982: will contribute to the averages. A much harder question is whether or

983: not every member of $\mathcal{G}(G)$ is sampled with uniform

984: probability. In this section, we will argue that our algorithm does a

985: reasonably good job in the sense that there are no inconsistencies and

986: parameter values are appropriate.

987:

988: When the target pixel is found (step~\ref{step:rw} of the algorithm)

989: we perform $\nu_\mathrm{rnd.}$ additional random edge-pair swaps. The idea

990: is to sample the $\mathcal{G}(G)$-members of the pixel more

991: uniformly (and indeed to be able to reach into the interior of the

992: pixel). In Fig.~\ref{fig:uni}(a) we illustrate the effect of these random

993: moves. We plot a normalized histogram of the relative largest cluster

994: size $s$ for $0$, $100$ and $10000$ random moves. We see that these

995: moves do make a difference (the $\nu_\mathrm{rnd.}=0$ is different

996: from the $\nu_\mathrm{rnd.}=100$) but it does not matter if

997: $\nu_\mathrm{rnd.}=100$ or $\nu_\mathrm{rnd.}=10000$. The same

998: situation is observed for other pixels, networks and

999: quantities. Therefore, we use $\nu_\mathrm{rnd.}=1000$ in this work.

1000:

1001: Next, we will illustrate the use of the randomly permuted list

1002: in the sampling of the pixels (steps~\ref{step:perm} and

1003: \ref{step:pick} of the algorithm). The motivation for this procedure

1004: is that the network structure can depend on the direction from which

1005: the search arrives to the pixel. In Fig.~\ref{fig:uni}(b) we

1006: illustrate the test procedure---we sample separate histograms from four

1007: starting points in the four cardinal directions with respect to the central

1008: $(r,C)=(-0.1,0.1)$ pixel. In Fig.~\ref{fig:uni}(c) we

1009: see that the histograms from the W and S pixels are different. There

1010: appears to be two regions of $\mathcal{G}(G)$ contributing to these

1011: histograms (one with $s\approx 0.65$, one with $s\approx

1012: 0.75$). Searches starting from W seem to arrive at the $s\approx 0.75$

1013: region more frequently, and searches staring at S ends up around

1014: $s\approx 0.65$ more frequently. The curve of the actual algorithm

1015: weighs the two peaks more equal. The curves from N and E coincides

1016: almost completely the curve for the regular algorithm (and are

1017: therefore omitted for clarity). The impression we get is that the

1018: search from one direction can induce a bias in the network structure

1019: (symbolically speaking, the graphs have a preference for ending up in

1020: a certain region of $\mathcal{G}(G)$). However, from other directions,

1021: or by the random sampling of pixels (step~\ref{step:perm}), the bias is

1022: reduced. This picture is further strengthened in Fig.~\ref{fig:uni}(d)

1023: where we show that the average value of the histograms from the four

1024: starting points are overlapping with the histogram of the regular

1025: algorithm.

1026:

1027: \begin{thebibliography}{10}

1028:

1029: \bibitem{ba:rev}

1030: R.~Albert and A.-L. Barab\'{a}si.

1031: \newblock Statistical mechanics of complex networks.

1032: \newblock {\em Rev. Mod. Phys}, 74:47--98, 2002.

1033:

1034: \bibitem{alb:attack}

1035: R.~Albert, H.~Jeong, and A.-L. Barab\'{a}si.

1036: \newblock Attack and error tolerance of complex networks.

1037: \newblock {\em Nature}, 406:378--382, 2000.

1038:

1039: \bibitem{amaral:classes}

1040: L.~A.~N. Amaral, A.~Scala, M.~Barth\'{e}l\'{e}my, and H.~E. Stanley.

1041: \newblock Classes of small-world networks.

1042: \newblock {\em Proc. Natl. Acad. Sci. USA}, 97:11149--11152, 2000.

1043:

1044: \bibitem{rosv:mountain}

1045: J.~B. Axelsen, S.~Bernhardsson, M.~Rosvall, K.~Sneppen, and A.~Trusina.

1046: \newblock Degree landscapes in scale-free networks.

1047: \newblock {\em Phys. Rev. E}, 74:036119, 2006.

1048:

1049: \bibitem{ba:model}

1050: A.-L. Barab\'{a}si and R.~Albert.

1051: \newblock Emergence of scaling in random networks.

1052: \newblock {\em Science}, 286:509--512, 1999.

1053:

1054: \bibitem{bw:sw}

1055: A.~Barrat and M.~Weigt.

1056: \newblock On the properties of small-world network models.

1057: \newblock {\em Eur. Phys. J. B}, 13:547--560, 2000.

1058:

1059: \bibitem{chung_lu:pnas}

1060: F.~Chung and L.~Lu.

1061: \newblock The average distances in random graphs with given expected degrees.

1062: \newblock {\em Proc. Natl. Acad. Sci. USA}, 99:15879--15882, 2002.

1063:

1064: \bibitem{doromen:book}

1065: S.~N. Dorogovtsev and J.~F.~F. Mendes.

1066: \newblock {\em Evolution of Networks: From Biological Nets to the Internet and

1067:   WWW}.

1068: \newblock Oxford University Press, Oxford, 2003.

1069:

1070: \bibitem{er:on}

1071: P.~Erd\H{o}s and A.~R\'{e}nyi.

1072: \newblock On random graphs {I}.

1073: \newblock {\em Publ. Math. Debrecen}, 6:290--297, 1959.

1074:

1075: \bibitem{gale:rew}

1076: D.~Gale.

1077: \newblock A theorem of flows in networks.

1078: \newblock {\em Pacific J. Math.}, 7:1073--1082, 1957.

1079:

1080: \bibitem{hoglund}

1081: M.~H\"{o}glund, A.~Frigyesi, and F.~Mitelman.

1082: \newblock A gene fusion network in human neoplasia.

1083: \newblock {\em Oncogene}, 25:2674--2678, 2006.

1084:

1085: \bibitem{holl:72}

1086: P.~W. Holland and S.~Leinhardt.

1087: \newblock Some evidence on the transitivity of positive interpersonal

1088:   sentiment.

1089: \newblock {\em Am. J. Sociol.}, 72:1205--1209, 1972.

1090:

1091: \bibitem{hh:pfp}

1092: P.~Holme and M.~Huss.

1093: \newblock Role-similarity based functional prediction in networked systems:

1094:   application to the yeast proteome.

1095: \newblock {\em J. Roy. Soc. Interface}, 2:327--333, 2005.

1096:

1097: \bibitem{our:attack}

1098: P.~Holme, B.~J. Kim, C.~N. Yoon, and S.~K. Han.

1099: \newblock Attack vulnerability of complex networks.

1100: \newblock {\em Phys. Rev. E}, 65:066109, 2002.

1101:

1102: \bibitem{our:curr}

1103: M.~Huss and P.~Holme.

1104: \newblock Currency and commodity metabolites: Their identification and relation

1105:   to the modularity of metabolic networks.

1106: \newblock e-print q-bio/0603038.

1107:

1108: \bibitem{katz:cug}

1109: L.~Katz and J.~H. Powell.

1110: \newblock Probability distributions of random variables associated with a

1111:   structure of the sample space of sociometric investigations.

1112: \newblock {\em Ann. Math. Stat.}, 28:442--448, 1957.

1113:

1114: \bibitem{latora:eff}

1115: V.~Latora and M.~Marchiori.

1116: \newblock Efficient behavior of small-world networks.

1117: \newblock {\em Phys. Rev. Lett.}, 87:198701, 2001.

1118:

1119: \bibitem{doyle:big}

1120: L.~Li, D.~Alderson, J.~C. Doyle, and W.~Willinger.

1121: \newblock Towards a theory of scale-free graphs: Definition, properties, and

1122:   implications.

1123: \newblock {\em Internet Mathematics}, 2:431--523, 2005.

1124:

1125: \bibitem{maslov:pro}

1126: S.~Maslov and K.~Sneppen.

1127: \newblock Specificity and stability in topology of protein networks.

1128: \newblock {\em Science}, 296:910--913, 2002.

1129:

1130: \bibitem{maslov:inet}

1131: S.~Maslov, K.~Sneppen, and A.~Zaliznyak.

1132: \newblock Detection of topological patterns in complex networks: Correlation

1133:   profile of the {I}nternet.

1134: \newblock {\em Physica A}, 333:529--540, 2004.

1135:

1136: \bibitem{mitelman}

1137: F.~Mitelman, B.~Johansson, and F.~Mertens.

1138: \newblock Fusion genes and rearranged genes as a linear function of chromosome

1139:   aberrations in cancer.

1140: \newblock {\em Nature Genetics}, 36:331--334, 2004.

1141:

1142: \bibitem{motter:cascade}

1143: A.~E. Motter.

1144: \newblock Cascade control and defense in complex networks.

1145: \newblock {\em Phys. Rev. Lett.}, 93:098701, 2004.

1146:

1147: \bibitem{mejn:assmix}

1148: M.~E.~J. Newman.

1149: \newblock Assortative mixing in networks.

1150: \newblock {\em Phys. Rev. Lett.}, 89:208701, 2002.

1151:

1152: \bibitem{mejn:rev}

1153: M.~E.~J. Newman.

1154: \newblock The structure and function of complex networks.

1155: \newblock {\em SIAM Review}, 45:167--256, 2003.

1156:

1157: \bibitem{mejn:why}

1158: M.~E.~J. Newman and J.~Park.

1159: \newblock Why social networks are different from other types of networks.

1160: \newblock {\em Phys. Rev. E}, 68:036122, 2003.

1161:

1162: \bibitem{alon}

1163: S.~Shen-Orr, R.~Milo, S.~Mangan, and U.~Alon.

1164: \newblock Network motifs in the transcriptional regulation network of

1165:   {E}scherichia coli.

1166: \newblock {\em Nature Genetics}, 31:64--68, 2002.

1167:

1168: \bibitem{sporns:cortex}

1169: O.~Sporns, G.~Tononi, and G.~M. Edelman.

1170: \newblock Theoretical neuroanatomy: Relating anatomical and functional

1171:   connectivity in graphs and cortical connection matrices.

1172: \newblock {\em Cerebral Cortex}, 10:127--141, 2000.

1173:

1174: \bibitem{wagner:robu}

1175: A.~Wagner.

1176: \newblock {\em Robustness and Evolvability in Living Systems}.

1177: \newblock Princeton University Press, Princeton NJ, 2005.

1178:

1179: \bibitem{walker_walstedt}

1180: L.~R. Walker and R.~E. Walstedt.

1181: \newblock Computer model of metallic spin-glasses.

1182: \newblock {\em Phys. Rev. B}, 22:3816--3842, 1980.

1183:

1184: \bibitem{wf}

1185: S.~Wasserman and K.~Faust.

1186: \newblock {\em Social network analysis: Methods and applications}.

1187: \newblock Cambridge University Press, Cambridge, 1994.

1188:

1189: \bibitem{white:420}

1190: J.~G. White, E.~Southgate, J.~N. Thompson, and S.~Brenner.

1191: \newblock The structure of the nervous system of the nematode {C.} {E}legans.

1192: \newblock {\em Philos. Trans. Roy. Soc. Lond.}, 314:1--340, 1986.

1193:

1194: \bibitem{zj:spectrum}

1195: J.~Zhao, L.~Tao, H.~Yu, J.-H. Luo, Z.-W. Cao, and Y.-X. Li.

1196: \newblock The spectrum of degree correlations: topological diversity of

1197:   networks with a given degree sequence.

1198: \newblock e-print cond-mat/0611104.

1199:

1200: \bibitem{zhao:meta}

1201: J.~Zhao, H.~Yu, J.~Luo, Z.~W. Cao, and Y.-X. Li.

1202: \newblock Complex networks theory for analyzing metabolic networks.

1203: \newblock {\em Chinese Science Bulletin}, 51:1529--1537, 2006.

1204:

1205: \end{thebibliography}

1206:

1207:

1208: \end{document}

1209:

1210:

1211: