0605:q-bio0605045/pnas10.tex

1: %%%pnas10

2:

3: \documentclass[aps,twocolumn,10pt]{revtex4}

4: %\textwidth 44pc

5:

6: \usepackage{dcolumn}

7: \usepackage{graphicx}

8: \ifnum\lefthyphenmin<2\lefthyphenmin=2\fi

9: \ifnum\righthyphenmin<2\righthyphenmin=2\fi

10: \begin{document}

11:

12: \title{Spontaneous Self-Assembly of Transcription Factor Based Gene Regulation Networks}

13:

14: \author{ D. Balcan$^1$, A. Kabak\c c\i o\u glu$^2$, M. Mungan$^{3,4}$, and  A. Erzan$^{1,4}$\\}

15:

16: \affiliation{$^1$Department of Physics, Faculty of Sciences and

17: Letters\\

18: Istanbul Technical University, Maslak 34469, Istanbul, Turkey}

19: \affiliation{$^2$Department of Physics, Faculty of Arts and Sciences, Koc University, 34450 Sariyer Istanbul, Turkey}

20: \affiliation{$^3$Department of Physics, Faculty of Arts and Sciences \\

21: Bogazi\c ci University, 34342 Bebek Istanbul, Turkey}

22: \affiliation{$^4$G\"ursey Institute, P.O.B. 6, \c Cengelk\"oy, 34680 Istanbul, Turkey}

23:

24: \date{\today }

25:

26: \begin{abstract}

27: We model the transcription factor based regulation network of

28: yeast using a content-based network model that mimicks the

29: recognition of binding motifs on the regulatory regions of the

30: genes. We are thereby able to faithfully reproduce many of the

31: topological features of the gene regulatory network of yeast once

32: the parameters of the yeast genome, in particular the distribution

33: of information coded by the ``binding sequences" within the

34: promoter regions is provided as input. The length distribution for

35: the promoter regions is fixed by comparing the k-core analysis of

36: the model network with that of yeast. Our results strongly point

37: to the possibility that the observed topological features are

38: generic to networks formed via sequence-matching between random

39: strings obeying certain length distributions.

40: \end{abstract}

41: \pacs{87.17.Aa, 89.75.Fb

42: }

43: \maketitle

44:

45: \section{Introduction}

46: Development of new experimental techniques, such as DNA microarrays,

47: in the late 1990's~\cite{microarray,spellman} made a huge impact on

48: cell biology research. Such experiments generated a flood of

49: expression data for several well-studied single-cell species for which we

50: now have an almost complete list of not only the genes, but also the

51: interactions between them.

52: A cell is able to survive, grow and replicate due to the

53: collective actions of its genes. The adaptation and robustness of

54: its activities in a constantly changing environment is maintained

55: by the complex network of interactions between the genes.

56:

57: The regulation of gene expression in a cell relies to a major

58: extent on dedicated proteins called transcription factors

59: (TFs).~\cite{Cell} These proteins come with a structure suited to

60: recognize and bind the DNA at specific locations called binding

61: sites.  The binding affinity of a TF on a certain DNA segment is

62: determined by the base sequence at the location. Each TF

63: preferentially binds certain regulatory sequences or binding

64: motifs, within the promoter regions (PRs) responsible for the

65: regulation of the gene. In the case of yeast, {\it Saccharomyces

66: cerevisiae}, a list of the binding motifs for more than 100 TFs

67: has  recently been provided.~\cite{Lee,Harbison} It was also

68: reported~\cite{Harbison} that the TF binding sites are located

69: with high probability within a window of several hundred bases

70: upstream of the transcription activation site (preceding the start

71: codon of the gene), although longer-distance action is also

72: possible.  In fact, the existence of a high-affinity binding motif

73: in a promoter region is a necessary but not sufficient condition

74: for TF-based expression regulation~\cite{Harbison}.  Moreover,

75: especially in eukaryotic cells, gene regulation relies on the

76: simultaneous action of multiple TFs.

77:

78: We argue that the global features of the gene regulation network

79: depend very little on such details and are largely determined by

80: the distribution of the amount of shared information or content,

81: that is required for the establishment of regulatory interactions.

82: It may be conjectured that information sharing and its

83: distribution is the basic organizing principle which is

84: responsible for the universality of the degree distribution of gene regulatory

85: networks across diverse species~\cite{Barkai}.

86:

87: In this paper we propose to model the transcription regulation network

88: of yeast using the ideas of the content-based model we

89: introduced earlier~\cite{Balcan,Mungan}.  We are able to

90: faithfully reproduce all the topological aspects of the gene

91: regulatory network of yeast when the parameters of the yeast genome,

92: in particular the distribution of information coded by the ``binding

93: sequences" of the regulatory segments, are given as input.  We compare

94: the ensemble of the resulting model networks with the data on the

95: yeast regulatory network available in different databases.

96:

97: Gene regulatory networks can be naturally described as a directed

98: graph where the nodes are the genes. A directed edge from node A

99: to node B implies that the transcription factor produced by gene A

100: regulates the activity of gene B. Since the edges are directed,

101: one distinguishes the in-degree (the number of incoming edges),

102: the out-degree (number of outgoing edges) and the total degree of

103: a node, each with their own (possibly distinct) probability

104: distributions. These distributions serve as distinguishing

105: features of the network which a realistic model is expected to

106: reproduce. Further structural aspects of these networks are probed

107: by measures such as the clustering coefficient

108: $C(k)$~\cite{Dorogovtsev,Watts-Strogats98}, the degree-degree

109: correlation between connected

110: vertices~\cite{kk-correlation_colizza}, the ``rich-club

111: coefficient''~\cite{rich-club,rich-club_colizza}, or the $k$-core

112: decomposition~\cite{bollobas} recently

113: employed to predict new

114: interactions in various biological systems~\cite{protein_k-core,yeast_k-core,bader,amin,wuchty}.

115:

116: This report is organized as follows: In Section \ref{modelSect} we

117: introduce our model, which we compare with the experimentally

118: determined yeast regulatory network in  \ref{SimSect}. A

119: discussion is provided in Section \ref{DiscSect}, while Section

120: \ref{MethodsSect} outlines our methods.

121:

122: \section{The Model}

123: \label{modelSect}

124:

125:

126: The nodes of our model network correspond to genes. We

127: differentiate between genes which code for a Transcription Factor

128: (TF) and those which do not.  All genes are assumed to be possible

129: targets of regulation by one or more TFs.  Each node has a

130: sequence associated with it, representing the promoter region (PR)

131: through which the corresponding gene may be regulated. We pick a

132: given percentage of nodes (around 5\%, see Table I) at random, to

133: represent TF-producing genes. With each TF-producing node/gene we

134: also associate a second sequence, which stands for the binding

135: motif, which the TF recognizes and binds in the promoter region of

136: another gene.

137:

138: We represent both the binding motifs and the PRs as random binary

139: sequences of variable length. The mechanism for establishing

140: connections between nodes of the gene regulatory network is given

141: by a string matching condition~\cite{Balcan,Mungan}, between the

142: binding motifs of the TF's and all possible uninterrupted

143: subsequences of the PRs. The (directed) network of regulatory gene

144: interactions is then obtained by connecting each TF-producing node

145: ${\rm A}$ to all those nodes ${\rm B},\; {\rm B}^\prime,\;{\rm

146: B}^{\prime \prime} \ldots$ whose PRs contain the binding motif

147: associated with node ${\rm A}$. The amount of information coded in

148: these randomly generated binding motifs and promoter regions

149: constitutes the essential ingredient of our model and dictates the

150: overall topology of the resultant networks.

151:

152: Experimentally determined TF binding motifs are typically short sequences

153: with a narrow length distribution, since a TF  selectively

154: binds 5-10 bases and not much more. A single TF can bind a range of

155: similar motifs, and the relative frequencies of the four bases at each

156: position within the motif contribute to the information exchanged in

157: the binding process.  The promoter regions (PRs) which lie in the

158: intergenic portions of the genome are typically longer and may

159: accommodate several binding motifs (as shown in Fig.~\ref{model}) to allow

160: graded and/or combinatorial regulation~\cite{Cell,Harbison}.

161:

162: The bitwise length distribution of the model binding motifs was

163: derived from the yeast data provided by Harbison et al. in

164: \cite{Harbison}. The motifs were reported~\cite{Harbison} as letter

165: sequences comprising the symbols for the four bases \{ATGC\}, or

166: the symbols \{YMKRSW\} for incompletely specified bases, with the

167: corresponding lower case letters indicating a lower confidence level.

168: In order to account for such variations in the information content of

169: the motifs, we assigned two bits to each of the letters \{ACTG\}

170: appearing in the motif, signifying a high information content at

171: that position, and one bit otherwise. The

172: length of the bit sequence obtained in this way roughly corresponds

173: to the amount of shared information, measured by the Shannon

174: entropy~\cite{Shannon}, required for the binding of the TF.

175: Performing this calculation for each TF in~\cite{Harbison}, we obtain

176: the length distribution shown in Fig.~\ref{RS_dist}.

177:

178:

179: \begin{figure}[h]

180: \vspace*{0.0cm}

181: \includegraphics[width=7cm]{Fig_model.eps}

182: \caption[]{The mechanism of interaction between the genes as envisaged

183:   in our model.  The genes are indicated by ellipses (green if

184:   TF-coding, blue otherwise), the transcription factors by triangles

185:   with the associated binding motif in the box underneath. Non-TF

186:   proteins are symbolized by the ``P'' shape, and the promoter regions

187:   (PR) upstream of each gene are shown as red boxes. Binding occurs if

188:   the binding motif matches a subsequence in the PR, as is the case

189:   here at PR4. PRs in the model are typically much longer than

190:   depicted here.}

191: \label{model}

192: \end{figure}

193:

194: In choosing the length distribution of the promoter regions, about

195: which less is known, we are guided by the finding~\cite{Harbison}

196: that most of the probability for encountering a TF binding site is

197: contained within a window of 250 base pairs (bps) located

198: approximately 100 bps upstream of a gene. The PR length

199: distribution that we adopt within this range decays with a power

200: law  $p(l) \propto l^{-1-\mu}$, with $0\le\mu\le2$ after the

201: findings of Almirantis and Provata~\cite{Provata} for the lengths

202: of intergenic regions. We also assign a minimum length chosen to

203: coincide with the peak of the motif-length distribution shown in

204: Fig.~\ref{RS_dist}. Note that the 250 bps window does not double

205: as we move from the 4 letter alphabet to a binary one, because the

206: matching probabilities and the total number of positions at which

207: the TFs may bind are required to remain invariant under this

208: transformation.

209:

210: The value of $\mu$ remains as the only adjustable

211: parameter in our model, and is determined by comparing the $k$-core

212: decomposition of the gene regulatory network of yeast as extracted

213: from experimental data (Table I) with our content-based network model,

214: as explained in the Methods section.

215:

216: \begin{figure}[h]

217: \vspace*{0.8cm}

218: \includegraphics[width=6cm]{Fig_lenDist.eps}

219: %\rotatebox{270}{\scalebox{.4}{\includegraphics{Fig0.eps}}}

220: \caption[]{Distribution of the amount of bitwise information coded by each

221:   regulatory sequence recognized and bound by the 102 TFs in the yeast

222:   genome (compiled from the recently published data by Harbison et

223:   al.~\cite{Harbison}). This distribution is adopted as the length

224:   distribution of the random regulatory sequences (``binding motifs") in our model.}

225: \label{RS_dist}

226: \end{figure}

227:

228: The collection of such model networks forms an ensemble whose

229: features are a direct consequence of the string-matching mechanism

230: and the length distributions. Clearly, each realization of the

231: model will result in a different collection of random PRs and

232: binding motifs, and hence a somewhat different network. These

233: features turn out to be strikingly distinct from those encountered

234: in random~\cite{erdos-renyi} or scale-free~\cite{Barabasi}

235: networks. We show below that the ``signatures'' of this ensemble

236: are shared by the yeast regulatory network.

237:

238:

239: \section{Results}

240: \label{SimSect}

241:

242: Our purpose here is to show that the experimentally determined

243: features of the yeast regulation network follow closely those typical

244: of the ensemble defined by our model. The topological features we will

245: focus on are the following:

246:

247: \begin{enumerate}

248: \item {\bf degree distribution} (in-, out-, and total): the

249: distribution of the number of connections of the nodes in a network.

250: \item {\bf clustering coefficient}: the modularity of the network.

251: \item {\bf degree-degree correlations}: average degree of the neighbors

252: of a node with degree $k$.

253: \item {\bf ``rich-club'' coefficient}:  a measure of the relative

254: connectivity among nodes whose degree is higher than a given number.

255: \item {\bf $k$-core structure}: the hierarchical structuring in the network

256: \end{enumerate}

257:

258: The precise definition of these quantities is given in the Methods

259: section.

260:

261: Here we will report the comparison of our results with the most

262: recent Yeastract~\cite{Nucleicacids} data.  Analogous comparisons

263: with each of the data sources listed in Table~\ref{tabyeast} yield

264: similar results (see Supplementary Material) showing that our

265: conclusions are consistent with all the different data sets

266: available.

267:

268: In order to compare our results with the available data we

269: generate an ensemble of realizations, with an average of $N_G =

270: 6000$ genes in total, 4167 of which contribute to the network on

271: the average. Out of these, 202 (making up  \% 4.8 of the genes)

272: are TF-coding genes, taking part in a total of 14365 interactions,

273: again on the average.  The corresponding values for the yeast

274: regulatory networks reported in the publicly available data bases

275: are given in Table~\ref{tabyeast}.

276:

277: The total degree distribution is obtained by ignoring the

278: directionality of the interactions and is different from the

279: superposition of in- and out-degree distributions. In

280: Fig.~\ref{degree-dists}a,  Yeastract data for the degree

281: distribution  is shown on top of a scatter plot obtained by

282: superposing the results from 100 artificial model genomes

283: independently generated according to the rules described in

284: Section \ref{modelSect}. In Fig.~\ref{degree-dists}b, we exhibit

285: the in-degree distribution obtained from the Yeastract data, and

286: the corresponding scatter plot.

287:

288: \vspace{0.0cm}

289: \begin{table}

290: \caption[]{The number of interacting genes, TFs, and interacting pairs that appear

291: in the yeast regulatory network as obtained from different sources.}

292: \begin{tabular}{l|c|c|c}

293: \hline Source & Genes & TFs & Interacting Pairs \\ \hline \hline

294: Fraenkel Lab\footnote{http://fraenkel.mit.edu/Harbison/release\_v24/bound\_by\_factor/} & 2884 & 102 & 6441  \\ \hline

295: Yeastract\footnote{http://www.yeastract.com}

296:  & 4252 & 146 & 12530 \\ \hline

297: Luscombe et al.\footnote{http://sandy.topnet.gersteinlab.org/index2.html}

298:  & 3459 & 142 & 7071 \\ \hline

299: K\i rdar et al.\footnote{private communication} & 3763 & 180 & 9135 \\

300: \end{tabular}

301: \label{tabyeast}

302: \end{table}

303:

304:

305: The out-degree distribution of the yeast and model networks exhibits a rather

306: large scatter of points due to the relatively small number of TFs.

307: Comparing with the scatter plot obtained from 100 realizations, we find again

308: that the actual yeast data falls within the boundaries set by the model ensemble

309: (Fig.~\ref{degree-dists}c).

310:

311:

312: In Fig.~\ref{coefficients}, we report the three topological

313: coefficients, the clustering coefficient, the degree-degree

314: correlation and the ``rich-club'' coefficient, that go beyond

315: degree-distributions in characterizing the network. The agreement is extremely good;

316: in particular, the shoulder observed in the ``rich-club''

317: coefficient in Fig.~\ref{coefficients}(c), a feature common to both

318: gene-regulation and protein-protein interaction networks

319: \cite{kk-correlation_colizza}, is captured accurately in our model.

320:

321: The agreement observed with the Yeastract data is not

322: source-specific, as can be seen from a comparison of the

323: topological properties of our model networks, with those

324: %for the yeast networks as

325: obtained from the different sources listed in Table

326: \ref{tabyeast}. (see Supplement)

327:

328: Finally, in Fig.~\ref{k-core}, left, the $k$-core analysis of the

329: model network is shown, which should be compared with that of the

330: Yeastract data on the right. The $k$-core analysis provides a much

331: more stringent characterization of a network than the other single

332: topological features considered above. To give an idea of the

333: sensitivity of the $k$-core analysis to the structure of the

334: network,  let us point out that, under a shuffling of the edges of

335: the network keeping the degree of each node fixed,  the typical

336: value of the maximum number of $k$-cores, $k_{\rm max}$, becomes

337: 29 rather than 9 as observed in both the real yeast regulatory

338: network and the model (see Supplement).

339:

340: \begin{widetext}

341:

342: \begin{figure}

343: \vspace*{0.0cm}

344: \includegraphics[width=17.0cm]{Fig_DegDists.eps}

345: \caption[]{Degree distributions extracted from the

346:   Yeastract~\cite{Nucleicacids} data (red circles), superposed on the

347:   corresponding degree distributions of 100 realizations of the model

348:   network (black dots). From left to right, a) The total degree distribution

349:   with an inset showing a log-linear plot for $k/k_{\rm av} \le 10$,

350:   where one may observe that both the model and the data points almost

351:   fall on a straight line. b) The in-degree distribution

352:   plotted on a semi-logarithmic scale. c) The out-degree distribution

353:   plotted on a log-log scale. The axes are  scaled by the average

354:   total degree in order to factor out sample-to-sample fluctuations in the network

355:   size.}

356: \label{degree-dists}

357: \end{figure}

358:

359:

360: \begin{figure}

361: \vspace*{0.0cm}

362: \includegraphics[width=17cm]{Fig_Coeff.eps}

363: \caption[]{ Comparison of a) the clustering coefficient $c(k)$, b)

364: the degree-degree correlations between neighboring nodes

365: $k_{nn}(k)$, and c) the rich-club coefficient $r(k)$, from left to right, for $100$

366: realizations of the model (black dots) and the Yeastract data (red

367: circles).} \label{coefficients}

368: \end{figure}

369:

370: \end{widetext}

371:

372:

373: \section{Discussion}

374: \label{DiscSect}

375:

376: The close structural similarity between the model and the real yeast regulatory network, with respect to a diverse set of criteria, shows that they are

377: part of the same statistical ensemble of networks, formed by random strings connected by the sequence matching rule.

378:

379: The sequence matching rule could more generally be viewed as an

380: information-theoretical constraint, where the interaction between two genes

381: requires the fulfillment of a set of conditions which we

382: symbolically represent as the matching of two random sequences. The more

383: stringent the prerequisites of the interaction, the longer is the random

384: ``binding motif" that is to be matched.

385: The length of the PR establishes the size of the phase space in which the motif is to be sought.

386: The properties of the network are then determined by the distributions obeyed by the lengths of the binding motifs as well as the promoting regions.

387:

388: Interpreted within this information-theoretical framework, our model has sufficient

389: generality to accommodate other interactions based on lock-and-key mechanisms, such as protein networks, where the

390: interactions are dictated by certain steric and chemical conditions.

391:

392: The topological features of the networks investigated here and

393: shown to be shared by the yeast regulatory network

394: strongly point to the possibility that these networks did not have to

395: be assembled from scratch, but rather emerged spontaneously,

396: given any sufficiently long linear code.

397: This proposition by no means minimizes the role of evolutionary pressures on such networks; instead, it

398: suggests that a network with essentially the current topology could have provided

399: a starting point for further fine-tuning. As a case in point, it has recently been demonstrated that evolution under duplication and divergence~\cite{Wagner} may leave the topological features of such networks essentially invariant~\cite{sengun}. Such a perspective will hopefully bring us a step

400: closer to envisioning how complex structures may have

401: come into existence, by shifting some of the load from the shoulders of

402: evolution onto the laws of probability.

403:

404: \begin{widetext}

405:

406: \begin{figure}

407: \vspace*{0.0cm}

408: \includegraphics[width=17cm]{Fig_kCore_compare.eps}

409: \caption[]{Left: The $k$-core decomposition of a single realization of our model

410:   network obtained with the visualization tool lanet-vi~\cite{lanet-vi}.

411:   The length distribution exponent of the PR sequences has been

412:   adjusted to $\mu=0.1$ to optimize the similarity with the $k$-core

413:   distribution of the Yeastract data (Right). Dots represent the nodes

414:   of the network, while edges between nodes depict connections. Nodes

415:   belonging to different $k$-shells are indicated by different colors

416:   (on the right hand side) and are arranged around concentric circles,

417:   whose average radius decreases with k. In particular, a node of a

418:   given shell is placed just inside (outside) the corresponding circle, if it

419:   is preferentially connected to lower (higher) k-shells. The size of

420: dots indicate the degree of the respective nodes; see legends to the left of the figures.

421: }

422: \label{k-core}

423: \end{figure}

424:

425: \end{widetext}

426:

427: \section{Methods}

428: \label{MethodsSect}

429:

430: The degree $k$ of a node is the number of edges connected to it.

431: When the graph is directed, one distinguishes in-, out-, and

432: total-degrees of a node, with their corresponding distributions.

433: In the measures below we have ignored the directionality of the

434: network.

435:

436: The clustering coefficient is given by the formula:

437: \[ C_i = \frac{\Delta_i}{k_i(k_i-1)/2}\;,\]

438: where $\Delta_i $ is the number of triangles that contain node $i$.

439: The quantity $C(k)$ plotted

440: in Fig.~\ref{coefficients} is the average of $C_i$ over the nodes with

441: degree $k$.

442:

443: The degree-degree correlation function $k_{nn}(k)$  is

444: \[

445: k_{nn}(k) = \sum_{k^\prime} k^\prime p(k^\prime \vert k),

446: \]

447: where $p(k^\prime \vert k)$ is the conditional probability that a node with

448: degree $k$ is connected to a node with degree $k^\prime$.

449:

450:

451: The``rich-club'' coefficient \cite{rich-club,rich-club_colizza}

452: $r(k)$ is the total number $e_{>k}$

453: of edges connecting nodes with degree greater than $k$, normalized by the

454: maximum possible number of such connections,

455: \[

456: r(k) = \frac{2e_{>k}}{N_{>k} (N_{>k} -1)},

457: \]

458: where $N_{>k}$ is the total number of nodes with degree greater than

459: $k$.

460:

461: The $k$-core decomposition performs a successive pruning on the least

462: connected vertices of a network~\cite{bollobas}. At each step one

463: removes all nodes with a degree less than $k$ along with their edges and

464: continues in this manner until all

465: nodes have at least degree $k$. The remaining nodes constitute

466: the $k$ core. Next, $k$ is incremented by one, and the process

467: is repeated until no nodes are left. The $k$-shell is defined as

468: the set of nodes that belong to the $k$-core, but not the $(k+1)$-core.

469:

470: Once the shape of the TF length distribution, the width of the PR region, as well

471: as the functional form of its distribution have been fixed through the available

472: biological data, the only remaining adjustable parameter in our model is the exponent

473: $\mu$ of the power law distribution of PR lengths, $p(l) \propto l^{-1-\mu}$. The $k$-core decomposition turns

474: out to provide the most detailed and stringent topological characterization of the

475: network, with both the total number of shells, and the distribution of the nodes

476: over the shells, being contained in the $k$-core plots (see Fig.\ref{k-core}). The

477: $k$-core plots also incorporate such qualitative features  as inter- and

478: intra-shell connectivity. We have therefore used qualitative and quantitative

479: comparison of the $k$-core plots for the Yeastract and the model network to determine $\mu$.

480: The best agreement was obtained for $\mu=0.1$.  Once $\mu$ has been fixed, no further

481: adjustment is needed in order to obtain the extremely close matching that is found

482: between the degree distributions, clustering coefficients, degree

483: correlations and the rich-club coefficient, as displayed in

484: Figs.~\ref{degree-dists} and ~\ref{coefficients}.

485:

486: We cannot rule out the possibility of obtaining similar agreement between our model and the real genomic network with respect to the features considered here, for a different choice of the functional form of the length distribution for the PR sequences, once more determining an adjustable parameter from a  comparison of the $k$-core plots. However, the present choice seems to be the only reasonable one within the physical constraints and the available information.

487:

488:

489: \section{Acknowledgments}

490:

491: We would like to thank Bet\"ul K\i rdar and Beste K\i n\i ko\u glu

492: for the use of their data and useful discussions. It is a pleasure

493: to thank Alessandro Vespignani and Ignacio Alvarez-Hamelin for

494: bringing $k$-core analysis to our attention, and for the use of

495: their web-based $k$-core analysis tool. AE would like to thank

496: Tam\'as Vicsek and Andr\'as Czir\'ok for a useful discussion and

497: is grateful for partial support from the Turkish Academy of

498: Sciences.

499:

500: \begin{thebibliography}{99}

501:

502: \bibitem{microarray} Lockhart, D.J.,Winzeler, E.A. (1995)

503: %Genomics, gene expression and DNA arrays.

504: {\it Nature} {\bf 405}, 827-36.

505:

506: \bibitem{spellman} Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D., Futcher,

507: B. (1998)

508: %Comprehensive identification of cell cycle-regulated genes of

509: %the yeast Saccharomyces cerevisiae by microarray

510: %hybridization.

511: {\it Molecular Biology of the Cell} {\bf 9},3273-3297.

512:

513: \bibitem{Cell} Alberts, B., Johnson, A., Lewis, J.,Raff, M., Roberts, K., Walter, P. (2002) in {\it Molecular Biology of the

514: Cell}. Chapter 9. (Garland Science, N.Y.).

515:

516: \bibitem{Lee} Lee, T.I., Rinaldi, N.J., Robert, F., Odom, D.T., Bar-Joseph, Z., Gerber, G.K., Hannett, N.M., Harbison, C.T., Thompson, C.M., Simon, I. et al. (2002)

517: %Transcriptional Regulatory Networks in Saccharomyces cerevisiae .

518: {\it Science}, {\bf 298}, 799-804.

519:

520: \bibitem{Harbison} Harbison, C.T., Gordon, D.B., Lee, T.I., Rinaldi, N.J.,

521: Macisaac, K.D., Danford, T.W., Hannett, N.M., Tagne, J.B., Reynolds, D.B., Yoo, J., et al. (2004)

522: %Transcriptional regulatory code of a eukaryotic genome.

523: {\it Nature} {\bf 431}, 99-104.

524:

525:

526: \bibitem{Barkai} Bergmann, S., Ihmels, J., Barkai, N. (2004)

527: % Similarities and differences in genome-wide expression data of six organisms

528: {\it PloS Biol.} {\bf 2}, 85-93.

529:

530: \bibitem{Balcan} Balcan, D., Erzan, A.

531: (2004) {\it Eur. Phys. J.} B {\bf 38}, 253.

532:

533: \bibitem{Mungan} Mungan, M., Kabakcioglu, A., Balcan, D.,

534: Erzan, A.

535: %Analytical solution of a stochastic content-based network model,

536: (2005) {\it J. Phys. A} {\bf 38} (44), 9599-9620.

537:

538:

539: \bibitem{Dorogovtsev}

540: Dorogovstsev, S.N., Mendes, J.F.F.

541: % Evolution of Networks.

542: (2002) {\it Adv. Phys.} {\bf 51}, 1079--1187.

543:

544: \bibitem{Watts-Strogats98} Watts, D.J. \& Strogatz, S.H.

545: % Collective dynamics of `small-world' networks.

546: (1998) {\it Nature} (London) {\bf 393}, 440-442.

547:

548: \bibitem{kk-correlation_colizza} Colizza, V., Flammini, A.,Maritan, A.,

549: Vespignani, A.

550: %Characterization and modeling of protein-protein interaction networks,

551: (2005) {\it Physica A} {\bf 352}, 1-27.

552:

553:

554: \bibitem{rich-club} Zhou, S. \& Mondragon,R.J.

555: %The rich-club phenomenon in the Internet topology,

556: (2004) {\it IEEE Commun. Lett.} {\bf 8}, 180-182.

557:

558:

559: \bibitem{rich-club_colizza} Colizza, V., Flammini, A., Serrano, M.A. \&

560: Vespignani, A.

561: %Detecting rich-club ordering in complex networks,

562: (2006) {\it Nature Physics} {\bf 2},110-115.

563:

564: \bibitem{bollobas} Bollobas, B., (1998) {\it Modern Graph Theory} (Springer Verlag, New York).

565:

566: \bibitem{protein_k-core} Tong, A.H.Y., Drees, B., Nardelli, G.,

567: Bader, G.D., Brannetti, B., Castagnoli, L., Evangelista, M., Ferracuti, S.,

568:  Nelson, B., Paoluzi, S. {\it et al.}

569: %A Combined Experimental and

570: %Computational Strategy to Define Protein Interaction Networks for

571: %Peptide Recognition Modules,

572: (2002) {\it Science} {\bf 295}, 321-324.

573:

574: \bibitem{yeast_k-core} Bader, G.D. \& Hogue, C.W.V.

575: %Analyzing yeast

576: %protein-protein interaction data obtained from different sources,

577: (2002) {\it Nature Biotechnology} {\bf 20}, 991-997.

578:

579: \bibitem{bader} Bader, G.D. \&  Hogue, C.W.V.

580: % An automated method for  finding molecular complexes in large protein

581: % interaction network

582: (2003) {\it BMC Bioinformatics}, {\bf 4}(2)

583:

584: \bibitem{amin} Altaf-Ul-Amin, M., Nishikata, K., Koma, T., Miyasato, T., Shinbo, Y., Arifuzzaman, M., Wada, C., Maeda, M., Oshima, T., Mori, H. {\it et al.}

585: % Prediction of Protein Functions Based on

586: % K-Cores of Protein-Protein

587: % Interaction Networks and Amino Acid Sequences

588: (2003) {\it Genome Informatics}, {\bf 14},  498-499.

589:

590: \bibitem{wuchty} Wuchty, S. \& Almaas, E.

591: % Peeling the yeast protein network

592: (2005) {\it Proteomics}, {\bf 5}(2), 444-449.

593:

594:

595: \bibitem{Shannon} Shannon, C. E.,

596: % Communication in the Presence of Noise,

597: (1949) {\it Proc. IRE} {\bf 37}, 10-21.

598:

599: \bibitem{Provata} Almirantis, Y. and Provata, A. (1999) {\it J. Stat. Phys},

600: {\bf 97}, 233-262.

601:

602: \bibitem{erdos-renyi} Erd\"{o}s, P. \&  R\'{e}nyi, A.,

603: %On the evolution of random graphs,

604: (1960) {\it Publ. Math. Inst. Hung. Acad. Sci.} {\bf 5}, 17-60.

605:

606: \bibitem{Barabasi} Jeong, H., Tombor, B., Albert, R., Oltvai, Z.N.,

607: Barabasi, A.-L.

608: % The large-scale organization of metabolic networks,

609: (2000) {\it Nature} {\bf 407}, 651-654; Albert, R., Jeong, H., Barabasi, A.-L.,

610: % The diameter of the world-wide-web,

611: (1999) {\it Nature} {\bf 401}, 130-131.

612:

613:

614: \bibitem{Nucleicacids} Teixeira, M.C., Monteiro, P., Jain, P.,

615: Tenreiro, S., Fernandes, A.R., Mira, N.P., Alenquer, M., Freitas,

616: A.T., Oliveira, A.L., Correia, I. (2006)

617: %The YEASTRACT database: a tool for the analysis of transcription

618: %regulatory associations in Saccharomyces cerevisiae,

619: {\it Nucl. Acids Res.} {\bf 34}, D446-451.

620:

621: \bibitem{lanet-vi} Alvarez-Hamelin, I., Dall'Asta, L.,

622: Barrat, L., Vespignani, A.

623: %k-core decomposition: a tool for the visualization of large scale networks.

624: Arxiv preprint cs.NI/0504107

625:

626: \bibitem{Wagner} Wagner, A. (2001) {\it Mol. Bio. Evol.} {\bf 18}, 1283.

627:

628: \bibitem{sengun} \c Seng\"un, Y., Erzan, A. (2006) {\it Physica A} {\bf 365}, 446-462.

629:

630: \bibitem{BA} Albert, R.  and  Barabasi, A.-L.(2002) {\it Rev. Mod. Phys.} {\bf 74}, 47-97.

631:

632: \end{thebibliography}

633:

634:

635: \begin{widetext}

636: \newpage

637:

638:

639:

640: \newpage

641: %\bigskip

642: {\bf Supplementary Material 1}

643:

644: {\bf Comparison with yeast data from different data bases}

645:

646: %\begin{widetext}

647:

648: \begin{figure}[h]

649: \vspace*{0.0cm}

650: \includegraphics[width=12cm]{Suppl_Fig.eps}

651: \caption[]{The network statistics extracted from the sources

652: listed in Table~\ref{tabyeast} superposed on the simulation results corresponding to

653: 100 realizations of the model network (black dots). The agreement is extremely good with all of these sets of data,

654: which almost completely cover, but do not exceed the phase space

655: of our model. (Black, red, blue, green yellow and maroon correspond to the model, Yeastract,

656: Fraenkel Lab, K\i rdar and Luscombe data respectively).

657: }

658: \label{supp_fig1}

659: \end{figure}

660:

661: \newpage

662:

663: {\bf Supplementary Material 2}

664:

665: {\bf Comparison with Randomized Networks}

666:

667: To double check the significance of our other results, we also

668: compared the clustering coefficients, the degree-degree correlations

669: and the rich-club coefficients of the Yeastract data

670: with those obtained after the randomly reconnecting the edges of the network while keeping the degree of each node fixed.  In this process, the directionality of the bonds is ignored.

671: The comparison of the topological coefficients of the randomized yeast and randomized model networks with that of the yeast network, as shown in Fig.~(\ref{randomized}), confirm that

672: the observed agreement between the yeast and models networks is not spurious.

673:

674: %\begin{widetext}

675:

676: \begin{figure}[h]

677: \vspace*{0.0cm}

678: \includegraphics[width=17cm]{Fig_Coeff_randomized.eps}

679: \caption[]{a) The clustering coefficient, b) the degree-degree

680: correlations between neighboring nodes, and c) the rich-club

681: coefficient of Yeastract data (red circles) compared with the results

682: for the same obtained by randomizing the Yeastract data (red dots) and

683: randomizing a realization of the model network (black dots),

684: keeping the degrees of the individual nodes, and thereby the degree distributions, fixed.}

685: \label{randomized}

686: \end{figure}

687: %\end{widetext}

688:

689: %\newpage

690:

691: In Fig.~\ref{k-core-random} we display the effect of performing the same randomization procedure as described above, on the $k$-core plots.  It is instructive to note that while in the yeast and model networks,  a large fraction of connections is between nearby shells, the situation is reversed in the randomized networks, where there is a high degree of intra-shell connectivity as can be seen from Fig.~\ref{k-core}.

692:

693: \begin{figure}[h]

694: \vspace*{0.0cm}

695: \includegraphics[width=17cm]{Fig_kCore_compare_randomized.eps}

696: \caption[]{The  $k$-core analysis of the randomized versions of the  model (left panel) and Yeastract (right panel) networks yield results that differ quantitatively and qualitatively from the originals.  The number of shells have gone up to 29 from 9, and the much higher intra-shell rather than inter-shell connectivity (as can be seen by following the edges) indicates that the hierarchical nature of the yeast network, which  is faithfully  reproduced by the model, is destroyed by the randomization process.

697: }

698: \label{k-core-random}

699: \end{figure}

700:

701: \newpage

702:

703: {\bf Supplementary Material 3}

704:

705: {\bf The k-core structure of the Balcan-Erzan and Barabasi-Albert

706: Networks}

707:

708: In Fig.~\ref{k-core-others} we show the k-core structure of the Balcan-Erzan~\cite{Balcan}

709: and Barabasi-Albert~\cite{BA} network, as models for complex networks. Note the absence

710: of well-defined hierarchical structures.

711:

712: \begin{figure}[h]

713: \vspace*{0.0cm}

714: \includegraphics[width=17cm]{Fig_kCore_others.eps}

715: \caption[]{The $k$-core analysis of the content-based network of

716: Balcan and Erzan~\cite{Balcan} (left panel) and the

717: Barabasi-Albert (BA) model~\cite{BA}. In the left panel, the total

718: length of the single sequences associated with all of the nodes is

719: $L = 15000$. The individual sequences obey the length distribution

720: $p(l) \propto q^l$, with $q = 0.95$. The BA model network (right

721: panel) has  5000 nodes, and is built by starting from a fully

722: connected four-cluster and adding nodes with two edges at a time.

723: In the $k$-core plot for the latter, only \% 5 of the edges are

724: shown for better visibility. } \label{k-core-others}

725: \end{figure}

726:

727:

728: \newpage

729: %\bigskip

730:

731: {\bf Supplementary Material 4}

732:

733: {\bf Ranking of overlapping sets of regulated genes and motif inclusion}

734:

735: We here report  a statistical fact in support of

736: the basic assumption underlying our model. The matching condition we

737: employ dictates a certain correlation between the sets of regulated

738: genes by each TF: if the binding motif of a TF (A) is embedded in that

739: of a TF (B), then the set of genes \{G$_i$\}$_{\mbox{B}}$ regulated by

740: TF$_{\mbox{B}}$ in our model is a subset of \{G$_i$\}$_{\mbox{A}}$. A

741: similar investigation of the yeast databases listed below reveals that

742: the top 50\% of the TF pairs related by the motif inclusion relation

743: above, rank in the top 3\% when all the TF pairs are listed according

744: to the overlap of their \{G$_i$\} sets. The actual ranking

745: of the TF pairs obtained among all possible pairs of 102 TFs with

746: known binding motifs is shown in Fig.~\ref{TFcorr}.

747:

748:

749: \begin{figure}[h]

750: \vspace*{0.8cm}

751: \includegraphics[width=8cm]{Fig_TFcorr.eps}

752: \caption[]{Correlation between the sets of proteins regulated by

753:   the TFs with similar binding motifs.  The vertical axis is the

754:   percentage overlap of the two sets of genes regulated by an

755:   arbitrary pair of TFs, which are ranked on the horizontal axis

756:   according to their overlap.  The red vertical lines mark those pairs

757:   of TFs that are also related by binding motif inclusion. The

758:   accumulation of the red lines to the left of the graph is indicative

759:   of the correlation described in the text.}

760: \label{TFcorr}

761: \end{figure}

762:

763: On the other hand, the more straighforward expectation that TFs with

764: short binding motifs should regulate more genes is not verified by the same

765: data. This curious fact probably points to certain sequence

766: correlations arising from the  duplication and divergence processes

767: ~\cite{Wagner}

768: that distort the occurance statistics of the binding motifs in PRs. Note that

769: the result in Fig.~\ref{TFcorr} is robust to such deviations from the

770: unbiased probabilities for the occurance of different strings.

771:

772: \end{widetext}

773:

774: \end{document}

775:

776: