q-bio0605045/pnas10.tex
1: %%%pnas10
2: 
3: \documentclass[aps,twocolumn,10pt]{revtex4}
4: %\textwidth 44pc
5: 
6: \usepackage{dcolumn}
7: \usepackage{graphicx}
8: \ifnum\lefthyphenmin<2\lefthyphenmin=2\fi
9: \ifnum\righthyphenmin<2\righthyphenmin=2\fi
10: \begin{document}
11: 
12: \title{Spontaneous Self-Assembly of Transcription Factor Based Gene Regulation Networks}
13: 
14: \author{ D. Balcan$^1$, A. Kabak\c c\i o\u glu$^2$, M. Mungan$^{3,4}$, and  A. Erzan$^{1,4}$\\}
15: 
16: \affiliation{$^1$Department of Physics, Faculty of Sciences and
17: Letters\\
18: Istanbul Technical University, Maslak 34469, Istanbul, Turkey}
19: \affiliation{$^2$Department of Physics, Faculty of Arts and Sciences, Koc University, 34450 Sariyer Istanbul, Turkey}
20: \affiliation{$^3$Department of Physics, Faculty of Arts and Sciences \\
21: Bogazi\c ci University, 34342 Bebek Istanbul, Turkey}
22: \affiliation{$^4$G\"ursey Institute, P.O.B. 6, \c Cengelk\"oy, 34680 Istanbul, Turkey}
23: 
24: \date{\today }
25: 
26: \begin{abstract}
27: We model the transcription factor based regulation network of
28: yeast using a content-based network model that mimicks the
29: recognition of binding motifs on the regulatory regions of the
30: genes. We are thereby able to faithfully reproduce many of the
31: topological features of the gene regulatory network of yeast once
32: the parameters of the yeast genome, in particular the distribution
33: of information coded by the ``binding sequences" within the
34: promoter regions is provided as input. The length distribution for
35: the promoter regions is fixed by comparing the k-core analysis of
36: the model network with that of yeast. Our results strongly point
37: to the possibility that the observed topological features are
38: generic to networks formed via sequence-matching between random
39: strings obeying certain length distributions.
40: \end{abstract}
41: \pacs{87.17.Aa, 89.75.Fb
42: }
43: \maketitle
44: 
45: \section{Introduction}
46: Development of new experimental techniques, such as DNA microarrays,
47: in the late 1990's~\cite{microarray,spellman} made a huge impact on
48: cell biology research. Such experiments generated a flood of
49: expression data for several well-studied single-cell species for which we
50: now have an almost complete list of not only the genes, but also the
51: interactions between them.
52: A cell is able to survive, grow and replicate due to the
53: collective actions of its genes. The adaptation and robustness of
54: its activities in a constantly changing environment is maintained
55: by the complex network of interactions between the genes.
56: 
57: The regulation of gene expression in a cell relies to a major
58: extent on dedicated proteins called transcription factors
59: (TFs).~\cite{Cell} These proteins come with a structure suited to
60: recognize and bind the DNA at specific locations called binding
61: sites.  The binding affinity of a TF on a certain DNA segment is
62: determined by the base sequence at the location. Each TF
63: preferentially binds certain regulatory sequences or binding
64: motifs, within the promoter regions (PRs) responsible for the
65: regulation of the gene. In the case of yeast, {\it Saccharomyces
66: cerevisiae}, a list of the binding motifs for more than 100 TFs
67: has  recently been provided.~\cite{Lee,Harbison} It was also
68: reported~\cite{Harbison} that the TF binding sites are located
69: with high probability within a window of several hundred bases
70: upstream of the transcription activation site (preceding the start
71: codon of the gene), although longer-distance action is also
72: possible.  In fact, the existence of a high-affinity binding motif
73: in a promoter region is a necessary but not sufficient condition
74: for TF-based expression regulation~\cite{Harbison}.  Moreover,
75: especially in eukaryotic cells, gene regulation relies on the
76: simultaneous action of multiple TFs.
77: 
78: We argue that the global features of the gene regulation network
79: depend very little on such details and are largely determined by
80: the distribution of the amount of shared information or content, 
81: that is required for the establishment of regulatory interactions.  
82: It may be conjectured that information sharing and its 
83: distribution is the basic organizing principle which is
84: responsible for the universality of the degree distribution of gene regulatory
85: networks across diverse species~\cite{Barkai}. 
86: 
87: In this paper we propose to model the transcription regulation network
88: of yeast using the ideas of the content-based model we
89: introduced earlier~\cite{Balcan,Mungan}.  We are able to
90: faithfully reproduce all the topological aspects of the gene
91: regulatory network of yeast when the parameters of the yeast genome,
92: in particular the distribution of information coded by the ``binding
93: sequences" of the regulatory segments, are given as input.  We compare
94: the ensemble of the resulting model networks with the data on the
95: yeast regulatory network available in different databases.
96: 
97: Gene regulatory networks can be naturally described as a directed
98: graph where the nodes are the genes. A directed edge from node A
99: to node B implies that the transcription factor produced by gene A
100: regulates the activity of gene B. Since the edges are directed,
101: one distinguishes the in-degree (the number of incoming edges),
102: the out-degree (number of outgoing edges) and the total degree of
103: a node, each with their own (possibly distinct) probability
104: distributions. These distributions serve as distinguishing
105: features of the network which a realistic model is expected to
106: reproduce. Further structural aspects of these networks are probed
107: by measures such as the clustering coefficient
108: $C(k)$~\cite{Dorogovtsev,Watts-Strogats98}, the degree-degree
109: correlation between connected
110: vertices~\cite{kk-correlation_colizza}, the ``rich-club
111: coefficient''~\cite{rich-club,rich-club_colizza}, or the $k$-core
112: decomposition~\cite{bollobas} recently
113: employed to predict new
114: interactions in various biological systems~\cite{protein_k-core,yeast_k-core,bader,amin,wuchty}.
115: 
116: This report is organized as follows: In Section \ref{modelSect} we
117: introduce our model, which we compare with the experimentally
118: determined yeast regulatory network in  \ref{SimSect}. A
119: discussion is provided in Section \ref{DiscSect}, while Section
120: \ref{MethodsSect} outlines our methods.
121: 
122: \section{The Model}
123: \label{modelSect}
124: 
125: 
126: The nodes of our model network correspond to genes. We
127: differentiate between genes which code for a Transcription Factor
128: (TF) and those which do not.  All genes are assumed to be possible
129: targets of regulation by one or more TFs.  Each node has a
130: sequence associated with it, representing the promoter region (PR)
131: through which the corresponding gene may be regulated. We pick a
132: given percentage of nodes (around 5\%, see Table I) at random, to
133: represent TF-producing genes. With each TF-producing node/gene we
134: also associate a second sequence, which stands for the binding
135: motif, which the TF recognizes and binds in the promoter region of
136: another gene.
137: 
138: We represent both the binding motifs and the PRs as random binary
139: sequences of variable length. The mechanism for establishing
140: connections between nodes of the gene regulatory network is given
141: by a string matching condition~\cite{Balcan,Mungan}, between the
142: binding motifs of the TF's and all possible uninterrupted
143: subsequences of the PRs. The (directed) network of regulatory gene
144: interactions is then obtained by connecting each TF-producing node
145: ${\rm A}$ to all those nodes ${\rm B},\; {\rm B}^\prime,\;{\rm
146: B}^{\prime \prime} \ldots$ whose PRs contain the binding motif
147: associated with node ${\rm A}$. The amount of information coded in
148: these randomly generated binding motifs and promoter regions
149: constitutes the essential ingredient of our model and dictates the
150: overall topology of the resultant networks.
151: 
152: Experimentally determined TF binding motifs are typically short sequences
153: with a narrow length distribution, since a TF  selectively
154: binds 5-10 bases and not much more. A single TF can bind a range of
155: similar motifs, and the relative frequencies of the four bases at each
156: position within the motif contribute to the information exchanged in
157: the binding process.  The promoter regions (PRs) which lie in the
158: intergenic portions of the genome are typically longer and may
159: accommodate several binding motifs (as shown in Fig.~\ref{model}) to allow
160: graded and/or combinatorial regulation~\cite{Cell,Harbison}.
161: 
162: The bitwise length distribution of the model binding motifs was
163: derived from the yeast data provided by Harbison et al. in
164: \cite{Harbison}. The motifs were reported~\cite{Harbison} as letter
165: sequences comprising the symbols for the four bases \{ATGC\}, or
166: the symbols \{YMKRSW\} for incompletely specified bases, with the
167: corresponding lower case letters indicating a lower confidence level.
168: In order to account for such variations in the information content of
169: the motifs, we assigned two bits to each of the letters \{ACTG\}
170: appearing in the motif, signifying a high information content at
171: that position, and one bit otherwise. The
172: length of the bit sequence obtained in this way roughly corresponds
173: to the amount of shared information, measured by the Shannon
174: entropy~\cite{Shannon}, required for the binding of the TF.
175: Performing this calculation for each TF in~\cite{Harbison}, we obtain
176: the length distribution shown in Fig.~\ref{RS_dist}.
177: 
178: 
179: \begin{figure}[h]
180: \vspace*{0.0cm}
181: \includegraphics[width=7cm]{Fig_model.eps}
182: \caption[]{The mechanism of interaction between the genes as envisaged
183:   in our model.  The genes are indicated by ellipses (green if
184:   TF-coding, blue otherwise), the transcription factors by triangles
185:   with the associated binding motif in the box underneath. Non-TF
186:   proteins are symbolized by the ``P'' shape, and the promoter regions
187:   (PR) upstream of each gene are shown as red boxes. Binding occurs if
188:   the binding motif matches a subsequence in the PR, as is the case
189:   here at PR4. PRs in the model are typically much longer than
190:   depicted here.}
191: \label{model}
192: \end{figure}
193: 
194: In choosing the length distribution of the promoter regions, about
195: which less is known, we are guided by the finding~\cite{Harbison}
196: that most of the probability for encountering a TF binding site is
197: contained within a window of 250 base pairs (bps) located
198: approximately 100 bps upstream of a gene. The PR length
199: distribution that we adopt within this range decays with a power
200: law  $p(l) \propto l^{-1-\mu}$, with $0\le\mu\le2$ after the
201: findings of Almirantis and Provata~\cite{Provata} for the lengths
202: of intergenic regions. We also assign a minimum length chosen to
203: coincide with the peak of the motif-length distribution shown in
204: Fig.~\ref{RS_dist}. Note that the 250 bps window does not double
205: as we move from the 4 letter alphabet to a binary one, because the
206: matching probabilities and the total number of positions at which
207: the TFs may bind are required to remain invariant under this
208: transformation.
209: 
210: The value of $\mu$ remains as the only adjustable
211: parameter in our model, and is determined by comparing the $k$-core
212: decomposition of the gene regulatory network of yeast as extracted
213: from experimental data (Table I) with our content-based network model,
214: as explained in the Methods section.
215: 
216: \begin{figure}[h]
217: \vspace*{0.8cm}
218: \includegraphics[width=6cm]{Fig_lenDist.eps}
219: %\rotatebox{270}{\scalebox{.4}{\includegraphics{Fig0.eps}}}
220: \caption[]{Distribution of the amount of bitwise information coded by each
221:   regulatory sequence recognized and bound by the 102 TFs in the yeast
222:   genome (compiled from the recently published data by Harbison et
223:   al.~\cite{Harbison}). This distribution is adopted as the length
224:   distribution of the random regulatory sequences (``binding motifs") in our model.}
225: \label{RS_dist}
226: \end{figure}
227: 
228: The collection of such model networks forms an ensemble whose
229: features are a direct consequence of the string-matching mechanism
230: and the length distributions. Clearly, each realization of the
231: model will result in a different collection of random PRs and
232: binding motifs, and hence a somewhat different network. These
233: features turn out to be strikingly distinct from those encountered
234: in random~\cite{erdos-renyi} or scale-free~\cite{Barabasi}
235: networks. We show below that the ``signatures'' of this ensemble
236: are shared by the yeast regulatory network.
237: 
238: 
239: \section{Results}
240: \label{SimSect}
241: 
242: Our purpose here is to show that the experimentally determined
243: features of the yeast regulation network follow closely those typical
244: of the ensemble defined by our model. The topological features we will
245: focus on are the following:
246: 
247: \begin{enumerate}
248: \item {\bf degree distribution} (in-, out-, and total): the
249: distribution of the number of connections of the nodes in a network.
250: \item {\bf clustering coefficient}: the modularity of the network.
251: \item {\bf degree-degree correlations}: average degree of the neighbors
252: of a node with degree $k$.
253: \item {\bf ``rich-club'' coefficient}:  a measure of the relative
254: connectivity among nodes whose degree is higher than a given number.
255: \item {\bf $k$-core structure}: the hierarchical structuring in the network
256: \end{enumerate}
257: 
258: The precise definition of these quantities is given in the Methods
259: section.
260: 
261: Here we will report the comparison of our results with the most
262: recent Yeastract~\cite{Nucleicacids} data.  Analogous comparisons
263: with each of the data sources listed in Table~\ref{tabyeast} yield
264: similar results (see Supplementary Material) showing that our
265: conclusions are consistent with all the different data sets
266: available.
267: 
268: In order to compare our results with the available data we
269: generate an ensemble of realizations, with an average of $N_G =
270: 6000$ genes in total, 4167 of which contribute to the network on
271: the average. Out of these, 202 (making up  \% 4.8 of the genes)
272: are TF-coding genes, taking part in a total of 14365 interactions,
273: again on the average.  The corresponding values for the yeast
274: regulatory networks reported in the publicly available data bases
275: are given in Table~\ref{tabyeast}.
276: 
277: The total degree distribution is obtained by ignoring the
278: directionality of the interactions and is different from the
279: superposition of in- and out-degree distributions. In
280: Fig.~\ref{degree-dists}a,  Yeastract data for the degree
281: distribution  is shown on top of a scatter plot obtained by
282: superposing the results from 100 artificial model genomes
283: independently generated according to the rules described in
284: Section \ref{modelSect}. In Fig.~\ref{degree-dists}b, we exhibit
285: the in-degree distribution obtained from the Yeastract data, and
286: the corresponding scatter plot.
287: 
288: \vspace{0.0cm}
289: \begin{table}
290: \caption[]{The number of interacting genes, TFs, and interacting pairs that appear
291: in the yeast regulatory network as obtained from different sources.}
292: \begin{tabular}{l|c|c|c}
293: \hline Source & Genes & TFs & Interacting Pairs \\ \hline \hline
294: Fraenkel Lab\footnote{http://fraenkel.mit.edu/Harbison/release\_v24/bound\_by\_factor/} & 2884 & 102 & 6441  \\ \hline
295: Yeastract\footnote{http://www.yeastract.com}
296:  & 4252 & 146 & 12530 \\ \hline
297: Luscombe et al.\footnote{http://sandy.topnet.gersteinlab.org/index2.html}
298:  & 3459 & 142 & 7071 \\ \hline
299: K\i rdar et al.\footnote{private communication} & 3763 & 180 & 9135 \\
300: \end{tabular}
301: \label{tabyeast}
302: \end{table}
303: 
304: 
305: The out-degree distribution of the yeast and model networks exhibits a rather
306: large scatter of points due to the relatively small number of TFs.
307: Comparing with the scatter plot obtained from 100 realizations, we find again
308: that the actual yeast data falls within the boundaries set by the model ensemble
309: (Fig.~\ref{degree-dists}c).
310: 
311: 
312: In Fig.~\ref{coefficients}, we report the three topological
313: coefficients, the clustering coefficient, the degree-degree
314: correlation and the ``rich-club'' coefficient, that go beyond
315: degree-distributions in characterizing the network. The agreement is extremely good;
316: in particular, the shoulder observed in the ``rich-club''
317: coefficient in Fig.~\ref{coefficients}(c), a feature common to both
318: gene-regulation and protein-protein interaction networks
319: \cite{kk-correlation_colizza}, is captured accurately in our model.
320: 
321: The agreement observed with the Yeastract data is not
322: source-specific, as can be seen from a comparison of the
323: topological properties of our model networks, with those
324: %for the yeast networks as
325: obtained from the different sources listed in Table
326: \ref{tabyeast}. (see Supplement)
327: 
328: Finally, in Fig.~\ref{k-core}, left, the $k$-core analysis of the
329: model network is shown, which should be compared with that of the
330: Yeastract data on the right. The $k$-core analysis provides a much
331: more stringent characterization of a network than the other single
332: topological features considered above. To give an idea of the
333: sensitivity of the $k$-core analysis to the structure of the
334: network,  let us point out that, under a shuffling of the edges of
335: the network keeping the degree of each node fixed,  the typical
336: value of the maximum number of $k$-cores, $k_{\rm max}$, becomes
337: 29 rather than 9 as observed in both the real yeast regulatory
338: network and the model (see Supplement).
339: 
340: \begin{widetext}
341: 
342: \begin{figure}
343: \vspace*{0.0cm}
344: \includegraphics[width=17.0cm]{Fig_DegDists.eps}
345: \caption[]{Degree distributions extracted from the
346:   Yeastract~\cite{Nucleicacids} data (red circles), superposed on the
347:   corresponding degree distributions of 100 realizations of the model
348:   network (black dots). From left to right, a) The total degree distribution
349:   with an inset showing a log-linear plot for $k/k_{\rm av} \le 10$,
350:   where one may observe that both the model and the data points almost
351:   fall on a straight line. b) The in-degree distribution
352:   plotted on a semi-logarithmic scale. c) The out-degree distribution
353:   plotted on a log-log scale. The axes are  scaled by the average
354:   total degree in order to factor out sample-to-sample fluctuations in the network
355:   size.}
356: \label{degree-dists}
357: \end{figure}
358: 
359: 
360: \begin{figure}
361: \vspace*{0.0cm}
362: \includegraphics[width=17cm]{Fig_Coeff.eps}
363: \caption[]{ Comparison of a) the clustering coefficient $c(k)$, b)
364: the degree-degree correlations between neighboring nodes
365: $k_{nn}(k)$, and c) the rich-club coefficient $r(k)$, from left to right, for $100$
366: realizations of the model (black dots) and the Yeastract data (red
367: circles).} \label{coefficients}
368: \end{figure}
369: 
370: \end{widetext}
371: 
372: 
373: \section{Discussion}
374: \label{DiscSect}
375: 
376: The close structural similarity between the model and the real yeast regulatory network, with respect to a diverse set of criteria, shows that they are
377: part of the same statistical ensemble of networks, formed by random strings connected by the sequence matching rule.
378: 
379: The sequence matching rule could more generally be viewed as an
380: information-theoretical constraint, where the interaction between two genes
381: requires the fulfillment of a set of conditions which we
382: symbolically represent as the matching of two random sequences. The more
383: stringent the prerequisites of the interaction, the longer is the random
384: ``binding motif" that is to be matched.
385: The length of the PR establishes the size of the phase space in which the motif is to be sought.
386: The properties of the network are then determined by the distributions obeyed by the lengths of the binding motifs as well as the promoting regions.
387: 
388: Interpreted within this information-theoretical framework, our model has sufficient
389: generality to accommodate other interactions based on lock-and-key mechanisms, such as protein networks, where the
390: interactions are dictated by certain steric and chemical conditions.
391: 
392: The topological features of the networks investigated here and
393: shown to be shared by the yeast regulatory network
394: strongly point to the possibility that these networks did not have to
395: be assembled from scratch, but rather emerged spontaneously,
396: given any sufficiently long linear code.
397: This proposition by no means minimizes the role of evolutionary pressures on such networks; instead, it
398: suggests that a network with essentially the current topology could have provided
399: a starting point for further fine-tuning. As a case in point, it has recently been demonstrated that evolution under duplication and divergence~\cite{Wagner} may leave the topological features of such networks essentially invariant~\cite{sengun}. Such a perspective will hopefully bring us a step
400: closer to envisioning how complex structures may have
401: come into existence, by shifting some of the load from the shoulders of
402: evolution onto the laws of probability.
403: 
404: \begin{widetext}
405: 
406: \begin{figure}
407: \vspace*{0.0cm}
408: \includegraphics[width=17cm]{Fig_kCore_compare.eps}
409: \caption[]{Left: The $k$-core decomposition of a single realization of our model
410:   network obtained with the visualization tool lanet-vi~\cite{lanet-vi}.
411:   The length distribution exponent of the PR sequences has been
412:   adjusted to $\mu=0.1$ to optimize the similarity with the $k$-core
413:   distribution of the Yeastract data (Right). Dots represent the nodes
414:   of the network, while edges between nodes depict connections. Nodes
415:   belonging to different $k$-shells are indicated by different colors
416:   (on the right hand side) and are arranged around concentric circles,
417:   whose average radius decreases with k. In particular, a node of a
418:   given shell is placed just inside (outside) the corresponding circle, if it
419:   is preferentially connected to lower (higher) k-shells. The size of
420: dots indicate the degree of the respective nodes; see legends to the left of the figures.
421: }
422: \label{k-core}
423: \end{figure}
424: 
425: \end{widetext}
426: 
427: \section{Methods}
428: \label{MethodsSect}
429: 
430: The degree $k$ of a node is the number of edges connected to it.
431: When the graph is directed, one distinguishes in-, out-, and
432: total-degrees of a node, with their corresponding distributions.
433: In the measures below we have ignored the directionality of the
434: network.
435: 
436: The clustering coefficient is given by the formula:
437: \[ C_i = \frac{\Delta_i}{k_i(k_i-1)/2}\;,\]
438: where $\Delta_i $ is the number of triangles that contain node $i$.
439: The quantity $C(k)$ plotted
440: in Fig.~\ref{coefficients} is the average of $C_i$ over the nodes with
441: degree $k$.
442: 
443: The degree-degree correlation function $k_{nn}(k)$  is
444: \[
445: k_{nn}(k) = \sum_{k^\prime} k^\prime p(k^\prime \vert k),
446: \]
447: where $p(k^\prime \vert k)$ is the conditional probability that a node with
448: degree $k$ is connected to a node with degree $k^\prime$.
449: 
450: 
451: The``rich-club'' coefficient \cite{rich-club,rich-club_colizza}
452: $r(k)$ is the total number $e_{>k}$
453: of edges connecting nodes with degree greater than $k$, normalized by the
454: maximum possible number of such connections,
455: \[
456: r(k) = \frac{2e_{>k}}{N_{>k} (N_{>k} -1)},
457: \]
458: where $N_{>k}$ is the total number of nodes with degree greater than
459: $k$.
460: 
461: The $k$-core decomposition performs a successive pruning on the least
462: connected vertices of a network~\cite{bollobas}. At each step one
463: removes all nodes with a degree less than $k$ along with their edges and
464: continues in this manner until all
465: nodes have at least degree $k$. The remaining nodes constitute
466: the $k$ core. Next, $k$ is incremented by one, and the process
467: is repeated until no nodes are left. The $k$-shell is defined as
468: the set of nodes that belong to the $k$-core, but not the $(k+1)$-core.
469: 
470: Once the shape of the TF length distribution, the width of the PR region, as well
471: as the functional form of its distribution have been fixed through the available
472: biological data, the only remaining adjustable parameter in our model is the exponent
473: $\mu$ of the power law distribution of PR lengths, $p(l) \propto l^{-1-\mu}$. The $k$-core decomposition turns
474: out to provide the most detailed and stringent topological characterization of the
475: network, with both the total number of shells, and the distribution of the nodes
476: over the shells, being contained in the $k$-core plots (see Fig.\ref{k-core}). The
477: $k$-core plots also incorporate such qualitative features  as inter- and
478: intra-shell connectivity. We have therefore used qualitative and quantitative
479: comparison of the $k$-core plots for the Yeastract and the model network to determine $\mu$.
480: The best agreement was obtained for $\mu=0.1$.  Once $\mu$ has been fixed, no further
481: adjustment is needed in order to obtain the extremely close matching that is found
482: between the degree distributions, clustering coefficients, degree
483: correlations and the rich-club coefficient, as displayed in
484: Figs.~\ref{degree-dists} and ~\ref{coefficients}.
485: 
486: We cannot rule out the possibility of obtaining similar agreement between our model and the real genomic network with respect to the features considered here, for a different choice of the functional form of the length distribution for the PR sequences, once more determining an adjustable parameter from a  comparison of the $k$-core plots. However, the present choice seems to be the only reasonable one within the physical constraints and the available information.
487: 
488: 
489: \section{Acknowledgments}
490: 
491: We would like to thank Bet\"ul K\i rdar and Beste K\i n\i ko\u glu
492: for the use of their data and useful discussions. It is a pleasure
493: to thank Alessandro Vespignani and Ignacio Alvarez-Hamelin for
494: bringing $k$-core analysis to our attention, and for the use of
495: their web-based $k$-core analysis tool. AE would like to thank
496: Tam\'as Vicsek and Andr\'as Czir\'ok for a useful discussion and
497: is grateful for partial support from the Turkish Academy of
498: Sciences.
499: 
500: \begin{thebibliography}{99}
501: 
502: \bibitem{microarray} Lockhart, D.J.,Winzeler, E.A. (1995)
503: %Genomics, gene expression and DNA arrays.
504: {\it Nature} {\bf 405}, 827-36.
505: 
506: \bibitem{spellman} Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D., Futcher,
507: B. (1998)
508: %Comprehensive identification of cell cycle-regulated genes of
509: %the yeast Saccharomyces cerevisiae by microarray
510: %hybridization.
511: {\it Molecular Biology of the Cell} {\bf 9},3273-3297.
512: 
513: \bibitem{Cell} Alberts, B., Johnson, A., Lewis, J.,Raff, M., Roberts, K., Walter, P. (2002) in {\it Molecular Biology of the
514: Cell}. Chapter 9. (Garland Science, N.Y.).
515: 
516: \bibitem{Lee} Lee, T.I., Rinaldi, N.J., Robert, F., Odom, D.T., Bar-Joseph, Z., Gerber, G.K., Hannett, N.M., Harbison, C.T., Thompson, C.M., Simon, I. et al. (2002)
517: %Transcriptional Regulatory Networks in Saccharomyces cerevisiae .
518: {\it Science}, {\bf 298}, 799-804.
519: 
520: \bibitem{Harbison} Harbison, C.T., Gordon, D.B., Lee, T.I., Rinaldi, N.J.,
521: Macisaac, K.D., Danford, T.W., Hannett, N.M., Tagne, J.B., Reynolds, D.B., Yoo, J., et al. (2004)
522: %Transcriptional regulatory code of a eukaryotic genome.
523: {\it Nature} {\bf 431}, 99-104.
524: 
525: 
526: \bibitem{Barkai} Bergmann, S., Ihmels, J., Barkai, N. (2004)
527: % Similarities and differences in genome-wide expression data of six organisms
528: {\it PloS Biol.} {\bf 2}, 85-93.
529: 
530: \bibitem{Balcan} Balcan, D., Erzan, A.
531: (2004) {\it Eur. Phys. J.} B {\bf 38}, 253.
532: 
533: \bibitem{Mungan} Mungan, M., Kabakcioglu, A., Balcan, D.,
534: Erzan, A.
535: %Analytical solution of a stochastic content-based network model,
536: (2005) {\it J. Phys. A} {\bf 38} (44), 9599-9620.
537: 
538: 
539: \bibitem{Dorogovtsev}
540: Dorogovstsev, S.N., Mendes, J.F.F.
541: % Evolution of Networks.
542: (2002) {\it Adv. Phys.} {\bf 51}, 1079--1187.
543: 
544: \bibitem{Watts-Strogats98} Watts, D.J. \& Strogatz, S.H.
545: % Collective dynamics of `small-world' networks.
546: (1998) {\it Nature} (London) {\bf 393}, 440-442.
547: 
548: \bibitem{kk-correlation_colizza} Colizza, V., Flammini, A.,Maritan, A.,
549: Vespignani, A.
550: %Characterization and modeling of protein-protein interaction networks,
551: (2005) {\it Physica A} {\bf 352}, 1-27.
552: 
553: 
554: \bibitem{rich-club} Zhou, S. \& Mondragon,R.J.
555: %The rich-club phenomenon in the Internet topology,
556: (2004) {\it IEEE Commun. Lett.} {\bf 8}, 180-182.
557: 
558: 
559: \bibitem{rich-club_colizza} Colizza, V., Flammini, A., Serrano, M.A. \&
560: Vespignani, A.
561: %Detecting rich-club ordering in complex networks,
562: (2006) {\it Nature Physics} {\bf 2},110-115.
563: 
564: \bibitem{bollobas} Bollobas, B., (1998) {\it Modern Graph Theory} (Springer Verlag, New York).
565: 
566: \bibitem{protein_k-core} Tong, A.H.Y., Drees, B., Nardelli, G.,
567: Bader, G.D., Brannetti, B., Castagnoli, L., Evangelista, M., Ferracuti, S.,
568:  Nelson, B., Paoluzi, S. {\it et al.}
569: %A Combined Experimental and
570: %Computational Strategy to Define Protein Interaction Networks for
571: %Peptide Recognition Modules,
572: (2002) {\it Science} {\bf 295}, 321-324.
573: 
574: \bibitem{yeast_k-core} Bader, G.D. \& Hogue, C.W.V.
575: %Analyzing yeast
576: %protein-protein interaction data obtained from different sources,
577: (2002) {\it Nature Biotechnology} {\bf 20}, 991-997.
578: 
579: \bibitem{bader} Bader, G.D. \&  Hogue, C.W.V.
580: % An automated method for  finding molecular complexes in large protein
581: % interaction network
582: (2003) {\it BMC Bioinformatics}, {\bf 4}(2)
583: 
584: \bibitem{amin} Altaf-Ul-Amin, M., Nishikata, K., Koma, T., Miyasato, T., Shinbo, Y., Arifuzzaman, M., Wada, C., Maeda, M., Oshima, T., Mori, H. {\it et al.}
585: % Prediction of Protein Functions Based on
586: % K-Cores of Protein-Protein
587: % Interaction Networks and Amino Acid Sequences
588: (2003) {\it Genome Informatics}, {\bf 14},  498-499.
589: 
590: \bibitem{wuchty} Wuchty, S. \& Almaas, E.
591: % Peeling the yeast protein network
592: (2005) {\it Proteomics}, {\bf 5}(2), 444-449.
593: 
594: 
595: \bibitem{Shannon} Shannon, C. E.,
596: % Communication in the Presence of Noise,
597: (1949) {\it Proc. IRE} {\bf 37}, 10-21.
598: 
599: \bibitem{Provata} Almirantis, Y. and Provata, A. (1999) {\it J. Stat. Phys},
600: {\bf 97}, 233-262.
601: 
602: \bibitem{erdos-renyi} Erd\"{o}s, P. \&  R\'{e}nyi, A.,
603: %On the evolution of random graphs,
604: (1960) {\it Publ. Math. Inst. Hung. Acad. Sci.} {\bf 5}, 17-60.
605: 
606: \bibitem{Barabasi} Jeong, H., Tombor, B., Albert, R., Oltvai, Z.N.,
607: Barabasi, A.-L.
608: % The large-scale organization of metabolic networks,
609: (2000) {\it Nature} {\bf 407}, 651-654; Albert, R., Jeong, H., Barabasi, A.-L.,
610: % The diameter of the world-wide-web,
611: (1999) {\it Nature} {\bf 401}, 130-131.
612: 
613: 
614: \bibitem{Nucleicacids} Teixeira, M.C., Monteiro, P., Jain, P.,
615: Tenreiro, S., Fernandes, A.R., Mira, N.P., Alenquer, M., Freitas,
616: A.T., Oliveira, A.L., Correia, I. (2006)
617: %The YEASTRACT database: a tool for the analysis of transcription
618: %regulatory associations in Saccharomyces cerevisiae,
619: {\it Nucl. Acids Res.} {\bf 34}, D446-451.
620: 
621: \bibitem{lanet-vi} Alvarez-Hamelin, I., Dall'Asta, L.,
622: Barrat, L., Vespignani, A.
623: %k-core decomposition: a tool for the visualization of large scale networks.
624: Arxiv preprint cs.NI/0504107
625: 
626: \bibitem{Wagner} Wagner, A. (2001) {\it Mol. Bio. Evol.} {\bf 18}, 1283.
627: 
628: \bibitem{sengun} \c Seng\"un, Y., Erzan, A. (2006) {\it Physica A} {\bf 365}, 446-462.
629: 
630: \bibitem{BA} Albert, R.  and  Barabasi, A.-L.(2002) {\it Rev. Mod. Phys.} {\bf 74}, 47-97.
631: 
632: \end{thebibliography}
633: 
634: 
635: \begin{widetext}
636: \newpage
637: 
638: 
639: 
640: \newpage
641: %\bigskip
642: {\bf Supplementary Material 1}
643: 
644: {\bf Comparison with yeast data from different data bases}
645: 
646: %\begin{widetext}
647: 
648: \begin{figure}[h]
649: \vspace*{0.0cm}
650: \includegraphics[width=12cm]{Suppl_Fig.eps}
651: \caption[]{The network statistics extracted from the sources
652: listed in Table~\ref{tabyeast} superposed on the simulation results corresponding to
653: 100 realizations of the model network (black dots). The agreement is extremely good with all of these sets of data,
654: which almost completely cover, but do not exceed the phase space
655: of our model. (Black, red, blue, green yellow and maroon correspond to the model, Yeastract,
656: Fraenkel Lab, K\i rdar and Luscombe data respectively).
657: }
658: \label{supp_fig1}
659: \end{figure}
660: 
661: \newpage
662: 
663: {\bf Supplementary Material 2}
664: 
665: {\bf Comparison with Randomized Networks}
666: 
667: To double check the significance of our other results, we also
668: compared the clustering coefficients, the degree-degree correlations
669: and the rich-club coefficients of the Yeastract data
670: with those obtained after the randomly reconnecting the edges of the network while keeping the degree of each node fixed.  In this process, the directionality of the bonds is ignored.
671: The comparison of the topological coefficients of the randomized yeast and randomized model networks with that of the yeast network, as shown in Fig.~(\ref{randomized}), confirm that
672: the observed agreement between the yeast and models networks is not spurious.
673: 
674: %\begin{widetext}
675: 
676: \begin{figure}[h]
677: \vspace*{0.0cm}
678: \includegraphics[width=17cm]{Fig_Coeff_randomized.eps}
679: \caption[]{a) The clustering coefficient, b) the degree-degree
680: correlations between neighboring nodes, and c) the rich-club
681: coefficient of Yeastract data (red circles) compared with the results
682: for the same obtained by randomizing the Yeastract data (red dots) and
683: randomizing a realization of the model network (black dots),
684: keeping the degrees of the individual nodes, and thereby the degree distributions, fixed.}
685: \label{randomized}
686: \end{figure}
687: %\end{widetext}
688: 
689: %\newpage
690: 
691: In Fig.~\ref{k-core-random} we display the effect of performing the same randomization procedure as described above, on the $k$-core plots.  It is instructive to note that while in the yeast and model networks,  a large fraction of connections is between nearby shells, the situation is reversed in the randomized networks, where there is a high degree of intra-shell connectivity as can be seen from Fig.~\ref{k-core}.
692: 
693: \begin{figure}[h]
694: \vspace*{0.0cm}
695: \includegraphics[width=17cm]{Fig_kCore_compare_randomized.eps}
696: \caption[]{The  $k$-core analysis of the randomized versions of the  model (left panel) and Yeastract (right panel) networks yield results that differ quantitatively and qualitatively from the originals.  The number of shells have gone up to 29 from 9, and the much higher intra-shell rather than inter-shell connectivity (as can be seen by following the edges) indicates that the hierarchical nature of the yeast network, which  is faithfully  reproduced by the model, is destroyed by the randomization process.
697: }
698: \label{k-core-random}
699: \end{figure}
700: 
701: \newpage
702: 
703: {\bf Supplementary Material 3}
704: 
705: {\bf The k-core structure of the Balcan-Erzan and Barabasi-Albert
706: Networks}
707: 
708: In Fig.~\ref{k-core-others} we show the k-core structure of the Balcan-Erzan~\cite{Balcan}
709: and Barabasi-Albert~\cite{BA} network, as models for complex networks. Note the absence
710: of well-defined hierarchical structures.
711: 
712: \begin{figure}[h]
713: \vspace*{0.0cm}
714: \includegraphics[width=17cm]{Fig_kCore_others.eps}
715: \caption[]{The $k$-core analysis of the content-based network of
716: Balcan and Erzan~\cite{Balcan} (left panel) and the
717: Barabasi-Albert (BA) model~\cite{BA}. In the left panel, the total
718: length of the single sequences associated with all of the nodes is
719: $L = 15000$. The individual sequences obey the length distribution
720: $p(l) \propto q^l$, with $q = 0.95$. The BA model network (right
721: panel) has  5000 nodes, and is built by starting from a fully
722: connected four-cluster and adding nodes with two edges at a time.
723: In the $k$-core plot for the latter, only \% 5 of the edges are
724: shown for better visibility. } \label{k-core-others}
725: \end{figure}
726: 
727: 
728: \newpage
729: %\bigskip
730: 
731: {\bf Supplementary Material 4}
732: 
733: {\bf Ranking of overlapping sets of regulated genes and motif inclusion}
734: 
735: We here report  a statistical fact in support of
736: the basic assumption underlying our model. The matching condition we
737: employ dictates a certain correlation between the sets of regulated
738: genes by each TF: if the binding motif of a TF (A) is embedded in that
739: of a TF (B), then the set of genes \{G$_i$\}$_{\mbox{B}}$ regulated by
740: TF$_{\mbox{B}}$ in our model is a subset of \{G$_i$\}$_{\mbox{A}}$. A
741: similar investigation of the yeast databases listed below reveals that
742: the top 50\% of the TF pairs related by the motif inclusion relation
743: above, rank in the top 3\% when all the TF pairs are listed according
744: to the overlap of their \{G$_i$\} sets. The actual ranking
745: of the TF pairs obtained among all possible pairs of 102 TFs with
746: known binding motifs is shown in Fig.~\ref{TFcorr}.
747: 
748: 
749: \begin{figure}[h]
750: \vspace*{0.8cm}
751: \includegraphics[width=8cm]{Fig_TFcorr.eps}
752: \caption[]{Correlation between the sets of proteins regulated by
753:   the TFs with similar binding motifs.  The vertical axis is the
754:   percentage overlap of the two sets of genes regulated by an
755:   arbitrary pair of TFs, which are ranked on the horizontal axis
756:   according to their overlap.  The red vertical lines mark those pairs
757:   of TFs that are also related by binding motif inclusion. The
758:   accumulation of the red lines to the left of the graph is indicative
759:   of the correlation described in the text.}
760: \label{TFcorr}
761: \end{figure}
762: 
763: On the other hand, the more straighforward expectation that TFs with
764: short binding motifs should regulate more genes is not verified by the same
765: data. This curious fact probably points to certain sequence
766: correlations arising from the  duplication and divergence processes
767: ~\cite{Wagner}
768: that distort the occurance statistics of the binding motifs in PRs. Note that
769: the result in Fig.~\ref{TFcorr} is robust to such deviations from the
770: unbiased probabilities for the occurance of different strings.
771: 
772: \end{widetext}
773: 
774: \end{document}
775: 
776: