1: \documentclass[12pt,a4paper,aps,rmp,onecolumn]{revtex4}
2: \usepackage{latexsym}
3: \usepackage{times}
4: \usepackage{graphicx}
5: \usepackage[tight,FIGTOPCAP]{subfigure}
6: %\usepackage{overcite}
7: \usepackage{amsmath,amssymb,enumerate,mathrsfs,boxedminipage,fancybox}
8:
9:
10: \newcommand{\eulang}[2]{\genfrac{\langle}{\rangle}{0pt}{}{#1}{#2}}
11: \newcommand{\su}[2]{\genfrac{(}{)}{0pt}{}{#1}{#2}}
12:
13:
14: \def\de#1#2{\frac {d#1}{d#2}}
15: \def\dde#1#2{\frac {d^2#1}{d#2^2}}
16: \def\dep#1#2{\frac {\partial#1}{\partial#2}}
17:
18: \def\e#1{{\rm e}^{#1}}
19:
20: \def\tr#1{{\rm tr} \: [{#1}]}
21:
22:
23: \def\Media#1{\big < {#1}\big >}
24: \def\media#1{< {#1}>}
25:
26: \def\mezzo{\frac{1}{2}}
27:
28: \def\O{\Omega}
29: \def\o{\omega}
30: \def\U{\bigcup}
31: \def\I{\bigcap}
32: \def\eps{\epsilon}
33: \def\s{\sigma}
34: \def\S{\Sigma}
35: \def\a{\alpha}
36: \def\b{\beta}
37: \def\t{\tau}
38: \def\l{\lambda}
39: \def\ub{\frac{1}{\beta}}
40: \def\d{\delta}
41: \def\D{\Delta}
42: \def\g{\gamma}
43: \def\G{\Gamma}
44: \def\L{{\mathcal L}}
45: \def\ene{{\mathcal E}}
46: \def\h{\hbar}
47:
48: \def\bi{\begin{itemize}}
49: \def\ei{\end{itemize}}
50:
51: \def\be{\begin{enumerate}}
52: \def\ee{\end{enumerate}}
53:
54: \def\beq{\begin{equation}}
55: \def\eeq{\end{equation}}
56:
57: \def\bdm{\begin{displaymath}}
58: \def\edm{\end{displaymath}}
59:
60: \def\bsp{\begin{split}}
61: \def\ensp{\end{split}}
62:
63: \def\C{\subset}
64: \def\und{\underline}
65: \def\qua{\quad ; \quad}
66: \def\es{\boxgiallo{Esempio}}
67: \def\nota{\Frecciaverde ~~ NOTA:~~}
68: \def\Frecciagialla { \boxgiallo {\( \Rightarrow \)} ~~}
69: \def\verdegiu{\begin{center} \boxverde{$ \Downarrow $} \end{center}}
70: %%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%
71:
72: \bibliographystyle{apalike}
73:
74:
75: \begin{document}
76:
77: \author{M. Cosentino Lagomarsino}
78: \affiliation{UMR 168 / Institut Curie, 26 rue d'Ulm 75005 Paris, France}
79: \email[ e-mail address: ]{mcl@curie.fr}
80: %
81: \author{P. Jona}
82: \affiliation{Politecnico di Milano, Dip. Fisica, Pza Leonardo Da Vinci
83: 32, 20133 Milano, Italy}
84: \email[ e-mail address: ]{ patrizia.jona@fisi.polimi.it}
85: %
86: \author{B. Bassetti}
87: \affiliation{Universit\`a degli Studi di Milano, Dip.
88: Fisica, and I.N.F.N. Via Celoria 16, 20133 Milano, Italy }
89: \email[e-mail address: ]{ bassetti@mi.infn.it }
90: % {+39 - (0)2 - 50317477 ; fax +39 - (0)2 - 50317480}
91:
92:
93: \title{The large-scale logico-chemical structure of a transcriptional
94: regulation network}
95:
96:
97: \begin{abstract}
98: Identity, response to external stimuli, and spatial architecture of a living
99: system are central topics of molecular biology. Presently, they are largely
100: seen as a result of the interplay between a gene repertoire and the
101: regulatory machinery of the cell. At the transcriptional level, the
102: \emph{cis}-regulatory regions establish sets of interdependencies between
103: transcription factors and genes, including other transcription factors.
104: These ``transcription networks'' are too large to be approached globally
105: with a detailed dynamical model. In this paper, we describe an approach to
106: this problem that focuses solely on the \emph{compatibility} between gene
107: expression patterns and signal integration functions, discussing
108: calculations carried on the simplest, Boolean, realization of the model, and
109: a first application to experimental data sets .
110: \end{abstract}
111:
112: \maketitle
113:
114:
115: \section{Introduction}
116:
117:
118:
119: %DA CAMBIARE TUTTA _ MOLTE INFO VANNO DIRETTAMENTE NEL MODELLO
120: % non ha struttura
121:
122: Regulation can be defined as the set of physico-chemical constraints operating
123: within a living cell that modulate the expression of the cell's genes. In the
124: present view of molecular biology, regulatory processes are often used as a
125: primary causal explanation for many phenomena, playing a role in this
126: discipline that is comparable to the role fundamental interactions play in
127: physics. In fact, it is widely believed that the repertoire of signal
128: responses (and, more in general, of all the information processing and
129: structural tasks) of living systems is encoded in interconnected threads of
130: genes regulating the activity of each other. These networks of
131: interdependencies are still largely uncharacterized, although they have begun
132: to fall within reach of systematic experimentation in the recent
133: years~\cite{HCP04,NRF04,BLA+04,WA03,LRR+02}.
134: %cita generali:
135:
136: Considering the so-called ``central dogma'' of molecular biology,
137: \begin{displaymath}
138: \textrm{DNA} \stackrel{\textrm{transcription}}{\longrightarrow}
139: \textrm{mRNA} \stackrel{\textrm{traslation}}{\longrightarrow}
140: \textrm{protein} \stackrel{\textrm{folding}}{\longrightarrow}
141: \textrm{function},
142: \end{displaymath}
143: regulation processes can intervene at all the separate steps (and also in
144: different sub-steps). Regulation exploiting the process of transcription, or
145: transcriptional regulation, constitute to date the best understood among all
146: the possible regulation mechanisms.
147: %cita
148:
149:
150: \begin{figure}[htbp]
151: \centering
152: \includegraphics[width=.85\textwidth]{Fig1}
153: \caption{Schematics of our representation of a signal integration function
154: at the \emph{cis-}regulatory region of a gene as a constraint on the gene
155: expression variables. For general variables, the constraint involves
156: minimization of the free energy of the Shea-Ackers model. In GR1, the
157: constraint is Boolean.}
158: \label{fig:Fagraph}
159: \end{figure}
160:
161: Transcriptional regulation networks are defined starting from the basic
162: functional building blocks involved in transcription. These are (i) the
163: promoter region of a gene or operon along the DNA sequence, which contains the
164: \emph{cis} regulatory binding sites for the transcription factors, (ii) the
165: transcription factors, which are proteins that regulate the binding of
166: RNA-polymerase, and (iii) RNA-polymerase, the protein complex that performs
167: transcription of a gene or an operon in mRNA form~\cite{Pta92,ABL+03}. The
168: amount of mRNA transcribed is related to the expression of a particular gene
169: only if one takes for granted all the other steps that bring to a functional
170: protein. If this (big) leap is accepted, the ``state'' of a cell is identified
171: to the mRNA concentration of its genes. Experimentally, this is particularly
172: sound for prokaryotes and simple unicellular organisms, but often assumed in
173: more complex contexts, for example in DNA microarray experiments.
174: %
175: Under this assumption, the locations and orientations of the binding sites for
176: transcription factors, as well as the affinity of the transcription factors to
177: different binding sites, determine the expression levels of a gene in response
178: to changes in the active transcription factor concentrations inside the cell.
179: In turn, the concentration of active transcription factors (the ones that can
180: actually bind) encodes the configuration of the environment, for example
181: through degradation or activation by internal and external signaling
182: molecules.
183: %
184: A \emph{cis-}regulatory region can contain many binding sites for many
185: transcription factors which act in cooperation (or competition) on the
186: promoter region, to control in a combinatorial way the binding of RNA
187: polymerase. This process, referred to as signal integration, is the
188: logic heart of the network.
189: %da qui in poi quasi tutto in modello
190: %
191: %FIGURA1
192: A transcriptional regulation network can be represented as a hypergraph
193: containing both gene expression (``variable'') nodes and signal integration
194: (``function'') nodes. The connectivity is the source of the network complexity
195: (Fig.~\ref{fig:grafone}).
196: %
197: \begin{figure}[htbp]
198: \centering
199: \includegraphics[width=.8\textwidth]{Grafone}
200: \caption{Graph representation of a transcription network. Each diamond node
201: represents a signal integration function, while each black circle is a
202: variable. The directionality of the constraint is represented graphically
203: by labeling the diamonds with IN and OUT on two different sides.}
204: \label{fig:grafone}
205: \end{figure}
206:
207:
208:
209: %piu' generale che transcription??? PER ORA NO
210: Thus, a transcriptional regulation network can work independently as a
211: computational unit in a living cell, being able to make decisions on which
212: genes will be switched on at different times. Studies focusing simply on the
213: \emph{structure} of the underlying graph have lead to interesting
214: results~\cite{BLA+04,SMM+02,WtW04}. However, characterizing and predicting
215: gene expression patterns given a network structure remains an enormous
216: challenge. Two main problems exist. Firstly, the networks are only partially
217: characterized experimentally~\cite{HCP04,NRF04}. In some
218: instances~\cite{BLA+04,DRO+02}, the wiring diagram is well known, but in
219: general the functions are described only qualitatively, typically with
220: annotations such as activation, repression or dual effects, and little is
221: known about their actual structure.
222: %
223: Secondly, transcription networks are fairly large. While detailed models or
224: simulations work well on small (sub-)systems~\cite{MA97,ARM98}, typically a
225: coarse grained approach is needed.
226: %
227: %This problem has also an influence on the choice of the dynamics.
228: Microscopically, it is well accepted that the Gillespie algorithm~\cite{GD77},
229: %CITA
230: while disregarding spatial correlations, correctly describes the stochastic
231: asynchronous events of chemical kinetics involved. On the other hand, with a
232: mesoscopic average in time, it is still unclear what the emergent time scales
233: might be. In particular, the pioneering approach of
234: Kauffman~\cite{Kau69,Kau69b,Kau93,Kau04}, suggesting a synchronous
235: deterministic dynamics for a Boolean (i.e. ON/OFF) representation of the
236: network is still being debated, both in its assumptions and in its
237: results~\cite{Ger04,KD05,SaK03,BP98}.
238:
239: % CITA CONDMAT CICLI!!!
240:
241:
242: % Costruiamo un modello di equilbrio che tiene conto, oltre che della
243: % struttura , dell'espressione dei geni
244: We consider the second problem, and develop a model (called GR, from Gene
245: Regulation), that focuses, rather than on pure dynamics, on the compatibility
246: between gene expression patterns and signal integration functions. The
247: compatibility constraints are generated by the clauses encoded by signal
248: integration functions at \emph{cis}-regulatory regions. Our framework
249: describes the system as a combinatorial optimization problem where $N$
250: variables, the gene expression levels, are subject to $M$ constraints,
251: representing the signal integration nodes. Simply put, a cell with $N$ genes
252: can express them in exponentially many ways, $2^N$ in the Boolean ON/OFF
253: representation. However, the cell never explores all the possible patterns of
254: expression. It generates only clusters of correlated configurations. To fix
255: the ideas, one can think to the very elementary example of the cI-cro switch
256: of $\lambda$-phage~\cite{Pta92,Tho73}. In this case one could observe the
257: states 10 (where cI is ON, and cro is OFF), 01, or perhaps 00, but never 11,
258: because this state is ruled out by the signal integration function.
259: % Bruno - ma a me sembra un po' ridondante
260: In a cell, with the added complexity of the regulation network, we can think
261: that many of the states are not observable for the same compatibility reasons.
262: The approach is easily connected to detailed thermodynamic treatment of
263: transcription from a
264: %nb da equilibrium diventa local equilibrium
265: signal integration node on one side~\cite{BGH03}, and to the statistical
266: mechanics of spin glasses and combinatorial optimization problems on the
267: other~\cite{MPZ02}.
268:
269:
270: %% Despite of the wealth of knowledge on network structure and expression
271: %% profiles in small organisms such as E.~coli and budding yeast, there
272: %% is to date no transcription network for which all the function nodes
273: %% are completely characterized.
274: %
275: Rather than the detailed quantitative prediction of mRNA expression states,
276: the current challenge is to set a conceptual framework which can help to
277: interpret the observations in concrete examples, integrating as much as
278: possible with known data.
279: %
280: As a the simplest example of this, we study the behavior of the Boolean
281: version of our model on the network structure of E.~coli.
282: %% Analysis of the structure of the E.~coli and yeast transcription
283: %% networks shows a modular structure. The modules that form them, called
284: %% \emph{network motifs}, carry elementary identifiable functions in the
285: %% life-cycle of these organisms. However, while some of these
286: %% motifs are simple, and can be interpreted easily as ``circuitry''
287: %% elements such as filters or amplifiers, others, sometimes referred to
288: %% as ``dense overlapping regulons'', are intrinsically combinatorial,
289: %% and ``wire'' many genes with many others.
290: %non si sa se
291: % conditions.
292: %cita ptashne
293: % più se mi gaso dico anche che è un concetto ristretto e unificato
294: % della nozione di complessità
295: Building up from this simplest case, the aim is to analyze increasingly
296: realistic network structures, in order to generate a theory that, while being
297: consistent with the generic qualitative features of regulation networks, is
298: useful to analyze single instances and realizations. This is maximally
299: important as biological knowledge is constructed on specificities, and not on
300: typical case behavior.
301:
302:
303: %Structure of the paper!
304:
305: This paper is structured as follows. Sec.~\ref{sec:model} introduces the model
306: abstractly, as an optimization problem, which, in sec.~\ref{sec:shea} is
307: connected to the more concrete thermodynamic Shea-Ackers model of
308: transcriptional regulation. Sec.~\ref{sec:satmap} abandons this general
309: setting, and takes on the simplest possible formulation of the model, GR1,
310: which has Boolean functions and variables, showing that this case maps
311: directly to a so-called satisfiability problem (Sat). The scope of
312: sec.~\ref{sec:leaf}, is to analyze the typical number $\mathcal{N}$ of gene
313: patterns of a random instances of GR1, starting from the case of fixed
314: connectivity. The ``leaf removal'' algorithm allows to carry this analysis in
315: the annealed approximation. An important premise is the fact, which is
316: evident looking at the data~\cite{SMM+02}, that some genes are essentially
317: ``free'' from the point of view of transcription. These are mainly controllers
318: and are connected to external stimuli. The expression of the rest is
319: conditioned to the state of other genes. The algorithm allows to define the
320: ``complex combinatorial core'' (CCC) of the network, as the set of genes able
321: to control its global state. The number of non-controlled, or ``free''
322: variables in the core determines the complexity of the system. The phase
323: diagram shows three distinct regimes of gene control. In the first (UNSAT),
324: there are no free genes in the core, and the system cannot control the
325: simultaneous expression of all its genes. In the second regime (``complex
326: control'' or HARD-SAT), the core contains free genes that control, both
327: directly and indirectly, many others. The general dynamics is residual (many
328: variables are fixed, the others can change). In the third regime, the core is
329: empty. Each free gene (which is external to the core) controls the state of a
330: small number of genes (``simple control'', or SAT phase).
331: Sec.~\ref{sec:selfav} concerns itself with the \emph{width} of the
332: distribution of $\mathcal{N}$, which has both a technical significance as a
333: validity test of the annealed approximation, and a biological one, as the
334: variability in the number of gene patterns at fixed gene number.
335: Sec.~\ref{sec:multipoiss} discusses generalizations of these results to
336: non-fixed connectivities. Finally, sec.~\ref{sec:concr} describes one first
337: attempt to put this findings to work on an experimental data set.
338:
339: \section{Model}
340: \label{sec:model}
341:
342: Our aim is to describe in a minimal way gene expression in a transcription
343: network, separating the issues related to the dynamics from those related to
344: its logical and computational structure. In order to do this, we will
345: formulate a model that sees the system as an optimization problem, where a set
346: of variables, the genes, is subject to a set of constraints, the signal
347: integration nodes. Upon this logic backbone, many a dynamics can be
348: superimposed, including in the most general case the kinetic Montecarlo scheme
349: commonly used to model genetic networks.
350: %dire meglio!!!!!!!!
351: Rather than going towards the direction of highest detail, we will choose to
352: simplify the model as much as possible, reducing the number of details to the
353: minimum, and studying the general qualitative features of the system.
354:
355: The model is specified by
356: \begin{itemize}
357: \item[1)] A set of $N$ discrete variables $\{ x_i \}_{i=1..N}$ associated to
358: genes or operons, which in the simplest picture are identified with their
359: transcripts and protein products. These variables represent the expression
360: levels and in general take discrete values in $\{0,..,q\}$. In particular
361: situations, they are well-approximated by continuous variables.
362: \item[2)] A set of $M$ interactions, or constraints $\{ I_b(x_{i_{0}},
363: x_{i_1}, ..., x_{i_{k_b}}) \}_{b=1..M}$ between the genes, representing the
364: signal integration from transcription nodes.
365: \end{itemize}
366: This formulates an optimization problem, which we call GR, from Gene
367: Regulation. GR asks to find the states compatible with the constraints.
368:
369: The model can be easily generalized to include other relevant degrees of
370: freedom, such as translation, protein modification and protein-protein
371: interactions. However, each addition adds complexity and parameters.
372: Therefore we start with the minimal possible description.
373: % e' una cagata
374: Admittedly, neglecting non-transcriptional regulation is a drastic
375: simplification of the system. A complete genetic network should in principle
376: include all forms. On the other hand, the justification for considering
377: transcription alone is that it is the first step in the chain of regulation
378: events and it is experimentally well characterized. From the physics point of
379: view, this model can be seen as a ``spin glass'', a system where some
380: variables, our gene expression levels, interact through some coupling
381: constants, specified by the constraints~\cite{M02}. This approach to
382: optimization problems of computer science has proved to be very useful in the
383: recent years~\cite{MPZ02}.
384:
385: % factor graph
386: The network structure is naturally represented on a ``factor graph'', where
387: two kinds of nodes are present, N ``variable nodes'' and M ``function nodes''
388: respectively (Fig.~\ref{fig:grafone}). Here $k_{b} = 1+ K_{b}$ can be seen as
389: the local connectivity of a function node. Note that the factor graph is also
390: defined by a variable connectivity $c_{i}$, the number of functions connected
391: to $x_{i}$.
392: %figura
393: In fact, this is the typical structure of a constraint satisfaction problem of
394: theoretical computer science, such as q-coloring or satisfiability~\cite{M02}.
395:
396: %figure 1, c'e' gia da intro?
397:
398: %description of individual node is shea ackers
399:
400:
401: \section{The constraints and the Shea-Ackers model}
402: \label{sec:shea}
403:
404: To specify the model one has to give a structure for the function nodes, i.e.
405: the constraints. This requires a physical model for signal integration. In
406: order to take this step, in this section we start from the well-known and
407: widely accepted thermodynamic model of Shea and Ackers of gene activation by
408: recruitment, to show that it is the natural setting to express
409: our constraints.
410: %Later on we will proceed with the analysis of a simpler, if
411: %not the simplest, example.
412:
413: Let us consider a function node with $K$ regulators and one output variable.
414: This is modeled, in the version presented by Buchler and
415: collaborators~\cite{BGH03}, as a neural network (a ``Boltzmann machine'') with
416: Hamiltonian
417: \begin{displaymath}
418: H = \sum_{\stackrel{i,j = 0..L}{ i\ne j} } J_{ij} s_i s_j
419: + \sum_{j = 0..L} h_j s_j \ \ ,
420: \end{displaymath}
421: where $s_1, .., s_L$ are the occupation variables of the \emph{cis-} binding
422: sites, $h_{j}$ are external fields, functions of $x_1, ..., x_K$ representing
423: the concentrations of ``input'' transcription factors, and $J_{ij}$ are
424: interaction constants associated to competitive versus cooperative binding.
425: More precisely, $h_{j} = - \beta^{-1} \log(Q_{j})$, where $Q_{j} =
426: \frac{[TF_{i}]}{\kappa_{i}} \sim \frac{x_{i}}{\kappa_{i}}$ is the binding
427: affinity of a site $i$, and $\kappa_{i} $ a dissociation constant. Normally,
428: the concentrations are approximated with continuous variables.
429: %
430: In general, $L \ge K$, because multiple binding sites are present. Finally,
431: $s_0, h_0$ are the occupation variable and the external field (a fixed
432: parameter corresponding to the polymerase binding affinity) associated to the
433: output node of the function (see Fig~\ref{fig:Sheanode}).
434: %
435: \begin{figure}[htbp]
436: \centering
437: \includegraphics[width=.7\textwidth]{SAnodeSs}
438: \caption{Exemplification of a Shea-Ackers node. $s_i$ are occupation
439: variables for the transcription factors binding sites, while the coupling
440: constants $J_{ij}$ encode cooperative or competitive binding. The external
441: fields $h_i$ are the phase space variables of GR.}
442: \label{fig:Sheanode}
443: \end{figure}
444:
445:
446: Given all the binding constants and the interaction parameters, the
447: -intrinsically probabilistic- output of the gate is computed as a function of
448: the input fields, simply as the probability that $s_0 = 1$.
449: %
450: This expectation value can be obtained through the partition function
451: \begin{displaymath}
452: Z[h_1,...,h_L] = \sum_{\{s\}} e^{- \beta H}
453: \end{displaymath}
454: as
455: \begin{displaymath}
456: P(\sigma_0 = 1) = \frac{1}{Z} \sum_{\{s\}} e^{- \beta
457: H(1,s_1,...,s_L)}
458: \end{displaymath}
459: where $\beta = 1/kT$.
460: %NB guarda che beta e' fittizio x' h,J~1/beta
461:
462: Equivalently, one can compute the local free energy
463: \begin{equation}
464: - \frac{1}{\beta} \log Z = F[h_0,...,h_L] = F[x_0, ..., x_K]
465: \label{eq:fensa}
466: \end{equation}
467: and find the average output through minimization with respect
468: to the $x_0$ coordinate.
469:
470: In other words, the function nodes are local equilibrium conditions for the
471: variable nodes, specified by the Shea-Ackers model of the
472: \emph{cis-}regulatory region of each variable node. The expression variables
473: $\{ x_i \}_{i=1..N}$ need to satisfy the constraints specified by the local
474: minimizations of the free energies $\{ F_b(x_{i_{0}}, x_{i_1}, ...,
475: x_{i_{k_b}}) \}_{b=1..M}$. Since there is a clear input-output logic encoded by
476: the chemical equilibrium of each signal integration nodes, one could refer to
477: this backbone static structure the ``logico-chemical'' structure of the
478: network, and separate it from its ``dynamic'' structure. The logic it encodes
479: is of course not Boolean. In fact, it is intrinsically non-Boolean even with
480: Boolean variables, as the outputs are probabilistic functions of the inputs.
481:
482:
483: % analogy with SG
484: From the point of view of statistical mechanics, this is a Potts spin system
485: with diluted interactions described by the local free energies $F_b$ (which in
486: this context should be interpreted as effective Hamiltonians).
487: % cavity analogy boh taglia
488: Interestingly for the analogy with spin glasses, the model for the gate can be
489: seen as a message-passing procedure analogous to that exploited by the cavity
490: method~\cite{MPV87,MP03,MZ02}, where, in the approximation of factorized
491: probability distribution of the variables, one evaluates the local fields
492: $h_{i \rightarrow b}$, describing the local influence of the couplings on
493: variable $i$ in absence of interaction $b$, and $u_{b \rightarrow i}$, the
494: contribution of interaction $b$ on the local magnetic field on spin $i$,
495: together with their histograms in the presence of many states.
496: %
497: In our case, the ``messages'' described above can only travel in the
498: input-to-output direction. We are currently investigating whether this
499: analogy can be exploited for further calculations, and we are aware of work in
500: this direction, in a simpler setting, by another group~\cite{CLP+}.
501: %boh
502:
503: %coarse grain
504: In principle, the variables $x_i$ directly stand for the number expressed of
505: molecules in a cell. Provided the set of all binding constants and
506: interactions is known, all the function nodes can be computed and the model is
507: complete. It could be solved, for example by numerical simulations, once a
508: dynamics is specified.
509:
510: On the other hand, with a few exceptions of small systems, these (many)
511: parameters are in general not known. For this reason, rather than
512: aiming for the highest level of detail, we choose to simplify as much
513: as possible, while trying to keep the most relevant features.
514: % (furthermore, a high level of detail would be useless without
515: % incorporating the rest of the genetic network in teh description with
516: % the same level of detail).
517: For practical purposes, in order to be able to advance further analytically
518: and numerically in the understanding of the model, it is convenient to
519: introduce coarse grained expression levels, thereby effectively reducing $q$.
520: The resulting model, GRq, is identical, mind the fact that variables and
521: constraints are now subject to implicit averaging. One advantage of this
522: approach is that local free energies become easier and easier to specify, and
523: it is possible to study, as is commonly done for spin glasses, the typical
524: behavior of the system as a function of the parameters.
525:
526:
527: In the simplest possible scenario $q=1$, and the expression levels are Boolean
528: variables. The assumption behind this is that what matters is only if the
529: level of expression is high or low~\cite{Kau93,Kau69b}.
530: %buchler hwa giustificaz
531: %l'ideale è x_i = 1..q . Si può fare.
532: % Nota questo non e' kauffmann, vedi lettera.
533: % dire GR1 GRq
534: The simplest case (which we will still call GR1), assumes also Boolean
535: functions. In the following section, we will show how GR1 maps to a
536: Satisfiability problem (Sat), an optimization problem where $N$
537: Boolean variables are constrained by $M$ conjunctive normal form (CNF)
538: constraints (i.e. by a Boolean polynomial constructed as a product ($\wedge$)
539: of $M$ disjunctive monomials ($\vee$)).
540:
541:
542: \section{GR1, mapping on a Satisfiability Problem}
543: \label{sec:satmap}
544:
545: As we have shown in the above section, for general variables $x_i$ that
546: represent real expression levels, the constraints can be derived directly from
547: the model of Shea and Ackers of gene activation by
548: recruitment~\cite{SA85,BGH03}. If the $x_i$ represent coarse-grained
549: expression levels, the same model can be used to construct the local free
550: energy in Eq.~\eqref{eq:fensa}, associated to each signal integration node,
551: that generates the constraints through minimization.
552: %
553: Here we consider the simplest possible scenario, treating the
554: expression levels as Boolean variables, setting $q=1$, and the signal
555: integration functions as boolean functions $\{ f_b(x_{b_1}, ..,
556: x_{b_{k_b}}) \}_{b=1..M}$.
557: %no fallo dopo!!!
558: We also restrict to the case of fixed in-ward connectivity $k_b = K, \ \
559: \forall b$. These conditions, defining $K$-GR1, are also found in Kauffman
560: networks~\cite{Kau93,Ger04}. We will relax the hypothesis of fixed $K$ to
561: explore networks with fluctuating connectivity in section
562: \ref{sec:multipoiss}.
563:
564:
565: \begin{figure}[htbp]
566: \includegraphics[width=.7\textwidth]{Nodi}
567: \caption{Translation of a 2-GR1 node in 3-Sat Nodes. The
568: input-output direction is from top to bottom. Sat constraints are
569: represented as squares, where the black and white vertices indicate that
570: the corresponding variable enters negated or affirmed respectively in the
571: 3-Sat constraint.}
572: \label{fig:schemino}
573: \end{figure}
574:
575: The expression
576: \begin{math}
577: x_{b_0} = f_b (x_{i_1}, .. ,x_{i_{K}}),
578: \end{math}
579: which imposes that the variable $x_{b_0}$ is the output of the
580: function $f_b$, translates into the Boolean constraint
581: \begin{equation}
582: \neg(x_{b_0} \dot{\vee} f_b).
583: \label{eq:constr}
584: \end{equation}
585: In a Kauffman network, this expression is equivalent to the fixed
586: point condition, and there is one such constraint for every variable
587: $x_i$. The formula
588: \begin{displaymath}
589: I = \bigwedge_{b=1..M} I_b = \bigwedge_{b=1..M} \neg(x_{b_0} \dot{\vee} f_b)
590: \end{displaymath}
591: defines a Satisfiability problem (Sat) on the variables $x_1,.., x_N$.
592: From the biological viewpoint, this is a logic representation of the
593: computational tasks encoded in the transcription network by evolution,
594: i.e. which sets of genes have to be switched on at any given
595: condition. More abstractly, having in mind Kauffman networks, each
596: satisfying solution of this problem corresponds to a fixed point in
597: the Kauffman dynamics (independently from the update scheme). The
598: number of variables involved in one constraint is always exactly
599: $k=K+1$, therefore this mapping associates a network with fixed
600: connectivity $K$, to a $k$-Satisfiability problem whose connectivity is
601: increased by one unit. For example, a $K=2$ Kauffman network
602: corresponds to a 3-Sat problem, and so on.
603: %
604: The suitable order parameter for such a system is $\gamma = M/N$. GR1 assumes
605: that each gene expression variable is regulated at most by one signal
606: integration function, so that $\gamma \le 1$.
607:
608:
609: To further understand the logic structure if GR1, we can write the CNF
610: constraints on each variable $x_n$. This allows to make a connection to the
611: order parameters used in $k$-Sat, i.e. the local constraint $\alpha$, defined
612: as the number of conjunctive-normal-form (CNF) clauses per Boolean variable.
613: In order to do this, we recast the Boolean formulas $I_n$ into CNF.
614: Reshuffling the truth table of $f_n$ in a way that the first $z$ terms
615: ($1,..,z$) give zero as an output, a simple procedure shows that
616: \begin{equation}
617: \begin{array}{ccc}
618: I_b &=&
619: \Big( \bigwedge_{\alpha=1}^{z}
620: (\neg x_{b_0} \vee \xi_{\alpha_1} \vee ..\vee \xi_{\alpha_K})
621: \Big) \\ & & \\
622: && \bigwedge
623: \Big(
624: \bigwedge_{\alpha=z+1}^{2^K} ( x_{b_0} \vee \xi_{\alpha_1} \vee ..\vee
625: \xi_{\alpha_K})
626: \Big),
627: \end{array}
628: \label{eq:satform}
629: \end{equation}
630: with
631: \begin{displaymath}
632: \xi_{\alpha_j} =
633: \left\{
634: \begin{array}{cc}
635: x_j & \textrm{if element } \ \alpha,j \ \textrm{of truth table} = 0 \\
636: \neg x_j & \textrm{if element } \ \alpha,j \ \textrm{of truth table} = 1
637: \end{array}
638: \right. ,
639: \end{displaymath}
640: having exactly $2^K$ clauses of $K+1$ elements.
641: % MA prima: ogni GATE ha 4 nodi !
642: % questi nodi non sono tot random, ma correlati negli input
643: Thus, a network with connectivity $K$ maps into a $(K+1)$-Sat having always
644: $\alpha = 2^K \gamma$.
645: %
646: %
647: However, we cannot imply directly that the typical behavior, and therefore the
648: phase diagram of the system will be the same as that of the corresponding
649: random $k$-Sat. Considering random realizations of the constraints, these are
650: \emph{a priori} only a subset of all the possible realizations of a $k$-Sat
651: constraint. In fact, it is immediate to realize that the $2^K$ CNF clauses
652: written in Eq.~\eqref{eq:satform} contain all the possible (fixed)
653: combinations for the inputs variables and the corresponding (random) $2^K$
654: outputs for the output variable. This is best exemplified on the factor graph
655: (Fig.~\ref{fig:schemino}).
656:
657: Having established a link between GR1 and a particular optimization problem we
658: set out to look at random instances for the signal integration functions. For
659: fixed connectivity one can expect a similar behavior for $K$-GR1 as random
660: $k$-Sat, or $k$-XORSAT (a Sat problem with clauses containing only XOR
661: clauses), with the presence of a phase transition in the number of satisfying
662: states. The suitable order parameter for such a transition is $\gamma = M/N$.
663: Notably, the space of functions of the three models have different dimensions.
664: Furthermore, differently from Sat or XORSAT, in GR1 each gene expression
665: variable is regulated at most by one signal integration function, so that
666: $\gamma \le 1$. In practice, both for ease of interpretation and for
667: simplicity of the analytical formulation, from now on we will abandon the
668: formulation of GR1 in terms of CNF constraints, and work with the input-output
669: functions $f_{n}$.
670:
671: \section{Leaf Removal and the Computational Core}
672: \label{sec:leaf}
673:
674: The aim of this section is to compute the number of satisfying solutions
675: $\mathcal{N}$, for random instances of the constraints. For accessory reasons,
676: we will map GR1 to a spin system. Quite simply, each Boolean variable $x_i
677: \in \{0,1 \}$ is transformed into a spin $\sigma_i \in \{-1,1\}$.
678: %
679: Each constraint, or diamond function node generates the interaction
680: Hamiltonian $ H_{\diamond,b}$
681: \begin{equation}
682: 2^k H_{\diamond,b} = \sum_{J_{b_1},..,J_{b_K}}
683: \prod_{l=1..K} (1+ J_{b_l} \sigma_{b_l})
684: (1+ J_{b_0}^{\{ J_{b_1},..,J_{b_K}\}} \sigma_{b_0})\ ,
685: \label{eq:ham_diam}
686: \end{equation}
687: %
688: Under the restrictions defining GR1, the total energy of the system is simply
689: the number of violated constraints. A zero-energy configuration satisfies all
690: the constraints, and is therefore able to comply to all the logic functions
691: encoded by the network. The $2^K$ coupling constants $J_{b_0}^{\{
692: J_{b_1},..,J_{b_K}\}} = \pm 1$ are a representation of the truth table of
693: the function $f_b$. With the correspondence $\{0,1\} \leftrightarrow
694: \{-1,1\}$ , $J_{b_1},..,J_{b_K}$ stand for the possible values of the input
695: variables $ x_{b_1},..,x_{b_K}$, while $J_{b_0}^{\{J_{b_1},..,J_{b_K}\}}$ is
696: the associated output value. The Hamiltonian (\ref{eq:ham_diam}) is the cost
697: function of the corresponding optimization problem K-GR1. It encodes the logic
698: constraint of Eq.~(\ref{eq:constr}), in the sense that it is one whenever the
699: constraint is violated, and zero otherwise. From the point of view of
700: transcription networks, $H_{\diamond,b}$ is the coarse grained local free
701: energy of the Shea-Ackers model. Note that, even in the case of Boolean
702: variables, the coupling constants represent binding affinities and
703: interactions between transcription factors. In general, they need not be plus
704: or minus one. Here we took this further assumption.
705:
706: The conventional average of $\mathcal{N}$ on the realizations might be biased
707: by the weight of exceptions~\cite{MPV87}. The correct quantity to compute is
708: the ``quenched average'' of the system's free energy,
709: $\overline{\log{\mathcal{N}}}$, which is usually accessed with the replica, or
710: similar methods~\cite{MPZ02}, passing from Hamiltonians like $H_{\diamond,b}$.
711: %salta forse
712: %Having in mind the Shea-Ackers picture, biochemical noise may be
713: %included to a certain extent in our framework by specifying a
714: %probability distribution for the coupling constants.
715: %
716: %
717: However, in the case under examination we will use a simpler method, based on
718: the ``leaf removal''~\cite{MRZ03} algorithm, which allows to compute only the
719: \emph{annealed} average $\log{\overline{\mathcal{N}}}$. As we will discuss,
720: this method has the advantage of an immediate biological interpretation in
721: terms of the roles played by genes in the network.
722: %
723: For the case of random XORSAT, Mezard and collaborators~\cite{MRZ03} have
724: shown that the annealed average on the core variables coincides with the
725: quenched one. In general $\overline{\log{\mathcal{N}}} \leq
726: \log{\overline{\mathcal{N}}}$.
727: %
728: For GR1, we performed estimates that indicate the presence of the same
729: self-averaging property in a well-defined region of parameter-space (see
730: sec.~\ref{sec:selfav}). Within this region, our annealed calculation is exact.
731:
732: For a given realization of the constraints $ \{ \vec{I}, \vec{f} \} $, the
733: number of satisfying states $\mathcal{N}$ can be written as
734: \begin{displaymath}
735: \mathcal{N}(\vec{I}, \vec{f}) = \sum_{\vec{\sigma}} \prod_{b=1}^M \delta(1;
736: f_{b}(\sigma_{i(b,1)}, .., \sigma_{i(b,K_b)})\sigma_{i(b,0)}).
737: \end{displaymath}
738: Here, the randomness is contained: (i) in the specification of the network
739: structure, $\vec{I} = (I_1,...,I_M)$, i.e. in the coordinates $i(b,l)$, which
740: point at the variable occupying place $l$ in the $b$th constraint; (ii) in the
741: specification of the functions $\vec{f}= (f_1,...,f_M) $ with a certain
742: probability distribution in the class $\mathcal{F}$. An overbar ($\bar{\ ~}$)
743: indicates an average on both distributions, $p(\vec{I})$ and $p(\vec{f})$. We
744: will first concentrate on the case with fixed in-ward connectivity $K$.
745:
746:
747: In carrying this average, there are three relevant preambles. The first is
748: that not all the $M$ equations and $N$ variables are meaningful to calculate
749: $\mathcal{N}$. Indeed, every output variable that appears in only one
750: constraint can be trivially fixed according to its function. Thus, both the
751: constraint and the variable can be eliminated without affecting the number of
752: solutions. This procedure is called ``leaf removal''~\cite{MRZ03}. It is a
753: nonlinear procedure, as more variables can disappear together with a single
754: constraint, because input free genes that regulate a leaf remain as isolated
755: points. The iteration of this mechanism leads to the definition of a
756: ``core'', the CCC, of significant variables and constraints, in numbers of
757: $N_C$ and $M_C$ respectively. In the CCC, $M_{C}$ genes are controlled, and
758: $\Delta_{C} = N_{C} - M_{C}$ are the ``free'' genes with an essential role in
759: controlling the expression states, as a function of an input signal.
760: % FIRST: FIXED p = k
761: The second relevant fact is the hypothesis that the functions are independent
762: and identically distributed random variables. Thirdly, we consider a set of
763: functions in a family $\mathcal{F}_K$ which satisfy the condition
764: \begin{math}
765: \frac{1}{2^{2^K}} \sum_{f \in \mathcal{F}_K } p(\vec{f}) f(\vec{x}) = \rho
766: \, ,
767: \end{math}
768: as will always be the case if the outputs of the functions are
769: uncorrelated, even in the presence of bias.
770:
771: It is then easy to verify that
772: \begin{displaymath}
773: \overline{ \mathcal{N}} = \sum_{\vec{\sigma}, \vec{I}} p(\vec{I})
774: \prod_{b=1}^M \left(
775: \rho \delta(1;\sigma_{i(b,0)}) + (1- \rho) \delta(-1;\sigma_{i(b,0)}
776: ) \right) \ ;
777: \end{displaymath}
778: so that, as for the XORSAT model,
779: \begin{equation}
780: \overline{\mathcal{N}} = 2^{N_{C}-M_{C}} \ .
781: \label{eq:nstati}
782: \end{equation}
783: Incidentally, we note that the same procedure is valid for GRq, where
784: $\overline{\mathcal{N}} = q^{N_{C}-M_{C}}$.
785:
786:
787: \begin{figure}[htb]
788: \includegraphics[width=.75\textwidth]{Fig2}
789: \caption{Phase diagram of $4$-GR1. For $\gamma >
790: \gamma_c$ no solutions exist in the typical realization (UNSAT phase). For
791: $\gamma < \gamma_d$ the system is paramagnetic typically a satisfying
792: state exists (SAT phase, or simple gene control). For $\gamma_d <\gamma<
793: \gamma_c$ there is complex gene control. }
794: \label{fig:pd}
795: \end{figure}
796:
797:
798:
799: %These genes can react to an external input and determine a state.
800: We can use $\frac{\Delta_C}{N_C}$ as an order parameter in the thermodynamic
801: limit $N \to \infty, \, \gamma$ const., to distinguish three types of
802: phenomenology, or three ``phases''. (i) For $\frac{\Delta_C}{N_C} \leq 0$,
803: there are no free genes in the core, and the system cannot comply to all the
804: expression programs encoded into its DNA. This is a UNSAT phase from the
805: computational point of view. (ii) For $0 \leq \frac{\Delta_C}{N_C} < 1$, the
806: $\Delta_{C}$ genes of the CCC control $O(N)$ genes each, and therefore are
807: able to determine an expression state. This is a ``complex gene control''
808: phase, or HARD-SAT (iii) For $ \frac{\Delta_C}{N_C} = 1$ the number of
809: controlled variables is a vanishing fraction of the total number of genes. In
810: other words, $ M_C = 0$ in the thermodynamic limit, the free genes control at
811: most $O(1)$ genes to generate a satisfying state (``simple control'', or SAT
812: phase). In the simple control phase, the system is underconstrained, which
813: means that the logic conditions imposed by the signal integration functions
814: are generally insufficient for a strict determination of the expression
815: patterns.
816:
817: The three phases described above depend both on the value of $\gamma$ and on
818: the class of random functions considered. In general, if all the possible
819: functions are taken into account, the phase diagram can be explored studying
820: the rank and the kernel of the connectivity matrix~\cite{CS02}.
821: %
822: % parametri d'ordine di Bruno e figura che li rappresenta.
823: %
824: % caso generico rango e kernel
825: %
826: %
827: Following the analysis of Mezard and collaborators~\cite{MRZ03}, in the case
828: of Poisson variable connectivity, i.e. with the distribution $\pi(c) =
829: \frac{(k \gamma)^{c}}{c!} e^{-k \gamma} $, the phase diagram as a function of
830: $\gamma$ is identical to the random XORSAT problem. It is illustrated in
831: Fig.~\ref{fig:pd}. For $\gamma>\gamma_c$ no solutions exist in the typical
832: realization (UNSAT phase). For $\gamma<\gamma_d$ the system is paramagnetic
833: (SAT phase). For $\gamma_d<\gamma<\gamma_c$ exponentially many satisfying
834: states exist.
835: %% Physically,
836: %% this is a glassy phase, the typical dynamics is slow in finding a satisfying
837: %% solution.
838: Here, the space of solutions breaks down into clusters separated by free
839: energy barriers. The typical dynamics in a cluster will be residual, in the
840: sense that a block of genes are fixed (on or off) and the rest may move. The
841: number of clusters is controlled by the (computational) complexity $\Sigma$ of
842: the system. The number of observed configurations is $ \mathcal{N^{*}} \sim
843: \exp[N \Delta_{C}]$. Thus, by definition $\Sigma$ is directly related to the
844: order parameter $\Delta_{C}/N_{C}$, i.e. to the partitioning of the core
845: genes. How the system explores (or not) the clusters depends on details of
846: its dynamics.
847:
848:
849:
850: % SE METTI DISCUTI!!!
851: %% \begin{figure}[htbp]
852: %% \includegraphics[width=.4\textwidth]{pd-k-leaf}
853: %% \caption{Zero-temperature phase diagram of $K$-GR1 computed with the leaf
854: %% removal algorithm (as described in~\cite{MRZ03}).}
855: %% \label{fig:pdk}
856: %% \end{figure}
857:
858:
859: % casi K fisso poisson = hsat (prima devi aver detto che a priori e'
860: % piu' generale!!!!) --> diagr di fase in fs di p=K+1
861:
862: %% For every value of $K$, the critical values for $\gamma$ are of order
863: %% one. Remembering our condition $\gamma \le 1$, it is interesting to
864: %% ask in which phase a hypothetical fixed connectivity CCC would
865: %% typically lie. This is illustrated in Fig.~\ref{fig:pdk}.
866:
867: % non piu' vero
868: % The first
869: % striking fact is that, for any $K$, the system never finds itself in the
870: % UNSAT phase.
871: %zzo vuol dire?
872:
873: % da vedere ....
874: %% Secondly, there is a transition (around $K=7$, for $\gamma=1$) between
875: %% the SAT and the HARD-SAT phase.
876: %% %boh
877: %% Remarkably, the transcription network of \emph{escherichia coli} has
878: %% signal integration functions with (exponentially distributed) $K \le
879: %% 6$ (according to the dataset used in~\cite{SMM+02}).
880: %cita DB
881:
882: %TL bias non influisce da notare
883: % To study the case of biased functions, a new calculation is
884: % necessary to compute the phase diagram. NOT THERE
885:
886:
887: \section{State-Fluctuations and Self-average}
888: \label{sec:selfav}
889:
890: In this section, we discuss a calculation of the \emph{width} of the
891: distribution of the number of compatible states. In order to do this, we
892: compute the quantity $\overline{[\mathcal{N}]^2}$. This calculation is
893: relevant both from the technical and from the qualitative point of view. The
894: technical aspect, as already anticipated, deals with the self-averaging
895: property, which holds when the quantity
896: \begin{displaymath}
897: \frac{\overline{[\mathcal{N}]^{2}} - \left[\overline{\mathcal{N}}\right]^2}
898: {\left[\overline{\mathcal{N}}\right]^2}
899: \end{displaymath}
900: vanishes in the thermodynamic limit $N \to \infty$ at constant $\gamma$. When
901: this condition is met, the annealed average computed above coincides with the
902: quenched one, and no extra effort is required. When it is not, the behavior of
903: the system can be qualitatively different from what emerges in the annealed
904: picture. In particular, the typical number of solution is overestimated. In
905: this case, more complicated formalisms, such as replicas, need to be
906: adopted~\cite{MPV87}.
907:
908:
909: The qualitative aspect involves the possible physical, and biological,
910: interpretation of $\overline{[\mathcal{N}]^2}$. This is an indicator of the
911: width of the probability distribution in the number of compatible states, in
912: presence of random functions and network structure. Therefore, it can be seen
913: as the freedom the system has of varying the number of states that comply to
914: the signal integration functions by acting on its constraints. Biologically,
915: having in mind Darwinian evolution one may interpret it as a kind of
916: ``adaptability''. This is done as follows. If $\mathcal{N}$ is interpreted as
917: the number of possible responses of gene patterns to external or internal
918: conditions, it is reasonable to assume that a given system with a fixed number
919: of genes will be fitter by maximizing $\mathcal{N}$. Now, if the distribution
920: is wide, the system can vary greatly $\mathcal{N}$ by acting on the signal
921: integration functions. If the distribution is peaked, the changes in the
922: functions will be less effective in increasing the number of states.
923:
924: If the self-averaging property holds, all the width, this adaptability, comes
925: only as a finite-size effect. This is not irrelevant, considering that the
926: value of $N$ is in the range $10^{3}-10^{5}$, quite far from the usual
927: Avogadro's number!
928: %
929: On the other hand, if the self-averaging property does not hold, it means that
930: a residual width exists even in the thermodynamic limit. In this case, an
931: evolved system can be ultra-specific, finding the very exceptional situations
932: in which a high number of solutions exists against the typical odds in which
933: the system cannot express compatible gene patterns, the biblical needle in a
934: haystack. Such putative highly specialized organisms would be particularly
935: sensitive to changes in the environment. Given these considerations, it seems
936: that the lack of self-averaging for a model like GR1 would make it slightly
937: less appealing.
938:
939: The final result for GR1 is that $\mathcal{N}$ is indeed self-averaging. The
940: details of the calculation are reported in Appendix~\ref{sec:self-aver-prop}.
941: It relies on two basic assumptions. The first relates to the choice of a
942: family of random functions such that:
943:
944: (a)
945: \begin{math}
946: \frac{1}{\Lambda} \sum_{f \in \mathcal{F}} p(\vec{f})
947: f(\vec{\s}) = 0 \, ,
948: \end{math}
949: as above, and
950:
951: (b)
952: \begin{math}
953: \frac{1}{\Lambda} \sum_{f \in \mathcal{F} } p(\vec{f})
954: f(\vec{\s})f(\vec{\t}) = \delta(\vec{\s}; \vec{\t}) \,
955: \end{math}.
956:
957: \noindent
958: Here, $\Lambda$ indicates the size of the family of functions, and we used the
959: obvious notation that makes the functions assume values $\pm 1$ if expressed
960: in terms of spins.
961:
962: The second assumption in the computation can be seen as a mean-field-like
963: hypothesis of independence of spins belonging to different clauses. It can be
964: argued as follows. Differently from the evaluation of
965: $\overline{\mathcal{N}}$, the computation of $\overline{[\mathcal{N}]^2}$
966: depends both on the class of functions and on the underlying network. The
967: essential problem is that while the function nodes are independent random
968: variables, the variable nodes clustered by functions can be repeated. Thus,
969: one has to answer to the question: how many distinct variables $n_{v}$ are
970: connected to $r$ constraints? For small r, we can estimate that $n_{v} \simeq
971: r \cdot k$, while for $ r \sim M, \ n_{v} \simeq \frac{M}{\gamma}$. These
972: extremes set the ``fork'' of values for which our estimate is consistent and
973: robust.
974: %
975: In these conditions, we obtain the scaling as $4^{\Delta_{C}}$, in the
976: relevant regime of the phase diagram $ 1/K < \gamma < \gamma^{*}$, with
977: $\gamma^{*} > \gamma_{c}$. This enforces the self-averaging property.
978: %Dire di piu' / discutere l'energia libera. - ha trans di fase con gamma
979: The same procedure can be carried with the k-Sat model yielding no
980: self-averaging for the number of satisfying states.
981:
982: %il modello da solo le possibilita' - poi c'e' la selezione naturale
983:
984: %More formally, given the set of functions that can be explored and a
985: %probability measure on it,
986: % posso fare prediction con width?
987: % posso misurarla fissato il grafo - per es dell'ecoli?
988:
989: % calcolo
990:
991: %etc la stori di k gamma / e forse dell'energia libera
992: %Discuss $k > \gamma^{-1} $ etc.....
993:
994: \section{Different Connectivities}
995: \label{sec:multipoiss}
996:
997:
998: So far we have discussed the idealized case where the in-ward connectivity,
999: the number of transcription factors controlling one gene, is fixed. In that
1000: case, only the out-ward connectivity, or the number of genes controlled by a
1001: transcription factor, can fluctuate.
1002: %
1003: A biologically more realistic case is when both the inward connectivity $K$
1004: and the outward connectivity $c$ vary along the network, and the decay of the
1005: latter is slower (see Fig.~\ref{fig:colidati}a).
1006:
1007: Considering $p(k|c) = \frac{(k\gamma)^c}{c!} e^{-(k\gamma) }$, the conditioned
1008: probability that a variable is in $c$ clauses of the $k$ kind, we have $\pi(c)
1009: = \sum_k\frac{(k\gamma)^c}{c!} e^{-(k\gamma) }\cdot p(k)$. The leaf removal
1010: algorithm can be applied separately to sets of clauses with a given
1011: connectivity, defining $ N_C \equiv <N_C>_{k}$ and $M_C \equiv <M_C>_{k}$,
1012: where $<X>_{k}= \sum_p p(k) \cdot X(k)$.
1013: %
1014: Choosing $p(k) = Z^{-1}(\nu) e^{-\nu k}$ does not affect the exponential
1015: asymptotic decay of $\pi(c)$ for large $c$.
1016:
1017: \begin{figure}[htbp]
1018: \centering
1019: \includegraphics[width=.75\textwidth]{Fig3}
1020: \caption{$\Delta_C$ as a function of $\gamma$ for different values of $\nu$
1021: in the multi-Poisson case. The discrete jumps are due to the onset of HARD
1022: phases for the different values of $k$. $\Delta_C$ can become negative
1023: many times, giving rise to reentrant UNSAT phases. The figure refers to a
1024: connectivity distribution with a cutoff at $k = 12$}
1025: \label{fig:mp-delta}
1026: \end{figure}
1027: %
1028: \begin{figure}[htbp]
1029: \centering
1030: \includegraphics[width=.75\textwidth]{Fig4}
1031: \caption{Phase diagram $\gamma - \nu$ for the multi-Poisson case. The dashed
1032: line represents the mean value of the numerically evaluated critical
1033: parameter $\gamma_d (\nu)$ for the SAT-HARD transition of network
1034: realizations with $N=3 \times 10^4$.}
1035: \label{fig:mp-pd}
1036: \end{figure}
1037:
1038: To show this, let us construct the graph with the mentioned properties
1039: \begin{enumerate}
1040: \item The probability to have a clause with $k$ elements is $p(k) =
1041: Z^{-1}(\nu) e^{-\nu k}\quad (k>1) $.
1042: \item $p(k|c) = \frac{(k\gamma)^c}{c!} e^{-(k\gamma) }$ is the conditioned
1043: probability that a variable is in $c$ clauses of the $k$ kind.
1044: \item $\pi(c) = \sum_k\frac{(k\gamma)^c}{c!} e^{-(k\gamma) }\cdot p(k)$.
1045: \end{enumerate}
1046: and compute $\pi(c)$.
1047: Setting $\xi = \gamma + \nu$,
1048: \begin{displaymath}
1049: \pi(c) = \frac{\gamma^{c}}{c!} \left( -e^{-\xi} + \mathcal{Z}[k^{c}]
1050: \right) \ \ ,
1051: \end{displaymath}
1052: where $\mathcal{Z}[f(c)] = \sum \frac{f(c)}{z^{c}}$ is the Z-transform, and
1053: for us $z = e^{\xi}$. Now, $ \mathcal{Z}[p^{c}] = \textrm{Li}_{-c}(z^{-1})$,
1054: where Li is the polylogarithm, which can be defined for negative integers as
1055: \begin{displaymath}
1056: \textrm{Li}_{-c}(z) := \frac{1}{(1-z)^{c+1}} \sum \eulang{c}{i} r^{c-i} \ \ ,
1057: \end{displaymath}
1058: %
1059: where $ \eulang{c}{i} $ are Euler's numbers.
1060: This gives a condensed expression for $\pi(c)$:
1061: \begin{displaymath}
1062: \pi(c) = \frac{\gamma^{c}}{Z c!} \left\{ \textrm{Li}_{-c}(e^{-\xi}) -
1063: e^{-\xi} \right\} \ \ .
1064: \end{displaymath}
1065: It can be easily checked that, for large $c$, this function decays
1066: exponentially, after having reached a maximum.
1067:
1068: We can call this case multi-Poisson, as the graph is a superimposition of
1069: graphs that follows a Poisson distribution, each graph having in turn a fixed
1070: clause-connectivity and a Poisson variable-connectivity.
1071: %
1072: % cambiare e espandere un po'
1073: %
1074: The behavior of GR1 on such a topology is different from the
1075: fixed connectivity case. The main reason for this is that, while
1076: $\Delta_{C}(\gamma)$ is still locally decreasing, many new discontinuities
1077: emerge, due to the influence of clauses with different connectivities. This
1078: gives rise to two phenomena. Firstly, $\Delta_{C}$ can increase globally with
1079: increasing $\gamma$. Indeed, it does increase after $\gamma_{d}$, to decrease
1080: again before $\gamma_{c}$ (Fig.~\ref{fig:mp-delta}). After the onset of the
1081: complex control phase, the complexity initially increases (step-wise), reaches
1082: a maximum, and then decreases monotonically. This has an influence on the
1083: number of observed states as a function of $\gamma$. Secondly, $\Delta_{C}$
1084: can become negative and then jump back to a positive state, creating a
1085: reentrant HARD-SAT phase (Fig.~\ref{fig:mp-pd}).
1086: We are currently studying ways to extend our calculation of the mean number of
1087: compatible states and the width of its distribution on graphs with more
1088: general connectivities.
1089:
1090:
1091:
1092: \section{An Example from an Experimental Setting}
1093: \label{sec:concr}
1094:
1095: The results described so far focused on the typical behavior of GR1 as a
1096: formal model for a genetic network. To resume them, we can predict the
1097: existence of a core of variables, the CCC, which determines the behavior of
1098: the system. The phase diagram of the system contains two regimes of gene
1099: control, simple and complex. In the complex control phase, the free genes of
1100: the core control $O(N)$ other genes. These phases also depend on connectivity.
1101: On the other hand, a very important question is how to relate them to concrete
1102: systems. There are many possibilities in this direction that we are currently
1103: exploring. In this section, we will discuss a first attempt. Specifically, we
1104: will make use of the data set for the structure of the E.~coli
1105: %and S.~Cerevisiae
1106: %cita alon
1107: transcription network from the RegulonDB database~\cite{SSG+01}, with the
1108: modifications of~\cite{SMM+02}. The goal is to apply the leaf removal
1109: algorithm using the information contained in the data set.
1110:
1111: % descrivi datasets
1112: The data set consists of an annotated graph, where the signal integration
1113: functions are described as sets of annotated links. The annotations consist in
1114: three modes of activity: activation, repression, and ``dual'' activity
1115: (meaning that the activity depends on the context). The data on the
1116: combinatorial activity of transcription factors are not part of the set. For
1117: this reason, in what follows we will ignore the annotations, concentrating on
1118: the study of random GR1 realizations on the given experimental network
1119: structures.
1120: %
1121: Considering the connectivity matrix $C_{ij}$ defined as
1122: \begin{displaymath}
1123: C_{ij} = \left\{
1124: \begin{array}{lll}
1125: 1 ; \ \ \textrm{Gene j regulates gene i} \\
1126: 0 ; \ \ \textrm{Otherwise}
1127: \end{array}
1128: \right. \ ,
1129: \end{displaymath}
1130: a ``leaf'' corresponds to a column containing only zeros. An iteration of the
1131: algorithm removes these columns, together with the corresponding lines. Note
1132: that the leaf removal algorithm is not guaranteed to preserve the network
1133: structure as in the abstract cases discussed above.
1134: % figura?
1135: % and it makes sense because it's unicellular? oppurepunto su # ecoli con N
1136: % geni?
1137: \begin{figure}[htbp]
1138: \centering
1139: \subfigure[]{\includegraphics[width=.5\textwidth]{INOUTedges}}
1140: \subfigure[]{\includegraphics[width=.5\textwidth]{ar}}
1141: \caption{Data inferred from the E.~coli transcription network. (a) Degree
1142: distribution. (b) Activity of autoregulators. See also~\cite{MC03}}.
1143: \label{fig:colidati}
1144: \end{figure}
1145:
1146:
1147: %
1148: During the iterative leaf removal procedure, one is confronted with an
1149: important choice, concerning how to deal with autoregulators. These create a
1150: problem, as, for particular assignments of the functions they create trivial
1151: contradictions.
1152: %
1153: In reality, this self-contradiction is inexistent, as negative
1154: autoregulations are known to play the role of controlling the overexpression
1155: of a particular gene. A standard, and the simplest, way to avoid the problem
1156: is simply to eliminate the autoregulations, and impose that the diagonal of
1157: $C_{ij}$ is zero.
1158: %
1159: However, this total cancellation is not biologically motivated - as
1160: autoregulations might reflect some global properties of the system, other than
1161: control of overexpression. To clarify, let us consider a gene that regulates
1162: itself and is regulated by some others (``rest''), that is a ``regulated
1163: autoregulator'' (RAR). It is then subject to the constraint $\s_0 = f ( \s_0 |
1164: \textrm{rest} ) = A (\textrm{rest})\s_{0} + B (\textrm{rest})$. If
1165: $A(\textrm{rest}) = 0$ the gene is regulated simply, for $B\in \big\{-1,1
1166: \big\}$ and the autoregulation is irrelevant. Conversely, when $B
1167: (\textrm{rest}) = 0$, the autoregulation plays a role, but if $A
1168: (\textrm{rest} ) = -1$ the system is UNSAT.
1169:
1170:
1171: To solve this problem, we propose a way to keep the role of autoregulators
1172: into account, while at the same time avoiding the trivial self-contradiction.
1173: In order to do this, we introduce the constraint $A (\textrm{rest} ) = 1$ that
1174: codes for the avoidance of trivial contradictions. With this technique, we aim
1175: to save the autoregulation role, while taking into account the notorious fact
1176: that auto-inhibitions cannot be represented with Boolean variables. We can
1177: call this the ``RAR hypothesis''. We will see that this hypothesis brings to
1178: a different final result. The same reasoning can be carried for GRq-type
1179: variables.
1180: %
1181: Assuming the RAR hypothesis, the problem becomes a mixed optimization
1182: problem, that includes the usual GR1 constraints, plus a set of Sat-like
1183: constraints that come from the $A_{n} (\textrm{rest} ) = 1$ conditions on the
1184: RARs.
1185:
1186: \subsection{The E.~coli Transcription Network}
1187:
1188: In the E~.coli data set there are 423 genes, and 59 autoregulations. Among
1189: these, 24 are RARs (Fig.~\ref{fig:colidati}). Applying the leaf-removal
1190: algorithm with cancellation of autoregulations leads to an empty core. This
1191: means that the system finds itself in the simple control, SAT phase. However,
1192: the application of the RAR hypothesis leads to a non-empty core
1193: (Fig.~\ref{fig:colirar0}).
1194: % figura
1195: \begin{figure}[htbp]
1196: \centering
1197: \includegraphics[width=.8\textwidth]{leaf0}
1198: \caption{The CCC of E.~coli with the RAR hypothesis. It contains 22
1199: variables (of which 14 are free or regulated only by themselves, 4 are
1200: non-free, and 4 are RARs) and 22 constraints (of which 18 are RAR
1201: constraints). }
1202: \label{fig:colirar0}
1203: \end{figure}
1204:
1205:
1206: The genes in the core can be divided in three different classes, free, which
1207: we will denote by $\tau$, non-free ($\sigma$), and RARs, or ($\alpha$). The
1208: core contains a total of 22 variables (of which 14 are free or regulated only
1209: by themselves, 4 are non-free, and 4 are RARs) and 22 constraints (of which 18
1210: are RAR constraints).
1211: %
1212: Biologically speaking, these core gene include some ``global regulators'', or
1213: transcription factors with a high out-ward connectivity~\cite{MC03}, including
1214: (a) the sigma factors rpoS and rpoN, (b) proteins belonging to the family of
1215: the DNA bending global regulator crp (c) himA, or IHF, another DNA bending
1216: factor.
1217: %
1218: More interestingly, also lower connectivity proteins, connected to metabolism
1219: (e.g. respiratory control and iron transport), and to structural tasks (e.g
1220: synthesis of the flagellum) are found in the core.
1221:
1222: The residual optimization problem on the core variables is small and simple
1223: enough to be solved for general functions, as exemplified in
1224: Fig.~\ref{fig:colirar12}. The final solution gives only two states, after
1225: having fixed the free genes.
1226: %
1227: \begin{figure}[htbp]
1228: \centering
1229: \subfigure[]{\includegraphics[width=.75\textwidth]{leaf1}}
1230: \subfigure[]{\includegraphics[width=.75\textwidth]{leaf2}}
1231: \caption{Solution of the general optimization problem on the core variables
1232: of E.~coli in the RAR hypothesis. In this procedure, variables are fixed
1233: with respect to each other according to the constraints that connect them.
1234: (a) Second step of the computation (b) Last step of the computation.}
1235: \label{fig:colirar12}
1236: \end{figure}
1237:
1238:
1239: What is the meaning of the CCC in the RAR hypothesis, if any? The answer can
1240: come from two directions: simulations and experiments. For the numerical
1241: case, one must study how fixing the core variables affects the reach of a
1242: fixed point or a steady state in a Boolean network. Our simulations on both
1243: asynchronous spin-flip and synchronous update dynamics show that, fixed some
1244: random functions on the whole network, the core free genes control a larger
1245: set of genes than the non-core ones (these results will be published
1246: elsewhere~\cite{CLB04}). This is an indication that the CCC found with the RAR
1247: hypothesis might have some significance. The same feature can be tested with
1248: microarray expression experiments.
1249:
1250: %e LIEVITO
1251:
1252: %RAR
1253:
1254: %N FINITO ????
1255:
1256: \section{Conclusions}
1257:
1258:
1259: In conclusion, we have presented and discussed a novel conceptual framework
1260: for the equilibrium modeling of large scale transcription networks. In its
1261: most general formulation, our approach is directly connected to the
1262: Shea-Ackers model for the \emph{cis-} regulatory region of a gene, and
1263: consists of a compatibility analysis of the constraints established by the
1264: signal integration functions.
1265: % per questa ragione preferiamo pensare a GR come un modello di trascrizione
1266: % prima di tutto, anche - al livello coarse grained di gr1 si puo'
1267: % generalizzare a reti di regolaz generiche
1268:
1269: The advantage of this approach is that it allows to separate issues related to
1270: the dynamics of the network from the basic logic structure that underlies it.
1271: Obviously, dynamics is a very important factor of a real biochemical network,
1272: possibly the most important. On the other hand, we feel that the
1273: disentanglement the two aspects might lead to further insight.
1274: %
1275: In the spirit of theoretical computer science, any dynamics superimposed on
1276: GR1 can be seen as an algorithm. The problem becomes then the following. How
1277: effectively does a given algorithm, modelling chemical kinetics explore
1278: configuration space? Naturally, this addition may carry intricate issues,
1279: connected to the nonequilibrium nature and the asymmetry of the interactions.
1280: These issues are particularly complex if one wants to add a coarse-graining
1281: of time, as is commonly done in Kauffman networks.
1282: %
1283: In absence of an explicit knowledge of the emergent time scales involved in
1284: the dynamics, we feel ours is an appropriate approach. Particularly in the
1285: Boolean approximation, GR1, which we treat here.
1286: %
1287:
1288: %tenere?
1289: %A dynamics in the space of constraints, or signal integration
1290: %functions, corresponds to Darwinian evolution of a particular genetic network.
1291: %
1292: %Thus, GR is an ideal substrate to construct evolutionary models, with a
1293: %fitness landscape on the space of constraints. Models of this class can be
1294: %seen as relatives of Kauffman's NK model.
1295:
1296: From a general, speculative standpoint, our model shows that the ``biological
1297: complexity'' is not simply measured by the number of genes. For a
1298: transcription network, a more proper indicator is $\Delta_{C}$. Interestingly,
1299: for GR1 this coincides exactly with what is called the ``computational''
1300: complexity, $\Sigma$.
1301: %
1302: Looking at the phase digram, $\Sigma$ depends on the order parameter
1303: $\gamma$, or - loosely - on the number of transcription factors per gene. At
1304: fixed number of genes, it is known that this quantity increases in bacteria
1305: that need to react to more environments~\cite{CdL03}.
1306: %
1307: Imagining that prokaryotes, being unicellular, naturally find themselves in a
1308: simple control phase, our phase diagram predicts an intrinsic limit to this
1309: adaptation process, represented by the phase boundary with the HARD-SAT.
1310: %
1311: Considering varying $N$, one may wonder why, in real organisms, a small
1312: $\gamma$ is correlated with a small $N$~\cite{vN03}. A possible answer to this
1313: question is the following. With large $N$ and small $\gamma$, the system is
1314: shifted to the SAT phase, and therefore needs to explore a very big
1315: configuration space without sufficient ``guidelines''. In other words, the
1316: available configurations are too many to be reached in reasonable time by the
1317: dynamics.
1318:
1319: Similar considerations can be carried for the \emph{width} of the distribution
1320: of satisfying solutions. The fact that the self-averaging property holds
1321: indicate that this is negligible in the thermodynamic limit. On the other
1322: hand, the typical value of $N$ for a living system is in the range
1323: $10^{3}-10^{5}$. While being large for detailed modeling, this is a
1324: smaller number than the size of the typical system treated with statistical
1325: mechanics. Thus, the effects of the system size are expected to be important.
1326:
1327:
1328: Considering the phase diagram in Fig.~\ref{fig:pd}, the complex control phase,
1329: having general residual dynamics, matches a qualitative feature of many cells,
1330: where some genes are constantly expressed, and the rest vary.
1331: %da sostanziare! o cita textbook art expr ecoli drosoph cluster?
1332: On the other hand, the dynamical slowing down characteristic of any glassy
1333: phase raises an issue that must be solved by the chemical dynamics of the
1334: cell. In analogy with Kauffman's ideas, the breakdown in many different
1335: attraction basins might be interpreted as epigenesis. That is, in the
1336: HARD-SAT phase there will be typically many cell types. How many, is
1337: determined by the complexity $\Sigma$, which is directly measured by our
1338: $\Delta_{C}$. While for fixed $K$ this quantity simply decreases with
1339: increasing $\gamma$, its behavior is more interesting in the multi-Poisson
1340: case.
1341: %% Thus, interpreting the number of attraction basins as the number of cell
1342: %% types
1343: %% is tempting.
1344: %DA RIVEDERE.
1345: However, this remains an open issue which has to be regarded with more detail.
1346: The experimental scaling of the number of cell types is sub-linear in the
1347: number of genes $N$~\cite{Kau93,Kau04}. In the fixed $K$ case, GR1 gives
1348: exponential scaling with $N$ at fixed $\gamma$ in the complex control phase.
1349: On the other hand, the results of random-GR1 are perhaps more easily related
1350: to the number of species times the number of cell types at equal number of
1351: genes. In the same way, the behavior of $\overline{[\mathcal{N}]^2}$ with $N$
1352: should roughly predict the scaling of the \emph{variability} in the total
1353: number of cell types for all the species with equal number of genes $N$.
1354: According to GR1, this quantity should vanish in the thermodynamic limit.
1355:
1356:
1357: To be biologically useful, the model has to deal with the details of an
1358: individual realization the system. In this respect, an advantage of the leaf
1359: removal algorithm is that it transforms a problem related the states of
1360: variables on a graph, the gene expression patterns, into a problem regarding
1361: the \emph{structure} of the graph. This is particularly of interest as long
1362: as the data regarding the activity of function nodes are only partially known.
1363: For example, the first application to the E.~coli core, in the RAR hypothesis
1364: leads to interesting results, that have a numerical counterpart and might be
1365: tested with expression correlation data. The application to more, larger, data
1366: sets and to other forms of regulation might lead to further insight.
1367: %
1368: Notably, some of the core variables do not have a high connectivity. This is
1369: an indication that additional, global, properties of the network structure
1370: other than local order parameters must contribute to establish the hierarchy
1371: of states in configuration space.
1372:
1373: Finally, besides the extensions to the work presented here, we believe the
1374: framework of GR might be used as a setting for many different problems
1375: involving fairly large networks, from evolutionary models to regulation
1376: network optimization, from network inference to design. It will be potentially
1377: useful in the years to come, as more and more data will be available from
1378: high-throughput experiments.
1379:
1380:
1381: %The RAR hypothesis should be seen as an attempt to overcome a limitation of
1382: % the Boolean formalism, it
1383: % we simpy note that
1384: %disregarding self-regulations might an equally radical choice.
1385:
1386:
1387: %estensioni
1388: % evolution is a dynamics on the function nodes structure and coordination
1389: %P(k)
1390: %GR is naturally fit to study networks with non-Boolean variables and non-fixed
1391: %in-ward connectivities. It can be easily extended to nonzero temperature
1392: % (probabilistic constraints).
1393: %
1394:
1395: %
1396:
1397: %
1398: %fine A biologically significant model has to be able
1399: %
1400: % noi abbiamo provato
1401:
1402: % CAMBIARE menala sul leaf removal e altri possibili algor. come per SAT
1403:
1404: % Furthermore, there exist techniques able to analyze single realizations of
1405: % optimization problems which might be applicable to sufficiently characterized
1406: % genetic networks. For example, a typical experiment in molecular biology
1407: % involves a local interaction with a regulation network with overexpression or
1408: % deletion of a gene. Such methods of analysis could be used to generate
1409: % predictions of the effects that keep into account the network connections.
1410:
1411: % c'e' molto da fare. COsa? 1) 2) 3) etc
1412:
1413:
1414: %
1415: %==================================
1416:
1417: \begin{acknowledgments}
1418: We would like to acknowledge interesting discussions with L.~Finzi,
1419: A.~Sportiello, J.~Berg, M.~Leone, M.~Caselle, P.R.~ten~Wolde, R.~Zecchina.
1420: \end{acknowledgments}
1421:
1422:
1423: \begin{appendix}
1424:
1425: \section{Self-averaging property of GR1.}
1426: \label{sec:self-aver-prop}
1427:
1428:
1429: The following paragraphs describe the calculation of the width of the
1430: distribution of $\mathcal{N}$. By definition,
1431: $$\overline{[\mathcal{N}]^2}=\sum_{\vec{\s},\vec{\t}}\sum_{C}p(C)\sum_{f^1 \in
1432: \mathcal{F} }p(f^1)\sum_{f^1 \in \mathcal{F}}p(f^2)...\sum_{f^M\in
1433: \mathcal{F}}p(f^M) \cdot$$
1434: %
1435: %
1436: $$\cdot \prod_{m=1}^M \d
1437: \big(1-f^m(\s_{n(1,m)},\cdots,\s_{n(k,m)})\cdot\s_{n(0,m)}\big) \d
1438: \big(1-f^m(\t_{n(1,m)},\cdots,\t_{n(k,m)})\cdot\t_{n(0,m)}\big) \ \ ;$$
1439:
1440: thus,
1441: $$\overline{[\mathcal{N}]^2}= \sum_{\vec{\s},\vec{\t}}\sum_{C}p(C)
1442: \prod_{m=1}^M \Big(\sum_{f^m \in \mathcal{F}}p(f^m)
1443: \d \big(1-f^m(\s)\cdot\s_{n(0,m)}\big)
1444: \d \big(1-f^m(\t)\cdot\t_{n(0,m)}\big)
1445: \Big) \ \ .$$
1446:
1447: % su sigma e tau ci va la freccina!
1448: For fixed states $\vec{\s}$ and $\vec{\tau}$, we can write the factors of the
1449: product above as
1450: $$\Big(
1451: \d \big(1-\s_{n(0,m)}\big)
1452: \d \big(1-\t_{n(0,m)}\big)\cdot A(\s_m ;\tau_m)+
1453: \d \big(1-\s_{n(0,m)}\big)
1454: \d \big(1+\t_{n(0,m)}\big)\cdot B(\s_m ;\tau_m) + $$
1455: $$\d \big(1+\s_{n(0,m)}\big)
1456: \d \big(1-\t_{n(0,m)}\big)\cdot C(\s_m ;\tau_m) +
1457: \d \big(1+\s_{n(0,m)}\big)
1458: \d \big(1+\t_{n(0,m)}\big)\cdot D(\s_m ;\tau_m)
1459: \Big) \ \ ,$$
1460:
1461: Where $A$, $B$, $C$, $D$, are the weights of the functions $ f \in
1462: \mathcal{F}$ such that, respectively
1463: \begin{displaymath}
1464: \left \{ \begin{array}{lllll}
1465: \textrm{$ A(\s_m ;\tau_m) $}&\textrm{$\leftarrow$} & \textrm{$f(\s \in m)=1$}
1466: &\& & \textrm{$f(\t \in m)=1$} \\
1467: \textrm{$ B(\s_m ;\tau_m)$}&\textrm{$\leftarrow$} & \textrm{$f(\s \in m)=1$}
1468: &\& & \textrm{$f(\t \in m)=-1$} \\
1469: \textrm{$ C(\s_m ;\tau_m)$}&\textrm{$\leftarrow$} & \textrm{$f(\s \in m)=-1$}
1470: &\& & \textrm{$f(\t \in m)=1$} \\
1471: \textrm{$ D(\s_m ;\tau_m)$}&\textrm{$\leftarrow$} & \textrm{$f(\s \in m)=-1$}
1472: &\& & \textrm{$f(\t \in m)=-1$} \\
1473: \end{array}\right.
1474: \end{displaymath}
1475:
1476: Now, as $A+B+C+D =1$, and $A+B = A+C = \rho$, we can write,
1477: choosing $\rho=1/2$
1478: %verif se prima a era rho!!
1479:
1480: $$\sum_{C } \sum_{\vec{\sigma} \vec{\tau}} \prod_{m} $$
1481: %
1482: $$\left\{\right. A(\s_{m},\t_{m}) \cdot \big[\d \big(1-\s_{n(0,m)}\big)\d
1483: \big(1-\t_{n(0,m)}\big) - \d \big(1+\s_{n(0,m)}\big)\d
1484: \big(1+\t_{n(0,m)}\big)] \ \ + \ \ \ \ (\mathcal{A})$$
1485: %
1486: $$ + \frac{1}{2} \big[ \d \big(1-\s_{n(0,m)}\big)\big(1+\t_{n(0,m)}\big) + \d
1487: \big(1+\s_{n(0,m)}\big)\big(1-\t_{n(0,m)}\big) \big] \left. \right\} \ \ \ \
1488: (\mathcal{R})$$
1489: %
1490:
1491: The product on $m$ gives rise to $2^{M}$ terms of the kind
1492: $$
1493: \prod_{k=1}^{r} \mathcal{A}^{(k)} \prod_{k'=r+1}^{M} \mathcal{R}^{(k')} \ \
1494: ,$$
1495: where $\mathcal{A}$ and $\mathcal{R}$ indicate factors of the two types in
1496: the previous expression. For every $r$ there are $\su{M}{r}$ terms of this
1497: kind in the sum. Applying the properties that characterize our family of
1498: functions, we find $A= \sum_{f \in \mathcal{F}}\big[
1499: \big(\frac{1+f(\s)}{2}\big) \big(\frac{1+f(\t)}{2}\big)\big]=
1500: \frac{1}{4}\big[1 + \d(\s,\t)\big] \ \, $
1501:
1502: With this consderation, the sum over the configurations $\vec{\sigma}
1503: \vec{\tau}$ can be simplified. It involves the product
1504: $$ \prod_{m} $$
1505: %
1506: $$\left\{ \right. A(\s,\t) \cdot \big[\s_{n(0,m)} \t_{n(0,m)}\big] \ \ + \ \ \
1507: \ \ \ \ \ (\mathcal{A})$$
1508: %
1509: $$ + \frac{1}{4} \big[1 - \s_{n(0,m)} \t_{n(0,m)} \big] \left.
1510: \right\} \ \ \ \ \ \ \ \ \ \ \ \ (\mathcal{R})$$
1511: %
1512:
1513: Let us now distinguish again between free genes and non-free ones, which are
1514: outputs of some signal integration function. The sum over non-free genes is
1515: such that (i) there is a contribution $ \mezzo^{r}$ due to the Kronecker
1516: deltas in the $\mathcal{A}$ part and (ii) if a type $\mathcal{R}$ non-free
1517: $\s_{n(0,k')}$ or $\tau_{n(0,k')}$ variable appears in the input of a type
1518: $\mathcal{A}$ clause the contribution is zero. A little thought leads to the
1519: conclusion that the non-free genes sum up to the term
1520: \begin{displaymath}
1521: \left(\mezzo\right)^{r} \left(1-\gamma + \frac{r}{N}\right)^{kr}
1522: \end{displaymath}
1523:
1524: The sum over the $2(N-M)$ free genes would give a $4^{\Delta}$ contribution in
1525: case of complete independence. However, a delta function on the input genes
1526: reduces the double sum to a single one. To estimate this contribution one has
1527: to evaluate the probability that two free genes appear as input of a type
1528: $\mathcal{A}$ clause. In a mean-field like estimate, this is $ kr
1529: \frac{N-M}{N-M +r}$, leading to the contribution
1530: \begin{displaymath}
1531: 4^{\Delta} \cdot 2^{-\frac{kr}{1+\frac{r}{\Delta}}}
1532: \end{displaymath}
1533:
1534: The $4^{\Delta}$ term factors out of everything, and alone would give the
1535: desired self-averaging property. It remains to evaluate the sum over $r$.
1536: Restricting the sum over the core genes, and evaluating it with a saddle point
1537: method leads to the minimization of the free-energy-like functional
1538: \begin{displaymath}
1539: G(x) = x \log x + (1-x) \log (1-x) + \log(2) x\left( 1 +
1540: \frac{k(1-\gamma)}{1-\gamma(x-1)}\right) - k x
1541: \log\left(1-\gamma(x-1)\right) \ ,
1542: \end{displaymath}
1543: where $x \in [0,1]$, and $\gamma := \frac{M_{C}}{N_{C}}$.
1544: %....
1545: Minimization of this functional always leads to the solution $x=0$, with the
1546: exception of the regions: $\gamma < 1/K$ and $\gamma > \gamma^{*}$, where
1547: $\gamma^{*}$ is a threshold that always lies in the UNSAT region. For $k=3$,
1548: $\gamma^{*} \simeq 0.9722$.
1549:
1550: \end{appendix}
1551:
1552: \begin{thebibliography}{}
1553:
1554: \bibitem[Alberts et~al., 2003]{ABL+03}
1555: Alberts, B., Bray, D., Lewis, J., Raff, M., Roberts, K., and Watson, J. (2003).
1556: \newblock {\em Molecular {B}iology of the {C}ell}.
1557: \newblock Garland, London, New York, 4th edition.
1558:
1559: \bibitem[Arkin et~al., 1998]{ARM98}
1560: Arkin, A., Ross, J., and McAdams, H. (1998).
1561: \newblock Stochastic kinetic analysis of developmental pathway bifurcation in
1562: phage lambda-infected {E}scherichia coli cells.
1563: \newblock {\em Genetics}, 149(4):1633--48.
1564:
1565: \bibitem[Babu et~al., 2004]{BLA+04}
1566: Babu, M., Luscombe, N., Aravind, L., Gerstein, M., and Teichmann, S. (2004).
1567: \newblock Structure and evolution of transcriptional regulatory networks.
1568: \newblock {\em Curr Opin Struct Biol}, 14(3):283--91.
1569:
1570: \bibitem[Bastolla and Parisi, 1998]{BP98}
1571: Bastolla, U. and Parisi, G. (1998).
1572: \newblock Relevant elements, magnetization and dynamical properties in
1573: {K}auffman networks. {A} numerical study.
1574: \newblock {\em Physica D}, 115(3-4):203--218.
1575:
1576: \bibitem[Buchler et~al., 2003]{BGH03}
1577: Buchler, N., Gerland, U., and Hwa, T. (2003).
1578: \newblock On schemes of combinatorial transcription logic.
1579: \newblock {\em Proc Natl Acad Sci U S A}, 100(9):5136--41.
1580:
1581: \bibitem[Caracciolo and Sportiello, 2002]{CS02}
1582: Caracciolo, S. and Sportiello, A. (2002).
1583: \newblock An exactly solvable random satisfiability problem.
1584: \newblock {\em J. Phys. A}, 35:7661--7688.
1585:
1586: \bibitem[Cases et~al., 2003]{CdL03}
1587: Cases, I., de~Lorenzo, V., and Ouzounis, C. (2003).
1588: \newblock Transcription regulation and environmental adaptation in bacteria.
1589: \newblock {\em Trends Microbiol}, 11(6):248--53.
1590:
1591: \bibitem[Correale et~al., 2004]{CLP+}
1592: Correale, L., Leone, M., Pagnani, A., Weigt, M., and Zecchina, R. (2004).
1593: \newblock Private Communication.
1594:
1595: \bibitem[Cosentino~Lagomarsino et~al., 2005]{CLB04}
1596: Cosentino~Lagomarsino, M., Bassetti, B., and Jona, P. (2005).
1597: \newblock In Preparation.
1598:
1599: \bibitem[Davidson et~al., 2002]{DRO+02}
1600: Davidson, E., Rast, J., Oliveri, P., Ransick, A., Calestani, C., Yuh, C.,
1601: Minokawa, T., Amore, G., Hinman, V., Arenas-Mena, C., Otim, O., Brown, C.,
1602: Livi, C., Lee, P., Revilla, R., Rust, A., Pan, Z., Schilstra, M., Clarke, P.,
1603: Arnone, M., Rowen, L., Cameron, R., McClay, D., Hood, L., and Bolouri, H.
1604: (2002).
1605: \newblock A genomic regulatory network for development.
1606: \newblock {\em Science}, 295(5560):1669--78.
1607:
1608: \bibitem[Gershenson, 2004]{Ger04}
1609: Gershenson, C. (2004).
1610: \newblock Introduction to {R}andom {B}oolean {N}etworks.
1611: \newblock In {\em Artificial {L}ife {IX} {W}orkshops and {T}utorials}.
1612:
1613: \bibitem[Gillespie, 1977]{GD77}
1614: Gillespie, D. (1977).
1615: \newblock Exact stochastic simulation of coupled chemical reactions.
1616: \newblock {\em J. Phys. Chem.}, 81(25):2340 61.
1617:
1618: \bibitem[Herrgard et~al., 2004]{HCP04}
1619: Herrgard, M., Covert, M., and Palsson, B. (2004).
1620: \newblock Reconstruction of microbial transcriptional regulatory networks.
1621: \newblock {\em Curr Opin Biotechnol}, 15(1):70--7.
1622:
1623: \bibitem[Kauffman, 1969a]{Kau69}
1624: Kauffman, S. (1969a).
1625: \newblock Homeostasis and differentiation in random genetic control networks.
1626: \newblock {\em Nature}, 224(215):177--8.
1627:
1628: \bibitem[Kauffman, 1969b]{Kau69b}
1629: Kauffman, S. (1969b).
1630: \newblock Metabolic stability and epigenesis in randomly constructed genetic
1631: nets.
1632: \newblock {\em J Theor Biol}, 22(3):437--67.
1633:
1634: \bibitem[Kauffman, 1993]{Kau93}
1635: Kauffman, S. (1993).
1636: \newblock {\em The {O}rigins of {O}rder}.
1637: \newblock Oxford Univ. Press, New York.
1638:
1639: \bibitem[Kauffman, 2004]{Kau04}
1640: Kauffman, S. (2004).
1641: \newblock A proposal for using the ensemble approach to understand genetic
1642: regulatory networks.
1643: \newblock {\em J Theor Biol}, 230(4):581--90.
1644:
1645: \bibitem[Kaufman and Drossel, 2005]{KD05}
1646: Kaufman, V. and Drossel, B. (2005).
1647: \newblock On the properties of cycles of simple {B}oolean networks.
1648: \newblock {\em Eur Phys J B}.
1649: \newblock in publication.
1650:
1651: \bibitem[Lee et~al., 2002]{LRR+02}
1652: Lee, T., Rinaldi, N., Robert, F., Odom, D., Bar-Joseph, Z., Gerber, G.,
1653: Hannett, N., Harbison, C., Thompson, C., Simon, I., Zeitlinger, J., Jennings,
1654: E., Murray, H., Gordon, D., Ren, B., Wyrick, J., Tagne, J., Volkert, T.,
1655: Fraenkel, E., Gifford, D., and Young, R. (2002).
1656: \newblock Transcriptional regulatory networks in {S}accharomyces cerevisiae.
1657: \newblock {\em Science}, 298(5594):799--804.
1658:
1659: \bibitem[Martinez-Antonio and Collado-Vides, 2003]{MC03}
1660: Martinez-Antonio, A. and Collado-Vides, J. (2003).
1661: \newblock Identifying global regulators in transcriptional regulatory networks
1662: in bacteria.
1663: \newblock {\em Curr Opin Microbiol}, 6(5):482--9.
1664:
1665: \bibitem[McAdams and Arkin, 1997]{MA97}
1666: McAdams, H. and Arkin, A. (1997).
1667: \newblock Stochastic mechanisms in gene expression.
1668: \newblock {\em Proc Natl Acad Sci U S A}, 94(3):814--9.
1669:
1670: \bibitem[Mertens, 2002]{M02}
1671: Mertens, S. (2002).
1672: \newblock Computational complexity for physicists.
1673: \newblock {\em Computing in Science and Engineering}, 4(3):31--47.
1674:
1675: \bibitem[Mezard and Parisi, 2003]{MP03}
1676: Mezard, M. and Parisi, G. (2003).
1677: \newblock The cavity method at zero temperature.
1678: \newblock {\em J. Stat. Phys}, 111:1.
1679:
1680: \bibitem[Mezard et~al., 1987]{MPV87}
1681: Mezard, M., Parisi, G., and Virasoro, M. (1987).
1682: \newblock {\em Spin {G}lass {T}heory and {B}eyond}.
1683: \newblock World Scientific, Singapore.
1684:
1685: \bibitem[Mezard et~al., 2002]{MPZ02}
1686: Mezard, M., Parisi, G., and Zecchina, R. (2002).
1687: \newblock Analytic and algorithmic solution of random satisfiability problems.
1688: \newblock {\em Science}, 297(5582):812--5.
1689:
1690: \bibitem[Mezard et~al., 2003]{MRZ03}
1691: Mezard, M., Ricci-Tersenghi, F., and Zecchina, R. (2003).
1692: \newblock Alternative solutions to diluted p-spin models and {XORSAT} problems.
1693: \newblock {\em J. Stat. Phys}, 505.
1694:
1695: \bibitem[Mezard and Zecchina, 2002]{MZ02}
1696: Mezard, M. and Zecchina, R. (2002).
1697: \newblock Random {K}-satisfiability problem: from an analytic solution to an
1698: efficient algorithm.
1699: \newblock {\em Phys Rev E Stat Nonlin Soft Matter Phys}, 66(5 Pt 2):056126.
1700:
1701: \bibitem[Nachman et~al., 2004]{NRF04}
1702: Nachman, I., Regev, A., and Friedman, N. (2004).
1703: \newblock Inferring quantitative models of regulatory networks from expression
1704: data.
1705: \newblock {\em Bioinformatics}, 20 Suppl 1:I248--I256.
1706:
1707: \bibitem[Ptashne, 1992]{Pta92}
1708: Ptashne, M. (1992).
1709: \newblock {\em A {G}enetic {S}witch}.
1710: \newblock Cell Press, MA, second edition edition.
1711:
1712: \bibitem[Salgado et~al., 2001]{SSG+01}
1713: Salgado, H., Santos-Zavaleta, A., Gama-Castro, S., Millan-Zarate, D.,
1714: Diaz-Peredo, E., Sanchez-Solano, F., Perez-Rueda, E., Bonavides-Martinez, C.,
1715: and Collado-Vides, J. (2001).
1716: \newblock Regulon{DB} (version 3.2): transcriptional regulation and operon
1717: organization in {E}scherichia coli {K}-12.
1718: \newblock {\em Nucleic Acids Res}, 29(1):72--4.
1719:
1720: \bibitem[Shea and Ackers, 1985]{SA85}
1721: Shea, M. and Ackers, G. (1985).
1722: \newblock The {OR} control system of bacteriophage lambda. {A}
1723: physical-chemical model for gene regulation.
1724: \newblock {\em J Mol Biol}, 181(2):211--30.
1725:
1726: \bibitem[Shen-Orr et~al., 2002]{SMM+02}
1727: Shen-Orr, S., Milo, R., Mangan, S., and Alon, U. (2002).
1728: \newblock Network motifs in the transcriptional regulation network of
1729: {E}scherichia coli.
1730: \newblock {\em Nat Genet}, 31(1):64--8.
1731:
1732: \bibitem[Socolar~and and Kauffman, 2003]{SaK03}
1733: Socolar~and, J. E.~S. and Kauffman, S.~A. (2003).
1734: \newblock Scaling in {O}rdered and {C}ritical {R}andom {B}oolean {N}etworks.
1735: \newblock {\em Phys Rev Lett}, 90:068702.
1736:
1737: \bibitem[Thomas, 1973]{Tho73}
1738: Thomas, R. (1973).
1739: \newblock Boolean formalization of genetic control circuits.
1740: \newblock {\em J Theor Biol}, 42(3):563--85.
1741:
1742: \bibitem[van Nimwegen, 2003]{vN03}
1743: van Nimwegen, E. (2003).
1744: \newblock Scaling laws in the functional content of genomes.
1745: \newblock {\em Trends Genet}, 19(9):479--84.
1746:
1747: \bibitem[Warren and ten Wolde, 2004]{WtW04}
1748: Warren, P. and ten Wolde, P. (2004).
1749: \newblock Statistical analysis of the spatial distribution of operons in the
1750: transcriptional regulation network of {E}scherichia coli.
1751: \newblock {\em J Mol Biol}, 342(5):1379--90.
1752:
1753: \bibitem[Wolf and Arkin, 2003]{WA03}
1754: Wolf, D. and Arkin, A. (2003).
1755: \newblock Motifs, modules and games in bacteria.
1756: \newblock {\em Curr Opin Microbiol}, 6(2):125--34.
1757:
1758: \end{thebibliography}
1759:
1760:
1761:
1762: \end{document}
1763:
1764:
1765:
1766:
1767: