0502:q-bio0502017/rev05.tex

1: \documentclass[12pt,a4paper,aps,rmp,onecolumn]{revtex4}

2: \usepackage{latexsym}

3: \usepackage{times}

4: \usepackage{graphicx}

5: \usepackage[tight,FIGTOPCAP]{subfigure}

6: %\usepackage{overcite}

7: \usepackage{amsmath,amssymb,enumerate,mathrsfs,boxedminipage,fancybox}

8:

9:

10: \newcommand{\eulang}[2]{\genfrac{\langle}{\rangle}{0pt}{}{#1}{#2}}

11: \newcommand{\su}[2]{\genfrac{(}{)}{0pt}{}{#1}{#2}}

12:

13:

14: \def\de#1#2{\frac {d#1}{d#2}}

15: \def\dde#1#2{\frac {d^2#1}{d#2^2}}

16: \def\dep#1#2{\frac {\partial#1}{\partial#2}}

17:

18: \def\e#1{{\rm e}^{#1}}

19:

20: \def\tr#1{{\rm tr} \: [{#1}]}

21:

22:

23: \def\Media#1{\big < {#1}\big >}

24: \def\media#1{< {#1}>}

25:

26: \def\mezzo{\frac{1}{2}}

27:

28: \def\O{\Omega}

29: \def\o{\omega}

30: \def\U{\bigcup}

31: \def\I{\bigcap}

32: \def\eps{\epsilon}

33: \def\s{\sigma}

34: \def\S{\Sigma}

35: \def\a{\alpha}

36: \def\b{\beta}

37: \def\t{\tau}

38: \def\l{\lambda}

39: \def\ub{\frac{1}{\beta}}

40: \def\d{\delta}

41: \def\D{\Delta}

42: \def\g{\gamma}

43: \def\G{\Gamma}

44: \def\L{{\mathcal L}}

45: \def\ene{{\mathcal E}}

46: \def\h{\hbar}

47:

48: \def\bi{\begin{itemize}}

49: \def\ei{\end{itemize}}

50:

51: \def\be{\begin{enumerate}}

52: \def\ee{\end{enumerate}}

53:

54: \def\beq{\begin{equation}}

55: \def\eeq{\end{equation}}

56:

57: \def\bdm{\begin{displaymath}}

58: \def\edm{\end{displaymath}}

59:

60: \def\bsp{\begin{split}}

61: \def\ensp{\end{split}}

62:

63: \def\C{\subset}

64: \def\und{\underline}

65: \def\qua{\quad ; \quad}

66: \def\es{\boxgiallo{Esempio}}

67: \def\nota{\Frecciaverde ~~ NOTA:~~}

68: \def\Frecciagialla { \boxgiallo {\( \Rightarrow \)} ~~}

69: \def\verdegiu{\begin{center} \boxverde{$ \Downarrow $} \end{center}}

70: %%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%

71:

72: \bibliographystyle{apalike}

73:

74:

75: \begin{document}

76:

77: \author{M. Cosentino Lagomarsino}

78: \affiliation{UMR 168 / Institut Curie, 26 rue d'Ulm 75005 Paris, France}

79: \email[ e-mail address: ]{mcl@curie.fr}

80: %

81: \author{P. Jona}

82: \affiliation{Politecnico di Milano, Dip. Fisica, Pza Leonardo Da Vinci

83:   32, 20133 Milano, Italy}

84: \email[ e-mail address: ]{ patrizia.jona@fisi.polimi.it}

85: %

86: \author{B.  Bassetti}

87: \affiliation{Universit\`a degli Studi di Milano, Dip.

88:     Fisica, and I.N.F.N. Via Celoria 16, 20133 Milano, Italy }

89: \email[e-mail address: ]{ bassetti@mi.infn.it }

90: % {+39 - (0)2 - 50317477 ; fax +39 - (0)2 - 50317480}

91:

92:

93: \title{The large-scale logico-chemical structure of a transcriptional

94:   regulation network}

95:

96:

97: \begin{abstract}

98:   Identity, response to external stimuli, and spatial architecture of a living

99:   system are central topics of molecular biology. Presently, they are largely

100:   seen as a result of the interplay between a gene repertoire and the

101:   regulatory machinery of the cell.  At the transcriptional level, the

102:   \emph{cis}-regulatory regions establish sets of interdependencies between

103:   transcription factors and genes, including other transcription factors.

104:   These ``transcription networks'' are too large to be approached globally

105:   with a detailed dynamical model.  In this paper, we describe an approach to

106:   this problem that focuses solely on the \emph{compatibility} between gene

107:   expression patterns and signal integration functions, discussing

108:   calculations carried on the simplest, Boolean, realization of the model, and

109:   a first application to experimental data sets .

110: \end{abstract}

111:

112: \maketitle

113:

114:

115: \section{Introduction}

116:

117:

118:

119: %DA CAMBIARE TUTTA _ MOLTE INFO VANNO DIRETTAMENTE NEL MODELLO

120: % non ha struttura

121:

122: Regulation can be defined as the set of physico-chemical constraints operating

123: within a living cell that modulate the expression of the cell's genes. In the

124: present view of molecular biology, regulatory processes are often used as a

125: primary causal explanation for many phenomena, playing a role in this

126: discipline that is comparable to the role fundamental interactions play in

127: physics.  In fact, it is widely believed that the repertoire of signal

128: responses (and, more in general, of all the information processing and

129: structural tasks) of living systems is encoded in interconnected threads of

130: genes regulating the activity of each other.  These networks of

131: interdependencies are still largely uncharacterized, although they have begun

132: to fall within reach of systematic experimentation in the recent

133: years~\cite{HCP04,NRF04,BLA+04,WA03,LRR+02}.

134: %cita generali:

135:

136: Considering the so-called ``central dogma'' of molecular biology,

137: \begin{displaymath}

138: \textrm{DNA} \stackrel{\textrm{transcription}}{\longrightarrow}

139: \textrm{mRNA} \stackrel{\textrm{traslation}}{\longrightarrow}

140: \textrm{protein} \stackrel{\textrm{folding}}{\longrightarrow}

141: \textrm{function},

142: \end{displaymath}

143: regulation processes can intervene at all the separate steps (and also in

144: different sub-steps).  Regulation exploiting the process of transcription, or

145: transcriptional regulation, constitute to date the best understood among all

146: the possible regulation mechanisms.

147: %cita

148:

149:

150: \begin{figure}[htbp]

151:   \centering

152:   \includegraphics[width=.85\textwidth]{Fig1}

153:   \caption{Schematics of our representation of a signal integration function

154:     at the \emph{cis-}regulatory region of a gene as a constraint on the gene

155:     expression variables. For general variables, the constraint involves

156:     minimization of the free energy of the Shea-Ackers model. In GR1, the

157:     constraint is Boolean.}

158:   \label{fig:Fagraph}

159: \end{figure}

160:

161: Transcriptional regulation networks are defined starting from the basic

162: functional building blocks involved in transcription. These are (i) the

163: promoter region of a gene or operon along the DNA sequence, which contains the

164: \emph{cis} regulatory binding sites for the transcription factors, (ii) the

165: transcription factors, which are proteins that regulate the binding of

166: RNA-polymerase, and (iii) RNA-polymerase, the protein complex that performs

167: transcription of a gene or an operon in mRNA form~\cite{Pta92,ABL+03}.  The

168: amount of mRNA transcribed is related to the expression of a particular gene

169: only if one takes for granted all the other steps that bring to a functional

170: protein. If this (big) leap is accepted, the ``state'' of a cell is identified

171: to the mRNA concentration of its genes. Experimentally, this is particularly

172: sound for prokaryotes and simple unicellular organisms, but often assumed in

173: more complex contexts, for example in DNA microarray experiments.

174: %

175: Under this assumption, the locations and orientations of the binding sites for

176: transcription factors, as well as the affinity of the transcription factors to

177: different binding sites, determine the expression levels of a gene in response

178: to changes in the active transcription factor concentrations inside the cell.

179: In turn, the concentration of active transcription factors (the ones that can

180: actually bind) encodes the configuration of the environment, for example

181: through degradation or activation by internal and external signaling

182: molecules.

183: %

184: A \emph{cis-}regulatory region can contain many binding sites for many

185: transcription factors which act in cooperation (or competition) on the

186: promoter region, to control in a combinatorial way the binding of RNA

187: polymerase.  This process, referred to as signal integration, is the

188: logic heart of the network.

189: %da qui in poi quasi tutto in modello

190: %

191: %FIGURA1

192: A transcriptional regulation network can be represented as a hypergraph

193: containing both gene expression (``variable'') nodes and signal integration

194: (``function'') nodes. The connectivity is the source of the network complexity

195: (Fig.~\ref{fig:grafone}).

196: %

197: \begin{figure}[htbp]

198:   \centering

199:   \includegraphics[width=.8\textwidth]{Grafone}

200:   \caption{Graph representation of a transcription network.  Each diamond node

201:     represents a signal integration function, while each black circle is a

202:     variable. The directionality of the constraint is represented graphically

203:     by labeling the diamonds with IN and OUT on two different sides.}

204:   \label{fig:grafone}

205: \end{figure}

206:

207:

208:

209: %piu' generale che transcription??? PER ORA NO

210: Thus, a transcriptional regulation network can work independently as a

211: computational unit in a living cell, being able to make decisions on which

212: genes will be switched on at different times. Studies focusing simply on the

213: \emph{structure} of the underlying graph have lead to interesting

214: results~\cite{BLA+04,SMM+02,WtW04}.  However, characterizing and predicting

215: gene expression patterns given a network structure remains an enormous

216: challenge. Two main problems exist.  Firstly, the networks are only partially

217: characterized experimentally~\cite{HCP04,NRF04}.  In some

218: instances~\cite{BLA+04,DRO+02}, the wiring diagram is well known, but in

219: general the functions are described only qualitatively, typically with

220: annotations such as activation, repression or dual effects, and little is

221: known about their actual structure.

222: %

223: Secondly, transcription networks are fairly large. While detailed models or

224: simulations work well on small (sub-)systems~\cite{MA97,ARM98}, typically a

225: coarse grained approach is needed.

226: %

227: %This problem has also an influence on the choice of the dynamics.

228: Microscopically, it is well accepted that the Gillespie algorithm~\cite{GD77},

229: %CITA

230: while disregarding spatial correlations, correctly describes the stochastic

231: asynchronous events of chemical kinetics involved. On the other hand, with a

232: mesoscopic average in time, it is still unclear what the emergent time scales

233: might be.  In particular, the pioneering approach of

234: Kauffman~\cite{Kau69,Kau69b,Kau93,Kau04}, suggesting a synchronous

235: deterministic dynamics for a Boolean (i.e.  ON/OFF) representation of the

236: network is still being debated, both in its assumptions and in its

237: results~\cite{Ger04,KD05,SaK03,BP98}.

238:

239: % CITA CONDMAT CICLI!!!

240:

241:

242: % Costruiamo un modello di equilbrio che tiene conto, oltre che della

243: % struttura , dell'espressione dei geni

244: We consider the second problem, and develop a model (called GR, from Gene

245: Regulation), that focuses, rather than on pure dynamics, on the compatibility

246: between gene expression patterns and signal integration functions. The

247: compatibility constraints are generated by the clauses encoded by signal

248: integration functions at \emph{cis}-regulatory regions.  Our framework

249: describes the system as a combinatorial optimization problem where $N$

250: variables, the gene expression levels, are subject to $M$ constraints,

251: representing the signal integration nodes.  Simply put, a cell with $N$ genes

252: can express them in exponentially many ways, $2^N$ in the Boolean ON/OFF

253: representation. However, the cell never explores all the possible patterns of

254: expression. It generates only clusters of correlated configurations.  To fix

255: the ideas, one can think to the very elementary example of the cI-cro switch

256: of $\lambda$-phage~\cite{Pta92,Tho73}. In this case one could observe the

257: states 10 (where cI is ON, and cro is OFF), 01, or perhaps 00, but never 11,

258: because this state is ruled out by the signal integration function.

259: % Bruno - ma a me sembra un po' ridondante

260: In a cell, with the added complexity of the regulation network, we can think

261: that many of the states are not observable for the same compatibility reasons.

262: The approach is easily connected to detailed thermodynamic treatment of

263: transcription from a

264: %nb da equilibrium diventa local equilibrium

265: signal integration node on one side~\cite{BGH03}, and to the statistical

266: mechanics of spin glasses and combinatorial optimization problems on the

267: other~\cite{MPZ02}.

268:

269:

270: %% Despite of the wealth of knowledge on network structure and expression

271: %% profiles in small organisms such as E.~coli and budding yeast, there

272: %% is to date no transcription network for which all the function nodes

273: %% are completely characterized.

274: %

275: Rather than the detailed quantitative prediction of mRNA expression states,

276: the current challenge is to set a conceptual framework which can help to

277: interpret the observations in concrete examples, integrating as much as

278: possible with known data.

279: %

280: As a the simplest example of this, we study the behavior of the Boolean

281: version of our model on the network structure of E.~coli.

282: %% Analysis of the structure of the E.~coli and yeast transcription

283: %% networks shows a modular structure. The modules that form them, called

284: %% \emph{network motifs}, carry elementary identifiable functions in the

285: %% life-cycle of these organisms.  However, while some of these

286: %% motifs are simple, and can be interpreted easily as ``circuitry''

287: %% elements such as filters or amplifiers, others, sometimes referred to

288: %% as ``dense overlapping regulons'', are intrinsically combinatorial,

289: %% and ``wire'' many genes with many others.

290: %non si sa se

291: % conditions.

292: %cita ptashne

293: % pi� se mi gaso dico anche che � un concetto ristretto e unificato

294: % della nozione di complessit�

295: Building up from this simplest case, the aim is to analyze increasingly

296: realistic network structures, in order to generate a theory that, while being

297: consistent with the generic qualitative features of regulation networks, is

298: useful to analyze single instances and realizations. This is maximally

299: important as biological knowledge is constructed on specificities, and not on

300: typical case behavior.

301:

302:

303: %Structure of the paper!

304:

305: This paper is structured as follows. Sec.~\ref{sec:model} introduces the model

306: abstractly, as an optimization problem, which, in sec.~\ref{sec:shea} is

307: connected to the more concrete thermodynamic Shea-Ackers model of

308: transcriptional regulation. Sec.~\ref{sec:satmap} abandons this general

309: setting, and takes on the simplest possible formulation of the model, GR1,

310: which has Boolean functions and variables, showing that this case maps

311: directly to a so-called satisfiability problem (Sat). The scope of

312: sec.~\ref{sec:leaf}, is to analyze the typical number $\mathcal{N}$ of gene

313: patterns of a random instances of GR1, starting from the case of fixed

314: connectivity. The ``leaf removal'' algorithm allows to carry this analysis in

315: the annealed approximation.  An important premise is the fact, which is

316: evident looking at the data~\cite{SMM+02}, that some genes are essentially

317: ``free'' from the point of view of transcription. These are mainly controllers

318: and are connected to external stimuli. The expression of the rest is

319: conditioned to the state of other genes.  The algorithm allows to define the

320: ``complex combinatorial core'' (CCC) of the network, as the set of genes able

321: to control its global state.  The number of non-controlled, or ``free''

322: variables in the core determines the complexity of the system. The phase

323: diagram shows three distinct regimes of gene control.  In the first (UNSAT),

324: there are no free genes in the core, and the system cannot control the

325: simultaneous expression of all its genes.  In the second regime (``complex

326: control'' or HARD-SAT), the core contains free genes that control, both

327: directly and indirectly, many others.  The general dynamics is residual (many

328: variables are fixed, the others can change).  In the third regime, the core is

329: empty.  Each free gene (which is external to the core) controls the state of a

330: small number of genes (``simple control'', or SAT phase).

331: Sec.~\ref{sec:selfav} concerns itself with the \emph{width} of the

332: distribution of $\mathcal{N}$, which has both a technical significance as a

333: validity test of the annealed approximation, and a biological one, as the

334: variability in the number of gene patterns at fixed gene number.

335: Sec.~\ref{sec:multipoiss} discusses generalizations of these results to

336: non-fixed connectivities.  Finally, sec.~\ref{sec:concr} describes one first

337: attempt to put this findings to work on an experimental data set.

338:

339: \section{Model}

340: \label{sec:model}

341:

342: Our aim is to describe in a minimal way gene expression in a transcription

343: network, separating the issues related to the dynamics from those related to

344: its logical and computational structure. In order to do this, we will

345: formulate a model that sees the system as an optimization problem, where a set

346: of variables, the genes, is subject to a set of constraints, the signal

347: integration nodes. Upon this logic backbone, many a dynamics can be

348: superimposed, including in the most general case the kinetic Montecarlo scheme

349: commonly used to model genetic networks.

350: %dire meglio!!!!!!!!

351: Rather than going towards the direction of highest detail, we will choose to

352: simplify the model as much as possible, reducing the number of details to the

353: minimum, and studying the general qualitative features of the system.

354:

355: The model is specified by

356: \begin{itemize}

357: \item[1)] A set of $N$ discrete variables $\{ x_i \}_{i=1..N}$ associated to

358:   genes or operons, which in the simplest picture are identified with their

359:   transcripts and protein products. These variables represent the expression

360:   levels and in general take discrete values in $\{0,..,q\}$. In particular

361:   situations, they are well-approximated by continuous variables.

362: \item[2)] A set of $M$ interactions, or constraints $\{ I_b(x_{i_{0}},

363:   x_{i_1}, ..., x_{i_{k_b}}) \}_{b=1..M}$ between the genes, representing the

364:   signal integration from transcription nodes.

365: \end{itemize}

366: This formulates an optimization problem, which we call GR, from Gene

367: Regulation. GR asks to find the states compatible with the constraints.

368:

369: The model can be easily generalized to include other relevant degrees of

370: freedom, such as translation, protein modification and protein-protein

371: interactions. However, each addition adds complexity and parameters.

372: Therefore we start with the minimal possible description.

373: % e' una cagata

374: Admittedly, neglecting non-transcriptional regulation is a drastic

375: simplification of the system. A complete genetic network should in principle

376: include all forms. On the other hand, the justification for considering

377: transcription alone is that it is the first step in the chain of regulation

378: events and it is experimentally well characterized.  From the physics point of

379: view, this model can be seen as a ``spin glass'', a system where some

380: variables, our gene expression levels, interact through some coupling

381: constants, specified by the constraints~\cite{M02}. This approach to

382: optimization problems of computer science has proved to be very useful in the

383: recent years~\cite{MPZ02}.

384:

385: % factor graph

386: The network structure is naturally represented on a ``factor graph'', where

387: two kinds of nodes are present, N ``variable nodes'' and M ``function nodes''

388: respectively (Fig.~\ref{fig:grafone}). Here $k_{b} = 1+ K_{b}$ can be seen as

389: the local connectivity of a function node. Note that the factor graph is also

390: defined by a variable connectivity $c_{i}$, the number of functions connected

391: to $x_{i}$.

392: %figura

393: In fact, this is the typical structure of a constraint satisfaction problem of

394: theoretical computer science, such as q-coloring or satisfiability~\cite{M02}.

395:

396: %figure 1, c'e' gia da intro?

397:

398: %description of individual node is shea ackers

399:

400:

401: \section{The constraints and the Shea-Ackers model}

402: \label{sec:shea}

403:

404: To specify the model one has to give a structure for the function nodes, i.e.

405: the constraints. This requires a physical model for signal integration. In

406: order to take this step, in this section we start from the well-known and

407: widely accepted thermodynamic model of Shea and Ackers of gene activation by

408: recruitment, to show that it is the natural setting to express

409: our constraints.

410: %Later on we will proceed with the analysis of a simpler, if

411: %not the simplest, example.

412:

413: Let us consider a function node with $K$ regulators and one output variable.

414: This is modeled, in the version presented by Buchler and

415: collaborators~\cite{BGH03}, as a neural network (a ``Boltzmann machine'') with

416: Hamiltonian

417: \begin{displaymath}

418:   H = \sum_{\stackrel{i,j = 0..L}{ i\ne j} } J_{ij} s_i s_j

419:   + \sum_{j = 0..L} h_j s_j \ \ ,

420: \end{displaymath}

421: where $s_1, .., s_L$ are the occupation variables of the \emph{cis-} binding

422: sites, $h_{j}$ are external fields, functions of $x_1, ..., x_K$ representing

423: the concentrations of ``input'' transcription factors, and $J_{ij}$ are

424: interaction constants associated to competitive versus cooperative binding.

425: More precisely, $h_{j} = - \beta^{-1} \log(Q_{j})$, where $Q_{j} =

426: \frac{[TF_{i}]}{\kappa_{i}} \sim \frac{x_{i}}{\kappa_{i}}$ is the binding

427: affinity of a site $i$, and $\kappa_{i} $ a dissociation constant. Normally,

428: the concentrations are approximated with continuous variables.

429: %

430: In general, $L \ge K$, because multiple binding sites are present.  Finally,

431: $s_0, h_0$ are the occupation variable and the external field (a fixed

432: parameter corresponding to the polymerase binding affinity) associated to the

433: output node of the function (see Fig~\ref{fig:Sheanode}).

434: %

435: \begin{figure}[htbp]

436:   \centering

437:     \includegraphics[width=.7\textwidth]{SAnodeSs}

438:   \caption{Exemplification of a Shea-Ackers node. $s_i$ are occupation

439:     variables for the transcription factors binding sites, while the coupling

440:     constants $J_{ij}$ encode cooperative or competitive binding. The external

441:     fields $h_i$ are the phase space variables of GR.}

442:   \label{fig:Sheanode}

443: \end{figure}

444:

445:

446: Given all the binding constants and the interaction parameters, the

447: -intrinsically probabilistic- output of the gate is computed as a function of

448: the input fields, simply as the probability that $s_0 = 1$.

449: %

450: This expectation value can be obtained through the partition function

451: \begin{displaymath}

452:   Z[h_1,...,h_L] = \sum_{\{s\}} e^{- \beta H}

453: \end{displaymath}

454: as

455: \begin{displaymath}

456:   P(\sigma_0 = 1) = \frac{1}{Z}  \sum_{\{s\}}  e^{- \beta

457:   H(1,s_1,...,s_L)}

458: \end{displaymath}

459: where $\beta = 1/kT$.

460: %NB guarda che beta e' fittizio x' h,J~1/beta

461:

462: Equivalently, one can compute the local free energy

463: \begin{equation}

464:   - \frac{1}{\beta} \log Z = F[h_0,...,h_L]  = F[x_0, ..., x_K]

465:   \label{eq:fensa}

466: \end{equation}

467: and find the average output through minimization with respect

468: to the $x_0$ coordinate.

469:

470: In other words, the function nodes are local equilibrium conditions for the

471: variable nodes, specified by the Shea-Ackers model of the

472: \emph{cis-}regulatory region of each variable node. The expression variables

473: $\{ x_i \}_{i=1..N}$ need to satisfy the constraints specified by the local

474: minimizations of the free energies $\{ F_b(x_{i_{0}}, x_{i_1}, ...,

475: x_{i_{k_b}}) \}_{b=1..M}$. Since there is a clear input-output logic encoded by

476: the chemical equilibrium of each signal integration nodes, one could refer to

477: this backbone static structure the ``logico-chemical'' structure of the

478: network, and separate it from its ``dynamic'' structure.  The logic it encodes

479: is of course not Boolean. In fact, it is intrinsically non-Boolean even with

480: Boolean variables, as the outputs are probabilistic functions of the inputs.

481:

482:

483: % analogy with SG

484: From the point of view of statistical mechanics, this is a Potts spin system

485: with diluted interactions described by the local free energies $F_b$ (which in

486: this context should be interpreted as effective Hamiltonians).

487: % cavity analogy boh taglia

488: Interestingly for the analogy with spin glasses, the model for the gate can be

489: seen as a message-passing procedure analogous to that exploited by the cavity

490: method~\cite{MPV87,MP03,MZ02}, where, in the approximation of factorized

491: probability distribution of the variables, one evaluates the local fields

492: $h_{i \rightarrow b}$, describing the local influence of the couplings on

493: variable $i$ in absence of interaction $b$, and $u_{b \rightarrow i}$, the

494: contribution of interaction $b$ on the local magnetic field on spin $i$,

495: together with their histograms in the presence of many states.

496: %

497: In our case, the ``messages'' described above can only travel in the

498: input-to-output direction.  We are currently investigating whether this

499: analogy can be exploited for further calculations, and we are aware of work in

500: this direction, in a simpler setting, by another group~\cite{CLP+}.

501: %boh

502:

503: %coarse grain

504: In principle, the variables $x_i$ directly stand for the number expressed of

505: molecules in a cell. Provided the set of all binding constants and

506: interactions is known, all the function nodes can be computed and the model is

507: complete.  It could be solved, for example by numerical simulations, once a

508: dynamics is specified.

509:

510: On the other hand, with a few exceptions of small systems, these (many)

511: parameters are in general not known. For this reason, rather than

512: aiming for the highest level of detail, we choose to simplify as much

513: as possible, while trying to keep the most relevant features.

514: % (furthermore, a high level of detail would be useless without

515: % incorporating the rest of the genetic network in teh description with

516: % the same level of detail).

517: For practical purposes, in order to be able to advance further analytically

518: and numerically in the understanding of the model, it is convenient to

519: introduce coarse grained expression levels, thereby effectively reducing $q$.

520: The resulting model, GRq, is identical, mind the fact that variables and

521: constraints are now subject to implicit averaging. One advantage of this

522: approach is that local free energies become easier and easier to specify, and

523: it is possible to study, as is commonly done for spin glasses, the typical

524: behavior of the system as a function of the parameters.

525:

526:

527: In the simplest possible scenario $q=1$, and the expression levels are Boolean

528: variables. The assumption behind this is that what matters is only if the

529: level of expression is high or low~\cite{Kau93,Kau69b}.

530: %buchler hwa giustificaz

531: %l'ideale � x_i = 1..q . Si pu� fare.

532: % Nota questo non e' kauffmann, vedi lettera.

533: % dire GR1 GRq

534: The simplest case (which we will still call GR1), assumes also Boolean

535: functions.  In the following section, we will show how GR1 maps to a

536: Satisfiability problem (Sat), an optimization problem where $N$

537: Boolean variables are constrained by $M$ conjunctive normal form (CNF)

538: constraints (i.e. by a Boolean polynomial constructed as a product ($\wedge$)

539: of $M$ disjunctive monomials ($\vee$)).

540:

541:

542: \section{GR1, mapping on a Satisfiability Problem}

543: \label{sec:satmap}

544:

545: As we have shown in the above section, for general variables $x_i$ that

546: represent real expression levels, the constraints can be derived directly from

547: the model of Shea and Ackers of gene activation by

548: recruitment~\cite{SA85,BGH03}. If the $x_i$ represent coarse-grained

549: expression levels, the same model can be used to construct the local free

550: energy in Eq.~\eqref{eq:fensa}, associated to each signal integration node,

551: that generates the constraints through minimization.

552: %

553: Here we consider the simplest possible scenario, treating the

554: expression levels as Boolean variables, setting $q=1$, and the signal

555: integration functions as boolean functions $\{ f_b(x_{b_1}, ..,

556: x_{b_{k_b}}) \}_{b=1..M}$.

557: %no fallo dopo!!!

558: We also restrict to the case of fixed in-ward connectivity $k_b = K, \ \

559: \forall b$.  These conditions, defining $K$-GR1, are also found in Kauffman

560: networks~\cite{Kau93,Ger04}. We will relax the hypothesis of fixed $K$ to

561: explore networks with fluctuating connectivity in section

562: \ref{sec:multipoiss}.

563:

564:

565: \begin{figure}[htbp]

566:   \includegraphics[width=.7\textwidth]{Nodi}

567:   \caption{Translation of a 2-GR1 node in 3-Sat Nodes. The

568:     input-output direction is from top to bottom. Sat constraints are

569:     represented as squares, where the black and white vertices indicate that

570:     the corresponding variable enters negated or affirmed respectively in the

571:     3-Sat constraint.}

572:   \label{fig:schemino}

573: \end{figure}

574:

575: The expression

576: \begin{math}

577:    x_{b_0} = f_b (x_{i_1}, .. ,x_{i_{K}}),

578: \end{math}

579: which imposes that the variable $x_{b_0}$ is the output of the

580: function $f_b$, translates into the Boolean constraint

581: \begin{equation}

582:   \neg(x_{b_0} \dot{\vee} f_b).

583:   \label{eq:constr}

584: \end{equation}

585: In a Kauffman network, this expression is equivalent to the fixed

586: point condition, and there is one such constraint for every variable

587: $x_i$. The formula

588: \begin{displaymath}

589:   I = \bigwedge_{b=1..M} I_b = \bigwedge_{b=1..M} \neg(x_{b_0} \dot{\vee} f_b)

590: \end{displaymath}

591: defines a Satisfiability problem (Sat) on the variables $x_1,.., x_N$.

592: From the biological viewpoint, this is a logic representation of the

593: computational tasks encoded in the transcription network by evolution,

594: i.e. which sets of genes have to be switched on at any given

595: condition.  More abstractly, having in mind Kauffman networks, each

596: satisfying solution of this problem corresponds to a fixed point in

597: the Kauffman dynamics (independently from the update scheme).  The

598: number of variables involved in one constraint is always exactly

599: $k=K+1$, therefore this mapping associates a network with fixed

600: connectivity $K$, to a $k$-Satisfiability problem whose connectivity is

601: increased by one unit.  For example, a $K=2$ Kauffman network

602: corresponds to a 3-Sat problem, and so on.

603: %

604: The suitable order parameter for such a system is $\gamma = M/N$. GR1 assumes

605: that each gene expression variable is regulated at most by one signal

606: integration function, so that $\gamma \le 1$.

607:

608:

609: To further understand the logic structure if GR1, we can write the CNF

610: constraints on each variable $x_n$. This allows to make a connection to the

611: order parameters used in $k$-Sat, i.e. the local constraint $\alpha$, defined

612: as the number of conjunctive-normal-form (CNF) clauses per Boolean variable.

613: In order to do this, we recast the Boolean formulas $I_n$ into CNF.

614: Reshuffling the truth table of $f_n$ in a way that the first $z$ terms

615: ($1,..,z$) give zero as an output, a simple procedure shows that

616: \begin{equation}

617: \begin{array}{ccc}

618:   I_b &=&

619:   \Big( \bigwedge_{\alpha=1}^{z}

620:     (\neg x_{b_0} \vee \xi_{\alpha_1} \vee ..\vee \xi_{\alpha_K})

621:   \Big) \\ & & \\

622:  &&  \bigwedge

623:   \Big(

624:     \bigwedge_{\alpha=z+1}^{2^K} ( x_{b_0} \vee \xi_{\alpha_1} \vee ..\vee

625:     \xi_{\alpha_K})

626:   \Big),

627: \end{array}

628:    \label{eq:satform}

629: \end{equation}

630: with

631: \begin{displaymath}

632: \xi_{\alpha_j} =

633: \left\{

634:   \begin{array}{cc}

635:     x_j & \textrm{if element } \  \alpha,j \ \textrm{of truth table} = 0 \\

636:     \neg x_j & \textrm{if element } \ \alpha,j \ \textrm{of truth table} = 1

637:   \end{array}

638: \right. ,

639: \end{displaymath}

640: having exactly $2^K$ clauses of $K+1$ elements.

641: % MA prima: ogni GATE ha 4 nodi !

642: % questi nodi non sono tot random, ma correlati negli input

643: Thus, a network with connectivity $K$ maps into a $(K+1)$-Sat having always

644: $\alpha = 2^K \gamma$.

645: %

646: %

647: However, we cannot imply directly that the typical behavior, and therefore the

648: phase diagram of the system will be the same as that of the corresponding

649: random $k$-Sat. Considering random realizations of the constraints, these are

650: \emph{a priori} only a subset of all the possible realizations of a $k$-Sat

651: constraint. In fact, it is immediate to realize that the $2^K$ CNF clauses

652: written in Eq.~\eqref{eq:satform} contain all the possible (fixed)

653: combinations for the inputs variables and the corresponding (random) $2^K$

654: outputs for the output variable. This is best exemplified on the factor graph

655: (Fig.~\ref{fig:schemino}).

656:

657: Having established a link between GR1 and a particular optimization problem we

658: set out to look at random instances for the signal integration functions. For

659: fixed connectivity one can expect a similar behavior for $K$-GR1 as random

660: $k$-Sat, or $k$-XORSAT (a Sat problem with clauses containing only XOR

661: clauses), with the presence of a phase transition in the number of satisfying

662: states.  The suitable order parameter for such a transition is $\gamma = M/N$.

663: Notably, the space of functions of the three models have different dimensions.

664: Furthermore, differently from Sat or XORSAT, in GR1 each gene expression

665: variable is regulated at most by one signal integration function, so that

666: $\gamma \le 1$. In practice, both for ease of interpretation and for

667: simplicity of the analytical formulation, from now on we will abandon the

668: formulation of GR1 in terms of CNF constraints, and work with the input-output

669: functions $f_{n}$.

670:

671: \section{Leaf Removal and the Computational Core}

672: \label{sec:leaf}

673:

674: The aim of this section is to compute the number of satisfying solutions

675: $\mathcal{N}$, for random instances of the constraints. For accessory reasons,

676: we will map GR1 to a spin system.  Quite simply, each Boolean variable $x_i

677: \in \{0,1 \}$ is transformed into a spin $\sigma_i \in \{-1,1\}$.

678: %

679: Each constraint, or diamond function node generates the interaction

680: Hamiltonian $ H_{\diamond,b}$

681: \begin{equation}

682:   2^k H_{\diamond,b} = \sum_{J_{b_1},..,J_{b_K}}

683:                        \prod_{l=1..K} (1+ J_{b_l} \sigma_{b_l})

684:                        (1+ J_{b_0}^{\{ J_{b_1},..,J_{b_K}\}} \sigma_{b_0})\ ,

685:    \label{eq:ham_diam}

686: \end{equation}

687: %

688: Under the restrictions defining GR1, the total energy of the system is simply

689: the number of violated constraints.  A zero-energy configuration satisfies all

690: the constraints, and is therefore able to comply to all the logic functions

691: encoded by the network.  The $2^K$ coupling constants $J_{b_0}^{\{

692:   J_{b_1},..,J_{b_K}\}} = \pm 1$ are a representation of the truth table of

693: the function $f_b$.  With the correspondence $\{0,1\} \leftrightarrow

694: \{-1,1\}$ , $J_{b_1},..,J_{b_K}$ stand for the possible values of the input

695: variables $ x_{b_1},..,x_{b_K}$, while $J_{b_0}^{\{J_{b_1},..,J_{b_K}\}}$ is

696: the associated output value.  The Hamiltonian (\ref{eq:ham_diam}) is the cost

697: function of the corresponding optimization problem K-GR1. It encodes the logic

698: constraint of Eq.~(\ref{eq:constr}), in the sense that it is one whenever the

699: constraint is violated, and zero otherwise.  From the point of view of

700: transcription networks, $H_{\diamond,b}$ is the coarse grained local free

701: energy of the Shea-Ackers model. Note that, even in the case of Boolean

702: variables, the coupling constants represent binding affinities and

703: interactions between transcription factors. In general, they need not be plus

704: or minus one. Here we took this further assumption.

705:

706: The conventional average of $\mathcal{N}$ on the realizations might be biased

707: by the weight of exceptions~\cite{MPV87}. The correct quantity to compute is

708: the ``quenched average'' of the system's free energy,

709: $\overline{\log{\mathcal{N}}}$, which is usually accessed with the replica, or

710: similar methods~\cite{MPZ02}, passing from Hamiltonians like $H_{\diamond,b}$.

711: %salta forse

712: %Having in mind the Shea-Ackers picture, biochemical noise may be

713: %included to a certain extent in our framework by specifying a

714: %probability distribution for the coupling constants.

715: %

716: %

717: However, in the case under examination we will use a simpler method, based on

718: the ``leaf removal''~\cite{MRZ03} algorithm, which allows to compute only the

719: \emph{annealed} average $\log{\overline{\mathcal{N}}}$. As we will discuss,

720: this method has the advantage of an immediate biological interpretation in

721: terms of the roles played by genes in the network.

722: %

723: For the case of random XORSAT, Mezard and collaborators~\cite{MRZ03} have

724: shown that the annealed average on the core variables coincides with the

725: quenched one. In general $\overline{\log{\mathcal{N}}} \leq

726: \log{\overline{\mathcal{N}}}$.

727: %

728: For GR1, we performed estimates that indicate the presence of the same

729: self-averaging property in a well-defined region of parameter-space (see

730: sec.~\ref{sec:selfav}). Within this region, our annealed calculation is exact.

731:

732: For a given realization of the constraints $ \{ \vec{I}, \vec{f} \} $, the

733: number of satisfying states $\mathcal{N}$ can be written as

734: \begin{displaymath}

735:   \mathcal{N}(\vec{I}, \vec{f}) = \sum_{\vec{\sigma}} \prod_{b=1}^M \delta(1;

736:   f_{b}(\sigma_{i(b,1)}, .., \sigma_{i(b,K_b)})\sigma_{i(b,0)}).

737: \end{displaymath}

738: Here, the randomness is contained: (i) in the specification of the network

739: structure, $\vec{I} = (I_1,...,I_M)$, i.e. in the coordinates $i(b,l)$, which

740: point at the variable occupying place $l$ in the $b$th constraint; (ii) in the

741: specification of the functions $\vec{f}= (f_1,...,f_M) $ with a certain

742: probability distribution in the class $\mathcal{F}$. An overbar ($\bar{\ ~}$)

743: indicates an average on both distributions, $p(\vec{I})$ and $p(\vec{f})$.  We

744: will first concentrate on the case with fixed in-ward connectivity $K$.

745:

746:

747: In carrying this average, there are three relevant preambles. The first is

748: that not all the $M$ equations and $N$ variables are meaningful to calculate

749: $\mathcal{N}$.  Indeed, every output variable that appears in only one

750: constraint can be trivially fixed according to its function. Thus, both the

751: constraint and the variable can be eliminated without affecting the number of

752: solutions.  This procedure is called ``leaf removal''~\cite{MRZ03}.  It is a

753: nonlinear procedure, as more variables can disappear together with a single

754: constraint, because input free genes that regulate a leaf remain as isolated

755: points.  The iteration of this mechanism leads to the definition of a

756: ``core'', the CCC, of significant variables and constraints, in numbers of

757: $N_C$ and $M_C$ respectively.  In the CCC, $M_{C}$ genes are controlled, and

758: $\Delta_{C} = N_{C} - M_{C}$ are the ``free'' genes with an essential role in

759: controlling the expression states, as a function of an input signal.

760: % FIRST: FIXED p = k

761: The second relevant fact is the hypothesis that the functions are independent

762: and identically distributed random variables.  Thirdly, we consider a set of

763: functions in a family $\mathcal{F}_K$ which satisfy the condition

764: \begin{math}

765:   \frac{1}{2^{2^K}} \sum_{f \in \mathcal{F}_K } p(\vec{f}) f(\vec{x}) = \rho

766:   \, ,

767: \end{math}

768: as will always be the case if the outputs of the functions are

769: uncorrelated, even in the presence of bias.

770:

771: It is then easy to verify that

772: \begin{displaymath}

773:   \overline{ \mathcal{N}} = \sum_{\vec{\sigma}, \vec{I}} p(\vec{I})

774:   \prod_{b=1}^M \left(

775:     \rho \delta(1;\sigma_{i(b,0)}) + (1- \rho) \delta(-1;\sigma_{i(b,0)}

776:     ) \right) \ ;

777: \end{displaymath}

778: so that, as for the XORSAT model,

779: \begin{equation}

780:   \overline{\mathcal{N}} = 2^{N_{C}-M_{C}} \ .

781:   \label{eq:nstati}

782: \end{equation}

783: Incidentally, we note that the same procedure is valid for GRq, where

784: $\overline{\mathcal{N}} = q^{N_{C}-M_{C}}$.

785:

786:

787: \begin{figure}[htb]

788:   \includegraphics[width=.75\textwidth]{Fig2}

789:   \caption{Phase diagram of $4$-GR1. For $\gamma >

790:     \gamma_c$ no solutions exist in the typical realization (UNSAT phase). For

791:     $\gamma < \gamma_d$ the system is paramagnetic typically a satisfying

792:     state exists (SAT phase, or simple gene control). For $\gamma_d <\gamma<

793:     \gamma_c$ there is complex gene control. }

794:   \label{fig:pd}

795: \end{figure}

796:

797:

798:

799: %These genes can react to an external input and determine a state.

800: We can use $\frac{\Delta_C}{N_C}$ as an order parameter in the thermodynamic

801: limit $N \to \infty, \, \gamma$ const., to distinguish three types of

802: phenomenology, or three ``phases''. (i) For $\frac{\Delta_C}{N_C} \leq 0$,

803: there are no free genes in the core, and the system cannot comply to all the

804: expression programs encoded into its DNA. This is a UNSAT phase from the

805: computational point of view. (ii) For $0 \leq \frac{\Delta_C}{N_C} < 1$, the

806: $\Delta_{C}$ genes of the CCC control $O(N)$ genes each, and therefore are

807: able to determine an expression state. This is a ``complex gene control''

808: phase, or HARD-SAT (iii) For $ \frac{\Delta_C}{N_C} = 1$ the number of

809: controlled variables is a vanishing fraction of the total number of genes.  In

810: other words, $ M_C = 0$ in the thermodynamic limit, the free genes control at

811: most $O(1)$ genes to generate a satisfying state (``simple control'', or SAT

812: phase).  In the simple control phase, the system is underconstrained, which

813: means that the logic conditions imposed by the signal integration functions

814: are generally insufficient for a strict determination of the expression

815: patterns.

816:

817: The three phases described above depend both on the value of $\gamma$ and on

818: the class of random functions considered. In general, if all the possible

819: functions are taken into account, the phase diagram can be explored studying

820: the rank and the kernel of the connectivity matrix~\cite{CS02}.

821: %

822: % parametri d'ordine di Bruno e figura che li rappresenta.

823: %

824: % caso generico rango e kernel

825: %

826: %

827: Following the analysis of Mezard and collaborators~\cite{MRZ03}, in the case

828: of Poisson variable connectivity, i.e. with the distribution $\pi(c) =

829: \frac{(k \gamma)^{c}}{c!} e^{-k \gamma} $, the phase diagram as a function of

830: $\gamma$ is identical to the random XORSAT problem.  It is illustrated in

831: Fig.~\ref{fig:pd}. For $\gamma>\gamma_c$ no solutions exist in the typical

832: realization (UNSAT phase).  For $\gamma<\gamma_d$ the system is paramagnetic

833: (SAT phase). For $\gamma_d<\gamma<\gamma_c$ exponentially many satisfying

834: states exist.

835: %% Physically,

836: %% this is a glassy phase, the typical dynamics is slow in finding a satisfying

837: %% solution.

838: Here, the space of solutions breaks down into clusters separated by free

839: energy barriers.  The typical dynamics in a cluster will be residual, in the

840: sense that a block of genes are fixed (on or off) and the rest may move.  The

841: number of clusters is controlled by the (computational) complexity $\Sigma$ of

842: the system.  The number of observed configurations is $ \mathcal{N^{*}} \sim

843: \exp[N \Delta_{C}]$.  Thus, by definition $\Sigma$ is directly related to the

844: order parameter $\Delta_{C}/N_{C}$, i.e. to the partitioning of the core

845: genes.  How the system explores (or not) the clusters depends on details of

846: its dynamics.

847:

848:

849:

850: % SE METTI DISCUTI!!!

851: %%  \begin{figure}[htbp]

852: %%    \includegraphics[width=.4\textwidth]{pd-k-leaf}

853: %%    \caption{Zero-temperature phase diagram of $K$-GR1 computed with the leaf

854: %%      removal algorithm (as described in~\cite{MRZ03}).}

855: %%    \label{fig:pdk}

856: %%  \end{figure}

857:

858:

859: % casi K fisso poisson = hsat (prima devi aver detto che a priori e'

860: % piu' generale!!!!) --> diagr di fase in fs di p=K+1

861:

862: %% For every value of $K$, the critical values for $\gamma$ are of order

863: %% one. Remembering our condition $\gamma \le 1$, it is interesting to

864: %% ask in which phase a hypothetical fixed connectivity CCC would

865: %% typically lie. This is illustrated in Fig.~\ref{fig:pdk}.

866:

867: % non  piu'  vero

868: % The first

869: % striking fact is that, for any $K$, the system never finds itself in the

870: % UNSAT phase.

871: %zzo vuol dire?

872:

873: % da vedere ....

874: %% Secondly, there is a transition (around $K=7$, for $\gamma=1$) between

875: %% the SAT and the HARD-SAT phase.

876: %% %boh

877: %% Remarkably, the transcription network of \emph{escherichia coli} has

878: %% signal integration functions with (exponentially distributed) $K \le

879: %% 6$ (according to the dataset used in~\cite{SMM+02}).

880: %cita DB

881:

882: %TL bias non influisce  da notare

883: % To study the case of biased functions, a new calculation is

884: % necessary to compute the phase diagram. NOT THERE

885:

886:

887: \section{State-Fluctuations and Self-average}

888: \label{sec:selfav}

889:

890: In this section, we discuss a calculation of the \emph{width} of the

891: distribution of the number of compatible states.  In order to do this, we

892: compute the quantity $\overline{[\mathcal{N}]^2}$.  This calculation is

893: relevant both from the technical and from the qualitative point of view. The

894: technical aspect, as already anticipated, deals with the self-averaging

895: property, which holds when the quantity

896: \begin{displaymath}

897:   \frac{\overline{[\mathcal{N}]^{2}} - \left[\overline{\mathcal{N}}\right]^2}

898:   {\left[\overline{\mathcal{N}}\right]^2}

899: \end{displaymath}

900: vanishes in the thermodynamic limit $N \to \infty$ at constant $\gamma$.  When

901: this condition is met, the annealed average computed above coincides with the

902: quenched one, and no extra effort is required. When it is not, the behavior of

903: the system can be qualitatively different from what emerges in the annealed

904: picture. In particular, the typical number of solution is overestimated.  In

905: this case, more complicated formalisms, such as replicas, need to be

906: adopted~\cite{MPV87}.

907:

908:

909: The qualitative aspect involves the possible physical, and biological,

910: interpretation of $\overline{[\mathcal{N}]^2}$. This is an indicator of the

911: width of the probability distribution in the number of compatible states, in

912: presence of random functions and network structure.  Therefore, it can be seen

913: as the freedom the system has of varying the number of states that comply to

914: the signal integration functions by acting on its constraints.  Biologically,

915: having in mind Darwinian evolution one may interpret it as a kind of

916: ``adaptability''. This is done as follows. If $\mathcal{N}$ is interpreted as

917: the number of possible responses of gene patterns to external or internal

918: conditions, it is reasonable to assume that a given system with a fixed number

919: of genes will be fitter by maximizing $\mathcal{N}$. Now, if the distribution

920: is wide, the system can vary greatly $\mathcal{N}$ by acting on the signal

921: integration functions.  If the distribution is peaked, the changes in the

922: functions will be less effective in increasing the number of states.

923:

924: If the self-averaging property holds, all the width, this adaptability, comes

925: only as a finite-size effect. This is not irrelevant, considering that the

926: value of $N$ is in the range $10^{3}-10^{5}$, quite far from the usual

927: Avogadro's number!

928: %

929: On the other hand, if the self-averaging property does not hold, it means that

930: a residual width exists even in the thermodynamic limit. In this case, an

931: evolved system can be ultra-specific, finding the very exceptional situations

932: in which a high number of solutions exists against the typical odds in which

933: the system cannot express compatible gene patterns, the biblical needle in a

934: haystack. Such putative highly specialized organisms would be particularly

935: sensitive to changes in the environment. Given these considerations, it seems

936: that the lack of self-averaging for a model like GR1 would make it slightly

937: less appealing.

938:

939: The final result for GR1 is that $\mathcal{N}$ is indeed self-averaging.  The

940: details of the calculation are reported in Appendix~\ref{sec:self-aver-prop}.

941: It relies on two basic assumptions. The first relates to the choice of a

942: family of random functions such that:

943:

944: (a)

945: \begin{math}

946:   \frac{1}{\Lambda} \sum_{f \in \mathcal{F}} p(\vec{f})

947:   f(\vec{\s}) = 0 \, ,

948: \end{math}

949: as above, and

950:

951: (b)

952: \begin{math}

953:   \frac{1}{\Lambda} \sum_{f \in \mathcal{F} } p(\vec{f})

954:   f(\vec{\s})f(\vec{\t}) = \delta(\vec{\s}; \vec{\t}) \,

955: \end{math}.

956:

957: \noindent

958: Here, $\Lambda$ indicates the size of the family of functions, and we used the

959: obvious notation that makes the functions assume values $\pm 1$ if expressed

960: in terms of spins.

961:

962: The second assumption in the computation can be seen as a mean-field-like

963: hypothesis of independence of spins belonging to different clauses. It can be

964: argued as follows. Differently from the evaluation of

965: $\overline{\mathcal{N}}$, the computation of $\overline{[\mathcal{N}]^2}$

966: depends both on the class of functions and on the underlying network. The

967: essential problem is that while the function nodes are independent random

968: variables, the variable nodes clustered by functions can be repeated. Thus,

969: one has to answer to the question: how many distinct variables $n_{v}$ are

970: connected to $r$ constraints? For small r, we can estimate that $n_{v} \simeq

971: r \cdot k$, while for $ r \sim M, \ n_{v} \simeq \frac{M}{\gamma}$. These

972: extremes set the ``fork'' of values for which our estimate is consistent and

973: robust.

974: %

975: In these conditions, we obtain the scaling as $4^{\Delta_{C}}$, in the

976: relevant regime of the phase diagram $ 1/K < \gamma < \gamma^{*}$, with

977: $\gamma^{*} > \gamma_{c}$. This enforces the self-averaging property.

978: %Dire di piu' / discutere l'energia libera. - ha trans di fase con gamma

979: The same procedure can be carried with the k-Sat model yielding no

980: self-averaging for the number of satisfying states.

981:

982: %il modello da solo le possibilita' - poi c'e' la selezione naturale

983:

984: %More formally, given the set of functions that can be explored and a

985: %probability measure on it,

986: % posso fare prediction con width?

987: % posso misurarla fissato il grafo - per es dell'ecoli?

988:

989: % calcolo

990:

991: %etc la stori di k gamma / e forse dell'energia libera

992: %Discuss $k > \gamma^{-1} $ etc.....

993:

994: \section{Different Connectivities}

995: \label{sec:multipoiss}

996:

997:

998: So far we have discussed the idealized case where the in-ward connectivity,

999: the number of transcription factors controlling one gene, is fixed. In that

1000: case, only the out-ward connectivity, or the number of genes controlled by a

1001: transcription factor, can fluctuate.

1002: %

1003: A biologically more realistic case is when both the inward connectivity $K$

1004: and the outward connectivity $c$ vary along the network, and the decay of the

1005: latter is slower (see Fig.~\ref{fig:colidati}a).

1006:

1007: Considering $p(k|c) = \frac{(k\gamma)^c}{c!} e^{-(k\gamma) }$, the conditioned

1008: probability that a variable is in $c$ clauses of the $k$ kind, we have $\pi(c)

1009: = \sum_k\frac{(k\gamma)^c}{c!} e^{-(k\gamma) }\cdot p(k)$. The leaf removal

1010: algorithm can be applied separately to sets of clauses with a given

1011: connectivity, defining $ N_C \equiv <N_C>_{k}$ and $M_C \equiv <M_C>_{k}$,

1012: where $<X>_{k}= \sum_p p(k) \cdot X(k)$.

1013: %

1014: Choosing $p(k) = Z^{-1}(\nu) e^{-\nu k}$ does not affect the exponential

1015: asymptotic decay of $\pi(c)$ for large $c$.

1016:

1017: \begin{figure}[htbp]

1018:   \centering

1019:   \includegraphics[width=.75\textwidth]{Fig3}

1020:   \caption{$\Delta_C$ as a function of $\gamma$ for different values of $\nu$

1021:     in the multi-Poisson case. The discrete jumps are due to the onset of HARD

1022:     phases for the different values of $k$. $\Delta_C$ can become negative

1023:     many times, giving rise to reentrant UNSAT phases. The figure refers to a

1024:     connectivity distribution with a cutoff at $k = 12$}

1025:   \label{fig:mp-delta}

1026: \end{figure}

1027: %

1028: \begin{figure}[htbp]

1029:   \centering

1030:    \includegraphics[width=.75\textwidth]{Fig4}

1031:    \caption{Phase diagram $\gamma - \nu$ for the multi-Poisson case. The dashed

1032:      line represents the mean value of the numerically evaluated critical

1033:      parameter $\gamma_d (\nu)$ for the SAT-HARD transition of network

1034:      realizations with $N=3 \times 10^4$.}

1035:   \label{fig:mp-pd}

1036: \end{figure}

1037:

1038: To show this, let us construct the graph with the mentioned properties

1039: \begin{enumerate}

1040: \item The probability to have a clause with $k$ elements is $p(k) =

1041:   Z^{-1}(\nu) e^{-\nu k}\quad (k>1) $.

1042: \item $p(k|c) = \frac{(k\gamma)^c}{c!} e^{-(k\gamma) }$ is the conditioned

1043:   probability that a variable is in $c$ clauses of the $k$ kind.

1044: \item $\pi(c) = \sum_k\frac{(k\gamma)^c}{c!} e^{-(k\gamma) }\cdot p(k)$.

1045: \end{enumerate}

1046: and compute $\pi(c)$.

1047: Setting $\xi = \gamma + \nu$,

1048: \begin{displaymath}

1049:   \pi(c) = \frac{\gamma^{c}}{c!} \left( -e^{-\xi} +  \mathcal{Z}[k^{c}]

1050:   \right) \ \ ,

1051: \end{displaymath}

1052: where $\mathcal{Z}[f(c)] = \sum \frac{f(c)}{z^{c}}$ is the Z-transform, and

1053: for us $z = e^{\xi}$. Now, $ \mathcal{Z}[p^{c}] = \textrm{Li}_{-c}(z^{-1})$,

1054: where Li is the polylogarithm, which can be defined for negative integers as

1055: \begin{displaymath}

1056:   \textrm{Li}_{-c}(z) := \frac{1}{(1-z)^{c+1}} \sum \eulang{c}{i} r^{c-i} \ \ ,

1057: \end{displaymath}

1058: %

1059: where  $ \eulang{c}{i} $ are Euler's numbers.

1060: This gives a condensed expression for $\pi(c)$:

1061: \begin{displaymath}

1062:   \pi(c) = \frac{\gamma^{c}}{Z c!} \left\{  \textrm{Li}_{-c}(e^{-\xi}) -

1063:     e^{-\xi} \right\} \ \ .

1064: \end{displaymath}

1065: It can be easily checked that, for large $c$, this function decays

1066: exponentially, after having reached a maximum.

1067:

1068: We can call this case multi-Poisson, as the graph is a superimposition of

1069: graphs that follows a Poisson distribution, each graph having in turn a fixed

1070: clause-connectivity and a Poisson variable-connectivity.

1071: %

1072: % cambiare e espandere un po'

1073: %

1074: The behavior of GR1 on such a topology is   different from the

1075: fixed connectivity case.  The main reason for this is that, while

1076: $\Delta_{C}(\gamma)$ is still locally decreasing, many new discontinuities

1077: emerge, due to the influence of clauses with different connectivities. This

1078: gives rise to two phenomena. Firstly, $\Delta_{C}$ can increase globally with

1079: increasing $\gamma$. Indeed, it does increase after $\gamma_{d}$, to decrease

1080: again before $\gamma_{c}$ (Fig.~\ref{fig:mp-delta}). After the onset of the

1081: complex control phase, the complexity initially increases (step-wise), reaches

1082: a maximum, and then decreases monotonically.  This has an influence on the

1083: number of observed states as a function of $\gamma$.  Secondly, $\Delta_{C}$

1084: can become negative and then jump back to a positive state, creating a

1085: reentrant HARD-SAT phase (Fig.~\ref{fig:mp-pd}).

1086: We are currently studying ways to extend our calculation of the mean number of

1087: compatible states and the width of its distribution on graphs with more

1088: general connectivities.

1089:

1090:

1091:

1092: \section{An Example from an Experimental Setting}

1093: \label{sec:concr}

1094:

1095: The results described so far focused on the typical behavior of GR1 as a

1096: formal model for a genetic network. To resume them, we can predict the

1097: existence of a core of variables, the CCC, which determines the behavior of

1098: the system. The phase diagram of the system contains two regimes of gene

1099: control, simple and complex.  In the complex control phase, the free genes of

1100: the core control $O(N)$ other genes. These phases also depend on connectivity.

1101: On the other hand, a very important question is how to relate them to concrete

1102: systems.  There are many possibilities in this direction that we are currently

1103: exploring. In this section, we will discuss a first attempt. Specifically, we

1104: will make use of the data set for the structure of the E.~coli

1105: %and S.~Cerevisiae

1106: %cita alon

1107: transcription network from the RegulonDB database~\cite{SSG+01}, with the

1108: modifications of~\cite{SMM+02}. The goal is to apply the leaf removal

1109: algorithm using the information contained in the data set.

1110:

1111: % descrivi datasets

1112: The data set consists of an annotated graph, where the signal integration

1113: functions are described as sets of annotated links. The annotations consist in

1114: three modes of activity: activation, repression, and ``dual'' activity

1115: (meaning that the activity depends on the context). The data on the

1116: combinatorial activity of transcription factors are not part of the set. For

1117: this reason, in what follows we will ignore the annotations, concentrating on

1118: the study of random GR1 realizations on the given experimental network

1119: structures.

1120: %

1121: Considering the connectivity matrix $C_{ij}$ defined as

1122: \begin{displaymath}

1123:   C_{ij} = \left\{

1124:   \begin{array}{lll}

1125:     1 ; \ \  \textrm{Gene j regulates gene i} \\

1126:     0 ; \ \  \textrm{Otherwise}

1127:   \end{array}

1128:   \right. \ ,

1129: \end{displaymath}

1130: a ``leaf'' corresponds to a column containing only zeros.  An iteration of the

1131: algorithm removes these columns, together with the corresponding lines. Note

1132: that the leaf removal algorithm is not guaranteed to preserve the network

1133: structure as in the abstract cases discussed above.

1134: % figura?

1135: % and it makes sense because it's unicellular? oppurepunto su # ecoli con N

1136: % geni?

1137: \begin{figure}[htbp]

1138:   \centering

1139:   \subfigure[]{\includegraphics[width=.5\textwidth]{INOUTedges}}

1140:   \subfigure[]{\includegraphics[width=.5\textwidth]{ar}}

1141:   \caption{Data inferred from the E.~coli transcription network. (a) Degree

1142:     distribution. (b) Activity of autoregulators. See also~\cite{MC03}}.

1143:    \label{fig:colidati}

1144: \end{figure}

1145:

1146:

1147: %

1148: During the iterative leaf removal procedure, one is confronted with an

1149: important choice, concerning how to deal with autoregulators. These create a

1150: problem, as, for particular assignments of the functions they create trivial

1151: contradictions.

1152: %

1153: In reality, this self-contradiction is inexistent, as negative

1154: autoregulations are known to play the role of controlling the overexpression

1155: of a particular gene.  A standard, and the simplest, way to avoid the problem

1156: is simply to eliminate the autoregulations, and impose that the diagonal of

1157: $C_{ij}$ is zero.

1158: %

1159: However, this total cancellation is not biologically motivated - as

1160: autoregulations might reflect some global properties of the system, other than

1161: control of overexpression. To clarify, let us consider a gene that regulates

1162: itself and is regulated by some others (``rest''), that is a ``regulated

1163: autoregulator'' (RAR). It is then subject to the constraint $\s_0 = f ( \s_0 |

1164: \textrm{rest} ) = A (\textrm{rest})\s_{0} + B (\textrm{rest})$.  If

1165: $A(\textrm{rest}) = 0$ the gene is regulated simply, for $B\in \big\{-1,1

1166: \big\}$ and the autoregulation is irrelevant. Conversely, when $B

1167: (\textrm{rest}) = 0$, the autoregulation plays a role, but if $A

1168: (\textrm{rest} ) = -1$ the system is UNSAT.

1169:

1170:

1171: To solve this problem, we propose a way to keep the role of autoregulators

1172: into account, while at the same time avoiding the trivial self-contradiction.

1173: In order to do this, we introduce the constraint $A (\textrm{rest} ) = 1$ that

1174: codes for the avoidance of trivial contradictions. With this technique, we aim

1175: to save the autoregulation role, while taking into account the notorious fact

1176: that auto-inhibitions cannot be represented with Boolean variables. We can

1177: call this the ``RAR hypothesis''.  We will see that this hypothesis brings to

1178: a different final result. The same reasoning can be carried for GRq-type

1179: variables.

1180: %

1181: Assuming the RAR hypothesis, the problem becomes a mixed optimization

1182: problem, that includes the usual GR1 constraints, plus a set of Sat-like

1183: constraints that come from the $A_{n} (\textrm{rest} ) = 1$ conditions on the

1184: RARs.

1185:

1186: \subsection{The E.~coli Transcription Network}

1187:

1188: In the E~.coli data set there are 423 genes, and 59 autoregulations. Among

1189: these, 24 are RARs (Fig.~\ref{fig:colidati}). Applying the leaf-removal

1190: algorithm with cancellation of autoregulations leads to an empty core. This

1191: means that the system finds itself in the simple control, SAT phase.  However,

1192: the application of the RAR hypothesis leads to a non-empty core

1193: (Fig.~\ref{fig:colirar0}).

1194: % figura

1195: \begin{figure}[htbp]

1196:   \centering

1197:   \includegraphics[width=.8\textwidth]{leaf0}

1198:   \caption{The CCC of E.~coli with the RAR hypothesis. It contains 22

1199:     variables (of which 14 are free or regulated only by themselves, 4 are

1200:     non-free, and 4 are RARs) and 22 constraints (of which 18 are RAR

1201:     constraints). }

1202:   \label{fig:colirar0}

1203: \end{figure}

1204:

1205:

1206: The genes in the core can be divided in three different classes, free, which

1207: we will denote by $\tau$, non-free ($\sigma$), and RARs, or ($\alpha$). The

1208: core contains a total of 22 variables (of which 14 are free or regulated only

1209: by themselves, 4 are non-free, and 4 are RARs) and 22 constraints (of which 18

1210: are RAR constraints).

1211: %

1212: Biologically speaking, these core gene include some ``global regulators'', or

1213: transcription factors with a high out-ward connectivity~\cite{MC03}, including

1214: (a) the sigma factors rpoS and rpoN, (b) proteins belonging to the family of

1215: the DNA bending global regulator crp (c) himA, or IHF, another DNA bending

1216: factor.

1217: %

1218: More interestingly, also lower connectivity proteins, connected to metabolism

1219: (e.g. respiratory control and iron transport), and to structural tasks (e.g

1220: synthesis of the flagellum) are found in the core.

1221:

1222: The residual optimization problem on the core variables is small and simple

1223: enough to be solved for general functions, as exemplified in

1224: Fig.~\ref{fig:colirar12}. The final solution gives only two states, after

1225: having fixed the free genes.

1226: %

1227: \begin{figure}[htbp]

1228:   \centering

1229:   \subfigure[]{\includegraphics[width=.75\textwidth]{leaf1}}

1230:   \subfigure[]{\includegraphics[width=.75\textwidth]{leaf2}}

1231:   \caption{Solution of the general optimization problem on the core variables

1232:     of E.~coli in the RAR hypothesis. In this procedure, variables are fixed

1233:     with respect to each other according to the constraints that connect them.

1234:     (a) Second step of the computation (b) Last step of the computation.}

1235:   \label{fig:colirar12}

1236: \end{figure}

1237:

1238:

1239: What is the meaning of the CCC in the RAR hypothesis, if any? The answer can

1240: come from two directions: simulations and experiments.  For the numerical

1241: case, one must study how fixing the core variables affects the reach of a

1242: fixed point or a steady state in a Boolean network.  Our simulations on both

1243: asynchronous spin-flip and synchronous update dynamics show that, fixed some

1244: random functions on the whole network, the core free genes control a larger

1245: set of genes than the non-core ones (these results will be published

1246: elsewhere~\cite{CLB04}). This is an indication that the CCC found with the RAR

1247: hypothesis might have some significance.  The same feature can be tested with

1248: microarray expression experiments.

1249:

1250: %e LIEVITO

1251:

1252: %RAR

1253:

1254: %N FINITO ????

1255:

1256: \section{Conclusions}

1257:

1258:

1259: In conclusion, we have presented and discussed a novel conceptual framework

1260: for the equilibrium modeling of large scale transcription networks. In its

1261: most general formulation, our approach is directly connected to the

1262: Shea-Ackers model for the \emph{cis-} regulatory region of a gene, and

1263: consists of a compatibility analysis of the constraints established by the

1264: signal integration functions.

1265: % per questa ragione preferiamo pensare a GR come un modello di trascrizione

1266: % prima di tutto, anche - al livello coarse grained di gr1 si puo'

1267: % generalizzare a reti di regolaz generiche

1268:

1269: The advantage of this approach is that it allows to separate issues related to

1270: the dynamics of the network from the basic logic structure that underlies it.

1271: Obviously, dynamics is a very important factor of a real biochemical network,

1272: possibly the most important.  On the other hand, we feel that the

1273: disentanglement the two aspects might lead to further insight.

1274: %

1275: In the spirit of theoretical computer science, any dynamics superimposed on

1276: GR1 can be seen as an algorithm.  The problem becomes then the following. How

1277: effectively does a given algorithm, modelling chemical kinetics explore

1278: configuration space?  Naturally, this addition may carry intricate issues,

1279: connected to the nonequilibrium nature and the asymmetry of the interactions.

1280: These issues are particularly complex if one wants to add a coarse-graining

1281: of time, as is commonly done in Kauffman networks.

1282: %

1283: In absence of an explicit knowledge of the emergent time scales involved in

1284: the dynamics, we feel ours is an appropriate approach.  Particularly in the

1285: Boolean approximation, GR1, which we treat here.

1286: %

1287:

1288: %tenere?

1289: %A dynamics in the space of constraints, or signal integration

1290: %functions, corresponds to Darwinian evolution of a particular genetic network.

1291: %

1292: %Thus, GR is an ideal substrate to construct evolutionary models, with a

1293: %fitness landscape on the space of constraints. Models of this class can be

1294: %seen as relatives of Kauffman's NK model.

1295:

1296: From a general, speculative standpoint, our model shows that the ``biological

1297: complexity'' is not simply measured by the number of genes. For a

1298: transcription network, a more proper indicator is $\Delta_{C}$. Interestingly,

1299: for GR1 this coincides exactly with what is called the ``computational''

1300: complexity, $\Sigma$.

1301: %

1302: Looking at the phase digram, $\Sigma$ depends on the order parameter

1303: $\gamma$, or - loosely - on the number of transcription factors per gene.  At

1304: fixed number of genes, it is known that this quantity increases in bacteria

1305: that need to react to more environments~\cite{CdL03}.

1306: %

1307: Imagining that prokaryotes, being unicellular, naturally find themselves in a

1308: simple control phase, our phase diagram predicts an intrinsic limit to this

1309: adaptation process, represented by the phase boundary with the HARD-SAT.

1310: %

1311: Considering varying $N$, one may wonder why, in real organisms, a small

1312: $\gamma$ is correlated with a small $N$~\cite{vN03}. A possible answer to this

1313: question is the following. With large $N$ and small $\gamma$, the system is

1314: shifted to the SAT phase, and therefore needs to explore a very big

1315: configuration space without sufficient ``guidelines''. In other words, the

1316: available configurations are too many to be reached in reasonable time by the

1317: dynamics.

1318:

1319: Similar considerations can be carried for the \emph{width} of the distribution

1320: of satisfying solutions. The fact that the self-averaging property holds

1321: indicate that this is negligible in the thermodynamic limit.  On the other

1322: hand, the typical value of $N$ for a living system is in the range

1323: $10^{3}-10^{5}$.  While being large for detailed modeling, this is a

1324: smaller number than the size of the typical system treated with statistical

1325: mechanics. Thus, the effects of the system size are expected to be important.

1326:

1327:

1328: Considering the phase diagram in Fig.~\ref{fig:pd}, the complex control phase,

1329: having general residual dynamics, matches a qualitative feature of many cells,

1330: where some genes are constantly expressed, and the rest vary.

1331: %da sostanziare! o cita textbook art expr ecoli drosoph cluster?

1332: On the other hand, the dynamical slowing down characteristic of any glassy

1333: phase raises an issue that must be solved by the chemical dynamics of the

1334: cell. In analogy with Kauffman's ideas, the breakdown in many different

1335: attraction basins might be interpreted as epigenesis.  That is, in the

1336: HARD-SAT phase there will be typically many cell types. How many, is

1337: determined by the complexity $\Sigma$, which is directly measured by our

1338: $\Delta_{C}$.  While for fixed $K$ this quantity simply decreases with

1339: increasing $\gamma$, its behavior is more interesting in the multi-Poisson

1340: case.

1341: %% Thus, interpreting the number of attraction basins as the number of cell

1342: %% types

1343: %% is tempting.

1344: %DA RIVEDERE.

1345: However, this remains an open issue which has to be regarded with more detail.

1346: The experimental scaling of the number of cell types is sub-linear in the

1347: number of genes $N$~\cite{Kau93,Kau04}.  In the fixed $K$ case, GR1 gives

1348: exponential scaling with $N$ at fixed $\gamma$ in the complex control phase.

1349: On the other hand, the results of random-GR1 are perhaps more easily related

1350: to the number of species times the number of cell types at equal number of

1351: genes. In the same way, the behavior of $\overline{[\mathcal{N}]^2}$ with $N$

1352: should roughly predict the scaling of the \emph{variability} in the total

1353: number of cell types for all the species with equal number of genes $N$.

1354: According to GR1, this quantity should vanish in the thermodynamic limit.

1355:

1356:

1357: To be biologically useful, the model has to deal with the details of an

1358: individual realization the system.  In this respect, an advantage of the leaf

1359: removal algorithm is that it transforms a problem related the states of

1360: variables on a graph, the gene expression patterns, into a problem regarding

1361: the \emph{structure} of the graph.  This is particularly of interest as long

1362: as the data regarding the activity of function nodes are only partially known.

1363: For example, the first application to the E.~coli core, in the RAR hypothesis

1364: leads to interesting results, that have a numerical counterpart and might be

1365: tested with expression correlation data. The application to more, larger, data

1366: sets and to other forms of regulation might lead to further insight.

1367: %

1368: Notably, some of the core variables do not have a high connectivity.  This is

1369: an indication that additional, global, properties of the network structure

1370: other than local order parameters must contribute to establish the hierarchy

1371: of states in configuration space.

1372:

1373: Finally, besides the extensions to the work presented here, we believe the

1374: framework of GR might be used as a setting for many different problems

1375: involving fairly large networks, from evolutionary models to regulation

1376: network optimization, from network inference to design. It will be potentially

1377: useful in the years to come, as more and more data will be available from

1378: high-throughput experiments.

1379:

1380:

1381: %The RAR hypothesis should be seen as an attempt to overcome a limitation of

1382: % the Boolean formalism, it

1383: % we simpy note that

1384: %disregarding self-regulations might an equally radical choice.

1385:

1386:

1387: %estensioni

1388: % evolution is a dynamics on the function nodes structure and coordination

1389: %P(k)

1390: %GR is naturally fit to study networks with non-Boolean variables and non-fixed

1391: %in-ward connectivities. It can be easily extended to nonzero temperature

1392: % (probabilistic constraints).

1393: %

1394:

1395: %

1396:

1397: %

1398: %fine A biologically significant model has to be able

1399: %

1400: % noi abbiamo provato

1401:

1402: % CAMBIARE menala sul leaf removal e altri possibili algor. come per SAT

1403:

1404: % Furthermore, there exist techniques able to analyze single realizations of

1405: % optimization problems which might be applicable to sufficiently characterized

1406: % genetic networks.  For example, a typical experiment in molecular biology

1407: % involves a local interaction with a regulation network with overexpression or

1408: % deletion of a gene. Such methods of analysis could be used to generate

1409: % predictions of the effects that keep into account the network connections.

1410:

1411: % c'e' molto da fare. COsa? 1) 2) 3) etc

1412:

1413:

1414: %

1415: %==================================

1416:

1417: \begin{acknowledgments}

1418:   We would like to acknowledge interesting discussions with L.~Finzi,

1419:   A.~Sportiello, J.~Berg, M.~Leone, M.~Caselle, P.R.~ten~Wolde, R.~Zecchina.

1420: \end{acknowledgments}

1421:

1422:

1423: \begin{appendix}

1424:

1425: \section{Self-averaging property of GR1.}

1426: \label{sec:self-aver-prop}

1427:

1428:

1429: The following paragraphs describe the calculation of the width of the

1430: distribution of $\mathcal{N}$. By definition,

1431: $$\overline{[\mathcal{N}]^2}=\sum_{\vec{\s},\vec{\t}}\sum_{C}p(C)\sum_{f^1 \in

1432:   \mathcal{F} }p(f^1)\sum_{f^1 \in \mathcal{F}}p(f^2)...\sum_{f^M\in

1433:   \mathcal{F}}p(f^M) \cdot$$

1434: %

1435: %

1436: $$\cdot \prod_{m=1}^M \d

1437: \big(1-f^m(\s_{n(1,m)},\cdots,\s_{n(k,m)})\cdot\s_{n(0,m)}\big) \d

1438: \big(1-f^m(\t_{n(1,m)},\cdots,\t_{n(k,m)})\cdot\t_{n(0,m)}\big) \ \  ;$$

1439:

1440: thus,

1441: $$\overline{[\mathcal{N}]^2}= \sum_{\vec{\s},\vec{\t}}\sum_{C}p(C)

1442: \prod_{m=1}^M \Big(\sum_{f^m  \in \mathcal{F}}p(f^m)

1443:  \d \big(1-f^m(\s)\cdot\s_{n(0,m)}\big)

1444:  \d \big(1-f^m(\t)\cdot\t_{n(0,m)}\big)

1445: \Big) \ \ .$$

1446:

1447: % su sigma e tau ci va la freccina!

1448: For fixed states $\vec{\s}$ and $\vec{\tau}$, we can write the factors of the

1449: product above as

1450: $$\Big(

1451:  \d \big(1-\s_{n(0,m)}\big)

1452:  \d \big(1-\t_{n(0,m)}\big)\cdot A(\s_m ;\tau_m)+

1453: \d \big(1-\s_{n(0,m)}\big)

1454:  \d \big(1+\t_{n(0,m)}\big)\cdot B(\s_m ;\tau_m) + $$

1455: $$\d \big(1+\s_{n(0,m)}\big)

1456:  \d \big(1-\t_{n(0,m)}\big)\cdot C(\s_m ;\tau_m)  +

1457: \d \big(1+\s_{n(0,m)}\big)

1458:  \d \big(1+\t_{n(0,m)}\big)\cdot  D(\s_m ;\tau_m)

1459: \Big) \ \ ,$$

1460:

1461: Where $A$, $B$, $C$, $D$, are the weights of the functions $ f \in

1462: \mathcal{F}$ such that, respectively

1463: \begin{displaymath}

1464: \left \{ \begin{array}{lllll}

1465: \textrm{$ A(\s_m ;\tau_m) $}&\textrm{$\leftarrow$} & \textrm{$f(\s \in m)=1$}

1466: &\& & \textrm{$f(\t \in m)=1$} \\

1467: \textrm{$ B(\s_m ;\tau_m)$}&\textrm{$\leftarrow$} & \textrm{$f(\s \in m)=1$}

1468: &\& & \textrm{$f(\t \in m)=-1$} \\

1469: \textrm{$ C(\s_m ;\tau_m)$}&\textrm{$\leftarrow$}  & \textrm{$f(\s \in m)=-1$}

1470: &\& & \textrm{$f(\t \in m)=1$} \\

1471:  \textrm{$ D(\s_m ;\tau_m)$}&\textrm{$\leftarrow$} & \textrm{$f(\s \in m)=-1$}

1472:  &\& & \textrm{$f(\t \in m)=-1$} \\

1473: \end{array}\right.

1474: \end{displaymath}

1475:

1476: Now, as $A+B+C+D =1$, and $A+B = A+C = \rho$, we can write,

1477: choosing $\rho=1/2$

1478: %verif se prima a era rho!!

1479:

1480: $$\sum_{C } \sum_{\vec{\sigma} \vec{\tau}} \prod_{m} $$

1481: %

1482: $$\left\{\right. A(\s_{m},\t_{m}) \cdot \big[\d \big(1-\s_{n(0,m)}\big)\d

1483: \big(1-\t_{n(0,m)}\big) - \d \big(1+\s_{n(0,m)}\big)\d

1484: \big(1+\t_{n(0,m)}\big)] \ \ +  \ \ \ \ (\mathcal{A})$$

1485: %

1486: $$ + \frac{1}{2} \big[ \d \big(1-\s_{n(0,m)}\big)\big(1+\t_{n(0,m)}\big) + \d

1487: \big(1+\s_{n(0,m)}\big)\big(1-\t_{n(0,m)}\big) \big] \left. \right\} \ \ \ \

1488: (\mathcal{R})$$

1489: %

1490:

1491: The product on $m$ gives rise to $2^{M}$ terms of the kind

1492: $$

1493: \prod_{k=1}^{r} \mathcal{A}^{(k)} \prod_{k'=r+1}^{M} \mathcal{R}^{(k')} \ \

1494: ,$$

1495: where $\mathcal{A}$ and $\mathcal{R}$ indicate factors of the two types in

1496: the previous expression. For every $r$ there are $\su{M}{r}$ terms of this

1497: kind in the sum. Applying the properties that characterize our family of

1498: functions, we find $A= \sum_{f \in \mathcal{F}}\big[

1499: \big(\frac{1+f(\s)}{2}\big) \big(\frac{1+f(\t)}{2}\big)\big]=

1500: \frac{1}{4}\big[1 + \d(\s,\t)\big] \ \, $

1501:

1502: With this consderation, the sum over the configurations $\vec{\sigma}

1503: \vec{\tau}$ can be simplified. It involves the product

1504: $$ \prod_{m} $$

1505: %

1506: $$\left\{ \right. A(\s,\t) \cdot \big[\s_{n(0,m)} \t_{n(0,m)}\big] \ \ + \ \ \

1507:   \ \ \ \ \ (\mathcal{A})$$

1508: %

1509: $$ + \frac{1}{4} \big[1 - \s_{n(0,m)} \t_{n(0,m)} \big] \left.

1510: \right\} \ \ \ \ \ \ \ \ \ \ \ \  (\mathcal{R})$$

1511: %

1512:

1513: Let us now distinguish again between free genes and non-free ones, which are

1514: outputs of some signal integration function.  The sum over non-free genes is

1515: such that (i) there is a contribution $ \mezzo^{r}$ due to the Kronecker

1516: deltas in the $\mathcal{A}$ part and (ii) if a type $\mathcal{R}$ non-free

1517: $\s_{n(0,k')}$ or $\tau_{n(0,k')}$ variable appears in the input of a type

1518: $\mathcal{A}$ clause the contribution is zero.  A little thought leads to the

1519: conclusion that the non-free genes sum up to the term

1520: \begin{displaymath}

1521:   \left(\mezzo\right)^{r} \left(1-\gamma + \frac{r}{N}\right)^{kr}

1522: \end{displaymath}

1523:

1524: The sum over the $2(N-M)$ free genes would give a $4^{\Delta}$ contribution in

1525: case of complete independence.  However, a delta function on the input genes

1526: reduces the double sum to a single one. To estimate this contribution one has

1527: to evaluate the probability that two free genes appear as input of a type

1528: $\mathcal{A}$ clause. In a mean-field like estimate, this is $ kr

1529: \frac{N-M}{N-M +r}$, leading to the contribution

1530: \begin{displaymath}

1531:   4^{\Delta} \cdot 2^{-\frac{kr}{1+\frac{r}{\Delta}}}

1532: \end{displaymath}

1533:

1534: The $4^{\Delta}$ term factors out of everything, and alone would give the

1535: desired self-averaging property. It remains to evaluate the sum over $r$.

1536: Restricting the sum over the core genes, and evaluating it with a saddle point

1537: method leads to the minimization of the free-energy-like functional

1538: \begin{displaymath}

1539:   G(x) = x \log x + (1-x) \log (1-x) + \log(2) x\left( 1 +

1540:     \frac{k(1-\gamma)}{1-\gamma(x-1)}\right) - k x

1541:     \log\left(1-\gamma(x-1)\right)  \ ,

1542: \end{displaymath}

1543: where $x \in [0,1]$, and $\gamma := \frac{M_{C}}{N_{C}}$.

1544: %....

1545: Minimization of this functional always leads to the solution $x=0$, with the

1546: exception of the regions: $\gamma < 1/K$ and $\gamma > \gamma^{*}$, where

1547: $\gamma^{*}$ is a threshold that always lies in the UNSAT region. For $k=3$,

1548: $\gamma^{*} \simeq 0.9722$.

1549:

1550: \end{appendix}

1551:

1552: \begin{thebibliography}{}

1553:

1554: \bibitem[Alberts et~al., 2003]{ABL+03}

1555: Alberts, B., Bray, D., Lewis, J., Raff, M., Roberts, K., and Watson, J. (2003).

1556: \newblock {\em Molecular {B}iology of the {C}ell}.

1557: \newblock Garland, London, New York, 4th edition.

1558:

1559: \bibitem[Arkin et~al., 1998]{ARM98}

1560: Arkin, A., Ross, J., and McAdams, H. (1998).

1561: \newblock Stochastic kinetic analysis of developmental pathway bifurcation in

1562:   phage lambda-infected {E}scherichia coli cells.

1563: \newblock {\em Genetics}, 149(4):1633--48.

1564:

1565: \bibitem[Babu et~al., 2004]{BLA+04}

1566: Babu, M., Luscombe, N., Aravind, L., Gerstein, M., and Teichmann, S. (2004).

1567: \newblock Structure and evolution of transcriptional regulatory networks.

1568: \newblock {\em Curr Opin Struct Biol}, 14(3):283--91.

1569:

1570: \bibitem[Bastolla and Parisi, 1998]{BP98}

1571: Bastolla, U. and Parisi, G. (1998).

1572: \newblock Relevant elements, magnetization and dynamical properties in

1573:   {K}auffman networks. {A} numerical study.

1574: \newblock {\em Physica D}, 115(3-4):203--218.

1575:

1576: \bibitem[Buchler et~al., 2003]{BGH03}

1577: Buchler, N., Gerland, U., and Hwa, T. (2003).

1578: \newblock On schemes of combinatorial transcription logic.

1579: \newblock {\em Proc Natl Acad Sci U S A}, 100(9):5136--41.

1580:

1581: \bibitem[Caracciolo and Sportiello, 2002]{CS02}

1582: Caracciolo, S. and Sportiello, A. (2002).

1583: \newblock An exactly solvable random satisfiability problem.

1584: \newblock {\em J. Phys. A}, 35:7661--7688.

1585:

1586: \bibitem[Cases et~al., 2003]{CdL03}

1587: Cases, I., de~Lorenzo, V., and Ouzounis, C. (2003).

1588: \newblock Transcription regulation and environmental adaptation in bacteria.

1589: \newblock {\em Trends Microbiol}, 11(6):248--53.

1590:

1591: \bibitem[Correale et~al., 2004]{CLP+}

1592: Correale, L., Leone, M., Pagnani, A., Weigt, M., and Zecchina, R. (2004).

1593: \newblock Private Communication.

1594:

1595: \bibitem[Cosentino~Lagomarsino et~al., 2005]{CLB04}

1596: Cosentino~Lagomarsino, M., Bassetti, B., and Jona, P. (2005).

1597: \newblock In Preparation.

1598:

1599: \bibitem[Davidson et~al., 2002]{DRO+02}

1600: Davidson, E., Rast, J., Oliveri, P., Ransick, A., Calestani, C., Yuh, C.,

1601:   Minokawa, T., Amore, G., Hinman, V., Arenas-Mena, C., Otim, O., Brown, C.,

1602:   Livi, C., Lee, P., Revilla, R., Rust, A., Pan, Z., Schilstra, M., Clarke, P.,

1603:   Arnone, M., Rowen, L., Cameron, R., McClay, D., Hood, L., and Bolouri, H.

1604:   (2002).

1605: \newblock A genomic regulatory network for development.

1606: \newblock {\em Science}, 295(5560):1669--78.

1607:

1608: \bibitem[Gershenson, 2004]{Ger04}

1609: Gershenson, C. (2004).

1610: \newblock Introduction to {R}andom {B}oolean {N}etworks.

1611: \newblock In {\em Artificial {L}ife {IX} {W}orkshops and {T}utorials}.

1612:

1613: \bibitem[Gillespie, 1977]{GD77}

1614: Gillespie, D. (1977).

1615: \newblock Exact stochastic simulation of coupled chemical reactions.

1616: \newblock {\em J. Phys. Chem.}, 81(25):2340 61.

1617:

1618: \bibitem[Herrgard et~al., 2004]{HCP04}

1619: Herrgard, M., Covert, M., and Palsson, B. (2004).

1620: \newblock Reconstruction of microbial transcriptional regulatory networks.

1621: \newblock {\em Curr Opin Biotechnol}, 15(1):70--7.

1622:

1623: \bibitem[Kauffman, 1969a]{Kau69}

1624: Kauffman, S. (1969a).

1625: \newblock Homeostasis and differentiation in random genetic control networks.

1626: \newblock {\em Nature}, 224(215):177--8.

1627:

1628: \bibitem[Kauffman, 1969b]{Kau69b}

1629: Kauffman, S. (1969b).

1630: \newblock Metabolic stability and epigenesis in randomly constructed genetic

1631:   nets.

1632: \newblock {\em J Theor Biol}, 22(3):437--67.

1633:

1634: \bibitem[Kauffman, 1993]{Kau93}

1635: Kauffman, S. (1993).

1636: \newblock {\em The {O}rigins of {O}rder}.

1637: \newblock Oxford Univ. Press, New York.

1638:

1639: \bibitem[Kauffman, 2004]{Kau04}

1640: Kauffman, S. (2004).

1641: \newblock A proposal for using the ensemble approach to understand genetic

1642:   regulatory networks.

1643: \newblock {\em J Theor Biol}, 230(4):581--90.

1644:

1645: \bibitem[Kaufman and Drossel, 2005]{KD05}

1646: Kaufman, V. and Drossel, B. (2005).

1647: \newblock On the properties of cycles of simple {B}oolean networks.

1648: \newblock {\em Eur Phys J B}.

1649: \newblock in publication.

1650:

1651: \bibitem[Lee et~al., 2002]{LRR+02}

1652: Lee, T., Rinaldi, N., Robert, F., Odom, D., Bar-Joseph, Z., Gerber, G.,

1653:   Hannett, N., Harbison, C., Thompson, C., Simon, I., Zeitlinger, J., Jennings,

1654:   E., Murray, H., Gordon, D., Ren, B., Wyrick, J., Tagne, J., Volkert, T.,

1655:   Fraenkel, E., Gifford, D., and Young, R. (2002).

1656: \newblock Transcriptional regulatory networks in {S}accharomyces cerevisiae.

1657: \newblock {\em Science}, 298(5594):799--804.

1658:

1659: \bibitem[Martinez-Antonio and Collado-Vides, 2003]{MC03}

1660: Martinez-Antonio, A. and Collado-Vides, J. (2003).

1661: \newblock Identifying global regulators in transcriptional regulatory networks

1662:   in bacteria.

1663: \newblock {\em Curr Opin Microbiol}, 6(5):482--9.

1664:

1665: \bibitem[McAdams and Arkin, 1997]{MA97}

1666: McAdams, H. and Arkin, A. (1997).

1667: \newblock Stochastic mechanisms in gene expression.

1668: \newblock {\em Proc Natl Acad Sci U S A}, 94(3):814--9.

1669:

1670: \bibitem[Mertens, 2002]{M02}

1671: Mertens, S. (2002).

1672: \newblock Computational complexity for physicists.

1673: \newblock {\em Computing in Science and Engineering}, 4(3):31--47.

1674:

1675: \bibitem[Mezard and Parisi, 2003]{MP03}

1676: Mezard, M. and Parisi, G. (2003).

1677: \newblock The cavity method at zero temperature.

1678: \newblock {\em J. Stat. Phys}, 111:1.

1679:

1680: \bibitem[Mezard et~al., 1987]{MPV87}

1681: Mezard, M., Parisi, G., and Virasoro, M. (1987).

1682: \newblock {\em Spin {G}lass {T}heory and {B}eyond}.

1683: \newblock World Scientific, Singapore.

1684:

1685: \bibitem[Mezard et~al., 2002]{MPZ02}

1686: Mezard, M., Parisi, G., and Zecchina, R. (2002).

1687: \newblock Analytic and algorithmic solution of random satisfiability problems.

1688: \newblock {\em Science}, 297(5582):812--5.

1689:

1690: \bibitem[Mezard et~al., 2003]{MRZ03}

1691: Mezard, M., Ricci-Tersenghi, F., and Zecchina, R. (2003).

1692: \newblock Alternative solutions to diluted p-spin models and {XORSAT} problems.

1693: \newblock {\em J. Stat. Phys}, 505.

1694:

1695: \bibitem[Mezard and Zecchina, 2002]{MZ02}

1696: Mezard, M. and Zecchina, R. (2002).

1697: \newblock Random {K}-satisfiability problem: from an analytic solution to an

1698:   efficient algorithm.

1699: \newblock {\em Phys Rev E Stat Nonlin Soft Matter Phys}, 66(5 Pt 2):056126.

1700:

1701: \bibitem[Nachman et~al., 2004]{NRF04}

1702: Nachman, I., Regev, A., and Friedman, N. (2004).

1703: \newblock Inferring quantitative models of regulatory networks from expression

1704:   data.

1705: \newblock {\em Bioinformatics}, 20 Suppl 1:I248--I256.

1706:

1707: \bibitem[Ptashne, 1992]{Pta92}

1708: Ptashne, M. (1992).

1709: \newblock {\em A {G}enetic {S}witch}.

1710: \newblock Cell Press, MA, second edition edition.

1711:

1712: \bibitem[Salgado et~al., 2001]{SSG+01}

1713: Salgado, H., Santos-Zavaleta, A., Gama-Castro, S., Millan-Zarate, D.,

1714:   Diaz-Peredo, E., Sanchez-Solano, F., Perez-Rueda, E., Bonavides-Martinez, C.,

1715:   and Collado-Vides, J. (2001).

1716: \newblock Regulon{DB} (version 3.2): transcriptional regulation and operon

1717:   organization in {E}scherichia coli {K}-12.

1718: \newblock {\em Nucleic Acids Res}, 29(1):72--4.

1719:

1720: \bibitem[Shea and Ackers, 1985]{SA85}

1721: Shea, M. and Ackers, G. (1985).

1722: \newblock The {OR} control system of bacteriophage lambda. {A}

1723:   physical-chemical model for gene regulation.

1724: \newblock {\em J Mol Biol}, 181(2):211--30.

1725:

1726: \bibitem[Shen-Orr et~al., 2002]{SMM+02}

1727: Shen-Orr, S., Milo, R., Mangan, S., and Alon, U. (2002).

1728: \newblock Network motifs in the transcriptional regulation network of

1729:   {E}scherichia coli.

1730: \newblock {\em Nat Genet}, 31(1):64--8.

1731:

1732: \bibitem[Socolar~and and Kauffman, 2003]{SaK03}

1733: Socolar~and, J. E.~S. and Kauffman, S.~A. (2003).

1734: \newblock Scaling in {O}rdered and {C}ritical {R}andom {B}oolean {N}etworks.

1735: \newblock {\em Phys Rev Lett}, 90:068702.

1736:

1737: \bibitem[Thomas, 1973]{Tho73}

1738: Thomas, R. (1973).

1739: \newblock Boolean formalization of genetic control circuits.

1740: \newblock {\em J Theor Biol}, 42(3):563--85.

1741:

1742: \bibitem[van Nimwegen, 2003]{vN03}

1743: van Nimwegen, E. (2003).

1744: \newblock Scaling laws in the functional content of genomes.

1745: \newblock {\em Trends Genet}, 19(9):479--84.

1746:

1747: \bibitem[Warren and ten Wolde, 2004]{WtW04}

1748: Warren, P. and ten Wolde, P. (2004).

1749: \newblock Statistical analysis of the spatial distribution of operons in the

1750:   transcriptional regulation network of {E}scherichia coli.

1751: \newblock {\em J Mol Biol}, 342(5):1379--90.

1752:

1753: \bibitem[Wolf and Arkin, 2003]{WA03}

1754: Wolf, D. and Arkin, A. (2003).

1755: \newblock Motifs, modules and games in bacteria.

1756: \newblock {\em Curr Opin Microbiol}, 6(2):125--34.

1757:

1758: \end{thebibliography}

1759:

1760:

1761:

1762: \end{document}

1763:

1764:

1765:

1766:

1767: