0703:q-bio0703063/afmv.tex

1: \documentclass[12pt]{iopart}

2:

3: \usepackage{graphicx}

4: %\usepackage{amsmath}

5: %\usepackage{amssymb}

6:

7: %\eqnobysec

8:

9: \begin{document}

10:

11: \title[Numbers and affinity]{Noise-filtering features of

12:     transcription regulation in the yeast {\it S. cerevisiae}}

13:

14: \author{%

15:   Erik Aurell$^1$ and

16:   Aymeric Fouquier d'H\'erou\"el$^{1,2}$ and

17:   Claes Malmn\"as$^{1}$ and

18:   Massimo Vergassola$^2$

19: }

20:

21: \address{$^1$ Computational Biological Physics, Royal Institute of Technology, AlbaNova University Center, Stockholm, Sweden}

22: \address{$^2$ CNRS, URA 2171, Institut Pasteur, Dept. ``G\'enomes et G\'enetique'', Research Unit ``G\'enetique in Silico'', 25 rue du Dr Roux, Paris, France}

23:

24: \eads{%

25:   \mailto{eaurell@kth.se},

26:   \mailto{afd@kth.se},

27:   \mailto{malmnas@kth.se},

28:   \mailto{massimo@pasteur.fr}

29: }

30:

31: \begin{abstract}

32:   Transcription regulation is largely governed by the profile and the dynamics of transcription factors' binding to DNA. Stochastic effects are intrinsic to this dynamics and the binding to functional sites must be controled with a certain specificity for living organisms to be able to elicit specific cellular responses. Specificity stems here from the interplay between binding affinity and cellular abundancy of transcription factor proteins and the binding of such proteins to DNA is thus controlled by their chemical potential.

33:

34:   We combine large-scale protein abundance data in the budding yeast with binding affinities for all transcription factors with known DNA binding site sequences to assess the behavior of their chemical potentials. A sizable fraction of transcription factors is apparently bound non-specifically to DNA and the observed abundances are marginally sufficient to ensure high occupations of the functional sites. We argue that a biological cause of this feature is related to its noise-filtering consequences: abundances below physiological levels do not yield signiﬁcant binding of functional targets and mis-expressions of regulated genes are thus tamed.

35: \end{abstract}

36:

37: \pacs{87.80.Vt}

38:

39: \vspace{1.0cm}

40: \begin{flushleft}

41: Running title: \textit{Numbers and affinity}

42: \end{flushleft}

43:

44: %\submitto{\PB}

45:

46: \maketitle

47:

48:

49:

50:

51: \section{Introduction}

52:

53: A major determinant in transcription regulation is the pattern of

54: transcription factor proteins (TFs) bound in the physical proximity of

55: the transcribed genomic locus~\cite{Ptashne,Davidson02,Davidson01}.

56: Intense activity is currently carried out to identify transcriptional

57: regulatory networks~\cite{PughGilmour2001,Lee,OCT4SOX2NANOG}, their

58: topology~\cite{Milo2002,Shen-Orr2002,Ping,MBV05} and signs and

59: strengths of the interactions~\cite{Ronen02}. Specificity is an

60: obvious need in transcription regulation: functional binding sites

61: ought to be sufficiently low in energy compared to typical sequences

62: in the rest of the genome (the so-called background). This energetic

63: constraint should be coupled with its kinetic counterpart: the TF

64: should be able to rapidly find its functional targets. Existing

65: evidence points at a search taking place via 1D sliding along the DNA,

66: alternated with 3D excursions \cite{Berg3,Marko}. The TF is kept along

67: the DNA by non-specific electrostatic interactions, recently

68: characterized experimentally~\cite{Mirny,Gowers}.

69:

70: Two quantitative variables govern the binding of TFs to DNA: their

71: cellular abundance and the affinity between the amino acids forming

72: their binding domains and the various possible stretches of

73: nucleotides. It has long been recognized in concrete examples that

74: equilibrium statistical-mechanics models are poised to describe the

75: binding site occupancy as a function of those parameters, and that

76: these occupancies are proxies for transcription rates

77: transcription~\cite{Ptashne,SheaAckers85}.  Detailed models for the

78: probability of binding to DNA by TFs have recently been reviewed

79: in~\cite{Bintu2005a,Bintu2005b} and we refer the interested reader

80: thereto (see also Methods for a concise summary).

81:

82: The qualitative point of importance here is that the probability of

83: TF's binding to DNA is controlled by its so-called chemical potential

84: $\mu$. As illustrated in figure~\ref{fig:1}, strong binding sites (with energy

85: much lower than $\mu$) are occupied almost certainly, while weak

86: sites, with energies much higher than $\mu$, are most frequently

87: empty.  The chemical potential $\mu$ increases with the number of

88: copies $n$ of the transcription factor as $\log n$ (see,

89: e.g.,~\cite{Bintu2005a,Bintu2005b}). For a single copy $n=1$, the

90: value of the chemical potential defines an offset $F_b$, usually

91: called background energy.  The reason is that $F_b$ controls the

92: fraction of TF copies bound to DNA either non-specifically or to the

93: genomic background. Indeed, let $E^*$ denote the minimal binding

94: energy, i.e. the energy of binding to the consensus sequence of the

95: TF. From the previous relation $\mu-F_b\propto \log n$, it follows

96: that if $F_b\simeq E^*$, then the threshold defined by the chemical

97: potential $\mu$ is larger than (or equals) $E^*$ for any $n\geq 1$. In

98: other words, even a single copy of the transcription factor would then

99: be sufficient to ensure persistent binding, at least of the consensus

100: sequence.  Conversely, as $F_{\rm b}$ becomes less than $E^*$, more

101: and more TF copies $n$ are needed to have $\mu\geq E^*$,

102: i.e. persistent occupancy of at least the strongest binding sites. A

103: minimal abundance (which depends exponentially on the difference

104: $E^*-F_b$) is then required to have persistent binding of the

105: strongest sites (supposed to be the functional ones).

106:

107: \medskip

108: Detailed quantitative information on the behavior of the chemical

109: potential for transcription factors of biological interest is

110: scanty. The relation between binding affinities and abundances was

111: analyzed in \cite{hwa} for three coliphage TFs (\textit{Mnt},

112: \textit{CI} and \textit{Cro}) and one bacterial TF

113: (\textit{LacR}). The result was that the offset $F_b$ is comparable to

114: the consensus energy $E^*$ for those four TFs. This type of relation

115: endows the cell with the widest possible window to vary the TF copy

116: number and differentially regulate various sets of genes. It was

117: therefore dubbed ``maximum programmability'' \cite{hwa}.

118:

119: Positing $F_b\simeq E^*$ generally valid seems however too strong a

120: requirement for the cellular dynamics, as it would make regulation

121: too prone to errors. In fact, as already

122: noted in \cite{hwa}, the four TFs which were considered are rather

123: special: they are all repressors, they operate without much

124: combinatorial interactions with other factors and their expression is

125: tightly controlled. This is not the situation encountered in

126: general. Namely, combinatorial regulation is much more frequent,

127: especially in eukaryotes, and a large fraction of genes are activated

128: by TFs to their physiological expression levels. Specificity is not

129: arising from a single transcription factor but from the sinergistic

130: and cooperative combination of several factors. We then expect that

131: the relation $F_b\simeq E^*$, found in \cite{hwa} for four particular

132: TFs, does not have general validity and that a different relation

133: holds in the majority of cases. Our goal here is to quantify and

134: support this expectation by analyzing experimental data for a large

135: set of transcription factors.

136:

137: A good model organism to quantitatively investigate the previous issue

138: is the budding yeast {\it S. cerevisiae}. Concentration data in the

139: log-growth phase~\cite{tf_amount} and large-scale chromatin

140: immunoprecipitation binding data, as given by~\cite{Lee,Harbison}, are

141: both available. The intersection of the two data sets leaves us with a

142: set of 63 TFs. The difficulty to be overcome is that large-scale

143: experimental data on binding do not directly provide affinities.  {\it

144:   A priori}, calorimetric methods \cite{Zhang,Takeda} might be

145: employed to measure the strength of the interaction of a TF with its

146: binding sites, but these methods have been hard to scale up, and

147: values are typically not available for a given TF. One is thus forced

148: to infer affinity matrices {\it in silico}, from a list of

149: experimentally detected binding sites.  The procedures and the

150: limitations of these inferences are recalled in the Methods, together

151: with the basics of statistical models for TF-DNA interactions. Two

152: different inference methods were employed: the classical maximum

153: likelihood argument by Berg and von~Hippel~\cite{BergvonHippel87} and

154: the QPMEME method, recently introduced in~\cite{Marko03}.  Results for

155: the relation between affinity and TF abundance, for both ways of

156: determining the binding energies, are presented hereafter. Biological

157: consequences, in particular for the control of noise in transcription

158: regulation, are presented in the Discussion.

159:

160:

161: \section{Results}

162:

163: Combining the two experimental data sets on abundance \cite{tf_amount}

164: and chromatin immunoprecipitation \cite{Lee,Harbison}, a set of 63 TFs was

165: identified. Affinity matrices for those TFs were then inferred as

166: detailed in Methods, using both the classical maximum likelihood

167: procedure \cite{BergvonHippel87} and the QPMEME method \cite{Marko03}.

168: In both cases, the matrices are {\it a priori} determined only up to a

169: scale factor. In the first case, following~\cite{BergvonHippel87}, the

170: factor was set to one in units of $k_{\rm B} T$.  In the QPMEME

171: method, the scale factor was determined as described in the Methods

172: via a self-consistency condition, based on the experimental

173: information on TF abundances.  This condition could be satisfied in 41

174: out of the 63 cases. In the remaining 22 cases no solution could be

175: found, for reasons that will be presented in the Discussion.

176:

177: The matrices derived by the two aforementioned methods agree well in

178: the majority of the 41 cases where both methods could be employed. A

179: first measure of the agreement between two energy matrices is whether

180: they give the best binder at each position, which indeed coincides for

181: 26 TFs out of 41. These 26 instances include cases where one TF admits

182: more than one consensus sequence, but where both matrices agree on at

183: least one consensus binder at each position. In 14 cases the sets of

184: consensus sequences agree completely.  For 15 TFs the sets of best

185: binders of the two matrices at some position are not overlapping,

186: \textit{i.e.} in at least one position the sets of best binders

187: differ.

188:

189: A more quantitative comparison, sensitive to the full energy matrix

190: and not just to the best binder, is to consider the normalized

191: probabilities $q_{i,\alpha}$, i.e.  the probability that nucleotide

192: $\alpha$ be found at the position $i$ of the DNA-TF binding

193: complex. The probabilities computed using the maximum likelihood

194: procedure \cite{BergvonHippel87} or QPMEME \cite{Marko03} are denoted

195: by $q^{\rm BvH}_{i,\alpha}$ and $q^{\mathrm QP}_{i,\alpha}$,

196: respectively. The difference between the two sets of probabilities is

197: quantified by the symmetric Kullback-Leibler relative entropy

198: \cite{CT06}\,:

199: \begin{equation}

200: S(q^{\mathrm BvH}_i,q^{\mathrm QP}_i) = \frac{1}{2}\sum_{\alpha}

201: \left(q^{\mathrm BvH}_{i,\alpha}-q^{\mathrm QP}_{i,\alpha}\right)

202: \log\frac{q^{\mathrm BvH}_{i,\alpha}}{q^{\mathrm QP}_{i,\alpha}}\,.

203: \end{equation}

204: Figure~\ref{fig:2} shows the mean Kullback-Leibler relative entropy per base

205: pair for the 41 TFs. Except in a few cases, the average differences

206: per base pair are moderate, on the order of $0.1-0.2$. No correlation

207: was detectable between the relative entropies and the number of

208: observed binding sites employed to infer the affinity matrices,

209: indicating that the differences between the QPMEME and Berg-von~Hippel

210: matrices are {\it bona fide} fluctuations and not due to finite sample

211: effects. Detailed properties of the affinity matrices computed using

212: the two methods are reported in table~1.

213:

214: %%%%%%%%%%%%%%%%%%%%%%

215: % FIGURE 1

216: %%%%%%%%%%%%%%%%%%%%%%

217: \begin{figure}[htp]

218:   \begin{center}

219:     \includegraphics[width=16cm]{figure1.pdf}

220:   \end{center}

221:   \caption{Left panel: A schematic view of the relation between the

222:     probability of binding to DNA for a transcription factor and its

223:     chemical potential $\mu$. Strong binding sites (with energies much

224:     lower than the chemical potential) have a high occupation

225:     probability (purple solid line), while the probability to bind

226:     decreases rapidly as the energy increases. Right panel: the

227:     relation between the chemical potential and the abundance $n$. The

228:     background (free) energy $F_b$ is the value of $\mu$ for $n=1$.}

229:   \label{fig:1}

230: \end{figure}

231:

232: As a side remark, note that the average discrimination energy per site

233: generally decreases with the length of the binding site, indicating a

234: trade-off between these two quantities. Figure~\ref{fig:3} displays the data for

235: the QPMEME-derived energy matrices; a similar behavior is found for

236: matrices inferred by maximum likelihood.

237: %%%%%%%%%%%%%%%%%%%%%%

238: % FIGURE 2

239: %%%%%%%%%%%%%%%%%%%%%%

240: \begin{figure}[htp]

241:   \begin{center}

242:     \includegraphics[scale=0.4]{figure2-a.pdf}

243:     \includegraphics[scale=0.4]{figure2-b.pdf}

244:   \end{center}

245:   \caption{Average Kullback-Leibler distance per base pair between the

246:     probability distributions of binding based on computing discrimination

247:     energies by maximum likelihood arguments~\cite{BergvonHippel87} or

248:     QPMEME \cite{Marko03} (see also Methods).}

249:   \label{fig:2}

250: \end{figure}

251:

252: %%%%%%%%%%%%%%%%%%%%%%

253: % TABLE 1

254: %%%%%%%%%%%%%%%%%%%%%%

255: \begin{table}[htp]

256:   \fontsize{8}{8}\selectfont

257:   \lineup

258:   \begin{center}

259:     \begin{tabular}{@{}*{7}{c}}

260:       \br

261:       TF name & ${}^{({\bf a})}$ $N_{\rm BS}$ & ${}^{({\bf b})}$ $n_{\rm obs}$ & ${}^{({\bf c})}$ $-r^*$ & ${}^{({\bf d})}$ $|\mu-\langle E_{\rm QP} \rangle|$ & ${}^{({\bf e})}$ $E_{\rm QP}^*-\langle E_{\rm QP} \rangle$ & ${}^{({\bf f})}$ $E_{\rm BvH}^*-\langle E_{\rm BvH} \rangle$ \\ \mr

262:       ABF1  & 139 & \04818.49 & 1.17 & 12.86 & -15.11 & -30.01 \\

263:       ACE2  & 11 & \0\0538.41 & 1.00 & 17.63 & -17.63 & -12.65 \\

264:       BAS1  & 33 & \0\0861.14 & 1.00 & 23.25 & -23.25 & -15.21 \\

265:       CAD1  & 8 & \0\0622.84 & 1.11 & 17.51 & -19.35 & -14.44 \\

266:       DIG1  & 131 & \01458.26 & 1.44 &  --- & \m--- & -16.37 \\

267:       FHL1  & 75 & \0\0639.38 & 1.22 & 20.60 & -25.08 & -26.41 \\

268:       FKH1  & 89 & \01720.43 & 1.24 &  --- & \m--- & -19.71 \\

269:       FKH2  & 53 & \0\0655.80 & 1.32 &  --- & \m--- & -23.54 \\

270:       GAL4  & 9 & \0\0166.39 & 1.21 & 15.89 & -19.29 & -20.41 \\

271:       GAL80  & 3 & \0\0783.80 & 1.03 & 12.28 & -12.59 & -15.10 \\

272:       GCR1  & 7 & \0\0258.92 & 1.13 & 18.83 & -21.36 & -12.34 \\

273:       GLN3  & 48 & \0\0589.44 & 1.20 &  --- & \m--- & -18.08 \\

274:       HAP5  & 22 & \0\0450.49 & 1.00 &  --- & \m--- & -11.57 \\

275:       INO2  & 27 & \0\0783.80 & 1.17 &  --- & \m--- & -15.26 \\

276:       INO4  & 23 & \0\0521.13 & 1.20 & 34.08 & -41.01 & -18.95 \\

277:       LEU3  & 9 & \0\0124.51 & 1.11 & 18.42 & -20.37 & -15.24 \\

278:       MAC1  & 5 & 14841.76 & 1.03 & 10.71 & -10.98 & \0-8.60 \\

279:       MBP1  & 130 & \0\0521.13 & 1.19 &  --- & \m--- & -20.12 \\

280:       MCM1  & 63 & \08965.92 & 1.24 & 11.06 & -13.67 & -26.07 \\

281:       MET31  & 5 & \0\0521.13 & 1.00 & 16.14 & -16.14 & -10.87 \\

282:       MET4  & 5 & \01295.19 & 1.14 & 12.34 & -14.02 & -15.96 \\

283:       MOT2  & 2 & \04276.97 & 1.03 & 10.31 & -10.63 & \0-8.11 \\

284:       MOT3  & 11 & \01694.68 & 1.00 &  --- & \m--- & \0-9.25 \\

285:       MSN2  & 14 & \0\0124.51 & 1.15 &  --- & \m--- & -14.01 \\

286:       NDD1  & 27 & \0\0799.42 & 1.15 & 16.10 & -18.44 & -21.42 \\

287:       NRG1  & 45 & \0\0555.55 & 1.18 &  --- & \m--- & -17.68 \\

288:       PDR1  & 2 & \01295.19 & 1.03 & 12.51 & -12.93 & \0-7.01 \\

289:       PDR3  & 2 & \0\0166.39 & 1.04 &  --- & \m--- & \0-5.18 \\

290:       PHD1  & 61 & \01417.94 & 1.64 &  --- & \m--- & -15.45 \\

291:       PHO2  & 3 & \06418.52 & 1.09 & 10.40 & -11.37 & \0-8.97 \\

292:       PUT3  & 3 & \0\0736.46 & 1.05 & 11.85 & -12.49 & -15.96 \\

293:       RAP1  & 70 & \04387.61 & 1.25 & 14.60 & -18.31 & -23.09 \\

294:       RCS1  & 28 & \02733.41 & 1.16 & 21.04 & -24.37 & -16.85 \\

295:       REB1  & 156 & \07514.31 & 1.16 & 13.02 & -15.12 & -23.68 \\

296:       RFX1  & 9 & \0\0376.92 & 1.11 & 14.97 & -16.66 & -18.44 \\

297:       RGT1  & 12 & \0\0194.71 & 1.02 &  --- & \m--- & -12.58 \\

298:       RLM1  & 9 & \0\0736.46 & 1.13 & 24.46 & -27.54 & -14.40 \\

299:       RLR1  & 4 & \0\0521.13 & 1.06 & 19.66 & -20.81 & -11.10 \\

300:       ROX1  & 10 & \0\0238.01 & 1.03 &  --- & \m--- & -13.49 \\

301:       RTG3  & 4 & \01054.72 & 1.00 & 15.63 & -15.63 & \0-7.43 \\

302:       SFP1  & 19 & \0\0258.92 & 1.21 & 46.90 & -56.71 & -17.81 \\

303:       SKN7  & 64 & \02572.11 & 1.25 & 20.63 & -25.77 & -18.55 \\

304:       SKO1  & 8 & \0\0503.71 & 1.00 & 22.35 & -22.35 & \0-9.89 \\

305:       SOK2  & 81 & \0\0314.38 & 1.20 &  --- & \m--- & -16.16 \\

306:       SPT23  & 23 & \0\0432.40 & 1.19 &  --- & \m--- & -13.00 \\

307:       SPT2  & 24 & \01239.69 & 1.25 &  --- & \m--- & -16.26 \\

308:       STB1  & 23 & \0\0319.30 & 1.15 & 27.00 & -31.10 & -17.94 \\

309:       STB4  & 4 & \0\0\098.94 & 1.00 & 18.69 & -18.70 & -10.11 \\

310:       STB5  & 15 & \0\0279.40 & 1.12 & 25.07 & -28.06 & -16.32 \\

311:       STE12  & 147 & \01923.06 & 1.23 &  --- & \m--- & -18.19 \\

312:       SUM1  & 43 & \0\0148.81 & 1.22 &  --- & \m--- & -20.65 \\

313:       SWI4  & 99 & \0\0589.44 & 1.28 &  --- & \m--- & -21.42 \\

314:       SWI5  & 40 & \0\0688.35 & 1.00 & 37.53 & -37.53 & -14.88 \\

315:       SWI6  & 128 & \03335.05 & 1.25 & 26.45 & -32.95 & -19.93 \\

316:       TEC1  & 57 & \0\0529.79 & 1.20 &  --- & \m--- & -15.84 \\

317:       TYE7  & 20 & \0\0486.13 & 1.09 & 19.15 & -20.86 & -18.00 \\

318:       UME1  & 2 & \03037.98 & 1.04 & 11.80 & -12.27 & \0-7.29 \\

319:       UME6  & 63 & \0\0216.63 & 1.21 & 24.83 & -30.04 & -26.54 \\

320:       XBP1  & 2 & \0\0194.71 & 1.00 & 19.74 & -19.74 & \0-5.83 \\

321:       YAP1  & 11 & \01616.86 & 1.06 & 18.47 & -19.66 & -13.73 \\

322:       YAP6  & 3 & \01350.09 & 1.00 & 19.51 & -19.51 & \0-6.87 \\

323:       YAP7  & 48 & \01694.68 & 1.09 &  --- & \m--- & -20.14 \\

324:       YOX1  & 4 & \0\0861.14 & 1.10 & 17.42 & -19.11 & -10.67 \\ \br

325:     \end{tabular}

326:   \end{center}

327:   \caption{Binding parameters for a set of 63 TFs of the yeast {\it S. cerevisiae}, stating numbers of binding sites used in the analysis ({\bf a}), experimentally measured protein abundances ({\bf b}), maximal ratio of binding energy to chemical potential (cf. equation~4 in Methods) ({\bf c}), and in units of $k_{\rm B}T$ the estimates for the chemical potential ({\bf d}) and minimal binding energies (consensus), stemming from both BvH ({\bf e}) and QPMEME matrices ({\bf f}), respectively.}

328:   \label{tab:tf_info}

329: \end{table}

330:

331: Figure~\ref{fig:4} reports the behavior of the background energy $F_b$,

332: previously defined as the offset of the chemical potential $\mu$ (its

333: value at unit copy number $n=1$).  The result is that the maximum

334: programmability relation $F_b\simeq E^*$ proposed in \cite{hwa} is

335: indeed peculiar to the three coliphage and the bacterial TFs which

336: were considered. A different behavior is clearly observed in the yeast

337: {\it S. cerevisiae}. The background energy $F_b$ is not comparable to

338: the consensus binding energy $E^*$, but is generally smaller and the

339: difference is correlated with the experimentally observed abundancy

340: $n_{\rm obs}$, as can be seen in figure~\ref{fig:4}. In other words, the

341: experimental observations are more in agreement with the behavior $E^*

342: - F_{\rm b} \propto \log n_{\rm obs}$ than the maximum programmability

343: relation $E^* - F_{\rm b} \approx 0$.  Note that this holds

344: irrespective of the method (maximum likelihood or QPMEME) used to

345: estimate the discrimination matrices.

346:

347: %%%%%%%%%%%%%%%%%%%%%%

348: % FIGURE 3

349: %%%%%%%%%%%%%%%%%%%%%%

350: \begin{figure}[htp]

351:   \begin{center}

352:     \includegraphics[width=8cm]{figure3.pdf}

353:   \end{center}

354:   \caption{Average discrimination energy {\it vs} length

355:     of binding sites.  Reported values refer to energy

356:     matrices computed using the QPMEME method, as described

357:     in the Methods.}

358:   \label{fig:3}

359: \end{figure}

360:

361: As a further test, we compared the experimentally measured TF

362: abundances with the number of binding sites found in

363: SGD~\cite{SGD-homepage} and as reported by Lee~{\it et

364:   al.}~\cite{Lee}.  For the latter, we counted all sites of

365: protein-DNA-interaction with associated $p$-values $<1\cdot10^{-3}$

366: (L1) and $<5\cdot10^{-3}$ (L5).  The {\it rationale} of this analysis

367: is as follows.  If the maximum programmability ansatz $F_{\rm b} - E^*

368: \approx 0$ were satisfied, we should expect that TF abundances are the

369: main leverage in the control of the number of binding sites. This is

370: the heuristic advantage provided by maximum programmability \cite{hwa}

371: and a strong dependence of the number of binding sites on the TF

372: abundance should then be present.  No such behavior is expected for

373: the alternative hypothesis $E^* -F_{\rm b}\propto \log n_{\rm obs}$: a

374: sizable fraction of the TF copies are weakly attached to the DNA, yet

375: the sites are sufficiently numerous to compete with high-specificity

376: sites.  A straightforward regression analysis gives coefficients of

377: regression $R^2$ close to zero, \textit{viz.} $0.0440$ for the SGD set

378: and $0.0513$ and $0.0900$ for the L1 and L5 sets, respectively. Even

379: though the p-values for the three sets show some statistical

380: significance ($0.07,~0.04,~0.006$, respectively), the low values of

381: $R^2$ indicate that the fraction of the variance explained by the

382: regression is scanty. To summarize, the correlation between the number

383: of binding sites and abundance is slightly significant (as should be

384: expected) but the weakness of the dependency confirms previous

385: conclusions.

386:

387: \section{Discussion}

388: \label{s:discussion}

389:

390: The integration of binding data provided by chromatin

391: immunoprecipitation experiments \cite{SGD-homepage,Lee} and abundance

392: data from~\cite{tf_amount} allowed us to extract information on

393: the relation between binding affinities and abundances of TFs in

394: the log-growth phase of the budding yeast {\it S. cerevisiae}.  The

395: availability of experimental data for other conditions would enable a

396: wider perspective, yet two main points have already emerged here and

397: are worth being discussed in their biological consequences and

398: significance.

399:

400: \medskip A first technical point is that, while bioinformatic tools to

401: infer binding free energies generally only give these up to a scale

402: factor, we have shown that combining the recent method QPMEME

403: \cite{Marko03} and abundance data can provide an estimate of that

404: factor. This may be of general methodological interest and useful for

405: future applications.

406:

407: For the budding yeast problem considered here, the scale factor could

408: be estimated for 41 transcription factors out of 63. For the remaining

409: 22 TFs ``individual specificity'' is not ensured by the observed

410: affinities and abundances, i.e. the binding sites are bound even

411: though their energy is larger than the chemical potential. This

412: prevents using QPMEME, since the method works in the strong binding

413: regime and supposes that all binding sites have energy below the

414: chemical potential. Biologically, having binding sites occupied

415: despite their energies being above the chemical potential does not

416: pose any contradiction, since additional effects such as other factors

417: and/or regulations of the chromosomal structure might crucially

418: contribute to specificity.  Indeed, ChIP data (see figure~2 in

419: \cite{Lee}) clearly indicate that many genes of {\it S. cerevisiae}

420: are regulated by multiple TFs. Furthermore, global chromatin

421: remodeling effects will reduce the effective size of the genome which

422: is accessible to TFs and increase specificity.  Finally, in eukaryotes

423: it is well known that combinatorial regulation is widespread

424: \cite{Davidson01} and its mode of action hinges on strong cooperative

425: effects among the TFs.  The corresponding loci are often structured so

426: as to require the synergistic action of various TFs and to remain

427: unbound and inactive if only one of them is present.  Results of our

428: analysis are in quantitative agreement with this picture.

429:

430: \medskip The second and main result of our work is that experimentally

431: observed abundances are marginally sufficient to ensure strong and

432: persistent binding of {\it S. cerevisiae} TFs to DNA sites. This is

433: quantified and supported by the results presented in figure~\ref{fig:4}.  More

434: technically, the background free energy $F_{\rm b}$ was found to be

435: negative and proportional to $\log n_{\rm obs}$, where $n_{\rm obs}$

436: is the abundance experimentally measured in

437: \cite{tf_amount}. Consequently, the chemical potential $\mu$ remains

438: below the minimal consensus energy $E^*$ if $n\ll n_{\rm obs}$.  This

439: implies that a sizable part of the TF copies are ``lost in the

440: background'' and that the \textit{in vivo} observed binding sites are

441: only occupied with low probability if the abundance is significantly

442: lower than $n_{\rm obs}$.

443:

444: What might superficially appear as a waste, ensures in fact an

445: effective noise-filtering procedure. Fluctuations in the copy number

446: of proteins are unavoidable in the molecular world and have been

447: experimentally demonstrated in various cases (see, e.g.,

448: \cite{Elowitz-Science-2002}).  A few spurious copies of TFs might be

449: present in the cell due to a variety of mechanisms, going from delayed

450: degradations, to leaks or lack of tight regulatory controls and

451: fluctuations in the expression rates.  In an \textit{E. coli} system,

452: it has recently been shown that extrinsic effects, over and above

453: cell-cycle dependent changes in gene copy number, acting \textit{e.g.}

454: through different concentrations of metabolites, ribosomes and

455: polymerases, may amount to 35\% fluctuations in gene expression

456: levels, and may persist over a cell cycle~\cite{Elowitz-Science-2005}.

457: Intrinsic fluctuations, while persisting for shorter times, are also

458: significant, at the 20\% level~\cite{Elowitz-Science-2005}.  The

459: relation $E^* - F_{\rm b}\propto \log n_{\rm obs}$ between the

460: background affinity energy and the abundance of the transcription

461: factors shown in figure~\ref{fig:4} ensures an effective way to filter out those

462: fluctuations and control mis-regulations.

463:

464:

465: %%%%%%%%%%%%%%%%%%%%%%

466: \section{Conclusions}

467: In conclusion, our results point at the importance of quantitative

468: effects of abundances in the regulatory dynamics of the cell. In

469: particular, the abundance-affinity relationship $E^*-F_{\rm b}\propto

470: \log n_{\rm obs}$ demonstrated here is a powerful control lever to

471: ensure global coherent responses of the cellular regulatory networks

472: despite the noisy nature of their individual molecular components.

473:

474:

475:

476: %%%%%%%%%%%%%%%%%%

477: \section{Methods}

478:

479: Let us consider a TF that diffuses in a cell containing a genomic

480: sequence of length $L$. The partition function of specific and

481: non-specific binding to DNA is

482: \begin{equation}

483:  Z_b = \sum_{j=1}^{L} e^{-\beta E(S_j)}

484: + L e^{-\beta E_{\rm ns}}\,,

485: \label{eq:Z-def}

486: \end{equation}

487: where $\beta$ is the inverse temperature in units of the Boltzmann

488: constant $k_{\rm B}$ and $S_j$ is the subsequence of length $l$

489: starting at position $j$ in the genomic sequence.  In~\eref{eq:Z-def}

490: we have omitted the contribution from the TF freely diffusing in

491: cytoplasm, assuming that number to be much smaller than the number of

492: TFs bound.  $E_{\mathrm ns}$ denotes the energy of the state where the

493: TF is bound non-specifically to the DNA \cite{Berg3,Marko,Mirny}. From

494: (\ref{eq:Z-def}), it follows the definition of the effective

495: background (free) energy $F_{\rm b}$ as:

496: \begin{eqnarray}

497: \label{eq:binding-probability-2}

498: F_{\rm b} = -\beta^{-1}\log Z_b\,.

499: \end{eqnarray}

500:

501: A commonly employed expression for the binding energies $E(S)$ is the

502: additive \textit{energy matrix} form

503: \cite{Stormo82,Staden84,Stormo00}:

504: \begin{eqnarray}

505: \label{eq:additivity}

506: E(S) = \sum_{i=1}^l \sum_{\alpha=1}^4 \varepsilon_{i,\alpha} S_{i,\alpha}\,.

507: \end{eqnarray}

508: Here, the indicator vector $S_{i,\alpha}$ has entries zero or one

509: depending on which nucleotide $\alpha$ stands at position $i$ in the

510: sequence $S$, $\varepsilon_{i,\alpha}$ is the free energy contribution

511: of nucleotide $\alpha$ at $i$ and $\ell$ is the length of the binding

512: domain. Even though exceptions are known \cite{Bulyk}, the linear form~\eref{eq:additivity} generally gives a good approximation of the

513: energy profile \cite{StormoFields98}.

514:

515: Expression~\eref{eq:binding-probability-2} of the background energy

516: $F_{\rm b}$ may be approximated by an average over a random

517: ensemble (background).  The approximation is justified in \cite{hwa}

518: by a mapping to the Random Energy Model~\cite{Derrida}. As for the

519: choice of the random ensemble, the simplest background model features

520: independent nucleotides generated with the average genomic frequencies

521: $p_{\alpha} (\alpha=A,C,G,T)$, yielding:

522: \begin{eqnarray}

523: \label{eq:background}\sum_{j} e^{-\beta E(S_j)}\simeq L \langle

524: e^{-\beta E} \rangle & \equiv & L \prod_{i=1}^{l} \left[\sum_{\alpha}

525: p_{\alpha} e^{-\beta \varepsilon_{i,\alpha}} \right]\,.

526: \end{eqnarray}

527: It follows that

528: \begin{eqnarray}

529: \label{eq:F_b_interm}

530: F_{\rm b} \simeq -\beta^{-1}\log \left\{L\int dE~\rho(E)\,e^{-\beta E} +

531: L\,e^{-\beta E_{\rm ns}}\right\}\,,

532: \end{eqnarray}

533: where $\rho(E)=\langle\delta\left(E-\sum_{i,\alpha}

534: \varepsilon_{i,\alpha} S_{i,\alpha}\right)\rangle$ is the density of

535: states for the random ensemble.  The background density $\rho(E)$ can

536: be computed by a saddle point expansion, where the first term is

537: Gaussian \cite{Marko03}.  Figure~\ref{fig:5} compares the empirical energy

538: density (obtained by the histogram of the energies measured over the

539: whole genome) with the Gaussian and the first correction. While the

540: former alone would not be appropriate (the empirical curve is not

541: symmetric), the correspondence with the latter is quite fair. For a

542: few TFs the match is less good, mainly because discretization effects

543: are more pronounced.

544:

545: For a TF present with $n$ copies in the cell, the probability that a

546: sequence $S_i$ be bound by the TF takes the Fermi-Dirac form (see,

547: e.g.,~\cite{hwa,Bintu2005a,Bintu2005b} for more details):

548: \begin{eqnarray}

549: \label{eq:binding_n} \mathcal P(S_i) & = & \frac{1}{1 + e^{\beta

550: (E(S_i) - \mu)}}\,,

551: \end{eqnarray}

552: with the chemical potential $\mu$ implicitly defined by

553: \begin{eqnarray}

554: \label{eq:mu_relation_1}

555: n & = & L \int dE~\left[\rho(E) +

556: \delta(E-E_{\rm ns})\right] \frac{1}{1 + e^{\beta (E-\mu)}}\,.

557: \end{eqnarray}

558: Equation~\eref{eq:mu_relation_1} simply states that the sum over all

559: the binding sites, weighted by the probability that a TF is bound

560: there, equals the copy number of the TF in the system.

561:

562: \subsection{Inference of binding properties}

563: \label{s:inference}

564:

565: %%%%%%%%%%%%

566:

567: A list of binding sites for a wide set of TFs of {\it S. cerevisiae}

568: was downloaded from the SGD database \cite{SGD-homepage}. The binding

569: sites were extracted from the intergenic regions identified by

570: chromatin immunoprecipitation experimental data \cite{Lee} as detailed

571: in \cite{Harbison}. We retained those TFs for which at least two

572: binding sites and their abundance were available and processed them as

573: detailed hereafter.

574:

575: A proxy of the binding properties of the TFs is provided by the

576: log-odds ratios based on the classical work \cite{BergvonHippel87}:

577: \begin{eqnarray}

578: \label{eq:BvH}

579: \Delta\varepsilon_{i,\alpha} = \frac{1}{\lambda}\log\frac{1+n^*_i}

580: {1+n_{i,\alpha}}\,,

581: \end{eqnarray}

582: where $n_{i,\alpha}$ is the number of observations of nucleotide

583: $\alpha$ at the $i$-th position in the binding site and $n^*_i$ is the

584: number of observations of the most frequently observed nucleotide in

585: that position. $\lambda$ is an unknown scale factor in units of

586: $k_{\rm B}T$.

587:

588: The discrimination energy of a sequence $S$ is defined as the

589: difference between $E(S)$ and the consensus energy and is hence

590: directly given by $\Delta\varepsilon_{i,\alpha}$ in~\eref{eq:BvH}.

591: The scale factor $\lambda$ must be determined from at least one

592: experimentally measured affinity. In the absence of experimental data,

593: we have set it to unity (in units of $k_{\rm B} T$), which is a fair

594: average of the values found for a number of prokaryotic examples

595: in~\cite{BergvonHippel87}, and concords with bioinformatic

596: practice~\cite{StormoFields98}.

597:

598: As a second proxy we have used the recently introduced QPMEME

599: method~\cite{Marko03}. This also does not give access to the binding

600: energies as such, but to the ratio of binding energies to a chemical

601: potential, shifted by the mean free energy of binding of the

602: corresponding TF:

603: \begin{eqnarray}

604: \label{eq:ratio}

605: r \equiv \frac{E-\langle E\rangle}{|\mu-\langle E\rangle|}\equiv

606: \frac{\hat{E}}{|\hat{\mu}|}

607: =\sum_{i,\alpha}\hat{\varepsilon}_{i,\alpha}S_{i,\alpha}\,,

608: \end{eqnarray}

609: where $\langle\bullet\rangle$ denotes the average over the random

610: background ensemble defined as before.  The calculation of the matrix

611: $\hat{\varepsilon}_{i,\alpha}$ boils down to a convex optimization

612: problem, where the width of the background probability distribution is

613: minimized under the constraints that all sequences in the training set

614: be bound.  Note that neither the average energy $\langle

615: E\rangle\equiv \sum_{i,\alpha} p_{i,\alpha}\varepsilon_{i,\alpha}$ nor

616: the chemical potential $\mu$ are determined by QPMEME. {\em

617:   Differences} between pairs of energies, \textit{e.g.} discrimination

618: energies, are determined up to the {\em scale factor} $|\hat\mu|$.

619:

620: %%%%%%%%%%%%%%%%%%%%%%

621: % FIGURE 4

622: %%%%%%%%%%%%%%%%%%%%%%

623: \begin{figure}[htp]

624:   \begin{center}

625:     \includegraphics[width=7.6cm]{figure4-a.pdf}

626:     \includegraphics[width=7.6cm]{figure4-b.pdf}

627:   \end{center}

628:   \caption{Comparison of the relation between the background energy $F_b$ and

629:     the abundance for a set of {\it S. cerevisiae} transcription

630:     factors. Values of the difference between the consensus energy $E^*$

631:     and the background energy $F_{\rm b}$ are reported as squares. Their

632:     values shifted by the logarithm of the TF abundance (as measured

633:     experimentally) are reported as circles.  Vertical dashed lines

634:     correspond to the average values for the two sets of points.  Points

635:     have a sizeable scatter but circles are clearly centered around

636:     zero.  No relation has been found between the deviation of the

637:     points around zero and the functional role of the corresponding

638:     TFs. Long panels: results for log-odds ratio matrices; short panels:

639:     results for QPMEME matrices.  Histograms give better visual access

640:     to the distribution widths.}

641:   \label{fig:4}

642: \end{figure}

643:

644: The energy matrices $\Delta\varepsilon_{i,\alpha}$ and

645: $\hat{\varepsilon}_{i,\alpha}$ have finite sample errors, which could

646: in principle be estimated as in~\cite{BergvonHippel87}.  Assuming the

647: sample to be non-biased, these errors decrease with the number of

648: known binding sites $N_{\rm BS}$ as $1/\sqrt{N_{\rm BS}}$. A

649: comparison with table~1 reveals that this error is at least on the

650: order of 10\% (for those TFs for which about a hundred binding sites

651: are known), ranging up to 50\% (for those with only a few binding

652: sites known). The chemical potential is determined by the reduced

653: energy matrix and the observed abundance $n_{\rm obs}$, which also has

654: experimental errors and is likely to fluctuate {\it in vivo}.  An

655: estimate of the error in the estimation of the chemical potential is

656: thus at best on the order of 10\%. This should nevertheless be

657: sufficient to elucidate statistical trends, which is our purpose here.

658:

659: The probabilities $q_{i,\alpha}$ appearing in the Results denote the

660: probabilities that nucleotide $\alpha$ is found at position $i$ in the

661: TF-DNA complex. They are computed from the energy matrices

662: $\Delta\varepsilon_{i,\alpha}$ as\,:

663: \begin{equation}

664: q_{i,\alpha} = \frac{e^{-\beta\Delta \varepsilon_{i,\alpha}}}

665: {\sum_{\alpha'} e^{-\beta\Delta \varepsilon_{i,\alpha'}}}\,.

666: \end{equation}

667:

668: \subsection{Computing the background free energy}

669: \label{p:fbcomp}

670:

671: Definition (\ref{eq:binding-probability-2}) involves two terms: one

672: describing binding to the genomic background and the other

673: non-specific electrostatic interactions with the DNA. The latter is

674: crucial to the target search \cite{Berg3}. As shown in \cite{hwa}, the

675: background contribution cannot be larger that the non-specific part:

676: the TF would otherwise diffuse in the background random medium and get

677: slowed down by its local minima.  In fact, the two contributions are

678: expected to be comparable.  The division in background and functional

679: binding sites is indeed dynamical and the former provides the

680: evolutive reservoir for the latter. Therefore, evolvability of the

681: regulatory network suggests that the background energy will tend to be

682: low, compatibly with the aforementioned specificity and kinetic

683: constraints (see \cite{Lassig1,Lassig2} about evolvability).

684:

685: %%%%%%%%%%%%%%%%%%%%%%

686: % FIGURE 5

687: %%%%%%%%%%%%%%%%%%%%%%

688: \begin{figure}[htp]

689:   \begin{center}

690:     \includegraphics{figure5.pdf}

691:   \end{center}

692:   \caption{The density of states for the TF ABF1. Dashed in black,

693:     the curve obtained for a random background. In red, the empirical

694:     curve found by computing the distribution of energies over the

695:     genome. The energy scale has been chosen so as to have the chemical

696:     potential $\mu=-1$.}

697:   \label{fig:5}

698: \end{figure}

699:

700: Our estimate for the background energy $F_{\rm b}$ in (\ref{eq:F_b_interm})

701: is then:

702: \begin{eqnarray}

703: \label{eq:F_b_approx}

704: \beta\left(E^* - F_{\rm b}\right) = \log\left[ 2L \int dr ~

705: \rho(r)

706: e^{-\left(\beta|\hat{\mu}|\right)\left(r-r^*\right)}\right]\,.

707: \end{eqnarray} Here, $\rho(r)$ is the background

708: density of states for the energy matrix $\hat{\varepsilon}_{i,\alpha}$

709: obtained by QPMEME and $r^*$ is the minimal value of the ratio

710: (\ref{eq:ratio}), that is for the energy $E^*$ of the consensus

711: sequence(s) $S^*$. The shift to $E^*$ in (\ref{eq:F_b_approx}) is

712: introduced just to facilitate comparison with the results in figure~\ref{fig:4}.

713:

714: The quantity $\beta|\hat{\mu}|$ is not determined by the QPMEME method

715: proper. We estimate it using the relation (\ref{eq:mu_relation_1}),

716: the fact that in QPMEME binding energies are only determined up to the

717: relative chemical potential, and the additional information on the TF

718: abundance $n_{\rm obs}$ from \cite{tf_amount}. Using the previous

719: arguments on background and non-specific contributions, we get:

720: \begin{eqnarray}

721: n_{\rm obs}=2L \int dr ~

722: \rho(r) ~ \frac{1}{1 + e^{\beta|\hat{\mu}|

723: (r+1)}}\,,

724: \label{eq:mu_relation_2}

725: \end{eqnarray}

726: whence $\beta|\hat{\mu}|$ is extracted and inserted back into

727: (\ref{eq:F_b_approx}) to obtain the value of the background effective

728: (free) energy $F_{\rm b}$.  As previously discussed,

729: (\ref{eq:mu_relation_2}) only has a solution for 41 cases out of 63

730: TFs. It is instructive to compare with (\ref{eq:mu_relation_1}), which

731: has a solution for every TF.  The chemical potential $\mu$ then simply

732: acts as a cut-off, so that sites with energies lower than $\mu$ are

733: mostly bound, while sites with higher energies are not, and the total

734: number of bound TFs equals $n_{\rm obs}$.  Depending on $n_{\rm obs}$,

735: the mostly unbound sites could or could not include \textit{in vivo}

736: observed binding sites \textit{i.e.}, part of the set of sites from

737: which the maximum likelihood energy matrices have been constructed.

738: In (\ref{eq:mu_relation_2}), on the other hand, all the \textit{in

739:   vivo} binding sites must necessarily have binding energy below the

740: chemical potential, because these are the constraints under which the

741: QPMEME reduced energy matrix $\hat\varepsilon$ is determined. Hence,

742: all sites for which the reduced QPMEME reduced energy is below the

743: threshold $-1$ will be at least half-filled. Each of these is actually

744: present in the genome with some probability, which leads to a total

745: expected number of at least half-filled sites.  Therefore,

746: (\ref{eq:mu_relation_2}) cannot be solved if $n_{\rm obs}$ is low

747: enough, because the right-hand side has a lower bound. This happens in

748: about one third of the cases at hand.

749:

750: \subsection{Maximal programmability}

751: \label{s:maximals}

752: In the simplest scenario where the major contribution in

753: \eref{eq:mu_relation_1} stems from energies where the Fermi-Dirac

754: weight can be approximated by the Boltzmann factor, one can invert

755: \eref{eq:mu_relation_1} to obtain

756: \begin{eqnarray}

757: \label{eq:mu_relation_3} \mu \simeq \beta^{-1}\log n + F_{\rm b}\,.

758: \end{eqnarray}

759: The occupation probability of a site $t$ reads then $P_t =

760: \frac{1}{1+{\tilde n}_t/n}$, where the threshold concentration

761: ${\tilde n}_t$ is $e^{\beta(E_t - F_{\rm b})}$.  The minimal copy

762: number required for strong binding (to the consensus) must then be at

763: least $e^{\beta(E^* - F_{\rm b})}$.

764:

765: Maximal programmability \cite{hwa} amounts to positing the lowest

766: (unity) threshold. The approximate equality $F_{\rm b} \approx E^*$

767: should then hold. One consequence, which motivates the term, is that

768: the consensus sequence is then half-bound if there is just a single

769: copy of the TF present in the cell.  Different regulatory elements can

770: then have threshold set, or programmed, from one, if their sequences

771: are the consensus sequence, and upwards, independently of a feedback

772: induced by the actual TF copy number.

773:

774:

775: %%%%%%%%%%%%%%%%%%%%%%%%%%%

776: \section*{Acknowledgements}

777: This work was supported by the Swedish Research Council through contract

778: 2003-4614 (E.A., C.M and A.F.d'H.).

779:

780: \section*{References}

781: \bibliographystyle{unsrt}

782: \bibliography{pb-afmv-references}

783:

784: \end{document}

785: