q-bio0703063/afmv.tex
1: \documentclass[12pt]{iopart}
2: 
3: \usepackage{graphicx}
4: %\usepackage{amsmath}
5: %\usepackage{amssymb}
6: 
7: %\eqnobysec
8: 
9: \begin{document}
10: 
11: \title[Numbers and affinity]{Noise-filtering features of      
12:     transcription regulation in the yeast {\it S. cerevisiae}}
13: 
14: \author{%
15:   Erik Aurell$^1$ and
16:   Aymeric Fouquier d'H\'erou\"el$^{1,2}$ and
17:   Claes Malmn\"as$^{1}$ and
18:   Massimo Vergassola$^2$
19: }
20: 
21: \address{$^1$ Computational Biological Physics, Royal Institute of Technology, AlbaNova University Center, Stockholm, Sweden}
22: \address{$^2$ CNRS, URA 2171, Institut Pasteur, Dept. ``G\'enomes et G\'enetique'', Research Unit ``G\'enetique in Silico'', 25 rue du Dr Roux, Paris, France}
23: 
24: \eads{%
25:   \mailto{eaurell@kth.se},
26:   \mailto{afd@kth.se},
27:   \mailto{malmnas@kth.se},
28:   \mailto{massimo@pasteur.fr}
29: }
30: 
31: \begin{abstract}
32:   Transcription regulation is largely governed by the profile and the dynamics of transcription factors' binding to DNA. Stochastic effects are intrinsic to this dynamics and the binding to functional sites must be controled with a certain specificity for living organisms to be able to elicit specific cellular responses. Specificity stems here from the interplay between binding affinity and cellular abundancy of transcription factor proteins and the binding of such proteins to DNA is thus controlled by their chemical potential.
33: 
34:   We combine large-scale protein abundance data in the budding yeast with binding affinities for all transcription factors with known DNA binding site sequences to assess the behavior of their chemical potentials. A sizable fraction of transcription factors is apparently bound non-specifically to DNA and the observed abundances are marginally sufficient to ensure high occupations of the functional sites. We argue that a biological cause of this feature is related to its noise-filtering consequences: abundances below physiological levels do not yield significant binding of functional targets and mis-expressions of regulated genes are thus tamed.
35: \end{abstract}
36: 
37: \pacs{87.80.Vt}
38: 
39: \vspace{1.0cm}
40: \begin{flushleft}
41: Running title: \textit{Numbers and affinity}
42: \end{flushleft}
43: 
44: %\submitto{\PB}
45: 
46: \maketitle
47: 
48: 
49: 
50: 
51: \section{Introduction}
52: 
53: A major determinant in transcription regulation is the pattern of
54: transcription factor proteins (TFs) bound in the physical proximity of
55: the transcribed genomic locus~\cite{Ptashne,Davidson02,Davidson01}.
56: Intense activity is currently carried out to identify transcriptional
57: regulatory networks~\cite{PughGilmour2001,Lee,OCT4SOX2NANOG}, their
58: topology~\cite{Milo2002,Shen-Orr2002,Ping,MBV05} and signs and
59: strengths of the interactions~\cite{Ronen02}. Specificity is an
60: obvious need in transcription regulation: functional binding sites
61: ought to be sufficiently low in energy compared to typical sequences
62: in the rest of the genome (the so-called background). This energetic
63: constraint should be coupled with its kinetic counterpart: the TF
64: should be able to rapidly find its functional targets. Existing
65: evidence points at a search taking place via 1D sliding along the DNA,
66: alternated with 3D excursions \cite{Berg3,Marko}. The TF is kept along
67: the DNA by non-specific electrostatic interactions, recently
68: characterized experimentally~\cite{Mirny,Gowers}.
69: 
70: Two quantitative variables govern the binding of TFs to DNA: their
71: cellular abundance and the affinity between the amino acids forming
72: their binding domains and the various possible stretches of
73: nucleotides. It has long been recognized in concrete examples that
74: equilibrium statistical-mechanics models are poised to describe the
75: binding site occupancy as a function of those parameters, and that
76: these occupancies are proxies for transcription rates
77: transcription~\cite{Ptashne,SheaAckers85}.  Detailed models for the
78: probability of binding to DNA by TFs have recently been reviewed
79: in~\cite{Bintu2005a,Bintu2005b} and we refer the interested reader
80: thereto (see also Methods for a concise summary).
81: 
82: The qualitative point of importance here is that the probability of
83: TF's binding to DNA is controlled by its so-called chemical potential
84: $\mu$. As illustrated in figure~\ref{fig:1}, strong binding sites (with energy
85: much lower than $\mu$) are occupied almost certainly, while weak
86: sites, with energies much higher than $\mu$, are most frequently
87: empty.  The chemical potential $\mu$ increases with the number of
88: copies $n$ of the transcription factor as $\log n$ (see,
89: e.g.,~\cite{Bintu2005a,Bintu2005b}). For a single copy $n=1$, the
90: value of the chemical potential defines an offset $F_b$, usually
91: called background energy.  The reason is that $F_b$ controls the
92: fraction of TF copies bound to DNA either non-specifically or to the
93: genomic background. Indeed, let $E^*$ denote the minimal binding
94: energy, i.e. the energy of binding to the consensus sequence of the
95: TF. From the previous relation $\mu-F_b\propto \log n$, it follows
96: that if $F_b\simeq E^*$, then the threshold defined by the chemical
97: potential $\mu$ is larger than (or equals) $E^*$ for any $n\geq 1$. In
98: other words, even a single copy of the transcription factor would then
99: be sufficient to ensure persistent binding, at least of the consensus
100: sequence.  Conversely, as $F_{\rm b}$ becomes less than $E^*$, more
101: and more TF copies $n$ are needed to have $\mu\geq E^*$,
102: i.e. persistent occupancy of at least the strongest binding sites. A
103: minimal abundance (which depends exponentially on the difference
104: $E^*-F_b$) is then required to have persistent binding of the
105: strongest sites (supposed to be the functional ones).
106: 
107: \medskip
108: Detailed quantitative information on the behavior of the chemical
109: potential for transcription factors of biological interest is
110: scanty. The relation between binding affinities and abundances was
111: analyzed in \cite{hwa} for three coliphage TFs (\textit{Mnt},
112: \textit{CI} and \textit{Cro}) and one bacterial TF
113: (\textit{LacR}). The result was that the offset $F_b$ is comparable to
114: the consensus energy $E^*$ for those four TFs. This type of relation
115: endows the cell with the widest possible window to vary the TF copy
116: number and differentially regulate various sets of genes. It was
117: therefore dubbed ``maximum programmability'' \cite{hwa}.
118: 
119: Positing $F_b\simeq E^*$ generally valid seems however too strong a
120: requirement for the cellular dynamics, as it would make regulation
121: too prone to errors. In fact, as already
122: noted in \cite{hwa}, the four TFs which were considered are rather
123: special: they are all repressors, they operate without much
124: combinatorial interactions with other factors and their expression is
125: tightly controlled. This is not the situation encountered in
126: general. Namely, combinatorial regulation is much more frequent,
127: especially in eukaryotes, and a large fraction of genes are activated
128: by TFs to their physiological expression levels. Specificity is not
129: arising from a single transcription factor but from the sinergistic
130: and cooperative combination of several factors. We then expect that
131: the relation $F_b\simeq E^*$, found in \cite{hwa} for four particular
132: TFs, does not have general validity and that a different relation
133: holds in the majority of cases. Our goal here is to quantify and
134: support this expectation by analyzing experimental data for a large
135: set of transcription factors.
136: 
137: A good model organism to quantitatively investigate the previous issue
138: is the budding yeast {\it S. cerevisiae}. Concentration data in the
139: log-growth phase~\cite{tf_amount} and large-scale chromatin
140: immunoprecipitation binding data, as given by~\cite{Lee,Harbison}, are
141: both available. The intersection of the two data sets leaves us with a
142: set of 63 TFs. The difficulty to be overcome is that large-scale
143: experimental data on binding do not directly provide affinities.  {\it
144:   A priori}, calorimetric methods \cite{Zhang,Takeda} might be
145: employed to measure the strength of the interaction of a TF with its
146: binding sites, but these methods have been hard to scale up, and
147: values are typically not available for a given TF. One is thus forced
148: to infer affinity matrices {\it in silico}, from a list of
149: experimentally detected binding sites.  The procedures and the
150: limitations of these inferences are recalled in the Methods, together
151: with the basics of statistical models for TF-DNA interactions. Two
152: different inference methods were employed: the classical maximum
153: likelihood argument by Berg and von~Hippel~\cite{BergvonHippel87} and
154: the QPMEME method, recently introduced in~\cite{Marko03}.  Results for
155: the relation between affinity and TF abundance, for both ways of
156: determining the binding energies, are presented hereafter. Biological
157: consequences, in particular for the control of noise in transcription
158: regulation, are presented in the Discussion.
159: 
160:  
161: \section{Results}
162: 
163: Combining the two experimental data sets on abundance \cite{tf_amount}
164: and chromatin immunoprecipitation \cite{Lee,Harbison}, a set of 63 TFs was
165: identified. Affinity matrices for those TFs were then inferred as
166: detailed in Methods, using both the classical maximum likelihood
167: procedure \cite{BergvonHippel87} and the QPMEME method \cite{Marko03}.
168: In both cases, the matrices are {\it a priori} determined only up to a
169: scale factor. In the first case, following~\cite{BergvonHippel87}, the
170: factor was set to one in units of $k_{\rm B} T$.  In the QPMEME
171: method, the scale factor was determined as described in the Methods
172: via a self-consistency condition, based on the experimental
173: information on TF abundances.  This condition could be satisfied in 41
174: out of the 63 cases. In the remaining 22 cases no solution could be
175: found, for reasons that will be presented in the Discussion.
176: 
177: The matrices derived by the two aforementioned methods agree well in
178: the majority of the 41 cases where both methods could be employed. A
179: first measure of the agreement between two energy matrices is whether
180: they give the best binder at each position, which indeed coincides for
181: 26 TFs out of 41. These 26 instances include cases where one TF admits
182: more than one consensus sequence, but where both matrices agree on at
183: least one consensus binder at each position. In 14 cases the sets of
184: consensus sequences agree completely.  For 15 TFs the sets of best
185: binders of the two matrices at some position are not overlapping,
186: \textit{i.e.} in at least one position the sets of best binders
187: differ.
188: 
189: A more quantitative comparison, sensitive to the full energy matrix
190: and not just to the best binder, is to consider the normalized
191: probabilities $q_{i,\alpha}$, i.e.  the probability that nucleotide
192: $\alpha$ be found at the position $i$ of the DNA-TF binding
193: complex. The probabilities computed using the maximum likelihood
194: procedure \cite{BergvonHippel87} or QPMEME \cite{Marko03} are denoted
195: by $q^{\rm BvH}_{i,\alpha}$ and $q^{\mathrm QP}_{i,\alpha}$,
196: respectively. The difference between the two sets of probabilities is
197: quantified by the symmetric Kullback-Leibler relative entropy
198: \cite{CT06}\,:
199: \begin{equation}
200: S(q^{\mathrm BvH}_i,q^{\mathrm QP}_i) = \frac{1}{2}\sum_{\alpha} 
201: \left(q^{\mathrm BvH}_{i,\alpha}-q^{\mathrm QP}_{i,\alpha}\right)
202: \log\frac{q^{\mathrm BvH}_{i,\alpha}}{q^{\mathrm QP}_{i,\alpha}}\,.
203: \end{equation}
204: Figure~\ref{fig:2} shows the mean Kullback-Leibler relative entropy per base
205: pair for the 41 TFs. Except in a few cases, the average differences
206: per base pair are moderate, on the order of $0.1-0.2$. No correlation
207: was detectable between the relative entropies and the number of
208: observed binding sites employed to infer the affinity matrices,
209: indicating that the differences between the QPMEME and Berg-von~Hippel
210: matrices are {\it bona fide} fluctuations and not due to finite sample
211: effects. Detailed properties of the affinity matrices computed using
212: the two methods are reported in table~1.
213: 
214: %%%%%%%%%%%%%%%%%%%%%%
215: % FIGURE 1
216: %%%%%%%%%%%%%%%%%%%%%%
217: \begin{figure}[htp]
218:   \begin{center}
219:     \includegraphics[width=16cm]{figure1.pdf}
220:   \end{center}
221:   \caption{Left panel: A schematic view of the relation between the
222:     probability of binding to DNA for a transcription factor and its
223:     chemical potential $\mu$. Strong binding sites (with energies much
224:     lower than the chemical potential) have a high occupation
225:     probability (purple solid line), while the probability to bind
226:     decreases rapidly as the energy increases. Right panel: the
227:     relation between the chemical potential and the abundance $n$. The
228:     background (free) energy $F_b$ is the value of $\mu$ for $n=1$.}
229:   \label{fig:1}
230: \end{figure}
231: 
232: As a side remark, note that the average discrimination energy per site
233: generally decreases with the length of the binding site, indicating a
234: trade-off between these two quantities. Figure~\ref{fig:3} displays the data for
235: the QPMEME-derived energy matrices; a similar behavior is found for
236: matrices inferred by maximum likelihood.
237: %%%%%%%%%%%%%%%%%%%%%%
238: % FIGURE 2
239: %%%%%%%%%%%%%%%%%%%%%%
240: \begin{figure}[htp]
241:   \begin{center}
242:     \includegraphics[scale=0.4]{figure2-a.pdf}
243:     \includegraphics[scale=0.4]{figure2-b.pdf}
244:   \end{center}
245:   \caption{Average Kullback-Leibler distance per base pair between the
246:     probability distributions of binding based on computing discrimination
247:     energies by maximum likelihood arguments~\cite{BergvonHippel87} or
248:     QPMEME \cite{Marko03} (see also Methods).}
249:   \label{fig:2}
250: \end{figure}
251: 
252: %%%%%%%%%%%%%%%%%%%%%%
253: % TABLE 1
254: %%%%%%%%%%%%%%%%%%%%%%
255: \begin{table}[htp]
256:   \fontsize{8}{8}\selectfont
257:   \lineup
258:   \begin{center}
259:     \begin{tabular}{@{}*{7}{c}}
260:       \br
261:       TF name & ${}^{({\bf a})}$ $N_{\rm BS}$ & ${}^{({\bf b})}$ $n_{\rm obs}$ & ${}^{({\bf c})}$ $-r^*$ & ${}^{({\bf d})}$ $|\mu-\langle E_{\rm QP} \rangle|$ & ${}^{({\bf e})}$ $E_{\rm QP}^*-\langle E_{\rm QP} \rangle$ & ${}^{({\bf f})}$ $E_{\rm BvH}^*-\langle E_{\rm BvH} \rangle$ \\ \mr
262:       ABF1  & 139 & \04818.49 & 1.17 & 12.86 & -15.11 & -30.01 \\ 
263:       ACE2  & 11 & \0\0538.41 & 1.00 & 17.63 & -17.63 & -12.65 \\ 
264:       BAS1  & 33 & \0\0861.14 & 1.00 & 23.25 & -23.25 & -15.21 \\ 
265:       CAD1  & 8 & \0\0622.84 & 1.11 & 17.51 & -19.35 & -14.44 \\ 
266:       DIG1  & 131 & \01458.26 & 1.44 &  --- & \m--- & -16.37 \\ 
267:       FHL1  & 75 & \0\0639.38 & 1.22 & 20.60 & -25.08 & -26.41 \\ 
268:       FKH1  & 89 & \01720.43 & 1.24 &  --- & \m--- & -19.71 \\ 
269:       FKH2  & 53 & \0\0655.80 & 1.32 &  --- & \m--- & -23.54 \\ 
270:       GAL4  & 9 & \0\0166.39 & 1.21 & 15.89 & -19.29 & -20.41 \\ 
271:       GAL80  & 3 & \0\0783.80 & 1.03 & 12.28 & -12.59 & -15.10 \\ 
272:       GCR1  & 7 & \0\0258.92 & 1.13 & 18.83 & -21.36 & -12.34 \\ 
273:       GLN3  & 48 & \0\0589.44 & 1.20 &  --- & \m--- & -18.08 \\ 
274:       HAP5  & 22 & \0\0450.49 & 1.00 &  --- & \m--- & -11.57 \\ 
275:       INO2  & 27 & \0\0783.80 & 1.17 &  --- & \m--- & -15.26 \\ 
276:       INO4  & 23 & \0\0521.13 & 1.20 & 34.08 & -41.01 & -18.95 \\ 
277:       LEU3  & 9 & \0\0124.51 & 1.11 & 18.42 & -20.37 & -15.24 \\ 
278:       MAC1  & 5 & 14841.76 & 1.03 & 10.71 & -10.98 & \0-8.60 \\ 
279:       MBP1  & 130 & \0\0521.13 & 1.19 &  --- & \m--- & -20.12 \\ 
280:       MCM1  & 63 & \08965.92 & 1.24 & 11.06 & -13.67 & -26.07 \\ 
281:       MET31  & 5 & \0\0521.13 & 1.00 & 16.14 & -16.14 & -10.87 \\ 
282:       MET4  & 5 & \01295.19 & 1.14 & 12.34 & -14.02 & -15.96 \\ 
283:       MOT2  & 2 & \04276.97 & 1.03 & 10.31 & -10.63 & \0-8.11 \\ 
284:       MOT3  & 11 & \01694.68 & 1.00 &  --- & \m--- & \0-9.25 \\ 
285:       MSN2  & 14 & \0\0124.51 & 1.15 &  --- & \m--- & -14.01 \\ 
286:       NDD1  & 27 & \0\0799.42 & 1.15 & 16.10 & -18.44 & -21.42 \\ 
287:       NRG1  & 45 & \0\0555.55 & 1.18 &  --- & \m--- & -17.68 \\ 
288:       PDR1  & 2 & \01295.19 & 1.03 & 12.51 & -12.93 & \0-7.01 \\ 
289:       PDR3  & 2 & \0\0166.39 & 1.04 &  --- & \m--- & \0-5.18 \\ 
290:       PHD1  & 61 & \01417.94 & 1.64 &  --- & \m--- & -15.45 \\ 
291:       PHO2  & 3 & \06418.52 & 1.09 & 10.40 & -11.37 & \0-8.97 \\ 
292:       PUT3  & 3 & \0\0736.46 & 1.05 & 11.85 & -12.49 & -15.96 \\
293:       RAP1  & 70 & \04387.61 & 1.25 & 14.60 & -18.31 & -23.09 \\ 
294:       RCS1  & 28 & \02733.41 & 1.16 & 21.04 & -24.37 & -16.85 \\ 
295:       REB1  & 156 & \07514.31 & 1.16 & 13.02 & -15.12 & -23.68 \\ 
296:       RFX1  & 9 & \0\0376.92 & 1.11 & 14.97 & -16.66 & -18.44 \\ 
297:       RGT1  & 12 & \0\0194.71 & 1.02 &  --- & \m--- & -12.58 \\ 
298:       RLM1  & 9 & \0\0736.46 & 1.13 & 24.46 & -27.54 & -14.40 \\ 
299:       RLR1  & 4 & \0\0521.13 & 1.06 & 19.66 & -20.81 & -11.10 \\ 
300:       ROX1  & 10 & \0\0238.01 & 1.03 &  --- & \m--- & -13.49 \\ 
301:       RTG3  & 4 & \01054.72 & 1.00 & 15.63 & -15.63 & \0-7.43 \\ 
302:       SFP1  & 19 & \0\0258.92 & 1.21 & 46.90 & -56.71 & -17.81 \\ 
303:       SKN7  & 64 & \02572.11 & 1.25 & 20.63 & -25.77 & -18.55 \\ 
304:       SKO1  & 8 & \0\0503.71 & 1.00 & 22.35 & -22.35 & \0-9.89 \\ 
305:       SOK2  & 81 & \0\0314.38 & 1.20 &  --- & \m--- & -16.16 \\ 
306:       SPT23  & 23 & \0\0432.40 & 1.19 &  --- & \m--- & -13.00 \\ 
307:       SPT2  & 24 & \01239.69 & 1.25 &  --- & \m--- & -16.26 \\ 
308:       STB1  & 23 & \0\0319.30 & 1.15 & 27.00 & -31.10 & -17.94 \\ 
309:       STB4  & 4 & \0\0\098.94 & 1.00 & 18.69 & -18.70 & -10.11 \\ 
310:       STB5  & 15 & \0\0279.40 & 1.12 & 25.07 & -28.06 & -16.32 \\ 
311:       STE12  & 147 & \01923.06 & 1.23 &  --- & \m--- & -18.19 \\ 
312:       SUM1  & 43 & \0\0148.81 & 1.22 &  --- & \m--- & -20.65 \\ 
313:       SWI4  & 99 & \0\0589.44 & 1.28 &  --- & \m--- & -21.42 \\ 
314:       SWI5  & 40 & \0\0688.35 & 1.00 & 37.53 & -37.53 & -14.88 \\ 
315:       SWI6  & 128 & \03335.05 & 1.25 & 26.45 & -32.95 & -19.93 \\ 
316:       TEC1  & 57 & \0\0529.79 & 1.20 &  --- & \m--- & -15.84 \\ 
317:       TYE7  & 20 & \0\0486.13 & 1.09 & 19.15 & -20.86 & -18.00 \\ 
318:       UME1  & 2 & \03037.98 & 1.04 & 11.80 & -12.27 & \0-7.29 \\ 
319:       UME6  & 63 & \0\0216.63 & 1.21 & 24.83 & -30.04 & -26.54 \\ 
320:       XBP1  & 2 & \0\0194.71 & 1.00 & 19.74 & -19.74 & \0-5.83 \\ 
321:       YAP1  & 11 & \01616.86 & 1.06 & 18.47 & -19.66 & -13.73 \\ 
322:       YAP6  & 3 & \01350.09 & 1.00 & 19.51 & -19.51 & \0-6.87 \\ 
323:       YAP7  & 48 & \01694.68 & 1.09 &  --- & \m--- & -20.14 \\ 
324:       YOX1  & 4 & \0\0861.14 & 1.10 & 17.42 & -19.11 & -10.67 \\ \br
325:     \end{tabular}
326:   \end{center}
327:   \caption{Binding parameters for a set of 63 TFs of the yeast {\it S. cerevisiae}, stating numbers of binding sites used in the analysis ({\bf a}), experimentally measured protein abundances ({\bf b}), maximal ratio of binding energy to chemical potential (cf. equation~4 in Methods) ({\bf c}), and in units of $k_{\rm B}T$ the estimates for the chemical potential ({\bf d}) and minimal binding energies (consensus), stemming from both BvH ({\bf e}) and QPMEME matrices ({\bf f}), respectively.}
328:   \label{tab:tf_info}
329: \end{table}
330: 
331: Figure~\ref{fig:4} reports the behavior of the background energy $F_b$,
332: previously defined as the offset of the chemical potential $\mu$ (its
333: value at unit copy number $n=1$).  The result is that the maximum
334: programmability relation $F_b\simeq E^*$ proposed in \cite{hwa} is
335: indeed peculiar to the three coliphage and the bacterial TFs which
336: were considered. A different behavior is clearly observed in the yeast
337: {\it S. cerevisiae}. The background energy $F_b$ is not comparable to
338: the consensus binding energy $E^*$, but is generally smaller and the
339: difference is correlated with the experimentally observed abundancy
340: $n_{\rm obs}$, as can be seen in figure~\ref{fig:4}. In other words, the
341: experimental observations are more in agreement with the behavior $E^*
342: - F_{\rm b} \propto \log n_{\rm obs}$ than the maximum programmability
343: relation $E^* - F_{\rm b} \approx 0$.  Note that this holds
344: irrespective of the method (maximum likelihood or QPMEME) used to
345: estimate the discrimination matrices.
346: 
347: %%%%%%%%%%%%%%%%%%%%%%
348: % FIGURE 3
349: %%%%%%%%%%%%%%%%%%%%%%
350: \begin{figure}[htp]
351:   \begin{center}
352:     \includegraphics[width=8cm]{figure3.pdf}
353:   \end{center}
354:   \caption{Average discrimination energy {\it vs} length
355:     of binding sites.  Reported values refer to energy
356:     matrices computed using the QPMEME method, as described
357:     in the Methods.}
358:   \label{fig:3}
359: \end{figure}
360: 
361: As a further test, we compared the experimentally measured TF
362: abundances with the number of binding sites found in
363: SGD~\cite{SGD-homepage} and as reported by Lee~{\it et
364:   al.}~\cite{Lee}.  For the latter, we counted all sites of
365: protein-DNA-interaction with associated $p$-values $<1\cdot10^{-3}$
366: (L1) and $<5\cdot10^{-3}$ (L5).  The {\it rationale} of this analysis
367: is as follows.  If the maximum programmability ansatz $F_{\rm b} - E^*
368: \approx 0$ were satisfied, we should expect that TF abundances are the
369: main leverage in the control of the number of binding sites. This is
370: the heuristic advantage provided by maximum programmability \cite{hwa}
371: and a strong dependence of the number of binding sites on the TF
372: abundance should then be present.  No such behavior is expected for
373: the alternative hypothesis $E^* -F_{\rm b}\propto \log n_{\rm obs}$: a
374: sizable fraction of the TF copies are weakly attached to the DNA, yet
375: the sites are sufficiently numerous to compete with high-specificity
376: sites.  A straightforward regression analysis gives coefficients of
377: regression $R^2$ close to zero, \textit{viz.} $0.0440$ for the SGD set
378: and $0.0513$ and $0.0900$ for the L1 and L5 sets, respectively. Even
379: though the p-values for the three sets show some statistical
380: significance ($0.07,~0.04,~0.006$, respectively), the low values of
381: $R^2$ indicate that the fraction of the variance explained by the
382: regression is scanty. To summarize, the correlation between the number
383: of binding sites and abundance is slightly significant (as should be
384: expected) but the weakness of the dependency confirms previous
385: conclusions.
386: 
387: \section{Discussion}
388: \label{s:discussion}
389: 
390: The integration of binding data provided by chromatin
391: immunoprecipitation experiments \cite{SGD-homepage,Lee} and abundance
392: data from~\cite{tf_amount} allowed us to extract information on
393: the relation between binding affinities and abundances of TFs in
394: the log-growth phase of the budding yeast {\it S. cerevisiae}.  The
395: availability of experimental data for other conditions would enable a
396: wider perspective, yet two main points have already emerged here and
397: are worth being discussed in their biological consequences and
398: significance.
399: 
400: \medskip A first technical point is that, while bioinformatic tools to
401: infer binding free energies generally only give these up to a scale
402: factor, we have shown that combining the recent method QPMEME
403: \cite{Marko03} and abundance data can provide an estimate of that
404: factor. This may be of general methodological interest and useful for
405: future applications.
406: 
407: For the budding yeast problem considered here, the scale factor could
408: be estimated for 41 transcription factors out of 63. For the remaining
409: 22 TFs ``individual specificity'' is not ensured by the observed
410: affinities and abundances, i.e. the binding sites are bound even
411: though their energy is larger than the chemical potential. This
412: prevents using QPMEME, since the method works in the strong binding
413: regime and supposes that all binding sites have energy below the
414: chemical potential. Biologically, having binding sites occupied
415: despite their energies being above the chemical potential does not
416: pose any contradiction, since additional effects such as other factors
417: and/or regulations of the chromosomal structure might crucially
418: contribute to specificity.  Indeed, ChIP data (see figure~2 in
419: \cite{Lee}) clearly indicate that many genes of {\it S. cerevisiae}
420: are regulated by multiple TFs. Furthermore, global chromatin
421: remodeling effects will reduce the effective size of the genome which
422: is accessible to TFs and increase specificity.  Finally, in eukaryotes
423: it is well known that combinatorial regulation is widespread
424: \cite{Davidson01} and its mode of action hinges on strong cooperative
425: effects among the TFs.  The corresponding loci are often structured so
426: as to require the synergistic action of various TFs and to remain
427: unbound and inactive if only one of them is present.  Results of our
428: analysis are in quantitative agreement with this picture.
429: 
430: \medskip The second and main result of our work is that experimentally
431: observed abundances are marginally sufficient to ensure strong and
432: persistent binding of {\it S. cerevisiae} TFs to DNA sites. This is
433: quantified and supported by the results presented in figure~\ref{fig:4}.  More
434: technically, the background free energy $F_{\rm b}$ was found to be
435: negative and proportional to $\log n_{\rm obs}$, where $n_{\rm obs}$
436: is the abundance experimentally measured in
437: \cite{tf_amount}. Consequently, the chemical potential $\mu$ remains
438: below the minimal consensus energy $E^*$ if $n\ll n_{\rm obs}$.  This
439: implies that a sizable part of the TF copies are ``lost in the
440: background'' and that the \textit{in vivo} observed binding sites are
441: only occupied with low probability if the abundance is significantly
442: lower than $n_{\rm obs}$.
443: 
444: What might superficially appear as a waste, ensures in fact an
445: effective noise-filtering procedure. Fluctuations in the copy number
446: of proteins are unavoidable in the molecular world and have been
447: experimentally demonstrated in various cases (see, e.g.,
448: \cite{Elowitz-Science-2002}).  A few spurious copies of TFs might be
449: present in the cell due to a variety of mechanisms, going from delayed
450: degradations, to leaks or lack of tight regulatory controls and
451: fluctuations in the expression rates.  In an \textit{E. coli} system,
452: it has recently been shown that extrinsic effects, over and above
453: cell-cycle dependent changes in gene copy number, acting \textit{e.g.}
454: through different concentrations of metabolites, ribosomes and
455: polymerases, may amount to 35\% fluctuations in gene expression
456: levels, and may persist over a cell cycle~\cite{Elowitz-Science-2005}.
457: Intrinsic fluctuations, while persisting for shorter times, are also
458: significant, at the 20\% level~\cite{Elowitz-Science-2005}.  The
459: relation $E^* - F_{\rm b}\propto \log n_{\rm obs}$ between the
460: background affinity energy and the abundance of the transcription
461: factors shown in figure~\ref{fig:4} ensures an effective way to filter out those
462: fluctuations and control mis-regulations.
463:     
464: 
465: %%%%%%%%%%%%%%%%%%%%%%
466: \section{Conclusions}
467: In conclusion, our results point at the importance of quantitative
468: effects of abundances in the regulatory dynamics of the cell. In
469: particular, the abundance-affinity relationship $E^*-F_{\rm b}\propto
470: \log n_{\rm obs}$ demonstrated here is a powerful control lever to
471: ensure global coherent responses of the cellular regulatory networks
472: despite the noisy nature of their individual molecular components.
473: 
474: 
475:   
476: %%%%%%%%%%%%%%%%%%
477: \section{Methods}
478: 
479: Let us consider a TF that diffuses in a cell containing a genomic
480: sequence of length $L$. The partition function of specific and
481: non-specific binding to DNA is
482: \begin{equation}
483:  Z_b = \sum_{j=1}^{L} e^{-\beta E(S_j)}
484: + L e^{-\beta E_{\rm ns}}\,,
485: \label{eq:Z-def}
486: \end{equation}
487: where $\beta$ is the inverse temperature in units of the Boltzmann
488: constant $k_{\rm B}$ and $S_j$ is the subsequence of length $l$
489: starting at position $j$ in the genomic sequence.  In~\eref{eq:Z-def}
490: we have omitted the contribution from the TF freely diffusing in
491: cytoplasm, assuming that number to be much smaller than the number of
492: TFs bound.  $E_{\mathrm ns}$ denotes the energy of the state where the
493: TF is bound non-specifically to the DNA \cite{Berg3,Marko,Mirny}. From
494: (\ref{eq:Z-def}), it follows the definition of the effective
495: background (free) energy $F_{\rm b}$ as:
496: \begin{eqnarray}
497: \label{eq:binding-probability-2} 
498: F_{\rm b} = -\beta^{-1}\log Z_b\,.
499: \end{eqnarray}
500: 
501: A commonly employed expression for the binding energies $E(S)$ is the
502: additive \textit{energy matrix} form
503: \cite{Stormo82,Staden84,Stormo00}:
504: \begin{eqnarray}
505: \label{eq:additivity}
506: E(S) = \sum_{i=1}^l \sum_{\alpha=1}^4 \varepsilon_{i,\alpha} S_{i,\alpha}\,.
507: \end{eqnarray}
508: Here, the indicator vector $S_{i,\alpha}$ has entries zero or one
509: depending on which nucleotide $\alpha$ stands at position $i$ in the
510: sequence $S$, $\varepsilon_{i,\alpha}$ is the free energy contribution
511: of nucleotide $\alpha$ at $i$ and $\ell$ is the length of the binding
512: domain. Even though exceptions are known \cite{Bulyk}, the linear form~\eref{eq:additivity} generally gives a good approximation of the
513: energy profile \cite{StormoFields98}.
514: 
515: Expression~\eref{eq:binding-probability-2} of the background energy
516: $F_{\rm b}$ may be approximated by an average over a random
517: ensemble (background).  The approximation is justified in \cite{hwa}
518: by a mapping to the Random Energy Model~\cite{Derrida}. As for the
519: choice of the random ensemble, the simplest background model features
520: independent nucleotides generated with the average genomic frequencies
521: $p_{\alpha} (\alpha=A,C,G,T)$, yielding:
522: \begin{eqnarray}
523: \label{eq:background}\sum_{j} e^{-\beta E(S_j)}\simeq L \langle
524: e^{-\beta E} \rangle & \equiv & L \prod_{i=1}^{l} \left[\sum_{\alpha}
525: p_{\alpha} e^{-\beta \varepsilon_{i,\alpha}} \right]\,.
526: \end{eqnarray}
527: It follows that 
528: \begin{eqnarray}
529: \label{eq:F_b_interm}
530: F_{\rm b} \simeq -\beta^{-1}\log \left\{L\int dE~\rho(E)\,e^{-\beta E} +
531: L\,e^{-\beta E_{\rm ns}}\right\}\,,
532: \end{eqnarray}
533: where $\rho(E)=\langle\delta\left(E-\sum_{i,\alpha}
534: \varepsilon_{i,\alpha} S_{i,\alpha}\right)\rangle$ is the density of
535: states for the random ensemble.  The background density $\rho(E)$ can
536: be computed by a saddle point expansion, where the first term is
537: Gaussian \cite{Marko03}.  Figure~\ref{fig:5} compares the empirical energy
538: density (obtained by the histogram of the energies measured over the
539: whole genome) with the Gaussian and the first correction. While the
540: former alone would not be appropriate (the empirical curve is not
541: symmetric), the correspondence with the latter is quite fair. For a
542: few TFs the match is less good, mainly because discretization effects
543: are more pronounced.
544: 
545: For a TF present with $n$ copies in the cell, the probability that a
546: sequence $S_i$ be bound by the TF takes the Fermi-Dirac form (see,
547: e.g.,~\cite{hwa,Bintu2005a,Bintu2005b} for more details):
548: \begin{eqnarray}
549: \label{eq:binding_n} \mathcal P(S_i) & = & \frac{1}{1 + e^{\beta
550: (E(S_i) - \mu)}}\,,  
551: \end{eqnarray}
552: with the chemical potential $\mu$ implicitly defined by
553: \begin{eqnarray} 
554: \label{eq:mu_relation_1} 
555: n & = & L \int dE~\left[\rho(E) +
556: \delta(E-E_{\rm ns})\right] \frac{1}{1 + e^{\beta (E-\mu)}}\,.
557: \end{eqnarray} 
558: Equation~\eref{eq:mu_relation_1} simply states that the sum over all
559: the binding sites, weighted by the probability that a TF is bound
560: there, equals the copy number of the TF in the system.
561: 
562: \subsection{Inference of binding properties}
563: \label{s:inference}
564: 
565: %%%%%%%%%%%%
566: 
567: A list of binding sites for a wide set of TFs of {\it S. cerevisiae}
568: was downloaded from the SGD database \cite{SGD-homepage}. The binding
569: sites were extracted from the intergenic regions identified by
570: chromatin immunoprecipitation experimental data \cite{Lee} as detailed
571: in \cite{Harbison}. We retained those TFs for which at least two
572: binding sites and their abundance were available and processed them as
573: detailed hereafter.
574: 
575: A proxy of the binding properties of the TFs is provided by the
576: log-odds ratios based on the classical work \cite{BergvonHippel87}:
577: \begin{eqnarray} 
578: \label{eq:BvH} 
579: \Delta\varepsilon_{i,\alpha} = \frac{1}{\lambda}\log\frac{1+n^*_i}
580: {1+n_{i,\alpha}}\,,
581: \end{eqnarray} 
582: where $n_{i,\alpha}$ is the number of observations of nucleotide
583: $\alpha$ at the $i$-th position in the binding site and $n^*_i$ is the
584: number of observations of the most frequently observed nucleotide in
585: that position. $\lambda$ is an unknown scale factor in units of
586: $k_{\rm B}T$.
587: 
588: The discrimination energy of a sequence $S$ is defined as the
589: difference between $E(S)$ and the consensus energy and is hence
590: directly given by $\Delta\varepsilon_{i,\alpha}$ in~\eref{eq:BvH}.
591: The scale factor $\lambda$ must be determined from at least one
592: experimentally measured affinity. In the absence of experimental data,
593: we have set it to unity (in units of $k_{\rm B} T$), which is a fair
594: average of the values found for a number of prokaryotic examples
595: in~\cite{BergvonHippel87}, and concords with bioinformatic
596: practice~\cite{StormoFields98}.
597: 
598: As a second proxy we have used the recently introduced QPMEME
599: method~\cite{Marko03}. This also does not give access to the binding
600: energies as such, but to the ratio of binding energies to a chemical
601: potential, shifted by the mean free energy of binding of the
602: corresponding TF:
603: \begin{eqnarray} 
604: \label{eq:ratio} 
605: r \equiv \frac{E-\langle E\rangle}{|\mu-\langle E\rangle|}\equiv
606: \frac{\hat{E}}{|\hat{\mu}|}
607: =\sum_{i,\alpha}\hat{\varepsilon}_{i,\alpha}S_{i,\alpha}\,,
608: \end{eqnarray}
609: where $\langle\bullet\rangle$ denotes the average over the random
610: background ensemble defined as before.  The calculation of the matrix
611: $\hat{\varepsilon}_{i,\alpha}$ boils down to a convex optimization
612: problem, where the width of the background probability distribution is
613: minimized under the constraints that all sequences in the training set
614: be bound.  Note that neither the average energy $\langle
615: E\rangle\equiv \sum_{i,\alpha} p_{i,\alpha}\varepsilon_{i,\alpha}$ nor
616: the chemical potential $\mu$ are determined by QPMEME. {\em
617:   Differences} between pairs of energies, \textit{e.g.} discrimination
618: energies, are determined up to the {\em scale factor} $|\hat\mu|$.
619: 
620: %%%%%%%%%%%%%%%%%%%%%%
621: % FIGURE 4
622: %%%%%%%%%%%%%%%%%%%%%%
623: \begin{figure}[htp]
624:   \begin{center}
625:     \includegraphics[width=7.6cm]{figure4-a.pdf}
626:     \includegraphics[width=7.6cm]{figure4-b.pdf}
627:   \end{center}
628:   \caption{Comparison of the relation between the background energy $F_b$ and
629:     the abundance for a set of {\it S. cerevisiae} transcription
630:     factors. Values of the difference between the consensus energy $E^*$
631:     and the background energy $F_{\rm b}$ are reported as squares. Their
632:     values shifted by the logarithm of the TF abundance (as measured
633:     experimentally) are reported as circles.  Vertical dashed lines
634:     correspond to the average values for the two sets of points.  Points
635:     have a sizeable scatter but circles are clearly centered around
636:     zero.  No relation has been found between the deviation of the
637:     points around zero and the functional role of the corresponding
638:     TFs. Long panels: results for log-odds ratio matrices; short panels:
639:     results for QPMEME matrices.  Histograms give better visual access
640:     to the distribution widths.}
641:   \label{fig:4}
642: \end{figure}
643: 
644: The energy matrices $\Delta\varepsilon_{i,\alpha}$ and
645: $\hat{\varepsilon}_{i,\alpha}$ have finite sample errors, which could
646: in principle be estimated as in~\cite{BergvonHippel87}.  Assuming the
647: sample to be non-biased, these errors decrease with the number of
648: known binding sites $N_{\rm BS}$ as $1/\sqrt{N_{\rm BS}}$. A
649: comparison with table~1 reveals that this error is at least on the
650: order of 10\% (for those TFs for which about a hundred binding sites
651: are known), ranging up to 50\% (for those with only a few binding
652: sites known). The chemical potential is determined by the reduced
653: energy matrix and the observed abundance $n_{\rm obs}$, which also has
654: experimental errors and is likely to fluctuate {\it in vivo}.  An
655: estimate of the error in the estimation of the chemical potential is
656: thus at best on the order of 10\%. This should nevertheless be
657: sufficient to elucidate statistical trends, which is our purpose here.
658: 
659: The probabilities $q_{i,\alpha}$ appearing in the Results denote the
660: probabilities that nucleotide $\alpha$ is found at position $i$ in the
661: TF-DNA complex. They are computed from the energy matrices
662: $\Delta\varepsilon_{i,\alpha}$ as\,:
663: \begin{equation}
664: q_{i,\alpha} = \frac{e^{-\beta\Delta \varepsilon_{i,\alpha}}}
665: {\sum_{\alpha'} e^{-\beta\Delta \varepsilon_{i,\alpha'}}}\,.
666: \end{equation}
667: 
668: \subsection{Computing the background free energy}
669: \label{p:fbcomp}
670: 
671: Definition (\ref{eq:binding-probability-2}) involves two terms: one
672: describing binding to the genomic background and the other
673: non-specific electrostatic interactions with the DNA. The latter is
674: crucial to the target search \cite{Berg3}. As shown in \cite{hwa}, the
675: background contribution cannot be larger that the non-specific part:
676: the TF would otherwise diffuse in the background random medium and get
677: slowed down by its local minima.  In fact, the two contributions are
678: expected to be comparable.  The division in background and functional
679: binding sites is indeed dynamical and the former provides the
680: evolutive reservoir for the latter. Therefore, evolvability of the
681: regulatory network suggests that the background energy will tend to be
682: low, compatibly with the aforementioned specificity and kinetic
683: constraints (see \cite{Lassig1,Lassig2} about evolvability).
684: 
685: %%%%%%%%%%%%%%%%%%%%%%
686: % FIGURE 5
687: %%%%%%%%%%%%%%%%%%%%%%
688: \begin{figure}[htp]
689:   \begin{center}
690:     \includegraphics{figure5.pdf}
691:   \end{center}
692:   \caption{The density of states for the TF ABF1. Dashed in black,
693:     the curve obtained for a random background. In red, the empirical
694:     curve found by computing the distribution of energies over the
695:     genome. The energy scale has been chosen so as to have the chemical
696:     potential $\mu=-1$.}
697:   \label{fig:5}
698: \end{figure}
699: 
700: Our estimate for the background energy $F_{\rm b}$ in (\ref{eq:F_b_interm})
701: is then:
702: \begin{eqnarray} 
703: \label{eq:F_b_approx} 
704: \beta\left(E^* - F_{\rm b}\right) = \log\left[ 2L \int dr ~
705: \rho(r)
706: e^{-\left(\beta|\hat{\mu}|\right)\left(r-r^*\right)}\right]\,.
707: \end{eqnarray} Here, $\rho(r)$ is the background
708: density of states for the energy matrix $\hat{\varepsilon}_{i,\alpha}$
709: obtained by QPMEME and $r^*$ is the minimal value of the ratio
710: (\ref{eq:ratio}), that is for the energy $E^*$ of the consensus
711: sequence(s) $S^*$. The shift to $E^*$ in (\ref{eq:F_b_approx}) is
712: introduced just to facilitate comparison with the results in figure~\ref{fig:4}.
713: 
714: The quantity $\beta|\hat{\mu}|$ is not determined by the QPMEME method
715: proper. We estimate it using the relation (\ref{eq:mu_relation_1}),
716: the fact that in QPMEME binding energies are only determined up to the
717: relative chemical potential, and the additional information on the TF
718: abundance $n_{\rm obs}$ from \cite{tf_amount}. Using the previous
719: arguments on background and non-specific contributions, we get:
720: \begin{eqnarray} 
721: n_{\rm obs}=2L \int dr ~
722: \rho(r) ~ \frac{1}{1 + e^{\beta|\hat{\mu}|
723: (r+1)}}\,,
724: \label{eq:mu_relation_2}
725: \end{eqnarray} 
726: whence $\beta|\hat{\mu}|$ is extracted and inserted back into
727: (\ref{eq:F_b_approx}) to obtain the value of the background effective
728: (free) energy $F_{\rm b}$.  As previously discussed,
729: (\ref{eq:mu_relation_2}) only has a solution for 41 cases out of 63
730: TFs. It is instructive to compare with (\ref{eq:mu_relation_1}), which
731: has a solution for every TF.  The chemical potential $\mu$ then simply
732: acts as a cut-off, so that sites with energies lower than $\mu$ are
733: mostly bound, while sites with higher energies are not, and the total
734: number of bound TFs equals $n_{\rm obs}$.  Depending on $n_{\rm obs}$,
735: the mostly unbound sites could or could not include \textit{in vivo}
736: observed binding sites \textit{i.e.}, part of the set of sites from
737: which the maximum likelihood energy matrices have been constructed.
738: In (\ref{eq:mu_relation_2}), on the other hand, all the \textit{in
739:   vivo} binding sites must necessarily have binding energy below the
740: chemical potential, because these are the constraints under which the
741: QPMEME reduced energy matrix $\hat\varepsilon$ is determined. Hence,
742: all sites for which the reduced QPMEME reduced energy is below the
743: threshold $-1$ will be at least half-filled. Each of these is actually
744: present in the genome with some probability, which leads to a total
745: expected number of at least half-filled sites.  Therefore,
746: (\ref{eq:mu_relation_2}) cannot be solved if $n_{\rm obs}$ is low
747: enough, because the right-hand side has a lower bound. This happens in
748: about one third of the cases at hand.
749: 
750: \subsection{Maximal programmability}
751: \label{s:maximals}
752: In the simplest scenario where the major contribution in
753: \eref{eq:mu_relation_1} stems from energies where the Fermi-Dirac
754: weight can be approximated by the Boltzmann factor, one can invert
755: \eref{eq:mu_relation_1} to obtain
756: \begin{eqnarray}
757: \label{eq:mu_relation_3} \mu \simeq \beta^{-1}\log n + F_{\rm b}\,.
758: \end{eqnarray}
759: The occupation probability of a site $t$ reads then $P_t =
760: \frac{1}{1+{\tilde n}_t/n}$, where the threshold concentration
761: ${\tilde n}_t$ is $e^{\beta(E_t - F_{\rm b})}$.  The minimal copy
762: number required for strong binding (to the consensus) must then be at
763: least $e^{\beta(E^* - F_{\rm b})}$.
764: 
765: Maximal programmability \cite{hwa} amounts to positing the lowest
766: (unity) threshold. The approximate equality $F_{\rm b} \approx E^*$
767: should then hold. One consequence, which motivates the term, is that
768: the consensus sequence is then half-bound if there is just a single
769: copy of the TF present in the cell.  Different regulatory elements can
770: then have threshold set, or programmed, from one, if their sequences
771: are the consensus sequence, and upwards, independently of a feedback
772: induced by the actual TF copy number.
773: 
774: 
775: %%%%%%%%%%%%%%%%%%%%%%%%%%%
776: \section*{Acknowledgements}
777: This work was supported by the Swedish Research Council through contract
778: 2003-4614 (E.A., C.M and A.F.d'H.).
779: 
780: \section*{References}
781: \bibliographystyle{unsrt}
782: \bibliography{pb-afmv-references}
783: 
784: \end{document}
785: