q-bio0605008/affy2.tex
1: \documentclass[aps,pre,twocolumn,floatfix,showpacs]{revtex4}
2: \usepackage{graphicx}
3: \usepackage{bm}
4: \bibstyle{apsrev.bst}
5: 
6: \newcommand{\ave}[1]{\langle #1\rangle}
7: 
8: \begin{document}
9: \title{Physics-based analysis of Affymetrix microarray data}
10: \author{T. Heim}
11: \affiliation{Interdisciplinary Research Institute c/o IEMN, Cit\'e
12: Scientifique BP 60069, F-59652 Villeneuve d'Ascq, France}
13: \author{L.-C. Tranchevent}
14: \affiliation{Interdisciplinary Research Institute c/o IEMN, Cit\'e
15: Scientifique BP 60069, F-59652 Villeneuve d'Ascq, France}
16: \author{E. Carlon}
17: \affiliation{Interdisciplinary Research Institute c/o IEMN, Cit\'e
18: Scientifique BP 60069, F-59652 Villeneuve d'Ascq, France}
19: \affiliation{Ecole Polytechnique Universitaire de Lille, Cit\'e
20: Scientifique, F-59655 Villeneuve d'Ascq, France}
21: \author{G. T. Barkema}
22: \affiliation{Institute for Theoretical Physics, University of Utrecht,
23: Leuvenlaan 4, 3584 CE Utrecht}
24: \date{\today}
25: 
26: \begin{abstract}
27: We analyze publicly available data on Affymetrix microarrays spike-in
28: experiments on the human HGU133 chipset in which sequences are added in
29: solution at known concentrations.  The spike-in set contains sequences
30: of bacterial, human and artificial origin.  Our analysis is based on a
31: recently introduced molecular-based model [E. Carlon and T. Heim, Physica
32: A {\bf 362}, 433 (2006)] which takes into account both probe-target
33: hybridization and target-target partial hybridization in solution.
34: The hybridization free energies are obtained from the nearest-neighbor
35: model with experimentally determined parameters.  The molecular-based
36: model suggests a rescaling that should result in a ``collapse" of the
37: data at different concentrations into a single universal curve.  We indeed
38: find such a collapse, with the same parameters as obtained before for the
39: older HGU95 chip set.  The quality of the collapse varies according to
40: the probe set considered.  Artificial sequences, chosen by Affymetrix
41: to be as different as possible from any other human genome sequence,
42: generally show a much better collapse and thus a better agreement with
43: the model than all other sequences. This suggests that the observed
44: deviations from the predicted collapse are related to the choice of
45: probes or have a biological origin, rather than being a problem with
46: the proposed model.  \end{abstract}
47: 
48: \pacs{87.15.-v,82.39.Pj}
49: 
50: \maketitle
51: 
52: \newcommand{\ul}{\underline}
53: \newcommand{\bc}{\begin{center}}
54: \newcommand{\ec}{\end{center}}
55: \newcommand{\be}{\begin{equation}}
56: \newcommand{\ee}{\end{equation}}
57: \newcommand{\ba}{\begin{array}}
58: \newcommand{\ea}{\end{array}}
59: \newcommand{\beqn}{\begin{eqnarray}}
60: \newcommand{\eeqn}{\end{eqnarray}}
61: 
62: \section{Introduction}
63: \label{sec:intro}
64: 
65: DNA microarrays \cite{sche95} allow to measure the gene expression level
66: of thousands of genes simultaneously. This is a major step forward
67: compared to traditional methods in molecular biology (as Northern
68: blots) which are applicable only to a limited set of genes at a time.
69: The determination of gene expression levels is not the only application
70: of DNA microarrays, which have been used also for the analysis of
71: genetic variance between individuals (single nucleotide polymorphisms),
72: as efficient tools for DNA sequencing, for the study of chromosomal
73: defects and for the determination of alternative splicing events.
74: 
75: Despite the increasing popularity that microarrays have known in the
76: recent years there are still some problems with the technology. There
77: has been, for instance, only a moderate effort in comparing different
78: microarrays platforms on the same biological system \cite{mars04}. When
79: this comparison was made, as in a recent study on expression analysis of
80: stressed-out pancreas cells, it was found that different commercial
81: platforms produced wildly incompatible data \cite{tan03_sh}.  These
82: problems call for a better fundamental understanding of the functioning of
83: the microarrays. Such understanding will help researchers to design better
84: algorithms for microarray data analysis based on the physical-chemistry
85: of the underlying hybridization process.
86: 
87: In the past years several experiments were addressing the analysis
88: of equilibrium and dynamical properties of DNA hybridization to
89: probes anchored on solid surfaces with different techniques as,
90: for instance, surface plasmon resonance \cite{pete02} and by quartz
91: microbalance \cite{okah98_sh}.  At the same time several papers
92: \cite{vain02,held03,naef03,haga04,halp04,bind05,carl06} have been dedicated
93: to theoretical aspects of hybridization, mostly discussing the Langmuir
94: model and variances thereof.
95: 
96: In a previous paper \cite{carl06} we have analyzed a series of publicly
97: available data of experiments performed on Affymetrix microarrays, using a
98: simple model of the hybridization process. In these experiments a set of
99: selected genes are ``spiked-in" at fixed concentrations into a solution
100: containing other types of RNAs. This set of data has been widely used
101: as testground for algorithms designed to extract gene expression levels
102: from the raw data. Affymetrix is one of the major commercial producers of
103: microarrays. In Affymetrix arrays the surface-bound probes are prepared in
104: situ by photolitographic techniques.  Although the technique is limited
105: to rather short oligos (25 nucleotides long) one of the advantages is
106: that a high density of probe sequences per array can be obtained. In the
107: latest generation 1,400,000 different probes have been placed in a single
108: array. The large number of probes compensate for their limited length.
109: Indeed Affymetrix uses multiple probes per gene, which define a probe set.
110: Another special feature of Affymetrix chips is that it uses as control a
111: mismatch (MM) probe sequence, which differs from a perfect-matching (PM)
112: sequence only at the base at position 13: a nucleotide A is interchanged
113: with T and a nucleotide C is interchanged with G.
114: 
115: In our previous work \cite{carl06} we focused on the spike-in data set
116: of the HGU95 human chipset. More recently this has been substituted
117: by the HGU133 chipset. Probe sets have been completely redesigned in
118: the HGU133 chipset; moreover there are only 11 probes per probe set
119: compared to the 16 probes of the HGU95 array.  In this paper we focus
120: on the analysis of publicly available spike-in data on the HGU133 chip,
121: building on our previous work \cite{carl06} on HGU95. This will allow us
122: to test the robustness of the model introduced in Ref. \cite{carl06}
123: to a new set of data.  There is another interesting feature of the
124: spike-in data of the HGU133 chipset: differently from the HGU95 data
125: where spikes correspond to human genes, the spikes in the HGU133 have
126: been selected between human, bacterial and ``artificial" sequences.
127: The latter were selected by Affymetrix to avoid cross-hybridization with
128: any known human coding sequence.
129: 
130: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
131: \begin{figure}[t]
132: \includegraphics[width=8.5cm]{FIG01.eps}
133: \caption{(a) The simple model of hybridization in Affymetrix microarrays
134: used throughout this paper is defined by two basic reactions: 1)
135: Hybridization between target molecules ({\it t}) to surface anchored
136: probes ({\it p}) leading to a duplex {\it pt} and 2) The hybridization
137: between target molecules in solution leading to the partial duplexes
138: $t {\hat t}_{i,j}$.  In the model, the effect of the hybridization in
139: solution amounts to a reduction of the original target concentration
140: $c$ to a value $\alpha c$.  (b) Partial hybridization of a fragment in
141: solution complementary to the target RNA sequence from base $i$ to base
142: $j$ ($1 \leq i < j \leq 25$).
143: }
144: \label{FIG00}
145: \end{figure}
146: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
147: 
148: \section{A simple model for hybridization in Affymetrix arrays}
149: \label{sec:model}
150: 
151: In this section we briefly recall the model introduced in
152: Ref. \cite{carl06}. Two basic processes are considered: 1) Target-probe
153: hybridization and 2) Target-target hybridization in solution.  According
154: to the model the fluorescence signal measured from a given probe is:
155: \be
156: I = I_0 + \frac{A \alpha c e^{\beta \Delta G}}{1 + \alpha c e^{\beta \Delta G}}
157: \label{fluorescence}
158: \ee
159: where $I_0$ indicates a background level due to non-specific
160: hybridization, $A$ sets the scale of intensities, $c$ is the target
161: concentration (a measure of the gene expression level), $\Delta G$
162: the target/probe hybridization free energy, $\beta = 1/RT$ the
163: inverse temperature, $R$ the universal gas constant. Here, $\alpha$
164: models the reduction in the concentration of available targets due to
165: the target-target hybridization in solution: only a fraction $\alpha
166: c$ is available for the hybridization with probes as the remaining
167: $(1-\alpha)c$ form stable duplexes with other partners in solution (see
168: Fig. \ref{FIG00}(a)).
169: 
170: In the model introduced in Ref. \cite{carl06}, we approximate the
171: target-target hybridization with the expression
172: \be
173: \alpha \approx \frac 1 
174: {1 + \tilde{c} \exp{\left( \beta' \Delta G_R^{(37)} \right)}}
175: \label{alpha}
176: \ee
177: with $\beta'$ and $\tilde{c}$ fitted parameters and $\Delta G_R^{(37)}
178: \equiv \Delta G_R (1,25)$ the (sequence dependent) RNA/RNA free energy
179: for duplex formation in solution at 37 degrees calculated over the whole
180: 25-mer length; in close approximation, the binding free energies at 37 and
181: 45 degrees (the actual experimental temperature) are almost identical,
182: apart from a small scaling factor, which is adsorbed into the rescaled
183: temperature $\beta'$. In the next section, we will discuss the steps
184: leading to Eq. (\ref{alpha}) in more detail.
185: 
186: The model defined in Eqs. (\ref{fluorescence}) and (\ref{alpha}) contains
187: the four fitting parameters $A$, $\beta$, $\beta'$ and $\tilde{c}$ which
188: were fitted against the spike-in data of the Affymetrix array HGU95a in
189: Ref. \cite{carl06}.  The parameters $\beta'$, $\tilde{c}$ and $A$ will
190: be discussed in Sec. \ref{sec:hyb_sol} and Sec. \ref{sec:saturation}.
191: The parameter $\beta$ is the inverse temperature. Instead of fixing
192: it to the experimental value we have kept it as a fitting parameter
193: as explained in Ref. \cite{carl06}. The hybridization free energies
194: $\Delta G$ and $\Delta G_R$ are calculated from tabulated experimental
195: data for DNA/RNA \cite{sugi95_sh,sugi00} and RNA/RNA \cite{xia98_sh}
196: duplex formation in solution.
197: 
198: We note that we fit mismatches and perfect matches with the same model.
199: The difference between the two is that there is a different hybridization
200: free energy $\Delta G$: one expects a lower signal for mismatches compared
201: to perfect matches, due to weaker binding. This is not always the case;
202: as remarked in several studies for a substantial fraction of probes (30\%,
203: as reported in Ref. \cite{naef03}) one observes ``bright mismatches''
204: for which the mismatch intensity $I_{\rm MM}$ exceeds the intensity
205: $I_{\rm PM}$ of the perfect match. However, it has been observed
206: \cite{bind05} that bright MM come predominantly from probes with low
207: intensity, which suggests that bright mismatches are associated with
208: weak specific hybridization when the signal $I$ is dominated by $I_0$
209: in Eq. (\ref{fluorescence}).
210: 
211: In recent work \cite{heim06} we also compared the current model
212: with the approach based on position-dependent effective affinities as
213: for instance described in Refs. \cite{naef03,bind05}. The conclusion is
214: that the two approaches are fully consistent with each other, provided
215: that various effects are incorporated such as partial unzipping of
216: the probe-target complex, less than 100\% efficiency in the probe
217: growth during lithography, and entropic repulsion between the target
218: and the substrate.  These additional effects are the main factors
219: causing position-dependence (and thus allowing for a comparison with
220: position-dependent effective affinities); for a quantitative prediction
221: of the intensities, their combined effect can be well approximated by
222: a slight decrease of $\beta$ in Eq. (\ref{fluorescence}) and they are
223: therefore not included in the current study.
224: 
225: \section{On the hybridization in solution}
226: \label{sec:hyb_sol}
227: 
228: We now discuss the approximations leading to the form of $\alpha$.
229: We denote the concentration of free 25-mer targets in solution as $[t]$,
230: the concentration of free target strands that are complementary from
231: nucleotide $i$ up and including nucleotide $j$ as $[\hat{t}_{i,j}]$, and
232: the concentration of duplexes between these two as $[t\,\hat{t}_{i,j}]$.
233: Chemical equilibrium (see Fig. \ref{FIG00}(b)) yields for the equilibrium
234: constant:
235: \be
236: K_{i,j} = \frac{[t][\hat{t}_{i,j}]} {[t\hat{t}_{i,j}]} = e^{-\beta \Delta G_R (i,j)},
237: \label{equilib}
238: \ee
239: where $\Delta G_R (i,j)$ is the RNA/RNA hybridization free energy for
240: target molecules in solution, which are complementary from nucleotide $i$ up
241: and including $j$, and $\beta=1.59$ mol/kcal (corresponding to the experimental
242: temperature of 45 degrees).
243: For a given gene, the measure of the
244: gene expression level which one wants to determine is the total target
245: concentration $c$ given by
246: \be
247: c=[t] + \sum_{i,j} [t \hat{t}_{i,j}].
248: \label{conserv}
249: \ee
250: Solving Eqs.(\ref{conserv}) and (\ref{equilib}) we find
251: for the fraction of single stranded target in solution:
252: \be
253: \alpha_f = \frac{[t]}{c} = \frac{1}{1+\sum_{i,j} [\hat{t}_{i,j}] 
254: \exp (\beta \Delta G_R (i,j)) }.
255: \label{alpha_full}
256: \ee
257: Note that the summation in the denominator of Eq. (\ref{alpha_full})
258: was replaced in the approximate expression Eq. (\ref{alpha}) by
259: the single term $\tilde{c}\exp(\beta' \Delta G_R^{(37)})$, with
260: fitting parameters $\tilde{c}$ and $\beta'$.
261: 
262: Eq.~(\ref{alpha_full}) requires as input estimates of the
263: concentration $[\hat{t}_{i,j}]$ of complementary sequences with length
264: $l=j-i+1$, present in solution.  Assuming that all four nucleotides
265: are roughly equally abundant, and that there are no correlations along
266: the sequence, the abundance of short sequences with length $l$ will
267: decrease as $[\hat{t}_{i,j}] \sim 4^{-l}$.  This scaling breaks down
268: beyond some length $L$; assuming for the human transscriptome a total
269: length of $10^7$ nucleotides, a random sequence longer than 12 is more
270: likely not present at all, since $4^{12} \gtrsim 10^7$. We therefore
271: take as our approximation
272: \be
273: [\hat{t}_{i,j}] = 
274: \left\{ 
275: \begin{array}{cc}
276: c_0\cdot 4^{-(j-i)}	& {\rm for \ j-i< 12,}\\
277: 0		        & {\rm otherwise.}
278: \end{array} 
279: \right.
280: \label{concdrop}
281: \ee
282: Here, $c_0$ is a measure of the RNA concentration.  Using this
283: approximation for the concentration of complementary strands, we can now
284: compare Eqs.~(\ref{alpha}) and (\ref{alpha_full}).  Fig. \ref{alphacompare}
285: shows the more elaborate model Eq.(\ref{alpha_full}) as a function of
286: the approximate form Eq.~(\ref{alpha}), with the values for the fitting
287: parameters $\beta'$ and $\tilde{c}$ taken from Ref.~\cite{carl06}.
288: There is a reasonable agreement between the two.
289: 
290: 
291: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
292: \begin{figure}
293: \includegraphics[width=8.0cm]{FIG02.eps}
294: \caption{Comparison of the summation in Eq.~(\ref{alpha_full}), equal
295: to $\alpha_f^{-1}-1$, and its approximation in Eq.~\ref{alpha}),
296: equal to $\alpha^{-1}-1$, for the first 1,000 spike-in sequences of
297: HGU133.  Note that a change in $c_0$ corresponds to a vertical shift
298: over $\log(c_0)$; in this figure, we used $c_0=1$.  The straight line
299: is a fit, given by $y=x+b$ with $b=-14.1$.
300: \label{alphacompare}
301: }
302: \end{figure}
303: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
304: 
305: Since Eq.~(\ref{alpha_full}) has a better microscopic foundation than
306: Eq.~(\ref{alpha}), it should in principle allow for a better estimate
307: of the hybridization in solution.  There are however severe limitations
308: to the use of Eq.~(\ref{alpha_full}).
309: In the hybridization in solution, there is a competition between the
310: contributions of short sequences, which are abundant but have a low
311: affinity, versus long sequences, for which the concentration is low but
312: the affinity high. The concentration drops on average approximately by a
313: factor of 4 per added length (see Eq.~(\ref{concdrop})), but the affinity
314: grows by approximately $\langle \Delta G\rangle \approx$ 2 or 3 kcal/mol,
315: the average value of RNA/RNA interaction parameters~\cite{bloo00}. Since
316: $\exp(\beta \langle \Delta G\rangle) > 4$, the longer sequences dominate
317: the hybridization in solution. However, as discussed above, beyond length
318: $L\approx 12$, there simply are no complementary strands. The accuracy
319: of the more elaborate model Eq.~(\ref{alpha_full}) thus hinges crucially
320: on knowing the longest complementary strand which is transcribed, as
321: well as its affinity and its concentration.  Since the approximate model
322: Eq.~(ref{alpha}) is not expected to perform worse than the more elaborate
323: model Eq.~(\ref{alpha_full}), we keep using the former.
324: 
325: The data points in Fig.~\ref{alphacompare} can be fitted by a
326: straight line with slope 1: the value of $\beta'=0.67$ mol/kcal
327: in Ref.\cite{carl06}, corresponding to 725 K, apparently is the
328: appropriate value to describe the experiments at a temperature
329: of 45 degrees. The offset in the straight-line fit is equal to
330: $\log(\tilde{c})-\log(c_0)$. Since the straight-line fit has an offset of
331: -14.1, and since we used the fitted value of $\tilde{c}=2\cdot 10^{-2}
332: pM$ in Ref.~\cite{carl06}, an estimate of the RNA concentration is
333: $c_0=\exp(14.1)\cdot \tilde{c}=30$ nM.  Even if we do not use the more
334: elaborate model Eq.~(\ref{alpha_full}), it provides us with a microscopic
335: basis for the values of the parameters $\beta'$ and $\tilde{c}$ in the
336: approximate model Eq.~(\ref{alpha}).
337: 
338: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
339: \begin{figure}[t]
340: \includegraphics[width=7.0cm]{FIG03.eps}
341: \caption{Plot on intensity vs. concentration for three spike-in genes
342: of the HGU133 chipset. $I_{\rm max}$ indicates the saturation value 
343: obtained from a three parameters ($I_0$, $A$ and $K$) non-linear fit 
344: based on Eq. (\ref{fit_c}).}
345: \label{Ivsc}
346: \end{figure}
347: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
348: 
349: \section{On the signal saturation level}
350: \label{sec:saturation}
351: 
352: If the target concentration $c$ and the binding energy $\Delta G$
353: are sufficiently high, the Langmuir isotherm saturates to a maximal
354: value. From Eq. (\ref{fluorescence}) we find for
355: $c \exp(\beta \Delta G) \gg 1$
356: \be
357: I_{\rm max} = I_0 + A \approx A,
358: \ee
359: where we have used the fact that typically the background level, $I_0$,
360: is much lower than the value of $A$. The saturation intensity arises if
361: targets are bound to almost all probes. Since the number of probes does
362: not vary between the sequences being measured, this saturation intensity
363: is also expected to be sequence-independent, and more specifically,
364: should not distinguish between perfect matches and mismatches.  A recent
365: analysis of the Latin square set \cite{held03,burd04} reported widely
366: different values for the saturation intensity. It is worth clarifying
367: further this issue here.
368: 
369: The obvious procedure to determine the saturation intensity, is to look at
370: the intensity of a probe as a function of concentration. Assuming an
371: effective affinity $K_s$ for probe sequence $s$, the intensity $I_s(c)$ as a
372: function of concentration $c$ is given by
373: \be
374: I_s(c) = I_{0,s} + \frac{A_s c K_s}{1+c K_s},
375: \label{fit_c}
376: \ee
377: in which $I_{0,s}$ is the (sequence-dependent) background intensity
378: due to non-specific binding.  A plot of $I_s$ vs. $c$ for two probes of
379: the HGU133 spike-in set is shown in Fig. \ref{Ivsc}. Taking $I_0$, $A$
380: and $K$ in eq.~(\ref{fit_c}) as fitting parameters, and extrapolating
381: to high concentration then yields the saturation intensity.
382: 
383: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
384: \begin{figure}[t]
385: \includegraphics[width=8.0cm]{FIG04.eps}
386: \caption{Plot of $I-I_0$ as a function of $\Delta G - R T\log \alpha$
387: for 4 sequences spiked-in at a concentration of $c=512$ pM.  The numbers
388: indicate the probe set numbers. Smaller characters are used for the
389: MM signals. Solid lines represent the Langmuir model as given by
390: Eq. (\ref{alpha}). The data are consistent, except few outliers, with 
391: the Langmuir model with roughly constant saturation level $A \approx 10^4$.}
392: \label{IvsDG_star}
393: \end{figure}
394: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
395: 
396: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
397: \begin{figure*}[t]
398: \includegraphics[width=7.5cm]{FIG05a.eps}
399: \includegraphics[width=7.5cm]{FIG05b.eps}
400: \caption{Histograms of the PM and MM intensities for the Latin square
401: experiments in log-log scale for the chips HGU95a (a) and HGU133 (b). The
402: plots contain 19 histograms referring to different experiments (a) and
403: 12 experiments (b).  The dashed lines are positioned at $I=10000$ and
404: $I=15000$ (intensities are given in Affymetrix scale). Insets: histograms
405: of the total intensity of PM and MM together.}
406: \label{FIG0h}
407: \end{figure*}
408: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
409: 
410: Two research groups \cite{held03,burd04} followed this procedure, and both
411: found saturation intensities that vary wildly between different sequences.
412: A first effect that can cause deviations from the Langmuir fit
413: Eq.~(\ref{fit_c}) is that the lithographic process, through which
414: the probes are synthesized in situ in Affymetrix chips, is not 100\%
415: efficient. As estimated by Burden~\cite{burd04}, only about 10\% of the
416: probes reach the full length of 25 nucleotides. At low intensities far
417: from saturation, the incomplete probes can be safely ignored since their
418: affinity is much lower than that of the fully grown probes. However, under
419: conditions where the fully grown probes are saturated, clearly there will
420: be contributions to the fluorescent intensity from the almost complete
421: probes, and an even further increase in concentration will bring into
422: play shorter and shorter incomplete probes.  Consequently, the Langmuir
423: fit Eq.~(\ref{fit_c}) breaks down near saturation; extrapolation to high
424: concentration is an unreliable procedure.
425: 
426: A second cause of worry is that comparing fluorescent intensities from
427: different chips is also potentially unreliable, since the microarrays
428: might have undergone slightly different processing during the washing
429: and staining. Since Affymetrix microarrays cannot be reused, the
430: spike-in measurements used in Refs.~\cite{held03,burd04} required a new
431: chip for each concentration.
432: 
433: To avoid these two potential sources of error, we therefore consider
434: the intensities for a given probe set at a specific concentration,
435: i.e. $c$ constant and $\Delta G$ and $\alpha$ variables in
436: Eq. (\ref{fluorescence}).  The data belong to the same array.
437: An example of this type of analysis is shown in Fig. \ref{IvsDG_star}
438: for a concentration of $c=512$ pM.  On the horizontal axis we plot
439: $\Delta G^* = \Delta G - RT \log \alpha$.  The solid lines are
440: given by the Langmuir curve Eq. (\ref{fluorescence}).  Note that
441: the large majority of the probes align along the expected curve,
442: with few exceptions as for instance probe 11 (both PM and MM) for
443: the probe set 204414\_at.  Therefore, the data are consistent with
444: a value of $A$ roughly constant in Eq. (\ref{fluorescence}), which
445: suggests indeed that the large variations in $I_{\rm max}$ obtained
446: from the extrapolations of the data in the earlier analysis are more
447: likely to be an artifact of the extrapolations. Note however that
448: some variability of the saturation level can be seen in the data of
449: Fig. \ref{IvsDG_star}. Typically this variability is of about $20\%$. In
450: order to keep the model simple we will keep $A$ constant in the rest of
451: the paper. An interesting possible explanation of the variability of $A$
452: has been given in Ref. \cite{burd04}, i.e. that this variation is due
453: to the post-hybridization washing of the array.
454: 
455: Yet another different way of addressing the issue of the saturation
456: intensities is to analyze the histogram of the intensities on the whole
457: chip, as in Fig. \ref{FIG0h}, which shows both the intensities for the
458: HGU95 and HGU133 spike-in data. To reveal the data at high intensities,
459: they are plotted in a log-log scale. In the figure we note a drop in the
460: histogram around $I \approx 10\ 000$, sharper in the HGU133 chipset,
461: which is consistent with the estimate of the saturation intensity
462: obtained from the fits of intensities vs $\Delta G - R T \log \alpha$,
463: as given in Fig. \ref{IvsDG_star}. Note that in Fig. \ref{FIG0h}(b)
464: the drop is 100-fold in the range $10\ 000 < I < 15\ 000$, which
465: suggests that the data are consistent with a roughly constant value
466: of the saturation. However a more close inspection of the histogram
467: of the HGU133 for PM and MM intensities separately, reveals that the
468: estimated saturation value for the two may be different. In the case of
469: PM intensities alone the drop is rather sharp at around $I \approx 10\
470: 000$, however the MM intensities seem to saturate at lower intensities,
471: which is not seen in the HGU95 data (Fig. \ref{FIG0h}(a)).  The number
472: of MM probes reaching an intensity close to the saturation level in the
473: histogram of Fig. \ref{FIG0h}(b) is quite small so the fact that the the
474: MM and PM reach a different saturation level cannot be concluded for sure.
475: 
476: Also the low-intensity side of the histograms in Fig.~\ref{FIG0h} contain
477: interesting information. Both for the HGU95 and HGU133, the intensity
478: drops steeply below a minimal intensity. For HGU95, this drop occurs
479: around $I_{\rm min}\approx 70$, while for HGU133 the drop occurs around
480: $I_{\rm min}\approx 30$. This increase of the dynamical intensity range
481: by more than a factor of two is a clear demonstration of the fast rate
482: of improvement in microarray technology.
483: 
484: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
485: \begin{figure*}[t]
486: \includegraphics[width=4.2cm]{FIG06_AFFX-DapX.eps}
487: \includegraphics[width=4.2cm]{FIG06_AFFX-LysX.eps}
488: \includegraphics[width=4.2cm]{FIG06_AFFX-PheX.eps}
489: \includegraphics[width=4.2cm]{FIG06_AFFX-ThrX.eps}
490: 
491: \includegraphics[width=4.2cm]{FIG06_AFFX-TagA.eps}
492: \includegraphics[width=4.2cm]{FIG06_AFFX-TagB.eps}
493: \includegraphics[width=4.2cm]{FIG06_AFFX-TagC.eps}
494: \includegraphics[width=4.2cm]{FIG06_AFFX-TagD.eps}
495: 
496: \includegraphics[width=4.2cm]{FIG06_AFFX-TagE.eps}
497: \includegraphics[width=4.2cm]{FIG06_AFFX-TagF.eps}
498: \includegraphics[width=4.2cm]{FIG06_AFFX-TagG.eps}
499: \includegraphics[width=4.2cm]{FIG06_AFFX-TagH.eps}
500: \caption{Collapse plots for the 4 bacterial and the 8 artificial 
501: sequences of the HGU133 spike-in set.
502: In these plots the background subtracted intensities for a given probe
503: set are plotted as functions of the rescaled variable $x'$ given in 
504: Eq.~(\ref{xprime}). The data corresponds to all spike-in concentrations
505: for a given probe sets. Solid lines correspond to the Langmuir isotherm.
506: Compared with the human and bacterial sequences the 
507: artificial sequences are characterized by the best collapses.}
508: % \caption{Collapse plots for the Artificial sequences in the HGU133 
509: % spike-in set. Compared with the human and bacterial sequences the 
510: % artificial sequences are characterized by the best collapses.}
511: \label{collapse_A}
512: \end{figure*}
513: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
514: 
515: \section{Analysis of data collapses}
516: \label{sec:analysis}
517: 
518: As a test of the validity of the model we plotted \cite{carl06} the
519: data as a function of the rescaled variable:
520: \be
521: x' = \alpha c e^{\beta \Delta G}.
522: \label{xprime}
523: \ee
524: If the model is to be trusted the data for different values of $c$ and
525: different probe sequences (i.e. different $\Delta G$ and $\alpha$)
526: ought to ``collapse" onto a single master curve
527: \be
528: I - I_0 = \frac{A x'}{1+x'}.
529: \label{rescaled}
530: \ee
531: This collapse has indeed been observed in the large majority of the
532: spike-in genes of the HGU95a chipset \cite{carl06}. Interestingly, the
533: very few outliers observed in that case could be explained as annotation
534: errors or unbalance of free energies used for specific nucleotides,
535: as discussed in Ref. \cite{carl06}.
536: 
537: We choose here the same fitting parameters used in Ref. \cite{carl06}
538: for the HGU95 chipset, that is: $A= 10\ 000$, $\beta = 0.74$ mol/kcal,
539: $\beta'= 0.67$ mol/kcal and $\tilde{c}= 10^{-2}$ pM. These parameters
540: fit equally well the HGU133 spike-in data.
541: 
542: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
543: \begin{figure*}[t]
544: \includegraphics[width=4.2cm]{FIG07_200665_s_at.eps}
545: \includegraphics[width=4.2cm]{FIG07_203471_s_at.eps}
546: \includegraphics[width=4.2cm]{FIG07_203508_at.eps}
547: \includegraphics[width=4.2cm]{FIG07_204205_at.eps}
548: 
549: \includegraphics[width=4.2cm]{FIG07_204417_at.eps}
550: \includegraphics[width=4.2cm]{FIG07_204430_s_at.eps}
551: \includegraphics[width=4.2cm]{FIG07_204513_s_at.eps}
552: \includegraphics[width=4.2cm]{FIG07_204563_at.eps}
553: 
554: \includegraphics[width=4.2cm]{FIG07_204836_at.eps}
555: \includegraphics[width=4.2cm]{FIG07_204912_at.eps}
556: \includegraphics[width=4.2cm]{FIG07_204951_at.eps}
557: \includegraphics[width=4.2cm]{FIG07_204959_at.eps}
558: 
559: % \includegraphics[width=4.2cm]{../TXT/204951_at1_ALPHA.eps}
560: % \includegraphics[width=4.2cm]{../TXT/204959_at1_ALPHA.eps}
561: \includegraphics[width=4.2cm]{FIG07_205267_at.eps}
562: \includegraphics[width=4.2cm]{FIG07_205291_at.eps}
563: \caption{Collapse plots for Human sequences of the HGU133 spike-in set
564: (part 1). The probes which are complementary to targets which the largest
565: folding free energies are emphasized (see Table \ref{table_fold}). They
566: correspond to probes 204912\_at10 and 204513\_s\_at4.
567: }
568: \label{collapse_H1}
569: \end{figure*}
570: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
571: 
572: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
573: \begin{table}[b]
574: \caption{List of values of $\langle w \rangle$ and $\sigma_w$ for the bacterial 
575: and the artificial sequences in the spike-in set HGU133.}
576: \begin{ruledtabular}
577: \begin{tabular}{lll|lll}
578: Probe set & $\langle w \rangle$ & $\sigma_w$ & Probe set & $\langle w \rangle$ & $\sigma_w$\\
579: \hline
580: AFFX-DapX-3\_at & 0.08 & 1.49 & AFFX-PheX-3\_at     & 0.16  & 1.55\\
581: AFFX-LysX-3\_at & 0.89 & 2.46 & AFFX-ThrX-3\_at     & 0.22  & 1.59\\
582: AFFX-r2-TagA\_at  & -1.05 & 0.97 & AFFX-r2-TagE\_at & -0.32 & 0.82\\
583: AFFX-r2-TagB\_at  & -0.51 & 0.83 & AFFX-r2-TagF\_at & -0.46 & 1.09\\
584: AFFX-r2-TagC\_at  &  0.43 & 1.08 & AFFX-r2-TagG\_at & -0.11 & 0.90\\
585: AFFX-r2-TagD\_at  & -0.03 & 0.90 & AFFX-r2-TagH\_at &  0.11 & 1.22
586: \end{tabular}
587: \end{ruledtabular}
588: \label{table_bacterial}
589: \end{table}
590: 
591: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
592: \begin{table}[t]
593: \caption{List of values of $\langle w \rangle$ and $\sigma_w$ for 
594: the human sequences in the spike-in set HGU133.}
595: \begin{ruledtabular}
596: \begin{tabular}{lcc|lcc}
597: Probe set & $\langle w \rangle$ & $\sigma_w$ & Probe set & $\langle w \rangle$ & $\sigma_w$\\
598: \hline
599: 200665\_s\_at   &  0.54 & 1.26   & 205569\_at         &  -0.28 & 1.12\\
600: 203471\_s\_at   &  0.39 & 1.43   & 205692\_s\_at      &   0.24 & 1.27\\
601: 203508\_at      &  0.45 & 1.83   & 205790\_at         &  -0.78 & 0.76\\
602: 204205\_at      &  0.86 & 2.11   & 206060\_s\_at      &   0.52 & 1.66\\
603: 204417\_at      & -0.24 & 1.18   & 207160\_at         &  -0.32 & 1.06\\
604: 204430\_s\_at   & -0.48 & 1.13   & 207540\_s\_at      &  -0.29 & 0.62\\
605: 204513\_s\_at   & -0.68 & 1.16   & 207641\_at         &   0.24 & 2.72\\
606: 204563\_at      & -0.57 & 1.44   & 207655\_s\_at      &   0.76 & 1.06\\
607: 204836\_at      & -0.04 & 1.41   & 207777\_s\_at      &  -0.14 & 1.11\\
608: 204912\_at      & -0.31 & 1.35   & 207968\_s\_at      &  -0.85 & 1.66\\
609: 204951\_at      & -0.15 & 1.48   & 209354\_at         &   0.04 & 1.41\\
610: 204959\_at      &  1.33 & 1.62   & 209606\_at         &   0.77 & 1.44\\
611: 205267\_at      &  0.36 & 1.23   & 209734\_at         &  -0.20 & 1.51\\
612: 205291\_at      & -0.44 & 1.24   & 209795\_at         &   0.63 & 1.71\\
613: 205398\_s\_at   & -0.15 & 1.37   & 212827\_at         &   0.61 & 2.53
614: \end{tabular}
615: \end{ruledtabular}
616: \label{table_human}
617: \end{table}
618: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
619: 
620: In Figs. \ref{collapse_A}, \ref{collapse_H1} and \ref{collapse_H2}
621: we show the collapse plots for all the 42 genes of the spike-in data
622: set HGU133. Each plot contains about 200 points, which all tend to
623: cluster (in some cases much better than others) along the Langmuir
624: curve $Ax'/(1+x')$. All the 13 concentrations, which range from $0.125$
625: pM to $512$ pM in the spike-in experiment, are shown. The intensities
626: measured at $c=0$ are taken as estimates of the background level $I_0$
627: in Eq.(\ref{rescaled}).  In the collapse plots only the MM sequences
628: for which a $\Delta G$ could be estimated are shown, as the mismatch
629: free energies in RNA/DNA duplexes are known only for a limited set of
630: mismatches \cite{sugi00} (we could associate a free energy to about 30\%
631: of mismatches, as discussed in Ref. \cite{carl06}).
632: 
633: The HGU133 spike-in set contains 4 bacterial sequences and 8
634: artificial sequences (Fig. \ref{collapse_A}) and 30 human sequences
635: (Fig. \ref{collapse_H1} and \ref{collapse_H2}).  A perfect agreement
636: with the Langmuir theory would imply that the data all align along the
637: curve given by Eq. (\ref{rescaled}) and shown as a solid line in the Figs.
638: \ref{collapse_A}, \ref{collapse_H1} and \ref{collapse_H2}. In general the
639: agreement is best for the artificial sequences. Occasionally, also some
640: human sequences collapse well into a single curve in good agreement with
641: the Langmuir model, but in general their behavior is worse than artificial
642: ones.  In order to measure the data dispersion we introduce the variable:
643: \be
644: w = \log \left( \frac{I}{I_{\rm th}} \right),
645: \label{define_w}
646: \ee
647: where $I$ is the measured intensity and $I_{\rm th}$ the theoretical
648: value as predicted from the Langmuir isotherm (Eq.~(\ref{rescaled}))
649: for the $x'$ corresponding to the measured $I$. For the definition of $w$
650: in Eq. (\ref{define_w}) we have kept only the values of $I$ in the
651: range $100 < I < 10000$.  We determine its average $\langle w \rangle$
652: and standard deviation $\sigma_w$. If the data are well-centered around
653: the expected behavior one has $\langle w \rangle =0$, while $\sigma_w$
654: is a measure of the spread in the data.
655: 
656: The values of $\langle w \rangle$ and $\sigma_w$ for the
657: bacterial, artificial and human sequences are given in the tables
658: \ref{table_bacterial} and \ref{table_human}, respectively.  We note
659: that $\sigma_w$ is on average the lowest for the artificial sequences
660: with typical value $\sigma_w \approx 1$. Only for two human probe sets
661: (205790\_at and 207540\_s\_at with $\sigma_w \approx 0.7$) the collapse
662: is better than that of the artificial sequences. For three human probe
663: sets (204205\_at, 207641\_at and 212827\_at) the collapse is very poor
664: as indicated by a $\sigma_w > 2$. The collapses in the four bacterial
665: sequences have somewhat higher dispersion compared to human sequences.
666: 
667: 
668: A very interesting feature of the whole analysis is that the quality
669: of collapses is much better for artificial sequences than for any
670: other sequence. Artificial sequences have been chosen by Affymetrix
671: to be as different as possible from any human RNA so to minimize the
672: effects of cross-hybridization. Their preparation, as labeling and
673: target fragmentation are concerned, is the same as for all other spikes
674: \cite{private_affy}. As in all collapses the same set of parameters is
675: used, the high $\sigma_w$ for some probe sets is very likely an indication
676: that the selected probes are not yet optimal.  Possible deviations from
677: the theory are due to cross-hybridization.
678: 
679: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
680: \begin{figure*}[t]
681: \includegraphics[width=4.2cm]{FIG08_205398_s_at.eps}
682: \includegraphics[width=4.2cm]{FIG08_205569_at.eps}
683: \includegraphics[width=4.2cm]{FIG08_205692_s_at.eps}
684: \includegraphics[width=4.2cm]{FIG08_205790_at.eps}
685: 
686: \includegraphics[width=4.2cm]{FIG08_206060_s_at.eps}
687: \includegraphics[width=4.2cm]{FIG08_207160_at.eps}
688: \includegraphics[width=4.2cm]{FIG08_207540_s_at.eps}
689: \includegraphics[width=4.2cm]{FIG08_207641_at.eps}
690: 
691: \includegraphics[width=4.2cm]{FIG08_207655_s_at.eps}
692: \includegraphics[width=4.2cm]{FIG08_207777_s_at.eps}
693: \includegraphics[width=4.2cm]{FIG08_207968_s_at.eps}
694: \includegraphics[width=4.2cm]{FIG08_209354_at.eps}
695: 
696: \includegraphics[width=4.2cm]{FIG08_209606_at.eps}
697: \includegraphics[width=4.2cm]{FIG08_209734_at.eps}
698: \includegraphics[width=4.2cm]{FIG08_209795_at.eps}
699: \includegraphics[width=4.2cm]{FIG08_212827_at.eps}
700: \caption{Collapse plots for Human sequences of the HGU133 spike-in set 
701: (part 2). The probes which are complementary to targets which the largest
702: folding free energies are emphasized (see Table \ref{table_fold}). They
703: correspond to probes 207641\_at5 and 209354\_at8.
704: }
705: \label{collapse_H2}
706: \end{figure*}
707: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
708: 
709: \section{Determination of the expression level}
710: 
711: The model defined by Eqs. (\ref{fluorescence}) and (\ref{alpha}), once
712: all parameters have been fixed, can be used to fit the concentration
713: $c$ starting from the measured intensities. The target concentration
714: in solution is a measurement of the gene expression level and it
715: is the quantity one wants to compute from the raw microarray data.
716: As the concentrations in the spike-in experiments are known, we can
717: compare the known values with the fitted ones. Figure \ref{CvsCspike}
718: shows a plot of fitted concentration vs. spike-in concentration for the
719: artificial sequences. We limit ourselves here to show the data for these
720: sequences, but the trend is quite general and valid for other
721: genes as well.  The solid line in Fig. \ref{CvsCspike} corresponds to
722: a line $y=x$, which means perfect agreement between spike-in and fitted
723: values. The two other lines correspond to $y=2x$ and $y = x/2$, drawn as
724: a guide to the eye.
725: 
726: As shown in Fig. \ref{CvsCspike}, most of the data fall in the range between
727: the two lines, except for the spikes TagA and TagF which give a much
728: lower fitted concentration. All the points follow approximately straight
729: lines with slope 1, except for the highest spike-in concentrations,
730: corresponding to $256$ and $512$ pM. This is due to the fact that at
731: high concentrations many probes are very close to saturation.
732: 
733: We note also that the fitted concentrations are all systematically
734: lower than the spike-in values, as most of the concentrations fall
735: in the interval $[c_{\rm spike-in}/2,c_{\rm spike-in}]$. This is a
736: consequence of our choice to use the fitting parameters which we took from
737: a previous study \cite{carl06} of spike-in experiments on the HGU95. We
738: have chosen not to refit these parameters here again for HGU133,
739: to illustrate their universal validity. The slight underestimation of
740: the absolute concentration is not a problem, since in gene expression
741: measurements one is only interested in fold-variations of expression
742: levels between different experimental conditions. The fact that the data
743: of Fig. \ref{CvsCspike} follow lines with a slope of approximately one
744: guarantees that the fold-change in concentration in different experiments
745: is correctly estimated.
746: 
747: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
748: \begin{figure}[t]
749: \includegraphics[width=8.5cm]{FIG09.eps}
750: \caption{Plot of the fitted target concentration as a function of the
751: spike-in concentration for the artificial sequences. The solid line correspond
752: to the diagonal $y=x$, while the two dotted lines are $y=x/2$ and $y=2x$
753: and are drawn as guides to the eye. We note a systematic shift of the 
754: estimated absolute concentration compared to the spike-in one, although
755: the fold-variations of the concentrations are correctly estimated
756: as the majority of the data follow lines parallel to the diagonal in
757: the plot.}
758: \label{CvsCspike}
759: \end{figure}
760: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
761: 
762: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
763: \begin{table}[t]
764: \caption{Minimal folding free energies for the targets (assumed to be 25-mers)
765: complementary to the probes forming the spike-in HGU133 data set. These
766: free energies are calculated with the program RNAfold.}
767: \begin{ruledtabular}
768: \begin{tabular}{ccc}
769: Probe set & Probe number & -$\Delta G_{\rm fold}$(kcal/mol) \\
770: \hline
771: 204513\_s\_at	& 4	& 8.70 \\
772: 207641\_at	& 5	& 8.16 \\
773: 204430\_s\_at	& 10	& 7.79 \\
774: 209354\_at	& 8	& 7.67 \\
775: 207540\_s\_at	& 10	& 7.45 \\
776: AFFX-r2-TagA\_at& 1	& 6.52 \\
777: 205398\_s\_at	& 1	& 6.43 \\
778: AFFX-PheX-3\_at	& 10	& 6.18 \\
779: 204836\_at	& 10	& 6.17 \\
780: 203508\_at	& 2	& 6.10 \\
781: 206060\_s\_at 	& 3	& 6.05 \\
782: \end{tabular}
783: \end{ruledtabular}
784: \label{table_fold}
785: \end{table}
786: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
787: 
788: \section{One cause of outliers: Target secondary structures}
789: 
790: It is well-known that single stranded nucleic acids, particularly RNA,
791: tend to form stable folded conformations by binding of complementary
792: bases. Currently, algorithms that calculate RNA secondary structures
793: are to be trusted for sufficiently short molecules, say less than 50
794: nucleotides, which is the situation of Affymetrix microarrays, where RNA
795: targets are fragmented before hybridization. The average target length
796: is $50$, but probably only shorter fragment contribute to hybridization.
797: 
798: We used the Vienna package \cite{hofa03} for the calculation of folded
799: RNA structures that may form in solution and impede hybridization.
800: We considered first 25-mer targets in solution exactly complementary to
801: the probes of the HGU133 spike-in data set.  Table \ref{table_fold} shows
802: a list of probes in this set, whose complementary target has the lowest
803: folding free energy, i.e. that of the most stable conformation, calculated
804: at the experimental temperature of $45^\circ$ C. Given a folding free energy
805: $\Delta G_{\rm fold}$, one can use the two state model approximation to
806: find $p_{\rm fold}$ the probability that the sequence is folded into the
807: most stable conformation: 
808: \be 
809: p_{\rm fold} = \frac{e^{-\Delta G_{\rm fold}/RT}}{1+e^{-\Delta G_{\rm fold}/RT}} 
810: \label{p_fold} 
811: \ee 
812: where we use $T=45^\circ$ C.  According to this expression for a folding
813: free energy $\Delta G_{\rm fold}= -8$~kcal/mol one finds $1 - p_{\rm
814: fold} \approx 4\cdot 10^{-6}$ and $\Delta G_{\rm fold}= -6$~kcal/mol $1
815: - p_{\rm fold} \approx 10^{-4}$. Therefore the large majority of the
816: targets complementary to the probes listed in Table \ref{table_fold}
817: are folded and not expected to participate to hybridization.
818: 
819: Figure \ref{FIG0r} shows the folding configurations for the four
820: targets with the lowest free energy of Table \ref{table_fold}. As shown
821: in Figs. \ref{collapse_H1} and \ref{collapse_H2} the corresponding
822: probes have a signal which is few order of magnitude lower than that
823: expected from the Langmuir model, although not as low as derived from
824: Eq. (\ref{p_fold}), using the $\Delta G_{\rm fold}$ listed in Table
825: \ref{table_fold}.  For instance, from the measured signals we find
826: an intensity lower by a factor $10^3$ for the probe 204513\_s\_at4,
827: instead of a factor $10^6$ as deduced from Eq.~(\ref{p_fold}). This
828: difference could have several origins. First, the hybridization in
829: solution described by the term $\alpha$ in Eq.~(\ref{alpha}) may
830: already take into account some secondary structure formation. Second,
831: the RNA in solution is present with sequences of all lengths. The free
832: energies listed in Table \ref{table_fold} refer to 25-mers, so shorter
833: sequences will have lower folding probability than that deduced from
834: Eq.~(\ref{p_fold}) on the basis of the free energies of 25-mers. Third,
835: even if some secondary structure is present, hybridization with the
836: surface-bound probes is still possible if the folded configuration has
837: some dangling ends from which binding can initiate.
838: 
839: We have analyzed the folding free energies of 25-mers complementary to
840: all the probes in the HGU spike-in set. We found that about $50\%$ of
841: the targets have folding free energy lower than $1$ kcal/mol, so that
842: secondary structure formation can be safely neglected. About $10\%$
843: of the targets have a folding free energy higher than $4$ kcal/mol, so
844: that for this fraction the secondary structure formation may interfere
845: with the target-probe hybridization.
846: 
847: Summarizing, the correct estimate of the folding probability involves
848: a complex calculation over fragments of all lengths, possibly including
849: sequences neighboring the 25-mer part complementary to the probe. However
850: the folding is expected to have a relevant effect for at most $10\%$ of
851: the probes. A possible way out is that of excluding from the analysis
852: of the gene expression levels those probes whose 25-mers folding free
853: energy is above a certain threshold.
854: 
855: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
856: \begin{figure}[t]
857: \includegraphics[height=3.6cm]{FIG10_fold-204513_s_at4.ps}
858: \includegraphics[height=3.3cm]{FIG10_fold-207641_at5.ps}
859: \includegraphics[height=2.8cm]{FIG10_fold-204430_s_at10.ps}
860: \includegraphics[height=3.4cm]{FIG10_fold-209354_at8.ps}
861: \caption{Folding configurations for the four targets with the lowest
862: free energy. From left to right: 204513\_s\_at4, 207641\_at5,
863: 204430\_s\_at10 and 209354\_at8.}
864: \label{FIG0r}
865: \end{figure}
866: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
867: 
868: \section{Conclusion}
869: 
870: In this paper we have extended a previous study \cite{carl06} of
871: Affymetrix spike-in experiments on the chip HGU95, to a novel HGU133
872: chipset. We used the model introduced in Ref. \cite{carl06} which
873: takes into account both target-probe and target-target hybridization
874: in solution. The hybridization free energies are calculated from the
875: nearest-neighbor model \cite{bloo00} using the experimental parameters
876: for RNA/DNA \cite{sugi95_sh,sugi00} and RNA/RNA \cite{xia98_sh}.
877: There are four global fitting parameters in the model that we
878: took from Ref. \cite{carl06}. We found that these parameters fit well
879: also the current data on the HGU133 chipset, apart for a systematic
880: small shift of all the estimates of the absolute target concentrations.
881: 
882: There are several features that make the spike-in data of the more recent
883: HGU133 chip interesting. First of all the spike-in set contains a larger
884: number of sequences compared to the HGU95 experiments (42 instead of 14)
885: and the chip has been entirely redesigned. Secondly, the spike-in
886: sequences contain some of artificial origin, designed to avoid any
887: cross hybridization with human RNAs, but prepared and labeled exactly
888: as all other spikes.  We find that these artificial sequences fit best
889: the hybridization model, as they show the best collapses when the data
890: are rescaled and plotted as function of an appropriate thermodynamic
891: variable. The good agreement suggests indeed that the simple model
892: describes rather well the hybridization in Affymetrix arrays and that
893: the deviations observed for some human sequences are probably related to
894: the non-optimal design of the sequences for a given probe.
895: 
896: When compared to the human sequences of the HGU95 spike-in experiments
897: analyzed in Ref. \cite{carl06}, we find that the artificial spikes of
898: the HGU133 set show definitely better collapses.  However, when comparing
899: the human sequences of the HGU133 with those in the HGU95 experiment we
900: find on average a better collapse for the latter. Only few probes out
901: of the 32 human spikes of the HGU133 experiment have a better collapse
902: than those of the HGU95.
903: 
904: Interestingly, the physics-based modeling developed here allows to assign
905: to each probe set a quality score based on the level of agreement on the
906: Langmuir model. This information may be used to reconsider and eventually 
907: redesign the probe sets of low quality.
908: 
909: Finally, we have discussed the physical basis of hybridization in solution
910: and of RNA secondary structure formation. The latter effect, according
911: to the statistics over the spike-in probes, will be relevant for about
912: 10\% of the probes only. The sequences with the highest folding probability
913: correspond to probes whose measured fluorescent intensities is well-below
914: that predicted from the Langmuir model.
915: 
916: According to our current understanding of the system (see also
917: Refs. \cite{carl06,heim06}), the hybridization in solution of
918: partially complementary RNA molecules has a strong influence.  One of
919: the reasons for that is that RNA/RNA interaction parameters are, at
920: given temperature and salt concentration, stronger than the DNA/DNA or
921: RNA/DNA parameters. The simple approximation given in Eq.(\ref{alpha})
922: captures the major features of the hybridization in solution. However,
923: an improvement over this approach, as discussed above, remains an open
924: challenge.
925: 
926: We acknowledge financial support from the Van Gogh Programme d'Actions
927: Int\'egr\'ees (PAI) 08505PB of the French Ministry of Foreign Affairs
928: and the NWO grant 62403735.
929: 
930: % \bibliography{/home/enrico/TEX/biblio.bib}
931: 
932: \begin{thebibliography}{20}
933: \expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi
934: \expandafter\ifx\csname bibnamefont\endcsname\relax
935:   \def\bibnamefont#1{#1}\fi
936: \expandafter\ifx\csname bibfnamefont\endcsname\relax
937:   \def\bibfnamefont#1{#1}\fi
938: \expandafter\ifx\csname citenamefont\endcsname\relax
939:   \def\citenamefont#1{#1}\fi
940: \expandafter\ifx\csname url\endcsname\relax
941:   \def\url#1{\texttt{#1}}\fi
942: \expandafter\ifx\csname urlprefix\endcsname\relax\def\urlprefix{URL }\fi
943: \providecommand{\bibinfo}[2]{#2}
944: \providecommand{\eprint}[2][]{\url{#2}}
945: 
946: \bibitem[{\citenamefont{Schena et~al.}(1995)\citenamefont{Schena, Shalon,
947:   Davis, and Brown}}]{sche95}
948: \bibinfo{author}{\bibfnamefont{M.}~\bibnamefont{Schena}},
949:   \bibinfo{author}{\bibfnamefont{D.}~\bibnamefont{Shalon}},
950:   \bibinfo{author}{\bibfnamefont{R.~W.} \bibnamefont{Davis}}, \bibnamefont{and}
951:   \bibinfo{author}{\bibfnamefont{P.~O.} \bibnamefont{Brown}},
952:   \bibinfo{journal}{Science} \textbf{\bibinfo{volume}{270}},
953:   \bibinfo{pages}{467} (\bibinfo{year}{1995}).
954: 
955: \bibitem[{\citenamefont{Marshall}(2004)}]{mars04}
956: \bibinfo{author}{\bibfnamefont{E.}~\bibnamefont{Marshall}},
957:   \bibinfo{journal}{Science} \textbf{\bibinfo{volume}{306}},
958:   \bibinfo{pages}{630} (\bibinfo{year}{2004}).
959: 
960: \bibitem[{\citenamefont{Tan et~al.}(2003)}]{tan03_sh}
961: \bibinfo{author}{\bibfnamefont{P.~K.} \bibnamefont{Tan}} \bibnamefont{et~al.},
962:   \bibinfo{journal}{Nucleic Acids Res.} \textbf{\bibinfo{volume}{31}},
963:   \bibinfo{pages}{5676} (\bibinfo{year}{2003}).
964: 
965: \bibitem[{\citenamefont{Peterson et~al.}(2002)\citenamefont{Peterson, Wolf, and
966:   Georgiadis}}]{pete02}
967: \bibinfo{author}{\bibfnamefont{A.~W.} \bibnamefont{Peterson}},
968:   \bibinfo{author}{\bibfnamefont{L.~K.} \bibnamefont{Wolf}}, \bibnamefont{and}
969:   \bibinfo{author}{\bibfnamefont{R.~M.} \bibnamefont{Georgiadis}},
970:   \bibinfo{journal}{J. Am. Chem. Soc.} \textbf{\bibinfo{volume}{124}},
971:   \bibinfo{pages}{14601} (\bibinfo{year}{2002}).
972: 
973: \bibitem[{\citenamefont{Okahata et~al.}(1998)}]{okah98_sh}
974: \bibinfo{author}{\bibfnamefont{Y.}~\bibnamefont{Okahata}} \bibnamefont{et~al.},
975:   \bibinfo{journal}{Anal. Chem.} \textbf{\bibinfo{volume}{70}},
976:   \bibinfo{pages}{1288} (\bibinfo{year}{1998}).
977: 
978: \bibitem[{\citenamefont{Vainrub and Pettitt}(2002)}]{vain02}
979: \bibinfo{author}{\bibfnamefont{A.}~\bibnamefont{Vainrub}} \bibnamefont{and}
980:   \bibinfo{author}{\bibfnamefont{B.~M.} \bibnamefont{Pettitt}},
981:   \bibinfo{journal}{Phys. Rev. E} \textbf{\bibinfo{volume}{66}},
982:   \bibinfo{pages}{041905} (\bibinfo{year}{2002}).
983: 
984: \bibitem[{\citenamefont{Held et~al.}(2003)\citenamefont{Held, Grinstein, and
985:   Tu}}]{held03}
986: \bibinfo{author}{\bibfnamefont{G.~A.} \bibnamefont{Held}},
987:   \bibinfo{author}{\bibfnamefont{G.}~\bibnamefont{Grinstein}},
988:   \bibnamefont{and} \bibinfo{author}{\bibfnamefont{Y.}~\bibnamefont{Tu}},
989:   \bibinfo{journal}{Proc. Natl. Acad. Sci.} \textbf{\bibinfo{volume}{100}},
990:   \bibinfo{pages}{7575} (\bibinfo{year}{2003}).
991: 
992: \bibitem[{\citenamefont{Naef and Magnasco}(2003)}]{naef03}
993: \bibinfo{author}{\bibfnamefont{F.}~\bibnamefont{Naef}} \bibnamefont{and}
994:   \bibinfo{author}{\bibfnamefont{M.~O.} \bibnamefont{Magnasco}},
995:   \bibinfo{journal}{Phys. Rev. E} \textbf{\bibinfo{volume}{68}},
996:   \bibinfo{pages}{011906} (\bibinfo{year}{2003}).
997: 
998: \bibitem[{\citenamefont{Hagan and Chakraborty}(2004)}]{haga04}
999: \bibinfo{author}{\bibfnamefont{M.~F.} \bibnamefont{Hagan}} \bibnamefont{and}
1000:   \bibinfo{author}{\bibfnamefont{A.~K.} \bibnamefont{Chakraborty}},
1001:   \bibinfo{journal}{J. Chem. Phys.} \textbf{\bibinfo{volume}{120}},
1002:   \bibinfo{pages}{4958} (\bibinfo{year}{2004}).
1003: 
1004: \bibitem[{\citenamefont{Halperin et~al.}(2004)\citenamefont{Halperin, Buhot,
1005:   and Zhulina}}]{halp04}
1006: \bibinfo{author}{\bibfnamefont{A.}~\bibnamefont{Halperin}},
1007:   \bibinfo{author}{\bibfnamefont{A.}~\bibnamefont{Buhot}}, \bibnamefont{and}
1008:   \bibinfo{author}{\bibfnamefont{E.~B.} \bibnamefont{Zhulina}},
1009:   \bibinfo{journal}{Biophys. J.} \textbf{\bibinfo{volume}{86}},
1010:   \bibinfo{pages}{718} (\bibinfo{year}{2004}).
1011: 
1012: \bibitem[{\citenamefont{Binder and Preibisch}(2005)}]{bind05}
1013: \bibinfo{author}{\bibfnamefont{H.}~\bibnamefont{Binder}} \bibnamefont{and}
1014:   \bibinfo{author}{\bibfnamefont{S.}~\bibnamefont{Preibisch}},
1015:   \bibinfo{journal}{Biophys. J.} \textbf{\bibinfo{volume}{89}},
1016:   \bibinfo{pages}{337} (\bibinfo{year}{2005}).
1017: 
1018: \bibitem[{\citenamefont{Carlon and Heim}(2006)}]{carl06}
1019: \bibinfo{author}{\bibfnamefont{E.}~\bibnamefont{Carlon}} \bibnamefont{and}
1020:   \bibinfo{author}{\bibfnamefont{T.}~\bibnamefont{Heim}},
1021:   \bibinfo{journal}{Physica A} \textbf{\bibinfo{volume}{362}},
1022:   \bibinfo{pages}{433} (\bibinfo{year}{2006}).
1023: 
1024: \bibitem[{\citenamefont{Sugimoto et~al.}(1995)}]{sugi95_sh}
1025: \bibinfo{author}{\bibfnamefont{N.}~\bibnamefont{Sugimoto}}
1026:   \bibnamefont{et~al.}, \bibinfo{journal}{Biochemistry}
1027:   \textbf{\bibinfo{volume}{34}}, \bibinfo{pages}{11211} (\bibinfo{year}{1995}).
1028: 
1029: \bibitem[{\citenamefont{Sugimoto et~al.}(2000)\citenamefont{Sugimoto, Nakano,
1030:   and Nakano}}]{sugi00}
1031: \bibinfo{author}{\bibfnamefont{N.}~\bibnamefont{Sugimoto}},
1032:   \bibinfo{author}{\bibfnamefont{M.}~\bibnamefont{Nakano}}, \bibnamefont{and}
1033:   \bibinfo{author}{\bibfnamefont{S.}~\bibnamefont{Nakano}},
1034:   \bibinfo{journal}{Biochemistry} \textbf{\bibinfo{volume}{39}},
1035:   \bibinfo{pages}{11270} (\bibinfo{year}{2000}).
1036: 
1037: \bibitem[{\citenamefont{Xia et~al.}(1998)}]{xia98_sh}
1038: \bibinfo{author}{\bibfnamefont{T.}~\bibnamefont{Xia}} \bibnamefont{et~al.},
1039:   \bibinfo{journal}{Biochemistry} \textbf{\bibinfo{volume}{37}},
1040:   \bibinfo{pages}{14719} (\bibinfo{year}{1998}).
1041: 
1042: \bibitem[{\citenamefont{Heim et~al.}(2006)\citenamefont{Heim, {Klein
1043:   Wolterink}, Carlon, and Barkema}}]{heim06}
1044: \bibinfo{author}{\bibfnamefont{T.}~\bibnamefont{Heim}},
1045:   \bibinfo{author}{\bibfnamefont{J.}~\bibnamefont{{Klein Wolterink}}},
1046:   \bibinfo{author}{\bibfnamefont{E.}~\bibnamefont{Carlon}}, \bibnamefont{and}
1047:   \bibinfo{author}{\bibfnamefont{G.~T.} \bibnamefont{Barkema}},
1048:   \bibinfo{journal}{J. Phys.: Cond. Matt.} \textbf{\bibinfo{volume}{18}},
1049:   \bibinfo{pages}{S525} (\bibinfo{year}{2006}).
1050: 
1051: \bibitem[{\citenamefont{Bloomfield et~al.}(2000)\citenamefont{Bloomfield,
1052:   Crothers, and {Tinoco, Jr.}}}]{bloo00}
1053: \bibinfo{author}{\bibfnamefont{V.~A.} \bibnamefont{Bloomfield}},
1054:   \bibinfo{author}{\bibfnamefont{D.~M.} \bibnamefont{Crothers}},
1055:   \bibnamefont{and} \bibinfo{author}{\bibfnamefont{I.}~\bibnamefont{{Tinoco,
1056:   Jr.}}}, \emph{\bibinfo{title}{Nucleic Acids Structures, Properties and
1057:   Functions}} (\bibinfo{publisher}{University Science Books, Mill Valley},
1058:   \bibinfo{year}{2000}).
1059: 
1060: \bibitem[{bur()}]{burd04}
1061: \bibinfo{howpublished}{C. J. Burden and Y. Pittelkow and S. R. Wilson, ``An
1062:   adsorption model of hybridization behaviour on oligonucleotide microarrays",
1063:   preprint q-bio.BM/0411005}.
1064: 
1065: \bibitem[{pri()}]{private_affy}
1066: \bibinfo{howpublished}{Affymetrix Europe, private communication.}
1067: 
1068: \bibitem[{\citenamefont{Hofacker}(2003)}]{hofa03}
1069: \bibinfo{author}{\bibfnamefont{I.~L.} \bibnamefont{Hofacker}},
1070:   \bibinfo{journal}{Nucleic Acids Res.} \textbf{\bibinfo{volume}{31}},
1071:   \bibinfo{pages}{3429} (\bibinfo{year}{2003}).
1072: 
1073: \end{thebibliography}
1074: \end{document}
1075: