0605:q-bio0605008/affy2.tex

1: \documentclass[aps,pre,twocolumn,floatfix,showpacs]{revtex4}

2: \usepackage{graphicx}

3: \usepackage{bm}

4: \bibstyle{apsrev.bst}

5:

6: \newcommand{\ave}[1]{\langle #1\rangle}

7:

8: \begin{document}

9: \title{Physics-based analysis of Affymetrix microarray data}

10: \author{T. Heim}

11: \affiliation{Interdisciplinary Research Institute c/o IEMN, Cit\'e

12: Scientifique BP 60069, F-59652 Villeneuve d'Ascq, France}

13: \author{L.-C. Tranchevent}

14: \affiliation{Interdisciplinary Research Institute c/o IEMN, Cit\'e

15: Scientifique BP 60069, F-59652 Villeneuve d'Ascq, France}

16: \author{E. Carlon}

17: \affiliation{Interdisciplinary Research Institute c/o IEMN, Cit\'e

18: Scientifique BP 60069, F-59652 Villeneuve d'Ascq, France}

19: \affiliation{Ecole Polytechnique Universitaire de Lille, Cit\'e

20: Scientifique, F-59655 Villeneuve d'Ascq, France}

21: \author{G. T. Barkema}

22: \affiliation{Institute for Theoretical Physics, University of Utrecht,

23: Leuvenlaan 4, 3584 CE Utrecht}

24: \date{\today}

25:

26: \begin{abstract}

27: We analyze publicly available data on Affymetrix microarrays spike-in

28: experiments on the human HGU133 chipset in which sequences are added in

29: solution at known concentrations.  The spike-in set contains sequences

30: of bacterial, human and artificial origin.  Our analysis is based on a

31: recently introduced molecular-based model [E. Carlon and T. Heim, Physica

32: A {\bf 362}, 433 (2006)] which takes into account both probe-target

33: hybridization and target-target partial hybridization in solution.

34: The hybridization free energies are obtained from the nearest-neighbor

35: model with experimentally determined parameters.  The molecular-based

36: model suggests a rescaling that should result in a ``collapse" of the

37: data at different concentrations into a single universal curve.  We indeed

38: find such a collapse, with the same parameters as obtained before for the

39: older HGU95 chip set.  The quality of the collapse varies according to

40: the probe set considered.  Artificial sequences, chosen by Affymetrix

41: to be as different as possible from any other human genome sequence,

42: generally show a much better collapse and thus a better agreement with

43: the model than all other sequences. This suggests that the observed

44: deviations from the predicted collapse are related to the choice of

45: probes or have a biological origin, rather than being a problem with

46: the proposed model.  \end{abstract}

47:

48: \pacs{87.15.-v,82.39.Pj}

49:

50: \maketitle

51:

52: \newcommand{\ul}{\underline}

53: \newcommand{\bc}{\begin{center}}

54: \newcommand{\ec}{\end{center}}

55: \newcommand{\be}{\begin{equation}}

56: \newcommand{\ee}{\end{equation}}

57: \newcommand{\ba}{\begin{array}}

58: \newcommand{\ea}{\end{array}}

59: \newcommand{\beqn}{\begin{eqnarray}}

60: \newcommand{\eeqn}{\end{eqnarray}}

61:

62: \section{Introduction}

63: \label{sec:intro}

64:

65: DNA microarrays \cite{sche95} allow to measure the gene expression level

66: of thousands of genes simultaneously. This is a major step forward

67: compared to traditional methods in molecular biology (as Northern

68: blots) which are applicable only to a limited set of genes at a time.

69: The determination of gene expression levels is not the only application

70: of DNA microarrays, which have been used also for the analysis of

71: genetic variance between individuals (single nucleotide polymorphisms),

72: as efficient tools for DNA sequencing, for the study of chromosomal

73: defects and for the determination of alternative splicing events.

74:

75: Despite the increasing popularity that microarrays have known in the

76: recent years there are still some problems with the technology. There

77: has been, for instance, only a moderate effort in comparing different

78: microarrays platforms on the same biological system \cite{mars04}. When

79: this comparison was made, as in a recent study on expression analysis of

80: stressed-out pancreas cells, it was found that different commercial

81: platforms produced wildly incompatible data \cite{tan03_sh}.  These

82: problems call for a better fundamental understanding of the functioning of

83: the microarrays. Such understanding will help researchers to design better

84: algorithms for microarray data analysis based on the physical-chemistry

85: of the underlying hybridization process.

86:

87: In the past years several experiments were addressing the analysis

88: of equilibrium and dynamical properties of DNA hybridization to

89: probes anchored on solid surfaces with different techniques as,

90: for instance, surface plasmon resonance \cite{pete02} and by quartz

91: microbalance \cite{okah98_sh}.  At the same time several papers

92: \cite{vain02,held03,naef03,haga04,halp04,bind05,carl06} have been dedicated

93: to theoretical aspects of hybridization, mostly discussing the Langmuir

94: model and variances thereof.

95:

96: In a previous paper \cite{carl06} we have analyzed a series of publicly

97: available data of experiments performed on Affymetrix microarrays, using a

98: simple model of the hybridization process. In these experiments a set of

99: selected genes are ``spiked-in" at fixed concentrations into a solution

100: containing other types of RNAs. This set of data has been widely used

101: as testground for algorithms designed to extract gene expression levels

102: from the raw data. Affymetrix is one of the major commercial producers of

103: microarrays. In Affymetrix arrays the surface-bound probes are prepared in

104: situ by photolitographic techniques.  Although the technique is limited

105: to rather short oligos (25 nucleotides long) one of the advantages is

106: that a high density of probe sequences per array can be obtained. In the

107: latest generation 1,400,000 different probes have been placed in a single

108: array. The large number of probes compensate for their limited length.

109: Indeed Affymetrix uses multiple probes per gene, which define a probe set.

110: Another special feature of Affymetrix chips is that it uses as control a

111: mismatch (MM) probe sequence, which differs from a perfect-matching (PM)

112: sequence only at the base at position 13: a nucleotide A is interchanged

113: with T and a nucleotide C is interchanged with G.

114:

115: In our previous work \cite{carl06} we focused on the spike-in data set

116: of the HGU95 human chipset. More recently this has been substituted

117: by the HGU133 chipset. Probe sets have been completely redesigned in

118: the HGU133 chipset; moreover there are only 11 probes per probe set

119: compared to the 16 probes of the HGU95 array.  In this paper we focus

120: on the analysis of publicly available spike-in data on the HGU133 chip,

121: building on our previous work \cite{carl06} on HGU95. This will allow us

122: to test the robustness of the model introduced in Ref. \cite{carl06}

123: to a new set of data.  There is another interesting feature of the

124: spike-in data of the HGU133 chipset: differently from the HGU95 data

125: where spikes correspond to human genes, the spikes in the HGU133 have

126: been selected between human, bacterial and ``artificial" sequences.

127: The latter were selected by Affymetrix to avoid cross-hybridization with

128: any known human coding sequence.

129:

130: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

131: \begin{figure}[t]

132: \includegraphics[width=8.5cm]{FIG01.eps}

133: \caption{(a) The simple model of hybridization in Affymetrix microarrays

134: used throughout this paper is defined by two basic reactions: 1)

135: Hybridization between target molecules ({\it t}) to surface anchored

136: probes ({\it p}) leading to a duplex {\it pt} and 2) The hybridization

137: between target molecules in solution leading to the partial duplexes

138: $t {\hat t}_{i,j}$.  In the model, the effect of the hybridization in

139: solution amounts to a reduction of the original target concentration

140: $c$ to a value $\alpha c$.  (b) Partial hybridization of a fragment in

141: solution complementary to the target RNA sequence from base $i$ to base

142: $j$ ($1 \leq i < j \leq 25$).

143: }

144: \label{FIG00}

145: \end{figure}

146: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

147:

148: \section{A simple model for hybridization in Affymetrix arrays}

149: \label{sec:model}

150:

151: In this section we briefly recall the model introduced in

152: Ref. \cite{carl06}. Two basic processes are considered: 1) Target-probe

153: hybridization and 2) Target-target hybridization in solution.  According

154: to the model the fluorescence signal measured from a given probe is:

155: \be

156: I = I_0 + \frac{A \alpha c e^{\beta \Delta G}}{1 + \alpha c e^{\beta \Delta G}}

157: \label{fluorescence}

158: \ee

159: where $I_0$ indicates a background level due to non-specific

160: hybridization, $A$ sets the scale of intensities, $c$ is the target

161: concentration (a measure of the gene expression level), $\Delta G$

162: the target/probe hybridization free energy, $\beta = 1/RT$ the

163: inverse temperature, $R$ the universal gas constant. Here, $\alpha$

164: models the reduction in the concentration of available targets due to

165: the target-target hybridization in solution: only a fraction $\alpha

166: c$ is available for the hybridization with probes as the remaining

167: $(1-\alpha)c$ form stable duplexes with other partners in solution (see

168: Fig. \ref{FIG00}(a)).

169:

170: In the model introduced in Ref. \cite{carl06}, we approximate the

171: target-target hybridization with the expression

172: \be

173: \alpha \approx \frac 1

174: {1 + \tilde{c} \exp{\left( \beta' \Delta G_R^{(37)} \right)}}

175: \label{alpha}

176: \ee

177: with $\beta'$ and $\tilde{c}$ fitted parameters and $\Delta G_R^{(37)}

178: \equiv \Delta G_R (1,25)$ the (sequence dependent) RNA/RNA free energy

179: for duplex formation in solution at 37 degrees calculated over the whole

180: 25-mer length; in close approximation, the binding free energies at 37 and

181: 45 degrees (the actual experimental temperature) are almost identical,

182: apart from a small scaling factor, which is adsorbed into the rescaled

183: temperature $\beta'$. In the next section, we will discuss the steps

184: leading to Eq. (\ref{alpha}) in more detail.

185:

186: The model defined in Eqs. (\ref{fluorescence}) and (\ref{alpha}) contains

187: the four fitting parameters $A$, $\beta$, $\beta'$ and $\tilde{c}$ which

188: were fitted against the spike-in data of the Affymetrix array HGU95a in

189: Ref. \cite{carl06}.  The parameters $\beta'$, $\tilde{c}$ and $A$ will

190: be discussed in Sec. \ref{sec:hyb_sol} and Sec. \ref{sec:saturation}.

191: The parameter $\beta$ is the inverse temperature. Instead of fixing

192: it to the experimental value we have kept it as a fitting parameter

193: as explained in Ref. \cite{carl06}. The hybridization free energies

194: $\Delta G$ and $\Delta G_R$ are calculated from tabulated experimental

195: data for DNA/RNA \cite{sugi95_sh,sugi00} and RNA/RNA \cite{xia98_sh}

196: duplex formation in solution.

197:

198: We note that we fit mismatches and perfect matches with the same model.

199: The difference between the two is that there is a different hybridization

200: free energy $\Delta G$: one expects a lower signal for mismatches compared

201: to perfect matches, due to weaker binding. This is not always the case;

202: as remarked in several studies for a substantial fraction of probes (30\%,

203: as reported in Ref. \cite{naef03}) one observes ``bright mismatches''

204: for which the mismatch intensity $I_{\rm MM}$ exceeds the intensity

205: $I_{\rm PM}$ of the perfect match. However, it has been observed

206: \cite{bind05} that bright MM come predominantly from probes with low

207: intensity, which suggests that bright mismatches are associated with

208: weak specific hybridization when the signal $I$ is dominated by $I_0$

209: in Eq. (\ref{fluorescence}).

210:

211: In recent work \cite{heim06} we also compared the current model

212: with the approach based on position-dependent effective affinities as

213: for instance described in Refs. \cite{naef03,bind05}. The conclusion is

214: that the two approaches are fully consistent with each other, provided

215: that various effects are incorporated such as partial unzipping of

216: the probe-target complex, less than 100\% efficiency in the probe

217: growth during lithography, and entropic repulsion between the target

218: and the substrate.  These additional effects are the main factors

219: causing position-dependence (and thus allowing for a comparison with

220: position-dependent effective affinities); for a quantitative prediction

221: of the intensities, their combined effect can be well approximated by

222: a slight decrease of $\beta$ in Eq. (\ref{fluorescence}) and they are

223: therefore not included in the current study.

224:

225: \section{On the hybridization in solution}

226: \label{sec:hyb_sol}

227:

228: We now discuss the approximations leading to the form of $\alpha$.

229: We denote the concentration of free 25-mer targets in solution as $[t]$,

230: the concentration of free target strands that are complementary from

231: nucleotide $i$ up and including nucleotide $j$ as $[\hat{t}_{i,j}]$, and

232: the concentration of duplexes between these two as $[t\,\hat{t}_{i,j}]$.

233: Chemical equilibrium (see Fig. \ref{FIG00}(b)) yields for the equilibrium

234: constant:

235: \be

236: K_{i,j} = \frac{[t][\hat{t}_{i,j}]} {[t\hat{t}_{i,j}]} = e^{-\beta \Delta G_R (i,j)},

237: \label{equilib}

238: \ee

239: where $\Delta G_R (i,j)$ is the RNA/RNA hybridization free energy for

240: target molecules in solution, which are complementary from nucleotide $i$ up

241: and including $j$, and $\beta=1.59$ mol/kcal (corresponding to the experimental

242: temperature of 45 degrees).

243: For a given gene, the measure of the

244: gene expression level which one wants to determine is the total target

245: concentration $c$ given by

246: \be

247: c=[t] + \sum_{i,j} [t \hat{t}_{i,j}].

248: \label{conserv}

249: \ee

250: Solving Eqs.(\ref{conserv}) and (\ref{equilib}) we find

251: for the fraction of single stranded target in solution:

252: \be

253: \alpha_f = \frac{[t]}{c} = \frac{1}{1+\sum_{i,j} [\hat{t}_{i,j}]

254: \exp (\beta \Delta G_R (i,j)) }.

255: \label{alpha_full}

256: \ee

257: Note that the summation in the denominator of Eq. (\ref{alpha_full})

258: was replaced in the approximate expression Eq. (\ref{alpha}) by

259: the single term $\tilde{c}\exp(\beta' \Delta G_R^{(37)})$, with

260: fitting parameters $\tilde{c}$ and $\beta'$.

261:

262: Eq.~(\ref{alpha_full}) requires as input estimates of the

263: concentration $[\hat{t}_{i,j}]$ of complementary sequences with length

264: $l=j-i+1$, present in solution.  Assuming that all four nucleotides

265: are roughly equally abundant, and that there are no correlations along

266: the sequence, the abundance of short sequences with length $l$ will

267: decrease as $[\hat{t}_{i,j}] \sim 4^{-l}$.  This scaling breaks down

268: beyond some length $L$; assuming for the human transscriptome a total

269: length of $10^7$ nucleotides, a random sequence longer than 12 is more

270: likely not present at all, since $4^{12} \gtrsim 10^7$. We therefore

271: take as our approximation

272: \be

273: [\hat{t}_{i,j}] =

274: \left\{

275: \begin{array}{cc}

276: c_0\cdot 4^{-(j-i)}	& {\rm for \ j-i< 12,}\\

277: 0		        & {\rm otherwise.}

278: \end{array}

279: \right.

280: \label{concdrop}

281: \ee

282: Here, $c_0$ is a measure of the RNA concentration.  Using this

283: approximation for the concentration of complementary strands, we can now

284: compare Eqs.~(\ref{alpha}) and (\ref{alpha_full}).  Fig. \ref{alphacompare}

285: shows the more elaborate model Eq.(\ref{alpha_full}) as a function of

286: the approximate form Eq.~(\ref{alpha}), with the values for the fitting

287: parameters $\beta'$ and $\tilde{c}$ taken from Ref.~\cite{carl06}.

288: There is a reasonable agreement between the two.

289:

290:

291: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

292: \begin{figure}

293: \includegraphics[width=8.0cm]{FIG02.eps}

294: \caption{Comparison of the summation in Eq.~(\ref{alpha_full}), equal

295: to $\alpha_f^{-1}-1$, and its approximation in Eq.~\ref{alpha}),

296: equal to $\alpha^{-1}-1$, for the first 1,000 spike-in sequences of

297: HGU133.  Note that a change in $c_0$ corresponds to a vertical shift

298: over $\log(c_0)$; in this figure, we used $c_0=1$.  The straight line

299: is a fit, given by $y=x+b$ with $b=-14.1$.

300: \label{alphacompare}

301: }

302: \end{figure}

303: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

304:

305: Since Eq.~(\ref{alpha_full}) has a better microscopic foundation than

306: Eq.~(\ref{alpha}), it should in principle allow for a better estimate

307: of the hybridization in solution.  There are however severe limitations

308: to the use of Eq.~(\ref{alpha_full}).

309: In the hybridization in solution, there is a competition between the

310: contributions of short sequences, which are abundant but have a low

311: affinity, versus long sequences, for which the concentration is low but

312: the affinity high. The concentration drops on average approximately by a

313: factor of 4 per added length (see Eq.~(\ref{concdrop})), but the affinity

314: grows by approximately $\langle \Delta G\rangle \approx$ 2 or 3 kcal/mol,

315: the average value of RNA/RNA interaction parameters~\cite{bloo00}. Since

316: $\exp(\beta \langle \Delta G\rangle) > 4$, the longer sequences dominate

317: the hybridization in solution. However, as discussed above, beyond length

318: $L\approx 12$, there simply are no complementary strands. The accuracy

319: of the more elaborate model Eq.~(\ref{alpha_full}) thus hinges crucially

320: on knowing the longest complementary strand which is transcribed, as

321: well as its affinity and its concentration.  Since the approximate model

322: Eq.~(ref{alpha}) is not expected to perform worse than the more elaborate

323: model Eq.~(\ref{alpha_full}), we keep using the former.

324:

325: The data points in Fig.~\ref{alphacompare} can be fitted by a

326: straight line with slope 1: the value of $\beta'=0.67$ mol/kcal

327: in Ref.\cite{carl06}, corresponding to 725 K, apparently is the

328: appropriate value to describe the experiments at a temperature

329: of 45 degrees. The offset in the straight-line fit is equal to

330: $\log(\tilde{c})-\log(c_0)$. Since the straight-line fit has an offset of

331: -14.1, and since we used the fitted value of $\tilde{c}=2\cdot 10^{-2}

332: pM$ in Ref.~\cite{carl06}, an estimate of the RNA concentration is

333: $c_0=\exp(14.1)\cdot \tilde{c}=30$ nM.  Even if we do not use the more

334: elaborate model Eq.~(\ref{alpha_full}), it provides us with a microscopic

335: basis for the values of the parameters $\beta'$ and $\tilde{c}$ in the

336: approximate model Eq.~(\ref{alpha}).

337:

338: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

339: \begin{figure}[t]

340: \includegraphics[width=7.0cm]{FIG03.eps}

341: \caption{Plot on intensity vs. concentration for three spike-in genes

342: of the HGU133 chipset. $I_{\rm max}$ indicates the saturation value

343: obtained from a three parameters ($I_0$, $A$ and $K$) non-linear fit

344: based on Eq. (\ref{fit_c}).}

345: \label{Ivsc}

346: \end{figure}

347: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

348:

349: \section{On the signal saturation level}

350: \label{sec:saturation}

351:

352: If the target concentration $c$ and the binding energy $\Delta G$

353: are sufficiently high, the Langmuir isotherm saturates to a maximal

354: value. From Eq. (\ref{fluorescence}) we find for

355: $c \exp(\beta \Delta G) \gg 1$

356: \be

357: I_{\rm max} = I_0 + A \approx A,

358: \ee

359: where we have used the fact that typically the background level, $I_0$,

360: is much lower than the value of $A$. The saturation intensity arises if

361: targets are bound to almost all probes. Since the number of probes does

362: not vary between the sequences being measured, this saturation intensity

363: is also expected to be sequence-independent, and more specifically,

364: should not distinguish between perfect matches and mismatches.  A recent

365: analysis of the Latin square set \cite{held03,burd04} reported widely

366: different values for the saturation intensity. It is worth clarifying

367: further this issue here.

368:

369: The obvious procedure to determine the saturation intensity, is to look at

370: the intensity of a probe as a function of concentration. Assuming an

371: effective affinity $K_s$ for probe sequence $s$, the intensity $I_s(c)$ as a

372: function of concentration $c$ is given by

373: \be

374: I_s(c) = I_{0,s} + \frac{A_s c K_s}{1+c K_s},

375: \label{fit_c}

376: \ee

377: in which $I_{0,s}$ is the (sequence-dependent) background intensity

378: due to non-specific binding.  A plot of $I_s$ vs. $c$ for two probes of

379: the HGU133 spike-in set is shown in Fig. \ref{Ivsc}. Taking $I_0$, $A$

380: and $K$ in eq.~(\ref{fit_c}) as fitting parameters, and extrapolating

381: to high concentration then yields the saturation intensity.

382:

383: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

384: \begin{figure}[t]

385: \includegraphics[width=8.0cm]{FIG04.eps}

386: \caption{Plot of $I-I_0$ as a function of $\Delta G - R T\log \alpha$

387: for 4 sequences spiked-in at a concentration of $c=512$ pM.  The numbers

388: indicate the probe set numbers. Smaller characters are used for the

389: MM signals. Solid lines represent the Langmuir model as given by

390: Eq. (\ref{alpha}). The data are consistent, except few outliers, with

391: the Langmuir model with roughly constant saturation level $A \approx 10^4$.}

392: \label{IvsDG_star}

393: \end{figure}

394: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

395:

396: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

397: \begin{figure*}[t]

398: \includegraphics[width=7.5cm]{FIG05a.eps}

399: \includegraphics[width=7.5cm]{FIG05b.eps}

400: \caption{Histograms of the PM and MM intensities for the Latin square

401: experiments in log-log scale for the chips HGU95a (a) and HGU133 (b). The

402: plots contain 19 histograms referring to different experiments (a) and

403: 12 experiments (b).  The dashed lines are positioned at $I=10000$ and

404: $I=15000$ (intensities are given in Affymetrix scale). Insets: histograms

405: of the total intensity of PM and MM together.}

406: \label{FIG0h}

407: \end{figure*}

408: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

409:

410: Two research groups \cite{held03,burd04} followed this procedure, and both

411: found saturation intensities that vary wildly between different sequences.

412: A first effect that can cause deviations from the Langmuir fit

413: Eq.~(\ref{fit_c}) is that the lithographic process, through which

414: the probes are synthesized in situ in Affymetrix chips, is not 100\%

415: efficient. As estimated by Burden~\cite{burd04}, only about 10\% of the

416: probes reach the full length of 25 nucleotides. At low intensities far

417: from saturation, the incomplete probes can be safely ignored since their

418: affinity is much lower than that of the fully grown probes. However, under

419: conditions where the fully grown probes are saturated, clearly there will

420: be contributions to the fluorescent intensity from the almost complete

421: probes, and an even further increase in concentration will bring into

422: play shorter and shorter incomplete probes.  Consequently, the Langmuir

423: fit Eq.~(\ref{fit_c}) breaks down near saturation; extrapolation to high

424: concentration is an unreliable procedure.

425:

426: A second cause of worry is that comparing fluorescent intensities from

427: different chips is also potentially unreliable, since the microarrays

428: might have undergone slightly different processing during the washing

429: and staining. Since Affymetrix microarrays cannot be reused, the

430: spike-in measurements used in Refs.~\cite{held03,burd04} required a new

431: chip for each concentration.

432:

433: To avoid these two potential sources of error, we therefore consider

434: the intensities for a given probe set at a specific concentration,

435: i.e. $c$ constant and $\Delta G$ and $\alpha$ variables in

436: Eq. (\ref{fluorescence}).  The data belong to the same array.

437: An example of this type of analysis is shown in Fig. \ref{IvsDG_star}

438: for a concentration of $c=512$ pM.  On the horizontal axis we plot

439: $\Delta G^* = \Delta G - RT \log \alpha$.  The solid lines are

440: given by the Langmuir curve Eq. (\ref{fluorescence}).  Note that

441: the large majority of the probes align along the expected curve,

442: with few exceptions as for instance probe 11 (both PM and MM) for

443: the probe set 204414\_at.  Therefore, the data are consistent with

444: a value of $A$ roughly constant in Eq. (\ref{fluorescence}), which

445: suggests indeed that the large variations in $I_{\rm max}$ obtained

446: from the extrapolations of the data in the earlier analysis are more

447: likely to be an artifact of the extrapolations. Note however that

448: some variability of the saturation level can be seen in the data of

449: Fig. \ref{IvsDG_star}. Typically this variability is of about $20\%$. In

450: order to keep the model simple we will keep $A$ constant in the rest of

451: the paper. An interesting possible explanation of the variability of $A$

452: has been given in Ref. \cite{burd04}, i.e. that this variation is due

453: to the post-hybridization washing of the array.

454:

455: Yet another different way of addressing the issue of the saturation

456: intensities is to analyze the histogram of the intensities on the whole

457: chip, as in Fig. \ref{FIG0h}, which shows both the intensities for the

458: HGU95 and HGU133 spike-in data. To reveal the data at high intensities,

459: they are plotted in a log-log scale. In the figure we note a drop in the

460: histogram around $I \approx 10\ 000$, sharper in the HGU133 chipset,

461: which is consistent with the estimate of the saturation intensity

462: obtained from the fits of intensities vs $\Delta G - R T \log \alpha$,

463: as given in Fig. \ref{IvsDG_star}. Note that in Fig. \ref{FIG0h}(b)

464: the drop is 100-fold in the range $10\ 000 < I < 15\ 000$, which

465: suggests that the data are consistent with a roughly constant value

466: of the saturation. However a more close inspection of the histogram

467: of the HGU133 for PM and MM intensities separately, reveals that the

468: estimated saturation value for the two may be different. In the case of

469: PM intensities alone the drop is rather sharp at around $I \approx 10\

470: 000$, however the MM intensities seem to saturate at lower intensities,

471: which is not seen in the HGU95 data (Fig. \ref{FIG0h}(a)).  The number

472: of MM probes reaching an intensity close to the saturation level in the

473: histogram of Fig. \ref{FIG0h}(b) is quite small so the fact that the the

474: MM and PM reach a different saturation level cannot be concluded for sure.

475:

476: Also the low-intensity side of the histograms in Fig.~\ref{FIG0h} contain

477: interesting information. Both for the HGU95 and HGU133, the intensity

478: drops steeply below a minimal intensity. For HGU95, this drop occurs

479: around $I_{\rm min}\approx 70$, while for HGU133 the drop occurs around

480: $I_{\rm min}\approx 30$. This increase of the dynamical intensity range

481: by more than a factor of two is a clear demonstration of the fast rate

482: of improvement in microarray technology.

483:

484: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

485: \begin{figure*}[t]

486: \includegraphics[width=4.2cm]{FIG06_AFFX-DapX.eps}

487: \includegraphics[width=4.2cm]{FIG06_AFFX-LysX.eps}

488: \includegraphics[width=4.2cm]{FIG06_AFFX-PheX.eps}

489: \includegraphics[width=4.2cm]{FIG06_AFFX-ThrX.eps}

490:

491: \includegraphics[width=4.2cm]{FIG06_AFFX-TagA.eps}

492: \includegraphics[width=4.2cm]{FIG06_AFFX-TagB.eps}

493: \includegraphics[width=4.2cm]{FIG06_AFFX-TagC.eps}

494: \includegraphics[width=4.2cm]{FIG06_AFFX-TagD.eps}

495:

496: \includegraphics[width=4.2cm]{FIG06_AFFX-TagE.eps}

497: \includegraphics[width=4.2cm]{FIG06_AFFX-TagF.eps}

498: \includegraphics[width=4.2cm]{FIG06_AFFX-TagG.eps}

499: \includegraphics[width=4.2cm]{FIG06_AFFX-TagH.eps}

500: \caption{Collapse plots for the 4 bacterial and the 8 artificial

501: sequences of the HGU133 spike-in set.

502: In these plots the background subtracted intensities for a given probe

503: set are plotted as functions of the rescaled variable $x'$ given in

504: Eq.~(\ref{xprime}). The data corresponds to all spike-in concentrations

505: for a given probe sets. Solid lines correspond to the Langmuir isotherm.

506: Compared with the human and bacterial sequences the

507: artificial sequences are characterized by the best collapses.}

508: % \caption{Collapse plots for the Artificial sequences in the HGU133

509: % spike-in set. Compared with the human and bacterial sequences the

510: % artificial sequences are characterized by the best collapses.}

511: \label{collapse_A}

512: \end{figure*}

513: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

514:

515: \section{Analysis of data collapses}

516: \label{sec:analysis}

517:

518: As a test of the validity of the model we plotted \cite{carl06} the

519: data as a function of the rescaled variable:

520: \be

521: x' = \alpha c e^{\beta \Delta G}.

522: \label{xprime}

523: \ee

524: If the model is to be trusted the data for different values of $c$ and

525: different probe sequences (i.e. different $\Delta G$ and $\alpha$)

526: ought to ``collapse" onto a single master curve

527: \be

528: I - I_0 = \frac{A x'}{1+x'}.

529: \label{rescaled}

530: \ee

531: This collapse has indeed been observed in the large majority of the

532: spike-in genes of the HGU95a chipset \cite{carl06}. Interestingly, the

533: very few outliers observed in that case could be explained as annotation

534: errors or unbalance of free energies used for specific nucleotides,

535: as discussed in Ref. \cite{carl06}.

536:

537: We choose here the same fitting parameters used in Ref. \cite{carl06}

538: for the HGU95 chipset, that is: $A= 10\ 000$, $\beta = 0.74$ mol/kcal,

539: $\beta'= 0.67$ mol/kcal and $\tilde{c}= 10^{-2}$ pM. These parameters

540: fit equally well the HGU133 spike-in data.

541:

542: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

543: \begin{figure*}[t]

544: \includegraphics[width=4.2cm]{FIG07_200665_s_at.eps}

545: \includegraphics[width=4.2cm]{FIG07_203471_s_at.eps}

546: \includegraphics[width=4.2cm]{FIG07_203508_at.eps}

547: \includegraphics[width=4.2cm]{FIG07_204205_at.eps}

548:

549: \includegraphics[width=4.2cm]{FIG07_204417_at.eps}

550: \includegraphics[width=4.2cm]{FIG07_204430_s_at.eps}

551: \includegraphics[width=4.2cm]{FIG07_204513_s_at.eps}

552: \includegraphics[width=4.2cm]{FIG07_204563_at.eps}

553:

554: \includegraphics[width=4.2cm]{FIG07_204836_at.eps}

555: \includegraphics[width=4.2cm]{FIG07_204912_at.eps}

556: \includegraphics[width=4.2cm]{FIG07_204951_at.eps}

557: \includegraphics[width=4.2cm]{FIG07_204959_at.eps}

558:

559: % \includegraphics[width=4.2cm]{../TXT/204951_at1_ALPHA.eps}

560: % \includegraphics[width=4.2cm]{../TXT/204959_at1_ALPHA.eps}

561: \includegraphics[width=4.2cm]{FIG07_205267_at.eps}

562: \includegraphics[width=4.2cm]{FIG07_205291_at.eps}

563: \caption{Collapse plots for Human sequences of the HGU133 spike-in set

564: (part 1). The probes which are complementary to targets which the largest

565: folding free energies are emphasized (see Table \ref{table_fold}). They

566: correspond to probes 204912\_at10 and 204513\_s\_at4.

567: }

568: \label{collapse_H1}

569: \end{figure*}

570: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

571:

572: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

573: \begin{table}[b]

574: \caption{List of values of $\langle w \rangle$ and $\sigma_w$ for the bacterial

575: and the artificial sequences in the spike-in set HGU133.}

576: \begin{ruledtabular}

577: \begin{tabular}{lll|lll}

578: Probe set & $\langle w \rangle$ & $\sigma_w$ & Probe set & $\langle w \rangle$ & $\sigma_w$\\

579: \hline

580: AFFX-DapX-3\_at & 0.08 & 1.49 & AFFX-PheX-3\_at     & 0.16  & 1.55\\

581: AFFX-LysX-3\_at & 0.89 & 2.46 & AFFX-ThrX-3\_at     & 0.22  & 1.59\\

582: AFFX-r2-TagA\_at  & -1.05 & 0.97 & AFFX-r2-TagE\_at & -0.32 & 0.82\\

583: AFFX-r2-TagB\_at  & -0.51 & 0.83 & AFFX-r2-TagF\_at & -0.46 & 1.09\\

584: AFFX-r2-TagC\_at  &  0.43 & 1.08 & AFFX-r2-TagG\_at & -0.11 & 0.90\\

585: AFFX-r2-TagD\_at  & -0.03 & 0.90 & AFFX-r2-TagH\_at &  0.11 & 1.22

586: \end{tabular}

587: \end{ruledtabular}

588: \label{table_bacterial}

589: \end{table}

590:

591: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

592: \begin{table}[t]

593: \caption{List of values of $\langle w \rangle$ and $\sigma_w$ for

594: the human sequences in the spike-in set HGU133.}

595: \begin{ruledtabular}

596: \begin{tabular}{lcc|lcc}

597: Probe set & $\langle w \rangle$ & $\sigma_w$ & Probe set & $\langle w \rangle$ & $\sigma_w$\\

598: \hline

599: 200665\_s\_at   &  0.54 & 1.26   & 205569\_at         &  -0.28 & 1.12\\

600: 203471\_s\_at   &  0.39 & 1.43   & 205692\_s\_at      &   0.24 & 1.27\\

601: 203508\_at      &  0.45 & 1.83   & 205790\_at         &  -0.78 & 0.76\\

602: 204205\_at      &  0.86 & 2.11   & 206060\_s\_at      &   0.52 & 1.66\\

603: 204417\_at      & -0.24 & 1.18   & 207160\_at         &  -0.32 & 1.06\\

604: 204430\_s\_at   & -0.48 & 1.13   & 207540\_s\_at      &  -0.29 & 0.62\\

605: 204513\_s\_at   & -0.68 & 1.16   & 207641\_at         &   0.24 & 2.72\\

606: 204563\_at      & -0.57 & 1.44   & 207655\_s\_at      &   0.76 & 1.06\\

607: 204836\_at      & -0.04 & 1.41   & 207777\_s\_at      &  -0.14 & 1.11\\

608: 204912\_at      & -0.31 & 1.35   & 207968\_s\_at      &  -0.85 & 1.66\\

609: 204951\_at      & -0.15 & 1.48   & 209354\_at         &   0.04 & 1.41\\

610: 204959\_at      &  1.33 & 1.62   & 209606\_at         &   0.77 & 1.44\\

611: 205267\_at      &  0.36 & 1.23   & 209734\_at         &  -0.20 & 1.51\\

612: 205291\_at      & -0.44 & 1.24   & 209795\_at         &   0.63 & 1.71\\

613: 205398\_s\_at   & -0.15 & 1.37   & 212827\_at         &   0.61 & 2.53

614: \end{tabular}

615: \end{ruledtabular}

616: \label{table_human}

617: \end{table}

618: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

619:

620: In Figs. \ref{collapse_A}, \ref{collapse_H1} and \ref{collapse_H2}

621: we show the collapse plots for all the 42 genes of the spike-in data

622: set HGU133. Each plot contains about 200 points, which all tend to

623: cluster (in some cases much better than others) along the Langmuir

624: curve $Ax'/(1+x')$. All the 13 concentrations, which range from $0.125$

625: pM to $512$ pM in the spike-in experiment, are shown. The intensities

626: measured at $c=0$ are taken as estimates of the background level $I_0$

627: in Eq.(\ref{rescaled}).  In the collapse plots only the MM sequences

628: for which a $\Delta G$ could be estimated are shown, as the mismatch

629: free energies in RNA/DNA duplexes are known only for a limited set of

630: mismatches \cite{sugi00} (we could associate a free energy to about 30\%

631: of mismatches, as discussed in Ref. \cite{carl06}).

632:

633: The HGU133 spike-in set contains 4 bacterial sequences and 8

634: artificial sequences (Fig. \ref{collapse_A}) and 30 human sequences

635: (Fig. \ref{collapse_H1} and \ref{collapse_H2}).  A perfect agreement

636: with the Langmuir theory would imply that the data all align along the

637: curve given by Eq. (\ref{rescaled}) and shown as a solid line in the Figs.

638: \ref{collapse_A}, \ref{collapse_H1} and \ref{collapse_H2}. In general the

639: agreement is best for the artificial sequences. Occasionally, also some

640: human sequences collapse well into a single curve in good agreement with

641: the Langmuir model, but in general their behavior is worse than artificial

642: ones.  In order to measure the data dispersion we introduce the variable:

643: \be

644: w = \log \left( \frac{I}{I_{\rm th}} \right),

645: \label{define_w}

646: \ee

647: where $I$ is the measured intensity and $I_{\rm th}$ the theoretical

648: value as predicted from the Langmuir isotherm (Eq.~(\ref{rescaled}))

649: for the $x'$ corresponding to the measured $I$. For the definition of $w$

650: in Eq. (\ref{define_w}) we have kept only the values of $I$ in the

651: range $100 < I < 10000$.  We determine its average $\langle w \rangle$

652: and standard deviation $\sigma_w$. If the data are well-centered around

653: the expected behavior one has $\langle w \rangle =0$, while $\sigma_w$

654: is a measure of the spread in the data.

655:

656: The values of $\langle w \rangle$ and $\sigma_w$ for the

657: bacterial, artificial and human sequences are given in the tables

658: \ref{table_bacterial} and \ref{table_human}, respectively.  We note

659: that $\sigma_w$ is on average the lowest for the artificial sequences

660: with typical value $\sigma_w \approx 1$. Only for two human probe sets

661: (205790\_at and 207540\_s\_at with $\sigma_w \approx 0.7$) the collapse

662: is better than that of the artificial sequences. For three human probe

663: sets (204205\_at, 207641\_at and 212827\_at) the collapse is very poor

664: as indicated by a $\sigma_w > 2$. The collapses in the four bacterial

665: sequences have somewhat higher dispersion compared to human sequences.

666:

667:

668: A very interesting feature of the whole analysis is that the quality

669: of collapses is much better for artificial sequences than for any

670: other sequence. Artificial sequences have been chosen by Affymetrix

671: to be as different as possible from any human RNA so to minimize the

672: effects of cross-hybridization. Their preparation, as labeling and

673: target fragmentation are concerned, is the same as for all other spikes

674: \cite{private_affy}. As in all collapses the same set of parameters is

675: used, the high $\sigma_w$ for some probe sets is very likely an indication

676: that the selected probes are not yet optimal.  Possible deviations from

677: the theory are due to cross-hybridization.

678:

679: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

680: \begin{figure*}[t]

681: \includegraphics[width=4.2cm]{FIG08_205398_s_at.eps}

682: \includegraphics[width=4.2cm]{FIG08_205569_at.eps}

683: \includegraphics[width=4.2cm]{FIG08_205692_s_at.eps}

684: \includegraphics[width=4.2cm]{FIG08_205790_at.eps}

685:

686: \includegraphics[width=4.2cm]{FIG08_206060_s_at.eps}

687: \includegraphics[width=4.2cm]{FIG08_207160_at.eps}

688: \includegraphics[width=4.2cm]{FIG08_207540_s_at.eps}

689: \includegraphics[width=4.2cm]{FIG08_207641_at.eps}

690:

691: \includegraphics[width=4.2cm]{FIG08_207655_s_at.eps}

692: \includegraphics[width=4.2cm]{FIG08_207777_s_at.eps}

693: \includegraphics[width=4.2cm]{FIG08_207968_s_at.eps}

694: \includegraphics[width=4.2cm]{FIG08_209354_at.eps}

695:

696: \includegraphics[width=4.2cm]{FIG08_209606_at.eps}

697: \includegraphics[width=4.2cm]{FIG08_209734_at.eps}

698: \includegraphics[width=4.2cm]{FIG08_209795_at.eps}

699: \includegraphics[width=4.2cm]{FIG08_212827_at.eps}

700: \caption{Collapse plots for Human sequences of the HGU133 spike-in set

701: (part 2). The probes which are complementary to targets which the largest

702: folding free energies are emphasized (see Table \ref{table_fold}). They

703: correspond to probes 207641\_at5 and 209354\_at8.

704: }

705: \label{collapse_H2}

706: \end{figure*}

707: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

708:

709: \section{Determination of the expression level}

710:

711: The model defined by Eqs. (\ref{fluorescence}) and (\ref{alpha}), once

712: all parameters have been fixed, can be used to fit the concentration

713: $c$ starting from the measured intensities. The target concentration

714: in solution is a measurement of the gene expression level and it

715: is the quantity one wants to compute from the raw microarray data.

716: As the concentrations in the spike-in experiments are known, we can

717: compare the known values with the fitted ones. Figure \ref{CvsCspike}

718: shows a plot of fitted concentration vs. spike-in concentration for the

719: artificial sequences. We limit ourselves here to show the data for these

720: sequences, but the trend is quite general and valid for other

721: genes as well.  The solid line in Fig. \ref{CvsCspike} corresponds to

722: a line $y=x$, which means perfect agreement between spike-in and fitted

723: values. The two other lines correspond to $y=2x$ and $y = x/2$, drawn as

724: a guide to the eye.

725:

726: As shown in Fig. \ref{CvsCspike}, most of the data fall in the range between

727: the two lines, except for the spikes TagA and TagF which give a much

728: lower fitted concentration. All the points follow approximately straight

729: lines with slope 1, except for the highest spike-in concentrations,

730: corresponding to $256$ and $512$ pM. This is due to the fact that at

731: high concentrations many probes are very close to saturation.

732:

733: We note also that the fitted concentrations are all systematically

734: lower than the spike-in values, as most of the concentrations fall

735: in the interval $[c_{\rm spike-in}/2,c_{\rm spike-in}]$. This is a

736: consequence of our choice to use the fitting parameters which we took from

737: a previous study \cite{carl06} of spike-in experiments on the HGU95. We

738: have chosen not to refit these parameters here again for HGU133,

739: to illustrate their universal validity. The slight underestimation of

740: the absolute concentration is not a problem, since in gene expression

741: measurements one is only interested in fold-variations of expression

742: levels between different experimental conditions. The fact that the data

743: of Fig. \ref{CvsCspike} follow lines with a slope of approximately one

744: guarantees that the fold-change in concentration in different experiments

745: is correctly estimated.

746:

747: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

748: \begin{figure}[t]

749: \includegraphics[width=8.5cm]{FIG09.eps}

750: \caption{Plot of the fitted target concentration as a function of the

751: spike-in concentration for the artificial sequences. The solid line correspond

752: to the diagonal $y=x$, while the two dotted lines are $y=x/2$ and $y=2x$

753: and are drawn as guides to the eye. We note a systematic shift of the

754: estimated absolute concentration compared to the spike-in one, although

755: the fold-variations of the concentrations are correctly estimated

756: as the majority of the data follow lines parallel to the diagonal in

757: the plot.}

758: \label{CvsCspike}

759: \end{figure}

760: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

761:

762: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

763: \begin{table}[t]

764: \caption{Minimal folding free energies for the targets (assumed to be 25-mers)

765: complementary to the probes forming the spike-in HGU133 data set. These

766: free energies are calculated with the program RNAfold.}

767: \begin{ruledtabular}

768: \begin{tabular}{ccc}

769: Probe set & Probe number & -$\Delta G_{\rm fold}$(kcal/mol) \\

770: \hline

771: 204513\_s\_at	& 4	& 8.70 \\

772: 207641\_at	& 5	& 8.16 \\

773: 204430\_s\_at	& 10	& 7.79 \\

774: 209354\_at	& 8	& 7.67 \\

775: 207540\_s\_at	& 10	& 7.45 \\

776: AFFX-r2-TagA\_at& 1	& 6.52 \\

777: 205398\_s\_at	& 1	& 6.43 \\

778: AFFX-PheX-3\_at	& 10	& 6.18 \\

779: 204836\_at	& 10	& 6.17 \\

780: 203508\_at	& 2	& 6.10 \\

781: 206060\_s\_at 	& 3	& 6.05 \\

782: \end{tabular}

783: \end{ruledtabular}

784: \label{table_fold}

785: \end{table}

786: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

787:

788: \section{One cause of outliers: Target secondary structures}

789:

790: It is well-known that single stranded nucleic acids, particularly RNA,

791: tend to form stable folded conformations by binding of complementary

792: bases. Currently, algorithms that calculate RNA secondary structures

793: are to be trusted for sufficiently short molecules, say less than 50

794: nucleotides, which is the situation of Affymetrix microarrays, where RNA

795: targets are fragmented before hybridization. The average target length

796: is $50$, but probably only shorter fragment contribute to hybridization.

797:

798: We used the Vienna package \cite{hofa03} for the calculation of folded

799: RNA structures that may form in solution and impede hybridization.

800: We considered first 25-mer targets in solution exactly complementary to

801: the probes of the HGU133 spike-in data set.  Table \ref{table_fold} shows

802: a list of probes in this set, whose complementary target has the lowest

803: folding free energy, i.e. that of the most stable conformation, calculated

804: at the experimental temperature of $45^\circ$ C. Given a folding free energy

805: $\Delta G_{\rm fold}$, one can use the two state model approximation to

806: find $p_{\rm fold}$ the probability that the sequence is folded into the

807: most stable conformation:

808: \be

809: p_{\rm fold} = \frac{e^{-\Delta G_{\rm fold}/RT}}{1+e^{-\Delta G_{\rm fold}/RT}}

810: \label{p_fold}

811: \ee

812: where we use $T=45^\circ$ C.  According to this expression for a folding

813: free energy $\Delta G_{\rm fold}= -8$~kcal/mol one finds $1 - p_{\rm

814: fold} \approx 4\cdot 10^{-6}$ and $\Delta G_{\rm fold}= -6$~kcal/mol $1

815: - p_{\rm fold} \approx 10^{-4}$. Therefore the large majority of the

816: targets complementary to the probes listed in Table \ref{table_fold}

817: are folded and not expected to participate to hybridization.

818:

819: Figure \ref{FIG0r} shows the folding configurations for the four

820: targets with the lowest free energy of Table \ref{table_fold}. As shown

821: in Figs. \ref{collapse_H1} and \ref{collapse_H2} the corresponding

822: probes have a signal which is few order of magnitude lower than that

823: expected from the Langmuir model, although not as low as derived from

824: Eq. (\ref{p_fold}), using the $\Delta G_{\rm fold}$ listed in Table

825: \ref{table_fold}.  For instance, from the measured signals we find

826: an intensity lower by a factor $10^3$ for the probe 204513\_s\_at4,

827: instead of a factor $10^6$ as deduced from Eq.~(\ref{p_fold}). This

828: difference could have several origins. First, the hybridization in

829: solution described by the term $\alpha$ in Eq.~(\ref{alpha}) may

830: already take into account some secondary structure formation. Second,

831: the RNA in solution is present with sequences of all lengths. The free

832: energies listed in Table \ref{table_fold} refer to 25-mers, so shorter

833: sequences will have lower folding probability than that deduced from

834: Eq.~(\ref{p_fold}) on the basis of the free energies of 25-mers. Third,

835: even if some secondary structure is present, hybridization with the

836: surface-bound probes is still possible if the folded configuration has

837: some dangling ends from which binding can initiate.

838:

839: We have analyzed the folding free energies of 25-mers complementary to

840: all the probes in the HGU spike-in set. We found that about $50\%$ of

841: the targets have folding free energy lower than $1$ kcal/mol, so that

842: secondary structure formation can be safely neglected. About $10\%$

843: of the targets have a folding free energy higher than $4$ kcal/mol, so

844: that for this fraction the secondary structure formation may interfere

845: with the target-probe hybridization.

846:

847: Summarizing, the correct estimate of the folding probability involves

848: a complex calculation over fragments of all lengths, possibly including

849: sequences neighboring the 25-mer part complementary to the probe. However

850: the folding is expected to have a relevant effect for at most $10\%$ of

851: the probes. A possible way out is that of excluding from the analysis

852: of the gene expression levels those probes whose 25-mers folding free

853: energy is above a certain threshold.

854:

855: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

856: \begin{figure}[t]

857: \includegraphics[height=3.6cm]{FIG10_fold-204513_s_at4.ps}

858: \includegraphics[height=3.3cm]{FIG10_fold-207641_at5.ps}

859: \includegraphics[height=2.8cm]{FIG10_fold-204430_s_at10.ps}

860: \includegraphics[height=3.4cm]{FIG10_fold-209354_at8.ps}

861: \caption{Folding configurations for the four targets with the lowest

862: free energy. From left to right: 204513\_s\_at4, 207641\_at5,

863: 204430\_s\_at10 and 209354\_at8.}

864: \label{FIG0r}

865: \end{figure}

866: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIG_01 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

867:

868: \section{Conclusion}

869:

870: In this paper we have extended a previous study \cite{carl06} of

871: Affymetrix spike-in experiments on the chip HGU95, to a novel HGU133

872: chipset. We used the model introduced in Ref. \cite{carl06} which

873: takes into account both target-probe and target-target hybridization

874: in solution. The hybridization free energies are calculated from the

875: nearest-neighbor model \cite{bloo00} using the experimental parameters

876: for RNA/DNA \cite{sugi95_sh,sugi00} and RNA/RNA \cite{xia98_sh}.

877: There are four global fitting parameters in the model that we

878: took from Ref. \cite{carl06}. We found that these parameters fit well

879: also the current data on the HGU133 chipset, apart for a systematic

880: small shift of all the estimates of the absolute target concentrations.

881:

882: There are several features that make the spike-in data of the more recent

883: HGU133 chip interesting. First of all the spike-in set contains a larger

884: number of sequences compared to the HGU95 experiments (42 instead of 14)

885: and the chip has been entirely redesigned. Secondly, the spike-in

886: sequences contain some of artificial origin, designed to avoid any

887: cross hybridization with human RNAs, but prepared and labeled exactly

888: as all other spikes.  We find that these artificial sequences fit best

889: the hybridization model, as they show the best collapses when the data

890: are rescaled and plotted as function of an appropriate thermodynamic

891: variable. The good agreement suggests indeed that the simple model

892: describes rather well the hybridization in Affymetrix arrays and that

893: the deviations observed for some human sequences are probably related to

894: the non-optimal design of the sequences for a given probe.

895:

896: When compared to the human sequences of the HGU95 spike-in experiments

897: analyzed in Ref. \cite{carl06}, we find that the artificial spikes of

898: the HGU133 set show definitely better collapses.  However, when comparing

899: the human sequences of the HGU133 with those in the HGU95 experiment we

900: find on average a better collapse for the latter. Only few probes out

901: of the 32 human spikes of the HGU133 experiment have a better collapse

902: than those of the HGU95.

903:

904: Interestingly, the physics-based modeling developed here allows to assign

905: to each probe set a quality score based on the level of agreement on the

906: Langmuir model. This information may be used to reconsider and eventually

907: redesign the probe sets of low quality.

908:

909: Finally, we have discussed the physical basis of hybridization in solution

910: and of RNA secondary structure formation. The latter effect, according

911: to the statistics over the spike-in probes, will be relevant for about

912: 10\% of the probes only. The sequences with the highest folding probability

913: correspond to probes whose measured fluorescent intensities is well-below

914: that predicted from the Langmuir model.

915:

916: According to our current understanding of the system (see also

917: Refs. \cite{carl06,heim06}), the hybridization in solution of

918: partially complementary RNA molecules has a strong influence.  One of

919: the reasons for that is that RNA/RNA interaction parameters are, at

920: given temperature and salt concentration, stronger than the DNA/DNA or

921: RNA/DNA parameters. The simple approximation given in Eq.(\ref{alpha})

922: captures the major features of the hybridization in solution. However,

923: an improvement over this approach, as discussed above, remains an open

924: challenge.

925:

926: We acknowledge financial support from the Van Gogh Programme d'Actions

927: Int\'egr\'ees (PAI) 08505PB of the French Ministry of Foreign Affairs

928: and the NWO grant 62403735.

929:

930: % \bibliography{/home/enrico/TEX/biblio.bib}

931:

932: \begin{thebibliography}{20}

933: \expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi

934: \expandafter\ifx\csname bibnamefont\endcsname\relax

935:   \def\bibnamefont#1{#1}\fi

936: \expandafter\ifx\csname bibfnamefont\endcsname\relax

937:   \def\bibfnamefont#1{#1}\fi

938: \expandafter\ifx\csname citenamefont\endcsname\relax

939:   \def\citenamefont#1{#1}\fi

940: \expandafter\ifx\csname url\endcsname\relax

941:   \def\url#1{\texttt{#1}}\fi

942: \expandafter\ifx\csname urlprefix\endcsname\relax\def\urlprefix{URL }\fi

943: \providecommand{\bibinfo}[2]{#2}

944: \providecommand{\eprint}[2][]{\url{#2}}

945:

946: \bibitem[{\citenamefont{Schena et~al.}(1995)\citenamefont{Schena, Shalon,

947:   Davis, and Brown}}]{sche95}

948: \bibinfo{author}{\bibfnamefont{M.}~\bibnamefont{Schena}},

949:   \bibinfo{author}{\bibfnamefont{D.}~\bibnamefont{Shalon}},

950:   \bibinfo{author}{\bibfnamefont{R.~W.} \bibnamefont{Davis}}, \bibnamefont{and}

951:   \bibinfo{author}{\bibfnamefont{P.~O.} \bibnamefont{Brown}},

952:   \bibinfo{journal}{Science} \textbf{\bibinfo{volume}{270}},

953:   \bibinfo{pages}{467} (\bibinfo{year}{1995}).

954:

955: \bibitem[{\citenamefont{Marshall}(2004)}]{mars04}

956: \bibinfo{author}{\bibfnamefont{E.}~\bibnamefont{Marshall}},

957:   \bibinfo{journal}{Science} \textbf{\bibinfo{volume}{306}},

958:   \bibinfo{pages}{630} (\bibinfo{year}{2004}).

959:

960: \bibitem[{\citenamefont{Tan et~al.}(2003)}]{tan03_sh}

961: \bibinfo{author}{\bibfnamefont{P.~K.} \bibnamefont{Tan}} \bibnamefont{et~al.},

962:   \bibinfo{journal}{Nucleic Acids Res.} \textbf{\bibinfo{volume}{31}},

963:   \bibinfo{pages}{5676} (\bibinfo{year}{2003}).

964:

965: \bibitem[{\citenamefont{Peterson et~al.}(2002)\citenamefont{Peterson, Wolf, and

966:   Georgiadis}}]{pete02}

967: \bibinfo{author}{\bibfnamefont{A.~W.} \bibnamefont{Peterson}},

968:   \bibinfo{author}{\bibfnamefont{L.~K.} \bibnamefont{Wolf}}, \bibnamefont{and}

969:   \bibinfo{author}{\bibfnamefont{R.~M.} \bibnamefont{Georgiadis}},

970:   \bibinfo{journal}{J. Am. Chem. Soc.} \textbf{\bibinfo{volume}{124}},

971:   \bibinfo{pages}{14601} (\bibinfo{year}{2002}).

972:

973: \bibitem[{\citenamefont{Okahata et~al.}(1998)}]{okah98_sh}

974: \bibinfo{author}{\bibfnamefont{Y.}~\bibnamefont{Okahata}} \bibnamefont{et~al.},

975:   \bibinfo{journal}{Anal. Chem.} \textbf{\bibinfo{volume}{70}},

976:   \bibinfo{pages}{1288} (\bibinfo{year}{1998}).

977:

978: \bibitem[{\citenamefont{Vainrub and Pettitt}(2002)}]{vain02}

979: \bibinfo{author}{\bibfnamefont{A.}~\bibnamefont{Vainrub}} \bibnamefont{and}

980:   \bibinfo{author}{\bibfnamefont{B.~M.} \bibnamefont{Pettitt}},

981:   \bibinfo{journal}{Phys. Rev. E} \textbf{\bibinfo{volume}{66}},

982:   \bibinfo{pages}{041905} (\bibinfo{year}{2002}).

983:

984: \bibitem[{\citenamefont{Held et~al.}(2003)\citenamefont{Held, Grinstein, and

985:   Tu}}]{held03}

986: \bibinfo{author}{\bibfnamefont{G.~A.} \bibnamefont{Held}},

987:   \bibinfo{author}{\bibfnamefont{G.}~\bibnamefont{Grinstein}},

988:   \bibnamefont{and} \bibinfo{author}{\bibfnamefont{Y.}~\bibnamefont{Tu}},

989:   \bibinfo{journal}{Proc. Natl. Acad. Sci.} \textbf{\bibinfo{volume}{100}},

990:   \bibinfo{pages}{7575} (\bibinfo{year}{2003}).

991:

992: \bibitem[{\citenamefont{Naef and Magnasco}(2003)}]{naef03}

993: \bibinfo{author}{\bibfnamefont{F.}~\bibnamefont{Naef}} \bibnamefont{and}

994:   \bibinfo{author}{\bibfnamefont{M.~O.} \bibnamefont{Magnasco}},

995:   \bibinfo{journal}{Phys. Rev. E} \textbf{\bibinfo{volume}{68}},

996:   \bibinfo{pages}{011906} (\bibinfo{year}{2003}).

997:

998: \bibitem[{\citenamefont{Hagan and Chakraborty}(2004)}]{haga04}

999: \bibinfo{author}{\bibfnamefont{M.~F.} \bibnamefont{Hagan}} \bibnamefont{and}

1000:   \bibinfo{author}{\bibfnamefont{A.~K.} \bibnamefont{Chakraborty}},

1001:   \bibinfo{journal}{J. Chem. Phys.} \textbf{\bibinfo{volume}{120}},

1002:   \bibinfo{pages}{4958} (\bibinfo{year}{2004}).

1003:

1004: \bibitem[{\citenamefont{Halperin et~al.}(2004)\citenamefont{Halperin, Buhot,

1005:   and Zhulina}}]{halp04}

1006: \bibinfo{author}{\bibfnamefont{A.}~\bibnamefont{Halperin}},

1007:   \bibinfo{author}{\bibfnamefont{A.}~\bibnamefont{Buhot}}, \bibnamefont{and}

1008:   \bibinfo{author}{\bibfnamefont{E.~B.} \bibnamefont{Zhulina}},

1009:   \bibinfo{journal}{Biophys. J.} \textbf{\bibinfo{volume}{86}},

1010:   \bibinfo{pages}{718} (\bibinfo{year}{2004}).

1011:

1012: \bibitem[{\citenamefont{Binder and Preibisch}(2005)}]{bind05}

1013: \bibinfo{author}{\bibfnamefont{H.}~\bibnamefont{Binder}} \bibnamefont{and}

1014:   \bibinfo{author}{\bibfnamefont{S.}~\bibnamefont{Preibisch}},

1015:   \bibinfo{journal}{Biophys. J.} \textbf{\bibinfo{volume}{89}},

1016:   \bibinfo{pages}{337} (\bibinfo{year}{2005}).

1017:

1018: \bibitem[{\citenamefont{Carlon and Heim}(2006)}]{carl06}

1019: \bibinfo{author}{\bibfnamefont{E.}~\bibnamefont{Carlon}} \bibnamefont{and}

1020:   \bibinfo{author}{\bibfnamefont{T.}~\bibnamefont{Heim}},

1021:   \bibinfo{journal}{Physica A} \textbf{\bibinfo{volume}{362}},

1022:   \bibinfo{pages}{433} (\bibinfo{year}{2006}).

1023:

1024: \bibitem[{\citenamefont{Sugimoto et~al.}(1995)}]{sugi95_sh}

1025: \bibinfo{author}{\bibfnamefont{N.}~\bibnamefont{Sugimoto}}

1026:   \bibnamefont{et~al.}, \bibinfo{journal}{Biochemistry}

1027:   \textbf{\bibinfo{volume}{34}}, \bibinfo{pages}{11211} (\bibinfo{year}{1995}).

1028:

1029: \bibitem[{\citenamefont{Sugimoto et~al.}(2000)\citenamefont{Sugimoto, Nakano,

1030:   and Nakano}}]{sugi00}

1031: \bibinfo{author}{\bibfnamefont{N.}~\bibnamefont{Sugimoto}},

1032:   \bibinfo{author}{\bibfnamefont{M.}~\bibnamefont{Nakano}}, \bibnamefont{and}

1033:   \bibinfo{author}{\bibfnamefont{S.}~\bibnamefont{Nakano}},

1034:   \bibinfo{journal}{Biochemistry} \textbf{\bibinfo{volume}{39}},

1035:   \bibinfo{pages}{11270} (\bibinfo{year}{2000}).

1036:

1037: \bibitem[{\citenamefont{Xia et~al.}(1998)}]{xia98_sh}

1038: \bibinfo{author}{\bibfnamefont{T.}~\bibnamefont{Xia}} \bibnamefont{et~al.},

1039:   \bibinfo{journal}{Biochemistry} \textbf{\bibinfo{volume}{37}},

1040:   \bibinfo{pages}{14719} (\bibinfo{year}{1998}).

1041:

1042: \bibitem[{\citenamefont{Heim et~al.}(2006)\citenamefont{Heim, {Klein

1043:   Wolterink}, Carlon, and Barkema}}]{heim06}

1044: \bibinfo{author}{\bibfnamefont{T.}~\bibnamefont{Heim}},

1045:   \bibinfo{author}{\bibfnamefont{J.}~\bibnamefont{{Klein Wolterink}}},

1046:   \bibinfo{author}{\bibfnamefont{E.}~\bibnamefont{Carlon}}, \bibnamefont{and}

1047:   \bibinfo{author}{\bibfnamefont{G.~T.} \bibnamefont{Barkema}},

1048:   \bibinfo{journal}{J. Phys.: Cond. Matt.} \textbf{\bibinfo{volume}{18}},

1049:   \bibinfo{pages}{S525} (\bibinfo{year}{2006}).

1050:

1051: \bibitem[{\citenamefont{Bloomfield et~al.}(2000)\citenamefont{Bloomfield,

1052:   Crothers, and {Tinoco, Jr.}}}]{bloo00}

1053: \bibinfo{author}{\bibfnamefont{V.~A.} \bibnamefont{Bloomfield}},

1054:   \bibinfo{author}{\bibfnamefont{D.~M.} \bibnamefont{Crothers}},

1055:   \bibnamefont{and} \bibinfo{author}{\bibfnamefont{I.}~\bibnamefont{{Tinoco,

1056:   Jr.}}}, \emph{\bibinfo{title}{Nucleic Acids Structures, Properties and

1057:   Functions}} (\bibinfo{publisher}{University Science Books, Mill Valley},

1058:   \bibinfo{year}{2000}).

1059:

1060: \bibitem[{bur()}]{burd04}

1061: \bibinfo{howpublished}{C. J. Burden and Y. Pittelkow and S. R. Wilson, ``An

1062:   adsorption model of hybridization behaviour on oligonucleotide microarrays",

1063:   preprint q-bio.BM/0411005}.

1064:

1065: \bibitem[{pri()}]{private_affy}

1066: \bibinfo{howpublished}{Affymetrix Europe, private communication.}

1067:

1068: \bibitem[{\citenamefont{Hofacker}(2003)}]{hofa03}

1069: \bibinfo{author}{\bibfnamefont{I.~L.} \bibnamefont{Hofacker}},

1070:   \bibinfo{journal}{Nucleic Acids Res.} \textbf{\bibinfo{volume}{31}},

1071:   \bibinfo{pages}{3429} (\bibinfo{year}{2003}).

1072:

1073: \end{thebibliography}

1074: \end{document}

1075: