0606:q-bio0606018/arxiv2.tex

1: %level%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

2: %2345678901234567890123456789012345678901234567890123456789012345678901234567890

3: %        1         2         3         4         5         6         7         8

4:

5: %\documentclass[letterpaper, 10 pt, conference]{ieeeconf}  % Comment this line out

6:                                                           % if you need a4paper

7: \documentclass[a4paper, 10pt, conference]{ieeeconf}      % Use this line for a4

8:                                                           % paper

9:

10: \IEEEoverridecommandlockouts                              % This command is only

11:                                                           % needed if you want to

12:                                                           % use the \thanks command

13: \overrideIEEEmargins

14: % See the \addtolength command later in the file to balance the column lengths

15: % on the last page of the document

16:

17:

18:

19: % The following packages can be found on http:\\www.ctan.org

20: \usepackage{graphics} % for pdf, bitmapped graphics files

21: \usepackage{epsfig} % for postscript graphics files

22: \usepackage{rotating}

23: %\usepackage{mathptmx} % assumes new font selection scheme installed

24: %\usepackage{times} % assumes new font selection scheme installed

25: %\usepackage{amsmath} % assumes amsmath package installed

26: %\usepackage{amssymb}  % assumes amsmath package installed

27:

28: \title{\LARGE \bf

29: Does Logarithm Transformation of Microarray Data

30: Affect Ranking Order of Differentially Expressed Genes?

31: }

32:

33:

34: \author{Wentian Li, Young Ju Suh, Jingshan Zhang % <-this % stops a space

35: \thanks{W. Li is a Research Scientist with the Robert S Boas Center for Genomics and Human Genetics, Feinstein Institute for Medical Research, North Shore LIJ Health System,

36: 	Manhasset, NY 11030, USA

37:         {\tt\small wli@nslij-genetics.org}}%

38: \thanks{Y.J. Suh is a Research Professor  of

39: 	The Research Institute of Natural Sciences, Sookmyung Women's University,

40: 	Seoul 140-742, Korea.

41:         {\tt\small yjsprite@yahoo.co.kr}}%

42: \thanks{J. Zhang is a Senior Statistician at

43: 	Forest Research Institute, Jersey City, NJ 07311, USA

44:         {\tt\small jingshan.zhang@frx.com}}%

45: }

46:

47:

48: \begin{document}

49:

50:

51:

52: \maketitle

53: \thispagestyle{empty}

54: \pagestyle{empty}

55:

56:

57: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

58: \begin{abstract}

59:

60: A common practice in microarray analysis is to transform

61: the microarray raw data (light intensity) by a logarithmic

62: transformation, and the justification for this transformation

63: is to make the distribution more symmetric and Gaussian-like. Since

64: this transformation is not universally practiced in

65: all microarray analysis,  we examined whether the

66: discrepancy of this treatment of raw data affect the

67: ``high level" analysis result. In particular, whether

68: the differentially expressed genes as obtained by

69: $t$-test, regularized $t$-test, or logistic regression have altered rank orders

70: due to presence or absence of the transformation.

71: We show that as much as 20\%--40\% of significant genes

72: are ``discordant" (significant only in one form of the

73: data and not in both), depending on the test being used and the

74: threshold value for claiming significance. The

75: $t$-test is more likely to be affected by logarithmic

76: transformation than logistic regression, and regularized $t$-test

77: more affected than $t$-test. On the other hand,

78: the very top ranking genes (e.g. up to top 20--50 genes,

79: depending on the test) are not affected by

80: the logarithmic transformation.

81:

82:

83: \end{abstract}

84:

85:

86: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

87: \section{INTRODUCTION}

88:

89: The number of copies of single-stranded messenger-RNA (mRNA)

90: can be used to infer the amount of protein product produced

91: by certain gene, and is called the ``expression level".

92: Ideally, one would like to count the number of copies of

93: certain mRNA directly. But in microarray chips, the

94: amount of a specific mRNA is measured indirectly by

95: the emission of fluorescence light. It is necessary to

96: transform the raw data of light intensity obtained by

97: optical detection to a summarized quantity

98: that indicates the expression level. Deriving the

99: expression level from raw data is called the ``low-level"

100: analysis, and it can be complicated by the details

101: of the technology and chip platform \cite{liwong,irizarry}.

102: Reaching conclusions such as the determination of differentially

103: expressed genes using the expression level data is

104: called the ``high-level" analysis.

105:

106: After the expression level is derived from the raw data,

107: another preprocessing step is commonly practiced: log-transformation.

108: The standard motivation for the log-transformation is

109: that the distribution of the derived expression level

110: is typically asymmetric with long tail at the high

111: expression end.  Many parametric statistical tests

112: require variables to follow a Gaussian/normal distribution.

113: The log-transformation is an attempt to convert

114: an asymmetric distribution to a symmetric and Gaussian-like

115: one. Other transformations for the purpose of ``normality"

116: are also possible \cite{sokal}, such as square-root, Box-Cox

117: \cite{boxcox}, and arcsine transformations. In microarray

118: data, transformations were proposed along the

119: line of variance stabilization \cite{durbin1,durbin2}

120:

121: A novel alternative explanation of the use of

122: log-transformation might be that human perceive

123: brightness of light as the logarithm of light

124: energy, similar to our perceiving loudness of sound

125: as the logarithm of sound intensity.  In general,

126: all human perception of physical stimuli is proportional

127: to the logarithm of amount of stimuli, under the

128: names of Weber-Fechner's law \cite{weber,fechner}

129: and Steven's law \cite{stevens}. For the light-intensity-derived

130: expression level, log-transformation can be

131: viewed as a way to measure the ``perception

132: signal" from the data.

133:

134: From the statistical point of view, logarithm

135: transformation can take down an outlier with

136: extreme high value, thus affecting the group mean.

137: On the other hand, logarithm transformation or

138: any 1-to-1 transformation  will not shuffle

139: the relative order of expression values, thus

140: will not affect a rank-based test result such

141: as Wilcoxon-Mann-Whitney test \cite{mann}.

142: For a specific test or statistical model,

143: the effect of log-transformation on the

144: result is not clear, even though we know it

145: has no effect if the test is rank-based, and

146: has some effects if there are outliers. For

147: linear classifiers, the violation of Gaussian

148: distribution affect some methods more (e.g. Fisher's

149: linear discriminant analysis, perceptron)

150: but less so on other methods (e.g.,

151: logistic regression, support vector machine)

152: \cite{hastie}.

153:

154: Another note on investigating the effect of

155: log-transformation is that one can focus either on

156: the whole list of genes, or only on the

157: more interesting top ranking genes. For example,

158: with a log-transformation, the top 1 and 2

159: differentially expressed genes may be switched

160: while the rank of all other genes are unchanged.

161: Even though the effect of log-transformation

162: on the whole list of genes could be small, the

163: minor rearrangement of the top ranking genes

164: can be crucial in designing the subsequent experiments

165: such as gene validation by real-time PCR.

166:

167: We will examine the effect of log transformation

168: on two or three simple methods for selecting differentially

169: expressed genes on a real microarray dataset.

170: Log-transformation is just one factor that change

171: the apparent value of data, there are other

172: factors as well such as the normalization

173: procedure during the ``low-level" analysis,

174: change of the probe set design, change of the

175: microarray platform, etc.

176:

177:

178:    % \begin{figure}[thpb]

179:    \begin{figure}[t]

180:       \centering

181: 	\begin{turn}{-90}

182:       % \includegraphics[scale=0.5]{yj-fig1.eps}

183:       % \includegraphics{yj-fig1.eps}

184:       \resizebox{8.0cm}{6.0cm}{ \includegraphics{yj-fig1.eps} }

185: 	\end{turn}

186:       \caption{Minus log of $p$-values of tests on log transformed vs. original

187: data. The $x$ axis is $-\log_{10}(p$-value) for the original

188: expression data, and $y$ axis is $-\log_{10}(p$-value) for the log-transformed

189: data. The top plot is for logistic regression and bottom plot

190: for $t$-test. The four quadrants as split by $x=5$ and $y=5$

191: are indicated. Each point represents a gene.

192: 	}

193:       \label{fig1}

194:    \end{figure}

195:

196: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

197: \section{METHODS AND DATA}

198:

199: \subsection{Student's $t$-test}

200:

201: The Student's $t$-test is used here as a representative of

202: tests that make assumption on variable normality.

203: We expect the normality requirement is met better

204: for the log-transformed data than the original data. The $t$-statistic

205: is defined as the ratio of the difference of

206: two group means and the standard error of

207: this difference: $t= (E_1 - E_2)/\sqrt{ s^2_1/n_1 + s^2_2/n_2}$,

208: where $E_{1,2}$, $s^2_{1,2}$, $n_{1,2}$ are the

209: mean, variance, and sample size of group 1 and 2.

210: The $p$-value given a $t$-statistic value is determined

211: by the Student's $t$-distribution with degree of

212: freedom $df$. Usually, $df$ is equal to $n_1+n_2-2$,

213: but when the variances in two groups are not

214: equal, a more complicated formula for $df$ can

215: be used \cite{welsh}. We use such a method as

216: implemented in the $R$ statistical package ({\sl http://www.r-project.org/}).

217:

218:

219: \subsection{Logistic regression}

220:

221: Logistic regression is used to represent statistical

222: models which do not have a strong normality requirement.

223: The advantage for models or tests lacking such a

224: requirement is that these are more robust. The

225: disadvantage for models without the normality

226: requirement is that when the variable is in fact

227: distributed as Gaussian, these are less ``efficient"

228: as classifiers \cite{efron}. The significance of a

229: single-gene logistic regression can be determined

230: by a likelihood-ratio test: (-2) log-maximum-likelihood

231: of the logistic regression model subtract that

232: of a null model follows a $\chi^2$ distribution

233: with one degree of freedom, under the null hypothesis.

234: Thus given the (-2) log-likelihood ratio (called

235: ``deviance"), the $p$-value can be determined using the

236: $\chi^2$ distribution.

237:

238: \subsection{Regularized t-test and significance analysis of microarrays (SAM)}

239:

240: Since low expression level also leads to low variance,

241: $t$-statistic can be high due to low expression level.

242: Penalized or regularized statistics add an extra

243: term $s_0$ to prevent this small variance from inflating the

244: statistic: $d= (E_1 - E_2)/(\sqrt{ s^2_1/n_1 + s^2_2/n_2}+s_0)$.

245: SAM (significance analysis of microarray) is a method

246: for determining the value of $s_0$ \cite{tusher}.

247: SAM test statistic, $d$-score, was calculated by the

248: SAM package obtained from

249: {\sl http://www-stat.stanford.edu/~tibs/SAM/}.

250:

251:

252: \subsection{Microarray data}

253:

254: The illustrative microarray data is a profiling study of

255: rheumatoid arthritis. There are 43 patients

256: and 48 normal controls, which is more than the 29 patients

257: and 21 controls used in the previous publication \cite{batli}.

258: The mRNA was extracted from the peripheral blood mononuclear cells.

259: The microarray data is obtained from the Affymetrix

260: HG-U133A GeneChip with 22,283 genes/probe-sets, and

261: was normalized by the Affymetrix microarray suite (MAS) program.

262:

263:

264:

265: \begin{table}

266: \caption{percentage of discordant genes: (I+IV)/(I+II+IV)}

267: \label{tab1}

268: \begin{center}

269: \begin{tabular}{|c|c|c|c|c|c|c|}

270: \multicolumn{4}{c}{\em logistic regression} & \multicolumn{3}{c}{\rm t-test} \\

271: \hline

272: $p_0$ & I+IV & II & \% (95\%CI) & I+IV & II & \% (95\% CI) \\

273: \hline

274: $10^{-9}$ & 0 & 10  & 0\% (0-0) & 7 & 4 & 64\% (35-92) \\

275: $10^{-8}$ & 6 & 20 & 23 (7-39) & 8 & 11 & 42 (20-64) \\

276: $10^{-7}$ & 22 & 40 & 35 (24-47) & 21 & 21 & 50 (35-65) \\

277: $10^{-6}$ & 44 & 84 & 34 (26-43) & 40 & 52 & 43 (33-54) \\

278: $10^{-5}$ & 82 & 176 & 32 (26-37) & 92 & 119 & 44 (37-50) \\

279: $10^{-4}$ & 163 & 346 & 32 (28-36) & 170 & 266 & 39 (34-44) \\

280: 0.001  & 328 & 709 & 32 (29-34) & 345 & 593 & 37 (34-40)\\

281: 0.01  & 744 & 1698 & 30 (29-32) & 771 & 1520 & 34 (32-36)\\

282: \hline

283: \end{tabular}

284: \end{center}

285: \end{table}

286:

287: \section{RESULTS}

288:

289: \subsection{Proportion of discordant differentially expressed genes}

290:

291: Fig.\ref{fig1} shows the minus log of $p$-values of log-transformed

292: expression data vs that of un-log-transformed (raw)

293: expression data, for both

294: logistic regression (top) and $t$-test (bottom). Taking

295: all genes as a whole, the two sets of $p$-values are highly

296: correlated (correlation coefficients are 0.94 and 0.93,

297: respectively).  In order to highlight the

298: differences, especially for the high-ranking differentially

299: expressed genes, we split the plot into four quadrants

300: by a vertical line at $x=a$ and horizontal line at

301: $y=a$. The parameter $a=-log_{10}(p_0)$ corresponds

302: to gene selection threshold $p_0$ for $p$-values.

303: For example, the $a=5$ in Fig.\ref{fig1} corresponds

304: a $p$-value threshold of $p_0=0.00001$.

305:

306: The genes in quadrants I, II, and IV have at least

307: one $p$-value of the two (log and raw data)

308: smaller than $p_0$, whereas the genes in quadrant II

309: have both $p$-values smaller than $p_0$.

310: If log-transformation has no effect on the gene selection,

311: there will be no points in quadrants I and IV. We use the

312: percentage of points in I and IV out of all points in I,II, IV

313: as a measure of the inconsistency between the test

314: results on raw and log-transformed data. If

315: points in quadrants I and IV are called ``discordant"

316: and those in quadrant II ``concordant", this

317: measure is the percentage of discordant genes among

318: all differentially expressed genes by either one type

319: of data.

320:

321:

322: Table \ref{tab1} shows the discordant percentage and

323: their 95\% confidence intervals (CI) at various

324: gene selection threshold $p_0$ (=$10^{-9}, \cdots, 10^{-4}, 0.001, 0.01$).

325: As expected, the $t$-test result is more affected by the

326: log transformation than logistic regression: at all $p_0$

327: threshold values, the percentage of discordant differentially

328: expressed genes is higher in $t$-test than in logistic

329: regression. The average discordant percentage at eight

330: $p_0$ values is 27\% for logistic regression and 44\%

331: for $t$-test.

332:

333: It was however surprising that for logistic regression,

334: except for the extremely differentially expressed

335: genes (e.g., when $p$-value $< 10^{-9}$, the discordant percentage

336: is zero), the discordant percentage is not negligible.

337: If either one of the raw or log-transformed data is

338: used for logistic regression analysis,  as much as 10\%--20\%

339: of the claimed differentially expressed genes will not be

340: claimed so by another data.

341:

342:

343:    % \begin{figure}[thpb]

344:    \begin{figure}[t]

345:       \centering

346: 	\begin{turn}{-90}

347:       \resizebox{8.0cm}{8.5cm}{ \includegraphics{yj-fig2.eps} }

348: 	\end{turn}

349:       \caption{Rank difference $d$ as a function of averaged

350: rank $R_a$ for all 22283 genes (A,B,C) and for top-400 genes

351: (D,E,F). Both rank difference $d$ and averaged rank $R_a$ concern

352: the same gene on two different types of data (raw and log-transformed).

353: (A) and (D) are results for logistic regression, (B) and (E) are

354: for $t$-test, (C) and (F) for SAM. The $x$-axis in (D,E,F) is in

355: log scale to highlight the top-ranking genes. In (D,E,F),

356: $d=50, -50, 100, -100$ and $d=R_a$, $d= -R_a$ lines

357: are drawn.

358: 	}

359:       \label{fig2}

360:    \end{figure}

361:

362:

363:

364: \subsection{Ranking change due to log transformation}

365:

366: The effect of log-transformation can also be examined by

367: the ranking of a gene in both datasets. If log-transformation

368: has no effect, the rank of a gene by (e.g.) $p$-value

369: will be unchanged. We use the notation $R_n(i)$, $R_l(i)$

370: for the rank of gene-$i$ in the raw and log-transformed data,

371: and define $R_a(i)$ as the average of the two:

372: $R_a(i) \equiv (R_n(i)+R_l(i))/2$, and $d(i)$ as the

373: rank difference: $d(i)= R_n(i)-R_l(i)$. Fig.\ref{fig2} (A,B,C)

374: show $d$ vs. $R_a$ for logistic regression, $t$-test, and

375: SAM (genes are ranked by absolute value of the $d$-score)

376: for all 22283 genes.

377:

378: Fig.\ref{fig2} (A, B,C) indicate that for the whole gene set

379: there is a similar pattern for all three test-statistics:

380: for high- and low-ranking genes, they are high and low ranked in

381: both raw and log-transformed data (thus smaller rank differences).

382: As the majority of genes are not differentially expressed,

383: the overall scattering pattern in Fig.\ref{fig2} (A,B,C)

384: may not be as interesting as the behavior near the high-ranking

385: differentially expressed genes.

386:

387: To focus on the top-ranking genes, Fig.\ref{fig2} (D,E,F)

388: zoom in for the top-400 genes ($x$-axis is in log scale).

389: First, we notice that for the very top genes (e.g. up to

390: top-10), the ranking is unchanged or changed very little

391: by the log transformation in any one of the three tests/models. Second, $t$-test

392: has reached rank-difference of $d=50$ and $d=100$ sooner

393: (i.e., at a higher ranking) than logistic regression, reconfirming

394: our previous conclusion that $t$-test is more likely to

395: be affected by log transformation than logistic regressions.

396: Using the $d=R_a$ and $d=-R_a$ envelope, we see that

397: points are more likely to be outside the envelopes for

398: $t$-test than the logistic regression.  The third

399: observation is that SAM test result is affected

400: even more by log transformation than $t$-test. In

401: Fig.\ref{fig2} (F), many points are far outside the

402: envelope region.

403:

404:

405:

406:

407:

408:

409:

410:

411: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

412: \section{CONCLUSIONS AND FUTURE WORKS}

413:

414: \subsection{Conclusions}

415:

416: Using one microarray dataset, we have shown that log transformation

417: may affect results on selecting differentially expressed genes.

418: If we call all genes that are significant by tests on either raw or

419: log-transformed data ``differentially expressed genes", and

420: those genes that are significant in test of only one of the two

421: types of data ``discordant", the discordant as a proportion of

422: the all (discordant and concordant) differentially expressed genes

423: can be as high as 27\% for logistic regression and 44\% for

424: $t$-test. The larger discordant percentage for $t$-test confirms

425: our general understanding that tests that require variable normality

426: are more likely to be affected by variable transformation.

427:

428:

429: \subsection{Future Works}

430:

431: We plan to extend the results here to other public

432: domain microarray datasets and to other tests, models,

433: and measures for determining differentially expressed genes.

434:

435:

436: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

437: \section{ACKNOWLEDGMENTS}

438:

439: We thank Franak Batliwalla for providing the data.

440:

441:

442: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

443: \begin{thebibliography}{99}

444:

445: \bibitem{liwong}

446: C. Li, W.H. Wong,

447: ``Model-based analysis of oligonucleotide arrays: Expression index

448: computation and outlier detection",

449: {\it Proc. Nat. Acad. Sci.}, vol 98,  pp.31-36.

450:

451: \bibitem{irizarry}

452: R.A. Irizarry, B.M. Bolstad, F. Collin, L.M. Cope, B. Hobbs, T. P. Speed,

453: ``Summaries of Affymetrix GeneChip probe level data",

454: {\it Nucl. Acids Res. }, vol 31, 2003, e15.

455:

456: \bibitem{sokal}

457: R.R. Sokal, F.J. Rohlf,

458: {\it Biometry}, 3rd edition, W.H. Freeman and Co., New York;

459: 1995.

460:

461: \bibitem{boxcox}

462: G.E.P. Box, D.R. Cox ,

463: ``An analysis of transformations",

464: {\it J. R. Stat. Soc. B}, vol 26, 1964,  pp.211-243.

465:

466: \bibitem{durbin1}

467: B.P. Durbin, J.S. Hardin, D.M. Hawkins, D.M. Rocke,

468: ``A variance-stabilizing transformation for gene-expression microarray data",

469: {\it Bioinformatics}, vol 18(suppl 1), 2002, pp.S105-S110.

470:

471: \bibitem{durbin2}

472: B. Durbin, D.M. Rocke,

473: ``Estimation of transformation parameters for microarray data",

474: {\it Bioinformatics}, vol 19, 2003, pp.1360-1367.

475:

476: \bibitem{weber}

477: E.H. Weber,

478: {\it De pulsi, resorptione, auditu ert tactu.

479: Annotationes anatomicae et physiologicae},

480: C.F. L\"{o}hler, Leipzig; 1834.

481:

482: \bibitem{fechner}

483: G.T. Fechner,

484: {\it Elemente der Psychophsik},

485: Breitkopf \& H\"{a}rtel, Leipzig; 1860.

486:

487: \bibitem{stevens}

488: S.S. Stevens,

489: ``On the psychophysical law",

490: {\it Psychol. Rev.}, vol 64, 1957, pp.153-181.

491:

492: \bibitem{mann}

493: H.B. Mann, D.R. Whitney,

494: ``On a test of whether one of 2 random variables is stochastically

495: larger than the other",

496: {\it Ann. Math. Stat. }, vol 18, 1947, pp.50-60.

497:

498: \bibitem{hastie}

499: T. Hastie, R. Tibshirani, J. Friedman,

500: {\it The Elements of Statistical Learning},

501: Springer, New York; 2001.

502:

503: \bibitem{welsh}

504: B. L. Welsh,

505: ``The generalization of `Student's' problem

506: when several different population variances are involved",

507: {\it Biometrika}, vol 34, 1947, pp.28-35.

508:

509: \bibitem{efron}

510: B. Efron,

511: ``The efficiency of logistic regression compared

512: to normal discriminant analysis",

513: {\it J. Am. Stat. Asso.}, vol 70, 1975, pp.892-898.

514:

515: \bibitem{tusher}

516: V. Tusher, R. Tibshirani, C. Chu, (2001):

517: ``Significance analysis of microarrays applied to the ionizing

518: radiation response",

519: {\it Proc. Natl. Acad. Sci.}, vol 98, 2001, pp.  5116-5121.

520:

521: \bibitem{batli}

522: F.M. Batliwalla, E.C. Baechler, X. Xiao, W. Li, S. Balasubramaniuan,  H. Khalili,

523: A. Damle, W.A. Ortmann, A. Perrone, A.B. Kantor,  P.S. Gulko, M. Kern, R. Furie,

524: T. W.  Behrens, P. K. Gregersen,

525: ``Peripheral blood gene expression profiling in rheumatoid arthritis",

526: {\it Gene and Immunity}, vol 6, 2005, pp. 388-397.

527:

528:

529: \end{thebibliography}

530:

531: \end{document}

532:

533:

534:

535: