1: %level%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2: %2345678901234567890123456789012345678901234567890123456789012345678901234567890
3: % 1 2 3 4 5 6 7 8
4:
5: %\documentclass[letterpaper, 10 pt, conference]{ieeeconf} % Comment this line out
6: % if you need a4paper
7: \documentclass[a4paper, 10pt, conference]{ieeeconf} % Use this line for a4
8: % paper
9:
10: \IEEEoverridecommandlockouts % This command is only
11: % needed if you want to
12: % use the \thanks command
13: \overrideIEEEmargins
14: % See the \addtolength command later in the file to balance the column lengths
15: % on the last page of the document
16:
17:
18:
19: % The following packages can be found on http:\\www.ctan.org
20: \usepackage{graphics} % for pdf, bitmapped graphics files
21: \usepackage{epsfig} % for postscript graphics files
22: \usepackage{rotating}
23: %\usepackage{mathptmx} % assumes new font selection scheme installed
24: %\usepackage{times} % assumes new font selection scheme installed
25: %\usepackage{amsmath} % assumes amsmath package installed
26: %\usepackage{amssymb} % assumes amsmath package installed
27:
28: \title{\LARGE \bf
29: Does Logarithm Transformation of Microarray Data
30: Affect Ranking Order of Differentially Expressed Genes?
31: }
32:
33:
34: \author{Wentian Li, Young Ju Suh, Jingshan Zhang % <-this % stops a space
35: \thanks{W. Li is a Research Scientist with the Robert S Boas Center for Genomics and Human Genetics, Feinstein Institute for Medical Research, North Shore LIJ Health System,
36: Manhasset, NY 11030, USA
37: {\tt\small wli@nslij-genetics.org}}%
38: \thanks{Y.J. Suh is a Research Professor of
39: The Research Institute of Natural Sciences, Sookmyung Women's University,
40: Seoul 140-742, Korea.
41: {\tt\small yjsprite@yahoo.co.kr}}%
42: \thanks{J. Zhang is a Senior Statistician at
43: Forest Research Institute, Jersey City, NJ 07311, USA
44: {\tt\small jingshan.zhang@frx.com}}%
45: }
46:
47:
48: \begin{document}
49:
50:
51:
52: \maketitle
53: \thispagestyle{empty}
54: \pagestyle{empty}
55:
56:
57: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
58: \begin{abstract}
59:
60: A common practice in microarray analysis is to transform
61: the microarray raw data (light intensity) by a logarithmic
62: transformation, and the justification for this transformation
63: is to make the distribution more symmetric and Gaussian-like. Since
64: this transformation is not universally practiced in
65: all microarray analysis, we examined whether the
66: discrepancy of this treatment of raw data affect the
67: ``high level" analysis result. In particular, whether
68: the differentially expressed genes as obtained by
69: $t$-test, regularized $t$-test, or logistic regression have altered rank orders
70: due to presence or absence of the transformation.
71: We show that as much as 20\%--40\% of significant genes
72: are ``discordant" (significant only in one form of the
73: data and not in both), depending on the test being used and the
74: threshold value for claiming significance. The
75: $t$-test is more likely to be affected by logarithmic
76: transformation than logistic regression, and regularized $t$-test
77: more affected than $t$-test. On the other hand,
78: the very top ranking genes (e.g. up to top 20--50 genes,
79: depending on the test) are not affected by
80: the logarithmic transformation.
81:
82:
83: \end{abstract}
84:
85:
86: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
87: \section{INTRODUCTION}
88:
89: The number of copies of single-stranded messenger-RNA (mRNA)
90: can be used to infer the amount of protein product produced
91: by certain gene, and is called the ``expression level".
92: Ideally, one would like to count the number of copies of
93: certain mRNA directly. But in microarray chips, the
94: amount of a specific mRNA is measured indirectly by
95: the emission of fluorescence light. It is necessary to
96: transform the raw data of light intensity obtained by
97: optical detection to a summarized quantity
98: that indicates the expression level. Deriving the
99: expression level from raw data is called the ``low-level"
100: analysis, and it can be complicated by the details
101: of the technology and chip platform \cite{liwong,irizarry}.
102: Reaching conclusions such as the determination of differentially
103: expressed genes using the expression level data is
104: called the ``high-level" analysis.
105:
106: After the expression level is derived from the raw data,
107: another preprocessing step is commonly practiced: log-transformation.
108: The standard motivation for the log-transformation is
109: that the distribution of the derived expression level
110: is typically asymmetric with long tail at the high
111: expression end. Many parametric statistical tests
112: require variables to follow a Gaussian/normal distribution.
113: The log-transformation is an attempt to convert
114: an asymmetric distribution to a symmetric and Gaussian-like
115: one. Other transformations for the purpose of ``normality"
116: are also possible \cite{sokal}, such as square-root, Box-Cox
117: \cite{boxcox}, and arcsine transformations. In microarray
118: data, transformations were proposed along the
119: line of variance stabilization \cite{durbin1,durbin2}
120:
121: A novel alternative explanation of the use of
122: log-transformation might be that human perceive
123: brightness of light as the logarithm of light
124: energy, similar to our perceiving loudness of sound
125: as the logarithm of sound intensity. In general,
126: all human perception of physical stimuli is proportional
127: to the logarithm of amount of stimuli, under the
128: names of Weber-Fechner's law \cite{weber,fechner}
129: and Steven's law \cite{stevens}. For the light-intensity-derived
130: expression level, log-transformation can be
131: viewed as a way to measure the ``perception
132: signal" from the data.
133:
134: From the statistical point of view, logarithm
135: transformation can take down an outlier with
136: extreme high value, thus affecting the group mean.
137: On the other hand, logarithm transformation or
138: any 1-to-1 transformation will not shuffle
139: the relative order of expression values, thus
140: will not affect a rank-based test result such
141: as Wilcoxon-Mann-Whitney test \cite{mann}.
142: For a specific test or statistical model,
143: the effect of log-transformation on the
144: result is not clear, even though we know it
145: has no effect if the test is rank-based, and
146: has some effects if there are outliers. For
147: linear classifiers, the violation of Gaussian
148: distribution affect some methods more (e.g. Fisher's
149: linear discriminant analysis, perceptron)
150: but less so on other methods (e.g.,
151: logistic regression, support vector machine)
152: \cite{hastie}.
153:
154: Another note on investigating the effect of
155: log-transformation is that one can focus either on
156: the whole list of genes, or only on the
157: more interesting top ranking genes. For example,
158: with a log-transformation, the top 1 and 2
159: differentially expressed genes may be switched
160: while the rank of all other genes are unchanged.
161: Even though the effect of log-transformation
162: on the whole list of genes could be small, the
163: minor rearrangement of the top ranking genes
164: can be crucial in designing the subsequent experiments
165: such as gene validation by real-time PCR.
166:
167: We will examine the effect of log transformation
168: on two or three simple methods for selecting differentially
169: expressed genes on a real microarray dataset.
170: Log-transformation is just one factor that change
171: the apparent value of data, there are other
172: factors as well such as the normalization
173: procedure during the ``low-level" analysis,
174: change of the probe set design, change of the
175: microarray platform, etc.
176:
177:
178: % \begin{figure}[thpb]
179: \begin{figure}[t]
180: \centering
181: \begin{turn}{-90}
182: % \includegraphics[scale=0.5]{yj-fig1.eps}
183: % \includegraphics{yj-fig1.eps}
184: \resizebox{8.0cm}{6.0cm}{ \includegraphics{yj-fig1.eps} }
185: \end{turn}
186: \caption{Minus log of $p$-values of tests on log transformed vs. original
187: data. The $x$ axis is $-\log_{10}(p$-value) for the original
188: expression data, and $y$ axis is $-\log_{10}(p$-value) for the log-transformed
189: data. The top plot is for logistic regression and bottom plot
190: for $t$-test. The four quadrants as split by $x=5$ and $y=5$
191: are indicated. Each point represents a gene.
192: }
193: \label{fig1}
194: \end{figure}
195:
196: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
197: \section{METHODS AND DATA}
198:
199: \subsection{Student's $t$-test}
200:
201: The Student's $t$-test is used here as a representative of
202: tests that make assumption on variable normality.
203: We expect the normality requirement is met better
204: for the log-transformed data than the original data. The $t$-statistic
205: is defined as the ratio of the difference of
206: two group means and the standard error of
207: this difference: $t= (E_1 - E_2)/\sqrt{ s^2_1/n_1 + s^2_2/n_2}$,
208: where $E_{1,2}$, $s^2_{1,2}$, $n_{1,2}$ are the
209: mean, variance, and sample size of group 1 and 2.
210: The $p$-value given a $t$-statistic value is determined
211: by the Student's $t$-distribution with degree of
212: freedom $df$. Usually, $df$ is equal to $n_1+n_2-2$,
213: but when the variances in two groups are not
214: equal, a more complicated formula for $df$ can
215: be used \cite{welsh}. We use such a method as
216: implemented in the $R$ statistical package ({\sl http://www.r-project.org/}).
217:
218:
219: \subsection{Logistic regression}
220:
221: Logistic regression is used to represent statistical
222: models which do not have a strong normality requirement.
223: The advantage for models or tests lacking such a
224: requirement is that these are more robust. The
225: disadvantage for models without the normality
226: requirement is that when the variable is in fact
227: distributed as Gaussian, these are less ``efficient"
228: as classifiers \cite{efron}. The significance of a
229: single-gene logistic regression can be determined
230: by a likelihood-ratio test: (-2) log-maximum-likelihood
231: of the logistic regression model subtract that
232: of a null model follows a $\chi^2$ distribution
233: with one degree of freedom, under the null hypothesis.
234: Thus given the (-2) log-likelihood ratio (called
235: ``deviance"), the $p$-value can be determined using the
236: $\chi^2$ distribution.
237:
238: \subsection{Regularized t-test and significance analysis of microarrays (SAM)}
239:
240: Since low expression level also leads to low variance,
241: $t$-statistic can be high due to low expression level.
242: Penalized or regularized statistics add an extra
243: term $s_0$ to prevent this small variance from inflating the
244: statistic: $d= (E_1 - E_2)/(\sqrt{ s^2_1/n_1 + s^2_2/n_2}+s_0)$.
245: SAM (significance analysis of microarray) is a method
246: for determining the value of $s_0$ \cite{tusher}.
247: SAM test statistic, $d$-score, was calculated by the
248: SAM package obtained from
249: {\sl http://www-stat.stanford.edu/~tibs/SAM/}.
250:
251:
252: \subsection{Microarray data}
253:
254: The illustrative microarray data is a profiling study of
255: rheumatoid arthritis. There are 43 patients
256: and 48 normal controls, which is more than the 29 patients
257: and 21 controls used in the previous publication \cite{batli}.
258: The mRNA was extracted from the peripheral blood mononuclear cells.
259: The microarray data is obtained from the Affymetrix
260: HG-U133A GeneChip with 22,283 genes/probe-sets, and
261: was normalized by the Affymetrix microarray suite (MAS) program.
262:
263:
264:
265: \begin{table}
266: \caption{percentage of discordant genes: (I+IV)/(I+II+IV)}
267: \label{tab1}
268: \begin{center}
269: \begin{tabular}{|c|c|c|c|c|c|c|}
270: \multicolumn{4}{c}{\em logistic regression} & \multicolumn{3}{c}{\rm t-test} \\
271: \hline
272: $p_0$ & I+IV & II & \% (95\%CI) & I+IV & II & \% (95\% CI) \\
273: \hline
274: $10^{-9}$ & 0 & 10 & 0\% (0-0) & 7 & 4 & 64\% (35-92) \\
275: $10^{-8}$ & 6 & 20 & 23 (7-39) & 8 & 11 & 42 (20-64) \\
276: $10^{-7}$ & 22 & 40 & 35 (24-47) & 21 & 21 & 50 (35-65) \\
277: $10^{-6}$ & 44 & 84 & 34 (26-43) & 40 & 52 & 43 (33-54) \\
278: $10^{-5}$ & 82 & 176 & 32 (26-37) & 92 & 119 & 44 (37-50) \\
279: $10^{-4}$ & 163 & 346 & 32 (28-36) & 170 & 266 & 39 (34-44) \\
280: 0.001 & 328 & 709 & 32 (29-34) & 345 & 593 & 37 (34-40)\\
281: 0.01 & 744 & 1698 & 30 (29-32) & 771 & 1520 & 34 (32-36)\\
282: \hline
283: \end{tabular}
284: \end{center}
285: \end{table}
286:
287: \section{RESULTS}
288:
289: \subsection{Proportion of discordant differentially expressed genes}
290:
291: Fig.\ref{fig1} shows the minus log of $p$-values of log-transformed
292: expression data vs that of un-log-transformed (raw)
293: expression data, for both
294: logistic regression (top) and $t$-test (bottom). Taking
295: all genes as a whole, the two sets of $p$-values are highly
296: correlated (correlation coefficients are 0.94 and 0.93,
297: respectively). In order to highlight the
298: differences, especially for the high-ranking differentially
299: expressed genes, we split the plot into four quadrants
300: by a vertical line at $x=a$ and horizontal line at
301: $y=a$. The parameter $a=-log_{10}(p_0)$ corresponds
302: to gene selection threshold $p_0$ for $p$-values.
303: For example, the $a=5$ in Fig.\ref{fig1} corresponds
304: a $p$-value threshold of $p_0=0.00001$.
305:
306: The genes in quadrants I, II, and IV have at least
307: one $p$-value of the two (log and raw data)
308: smaller than $p_0$, whereas the genes in quadrant II
309: have both $p$-values smaller than $p_0$.
310: If log-transformation has no effect on the gene selection,
311: there will be no points in quadrants I and IV. We use the
312: percentage of points in I and IV out of all points in I,II, IV
313: as a measure of the inconsistency between the test
314: results on raw and log-transformed data. If
315: points in quadrants I and IV are called ``discordant"
316: and those in quadrant II ``concordant", this
317: measure is the percentage of discordant genes among
318: all differentially expressed genes by either one type
319: of data.
320:
321:
322: Table \ref{tab1} shows the discordant percentage and
323: their 95\% confidence intervals (CI) at various
324: gene selection threshold $p_0$ (=$10^{-9}, \cdots, 10^{-4}, 0.001, 0.01$).
325: As expected, the $t$-test result is more affected by the
326: log transformation than logistic regression: at all $p_0$
327: threshold values, the percentage of discordant differentially
328: expressed genes is higher in $t$-test than in logistic
329: regression. The average discordant percentage at eight
330: $p_0$ values is 27\% for logistic regression and 44\%
331: for $t$-test.
332:
333: It was however surprising that for logistic regression,
334: except for the extremely differentially expressed
335: genes (e.g., when $p$-value $< 10^{-9}$, the discordant percentage
336: is zero), the discordant percentage is not negligible.
337: If either one of the raw or log-transformed data is
338: used for logistic regression analysis, as much as 10\%--20\%
339: of the claimed differentially expressed genes will not be
340: claimed so by another data.
341:
342:
343: % \begin{figure}[thpb]
344: \begin{figure}[t]
345: \centering
346: \begin{turn}{-90}
347: \resizebox{8.0cm}{8.5cm}{ \includegraphics{yj-fig2.eps} }
348: \end{turn}
349: \caption{Rank difference $d$ as a function of averaged
350: rank $R_a$ for all 22283 genes (A,B,C) and for top-400 genes
351: (D,E,F). Both rank difference $d$ and averaged rank $R_a$ concern
352: the same gene on two different types of data (raw and log-transformed).
353: (A) and (D) are results for logistic regression, (B) and (E) are
354: for $t$-test, (C) and (F) for SAM. The $x$-axis in (D,E,F) is in
355: log scale to highlight the top-ranking genes. In (D,E,F),
356: $d=50, -50, 100, -100$ and $d=R_a$, $d= -R_a$ lines
357: are drawn.
358: }
359: \label{fig2}
360: \end{figure}
361:
362:
363:
364: \subsection{Ranking change due to log transformation}
365:
366: The effect of log-transformation can also be examined by
367: the ranking of a gene in both datasets. If log-transformation
368: has no effect, the rank of a gene by (e.g.) $p$-value
369: will be unchanged. We use the notation $R_n(i)$, $R_l(i)$
370: for the rank of gene-$i$ in the raw and log-transformed data,
371: and define $R_a(i)$ as the average of the two:
372: $R_a(i) \equiv (R_n(i)+R_l(i))/2$, and $d(i)$ as the
373: rank difference: $d(i)= R_n(i)-R_l(i)$. Fig.\ref{fig2} (A,B,C)
374: show $d$ vs. $R_a$ for logistic regression, $t$-test, and
375: SAM (genes are ranked by absolute value of the $d$-score)
376: for all 22283 genes.
377:
378: Fig.\ref{fig2} (A, B,C) indicate that for the whole gene set
379: there is a similar pattern for all three test-statistics:
380: for high- and low-ranking genes, they are high and low ranked in
381: both raw and log-transformed data (thus smaller rank differences).
382: As the majority of genes are not differentially expressed,
383: the overall scattering pattern in Fig.\ref{fig2} (A,B,C)
384: may not be as interesting as the behavior near the high-ranking
385: differentially expressed genes.
386:
387: To focus on the top-ranking genes, Fig.\ref{fig2} (D,E,F)
388: zoom in for the top-400 genes ($x$-axis is in log scale).
389: First, we notice that for the very top genes (e.g. up to
390: top-10), the ranking is unchanged or changed very little
391: by the log transformation in any one of the three tests/models. Second, $t$-test
392: has reached rank-difference of $d=50$ and $d=100$ sooner
393: (i.e., at a higher ranking) than logistic regression, reconfirming
394: our previous conclusion that $t$-test is more likely to
395: be affected by log transformation than logistic regressions.
396: Using the $d=R_a$ and $d=-R_a$ envelope, we see that
397: points are more likely to be outside the envelopes for
398: $t$-test than the logistic regression. The third
399: observation is that SAM test result is affected
400: even more by log transformation than $t$-test. In
401: Fig.\ref{fig2} (F), many points are far outside the
402: envelope region.
403:
404:
405:
406:
407:
408:
409:
410:
411: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
412: \section{CONCLUSIONS AND FUTURE WORKS}
413:
414: \subsection{Conclusions}
415:
416: Using one microarray dataset, we have shown that log transformation
417: may affect results on selecting differentially expressed genes.
418: If we call all genes that are significant by tests on either raw or
419: log-transformed data ``differentially expressed genes", and
420: those genes that are significant in test of only one of the two
421: types of data ``discordant", the discordant as a proportion of
422: the all (discordant and concordant) differentially expressed genes
423: can be as high as 27\% for logistic regression and 44\% for
424: $t$-test. The larger discordant percentage for $t$-test confirms
425: our general understanding that tests that require variable normality
426: are more likely to be affected by variable transformation.
427:
428:
429: \subsection{Future Works}
430:
431: We plan to extend the results here to other public
432: domain microarray datasets and to other tests, models,
433: and measures for determining differentially expressed genes.
434:
435:
436: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
437: \section{ACKNOWLEDGMENTS}
438:
439: We thank Franak Batliwalla for providing the data.
440:
441:
442: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
443: \begin{thebibliography}{99}
444:
445: \bibitem{liwong}
446: C. Li, W.H. Wong,
447: ``Model-based analysis of oligonucleotide arrays: Expression index
448: computation and outlier detection",
449: {\it Proc. Nat. Acad. Sci.}, vol 98, pp.31-36.
450:
451: \bibitem{irizarry}
452: R.A. Irizarry, B.M. Bolstad, F. Collin, L.M. Cope, B. Hobbs, T. P. Speed,
453: ``Summaries of Affymetrix GeneChip probe level data",
454: {\it Nucl. Acids Res. }, vol 31, 2003, e15.
455:
456: \bibitem{sokal}
457: R.R. Sokal, F.J. Rohlf,
458: {\it Biometry}, 3rd edition, W.H. Freeman and Co., New York;
459: 1995.
460:
461: \bibitem{boxcox}
462: G.E.P. Box, D.R. Cox ,
463: ``An analysis of transformations",
464: {\it J. R. Stat. Soc. B}, vol 26, 1964, pp.211-243.
465:
466: \bibitem{durbin1}
467: B.P. Durbin, J.S. Hardin, D.M. Hawkins, D.M. Rocke,
468: ``A variance-stabilizing transformation for gene-expression microarray data",
469: {\it Bioinformatics}, vol 18(suppl 1), 2002, pp.S105-S110.
470:
471: \bibitem{durbin2}
472: B. Durbin, D.M. Rocke,
473: ``Estimation of transformation parameters for microarray data",
474: {\it Bioinformatics}, vol 19, 2003, pp.1360-1367.
475:
476: \bibitem{weber}
477: E.H. Weber,
478: {\it De pulsi, resorptione, auditu ert tactu.
479: Annotationes anatomicae et physiologicae},
480: C.F. L\"{o}hler, Leipzig; 1834.
481:
482: \bibitem{fechner}
483: G.T. Fechner,
484: {\it Elemente der Psychophsik},
485: Breitkopf \& H\"{a}rtel, Leipzig; 1860.
486:
487: \bibitem{stevens}
488: S.S. Stevens,
489: ``On the psychophysical law",
490: {\it Psychol. Rev.}, vol 64, 1957, pp.153-181.
491:
492: \bibitem{mann}
493: H.B. Mann, D.R. Whitney,
494: ``On a test of whether one of 2 random variables is stochastically
495: larger than the other",
496: {\it Ann. Math. Stat. }, vol 18, 1947, pp.50-60.
497:
498: \bibitem{hastie}
499: T. Hastie, R. Tibshirani, J. Friedman,
500: {\it The Elements of Statistical Learning},
501: Springer, New York; 2001.
502:
503: \bibitem{welsh}
504: B. L. Welsh,
505: ``The generalization of `Student's' problem
506: when several different population variances are involved",
507: {\it Biometrika}, vol 34, 1947, pp.28-35.
508:
509: \bibitem{efron}
510: B. Efron,
511: ``The efficiency of logistic regression compared
512: to normal discriminant analysis",
513: {\it J. Am. Stat. Asso.}, vol 70, 1975, pp.892-898.
514:
515: \bibitem{tusher}
516: V. Tusher, R. Tibshirani, C. Chu, (2001):
517: ``Significance analysis of microarrays applied to the ionizing
518: radiation response",
519: {\it Proc. Natl. Acad. Sci.}, vol 98, 2001, pp. 5116-5121.
520:
521: \bibitem{batli}
522: F.M. Batliwalla, E.C. Baechler, X. Xiao, W. Li, S. Balasubramaniuan, H. Khalili,
523: A. Damle, W.A. Ortmann, A. Perrone, A.B. Kantor, P.S. Gulko, M. Kern, R. Furie,
524: T. W. Behrens, P. K. Gregersen,
525: ``Peripheral blood gene expression profiling in rheumatoid arthritis",
526: {\it Gene and Immunity}, vol 6, 2005, pp. 388-397.
527:
528:
529: \end{thebibliography}
530:
531: \end{document}
532:
533:
534:
535: