q-bio0606017/arxiv1.tex
1: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2: %2345678901234567890123456789012345678901234567890123456789012345678901234567890
3: %        1         2         3         4         5         6         7         8
4: 
5: %\documentclass[letterpaper, 10 pt, conference]{ieeeconf}  % Comment this line out
6:                                                           % if you need a4paper
7: \documentclass[a4paper, 10pt, conference]{ieeeconf}      % Use this line for a4
8:                                                           % paper
9: 
10: \IEEEoverridecommandlockouts                              % This command is only
11:                                                           % needed if you want to
12:                                                           % use the \thanks command
13: \overrideIEEEmargins
14: % See the \addtolength command later in the file to balance the column lengths
15: % on the last page of the document
16: 
17: 
18: 
19: % The following packages can be found on http:\\www.ctan.org
20: \usepackage{graphics} % for pdf, bitmapped graphics files
21: \usepackage{epsfig} % for postscript graphics files
22: \usepackage{rotating}
23: %\usepackage{mathptmx} % assumes new font selection scheme installed
24: %\usepackage{times} % assumes new font selection scheme installed
25: %\usepackage{amsmath} % assumes amsmath package installed
26: %\usepackage{amssymb}  % assumes amsmath package installed
27: 
28: \title{\LARGE \bf
29: Overlapping Probabilities of Top Ranking Gene Lists,
30: Hypergeometric Distribution, and Stringency of Gene Selection Criterion
31: }
32: 
33: 
34: \author{Wen Fury, Franak Batliwalla, Peter K. Gregersen, and Wentian Li% <-this % stops a space
35: \thanks{W. Fury is a Senior Bioinformatics Scientist at Regeneron Pharmaceutical, Inc.
36: 	Tarrytown, NY 10591, USA.
37:         {\tt\small wen.fury@regeneron.com}}%
38: \thanks{F. Batliwalla, P.K. Gregersen, and W. Li are Research Scientists
39: with the Robert S Boas Center for Genomics and Human Genetics, 
40: Feinstein Institute for Medical Research, North Shore LIJ Health System,
41: 	Manhasset, NY 11030, USA
42:         {\tt\small fb@nshs.edu},
43:         {\tt\small peterg@nshs.edu},
44:         {\tt\small wli@nslij-genetics.org}}%
45: }
46: 
47: 
48: \begin{document}
49: 
50: 
51: 
52: \maketitle
53: \thispagestyle{empty}
54: \pagestyle{empty}
55: 
56: 
57: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
58: \begin{abstract}
59: 
60: When the same set of genes appear in two top ranking gene lists in
61: two different studies, it is often of interest to estimate
62: the probability for this being a chance event. This overlapping
63: probability is well known to follow the hypergeometric
64: distribution.  Usually, the lengths of top-ranking gene lists 
65: are assumed to be fixed, by using a pre-set criterion on, e.g.,
66: $p$-value for the $t$-test. We investigate how overlapping probability
67: changes with the gene selection criterion, or simply, with the
68: length of the top-ranking gene lists. It is concluded that 
69: overlapping probability is indeed a function of the gene list 
70: length, and its statistical significance should be quoted in 
71: the context of gene selection criterion. 
72: 
73: 
74: \end{abstract}
75: 
76: 
77: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
78: \section{INTRODUCTION}
79: 
80: One of the most common tasks in microarray analysis 
81: is to identify a list of genes that are differentially
82: expressed under two conditions, such as being affected by
83: a disease vs. normal, before vs. after a medical
84: treatment, and one vs. another disease subtype. The
85: number of genes on the top-ranking list is
86: usually much smaller than the total number of genes
87: on the chip, $n$. If the same type of microarray chip is used for 
88: two different studies (e.g. disease-A vs. control, 
89: and disease-B vs. control), two differentially
90: expressed gene lists can be obtained, with $n_1$ and
91: $n_2$ genes. Researchers often find the same genes
92: appear in both lists and hypothesize that these common
93: genes are involved the etiology of both diseases.
94: 
95: However, for such a hypothesis to be convincing,
96: one has to first estimate the probability for 
97: overlapping genes by chance alone. In other words,
98: if two lists of genes are selected out of $n$ genes 
99: randomly, we would like to calculate the probability
100: for $m$ genes in common in the two lists,
101: with the lengths of the two lists being $n_1$ and $n_2$.
102: This overlapping probability is known to follow the
103: hypergeometric distribution \footnote{Despite certain
104: similarity, this problem is not the birthday problem 
105: -- the probability for two people in a room to 
106: have the same birthday.}. The name hypergeometric
107: distribution was first used in \cite{hyper}, and
108: was popularized by its role in Fisher's exact
109: test \cite{fisher}.
110: 
111: In microarray analysis, overlapping probability and
112: hypergeometric distribution mainly appear in testing
113: the enrichment of genes in certain functional
114: category \cite{tavazoie, draghici, fino, hosack,  
115: boorsma, curtis, mao, tian}. In this application,
116: the first list is the top-ranking differentially
117: expressed genes, and a gene selection process is
118: involved. The second list is nevertheless given: 
119: $n_2$ genes are known to be in a pathway, a 
120: member of a protein family, described by a gene ontology term,
121: etc. One asks the question on chance probability
122: for $m$ out of $n_1$ selected genes to be in 
123: a given pathway, a protein family, and describable 
124: by a gene ontology term.  Fixing $n_2$ or not is the 
125: main difference between their application and ours.
126: 
127: 
128: When a different gene selection criterion is used,
129: the number of genes in the two top-ranking lists
130: of two studies ($n_1$ and $n_2$) will also change.
131: Because the stringency of a gene selection criterion
132: is always adjustable and to some extent arbitrary,
133: we would like to examine whether these changes will
134: affect the overlapping probability. At two
135: extreme situations, very small $n_1 = n_2 \approx 1 $
136: and very large $n_1=n_2 =n$, it is clear that
137: the number of overlapping genes is $m=0$ and $m=n$.
138: These $m$ values appear 100\% of the times, so
139: the corresponding $p$-value is equal to 1, i.e.,
140: not significant. For intermediate $n_1 \approx n_2$
141: values, it is not clear what the overlapping
142: probability and significance will be, and it is
143: the topic of this abstract.
144: 
145: 
146:  
147: 
148: 
149: 
150: 
151: 
152: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
153: \section{HYPERGEOMETRIC DISTRIBUTION AND OVERLAPPING P-VALUES}
154: 
155: 
156: Given integers $n$, $n_1$, $n_2$, $m$ 
157: ($ \max(n_1, n_2) \le n$ and $m \le \min(n_1, n_2$) ), the hypergeometric
158: distribution is defined as
159: $$
160: P(m) =\frac{ C(n_1, m ) C(n-n_1, n_2-m )}{ C(n, n_2) }
161: = \frac{ \left( \begin{array}{c} n_1 \\ m \end{array} \right)
162: \left(  \begin{array}{c} n-n_1 \\ n_2-m  \end{array} \right)}
163: { \left( \begin{array}{c} n \\ n_2 \end{array}  \right) }
164: $$
165: where $C(n, m)$ is the number of possibilities of choosing
166: $m$ objects out of $n$ objects: $C(n, m)= n!/[m! (n-m) !] $.
167: 
168: When $n_1$ genes are randomly chosen from the total of
169: $n$ genes, and another random sampling leads to $n_2$
170: genes, the probability that the two lists of genes have
171: $m$ in common is exactly the hypergeometric probability
172: $P(m)$. This can be proven by the following steps:
173: 1) The total number of possible choices for the two
174: lists of genes is $C(n, n_1) \cdot C(n, n_2)$.
175: 2) There are $C(n, n_1)$ possibilities for choosing the first
176: list.
177: 3) Among the $n_1$ genes in the first list, there are
178: $C(n_1, m)$ possibilities  for choosing $m$ genes to
179: be in common with the second list.
180: 4) In the second list, besides the $m$ genes that are in
181: common with the first list, the remaining $n_2-m$ genes
182: are chosen among the $n-n_1$ ``leftover" genes not
183: in the first list, thus $C(n-n_1, n_2-m)$ possibilities.
184: The $P(m)$ is simply (\#2 $\times$ \#3 $\times$ \#4) / \#1.
185: Note that $n_1$ and $n_2$ can be switched without
186: changing the $P(m)$ value.
187: 
188: It is usually more interesting to calculate the sum of
189: $P(m)$ for $m$'s equal or larger than the observed value
190: (i.e., the $p$-value):
191: $$
192: p\mbox{-value} =  \sum_{k = m}^{\min(n_1, n_2)} p(k)
193: = \sum_{k=0}^{\min(n_1, n_2)} p(k)
194: -\sum_{k=0}^{m-1} p(k)
195: $$
196: In statistical package $R$ ({\sl http://www.r-project.org/}), 
197: there are at least two ways to calculate the overlapping $p$-value.
198: The first is to use the accumulative distribution of
199: hypergeometric distribution, {\sl phyper(m, $n_1$, $n-n_1$, $n_2$)}:
200: $p$-value $= phyper(\min(n_1, n_2), n_1, n-n_1, n_2)
201: - phyper(m-1, n_1, n-n_1, n_2)$ if $m >0$, and
202: $p$-value=1 if $m=0$. The second method is to use
203: the  $p$-value from the Fisher's exact test on 
204: the following 2-by-2 table:
205: $$ 
206: \begin{array}{c|cc|c}
207:  & col_1 & col_2 & total \\
208: \hline
209:  row_1& m & n_1 -m & n_1 \\
210: row_2& n_2-m & n -n_1-n_2+m & n-n_1 \\
211: \hline
212: total & n_2 & n-n_2 & n
213: \end{array}
214: $$ 
215: The two approaches lead to the identical result.
216:  
217:    % \begin{figure}[thpb]
218:    \begin{figure}[t]
219:       \centering
220: 	\begin{turn}{-90}
221:       	% \includegraphics[scale=1.0]{wen-fig1.eps}
222: 	\resizebox{8.0cm}{8.0cm}{ \includegraphics{wen-fig1.eps} }
223: 	\end{turn}
224:       \caption{First column: proportion of overlapping genes between
225: two top ranking gene lists for a pair of studies ($m/n_1$)
226: as a function of the gene list length ($n_1(=n_2)$). Top is
227: for gene ranking by $t$-test and bottom is for gene ranking
228: by logistic regression. The overlapping proportion for
229: two randomly shuffled lists is shown in crosses, and the line
230: $m/n_1 = n_1/n$ is marked. Second column: observed number
231: of overlapping genes ($m$) subtract the expected number
232: of overlapping genes ($n_1^2/n$).
233: 	}
234:       \label{fig1}
235:    \end{figure}
236: 
237: \section{PROPORTION OF OVERLAPPING GENES IN A COLLECTION
238: OF MICROARRAY  DATASET}
239: 
240: 
241: In hypergeometric distribution, the number of overlapping
242: elements $m$ is an independent variable from the the
243: list lengths $n_1, n_2$. In order to get a rough idea on
244: how $m$ changes with the list lengths, we use three real
245: microarray datasets.  Theese studies concern three 
246: autoimmune diseases: rheumatoid 
247: arthritis (RA), systemic lupus erythematosus (SLE), and 
248: psoriatic arthritis (PsA), described in details in
249: \cite{ra, sle, psa}.  The number of controls (C) and patients (P)
250: in these three datasets are (C=39, P=46), (C=41, P=81), and 
251: (C=19, P=19), respectively. The total number of genes/probe-sets
252: is $n=$22283, and  the expression levels are log transformed.
253: Genes are ranked for their degree of differential expression 
254: which can be measured by various tests or models, such 
255: as $t$-test and logistic regression.
256: 
257: For any pair of studies, with a fixed number of top-ranking
258: gene lists $n_1(=n_2)$, one can count the number of overlapping genes
259: $m$ and the proportion $m/n_1(=m/n_2)$. Fig.\ref{fig1} (left
260: column) shows this proportion as a function of $n_1(=n_2)$ 
261: for three study-pairs (RA-SLE, SLE-PsA, RA-PsA) as well as for two ranking methods 
262: ($t$-test and logistic regression). Similar overlapping 
263: proportion of two random shuffled lists is also 
264: indicated in Fig.\ref{fig1} as crosses.
265: 
266: When $n_1(=n_2)$ is small, $m$ is more likely to be zero, so
267: the proportion is also zero. When $n_1(=n_2)$ approaches the
268: total number of genes, $n$, all genes are overlapping genes,
269: and the proportion is 1. Fig. \ref{fig1} indeed shows these
270: trends at the two extreme points. In order to check
271: behavior in-between, we draw a reference line in Fig.\ref{fig1}
272: (left column) that assume a linear relationship between 
273: $m/n_1$ and $n_1/n$.  Most of the points on Fig.\ref{fig1} 
274: are above this line, and the overlapping proportion of two 
275: random lists is exactly on this line.
276: 
277: To have an idea of the absolute number of common genes
278: more than expected by random chance, Fig.\ref{fig1} (right
279: column) plots the observed $m$ subtract the expected $m_{exp}= n_1^2/n(=n_2^2/n)$
280: as a function of $n_1(=n_2)$. The maximum difference between
281: the observed and expected is reached between $n_1=5000$ and
282: $n_1=10000$. The difference of observed and expected $m$'s 
283: can be as much as 600--800.
284: 
285:    % \begin{figure}[thpb]
286:    \begin{figure}[t]
287:       \centering
288: 	\begin{turn}{-90}
289: 	\resizebox{4.0cm}{7.50cm}{ \includegraphics{wen-fig2.eps} }
290: 	\end{turn}
291:       \caption{
292: 	Overlapping significance as measured by $-\log_{10}(p$-value)
293: where $p$-value is obtained by the hypergeometric distribution,
294: as a function of $n_1(=n_2)$, the number of genes in the
295: top-ranking gene lists. The $R$ program reports $p$-value to
296: be zero whenever it is lower than 2.2$\times 10^{-16}$, and
297: we use a ceiling of 15.65758 $=-\log_{10}(2.2 \times 10^{-16})$
298: in the plot.  Six lines are shown for three
299: study pairs (RA-SLE, SLE-PsA, RA-PsA) and two tests/models
300: ($t$-test and logistic regression). Similar overlapping significance
301: for two randomly shuffled lists is also shown (indicated by crosses).
302: 	}
303:       \label{fig2}
304:    \end{figure}
305: 
306: \section{OVERLAPPING SIGNIFICANCE}
307: 
308: The overlapping $p$-value corresponding to the $m$ counts
309: plotted in Fig.\ref{fig1} was calculated by the hypergeometric
310: distribution, and is shown in Fig.\ref{fig2}:
311: $y$-axis is $-\log_{10}(p$-value), and $x$-axis is
312: $n_1(=n_2)$. Six lines are shown for
313: three comparisons (RA-SLE, SLE-PsA, RA-PsA) and two
314: measurements of the differential expression ($t$-test and
315: logistic regression).  Zero $p$-values are converted to
316: 2.2 $\times 10^{-16}$ which is the minimum value
317: reported by $R$ program.  Fig.\ref{fig2}  shows that 
318: besides the two ends ($m=n_1=n_2=0$ and $m=n_1=n_2=n$) where 
319: the $p$-value is 1, the overlapping significance
320: quickly increases with the length of top-ranking gene list
321: $n_1(=n_2$), and can be extremely significant when a
322: large number of genes are kept in the two lists
323: for comparison.
324: 
325: This result confirm our previous suspicion that overlapping
326: significance is a function of the gene list lengths.
327: If the selection of $n_1, n_2$ is arbitrary, the
328: overlapping significance thus calculated is also
329: arbitrary. It is not surprising that
330: overlapping significance may keep increasing
331: (or, $p$-value decreasing) with the increase of $n_1(=n_2)$,
332: because $p$-value in general depends on the sample
333: size. When a signal is real (true positive), $p$-value
334: will monotonically decrease with the sample size.
335: On the contrast, if a true signal is absent, the
336: sample size does not affect the conclusion. As
337: can be seen in Fig.\ref{fig2}, the overlapping significance
338: for two random lists does not really change with $n_1(=n_2)$.
339: 
340: One may argue that it is unlikely to consider
341: top 5000 genes as being differentially expressed,
342: because by a typical selection criterion (e.g. $p$-value of
343: $t$-test smaller than 0.01, with or without multiple
344: testing correction), the number of genes selected
345: is less than a few hundreds. However, as can be
346: seen in Fig.\ref{fig2},  even in the range
347: of 10--500, the overlapping $p$-value changes dramatically.
348: 
349: This pitfall of gene-list-length dependence of overlapping
350: $p$-values  has not been noticed before
351: perhaps because in other application of hypergeometric
352: distribution for calculating overlapping probability,
353: the length of the second list $n_2$ is fixed, for example,
354: in the study of overrepresentation of genes in
355: certain pathway. The number of overlapping genes $m$
356: is then constrained from above by $\min(n_1, n_2)$ even though
357: the length of the first list, $n_1$, might increase
358: by relaxing the gene selection criterion.
359: 
360:    % \begin{figure}[thpb]
361:    \begin{figure}[t]
362:       \centering
363: 	\begin{turn}{-90}
364: 	\resizebox{4.0cm}{7.0cm}{ \includegraphics{wen-fig3.eps} }
365: 	\end{turn}
366:       \caption{The test significance ($-\log_{10}(p$-value))
367: from $t$-test of $n=$22283 genes sorted by the averaged expression 
368: level (log-transformed) across all 245 samples in 3 studies 
369: (RA, SLE, PsA). The three $t$-tests are for RA vs. control, SLE vs. control,
370: and PsA vs. control.
371:  	}
372:       \label{fig3}
373:    \end{figure}
374: 
375:    % \begin{figure}[thpb]
376:    \begin{figure}[t]
377:       \centering
378: 	\begin{turn}{-90}
379: 	\resizebox{8.0cm}{8.0cm}{ \includegraphics{wen-fig4.eps} }
380: 	\end{turn}
381:       \caption{Several measures of overlapping genes between
382: a pair of studies as a function of the number of genes included
383: in the top-ranking list, for the reduced dataset with 15283 genes.
384: First column: proportion of overlapping genes ($m/n_1$); 
385: second column: number of observed overlapping genes subtracting the 
386: number of expected ($m- n_1^2/15283$); third column: $-\log_{10}(p$-value)
387: by the hypergeometric distribution. First row is for lists ranked
388: by $t$-test result, and second row is for lists ranked by
389: logistic regression. 
390:        }
391:       \label{fig4}
392:    \end{figure}
393: \section{THE EFFECTS OF UNEXPRESSED GENES}
394: 
395: There are many genes/probe-sets on the microarray chip
396: that do not register much signal. Since these low-expressed
397: genes are lowly expressed in both control and patient
398: samples, they usually do not appear in the top-ranking
399: differentially expressed gene list.  Fig.\ref{fig3}
400: shows $-\log_{10}(p$-value) of each gene of 3 $t$-tests 
401: sorted by average expression (log-transformed)
402: across all 245 samples in 3 datasets (for both cases and controls). Although
403: we cannot use the average expression level to predict
404: the degree of differential expression, there is 
405: a general trend for low-expressed genes to rank lower in the
406: differentially expressed list as seen from Fig.\ref{fig3}.
407: 
408: We removed 7000 genes with lower overall expression across
409: all samples, leaving $n=15283$ genes. Figs.\ref{fig1} and \ref{fig2}
410: are reproduced in Fig.\ref{fig4} for the dataset with a reduced gene pool.
411: As in Figs.\ref{fig1} and \ref{fig2}, the observed number
412: of overlapping genes $m$ is much larger than the expected,
413: though the difference peaks at 400--600, as versus 600-800
414: in Fig.\ref{fig1}. The overlapping significance as measured
415: by $-\log(p$-value) again quickly moves up with $n_1(=n_2)$
416: as shown in the last column of Fig.\ref{fig4}. 
417: 
418: The qualitative similarity between Figs.\ref{fig1}, \ref{fig2}
419: and Fig.\ref{fig4} indicates that the presence of 
420: low-expressed genes does not affect our conclusion.
421: 
422: \addtolength{\textheight}{-12cm}   % This command serves to balance the column lengths
423:                                   % on the last page of the document manually. It shortens
424:                                   % the textheight of the last page by a suitable amount.
425:                                   % This command does not take effect until the next page
426:                                   % so it should come on the page before the last. Make
427:                                   % sure that you do not shorten the textheight too much.
428: 
429: 
430: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
431: \section{CONCLUSIONS AND FUTURE WORKS}
432: 
433: \subsection{Conclusions}
434: 
435: Using the hypergeometric distribution to calculate the
436: overlapping probability between two top-ranking differentially
437: expressed genes in two studies, we have shown that the
438: overlapping significance depends on the stringency of
439: gene selection criterion, or equivalently, the length
440: of the gene lists. This observation presents a problem
441: when an overlapping $p$-value is reported but the
442: gene selection criterion is not specified. On the other
443: hand, the increase of the overlapping significance
444: with the gene list length can be an indication that
445: the significant overlapping of genes is a true signal.
446: 
447: 
448: \subsection{Future Works}
449: 
450: The overlapping probability calculated here assumes the two 
451: top-ranking gene lists are selected from the same pool of $n$ 
452: genes. If the two studies are based on different chip
453: platforms, the two initial gene pools are not identical,
454: though there are perhaps certain common genes. We plan to
455: derive the overlapping distribution for this situation.
456: 
457: We also plan to study the probability for genes appearing
458: in three top-ranking gene lists. Although a permutation based
459: approach comparing multiple studies was proposed in \cite{rhode},
460: there is no analytic formula available.
461: 
462: 
463: \section{ACKNOWLEDGMENTS}
464: 
465: We would like to thank Prof. Richard Friedberg for suggestions.
466: 
467: 
468: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
469: 
470: 
471: \begin{thebibliography}{99}
472: 
473: \bibitem{hyper}
474: H.T. Gonin,
475: ``The use of factorial moments in the treatment of the hypergeometric
476: distribution and in tests for regression",
477: {\it Philosophical Mag.}, vol 7, 1936, pp 215-226.
478: 
479: \bibitem{fisher}
480: R.A. Fisher,
481: {\sl Statistical Methods for Research Workers}
482: Oliver and Boyd, Edinburgh; 1934.
483: 
484: \bibitem{tavazoie}
485: S. Tavazoie, J.D. Hughes, M.J. Campbell, R.J. Cho, G.M. Church,
486: ``Systematic determination of genetic network architecture",
487: {\it Nature Genet.}, vol 22, 1999, pp 281-285.
488: 
489: \bibitem{draghici}
490: S.  Dr\v{a}ghici, P. Khatri, R.P. Martins, G.C. Ostermeier,
491: S.A.  Krawetz,
492: ``Global functional profiling of gene expression",
493: {\it Genomics}, vol 81, 2003, pp.98-104.
494: 
495: \bibitem{fino}
496: G. Finocchiaro, F. Mancuso, H. Muller,
497: ``Mining published lists of cancer related microarray experiments:
498: identification of a gene expression signature having a
499: critical role in cell-cycle control",
500: {\it BMC Bioinf.},  vol 6(suppl 4), 2003, S14.
501: 
502: \bibitem{hosack}
503: D.A. Hosack, G. Dennis Jr., B.T. Sherman, H.C. Lane,
504: R.A. Lempicki 
505: (2003),
506: ``Identifying biological themes within lists of genes with EASE",
507: {\it Genome Biol.}, vol 4, 2003, R70.
508: 
509: 
510: \bibitem{boorsma}
511: A.  Boorsma, B.C. Foat, D. Vis, F. Klis, H.J. Bussemaker,
512: ``T-profiler: scoring the activity of predefined groups
513: of genes using gene expression data",
514: {\it Nucleic Acids Res.}, vol 33, 2005,  pp W592-W595.
515: 
516: \bibitem{curtis}
517: R.K. Curtis, M.  Ore\v{s}i\v{c}, A. Vidal-Puig,
518: ``Pathways to the analysis of microarray data",
519: {\it Trends Biotech.}, vol 23, 2005, pp 429-435.
520: 
521: \bibitem{mao}
522: X. Mao, T. Cai, J.G. Olyarchuk, L. Wei,
523: ``Automated genome annotation and pathway identification using
524: the KEGG Orthology (KO) as a controlled vocabulary",
525: {\it Bioinfo.}, vol 21, 2005,  pp 3787-3793.
526: 
527: \bibitem{tian}
528: L. Tian, S.A. Greenberg, S.W. Kong, J. Altschuler,
529: I.S.  Kohane, P.J. Park,
530: ``Discovering statistically significant pathways in expression profiling studies",
531: {\it Proc. Natl. Acad. Sci.}, vol 102, 2005, pp 13544-13549.
532: 
533: \bibitem{ra}
534: F.M. Batliwalla, E.C.  Baechler, X.  Xiao, W.  Li, 
535: S. Balasubramaniuan, H. Khalili, A. Damle, W.A. Ortmann, A. Perrone,
536: A.B. Kantor, M. Kern, P.S. Gulko, M. Kern, R. Furie, T.W. Behrens, P.K.  Gregersen,
537: ``Peripheral blood gene expression profiling in rheumatoid arthritis",
538: {\it Gene and Immunity}, vol 6, 2005, pp 388-397.
539: 
540: \bibitem{sle}
541: E.C. Baechler, F.M. Batliwalla, G. Karypis, P.M. Gaffney, W.A. Ortmann,
542: K.J.  Espe, K.B. Shark, W.J. Grande, K.M. Hughes, V. Kapur, P.K.  Gregersen,
543: T.W. Behrens, 
544: ``Interferon-inducible gene expression signature in peripheral
545: blood cells of patients with severe lupus",
546: {\it Proc. Natl. Acad. Sci. }, vol 100, 2003, pp 2610-2615.
547: 
548: \bibitem{psa}
549: F.M. Batliwalla, W. Li, C.T. Ritchlin, X. Xiao, M. Brenner,
550: T.  Laragione, T. Shao, R. Durham, S. Kemshetti, E. Schwarz,
551: R.  Coe, M. Kern, E.C. Baechler, T.W. Behrens, P.K. Gregersen, P.K. Gulko,
552: ``Microarray analyses of peripheral blood cells identifies
553: unique expression signature in psoriatic arthritis",
554: {\it Mol. Med.}, 2006, to appear.
555: 
556: \bibitem{rhode}
557: D.R. Rhodes, J. Yu, K. Shanker, N. Deshpande, R. Varambally, D. Ghosh,
558: T.  Barrette, A. Pandey, A.M. Chinnaiyan, 
559: ``Large-scale meta-analysis of cancer microarray data identifies
560: common transcriptional profiles of neoplastic transformation and progression",
561: {\it Proc. Natl. Acad. Sci. }, vol 101, 2004, pp 9309-9314.
562: 
563: 
564: 
565: 
566: 
567: \end{thebibliography}
568: 
569: \end{document}
570: 
571: