0606:q-bio0606017/arxiv1.tex

1: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

2: %2345678901234567890123456789012345678901234567890123456789012345678901234567890

3: %        1         2         3         4         5         6         7         8

4:

5: %\documentclass[letterpaper, 10 pt, conference]{ieeeconf}  % Comment this line out

6:                                                           % if you need a4paper

7: \documentclass[a4paper, 10pt, conference]{ieeeconf}      % Use this line for a4

8:                                                           % paper

9:

10: \IEEEoverridecommandlockouts                              % This command is only

11:                                                           % needed if you want to

12:                                                           % use the \thanks command

13: \overrideIEEEmargins

14: % See the \addtolength command later in the file to balance the column lengths

15: % on the last page of the document

16:

17:

18:

19: % The following packages can be found on http:\\www.ctan.org

20: \usepackage{graphics} % for pdf, bitmapped graphics files

21: \usepackage{epsfig} % for postscript graphics files

22: \usepackage{rotating}

23: %\usepackage{mathptmx} % assumes new font selection scheme installed

24: %\usepackage{times} % assumes new font selection scheme installed

25: %\usepackage{amsmath} % assumes amsmath package installed

26: %\usepackage{amssymb}  % assumes amsmath package installed

27:

28: \title{\LARGE \bf

29: Overlapping Probabilities of Top Ranking Gene Lists,

30: Hypergeometric Distribution, and Stringency of Gene Selection Criterion

31: }

32:

33:

34: \author{Wen Fury, Franak Batliwalla, Peter K. Gregersen, and Wentian Li% <-this % stops a space

35: \thanks{W. Fury is a Senior Bioinformatics Scientist at Regeneron Pharmaceutical, Inc.

36: 	Tarrytown, NY 10591, USA.

37:         {\tt\small wen.fury@regeneron.com}}%

38: \thanks{F. Batliwalla, P.K. Gregersen, and W. Li are Research Scientists

39: with the Robert S Boas Center for Genomics and Human Genetics,

40: Feinstein Institute for Medical Research, North Shore LIJ Health System,

41: 	Manhasset, NY 11030, USA

42:         {\tt\small fb@nshs.edu},

43:         {\tt\small peterg@nshs.edu},

44:         {\tt\small wli@nslij-genetics.org}}%

45: }

46:

47:

48: \begin{document}

49:

50:

51:

52: \maketitle

53: \thispagestyle{empty}

54: \pagestyle{empty}

55:

56:

57: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

58: \begin{abstract}

59:

60: When the same set of genes appear in two top ranking gene lists in

61: two different studies, it is often of interest to estimate

62: the probability for this being a chance event. This overlapping

63: probability is well known to follow the hypergeometric

64: distribution.  Usually, the lengths of top-ranking gene lists

65: are assumed to be fixed, by using a pre-set criterion on, e.g.,

66: $p$-value for the $t$-test. We investigate how overlapping probability

67: changes with the gene selection criterion, or simply, with the

68: length of the top-ranking gene lists. It is concluded that

69: overlapping probability is indeed a function of the gene list

70: length, and its statistical significance should be quoted in

71: the context of gene selection criterion.

72:

73:

74: \end{abstract}

75:

76:

77: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

78: \section{INTRODUCTION}

79:

80: One of the most common tasks in microarray analysis

81: is to identify a list of genes that are differentially

82: expressed under two conditions, such as being affected by

83: a disease vs. normal, before vs. after a medical

84: treatment, and one vs. another disease subtype. The

85: number of genes on the top-ranking list is

86: usually much smaller than the total number of genes

87: on the chip, $n$. If the same type of microarray chip is used for

88: two different studies (e.g. disease-A vs. control,

89: and disease-B vs. control), two differentially

90: expressed gene lists can be obtained, with $n_1$ and

91: $n_2$ genes. Researchers often find the same genes

92: appear in both lists and hypothesize that these common

93: genes are involved the etiology of both diseases.

94:

95: However, for such a hypothesis to be convincing,

96: one has to first estimate the probability for

97: overlapping genes by chance alone. In other words,

98: if two lists of genes are selected out of $n$ genes

99: randomly, we would like to calculate the probability

100: for $m$ genes in common in the two lists,

101: with the lengths of the two lists being $n_1$ and $n_2$.

102: This overlapping probability is known to follow the

103: hypergeometric distribution \footnote{Despite certain

104: similarity, this problem is not the birthday problem

105: -- the probability for two people in a room to

106: have the same birthday.}. The name hypergeometric

107: distribution was first used in \cite{hyper}, and

108: was popularized by its role in Fisher's exact

109: test \cite{fisher}.

110:

111: In microarray analysis, overlapping probability and

112: hypergeometric distribution mainly appear in testing

113: the enrichment of genes in certain functional

114: category \cite{tavazoie, draghici, fino, hosack,

115: boorsma, curtis, mao, tian}. In this application,

116: the first list is the top-ranking differentially

117: expressed genes, and a gene selection process is

118: involved. The second list is nevertheless given:

119: $n_2$ genes are known to be in a pathway, a

120: member of a protein family, described by a gene ontology term,

121: etc. One asks the question on chance probability

122: for $m$ out of $n_1$ selected genes to be in

123: a given pathway, a protein family, and describable

124: by a gene ontology term.  Fixing $n_2$ or not is the

125: main difference between their application and ours.

126:

127:

128: When a different gene selection criterion is used,

129: the number of genes in the two top-ranking lists

130: of two studies ($n_1$ and $n_2$) will also change.

131: Because the stringency of a gene selection criterion

132: is always adjustable and to some extent arbitrary,

133: we would like to examine whether these changes will

134: affect the overlapping probability. At two

135: extreme situations, very small $n_1 = n_2 \approx 1 $

136: and very large $n_1=n_2 =n$, it is clear that

137: the number of overlapping genes is $m=0$ and $m=n$.

138: These $m$ values appear 100\% of the times, so

139: the corresponding $p$-value is equal to 1, i.e.,

140: not significant. For intermediate $n_1 \approx n_2$

141: values, it is not clear what the overlapping

142: probability and significance will be, and it is

143: the topic of this abstract.

144:

145:

146:

147:

148:

149:

150:

151:

152: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

153: \section{HYPERGEOMETRIC DISTRIBUTION AND OVERLAPPING P-VALUES}

154:

155:

156: Given integers $n$, $n_1$, $n_2$, $m$

157: ($ \max(n_1, n_2) \le n$ and $m \le \min(n_1, n_2$) ), the hypergeometric

158: distribution is defined as

159: $$

160: P(m) =\frac{ C(n_1, m ) C(n-n_1, n_2-m )}{ C(n, n_2) }

161: = \frac{ \left( \begin{array}{c} n_1 \\ m \end{array} \right)

162: \left(  \begin{array}{c} n-n_1 \\ n_2-m  \end{array} \right)}

163: { \left( \begin{array}{c} n \\ n_2 \end{array}  \right) }

164: $$

165: where $C(n, m)$ is the number of possibilities of choosing

166: $m$ objects out of $n$ objects: $C(n, m)= n!/[m! (n-m) !] $.

167:

168: When $n_1$ genes are randomly chosen from the total of

169: $n$ genes, and another random sampling leads to $n_2$

170: genes, the probability that the two lists of genes have

171: $m$ in common is exactly the hypergeometric probability

172: $P(m)$. This can be proven by the following steps:

173: 1) The total number of possible choices for the two

174: lists of genes is $C(n, n_1) \cdot C(n, n_2)$.

175: 2) There are $C(n, n_1)$ possibilities for choosing the first

176: list.

177: 3) Among the $n_1$ genes in the first list, there are

178: $C(n_1, m)$ possibilities  for choosing $m$ genes to

179: be in common with the second list.

180: 4) In the second list, besides the $m$ genes that are in

181: common with the first list, the remaining $n_2-m$ genes

182: are chosen among the $n-n_1$ ``leftover" genes not

183: in the first list, thus $C(n-n_1, n_2-m)$ possibilities.

184: The $P(m)$ is simply (\#2 $\times$ \#3 $\times$ \#4) / \#1.

185: Note that $n_1$ and $n_2$ can be switched without

186: changing the $P(m)$ value.

187:

188: It is usually more interesting to calculate the sum of

189: $P(m)$ for $m$'s equal or larger than the observed value

190: (i.e., the $p$-value):

191: $$

192: p\mbox{-value} =  \sum_{k = m}^{\min(n_1, n_2)} p(k)

193: = \sum_{k=0}^{\min(n_1, n_2)} p(k)

194: -\sum_{k=0}^{m-1} p(k)

195: $$

196: In statistical package $R$ ({\sl http://www.r-project.org/}),

197: there are at least two ways to calculate the overlapping $p$-value.

198: The first is to use the accumulative distribution of

199: hypergeometric distribution, {\sl phyper(m, $n_1$, $n-n_1$, $n_2$)}:

200: $p$-value $= phyper(\min(n_1, n_2), n_1, n-n_1, n_2)

201: - phyper(m-1, n_1, n-n_1, n_2)$ if $m >0$, and

202: $p$-value=1 if $m=0$. The second method is to use

203: the  $p$-value from the Fisher's exact test on

204: the following 2-by-2 table:

205: $$

206: \begin{array}{c|cc|c}

207:  & col_1 & col_2 & total \\

208: \hline

209:  row_1& m & n_1 -m & n_1 \\

210: row_2& n_2-m & n -n_1-n_2+m & n-n_1 \\

211: \hline

212: total & n_2 & n-n_2 & n

213: \end{array}

214: $$

215: The two approaches lead to the identical result.

216:

217:    % \begin{figure}[thpb]

218:    \begin{figure}[t]

219:       \centering

220: 	\begin{turn}{-90}

221:       	% \includegraphics[scale=1.0]{wen-fig1.eps}

222: 	\resizebox{8.0cm}{8.0cm}{ \includegraphics{wen-fig1.eps} }

223: 	\end{turn}

224:       \caption{First column: proportion of overlapping genes between

225: two top ranking gene lists for a pair of studies ($m/n_1$)

226: as a function of the gene list length ($n_1(=n_2)$). Top is

227: for gene ranking by $t$-test and bottom is for gene ranking

228: by logistic regression. The overlapping proportion for

229: two randomly shuffled lists is shown in crosses, and the line

230: $m/n_1 = n_1/n$ is marked. Second column: observed number

231: of overlapping genes ($m$) subtract the expected number

232: of overlapping genes ($n_1^2/n$).

233: 	}

234:       \label{fig1}

235:    \end{figure}

236:

237: \section{PROPORTION OF OVERLAPPING GENES IN A COLLECTION

238: OF MICROARRAY  DATASET}

239:

240:

241: In hypergeometric distribution, the number of overlapping

242: elements $m$ is an independent variable from the the

243: list lengths $n_1, n_2$. In order to get a rough idea on

244: how $m$ changes with the list lengths, we use three real

245: microarray datasets.  Theese studies concern three

246: autoimmune diseases: rheumatoid

247: arthritis (RA), systemic lupus erythematosus (SLE), and

248: psoriatic arthritis (PsA), described in details in

249: \cite{ra, sle, psa}.  The number of controls (C) and patients (P)

250: in these three datasets are (C=39, P=46), (C=41, P=81), and

251: (C=19, P=19), respectively. The total number of genes/probe-sets

252: is $n=$22283, and  the expression levels are log transformed.

253: Genes are ranked for their degree of differential expression

254: which can be measured by various tests or models, such

255: as $t$-test and logistic regression.

256:

257: For any pair of studies, with a fixed number of top-ranking

258: gene lists $n_1(=n_2)$, one can count the number of overlapping genes

259: $m$ and the proportion $m/n_1(=m/n_2)$. Fig.\ref{fig1} (left

260: column) shows this proportion as a function of $n_1(=n_2)$

261: for three study-pairs (RA-SLE, SLE-PsA, RA-PsA) as well as for two ranking methods

262: ($t$-test and logistic regression). Similar overlapping

263: proportion of two random shuffled lists is also

264: indicated in Fig.\ref{fig1} as crosses.

265:

266: When $n_1(=n_2)$ is small, $m$ is more likely to be zero, so

267: the proportion is also zero. When $n_1(=n_2)$ approaches the

268: total number of genes, $n$, all genes are overlapping genes,

269: and the proportion is 1. Fig. \ref{fig1} indeed shows these

270: trends at the two extreme points. In order to check

271: behavior in-between, we draw a reference line in Fig.\ref{fig1}

272: (left column) that assume a linear relationship between

273: $m/n_1$ and $n_1/n$.  Most of the points on Fig.\ref{fig1}

274: are above this line, and the overlapping proportion of two

275: random lists is exactly on this line.

276:

277: To have an idea of the absolute number of common genes

278: more than expected by random chance, Fig.\ref{fig1} (right

279: column) plots the observed $m$ subtract the expected $m_{exp}= n_1^2/n(=n_2^2/n)$

280: as a function of $n_1(=n_2)$. The maximum difference between

281: the observed and expected is reached between $n_1=5000$ and

282: $n_1=10000$. The difference of observed and expected $m$'s

283: can be as much as 600--800.

284:

285:    % \begin{figure}[thpb]

286:    \begin{figure}[t]

287:       \centering

288: 	\begin{turn}{-90}

289: 	\resizebox{4.0cm}{7.50cm}{ \includegraphics{wen-fig2.eps} }

290: 	\end{turn}

291:       \caption{

292: 	Overlapping significance as measured by $-\log_{10}(p$-value)

293: where $p$-value is obtained by the hypergeometric distribution,

294: as a function of $n_1(=n_2)$, the number of genes in the

295: top-ranking gene lists. The $R$ program reports $p$-value to

296: be zero whenever it is lower than 2.2$\times 10^{-16}$, and

297: we use a ceiling of 15.65758 $=-\log_{10}(2.2 \times 10^{-16})$

298: in the plot.  Six lines are shown for three

299: study pairs (RA-SLE, SLE-PsA, RA-PsA) and two tests/models

300: ($t$-test and logistic regression). Similar overlapping significance

301: for two randomly shuffled lists is also shown (indicated by crosses).

302: 	}

303:       \label{fig2}

304:    \end{figure}

305:

306: \section{OVERLAPPING SIGNIFICANCE}

307:

308: The overlapping $p$-value corresponding to the $m$ counts

309: plotted in Fig.\ref{fig1} was calculated by the hypergeometric

310: distribution, and is shown in Fig.\ref{fig2}:

311: $y$-axis is $-\log_{10}(p$-value), and $x$-axis is

312: $n_1(=n_2)$. Six lines are shown for

313: three comparisons (RA-SLE, SLE-PsA, RA-PsA) and two

314: measurements of the differential expression ($t$-test and

315: logistic regression).  Zero $p$-values are converted to

316: 2.2 $\times 10^{-16}$ which is the minimum value

317: reported by $R$ program.  Fig.\ref{fig2}  shows that

318: besides the two ends ($m=n_1=n_2=0$ and $m=n_1=n_2=n$) where

319: the $p$-value is 1, the overlapping significance

320: quickly increases with the length of top-ranking gene list

321: $n_1(=n_2$), and can be extremely significant when a

322: large number of genes are kept in the two lists

323: for comparison.

324:

325: This result confirm our previous suspicion that overlapping

326: significance is a function of the gene list lengths.

327: If the selection of $n_1, n_2$ is arbitrary, the

328: overlapping significance thus calculated is also

329: arbitrary. It is not surprising that

330: overlapping significance may keep increasing

331: (or, $p$-value decreasing) with the increase of $n_1(=n_2)$,

332: because $p$-value in general depends on the sample

333: size. When a signal is real (true positive), $p$-value

334: will monotonically decrease with the sample size.

335: On the contrast, if a true signal is absent, the

336: sample size does not affect the conclusion. As

337: can be seen in Fig.\ref{fig2}, the overlapping significance

338: for two random lists does not really change with $n_1(=n_2)$.

339:

340: One may argue that it is unlikely to consider

341: top 5000 genes as being differentially expressed,

342: because by a typical selection criterion (e.g. $p$-value of

343: $t$-test smaller than 0.01, with or without multiple

344: testing correction), the number of genes selected

345: is less than a few hundreds. However, as can be

346: seen in Fig.\ref{fig2},  even in the range

347: of 10--500, the overlapping $p$-value changes dramatically.

348:

349: This pitfall of gene-list-length dependence of overlapping

350: $p$-values  has not been noticed before

351: perhaps because in other application of hypergeometric

352: distribution for calculating overlapping probability,

353: the length of the second list $n_2$ is fixed, for example,

354: in the study of overrepresentation of genes in

355: certain pathway. The number of overlapping genes $m$

356: is then constrained from above by $\min(n_1, n_2)$ even though

357: the length of the first list, $n_1$, might increase

358: by relaxing the gene selection criterion.

359:

360:    % \begin{figure}[thpb]

361:    \begin{figure}[t]

362:       \centering

363: 	\begin{turn}{-90}

364: 	\resizebox{4.0cm}{7.0cm}{ \includegraphics{wen-fig3.eps} }

365: 	\end{turn}

366:       \caption{The test significance ($-\log_{10}(p$-value))

367: from $t$-test of $n=$22283 genes sorted by the averaged expression

368: level (log-transformed) across all 245 samples in 3 studies

369: (RA, SLE, PsA). The three $t$-tests are for RA vs. control, SLE vs. control,

370: and PsA vs. control.

371:  	}

372:       \label{fig3}

373:    \end{figure}

374:

375:    % \begin{figure}[thpb]

376:    \begin{figure}[t]

377:       \centering

378: 	\begin{turn}{-90}

379: 	\resizebox{8.0cm}{8.0cm}{ \includegraphics{wen-fig4.eps} }

380: 	\end{turn}

381:       \caption{Several measures of overlapping genes between

382: a pair of studies as a function of the number of genes included

383: in the top-ranking list, for the reduced dataset with 15283 genes.

384: First column: proportion of overlapping genes ($m/n_1$);

385: second column: number of observed overlapping genes subtracting the

386: number of expected ($m- n_1^2/15283$); third column: $-\log_{10}(p$-value)

387: by the hypergeometric distribution. First row is for lists ranked

388: by $t$-test result, and second row is for lists ranked by

389: logistic regression.

390:        }

391:       \label{fig4}

392:    \end{figure}

393: \section{THE EFFECTS OF UNEXPRESSED GENES}

394:

395: There are many genes/probe-sets on the microarray chip

396: that do not register much signal. Since these low-expressed

397: genes are lowly expressed in both control and patient

398: samples, they usually do not appear in the top-ranking

399: differentially expressed gene list.  Fig.\ref{fig3}

400: shows $-\log_{10}(p$-value) of each gene of 3 $t$-tests

401: sorted by average expression (log-transformed)

402: across all 245 samples in 3 datasets (for both cases and controls). Although

403: we cannot use the average expression level to predict

404: the degree of differential expression, there is

405: a general trend for low-expressed genes to rank lower in the

406: differentially expressed list as seen from Fig.\ref{fig3}.

407:

408: We removed 7000 genes with lower overall expression across

409: all samples, leaving $n=15283$ genes. Figs.\ref{fig1} and \ref{fig2}

410: are reproduced in Fig.\ref{fig4} for the dataset with a reduced gene pool.

411: As in Figs.\ref{fig1} and \ref{fig2}, the observed number

412: of overlapping genes $m$ is much larger than the expected,

413: though the difference peaks at 400--600, as versus 600-800

414: in Fig.\ref{fig1}. The overlapping significance as measured

415: by $-\log(p$-value) again quickly moves up with $n_1(=n_2)$

416: as shown in the last column of Fig.\ref{fig4}.

417:

418: The qualitative similarity between Figs.\ref{fig1}, \ref{fig2}

419: and Fig.\ref{fig4} indicates that the presence of

420: low-expressed genes does not affect our conclusion.

421:

422: \addtolength{\textheight}{-12cm}   % This command serves to balance the column lengths

423:                                   % on the last page of the document manually. It shortens

424:                                   % the textheight of the last page by a suitable amount.

425:                                   % This command does not take effect until the next page

426:                                   % so it should come on the page before the last. Make

427:                                   % sure that you do not shorten the textheight too much.

428:

429:

430: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

431: \section{CONCLUSIONS AND FUTURE WORKS}

432:

433: \subsection{Conclusions}

434:

435: Using the hypergeometric distribution to calculate the

436: overlapping probability between two top-ranking differentially

437: expressed genes in two studies, we have shown that the

438: overlapping significance depends on the stringency of

439: gene selection criterion, or equivalently, the length

440: of the gene lists. This observation presents a problem

441: when an overlapping $p$-value is reported but the

442: gene selection criterion is not specified. On the other

443: hand, the increase of the overlapping significance

444: with the gene list length can be an indication that

445: the significant overlapping of genes is a true signal.

446:

447:

448: \subsection{Future Works}

449:

450: The overlapping probability calculated here assumes the two

451: top-ranking gene lists are selected from the same pool of $n$

452: genes. If the two studies are based on different chip

453: platforms, the two initial gene pools are not identical,

454: though there are perhaps certain common genes. We plan to

455: derive the overlapping distribution for this situation.

456:

457: We also plan to study the probability for genes appearing

458: in three top-ranking gene lists. Although a permutation based

459: approach comparing multiple studies was proposed in \cite{rhode},

460: there is no analytic formula available.

461:

462:

463: \section{ACKNOWLEDGMENTS}

464:

465: We would like to thank Prof. Richard Friedberg for suggestions.

466:

467:

468: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

469:

470:

471: \begin{thebibliography}{99}

472:

473: \bibitem{hyper}

474: H.T. Gonin,

475: ``The use of factorial moments in the treatment of the hypergeometric

476: distribution and in tests for regression",

477: {\it Philosophical Mag.}, vol 7, 1936, pp 215-226.

478:

479: \bibitem{fisher}

480: R.A. Fisher,

481: {\sl Statistical Methods for Research Workers}

482: Oliver and Boyd, Edinburgh; 1934.

483:

484: \bibitem{tavazoie}

485: S. Tavazoie, J.D. Hughes, M.J. Campbell, R.J. Cho, G.M. Church,

486: ``Systematic determination of genetic network architecture",

487: {\it Nature Genet.}, vol 22, 1999, pp 281-285.

488:

489: \bibitem{draghici}

490: S.  Dr\v{a}ghici, P. Khatri, R.P. Martins, G.C. Ostermeier,

491: S.A.  Krawetz,

492: ``Global functional profiling of gene expression",

493: {\it Genomics}, vol 81, 2003, pp.98-104.

494:

495: \bibitem{fino}

496: G. Finocchiaro, F. Mancuso, H. Muller,

497: ``Mining published lists of cancer related microarray experiments:

498: identification of a gene expression signature having a

499: critical role in cell-cycle control",

500: {\it BMC Bioinf.},  vol 6(suppl 4), 2003, S14.

501:

502: \bibitem{hosack}

503: D.A. Hosack, G. Dennis Jr., B.T. Sherman, H.C. Lane,

504: R.A. Lempicki

505: (2003),

506: ``Identifying biological themes within lists of genes with EASE",

507: {\it Genome Biol.}, vol 4, 2003, R70.

508:

509:

510: \bibitem{boorsma}

511: A.  Boorsma, B.C. Foat, D. Vis, F. Klis, H.J. Bussemaker,

512: ``T-profiler: scoring the activity of predefined groups

513: of genes using gene expression data",

514: {\it Nucleic Acids Res.}, vol 33, 2005,  pp W592-W595.

515:

516: \bibitem{curtis}

517: R.K. Curtis, M.  Ore\v{s}i\v{c}, A. Vidal-Puig,

518: ``Pathways to the analysis of microarray data",

519: {\it Trends Biotech.}, vol 23, 2005, pp 429-435.

520:

521: \bibitem{mao}

522: X. Mao, T. Cai, J.G. Olyarchuk, L. Wei,

523: ``Automated genome annotation and pathway identification using

524: the KEGG Orthology (KO) as a controlled vocabulary",

525: {\it Bioinfo.}, vol 21, 2005,  pp 3787-3793.

526:

527: \bibitem{tian}

528: L. Tian, S.A. Greenberg, S.W. Kong, J. Altschuler,

529: I.S.  Kohane, P.J. Park,

530: ``Discovering statistically significant pathways in expression profiling studies",

531: {\it Proc. Natl. Acad. Sci.}, vol 102, 2005, pp 13544-13549.

532:

533: \bibitem{ra}

534: F.M. Batliwalla, E.C.  Baechler, X.  Xiao, W.  Li,

535: S. Balasubramaniuan, H. Khalili, A. Damle, W.A. Ortmann, A. Perrone,

536: A.B. Kantor, M. Kern, P.S. Gulko, M. Kern, R. Furie, T.W. Behrens, P.K.  Gregersen,

537: ``Peripheral blood gene expression profiling in rheumatoid arthritis",

538: {\it Gene and Immunity}, vol 6, 2005, pp 388-397.

539:

540: \bibitem{sle}

541: E.C. Baechler, F.M. Batliwalla, G. Karypis, P.M. Gaffney, W.A. Ortmann,

542: K.J.  Espe, K.B. Shark, W.J. Grande, K.M. Hughes, V. Kapur, P.K.  Gregersen,

543: T.W. Behrens,

544: ``Interferon-inducible gene expression signature in peripheral

545: blood cells of patients with severe lupus",

546: {\it Proc. Natl. Acad. Sci. }, vol 100, 2003, pp 2610-2615.

547:

548: \bibitem{psa}

549: F.M. Batliwalla, W. Li, C.T. Ritchlin, X. Xiao, M. Brenner,

550: T.  Laragione, T. Shao, R. Durham, S. Kemshetti, E. Schwarz,

551: R.  Coe, M. Kern, E.C. Baechler, T.W. Behrens, P.K. Gregersen, P.K. Gulko,

552: ``Microarray analyses of peripheral blood cells identifies

553: unique expression signature in psoriatic arthritis",

554: {\it Mol. Med.}, 2006, to appear.

555:

556: \bibitem{rhode}

557: D.R. Rhodes, J. Yu, K. Shanker, N. Deshpande, R. Varambally, D. Ghosh,

558: T.  Barrette, A. Pandey, A.M. Chinnaiyan,

559: ``Large-scale meta-analysis of cancer microarray data identifies

560: common transcriptional profiles of neoplastic transformation and progression",

561: {\it Proc. Natl. Acad. Sci. }, vol 101, 2004, pp 9309-9314.

562:

563:

564:

565:

566:

567: \end{thebibliography}

568:

569: \end{document}

570:

571: