0503:q-bio0503025/rfVS.tex

1: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

2: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

3: %%%%%%%%%%%%%%%%%%%%                                  %%%%%%%%%%%%%%%%%%%%

4: %%%%%%%%%%%%%%%%%%%%       Technical report           %%%%%%%%%%%%%%%%%%%%

5: %%%%%%%%%%%%%%%%%%%%                                  %%%%%%%%%%%%%%%%%%%%

6: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

7: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

8:

9:    \documentclass[10pt]{article}

10:    \usepackage[latin1]{inputenc}

11:    \usepackage{geometry}

12:    \geometry{verbose,a4paper,tmargin=20mm,bmargin=20mm,lmargin=20mm,rmargin=20mm}

13:    \usepackage{setspace}

14:    \usepackage{graphics}

15:    \singlespacing

16:    \usepackage{verbatim}

17:    \usepackage{amsmath}

18:    \usepackage{url}

19:    \bibliographystyle{bioinformatics}

20:    \usepackage[authoryear, round, sort]{natbib}

21:   \usepackage{hyperref} %%??

22:

23:   \title{Variable selection from random forests: application to gene expression data}

24:    \author{\vspace{20pt}

25:      Ram�n D�az-Uriarte$^{1,3}$, Sara Alvarez de Andr�s$^2$\\

26:    $�$Bioinformatics Unit, $�$Cytogenetics Unit\\

27:    Biotechnology Programme\\

28:    Spanish National Cancer Center (CNIO)\\

29:    Melchor Fern�ndez Almagro 3 \\

30:    Madrid, 28029\\

31: \vspace{20pt}

32:    Spain. \\

33: $^3$ Author for correspondence.\\

34:    \texttt{rdiaz@ligarto.org}\\

35:    \url{http://ligarto.org/rdiaz}\\

36:    }

37:    \date{

38:    \vspace*{40pt}

39:    2005-06-22 \\

40: \vspace{20pt}

41:    {\bf Running Head:} Gene selection with random forest.} %%%% eliminate for tech report}

42:   \begin{document}

43:   \maketitle

44:   \newpage

45:   \begin{abstract}

46:

47:   Random forest is a classification algorithm well suited for

48:   microarray data: it shows excellent performance even when most

49:   predictive variables are noise, can be used when the number of

50:   variables is much larger than the number of observations, and

51:   returns measures of variable importance. Thus, it is important to

52:   understand the performance of random forest with microarray data and

53:   its use for gene selection.

54:

55:

56:    We first show the effects of changes in parameters of random forest on the

57:    prediction error.  Then we present an approach for gene selection

58:    that uses measures of variable importance and error rate,

59:    and is targeted towards the selection of small sets of genes.  Using

60:    simulated and real microarray data, we show that the gene selection

61:    procedure yields small sets of genes while preserving predictive accuracy.

62:

63:

64:   We first show the effects of changes in parameters of random forest

65:   on the prediction error rate with microarray data. Then we present

66:   two approaches for gene selection with random forest: 1) comparing

67:   variable importance plots of variable importance from original and permuted data

68:   sets; 2) using backwards variable elimination. Using simulated and

69:   real microarray data, we show: 1) variable importance plots can be used to recover

70:   the full set of genes related to the outcome of interest, without

71:   being adversely affected by collinearities; 2) backwards variable

72:   elimination yields small sets of genes while preserving predictive

73:   accuracy (compared to several state-of-the art algorithms). Thus,

74:   both methods are useful for gene selection.

75:

76:   All code is available as an R package, varSelRF, from CRAN

77: \href{http://cran.r-project.org/src/contrib/PACKAGES.html}

78: {http://cran.r-project.org/src/contrib/PACKAGES.html} or from the supplementary

79: material page.

80:

81: Supplementary information:

82: \href{http://ligarto.org/rdiaz/Papers/rfVS/randomForestVarSel.html}{http://ligarto.org/rdiaz/Papers/rfVS/randomForestVarSel.html}

83:

84:

85:   \end{abstract}

86:

87:

88:  \footnotetext[1]{To whom correspondence should be addressed}

89:

90:

91:

92: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

93: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

94: %%%%%%%%%%%%%%%%%%%%                                  %%%%%%%%%%%%%%%%%%%%

95: %%%%%%%%%%%%%%%%%%%%       Bioinformatics             %%%%%%%%%%%%%%%%%%%%

96: %%%%%%%%%%%%%%%%%%%%                                  %%%%%%%%%%%%%%%%%%%%

97: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

98: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

99:

100: %%% \renewcommand{\thefootnote}{\fnsymbol{footnote}}

101:

102: %%%   \documentclass{bioinfo}

103: %%%   \copyrightyear{2005}

104: %%%   \pubyear{2005}

105: %%%   \usepackage[latin1]{inputenc}

106:

107: %%%   \begin{document}

108: %%%   \firstpage{1}

109:

110: %%%   \title[Gene selection with random forests]{Variable selection from random forests: application to gene expression data}

111:

112: %%% %   \author{Ram�n D�az-Uriarte}

113: %%% %   \address{Bioinformatics Unit\\

114: %%% %   Spanish National Cancer Center (CNIO)\\

115: %%% %   Melchor Fern�ndez Almagro 3 \\

116: %%% %   Madrid, 28029\\

117: %%% %   Spain

118: %%% %   }

119:

120: %%%    \author{Ram�n D�az-Uriarte\,$^{\rm a}$\footnote{To whom correspondence should be addressed}, Sara Alvarez de

121: %%%      Andr�s\,$^{\rm b}$}

122:

123: %%%    \author{Ram�n D�az-Uriarte\,$^{\rm a,}$\footnotemark[1], Sara Alvarez de

124: %%%      Andr�s\,$^{\rm b}$}

125: %%%    \address{$^{a}$Bioinformatics Unit, $^{b}$Cytogenetics Unit\\

126: %%%      Biotechnology Programme\\

127: %%%   Spanish National Cancer Centre (CNIO)\\

128: %%%   Melchor Fern�ndez Almagro 3 \\

129: %%%   Madrid, 28029\\

130: %%%   Spain

131: %%%   }

132:

133: %%%   \maketitle

134:

135: %%%   \begin{abstract}

136:

137: %%%   \section{Motivation:}

138: %%%Random forest is a classification algorithm well suited

139: %%%for microarray data: it shows excellent performance

140: %%%even when most predictive variables are noise, can be

141: %%%used when the number of variables is much larger than

142: %%%the number of observations, and returns measures of

143: %%%variable importance. Thus, it is important to

144: %%%understand the performance of random forest with

145: %%%microarray data and its use for gene selection.

146:

147:

148: %%%   \section{Results:}

149: %%%   We first show the effects of changes in parameters of random forest on the

150: %%%   prediction error.  Then we present an approach for gene selection

151: %%%   that uses measures of variable importance and error rate,

152: %%%   and is targeted towards the selection of small sets of genes.  Using

153: %%%   simulated and real microarray data, we show that the gene selection

154: %%%   procedure yields small sets of genes while preserving predictive accuracy.

155:

156: %%%   \section{Availability:}

157: %%%  All code is available as an R package, varSelRF, from CRAN,

158: %%%\href{http://cran.r-project.org/src/contrib/PACKAGES.html}

159: %%%{http://cran.r-project.org/src/contrib/PACKAGES.html}, or from the supplementary

160: %%%material page.

161:

162:

163: %%%   \section{Contact:}

164: %%%   \href{rdiaz@ligarto.org}{rdiaz@ligarto.org}

165: %%%   \section{Supplementary information:}\\

166: %%%   \href{http://ligarto.org/rdiaz/Papers/rfVS/randomForestVarSel.html}{http://ligarto.org/rdiaz/Papers/rfVS/randomForestVarSel.html}

167:

168:

169:

170: %%%   \end{abstract}

171:

172: %%% \footnotetext[1]{To whom correspondence should be addressed}

173: %%%%%%%%%%%%%%%%%%%%%%%%5   end bioinformatics

174:

175:

176:

177: \section{Introduction}

178:

179:

180: Random forest is an algorithm for classification developed by Leo

181: Breiman \citep{breiman-rf} that uses an ensemble of classification

182: trees \citep{cart, ripley-96, htf-01}. Each of the classification

183: trees is built using a bootstrap sample of the data, and at each split

184: the candidate set of variables is a random subset of the variables.

185: Thus, random forest uses both bagging (bootstrap aggregation), a

186: successful approach for combining unstable learners

187: \citep{breiman-bagging, htf-01}, and random variable selection for

188: tree building.  Each tree is unpruned (grown fully), so as to obtain

189: low-bias trees; at the same time, bagging and random variable

190: selection result in low correlation of the individual trees.  The

191: algorithm yields an ensemble that can achieve both low bias and low

192: variance (from averaging over a large ensemble of low-bias,

193: high-variance but low correlation trees).

194:

195:

196: Random forest has excellent performance in classification tasks,

197: comparable to support vector machines. Although random forest is not

198: widely used in the microarray literature \citep[but see][]{SaraRF,

199:   Izmir2004, Wu.Zhao2003, Gunther.Heyes2003, Man-rf,

200:   Schwender.Bolt2004}, it has several characteristics that make it

201: ideal for these data sets: a) can be used when there are many more

202: variables than observations; b) has good predictive performance

203: even when most predictive variables are noise; c) does not

204: overfit; d) can handle a mixture of categorical and continuous

205: predictors; e) incorporates interactions among predictor variables; f)

206: the output is invariant to monotone transformations of the predictors;

207: g) there are high quality and free implementations: the original

208: Fortran code from L.\ Breiman and A.\ Cutler, and an R package from

209: A.\ Liaw and M.\ Wiener \citep{rf-rnews}; h) there is little need to

210: fine-tune parameters to achieve excellent performance; i) returns measures of

211: variable (gene) importance. The most

212: important parameter to choose is $mtry$, the number of input variables

213: tried at each split, but it has been reported that the default value

214: is often a good choice \citep{rf-rnews}.  In addition, the user needs

215: to decide how many trees to grow for each forest ($ntree$) as well as

216: the minimum size of the terminal nodes ($nodesize$). These three

217: parameters will be throughly examined in this paper.

218:

219: Given these promising features, it is important to understand the

220: performance of random forest compared to alternative state-of-the-art

221: prediction methods with microarray data, as well as the effects

222: of changes in the parameters of random forest. In this paper we present, as necessary

223: background for the main topic of the paper (gene selection), the first

224: through examination of these issues, including evaluating the effects

225: of $mtry$, $ntree$ and $nodesize$ on error rate using

226: nine real microarray data sets and simulated data.

227:

228: The main question addressed in this paper is gene selection using random

229: forest.  A few authors have previously used variable selection with random

230: forest.  \citet{dudoit-inbook} and \citet{Wu.Zhao2003} use filtering approaches

231: and, thus, do not take advantage of the measures of variable importance

232: returned by random forest as part of the algorithm. \citet{svetnik} propose a

233: method that is somewhat similar to our approach. The main difference is that

234: \citet{svetnik} first find the ``best'' dimension ($p$) of the model, and then

235: choose the $p$ most important variables. This is a sound strategy when the

236: objective is to build accurate predictors, without any regards for model

237: interpretability.  But this might not be the most appropriate for our purposes

238: as it shifts the emphasis away from selection of specific genes, and in genomic

239: studies the identity of the selected genes is relevant (e.g., to understand

240: molecular pathways or to find targets for drug development).

241:

242:

243: The last issue addressed in this paper is the multiplicity (or lack of

244: uniqueness or lack of stability) problem. Variable selection with microarray

245: data can lead to many solutions that are equally good from the point of view of

246: prediction rates, but that share few common genes.  This multiplicity problem

247: has been emphasized by \citet{Somorjai2003} and recent examples are shown in

248: \citet{EinDor} and \citet{Michielis}. Although multiplicity of results is not a problem when

249: the only objective of our method is prediction, it casts serious doubts on the

250: biological interpretability of the results \citep{Somorjai2003}.  Unfortunately

251: most ``methods papers'' in bioinformatics do not evaluate the stability of the

252: results obtained, leading to a false sense of trust on the biological

253: interpretability of the output obtained. Our paper presents a through and

254: critical evaluation of the stability of the lists of selected genes with the

255: proposed (and two competing) methods.

256:

257:

258:

259:

260:

261:

262:

263:

264: \section{Variable selection methods}

265:

266: \subsection{Two objectives of variable selection}

267:

268: When facing gene selection problems, biomedical researchers often show

269: interest in one of the following objectives:

270:

271: \begin{enumerate}

272: \item To identify relevant genes for subsequent research; this

273:   involves obtaining a (probably large) set of genes that are related

274:   to the outcome of interest, and this set should include genes even if they

275:   perform similar functions and are highly correlated.

276:

277: \item To identify small sets of genes to be used for diagnostic

278:   purposes in clinical practice; this involves obtaining the smallest

279:   possible set of genes that can still achieve good predictive

280:   performance (thus, ``redundant'' genes should not be selected).

281: \end{enumerate}

282:

283: We will focus on the second objective. The use of random forest for the first

284: objective is under investigation and will be reported elsewhere.

285:

286:

287: \subsection{Variable importance from random forest}

288:

289: Random forest returns several measures of variable importance. The

290: most reliable measure is based on the decrease

291: of classification accuracy when values of a variable in a node of a

292: tree are permuted randomly \citep{breiman-rf, Bureau2003},

293: and this is the measure of variable importance (in its unscaled

294: version ---see supplementary material) that we will use in the rest of

295: the paper.

296:

297:

298: % . This measure is sometimes reported as such, and

299: % sometimes it is reported after scaling it, or dividing by a quantity

300: % somewhat analogous to its standard error (``somewhat analogous''

301: % because the data used to obtain that ``standard error'' are not truly

302: % independent, and thus the true standard error can be severely

303: % underestimated).  We use in this paper the unscaled importance

304: % measure, because it allows us to compare directly runs with different

305: % settings of $ntree$ and $mtry$ (in contrast, scaled importances

306: % increase monotonically as we increase the value of $ntree$).

307: % %%%Explain why we use unscaled importance:

308: %%%a) allows direct comparison between runs with different ntrees and mtries.

309: %%%b) does not mislead into considering them Z scores.

310:

311:

312:

313: \subsection{Backwards elimination of variables (genes) using OOB error}

314:

315: To select gebes we can iteratively fit random forests, at each iteration

316: building a new forest after discarding those variables (genes) with the

317: smallest variable importances; the selected set of genes is the one that yields the

318: smallest error rate.  Random forest returns a measure of error rate based on

319: the out-of-bag cases for each fitted tree, the OOB error, and this is the

320: measure of error we will use.  Note that in this section we are using OOB

321: error to choose the final set of genes, not to obtain unbiased estimates of the

322: error rate of this rule.  Because of the iterative approach, the OOB error is

323: biased down and cannot be used to asses the overall error rate of the approach,

324: for reasons analogous to those leading to ``selection bias'' \citep{ambroise, simon-03}. To assess prediction error rates we will use the bootstrap, not

325: OOB error (see section \ref{boot}). (Using error rates

326:   affected by selection bias to select the optimal number of genes is

327:   not necessarily a bad procedure from the point of view of selecting

328:   the final number of genes; see \citet{Braga-Neto.Carroll2004}).

329: %\citet{svetnik} leave aside a set of data,

330: %and decide on the stopping criterion using the error rate on the test data.

331: %This approach, however, is problematic when, as in our case, we are interested

332: %in specific genes and not in using the test set error rate to select the

333: %number of genes.

334:

335: In our algorithm we examine all forests that result from eliminating,

336: iteratively, a fraction, $fraction.dropped$, of the genes (the

337: least important ones) used in the previous iteration. By default,

338: $fraction.dropped = 0.2$ which allows for relatively fast operation,

339: is coherent with the idea of an ``aggressive variable selection''

340: approach, and increases the resolution as the number of genes

341: considered becomes smaller.  We do not recalculate variable

342: importances at each step as \citet{svetnik} mention severe overfitting

343: resulting from recalculating variable importances. After fitting all

344: forests, we examine the OOB error rates from all the fitted random

345: forests. We choose the solution with the smallest number of genes

346: whose error rate is within $u$ standard errors of the minimum error

347: rate of all forests.

348: % (The standard error is calculated using the

349: % expression for a binomial error count [$\sqrt{p (1-p) * 1/N}$]).

350: Setting $u = 0$ is the same as selecting the set of genes that

351: leads to the smallest error rate.  Setting $u = 1$ is similar to the

352: common ``1 s.e.  rule'', used in the classification trees literature

353: \citep{ripley-96, cart}; this strategy can lead to solutions with

354: fewer genes than selecting the solution with the smallest error

355: rate, while achieving an error rate that is not

356: different, within sampling error, from the ``best solution''. In this

357: paper we will examine both the ``1 s.e. rule'' and the ``0 s.e.

358: rule''.

359:

360:

361:

362: %Note here no need for very large mtries, etc, since we do not want all

363: %the important genes, but just enough genes to do a good job.

364:

365: %Besides the stopping criterion we have also chosen the following settings:

366:

367: %\begin{itemize}

368: %\item We examine all forest that result from iteratively

369: %  \textbf{eliminating the lower 50\% of the genes}; this

370: %  allows for relatively fast operation, and is coherent with the

371: %  idea of an ``aggressive variable selection'' approach, and

372: %  increases the ``resolution'' as the number of genes

373: %  considered becomes smaller.

374: %\item \textbf{Variable importances are not recalculated at each step}, but

375: %  instead we use the variable importances computed at the end of

376: %  the run; we have not observed important differences whether or

377: %  not variable importances are recalculated, but \citet{svetnik}

378: %  mention severe overfitting resulting from recalculating

379: %  variable importances.

380: %\item We examine the OOB error rates from all the fitted random

381: %  forests. We choose the \textbf{solution with the smallest number of

382: %  genes whose error rate is within 1 standard error of the

383: %  minimum error rate of all forests}.  and the ``1 SE rule'' is common in the

384: %  classification trees literature ).

385: %\end{itemize}

386:

387:

388:

389: \section{Evaluation of performance}

390:

391: \subsection{Data sets}

392: We have used both simulated and real microarray data sets to evaluate

393: the variable selection procedure. For the real

394: data sets, original reference paper and main features are shown in

395: Table \ref{datasets}. Further details are provided in the

396: supplementary material.

397:

398:

399: \begin{table}

400: \caption{\label{datasets} Main characteristics of the microarray data

401:   sets used.}

402: {\footnotesize

403: \begin{tabular}{l|lrrr}

404: Dataset & Original ref.&Genes&Patients&Classes \\

405: \hline

406: Leukemia &\citet{golub}&3051&38&2\\

407: Breast &\citet{vveer}&4869&78&2\\

408: Breast &\citet{vveer}&4869&96&3\\

409: NCI 60 &\citet{ross}&5244&61&8\\

410: Adenocar-\\

411: cinoma &\citet{ramas-03}&9868&76&2\\

412: Brain &\citet{pomeroy}&5597&42&5\\

413: Colon &\citet{alon}&2000&62&2\\

414: Lymphoma &\citet{alizadeh}&4026&62&3\\

415: Prostate &\citet{singh}&6033&102&2\\

416: Srbct &\citet{khan}&2308&63&4\\

417: \hline

418: \end{tabular}

419: }

420: \end{table}

421:

422:

423:

424: % first four data sets. For the last five, the binary R data files were

425: % obtained from M.\ Dettling's web page

426: % \url{http://stat.ethz.ch/~dettling/bagboost.html}; the data sets

427: % and their preprocessing are fully described in \cite{wilma}.

428:

429:

430: To evaluate if the proposed procedure can recover the signal in the

431: data, we need to use simulated data, so that we know exactly which

432: genes are relevant.  Data have been simulated using different numbers

433: of classes of patients (2 to 4), number of independent dimensions (1

434: to 3), and number of genes per dimension (5, 20, 100).  In all cases,

435: we have set to 25 the number of subjects per class. Each independent

436: dimension has the same relevance for discrimination of the classes.

437: The data come from a multivariate normal distribution with variance of

438: 1, a (within-class) correlation among genes within dimension of

439: 0.9, and a within-class correlation of 0 between genes from different

440: dimensions, as those are independent.  The multivariate means have

441: been set so that the unconditional prediction error rate

442: \citep{mclach-dlda} of a linear discriminant analysis using one gene

443: from each dimension is approximately 5\%.  To each data set we have

444: added 2000 random normal variates (mean 0, variance 1) and 2000 random

445: uniform $[-1, 1]$ variates.  In addition, we have generated data sets

446: for 2, 3, and 4 classes where no genes have signal (all 4000 genes are

447: random).  For the non-signal data sets we have generated four

448: replicate data sets for each level of number of classes. Further

449: details are provided in the supplementary material.

450:

451:

452: \subsection{Competing methods}

453:

454: We have compared the predictive performance of the variable selection

455: approach with: a) random forest without any variable selection (using

456: $mtry = \sqrt{number\ of \ genes}$, $ntree = 5000$, $nodesize =

457: 1$); b) three other methods that have shown good

458: performance in reviews of classification methods with microarray data

459: \citep{dudoit-dlda, romualdi-03, bag-boost} but that do not include

460: any variable selection; c) two methods that carry out

461: variable selection.

462:

463: For the three methods that do not carry out variable selection,

464: \textbf{Diagonal Linear Discriminant Analysis (DLDA)}, \textbf{K

465:   nearest neighbor (KNN)}, and \textbf{Support Vector Machines (SVM)}

466: with linear kernel, we have used, based on \cite{dudoit-dlda}, the 200

467: genes with the largest $F$-ratio of between to within groups sums of

468: squares. For \textbf{KNN}, the number of neighbors ($K$) was

469: chosen by cross-validation as in \cite{dudoit-dlda}.

470:

471:

472: One of the methods that incorporates gene selection is

473: \textbf{Shrunken centroids (SC)}, developed by \cite{shrunkenc}. We

474: have used two different approaches to determine the best number of

475: features. In the first one, \textbf{SC.l}, we choose the number of

476: genes that minimizes the cross-validated error rate and, in case of

477: several solutions with minimal error rates, we choose the one with

478: largest likelihood. In the second approach, \textbf{SC.s}, we choose

479: the number of genes that minimizes the cross-validated error rate and,

480: in case of several solutions with minimal error rates, we choose the

481: one with smallest number of genes (larger penalty). The second method

482: that incorporates gene selection is \textbf{Nearest neighbor +

483:   variable selection (NN.vs)}, where we filter genes using the

484: F-ratio, and select the number of genes that leads to the smallest

485: error rate; in our implementation, we run a Nearest Neighbor

486: classifier (KNN with K = 1) on all subsets of genes that result from

487: eliminating $20\%$ of the genes (the ones with the smallest F-ratio)

488: used in the previous iteration.  This approach, in its many variants

489: (changing both the classifier and the ordering criterion) is popular

490: in microarray papers; a recent example is \cite{roepman}, and

491: similar general strategies are implemented in the program Tnasas

492: \citep{gepas2}. Further

493: details of all these methods are provided in the supplementary

494: material. All simulations and analyses were carried out with R

495: \citep[http://www.r-project.org; ][]{R}, using

496: packages randomForest (from A.\ Liaw and M.\ Wiener) for random

497: forest, e1071 (E.\ Dimitriadou, K.\ Hornik, F.\ Leisch, D.\ Meyer, and

498: A.\ Weingessel) for SVM, class (B.\ Ripley and W.\ Venables) for KNN,

499: PAM \citep{shrunkenc} for shrunken centroids, and

500: geSignatures (by R.D.-U.) for DLDA.

501:

502:

503:

504:

505: \subsection{\label{boot}Estimation of error rates}

506: To estimate the prediction error rate of all methods we have used the

507: .632+ bootstrap method \citep{ambroise, 632-rule}. It must be

508: emphasized that the error rate used when performing variable selection

509: is not the error rate reported as the prediction error rate (e.g.,

510: Table \ref{error.rates}), nor the error used to compute the .632+

511: estimate. To calculate the prediction error rate (as reported, for

512: example, in Table \ref{error.rates}) the .632+ bootstrap method is

513: applied to the complete procedure, and thus the ``out-of-bag'' samples

514: used in the .632+ method are samples that are not used when fitting

515: the random forest, or carrying out variable selection. This also

516: applies when evaluating the competing methods.

517:

518:

519: \subsection{Stability (uniqueness) of results}

520: Following \citet{Faraway-92}, \citet{harrell-01}, and

521: \citet{efron-gong},  we have evaluated

522: the stability of the variable selection procedure using the

523: bootstrap.  This allows us to asses how often a given

524: gene, selected when running the variable selection procedure in the

525: original sample, is selected when running the procedure on bootstrap

526: samples.

527:

528:

529:

530:  \begin{figure}

531:  \begin{center}

532:  {\resizebox{!}{7.5cm}{%

533:  \includegraphics{mtry.ntree.paper.real.eps}}}

534:

535:

536:

537: % \begin{figure}

538: % {\resizebox{!}{7.5cm}{%

539: % \centerline{\includegraphics{$mtry$.$ntree$.paper.real.eps}}}}

540:

541:

542:

543: \caption{\label{mtry.ntree.paper.real} Out-of-Bag (OOB) vs

544:   $mtryFactor$ for the nine microarray data sets.  $mtryFactor$ is the

545:   multiplicative factor of the default $mtry$

546:   ($\sqrt{number.of.genes}$); thus, an $mtryFactor$ of 3 means the

547:   number of genes tried at each split is $3 *\sqrt{number.of.genes}$;

548:   an $mtryFactor = 0$ means the number of genes tried was 1; the

549:   $mtryFactor$s examined were $= \{0, 0.05, 0.1, 0.17, 0.25, 0.33, 0.5,

550:   0.75, 0.8, 1, 1.15, 1.33, 1.5, 2, 3,$ $4, 5, 6, 8, 10, 13\}$. Results

551:   shown for six different $ntree = \{1000, 2000, 5000,

552:   10000, 20000, 40000\}$.  $nodesize = 1$.}

553: \end{center}

554: \end{figure}

555:

556:

557:

558: \section{Results}

559:

560:

561:

562:

563:  \begin{table*}[b!]  \begin{center} %\processtable{

564:        \caption{\label{error.rates} Error rates (estimated using the

565:          0.632+ bootstrap method with 200 bootstrap samples) for the

566:          microarray data sets using different methods (see text for

567:          description of alternative methods).  The results shown for

568:          variable selection with random forest used $ntree = 2000,

569:          fraction.dropped = 0.2, mtryFactor = 1$.  Note that the OOB

570:          error used for variable selection \emph{is not} the error

571:          reported in this table; the error rate reported is obtained

572:          using bootstrap on the complete variable selection process.

573:          The column ``no info'' denotes the minimal error we can make

574:          if we use no information from the genes (i.e., we always bet

575:          on the most frequent class).}

576:

577:

578:

579:   {\footnotesize

580:     \begin{tabular}{l|cccccccccc}

581:       % Data set& SVM & KNN & DLDA& SC.l & SC.s & NN.vs & random forest & \multicolumn{4}{c}{random forest var.sel.}\\

582:       % & & & & & & & & \multicolumn{2}{c}{s.e.\ 0} & \multicolumn{2}{c}{s.e.\ 1}\\

583:       % & & & & & & & & m.f.\ 1 & m.f.\ 13 & m.f.\ 1 & m.f.\ 13 \\

584:

585: Data set& no info & SVM & KNN & DLDA& SC.l & SC.s & NN.vs & random forest &

586: \multicolumn{2}{c}{random forest var.sel.}\\

587: && & & & & & & & s.e.\ 0 & s.e.\ 1\\

588:

589: \hline

590: Leukemia &       0.289 &0.014 &  0.029 &  0.020 &   0.025& 0.062  &   0.056&      0.051 &   0.087  &   0.075   \\

591: Breast 2 cl.&    0.429 &0.325 &  0.337 &  0.331 &   0.324& 0.326  &  0.337&       0.342 &   0.337  &   0.332  \\

592: Breast 3 cl.&    0.537 &0.380 &  0.449 &  0.370 &   0.396& 0.401  &   0.424&      0.351 &   0.346  &   0.364  \\

593: NCI 60      &    0.852 &0.256 &  0.317 &  0.286 &   0.256& 0.246  &   0.237&      0.252 &   0.327  &   0.353  \\

594: Adenocar.&       0.158 &0.203 &  0.174 &  0.194 &   0.177 & 0.179 &    0.181&     0.125 &   0.185  &   0.207  \\

595: Brain&           0.761 &0.138 &  0.174 &  0.183 &   0.163 & 0.159 &    0.194&     0.154 &   0.216  &   0.216  \\

596: Colon&           0.355 &0.147 &  0.152 &  0.137 &   0.123 & 0.122 &    0.158&     0.127 &   0.159  &   0.177  \\

597: Lymphoma &       0.323 &0.010 &  0.008 &  0.021 &   0.028 & 0.033 &    0.04 &     0.009 &   0.047  &   0.042  \\

598: Prostate &       0.490 &0.064 &  0.100 &  0.149 &   0.088 & 0.089 &    0.081&     0.077 &   0.061  &   0.064  \\

599: Srbct &          0.635 &0.017 &  0.023 &  0.011 &   0.012 & 0.025 &    0.031&     0.021 &   0.039  &   0.038   \\

600: \hline

601: \end{tabular}

602: }

603: % \caption{\label{error.rates} Error rates (estimated using 0.632+

604: %   bootstrap method with 200 bootstrap samples) for each data set using

605: %   different methods (see text for description of alternative methods).

606: %   The results shown for variable selection with random forest used

607: %   $ntree = 2000, fraction.dropped = 0.2$, $mtry$Factor = 1$ (error rates with

608: %   $ntree=20000$ and $ntree=5000$ and with $fraction.dropped = 0.5$ and

609: %   $mtry$Factor = 13$ are very similar; see supplementary material and

610: %   Table \ref{stability}). When using variable selection with random

611: %   forest, we display four genes. The first two, correspond to using

612: %   the ``s.e.0'' rule, where the model selected is the one with the

613: %   smallest OOB error rate, and two to the ``s.e. 1'' rule, where the

614: %   model selected is the smallest model whose error rate is within 1

615: %   standard error of the minimum error rate of all forests. For each of

616: %   these, we show the error corresponding to using an $mtry$ factor

617: %   (m.f.) of 13 (i.e., $mtry = 13 * sqrt(number of colums)) and an $mtry$

618: %   factor of 1 ($mtry = sqrt(number of genes)). Note that the OOB

619: %   error used for variable selection \emph{is not} the error reported

620: %   in the table (which is obtained using bootstrap on the complete

621: %   variable selection process).}

622: \end{center}

623: \end{table*}

624:

625:

626: \subsection{Choosing $mtry$ and $ntree$}

627:

628: Preliminary data suggested that $mtry$ and $ntree$ could affect the shape of

629: variable importance plots.  At the same time, use of OOB error rate as a

630: guidance to select $mtry$ could be affected by $ntree$ and, potentially,

631: $nodesize$. Thus, we first examined whether the OOB error rate is substantially

632: affected by changes in $mtry$, $ntree$, and $nodesize$.

633:

634:

635:

636:

637:

638:

639:

640:

641: Figure \ref{mtry.ntree.paper.real} and the supplementary material (Figure

642: \\``error.vs.mtry.pdf''), however, show that, for both real and simulated data,

643: the relation of OOB error rate with $mtry$ is largely independent of $ntree$

644: (for $ntree$ between 1000 and 40000) and $nodesize$ (nodesizes 1 and 5). In

645: addition, the default setting of $mtry$ ($mtryFactor = 1$ in the figures) is

646: often a good choice in terms of OOB error rate. In some cases, increasing

647: $mtry$ can lead to small decreases in error rate, and decreases in $mtry$ often

648: lead to increases in the error rate. This is specially the case with simulated

649: data with very few relevant genes (with very few relevant genes, small $mtry$

650: results in many trees being built that do not incorporate any of the relevant

651: genes). Since the OOB error and the relation between OOB error and $mtry$ do

652: not change whether we use $nodesize$ of 1 or 5, and because the increase in

653: speed from using $nodesize$ of 5 is inconsequential, all further analyses will

654: use only the default $nodesize = 1$.

655:

656:

657:

658:

659:

660:

661:

662:

663:

664:

665:

666:

667:  \subsection{Backwards elimination of variables (genes) using OOB

668:    error} On the simulated data sets (see supplementary material,

669:  Tables 3 and 4) %\ref{simplify.signal.02}, \ref{simplify.signal.05}),

670:  backwards elimination often leads to very small sets of genes, often

671:  much smaller than the set of ``true genes''. The error rate of the

672:  variable selection procedure, estimated using the .632+ bootstrap

673:  method, indicates that the variable selection procedure does not lead

674:  to overfitting, and can achieve the objective of aggressively

675:  reducing the set of selected genes.  In contrast, when the

676:  simplification procedure is applied to simulated data sets without

677:  signal (see Tables 1 and 2

678: %\ref{simplify.no.signal.02} \ref{simplify.no.signal.05}

679: in supplementary material), the number of

680:  genes selected is consistently much larger and, as should be the

681:  case, the estimated error rate using the bootstrap corresponds to

682:  that achieved by always betting on the most probable class.

683:

684:

685:

686: Results for the real data sets are shown in Tables \ref{error.rates} and

687: \ref{stability} (see also supplementary material, Tables 5, 6, 7,

688: %%\ref{stability-20000}, stability-5000, stability-02

689: for additional results using different combinations of $ntree =

690: \{2000,5000,20000\}$, $mtryFactor = \{1, 13\}, se=\{0, 1\},

691: fraction.dropped=\{0.2, 0.5\}$). Error rates (see Table

692: \ref{error.rates}) when performing variable selection are in most cases comparable

693: (within sampling error) to those from random forest without variable

694: selection, and comparable also to the error rates from competing

695: state-of-the-art prediction methods. The number of genes selected

696: varies by data set, but generally (Table \ref{stability}) the

697: variable selection procedure leads to small ($< 50$) sets of predictor

698: genes, often much smaller than those from competing approaches

699: (see also Table 8 in supplementary material). There are no relevant

700: differences in error rate related to differences in $mtry$, $ntree$ or

701: whether we use the ``s.e.\ 1'' or ``s.e.\ 0'' rules. The use of the

702: ``s.e.\ 1'' rule, however, tends to result in smaller sets of selected

703: genes.

704:

705:

706:

707:

708: % \begin{table*}[ph!]

709: % \begin{center}

710: %   \caption{\label{stability} Stability of results of backwards

711: %     elimination of variables using OOB error, and of two alternative

712: %     variable selection methods. Stability evaluated using 200

713: %     bootstrap samples. ``\# Vars'' denotes the number of variables

714: %     selected on the original data set. ``\# Vars bootstrap'' shows the

715: %     median (1st quartile, 3rd quartile) number of variables selected

716: %     when the procedure is run on the bootstrap samples. ``Freq. vars''

717: %     is the median (1st quartile, 3rd quartile) of the frequency with

718: %     which each variable in the original data set appears in the

719: %     variables selected when the procedure is run on the bootstrap

720: %     samples. For further results see supplementary material.}

721: % \end{center}

722:

723: % \begin{center}

724: % {\small

725: % \begin{tabular}{l|rrrr|rrrr}

726: % Data set& Error rate & \# Vars & \# Vars bootstrap & Freq. vars& Error rate & \# Vars & \# Vars bootstrap & Freq. vars\\

727: % \hline

728: % \hline

729: % \multicolumn{5}{c}{\textbf{Backwards elimination of variables from random forest}}\\ %%% OK

730: % \hline

731: % & \multicolumn{4}{c}{$s.e.\ = 0} & \multicolumn{4}{$s.e.\ = 1}\\ %%% OK

732: % \hline

733: % %%$mtry$1, se1, $ntree = 2000

734: % Leukemia    &     0.087 &          2 &          2 (2, 2)   &   0.38 (0.29, 0.48)\footnotemark[1]

735: % Breast 2 cl.&     0.337 &         14 &          9 (5, 23)   &   0.15 (0.1, 0.28)

736: % Breast 3 cl.&     0.346 &        110 &         14 (9, 31)   &   0.08 (0.04, 0.13)

737: % NCI 60      &     0.327 &        230 &         60 (30, 94)   &    0.1 (0.06, 0.19)

738: % Adenocar.   &     0.185 &          6 &          3 (2, 8)   &   0.14 (0.12, 0.15)

739: % Brain       &     0.216 &         22 &         14 (7, 22)   &   0.18 (0.09, 0.25)

740: % Colon       &     0.159 &         14 &          5 (3, 12)   &   0.29 (0.19, 0.42)

741: % Lymphoma    &     0.047 &         73 &         14 (4, 58)   &   0.26 (0.18, 0.38)

742: % Prostate    &     0.061 &         18 &          5 (3, 14)   &   0.22 (0.17, 0.43)

743: % Srbct       &     0.039 &        101 &         18 (11, 27)   &    0.1 (0.04, 0.29)

744: % \hline

745: % \hline

746: % \multicolumn{4}{c}{$mtryFactor = 1, s.e.\ = 1, ntree = 2000, ntreeIterat = 1000, fraction.dropped = 0.2$}\\ %%% OK

747: % \hline

748: % %%$mtry$1, se1, $ntree = 2000, $ntree$Iterat = 1000

749: % Leukemia    &     0.075 &          2 &          2 (2, 2)   &    0.4 (0.32, 0.5)\footnotemark[1]\\

750: % Breast 2 cl.&     0.332 &         14 &          4 (2, 7)   &   0.12 (0.07, 0.17)\\

751: % Breast 3 cl.&     0.364 &          6 &          7 (4, 14)   &   0.27 (0.22, 0.31)\\

752: % NCI 60      &     0.353 &         24 &         30 (19, 60)   &   0.26 (0.17, 0.38)\\

753: % Adenocar.   &     0.207 &          8 &          3 (2, 5)   &   0.06 (0.03, 0.12)\\

754: % Brain       &     0.216 &          9 &         14 (7, 22)   &   0.26 (0.14, 0.46)\\

755: % Colon       &     0.177 &          3 &          3 (2, 6)   &   0.36 (0.32, 0.36)\\

756: % Lymphoma    &     0.042 &         58 &         12 (5, 73)   &   0.32 (0.24, 0.42)\\

757: % Prostate    &     0.064 &          2 &          3 (2, 5)   &    0.9 (0.82, 0.99)\footnotemark[1]\\

758: % Srbct       &     0.038 &         22 &         18 (11, 34)   &   0.57 (0.4, 0.88)\\

759: % \hline

760: % \hline

761: % \multicolumn{4}{c}{\textbf{Alternative approaches}}\\ %%% OK

762: % \hline

763: % \multicolumn{4}{c}{Shrunken centroids; mimimizing error rate then

764: %   minimizing number of genes selected}\\ %%% OK

765: % \hline

766: % Leukemia    &     0.062 &         82 &         46 (14, 504)   &   0.48 (0.45, 0.59)\\

767: % Breast 2 cl.&     0.326 &         31 &         55 (24, 296)   &   0.54 (0.51, 0.66)\\

768: % Breast 3 cl.&     0.401 &       2166 &       4341 (2379, 4804)   &   0.84 (0.78, 0.88)\\

769: % NCI 60      &     0.246 &       5118 &       4919 (3711, 5243)   &   0.84 (0.74, 0.92)\\

770: % Adenocar.   &     0.179 &          0 &          9 (0, 18)   &     NA (NA, NA)\footnotemark[2]\\

771: % Brain       &     0.159 &       4177 &       1257 (295, 3483)   &   0.38 (0.3, 0.5)\\

772: % Colon       &     0.122 &         15 &         22 (15, 34)   &    0.8 (0.66, 0.87)\\

773: % Lymphoma    &     0.033 &       2796 &       2718 (2030, 3269)   &   0.82 (0.68, 0.86)\\

774: % Prostate    &     0.089 &          4 &          3 (2, 4)   &   0.72 (0.49, 0.92)\\

775: % Srbct       &     0.025 &         37 &         18 (12, 40)   &   0.45 (0.34, 0.61)\\

776: % \hline

777: % \hline

778: % \multicolumn{4}{c}{Nearest Neighbor with variable selection}\\ %%% OK

779: % \hline

780: % Leukemia    &     0.056 &        512 &         23 (4, 134)   &   0.17 (0.14, 0.24)\\

781: % Breast 2 cl.&     0.337 &         88 &         23 (4, 110)   &   0.24 (0.2, 0.31)\\

782: % Breast 3 cl.&     0.424 &          9 &         45 (6, 214)   &   0.66 (0.61, 0.72)\\

783: % NCI 60      &     0.237 &       1718 &        880 (360, 1718)   &   0.44 (0.34, 0.57)\\

784: % Adenocar.   &     0.181 &       9868 &         73 (8, 1324)   &   0.13 (0.1, 0.18)\\

785: % Brain       &     0.194 &       1834 &        158 (52, 601)   &   0.16 (0.12, 0.25)\\

786: % Colon       &     0.158 &          8 &          9 (4, 45)   &   0.57 (0.45, 0.72)\\

787: % Lymphoma    &     0.04 &         15 &         15 (5, 39)   &    0.5 (0.4, 0.6)\\

788: % Prostate    &     0.081 &          7 &          6 (3, 18)   &   0.46 (0.39, 0.78)\\

789: % Srbct       &     0.031 &         11 &         17 (11, 33)   &    0.7 (0.66, 0.85)\\

790: % \hline

791:

792: % \end{tabular}

793: % }

794: % \end{center}

795: % \renewcommand{\baselinestretch}{0.2}\footnotesize\normalsize\footnotesize

796: % {%\setlength{\baselineskip}{1pt} \renewcommand{\baselinestretch}{0.1}\footnotesize\normalsize\footnotesize

797: % $^1$As only two variables are selected from the complete data set, the values are the actual

798: %   frequencies of those two variables, not the 25th and 75th

799: %   percentiles.\\

800: % $^2$No variables were selected.\\

801: % }

802: % \end{table*}

803:

804:

805:

806:

807:

808:

809:

810: \subsection{Stability (uniqueness) of results}

811: The results here will focus on the real microarray data sets (results

812: from the simulated data are presented on the supplementary material).

813: Table \ref{stability} (see also supplementary material, Tables 5, 6, 7,

814: % \ref{stability-20000}

815: for other combinations of $ntree, mtryFactor, fraction.dropped, se$)

816: shows the variation in the number of genes selected in bootstrap

817: samples, and the frequency with which the genes selected in the

818: original sample appear among the genes selected from the bootstrap

819: samples. In most cases, there is a wide range in the number of genes

820: selected; more importantly, the genes selected in the original samples

821: are rarely selected in more than 50\% of the bootstrap samples. These

822: results are not strongly affected by variations in $ntree$ or $mtry$;

823: using the ``s.e.\ 1'' rule can lead, in some cases, to increased

824: stability of the results.

825:

826:

827: As a comparison, we also show in Table \ref{stability} the stability

828: of two alternative approaches for gene selection, the shrunken

829: centroids method, and a filter approach combined with a Nearest

830: Neighbor classifier (see Table 8 in the supplementary material for

831: results of SC.l). Error rates are comparable, but both alternative

832: methods lead to much larger sets of selected genes than backwards

833: variable selection with random forests. The alternative approaches

834: seem to lead to somewhat more stable results in variable selection (probably a

835: consequence of the large number of genes selected) but

836: in practical applications this increase in stability is probably far

837: out-weighted by the very large number of selected genes.

838:

839:

840:

841:

842:

843:

844:

845:  \begin{table}[p]

846:  \begin{center}

847:    \caption{\label{stability} Stability of variable (gene) selection evaluated

848:      using 200 bootstrap samples. ``\# Genes'': number of genes

849:      selected on the original data set. ``\# Genes boot.'': median

850:      (1st quartile, 3rd quartile) of number of genes selected from

851:      on the bootstrap samples. ``Freq. genes'': median (1st quartile,

852:      3rd quartile) of the frequency with which each gene in the

853:      original data set appears in the genes selected from the

854:      bootstrap samples. Parameters for backwards elimination with

855:      random forest: $mtryFactor = 1, s.e.\ = 0, ntree = 2000,

856:      ntreeIterat = 1000, fraction.dropped = 0.2$.}

857:  \end{center}

858:  \begin{center}

859: \vspace{-32pt} %%% use for bioinformatics.

860:  {\footnotesize

861:  \begin{tabular}{l|rrrr}

862:  Data set& Error & \# Genes & \# Genes boot. & Freq. genes\\

863:  \hline

864:  \hline

865:  \multicolumn{5}{c}{\textbf{Backwards elimination of genes from random forest}}\\ %%% OK

866:  \hline

867: %\multicolumn{5}{c}{$mtryFactor = 1, s.e.\ = 0, ntree = 2000, ntreeIterat = 1000, fraction.dropped = 0.2$}\\ %%% OK

868: \multicolumn{5}{c}{$s.e.\ = 0$}\\ %%% OK

869:  \hline

870: % %$mtry$1, se1, $ntree = 2000

871:  Leukemia    &  0.087 &   2 &  2 (2, 2) &   0.38 (0.29, 0.48)\footnotemark[1]\\

872:  Breast 2 cl.&  0.337 &  14 &  9 (5, 23)&   0.15 (0.1, 0.28)\\

873:  Breast 3 cl.&  0.346 & 110 & 14 (9, 31)&   0.08 (0.04, 0.13)\\

874:  NCI 60      &  0.327 & 230 & 60 (30, 94)&    0.1 (0.06, 0.19)\\

875:  Adenocar.   &  0.185 &   6 &  3 (2, 8)&   0.14 (0.12, 0.15)\\

876:  Brain       &  0.216 &  22 & 14 (7, 22)&   0.18 (0.09, 0.25)\\

877:  Colon       &  0.159 &  14 &  5 (3, 12)&   0.29 (0.19, 0.42)\\

878:  Lymphoma    &  0.047 &  73 & 14 (4, 58)&   0.26 (0.18, 0.38)\\

879:  Prostate    &  0.061 &  18 &  5 (3, 14)&   0.22 (0.17, 0.43)\\

880:  Srbct       &  0.039 & 101 & 18 (11, 27)&    0.1 (0.04, 0.29)\\

881:  \hline

882:  \hline

883: %\multicolumn{4}{c}{$mtryFactor = 1, s.e.\ = 1, ntree = 2000, ntreeIterat = 1000, fraction.dropped = 0.2$}\\ %%% OK

884: \multicolumn{5}{c}{$s.e.\ = 1$}\\ %%% OK

885:  \hline

886: % %$mtry$1, se1, $ntree = 2000, $ntree$Iterat = 1000

887:  Leukemia    & 0.075 &  2 &  2 (2, 2)&    0.4 (0.32, 0.5)\footnotemark[1]\\

888:  Breast 2 cl.& 0.332 & 14 &  4 (2, 7)&   0.12 (0.07, 0.17)\\

889:  Breast 3 cl.& 0.364 &  6 &  7 (4, 14)&   0.27 (0.22, 0.31)\\

890:  NCI 60      & 0.353 & 24 & 30 (19, 60)&   0.26 (0.17, 0.38)\\

891:  Adenocar.   & 0.207 &  8 &  3 (2, 5)&   0.06 (0.03, 0.12)\\

892:  Brain       & 0.216 &  9 & 14 (7, 22)&   0.26 (0.14, 0.46)\\

893:  Colon       & 0.177 &  3 &  3 (2, 6)&   0.36 (0.32, 0.36)\\

894:  Lymphoma    & 0.042 & 58 & 12 (5, 73)&   0.32 (0.24, 0.42)\\

895:  Prostate    & 0.064 &  2 &  3 (2, 5)&    0.9 (0.82, 0.99)\footnotemark[1]\\

896:  Srbct       & 0.038 & 22 & 18 (11, 34)&   0.57 (0.4, 0.88)\\

897:  \hline

898:  \hline

899:  \multicolumn{5}{c}{\textbf{Alternative approaches}}\\ %%% OK

900:  \hline

901: %  \multicolumn{5}{c}{Shrunken centroids; minimizing error rate then}\\

902: %   \multicolumn{5}{c}{minimizing number of genes selected}\\ %%% OK

903:   \multicolumn{5}{c}{SC.s}\\

904:  \hline

905:  Leukemia    & 0.062 &   82\footnotemark[2] &   46 (14, 504)&   0.48 (0.45, 0.59)\\

906:  Breast 2 cl.& 0.326 &   31 &   55 (24, 296)&   0.54 (0.51, 0.66)\\

907:  Breast 3 cl.& 0.401 & 2166 & 4341 (2379, 4804)&   0.84 (0.78, 0.88)\\

908:  NCI 60      & 0.246 & 5118 & 4919 (3711, 5243)&   0.84 (0.74, 0.92)\\

909:  Adenocar.   & 0.179 &    0 &    9 (0, 18)&     NA (NA, NA)\\

910:  Brain       & 0.159 & 4177 & 1257 (295, 3483)&   0.38 (0.3, 0.5)\\

911:  Colon       & 0.122 &   15 &   22 (15, 34)&    0.8 (0.66, 0.87)\\

912:  Lymphoma    & 0.033 & 2796 & 2718 (2030, 3269)&   0.82 (0.68, 0.86)\\

913:  Prostate    & 0.089 &    4 &    3 (2, 4)&   0.72 (0.49, 0.92)\\

914:  Srbct       & 0.025 &   37\footnotemark[3] &   18 (12, 40)&   0.45 (0.34, 0.61)\\

915:  \hline

916:  \hline

917: %\multicolumn{5}{c}{Nearest Neighbor with variable selection}\\ %%% OK

918: \multicolumn{5}{c}{NN.vs}\\ %%% OK

919:  \hline

920:  Leukemia    & 0.056 &  512 &  23 (4, 134)&   0.17 (0.14, 0.24)\\

921:  Breast 2 cl.& 0.337 &   88 &  23 (4, 110)&   0.24 (0.2, 0.31)\\

922:  Breast 3 cl.& 0.424 &    9 &  45 (6, 214)&   0.66 (0.61, 0.72)\\

923:  NCI 60      & 0.237 & 1718 & 880 (360, 1718)&   0.44 (0.34, 0.57)\\

924:  Adenocar.   & 0.181 & 9868 &  73 (8, 1324)&   0.13 (0.1, 0.18)\\

925:  Brain       & 0.194 & 1834 & 158 (52, 601)&   0.16 (0.12, 0.25)\\

926:  Colon       & 0.158 &    8 &   9 (4, 45)&   0.57 (0.45, 0.72)\\

927:  Lymphoma    & 0.04 &   15 &  15 (5, 39)&    0.5 (0.4, 0.6)\\

928:  Prostate    & 0.081 &    7 &   6 (3, 18)&   0.46 (0.39, 0.78)\\

929:  Srbct       & 0.031 &   11 &  17 (11, 33)&    0.7 (0.66, 0.85)\\

930:  \hline

931:

932:  \end{tabular}

933:  }

934:  \end{center}

935:  \renewcommand{\baselinestretch}{0.2}\footnotesize\normalsize\footnotesize

936:  {%\setlength{\baselineskip}{1pt} \renewcommand{\baselinestretch}{0.1}\footnotesize\normalsize\footnotesize

937:    $^*$Only two genes are selected from the complete data set; the values are the actual

938:    frequencies of those two genes.\\

939:    $^{\dagger}$\citet{shrunkenc} select 21 genes after visually inspecting

940:    the plot of

941:    cross-validation error rate vs. amount of shrinkage and number of

942:    genes. Their procedure is hard to automate and thus it is very difficult to obtain estimates of the error

943:    rate of their procedure.\\

944:    $^{\ddagger}$\citet{shrunkenc} select 43 genes. The difference is likely due

945:    to differences in the random partitions for cross-validation. Repeating 100 times

946:    the gene selection process with the full data set the median, 1st quartile, and 3rd

947:    quartile of the number of selected genes are 13, 8, and 147.\\

948:

949:

950:  }

951:  \end{table}

952:

953:

954:

955:

956:

957: \section{Discussion}

958:

959: We have examined the performance of an approach for gene selection using random

960: forest, and compared it to alternative approaches. Our results, using both

961: simulated and real microarray data sets, show that this method of gene

962: selection accomplishes the proposed objectives.  Our method returns very small

963: sets of genes compared to two alternative variable selection methods, while

964: retaining predictive performance comparable to that of seven alternative

965: state-of-the-art methods.  Recently, \citet{BMA-selection} have proposed a

966: Bayesian model averaging (BMA) approach for gene selection; comparing the

967: results for the two common data sets between our study and theirs, in one case

968: (Leukemia) our procedure returns a much smaller set of genes (2 vs. 15),

969: whereas in another (Breast, 2 class) their BMA procedure returns 8 fewer genes

970: (14 vs. 6); our procedure does not require setting a limit in the maximum

971: number of relevant genes to be selected nor does it require to prespecify a

972: number of top ranked genes as relevant (the latter is nor required by the BMA

973: procedure either).

974:

975: Our method of gene selection will not return sets of genes

976: that are highly correlated, because they are redundant.  This method will be

977: most useful under two scenarios: a) when considering the design of diagnostic

978: tools, where having a small set of probes is often desirable; b) to help

979: understand the results from other gene selection approaches that return many

980: genes, so as to understand which ones of those genes have the largest signal to

981: noise ratio and could be used as surrogates for complex processes involving

982: many correlated genes. A backwards elimination method, precursor to the one

983: used here, has been already used to predict breast tumor type based on

984: chromosomic alterations \citep{SaraRF}.

985:

986:

987: We have also throughly examined the effects of changes in the

988: parameters of random forest (specifically $mtry$, $ntree$, $nodesize$)

989: and the variable selection algorithm ($se$, $fraction.dropped$).

990: Changes in these parameters have in most cases negligible effects,

991: suggesting that the default values are often good options, but we can

992: make some general recommendations.

993: Time of execution of the code increases $\approx$ linearly with $ntree$.

994: Larger $ntree$ values lead to slightly more stable values of variable

995: importances, but for the data sets examined, $ntree = 2000$ or $ntree = 5000$

996: seem quite adequate, with further increases having negligible effects. The

997: change in $nodesize$ from 1 to 5 has negligible effects, and thus its default

998: setting of 1 is appropriate.  For the backwards elimination algorithm, the

999: parameter $fraction.dropped$ can be adjusted to modify the resolution of the

1000: number of variable selected; smaller values of $fraction.dropped$ lead to finer

1001: resolution in the examination of number of genes, but to slower execution of

1002: the code.  Finally, the parameter $se$ has also minor effects on the results of

1003: the backwards variable selection algorithm but a value of $se = 1$ leads to

1004: slightly more stable results.

1005:

1006:

1007:

1008: The final issue addressed in this paper is instability or multiplicity of the

1009: selected sets of genes. From this point of view, the results are slightly

1010: disappointing. But so are the results of the competing methods. And so are the

1011: results of most examined methods so far with microarray data, as shown in

1012: \citet{EinDor} and \citet{Michielis} and discussed throughly by

1013: \citet{Somorjai2003} for classification and by \citet{pan-pnas} for the related

1014: problem of the effect of threshold choice in gene selection.  However, and

1015: except for the above cited papers and the review in \citet{Yo-azuaje}, this is

1016: an issue that still seems largely ignored in the microarray literature. As

1017: these papers and the statistical literature on variable selection

1018: \citep[e.g.,][]{breiman-2-cultures, harrell-01} discusses, the causes of the

1019: problem are small sample sizes and the extremely small ratio of samples to

1020: variables (i.e., number of arrays to number of genes). Thus, we might need to

1021: learn to live with the problem, and try to assess the stability and robustness

1022: of our results by using a variety of gene selection features, and examining

1023: whether there is a subset of features that tends to be repeatedly selected.

1024: This concern is explicitly taken into account in our results, and facilities

1025: for examining this problem are part of our R code.

1026:

1027:

1028: The multiplicity problem, however, does not need to result in large

1029: prediction errors.  This and other papers \citep{dudoit-dlda, pelora,

1030:   simon.book, romualdi-03, bag-boost, Somorjai2003} show that very different

1031: classifiers often lead to comparable and successful error rates with

1032: a variety of microarray data sets. Thus, although improving prediction

1033: rates is important \citep[specially if giving consideration to ROC

1034: curves, and not just overall prediction error rates;][]{pepe-book},

1035: when trying to address questions of biological mechanism or discover

1036: therapeutic targets, probably a more challenging and relevant issue is

1037: to identify sets of genes with biological relevance.

1038:

1039:

1040: Two areas of future research are using random forest for the selection of

1041: potentially large sets of genes that include correlated genes, and improving

1042: the computational efficiency of these approaches; in the present work, we have

1043: used parallelization of the ``embarrassingly parallelizable'' tasks using MPI

1044: with the Rmpi and Snow packages \citep{Rmpi, snow} for R. In a broader context,

1045: further work is warranted on the stability properties and biological relevance

1046: of this and other gene-selection approaches, because the multiplicity problem

1047: casts doubts on the biological interpretability of most results based on a

1048: single run of one gene-selection approach.

1049:

1050:

1051:

1052:

1053:

1054: %%% Both allow var sel; the type of var sel is wrapper approach, which

1055: %%% should be superior to ``filter'' approaches. variable importance plots not affected

1056: %%% by multicol. However, not many unique (stable) results. Select only

1057: %%% the most important from variable importance plots, or use a large set of candidates.

1058: %%% With backwards, can help to examine if the different, non-overlapping,

1059: %%% sets of vars are in similar routes, etc.

1060:

1061: %%% Examinar tambi�n plots of ``flatness'' of OOB and numero de genes.

1062: %%% Indication of how important things are (and plots of flatness in

1063: %%% bootstrap samples?). Problem is: no longer emphasis on which are the

1064: %%% selected genes. Here the approach of Svetnik et al more relevant?

1065:

1066:

1067:

1068: \section{Conclusion}

1069: The proposed method can be used for variable selection fulfilling the

1070: objectives above: we can obtain very small sets of non-redundant genes while

1071: preserving predictive accuracy. These results clearly indicate that the

1072: proposed method can be profitably used with microarray data. Given its

1073: performance, random forest and variable selection using random forest should

1074: probably become part of the ``standard tool-box'' of methods for the analysis

1075: of microarray data.

1076:

1077:

1078:

1079:

1080: \section{Acknowledgements}

1081:

1082: % This work arised out of work I did in collaboration with S.\ �lvarez

1083: % de Andr�s; I thank her for the opportunity to collaborate in that

1084: % work, and for her patience and enthusiasm.

1085:

1086: Most of the simulations and analyses were carried out in the Beowulf

1087: cluster of the Bioinformatics unit at CNIO, financed by the RTICCC

1088: from the FIS; J.~M.\ Vaquerizas provided help with the administration

1089: of the cluster. A.\ Liaw provided discussion, unpublished manuscripts,

1090: and code.  C.\ L�zaro-Perea provided many discussions and comments on

1091: the ms. A.\ S�nchez provided comments on the ms.  I.\ D�az showed

1092: R.D.-U. the forest, or the trees, or both.  R.D.-U. partially

1093: supported by the Ram�n y Cajal program of the Spanish MEC (Ministry

1094: of Education and Science); S.A.A. supported by project C.A.M.

1095: GR/SAL/0219/2004; funding provided by project TIC2003-09331-C02-02 of

1096: the Spanish MEC.

1097:

1098:

1099: \bibliography{signatures2}

1100: \bibliographystyle{bioinformatics}

1101:

1102:

1103:

1104: %\end{multicols}

1105: \newpage

1106:

1107:

1108:

1109:

1110:

1111:

1112:

1113:

1114:

1115:

1116:

1117:

1118: \end{document}

1119:

1120:

1121: %All with R, library  randomForest. Code available.

1122:

1123:

1124:

1125:

1126: %Although occassionally plots of variable importance show a clear

1127: %pattern where only a few variables stand out, most often

1128:

1129:

1130: **************************

1131:

1132:

1133: We will also mention computational requirements.

1134:

1135: It should be possible to

1136: use these measures of variable importance to single out

1137: genes of particular relevance for a given condition.

1138:

1139:

1140:

1141:

1142: Future work:

1143: ------------

1144: - Changes in $mtry$, since small mtries should lead to faster

1145: runs and further decreases in correlations of trees.

1146:

1147: - After variable reduction: use all variables in models

1148: building (i.e., $mtry = number of variables)?

1149:

1150: