q-bio0503025/rfVS.tex
1: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
3: %%%%%%%%%%%%%%%%%%%%                                  %%%%%%%%%%%%%%%%%%%%
4: %%%%%%%%%%%%%%%%%%%%       Technical report           %%%%%%%%%%%%%%%%%%%%
5: %%%%%%%%%%%%%%%%%%%%                                  %%%%%%%%%%%%%%%%%%%%
6: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
7: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
8: 
9:    \documentclass[10pt]{article}
10:    \usepackage[latin1]{inputenc}
11:    \usepackage{geometry}
12:    \geometry{verbose,a4paper,tmargin=20mm,bmargin=20mm,lmargin=20mm,rmargin=20mm}
13:    \usepackage{setspace}
14:    \usepackage{graphics} 
15:    \singlespacing
16:    \usepackage{verbatim}
17:    \usepackage{amsmath}
18:    \usepackage{url}
19:    \bibliographystyle{bioinformatics}
20:    \usepackage[authoryear, round, sort]{natbib}
21:   \usepackage{hyperref} %%??
22: 
23:   \title{Variable selection from random forests: application to gene expression data} 
24:    \author{\vspace{20pt}
25:      Ramón Díaz-Uriarte$^{1,3}$, Sara Alvarez de Andrés$^2$\\
26:    $¹$Bioinformatics Unit, $²$Cytogenetics Unit\\
27:    Biotechnology Programme\\
28:    Spanish National Cancer Center (CNIO)\\
29:    Melchor Fernández Almagro 3 \\
30:    Madrid, 28029\\
31: \vspace{20pt}
32:    Spain. \\
33: $^3$ Author for correspondence.\\
34:    \texttt{rdiaz@ligarto.org}\\
35:    \url{http://ligarto.org/rdiaz}\\
36:    }
37:    \date{
38:    \vspace*{40pt}
39:    2005-06-22 \\
40: \vspace{20pt}
41:    {\bf Running Head:} Gene selection with random forest.} %%%% eliminate for tech report}
42:   \begin{document}
43:   \maketitle
44:   \newpage
45:   \begin{abstract}
46: 
47:   Random forest is a classification algorithm well suited for
48:   microarray data: it shows excellent performance even when most
49:   predictive variables are noise, can be used when the number of
50:   variables is much larger than the number of observations, and
51:   returns measures of variable importance. Thus, it is important to
52:   understand the performance of random forest with microarray data and
53:   its use for gene selection.
54: 
55: 
56:    We first show the effects of changes in parameters of random forest on the
57:    prediction error.  Then we present an approach for gene selection
58:    that uses measures of variable importance and error rate,
59:    and is targeted towards the selection of small sets of genes.  Using
60:    simulated and real microarray data, we show that the gene selection
61:    procedure yields small sets of genes while preserving predictive accuracy.
62: 
63: 
64:   We first show the effects of changes in parameters of random forest
65:   on the prediction error rate with microarray data. Then we present
66:   two approaches for gene selection with random forest: 1) comparing
67:   variable importance plots of variable importance from original and permuted data
68:   sets; 2) using backwards variable elimination. Using simulated and
69:   real microarray data, we show: 1) variable importance plots can be used to recover
70:   the full set of genes related to the outcome of interest, without
71:   being adversely affected by collinearities; 2) backwards variable
72:   elimination yields small sets of genes while preserving predictive
73:   accuracy (compared to several state-of-the art algorithms). Thus,
74:   both methods are useful for gene selection. 
75: 
76:   All code is available as an R package, varSelRF, from CRAN
77: \href{http://cran.r-project.org/src/contrib/PACKAGES.html}
78: {http://cran.r-project.org/src/contrib/PACKAGES.html} or from the supplementary
79: material page.
80: 
81: Supplementary information:
82: \href{http://ligarto.org/rdiaz/Papers/rfVS/randomForestVarSel.html}{http://ligarto.org/rdiaz/Papers/rfVS/randomForestVarSel.html}
83: 
84: 
85:   \end{abstract}
86: 
87: 
88:  \footnotetext[1]{To whom correspondence should be addressed}
89: 
90: 
91: 
92: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
93: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
94: %%%%%%%%%%%%%%%%%%%%                                  %%%%%%%%%%%%%%%%%%%%
95: %%%%%%%%%%%%%%%%%%%%       Bioinformatics             %%%%%%%%%%%%%%%%%%%%
96: %%%%%%%%%%%%%%%%%%%%                                  %%%%%%%%%%%%%%%%%%%%
97: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
98: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
99: 
100: %%% \renewcommand{\thefootnote}{\fnsymbol{footnote}}
101: 
102: %%%   \documentclass{bioinfo}
103: %%%   \copyrightyear{2005}
104: %%%   \pubyear{2005}
105: %%%   \usepackage[latin1]{inputenc}
106: 
107: %%%   \begin{document}
108: %%%   \firstpage{1}
109: 
110: %%%   \title[Gene selection with random forests]{Variable selection from random forests: application to gene expression data} 
111: 
112: %%% %   \author{Ramón Díaz-Uriarte}
113: %%% %   \address{Bioinformatics Unit\\
114: %%% %   Spanish National Cancer Center (CNIO)\\
115: %%% %   Melchor Fernández Almagro 3 \\
116: %%% %   Madrid, 28029\\
117: %%% %   Spain 
118: %%% %   }
119: 
120: %%%    \author{Ramón Díaz-Uriarte\,$^{\rm a}$\footnote{To whom correspondence should be addressed}, Sara Alvarez de
121: %%%      Andrés\,$^{\rm b}$}
122: 
123: %%%    \author{Ramón Díaz-Uriarte\,$^{\rm a,}$\footnotemark[1], Sara Alvarez de
124: %%%      Andrés\,$^{\rm b}$}
125: %%%    \address{$^{a}$Bioinformatics Unit, $^{b}$Cytogenetics Unit\\
126: %%%      Biotechnology Programme\\
127: %%%   Spanish National Cancer Centre (CNIO)\\
128: %%%   Melchor Fernández Almagro 3 \\
129: %%%   Madrid, 28029\\
130: %%%   Spain 
131: %%%   }
132: 
133: %%%   \maketitle
134: 
135: %%%   \begin{abstract}
136: 
137: %%%   \section{Motivation:} 
138: %%%Random forest is a classification algorithm well suited
139: %%%for microarray data: it shows excellent performance
140: %%%even when most predictive variables are noise, can be
141: %%%used when the number of variables is much larger than
142: %%%the number of observations, and returns measures of
143: %%%variable importance. Thus, it is important to
144: %%%understand the performance of random forest with
145: %%%microarray data and its use for gene selection.
146: 
147: 
148: %%%   \section{Results:} 
149: %%%   We first show the effects of changes in parameters of random forest on the
150: %%%   prediction error.  Then we present an approach for gene selection
151: %%%   that uses measures of variable importance and error rate,
152: %%%   and is targeted towards the selection of small sets of genes.  Using
153: %%%   simulated and real microarray data, we show that the gene selection
154: %%%   procedure yields small sets of genes while preserving predictive accuracy.
155: 
156: %%%   \section{Availability:}
157: %%%  All code is available as an R package, varSelRF, from CRAN,
158: %%%\href{http://cran.r-project.org/src/contrib/PACKAGES.html}
159: %%%{http://cran.r-project.org/src/contrib/PACKAGES.html}, or from the supplementary
160: %%%material page.
161: 
162: 
163: %%%   \section{Contact:}
164: %%%   \href{rdiaz@ligarto.org}{rdiaz@ligarto.org}
165: %%%   \section{Supplementary information:}\\
166: %%%   \href{http://ligarto.org/rdiaz/Papers/rfVS/randomForestVarSel.html}{http://ligarto.org/rdiaz/Papers/rfVS/randomForestVarSel.html}
167: 
168: 
169: 
170: %%%   \end{abstract}
171: 
172: %%% \footnotetext[1]{To whom correspondence should be addressed}
173: %%%%%%%%%%%%%%%%%%%%%%%%5   end bioinformatics
174: 
175: 
176: 
177: \section{Introduction}
178: 
179: 
180: Random forest is an algorithm for classification developed by Leo
181: Breiman \citep{breiman-rf} that uses an ensemble of classification
182: trees \citep{cart, ripley-96, htf-01}. Each of the classification
183: trees is built using a bootstrap sample of the data, and at each split
184: the candidate set of variables is a random subset of the variables.
185: Thus, random forest uses both bagging (bootstrap aggregation), a
186: successful approach for combining unstable learners
187: \citep{breiman-bagging, htf-01}, and random variable selection for
188: tree building.  Each tree is unpruned (grown fully), so as to obtain
189: low-bias trees; at the same time, bagging and random variable
190: selection result in low correlation of the individual trees.  The
191: algorithm yields an ensemble that can achieve both low bias and low
192: variance (from averaging over a large ensemble of low-bias,
193: high-variance but low correlation trees).
194: 
195: 
196: Random forest has excellent performance in classification tasks,
197: comparable to support vector machines. Although random forest is not
198: widely used in the microarray literature \citep[but see][]{SaraRF,
199:   Izmir2004, Wu.Zhao2003, Gunther.Heyes2003, Man-rf,
200:   Schwender.Bolt2004}, it has several characteristics that make it
201: ideal for these data sets: a) can be used when there are many more
202: variables than observations; b) has good predictive performance
203: even when most predictive variables are noise; c) does not
204: overfit; d) can handle a mixture of categorical and continuous
205: predictors; e) incorporates interactions among predictor variables; f)
206: the output is invariant to monotone transformations of the predictors;
207: g) there are high quality and free implementations: the original
208: Fortran code from L.\ Breiman and A.\ Cutler, and an R package from
209: A.\ Liaw and M.\ Wiener \citep{rf-rnews}; h) there is little need to
210: fine-tune parameters to achieve excellent performance; i) returns measures of
211: variable (gene) importance. The most
212: important parameter to choose is $mtry$, the number of input variables
213: tried at each split, but it has been reported that the default value
214: is often a good choice \citep{rf-rnews}.  In addition, the user needs
215: to decide how many trees to grow for each forest ($ntree$) as well as
216: the minimum size of the terminal nodes ($nodesize$). These three
217: parameters will be throughly examined in this paper.
218: 
219: Given these promising features, it is important to understand the
220: performance of random forest compared to alternative state-of-the-art
221: prediction methods with microarray data, as well as the effects
222: of changes in the parameters of random forest. In this paper we present, as necessary
223: background for the main topic of the paper (gene selection), the first
224: through examination of these issues, including evaluating the effects
225: of $mtry$, $ntree$ and $nodesize$ on error rate using
226: nine real microarray data sets and simulated data.
227: 
228: The main question addressed in this paper is gene selection using random
229: forest.  A few authors have previously used variable selection with random
230: forest.  \citet{dudoit-inbook} and \citet{Wu.Zhao2003} use filtering approaches
231: and, thus, do not take advantage of the measures of variable importance
232: returned by random forest as part of the algorithm. \citet{svetnik} propose a
233: method that is somewhat similar to our approach. The main difference is that
234: \citet{svetnik} first find the ``best'' dimension ($p$) of the model, and then
235: choose the $p$ most important variables. This is a sound strategy when the
236: objective is to build accurate predictors, without any regards for model
237: interpretability.  But this might not be the most appropriate for our purposes
238: as it shifts the emphasis away from selection of specific genes, and in genomic
239: studies the identity of the selected genes is relevant (e.g., to understand
240: molecular pathways or to find targets for drug development).
241: 
242: 
243: The last issue addressed in this paper is the multiplicity (or lack of
244: uniqueness or lack of stability) problem. Variable selection with microarray
245: data can lead to many solutions that are equally good from the point of view of
246: prediction rates, but that share few common genes.  This multiplicity problem
247: has been emphasized by \citet{Somorjai2003} and recent examples are shown in
248: \citet{EinDor} and \citet{Michielis}. Although multiplicity of results is not a problem when
249: the only objective of our method is prediction, it casts serious doubts on the
250: biological interpretability of the results \citep{Somorjai2003}.  Unfortunately
251: most ``methods papers'' in bioinformatics do not evaluate the stability of the
252: results obtained, leading to a false sense of trust on the biological
253: interpretability of the output obtained. Our paper presents a through and
254: critical evaluation of the stability of the lists of selected genes with the
255: proposed (and two competing) methods.
256: 
257: 
258: 
259: 
260: 
261: 
262: 
263: 
264: \section{Variable selection methods}
265: 
266: \subsection{Two objectives of variable selection}
267: 
268: When facing gene selection problems, biomedical researchers often show
269: interest in one of the following objectives:
270: 
271: \begin{enumerate}
272: \item To identify relevant genes for subsequent research; this
273:   involves obtaining a (probably large) set of genes that are related
274:   to the outcome of interest, and this set should include genes even if they
275:   perform similar functions and are highly correlated.
276:   
277: \item To identify small sets of genes to be used for diagnostic
278:   purposes in clinical practice; this involves obtaining the smallest
279:   possible set of genes that can still achieve good predictive
280:   performance (thus, ``redundant'' genes should not be selected). 
281: \end{enumerate}
282: 
283: We will focus on the second objective. The use of random forest for the first
284: objective is under investigation and will be reported elsewhere.
285: 
286: 
287: \subsection{Variable importance from random forest}
288: 
289: Random forest returns several measures of variable importance. The
290: most reliable measure is based on the decrease
291: of classification accuracy when values of a variable in a node of a
292: tree are permuted randomly \citep{breiman-rf, Bureau2003},
293: and this is the measure of variable importance (in its unscaled
294: version ---see supplementary material) that we will use in the rest of
295: the paper.
296: 
297: 
298: % . This measure is sometimes reported as such, and
299: % sometimes it is reported after scaling it, or dividing by a quantity
300: % somewhat analogous to its standard error (``somewhat analogous''
301: % because the data used to obtain that ``standard error'' are not truly
302: % independent, and thus the true standard error can be severely
303: % underestimated).  We use in this paper the unscaled importance
304: % measure, because it allows us to compare directly runs with different
305: % settings of $ntree$ and $mtry$ (in contrast, scaled importances
306: % increase monotonically as we increase the value of $ntree$).
307: % %%%Explain why we use unscaled importance:
308: %%%a) allows direct comparison between runs with different ntrees and mtries.
309: %%%b) does not mislead into considering them Z scores.
310: 
311: 
312: 
313: \subsection{Backwards elimination of variables (genes) using OOB error}
314: 
315: To select gebes we can iteratively fit random forests, at each iteration
316: building a new forest after discarding those variables (genes) with the
317: smallest variable importances; the selected set of genes is the one that yields the
318: smallest error rate.  Random forest returns a measure of error rate based on
319: the out-of-bag cases for each fitted tree, the OOB error, and this is the
320: measure of error we will use.  Note that in this section we are using OOB
321: error to choose the final set of genes, not to obtain unbiased estimates of the
322: error rate of this rule.  Because of the iterative approach, the OOB error is
323: biased down and cannot be used to asses the overall error rate of the approach,
324: for reasons analogous to those leading to ``selection bias'' \citep{ambroise, simon-03}. To assess prediction error rates we will use the bootstrap, not
325: OOB error (see section \ref{boot}). (Using error rates
326:   affected by selection bias to select the optimal number of genes is
327:   not necessarily a bad procedure from the point of view of selecting
328:   the final number of genes; see \citet{Braga-Neto.Carroll2004}).
329: %\citet{svetnik} leave aside a set of data,
330: %and decide on the stopping criterion using the error rate on the test data.
331: %This approach, however, is problematic when, as in our case, we are interested
332: %in specific genes and not in using the test set error rate to select the
333: %number of genes.
334: 
335: In our algorithm we examine all forests that result from eliminating,
336: iteratively, a fraction, $fraction.dropped$, of the genes (the
337: least important ones) used in the previous iteration. By default,
338: $fraction.dropped = 0.2$ which allows for relatively fast operation,
339: is coherent with the idea of an ``aggressive variable selection''
340: approach, and increases the resolution as the number of genes
341: considered becomes smaller.  We do not recalculate variable
342: importances at each step as \citet{svetnik} mention severe overfitting
343: resulting from recalculating variable importances. After fitting all
344: forests, we examine the OOB error rates from all the fitted random
345: forests. We choose the solution with the smallest number of genes
346: whose error rate is within $u$ standard errors of the minimum error
347: rate of all forests. 
348: % (The standard error is calculated using the
349: % expression for a binomial error count [$\sqrt{p (1-p) * 1/N}$]).
350: Setting $u = 0$ is the same as selecting the set of genes that
351: leads to the smallest error rate.  Setting $u = 1$ is similar to the
352: common ``1 s.e.  rule'', used in the classification trees literature
353: \citep{ripley-96, cart}; this strategy can lead to solutions with
354: fewer genes than selecting the solution with the smallest error
355: rate, while achieving an error rate that is not
356: different, within sampling error, from the ``best solution''. In this
357: paper we will examine both the ``1 s.e. rule'' and the ``0 s.e.
358: rule''.
359: 
360: 
361: 
362: %Note here no need for very large mtries, etc, since we do not want all
363: %the important genes, but just enough genes to do a good job.
364: 
365: %Besides the stopping criterion we have also chosen the following settings:
366: 
367: %\begin{itemize}
368: %\item We examine all forest that result from iteratively
369: %  \textbf{eliminating the lower 50\% of the genes}; this
370: %  allows for relatively fast operation, and is coherent with the
371: %  idea of an ``aggressive variable selection'' approach, and
372: %  increases the ``resolution'' as the number of genes
373: %  considered becomes smaller.
374: %\item \textbf{Variable importances are not recalculated at each step}, but
375: %  instead we use the variable importances computed at the end of
376: %  the run; we have not observed important differences whether or
377: %  not variable importances are recalculated, but \citet{svetnik}
378: %  mention severe overfitting resulting from recalculating
379: %  variable importances.
380: %\item We examine the OOB error rates from all the fitted random
381: %  forests. We choose the \textbf{solution with the smallest number of
382: %  genes whose error rate is within 1 standard error of the
383: %  minimum error rate of all forests}.  and the ``1 SE rule'' is common in the
384: %  classification trees literature ).
385: %\end{itemize}
386: 
387: 
388: 
389: \section{Evaluation of performance}
390: 
391: \subsection{Data sets}
392: We have used both simulated and real microarray data sets to evaluate
393: the variable selection procedure. For the real
394: data sets, original reference paper and main features are shown in
395: Table \ref{datasets}. Further details are provided in the
396: supplementary material.
397: 
398: 
399: \begin{table}
400: \caption{\label{datasets} Main characteristics of the microarray data
401:   sets used.}
402: {\footnotesize
403: \begin{tabular}{l|lrrr}
404: Dataset & Original ref.&Genes&Patients&Classes \\
405: \hline
406: Leukemia &\citet{golub}&3051&38&2\\
407: Breast &\citet{vveer}&4869&78&2\\
408: Breast &\citet{vveer}&4869&96&3\\
409: NCI 60 &\citet{ross}&5244&61&8\\
410: Adenocar-\\
411: cinoma &\citet{ramas-03}&9868&76&2\\
412: Brain &\citet{pomeroy}&5597&42&5\\
413: Colon &\citet{alon}&2000&62&2\\
414: Lymphoma &\citet{alizadeh}&4026&62&3\\
415: Prostate &\citet{singh}&6033&102&2\\
416: Srbct &\citet{khan}&2308&63&4\\
417: \hline
418: \end{tabular}
419: }
420: \end{table}
421: 
422: 
423: 
424: % first four data sets. For the last five, the binary R data files were
425: % obtained from M.\ Dettling's web page
426: % \url{http://stat.ethz.ch/~dettling/bagboost.html}; the data sets
427: % and their preprocessing are fully described in \cite{wilma}. 
428: 
429: 
430: To evaluate if the proposed procedure can recover the signal in the
431: data, we need to use simulated data, so that we know exactly which
432: genes are relevant.  Data have been simulated using different numbers
433: of classes of patients (2 to 4), number of independent dimensions (1
434: to 3), and number of genes per dimension (5, 20, 100).  In all cases,
435: we have set to 25 the number of subjects per class. Each independent
436: dimension has the same relevance for discrimination of the classes.
437: The data come from a multivariate normal distribution with variance of
438: 1, a (within-class) correlation among genes within dimension of
439: 0.9, and a within-class correlation of 0 between genes from different
440: dimensions, as those are independent.  The multivariate means have
441: been set so that the unconditional prediction error rate
442: \citep{mclach-dlda} of a linear discriminant analysis using one gene
443: from each dimension is approximately 5\%.  To each data set we have
444: added 2000 random normal variates (mean 0, variance 1) and 2000 random
445: uniform $[-1, 1]$ variates.  In addition, we have generated data sets
446: for 2, 3, and 4 classes where no genes have signal (all 4000 genes are
447: random).  For the non-signal data sets we have generated four
448: replicate data sets for each level of number of classes. Further
449: details are provided in the supplementary material.
450: 
451: 
452: \subsection{Competing methods}
453: 
454: We have compared the predictive performance of the variable selection
455: approach with: a) random forest without any variable selection (using
456: $mtry = \sqrt{number\ of \ genes}$, $ntree = 5000$, $nodesize =
457: 1$); b) three other methods that have shown good
458: performance in reviews of classification methods with microarray data
459: \citep{dudoit-dlda, romualdi-03, bag-boost} but that do not include
460: any variable selection; c) two methods that carry out
461: variable selection. 
462: 
463: For the three methods that do not carry out variable selection,
464: \textbf{Diagonal Linear Discriminant Analysis (DLDA)}, \textbf{K
465:   nearest neighbor (KNN)}, and \textbf{Support Vector Machines (SVM)}
466: with linear kernel, we have used, based on \cite{dudoit-dlda}, the 200
467: genes with the largest $F$-ratio of between to within groups sums of
468: squares. For \textbf{KNN}, the number of neighbors ($K$) was
469: chosen by cross-validation as in \cite{dudoit-dlda}.
470: 
471: 
472: One of the methods that incorporates gene selection is
473: \textbf{Shrunken centroids (SC)}, developed by \cite{shrunkenc}. We
474: have used two different approaches to determine the best number of
475: features. In the first one, \textbf{SC.l}, we choose the number of
476: genes that minimizes the cross-validated error rate and, in case of
477: several solutions with minimal error rates, we choose the one with
478: largest likelihood. In the second approach, \textbf{SC.s}, we choose
479: the number of genes that minimizes the cross-validated error rate and,
480: in case of several solutions with minimal error rates, we choose the
481: one with smallest number of genes (larger penalty). The second method
482: that incorporates gene selection is \textbf{Nearest neighbor +
483:   variable selection (NN.vs)}, where we filter genes using the
484: F-ratio, and select the number of genes that leads to the smallest
485: error rate; in our implementation, we run a Nearest Neighbor
486: classifier (KNN with K = 1) on all subsets of genes that result from
487: eliminating $20\%$ of the genes (the ones with the smallest F-ratio)
488: used in the previous iteration.  This approach, in its many variants
489: (changing both the classifier and the ordering criterion) is popular
490: in microarray papers; a recent example is \cite{roepman}, and
491: similar general strategies are implemented in the program Tnasas
492: \citep{gepas2}. Further
493: details of all these methods are provided in the supplementary
494: material. All simulations and analyses were carried out with R
495: \citep[http://www.r-project.org; ][]{R}, using
496: packages randomForest (from A.\ Liaw and M.\ Wiener) for random
497: forest, e1071 (E.\ Dimitriadou, K.\ Hornik, F.\ Leisch, D.\ Meyer, and
498: A.\ Weingessel) for SVM, class (B.\ Ripley and W.\ Venables) for KNN,
499: PAM \citep{shrunkenc} for shrunken centroids, and
500: geSignatures (by R.D.-U.) for DLDA.
501: 
502: 
503: 
504: 
505: \subsection{\label{boot}Estimation of error rates} 
506: To estimate the prediction error rate of all methods we have used the
507: .632+ bootstrap method \citep{ambroise, 632-rule}. It must be
508: emphasized that the error rate used when performing variable selection
509: is not the error rate reported as the prediction error rate (e.g.,
510: Table \ref{error.rates}), nor the error used to compute the .632+
511: estimate. To calculate the prediction error rate (as reported, for
512: example, in Table \ref{error.rates}) the .632+ bootstrap method is
513: applied to the complete procedure, and thus the ``out-of-bag'' samples
514: used in the .632+ method are samples that are not used when fitting
515: the random forest, or carrying out variable selection. This also
516: applies when evaluating the competing methods.
517: 
518: 
519: \subsection{Stability (uniqueness) of results}
520: Following \citet{Faraway-92}, \citet{harrell-01}, and
521: \citet{efron-gong},  we have evaluated
522: the stability of the variable selection procedure using the
523: bootstrap.  This allows us to asses how often a given
524: gene, selected when running the variable selection procedure in the
525: original sample, is selected when running the procedure on bootstrap
526: samples.  
527: 
528: 
529: 
530:  \begin{figure}                                                         
531:  \begin{center}
532:  {\resizebox{!}{7.5cm}{%
533:  \includegraphics{mtry.ntree.paper.real.eps}}}
534: 
535: 
536: 
537: % \begin{figure}                                                         
538: % {\resizebox{!}{7.5cm}{%
539: % \centerline{\includegraphics{$mtry$.$ntree$.paper.real.eps}}}}
540: 
541: 
542: 
543: \caption{\label{mtry.ntree.paper.real} Out-of-Bag (OOB) vs
544:   $mtryFactor$ for the nine microarray data sets.  $mtryFactor$ is the
545:   multiplicative factor of the default $mtry$
546:   ($\sqrt{number.of.genes}$); thus, an $mtryFactor$ of 3 means the
547:   number of genes tried at each split is $3 *\sqrt{number.of.genes}$;
548:   an $mtryFactor = 0$ means the number of genes tried was 1; the
549:   $mtryFactor$s examined were $= \{0, 0.05, 0.1, 0.17, 0.25, 0.33, 0.5,
550:   0.75, 0.8, 1, 1.15, 1.33, 1.5, 2, 3,$ $4, 5, 6, 8, 10, 13\}$. Results
551:   shown for six different $ntree = \{1000, 2000, 5000,
552:   10000, 20000, 40000\}$.  $nodesize = 1$.}
553: \end{center}
554: \end{figure}
555: 
556: 
557: 
558: \section{Results}
559: 
560: 
561: 
562: 
563:  \begin{table*}[b!]  \begin{center} %\processtable{
564:        \caption{\label{error.rates} Error rates (estimated using the
565:          0.632+ bootstrap method with 200 bootstrap samples) for the
566:          microarray data sets using different methods (see text for
567:          description of alternative methods).  The results shown for
568:          variable selection with random forest used $ntree = 2000,
569:          fraction.dropped = 0.2, mtryFactor = 1$.  Note that the OOB
570:          error used for variable selection \emph{is not} the error
571:          reported in this table; the error rate reported is obtained
572:          using bootstrap on the complete variable selection process.
573:          The column ``no info'' denotes the minimal error we can make
574:          if we use no information from the genes (i.e., we always bet
575:          on the most frequent class).}
576: 
577: 
578: 
579:   {\footnotesize
580:     \begin{tabular}{l|cccccccccc}
581:       % Data set& SVM & KNN & DLDA& SC.l & SC.s & NN.vs & random forest & \multicolumn{4}{c}{random forest var.sel.}\\
582:       % & & & & & & & & \multicolumn{2}{c}{s.e.\ 0} & \multicolumn{2}{c}{s.e.\ 1}\\
583:       % & & & & & & & & m.f.\ 1 & m.f.\ 13 & m.f.\ 1 & m.f.\ 13 \\
584: 
585: Data set& no info & SVM & KNN & DLDA& SC.l & SC.s & NN.vs & random forest &
586: \multicolumn{2}{c}{random forest var.sel.}\\
587: && & & & & & & & s.e.\ 0 & s.e.\ 1\\
588: 
589: \hline
590: Leukemia &       0.289 &0.014 &  0.029 &  0.020 &   0.025& 0.062  &   0.056&      0.051 &   0.087  &   0.075   \\
591: Breast 2 cl.&    0.429 &0.325 &  0.337 &  0.331 &   0.324& 0.326  &  0.337&       0.342 &   0.337  &   0.332  \\
592: Breast 3 cl.&    0.537 &0.380 &  0.449 &  0.370 &   0.396& 0.401  &   0.424&      0.351 &   0.346  &   0.364  \\
593: NCI 60      &    0.852 &0.256 &  0.317 &  0.286 &   0.256& 0.246  &   0.237&      0.252 &   0.327  &   0.353  \\
594: Adenocar.&       0.158 &0.203 &  0.174 &  0.194 &   0.177 & 0.179 &    0.181&     0.125 &   0.185  &   0.207  \\
595: Brain&           0.761 &0.138 &  0.174 &  0.183 &   0.163 & 0.159 &    0.194&     0.154 &   0.216  &   0.216  \\
596: Colon&           0.355 &0.147 &  0.152 &  0.137 &   0.123 & 0.122 &    0.158&     0.127 &   0.159  &   0.177  \\
597: Lymphoma &       0.323 &0.010 &  0.008 &  0.021 &   0.028 & 0.033 &    0.04 &     0.009 &   0.047  &   0.042  \\
598: Prostate &       0.490 &0.064 &  0.100 &  0.149 &   0.088 & 0.089 &    0.081&     0.077 &   0.061  &   0.064  \\
599: Srbct &          0.635 &0.017 &  0.023 &  0.011 &   0.012 & 0.025 &    0.031&     0.021 &   0.039  &   0.038   \\
600: \hline
601: \end{tabular}
602: }
603: % \caption{\label{error.rates} Error rates (estimated using 0.632+
604: %   bootstrap method with 200 bootstrap samples) for each data set using
605: %   different methods (see text for description of alternative methods).
606: %   The results shown for variable selection with random forest used
607: %   $ntree = 2000, fraction.dropped = 0.2$, $mtry$Factor = 1$ (error rates with
608: %   $ntree=20000$ and $ntree=5000$ and with $fraction.dropped = 0.5$ and
609: %   $mtry$Factor = 13$ are very similar; see supplementary material and
610: %   Table \ref{stability}). When using variable selection with random
611: %   forest, we display four genes. The first two, correspond to using
612: %   the ``s.e.0'' rule, where the model selected is the one with the
613: %   smallest OOB error rate, and two to the ``s.e. 1'' rule, where the
614: %   model selected is the smallest model whose error rate is within 1
615: %   standard error of the minimum error rate of all forests. For each of
616: %   these, we show the error corresponding to using an $mtry$ factor
617: %   (m.f.) of 13 (i.e., $mtry = 13 * sqrt(number of colums)) and an $mtry$
618: %   factor of 1 ($mtry = sqrt(number of genes)). Note that the OOB
619: %   error used for variable selection \emph{is not} the error reported
620: %   in the table (which is obtained using bootstrap on the complete
621: %   variable selection process).}
622: \end{center}
623: \end{table*}
624: 
625: 
626: \subsection{Choosing $mtry$ and $ntree$}
627: 
628: Preliminary data suggested that $mtry$ and $ntree$ could affect the shape of
629: variable importance plots.  At the same time, use of OOB error rate as a
630: guidance to select $mtry$ could be affected by $ntree$ and, potentially,
631: $nodesize$. Thus, we first examined whether the OOB error rate is substantially
632: affected by changes in $mtry$, $ntree$, and $nodesize$.
633: 
634: 
635: 
636: 
637: 
638: 
639: 
640: 
641: Figure \ref{mtry.ntree.paper.real} and the supplementary material (Figure
642: \\``error.vs.mtry.pdf''), however, show that, for both real and simulated data,
643: the relation of OOB error rate with $mtry$ is largely independent of $ntree$
644: (for $ntree$ between 1000 and 40000) and $nodesize$ (nodesizes 1 and 5). In
645: addition, the default setting of $mtry$ ($mtryFactor = 1$ in the figures) is
646: often a good choice in terms of OOB error rate. In some cases, increasing
647: $mtry$ can lead to small decreases in error rate, and decreases in $mtry$ often
648: lead to increases in the error rate. This is specially the case with simulated
649: data with very few relevant genes (with very few relevant genes, small $mtry$
650: results in many trees being built that do not incorporate any of the relevant
651: genes). Since the OOB error and the relation between OOB error and $mtry$ do
652: not change whether we use $nodesize$ of 1 or 5, and because the increase in
653: speed from using $nodesize$ of 5 is inconsequential, all further analyses will
654: use only the default $nodesize = 1$.
655: 
656: 
657: 
658: 
659: 
660: 
661: 
662: 
663: 
664: 
665: 
666: 
667:  \subsection{Backwards elimination of variables (genes) using OOB
668:    error} On the simulated data sets (see supplementary material,
669:  Tables 3 and 4) %\ref{simplify.signal.02}, \ref{simplify.signal.05}),
670:  backwards elimination often leads to very small sets of genes, often
671:  much smaller than the set of ``true genes''. The error rate of the
672:  variable selection procedure, estimated using the .632+ bootstrap
673:  method, indicates that the variable selection procedure does not lead
674:  to overfitting, and can achieve the objective of aggressively
675:  reducing the set of selected genes.  In contrast, when the
676:  simplification procedure is applied to simulated data sets without
677:  signal (see Tables 1 and 2 
678: %\ref{simplify.no.signal.02} \ref{simplify.no.signal.05} 
679: in supplementary material), the number of
680:  genes selected is consistently much larger and, as should be the
681:  case, the estimated error rate using the bootstrap corresponds to
682:  that achieved by always betting on the most probable class.
683: 
684: 
685: 
686: Results for the real data sets are shown in Tables \ref{error.rates} and
687: \ref{stability} (see also supplementary material, Tables 5, 6, 7, 
688: %%\ref{stability-20000}, stability-5000, stability-02
689: for additional results using different combinations of $ntree =
690: \{2000,5000,20000\}$, $mtryFactor = \{1, 13\}, se=\{0, 1\},
691: fraction.dropped=\{0.2, 0.5\}$). Error rates (see Table
692: \ref{error.rates}) when performing variable selection are in most cases comparable
693: (within sampling error) to those from random forest without variable
694: selection, and comparable also to the error rates from competing
695: state-of-the-art prediction methods. The number of genes selected
696: varies by data set, but generally (Table \ref{stability}) the
697: variable selection procedure leads to small ($< 50$) sets of predictor
698: genes, often much smaller than those from competing approaches
699: (see also Table 8 in supplementary material). There are no relevant
700: differences in error rate related to differences in $mtry$, $ntree$ or
701: whether we use the ``s.e.\ 1'' or ``s.e.\ 0'' rules. The use of the
702: ``s.e.\ 1'' rule, however, tends to result in smaller sets of selected
703: genes.
704: 
705: 
706: 
707: 
708: % \begin{table*}[ph!]
709: % \begin{center}
710: %   \caption{\label{stability} Stability of results of backwards
711: %     elimination of variables using OOB error, and of two alternative
712: %     variable selection methods. Stability evaluated using 200
713: %     bootstrap samples. ``\# Vars'' denotes the number of variables
714: %     selected on the original data set. ``\# Vars bootstrap'' shows the
715: %     median (1st quartile, 3rd quartile) number of variables selected
716: %     when the procedure is run on the bootstrap samples. ``Freq. vars''
717: %     is the median (1st quartile, 3rd quartile) of the frequency with
718: %     which each variable in the original data set appears in the
719: %     variables selected when the procedure is run on the bootstrap
720: %     samples. For further results see supplementary material.}
721: % \end{center}
722: 
723: % \begin{center}
724: % {\small
725: % \begin{tabular}{l|rrrr|rrrr}
726: % Data set& Error rate & \# Vars & \# Vars bootstrap & Freq. vars& Error rate & \# Vars & \# Vars bootstrap & Freq. vars\\
727: % \hline
728: % \hline
729: % \multicolumn{5}{c}{\textbf{Backwards elimination of variables from random forest}}\\ %%% OK
730: % \hline
731: % & \multicolumn{4}{c}{$s.e.\ = 0} & \multicolumn{4}{$s.e.\ = 1}\\ %%% OK
732: % \hline
733: % %%$mtry$1, se1, $ntree = 2000
734: % Leukemia    &     0.087 &          2 &          2 (2, 2)   &   0.38 (0.29, 0.48)\footnotemark[1] 
735: % Breast 2 cl.&     0.337 &         14 &          9 (5, 23)   &   0.15 (0.1, 0.28) 
736: % Breast 3 cl.&     0.346 &        110 &         14 (9, 31)   &   0.08 (0.04, 0.13) 
737: % NCI 60      &     0.327 &        230 &         60 (30, 94)   &    0.1 (0.06, 0.19) 
738: % Adenocar.   &     0.185 &          6 &          3 (2, 8)   &   0.14 (0.12, 0.15) 
739: % Brain       &     0.216 &         22 &         14 (7, 22)   &   0.18 (0.09, 0.25) 
740: % Colon       &     0.159 &         14 &          5 (3, 12)   &   0.29 (0.19, 0.42) 
741: % Lymphoma    &     0.047 &         73 &         14 (4, 58)   &   0.26 (0.18, 0.38) 
742: % Prostate    &     0.061 &         18 &          5 (3, 14)   &   0.22 (0.17, 0.43) 
743: % Srbct       &     0.039 &        101 &         18 (11, 27)   &    0.1 (0.04, 0.29) 
744: % \hline
745: % \hline
746: % \multicolumn{4}{c}{$mtryFactor = 1, s.e.\ = 1, ntree = 2000, ntreeIterat = 1000, fraction.dropped = 0.2$}\\ %%% OK
747: % \hline
748: % %%$mtry$1, se1, $ntree = 2000, $ntree$Iterat = 1000
749: % Leukemia    &     0.075 &          2 &          2 (2, 2)   &    0.4 (0.32, 0.5)\footnotemark[1]\\ 
750: % Breast 2 cl.&     0.332 &         14 &          4 (2, 7)   &   0.12 (0.07, 0.17)\\ 
751: % Breast 3 cl.&     0.364 &          6 &          7 (4, 14)   &   0.27 (0.22, 0.31)\\ 
752: % NCI 60      &     0.353 &         24 &         30 (19, 60)   &   0.26 (0.17, 0.38)\\ 
753: % Adenocar.   &     0.207 &          8 &          3 (2, 5)   &   0.06 (0.03, 0.12)\\ 
754: % Brain       &     0.216 &          9 &         14 (7, 22)   &   0.26 (0.14, 0.46)\\ 
755: % Colon       &     0.177 &          3 &          3 (2, 6)   &   0.36 (0.32, 0.36)\\ 
756: % Lymphoma    &     0.042 &         58 &         12 (5, 73)   &   0.32 (0.24, 0.42)\\ 
757: % Prostate    &     0.064 &          2 &          3 (2, 5)   &    0.9 (0.82, 0.99)\footnotemark[1]\\ 
758: % Srbct       &     0.038 &         22 &         18 (11, 34)   &   0.57 (0.4, 0.88)\\ 
759: % \hline
760: % \hline
761: % \multicolumn{4}{c}{\textbf{Alternative approaches}}\\ %%% OK
762: % \hline
763: % \multicolumn{4}{c}{Shrunken centroids; mimimizing error rate then
764: %   minimizing number of genes selected}\\ %%% OK
765: % \hline
766: % Leukemia    &     0.062 &         82 &         46 (14, 504)   &   0.48 (0.45, 0.59)\\ 
767: % Breast 2 cl.&     0.326 &         31 &         55 (24, 296)   &   0.54 (0.51, 0.66)\\ 
768: % Breast 3 cl.&     0.401 &       2166 &       4341 (2379, 4804)   &   0.84 (0.78, 0.88)\\ 
769: % NCI 60      &     0.246 &       5118 &       4919 (3711, 5243)   &   0.84 (0.74, 0.92)\\ 
770: % Adenocar.   &     0.179 &          0 &          9 (0, 18)   &     NA (NA, NA)\footnotemark[2]\\ 
771: % Brain       &     0.159 &       4177 &       1257 (295, 3483)   &   0.38 (0.3, 0.5)\\ 
772: % Colon       &     0.122 &         15 &         22 (15, 34)   &    0.8 (0.66, 0.87)\\ 
773: % Lymphoma    &     0.033 &       2796 &       2718 (2030, 3269)   &   0.82 (0.68, 0.86)\\ 
774: % Prostate    &     0.089 &          4 &          3 (2, 4)   &   0.72 (0.49, 0.92)\\ 
775: % Srbct       &     0.025 &         37 &         18 (12, 40)   &   0.45 (0.34, 0.61)\\ 
776: % \hline
777: % \hline
778: % \multicolumn{4}{c}{Nearest Neighbor with variable selection}\\ %%% OK
779: % \hline
780: % Leukemia    &     0.056 &        512 &         23 (4, 134)   &   0.17 (0.14, 0.24)\\ 
781: % Breast 2 cl.&     0.337 &         88 &         23 (4, 110)   &   0.24 (0.2, 0.31)\\ 
782: % Breast 3 cl.&     0.424 &          9 &         45 (6, 214)   &   0.66 (0.61, 0.72)\\ 
783: % NCI 60      &     0.237 &       1718 &        880 (360, 1718)   &   0.44 (0.34, 0.57)\\ 
784: % Adenocar.   &     0.181 &       9868 &         73 (8, 1324)   &   0.13 (0.1, 0.18)\\ 
785: % Brain       &     0.194 &       1834 &        158 (52, 601)   &   0.16 (0.12, 0.25)\\ 
786: % Colon       &     0.158 &          8 &          9 (4, 45)   &   0.57 (0.45, 0.72)\\ 
787: % Lymphoma    &     0.04 &         15 &         15 (5, 39)   &    0.5 (0.4, 0.6)\\ 
788: % Prostate    &     0.081 &          7 &          6 (3, 18)   &   0.46 (0.39, 0.78)\\ 
789: % Srbct       &     0.031 &         11 &         17 (11, 33)   &    0.7 (0.66, 0.85)\\ 
790: % \hline
791: 
792: % \end{tabular}
793: % }
794: % \end{center}
795: % \renewcommand{\baselinestretch}{0.2}\footnotesize\normalsize\footnotesize
796: % {%\setlength{\baselineskip}{1pt} \renewcommand{\baselinestretch}{0.1}\footnotesize\normalsize\footnotesize
797: % $^1$As only two variables are selected from the complete data set, the values are the actual
798: %   frequencies of those two variables, not the 25th and 75th
799: %   percentiles.\\
800: % $^2$No variables were selected.\\
801: % }
802: % \end{table*}
803: 
804: 
805: 
806: 
807: 
808: 
809: 
810: \subsection{Stability (uniqueness) of results}
811: The results here will focus on the real microarray data sets (results
812: from the simulated data are presented on the supplementary material).
813: Table \ref{stability} (see also supplementary material, Tables 5, 6, 7,
814: % \ref{stability-20000} 
815: for other combinations of $ntree, mtryFactor, fraction.dropped, se$)
816: shows the variation in the number of genes selected in bootstrap
817: samples, and the frequency with which the genes selected in the
818: original sample appear among the genes selected from the bootstrap
819: samples. In most cases, there is a wide range in the number of genes
820: selected; more importantly, the genes selected in the original samples
821: are rarely selected in more than 50\% of the bootstrap samples. These
822: results are not strongly affected by variations in $ntree$ or $mtry$;
823: using the ``s.e.\ 1'' rule can lead, in some cases, to increased
824: stability of the results.
825: 
826: 
827: As a comparison, we also show in Table \ref{stability} the stability
828: of two alternative approaches for gene selection, the shrunken
829: centroids method, and a filter approach combined with a Nearest
830: Neighbor classifier (see Table 8 in the supplementary material for
831: results of SC.l). Error rates are comparable, but both alternative
832: methods lead to much larger sets of selected genes than backwards
833: variable selection with random forests. The alternative approaches
834: seem to lead to somewhat more stable results in variable selection (probably a
835: consequence of the large number of genes selected) but
836: in practical applications this increase in stability is probably far
837: out-weighted by the very large number of selected genes.
838:   
839: 
840: 
841: 
842: 
843: 
844: 
845:  \begin{table}[p]
846:  \begin{center}
847:    \caption{\label{stability} Stability of variable (gene) selection evaluated
848:      using 200 bootstrap samples. ``\# Genes'': number of genes
849:      selected on the original data set. ``\# Genes boot.'': median
850:      (1st quartile, 3rd quartile) of number of genes selected from 
851:      on the bootstrap samples. ``Freq. genes'': median (1st quartile,
852:      3rd quartile) of the frequency with which each gene in the
853:      original data set appears in the genes selected from the
854:      bootstrap samples. Parameters for backwards elimination with
855:      random forest: $mtryFactor = 1, s.e.\ = 0, ntree = 2000,
856:      ntreeIterat = 1000, fraction.dropped = 0.2$.}
857:  \end{center}
858:  \begin{center}
859: \vspace{-32pt} %%% use for bioinformatics.
860:  {\footnotesize
861:  \begin{tabular}{l|rrrr}
862:  Data set& Error & \# Genes & \# Genes boot. & Freq. genes\\
863:  \hline
864:  \hline
865:  \multicolumn{5}{c}{\textbf{Backwards elimination of genes from random forest}}\\ %%% OK
866:  \hline
867: %\multicolumn{5}{c}{$mtryFactor = 1, s.e.\ = 0, ntree = 2000, ntreeIterat = 1000, fraction.dropped = 0.2$}\\ %%% OK
868: \multicolumn{5}{c}{$s.e.\ = 0$}\\ %%% OK
869:  \hline
870: % %$mtry$1, se1, $ntree = 2000
871:  Leukemia    &  0.087 &   2 &  2 (2, 2) &   0.38 (0.29, 0.48)\footnotemark[1]\\ 
872:  Breast 2 cl.&  0.337 &  14 &  9 (5, 23)&   0.15 (0.1, 0.28)\\ 
873:  Breast 3 cl.&  0.346 & 110 & 14 (9, 31)&   0.08 (0.04, 0.13)\\ 
874:  NCI 60      &  0.327 & 230 & 60 (30, 94)&    0.1 (0.06, 0.19)\\ 
875:  Adenocar.   &  0.185 &   6 &  3 (2, 8)&   0.14 (0.12, 0.15)\\ 
876:  Brain       &  0.216 &  22 & 14 (7, 22)&   0.18 (0.09, 0.25)\\ 
877:  Colon       &  0.159 &  14 &  5 (3, 12)&   0.29 (0.19, 0.42)\\ 
878:  Lymphoma    &  0.047 &  73 & 14 (4, 58)&   0.26 (0.18, 0.38)\\ 
879:  Prostate    &  0.061 &  18 &  5 (3, 14)&   0.22 (0.17, 0.43)\\ 
880:  Srbct       &  0.039 & 101 & 18 (11, 27)&    0.1 (0.04, 0.29)\\ 
881:  \hline
882:  \hline
883: %\multicolumn{4}{c}{$mtryFactor = 1, s.e.\ = 1, ntree = 2000, ntreeIterat = 1000, fraction.dropped = 0.2$}\\ %%% OK
884: \multicolumn{5}{c}{$s.e.\ = 1$}\\ %%% OK
885:  \hline
886: % %$mtry$1, se1, $ntree = 2000, $ntree$Iterat = 1000
887:  Leukemia    & 0.075 &  2 &  2 (2, 2)&    0.4 (0.32, 0.5)\footnotemark[1]\\ 
888:  Breast 2 cl.& 0.332 & 14 &  4 (2, 7)&   0.12 (0.07, 0.17)\\ 
889:  Breast 3 cl.& 0.364 &  6 &  7 (4, 14)&   0.27 (0.22, 0.31)\\ 
890:  NCI 60      & 0.353 & 24 & 30 (19, 60)&   0.26 (0.17, 0.38)\\ 
891:  Adenocar.   & 0.207 &  8 &  3 (2, 5)&   0.06 (0.03, 0.12)\\ 
892:  Brain       & 0.216 &  9 & 14 (7, 22)&   0.26 (0.14, 0.46)\\ 
893:  Colon       & 0.177 &  3 &  3 (2, 6)&   0.36 (0.32, 0.36)\\ 
894:  Lymphoma    & 0.042 & 58 & 12 (5, 73)&   0.32 (0.24, 0.42)\\ 
895:  Prostate    & 0.064 &  2 &  3 (2, 5)&    0.9 (0.82, 0.99)\footnotemark[1]\\ 
896:  Srbct       & 0.038 & 22 & 18 (11, 34)&   0.57 (0.4, 0.88)\\ 
897:  \hline
898:  \hline
899:  \multicolumn{5}{c}{\textbf{Alternative approaches}}\\ %%% OK
900:  \hline
901: %  \multicolumn{5}{c}{Shrunken centroids; minimizing error rate then}\\
902: %   \multicolumn{5}{c}{minimizing number of genes selected}\\ %%% OK
903:   \multicolumn{5}{c}{SC.s}\\
904:  \hline
905:  Leukemia    & 0.062 &   82\footnotemark[2] &   46 (14, 504)&   0.48 (0.45, 0.59)\\ 
906:  Breast 2 cl.& 0.326 &   31 &   55 (24, 296)&   0.54 (0.51, 0.66)\\ 
907:  Breast 3 cl.& 0.401 & 2166 & 4341 (2379, 4804)&   0.84 (0.78, 0.88)\\ 
908:  NCI 60      & 0.246 & 5118 & 4919 (3711, 5243)&   0.84 (0.74, 0.92)\\ 
909:  Adenocar.   & 0.179 &    0 &    9 (0, 18)&     NA (NA, NA)\\ 
910:  Brain       & 0.159 & 4177 & 1257 (295, 3483)&   0.38 (0.3, 0.5)\\ 
911:  Colon       & 0.122 &   15 &   22 (15, 34)&    0.8 (0.66, 0.87)\\ 
912:  Lymphoma    & 0.033 & 2796 & 2718 (2030, 3269)&   0.82 (0.68, 0.86)\\ 
913:  Prostate    & 0.089 &    4 &    3 (2, 4)&   0.72 (0.49, 0.92)\\ 
914:  Srbct       & 0.025 &   37\footnotemark[3] &   18 (12, 40)&   0.45 (0.34, 0.61)\\ 
915:  \hline
916:  \hline
917: %\multicolumn{5}{c}{Nearest Neighbor with variable selection}\\ %%% OK
918: \multicolumn{5}{c}{NN.vs}\\ %%% OK
919:  \hline
920:  Leukemia    & 0.056 &  512 &  23 (4, 134)&   0.17 (0.14, 0.24)\\ 
921:  Breast 2 cl.& 0.337 &   88 &  23 (4, 110)&   0.24 (0.2, 0.31)\\ 
922:  Breast 3 cl.& 0.424 &    9 &  45 (6, 214)&   0.66 (0.61, 0.72)\\ 
923:  NCI 60      & 0.237 & 1718 & 880 (360, 1718)&   0.44 (0.34, 0.57)\\ 
924:  Adenocar.   & 0.181 & 9868 &  73 (8, 1324)&   0.13 (0.1, 0.18)\\ 
925:  Brain       & 0.194 & 1834 & 158 (52, 601)&   0.16 (0.12, 0.25)\\ 
926:  Colon       & 0.158 &    8 &   9 (4, 45)&   0.57 (0.45, 0.72)\\ 
927:  Lymphoma    & 0.04 &   15 &  15 (5, 39)&    0.5 (0.4, 0.6)\\ 
928:  Prostate    & 0.081 &    7 &   6 (3, 18)&   0.46 (0.39, 0.78)\\ 
929:  Srbct       & 0.031 &   11 &  17 (11, 33)&    0.7 (0.66, 0.85)\\ 
930:  \hline
931: 
932:  \end{tabular}
933:  }
934:  \end{center}
935:  \renewcommand{\baselinestretch}{0.2}\footnotesize\normalsize\footnotesize
936:  {%\setlength{\baselineskip}{1pt} \renewcommand{\baselinestretch}{0.1}\footnotesize\normalsize\footnotesize
937:    $^*$Only two genes are selected from the complete data set; the values are the actual
938:    frequencies of those two genes.\\
939:    $^{\dagger}$\citet{shrunkenc} select 21 genes after visually inspecting 
940:    the plot of
941:    cross-validation error rate vs. amount of shrinkage and number of
942:    genes. Their procedure is hard to automate and thus it is very difficult to obtain estimates of the error
943:    rate of their procedure.\\
944:    $^{\ddagger}$\citet{shrunkenc} select 43 genes. The difference is likely due
945:    to differences in the random partitions for cross-validation. Repeating 100 times
946:    the gene selection process with the full data set the median, 1st quartile, and 3rd
947:    quartile of the number of selected genes are 13, 8, and 147.\\
948: 
949: 
950:  }
951:  \end{table}
952: 
953: 
954: 
955: 
956: 
957: \section{Discussion}
958: 
959: We have examined the performance of an approach for gene selection using random
960: forest, and compared it to alternative approaches. Our results, using both
961: simulated and real microarray data sets, show that this method of gene
962: selection accomplishes the proposed objectives.  Our method returns very small
963: sets of genes compared to two alternative variable selection methods, while
964: retaining predictive performance comparable to that of seven alternative
965: state-of-the-art methods.  Recently, \citet{BMA-selection} have proposed a
966: Bayesian model averaging (BMA) approach for gene selection; comparing the
967: results for the two common data sets between our study and theirs, in one case
968: (Leukemia) our procedure returns a much smaller set of genes (2 vs. 15),
969: whereas in another (Breast, 2 class) their BMA procedure returns 8 fewer genes
970: (14 vs. 6); our procedure does not require setting a limit in the maximum
971: number of relevant genes to be selected nor does it require to prespecify a
972: number of top ranked genes as relevant (the latter is nor required by the BMA
973: procedure either).  
974: 
975: Our method of gene selection will not return sets of genes
976: that are highly correlated, because they are redundant.  This method will be
977: most useful under two scenarios: a) when considering the design of diagnostic
978: tools, where having a small set of probes is often desirable; b) to help
979: understand the results from other gene selection approaches that return many
980: genes, so as to understand which ones of those genes have the largest signal to
981: noise ratio and could be used as surrogates for complex processes involving
982: many correlated genes. A backwards elimination method, precursor to the one
983: used here, has been already used to predict breast tumor type based on
984: chromosomic alterations \citep{SaraRF}.
985: 
986: 
987: We have also throughly examined the effects of changes in the
988: parameters of random forest (specifically $mtry$, $ntree$, $nodesize$)
989: and the variable selection algorithm ($se$, $fraction.dropped$).
990: Changes in these parameters have in most cases negligible effects,
991: suggesting that the default values are often good options, but we can
992: make some general recommendations. 
993: Time of execution of the code increases $\approx$ linearly with $ntree$.
994: Larger $ntree$ values lead to slightly more stable values of variable
995: importances, but for the data sets examined, $ntree = 2000$ or $ntree = 5000$
996: seem quite adequate, with further increases having negligible effects. The
997: change in $nodesize$ from 1 to 5 has negligible effects, and thus its default
998: setting of 1 is appropriate.  For the backwards elimination algorithm, the
999: parameter $fraction.dropped$ can be adjusted to modify the resolution of the
1000: number of variable selected; smaller values of $fraction.dropped$ lead to finer
1001: resolution in the examination of number of genes, but to slower execution of
1002: the code.  Finally, the parameter $se$ has also minor effects on the results of
1003: the backwards variable selection algorithm but a value of $se = 1$ leads to
1004: slightly more stable results.
1005: 
1006: 
1007: 
1008: The final issue addressed in this paper is instability or multiplicity of the
1009: selected sets of genes. From this point of view, the results are slightly
1010: disappointing. But so are the results of the competing methods. And so are the
1011: results of most examined methods so far with microarray data, as shown in
1012: \citet{EinDor} and \citet{Michielis} and discussed throughly by
1013: \citet{Somorjai2003} for classification and by \citet{pan-pnas} for the related
1014: problem of the effect of threshold choice in gene selection.  However, and
1015: except for the above cited papers and the review in \citet{Yo-azuaje}, this is
1016: an issue that still seems largely ignored in the microarray literature. As
1017: these papers and the statistical literature on variable selection
1018: \citep[e.g.,][]{breiman-2-cultures, harrell-01} discusses, the causes of the
1019: problem are small sample sizes and the extremely small ratio of samples to
1020: variables (i.e., number of arrays to number of genes). Thus, we might need to
1021: learn to live with the problem, and try to assess the stability and robustness
1022: of our results by using a variety of gene selection features, and examining
1023: whether there is a subset of features that tends to be repeatedly selected.
1024: This concern is explicitly taken into account in our results, and facilities
1025: for examining this problem are part of our R code.
1026: 
1027: 
1028: The multiplicity problem, however, does not need to result in large
1029: prediction errors.  This and other papers \citep{dudoit-dlda, pelora,
1030:   simon.book, romualdi-03, bag-boost, Somorjai2003} show that very different
1031: classifiers often lead to comparable and successful error rates with
1032: a variety of microarray data sets. Thus, although improving prediction
1033: rates is important \citep[specially if giving consideration to ROC
1034: curves, and not just overall prediction error rates;][]{pepe-book},
1035: when trying to address questions of biological mechanism or discover
1036: therapeutic targets, probably a more challenging and relevant issue is
1037: to identify sets of genes with biological relevance.
1038: 
1039: 
1040: Two areas of future research are using random forest for the selection of
1041: potentially large sets of genes that include correlated genes, and improving
1042: the computational efficiency of these approaches; in the present work, we have
1043: used parallelization of the ``embarrassingly parallelizable'' tasks using MPI
1044: with the Rmpi and Snow packages \citep{Rmpi, snow} for R. In a broader context,
1045: further work is warranted on the stability properties and biological relevance
1046: of this and other gene-selection approaches, because the multiplicity problem
1047: casts doubts on the biological interpretability of most results based on a
1048: single run of one gene-selection approach.
1049: 
1050: 
1051: 
1052: 
1053: 
1054: %%% Both allow var sel; the type of var sel is wrapper approach, which
1055: %%% should be superior to ``filter'' approaches. variable importance plots not affected
1056: %%% by multicol. However, not many unique (stable) results. Select only
1057: %%% the most important from variable importance plots, or use a large set of candidates.
1058: %%% With backwards, can help to examine if the different, non-overlapping,
1059: %%% sets of vars are in similar routes, etc.
1060: 
1061: %%% Examinar también plots of ``flatness'' of OOB and numero de genes.
1062: %%% Indication of how important things are (and plots of flatness in
1063: %%% bootstrap samples?). Problem is: no longer emphasis on which are the
1064: %%% selected genes. Here the approach of Svetnik et al more relevant?
1065: 
1066: 
1067: 
1068: \section{Conclusion}
1069: The proposed method can be used for variable selection fulfilling the
1070: objectives above: we can obtain very small sets of non-redundant genes while
1071: preserving predictive accuracy. These results clearly indicate that the
1072: proposed method can be profitably used with microarray data. Given its
1073: performance, random forest and variable selection using random forest should
1074: probably become part of the ``standard tool-box'' of methods for the analysis
1075: of microarray data.
1076: 
1077: 
1078: 
1079: 
1080: \section{Acknowledgements}
1081: 
1082: % This work arised out of work I did in collaboration with S.\ Álvarez
1083: % de Andrés; I thank her for the opportunity to collaborate in that
1084: % work, and for her patience and enthusiasm. 
1085: 
1086: Most of the simulations and analyses were carried out in the Beowulf
1087: cluster of the Bioinformatics unit at CNIO, financed by the RTICCC
1088: from the FIS; J.~M.\ Vaquerizas provided help with the administration
1089: of the cluster. A.\ Liaw provided discussion, unpublished manuscripts,
1090: and code.  C.\ Lázaro-Perea provided many discussions and comments on
1091: the ms. A.\ Sánchez provided comments on the ms.  I.\ Díaz showed
1092: R.D.-U. the forest, or the trees, or both.  R.D.-U. partially
1093: supported by the Ramón y Cajal program of the Spanish MEC (Ministry
1094: of Education and Science); S.A.A. supported by project C.A.M.
1095: GR/SAL/0219/2004; funding provided by project TIC2003-09331-C02-02 of
1096: the Spanish MEC.
1097: 
1098: 
1099: \bibliography{signatures2}
1100: \bibliographystyle{bioinformatics}
1101: 
1102: 
1103: 
1104: %\end{multicols}
1105: \newpage
1106: 
1107: 
1108: 
1109: 
1110: 
1111: 
1112: 
1113: 
1114: 
1115: 
1116: 
1117: 
1118: \end{document}
1119: 
1120: 
1121: %All with R, library  randomForest. Code available.
1122: 
1123: 
1124: 
1125: 
1126: %Although occassionally plots of variable importance show a clear
1127: %pattern where only a few variables stand out, most often
1128: 
1129: 
1130: **************************
1131: 
1132: 
1133: We will also mention computational requirements.
1134: 
1135: It should be possible to
1136: use these measures of variable importance to single out
1137: genes of particular relevance for a given condition.
1138: 
1139: 
1140: 
1141: 
1142: Future work:
1143: ------------
1144: - Changes in $mtry$, since small mtries should lead to faster
1145: runs and further decreases in correlations of trees.
1146: 
1147: - After variable reduction: use all variables in models
1148: building (i.e., $mtry = number of variables)?
1149: 
1150: