1: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
3: %%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%
4: %%%%%%%%%%%%%%%%%%%% Technical report %%%%%%%%%%%%%%%%%%%%
5: %%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%
6: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
7: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
8:
9: \documentclass[10pt]{article}
10: \usepackage[latin1]{inputenc}
11: \usepackage{geometry}
12: \geometry{verbose,a4paper,tmargin=20mm,bmargin=20mm,lmargin=20mm,rmargin=20mm}
13: \usepackage{setspace}
14: \usepackage{graphics}
15: \singlespacing
16: \usepackage{verbatim}
17: \usepackage{amsmath}
18: \usepackage{url}
19: \bibliographystyle{bioinformatics}
20: \usepackage[authoryear, round, sort]{natbib}
21: \usepackage{hyperref} %%??
22:
23: \title{Variable selection from random forests: application to gene expression data}
24: \author{\vspace{20pt}
25: Ramón Díaz-Uriarte$^{1,3}$, Sara Alvarez de Andrés$^2$\\
26: $¹$Bioinformatics Unit, $²$Cytogenetics Unit\\
27: Biotechnology Programme\\
28: Spanish National Cancer Center (CNIO)\\
29: Melchor Fernández Almagro 3 \\
30: Madrid, 28029\\
31: \vspace{20pt}
32: Spain. \\
33: $^3$ Author for correspondence.\\
34: \texttt{rdiaz@ligarto.org}\\
35: \url{http://ligarto.org/rdiaz}\\
36: }
37: \date{
38: \vspace*{40pt}
39: 2005-06-22 \\
40: \vspace{20pt}
41: {\bf Running Head:} Gene selection with random forest.} %%%% eliminate for tech report}
42: \begin{document}
43: \maketitle
44: \newpage
45: \begin{abstract}
46:
47: Random forest is a classification algorithm well suited for
48: microarray data: it shows excellent performance even when most
49: predictive variables are noise, can be used when the number of
50: variables is much larger than the number of observations, and
51: returns measures of variable importance. Thus, it is important to
52: understand the performance of random forest with microarray data and
53: its use for gene selection.
54:
55:
56: We first show the effects of changes in parameters of random forest on the
57: prediction error. Then we present an approach for gene selection
58: that uses measures of variable importance and error rate,
59: and is targeted towards the selection of small sets of genes. Using
60: simulated and real microarray data, we show that the gene selection
61: procedure yields small sets of genes while preserving predictive accuracy.
62:
63:
64: We first show the effects of changes in parameters of random forest
65: on the prediction error rate with microarray data. Then we present
66: two approaches for gene selection with random forest: 1) comparing
67: variable importance plots of variable importance from original and permuted data
68: sets; 2) using backwards variable elimination. Using simulated and
69: real microarray data, we show: 1) variable importance plots can be used to recover
70: the full set of genes related to the outcome of interest, without
71: being adversely affected by collinearities; 2) backwards variable
72: elimination yields small sets of genes while preserving predictive
73: accuracy (compared to several state-of-the art algorithms). Thus,
74: both methods are useful for gene selection.
75:
76: All code is available as an R package, varSelRF, from CRAN
77: \href{http://cran.r-project.org/src/contrib/PACKAGES.html}
78: {http://cran.r-project.org/src/contrib/PACKAGES.html} or from the supplementary
79: material page.
80:
81: Supplementary information:
82: \href{http://ligarto.org/rdiaz/Papers/rfVS/randomForestVarSel.html}{http://ligarto.org/rdiaz/Papers/rfVS/randomForestVarSel.html}
83:
84:
85: \end{abstract}
86:
87:
88: \footnotetext[1]{To whom correspondence should be addressed}
89:
90:
91:
92: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
93: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
94: %%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%
95: %%%%%%%%%%%%%%%%%%%% Bioinformatics %%%%%%%%%%%%%%%%%%%%
96: %%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%
97: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
98: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
99:
100: %%% \renewcommand{\thefootnote}{\fnsymbol{footnote}}
101:
102: %%% \documentclass{bioinfo}
103: %%% \copyrightyear{2005}
104: %%% \pubyear{2005}
105: %%% \usepackage[latin1]{inputenc}
106:
107: %%% \begin{document}
108: %%% \firstpage{1}
109:
110: %%% \title[Gene selection with random forests]{Variable selection from random forests: application to gene expression data}
111:
112: %%% % \author{Ramón Díaz-Uriarte}
113: %%% % \address{Bioinformatics Unit\\
114: %%% % Spanish National Cancer Center (CNIO)\\
115: %%% % Melchor Fernández Almagro 3 \\
116: %%% % Madrid, 28029\\
117: %%% % Spain
118: %%% % }
119:
120: %%% \author{Ramón Díaz-Uriarte\,$^{\rm a}$\footnote{To whom correspondence should be addressed}, Sara Alvarez de
121: %%% Andrés\,$^{\rm b}$}
122:
123: %%% \author{Ramón Díaz-Uriarte\,$^{\rm a,}$\footnotemark[1], Sara Alvarez de
124: %%% Andrés\,$^{\rm b}$}
125: %%% \address{$^{a}$Bioinformatics Unit, $^{b}$Cytogenetics Unit\\
126: %%% Biotechnology Programme\\
127: %%% Spanish National Cancer Centre (CNIO)\\
128: %%% Melchor Fernández Almagro 3 \\
129: %%% Madrid, 28029\\
130: %%% Spain
131: %%% }
132:
133: %%% \maketitle
134:
135: %%% \begin{abstract}
136:
137: %%% \section{Motivation:}
138: %%%Random forest is a classification algorithm well suited
139: %%%for microarray data: it shows excellent performance
140: %%%even when most predictive variables are noise, can be
141: %%%used when the number of variables is much larger than
142: %%%the number of observations, and returns measures of
143: %%%variable importance. Thus, it is important to
144: %%%understand the performance of random forest with
145: %%%microarray data and its use for gene selection.
146:
147:
148: %%% \section{Results:}
149: %%% We first show the effects of changes in parameters of random forest on the
150: %%% prediction error. Then we present an approach for gene selection
151: %%% that uses measures of variable importance and error rate,
152: %%% and is targeted towards the selection of small sets of genes. Using
153: %%% simulated and real microarray data, we show that the gene selection
154: %%% procedure yields small sets of genes while preserving predictive accuracy.
155:
156: %%% \section{Availability:}
157: %%% All code is available as an R package, varSelRF, from CRAN,
158: %%%\href{http://cran.r-project.org/src/contrib/PACKAGES.html}
159: %%%{http://cran.r-project.org/src/contrib/PACKAGES.html}, or from the supplementary
160: %%%material page.
161:
162:
163: %%% \section{Contact:}
164: %%% \href{rdiaz@ligarto.org}{rdiaz@ligarto.org}
165: %%% \section{Supplementary information:}\\
166: %%% \href{http://ligarto.org/rdiaz/Papers/rfVS/randomForestVarSel.html}{http://ligarto.org/rdiaz/Papers/rfVS/randomForestVarSel.html}
167:
168:
169:
170: %%% \end{abstract}
171:
172: %%% \footnotetext[1]{To whom correspondence should be addressed}
173: %%%%%%%%%%%%%%%%%%%%%%%%5 end bioinformatics
174:
175:
176:
177: \section{Introduction}
178:
179:
180: Random forest is an algorithm for classification developed by Leo
181: Breiman \citep{breiman-rf} that uses an ensemble of classification
182: trees \citep{cart, ripley-96, htf-01}. Each of the classification
183: trees is built using a bootstrap sample of the data, and at each split
184: the candidate set of variables is a random subset of the variables.
185: Thus, random forest uses both bagging (bootstrap aggregation), a
186: successful approach for combining unstable learners
187: \citep{breiman-bagging, htf-01}, and random variable selection for
188: tree building. Each tree is unpruned (grown fully), so as to obtain
189: low-bias trees; at the same time, bagging and random variable
190: selection result in low correlation of the individual trees. The
191: algorithm yields an ensemble that can achieve both low bias and low
192: variance (from averaging over a large ensemble of low-bias,
193: high-variance but low correlation trees).
194:
195:
196: Random forest has excellent performance in classification tasks,
197: comparable to support vector machines. Although random forest is not
198: widely used in the microarray literature \citep[but see][]{SaraRF,
199: Izmir2004, Wu.Zhao2003, Gunther.Heyes2003, Man-rf,
200: Schwender.Bolt2004}, it has several characteristics that make it
201: ideal for these data sets: a) can be used when there are many more
202: variables than observations; b) has good predictive performance
203: even when most predictive variables are noise; c) does not
204: overfit; d) can handle a mixture of categorical and continuous
205: predictors; e) incorporates interactions among predictor variables; f)
206: the output is invariant to monotone transformations of the predictors;
207: g) there are high quality and free implementations: the original
208: Fortran code from L.\ Breiman and A.\ Cutler, and an R package from
209: A.\ Liaw and M.\ Wiener \citep{rf-rnews}; h) there is little need to
210: fine-tune parameters to achieve excellent performance; i) returns measures of
211: variable (gene) importance. The most
212: important parameter to choose is $mtry$, the number of input variables
213: tried at each split, but it has been reported that the default value
214: is often a good choice \citep{rf-rnews}. In addition, the user needs
215: to decide how many trees to grow for each forest ($ntree$) as well as
216: the minimum size of the terminal nodes ($nodesize$). These three
217: parameters will be throughly examined in this paper.
218:
219: Given these promising features, it is important to understand the
220: performance of random forest compared to alternative state-of-the-art
221: prediction methods with microarray data, as well as the effects
222: of changes in the parameters of random forest. In this paper we present, as necessary
223: background for the main topic of the paper (gene selection), the first
224: through examination of these issues, including evaluating the effects
225: of $mtry$, $ntree$ and $nodesize$ on error rate using
226: nine real microarray data sets and simulated data.
227:
228: The main question addressed in this paper is gene selection using random
229: forest. A few authors have previously used variable selection with random
230: forest. \citet{dudoit-inbook} and \citet{Wu.Zhao2003} use filtering approaches
231: and, thus, do not take advantage of the measures of variable importance
232: returned by random forest as part of the algorithm. \citet{svetnik} propose a
233: method that is somewhat similar to our approach. The main difference is that
234: \citet{svetnik} first find the ``best'' dimension ($p$) of the model, and then
235: choose the $p$ most important variables. This is a sound strategy when the
236: objective is to build accurate predictors, without any regards for model
237: interpretability. But this might not be the most appropriate for our purposes
238: as it shifts the emphasis away from selection of specific genes, and in genomic
239: studies the identity of the selected genes is relevant (e.g., to understand
240: molecular pathways or to find targets for drug development).
241:
242:
243: The last issue addressed in this paper is the multiplicity (or lack of
244: uniqueness or lack of stability) problem. Variable selection with microarray
245: data can lead to many solutions that are equally good from the point of view of
246: prediction rates, but that share few common genes. This multiplicity problem
247: has been emphasized by \citet{Somorjai2003} and recent examples are shown in
248: \citet{EinDor} and \citet{Michielis}. Although multiplicity of results is not a problem when
249: the only objective of our method is prediction, it casts serious doubts on the
250: biological interpretability of the results \citep{Somorjai2003}. Unfortunately
251: most ``methods papers'' in bioinformatics do not evaluate the stability of the
252: results obtained, leading to a false sense of trust on the biological
253: interpretability of the output obtained. Our paper presents a through and
254: critical evaluation of the stability of the lists of selected genes with the
255: proposed (and two competing) methods.
256:
257:
258:
259:
260:
261:
262:
263:
264: \section{Variable selection methods}
265:
266: \subsection{Two objectives of variable selection}
267:
268: When facing gene selection problems, biomedical researchers often show
269: interest in one of the following objectives:
270:
271: \begin{enumerate}
272: \item To identify relevant genes for subsequent research; this
273: involves obtaining a (probably large) set of genes that are related
274: to the outcome of interest, and this set should include genes even if they
275: perform similar functions and are highly correlated.
276:
277: \item To identify small sets of genes to be used for diagnostic
278: purposes in clinical practice; this involves obtaining the smallest
279: possible set of genes that can still achieve good predictive
280: performance (thus, ``redundant'' genes should not be selected).
281: \end{enumerate}
282:
283: We will focus on the second objective. The use of random forest for the first
284: objective is under investigation and will be reported elsewhere.
285:
286:
287: \subsection{Variable importance from random forest}
288:
289: Random forest returns several measures of variable importance. The
290: most reliable measure is based on the decrease
291: of classification accuracy when values of a variable in a node of a
292: tree are permuted randomly \citep{breiman-rf, Bureau2003},
293: and this is the measure of variable importance (in its unscaled
294: version ---see supplementary material) that we will use in the rest of
295: the paper.
296:
297:
298: % . This measure is sometimes reported as such, and
299: % sometimes it is reported after scaling it, or dividing by a quantity
300: % somewhat analogous to its standard error (``somewhat analogous''
301: % because the data used to obtain that ``standard error'' are not truly
302: % independent, and thus the true standard error can be severely
303: % underestimated). We use in this paper the unscaled importance
304: % measure, because it allows us to compare directly runs with different
305: % settings of $ntree$ and $mtry$ (in contrast, scaled importances
306: % increase monotonically as we increase the value of $ntree$).
307: % %%%Explain why we use unscaled importance:
308: %%%a) allows direct comparison between runs with different ntrees and mtries.
309: %%%b) does not mislead into considering them Z scores.
310:
311:
312:
313: \subsection{Backwards elimination of variables (genes) using OOB error}
314:
315: To select gebes we can iteratively fit random forests, at each iteration
316: building a new forest after discarding those variables (genes) with the
317: smallest variable importances; the selected set of genes is the one that yields the
318: smallest error rate. Random forest returns a measure of error rate based on
319: the out-of-bag cases for each fitted tree, the OOB error, and this is the
320: measure of error we will use. Note that in this section we are using OOB
321: error to choose the final set of genes, not to obtain unbiased estimates of the
322: error rate of this rule. Because of the iterative approach, the OOB error is
323: biased down and cannot be used to asses the overall error rate of the approach,
324: for reasons analogous to those leading to ``selection bias'' \citep{ambroise, simon-03}. To assess prediction error rates we will use the bootstrap, not
325: OOB error (see section \ref{boot}). (Using error rates
326: affected by selection bias to select the optimal number of genes is
327: not necessarily a bad procedure from the point of view of selecting
328: the final number of genes; see \citet{Braga-Neto.Carroll2004}).
329: %\citet{svetnik} leave aside a set of data,
330: %and decide on the stopping criterion using the error rate on the test data.
331: %This approach, however, is problematic when, as in our case, we are interested
332: %in specific genes and not in using the test set error rate to select the
333: %number of genes.
334:
335: In our algorithm we examine all forests that result from eliminating,
336: iteratively, a fraction, $fraction.dropped$, of the genes (the
337: least important ones) used in the previous iteration. By default,
338: $fraction.dropped = 0.2$ which allows for relatively fast operation,
339: is coherent with the idea of an ``aggressive variable selection''
340: approach, and increases the resolution as the number of genes
341: considered becomes smaller. We do not recalculate variable
342: importances at each step as \citet{svetnik} mention severe overfitting
343: resulting from recalculating variable importances. After fitting all
344: forests, we examine the OOB error rates from all the fitted random
345: forests. We choose the solution with the smallest number of genes
346: whose error rate is within $u$ standard errors of the minimum error
347: rate of all forests.
348: % (The standard error is calculated using the
349: % expression for a binomial error count [$\sqrt{p (1-p) * 1/N}$]).
350: Setting $u = 0$ is the same as selecting the set of genes that
351: leads to the smallest error rate. Setting $u = 1$ is similar to the
352: common ``1 s.e. rule'', used in the classification trees literature
353: \citep{ripley-96, cart}; this strategy can lead to solutions with
354: fewer genes than selecting the solution with the smallest error
355: rate, while achieving an error rate that is not
356: different, within sampling error, from the ``best solution''. In this
357: paper we will examine both the ``1 s.e. rule'' and the ``0 s.e.
358: rule''.
359:
360:
361:
362: %Note here no need for very large mtries, etc, since we do not want all
363: %the important genes, but just enough genes to do a good job.
364:
365: %Besides the stopping criterion we have also chosen the following settings:
366:
367: %\begin{itemize}
368: %\item We examine all forest that result from iteratively
369: % \textbf{eliminating the lower 50\% of the genes}; this
370: % allows for relatively fast operation, and is coherent with the
371: % idea of an ``aggressive variable selection'' approach, and
372: % increases the ``resolution'' as the number of genes
373: % considered becomes smaller.
374: %\item \textbf{Variable importances are not recalculated at each step}, but
375: % instead we use the variable importances computed at the end of
376: % the run; we have not observed important differences whether or
377: % not variable importances are recalculated, but \citet{svetnik}
378: % mention severe overfitting resulting from recalculating
379: % variable importances.
380: %\item We examine the OOB error rates from all the fitted random
381: % forests. We choose the \textbf{solution with the smallest number of
382: % genes whose error rate is within 1 standard error of the
383: % minimum error rate of all forests}. and the ``1 SE rule'' is common in the
384: % classification trees literature ).
385: %\end{itemize}
386:
387:
388:
389: \section{Evaluation of performance}
390:
391: \subsection{Data sets}
392: We have used both simulated and real microarray data sets to evaluate
393: the variable selection procedure. For the real
394: data sets, original reference paper and main features are shown in
395: Table \ref{datasets}. Further details are provided in the
396: supplementary material.
397:
398:
399: \begin{table}
400: \caption{\label{datasets} Main characteristics of the microarray data
401: sets used.}
402: {\footnotesize
403: \begin{tabular}{l|lrrr}
404: Dataset & Original ref.&Genes&Patients&Classes \\
405: \hline
406: Leukemia &\citet{golub}&3051&38&2\\
407: Breast &\citet{vveer}&4869&78&2\\
408: Breast &\citet{vveer}&4869&96&3\\
409: NCI 60 &\citet{ross}&5244&61&8\\
410: Adenocar-\\
411: cinoma &\citet{ramas-03}&9868&76&2\\
412: Brain &\citet{pomeroy}&5597&42&5\\
413: Colon &\citet{alon}&2000&62&2\\
414: Lymphoma &\citet{alizadeh}&4026&62&3\\
415: Prostate &\citet{singh}&6033&102&2\\
416: Srbct &\citet{khan}&2308&63&4\\
417: \hline
418: \end{tabular}
419: }
420: \end{table}
421:
422:
423:
424: % first four data sets. For the last five, the binary R data files were
425: % obtained from M.\ Dettling's web page
426: % \url{http://stat.ethz.ch/~dettling/bagboost.html}; the data sets
427: % and their preprocessing are fully described in \cite{wilma}.
428:
429:
430: To evaluate if the proposed procedure can recover the signal in the
431: data, we need to use simulated data, so that we know exactly which
432: genes are relevant. Data have been simulated using different numbers
433: of classes of patients (2 to 4), number of independent dimensions (1
434: to 3), and number of genes per dimension (5, 20, 100). In all cases,
435: we have set to 25 the number of subjects per class. Each independent
436: dimension has the same relevance for discrimination of the classes.
437: The data come from a multivariate normal distribution with variance of
438: 1, a (within-class) correlation among genes within dimension of
439: 0.9, and a within-class correlation of 0 between genes from different
440: dimensions, as those are independent. The multivariate means have
441: been set so that the unconditional prediction error rate
442: \citep{mclach-dlda} of a linear discriminant analysis using one gene
443: from each dimension is approximately 5\%. To each data set we have
444: added 2000 random normal variates (mean 0, variance 1) and 2000 random
445: uniform $[-1, 1]$ variates. In addition, we have generated data sets
446: for 2, 3, and 4 classes where no genes have signal (all 4000 genes are
447: random). For the non-signal data sets we have generated four
448: replicate data sets for each level of number of classes. Further
449: details are provided in the supplementary material.
450:
451:
452: \subsection{Competing methods}
453:
454: We have compared the predictive performance of the variable selection
455: approach with: a) random forest without any variable selection (using
456: $mtry = \sqrt{number\ of \ genes}$, $ntree = 5000$, $nodesize =
457: 1$); b) three other methods that have shown good
458: performance in reviews of classification methods with microarray data
459: \citep{dudoit-dlda, romualdi-03, bag-boost} but that do not include
460: any variable selection; c) two methods that carry out
461: variable selection.
462:
463: For the three methods that do not carry out variable selection,
464: \textbf{Diagonal Linear Discriminant Analysis (DLDA)}, \textbf{K
465: nearest neighbor (KNN)}, and \textbf{Support Vector Machines (SVM)}
466: with linear kernel, we have used, based on \cite{dudoit-dlda}, the 200
467: genes with the largest $F$-ratio of between to within groups sums of
468: squares. For \textbf{KNN}, the number of neighbors ($K$) was
469: chosen by cross-validation as in \cite{dudoit-dlda}.
470:
471:
472: One of the methods that incorporates gene selection is
473: \textbf{Shrunken centroids (SC)}, developed by \cite{shrunkenc}. We
474: have used two different approaches to determine the best number of
475: features. In the first one, \textbf{SC.l}, we choose the number of
476: genes that minimizes the cross-validated error rate and, in case of
477: several solutions with minimal error rates, we choose the one with
478: largest likelihood. In the second approach, \textbf{SC.s}, we choose
479: the number of genes that minimizes the cross-validated error rate and,
480: in case of several solutions with minimal error rates, we choose the
481: one with smallest number of genes (larger penalty). The second method
482: that incorporates gene selection is \textbf{Nearest neighbor +
483: variable selection (NN.vs)}, where we filter genes using the
484: F-ratio, and select the number of genes that leads to the smallest
485: error rate; in our implementation, we run a Nearest Neighbor
486: classifier (KNN with K = 1) on all subsets of genes that result from
487: eliminating $20\%$ of the genes (the ones with the smallest F-ratio)
488: used in the previous iteration. This approach, in its many variants
489: (changing both the classifier and the ordering criterion) is popular
490: in microarray papers; a recent example is \cite{roepman}, and
491: similar general strategies are implemented in the program Tnasas
492: \citep{gepas2}. Further
493: details of all these methods are provided in the supplementary
494: material. All simulations and analyses were carried out with R
495: \citep[http://www.r-project.org; ][]{R}, using
496: packages randomForest (from A.\ Liaw and M.\ Wiener) for random
497: forest, e1071 (E.\ Dimitriadou, K.\ Hornik, F.\ Leisch, D.\ Meyer, and
498: A.\ Weingessel) for SVM, class (B.\ Ripley and W.\ Venables) for KNN,
499: PAM \citep{shrunkenc} for shrunken centroids, and
500: geSignatures (by R.D.-U.) for DLDA.
501:
502:
503:
504:
505: \subsection{\label{boot}Estimation of error rates}
506: To estimate the prediction error rate of all methods we have used the
507: .632+ bootstrap method \citep{ambroise, 632-rule}. It must be
508: emphasized that the error rate used when performing variable selection
509: is not the error rate reported as the prediction error rate (e.g.,
510: Table \ref{error.rates}), nor the error used to compute the .632+
511: estimate. To calculate the prediction error rate (as reported, for
512: example, in Table \ref{error.rates}) the .632+ bootstrap method is
513: applied to the complete procedure, and thus the ``out-of-bag'' samples
514: used in the .632+ method are samples that are not used when fitting
515: the random forest, or carrying out variable selection. This also
516: applies when evaluating the competing methods.
517:
518:
519: \subsection{Stability (uniqueness) of results}
520: Following \citet{Faraway-92}, \citet{harrell-01}, and
521: \citet{efron-gong}, we have evaluated
522: the stability of the variable selection procedure using the
523: bootstrap. This allows us to asses how often a given
524: gene, selected when running the variable selection procedure in the
525: original sample, is selected when running the procedure on bootstrap
526: samples.
527:
528:
529:
530: \begin{figure}
531: \begin{center}
532: {\resizebox{!}{7.5cm}{%
533: \includegraphics{mtry.ntree.paper.real.eps}}}
534:
535:
536:
537: % \begin{figure}
538: % {\resizebox{!}{7.5cm}{%
539: % \centerline{\includegraphics{$mtry$.$ntree$.paper.real.eps}}}}
540:
541:
542:
543: \caption{\label{mtry.ntree.paper.real} Out-of-Bag (OOB) vs
544: $mtryFactor$ for the nine microarray data sets. $mtryFactor$ is the
545: multiplicative factor of the default $mtry$
546: ($\sqrt{number.of.genes}$); thus, an $mtryFactor$ of 3 means the
547: number of genes tried at each split is $3 *\sqrt{number.of.genes}$;
548: an $mtryFactor = 0$ means the number of genes tried was 1; the
549: $mtryFactor$s examined were $= \{0, 0.05, 0.1, 0.17, 0.25, 0.33, 0.5,
550: 0.75, 0.8, 1, 1.15, 1.33, 1.5, 2, 3,$ $4, 5, 6, 8, 10, 13\}$. Results
551: shown for six different $ntree = \{1000, 2000, 5000,
552: 10000, 20000, 40000\}$. $nodesize = 1$.}
553: \end{center}
554: \end{figure}
555:
556:
557:
558: \section{Results}
559:
560:
561:
562:
563: \begin{table*}[b!] \begin{center} %\processtable{
564: \caption{\label{error.rates} Error rates (estimated using the
565: 0.632+ bootstrap method with 200 bootstrap samples) for the
566: microarray data sets using different methods (see text for
567: description of alternative methods). The results shown for
568: variable selection with random forest used $ntree = 2000,
569: fraction.dropped = 0.2, mtryFactor = 1$. Note that the OOB
570: error used for variable selection \emph{is not} the error
571: reported in this table; the error rate reported is obtained
572: using bootstrap on the complete variable selection process.
573: The column ``no info'' denotes the minimal error we can make
574: if we use no information from the genes (i.e., we always bet
575: on the most frequent class).}
576:
577:
578:
579: {\footnotesize
580: \begin{tabular}{l|cccccccccc}
581: % Data set& SVM & KNN & DLDA& SC.l & SC.s & NN.vs & random forest & \multicolumn{4}{c}{random forest var.sel.}\\
582: % & & & & & & & & \multicolumn{2}{c}{s.e.\ 0} & \multicolumn{2}{c}{s.e.\ 1}\\
583: % & & & & & & & & m.f.\ 1 & m.f.\ 13 & m.f.\ 1 & m.f.\ 13 \\
584:
585: Data set& no info & SVM & KNN & DLDA& SC.l & SC.s & NN.vs & random forest &
586: \multicolumn{2}{c}{random forest var.sel.}\\
587: && & & & & & & & s.e.\ 0 & s.e.\ 1\\
588:
589: \hline
590: Leukemia & 0.289 &0.014 & 0.029 & 0.020 & 0.025& 0.062 & 0.056& 0.051 & 0.087 & 0.075 \\
591: Breast 2 cl.& 0.429 &0.325 & 0.337 & 0.331 & 0.324& 0.326 & 0.337& 0.342 & 0.337 & 0.332 \\
592: Breast 3 cl.& 0.537 &0.380 & 0.449 & 0.370 & 0.396& 0.401 & 0.424& 0.351 & 0.346 & 0.364 \\
593: NCI 60 & 0.852 &0.256 & 0.317 & 0.286 & 0.256& 0.246 & 0.237& 0.252 & 0.327 & 0.353 \\
594: Adenocar.& 0.158 &0.203 & 0.174 & 0.194 & 0.177 & 0.179 & 0.181& 0.125 & 0.185 & 0.207 \\
595: Brain& 0.761 &0.138 & 0.174 & 0.183 & 0.163 & 0.159 & 0.194& 0.154 & 0.216 & 0.216 \\
596: Colon& 0.355 &0.147 & 0.152 & 0.137 & 0.123 & 0.122 & 0.158& 0.127 & 0.159 & 0.177 \\
597: Lymphoma & 0.323 &0.010 & 0.008 & 0.021 & 0.028 & 0.033 & 0.04 & 0.009 & 0.047 & 0.042 \\
598: Prostate & 0.490 &0.064 & 0.100 & 0.149 & 0.088 & 0.089 & 0.081& 0.077 & 0.061 & 0.064 \\
599: Srbct & 0.635 &0.017 & 0.023 & 0.011 & 0.012 & 0.025 & 0.031& 0.021 & 0.039 & 0.038 \\
600: \hline
601: \end{tabular}
602: }
603: % \caption{\label{error.rates} Error rates (estimated using 0.632+
604: % bootstrap method with 200 bootstrap samples) for each data set using
605: % different methods (see text for description of alternative methods).
606: % The results shown for variable selection with random forest used
607: % $ntree = 2000, fraction.dropped = 0.2$, $mtry$Factor = 1$ (error rates with
608: % $ntree=20000$ and $ntree=5000$ and with $fraction.dropped = 0.5$ and
609: % $mtry$Factor = 13$ are very similar; see supplementary material and
610: % Table \ref{stability}). When using variable selection with random
611: % forest, we display four genes. The first two, correspond to using
612: % the ``s.e.0'' rule, where the model selected is the one with the
613: % smallest OOB error rate, and two to the ``s.e. 1'' rule, where the
614: % model selected is the smallest model whose error rate is within 1
615: % standard error of the minimum error rate of all forests. For each of
616: % these, we show the error corresponding to using an $mtry$ factor
617: % (m.f.) of 13 (i.e., $mtry = 13 * sqrt(number of colums)) and an $mtry$
618: % factor of 1 ($mtry = sqrt(number of genes)). Note that the OOB
619: % error used for variable selection \emph{is not} the error reported
620: % in the table (which is obtained using bootstrap on the complete
621: % variable selection process).}
622: \end{center}
623: \end{table*}
624:
625:
626: \subsection{Choosing $mtry$ and $ntree$}
627:
628: Preliminary data suggested that $mtry$ and $ntree$ could affect the shape of
629: variable importance plots. At the same time, use of OOB error rate as a
630: guidance to select $mtry$ could be affected by $ntree$ and, potentially,
631: $nodesize$. Thus, we first examined whether the OOB error rate is substantially
632: affected by changes in $mtry$, $ntree$, and $nodesize$.
633:
634:
635:
636:
637:
638:
639:
640:
641: Figure \ref{mtry.ntree.paper.real} and the supplementary material (Figure
642: \\``error.vs.mtry.pdf''), however, show that, for both real and simulated data,
643: the relation of OOB error rate with $mtry$ is largely independent of $ntree$
644: (for $ntree$ between 1000 and 40000) and $nodesize$ (nodesizes 1 and 5). In
645: addition, the default setting of $mtry$ ($mtryFactor = 1$ in the figures) is
646: often a good choice in terms of OOB error rate. In some cases, increasing
647: $mtry$ can lead to small decreases in error rate, and decreases in $mtry$ often
648: lead to increases in the error rate. This is specially the case with simulated
649: data with very few relevant genes (with very few relevant genes, small $mtry$
650: results in many trees being built that do not incorporate any of the relevant
651: genes). Since the OOB error and the relation between OOB error and $mtry$ do
652: not change whether we use $nodesize$ of 1 or 5, and because the increase in
653: speed from using $nodesize$ of 5 is inconsequential, all further analyses will
654: use only the default $nodesize = 1$.
655:
656:
657:
658:
659:
660:
661:
662:
663:
664:
665:
666:
667: \subsection{Backwards elimination of variables (genes) using OOB
668: error} On the simulated data sets (see supplementary material,
669: Tables 3 and 4) %\ref{simplify.signal.02}, \ref{simplify.signal.05}),
670: backwards elimination often leads to very small sets of genes, often
671: much smaller than the set of ``true genes''. The error rate of the
672: variable selection procedure, estimated using the .632+ bootstrap
673: method, indicates that the variable selection procedure does not lead
674: to overfitting, and can achieve the objective of aggressively
675: reducing the set of selected genes. In contrast, when the
676: simplification procedure is applied to simulated data sets without
677: signal (see Tables 1 and 2
678: %\ref{simplify.no.signal.02} \ref{simplify.no.signal.05}
679: in supplementary material), the number of
680: genes selected is consistently much larger and, as should be the
681: case, the estimated error rate using the bootstrap corresponds to
682: that achieved by always betting on the most probable class.
683:
684:
685:
686: Results for the real data sets are shown in Tables \ref{error.rates} and
687: \ref{stability} (see also supplementary material, Tables 5, 6, 7,
688: %%\ref{stability-20000}, stability-5000, stability-02
689: for additional results using different combinations of $ntree =
690: \{2000,5000,20000\}$, $mtryFactor = \{1, 13\}, se=\{0, 1\},
691: fraction.dropped=\{0.2, 0.5\}$). Error rates (see Table
692: \ref{error.rates}) when performing variable selection are in most cases comparable
693: (within sampling error) to those from random forest without variable
694: selection, and comparable also to the error rates from competing
695: state-of-the-art prediction methods. The number of genes selected
696: varies by data set, but generally (Table \ref{stability}) the
697: variable selection procedure leads to small ($< 50$) sets of predictor
698: genes, often much smaller than those from competing approaches
699: (see also Table 8 in supplementary material). There are no relevant
700: differences in error rate related to differences in $mtry$, $ntree$ or
701: whether we use the ``s.e.\ 1'' or ``s.e.\ 0'' rules. The use of the
702: ``s.e.\ 1'' rule, however, tends to result in smaller sets of selected
703: genes.
704:
705:
706:
707:
708: % \begin{table*}[ph!]
709: % \begin{center}
710: % \caption{\label{stability} Stability of results of backwards
711: % elimination of variables using OOB error, and of two alternative
712: % variable selection methods. Stability evaluated using 200
713: % bootstrap samples. ``\# Vars'' denotes the number of variables
714: % selected on the original data set. ``\# Vars bootstrap'' shows the
715: % median (1st quartile, 3rd quartile) number of variables selected
716: % when the procedure is run on the bootstrap samples. ``Freq. vars''
717: % is the median (1st quartile, 3rd quartile) of the frequency with
718: % which each variable in the original data set appears in the
719: % variables selected when the procedure is run on the bootstrap
720: % samples. For further results see supplementary material.}
721: % \end{center}
722:
723: % \begin{center}
724: % {\small
725: % \begin{tabular}{l|rrrr|rrrr}
726: % Data set& Error rate & \# Vars & \# Vars bootstrap & Freq. vars& Error rate & \# Vars & \# Vars bootstrap & Freq. vars\\
727: % \hline
728: % \hline
729: % \multicolumn{5}{c}{\textbf{Backwards elimination of variables from random forest}}\\ %%% OK
730: % \hline
731: % & \multicolumn{4}{c}{$s.e.\ = 0} & \multicolumn{4}{$s.e.\ = 1}\\ %%% OK
732: % \hline
733: % %%$mtry$1, se1, $ntree = 2000
734: % Leukemia & 0.087 & 2 & 2 (2, 2) & 0.38 (0.29, 0.48)\footnotemark[1]
735: % Breast 2 cl.& 0.337 & 14 & 9 (5, 23) & 0.15 (0.1, 0.28)
736: % Breast 3 cl.& 0.346 & 110 & 14 (9, 31) & 0.08 (0.04, 0.13)
737: % NCI 60 & 0.327 & 230 & 60 (30, 94) & 0.1 (0.06, 0.19)
738: % Adenocar. & 0.185 & 6 & 3 (2, 8) & 0.14 (0.12, 0.15)
739: % Brain & 0.216 & 22 & 14 (7, 22) & 0.18 (0.09, 0.25)
740: % Colon & 0.159 & 14 & 5 (3, 12) & 0.29 (0.19, 0.42)
741: % Lymphoma & 0.047 & 73 & 14 (4, 58) & 0.26 (0.18, 0.38)
742: % Prostate & 0.061 & 18 & 5 (3, 14) & 0.22 (0.17, 0.43)
743: % Srbct & 0.039 & 101 & 18 (11, 27) & 0.1 (0.04, 0.29)
744: % \hline
745: % \hline
746: % \multicolumn{4}{c}{$mtryFactor = 1, s.e.\ = 1, ntree = 2000, ntreeIterat = 1000, fraction.dropped = 0.2$}\\ %%% OK
747: % \hline
748: % %%$mtry$1, se1, $ntree = 2000, $ntree$Iterat = 1000
749: % Leukemia & 0.075 & 2 & 2 (2, 2) & 0.4 (0.32, 0.5)\footnotemark[1]\\
750: % Breast 2 cl.& 0.332 & 14 & 4 (2, 7) & 0.12 (0.07, 0.17)\\
751: % Breast 3 cl.& 0.364 & 6 & 7 (4, 14) & 0.27 (0.22, 0.31)\\
752: % NCI 60 & 0.353 & 24 & 30 (19, 60) & 0.26 (0.17, 0.38)\\
753: % Adenocar. & 0.207 & 8 & 3 (2, 5) & 0.06 (0.03, 0.12)\\
754: % Brain & 0.216 & 9 & 14 (7, 22) & 0.26 (0.14, 0.46)\\
755: % Colon & 0.177 & 3 & 3 (2, 6) & 0.36 (0.32, 0.36)\\
756: % Lymphoma & 0.042 & 58 & 12 (5, 73) & 0.32 (0.24, 0.42)\\
757: % Prostate & 0.064 & 2 & 3 (2, 5) & 0.9 (0.82, 0.99)\footnotemark[1]\\
758: % Srbct & 0.038 & 22 & 18 (11, 34) & 0.57 (0.4, 0.88)\\
759: % \hline
760: % \hline
761: % \multicolumn{4}{c}{\textbf{Alternative approaches}}\\ %%% OK
762: % \hline
763: % \multicolumn{4}{c}{Shrunken centroids; mimimizing error rate then
764: % minimizing number of genes selected}\\ %%% OK
765: % \hline
766: % Leukemia & 0.062 & 82 & 46 (14, 504) & 0.48 (0.45, 0.59)\\
767: % Breast 2 cl.& 0.326 & 31 & 55 (24, 296) & 0.54 (0.51, 0.66)\\
768: % Breast 3 cl.& 0.401 & 2166 & 4341 (2379, 4804) & 0.84 (0.78, 0.88)\\
769: % NCI 60 & 0.246 & 5118 & 4919 (3711, 5243) & 0.84 (0.74, 0.92)\\
770: % Adenocar. & 0.179 & 0 & 9 (0, 18) & NA (NA, NA)\footnotemark[2]\\
771: % Brain & 0.159 & 4177 & 1257 (295, 3483) & 0.38 (0.3, 0.5)\\
772: % Colon & 0.122 & 15 & 22 (15, 34) & 0.8 (0.66, 0.87)\\
773: % Lymphoma & 0.033 & 2796 & 2718 (2030, 3269) & 0.82 (0.68, 0.86)\\
774: % Prostate & 0.089 & 4 & 3 (2, 4) & 0.72 (0.49, 0.92)\\
775: % Srbct & 0.025 & 37 & 18 (12, 40) & 0.45 (0.34, 0.61)\\
776: % \hline
777: % \hline
778: % \multicolumn{4}{c}{Nearest Neighbor with variable selection}\\ %%% OK
779: % \hline
780: % Leukemia & 0.056 & 512 & 23 (4, 134) & 0.17 (0.14, 0.24)\\
781: % Breast 2 cl.& 0.337 & 88 & 23 (4, 110) & 0.24 (0.2, 0.31)\\
782: % Breast 3 cl.& 0.424 & 9 & 45 (6, 214) & 0.66 (0.61, 0.72)\\
783: % NCI 60 & 0.237 & 1718 & 880 (360, 1718) & 0.44 (0.34, 0.57)\\
784: % Adenocar. & 0.181 & 9868 & 73 (8, 1324) & 0.13 (0.1, 0.18)\\
785: % Brain & 0.194 & 1834 & 158 (52, 601) & 0.16 (0.12, 0.25)\\
786: % Colon & 0.158 & 8 & 9 (4, 45) & 0.57 (0.45, 0.72)\\
787: % Lymphoma & 0.04 & 15 & 15 (5, 39) & 0.5 (0.4, 0.6)\\
788: % Prostate & 0.081 & 7 & 6 (3, 18) & 0.46 (0.39, 0.78)\\
789: % Srbct & 0.031 & 11 & 17 (11, 33) & 0.7 (0.66, 0.85)\\
790: % \hline
791:
792: % \end{tabular}
793: % }
794: % \end{center}
795: % \renewcommand{\baselinestretch}{0.2}\footnotesize\normalsize\footnotesize
796: % {%\setlength{\baselineskip}{1pt} \renewcommand{\baselinestretch}{0.1}\footnotesize\normalsize\footnotesize
797: % $^1$As only two variables are selected from the complete data set, the values are the actual
798: % frequencies of those two variables, not the 25th and 75th
799: % percentiles.\\
800: % $^2$No variables were selected.\\
801: % }
802: % \end{table*}
803:
804:
805:
806:
807:
808:
809:
810: \subsection{Stability (uniqueness) of results}
811: The results here will focus on the real microarray data sets (results
812: from the simulated data are presented on the supplementary material).
813: Table \ref{stability} (see also supplementary material, Tables 5, 6, 7,
814: % \ref{stability-20000}
815: for other combinations of $ntree, mtryFactor, fraction.dropped, se$)
816: shows the variation in the number of genes selected in bootstrap
817: samples, and the frequency with which the genes selected in the
818: original sample appear among the genes selected from the bootstrap
819: samples. In most cases, there is a wide range in the number of genes
820: selected; more importantly, the genes selected in the original samples
821: are rarely selected in more than 50\% of the bootstrap samples. These
822: results are not strongly affected by variations in $ntree$ or $mtry$;
823: using the ``s.e.\ 1'' rule can lead, in some cases, to increased
824: stability of the results.
825:
826:
827: As a comparison, we also show in Table \ref{stability} the stability
828: of two alternative approaches for gene selection, the shrunken
829: centroids method, and a filter approach combined with a Nearest
830: Neighbor classifier (see Table 8 in the supplementary material for
831: results of SC.l). Error rates are comparable, but both alternative
832: methods lead to much larger sets of selected genes than backwards
833: variable selection with random forests. The alternative approaches
834: seem to lead to somewhat more stable results in variable selection (probably a
835: consequence of the large number of genes selected) but
836: in practical applications this increase in stability is probably far
837: out-weighted by the very large number of selected genes.
838:
839:
840:
841:
842:
843:
844:
845: \begin{table}[p]
846: \begin{center}
847: \caption{\label{stability} Stability of variable (gene) selection evaluated
848: using 200 bootstrap samples. ``\# Genes'': number of genes
849: selected on the original data set. ``\# Genes boot.'': median
850: (1st quartile, 3rd quartile) of number of genes selected from
851: on the bootstrap samples. ``Freq. genes'': median (1st quartile,
852: 3rd quartile) of the frequency with which each gene in the
853: original data set appears in the genes selected from the
854: bootstrap samples. Parameters for backwards elimination with
855: random forest: $mtryFactor = 1, s.e.\ = 0, ntree = 2000,
856: ntreeIterat = 1000, fraction.dropped = 0.2$.}
857: \end{center}
858: \begin{center}
859: \vspace{-32pt} %%% use for bioinformatics.
860: {\footnotesize
861: \begin{tabular}{l|rrrr}
862: Data set& Error & \# Genes & \# Genes boot. & Freq. genes\\
863: \hline
864: \hline
865: \multicolumn{5}{c}{\textbf{Backwards elimination of genes from random forest}}\\ %%% OK
866: \hline
867: %\multicolumn{5}{c}{$mtryFactor = 1, s.e.\ = 0, ntree = 2000, ntreeIterat = 1000, fraction.dropped = 0.2$}\\ %%% OK
868: \multicolumn{5}{c}{$s.e.\ = 0$}\\ %%% OK
869: \hline
870: % %$mtry$1, se1, $ntree = 2000
871: Leukemia & 0.087 & 2 & 2 (2, 2) & 0.38 (0.29, 0.48)\footnotemark[1]\\
872: Breast 2 cl.& 0.337 & 14 & 9 (5, 23)& 0.15 (0.1, 0.28)\\
873: Breast 3 cl.& 0.346 & 110 & 14 (9, 31)& 0.08 (0.04, 0.13)\\
874: NCI 60 & 0.327 & 230 & 60 (30, 94)& 0.1 (0.06, 0.19)\\
875: Adenocar. & 0.185 & 6 & 3 (2, 8)& 0.14 (0.12, 0.15)\\
876: Brain & 0.216 & 22 & 14 (7, 22)& 0.18 (0.09, 0.25)\\
877: Colon & 0.159 & 14 & 5 (3, 12)& 0.29 (0.19, 0.42)\\
878: Lymphoma & 0.047 & 73 & 14 (4, 58)& 0.26 (0.18, 0.38)\\
879: Prostate & 0.061 & 18 & 5 (3, 14)& 0.22 (0.17, 0.43)\\
880: Srbct & 0.039 & 101 & 18 (11, 27)& 0.1 (0.04, 0.29)\\
881: \hline
882: \hline
883: %\multicolumn{4}{c}{$mtryFactor = 1, s.e.\ = 1, ntree = 2000, ntreeIterat = 1000, fraction.dropped = 0.2$}\\ %%% OK
884: \multicolumn{5}{c}{$s.e.\ = 1$}\\ %%% OK
885: \hline
886: % %$mtry$1, se1, $ntree = 2000, $ntree$Iterat = 1000
887: Leukemia & 0.075 & 2 & 2 (2, 2)& 0.4 (0.32, 0.5)\footnotemark[1]\\
888: Breast 2 cl.& 0.332 & 14 & 4 (2, 7)& 0.12 (0.07, 0.17)\\
889: Breast 3 cl.& 0.364 & 6 & 7 (4, 14)& 0.27 (0.22, 0.31)\\
890: NCI 60 & 0.353 & 24 & 30 (19, 60)& 0.26 (0.17, 0.38)\\
891: Adenocar. & 0.207 & 8 & 3 (2, 5)& 0.06 (0.03, 0.12)\\
892: Brain & 0.216 & 9 & 14 (7, 22)& 0.26 (0.14, 0.46)\\
893: Colon & 0.177 & 3 & 3 (2, 6)& 0.36 (0.32, 0.36)\\
894: Lymphoma & 0.042 & 58 & 12 (5, 73)& 0.32 (0.24, 0.42)\\
895: Prostate & 0.064 & 2 & 3 (2, 5)& 0.9 (0.82, 0.99)\footnotemark[1]\\
896: Srbct & 0.038 & 22 & 18 (11, 34)& 0.57 (0.4, 0.88)\\
897: \hline
898: \hline
899: \multicolumn{5}{c}{\textbf{Alternative approaches}}\\ %%% OK
900: \hline
901: % \multicolumn{5}{c}{Shrunken centroids; minimizing error rate then}\\
902: % \multicolumn{5}{c}{minimizing number of genes selected}\\ %%% OK
903: \multicolumn{5}{c}{SC.s}\\
904: \hline
905: Leukemia & 0.062 & 82\footnotemark[2] & 46 (14, 504)& 0.48 (0.45, 0.59)\\
906: Breast 2 cl.& 0.326 & 31 & 55 (24, 296)& 0.54 (0.51, 0.66)\\
907: Breast 3 cl.& 0.401 & 2166 & 4341 (2379, 4804)& 0.84 (0.78, 0.88)\\
908: NCI 60 & 0.246 & 5118 & 4919 (3711, 5243)& 0.84 (0.74, 0.92)\\
909: Adenocar. & 0.179 & 0 & 9 (0, 18)& NA (NA, NA)\\
910: Brain & 0.159 & 4177 & 1257 (295, 3483)& 0.38 (0.3, 0.5)\\
911: Colon & 0.122 & 15 & 22 (15, 34)& 0.8 (0.66, 0.87)\\
912: Lymphoma & 0.033 & 2796 & 2718 (2030, 3269)& 0.82 (0.68, 0.86)\\
913: Prostate & 0.089 & 4 & 3 (2, 4)& 0.72 (0.49, 0.92)\\
914: Srbct & 0.025 & 37\footnotemark[3] & 18 (12, 40)& 0.45 (0.34, 0.61)\\
915: \hline
916: \hline
917: %\multicolumn{5}{c}{Nearest Neighbor with variable selection}\\ %%% OK
918: \multicolumn{5}{c}{NN.vs}\\ %%% OK
919: \hline
920: Leukemia & 0.056 & 512 & 23 (4, 134)& 0.17 (0.14, 0.24)\\
921: Breast 2 cl.& 0.337 & 88 & 23 (4, 110)& 0.24 (0.2, 0.31)\\
922: Breast 3 cl.& 0.424 & 9 & 45 (6, 214)& 0.66 (0.61, 0.72)\\
923: NCI 60 & 0.237 & 1718 & 880 (360, 1718)& 0.44 (0.34, 0.57)\\
924: Adenocar. & 0.181 & 9868 & 73 (8, 1324)& 0.13 (0.1, 0.18)\\
925: Brain & 0.194 & 1834 & 158 (52, 601)& 0.16 (0.12, 0.25)\\
926: Colon & 0.158 & 8 & 9 (4, 45)& 0.57 (0.45, 0.72)\\
927: Lymphoma & 0.04 & 15 & 15 (5, 39)& 0.5 (0.4, 0.6)\\
928: Prostate & 0.081 & 7 & 6 (3, 18)& 0.46 (0.39, 0.78)\\
929: Srbct & 0.031 & 11 & 17 (11, 33)& 0.7 (0.66, 0.85)\\
930: \hline
931:
932: \end{tabular}
933: }
934: \end{center}
935: \renewcommand{\baselinestretch}{0.2}\footnotesize\normalsize\footnotesize
936: {%\setlength{\baselineskip}{1pt} \renewcommand{\baselinestretch}{0.1}\footnotesize\normalsize\footnotesize
937: $^*$Only two genes are selected from the complete data set; the values are the actual
938: frequencies of those two genes.\\
939: $^{\dagger}$\citet{shrunkenc} select 21 genes after visually inspecting
940: the plot of
941: cross-validation error rate vs. amount of shrinkage and number of
942: genes. Their procedure is hard to automate and thus it is very difficult to obtain estimates of the error
943: rate of their procedure.\\
944: $^{\ddagger}$\citet{shrunkenc} select 43 genes. The difference is likely due
945: to differences in the random partitions for cross-validation. Repeating 100 times
946: the gene selection process with the full data set the median, 1st quartile, and 3rd
947: quartile of the number of selected genes are 13, 8, and 147.\\
948:
949:
950: }
951: \end{table}
952:
953:
954:
955:
956:
957: \section{Discussion}
958:
959: We have examined the performance of an approach for gene selection using random
960: forest, and compared it to alternative approaches. Our results, using both
961: simulated and real microarray data sets, show that this method of gene
962: selection accomplishes the proposed objectives. Our method returns very small
963: sets of genes compared to two alternative variable selection methods, while
964: retaining predictive performance comparable to that of seven alternative
965: state-of-the-art methods. Recently, \citet{BMA-selection} have proposed a
966: Bayesian model averaging (BMA) approach for gene selection; comparing the
967: results for the two common data sets between our study and theirs, in one case
968: (Leukemia) our procedure returns a much smaller set of genes (2 vs. 15),
969: whereas in another (Breast, 2 class) their BMA procedure returns 8 fewer genes
970: (14 vs. 6); our procedure does not require setting a limit in the maximum
971: number of relevant genes to be selected nor does it require to prespecify a
972: number of top ranked genes as relevant (the latter is nor required by the BMA
973: procedure either).
974:
975: Our method of gene selection will not return sets of genes
976: that are highly correlated, because they are redundant. This method will be
977: most useful under two scenarios: a) when considering the design of diagnostic
978: tools, where having a small set of probes is often desirable; b) to help
979: understand the results from other gene selection approaches that return many
980: genes, so as to understand which ones of those genes have the largest signal to
981: noise ratio and could be used as surrogates for complex processes involving
982: many correlated genes. A backwards elimination method, precursor to the one
983: used here, has been already used to predict breast tumor type based on
984: chromosomic alterations \citep{SaraRF}.
985:
986:
987: We have also throughly examined the effects of changes in the
988: parameters of random forest (specifically $mtry$, $ntree$, $nodesize$)
989: and the variable selection algorithm ($se$, $fraction.dropped$).
990: Changes in these parameters have in most cases negligible effects,
991: suggesting that the default values are often good options, but we can
992: make some general recommendations.
993: Time of execution of the code increases $\approx$ linearly with $ntree$.
994: Larger $ntree$ values lead to slightly more stable values of variable
995: importances, but for the data sets examined, $ntree = 2000$ or $ntree = 5000$
996: seem quite adequate, with further increases having negligible effects. The
997: change in $nodesize$ from 1 to 5 has negligible effects, and thus its default
998: setting of 1 is appropriate. For the backwards elimination algorithm, the
999: parameter $fraction.dropped$ can be adjusted to modify the resolution of the
1000: number of variable selected; smaller values of $fraction.dropped$ lead to finer
1001: resolution in the examination of number of genes, but to slower execution of
1002: the code. Finally, the parameter $se$ has also minor effects on the results of
1003: the backwards variable selection algorithm but a value of $se = 1$ leads to
1004: slightly more stable results.
1005:
1006:
1007:
1008: The final issue addressed in this paper is instability or multiplicity of the
1009: selected sets of genes. From this point of view, the results are slightly
1010: disappointing. But so are the results of the competing methods. And so are the
1011: results of most examined methods so far with microarray data, as shown in
1012: \citet{EinDor} and \citet{Michielis} and discussed throughly by
1013: \citet{Somorjai2003} for classification and by \citet{pan-pnas} for the related
1014: problem of the effect of threshold choice in gene selection. However, and
1015: except for the above cited papers and the review in \citet{Yo-azuaje}, this is
1016: an issue that still seems largely ignored in the microarray literature. As
1017: these papers and the statistical literature on variable selection
1018: \citep[e.g.,][]{breiman-2-cultures, harrell-01} discusses, the causes of the
1019: problem are small sample sizes and the extremely small ratio of samples to
1020: variables (i.e., number of arrays to number of genes). Thus, we might need to
1021: learn to live with the problem, and try to assess the stability and robustness
1022: of our results by using a variety of gene selection features, and examining
1023: whether there is a subset of features that tends to be repeatedly selected.
1024: This concern is explicitly taken into account in our results, and facilities
1025: for examining this problem are part of our R code.
1026:
1027:
1028: The multiplicity problem, however, does not need to result in large
1029: prediction errors. This and other papers \citep{dudoit-dlda, pelora,
1030: simon.book, romualdi-03, bag-boost, Somorjai2003} show that very different
1031: classifiers often lead to comparable and successful error rates with
1032: a variety of microarray data sets. Thus, although improving prediction
1033: rates is important \citep[specially if giving consideration to ROC
1034: curves, and not just overall prediction error rates;][]{pepe-book},
1035: when trying to address questions of biological mechanism or discover
1036: therapeutic targets, probably a more challenging and relevant issue is
1037: to identify sets of genes with biological relevance.
1038:
1039:
1040: Two areas of future research are using random forest for the selection of
1041: potentially large sets of genes that include correlated genes, and improving
1042: the computational efficiency of these approaches; in the present work, we have
1043: used parallelization of the ``embarrassingly parallelizable'' tasks using MPI
1044: with the Rmpi and Snow packages \citep{Rmpi, snow} for R. In a broader context,
1045: further work is warranted on the stability properties and biological relevance
1046: of this and other gene-selection approaches, because the multiplicity problem
1047: casts doubts on the biological interpretability of most results based on a
1048: single run of one gene-selection approach.
1049:
1050:
1051:
1052:
1053:
1054: %%% Both allow var sel; the type of var sel is wrapper approach, which
1055: %%% should be superior to ``filter'' approaches. variable importance plots not affected
1056: %%% by multicol. However, not many unique (stable) results. Select only
1057: %%% the most important from variable importance plots, or use a large set of candidates.
1058: %%% With backwards, can help to examine if the different, non-overlapping,
1059: %%% sets of vars are in similar routes, etc.
1060:
1061: %%% Examinar también plots of ``flatness'' of OOB and numero de genes.
1062: %%% Indication of how important things are (and plots of flatness in
1063: %%% bootstrap samples?). Problem is: no longer emphasis on which are the
1064: %%% selected genes. Here the approach of Svetnik et al more relevant?
1065:
1066:
1067:
1068: \section{Conclusion}
1069: The proposed method can be used for variable selection fulfilling the
1070: objectives above: we can obtain very small sets of non-redundant genes while
1071: preserving predictive accuracy. These results clearly indicate that the
1072: proposed method can be profitably used with microarray data. Given its
1073: performance, random forest and variable selection using random forest should
1074: probably become part of the ``standard tool-box'' of methods for the analysis
1075: of microarray data.
1076:
1077:
1078:
1079:
1080: \section{Acknowledgements}
1081:
1082: % This work arised out of work I did in collaboration with S.\ Álvarez
1083: % de Andrés; I thank her for the opportunity to collaborate in that
1084: % work, and for her patience and enthusiasm.
1085:
1086: Most of the simulations and analyses were carried out in the Beowulf
1087: cluster of the Bioinformatics unit at CNIO, financed by the RTICCC
1088: from the FIS; J.~M.\ Vaquerizas provided help with the administration
1089: of the cluster. A.\ Liaw provided discussion, unpublished manuscripts,
1090: and code. C.\ Lázaro-Perea provided many discussions and comments on
1091: the ms. A.\ Sánchez provided comments on the ms. I.\ Díaz showed
1092: R.D.-U. the forest, or the trees, or both. R.D.-U. partially
1093: supported by the Ramón y Cajal program of the Spanish MEC (Ministry
1094: of Education and Science); S.A.A. supported by project C.A.M.
1095: GR/SAL/0219/2004; funding provided by project TIC2003-09331-C02-02 of
1096: the Spanish MEC.
1097:
1098:
1099: \bibliography{signatures2}
1100: \bibliographystyle{bioinformatics}
1101:
1102:
1103:
1104: %\end{multicols}
1105: \newpage
1106:
1107:
1108:
1109:
1110:
1111:
1112:
1113:
1114:
1115:
1116:
1117:
1118: \end{document}
1119:
1120:
1121: %All with R, library randomForest. Code available.
1122:
1123:
1124:
1125:
1126: %Although occassionally plots of variable importance show a clear
1127: %pattern where only a few variables stand out, most often
1128:
1129:
1130: **************************
1131:
1132:
1133: We will also mention computational requirements.
1134:
1135: It should be possible to
1136: use these measures of variable importance to single out
1137: genes of particular relevance for a given condition.
1138:
1139:
1140:
1141:
1142: Future work:
1143: ------------
1144: - Changes in $mtry$, since small mtries should lead to faster
1145: runs and further decreases in correlations of trees.
1146:
1147: - After variable reduction: use all variables in models
1148: building (i.e., $mtry = number of variables)?
1149:
1150: