abstract:6d9ae742eccd05d4.tex

1: \begin{abstract}

2:  Although the methods of  bagging and random forests are some of the most widely used prediction methods, relatively little is known about their algorithmic convergence.

3:  In particular, there are not many theoretical guarantees for deciding when an ensemble  is ``large enough''

4:  --- so that its accuracy is close to that of an ideal infinite ensemble.

5:  %

6:   Due to the fact that bagging and random forests are randomized algorithms, the choice of  ensemble size is closely related to the notion of ``algorithmic variance'' (i.e.~the variance of prediction error due only to the training algorithm). In the present work, we propose a bootstrap method to estimate this variance for bagging, random forests, and related methods in the context of classification.

7: %

8: %

9: To be specific, suppose the training dataset is fixed, and let the random variable $\Err_t$ denote the prediction error of a randomized ensemble of size $t$.

10: Working under a ``first-order model'' for randomized ensembles, we prove that the centered law of $\Err_t$ can be consistently approximated via the proposed method as $t\to\infty$. Meanwhile, the computational cost of the method is quite modest, by virtue of an extrapolation technique.  As a consequence, the method offers a practical guideline for deciding when the algorithmic fluctuations of $\Err_t$ are negligible.

11:

12: \end{abstract}

13: