0603:math0603228/lit.tex

1: \chapter{Literature}\label{c:lit}

2:

3: This chapter reviews and discusses the literature on three

4: subjects. The first section reviews some theoretical results concerning the

5: frequentist performance of Bayesian procedures. The second section

6: gives a survey of some of the

7: work done by authors on related Bayesian efforts. The final section briefly

8: surveys some salient examples of alternative approaches to the

9: classification problem.

10:

11: \section{Theoretical Results}

12: The frequentist performance of Bayesian methods is of fundamental

13: interest in statistics. Given a large sample from a smooth,

14: finite-dimensional statistical model, the situation is quite well

15: understood. The Bernstein-von~Mises

16: theorem~\cite{lecam:yang,freedman:1999} shows that the Bayes estimate and the maximum likelihood

17: estimate will be close. Furthermore, the posterior distribution of the

18: parameter vector around the posterior mean is close to the

19: distribution of the maximum likelihood estimate around the truth: both

20: are asymptotically normal with mean 0 and the same covariance

21: matrix. Unfortunately, though, in more general circumstances, such as

22: those needed for this work, the situation can be much more complex. In

23: particular, the basic model is based on an infinite hierarchy of finite

24: dimensional models. Moreover, even for a given finite dimensional

25: submodel, the dependency of the likelihood function on the parameter

26: is not smooth; the functions are allowed to take jumps. Consequently, a more

27: general theory is needed.

28:

29: This section reviews some of the literature on this subject with

30: a focus on results that address the question of consistency: i.e. as

31: the number of data points tends to infinity, will the Bayesian

32: estimate converge to the true value (in some suitable sense) almost

33: surely (resp. in probability)?  The literature contains a number of

34: useful and quite flexible positive results, but also a variety of

35: interesting negative examples showing that the regularity conditions

36: under which the theorems hold are not to be taken lightly. A good

37: introduction to these issues is by Diaconis and

38: Freedman~\cite{diac:free:1986}. Throughout this section, the reader

39: may envision a family $\P_\theta(dx)$, a prior $\pi(d\theta)$, and

40: posterior $\pi(\theta | x_1, \dots, x_n)$, where the $x_i$ are drawn

41: \iid\ from $\P_{\theta_0}(dx)$. Consistency means that the posterior

42: concentrates at $\theta_0$ for large samples.

43:

44: Doob~\cite{doob} established a fundamental result under minimal

45: regularity assumptions using a martingale convergence

46: argument. Roughly speaking, the result states that if

47: consistent estimators exist at all, then a Bayes procedure will

48: provide an almost surely consistent estimate of the true parameter

49: $\theta$ under sampling from the $\theta$ distribution for any

50: $\theta$ in some set $B$ which has prior probability of $1$. Notice,

51: though, that this does not specify if consistency will obtain at any

52: {\em particular} point of interest $\theta_0$, unless $\theta_0$

53: happens to be a point-mass of the prior, or unless it possible to determine

54: $B$ by some more detailed line of argumentation.

55:

56: Freedman~\cite{freedman:1963} considered the case in which the

57: observations are discrete. If the set of possible observations is

58: finite, the posterior is consistent exactly for parameter values in

59: the topological support of the prior. The countably infinite case is

60: more complex. He constructs a class of examples showing that it is

61: possible to construct a prior which assigns positive mass to every

62: (weak star) neighborhood of the true parameter value, but for which

63: the posterior converges to a point mass at some other (chosen)

64: parameter value. Furthermore, he finds a prior which assigns positive

65: prior mass to every (weak star) open set of parameters, but for which

66: the posterior is consistent only at a set of parameters of the first

67: category. The reader should note that this prior did not assign mass

68: to all entropy-neighborhoods. This sort of subtle distinction can make

69: all of the difference and explains the necessity of some such

70: assumption in the following consistency theorems. He introduces the

71: ``tail-free'' priors for the the countably-infinite case and

72: demonstrates that these are always consistent.

73:

74: Lorraine Schwartz~\cite{schw:1965} explored the question of

75: consistency in a very general setting. She extended Doob's result to a

76: broad class of loss functions~\cite[lemma 4.2]{schw:1965}.  She also

77: found sufficient conditions for the posterior to be consistent under

78: $\iid$ sampling. These conditions, she says, are ``of an essentially

79: weaker nature'' than the conditions established for the consistency of

80: maximum likelihood estimators. Nevertheless, she constructs an example

81: where the maximum likelihood estimate is consistent and the estimates

82: based on certain priors are not. The example (\cite[example 3]{schw:1965}) involves a simple parametric family of densities

83: which satisfies Wald's conditions, thereby guaranteeing that the

84: maximum likelihood estimate will be consistent, but for which the

85: posterior can be inconsistent. The consistency of the posterior in

86: this case, is found to depend critically on the amount of mass that

87: the prior ascribes to small neighborhoods of the true parameter value;

88: if this mass shrinks too quickly, the prior ``ignores'' the data. One

89: clever aspect of her construction is the way the densities are

90: parametrized. Parameter values close to the target value $\theta_0$

91: correspond to densities that are close to the $\theta_0$-density in an

92: $\cL^1$ sense, but which are farther and farther away in

93: Kullback-Leibler discrepancy. In fact, there is only one point in

94: parameter space (the true parameter) that has Kullback-Leibler

95: discrepancy from the truth smaller than $\epsilon$, for $\epsilon$

96: sufficiently small.

97:

98: Schwartz then shows that the posterior will be consistent under $\iid$

99: sampling under two basic conditions. First, the prior should have

100: positive mass on Kullback-Leibler neighborhoods of the true parameter

101: (defined in \autoref{s:notation} of this thesis), and second, the

102: model class should not be too rich; specifically, she requires that

103: uniformly consistent tests of the hypothesis that $\theta=\theta_0$

104: against the alternative that $\theta$ lies outside a given (open)

105: neighborhood of $\theta_0$ exist.

106:

107: It is not always obvious how to verify the later property

108: directly. Modern authors have employed entropy-type bounds to

109: guarantee their existence. Ghoshal, Ghosh, and van der

110: Vaart~\cite{ggv} state a theorem (\cite[theorem 7.3]{ggv}) which

111: proves that the posterior converges at a certain rate if certain

112: uniform tests exist (and the prior mass is suitably distributed) and

113: go on to find a variety of entropy-type conditions that suffice to be

114: able to construct the necessary tests. Shen and

115: Wasserman~\cite{shen:wasserman} show related results, requiring

116: slightly different conditions on how mass needs to be allocated--they

117: do not a make a connection with testing. Barron, Schervish, and

118: Wasserman~\cite{barron:schervish:wasserman:1999} find sufficient

119: conditions for the posterior to be consistent; their results are

120: reviewed and then used in \autoref{c:proof}.

121:

122: It should be noted that these various conditions for consistency are

123: not necessary, but merely sufficient. Nevertheless, it is important to

124: treat this subject with care because of the variety of examples for

125: which consistency fails.

126:

127: Barron, Schervish, and Wasserman also give an interesting example

128: where consistency fails. In this example, they show that the prior

129: puts too much mass on a very rich class of models that will be able to

130: match any spurious structure that the data might have by chance,

131: overwhelming the true parameter. Furthermore, lest the reader get the

132: wrong idea, inconsistency does not only occur in artificial

133: examples. A series of ``natural'' yet still inconsistent estimators

134: for the symmetric location problem are discussed by Diaconis and

135: Freedman~\cite{diac:free:1986}. In addition, the binary regression

136: example explained in the next section has a natural motivation based on

137: conditional exchangability.

138:

139: \section{Related Bayesian Work}\label{s:relatedpriors}

140:

141: The following subsections contain a review of work by other authors

142: that is closely related to this thesis. It is followed my a brief synopsis

143: of the contributions that this thesis makes to the literature.

144:

145: \subsection{A Dyadic Prior for Binary Regression}\label{s:DFprior}

146: The most relevant examples for the work of this thesis are the

147: nonparametric binary regression examples of Diaconis and

148: Freedman~\cite{diac:free:1993, diac:free:1995}. They use a different

149: prior; call it $\piDF$, a hierarchical, dyadic prior on $f$. To

150: describe $\piDF$, let $A_k$ be the set of intervals which result from

151: partitioning the unit interval into $2^k$ equal pieces. Let $\F_k$ be

152: the subset of functions which are constant on all intervals $a \in

153: A_k$. Finally, fix a prior distribution $\kappa$ on the non-negative

154: integers. Assume, for simplicity, that $\kappa(k) >0$ for all $k$. To

155: draw $f$ from $\piDF$, draw $K$ from $\kappa$ and then, conditional on

156: hierarchy-level $K=k$, draw $f$ uniformly at random from $\F_k$. In

157: effect then, at level $k$ one draws $2^k$ independent $\U(0,1)$ random

158: variables to describe the success probability on each of the $2^k$

159: pieces.

160:

161: They show that for any $\kappa$ and any $f_0$ (except possibly for

162: $f_0 \equiv \frac{1}{2}$), the posterior estimates are consistent (in

163: the sense that any $\mathcal{L}^1$ neighborhood of $f_0$ has posterior

164: probability tending to $1$ a.s.). Remarkably, however, for $f_0 \equiv

165: \frac{1}{2}$, the posterior can be an inconsistent estimate if the

166: tail of $\kappa$ is sufficiently heavy.  Specifically, let

167: $\lambda_k=-\log (\kappa(K \geq k) ) /k $. Then if $\limsup \lambda_k

168: > \lambdac=2^{-\frac{1}{4}} \approx 0.841$, the posterior is

169: inconsistent at $f_0 \equiv \frac{1}{2}$. On the other hand, if

170: $\limsup \lambda_k <\lambdac$, the posterior is consistent for any

171: $f_0$. To put this in perspective, for $\kappa(k)=(1-\beta)\beta^k$ (a

172: shifted $\Geometric(1-\beta)$ prior), $\limsup

173: \lambda_k=-\log(\beta)$. The critical value for $\beta$ is

174: $\exp(-\lambdac) \approx 0.431$; for larger $\beta$ (longer tails)

175: inconsistency will occur (but only for $f_0 \equiv \half$).

176:

177: This result is substantially stronger than the result I have

178: obtained for my prior $\pi$. In particular, applying the same

179: (general) method of proof that I employed to prove consistency for

180: $\pi$ to $\piDF$ yields only the result that $\piDF$ is consistent if

181: the tails of $\kappa$ drop off at least as fast as those of a

182: Poisson. (Recall, that at level $k$, $\pi$ only divides $[0,1]$ into

183: $k$ intervals, but $\piDF$ divides it into $2^k$.) Their method of

184: proof is direct: using Bernstein's inequality, Poissonization, and

185: special features of the prior. My method of proof is indirect; it uses

186: general results that employ entropy-type bounds.

187:

188: There are striking similarities between $\pi$ and $\piDF$. In fact,

189: $\pi$ is equivalent to a suitably randomized $\piDF$. To achieve this,

190: it is not enough to simply randomize the dyadic split points. Instead, recall

191: that $\piDF$ has an alternative interpretation in terms of binary

192: sequences. At hierarchy-level $k$, $\piDF$ is uniform over $\F_k$. This

193: corresponds to independently assigning uniform success probabilities to

194: each binary sequence of length $k$. Here is an alternative way

195: to draw $f$ from $\pi$. Draw $g$ from $\piDF$ and interpret $g$ as

196: function on binary sequences of length $k$ ($k$ depends on $g$). Let

197: $V_i$ ($i=1, \dots, k$) be $\iid$ $U(0,1)$ random variables. To any

198: point $u \in [0,1]$ associate the binary random variables

199: $\eta_i(u)=\II(u \leq V_i)$ ($i=1, \dots, k$). Define $f$ via

200: $f(u)=g((\eta_1(u), \dots, \eta_k(u) ))$. Note that only a small

201: fraction of possible binary sequences are realized in this manner (at

202: level $k$ (which ranges from $0$ to $\infty$ under $\piDF$), $k+1$

203: sequences out of the full set of $2^k$ possible sequences are achieved).

204:

205: \subsection{Bayesian CART}

206:

207: Two other closely related priors can be described as Bayesian versions

208: of the CART algorithm. This was pursued by Chipman, George, and

209: McCulloch, whose prior closely parallels the choices made in the

210: original CART algorithm~\cite{cgm:1998a, cgm:1998b, cgm:2000a,

211: cgm:2000b}. Here is a description of their prior when the covariate

212: space is $\cR^p$. Their prior starts with a root node (which

213: represents the whole space); this node is then recursively

214: partitioned in a random way. For each node, randomly choose whether to

215: split it or not, then choose a coordinate to split on, then choosing a

216: split point (i.e. the cutoff value) randomly {\em from among the

217: midpoints between the ordered values of this coordinate}; finally each

218: leaf node is given an independent regression value. The details of how

219: these decisions are made differ in their particulars from the ones

220: that I described in the introduction. In early work, these authors

221: observed that using MCMC to sample from the posterior of this prior

222: provides a rudimentary (global) search procedure, which has certain

223: (apparent) advantages over the {\em greedy} search procedure commonly

224: implemented in CART-type algorithms. In later work, they examined and

225: computed the (approximate) posterior mean (working primarily on the

226: least-squares white-noise regression problem) and found that it had

227: good performance. They also considered extended priors that modeled

228: the regression values as additively (not independently)

229: generated~\cite{cgm:2000b}.

230:

231: Denison, Mallick, and Smith, independently considered another version

232: of Bayes\-ian CART~\cite{dms:1998b, dms:1998a, dms:1998c}. For

233: one-dimen\-sion\-al problems they propose using random splines (the prior

234: I use is essentially a special case of this prior). They consider some

235: of the regression examples that are standard in the wavelet literature

236: and show that their spline methods perform equally well. Additionally,

237: they propose a Bayesian version of Friedman's MARS which puts a prior on

238: functions that are constructed by adding together random spline-type

239: ridge functions. Denison, Adams, Holmes, and Hand discuss the

240: usefulness of random partitions in this paper~\cite{dahh:2002}.

241:

242: Very recently, Denison, Holmes, Mallick, and Smith have written a

243: book~\cite{dhms:2002} which surveys some related Bayesian regression

244: schemes, including a Bayesian method for (multiple class)

245: classification using Voronoi partitions that is very closely related

246: (albeit independent of) the work that I present in

247: \autoref{c:voronoi}. The book also discusses Bayesian wavelet methods,

248: and an interesting Bayesian nearest-neighbor prior. As a default

249: prior, they recommend assuming that every model in a ``single

250: dimension'' is equally likely, and each dimension is equally probable,

251: \apriori. This ``flat prior,'' they claim, should serve perfectly well

252: because of the, ``natural tendency'' for the marginalized likelihood

253: to penalize complex models:

254:

255: \begin{quotation}

256: On the face of it, we might be concerned that the flexible modeling

257: strategy we advocate might be prone to overfitting the data by adding

258: too many basis functions. Indeed, many papers found in the literature

259: advocate explicit priors on the model space that penalize the

260: dimension of the model. However, throughout this book we argue that

261: such a measure is unnecessary.  The Bayesian framework contains

262: a natural penalty against over [sic] complex models, sometimes called

263: {\it Occam's razor}, which essentially states that a simpler theory is

264: to be favoured over a more complex one, all other things being

265: equal.

266: \end{quotation}%page 21-22 dhms

267:

268: There is no consideration given to the possibility that this might

269: give rise to inconsistent estimates (e.g. as in the Diaconis and

270: Freedman non-parametric regression example explained earlier); indeed

271: there are few theoretical considerations at all in the book. Their

272: explanation of why the Markov chain techniques that they develop

273: should actually give meaningful samples from the posterior appeals to

274: Green's reversible jump~\cite{green:1995}. The explanation given is

275: vague and ultimately they decide to avoid the issue and appeal to the

276: fact that their chains are discrete. The chains in

277: \autoref{c:implement} of this thesis involve a continuous state space

278: and do not simply avoid this issue by discretizing the continuous

279: modeling space as these authors seem to do.

280:

281: Overall, the book emphasizes main ideas, algorithms, and results. It

282: seems that for every existing regression technique, they want to

283: demonstrate that they can make a ``Bayesian'' version of it too. The

284: book does not emphasize subjectivism, but rather adopts an

285: ``$\cM_{\text{\it open}}$'' perspective to Bayesian modeling: ``we

286: never believe that the true model lies in the set of possible

287: models.'' The book does do a good job of supplying default priors for

288: a wide variety of possible parametric models. Similarly, Denison's

289: thesis~\cite{denison:1997} emphasizes the wide variety of problems to

290: which Bayesian partitioning methods of this sort can be applied.

291:

292: \subsection{Poisson Rate estimates using Random Partitions}

293: Green~\cite{green:1995}, and

294: Scargle~\cite{scargle} develop priors on piecewise constant

295: functions on the real line and $\bbR^d$ using Voronoi

296: cells. Their priors are quite similar to the ones developed in this thesis,

297: but are intended to address the problem of estimating the rate function of a

298: $\Poisson$ process. In principle, one could apply their techniques to

299: the problem of binary regression by generating an estimate of the rate

300: function of the ``heads'' process and the ``tails'' process separately

301: and then combining the results. I do not think that this has been

302: tried and it seems substantially less ``natural.''

303:

304: Green applies his method to a coal mining dataset and a synthetic

305: two-dimen\-sion\-al example. For these example, Green assumes that an

306: individual cell's rate-para\-met\-er is drawn independently from a

307: $\Gamma(\alpha,\beta)$ prior. For the one-dimen\-sion\-al case he

308: advocates a prior which ``probabilistically'' spaces out the

309: change-point locations; specifically, if there are $j$ change-points,

310: the ordered locations of the change-points are distributed like the

311: even order statistics of $2j+1$ independent uniform values. He argues

312: that this is good because it prevents small change-point intervals from

313: entering into the posterior. For the two-dimensional example, the

314: generating points of the Voronoi partition are drawn independently and

315: uniformly. Green's methods are given, in part, as

316: examples of his ``reversible jump'' MCMC technique. This technique has

317: become an accepted part of MCMC practice, but is not accepted by all

318: experts in MCMC theory because it does not lay down in a

319: straightforward ``theorem-proof'' manner the necessary conditions and

320: consequent conclusions. For this reason, detailed verifications for

321: the chains used in this thesis are given in \autoref{c:implement}.

322:

323: Scargle's work is applied to astronomical data; he concentrates on the

324: problem of finding the mode of the posterior, rather than the

325: posterior mean. Fortunately, he and coworkers have developed a way of

326: computing this mode in the one-dimensional case exactly and

327: efficiently using a dynamic programming

328: approach~\cite{scargle:dynamic}. Instead of giving each cell an

329: independent value, Scargle gives each cell a (logical) ``color'' and

330: then associates each unique color with an independent

331: rate-parameter. This allows him to use a fine partition and then group

332: ``chunks'' back together into more complicated shapes. The way he

333: forms this partition is also different; in particular his ``prior'' is

334: data dependent, but not quite in the way of the ``prior'' that I

335: consider in \autoref{c:voronoi}. Rather, the data is used once and for

336: all to generate the fine Voronoi partition of space that results from

337: using all of the data points as generators. These cells are then

338: ``clumped'' (i.e. given a logical color) and the clumps are given an

339: independent rate parameter.

340:

341: \subsection{Bayesian ``Image'' Analysis}

342: M{\o}ller and Skare~\cite{moller:skare} apply their work to reservoir

343: modeling and connect their work to efforts in Bayesian image analysis

344: (including Markov random fields). They use a random Voronoi partition

345: of the data and assign each partition element a random color (in a way

346: that depends only the colors of neighboring cells). They supply

347: several further references to work in Bayesian image analysis which

348: use Voronoi cells. From their perspective, to calculate their

349: posterior they are simulating from a special ``marked point'' process.

350: The generators of the Voronoi cells are regarded as point set that has

351: been drawn from a homogeneous Poisson process of rate $\beta$ on the

352: unit cube. In the simplest case, the marks or ``colors'' of these

353: points are just integers from $1$ up to $M$ that have been drawn

354: independently. More generally, according to their prior, the

355: conditional distribution of the coloring of cells given is an Ising or

356: Potts model. The graphical structure of this model is determined by

357: consideration of which Voronoi cells are neighbors, and the $\theta$

358: parameter is chosen to reflect their prior belief that neighboring

359: cells tend to be of the same color. They consider two problems. The

360: first is a simulation experiment in which a ``true'' binary image is

361: degraded with Gaussian noise. The second is a three dimensional

362: reservoir problem based on real data. It is supposed that a certain

363: three dimensional cube (the reservoir) consists of 4 different types

364: of rock. The rock types are observed along seven vertical lines,

365: representing the observations of rock that were made as seven wells

366: were dug into the reservoir. In both problems, the true object to be

367: recovered is itself a certain ``coloring'' of space (i.e. rather than

368: a continuous regression function). For the MCMC computation of their

369: posterior they apply the birth-death type Metropolis-Hastings

370: algorithm for point processes, as studied by Geyer and

371: M{\o}ller~\cite{geyer:moller:1994} and claim that their target

372: distribution satisfies a local stability condition (see

373: Geyer~\cite{geyer:1999}, Kendal and

374: M{\o}ller~\cite{kendal:moller:2000}, and M{\o}ller~\cite{moller:1999})

375: so that the MCMC is actually geometrically ergodic.

376:

377: \subsection{Polya Trees}

378: Finally, Polya trees~\cite{lavine:1992} and especially randomized

379: Polya trees~\cite{paddock:1999} deserve to be mentioned. The basic

380: Polya tree puts a prior on distribution functions on the unit

381: interval. The unit interval is divided recursively in a dyadic binary

382: way and mass is allocated to each piece of the partition in a

383: stagewise manner by first determining how much of the mass that is

384: available will be on the left versus the right half and then

385: continuing with such determinations layer by layer. Each of these

386: assignments is ultimately determined by independent $\Beta$ random

387: variable, whose parameters depend upon its location in the ``tree.''

388: If a suitable choice of these parameters is made the result prior on

389: distribution functions concentrates on distributions that are

390: absolutely continuous with respect to Lebesgue measure. The essential

391: advantage of Polya trees is that the posterior of Polya tree prior is

392: easily and analytically computable, being itself another Polya

393: tree. For randomized Polya trees, the partitioning scheme is

394: independently ``jittered'' at random in a particular

395: way~\cite{paddock:1999}. A Hybrid MCMC can be employed to sample from

396: the randomized Polya tree posterior which uses a Gibbs step to take

397: advantage of the ease with which the (internal) Polya tree posterior

398: can be computed. Both methods can be extended (essentially by taking

399: ``direct products'') to put a prior on distributions on the unit cube.

400:

401: \subsection{The Contributions of this Thesis}

402:

403: Reviewing the depth and breath of the literature reviewed above may

404: leave the reader in doubt about the contributions of this

405: thesis. After all the one-dimensional prior that I consider is

406: essentially a special case of the univariate spline model and the idea

407: of using Voronoi partitions is certainly not new, although effective

408: Bayesian methods using them only started springing up fairly

409: recently.

410:

411: Still there is room for careful analysis. This thesis establishes that

412: the posterior is consistent under suitable conditions on the prior and

413: for any measurable regression function (see \autoref{c:proof} for

414: details): an issue which none of the ``Bayesian CART'' or ``Voronoi

415: Partition'' authors address at all. This thesis also gives an explicit

416: Markov chain Monte Carlo algorithm (see \autoref{s:algo}). Broadly

417: speaking it is a fairly standard birth-death Markov chain as

418: considered by Geyer and Moller~\cite{geyer:moller:1994}, but the

419: technicalities of the analysis seem to be somewhat different. This

420: thesis proceeds to show in detail that it satisfies detailed balance

421: by direct self-contained argumentation; further, the chain is shown to

422: have an ergodicity property (see \autoref{s:ergoproof}). These

423: considerations are often glossed over in modern writing.

424:

425: On the more practical side, \autoref{c:examples} scrutinizes the

426: behavior of the posterior mean estimate under a variety of carefully

427: designed simulation experiments. These experiments both serve to

428: analyze the posterior mean and to give insight into the relationship

429: between Bayesian methods and their classical counterparts. See for

430: example the discussion of CART and bagging in \autoref{s:cartexp}.

431:

432: \section{Other Approaches}\label{s:otherlit}

433: The literature on classification and regression methods is huge; the

434: interested reader is urged to consult good modern books on the subject

435: like {\em The Elements of Statistical Learning,} by Hastie,

436: Tibshirani, and Friedman~\cite{HTF:2001}. The following paragraphs

437: outline some of the methods that have had the most impact upon the author.

438:

439: In the statistics literature, classical approaches to the

440: classification and binary regression problem include logistic

441: regression, Fisher's discriminant analysis, and projection pursuit

442: methods. Logistic regression specifies that the success probability

443: regression function is such that its log-odds follows a linear model

444: with a user specified basis (e.g. by using polynomial or spline

445: functions of the covariate-data) and estimates the parameters by

446: maximum likelihood. Model selection is commonly performed using

447: classical methods to select a subset of the covariate

448: variables. Fisher's discriminant analysis finds a hyperplane which

449: ``optimally'' separates the two classes using a within versus between

450: variance criterion. Projection pursuit seeks an interesting linear (or

451: sometimes nonlinear) projection of the covariate-data onto a lower

452: dimensional subspace (e.g. $\bbR$). Various criteria have been

453: proposed to define ``interesting,'' some of which are suitable for the

454: classification problem. Each of these methods has undergone a variety

455: of generalizations and tweaks to address a wider range of problems over

456: the years.

457:

458: The first {\em general} method to solve the classification problem

459: automatically was the $k$-nearest neighbor

460: approach~\cite{cover:hart:1967}. $k$-Nearest neighbor estimates are

461: known to be universally consistent if $k=k(n) \tendsup \infty$ slowly

462: enough~\cite{devroye:gyorfi:lugosi:1996}. Their convergence, however,

463: especially in high dimensional problems, can be slow in

464: practice~\cite{friedman:1996}.

465:

466: Local regression methods are a clever extension of

467: this approach. To predict at a given point, instead of averaging the

468: values given at the neighbors, they fit a low-order linear model to a

469: locally-weighted version of the data set~\cite{clev:load:1996}.

470:

471: Trees~\cite{bfos:1984} and neural nets~\cite{ripley:1996} differ in

472: that they search through a globally-parametrized class of

473: functions. In all of these methods, cross-validation is often employed

474: to estimate frequentist ``out-of-sample'' performance and select a

475: regularization parameter which governs the trade-off between bias and

476: variance~\cite{HTF:2001}.

477:

478: Wavelet methods are in some ways a compromise between the local and

479: the global approaches mentioned above. They fit an explicit global

480: linear model to the data, but the basis elements in this model are

481: carefully constructed to maintain ``localization'' (in space and

482: frequency domains). They boast powerful asymptotic compression and

483: approximation properties, computationally efficient transforms, and

484: can employ special thresholding methods which ``optimally'' choose

485: which coefficients in the model are kept~\cite{donoho:johnstone:1994}. However,

486: their practical use seems to remain concentrated on the case of

487: regularly-spaced regression data. Some recent papers address this

488: shortcoming~\cite{daubechies}.

489:

490: Support vector machines (SVMs)~\cite{vapnik:1996} employ a

491: ``kernel-trick'' to reduce consideration of a certain

492: globally-parametrized model class to consideration of an equivalent

493: linear model class in an abstract Hilbert space. The

494: estimated decision rule corresponds to the solution of a convex

495: optimization problem. This objective function still involves an

496: unknown regularization parameter. In practice, this parameter is often

497: chosen by cross-validation, but, in principle, it can be chosen

498: through consideration of the structural risk minimization (SRM)

499: paradigm. The advantage of using the SRM paradigm is that one obtains

500: provably valid confidence statements about the error rate that will

501: obtain on future data. Moreover, these confidence bounds improve at an

502: exponential rate in the number of data points. With realistic sample

503: sizes, however, the bounds are often too crude to be of practical

504: use. There are hidden connections between SVMs and (1) Bayesian

505: methods employing Gaussian-process priors on the regression function

506: (including the generalized spline methods of Wahba~\cite{wahba,wahba:svms}) (2)

507: projection pursuit regression~\cite{coram:svm:pp}.

508:

509: Bagging~\cite{breiman96bagging} and

510: boosting~\cite{freund:schapire:1996a} are meta-algorithms that

511: ``boost'' the performance other classification algorithms (especially

512: trees) by taking carefully chosen weighted averages of the results of

513: the boosted (respectively, bagged) algorithm. There are close

514: connections between boosting and the Lasso penalty, which itself is

515: closely related to the least angle regression method

516: (LARS)~\cite{efron:lars}.

517: \nocite{friedman:mars}

518:

519: % LocalWords:  frequentist von Mises submodel resp Diaconis rom Doob Doob's der

520: % LocalWords:  Wald's Kullback Leibler Ghoshal Ghosh Vaart Shen Schervish cgm

521: % LocalWords:  exchangability nonparametric Bernstein's Poissonization Chipman

522: % LocalWords:  McCulloch MCMC Denison Mallick wavelet Friedman's Voronoi dhms

523: % LocalWords:  voronoi marginalized overfitting favoured Green's discretizing

524: % LocalWords:  subjectivism Denison's Scargle cell's Scargle's ller Skare Geyer

525: % LocalWords:  Kendal Polya stagewise jittered Monte algo Moller ergoproof SVMs

526: % LocalWords:  cartexp Hastie Tibshirani Fisher's hyperplane regularization SRM

527: % LocalWords:  Wahba

528: