math0603228/lit.tex
1: \chapter{Literature}\label{c:lit}
2: 
3: This chapter reviews and discusses the literature on three
4: subjects. The first section reviews some theoretical results concerning the
5: frequentist performance of Bayesian procedures. The second section
6: gives a survey of some of the
7: work done by authors on related Bayesian efforts. The final section briefly
8: surveys some salient examples of alternative approaches to the
9: classification problem. 
10: 
11: \section{Theoretical Results}
12: The frequentist performance of Bayesian methods is of fundamental
13: interest in statistics. Given a large sample from a smooth,
14: finite-dimensional statistical model, the situation is quite well
15: understood. The Bernstein-von~Mises
16: theorem~\cite{lecam:yang,freedman:1999} shows that the Bayes estimate and the maximum likelihood
17: estimate will be close. Furthermore, the posterior distribution of the
18: parameter vector around the posterior mean is close to the
19: distribution of the maximum likelihood estimate around the truth: both
20: are asymptotically normal with mean 0 and the same covariance
21: matrix. Unfortunately, though, in more general circumstances, such as
22: those needed for this work, the situation can be much more complex. In
23: particular, the basic model is based on an infinite hierarchy of finite
24: dimensional models. Moreover, even for a given finite dimensional
25: submodel, the dependency of the likelihood function on the parameter
26: is not smooth; the functions are allowed to take jumps. Consequently, a more
27: general theory is needed.
28: 
29: This section reviews some of the literature on this subject with
30: a focus on results that address the question of consistency: i.e. as
31: the number of data points tends to infinity, will the Bayesian
32: estimate converge to the true value (in some suitable sense) almost
33: surely (resp. in probability)?  The literature contains a number of
34: useful and quite flexible positive results, but also a variety of
35: interesting negative examples showing that the regularity conditions
36: under which the theorems hold are not to be taken lightly. A good
37: introduction to these issues is by Diaconis and
38: Freedman~\cite{diac:free:1986}. Throughout this section, the reader
39: may envision a family $\P_\theta(dx)$, a prior $\pi(d\theta)$, and
40: posterior $\pi(\theta | x_1, \dots, x_n)$, where the $x_i$ are drawn
41: \iid\ from $\P_{\theta_0}(dx)$. Consistency means that the posterior
42: concentrates at $\theta_0$ for large samples.
43: 
44: Doob~\cite{doob} established a fundamental result under minimal
45: regularity assumptions using a martingale convergence
46: argument. Roughly speaking, the result states that if
47: consistent estimators exist at all, then a Bayes procedure will
48: provide an almost surely consistent estimate of the true parameter
49: $\theta$ under sampling from the $\theta$ distribution for any
50: $\theta$ in some set $B$ which has prior probability of $1$. Notice,
51: though, that this does not specify if consistency will obtain at any
52: {\em particular} point of interest $\theta_0$, unless $\theta_0$
53: happens to be a point-mass of the prior, or unless it possible to determine
54: $B$ by some more detailed line of argumentation.
55: 
56: Freedman~\cite{freedman:1963} considered the case in which the
57: observations are discrete. If the set of possible observations is
58: finite, the posterior is consistent exactly for parameter values in
59: the topological support of the prior. The countably infinite case is
60: more complex. He constructs a class of examples showing that it is
61: possible to construct a prior which assigns positive mass to every
62: (weak star) neighborhood of the true parameter value, but for which
63: the posterior converges to a point mass at some other (chosen)
64: parameter value. Furthermore, he finds a prior which assigns positive
65: prior mass to every (weak star) open set of parameters, but for which
66: the posterior is consistent only at a set of parameters of the first
67: category. The reader should note that this prior did not assign mass
68: to all entropy-neighborhoods. This sort of subtle distinction can make
69: all of the difference and explains the necessity of some such
70: assumption in the following consistency theorems. He introduces the
71: ``tail-free'' priors for the the countably-infinite case and
72: demonstrates that these are always consistent.
73: 
74: Lorraine Schwartz~\cite{schw:1965} explored the question of
75: consistency in a very general setting. She extended Doob's result to a
76: broad class of loss functions~\cite[lemma 4.2]{schw:1965}.  She also
77: found sufficient conditions for the posterior to be consistent under
78: $\iid$ sampling. These conditions, she says, are ``of an essentially
79: weaker nature'' than the conditions established for the consistency of
80: maximum likelihood estimators. Nevertheless, she constructs an example
81: where the maximum likelihood estimate is consistent and the estimates
82: based on certain priors are not. The example (\cite[example 3]{schw:1965}) involves a simple parametric family of densities
83: which satisfies Wald's conditions, thereby guaranteeing that the
84: maximum likelihood estimate will be consistent, but for which the
85: posterior can be inconsistent. The consistency of the posterior in
86: this case, is found to depend critically on the amount of mass that
87: the prior ascribes to small neighborhoods of the true parameter value;
88: if this mass shrinks too quickly, the prior ``ignores'' the data. One
89: clever aspect of her construction is the way the densities are
90: parametrized. Parameter values close to the target value $\theta_0$
91: correspond to densities that are close to the $\theta_0$-density in an
92: $\cL^1$ sense, but which are farther and farther away in
93: Kullback-Leibler discrepancy. In fact, there is only one point in
94: parameter space (the true parameter) that has Kullback-Leibler
95: discrepancy from the truth smaller than $\epsilon$, for $\epsilon$
96: sufficiently small.
97: 
98: Schwartz then shows that the posterior will be consistent under $\iid$
99: sampling under two basic conditions. First, the prior should have
100: positive mass on Kullback-Leibler neighborhoods of the true parameter
101: (defined in \autoref{s:notation} of this thesis), and second, the
102: model class should not be too rich; specifically, she requires that
103: uniformly consistent tests of the hypothesis that $\theta=\theta_0$
104: against the alternative that $\theta$ lies outside a given (open)
105: neighborhood of $\theta_0$ exist.
106: 
107: It is not always obvious how to verify the later property
108: directly. Modern authors have employed entropy-type bounds to
109: guarantee their existence. Ghoshal, Ghosh, and van der
110: Vaart~\cite{ggv} state a theorem (\cite[theorem 7.3]{ggv}) which
111: proves that the posterior converges at a certain rate if certain
112: uniform tests exist (and the prior mass is suitably distributed) and
113: go on to find a variety of entropy-type conditions that suffice to be
114: able to construct the necessary tests. Shen and
115: Wasserman~\cite{shen:wasserman} show related results, requiring
116: slightly different conditions on how mass needs to be allocated--they
117: do not a make a connection with testing. Barron, Schervish, and
118: Wasserman~\cite{barron:schervish:wasserman:1999} find sufficient
119: conditions for the posterior to be consistent; their results are
120: reviewed and then used in \autoref{c:proof}.
121: 
122: It should be noted that these various conditions for consistency are
123: not necessary, but merely sufficient. Nevertheless, it is important to
124: treat this subject with care because of the variety of examples for
125: which consistency fails. 
126: 
127: Barron, Schervish, and Wasserman also give an interesting example
128: where consistency fails. In this example, they show that the prior
129: puts too much mass on a very rich class of models that will be able to
130: match any spurious structure that the data might have by chance,
131: overwhelming the true parameter. Furthermore, lest the reader get the
132: wrong idea, inconsistency does not only occur in artificial
133: examples. A series of ``natural'' yet still inconsistent estimators
134: for the symmetric location problem are discussed by Diaconis and
135: Freedman~\cite{diac:free:1986}. In addition, the binary regression
136: example explained in the next section has a natural motivation based on
137: conditional exchangability.
138: 
139: \section{Related Bayesian Work}\label{s:relatedpriors}
140: 
141: The following subsections contain a review of work by other authors
142: that is closely related to this thesis. It is followed my a brief synopsis
143: of the contributions that this thesis makes to the literature.
144: 
145: \subsection{A Dyadic Prior for Binary Regression}\label{s:DFprior}
146: The most relevant examples for the work of this thesis are the
147: nonparametric binary regression examples of Diaconis and
148: Freedman~\cite{diac:free:1993, diac:free:1995}. They use a different
149: prior; call it $\piDF$, a hierarchical, dyadic prior on $f$. To
150: describe $\piDF$, let $A_k$ be the set of intervals which result from
151: partitioning the unit interval into $2^k$ equal pieces. Let $\F_k$ be
152: the subset of functions which are constant on all intervals $a \in
153: A_k$. Finally, fix a prior distribution $\kappa$ on the non-negative
154: integers. Assume, for simplicity, that $\kappa(k) >0$ for all $k$. To
155: draw $f$ from $\piDF$, draw $K$ from $\kappa$ and then, conditional on
156: hierarchy-level $K=k$, draw $f$ uniformly at random from $\F_k$. In
157: effect then, at level $k$ one draws $2^k$ independent $\U(0,1)$ random
158: variables to describe the success probability on each of the $2^k$
159: pieces.
160: 
161: They show that for any $\kappa$ and any $f_0$ (except possibly for
162: $f_0 \equiv \frac{1}{2}$), the posterior estimates are consistent (in
163: the sense that any $\mathcal{L}^1$ neighborhood of $f_0$ has posterior
164: probability tending to $1$ a.s.). Remarkably, however, for $f_0 \equiv
165: \frac{1}{2}$, the posterior can be an inconsistent estimate if the
166: tail of $\kappa$ is sufficiently heavy.  Specifically, let
167: $\lambda_k=-\log (\kappa(K \geq k) ) /k $. Then if $\limsup \lambda_k
168: > \lambdac=2^{-\frac{1}{4}} \approx 0.841$, the posterior is
169: inconsistent at $f_0 \equiv \frac{1}{2}$. On the other hand, if
170: $\limsup \lambda_k <\lambdac$, the posterior is consistent for any
171: $f_0$. To put this in perspective, for $\kappa(k)=(1-\beta)\beta^k$ (a
172: shifted $\Geometric(1-\beta)$ prior), $\limsup
173: \lambda_k=-\log(\beta)$. The critical value for $\beta$ is
174: $\exp(-\lambdac) \approx 0.431$; for larger $\beta$ (longer tails)
175: inconsistency will occur (but only for $f_0 \equiv \half$).
176: 
177: This result is substantially stronger than the result I have
178: obtained for my prior $\pi$. In particular, applying the same
179: (general) method of proof that I employed to prove consistency for
180: $\pi$ to $\piDF$ yields only the result that $\piDF$ is consistent if
181: the tails of $\kappa$ drop off at least as fast as those of a
182: Poisson. (Recall, that at level $k$, $\pi$ only divides $[0,1]$ into
183: $k$ intervals, but $\piDF$ divides it into $2^k$.) Their method of
184: proof is direct: using Bernstein's inequality, Poissonization, and
185: special features of the prior. My method of proof is indirect; it uses
186: general results that employ entropy-type bounds.
187: 
188: There are striking similarities between $\pi$ and $\piDF$. In fact,
189: $\pi$ is equivalent to a suitably randomized $\piDF$. To achieve this,
190: it is not enough to simply randomize the dyadic split points. Instead, recall
191: that $\piDF$ has an alternative interpretation in terms of binary
192: sequences. At hierarchy-level $k$, $\piDF$ is uniform over $\F_k$. This
193: corresponds to independently assigning uniform success probabilities to
194: each binary sequence of length $k$. Here is an alternative way
195: to draw $f$ from $\pi$. Draw $g$ from $\piDF$ and interpret $g$ as
196: function on binary sequences of length $k$ ($k$ depends on $g$). Let
197: $V_i$ ($i=1, \dots, k$) be $\iid$ $U(0,1)$ random variables. To any
198: point $u \in [0,1]$ associate the binary random variables
199: $\eta_i(u)=\II(u \leq V_i)$ ($i=1, \dots, k$). Define $f$ via
200: $f(u)=g((\eta_1(u), \dots, \eta_k(u) ))$. Note that only a small
201: fraction of possible binary sequences are realized in this manner (at
202: level $k$ (which ranges from $0$ to $\infty$ under $\piDF$), $k+1$
203: sequences out of the full set of $2^k$ possible sequences are achieved).  
204: 
205: \subsection{Bayesian CART}
206: 
207: Two other closely related priors can be described as Bayesian versions
208: of the CART algorithm. This was pursued by Chipman, George, and
209: McCulloch, whose prior closely parallels the choices made in the
210: original CART algorithm~\cite{cgm:1998a, cgm:1998b, cgm:2000a,
211: cgm:2000b}. Here is a description of their prior when the covariate
212: space is $\cR^p$. Their prior starts with a root node (which
213: represents the whole space); this node is then recursively
214: partitioned in a random way. For each node, randomly choose whether to
215: split it or not, then choose a coordinate to split on, then choosing a
216: split point (i.e. the cutoff value) randomly {\em from among the
217: midpoints between the ordered values of this coordinate}; finally each
218: leaf node is given an independent regression value. The details of how
219: these decisions are made differ in their particulars from the ones
220: that I described in the introduction. In early work, these authors
221: observed that using MCMC to sample from the posterior of this prior
222: provides a rudimentary (global) search procedure, which has certain
223: (apparent) advantages over the {\em greedy} search procedure commonly
224: implemented in CART-type algorithms. In later work, they examined and
225: computed the (approximate) posterior mean (working primarily on the
226: least-squares white-noise regression problem) and found that it had
227: good performance. They also considered extended priors that modeled
228: the regression values as additively (not independently)
229: generated~\cite{cgm:2000b}.
230: 
231: Denison, Mallick, and Smith, independently considered another version
232: of Bayes\-ian CART~\cite{dms:1998b, dms:1998a, dms:1998c}. For
233: one-dimen\-sion\-al problems they propose using random splines (the prior
234: I use is essentially a special case of this prior). They consider some
235: of the regression examples that are standard in the wavelet literature
236: and show that their spline methods perform equally well. Additionally,
237: they propose a Bayesian version of Friedman's MARS which puts a prior on
238: functions that are constructed by adding together random spline-type
239: ridge functions. Denison, Adams, Holmes, and Hand discuss the
240: usefulness of random partitions in this paper~\cite{dahh:2002}.
241: 
242: Very recently, Denison, Holmes, Mallick, and Smith have written a
243: book~\cite{dhms:2002} which surveys some related Bayesian regression
244: schemes, including a Bayesian method for (multiple class)
245: classification using Voronoi partitions that is very closely related
246: (albeit independent of) the work that I present in
247: \autoref{c:voronoi}. The book also discusses Bayesian wavelet methods,
248: and an interesting Bayesian nearest-neighbor prior. As a default
249: prior, they recommend assuming that every model in a ``single
250: dimension'' is equally likely, and each dimension is equally probable,
251: \apriori. This ``flat prior,'' they claim, should serve perfectly well
252: because of the, ``natural tendency'' for the marginalized likelihood
253: to penalize complex models:
254: 
255: \begin{quotation}
256: On the face of it, we might be concerned that the flexible modeling
257: strategy we advocate might be prone to overfitting the data by adding
258: too many basis functions. Indeed, many papers found in the literature
259: advocate explicit priors on the model space that penalize the
260: dimension of the model. However, throughout this book we argue that
261: such a measure is unnecessary.  The Bayesian framework contains
262: a natural penalty against over [sic] complex models, sometimes called
263: {\it Occam's razor}, which essentially states that a simpler theory is
264: to be favoured over a more complex one, all other things being
265: equal.
266: \end{quotation}%page 21-22 dhms
267: 
268: There is no consideration given to the possibility that this might
269: give rise to inconsistent estimates (e.g. as in the Diaconis and
270: Freedman non-parametric regression example explained earlier); indeed
271: there are few theoretical considerations at all in the book. Their
272: explanation of why the Markov chain techniques that they develop
273: should actually give meaningful samples from the posterior appeals to
274: Green's reversible jump~\cite{green:1995}. The explanation given is
275: vague and ultimately they decide to avoid the issue and appeal to the
276: fact that their chains are discrete. The chains in
277: \autoref{c:implement} of this thesis involve a continuous state space
278: and do not simply avoid this issue by discretizing the continuous
279: modeling space as these authors seem to do. 
280: 
281: Overall, the book emphasizes main ideas, algorithms, and results. It
282: seems that for every existing regression technique, they want to
283: demonstrate that they can make a ``Bayesian'' version of it too. The
284: book does not emphasize subjectivism, but rather adopts an
285: ``$\cM_{\text{\it open}}$'' perspective to Bayesian modeling: ``we
286: never believe that the true model lies in the set of possible
287: models.'' The book does do a good job of supplying default priors for
288: a wide variety of possible parametric models. Similarly, Denison's
289: thesis~\cite{denison:1997} emphasizes the wide variety of problems to
290: which Bayesian partitioning methods of this sort can be applied.
291: 
292: \subsection{Poisson Rate estimates using Random Partitions}
293: Green~\cite{green:1995}, and
294: Scargle~\cite{scargle} develop priors on piecewise constant
295: functions on the real line and $\bbR^d$ using Voronoi
296: cells. Their priors are quite similar to the ones developed in this thesis,
297: but are intended to address the problem of estimating the rate function of a
298: $\Poisson$ process. In principle, one could apply their techniques to
299: the problem of binary regression by generating an estimate of the rate
300: function of the ``heads'' process and the ``tails'' process separately
301: and then combining the results. I do not think that this has been
302: tried and it seems substantially less ``natural.''
303: 
304: Green applies his method to a coal mining dataset and a synthetic
305: two-dimen\-sion\-al example. For these example, Green assumes that an
306: individual cell's rate-para\-met\-er is drawn independently from a
307: $\Gamma(\alpha,\beta)$ prior. For the one-dimen\-sion\-al case he
308: advocates a prior which ``probabilistically'' spaces out the
309: change-point locations; specifically, if there are $j$ change-points,
310: the ordered locations of the change-points are distributed like the
311: even order statistics of $2j+1$ independent uniform values. He argues
312: that this is good because it prevents small change-point intervals from
313: entering into the posterior. For the two-dimensional example, the
314: generating points of the Voronoi partition are drawn independently and
315: uniformly. Green's methods are given, in part, as
316: examples of his ``reversible jump'' MCMC technique. This technique has
317: become an accepted part of MCMC practice, but is not accepted by all
318: experts in MCMC theory because it does not lay down in a
319: straightforward ``theorem-proof'' manner the necessary conditions and
320: consequent conclusions. For this reason, detailed verifications for
321: the chains used in this thesis are given in \autoref{c:implement}.
322: 
323: Scargle's work is applied to astronomical data; he concentrates on the
324: problem of finding the mode of the posterior, rather than the
325: posterior mean. Fortunately, he and coworkers have developed a way of
326: computing this mode in the one-dimensional case exactly and
327: efficiently using a dynamic programming
328: approach~\cite{scargle:dynamic}. Instead of giving each cell an
329: independent value, Scargle gives each cell a (logical) ``color'' and
330: then associates each unique color with an independent
331: rate-parameter. This allows him to use a fine partition and then group
332: ``chunks'' back together into more complicated shapes. The way he
333: forms this partition is also different; in particular his ``prior'' is
334: data dependent, but not quite in the way of the ``prior'' that I
335: consider in \autoref{c:voronoi}. Rather, the data is used once and for
336: all to generate the fine Voronoi partition of space that results from
337: using all of the data points as generators. These cells are then
338: ``clumped'' (i.e. given a logical color) and the clumps are given an
339: independent rate parameter.
340: 
341: \subsection{Bayesian ``Image'' Analysis}
342: M{\o}ller and Skare~\cite{moller:skare} apply their work to reservoir
343: modeling and connect their work to efforts in Bayesian image analysis
344: (including Markov random fields). They use a random Voronoi partition
345: of the data and assign each partition element a random color (in a way
346: that depends only the colors of neighboring cells). They supply
347: several further references to work in Bayesian image analysis which
348: use Voronoi cells. From their perspective, to calculate their
349: posterior they are simulating from a special ``marked point'' process.
350: The generators of the Voronoi cells are regarded as point set that has
351: been drawn from a homogeneous Poisson process of rate $\beta$ on the
352: unit cube. In the simplest case, the marks or ``colors'' of these
353: points are just integers from $1$ up to $M$ that have been drawn
354: independently. More generally, according to their prior, the
355: conditional distribution of the coloring of cells given is an Ising or
356: Potts model. The graphical structure of this model is determined by
357: consideration of which Voronoi cells are neighbors, and the $\theta$
358: parameter is chosen to reflect their prior belief that neighboring
359: cells tend to be of the same color. They consider two problems. The
360: first is a simulation experiment in which a ``true'' binary image is
361: degraded with Gaussian noise. The second is a three dimensional
362: reservoir problem based on real data. It is supposed that a certain
363: three dimensional cube (the reservoir) consists of 4 different types
364: of rock. The rock types are observed along seven vertical lines,
365: representing the observations of rock that were made as seven wells
366: were dug into the reservoir. In both problems, the true object to be
367: recovered is itself a certain ``coloring'' of space (i.e. rather than
368: a continuous regression function). For the MCMC computation of their
369: posterior they apply the birth-death type Metropolis-Hastings
370: algorithm for point processes, as studied by Geyer and
371: M{\o}ller~\cite{geyer:moller:1994} and claim that their target
372: distribution satisfies a local stability condition (see
373: Geyer~\cite{geyer:1999}, Kendal and
374: M{\o}ller~\cite{kendal:moller:2000}, and M{\o}ller~\cite{moller:1999})
375: so that the MCMC is actually geometrically ergodic.
376: 
377: \subsection{Polya Trees}
378: Finally, Polya trees~\cite{lavine:1992} and especially randomized
379: Polya trees~\cite{paddock:1999} deserve to be mentioned. The basic
380: Polya tree puts a prior on distribution functions on the unit
381: interval. The unit interval is divided recursively in a dyadic binary
382: way and mass is allocated to each piece of the partition in a
383: stagewise manner by first determining how much of the mass that is
384: available will be on the left versus the right half and then
385: continuing with such determinations layer by layer. Each of these
386: assignments is ultimately determined by independent $\Beta$ random
387: variable, whose parameters depend upon its location in the ``tree.''
388: If a suitable choice of these parameters is made the result prior on
389: distribution functions concentrates on distributions that are
390: absolutely continuous with respect to Lebesgue measure. The essential
391: advantage of Polya trees is that the posterior of Polya tree prior is
392: easily and analytically computable, being itself another Polya
393: tree. For randomized Polya trees, the partitioning scheme is
394: independently ``jittered'' at random in a particular
395: way~\cite{paddock:1999}. A Hybrid MCMC can be employed to sample from
396: the randomized Polya tree posterior which uses a Gibbs step to take
397: advantage of the ease with which the (internal) Polya tree posterior
398: can be computed. Both methods can be extended (essentially by taking
399: ``direct products'') to put a prior on distributions on the unit cube.
400: 
401: \subsection{The Contributions of this Thesis}
402: 
403: Reviewing the depth and breath of the literature reviewed above may
404: leave the reader in doubt about the contributions of this
405: thesis. After all the one-dimensional prior that I consider is
406: essentially a special case of the univariate spline model and the idea
407: of using Voronoi partitions is certainly not new, although effective
408: Bayesian methods using them only started springing up fairly
409: recently. 
410: 
411: Still there is room for careful analysis. This thesis establishes that
412: the posterior is consistent under suitable conditions on the prior and
413: for any measurable regression function (see \autoref{c:proof} for
414: details): an issue which none of the ``Bayesian CART'' or ``Voronoi
415: Partition'' authors address at all. This thesis also gives an explicit
416: Markov chain Monte Carlo algorithm (see \autoref{s:algo}). Broadly
417: speaking it is a fairly standard birth-death Markov chain as
418: considered by Geyer and Moller~\cite{geyer:moller:1994}, but the
419: technicalities of the analysis seem to be somewhat different. This
420: thesis proceeds to show in detail that it satisfies detailed balance
421: by direct self-contained argumentation; further, the chain is shown to
422: have an ergodicity property (see \autoref{s:ergoproof}). These
423: considerations are often glossed over in modern writing.
424: 
425: On the more practical side, \autoref{c:examples} scrutinizes the
426: behavior of the posterior mean estimate under a variety of carefully
427: designed simulation experiments. These experiments both serve to
428: analyze the posterior mean and to give insight into the relationship
429: between Bayesian methods and their classical counterparts. See for
430: example the discussion of CART and bagging in \autoref{s:cartexp}.
431: 
432: \section{Other Approaches}\label{s:otherlit}
433: The literature on classification and regression methods is huge; the
434: interested reader is urged to consult good modern books on the subject
435: like {\em The Elements of Statistical Learning,} by Hastie,
436: Tibshirani, and Friedman~\cite{HTF:2001}. The following paragraphs
437: outline some of the methods that have had the most impact upon the author.
438: 
439: In the statistics literature, classical approaches to the
440: classification and binary regression problem include logistic
441: regression, Fisher's discriminant analysis, and projection pursuit
442: methods. Logistic regression specifies that the success probability
443: regression function is such that its log-odds follows a linear model
444: with a user specified basis (e.g. by using polynomial or spline
445: functions of the covariate-data) and estimates the parameters by
446: maximum likelihood. Model selection is commonly performed using
447: classical methods to select a subset of the covariate
448: variables. Fisher's discriminant analysis finds a hyperplane which
449: ``optimally'' separates the two classes using a within versus between
450: variance criterion. Projection pursuit seeks an interesting linear (or
451: sometimes nonlinear) projection of the covariate-data onto a lower
452: dimensional subspace (e.g. $\bbR$). Various criteria have been
453: proposed to define ``interesting,'' some of which are suitable for the
454: classification problem. Each of these methods has undergone a variety
455: of generalizations and tweaks to address a wider range of problems over
456: the years.
457: 
458: The first {\em general} method to solve the classification problem
459: automatically was the $k$-nearest neighbor
460: approach~\cite{cover:hart:1967}. $k$-Nearest neighbor estimates are
461: known to be universally consistent if $k=k(n) \tendsup \infty$ slowly
462: enough~\cite{devroye:gyorfi:lugosi:1996}. Their convergence, however,
463: especially in high dimensional problems, can be slow in
464: practice~\cite{friedman:1996}.
465: 
466: Local regression methods are a clever extension of
467: this approach. To predict at a given point, instead of averaging the
468: values given at the neighbors, they fit a low-order linear model to a
469: locally-weighted version of the data set~\cite{clev:load:1996}.
470: 
471: Trees~\cite{bfos:1984} and neural nets~\cite{ripley:1996} differ in
472: that they search through a globally-parametrized class of
473: functions. In all of these methods, cross-validation is often employed
474: to estimate frequentist ``out-of-sample'' performance and select a
475: regularization parameter which governs the trade-off between bias and
476: variance~\cite{HTF:2001}.
477: 
478: Wavelet methods are in some ways a compromise between the local and
479: the global approaches mentioned above. They fit an explicit global
480: linear model to the data, but the basis elements in this model are
481: carefully constructed to maintain ``localization'' (in space and
482: frequency domains). They boast powerful asymptotic compression and
483: approximation properties, computationally efficient transforms, and
484: can employ special thresholding methods which ``optimally'' choose
485: which coefficients in the model are kept~\cite{donoho:johnstone:1994}. However,
486: their practical use seems to remain concentrated on the case of
487: regularly-spaced regression data. Some recent papers address this
488: shortcoming~\cite{daubechies}.
489: 
490: Support vector machines (SVMs)~\cite{vapnik:1996} employ a
491: ``kernel-trick'' to reduce consideration of a certain
492: globally-parametrized model class to consideration of an equivalent
493: linear model class in an abstract Hilbert space. The
494: estimated decision rule corresponds to the solution of a convex
495: optimization problem. This objective function still involves an
496: unknown regularization parameter. In practice, this parameter is often
497: chosen by cross-validation, but, in principle, it can be chosen
498: through consideration of the structural risk minimization (SRM)
499: paradigm. The advantage of using the SRM paradigm is that one obtains
500: provably valid confidence statements about the error rate that will
501: obtain on future data. Moreover, these confidence bounds improve at an
502: exponential rate in the number of data points. With realistic sample
503: sizes, however, the bounds are often too crude to be of practical
504: use. There are hidden connections between SVMs and (1) Bayesian
505: methods employing Gaussian-process priors on the regression function
506: (including the generalized spline methods of Wahba~\cite{wahba,wahba:svms}) (2)
507: projection pursuit regression~\cite{coram:svm:pp}.
508: 
509: Bagging~\cite{breiman96bagging} and
510: boosting~\cite{freund:schapire:1996a} are meta-algorithms that
511: ``boost'' the performance other classification algorithms (especially
512: trees) by taking carefully chosen weighted averages of the results of
513: the boosted (respectively, bagged) algorithm. There are close
514: connections between boosting and the Lasso penalty, which itself is
515: closely related to the least angle regression method
516: (LARS)~\cite{efron:lars}.
517: \nocite{friedman:mars}
518: 
519: % LocalWords:  frequentist von Mises submodel resp Diaconis rom Doob Doob's der
520: % LocalWords:  Wald's Kullback Leibler Ghoshal Ghosh Vaart Shen Schervish cgm
521: % LocalWords:  exchangability nonparametric Bernstein's Poissonization Chipman
522: % LocalWords:  McCulloch MCMC Denison Mallick wavelet Friedman's Voronoi dhms
523: % LocalWords:  voronoi marginalized overfitting favoured Green's discretizing
524: % LocalWords:  subjectivism Denison's Scargle cell's Scargle's ller Skare Geyer
525: % LocalWords:  Kendal Polya stagewise jittered Monte algo Moller ergoproof SVMs
526: % LocalWords:  cartexp Hastie Tibshirani Fisher's hyperplane regularization SRM
527: % LocalWords:  Wahba
528: