math0603228/ch1.tex
1: \newcommand{\X}{\mathcal{X}}
2: \newcommand{\Y}{\mathcal{Y}}
3: 
4: \newcommand{\Cvec}{\mathbf{C}}
5: \newcommand{\cvec}{\mathbf{c}}
6: 
7: In this thesis we consider the problem of binary regression and the
8: related classification (discrimination) problem. Specifically, the
9: data we consider is a list of pairs $(X_1,Y_1), (X_2,Y_2), \dots, \breakpt
10: (X_n, Y_n)$. Each $X$ is an explanatory variable through which attempt
11: to ``predict'' the corresponding label $Y$ which is either $0$ or
12: $1$. We model these pairs as being drawn $\iid$ from an unknown
13: distribution $F_0$ which we decompose as:
14: \[
15: dF_0\left( (x,y)
16: \right)=\big[f_0(x)\1{y=1}+(1-f_0(x))\1{y=0}\big]
17: \mu(dx)\eta(dy).
18: \]
19: Here $\1{}$ signifies the indicator function; $f_0(x)$ represents the
20: probability that $Y=1$ given that $X=x$; $\mu$ is the marginal
21: distribution of $X$ over the space $\X$; and $\eta$ is counting
22: measure on the set $\Y=\{0,1\}$. This decomposition is unique except
23: that, technically, $f_0$ is only determined $\mu$-a.e by $F_0$. 
24: 
25: In binary regression, our goal is to estimate $f_0$ (say in an $\cL^2$
26: or $\cL^1$ sense). For classification, we are only concerned with
27: being able to predict the most likely value of $Y$, equivalently we
28: are concerned with estimating the set $\{x:
29: f_0(x)>\frac{1}{2}\}$. There are a great many ways to proceed, as
30: evidenced by the vast literature on these subjects. Some references
31: are given in \secref{@@}.
32: 
33: \section{An Example}
34: 
35: To get started, let us consider the specific case in which
36: $\X=[0,1]$ and $\mu$ is uniform measure. Further, I choose a specific
37: $f_0$ which is piecewise continuous with two constant regions and a
38: smooth transition region. I take $f_0(x)$ to be:
39: \[
40: \1{0 \leq x < \frac{1}{6}} 0.6 + \1{\frac{1}{6} \leq x \leq
41: \frac{1}{2}} 0.4 + \1{\frac{1}{2} < x \leq 1} \Psi(x)
42: \]
43: 
44: where $\Psi(x)$ is :
45: 
46: \[
47: \Psi(x)=\frac{\phi_\sigma(x-\frac{1}{2})}{\phi_\sigma(x-\frac{1}{2})+\phi_\sigma(x-1)}
48: \]
49: 
50: and where $\phi_\sigma$ is the p.d.f. of a normal with mean 0 and
51: standard deviation $\sigma=0.25$.
52: 
53: \begin{figure}
54: \noindent
55: \begin{minipage}[t]{\linewidth}
56:   \centering%\epsfig{figure=figs/prob1plot.ps,height=5cm,angle=-90}
57:   \caption[]{$f(x)$}
58: \end{minipage}
59: \end{figure}
60: 
61: \section{A Nonparametric Prior on Regression Functions}
62: 
63: In the Bayesian approach, we specify our uncertainty about $F_0$, or,
64: equivalently, $f_0$ by specifying a probability distribution on
65: them. Ideally, we want this distras
66: randomly chosen.To specify our prior, I explain how to choose an $f$ at random
67: from it. First, choose $K$, the number of split points, where:
68: \[
69: P(K=k)=(1-\alpha)\alpha^k \quad \textrm{for $k=0 \dots \infty$}
70: \] 
71: 
72: That is, $K+1$ (the number of intervals) is geometric with parameter
73: $1-\alpha$.
74: 
75: Now, conditional on $K=k$, choose $C_1, C_2, \dots C_k$ \iid from
76: $U(0,1)$. Let $C_{(i)}$ (for $i=1 \dots k$) be the $i'th$ ordered value
77: of the $C$'s. This produces $k+1$ intervals:
78: \[
79: I_1=\{0 \leq x < C_{(1)}\}, I_2=\{C_{(1)} \leq x < C_{(2)}\}, \dots,
80: I_{k+1}=\{C_{(k)} \leq x \leq 1\}
81: \]
82: 
83: Finally, conditional on $K=k$, choose $k+1$ values $D_i$
84: (for $i=1 \dots k+1$) \iid from $U(0,1)$. This generates a random
85: function $f$:
86: 
87: \[
88: f(x)=\sum_{i=1}^{k+1} \1{x \in I_{i}} D_i
89: \]
90: 
91: As a default, take $\alpha=\frac{1}{2}$.
92: 
93: This prior is related to the prior described in Diaconis and
94: Freedman~\cite{df2}, but this prior differs in that position of the
95: split points is random. It is similar, however, in that it specifies a
96: prior on piecewise constant functions. 
97: 
98: 
99: 
100: 
101: \section{The Posterior}
102: 
103: Figure~\ref{f:post1} shows an example of the posterior mean of this
104: prior upon seeing $16,384$ points randomly labeled according to the
105: function $f$ described in section~\ref{s:example}. The solid grey
106: curve is the true $f$, as before. The solid black curve is the result
107: of running $30,000$ MCMC iterations following the general procedures
108: described in M{\o}ller~\cite{moller}.
109: 
110: To give some sense of how well we
111: are doing, a very na\:ive ``frequentist'' estimate is
112: superimposed. Specifically, the unit interval is divided into $20$
113: equally spaced segments. For each segment, all the data lying within
114: the segment is considered giving $k_0$ 0's and $k_1$ 1's within the
115: segment. Each segment is then labeled with three horizontal bars. The
116: middle bar is the sample mean on that segment
117: (i.e. $\hat{d}=k_1/(k_0+k_1)$). The upper and lower bars give a na\:ive
118: 95\%-confidence interval for the value there via $\hat{d} \pm
119: 2\sqrt{\hat{d}(1-\hat{d})/(k_0+k_1)}$. We can see that there is still
120: plenty of noise left in the data, despite its apparent abundance.
121: 
122: Roughly speaking, the posterior mean seems to be doing a good job of
123: finding the split points between the piecewise constant regions. It
124: manages to approximate the continuously varying region fairly well,
125: but detects little deviation from constancy on the two constant intervals.
126: 
127: \section{Computing the Posterior}
128: 
129: Computation of the posterior is simplified by identifying the
130: posterior distribution of the $(k+1)$ values $D_i$. If we
131: condition on $K=k$ and on the location of the split points then, a
132: priori, $D_i$ are independent $U(0,1)$. A posteriori, they are
133: (conditionally) independent and beta distributed with parameters:
134: \begin{align*}
135: a_i&=(\textrm{\# of data points in $I_i$ labeled 1})+1 \\
136: b_i&=(\textrm{\# of data points in $I_i$ labeled 0})+1.
137: \end{align*}
138: 
139: Using this fact, we can integrate out the $D$'s from our
140: posterior (i.e. we do not have to explicitly simulate realizations of
141: the $D$'s since this completely describes their (conditional)
142: posterior distribution).
143: 
144: All that remains, then, is to perform a random walk over the point
145: process with $K=k$ points $C_1, \dots, C_k$. Our posterior weight on a
146: particular choice $K=k$ and $\Cvec=\cvec$ (treating the $C$'s as a random
147: vector) is proportional to the prior
148: probability times the predictive probability of the data (having
149: integrated out the $D$'s).
150: 
151: Specifically, if we denote the predictive probability of the data under $K=k$ and
152: $\Cvec=\cvec$ by:
153: \[
154: \psi(k, \cvec)=\prod_{i=1}^{k+1} \textrm{beta integral}(a_i,b_i)
155: \]
156: 
157: and the density of the prior (with respect to a poisson point process
158: with parameter $\lambda=1$) as:
159: 
160: \[
161: \pi(k,\cvec)=@@.
162: \]
163: 
164: Then our posterior has density proportional to:
165: \[
166: \pi(k,\cvec)\psi(k, \cvec).
167: \]
168: 
169: To sample from this posterior, we make proposals according to a base
170: chain, and accept the proposals according to the Metropolis-Hastings
171: algorithm. Our base chain proposes one of 6 actions at each step
172: randomly with probability:
173: \begin{verbatim}
174:  1         2         3         4         5         6
175:  0.0476    0.0476    0.2381    0.1429    0.4762    0.0476
176: \end{verbatim}
177: 
178: Action 1 adds a new coordinate chosen uniformly on [0,1]. Action 2
179: deletes a coordinate at random. Action 3 chooses a random coordinate
180: and moves it by a small random normal amount (standard deviation 0.1)
181: (no action taken if the move is invalid). Action 4 chooses a random
182: coordinate and moves it to a new value chosen uniformly from
183: [0,1]. Action 5 moves all coordinates by a very small random normal
184: amount (standard deviation 0.01) (a coordinate doesn't move if this would
185: propose a move outside the interval, but all valid moves are still
186: taken). Action 6 re-randomizes all the coordinates with uniform values
187: from [0,1].
188: 
189: \begin{figure}
190: \noindent
191: \begin{minipage}[t]{\linewidth}
192:   \centering%\epsfig{figure=../unifpost_N16384_05_1.ps,height=7cm,angle=90}
193:   \caption[]{Posterior Mean et.al.}\label{f:post1}
194: \end{minipage}
195: \end{figure}
196: 
197:  A selection of popular algorithms includes: nearest
198: neighbors, local regression (e.g. kernel methods and loess-type
199: algorithms), ``global'' regressions (e.g. neural nets and support
200: vector machines), and partition based (e.g. CART). (bagging/boosting?)
201: Most of these methods can be described (at least in principle) as
202: specifying a certain class of functions (either locally or globally)
203: and choosing among the candidates by optimizing some criterion which
204: is intended to balance the ever-important tradeoff between fidelity to
205: the data and the possibility of ``over-fitting.'' There are at least
206: three different competing ``philosophies'' for how exactly such
207: criteria should be chosen. Traditional frequentists tend to evaluate
208: criteria in terms of ``bias/variance tradeoffs.''  Typically, they
209: propose criteria which utilize theoretically motivated point estimates
210: of risk (e.g. $C_p$, GCV, and cross-validation). Advocates of
211: ``Structural Risk Minimization,'' on the other hand, penalize a
212: function according to the complexity of the function class that
213: contains it (according to a pre-decided ``structure'') in order to
214: maintain upper-confidence levels on the risk. A purely Bayesian
215: approach, however, would be to specify whatever subjective prior the
216: practitioner happens to believe, and choose the estimate which
217: minimizes \emph{posterior} risk.
218: 
219: Each approach has its advantages and disadvantages. The frequentist
220: approach has the
221: most direct information about the (asymptotic) sampling variation of its estimates
222: (under a fixed truth). Little, however, can typically be said about
223: the finite sample properties of its estimates for general $F$. Structural Risk Minimization offers
224: theoretically sound confidence statements about its particular risk on
225: a particular data set, under sampling from an arbitrarily complicated
226: $F$. It does not, however, have anything to say about the
227: finite-sample optimality of its estimation scheme under any
228: particular (prior) scenario--this is especially true if one considers
229: a scenario in which arbitrary $F$ may be possible,
230: but are vanishingly unlikely. A subjective Bayesian approach offers
231: the fact that it is behaving optimally for at least some scenario (the
232: one specified by its prior). It typically, however, shows little concern for
233: its sampling variation (under particular $F$), or interest in offering uniformly valid confidence statements about its risk.
234: 
235: It is the opinion of this paper that the best chance of attaining all
236: of the advantages outlined above is to employ the Bayesian approach,
237: but carefully choose the prior to satisfy both of the ``empiricist'' goals.
238: 
239: \begin{thebibliography}{1}
240: \bibitem{df1}
241: P. Diaconis and D. Freedman (1986). On the Consistency of Bayes Estimates. \emph{Ann. Stat.} \textbf{14}, 1-26
242: \bibitem{df2}
243: P. Diaconis and D. Freedman (1990). On the Uniform Consistency of Bayes Estimates for Multinomial Probabilities. \emph{Ann. Stat.} \textbf{18}, 1317-1327
244: \bibitem{df3}
245: P. Diaconis and D. Freedman (1993). Nonparametric Binary Regression: A
246: Bayesian Approach. \emph{Ann. Stat.} \textbf{21}, 2108-2137
247: \bibitem{moller}
248: J. M{\o}ller (1999). Stochastic Geometry: Likelihood and Computation
249: (chapter 4). O.~Barndorff-Nielsen, W.~Kendall, and M.~van~Lieshout,
250: eds. Chapman and Hall.
251: \bibitem{mollerskare}
252: J. M{\o}ller and {\O}ivind Skare (??) Bayesian Image Analysis with Coloured Voronoi Tessellations and a View to Application in Reservoir Modelling
253: \end{thebibliography}
254: 
255: We
256: also define the function:
257: \[
258: \tilde{f_0}(x,y)=f_0(x)\1{y=1}+(1-f_0(x))\1{y=0}
259: \]
260: and when it does not result in confusion we drop the tilde.
261: