0603:math0603228/ch1.tex

1: \newcommand{\X}{\mathcal{X}}

2: \newcommand{\Y}{\mathcal{Y}}

3:

4: \newcommand{\Cvec}{\mathbf{C}}

5: \newcommand{\cvec}{\mathbf{c}}

6:

7: In this thesis we consider the problem of binary regression and the

8: related classification (discrimination) problem. Specifically, the

9: data we consider is a list of pairs $(X_1,Y_1), (X_2,Y_2), \dots, \breakpt

10: (X_n, Y_n)$. Each $X$ is an explanatory variable through which attempt

11: to ``predict'' the corresponding label $Y$ which is either $0$ or

12: $1$. We model these pairs as being drawn $\iid$ from an unknown

13: distribution $F_0$ which we decompose as:

14: \[

15: dF_0\left( (x,y)

16: \right)=\big[f_0(x)\1{y=1}+(1-f_0(x))\1{y=0}\big]

17: \mu(dx)\eta(dy).

18: \]

19: Here $\1{}$ signifies the indicator function; $f_0(x)$ represents the

20: probability that $Y=1$ given that $X=x$; $\mu$ is the marginal

21: distribution of $X$ over the space $\X$; and $\eta$ is counting

22: measure on the set $\Y=\{0,1\}$. This decomposition is unique except

23: that, technically, $f_0$ is only determined $\mu$-a.e by $F_0$.

24:

25: In binary regression, our goal is to estimate $f_0$ (say in an $\cL^2$

26: or $\cL^1$ sense). For classification, we are only concerned with

27: being able to predict the most likely value of $Y$, equivalently we

28: are concerned with estimating the set $\{x:

29: f_0(x)>\frac{1}{2}\}$. There are a great many ways to proceed, as

30: evidenced by the vast literature on these subjects. Some references

31: are given in \secref{@@}.

32:

33: \section{An Example}

34:

35: To get started, let us consider the specific case in which

36: $\X=[0,1]$ and $\mu$ is uniform measure. Further, I choose a specific

37: $f_0$ which is piecewise continuous with two constant regions and a

38: smooth transition region. I take $f_0(x)$ to be:

39: \[

40: \1{0 \leq x < \frac{1}{6}} 0.6 + \1{\frac{1}{6} \leq x \leq

41: \frac{1}{2}} 0.4 + \1{\frac{1}{2} < x \leq 1} \Psi(x)

42: \]

43:

44: where $\Psi(x)$ is :

45:

46: \[

47: \Psi(x)=\frac{\phi_\sigma(x-\frac{1}{2})}{\phi_\sigma(x-\frac{1}{2})+\phi_\sigma(x-1)}

48: \]

49:

50: and where $\phi_\sigma$ is the p.d.f. of a normal with mean 0 and

51: standard deviation $\sigma=0.25$.

52:

53: \begin{figure}

54: \noindent

55: \begin{minipage}[t]{\linewidth}

56:   \centering%\epsfig{figure=figs/prob1plot.ps,height=5cm,angle=-90}

57:   \caption[]{$f(x)$}

58: \end{minipage}

59: \end{figure}

60:

61: \section{A Nonparametric Prior on Regression Functions}

62:

63: In the Bayesian approach, we specify our uncertainty about $F_0$, or,

64: equivalently, $f_0$ by specifying a probability distribution on

65: them. Ideally, we want this distras

66: randomly chosen.To specify our prior, I explain how to choose an $f$ at random

67: from it. First, choose $K$, the number of split points, where:

68: \[

69: P(K=k)=(1-\alpha)\alpha^k \quad \textrm{for $k=0 \dots \infty$}

70: \]

71:

72: That is, $K+1$ (the number of intervals) is geometric with parameter

73: $1-\alpha$.

74:

75: Now, conditional on $K=k$, choose $C_1, C_2, \dots C_k$ \iid from

76: $U(0,1)$. Let $C_{(i)}$ (for $i=1 \dots k$) be the $i'th$ ordered value

77: of the $C$'s. This produces $k+1$ intervals:

78: \[

79: I_1=\{0 \leq x < C_{(1)}\}, I_2=\{C_{(1)} \leq x < C_{(2)}\}, \dots,

80: I_{k+1}=\{C_{(k)} \leq x \leq 1\}

81: \]

82:

83: Finally, conditional on $K=k$, choose $k+1$ values $D_i$

84: (for $i=1 \dots k+1$) \iid from $U(0,1)$. This generates a random

85: function $f$:

86:

87: \[

88: f(x)=\sum_{i=1}^{k+1} \1{x \in I_{i}} D_i

89: \]

90:

91: As a default, take $\alpha=\frac{1}{2}$.

92:

93: This prior is related to the prior described in Diaconis and

94: Freedman~\cite{df2}, but this prior differs in that position of the

95: split points is random. It is similar, however, in that it specifies a

96: prior on piecewise constant functions.

97:

98:

99:

100:

101: \section{The Posterior}

102:

103: Figure~\ref{f:post1} shows an example of the posterior mean of this

104: prior upon seeing $16,384$ points randomly labeled according to the

105: function $f$ described in section~\ref{s:example}. The solid grey

106: curve is the true $f$, as before. The solid black curve is the result

107: of running $30,000$ MCMC iterations following the general procedures

108: described in M{\o}ller~\cite{moller}.

109:

110: To give some sense of how well we

111: are doing, a very na\:ive ``frequentist'' estimate is

112: superimposed. Specifically, the unit interval is divided into $20$

113: equally spaced segments. For each segment, all the data lying within

114: the segment is considered giving $k_0$ 0's and $k_1$ 1's within the

115: segment. Each segment is then labeled with three horizontal bars. The

116: middle bar is the sample mean on that segment

117: (i.e. $\hat{d}=k_1/(k_0+k_1)$). The upper and lower bars give a na\:ive

118: 95\%-confidence interval for the value there via $\hat{d} \pm

119: 2\sqrt{\hat{d}(1-\hat{d})/(k_0+k_1)}$. We can see that there is still

120: plenty of noise left in the data, despite its apparent abundance.

121:

122: Roughly speaking, the posterior mean seems to be doing a good job of

123: finding the split points between the piecewise constant regions. It

124: manages to approximate the continuously varying region fairly well,

125: but detects little deviation from constancy on the two constant intervals.

126:

127: \section{Computing the Posterior}

128:

129: Computation of the posterior is simplified by identifying the

130: posterior distribution of the $(k+1)$ values $D_i$. If we

131: condition on $K=k$ and on the location of the split points then, a

132: priori, $D_i$ are independent $U(0,1)$. A posteriori, they are

133: (conditionally) independent and beta distributed with parameters:

134: \begin{align*}

135: a_i&=(\textrm{\# of data points in $I_i$ labeled 1})+1 \\

136: b_i&=(\textrm{\# of data points in $I_i$ labeled 0})+1.

137: \end{align*}

138:

139: Using this fact, we can integrate out the $D$'s from our

140: posterior (i.e. we do not have to explicitly simulate realizations of

141: the $D$'s since this completely describes their (conditional)

142: posterior distribution).

143:

144: All that remains, then, is to perform a random walk over the point

145: process with $K=k$ points $C_1, \dots, C_k$. Our posterior weight on a

146: particular choice $K=k$ and $\Cvec=\cvec$ (treating the $C$'s as a random

147: vector) is proportional to the prior

148: probability times the predictive probability of the data (having

149: integrated out the $D$'s).

150:

151: Specifically, if we denote the predictive probability of the data under $K=k$ and

152: $\Cvec=\cvec$ by:

153: \[

154: \psi(k, \cvec)=\prod_{i=1}^{k+1} \textrm{beta integral}(a_i,b_i)

155: \]

156:

157: and the density of the prior (with respect to a poisson point process

158: with parameter $\lambda=1$) as:

159:

160: \[

161: \pi(k,\cvec)=@@.

162: \]

163:

164: Then our posterior has density proportional to:

165: \[

166: \pi(k,\cvec)\psi(k, \cvec).

167: \]

168:

169: To sample from this posterior, we make proposals according to a base

170: chain, and accept the proposals according to the Metropolis-Hastings

171: algorithm. Our base chain proposes one of 6 actions at each step

172: randomly with probability:

173: \begin{verbatim}

174:  1         2         3         4         5         6

175:  0.0476    0.0476    0.2381    0.1429    0.4762    0.0476

176: \end{verbatim}

177:

178: Action 1 adds a new coordinate chosen uniformly on [0,1]. Action 2

179: deletes a coordinate at random. Action 3 chooses a random coordinate

180: and moves it by a small random normal amount (standard deviation 0.1)

181: (no action taken if the move is invalid). Action 4 chooses a random

182: coordinate and moves it to a new value chosen uniformly from

183: [0,1]. Action 5 moves all coordinates by a very small random normal

184: amount (standard deviation 0.01) (a coordinate doesn't move if this would

185: propose a move outside the interval, but all valid moves are still

186: taken). Action 6 re-randomizes all the coordinates with uniform values

187: from [0,1].

188:

189: \begin{figure}

190: \noindent

191: \begin{minipage}[t]{\linewidth}

192:   \centering%\epsfig{figure=../unifpost_N16384_05_1.ps,height=7cm,angle=90}

193:   \caption[]{Posterior Mean et.al.}\label{f:post1}

194: \end{minipage}

195: \end{figure}

196:

197:  A selection of popular algorithms includes: nearest

198: neighbors, local regression (e.g. kernel methods and loess-type

199: algorithms), ``global'' regressions (e.g. neural nets and support

200: vector machines), and partition based (e.g. CART). (bagging/boosting?)

201: Most of these methods can be described (at least in principle) as

202: specifying a certain class of functions (either locally or globally)

203: and choosing among the candidates by optimizing some criterion which

204: is intended to balance the ever-important tradeoff between fidelity to

205: the data and the possibility of ``over-fitting.'' There are at least

206: three different competing ``philosophies'' for how exactly such

207: criteria should be chosen. Traditional frequentists tend to evaluate

208: criteria in terms of ``bias/variance tradeoffs.''  Typically, they

209: propose criteria which utilize theoretically motivated point estimates

210: of risk (e.g. $C_p$, GCV, and cross-validation). Advocates of

211: ``Structural Risk Minimization,'' on the other hand, penalize a

212: function according to the complexity of the function class that

213: contains it (according to a pre-decided ``structure'') in order to

214: maintain upper-confidence levels on the risk. A purely Bayesian

215: approach, however, would be to specify whatever subjective prior the

216: practitioner happens to believe, and choose the estimate which

217: minimizes \emph{posterior} risk.

218:

219: Each approach has its advantages and disadvantages. The frequentist

220: approach has the

221: most direct information about the (asymptotic) sampling variation of its estimates

222: (under a fixed truth). Little, however, can typically be said about

223: the finite sample properties of its estimates for general $F$. Structural Risk Minimization offers

224: theoretically sound confidence statements about its particular risk on

225: a particular data set, under sampling from an arbitrarily complicated

226: $F$. It does not, however, have anything to say about the

227: finite-sample optimality of its estimation scheme under any

228: particular (prior) scenario--this is especially true if one considers

229: a scenario in which arbitrary $F$ may be possible,

230: but are vanishingly unlikely. A subjective Bayesian approach offers

231: the fact that it is behaving optimally for at least some scenario (the

232: one specified by its prior). It typically, however, shows little concern for

233: its sampling variation (under particular $F$), or interest in offering uniformly valid confidence statements about its risk.

234:

235: It is the opinion of this paper that the best chance of attaining all

236: of the advantages outlined above is to employ the Bayesian approach,

237: but carefully choose the prior to satisfy both of the ``empiricist'' goals.

238:

239: \begin{thebibliography}{1}

240: \bibitem{df1}

241: P. Diaconis and D. Freedman (1986). On the Consistency of Bayes Estimates. \emph{Ann. Stat.} \textbf{14}, 1-26

242: \bibitem{df2}

243: P. Diaconis and D. Freedman (1990). On the Uniform Consistency of Bayes Estimates for Multinomial Probabilities. \emph{Ann. Stat.} \textbf{18}, 1317-1327

244: \bibitem{df3}

245: P. Diaconis and D. Freedman (1993). Nonparametric Binary Regression: A

246: Bayesian Approach. \emph{Ann. Stat.} \textbf{21}, 2108-2137

247: \bibitem{moller}

248: J. M{\o}ller (1999). Stochastic Geometry: Likelihood and Computation

249: (chapter 4). O.~Barndorff-Nielsen, W.~Kendall, and M.~van~Lieshout,

250: eds. Chapman and Hall.

251: \bibitem{mollerskare}

252: J. M{\o}ller and {\O}ivind Skare (??) Bayesian Image Analysis with Coloured Voronoi Tessellations and a View to Application in Reservoir Modelling

253: \end{thebibliography}

254:

255: We

256: also define the function:

257: \[

258: \tilde{f_0}(x,y)=f_0(x)\1{y=1}+(1-f_0(x))\1{y=0}

259: \]

260: and when it does not result in confusion we drop the tilde.

261: