0605:math0605793/vc.tex

1: \newcommand{\w}[1]{\widehat{#1}}

2: \section*{Introduction}

3: \addcontentsline{toc}{section}{Introduction}

4:

5: Among the possible approaches to pattern recognition,

6: statistical learning theory has received a lot of attention

7: in the last few years. Although a realistic pattern recognition

8: scheme involves data pre-processing and post-processing that

9: need a theory of their own, a central role is often played

10: by some kind of supervised learning algorithm. This central

11: piece of work is the subject we are going to analyse in

12: these notes.

13:

14: Accordingly, we assume that we have prepared in some way or another

15: a {\em sample} of $N$ labelled patterns $(X_i, Y_i)_{i=1}^N$,

16: where $X_i$ ranges in some pattern space $\C{X}$ and $Y_i$ ranges

17: in some finite label set $\C{Y}$. We also assume that we have devised

18: our experiment in such a way that the couples of random variables

19: $(X_i, Y_i)$ are independent (but not necessarily equidistributed).

20: Here, randomness should be understood to come from the way the

21: statistician has planned his experiment. He may for instance

22: have drawn the $X_i$s

23: at random from some larger population of patterns the algorithm

24: is meant to be applied to in a second stage. The labels $Y_i$

25: may have been set with the help of some external expertise

26: (which may itself be faulty or

27: contain some amount of randomness, therefore we do not assume

28: that $Y_i$ is a function of $X_i$, and allow the couple of

29: random variables $(X_i, Y_i)$ to follow any kind of joint distribution).

30: In practice, patterns will be extracted from some high dimensional and highly

31: structured data, like digital images, speech signals, DNA sequences, etc.

32: We will not discuss here this pre-processing stage

33: (although it poses crucial problems dealing with segmentation

34: and the choice of a representation).

35:

36: To fix notations, let $(X_i,Y_i)_{i=1}^N$ be the canonical process

37: on $\Omega = (\C{X} \times \C{Y})^N$ (which means

38: the coordinate process).

39: Let the pattern space

40: be provided with a sigma-algebra $\C{B}$ turning it into

41: a measurable space $(\C{X}, \C{B})$. On the finite label space $\C{Y}$,

42: we will consider the trivial algebra $\C{B}'$ made of all its subsets.

43: Let $\C{M}_+^1\bigl[(\C{K} \times \C{Y})^N, (\C{B}

44: \otimes \C{B}')^{\otimes N} \bigr]$ be our notation for

45: the set of probability measures (i.e. of positive measures

46: of total mass equal to $1$) on the measurable space

47: $\bigl[ (\C{X} \times \C{Y})^N, (\C{B} \times \C{B}')^{\otimes N}

48: \bigr]$.

49: Once some probability distribution

50: $\PP \in \C{M}_+^1\bigl[ (\C{X} \times \C{Y})^N, (\C{B} \otimes

51: \C{B}')^{\otimes N} \bigr]$ is chosen,

52: it turns $(X_i,Y_i)_{i=1}^N$

53: into the canonical realization of a stochastic process modeling the

54: observed sample (also called the training set).

55: We will assume that $\PP = \bigotimes_{i=1}^N P_i$, where

56: for each $i = 1, \dots, N$,

57: $P_i \in \C{M}_+^1(\C{X} \times \C{Y}, \C{B} \otimes \C{B}')$,

58: to reflect

59: the assumption that we observe independent pairs of patterns and labels.

60: We will also assume that we are provided with some indexed set of

61: possible classification rules

62: $$

63: \C{R}_{\Theta} = \bigl\{ f_{\theta} : \C{X} \rightarrow \C{Y};

64: \theta \in \Theta \bigr\},

65: $$

66: where $(\Theta, \C{T})$ is some measurable index set. Assuming

67: some indexation of the classification rules is just a matter

68: of presentation. Although it leads to longer notations, it

69: allows to integrate over the space of classification rules

70: as well as over $\Omega$ using the usual formalism of multiple

71: integrals. For this matter, we will assume that

72: $(\theta, x) \mapsto f_{\theta}(x) : ( \Theta \times \C{X},

73: \C{B} \otimes \C{T} ) \rightarrow (\C{Y}, \C{B}')$

74: is a measurable function.

75:

76: In many cases $\Theta = \bigcup_{i \in I} \Theta_i$ will be a finite

77: (or more generally countable) union of subspaces, dividing the classification

78: model $\C{R}_{\Theta} = \bigcup_{i \in I} \C{R}_{\Theta_i}$ into a union of

79: submodels. The importance of introducing such a structure has been

80: put forward by V. Vapnik, as a way to avoid making strong hypotheses

81: on the distribution $\PP$ of the sample.

82: If neither the distribution of the sample nor the set of

83: classification rules were constrained, it is well known indeed that

84: no kind of statistical inference would be possible.

85: Considering a family of submodels is a way to

86: provide for adaptive classification where

87: the choice of the model depends on the observed

88: sample. Restricting the set of classification rules is more realistic

89: than restricting the distribution of patterns, since the classification

90: rules are a processing tool left to the choice of the statistician,

91: whereas the distribution of the patterns is not fully under his control,

92: except for some planning of the learning experiment which may enforce

93: some weak properties like independence, but not the precise shapes of

94: the marginal distributions $P_i$ which are as a rule unknown distributions

95: on some high dimensional space.

96:

97: \newcommand{\wtheta}{\widehat{\theta}}

98: In these notes, we will concentrate on general issues concerned with

99: a natural measure of risk, namely the {\em expected error rate}

100: of each classification rule $f_{\theta}$, expressed as

101: $$

102: R(\theta) = \frac{1}{N} \sum_{i=1}^N \PP\bigl[ f_{\theta}(X_i) \neq Y_i

103: \bigr].

104: $$

105: As this quantity is unobserved, we will be led to work with

106: the corresponding  {\em empirical error rate}

107: $$

108: r(\theta,\omega) = \frac{1}{N} \sum_{i=1}^N \B{1} \bigl[ f_{\theta}(X_i) \neq Y_i \bigr].

109: $$

110: This does not mean that pratical learning algorithms will

111: always try to minimize this criterion. They often on the contrary

112: try to minimize some other criterion which is linked with

113: the structure of the problem and has some nice additional properties

114: (like smoothness and convexity, for example). Nevertheless, and independently

115: from the precise form of the estimator $\wtheta : \Omega \rightarrow \Theta$

116: under study, the analysis of $R(\wtheta)$ is a natural question,

117: and often corresponds to what is required in practice.

118:

119: Answering this question is not straightforward because,

120: although $R(\theta)$ is the expectation of $r(\theta)$,

121: a sum of independent Bernoulli random variables,

122: $R(\wtheta)$ is not the expectation of $r(\wtheta)$,

123: because of the dependence of $\wtheta$ on the sample,

124: and neither is $r(\wtheta)$ a sum of independent

125: random variables.

126: To circumvent this unfortunate situation,

127: some uniform control over the deviations of $r$ with respect to $R$

128: is needed.

129:

130: The PAC-Bayesian approach to this problem, originated in the machine

131: learning community and pionneered by

132: D. McAllester \cite{McAllester,McAllester2},

133: can be seen as some variant of the more classical approach of $M$-estimators

134: relying on empirical process theory (as exposed for instance in

135: \cite{VanDeGeer}).

136:

137: It is built on three corner stones:

138: \begin{itemize}

139: \item One idea is to embed the set of estimators of the type $\wtheta

140: : \Omega \rightarrow \Theta$ into the larger set of

141: regular conditional probability measures

142: $\rho : \bigl( \Omega,

143: (\C{B} \otimes \C{B}')^{\otimes N} \bigr) \rightarrow \C{M}_+^1(\Theta, \C{T})$.

144: We will call these conditional probability measures {\em posterior distributions},

145: to follow a usual terminology.

146: \item A second idea is to measure the fluctuations of $\rho$

147: with respect to the sample, using some prior distribution $\pi \in

148: \C{M}_+^1(\Theta, \C{T})$, and the Kullback divergence function

149: $\C{K}(\rho, \pi)$. The expectation $\PP \bigl\{ \C{K}(\rho, \pi) \bigr\}$

150: measures the randomness of $\rho$.

151: The optimal choice of

152: $\pi$ would be $\PP(\rho)$, resulting in a measure of the

153: randomness of $\rho$ equal to the mutual information between

154: the sample and the estimated parameter drawn from $\rho$.

155: Anyhow, since $\PP(\rho)$ is as a rule no more observed than

156: $\PP$, we will have to be content with some less concentrated

157: prior distribution $\pi$, resulting in some looser measure

158: of randomness, as shown by the identity

159: $\PP \bigl[ \C{K}(\rho, \pi) \bigr] = \PP \bigl\{ \C{K}\bigl[\rho,

160: \PP(\rho)\bigr] \bigr\} + \C{K}\bigl[\PP(\rho), \pi\bigr]$.

161: \item A third idea is to analyze the fluctuations of the random

162: process $\theta \mapsto r(\theta)$ with respect to its mean

163: process $\theta \mapsto R(\theta)$ through the $\log$-Laplace

164: transform

165: $$

166: - \frac{1}{\lambda}

167: \log \left\{ \iint \exp \bigl[ - \lambda r(\theta,\omega) \bigr]

168: \pi(d \theta) \PP(d \omega) \right\},

169: $$ as a physicist prone to statistical mechanics

170: (where this is called the free energy) would do. This transform

171: is well suited

172: to relate $\min_{\theta \in \Theta} r(\theta)$

173: to $\inf_{\theta \in \Theta} R(\theta)$.

174: \end{itemize}

175:

176: This monograph is devided into two sections. The first one deals with the

177: inductive setting presented in these lines, the second one with

178: the {\em transductive} setting, where, following Vapnik's seminal

179: approach \cite{Vapnik}, a shadow sample is considered.

180:

181: In the first section, two types of bounds are shown. {\em Empirical bounds}

182: can be used to choose between estimators or to build estimators.

183: {\em Non random bounds} can be used to assess the speed of convergence

184: of estimators, relating this speed to the speed of convergence

185: of the Gibbs prior expected error rate $\beta \mapsto

186: \pi_{\exp ( - \beta R)}(R)$ towards $\ess \inf_{\pi} R$

187: as $\beta$ goes to infinity, and to other quantities

188: akin to the margin assumption of Mammen and Tsybakov in more

189: sophisticated cases. We will progress from the most straighforward

190: bounds to more elaborate ones, built to achieve a better

191: asymptotic behaviour. We will thus introduce {\em local bounds}

192: and {\em relative bounds}.

193: From an asymptotic point of view, the culminating result of

194: these notes is Theorem \ref{thm1.1.43} (page \pageref{thm1.1.43}).

195: It is used in Proposition \ref{prop1.1.37} to build a classification

196: rule which is proved to be adaptive in all the parameters

197: of the Mammen and Tsybakov margin assumption and of

198: a parametric complexity assumption

199: in Corollary \ref{cor1.52} (page \pageref{cor1.52}) of Theorem

200: \ref{thm1.50} (page \pageref{thm1.50}). This opens the road to Theorem

201: \ref{thm1.59} (page \pageref{thm1.59}) which performs two step localization

202: on top of Theorem

203: \ref{thm1.1.43} in order to be able to achieve adaptive model selection

204: with a decreased influence of the number of empirically unefficient

205: models included in the comparison. The analysis of this bound is

206: hinted at in subsequent pages, but not fully developed, since

207: we are not sure the amount of technicalities it requires is worth it.

208: Anyhow we would not like to induce the

209: reader into thinking that each result in the first section is

210: actually an {\em improvement} on the previous one, it is as a rule

211: only an {\em asymptotic improvement}, and the price to pay for

212: being asymptotically tighter is to get looser bounds for small sample sizes.

213: What is a small sample size in practice is a question of ratio between

214: the number of examples and the complexity (roughly speaking the number

215: of parameters) of the model used to classify. Since our aim here is

216: to describe classification methods suitable for complex data (images,

217: speech, DNA, \dots), we suspect that practitioners wanting to make use

218: of our proposals will be confronted with small sample sizes more often

219: than with large ones, and should try to make use of the simplest

220: bounds first and see only afterwards whether the asymptotically

221: better ones can bring them more for the size of samples their computers can handle

222: and their data bases can provide. Let us advocate also that the results

223: of this first section are not only of a theoretical nature for two

224: reasons : the first one is that posterior parameter distributions

225: can be computed effectively, using Monte Carlo techniques, there is

226: a whole tradition about these computations in Bayesian statistics,

227: proving that what we call here Gibbs estimators are not

228: only a way to show that some optimal speeds of convergence can

229: be reached in some theoretically well understood situations,

230: but that they can also be computed in practice. The second reason

231: is that a traditional non randomized estimator $\w{\theta} \in \Theta$ of the

232: parameter can be approximated by a posterior distribution $\rho$ which

233: is supported by a fairly narrow neighboorhood of $\w{\theta} \in \Theta$,

234: without spoiling excessively our bounds, resulting in a classification

235: rule which is to provide a randomized answer only for a small amount

236: of dubious examples and will most of the time issue the same deterministic

237: answer as the classification rule indexed by $\w{\theta}$ it is

238: derived from. This is

239: explained on page \pageref{eq1.1.2}.

240:

241: In the second section, we show first how we can transport

242: all the results obtained in the inductive case to the transductive case,

243: allowing to replace prior distributions by {\em partially exchangeable posterior

244: distributions} depending on an extended sample were unlabelled shadow

245: examples are added, with increased possibilities of adaptation to the data.

246: We then focus on the small sample case, where local and relative

247: bounds are not expected to be of great help. Using

248: a fictitious (that is unobserved) shadow sample, we study Vapnik

249: type generalization bounds, showing how to tighten and extend them

250: using some original ideas, like making no Gaussian approximation to the

251: $\log$-Laplace of Bernoulli random variables, --- using a shadow sample

252: of arbitrary size, --- shrinking from the use of any symmetrization trick ---

253: and using a subset of the group of permutations suitable to cover the

254: case of independent non identically distributed data. The culminating

255: result of the second section is Theorem \ref{thm2.3.3} on page \pageref{thm2.3.3},

256: subsequent bounds showing the separate influence of the above ideas and

257: providing an easier comparison with Vapnik's original results.

258: Vapnik type generalization bounds have a broad applicability, not

259: only through the concept of VC dimension, but also through the use

260: of compression schemes \cite{Little}, which are briefly described

261: on page \pageref{compression}.

262:

263: \section{Inductive PAC-Bayesian learning}

264:

265: The setting of inductive inference (as opposed to transductive

266: inference to be discussed later) is the one described in the

267: introduction.

268:

269: When we will have to take the expectation of

270: a random variable $Z : \Omega \rightarrow \RR$ as well as of a function

271: of the parameter $h : \Theta \rightarrow \RR$ with respect to

272: some probability measure, we will as a rule use functional

273: short notations instead of resorting to the integral sign:

274: thus we will write $\PP(Z)$ for $\int_{\Omega} Z(\omega) \PP(d \omega)$

275: and $\pi(h)$ for $\int_{\Theta} h(\theta) \pi(d \theta)$.

276:

277: The PAC-Bayesian approach, in its simplest form, relies on some

278: basic upper bound for the Laplace transform of

279: $\sup_{\rho \in \C{M}_+^1(\Theta)} \bigl[

280: \rho(R) - \rho(r) \bigr]$, or more technically on some penalized

281: variant of it, as will be seen. This will be the subject of the

282: next subsection, where we will start with the Laplace

283: transform of $R(\theta) - r(\theta)$, for any $\theta \in \Theta$,

284: before encompassing posterior distributions. As it is already

285: easy to guess, the purpose of these preliminaries is to

286: gain some uniform control on the lower deviations of the

287: empirical error rate from the expected error rate under

288: any posterior distribution.

289: \subsection{Basic inequality}

290: In the setting described in the introduction,

291: let us consider the Bernoulli random variables

292: $\sigma_i(\theta) = \B{1} \bigl[ Y_i \neq f_{\theta} (X_i) \bigr]$.

293: Using independence and the concavity of the logarithm

294: function, it is readily seen that for any real constant $\lambda$

295: \begin{multline*}

296: \log \Bigl\{ \PP \bigl\{ \exp \bigl[ - \lambda r(\theta) \bigr]

297: \bigr\} \Bigr\}

298: = \sum_{i=1}^N \log \Bigl\{ \PP \Bigl[ \exp\bigl(

299: - \tfrac{\lambda}{N} \sigma_i \bigr) \Bigr] \Bigr\}

300: \\ \leq N \log \biggl\{ \frac{1}{N}\sum_{i=1}^N

301: \PP \Bigl[ \exp \bigl( - \tfrac{\lambda}{N}

302: \sigma_i \bigr) \Bigr]

303: \biggr\}.

304: \end{multline*}

305: The right-hand side of this inequality is the $\log$ Laplace

306: transform of a Bernoulli distribution with parameter

307: $\frac{1}{N} \sum_{i=1}^N \PP(\sigma_i) = R(\theta)$.

308: As any Bernoulli distribution is fully defined

309: by its parameter, this $\log$ Laplace transform

310: is necessarily a function of $R(\theta)$. It can

311: be expressed with the help of the family of functions

312: $$

313: \Phi_{a}(p) = - a^{-1} \log \bigl\{

314: 1 - \bigl[1 - \exp( - a)\bigr]

315: p \bigr\}, \quad a \in \RR, p \in (0,1).

316: $$

317: It is immediately seen that $\Phi_{\alpha}$ is an increasing

318: one to one mapping of the unit interval unto itself, and that it

319: is convex when $a > 0$, concave when $a < 0$ and can be defined

320: by continuity to be the identity when $a = 0$.

321: Moreover the inverse of $\Phi_{a}$ is given by the

322: formula

323: $$

324: \Phi_{a}^{-1}(q) = \frac{1 - \exp (- a q )}{1 - \exp ( - a )},

325: \qquad a \in \RR, q \in (0,1).

326: $$

327: This formula may be used to extend $\Phi_a^{-1}$

328: to $q \in \RR$, and we will use this extension without

329: further notice when required.

330:

331: Using these notations, the previous inequality becomes

332: $$

333: \log \Bigl\{ \PP \bigl\{ \exp \bigl[ - \lambda r(\theta)

334: \bigr] \bigr\} \Bigr\} \leq

335: - \lambda \Phi_{\frac{\lambda}{N}} \bigl[ R(\theta) \bigr],

336: \quad \text{proving}

337: $$

338:

339: \begin{lemma}

340: \label{lemma1.1.1} \mypoint For any real constant $\lambda$ and

341: any parameter $\theta \in \Theta$,

342: $$

343: \PP \biggl\{ \exp \Bigl\{

344: \lambda \Bigl[ \Phi_{\frac{\lambda}{N}} \bigl[ R(\theta) \bigr]

345: - r(\theta) \Bigr]

346: \Bigr\} \biggr\} \leq 1.

347: $$

348: \end{lemma}

349: In previous versions of this study, we had used some Bernstein

350: bound, instead of this lemma. Anyhow, as it will turn out,

351: keeping the $\log$ Laplace of a Bernoulli instead of approximating

352: it provides simpler and tighter results.

353:

354: Lemma \ref{lemma1.1.1} implies that

355: for any constants $\lambda \in \RR_+$ and $\epsilon \in )0,1)$,

356: $$

357: \PP \biggl[ \Phi_{\frac{\lambda}{N}}\bigl[ R(\theta) \bigr] +

358: \frac{\log(\epsilon)}{\lambda} \leq r(\theta) \biggr] \geq 1 - \epsilon.

359: $$

360: Choosing $\ds \overline{\lambda} \in \arg\max_{\RR_+}

361: \Phi_{\frac{\lambda}{N}}\bigl[ R(\theta) \bigr] + \frac{\log(\epsilon)}{\lambda}$,

362: we deduce

363: \begin{lemma}\mypoint

364: For any $\epsilon \in )0,1)$, any $\theta \in \Theta$,

365: $$

366: \PP \Biggl\{ R(\theta) \leq \inf_{\lambda \in \RR_+}

367: \Phi_{\frac{\lambda}{N}}^{-1} \biggl[

368: r(\theta) - \frac{\log(\epsilon)}{\lambda} \biggr] \Biggr\}

369: \geq 1 - \epsilon.

370: $$

371: \end{lemma}

372:

373: We will illustrate throughout these notes the bounds we prove with

374: a small numerical example: in the case where $N = 1000$,

375: $\epsilon = 0.01$ and $r(\theta) = 0.2$,

376: we get with a confidence level of $0.99$ that $ R(\theta) \leq .2402$,

377: this being obtained for $\lambda = 234$.

378:

379: Now, to proceed towards the analysis of posterior

380: distributions, let us put for short $U_{\lambda}(\theta, \omega) =

381: \lambda \Bigl[ \Phi_{\frac{\lambda}{N}} \bigl[ R(\theta) \bigr]

382: - r(\theta, \omega) \Bigr],

383: $ and let us consider \linebreak

384: $\log \Bigl\{ \PP \Bigl[ \pi \bigl[ \exp ( U_{\lambda}) \bigr] \Bigr] \Bigr\}$, where

385: $\pi \in \C{M}_+^1(\Theta, \C{T})$ is some prior probability

386: measure on the parameter space. Using Fubini's theorem

387: for non negative functions, we see that

388: $$

389: \log \Bigl\{ \PP \Bigl[ \pi \bigl[ \exp ( U_{\lambda}) \bigr] \Bigr] \Bigr\}

390: = \log \Bigl\{ \pi \Bigl[ \PP \bigl[ \exp ( U_{\lambda} ) \bigr] \Bigr]

391: \Bigr\} \leq 0.

392: $$

393:

394: To relate this quantity

395: to the expectation $\rho(U_{\lambda})$ with respect to

396: any posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,

397: we will use the properties of the Kullback divergence

398: $\C{K}(\rho, \pi)$

399: of $\rho$ with respect to $\pi$, which is defined as

400: $$

401: \C{K}(\rho, \pi) = \begin{cases}

402: \int \log( \frac{d\rho}{d \pi}) d \rho, & \text{ when $\rho \ll

403: \pi$},\\

404: + \infty, & \text{ otherwise}.

405: \end{cases}

406: $$

407: The following lemma shows in which sense the Kullback divergence

408: function can be thought of as the dual of the $\log$ Laplace

409: transform.

410: \begin{lemma} \mypoint

411: \label{lemma1.3}

412: For any bounded measurable function $h : \Theta \rightarrow \RR$,

413: and any probability distribution $\rho \in \C{M}_+^1(\Theta)$

414: such that $\C{K}(\rho,\pi) < \infty$,

415: $$

416: \log \bigl\{ \pi \bigl[ \exp (h) \bigr]

417: \bigr\} = \rho(h)

418: - \C{K}(\rho,\pi) + \C{K}(\rho, \pi_{\exp(h)}),

419: $$

420: where by definition $\ds \frac{d \pi_{\exp(h)}}{d \pi} =

421: \frac{\exp[h(\theta)]}{\pi[\exp(h)]}$. Consequently

422: $$

423: \log \bigl\{ \pi \bigl[ \exp (h)] \bigr] \bigr\}

424: = \sup_{\rho \in \C{M}_+^1(\Theta)} \rho (h)

425: - \C{K}(\rho, \pi).

426: $$

427: \end{lemma}

428: The proof is just a matter of writing down the definition

429: of the quantities involved and using the fact that the Kullback

430: divergence function is non negative.

431: It can be found in \cite[page 160]{Cat7}.

432: In the duality between measurable functions and probability measures,

433: we thus see that the $\log$ Laplace transform with respect to

434: $\pi$ is the Legendre transform of the Kullback divergence function

435: with respect to $\pi$.

436: Using this, we get

437: $$

438: \PP \Bigl\{ \exp \bigl\{ \sup_{\rho \in \C{M}_+^1(\Theta)}

439: \rho [ U_{\lambda}(\theta) ] - \C{K}(\rho, \pi) \bigr\} \Bigr\} \leq 1,

440: $$

441: which, combined with the convexity of $\lambda \Phi_{\frac{\lambda}{N}}$, proves

442: the basic inequality we were looking for.

443: \begin{thm}

444: \label{thm2.3}

445: \mypoint For any real constant $\lambda$,

446: \begin{multline*}

447: \PP \biggl\{ \exp \biggl[

448: \sup_{\rho \in \C{M}_+^1(\Theta)} \lambda

449: \Bigl[ \rho \bigl( \Phi_{\frac{\lambda}{N}}\!\circ\!R \bigr)

450: - \rho(r) \Bigr] - \C{K}(\rho,\pi) \biggr] \biggr\}

451: \\ \leq

452: \PP \biggl\{ \exp \biggl[

453: \sup_{\rho \in \C{M}_+^1(\Theta)} \lambda

454: \Bigl[ \Phi_{\frac{\lambda}{N}}\bigl[ \rho(R) \bigr]

455: - \rho(r) \Bigr] - \C{K}(\rho,\pi) \biggr] \biggr\}

456: \leq 1.

457: \end{multline*}

458: \end{thm}

459: The following sections will show how to use this theorem.

460: \subsection{Non local bounds}

461: At least three sorts of bounds can be deduced from Theorem \ref{thm2.3}.

462:

463: The most interesting ones to build estimators and tune parameters,

464: as well as the first that have been considered in the development of

465: the PAC-Bayesian approach, are deviation bounds. They provide an

466: empirical upper bound for $\rho(R)$ --- that is a bound which can be computed from

467: observed data --- with some probability $1 - \epsilon$, where $\epsilon$

468: is a presumably small and tunable confidence level.

469:

470: Anyhow, since most

471: of the results about the convergence speed of estimators to be found

472: in the statistical literature are concerned with the expectation $\PP \bigl[

473: \rho(R) \bigr]$, it is also enlightening to bound this quantity.

474: In order to know at which rate it may be approaching $\inf_{\Theta} R$,

475: a non random upper bound is required, which will relate the average of

476: the expected risk $\PP \bigl[ \rho(R) \bigr]$ with the properties of

477: the contrast function $\theta \mapsto R(\theta)$.

478:

479: Since the values of constants do matter a lot when a bound is to be used

480: to select between various estimators using classification models of various

481: complexities, a third kind of bound, related to the first, may be considered

482: for the sake of its hopefully better constants: we will call them

483: {\em unbiased empirical bounds}, to stress the fact that they provide some

484: empirical quantity whose expectation under $\PP$ can be proved to

485: be an upper bound for $\PP \bigl[ \rho(R) \bigr]$, the average expected

486: risk. The price to pay for these better constants is of course the lack

487: of formal guarantee given by the bound : two random variables whose

488: expectations are ordered in a certain way may very well be ordered

489: in the reverse way with a large probability, so that basing the

490: estimation of parameters or the selection of an estimator on some

491: unbiased empirical bound is a hazardous business. Anyhow, since it is

492: common practice to use the inequalities provided by mathematical statistical

493: theory while replacing the proven constants with smaller values showing

494: a better practical efficiency, considering unbiased empirical bounds

495: akin to deviation bounds provides an indication about how much

496: the constants may be decreased while not violating the theory too

497: outrageously.

498:

499: \subsubsection{Unbiased empirical bounds}

500: Let $\rho : \Omega

501: \rightarrow \C{M}_+^1(\Theta)$ be some fixed (and arbitrary)

502: posterior distribution, describing some randomized estimator of $\theta$.

503: As we already mentioned, in these notes a posterior distribution

504: will always be a regular conditional probability measure. By this

505: we mean that

506: \begin{itemize}

507: \item for any $A \in \C{T}$, the map $\omega \mapsto \rho (\omega, A)

508: : \bigl(\Omega, ( \C{B} \otimes

509: \C{B}')^{\otimes N} \bigr) \rightarrow \RR_+$

510: is assumed to be measurable;

511: \item for any $\omega \in \Omega$, the map $A \mapsto \rho(\omega, A):

512: \C{T} \rightarrow \RR_+$

513: is assumed to be a probability measure.

514: \end{itemize}

515: We will also assume without further notice that the $\sigma$-algebras

516: we deal with are always countably generated.

517: The technical implications of these assumptions are standard

518: and discussed for instance in \cite[pages 50-54]{Cat7}

519: (where, among other things, a detailed proof of the decomposition

520: of the Kullback Liebler divergence is given).

521:

522: Let us restrict to the case when the constant $\lambda$ is positive.

523: We get from Theorem \ref{thm2.3} that

524: \begin{equation}

525: \label{eq2.2.1bis}

526: \exp \biggl[ \lambda \Bigl\{ \Phi_{\frac{\lambda}{N}}

527: \Bigl[ \PP \bigl[ \rho(R) \bigr]

528: \Bigr] - \PP \bigl[ \rho(r) \bigr] \Bigr\} - \PP \bigl[\C{K}(\rho, \pi)

529: \bigr] \biggr]

530: \leq 1,

531: \end{equation}

532: where we have used the convexity of the $\exp$ function and of $\Phi_{\frac{

533: \lambda}{N}}$.

534: Since we have restricted our attention to positive values of the constant $\lambda$,

535: Equation \eqref{eq2.2.1bis} can also be written

536: $$

537: \PP \bigl[ \rho(R) \bigr]

538: \leq \Phi_{\frac{\lambda}{N}}^{-1} \Bigl\{

539: \PP \bigl[ \rho(r) + \lambda^{-1} \C{K}(\rho,\pi) \bigr] \Bigr\},

540: $$

541: leading to

542: \begin{thm}

543: \label{thm2.4}

544: \mypoint For any posterior distribution $\rho: \Omega \rightarrow \C{M}_+^1(\Theta)$,

545: for any positive parameter $\lambda$,

546: \begin{align*}

547: \PP \bigl[ \rho (R) \bigr]

548: & \leq \frac{\ds

549: 1 - \exp \Bigl[ - N^{-1} \PP \bigl[

550: \lambda \rho(r) + \C{K}(\rho,\pi) \bigr]  \Bigr] }{\ds 1 - \exp( - \tfrac{\lambda}{N})} \\

551: & \leq \PP \Biggl\{ \frac{\lambda}{N \bigl[ 1 - \exp( - \frac{\lambda}{N}) \bigr]}

552: \left[ \rho(r) + \frac{\C{K}(\rho,\pi)}{\lambda} \right] \Biggr\}.

553: \end{align*}

554: \end{thm}

555: The last inequality provides the {\em unbiased empirical upper

556: bound} for $\rho(R)$ we were looking for, meaning that the expectation of

557: \linebreak $\frac{\lambda}{N \bigl[ 1 - \exp( - \frac{\lambda}{N}) \bigr]}

558: \left[ \rho(r) + \frac{\C{K}(\rho,\pi)}{\lambda} \right]$

559: is larger than the expectation of $\rho(R)$. Let us notice that

560: $1 \leq \frac{\lambda}{N \bigl[ 1 - \exp( - \frac{\lambda}{N}) \bigr]} \leq

561: \bigl[ 1 - \frac{\lambda}{2N} \bigr]^{-1}$ and therefore that this

562: coefficient is close to $1$ when $\lambda$ is significantly smaller

563: than $N$.

564:

565: If we are ready to believe in this bound (although this belief is not

566: mathematically well founded, as we already mentioned), we can use

567: it to optimize $\lambda$ and to choose $\rho$. While the optimal choice

568: of $\rho$ when $\lambda$ is fixed is to take it equal to $\pi_{\exp( - \lambda r)}$,

569: a Gibbs posterior distribution, as it is sometimes called, we may for

570: computational reasons be more interested in choosing $\rho$ in some

571: other class of posterior distributions.

572:

573: For instance, our real interest

574: may be to select some deterministic estimator from a

575: family $\wtheta_m : \Omega \rightarrow

576: \Theta_m$, $m \in M$, of possible ones, where $\Theta_m$ are

577: measurable subsets of $\Theta$ and where $M$ is an arbitrary (non necessarily

578: countable) index set. We may for instance think of

579: the case when $\wtheta_m \in \arg\min_{\Theta_m} r$.

580: We may slightly randomize the estimators to start with,

581: considering for any $\theta \in \Theta_m$ and any $m \in M$,

582: $$

583: \Delta_m(\theta) = \Bigl\{ \theta' \in \Theta_m :

584: \bigl[ f_{\theta'}(X_i) \bigr]_{i=1}^N = \bigl[ f_{\theta}(X_i) \bigr]_{i=1}^N

585: \Bigr\},

586: $$

587: and defining $\rho_m$ by the formula

588: $$

589: \frac{d \rho_m}{d \pi} (\theta) = \frac{\B{1}\bigl[ \theta \in \Delta_m(\wtheta_m)

590: \bigr]}{\pi \bigl[ \Delta_m(\wtheta_m) \bigr]}.

591: $$

592: Our posterior is minimizing $\C{K}(\rho, \pi)$ among those

593: whose support is restricted to the values of $\theta$

594: in $\Theta_m$ for which the classification rule $f_{\theta}$

595: is identical to the estimated one $f_{\wtheta_m}$ on

596: the observed sample.

597: Presumably, in many practical situations, $f_{\theta}(x)$

598: will be $\rho_m$ almost surely identical to

599: $f_{\wtheta_m}(x)$ when $\theta$ is drawn from

600: $\rho_m$, for the vast majority of the values of $x \in \C{X}$

601: and all the submodels $\Theta_m$ not plagued with too much overfitting

602: (since this is by construction the case when $x \in \{ X_i : i = 1, \dots, N \}$).

603: Therefore replacing $\wtheta_m$ with $\rho_m$ can be expected to be

604: a minor change in many situations. This change by the way can be

605: estimated in the (admittedly not so common) case when the

606: distribution of the patterns $(X_i)_{i=1}^N$ is known.

607: Indeed, introducing the pseudo distance

608: \begin{equation}

609: \label{eq1.1.2}

610: D(\theta, \theta') = \frac{1}{N} \sum_{i=1}^N

611: \PP \bigl[ f_{\theta}(X_i) \neq f_{\theta'}(X_i) \bigr], \qquad \theta, \theta' \in

612: \Theta,

613: \end{equation}

614: one immediately sees that $R(\theta') \leq R(\theta) + D(\theta, \theta')$,

615: for any $\theta, \theta' \in \Theta$, and

616: therefore that

617: $$

618: R(\wtheta_m) \leq \rho_m(R) + \rho_m\bigl[ D(\cdot,\wtheta_m) \bigr].

619: $$

620: Let us notice also that in the case where $\Theta_m

621: \subset \RR^{d_m}$, and $R$ happens to be convex on

622: $\Delta_m(\wtheta_m)$, then $\rho_m(R) \geq R \bigl[

623: \int \theta \rho_m(d \theta)\bigr]$, and we can replace

624: $\wtheta_m$ with $\T_m = \int \theta \rho_m( d\theta)$,

625: and obtain bounds for $R(\T_m)$.

626: This is not a very heavy assumption about $R$, in the case

627: where we consider $\wtheta_m \in \arg\min_{\Theta_m} r$.

628: Indeed, $\wtheta_m$, and therefore $\Delta_m(\wtheta_m)$,

629: will be presumably close to $\arg\min_{\Theta_m} R$,

630: and requiring a function to be convex in the neighboorhood of

631: its minima is not a very strong assumption.

632:

633: Since $r(\wtheta_m) = \rho_m(r)$,

634: and $\C{K}(\rho_m, \pi) = - \log \bigl\{

635: \pi\bigl[ \Delta_m(\wtheta_m) \bigr] \bigr\}$,

636: our unbiased empirical upper

637: bound in this context reads as

638: $$

639: \frac{\lambda}{N\bigl[ 1 - \exp( - \frac{\lambda}{N})\bigr]} \left\{

640: r(\wtheta_m) - \frac{\log\bigl\{ \pi \bigl[ \Delta_m(\wtheta_m) \bigr]

641: \bigr\}}{\lambda} \right\}.

642: $$

643: Let us notice that we obtain a complexity factor $- \log \bigl\{

644: \pi \bigl[ \Delta_m(\wtheta_m) \bigr] \bigr\}$ which may be

645: compared with the Vapnik Cervonenkis dimension. Indeed, in the

646: case of binary classification, when using a classification model

647: with VC dimension not greater than $h_m$, that is when any subset

648: of $\C{X}$ which can be split in any arbitrary way by some

649: classification rule $f_{\theta}$ of the model $\Theta_m$ has at most $h_m$

650: points, then

651: $$

652: \bigl\{ \Delta_m(\theta) : \theta \in \Theta_m  \bigr\}

653: $$

654: is a partition of $\Theta_m$ with at most $\left( \frac{eN}{h} \right)^h$

655: components. Therefore

656: $$

657: \inf_{\theta \in \Theta_m} - \log \bigl\{

658: \pi \bigl[ \Delta_m(\theta) \bigr] \bigr\} \leq h_m \log \left( \frac{e N}{h_m}

659: \right) - \log \bigl[ \pi(\Theta_m) \bigr].

660: $$

661: Thus, if the model and prior distribution are well suited to the classification

662: task, in the sense that there is more ``room'' (where room is measured with $\pi$)

663: between the two clusters defined by $\wtheta_m$ than between other partitions

664: of the sample of patterns $(X_i)_{i=1}^N$, then we will have

665: $$

666: -\log \bigl\{ \pi \bigl[ \Delta_m(\wtheta) \bigr] \bigr\} \leq h_m

667: \log \left( \frac{e N}{h_m} \right) - \log \bigl[ \pi(\Theta_m) \bigr].

668: $$

669: \newcommand{\wm}{\widehat{m}}

670: An optimal value $\wm$ may be selected so that

671: $$

672: \wm \in \arg\min_{m \in M} \left\{ \inf_{\lambda \in \RR_+}

673: \frac{\lambda}{N\bigl[ 1 - \exp( - \frac{\lambda}{N})\bigr]} \left(

674: r(\wtheta_m) - \frac{\log\bigl\{ \pi \bigl[ \Delta_m(\wtheta_m) \bigr] \bigr\}}{\lambda} \right) \right\}.

675: $$

676: Since $\rho_{\wm}$ is still another posterior distribution, we can be sure that

677: \begin{multline*}

678: \PP \Bigl\{ R(\wtheta_{\wm}) - \rho_{\wm} \bigl[ D(\cdot, \wtheta_{\wm}) \bigr]\Bigr\}

679: \leq \PP \bigl[ \rho_{\wm}(R) \bigr]

680: \\ \leq \inf_{\lambda \in \RR_+} \PP

681: \left\{ \frac{\lambda}{N\bigl[ 1 - \exp( - \frac{\lambda}{N})\bigr]} \left(

682: r(\wtheta_{\wm}) - \frac{\log\bigl\{ \pi \bigl[ \Delta_{\wm}

683: (\wtheta_{\wm}) \bigr] \bigr\}}{\lambda} \right) \right\}.

684: \end{multline*}

685: (Taking the infimum in $\lambda$ inside the expectation with respect to $\PP$

686: would be possible at the price of some supplementary technicalities

687: and a slight increase of the bound that we prefer to postpone to the discussion

688: of deviation bounds, since they are the only ones to provide a rigorous mathematical

689: foundation to the adaptive selection of estimators.)

690:

691: \subsubsection{Optimizing explicitly the exponential parameter $\lambda$}

692: We would like to deal in this section with some technical issue we think

693: helpful to the understanding of Theorem \ref{thm2.4}

694: (see page \pageref{thm2.4}): namely to investigate

695: how the upper bound it provides could be optimized, or at least approximately

696: optimized, in $\lambda$. It turns out that this can be done quite

697: explicitely.

698:

699: So we will consider in this discussion the

700: posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$

701: to be fixed, and our aim will be to eliminate the constant $\lambda$

702: from the bound by choosing its value in some nearly optimal way as

703: a function of $\PP\bigl[ \rho(r) \bigr]$, the average of the

704: empirical risk, and of

705: $\PP \bigl[ \C{K}(\rho, \pi) \bigr]$, which controls overfitting.

706:

707: Let the bound be written as

708: $$

709: \varphi ( \lambda) = \bigl[ 1 - \exp( - \tfrac{\lambda}{N}) \bigr]^{-1}

710: \left\{ 1 - \exp \Bigl[ - \tfrac{\lambda}{N} \PP \bigl[ \rho(r) \bigr]

711: - N^{-1}\PP \bigl[ \C{K}(\rho,\pi) \bigr] \Bigr] \right\}.

712: $$

713: We see that

714: $$

715: N \frac{\partial}{\partial \lambda} \log \bigl[ \varphi(\lambda) \bigr]

716: = \frac{\PP\bigl[\rho(r)\bigr]}{\exp \Bigl[ \frac{\lambda}{N} \PP\bigl[\rho(r)\bigr]

717: + N^{-1} \PP\bigl[ \C{K}(\rho, \pi) \bigr] \Bigr] - 1} -

718: \frac{1}{\exp(\frac{\lambda}{N}) - 1}.

719: $$

720: Thus, the optimal value for $\lambda$ is such that

721: $$

722: \bigl[ \exp( \tfrac{\lambda}{N}) - 1 \bigr] \PP \bigl[\rho(r)\bigr]

723: = \exp \Bigl[ \tfrac{\lambda}{N} \PP \bigl[ \rho(r) \bigr] + N^{-1}

724: \PP \bigl[ \C{K}(\rho, \pi) \bigr] \Bigr] - 1.

725: $$

726: Assuming that $1 \gg \frac{\lambda}{N} \PP \bigl[ \rho(r) \bigr]

727: \gg \frac{\PP [ \C{K}(\rho,\pi) ]}{N}$,

728: and keeping only higher order terms, we are led to choose

729: $$

730: \lambda = \sqrt{ \frac{2 N \PP \bigl[ \C{K}(\rho,\pi) \bigr]}{\PP \bigl[ \rho(r) \bigr]

731: \bigl\{ 1 - \PP \bigl[\rho(r) \bigr] \bigr\}}},

732: $$

733: obtaining

734: \begin{thm}

735: \label{thm1.6}

736: \mypoint For any posterior distribution $\rho: \Omega \rightarrow \C{M}_+^1(\Theta)$,

737: $$

738: \PP \bigl[ \rho(R) \bigr] \leq

739: \frac{ 1 - \exp \left\{ - \sqrt{\frac{ 2 \PP [ \C{K}(\rho,\pi) ] \PP [

740: \rho(r)]}{N \{ 1 - \PP [ \rho(r) ] \}}} -

741: \frac{\PP [ \C{K}(\rho,\pi) ]}{N} \right\}}{

742: 1 - \exp \left\{ - \sqrt{ \frac{ 2 \PP [ \C{K}(\rho,\pi) ]}{

743: N \PP [ \rho(r) ] \{1 - \PP [ \rho(r) ] \}}}

744: \right\}}.

745: $$

746: \end{thm}

747: This result of course is not very useful in itself, since none of the

748: two quantities $\PP\bigl[ \rho(r) \bigr]$ and $\PP\bigl[ \C{K}(\rho, \pi) \bigr]$

749: are easy to evaluate. Anyhow it gives a hint that replacing them boldly

750: with $\rho(r)$ and $\C{K}(\rho, \pi)$ could produce something close to

751: a legitimate empirical upper bound for $\rho(R)$. We will see in the subsection

752: about deviation bounds that this is indeed essentially true.

753:

754: Let us remark that in the second section of these notes,

755: we will see another way of bounding

756: $$

757: \inf_{\lambda \in \RR_+} \Phi_{\frac{\lambda}{N}}^{-1}

758: \left(q + \frac{d}{\lambda}\right),\text{ leading to}

759: $$

760: \begin{thm}\mypoint

761: \label{thm1.1.6}

762: For any prior distribution $\pi \in \C{M}_+^1(\Theta)$,

763: for any posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,

764: \begin{multline*}

765: \PP \bigl[ \rho(R) \bigr] \leq

766: \left(1 + \frac{2\PP\bigl[\C{K}(\rho, \pi) \bigr]}{N}\right)^{-1}

767: \Biggl\{ \PP \bigl[ \rho(r) \bigr] + \frac{\PP\bigl[\C{K}(\rho, \pi)\bigr]}{N}

768: \\* \shoveright{+ \sqrt{ \frac{2 \PP \bigl[ \C{K}(\rho, \pi) \bigr] \PP \bigl[ \rho(r) \bigr]

769: \bigl\{ 1 - \PP \bigl[ \rho(r) \bigr] \bigr\}}{N} + \frac{

770: \PP\bigl[\C{K}(\rho,\pi)\bigr]^2}{N^2}} \Biggr\},}\\

771: \text{as soon as }

772: \PP \bigl[ \rho(r)  \bigr] + \sqrt{ \frac{\PP \bigl[ \C{K}(\rho, \pi) \bigr]}{2N}}

773: \leq \frac{1}{2},\\

774: \text{and }

775: \PP\bigl[\rho(R)\bigr] \leq \PP\bigl[\rho(r)\bigr] +

776: \sqrt{\frac{\PP\bigl[\C{K}(\rho,\pi)\bigr]}{2N}} \text{ otherwise.}

777: \end{multline*}

778: \end{thm}

779: This theorem enlightens the influence of three terms on the average expected

780: risk :

781:

782: $\bullet$ the average empirical risk, $\PP \bigl[ \rho(r) \bigr]$, which

783: as a rule will decrease as the size of the classification model increases,

784: acts as a {\em bias} term, grasping the ability of the model to

785: account for the observed sample itself;

786:

787: $\bullet$ a {\em variance} term $\PP \bigl[ \rho(r) \bigr] \bigl\{ 1 - \PP \bigl[ \rho(r) \bigr]

788: \bigr\}$ is due to the random fluctuations of $\rho(r)$;

789:

790: $\bullet$

791: a {\em complexity} term $\PP \bigl[ \C{K}(\rho, \pi) \bigr]$, which as a rule will

792: increase with the size of the classification model,

793: eventually acts as a multiplier of the variance term.

794: \bigskip

795:

796: We observed numerically that the bound provided by Theorem \ref{thm1.6}

797: is better than the more classical Vapnik's like bound of Theorem \ref{thm1.1.6}.

798: For instance, when $N = 1000$, $\PP\bigl[\rho(r) \bigr] = 0.2$

799: and $\PP\bigl[\C{K}(\rho,\pi)\bigr] = 10$, Theorem \ref{thm1.6} gives a bound

800: lower than $0.2604$, whereas the more classical Vapnik's like approximation

801: of Theorem \ref{thm1.1.6} gives a bound larger than $0.2622$. Numerical simulations tend to suggest

802: the two bounds are always ordered in the same way,

803: although this could be a little teadious

804: to prove mathematically.

805:

806: \subsubsection{Non random bounds}

807: It is time now to come to less tentative results and

808: see how far is the average expected error rate $\PP \bigl[ \rho(R) \bigr]$

809: from its best possible value $\inf_{\Theta} R$.

810:

811: Let us notice first that

812: $$

813: \lambda \rho(r) + \C{K}(\rho,\pi) =

814: \C{K}(\rho, \pi_{\exp( - \lambda r)})

815: - \log \Bigl\{ \pi \bigl[ \exp ( - \lambda r) \bigr] \Bigr\}.

816: $$

817: Let us remark moreover that $r \mapsto \log \Bigl[ \pi \bigl[

818: \exp ( - \lambda r) \bigr] \Bigr]$ is a convex functional,

819: a property which can be used in the following way:

820: \begin{multline}

821: \label{eq1.1.3Ter}

822: \PP \Bigl\{ \log \Bigl[ \pi \bigl[ \exp ( - \lambda r) \bigr]

823: \Bigr] \Bigr\}

824: = \PP \Bigl\{ \sup_{\rho \in \C{M}_+^1(\Theta)}

825: - \lambda \rho(r) - \C{K}(\rho,\pi) \Bigr\}

826: \\ \geq \sup_{\rho \in \C{M}_+^1(\Theta)} \PP \Bigl\{

827: - \lambda \rho(r) - \C{K}(\rho, \pi) \Bigr\}

828: = \sup_{\rho \in \C{M}_+^1(\Theta)} - \lambda \rho(R) - \C{K}(\rho, \pi)

829: \\ = \log \Bigl\{ \pi \bigl[ \exp ( - \lambda R) \bigr] \Bigr\}

830: = - \int_{0}^{\lambda} \pi_{\exp( - \beta R)}(R) d \beta.

831: \end{multline}

832: These remarks applied to Theorem \ref{thm2.4} lead to

833: \begin{thm}

834: \label{thm2.5}

835: \mypoint For any posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,

836: for any positive parameter $\lambda$,

837: \begin{align*}

838: \PP \bigl[ \rho(R) \bigr] &

839: \leq

840: \frac{1 - \exp \left\{ - \frac{1}{N} \int_0^{\lambda} \pi_{\exp( - \beta R)}(R)

841: d \beta - \frac{1}{N} \PP \bigl[ \C{K}(\rho, \pi_{\exp(- \lambda r)}) \bigr]

842: \right\}}{

843: 1 - \exp( - \frac{\lambda}{N})}

844: \\ & \leq \frac{1}{N \bigl[ 1 - \exp ( - \frac{\lambda}{N}) \bigr]}

845: \biggl\{ \int_0^{\lambda} \pi_{\exp( - \beta R)}(R) d \beta

846: + \PP \bigl[ \C{K}(\rho, \pi_{\exp( - \lambda r)}) \bigr]  \biggr\}.

847: \end{align*}

848: \end{thm}

849: This theorem is particularly well fitted for the case

850: of the Gibbs posterior distribution $\rho = \pi_{\exp(- \lambda r)}$,

851: where the entropy factor cancels and where

852: $\PP \bigl[ \pi_{\exp( - \lambda r)}(R) \bigr]$

853: is shown to be bound to get close to $\inf_{\Theta} R$ when $N$ goes to $\infty$,

854: as soon as $\lambda/N$ goes to $0$ while $\lambda$ goes to $+ \infty$.

855:

856: We can elaborate on Theorem \ref{thm2.5} and define a notion of dimension

857: of $(\Theta, R)$, with margin $\eta \geq 0$ putting

858: \begin{multline}

859: \label{eq1.1.3Bis}

860: d_{\eta} (\Theta, R) = \sup_{\beta \in \RR_+} \beta \bigl[

861: \pi_{\exp( - \beta R)}(R) - \ess\inf_{\pi} R - \eta \bigr]

862: \\ \leq - \log \Bigl\{ \pi \bigl[ R \leq \ess\inf_{\pi} R + \eta \bigr] \Bigr\}.

863: \end{multline}

864: This last inequality can be established by the chain of inequalities:

865: \begin{multline*}

866: \beta \pi_{\exp( - \beta R)}(R) \leq \int_0^{\beta}

867: \pi_{\exp( - \gamma R)}(R) d \gamma =

868: - \log \Bigl\{ \pi \bigl[

869: \exp ( - \beta R) \bigr] \Bigr\} \\ \leq \beta \Bigl( \ess \inf_{\pi} R

870: + \eta \Bigr) - \log \Bigl[ \pi\bigl( R \leq \ess \inf_{\pi} R + \eta

871: \bigr) \Bigr],

872: \end{multline*}

873: where we have used successively the fact that $\lambda \mapsto

874: \pi_{\exp( - \lambda R)}(R)$ is decreasing (because it is

875: the derivative of the concave function $ \lambda \mapsto -\log

876: \bigl\{ \pi \bigl[ \exp( - \lambda R) \bigr] \bigr\}$)

877: and the fact that the exponential function takes positive values.

878:

879: In typical ``parametric'' situations $d_0(\Theta, R)$ will be finite,

880: and in all circumstances $d_{\eta}(\Theta, R)$

881: will be finite for any $\eta > 0$ (this is a direct consequence

882: of the definition of the essential infimum).

883: Using this notion of dimension, we see that

884: \begin{multline*}

885: \int_{0}^{\lambda} \pi_{\exp( -\beta R)}(R) d \beta \leq

886: \lambda  \bigl( \ess \inf_{\pi} R  + \eta \bigr)

887: \\ \shoveright{+ \int_{0}^{\lambda} \left[ \frac{d_{\eta}}{\beta} \wedge (1 - \ess

888: \inf_{\pi} R - \eta)

889: \right] d \beta \quad}\\ = \lambda \bigl(\ess \inf_{\pi} R + \eta \bigr) +

890: d_{\eta}(\Theta, R) \log \left[ \frac{e \lambda}{d_{\eta}(\Theta, R)}

891: \bigl(1 - \ess \inf_{\pi} R - \eta \bigr) \right].

892: \end{multline*}

893: This leads to

894: \begin{cor}

895: With the above notations, for any margin $\eta \in \RR_+$,

896: for any posterior distibution

897: $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,

898: $$

899: \PP \bigl[ \rho(R) \bigr] \leq \inf_{\lambda \in \RR_+}

900: \Phi_{\frac{\lambda}{N}}^{-1} \left[ \ess \inf_{\pi} R + \eta +

901: \frac{d_{\eta}}{\lambda} \log \left( \frac{e \lambda}{d_{\eta}} \right)

902: + \frac{\PP \bigl\{ \C{K}\bigl[\rho, \pi_{\exp( - \lambda r)}\bigr] \bigr\}}{\lambda}

903: \right].

904: $$

905: \end{cor}

906:

907: If one is wanting a posterior distribution with a small support,

908: the theorem can also be applied to the case when $\rho$ is obtained by truncating $\pi_{\exp ( - \lambda r)}$

909: to some level set to reduce its support: let

910: $\Theta_{p} = \{ \theta \in \Theta : r(\theta) \leq p \}$,

911: and let us define for any $q \in )0,1)$ the level

912: $p_{q} = \inf \{ p : \pi_{\exp( - \lambda r)}(\Theta_p) \geq

913: q \}$,

914: let us then define $\rho_{q}$ by its density

915: $$

916: \frac{\ds d \rho_q}{\ds d \pi_{\exp(- \lambda r)}} (\theta)

917: = \frac{\ds \B{1}(\theta \in \Theta_{p_q})}{\ds \pi_{\exp( - \lambda r)}(\Theta_{p_q})},

918: $$

919: then $\rho_0 = \pi_{\exp ( - \lambda r)}$ and for any $q \in (0,1($,

920: \begin{align*}

921: \PP \bigl[ \rho_q(R) \bigr] &

922: \leq

923: \frac{1 - \exp \left\{ - \frac{1}{N} \int_0^{\lambda} \pi_{\exp( - \beta R)}(R)

924: d \beta - \frac{\log(q)}{N}

925: \right\}}{

926: 1 - \exp( - \frac{\lambda}{N})} \\

927: & \leq \frac{1}{N \bigl[ 1 - \exp ( - \frac{\lambda}{N}) \bigr]}

928: \biggl\{ \int_0^{\lambda} \pi_{\exp( - \beta R)}(R) d \beta

929: - \log(q) \biggr\}.

930: \end{align*}

931:

932: \subsubsection{Deviation bounds}

933: They provide results holding under the distribution $\PP$

934: of the sample with probability at least $1 - \epsilon$, for any

935: given confidence level, set by the choice of $\epsilon \in )0, 1($.

936: Using them is the only way to be quite (i.e. with probability $1-\epsilon$)

937: sure to do the right thing,

938: although this right thing may be overpessimistic, since

939: deviation upper bounds are larger than corresponding non biased bounds.

940:

941: Starting again

942: from Theorem \ref{thm2.3}, and using Markov's inequality \linebreak $\PP \bigl[

943: \exp (h) \geq 1 \bigr] \leq \PP \bigl[ \exp(h) \bigr]$, we

944: obtain

945: \begin{thm}

946: \label{thm2.7}

947: \mypoint For any positive parameter $\lambda$, with $\PP$ probability at least $1 - \epsilon$,

948: for any posterior distribution $\rho : \Omega \rightarrow

949: \C{M}_+^1(\Theta)$,

950: \begin{align*}

951: \rho(R) & \leq \Phi_{\frac{\lambda}{N}}^{-1} \left\{

952: \rho(r) + \frac{\C{K}(\rho, \pi) - \log(\epsilon)}{\lambda} \right\}\\

953: & = \frac{\ds 1 - \exp \left\{ - \frac{\lambda \rho(r)}{N}

954: - \frac{\C{K}(\rho,\pi) - \log(\epsilon)}{N} \right\}}{\ds 1

955: - \exp\bigl( - \tfrac{\lambda}{N}\bigr)} \\

956: & \leq \frac{\lambda}{\ds N \left[ 1 - \exp \left( -

957: \tfrac{\lambda}{N} \right) \right]}

958: \left[ \rho(r)+ \frac{ \C{K}(\rho, \pi) - \log(\epsilon)}{\lambda}

959: \right].

960: \end{align*}

961: \end{thm}

962:

963: We see that for a fixed value of the parameter $\lambda$,

964: the upper bound is optimized when the posterior is chosen

965: to be the Gibbs distribution $\rho = \pi_{\exp( - \lambda r)}$.

966:

967: Moreover we would like to be entitled to optimize the bound

968: in $\lambda$. Gaining the required uniformity in $\lambda$

969: can be done in the following way.

970: Let us notice first that values of $\lambda$ less than $1$

971: are not interesting (because they provide a bound larger than

972: one, at least as soon as $\epsilon \leq \exp(-1)$). Let us consider some real parameter

973: $\alpha > 1$, and the set $\Lambda =

974: \{ \alpha^k ; k \in \NN \}$. Let us put on this set

975: the probability measure $\nu(\alpha^k) = [(k+1)(k+2)]^{-1}$.

976: Applying the previous theorem to $\lambda = \alpha^k$ at

977: confidence level $1 - \frac{\epsilon}{(k+1)(k+2)}$,

978: and using a union bound, we see that

979: with probability at least $1 - \epsilon$,

980: for any posterior distribution $\rho$,

981: $$

982: \rho(R) \leq \inf_{\lambda' \in \Lambda}

983: \Phi_{\frac{\lambda'}{N}}^{-1}

984: \left\{ \rho(r) + \frac{\C{K}(\rho,\pi) - \log(\epsilon) +

985: 2 \log \Bigl[\tfrac{\log(\alpha^2\lambda')}{\log(\alpha)} \Bigr]}{

986: \lambda'}

987: \right\}.

988: $$

989: Now we can remark that for any $\lambda \in (1, + \infty($,

990: there is $\lambda' \in \Lambda$ such that $\alpha^{-1} \lambda \leq \lambda' \leq

991: \lambda$. Moreover, for any $q \in (0,1)$, $\beta \mapsto \Phi_{\beta}^{-1}(q)$

992: is increasing on $\RR_+$. Thus

993: with probability at least $1 - \epsilon$,

994: for any posterior distribution $\rho$,

995: \begin{align*}

996: \rho(R) & \leq \inf_{\lambda \in (1, \infty(}

997: \Phi_{\frac{\lambda}{N}}^{-1}

998: \left\{ \rho(r) + \frac{\alpha}{\lambda} \left[

999: \C{K}(\rho,\pi) - \log(\epsilon) + 2 \log

1000: \Bigl( \tfrac{\log(\alpha^2 \lambda)}{\log(\alpha)} \Bigr)

1001: \right] \right\} \\

1002: & = \inf_{\lambda \in (1, \infty(}\frac{ 1 - \exp \left\{ - \frac{\lambda}{N}\rho(r) -

1003: \frac{\alpha}{N}\left[ \C{K}(\rho,\pi) - \log(\epsilon) +

1004: 2 \log \Bigl( \frac{\log(\alpha^2 \lambda)}{\log(\alpha)}

1005: \Bigr) \right] \right\}}{ 1 -

1006: \exp( - \frac{\lambda}{N} )}.

1007: \end{align*}

1008: Taking the approximately optimal value

1009: $$

1010: \lambda = \sqrt{ \frac{2 N \alpha \left[ \C{K}(\rho,\pi) - \log (\epsilon) \right]}{

1011: \rho(r)[ 1 - \rho(r) ]}},

1012: $$

1013: we obtain

1014: \begin{thm}

1015: \label{thm1.1.11}

1016: \mypoint With probability $1 - \epsilon$, for any posterior distribution

1017: $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$, putting

1018: $d(\rho,\epsilon) = \C{K}(\rho,\pi) - \log(\epsilon)$,

1019: \begin{multline*}

1020: \rho(R)

1021:  \leq \inf_{k \in \NN}\frac{\ds 1 - \exp \left\{ -

1022:  \frac{\alpha^k}{N}\rho(r) -

1023: \frac{1}{N}\Bigl[ d(\rho,\epsilon)+

1024: \log \bigl[

1025: (k+1)(k+2)\bigr] \Bigr] \right\}}{\ds 1 -

1026: \exp \left( - \frac{\alpha^k}{N} \right)} \\

1027: \leq \frac{\ds 1 - \exp \left\{ - \sqrt{\frac{2 \alpha \rho(r)

1028: d(\rho,\epsilon)}{N [1 - \rho(r)]}} - \frac{\alpha}{N}

1029: \Biggl[ d(\rho,\epsilon)+

1030: 2 \log \biggl( \tfrac{\log \left( \alpha^2

1031: \sqrt{\frac{2 N \alpha d(\rho,\epsilon)}{

1032: \rho(r)[1 - \rho(r)]}}\right)}{\log(\alpha)} \biggr) \Biggr] \right\}}{\ds

1033: 1 - \exp \left[ - \sqrt{\frac{2 \alpha d(\rho,\epsilon)}{

1034: N \rho(r) [1 - \rho(r)]}} \right]}.

1035: \end{multline*}

1036: Moreover with probability at least $1 - \epsilon$, for any

1037: posterior distribution $\rho$ such that $\rho(r) = 0$,

1038: $$

1039: \rho(R) \leq 1 - \exp \left[ - \frac{\C{K}(\rho,\pi) - \log(\epsilon)}{N} \right].

1040: $$

1041: \end{thm}

1042:

1043: We can also elaborate on the results in an other direction by introducing

1044: the {\em empirical dimension}

1045: \begin{equation}

1046: \label{eq1.1.3}

1047: d_e = \sup_{\beta \in \RR_+} \beta \bigl[ \pi_{\exp( - \beta r)}(r) -

1048: \ess\inf_{\pi} r

1049: \bigr] \leq - \log \bigl[ \pi \bigl( r = \ess \inf_{\pi} r\bigr) \bigr].

1050: \end{equation}

1051: (There is no need to introduce a margin in this definition, since $r$ takes

1052: at most $N$ values, and therefore $\pi \bigl( r = \ess \inf_{\pi}

1053: r \bigr)$

1054: will be strictly positive.)

1055: This leads to

1056: \begin{cor}

1057: \label{cor1.1.12}

1058: \mypoint

1059: For any positive real constant $\lambda$,

1060: with $\PP$ probability at least $1 - \epsilon$, for any posterior distribution

1061: $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,

1062: $$

1063: \rho(R) \leq \Phi_{\frac{\lambda}{N}}^{-1}

1064: \left[ \ess \inf_{\pi} r + \frac{d_e}{\lambda} \log \left( \frac{e \lambda}{d_e}

1065: \right) + \frac{\C{K}\bigl[ \rho, \pi_{\exp( - \lambda r)} \bigr]- \log(\epsilon)

1066: }{\lambda} \right].

1067: $$

1068: \end{cor}

1069: We could then make the bound uniform in $\lambda$ and optimize this parameter

1070: in a way similar to what was done to obtain Theorem \ref{thm1.1.11}.

1071:

1072: \subsection{Local bounds}

1073: In this subsection, better bounds will be achieved through a better choice

1074: of the prior distribution. This better prior distribution turns out to

1075: depend on the unknown sample distribution $\PP$, and some work is required to

1076: circumvent this and obtain empirical bounds.

1077: \subsubsection{Choice of the prior}

1078: As mentioned in the introduction, if one is

1079: willing to minimize the bound in expectation provided by Theorem

1080: \ref{thm2.4} (page \pageref{thm2.4}),

1081: one is led to consider the optimal choice $\pi =

1082: \PP(\rho)$. However, this is but an ideal choice, since

1083: $\PP$ is in all conceivable situations unknown. Nevertheless it

1084: shows that it is possible through Theorem \ref{thm2.4} to measure

1085: the {\em complexity} of the classification model

1086: with $\PP \bigl\{ \C{K}\bigl[\rho, \PP(\rho) \bigr] \bigr\}$,

1087: which is nothing but the {\em mutual information}

1088: between the random sample $(X_i,Y_i)_{i=1}^N$

1089: and the estimated parameter $\Hat{\theta}$, when the sample

1090: is drawn according to $\PP$ and the

1091: estimated parameter knowing the sample is drawn according

1092: to $\rho$.

1093:

1094: In practice, since we cannot choose $\pi = \PP(\rho)$,

1095: we have to be content with a {\em flat} prior $\pi$,

1096: resulting in a bound measuring complexity according to

1097: $\PP \bigl[ \C{K}(\rho,\pi) \bigr] = \PP \bigl\{ \C{K} \bigl[ \rho, \PP(\rho) \bigr]

1098: \bigr\} + \C{K} \bigl[ \PP(\rho), \pi \bigr]$ larger by the entropy

1099: factor $\C{K}\bigl[ \PP(\rho), \pi \bigr]$ than the optimal one

1100: (we are still commenting on Theorem \ref{thm2.4}).

1101:

1102: If we want to base the choice of $\pi$ on Theorem \ref{thm2.5}

1103: (page \pageref{thm2.5}), and if we

1104: choose

1105: $\rho = \pi_{\exp( - \lambda r)}$

1106: to optimize this bound, we will be inclined to choose some $\pi$ such

1107: that

1108: $$

1109: \frac{1}{\lambda} \int_0^{\lambda} \pi_{\exp( - \beta R)}(R) d \beta

1110: = - \frac{1}{\lambda} \log \Bigl\{ \pi \bigl[ \exp( - \lambda R) \bigr] \Bigr\}

1111: $$

1112: is as far as possible close to $\inf_{\theta \in \Theta} R(\theta)$ in all circumstances. To give

1113: some more specific example, in

1114: the case when the distribution of the design $(X_i)_{i=1}^N$ is known,

1115: one can introduce on the parameter space $\Theta$ the metric $D$

1116: already defined by equation (\ref{eq1.1.2}, page \pageref{eq1.1.2})

1117: (or some available upper bound for this distance). In view of the fact that

1118: $R(\theta) - R(\theta') \leq D(\theta, \theta')$, for any $\theta$, $\theta'

1119: \in \Theta$, it can be meaningful, at least theoretically,

1120: to choose $\pi$ as

1121: $$

1122: \pi = \sum_{k=1}^{\infty} \frac{1}{k(k+1)} \pi_k,

1123: $$

1124: where $\pi_k$ is the uniform measure on some minimal (or close

1125: to minimal) $2^{-k}$-net $\C{N}(\Theta,

1126: D,2^{-k})$ of the metric space $(\Theta, D)$. With this choice

1127: \begin{multline*}

1128: - \frac{1}{\lambda} \log \Bigl\{ \pi \bigl[ \exp (- \lambda R) \bigr] \Bigr\}

1129: \leq \inf_{\theta \in \Theta} R(\theta)

1130: \\ + \inf_k \left\{ 2^{-k} + \frac{\log ( \lvert \C{N}(\Theta, D, 2^{-k}) \rvert

1131: ) + \log[k(k+1)]}{\lambda} \right\}.

1132: \end{multline*}

1133:

1134: Another possibility, when we have to deal with real valued parameters,

1135: meaning that $\Theta \subset \RR^d$, is to code each real component

1136: $\theta_i \in \RR$ of $\theta = (\theta_i)_{i=1}^d$ to some precision

1137: and to use a prior $\mu$ which is atomic on dyadic numbers. More

1138: precisely let us parametrize the set of dyadic real numbers as

1139: \begin{multline*}

1140: \C{D} = \Biggl\{

1141: r\bigl[ s, m, p, (b_j)_{j=1}^p\bigr] = s 2^m \biggl( 1 + \sum_{j=1}^p b_j 2^{-j}

1142: \biggr)\,\\ :\,

1143: s \in \{-1, +1\}, m \in \ZZ, p \in \NN, b_j \in \{0,1\} \Biggr\},

1144: \end{multline*}

1145: where, as can be seen, $s$ codes the sign, $m$ the order of magnitude,

1146: $p$ the precision and $(b_j)_{j=1}^p$ the binary representation of

1147: the dyadic number $r\bigl[ s,m,p, (b_j)_{j=1}^p \bigr]$. We can for

1148: instance consider on $\C{D}$ the probability distribution

1149: \begin{equation}

1150: \label{eq1.1.4bis}

1151: \mu\bigl\{ r\bigl[ s,m,p,(b_j)_{j=1}^p \bigr] \bigr\}

1152: = \Bigl[ 3 (\lvert m \rvert + 1)(\lvert m \rvert + 2) (p+1)(p+2) 2^p  \Bigr]^{-1},

1153: \end{equation}

1154: and define $\pi \in \C{M}_+^1(\RR^d)$ as $\pi = \mu^{\otimes d}$.

1155: This kind of ``coding'' prior distribution can be used also to define

1156: a prior on the integers (by renormalizing the restriction of $\mu$

1157: to integers to get a probability distribution).

1158: Using $\mu$ is somehow equivalent to picking up a representative of

1159: each dyadic interval, and makes it possible to restrict to the

1160: case when the posterior $\rho$ is a Dirac mass without losing

1161: too much (when $\Theta = (0,1)$, this approach is somewhat equivalent

1162: to considering as prior distribution the Lebesgue measure and using

1163: as posterior distributions the uniform probability measures on dyadic

1164: intervals, with the advantage of obtaining non randomized estimators).

1165: When one uses in this way an atomic prior and Dirac masses as posterior

1166: distributions, the bounds proven so far can be obtained through a

1167: simpler union bound argument. This is so true that some of the

1168: detractors of the PAC-Bayesian approach (which, as a newcomer,

1169: has sometimes received a suspicious greeting among statisticians)

1170: have argued that it cannot bring anything that elementary union bound

1171: arguments could not essentially provide. We do not share of course

1172: this derogatory opinion, and while we think that allowing for

1173: non atomic priors and posteriors is worthwhile, we also would

1174: like to stress that next to come local and relative bounds could

1175: hardly be obtained with the only help of union bounds.

1176:

1177: Although the choice of a {\em flat} prior seems at first glance to be

1178: the only alternative when nothing is known about the sample distribution

1179: $\PP$, the previous discussion shows that this type of choice is

1180: lacking proper localisation, and namely that we loose a factor

1181: $\C{K}\bigl\{ \PP\bigl[\pi_{\exp(- \lambda r)}\bigr],\pi \bigr\}$, the divergence

1182: between the bound-optimal prior $\PP\bigl[ \pi_{\exp( - \lambda r)} \bigr]$,

1183: which is concentrated near the minima of $R$ in favourable situations,

1184: and the flat prior $\pi$. Fortunately, there are technical ways to

1185: get around this difficulty and to obtain more local empirical bounds.

1186:

1187: \subsubsection{Unbiased local empirical bounds}

1188: The idea is to start with some flat prior $\pi \in \C{M}_+^1(\Theta)$, and the

1189: posterior distribution $\rho = \pi_{\exp( - \lambda r)}$ minimizing the bound of

1190: Theorem \ref{thm2.4}

1191: (page \pageref{thm2.4}), when $\pi$ is used as a prior. To improve the bound, we

1192: would like to use $\PP \bigl[ \pi_{\exp(- \lambda r)}\bigr]$ instead of $\pi$,

1193: and we are going to make the guess that we could approximate it with $\pi_{\exp(

1194: - \beta R)}$ (we have replaced the parameter $\lambda$ with some distinct

1195: parameter $\beta$ to give some more freedom to our investigation,

1196: and also because, intuitively, $\PP \bigl[ \pi_{\exp( - \lambda r)} \bigr]$

1197: may be expected to be less concentrated than each of the $\pi_{\exp( - \lambda r)}$

1198: it is mixing,

1199: which suggests that the best approximation of $\PP \bigl[

1200: \pi_{\exp( - \lambda r)} \bigr]$ by some $\pi_{\exp( - \beta R)}$

1201: may be obtained for some parameter $\beta < \lambda$). We are then

1202: led to look for some empirical upper bound of $\C{K}\bigl[

1203: \rho, \pi_{\exp( -\beta R)} \bigr]$. This is happily provided by the

1204: following computation

1205: \begin{multline*}

1206: \PP \bigl\{ \C{K}\bigl[ \rho, \pi_{\exp( - \beta R)} \bigr] \bigr\}

1207: = \PP \bigl[ \C{K}(\rho, \pi) \bigr] + \beta \PP \bigl[ \rho (R) \bigr]

1208: + \log \Bigl\{ \pi \bigl[ \exp( - \beta R) \bigr] \Bigr\}

1209: \\ = \PP \bigl\{ \C{K}\bigl[ \rho, \pi_{\exp( - \beta r)}\bigr] \bigr\}

1210: + \beta \PP \bigl[ \rho(R-r) \bigr]

1211: \\ + \log \Bigl\{ \pi \bigl[ \exp( - \beta R) \bigr] \Bigr\}

1212: - \PP \Bigl\{ \log \pi \bigl[ \exp( - \beta r) \bigr] \Bigr\}.

1213: \end{multline*}

1214: Using the convexity of $r \mapsto \log \bigl\{ \pi \bigl[

1215: \exp ( - \beta r) \bigr] \bigr\}$ as in equation

1216: \eqref{eq1.1.3Ter} on page \pageref{eq1.1.3Ter}, we see that

1217: $$

1218: 0 \leq \PP \bigl\{ \C{K}\bigl[ \rho, \pi_{\exp( - \beta R)}\bigr] \bigr\}

1219: \leq \beta \PP \bigl[ \rho(R - r) \bigr] + \PP \bigl\{ \C{K} \bigl[ \rho,

1220: \pi_{\exp( - \beta r)} \bigr] \bigr\}.

1221: $$

1222: This inequality has an interest of its own, since it provides a lower

1223: bound for $\PP \bigl[ \rho(R) \bigr]$. Moreover we can plug it

1224: into Theorem \ref{thm2.4} (page \pageref{thm2.4}) applied to the prior distribution

1225: $\pi_{\exp( - \beta R)}$ and obtain for any posterior distribution $\rho$

1226: and any positive paramter $\lambda$ that

1227: $$

1228: \Phi_{\frac{\lambda}{N}} \bigl\{ \PP \bigl[ \rho(R) \bigr] \bigr\}

1229: \leq \PP \biggl\{ \rho(r) + \frac{\beta}{\lambda} \rho(R-r)

1230: + \frac{1}{\lambda} \PP \Bigl\{ \C{K}\bigl[

1231: \rho, \pi_{\exp( - \beta r)} \bigr] \Bigr\} \biggr\}.

1232: $$

1233: In view of this, it it convenient to introduce the function

1234: \newcommand{\TPhi}{\widetilde{\Phi}}

1235: \begin{multline*}

1236: \TPhi_{a,b}(p) = (1 - b)^{-1}

1237: \bigl[ \Phi_a(p) - bp \bigr] \\

1238: = - (1 - b)^{-1} \Bigl\{ a^{-1} \log \bigl\{ 1 - p

1239: \bigl[ 1 - \exp( - a) \bigr] \bigr\} + bp \Bigr\},\\

1240: p \in (0,1), a \in )0,\infty(, b \in (0,1(.

1241: \end{multline*}

1242: This is a convex function of $p$, moreover

1243: $$

1244: \TPhi_{a,b}'(0)

1245: = \Bigl\{ a^{-1} \bigl[ 1 - \exp(- a) \bigr] - b \Bigr\} (1 - b)^{-1},$$

1246: showing that it is an increasing one to one convex map of the unit interval unto

1247: itself as soon as $b \leq a^{-1}

1248: \bigl[ 1 - \exp( - a ) \bigr]$.

1249: Its convexity, combined with the value of its derivative at the origin, shows

1250: that

1251: $$

1252: \TPhi_{a,b}(p) \geq \frac{a^{-1} \bigl[ 1 - \exp ( - a) \bigr] - b}{1-b} p.

1253: $$

1254: Using these notations and remarks, we can state

1255: \begin{thm}

1256: \label{thm3.1}

1257: \mypoint For any positive real constants

1258: $\beta$ and $\lambda$ such that

1259: $0 \leq \beta < N [1 - \exp( - \frac{\lambda}{N})]$, for any posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,

1260: \begin{multline*}

1261: \PP \biggl\{ \rho(r) - \frac{ \C{K} \bigl[ \rho, \pi_{\exp( - \beta r)} \bigr]}{\beta}

1262: \biggr\} \leq

1263: \PP \bigl[ \rho(R) \bigr] \\ \leq

1264: \TPhi_{\frac{\lambda}{N}, \frac{\beta}{\lambda}}^{-1}

1265: \biggl\{ \PP \biggl[ \rho(r) + \frac{\C{K}\bigl[ \rho, \pi_{\exp( - \beta r)}

1266: \bigr]}{\lambda - \beta}

1267: \biggr] \biggr\}

1268: \\ \leq

1269: \frac{\lambda - \beta}{N [ 1 - \exp( - \frac{\lambda}{N})] - \beta}

1270: \PP \biggl[ \rho(r) + \frac{\C{K} \bigl[ \rho, \pi_{\exp( - \beta r)}

1271: \bigr]}{\lambda - \beta} \biggr].

1272: \end{multline*}

1273: Thus (taking $\lambda = 2 \beta$), for any $\beta$ such that $0 \leq \beta < \frac{N}{2}$,

1274: $$

1275: \PP \bigl[ \rho(R) \bigr]

1276: \leq \frac{1}{1 - \frac{2 \beta}{N}} \PP \biggl\{ \rho(r) + \frac{\C{K}\bigl[

1277: \rho, \pi_{\exp(- \beta r)} \bigr]}{\beta} \biggr\}.

1278: $$

1279: \end{thm}

1280: Note that the last inequality is obtained using the fact that

1281: $1 - \exp( - x) \geq x - \frac{x^2}{2}$, $x \in \RR_+$.

1282: \begin{cor}

1283: \label{cor3.2}

1284: \mypoint For any $\beta \in (0,N($,

1285: \begin{multline*}

1286: \PP \bigl[ \pi_{\exp( - \beta r)}(r) \bigr] \leq

1287: \PP \bigl[ \pi_{\exp(- \beta r)}(R) \bigr] \\

1288: \leq \inf_{\lambda \in (- N \log(1 - \frac{\beta}{N}),

1289: \infty(} \frac{\lambda - \beta}{N[1 - \exp( - \frac{\lambda}{N})] - \beta}

1290: \PP \bigl[ \pi_{\exp( - \beta r)}(r) \bigr]

1291: \\ \leq \frac{1}{1 - \frac{2 \beta}{N}} \PP \bigl[

1292: \pi_{\exp( - \beta r)}(r) \bigr],

1293: \end{multline*}

1294: the last inequality holding only when $\beta < \frac{N}{2}$.

1295: \end{cor}

1296:

1297: It is interesting to compare the upper bound provided by

1298: this corollary with Theorem \ref{thm2.4} on page \pageref{thm2.4}

1299: when the posterior is a Gibbs measure $\rho = \pi_{\exp( - \beta r)}$.

1300: We see that we have succeeded to get rid of the entropy term

1301: $\C{K}\bigl[\pi_{\exp( - \beta r)}, \pi \bigr]$, but at the price

1302: of an increase of the multiplicative factor, which for small values of

1303: $\frac{\beta}{N}$ grows from $( 1 - \frac{\beta}{2N})^{-1}$

1304: (when we take $\lambda = \beta$ in Theorem \ref{thm2.4}),

1305: to $(1 - \frac{2 \beta}{N})^{-1}$. Therefore non localized bounds

1306: have an interest of their own, and are superseded by localized

1307: bounds only in favourable circumstances (presumably when the sample

1308: is large enough when compared with the complexity of the classification

1309: model).

1310:

1311: Corollary \ref{cor3.2} shows that when $\frac{2 \beta}{N}$ is

1312: small, $\pi_{\exp( - \beta r)}(r)$ is a tight approximation of

1313: $\pi_{\exp( - \beta r)}(R)$ in the mean (since we have

1314: an upper bound and a lower bound which are close together).

1315:

1316: Another corollary is obtained by optimizing the bound

1317: given by Theorem \ref{thm3.1} in $\rho$, which is done

1318: by taking $\rho = \pi_{\exp( - \lambda r)}$.

1319: \begin{cor}

1320: \mypoint For any positive real constants $\beta$ and $\lambda$ such that

1321: $0 \leq \beta < N[1 - \exp( - \frac{\lambda}{N})]$,

1322: \begin{multline*}

1323: \PP \bigl[ \pi_{\exp( - \lambda r)}(R) \bigr]

1324: \leq \TPhi_{\frac{\lambda}{N}, \frac{\beta}{\lambda}}^{-1}

1325: \biggl\{ \PP \biggl[ \frac{1}{\lambda - \beta} \int_{\beta}^{\lambda}

1326: \pi_{\exp( - \gamma r)}(r) d \gamma \biggr] \biggr\}

1327: \\ \leq \frac{1}{N[1 - \exp( - \frac{\lambda}{N})] - \beta} \PP

1328: \biggr[ \int_{\beta}^{\lambda}

1329: \pi_{\exp( - \gamma r)}(r) d \gamma \biggr].

1330: \end{multline*}

1331: \end{cor}

1332: Although this inequality gives by construction a better

1333: upper bound for $\inf_{\lambda \in \RR_+} \PP \bigl[

1334: \pi_{\exp( - \lambda r)}(R) \bigr]$ than Corollary

1335: \ref{cor3.2}, it is not easy to tell which one of the two inequalities

1336: is the best to bound $\PP \bigl[ \pi_{\exp( - \lambda r)}(R)\bigr]$

1337: for a fixed (and possibly suboptimal) value of

1338: $\lambda$, because in this case, one factor is improved while the other is worsened.

1339:

1340: Using the {\em empirical dimension} $d_e$ defined by equation \eqref{eq1.1.3}

1341: on page \pageref{eq1.1.3}, we see that

1342: $$

1343: \frac{1}{\lambda - \beta} \int_{\beta}^{\lambda} \pi_{\exp( - \gamma r)}(r)

1344: d \gamma \leq \ess \inf_{\pi} r + d_e \log \left( \frac{\lambda}{\beta} \right).

1345: $$

1346: Therefore, in the case when we keep the ratio $\frac{\lambda}{\beta}$

1347: bounded, we get a better dependence on the empirical dimension $d_e$

1348: than in Corollary \ref{cor1.1.12} (page \pageref{cor1.1.12}).

1349:

1350: \subsubsection{Non random local bounds} Let us come now to the localization

1351: of the non random upper

1352: bound given by Theorem \ref{thm2.5} on page \pageref{thm2.5}.

1353: According to Theorem \ref{thm2.4} (page \pageref{thm2.4})

1354: applied to the localized prior $\pi_{\exp( - \beta R)}$,

1355: \begin{multline*}

1356: \lambda \Phi_{\frac{\lambda}{N}} \bigl\{ \PP \bigl[ \rho(R) \bigr] \bigr\}

1357: \leq \PP \Bigl\{ \lambda \rho(r) + \C{K}(\rho, \pi) + \beta \rho(R) \Bigr\}

1358: + \log \bigl\{ \pi \bigl[ \exp( - \beta R) \bigr] \bigr\} \\

1359: = \PP \Bigl\{ \C{K}\bigl[\rho, \pi_{\exp( - \lambda r)}\bigr]

1360: - \log \bigl\{ \pi \bigl[ \exp( - \lambda r) \bigr] \bigr\} +

1361: \beta \rho(R) \Bigr\} + \log \bigl\{ \pi \bigl[ \exp (- \beta R) \bigr] \bigr\}\\

1362: \leq \PP \Bigl\{ \C{K}\bigl[\rho, \pi_{\exp( - \lambda r)}\bigr]

1363: + \beta \rho(R) \Bigr\} - \log \bigl\{ \pi \bigl[ \exp( - \lambda R) \bigr]

1364: \bigr\} + \log \bigl\{ \pi \bigl[ \exp ( - \beta R) \bigr] \bigr\},

1365: \end{multline*}

1366: where we have used as previously inequality \eqref{eq1.1.3Ter}

1367: (page \pageref{eq1.1.3Ter}).

1368: This proves

1369: \begin{thm}

1370: \mypoint For any posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,

1371: for any real parameters $\beta$ and $\lambda$ such that

1372: $0 \leq \beta < N \bigl[ 1 - \exp( - \frac{\lambda}{N}) \bigr]$,

1373: \begin{multline*}

1374: \PP \bigl[ \rho(R) \bigr]

1375: \leq \TPhi_{\frac{\lambda}{N}, \frac{\beta}{\lambda}}^{-1}

1376: \biggl\{

1377: \frac{1}{ \lambda - \beta} \int_{\beta}^{\lambda}

1378: \pi_{\exp( - \gamma R)}(R) d \gamma + \PP \biggl[ \frac{\C{K}\bigl[ \rho,

1379: \pi_{\exp( - \lambda r)}\bigr]}{\lambda - \beta} \biggr] \biggr\} \\

1380: \leq \frac{ 1}{N \bigl[ 1 - \exp( - \frac{\lambda}{N} )

1381: \bigr] - \beta} \biggl\{

1382: \int_{\beta}^{\lambda}

1383: \pi_{\exp( - \gamma R)}(R) d \gamma + \PP \Bigl\{ \C{K}\bigl[

1384: \rho, \pi_{\exp( - \lambda r)}\bigr] \Bigr\} \biggr\}.

1385: \end{multline*}

1386: \end{thm}

1387: Let us notice in particular that this theorem contains Theorem \ref{thm2.5}

1388: (page \pageref{thm2.5})

1389: which corresponds to the case $\beta = 0$. As a corollary, we see also,

1390: taking $\rho = \pi_{\exp( - \lambda r)}$ and $\lambda = 2 \beta$,

1391: and noticing that $\gamma \mapsto \pi_{\exp( -\gamma R)}(R)$ is decreasing, that

1392: \begin{align*}

1393: \PP \bigl[ \pi_{\exp( - \lambda r)}(R) \bigr]

1394: & \leq  \inf_{\beta, \beta < N[ 1 - \exp( - \frac{\lambda}{N})]}

1395: \frac{\beta}{N \bigl[ 1 - \exp( - \frac{\lambda}{N} ) \bigr]

1396: - \beta} \pi_{\exp( - \beta R)}(R)

1397: \\ & \leq \frac{1}{1 - \frac{\lambda}{N}} \pi_{\exp( - \frac{\lambda}{2} R)}(R).

1398: \end{align*}

1399: We can use this inequality in conjunction with the notion of

1400: dimension with margin $\eta$ introduced by equation

1401: \eqref{eq1.1.3Bis} on page \pageref{eq1.1.3Bis},

1402: to see that the Gibbs posterior achieves for

1403: a proper choice of $\lambda$ and any margin parameter $\eta \geq 0$

1404: (which can be chosen to be equal to zero in parametric

1405: situations)

1406: \begin{multline}

1407: \label{eq1.1.7}

1408: \inf_{\lambda} \PP \bigl[ \pi_{\exp( - \lambda r)}(R) \bigr]

1409: \leq \ess \inf_{\pi} R + \eta + \frac{4 d_{\eta}}{N} \\ +

1410: 2 \sqrt{ \frac{2d_{\eta} \bigl( \ess \inf_{\pi} R + \eta

1411: \bigr) }{N} + \frac{4 d_{\eta}^2}{N^2}}.

1412: \end{multline}

1413: Deviation bounds to come next will show that the optimal

1414: $\lambda$ can be estimated from empirical data.

1415:

1416: Let us propose a little numerical example as an illustration : assuming

1417: that $d_{0} = 10$, $N=1000$ and $\ess \inf_{\pi}

1418: R = 0.2$, we obtain from equation

1419: \eqref{eq1.1.7} that

1420: $\inf_{\lambda} \PP \bigl[ \pi_{\exp(-\lambda r)}(R) \bigr]

1421: \leq 0.373$.

1422: \subsubsection{Local deviation bounds}

1423: %\newcommand{\BPsi}{\overline{\Phi}}

1424: When it comes to deviation bounds, we will for technical reasons

1425: choose a slightly more involved change of prior distribution and

1426: apply Theorem \ref{thm2.7} (page \pageref{thm2.7}) to the prior $

1427: \pi_{\exp [ - \beta \Phi_{- \frac{\beta}{N}}

1428: \circ R ]}$. The advantage of tweaking $R$ with the nonlinear function

1429: $\Phi_{- \frac{\beta}{N}}$ will appear in the search for an empirical upper

1430: bound of the local entropy term.

1431: Theorem \ref{thm2.3} (page \pageref{thm2.3}), used with the above mentioned local prior,

1432: shows that

1433: \begin{equation}

1434: \label{eq1.1.4}

1435: \PP \Biggl\{ \sup_{\rho \in \C{M}_+^1(\Theta)}

1436: \lambda \Bigl\{ \rho \bigl(\Phi_{\frac{\lambda}{N}}\!\circ\!R \bigr)

1437: - \rho(r) \Bigr\} - \C{K}\bigl[\rho, \pi_{\exp (- \beta \Phi_{- \frac{\beta}{N}}

1438: \!\circ R)}\bigr] \Biggr\} \leq 1.

1439: \end{equation}

1440: \newcommand{\Brho}{\Bar{\rho}}Moreover

1441: \begin{multline}

1442: \label{eq1.1.5bis}

1443: \C{K}\bigl[ \rho, \pi_{\exp[ - \beta \Phi_{- \frac{\beta}{N}}\circ R ]} \bigr]

1444: = \C{K}\bigl[ \rho,\pi_{\exp( - \beta r)}

1445: \bigr] + \beta \rho \Bigl[ \Phi_{- \frac{\beta}{N}}\!\circ\!R - r \Bigr] \\*

1446: + \log \Bigl\{ \pi \Bigl[ \exp \bigl( - \beta \Phi_{- \frac{\beta}{N}}\!\circ\!R

1447: \bigr) \Bigr] \Bigr\} - \log \Bigl\{ \pi \Bigl[ \exp ( - \beta r) \Bigr]

1448: \Bigr\},

1449: \end{multline}

1450: which is an invitation to find an upper bound for

1451: $\log \Bigl\{ \pi \Bigl[ \exp \bigl[ - \beta \Phi_{- \frac{\lambda}{N}}\!\circ R

1452: \big] \Bigr] \Bigr\} - \log \Bigl\{ \pi \bigl[ \exp ( - \beta r) \bigr] \Bigr\}$.

1453: \newcommand{\Bpi}{\overline{\pi}}

1454: Let us call for short $\Bpi$ our localized prior distribution, thus defined as

1455: $$

1456: \frac{d \Bpi}{d \pi}(\theta)

1457: = \frac{\ds

1458: \exp \Bigl\{ - \beta \Phi_{- \frac{\beta}{N}} \bigl[ R(\theta) \bigr] \Bigr\}}{\ds

1459: \pi \Bigl\{ \exp \bigl[ - \beta

1460: \Phi_{- \frac{\beta}{N}}\!\circ\!R \bigr] \Bigr\}}.

1461: $$

1462: Applying once again Theorem \ref{thm2.3} (page \pageref{thm2.3}),

1463: but this time to $- \beta$, we see that

1464: \begin{multline}

1465: \label{eq1.1.5}

1466: \PP \biggl\{ \exp \biggl[

1467: \log \Bigl\{ \pi \Bigl[ \exp \bigl( - \beta \Phi_{-

1468: \frac{\beta}{N}}\!\circ\!R

1469: \bigr) \Bigr] \Bigr\}

1470: - \log \Bigl\{ \pi \bigl[ \exp ( - \beta r) \bigr] \Bigr\} \biggr] \biggr\}

1471: \\ = \PP \biggl\{ \exp \biggl[

1472: \log \Bigl\{ \pi \Bigl[ \exp \bigl( - \beta \Phi_{- \frac{\beta}{N}}\!\circ\!R)

1473: \bigr) \Bigr] \Bigr\}

1474: + \inf_{\rho \in \C{M}_+^1(\Theta)}

1475: \beta \rho(r) + \C{K}(\rho, \pi)  \biggr] \biggr\}

1476: \\ \leq \PP \biggl\{ \exp \biggl[

1477: \log \Bigl\{ \pi \Bigl[ \exp \bigl( - \beta \Phi_{- \frac{\beta}{N}}\!\circ\!R)

1478: \bigr) \Bigr] \Bigr\}  + \beta \Bpi(r)

1479: + \C{K}(\Bpi , \pi) \biggr] \biggr\}

1480: \\ = \PP \biggl\{ \exp \biggl[

1481: \beta \Bigl[ \Bpi(r) - \Bpi \bigl( \Phi_{- \frac{\beta}{N}}\!\circ\!R \bigr)

1482: \Bigr] - \C{K}(\Bpi,\Bpi) \biggl]

1483: \biggr\} \leq 1.

1484: \end{multline}

1485: Combining equations \eqref{eq1.1.5bis} and \eqref{eq1.1.5}

1486: and using the concavity of $\Phi_{- \frac{\beta}{N}}$,

1487: we see that with $\PP$ probability at least $1 - \epsilon$,

1488: for any posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,

1489: $$

1490: 0 \leq \C{K}(\rho, \Bpi) \leq \C{K} \bigl[\rho, \pi_{\exp(-\beta r)}\bigr]

1491: + \beta \Bigl[ \Phi_{-\frac{\beta}{N}}\bigl[ \rho(R) \bigr] - \rho(r) \Bigr]

1492: - \log(\epsilon).

1493: $$

1494: We have proved a lower deviation bound:

1495: \begin{thm} For any positive real constant $\beta$,

1496: with $\PP$ probability at least $1 - \epsilon$,

1497: for any posterior distribution $\rho : \Omega \rightarrow

1498: \C{M}_+^1(\Theta)$,

1499: $$

1500: \frac{\ds \exp \biggl\{ \frac{\beta}{N} \biggl[

1501: \rho(r) - \frac{\C{K}[\rho, \pi_{\exp( - \beta r)}]

1502: - \log(\epsilon)}{\beta} \biggr] \biggr\} - 1}{\ds

1503: \exp\bigl( \tfrac{\beta}{N} \bigr) - 1} \leq \rho (R).

1504: $$

1505: \end{thm}

1506: Let us now seek for an upper bound. Using the Cauchy-Schwarz inequality to combine

1507: equations \eqref{eq1.1.4} and \eqref{eq1.1.5},

1508: we obtain

1509: \begin{multline}

1510: \label{eq1.1.11Bis}

1511: \PP \biggl\{ \exp \biggl[ \frac{1}{2}

1512: \sup_{\rho \in \C{M}_+^1(\Theta)} \lambda

1513: \rho \bigl( \Phi_{\frac{\lambda}{N}}\!\circ\!R \bigr) - \beta

1514: \rho \bigl( \Phi_{- \frac{\beta}{N}}\!\circ\!R \bigr) - (\lambda - \beta)

1515: \rho(r) - \C{K}\bigl[ \rho, \pi_{\exp(- \beta r)}\bigr] \biggr] \biggr\}

1516: \\ =

1517: \PP \biggl\{ \exp \biggl[

1518: \tfrac{1}{2} \sup_{\rho \in \C{M}_+^1(\Theta)} \biggl(\lambda \Bigl\{

1519: \rho \bigl( \Phi_{\frac{\lambda}{N}}\!\circ\!R \bigr)

1520: - \rho(r) \Bigr\} - \C{K}(\rho, \Bpi) \biggr) \bigg] \\

1521: \times \exp \biggl[ \tfrac{1}{2}

1522: \biggl( \log \Bigl\{ \pi \Bigl[

1523: \exp\bigl( - \beta \Phi_{- \frac{\beta}{N}}\!\circ\!R\bigr)

1524: \Bigr] \Bigr\} - \log \Bigl\{ \pi \Bigl[

1525: \exp ( - \beta r) \Bigr] \Bigr\} \biggr) \biggr] \biggr\}

1526: \\ \leq

1527: \PP \biggl\{ \exp \biggl[

1528: \sup_{\rho \in \C{M}_+^1(\Theta)} \biggl(\lambda \Bigl\{

1529: \rho \bigl( \Phi_{\frac{\lambda}{N}}\!\circ\!R \bigr)

1530: - \rho(r) \Bigr\} - \C{K}(\rho, \Bpi) \biggr) \biggl] \biggr\}^{1/2}\\

1531: \times \PP \biggl\{ \exp \biggl[

1532: \biggl( \log \Bigl\{ \pi \Bigl[

1533: \exp\bigl( - \beta \Phi_{- \frac{\beta}{N}}\!\circ\!R\bigr)

1534: \Bigr] \Bigr\} - \log \Bigl\{ \pi \Bigl[

1535: \exp ( - \beta r) \Bigr] \Bigr\} \biggr) \biggr] \biggr\}^{1/2}

1536: \leq 1.

1537: \end{multline}

1538: Thus with $\PP$ probability

1539: at least $1 - \epsilon$, for any posterior distribution $\rho$,

1540: $$

1541: \lambda \Phi_{\frac{\lambda}{N}}\bigl[ \rho(R) \bigr]

1542: - \beta \Phi_{- \frac{\beta}{N}} \bigl[ \rho(R) \bigr]

1543: \leq (\lambda - \beta) \rho(r) + \C{K}(\rho, \pi_{\exp(- \beta r)})

1544: - 2 \log(\epsilon).

1545: $$

1546: (It would have been more straightforward to use a union bound on

1547: deviation inequalities instead of the Cauchy-Schwarz

1548: inequality on exponential moments, anyhow, this would have led

1549: to replace $- 2 \log(\epsilon)$ with the worse factor

1550: $2 \log(\frac{2}{\epsilon})$.)

1551: Let us now remind that

1552: \begin{multline*}

1553: \lambda \Phi_{\frac{\lambda}{N}}(p) - \beta \Phi_{-\frac{\beta}{N}}(p)

1554: = - N \log \Bigl\{ 1 - \bigl[ 1 - \exp\bigl(- \tfrac{\lambda}{N}\bigr)\bigr] p

1555: \Bigr\} \\ - N \log \Bigl\{ 1 + \bigl[\exp\bigl( \tfrac{\beta}{N} \bigr) - 1\bigr] p

1556: \Bigr\},

1557: \end{multline*}

1558: and let us put

1559: \begin{multline*}

1560: B  = (\lambda - \beta) \rho(r) + \C{K}\bigl[ \rho, \pi_{\exp(- \beta r)}\bigr]

1561: - 2 \log(\epsilon) \\

1562: = \C{K}\bigl[ \rho, \pi_{\exp( - \lambda r)} \bigr]

1563: + \int_{\beta}^{\lambda} \pi_{\exp( - \xi r)}(r) d \xi - 2 \log(\epsilon).

1564: \end{multline*}

1565: Let us consider moreover the change of variables

1566: $\alpha = 1 - \exp( - \frac{\lambda}{N})$ and $\gamma = \exp(\frac{\beta}{N}) - 1$.\\

1567: We obtain

1568: $

1569: \bigl[ 1 - \alpha \rho(R)  \big] \bigl[ 1 + \gamma \rho(R) \bigr]

1570: \geq \exp( - \tfrac{B}{N}),

1571: $

1572: leading to

1573: \begin{thm}

1574: \label{thm1.1.17}\mypoint

1575: For any positive constants $\alpha$, $\gamma$, such that $0 \leq \gamma < \alpha <1$,

1576: with $\PP$ probability at least $1 - \epsilon$, for any posterior distribution

1577: $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,

1578: the bound

1579: \begin{align*}

1580: M(\rho) & = - \frac{\log\bigl[ (1 - \alpha)(1 + \gamma) \bigr]}{\alpha - \gamma} \rho(r)

1581: + \frac{\ds \C{K}(\rho, \pi_{\exp[ - N \log( 1 + \gamma)r ]})

1582: - 2 \log(\epsilon)}{\ds N (\alpha - \gamma)} \\

1583: & = \frac{\ds \C{K}\bigl[ \rho, \pi_{\exp[ N\log(1 - \alpha) r]}\bigr]

1584: + \int_{N \log(1 + \gamma)}^{- N \log(1 - \alpha)} \pi_{\exp( - \xi r)}(r)

1585: d \xi - 2 \log(\epsilon)}{N (\alpha - \gamma)},

1586: \end{align*}

1587: is such that

1588: $$

1589: \rho(R) \leq \frac{\alpha - \gamma}{2 \alpha \gamma}

1590: \left( \sqrt{1+ \frac{4 \alpha \gamma}{(\alpha - \gamma)^2} \bigl\{ 1 - \exp\bigl[

1591: - (\alpha - \gamma) M(\rho) \bigr]  \bigr\}}- 1 \right) \leq

1592: M(\rho),

1593: $$

1594: \end{thm}

1595: Using the {\em empirical dimension} $d_e$ defined by equation \eqref{eq1.1.3}

1596: on page \pageref{eq1.1.3},

1597: we can use the inequality

1598: $$

1599: \int_{\beta}^{\lambda} \pi_{\exp(- \xi r)}(r) d \xi

1600: \leq (\lambda - \beta) \ess \inf_{\pi} r + d_e \log \left( \frac{\lambda}{\beta} \right),

1601: $$

1602: to prove that

1603: \begin{multline*}

1604: M(\rho) \leq \frac{\log\bigl[ (1+\gamma)(1-\alpha) \bigr]}{\gamma - \alpha}

1605: \ess \inf_{\pi} r \\

1606: + \frac{d_e

1607: \log \left[ \frac{ - \log( 1- \alpha)}{\log(1 + \gamma)} \right]

1608: + \C{K}\bigl[ \rho, \pi_{\exp [ N \log(1 - \alpha)r]}\bigr] - 2 \log(\epsilon)}{

1609: N(\alpha - \gamma)}.

1610: \end{multline*}

1611:

1612: Let us give a little numerical illustration : assuming that

1613: $d_e = 10$ and $N = 1000$, taking $\epsilon = 0.01$,

1614: $\alpha = 0.5$ and $\gamma = 0.1$, we obtain from

1615: Theorem \ref{thm1.1.17} $\pi_{\exp[ N\log(1-\alpha)r]}(R) \simeq \pi_{\exp(- 693 r)}(R)

1616: \leq 0.332\leq 0.372$, where we have given respectively the non linear and

1617: the linear bound. This shows the practical interest of keeping the non-linearity.

1618: Let us also mention that optimizing the values of the parameters

1619: $\alpha$ and $\gamma$ would not have yielded a significantly lower bound.

1620:

1621: The following corollary is obtained by taking $\lambda = 2 \beta$ and

1622: keeping only the linear bound, we give it for the sake of its simplicity:

1623: \begin{cor}\mypoint

1624: For any positive real constant $\beta$ such that

1625: \hfill $\exp(\frac{\beta}{N})

1626: + \exp( - \frac{2 \beta}{N}) < 2$, which is the case when $\beta < 0.48 N$,

1627: with $\PP$ probability at least $1 - \epsilon$, for any posterior distribution

1628: $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,

1629: \begin{multline*}

1630: \rho(R) \leq \frac{ \beta \rho(r) + \C{K}\bigl[ \rho, \pi_{\exp( - \beta r)}\bigr]

1631: - 2 \log(\epsilon)}{N \bigl[ 2 - \exp\bigl( \frac{\beta}{N}\bigr) -

1632: \exp \bigl( - \frac{2 \beta}{N} \bigr) \bigr]}

1633: \\ = \frac{

1634: \int_{\beta}^{2 \beta}

1635: \pi_{\exp( - \xi r)}(r) d \xi + \C{K}\bigl[ \rho, \pi_{\exp( - 2 \beta r)}\bigr] - 2 \log(\epsilon)}{

1636: N \bigl[ 2 - \exp( \frac{\beta}{N}) - \exp( - \frac{2 \beta}{N}) \bigr]}.

1637: \end{multline*}

1638: \end{cor}

1639: Let us mention that this corollary applied to the above numerical example

1640: gives $\pi_{\exp(-200 r)}(R) \leq 0.475$ (when we take $\beta = 100$, consistently

1641: with the choice $\gamma = 0.1$).

1642:

1643: \subsubsection{Partially local bounds}

1644:

1645: Local bounds are suitable when the lowest values of the empirical

1646: error rate $r$ are reached only on a small part of the parameter

1647: set $\Theta$. When $\Theta$ is the disjoint union of submodels

1648: of different complexities, the minimum of $r$ will as a rule

1649: not be ``localized'' in a way that calls for the use of

1650: local bounds. Just think for instance of the case when

1651: $\Theta = \bigsqcup_{m=1}^M \Theta_m$, where the sets $\Theta_1 \subset

1652: \Theta_2 \subset \dots \subset \Theta_M$ are nested.

1653: In this case we will have $\inf_{\Theta_1} r \geq \inf_{\Theta_2} r

1654: \geq \dots \geq \inf_{\Theta_M} r$, although $\Theta_M$ may be

1655: too large to be the right model to use. In this situation, we

1656: do not want to localize the bound completely. Let us make a

1657: more specific fancyful but typical pseudo computation.

1658: Just imagine we have a countable collection $(\Theta_m)_{m \in M}$ of submodels.

1659: Let us assume we are interested in choosing between the

1660: estimators $\wtheta_m \in \arg\min_{\Theta_m} r$,

1661: maybe randomizing them (e.g. replacing them

1662: with $\pi^m_{\exp( - \lambda r)}$). Let us

1663: imagine moreover that we are in a typically parametric

1664: situation, where, for some priors $\pi^m \in \C{M}_+^1(\Theta_m)$,

1665: $m \in M$, there is a ``dimension'' $d_m$ such that

1666: $\lambda \bigl[ \pi^m_{\exp( - \lambda r)}(r) - r(\wtheta_m)

1667: \bigr] \simeq d_m$. Let $\mu \in \C{M}_+^1(M)$ be some distribution

1668: on the index set $M$.

1669: It is easy to see that $(\mu \pi)_{\exp( - \lambda r)}$ will

1670: typically not be properly local, in the sense that

1671: typically

1672: \begin{multline*}

1673: (\mu \pi)_{\exp( - \lambda r)}(r) =

1674: \frac{\ds \mu \Bigl\{ \pi_{\exp( - \lambda r)}(r) \pi \bigl[ \exp( - \lambda r) \bigr]

1675: \Bigr\}}{

1676: \mu \Bigl\{ \pi  \bigl[ \exp( - \lambda r) \bigr] \Bigr\}

1677: } \\ \simeq

1678: \frac{\ds \sum_{m \in M}

1679: \bigl[ (\inf_{\Theta_m} r) + \tfrac{d_m}{\lambda} \bigr] \exp \bigl[ - \lambda

1680: (\inf_{\Theta_m} r) - d_m \log\bigl(\tfrac{e \lambda}{d_m}\bigr) \bigr]

1681: \mu(m)}{\ds

1682: \sum_{m \in M} \exp \Bigl[ - \lambda (\inf_{\Theta_m} r) - d_m \log \bigl(\tfrac{e

1683: \lambda}{d_m}

1684: \bigr) \Bigr] \mu(m)}

1685: \\ \simeq \biggl\{ \inf_{m \in M} (\inf_{\Theta_m} r) + \tfrac{d_m}{\lambda}

1686: \log \bigl(

1687: \tfrac{e \lambda}{d_m \mu(m)}\bigr) \biggr\} \\ + \log

1688: \biggl\{ \sum_{m \in M}

1689: \exp \bigl[ - d_m \log(\tfrac{\lambda}{d_m})\bigr] \mu(m)\biggr\}.

1690: \end{multline*}

1691: where we have used the estimate

1692: \begin{multline*}

1693: - \log \Bigl\{ \pi \bigl[ \exp( - \lambda r) \bigr]

1694: \Bigr\} = \int_0^{\lambda} \pi_{\exp( - \beta r)}(r) d \beta

1695: \\ \simeq \int_0^{\lambda } (\inf_{\Theta_m} r) + \bigl[

1696: \tfrac{d_m}{\beta} \wedge 1 \bigr]

1697: d \beta \simeq  \lambda (\inf_{\Theta_m} r) + d_m

1698: \bigl[ \log \bigl( \tfrac{\lambda}{d_m} \bigr) + 1 \bigr].

1699: \end{multline*}

1700: Our approximations have no pretention to be rigorous or

1701: very accurate, but they nevertheless give the best order

1702: of magnitude we can expect in typical situations, and

1703: show that this order of magnitude is not what we are

1704: looking for: mixing different models with the help

1705: of $\mu$ spoils the localization, introducing a multiplier

1706: $\log \bigl( \tfrac{\lambda}{d_m} \bigr)$ to the dimension

1707: $d_m$ which is precisely what we would have got if we had

1708: not localized at all the bound. What we would

1709: really like to do in such situations is to use a {\em partially

1710: localized} posterior distribution, such as

1711: $\mu^{\widehat{m}}_{\exp( - \lambda r)}$, where

1712: $\widehat{m}$ is an estimator of the best submodel

1713: to be used. While the most straightforward way to

1714: do this is to use a union bound on results obtained

1715: for each submodel $\Theta_m$, we are going here

1716: to show how to allow arbitrary posterior distributions

1717: on the index set (corresponding to a randomization of

1718: the choice of $\widehat{m}$).

1719:

1720: Let us consider the framework we just mentioned: let the

1721: measurable parameter

1722: set $(\Theta, \C{T})$ be a disjoint union of measurable submodels,

1723: $\Theta = \bigsqcup_{m \in M} \Theta_m$. Let the index set $(M, \C{M})$ be

1724: some measurable space (most of the time it will be a countable set).

1725: Let $\mu \in \C{M}_+^1(M)$ be a prior probability distribution on

1726: $(M, \C{M})$. Let $\pi : M \rightarrow \C{M}_+^1(\Theta)$ be a regular

1727: conditional probability measure such that $\pi(m,\Theta_m) = 1$,

1728: for any $m \in M$.

1729: Let $\mu \pi \in \C{M}_+^1(M \times \Theta)$ be the product probability

1730: measure defined by

1731: $\mu\pi(h) = \int_{m \in M} \left( \int_{\theta \in \Theta} h(m,\theta)

1732: \pi(m, d \theta) \right) \mu(dm)$, for any bounded measurable

1733: function $h : M \times \Theta \rightarrow \RR$.

1734: Let $\pi_{\exp(h)} \in \C{M}_+(M \times \Theta)$ be the regular

1735: conditionnal probability measure defined by

1736: $$

1737: \frac{d \pi_{\exp(h)}}{d \pi} (m, \theta) = \frac{ \exp\bigl[ h(\theta) \bigr]}{

1738: \pi \bigl[ m, \exp(h) \bigr]},

1739: $$

1740: where consistently with previous notations $\pi(m,h) = \int_{\Theta}

1741: h(m,\theta) \pi(m, d \theta)$ (we will also often use the less explicit

1742: notation $\pi(h)$).

1743: Let for short

1744: $$

1745: U(\theta, \omega) = \lambda \Phi_{\frac{\lambda}{N}}\bigl[ R(\theta) \bigr] -

1746: \beta \Phi_{- \frac{\beta}{N}}\bigl[ R(\theta) \bigr] - (\lambda - \beta) r

1747: (\theta, \omega).

1748: $$

1749: Integrating with respect to $\mu$ equation \eqref{eq1.1.11Bis} on page \pageref{eq1.1.11Bis},

1750: written in each submodel $\Theta_m$ using the prior distribution $\pi(m, \cdot)$,

1751: we see that

1752: \begin{multline*}

1753: \PP \biggl\{ \exp \biggl[

1754: \sup_{\nu \in \C{M}_+^1(M)} \sup_{\rho : M \rightarrow \C{M}_+^1(\Theta)}

1755: \frac{1}{2} \Bigl[ (\nu \rho)(U) - \nu \bigl\{

1756: \C{K}(\bigl[ \rho, \pi_{\exp( - \beta r)}\bigr] \bigr\} \Bigl] - \C{K}(\nu,\mu)

1757: \biggr] \biggr\}

1758: \\ \leq

1759: \PP \biggl\{ \exp \biggl[

1760: \sup_{\nu \in \C{M}_+^1(M)} \frac{1}{2} \nu \biggl( \sup_{\rho : M \rightarrow \C{M}_+^1(\Theta)}

1761: \rho(U) - \C{K}(\rho, \pi_{\exp( - \beta r)}) \biggr)

1762: - \C{K}(\nu, \mu) \biggr] \biggr\}

1763: \\ =

1764: \PP \biggl\{ \mu \biggl[ \exp \Bigl\{ \tfrac{1}{2} \sup_{\rho : M \rightarrow

1765: \C{M}_+^1(\Theta)} \Bigl[ \rho(U) - \C{K} \bigl[ \rho, \pi_{\exp( - \beta r)}\bigr]

1766: \Bigr] \Bigr\} \biggr] \biggr\}\\

1767: = \mu \biggl\{ \PP \biggl[ \exp \Bigl\{ \tfrac{1}{2} \sup_{\rho : M \rightarrow

1768: \C{M}_+^1(\Theta)} \Bigl[ \rho(U) - \C{K} \bigl[ \rho, \pi_{\exp( - \beta r)}\bigr]

1769: \Bigr] \Bigr\} \biggr] \biggr\} \leq 1.

1770: \end{multline*}

1771: This proves that

1772: \begin{multline}

1773: \label{eq1.1.10}

1774: \PP \Biggl\{ \exp \Biggl[ \frac{1}{2}

1775: \sup_{\nu \in \C{M}_+^1(M)} \sup_{\rho:M\rightarrow \C{M}_+^1(\Theta)}

1776: \lambda \Phi_{\frac{\lambda}{N}} \bigl[\nu \rho(R) \bigr]

1777: - \beta \Phi_{-\frac{\beta}{N}} \bigl[ \nu \rho(R) \bigr]

1778: \\ -(\lambda - \beta) \nu \rho(r) - 2 \C{K}(\nu,\mu) - \nu \bigl\{

1779: \C{K} \bigl[ \rho,

1780: \pi_{\exp( - \beta r)}\bigr] \bigr\} \Biggr] \Biggr\} \leq 1.

1781: \end{multline}

1782: \newcommand{\sR}{R^{\star}}

1783: \newcommand{\sr}{r^{\star}}

1784: \newcommand{\stheta}{\theta^{\star}}

1785: Introducing the optimal value of $r$ on each submodel

1786: $\sr(m) = \ess \inf_{\pi(m,\cdot)} r$ and the empirical dimensions

1787: $$

1788: d_e(m) = \sup_{\xi \in \RR_+} \xi \bigl[

1789: \pi_{\exp( - \xi r)}(m,r) - \sr(m) \bigr],

1790: $$

1791: we can thus state

1792: \begin{thm}

1793: \label{thm1.1.20}

1794: \mypoint

1795: For any positive real constants $\beta < \lambda$,

1796: with $\PP$ probability at least $1 - \epsilon$,

1797: for any posterior distribution $\nu : \Omega \rightarrow \C{M}_+^1(M)$,

1798: for any conditional posterior distribution $\rho : \Omega \times

1799: M \rightarrow \C{M}_+^1(\Theta)$,

1800: $$

1801: \lambda \Phi_{\frac{\lambda}{N}} \bigl[ \nu \rho(R) \bigr]

1802: - \beta \Phi_{-\frac{\beta}{N}} \bigl[ \nu \rho(R) \bigr]

1803: \leq B_1(\nu, \rho),

1804: $$

1805: \begin{multline*}

1806: \text{where } B_1(\nu, \rho) =

1807: (\lambda - \beta) \nu \rho(r) + 2\C{K}(\nu,\mu)+

1808: \nu \bigl\{ \C{K}\bigl[ \rho, \pi_{\exp( - \beta r)} \bigr] \bigr\} - 2

1809: \log(\epsilon)\\

1810: = \nu \biggl[ \int_{\beta}^{\lambda}

1811: \pi_{\exp ( - \alpha r)}(r) d\alpha \biggr] + 2 \C{K}(\nu, \mu)

1812: + \nu \bigl\{ \C{K}\bigl[ \rho, \pi_{\exp( - \lambda r)} \bigr] \bigr\}

1813: - 2 \log(\epsilon)

1814: \\

1815: = 2 \log \biggl\{ \mu \biggl[ \exp \biggl( - \frac{1}{2}

1816: \int_{\beta}^{\lambda} \pi_{\exp( - \alpha r)}(r) d \alpha \biggr)

1817: \biggr] \biggr\} \\

1818: \shoveright{+ 2 \C{K}\bigl[ \nu, \mu_{\left(\frac{\pi[\exp(-\lambda r)]}{

1819: \pi[\exp(-\beta r)]}\right)^{1/2}}\bigr] + \nu \bigl\{ \C{K}\bigl[

1820: \rho, \pi_{\exp( - \lambda r)} \bigr] \bigr\} - 2 \log(\epsilon),}\\

1821: \shoveleft{\text{and therefore }

1822: B_1(\nu,\rho) \leq  \nu \Bigl[ (\lambda - \beta) \sr + \log \Bigl( \tfrac{\lambda}{\beta}

1823: \Bigr) d_e

1824: \Bigr] + 2 \C{K}(\nu, \mu)} \\\shoveright{ + \nu \bigl\{ \C{K} \bigl[

1825: \rho, \pi_{\exp( - \lambda r)} \bigr] \bigr\} - 2 \log(\epsilon),}

1826: \\\shoveleft{\text{as well as }

1827: B_1(\nu, \rho) \leq 2 \log \biggl\{ \mu \biggl[

1828: \exp \biggl( - \frac{1}{2} \sr + \frac{1}{2}

1829: \log \Bigl( \tfrac{\lambda}{\beta} \Bigr) d_e \biggr) \biggr] \biggr\}

1830: }\\+ 2 \C{K} \bigl[ \nu, \mu_{\frac{\pi[\exp( -

1831: \lambda r)]}{\pi[\exp( - \beta r)]}}

1832: \bigr] + \nu \bigl\{ \C{K}\bigl[ \rho, \pi_{\exp( - \lambda r)} \bigr]

1833: - 2 \log(\epsilon).

1834: \end{multline*}

1835: Thus, for any real constants $\alpha$ and $\gamma$ such that

1836: $0 \leq \gamma < \alpha < 1$, with $\PP$ probability

1837: at least $1 - \epsilon$, for any posterior distribution

1838: $\nu : \Omega \rightarrow \C{M}_+^1(M)$ and any conditional posterior

1839: distribution $\rho : \Omega \times M \rightarrow \C{M}_+^1(\Theta)$,

1840: the bound

1841: \begin{multline*}

1842: B_2(\nu,\rho) = - \tfrac{\log \bigl[ (1 - \alpha)(1 + \gamma)\bigr]}{\alpha-\gamma}

1843: \nu\rho(r) + \tfrac{ 2 \C{K}(\nu,\mu) + \nu \bigl\{ \C{K}\bigl[

1844: \rho, \pi_{(1 + \gamma)^{-Nr}}\bigr] \bigr\} - 2 \log(\epsilon)}{N (\alpha - \gamma)}

1845: \\ = \tfrac{

1846: 2 \C{K}\bigl[ \nu, \mu_{\left( \frac{ \pi [ (1 -\alpha)^{Nr}]}{\pi [ (1 + \gamma)^{-N

1847: r}]}\right)^{1/2}} \bigr]

1848: + \nu \bigl\{ \C{K}\bigl[\rho, \pi_{(1 - \alpha)^{Nr}}\bigr] \bigr\}}{

1849: N(\alpha - \gamma)} \\ - \tfrac{

1850: 2 \log \Bigl\{ \mu \Bigl[ \exp \biggl[ - \frac{1}{2}

1851: \int_{N \log(1 + \gamma)}^{- N \log(1 - \alpha)} \pi_{\exp( - \xi r)}(\cdot,r) d \xi

1852: \bigr] \Bigr] \Bigr\}

1853: + 2 \log(\epsilon)}{

1854: N(\alpha - \gamma)}

1855: \end{multline*}

1856: satisfies

1857: $$

1858: \nu \rho(R) \leq \frac{\alpha - \gamma}{2 \alpha \gamma}

1859: \left( \sqrt{1 + \frac{4 \alpha \gamma}{(\alpha - \gamma)^2} \Bigl\{

1860: 1 - \exp \bigl[ - (\alpha - \gamma) B(\nu,\rho) \bigr] \Bigr\}} - 1

1861: \right) \leq B(\nu,\rho).

1862: $$

1863: \end{thm}

1864: Let us remark that in the case when $\nu = \mu_{\left( \frac{

1865: \pi[(1 - \alpha)^{Nr}]}{\pi[(1 + \gamma)^{-Nr}]} \right)^{1/2}}$

1866: and $\rho = \pi_{(1-\alpha)^{Nr}}$,

1867: we get as desired a bound that is adaptively local in all the $\Theta_m$

1868: (at least when $M$ is countable and $\mu$ is atomic):

1869: \begin{multline*}

1870: B(\nu,\rho) \leq - \tfrac{2}{N(\alpha - \gamma)}

1871: \log \Biggl\{ \mu \biggl\{

1872: \exp \biggl[ \tfrac{N}{2} \log\bigl[(1+\gamma)(1 - \alpha)\bigr]

1873: \sr  \\\shoveright{ - \log \left( \tfrac{-\log(1-\alpha)}{\log(1 + \gamma)}

1874: \right) \tfrac{d_e}{2} \biggr] \biggr\} \Biggr\}

1875: - \frac{2 \log(\epsilon)}{N(\alpha - \gamma)}\qquad}

1876: \\\shoveleft{\qquad \qquad \leq \inf_{m \in M} \biggl\{

1877: - \tfrac{\log\bigl[ (1- \alpha)(1+\gamma)\bigr]}{\alpha

1878: -\gamma} \sr(m)} \\ +

1879: \log \left( \tfrac{- \log(1 - \alpha)}{\log(1 + \gamma)}\right)

1880: \tfrac{d_e(m)}{N(\alpha - \gamma)} -

1881: 2 \tfrac{\log\bigl[\epsilon \mu(m) \bigr]}{N(\alpha - \gamma)} \biggr\}.

1882: \end{multline*}

1883: The penalization by the {\em empirical dimension} $d_e(m)$ in each submodel

1884: is as desired linear in $d_e(m)$. Non random partially local bounds could

1885: be obtained in a way that is easy to imagine. We leave this investigation

1886: to the reader.

1887:

1888: \subsubsection{Two step localization}

1889:

1890: We have seen that the bound optimal choice of the posterior

1891: distribution $\nu$ on the index set in Theorem \ref{thm1.1.20}

1892: (page \pageref{thm1.1.20}) is such that

1893: $$

1894: \frac{d\nu}{d \mu}(m)  \sim

1895: \left( \frac{\pi \bigl[ \exp\bigl( - \lambda r(m, \cdot) \bigr) \bigr]}{\pi

1896: \bigl[ \exp\bigl( - \beta r(m,\cdot) \bigr)  \bigr]}\right)^{\frac{1}{2}}

1897: = \exp \biggl[ - \frac{1}{2} \int_{\beta}^{\lambda}

1898: \pi_{\exp( - \alpha r)}(m,r)  d \alpha \biggr].

1899: $$

1900: \newcommand{\ov}[1]{\overline{#1}}

1901: This suggests to replace the prior distribution $\mu$ with $\ov{\mu}$

1902: defined by its density

1903: \begin{multline}

1904: \label{eq1.13}

1905: \frac{d \ov{\mu}}{d \mu} (m) = \frac{ \exp \bigl[ - h(m) \bigr]}{\mu

1906: \bigl[ \exp( - h ) \bigr]},

1907: \\ \text{ where }

1908: h(m) = - \xi \int_{\beta}^{\gamma} \pi_{\exp( - \alpha \Phi_{- \frac{\eta}{N}}

1909: \circ R)} \bigl[ \Phi_{- \frac{\eta}{N}}\!\circ\!R(m, \cdot) \bigr] d \alpha.

1910: \end{multline}

1911: The use of $\Phi_{- \frac{\eta}{N}}\!\circ\!R$ instead of $R$ is motivated

1912: by technical reasons which will appear in subsequent computations.

1913: Indeed, we will need to bound

1914: $$

1915: \nu \biggl[ \int_{\beta}^{\lambda} \pi_{\exp ( - \alpha

1916: \Phi_{- \frac{\eta}{N}} \circ R)} \bigl(

1917: \Phi_{- \frac{\eta}{N}}\!\circ\!R \bigr) d \alpha \biggr]

1918: $$

1919: in order to handle $\C{K}(\nu, \ov{\mu})$.

1920: In the spirit of equation (\ref{eq1.1.4}, page \pageref{eq1.1.4}),

1921: starting back from Theorem \ref{thm2.3} (page \pageref{thm2.3}),

1922: applied in each submodel $\Theta_m$ to the prior

1923: distribution $\pi_{\exp( - \gamma \Phi_{-\frac{\eta}{N}} \circ

1924: R )}$ and integrated with respect to

1925: $\ov{\mu}$, we see that for any

1926: positive real constants $\lambda$, $\gamma$ and $\eta$,

1927: with $\PP$ probability at least $1 - \epsilon$,

1928: for any posterior distribution $\nu : \Omega \rightarrow \C{M}_+^1(M)$ on the index set

1929: and any conditional posterior distribution $\rho : \Omega \times M \rightarrow

1930: \C{M}_+^1(\Theta)$,

1931: \begin{multline}

1932: \label{eq1.1.13}

1933: \nu \rho \bigl( \lambda \Phi_{\frac{\lambda}{N}}\!\circ\!R - \gamma

1934: \Phi_{-\frac{\eta}{N}}\!\circ\!R \bigr) \leq \lambda \nu \rho(r) \\ +

1935: \nu \C{K}(\rho, \pi)

1936: + \C{K}(\nu, \ov{\mu}) +

1937: \nu \Bigl\{ \log \Bigl[ \pi \bigl[ \exp \bigl(

1938: - \gamma \Phi_{- \frac{\eta}{N}}\!\circ\!R \bigr) \bigr]  \Bigr] \Bigr\} -

1939: \log(\epsilon).

1940: \end{multline}

1941: Since $x \mapsto f(x) \overset{\text{\rm def}}{=}

1942: \lambda \Phi_{\frac{\lambda}{N}}

1943: - \gamma \Phi_{- \frac{\eta}{N}}(x)$ is a convex function, it is such

1944: that

1945: $$

1946: f(x) \geq x f'(0)= x N \Bigl\{

1947:  \bigl[1 - \exp( - \tfrac{\lambda}{N}) \bigr] + \tfrac{\gamma}{\eta}

1948: \bigl[ \exp( \tfrac{\eta}{N}) - 1 \bigr] \Bigr\}.

1949: $$

1950: Thus if we put

1951: \begin{equation}

1952: \label{eq1.14}

1953: \gamma = \frac{\eta \bigl[ 1 - \exp (- \frac{\lambda}{N}) \bigr]}{\exp(

1954: \frac{\eta}{N}) - 1},

1955: \end{equation}

1956: we obtain that $f(x) \geq 0$, $x \in \RR$, and therefore that

1957: the left-hand side of equation \eqref{eq1.1.13} is non negative.

1958: We can moreover introduce the prior conditional distribution $\ov{\pi}$ defined

1959: by

1960: $$

1961: \frac{d \ov{\pi}}{d \pi}(m, \theta) =

1962: \frac{ \exp \bigl[ - \beta \Phi_{- \frac{\eta}{N}} \circ R(\theta) \bigr]}{

1963: \pi \bigl\{m, \exp \bigl[ - \beta \Phi_{- \frac{\eta}{N}} \circ R \bigr] \bigr\}}.

1964: $$

1965: With $\PP$ probability at least $1 - \epsilon$, for any posterior distributions

1966: $\nu \Omega \rightarrow \C{M}_+^1(M)$ and $\rho: \Omega \times M \rightarrow

1967: \C{M}_+^1(\Theta)$,

1968: \begin{multline*}

1969: \beta \nu \rho(r) + \nu \bigl[ \C{K}( \rho, \pi) \bigr] =

1970: \nu \bigl\{ \C{K}\bigl[ \rho, \pi_{\exp (- \beta r)} \bigr] \bigr\} -

1971: \nu \biggl[ \log \Bigl\{ \pi \bigl[ \exp ( - \beta r) \bigr] \Bigr\}  \biggr]

1972: \\ \leq \nu \bigl\{ \C{K} \bigl[ \rho, \pi_{\exp( - \beta r)} \bigr] \bigr\}

1973: + \beta \nu \ov{\pi} (r) + \nu \bigl[ \C{K}(\ov{\pi}, \pi) \bigr] \\

1974: \leq \nu \bigl\{ \C{K} \bigl[ \rho, \pi_{\exp ( - \beta r)} \bigr] \bigr\}

1975: + \beta \nu \ov{\pi} \bigl( \Phi_{- \frac{\eta}{N}}\!\circ\!R \bigr)

1976: \\\shoveright{+ \tfrac{\beta}{\eta} \bigl[ \C{K}(\nu, \ov{\mu})- \log(\epsilon) \bigr]

1977: + \nu \bigl[ \C{K}(\ov{\pi}, \pi) \bigr] \qquad}

1978: \\\shoveleft{\qquad

1979: = \nu \bigl\{ \C{K} \bigl[ \rho, \pi_{\exp ( - \beta r)} \bigr] \bigr\}

1980: - \nu \Bigl\{ \log \Bigl[ \pi \bigl[ \exp \bigl( -

1981: \beta \Phi_{-\frac{\eta}{N}}\!\circ\!R \bigr) \bigr] \Bigr] \Bigr\}}

1982: \\ + \tfrac{\beta}{\eta} \bigl[ \C{K}(\nu, \ov{\mu}) - \log(\epsilon) \bigr].

1983: \end{multline*}

1984: Thus, coming back to equation \eqref{eq1.1.13}, we see that under condition

1985: \eqref{eq1.14},

1986: with $\PP$ probability at least $1 - \epsilon$,

1987: \begin{multline*}

1988: 0 \leq (\lambda - \beta) \nu \rho(r) + \nu \bigl\{ \C{K}\bigl[ \rho, \pi_{

1989: \exp( - \beta r)}\bigr] \bigr\} \\ - \nu \biggl[

1990: \int_{\beta}^{\gamma} \pi_{\exp( - \alpha \Phi_{- \frac{\eta}{N}} \circ R)}

1991: \bigl( \Phi_{- \frac{\eta}{N}}\!\circ\!R \bigr) d \alpha \biggr]

1992: + (1 + \tfrac{\beta}{\eta}) \bigl[ \C{K}(\nu, \ov{\mu}) + \log(\tfrac{2}{\epsilon})

1993: \bigr].

1994: \end{multline*}

1995: Noticing moreover that

1996: \begin{multline*}

1997: (\lambda - \beta) \nu \rho(r) + \nu \bigl\{ \C{K} \bigl[

1998: \rho, \pi_{\exp( - \beta r)}\bigr] \bigr\} \\ =

1999: \nu \bigl\{ \C{K}\bigl[ \rho, \pi_{\exp ( - \lambda r)}\bigr] \bigr\}

2000: + \nu \biggl[ \int_{\beta}^{\lambda} \pi_{\exp( - \alpha r)}(r) d \alpha \biggr],

2001: \end{multline*}

2002: and choosing $\rho = \pi_{\exp( - \lambda r)}$, we have proved

2003: \begin{thm}

2004: For any positive real constants $\beta$, $\gamma$ and $\eta$, such that

2005: \linebreak $\gamma < \eta \bigl[ \exp( \frac{\eta}{N}) - 1 \bigr]^{-1}$, defining

2006: $\lambda$ by condition \eqref{eq1.14}, so that \linebreak

2007: $\lambda = - N \log \Bigl\{ 1 - \frac{\gamma}{\eta} \bigl[ \exp(

2008: \frac{\eta}{N}) - 1 \bigr] \Bigr\}$,

2009: with $\PP$ probability at least $1 - \epsilon$,

2010: for any posterior distribution $\nu : \Omega \rightarrow \C{M}_+^1(M)$,

2011: any conditional posterior distribution $\rho: \Omega \times M

2012: \rightarrow \C{M}_+^1(\Theta)$,

2013: \begin{multline*}

2014: \nu \biggl[ \int_{\beta}^{\gamma}

2015: \pi_{\exp( - \alpha \Phi_{- \frac{\eta}{N}}\circ R)}

2016: \bigl( \Phi_{- \frac{\eta}{N}}\!\circ\!R \bigr) d \alpha \biggr]

2017: \\ \leq \nu \biggl[ \int_{\beta}^{\lambda} \pi_{\exp( - \alpha r)}(r)

2018: d \alpha \biggr] + \bigl( 1 + \tfrac{\beta}{\eta} \bigr)

2019: \bigl[ \C{K}(\nu, \ov{\mu}) + \log\bigl(\tfrac{2}{\epsilon}\bigr) \bigr].

2020: \end{multline*}

2021: \end{thm}

2022: Let us remark that this theorem does not require that $\beta < \gamma$,

2023: and thus provides both an upper and a lower bound for the quantity of

2024: interest:

2025: \begin{cor}

2026: For any positive real constants $\beta$, $\gamma$ and $\eta$

2027: such that

2028: $\max \{ \beta, \gamma \} < \eta \bigl[ \exp(\frac{\eta}{N}) - 1 \bigr]^{-1}$,

2029: with $\PP$ probability at least $1- \epsilon$, for any posterior distributions

2030: $\nu : \Omega \rightarrow \C{M}_+^1(M)$ and $\rho: \Omega \times M \rightarrow

2031: \C{M}_+^1(\Theta)$,

2032: \begin{multline*}

2033: \nu \biggl[ \int_{- N \log \{ 1 - \frac{\beta}{N} [

2034: \exp (\frac{\eta}{N}) -1 ] \}}^{\gamma} \pi_{\exp( - \alpha r)}(r) d \alpha \biggr]

2035: - \bigl( 1 + \tfrac{\gamma}{\eta} \bigr)\bigl[ \C{K}(\nu, \ov{\mu}) +

2036: \log \bigl( \tfrac{3}{\epsilon} \bigr) \bigr]

2037: \\ \shoveleft{\qquad \leq  \nu \biggl[ \int_{\beta}^{\gamma} \pi_{\exp( - \alpha

2038: \Phi_{- \frac{\eta}{N}}\circ R)} \bigl(

2039: \Phi_{- \frac{\eta}{N}}\!\circ\!R \bigr) d \alpha \biggr] }

2040: \\ \leq \nu \biggl[ \int_{\beta}^{- N \log \{ 1 - \frac{\gamma}{\eta}

2041: [ \exp(\frac{\eta}{N})-1 ] \}}

2042: \pi_{\exp( - \alpha r)}(r) d \alpha \biggr]

2043: \\ + \bigl( 1 + \tfrac{\beta}{\eta} \bigr) \bigl[

2044: \C{K}(\nu, \ov{\mu}) + \log \bigl( \tfrac{3}{\epsilon} \bigr) \bigr].

2045: \end{multline*}

2046: \end{cor}

2047: We can then remember that

2048: $$

2049: \C{K}(\nu, \ov{\mu}) = \xi \bigl( \nu - \ov{\mu} \bigr)  \biggl[ \int_{\beta}^{\gamma}

2050: \pi_{\exp( - \alpha \Phi_{- \frac{\eta}{N}}\circ R)} \bigl(

2051: \Phi_{- \frac{\eta}{N}}\!\circ\!R \bigr) d \alpha \biggr] + \C{K}(\nu, \mu) -

2052: \C{K}(\ov{\mu}, \mu),

2053: $$

2054: to conclude that, putting

2055: \begin{equation}

2056: \label{eq1.16}

2057: G_{\eta}(\alpha) =

2058: -N \log \bigl\{ 1 - \frac{\alpha}{\eta} \bigl[

2059: \exp \bigl( \frac{\eta}{N}) - 1 \bigr] \bigr\} \geq \alpha, \qquad \alpha \in \RR_+,

2060: \end{equation}

2061: and

2062: \begin{equation}

2063: \label{eq1.15}

2064: \frac{d \w{\nu}}{d \mu} (m) \overset{\text{\rm def}}{=}

2065: \frac{\exp \bigl[ - h(m) \bigr]}{\mu \bigl[ \exp( - h)\bigr]}

2066: \text{ where }

2067: h(m) = \xi \int_{G_{\eta}(\beta)}^{\gamma} \pi_{\exp( - \alpha r)}(m, r) d \alpha,

2068: \end{equation}

2069: the divergence of $\nu$ with respect to the local prior $\ov{\mu}$ is bounded by

2070: \begin{multline*}

2071: \bigl[ 1 - \xi \bigl( 1 + \tfrac{\beta}{\eta} \bigr) \bigr]

2072: \C{K}(\nu, \ov{\mu}) \\

2073: \shoveleft{\qquad \leq \xi \nu \biggl[ \int_{\beta}^{

2074: G_{\eta}(\gamma)}

2075: \pi_{\exp( - \alpha r)}(r) d \alpha \biggr]

2076: - \xi \ov{\mu} \biggl[ \int_{G_{\eta}(\beta)}^{\gamma} \pi_{\exp( - \alpha r)}(r)

2077: d \alpha \biggr]} \\ \shoveright{+ \C{K}(\nu, \mu)

2078: - \C{K}(\ov{\mu}, \mu)

2079: + \xi \bigl( 2 +

2080: \tfrac{\beta + \gamma}{\eta} \bigr)

2081: \log\bigl(\tfrac{3}{\epsilon}\bigr)} \\

2082: \shoveleft{\qquad \leq \xi \nu \biggl[ \int_{\beta}^{G_{\eta}(\gamma)} \pi_{\exp( - \alpha r)}(r)

2083: d \alpha \biggr] + \C{K}(\nu, \mu)} \\ +

2084: \log \biggl\{ \mu \biggl[ \exp \biggl( - \xi \int_{G_{\eta}(\beta)}^{\gamma}

2085: \pi_{\exp(- \alpha r)}(r) d \alpha \biggr) \biggr] \biggr\}

2086: \\

2087: \shoveright{+ \xi \bigl( 2 +

2088: \tfrac{\beta + \gamma}{\eta} \bigr)

2089: \log\bigl(\tfrac{3}{\epsilon}\bigr)}

2090: \\

2091: \shoveleft{\qquad = \C{K}(\nu, \w{\nu}) + \xi \nu \biggl[ \biggl( \int_{\beta}^{G_{\eta}(\beta)}

2092: + \int_{\gamma}^{G_{\eta}(\gamma)}\biggr)  \pi_{\exp( - \alpha r)}(r) d \alpha \biggr]}

2093: \\

2094: + \xi \bigl( 2 + \tfrac{\beta+\gamma}{\eta} \bigr) \log \bigl( \tfrac{3}{\epsilon}

2095: \bigr).

2096: \end{multline*}

2097: We have proved

2098: \begin{thm}

2099: \mypoint

2100: \label{thm1.23}

2101: For any positive constants $\beta$, $\gamma$ and $\eta$ such that

2102: \linebreak $\max \{ \beta, \gamma \}

2103: < \eta \bigl[ \exp( \frac{\eta}{N}) - 1 \bigr]^{-1}$,

2104: with $\PP$ probability at least $1 - \epsilon$, for any posterior distribution

2105: $\nu : \Omega \rightarrow \C{M}_+^1(M)$ and any conditional posterior distribution

2106: $\rho: \Omega \times M \rightarrow \C{M}_+^1(\Theta)$,

2107: \begin{multline*}

2108: \C{K}(\nu, \ov{\mu}) \leq \Bigl[1 - \xi\Bigl(1

2109: + \frac{\beta}{\eta}\Bigr)\Bigr]^{-1}

2110: \biggl\{

2111: \C{K}(\nu, \w{\nu})

2112: \\

2113: + \xi \nu \biggl[ \biggl( \int_{\beta}^{G_{\eta}(\beta)}

2114: + \int_{\gamma}^{G_{\eta}(\gamma)}\biggr)

2115: \pi_{\exp( - \alpha r)} (r) d \alpha \biggr]

2116: \\\shoveright{ + \xi \bigl( 2 + \tfrac{\beta+\gamma}{\eta} \bigr)

2117: \log \bigl( \tfrac{3}{\epsilon}

2118: \bigr) \biggr\}}

2119: \\ \shoveleft{ \leq  \Bigl[ 1 - \xi\Bigl(1 + \frac{\beta}{\eta}\Bigr) \Bigr]^{-1}

2120: \biggl\{ \C{K}(\nu, \w{\nu})}\\ + \xi \nu \biggl[

2121: \bigl[ G_{\eta}(\gamma)

2122: - \gamma  + G_{\eta}(\beta)- \beta \bigr] \sr +

2123: \log \biggl( \frac{G_{\eta}(\beta)

2124: G_{\eta}(\gamma)}{\beta \gamma}\biggr)

2125: d_e \biggr] \\ +

2126: \xi \bigl( 2 + \tfrac{\beta+\gamma}{\eta} \bigr) \log \bigl(

2127: \tfrac{3}{\epsilon} \bigr) \biggr\},

2128: \end{multline*}

2129: where the local prior $\ov{\mu}$ is defined by equation \eqref{eq1.13}

2130: on page \pageref{eq1.13} and the local posterior $\w{\nu}$ and the function

2131: $G_{\eta}$ are defined by equation \eqref{eq1.15} above.

2132: \end{thm}

2133: We can then use this theorem to give a local version of Theorem

2134: \ref{thm1.1.20} (page \pageref{thm1.1.20}). To get something pleasing

2135: to read, we can apply Theorem \ref{thm1.23} with constants

2136: $\beta'$, $\gamma'$ and $\eta$ chosen so that

2137: $ \frac{2 \xi}{1 - \xi(1 + \frac{\beta'}{\eta})} = 1,$

2138: $G_{\eta}(\beta') = \beta$ and $\gamma' = \lambda$, where

2139: $\beta$ and $\lambda$ are the constants appearing in Theorem

2140: \ref{thm1.1.20}. This gives

2141: \begin{thm}\mypoint

2142: \label{thm1.24}

2143: For any positive real constants $\beta < \lambda$ and $\eta$

2144: such that $\lambda < \eta \bigl[ \exp(\frac{\eta}{N}) - 1 \bigr]^{-1}$,

2145: with $\PP$ probability at least $1 - \epsilon$, for any posterior distribution

2146: $\nu : \Omega \rightarrow \C{M}_+^1(M)$, for any conditional posterior distribution

2147: $\rho : \Omega \times M \rightarrow \C{M}_+^1(\Theta)$,

2148: \begin{multline*}

2149: \hfill \lambda \Phi_{\frac{\lambda}{N}} \bigl[ \nu \rho(R) \bigr]

2150: - \beta \Phi_{- \frac{\beta}{N}} \bigl[ \nu \rho(R) \bigr]

2151: \leq B_3(\nu, \rho),\text{ where}\hfill\\

2152: \shoveleft{B_3(\nu, \rho) =

2153: \nu \biggl[ \int_{G_{\eta}^{-1} (\beta)}^{G_{\eta}(\lambda)}

2154: \pi_{\exp( - \alpha r)}(r) d \alpha \biggr] }

2155: \\ + \Bigl(3 + \tfrac{G_{\eta}^{-1}(\beta)}{

2156: \eta} \Bigr) \C{K}\bigl[ \nu, \mu_{\exp \bigl[ - \bigl(3

2157: + \frac{G_{\eta}^{-1}(\beta)}{\eta}\bigr)^{-1}

2158: \int_{\beta}^{\lambda} \pi_{\exp( - \alpha

2159: r)}(r) d \alpha \bigr]}\bigr]

2160: \\\shoveright{ + \nu \bigl\{ \C{K}(\rho,

2161: \pi_{\exp( - \lambda r)}\bigr] \bigr\} + \Bigl( 4 +

2162: \tfrac{G_{\eta}^{-1}(\beta)+\lambda}{\eta} \Bigr) \log \bigl( \tfrac{4}{\epsilon}

2163: \bigr)}\\

2164: \shoveleft{\qquad \leq \nu \Bigl[ \bigl[ G_{\eta}(\lambda) - G_{\eta}^{-1}(\beta)  \bigr]

2165: \sr + \log \Bigl(\tfrac{G_{\eta}(\lambda)}{G_{\eta}^{-1}(\beta)} \Bigr) d_e

2166: \Bigr]}

2167: \\

2168: + \Bigl(3 + \tfrac{G_{\eta}^{-1}(\beta)}{

2169: \eta} \Bigr) \C{K}\bigl[ \nu, \mu_{\exp \bigl[ - \bigl(3+\frac{

2170: G_{\eta}^{-1}(\beta)}{\eta}\bigr)^{-1} \int_{\beta}^{\lambda} \pi_{\exp( - \alpha

2171: r)}(r) d \alpha \bigr]}\bigr]

2172: \\ + \nu \bigl\{ \C{K}(\rho,

2173: \pi_{\exp( - \lambda r)}\bigr] \bigr\} + \Bigl( 4 +

2174: \tfrac{G_{\eta}^{-1}(\beta)+\lambda}{\eta} \Bigr) \log \bigl( \tfrac{4}{\epsilon}

2175: \bigr),

2176: \end{multline*}

2177: and where the function $G_{\eta}$ is defined by equation

2178: \eqref{eq1.16} on page \pageref{eq1.16}.

2179: \end{thm}

2180: A first remark: if we had the stamina to use Cauchy Schwarz inequalities

2181: (or more generally H\"older inequalities) on exponential moments

2182: instead of using weighted union bounds on deviation inequalities, we could have

2183: replaced $\log(\frac{4}{\epsilon})$ with $- \log(\epsilon)$ in the above inequalities.

2184:

2185: We see that we have achieved the desired kind of localization of Theorem

2186: \ref{thm1.1.20} (page \pageref{thm1.1.20}), since the new empirical

2187: entropy term \\\mbox{} \hfill$\C{K}[\nu, \mu_{\exp [

2188: - \xi \int_{\beta}^{\lambda} \pi_{\exp( - \alpha r)}(r) d\alpha ]}]$

2189: \hfill\mbox{}\\

2190: cancels for a value of the posterior distribution on the index set $\nu$

2191: which is of the same form as the one minimizing the bound $B_1(\nu, \rho)$

2192: of Theorem \ref{thm1.1.20} (with a decreased constant, as could be expected).

2193: In a typical parametric setting, we will have

2194: $$

2195: \int_{\beta}^{\lambda} \pi_{\exp( - \alpha r)}(r) d\alpha

2196: \simeq (\lambda - \beta) \sr(m) + \log \left( \tfrac{\lambda}{\beta} \right)

2197: d_e(m),

2198: $$

2199: and therefore, if we choose for $\nu$ the Dirac mass at\\\mbox{}\hfill

2200: $\w{m} \in \arg \min_{m \in M} \sr(m) +

2201: \frac{\log(\frac{\lambda}{\beta})}{\lambda - \beta} d_e(m)$,\hfill

2202: \mbox{}\\

2203: and $\rho(m,\cdot) = \pi_{\exp( - \lambda r)}(m, \cdot)$,

2204: we will get, in the case when the index set $M$ is countable,

2205: \begin{multline*}

2206: B_3(\nu, \rho) \lesssim

2207: \max \left\{ \bigl[ G_{\eta}(\lambda) - G_{\eta}^{-1}(\beta) \bigr]

2208: , (\lambda - \beta)\tfrac{\log\bigl[\frac{G_{\eta}(\lambda)}{

2209: G_{\eta}^{-1}(\beta)}\bigr]}{

2210: \log(\frac{\lambda}{\beta})}\right\}

2211: \\ \shoveright{\times \Bigl[ \sr(\w{m}) + \tfrac{\log(\frac{\lambda}{\beta})}{\lambda - \beta}

2212: d_e(\w{m}) \Bigr]\quad}\\

2213: \shoveleft{\quad + \Bigl( 3 +

2214: \tfrac{G_{\eta}^{-1}(\beta)}{\eta} \Bigr)

2215: \log \Biggl\{ \sum_{m \in M} \tfrac{\mu(m)}{\mu(\w{m})}

2216: \exp \biggl[ - \Bigl( 3 + \tfrac{G_{\eta}^{-1} (\beta)}{\eta}\Bigr)^{-1}}\\

2217: \times

2218: \Bigl\{ (\lambda - \beta) \bigl[ \sr(m) - \sr(\w{m}) \bigr]

2219: + \log \bigl( \tfrac{\lambda}{\beta} \bigr)

2220: \bigl[ d_e(m)- d_e(\w{m}) \bigr] \Bigr\} \biggr] \Biggr\} \\

2221: + \Bigl(4 + \tfrac{G_{\eta}^{-1}(\beta)+\lambda}{\eta}\Bigr)\log\bigl(\tfrac{4}{

2222: \epsilon}\bigr).

2223: \end{multline*}

2224: Therefore, as long as there are not too many of them, we do not feel

2225: strongly in this bound the models for which the penalized minimum empirical

2226: risk $\sr(m) + \frac{\log(\frac{\lambda}{\beta})}{\lambda - \beta}

2227: \,d_e(m)$

2228: is far from optimal.

2229:

2230: \subsection{Relative bounds}

2231: The behaviour of the minimum

2232: of the empirical process $\theta \mapsto r(\theta)$

2233: is known to depend on the covariances between pairs $\bigl[

2234: r(\theta), r(\theta') \bigr]$, $\theta, \theta' \in \Theta$.

2235: Accordingly, our previous study, based on the analysis of the variance

2236: of $r(\theta)$ (or technically on some exponential moment playing

2237: quite the same role), is missing some accuracy in some circumstances

2238: (namely when $\inf_{\Theta} R$ is not close enough to zero).

2239: In this subsection, instead of bounding the expected risk $\rho(R)$,

2240: we are going to upper bound the difference $\rho(R) - \inf_{\Theta} R$,

2241: and more generally $\rho(R) - R(\T)$, where $\T \in \Theta$ is some

2242: fixed parameter value. Eventually in the next subsection

2243: we will analyze $\rho(R) - \pi_{\exp( - \beta R)}(R)$, allowing to compare the expected error

2244: rate of a posterior distribution $\rho$ with the error rate

2245: of a Gibbs prior distribution.

2246: Thus relative bounds are not exactly of the

2247: same nature as previous ones: although it is not possible to estimate

2248: $\rho(R)$ with an order of precision higher than $(\rho(R) / N)^{1/2}$,

2249: it is still possible in some situations to reach a better precision

2250: for $\rho(R) - \inf_{\Theta} R$, as we will see.

2251: The study of PAC-Bayesian relative bounds stems from the second and

2252: third part of J. Y. Audibert's dissertation \cite{Audibert2}.

2253:

2254: We will suggest two different kinds of applications of these bounds.

2255: The first more obvious one is to upper bound $\rho(R) - \inf_{\Theta} R$

2256: to get an idea of the performance of the posterior distribution $\rho$.

2257:

2258: The second application is to compare the classification model indexed by

2259: $\Theta$ with a submodel indexed by one of its measurable subsets

2260: $\Theta_1 \subset \Theta$. For this purpose we are

2261: going to compare $\rho(R)$, where $\rho : \Omega \rightarrow

2262: \C{M}_+^1(\Theta)$ is any posterior distribution, with

2263: $R(\T)$, where $\T \in \Theta_1$ is some possibly unobservable

2264: value of the parameter in the submodel defined by $\Theta_1$.

2265: We will typically consider the case when $\T \in \arg\min_{\Theta_1} R$.

2266: In this special case, a negative bound for $\rho(R) - R(\T)

2267: = \rho(R) - \inf_{\Theta_1} R$ indicates that it is definitely

2268: worth using a randomized estimator $\rho$ supported by

2269: the larger parameter set $\Theta$ instead of using only

2270: the classification model defined by the smaller set $\Theta_1$.

2271:

2272: \subsubsection{Basic inequalities}

2273: Relative bounds in this section are based on the control of

2274: $r(\theta) - r(\T)$, where $\theta, \T \in \Theta$. These

2275: differences are related to the random variables

2276: $$

2277: \psi_i(\theta, \T) = \sigma_i(\theta) - \sigma_i(\T)

2278: = \B{1} \bigl[ f_{\theta}(X_i) \neq Y_i \bigr] -

2279: \B{1} \bigl[ f_{\T}(X_i) \neq Y_i \bigr].

2280: $$

2281:

2282: Some supplementary technical difficulties, as compared to

2283: the previous sections, come from the fact that

2284: $\psi_i(\theta, \T)$ takes three values, whereas $\sigma_i(\theta)$

2285: takes only two. Let $\rr(\theta, \T) = r(\theta) - r(\T)$

2286: and $\R(\theta, \T) = R(\theta) - R(\T)$. We have as usual from

2287: independence that

2288: \begin{multline*}

2289: \log \Bigl\{ \PP \Bigl[ \exp \bigl[

2290: - \lambda \rr(\theta, \T) \bigr] \Bigr] \Bigr\}

2291: = \sum_{i=1}^N \log \Bigl\{ \PP \Bigl[

2292: \exp \bigl[ - \tfrac{\lambda}{N} \psi_i(\theta, \T) \bigr] \Bigr] \Bigr\}

2293: \\ \leq N \log \biggl\{ \frac{1}{N} \sum_{i=1}^N \PP

2294: \Bigl\{ \exp \Bigl[ - \frac{\lambda}{N} \psi_i(\theta, \T) \Bigr] \Bigr\} \biggr\}.

2295: \end{multline*}

2296: Let $C_i$ be the distribution of $\psi_i(\theta, \T)$ under $\PP$ and let

2297: $\Bar{C} = \frac{1}{N} \sum_{i=1}^N C_i \in \C{M}_+^1\bigl( \{-1, 0, 1\} \bigr)$.

2298: With these notations

2299: \begin{equation}

2300: \label{eq2.2.2Bis}

2301: \log \Bigl\{ \PP \Bigl[ \exp \bigl[ - \lambda \rr( \theta, \T) \bigr]

2302: \Bigr] \Bigr\} \leq N \log \biggl\{ \int \exp \Bigl( - \frac{\lambda}{N}

2303: \psi \Bigr) \Bar{C}(d \psi) \biggr\}.

2304: \end{equation}

2305: \newcommand{\BM}{{M'}}

2306: The right-hand side of this inequality is a function of $\Bar{C}$. On the

2307: other hand, $\Bar{C}$ being a probability measure on a three point set, is

2308: defined by two parameters, that we may take equal to $\int \psi \Bar{C}(d \psi)$ and

2309: $\int \psi^2 \Bar{C}(d \psi)$. To this purpose, let us introduce

2310: $$

2311: \BM(\theta, \T) = \int \psi^2 \Bar{C}(d \psi) = \Bar{C}(+1)

2312: + \Bar{C}(-1) = \frac{1}{N} \sum_{i=1}^N \PP \bigl[

2313: \psi_i^2(\theta, \T) \bigr], \quad \theta, \T \in \Theta.

2314: $$

2315: It is a pseudo distance

2316: (meaning that it is symmetric and satisfies the triangle inequality),

2317: since it can also be written as

2318: $$

2319: \BM(\theta, \T) = \frac{1}{N} \sum_{i=1}^N

2320: \PP \Bigl\{ \Bigl\lvert \B{1} \bigl[ f_{\theta}(X_i) \neq Y_i \bigr]

2321: - \B{1} \bigl[ f_{\T}(X_i) \neq Y_i \bigr] \Bigr\rvert \Bigr\},

2322: \quad \theta, \T \in \Theta.

2323: $$

2324: It is readily seen that

2325: $$

2326: N \log \left\{ \int \exp \left( - \frac{\lambda}{N} \psi \right) \Bar{C}(d \psi)

2327: \right\} = - \lambda \Psi_{\frac{\lambda}{N}} \bigl[ R'(\theta, \T), M'(\theta, \T) \bigr],

2328: $$

2329: where

2330: \begin{align*}

2331: \Psi_a(p,m) & = - a^{-1}

2332: \log \Bigl[ (1 - m) + \frac{m+p}{2} \exp(-a)

2333: + \frac{m-p}{2} \exp (a) \Bigr]

2334: \\ & = - a^{-1} \log \Bigl\{

2335: 1 - \sinh(a) \bigl[ p - m \tanh(\tfrac{a}{2}) \bigr] \Bigr\}.

2336: \end{align*}

2337: Thus plugging this equality into inequality \eqref{eq2.2.2Bis} we see that for

2338: any real parameter $\lambda$,

2339: $$

2340: \log \Bigl\{ \PP \Bigl[ \exp \bigl[ - \lambda \rr( \theta, \T) \bigr]

2341: \Bigr] \Bigr\} \leq - \lambda \Psi_{\frac{\lambda}{N}}

2342: \bigl[ \R(\theta, \T), \BM(\theta, \T) \bigr],

2343: $$

2344: To make a link with previous works initiated by Mammen and Tsybakov

2345: (see e.g. \cite{Mammen,Tsybakov}), we may consider the pseudo

2346: distance $D$ on $\Theta$ defined on page \pageref{eq1.1.2} by equation

2347: \eqref{eq1.1.2}.

2348: This distance only depends on the distribution of the patterns. It

2349: is often used to formulate margin assumptions (in the sense of Mammen

2350: and Tsybakov).

2351: Here we are going to work rather with

2352: $\BM$: as it is dominated by $D$ in the sense that

2353: $\BM(\theta, \T) \leq D(\theta, \T)$, $\theta, \T \in \Theta$, with equality

2354: in the important case of binary classification, hypotheses formulated on

2355: $D$ induce hypotheses on $M'$, and working with $M'$ may only sharpen the

2356: results when compared to working with $D$.

2357:

2358: Using the same reasoning as in the previous section, we deduce

2359: \begin{thm}

2360: \label{thm4.1}

2361: \mypoint For any real parameter $\lambda$, any $\T \in \Theta$,

2362: $$

2363: \PP \biggl\{ \exp \biggl[ \sup_{\rho \in \C{M}_+^1(\Theta)}

2364: \lambda \Bigl[ \rho \bigl\{ \Psi_{\frac{\lambda}{N}} \bigl[

2365: \R(\cdot, \T\,), \BM(\cdot, \T\,) \bigr]  \bigr\}

2366: - \rho\bigl[\rr(\cdot, \T) \bigr] \Bigr]

2367: - \C{K}(\rho, \pi) \biggr] \biggr\} \leq 1.

2368: $$

2369: \end{thm}

2370:

2371: We are now going to derive some variant of Theorem \ref{thm4.1}.

2372: In this theorem, we obtain an inequality comparing one observed quantity

2373: $\rho\bigl[r'(\cdot, \T\,)\bigr]$ with two unobversed ones, $\rho\bigl[R'(

2374: \cdot, \T\,)\bigr]$ and $\rho\bigl[M'(\cdot, \T\,) \bigr]$

2375: (because of the convexity of the function $\lambda \Psi_{\frac{\lambda}{N}}$,

2376: $$

2377: \lambda \rho

2378: \bigl\{ \Psi_{\frac{\lambda}{N}}\bigl[R'(\cdot, \T\,),M'(\cdot, \T\,) \bigr]

2379: \bigr\} \geq

2380: \lambda \Psi_{\frac{\lambda}{N}} \bigl\{ \rho\bigl[R'(\cdot, \T\,)\bigr],

2381: \rho\bigl[ M'(\cdot, \T\,) \bigr] \bigr\}.)

2382: $$

2383: This may be inconvenient when looking for

2384: an empirical bound for $\rho\bigl[ R'(\cdot, \T) \bigr]$, and we are going now to seek

2385: an inequality comparing $\rho\bigl[R'(\cdot, \T\,)\bigr]$ with empirical quantities

2386: only. This is possible through a change of variables in the

2387: exponential inequality. Indeed, if we consider now random variables

2388: $\chi_i(\theta, \T)$, such that

2389: $$

2390: 1 - \frac{\lambda}{N} \psi_i = \exp \left( - \frac{\lambda}{N} \chi_i \right),

2391: $$

2392: which is possible when $\frac{\lambda}{N} \in \; )\!-\!\!1, 1($ and leads to define

2393: $$

2394: \chi_i = - \frac{N}{\lambda} \log \left( 1 - \frac{\lambda}{N}\psi_i \right),

2395: $$

2396: we obtain easily following the same reasoning as previously

2397: \begin{multline*}

2398: \log \Biggl\{ \PP \biggl\{ \exp \biggl[ \sum_{i=1}^N \log \Bigl(

2399: 1 - \frac{\lambda}{N} \psi_i

2400: \Bigr) \biggr] \biggr\} \Biggr\}

2401: \\ \leq \sum_{i=1}^N \log \Bigl[ 1 - \frac{\lambda}{N} \PP(\psi_i) \Bigr]

2402: \leq N  \log \Bigl[ 1 - \frac{\lambda}{N} R'(\theta,\T\,) \Bigr].

2403: \end{multline*}

2404: Let us replace for simplicity $\lambda / N$ with $\lambda$.

2405: Let us also introduce the random pseudo distance

2406: \begin{multline}

2407: \label{eq1.3}

2408: m'(\theta, \T) = \frac{1}{N} \sum_{i=1}^N \psi_i(\theta,\T)^2

2409: \\ = \frac{1}{N} \sum_{i=1}^N \Bigl\lvert \B{1} \bigl[

2410: f_{\theta}(X_i) \neq Y_i \bigr] - \B{1} \bigl[ f_{\T}(

2411: X_i) \neq Y_i \bigr] \Bigr\rvert, \quad \theta, \T \in \Theta.

2412: \end{multline}

2413: This is the empirical counter part of $M'$, since $\PP(m') = M'$.

2414: Let us notice that

2415: \begin{multline*}

2416: \frac{1}{N} \sum_{i=1}^N \log \bigl[ 1 - \lambda \psi_i(\theta, \T) \bigr]

2417: = \frac{\log(1 - \lambda) - \log(1 + \lambda)}{2} r'(\theta, \T)

2418: \\ \shoveright{+ \frac{\log(1 - \lambda) + \log(1 + \lambda)}{2} m'(\theta,\T)

2419: \qquad} \\

2420: \\ = \frac{1}{2} \log \left( \frac{1 - \lambda}{1 + \lambda} \right)

2421: r'\bigl(\theta, \T\,\bigr) + \frac{1}{2} \log( 1 - \lambda^2)

2422: m'\bigl(\theta, \T\,\bigr).

2423: \end{multline*}

2424: With these notations, we can

2425: conveniently write the previous inequality as

2426: \begin{multline*}

2427: \PP \Biggl\{ \exp \Biggl[ -N \log \bigl[ 1 - \lambda R'(\theta, \T) \bigr]

2428: \\ - \frac{N}{2} \log \biggl(\frac{1+\lambda}{1-\lambda}\biggr) r'\bigl(\theta,

2429: \T\,\bigr) + \frac{N}{2} \log\bigl(1 - \lambda^2\bigr) m'\bigl(\theta, \T\, \bigr) \Biggr] \Biggr\}

2430: \leq 1.

2431: \end{multline*}

2432: Integrating with respect to a prior probability measure $\pi \in \C{M}_+^1(\Theta)$,

2433: we obtain

2434: \begin{thm}

2435: \label{thm2.2.18}

2436: \mypoint For any real parameter $\lambda \in \; )\!\!-\!\!1,1($, for any $\T \in \Theta$,

2437: for any prior probability distribution $\pi \in \C{M}_+^1(\Theta)$,

2438: \begin{multline*}

2439: \PP \Biggl\{ \exp \Biggl[ \sup_{\rho \in \C{M}_+^1(\Theta)} \biggl\{

2440: -N \rho \Bigl\{ \log \bigl[ 1 - \lambda R'(\cdot, \T\,) \bigr] \Bigr\}

2441: \\ - \frac{N}{2} \log \biggl( \frac{1+\lambda}{1-\lambda}\biggr)

2442: \rho \bigl[r'(\cdot, \T\,)\bigr]\qquad \\ + \frac{N}{2} \log(1 - \lambda^2)

2443: \rho\bigl[m'(\cdot, \T\,) \bigr]

2444: - \C{K}(\rho, \pi) \biggr\} \Biggr] \Biggr\} \leq 1.

2445: \end{multline*}

2446: \end{thm}

2447:

2448: \subsubsection{Non random bounds}

2449: Let us first deduce a non random bound from Theorem \ref{thm4.1}.

2450: This theorem can be conveniently taken advantage of by

2451: throwing the non linearity into a localized prior, considering

2452: the prior probability measure $\mu$ defined by

2453: $$

2454: \frac{d \mu}{d \pi}(\theta) = \frac{\exp \bigl\{ - \lambda \Psi_{\frac{\lambda}{N}}

2455: \bigl[ R'(\theta, \T\,), \BM(\theta, \T\,) \bigr] + \beta \R(\theta, \T\,) \bigr\}}

2456: {\pi \Bigl\{ \exp \bigl\{ - \lambda \Psi_{\frac{\lambda}{N}}

2457: \bigl[ R'(\cdot, \T\,), \BM(\cdot, \T\,) \bigr] + \beta \R(\cdot, \T\,) \bigr\}

2458: \Bigr\}}.

2459: $$

2460: Indeed, for any posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,

2461: \begin{multline*}

2462: \C{K}(\rho,\mu) = \C{K}(\rho,\pi) + \lambda \rho \Bigl\{

2463: \Psi_{\frac{\lambda}{N}} \bigl[ R'(\cdot, \T\,),M'(\cdot, \T\,) \bigr]

2464: \Bigr\} - \beta \rho \bigl[ R'(\cdot, \T\,) \bigr] \\ +

2465: \log \Bigl\{ \pi \Bigl[ \exp \bigl\{

2466: - \lambda \Psi_{\frac{\lambda}{N}}\bigl[ R'(\cdot, \T\,),

2467: M'(\cdot, \T\,) \bigr] + \beta R'(\cdot, \T\,) \bigr] \bigr\} \Bigr] \Bigr\}.

2468: \end{multline*}

2469: Plugging this into Theorem \ref{thm4.1} and using the convexity of the

2470: exponential function, we see that for any posterior probability distribution

2471: $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,

2472: \begin{multline*}

2473: \beta \PP \bigl\{ \rho \bigl[ R'(\cdot, \T\,) \bigr] \bigr\}

2474: \leq \lambda \PP \bigl\{ \rho \bigl[ r'(\cdot, \T\,) \bigr] \bigr\}

2475: + \PP \bigl[ \C{K}(\rho, \pi) \bigr] \\ +

2476: \log \Bigl\{ \pi \Bigl[ \exp \bigl\{

2477: - \lambda \Psi_{\frac{\lambda}{N}}\bigl[ R'(\cdot, \T\,),

2478: M'(\cdot, \T\,) \bigr] + \beta R'(\cdot, \T\,) \bigr] \bigr\} \Bigr] \Bigr\}.

2479: \end{multline*}

2480: We can then recall that

2481: $$

2482: \lambda \rho\bigl[ r'(\cdot, \T\,) \bigr] + \C{K}(\rho, \pi)

2483: = \C{K}\bigl[ \rho, \pi_{\exp( - \lambda r)}\bigr] - \log

2484: \Bigl\{ \pi \Bigl[ \exp \bigl[ - \lambda r'(\cdot, \T\,) \bigr] \Bigr] \Bigr\},

2485: $$

2486: and notice moreover that

2487: $$

2488: - \PP \biggl\{ \log \Bigl\{ \pi \Bigl[

2489: \exp \bigl[ - \lambda r'(\cdot, \T\,) \bigr] \Bigr] \Bigr\} \biggr\}

2490: \leq

2491: - \log \Bigl\{ \pi \Bigl[

2492: \exp \bigl[ - \lambda R'(\cdot, \T\,) \bigr] \Bigr] \Bigr\},

2493: $$

2494: since $R' = \PP(r')$ and $h \mapsto \log \Bigl\{ \pi \bigl[ \exp ( h) \bigr] \Bigr\}$

2495: is a convex functional. Putting these two remarks together, we obtain

2496: \begin{thm}

2497: \mypoint \label{thm2.2.19}

2498: For any real positive parameter $\lambda$, for any prior distribution $\pi

2499: \in \C{M}_+^1(\Theta)$, for any posterior distribution $\rho : \Omega

2500: \rightarrow \C{M}_+^1(\Theta)$,

2501: \begin{multline*}

2502: \PP \bigl\{ \rho \bigl[ R'(\cdot, \T\,) \bigr] \bigr\}

2503: \leq \frac{1}{\beta} \PP \bigl[ \C{K}(\rho, \pi_{\exp( - \lambda r)}) \bigr]

2504: \\ + \frac{1}{\beta} \log \Bigl\{ \pi \Bigl[ \exp \bigl\{

2505: - \lambda \Psi_{\frac{\lambda}{N}}\bigl[ R'(\cdot, \T\,),

2506: M'(\cdot, \T\,) \bigr] + \beta R'(\cdot, \T\,) \bigr] \bigr\} \Bigr] \Bigr\}\\

2507: \shoveright{- \frac{1}{\beta} \log \Bigl\{ \pi \Bigl[

2508: \exp \bigl[ - \lambda R'(\cdot, \T\,) \bigr] \Bigr] \Bigr\}\quad}\\\shoveleft{\qquad

2509: \leq \frac{1}{\beta} \PP \bigl[ \C{K}(\rho, \pi_{\exp( - \lambda r)})\bigr]}

2510: \\ + \frac{1}{\beta} \log \Bigl\{ \pi \Bigl[

2511: \exp \bigl\{ - \bigl[ N \sinh(\tfrac{\lambda}{N}) - \beta \bigl] R'(\cdot, \T\,)

2512: \\ \shoveright{+ 2 N \sinh(\tfrac{\lambda}{2N})^2 M'(\cdot, \T\,) \bigr\} \Bigr] \Bigr\}

2513: \qquad} \\ - \frac{1}{\beta} \log \Bigl\{ \pi \Bigl[

2514: \exp \bigl[ - \lambda R'(\cdot, \T\,) \bigr] \Bigr] \Bigr\}.

2515: \end{multline*}

2516: \end{thm}

2517: It may be interesting to derive some more suggestive (but slightly weaker)

2518: bound in the important case when $\Theta_1 = \Theta$ and $R(\T) = \inf_{\Theta} R$.

2519: In this case, it is convenient to introduce the {\em margin function}

2520: \begin{equation}

2521: \label{eq1.1.16Bis}

2522: \varphi(x) = \sup_{\theta \in \Theta} \BM(\theta, \T) -

2523: x \R(\theta, \T), \quad x \in \RR_+.

2524: \end{equation}

2525: We see that $\varphi$ is convex and nonnegative on $\RR_+$.

2526: Using the bound $M'(\theta, \T\,) \leq x R'(\theta, \T\,) + \varphi(x)$,

2527: we obtain

2528: \begin{multline*}

2529: \PP \bigl\{ \rho \bigl[ R'(\cdot, \T\,) \bigr] \bigr\}

2530: \leq \frac{1}{\beta} \PP \bigl[ \C{K}(\rho, \pi_{\exp( - \lambda r)})\bigr]

2531: \\ + \frac{1}{\beta} \log \biggl\{ \pi \biggl[

2532: \exp \Bigl\{ -

2533: \bigl\{ N \sinh(\tfrac{\lambda}{N})\bigl[

2534: 1 - x\tanh(\tfrac{\lambda}{2N})\bigr] - \beta \bigr\}

2535: R'(\cdot, \T\,) \Bigr\}

2536: \biggr] \biggr\}

2537: \\ + \frac{N \sinh(\tfrac{\lambda}{N}) \tanh(\tfrac{\lambda}{2N})}{\beta} \varphi(x)

2538: - \frac{1}{\beta} \log \Bigl\{ \pi \Bigl[

2539: \exp \bigl[ - \lambda R'(\cdot, \T\,) \bigr] \Bigr] \Bigr\}.

2540: \end{multline*}

2541: Let us make the change of variable $\gamma =

2542: N \sinh(\tfrac{\lambda}{N})\bigl[

2543: 1 - x\tanh(\tfrac{\lambda}{2N})\bigr] - \beta$ to obtain

2544: \begin{cor}

2545: \label{cor1.1.21}\mypoint

2546: For any real positive parameters $x$, $\gamma$ and $\lambda$ such that

2547: $x \leq \tanh(\frac{\lambda}{2N})^{-1}$ and $0 \leq \gamma <

2548: N \sinh(\frac{\lambda}{N}) \bigl[ 1 - x \tanh(\frac{\lambda}{2N}) \bigr]$,

2549: \begin{multline*}

2550: \PP \bigl[ \rho(R) \bigr] - \inf_{\Theta} R

2551: \leq \Bigl\{

2552: N \sinh(\tfrac{\lambda}{N}) \bigl[ 1 - x

2553: \tanh(\tfrac{\lambda}{2N})\bigr] - \gamma \Bigr\}^{-1} \\

2554: \shoveleft{\qquad \times

2555: \biggl\{ \int_{\gamma}^{\lambda}

2556: \bigl[ \pi_{\exp( - \alpha R)}(R) - \inf_{\Theta} R\bigr]

2557: d \alpha }\\ + N \sinh\bigl(\tfrac{\lambda}{N}\bigr) \tanh\bigl(\tfrac{\lambda}{2N}\bigr)

2558: \varphi(x) + \PP \bigl[ \C{K}(\rho, \pi_{\exp( - \lambda r)}) \bigr]

2559: \biggr\}.

2560: \end{multline*}

2561: \end{cor}

2562: Let us remark that these results, although well suited to study Mammen and Tsybakov's

2563: margin assumptions, hold in the general case: introducing the convex {\em expected

2564: margin function} $\varphi$ is a substitute for making hypotheses about the relations

2565: between $R$ and $D$.

2566:

2567: Using the fact that $R'(\theta, \T\,) \geq 0$, $\theta \in \Theta$  and

2568: that $\varphi(x) \geq 0$, $x \in \RR_+$, we can weaken and simplify even more

2569: the preceding corollary to get

2570: \begin{cor}

2571: \label{cor4.3}

2572: \mypoint For any real parameters $\beta$, $\lambda$ and $x$ such that

2573: $x \geq 0$ and $0 \leq \beta < \lambda - x \frac{\lambda^2}{2N}$,

2574: for any posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,

2575: \begin{multline*}

2576: \PP \bigl[ \rho(R) \bigr] \leq \inf_{\Theta} R

2577: \\ +

2578: \Bigl[\lambda - x \tfrac{\lambda^2}{2N} - \beta \Bigr]^{-1}

2579: \biggl\{ \int_{\beta}^{\lambda}

2580: \bigl[ \pi_{\exp( - \alpha R)}(R) - \inf_{\Theta} R \bigr]  d \alpha

2581: \\ + \PP \bigl\{ \C{K}\bigl[\rho, \pi_{\exp( - \lambda r)} \bigr] \bigr\}

2582: + \varphi(x) \frac{\lambda^2}{2N} \biggr\}.

2583: \end{multline*}

2584: \end{cor}

2585: Let us apply this bound under the {\em margin assumption}

2586: first considered by Mammen and Tsybakov \cite{Mammen,Tsybakov},

2587: which tells that for some real positive constant $c$ and some

2588: real exponent $\kappa \geq 1$,

2589: \begin{equation}

2590: \label{eq1.1.17Bis}

2591: \R(\theta, \T) \geq

2592: c D(\theta, \T)^{\kappa}, \qquad \theta \in \Theta.

2593: \end{equation}

2594: In the

2595: case when $\kappa = 1$, then $\varphi(c^{-1}) = 0$, proving that

2596: \begin{align*}

2597: \PP \bigl\{ \pi_{\exp( - \lambda r)}\bigl[  \R(\cdot, \T\,) \bigr] \bigr\}

2598: & \leq \frac{\int_{\beta}^{\lambda} \pi_{\exp(

2599: - \gamma R)}\bigl[ \R(\cdot, \T\,)\bigr]

2600: d \gamma}{N \sinh(\frac{\lambda}{N})

2601: \bigl[ 1 - c^{-1} \tanh(\frac{\lambda}{2N}) \bigr] - \beta}

2602: \\ & \leq \frac{ \int_{\beta}^{\lambda} \pi_{\exp( - \gamma R)}\bigl[

2603: \R(\cdot, \T\,)\bigr]

2604: d \gamma}{

2605: \lambda - \frac{ \lambda^2}{2 c N} - \beta}.

2606: \end{align*}

2607: Taking for example  $\lambda = \frac{cN}{2}$, $\beta = \frac{\lambda}{2}

2608: = \frac{cN}{4}$,

2609: we obtain

2610: \begin{align*}

2611: \PP \bigl[ \pi_{\exp( - 2^{-1} c N r)}(R) \bigr] & \leq \inf R +

2612: \frac{8}{cN} \int_{\frac{c N}{4}}^{\frac{cN}{2}}

2613: \pi_{\exp( - \gamma R)}\bigl[\R(\cdot, \T)\bigr]

2614: d \gamma \\* & \leq \inf R + 2 \pi_{\exp(- \frac{cN}{4} R)}\bigl[ \R(\cdot, \T\,)\bigr].

2615: \end{align*}

2616: If moreover the behaviour of the prior distribution $\pi$ is parametric

2617: meaning that $\pi_{\exp( - \beta R)}\bigl[ \R(\cdot, \T\,) \bigr]

2618: \leq \frac{d}{\beta}$,

2619: for some positive real constant $d$ linked with the dimension of the

2620: classification model, then

2621: $$

2622: \PP \bigl[ \pi_{\exp( - \frac{c N}{2} r)}(R) \bigr]

2623: \leq \inf R + \frac{8 \log(2) d}{cN}

2624: \leq \inf R + \frac{5.55 \, d}{cN}.

2625: $$

2626: In the case when $\kappa > 1$,

2627: $$\varphi(x) \leq (\kappa -1) \kappa^{- \frac{\kappa}{

2628: \kappa -1}} (c x)^{- \frac{1}{\kappa - 1}} = (1 - \kappa^{-1})(\kappa c x)^{-\frac{1}{

2629: \kappa - 1}},$$

2630: \begin{multline*}

2631: \hspace{-10pt}\text{thus }\PP \bigl\{ \pi_{\exp(- \lambda r)}\bigl[ \R(\cdot, \T\,)\bigr] \bigr\}

2632: \\ \leq \frac{\int_{\beta}^{\lambda} \pi_{\exp( - \gamma R)}\bigl[ \R(\cdot, \T\,)\bigr] d \gamma

2633: + (1 - \kappa^{-1}) (\kappa c x)^{-\frac{1}{\kappa - 1}}

2634: \frac{\lambda^2}{2N} }{

2635: \lambda - \frac{x\lambda^2}{2N}  - \beta}.

2636: \end{multline*}

2637: Taking for instance $\beta = \frac{\lambda}{2}$, $x = \frac{N}{2 \lambda}$,

2638: and putting $b = (1 - \kappa^{-1}) (c \kappa)^{- \frac{1}{\kappa -1}}$,

2639: we obtain

2640: $$

2641: \PP \bigl[ \pi_{\exp( - \lambda r)}(R) \bigr] - \inf R

2642: \leq \frac{4}{\lambda} \int_{\lambda/2}^{\lambda}

2643: \pi_{\exp( - \gamma R)}\bigl[ \R(\cdot, \T\,)\bigr] d \gamma + b \left(\frac{2 \lambda}{N}\right)^{\frac{

2644: \kappa}{\kappa -1}}.

2645: $$

2646: In the {\em parametric} case when $\pi_{\exp( - \gamma R)}\bigl[ \R(\cdot, \T\,)\bigr]

2647: \leq \frac{d}{\gamma}$,

2648: we get

2649: $$

2650: \PP \bigl[ \pi_{\exp( - \lambda r)}(R) \bigr] - \inf R

2651: \leq \frac{4 \log(2) d}{\lambda} + b \left( \frac{2 \lambda}{N} \right)^{\frac{

2652: \kappa}{\kappa - 1}}.

2653: $$

2654: Taking

2655: \newcommand{\Blambda}{\overline{\lambda}}

2656: $$

2657: \Blambda = 2^{-1} \bigl[ 8 \log(2) d \bigr]^{\frac{\kappa-1}{2 \kappa -1}}

2658: (\kappa c)^{\frac{1}{2 \kappa -1}}

2659: N^{\frac{\kappa}{2 \kappa -1 }},

2660: $$

2661: we obtain

2662: $$

2663: \PP \bigl[ \pi_{\exp( - \Blambda r)}(R) \bigr] - \inf R

2664: \leq (2 - \kappa^{-1}) (\kappa c)^{-\frac{1}{2 \kappa - 1}}

2665: \left( \frac{ 8 \log(2) d}{N} \right)^{\frac{\kappa}{2 \kappa - 1}}.

2666: $$

2667: We see that this formula coincides with the result for $\kappa = 1$.

2668: We can thus reduce the two cases to a single one and state

2669: \begin{cor}

2670: \mypoint

2671: \label{cor1.1.23} Let us assume that for some $\T \in \Theta$, some

2672: positive real constant $c$, some real exponent $\kappa \geq 1$

2673: and for any $\theta \in \Theta$,

2674: $R(\theta)\geq R(\T) + c D(\theta, \T)^{\kappa}$.

2675: Let us also assume that for some positive real

2676: constant $d$ and any positive real parameter $\gamma$,

2677: $\pi_{\exp( - \gamma R)}(R) - \inf R \leq \frac{d}{\gamma}$.

2678: Then

2679: \begin{multline*}

2680: \PP \Bigl[ \pi_{\exp \bigl\{ -

2681: 2^{-1}[ 8 \log(2) d ]^{\frac{\kappa-1}{2 \kappa -1}}

2682: (\kappa c)^{\frac{1}{2 \kappa -1}}

2683: N^{\frac{\kappa}{2 \kappa -1 }}

2684: r\bigr\}}(R) \Bigr]

2685: \\ \leq \inf R + (2 - \kappa^{-1}) (\kappa c)^{-\frac{1}{2 \kappa - 1}}

2686: \left( \frac{ 8 \log(2) d}{N} \right)^{\frac{\kappa}{2 \kappa - 1}}.

2687: \end{multline*}

2688: \end{cor}

2689: Let us remark that the exponent of $N$ is this corollary is

2690: known to be the minimax exponent under these assumptions:

2691: it is unimprovable, whatever estimator is used in place of

2692: the Gibbs posterior shown here (at least in the worst case

2693: compatible with the hypotheses). The interest of the corollary

2694: is to show not only the minimax exponent in $N$, but also

2695: an explicit non asymptotic bound with reasonable and simple

2696: constants. It is also clear that we could have got slightly

2697: better constants if we had kept the full strength of Theorem

2698: \ref{thm2.2.19} (page \pageref{thm2.2.19})

2699: instead of using the weaker Corollary \ref{cor4.3}

2700: (page \pageref{cor4.3}).

2701:

2702: We will prove in the following empirical bounds showing

2703: how the constant $\lambda$ can be estimated from the data

2704: instead of being chosen according to some margin and

2705: complexity assumptions.

2706:

2707: \subsubsection{Unbiased empirical bounds}

2708: We are going to provide an empirical counter part for the

2709: {\em expected margin function} $\varphi$. It will appear

2710: in empirical bounds having otherwise the same structure as

2711: the non random bound we just proved. Anyhow, we will not

2712: launch into trying to compare the behaviour of our proposed

2713: {\em empirical margin function} with the {\em expected margin function},

2714: since the margin function involves taking a supremum

2715: which is not straightforward to handle.

2716:

2717: Let us start as in the previous subsection with the inequality

2718: \begin{multline*}

2719: \beta \PP \Bigl\{ \rho\bigl[\R(\cdot,\T\,) \bigr] \Bigr\} \leq

2720: \PP \Bigl\{ \lambda \rho\bigl[ r'(\cdot, \T\,) \bigr]+ \C{K}(\rho, \pi) \Bigr\}

2721: \\ + \log \Bigl\{ \pi \Bigl[ \exp \bigl\{ - \lambda \Psi_{\frac{\lambda}{N}}\bigl[\R

2722: (\cdot, \T\,), \BM(\cdot, \T\,) \bigr] + \beta \R(\cdot, \T\,) \, \bigr\} \Bigr]

2723: \Bigr\} .

2724: \end{multline*}

2725: We have already defined by equation \eqref{eq1.3} the empirical pseudo distance

2726: \newcommand{\m}{{m'}}

2727: $$

2728: \m( \theta, \T\,) = \frac{1}{N} \sum_{i=1}^N \psi_i(\theta, \T\,)^2.

2729: $$

2730: Recalling that $\PP \bigl[ \m(\theta, \T\,) \bigr] = \BM(\theta, \T\,)$,

2731: and using the convexity of $h \mapsto \log \Bigl\{ \pi \bigl[ \exp( h ) \bigr] \Bigr\}$,

2732: leads to the following inequalities:

2733: \begin{multline*}

2734: \log \Bigl\{ \pi \Bigl[ \exp \bigl\{ - \lambda \Psi_{\frac{\lambda}{N}}\bigl[

2735: \R(\cdot, \T\,), \BM(\cdot, \T\,)\bigr] + \beta \R(\cdot, \T\,) \bigr\} \Bigr] \Bigr\}

2736: \\*\shoveleft{\qquad \leq \log \Bigl\{ \pi \Bigl[ \exp \bigl\{

2737: - N \sinh(\tfrac{\lambda}{N}) \R(\cdot, \T\,)  }

2738: \\ \shoveright{+  N \sinh(\tfrac{\lambda}{N})\tanh(\tfrac{\lambda}{2N}) \BM(\cdot, \T\,)

2739: + \beta \R(\cdot,\T\,) \bigr] \bigr\} \Bigr] \Bigr\} \qquad}

2740: \\* \leq \PP \biggl\{

2741: \log \Bigl\{ \pi \Bigl[

2742: \exp \bigl\{ - \bigl[N \sinh(\tfrac{\lambda}{N})

2743: - \beta \bigr] \rr(\cdot, \T\,)

2744: \\ + N \sinh(\tfrac{\lambda}{N}) \tanh(\tfrac{\lambda}{2N})

2745: \m(\cdot, \T\,) \bigr\} \Bigr] \Bigr\} \biggr\}.

2746: \end{multline*}

2747: We may moreover remark that

2748: \begin{multline*}

2749: \lambda \rho\bigl[ \rr(\cdot, \T\,) \bigr]

2750: + \C{K}(\rho, \pi)

2751: = \bigl[ \beta - N \sinh(\tfrac{\lambda}{N}) + \lambda \bigr]

2752: \rho \bigl[ \rr(\cdot, \T\,)\bigr] \\ + \C{K}\bigl[ \rho, \pi_{\exp \{-[ N \sinh(\frac{\lambda}{N}) - \beta

2753: ] r \}} \bigr] \\ - \log \Bigl\{ \pi \Bigl[ \exp \bigl\{

2754: - \bigl[ N \sinh(\tfrac{\lambda}{N}) - \beta \bigr] \rr(\cdot, \T\,) \bigr\} \Bigr]

2755: \Bigr\}.

2756: \end{multline*}

2757: This ends to prove

2758: \begin{thm}

2759: \mypoint For any positive real parameters $\beta$ and $\lambda$,

2760: for any posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,

2761: \begin{multline*}

2762: \PP \bigl\{ \rho\bigl[ \R(\cdot, \T\,) \bigr] \bigr\}

2763: \leq \PP \biggl\{

2764: \biggl[ 1 - \frac{ N \sinh(\frac{\lambda}{N}) - \lambda}{\beta} \biggr]

2765: \rho\bigl[ \rr(\cdot, \T\,)\bigr]

2766: \\\shoveright{ + \frac{\C{K}\bigl[\rho, \pi_{\exp \{ - [ N \sinh(\frac{\lambda}{N})

2767: - \beta ] r \}} \bigr]}{\beta} \qquad}

2768: \\ + \beta^{-1}

2769: \log \Bigl\{

2770: \pi_{\exp \{ - [N \sinh(\frac{\lambda}{N}) - \beta ] r \}} \Bigl[

2771: \exp \bigl[ N \sinh(\tfrac{\lambda}{N}) \tanh(\tfrac{\lambda}{2N})\m(\cdot, \T\,)

2772: \bigr] \Bigr] \Bigr\} \biggr\}.

2773: \end{multline*}

2774: \end{thm}

2775: Taking $\beta = \frac{N}{2} \sinh (\frac{\lambda}{N})$, using the

2776: fact that $\sinh(a) \geq a$, $a \geq 0$ and expressing

2777: $\tanh(\frac{a}{2}) = a^{-1} \bigl[ \sqrt{1 + \sinh(a)^2}- 1 \bigr]$

2778: and $a = \log \bigl[ \sqrt{1 + \sinh(a)^2} + \sinh(a) \bigr]$,

2779: we deduce

2780: \begin{cor}

2781: \mypoint For any positive real constant $\beta$ and any posterior distribution

2782: $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,

2783: \begin{multline*}

2784: \PP \bigl\{ \rho\bigl[ \R(\cdot, \T\,) \bigr] \bigr\} \leq

2785: \PP \Biggl\{ \underbrace{\biggl[ \tfrac{N}{\beta}\log \Bigl(

2786: \sqrt{1 + \tfrac{4 \beta^2}{N^2}} + \tfrac{2 \beta}{N} \Bigr) - 1  \biggr]}_{\leq 1}

2787: \rho\bigl[ \rr(\cdot, \T\,) \bigr]  \\

2788: \shoveleft{\qquad

2789: + \frac{1}{\beta} \biggl\{ \C{K}\bigl[ \rho,\pi_{\exp( - \beta r)} \bigr]}

2790: \\ + \log \biggl[ \pi_{\exp( - \beta r)} \Bigl\{ \exp \Bigl[ N\Bigl(

2791: \sqrt{1 + \tfrac{4 \beta^2}{N^2}}

2792: - 1 \Bigr) \m(\cdot, \T\,) \Bigr] \Bigr\} \biggr] \biggr\} \Biggr\}.

2793: \end{multline*}

2794: \end{cor}

2795: This theorem and its corollary are really anologous to

2796: Theorem \ref{thm2.2.19} (page \pageref{thm2.2.19}) and it

2797: could easily be proved that under Mammen and Tsybakov margin assumptions,

2798: we obtain an upper bound of the same order as Corollary \ref{cor1.1.23}

2799: (page \pageref{cor1.1.23}).

2800: Anyhow, in order to obtain an empirical bound, we are going now to take

2801: a supremum over all possible values of $\T$, that is over $\Theta_1$.

2802: Although we believe that taking this supremum will not spoil the bound

2803: in cases when overfitting remains under control, we will not try

2804: to investigate precisely if and when this is actually true, and

2805: provide our empirical bound as such. Let us only say that on a qualitative

2806: ground, the values of the margin function quantify how steep is the

2807: contrast function $R$ or its empirical counterpart $r$, and

2808: that the definition

2809: of the empirical margin function is obtained by substituting $\PP$, the true

2810: sample distribution, with $\overline{\PP} = \bigl( \frac{1}{N} \sum_{i=1}^N

2811: \delta_{(X_i, Y_i)}\bigr)^{\otimes N}$, the empirical sample distribution,

2812: in the definition of the expected margin function. Therefore, on qualitative

2813: grounds, it sounds like hopeless to presume that $R$ is steep when $r$ is

2814: not, or in other words that a classification model that would be unefficient

2815: at estimating a bootstrapped sample according to our non random bound

2816: would be by some miracle efficient at estimating the true sample distribution

2817: according to the same bound. To this extent, we feel that our empirical

2818: bounds bring a satisfactory counterpart of our non random bounds.

2819: Anyhow, we will also produce estimators which can be proved

2820: to be adaptive

2821: using PAC-Bayesian tools in the next subsection, at the price of

2822: a more sophisticated construction involving comparisons between

2823: a posterior distribution and a Gibbs prior distribution.

2824:

2825: \newcommand{\Btheta}{\widehat{\theta}}

2826: Let us restrict now to the important case when $\T \in \arg\min_{\Theta_1} R$.

2827: To obtain an observable bound, let $\Btheta \in \arg\min_{\theta

2828: \in \Theta} r(\theta)$ and let us introduce the {\em empirical margin

2829: functions}

2830: \newcommand{\Tphi}{\widetilde{\varphi}}

2831: \newcommand{\Bphi}{\overline{\varphi}}

2832: \begin{align*}

2833: \Bphi(x) & = \sup_{\theta \in \Theta} \m(\theta, \Btheta) - x \bigl[

2834: r(\theta) - r(\Btheta) \bigr], \quad x \in \RR_+,\\

2835: \Tphi(x) & = \sup_{\theta \in \Theta_1} \m(\theta, \Btheta) - x \bigl[

2836: r(\theta) - r(\Btheta) \bigr], \quad x \in \RR_+.

2837: \end{align*}

2838: Using the fact that $\m(\theta, \T) \leq \m(\theta, \Btheta)

2839: + \m(\Btheta, \T)$, we get

2840: \begin{cor}

2841: \mypoint For any positive real parameters $\beta$ and $\lambda$,

2842: for any posterior distribution $\rho : \Omega

2843: \rightarrow \C{M}_+^1(\Theta)$,

2844: \begin{multline*}

2845: \PP \bigl[ \rho (R) \bigr] - \inf_{\Theta_1} R

2846: \leq \PP \biggl\{

2847: \Bigl[ 1 - \tfrac{ N \sinh(\frac{\lambda}{N}) - \lambda}{\beta}

2848: \Bigr] \bigl[ \rho(r) - r(\Btheta)\bigr] \\

2849: + \frac{ \C{K}\bigl[ \rho, \pi_{\exp\{-[N \sinh(\frac{\lambda}{N})

2850: - \beta]r\}} \bigr]}{\beta}\\

2851: + \beta^{-1} \log \Bigl\{ \pi_{\exp \{-[N \sinh(\frac{\lambda}{N})

2852: - \beta]r\}} \Bigl[ \exp \bigl[

2853: N \sinh\bigl(\tfrac{\lambda}{N}\bigr) \tanh\bigl(\tfrac{\lambda}{2N}\bigr) \m(\cdot,\Btheta)

2854: \bigr] \Bigr] \Bigr\} \\ +

2855: \beta^{-1}N \sinh(\tfrac{\lambda}{N}) \tanh(\tfrac{\lambda}{2N})

2856: \Tphi \biggl[ \frac{\beta}{N\sinh(\frac{\lambda}{N}) \tanh(\frac{\lambda}{

2857: 2N})} \left(1 - \frac{N\sinh(\frac{\lambda}{N}) - \lambda}{\beta}

2858: \right)\biggr] \biggr\}.

2859: \end{multline*}

2860: Taking $\beta = \frac{N}{2} \sinh(\frac{\lambda}{N})$, we also

2861: obtain

2862: \begin{multline*}

2863: \PP \bigl[ \rho(R) \bigr] - \inf_{\Theta_1} R \leq

2864: \PP \Biggl\{ \underbrace{\biggl[ \tfrac{N}{\beta}\log \Bigl(

2865: \sqrt{1 + \tfrac{4 \beta^2}{N^2}}

2866: + \tfrac{2 \beta}{N} \Bigr) - 1  \biggr]}_{\leq 1}

2867: \bigl[ \rho(r) - r(\Btheta) \bigr] \\

2868: \shoveleft{\qquad + \frac{1}{\beta} \biggl\{ \C{K}\bigl[

2869: \rho,\pi_{\exp( - \beta r)} \bigr]}

2870: \\\qquad + \log \biggl[ \pi_{\exp( - \beta r)} \Bigl\{ \exp \Bigl[ N\Bigl(

2871: \sqrt{1 + \tfrac{4 \beta^2}{N^2}}

2872: - 1 \Bigr) \m(\cdot, \Btheta) \Bigr] \Bigr\} \biggr] \biggr\} \\

2873: + \frac{N}{\beta}\Bigl(\sqrt{1 + \tfrac{4 \beta^2}{N^2}} - 1\Bigr)

2874: \Tphi \Biggl[ \frac{\log \Bigl( \sqrt{1 + \frac{4 \beta^2}{N^2}}

2875: + \frac{2 \beta}{N} \Bigr) - \frac{\beta}{N}}{\Bigl(

2876: \sqrt{1 + \frac{4 \beta^2}{N^2}} - 1 \Bigr)}\Biggr]

2877: \Biggr\}.

2878: \end{multline*}

2879: \end{cor}

2880: Note that we could also use the upper bound

2881: $\m(\theta, \Btheta) \leq x \bigl[ r(\theta) - r(\Btheta)

2882: \bigr] + \Bphi(x)$ and put $\alpha =

2883: N \sinh(\frac{\lambda}{N}) \bigl[ 1 -

2884: x \tanh(\frac{\lambda}{2N}) \bigr] - \beta$, to obtain

2885: \begin{cor}

2886: \label{cor1.1.27}

2887: \mypoint For any non negative

2888: real parameters $x$, $\alpha$ and $\lambda$,

2889: such that $\alpha < N \sinh(\frac{\lambda}{N}) \bigl[

2890: 1 - x \tanh(\frac{\lambda}{2N}) \bigr]$, for any posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,

2891: \begin{multline*}

2892: \PP \bigl[ \rho(R) \bigr] - \inf_{\Theta_1} R

2893: \\ \shoveleft{\quad \leq \PP

2894: \Biggl\{ \biggl[ 1 - \frac{N\sinh(\frac{\lambda}{N})\bigl[1 - x

2895: \tanh(\frac{\lambda}{2N})\bigr] - \lambda}{

2896: N \sinh(\frac{\lambda}{N})\bigl[ 1 - x \tanh(\frac{\lambda}{2N})

2897: \bigr] - \alpha} \biggr] \bigl[ \rho(r) - r(\Btheta) \bigr]}

2898: \\ \shoveleft{\quad \qquad \qquad + \frac{\C{K} \bigl[ \rho, \pi_{\exp(- \alpha r)} \bigr]}{

2899: N \sinh(\frac{\lambda}{N})\bigl[1 - x \tanh(\frac{\lambda}{2N})\bigr]

2900: - \alpha} }\\

2901: \shoveleft{\quad\qquad \qquad + \frac{N\sinh(\tfrac{\lambda}{N})

2902: \tanh(\tfrac{\lambda}{2N})}{

2903: N \sinh(\frac{\lambda}{N}) \bigl[ 1 - x \tanh(\frac{\lambda}{2N}) \bigr]

2904: - \alpha}}\\\times

2905: \biggl[ \Bphi(x) + \Tphi \biggl(

2906: \frac{\lambda - \alpha}{N \sinh(\frac{\lambda}{N})

2907: \tanh(\frac{\lambda}{2N})}\biggr) \biggr] \Biggr\}.

2908: \end{multline*}

2909: \end{cor}

2910: Let us notice that in the case when $\Theta_1 = \Theta$,

2911: the upper bound provided by this corollary

2912: has the same general form as the upper bound provided by Corollary

2913: \ref{cor1.1.21} (page \pageref{cor1.1.21}), with the sample

2914: distribution $\PP$ replaced with

2915: the empirical distribution of the sample $\overline{\PP}

2916: = \bigl( \frac{1}{N} \sum_{i=1}^N \delta_{(X_i, Y_i)} \bigr)^{\otimes N}$.

2917: Therefore, our empirical bound can be of a larger order of magnitude

2918: than our non random bound only in the case when our non random

2919: bound applied to the bootstrapped sample distribution $\overline{\PP}$

2920: would be of a larger order of magnitude than when applied to

2921: the true sample distribution $\PP$. In other words, we can say that

2922: our empirical bound is close to our non random bound in every situation

2923: where the bootstrapped sample distribution $\overline{\PP}$ is not

2924: harder to bound than the true sample distribution $\PP$. Although

2925: this does not prove that our empirical bound is always of the same

2926: order as our non random bound, this is a good qualitative hint that

2927: this will be the case in most practical situations of interest,

2928: since in situations of ``underfitting'', if they exist, it is likely

2929: that the choice of the classification model is inappropriate to the data

2930: and should be modified.

2931:

2932: Another reassuring remark is that the empirical margin functions

2933: $\Bphi$ and $\Tphi$ behave well in the case when $\inf_{\Theta} r

2934: = 0$. Indeed in this case $m'(\theta, \wtheta)

2935: = r'(\theta, \wtheta) = r(\theta)$, $\theta \in \Theta$,

2936: and thus $\Bphi(1) = \Tphi(1) = 0$, and\\

2937: \mbox{}\hfill $\Tphi(x)

2938: \leq - (x -1 ) \inf_{\Theta_1} r$, $x \geq 1$.\hfill \mbox{}\\

2939: This shows that we recover in this case the same

2940: accuracy as with non relative local empirical bounds.

2941: Thus the bound of Corollary \ref{cor1.1.27} does not

2942: collapse in presence of massive overfitting in the larger

2943: model, causing $r(\wtheta) = 0$, which is another hint

2944: that this may be an accurate bound in many situations.

2945:

2946: \subsubsection{Relative empirical deviation bounds}

2947:

2948: It is natural to make use of Theorem \ref{thm2.2.18}

2949: on page \pageref{thm2.2.18} to obtain

2950: empirical deviation bounds, since this theorem provides an empirical

2951: variance term.

2952:

2953: Theorem \ref{thm2.2.18} is written in a way which exploits the

2954: fact that $\psi_i$ takes only the three values -1, 0 and +1.

2955: However, it will be more convenient for the following computations

2956: to use it in its more general form, which only makes use of the

2957: fact that $\psi_i \in\; (-1, 1)$.

2958: With notations to be

2959: explained hereafter, it can indeed also be written as

2960: \newcommand{\BP}{\overline{P}}

2961: \begin{multline}

2962: \label{eq2.2.2}

2963: \PP \Biggl\{ \exp \Biggl[ \sup_{\rho \in \C{M}_+^1(\Theta)} \biggl\{

2964: - N \rho \Bigl\{ \log \Bigl[ 1 - \lambda P(\psi) \Bigr] \Bigr\}

2965: \\ + N \rho \Bigl\{ \BP \Bigl[ \log(1 - \lambda \psi) \Bigr]

2966: \Bigr\} - \C{K}(\rho,\pi) \biggr\} \Biggr] \Biggr\} \leq 1.

2967: \end{multline}

2968: We have used the following notations in this inequality. We have put

2969: $$

2970: \BP = \frac{1}{N} \sum_{i=1}^N \delta_{(X_i,Y_i)},

2971: $$

2972: so that $\BP$ is our notation for the empirical distribution of the

2973: process \linebreak $(X_i,Y_i)_{i=1}^N$. Moreover we have also used

2974: $$

2975: P = \PP(\BP) = \frac{1}{N} \sum_{i=1}^N P_i,

2976: $$

2977: where it should be remembered that the joint distribution of the

2978: process $(X_i,Y_i)_{i=1}^N$ is $\PP = \bigotimes_{i=1}^N P_i$.

2979: We have considered $\psi(\theta, \T)$ as a function defined on $\C{X} \times \C{Y}$,\\

2980: \mbox{}\hfill as $\psi(\theta, \T) (x,y) = \B{1}\bigl[ y \neq f_{\theta}(x) \bigr] - \B{1} \bigl[

2981: y \neq f_{\T}(x) \bigr], \quad (x,y) \in \C{X} \times \C{Y} $  \hfill\mbox{}\\

2982: so that it should be understood that

2983: \begin{multline*}

2984: P(\psi) = \frac{1}{N} \sum_{i=1}^N \PP \bigl[ \psi_i(\theta, \T) \bigr]

2985: \\ = \frac{1}{N} \sum_{i=1}^N \PP \Bigl\{

2986: \B{1} \bigl[ Y_i \neq f_{\theta}(X_i) \bigr] - \B{1} \bigl[

2987: Y_i \neq f_{\T}(X_i) \bigr] \Bigr\} = R'(\theta, \T).

2988: \end{multline*}

2989: In the same way

2990: $$

2991: \BP \Bigl[ \log(1 - \lambda \psi) \Bigr]

2992: = \frac{1}{N} \sum_{i=1}^N \log \bigl[ 1 - \lambda \psi_i(\theta, \T) \bigr].

2993: $$

2994: Moreover integration with respect to $\rho$ bears on the index $\theta$,

2995: so that

2996: \begin{align*}

2997: \rho \Bigl\{ \log \Bigl[ 1 - \lambda P(\psi) \Bigr] \Bigr\}

2998: & = \int_{\theta \in \Theta} \log \biggl\{ 1 - \frac{\lambda}{N}

2999: \sum_{i=1}^N \PP\bigl[ \psi_i(\theta, \T) \bigr] \biggr\} \rho(d \theta),\\

3000: \rho \Bigl\{ \BP \Bigl[ \log (1 - \lambda \psi) \Bigr] \Bigr\}

3001: & = \int_{\theta \in \Theta} \biggl\{ \frac{1}{N} \sum_{i=1}^N \log \bigl[

3002: 1 - \lambda \psi_i(\theta, \T) \bigr] \biggr\} \rho(d \theta).

3003: \end{align*}

3004:

3005: We have chosen concise notations, as we did throughout these notes,

3006: in order to make the computations easier to follow.

3007:

3008: To get an alternate version of empirical relative deviation bounds,

3009: we need to find some convenient way to localize the choice of

3010: the prior distribution $\pi$ in equation (\ref{eq2.2.2},

3011: page \pageref{eq2.2.2}).

3012: Here we propose to replace

3013: $\pi$ with $\mu = \pi_{\exp \{ - N \log[1 + \beta P(\psi)] \}}$,

3014: which can also be written $\pi_{\exp \{ - N \log[1 + \beta

3015: R'(\cdot, \T)]\}}$. Indeed we see that

3016: \begin{multline*}

3017: \C{K}(\rho, \mu)

3018: = N \rho \Bigl\{ \log \bigl[ 1 + \beta P(\psi) \bigr] \Bigr\}

3019: + \C{K}(\rho, \pi)

3020: \\ + \log \Bigl\{ \pi \Bigl[ \exp \bigl\{

3021: - N \log \bigl[ 1 + \beta P(\psi) \bigr] \bigr\} \Bigr] \Bigr\}.

3022: \end{multline*}

3023: Moreover, we deduce from our deviation inequality applied

3024: to $- \psi$, that (as long as $\beta > -1$),

3025: $$

3026: \PP \biggl\{ \exp \biggl[ N \mu \Bigl\{ \BP \bigl[

3027: \log( 1 + \beta \psi) \bigr] \Bigr\}

3028: -N \mu \Bigl\{ \log \bigl[ 1 + \beta P(\psi) \bigr] \Bigr\}

3029: \biggr] \biggr\} \leq 1.

3030: $$

3031: Thus

3032: \begin{multline*}

3033: \PP \biggl\{ \exp \biggl[

3034: \log \Bigl\{ \pi \Bigl[ \exp \bigl\{

3035: - N \log \bigl[ 1 + \beta P(\psi) \bigr] \bigr\} \Bigr] \Bigr\}

3036: \\ \shoveright{- \log \Bigl\{ \pi \Bigl[ \exp \bigl\{

3037: - N \BP \bigl[ \log(1 + \beta \psi) \bigr] \bigr\} \Bigr] \Bigr\}

3038: \biggr] \bigg\}\qquad}

3039: \\ \leq

3040: \PP \biggl\{ \exp \biggl[

3041: - N \mu \Bigl\{ \log \bigl[ 1 + \beta P(\psi) \bigr] \Bigr\}

3042: - \C{K}(\mu,\pi) \\ + N \mu \Bigl\{

3043: \BP \bigl[ \log(1 + \beta \psi) \bigr] \Bigr\} + \C{K}(\mu, \pi) \biggr] \biggr\}

3044: \leq 1.

3045: \end{multline*}

3046: This can be used to handle $\C{K}(\rho, \mu)$, making use

3047: of the Cauchy Schwarz inequality as follows

3048: \begin{multline*}

3049: \PP \Biggl\{ \exp \Biggl[ \frac{1}{2} \biggl[

3050: -N \log \Bigl\{ \Bigl( 1 - \lambda \rho\bigl[P(\psi)\bigr] \Bigr)

3051: \Bigl( 1 + \beta \rho \bigl[ P (\psi) \bigr] \Bigr) \Bigr\}

3052: \\* \shoveright{ \begin{aligned} + N \rho \Bigl\{ & \BP \Bigl[ \log

3053: ( 1 - \lambda \psi) \Bigr] \Bigr\}

3054: \\* & - \C{K}(\rho, \pi) - \log \Bigl\{ \pi \Bigl[

3055: \exp \bigl\{ - N \BP \bigl[ \log(1 + \beta \psi) \bigr]

3056: \bigr\} \Bigr] \Bigr\} \biggr] \Biggr] \Biggr\}\end{aligned}}

3057: \\* \shoveleft{\qquad \leq \PP \Biggl\{ \exp \Biggl[ - N \log \Bigl\{ \Bigl(

3058: 1 - \lambda \rho \bigl[ P(\psi) \bigr] \Bigr) \Bigr\}}

3059: \\*\shoveright{ + N \rho \Bigl\{ \BP \Bigl[ \log(1 - \lambda \psi) \Bigr] \Bigr\}

3060: - \C{K}(\rho, \mu) \Biggr] \Biggr\}^{1/2} \qquad} \\

3061: \shoveleft{\qquad \times \PP \Biggl\{ \exp \Biggl[ \log

3062: \Bigl\{ \pi \Bigl[ \exp \bigl\{

3063: - N \log \bigl[1 + \beta P(\psi)\bigr] \bigr\} \Bigr] \Bigr\} }

3064: \\*- \log \Bigl\{ \pi \Bigl[ \exp \bigl\{ - N \BP \bigl[

3065: \log(1 + \beta \psi) \bigr] \bigr\} \Bigr] \Bigr\} \Biggr] \Biggr\}^{1/2}

3066: \leq 1.

3067: \end{multline*}

3068: This implies that with $\PP$ probability at least $1 - \epsilon$,

3069: \begin{multline*}

3070: -N \log \Bigl\{ \Bigl( 1 - \lambda \rho\bigl[P(\psi)\bigr] \Bigr)

3071: \Bigl( 1 + \beta \rho \bigl[ P (\psi) \bigr] \Bigr) \Bigr\}

3072: \\ \begin{aligned} \leq -N \rho & \Bigl\{ \BP \Bigl[ \log

3073: ( 1 - \lambda \psi) \Bigr] \Bigr\}

3074: \\ & + \C{K}(\rho, \pi) + \log \Bigl\{ \pi \Bigl[

3075: \exp \bigl\{ - N \BP \bigl[ \log(1 + \beta \psi) \bigr]

3076: \bigr\} \Bigr] \Bigr\} -

3077: 2 \log(\epsilon).\end{aligned}

3078: \end{multline*}

3079: It is now convenient to remember that

3080: $$

3081: \BP \Bigl[\log(1 - \lambda \psi) \Bigr]

3082: = \frac{1}{2} \log \left( \frac{1 - \lambda}{1 + \lambda} \right) r'(\theta, \T)

3083: + \frac{1}{2} \log (1 - \lambda^2) m'(\theta, \T).

3084: $$

3085: We thus can write the previous inequality as

3086: \begin{multline*}

3087: - N \log \Bigl\{ \Bigl( 1 - \lambda \rho\bigl[R'(\cdot,\T) \bigr] \Bigr)

3088: \Bigl(1 + \beta \rho \bigl[ R'(\cdot,\T) \bigr] \Bigr) \Bigr\} \\ \leq

3089: \frac{N}{2} \log \left( \frac{1+\lambda}{1-\lambda}\right)

3090: \rho \bigl[ r'(\cdot,\T) \bigr] - \frac{N}{2} \log(1 - \lambda^2)

3091: \rho \bigl[ m'(\cdot, \T) \bigr] +

3092: \C{K}(\rho, \pi) \\ \begin{aligned}+ \log \biggl\{ \pi \biggl[

3093: \exp \Bigl\{ & - \frac{N}{2}

3094: \log \Bigl( \frac{1 + \beta}{1 - \beta} \Bigr) r'(\cdot, \T)

3095: \\ & - \frac{N}{2} \log( 1 - \beta^2) m'(\cdot, \T) \Bigr\} \biggr] \biggr\}

3096: - 2 \log(\epsilon).\end{aligned}

3097: \end{multline*}

3098: Let us assume now that $\T \in \arg\min_{\Theta_1} R$.

3099: Let us introduce $\Btheta \in \arg\min_{\Theta} r$.

3100: Decomposing

3101: $r'(\theta, \T) = r'(\theta, \Btheta) + r'(\Btheta,\T)$ and

3102: considering that \\

3103: \mbox{} \hfill $m'(\theta, \T) \leq m'(\theta,

3104: \Btheta) + m'(\Btheta,\T)$, \hfill \mbox{}\\

3105: we see that with $\PP$ probability at least $1 - \epsilon$,

3106: for any posterior distribution $\rho :

3107: \Omega \rightarrow \C{M}_+^1(\Theta)$,

3108:

3109: \begin{multline*}

3110: - N \log \Bigl\{ \Bigl( 1 -

3111: \lambda \rho \bigl[ R'(\cdot, \T) \bigr] \Bigr) \Bigl(

3112: 1 + \beta \rho \bigl[ R'(\cdot, \T) \Bigr) \Bigr\}

3113: \\* \leq \frac{N}{2} \log \biggl( \frac{1 + \lambda}{1 - \lambda} \biggr)

3114: \rho \bigl[ r'(\cdot, \Btheta) \bigr] -

3115: \frac{N}{2} \log(1 - \lambda^2) \rho \bigl[ m'(\cdot, \Btheta) \bigr]

3116: + \C{K}(\rho,\pi) \\* + \log \biggl\{ \pi \biggl[

3117: \exp \Bigl\{ - \tfrac{N}{2} \log \Bigl( \tfrac{1+\beta}{1-\beta} \Bigr)

3118: \bigl[r'(\cdot, \Btheta\,) \bigr] - \tfrac{N}{2} \log(1 - \beta^2) m'(\cdot, \Btheta\,)

3119: \Bigr\} \biggr] \biggr\} \\*

3120: + \tfrac{N}{2} \log \Bigl[ \tfrac{(1 + \lambda)(1 - \beta)}{(1 - \lambda)(1 + \beta)}

3121: \Bigr] \bigl[ r(\Btheta\,) - r(\T) \bigr]

3122: \\* - \tfrac{N}{2} \log \bigl[ (1 - \lambda^2)(1 - \beta^2) \bigr] m'(\Btheta\,,\T)

3123: - 2 \log(\epsilon).

3124: \end{multline*}

3125:

3126: Let us now define for simplicity the posterior $\nu : \Omega \rightarrow

3127: \C{M}_+^1(\Theta)$ by the identity

3128: $$

3129: \frac{d \nu}{d \pi}(\theta) = \frac{ \exp \Bigl\{

3130: - \frac{N}{2}  \log \Bigl( \frac{1+\lambda}{1-\lambda} \Bigr)

3131: r'(\theta,\Btheta) + \frac{N}{2} \log(1 - \lambda^2) m'(\theta, \Btheta)

3132: \Bigr\}}{ \pi

3133: \biggl[ \exp \Bigl\{

3134: - \frac{N}{2}  \log \Bigl( \frac{1+\lambda}{1-\lambda} \Bigr)

3135: r'(\cdot,\Btheta) + \frac{N}{2} \log(1 - \lambda^2) m'(\cdot, \Btheta)

3136: \Bigr\}\biggl]}.

3137: $$

3138: Let us also introduce the random bound

3139: \begin{multline*}

3140: B =

3141: \frac{1}{N} \log \biggl\{ \nu \biggl[ \exp \Bigl[ \tfrac{N}{2} \log \Bigl[

3142: \tfrac{(1 + \lambda)(1 - \beta)}{(1 - \lambda) (1 + \beta) } \Bigr]

3143: r'(\cdot, \Btheta) \\ \shoveright{- \tfrac{N}{2} \log \bigl[ (1 - \lambda^2)

3144: (1 - \beta^2) \bigr] m'(\cdot, \Btheta\,) \Bigr] \biggr] \biggr\}\qquad} \\

3145: \shoveleft{\qquad + \sup_{\theta \in \Theta_1}

3146: \frac{1}{2} \log \Big[\tfrac{(1 - \lambda)(1 + \beta)}{(1 + \lambda)(1 - \beta)}

3147: \Bigr]

3148: r'(\theta,\Btheta\,)} \\ - \frac{1}{2} \log\bigl[ (1 - \lambda^2)(1 - \beta^2)\bigr]

3149: m'(\theta,\Btheta\,) - \frac{2}{N} \log(\epsilon).

3150: \end{multline*}

3151: \begin{thm}\mypoint

3152: Using the above notations, for any real constants $0 \leq \beta < \lambda < 1$,

3153: for any prior distribution $\pi \in \C{M}_+^1(\Theta)$,

3154: for any subset $\Theta_1 \subset \Theta$,

3155: with $\PP$ probability at least $1 - \epsilon$,

3156: for any posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,

3157: $$

3158: - \log \Bigl\{ \Bigl( 1 - \lambda \bigl[ \rho (R) - \inf_{\Theta_1} R \bigr]

3159: \Bigr)\Bigl(1 + \beta \bigl[ \rho (R) - \inf_{\Theta_1} R \bigr] \Bigr) \Bigr\}

3160: \leq \frac{\C{K}(\rho, \nu)}{N} + B.

3161: $$

3162: Therefore,

3163: \begin{multline*}

3164: \rho(R) - \inf_{\Theta_1} R \\* \leq \frac{\lambda - \beta}{2 \lambda \beta}

3165: \left( \sqrt{1 + 4 \frac{\lambda \beta}{(\lambda - \beta)^2}

3166: \left[ 1 - \exp \left( - B - \frac{\C{K}(\rho, \nu)}{N} \right) \right]}-1\right)

3167: \\ \leq \frac{1}{\lambda - \beta} \left( B + \frac{\C{K}(\rho,\nu)}{N} \right).

3168: \end{multline*}

3169: \end{thm}

3170: Let us define the posterior $\widehat{\nu}$ by the identity

3171: $$

3172: \frac{d\widehat{\nu}}{d\pi} (\theta) = \frac{\exp

3173: \Bigl[ - \frac{N}{2} \log \left(

3174: \frac{1+\beta}{1-\beta}\right) r'(\theta, \Btheta) - \frac{N}{2}

3175: \log(1 - \beta^2) m'(\theta, \Btheta)\Bigr]}{

3176: \pi \Bigl\{ \exp

3177: \Bigl[ - \frac{N}{2} \log \left(

3178: \frac{1+\beta}{1-\beta}\right) r'(\cdot, \Btheta) - \frac{N}{2}

3179: \log(1 - \beta^2) m'(\cdot, \Btheta)\Bigr]\Bigr\}}.

3180: $$

3181: It is useful to remark that

3182: \begin{multline*}

3183: \frac{1}{N} \log \biggl\{ \nu \biggl[ \exp \Bigl[ \frac{N}{2} \log \Bigl(

3184: \frac{(1 + \lambda)(1 - \beta)}{(1 - \lambda) (1 + \beta) } \Bigr)

3185: r'(\cdot, \Btheta) \\ \shoveright{- \frac{N}{2} \log \bigl[ (1 - \lambda^2)

3186: (1 - \beta^2) \bigr] m'(\cdot, \Btheta) \Bigr] \biggr] \biggr\}\qquad} \\

3187: \\ \shoveleft{\qquad \leq

3188: \widehat{\nu}

3189: \biggl\{ \frac{1}{2}

3190: \log \Bigl( \frac{(1+\lambda)(1-\beta)}{(1 - \lambda)(1+\beta)}\Bigr)

3191: r'( \cdot, \Btheta) }\\ - \frac{1}{2} \log\bigl[ (1 - \lambda^2)(1 - \beta^2) \bigr]

3192: m'(\cdot, \Btheta) \biggr\}.

3193: \end{multline*}

3194: Let us introduce as previously

3195: $

3196: \Bphi(x) = \sup_{\theta \in \Theta} m'(\theta, \Btheta) -

3197: x \, r'(\theta, \Btheta)$, $x \in \RR_+$.

3198: Let us moreover consider $

3199: \Tphi(x) = \sup_{\theta \in \Theta_1} m'(\theta, \Btheta) -

3200: x \, r'(\theta, \Btheta)$, $x \in \RR_+$. These functions can be

3201: used to produce a result which is slightly weaker, but maybe easier

3202: to read and understand. Indeed, comming back a little while,

3203: we see that, for any $x \in \RR_+$, with $\PP$ probability at least $1 - \epsilon$,

3204: for any posterior distribution $\rho$,

3205:

3206: \begin{multline*}

3207: - N \log \Bigl\{\Bigl( 1 - \lambda \rho \bigl[R'(\cdot, \T)\bigr] \Bigr)

3208: \Bigl(1 + \beta \rho \bigl[ R'(\cdot, \T) \bigr] \Bigr) \Bigr\}

3209: \\*\shoveleft{\qquad \leq \frac{N}{2} \log \left[ \frac{(1+\lambda)}{(1-\lambda)(1 - \lambda^2)^x}\right]

3210: \rho \bigl[ r'(\cdot, \Btheta) \bigr] }

3211: \\*\shoveleft{\qquad\qquad - \frac{N}{2} \log\bigl[ (1 - \lambda^2)(1 -

3212: \beta^2) \bigr]  \Bphi(x)} + \C{K}(\rho, \pi)

3213: \\*\shoveleft{\qquad\qquad + \log \biggl\{ \pi \biggl[ \exp \Bigl\{

3214: - \tfrac{N}{2} \log \Bigl[ \tfrac{(1+\beta)}{(1-\beta)(1 - \beta^2)^x}\Bigr]

3215: r'(\cdot, \Btheta) \Bigr\} \biggr] \biggr\}

3216: }\\* \shoveleft{\qquad\qquad - \frac{N}{2} \log\bigl[

3217: (1-\lambda^2)(1-\beta^2) \bigr]

3218: \Tphi \left( \frac{ \log \left[ \frac{(1+\lambda)(1-\beta)}{(1-\lambda)(1+\beta)}

3219: \right]}{- \log\left[ (1 - \lambda^2)(1 - \beta^2) \right]} \right)

3220: }\\*\shoveright{- 2 \log(\epsilon)\qquad}

3221: \\ \shoveleft{ \qquad =

3222: \int_{\frac{N}{2} \log \left[ \frac{(1+\beta)}{(1 - \beta)(1 - \beta^2)^x} \right]}^{

3223: \frac{N}{2} \log \left[ \frac{(1+\lambda)}{(1 - \lambda)(1 - \lambda^2)^x} \right]}

3224: \pi_{\exp (- \alpha r)}\bigl[ r'(\cdot, \Btheta)\bigr] d \alpha}

3225: \\* \shoveright{+ \C{K}(\rho, \pi_{\exp \{ - \frac{N}{2} \log [ \frac{(1+\lambda)}{(1-\lambda)

3226: (1-\lambda^2)^x}] r \}}) - 2 \log (\epsilon)\quad}

3227: \\* - \frac{N}{2} \log \bigl[ (1 - \lambda^2)(1 - \beta^2) \bigr]

3228: \left[ \Bphi(x) + \Tphi \left( \frac{\log \left[ \frac{(1+\lambda)(1-\beta)}{(1-\lambda)

3229: (1 + \beta)} \right]}{- \log [ (1 - \lambda^2)(1 - \beta^2) ]} \right) \right].

3230: \end{multline*}

3231: \begin{thm}\mypoint

3232: With the previous notations, for any real constants $0 \leq \beta < \lambda < 1$,

3233: for any positive real constant $x$, for any prior probability distribution

3234: $\pi \in \C{M}_+^1(\Theta)$, for any subset $\Theta_1 \subset \Theta$,

3235: with $\PP$ probability at least $1 - \epsilon$,

3236: for any posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,

3237: putting

3238: \begin{multline*}

3239: B(\rho) =

3240: \frac{1}{N(\lambda - \beta)}

3241: \int_{\frac{N}{2} \log \left[ \frac{(1+\beta)}{(1 - \beta)(1 - \beta^2)^x} \right]}^{

3242: \frac{N}{2} \log \left[ \frac{(1+\lambda)}{(1 - \lambda)(1 - \lambda^2)^x} \right]}

3243: \pi_{\exp (- \alpha r)}\bigl[ r'(\cdot, \Btheta)\bigr] d \alpha

3244: \\ + \frac{\C{K}(\rho, \pi_{\exp \{ - \frac{N}{2} \log [ \frac{(1+\lambda)}{(1-\lambda)

3245: (1-\lambda^2)^x}] r \}}) - 2 \log (\epsilon)}{N(\lambda - \beta)}\\

3246: - \frac{1}{2(\lambda - \beta)} \log \bigl[ (1 - \lambda^2)(1 - \beta^2) \bigr]

3247: \left[ \Bphi(x) + \Tphi \left( \frac{\log \left[ \frac{(1+\lambda)(1-\beta)}{(1-\lambda)

3248: (1 + \beta)} \right]}{- \log [ (1 - \lambda^2)(1 - \beta^2) ]} \right) \right]

3249: \\ \shoveleft{\leq

3250: \frac{1}{N(\lambda - \beta)}

3251: d_e \log \left( \frac{\log \Bigl[ \frac{(1+\lambda)}{(1-\lambda)(1-\lambda^2)^x}\Bigr]}{

3252: \log \Bigl(\frac{(1+\beta)}{(1-\beta)(1-\beta^2)^x}\Bigr)}\right)}

3253: \\ + \frac{\C{K}(\rho, \pi_{\exp \{ - \frac{N}{2} \log [ \frac{(1+\lambda)}{(1-\lambda)

3254: (1-\lambda^2)^x}] r \}}) - 2 \log (\epsilon)}{N(\lambda - \beta)}\\

3255: - \frac{1}{2(\lambda - \beta)} \log \bigl[ (1 - \lambda^2)(1 - \beta^2) \bigr]

3256: \left[ \Bphi(x) + \Tphi \left( \frac{\log \left[ \frac{(1+\lambda)(1-\beta)}{(1-\lambda)

3257: (1 + \beta)} \right]}{- \log [ (1 - \lambda^2)(1 - \beta^2) ]} \right) \right],

3258: \end{multline*}

3259: the following bounds hold true:

3260: \begin{multline*}

3261: \rho(R) - \inf_{\Theta_1} R \\ \leq \frac{\lambda - \beta}{2 \lambda \beta}

3262: \Biggl(

3263: \sqrt{

3264: 1 + \frac{4 \lambda \beta}{(\lambda - \beta)^2}

3265: \Bigl\{ 1 - \exp \bigl[ - (\lambda - \beta)  B(\rho)

3266: \bigr] \Bigr\}} - 1 \Biggr) \\ \leq B(\rho).

3267: \end{multline*}

3268: \end{thm}

3269: Let us remark that this alternative way of handling

3270: relative deviation bounds

3271: made it possible to carry on with non linear bounds up to the final result.

3272: (For instance, if $\lambda = 0.5$, $\beta = 0.2$ and $B(\rho) = 0.1$,

3273: the non linear bound gives $\rho(R) - \inf_{\Theta_1} R \leq 0.096$.)

3274:

3275: \subsection{Bounds relative to a Gibbs distribution} The empirical bounds

3276: of the previous section

3277: involve taking suprema in $\theta \in \Theta$, and replacing the

3278: {\em margin function} $\varphi$ by some empirical counter parts

3279: $\Bphi$ or $\Tphi$, which may prove unsafe

3280: when using very complex classification models. Moreover,

3281: they are not easy to analyze

3282: with PAC-Bayesian tools. To remedy these

3283: weaknesses, we are going now to propose

3284: another type of relative bounds. We will first explain how to

3285: compare

3286: the expected error rate $\rho(R)$ of any posterior distribution

3287: $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$

3288: with $\pi_{\exp( - \beta R)}(R)$,

3289: the expected risk of a Gibbs prior distribution.

3290: We will then show how to analyze the behaviour of this

3291: bound. This will provide an

3292: estimator proven to reach adaptively the best possible

3293: asymptotic behaviour of the error rate under Mammen

3294: and Tsybakov margin assumptions and parametric complexity

3295: assumptions.

3296:

3297: Then, we will provide an empirical bound for the Kullback

3298: divergence $\C{K}(\rho, \pi_{\exp( - \beta R)})$

3299: of a posterior distribution with respect to a Gibbs prior,

3300: making use of relative deviation inequalities.

3301:

3302: To tackle the question of model selection,

3303: we will estimate the relative performance

3304: of one posterior distribution with respect to another,

3305: which is useful when the two posteriors are supported by

3306: different models.

3307:

3308: Eventually, we will propose a more integrated approach to model selection,

3309: showing how to build a two step localization strategy, in which

3310: the performance of the posterior distribution to be analyzed is

3311: compared with some {\em two step} Gibbs prior.

3312:

3313: \subsubsection{Comparing a posterior distribution with a Gibbs prior}

3314: \newcommand{\wt}[1]{\widetilde{#1}}

3315: Similarly to Theorem \ref{thm2.2.18} we can prove that for any prior distribution

3316: $\wt{\pi} \in \C{M}_+^1(\Theta)$,

3317: \begin{multline}

3318: \label{eq1.1.15}

3319: \PP \Biggl\{ \wt{\pi} \otimes \wt{\pi} \biggl\{ \exp \biggl[ -

3320: N \log (1 - \lambda R') \\ - \frac{N}{2}\log \left( \frac{1+\lambda}{1-\lambda}

3321: \right) r' + \frac{N}{2} \log \bigl(1 - \lambda^2) m' \biggr] \biggr\}

3322: \Biggr\} \leq 1.

3323: \end{multline}

3324: Replacing $\wt{\pi}$ with $\pi_{\exp( - \beta R)}$ and considering

3325: the posterior distribution $\rho \otimes \pi_{\exp( - \beta R)}$,

3326: provides a starting point in the comparison of

3327: $\rho$ with $\pi_{\exp( - \beta R)}$; we can indeed

3328: state with $\PP$ probability at least $1 - \epsilon$ that

3329: \begin{multline}

3330: \label{eq1.1.17}

3331: - N \log \Bigl\{ 1 - \lambda \Bigl[

3332: \rho(R) - \pi_{\exp( - \beta R)}(R) \Bigr] \Bigr\}

3333: \\ \leq \frac{N}{2} \log \left( \frac{1+\lambda}{1-\lambda}\right)

3334: \bigl[ \rho(r) - \pi_{\exp(- \beta R)}(r) \bigr]

3335: \\ \qquad - \frac{N}{2} \log\bigl(1 - \lambda^2\bigr) \rho \otimes \pi_{\exp( - \beta R)}

3336: (m') \\ + \C{K}\bigl[ \rho, \pi_{\exp(- \beta R)} \bigr] - \log(\epsilon).

3337: \end{multline}

3338: Using the parameter

3339: $\gamma = \frac{N}{2} \log \left( \frac{1+\lambda}{1-\lambda}\right)$,

3340: so that $\lambda = \tanh \left(\frac{\gamma}{N}\right)$ and

3341: $-\frac{N}{2} \log ( 1 - \lambda^2) = N \log \bigl[ \cosh(\frac{\gamma}{N})\bigr]$,

3342: and noticing that

3343: \begin{multline}

3344: \label{eq1.1.16}

3345: \C{K}\bigl[ \rho, \pi_{\exp( - \beta R)}\bigr]

3346: = \beta \bigl[ \rho(R) - \pi_{\exp( - \beta R)}(R) \bigr]

3347: \\ + \C{K}(\rho, \pi) - \C{K}\bigl[\pi_{\exp( - \beta R)}, \pi\bigr],

3348: \end{multline}

3349: makes a step further in the proper handling of the entropy term:

3350: \begin{multline}

3351: \label{eq1.1.20}

3352: - N \log \Bigl\{ 1 - \tanh(\tfrac{\gamma}{N})

3353: \Bigl[ \rho(R) - \pi_{\exp( - \beta R)}(R) \Bigr] \Bigr\}

3354: - \beta \bigl[ \rho(R) - \pi_{\exp( - \beta R)}(R) \bigr]

3355: \\ \leq \gamma \bigl[ \rho(r) - \pi_{\exp( - \beta R)}(r) \bigr]

3356: + N \log \bigl[ \cosh \bigl(\tfrac{\gamma}{N} \bigr)\bigr]

3357: \rho \otimes \pi_{\exp( - \beta R)}(m')

3358: \\ + \C{K}(\rho, \pi) - \C{K}\bigl[ \pi_{\exp( - \beta R)}, \pi \bigr]

3359: - \log(\epsilon).

3360: \end{multline}

3361:

3362: We can then decompose in the right-hand side

3363: $\gamma \bigl[ \rho(r) - \pi_{\exp( - \beta R)}(r) \bigr]$ into

3364: $(\gamma - \lambda) \bigl[ \rho(r) - \pi_{\exp( - \beta R)}(r) \bigr]

3365: + \lambda \bigl[ \rho(r) - \pi_{\exp( - \beta R)}(r) \bigr]$

3366: and use the fact that

3367: \begin{multline*}

3368: \lambda \bigl[ \rho(r) - \pi_{\exp( - \beta R)}(r) \bigr]

3369: + N \log \bigl[ \cosh(\tfrac{\gamma}{N}) \bigr] \rho \otimes

3370: \pi_{\exp( - \beta R)}(m') \\ \shoveright{+ \C{K}(\rho, \pi)

3371: - \C{K}\bigl[ \pi_{\exp( - \beta R)}, \pi \bigr]}

3372: \\ \leq \lambda \rho(r) + \C{K}(\rho, \pi) + \log \Bigl\{

3373: \pi \Bigl[ \exp \bigl\{ - \lambda r + N \log \bigl[ \cosh(\tfrac{\gamma}{N}) \bigr] \rho(m') \bigr\}

3374: \Bigr] \Bigr\} \\

3375: = \C{K}\bigl[ \rho, \pi_{\exp( - \lambda r)}\bigr]

3376: + \log \Bigl\{ \pi_{\exp( - \lambda r)} \Bigl[ \exp \bigl\{ N \log \bigl[

3377: \cosh(\tfrac{\gamma}{N}) \bigr] \rho(m') \bigr\} \Bigr] \Bigr\},

3378: \end{multline*}

3379: to get rid of the appearance of the unobserved Gibbs prior $\pi_{\exp( - \beta R)}$

3380: in most places of the right-hand side of our inequality, leading to

3381: \begin{thm}

3382: \mypoint

3383: \label{thm1.1.41Bis}

3384: For any real constants $\beta$ and $\gamma$,

3385: with $\PP$ probability at least $1 - \epsilon$, for any posterior distribution

3386: $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$, for any real constant $\lambda$,

3387: \begin{multline*}

3388: \bigl[ N \tanh(\tfrac{\gamma}{N}) - \beta \bigr]

3389: \bigl[ \rho(R) - \pi_{\exp( - \beta R)}(R) \bigr] \\

3390: \shoveleft{\qquad \leq - N \log \Bigl\{ 1 - \tanh(\tfrac{\gamma}{N}) \Bigl[ \rho(R)

3391: - \pi_{\exp( - \beta R)}(R) \Bigr] \Bigr\} }

3392: \\ \shoveright{- \beta \bigl[ \rho(R) - \pi_{\exp( - \beta R)}(R) \bigr]}

3393: \\ \shoveleft{\qquad \leq (\gamma - \lambda) \bigl[

3394: \rho(r) - \pi_{\exp( - \beta R)}(r) \bigr]

3395: + \C{K}\bigl[ \rho, \pi_{\exp( - \lambda r)}\bigr]}

3396: \\\shoveright{ + \log \Bigl\{ \pi_{\exp( - \lambda r)} \Bigl[ \exp \bigl\{ N

3397: \log \bigl[ \cosh(\tfrac{\gamma}{N}) \bigr] \rho(m') \bigr\} \Bigr] \Bigr\} -

3398: \log(\epsilon)}

3399: \\ \shoveleft{\qquad = \C{K}\bigl[ \rho, \pi_{\exp (- \gamma r)} \bigr] }

3400: \\ + \log \Bigl\{ \pi_{\exp( - \gamma r)} \Bigl[

3401: \exp \bigl\{ (\gamma - \lambda) r + N \log \bigl[ \cosh(\tfrac{\gamma}{N})

3402: \bigr] \rho(m') \bigr\} \Bigr] \Bigr\} \\

3403: -( \gamma - \lambda) \pi_{\exp( - \beta R)}(r)

3404: - \log(\epsilon).

3405: \end{multline*}

3406: \end{thm}

3407: We would like to have a fully empirical upper bound even in the case when $\lambda

3408: \neq \gamma$. This can be done by using the theorem twice. We will

3409: need a lemma

3410: \begin{lemma}

3411: \label{lemma1.38}

3412: For any probability distribution $\pi \in \C{M}_+^1(\Theta)$,

3413: for any bounded measurable functions $g,h: \Theta \rightarrow \RR$,

3414: $$

3415: \pi_{\exp( -g )}(g) - \pi_{\exp(-h)}(g) \leq

3416: \pi_{\exp(-g)}(h) - \pi_{\exp(-h)}(h).

3417: $$

3418: \end{lemma}

3419: \begin{proof}

3420: Let us notice that

3421: \begin{multline*}

3422: 0 \leq \C{K}(\pi_{\exp( - g)}, \pi_{\exp( - h )})

3423: = \pi_{\exp( - g)}(h)

3424: + \log \bigl\{ \pi \bigl[ \exp ( - h) \bigr] \bigr\} + \C{K}(\pi_{\exp( - g)}, \pi)

3425: \\ = \pi_{\exp( - g)}(h) - \pi_{\exp( - h)}(h) - \C{K}(\pi_{\exp( - h)}, \pi)

3426: + \C{K}(\pi_{\exp( - g)}, \pi)

3427: \\ = \pi_{\exp( - g)}(h) - \pi_{\exp( - h)}(h) - \C{K}(\pi_{\exp( - h)}, \pi)

3428: - \pi_{\exp( - g)}(g) - \log \bigl\{ \pi \bigl[ \exp ( - g) \bigr] \bigr\}.

3429: \end{multline*}

3430: Moreover

3431: $$

3432: - \log \bigl\{ \pi \bigl[ \exp( - g) \bigr] \bigr\} \leq \pi_{\exp( - h)}(g)

3433: + \C{K}(\pi_{\exp( - h)}, \pi),

3434: $$

3435: which achieves the proof.

3436: \end{proof}

3437:

3438: For any positive real constants $\beta$ and $\lambda$,

3439: we can then apply Theorem \ref{thm1.1.41Bis} to $\rho = \pi_{\exp( - \lambda r)}$,

3440: and use the inequality

3441: \begin{equation}

3442: \label{eq1.1.22}

3443: \frac{\lambda}{\beta} \bigl[

3444: \pi_{\exp( - \lambda r)}(r) - \pi_{\exp( - \beta R)}(r) \bigr]

3445: \leq \pi_{\exp( - \lambda r)}(R) -

3446: \pi_{\exp( - \beta R) }(R)

3447: \end{equation}

3448: provided by the previous lemma.

3449: We thus obtain with $\PP$ probability at least $1 - \epsilon$

3450: \begin{multline*}

3451: - N \log \Bigl\{ 1 - \tanh(\tfrac{\gamma}{N}) \tfrac{\lambda}{\beta}

3452: \Bigl[ \pi_{\exp

3453: (- \lambda r)} (r) - \pi_{\exp( - \beta R)}(r) \Bigr] \Bigr\}

3454: \\ \shoveright{- \gamma \bigl[

3455: \pi_{\exp( - \lambda r)}(r) - \pi_{\exp( - \beta R)}(r) \bigr] }

3456: \\ \leq \log \Bigl\{ \pi_{\exp( - \lambda r)} \Bigl[

3457: \exp \bigl\{ N \log \bigl[ \cosh(\tfrac{\gamma}{N}) \bigr] \pi_{\exp( - \lambda r)}

3458: (m') \bigr\} \Bigr] \Bigr\} - \log(\epsilon).

3459: \end{multline*}

3460: Let us

3461: introduce the convex function

3462: $$

3463: F_{\gamma, \alpha}(x) = - N \log \bigl[ 1 - \tanh(\tfrac{\gamma}{N})

3464: x \bigr] - \alpha x \geq \bigl[ N \tanh(\tfrac{\gamma}{N}) - \alpha \bigr] x.

3465: $$

3466: With $\PP$ probability at least $1 - \epsilon$,

3467: \begin{multline*}

3468: - \pi_{\exp( - \beta R)}(r)

3469: \leq \inf_{\lambda \in \RR_+^*} \biggl\{ - \pi_{\exp( - \lambda r)}(r) \\*

3470: + \frac{\beta}{\lambda} F_{\gamma,

3471: \frac{\beta \gamma}{\lambda}}^{-1} \biggl[

3472: \log \Bigl\{ \pi_{\exp(- \lambda r)} \Bigl[ \exp

3473: \bigl\{ N \log \bigl[ \cosh(\tfrac{\gamma}{N}) \bigr]

3474: \pi_{\exp( - \lambda r)}(m') \bigr\} \Bigr] \Bigr\}

3475: \\ - \log(\epsilon) \biggr] \biggr\}.

3476: \end{multline*}

3477: Since Theorem \ref{thm1.1.41Bis} holds uniformly for any posterior distribution

3478: $\rho$, we can apply it again to some arbitrary posterior distribution $\rho$.

3479: We can moreover make the result uniform in $\beta$ and $\gamma$ by considering

3480: some atomic measure $\nu \in \C{M}_+^1(\RR)$ on the real line and using a union bound.

3481: This leads to

3482: \begin{thm}

3483: \mypoint

3484: \label{thm1.1.43}

3485: For any atomic probability distribution on the positive real line

3486: $\nu \in \C{M}_+^1(\RR_+)$,

3487: with $\PP$ probability

3488: at least $1 - \epsilon$, for any posterior distribution $\rho :

3489: \Omega \rightarrow \C{M}_+^1(\Theta)$, for any positive real constants $\beta$

3490: and $\gamma$,

3491: \begin{multline*}

3492: \bigl[ N \tanh(\tfrac{\gamma}{N}) - \beta \bigr] \bigl[ \rho(R) -

3493: \pi_{\exp( - \beta R)}(R) \bigr]

3494: \\* \shoveright{\leq

3495: F_{\gamma, \beta}\bigl[ \rho(R) - \pi_{\exp( - \beta R)}(R) \bigr]

3496: \leq  B(\rho, \beta, \gamma), \text{ where}}\\\shoveleft{B(\rho, \beta, \gamma) = \inf_{

3497: \substack{\lambda_1 \in \RR_+, \lambda_1 \leq \gamma\\

3498: \lambda_2 \in \RR, \lambda_2 >

3499: \frac{\beta \gamma}{N} \tanh(\frac{\gamma}{N})^{-1}

3500: }} \Biggr\{

3501: \C{K}\bigl[ \rho, \pi_{\exp( - \lambda_1 r)} \bigr] }

3502: \\\shoveleft{\qquad + (\gamma - \lambda_1) \bigl[ \rho(r)

3503: - \pi_{\exp( - \lambda_2 r)}(r) \bigr]}

3504: \\\shoveleft{\qquad + \log \Bigl\{ \pi_{\exp( - \lambda_1 r)} \Bigl[ \exp \bigl\{

3505: N \log \bigl[ \cosh(\tfrac{\gamma}{N}) \bigr] \rho(m') \bigr\} \Bigr] \Bigr\}

3506: - \log \bigl[ \epsilon \nu(\beta) \nu(\gamma) \bigr]}\\

3507: \shoveleft{\qquad + (\gamma - \lambda_1) \frac{\beta}{\lambda_2}

3508: F_{\gamma, \frac{\beta \gamma}{\lambda_2}}^{-1}  \biggl[

3509: \log \Bigl\{ }\\ \pi_{\exp( - \lambda_2 r)} \Bigl[ \exp \bigl\{

3510: N \log \bigl[ \cosh(\tfrac{\gamma}{N}) \bigr] \pi_{\exp( - \lambda_2 r)}(m')

3511: \bigr\} \Bigr] \Bigr\} \\\shoveright{ - \log \bigl[ \epsilon \nu(\beta)

3512: \nu(\gamma)\bigr]  \biggr] \Biggr\}}

3513: \\\shoveleft{\leq  \inf_{

3514: \substack{\lambda_1 \in \RR_+, \lambda_1 \leq \gamma\\

3515: \lambda_2 \in \RR, \lambda_2 >

3516: \frac{\beta \gamma}{N} \tanh(\frac{\gamma}{N})^{-1}

3517: }} \Biggr\{

3518: \C{K}\bigl[ \rho, \pi_{\exp( - \lambda_1 r)} \bigr]

3519: }\\\shoveleft{\qquad+ (\gamma - \lambda_1) \bigl[

3520: \rho(r) - \pi_{\exp( - \lambda_2 r)}(r) \bigr]}

3521: \\\shoveleft{\qquad+ \log \Bigl\{ \pi_{\exp( - \lambda_1 r)} \Bigl[ \exp \bigl\{

3522: N \log \bigl[ \cosh(\tfrac{\gamma}{N}) \bigr] \rho(m') \bigr\} \Bigr] \Bigr\}}

3523: \\\shoveleft{\qquad + \frac{\beta}{\lambda_2} \frac{(1 - \frac{\lambda_1}{\gamma})}{

3524: \bigl[ \frac{N}{\gamma} \tanh(\frac{\gamma}{N}) - \frac{\beta}{\lambda_2}\bigr]}

3525: \log \Bigl\{ \pi_{\exp( - \lambda_2 r)} \Bigl[ }

3526: \\ \exp \bigl\{

3527: N \log \bigl[ \cosh(\tfrac{\gamma}{N}) \bigr] \pi_{\exp( - \lambda_2 r)}(m')

3528: \bigr\} \Bigr] \Bigr\} \\

3529: - \Bigl\{ 1 + \frac{\beta}{\lambda_2} \tfrac{(1 - \frac{\lambda_1}{\gamma})}{

3530: [ \frac{N}{\gamma} \tanh(\frac{\gamma}{N}) - \frac{\beta}{\lambda_2}]} \Bigr\}

3531: \log \bigl[ \epsilon \nu( \beta) \nu( \gamma) \bigr]  \Biggr\},

3532: \end{multline*}

3533: where we have written for short $\nu(\beta)$ and $\nu(\gamma)$ instead

3534: of $\nu(\{\beta\})$ and $\nu(\{\gamma\})$.

3535: \end{thm}

3536: Let us notice that $B(\rho, \beta, \gamma) = + \infty$ when $\nu(\beta) = 0$

3537: or $\nu(\gamma) = 0$, the uniformity in $\beta$ and $\gamma$ of the

3538: theorem therefore necessarily bears on a countable number of values of these parameters.

3539: We can typically choose for $\nu$ distributions such as the one

3540: used in Theorem \ref{thm1.1.11} on page \pageref{thm1.1.11}:

3541: namely we can put for some positive real ratio $\alpha > 1$

3542: $$

3543: \nu(\alpha^k) = \frac{1}{(k+1)(k+2)}, \qquad k \in \NN,

3544: $$

3545: or alternatively, since we are interested in values of the parameters

3546: less than $N$, we can prefer

3547: $$

3548: \nu(\alpha^k) = \frac{\log(\alpha)}{\log(\alpha N)},

3549: \qquad 0 \leq k < \frac{\log(N)}{\log(\alpha)}.

3550: $$

3551: We can also use such a coding distribution on dyadic numbers

3552: as the one defined by equation \eqref{eq1.1.4bis} on page \pageref{eq1.1.4bis}.

3553:

3554: \subsubsection{The effective temperature of a posterior distribution}

3555: Using the parametric approximation $\pi_{\exp( - \alpha r)}(r)

3556: - \inf_{\Theta} r \simeq \frac{d_e}{\alpha}$, we get as an order of magnitude

3557: \begin{multline*}

3558: B(\pi_{\exp( - \lambda_1 r)}, \beta, \gamma) \lesssim

3559: - (\gamma - \lambda_1) d_e \bigl[ \lambda_2^{-1} - \lambda_1^{-1} \bigr]

3560: \\ \shoveleft{\qquad + 2 d_e \log \frac{\lambda_1}{ \lambda_1

3561: - N\log\bigl[ \cosh(\tfrac{\gamma}{N}) \bigr] x}}\\*

3562: \qquad\qquad + 2 \frac{\beta}{\lambda_2} \frac{(1 - \frac{\lambda_1}{\gamma})}{

3563: \bigl[ \frac{N}{\gamma}\tanh(\tfrac{\gamma}{N}) - \frac{\beta}{\lambda_2} \bigr]} d_e \log

3564: \left( \frac{ \lambda_2}{\lambda_2 - N \log \bigl[ \cosh(\tfrac{\gamma}{N}) \bigr] x}

3565: \right) \\*

3566: \qquad\qquad\qquad\qquad + 2 N \log \bigl[ \cosh(\tfrac{\gamma}{N}) \bigr] \biggl[ 1 + \frac{\beta}{\lambda_2}

3567: \frac{(1 - \frac{\lambda_1}{\gamma})}{ \bigl[ \frac{N}{\gamma}

3568: \tanh(\frac{\gamma}{N}) - \frac{\beta}{\lambda_2} \bigr]} \biggr] \Tphi(x)

3569: \\ - \Bigl\{ 1 + \frac{\beta}{\lambda_2}

3570: \frac{(1 - \frac{\lambda_1}{\gamma})}{[\frac{N}{\gamma} \tanh(\tfrac{\gamma}{N})

3571: - \frac{\beta}{\lambda_2}]} \Bigr\} \log\bigl[ \nu(\beta) \nu(\gamma) \epsilon

3572: \bigr].

3573: \end{multline*}

3574: Therefore, if the empirical dimension $d_e$ stays bounded when $N$ increases,

3575: we are going to obtain a negative upper bound for any values of the constants

3576: $\lambda_1 > \lambda_2 > \beta$, as soon as $\gamma$ and $\frac{N}{\gamma}$

3577: are chosen to be large enough.

3578: This ability to obtain negative values for the bound $B(\pi_{\exp( - \lambda_1 r)},

3579: \gamma, \beta)$, and more generally $B(\rho, \gamma, \beta)$, leads the way

3580: to introducing the new concept of the {\em effective temperature} of an estimator.

3581: \begin{dfn}

3582: For any posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$ we define

3583: the {\em effective temperature} $T(\rho) \in

3584: \RR \cup \{ - \infty, + \infty \}$ of $\rho$ by the equation

3585: $$

3586: \rho(R) = \pi_{\exp( - \frac{R}{T(\rho)})}(R).

3587: $$

3588: \end{dfn}

3589: Note that $\beta \mapsto \pi_{\exp( - \beta R)}(R) : \RR \cup \{ - \infty, + \infty \}

3590: \rightarrow (0,1)$ is continuous and strictly decreasing from $\ess \sup_{\pi} R$

3591: to $\ess \inf_{\pi} R$ (as soon as these two bounds do not coincide). This shows

3592: that the effective temperature $T(\rho)$ is a well defined random variable.

3593:

3594: Theorem \ref{thm1.1.43} provides a bound for $T(\rho)$, indeed:

3595: \begin{prop}\mypoint

3596: \label{prop1.1.37}

3597: Let

3598: $$

3599: \w{\beta}(\rho) = \sup \bigl\{ \beta \in \RR; \inf_{\gamma, N \tanh(\frac{\gamma}{N})

3600: > \beta}

3601: B(\rho, \beta, \gamma) \leq 0 \bigr\},

3602: $$

3603: where $B(\rho, \beta, \gamma)$ is as in Theorem \ref{thm1.1.43}.

3604: Then with $\PP$ probability at least $1 - \epsilon$, for any posterior

3605: distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,

3606: $T(\rho) \leq \w{\beta}(\rho)^{-1}$, or equivalently

3607: $\rho(R) \leq \pi_{\exp[ - \w{\beta}(\rho)  R]}(R)$.

3608: \end{prop}

3609: This notion of {\em effective temperature} of a (randomized) estimator

3610: $\rho$ is interesting for two reasons:

3611:

3612: $\bullet$ the difference $\rho(R) - \pi_{\exp( - \beta R)}(R)$ can be estimated

3613: with a better accuracy than $\rho(R)$ itself, due to the use of relative deviation

3614: inequalities, leading to convergence rates up to $1/N$ in favourable situations,

3615: even when $\inf_{\Theta} R$ is not close to zero;

3616:

3617: $\bullet$ and of course $\pi_{\exp( - \beta R)}(R)$ is a decreasing function

3618: of $\beta$, thus being able to estimate $\rho(R) - \pi_{\exp( - \beta R)}(R)$

3619: with some given accuracy, means being able to discriminate between values

3620: of $\rho(R)$ with the same accuracy, although doing so through the

3621: parametrization $\beta \mapsto \pi_{\exp( - \beta R)}(R)$, which cannot

3622: be observed nor estimated with the same precision!

3623:

3624: \subsubsection{Analysis of an empirical bound for the effective temperature}

3625: We are now going to launch into a mathematically rigorous analysis of

3626: the bound $B(\pi_{\exp( - \lambda_1 r), \beta, \gamma})$

3627: provided by Theorem \ref{thm1.1.43},

3628: to show that \linebreak $\inf_{\rho \in \C{M}_+^1(\Theta)}

3629: \pi_{\exp[ - \w{\beta}(\rho) R]}(R)$ converges indeed to $\inf_{\Theta} R$

3630: at some unimprovable rates in favourable situations.

3631:

3632: It is more convenient for this purpose to use deviation inequalities involving

3633: $M'$ rather than $m'$. It is straightforward to extend Theorem \ref{thm4.1} on

3634: page \pageref{thm4.1} to

3635: \begin{thm}

3636: \mypoint

3637: For any real constants $\beta$ and $\gamma$, for any prior distribution

3638: $\mu \in \C{M}_+^1(\Theta)$, with $\PP$ probability at least $1 - \eta$,

3639: for any posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,

3640: $$

3641: \gamma \rho \otimes \pi_{\exp( - \beta R)} \bigl[ \Psi_{\frac{\gamma}{N}}(R', M') \bigr]

3642: \leq \gamma \rho \otimes \pi_{\exp( - \beta R)}(r') + \C{K}(\rho, \mu) - \log(\eta).

3643: $$

3644: \end{thm}

3645: In order to transform the left-hand side into a linear expression and

3646: in the same time to localize this theorem, let us choose $\mu$ defined by its density

3647:

3648: \begin{multline*}

3649: \frac{d \mu}{d \pi}(\theta_1)

3650: = C^{-1} \exp \biggl[ - \beta R(\theta_1)

3651: \\* - \gamma \int_{\Theta} \Bigl\{

3652: \Psi_{\frac{\gamma}{N}} \bigl[ R'(\theta_1, \theta_2),

3653: M'(\theta_1, \theta_2) \bigr] \\* - \tfrac{N}{\gamma} \sinh(\tfrac{\gamma}{N})

3654: R'(\theta_1, \theta_2) \Bigr\}  \pi_{\exp( - \beta R)}(d \theta_2) \biggr],

3655: \end{multline*}

3656: where $C$ is such that $\mu(\Theta) = 1$.

3657: We get

3658: \begin{multline*}

3659: \C{K}(\rho, \mu) = \beta \rho(R) + \gamma

3660: \rho \otimes \pi_{\exp( - \beta R)} \bigl[

3661: \Psi_{\frac{\gamma}{N}} (R', M') - \tfrac{N}{\gamma} \sinh(\tfrac{\gamma}{N})

3662: R' \bigr] + \C{K}(\rho, \pi) \\

3663: \shoveleft{\qquad + \log \biggl\{ \int_{\Theta} \exp \biggl[ - \beta R(\theta_1)}

3664: \\ - \gamma \int_{\Theta} \Bigl\{

3665: \Psi_{\frac{\gamma}{N}} \bigl[ R'(\theta_1, \theta_2), M'(\theta_1,

3666: \theta_2) \bigr]\\\shoveright{ - \tfrac{N}{\gamma} \sinh(\tfrac{\gamma}{N})

3667: R'(\theta_1, \theta_2) \Bigr\} \pi_{\exp( -

3668: \beta R)}(d \theta_2) \biggr] \pi ( d \theta_1) \biggr\}}

3669: \\\shoveleft{\quad= \beta \bigl[ \rho(R) - \pi_{\exp( - \beta R)}(R) \bigr]}\\

3670: + \gamma \rho \otimes \pi_{\exp ( - \beta R)} \bigl[

3671: \Psi_{\frac{\gamma}{N}}(R', M') - \tfrac{N}{\gamma} \sinh(\tfrac{\gamma}{N})

3672: R' \bigr]

3673: \\\shoveright{+ \C{K}(\rho, \pi) - \C{K}(\pi_{\exp( - \beta R)}, \pi)

3674: \qquad}\\

3675: \shoveleft{\qquad + \log \biggl\{ \int_{\Theta} \exp

3676: \biggl[ - \gamma \int_{\Theta} \Bigl\{ \Psi_{\frac{\gamma}{N}}

3677: \bigl[ R'(\theta_1, \theta_2),M'(\theta_1, \theta_2) \bigr]

3678: }\\ - \tfrac{N}{\gamma} \sinh(\tfrac{\gamma}{N})

3679: R'(\theta_1, \theta_2) \Bigr\} \pi_{\exp( - \beta R)}(d \theta_2)

3680: \biggr] \pi_{\exp( - \beta R)}(d \theta_1) \biggr\}.

3681: \end{multline*}

3682: Thus with $\PP$ probability at least $1 - \eta$,

3683: \begin{multline}

3684: \label{eq1.1.23}

3685: \bigl[ N \sinh(\tfrac{\gamma}{N}) - \beta \bigr]

3686: \bigl[ \rho(R) - \pi_{\exp( - \beta R)}(R) \bigr]

3687: \\\shoveleft{\qquad \leq \gamma \bigl[ \rho(r) - \pi_{\exp ( - \beta R)}(r) \bigr] +

3688: \C{K}(\rho, \pi) - \C{K}(\pi_{\exp( - \beta R)}, \pi) - \log(\eta) +

3689: C(\beta, \gamma)}

3690: \\

3691: \shoveleft{\text{where } C(\beta, \gamma) = \log \biggl\{ \int_{\Theta} \exp

3692: \biggl[ - \gamma \int_{\Theta} \Bigl\{ \Psi_{\frac{\gamma}{N}}

3693: \bigl[ R'(\theta_1, \theta_2),M'(\theta_1, \theta_2) \bigr]

3694: }\\- \tfrac{N}{\gamma} \sinh(\tfrac{\gamma}{N})

3695: R'(\theta_1, \theta_2) \Bigr\} \pi_{\exp( - \beta R)}(d \theta_2)

3696: \biggr] \pi_{\exp( - \beta R)}(d \theta_1) \biggr\}.

3697: \end{multline}

3698: Remarking that

3699: $$

3700: \C{K}\bigl[ \rho, \pi_{\exp( - \beta R)}\bigr]

3701: = \beta \bigl[ \rho(R) - \pi_{\exp( - \beta R)}(R) \bigr]

3702: + \C{K}(\rho, \pi) - \C{K}(\pi_{\exp( - \beta R)}, \pi),

3703: $$

3704: we deduce from the previous inequality

3705: \begin{thm}\mypoint

3706: \label{thm1.1.45}

3707: For any real constants $\beta$ and $\gamma$, with $\PP$ probability

3708: at least $1 - \eta$, for any posterior distribution $\rho : \Omega

3709: \rightarrow \C{M}_+^1(\Theta)$,

3710: \begin{multline*}

3711: N \sinh(\tfrac{\gamma}{N}) \bigl[ \rho(R) - \pi_{\exp( - \beta R)}(R)

3712: \bigr] \leq \gamma \bigl[ \rho(r) - \pi_{\exp( - \beta R)}(r) \bigr]

3713: \\ + \C{K}\bigl[ \rho, \pi_{\exp( - \beta R)}\bigr] - \log(\eta)

3714: + C(\beta, \gamma).

3715: \end{multline*}

3716: \end{thm}

3717: We can also go into a slightly different direction, starting

3718: back again from equation \eqref{eq1.1.23} on page \pageref{eq1.1.23} and

3719: remarking that for any real constant $\lambda$,

3720: \begin{multline*}

3721: \lambda \bigl[ \rho(r) - \pi_{\exp( - \beta R)}(r) \bigr]

3722: + \C{K}(\rho, \pi) - \C{K}(\pi_{\exp(- \beta R)}, \pi)

3723: \\ \leq \lambda \rho(r) + \C{K}(\rho, \pi) + \log \bigl\{

3724: \pi \bigl[ \exp ( - \lambda r) \bigr] \bigr\} =

3725: \C{K}\bigl[ \rho, \pi_{\exp( - \lambda r)} \bigr].

3726: \end{multline*}

3727: This leads to

3728: \begin{thm}\mypoint

3729: For any real constants $\beta$ and $\gamma$, with $\PP$ probability at least $1 - \eta$,

3730: for any real constant $\lambda$,

3731: \begin{multline*}

3732: \bigl[ N \sinh(\tfrac{\gamma}{N}) - \beta \bigr]

3733: \bigl[ \rho(R) - \pi_{\exp( - \beta R)}(R) \bigr]

3734: \\ \leq (\gamma - \lambda)

3735: \bigl[ \rho(r) - \pi_{\exp ( - \beta R)}(r) \bigr] +

3736: \C{K}\bigl[ \rho, \pi_{\exp( - \lambda r)} \bigr] - \log(\eta) + C(\beta, \gamma),

3737: \end{multline*}

3738: where the definition of $C(\beta, \gamma)$ is given by equation \eqref{eq1.1.23}

3739: on page \pageref{eq1.1.23}.

3740: \end{thm}

3741:

3742: We can now use this inequality in the case when $\rho = \pi_{\exp( - \lambda r)}$

3743: and combine it with inequality \eqref{eq1.1.22} on page \pageref{eq1.1.22}

3744: to obtain

3745: \begin{thm}

3746: For any real constants $\beta$ and $\gamma$,

3747: with $\PP$ probability at least $1 - \eta$, for any real constant

3748: $\lambda$,

3749: $$

3750: \bigl[ \tfrac{N \lambda}{\beta} \sinh(\tfrac{\gamma}{N}) - \gamma \bigr]

3751: \bigl[ \pi_{\exp( - \lambda r)}(r) - \pi_{\exp( - \beta R)}(r) \bigr]

3752: \leq C(\beta, \gamma) - \log(\eta).

3753: $$

3754: \end{thm}

3755: We deduce from this theorem

3756: \begin{prop}

3757: For any real positive constants $\beta_1$, $\beta_2$ and

3758: $\gamma$, with $\PP$ probability at least $1 - \eta$, for any real constants

3759: $\lambda_1$ and $\lambda_2$, such that $\lambda_2 < \beta_2 \frac{\gamma}{N}

3760: \sinh(\frac{\gamma}{N})^{-1}$ and $\lambda_1 > \beta_1 \frac{\gamma}{N}

3761: \sinh(\frac{\gamma}{N})^{-1}$,

3762: \begin{multline*}

3763: \pi_{\exp( - \lambda_1 r)}(r) - \pi_{\exp( - \lambda_2 r)}(r)

3764: \leq \pi_{\exp( - \beta_1 R)}(r) - \pi_{\exp( - \beta_2 R)}(r)

3765: \\ + \frac{C(\beta_1, \gamma) + \log( 2 /\eta)}{\frac{N\lambda_1}{\beta_1}

3766: \sinh(\frac{\gamma}{N})- \gamma}

3767: + \frac{C(\beta_2, \gamma) + \log( 2 /\eta)}{\gamma - \frac{N\lambda_2}{\beta_2}

3768: \sinh(\frac{\gamma}{N})}.

3769: \end{multline*}

3770: \end{prop}

3771: Moreover, $\pi_{\exp( - \beta_1 R)}$ and $\pi_{\exp( - \beta_2 R)}$

3772: being prior distributions,

3773: with $\PP$ probability at least $1 - \eta$,

3774: \begin{multline*}

3775: \gamma \bigl[ \pi_{\exp( - \beta_1 R)}(r) - \pi_{\exp( - \beta_2 R)}(r) \bigr]

3776: \\ \leq \gamma \pi_{\exp( - \beta_1 R)} \otimes \pi_{\exp( - \beta_2 R)}

3777:  \bigl[ \Psi_{- \frac{\gamma}{N}}(R',M') \bigr] - \log( \eta).

3778: \end{multline*}

3779: Hence

3780: \begin{prop}

3781: For any positive real constants $\beta_1$, $\beta_2$ and $\gamma$,

3782: with $\PP$ probability at least $1 - \eta$,

3783: for any positive real constants $\lambda_1$ and $\lambda_2$

3784: such that $\lambda_2 < \beta_2 \frac{\gamma}{N} \sinh(\tfrac{\gamma}{N})^{-1}$

3785: and $\lambda_1 > \beta_1 \frac{\gamma}{N} \sinh(\frac{\gamma}{N})^{-1}$,

3786: \begin{multline*}

3787: \pi_{\exp ( - \lambda_1 r)}(r) - \pi_{\exp( - \lambda_2 r)}(r)

3788: \\ \leq \pi_{\exp( - \beta_1 R)} \otimes

3789: \pi_{\exp( - \beta_2 R)} \bigl[ \Psi_{- \frac{\gamma}{N}} (R',M')\bigr] \\

3790: + \frac{\log(\frac{3}{\eta})}{\gamma} + \frac{C(\beta_1,\gamma) + \log(\frac{3}{\eta})}{

3791: \frac{N \lambda_1}{\beta_1} \sinh(\frac{\gamma}{N})- \gamma}

3792: + \frac{C(\beta_2, \gamma) + \log (\frac{3}{\eta})}{\gamma -

3793: \frac{N \lambda_2}{\beta_2} \sinh(\frac{\gamma}{N})}.

3794: \end{multline*}

3795: \end{prop}

3796:

3797: In order to achieve the analysis of the bound $B(\pi_{\exp( - \lambda_1 r)}, \beta,

3798: \gamma)$

3799: given by Theorem \ref{thm1.1.43}, there remains now to bound quantities of the

3800: general form

3801: \begin{multline*}

3802: \log \Bigl\{ \pi_{\exp( - \lambda r)} \Bigl[

3803: \exp \bigl\{ N \log \bigl[ \cosh(\tfrac{\gamma}{N}) \bigr] \pi_{\exp(

3804: - \lambda r)}(m') \bigr\} \Bigr] \Bigr\} \\

3805: = \sup_{\rho \in \C{M}_+^1(\Theta)}

3806: N \log \bigl[ \cosh(\tfrac{\gamma}{N}) \bigr] \rho \otimes

3807: \pi_{\exp( - \lambda)}(m') -

3808: \C{K}\bigl[\rho, \pi_{\exp( - \lambda r)}\bigr].

3809: \end{multline*}

3810:

3811: Let us consider the prior distribution $\mu \in \C{M}_+^1(\Theta \times \Theta)$

3812: on couples of parameters defined by its density

3813: $$

3814: \frac{d \mu}{d (\pi \otimes \pi)} (\theta_1, \theta_2)

3815: = C^{-1} \exp \Bigl\{

3816: - \beta R(\theta_1) - \beta R(\theta_2) + \alpha

3817: \Phi_{- \frac{\alpha}{N}} \bigl[ M'(\theta_1, \theta_2) \bigr] \Bigr\},

3818: $$

3819: where the normalizing constant $C$ is such that $\mu( \Theta \times \Theta) = 1$.

3820: Since for fixed values of the parameters $\theta$

3821: and $\theta' \in \Theta$, $m'(\theta, \theta')$, like $r(\theta)$, is a sum

3822: of independent Bernoulli random variables, we can easily

3823: adapt the proof of Theorem \ref{thm2.3} on page \pageref{thm2.3},

3824: to establish that with $\PP$ probability at least $1 - \eta$,  for any posterior distribution

3825: $\rho$ and any real constant $\lambda$,

3826: \begin{multline*}

3827: \alpha \rho \otimes \pi_{\exp( - \lambda r)}(m')

3828: \leq \alpha \rho \otimes \pi_{\exp( - \lambda r)} \bigl[ \Phi_{- \frac{\alpha}{N}}(M') \bigr]

3829: \\\shoveright{ +  \C{K}(\rho \otimes \pi_{\exp( - \lambda r)}, \mu) -

3830: \log( \eta)} \\

3831: \shoveleft{\qquad = \C{K}\bigl[ \rho, \pi_{\exp( - \beta R)}\bigr] + \C{K}\bigl[

3832: \pi_{\exp( - \lambda r)}, \pi_{\exp( - \beta R)}\bigr] }

3833: \\* + \log \Bigl\{ \pi_{\exp( - \beta R)} \otimes \pi_{\exp( - \beta

3834: R)} \Bigl[ \exp \bigl( \alpha \Phi_{-\frac{\alpha}{N}}\!\circ\!M' \bigr)

3835: \Bigr] \Bigr\} - \log(\eta).

3836: \end{multline*}

3837: Thus for any real constant $\beta$ and any positive real constants

3838: $\alpha$ and $\gamma$,

3839: with $\PP$ probability at least $1 - \eta$,  for any real constant

3840: $\lambda$,

3841: \begin{multline}

3842: \label{eq1.1.24}

3843: \log \Bigl\{ \pi_{\exp( - \lambda r)} \Bigl[ \exp

3844: \bigl\{ N \log \bigl[ \cosh(\tfrac{\gamma}{N})\bigr] \pi_{\exp( - \lambda r)}

3845: (m') \bigr\} \Bigr] \Bigr\}

3846: \\ \leq \sup_{\rho \in \C{M}_+^1(\Theta)} \biggl(

3847: \tfrac{N}{\alpha} \log \bigl[ \cosh(\tfrac{\gamma}{N})\bigr]

3848: \Bigl\{ \C{K}\bigl[ \rho, \pi_{\exp( - \beta R)}\bigr]

3849: + \C{K} \bigl[ \pi_{\exp( - \lambda r)}, \pi_{\exp( - \beta R)} \bigr]

3850: \\

3851: + \log \bigl\{ \pi_{\exp( - \beta R)} \otimes \pi_{\exp(- \beta R)}

3852: \bigl[ \exp ( \alpha \Phi_{- \frac{\alpha}{N}}\!\circ\!M') \bigr] \bigr\}

3853: \\ - \log( \eta) \Bigr\} - \C{K}\bigl[ \rho, \pi_{\exp( - \lambda r)}\bigr] \biggr).

3854: \end{multline}

3855:

3856: To conclude, we need some suitable upper bound for the entropy

3857: \linebreak $\C{K}\bigl[ \rho, \pi_{\exp( - \beta R)} \bigr]$. This question can

3858: be handled in the following way:

3859: using Theorem \ref{thm1.1.45} on page \pageref{thm1.1.45},

3860: we see that for any positive real constants $\gamma$ and $\beta$,

3861: with $\PP$ probability at least $1 - \eta$, for any posterior distribution

3862: $\rho$,

3863: \begin{multline*}

3864: \C{K}\bigl[ \rho, \pi_{\exp( - \beta R)} \bigr]

3865: = \beta \bigl[ \rho(R) - \pi_{\exp( - \beta R)}(R) \bigr]

3866: + \C{K}(\rho, \pi) - \C{K}(\pi_{\exp( - \beta R)}, \pi)

3867: \\ \shoveleft{\qquad \leq \frac{\beta}{N \sinh(\frac{\gamma}{N})} \biggl[

3868: \gamma \bigl[ \rho(r) - \pi_{\exp( - \beta R)}(r) \bigr] }

3869:  \\ + \C{K}\bigl[ \rho, \pi_{\exp( - \beta R)}\bigr]

3870: - \log(\eta) + C(\beta, \gamma) \biggr]\\\shoveright{+ \C{K}(\rho, \pi)

3871: - \C{K}(\pi_{\exp( - \beta R)}, \pi)\qquad}

3872: \\ \shoveleft{\qquad \leq \C{K} \bigl[ \rho, \pi_{\exp( - \frac{\beta \gamma}{N

3873: \sinh(\frac{\gamma}{N})} r)}

3874: \bigr]} \\ + \frac{\beta}{N \sinh(\frac{\gamma}{N})}

3875: \Bigl\{ \C{K}\bigl[ \rho, \pi_{\exp( - \beta R)}\bigr]

3876: + C(\beta, \gamma) - \log(\eta) \Bigr\}.

3877: \end{multline*}

3878: In other words,

3879: \begin{thm}

3880: \mypoint

3881: For any positive real constants $\beta$ and $\gamma$ such that

3882: $\beta < N \sinh(\tfrac{\gamma}{N})$, with $\PP$ probability at least $1 - \eta$, for any posterior

3883: distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,

3884: $$

3885: \C{K}\bigl[ \rho, \pi_{\exp( - \beta R)} \bigr]

3886: \leq \frac{\ds \C{K} \bigl[ \rho, \pi_{\exp[ - \beta \frac{\gamma}{N}

3887: \sinh(\frac{\gamma}{N})^{-1} r]} \bigr]}{\ds 1 - \frac{\beta}{N \sinh(\frac{\gamma}{N})}}

3888: + \frac{\ds C(\beta, \gamma) - \log(\eta)}{\ds \frac{N \sinh(\frac{\gamma}{N})}{\beta}

3889: - 1}.

3890: $$

3891: \end{thm}

3892:

3893: Choosing in equation \eqref{eq1.1.24} on page \pageref{eq1.1.24}

3894: $\ds \alpha = \frac{N \log \bigl[ \cosh(\frac{\gamma}{N})\bigr]}{1

3895: - \frac{\beta}{N \sinh(\frac{\gamma}{N})}}$ and \linebreak

3896: $\beta = \lambda \frac{N}{\gamma} \sinh(\frac{\gamma}{N})$, so that

3897: $\ds \alpha = \frac{N \log \bigl[ \cosh(\frac{\gamma}{N})\bigr]}{1 - \frac{\lambda}{\gamma}

3898: }$, we obtain with $\PP$

3899: probability at least $1 - \eta$,

3900: \begin{multline*}

3901: \log \Bigl\{ \pi_{\exp( - \lambda r)} \Bigl[

3902: \exp \bigl\{ N \log \bigl[ \cosh(\tfrac{\gamma}{N})\bigr] \pi_{\exp( -

3903: \lambda r)}(m') \bigr\} \Bigr] \Bigr\}

3904: \\ \shoveleft{\qquad \leq \tfrac{2 \lambda}{\gamma} \bigl[

3905: C(\beta, \gamma) + \log( \tfrac{2}{\eta}) \bigr]

3906: } \\ + \Bigl( 1 - \tfrac{\lambda}{\gamma} \Bigr) \biggl[ \log \Bigl\{ \pi_{\exp( - \beta R)} \otimes \pi_{\exp( - \beta R)}

3907: \bigl[ \exp( \alpha \Phi_{-\frac{\alpha}{N}}\!\circ\!M')\bigr] \Bigr\} \\+

3908: \log( \tfrac{2}{\eta}) \biggr].

3909: \end{multline*}

3910: This proves

3911: \begin{prop}

3912: \mypoint

3913: For any positive real constants $\lambda < \gamma$,

3914: with $\PP$ probability at least $1 - \eta$,

3915: \begin{multline*}

3916: \log \Bigl\{ \pi_{\exp( - \lambda r)} \Bigl[

3917: \exp \bigl\{ N \log \bigl[ \cosh(\tfrac{\gamma}{N})\bigr] \pi_{\exp( -

3918: \lambda r)}(m') \bigr\} \Bigr] \Bigr\} \\

3919: \shoveleft{\qquad \leq

3920: \frac{2 \lambda}{\gamma} \bigl[ C( \tfrac{N \lambda}{\gamma} \sinh(

3921: \tfrac{\gamma}{N}), \gamma)

3922: + \log ( \tfrac{2}{\eta}) \bigr]}

3923: \\\shoveleft{\qquad\qquad + \Bigl(1 - \tfrac{\lambda}{\gamma}\Bigr)

3924: \log \biggl\{ \pi_{\exp[ - \frac{N\lambda}{\gamma} \sinh(\frac{\gamma}{N}) R]

3925: }^{\otimes 2}

3926: \biggl[}\\\shoveright{

3927: \exp \biggl( \frac{N \log [ \cosh(\tfrac{\gamma}{N})]}{1 - \frac{\lambda}{\gamma}}

3928: \Phi_{- \frac{\log[\cosh(\frac{\gamma}{N})]}{1 - \frac{\lambda}{\gamma}}}\!\circ\!M'

3929: \biggr)

3930: \biggr] \biggr\}\qquad}\\

3931: + \Bigl( 1 - \tfrac{\lambda}{\gamma} \Bigr) \log( \tfrac{2}{\eta}).

3932: \end{multline*}

3933: \end{prop}

3934:

3935: We are now ready to analyse the bound $B(\pi_{\exp( - \lambda_1 r)}, \beta, \gamma)$ of

3936: Theorem \ref{thm1.1.43} on page \pageref{thm1.1.43}.

3937: \begin{thm}\mypoint

3938: \label{thm1.1.52}

3939: For any positive real constants $\lambda_1$, $\lambda_2$, $\beta_1$,

3940: $\beta_2$, $\beta$ and $\gamma$, such that

3941: \begin{align*}

3942: \lambda_1 & < \gamma,&

3943: \beta_1 & < \tfrac{N \lambda_1}{\gamma} \sinh(\tfrac{\gamma}{N}),\\

3944: \lambda_2 & < \gamma, & \beta_2 & > \tfrac{N \lambda_2}{\gamma} \sinh(\tfrac{\gamma}{N}),\\

3945: & & \beta & < \tfrac{N \lambda_2}{\gamma} \tanh(\tfrac{\gamma}{N}),

3946: \end{align*}

3947: with $\PP$ probability $1 - \eta$, the bound

3948: $B(\pi_{\exp( - \lambda_1 r)}, \beta, \gamma)$

3949: of Theorem \ref{thm1.1.43} on page \pageref{thm1.1.43} satisfies

3950: \begin{multline*}

3951: B(\pi_{\exp( - \lambda_1 r)}, \beta, \gamma) \\ \leq

3952: (\gamma - \lambda_1) \Biggl\{ \pi_{\exp( - \beta_1 R)} \otimes

3953: \pi_{\exp( - \beta_2 R)} \bigl[ \Psi_{- \frac{\gamma}{N}} (R',M') \bigr]

3954: + \frac{\log(\frac{7}{\eta})}{\gamma} \\*

3955: \shoveright{+ \frac{C(\beta_1, \gamma) + \log( \frac{7}{\eta})}{

3956: \frac{N \lambda_1}{\beta_1} \sinh(\frac{\gamma}{N}) - \gamma}

3957: + \frac{C(\beta_2, \gamma)+ \log(\frac{7}{\eta})}{\gamma -

3958: \frac{N\lambda_2}{\beta_2} \sinh( \frac{\gamma}{N})}

3959: \Biggr\}} \\*

3960: \qquad+ \frac{2 \lambda_1}{\gamma}

3961: \Bigl[ C \bigl(\tfrac{N \lambda_1}{\gamma} \sinh(\tfrac{\gamma}{N}), \gamma\bigr)

3962: + \log(\tfrac{7}{\eta}) \Bigr] \\*

3963: \shoveleft{\qquad + \left( 1 - \tfrac{\lambda_1}{\gamma} \right)

3964: \log \biggl\{ \pi_{\exp [ - \frac{N \lambda_1}{\gamma} \sinh(\frac{\gamma}{N})

3965: R]}^{\otimes 2} \biggl[}\\\shoveright{ \exp \biggl( \tfrac{N \log [ \cosh(\frac{\gamma}{N})] }{1

3966: - \frac{\lambda_1}{\gamma}} \Phi_{- \frac{\log[\cosh(\frac{\gamma}{N})]}{1

3967: - \frac{\lambda_1}{\gamma}}}\!\circ\!M'\biggr)\biggr] \biggr\} }

3968: \\* + \Bigl( 1 - \tfrac{\lambda_1}{\gamma} \Bigr)

3969: \log(\tfrac{7}{\eta}) - \log\bigl[ \nu(\{\beta\}) \nu(\{\gamma\})\epsilon

3970: \bigr]\\*

3971: \shoveleft{\qquad+ (\gamma - \lambda_1) \tfrac{\beta}{\lambda_2}

3972: F_{\gamma, \frac{\beta \gamma}{\lambda_2}}^{-1} \Biggl\{

3973: \frac{2 \lambda_2}{\gamma}

3974: \Bigl[ C \bigl( \tfrac{N \lambda_2}{\gamma} \sinh(\tfrac{\gamma}{N}), \gamma \bigr)

3975: + \log \bigl( \tfrac{7}{\eta}\bigr) \Bigr]}\\*

3976: \shoveleft{\qquad \qquad + \Bigl( 1 - \tfrac{\lambda_2}{\gamma}

3977: \Bigr)

3978: \log \biggl\{

3979: \pi_{\exp[ - \frac{N \lambda_2}{\gamma} \sinh(\frac{\gamma}{N})R]}^{\otimes 2}

3980: \biggl[}\\

3981: \exp \biggl( \frac{N\log[\cosh(\frac{\gamma}{N})]}{1 - \frac{\lambda_2}{\gamma}}

3982: \Phi_{- \frac{\log[\cosh(\frac{\gamma}{N})]}{1 - \frac{\lambda_2}{\gamma}}}\!\circ\!M'

3983: \biggr) \biggr] \biggr\} \\* + \Bigl(1 - \tfrac{\lambda_2}{\gamma} \Bigr)

3984: \log\bigl(\tfrac{7}{\eta}\bigr) - \log\bigl[\nu(\{\beta\}) \nu(\{\gamma\})\epsilon\bigr]

3985: \Biggr\},

3986: \end{multline*}

3987: where the function $C(\beta, \gamma)$ is defined by equation \eqref{eq1.1.23}

3988: on page \pageref{eq1.1.23}.

3989: \end{thm}

3990: \subsubsection{Adaptation to parametric and margin assumptions}

3991: To help understand the previous theorem, it may be useful to

3992: give linear upper-bounds to the factors appearing in the

3993: right-hand side of the previous inequality.

3994: Introducing $\T$ such that $R(\T) = \inf_{\Theta} R$

3995: (assuming that such a parameter exists) and remembering that

3996: \begin{align*}

3997: \Psi_{-a}(p,m) & \leq a^{-1} \sinh(a) p + 2 a^{-1} \sinh(\tfrac{a}{2})^2 m, & a \in \RR_+,\\

3998: \Phi_{-a}(p) & \leq a^{-1} \bigl[ \exp(a)-1 \bigr] p, & a \in \RR_+,\\

3999: \Psi_{a}(p,m) & \geq a^{-1} \sinh(a) p - 2a^{-1}\sinh(\tfrac{a}{2})^2 m, & a \in \RR_+,\\

4000: M'(\theta_1, \theta_2) & \leq M'(\theta_1, \T) + M'(\theta_2, \T), & \theta_1, \theta_2

4001: \in \Theta,\\

4002: M'(\theta_1, \T) & \leq x R'(\theta_1, \T) + \varphi(x), & x \in \RR_+, \theta_1 \in

4003: \Theta,

4004: \end{align*}

4005: (the last inequality being rather

4006: a consequence of the definition of $\varphi$ than a property of $M'$),

4007: we easily see that

4008: \begin{multline*}

4009: \pi_{\exp( - \beta_1 R)}\otimes \pi_{\exp( - \beta_2 R)}

4010: \bigl[ \Psi_{- \frac{\gamma}{N}}(R',M') \bigr]

4011: \\\shoveleft{\quad  \leq

4012: \tfrac{N}{\gamma} \sinh(\tfrac{\gamma}{N})

4013: \bigl[ \pi_{\exp( - \beta_1 R)}(R) - \pi_{\exp( - \beta_2 R)}(R) \bigr]}

4014: \\\shoveright{+  \tfrac{2N}{\gamma}\sinh(\tfrac{\gamma}{2N})^{2}

4015: \pi_{\exp( - \beta_1 R)} \otimes \pi_{\exp( - \beta_2 R)}

4016: (M')\qquad} \\

4017: \shoveleft{\quad\leq \tfrac{N}{\gamma} \sinh(\tfrac{\gamma}{N}) \bigl[ \pi_{

4018: \exp( - \beta_1 R)}(R) -

4019: \pi_{\exp( - \beta_2 R)}(R) \bigr]} \\

4020: \qquad + \frac{2xN}{\gamma} \sinh(\tfrac{\gamma}{2N})^{2} \Bigl\{

4021: \pi_{\exp( - \beta_1 R)}\bigl[ R'(\cdot, \T) \bigr] +

4022: \pi_{\exp( - \beta_2 R)} \bigl[ R'(\cdot, \T) \bigr] \Bigr\}

4023: \\ + \frac{4N}{\gamma} \sinh(\tfrac{\gamma}{2N})^2 \varphi(x).

4024: \end{multline*}

4025: \begin{multline*}

4026: C(\beta, \gamma) \leq

4027: \log \biggl\{ \pi_{\exp( - \beta R)} \Bigl\{ \exp \Bigl[

4028: 2 N \sinh\bigl(\tfrac{\gamma}{2N}\bigr)^{2} \pi_{\exp( - \beta R)}(M') \Bigr] \Bigr\}

4029: \biggr\} \\\shoveleft{\qquad\qquad\leq

4030: \log \biggl\{ \pi_{\exp( - \beta R)} \Bigl\{ \exp \Bigl[

4031: 2 N \sinh\bigl(\tfrac{\gamma}{2N}\bigr)^{2} M'(\cdot, \T) \Bigr] \Bigr\}

4032: \biggr\}} \\\shoveright{ + 2N\sinh(\tfrac{\gamma}{2N})^{2} \pi_{\exp( - \beta R)}

4033: \bigl[ M'(\cdot, \T)\bigr]}\\

4034: \shoveleft{\qquad\qquad \leq \log \biggl\{ \pi_{\exp( - \beta R)} \Bigl\{ \exp \Bigl[

4035: 2 x N \sinh(\tfrac{\gamma}{2N})^{2} R'( \cdot, \T) \Bigr] \Bigr\} \biggr\}}

4036: \\\shoveright{+ 2 x N \sinh(\tfrac{\gamma}{2N})^{2} \pi_{\exp( - \beta R)}

4037: \bigl[ R'(\cdot, \T) \bigr] + 4 N \sinh(\tfrac{\gamma}{2N})^{2}

4038: \varphi(x)}\\

4039: \shoveleft{\qquad\qquad = \int_{\beta - 2xN\sinh(\frac{\gamma}{2N})^2}^{\beta}

4040: \pi_{\exp( - \alpha R)}\bigl[ R'(\cdot, \T) \bigr]  d \alpha}\\

4041: \shoveright{+ 2 x N \sinh(\tfrac{\gamma}{2N})^{2} \pi_{\exp( - \beta R)}

4042: \bigl[ R'(\cdot, \T) \bigr] + 4 N \sinh(\tfrac{\gamma}{2N})^{2}

4043: \varphi(x)}\\

4044:  \shoveleft{\qquad \qquad \leq 4xN\sinh(\tfrac{\gamma}{2N})^2 \pi_{\exp[ - (\beta - 2 x N

4045: \sinh(\frac{\gamma}{2N})^2)R]}\bigl[ R'(\cdot, \T) \bigr]

4046: }\\ + 4 N \sinh(\tfrac{\gamma}{2N})^2 \varphi(x).

4047: \end{multline*}

4048:

4049: \begin{multline*}

4050: \log \Bigl\{ \pi_{\exp( - \beta R)}^{\otimes 2} \Bigl[

4051: \exp \Bigl( N \alpha \Phi_{- \alpha} \!\circ\!M' \Bigr) \Bigr] \Bigr\}

4052: \\ \leq 2 \log \Bigl\{ \pi_{\exp( - \beta R)} \Bigl[ \exp \Bigl( N

4053: \bigl[ \exp( \alpha) - 1 \bigr] M'(\cdot, \T) \Bigr) \Bigr] \Bigr\}

4054: \\ \leq 2 x N \bigl[ \exp( \alpha) - 1\bigr]

4055: \pi_{\exp[ - (\beta - x N [\exp(\alpha) - 1]) R]} \bigl[ R'(\cdot, \T) \bigr]

4056: \\* + 2 x N \bigl[ \exp( \alpha) - 1 \bigr] \varphi(x).

4057: \end{multline*}

4058:

4059: Let us push further the investigation under the parametric

4060: assumption that for some positive real constant $d$

4061: \begin{equation}

4062: \label{parametric}

4063: \lim_{\beta \rightarrow + \infty} \beta \pi_{\exp( - \beta R)}\bigl[ R'( \cdot,

4064: \T) \bigr] = d,

4065: \end{equation}

4066: This assumption will for instance hold true

4067: with $d = \frac{n}{2}$ when $R : \Theta \rightarrow (0,1)$

4068: is a smooth function defined on a compact subset $\Theta$ of $\RR^n$ that

4069: reaches its minimum value on a finite number of non degenerate (i.e. with

4070: a positive definite Hessian) interior points of $\Theta$, and $\pi$

4071: is absolutely continuous with respect to the

4072: Lebesgue measure on $\Theta$ and has a smooth density.

4073:

4074: In case of assumption \eqref{parametric}, if we restrict to sufficiently large values of the

4075: constants $\beta$, $\beta_1$, $\beta_2$, $\lambda_1$, $\lambda_2$ and $\gamma$

4076: (the smaller of which being as a rule $\beta$ as we will see), we can

4077: use the fact that for some (small) positive constant $\delta$, and

4078: some (large) positive constant $A$,

4079: \begin{equation}

4080: \label{eq1.1.25}

4081: \frac{d}{\alpha}(1 - \delta) \leq \pi_{\exp(- \alpha R)}\bigl[ R'(\cdot, \T)

4082: \bigr] \leq

4083: \frac{d}{\alpha}(1 + \delta), \qquad \alpha \geq A.

4084: \end{equation}

4085: Under this assumption,

4086: \begin{multline*}

4087: \pi_{\exp( - \beta_1 R)} \otimes \pi_{\exp( - \beta_2 R)}

4088: \bigl[ \Psi_{- \frac{\gamma}{N}}(R', M') \bigr]

4089: \\ \leq \tfrac{N}{\gamma} \sinh(\tfrac{\gamma}{N})

4090: \bigl[ \tfrac{d}{\beta_1}(1 + \delta) - \tfrac{d}{\beta_2}(1 - \delta) \bigr]

4091: \qquad \qquad  \\ \shoveright{+ \tfrac{2 x N}{\gamma}

4092: \sinh(\tfrac{\gamma}{2N})^2 (1 + \delta)

4093: \bigl[ \tfrac{d}{\beta_1}

4094: + \tfrac{d}{\beta_2} \bigr] + \tfrac{4N}{\gamma} \sinh(\tfrac{\gamma}{2N})^2

4095: \varphi(x).}

4096: \\

4097: \shoveleft{C(\beta, \gamma) \leq d(1 + \delta) \log \Bigl( \tfrac{\beta}{\beta -

4098: 2xN\sinh(\frac{\gamma}{2N})^2} \Bigr)} \\

4099: \shoveright{+ 2 x N \sinh(\tfrac{\gamma}{2N})^2

4100: \tfrac{(1 + \delta)d}{\beta} + 4N \sinh(\tfrac{\gamma}{2N})^2 \varphi(x).}\\

4101: \shoveleft{\log \Bigl\{ \pi_{\exp( - \beta R)}^{\otimes 2}

4102: \Bigl[ \exp \Bigl( N \alpha \Phi_{- \alpha}\!\circ\!M' \Bigr) \Bigr] \Bigr\}

4103: } \\ \leq 2xN\bigl[ \exp( \alpha) - 1 \bigr] \frac{d(1 + \delta)}{ \beta -

4104: x N [\exp(\alpha) - 1]} + 2 N \bigl[ \exp( \alpha) - 1 \bigr] \varphi(x).

4105: \end{multline*}

4106: Thus with $\PP$ probability at least $1 - \eta$,

4107: \begin{multline*}

4108: B(\pi_{\exp( - \lambda_1 r)}, \beta, \gamma)

4109: \leq - (\gamma - \lambda_1) \tfrac{N}{\gamma}

4110: \sinh(\tfrac{\gamma}{N}) \tfrac{d}{\beta_2}( 1

4111: - \delta)

4112: \\ \shoveleft{+

4113: (\gamma - \lambda_1) \biggl\{

4114: \tfrac{N}{\gamma} \sinh(\tfrac{\gamma}{N}) \tfrac{(1+\delta)d}{\beta_1}

4115: }\\*\shoveright{+ \tfrac{2xN}{\gamma} \sinh(\tfrac{\gamma}{2N})^2(1+\delta) \bigl[ \tfrac{d}{\beta_1}

4116: + \tfrac{d}{\beta_2} \bigr]

4117: + \tfrac{4N}{\gamma} \sinh(\tfrac{\gamma}{2N})^2 \varphi(x)

4118: + \frac{\log(\tfrac{7}{\eta})}{\gamma}}\\

4119: + \frac{4xN\sinh(\tfrac{\gamma}{2N})^2 \tfrac{(1+\delta)d}{\beta_1 -

4120: 2xN\sinh(\frac{\gamma}{2N})^2} + 4 N \sinh(\tfrac{\gamma}{2N})^2 \varphi(x)

4121: + \log(\frac{7}{\eta})}{\frac{N\lambda_1}{\beta_1}\sinh(\frac{\gamma}{N}) -

4122: \gamma}\\

4123: \shoveright{+ \frac{4xN\sinh(\tfrac{\gamma}{2N})^2 \tfrac{(1+\delta)d}{\beta_2 -

4124: 2xN\sinh(\frac{\gamma}{2N})^2} + 4 N \sinh(\tfrac{\gamma}{2N})^2 \varphi(x)

4125: + \log(\frac{7}{\eta})}{\gamma - \frac{N\lambda_2}{\beta_2}\sinh(\frac{\gamma}{N})}

4126: \biggr\}}

4127: \\ \shoveleft{+

4128: \frac{2 \lambda_1}{\gamma}

4129: \biggl\{ 4xN\sinh(\tfrac{\gamma}{2N})^2 \tfrac{(1+\delta)d}{\tfrac{N\lambda_1}{\gamma}

4130: \sinh(\tfrac{\gamma}{N}) -

4131: 2xN\sinh(\frac{\gamma}{2N})^2}}\\

4132: \shoveright{ + 4 N \sinh(\tfrac{\gamma}{2N})^2 \varphi(x)

4133: + \log(\tfrac{7}{\eta}) \biggr\}}\\

4134: \shoveleft{+ \Bigl( 1 - \frac{\lambda_1}{\gamma} \Bigr) \Biggl\{

4135: 2 d(1+\delta) \Biggl( \tfrac{\lambda_1\sinh\bigl(\tfrac{\gamma}{N}\bigr)}{x \gamma

4136: \Bigl[ \exp\Bigl(\frac{\log[\cosh(\frac{\gamma}{N})]}{1-\frac{\lambda_1}{\gamma}}

4137: \Bigr)-1

4138: \Bigr]}-1 \Biggr)^{-1}}\\\shoveright{ + 2N\Bigl[ \exp \Bigl( \tfrac{\log[\cosh(\frac{\gamma}{N})]}{1 -

4139: \frac{\lambda_1}{\gamma}} \Bigr) - 1 \Bigr] \varphi(x)

4140: \Biggr\}}\\

4141: + \Bigl(1 - \tfrac{\lambda_1}{\gamma} \Bigr)

4142: \log(\tfrac{7}{\eta}) - \log\bigl[ \nu(\{\beta\}) \nu(\{\gamma\}) \epsilon\bigr]\\

4143: \shoveleft{+ \frac{1 - \frac{\lambda_1}{\gamma}}{ \frac{N \lambda_2}{\beta \gamma}

4144: \tanh(\frac{\gamma}{N}) - 1} \Biggl\{

4145: \frac{2 \lambda_2}{\gamma}

4146: \biggl\{ 4xN\sinh(\tfrac{\gamma}{2N})^2 \tfrac{(1+\delta)d}{\tfrac{N\lambda_2}{\gamma}

4147: \sinh(\tfrac{\gamma}{N}) -

4148: 2xN\sinh(\frac{\gamma}{2N})^2}}\\

4149: \shoveright{+ 4 N \sinh(\tfrac{\gamma}{2N})^2 \varphi(x)

4150: + \log(\tfrac{7}{\eta}) \biggr\}}\\

4151: \shoveleft{+ \Bigl( 1 - \frac{\lambda_2}{\gamma} \Bigr) \Biggl[

4152: 2 d(1+\delta) \Biggl( \tfrac{\lambda_2\sinh\bigl(\tfrac{\gamma}{N}\bigr)}{x \gamma

4153: \Bigl[ \exp\Bigl(\frac{\log[\cosh(\frac{\gamma}{N})]}{1-\frac{\lambda_2}{\gamma}}

4154: \Bigr)-1

4155: \Bigr]}-1 \Biggr)^{-1}} \\

4156: \shoveright{+ 2N\Bigl[ \exp \Bigl( \tfrac{\log[\cosh(\frac{\gamma}{N})]}{1 -

4157: \frac{\lambda_2}{\gamma}} \Bigr) - 1 \Bigr] \varphi(x)

4158: \Biggr]\qquad\quad}\\

4159: + \Bigl(1 - \tfrac{\lambda_2}{\gamma} \Bigr)

4160: \log(\tfrac{7}{\eta}) - \log\bigl[ \nu(\beta) \nu(\gamma) \epsilon\bigr]

4161: \Biggr\}.

4162: \end{multline*}

4163:

4164: Now let us choose for simplicity

4165: $\beta_2 = 2 \lambda_2 = 4 \beta$, $\beta_1 = \lambda_1 / 2 = \gamma / 4$,

4166: and let us introduce the notations

4167: \begin{align*}

4168: C_1 & = \frac{N}{\gamma}\sinh(\frac{\gamma}{N}),\\

4169: C_2 & = \frac{N}{\gamma} \tanh(\frac{\gamma}{N}),\\

4170: C_3 & = \frac{N^2}{\gamma^2}

4171: \bigl[ \exp( \frac{\gamma^2}{N^2} ) - 1 \bigr]\\

4172: \text{and }\quad

4173: C_4 & = \frac{2 N^2(1 - \frac{2 \beta}{\gamma})}{\gamma^2}

4174: \Bigl[ \exp \Bigl( \frac{\gamma^2}{2 N^2 (1 - \frac{2 \beta}{\gamma})}

4175: \Bigr) - 1 \Bigr],

4176: \end{align*}

4177: to obtain

4178: \begin{multline*}

4179: B(\pi_{\exp( - \lambda_1 r)}, \beta, \gamma) \leq

4180: - \frac{C_1 \gamma}{8 \beta} (1 - \delta)d

4181: \\ + \frac{C_1 \gamma}{2} \biggl\{

4182:  \tfrac{4(1+\delta)d}{\gamma} + x \tfrac{\gamma}{2 N}(1+\delta)

4183: \bigl[ \tfrac{4 d}{\gamma} + \tfrac{d}{4\beta} \bigr]

4184: + \tfrac{\gamma}{N} \varphi(x) \biggr\} +

4185: \tfrac{1}{2} \log\bigl(\tfrac{7}{\eta}\bigr)\\*

4186: \qquad + \frac{1}{2C_1-1}  \Bigl[(1+\delta) d \Bigl( \tfrac{N}{2xC_1\gamma} -1 \Bigr)^{-1}

4187: + C_1 \frac{\gamma^2}{2N} \varphi(x) + \tfrac{1}{2} \log(\tfrac{7}{\eta}) \Bigr]

4188: \\*\hfill \hfill \hfill + \frac{1}{2 - C_1} \biggl[ 2 (1+\delta)d \Bigl( \tfrac{8 N \beta}{x C_1 \gamma^2}

4189: - 1\Bigr)^{-1} + C_1 \frac{\gamma^2}{N} \varphi(x) + \log(\tfrac{7}{\eta}) \biggr]

4190: \hfill \\*

4191: \shoveright{+ \frac{2 x \gamma (1 + \delta) d}{N - x \gamma} + C_1 \tfrac{\gamma^2}{N} \varphi(x)

4192: + \log( \tfrac{7}{\eta})} \\*

4193: \shoveright{+ d(1+\delta)\frac{x \gamma}{N} \biggl( \frac{C_1}{2

4194: C_3 } - \frac{x \gamma}{N} \biggr)^{-1} +  \frac{\gamma^2}{N} C_3

4195: \varphi(x) + \frac{\log(\frac{7}{\eta})}{2} -

4196: \log\bigl[ \nu(\beta) \nu(\gamma) \epsilon\bigr]}\\*

4197: \shoveleft{\qquad + \Bigl( 4 C_2  - 2\Bigr)^{-1}

4198: \Biggl\{ \frac{4 \beta}{\gamma} \biggl\{

4199: x \frac{\gamma^2}{N} C_1 (1 + \delta) d \Bigl(

4200: 2 \beta C_1 - x C_1 \frac{\gamma^2}{2N} \Bigr)^{-1}} \\\shoveright{

4201: + \tfrac{\gamma^2}{N} \varphi(x)

4202: + \log(\tfrac{7}{\eta})\biggr\}\quad }

4203: \\* \shoveleft{\qquad + \Bigl(1 - \frac{2 \beta}{\gamma} \Bigr) \biggl\{

4204: 2 d (1 + \delta) \frac{x \gamma}{N}

4205: \biggl[ \frac{4   \beta C_1}{

4206: \gamma C_4}\biggl(1 - \frac{2 \beta}{\gamma}\biggr) - \frac{x \gamma}{N}

4207: \biggr]^{-1}}\\ \shoveright{

4208: + \frac{\gamma^2}{N(1 - \frac{2 \beta}{\gamma})} C_4 \varphi(x)

4209: \biggr\}\quad }

4210: \\* + \Bigl( 1 - \tfrac{2 \beta}{\gamma} \Bigr) \log(\tfrac{7}{\eta}) - \log

4211: \bigl[ \nu(\beta) \nu(\gamma) \epsilon \bigr]

4212: \Biggr\}.

4213: \end{multline*}

4214: This simplifies to

4215: \begin{multline*}

4216: B( \pi_{\exp( - \lambda_1 r)}, \beta, \gamma) \leq

4217: - \frac{C_1}{8}(1- \delta)d \frac{\gamma}{\beta}

4218: \\ + 2 C_1(1 + \delta) d + \log(\tfrac{7}{\eta})

4219: \biggl[ 2  +  \tfrac{3 C_1}{(4C_1-2)(2-C_1)}

4220: +  \frac{ 1 + \frac{2 \beta}{\gamma}}{4C_2 - 2}

4221: \biggr] \\ \hfill - \bigl( 1 + \tfrac{1}{4 C_2 - 2} \bigr)

4222: \log\bigl[ \nu(\beta) \nu( \gamma) \epsilon\bigr]\qquad

4223: \\\qquad  + \frac{(1 + \delta) d x \gamma}{N} \biggl\{

4224: C_1 + \tfrac{1}{2 C_1 - 1} \Bigl(

4225: \tfrac{1}{2C_1} - \tfrac{\gamma x}{N} \Bigr)^{-1}

4226: \hfill \\\hfill + 2 \Bigl( 1 - \tfrac{\gamma x}{N} \Bigr)^{-1}

4227: + \Bigl( \tfrac{C_1}{2 C_3} -

4228: \tfrac{\gamma x}{N} \Bigr)^{-1} + \tfrac{4C_1\beta}{\gamma(4C_2-2)}

4229: \biggr\}\qquad \\

4230: \qquad + \frac{(1 + \delta) d x \gamma^2}{N \beta} \biggl\{

4231: \tfrac{C_1}{16} + \tfrac{2}{2-C_1} \Bigl( \tfrac{8}{C_1} -

4232: \tfrac{x \gamma^2}{N \beta} \Bigr)^{-1} \hfill \\

4233: \hfill +

4234: \Bigl(1 - \tfrac{2 \beta}{\gamma} \Bigr) \tfrac{1}{2C_2 -1}

4235: \Bigl[ \tfrac{4C_1}{C_4}\Bigl(1 - \tfrac{2 \beta}{\gamma}\Bigr)

4236: - \tfrac{\gamma^2 x}{\beta N} \Bigr]^{-1}

4237: \biggr\} \qquad

4238: \\

4239: + \frac{\gamma^2}{N} \varphi(x) \biggl\{

4240: \tfrac{3 C_1}{2} + \tfrac{C_1}{4C_1 - 2} + \tfrac{C_1}{2 - C_1} + C_3

4241: + \tfrac{4 \beta}{\gamma( 4 C_2 - 2)} + \tfrac{C_4}{4 C_2 - 2}

4242: \biggr\}.

4243: \end{multline*}

4244:

4245: This shows that there exist universal positive real constants $A_1$, $A_2$, $B_1$, $B_2$, $B_3$,

4246: and $B_4$

4247: such that as soon as $\frac{\gamma \max\{x, 1\}}{N} \leq A_1 \frac{\beta}{\gamma}

4248: \leq A_2$,

4249: \begin{multline*}

4250: B( \pi_{\exp( - \lambda_1 r) }, \beta, \gamma) \leq

4251: - B_1 (1 - \delta) d \frac{\gamma}{\beta} + B_2 (1 + \delta) d \\

4252: - B_3 \log\bigl[

4253: \nu(\beta) \nu(\gamma) \epsilon\,\eta\bigr]

4254: + B_4 \frac{\gamma^2}{N} \varphi(x).

4255: \end{multline*}

4256: Thus $\pi_{\exp( - \lambda_1 r)}(R)

4257: \leq \pi_{\exp( - \beta R)}(R) \leq \inf_{\Theta} R + \frac{ (1 + \delta) d}{\beta}$

4258: as soon as moreover

4259: $$

4260: \frac{\beta}{\gamma} \leq \frac{ B_1}{

4261: B_2\frac{(1 + \delta)}{(1 - \delta)} + \frac{B_4 \frac{\gamma^2}{N} \varphi(x)

4262: - B_3 \log[\nu(\beta) \nu(\gamma) \epsilon \eta]}{(1-\delta) d}}.

4263: $$

4264:

4265: Choosing some real ratio $\alpha > 1$,

4266: we can now make the above result uniform for any

4267: \begin{equation}

4268: \label{eq1.1.27}

4269: \beta, \gamma \in

4270: \Lambda_{\alpha} \overset{\text{def}}{=}

4271: \Bigl\{ \alpha^k ; k \in \NN, 0 \leq k < \tfrac{\log(N)}{\log(\alpha)} \Bigr\},

4272: \end{equation}

4273: by substituting $\nu(\beta)$ and $\nu(\gamma)$

4274: with $\frac{\log(\alpha)}{\log(\alpha N)}$ and $- \log(\eta)$ with

4275: $ - \log( \eta) + 2 \log \left[ \frac{\log( \alpha N)}{\log(\alpha)} \right]$.

4276:

4277: Taking moreover for simplicity $\eta = \epsilon$,

4278: let us summarize the type of result we got by

4279: \begin{thm}

4280: \mypoint

4281: \label{thm1.50}

4282: There exist positive real universal constants

4283: $A$, $B_1$, $B_2$, $B_3$ and $B_4$ such that

4284: for any positive real constants $\alpha > 1$, $d$ and $\delta$, for any

4285: prior distribution $\pi \in \C{M}_+^1(\Theta)$,

4286: with

4287: $\PP$ probability at least $1 - \epsilon$,

4288: for any $\beta, \gamma

4289: \in \Lambda_{\alpha}$ (where $\Lambda_{\alpha}$ is defined by equation

4290: \eqref{eq1.1.27} above) such that

4291: $$

4292: \sup_{\beta' \in \RR, \beta' \geq \beta}

4293: \biggl\lvert \frac{\beta'}{d} \bigl[

4294: \pi_{\exp( - \beta' R)}(R) - \inf_{\Theta} R \bigr]  - 1 \biggr\rvert

4295: \leq \delta

4296: $$

4297: and such that also for some positive real parameter $x$

4298: $$

4299: \frac{\gamma \max\{x, 1\}}{N} \leq \frac{A \beta}{\gamma} \text{ and }

4300: \frac{\beta}{\gamma} \leq

4301: \frac{B_1}{B_2 \frac{(1 + \delta)}{(1 - \delta)}

4302: + \frac{ B_4 \frac{\gamma^2}{N}\varphi(x) - 2 B_3 \log(\epsilon) + 4

4303: B_3 \log \bigl[ \frac{\log(N)}{\log(\alpha)}\bigr]}{(1 - \delta) d}},

4304: $$

4305: the bound $B(\pi_{\exp( - \frac{\gamma}{2} r)}, \beta, \gamma)$

4306: given by Theorem \ref{thm1.1.43} on page \pageref{thm1.1.43}

4307: in the case where we have chosen $\nu$

4308: to be the uniform probability measure on $\Lambda_{\alpha}$,

4309: satisfies

4310: $B(\pi_{\exp( - \frac{\gamma}{2} r)}, \beta, \gamma)

4311: \leq 0,$ proving that $\w{\beta}(\pi_{\exp( - \frac{\gamma}{2} r)})

4312: \geq \beta$ and therefore that

4313: $$

4314: \pi_{\exp( - \gamma \frac{r}{2} )}(R) \leq \pi_{\exp ( - \beta R)}(R)

4315: \leq \inf_{\Theta} R + \frac{(1 + \delta) d}{\beta}.

4316: $$

4317: \end{thm}

4318: What is important in this result is that we do not only bound

4319: $\pi_{\exp( - \frac{\gamma}{2} r)}(R)$, but also

4320: $B(\pi_{\exp( - \frac{\gamma}{2} r)}, \beta, \gamma)$,

4321: and that we do it uniformly on a grid of values of $\beta$ and

4322: $\gamma$, showing that we can indeed

4323: set the constants $\beta$ and $\gamma$

4324: adaptively using the empirical bound

4325: $B( \pi_{\exp( - \frac{\gamma}{2} r)}, \beta, \gamma)$.

4326:

4327: Let us see what we get under the margin assumption \eqref{eq1.1.17Bis}

4328: (see page \pageref{eq1.1.17Bis}).

4329: When $\kappa = 1$, $\varphi(c^{-1}) \leq 0$, leading to

4330: \begin{cor}\mypoint

4331: Assuming that the margin

4332: assumption \ref{eq1.1.17Bis} (on page \pageref{eq1.1.17Bis}) is

4333: satisfied for $\kappa = 1$, that $R : \Theta \rightarrow (0,1)$

4334: is independent of $N$ (which is the case for instance when

4335: $\PP = P^{\otimes N}$), and is such that

4336: $$

4337: \lim_{\beta' \rightarrow + \infty} \beta'

4338: \bigl[ \pi_{\exp( - \beta'

4339: R)}(R) - \inf_{\Theta} R \bigr] = d,

4340: $$

4341: there are universal positive real constants

4342: $B_5$ and $B_6$

4343: and $N_1 \in \NN$

4344: such that

4345: for any $N \geq N_1$,

4346: with $\PP$ probability at least $1 - \epsilon$

4347: $$

4348: \pi_{\exp( - \widehat{\gamma}\frac{r}{2} )}(R) \leq

4349: \inf_{\Theta} R + \frac{ B_5  d}{c N}

4350: \left[1 + \frac{B_6}{d}  \log \biggl( \frac{\log(N)}{

4351: \epsilon } \biggr) \right]^2,

4352: $$

4353: where $\w{\gamma} \in \arg\max_{\gamma \in \Lambda_2} \max \bigl\{ \beta \in \Lambda_2

4354: ; B(\pi_{\exp( - \gamma \frac{r}{2})}, \beta, \gamma) \leq 0 \bigr\}$

4355: (where $\Lambda_2$ is defined by equation \eqref{eq1.1.27} on page \pageref{eq1.1.27}).

4356: \end{cor}

4357: When $\kappa > 1$, $\varphi(x) \leq (1 - \kappa^{-1}) \bigl( \kappa c x \bigr)^{-

4358: \frac{1}{\kappa -1}}$, and we can choose $\gamma$ and $x$ such that

4359: $\frac{\gamma^2}{N} \varphi(x) \simeq d$ to prove

4360: \begin{cor}\mypoint

4361: \label{cor1.52}

4362: Assuming that the margin assumption \eqref{eq1.1.17Bis} is satisfied

4363: for some exponent $\kappa > 1$, that $R : \Theta \rightarrow (0,1)$

4364: is independent of $N$ (which is for instance the case when

4365: $\PP = P^{\otimes N}$), and is such that

4366: $$

4367: \lim_{\beta' \rightarrow + \infty} \beta'

4368: \bigl[ \pi_{\exp ( - \beta' R)}(R) - \inf_{\Theta} R \bigr] = d,

4369: $$

4370: there are universal positive constants

4371: $B_7$ and $B_8$

4372: and $N_1 \in \NN$ such that for any $N \geq N_1$, with $\PP$

4373: probability at least $1 - \epsilon$,

4374: $$

4375: \pi_{\exp( - \widehat{\gamma} \frac{r}{2} )}(R)

4376: \leq \inf_{\Theta} R +  B_7

4377: c^{ - \frac{1}{2 \kappa -1}}

4378: \biggl[ 1 + \frac{B_8}{d} \log

4379: \biggl( \frac{\log(N)}{\epsilon} \biggr)

4380: \biggr]^{\frac{2 \kappa}{2 \kappa - 1}} \left(

4381: \frac{d}{N} \right)^{ \frac{\kappa}{2 \kappa - 1}},

4382: $$

4383: where $\widehat{\gamma} \in \arg \max_{\gamma \in \Lambda_2}

4384: \max \bigl\{ \beta \in \Lambda_2; B(\pi_{\exp( - \gamma \frac{r}{2})},

4385: \beta, \gamma) \leq 0 \bigr\}$ ($\Lambda_2$ being defined by equation \eqref{eq1.1.27}

4386: on page \pageref{eq1.1.27}).

4387: \end{cor}

4388: We find the same rate of convergence as in Corollary

4389: \ref{cor1.1.23} on page \pageref{cor1.1.23}, but this

4390: time, we were able to provide an empirical posterior distribution

4391: $\pi_{\exp( - \w{\gamma} \frac{r}{2})}$

4392: which achieves this rate adaptively in all the parameters

4393: (meaning in particular that we do not need to know $d$,

4394: $c$ or $\kappa$). Moreover, as

4395: already mentioned, the power

4396: of $N$ in this rate of convergence is known to be unimprovable

4397: in the worst case (see \cite{Mammen,Tsybakov,Tsybakov2}, and

4398: more specifically in \cite{Audibert2} --- downloadable from

4399: its author's web page,--- Theorem 3.3 on page 132).

4400:

4401: \subsubsection{Estimating the divergence of a posterior

4402: with respect to a Gibbs prior}

4403: Another interesting question is to estimate

4404: $\C{K} \bigl[ \rho, \pi_{\exp ( - \beta R)} \bigr]$

4405: using relative deviation inequalities.

4406: We follow here an idea to be found first

4407: in Audibert \cite[page 93]{Audibert2}.

4408: Indeed, combining equation \eqref{eq1.1.17} with

4409: equation \eqref{eq1.1.16} on page \pageref{eq1.1.16}, we see that

4410: for any positive real parameters $\beta$ and $\lambda$,

4411: with $\PP$ probability at least $1 - \epsilon$, for any

4412: posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,

4413: \begin{multline*}

4414: \C{K}\bigl[\rho, \pi_{\exp( - \beta R)}\bigr]

4415: \leq \frac{\beta}{N \lambda} \biggl\{

4416: \frac{N}{2} \log\left( \frac{1 + \lambda}{1 - \lambda}\right)

4417: \bigl[ \rho(r) - \pi_{\exp(- \beta R)}(r) \bigr]

4418: \\ \hfill - \frac{N}{2} \log(1 - \lambda^2) \rho \otimes \pi_{\exp( - \beta R)}

4419: (m') \qquad \\\hfill  + \C{K}\bigl[ \rho, \pi_{\exp( - \beta R)}\bigr]

4420: - \log(\epsilon) \biggr\} + \C{K}(\rho, \pi) - \C{K} \bigl[

4421: \pi_{\exp( - \beta R)}, \pi \bigr]\quad

4422: \\ \leq \C{K} \bigl[ \rho, \pi_{\exp [ - \frac{\beta}{2\lambda}

4423: \log(\frac{1+\lambda}{1-\lambda}) r]} \bigr] + \frac{\beta}{N \lambda} \C{K}\bigl[

4424: \rho, \pi_{\exp( - \beta R)}\bigr] - \frac{\beta}{N \lambda}

4425: \log(\epsilon) \\ +

4426: \log \biggl[ \pi_{\exp [ - \frac{\beta}{2\lambda}\log(\frac{1+\lambda}{1-\lambda})r]}

4427: \Bigl\{ \exp \Bigl[ - \frac{\beta}{2 \lambda} \log(1 - \lambda^2)

4428: \rho(m')\Bigr] \Bigr\} \biggr].

4429: \end{multline*}

4430: Thus, putting $\gamma = \frac{N}{2} \log( \frac{1+\lambda}{1 - \lambda})$,

4431: we obtain

4432: \begin{thm}

4433: \mypoint

4434: \label{thm1.1.37}

4435: For any positive real constants $\beta$ and $\gamma$ such

4436: that $\beta < N \tanh ( \frac{\gamma}{N})$,

4437: with $\PP$ probability at least $1 - \epsilon$, for any

4438: posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,

4439: \begin{multline*}

4440: \C{K}\bigl[ \rho, \pi_{\exp( - \beta R)}\bigr]

4441: \leq \left( 1 - \frac{\beta}{N}\tanh\left(\frac{\gamma}{N}\right)^{-1}\right)^{-1}

4442: \\ \times \Biggl\{ \C{K}\bigl[ \rho, \pi_{\exp [ - \frac{\beta\gamma}{N}

4443: \tanh(\frac{\gamma}{N})^{-1}r]}

4444: \bigr] - \frac{\beta}{N \tanh(\frac{\gamma}{N})} \log(\epsilon)

4445: \\ + \log \Bigl\{ \pi_{\exp[ -

4446: \frac{\beta \gamma}{N} \tanh(\frac{\gamma}{N})^{-1} r]} \Bigl[

4447: \exp \bigl\{ \beta \tanh(\tfrac{\gamma}{N})^{-1} \log[\cosh(\tfrac{\gamma}{N})]

4448: \rho(m') \bigr\} \Bigr] \Bigr\} \Biggr\}.

4449: \end{multline*}

4450: \end{thm}

4451: This theorem provides another way of measuring overfitting,

4452: since it gives an upper bound for $\C{K}\bigl[

4453: \pi_{\exp[ - \frac{\beta \gamma}{N}

4454: \tanh(\frac{\gamma}{N})^{-1} r]}, \pi_{\exp( - \beta R)} \bigr]$.

4455: It may be used in combination with Theorem \ref{thm2.7}

4456: on page \pageref{thm2.7} as an alternative to Theorem

4457: \ref{thm1.1.17} on page \pageref{thm1.1.17}.

4458: It will also be used in the next section.

4459:

4460: An alternative parametrization of the same result providing a simpler

4461: right-hand side is also useful:

4462: \begin{cor}

4463: For any positive real constants $\beta$ and $\gamma$ such that $

4464: \beta < \gamma$, with $\PP$ probability at least $1 - \epsilon$, for any

4465: posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,

4466: \begin{multline*}

4467: \C{K}\bigl[ \rho, \pi_{\exp[ - N \frac{\beta}{\gamma} \tanh(\frac{\gamma}{N}) R]}

4468: \bigr] \leq \biggl(1 - \frac{\beta}{\gamma} \biggr)^{-1}

4469: \Biggl\{ \C{K}\bigl[ \rho, \pi_{\exp( - \beta r)}\bigr] - \frac{\beta}{\gamma}

4470: \log( \epsilon) \\ +

4471: \log \Bigl\{ \pi_{\exp( - \beta r)} \Bigl[ \exp \bigl\{

4472: N \tfrac{\beta}{\gamma} \log \bigl[ \cosh(\tfrac{\gamma}{N})\bigr] \rho

4473: (m') \bigr\} \Bigr] \Bigr\} \Biggr\}.

4474: \end{multline*}

4475: \end{cor}

4476:

4477: \subsubsection{Comparing two posterior distributions}

4478: Estimating the effective temperature of an estimator provides an efficient

4479: way to tune parameters in a model with a parametric behaviour. On the other

4480: hand, it will not be fitted to choose between different models, especially

4481: in the case when they are nested (because as we already saw in the case

4482: when $\Theta$ is a union of nested models, the prior distribution $\pi_{\exp

4483: ( - \beta R)}$ is not providing an efficient localization of the parameter

4484: in this case, in the sens that $\pi_{\exp( - \beta R)}(R)$

4485: is not going down to $\inf_{\Theta} R$ at the desired rate when

4486: $\beta$ goes to $+ \infty$, requiring to resort to partial localization).

4487:

4488: Once some estimator (in the form of a posterior distribution) has been

4489: chosen in each submodel, these estimators can be compared between themselves

4490: with the help of the relative bounds that we will establish in this section.

4491:

4492: From equation \eqref{eq1.1.15} (slightly modified by replacing $\pi \otimes \pi$

4493: with $\pi^1 \otimes \pi^2$), we obtain easily

4494: \begin{thm}

4495: \mypoint

4496: \label{thm1.1.38}

4497: For any positive real constant $\lambda$,

4498: for any prior distributions $\pi^1, \pi^2 \in \C{M}_+^1(\Theta)$,

4499: with $\PP$ probability at least $1 - \epsilon$,

4500: for any posterior distributions $\rho_1$ and $\rho_2 :

4501: \Omega \rightarrow \C{M}_+^1(\Theta)$,

4502: \begin{multline*}

4503: - N \log \Bigl\{ 1 - \tanh\bigl( \tfrac{\lambda}{N} \bigr)

4504: \Bigl[ \rho_2(R) - \rho_1(R) \Bigr] \Bigr\}

4505: \leq \lambda \bigl[ \rho_2(r) - \rho_1(r) \bigr]

4506: \\ + N \log \bigl[ \cosh \bigl( \tfrac{\lambda}{N} \bigr) \bigr]

4507: \rho_1 \otimes \rho_2 (m') \\ + \C{K}\bigl( \rho_1, \pi^1 \bigr)

4508: + \C{K}\bigl( \rho_2, \pi^2\bigr) - \log(\epsilon).

4509: \end{multline*}

4510: \end{thm}

4511:

4512: There enters into the game the entropy bound

4513: of the previous section, providing a localized version of Theorem \ref{thm1.1.38}.

4514: We will use the notation

4515: $$

4516: \Xi_{a} (q) = \tanh(a)^{-1} \bigl[ 1 -

4517: \exp( - aq) \bigr] \leq \frac{a}{\tanh(a)}q, \qquad a, q \in \RR.

4518: $$

4519: \begin{thm}

4520: \mypoint

4521: \label{thm1.1.39}

4522: For any sequence of prior distributions $(\pi^i)_{i \in \NN } \in

4523: \C{M}_+^1(\Theta)^{\NN}$,

4524: any probability distribution $\mu$ on $\NN$,

4525: any atomic probability distribution $\nu$ on $\RR_+$,

4526: with $\PP$ probability at least $1 - \epsilon$, for any posterior distributions

4527: $\rho_1, \rho_2 : \Omega \rightarrow \C{M}_+^1(\Theta)$,

4528: \begin{multline*}

4529: \hfill \rho_2(R) - \rho_1(R) \leq B(\rho_1, \rho_2), \text{ where} \hfill

4530: \\

4531: \shoveleft{B(\rho_1, \rho_2) = \inf_{\lambda, \beta_1 < \gamma_1, \beta_2 <

4532: \gamma_2 \in \RR_+, i, j \in \NN} \Xi_{\frac{\lambda}{N}}  \Biggl\{

4533: \bigl[ \rho_2(r) - \rho_1(r) \bigr]}\\\shoveright{ + \tfrac{N}{\lambda} \log

4534: \bigl[ \cosh(

4535: \tfrac{\lambda}{N}) \bigr] \rho_1 \otimes \rho_2(m')

4536: }\\\shoveleft{ + \frac{1}{\lambda \Bigl(1 - \frac{\beta_1}{\gamma_1}\Bigr)}

4537: \biggl\{ \C{K} \bigl[ \rho_1, \pi^i_{\exp( - \beta_1 r)}\bigr]

4538: }\\ \shoveright{+ \log \Bigl\{ \pi^i_{\exp( - \beta_1 r)} \Bigl[ \exp \bigl\{

4539: \beta_1 \tfrac{N}{\gamma_1}

4540: \log \bigl[ \cosh(\tfrac{\gamma_1}{N})\bigr] \rho_1(m') \bigr\}

4541: \Bigr] \Bigr\} \biggr\} \quad}

4542: \\ \shoveleft{+ \frac{1}{\lambda \Bigl( 1 - \frac{\beta_2}{\gamma_2} \Bigr)} \biggl\{

4543: \C{K} \bigl[ \rho_2, \pi^j_{\exp( - \beta_2 r)}\bigr]

4544: }\\ \shoveright{+ \log \Bigl\{ \pi^j_{\exp( - \beta_2 r)} \Bigl[ \exp \bigl\{ \beta_2

4545: \tfrac{N}{\gamma_2}

4546: \log \bigl[ \cosh(\tfrac{\gamma_2}{N})\bigr] \rho_2(m') \bigr\}

4547: \Bigr] \Bigr\} \biggr\}\quad }

4548: \\ \shoveleft{- \Bigl[ \bigl( \tfrac{\gamma_1}{\beta_1} - 1 \bigr)^{-1}

4549: + \bigl( \tfrac{\gamma_2}{\beta_2} - 1 \bigr)^{-1} + 1 \Bigr]

4550: }\\ \times \frac{

4551: \log\bigl[3^{-1} \nu(\beta_1) \nu(\beta_2) \nu(\gamma_1) \nu(\gamma_2)

4552: \nu(\lambda) \mu(i) \mu(j) \epsilon\bigr]}{\lambda}

4553: \Biggr\}.

4554: \end{multline*}

4555: \end{thm}

4556: The sequence of prior distributions $(\pi^i)_{i \in \NN}$

4557: should be understood

4558: to be typically supported by subsets of $\Theta$ corresponding to

4559: parametric submodels, that is submodels for which it

4560: is reasonable to expect that \\

4561: \mbox{} \hfill $\ds \lim_{\beta \rightarrow

4562: + \infty} \beta \bigl[ \pi^i_{\exp( - \beta R)}(R) -

4563: \ess \inf_{\pi^i} R \bigr]$\hfill\mbox{}\\

4564: exists and is positive and finite.

4565: As there is no reason why the bound $B(\rho_1, \rho_2)$ provided by

4566: the previous theorem should be subadditive (in the sense that

4567: $B(\rho_1, \rho_3) \leq B(\rho_1, \rho_2) + B(\rho_2, \rho_3)$),

4568: it is adequate, at least from a theoretical point of view, to

4569: consider some workable subset $\C{P} \subset \C{M}_+^1(\Theta)$

4570: of posterior distributions (for instance the distributions of

4571: the form $\pi^i_{\exp( - \beta r)}$, $i \in \NN$, $\beta \in \RR_+$,

4572: it is understood that $\C{P}$ is allowed to be a random

4573: subset of $\C{M}_+^1(\Theta)$, as in this suggested example),

4574: and to define the subadditive chained bound

4575: \newcommand{\TB}{\widetilde{B}}

4576: \begin{multline*}

4577: \TB (\rho, \rho') = \inf \Biggl\{

4578: \sum_{k=0}^{n-1} B(\rho_k, \rho_{k+1});\, n \in \NN^*,

4579: (\rho_k)_{k=0}^{n} \in \C{P}^{n+1},\\  \rho_0 = \rho,

4580: \rho_n = \rho' \Biggr\}, \quad \rho, \rho' \in \C{P}.

4581: \end{multline*}

4582: \begin{prop}\mypoint

4583: \label{prop1.1.54}

4584: With $\PP$ probability at least $1 - \epsilon$,

4585: for any posterior distributions $\rho_1, \rho_2

4586: \in \C{P}$,

4587: $

4588: \rho_2(R) - \rho_1(R) \leq \TB(\rho_1, \rho_2).

4589: $

4590: Moreover for any

4591: posterior distribution $\rho_1 \in \C{P}$,

4592: any posterior distribution $\rho_2 \in \C{P}$ such that

4593: $\TB(\rho_1, \rho_2) = \inf_{\rho_3 \in \C{P}} \TB(\rho_1, \rho_3)$

4594: is unimprovable with the help of $\TB$ in $\C{P}$

4595: in the sense that $\inf_{\rho_3 \in \C{P}}

4596: \TB(\rho_2, \rho_3) \geq 0$.

4597: \end{prop}

4598: \begin{proof} The first assertion is a direct consequence of the

4599: previous theorem, therefore only the second assertion requires a proof: for

4600: any $\rho_3 \in \C{P}$, we deduce from

4601: the optimality of $\rho_2$ and the subadditivity of $\TB$ that

4602: $

4603: \TB(\rho_1,\rho_2) \leq \TB(\rho_1, \rho_3) \leq \TB(\rho_1, \rho_2) +

4604: \TB(\rho_2, \rho_3).

4605: $

4606: \end{proof}

4607:

4608: This proposition provides a way to improve a posterior distribution

4609: $\rho_1 \in \C{P}$ by choosing $\rho_2 \in \arg\min_{\rho \in \C{P}}

4610: \TB(\rho_1, \rho)$ whenever $\TB(\rho_1, \rho_2) < 0$.

4611: This improvement process is proved according to Proposition \ref{prop1.1.54}

4612: to be a one step process: the obtained improved posterior $\rho_2$

4613: cannot be improved again using the same technique.

4614:

4615: Let us give some example of possible starting

4616: distribution $\rho_1$ for this improvement scheme: $\rho_1$ may be chosen as

4617: the best posterior Gibbs distribution

4618: according to Proposition \ref{prop1.1.37} on page

4619: \pageref{prop1.1.37}. More precisely, we may build

4620: from the prior distributions $\pi^i$, $i \in \NN$,

4621: a global prior $\pi = \sum_{i \in \NN} \mu(i) \pi^i$.

4622: We can then define the estimator of the inverse effective

4623: temperature as in Proposition \ref{prop1.1.37}

4624: and choose $\rho_1 \in \arg \min_{\rho \in \C{P}} \w{\beta}(\rho)$,

4625: where $\C{P}$ is as suggested above the set of posterior

4626: distributions

4627: $$

4628: \C{P} = \Bigl\{ \pi^i_{\exp( - \beta r)};\, i \in \NN, \beta \in \RR_+ \Bigr\}.

4629: $$

4630: (This starting point $\rho_1$ should already be pretty good,

4631: at least in an asymptotic perspective, the only

4632: gain in the rate of convergence to be expected bearing

4633: on spurious $\log(N)$ factors).

4634:

4635: For more elaborate uses of relative bounds, we refer to

4636: the third section of the second chapter of Audibert \cite{Audibert2}, where an algorithm

4637: is proposed and analyzed, which allows to use relative bounds

4638: between two posterior distributions as a stand alone estimation

4639: tool.

4640:

4641: \subsubsection{Two step localization of relative bounds}

4642:

4643:   Let us consider again in this section

4644: the case when we want to choose adaptively between a family

4645: of parametric models. Let us thus assume that the parameter

4646: set is a disjoint union of measurable submodels, so that we can write

4647: $\Theta = \sqcup_{m \in M} \Theta_m$, where $M$ is some measurable

4648: index set. Let us choose some prior probability distribution

4649: on the index set $\mu \in \C{M}_+^1(M)$, and some regular conditional

4650: prior distribution on $(M,\Theta)$, $\pi : M \rightarrow \C{M}_+^1(\Theta)$,

4651: such that $\pi(m, \Theta_m) = 1$, $m \in M$. Let us then study some

4652: arbitrary posterior distributions $\nu : \Omega \rightarrow \C{M}_+^1(M)$

4653: and $\rho : \Omega \times M : \rightarrow \C{M}_+^1(\Theta)$, such

4654: that $\rho(\omega, m, \Theta_m) = 1$, $\omega \in \Omega$, $m \in M$.

4655: We would like to compare $\nu \rho(R)$ with some doubly localized

4656: prior distribution $\mu_{\exp[ - \frac{\beta}{1 + \zeta_2} \pi_{

4657: \exp( - \beta R)}(R)]} \bigl[ \pi_{\exp( - \beta R)} \bigr](R)$

4658: (where $\zeta_2$ is a positive parameter to be set as needed later on).

4659: We will define to ease notations two prior distributions (one

4660: being more precisely a conditional distribution) depending on

4661: the positive real parameters $\beta$ and $\zeta_2$, putting

4662: \begin{equation}

4663: \label{eqprior}

4664: \ov{\pi} = \pi_{\exp( - \beta R)}

4665: \text{ and }\ov{\mu} = \mu_{\exp[ - \frac{\beta}{1 + \zeta_2}

4666: \ov{\pi}(R)]}.

4667: \end{equation}

4668:

4669: Similarly to Theorem \ref{thm2.2.18} on page \pageref{thm2.2.18}

4670: we can write for any positive real constants $\beta$ and $\gamma$

4671: \begin{multline*}

4672: \PP \biggl\{ (\ov{\mu}\,\ov{\pi}) \otimes (\ov{\mu}\,\ov{\pi})

4673: \biggl[ \exp \Bigl[ - N \log \bigl[  1 - \tanh(\tfrac{\gamma}{N})R' \bigr]

4674: \\ - \gamma r' - N \log \bigl[

4675: \cosh(\tfrac{\gamma}{N})\bigr] m' \Bigr] \biggr] \biggr\}

4676: \leq 1,

4677: \end{multline*}

4678: and deduce, using Lemma \ref{lemma1.3} on page \pageref{lemma1.3}

4679: \begin{multline}

4680: \label{eq1.31}

4681: \PP \biggl\{ \exp \biggl[

4682: \sup_{\nu \in \C{M}_+^1(M)} \sup_{\rho : M \rightarrow \C{M}_+^1(\Theta)}

4683: \Bigl\{ - N

4684: \log \bigl[ 1 - \tanh(\tfrac{\gamma}{N})

4685: (\nu \rho - \ov{\mu}\,\ov{\pi}) (R) \bigr]\\* - \gamma (\nu \rho - \ov{\mu}

4686: \,\ov{\pi})(r)

4687: - N \log \bigl[ \cosh(\tfrac{\gamma}{N}) \bigr] (\nu \rho) \otimes

4688: (\ov{\mu}\,\ov{\pi}) (m') \\* - \C{K}(\nu, \ov{\mu}) - \nu

4689: \bigl[ \C{K}(\rho, \ov{\pi}) \bigr] \Bigr\} \biggr] \biggr\} \leq 1.

4690: \end{multline}

4691: This will be our starting point in comparing

4692: $\nu \rho(R)$ with $\ov{\mu}\,\ov{\pi}(R)$.

4693: However, obtaining an empirical bound will require some supplementary efforts.

4694: For each $m \in M$, we can write

4695: in the same way

4696: $$

4697: \PP \biggl\{ \ov{\pi} \otimes \ov{\pi}

4698: \biggl[ \exp \Bigl[ - N \log \bigl[  1 - \tanh(\tfrac{\gamma}{N})R' \bigr]

4699: - \gamma r' - N \log \bigl[ \cosh(\tfrac{\gamma}{N})\bigr] m' \Bigr] \biggr] \biggr\}

4700: \leq 1.

4701: $$

4702: Intagrating this inequality with respect to $\ov{\mu}$ and using Fubini's lemma

4703: for positive functions, we get

4704: $$

4705: \PP \biggl\{ \ov{\mu}(\ov{\pi} \otimes \ov{\pi})

4706: \biggl[ \exp \Bigl[ - N \log \bigl[  1 - \tanh(\tfrac{\gamma}{N})R' \bigr]

4707: - \gamma r' - N \log \bigl[ \cosh(\tfrac{\gamma}{N})\bigr] m' \Bigr] \biggr] \biggr\}

4708: \leq 1.

4709: $$

4710: Let us make clear that $\ov{\mu}(\ov{\pi} \otimes \ov{\pi})$ is a probability

4711: measure on $M \times \Theta \times \Theta$, whereas $(\ov{\mu}\,\ov{\pi})

4712: \otimes (\ov{\mu}\,\ov{\pi})$ considered previously is a probability measure

4713: on \linebreak $(M\times \Theta) \times (M \times \Theta)$.

4714: We get as previously

4715: \begin{multline}

4716: \label{eq1.31bis}

4717: \PP \biggl\{ \exp \biggl[

4718: \sup_{\nu \in \C{M}_+^1(M)}

4719: \sup_{\rho : M \rightarrow \C{M}_+^1(\Theta)} \Bigl\{

4720: - N

4721: \log \bigl[ 1 - \tanh(\tfrac{\gamma}{N})

4722: \nu (\rho - \ov{\pi}) (R) \bigr]

4723: \\ - \gamma \nu (\rho - \ov{\pi})(r) - N \log

4724: \bigl[\cosh(\tfrac{\gamma}{N})\bigr]

4725: \nu ( \rho \otimes \ov{\pi} ) (m') \\ - \C{K}(\nu, \ov{\mu})

4726: - \nu \bigl[ \C{K}(\rho, \ov{\pi}) \bigr]

4727: \Bigr\} \biggr] \biggr\} \leq 1.

4728: \end{multline}

4729: Let us eventually recall that

4730: \begin{align}

4731: \C{K}(\nu, \ov{\mu}) & = \tfrac{\beta}{1 + \zeta_2} (\nu - \ov{\mu})\ov{\pi}(R) + \C{K}(\nu, \mu)

4732: - \C{K}(\ov{\mu}, \mu),\\

4733: \label{eq1.31ter}

4734: \C{K}(\rho, \ov{\pi}) & = \beta (\rho - \ov{\pi})(R) + \C{K}(\rho, \pi)

4735: - \C{K}(\ov{\pi}, \pi).

4736: \end{align}

4737: From equations \eqref{eq1.31}, \eqref{eq1.31bis} and \eqref{eq1.31ter} we deduce

4738: \begin{prop}\mypoint

4739: \label{prop1.58}

4740: For any positive real constants $\beta$, $\gamma$ and $\zeta_2$,

4741: with $\PP$ probability at least $1 - \epsilon$, for any posterior

4742: distribution $\nu : \Omega \rightarrow \C{M}_+^1(M)$ and any conditional posterior

4743: distribution $\rho : \Omega \times M \rightarrow \C{M}_+^1(\Theta)$,

4744: \begin{multline*}

4745: - N \log \bigl[ 1 - \tanh(\tfrac{\gamma}{N})(\nu \rho - \ov{\mu}\,\ov{\pi})(R)

4746: \bigr] - \beta \nu(\rho - \ov{\pi})(R) \\ \leq \gamma (\nu \rho - \ov{\mu}\,\ov{\pi}) (r)

4747: + N \log \bigl[ \cosh(\tfrac{\gamma}{N}) \bigr] (\nu \rho) \otimes

4748: (\ov{\mu}\,\ov{\pi}) (m') \\ + \C{K}(\nu, \ov{\mu}) + \nu \bigl[ \C{K}(\rho, \pi) \bigr]

4749: - \nu \bigl[ \C{K}( \ov{\pi}, \pi) \bigr] + \log \bigl( \tfrac{2}{\epsilon} \bigr).

4750: \end{multline*}

4751: and

4752: \begin{multline*}

4753: - N \log \bigl[ 1 - \tanh(\tfrac{\gamma}{N}) \nu(\rho - \ov{\pi})(R) \bigr]

4754: \\\leq \gamma \nu(\rho - \ov{\pi})(r)

4755: + N \log \bigl[ \cosh(\tfrac{\gamma}{N}) \bigr]

4756: \nu( \rho\otimes \ov{\pi})(m') \\ + \C{K}(\nu, \ov{\mu}) + \nu\bigl[ \C{K}(\rho,

4757: \ov{\pi}) \bigr] +

4758: \log\bigl(\tfrac{2}{\epsilon}\bigr),

4759: \end{multline*}

4760: where the prior distribution $\ov{\mu}\,\ov{\pi}$ is defined by equation

4761: \eqref{eqprior} on page \pageref{eqprior} and depends on $\beta$ and $\zeta_2$.

4762: \end{prop}

4763: Let us put for short

4764: $$

4765: T = \tanh(\tfrac{\gamma}{N}) \text{ and } C = N \log \bigl[ \cosh(\tfrac{\gamma}{N})

4766: \bigr].

4767: $$

4768:

4769: \newcommand{\omu}{\ov{\mu}}

4770: \newcommand{\opi}{\ov{\pi}}

4771: We will use some entropy compensation strategy for which we need a couple

4772: of entropy bounds. Let us assume that $\beta < NT$.

4773: We have according to Proposition \ref{prop1.58},

4774: with $\PP$ probability at least $1 - \epsilon$,

4775: \begin{multline*}

4776: \nu \bigl[ \C{K}(\rho, \opi) \bigr]

4777: = \beta \nu(\rho - \opi)(R) + \nu \bigl[ \C{K}(\rho, \pi) -

4778: \C{K}(\opi, \pi) \bigr] \\\shoveleft{\qquad

4779: \leq \frac{\beta}{NT} \biggl[ \gamma \nu(\rho - \opi) (r)

4780: + C \nu(\rho \otimes \opi)(m')} \\ + \C{K}(\nu, \omu)

4781: +  \nu \bigl[ \C{K}( \rho, \opi) \bigr]

4782: + \log( \tfrac{2}{\epsilon} ) \biggr] \\ + \nu \bigl[ \C{K}(\rho, \pi)

4783: - \C{K}(\opi, \pi) \bigr].

4784: \end{multline*}

4785: Similarly

4786: \begin{multline*}

4787: \C{K}(\nu, \omu) = \frac{\beta}{1 + \zeta_2} (\nu - \omu) \opi(R)

4788: + \C{K}(\nu, \mu) - \C{K}(\omu, \mu) \\

4789: \leq \frac{\beta}{(1 + \zeta_2) NT} \biggl[

4790: \gamma (\nu - \omu) \opi(r) + C (\nu \opi) \otimes ( \omu\,\opi) (m')

4791: \\ + \C{K}(\nu, \omu) + \log (\tfrac{2}{\epsilon}) \biggr]

4792: + \C{K}(\nu, \mu) - \C{K}(\omu, \mu).

4793: \end{multline*}

4794: Thus, for any positive real constants $\beta$, $\gamma$ and $\zeta_i$,

4795: $i = 1, \dots, 5$, with $\PP$ probability at least $1 - \epsilon$,

4796: for any posterior distributions $\nu, \nu_3

4797: : \Omega \rightarrow \C{M}_+^1(\Theta)$, any posterior conditional distributions

4798: $\rho, \rho_1, \rho_2, \rho_4, \rho_5

4799: : \Omega \times M \rightarrow \C{M}_+^1(\Theta)$,

4800: \begin{multline*}

4801: - N \log \bigl[ 1 - T (\nu \rho - \omu\,\opi)(R) \bigr]

4802: - \beta \nu (\rho - \opi)(R) \\ \leq

4803: \gamma (\nu \rho - \omu\,\opi)(r) + C (\nu \rho) \otimes (\omu\,\opi)(m')

4804: \\

4805: \hfill + \C{K}(\nu, \omu) + \nu \bigl[ \C{K}(\rho, \pi)

4806: - \C{K}(\opi, \pi) \bigr] + \log(\tfrac{2}{\epsilon}),

4807: \quad\\\quad

4808: \zeta_1 \frac{NT}{\beta} \omu \bigl[ \C{K}(\rho_1, \opi) \bigr]

4809: \leq \zeta_1 \gamma \omu(\rho_1 - \opi)(r) + \zeta_1 C \omu(\rho_1 \otimes \opi)(m')

4810: \hfill \\ \hfill + \zeta_1 \omu \bigl[ \C{K}(\rho_1, \opi) \bigr] +

4811: \zeta_1 \log( \tfrac{2}{\epsilon})

4812: + \zeta_1 \frac{NT}{\beta} \omu \bigl[ \C{K}(\rho_1, \pi)

4813: - \C{K}(\opi, \pi) \bigr],\quad\\\quad

4814: \zeta_2 \frac{NT}{\beta} \nu \bigl[ \C{K}(\rho_2, \opi) \bigr]

4815: \leq \zeta_2 \gamma \nu(\rho_2- \opi)(r) + \zeta_2 C \nu(

4816: \rho_2 \otimes \opi)(m') \hfill \\

4817: + \zeta_2 \C{K}(\nu, \omu) + \zeta_2 \nu \bigl[ \C{K}(\rho_2, \opi) \bigr]

4818: + \zeta_2 \log( \tfrac{2}{\epsilon}) \\ \hfill

4819: + \zeta_2 \frac{NT}{\beta} \nu \bigl[ \C{K}(\rho_2, \pi) - \C{K}(\opi, \pi)

4820: \bigr],\quad\\\quad

4821: \zeta_3 (1 + \zeta_2)\frac{ N T}{\beta} \C{K}(\nu_3, \omu)

4822: \leq \zeta_3 \gamma( \nu_3 - \omu) \opi(r)

4823: \hfill \\ +

4824: \zeta_3 C \bigl[ (\nu_3 \opi) \otimes (\nu_3 \rho_1) + (\nu_3 \rho_1)

4825: \otimes ( \omu \, \opi) \bigr] (m')

4826: + \zeta_3 \C{K}(\nu_3, \omu) + \zeta_3 \log(\tfrac{2}{\epsilon})

4827: \\ \hfill + \zeta_3 (1 + \zeta_2)\frac{NT}{ \beta}

4828:  \bigl[ \C{K}(\nu_3, \mu) - \C{K}(\ov{\mu}, \mu) \bigr],\quad\\\quad

4829: \zeta_4 \frac{NT}{\beta} \nu_3 \bigl[ \C{K}(\rho_4, \opi) \bigr]

4830: \leq \zeta_4 \gamma \nu_3(\rho_4 - \opi)(r) \hfill \\

4831: + \zeta_4 C \nu_3(\rho_4 \otimes \opi)

4832: (m') + \zeta_4 \C{K}(\nu_3, \omu) + \zeta_4 \nu_3 \bigl[ \C{K}(\rho_4, \opi) \bigr]

4833: + \zeta_4 \log( \tfrac{2}{\epsilon}) \\

4834: \hfill + \zeta_4 \frac{NT}{\beta} \nu_3 \bigl[ \C{K}(\rho_4,

4835: \pi) - \C{K}( \opi, \pi) \bigr],

4836: \quad\\\quad

4837: \zeta_5 \frac{NT}{\beta} \omu \bigl[ \C{K}(\rho_5, \opi) \bigr]

4838: \leq \zeta_5 \gamma \omu(\rho_5 - \opi)(r) + \zeta_5 C \omu(\rho_5 \otimes \opi)(m')

4839: \hfill \\ \hfill + \zeta_5 \omu \bigl[ \C{K}(\rho_5, \opi) \bigr] +

4840: \zeta_5 \log( \tfrac{2}{\epsilon})

4841: + \zeta_5 \frac{NT}{\beta} \omu \bigl[ \C{K}(\rho_5, \pi)

4842: - \C{K}(\opi, \pi) \bigr].

4843: \end{multline*}

4844: Adding these six inequalities and assuming that $\zeta_4 \leq \zeta_3 \bigl[

4845: ( 1 + \zeta_2) \tfrac{NT}{\beta} - 1 \bigr]$, we find

4846: \begin{multline*}

4847: - N \log \bigl[ 1 - T (\nu \rho - \omu\,\opi)(R) \bigr]

4848: - \beta (\nu \rho - \omu \, \opi)(R) \\\qquad \leq

4849: - N \log \bigl[ 1 - T (\nu \rho - \omu\,\opi)(R) \bigr]

4850: - \beta (\nu \rho - \omu \, \opi)(R)\hfill\\+

4851: \zeta_1 \bigl( \tfrac{NT}{\beta} - 1\bigr)

4852: \omu \bigl[ \C{K}(\rho_1, \opi)\bigr]

4853: + \zeta_2 \bigl( \tfrac{NT}{\beta} - 1 \bigr)

4854: \nu \bigl[ \C{K}(\rho_2, \opi) \bigr] \\ +

4855: \bigl[ \zeta_3(1 + \zeta_2) \tfrac{NT}{\beta} - \zeta_3

4856: - \zeta_4 \bigr] \C{K}(\nu_3, \omu)\\\hfill

4857: + \zeta_4 \bigl( \tfrac{NT}{\beta} - 1 \bigr)

4858: \nu_3 \bigl[ \C{K}(\rho_4, \opi) \bigr] +

4859: \zeta_5 \bigl( \tfrac{NT}{\beta} - 1 \bigr)

4860: \omu \bigl[ \C{K}(\rho_5, \opi) \bigr] \quad\\\qquad

4861: \leq \gamma (\nu \rho - \omu\,\opi)(r)

4862: + \zeta_1 \gamma \omu(\rho_1 - \opi) (r) +

4863: \zeta_2 \gamma \nu(\rho_2 - \opi) (r)

4864: \hfill \\ + \zeta_3 \gamma(\nu_3 - \omu) \opi(r) +

4865: \zeta_4 \gamma \nu_3(\rho_4 - \opi)(r) + \zeta_5 \gamma \omu(\rho_5 - \opi)

4866: (r) \qquad\\ \hfill

4867: + C \bigl[ (\nu \rho) \otimes (\omu\,\opi)+ \zeta_1

4868: \omu(\rho_1 \otimes \opi) + \zeta_2 \nu( \rho_2 \otimes \opi)\qquad\\

4869: \quad + \zeta_3 (\nu_3 \opi) \otimes (\nu_3 \rho_1) +

4870: \zeta_3 (\nu_3 \rho_1) \otimes ( \omu \, \opi)\hfill \\

4871: \hfill + \zeta_4

4872: \nu_3 ( \rho_4 \otimes \opi) + \zeta_5 \omu(\rho_5\otimes \opi) \bigr] (m')\qquad\\

4873: \quad + (1 + \zeta_2) \bigl[\C{K}(\nu, \mu) - \C{K}(\omu, \mu)\bigr]

4874: + \nu \bigl[ \C{K}(\rho, \pi) - \C{K}(\opi, \pi) \bigr]\hfill\\

4875: \hfill + \zeta_1 \tfrac{NT}{\beta} \omu \bigl[ \C{K}(\rho_1, \pi)

4876: - \C{K}(\opi, \pi) \bigr] + \zeta_2 \tfrac{NT}{\beta}

4877: \nu \bigl[ \C{K}(\rho_2, \pi) - \C{K}(\opi, \pi) \bigr] \qquad

4878: \\\quad + \zeta_3 (1 + \zeta_2) \tfrac{NT}{\beta} \bigl[ \C{K}(\nu_3, \mu)

4879: - \C{K}(\omu, \mu) \bigr]

4880: + \zeta_4 \tfrac{NT}{\beta} \nu_3 \bigl[ \C{K}( \rho_4, \pi)

4881: - \C{K}(\opi, \pi) \bigr] \hfill \\

4882: + \zeta_5 \tfrac{NT}{\beta} \omu \bigl[

4883: \C{K}(\rho_5, \pi) - \C{K}(\opi, \pi) \bigr]

4884: + (1 + \zeta_1 + \zeta_2 + \zeta_3 + \zeta_4 + \zeta_5 ) \log( \tfrac{2}{\epsilon}).

4885: \end{multline*}

4886: Let us now apply to $\opi$ (we shall later do the same with $\omu$)

4887: the following inequalities, holding for any random

4888: functions of the sample and the parameters $h : \Omega \times \Theta \rightarrow

4889: \RR$ and $g : \Omega \times \Theta \rightarrow \RR$,

4890: \begin{multline*}

4891: \opi(g-h) - \C{K}(\opi, \pi) \leq

4892: \sup_{\rho : \Omega \times M \rightarrow \C{M}_+^1(\Theta)} \rho( g - h) - \C{K}(\rho, \pi) \\

4893: \shoveleft{\qquad = \log \bigl\{ \pi \bigl[ \exp (g - h)  \bigr] \bigr\}} \\

4894: \shoveleft{\qquad \qquad =

4895: \log \bigl\{ \pi \bigl[ \exp ( - h ) \bigr] \bigr\}

4896: + \log \bigl\{ \pi_{\exp( - h)} \bigl[ \exp (g) \bigr] \bigr\}}

4897: \\ = - \pi_{\exp( - h)}(h) - \C{K}(\pi_{\exp( - h)}, \pi)

4898: + \log \bigl\{ \pi_{\exp( - h)} \bigl[ \exp (g) \bigr] \bigr\}.

4899: \end{multline*}

4900: When $h$ and $g$ are observable, and $h$ is not too far from

4901: $\beta r \simeq \beta R$, this gives a way to replace $\opi$ with

4902: some satisfactory empirical approximation.

4903: We will apply this method, choosing $\rho_1$ and $\rho_5$ such that

4904: $\omu\,\opi$ is replaced either with $\omu \rho_1$,

4905: when it comes from the first two inequalities or

4906: with $\omu \rho_5$ otherwise,

4907: choosing $\rho_2$ such that $\nu \opi$ is replaced with $\nu \rho_2$

4908: and $\rho_4$ such that $\nu_3 \opi$ is replaced with $\nu_3 \rho_4$. We will do

4909: so because it leads to a lot of helpful cancellations.

4910: For those to happen, we need to choose $\rho_i = \pi_{\exp( - \lambda_i r)}$,

4911: $i=1,2,4$, where $\lambda_1$, $\lambda_2$ and $\lambda_4$ are such that

4912: \begin{align*}

4913: (1 + \zeta_1) \gamma & = \zeta_1 \tfrac{NT}{\beta} \lambda_1,\\

4914: \zeta_2 \gamma & = \bigl(1 + \zeta_2 \tfrac{NT}{\beta} \bigr) \lambda_2,\\

4915: (\zeta_4 - \zeta_3) \gamma & = \zeta_4 \frac{NT}{\beta} \lambda_4,\\

4916: \zeta_3 \gamma & = \zeta_5 \tfrac{NT}{\beta} \lambda_5,

4917: \end{align*}

4918: and to assume that

4919: $\zeta_4 > \zeta_3$.

4920: We obtain that with $\PP$ probability at least $1 - \epsilon$,

4921: \begin{multline*}

4922: - N \log \bigl[ 1 - T(\mu \rho - \omu\,\opi)(R) \bigr]

4923: - \beta (\nu \rho - \omu\,\opi)(R)\\

4924: \leq \gamma(\nu \rho - \omu\,\rho_1)(r) +

4925: \zeta_3 \gamma(\nu_3 \rho_4 - \omu \rho_5)(r)

4926: \\

4927: + \zeta_1 \tfrac{NT}{\beta} \omu \Biggl\{

4928: \log \Biggl[ \rho_1 \biggl\{ \exp \biggl[ C \tfrac{\beta}{NT \zeta_1}

4929: \bigl[ \nu \rho + \zeta_1 \rho_1 \bigr](m') \biggr]

4930: \biggr\} \Biggr] \Biggr\}\\

4931: + \bigl( 1 + \zeta_2 \tfrac{NT}{\beta}\bigr) \nu \Biggl\{

4932: \log \Biggl\{ \rho_2 \biggl\{ \exp \biggl[ \tfrac{C}{1 + \zeta_2

4933: \frac{NT}{\beta}} \zeta_2 \rho_2 (m') \biggr] \biggr\} \Biggr] \Biggr\}\\

4934: + \zeta_4 \tfrac{NT}{\beta} \nu_3 \Biggl\{ \log \Biggl[

4935: \rho_4 \biggl\{ \exp \biggl[ C \tfrac{\beta}{NT \zeta_4}

4936: \bigl[ \zeta_3 \nu_3 \rho_1 + \zeta_4

4937: \rho_4 \bigr] (m') \biggr] \biggr\} \Biggr] \Biggr\}\\

4938: + \zeta_5 \tfrac{NT}{\beta} \omu \Biggl\{

4939: \log \Biggl[ \rho_5 \biggl\{ \exp \biggl[ C \tfrac{\beta}{NT \zeta_5}

4940: \bigl[ \zeta_3 \nu_3 \rho_1 + \zeta_5 \rho_5 \bigr] (m') \biggr]

4941: \biggr\} \Biggr] \Biggr\}\\

4942: + (1 + \zeta_2) \bigl[ \C{K}(\nu, \mu) - \C{K}(\omu, \mu) \bigr]

4943: + \nu \bigl[ \C{K}(\rho, \pi) - \C{K}(\rho_2, \pi) \bigr]

4944: \\ + \zeta_3(1 + \zeta_2) \tfrac{NT}{\beta} \bigl[

4945: \C{K}(\nu_3, \mu) - \C{K}(\omu, \mu) \bigr] \\

4946: +

4947: \biggl(1 + \sum_{i=1}^5 \zeta_i\biggr) \log \bigl( \tfrac{2}{\epsilon} \bigr).

4948: \end{multline*}

4949: In order to obtain more cancellations while replacing $\omu$ by

4950: some posterior distribution, we will choose the constants such that

4951: $\lambda_5 = \lambda_4$, which can be done by choosing

4952: $$

4953: \zeta_5 = \frac{\zeta_3 \zeta_4}{\zeta_4 - \zeta_3}.

4954: $$

4955: We can now replace $\omu$ with

4956: $\mu_{\exp - \xi_1 \rho_1(r) - \xi_4 \rho_4(r)}$,

4957: where

4958: \begin{align*}

4959: \xi_1 & = \frac{\gamma}{(1 + \zeta_2)\bigl(1 + \tfrac{NT}{\beta} \zeta_3 \bigr)},\\

4960: \xi_4 & = \frac{\gamma\zeta_3}{(1 + \zeta_2)\bigl(1 + \tfrac{NT}{\beta} \zeta_3 \bigr)}.

4961: \end{align*}

4962: Choosing moreover $\nu_3 = \mu_{\exp - \xi_1 \rho_1(r) - \xi_4 \rho_4(r)}$,

4963: to induce some more cancellations,

4964: we get

4965: \begin{thm}\mypoint

4966: \label{thm1.59}

4967: For any positive real constants satisfying the above mentioned constraints,

4968: with $\PP$ probability at least $1 - \epsilon$, for any posterior distribution

4969: $\nu : \Omega \rightarrow \C{M}_+^1(M)$ and any conditional posterior

4970: distribution $\rho : \Omega \times M \rightarrow \C{M}_+^1(\Theta)$,

4971: \begin{multline*}

4972: - N \log \bigl[ 1 - T(\nu \rho - \omu\,\opi)(R) \bigr]

4973: - \beta (\nu \rho - \omu\,\opi)(R) \leq B(\nu, \rho, \beta),\\

4974: \shoveleft{\text{where }

4975: B(\nu, \rho, \beta) \overset{\text{\rm def}}{=} \gamma ( \nu \rho -

4976: \nu_3 \rho_1)(r)} \\*

4977: \shoveleft{\qquad + (1 + \zeta_2) \bigl( 1 + \tfrac{NT}{\beta} \zeta_3 \bigr) }

4978: \\ \times

4979: \log \Biggl\{ \nu_3 \Biggl[ \rho_1 \biggl\{

4980: \exp \biggl[ C \tfrac{\beta}{NT \zeta_1} \bigl[ \nu \rho

4981: + \zeta_1 \rho_1 \bigr] (m') \biggr] \biggr\}^{\frac{\zeta_1 N T}{\beta

4982: (1 + \zeta_2)(1 + \frac{NT}{\beta}\zeta_3)}} \\

4983: \shoveright{\times \rho_4 \biggl\{ \exp \biggl[

4984: C \tfrac{\beta}{NT \zeta_5} \bigl[

4985: \zeta_3 \nu_3 \rho_1 + \zeta_5 \rho_4 \bigr] (m')

4986: \biggr] \biggr\}^{\frac{\zeta_5 N T}{\beta(1 + \zeta_2)(1 + \frac{NT}{\beta}

4987: \zeta_3)}} \Biggr] \Biggr\}}\\

4988: + \bigl( 1 + \zeta_2 \tfrac{NT}{\beta}\bigr) \nu \Biggl\{

4989: \log \Biggl\{ \rho_2 \biggl\{ \exp \biggl[ \tfrac{C}{1 + \zeta_2

4990: \frac{NT}{\beta}} \zeta_2 \rho_2 (m') \biggr] \biggr\} \Biggr] \Biggr\}\\

4991: + \zeta_4 \tfrac{NT}{\beta} \nu_3 \Biggl\{ \log \Biggl[

4992: \rho_4 \biggl\{ \exp \biggl[ C \tfrac{\beta}{NT \zeta_4}

4993: \bigl[ \zeta_3 \nu_3 \rho_1 + \zeta_4

4994: \rho_4 \bigr] (m') \biggr] \biggr\} \Biggr] \Biggr\}\\

4995: \shoveleft{\qquad + (1 + \zeta_2) \bigl[ \C{K}(\nu, \mu) - \C{K}(\nu_3, \mu) \bigr]

4996: } \\ + \nu \bigl[ \C{K}(\rho, \pi) - \C{K}(\rho_2, \pi) \bigr]

4997: + \biggl( 1 + \sum_{i=1}^5 \zeta_i \biggr)

4998: \log \bigl( \tfrac{2}{\epsilon} \bigr).

4999: \end{multline*}

5000: \end{thm}

5001:

5002: This theorem can be used to find the largest value $\w{\beta}(\nu \rho)$ of

5003: $\beta$ such that

5004: $ B( \nu, \rho, \beta) \leq 0$, thus providing an estimator for

5005: $\beta(\nu \rho)$ defined as $\nu \rho(R) = \ov{\mu}_{\beta(\nu \rho)}

5006: \ov{\pi}_{\beta(\nu \rho)}(R)$, where we have mentioned explicitely

5007: the dependence of $\ov{\mu}$ and $\ov{\pi}$ in $\beta$, the constant

5008: $\zeta_2$ staying fixed. The posterior distribution $\nu \rho$ may

5009: then be chosen to maximize $\w{\beta}(\nu \rho)$ within some manageable

5010: subset of posterior distributions $\C{P}$, thus gaining the assurance

5011: that $\nu \rho(R) \leq \ov{\mu}_{\w{\beta}(\nu \rho)}\ov{\pi}_{\w{\beta}(\nu \rho)}

5012: (R)$, with the largest parameter $\w{\beta}(\nu \rho)$ that this

5013: approach can provide. Maximizing $\w{\beta}(\nu \rho)$ is supported by the

5014: fact that $\lim_{\beta \rightarrow + \infty} \ov{\mu}_{\beta}\ov{\pi}_{\beta}(R)

5015: = \ess \inf_{\mu \pi} R$. Anyhow, there is no assurance (to our knowledge) that

5016: $\beta \mapsto \ov{\mu}_{\beta} \ov{\pi}_{\beta}(R)$ will be a decreasing

5017: function of $\beta$ all the way, although this may be expected to be the case

5018: in many practical situations.

5019:

5020: We can make the bound more explicit in several ways. One point

5021: of view is to put forward the optimal values of $\rho$ and $\nu$.

5022: We can thus remark that

5023: \begin{multline*}

5024: \nu \bigl[ \gamma \rho(r) + \C{K}(\rho, \pi) -

5025: \C{K}(\rho_2, \pi) \bigr] + (1 + \zeta_2) \C{K}(\nu, \mu)

5026: \\ =

5027: \nu \biggl[ \C{K}\bigl[ \rho, \pi_{\exp( - \gamma r)} \bigr]

5028: + \lambda_2 \rho_2(r)

5029: + \int_{\lambda^2}^{\gamma}

5030: \pi_{\exp( - \alpha r)}(r) d \alpha \biggr]

5031: + (1 + \zeta_2) \C{K}( \nu, \mu)

5032: \\ = \nu \bigl\{ \C{K}\bigl[ \rho, \pi_{\exp( - \gamma r)} \bigr]

5033: \bigr\} + (1 + \zeta_2)

5034: \C{K}\bigl[ \nu, \mu_{ \exp

5035: \bigl( - \frac{\lambda_2 \rho_2(r)}{1 + \zeta_2}

5036: - \frac{1}{1 + \zeta_2} \int_{\lambda_2}^{\gamma}

5037: \pi_{\exp( - \alpha r)}(r) d \alpha \bigr)} \bigr]

5038: \\ - (1 + \zeta_2) \log \Biggl\{ \mu \Biggl[ \exp \biggl\{

5039: - \frac{\lambda_2}{1 + \zeta_2} \rho_2(r)

5040: - \frac{1}{1 + \zeta_2} \int_{\lambda_2}^{\gamma}

5041: \pi_{\exp( - \alpha r )}(r) d \alpha \biggr\} \Biggr] \Biggr\}.

5042: \end{multline*}

5043: Thus

5044: \begin{multline*}

5045: B(\nu, \rho, \beta) =

5046: (1 + \zeta_2) \Bigl[ \xi_1 \nu_3 \rho_1(r) + \xi_4

5047: \nu_3 \rho_4(r) \\ + \log \bigl\{ \mu \bigl[ \exp

5048: \bigl( - \xi_1 \rho_1(r) - \xi_4 \rho_4(r) \bigr) \bigr] \bigr\}

5049: \Bigr] \\ - (1 + \zeta_2) \log \Biggl\{ \mu \Biggl[ \exp \biggl\{

5050: - \frac{\lambda_2}{1 + \zeta_2} \rho_2(r)

5051: - \frac{1}{1 + \zeta_2} \int_{\lambda_2}^{\gamma}

5052: \pi_{\exp( - \alpha r )}(r) d \alpha \biggr\} \Biggr] \Biggr\} \\ \shoveleft{\quad

5053: - \gamma \nu_3 \rho_1 (r)

5054: + (1 + \zeta_2) \bigl( 1 + \tfrac{NT}{\beta} \zeta_3 \bigr) }

5055: \\ \times

5056: \log \Biggl\{ \nu_3 \Biggl[ \rho_1 \biggl\{

5057: \exp \biggl[ C \tfrac{\beta}{NT \zeta_1} \bigl[ \nu \rho

5058: + \zeta_1 \rho_1 \bigr] (m') \biggr] \biggr\}^{\frac{\zeta_1 N T}{\beta

5059: (1 + \zeta_2)(1 + \frac{NT}{\beta}\zeta_3)}} \\

5060: \shoveright{\times \rho_4 \biggl\{ \exp \biggl[

5061: C \tfrac{\beta}{NT \zeta_5} \bigl[

5062: \zeta_3 \nu_3 \rho_1 + \zeta_5 \rho_4 \bigr] (m')

5063: \biggr] \biggr\}^{\frac{\zeta_5 N T}{\beta(1 + \zeta_2)(1 + \frac{NT}{\beta}

5064: \zeta_3)}} \Biggr] \Biggr\}}\\

5065: + \bigl( 1 + \zeta_2 \tfrac{NT}{\beta}\bigr) \nu \Biggl\{

5066: \log \Biggl\{ \rho_2 \biggl\{ \exp \biggl[ \tfrac{C}{1 + \zeta_2

5067: \frac{NT}{\beta}} \zeta_2 \rho_2 (m') \biggr] \biggr\} \Biggr] \Biggr\}\\

5068: + \zeta_4 \tfrac{NT}{\beta} \nu_3 \Biggl\{ \log \Biggl[

5069: \rho_4 \biggl\{ \exp \biggl[ C \tfrac{\beta}{NT \zeta_4}

5070: \bigl[ \zeta_3 \nu_3 \rho_1 + \zeta_4

5071: \rho_4 \bigr] (m') \biggr] \biggr\} \Biggr] \Biggr\}\\

5072: \shoveleft{\quad + \nu \bigl\{ \C{K}\bigl[ \rho, \pi_{\exp( - \gamma r)} \bigr]

5073: \bigr\}} \\  + (1 + \zeta_2)

5074: \C{K}\bigl[ \nu, \mu_{ \exp

5075: \bigl( - \frac{\lambda_2 \rho_2(r)}{1 + \zeta_2}

5076: - \frac{1}{1 + \zeta_2} \int_{\lambda_2}^{\gamma}

5077: \pi_{\exp( - \alpha r)}(r) d \alpha \bigr)} \bigr]\\

5078: + \biggl(1 + \sum_{i=1}^5 \zeta_i \biggr) \log\bigl(\tfrac{2}{\epsilon}

5079: \bigr).

5080: \end{multline*}

5081: This formula is better understood when thinking about

5082: the following upper bound for the two first lines

5083: in the expression of $B(\nu, \rho, \beta)$ :

5084: \begin{multline*}

5085: (1 + \zeta_2) \Bigl[ \xi_1 \nu_3 \rho_1(r) + \xi_4

5086: \nu_3 \rho_4(r) + \log \bigl\{ \mu \bigl[ \exp

5087: \bigl( - \xi_1 \rho_1(r) - \xi_4 \rho_4(r) \bigr) \bigr] \bigr\}

5088: \Bigr] \\ \shoveleft{\qquad - (1 + \zeta_2) \log \Biggl\{ \mu \Biggl[ \exp \biggl\{

5089: - \frac{\lambda_2}{1 + \zeta_2} \rho_2(r) }

5090: \\ \shoveright{ - \frac{1}{1 + \zeta_2} \int_{\lambda_2}^{\gamma}

5091: \pi_{\exp( - \alpha r )}(r) d \alpha \biggr\} \Biggr] \Biggr\} -

5092: \gamma \nu_3 \rho_1 (r)\qquad}\\

5093: \leq \nu_3 \biggl[ \lambda_2 \rho_2(r) + \int_{\lambda_2}^{\gamma}

5094: \pi_{\exp( - \alpha r)}(r) d \alpha - \gamma \rho_1(r) \biggr].

5095: \end{multline*}

5096: Another approach to understanding Theorem \ref{thm1.59} is

5097: to put forward $\rho_0 = \pi_{\exp(- \lambda_0 r)}$,

5098: for some positive real constant $\lambda_0 < \gamma$,

5099: noticing that

5100: $$

5101: \nu \bigl[ \C{K}(\rho_0, \pi) - \C{K}(\rho_2, \pi) \bigr]

5102: = \lambda_0 \nu (\rho_2 - \rho_0)(r) - \nu \bigl[

5103: \C{K}(\rho_2, \rho_0) \bigr].

5104: $$

5105: Thus

5106: \begin{multline*}

5107: B(\nu, \rho_0, \beta) \leq

5108: \nu_3 \bigl[ (\gamma - \lambda_0) (\rho_0 - \rho_1)(r) + \lambda_0

5109: (\rho_2 - \rho_1)(r) \bigr]  \\

5110: \shoveleft{\quad + (1 + \zeta_2) \bigl( 1 + \tfrac{NT}{\beta} \zeta_3 \bigr)

5111: } \\ \times \log \Biggl\{ \nu_3 \Biggl[ \rho_1 \biggl\{

5112: \exp \biggl[ C \tfrac{\beta}{NT \zeta_1} \bigl[ \nu \rho_0

5113: + \zeta_1 \rho_1 \bigr] (m') \biggr] \biggr\}^{\frac{\zeta_1 N T}{\beta

5114: (1 + \zeta_2)(1 + \frac{NT}{\beta}\zeta_3)}} \\

5115: \shoveright{ \times \rho_4 \biggl\{ \exp \biggl[

5116: C \tfrac{\beta}{NT \zeta_5} \bigl[

5117: \zeta_3 \nu_3 \rho_1 + \zeta_5 \rho_4 \bigr] (m')

5118: \biggr] \biggr\}^{\frac{\zeta_5 N T}{\beta(1 + \zeta_2)(1 + \frac{NT}{\beta}

5119: \zeta_3)}} \Biggr] \Biggr\}\quad}\\

5120: + \bigl( 1 + \zeta_2 \tfrac{NT}{\beta}\bigr) \nu \Biggl\{

5121: \log \Biggl\{ \rho_2 \biggl\{ \exp \biggl[ \tfrac{C}{1 + \zeta_2

5122: \frac{NT}{\beta}} \zeta_2 \rho_2 (m') \biggr] \biggr\} \Biggr] \Biggr\}\\

5123: + \zeta_4 \tfrac{NT}{\beta} \nu_3 \Biggl\{ \log \Biggl[

5124: \rho_4 \biggl\{ \exp \biggl[ C \tfrac{\beta}{NT \zeta_4}

5125: \bigl[ \zeta_3 \nu_3 \rho_1 + \zeta_4

5126: \rho_4 \bigr] (m') \biggr] \biggr\} \Biggr] \Biggr\}\\

5127: \shoveleft{\quad + (1 + \zeta_2) \C{K}\Bigl[

5128: \nu, \mu_{\exp \bigl( - \frac{(\gamma - \lambda_0) \rho_0(r) + \lambda_0 \rho_2(r)}{

5129: 1 + \zeta_2} \bigr)} \Bigr] }\\

5130: - \nu \bigl[ \C{K}(\rho_2, \rho_0) \bigr]

5131: + \biggl( 1 + \sum_{i=1}^5 \zeta_i \biggr)

5132: \log \bigl( \tfrac{2}{\epsilon} \bigr).

5133: \end{multline*}

5134:

5135: In the case when we want to select a single model $\wm(\omega)$,

5136: and therefore to set $\nu = \delta_{\wm}$, the previous

5137: inequality engages us to take \\

5138: \mbox{} \hfill $\ds \wm \in \arg \min_{m \in M}

5139: (\gamma - \lambda_0) \rho_0(m, r) + \lambda_0 \rho_2(m, r)$.

5140: \hfill \mbox{}\\

5141: In parametric situations where $\pi_{\exp( - \lambda r)}(r)

5142: \simeq \sr(m) + \frac{d_e(m)}{\lambda}$,

5143: we get\\\mbox{}\hfill

5144: $(\gamma - \lambda_0) \rho_0(m, r) - \lambda_0 \rho_2(m, r)

5145: \simeq \gamma \bigl[ \sr(m) + d_e(m) \bigl( \tfrac{1}{\lambda_0}

5146: + \tfrac{\lambda_0 - \lambda_2}{\gamma \lambda_2} \bigr)\bigr]$,\hfill

5147: \mbox{}\\

5148: resulting in a linear penalization of the empirical dimension of the

5149: models.

5150:

5151: \subsubsection{Analysis of the two step relative bound}

5152: We will not state a formal result, but will neverless give some

5153: hints about how to establish one.

5154: We should start from Theorem \ref{thm4.1}, which gives a deterministic variance

5155: term. From Theorem \ref{thm4.1}, after a

5156: change of prior distribution, we obtain

5157: for any positive constants $\alpha_1$ and $\alpha_2$,

5158: any prior distributions $\wt{\mu}_1$ and $\wt{\mu}_2

5159: \in \C{M}_+^1(M)$,

5160: for any prior conditional distributions $\wt{\pi}_1$

5161: and $\wt{\pi}_2 : M \rightarrow \C{M}_+^1(\Theta)$,

5162: with $\PP$ probability at least $1 - \eta$,

5163: for any posterior distributions $\nu_1 \rho_1$ and

5164: $\nu_2 \rho_2$,

5165: \begin{multline*}

5166: \alpha_1(\nu_1 \rho_1 - \nu_2 \rho_2)(R) \leq

5167: \alpha_2(\nu_1 \rho_1 - \nu_2 \rho_2)(r) \\ +

5168: \C{K}\bigl[ (\nu_1 \rho_1) \otimes (\nu_2 \rho_2),

5169: (\wt{\mu}_1\,\wt{\pi}_1)\otimes(\wt{\mu}_2\,\wt{\pi}_2)

5170: \bigr] \\

5171: + \log \Bigl\{ (\wt{\mu}_1\,\wt{\pi}_1)\otimes (\wt{\mu}_2\,\wt{\pi}_2) \Bigl[

5172: \exp \bigl\{ - \alpha_2 \Psi_{\frac{\alpha_2}{N}}(R',M') + \alpha_1 R' \bigr\}

5173: \Bigr] \Bigr\} - \log(\eta).

5174: \end{multline*}

5175: Applying this to $\alpha_1 = 0$, we get that

5176: \begin{multline*}

5177: (\nu \rho - \nu_3 \rho_1)(r)

5178: \leq \frac{1}{\alpha_2} \biggl[ \C{K}\bigl[

5179: (\nu \rho) \otimes (\nu_3 \rho_1), (\wt{\mu}\,\wt{\pi})\otimes (

5180: \wt{\mu}_3\,\wt{\pi}_1) \bigr]

5181: \\ + \log \Bigl\{  (\wt{\mu}\,\wt{\nu})\otimes(\wt{\mu}_3\,\wt{\pi}_1)

5182: \Bigl[ \exp \bigl\{

5183: \alpha_2 \Psi_{-\frac{\alpha_2}{N}} (R', M') \bigr\} \Bigr] \Bigr\}

5184: - \log(\eta) \biggr].

5185: \end{multline*}

5186: In the same way, to bound quantities of the form

5187: \begin{multline*}

5188: \log \Biggl\{ \nu_3 \Biggl[ \rho_1 \biggl\{

5189: \exp \biggl[ C_1 (\nu \rho + \zeta_1 \rho_1)(m') \biggr] \biggr\}^{p_1}

5190: \\ \times \rho_4 \biggl\{ \exp \biggl[ C_2 \bigl[

5191: \zeta_3 \nu_3 \rho_1 + \zeta_5 \rho_4 \bigr] (m') \biggr]

5192: \biggr\}^{p_2} \Biggr] \Biggr\}

5193: \\ = \sup_{\nu_5} \biggl\{ p_1 \sup_{\rho_5} \Bigl\{

5194: C_1 \bigl[ (\nu \rho) \otimes (\nu_5 \rho_5) + \zeta_1 \nu_5(\rho_1

5195: \otimes \rho_5) \bigr](m') - \C{K}(\rho_5, \rho_1) \Bigr\}

5196: \\\qquad \qquad + p_2 \sup_{\rho_6} \Bigl\{  C_2 \bigl[ \zeta_3

5197: (\nu_3 \rho_1) \otimes (\nu_5 \rho_6) \hfill \\ + \zeta_5 \nu_5(\rho_4

5198: \otimes \rho_6) \bigr] (m') - \C{K}(\rho_6, \rho_4) \Bigr\}

5199: - \C{K}(\nu_5, \nu_3) \biggr\},

5200: \end{multline*}

5201: where $C_1$, $C_2$, $p_1$ and $p_2$ are positive constants,

5202: and similar terms,

5203: we need to use inequalities of the type: for any prior distributions

5204: $\wt{\mu}_i\,\wt{\pi}_i$, $i = 1, 2$, with $\PP$ probability

5205: at least $1 - \eta$, for any posterior distributions

5206: $\nu_i \rho_i$, $i = 1,2$,

5207: \begin{multline*}

5208: \alpha_3 (\nu_1 \rho_1) \otimes (\nu_2 \rho_2)(m')

5209: \leq

5210: \log \Bigl\{ (\wt{\mu}_1\,\wt{\pi}_1) \otimes

5211: (\wt{\mu}_2\,\wt{\pi}_2) \exp \Bigl[ \alpha_3 \Phi_{\frac{- \alpha_3}{N}}

5212: (M') \Bigr] \Bigr\} \\ + \C{K}\bigl[

5213: (\nu_1 \rho_1) \otimes (\nu_2 \rho_2), (\wt{\mu}_1\,\wt{\pi}_1)

5214: \otimes (\wt{\mu}_2\,\wt{\pi}_2) \bigr] - \log(\eta).

5215: \end{multline*}

5216: We need also the variant: with $\PP$ probability at least $1 - \eta$,

5217: for any posterior distribution $\nu_1 : \Omega \rightarrow \C{M}_+^1(M)$

5218: and any conditional posterior distributions $\rho_1, \rho_2 :

5219: \Omega \times M \rightarrow \C{M}_+^1(\Theta)$,

5220: \begin{multline*}

5221: \alpha_3 \nu_1 (\rho_1 \otimes \rho_2)(m')

5222: \leq

5223: \log \Bigl\{ \wt{\mu}_1\bigl(\wt{\pi}_1 \otimes \wt{\pi}_2 \bigr)

5224: \exp \Bigl[ \alpha_3 \Phi_{- \frac{\alpha_3}{N}}(M') \Bigr] \Bigr\}

5225: \\ + \C{K}(\nu_1, \wt{\mu}_1) + \nu_1 \bigl\{

5226: \C{K}\bigl[

5227: \rho_1 \otimes \rho_2, \wt{\pi}_1

5228: \otimes \wt{\pi}_2 \bigr] \bigr\} - \log(\eta).

5229: \end{multline*}

5230: We deduce that

5231: \begin{multline*}

5232: \log \Biggl\{ \nu_3 \Biggl[

5233: \rho_1 \biggl\{ \exp \biggl[

5234: C_1 (\nu \rho + \zeta_1 \rho_1)(m') \biggr]

5235: \biggr\}^{p_1}

5236: \\ \shoveright{ \times \rho_4 \biggl\{ \exp

5237: \biggl[

5238: C_2 \bigl[ \zeta_3 \nu_3 \rho_1 + \zeta_5

5239: \rho_4 \bigr] (m') \biggr] \biggr\}^{p_2} \Biggr] \Biggr\} \quad } \\

5240: \leq \sup_{\nu_5} \Biggl\{ p_1

5241: \sup_{\rho_5} \Biggl[

5242: \frac{C_1}{\alpha_3} \biggl\{ \log \Bigl\{ (\wt{\mu} \, \wt{\pi})

5243: \otimes (\wt{\mu}_5\,\wt{\pi}_5) \exp \Bigl[

5244: \alpha_3 \Phi_{- \frac{\alpha_3}{N}}(M') \Bigr] \Bigr\}

5245: \\ + \C{K}\bigl[ (\nu \rho) \otimes (\nu_5 \rho_5),

5246: (\wt{\mu}\,\wt{\pi} \otimes (\wt{\mu}_5\,\wt{\pi}_5) \bigr]

5247: + \log(\tfrac{2}{\eta}) \\

5248: + \zeta_1 \biggl[

5249: \log \Bigl\{ \wt{\mu}_5 \bigl(

5250: \wt{\pi}_1 \otimes \wt{\pi}_5 \bigr)

5251: \exp \Bigl[ \alpha_3 \Phi_{- \frac{\alpha_3}{N}}

5252: (M') \Bigr] \Bigr\}

5253: \\ + \C{K}(\nu_5, \wt{\mu}_5)

5254: + \nu_5 \bigl\{ \C{K} \bigl[

5255: \rho_1 \otimes \rho_5,

5256: \wt{\pi}_1 \otimes \wt{\pi}_5 \bigr] \bigr\}

5257: + \log \bigl(  \tfrac{2}{\eta} \bigr)

5258: \biggr] \biggr\} - \C{K}(\rho_5, \rho_1) \Biggr] \\

5259: + p_2 \sup_{\rho_6} \Biggl[

5260: \frac{C_1}{\alpha_3} \biggl\{ \log \Bigl\{ (\wt{\mu}_3 \, \wt{\pi}_1)

5261: \otimes (\wt{\mu}_5\,\wt{\pi}_6) \exp \Bigl[

5262: \alpha_3 \Phi_{- \frac{\alpha_3}{N}}(M') \Bigr] \Bigr\}

5263: \\ + \C{K}\bigl[ (\nu_3 \rho_1) \otimes (\nu_5 \rho_6),

5264: (\wt{\mu}_3\,\wt{\pi}_1 \otimes (\wt{\mu}_5\,\wt{\pi}_6) \bigr]

5265: + \log(\tfrac{2}{\eta}) \\

5266: + \zeta_1 \biggl[

5267: \log \Bigl\{ \wt{\mu}_5 \bigl(

5268: \wt{\pi}_4 \otimes \wt{\pi}_6 \bigr)

5269: \exp \Bigl[ \alpha_3 \Phi_{- \frac{\alpha_3}{N}}

5270: (M') \Bigr] \Bigr\}

5271: \\ \hfill + \C{K}(\nu_5, \wt{\mu}_5)

5272: + \nu_5 \bigl\{ \C{K} \bigl[

5273: \rho_4 \otimes \rho_6,

5274: \wt{\pi}_4 \otimes \wt{\pi}_6 \bigr] \bigr\}

5275: + \log \bigl(  \tfrac{2}{\eta} \bigr)

5276: \biggr] \biggr\}\qquad \\ - \C{K}(\rho_6, \rho_4) \Biggr]

5277: - \C{K}(\nu_5, \nu_3) \Biggr\}.

5278: \end{multline*}

5279:

5280: We are then left with the need to bound entropy terms like

5281: $\C{K}(\nu_3 \rho_1, \wt{\mu}_3\wt{\pi}_1)$, where we have the choice of

5282: $\wt{\mu}_3$ and $\wt{\pi}_1$, to obtain a useful bound.

5283: As could be expected, we decompose it into

5284: $$

5285: \C{K}(\nu_3 \rho_1, \wt{\mu}_3\wt{\pi}_1) =

5286: \C{K}(\nu_3, \wt{\mu}_3) + \nu_3 \bigl[ \C{K}(\rho_1, \wt{\pi}_1) \bigr].

5287: $$

5288: Let us look after the second term first, choosing $\wt{\pi}_1 = \pi_{\exp

5289: ( - \beta_1 R)}$:

5290: \begin{multline*}

5291: \nu_3 \bigl[ \C{K}(\rho_1, \wt{\pi}_1) \bigr]

5292: = \nu_3 \bigl[ \beta_1 (\rho_1 - \wt{\pi}_1)(R) + \C{K}(\rho_1, \pi)

5293: - \C{K}(\wt{\pi}_1, \pi) \bigr]

5294: \\ \leq \frac{\beta_1}{\alpha_1}  \biggl[ \alpha_2 \nu_3(\rho_1 - \wt{\pi}_1)(r)

5295: + \C{K}(\nu_3, \wt{\mu}_3) + \nu_3 \bigl[ \C{K}(\rho_1, \wt{\pi}_1) \bigr]

5296: \\+ \log \Bigl\{ \wt{\mu}_3 \bigl( \wt{\pi}_1^{\otimes 2}

5297: \bigr) \Bigl[

5298: \exp \bigl\{ - \alpha_2 \Psi_{\frac{\alpha_2}{N}}

5299: (R', M') + \alpha_1 R' \bigr\} \Bigr] \Bigr\} - \log(\eta) \biggr]

5300: \\ \shoveright{+ \nu_3 \bigl[ \C{K}(\rho_1, \pi) - \C{K}(\wt{\pi}_1, \pi) \bigr]

5301: \qquad}

5302: \\ \quad \leq \frac{\beta_1}{\alpha_1} \biggl[

5303: \C{K}(\nu_3, \wt{\mu}_3) + \nu_3 \bigl[ \C{K}(\rho_1, \wt{\pi}_1) \bigr]

5304: \hfill \\ + \log \Bigl\{

5305: \wt{\mu}_3 \bigl( \wt{\pi}_1^{\otimes 2} \bigr)

5306: \Bigl[ \exp \bigl\{

5307: - \alpha_2 \Psi_{\frac{\alpha_2}{N}}(R', M') + \alpha_1 R' \bigr\}

5308: \Bigr] \Bigr\} - \log(\eta) \biggr]

5309: \\ + \nu_3

5310: \bigl\{ \C{K}\bigl[ \rho_1 , \pi_{\exp ( -

5311: \frac{\beta_1 \alpha_2}{\alpha_1} r)} \bigr] \bigr\}.

5312: \end{multline*}

5313: Thus, when the constraint $\lambda_1 = \frac{\beta_1 \alpha_2}{\alpha_1}$

5314: is satisfied,

5315: \begin{multline*}

5316: \nu_3 \bigl[ \C{K}(\rho_1, \wt{\pi}_1) \bigr]

5317: \leq \Bigl( 1 - \frac{\beta_1}{\alpha_1} \Bigr)^{-1} \frac{\beta_1}{\alpha_1} \biggl[

5318: \C{K}(\nu_3, \wt{\mu}_3) \\ + \log \Bigl\{

5319: \wt{\mu}_3 \bigl(\wt{\pi}_1^{\otimes 2} \bigr)

5320: \Bigl[ \exp \bigl\{ - \alpha_2 \Psi_{\frac{\alpha_2}{N}}(R', M') + \alpha_1

5321: R' \bigr\} \Bigr] \Bigr\}

5322: - \log(\eta) \biggr].

5323: \end{multline*}

5324: We can further specialize the constants, choosing $\alpha_1

5325: = N \sinh(\frac{\alpha_2}{N})$, so that

5326: $$

5327: - \alpha_2 \Psi_{\frac{\alpha_2}{N}}(R', M') + \alpha_1 R'

5328: \leq 2 N \sinh\Bigl(\frac{\alpha_2}{2 N}\Bigr)^2 M'.

5329: $$

5330: We can for instance choose $\alpha_2 = \gamma$, $\alpha_1 = N \sinh(\frac{\gamma}{N})$,

5331: and $\beta_1 = \lambda_1 \frac{N}{\gamma} \sinh(\frac{\gamma}{N})$,

5332: leading to

5333: \begin{prop}\mypoint

5334: With the notations of Theorem \ref{thm1.59}, the constants being

5335: set as explained above, putting $

5336: \wt{\pi}_1  = \pi_{\exp( - \lambda_1 \frac{N}{\gamma}\sinh(\frac{\gamma}{N}) R)}$,

5337: with $\PP$ probability at least $1 - \eta$,

5338: \begin{multline*}

5339: \nu_3 \bigl[ \C{K}(\rho_1, \wt{\pi}_1) \bigr]

5340: \leq \Bigl( 1 - \frac{\lambda_1}{\gamma} \Bigr)^{-1}

5341: \frac{\lambda_1}{\gamma} \biggl[ \C{K}(\nu_3, \wt{\mu}_3)

5342: \\ + \log \Bigl\{

5343: \wt{\mu}_3 \bigl( \wt{\pi}_1^{\otimes 2} \bigr)\Bigl[

5344: \exp \bigl\{ 2 N \sinh(\tfrac{\gamma}{2N})^2 M' \bigr\} \Bigr] \Bigr\}

5345: - \log(\eta) \biggr].

5346: \end{multline*}

5347: More generally

5348: \begin{multline*}

5349: \nu_3 \bigl[ \C{K}(\rho, \wt{\pi}_1) \bigr]

5350: \leq \Bigl( 1 - \frac{\lambda_1}{\gamma} \Bigr)^{-1}

5351: \frac{\lambda_1}{\gamma} \biggl[ \C{K}(\nu_3, \wt{\mu}_3)

5352: \\ + \log \Bigl\{

5353: \wt{\mu}_3 \bigl( \wt{\pi}_1^{\otimes 2} \bigr)\Bigl[

5354: \exp \bigl\{ 2 N \sinh(\tfrac{\gamma}{2N})^2 M' \bigr\}

5355: \Bigr] \Bigr\} - \log(\eta) \biggr]

5356: \\ + \Bigl( 1 - \frac{\lambda_1}{\gamma} \Bigr)^{-1} \nu_3 \bigl[ \C{K}(

5357: \rho, \rho_1) \bigr].

5358: \end{multline*}

5359: \end{prop}

5360: In a similar way, let us choose now $\wt{\mu}_3 = \mu_{\exp[ - \alpha_3 \opi(R)]}$.

5361: We can write

5362: \begin{multline*}

5363: \C{K}(\nu, \wt{\mu}_3) = \alpha_3 (\nu - \wt{\mu}_3)\opi(R)

5364: + \C{K}(\nu, \mu) - \C{K}(\wt{\mu}_3, \mu)

5365: \\ \leq \frac{\alpha_3}{\alpha_1} \biggl[ \alpha_2 (\nu - \wt{\mu}_3)\opi(r)

5366: + \C{K}(\nu, \wt{\mu}_3) \\ + \log \Bigl\{ (\wt{\mu}_3 \opi) \otimes

5367: (\wt{\mu}_3 \opi) \Bigl[ \exp \bigl\{

5368: - \alpha_2 \Psi_{\frac{\alpha_2}{N}}(R',M') + \alpha_1 R' \bigr\} \Bigr] \Bigr\}

5369: - \log(\eta) \biggr] \\

5370: + \C{K}(\nu, \mu) - \C{K}(\wt{\mu}_3, \mu).

5371: \end{multline*}

5372: Let us choose $\alpha_2 = \gamma$, $\alpha_1 = N \sinh(\frac{\gamma}{N})$, and

5373: let us add some other entropy inequalities to get

5374: rid of $\opi$ in a suitable way, the approach of entropy

5375: compensation being quite the same as the one used

5376: to obtain the empirical bound of Theorem \ref{thm1.59}.

5377: This results with $\PP$ probability

5378: at least $1 - \eta$ in

5379: \begin{multline*}

5380: \Bigl( 1 - \frac{\alpha_3}{\alpha_1} \Bigr)

5381: \C{K}(\nu, \wt{\mu}_3) \leq \frac{\alpha_3}{\alpha_1}  \biggl[

5382: \gamma (\nu - \wt{\mu}_3)\opi(r)

5383: \\+ \log \Bigl\{ ( \wt{\mu}_3 \opi) \otimes ( \wt{\mu}_3 \opi)

5384: \Bigl[ \exp \bigl\{ - \gamma \Psi_{\frac{\gamma}{N}}(R', M') + \alpha_1 R' \bigr\}

5385: \Bigr] \Bigr\} + \log(\tfrac{2}{\eta}) \biggr]

5386: \\ \hfill + \C{K}(\nu, \mu) - \C{K}(\wt{\mu}_3, \mu),\quad\\\quad

5387: \zeta_6 \Bigl(1 - \frac{\beta}{\alpha_1} \Bigr)

5388: \wt{\mu}_3 \bigl[ \C{K}(\rho_6, \opi) \bigr]

5389: \leq \zeta_6 \frac{\beta}{\alpha_1} \biggl[

5390: \gamma \wt{\mu}_3 (\rho_6 - \opi)(r)\hfill\\

5391: + \log \Bigl\{ \wt{\mu}_3\bigl(\opi^{\otimes 2}\bigr)

5392: \Bigl[ \exp \bigl\{ - \gamma \Psi_{\frac{\gamma}{N}}(R', M')

5393: + \alpha_1 R' \bigr\} \Bigr] \Bigr\} + \log(\tfrac{2}{\eta}) \biggr]

5394: \\ \hfill + \zeta_6 \wt{\mu}_3 \bigl[

5395: \C{K}(\rho_6, \pi) - \C{K}(\opi, \pi) \bigr],\quad\\\quad

5396: \zeta_7 \Bigl(1 - \frac{\beta}{\alpha_1} \Bigr)

5397: \wt{\mu}_3 \bigl[ \C{K}(\rho_7, \opi) \bigr]

5398: \leq \zeta_7 \frac{\beta}{\alpha_1} \biggl[

5399: \gamma \wt{\mu}_3 (\rho_7 - \opi)(r)\hfill \\

5400: + \log \Bigl\{ \wt{\mu}_3\bigl(\opi^{\otimes 2}\bigr)

5401: \Bigl[ \exp \bigl\{ - \gamma \Psi_{\frac{\gamma}{N}}(R', M')

5402: + \alpha_1 R' \bigr\} \Bigr] \Bigr\} + \log(\tfrac{2}{\eta}) \biggr]

5403: \\ \hfill + \zeta_7 \wt{\mu}_3 \bigl[

5404: \C{K}(\rho_7, \pi) - \C{K}(\opi, \pi) \bigr],\quad\\\quad

5405: \zeta_8 \Bigl( 1 - \frac{\beta}{\alpha_1} \Bigr) \nu \bigl[ \C{K}(\rho_8, \opi) \bigr]

5406: \leq \zeta_8 \frac{\beta}{\alpha_1} \biggl[ \gamma \nu ( \rho_8 - \opi) (r)

5407: + \C{K}(\nu, \wt{\mu}_3) \hfill\\ +

5408: \log \Bigl\{ \wt{\mu}_3\bigl(\opi^{\otimes 2}\bigr)

5409: \Bigl[ \exp \bigl\{ - \gamma \Psi_{\frac{\gamma}{N}}(R', M') + \alpha_1 R' \bigr\}

5410: \Bigr] \Bigr\} + \log(\tfrac{2}{\eta}) \biggr]

5411: \\ \hfill + \zeta_8 \nu \bigl[ \C{K}(\rho_8, \pi)

5412: - \C{K}(\opi, \pi) \bigr],\quad\\\quad

5413: \zeta_9 \Bigl( 1 - \frac{\beta}{\alpha_1} \Bigr) \nu \bigl[ \C{K}(\rho_9, \opi) \bigr]

5414: \leq \zeta_9 \frac{\beta}{\alpha_1} \biggl[ \gamma \nu ( \rho_9 - \opi) (r)

5415: + \C{K}(\nu, \wt{\mu}_3) \hfill\\ +

5416: \log \Bigl\{ \wt{\mu}_3\bigl(\opi^{\otimes 2}\bigr)

5417: \Bigl[ \exp \bigl\{ - \gamma \Psi_{\frac{\gamma}{N}}(R', M') + \alpha_1 R' \bigr\}

5418: \Bigr] \Bigr\} + \log(\tfrac{2}{\eta}) \biggr]

5419: \\ \hfill + \zeta_9 \nu \bigl[ \C{K}(\rho_9, \pi)

5420: - \C{K}(\opi, \pi) \bigr],

5421: \end{multline*}

5422: where we have introduced a bunch of constants, assumed to be positive,

5423: that we will more precisely set to

5424: \begin{align*}

5425: x_8 + x_9 & = 1,\\

5426: ( \zeta_6 \beta + x_8 \alpha_3) \frac{\gamma}{\alpha_1} & = \lambda_6,\\

5427: ( \zeta_7 \beta + x_9 \alpha_3) \frac{\gamma}{\alpha_1} & = \lambda_7,\\

5428: ( \zeta_8 \beta - x_8 \alpha_3) \frac{\gamma}{\alpha_1} & = \lambda_8,\\

5429: ( \zeta_9 \beta - x_9 \alpha_3) \frac{\gamma}{\alpha_1} & = \lambda_9.

5430: \end{align*}

5431: We get with $\PP$ probability at least $1 - \eta$,

5432: \begin{multline*}

5433: \Bigl( 1 - \frac{\alpha_3}{\alpha_1} -

5434: (\zeta_8 + \zeta_9)  \frac{\beta}{\alpha_1} \Bigr)

5435: \C{K}(\nu, \wt{\mu}_3) \leq

5436: \\ \frac{\alpha_3}{\alpha_1} \biggl[ \gamma \bigl[ \nu (

5437: x_8 \rho_8 + x_9 \rho_9)(r) - \wt{\mu}_3 (x_8 \rho_6 + x_9 \rho_7) (r) \bigr]

5438: \\ + \frac{\alpha_3}{\alpha_1} \log

5439: \Bigl\{ (\wt{\mu}_3 \opi) \otimes (\wt{\mu}_3 \opi)

5440: \Bigl[ \exp \bigl\{ - \gamma \Psi_{\frac{\gamma}{N}}(R', M')

5441: + \alpha_1 R' \bigr\} \Bigr] \Bigr\} \\

5442: + (\zeta_6 + \zeta_7 + \zeta_8 + \zeta_9) \frac{\beta}{\alpha_1}

5443: \log \Bigl\{ \wt{\mu}_3 \bigl(

5444: \opi^{\otimes 2} \bigr)

5445: \Bigl[ \exp \bigl\{ - \gamma

5446: \Psi_{\frac{\gamma}{N}}(R', M') + \alpha_1 R' \bigr\} \Bigr] \Bigr\}\\

5447: + \C{K}(\nu, \mu) - \C{K}(\wt{\mu}_3, \mu)

5448: + \Bigl( \frac{\alpha_3}{\alpha_1} + (\zeta_6 + \zeta_7 + \zeta_8 +

5449: \zeta_9) \frac{\beta}{\alpha_1} \Bigr) \log\bigl( \tfrac{2}{\eta} \bigr).

5450: \end{multline*}

5451: Let us choose the constants so that

5452: $\lambda_1 = \lambda_7 = \lambda_9$, $\lambda_4 = \lambda_6 = \lambda_8$,

5453: $\alpha_3 x_9 \frac{\gamma}{\alpha_1} = \xi_1$ and $ \alpha_3 x_8

5454: \frac{\gamma}{\alpha_1} = \xi_4$.

5455: This is done by setting

5456: \begin{align*}

5457: x_8 & = \frac{\xi_4}{\xi_1 + \xi_4},\\

5458: x_9 & = \frac{\xi_1}{\xi_1 + \xi_4},\\

5459: \alpha_3 & = \tfrac{N}{\gamma} \sinh(\tfrac{\gamma}{N}) ( \xi_1 + \xi_4),\\

5460: \zeta_6 & = \tfrac{N}{\gamma}\sinh(\tfrac{\gamma}{N}) \frac{(\lambda_4 - \xi_4)}{\beta},\\

5461: \zeta_7 & = \tfrac{N}{\gamma}\sinh(\tfrac{\gamma}{N})

5462: \frac{(\lambda_1 - \xi_1)}{\beta},\\

5463: \zeta_8 & = \tfrac{N}{\gamma} \sinh(\tfrac{\gamma}{N}) \frac{(\lambda_4 +

5464: \xi_4)}{\beta},\\

5465: \zeta_9 & = \tfrac{N}{\gamma} \sinh(\tfrac{\gamma}{N}) \frac{(\lambda_1 + \xi_1)}{

5466: \beta}.

5467: \end{align*}

5468: The inequality $\lambda_1 > \xi_1$ is always satisfied. The inequality

5469: $\lambda_4 > \xi_4$ is required for the above choice of constants, and

5470: will be satisfied for a suitable choice of $\zeta_3$ and $\zeta_4$.

5471:

5472: Under these asumptions, we obtain with $\PP$ probability at least $1 - \eta$

5473: \begin{multline*}

5474: \Bigl( 1 - \frac{\alpha_3}{\alpha_1} -

5475: (\zeta_8 + \zeta_9)  \frac{\beta}{\alpha_1} \Bigr)

5476: \C{K}(\nu, \wt{\mu}_3) \leq

5477: (\nu - \wt{\mu}_3) (\xi_1 \rho_1 + \xi_4 \rho_4)(r)

5478: \\ + \frac{\alpha_3}{\alpha_1} \log

5479: \Bigl\{ (\wt{\mu}_3 \opi) \otimes (\wt{\mu}_3 \opi)

5480: \Bigl[ \exp \bigl\{ - \gamma \Psi_{\frac{\gamma}{N}}(R', M')

5481: + \alpha_1 R' \bigr\} \Bigr] \Bigr\} \\

5482: + (\zeta_6 + \zeta_7 + \zeta_8 + \zeta_9) \frac{\beta}{\alpha_1}

5483: \log \Bigl\{ \wt{\mu}_3 \bigl(

5484: \opi^{\otimes 2} \bigr)

5485: \Bigl[ \exp \bigl\{ - \gamma

5486: \Psi_{\frac{\gamma}{N}}(R', M') + \alpha_1 R' \bigr\} \Bigr] \Bigr\}\\

5487: + \C{K}(\nu, \mu) - \C{K}(\wt{\mu}_3, \mu)

5488: + \Bigl( \frac{\alpha_3}{\alpha_1} + (\zeta_6 + \zeta_7 + \zeta_8 +

5489: \zeta_9) \frac{\beta}{\alpha_1} \Bigr) \log\bigl( \tfrac{2}{\eta} \bigr).

5490: \end{multline*}

5491: This proves

5492: \begin{prop}

5493: \mypoint

5494: The constants being set as explained above,

5495: with $\PP$ probability at least $1 - \eta$,

5496: for any posterior distribution $\nu : \Omega \rightarrow \C{M}_+^1(M)$,

5497: \begin{multline*}

5498: \C{K}(\nu, \wt{\mu}_3) \leq \Bigl( 1 - \frac{\alpha_3}{\alpha_1} -

5499: (\zeta_8 + \zeta_9)  \frac{\beta}{\alpha_1} \Bigr)^{-1}

5500: \biggl[ \C{K}(\nu, \nu_3)

5501: \\ + \frac{\alpha_3}{\alpha_1} \log

5502: \Bigl\{ (\wt{\mu}_3 \opi) \otimes (\wt{\mu}_3 \opi)

5503: \Bigl[ \exp \bigl\{ - \gamma \Psi_{\frac{\gamma}{N}}(R', M')

5504: + \alpha_1 R' \bigr\} \Bigr] \Bigr\} \\

5505: + (\zeta_6 + \zeta_7 + \zeta_8 + \zeta_9) \frac{\beta}{\alpha_1}

5506: \log \Bigl\{ \wt{\mu}_3 \bigl(

5507: \opi^{\otimes 2} \bigr)

5508: \Bigl[ \exp \bigl\{ - \gamma

5509: \Psi_{\frac{\gamma}{N}}(R', M') + \alpha_1 R' \bigr\} \Bigr] \Bigr\}\\

5510: + \Bigl( \frac{\alpha_3}{\alpha_1} + (\zeta_6 + \zeta_7 + \zeta_8 +

5511: \zeta_9) \frac{\beta}{\alpha_1} \Bigr) \log\bigl( \tfrac{2}{\eta} \bigr)\biggr] .

5512: \end{multline*}

5513: \end{prop}

5514: Thus

5515: \begin{multline*}

5516: \C{K}(\nu_3 \rho_1, \wt{\mu}_3\,\wt{\pi}_1) \leq

5517: \frac{1 + \bigl(1 - \frac{\lambda_1}{\gamma}\bigr)^{-1} \frac{\lambda_1}{\gamma}}{

5518: 1 - \frac{\alpha_3}{\alpha_1} - (\zeta_8+\zeta_9)\frac{\beta}{\alpha_1}} \\ \times

5519: \biggl[ \frac{\alpha_3}{\alpha_1} \log \Bigl\{

5520: (\wt{\mu}_3 \ov{\pi} \otimes (\wt{\mu}_3 \ov{\pi}) \Bigl[

5521: \exp \bigl\{ - \gamma \Psi_{\frac{\gamma}{N}}

5522: (R',M') + \alpha_1 R' \bigr\} \Bigr] \Bigr\}

5523: \\ + (\zeta_6 + \zeta_7 + \zeta_8 + \zeta_9) \frac{\beta}{\alpha_1}

5524: \log \Bigl\{ \wt{\mu}_3 \bigl( \ov{\pi}^{\otimes 2} \bigr) \Bigl[

5525: \exp \bigl\{ - \gamma \Psi_{\frac{\gamma}{N}}(R', M') + \alpha_1 R' \bigr\} \Bigr]

5526: \Bigr\} \\

5527: + \Bigl( \frac{\alpha_3}{\alpha_1} + (

5528: \zeta_6 + \zeta_7 + \zeta_8 + \zeta_9) \frac{\beta}{\alpha_1} \Bigr)

5529: \log \bigl( \tfrac{2}{\eta} \bigr) \biggr] \\

5530: + \Bigl( 1 - \frac{\lambda_1}{\gamma} \Bigr)^{-1} \frac{\lambda_1}{\gamma} \biggl[

5531: \log \Bigl\{ \wt{\mu}_3 \bigl( \wt{\pi}_1^{\otimes 2} \bigr)

5532: \Bigl[ \exp \bigl\{ 2 N \sinh\bigl(\tfrac{\gamma}{2N} \bigr)^2

5533: M' \bigr\} \Bigr] \Bigr\} - \log( \tfrac{2}{\eta} ) \biggr].

5534: \end{multline*}

5535: We will not go further, lest it may become tedious, but we hope we have

5536: given sufficient hints to state informally that the bound $B(\nu, \rho, \beta)$

5537: of Theorem \ref{thm1.59} is upper bounded

5538: with $\PP$ probability close to one by a

5539: bound of the same flavour where the empirical quantities $r$ and $m'$

5540: have been replaced with their expectations $R$ and $M'$.

5541:

5542: \section{Transductive PAC-Bayesian learning}

5543:

5544: \subsection{Basic inequalities}

5545: In this section the observed sample $(X_i, Y_i)_{i=1}^N$

5546: will be supplemented with a {\em shadow sample}

5547: $(X_i,Y_i)_{i=N+1}^{(k+1)N}$.

5548: This point of view, called {\em transductive classification},

5549: has been introduced by V. Vapnik. It may be justified in different

5550: ways.

5551:

5552: On the practical side,

5553: one interest of the transductive setting is that it is

5554: often a lot easier to collect examples than it is to label them,

5555: so that it is not unreallistic to assume that we indeed have

5556: two training samples, one labelled and one unlabelled.

5557: It also covers the case when a batch of patterns

5558: is to be classified and we are allowed to observe

5559: the whole batch before issuing the classification.

5560:

5561: On the mathematical side, considering a shadow sample

5562: proves technically fruitfull. Indeed, when introducing

5563: the VC entropy and VC dimension concepts, as well as when

5564: dealing with compression

5565: schemes, albeit the {\em inductive} setting is our

5566: final concern, the transductive setting is a

5567: useful detour.

5568: In this second scenario, intermediate technical results

5569: involving the shadow sample are integrated with respect

5570: to unobserved random variables in a second stage of the proofs.

5571:

5572: Let us describe now the changes to be made to previous

5573: notations to adapt them to the transductive setting.

5574: The distribution $\PP$ will be a probability measure on the

5575: canonical space $\Omega = (\C{X} \times \C{Y})^{(k+1)N}$,

5576: and $(X_i,Y_i)_{i=1}^{(k+1)N}$

5577: will be the canonical process on this space

5578: (that is the coordinate process).

5579: Unless explicitely mentioned, the parameter $k$ indicating the

5580: size of the shadow sample will remain fixed.

5581: Assuming the shadow sample size is a multiple of the

5582: training sample size is convenient without significantly

5583: restricting the generality.

5584: For a while, we will use a weaker assumption than independence,

5585: assuming that $\PP$ is {\em partially exchangeable},

5586: since this is all what we need in the proofs.

5587: \begin{dfn}

5588: \mypoint For $i = 1, \dots, N$,

5589: let $\tau_i : \Omega \rightarrow \Omega$ be defined

5590: for any \linebreak $\omega = (\omega_j)_{j=1}^{(k+1)N} \in \Omega$ by

5591: $$

5592: \begin{cases}

5593: \tau_i(\omega)_{i + jN} = \omega_{i + (j-1)N}, & j=1, \dots, k,\\

5594: \tau_i(\omega)_{i} = \omega_{i+kN}, & \\

5595: \text{and } \tau_i(\omega)_{m + j N} = \omega_{m + j N}, &

5596: m\neq i, m = 1, \dots, N, j=0, \dots k.

5597: \end{cases}

5598: $$

5599: Clearly, if we arrange the $(k+1)N$ samples in a $N \times (k+1)$ array,

5600: $\tau_i$ performs a circular permutation of $k+1$ entries

5601: on the $i$th row, letting the

5602: other rows unchanged.

5603: Moreover, all the circular permutations of the $i$th

5604: row have the form $\tau_i^j$, $j$ ranging from $0$ to $k$.

5605:

5606: The probability distribution $\PP$ is said to be partially exchangeable if

5607: for any $i = 1, \dots, N$, $\PP \circ \tau_i^{-1} = \PP$.

5608:

5609: This means equivalently that for any

5610: bounded measurable function $h : \Omega \rightarrow \RR$,  $\PP ( h \circ \tau_i) = \PP (h)$.

5611:

5612: In the same way a function $h$ defined on $\Omega$ will be said to

5613: be partially exchangeable if $h \circ \tau_i = h$ for

5614: any $i=1, \dots, N$.

5615: Accordingly a posterior distribution

5616: $\rho : \Omega \rightarrow \C{M}_+^1(\Theta, \C{T})$ will be said to

5617: be partially exchangeable when $\rho(\omega, A) = \rho \bigl[\tau_i(\omega), A

5618: \bigr]$, for any $\omega \in \Omega$, any $i = 1, \dots, N$

5619: and any $A \in \C{T}$.

5620: \end{dfn}

5621: For any bounded measurable function $h$, let us define

5622: $T_i(h) = \frac{1}{k+1} \sum_{j=0}^k h \circ \tau_i^j$.

5623: Let $T(h) = T_N \circ \dots \circ T_1(h)$.

5624: For any partially exchangeable probability distribution $\PP$, and for

5625: any bounded measurable function $h$, $\PP \bigl[ T(h) \bigr] = \PP(h)$.

5626: Let us put

5627: \renewcommand{\rr}{\overline{r}}

5628: \begin{align*}

5629: \sigma_i(\theta)  & = \B{1} \bigl[ f_{\theta}(X_i) \neq Y_i \bigr],

5630: \quad \begin{tabular}[t]{l}indicating the success or failure of $f_{\theta}$\\

5631: to predict $Y_i$ from $X_i$,\end{tabular}\\

5632: r_1(\theta) & = \frac{1}{N} \sum_{i=1}^N \sigma_i(\theta),

5633: \quad \begin{tabular}[t]{l} the empirical error rate of $f_{\theta}$ \\

5634: on the observed sample,\end{tabular}\\

5635: r_2(\theta) & = \frac{1}{kN} \sum_{i=N+1}^{(k+1)N}

5636: \sigma_i(\theta),\quad \text{the error rate of $f_{\theta}$

5637: on the shadow sample,}\\

5638: \rr(\theta) & = \frac{r_1(\theta) + k r_2(\theta)}{k+1}

5639: = \frac{1}{(k+1)N} \sum_{i=1}^{(k+1)N}

5640: \sigma_i(\theta), \quad \begin{tabular}[t]{l}the global error \\

5641: rate of $f_{\theta}$,\end{tabular}\\

5642: R_i(\theta) & = \PP \bigl[ f_{\theta}(X_i) \neq Y_i \bigr],\quad

5643: \begin{tabular}[t]{l}the expected error \\ rate of $f_{\theta}$ on the $i$th

5644: input,\end{tabular}\\

5645: R(\theta) & = \frac{1}{N} \sum_{i=1}^N R_i(\theta) =

5646: \PP \bigl[ r_1(\theta) \bigr] = \PP \bigl[ r_2(\theta) \bigr],

5647: \quad \text{the average expected} \\*  \text{error} & \text{ rate of $f_{\theta}$

5648: on all inputs.}

5649: \end{align*}

5650: We will allow for posterior

5651: distributions $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$

5652: depending on the shadow sample. The most interesting ones will anyhow

5653: be independent of the shadow labels $Y_{N+1}, \dots, Y_{(k+1)N}$.

5654: We will be interested in the conditional expected

5655: error rate of the randomized classification

5656: rule described by $\rho$ on the shadow sample, given the observed

5657: sample, which reads as

5658: $\PP \bigl[ \rho(r_2) \lvert (X_i,Y_i)_{i=1}^N\bigr]$.

5659:

5660: Let us comment on the case when $\PP$ is invariant

5661: by any permutations of the rows, meaning that

5662: \\ \mbox{} \hfill $\PP

5663: \bigl[ h(\omega \circ s) \bigr] = \PP \bigl[ h(\omega) \bigr]$

5664: for all $s \in \mathfrak{S}(\{i+jN ; j=0, \dots, k \})$

5665: \hfill\mbox{}\\ and all $i=1,

5666: \dots, N$ (where $\mathfrak{S}(A)$ is the set of permutations of $A$,

5667: extended to $\{1, \dots, (k+1)N \}$ so as to be the identity outside

5668: of $A$).

5669: In this case, if $\rho$ is invariant by permutations of the rows of

5670: the shadow sample, meaning that $\rho(\omega \circ s) = \rho(\omega)

5671: \in \C{M}_+^1(\Theta)$, $s \in \mathfrak{S}(\{i+jN; j=1, \dots, k \})$,

5672: $i = 1, \dots, N$, then $\PP \bigl[ \rho(r_2) \lvert (X_i,Y_i)_{i=1}^N \bigr] =

5673: \frac{1}{N} \sum_{i=1}^N \PP \bigl[ \rho(\sigma_{i+N})

5674: \lvert (X_i,Y_i)_{i=1}^N \bigr]$, meaning that

5675: the expectation can be taken on a restricted shadow sample

5676: of the same size as the observed sample.

5677: If moreover the rows are equidistributed (meaning that their marginal distributions

5678: are equal), then

5679: \\\mbox{}\hfill $\PP \bigl[ \rho(r_2)

5680: \lvert (X_i,Y_i)_{i=1}^N \bigr] = \PP \bigl[ \rho(\sigma_{N+1})

5681: \lvert (X_i,Y_i)_{i=1}^N \bigr]$. \hfill \mbox{}\\

5682: This means that under these quite commonly fullfilled assumptions,

5683: the expectation can be taken on a single

5684: new object to be classified,

5685: our study thus covers the case when only one of the

5686: patterns from the shadow sample is to be labelled and one is interested

5687: in the expected error rate of this single labelling.

5688: Of course, in the case when

5689: $\PP$ is i.i.d. and $\rho$ depends only on the

5690: training sample $(X_i,Y_i)_{i=1}^N$, we fall back on

5691: the usual criterion of performance

5692: $\PP \bigl[ \rho(r_2) \lvert (Z_i)_{i=1}^N \bigr] = \rho(R)

5693: = \rho(R_1)$.

5694:

5695: Let us recall the notation

5696: $

5697: \Phi_{a}(p) = - a^{-1} \log \bigl\{ 1 - p \bigl[ 1 - \exp( - a) \bigr] \bigr\}.

5698: $

5699:

5700: Using an obvious factorization, and considering for the moment

5701: a fixed value of $\theta$ and any partially exchangeable positive real measurable

5702: function $\lambda : \Omega \rightarrow \RR_+$, we can compute the

5703: $\log$ Laplace transform of $r_1$ under $T$, which acts like a

5704: conditional probability distribution:

5705: \begin{multline*}

5706: \log \Bigl\{ T \bigl[ \exp ( - \lambda r_1 ) \bigr] \Bigr\}

5707: = \sum_{i=1}^N \log \Bigl\{ T_i \bigl[ \exp ( - \tfrac{\lambda}{N} \sigma_i ) \bigr]

5708: \Bigr\}  \\

5709: \leq N \log \biggl\{ \frac{1}{N} \sum_{i=1}^N T_i \Bigl[

5710: \exp \bigl( - \tfrac{\lambda}{N} \sigma_i \bigr) \Bigr] \biggr\}

5711: = - \lambda \Phi_{\frac{\lambda}{N}}(\rr).

5712: \end{multline*}

5713: Remarking that $T \Bigl\{ \exp \Bigl[

5714: \lambda \bigl[ \Phi_{\frac{\lambda}{N}}(\rr) - r_1 \bigr] \Bigr] \Bigr\}

5715: = \exp \bigl[ \lambda \Phi_{\frac{\lambda}{N}}(\rr) \bigr]  T \bigl[

5716: \exp ( - \lambda r_1) \bigr]$ we obtain

5717: \begin{lemma}

5718: \mypoint For any $\theta \in \Theta$ and any partially

5719: exchangeable positive real

5720: measurable function $\lambda : \Omega \rightarrow \RR_+$,

5721: $$

5722: T \Bigl\{ \exp \Bigl[ \lambda \bigl\{ \Phi_{\frac{\lambda}{N}}

5723: \bigl[ \rr(\theta) \bigr]  - r_1(\theta) \bigr\} \Bigr]

5724: \Bigr\} \leq 1.

5725: $$

5726: \end{lemma}

5727: We deduce from this lemma a result analogous to the inductive case:

5728: \begin{thm}

5729: \label{thm1.2}

5730: \mypoint For any partially exchangeable positive real measurable

5731: function $\lambda : \Omega \times \Theta \rightarrow \RR_+$,

5732: for any partially exchangeable posterior distribution

5733: $\pi : \Omega \rightarrow \C{M}_+^1(\Theta)$,

5734: $$

5735: \PP \biggl\{ \exp \biggl[ \sup_{\rho \in \C{M}_+^1(\Theta)}

5736: \rho \Bigl[ \lambda \bigl[ \Phi_{\frac{\lambda}{N}}(\rr) - r_1 \bigr] \Bigr]

5737: - \C{K}(\rho, \pi) \biggr] \biggr\} \leq 1.

5738: $$

5739: \end{thm}

5740: The proof is deduced from the previous lemma, using the

5741: fact that $\pi$ is partially exchangeable :

5742: \begin{multline*}

5743: \PP \biggl\{ \exp \biggl[ \sup_{\rho \in \C{M}_+^1(\Theta)}

5744: \rho \Bigl[ \lambda \bigl[ \Phi_{\frac{\lambda}{N}}(\rr) - r_1 \bigr] \Bigr]

5745: - \C{K}(\rho, \pi) \biggr] \biggr\} \\ =

5746: \PP \biggl\{ \pi \Bigl\{ \exp \Bigl[ \lambda \bigl[ \Phi_{\frac{\lambda}{N}}(\rr) -

5747: r_1 \bigr] \Bigr] \Bigr\} \biggr\} =

5748: \PP \biggl\{ T \pi \Bigl\{ \exp \Bigl[ \lambda \bigl[ \Phi_{\frac{\lambda}{N}}(\rr) -

5749: r_1 \bigr] \Bigr] \Bigr\} \biggr\} \\ =

5750: \PP \biggl\{  \pi \Bigl\{ T \exp \Bigl[ \lambda \bigl[ \Phi_{\frac{\lambda}{N}}(\rr) -

5751: r_1 \bigr] \Bigr] \Bigr\} \biggr\} \leq 1.

5752: \end{multline*}

5753:

5754: Introducing in the same way

5755: \newcommand{\Bm}{\overline{m}}

5756: \begin{align*}

5757: m'(\theta, \theta') & = \frac{1}{N}

5758: \sum_{i=1}^{N} \Bigl\lvert \B{1} \bigl[ f_{\theta}(X_i) \neq Y_i \bigr]

5759: - \B{1}\bigl[ f_{\theta'}(X_i) \neq Y_i \bigr] \Bigr\rvert\\

5760: \text{and } \quad \Bm(\theta, \theta') & = \frac{1}{(k+1)N}

5761: \sum_{i=1}^{(k+1)N} \Bigl\lvert \B{1} \bigl[ f_{\theta}(X_i) \neq Y_i \bigr]

5762: - \B{1}\bigl[ f_{\theta'}(X_i) \neq Y_i \bigr] \Bigr\rvert,

5763: \end{align*}

5764: we could prove along the same line of reasoning

5765: \begin{thm}\mypoint

5766: For any real parameter $\lambda$, any $\T \in \Theta$, any partially exchangeable

5767: posterior distribution $\pi : \Omega \rightarrow \C{M}_+^1(\Theta)$,

5768: \begin{multline*}

5769: \PP \biggl\{ \exp \biggl[ \sup_{\rho \in \C{M}_+^1(\Theta)}

5770: \lambda \Bigl[ \rho \bigl\{

5771: \Psi_{\frac{\lambda}{N}} \bigl[ \rr(\cdot) - \rr(\T), \Bm(\cdot, \T)\bigr]

5772: \bigr\} \\* -

5773: \bigl[ \rho(r_1) - r_1(\T) \bigr] \Bigr] - \C{K}(\rho, \pi) \biggr] \biggr\}

5774: \leq 1.

5775: \end{multline*}

5776: \end{thm}

5777: \begin{thm}\mypoint

5778: For any real constant $\gamma$, for any $\T \in \Theta$,

5779: for any partially exchangeable posterior distribution $\pi : \Omega

5780: \rightarrow \C{M}_+^1(\Theta)$,

5781: \begin{multline*}

5782: \PP \Biggl\{ \exp \Biggl[ \sup_{\rho \in \C{M}_+^1(\Theta)}

5783: \biggl\{ - N \rho \Bigl\{ \log \Bigl[ 1 - \tanh\bigl(\tfrac{\gamma}{N}\bigr) \bigl[ \rr(\cdot) - \rr(\T) \bigr]

5784: \Bigr] \Bigr\} \\

5785: - \gamma

5786: \bigl[\rho(r_1) - r_1(\T) \bigr] -

5787: N \log \bigl[ \cosh \bigl( \tfrac{\gamma}{N} \bigr) \bigr] \rho \bigl[ m'( \cdot, \T) \bigr] -

5788: \C{K}(\rho, \pi) \biggr\} \Biggr] \Biggr\} \leq 1.

5789: \end{multline*}

5790: \end{thm}

5791: This last theorem can be generalized to give

5792: \begin{thm}\mypoint

5793: For any real constant $\gamma$, for any partially

5794: exchangeable posterior distributions $\pi^1, \pi^2: \Omega

5795: \rightarrow \C{M}_+^1(\Theta)$,

5796: \begin{multline*}

5797: \PP \Biggl\{ \exp \Biggl[

5798: \sup_{\rho_1, \rho_2 \in \C{M}_+^1(\Theta)}

5799: \biggl\{

5800: - N \log \Bigl\{ 1 - \tanh\bigl( \tfrac{\gamma}{N} \bigr)

5801: \bigl[ \rho_1(\rr) - \rho_2(\rr) \bigr] \Bigr\} \\

5802: - \gamma \bigl[ \rho_1(r_1) - \rho_2(r_1) \bigr]

5803: - N \log \bigl[ \cosh \bigl( \tfrac{\gamma}{N}

5804: \bigr) \bigr]

5805: \rho_1 \otimes \rho_2 (m') \\ - \C{K}(\rho_1, \pi^1) -

5806: \C{K}(\rho_2, \pi^2) \biggr\} \Biggr] \Biggr\} \leq 1.

5807: \end{multline*}

5808: \end{thm}

5809:

5810: To conclude this section, we see that the basic theorems of transductive PAC-Bayesian

5811: classification have exactly the same form as the basic inequalities of inductive

5812: classification, Theorems \ref{thm2.3}, \ref{thm4.1} and \ref{thm2.2.18}

5813: {\em with $R(\theta)$ replaced with $\rr(\theta)$}, $r(\theta)$ replaced

5814: with $r_1(\theta)$ and $M'(\theta, \T)$

5815: replaced with $\Bm(\theta, \T)$.

5816: \label{page97}

5817:

5818: {\em Thus all the results of the first section remain true under the hypotheses

5819: of transductive classification, with $R(\theta)$ replaced with $\rr(\theta)$,

5820: $r(\theta)$ replaced with $r_1(\theta)$

5821: and $M'(\theta, \T\,)$ replaced with $\Bm(\theta, \T)$.}

5822:

5823: {\em Consequently, in the case when the unlabelled shadow sample is observed,

5824: it is possible

5825: to improve on Vapnik's bounds to be discussed hereafter by using

5826: an explicit partially exchangeable posterior distribution $\pi$ and

5827: resorting to localized or to relative bounds (in the case at least of

5828: unlimited computing resources, which of course may still be unrealistic

5829: in many real world situations, and with the caveat, to be recalled in

5830: the conclusion of this article, that for small sample sizes and comparatively

5831: complex classification models, the improvement may not be so decisive).}

5832:

5833: Let us notice also that the transductive setting when experimentally available,

5834: has the advantage that

5835: \newcommand{\Bd}{\overline{d}}

5836: \begin{multline*}

5837: \Bd(\theta, \theta') = \frac{1}{(k+1)N}

5838: \sum_{i=1}^{(k+1)N} \B{1} \bigl[ f_{\theta'}(X_i) \neq f_{\theta}(X_i) \bigr]

5839: \\ \geq \Bm(\theta, \theta') \geq \rr(\theta) - \rr(\theta'), \qquad

5840: \theta, \theta' \in \Theta,

5841: \end{multline*}

5842: is observable in this context, providing an empirical upper bound for

5843: the difference

5844: $\rr(\wtheta) - \rho(\rr)$ for any non randomized estimator

5845: $\wtheta$ and any posterior distribution $\rho$, namely

5846: $$

5847: \rr(\wtheta) \leq \rho(\rr) + \rho\bigl[\,\Bd( \cdot, \wtheta)\bigr].

5848: $$

5849: Thus in the setting of transductive statistical experiments,

5850: the PAC-Bayesian framework provides fully empirical bounds

5851: for the error rate of non randomized estimators $\wtheta :

5852: \Omega \rightarrow \Theta$, even when using a non atomic

5853: prior $\pi$ (or more generally a non atomic partially exchangeable

5854: posterior distribution $\pi$), when $\Theta$

5855: is not a vector space and $\theta \mapsto R(\theta)$

5856: cannot be proved to be convex on the support of some useful

5857: posterior distribution $\rho$.

5858:

5859: \subsection{Vapnik's bounds for transductive classification}

5860: In this section, we are going to stick to plain unlocalized non relative

5861: bounds. As we have already mentioned, (and as it was put forward

5862: by Vapnik himself in his seminal works), these bounds are not always

5863: superseded by the asymptotically better ones, and deserve all our efforts

5864: since they deal in many situations better with small samples.

5865: \subsubsection{With a shadow sample of arbitrary size}

5866: The great thing with the transductive setting is that we are manipulating

5867: only $r_1$ and $\rr$ which can take but a finite number of values

5868: and therefore are piecewise constant on $\Theta$. To make use of this,

5869: let us consider for any value $\theta \in \Theta$ of the parameter

5870: the subset $\Delta(\theta) \subset \Theta$ of parameters $\theta'$ such

5871: that the classification rule $f_{\theta'}$ answers the same on the

5872: extended sample $(X_i)_{i=1}^{(k+1)N}$ as $f_{\theta}$. Namely, let us put

5873: for any $\theta \in \Theta$

5874: $$

5875: \Delta(\theta) = \bigl\{ \theta' \in \Theta ; f_{\theta'}(X_i) = f_{\theta}(X_i),

5876: i = 1, \dots, (k+1)N \bigr\}.

5877: $$

5878: We see immediately that $\Delta(\theta)$ is an exchangeable parameter subset on

5879: which $r_1$ and $r_2$ (and therefore also $\rr$) take a constant value.

5880: Thus for any $\theta \in \Theta$ we may consider the posterior $\rho_{\theta}$

5881: defined by

5882: $$

5883: \frac{d\rho_{\theta}}{d \pi}(\theta') = \B{1} \bigl[ \theta' \in \Delta(\theta) \bigr]\pi

5884: \bigl[ \Delta(\theta) \bigr]^{-1},

5885: $$

5886: and use the fact that $\rho_{\theta}(r_1) = r_1(\theta)$ and $\rho_{\theta}(\rr) = \rr(\theta)$,

5887: to prove that

5888: \begin{lemma}

5889: \mypoint For any partially exchangeable positive real measurable function

5890: $\lambda : \Omega \times \Theta \rightarrow \RR$ such that

5891: \begin{equation}

5892: \label{eq2.2.1}

5893: \lambda(\omega, \theta') = \lambda(\omega, \theta), \quad \theta \in \Theta, \theta'

5894: \in \Delta(\theta), \omega \in \Omega,

5895: \end{equation}

5896: and any partially exchangeable posterior distribution

5897: $\pi : \Omega \rightarrow \C{M}_+^1(\Theta)$,

5898: with $\PP$ probability at least $1 - \epsilon$, for any $\theta \in \Theta$,

5899: $$

5900: \Phi_{\frac{\lambda}{N}}\bigl[ \rr(\theta) \bigr] + \frac{\log \bigl\{ \epsilon \pi \bigl[

5901: \Delta(\theta) \bigr] \bigr\}}{\lambda(\theta)} \leq r_1(\theta).

5902: $$

5903: \end{lemma}

5904: We can then remark that for any value of $\lambda$ independent of $\omega$,

5905: the left-hand side of the previous inequality is a partially exchangeable function of

5906: $\omega \in \Omega$. Thus this left-hand side is maximized by some

5907: partially exchangeable function $\lambda$, namely $$

5908: \arg\max_{\lambda}

5909: \Phi_{\frac{\lambda}{N}} \bigl[ \rr(\theta) \bigr]

5910: + \frac{\log \bigl\{ \epsilon \pi \bigl[ \Delta(\theta) \bigr] \bigr\}}{\lambda}

5911: $$

5912: is partially exchangeable as depending only on partially exchangeable quantities.

5913: Moreover this choice of $\lambda(\omega, \theta)$ satisfies also condition

5914: \eqref{eq2.2.1}

5915: stated in the previous lemma of being constant on $\Delta(\theta)$,

5916: proving

5917: \begin{lemma}

5918: \mypoint For any partially exchangeable posterior distribution $\pi : \Omega \rightarrow

5919: \C{M}_+^1(\Theta)$, with $\PP$ probability at least $1 - \epsilon$,

5920: for any $\theta \in \Theta$ and any $\lambda \in \RR_+$,

5921: $$

5922: \Phi_{\frac{\lambda}{N}} \bigl[ \rr(\theta) \bigr] + \frac{\log \bigl\{

5923: \epsilon \pi \bigl[ \Delta(\theta) \bigr] \bigr\}}{\lambda} \leq r_1(\theta).

5924: $$

5925: \end{lemma}

5926:

5927: Writing $\rr = \frac{r_1 + k r_2}{k+1}$ and rearranging terms we obtain

5928: \begin{thm}

5929: \label{thm2.1.5}

5930: \mypoint For any partially exchangeable posterior

5931: distribution $\pi : \Omega \rightarrow

5932: \C{M}_+^1(\Theta)$, with $\PP$ probability at least $1 - \epsilon$,

5933: for any $\theta \in \Theta$,

5934: $$

5935: r_2(\theta) \leq \frac{k+1}{k} \inf_{\lambda \in \RR_+}

5936: \frac{\ds 1 - \exp \left( - \frac{\lambda}{N} r_1(\theta) + \frac{ \log \bigl\{

5937: \epsilon \pi \bigl[ \Delta(\theta) \bigr] \bigr\}}{N} \right)}{\ds 1

5938: - \exp \bigl( - \tfrac{\lambda}{N}\bigr)} - \frac{r_1(\theta)}{k}.

5939: $$

5940: \end{thm}

5941:

5942: Let us remind the reader that in the case when we have a set of binary

5943: classification rules $\{ f_{\theta}; \theta \in \Theta \}$ whose

5944: VC dimension is not greater than $h$, then we can choose $\pi$ such

5945: that $\pi \bigl[ \Delta(\theta) \bigr]$ is independent of $\theta$

5946: and not less that

5947: $\ds \left(\frac{h}{e(k+1)N}\right)^h$.

5948:

5949: Another important case when the complexity term $- \log \bigl\{

5950: \pi \bigl[ \Delta(\theta) \bigr] \bigr\}$ can easily be controlled

5951: is the setting of {\em compression schemes},

5952: introduced by Littlestone et Warmuth \cite{Little}.

5953: In this case, we are given for each labelled subsample

5954: $(X_i, Y_i)_{i \in J}$, $J \subset \{1, \dots, N\}$,

5955: an estimator of the parameter

5956: $$

5957: \wtheta\bigl[ (X_i, Y_i)_{i \in J} \bigr]

5958: = \wtheta_J, \quad J \subset \{ 1, \dots, N \}, \lvert J \rvert \leq h,

5959: $$

5960: \label{compression} where

5961: $$

5962: \wtheta : \bigsqcup_{k=1}^N \bigl( \C{X} \times \C{Y} \bigr)^k \rightarrow \Theta

5963: $$

5964: is an exchangeable function providing estimators for

5965: subsamples of arbitrary size.

5966: Let us assume that $\w{\theta}$

5967: is exchangeable, meaning that for any $k = 1, \dots, N$ and

5968: any permutation $\sigma$ of $\{1, \dots, k\}$

5969: $$

5970: \w{\theta} \bigl[ (x_i, y_i)_{i=1}^k \bigr]

5971: = \w{\theta} \bigl[ (x_{\sigma(i)}, y_{\sigma(i)})_{i=1}^k

5972: \bigr], \qquad

5973: (x_i, y_i)_{i=1}^k \in \bigl( \C{X} \times \C{Y} \bigr)^k.

5974: $$

5975: In this situation, we can introduce the exchangeable subset

5976: $$

5977: \Bigl\{ \wtheta_J ; J \subset \{1, \dots, (k+1)N\}, \lvert J

5978: \rvert \leq h \Bigr\} \subset \Theta,

5979: $$

5980: which is seen to contain at most $\ds \sum_{j=0}^h \binom{(k+1)N}{j}

5981: \leq \left( \frac{e(k+1)N}{h} \right)^h$ classification rules

5982: (as will be proved later on in Theorem \ref{th2} on page \pageref{th2}).

5983: Note that we had to extend the range of $J$ to all the subsets

5984: of the extended sample, although we will use for estimation

5985: only those of the training sample, on which the labels

5986: are observed.

5987: Thus in this case also we can find a partially exchangeable posterior

5988: distribution $\pi$ such that $\ds \pi \bigl[ \Delta(\wtheta_J) \bigr]

5989: \geq \left( \frac{h}{e(k+1)N} \right)^h$. We see that the size of

5990: the compression scheme plays the same role in this complexity bound

5991: as the $VC$ dimension for $VC$ classes.

5992:

5993: In these two cases of binary classification with VC dimension

5994: not greater than $h$ and compression schemes depending on a

5995: compression set with at most $h$ points, we get a bound of

5996: \begin{multline*}

5997: r_2(\theta) \leq \frac{k+1}{k} \inf_{\lambda \in \RR_+}

5998: \frac{\ds 1 - \exp \left( - \frac{\lambda}{N} r_1(\theta) - \frac{ h

5999: \log \left( \frac{e(k+1)N}{h} \right) - \log(\epsilon)}{N} \right)}{\ds 1

6000: - \exp \bigl( - \tfrac{\lambda}{N}\bigr)} \\ - \frac{r_1(\theta)}{k}.

6001: \end{multline*}

6002: Let us make some numerical application: when $N = 1000, h = 10, \epsilon = 0.01$,

6003: and $\inf_{\Theta} r_1 = r_1(\w{\theta}) = 0.2$,

6004: we find that $r_2(\w{\theta}) \leq 0.4093$, for $k$ between

6005: $15$ and $17$, and values of $\lambda$ equal respectively to $965$,

6006: $968$ and $971$. For $k=1$, we find only $r_2(\w{\theta}) \leq 0.539$, showing

6007: the interest of allowing $k$ to be larger than $1$.

6008:

6009: \subsubsection{When the shadow sample has the same size as the training sample}

6010: In the case when $k = 1$, we can improve Theorem \ref{thm1.2} by taking advantage

6011: of the fact that $T_i(\sigma_i)$ can take only $3$ values, namely $0$, $0.5$

6012: and $1$. We see thus that $T_i(\sigma_i) - \Phi_{\frac{\lambda}{N}}\bigl[

6013: T_i(\sigma_i) \bigr]$ can take only two values, $0$ and $\frac{1}{2} - \Phi_{\frac{

6014: \lambda}{N}}(\frac{1}{2})$, because $\Phi_{\frac{\lambda}{N}}(0) = 0$ and

6015: $\Phi_{\frac{\lambda}{N}}(1) = 1$. Thus

6016: $$

6017: T_i(\sigma_i) - \Phi_{\frac{\lambda}{N}} \bigl[ T_i(\sigma_i) \bigr]

6018: = \bigl[ 1 - \lvert 1 - 2 T_i(\sigma_i) \rvert \bigr] \bigl[

6019: \tfrac{1}{2} - \Phi_{\frac{\lambda}{N}}(\tfrac{1}{2}) \bigr].

6020: $$

6021: This shows that in the case when $k=1$,

6022: \begin{multline*}

6023: \log \Bigl\{ T \bigl[ \exp ( - \lambda r_1) \bigr] \Bigr\}

6024: = - \lambda \rr

6025: + \frac{\lambda}{N} \sum_{i=1}^N T_i(\sigma_i) - \Phi_{\frac{\lambda}{N}}

6026: \bigl[ T_i(\sigma_i) \bigr]\\

6027: = - \lambda \rr + \frac{\lambda}{N} \sum_{i=1}^N \bigl[ 1 - \lvert 1 - 2 T_i(\sigma_i) \rvert

6028: \bigr] \bigl[ \tfrac{1}{2} - \Phi_{\frac{\lambda}{N}}(\tfrac{1}{2}) \bigr]

6029: \\ \leq - \lambda \rr + \lambda \bigl[ \tfrac{1}{2} - \Phi_{\frac{\lambda}{N}}(\tfrac{1}{2}) \bigr] \bigl[ 1 - \lvert 1 - 2 \rr \rvert \bigr].

6030: \end{multline*}

6031: Noticing that $\frac{1}{2} - \Phi_{\frac{\lambda}{N}}(\frac{1}{2}) =

6032: \frac{N}{\lambda} \log \bigl[ \cosh(\frac{\lambda}{2N}) \bigr]$,

6033: we obtain

6034: \begin{thm}

6035: \mypoint For any partially exchangeable function $\lambda : \Omega \times \Theta

6036: \rightarrow \RR_+$, for any partially exchangeable posterior distribution

6037: $\pi : \Omega \rightarrow \C{M}_+^1(\Theta)$,

6038: \begin{multline*}

6039: \PP \biggl\{ \exp \biggl[

6040: \sup_{\rho \in \C{M}_+^1(\Theta)}

6041: \rho \Bigl[ \lambda ( \rr - r_1) \\ -

6042: N \log \bigl[ \cosh(\tfrac{\lambda}{2N}) \bigr]

6043: \bigl( 1 - \lvert 1 - 2 \rr \rvert \bigr) \Bigr] - \C{K}(\rho, \pi) \biggr]

6044: \biggr\} \leq 1.

6045: \end{multline*}

6046: \end{thm}

6047: As a consequence, reasonning as previously, we deduce

6048: \begin{thm}

6049: \label{thm2.2.5}

6050: \mypoint In the case when $k=1$,

6051: for any partially exchangeable posterior distribution $\pi: \Omega

6052: \rightarrow \C{M}_+^1(\Theta)$, with $\PP$ probability at least

6053: $1 - \epsilon$, for any $\theta \in \Theta$ and any

6054: $\lambda \in \RR_+$,

6055: $$

6056: \rr(\theta) - \tfrac{N}{\lambda} \log \bigl[

6057: \cosh(\tfrac{\lambda}{2N}) \bigr] \bigl( 1 - \lvert 1

6058: - 2 \rr(\theta) \rvert \bigr) + \frac{ \log \bigl\{ \epsilon

6059: \pi\bigl[\Delta(\theta)\bigr] \bigr\}}{\lambda} \leq r_1(\theta);

6060: $$

6061: and consequently for any $\theta \in \Theta$,

6062: $$

6063: r_2(\theta) \leq 2 \inf_{\lambda \in \RR_+} \frac{\ds r_1(\theta) - \frac{\log \bigl\{

6064: \epsilon \pi \bigl[ \Delta(\theta) \bigr] \bigr\}}{\lambda}}{

6065: 1 - \frac{2N}{\lambda} \log \bigl[ \cosh(\frac{\lambda}{2N})

6066: \bigr]} - r_1(\theta).

6067: $$

6068: \end{thm}

6069:

6070: In the case of binary classification using a VC class

6071: of VC dimension not greater than $h$, we can choose $\pi$ such that

6072: $- \log \bigl\{ \pi \bigl[ \Delta(\theta) \bigr] \bigr\}

6073: \leq h \log ( \frac{2eN}{h})$ and obtain the following

6074: numerical illustration of this theorem : for $N = 1000$, $h = 10$,

6075: $\epsilon = 0.01$ and $\inf_{\Theta} r_1 = r_1(\w{\theta}) = 0.2$,

6076: we find an upper bound $r_2(\w{\theta})

6077: \leq 0.5033$, which improves on Theorem \ref{thm2.1.5} but still

6078: is not under the significance level $\frac{1}{2}$ (achieved by

6079: blind random classification). This indicates that considering

6080: shadow samples of arbitrary sizes brings in some noisy situations

6081: a significant improvement on bounds obtained with a shadow sample

6082: of the same size as the training sample.

6083:

6084: \subsubsection{When moreover the distribution of the augmented sample

6085: is exchangeable} In the case when $k=1$ and $\PP$ is exchangeable meaning that for

6086: any bounded measurable function $h : \Omega \rightarrow \RR$

6087: and any permutation $s \in \mathfrak{S} \bigl(

6088: \{1, \dots, 2N \} \bigr)$ $\PP \bigl[ h( \omega \circ s ) \bigr]

6089: = \PP \bigl[ h(\omega) \bigr]$, then we can still improve the bound

6090: as follows. Let

6091: $$

6092: T' (h) = \frac{1}{N!} \sum_{s \in \mathfrak{S}

6093: \bigl( \{ N+1, \dots, 2N \} \bigr)} h(\omega \circ s).

6094: $$

6095: Then we can write

6096: $$

6097: 1 - \lvert 1 - 2 T_i(\sigma_i) \rvert = (\sigma_i - \sigma_{i+N})^2

6098: = \sigma_i + \sigma_{i+N} - 2 \sigma_i \sigma_{i+N}.

6099: $$

6100: Using this identity, we get for any exchangeable function

6101: $\lambda : \Omega \times \Theta \rightarrow \RR_+$,

6102: $$

6103: T \biggl\{ \exp \biggl[ \lambda (\rr - r_1) - \log \bigl[ \cosh(\tfrac{\lambda}{2N}

6104: ) \bigr] \sum_{i=1}^N \bigl( \sigma_i + \sigma_{i+N} - 2 \sigma_i \sigma_{i+N}

6105: \bigr) \biggr] \biggl\} \leq 1.

6106: $$

6107: Let us put

6108: \label{page39}

6109: \begin{align}

6110: \label{eq2.2}

6111: A(\lambda) & = \tfrac{2N}{\lambda} \log \bigl[ \cosh(\tfrac{\lambda}{2N}

6112: ) \bigr],\\

6113: v(\theta) & = \frac{1}{2N} \sum_{i=1}^N (\sigma_i + \sigma_{i+N}

6114: - 2 \sigma_i \sigma_{i+N}).

6115: \end{align}

6116: With these notations

6117: $$

6118: T \Bigl\{ \exp \bigl\{ \lambda \bigl[ \rr - r_1 - A(\lambda) v \bigr] \bigr\}

6119: \Bigr\} \leq 1.

6120: $$

6121: Let notice now that

6122: $$

6123: T'\bigl[ v(\theta) \bigr] = \rr(\theta) - r_1(\theta) r_2(\theta).

6124: $$

6125: Let $\pi : \Omega \rightarrow \C{M}_+^1(\Theta)$ be any given

6126: exchangeable posterior distribution. Using the exchangeability

6127: of $\PP$ and $\pi$ and the exchangeability of the exponential

6128: function, we get

6129: \begin{align*}

6130: \PP & \Bigl\{ \pi \Bigl[ \exp \bigl\{ \lambda \bigl[

6131: \rr - r_1 - A(\rr - r_1 r_2) \bigr] \bigr\} \Bigr] \Bigr\}

6132:  = \PP \Bigl\{ \pi \Bigl[ \exp \bigl\{ \lambda \bigl[

6133: \rr - r_1 - AT'(v) \bigr] \bigr\} \Bigr] \Bigr\}

6134: \\ & \leq

6135: \PP \Bigl\{ \pi \Bigl[ T' \exp \bigl\{ \lambda \bigl[

6136: \rr - r_1 - Av \bigr] \bigr\} \Bigr] \Bigr\}

6137:  =

6138: \PP \Bigl\{ T' \pi \Bigl[ \exp \bigl\{ \lambda \bigl[

6139: \rr - r_1 - Av \bigr] \bigr\} \Bigr] \Bigr\}

6140: \\ & =

6141: \PP \Bigl\{ \pi \Bigl[ \exp \bigl\{ \lambda \bigl[

6142: \rr - r_1 - Av \bigr] \bigr\} \Bigr] \Bigr\}

6143:  =

6144: \PP \Bigl\{ T \pi \Bigl[ \exp \bigl\{ \lambda \bigl[

6145: \rr - r_1 - Av \bigr] \bigr\} \Bigr] \Bigr\}

6146: \\  & =

6147: \PP \Bigl\{ \pi \Bigl[ T \exp \bigl\{ \lambda \bigl[

6148: \rr - r_1 - Av \bigr] \bigr\} \Bigr] \Bigr\}

6149: \leq 1.

6150: \end{align*}

6151: We are thus ready to state

6152: \begin{thm}

6153: \label{thm3.3.8}

6154: \mypoint

6155: In the case when $k = 1$, for any exchangeable probability distribution $\PP$,

6156: for any exchangeable posterior distribution $\pi : \Omega \rightarrow

6157: \C{M}_+^1(\Theta)$, for any exchangeable function

6158: $\lambda : \Omega \times \Theta \rightarrow \RR_+$,

6159: $$

6160: \PP \biggl\{ \exp \biggl[ \sup_{\rho \in \C{M}_+^1(\Theta)}

6161: \rho \Bigl\{ \lambda \bigl[ \rr - r_1 - A(\lambda)(\rr - r_1 r_2)\bigr] \Bigr\}

6162: - \C{K}(\rho, \pi) \biggr] \biggr\} \leq 1,

6163: $$

6164: where $A(\lambda)$ is defined by equation \eqref{eq2.2} above.

6165: \end{thm}

6166: We then deduce as previously

6167: \begin{cor}

6168: \label{thm2.2.6}

6169: \mypoint For any exchangeable posterior distribution $\pi :

6170: \Omega \rightarrow \C{M}_+^1(\Theta)$, for any

6171: exchangeable probability measure $\PP \in \C{M}_+^1(\Omega)$,

6172: for any measurable exchangeable function $\lambda: \Omega \times \Theta

6173: \rightarrow \RR_+$,

6174: with $\PP$ probability at least $1 - \epsilon$, for any $\theta \in \Theta$,

6175: $$

6176: \rr(\theta) \leq r_1(\theta) + A(\lambda) \bigl[ \rr(\theta) - r_1( \theta)

6177: r_2(\theta) \bigr] - \frac{ \log \bigl\{ \epsilon \pi\bigl[

6178: \Delta(\theta) \bigr] \bigr\}}{\lambda},

6179: $$

6180: where $A(\lambda)$ is defined by equation \eqref{eq2.2}

6181: on page \pageref{eq2.2}.

6182: \end{cor}

6183: In order to deduce an empirical bound from this theorem, we have

6184: to make some choice for $\lambda(\omega, \theta)$.

6185: Fortunately, it is easy to show that the bound indeed holds uniformly

6186: in $\lambda$. This is the case because the inequality can

6187: be rewritten as a function of only one non exchangeable quantity,

6188: namely $r_1(\theta)$. Indeed, since

6189: $r_2 = 2 \rr - r_1$, we see that the

6190: inequality can be written as

6191: $$

6192: \rr(\theta) \leq r_1(\theta) + A(\lambda) \bigl[

6193: \rr(\theta) - 2 \rr(\theta) r_1(\theta) + r_1(\theta)^2 \bigr]

6194: - \frac{\log \bigl\{ \epsilon \pi \bigl[ \Delta(\theta)\bigr]}{\lambda}.

6195: $$

6196: It can be solved in $r_1(\theta)$, to get

6197: $$

6198: r_1(\theta) \geq f \Bigl(\lambda, \rr(\theta), -\log \bigl\{ \epsilon

6199: \pi\bigl[ \Delta(\theta) \bigr] \bigr\} \Bigr),

6200: $$

6201: where namely

6202: \begin{multline*}

6203: f(\lambda, \rr, d) = \bigl[2 A(\lambda)\bigr]^{-1}

6204: \biggl\{ 2 \rr A(\lambda) - 1 \\ + \sqrt{\bigl[1 - 2 \rr A(\lambda)\bigr]^2

6205: + 4 A(\lambda) \Bigl\{ \rr\bigl[ 1 - A(\lambda) \bigr] - \tfrac{d}{\lambda}

6206: \Bigr\}} \biggr\}.

6207: \end{multline*}

6208: Thus we can find some exchangeable function $\lambda(\omega, \theta)$,

6209: such that

6210: $$

6211: f\Bigl( \lambda(\omega, \theta), \rr(\theta), -

6212: \log \bigl\{ \epsilon \pi \bigl[ \Delta(\theta) \bigr] \bigr\} \Bigr)

6213: = \sup_{\beta \in \RR_+} f \Bigl( \beta, \rr(\theta), - \log\bigl\{

6214: \epsilon \pi \bigl[ \Delta(\theta) \bigr]\bigr\} \Bigr).

6215: $$

6216: Applying Corollary \ref{thm2.2.6} to that choice of $\lambda$, we

6217: see that

6218: \begin{thm}

6219: \mypoint For any exchangeable probability measure

6220: $\PP \in \C{M}_+^1(\Omega)$, for any exchangeable posterior

6221: probability distribution $\pi : \Omega \rightarrow \C{M}_+^1(\Theta)$,

6222: with $\PP$ probability at least $1 - \epsilon$, for any $\theta \in \Theta$,

6223: for any $\lambda \in \RR_+$,

6224: $$

6225: \rr(\theta) \leq  r_1(\theta) + A(\lambda) \bigl[

6226: \rr(\theta) - r_1(\theta) r_2(\theta) \bigr] - \frac{

6227: \log \bigl\{ \epsilon \pi \bigl[ \Delta(\theta) \bigr] \bigr\}}{\lambda},

6228: $$

6229: where $A(\lambda)$ is defined by equation \eqref{eq2.2} on

6230: page \pageref{eq2.2}.

6231: \end{thm}

6232: Solving the previous inequality in $r_2(\theta)$, we get

6233: \begin{cor}

6234: \mypoint Under the same assumptions as in the

6235: previous theorem, with

6236: $\PP$ probability at least $1 - \epsilon$, for any

6237: $\theta \in \Theta$,

6238: $$

6239: r_2(\theta) \leq \inf_{\lambda \in \RR_+}

6240: \frac{\ds r_1(\theta) \Bigl\{ 1 + \tfrac{2N}{\lambda}\log \bigl[

6241: \cosh(\tfrac{\lambda}{2N})\bigr] \Bigr\} - \frac{ 2 \log \bigl\{ \epsilon \pi

6242: \bigl[ \Delta(\theta) \bigr] \bigr\}}{\lambda}}{\ds 1 - \tfrac{2N}{\lambda}

6243: \log \bigl[ \cosh(\tfrac{\lambda}{2N})\bigr] \bigl[

6244: 1 - 2 r_1(\theta) \bigr]}.

6245: $$

6246: \end{cor}

6247: Applying this to our usual numerical example of a binary classification

6248: model with VC dimension not greater than $h = 10$, when $N=1000$, $

6249: \inf_{\Theta} r_1 = r_1(\w{\theta}) = 10$ and

6250: $\epsilon = 0.01$, we obtain that $r_2(\w{\theta}) \leq 0.4450$.

6251:

6252: \subsection{Vapnik's bounds for inductive classification}

6253: \subsubsection{Arbitrary shadow sample size}

6254: \newcommand{\F}[1]{\mathfrak{#1}}

6255: We assume in this section that

6256: $$

6257: \PP = \biggl( \bigotimes_{i=1}^N P_i

6258: \biggr)^{\otimes \, \infty} \in \C{M}_+^1 \Bigl\{ \bigl[

6259: \bigl( \C{X} \times \C{Y} \bigr)^N \bigr]^{\NN} \Bigr\},

6260: $$

6261: where

6262: $P_i \in \C{M}_+^1\bigl( \C{X} \times \C{Y} \bigr)$:

6263: we consider an infinite i.i.d. sequence of independent

6264: {\em not} identically distributed samples of size $N$,

6265: the first one only being observed. The shadow samples will only appear

6266: in the proofs. The aim of this section is to prove better Vapnik's

6267: bounds, generalizing them in the same time to the independent

6268: non i.i.d. setting, which to our knowledge had not been done before.

6269:

6270: Let us introduce the notation $\PP'\bigl[h(\omega) \bigr]  =

6271: \PP \bigl[ h(\omega) \,\lvert\, (X_i,Y_i)_{i=1}^N \bigr]$,

6272: where $h$ may be any suitable (e.g. bounded)

6273: random variable, let us also put

6274: $\Omega = \bigl[(\C{X} \times \C{Y})^N \bigr]^{\NN}$.

6275: \begin{dfn}

6276: \mypoint For any subset $A \subset \NN$ of

6277: integers, let $\F{C}(A)$ be the set of circular permutations of the

6278: totally ordered set $A$, extended to a permutation of $\NN$ by

6279: taking it to be the identity on the complement $\NN \setminus A$

6280: of $A$.

6281: We will say that a random function $h : \Omega \rightarrow \RR$ is $k$-partially

6282: exchangeable if

6283: $$

6284: h( \omega \circ s ) = h( \omega ), \quad s \in \F{C}\bigl(

6285: \{i + j N\,;\,j=0, \dots, k \} \bigr), i=1, \dots, N.

6286: $$

6287: In the same way, we will say that a posterior distribution

6288: $\pi : \Omega \rightarrow \C{M}_+^1(\Theta)$ is $k$-partially

6289: exchangeable if

6290: $$

6291: \pi( \omega \circ s ) = \pi ( \omega ) \in \C{M}_+^1(\Theta), \quad s \in \F{C}\bigl(

6292: \{i + j N\,;\,j=0, \dots, k \} \bigr), i=1, \dots, N.

6293: $$

6294: \end{dfn}

6295: Note that $\PP$ itself is $k$-partially exchangeable for any $k$ in the

6296: sense that for any bounded measurable function $h : \Omega \rightarrow \RR$

6297: $$

6298: \PP \bigl[ h( \omega \circ s ) \bigr]  =  \PP \bigl[ h( \omega ) \bigr] , \quad s \in \F{C}\bigl(

6299: \{i + j N\,;\,j=0, \dots, k \} \bigr), i=1, \dots, N.

6300: $$

6301: Let $\ds

6302: \Delta_k(\theta) = \Bigl\{ \theta' \in \Theta \,;\,

6303: \bigl[ f_{\theta'}(X_i) \bigr]_{i=1}^{(k+1)N} =

6304: \bigl[ f_{\theta}(X_i) \bigr]_{i=1}^{(k+1)N} \Bigr\},$ $\theta \in \Theta,

6305: k \in \NN^*$,

6306: and let also $\ds \rr_k(\theta) = \frac{1}{(k+1)N} \sum_{i=1}^{(k+1) N}

6307: \B{1} \bigl[ f_{\theta}(X_i) \neq Y_i \bigr]$.

6308: Theorem \ref{thm1.2} shows that for any positive real parameter

6309: $\lambda$

6310: and any $k$-partially exchangeable posterior distribution $\pi_k : \Omega

6311: \rightarrow \C{M}_+^1(\Theta)$,

6312: $$

6313: \PP \biggl\{ \exp \biggl[ \sup_{\theta \in \Theta}

6314: \lambda \bigl[ \Phi_{\frac{\lambda}{N}}(\rr_k) - r_1 \bigr]

6315: + \log \bigl\{ \epsilon \pi_k \bigl[ \Delta_k (\theta) \bigr] \bigr\} \biggr] \biggr\}

6316: \leq \epsilon.

6317: $$

6318: Using the general fact that

6319: $$

6320: \PP \bigl[ \exp( h ) \bigr] =

6321: \PP \Bigl\{ \PP' \bigl[ \exp( h) \bigr] \Bigr\} \geq \PP \Bigl\{

6322: \exp \bigl[ \PP' (h) \bigr] \Bigr\},

6323: $$

6324: and the fact that the expectation of a supremum is larger than the

6325: supremum of an expectation, we see that with $\PP$ probability

6326: at most $1 - \epsilon$, for any $\theta \in \Theta$,

6327: $$

6328: \PP'\Bigl\{ \Phi_{\frac{\lambda}{N}} \bigl[ \rr_k(\theta) \bigr]

6329: \Bigr\} \leq r_1(\theta) - \frac{

6330: \PP' \Bigl\{ \log \bigl\{ \epsilon \pi_k \bigl[ \Delta_k(\theta) \bigr] \bigr\}

6331: \Bigr\}}{\lambda}.

6332: $$

6333: Let us put for short

6334: \newcommand{\dd}{\Bar{d}}

6335: \begin{align*}

6336: \dd_k(\theta)  & = - \log \bigl\{ \epsilon \pi_k \bigl[ \Delta_k(\theta) \bigr] \bigr\},\\

6337: d'_k(\theta) & = - \PP' \Bigl\{ \log \bigl\{ \epsilon \pi_k \bigl[ \Delta_k(\theta) \bigr] \bigr\}

6338: \Bigr\},\\

6339: d_k(\theta) & = - \PP \Bigl\{ \log \bigl\{ \epsilon \pi_k \bigl[ \Delta_k(\theta) \bigr] \bigr\}

6340: \Bigr\}.

6341: \end{align*}

6342: We can use the convexity of $\Phi_{\frac{\lambda}{N}}$ and the fact

6343: that $\PP'(\rr_k) = \frac{r_1 + k R}{k+1}$, to see that

6344: $$

6345: \PP' \Bigl\{ \Phi_{\frac{\lambda}{N}} \bigl[ \rr_k(\theta) \bigr]

6346: \Bigr\} \geq \Phi_{\frac{\lambda}{N}}

6347: \left[ \frac{r_1(\theta) + k R(\theta)}{k+1} \right].

6348: $$

6349: We have proved

6350: \begin{thm}

6351: \mypoint Using the above hypotheses and notations,

6352: for any sequence

6353: $\pi_k : \Omega \rightarrow \C{M}_+^1(\Theta)$, where $\pi_k$

6354: is a $k$-partially exchangeable posterior distribution,

6355: for any positive real constant $\lambda$, any positive integer $k$,

6356: with $\PP$ probability

6357: at least $1 - \epsilon$, for any $\theta \in \Theta$,

6358: $$

6359: \Phi_{\frac{\lambda}{N}} \left[

6360: \frac{ r_1(\theta) + k R(\theta)}{k+1} \right]

6361: \leq r_1(\theta) + \frac{d'_k(\theta)}{\lambda}.

6362: $$

6363: \end{thm}

6364: We can make

6365: as we did with Theorem \ref{thm2.7} on page \pageref{thm2.7} the

6366: result of this theorem uniform in $\lambda \in \{ \alpha^j\,;\,

6367: j \in \NN^* \}$ and $k \in \NN^*$ (considering

6368: on $k$ the prior $\frac{1}{k(k+1)}$ and on $j$ the prior

6369: $\frac{1}{j(j+1)}$), and obtain

6370:

6371: \begin{thm}

6372: \mypoint For any real parameter

6373: $\alpha > 1$, with $\PP$ probability at least $1 - \epsilon$,

6374: for any $\theta \in \Theta$,

6375: \begin{multline*}

6376: R(\theta) \leq  \\* \inf_{k \in \NN^*, j \in \NN^*}

6377: \frac{1 - \exp \biggl\{ - \frac{\alpha^j}{N} r_1(\theta) - \frac{1}{N}

6378: \Bigl\{ d'_k(\theta) + \log \bigl[ k (k+1) j (j+1)\bigr]

6379: \Bigr\} \biggr\}}{\frac{k}{k+1} \left[ 1 -

6380: \exp \left( - \frac{\alpha^j}{N}\right) \right] } \\* - \frac{r_1(\theta)}{k}.

6381: \end{multline*}

6382: \end{thm}

6383: Note that as a special case we can choose $\pi_k$ such that $

6384: \log \bigl\{ \pi_k\bigl[ \Delta_k(\theta) \bigr] \bigr\}$ is independent of

6385: $\theta$ and equal to $\log (\F{N}_k)$, where $\F{N}_k = \bigl\lvert  \bigl\{

6386: \bigl[ f_{\theta}(X_i) \bigr]_{i=1}^{(k+1)N} \,;\,

6387: \theta \in \Theta \bigr\} \bigr\rvert$ is the size of the trace of the

6388: classification model on the extended sample

6389: of size $(k+1)N$.

6390: With this choice, we obtain a bound involving a new flavour

6391: of conditional Vapnik's entropy, namely

6392: $$

6393: d'_k(\theta) = \PP \bigl[ \log (\F{N}_k) \,\lvert (Z_i)_{i=1}^N \bigr] - \log(\epsilon).

6394: $$

6395:

6396: In the case of binary classification using a VC class of VC dimension not

6397: greater than $h = 10$, when $N = 1000$, $\inf_{\Theta}

6398: r_1 = r_1(\w{\theta}) = 0.2$ and $\epsilon = 0.01$,

6399: choosing $\alpha = 1.1$, we obtain $R(\w{\theta}) \leq 0.4271$

6400: (for an optimal value of $\lambda = 1071.8$, and an optimal

6401: value of $k = 16$).

6402:

6403: \subsubsection{A better minimization with respect to the exponential parameter}If we are not pleased with the fact of optimizing $\lambda$ on a discrete

6404: subset of the real line, we can use a slightly different approach.

6405: From Theorem \ref{thm1.2}, we see that for any positive integer

6406: $k$, for any $k$-partially exchangeable

6407: positive real measurable function $\lambda : \Omega \times \Theta

6408: \rightarrow \RR_+$ satisfying equation \eqref{eq2.2.1} on

6409: page \pageref{eq2.2.1} (with $\Delta(\theta)$ replaced

6410: with $\Delta_k(\theta)$),

6411: for any $\epsilon \in )0,1)$ and $\eta \in )0,1)$,

6412: $$

6413: \PP \biggl\{ \PP' \biggl[ \exp \Bigl[ \sup_{\theta}

6414: \lambda \bigl[ \Phi_{\frac{\lambda}{N}}(\rr_k) - r_1 \bigr] +

6415: \log \bigl\{ \epsilon \eta \pi_k \bigl[ \Delta_k(\theta) \bigr] \bigr\}

6416: \biggr] \biggr\}

6417: \leq \epsilon \eta,

6418: $$

6419: therefore with $\PP$ probability at least $1 - \epsilon$,

6420: $$

6421: \PP' \biggl\{ \exp \Bigl[ \sup_{\theta}

6422: \lambda \bigl[ \Phi_{\frac{\lambda}{N}}(\rr_k) - r_1 \bigr] +

6423: \log \bigl\{ \epsilon \eta \pi_k \bigl[ \Delta_k(\theta) \bigr] \bigr\}

6424: \Bigr]

6425: \biggr\}

6426: \leq \eta,

6427: $$

6428: and consequently, with $\PP$ probability at least $1 - \epsilon$,

6429: with $\PP'$ probability at least $1 - \eta$, for any $\theta \in \Theta$,

6430: $$

6431: \Phi_{\frac{\lambda}{N}}(\rr_k) +

6432: \frac{\log \bigl\{ \epsilon \eta \pi_{k} \bigl[ \Delta_k(\theta)

6433: \bigr] \bigr\}}{\lambda}

6434: \leq r_1.

6435: $$

6436: Now we are entitled to choose $$

6437: \lambda(\omega, \theta)

6438: \in \arg \max_{\lambda' \in \RR_+} \Phi_{\frac{\lambda'}{N}}(\rr_k)

6439: + \frac{\log \bigl\{ \epsilon \eta \pi_{k} \bigl[ \Delta_k(\theta)

6440: \bigr] \bigr\}}{\lambda'}.

6441: $$

6442: This shows that with $\PP$ probability

6443: at least $1 - \epsilon$, with $\PP'$ probability at least $1 - \eta$,

6444: for any $\theta \in \Theta$,

6445: $$

6446: \sup_{\lambda \in \RR_+} \Phi_{\frac{\lambda}{N}}(\rr_k) -

6447: \frac{\dd_k(\theta) - \log(\eta)}{\lambda}

6448: \leq r_1,

6449: $$

6450: which can also be written

6451: $$

6452: \Phi_{\frac{\lambda}{N}}(\rr_k) - r_1 - \frac{

6453: \dd_k(\theta)}{\lambda} \leq - \frac{\log(\eta)}{\lambda}, \quad \lambda \in \RR_+.

6454: $$

6455: Thus with $\PP$ probability at least $1 - \epsilon$,

6456: for any $\theta \in \Theta$, any $\lambda \in \RR_+$,

6457: $$

6458: \PP'\biggl[ \Phi_{\frac{\lambda}{N}}(\rr_k) - r_1 -

6459: \frac{\dd_k(\theta)}{\lambda} \biggr] \leq - \frac{

6460: \log(\eta)}{\lambda} + \biggl[1 - r_1 + \frac{\log(\eta)}{\lambda}

6461: \biggr] \eta.

6462: $$

6463: On the other hand, $\Phi_{\frac{\lambda}{N}}$ being a convex function,

6464: \begin{align*}

6465: \PP'\biggl[ \Phi_{\frac{\lambda}{N}}(\rr_k) - r_1 -

6466: \frac{\dd_k(\theta)}{\lambda} \biggr]

6467: & \geq \Phi_{\frac{\lambda}{N}}\bigl[ \PP'(\rr_k) \bigr] - r_1

6468: - \frac{d'_k}{\lambda} \\ & = \Phi_{\frac{\lambda}{N}}

6469: \biggl( \frac{kR+r_1}{k+1} \biggr) - r_1 - \frac{d'_k}{\lambda}.

6470: \end{align*}

6471: Thus with $\PP$ probability at least $1 - \epsilon$, for any $\theta \in \Theta$,

6472: $$

6473: \frac{kR+r_1}{k+1} \leq \inf_{\lambda \in \RR_+}

6474: \Phi_{\frac{\lambda}{N}}^{-1} \biggl[ r_1(1 - \eta) + \eta +

6475: \frac{d'_k - \log(\eta) (1 - \eta)}{\lambda} \biggr].

6476: $$

6477: We can generalize this approach by considering a finite decreasing sequence

6478: $\eta_0=1 > \eta_1 > \eta_2 > \dots > \eta_J > \eta_{J+1} = 0$, and

6479: the corresponding sequence of levels

6480: \begin{align*}

6481: L_j & = - \frac{\log(\eta_j)}{\lambda}, 0 \leq j \leq J,\\

6482: L_{J+1} & = 1 - r_1 - \frac{\log(J) - \log(\epsilon)}{\lambda}.

6483: \end{align*}

6484: Taking a union bound in $j$, we see that with $\PP$ probability at least $1 - \epsilon$,

6485: for any $\theta \in \Theta$, for any $\lambda \in \RR_+$,

6486: $$

6487: \PP' \biggl[ \Phi_{\frac{\lambda}{N}}(\rr_k) - r_1

6488: - \frac{\dd_k + \log(J)}{\lambda} \geq L_j \biggr] \leq \eta_j, \quad j=0, \dots, J+1,

6489: $$

6490: and consequently

6491: \begin{align*}

6492: \PP' & \biggl[ \Phi_{\frac{\lambda}{N}}(\rr_k) - r_1

6493: - \frac{\dd_k + \log(J)}{\lambda} \biggr] \\

6494: & \leq \int_{0}^{L_{J+1}}

6495: \PP' \biggl[ \Phi_{\frac{\lambda}{N}}(\rr_k) - r_1

6496: - \frac{\dd_k+ \log(J)}{\lambda} \geq \alpha \biggr] d \alpha

6497: \quad \leq \sum_{j=1}^{J+1} \eta_{j-1}(L_j - L_{j-1})

6498: \\ & = \eta_J \biggl[ 1 - r_1 - \frac{\log(J) -

6499: \log(\epsilon) - \log(\eta_J)}{\lambda}

6500: \biggr] - \frac{\log(\eta_1)}{\lambda} + \sum_{j=1}^{J-1}

6501: \frac{\eta_{j}}{\lambda} \log \biggl(

6502: \frac{\eta_{j}}{\eta_{j+1}}\biggr).

6503: \end{align*}

6504: Let us put

6505: \begin{multline*}

6506: d''_k\bigl[\theta, (\eta_j)_{j=1}^J \bigr]

6507: = d'_k(\theta) +

6508: \log(J) - \log(\eta_1)

6509: \\ + \sum_{j=1}^{J-1}

6510: \eta_j \log \left( \frac{\eta_j}{\eta_{j+1}} \right)

6511: + \log\left(\frac{\epsilon \eta_J}{J} \right) \eta_J.

6512: \end{multline*}

6513:

6514: We have proved that for any decreasing sequence $(\eta_j)_{j=1}^J$,

6515: with $\PP$ probability at least $1 - \epsilon$,

6516: for any $\theta \in \Theta$,

6517: $$

6518: \frac{k R + r_1}{k+1}

6519: \leq \inf_{\lambda \in \RR_+}

6520: \Phi_{\frac{\lambda}{N}}^{-1} \biggl[

6521: r_1(1 - \eta_J) + \eta_J +

6522: \frac{ d''_k \bigl[ \theta, (\eta_j)_{j=1}^J \bigr]}{\lambda} \biggr].

6523: $$

6524:

6525: \begin{rmk}

6526: \mypoint We can for instance choose

6527: $J=2$, $\eta_2 = \frac{1}{10N}$, $\eta_1 =

6528: \frac{1}{\log(10 N)}$,

6529: resulting in

6530: $$

6531: d''_k = d'_k + \log(2) + \log\log(10 N) + 1 -

6532: \frac{\log\log(10N)}{\log(10N)} - \frac{\log \left( \frac{20N}{\epsilon} \right)}{10N}.

6533: $$

6534: In the case when $N = 1000$ and for any $\epsilon \in (0,1)$,

6535: we get $d''_k \leq d'_k + 3.7$, in the case when $N = 10^6$,

6536: we get $d''_k \leq d'_k + 4.4$, and in the case $N = 10^9$,

6537: we get $d''_k \leq d'_k + 4.7$.

6538:

6539: Therefore, for any practical

6540: purpose we could take $d''_k = d'_k + 4.7$ and $\eta_J = \frac{1}{10N}$

6541: in the above inequality.

6542: \end{rmk}

6543:

6544: Taking moreover a weighted union bound in $k$, we get

6545: \begin{thm}

6546: \label{thm2.3.3}

6547: \mypoint For any $\epsilon \in )0,1)$, any sequence

6548: $1 > \eta_1 > \dots > \eta_J > 0$,

6549: any sequence $\pi_k : \Omega \rightarrow \C{M}_+^1(\Theta)$,

6550: where $\pi_k$ is a $k$-partially exchangeable posterior distribution,

6551: with $\PP$ probability at least $1 - \epsilon$, for any $\theta

6552: \in \Theta$,

6553: \begin{multline*}

6554: R(\theta) \leq \inf_{k \in \NN^*} \frac{k+1}{k} \inf_{\lambda \in \RR_+}

6555: \Phi_{\frac{\lambda}{N}}^{-1}

6556: \biggl[ r_1(\theta) + \eta_J \bigl[1 - r_1(\theta) \bigr]

6557: \\ + \frac{d''_k\bigl[\theta, (\eta_j)_{j=1}^J \bigr] + \log\bigl[k(k+1)\bigr]}{\lambda}

6558: \biggr] - \frac{r_1(\theta)}{k}.

6559: \end{multline*}

6560: \end{thm}

6561: \begin{cor}

6562: \label{cor3.3.14}

6563: \mypoint For any $\epsilon \in )0,1)$, for any $N \leq 10^9$, with $\PP$ probability

6564: at least $1 - \epsilon$, for any $\theta \in \Theta$,

6565: \begin{multline*}

6566: R(\theta) \leq

6567: \inf_{k \in \NN^*} \inf_{\lambda \in \RR_+}

6568: \frac{k+1}{k} \bigl[ 1 - \exp( - \tfrac{\lambda}{N}) \bigr]^{-1}

6569: \biggl\{ 1 - \exp \biggl[ - \tfrac{\lambda}{N} \bigl[ r_1(\theta) +

6570: \tfrac{1}{10N} \bigr]

6571: \\ - \frac{ \PP' \bigl[ \log(\F{N}_k)\,\lvert\,(Z_i)_{i=1}^N

6572: \bigr]

6573: - \log(\epsilon) + \log\bigl[k(k+1)\bigr] + 4.7}{N} \biggr]

6574: \biggr\}

6575: - \frac{r_1(\theta)}{k}.

6576: \end{multline*}

6577: \end{cor}

6578:

6579: Let us end this section with a numerical example: in the case of binary classification

6580: with a VC class of dimension not greater than $10$, when $N=1000$,

6581: $\inf_{\Theta} r_1 = r_1(\w{\theta}) = 0.2$

6582: and $\epsilon = 0.01$, we get a bound $R(\w{\theta}) \leq 0.4211$ (for optimal

6583: values of $k = 15$ and of $\lambda = 1010$).

6584:

6585: \subsubsection{Equal shadow and training sample sizes}In the case when $k=1$, we can use Theorem \ref{thm2.2.5}, and replace

6586: $\Phi_{\frac{\lambda}{N}}^{-1}(q)$ with $\bigl\{ 1 - \frac{2N}{\lambda}

6587: \log \bigl[ \cosh(\frac{\lambda}{2N}) \bigr] \bigr\}^{-1}q$,

6588: resulting in

6589: \begin{thm}

6590: \mypoint For any $\epsilon \in )0,1)$, any $N \leq 10^9$, any 1-partially exchangeable

6591: posterior distribution

6592: $\pi_1 : \Omega \rightarrow \C{M}_+^1(\Theta)$,

6593: with $\PP$ probability at least $1 - \epsilon$,

6594: for any $\theta \in \Theta$,

6595: $$

6596: R(\theta) \leq

6597: \inf_{\lambda \in \RR_+} \frac{\ds

6598: \Bigl\{ 1 + \tfrac{2N}{\lambda} \log \bigl[ \cosh(\tfrac{\lambda}{2N}) \bigr] \Bigr\} r_1(\theta)

6599: + \frac{1}{5N} + 2 \frac{d_1'(\theta) + 4.7}{\lambda}}{\ds

6600: 1 - \tfrac{2N}{\lambda} \log \bigl[ \cosh(\tfrac{\lambda}{2N}

6601: ) \bigr]}.

6602: $$

6603: \end{thm}

6604:

6605: \subsubsection{Improvement on the equal sample size bound in the i.i.d.~case}

6606: Eventually, in the case when $\PP$ is i.i.d., meaning that all the

6607: $P_i$ are equal, we can improve the previous bound. For any

6608: partially exchangeable function $\lambda : \Omega \times \Theta

6609: \rightarrow \RR_+$, we saw in the discussion preceding Theorem

6610: \ref{thm3.3.8} on page \pageref{thm3.3.8} that

6611: $$

6612: T \Bigl[ \exp \bigl[ \lambda (\rr_k - r_1) - A(\lambda) v \bigr] \Bigr]

6613: \leq 1,

6614: $$

6615: with the notations introduced therein.

6616: Thus for any partially exchangeable positive real measurable function

6617: $\lambda : \Omega \times \Theta \rightarrow \RR_+$ satisfying equation

6618: \eqref{eq2.2.1} on page \pageref{eq2.2.1}, any 1-partially exchangeable

6619: posterior distribution $\pi_1 : \Omega \rightarrow \C{M}_+^1(\Theta)$,

6620: $$

6621: \PP \Bigl\{ \exp \Bigl[  \sup_{\theta \in \Theta}

6622: \lambda \bigl[ \rr_k(\theta) - r_1(\theta) - A(\lambda)v(\theta) \bigr] + \log \bigl[

6623: \epsilon \pi_1 \bigl[ \Delta(\theta) \bigr] \Bigr] \Bigr\} \leq 1.

6624: $$

6625: Therefore with $\PP$ probability at least $1 - \epsilon$, with $\PP'$

6626: probability $1 - \eta$,

6627: $$

6628: \rr_k(\theta) \leq r_1(\theta) + A(\lambda) v(\theta) + \frac{1}{\lambda} \bigl[

6629: \dd_1(\theta) - \log(\eta) \bigr]

6630: $$

6631:

6632: We can then choose $\ds \lambda(\omega, \theta) \in

6633: \arg\min_{\lambda' \in \RR_+} A(\lambda') v(\theta) + \frac{\dd_1(\theta)

6634: - \log(\eta) \bigr]}{\lambda'}$, which satisfies the required

6635: conditions, to show that with $\PP$ probability at least $1 - \epsilon$,

6636: for any $\theta \in \Theta$, with $\PP'$ probability at least $1 - \eta$,

6637: for any $\lambda \in \RR_+$,

6638: $$

6639: \rr_k(\theta) \leq r_1(\theta) +

6640: A(\lambda)v(\theta) + \frac{\dd_1(\theta) - \log(\eta)}{\lambda}.

6641: $$

6642:

6643: We can then take a union bound on a decreasing sequence of $J$

6644: values $\eta_1 \geq \dots \geq \eta_J$ of $\eta$.

6645: Weakening a little the order of quantifiers,

6646: we then obtain the following statement:

6647: with $\PP$ probability at least $1 - \epsilon$, for any $\theta \in \Theta$,

6648: for any $\lambda \in \RR_+$, for any $j=1, \dots, J$

6649: $$

6650: \PP' \biggl[ \rr_k(\theta) - r_1(\theta) -

6651: A(\lambda) v(\theta) - \frac{\dd_1(\theta) + \log(J)}{\lambda}

6652: \geq - \frac{\log(\eta_j)}{\lambda}  \biggr] \leq \eta_j.

6653: $$

6654: Consequently for any $\lambda \in \RR_+$,

6655: \begin{multline*}

6656: \PP' \biggl[ \rr_k(\theta) - r_1(\theta) -

6657: A(\lambda) v(\theta) - \frac{\dd_1(\theta) + \log(J)}{\lambda} \biggr]

6658: \\ \leq - \frac{  \log(\eta_1)}{\lambda} +

6659: \eta_J \biggl[1 - r_1(\theta) - \frac{\log(J) - \log(\epsilon) - \log(\eta_J)}{\lambda}

6660: \biggr]

6661: \\ + \sum_{j=1}^{J-1} \frac{\eta_{j}}{\lambda} \log \left( \frac{\eta_j}{\eta_{j+1}}

6662: \right).

6663: \end{multline*}

6664: Moreover $\PP' \bigl[ v(\theta) \bigr] = \frac{r_1 + R}{2} - r_1 R$,

6665: (this is where we need equidistribution) thus proving that

6666: $$

6667: \frac{R - r_1}{2} \leq

6668: \frac{A(\lambda)}{2} \Bigl[ R+r_1 - 2 r_1 R \Bigr]

6669: + \frac{

6670: d''_1\bigl[\theta, (\eta_j)_{j=1}^J\bigr]

6671: }{\lambda} + \eta_J\bigl[1 - r_1(\theta)\bigr].

6672: $$

6673: Keeping track of quantifiers, we obtain

6674: \begin{thm}

6675: \label{thm2.3.9}

6676: \mypoint For any decreasing sequence $(\eta_j)_{j=1}^J$, any

6677: $\epsilon \in (0,1)$, any 1-partially exchangeable posterior

6678: distribution $\pi : \Omega \rightarrow \C{M}_+^1(\Theta)$,

6679: with $\PP$ probability at least $1 - \epsilon$, for any $\theta \in \Theta$,

6680: \begin{multline*}

6681: R(\theta) \leq \inf_{\lambda \in \RR_+} \\

6682: \frac{\ds \Bigl\{ 1  + \tfrac{2N}{\lambda}\log \bigl[ \cosh(\tfrac{\lambda}{2N})

6683: \bigr] \Bigr\} r_1(\theta) + \frac{2 d''_1\bigl[ \theta, (\eta_j)_{j=1}^J

6684: \bigr] }{\lambda} + 2 \eta_J

6685: \bigl[ 1 - r_1(\theta) \bigr]}{\ds

6686: 1 - \tfrac{2N}{\lambda}\log\bigl[ \cosh(\tfrac{\lambda}{2N})

6687: \bigr] \bigl[ 1 - 2 r_1(\theta) \bigr] }.

6688: \end{multline*}

6689: \end{thm}

6690:

6691: \subsection{Gaussian approximation in Vapnik's bounds}

6692: To obtain formulas which could be easily compared with original Vapnik's bounds,

6693: we may replace $p - \Phi_a(p)$ with a Gaussian upper bound:

6694: \begin{lemma}

6695: \mypoint For any $p \in (0,\frac{1}{2})$, any $a \in \RR_+$,

6696: $$

6697: p - \Phi_a(p) \leq \frac{a}{2} p(1-p).

6698: $$

6699: For any $p \in (\frac{1}{2}, 1)$,

6700: $$

6701: p - \Phi_a(p) \leq \frac{a}{8} .

6702: $$

6703:

6704: \end{lemma}

6705: \begin{proof}

6706: Let us notice that for any $p \in (0,1)$,

6707: \begin{align*}

6708: \frac{\partial}{\partial a} \bigl[ - a \Phi_a(p) \bigr]

6709: & = - \frac{p \exp(-a) }{1 - p + p \exp( - a)},\\

6710: \frac{\partial^2}{\partial^2 a} \bigl[ - a \Phi_a(p) \bigr]

6711: & =

6712: \frac{p \exp(-a) }{1 - p + p \exp( - a)}

6713: \left( 1 - \frac{p \exp( - a)}{1 - p + p\exp( - a)} \right) \\

6714: & \leq

6715: \begin{cases}

6716: p(1-p) & p \in (0, \frac{1}{2}),\\

6717: \frac{1}{4} & p \in (\frac{1}{2}, 1).

6718: \end{cases}

6719: \end{align*}

6720: Thus taking a Taylor expansion of order one with integral remainder :

6721: $$

6722: -a \Phi(a) \leq

6723: \begin{cases}

6724: \begin{aligned}[b]-a p + \int_0^a p (1-p) & (a-b) db \\

6725: & = -a p + \frac{a^2}{2}p(1-p),\end{aligned} & p \in

6726: (0,\frac{1}{2}),\\

6727: \ds -a p + \int_0^a \frac{1}{4}(a -b) db = -a p + \frac{a^2}{8}, & p \in

6728: (\frac{1}{2}, 1).

6729: \end{cases}

6730: $$

6731: This ends the proof of our lemma. \end{proof}

6732: \begin{lemma}

6733: \mypoint Let us consider the bound

6734: $$

6735: B(q,d) = \left(1 + \frac{2 d}{N} \right)^{-1}

6736: \biggl[ q + \frac{d}{N} + \sqrt{ \frac{2 d q(1-q)}{N}

6737: + \frac{d^2}{N^2}} \biggr], \quad q \in \RR_+, d \in \RR_+.

6738: $$

6739: Let us also put

6740: $$

6741: \Bar{B}(q,d) =

6742: \begin{cases}

6743: B(q,d) & B(q,d) \leq \frac{1}{2},\\

6744: q + \sqrt{\frac{d}{2N}} & \text{ otherwise}.

6745: \end{cases}

6746: $$

6747: For any positive real parameters $q$ and $d$

6748: $$

6749: \inf_{\lambda \in \RR_+} \Phi_{\frac{\lambda}{N}}^{-1}

6750: \biggl( q + \frac{d}{\lambda} \biggr) \leq \Bar{B}(q,d).

6751: $$

6752: \end{lemma}

6753: \begin{proof}

6754: Let $\ds p = \inf_{\lambda} \Phi_{\frac{\lambda}{N}}^{-1} \biggl(

6755: q + \frac{d}{\lambda}\,\biggr)$. For any $\lambda \in \RR_+$,

6756: $$

6757: p - \frac{\lambda}{2N} (p \wedge \tfrac{1}{2})\bigl[1 -

6758: (p \wedge \tfrac{1}{2}) \bigr] \leq \Phi_{\frac{\lambda}{N}}(p)

6759: \leq q + \frac{d}{\lambda}.

6760: $$

6761: Thus

6762: \begin{multline*}

6763: p \leq q + \inf_{\lambda \in \RR_+} \frac{\lambda}{2N}

6764: (p \wedge \tfrac{1}{2}) \bigl[ 1 - ( p \wedge \tfrac{1}{2}) \bigr]

6765: + \frac{d}{\lambda} \\ = q + \sqrt{\frac{2 d

6766: (p \wedge \tfrac{1}{2}) \bigl[ 1 - ( p \wedge \tfrac{1}{2}) \bigr]}{N}}

6767: \leq q + \sqrt{\frac{d}{2N}}.

6768: \end{multline*}

6769: Then let us remark that

6770: $\ds

6771: B(q,d) = \sup \left\{ p' \in \RR_+ \,;\, p' \leq q + \sqrt{\frac{2dp'(1-p')}{N}}

6772: \right\}.$

6773: If moreover $\tfrac{1}{2} \geq B(q,d)$, then according

6774: to this remark $\tfrac{1}{2} \geq q + \sqrt{\frac{d}{2N}} \geq p$.

6775: Therefore $p \leq \tfrac{1}{2}$, and consequently $p \leq q + \sqrt{\frac{2dp(1-p)}{N}}$,

6776: implying that $p \leq B(q,d)$.

6777: \end{proof}

6778:

6779: \subsubsection{Arbitrary shadow sample size}

6780: This lemma combined with Corollary \ref{cor3.3.14}

6781: on page \pageref{cor3.3.14} implies

6782: \begin{cor}

6783: \label{cor2.3.7}

6784: \mypoint For any $\epsilon \in )0,1)$, any integer $N \leq 10^9$,

6785: with $\PP$ probability at least $1 - \epsilon$,

6786: for any $\theta \in \Theta$,

6787: $$

6788: R(\theta) \leq \inf_{k \in \NN^*}

6789: \frac{k+1}{k} \Bigl\{

6790: \Bar{B}\Bigl[r_1(\theta) + \frac{1}{10N}, d'_k(\theta) + \log \bigl[

6791: k(k+1)\bigr] + 4.7 \Bigr] \Bigr\} - \frac{r_1(\theta)}{k}.

6792: $$

6793: \end{cor}

6794:

6795: \subsubsection{Equal sample sizes in the i.i.d.~case}

6796: To make a link with Vapnik's result, it is useful to work out

6797: the Gaussian approximation to Theorem \ref{thm2.3.9}

6798: on page \pageref{thm2.3.9}.

6799: Indeed, using the upper bound $A(\lambda) \leq \frac{\lambda}{4N}$,

6800: where $A(\lambda)$ is defined by equation \eqref{eq2.2}

6801: on page \pageref{eq2.2}, we

6802: get with $\PP$ probability at least $1 - \epsilon$

6803: $$

6804: R  - r_1 - 2 \eta_J \leq \inf_{\lambda \in \RR_+}

6805: \frac{\lambda}{4N} \bigl[ R + r_1 - 2 r_1 R \bigr]

6806: + \frac{2 d''_1}{\lambda}

6807: = \sqrt{\frac{2 d''_1 (R + r_1 - 2 r_1 R)}{N}},

6808: $$

6809: which can be solved in $R$ to obtain

6810: \begin{cor}

6811: \label{cor2.3.10}

6812: \mypoint With $\PP$ probability at least

6813: $1 - \epsilon$, for any $\theta \in \Theta$,

6814: \begin{multline*}

6815: R(\theta) \leq r_1(\theta) + \frac{d''_1(\theta)}{N}

6816: \bigl[ 1 - 2 r_1(\theta) \bigr]

6817: + 2 \eta_J

6818: \\ + \sqrt{ \frac{4 d''_1(\theta) \bigl[ 1 - r_1(\theta) \bigr] r_1(\theta)}{N}

6819: + \frac{{d''_1}(\theta)^2}{N^2} \bigl[ 1 - 2 r_1(\theta) \bigr]^2

6820: + \frac{4 d''_1(\theta)}{N} \bigl[ 1 - 2 r_1(\theta) \bigr] \eta_J}.

6821: \end{multline*}

6822: \end{cor}

6823: This is to be compared with Vapnik's result, as proved in \cite[page 138]{Vapnik}:

6824: \begin{thm}[Vapnik]

6825: \label{thmVapnik}

6826: \mypoint For any i.i.d. probability distribution $\PP$,

6827: with $\PP$ probability at least $1 - \epsilon$, for any $\theta \in \Theta$,

6828: putting

6829: $$

6830: d_V = \log \bigl[ \PP (\F{N}_1) \bigr] + \log(4/\epsilon),

6831: $$

6832: $$

6833: R(\theta) \leq r_1(\theta) + \frac{2 d_V}{N} +

6834: \sqrt{ \frac{4 d_V r_1(\theta)}{N} + \frac{4 d_V^2}{N^2}}.

6835: $$

6836: \end{thm}

6837: Recalling that we can choose $(\eta_j)_{j=1}^2$ such that

6838: $\eta_J = \frac{1}{10N}$ (which is negligeable by all means) and

6839: such that for any $N \leq

6840: 10^9$,

6841: $$

6842: d''_1( \theta) \leq \PP \bigl[ \log ( \F{N}_1 ) \,\lvert\,

6843: (Z_i)_{i=1}^N\bigr]

6844: - \log(\epsilon) + 4.7,

6845: $$

6846: we see that our complexity term is somehow more satisfactory than Vapnik's,

6847: since it is integrated outside the logarithm, with a little larger additional

6848: constant (remember that $\log(4) \simeq 1.4$, which is better than our $4.7$,

6849: which could presumably be improved by working out a better sequence $\eta_j$,

6850: but not down to $\log(4)$). Our variance term is better, since we get

6851: $r_1(1-r_1)$ as we should, instead of only $r_1$.

6852: We also have $\ds \frac{d''_1}{N}$ instead of

6853: $\ds 2 \frac{d_V}{N}$, comming from the fact that we do not use any symmetrization

6854: trick.

6855:

6856: Let us illustrate these bound on a numerical example (corresponding to

6857: a situation where the sample is noisy or the classification model is

6858: weak). Let us assume that $N = 1000$, $

6859: \inf_{\Theta} r_1 = r_1(\w{\theta}) = 0.2$, that we

6860: are performing binary classification with a model with VC dimension

6861: not greater than $h = 10$, and that we work at level of confidence

6862: $\epsilon = 0.01$. Vapnik's theorem provides an upper bound for

6863: $R(\w{\theta})$ not smaller than

6864: $0.610$, whereas Corollary \ref{cor2.3.10} gives

6865: $R(\w{\theta}) \leq 0.461$ (using the bound $d''_1 \leq d'_1 + 3.7$ when $N = 1000$).

6866: Now if we go for Theorem

6867: \ref{thm2.3.9} and do not make a Gaussian approximation,

6868: we get $R(\w{\theta}) \leq 0.453$.  It is interesting to

6869: remark that this bound is achieved for $\lambda = 1195 > N = 1000$.

6870: This explains why the Gaussian approximation in Vapnik's bound

6871: can be improved: for such a large value of $\lambda$, $\lambda r_1(\theta)$

6872: does not behave like a Gaussian random variable.

6873:

6874: Let us remind in conclusion that the best bound is provided by

6875: Theorem \ref{thm2.3.3}, giving $R(\w{\theta}) \leq 0.4211$,

6876: (that is approximately $2/3$

6877: of Vapnik's bound), for optimal values

6878: of $k = 15$, and of $\lambda = 1010$. This bound can be seen to

6879: take advantage of the fact that Bernoulli random variables

6880: are not Gaussian (its Gaussian approximation, Corollary \ref{cor2.3.7},

6881: gives a bound $R(\theta) \simeq 0.4325$, still with an optimal $k = 15$),

6882: and of the fact that the optimal size of

6883: the shadow sample is significantly larger than the size

6884: of the observed sample. Moreover, Theorem \ref{thm2.3.3} does not

6885: assume that the sample is i.i.d., but only that it is

6886: independent, thus generalizing Vapnik's bounds to inhomogeneous

6887: data (this will presumably be the case when data are collected

6888: from different places where the experimental conditions may

6889: not be expected to be the same, although they may reasonably

6890: be assumed to be independent). We would like also to emphasis

6891: that our little numerical example shows that Vapnik's bounds

6892: can be expected to be appropriate when dealing with moderate

6893: sample sizes. More sophisticated bounds obviously have a better

6894: asymptotic behaviour as shown in the first section. Nevertheless

6895: the numerical illustration

6896: of Theorem \ref{thm1.1.17} given on page \pageref{thm1.1.17}

6897: suggests hat

6898: Vapnik's bounds are not doing so bad for small

6899: to medium ratios between the sample size and the dimension of

6900: the classification model (with local bounds, we could only get

6901: down to $0.332$, although using a quite stronger dimension assumption).

6902:

6903: We chose on purpose an example where it is non trivial

6904: to decide whether the chosen classifier does better than the $0.5$

6905: error rate of blind random classification. We think that this

6906: situation of weak learning is of practical interest, since

6907: ``significant'' weak learning rules may afterwards be aggregated

6908: or combined in various ways to achieve better classification rates.

6909:

6910: \section{Support Vector Machines}

6911: \subsection{How to build them}

6912: \subsubsection{The canonical hyperplane}

6913: \label{chapSVM}

6914:

6915: Support

6916: Vector Machines, of widely spread use and renown,

6917: were introduced by V. Vapkik \cite{Vapnik}.

6918: Before introducing them,

6919: we will study as a prerequisite the separation of points by hyperplanes

6920: in a finite dimensional Euclidean space.

6921: Support Vector Machines perform the same kind of linear

6922: separation after

6923: an implicit change of pattern space.

6924: The preceding PAC-Bayesian results provide a

6925: fit framework to analyze their generalization properties.

6926:

6927: We will deal in this section with the classification

6928: of points in $\RR^d$ in two classes.

6929: Let $Z = (x_i, y_i)_{i=1}^N \in \bigl(\RR^d \times \{-1,+1\}

6930: \bigr)^N$ be some set of labelled examples (called

6931: the training set hereafter). Let us split the set of

6932: indices $I = \{1, \dots, N\}$

6933: according to the labels into two subsets

6934: \begin{align*}

6935: I_+ & = \{ i \in I\,: y_i = + 1 \},\\

6936: I_- & = \{ i \in I\,: y_i = - 1 \}.

6937: \end{align*}

6938: Let us then consider the set of admissible separating directions

6939: $$

6940: A_Z = \bigl\{ w \in \RR^d \,: \sup_{b \in \RR} \inf_{i \in I}

6941: ( \langle w, x_i \rangle - b ) y_i \geq 1 \bigr\},

6942: $$

6943: which can also be written as

6944: $$

6945: A_Z = \bigl\{ w \in \RR^d\,:

6946: \max_{i \in I_-} \langle w, x_i

6947: \rangle + 2 \leq \min_{i \in I_+} \langle w, x_i \rangle \bigr\}.

6948: $$

6949: As it is easily seen, the optimal value of $b$ for a fixed value of $w$, in other

6950: words the value of $b$ which maximizes $\inf_{i \in I}

6951: (\langle w, x_i \rangle - b)y_i$, is equal to

6952: $$

6953: b_w = \frac{1}{2} \Bigl[ \max_{i \in I_-} \langle w, x_i \rangle +

6954: \min_{i \in I_+} \langle w, x_i \rangle \Bigr].

6955: $$

6956: \begin{lemma}\mypoint

6957: When $A_Z \neq \varnothing$, $\inf \{ \lVert w \rVert^2 \,: w

6958: \in A_Z \}$ is reached for only one value $w_Z$ of $w$.

6959: \end{lemma}

6960: \begin{proof}

6961: Let $w_0 \in A_Z$. The set $A_Z \cap \{ w \in \RR^d :

6962: \lVert w \rVert \leq \lVert w_0 \rVert \}$ is a compact convex set and $w \mapsto \lVert w \rVert^2$ is strictly

6963: convex and therefore has a unique minimum on this set, which

6964: is also obviously its minimum on $A_Z$.

6965: \end{proof}

6966: \begin{dfn}\mypoint

6967: When $A_Z \neq \varnothing$, the training set $Z$ is said

6968: to be linearly separable. The hyperplane

6969: $$

6970: H = \{ x \in \RR^d \,: \langle w_Z, x \rangle - b_Z = 0 \},

6971: $$

6972: where

6973: \begin{align*}

6974: w_Z & = \arg\min \{ \lVert w \rVert \,: w \in A_Z \},\\

6975: b_Z & = b_{w_Z},

6976: \end{align*}

6977: is called the canonical separating hyperplane of the training set $Z$.

6978: The quantity $\lVert w_Z \rVert^{-1}$ is called the margin of the

6979: canonical hyperplane.

6980: \end{dfn}

6981: Note that as $\min_{i \in I_+} \langle w_Z, x_i \rangle -

6982: \max_{i \in I_-} \langle w_Z, x_i \rangle = 2$, the margin is

6983: also equal to half the distance between the projections

6984: on the direction $w_Z$ of the positive and negative patterns.

6985:

6986: \subsubsection{Computation of the canonical hyperplane}

6987:

6988: Let us consider the convex hulls $X_+$ and $X_-$ of the positive

6989: and negative patterns:

6990: \begin{align*}

6991: \C{X}_+ & = \Bigl\{ \sum_{i \in I_+} \lambda_i x_i\,:\bigl( \lambda_i

6992: \bigr)_{i \in I_+} \in \RR_+^{I_+}, \sum_{i \in I_+} \lambda_i

6993: = 1 \Bigr\},\\

6994: \C{X}_- & = \Bigl\{ \sum_{i \in I_-} \lambda_i x_i\,:\bigl( \lambda_i

6995: \bigr)_{i \in I_-} \in \RR_+^{I_-}, \sum_{i \in I_-} \lambda_i

6996: = 1 \Bigr\}.

6997: \end{align*}

6998: Let us introduce the closed convex set

6999: $$

7000: \C{V} = \C{X}_+ - \C{X}_- = \bigl\{ x_+ - x_-\,: x_+ \in \C{X}_+, x_- \in

7001: \C{X}_- \bigr\}.

7002: $$

7003: As $v \mapsto \lVert v \rVert^{2}$ is strictly convex,

7004: with compact lower level sets, there is a unique

7005: vector $v^*$ such that

7006: $$

7007: \lVert v^* \rVert^2 = \inf_{v \in \C{V}} \bigl\{ \lVert v \rVert^2\,: v \in \C{V} \bigr\}.

7008: $$

7009: \begin{lemma}\mypoint

7010: The set $A_Z$ is non empty (i.e. the training set $Z$

7011: is linearly separable) if and only if $v^* \neq 0$. In this case

7012: $$

7013: w_Z = \frac{2}{\lVert v^* \rVert^{2}} v^*,

7014: $$

7015: and the margin of the canonical hyperplane is equal to $\frac{1}{2}

7016: \lVert v^* \rVert$.

7017: \end{lemma}

7018: \begin{proof}

7019: Let us assume first that $v^* = 0$, or equivalently that

7020: $\C{X}_+ \cap \C{X}_- \neq \varnothing$. As for any vector $w \in \RR^d$,

7021: \begin{align*}

7022: \min_{i \in I_+} \langle w, x_i \rangle & = \min_{x \in \C{X}_+}

7023: \langle w, x \rangle,\\

7024: \max_{i \in I_-} \langle w, x_i \rangle & = \max_{x \in \C{X}_-}

7025: \langle w, x \rangle,

7026: \end{align*}

7027: we see that necessarily $ \min_{i \in I_+}

7028: \langle w, x_i \rangle - \max_{i \in I_-}

7029: \langle w, x_i \rangle \leq 0$, which shows that

7030: $w$ cannot be in $A_Z$ and therefore that $A_Z$

7031: is empty.

7032:

7033: Let us assume now that $v^* \neq 0$, or equivalently that

7034: $\C{X}_+ \cap \C{X}_- = \varnothing$. Let us put

7035: $w^* = \frac{2}{\lVert v^* \rVert^2} v^*$.

7036: Let us remark first that

7037: \begin{align*}

7038: \min_{i \in I_+} \langle w^*, x_i \rangle -

7039: \max_{i \in I_-} \langle w^*, x_i \rangle & =

7040: \inf_{x \in \C{X}_+} \langle w^*, x \rangle -

7041: \sup_{x \in \C{X}_-} \langle w^*, x \rangle

7042: \\ & = \inf_{x_+ \in \C{X}_+, x_- \in \C{X}_-}

7043: \langle w^*, x_+ - x_- \rangle \\ & =

7044: \frac{2}{\lVert v^* \rVert^2}

7045: \inf_{v \in \C{V}} \langle v^*, v \rangle.

7046: \end{align*}

7047: Let us now prove that $\inf_{v \in \C{V}}

7048: \langle v^*, v \rangle = \lVert v^* \rVert^2$.

7049: Some arbitrary $v \in \C{V}$ being fixed,

7050: consider the function $$\beta \mapsto \lVert

7051: \beta v + (1 - \beta) v^* \rVert^2 : [0,1]

7052: \rightarrow \RR.$$ By definition of $v^*$,

7053: it reaches its minimum value for $\beta = 0$,

7054: and therefore has a non negative derivative at

7055: this point. Computing this derivative, we find

7056: that $\langle v - v^*, v^* \rangle \geq 0$,

7057: as claimed. We have proved that

7058: $$

7059: \min_{i \in I_+} \langle w^*, x_i \rangle

7060: - \max_{i \in I_-} \langle w^*, x_i \rangle

7061: = 2,

7062: $$

7063: and therefore that $w^* \in A_Z$. On the other hand,

7064: any $w \in A_Z$ is such that

7065: $$

7066: 2 \leq \min_{i \in I_+} \langle w, x_i \rangle

7067: - \max_{i \in I_-} \langle w, x_i \rangle

7068: = \inf_{v \in \C{V}} \langle w, v \rangle \leq \lVert w \rVert

7069: \inf_{v \in \C{V}} \lVert v \rVert = \lVert w \rVert

7070: \,\lVert v^* \rVert.

7071: $$

7072: This proves that $\lVert w^* \rVert = \inf \bigl\{ \lVert w \rVert\,:

7073: w \in A_Z \bigr\}$, and therefore that $w^* = w_Z$ as claimed.

7074: \end{proof}

7075: One way to compute $w_Z$ would be therefore to compute $v^*$ by minimizing

7076: $$

7077: \bigl\{ \lVert \sum_{i \in I} \lambda_i y_i x_i \rVert^2\,:

7078: (\lambda_i)_{i \in I} \in \RR_+^I, \sum_{i \in I} \lambda_i = 2,

7079: \sum_{i \in I} y_i \lambda_i = 0 \bigr\}.

7080: $$

7081: Although this is a tractable quadratic programming problem, a

7082: direct computation of $w_Z$ through the following proposition

7083: is usually prefered.

7084: \begin{prop}\mypoint

7085: \label{wComp}

7086: The canonical direction $w_Z$ can be expressed as

7087: $$

7088: w_Z = \sum_{i=1}^N \alpha_i^* y_i x_i,

7089: $$

7090: where $(\alpha_i^*)_{i=1}^N$ is obtained by minimizing

7091: $$

7092: \inf \bigl\{ F(\alpha)\,: \alpha \in \C{A} \bigr\},

7093: $$

7094: where

7095: $$

7096: \C{A} = \Bigl\{ (\alpha_i)_{i \in I}

7097: \in \RR_+^{I}, \sum_{i \in I} \alpha_i y_i = 0 \Bigr\},

7098: $$

7099: and

7100: $$

7101: F(\alpha) = \Bigl\lVert \sum_{i \in I} \alpha_i y_i x_i \Bigr\rVert^2

7102: - 2 \sum_{i \in I} \alpha_i.

7103: $$

7104: \end{prop}

7105: \begin{proof}

7106: Let $w(\alpha) = \sum_{i \in I} \alpha_i y_i x_i$ and

7107: let $S(\alpha) = \frac{1}{2} \sum_{i\in I}\alpha_i$.

7108: We can express the function $F(\alpha)$ as

7109: $F(\alpha) = \lVert w(\alpha) \rVert^2 - 4 S(\alpha)$.

7110: Moreover it is important to notice that for any $s \in \RR_+$

7111: $\{ w(\alpha)\,: \alpha \in \C{A}, S(\alpha) = s\} = s \C{V}$.

7112: This shows that for any $s \in \RR_+$, $\inf \{ F(\alpha)

7113: : \alpha \in \C{A}, S(\alpha) = s \}$ is reached and that for any

7114: \linebreak $\alpha_s \in \{ \alpha \in \C{A}\,: S(\alpha)  = s \}$ reaching this infimum,

7115: $w(\alpha_s) = s v^*$. As \linebreak $s \mapsto s^2 \lVert v^* \rVert^2 - 4 s :

7116: \RR_+ \rightarrow \RR$ reaches its infimum for only one value

7117: $s^*$ of $s$, namely at $s^* = \frac{2}{\lVert v^* \rVert^2}$,

7118: this shows that $F(\alpha)$ reaches its infimum on $\C{A}$,

7119: and that for any $\alpha^* \in \C{A}$ such that $F(\alpha^*) =

7120: \inf \{ F(\alpha)\,: \alpha \in \C{A} \}$, $w(\alpha^*)

7121: = \frac{2}{\lVert v^* \rVert^2} v^* = w_Z$.

7122: \end{proof}

7123:

7124: \subsubsection{Support vectors}

7125: \begin{dfn}\mypoint

7126: The set of support vectors $\C{S}$ is defined by

7127: $$

7128: \C{S} = \{ x_i \,: \langle w_Z , x_i \rangle - b_Z = y_i \}.

7129: $$

7130: \end{dfn}

7131:

7132: \begin{prop}\mypoint

7133: \label{chap4Prop3.1}

7134: Any $\alpha^*$ minimizing $F(\alpha)$ on $\C{A}$

7135: is such that

7136: $$

7137: \{ x_i\,: \alpha_i^* > 0 \} \subset \C{S}.

7138: $$

7139: This implies that the representation $w_Z = w(\alpha^*)$

7140: involves in general only a limited number of non zero

7141: coefficients and that $w_Z = w_{Z'}$, where $Z' =

7142: \{ (x_i,y_i)\,: x_i \in \C{S} \}$.

7143: \end{prop}

7144: \begin{proof}

7145: Let us consider any given $i \in I_+$ and $j \in I_-$, such that

7146: $\alpha_i^* > 0$ and $\alpha_j^* > 0$ (there exists at least

7147: one such index in each set $I_-$ and $I_+$, since the sum of the

7148: components of $\alpha^*$ on each of these sets are equal and

7149: since $\sum_{k \in I} \alpha^*_k > 0$).

7150: For any $t \in \RR$, consider

7151: $$

7152: \alpha_k(t) = \alpha_k^* + t \B{1}(k \in \{i,j\}), \quad k \in I.

7153: $$

7154: The vector $\alpha(t)$ is in $\C{A}$

7155: for any value of $t$ in some neighborhood of $0$,

7156: therefore $\frac{\partial}{\partial t}_{|t = 0} F\bigl[\alpha(t) \bigr] = 0$.

7157: Computing this derivative, we find that

7158: $$

7159: y_i \langle w(\alpha^*), x_i \rangle +

7160: y_j \langle w(\alpha^*) , x_j \rangle = 2.

7161: $$

7162: As $y_i = - y_j$, this can also be written as

7163: $$

7164: y_i \bigl[ \langle w(\alpha^*), x_i \rangle - b_Z \bigr] +

7165: y_j \bigl[ \langle w(\alpha^*) , x_j \rangle -b_Z \bigr] = 2.

7166: $$

7167: As $w(\alpha^*)\in A_Z$,

7168: $$

7169: y_k \bigl[ \langle w(\alpha^*), x_k \rangle - b_Z \bigr] \geq 1,

7170: \qquad k \in I,

7171: $$

7172: which implies necessarily as claimed that

7173: $$

7174: y_i \bigl[ \langle w(\alpha^*), x_i \rangle - b_Z \bigr]

7175: = y_j \bigl[ \langle w(\alpha^*) , x_j \rangle -b_Z \bigr] = 1.

7176: $$

7177: \end{proof}

7178: \subsubsection{The non separable case}

7179: In the case when the training set $Z = (x_i, y_i)_{i=1}^N$

7180: is not linearly separable, we can define a noisy canonical

7181: hyperplane as follows. We can choose $w \in \RR^d$ and

7182: $b \in \RR$ to minimize

7183: \begin{equation}

7184: C(w,b) =

7185: \sum_{i=1}^N \bigl[ 1 - \bigl( \langle w, x_i \rangle - b \bigr)

7186: y_i \bigr]_+ + \tfrac{1}{2} \lVert w \rVert^2,

7187: \end{equation}

7188: where for any real number $r$, $r_+ = \max \{r, 0\}$ is

7189: the positive part of $r$.

7190: \newcommand{\Bw}{\overline{w}}

7191: \begin{thm}\mypoint

7192: Let us introduce the dual criterion

7193: $$

7194: F(\alpha) = \sum_{i=1}^N \alpha_i - \frac{1}{2}

7195: \biggl\lVert \sum_{i=1}^N y_i \alpha_i x_i \biggr\rVert^2

7196: $$

7197: and the domain

7198: $\ds

7199: \C{A}' = \biggl\{ \alpha \in \RR_+^N : \alpha_i \leq 1, i = 1, \dots, N,

7200: \sum_{i=1}^N y_i \alpha_i = 0 \biggr\}.

7201: $

7202: Let $\alpha^* \in \C{A}'$ be such that $ F(\alpha^*) = \sup_{\alpha \in

7203: \C{A}'} F(\alpha)$.

7204: Let $w^* = \sum_{i=1}^N y_i \alpha^*_i x_i$. There is

7205: a threshold $b^*$ (whose construction will be detailed

7206: in the proof), such that

7207: $$

7208: C(w^*, b^*) = \inf_{w \in \RR^d, b \in \RR}

7209: C(w, b).

7210: $$

7211: \end{thm}

7212: \begin{cor}\mypoint \!\!{\sc(scaled criterion)}

7213: For any positive real parameter $\lambda$

7214: let us consider the criterion

7215: $$

7216: C_{\lambda}(w,b) = \lambda^2

7217: \sum_{i=1}^N \bigl[ 1 - (\langle w, x_i \rangle - b ) y_i

7218: \bigr]_+ + \tfrac{1}{2} \lVert w \rVert^2

7219: $$

7220: and the domain

7221: $\ds

7222: \C{A}'_{\lambda} = \biggl\{

7223: \alpha \in \RR_+^N : \alpha_i \leq \lambda^2, i = 1, \dots, N,

7224: \sum_{i=1}^N y_i \alpha_i = 0 \biggr\}.

7225: $

7226: For any solution $\alpha^*$ of the minimization problem

7227: $ F(\alpha^*) = \sup_{\alpha \in \C{A}'_{\lambda}} F(\alpha)$,

7228: the vector $w^* = \sum_{i=1}^N y_i \alpha^*_i x_i$

7229: is such that

7230: $$

7231: \inf_{b \in \RR} C_{\lambda}(w^*, b)

7232: = \inf_{w \in \RR^d, b \in \RR} C_{\lambda}(w, b).

7233: $$

7234: \end{cor}

7235: Let us remark that in the separable case, the scaled criterion is

7236: minimized by the canonical hyperplane for $\lambda$ large enough.

7237: This extension of the canonical hyperplane computation

7238: in dual space is often called {\em the box constraint},

7239: for obvious reasons.

7240:

7241: \noindent{\sc Proof.}

7242: The corollary is a straightforward consequence of

7243: the scale property $C_{\lambda}(w, b, x) = \lambda^2 C(\lambda^{-1}

7244: w, b, \lambda x)$, where we have made the dependence

7245: of the criterion in $x \in \RR^{d N}$ explicit.

7246: Let us come now to the proof of the theorem.

7247:

7248: The minimization of $C(w, b)$ can be performed in dual

7249: space extending the couple of parameters $(w, b)$

7250: to $\Bw = (w, b, \gamma) \in \RR^d \times \RR \times \RR_+^N$

7251: and introducing the dual multipliers $\alpha \in \RR_+^N$

7252: and the criterion

7253: $$

7254: G( \alpha, \Bw ) =

7255: \sum_{i = 1}^N \gamma_i + \sum_{i=1}^N \alpha_i

7256: \bigl\{ \bigl[ 1 - (\langle w, x_i \rangle - b ) y_i \bigr] - \gamma_i

7257: \bigr\} + \tfrac{1}{2} \lVert w \rVert^2.

7258: $$

7259: We see that

7260: $$

7261: C(w, b) = \inf_{\gamma \in \RR_+^N} \sup_{\alpha \in \RR_+^N}

7262: G\bigl[ \alpha, (w, b, \gamma) \bigr],

7263: $$

7264: and therefore, putting $\ov{\C{W}} = \{ (w, b, \gamma) :

7265: w \in \RR^d, b \in \RR, \gamma \in \RR_+^N \bigr \}$,

7266: we are led to solve the minimization problem

7267: $$

7268: G(\alpha_*, \Bw_*) = \inf_{\Bw \in \ov{\C{W}}} \sup_{\alpha \in \RR_+^N}

7269: G(\alpha, \Bw),

7270: $$

7271: whose solution $\Bw_* = (w_*, b_*, \gamma_*)$ is such that

7272: $C(\Bw_*, b_*) = \inf_{(w, b) \in \RR^{d+1}} C(w, b)$,

7273: according to the preceding identity.

7274: As for any value of $\alpha' \in \RR_+^N$,

7275: $$

7276: \inf_{\Bw \in \ov{\C{W}}} \sup_{\alpha \in \RR_+^N}

7277: G(\alpha, \ov{w}) \geq

7278: \inf_{\Bw \in \ov{\C{W}}} G(\alpha', \ov{w}),

7279: $$

7280: it is immediately seen that

7281: $$

7282: \inf_{\Bw \in \ov{\C{W}}} \sup_{\alpha \in \RR_+^N}

7283: G(\alpha, \ov{w}) \geq

7284: \sup_{\alpha \in \RR_+^N} \inf_{\Bw \in \ov{\C{W}}}

7285: G(\alpha, \ov{w}).

7286: $$

7287: We are going to show that there is no duality gap,

7288: meaning that this inequality is indeed an equality.

7289: More importantly, we will do so by exhibiting

7290: a saddle point, which, solving the dual minimization

7291: problem will also solve the original one.

7292:

7293: Let us first make explicit the solution of the

7294: dual problem (the interest of this dual problem

7295: precisely lies in the fact that it can more easily

7296: be solved explicitly).

7297: Introducing the admissible set of values

7298: of $\alpha$,

7299: $$

7300: \C{A}' =  \bigl\{ \alpha \in \RR^N : 0 \leq \alpha_i \leq

7301: 1, i = 1, \dots, N, \sum_{i=1}^N y_i \alpha_i = 0 \bigr\},

7302: $$

7303: it is elementary to check that

7304: $$

7305: \inf_{\Bw \in \ov{\C{W}}} G(\alpha, \Bw) =

7306: \begin{cases}\ds

7307: \inf_{w \in \RR^d} G \bigl[ \alpha, (w,0,0) \bigr],

7308: & \alpha \in \C{A}',\\

7309: - \infty, & \text{otherwise}.

7310: \end{cases}

7311: $$

7312: As

7313: $$

7314: G \bigl[ \alpha, (w, 0, 0) \bigr]

7315: = \tfrac{1}{2} \lVert w \rVert^2 + \sum_{i=1}^N \alpha_i \bigl(

7316: 1 -  \langle w, x_i \rangle y_i \bigr),

7317: $$

7318: we see that $\inf_{w \in \RR^d} G\bigl[ \alpha, (w,0,0) \bigr]$

7319: is reached at

7320: $$

7321: w_{\alpha} = \sum_{i=1}^N y_i \alpha_i x_i.

7322: $$

7323: This proves that

7324: \newcommand{\BW}{\ov{\C{W}}}

7325: $$

7326: \inf_{\Bw \in \BW} G(\alpha, \Bw) = F(\alpha).

7327: $$

7328: The continuous map $\alpha \mapsto \inf_{\Bw \in \ov{\C{W}}}

7329: G(\alpha, \Bw)$ reaches a (non necessarily unique) maximum

7330: $\alpha^*$

7331: on the compact convex set $\C{A}'$.

7332: We are now going to exhibit a choice of $\Bw^* \in \BW$

7333: such that $(\alpha^*, \Bw^*)$ is a {\em saddle point}.

7334: This means that we are going to show that

7335: $$

7336: G(\alpha^*, \Bw^*) =

7337: \inf_{\Bw \in \BW} G(\alpha^*, \Bw) =

7338: \sup_{\alpha \in \RR_+^N} G(\alpha, \Bw^*).

7339: $$

7340: It will imply that

7341: $$

7342: \inf_{\Bw \in \BW} \sup_{\alpha \in \RR_+^d} G(\alpha, \Bw)

7343: \leq \sup_{\alpha \in \RR_+^N} G(\alpha, \Bw^*) = G(\alpha^*, \Bw^*)

7344: $$

7345: on the one hand and that

7346: $$

7347: \inf_{\Bw \in \BW} \sup_{\alpha \in \RR_+^d} G(\alpha, \Bw)

7348: \geq \inf_{\Bw \in \BW} G(\alpha^*, \Bw) = G(\alpha^*, \Bw^*)

7349: $$

7350: on the other hand, proving that

7351: $$

7352: G(\alpha^*, \Bw^*) = \inf_{\Bw \in \BW} \sup_{\alpha \in \RR_+^N}

7353: G(\alpha, \Bw)

7354: $$

7355: as required.

7356:

7357: \noindent{\sc Construction of $\Bw^*$.}

7358: \begin{itemize}

7359: \item Let us put $w^* = w_{\alpha^*}$.

7360: \item If there is $j \in \{1, \dots, N \}$

7361: such that $0 < \alpha^*_j < 1$,

7362: let us put

7363: $$

7364: b^* = \langle x_j , w^* \rangle - y_j.

7365: $$

7366: Otherwise, let us put

7367: $$

7368: b^* = \sup \{ \langle x_i , w^* \rangle - 1 : \alpha^*_i > 0 , y_i = + 1,

7369: i = 1, \dots, N\}.

7370: $$

7371: \item Let us then put

7372: $$

7373: \gamma^*_i =

7374: \begin{cases}

7375: 0, & \alpha^*_i < 1,\\

7376: 1 - (\langle w^*, x_i \rangle - b^*)y_i, & \alpha^*_i = 1.

7377: \end{cases}

7378: $$

7379: \end{itemize}

7380: If we can prove that

7381: \begin{equation}

7382: \label{eq3.2}

7383: 1 - (\langle w^*, x_i \rangle - b^*)y_i

7384: \begin{cases}

7385: \leq 0, & \alpha^*_i = 0,\\

7386: = 0, & 0 < \alpha^*_i < 1,\\

7387: \geq 0, & \alpha^*_i = 1,

7388: \end{cases}

7389: \end{equation}

7390: it will show that $\gamma^* \in \RR_+^N$

7391: and therefore that $\Bw^* = (w^*, b^*, \gamma^*) \in \BW$.

7392: It will also show that

7393: $$

7394: G(\alpha, \Bw^*) = \sum_{i=1}^N \gamma^*_i

7395: + \sum_{i, \alpha^*_i = 0}  \alpha_i \bigl[ 1 -

7396: (\langle \Bw^*, x_i \rangle - b^*) y_i \bigr]

7397: + \tfrac{1}{2} \lVert \Bw^* \rVert^2,

7398: $$

7399: proving that

7400: $G(\alpha^*, \Bw^*) = \sup_{\alpha \in \RR_+^N} G(\alpha,

7401: \Bw^*)$. As obviously $G (\alpha^*, \Bw^*) = G \bigl[ \alpha^*,

7402: (w^*, 0 , 0) \bigr]$, we already know that

7403: $G(\alpha^*, \Bw^*) = \inf_{\Bw \in \BW} G(\alpha^*, \Bw)$.

7404: This will show that $(\alpha^*, \Bw^*)$ is the saddle

7405: point we were looking for, thus ending the proof of the

7406: theorem.

7407:

7408: \noindent{\sc Proof of equation \eqref{eq3.2}:} Let us deal first with the case when there is $j \in \{1, \dots, N\}$

7409: such that $0 < \alpha_j^* < 1$.

7410:

7411: For any $i \in \{1, \dots, N\}$

7412: such that $0< \alpha^*_i < 1$, there is $\epsilon > 0$ such

7413: that for any $t \in (-\epsilon, \epsilon)$, $\alpha^* + t y_i e_i - t y_j e_j

7414: \in \C{A}'$, where $(e_k)_{k=1}^N$ is the canonical base of $\RR^N$.

7415: Thus $\frac{\partial}{\partial t}_{|t=0} F(\alpha^* + t y_i e_i -

7416: t y_j e_j ) = 0$. Computing this derivative,

7417: we obtain

7418: \begin{align*}

7419: \frac{\partial}{\partial t}_{|t=0}

7420: F(\alpha^* + t y_i e_i - t y_j e_j)

7421: & = y_i  - \langle w^*, x_i \rangle + \langle w^*, x_j \rangle - y_j \\

7422: & = y_i \bigl[ 1 - \bigl(\langle w, x_i \rangle - b^* \bigr) y_i \bigr].

7423: \end{align*}

7424: Thus $1 - \bigl(\langle w, x_i \rangle - b^* \bigr) y_i = 0$,

7425: as required. This shows also that the definition of $b^*$ does not

7426: depend on the choice of $j$ such that $0 < \alpha^*_j < 1$.

7427:

7428: For any $i \in \{1, \dots, N\}$ such that $\alpha^*_i = 0$,

7429: there is $\epsilon > 0$ such that for any $t \in (0, \epsilon)$,

7430: $\alpha^* + t e_i - t y_i y_j e_j \in \C{A}'$.

7431: Thus $\frac{\partial}{\partial t}_{|t=0} F(\alpha^* + t e_i

7432: - t y_i y_j e_j) \leq 0$, showing that

7433: $1 - \bigl( \langle w^*, x_i \rangle - b^* \bigr) y_i \leq 0$ as

7434: required.

7435:

7436: For any $i \in \{1, \dots, N\}$ such that $\alpha^*_i

7437: = 1$, there is $\epsilon > 0$ such that $

7438: \alpha^* - t e_i + t y_i y_j e_j \in \C{A}'$.

7439: Thus $\frac{\partial}{\partial t}_{| t = 0} F(

7440: \alpha^* - t e_i + t y_i y_j e_j) \leq 0$, showing

7441: that  $1 - \bigl( \langle w^*, x_i \rangle - b^* \bigr) y_i \geq 0$

7442: as required. This ends to prove that $(\alpha^*, \Bw^*)$

7443: is a saddle point in this case.

7444:

7445: Let us deal now with the case where $\alpha^* \in \{0, 1\}^N$.

7446: If we are not in the trivial case where the vector $(y_i)_{i=1}^N$

7447: is constant, the case $\alpha^* = 0$ is ruled out. Indeed,

7448: in this case, considering $\alpha^* + t e_i + t e_j$, where

7449: $y_i y_j = -1$, we would get the contradiction

7450: $2 = \frac{\partial}{\partial t}_{|t=0} F(\alpha^*+te_i+te_j)

7451: \leq 0$.

7452:

7453: Thus there are values of $j$ such that $\alpha^*_j = 1$,

7454: and since $\sum_{i=1}^N \alpha_i y_i = 0$, both classes are

7455: present in the set $\{ j : \alpha^*_j = 1 \}$.

7456:

7457: Now for any $i, j \in \{1, \dots, N\}$ such that

7458: $\alpha^*_i = \alpha^*_j = 1$ and such that $y_i = +1$ and $y_j = -1$,

7459: $ \frac{\partial}{\partial t}_{|t=0} F( \alpha^* - t e_i

7460: - t e_j) = - 2 + \langle w^* , x_i \rangle - \langle

7461: w^*, x_j \rangle \leq 0$.

7462: Thus

7463: $$

7464: \sup \{ \langle w^*, x_i \rangle - 1 : \alpha^*_i = 1, y_i = +1 \}

7465: \leq \inf \{ \langle w^*, x_j \rangle + 1 : \alpha^*_j = 1, y_j = -1 \},

7466: $$

7467: showing that

7468: $$

7469: 1 - \bigl( \langle w^*, x_k \rangle - b^* \bigr) y_k \geq 0, \alpha^*_k = 1.

7470: $$

7471: Eventually, for any $i$ such that $\alpha^*_i = 0$,

7472: for any $j$ such that $\alpha^*_j = 1$ and

7473: $y_j = y_i$

7474: $$

7475: \frac{\partial}{\partial t}_{|t=0}F(\alpha^*

7476: + t e_i - t e_j) =  y_i \langle w^*, x_i - x_j \rangle  \leq 0,

7477: $$

7478: showing that $1 - \bigl( \langle w^*, x_i \rangle - b^* \bigr) y_i

7479: \leq 0$. This ends to prove that $(\alpha^*, \Bw^*)$ is in all

7480: circumstances a saddle point.

7481:

7482: \subsubsection{Support Vector Machines}

7483: \begin{dfn}\mypoint

7484: The symmetric measurable kernel $K : \C{X} \times \C{X}

7485: \rightarrow \RR$ is said to

7486: be positive (or more precisely positive semi-definite) if

7487: for any $n \in \NN$, any $(x_i)_{i=1}^n \in \C{X}^n$,

7488: $$

7489: \inf_{\alpha \in \RR^n} \sum_{i=1}^n \sum_{j=1}^n \alpha_i K(x_i, x_j)

7490: \alpha_j \geq 0.

7491: $$

7492: \end{dfn}

7493: Let $Z = (x_i,y_i)_{i=1}^N$ be some training set. Let us consider

7494: as previously

7495: $$

7496: \C{A} = \bigl\{ \alpha \in \RR_+^N \,: \sum_{i=1}^N \alpha_i y_i = 0 \bigr\}.

7497: $$

7498: Let

7499: $$

7500: F(\alpha) = \sum_{i=1}^N \sum_{j=1}^N \alpha_i y_iK(x_i,x_j)y_j \alpha_j

7501: - 2 \sum_{i=1}^N \alpha_i.

7502: $$

7503: \begin{dfn}\mypoint

7504: Let $K$ be a positive symmmetric kernel.

7505: The training set $Z$ is said to be $K$-separable

7506: if

7507: $$

7508: \inf \bigl\{ F(\alpha)\,: \alpha \in \C{A} \bigr\} > - \infty.

7509: $$

7510: \end{dfn}

7511: \begin{lemma}\mypoint

7512: When $Z$ is $K$-separable, $\inf\{ F(\alpha)\,: \alpha \in \C{A} \}$ is

7513: reached.

7514: \end{lemma}

7515: \begin{proof}

7516: Consider the training set $Z' = (x_i',y_i)_{i=1}^N$, where

7517: $$

7518: x_i' = \biggl\{ \biggl[ \Bigl\{ K(x_k,x_{\ell})\Bigr\}_{k=1, \ell=1}^{N

7519: \quad N} \biggr]^{1/2}(i,j) \biggr\}_{j=1}^N \in \RR^N.

7520: $$

7521: We see that $F(\alpha) = \left\lVert \sum_{i=1}^N \alpha_i y_i x_i'

7522: \right\rVert^2 - 2 \sum_{i=1}^N \alpha_i$.

7523: We have proved in the previous section that $Z'$ is linearly separable

7524: if and only if $\inf \{ F(\alpha)\,: \alpha \in \C{A} \} > - \infty$,

7525: and that the infimum is reached in this case.

7526: \end{proof}

7527:

7528: \begin{proposition}\mypoint

7529: \label{chap4Prop4.1} Let $K$ be a symmetric positive kernel and let

7530: $Z = (x_i, y_i)_{i=1}^N$ be some $K$-separable training set. Let

7531: $\alpha^* \in \C{A}$ be such that $F(\alpha^*)

7532: = \inf \{ F(\alpha) \,: \alpha \in \C{A} \}$.

7533: Let

7534: \begin{align*}

7535: I_-^* & = \{ i \in \NN\,:1 \leq i \leq N, y_i = -1, \alpha_i^* > 0 \}\\

7536: I_+^* & = \{ i \in \NN\,:1 \leq i \leq N, y_i = +1, \alpha_i^* > 0 \}\\

7537: b^* & = \frac{1}{2} \Bigl\{

7538: \sum_{j=1}^N \alpha_j^* y_j K(x_j,x_{i_-})

7539: + \sum_{j=1}^N \alpha_j^* y_j K(x_j,x_{i_+}) \Bigr\}, \qquad i_- \in

7540: I_-^*, i_+ \in I_+^*,

7541: \end{align*}

7542: where the value of $b^*$ does not depend on the choice of $i_-$ and

7543: $i_+$.

7544: The classification rule $f : \C{X} \rightarrow \C{Y}$

7545: defined by the formula

7546: $$

7547: f(x) = \sign \left( \sum_{i=1}^N \alpha_i^* y_i K(x_i,x) -

7548: b^* \right)

7549: $$

7550: is independent of the choice of $\alpha^*$ and is called

7551: the support vector machine defined by $K$ and $Z$.

7552: The set

7553: $\C{S} = \{ x_j\,: \sum_{i=1}^N \alpha_i^* y_i K(x_i,x_j) - b^* = y_j \}$

7554: is called the set of support vectors. For any choice of $\alpha^*$,

7555: $\{ x_i\,: \alpha_i^* > 0 \} \subset \C{S}$.

7556: \end{proposition}

7557: An important consequence of this proposition is that the support

7558: vector machine defined by $K$ and $Z$ is also the support vector

7559: machine defined by $K$ and $Z' = \{ (x_i, y_i) : \alpha^*_i > 0,

7560: 1 \leq i \leq N \}$, since this restriction of the index set

7561: contains the value $\alpha^*$ where the minimum of $F$ is reached.

7562:

7563: \begin{proof}

7564: The independence from the choice of $\alpha^*$, which is not

7565: necessarily unique, is seen as follows.

7566: Let $(x_i)_{i=1}^N$ and $x \in \C{X}$ be fixed.

7567: Let us put for ease of notations $x_{N+1} = x$.

7568: Let $M$ be the $(N+1) \times (N+1)$ symmetric

7569: semi-definite matrix defined by $M(i,j) = K(x_i,x_j)$,

7570: $i=1,\dots, N+1$, $j=1, \dots, N+1$.

7571: Let us consider the mapping

7572: $\Psi : \{ x_i\,:i=1, \dots, N+1 \} \rightarrow \RR^{N+1}$

7573: defined by

7574: \begin{equation}

7575: \label{PsiDef}

7576: \Psi(x_i) = \bigl[M^{1/2}(i,j)\bigr]_{j=1}^{N+1} \in \RR^{N+1}.

7577: \end{equation}

7578: Let us consider the training set $Z' = \bigl[ \Psi(x_i),y_i \bigr]_{i=1}^N$.

7579: Then $Z'$ is linearly separable,

7580: $$F(\alpha) =

7581: \Bigl\lVert \sum_{i=1}^N \alpha_i y_i \Psi(x_i) \Bigr\rVert^2

7582: - 2 \sum_{i=1}^N \alpha_i,$$

7583: and we have proved that

7584: for any choice of $\alpha^* \in \C{A}$ minimizing $F(\alpha)$,

7585: \linebreak $w_{Z'} = \sum_{i=1}^N \alpha_i^* y_i \Psi(x_i)$.

7586: Thus the support vector machine defined by $K$ and $Z$ can also be expressed by the formula

7587: $$

7588: f(x) = \sign \Bigl[ \langle w_{Z'}, \Psi(x) \rangle - b_{Z'} \bigr]

7589: $$

7590: which does not depend on $\alpha^*$. The definition of $\C{S}$

7591: is such that $\Psi(\C{S})$ is the set of support vectors

7592: defined in the linear case, where its stated property has already been

7593: prooved.

7594: \end{proof}

7595:

7596: We can in the same way use the box constraint and show

7597: that any solution $\alpha^* \in \arg \min

7598: \{ F(\alpha) : \alpha \in \C{A}, \alpha_i \leq \lambda^2,

7599: i = 1, \dots, N \}$ minimizes

7600: \begin{multline}

7601: \label{eq3.4}

7602: \inf_{b \in \RR} \lambda^2 \sum_{i=1}^N \biggl[ 1 -

7603: \biggl( \sum_{j=1}^N y_j \alpha_j K(x_j, x_i) - b

7604: \biggr) y_i \biggr]_+ \\ + \frac{1}{2}

7605: \sum_{i=1}^N \sum_{j=1}^N \alpha_i \alpha_j y_i y_j K(x_i, x_j).

7606: \end{multline}

7607:

7608: \subsubsection{Building kernels}

7609:

7610: The results of this section (except the last one) are drawned from

7611: \cite{Cristianini}. We have no reference for the last

7612: proposition of this section, although we believe it is well known.

7613: We include them for the convenience of the reader.

7614:

7615: \begin{prop}\mypoint

7616: Let $K_1$ and $K_2$ be positive symmetric kernels on $\C{X}$.

7617: Then for any $a \in \RR_+$

7618: \begin{align*}

7619: (a K_1 + K_2)(x,x') & \overset{\text{\rm def}}{=} a K_1(x,x')

7620: + K_2(x,x')\\

7621: \text{ and }(K_1 \cdot K_2)(x,x') &\overset{\text{\rm def}}{=}

7622: K_1(x,x') K_2(x,x')

7623: \end{align*}

7624: are also positive symmetric kernels.

7625: Moreover, for any measurable function \linebreak $g : \C{X} \rightarrow \RR$,

7626: $K_g(x,x') \overset{\text{\rm def}}{=} g(x)g(x')$ is also a positive symmetric kernel.

7627: \end{prop}

7628: \begin{proof}

7629: It is enough to prove the proposition in the case when $\C{X}$ is

7630: finite and kernels are just ordinary symmetric matrices.

7631: Thus we can assume without loss of generality that

7632: $\C{X} = \{ 1, \dots, n\}$. Then for any $\alpha \in \RR^N$,

7633: using usual matrix notations,

7634: \begin{align*}

7635: \langle \alpha , (a K_1 + K_2) \alpha \rangle & =

7636: a \langle \alpha, K_1 \alpha \rangle + \langle \alpha , K_2 \alpha \rangle

7637: \geq 0,\\

7638: \langle \alpha, (K_1 \cdot K_2) \alpha \rangle & =

7639: \sum_{i,j} \alpha_i K_1(i,j) K_2(i,j) \alpha_j\\

7640: & = \sum_{i,j,k} \alpha_i K_1^{1/2}(i,k) K_1^{1/2}(k,j)K_2(i,j) \alpha_j

7641: \\ & = \sum_{k} \underbrace{\sum_{i,j} \bigl[K_1^{1/2}(k,i) \alpha_i \bigr] K_2(i,j)

7642: \bigl[K_1^{1/2}(k,j) \alpha_j \bigr]}_{

7643: \geq 0} \geq 0,\\

7644: \langle \alpha, K_g \alpha \rangle & = \sum_{i,j} \alpha_i g(i) g(j) \alpha_j

7645: = \left( \sum_i \alpha_i g(i) \right)^2 \geq 0.

7646: \end{align*}

7647: \end{proof}

7648:

7649: \begin{prop}\mypoint

7650: Let $K$ be some positive symmetric kernel on $\C{X}$. Let $p : \RR \rightarrow

7651: \RR$ be a polynomial with positive coefficients.

7652: Let $g : \C{X} \rightarrow \RR^d$ be a measurable function.

7653: Then

7654: \begin{align*}

7655: p(K)(x,x') & \overset{\text{def}}{=}

7656: p\bigl[ K(x,x')\bigr], \\

7657: \exp(K)(x,x') & \overset{\text{def}}{=}

7658: \exp \bigl[ K(x,x') \bigr]\\

7659: \text{ and } G_{g}(x,x') & \overset{\text{def}}{=}

7660: \exp \bigl( - \lVert g(x) - g(x') \rVert^2 \bigr)

7661: \end{align*}are all

7662: positive symmetric kernels.

7663: \end{prop}

7664: \begin{proof}

7665: The first assertion is a direct consequence of the previous proposition.

7666: The second one comes from the fact that the exponential function is

7667: the pointwise limit of a sequence of polynomial functions

7668: with positive coefficients.

7669: The third one is seen from the second one and the decomposition

7670: $$

7671: G_{g}(x,x') = \Bigl[ \exp\bigl( - \lVert g(x) \rVert^2 \bigr)

7672: \exp \bigl( - \lVert g(x') \rVert^2 \bigr) \Bigr]

7673: \exp \bigl[ 2 \langle g(x), g(x') \rangle \bigr]

7674: $$

7675: \end{proof}

7676: \begin{prop}\mypoint

7677: With the notations of the previous proposition,

7678: {\em any} training set $Z = (x_i,y_i)_{i=1}^N \in \bigl( \C{X}\times \{-1,+1\}

7679: \bigr)^N$ is $G_g$-separable as soon as $g(x_i)$, $i = 1, \dots, N$ are

7680: distinct points of $\RR^d$.

7681: \end{prop}

7682: \begin{proof}

7683: It is clearly enough to prove the case when $\C{X} = \RR^d$ and

7684: $g$ is the identity.

7685: Let us consider some other generic point $x_{N+1} \in \RR^d$

7686: and define $\Psi$ as in \eqref{PsiDef}.

7687: It is enough to prove that

7688: $\Psi(x_1), \dots, \Psi(x_N)$ are affine independent, since the

7689: simplex, and therefore any affine independent set of points can

7690: be shattered by affine half-spaces. Let us assume that

7691: $(x_1, \dots, x_N)$ are affine dependent, this means that

7692: for some $(\lambda_1, \dots, \lambda_N) \neq 0$ such that

7693: $\sum_{i=1}^N \lambda_i = 0$,

7694: $$

7695: \sum_{i=1}^N \sum_{j=1}^N \lambda_i G(x_i, x_j) \lambda_j = 0.

7696: $$

7697: Thus, $(\lambda_i)_{i=1}^{N+1}$, where we have put $\lambda_{N+1} = 0$

7698: is in the kernel of the symmetric positive semi-definite matrix

7699: $G(x_i,x_j)_{i,j \in \{1, \dots, N+1\}}$. Therefore

7700: $$

7701: \sum_{i=1}^N \lambda_i G(x_i, x_{N+1}) = 0,

7702: $$

7703: for any $x_{N+1} \in \RR^d$. This would mean that

7704: the functions $x \mapsto \exp (- \lVert x - x_i \rVert^2)$ are

7705: linearly dependent, which can be easily proved to be false.

7706: Indeed, let $n \in \RR^d$ be such that $\lVert n \rVert = 1$

7707: and $\langle n, x_i \rangle$, $i = 1, \dots, N$ are distinct

7708: (such a vector exists, because it has to be outside the

7709: union of a finite number of hyperplanes, which is of zero

7710: Lebesgue measure on the sphere). Let us assume for

7711: a while that for some $(\lambda_i)_{i=1}^N \in \RR^N$,

7712: for any $x \in \RR^d$,

7713: $$

7714: \sum_{i=1}^N \lambda_i \exp( - \lVert x - x_i \rVert^2) = 0.

7715: $$

7716: Considering $x = t n$, for $t \in \RR$, we would get

7717: $$

7718: \sum_{i=1}^N \lambda_i \exp( 2 t \langle n, x_i \rangle

7719: - \lVert x_i \rVert^2 ) = 0, \qquad t \in \RR.

7720: $$

7721: Letting $t$ go to infinity, we see that this is only

7722: possible if $\lambda_i = 0$ for all values of $i$.

7723: \end{proof}

7724:

7725: \subsection{Bounds for Support Vector Machines}

7726:

7727: \subsubsection{Compression scheme bounds}

7728:

7729: We can use Support Vector Machines in the framework of compression

7730: schemes and apply Theorem \ref{thm2.3.3} on page \pageref{thm2.3.3}.

7731: More precisely, given some positive symmetric kernel $K$ on $\C{X}$,

7732: we may consider for any training set $Z' = (x_i',y_i')_{i=1}^h$

7733: the classifier $\Hat{f}_{Z'}: \C{X} \rightarrow \C{Y}$ which is

7734: equal to the Support Vector Machine defined by $K$ and $Z'$

7735: whenever $Z'$ is $K$-separable, and which is equal to some

7736: constant classification rule otherwise (we take this convention

7737: to stick to the framework described on page \pageref{compression}, we

7738: will only use $\Hat{f}_{Z'}$ in the $K$-separable case,

7739: so this extension of the definition is just a matter of

7740: presentation). In the application of Theorem \ref{thm2.3.3}

7741: in the case when the observed sample $(X_i,Y_i)_{i=1}^N$ is $K$-separable,

7742: a natural (if not always optimal) choice of $Z'$ is to choose for

7743: $(x_i')$ the set of support vectors defined by $Z = (X_i,Y_i)_{i=1}^N$

7744: and to choose for $(y_i')$ the corresponding values of $Y$.

7745: This is justified by the fact that $\Hat{f}_{Z}=\Hat{f}_{Z'}$,

7746: as shown in Proposition \ref{chap4Prop4.1} (page \pageref{chap4Prop4.1}).

7747: In the case when

7748: $Z$ is not $K$-separable,

7749: we can train a Support Vector Machine with the box constraint,

7750: then remove all the errors to obtain a $K$-separable subsample

7751: $Z' = \{ (X_i, Y_i) : \alpha^*_i < \lambda^2, 1 \leq i \leq N \}$,

7752: (using the same notations as in equation \eqref{eq3.4}

7753: on page \pageref{eq3.4})

7754: and then

7755: consider its support vectors as the compression set.

7756: Still using the notations of page \pageref{eq3.4},

7757: this means we have to compute successively

7758: $\alpha^* \in \arg\min \{ F(\alpha) : \alpha \in \C{A},

7759: \alpha_i \leq \lambda^2 \}$, and $\alpha^{**}

7760: \in  \arg \min \{ F(\alpha) : \alpha \in \C{A},

7761: \alpha_i = 0 \text{ when } \alpha^*_i = \lambda^2 \}$,

7762: to keep eventually the compression set indexed by

7763: $J = \{ i : 1 \leq i \leq N, \alpha^{**}_i > 0 \}$,

7764: and the corresponding Support Vector Machine $\w{f}_{J}$.

7765: Different values of $\lambda$ can be used at this

7766: stage, producing different candidate compression

7767: sets : when $\lambda$ increases, the number of

7768: errors should decrease, on the other hand when

7769: $\lambda$ decreases, the margin $\lVert w \rVert^{-1}$

7770: of the separable subset $Z'$

7771: increases, supporting the hope for a smaller set of

7772: support vectors, thus we can use $\lambda$

7773: to monitor the number of errors on the training set

7774: we accept from the compression scheme.

7775: As we can use whatever heuristic we want while

7776: selecting the compression set, we can also try

7777: to threshold in the previous construction $\alpha_i^{**}$

7778: at different levels $\eta \geq 0$, to produce candidate

7779: compression sets

7780: $J_{\eta} = \{ i : 1 \leq i \leq N, \alpha^{**}_i > \eta \}$

7781: of various sizes.

7782:

7783: As the size $\lvert J \rvert$ of the compression

7784: set is random in this construction, we have to

7785: use a version of Theorem \ref{thm2.3.3} (page

7786: \pageref{thm2.3.3}) which handles compression

7787: sets of arbitrary sizes. This is done by choosing

7788: for each $k$ a $k$-partially exchangeable posterior distribution

7789: $\pi_k$ which weights the compression sets of all dimensions.

7790: We immediately see that we can choose $\pi_k$ such that

7791: $- \log \bigl[ \pi_k (\Delta_k(J)) \bigr]

7792: \leq \log \bigl[ \lvert J \rvert (\lvert J \rvert + 1)

7793: \bigr] + \lvert J \rvert  \log \Bigl[

7794: \tfrac{(k+1)eN}{\lvert J \rvert} \Bigr]$.

7795:

7796: If we observe the shadow sample patterns, and if computer

7797: resources permit, we can of

7798: course use more elaborate bounds than Theorem \ref{thm2.3.3},

7799: such as the transductive correspondent to Theorem \ref{thm1.24}

7800: (page \pageref{thm1.24}) (where we may consider the submodels

7801: made of all the compression sets of the same size). Theorems

7802: based on relative bounds, such as Theorem \ref{thm1.59} (

7803: page \pageref{thm1.59}) can also be used. Gibbs distributions

7804: can be approximated by Monte Carlo techniques, where

7805: a Markov chain with the proper invariant measure

7806: consists in suitable local perturbations of the

7807: compression set.

7808:

7809: Let us mention also that the use of compression schemes based

7810: on Support Vector Machines

7811: can be tailored to perform some kind of {\em feature aggregation}.

7812: Imagine that the kernel $K$ is defined as the scalar

7813: product in $L_2(\pi)$, where $\pi \in \C{M}_+^1(\Theta)$.

7814: More precisely let us consider for some set of

7815: soft classification rules $\bigl\{ f_{\theta} : \C{X} \rightarrow

7816: \RR\,; \theta \in \Theta \bigr\}$ the kernel

7817: $$

7818: K(x,x') = \int_{\theta \in \Theta} f_{\theta}(x) f_{\theta}(x')

7819: \pi(d \theta).

7820: $$

7821: In this setting, the Support Vector Machine

7822: applied to the training set $Z = (x_i, y_i)_{i=1}^N$

7823: has the form

7824: $$

7825: f_{Z}(x) = \sign \left( \int_{\theta \in \Theta} f_{\theta}(x)

7826: \sum_{i=1}^N y_i \alpha_i

7827: f_{\theta}(x_i) \pi(d \theta) - b \right)

7828: $$

7829: and, may it be too burdening to compute,

7830: we can replace it with some finite approximation

7831: $$

7832: \widetilde{f}_{Z}(x) = \sign \left(

7833: \sum_{k=1}^m f_{\theta_k}(x) w_k - b \right),

7834: $$

7835: where the set $\{\theta_k,\, k=1, \dots, m\}$ and the

7836: weights $\{ w_k,\,k=1, \dots, m\}$ are computed

7837: in some suitable way from $Z' = (x_i, y_i)_{i , \alpha_i > 0}$,

7838: the set of support vectors

7839: of $f_Z$. For instance,

7840: we can draw $\{ \theta_k,\,k=1, \dots, m\}$ at random according to

7841: the probability distribution proportional to

7842: $$

7843: \left\lvert \sum_{i=1}^N y_i \alpha_i f_{\theta}(x_i) \right\rvert

7844: \pi(d \theta),

7845: $$

7846: define the weights $w_k$ by

7847: $$

7848: w_k =

7849: \sign \left( \sum_{i=1}^N y_i \alpha_i f_{\theta_k}(x_i)

7850: \right) \int_{\theta \in \Theta} \left\lvert

7851: \sum_{i = 1}^N y_i \alpha_i f_{\theta}(x_i) \right\rvert \pi(d\theta),

7852: $$

7853: and choose the smallest value of $m$ for which this approximation

7854: still classifies $Z'$ without errors.

7855: Let us remark that we have built

7856: $\widetilde{f}_Z$ in such a way that

7857: $$

7858: \lim_{m \rightarrow + \infty}

7859: \widetilde{f}_Z(x_i) = f_Z(x_i) = y_i, \quad \text{a.s.}

7860: $$ for any support index

7861: $i$ such that $\alpha_i > 0$.

7862:

7863: Alternatively, given $Z'$, we can select a finite set of features

7864: $\Theta' \subset \Theta$ such that $Z'$ is $K_{\Theta'}$ separable,

7865: where

7866: $K_{\Theta'}(x,x') = \sum_{\theta \in \Theta'}

7867: f_{\theta}(x) f_{\theta}(x')$

7868: and consider the Support Vector Machines $f_{Z'}$ built with the

7869: kernel $K_{\Theta'}$. As soon as $\Theta'$ is chosen as a function

7870: of $Z'$ only, Theorem \ref{thm2.3.3} (page \pageref{thm2.3.3}) applies

7871: and provides

7872: some level of confidence for the risk of $f_{Z'}$.

7873:

7874: \subsubsection{The Vapnik Cervonenkis dimension

7875: of a family of subsets}

7876:

7877: Let us consider some set $X$ and some set

7878: $S \subset \{0,1\}^X$ of subsets of $X$.

7879: Let $h(S)$ be the VC dimension of $S$, defined as

7880: $$

7881: h(S) = \max \{ \lvert A \rvert : A \text{ finite and }

7882: A \cap S = \{0,1\}^{A} \},

7883: $$

7884: where by definition $A \cap S = \{ A \cap B : B \in S \}$.

7885: Let us notice that this definition does not depend on

7886: the choice of the reference set $X$. Indeed $X$ can

7887: be chosen to be $\bigcup S$, the union of all the sets in $S$

7888: or any bigger set. Let us notice also that for any set $B$,

7889: $h(B \cap S) \leq h(S)$, the reason being that

7890: $A \cap (B \cap S) = B \cap (A \cap S)$.

7891:

7892: This notion of VC dimension is useful because

7893: it can, as we will see about Support Vector

7894: Machines, be computed in some important special cases.

7895: Let us prove here as an illustration that

7896: $h(S) = d+1$ when $X = \RR^d$

7897: and $S$ is made of all the half spaces :

7898: $$

7899: S = \{ A_{w,b}\,: w \in \RR^d, b \in \RR \},

7900: \text{ where } A_{w,b} = \{ x \in X \,:

7901: \langle w, x \rangle \geq b \}.

7902: $$

7903: \begin{prop}\mypoint

7904: With the previous notations, $h(S) = d+1$.

7905: \end{prop}

7906: \begin{proof}

7907: Let $(e_i)_{i=1}^{d+1}$ be the canonical base of $\RR^{d+1}$,

7908: and let $X$ be the affine subspace it generates, which

7909: can be identified with $\RR^d$. For any $(\epsilon_i)_{i=1}^{d+1}

7910: \in \{-1,+1\}^{d+1}$, let $w = \sum_{i=1}^{d+1} \epsilon_i e_i$

7911: and $b = 0$. The half space $A_{w,b} \cap X$ is such that

7912: $\{e_i\,; i=1, \dots, d+1 \} \cap (A_{w,b} \cap X) = \{ e_i \,;

7913: \epsilon_i = +1 \}$. This proves that $h(S) \geq d + 1$.

7914:

7915: To prove that $h(S) \leq d + 1$, we have to show that

7916: for any set $A \subset \RR^d$

7917: of size $|A| = d+2$, there is $B \subset A$ such

7918: that $B \not\in (A \cap S)$. This will obviously

7919: be the case if the convex hulls of $B$ and $A \setminus

7920: B$ have a non empty intersection : indeed if a hyperplane

7921: separates two sets of points, it also separates

7922: their convex hulls. As $\lvert A \rvert

7923: > d+1$, $A$ is affine dependent : there is

7924: $(\lambda_x)_{x \in A} \in \RR^{d+2} \setminus

7925: \{0\}$ such that

7926: $\sum_{x \in A} \lambda_x x = 0$ and $\sum_{x \in A}

7927: \lambda_x = 0$. The set

7928: $B = \{ x \in A\,: \lambda_x > 0\}$ is non-empty,

7929: as well as its complement $A \setminus B$,

7930: because $\sum_{x \in A} \lambda_x = 0$ and $\lambda \neq

7931: 0$. Moreover $\sum_{x \in B} \lambda_x =

7932: \sum_{x \in A \setminus B} - \lambda_x > 0$.

7933: The relation

7934: $$

7935: \frac{1}{\sum_{x \in B} \lambda_x} \sum_{x \in B}

7936: \lambda_x x = \frac{1}{\sum_{x \in B} \lambda_x}

7937: \sum_{x \in A \setminus B} - \lambda_x x

7938: $$

7939: shows that the convex hulls of $B$ and $A \setminus B$

7940: have a non void intersection.

7941: \end{proof}

7942:

7943: Let us introduce the function of two integers

7944: $$

7945: \Phi_n^h = \sum_{k=0}^h \binom{n}{k}

7946: $$

7947: Let us notice that $\Phi$ can alternatively be defined

7948: by the relations :

7949: $$

7950: \Phi_n^h =

7951: \begin{cases}

7952: 2^n & \text{ when } n \leq h,\\

7953: \Phi_{n-1}^{h-1} + \Phi_{n-1}^h & \text{ when } n > h.

7954: \end{cases}

7955: $$

7956: \begin{thm}\mypoint

7957: \label{th1}

7958: Whenever $\bigcup S$ is finite,

7959: $$

7960: \lvert S \rvert \leq \Phi\left( \left\lvert \bigcup S \right\rvert, h(S)

7961: \right).

7962: $$

7963: \end{thm}

7964: \begin{thm}\mypoint

7965: \label{th2}

7966: For any $h \leq n$,

7967: $$

7968: \Phi_n^h \leq \exp \bigl( n H(\tfrac{h}{n}) \bigr)

7969: \leq \exp \bigl[ h \bigl( \log ( \tfrac{n}{h} ) + 1 \bigr) \bigr],

7970: $$

7971: where $H(p) = - p \log(p) - (1-p)\log(1-p)$ is the Shannon

7972: entropy of the Bernoulli distribution with parameter $p$.

7973: \end{thm}

7974: {\sc Proof of theorem \ref{th1}.}

7975: Let us prove this theorem by induction on $\left\lvert \bigcup

7976: S \right\rvert$. It is easy to check that it holds

7977: true when $\left\lvert \bigcup

7978: S \right\rvert = 1$.

7979: Let $X = \bigcup S$, let

7980: $x \in X$ and $X' = X \setminus \{x\}$. Define ($\bigtriangleup$

7981: denoting the symmetric difference of two sets)

7982: \begin{align*}

7983: S' & = \{ A \in S : A \bigtriangleup \{x\} \in S \},\\

7984: S'' & = \{ A \in S : A \bigtriangleup \{x\} \not\in S \}.

7985: \end{align*}

7986: Clearly, $\sqcup$ denoting the disjoint union,

7987: $S = S' \sqcup S''$ and $S \cap X' = (S' \cap X')

7988: \sqcup (S'' \cap X')$. Moreover $\lvert S' \rvert =

7989: 2 \lvert S' \cap X' \rvert$ and $\lvert S'' \rvert = \lvert

7990: S'' \cap X' \rvert$. Thus $\lvert S \rvert =

7991: \lvert S' \rvert + \lvert S'' \rvert = 2 \lvert S' \cap X' \rvert

7992: + \lvert S'' \rvert = \lvert S \cap X' \rvert + \lvert S' \cap

7993: X' \rvert$. Obviously $h(S \cap X') \leq h(S)$. Moreover

7994: $h(S' \cap X') = h(S') - 1$, because if $A \subset X'$

7995: is shattered by $S'$ (or equivalently by $S' \cap X'$),

7996: then $A \cup \{x\}$ is shattered by $S'$ (we say that $A$

7997: is shattered by $S$ when $S \cap A = \{0,1\}^A$).

7998: Using the induction hypothesis, we then see that

7999: $\lvert S \cap X' \rvert \leq \Phi_{\lvert X' \rvert}^{h(S)}

8000: + \Phi_{\lvert X' \rvert}^{h(S)-1}$. But as $\lvert X' \rvert =

8001: \lvert X \rvert - 1$, the righthand side of this inequality

8002: is equal to $\Phi_{\lvert X \rvert}^{h(S)}$, according to

8003: the recurrence equation satisfyied by $\Phi$.

8004:

8005: {\sc Proof of theorem \ref{th2}:}

8006: This is the well known Chernoff bound for the deviation of sums

8007: of Bernoulli r.v.: let $(\sigma_1, \dots, \sigma_n)$ be i.i.d.

8008: Bernoulli r.v. with parameter $1/2$. Let us notice that

8009: $$

8010: \Phi_n^h = 2^n \PP \left( \sum_{i=1}^n \sigma_i \leq h \right).

8011: $$

8012: For any positive real number $\lambda$ ,

8013: \begin{align*}

8014: \PP( \sum_{i=1}^n \sigma_i \leq h ) & \leq \exp (\lambda h) \EE \left[

8015: \exp \left( - \lambda \sum_{i=1}^n \sigma_i \right) \right] \\ & =

8016: \exp \Bigl\{ \lambda h + n \log \bigl\{

8017: \EE \bigl[ \exp \bigl( - \lambda \sigma_1 \bigr)

8018: \bigr] \bigr\} \Bigr\}.

8019: \end{align*}

8020: Differentiating the right-hand side in $\lambda$ shows that its

8021: minimal value is \linebreak

8022: $\exp \bigl[ - n \C{K}(\tfrac{h}{n},\tfrac{1}{2}) \bigr]$,

8023: where $\C{K}(p,q) = p \log(\tfrac{p}{q}) + (1-p) \log(\tfrac{1-p}{1-q})$

8024: is the Kullback divergence function between two Bernoulli distributions

8025: $B_p$ and $B_q$

8026: of parameters $p$ and $q$. Indeed the optimal value $\lambda^*$ of $\lambda$

8027: is such that $h = n \frac{\EE \bigl[\sigma_1 \exp ( - \lambda^* \sigma_1)

8028: \bigr]}{\EE \bigl[ \exp ( - \lambda^* \sigma_1) \bigr]}

8029: = n B_{h/n}(\sigma_1)$. Therefore (using the fact that two Bernoulli

8030: distributions with the same expectations are equal)

8031: $$

8032: \log \bigl\{ \EE \bigl[ \exp ( - \lambda^* \sigma_1)\bigr] \bigr\}

8033: = - \lambda^* B_{h/n}(\sigma_1) - \C{K}(B_{h/n},B_{1/2}) =

8034: - \lambda^* \tfrac{h}{n} - \C{K}(\tfrac{h}{n},\tfrac{1}{2}).

8035: $$

8036: The announced result then follows from

8037: the identity

8038: \begin{multline*}

8039: H(p) = \log(2) - \C{K}(p,\tfrac{1}{2}) \\= p \log(p^{-1})

8040: + (1- p) \log(1 + \frac{p}{1-p}) \leq p \bigl[ \log(p^{-1})+1\bigr].

8041: \end{multline*}

8042:

8043: \subsubsection{VC dimension of linear rules with margin}

8044: The proof of the following theorem has been suggested to us

8045: by a similar proof presented in \cite{Cristianini}.

8046: \begin{thm}\mypoint

8047: \label{chap5Th1.1}

8048: Consider a family of points $(x_1, \dots, x_n)$ in some Euclidean

8049: vector space $E$ and a family of affine functions

8050: $$

8051: \C{H} = \bigl\{ g_{w,b} : E \rightarrow \RR\,; w \in E, \lVert w \rVert = 1,

8052: b \in \RR \bigr\},

8053: $$

8054: where

8055: $$

8056: g_{w,b}(x) = \langle w, x \rangle - b, \qquad x \in E.

8057: $$

8058:

8059: Assume that there is a set of thresholds $(b_i)_{i=1}^n

8060: \in \RR^n$ such that for any \linebreak $(y_i)_{i=1}^n \in \{-1,+1\}^n$,

8061: there is $g_{w,b} \in \C{H}$ such that

8062: $$

8063: \inf_{i=1}^n  \bigl( g_{w,b}(x_i) - b_i \bigr) y_i \geq

8064: \gamma.

8065: $$

8066: Let us also introduce the empirical variance of $(x_i)_{i=1}^n$,

8067: $$

8068: \Var(x_1, \dots, x_n) = \frac{1}{n} \sum_{i=1}^n

8069: \biggl\lVert x_i - \frac{1}{n} \sum_{j=1}^n x_j \biggr\rVert^2.

8070: $$

8071: In this case and with these notations,

8072: \begin{equation}

8073: \label{firstPart}

8074: \frac{\Var(x_1, \dots, x_n)}{\gamma^2} \geq

8075: \begin{cases}

8076: n-1 & \text{ when } n \text{ is even,}\\

8077: (n-1) \frac{n^2 - 1}{n^2} & \text{ when } n \text{ is odd.}

8078: \end{cases}

8079: \end{equation}

8080: Moreover, equality is reached when $\gamma$ is optimal,

8081: $b_i = 0$, $i = 1, \dots, n$

8082: and $(x_1, \dots, x_n)$

8083: is a regular simplex

8084: (i.e. when $2 \gamma$ is the minimum distance

8085: between the convex hulls of any two subsets of $\{x_1, \dots, x_n\}$

8086: and $\lVert x_i - x_j \rVert$ does not depend on $i \neq j$).

8087: \end{thm}

8088: \begin{proof}

8089: Let $(s_i)_{i=1}^n \in \RR^n$ be such that $\sum_{i=1}^n s_i = 0$.

8090: Let $\sigma$ be a uniformly distributed random variable with values

8091: in $\mathfrak{S}_{n}$, the set of permutations of the first $n$

8092: integers $\{1, \dots, n \}$. By assumption, for any value of $\sigma$,

8093: there is an affine function $g_{w,b} \in \C{H}$ such that

8094: $$

8095: \min_{i=1, \dots, n} \bigl[ g_{w,b}(x_i) - b_i \bigr] \bigl[

8096: 2 \B{1}(s_{\sigma(i)} > 0) - 1 \bigr] \geq \gamma.

8097: $$

8098: As a consequence

8099: \begin{align*}

8100: \left\langle \sum_{i=1}^n s_{\sigma(i)} x_i, w \right\rangle

8101: & =

8102: \sum_{i=1}^n s_{\sigma(i)} \bigl( \langle x_i, w \rangle - b - b_i\bigr)

8103: + \sum_{i=1}^n s_{\sigma(i)} b_i\\

8104: & \geq \sum_{i=1}^n

8105: \gamma \lvert s_{\sigma(i)} \rvert + s_{\sigma(i)} b_i.

8106: \end{align*}

8107: Therefore, using the fact that the map $x \mapsto

8108: \Bigl(\max \bigl\{0,x\bigr\}\Bigr)^2$ is convex,

8109: \begin{multline*}

8110: \EE \left(

8111: \biggl\lVert \sum_{i=1}^n s_{\sigma(i)} x_i \biggr\rVert^2 \right)

8112: \geq

8113: \EE \left[ \left( \max \left\{ 0,

8114: \sum_{i=1}^n \gamma \lvert s_{\sigma(i)} \rvert + s_{\sigma(i)} b_i

8115: \right\} \right)^2 \right] \\ \geq

8116: \left(\max \left\{ 0, \sum_{i=1}^n \gamma \EE \bigl(

8117: \lvert s_{\sigma(i)} \rvert \bigr) + \EE \bigl( s_{\sigma(i)} \bigr)

8118: b_i \right\} \right)^2

8119: = \gamma^2 \left( \sum_{i=1}^n \lvert s_i \rvert \right)^2,

8120: \end{multline*}

8121: where $\EE$ is the expectation with respect to the random permutation

8122: $\sigma$.

8123: On the other hand

8124: $$

8125: \EE \left( \biggl\lVert \sum_{i=1}^n s_{\sigma(i)} x_i \biggr\rVert^2 \right)

8126: = \sum_{i=1}^n \EE(s_{\sigma(i)}^2) \lVert x_i \rVert^2 +

8127: \sum_{i\neq j} \EE(s_{\sigma(i)} s_{\sigma(j)}) \langle x_i, x_j \rangle.

8128: $$

8129: Moreover

8130: $$

8131: \EE ( s_{\sigma(i)}^2 ) = \frac{1}{n} \EE \left(

8132: \sum_{i=1}^n s_{\sigma(i)}^2 \right) = \frac{1}{n} \sum_{i=1}^n

8133: s_i^2.

8134: $$

8135: In the same way, for any $i \neq j$,

8136: \begin{align*}

8137: \EE \left( s_{\sigma(i)} s_{\sigma(j)} \right) & =

8138: \frac{1}{n(n-1)} \EE \left( \sum_{i \neq j} s_{\sigma(i)} s_{\sigma(j)}

8139: \right) \\ & = \frac{1}{n(n-1)} \sum_{i\neq j} s_i s_j\\

8140: & = \frac{1}{n(n-1)} \Biggl[

8141: \Biggl( \underbrace{\sum_{i=1}^n s_i}_{=0} \Biggr)^2 - \sum_{i=1}^n s_i^2

8142: \Biggr] \\ & = - \frac{1}{n(n-1)} \sum_{i=1}^n s_i^2.

8143: \end{align*}

8144: Thus

8145: \begin{align*}

8146: \EE \left( \biggl\lVert \sum_{i=1}^n s_{\sigma(i)} x_i \biggr\rVert^2 \right)

8147: & = \left( \sum_{i=1}^n s_i^2 \right) \left[ \frac{1}{n} \sum_{i=1}^n \lVert

8148: x_i \rVert^2 -

8149: \frac{1}{n(n-1)} \sum_{i\neq j} \langle x_i, x_j \rangle \right] \\ & =

8150: \left( \sum_{i=1}^n s_i^2 \right) \Biggl[

8151: \left( \frac{1}{n} + \frac{1}{n(n-1)} \right) \sum_{i=1}^n \lVert x_i \rVert^2

8152: \\ & \qquad - \frac{1}{n(n-1)} \biggl\lVert \sum_{i=1}^n x_i

8153: \biggr\rVert^2 \Biggr] \\ & =

8154: \frac{n}{n-1} \left( \sum_{i=1}^n s_i^2 \right) \Var(x_1, \dots, x_n).

8155: \end{align*}

8156: We have proved that

8157: $$

8158: \frac{\Var(x_1, \dots, x_n)}{\gamma^2} \geq \frac{\ds (n-1) \biggl(

8159: \sum_{i=1}^n \lvert s_i \rvert \biggr)^2}{\ds n \sum_{i=1}^n s_i^2}.

8160: $$

8161: This can be used with $s_i = \B{1}( i \leq \frac{n}{2}) - \B{1}(

8162: i > \frac{n}{2})$ in the case when $n$ is even and

8163: $s_i = \frac{2}{(n-1)} \B{1}( i \leq \frac{n-1}{2} ) -

8164: \frac{2}{n+1} \B{1}(i > \frac{n-1}{2} )$ in the case when

8165: $n$ is odd to establish the first inequality \eqref{firstPart} of the theorem.

8166:

8167: Checking that equality is reached for the simplex is an easy computation

8168: when the simplex $(x_i)_{i=1}^n \in (\RR^n)^n$ is parametrized in such a

8169: way that

8170: $$

8171: x_i(j) = \begin{cases}

8172: 1 & \text{ if } i = j,\\

8173: 0 & \text{ otherwise.}

8174: \end{cases}

8175: $$

8176: Indeed the distance between the convex hulls of any two subsets of

8177: the simplex is the distance between their mean values (i.e. centers of mass).

8178: \end{proof}

8179:

8180: \subsubsection{Application to Support Vector Machines}

8181:

8182: We are going to apply Theorem \ref{chap5Th1.1} (page

8183: \pageref{chap5Th1.1}) to Support Vector

8184: Machines in the transductive case. So let us consider

8185: $(X_i, Y_i)_{i=1}^{(k+1)N}$ distributed according to some partially exchangeable

8186: distribution $\PP$ and assume that $(X_i)_{i=1}^{(k+1)N}$ and

8187: $(Y_i)_{i=1}^N$ are observed. Let us consider some positive

8188: kernel $K$ on $\C{X}$. For any $K$-separable training set of

8189: the form $Z' = (X_i,y_i')_{i=1}^{(k+1)N}$, where $(y_i')_{i=1}^{(k+1)N}

8190: \in \C{Y}^{(k+1)N}$, let $\Hat{f}_{Z'}$ be the Support Vector Machine

8191: defined by $K$ and $Z'$ and let $\gamma(Z')$ be its margin.

8192: Let

8193: \begin{multline*}

8194: R^2 = \max_{i=1, \dots, (k+1)N} K(X_i,X_i) + \frac{1}{(k+1)^2 N^2}

8195: \sum_{j=1}^{(k+1)N} \sum_{k=1}^{(k+1)N} K(X_j,X_k) \\

8196: - \frac{2}{(k+1)N}

8197: \sum_{j=1}^{(k+1)N} K(X_i,X_j).

8198: \end{multline*}

8199: (This is an easily computable upper-bound for the radius

8200: of some ball containing the image of $(X_1, \dots, X_{(k+1)N})$

8201: in feature space.)

8202:

8203: Let us define for any integer $h$ the margins

8204: \begin{equation}

8205: \label{margin}

8206: \gamma_{2h} = (2h - 1)^{-1/2}

8207: \text{ and } \gamma_{2h+1} = \left[ 2h\left(

8208: 1 - \frac{1}{(2h+1)^2}\right) \right]^{-1/2}.

8209: \end{equation}

8210: Let us consider for any $h =1, \dots, N$ the exchangeable model

8211: $$

8212: \C{R}_h = \bigl\{ \Hat{f}_{Z'}\,:Z' = (X_i, y_i')_{i=1}^{(k+1)N}

8213: \text{ is $K$-separable and } \gamma(Z') \geq R \gamma_h \bigr\}.

8214: $$

8215: The family of models $\C{R}_h$, $h=1, \dots, N$ is nested,

8216: and we know from Theorem \ref{chap5Th1.1} (page \pageref{chap5Th1.1}) and

8217: Theorems \ref{th1} (page \pageref{th1}) and

8218: \ref{th2} (page \pageref{th2}) that

8219: $$

8220: \log \bigl( \lvert \C{R}_h \rvert \bigr) \leq h \log

8221: \bigl( \tfrac{(k+1)e N}{h} \bigr).

8222: $$

8223: We can then consider on the large model $\C{R} = \bigsqcup_{h=1}^N

8224: \C{R}_h$ (the disjoint union of the submodels)

8225: an exchangeable prior $\pi$ which is uniform on each $\C{R}_h$

8226: and is such that $\pi(\C{R}_h) \geq \frac{1}{h(h+1)}$.

8227: Applying Theorem \ref{thm2.1.5}

8228: (page \pageref{thm2.1.5})

8229: we get

8230: \begin{proposition}\mypoint

8231: With $\PP$ probability at least $1 - \epsilon$, for any

8232: $h = 1, \dots, N$, any Support Vector Machine $f \in \C{R}_h$,

8233: \begin{multline*}

8234: r_2(f)  \leq \\*

8235: \frac{k+1}{k} \inf_{\lambda \in \RR_+}

8236: \frac{1 - \exp \Bigl[ - \frac{\lambda}{N} r_1(f) - \frac{h}{N} \log

8237: \Bigl( \frac{e(k+1)N}{h} \Bigr) - \frac{\log[h(h+1)] -

8238: \log(\epsilon)}{N}

8239: \Bigr]}{

8240: 1 - \exp( - \frac{\lambda}{N})} \\* - \frac{r_1(f)}{k}.

8241: \end{multline*}

8242: \end{proposition}

8243: Searching the whole model $\C{R}_h$ may be unfeasible,

8244: nonetheless any heuristic can be applied to choose $f$. For instance,

8245: a Support Vector Machine $f'$ can be trained from

8246: the training set $(X_i, Y_i)_{i=1}^N$ and then $(y'_i)_{i=1}^{

8247: (k+1)N}$ can be set to $y'_i = \sign(f'(X_i))$, $i = 1,

8248: \dots, (k+1)N$.

8249:

8250: \subsubsection[Inductive margin bounds]{Inductive margin bounds for Support

8251: Vector Machines}

8252:

8253: In order to establish inductive margin bounds, we will

8254: need a different combinatorial lemma. It is due to \cite{Alon}.

8255: We will reproduce their proof with some tiny improvements on

8256: the values of constants.

8257:

8258: Let us consider the finite case when $\C{X} = \{1, \dots, n\}$,

8259: $\C{Y} = \{1, \dots, b\}$ and \linebreak $b \geq 3$ (the question

8260: we will study would be meaningless in the case when $b \leq 2$). Assume as usual that we are

8261: dealing with a prescribed set of classification rules

8262: \linebreak $\C{R} = \bigl\{ f : \C{X} \rightarrow \C{Y} \bigr\}$.

8263: Let us say that a pair $(A,s)$, where $A \subset \C{X}$

8264: is a non empty set of shapes

8265: and $s : A \rightarrow \{2, \dots, b-1\}$ a threshold function,

8266: is {\em shattered}

8267: by the set of functions $F \subset \C{R}$

8268: if for any $(\sigma_x)_{x \in A} \in \{-1,+1\}^{A}$,

8269: there exists some $f \in F$ such that $\min_{x \in A}

8270: \sigma_x \bigl[ f(x) - s(x) \bigr] \geq 1$.

8271:

8272: \begin{dfn}\mypoint

8273: \label{fatDef}

8274: Let the {\em fat shattering

8275: dimension} of $(\C{X},\C{R})$ be the maximal size $\lvert A \rvert$

8276: of the first component of the pairs which are shattered by $\C{R}$.

8277: \end{dfn}

8278:

8279: Let us say that a subset of classification rules $F \subset

8280: \C{Y}^{\C{X}}$ is {\em separated} whenever for any pair

8281: $(f,g) \in F^2$ such that $f\neq g$, $\lVert f - g \rVert_{\infty}

8282: = \max_{x \in \C{X}} \lvert f(x) - g(x) \rvert \geq 2$.

8283: Let $\mathfrak{M}(\C{R})$ be the maximum size $\lvert F \rvert$

8284: of separated subsets $F$ of $\C{R}$. Note that if $F$ is a

8285: separated subset of $\C{R}$ such that $\lvert F \rvert =

8286: \mathfrak{M}(\C{R})$, then it is a $1$-net for the $\C{L}_{\infty}$

8287: distance: for any function $f \in \C{R}$ there exists $g \in F$

8288: such that $\lVert f - g \rVert_{\infty} \leq 1$ (otherwise $f$ could be

8289: added to $F$ to create a larger separated set).

8290:

8291: \begin{lemma}\mypoint

8292: \label{lemma3.1}

8293: With the above notations,

8294: whenever the fat shattering dimension of

8295: $(\C{X}, \C{R})$ is not greater than $h$,

8296: \begin{multline*}

8297: \log \bigl[ \mathfrak{M}(\C{R}) \bigr] < \log \bigl[ (b-1)(b-2) n \bigr]

8298: \Biggl\{\frac{\log \bigl[ \sum_{i=1}^h \binom{n}{i} (b-2)^i \bigr]}{

8299: \log(2)}+1 \Biggr\} + \log(2)

8300: \\ \leq \log \bigl[ (b-1)(b-2) n \bigr]

8301: \Biggl\{ \biggl[ \log \Bigl[ \tfrac{(b-2) n}{h}

8302: \Bigr] + 1 \biggr] \frac{h}{\log(2)} + 1\Biggr\} + \log(2).

8303: \end{multline*}

8304: \end{lemma}

8305: \begin{proof}

8306: For any set of functions $F \subset \C{Y}^{\C{X}}$,

8307: let $t(F)$ be the number of pairs $(A, s)$ shattered by $F$.

8308: Let $t(m,n)$ be the minimum of $t(F)$ over

8309: all {\em separated} sets of functions $F \subset \C{Y}^{\C{X}}$ of size $\lvert

8310: F \rvert = m$ ($n$ is here to recall that the shape space $\C{X}$

8311: is made of $n$ shapes). For any $m$ such that $t(m,n) > \sum_{i=1}^h

8312: \binom{n}{i} (b-2)^i$, it is clear that any separated set of functions

8313: of size $\lvert F \rvert \geq m$ shatters at least one pair

8314: $(A,s)$ such that $\lvert A \rvert > h$. Indeed, $t(m,n)$ is

8315: clearly from its definition a non decreasing function of $m$,

8316: so that $t(\lvert F \rvert, n) > \sum_{i=1}^h \binom{n}{i}

8317: (b-2)^i$.

8318: Moreover there are only $\sum_{i=1}^h \binom{n}{i}(b-2)^i$

8319: pairs $(A,s)$ such that $\lvert A \rvert \leq h$.

8320: As a consequence, whenever the fat shattering dimension

8321: of $(\C{X}, \C{R})$ is not greater than $h$ we have $\mathfrak{M}(\C{R})

8322: < m$.

8323:

8324: It is clear that for any $n \geq 1$, $t(2,n) = 1$.

8325: \begin{lemma}\mypoint

8326: For any $m \geq 1$,

8327: $t\bigl[mn(b-1)(b-2), n \bigr] \geq 2 t\bigl[ m, n-1 \bigr]$,

8328: and therefore $t\bigl[ 2 n(n-1) \dots (n-r+1) (b-1)^r(b-2)^r, n \bigr]

8329: \geq 2^r$.

8330: \end{lemma}

8331: \begin{proof}

8332: Let $F = \{f_1, \dots, f_{mn(b-1)(b-2)}\}$

8333: be some separated set of functions of size

8334: $mn(b-1)(b-2)$. For any pair $(f_{2i-1},f_{2i})$,

8335: $i=1,\dots, mn(b-1)(b-2)/2$, there is $x_i \in \C{X}$

8336: such that $\lvert f_{2i-1}(x_i) - f_{2i}(x_i) \rvert

8337: \geq 2$. Since $\lvert \C{X} \rvert = n$, there is

8338: $x \in \C{X}$ such that $\sum_{i=1}^{mn(b-1)(b-2)/2}

8339: \B{1}(x_i = x) \geq m(b-1)(b-2)/2$. Let $I = \{ i \,:

8340: x_i = x\}$.

8341: Since there are

8342: $(b-1)(b-2)/2$ pairs $(y_1,y_2) \in \C{Y}^2$

8343: such that $1\leq y_1 < y_2 - 1 \leq b -1$, there is some pair

8344: $(y_1,y_2)$, such that $1 \leq y_1 < y_2 \leq b$

8345: and such that $\sum_{i\in I} \B{1}(\{y_1,y_2\} = \{f_{2i-1}(x),

8346: f_{2i}(x)\}) \geq m$.

8347: Let $J = \bigl\{i \in I\,: \{f_{2i-1}(x),f_{2i}(x)\} = \{y_1,y_2\}

8348: \bigr\}$. Let

8349: \begin{align*}

8350: F_1 & =

8351: \{ f_{2i-1} \,:i \in J, f_{2i-1}(x) = y_1\}

8352: \cup

8353: \{ f_{2i} \,:i \in J, f_{2i}(x) = y_1\},\\

8354: F_2 & =

8355: \{ f_{2i-1} \,:i \in J, f_{2i-1}(x) = y_2\}

8356: \cup

8357: \{ f_{2i} \,:i \in J, f_{2i}(x) = y_2\}.

8358: \end{align*}

8359: Obviously $\lvert F_1 \rvert = \lvert F_2 \rvert =

8360: \lvert J \rvert = m$.  Moreover the restrictions

8361: of the functions of $F_1$ to $\C{X} \setminus \{x\}$

8362: are separated, and it is the same with $F_2$. Thus

8363: $F_1$ strongly shatters at least $t(m,n-1)$

8364: pairs $(A,s)$ such that $A \subset \C{X} \setminus \{x\}$

8365: and it is the same with $F_2$. Eventually,

8366: if the pair $(A,s)$ where $A \subset \C{X} \setminus \{x\}$

8367: is both shattered by $F_1$ and $F_2$, then

8368: $F_1 \cup F_2$ shatters also $(A \cup \{x\}, s')$

8369: where $s'(x') = s(x')$ for any $x' \in A$ and $s'(x) =

8370: \lfloor \frac{y_1+y_2}{2} \rfloor$. Thus $F_1 \cup F_2$,

8371: and therefore $F$, shatters at least $2t(m,n-1)$

8372: pairs $(A,s)$.

8373: \end{proof}

8374:

8375: Resuming the proof of lemma \ref{lemma3.1}, let us choose

8376: for $r$ the smallest integer such that

8377: $2^r > \sum_{i=1}^h \binom{n}{i} (b-2)^i$, which is no greater than

8378: \\ \mbox{} \hfill $\left\{ \frac{\log \bigl[ \sum_{i=1}^h \binom{n}{i} (b-2)^i \bigr]}{

8379: \log(2)} + 1 \right\}$.

8380: \hfill \mbox{}\\

8381: In the case when $1 \leq n \leq r$,

8382: $$

8383: \log( \mathfrak{M}(\C{R}) ) < {\lvert \C{X} \rvert} \log(\lvert \C{Y} \rvert)

8384:  = n \log(b) \leq r \log( b) \leq r \log \bigl[ (b-1)(b-2)n \bigr] + \log(2),

8385:  $$

8386:  which proves the lemma. In the remaining case $n > r$,

8387: \begin{multline*}

8388:  t \bigl[ 2 n^r (b-1)^r (b-2)^r, n \bigr]

8389: \\ \geq t \bigl[ 2n(n-1) \dots (n-r+1)(b-1)^r(b-2)^r, n\bigr]

8390:  \\ > \sum_{i=1}^h \binom{n}{i} (b-2)^i.

8391: \end{multline*}

8392:  Thus $\lvert \mathfrak{M}(\C{R}) \rvert < 2 \Bigl[(b-2)(b-1)n\Bigr]^r$ as

8393: claimed.

8394: \end{proof}

8395:

8396: In order to apply this combinatorial lemma to Support Vector

8397: Machines, let us consider now the case of separating

8398: hyperplanes in $\RR^d$ (the generalization to Support Vector Machines

8399: being straightforward).

8400: Assume that $\C{X} = \RR^d$ and

8401: $\C{Y}= \{-1,+1\}$.

8402: For any sample $(X)_{i=1}^{(k+1)N}$, let

8403: $$

8404: R(X_1^{(k+1)N}) = \max \{ \lVert X_i \rVert \,: 1 \leq i \leq (k+1)N \}.

8405: $$

8406: Let us consider the set of parameters

8407: $$

8408: \Theta = \bigl\{ (w,b) \in \RR^d \times \RR\,: \lVert w \rVert = 1 \bigr\}.

8409: $$

8410: For any $(w,b) \in \Theta$, let

8411: $g_{w,b}(x) = \langle w, x \rangle - b$.

8412: Let $h$ be some fixed integer and let $\gamma = R(X_1^{(k+1)N})\gamma_h$,

8413: where $\gamma_h$ is defined by equation \eqref{margin} on page \pageref{margin}.

8414:

8415: Let us define $\zeta : \RR \rightarrow \ZZ$ by

8416: $$

8417: \zeta (r) =

8418: \left\{

8419: \begin{aligned}

8420: -5  & & \text{ when }&&   & r \leq -4\gamma,\\

8421: -3  & & \text{ when }&&   -4 \gamma < & r \leq -2 \gamma,\\

8422: -1  & & \text{ when }&&  -2 \gamma < & r \leq 0,\\

8423: +1  & & \text{ when }&&   0 < & r \leq 2 \gamma,\\

8424: +3  & & \text{ when }&&   2 \gamma < & r \leq 4 \gamma,\\

8425: +5  & & \text{ when }&&   4 \gamma < & r.

8426: \end{aligned}\right.

8427: $$

8428: Let $G_{w,b}(x) = \zeta \bigl[ g_{w,b}(x) \bigr]$.

8429: The fat shattering dimension (as defined in \ref{fatDef})

8430: of

8431: $$

8432: \Bigl( X_1^{(k+1)N}, \bigl\{ (G_{w,b}+7)/2 :

8433: (w,b) \in \Theta \bigr\} \Bigr)

8434: $$

8435: is not greater than $h$ (according to Theorem \ref{chap5Th1.1}, page

8436: \pageref{chap5Th1.1}),

8437: therefore there is some set $\C{F}$

8438: of functions from $X_1^{(k+1)N}$ to $\{-5,-3,-1,+1,+3,+5\}$

8439: such that

8440: $$

8441: \log \bigl(\lvert \C{F} \rvert \bigr) \leq

8442: \log\bigl[ 20(k+1) N \bigr] \Biggl\{ \frac{h}{\log(2)}

8443: \biggl[ \log \left( \frac{4(k+1)N}{h} \right) + 1 \biggr]

8444: + 1 \Biggr\} + \log(2).

8445: $$

8446: and

8447: for any $(w,b) \in \Theta$, there is

8448: $f_{w,b} \in \C{F}$ such that $\sup \bigl\{ \lvert f_{w,b}

8449: (X_i) - G_{w,b}(X_i) \rvert\,: i=1, \dots, (k+1)N \bigr\} \leq 2.$

8450: Moreover, the choice of $f_{w,b}$ may be required to depend

8451: on $(X_i)_{i=1}^{(k+1)N}$ in an exchangeable way.

8452: Similarly to Theorem \ref{thm2.1.5} (page \pageref{thm2.1.5}),

8453: it can be proved that for any partially exchangeable probability

8454: distribution $\PP \in \C{M}_+^1 (\Omega)$,

8455: with $\PP$ probability at least $1 - \epsilon$,

8456: for any $f_{w,b} \in \C{F}$,

8457:

8458: \begin{multline*}

8459: \frac{1}{kN} \sum_{i=N+1}^{(k+1)N}

8460: \B{1}\bigl[f_{w,b}(X_i) Y_i \leq 1 \bigr] \\

8461: \begin{aligned} \leq \frac{k+1}{k} & \inf_{\lambda \in \RR_+}

8462: \bigl[ 1 - \exp( - \tfrac{\lambda}{N} ) \bigr]^{-1}

8463: \biggl\{ 1 - \\

8464: & \exp \biggl[ - \frac{\lambda}{N^2}

8465: \sum_{i=1}^N \B{1} \bigl[ f_{w,b}(X_i) Y_i \leq 1 \bigr]

8466: - \frac{\log \bigl( \lvert \C{F} \rvert \bigr) - \log(\epsilon)}{N}

8467: \biggr] \biggr\} \end{aligned}\\- \frac{1}{k N} \sum_{i=1}^{N} \B{1} \bigl[

8468: f_{w,b}(X_i) Y_i \leq 1 \bigr].

8469: \end{multline*}

8470:

8471: Let us remark that

8472: $$

8473: \B{1} \Bigl\{

8474: 2 \B{1} \bigl[g_{w,b}(X_i) \geq 0 \bigr] - 1 \neq Y_i \Bigr\}

8475: = \B{1}\bigl[ G_{w,b}(X_i) Y_i < 0 \bigr] \leq

8476: \B{1} \bigl[ f_{w,b}(X_i) Y_i \leq 1 \bigr]

8477: $$

8478: and

8479: $$

8480: \B{1}\bigl[ f_{w,b}(X_i) Y_i \leq 1 \bigr]

8481: \leq \B{1}\bigl[ G_{w,b}(X_i) Y_i \leq 3 \bigr]

8482: \leq \B{1} \bigl[ g_{w,b}(X_i) Y_i \leq 4 \gamma \bigr].

8483: $$

8484: This proves the following theorem.

8485: \begin{thm}\mypoint

8486: With $\PP$ probability at least

8487: $1 - \epsilon$, for any $(w,b) \in \Theta$,

8488: \begin{multline*}

8489: \frac{1}{kN} \sum_{i=N+1}^{(k+1)N}

8490: \B{1} \Bigl\{ 2 \B{1} \bigl[ g_{w,b}(X_i) \geq 0 \bigr] - 1 \neq Y_i \Bigr\}\\

8491: \begin{aligned} \leq \frac{k+1}{k} & \inf_{\lambda \in \RR_+, h \in \NN^*}

8492: \bigl[ 1 - \exp( - \tfrac{\lambda}{N} ) \bigr]^{-1}

8493: \Biggl\{ 1 - \\

8494: \exp \Biggl[ - & \frac{\lambda}{N^2}

8495:  \sum_{i=1}^N \B{1} \bigl[ g_{w,b}(X_i)Y_i \leq 4 R \gamma_h \bigr]

8496: \\ - & \frac{\log

8497: \bigl[ 20 (k+1)N \bigr] \Bigl\{

8498: \tfrac{h}{\log(2)} \log \Bigl( \tfrac{4e (k+1)N}{h} \Bigr)

8499: + 1 \Bigr\} + \log\Bigl[ \tfrac{2h(h+1)}{\epsilon} \Bigr] }{N}

8500: \Biggr] \Biggr\} \end{aligned}\\- \frac{1}{k N} \sum_{i=1}^{N} \B{1}

8501: \bigl[ g_{w,b}(X_i)Y_i \leq 4 R \gamma_h \bigr].

8502: \end{multline*}

8503: \end{thm}

8504: As a consequence,

8505: we obtain with $\PP$ probability at least $1 - \epsilon$,

8506: for any $(w,b) \in \Theta$ such that

8507: $$

8508: \gamma = \min_{i=1, \dots, N}  g_{w,b}(X_i)Y_i > 0,

8509: $$

8510: \begin{multline*}

8511: \frac{1}{kN} \sum_{i=N+1}^{(k+1)N}

8512: \B{1} \bigl[ g_{w,b}(X_i) Y_i < 0 \bigr]

8513: \\ \leq \tfrac{k+1}{k} \biggl\{

8514: 1 - \exp \biggl[ - \tfrac{\log\bigl[ 20(k+1)N \bigr] }{N}

8515: \Bigl\{ \tfrac{16 R^2 + 2 \gamma^2}{\log(2) \gamma^2}

8516: \log \Bigl( \tfrac{e (k+1)N \gamma^2}{4R^2} \Bigr)  + 1 \Bigr\}

8517: \\ +  \frac{1}{N} \log ( \tfrac{\epsilon}{2} ) \biggr] \biggr\}.

8518: \end{multline*}

8519: This inequality compares favourably with similar inequalities

8520: in \cite{Cristianini}, which moreover do not extend to the margin

8521: quantile case as this one.

8522:

8523: Let us also remark that it is easy to circonvent the fact that

8524: $R$ is not observed when the test set

8525: $X_{N+1}^{(k+1)N}$ is not observed.

8526:

8527: Indeed, we can consider the sample obtained by projecting $X_1^{(k+1)N}$

8528: on some ball of fixed radius $R_{\max}$, putting

8529: $$

8530: t_{R_{\max}}(X_i) = \min \left\{ 1, \frac{R_{\max}}{\lVert X_i \rVert} \right\} X_i.

8531: $$

8532: We can further consider an atomic prior distribution $\nu \in \C{M}_+^1(\RR_+)$

8533: bearing on $R_{\max}$, to obtain a uniform result through a union bound.

8534: As a consequence of the previous theorem indeed,

8535: \begin{cor}\mypoint

8536: For any atomic prior $\nu \in \C{M}_+^1(\RR_+)$,

8537: for any partially exchangeable probability measure $\PP \in \C{M}_+^1(\Omega)$,

8538: with $\PP$ probability at least

8539: $1 - \epsilon$, for any $(w,b) \in \Theta$, any $R_{\max} \in \RR_+$,

8540: \begin{multline*}

8541: \frac{1}{kN} \sum_{i=N+1}^{(k+1)N}

8542: \B{1} \Bigl\{ 2 \B{1} \bigl[ g_{w,b} \circ t_{R_{\max}}(X_i)

8543: \geq 0 \bigr] - 1 \neq Y_i \Bigr\}\\*

8544: \begin{aligned} \leq \frac{k+1}{k} & \inf_{\lambda \in \RR_+, h \in \NN^*}

8545: \bigl[ 1 - \exp( - \tfrac{\lambda}{N} ) \bigr]^{-1}

8546: \Biggl\{ 1 - \\

8547: \exp \Biggl[ - & \frac{\lambda}{N^2}

8548:  \sum_{i=1}^N \B{1} \bigl[ g_{w,b} \circ t_{R_{\max}}(X_i)Y_i \leq 4 R_{\max}

8549:  \gamma_h \bigr]

8550: \\ - & \frac{\log

8551: \bigl[ 20 (k+1)N \bigr] \Bigl\{

8552: \tfrac{h}{\log(2)} \log \Bigl( \tfrac{4e (k+1)N}{h} \Bigr)

8553: + 1 \Bigr\} + \log\Bigl[ \tfrac{2h(h+1)}{\epsilon \nu(R_{\max})} \Bigr] }{N}

8554: \Biggr] \Biggr\} \end{aligned}\\- \frac{1}{k N} \sum_{i=1}^{N} \B{1}

8555: \bigl[ g_{w,b}\circ t_{R_{\max}} (X_i)Y_i \leq 4 R_{\max} \gamma_h \bigr].

8556: \end{multline*}

8557: \end{cor}

8558:

8559: \input{appendix}

8560: