1: \newcommand{\w}[1]{\widehat{#1}}
2: \section*{Introduction}
3: \addcontentsline{toc}{section}{Introduction}
4:
5: Among the possible approaches to pattern recognition,
6: statistical learning theory has received a lot of attention
7: in the last few years. Although a realistic pattern recognition
8: scheme involves data pre-processing and post-processing that
9: need a theory of their own, a central role is often played
10: by some kind of supervised learning algorithm. This central
11: piece of work is the subject we are going to analyse in
12: these notes.
13:
14: Accordingly, we assume that we have prepared in some way or another
15: a {\em sample} of $N$ labelled patterns $(X_i, Y_i)_{i=1}^N$,
16: where $X_i$ ranges in some pattern space $\C{X}$ and $Y_i$ ranges
17: in some finite label set $\C{Y}$. We also assume that we have devised
18: our experiment in such a way that the couples of random variables
19: $(X_i, Y_i)$ are independent (but not necessarily equidistributed).
20: Here, randomness should be understood to come from the way the
21: statistician has planned his experiment. He may for instance
22: have drawn the $X_i$s
23: at random from some larger population of patterns the algorithm
24: is meant to be applied to in a second stage. The labels $Y_i$
25: may have been set with the help of some external expertise
26: (which may itself be faulty or
27: contain some amount of randomness, therefore we do not assume
28: that $Y_i$ is a function of $X_i$, and allow the couple of
29: random variables $(X_i, Y_i)$ to follow any kind of joint distribution).
30: In practice, patterns will be extracted from some high dimensional and highly
31: structured data, like digital images, speech signals, DNA sequences, etc.
32: We will not discuss here this pre-processing stage
33: (although it poses crucial problems dealing with segmentation
34: and the choice of a representation).
35:
36: To fix notations, let $(X_i,Y_i)_{i=1}^N$ be the canonical process
37: on $\Omega = (\C{X} \times \C{Y})^N$ (which means
38: the coordinate process).
39: Let the pattern space
40: be provided with a sigma-algebra $\C{B}$ turning it into
41: a measurable space $(\C{X}, \C{B})$. On the finite label space $\C{Y}$,
42: we will consider the trivial algebra $\C{B}'$ made of all its subsets.
43: Let $\C{M}_+^1\bigl[(\C{K} \times \C{Y})^N, (\C{B}
44: \otimes \C{B}')^{\otimes N} \bigr]$ be our notation for
45: the set of probability measures (i.e. of positive measures
46: of total mass equal to $1$) on the measurable space
47: $\bigl[ (\C{X} \times \C{Y})^N, (\C{B} \times \C{B}')^{\otimes N}
48: \bigr]$.
49: Once some probability distribution
50: $\PP \in \C{M}_+^1\bigl[ (\C{X} \times \C{Y})^N, (\C{B} \otimes
51: \C{B}')^{\otimes N} \bigr]$ is chosen,
52: it turns $(X_i,Y_i)_{i=1}^N$
53: into the canonical realization of a stochastic process modeling the
54: observed sample (also called the training set).
55: We will assume that $\PP = \bigotimes_{i=1}^N P_i$, where
56: for each $i = 1, \dots, N$,
57: $P_i \in \C{M}_+^1(\C{X} \times \C{Y}, \C{B} \otimes \C{B}')$,
58: to reflect
59: the assumption that we observe independent pairs of patterns and labels.
60: We will also assume that we are provided with some indexed set of
61: possible classification rules
62: $$
63: \C{R}_{\Theta} = \bigl\{ f_{\theta} : \C{X} \rightarrow \C{Y};
64: \theta \in \Theta \bigr\},
65: $$
66: where $(\Theta, \C{T})$ is some measurable index set. Assuming
67: some indexation of the classification rules is just a matter
68: of presentation. Although it leads to longer notations, it
69: allows to integrate over the space of classification rules
70: as well as over $\Omega$ using the usual formalism of multiple
71: integrals. For this matter, we will assume that
72: $(\theta, x) \mapsto f_{\theta}(x) : ( \Theta \times \C{X},
73: \C{B} \otimes \C{T} ) \rightarrow (\C{Y}, \C{B}')$
74: is a measurable function.
75:
76: In many cases $\Theta = \bigcup_{i \in I} \Theta_i$ will be a finite
77: (or more generally countable) union of subspaces, dividing the classification
78: model $\C{R}_{\Theta} = \bigcup_{i \in I} \C{R}_{\Theta_i}$ into a union of
79: submodels. The importance of introducing such a structure has been
80: put forward by V. Vapnik, as a way to avoid making strong hypotheses
81: on the distribution $\PP$ of the sample.
82: If neither the distribution of the sample nor the set of
83: classification rules were constrained, it is well known indeed that
84: no kind of statistical inference would be possible.
85: Considering a family of submodels is a way to
86: provide for adaptive classification where
87: the choice of the model depends on the observed
88: sample. Restricting the set of classification rules is more realistic
89: than restricting the distribution of patterns, since the classification
90: rules are a processing tool left to the choice of the statistician,
91: whereas the distribution of the patterns is not fully under his control,
92: except for some planning of the learning experiment which may enforce
93: some weak properties like independence, but not the precise shapes of
94: the marginal distributions $P_i$ which are as a rule unknown distributions
95: on some high dimensional space.
96:
97: \newcommand{\wtheta}{\widehat{\theta}}
98: In these notes, we will concentrate on general issues concerned with
99: a natural measure of risk, namely the {\em expected error rate}
100: of each classification rule $f_{\theta}$, expressed as
101: $$
102: R(\theta) = \frac{1}{N} \sum_{i=1}^N \PP\bigl[ f_{\theta}(X_i) \neq Y_i
103: \bigr].
104: $$
105: As this quantity is unobserved, we will be led to work with
106: the corresponding {\em empirical error rate}
107: $$
108: r(\theta,\omega) = \frac{1}{N} \sum_{i=1}^N \B{1} \bigl[ f_{\theta}(X_i) \neq Y_i \bigr].
109: $$
110: This does not mean that pratical learning algorithms will
111: always try to minimize this criterion. They often on the contrary
112: try to minimize some other criterion which is linked with
113: the structure of the problem and has some nice additional properties
114: (like smoothness and convexity, for example). Nevertheless, and independently
115: from the precise form of the estimator $\wtheta : \Omega \rightarrow \Theta$
116: under study, the analysis of $R(\wtheta)$ is a natural question,
117: and often corresponds to what is required in practice.
118:
119: Answering this question is not straightforward because,
120: although $R(\theta)$ is the expectation of $r(\theta)$,
121: a sum of independent Bernoulli random variables,
122: $R(\wtheta)$ is not the expectation of $r(\wtheta)$,
123: because of the dependence of $\wtheta$ on the sample,
124: and neither is $r(\wtheta)$ a sum of independent
125: random variables.
126: To circumvent this unfortunate situation,
127: some uniform control over the deviations of $r$ with respect to $R$
128: is needed.
129:
130: The PAC-Bayesian approach to this problem, originated in the machine
131: learning community and pionneered by
132: D. McAllester \cite{McAllester,McAllester2},
133: can be seen as some variant of the more classical approach of $M$-estimators
134: relying on empirical process theory (as exposed for instance in
135: \cite{VanDeGeer}).
136:
137: It is built on three corner stones:
138: \begin{itemize}
139: \item One idea is to embed the set of estimators of the type $\wtheta
140: : \Omega \rightarrow \Theta$ into the larger set of
141: regular conditional probability measures
142: $\rho : \bigl( \Omega,
143: (\C{B} \otimes \C{B}')^{\otimes N} \bigr) \rightarrow \C{M}_+^1(\Theta, \C{T})$.
144: We will call these conditional probability measures {\em posterior distributions},
145: to follow a usual terminology.
146: \item A second idea is to measure the fluctuations of $\rho$
147: with respect to the sample, using some prior distribution $\pi \in
148: \C{M}_+^1(\Theta, \C{T})$, and the Kullback divergence function
149: $\C{K}(\rho, \pi)$. The expectation $\PP \bigl\{ \C{K}(\rho, \pi) \bigr\}$
150: measures the randomness of $\rho$.
151: The optimal choice of
152: $\pi$ would be $\PP(\rho)$, resulting in a measure of the
153: randomness of $\rho$ equal to the mutual information between
154: the sample and the estimated parameter drawn from $\rho$.
155: Anyhow, since $\PP(\rho)$ is as a rule no more observed than
156: $\PP$, we will have to be content with some less concentrated
157: prior distribution $\pi$, resulting in some looser measure
158: of randomness, as shown by the identity
159: $\PP \bigl[ \C{K}(\rho, \pi) \bigr] = \PP \bigl\{ \C{K}\bigl[\rho,
160: \PP(\rho)\bigr] \bigr\} + \C{K}\bigl[\PP(\rho), \pi\bigr]$.
161: \item A third idea is to analyze the fluctuations of the random
162: process $\theta \mapsto r(\theta)$ with respect to its mean
163: process $\theta \mapsto R(\theta)$ through the $\log$-Laplace
164: transform
165: $$
166: - \frac{1}{\lambda}
167: \log \left\{ \iint \exp \bigl[ - \lambda r(\theta,\omega) \bigr]
168: \pi(d \theta) \PP(d \omega) \right\},
169: $$ as a physicist prone to statistical mechanics
170: (where this is called the free energy) would do. This transform
171: is well suited
172: to relate $\min_{\theta \in \Theta} r(\theta)$
173: to $\inf_{\theta \in \Theta} R(\theta)$.
174: \end{itemize}
175:
176: This monograph is devided into two sections. The first one deals with the
177: inductive setting presented in these lines, the second one with
178: the {\em transductive} setting, where, following Vapnik's seminal
179: approach \cite{Vapnik}, a shadow sample is considered.
180:
181: In the first section, two types of bounds are shown. {\em Empirical bounds}
182: can be used to choose between estimators or to build estimators.
183: {\em Non random bounds} can be used to assess the speed of convergence
184: of estimators, relating this speed to the speed of convergence
185: of the Gibbs prior expected error rate $\beta \mapsto
186: \pi_{\exp ( - \beta R)}(R)$ towards $\ess \inf_{\pi} R$
187: as $\beta$ goes to infinity, and to other quantities
188: akin to the margin assumption of Mammen and Tsybakov in more
189: sophisticated cases. We will progress from the most straighforward
190: bounds to more elaborate ones, built to achieve a better
191: asymptotic behaviour. We will thus introduce {\em local bounds}
192: and {\em relative bounds}.
193: From an asymptotic point of view, the culminating result of
194: these notes is Theorem \ref{thm1.1.43} (page \pageref{thm1.1.43}).
195: It is used in Proposition \ref{prop1.1.37} to build a classification
196: rule which is proved to be adaptive in all the parameters
197: of the Mammen and Tsybakov margin assumption and of
198: a parametric complexity assumption
199: in Corollary \ref{cor1.52} (page \pageref{cor1.52}) of Theorem
200: \ref{thm1.50} (page \pageref{thm1.50}). This opens the road to Theorem
201: \ref{thm1.59} (page \pageref{thm1.59}) which performs two step localization
202: on top of Theorem
203: \ref{thm1.1.43} in order to be able to achieve adaptive model selection
204: with a decreased influence of the number of empirically unefficient
205: models included in the comparison. The analysis of this bound is
206: hinted at in subsequent pages, but not fully developed, since
207: we are not sure the amount of technicalities it requires is worth it.
208: Anyhow we would not like to induce the
209: reader into thinking that each result in the first section is
210: actually an {\em improvement} on the previous one, it is as a rule
211: only an {\em asymptotic improvement}, and the price to pay for
212: being asymptotically tighter is to get looser bounds for small sample sizes.
213: What is a small sample size in practice is a question of ratio between
214: the number of examples and the complexity (roughly speaking the number
215: of parameters) of the model used to classify. Since our aim here is
216: to describe classification methods suitable for complex data (images,
217: speech, DNA, \dots), we suspect that practitioners wanting to make use
218: of our proposals will be confronted with small sample sizes more often
219: than with large ones, and should try to make use of the simplest
220: bounds first and see only afterwards whether the asymptotically
221: better ones can bring them more for the size of samples their computers can handle
222: and their data bases can provide. Let us advocate also that the results
223: of this first section are not only of a theoretical nature for two
224: reasons : the first one is that posterior parameter distributions
225: can be computed effectively, using Monte Carlo techniques, there is
226: a whole tradition about these computations in Bayesian statistics,
227: proving that what we call here Gibbs estimators are not
228: only a way to show that some optimal speeds of convergence can
229: be reached in some theoretically well understood situations,
230: but that they can also be computed in practice. The second reason
231: is that a traditional non randomized estimator $\w{\theta} \in \Theta$ of the
232: parameter can be approximated by a posterior distribution $\rho$ which
233: is supported by a fairly narrow neighboorhood of $\w{\theta} \in \Theta$,
234: without spoiling excessively our bounds, resulting in a classification
235: rule which is to provide a randomized answer only for a small amount
236: of dubious examples and will most of the time issue the same deterministic
237: answer as the classification rule indexed by $\w{\theta}$ it is
238: derived from. This is
239: explained on page \pageref{eq1.1.2}.
240:
241: In the second section, we show first how we can transport
242: all the results obtained in the inductive case to the transductive case,
243: allowing to replace prior distributions by {\em partially exchangeable posterior
244: distributions} depending on an extended sample were unlabelled shadow
245: examples are added, with increased possibilities of adaptation to the data.
246: We then focus on the small sample case, where local and relative
247: bounds are not expected to be of great help. Using
248: a fictitious (that is unobserved) shadow sample, we study Vapnik
249: type generalization bounds, showing how to tighten and extend them
250: using some original ideas, like making no Gaussian approximation to the
251: $\log$-Laplace of Bernoulli random variables, --- using a shadow sample
252: of arbitrary size, --- shrinking from the use of any symmetrization trick ---
253: and using a subset of the group of permutations suitable to cover the
254: case of independent non identically distributed data. The culminating
255: result of the second section is Theorem \ref{thm2.3.3} on page \pageref{thm2.3.3},
256: subsequent bounds showing the separate influence of the above ideas and
257: providing an easier comparison with Vapnik's original results.
258: Vapnik type generalization bounds have a broad applicability, not
259: only through the concept of VC dimension, but also through the use
260: of compression schemes \cite{Little}, which are briefly described
261: on page \pageref{compression}.
262:
263: \section{Inductive PAC-Bayesian learning}
264:
265: The setting of inductive inference (as opposed to transductive
266: inference to be discussed later) is the one described in the
267: introduction.
268:
269: When we will have to take the expectation of
270: a random variable $Z : \Omega \rightarrow \RR$ as well as of a function
271: of the parameter $h : \Theta \rightarrow \RR$ with respect to
272: some probability measure, we will as a rule use functional
273: short notations instead of resorting to the integral sign:
274: thus we will write $\PP(Z)$ for $\int_{\Omega} Z(\omega) \PP(d \omega)$
275: and $\pi(h)$ for $\int_{\Theta} h(\theta) \pi(d \theta)$.
276:
277: The PAC-Bayesian approach, in its simplest form, relies on some
278: basic upper bound for the Laplace transform of
279: $\sup_{\rho \in \C{M}_+^1(\Theta)} \bigl[
280: \rho(R) - \rho(r) \bigr]$, or more technically on some penalized
281: variant of it, as will be seen. This will be the subject of the
282: next subsection, where we will start with the Laplace
283: transform of $R(\theta) - r(\theta)$, for any $\theta \in \Theta$,
284: before encompassing posterior distributions. As it is already
285: easy to guess, the purpose of these preliminaries is to
286: gain some uniform control on the lower deviations of the
287: empirical error rate from the expected error rate under
288: any posterior distribution.
289: \subsection{Basic inequality}
290: In the setting described in the introduction,
291: let us consider the Bernoulli random variables
292: $\sigma_i(\theta) = \B{1} \bigl[ Y_i \neq f_{\theta} (X_i) \bigr]$.
293: Using independence and the concavity of the logarithm
294: function, it is readily seen that for any real constant $\lambda$
295: \begin{multline*}
296: \log \Bigl\{ \PP \bigl\{ \exp \bigl[ - \lambda r(\theta) \bigr]
297: \bigr\} \Bigr\}
298: = \sum_{i=1}^N \log \Bigl\{ \PP \Bigl[ \exp\bigl(
299: - \tfrac{\lambda}{N} \sigma_i \bigr) \Bigr] \Bigr\}
300: \\ \leq N \log \biggl\{ \frac{1}{N}\sum_{i=1}^N
301: \PP \Bigl[ \exp \bigl( - \tfrac{\lambda}{N}
302: \sigma_i \bigr) \Bigr]
303: \biggr\}.
304: \end{multline*}
305: The right-hand side of this inequality is the $\log$ Laplace
306: transform of a Bernoulli distribution with parameter
307: $\frac{1}{N} \sum_{i=1}^N \PP(\sigma_i) = R(\theta)$.
308: As any Bernoulli distribution is fully defined
309: by its parameter, this $\log$ Laplace transform
310: is necessarily a function of $R(\theta)$. It can
311: be expressed with the help of the family of functions
312: $$
313: \Phi_{a}(p) = - a^{-1} \log \bigl\{
314: 1 - \bigl[1 - \exp( - a)\bigr]
315: p \bigr\}, \quad a \in \RR, p \in (0,1).
316: $$
317: It is immediately seen that $\Phi_{\alpha}$ is an increasing
318: one to one mapping of the unit interval unto itself, and that it
319: is convex when $a > 0$, concave when $a < 0$ and can be defined
320: by continuity to be the identity when $a = 0$.
321: Moreover the inverse of $\Phi_{a}$ is given by the
322: formula
323: $$
324: \Phi_{a}^{-1}(q) = \frac{1 - \exp (- a q )}{1 - \exp ( - a )},
325: \qquad a \in \RR, q \in (0,1).
326: $$
327: This formula may be used to extend $\Phi_a^{-1}$
328: to $q \in \RR$, and we will use this extension without
329: further notice when required.
330:
331: Using these notations, the previous inequality becomes
332: $$
333: \log \Bigl\{ \PP \bigl\{ \exp \bigl[ - \lambda r(\theta)
334: \bigr] \bigr\} \Bigr\} \leq
335: - \lambda \Phi_{\frac{\lambda}{N}} \bigl[ R(\theta) \bigr],
336: \quad \text{proving}
337: $$
338:
339: \begin{lemma}
340: \label{lemma1.1.1} \mypoint For any real constant $\lambda$ and
341: any parameter $\theta \in \Theta$,
342: $$
343: \PP \biggl\{ \exp \Bigl\{
344: \lambda \Bigl[ \Phi_{\frac{\lambda}{N}} \bigl[ R(\theta) \bigr]
345: - r(\theta) \Bigr]
346: \Bigr\} \biggr\} \leq 1.
347: $$
348: \end{lemma}
349: In previous versions of this study, we had used some Bernstein
350: bound, instead of this lemma. Anyhow, as it will turn out,
351: keeping the $\log$ Laplace of a Bernoulli instead of approximating
352: it provides simpler and tighter results.
353:
354: Lemma \ref{lemma1.1.1} implies that
355: for any constants $\lambda \in \RR_+$ and $\epsilon \in )0,1)$,
356: $$
357: \PP \biggl[ \Phi_{\frac{\lambda}{N}}\bigl[ R(\theta) \bigr] +
358: \frac{\log(\epsilon)}{\lambda} \leq r(\theta) \biggr] \geq 1 - \epsilon.
359: $$
360: Choosing $\ds \overline{\lambda} \in \arg\max_{\RR_+}
361: \Phi_{\frac{\lambda}{N}}\bigl[ R(\theta) \bigr] + \frac{\log(\epsilon)}{\lambda}$,
362: we deduce
363: \begin{lemma}\mypoint
364: For any $\epsilon \in )0,1)$, any $\theta \in \Theta$,
365: $$
366: \PP \Biggl\{ R(\theta) \leq \inf_{\lambda \in \RR_+}
367: \Phi_{\frac{\lambda}{N}}^{-1} \biggl[
368: r(\theta) - \frac{\log(\epsilon)}{\lambda} \biggr] \Biggr\}
369: \geq 1 - \epsilon.
370: $$
371: \end{lemma}
372:
373: We will illustrate throughout these notes the bounds we prove with
374: a small numerical example: in the case where $N = 1000$,
375: $\epsilon = 0.01$ and $r(\theta) = 0.2$,
376: we get with a confidence level of $0.99$ that $ R(\theta) \leq .2402$,
377: this being obtained for $\lambda = 234$.
378:
379: Now, to proceed towards the analysis of posterior
380: distributions, let us put for short $U_{\lambda}(\theta, \omega) =
381: \lambda \Bigl[ \Phi_{\frac{\lambda}{N}} \bigl[ R(\theta) \bigr]
382: - r(\theta, \omega) \Bigr],
383: $ and let us consider \linebreak
384: $\log \Bigl\{ \PP \Bigl[ \pi \bigl[ \exp ( U_{\lambda}) \bigr] \Bigr] \Bigr\}$, where
385: $\pi \in \C{M}_+^1(\Theta, \C{T})$ is some prior probability
386: measure on the parameter space. Using Fubini's theorem
387: for non negative functions, we see that
388: $$
389: \log \Bigl\{ \PP \Bigl[ \pi \bigl[ \exp ( U_{\lambda}) \bigr] \Bigr] \Bigr\}
390: = \log \Bigl\{ \pi \Bigl[ \PP \bigl[ \exp ( U_{\lambda} ) \bigr] \Bigr]
391: \Bigr\} \leq 0.
392: $$
393:
394: To relate this quantity
395: to the expectation $\rho(U_{\lambda})$ with respect to
396: any posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,
397: we will use the properties of the Kullback divergence
398: $\C{K}(\rho, \pi)$
399: of $\rho$ with respect to $\pi$, which is defined as
400: $$
401: \C{K}(\rho, \pi) = \begin{cases}
402: \int \log( \frac{d\rho}{d \pi}) d \rho, & \text{ when $\rho \ll
403: \pi$},\\
404: + \infty, & \text{ otherwise}.
405: \end{cases}
406: $$
407: The following lemma shows in which sense the Kullback divergence
408: function can be thought of as the dual of the $\log$ Laplace
409: transform.
410: \begin{lemma} \mypoint
411: \label{lemma1.3}
412: For any bounded measurable function $h : \Theta \rightarrow \RR$,
413: and any probability distribution $\rho \in \C{M}_+^1(\Theta)$
414: such that $\C{K}(\rho,\pi) < \infty$,
415: $$
416: \log \bigl\{ \pi \bigl[ \exp (h) \bigr]
417: \bigr\} = \rho(h)
418: - \C{K}(\rho,\pi) + \C{K}(\rho, \pi_{\exp(h)}),
419: $$
420: where by definition $\ds \frac{d \pi_{\exp(h)}}{d \pi} =
421: \frac{\exp[h(\theta)]}{\pi[\exp(h)]}$. Consequently
422: $$
423: \log \bigl\{ \pi \bigl[ \exp (h)] \bigr] \bigr\}
424: = \sup_{\rho \in \C{M}_+^1(\Theta)} \rho (h)
425: - \C{K}(\rho, \pi).
426: $$
427: \end{lemma}
428: The proof is just a matter of writing down the definition
429: of the quantities involved and using the fact that the Kullback
430: divergence function is non negative.
431: It can be found in \cite[page 160]{Cat7}.
432: In the duality between measurable functions and probability measures,
433: we thus see that the $\log$ Laplace transform with respect to
434: $\pi$ is the Legendre transform of the Kullback divergence function
435: with respect to $\pi$.
436: Using this, we get
437: $$
438: \PP \Bigl\{ \exp \bigl\{ \sup_{\rho \in \C{M}_+^1(\Theta)}
439: \rho [ U_{\lambda}(\theta) ] - \C{K}(\rho, \pi) \bigr\} \Bigr\} \leq 1,
440: $$
441: which, combined with the convexity of $\lambda \Phi_{\frac{\lambda}{N}}$, proves
442: the basic inequality we were looking for.
443: \begin{thm}
444: \label{thm2.3}
445: \mypoint For any real constant $\lambda$,
446: \begin{multline*}
447: \PP \biggl\{ \exp \biggl[
448: \sup_{\rho \in \C{M}_+^1(\Theta)} \lambda
449: \Bigl[ \rho \bigl( \Phi_{\frac{\lambda}{N}}\!\circ\!R \bigr)
450: - \rho(r) \Bigr] - \C{K}(\rho,\pi) \biggr] \biggr\}
451: \\ \leq
452: \PP \biggl\{ \exp \biggl[
453: \sup_{\rho \in \C{M}_+^1(\Theta)} \lambda
454: \Bigl[ \Phi_{\frac{\lambda}{N}}\bigl[ \rho(R) \bigr]
455: - \rho(r) \Bigr] - \C{K}(\rho,\pi) \biggr] \biggr\}
456: \leq 1.
457: \end{multline*}
458: \end{thm}
459: The following sections will show how to use this theorem.
460: \subsection{Non local bounds}
461: At least three sorts of bounds can be deduced from Theorem \ref{thm2.3}.
462:
463: The most interesting ones to build estimators and tune parameters,
464: as well as the first that have been considered in the development of
465: the PAC-Bayesian approach, are deviation bounds. They provide an
466: empirical upper bound for $\rho(R)$ --- that is a bound which can be computed from
467: observed data --- with some probability $1 - \epsilon$, where $\epsilon$
468: is a presumably small and tunable confidence level.
469:
470: Anyhow, since most
471: of the results about the convergence speed of estimators to be found
472: in the statistical literature are concerned with the expectation $\PP \bigl[
473: \rho(R) \bigr]$, it is also enlightening to bound this quantity.
474: In order to know at which rate it may be approaching $\inf_{\Theta} R$,
475: a non random upper bound is required, which will relate the average of
476: the expected risk $\PP \bigl[ \rho(R) \bigr]$ with the properties of
477: the contrast function $\theta \mapsto R(\theta)$.
478:
479: Since the values of constants do matter a lot when a bound is to be used
480: to select between various estimators using classification models of various
481: complexities, a third kind of bound, related to the first, may be considered
482: for the sake of its hopefully better constants: we will call them
483: {\em unbiased empirical bounds}, to stress the fact that they provide some
484: empirical quantity whose expectation under $\PP$ can be proved to
485: be an upper bound for $\PP \bigl[ \rho(R) \bigr]$, the average expected
486: risk. The price to pay for these better constants is of course the lack
487: of formal guarantee given by the bound : two random variables whose
488: expectations are ordered in a certain way may very well be ordered
489: in the reverse way with a large probability, so that basing the
490: estimation of parameters or the selection of an estimator on some
491: unbiased empirical bound is a hazardous business. Anyhow, since it is
492: common practice to use the inequalities provided by mathematical statistical
493: theory while replacing the proven constants with smaller values showing
494: a better practical efficiency, considering unbiased empirical bounds
495: akin to deviation bounds provides an indication about how much
496: the constants may be decreased while not violating the theory too
497: outrageously.
498:
499: \subsubsection{Unbiased empirical bounds}
500: Let $\rho : \Omega
501: \rightarrow \C{M}_+^1(\Theta)$ be some fixed (and arbitrary)
502: posterior distribution, describing some randomized estimator of $\theta$.
503: As we already mentioned, in these notes a posterior distribution
504: will always be a regular conditional probability measure. By this
505: we mean that
506: \begin{itemize}
507: \item for any $A \in \C{T}$, the map $\omega \mapsto \rho (\omega, A)
508: : \bigl(\Omega, ( \C{B} \otimes
509: \C{B}')^{\otimes N} \bigr) \rightarrow \RR_+$
510: is assumed to be measurable;
511: \item for any $\omega \in \Omega$, the map $A \mapsto \rho(\omega, A):
512: \C{T} \rightarrow \RR_+$
513: is assumed to be a probability measure.
514: \end{itemize}
515: We will also assume without further notice that the $\sigma$-algebras
516: we deal with are always countably generated.
517: The technical implications of these assumptions are standard
518: and discussed for instance in \cite[pages 50-54]{Cat7}
519: (where, among other things, a detailed proof of the decomposition
520: of the Kullback Liebler divergence is given).
521:
522: Let us restrict to the case when the constant $\lambda$ is positive.
523: We get from Theorem \ref{thm2.3} that
524: \begin{equation}
525: \label{eq2.2.1bis}
526: \exp \biggl[ \lambda \Bigl\{ \Phi_{\frac{\lambda}{N}}
527: \Bigl[ \PP \bigl[ \rho(R) \bigr]
528: \Bigr] - \PP \bigl[ \rho(r) \bigr] \Bigr\} - \PP \bigl[\C{K}(\rho, \pi)
529: \bigr] \biggr]
530: \leq 1,
531: \end{equation}
532: where we have used the convexity of the $\exp$ function and of $\Phi_{\frac{
533: \lambda}{N}}$.
534: Since we have restricted our attention to positive values of the constant $\lambda$,
535: Equation \eqref{eq2.2.1bis} can also be written
536: $$
537: \PP \bigl[ \rho(R) \bigr]
538: \leq \Phi_{\frac{\lambda}{N}}^{-1} \Bigl\{
539: \PP \bigl[ \rho(r) + \lambda^{-1} \C{K}(\rho,\pi) \bigr] \Bigr\},
540: $$
541: leading to
542: \begin{thm}
543: \label{thm2.4}
544: \mypoint For any posterior distribution $\rho: \Omega \rightarrow \C{M}_+^1(\Theta)$,
545: for any positive parameter $\lambda$,
546: \begin{align*}
547: \PP \bigl[ \rho (R) \bigr]
548: & \leq \frac{\ds
549: 1 - \exp \Bigl[ - N^{-1} \PP \bigl[
550: \lambda \rho(r) + \C{K}(\rho,\pi) \bigr] \Bigr] }{\ds 1 - \exp( - \tfrac{\lambda}{N})} \\
551: & \leq \PP \Biggl\{ \frac{\lambda}{N \bigl[ 1 - \exp( - \frac{\lambda}{N}) \bigr]}
552: \left[ \rho(r) + \frac{\C{K}(\rho,\pi)}{\lambda} \right] \Biggr\}.
553: \end{align*}
554: \end{thm}
555: The last inequality provides the {\em unbiased empirical upper
556: bound} for $\rho(R)$ we were looking for, meaning that the expectation of
557: \linebreak $\frac{\lambda}{N \bigl[ 1 - \exp( - \frac{\lambda}{N}) \bigr]}
558: \left[ \rho(r) + \frac{\C{K}(\rho,\pi)}{\lambda} \right]$
559: is larger than the expectation of $\rho(R)$. Let us notice that
560: $1 \leq \frac{\lambda}{N \bigl[ 1 - \exp( - \frac{\lambda}{N}) \bigr]} \leq
561: \bigl[ 1 - \frac{\lambda}{2N} \bigr]^{-1}$ and therefore that this
562: coefficient is close to $1$ when $\lambda$ is significantly smaller
563: than $N$.
564:
565: If we are ready to believe in this bound (although this belief is not
566: mathematically well founded, as we already mentioned), we can use
567: it to optimize $\lambda$ and to choose $\rho$. While the optimal choice
568: of $\rho$ when $\lambda$ is fixed is to take it equal to $\pi_{\exp( - \lambda r)}$,
569: a Gibbs posterior distribution, as it is sometimes called, we may for
570: computational reasons be more interested in choosing $\rho$ in some
571: other class of posterior distributions.
572:
573: For instance, our real interest
574: may be to select some deterministic estimator from a
575: family $\wtheta_m : \Omega \rightarrow
576: \Theta_m$, $m \in M$, of possible ones, where $\Theta_m$ are
577: measurable subsets of $\Theta$ and where $M$ is an arbitrary (non necessarily
578: countable) index set. We may for instance think of
579: the case when $\wtheta_m \in \arg\min_{\Theta_m} r$.
580: We may slightly randomize the estimators to start with,
581: considering for any $\theta \in \Theta_m$ and any $m \in M$,
582: $$
583: \Delta_m(\theta) = \Bigl\{ \theta' \in \Theta_m :
584: \bigl[ f_{\theta'}(X_i) \bigr]_{i=1}^N = \bigl[ f_{\theta}(X_i) \bigr]_{i=1}^N
585: \Bigr\},
586: $$
587: and defining $\rho_m$ by the formula
588: $$
589: \frac{d \rho_m}{d \pi} (\theta) = \frac{\B{1}\bigl[ \theta \in \Delta_m(\wtheta_m)
590: \bigr]}{\pi \bigl[ \Delta_m(\wtheta_m) \bigr]}.
591: $$
592: Our posterior is minimizing $\C{K}(\rho, \pi)$ among those
593: whose support is restricted to the values of $\theta$
594: in $\Theta_m$ for which the classification rule $f_{\theta}$
595: is identical to the estimated one $f_{\wtheta_m}$ on
596: the observed sample.
597: Presumably, in many practical situations, $f_{\theta}(x)$
598: will be $\rho_m$ almost surely identical to
599: $f_{\wtheta_m}(x)$ when $\theta$ is drawn from
600: $\rho_m$, for the vast majority of the values of $x \in \C{X}$
601: and all the submodels $\Theta_m$ not plagued with too much overfitting
602: (since this is by construction the case when $x \in \{ X_i : i = 1, \dots, N \}$).
603: Therefore replacing $\wtheta_m$ with $\rho_m$ can be expected to be
604: a minor change in many situations. This change by the way can be
605: estimated in the (admittedly not so common) case when the
606: distribution of the patterns $(X_i)_{i=1}^N$ is known.
607: Indeed, introducing the pseudo distance
608: \begin{equation}
609: \label{eq1.1.2}
610: D(\theta, \theta') = \frac{1}{N} \sum_{i=1}^N
611: \PP \bigl[ f_{\theta}(X_i) \neq f_{\theta'}(X_i) \bigr], \qquad \theta, \theta' \in
612: \Theta,
613: \end{equation}
614: one immediately sees that $R(\theta') \leq R(\theta) + D(\theta, \theta')$,
615: for any $\theta, \theta' \in \Theta$, and
616: therefore that
617: $$
618: R(\wtheta_m) \leq \rho_m(R) + \rho_m\bigl[ D(\cdot,\wtheta_m) \bigr].
619: $$
620: Let us notice also that in the case where $\Theta_m
621: \subset \RR^{d_m}$, and $R$ happens to be convex on
622: $\Delta_m(\wtheta_m)$, then $\rho_m(R) \geq R \bigl[
623: \int \theta \rho_m(d \theta)\bigr]$, and we can replace
624: $\wtheta_m$ with $\T_m = \int \theta \rho_m( d\theta)$,
625: and obtain bounds for $R(\T_m)$.
626: This is not a very heavy assumption about $R$, in the case
627: where we consider $\wtheta_m \in \arg\min_{\Theta_m} r$.
628: Indeed, $\wtheta_m$, and therefore $\Delta_m(\wtheta_m)$,
629: will be presumably close to $\arg\min_{\Theta_m} R$,
630: and requiring a function to be convex in the neighboorhood of
631: its minima is not a very strong assumption.
632:
633: Since $r(\wtheta_m) = \rho_m(r)$,
634: and $\C{K}(\rho_m, \pi) = - \log \bigl\{
635: \pi\bigl[ \Delta_m(\wtheta_m) \bigr] \bigr\}$,
636: our unbiased empirical upper
637: bound in this context reads as
638: $$
639: \frac{\lambda}{N\bigl[ 1 - \exp( - \frac{\lambda}{N})\bigr]} \left\{
640: r(\wtheta_m) - \frac{\log\bigl\{ \pi \bigl[ \Delta_m(\wtheta_m) \bigr]
641: \bigr\}}{\lambda} \right\}.
642: $$
643: Let us notice that we obtain a complexity factor $- \log \bigl\{
644: \pi \bigl[ \Delta_m(\wtheta_m) \bigr] \bigr\}$ which may be
645: compared with the Vapnik Cervonenkis dimension. Indeed, in the
646: case of binary classification, when using a classification model
647: with VC dimension not greater than $h_m$, that is when any subset
648: of $\C{X}$ which can be split in any arbitrary way by some
649: classification rule $f_{\theta}$ of the model $\Theta_m$ has at most $h_m$
650: points, then
651: $$
652: \bigl\{ \Delta_m(\theta) : \theta \in \Theta_m \bigr\}
653: $$
654: is a partition of $\Theta_m$ with at most $\left( \frac{eN}{h} \right)^h$
655: components. Therefore
656: $$
657: \inf_{\theta \in \Theta_m} - \log \bigl\{
658: \pi \bigl[ \Delta_m(\theta) \bigr] \bigr\} \leq h_m \log \left( \frac{e N}{h_m}
659: \right) - \log \bigl[ \pi(\Theta_m) \bigr].
660: $$
661: Thus, if the model and prior distribution are well suited to the classification
662: task, in the sense that there is more ``room'' (where room is measured with $\pi$)
663: between the two clusters defined by $\wtheta_m$ than between other partitions
664: of the sample of patterns $(X_i)_{i=1}^N$, then we will have
665: $$
666: -\log \bigl\{ \pi \bigl[ \Delta_m(\wtheta) \bigr] \bigr\} \leq h_m
667: \log \left( \frac{e N}{h_m} \right) - \log \bigl[ \pi(\Theta_m) \bigr].
668: $$
669: \newcommand{\wm}{\widehat{m}}
670: An optimal value $\wm$ may be selected so that
671: $$
672: \wm \in \arg\min_{m \in M} \left\{ \inf_{\lambda \in \RR_+}
673: \frac{\lambda}{N\bigl[ 1 - \exp( - \frac{\lambda}{N})\bigr]} \left(
674: r(\wtheta_m) - \frac{\log\bigl\{ \pi \bigl[ \Delta_m(\wtheta_m) \bigr] \bigr\}}{\lambda} \right) \right\}.
675: $$
676: Since $\rho_{\wm}$ is still another posterior distribution, we can be sure that
677: \begin{multline*}
678: \PP \Bigl\{ R(\wtheta_{\wm}) - \rho_{\wm} \bigl[ D(\cdot, \wtheta_{\wm}) \bigr]\Bigr\}
679: \leq \PP \bigl[ \rho_{\wm}(R) \bigr]
680: \\ \leq \inf_{\lambda \in \RR_+} \PP
681: \left\{ \frac{\lambda}{N\bigl[ 1 - \exp( - \frac{\lambda}{N})\bigr]} \left(
682: r(\wtheta_{\wm}) - \frac{\log\bigl\{ \pi \bigl[ \Delta_{\wm}
683: (\wtheta_{\wm}) \bigr] \bigr\}}{\lambda} \right) \right\}.
684: \end{multline*}
685: (Taking the infimum in $\lambda$ inside the expectation with respect to $\PP$
686: would be possible at the price of some supplementary technicalities
687: and a slight increase of the bound that we prefer to postpone to the discussion
688: of deviation bounds, since they are the only ones to provide a rigorous mathematical
689: foundation to the adaptive selection of estimators.)
690:
691: \subsubsection{Optimizing explicitly the exponential parameter $\lambda$}
692: We would like to deal in this section with some technical issue we think
693: helpful to the understanding of Theorem \ref{thm2.4}
694: (see page \pageref{thm2.4}): namely to investigate
695: how the upper bound it provides could be optimized, or at least approximately
696: optimized, in $\lambda$. It turns out that this can be done quite
697: explicitely.
698:
699: So we will consider in this discussion the
700: posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$
701: to be fixed, and our aim will be to eliminate the constant $\lambda$
702: from the bound by choosing its value in some nearly optimal way as
703: a function of $\PP\bigl[ \rho(r) \bigr]$, the average of the
704: empirical risk, and of
705: $\PP \bigl[ \C{K}(\rho, \pi) \bigr]$, which controls overfitting.
706:
707: Let the bound be written as
708: $$
709: \varphi ( \lambda) = \bigl[ 1 - \exp( - \tfrac{\lambda}{N}) \bigr]^{-1}
710: \left\{ 1 - \exp \Bigl[ - \tfrac{\lambda}{N} \PP \bigl[ \rho(r) \bigr]
711: - N^{-1}\PP \bigl[ \C{K}(\rho,\pi) \bigr] \Bigr] \right\}.
712: $$
713: We see that
714: $$
715: N \frac{\partial}{\partial \lambda} \log \bigl[ \varphi(\lambda) \bigr]
716: = \frac{\PP\bigl[\rho(r)\bigr]}{\exp \Bigl[ \frac{\lambda}{N} \PP\bigl[\rho(r)\bigr]
717: + N^{-1} \PP\bigl[ \C{K}(\rho, \pi) \bigr] \Bigr] - 1} -
718: \frac{1}{\exp(\frac{\lambda}{N}) - 1}.
719: $$
720: Thus, the optimal value for $\lambda$ is such that
721: $$
722: \bigl[ \exp( \tfrac{\lambda}{N}) - 1 \bigr] \PP \bigl[\rho(r)\bigr]
723: = \exp \Bigl[ \tfrac{\lambda}{N} \PP \bigl[ \rho(r) \bigr] + N^{-1}
724: \PP \bigl[ \C{K}(\rho, \pi) \bigr] \Bigr] - 1.
725: $$
726: Assuming that $1 \gg \frac{\lambda}{N} \PP \bigl[ \rho(r) \bigr]
727: \gg \frac{\PP [ \C{K}(\rho,\pi) ]}{N}$,
728: and keeping only higher order terms, we are led to choose
729: $$
730: \lambda = \sqrt{ \frac{2 N \PP \bigl[ \C{K}(\rho,\pi) \bigr]}{\PP \bigl[ \rho(r) \bigr]
731: \bigl\{ 1 - \PP \bigl[\rho(r) \bigr] \bigr\}}},
732: $$
733: obtaining
734: \begin{thm}
735: \label{thm1.6}
736: \mypoint For any posterior distribution $\rho: \Omega \rightarrow \C{M}_+^1(\Theta)$,
737: $$
738: \PP \bigl[ \rho(R) \bigr] \leq
739: \frac{ 1 - \exp \left\{ - \sqrt{\frac{ 2 \PP [ \C{K}(\rho,\pi) ] \PP [
740: \rho(r)]}{N \{ 1 - \PP [ \rho(r) ] \}}} -
741: \frac{\PP [ \C{K}(\rho,\pi) ]}{N} \right\}}{
742: 1 - \exp \left\{ - \sqrt{ \frac{ 2 \PP [ \C{K}(\rho,\pi) ]}{
743: N \PP [ \rho(r) ] \{1 - \PP [ \rho(r) ] \}}}
744: \right\}}.
745: $$
746: \end{thm}
747: This result of course is not very useful in itself, since none of the
748: two quantities $\PP\bigl[ \rho(r) \bigr]$ and $\PP\bigl[ \C{K}(\rho, \pi) \bigr]$
749: are easy to evaluate. Anyhow it gives a hint that replacing them boldly
750: with $\rho(r)$ and $\C{K}(\rho, \pi)$ could produce something close to
751: a legitimate empirical upper bound for $\rho(R)$. We will see in the subsection
752: about deviation bounds that this is indeed essentially true.
753:
754: Let us remark that in the second section of these notes,
755: we will see another way of bounding
756: $$
757: \inf_{\lambda \in \RR_+} \Phi_{\frac{\lambda}{N}}^{-1}
758: \left(q + \frac{d}{\lambda}\right),\text{ leading to}
759: $$
760: \begin{thm}\mypoint
761: \label{thm1.1.6}
762: For any prior distribution $\pi \in \C{M}_+^1(\Theta)$,
763: for any posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,
764: \begin{multline*}
765: \PP \bigl[ \rho(R) \bigr] \leq
766: \left(1 + \frac{2\PP\bigl[\C{K}(\rho, \pi) \bigr]}{N}\right)^{-1}
767: \Biggl\{ \PP \bigl[ \rho(r) \bigr] + \frac{\PP\bigl[\C{K}(\rho, \pi)\bigr]}{N}
768: \\* \shoveright{+ \sqrt{ \frac{2 \PP \bigl[ \C{K}(\rho, \pi) \bigr] \PP \bigl[ \rho(r) \bigr]
769: \bigl\{ 1 - \PP \bigl[ \rho(r) \bigr] \bigr\}}{N} + \frac{
770: \PP\bigl[\C{K}(\rho,\pi)\bigr]^2}{N^2}} \Biggr\},}\\
771: \text{as soon as }
772: \PP \bigl[ \rho(r) \bigr] + \sqrt{ \frac{\PP \bigl[ \C{K}(\rho, \pi) \bigr]}{2N}}
773: \leq \frac{1}{2},\\
774: \text{and }
775: \PP\bigl[\rho(R)\bigr] \leq \PP\bigl[\rho(r)\bigr] +
776: \sqrt{\frac{\PP\bigl[\C{K}(\rho,\pi)\bigr]}{2N}} \text{ otherwise.}
777: \end{multline*}
778: \end{thm}
779: This theorem enlightens the influence of three terms on the average expected
780: risk :
781:
782: $\bullet$ the average empirical risk, $\PP \bigl[ \rho(r) \bigr]$, which
783: as a rule will decrease as the size of the classification model increases,
784: acts as a {\em bias} term, grasping the ability of the model to
785: account for the observed sample itself;
786:
787: $\bullet$ a {\em variance} term $\PP \bigl[ \rho(r) \bigr] \bigl\{ 1 - \PP \bigl[ \rho(r) \bigr]
788: \bigr\}$ is due to the random fluctuations of $\rho(r)$;
789:
790: $\bullet$
791: a {\em complexity} term $\PP \bigl[ \C{K}(\rho, \pi) \bigr]$, which as a rule will
792: increase with the size of the classification model,
793: eventually acts as a multiplier of the variance term.
794: \bigskip
795:
796: We observed numerically that the bound provided by Theorem \ref{thm1.6}
797: is better than the more classical Vapnik's like bound of Theorem \ref{thm1.1.6}.
798: For instance, when $N = 1000$, $\PP\bigl[\rho(r) \bigr] = 0.2$
799: and $\PP\bigl[\C{K}(\rho,\pi)\bigr] = 10$, Theorem \ref{thm1.6} gives a bound
800: lower than $0.2604$, whereas the more classical Vapnik's like approximation
801: of Theorem \ref{thm1.1.6} gives a bound larger than $0.2622$. Numerical simulations tend to suggest
802: the two bounds are always ordered in the same way,
803: although this could be a little teadious
804: to prove mathematically.
805:
806: \subsubsection{Non random bounds}
807: It is time now to come to less tentative results and
808: see how far is the average expected error rate $\PP \bigl[ \rho(R) \bigr]$
809: from its best possible value $\inf_{\Theta} R$.
810:
811: Let us notice first that
812: $$
813: \lambda \rho(r) + \C{K}(\rho,\pi) =
814: \C{K}(\rho, \pi_{\exp( - \lambda r)})
815: - \log \Bigl\{ \pi \bigl[ \exp ( - \lambda r) \bigr] \Bigr\}.
816: $$
817: Let us remark moreover that $r \mapsto \log \Bigl[ \pi \bigl[
818: \exp ( - \lambda r) \bigr] \Bigr]$ is a convex functional,
819: a property which can be used in the following way:
820: \begin{multline}
821: \label{eq1.1.3Ter}
822: \PP \Bigl\{ \log \Bigl[ \pi \bigl[ \exp ( - \lambda r) \bigr]
823: \Bigr] \Bigr\}
824: = \PP \Bigl\{ \sup_{\rho \in \C{M}_+^1(\Theta)}
825: - \lambda \rho(r) - \C{K}(\rho,\pi) \Bigr\}
826: \\ \geq \sup_{\rho \in \C{M}_+^1(\Theta)} \PP \Bigl\{
827: - \lambda \rho(r) - \C{K}(\rho, \pi) \Bigr\}
828: = \sup_{\rho \in \C{M}_+^1(\Theta)} - \lambda \rho(R) - \C{K}(\rho, \pi)
829: \\ = \log \Bigl\{ \pi \bigl[ \exp ( - \lambda R) \bigr] \Bigr\}
830: = - \int_{0}^{\lambda} \pi_{\exp( - \beta R)}(R) d \beta.
831: \end{multline}
832: These remarks applied to Theorem \ref{thm2.4} lead to
833: \begin{thm}
834: \label{thm2.5}
835: \mypoint For any posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,
836: for any positive parameter $\lambda$,
837: \begin{align*}
838: \PP \bigl[ \rho(R) \bigr] &
839: \leq
840: \frac{1 - \exp \left\{ - \frac{1}{N} \int_0^{\lambda} \pi_{\exp( - \beta R)}(R)
841: d \beta - \frac{1}{N} \PP \bigl[ \C{K}(\rho, \pi_{\exp(- \lambda r)}) \bigr]
842: \right\}}{
843: 1 - \exp( - \frac{\lambda}{N})}
844: \\ & \leq \frac{1}{N \bigl[ 1 - \exp ( - \frac{\lambda}{N}) \bigr]}
845: \biggl\{ \int_0^{\lambda} \pi_{\exp( - \beta R)}(R) d \beta
846: + \PP \bigl[ \C{K}(\rho, \pi_{\exp( - \lambda r)}) \bigr] \biggr\}.
847: \end{align*}
848: \end{thm}
849: This theorem is particularly well fitted for the case
850: of the Gibbs posterior distribution $\rho = \pi_{\exp(- \lambda r)}$,
851: where the entropy factor cancels and where
852: $\PP \bigl[ \pi_{\exp( - \lambda r)}(R) \bigr]$
853: is shown to be bound to get close to $\inf_{\Theta} R$ when $N$ goes to $\infty$,
854: as soon as $\lambda/N$ goes to $0$ while $\lambda$ goes to $+ \infty$.
855:
856: We can elaborate on Theorem \ref{thm2.5} and define a notion of dimension
857: of $(\Theta, R)$, with margin $\eta \geq 0$ putting
858: \begin{multline}
859: \label{eq1.1.3Bis}
860: d_{\eta} (\Theta, R) = \sup_{\beta \in \RR_+} \beta \bigl[
861: \pi_{\exp( - \beta R)}(R) - \ess\inf_{\pi} R - \eta \bigr]
862: \\ \leq - \log \Bigl\{ \pi \bigl[ R \leq \ess\inf_{\pi} R + \eta \bigr] \Bigr\}.
863: \end{multline}
864: This last inequality can be established by the chain of inequalities:
865: \begin{multline*}
866: \beta \pi_{\exp( - \beta R)}(R) \leq \int_0^{\beta}
867: \pi_{\exp( - \gamma R)}(R) d \gamma =
868: - \log \Bigl\{ \pi \bigl[
869: \exp ( - \beta R) \bigr] \Bigr\} \\ \leq \beta \Bigl( \ess \inf_{\pi} R
870: + \eta \Bigr) - \log \Bigl[ \pi\bigl( R \leq \ess \inf_{\pi} R + \eta
871: \bigr) \Bigr],
872: \end{multline*}
873: where we have used successively the fact that $\lambda \mapsto
874: \pi_{\exp( - \lambda R)}(R)$ is decreasing (because it is
875: the derivative of the concave function $ \lambda \mapsto -\log
876: \bigl\{ \pi \bigl[ \exp( - \lambda R) \bigr] \bigr\}$)
877: and the fact that the exponential function takes positive values.
878:
879: In typical ``parametric'' situations $d_0(\Theta, R)$ will be finite,
880: and in all circumstances $d_{\eta}(\Theta, R)$
881: will be finite for any $\eta > 0$ (this is a direct consequence
882: of the definition of the essential infimum).
883: Using this notion of dimension, we see that
884: \begin{multline*}
885: \int_{0}^{\lambda} \pi_{\exp( -\beta R)}(R) d \beta \leq
886: \lambda \bigl( \ess \inf_{\pi} R + \eta \bigr)
887: \\ \shoveright{+ \int_{0}^{\lambda} \left[ \frac{d_{\eta}}{\beta} \wedge (1 - \ess
888: \inf_{\pi} R - \eta)
889: \right] d \beta \quad}\\ = \lambda \bigl(\ess \inf_{\pi} R + \eta \bigr) +
890: d_{\eta}(\Theta, R) \log \left[ \frac{e \lambda}{d_{\eta}(\Theta, R)}
891: \bigl(1 - \ess \inf_{\pi} R - \eta \bigr) \right].
892: \end{multline*}
893: This leads to
894: \begin{cor}
895: With the above notations, for any margin $\eta \in \RR_+$,
896: for any posterior distibution
897: $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,
898: $$
899: \PP \bigl[ \rho(R) \bigr] \leq \inf_{\lambda \in \RR_+}
900: \Phi_{\frac{\lambda}{N}}^{-1} \left[ \ess \inf_{\pi} R + \eta +
901: \frac{d_{\eta}}{\lambda} \log \left( \frac{e \lambda}{d_{\eta}} \right)
902: + \frac{\PP \bigl\{ \C{K}\bigl[\rho, \pi_{\exp( - \lambda r)}\bigr] \bigr\}}{\lambda}
903: \right].
904: $$
905: \end{cor}
906:
907: If one is wanting a posterior distribution with a small support,
908: the theorem can also be applied to the case when $\rho$ is obtained by truncating $\pi_{\exp ( - \lambda r)}$
909: to some level set to reduce its support: let
910: $\Theta_{p} = \{ \theta \in \Theta : r(\theta) \leq p \}$,
911: and let us define for any $q \in )0,1)$ the level
912: $p_{q} = \inf \{ p : \pi_{\exp( - \lambda r)}(\Theta_p) \geq
913: q \}$,
914: let us then define $\rho_{q}$ by its density
915: $$
916: \frac{\ds d \rho_q}{\ds d \pi_{\exp(- \lambda r)}} (\theta)
917: = \frac{\ds \B{1}(\theta \in \Theta_{p_q})}{\ds \pi_{\exp( - \lambda r)}(\Theta_{p_q})},
918: $$
919: then $\rho_0 = \pi_{\exp ( - \lambda r)}$ and for any $q \in (0,1($,
920: \begin{align*}
921: \PP \bigl[ \rho_q(R) \bigr] &
922: \leq
923: \frac{1 - \exp \left\{ - \frac{1}{N} \int_0^{\lambda} \pi_{\exp( - \beta R)}(R)
924: d \beta - \frac{\log(q)}{N}
925: \right\}}{
926: 1 - \exp( - \frac{\lambda}{N})} \\
927: & \leq \frac{1}{N \bigl[ 1 - \exp ( - \frac{\lambda}{N}) \bigr]}
928: \biggl\{ \int_0^{\lambda} \pi_{\exp( - \beta R)}(R) d \beta
929: - \log(q) \biggr\}.
930: \end{align*}
931:
932: \subsubsection{Deviation bounds}
933: They provide results holding under the distribution $\PP$
934: of the sample with probability at least $1 - \epsilon$, for any
935: given confidence level, set by the choice of $\epsilon \in )0, 1($.
936: Using them is the only way to be quite (i.e. with probability $1-\epsilon$)
937: sure to do the right thing,
938: although this right thing may be overpessimistic, since
939: deviation upper bounds are larger than corresponding non biased bounds.
940:
941: Starting again
942: from Theorem \ref{thm2.3}, and using Markov's inequality \linebreak $\PP \bigl[
943: \exp (h) \geq 1 \bigr] \leq \PP \bigl[ \exp(h) \bigr]$, we
944: obtain
945: \begin{thm}
946: \label{thm2.7}
947: \mypoint For any positive parameter $\lambda$, with $\PP$ probability at least $1 - \epsilon$,
948: for any posterior distribution $\rho : \Omega \rightarrow
949: \C{M}_+^1(\Theta)$,
950: \begin{align*}
951: \rho(R) & \leq \Phi_{\frac{\lambda}{N}}^{-1} \left\{
952: \rho(r) + \frac{\C{K}(\rho, \pi) - \log(\epsilon)}{\lambda} \right\}\\
953: & = \frac{\ds 1 - \exp \left\{ - \frac{\lambda \rho(r)}{N}
954: - \frac{\C{K}(\rho,\pi) - \log(\epsilon)}{N} \right\}}{\ds 1
955: - \exp\bigl( - \tfrac{\lambda}{N}\bigr)} \\
956: & \leq \frac{\lambda}{\ds N \left[ 1 - \exp \left( -
957: \tfrac{\lambda}{N} \right) \right]}
958: \left[ \rho(r)+ \frac{ \C{K}(\rho, \pi) - \log(\epsilon)}{\lambda}
959: \right].
960: \end{align*}
961: \end{thm}
962:
963: We see that for a fixed value of the parameter $\lambda$,
964: the upper bound is optimized when the posterior is chosen
965: to be the Gibbs distribution $\rho = \pi_{\exp( - \lambda r)}$.
966:
967: Moreover we would like to be entitled to optimize the bound
968: in $\lambda$. Gaining the required uniformity in $\lambda$
969: can be done in the following way.
970: Let us notice first that values of $\lambda$ less than $1$
971: are not interesting (because they provide a bound larger than
972: one, at least as soon as $\epsilon \leq \exp(-1)$). Let us consider some real parameter
973: $\alpha > 1$, and the set $\Lambda =
974: \{ \alpha^k ; k \in \NN \}$. Let us put on this set
975: the probability measure $\nu(\alpha^k) = [(k+1)(k+2)]^{-1}$.
976: Applying the previous theorem to $\lambda = \alpha^k$ at
977: confidence level $1 - \frac{\epsilon}{(k+1)(k+2)}$,
978: and using a union bound, we see that
979: with probability at least $1 - \epsilon$,
980: for any posterior distribution $\rho$,
981: $$
982: \rho(R) \leq \inf_{\lambda' \in \Lambda}
983: \Phi_{\frac{\lambda'}{N}}^{-1}
984: \left\{ \rho(r) + \frac{\C{K}(\rho,\pi) - \log(\epsilon) +
985: 2 \log \Bigl[\tfrac{\log(\alpha^2\lambda')}{\log(\alpha)} \Bigr]}{
986: \lambda'}
987: \right\}.
988: $$
989: Now we can remark that for any $\lambda \in (1, + \infty($,
990: there is $\lambda' \in \Lambda$ such that $\alpha^{-1} \lambda \leq \lambda' \leq
991: \lambda$. Moreover, for any $q \in (0,1)$, $\beta \mapsto \Phi_{\beta}^{-1}(q)$
992: is increasing on $\RR_+$. Thus
993: with probability at least $1 - \epsilon$,
994: for any posterior distribution $\rho$,
995: \begin{align*}
996: \rho(R) & \leq \inf_{\lambda \in (1, \infty(}
997: \Phi_{\frac{\lambda}{N}}^{-1}
998: \left\{ \rho(r) + \frac{\alpha}{\lambda} \left[
999: \C{K}(\rho,\pi) - \log(\epsilon) + 2 \log
1000: \Bigl( \tfrac{\log(\alpha^2 \lambda)}{\log(\alpha)} \Bigr)
1001: \right] \right\} \\
1002: & = \inf_{\lambda \in (1, \infty(}\frac{ 1 - \exp \left\{ - \frac{\lambda}{N}\rho(r) -
1003: \frac{\alpha}{N}\left[ \C{K}(\rho,\pi) - \log(\epsilon) +
1004: 2 \log \Bigl( \frac{\log(\alpha^2 \lambda)}{\log(\alpha)}
1005: \Bigr) \right] \right\}}{ 1 -
1006: \exp( - \frac{\lambda}{N} )}.
1007: \end{align*}
1008: Taking the approximately optimal value
1009: $$
1010: \lambda = \sqrt{ \frac{2 N \alpha \left[ \C{K}(\rho,\pi) - \log (\epsilon) \right]}{
1011: \rho(r)[ 1 - \rho(r) ]}},
1012: $$
1013: we obtain
1014: \begin{thm}
1015: \label{thm1.1.11}
1016: \mypoint With probability $1 - \epsilon$, for any posterior distribution
1017: $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$, putting
1018: $d(\rho,\epsilon) = \C{K}(\rho,\pi) - \log(\epsilon)$,
1019: \begin{multline*}
1020: \rho(R)
1021: \leq \inf_{k \in \NN}\frac{\ds 1 - \exp \left\{ -
1022: \frac{\alpha^k}{N}\rho(r) -
1023: \frac{1}{N}\Bigl[ d(\rho,\epsilon)+
1024: \log \bigl[
1025: (k+1)(k+2)\bigr] \Bigr] \right\}}{\ds 1 -
1026: \exp \left( - \frac{\alpha^k}{N} \right)} \\
1027: \leq \frac{\ds 1 - \exp \left\{ - \sqrt{\frac{2 \alpha \rho(r)
1028: d(\rho,\epsilon)}{N [1 - \rho(r)]}} - \frac{\alpha}{N}
1029: \Biggl[ d(\rho,\epsilon)+
1030: 2 \log \biggl( \tfrac{\log \left( \alpha^2
1031: \sqrt{\frac{2 N \alpha d(\rho,\epsilon)}{
1032: \rho(r)[1 - \rho(r)]}}\right)}{\log(\alpha)} \biggr) \Biggr] \right\}}{\ds
1033: 1 - \exp \left[ - \sqrt{\frac{2 \alpha d(\rho,\epsilon)}{
1034: N \rho(r) [1 - \rho(r)]}} \right]}.
1035: \end{multline*}
1036: Moreover with probability at least $1 - \epsilon$, for any
1037: posterior distribution $\rho$ such that $\rho(r) = 0$,
1038: $$
1039: \rho(R) \leq 1 - \exp \left[ - \frac{\C{K}(\rho,\pi) - \log(\epsilon)}{N} \right].
1040: $$
1041: \end{thm}
1042:
1043: We can also elaborate on the results in an other direction by introducing
1044: the {\em empirical dimension}
1045: \begin{equation}
1046: \label{eq1.1.3}
1047: d_e = \sup_{\beta \in \RR_+} \beta \bigl[ \pi_{\exp( - \beta r)}(r) -
1048: \ess\inf_{\pi} r
1049: \bigr] \leq - \log \bigl[ \pi \bigl( r = \ess \inf_{\pi} r\bigr) \bigr].
1050: \end{equation}
1051: (There is no need to introduce a margin in this definition, since $r$ takes
1052: at most $N$ values, and therefore $\pi \bigl( r = \ess \inf_{\pi}
1053: r \bigr)$
1054: will be strictly positive.)
1055: This leads to
1056: \begin{cor}
1057: \label{cor1.1.12}
1058: \mypoint
1059: For any positive real constant $\lambda$,
1060: with $\PP$ probability at least $1 - \epsilon$, for any posterior distribution
1061: $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,
1062: $$
1063: \rho(R) \leq \Phi_{\frac{\lambda}{N}}^{-1}
1064: \left[ \ess \inf_{\pi} r + \frac{d_e}{\lambda} \log \left( \frac{e \lambda}{d_e}
1065: \right) + \frac{\C{K}\bigl[ \rho, \pi_{\exp( - \lambda r)} \bigr]- \log(\epsilon)
1066: }{\lambda} \right].
1067: $$
1068: \end{cor}
1069: We could then make the bound uniform in $\lambda$ and optimize this parameter
1070: in a way similar to what was done to obtain Theorem \ref{thm1.1.11}.
1071:
1072: \subsection{Local bounds}
1073: In this subsection, better bounds will be achieved through a better choice
1074: of the prior distribution. This better prior distribution turns out to
1075: depend on the unknown sample distribution $\PP$, and some work is required to
1076: circumvent this and obtain empirical bounds.
1077: \subsubsection{Choice of the prior}
1078: As mentioned in the introduction, if one is
1079: willing to minimize the bound in expectation provided by Theorem
1080: \ref{thm2.4} (page \pageref{thm2.4}),
1081: one is led to consider the optimal choice $\pi =
1082: \PP(\rho)$. However, this is but an ideal choice, since
1083: $\PP$ is in all conceivable situations unknown. Nevertheless it
1084: shows that it is possible through Theorem \ref{thm2.4} to measure
1085: the {\em complexity} of the classification model
1086: with $\PP \bigl\{ \C{K}\bigl[\rho, \PP(\rho) \bigr] \bigr\}$,
1087: which is nothing but the {\em mutual information}
1088: between the random sample $(X_i,Y_i)_{i=1}^N$
1089: and the estimated parameter $\Hat{\theta}$, when the sample
1090: is drawn according to $\PP$ and the
1091: estimated parameter knowing the sample is drawn according
1092: to $\rho$.
1093:
1094: In practice, since we cannot choose $\pi = \PP(\rho)$,
1095: we have to be content with a {\em flat} prior $\pi$,
1096: resulting in a bound measuring complexity according to
1097: $\PP \bigl[ \C{K}(\rho,\pi) \bigr] = \PP \bigl\{ \C{K} \bigl[ \rho, \PP(\rho) \bigr]
1098: \bigr\} + \C{K} \bigl[ \PP(\rho), \pi \bigr]$ larger by the entropy
1099: factor $\C{K}\bigl[ \PP(\rho), \pi \bigr]$ than the optimal one
1100: (we are still commenting on Theorem \ref{thm2.4}).
1101:
1102: If we want to base the choice of $\pi$ on Theorem \ref{thm2.5}
1103: (page \pageref{thm2.5}), and if we
1104: choose
1105: $\rho = \pi_{\exp( - \lambda r)}$
1106: to optimize this bound, we will be inclined to choose some $\pi$ such
1107: that
1108: $$
1109: \frac{1}{\lambda} \int_0^{\lambda} \pi_{\exp( - \beta R)}(R) d \beta
1110: = - \frac{1}{\lambda} \log \Bigl\{ \pi \bigl[ \exp( - \lambda R) \bigr] \Bigr\}
1111: $$
1112: is as far as possible close to $\inf_{\theta \in \Theta} R(\theta)$ in all circumstances. To give
1113: some more specific example, in
1114: the case when the distribution of the design $(X_i)_{i=1}^N$ is known,
1115: one can introduce on the parameter space $\Theta$ the metric $D$
1116: already defined by equation (\ref{eq1.1.2}, page \pageref{eq1.1.2})
1117: (or some available upper bound for this distance). In view of the fact that
1118: $R(\theta) - R(\theta') \leq D(\theta, \theta')$, for any $\theta$, $\theta'
1119: \in \Theta$, it can be meaningful, at least theoretically,
1120: to choose $\pi$ as
1121: $$
1122: \pi = \sum_{k=1}^{\infty} \frac{1}{k(k+1)} \pi_k,
1123: $$
1124: where $\pi_k$ is the uniform measure on some minimal (or close
1125: to minimal) $2^{-k}$-net $\C{N}(\Theta,
1126: D,2^{-k})$ of the metric space $(\Theta, D)$. With this choice
1127: \begin{multline*}
1128: - \frac{1}{\lambda} \log \Bigl\{ \pi \bigl[ \exp (- \lambda R) \bigr] \Bigr\}
1129: \leq \inf_{\theta \in \Theta} R(\theta)
1130: \\ + \inf_k \left\{ 2^{-k} + \frac{\log ( \lvert \C{N}(\Theta, D, 2^{-k}) \rvert
1131: ) + \log[k(k+1)]}{\lambda} \right\}.
1132: \end{multline*}
1133:
1134: Another possibility, when we have to deal with real valued parameters,
1135: meaning that $\Theta \subset \RR^d$, is to code each real component
1136: $\theta_i \in \RR$ of $\theta = (\theta_i)_{i=1}^d$ to some precision
1137: and to use a prior $\mu$ which is atomic on dyadic numbers. More
1138: precisely let us parametrize the set of dyadic real numbers as
1139: \begin{multline*}
1140: \C{D} = \Biggl\{
1141: r\bigl[ s, m, p, (b_j)_{j=1}^p\bigr] = s 2^m \biggl( 1 + \sum_{j=1}^p b_j 2^{-j}
1142: \biggr)\,\\ :\,
1143: s \in \{-1, +1\}, m \in \ZZ, p \in \NN, b_j \in \{0,1\} \Biggr\},
1144: \end{multline*}
1145: where, as can be seen, $s$ codes the sign, $m$ the order of magnitude,
1146: $p$ the precision and $(b_j)_{j=1}^p$ the binary representation of
1147: the dyadic number $r\bigl[ s,m,p, (b_j)_{j=1}^p \bigr]$. We can for
1148: instance consider on $\C{D}$ the probability distribution
1149: \begin{equation}
1150: \label{eq1.1.4bis}
1151: \mu\bigl\{ r\bigl[ s,m,p,(b_j)_{j=1}^p \bigr] \bigr\}
1152: = \Bigl[ 3 (\lvert m \rvert + 1)(\lvert m \rvert + 2) (p+1)(p+2) 2^p \Bigr]^{-1},
1153: \end{equation}
1154: and define $\pi \in \C{M}_+^1(\RR^d)$ as $\pi = \mu^{\otimes d}$.
1155: This kind of ``coding'' prior distribution can be used also to define
1156: a prior on the integers (by renormalizing the restriction of $\mu$
1157: to integers to get a probability distribution).
1158: Using $\mu$ is somehow equivalent to picking up a representative of
1159: each dyadic interval, and makes it possible to restrict to the
1160: case when the posterior $\rho$ is a Dirac mass without losing
1161: too much (when $\Theta = (0,1)$, this approach is somewhat equivalent
1162: to considering as prior distribution the Lebesgue measure and using
1163: as posterior distributions the uniform probability measures on dyadic
1164: intervals, with the advantage of obtaining non randomized estimators).
1165: When one uses in this way an atomic prior and Dirac masses as posterior
1166: distributions, the bounds proven so far can be obtained through a
1167: simpler union bound argument. This is so true that some of the
1168: detractors of the PAC-Bayesian approach (which, as a newcomer,
1169: has sometimes received a suspicious greeting among statisticians)
1170: have argued that it cannot bring anything that elementary union bound
1171: arguments could not essentially provide. We do not share of course
1172: this derogatory opinion, and while we think that allowing for
1173: non atomic priors and posteriors is worthwhile, we also would
1174: like to stress that next to come local and relative bounds could
1175: hardly be obtained with the only help of union bounds.
1176:
1177: Although the choice of a {\em flat} prior seems at first glance to be
1178: the only alternative when nothing is known about the sample distribution
1179: $\PP$, the previous discussion shows that this type of choice is
1180: lacking proper localisation, and namely that we loose a factor
1181: $\C{K}\bigl\{ \PP\bigl[\pi_{\exp(- \lambda r)}\bigr],\pi \bigr\}$, the divergence
1182: between the bound-optimal prior $\PP\bigl[ \pi_{\exp( - \lambda r)} \bigr]$,
1183: which is concentrated near the minima of $R$ in favourable situations,
1184: and the flat prior $\pi$. Fortunately, there are technical ways to
1185: get around this difficulty and to obtain more local empirical bounds.
1186:
1187: \subsubsection{Unbiased local empirical bounds}
1188: The idea is to start with some flat prior $\pi \in \C{M}_+^1(\Theta)$, and the
1189: posterior distribution $\rho = \pi_{\exp( - \lambda r)}$ minimizing the bound of
1190: Theorem \ref{thm2.4}
1191: (page \pageref{thm2.4}), when $\pi$ is used as a prior. To improve the bound, we
1192: would like to use $\PP \bigl[ \pi_{\exp(- \lambda r)}\bigr]$ instead of $\pi$,
1193: and we are going to make the guess that we could approximate it with $\pi_{\exp(
1194: - \beta R)}$ (we have replaced the parameter $\lambda$ with some distinct
1195: parameter $\beta$ to give some more freedom to our investigation,
1196: and also because, intuitively, $\PP \bigl[ \pi_{\exp( - \lambda r)} \bigr]$
1197: may be expected to be less concentrated than each of the $\pi_{\exp( - \lambda r)}$
1198: it is mixing,
1199: which suggests that the best approximation of $\PP \bigl[
1200: \pi_{\exp( - \lambda r)} \bigr]$ by some $\pi_{\exp( - \beta R)}$
1201: may be obtained for some parameter $\beta < \lambda$). We are then
1202: led to look for some empirical upper bound of $\C{K}\bigl[
1203: \rho, \pi_{\exp( -\beta R)} \bigr]$. This is happily provided by the
1204: following computation
1205: \begin{multline*}
1206: \PP \bigl\{ \C{K}\bigl[ \rho, \pi_{\exp( - \beta R)} \bigr] \bigr\}
1207: = \PP \bigl[ \C{K}(\rho, \pi) \bigr] + \beta \PP \bigl[ \rho (R) \bigr]
1208: + \log \Bigl\{ \pi \bigl[ \exp( - \beta R) \bigr] \Bigr\}
1209: \\ = \PP \bigl\{ \C{K}\bigl[ \rho, \pi_{\exp( - \beta r)}\bigr] \bigr\}
1210: + \beta \PP \bigl[ \rho(R-r) \bigr]
1211: \\ + \log \Bigl\{ \pi \bigl[ \exp( - \beta R) \bigr] \Bigr\}
1212: - \PP \Bigl\{ \log \pi \bigl[ \exp( - \beta r) \bigr] \Bigr\}.
1213: \end{multline*}
1214: Using the convexity of $r \mapsto \log \bigl\{ \pi \bigl[
1215: \exp ( - \beta r) \bigr] \bigr\}$ as in equation
1216: \eqref{eq1.1.3Ter} on page \pageref{eq1.1.3Ter}, we see that
1217: $$
1218: 0 \leq \PP \bigl\{ \C{K}\bigl[ \rho, \pi_{\exp( - \beta R)}\bigr] \bigr\}
1219: \leq \beta \PP \bigl[ \rho(R - r) \bigr] + \PP \bigl\{ \C{K} \bigl[ \rho,
1220: \pi_{\exp( - \beta r)} \bigr] \bigr\}.
1221: $$
1222: This inequality has an interest of its own, since it provides a lower
1223: bound for $\PP \bigl[ \rho(R) \bigr]$. Moreover we can plug it
1224: into Theorem \ref{thm2.4} (page \pageref{thm2.4}) applied to the prior distribution
1225: $\pi_{\exp( - \beta R)}$ and obtain for any posterior distribution $\rho$
1226: and any positive paramter $\lambda$ that
1227: $$
1228: \Phi_{\frac{\lambda}{N}} \bigl\{ \PP \bigl[ \rho(R) \bigr] \bigr\}
1229: \leq \PP \biggl\{ \rho(r) + \frac{\beta}{\lambda} \rho(R-r)
1230: + \frac{1}{\lambda} \PP \Bigl\{ \C{K}\bigl[
1231: \rho, \pi_{\exp( - \beta r)} \bigr] \Bigr\} \biggr\}.
1232: $$
1233: In view of this, it it convenient to introduce the function
1234: \newcommand{\TPhi}{\widetilde{\Phi}}
1235: \begin{multline*}
1236: \TPhi_{a,b}(p) = (1 - b)^{-1}
1237: \bigl[ \Phi_a(p) - bp \bigr] \\
1238: = - (1 - b)^{-1} \Bigl\{ a^{-1} \log \bigl\{ 1 - p
1239: \bigl[ 1 - \exp( - a) \bigr] \bigr\} + bp \Bigr\},\\
1240: p \in (0,1), a \in )0,\infty(, b \in (0,1(.
1241: \end{multline*}
1242: This is a convex function of $p$, moreover
1243: $$
1244: \TPhi_{a,b}'(0)
1245: = \Bigl\{ a^{-1} \bigl[ 1 - \exp(- a) \bigr] - b \Bigr\} (1 - b)^{-1},$$
1246: showing that it is an increasing one to one convex map of the unit interval unto
1247: itself as soon as $b \leq a^{-1}
1248: \bigl[ 1 - \exp( - a ) \bigr]$.
1249: Its convexity, combined with the value of its derivative at the origin, shows
1250: that
1251: $$
1252: \TPhi_{a,b}(p) \geq \frac{a^{-1} \bigl[ 1 - \exp ( - a) \bigr] - b}{1-b} p.
1253: $$
1254: Using these notations and remarks, we can state
1255: \begin{thm}
1256: \label{thm3.1}
1257: \mypoint For any positive real constants
1258: $\beta$ and $\lambda$ such that
1259: $0 \leq \beta < N [1 - \exp( - \frac{\lambda}{N})]$, for any posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,
1260: \begin{multline*}
1261: \PP \biggl\{ \rho(r) - \frac{ \C{K} \bigl[ \rho, \pi_{\exp( - \beta r)} \bigr]}{\beta}
1262: \biggr\} \leq
1263: \PP \bigl[ \rho(R) \bigr] \\ \leq
1264: \TPhi_{\frac{\lambda}{N}, \frac{\beta}{\lambda}}^{-1}
1265: \biggl\{ \PP \biggl[ \rho(r) + \frac{\C{K}\bigl[ \rho, \pi_{\exp( - \beta r)}
1266: \bigr]}{\lambda - \beta}
1267: \biggr] \biggr\}
1268: \\ \leq
1269: \frac{\lambda - \beta}{N [ 1 - \exp( - \frac{\lambda}{N})] - \beta}
1270: \PP \biggl[ \rho(r) + \frac{\C{K} \bigl[ \rho, \pi_{\exp( - \beta r)}
1271: \bigr]}{\lambda - \beta} \biggr].
1272: \end{multline*}
1273: Thus (taking $\lambda = 2 \beta$), for any $\beta$ such that $0 \leq \beta < \frac{N}{2}$,
1274: $$
1275: \PP \bigl[ \rho(R) \bigr]
1276: \leq \frac{1}{1 - \frac{2 \beta}{N}} \PP \biggl\{ \rho(r) + \frac{\C{K}\bigl[
1277: \rho, \pi_{\exp(- \beta r)} \bigr]}{\beta} \biggr\}.
1278: $$
1279: \end{thm}
1280: Note that the last inequality is obtained using the fact that
1281: $1 - \exp( - x) \geq x - \frac{x^2}{2}$, $x \in \RR_+$.
1282: \begin{cor}
1283: \label{cor3.2}
1284: \mypoint For any $\beta \in (0,N($,
1285: \begin{multline*}
1286: \PP \bigl[ \pi_{\exp( - \beta r)}(r) \bigr] \leq
1287: \PP \bigl[ \pi_{\exp(- \beta r)}(R) \bigr] \\
1288: \leq \inf_{\lambda \in (- N \log(1 - \frac{\beta}{N}),
1289: \infty(} \frac{\lambda - \beta}{N[1 - \exp( - \frac{\lambda}{N})] - \beta}
1290: \PP \bigl[ \pi_{\exp( - \beta r)}(r) \bigr]
1291: \\ \leq \frac{1}{1 - \frac{2 \beta}{N}} \PP \bigl[
1292: \pi_{\exp( - \beta r)}(r) \bigr],
1293: \end{multline*}
1294: the last inequality holding only when $\beta < \frac{N}{2}$.
1295: \end{cor}
1296:
1297: It is interesting to compare the upper bound provided by
1298: this corollary with Theorem \ref{thm2.4} on page \pageref{thm2.4}
1299: when the posterior is a Gibbs measure $\rho = \pi_{\exp( - \beta r)}$.
1300: We see that we have succeeded to get rid of the entropy term
1301: $\C{K}\bigl[\pi_{\exp( - \beta r)}, \pi \bigr]$, but at the price
1302: of an increase of the multiplicative factor, which for small values of
1303: $\frac{\beta}{N}$ grows from $( 1 - \frac{\beta}{2N})^{-1}$
1304: (when we take $\lambda = \beta$ in Theorem \ref{thm2.4}),
1305: to $(1 - \frac{2 \beta}{N})^{-1}$. Therefore non localized bounds
1306: have an interest of their own, and are superseded by localized
1307: bounds only in favourable circumstances (presumably when the sample
1308: is large enough when compared with the complexity of the classification
1309: model).
1310:
1311: Corollary \ref{cor3.2} shows that when $\frac{2 \beta}{N}$ is
1312: small, $\pi_{\exp( - \beta r)}(r)$ is a tight approximation of
1313: $\pi_{\exp( - \beta r)}(R)$ in the mean (since we have
1314: an upper bound and a lower bound which are close together).
1315:
1316: Another corollary is obtained by optimizing the bound
1317: given by Theorem \ref{thm3.1} in $\rho$, which is done
1318: by taking $\rho = \pi_{\exp( - \lambda r)}$.
1319: \begin{cor}
1320: \mypoint For any positive real constants $\beta$ and $\lambda$ such that
1321: $0 \leq \beta < N[1 - \exp( - \frac{\lambda}{N})]$,
1322: \begin{multline*}
1323: \PP \bigl[ \pi_{\exp( - \lambda r)}(R) \bigr]
1324: \leq \TPhi_{\frac{\lambda}{N}, \frac{\beta}{\lambda}}^{-1}
1325: \biggl\{ \PP \biggl[ \frac{1}{\lambda - \beta} \int_{\beta}^{\lambda}
1326: \pi_{\exp( - \gamma r)}(r) d \gamma \biggr] \biggr\}
1327: \\ \leq \frac{1}{N[1 - \exp( - \frac{\lambda}{N})] - \beta} \PP
1328: \biggr[ \int_{\beta}^{\lambda}
1329: \pi_{\exp( - \gamma r)}(r) d \gamma \biggr].
1330: \end{multline*}
1331: \end{cor}
1332: Although this inequality gives by construction a better
1333: upper bound for $\inf_{\lambda \in \RR_+} \PP \bigl[
1334: \pi_{\exp( - \lambda r)}(R) \bigr]$ than Corollary
1335: \ref{cor3.2}, it is not easy to tell which one of the two inequalities
1336: is the best to bound $\PP \bigl[ \pi_{\exp( - \lambda r)}(R)\bigr]$
1337: for a fixed (and possibly suboptimal) value of
1338: $\lambda$, because in this case, one factor is improved while the other is worsened.
1339:
1340: Using the {\em empirical dimension} $d_e$ defined by equation \eqref{eq1.1.3}
1341: on page \pageref{eq1.1.3}, we see that
1342: $$
1343: \frac{1}{\lambda - \beta} \int_{\beta}^{\lambda} \pi_{\exp( - \gamma r)}(r)
1344: d \gamma \leq \ess \inf_{\pi} r + d_e \log \left( \frac{\lambda}{\beta} \right).
1345: $$
1346: Therefore, in the case when we keep the ratio $\frac{\lambda}{\beta}$
1347: bounded, we get a better dependence on the empirical dimension $d_e$
1348: than in Corollary \ref{cor1.1.12} (page \pageref{cor1.1.12}).
1349:
1350: \subsubsection{Non random local bounds} Let us come now to the localization
1351: of the non random upper
1352: bound given by Theorem \ref{thm2.5} on page \pageref{thm2.5}.
1353: According to Theorem \ref{thm2.4} (page \pageref{thm2.4})
1354: applied to the localized prior $\pi_{\exp( - \beta R)}$,
1355: \begin{multline*}
1356: \lambda \Phi_{\frac{\lambda}{N}} \bigl\{ \PP \bigl[ \rho(R) \bigr] \bigr\}
1357: \leq \PP \Bigl\{ \lambda \rho(r) + \C{K}(\rho, \pi) + \beta \rho(R) \Bigr\}
1358: + \log \bigl\{ \pi \bigl[ \exp( - \beta R) \bigr] \bigr\} \\
1359: = \PP \Bigl\{ \C{K}\bigl[\rho, \pi_{\exp( - \lambda r)}\bigr]
1360: - \log \bigl\{ \pi \bigl[ \exp( - \lambda r) \bigr] \bigr\} +
1361: \beta \rho(R) \Bigr\} + \log \bigl\{ \pi \bigl[ \exp (- \beta R) \bigr] \bigr\}\\
1362: \leq \PP \Bigl\{ \C{K}\bigl[\rho, \pi_{\exp( - \lambda r)}\bigr]
1363: + \beta \rho(R) \Bigr\} - \log \bigl\{ \pi \bigl[ \exp( - \lambda R) \bigr]
1364: \bigr\} + \log \bigl\{ \pi \bigl[ \exp ( - \beta R) \bigr] \bigr\},
1365: \end{multline*}
1366: where we have used as previously inequality \eqref{eq1.1.3Ter}
1367: (page \pageref{eq1.1.3Ter}).
1368: This proves
1369: \begin{thm}
1370: \mypoint For any posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,
1371: for any real parameters $\beta$ and $\lambda$ such that
1372: $0 \leq \beta < N \bigl[ 1 - \exp( - \frac{\lambda}{N}) \bigr]$,
1373: \begin{multline*}
1374: \PP \bigl[ \rho(R) \bigr]
1375: \leq \TPhi_{\frac{\lambda}{N}, \frac{\beta}{\lambda}}^{-1}
1376: \biggl\{
1377: \frac{1}{ \lambda - \beta} \int_{\beta}^{\lambda}
1378: \pi_{\exp( - \gamma R)}(R) d \gamma + \PP \biggl[ \frac{\C{K}\bigl[ \rho,
1379: \pi_{\exp( - \lambda r)}\bigr]}{\lambda - \beta} \biggr] \biggr\} \\
1380: \leq \frac{ 1}{N \bigl[ 1 - \exp( - \frac{\lambda}{N} )
1381: \bigr] - \beta} \biggl\{
1382: \int_{\beta}^{\lambda}
1383: \pi_{\exp( - \gamma R)}(R) d \gamma + \PP \Bigl\{ \C{K}\bigl[
1384: \rho, \pi_{\exp( - \lambda r)}\bigr] \Bigr\} \biggr\}.
1385: \end{multline*}
1386: \end{thm}
1387: Let us notice in particular that this theorem contains Theorem \ref{thm2.5}
1388: (page \pageref{thm2.5})
1389: which corresponds to the case $\beta = 0$. As a corollary, we see also,
1390: taking $\rho = \pi_{\exp( - \lambda r)}$ and $\lambda = 2 \beta$,
1391: and noticing that $\gamma \mapsto \pi_{\exp( -\gamma R)}(R)$ is decreasing, that
1392: \begin{align*}
1393: \PP \bigl[ \pi_{\exp( - \lambda r)}(R) \bigr]
1394: & \leq \inf_{\beta, \beta < N[ 1 - \exp( - \frac{\lambda}{N})]}
1395: \frac{\beta}{N \bigl[ 1 - \exp( - \frac{\lambda}{N} ) \bigr]
1396: - \beta} \pi_{\exp( - \beta R)}(R)
1397: \\ & \leq \frac{1}{1 - \frac{\lambda}{N}} \pi_{\exp( - \frac{\lambda}{2} R)}(R).
1398: \end{align*}
1399: We can use this inequality in conjunction with the notion of
1400: dimension with margin $\eta$ introduced by equation
1401: \eqref{eq1.1.3Bis} on page \pageref{eq1.1.3Bis},
1402: to see that the Gibbs posterior achieves for
1403: a proper choice of $\lambda$ and any margin parameter $\eta \geq 0$
1404: (which can be chosen to be equal to zero in parametric
1405: situations)
1406: \begin{multline}
1407: \label{eq1.1.7}
1408: \inf_{\lambda} \PP \bigl[ \pi_{\exp( - \lambda r)}(R) \bigr]
1409: \leq \ess \inf_{\pi} R + \eta + \frac{4 d_{\eta}}{N} \\ +
1410: 2 \sqrt{ \frac{2d_{\eta} \bigl( \ess \inf_{\pi} R + \eta
1411: \bigr) }{N} + \frac{4 d_{\eta}^2}{N^2}}.
1412: \end{multline}
1413: Deviation bounds to come next will show that the optimal
1414: $\lambda$ can be estimated from empirical data.
1415:
1416: Let us propose a little numerical example as an illustration : assuming
1417: that $d_{0} = 10$, $N=1000$ and $\ess \inf_{\pi}
1418: R = 0.2$, we obtain from equation
1419: \eqref{eq1.1.7} that
1420: $\inf_{\lambda} \PP \bigl[ \pi_{\exp(-\lambda r)}(R) \bigr]
1421: \leq 0.373$.
1422: \subsubsection{Local deviation bounds}
1423: %\newcommand{\BPsi}{\overline{\Phi}}
1424: When it comes to deviation bounds, we will for technical reasons
1425: choose a slightly more involved change of prior distribution and
1426: apply Theorem \ref{thm2.7} (page \pageref{thm2.7}) to the prior $
1427: \pi_{\exp [ - \beta \Phi_{- \frac{\beta}{N}}
1428: \circ R ]}$. The advantage of tweaking $R$ with the nonlinear function
1429: $\Phi_{- \frac{\beta}{N}}$ will appear in the search for an empirical upper
1430: bound of the local entropy term.
1431: Theorem \ref{thm2.3} (page \pageref{thm2.3}), used with the above mentioned local prior,
1432: shows that
1433: \begin{equation}
1434: \label{eq1.1.4}
1435: \PP \Biggl\{ \sup_{\rho \in \C{M}_+^1(\Theta)}
1436: \lambda \Bigl\{ \rho \bigl(\Phi_{\frac{\lambda}{N}}\!\circ\!R \bigr)
1437: - \rho(r) \Bigr\} - \C{K}\bigl[\rho, \pi_{\exp (- \beta \Phi_{- \frac{\beta}{N}}
1438: \!\circ R)}\bigr] \Biggr\} \leq 1.
1439: \end{equation}
1440: \newcommand{\Brho}{\Bar{\rho}}Moreover
1441: \begin{multline}
1442: \label{eq1.1.5bis}
1443: \C{K}\bigl[ \rho, \pi_{\exp[ - \beta \Phi_{- \frac{\beta}{N}}\circ R ]} \bigr]
1444: = \C{K}\bigl[ \rho,\pi_{\exp( - \beta r)}
1445: \bigr] + \beta \rho \Bigl[ \Phi_{- \frac{\beta}{N}}\!\circ\!R - r \Bigr] \\*
1446: + \log \Bigl\{ \pi \Bigl[ \exp \bigl( - \beta \Phi_{- \frac{\beta}{N}}\!\circ\!R
1447: \bigr) \Bigr] \Bigr\} - \log \Bigl\{ \pi \Bigl[ \exp ( - \beta r) \Bigr]
1448: \Bigr\},
1449: \end{multline}
1450: which is an invitation to find an upper bound for
1451: $\log \Bigl\{ \pi \Bigl[ \exp \bigl[ - \beta \Phi_{- \frac{\lambda}{N}}\!\circ R
1452: \big] \Bigr] \Bigr\} - \log \Bigl\{ \pi \bigl[ \exp ( - \beta r) \bigr] \Bigr\}$.
1453: \newcommand{\Bpi}{\overline{\pi}}
1454: Let us call for short $\Bpi$ our localized prior distribution, thus defined as
1455: $$
1456: \frac{d \Bpi}{d \pi}(\theta)
1457: = \frac{\ds
1458: \exp \Bigl\{ - \beta \Phi_{- \frac{\beta}{N}} \bigl[ R(\theta) \bigr] \Bigr\}}{\ds
1459: \pi \Bigl\{ \exp \bigl[ - \beta
1460: \Phi_{- \frac{\beta}{N}}\!\circ\!R \bigr] \Bigr\}}.
1461: $$
1462: Applying once again Theorem \ref{thm2.3} (page \pageref{thm2.3}),
1463: but this time to $- \beta$, we see that
1464: \begin{multline}
1465: \label{eq1.1.5}
1466: \PP \biggl\{ \exp \biggl[
1467: \log \Bigl\{ \pi \Bigl[ \exp \bigl( - \beta \Phi_{-
1468: \frac{\beta}{N}}\!\circ\!R
1469: \bigr) \Bigr] \Bigr\}
1470: - \log \Bigl\{ \pi \bigl[ \exp ( - \beta r) \bigr] \Bigr\} \biggr] \biggr\}
1471: \\ = \PP \biggl\{ \exp \biggl[
1472: \log \Bigl\{ \pi \Bigl[ \exp \bigl( - \beta \Phi_{- \frac{\beta}{N}}\!\circ\!R)
1473: \bigr) \Bigr] \Bigr\}
1474: + \inf_{\rho \in \C{M}_+^1(\Theta)}
1475: \beta \rho(r) + \C{K}(\rho, \pi) \biggr] \biggr\}
1476: \\ \leq \PP \biggl\{ \exp \biggl[
1477: \log \Bigl\{ \pi \Bigl[ \exp \bigl( - \beta \Phi_{- \frac{\beta}{N}}\!\circ\!R)
1478: \bigr) \Bigr] \Bigr\} + \beta \Bpi(r)
1479: + \C{K}(\Bpi , \pi) \biggr] \biggr\}
1480: \\ = \PP \biggl\{ \exp \biggl[
1481: \beta \Bigl[ \Bpi(r) - \Bpi \bigl( \Phi_{- \frac{\beta}{N}}\!\circ\!R \bigr)
1482: \Bigr] - \C{K}(\Bpi,\Bpi) \biggl]
1483: \biggr\} \leq 1.
1484: \end{multline}
1485: Combining equations \eqref{eq1.1.5bis} and \eqref{eq1.1.5}
1486: and using the concavity of $\Phi_{- \frac{\beta}{N}}$,
1487: we see that with $\PP$ probability at least $1 - \epsilon$,
1488: for any posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,
1489: $$
1490: 0 \leq \C{K}(\rho, \Bpi) \leq \C{K} \bigl[\rho, \pi_{\exp(-\beta r)}\bigr]
1491: + \beta \Bigl[ \Phi_{-\frac{\beta}{N}}\bigl[ \rho(R) \bigr] - \rho(r) \Bigr]
1492: - \log(\epsilon).
1493: $$
1494: We have proved a lower deviation bound:
1495: \begin{thm} For any positive real constant $\beta$,
1496: with $\PP$ probability at least $1 - \epsilon$,
1497: for any posterior distribution $\rho : \Omega \rightarrow
1498: \C{M}_+^1(\Theta)$,
1499: $$
1500: \frac{\ds \exp \biggl\{ \frac{\beta}{N} \biggl[
1501: \rho(r) - \frac{\C{K}[\rho, \pi_{\exp( - \beta r)}]
1502: - \log(\epsilon)}{\beta} \biggr] \biggr\} - 1}{\ds
1503: \exp\bigl( \tfrac{\beta}{N} \bigr) - 1} \leq \rho (R).
1504: $$
1505: \end{thm}
1506: Let us now seek for an upper bound. Using the Cauchy-Schwarz inequality to combine
1507: equations \eqref{eq1.1.4} and \eqref{eq1.1.5},
1508: we obtain
1509: \begin{multline}
1510: \label{eq1.1.11Bis}
1511: \PP \biggl\{ \exp \biggl[ \frac{1}{2}
1512: \sup_{\rho \in \C{M}_+^1(\Theta)} \lambda
1513: \rho \bigl( \Phi_{\frac{\lambda}{N}}\!\circ\!R \bigr) - \beta
1514: \rho \bigl( \Phi_{- \frac{\beta}{N}}\!\circ\!R \bigr) - (\lambda - \beta)
1515: \rho(r) - \C{K}\bigl[ \rho, \pi_{\exp(- \beta r)}\bigr] \biggr] \biggr\}
1516: \\ =
1517: \PP \biggl\{ \exp \biggl[
1518: \tfrac{1}{2} \sup_{\rho \in \C{M}_+^1(\Theta)} \biggl(\lambda \Bigl\{
1519: \rho \bigl( \Phi_{\frac{\lambda}{N}}\!\circ\!R \bigr)
1520: - \rho(r) \Bigr\} - \C{K}(\rho, \Bpi) \biggr) \bigg] \\
1521: \times \exp \biggl[ \tfrac{1}{2}
1522: \biggl( \log \Bigl\{ \pi \Bigl[
1523: \exp\bigl( - \beta \Phi_{- \frac{\beta}{N}}\!\circ\!R\bigr)
1524: \Bigr] \Bigr\} - \log \Bigl\{ \pi \Bigl[
1525: \exp ( - \beta r) \Bigr] \Bigr\} \biggr) \biggr] \biggr\}
1526: \\ \leq
1527: \PP \biggl\{ \exp \biggl[
1528: \sup_{\rho \in \C{M}_+^1(\Theta)} \biggl(\lambda \Bigl\{
1529: \rho \bigl( \Phi_{\frac{\lambda}{N}}\!\circ\!R \bigr)
1530: - \rho(r) \Bigr\} - \C{K}(\rho, \Bpi) \biggr) \biggl] \biggr\}^{1/2}\\
1531: \times \PP \biggl\{ \exp \biggl[
1532: \biggl( \log \Bigl\{ \pi \Bigl[
1533: \exp\bigl( - \beta \Phi_{- \frac{\beta}{N}}\!\circ\!R\bigr)
1534: \Bigr] \Bigr\} - \log \Bigl\{ \pi \Bigl[
1535: \exp ( - \beta r) \Bigr] \Bigr\} \biggr) \biggr] \biggr\}^{1/2}
1536: \leq 1.
1537: \end{multline}
1538: Thus with $\PP$ probability
1539: at least $1 - \epsilon$, for any posterior distribution $\rho$,
1540: $$
1541: \lambda \Phi_{\frac{\lambda}{N}}\bigl[ \rho(R) \bigr]
1542: - \beta \Phi_{- \frac{\beta}{N}} \bigl[ \rho(R) \bigr]
1543: \leq (\lambda - \beta) \rho(r) + \C{K}(\rho, \pi_{\exp(- \beta r)})
1544: - 2 \log(\epsilon).
1545: $$
1546: (It would have been more straightforward to use a union bound on
1547: deviation inequalities instead of the Cauchy-Schwarz
1548: inequality on exponential moments, anyhow, this would have led
1549: to replace $- 2 \log(\epsilon)$ with the worse factor
1550: $2 \log(\frac{2}{\epsilon})$.)
1551: Let us now remind that
1552: \begin{multline*}
1553: \lambda \Phi_{\frac{\lambda}{N}}(p) - \beta \Phi_{-\frac{\beta}{N}}(p)
1554: = - N \log \Bigl\{ 1 - \bigl[ 1 - \exp\bigl(- \tfrac{\lambda}{N}\bigr)\bigr] p
1555: \Bigr\} \\ - N \log \Bigl\{ 1 + \bigl[\exp\bigl( \tfrac{\beta}{N} \bigr) - 1\bigr] p
1556: \Bigr\},
1557: \end{multline*}
1558: and let us put
1559: \begin{multline*}
1560: B = (\lambda - \beta) \rho(r) + \C{K}\bigl[ \rho, \pi_{\exp(- \beta r)}\bigr]
1561: - 2 \log(\epsilon) \\
1562: = \C{K}\bigl[ \rho, \pi_{\exp( - \lambda r)} \bigr]
1563: + \int_{\beta}^{\lambda} \pi_{\exp( - \xi r)}(r) d \xi - 2 \log(\epsilon).
1564: \end{multline*}
1565: Let us consider moreover the change of variables
1566: $\alpha = 1 - \exp( - \frac{\lambda}{N})$ and $\gamma = \exp(\frac{\beta}{N}) - 1$.\\
1567: We obtain
1568: $
1569: \bigl[ 1 - \alpha \rho(R) \big] \bigl[ 1 + \gamma \rho(R) \bigr]
1570: \geq \exp( - \tfrac{B}{N}),
1571: $
1572: leading to
1573: \begin{thm}
1574: \label{thm1.1.17}\mypoint
1575: For any positive constants $\alpha$, $\gamma$, such that $0 \leq \gamma < \alpha <1$,
1576: with $\PP$ probability at least $1 - \epsilon$, for any posterior distribution
1577: $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,
1578: the bound
1579: \begin{align*}
1580: M(\rho) & = - \frac{\log\bigl[ (1 - \alpha)(1 + \gamma) \bigr]}{\alpha - \gamma} \rho(r)
1581: + \frac{\ds \C{K}(\rho, \pi_{\exp[ - N \log( 1 + \gamma)r ]})
1582: - 2 \log(\epsilon)}{\ds N (\alpha - \gamma)} \\
1583: & = \frac{\ds \C{K}\bigl[ \rho, \pi_{\exp[ N\log(1 - \alpha) r]}\bigr]
1584: + \int_{N \log(1 + \gamma)}^{- N \log(1 - \alpha)} \pi_{\exp( - \xi r)}(r)
1585: d \xi - 2 \log(\epsilon)}{N (\alpha - \gamma)},
1586: \end{align*}
1587: is such that
1588: $$
1589: \rho(R) \leq \frac{\alpha - \gamma}{2 \alpha \gamma}
1590: \left( \sqrt{1+ \frac{4 \alpha \gamma}{(\alpha - \gamma)^2} \bigl\{ 1 - \exp\bigl[
1591: - (\alpha - \gamma) M(\rho) \bigr] \bigr\}}- 1 \right) \leq
1592: M(\rho),
1593: $$
1594: \end{thm}
1595: Using the {\em empirical dimension} $d_e$ defined by equation \eqref{eq1.1.3}
1596: on page \pageref{eq1.1.3},
1597: we can use the inequality
1598: $$
1599: \int_{\beta}^{\lambda} \pi_{\exp(- \xi r)}(r) d \xi
1600: \leq (\lambda - \beta) \ess \inf_{\pi} r + d_e \log \left( \frac{\lambda}{\beta} \right),
1601: $$
1602: to prove that
1603: \begin{multline*}
1604: M(\rho) \leq \frac{\log\bigl[ (1+\gamma)(1-\alpha) \bigr]}{\gamma - \alpha}
1605: \ess \inf_{\pi} r \\
1606: + \frac{d_e
1607: \log \left[ \frac{ - \log( 1- \alpha)}{\log(1 + \gamma)} \right]
1608: + \C{K}\bigl[ \rho, \pi_{\exp [ N \log(1 - \alpha)r]}\bigr] - 2 \log(\epsilon)}{
1609: N(\alpha - \gamma)}.
1610: \end{multline*}
1611:
1612: Let us give a little numerical illustration : assuming that
1613: $d_e = 10$ and $N = 1000$, taking $\epsilon = 0.01$,
1614: $\alpha = 0.5$ and $\gamma = 0.1$, we obtain from
1615: Theorem \ref{thm1.1.17} $\pi_{\exp[ N\log(1-\alpha)r]}(R) \simeq \pi_{\exp(- 693 r)}(R)
1616: \leq 0.332\leq 0.372$, where we have given respectively the non linear and
1617: the linear bound. This shows the practical interest of keeping the non-linearity.
1618: Let us also mention that optimizing the values of the parameters
1619: $\alpha$ and $\gamma$ would not have yielded a significantly lower bound.
1620:
1621: The following corollary is obtained by taking $\lambda = 2 \beta$ and
1622: keeping only the linear bound, we give it for the sake of its simplicity:
1623: \begin{cor}\mypoint
1624: For any positive real constant $\beta$ such that
1625: \hfill $\exp(\frac{\beta}{N})
1626: + \exp( - \frac{2 \beta}{N}) < 2$, which is the case when $\beta < 0.48 N$,
1627: with $\PP$ probability at least $1 - \epsilon$, for any posterior distribution
1628: $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,
1629: \begin{multline*}
1630: \rho(R) \leq \frac{ \beta \rho(r) + \C{K}\bigl[ \rho, \pi_{\exp( - \beta r)}\bigr]
1631: - 2 \log(\epsilon)}{N \bigl[ 2 - \exp\bigl( \frac{\beta}{N}\bigr) -
1632: \exp \bigl( - \frac{2 \beta}{N} \bigr) \bigr]}
1633: \\ = \frac{
1634: \int_{\beta}^{2 \beta}
1635: \pi_{\exp( - \xi r)}(r) d \xi + \C{K}\bigl[ \rho, \pi_{\exp( - 2 \beta r)}\bigr] - 2 \log(\epsilon)}{
1636: N \bigl[ 2 - \exp( \frac{\beta}{N}) - \exp( - \frac{2 \beta}{N}) \bigr]}.
1637: \end{multline*}
1638: \end{cor}
1639: Let us mention that this corollary applied to the above numerical example
1640: gives $\pi_{\exp(-200 r)}(R) \leq 0.475$ (when we take $\beta = 100$, consistently
1641: with the choice $\gamma = 0.1$).
1642:
1643: \subsubsection{Partially local bounds}
1644:
1645: Local bounds are suitable when the lowest values of the empirical
1646: error rate $r$ are reached only on a small part of the parameter
1647: set $\Theta$. When $\Theta$ is the disjoint union of submodels
1648: of different complexities, the minimum of $r$ will as a rule
1649: not be ``localized'' in a way that calls for the use of
1650: local bounds. Just think for instance of the case when
1651: $\Theta = \bigsqcup_{m=1}^M \Theta_m$, where the sets $\Theta_1 \subset
1652: \Theta_2 \subset \dots \subset \Theta_M$ are nested.
1653: In this case we will have $\inf_{\Theta_1} r \geq \inf_{\Theta_2} r
1654: \geq \dots \geq \inf_{\Theta_M} r$, although $\Theta_M$ may be
1655: too large to be the right model to use. In this situation, we
1656: do not want to localize the bound completely. Let us make a
1657: more specific fancyful but typical pseudo computation.
1658: Just imagine we have a countable collection $(\Theta_m)_{m \in M}$ of submodels.
1659: Let us assume we are interested in choosing between the
1660: estimators $\wtheta_m \in \arg\min_{\Theta_m} r$,
1661: maybe randomizing them (e.g. replacing them
1662: with $\pi^m_{\exp( - \lambda r)}$). Let us
1663: imagine moreover that we are in a typically parametric
1664: situation, where, for some priors $\pi^m \in \C{M}_+^1(\Theta_m)$,
1665: $m \in M$, there is a ``dimension'' $d_m$ such that
1666: $\lambda \bigl[ \pi^m_{\exp( - \lambda r)}(r) - r(\wtheta_m)
1667: \bigr] \simeq d_m$. Let $\mu \in \C{M}_+^1(M)$ be some distribution
1668: on the index set $M$.
1669: It is easy to see that $(\mu \pi)_{\exp( - \lambda r)}$ will
1670: typically not be properly local, in the sense that
1671: typically
1672: \begin{multline*}
1673: (\mu \pi)_{\exp( - \lambda r)}(r) =
1674: \frac{\ds \mu \Bigl\{ \pi_{\exp( - \lambda r)}(r) \pi \bigl[ \exp( - \lambda r) \bigr]
1675: \Bigr\}}{
1676: \mu \Bigl\{ \pi \bigl[ \exp( - \lambda r) \bigr] \Bigr\}
1677: } \\ \simeq
1678: \frac{\ds \sum_{m \in M}
1679: \bigl[ (\inf_{\Theta_m} r) + \tfrac{d_m}{\lambda} \bigr] \exp \bigl[ - \lambda
1680: (\inf_{\Theta_m} r) - d_m \log\bigl(\tfrac{e \lambda}{d_m}\bigr) \bigr]
1681: \mu(m)}{\ds
1682: \sum_{m \in M} \exp \Bigl[ - \lambda (\inf_{\Theta_m} r) - d_m \log \bigl(\tfrac{e
1683: \lambda}{d_m}
1684: \bigr) \Bigr] \mu(m)}
1685: \\ \simeq \biggl\{ \inf_{m \in M} (\inf_{\Theta_m} r) + \tfrac{d_m}{\lambda}
1686: \log \bigl(
1687: \tfrac{e \lambda}{d_m \mu(m)}\bigr) \biggr\} \\ + \log
1688: \biggl\{ \sum_{m \in M}
1689: \exp \bigl[ - d_m \log(\tfrac{\lambda}{d_m})\bigr] \mu(m)\biggr\}.
1690: \end{multline*}
1691: where we have used the estimate
1692: \begin{multline*}
1693: - \log \Bigl\{ \pi \bigl[ \exp( - \lambda r) \bigr]
1694: \Bigr\} = \int_0^{\lambda} \pi_{\exp( - \beta r)}(r) d \beta
1695: \\ \simeq \int_0^{\lambda } (\inf_{\Theta_m} r) + \bigl[
1696: \tfrac{d_m}{\beta} \wedge 1 \bigr]
1697: d \beta \simeq \lambda (\inf_{\Theta_m} r) + d_m
1698: \bigl[ \log \bigl( \tfrac{\lambda}{d_m} \bigr) + 1 \bigr].
1699: \end{multline*}
1700: Our approximations have no pretention to be rigorous or
1701: very accurate, but they nevertheless give the best order
1702: of magnitude we can expect in typical situations, and
1703: show that this order of magnitude is not what we are
1704: looking for: mixing different models with the help
1705: of $\mu$ spoils the localization, introducing a multiplier
1706: $\log \bigl( \tfrac{\lambda}{d_m} \bigr)$ to the dimension
1707: $d_m$ which is precisely what we would have got if we had
1708: not localized at all the bound. What we would
1709: really like to do in such situations is to use a {\em partially
1710: localized} posterior distribution, such as
1711: $\mu^{\widehat{m}}_{\exp( - \lambda r)}$, where
1712: $\widehat{m}$ is an estimator of the best submodel
1713: to be used. While the most straightforward way to
1714: do this is to use a union bound on results obtained
1715: for each submodel $\Theta_m$, we are going here
1716: to show how to allow arbitrary posterior distributions
1717: on the index set (corresponding to a randomization of
1718: the choice of $\widehat{m}$).
1719:
1720: Let us consider the framework we just mentioned: let the
1721: measurable parameter
1722: set $(\Theta, \C{T})$ be a disjoint union of measurable submodels,
1723: $\Theta = \bigsqcup_{m \in M} \Theta_m$. Let the index set $(M, \C{M})$ be
1724: some measurable space (most of the time it will be a countable set).
1725: Let $\mu \in \C{M}_+^1(M)$ be a prior probability distribution on
1726: $(M, \C{M})$. Let $\pi : M \rightarrow \C{M}_+^1(\Theta)$ be a regular
1727: conditional probability measure such that $\pi(m,\Theta_m) = 1$,
1728: for any $m \in M$.
1729: Let $\mu \pi \in \C{M}_+^1(M \times \Theta)$ be the product probability
1730: measure defined by
1731: $\mu\pi(h) = \int_{m \in M} \left( \int_{\theta \in \Theta} h(m,\theta)
1732: \pi(m, d \theta) \right) \mu(dm)$, for any bounded measurable
1733: function $h : M \times \Theta \rightarrow \RR$.
1734: Let $\pi_{\exp(h)} \in \C{M}_+(M \times \Theta)$ be the regular
1735: conditionnal probability measure defined by
1736: $$
1737: \frac{d \pi_{\exp(h)}}{d \pi} (m, \theta) = \frac{ \exp\bigl[ h(\theta) \bigr]}{
1738: \pi \bigl[ m, \exp(h) \bigr]},
1739: $$
1740: where consistently with previous notations $\pi(m,h) = \int_{\Theta}
1741: h(m,\theta) \pi(m, d \theta)$ (we will also often use the less explicit
1742: notation $\pi(h)$).
1743: Let for short
1744: $$
1745: U(\theta, \omega) = \lambda \Phi_{\frac{\lambda}{N}}\bigl[ R(\theta) \bigr] -
1746: \beta \Phi_{- \frac{\beta}{N}}\bigl[ R(\theta) \bigr] - (\lambda - \beta) r
1747: (\theta, \omega).
1748: $$
1749: Integrating with respect to $\mu$ equation \eqref{eq1.1.11Bis} on page \pageref{eq1.1.11Bis},
1750: written in each submodel $\Theta_m$ using the prior distribution $\pi(m, \cdot)$,
1751: we see that
1752: \begin{multline*}
1753: \PP \biggl\{ \exp \biggl[
1754: \sup_{\nu \in \C{M}_+^1(M)} \sup_{\rho : M \rightarrow \C{M}_+^1(\Theta)}
1755: \frac{1}{2} \Bigl[ (\nu \rho)(U) - \nu \bigl\{
1756: \C{K}(\bigl[ \rho, \pi_{\exp( - \beta r)}\bigr] \bigr\} \Bigl] - \C{K}(\nu,\mu)
1757: \biggr] \biggr\}
1758: \\ \leq
1759: \PP \biggl\{ \exp \biggl[
1760: \sup_{\nu \in \C{M}_+^1(M)} \frac{1}{2} \nu \biggl( \sup_{\rho : M \rightarrow \C{M}_+^1(\Theta)}
1761: \rho(U) - \C{K}(\rho, \pi_{\exp( - \beta r)}) \biggr)
1762: - \C{K}(\nu, \mu) \biggr] \biggr\}
1763: \\ =
1764: \PP \biggl\{ \mu \biggl[ \exp \Bigl\{ \tfrac{1}{2} \sup_{\rho : M \rightarrow
1765: \C{M}_+^1(\Theta)} \Bigl[ \rho(U) - \C{K} \bigl[ \rho, \pi_{\exp( - \beta r)}\bigr]
1766: \Bigr] \Bigr\} \biggr] \biggr\}\\
1767: = \mu \biggl\{ \PP \biggl[ \exp \Bigl\{ \tfrac{1}{2} \sup_{\rho : M \rightarrow
1768: \C{M}_+^1(\Theta)} \Bigl[ \rho(U) - \C{K} \bigl[ \rho, \pi_{\exp( - \beta r)}\bigr]
1769: \Bigr] \Bigr\} \biggr] \biggr\} \leq 1.
1770: \end{multline*}
1771: This proves that
1772: \begin{multline}
1773: \label{eq1.1.10}
1774: \PP \Biggl\{ \exp \Biggl[ \frac{1}{2}
1775: \sup_{\nu \in \C{M}_+^1(M)} \sup_{\rho:M\rightarrow \C{M}_+^1(\Theta)}
1776: \lambda \Phi_{\frac{\lambda}{N}} \bigl[\nu \rho(R) \bigr]
1777: - \beta \Phi_{-\frac{\beta}{N}} \bigl[ \nu \rho(R) \bigr]
1778: \\ -(\lambda - \beta) \nu \rho(r) - 2 \C{K}(\nu,\mu) - \nu \bigl\{
1779: \C{K} \bigl[ \rho,
1780: \pi_{\exp( - \beta r)}\bigr] \bigr\} \Biggr] \Biggr\} \leq 1.
1781: \end{multline}
1782: \newcommand{\sR}{R^{\star}}
1783: \newcommand{\sr}{r^{\star}}
1784: \newcommand{\stheta}{\theta^{\star}}
1785: Introducing the optimal value of $r$ on each submodel
1786: $\sr(m) = \ess \inf_{\pi(m,\cdot)} r$ and the empirical dimensions
1787: $$
1788: d_e(m) = \sup_{\xi \in \RR_+} \xi \bigl[
1789: \pi_{\exp( - \xi r)}(m,r) - \sr(m) \bigr],
1790: $$
1791: we can thus state
1792: \begin{thm}
1793: \label{thm1.1.20}
1794: \mypoint
1795: For any positive real constants $\beta < \lambda$,
1796: with $\PP$ probability at least $1 - \epsilon$,
1797: for any posterior distribution $\nu : \Omega \rightarrow \C{M}_+^1(M)$,
1798: for any conditional posterior distribution $\rho : \Omega \times
1799: M \rightarrow \C{M}_+^1(\Theta)$,
1800: $$
1801: \lambda \Phi_{\frac{\lambda}{N}} \bigl[ \nu \rho(R) \bigr]
1802: - \beta \Phi_{-\frac{\beta}{N}} \bigl[ \nu \rho(R) \bigr]
1803: \leq B_1(\nu, \rho),
1804: $$
1805: \begin{multline*}
1806: \text{where } B_1(\nu, \rho) =
1807: (\lambda - \beta) \nu \rho(r) + 2\C{K}(\nu,\mu)+
1808: \nu \bigl\{ \C{K}\bigl[ \rho, \pi_{\exp( - \beta r)} \bigr] \bigr\} - 2
1809: \log(\epsilon)\\
1810: = \nu \biggl[ \int_{\beta}^{\lambda}
1811: \pi_{\exp ( - \alpha r)}(r) d\alpha \biggr] + 2 \C{K}(\nu, \mu)
1812: + \nu \bigl\{ \C{K}\bigl[ \rho, \pi_{\exp( - \lambda r)} \bigr] \bigr\}
1813: - 2 \log(\epsilon)
1814: \\
1815: = 2 \log \biggl\{ \mu \biggl[ \exp \biggl( - \frac{1}{2}
1816: \int_{\beta}^{\lambda} \pi_{\exp( - \alpha r)}(r) d \alpha \biggr)
1817: \biggr] \biggr\} \\
1818: \shoveright{+ 2 \C{K}\bigl[ \nu, \mu_{\left(\frac{\pi[\exp(-\lambda r)]}{
1819: \pi[\exp(-\beta r)]}\right)^{1/2}}\bigr] + \nu \bigl\{ \C{K}\bigl[
1820: \rho, \pi_{\exp( - \lambda r)} \bigr] \bigr\} - 2 \log(\epsilon),}\\
1821: \shoveleft{\text{and therefore }
1822: B_1(\nu,\rho) \leq \nu \Bigl[ (\lambda - \beta) \sr + \log \Bigl( \tfrac{\lambda}{\beta}
1823: \Bigr) d_e
1824: \Bigr] + 2 \C{K}(\nu, \mu)} \\\shoveright{ + \nu \bigl\{ \C{K} \bigl[
1825: \rho, \pi_{\exp( - \lambda r)} \bigr] \bigr\} - 2 \log(\epsilon),}
1826: \\\shoveleft{\text{as well as }
1827: B_1(\nu, \rho) \leq 2 \log \biggl\{ \mu \biggl[
1828: \exp \biggl( - \frac{1}{2} \sr + \frac{1}{2}
1829: \log \Bigl( \tfrac{\lambda}{\beta} \Bigr) d_e \biggr) \biggr] \biggr\}
1830: }\\+ 2 \C{K} \bigl[ \nu, \mu_{\frac{\pi[\exp( -
1831: \lambda r)]}{\pi[\exp( - \beta r)]}}
1832: \bigr] + \nu \bigl\{ \C{K}\bigl[ \rho, \pi_{\exp( - \lambda r)} \bigr]
1833: - 2 \log(\epsilon).
1834: \end{multline*}
1835: Thus, for any real constants $\alpha$ and $\gamma$ such that
1836: $0 \leq \gamma < \alpha < 1$, with $\PP$ probability
1837: at least $1 - \epsilon$, for any posterior distribution
1838: $\nu : \Omega \rightarrow \C{M}_+^1(M)$ and any conditional posterior
1839: distribution $\rho : \Omega \times M \rightarrow \C{M}_+^1(\Theta)$,
1840: the bound
1841: \begin{multline*}
1842: B_2(\nu,\rho) = - \tfrac{\log \bigl[ (1 - \alpha)(1 + \gamma)\bigr]}{\alpha-\gamma}
1843: \nu\rho(r) + \tfrac{ 2 \C{K}(\nu,\mu) + \nu \bigl\{ \C{K}\bigl[
1844: \rho, \pi_{(1 + \gamma)^{-Nr}}\bigr] \bigr\} - 2 \log(\epsilon)}{N (\alpha - \gamma)}
1845: \\ = \tfrac{
1846: 2 \C{K}\bigl[ \nu, \mu_{\left( \frac{ \pi [ (1 -\alpha)^{Nr}]}{\pi [ (1 + \gamma)^{-N
1847: r}]}\right)^{1/2}} \bigr]
1848: + \nu \bigl\{ \C{K}\bigl[\rho, \pi_{(1 - \alpha)^{Nr}}\bigr] \bigr\}}{
1849: N(\alpha - \gamma)} \\ - \tfrac{
1850: 2 \log \Bigl\{ \mu \Bigl[ \exp \biggl[ - \frac{1}{2}
1851: \int_{N \log(1 + \gamma)}^{- N \log(1 - \alpha)} \pi_{\exp( - \xi r)}(\cdot,r) d \xi
1852: \bigr] \Bigr] \Bigr\}
1853: + 2 \log(\epsilon)}{
1854: N(\alpha - \gamma)}
1855: \end{multline*}
1856: satisfies
1857: $$
1858: \nu \rho(R) \leq \frac{\alpha - \gamma}{2 \alpha \gamma}
1859: \left( \sqrt{1 + \frac{4 \alpha \gamma}{(\alpha - \gamma)^2} \Bigl\{
1860: 1 - \exp \bigl[ - (\alpha - \gamma) B(\nu,\rho) \bigr] \Bigr\}} - 1
1861: \right) \leq B(\nu,\rho).
1862: $$
1863: \end{thm}
1864: Let us remark that in the case when $\nu = \mu_{\left( \frac{
1865: \pi[(1 - \alpha)^{Nr}]}{\pi[(1 + \gamma)^{-Nr}]} \right)^{1/2}}$
1866: and $\rho = \pi_{(1-\alpha)^{Nr}}$,
1867: we get as desired a bound that is adaptively local in all the $\Theta_m$
1868: (at least when $M$ is countable and $\mu$ is atomic):
1869: \begin{multline*}
1870: B(\nu,\rho) \leq - \tfrac{2}{N(\alpha - \gamma)}
1871: \log \Biggl\{ \mu \biggl\{
1872: \exp \biggl[ \tfrac{N}{2} \log\bigl[(1+\gamma)(1 - \alpha)\bigr]
1873: \sr \\\shoveright{ - \log \left( \tfrac{-\log(1-\alpha)}{\log(1 + \gamma)}
1874: \right) \tfrac{d_e}{2} \biggr] \biggr\} \Biggr\}
1875: - \frac{2 \log(\epsilon)}{N(\alpha - \gamma)}\qquad}
1876: \\\shoveleft{\qquad \qquad \leq \inf_{m \in M} \biggl\{
1877: - \tfrac{\log\bigl[ (1- \alpha)(1+\gamma)\bigr]}{\alpha
1878: -\gamma} \sr(m)} \\ +
1879: \log \left( \tfrac{- \log(1 - \alpha)}{\log(1 + \gamma)}\right)
1880: \tfrac{d_e(m)}{N(\alpha - \gamma)} -
1881: 2 \tfrac{\log\bigl[\epsilon \mu(m) \bigr]}{N(\alpha - \gamma)} \biggr\}.
1882: \end{multline*}
1883: The penalization by the {\em empirical dimension} $d_e(m)$ in each submodel
1884: is as desired linear in $d_e(m)$. Non random partially local bounds could
1885: be obtained in a way that is easy to imagine. We leave this investigation
1886: to the reader.
1887:
1888: \subsubsection{Two step localization}
1889:
1890: We have seen that the bound optimal choice of the posterior
1891: distribution $\nu$ on the index set in Theorem \ref{thm1.1.20}
1892: (page \pageref{thm1.1.20}) is such that
1893: $$
1894: \frac{d\nu}{d \mu}(m) \sim
1895: \left( \frac{\pi \bigl[ \exp\bigl( - \lambda r(m, \cdot) \bigr) \bigr]}{\pi
1896: \bigl[ \exp\bigl( - \beta r(m,\cdot) \bigr) \bigr]}\right)^{\frac{1}{2}}
1897: = \exp \biggl[ - \frac{1}{2} \int_{\beta}^{\lambda}
1898: \pi_{\exp( - \alpha r)}(m,r) d \alpha \biggr].
1899: $$
1900: \newcommand{\ov}[1]{\overline{#1}}
1901: This suggests to replace the prior distribution $\mu$ with $\ov{\mu}$
1902: defined by its density
1903: \begin{multline}
1904: \label{eq1.13}
1905: \frac{d \ov{\mu}}{d \mu} (m) = \frac{ \exp \bigl[ - h(m) \bigr]}{\mu
1906: \bigl[ \exp( - h ) \bigr]},
1907: \\ \text{ where }
1908: h(m) = - \xi \int_{\beta}^{\gamma} \pi_{\exp( - \alpha \Phi_{- \frac{\eta}{N}}
1909: \circ R)} \bigl[ \Phi_{- \frac{\eta}{N}}\!\circ\!R(m, \cdot) \bigr] d \alpha.
1910: \end{multline}
1911: The use of $\Phi_{- \frac{\eta}{N}}\!\circ\!R$ instead of $R$ is motivated
1912: by technical reasons which will appear in subsequent computations.
1913: Indeed, we will need to bound
1914: $$
1915: \nu \biggl[ \int_{\beta}^{\lambda} \pi_{\exp ( - \alpha
1916: \Phi_{- \frac{\eta}{N}} \circ R)} \bigl(
1917: \Phi_{- \frac{\eta}{N}}\!\circ\!R \bigr) d \alpha \biggr]
1918: $$
1919: in order to handle $\C{K}(\nu, \ov{\mu})$.
1920: In the spirit of equation (\ref{eq1.1.4}, page \pageref{eq1.1.4}),
1921: starting back from Theorem \ref{thm2.3} (page \pageref{thm2.3}),
1922: applied in each submodel $\Theta_m$ to the prior
1923: distribution $\pi_{\exp( - \gamma \Phi_{-\frac{\eta}{N}} \circ
1924: R )}$ and integrated with respect to
1925: $\ov{\mu}$, we see that for any
1926: positive real constants $\lambda$, $\gamma$ and $\eta$,
1927: with $\PP$ probability at least $1 - \epsilon$,
1928: for any posterior distribution $\nu : \Omega \rightarrow \C{M}_+^1(M)$ on the index set
1929: and any conditional posterior distribution $\rho : \Omega \times M \rightarrow
1930: \C{M}_+^1(\Theta)$,
1931: \begin{multline}
1932: \label{eq1.1.13}
1933: \nu \rho \bigl( \lambda \Phi_{\frac{\lambda}{N}}\!\circ\!R - \gamma
1934: \Phi_{-\frac{\eta}{N}}\!\circ\!R \bigr) \leq \lambda \nu \rho(r) \\ +
1935: \nu \C{K}(\rho, \pi)
1936: + \C{K}(\nu, \ov{\mu}) +
1937: \nu \Bigl\{ \log \Bigl[ \pi \bigl[ \exp \bigl(
1938: - \gamma \Phi_{- \frac{\eta}{N}}\!\circ\!R \bigr) \bigr] \Bigr] \Bigr\} -
1939: \log(\epsilon).
1940: \end{multline}
1941: Since $x \mapsto f(x) \overset{\text{\rm def}}{=}
1942: \lambda \Phi_{\frac{\lambda}{N}}
1943: - \gamma \Phi_{- \frac{\eta}{N}}(x)$ is a convex function, it is such
1944: that
1945: $$
1946: f(x) \geq x f'(0)= x N \Bigl\{
1947: \bigl[1 - \exp( - \tfrac{\lambda}{N}) \bigr] + \tfrac{\gamma}{\eta}
1948: \bigl[ \exp( \tfrac{\eta}{N}) - 1 \bigr] \Bigr\}.
1949: $$
1950: Thus if we put
1951: \begin{equation}
1952: \label{eq1.14}
1953: \gamma = \frac{\eta \bigl[ 1 - \exp (- \frac{\lambda}{N}) \bigr]}{\exp(
1954: \frac{\eta}{N}) - 1},
1955: \end{equation}
1956: we obtain that $f(x) \geq 0$, $x \in \RR$, and therefore that
1957: the left-hand side of equation \eqref{eq1.1.13} is non negative.
1958: We can moreover introduce the prior conditional distribution $\ov{\pi}$ defined
1959: by
1960: $$
1961: \frac{d \ov{\pi}}{d \pi}(m, \theta) =
1962: \frac{ \exp \bigl[ - \beta \Phi_{- \frac{\eta}{N}} \circ R(\theta) \bigr]}{
1963: \pi \bigl\{m, \exp \bigl[ - \beta \Phi_{- \frac{\eta}{N}} \circ R \bigr] \bigr\}}.
1964: $$
1965: With $\PP$ probability at least $1 - \epsilon$, for any posterior distributions
1966: $\nu \Omega \rightarrow \C{M}_+^1(M)$ and $\rho: \Omega \times M \rightarrow
1967: \C{M}_+^1(\Theta)$,
1968: \begin{multline*}
1969: \beta \nu \rho(r) + \nu \bigl[ \C{K}( \rho, \pi) \bigr] =
1970: \nu \bigl\{ \C{K}\bigl[ \rho, \pi_{\exp (- \beta r)} \bigr] \bigr\} -
1971: \nu \biggl[ \log \Bigl\{ \pi \bigl[ \exp ( - \beta r) \bigr] \Bigr\} \biggr]
1972: \\ \leq \nu \bigl\{ \C{K} \bigl[ \rho, \pi_{\exp( - \beta r)} \bigr] \bigr\}
1973: + \beta \nu \ov{\pi} (r) + \nu \bigl[ \C{K}(\ov{\pi}, \pi) \bigr] \\
1974: \leq \nu \bigl\{ \C{K} \bigl[ \rho, \pi_{\exp ( - \beta r)} \bigr] \bigr\}
1975: + \beta \nu \ov{\pi} \bigl( \Phi_{- \frac{\eta}{N}}\!\circ\!R \bigr)
1976: \\\shoveright{+ \tfrac{\beta}{\eta} \bigl[ \C{K}(\nu, \ov{\mu})- \log(\epsilon) \bigr]
1977: + \nu \bigl[ \C{K}(\ov{\pi}, \pi) \bigr] \qquad}
1978: \\\shoveleft{\qquad
1979: = \nu \bigl\{ \C{K} \bigl[ \rho, \pi_{\exp ( - \beta r)} \bigr] \bigr\}
1980: - \nu \Bigl\{ \log \Bigl[ \pi \bigl[ \exp \bigl( -
1981: \beta \Phi_{-\frac{\eta}{N}}\!\circ\!R \bigr) \bigr] \Bigr] \Bigr\}}
1982: \\ + \tfrac{\beta}{\eta} \bigl[ \C{K}(\nu, \ov{\mu}) - \log(\epsilon) \bigr].
1983: \end{multline*}
1984: Thus, coming back to equation \eqref{eq1.1.13}, we see that under condition
1985: \eqref{eq1.14},
1986: with $\PP$ probability at least $1 - \epsilon$,
1987: \begin{multline*}
1988: 0 \leq (\lambda - \beta) \nu \rho(r) + \nu \bigl\{ \C{K}\bigl[ \rho, \pi_{
1989: \exp( - \beta r)}\bigr] \bigr\} \\ - \nu \biggl[
1990: \int_{\beta}^{\gamma} \pi_{\exp( - \alpha \Phi_{- \frac{\eta}{N}} \circ R)}
1991: \bigl( \Phi_{- \frac{\eta}{N}}\!\circ\!R \bigr) d \alpha \biggr]
1992: + (1 + \tfrac{\beta}{\eta}) \bigl[ \C{K}(\nu, \ov{\mu}) + \log(\tfrac{2}{\epsilon})
1993: \bigr].
1994: \end{multline*}
1995: Noticing moreover that
1996: \begin{multline*}
1997: (\lambda - \beta) \nu \rho(r) + \nu \bigl\{ \C{K} \bigl[
1998: \rho, \pi_{\exp( - \beta r)}\bigr] \bigr\} \\ =
1999: \nu \bigl\{ \C{K}\bigl[ \rho, \pi_{\exp ( - \lambda r)}\bigr] \bigr\}
2000: + \nu \biggl[ \int_{\beta}^{\lambda} \pi_{\exp( - \alpha r)}(r) d \alpha \biggr],
2001: \end{multline*}
2002: and choosing $\rho = \pi_{\exp( - \lambda r)}$, we have proved
2003: \begin{thm}
2004: For any positive real constants $\beta$, $\gamma$ and $\eta$, such that
2005: \linebreak $\gamma < \eta \bigl[ \exp( \frac{\eta}{N}) - 1 \bigr]^{-1}$, defining
2006: $\lambda$ by condition \eqref{eq1.14}, so that \linebreak
2007: $\lambda = - N \log \Bigl\{ 1 - \frac{\gamma}{\eta} \bigl[ \exp(
2008: \frac{\eta}{N}) - 1 \bigr] \Bigr\}$,
2009: with $\PP$ probability at least $1 - \epsilon$,
2010: for any posterior distribution $\nu : \Omega \rightarrow \C{M}_+^1(M)$,
2011: any conditional posterior distribution $\rho: \Omega \times M
2012: \rightarrow \C{M}_+^1(\Theta)$,
2013: \begin{multline*}
2014: \nu \biggl[ \int_{\beta}^{\gamma}
2015: \pi_{\exp( - \alpha \Phi_{- \frac{\eta}{N}}\circ R)}
2016: \bigl( \Phi_{- \frac{\eta}{N}}\!\circ\!R \bigr) d \alpha \biggr]
2017: \\ \leq \nu \biggl[ \int_{\beta}^{\lambda} \pi_{\exp( - \alpha r)}(r)
2018: d \alpha \biggr] + \bigl( 1 + \tfrac{\beta}{\eta} \bigr)
2019: \bigl[ \C{K}(\nu, \ov{\mu}) + \log\bigl(\tfrac{2}{\epsilon}\bigr) \bigr].
2020: \end{multline*}
2021: \end{thm}
2022: Let us remark that this theorem does not require that $\beta < \gamma$,
2023: and thus provides both an upper and a lower bound for the quantity of
2024: interest:
2025: \begin{cor}
2026: For any positive real constants $\beta$, $\gamma$ and $\eta$
2027: such that
2028: $\max \{ \beta, \gamma \} < \eta \bigl[ \exp(\frac{\eta}{N}) - 1 \bigr]^{-1}$,
2029: with $\PP$ probability at least $1- \epsilon$, for any posterior distributions
2030: $\nu : \Omega \rightarrow \C{M}_+^1(M)$ and $\rho: \Omega \times M \rightarrow
2031: \C{M}_+^1(\Theta)$,
2032: \begin{multline*}
2033: \nu \biggl[ \int_{- N \log \{ 1 - \frac{\beta}{N} [
2034: \exp (\frac{\eta}{N}) -1 ] \}}^{\gamma} \pi_{\exp( - \alpha r)}(r) d \alpha \biggr]
2035: - \bigl( 1 + \tfrac{\gamma}{\eta} \bigr)\bigl[ \C{K}(\nu, \ov{\mu}) +
2036: \log \bigl( \tfrac{3}{\epsilon} \bigr) \bigr]
2037: \\ \shoveleft{\qquad \leq \nu \biggl[ \int_{\beta}^{\gamma} \pi_{\exp( - \alpha
2038: \Phi_{- \frac{\eta}{N}}\circ R)} \bigl(
2039: \Phi_{- \frac{\eta}{N}}\!\circ\!R \bigr) d \alpha \biggr] }
2040: \\ \leq \nu \biggl[ \int_{\beta}^{- N \log \{ 1 - \frac{\gamma}{\eta}
2041: [ \exp(\frac{\eta}{N})-1 ] \}}
2042: \pi_{\exp( - \alpha r)}(r) d \alpha \biggr]
2043: \\ + \bigl( 1 + \tfrac{\beta}{\eta} \bigr) \bigl[
2044: \C{K}(\nu, \ov{\mu}) + \log \bigl( \tfrac{3}{\epsilon} \bigr) \bigr].
2045: \end{multline*}
2046: \end{cor}
2047: We can then remember that
2048: $$
2049: \C{K}(\nu, \ov{\mu}) = \xi \bigl( \nu - \ov{\mu} \bigr) \biggl[ \int_{\beta}^{\gamma}
2050: \pi_{\exp( - \alpha \Phi_{- \frac{\eta}{N}}\circ R)} \bigl(
2051: \Phi_{- \frac{\eta}{N}}\!\circ\!R \bigr) d \alpha \biggr] + \C{K}(\nu, \mu) -
2052: \C{K}(\ov{\mu}, \mu),
2053: $$
2054: to conclude that, putting
2055: \begin{equation}
2056: \label{eq1.16}
2057: G_{\eta}(\alpha) =
2058: -N \log \bigl\{ 1 - \frac{\alpha}{\eta} \bigl[
2059: \exp \bigl( \frac{\eta}{N}) - 1 \bigr] \bigr\} \geq \alpha, \qquad \alpha \in \RR_+,
2060: \end{equation}
2061: and
2062: \begin{equation}
2063: \label{eq1.15}
2064: \frac{d \w{\nu}}{d \mu} (m) \overset{\text{\rm def}}{=}
2065: \frac{\exp \bigl[ - h(m) \bigr]}{\mu \bigl[ \exp( - h)\bigr]}
2066: \text{ where }
2067: h(m) = \xi \int_{G_{\eta}(\beta)}^{\gamma} \pi_{\exp( - \alpha r)}(m, r) d \alpha,
2068: \end{equation}
2069: the divergence of $\nu$ with respect to the local prior $\ov{\mu}$ is bounded by
2070: \begin{multline*}
2071: \bigl[ 1 - \xi \bigl( 1 + \tfrac{\beta}{\eta} \bigr) \bigr]
2072: \C{K}(\nu, \ov{\mu}) \\
2073: \shoveleft{\qquad \leq \xi \nu \biggl[ \int_{\beta}^{
2074: G_{\eta}(\gamma)}
2075: \pi_{\exp( - \alpha r)}(r) d \alpha \biggr]
2076: - \xi \ov{\mu} \biggl[ \int_{G_{\eta}(\beta)}^{\gamma} \pi_{\exp( - \alpha r)}(r)
2077: d \alpha \biggr]} \\ \shoveright{+ \C{K}(\nu, \mu)
2078: - \C{K}(\ov{\mu}, \mu)
2079: + \xi \bigl( 2 +
2080: \tfrac{\beta + \gamma}{\eta} \bigr)
2081: \log\bigl(\tfrac{3}{\epsilon}\bigr)} \\
2082: \shoveleft{\qquad \leq \xi \nu \biggl[ \int_{\beta}^{G_{\eta}(\gamma)} \pi_{\exp( - \alpha r)}(r)
2083: d \alpha \biggr] + \C{K}(\nu, \mu)} \\ +
2084: \log \biggl\{ \mu \biggl[ \exp \biggl( - \xi \int_{G_{\eta}(\beta)}^{\gamma}
2085: \pi_{\exp(- \alpha r)}(r) d \alpha \biggr) \biggr] \biggr\}
2086: \\
2087: \shoveright{+ \xi \bigl( 2 +
2088: \tfrac{\beta + \gamma}{\eta} \bigr)
2089: \log\bigl(\tfrac{3}{\epsilon}\bigr)}
2090: \\
2091: \shoveleft{\qquad = \C{K}(\nu, \w{\nu}) + \xi \nu \biggl[ \biggl( \int_{\beta}^{G_{\eta}(\beta)}
2092: + \int_{\gamma}^{G_{\eta}(\gamma)}\biggr) \pi_{\exp( - \alpha r)}(r) d \alpha \biggr]}
2093: \\
2094: + \xi \bigl( 2 + \tfrac{\beta+\gamma}{\eta} \bigr) \log \bigl( \tfrac{3}{\epsilon}
2095: \bigr).
2096: \end{multline*}
2097: We have proved
2098: \begin{thm}
2099: \mypoint
2100: \label{thm1.23}
2101: For any positive constants $\beta$, $\gamma$ and $\eta$ such that
2102: \linebreak $\max \{ \beta, \gamma \}
2103: < \eta \bigl[ \exp( \frac{\eta}{N}) - 1 \bigr]^{-1}$,
2104: with $\PP$ probability at least $1 - \epsilon$, for any posterior distribution
2105: $\nu : \Omega \rightarrow \C{M}_+^1(M)$ and any conditional posterior distribution
2106: $\rho: \Omega \times M \rightarrow \C{M}_+^1(\Theta)$,
2107: \begin{multline*}
2108: \C{K}(\nu, \ov{\mu}) \leq \Bigl[1 - \xi\Bigl(1
2109: + \frac{\beta}{\eta}\Bigr)\Bigr]^{-1}
2110: \biggl\{
2111: \C{K}(\nu, \w{\nu})
2112: \\
2113: + \xi \nu \biggl[ \biggl( \int_{\beta}^{G_{\eta}(\beta)}
2114: + \int_{\gamma}^{G_{\eta}(\gamma)}\biggr)
2115: \pi_{\exp( - \alpha r)} (r) d \alpha \biggr]
2116: \\\shoveright{ + \xi \bigl( 2 + \tfrac{\beta+\gamma}{\eta} \bigr)
2117: \log \bigl( \tfrac{3}{\epsilon}
2118: \bigr) \biggr\}}
2119: \\ \shoveleft{ \leq \Bigl[ 1 - \xi\Bigl(1 + \frac{\beta}{\eta}\Bigr) \Bigr]^{-1}
2120: \biggl\{ \C{K}(\nu, \w{\nu})}\\ + \xi \nu \biggl[
2121: \bigl[ G_{\eta}(\gamma)
2122: - \gamma + G_{\eta}(\beta)- \beta \bigr] \sr +
2123: \log \biggl( \frac{G_{\eta}(\beta)
2124: G_{\eta}(\gamma)}{\beta \gamma}\biggr)
2125: d_e \biggr] \\ +
2126: \xi \bigl( 2 + \tfrac{\beta+\gamma}{\eta} \bigr) \log \bigl(
2127: \tfrac{3}{\epsilon} \bigr) \biggr\},
2128: \end{multline*}
2129: where the local prior $\ov{\mu}$ is defined by equation \eqref{eq1.13}
2130: on page \pageref{eq1.13} and the local posterior $\w{\nu}$ and the function
2131: $G_{\eta}$ are defined by equation \eqref{eq1.15} above.
2132: \end{thm}
2133: We can then use this theorem to give a local version of Theorem
2134: \ref{thm1.1.20} (page \pageref{thm1.1.20}). To get something pleasing
2135: to read, we can apply Theorem \ref{thm1.23} with constants
2136: $\beta'$, $\gamma'$ and $\eta$ chosen so that
2137: $ \frac{2 \xi}{1 - \xi(1 + \frac{\beta'}{\eta})} = 1,$
2138: $G_{\eta}(\beta') = \beta$ and $\gamma' = \lambda$, where
2139: $\beta$ and $\lambda$ are the constants appearing in Theorem
2140: \ref{thm1.1.20}. This gives
2141: \begin{thm}\mypoint
2142: \label{thm1.24}
2143: For any positive real constants $\beta < \lambda$ and $\eta$
2144: such that $\lambda < \eta \bigl[ \exp(\frac{\eta}{N}) - 1 \bigr]^{-1}$,
2145: with $\PP$ probability at least $1 - \epsilon$, for any posterior distribution
2146: $\nu : \Omega \rightarrow \C{M}_+^1(M)$, for any conditional posterior distribution
2147: $\rho : \Omega \times M \rightarrow \C{M}_+^1(\Theta)$,
2148: \begin{multline*}
2149: \hfill \lambda \Phi_{\frac{\lambda}{N}} \bigl[ \nu \rho(R) \bigr]
2150: - \beta \Phi_{- \frac{\beta}{N}} \bigl[ \nu \rho(R) \bigr]
2151: \leq B_3(\nu, \rho),\text{ where}\hfill\\
2152: \shoveleft{B_3(\nu, \rho) =
2153: \nu \biggl[ \int_{G_{\eta}^{-1} (\beta)}^{G_{\eta}(\lambda)}
2154: \pi_{\exp( - \alpha r)}(r) d \alpha \biggr] }
2155: \\ + \Bigl(3 + \tfrac{G_{\eta}^{-1}(\beta)}{
2156: \eta} \Bigr) \C{K}\bigl[ \nu, \mu_{\exp \bigl[ - \bigl(3
2157: + \frac{G_{\eta}^{-1}(\beta)}{\eta}\bigr)^{-1}
2158: \int_{\beta}^{\lambda} \pi_{\exp( - \alpha
2159: r)}(r) d \alpha \bigr]}\bigr]
2160: \\\shoveright{ + \nu \bigl\{ \C{K}(\rho,
2161: \pi_{\exp( - \lambda r)}\bigr] \bigr\} + \Bigl( 4 +
2162: \tfrac{G_{\eta}^{-1}(\beta)+\lambda}{\eta} \Bigr) \log \bigl( \tfrac{4}{\epsilon}
2163: \bigr)}\\
2164: \shoveleft{\qquad \leq \nu \Bigl[ \bigl[ G_{\eta}(\lambda) - G_{\eta}^{-1}(\beta) \bigr]
2165: \sr + \log \Bigl(\tfrac{G_{\eta}(\lambda)}{G_{\eta}^{-1}(\beta)} \Bigr) d_e
2166: \Bigr]}
2167: \\
2168: + \Bigl(3 + \tfrac{G_{\eta}^{-1}(\beta)}{
2169: \eta} \Bigr) \C{K}\bigl[ \nu, \mu_{\exp \bigl[ - \bigl(3+\frac{
2170: G_{\eta}^{-1}(\beta)}{\eta}\bigr)^{-1} \int_{\beta}^{\lambda} \pi_{\exp( - \alpha
2171: r)}(r) d \alpha \bigr]}\bigr]
2172: \\ + \nu \bigl\{ \C{K}(\rho,
2173: \pi_{\exp( - \lambda r)}\bigr] \bigr\} + \Bigl( 4 +
2174: \tfrac{G_{\eta}^{-1}(\beta)+\lambda}{\eta} \Bigr) \log \bigl( \tfrac{4}{\epsilon}
2175: \bigr),
2176: \end{multline*}
2177: and where the function $G_{\eta}$ is defined by equation
2178: \eqref{eq1.16} on page \pageref{eq1.16}.
2179: \end{thm}
2180: A first remark: if we had the stamina to use Cauchy Schwarz inequalities
2181: (or more generally H\"older inequalities) on exponential moments
2182: instead of using weighted union bounds on deviation inequalities, we could have
2183: replaced $\log(\frac{4}{\epsilon})$ with $- \log(\epsilon)$ in the above inequalities.
2184:
2185: We see that we have achieved the desired kind of localization of Theorem
2186: \ref{thm1.1.20} (page \pageref{thm1.1.20}), since the new empirical
2187: entropy term \\\mbox{} \hfill$\C{K}[\nu, \mu_{\exp [
2188: - \xi \int_{\beta}^{\lambda} \pi_{\exp( - \alpha r)}(r) d\alpha ]}]$
2189: \hfill\mbox{}\\
2190: cancels for a value of the posterior distribution on the index set $\nu$
2191: which is of the same form as the one minimizing the bound $B_1(\nu, \rho)$
2192: of Theorem \ref{thm1.1.20} (with a decreased constant, as could be expected).
2193: In a typical parametric setting, we will have
2194: $$
2195: \int_{\beta}^{\lambda} \pi_{\exp( - \alpha r)}(r) d\alpha
2196: \simeq (\lambda - \beta) \sr(m) + \log \left( \tfrac{\lambda}{\beta} \right)
2197: d_e(m),
2198: $$
2199: and therefore, if we choose for $\nu$ the Dirac mass at\\\mbox{}\hfill
2200: $\w{m} \in \arg \min_{m \in M} \sr(m) +
2201: \frac{\log(\frac{\lambda}{\beta})}{\lambda - \beta} d_e(m)$,\hfill
2202: \mbox{}\\
2203: and $\rho(m,\cdot) = \pi_{\exp( - \lambda r)}(m, \cdot)$,
2204: we will get, in the case when the index set $M$ is countable,
2205: \begin{multline*}
2206: B_3(\nu, \rho) \lesssim
2207: \max \left\{ \bigl[ G_{\eta}(\lambda) - G_{\eta}^{-1}(\beta) \bigr]
2208: , (\lambda - \beta)\tfrac{\log\bigl[\frac{G_{\eta}(\lambda)}{
2209: G_{\eta}^{-1}(\beta)}\bigr]}{
2210: \log(\frac{\lambda}{\beta})}\right\}
2211: \\ \shoveright{\times \Bigl[ \sr(\w{m}) + \tfrac{\log(\frac{\lambda}{\beta})}{\lambda - \beta}
2212: d_e(\w{m}) \Bigr]\quad}\\
2213: \shoveleft{\quad + \Bigl( 3 +
2214: \tfrac{G_{\eta}^{-1}(\beta)}{\eta} \Bigr)
2215: \log \Biggl\{ \sum_{m \in M} \tfrac{\mu(m)}{\mu(\w{m})}
2216: \exp \biggl[ - \Bigl( 3 + \tfrac{G_{\eta}^{-1} (\beta)}{\eta}\Bigr)^{-1}}\\
2217: \times
2218: \Bigl\{ (\lambda - \beta) \bigl[ \sr(m) - \sr(\w{m}) \bigr]
2219: + \log \bigl( \tfrac{\lambda}{\beta} \bigr)
2220: \bigl[ d_e(m)- d_e(\w{m}) \bigr] \Bigr\} \biggr] \Biggr\} \\
2221: + \Bigl(4 + \tfrac{G_{\eta}^{-1}(\beta)+\lambda}{\eta}\Bigr)\log\bigl(\tfrac{4}{
2222: \epsilon}\bigr).
2223: \end{multline*}
2224: Therefore, as long as there are not too many of them, we do not feel
2225: strongly in this bound the models for which the penalized minimum empirical
2226: risk $\sr(m) + \frac{\log(\frac{\lambda}{\beta})}{\lambda - \beta}
2227: \,d_e(m)$
2228: is far from optimal.
2229:
2230: \subsection{Relative bounds}
2231: The behaviour of the minimum
2232: of the empirical process $\theta \mapsto r(\theta)$
2233: is known to depend on the covariances between pairs $\bigl[
2234: r(\theta), r(\theta') \bigr]$, $\theta, \theta' \in \Theta$.
2235: Accordingly, our previous study, based on the analysis of the variance
2236: of $r(\theta)$ (or technically on some exponential moment playing
2237: quite the same role), is missing some accuracy in some circumstances
2238: (namely when $\inf_{\Theta} R$ is not close enough to zero).
2239: In this subsection, instead of bounding the expected risk $\rho(R)$,
2240: we are going to upper bound the difference $\rho(R) - \inf_{\Theta} R$,
2241: and more generally $\rho(R) - R(\T)$, where $\T \in \Theta$ is some
2242: fixed parameter value. Eventually in the next subsection
2243: we will analyze $\rho(R) - \pi_{\exp( - \beta R)}(R)$, allowing to compare the expected error
2244: rate of a posterior distribution $\rho$ with the error rate
2245: of a Gibbs prior distribution.
2246: Thus relative bounds are not exactly of the
2247: same nature as previous ones: although it is not possible to estimate
2248: $\rho(R)$ with an order of precision higher than $(\rho(R) / N)^{1/2}$,
2249: it is still possible in some situations to reach a better precision
2250: for $\rho(R) - \inf_{\Theta} R$, as we will see.
2251: The study of PAC-Bayesian relative bounds stems from the second and
2252: third part of J. Y. Audibert's dissertation \cite{Audibert2}.
2253:
2254: We will suggest two different kinds of applications of these bounds.
2255: The first more obvious one is to upper bound $\rho(R) - \inf_{\Theta} R$
2256: to get an idea of the performance of the posterior distribution $\rho$.
2257:
2258: The second application is to compare the classification model indexed by
2259: $\Theta$ with a submodel indexed by one of its measurable subsets
2260: $\Theta_1 \subset \Theta$. For this purpose we are
2261: going to compare $\rho(R)$, where $\rho : \Omega \rightarrow
2262: \C{M}_+^1(\Theta)$ is any posterior distribution, with
2263: $R(\T)$, where $\T \in \Theta_1$ is some possibly unobservable
2264: value of the parameter in the submodel defined by $\Theta_1$.
2265: We will typically consider the case when $\T \in \arg\min_{\Theta_1} R$.
2266: In this special case, a negative bound for $\rho(R) - R(\T)
2267: = \rho(R) - \inf_{\Theta_1} R$ indicates that it is definitely
2268: worth using a randomized estimator $\rho$ supported by
2269: the larger parameter set $\Theta$ instead of using only
2270: the classification model defined by the smaller set $\Theta_1$.
2271:
2272: \subsubsection{Basic inequalities}
2273: Relative bounds in this section are based on the control of
2274: $r(\theta) - r(\T)$, where $\theta, \T \in \Theta$. These
2275: differences are related to the random variables
2276: $$
2277: \psi_i(\theta, \T) = \sigma_i(\theta) - \sigma_i(\T)
2278: = \B{1} \bigl[ f_{\theta}(X_i) \neq Y_i \bigr] -
2279: \B{1} \bigl[ f_{\T}(X_i) \neq Y_i \bigr].
2280: $$
2281:
2282: Some supplementary technical difficulties, as compared to
2283: the previous sections, come from the fact that
2284: $\psi_i(\theta, \T)$ takes three values, whereas $\sigma_i(\theta)$
2285: takes only two. Let $\rr(\theta, \T) = r(\theta) - r(\T)$
2286: and $\R(\theta, \T) = R(\theta) - R(\T)$. We have as usual from
2287: independence that
2288: \begin{multline*}
2289: \log \Bigl\{ \PP \Bigl[ \exp \bigl[
2290: - \lambda \rr(\theta, \T) \bigr] \Bigr] \Bigr\}
2291: = \sum_{i=1}^N \log \Bigl\{ \PP \Bigl[
2292: \exp \bigl[ - \tfrac{\lambda}{N} \psi_i(\theta, \T) \bigr] \Bigr] \Bigr\}
2293: \\ \leq N \log \biggl\{ \frac{1}{N} \sum_{i=1}^N \PP
2294: \Bigl\{ \exp \Bigl[ - \frac{\lambda}{N} \psi_i(\theta, \T) \Bigr] \Bigr\} \biggr\}.
2295: \end{multline*}
2296: Let $C_i$ be the distribution of $\psi_i(\theta, \T)$ under $\PP$ and let
2297: $\Bar{C} = \frac{1}{N} \sum_{i=1}^N C_i \in \C{M}_+^1\bigl( \{-1, 0, 1\} \bigr)$.
2298: With these notations
2299: \begin{equation}
2300: \label{eq2.2.2Bis}
2301: \log \Bigl\{ \PP \Bigl[ \exp \bigl[ - \lambda \rr( \theta, \T) \bigr]
2302: \Bigr] \Bigr\} \leq N \log \biggl\{ \int \exp \Bigl( - \frac{\lambda}{N}
2303: \psi \Bigr) \Bar{C}(d \psi) \biggr\}.
2304: \end{equation}
2305: \newcommand{\BM}{{M'}}
2306: The right-hand side of this inequality is a function of $\Bar{C}$. On the
2307: other hand, $\Bar{C}$ being a probability measure on a three point set, is
2308: defined by two parameters, that we may take equal to $\int \psi \Bar{C}(d \psi)$ and
2309: $\int \psi^2 \Bar{C}(d \psi)$. To this purpose, let us introduce
2310: $$
2311: \BM(\theta, \T) = \int \psi^2 \Bar{C}(d \psi) = \Bar{C}(+1)
2312: + \Bar{C}(-1) = \frac{1}{N} \sum_{i=1}^N \PP \bigl[
2313: \psi_i^2(\theta, \T) \bigr], \quad \theta, \T \in \Theta.
2314: $$
2315: It is a pseudo distance
2316: (meaning that it is symmetric and satisfies the triangle inequality),
2317: since it can also be written as
2318: $$
2319: \BM(\theta, \T) = \frac{1}{N} \sum_{i=1}^N
2320: \PP \Bigl\{ \Bigl\lvert \B{1} \bigl[ f_{\theta}(X_i) \neq Y_i \bigr]
2321: - \B{1} \bigl[ f_{\T}(X_i) \neq Y_i \bigr] \Bigr\rvert \Bigr\},
2322: \quad \theta, \T \in \Theta.
2323: $$
2324: It is readily seen that
2325: $$
2326: N \log \left\{ \int \exp \left( - \frac{\lambda}{N} \psi \right) \Bar{C}(d \psi)
2327: \right\} = - \lambda \Psi_{\frac{\lambda}{N}} \bigl[ R'(\theta, \T), M'(\theta, \T) \bigr],
2328: $$
2329: where
2330: \begin{align*}
2331: \Psi_a(p,m) & = - a^{-1}
2332: \log \Bigl[ (1 - m) + \frac{m+p}{2} \exp(-a)
2333: + \frac{m-p}{2} \exp (a) \Bigr]
2334: \\ & = - a^{-1} \log \Bigl\{
2335: 1 - \sinh(a) \bigl[ p - m \tanh(\tfrac{a}{2}) \bigr] \Bigr\}.
2336: \end{align*}
2337: Thus plugging this equality into inequality \eqref{eq2.2.2Bis} we see that for
2338: any real parameter $\lambda$,
2339: $$
2340: \log \Bigl\{ \PP \Bigl[ \exp \bigl[ - \lambda \rr( \theta, \T) \bigr]
2341: \Bigr] \Bigr\} \leq - \lambda \Psi_{\frac{\lambda}{N}}
2342: \bigl[ \R(\theta, \T), \BM(\theta, \T) \bigr],
2343: $$
2344: To make a link with previous works initiated by Mammen and Tsybakov
2345: (see e.g. \cite{Mammen,Tsybakov}), we may consider the pseudo
2346: distance $D$ on $\Theta$ defined on page \pageref{eq1.1.2} by equation
2347: \eqref{eq1.1.2}.
2348: This distance only depends on the distribution of the patterns. It
2349: is often used to formulate margin assumptions (in the sense of Mammen
2350: and Tsybakov).
2351: Here we are going to work rather with
2352: $\BM$: as it is dominated by $D$ in the sense that
2353: $\BM(\theta, \T) \leq D(\theta, \T)$, $\theta, \T \in \Theta$, with equality
2354: in the important case of binary classification, hypotheses formulated on
2355: $D$ induce hypotheses on $M'$, and working with $M'$ may only sharpen the
2356: results when compared to working with $D$.
2357:
2358: Using the same reasoning as in the previous section, we deduce
2359: \begin{thm}
2360: \label{thm4.1}
2361: \mypoint For any real parameter $\lambda$, any $\T \in \Theta$,
2362: $$
2363: \PP \biggl\{ \exp \biggl[ \sup_{\rho \in \C{M}_+^1(\Theta)}
2364: \lambda \Bigl[ \rho \bigl\{ \Psi_{\frac{\lambda}{N}} \bigl[
2365: \R(\cdot, \T\,), \BM(\cdot, \T\,) \bigr] \bigr\}
2366: - \rho\bigl[\rr(\cdot, \T) \bigr] \Bigr]
2367: - \C{K}(\rho, \pi) \biggr] \biggr\} \leq 1.
2368: $$
2369: \end{thm}
2370:
2371: We are now going to derive some variant of Theorem \ref{thm4.1}.
2372: In this theorem, we obtain an inequality comparing one observed quantity
2373: $\rho\bigl[r'(\cdot, \T\,)\bigr]$ with two unobversed ones, $\rho\bigl[R'(
2374: \cdot, \T\,)\bigr]$ and $\rho\bigl[M'(\cdot, \T\,) \bigr]$
2375: (because of the convexity of the function $\lambda \Psi_{\frac{\lambda}{N}}$,
2376: $$
2377: \lambda \rho
2378: \bigl\{ \Psi_{\frac{\lambda}{N}}\bigl[R'(\cdot, \T\,),M'(\cdot, \T\,) \bigr]
2379: \bigr\} \geq
2380: \lambda \Psi_{\frac{\lambda}{N}} \bigl\{ \rho\bigl[R'(\cdot, \T\,)\bigr],
2381: \rho\bigl[ M'(\cdot, \T\,) \bigr] \bigr\}.)
2382: $$
2383: This may be inconvenient when looking for
2384: an empirical bound for $\rho\bigl[ R'(\cdot, \T) \bigr]$, and we are going now to seek
2385: an inequality comparing $\rho\bigl[R'(\cdot, \T\,)\bigr]$ with empirical quantities
2386: only. This is possible through a change of variables in the
2387: exponential inequality. Indeed, if we consider now random variables
2388: $\chi_i(\theta, \T)$, such that
2389: $$
2390: 1 - \frac{\lambda}{N} \psi_i = \exp \left( - \frac{\lambda}{N} \chi_i \right),
2391: $$
2392: which is possible when $\frac{\lambda}{N} \in \; )\!-\!\!1, 1($ and leads to define
2393: $$
2394: \chi_i = - \frac{N}{\lambda} \log \left( 1 - \frac{\lambda}{N}\psi_i \right),
2395: $$
2396: we obtain easily following the same reasoning as previously
2397: \begin{multline*}
2398: \log \Biggl\{ \PP \biggl\{ \exp \biggl[ \sum_{i=1}^N \log \Bigl(
2399: 1 - \frac{\lambda}{N} \psi_i
2400: \Bigr) \biggr] \biggr\} \Biggr\}
2401: \\ \leq \sum_{i=1}^N \log \Bigl[ 1 - \frac{\lambda}{N} \PP(\psi_i) \Bigr]
2402: \leq N \log \Bigl[ 1 - \frac{\lambda}{N} R'(\theta,\T\,) \Bigr].
2403: \end{multline*}
2404: Let us replace for simplicity $\lambda / N$ with $\lambda$.
2405: Let us also introduce the random pseudo distance
2406: \begin{multline}
2407: \label{eq1.3}
2408: m'(\theta, \T) = \frac{1}{N} \sum_{i=1}^N \psi_i(\theta,\T)^2
2409: \\ = \frac{1}{N} \sum_{i=1}^N \Bigl\lvert \B{1} \bigl[
2410: f_{\theta}(X_i) \neq Y_i \bigr] - \B{1} \bigl[ f_{\T}(
2411: X_i) \neq Y_i \bigr] \Bigr\rvert, \quad \theta, \T \in \Theta.
2412: \end{multline}
2413: This is the empirical counter part of $M'$, since $\PP(m') = M'$.
2414: Let us notice that
2415: \begin{multline*}
2416: \frac{1}{N} \sum_{i=1}^N \log \bigl[ 1 - \lambda \psi_i(\theta, \T) \bigr]
2417: = \frac{\log(1 - \lambda) - \log(1 + \lambda)}{2} r'(\theta, \T)
2418: \\ \shoveright{+ \frac{\log(1 - \lambda) + \log(1 + \lambda)}{2} m'(\theta,\T)
2419: \qquad} \\
2420: \\ = \frac{1}{2} \log \left( \frac{1 - \lambda}{1 + \lambda} \right)
2421: r'\bigl(\theta, \T\,\bigr) + \frac{1}{2} \log( 1 - \lambda^2)
2422: m'\bigl(\theta, \T\,\bigr).
2423: \end{multline*}
2424: With these notations, we can
2425: conveniently write the previous inequality as
2426: \begin{multline*}
2427: \PP \Biggl\{ \exp \Biggl[ -N \log \bigl[ 1 - \lambda R'(\theta, \T) \bigr]
2428: \\ - \frac{N}{2} \log \biggl(\frac{1+\lambda}{1-\lambda}\biggr) r'\bigl(\theta,
2429: \T\,\bigr) + \frac{N}{2} \log\bigl(1 - \lambda^2\bigr) m'\bigl(\theta, \T\, \bigr) \Biggr] \Biggr\}
2430: \leq 1.
2431: \end{multline*}
2432: Integrating with respect to a prior probability measure $\pi \in \C{M}_+^1(\Theta)$,
2433: we obtain
2434: \begin{thm}
2435: \label{thm2.2.18}
2436: \mypoint For any real parameter $\lambda \in \; )\!\!-\!\!1,1($, for any $\T \in \Theta$,
2437: for any prior probability distribution $\pi \in \C{M}_+^1(\Theta)$,
2438: \begin{multline*}
2439: \PP \Biggl\{ \exp \Biggl[ \sup_{\rho \in \C{M}_+^1(\Theta)} \biggl\{
2440: -N \rho \Bigl\{ \log \bigl[ 1 - \lambda R'(\cdot, \T\,) \bigr] \Bigr\}
2441: \\ - \frac{N}{2} \log \biggl( \frac{1+\lambda}{1-\lambda}\biggr)
2442: \rho \bigl[r'(\cdot, \T\,)\bigr]\qquad \\ + \frac{N}{2} \log(1 - \lambda^2)
2443: \rho\bigl[m'(\cdot, \T\,) \bigr]
2444: - \C{K}(\rho, \pi) \biggr\} \Biggr] \Biggr\} \leq 1.
2445: \end{multline*}
2446: \end{thm}
2447:
2448: \subsubsection{Non random bounds}
2449: Let us first deduce a non random bound from Theorem \ref{thm4.1}.
2450: This theorem can be conveniently taken advantage of by
2451: throwing the non linearity into a localized prior, considering
2452: the prior probability measure $\mu$ defined by
2453: $$
2454: \frac{d \mu}{d \pi}(\theta) = \frac{\exp \bigl\{ - \lambda \Psi_{\frac{\lambda}{N}}
2455: \bigl[ R'(\theta, \T\,), \BM(\theta, \T\,) \bigr] + \beta \R(\theta, \T\,) \bigr\}}
2456: {\pi \Bigl\{ \exp \bigl\{ - \lambda \Psi_{\frac{\lambda}{N}}
2457: \bigl[ R'(\cdot, \T\,), \BM(\cdot, \T\,) \bigr] + \beta \R(\cdot, \T\,) \bigr\}
2458: \Bigr\}}.
2459: $$
2460: Indeed, for any posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,
2461: \begin{multline*}
2462: \C{K}(\rho,\mu) = \C{K}(\rho,\pi) + \lambda \rho \Bigl\{
2463: \Psi_{\frac{\lambda}{N}} \bigl[ R'(\cdot, \T\,),M'(\cdot, \T\,) \bigr]
2464: \Bigr\} - \beta \rho \bigl[ R'(\cdot, \T\,) \bigr] \\ +
2465: \log \Bigl\{ \pi \Bigl[ \exp \bigl\{
2466: - \lambda \Psi_{\frac{\lambda}{N}}\bigl[ R'(\cdot, \T\,),
2467: M'(\cdot, \T\,) \bigr] + \beta R'(\cdot, \T\,) \bigr] \bigr\} \Bigr] \Bigr\}.
2468: \end{multline*}
2469: Plugging this into Theorem \ref{thm4.1} and using the convexity of the
2470: exponential function, we see that for any posterior probability distribution
2471: $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,
2472: \begin{multline*}
2473: \beta \PP \bigl\{ \rho \bigl[ R'(\cdot, \T\,) \bigr] \bigr\}
2474: \leq \lambda \PP \bigl\{ \rho \bigl[ r'(\cdot, \T\,) \bigr] \bigr\}
2475: + \PP \bigl[ \C{K}(\rho, \pi) \bigr] \\ +
2476: \log \Bigl\{ \pi \Bigl[ \exp \bigl\{
2477: - \lambda \Psi_{\frac{\lambda}{N}}\bigl[ R'(\cdot, \T\,),
2478: M'(\cdot, \T\,) \bigr] + \beta R'(\cdot, \T\,) \bigr] \bigr\} \Bigr] \Bigr\}.
2479: \end{multline*}
2480: We can then recall that
2481: $$
2482: \lambda \rho\bigl[ r'(\cdot, \T\,) \bigr] + \C{K}(\rho, \pi)
2483: = \C{K}\bigl[ \rho, \pi_{\exp( - \lambda r)}\bigr] - \log
2484: \Bigl\{ \pi \Bigl[ \exp \bigl[ - \lambda r'(\cdot, \T\,) \bigr] \Bigr] \Bigr\},
2485: $$
2486: and notice moreover that
2487: $$
2488: - \PP \biggl\{ \log \Bigl\{ \pi \Bigl[
2489: \exp \bigl[ - \lambda r'(\cdot, \T\,) \bigr] \Bigr] \Bigr\} \biggr\}
2490: \leq
2491: - \log \Bigl\{ \pi \Bigl[
2492: \exp \bigl[ - \lambda R'(\cdot, \T\,) \bigr] \Bigr] \Bigr\},
2493: $$
2494: since $R' = \PP(r')$ and $h \mapsto \log \Bigl\{ \pi \bigl[ \exp ( h) \bigr] \Bigr\}$
2495: is a convex functional. Putting these two remarks together, we obtain
2496: \begin{thm}
2497: \mypoint \label{thm2.2.19}
2498: For any real positive parameter $\lambda$, for any prior distribution $\pi
2499: \in \C{M}_+^1(\Theta)$, for any posterior distribution $\rho : \Omega
2500: \rightarrow \C{M}_+^1(\Theta)$,
2501: \begin{multline*}
2502: \PP \bigl\{ \rho \bigl[ R'(\cdot, \T\,) \bigr] \bigr\}
2503: \leq \frac{1}{\beta} \PP \bigl[ \C{K}(\rho, \pi_{\exp( - \lambda r)}) \bigr]
2504: \\ + \frac{1}{\beta} \log \Bigl\{ \pi \Bigl[ \exp \bigl\{
2505: - \lambda \Psi_{\frac{\lambda}{N}}\bigl[ R'(\cdot, \T\,),
2506: M'(\cdot, \T\,) \bigr] + \beta R'(\cdot, \T\,) \bigr] \bigr\} \Bigr] \Bigr\}\\
2507: \shoveright{- \frac{1}{\beta} \log \Bigl\{ \pi \Bigl[
2508: \exp \bigl[ - \lambda R'(\cdot, \T\,) \bigr] \Bigr] \Bigr\}\quad}\\\shoveleft{\qquad
2509: \leq \frac{1}{\beta} \PP \bigl[ \C{K}(\rho, \pi_{\exp( - \lambda r)})\bigr]}
2510: \\ + \frac{1}{\beta} \log \Bigl\{ \pi \Bigl[
2511: \exp \bigl\{ - \bigl[ N \sinh(\tfrac{\lambda}{N}) - \beta \bigl] R'(\cdot, \T\,)
2512: \\ \shoveright{+ 2 N \sinh(\tfrac{\lambda}{2N})^2 M'(\cdot, \T\,) \bigr\} \Bigr] \Bigr\}
2513: \qquad} \\ - \frac{1}{\beta} \log \Bigl\{ \pi \Bigl[
2514: \exp \bigl[ - \lambda R'(\cdot, \T\,) \bigr] \Bigr] \Bigr\}.
2515: \end{multline*}
2516: \end{thm}
2517: It may be interesting to derive some more suggestive (but slightly weaker)
2518: bound in the important case when $\Theta_1 = \Theta$ and $R(\T) = \inf_{\Theta} R$.
2519: In this case, it is convenient to introduce the {\em margin function}
2520: \begin{equation}
2521: \label{eq1.1.16Bis}
2522: \varphi(x) = \sup_{\theta \in \Theta} \BM(\theta, \T) -
2523: x \R(\theta, \T), \quad x \in \RR_+.
2524: \end{equation}
2525: We see that $\varphi$ is convex and nonnegative on $\RR_+$.
2526: Using the bound $M'(\theta, \T\,) \leq x R'(\theta, \T\,) + \varphi(x)$,
2527: we obtain
2528: \begin{multline*}
2529: \PP \bigl\{ \rho \bigl[ R'(\cdot, \T\,) \bigr] \bigr\}
2530: \leq \frac{1}{\beta} \PP \bigl[ \C{K}(\rho, \pi_{\exp( - \lambda r)})\bigr]
2531: \\ + \frac{1}{\beta} \log \biggl\{ \pi \biggl[
2532: \exp \Bigl\{ -
2533: \bigl\{ N \sinh(\tfrac{\lambda}{N})\bigl[
2534: 1 - x\tanh(\tfrac{\lambda}{2N})\bigr] - \beta \bigr\}
2535: R'(\cdot, \T\,) \Bigr\}
2536: \biggr] \biggr\}
2537: \\ + \frac{N \sinh(\tfrac{\lambda}{N}) \tanh(\tfrac{\lambda}{2N})}{\beta} \varphi(x)
2538: - \frac{1}{\beta} \log \Bigl\{ \pi \Bigl[
2539: \exp \bigl[ - \lambda R'(\cdot, \T\,) \bigr] \Bigr] \Bigr\}.
2540: \end{multline*}
2541: Let us make the change of variable $\gamma =
2542: N \sinh(\tfrac{\lambda}{N})\bigl[
2543: 1 - x\tanh(\tfrac{\lambda}{2N})\bigr] - \beta$ to obtain
2544: \begin{cor}
2545: \label{cor1.1.21}\mypoint
2546: For any real positive parameters $x$, $\gamma$ and $\lambda$ such that
2547: $x \leq \tanh(\frac{\lambda}{2N})^{-1}$ and $0 \leq \gamma <
2548: N \sinh(\frac{\lambda}{N}) \bigl[ 1 - x \tanh(\frac{\lambda}{2N}) \bigr]$,
2549: \begin{multline*}
2550: \PP \bigl[ \rho(R) \bigr] - \inf_{\Theta} R
2551: \leq \Bigl\{
2552: N \sinh(\tfrac{\lambda}{N}) \bigl[ 1 - x
2553: \tanh(\tfrac{\lambda}{2N})\bigr] - \gamma \Bigr\}^{-1} \\
2554: \shoveleft{\qquad \times
2555: \biggl\{ \int_{\gamma}^{\lambda}
2556: \bigl[ \pi_{\exp( - \alpha R)}(R) - \inf_{\Theta} R\bigr]
2557: d \alpha }\\ + N \sinh\bigl(\tfrac{\lambda}{N}\bigr) \tanh\bigl(\tfrac{\lambda}{2N}\bigr)
2558: \varphi(x) + \PP \bigl[ \C{K}(\rho, \pi_{\exp( - \lambda r)}) \bigr]
2559: \biggr\}.
2560: \end{multline*}
2561: \end{cor}
2562: Let us remark that these results, although well suited to study Mammen and Tsybakov's
2563: margin assumptions, hold in the general case: introducing the convex {\em expected
2564: margin function} $\varphi$ is a substitute for making hypotheses about the relations
2565: between $R$ and $D$.
2566:
2567: Using the fact that $R'(\theta, \T\,) \geq 0$, $\theta \in \Theta$ and
2568: that $\varphi(x) \geq 0$, $x \in \RR_+$, we can weaken and simplify even more
2569: the preceding corollary to get
2570: \begin{cor}
2571: \label{cor4.3}
2572: \mypoint For any real parameters $\beta$, $\lambda$ and $x$ such that
2573: $x \geq 0$ and $0 \leq \beta < \lambda - x \frac{\lambda^2}{2N}$,
2574: for any posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,
2575: \begin{multline*}
2576: \PP \bigl[ \rho(R) \bigr] \leq \inf_{\Theta} R
2577: \\ +
2578: \Bigl[\lambda - x \tfrac{\lambda^2}{2N} - \beta \Bigr]^{-1}
2579: \biggl\{ \int_{\beta}^{\lambda}
2580: \bigl[ \pi_{\exp( - \alpha R)}(R) - \inf_{\Theta} R \bigr] d \alpha
2581: \\ + \PP \bigl\{ \C{K}\bigl[\rho, \pi_{\exp( - \lambda r)} \bigr] \bigr\}
2582: + \varphi(x) \frac{\lambda^2}{2N} \biggr\}.
2583: \end{multline*}
2584: \end{cor}
2585: Let us apply this bound under the {\em margin assumption}
2586: first considered by Mammen and Tsybakov \cite{Mammen,Tsybakov},
2587: which tells that for some real positive constant $c$ and some
2588: real exponent $\kappa \geq 1$,
2589: \begin{equation}
2590: \label{eq1.1.17Bis}
2591: \R(\theta, \T) \geq
2592: c D(\theta, \T)^{\kappa}, \qquad \theta \in \Theta.
2593: \end{equation}
2594: In the
2595: case when $\kappa = 1$, then $\varphi(c^{-1}) = 0$, proving that
2596: \begin{align*}
2597: \PP \bigl\{ \pi_{\exp( - \lambda r)}\bigl[ \R(\cdot, \T\,) \bigr] \bigr\}
2598: & \leq \frac{\int_{\beta}^{\lambda} \pi_{\exp(
2599: - \gamma R)}\bigl[ \R(\cdot, \T\,)\bigr]
2600: d \gamma}{N \sinh(\frac{\lambda}{N})
2601: \bigl[ 1 - c^{-1} \tanh(\frac{\lambda}{2N}) \bigr] - \beta}
2602: \\ & \leq \frac{ \int_{\beta}^{\lambda} \pi_{\exp( - \gamma R)}\bigl[
2603: \R(\cdot, \T\,)\bigr]
2604: d \gamma}{
2605: \lambda - \frac{ \lambda^2}{2 c N} - \beta}.
2606: \end{align*}
2607: Taking for example $\lambda = \frac{cN}{2}$, $\beta = \frac{\lambda}{2}
2608: = \frac{cN}{4}$,
2609: we obtain
2610: \begin{align*}
2611: \PP \bigl[ \pi_{\exp( - 2^{-1} c N r)}(R) \bigr] & \leq \inf R +
2612: \frac{8}{cN} \int_{\frac{c N}{4}}^{\frac{cN}{2}}
2613: \pi_{\exp( - \gamma R)}\bigl[\R(\cdot, \T)\bigr]
2614: d \gamma \\* & \leq \inf R + 2 \pi_{\exp(- \frac{cN}{4} R)}\bigl[ \R(\cdot, \T\,)\bigr].
2615: \end{align*}
2616: If moreover the behaviour of the prior distribution $\pi$ is parametric
2617: meaning that $\pi_{\exp( - \beta R)}\bigl[ \R(\cdot, \T\,) \bigr]
2618: \leq \frac{d}{\beta}$,
2619: for some positive real constant $d$ linked with the dimension of the
2620: classification model, then
2621: $$
2622: \PP \bigl[ \pi_{\exp( - \frac{c N}{2} r)}(R) \bigr]
2623: \leq \inf R + \frac{8 \log(2) d}{cN}
2624: \leq \inf R + \frac{5.55 \, d}{cN}.
2625: $$
2626: In the case when $\kappa > 1$,
2627: $$\varphi(x) \leq (\kappa -1) \kappa^{- \frac{\kappa}{
2628: \kappa -1}} (c x)^{- \frac{1}{\kappa - 1}} = (1 - \kappa^{-1})(\kappa c x)^{-\frac{1}{
2629: \kappa - 1}},$$
2630: \begin{multline*}
2631: \hspace{-10pt}\text{thus }\PP \bigl\{ \pi_{\exp(- \lambda r)}\bigl[ \R(\cdot, \T\,)\bigr] \bigr\}
2632: \\ \leq \frac{\int_{\beta}^{\lambda} \pi_{\exp( - \gamma R)}\bigl[ \R(\cdot, \T\,)\bigr] d \gamma
2633: + (1 - \kappa^{-1}) (\kappa c x)^{-\frac{1}{\kappa - 1}}
2634: \frac{\lambda^2}{2N} }{
2635: \lambda - \frac{x\lambda^2}{2N} - \beta}.
2636: \end{multline*}
2637: Taking for instance $\beta = \frac{\lambda}{2}$, $x = \frac{N}{2 \lambda}$,
2638: and putting $b = (1 - \kappa^{-1}) (c \kappa)^{- \frac{1}{\kappa -1}}$,
2639: we obtain
2640: $$
2641: \PP \bigl[ \pi_{\exp( - \lambda r)}(R) \bigr] - \inf R
2642: \leq \frac{4}{\lambda} \int_{\lambda/2}^{\lambda}
2643: \pi_{\exp( - \gamma R)}\bigl[ \R(\cdot, \T\,)\bigr] d \gamma + b \left(\frac{2 \lambda}{N}\right)^{\frac{
2644: \kappa}{\kappa -1}}.
2645: $$
2646: In the {\em parametric} case when $\pi_{\exp( - \gamma R)}\bigl[ \R(\cdot, \T\,)\bigr]
2647: \leq \frac{d}{\gamma}$,
2648: we get
2649: $$
2650: \PP \bigl[ \pi_{\exp( - \lambda r)}(R) \bigr] - \inf R
2651: \leq \frac{4 \log(2) d}{\lambda} + b \left( \frac{2 \lambda}{N} \right)^{\frac{
2652: \kappa}{\kappa - 1}}.
2653: $$
2654: Taking
2655: \newcommand{\Blambda}{\overline{\lambda}}
2656: $$
2657: \Blambda = 2^{-1} \bigl[ 8 \log(2) d \bigr]^{\frac{\kappa-1}{2 \kappa -1}}
2658: (\kappa c)^{\frac{1}{2 \kappa -1}}
2659: N^{\frac{\kappa}{2 \kappa -1 }},
2660: $$
2661: we obtain
2662: $$
2663: \PP \bigl[ \pi_{\exp( - \Blambda r)}(R) \bigr] - \inf R
2664: \leq (2 - \kappa^{-1}) (\kappa c)^{-\frac{1}{2 \kappa - 1}}
2665: \left( \frac{ 8 \log(2) d}{N} \right)^{\frac{\kappa}{2 \kappa - 1}}.
2666: $$
2667: We see that this formula coincides with the result for $\kappa = 1$.
2668: We can thus reduce the two cases to a single one and state
2669: \begin{cor}
2670: \mypoint
2671: \label{cor1.1.23} Let us assume that for some $\T \in \Theta$, some
2672: positive real constant $c$, some real exponent $\kappa \geq 1$
2673: and for any $\theta \in \Theta$,
2674: $R(\theta)\geq R(\T) + c D(\theta, \T)^{\kappa}$.
2675: Let us also assume that for some positive real
2676: constant $d$ and any positive real parameter $\gamma$,
2677: $\pi_{\exp( - \gamma R)}(R) - \inf R \leq \frac{d}{\gamma}$.
2678: Then
2679: \begin{multline*}
2680: \PP \Bigl[ \pi_{\exp \bigl\{ -
2681: 2^{-1}[ 8 \log(2) d ]^{\frac{\kappa-1}{2 \kappa -1}}
2682: (\kappa c)^{\frac{1}{2 \kappa -1}}
2683: N^{\frac{\kappa}{2 \kappa -1 }}
2684: r\bigr\}}(R) \Bigr]
2685: \\ \leq \inf R + (2 - \kappa^{-1}) (\kappa c)^{-\frac{1}{2 \kappa - 1}}
2686: \left( \frac{ 8 \log(2) d}{N} \right)^{\frac{\kappa}{2 \kappa - 1}}.
2687: \end{multline*}
2688: \end{cor}
2689: Let us remark that the exponent of $N$ is this corollary is
2690: known to be the minimax exponent under these assumptions:
2691: it is unimprovable, whatever estimator is used in place of
2692: the Gibbs posterior shown here (at least in the worst case
2693: compatible with the hypotheses). The interest of the corollary
2694: is to show not only the minimax exponent in $N$, but also
2695: an explicit non asymptotic bound with reasonable and simple
2696: constants. It is also clear that we could have got slightly
2697: better constants if we had kept the full strength of Theorem
2698: \ref{thm2.2.19} (page \pageref{thm2.2.19})
2699: instead of using the weaker Corollary \ref{cor4.3}
2700: (page \pageref{cor4.3}).
2701:
2702: We will prove in the following empirical bounds showing
2703: how the constant $\lambda$ can be estimated from the data
2704: instead of being chosen according to some margin and
2705: complexity assumptions.
2706:
2707: \subsubsection{Unbiased empirical bounds}
2708: We are going to provide an empirical counter part for the
2709: {\em expected margin function} $\varphi$. It will appear
2710: in empirical bounds having otherwise the same structure as
2711: the non random bound we just proved. Anyhow, we will not
2712: launch into trying to compare the behaviour of our proposed
2713: {\em empirical margin function} with the {\em expected margin function},
2714: since the margin function involves taking a supremum
2715: which is not straightforward to handle.
2716:
2717: Let us start as in the previous subsection with the inequality
2718: \begin{multline*}
2719: \beta \PP \Bigl\{ \rho\bigl[\R(\cdot,\T\,) \bigr] \Bigr\} \leq
2720: \PP \Bigl\{ \lambda \rho\bigl[ r'(\cdot, \T\,) \bigr]+ \C{K}(\rho, \pi) \Bigr\}
2721: \\ + \log \Bigl\{ \pi \Bigl[ \exp \bigl\{ - \lambda \Psi_{\frac{\lambda}{N}}\bigl[\R
2722: (\cdot, \T\,), \BM(\cdot, \T\,) \bigr] + \beta \R(\cdot, \T\,) \, \bigr\} \Bigr]
2723: \Bigr\} .
2724: \end{multline*}
2725: We have already defined by equation \eqref{eq1.3} the empirical pseudo distance
2726: \newcommand{\m}{{m'}}
2727: $$
2728: \m( \theta, \T\,) = \frac{1}{N} \sum_{i=1}^N \psi_i(\theta, \T\,)^2.
2729: $$
2730: Recalling that $\PP \bigl[ \m(\theta, \T\,) \bigr] = \BM(\theta, \T\,)$,
2731: and using the convexity of $h \mapsto \log \Bigl\{ \pi \bigl[ \exp( h ) \bigr] \Bigr\}$,
2732: leads to the following inequalities:
2733: \begin{multline*}
2734: \log \Bigl\{ \pi \Bigl[ \exp \bigl\{ - \lambda \Psi_{\frac{\lambda}{N}}\bigl[
2735: \R(\cdot, \T\,), \BM(\cdot, \T\,)\bigr] + \beta \R(\cdot, \T\,) \bigr\} \Bigr] \Bigr\}
2736: \\*\shoveleft{\qquad \leq \log \Bigl\{ \pi \Bigl[ \exp \bigl\{
2737: - N \sinh(\tfrac{\lambda}{N}) \R(\cdot, \T\,) }
2738: \\ \shoveright{+ N \sinh(\tfrac{\lambda}{N})\tanh(\tfrac{\lambda}{2N}) \BM(\cdot, \T\,)
2739: + \beta \R(\cdot,\T\,) \bigr] \bigr\} \Bigr] \Bigr\} \qquad}
2740: \\* \leq \PP \biggl\{
2741: \log \Bigl\{ \pi \Bigl[
2742: \exp \bigl\{ - \bigl[N \sinh(\tfrac{\lambda}{N})
2743: - \beta \bigr] \rr(\cdot, \T\,)
2744: \\ + N \sinh(\tfrac{\lambda}{N}) \tanh(\tfrac{\lambda}{2N})
2745: \m(\cdot, \T\,) \bigr\} \Bigr] \Bigr\} \biggr\}.
2746: \end{multline*}
2747: We may moreover remark that
2748: \begin{multline*}
2749: \lambda \rho\bigl[ \rr(\cdot, \T\,) \bigr]
2750: + \C{K}(\rho, \pi)
2751: = \bigl[ \beta - N \sinh(\tfrac{\lambda}{N}) + \lambda \bigr]
2752: \rho \bigl[ \rr(\cdot, \T\,)\bigr] \\ + \C{K}\bigl[ \rho, \pi_{\exp \{-[ N \sinh(\frac{\lambda}{N}) - \beta
2753: ] r \}} \bigr] \\ - \log \Bigl\{ \pi \Bigl[ \exp \bigl\{
2754: - \bigl[ N \sinh(\tfrac{\lambda}{N}) - \beta \bigr] \rr(\cdot, \T\,) \bigr\} \Bigr]
2755: \Bigr\}.
2756: \end{multline*}
2757: This ends to prove
2758: \begin{thm}
2759: \mypoint For any positive real parameters $\beta$ and $\lambda$,
2760: for any posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,
2761: \begin{multline*}
2762: \PP \bigl\{ \rho\bigl[ \R(\cdot, \T\,) \bigr] \bigr\}
2763: \leq \PP \biggl\{
2764: \biggl[ 1 - \frac{ N \sinh(\frac{\lambda}{N}) - \lambda}{\beta} \biggr]
2765: \rho\bigl[ \rr(\cdot, \T\,)\bigr]
2766: \\\shoveright{ + \frac{\C{K}\bigl[\rho, \pi_{\exp \{ - [ N \sinh(\frac{\lambda}{N})
2767: - \beta ] r \}} \bigr]}{\beta} \qquad}
2768: \\ + \beta^{-1}
2769: \log \Bigl\{
2770: \pi_{\exp \{ - [N \sinh(\frac{\lambda}{N}) - \beta ] r \}} \Bigl[
2771: \exp \bigl[ N \sinh(\tfrac{\lambda}{N}) \tanh(\tfrac{\lambda}{2N})\m(\cdot, \T\,)
2772: \bigr] \Bigr] \Bigr\} \biggr\}.
2773: \end{multline*}
2774: \end{thm}
2775: Taking $\beta = \frac{N}{2} \sinh (\frac{\lambda}{N})$, using the
2776: fact that $\sinh(a) \geq a$, $a \geq 0$ and expressing
2777: $\tanh(\frac{a}{2}) = a^{-1} \bigl[ \sqrt{1 + \sinh(a)^2}- 1 \bigr]$
2778: and $a = \log \bigl[ \sqrt{1 + \sinh(a)^2} + \sinh(a) \bigr]$,
2779: we deduce
2780: \begin{cor}
2781: \mypoint For any positive real constant $\beta$ and any posterior distribution
2782: $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,
2783: \begin{multline*}
2784: \PP \bigl\{ \rho\bigl[ \R(\cdot, \T\,) \bigr] \bigr\} \leq
2785: \PP \Biggl\{ \underbrace{\biggl[ \tfrac{N}{\beta}\log \Bigl(
2786: \sqrt{1 + \tfrac{4 \beta^2}{N^2}} + \tfrac{2 \beta}{N} \Bigr) - 1 \biggr]}_{\leq 1}
2787: \rho\bigl[ \rr(\cdot, \T\,) \bigr] \\
2788: \shoveleft{\qquad
2789: + \frac{1}{\beta} \biggl\{ \C{K}\bigl[ \rho,\pi_{\exp( - \beta r)} \bigr]}
2790: \\ + \log \biggl[ \pi_{\exp( - \beta r)} \Bigl\{ \exp \Bigl[ N\Bigl(
2791: \sqrt{1 + \tfrac{4 \beta^2}{N^2}}
2792: - 1 \Bigr) \m(\cdot, \T\,) \Bigr] \Bigr\} \biggr] \biggr\} \Biggr\}.
2793: \end{multline*}
2794: \end{cor}
2795: This theorem and its corollary are really anologous to
2796: Theorem \ref{thm2.2.19} (page \pageref{thm2.2.19}) and it
2797: could easily be proved that under Mammen and Tsybakov margin assumptions,
2798: we obtain an upper bound of the same order as Corollary \ref{cor1.1.23}
2799: (page \pageref{cor1.1.23}).
2800: Anyhow, in order to obtain an empirical bound, we are going now to take
2801: a supremum over all possible values of $\T$, that is over $\Theta_1$.
2802: Although we believe that taking this supremum will not spoil the bound
2803: in cases when overfitting remains under control, we will not try
2804: to investigate precisely if and when this is actually true, and
2805: provide our empirical bound as such. Let us only say that on a qualitative
2806: ground, the values of the margin function quantify how steep is the
2807: contrast function $R$ or its empirical counterpart $r$, and
2808: that the definition
2809: of the empirical margin function is obtained by substituting $\PP$, the true
2810: sample distribution, with $\overline{\PP} = \bigl( \frac{1}{N} \sum_{i=1}^N
2811: \delta_{(X_i, Y_i)}\bigr)^{\otimes N}$, the empirical sample distribution,
2812: in the definition of the expected margin function. Therefore, on qualitative
2813: grounds, it sounds like hopeless to presume that $R$ is steep when $r$ is
2814: not, or in other words that a classification model that would be unefficient
2815: at estimating a bootstrapped sample according to our non random bound
2816: would be by some miracle efficient at estimating the true sample distribution
2817: according to the same bound. To this extent, we feel that our empirical
2818: bounds bring a satisfactory counterpart of our non random bounds.
2819: Anyhow, we will also produce estimators which can be proved
2820: to be adaptive
2821: using PAC-Bayesian tools in the next subsection, at the price of
2822: a more sophisticated construction involving comparisons between
2823: a posterior distribution and a Gibbs prior distribution.
2824:
2825: \newcommand{\Btheta}{\widehat{\theta}}
2826: Let us restrict now to the important case when $\T \in \arg\min_{\Theta_1} R$.
2827: To obtain an observable bound, let $\Btheta \in \arg\min_{\theta
2828: \in \Theta} r(\theta)$ and let us introduce the {\em empirical margin
2829: functions}
2830: \newcommand{\Tphi}{\widetilde{\varphi}}
2831: \newcommand{\Bphi}{\overline{\varphi}}
2832: \begin{align*}
2833: \Bphi(x) & = \sup_{\theta \in \Theta} \m(\theta, \Btheta) - x \bigl[
2834: r(\theta) - r(\Btheta) \bigr], \quad x \in \RR_+,\\
2835: \Tphi(x) & = \sup_{\theta \in \Theta_1} \m(\theta, \Btheta) - x \bigl[
2836: r(\theta) - r(\Btheta) \bigr], \quad x \in \RR_+.
2837: \end{align*}
2838: Using the fact that $\m(\theta, \T) \leq \m(\theta, \Btheta)
2839: + \m(\Btheta, \T)$, we get
2840: \begin{cor}
2841: \mypoint For any positive real parameters $\beta$ and $\lambda$,
2842: for any posterior distribution $\rho : \Omega
2843: \rightarrow \C{M}_+^1(\Theta)$,
2844: \begin{multline*}
2845: \PP \bigl[ \rho (R) \bigr] - \inf_{\Theta_1} R
2846: \leq \PP \biggl\{
2847: \Bigl[ 1 - \tfrac{ N \sinh(\frac{\lambda}{N}) - \lambda}{\beta}
2848: \Bigr] \bigl[ \rho(r) - r(\Btheta)\bigr] \\
2849: + \frac{ \C{K}\bigl[ \rho, \pi_{\exp\{-[N \sinh(\frac{\lambda}{N})
2850: - \beta]r\}} \bigr]}{\beta}\\
2851: + \beta^{-1} \log \Bigl\{ \pi_{\exp \{-[N \sinh(\frac{\lambda}{N})
2852: - \beta]r\}} \Bigl[ \exp \bigl[
2853: N \sinh\bigl(\tfrac{\lambda}{N}\bigr) \tanh\bigl(\tfrac{\lambda}{2N}\bigr) \m(\cdot,\Btheta)
2854: \bigr] \Bigr] \Bigr\} \\ +
2855: \beta^{-1}N \sinh(\tfrac{\lambda}{N}) \tanh(\tfrac{\lambda}{2N})
2856: \Tphi \biggl[ \frac{\beta}{N\sinh(\frac{\lambda}{N}) \tanh(\frac{\lambda}{
2857: 2N})} \left(1 - \frac{N\sinh(\frac{\lambda}{N}) - \lambda}{\beta}
2858: \right)\biggr] \biggr\}.
2859: \end{multline*}
2860: Taking $\beta = \frac{N}{2} \sinh(\frac{\lambda}{N})$, we also
2861: obtain
2862: \begin{multline*}
2863: \PP \bigl[ \rho(R) \bigr] - \inf_{\Theta_1} R \leq
2864: \PP \Biggl\{ \underbrace{\biggl[ \tfrac{N}{\beta}\log \Bigl(
2865: \sqrt{1 + \tfrac{4 \beta^2}{N^2}}
2866: + \tfrac{2 \beta}{N} \Bigr) - 1 \biggr]}_{\leq 1}
2867: \bigl[ \rho(r) - r(\Btheta) \bigr] \\
2868: \shoveleft{\qquad + \frac{1}{\beta} \biggl\{ \C{K}\bigl[
2869: \rho,\pi_{\exp( - \beta r)} \bigr]}
2870: \\\qquad + \log \biggl[ \pi_{\exp( - \beta r)} \Bigl\{ \exp \Bigl[ N\Bigl(
2871: \sqrt{1 + \tfrac{4 \beta^2}{N^2}}
2872: - 1 \Bigr) \m(\cdot, \Btheta) \Bigr] \Bigr\} \biggr] \biggr\} \\
2873: + \frac{N}{\beta}\Bigl(\sqrt{1 + \tfrac{4 \beta^2}{N^2}} - 1\Bigr)
2874: \Tphi \Biggl[ \frac{\log \Bigl( \sqrt{1 + \frac{4 \beta^2}{N^2}}
2875: + \frac{2 \beta}{N} \Bigr) - \frac{\beta}{N}}{\Bigl(
2876: \sqrt{1 + \frac{4 \beta^2}{N^2}} - 1 \Bigr)}\Biggr]
2877: \Biggr\}.
2878: \end{multline*}
2879: \end{cor}
2880: Note that we could also use the upper bound
2881: $\m(\theta, \Btheta) \leq x \bigl[ r(\theta) - r(\Btheta)
2882: \bigr] + \Bphi(x)$ and put $\alpha =
2883: N \sinh(\frac{\lambda}{N}) \bigl[ 1 -
2884: x \tanh(\frac{\lambda}{2N}) \bigr] - \beta$, to obtain
2885: \begin{cor}
2886: \label{cor1.1.27}
2887: \mypoint For any non negative
2888: real parameters $x$, $\alpha$ and $\lambda$,
2889: such that $\alpha < N \sinh(\frac{\lambda}{N}) \bigl[
2890: 1 - x \tanh(\frac{\lambda}{2N}) \bigr]$, for any posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,
2891: \begin{multline*}
2892: \PP \bigl[ \rho(R) \bigr] - \inf_{\Theta_1} R
2893: \\ \shoveleft{\quad \leq \PP
2894: \Biggl\{ \biggl[ 1 - \frac{N\sinh(\frac{\lambda}{N})\bigl[1 - x
2895: \tanh(\frac{\lambda}{2N})\bigr] - \lambda}{
2896: N \sinh(\frac{\lambda}{N})\bigl[ 1 - x \tanh(\frac{\lambda}{2N})
2897: \bigr] - \alpha} \biggr] \bigl[ \rho(r) - r(\Btheta) \bigr]}
2898: \\ \shoveleft{\quad \qquad \qquad + \frac{\C{K} \bigl[ \rho, \pi_{\exp(- \alpha r)} \bigr]}{
2899: N \sinh(\frac{\lambda}{N})\bigl[1 - x \tanh(\frac{\lambda}{2N})\bigr]
2900: - \alpha} }\\
2901: \shoveleft{\quad\qquad \qquad + \frac{N\sinh(\tfrac{\lambda}{N})
2902: \tanh(\tfrac{\lambda}{2N})}{
2903: N \sinh(\frac{\lambda}{N}) \bigl[ 1 - x \tanh(\frac{\lambda}{2N}) \bigr]
2904: - \alpha}}\\\times
2905: \biggl[ \Bphi(x) + \Tphi \biggl(
2906: \frac{\lambda - \alpha}{N \sinh(\frac{\lambda}{N})
2907: \tanh(\frac{\lambda}{2N})}\biggr) \biggr] \Biggr\}.
2908: \end{multline*}
2909: \end{cor}
2910: Let us notice that in the case when $\Theta_1 = \Theta$,
2911: the upper bound provided by this corollary
2912: has the same general form as the upper bound provided by Corollary
2913: \ref{cor1.1.21} (page \pageref{cor1.1.21}), with the sample
2914: distribution $\PP$ replaced with
2915: the empirical distribution of the sample $\overline{\PP}
2916: = \bigl( \frac{1}{N} \sum_{i=1}^N \delta_{(X_i, Y_i)} \bigr)^{\otimes N}$.
2917: Therefore, our empirical bound can be of a larger order of magnitude
2918: than our non random bound only in the case when our non random
2919: bound applied to the bootstrapped sample distribution $\overline{\PP}$
2920: would be of a larger order of magnitude than when applied to
2921: the true sample distribution $\PP$. In other words, we can say that
2922: our empirical bound is close to our non random bound in every situation
2923: where the bootstrapped sample distribution $\overline{\PP}$ is not
2924: harder to bound than the true sample distribution $\PP$. Although
2925: this does not prove that our empirical bound is always of the same
2926: order as our non random bound, this is a good qualitative hint that
2927: this will be the case in most practical situations of interest,
2928: since in situations of ``underfitting'', if they exist, it is likely
2929: that the choice of the classification model is inappropriate to the data
2930: and should be modified.
2931:
2932: Another reassuring remark is that the empirical margin functions
2933: $\Bphi$ and $\Tphi$ behave well in the case when $\inf_{\Theta} r
2934: = 0$. Indeed in this case $m'(\theta, \wtheta)
2935: = r'(\theta, \wtheta) = r(\theta)$, $\theta \in \Theta$,
2936: and thus $\Bphi(1) = \Tphi(1) = 0$, and\\
2937: \mbox{}\hfill $\Tphi(x)
2938: \leq - (x -1 ) \inf_{\Theta_1} r$, $x \geq 1$.\hfill \mbox{}\\
2939: This shows that we recover in this case the same
2940: accuracy as with non relative local empirical bounds.
2941: Thus the bound of Corollary \ref{cor1.1.27} does not
2942: collapse in presence of massive overfitting in the larger
2943: model, causing $r(\wtheta) = 0$, which is another hint
2944: that this may be an accurate bound in many situations.
2945:
2946: \subsubsection{Relative empirical deviation bounds}
2947:
2948: It is natural to make use of Theorem \ref{thm2.2.18}
2949: on page \pageref{thm2.2.18} to obtain
2950: empirical deviation bounds, since this theorem provides an empirical
2951: variance term.
2952:
2953: Theorem \ref{thm2.2.18} is written in a way which exploits the
2954: fact that $\psi_i$ takes only the three values -1, 0 and +1.
2955: However, it will be more convenient for the following computations
2956: to use it in its more general form, which only makes use of the
2957: fact that $\psi_i \in\; (-1, 1)$.
2958: With notations to be
2959: explained hereafter, it can indeed also be written as
2960: \newcommand{\BP}{\overline{P}}
2961: \begin{multline}
2962: \label{eq2.2.2}
2963: \PP \Biggl\{ \exp \Biggl[ \sup_{\rho \in \C{M}_+^1(\Theta)} \biggl\{
2964: - N \rho \Bigl\{ \log \Bigl[ 1 - \lambda P(\psi) \Bigr] \Bigr\}
2965: \\ + N \rho \Bigl\{ \BP \Bigl[ \log(1 - \lambda \psi) \Bigr]
2966: \Bigr\} - \C{K}(\rho,\pi) \biggr\} \Biggr] \Biggr\} \leq 1.
2967: \end{multline}
2968: We have used the following notations in this inequality. We have put
2969: $$
2970: \BP = \frac{1}{N} \sum_{i=1}^N \delta_{(X_i,Y_i)},
2971: $$
2972: so that $\BP$ is our notation for the empirical distribution of the
2973: process \linebreak $(X_i,Y_i)_{i=1}^N$. Moreover we have also used
2974: $$
2975: P = \PP(\BP) = \frac{1}{N} \sum_{i=1}^N P_i,
2976: $$
2977: where it should be remembered that the joint distribution of the
2978: process $(X_i,Y_i)_{i=1}^N$ is $\PP = \bigotimes_{i=1}^N P_i$.
2979: We have considered $\psi(\theta, \T)$ as a function defined on $\C{X} \times \C{Y}$,\\
2980: \mbox{}\hfill as $\psi(\theta, \T) (x,y) = \B{1}\bigl[ y \neq f_{\theta}(x) \bigr] - \B{1} \bigl[
2981: y \neq f_{\T}(x) \bigr], \quad (x,y) \in \C{X} \times \C{Y} $ \hfill\mbox{}\\
2982: so that it should be understood that
2983: \begin{multline*}
2984: P(\psi) = \frac{1}{N} \sum_{i=1}^N \PP \bigl[ \psi_i(\theta, \T) \bigr]
2985: \\ = \frac{1}{N} \sum_{i=1}^N \PP \Bigl\{
2986: \B{1} \bigl[ Y_i \neq f_{\theta}(X_i) \bigr] - \B{1} \bigl[
2987: Y_i \neq f_{\T}(X_i) \bigr] \Bigr\} = R'(\theta, \T).
2988: \end{multline*}
2989: In the same way
2990: $$
2991: \BP \Bigl[ \log(1 - \lambda \psi) \Bigr]
2992: = \frac{1}{N} \sum_{i=1}^N \log \bigl[ 1 - \lambda \psi_i(\theta, \T) \bigr].
2993: $$
2994: Moreover integration with respect to $\rho$ bears on the index $\theta$,
2995: so that
2996: \begin{align*}
2997: \rho \Bigl\{ \log \Bigl[ 1 - \lambda P(\psi) \Bigr] \Bigr\}
2998: & = \int_{\theta \in \Theta} \log \biggl\{ 1 - \frac{\lambda}{N}
2999: \sum_{i=1}^N \PP\bigl[ \psi_i(\theta, \T) \bigr] \biggr\} \rho(d \theta),\\
3000: \rho \Bigl\{ \BP \Bigl[ \log (1 - \lambda \psi) \Bigr] \Bigr\}
3001: & = \int_{\theta \in \Theta} \biggl\{ \frac{1}{N} \sum_{i=1}^N \log \bigl[
3002: 1 - \lambda \psi_i(\theta, \T) \bigr] \biggr\} \rho(d \theta).
3003: \end{align*}
3004:
3005: We have chosen concise notations, as we did throughout these notes,
3006: in order to make the computations easier to follow.
3007:
3008: To get an alternate version of empirical relative deviation bounds,
3009: we need to find some convenient way to localize the choice of
3010: the prior distribution $\pi$ in equation (\ref{eq2.2.2},
3011: page \pageref{eq2.2.2}).
3012: Here we propose to replace
3013: $\pi$ with $\mu = \pi_{\exp \{ - N \log[1 + \beta P(\psi)] \}}$,
3014: which can also be written $\pi_{\exp \{ - N \log[1 + \beta
3015: R'(\cdot, \T)]\}}$. Indeed we see that
3016: \begin{multline*}
3017: \C{K}(\rho, \mu)
3018: = N \rho \Bigl\{ \log \bigl[ 1 + \beta P(\psi) \bigr] \Bigr\}
3019: + \C{K}(\rho, \pi)
3020: \\ + \log \Bigl\{ \pi \Bigl[ \exp \bigl\{
3021: - N \log \bigl[ 1 + \beta P(\psi) \bigr] \bigr\} \Bigr] \Bigr\}.
3022: \end{multline*}
3023: Moreover, we deduce from our deviation inequality applied
3024: to $- \psi$, that (as long as $\beta > -1$),
3025: $$
3026: \PP \biggl\{ \exp \biggl[ N \mu \Bigl\{ \BP \bigl[
3027: \log( 1 + \beta \psi) \bigr] \Bigr\}
3028: -N \mu \Bigl\{ \log \bigl[ 1 + \beta P(\psi) \bigr] \Bigr\}
3029: \biggr] \biggr\} \leq 1.
3030: $$
3031: Thus
3032: \begin{multline*}
3033: \PP \biggl\{ \exp \biggl[
3034: \log \Bigl\{ \pi \Bigl[ \exp \bigl\{
3035: - N \log \bigl[ 1 + \beta P(\psi) \bigr] \bigr\} \Bigr] \Bigr\}
3036: \\ \shoveright{- \log \Bigl\{ \pi \Bigl[ \exp \bigl\{
3037: - N \BP \bigl[ \log(1 + \beta \psi) \bigr] \bigr\} \Bigr] \Bigr\}
3038: \biggr] \bigg\}\qquad}
3039: \\ \leq
3040: \PP \biggl\{ \exp \biggl[
3041: - N \mu \Bigl\{ \log \bigl[ 1 + \beta P(\psi) \bigr] \Bigr\}
3042: - \C{K}(\mu,\pi) \\ + N \mu \Bigl\{
3043: \BP \bigl[ \log(1 + \beta \psi) \bigr] \Bigr\} + \C{K}(\mu, \pi) \biggr] \biggr\}
3044: \leq 1.
3045: \end{multline*}
3046: This can be used to handle $\C{K}(\rho, \mu)$, making use
3047: of the Cauchy Schwarz inequality as follows
3048: \begin{multline*}
3049: \PP \Biggl\{ \exp \Biggl[ \frac{1}{2} \biggl[
3050: -N \log \Bigl\{ \Bigl( 1 - \lambda \rho\bigl[P(\psi)\bigr] \Bigr)
3051: \Bigl( 1 + \beta \rho \bigl[ P (\psi) \bigr] \Bigr) \Bigr\}
3052: \\* \shoveright{ \begin{aligned} + N \rho \Bigl\{ & \BP \Bigl[ \log
3053: ( 1 - \lambda \psi) \Bigr] \Bigr\}
3054: \\* & - \C{K}(\rho, \pi) - \log \Bigl\{ \pi \Bigl[
3055: \exp \bigl\{ - N \BP \bigl[ \log(1 + \beta \psi) \bigr]
3056: \bigr\} \Bigr] \Bigr\} \biggr] \Biggr] \Biggr\}\end{aligned}}
3057: \\* \shoveleft{\qquad \leq \PP \Biggl\{ \exp \Biggl[ - N \log \Bigl\{ \Bigl(
3058: 1 - \lambda \rho \bigl[ P(\psi) \bigr] \Bigr) \Bigr\}}
3059: \\*\shoveright{ + N \rho \Bigl\{ \BP \Bigl[ \log(1 - \lambda \psi) \Bigr] \Bigr\}
3060: - \C{K}(\rho, \mu) \Biggr] \Biggr\}^{1/2} \qquad} \\
3061: \shoveleft{\qquad \times \PP \Biggl\{ \exp \Biggl[ \log
3062: \Bigl\{ \pi \Bigl[ \exp \bigl\{
3063: - N \log \bigl[1 + \beta P(\psi)\bigr] \bigr\} \Bigr] \Bigr\} }
3064: \\*- \log \Bigl\{ \pi \Bigl[ \exp \bigl\{ - N \BP \bigl[
3065: \log(1 + \beta \psi) \bigr] \bigr\} \Bigr] \Bigr\} \Biggr] \Biggr\}^{1/2}
3066: \leq 1.
3067: \end{multline*}
3068: This implies that with $\PP$ probability at least $1 - \epsilon$,
3069: \begin{multline*}
3070: -N \log \Bigl\{ \Bigl( 1 - \lambda \rho\bigl[P(\psi)\bigr] \Bigr)
3071: \Bigl( 1 + \beta \rho \bigl[ P (\psi) \bigr] \Bigr) \Bigr\}
3072: \\ \begin{aligned} \leq -N \rho & \Bigl\{ \BP \Bigl[ \log
3073: ( 1 - \lambda \psi) \Bigr] \Bigr\}
3074: \\ & + \C{K}(\rho, \pi) + \log \Bigl\{ \pi \Bigl[
3075: \exp \bigl\{ - N \BP \bigl[ \log(1 + \beta \psi) \bigr]
3076: \bigr\} \Bigr] \Bigr\} -
3077: 2 \log(\epsilon).\end{aligned}
3078: \end{multline*}
3079: It is now convenient to remember that
3080: $$
3081: \BP \Bigl[\log(1 - \lambda \psi) \Bigr]
3082: = \frac{1}{2} \log \left( \frac{1 - \lambda}{1 + \lambda} \right) r'(\theta, \T)
3083: + \frac{1}{2} \log (1 - \lambda^2) m'(\theta, \T).
3084: $$
3085: We thus can write the previous inequality as
3086: \begin{multline*}
3087: - N \log \Bigl\{ \Bigl( 1 - \lambda \rho\bigl[R'(\cdot,\T) \bigr] \Bigr)
3088: \Bigl(1 + \beta \rho \bigl[ R'(\cdot,\T) \bigr] \Bigr) \Bigr\} \\ \leq
3089: \frac{N}{2} \log \left( \frac{1+\lambda}{1-\lambda}\right)
3090: \rho \bigl[ r'(\cdot,\T) \bigr] - \frac{N}{2} \log(1 - \lambda^2)
3091: \rho \bigl[ m'(\cdot, \T) \bigr] +
3092: \C{K}(\rho, \pi) \\ \begin{aligned}+ \log \biggl\{ \pi \biggl[
3093: \exp \Bigl\{ & - \frac{N}{2}
3094: \log \Bigl( \frac{1 + \beta}{1 - \beta} \Bigr) r'(\cdot, \T)
3095: \\ & - \frac{N}{2} \log( 1 - \beta^2) m'(\cdot, \T) \Bigr\} \biggr] \biggr\}
3096: - 2 \log(\epsilon).\end{aligned}
3097: \end{multline*}
3098: Let us assume now that $\T \in \arg\min_{\Theta_1} R$.
3099: Let us introduce $\Btheta \in \arg\min_{\Theta} r$.
3100: Decomposing
3101: $r'(\theta, \T) = r'(\theta, \Btheta) + r'(\Btheta,\T)$ and
3102: considering that \\
3103: \mbox{} \hfill $m'(\theta, \T) \leq m'(\theta,
3104: \Btheta) + m'(\Btheta,\T)$, \hfill \mbox{}\\
3105: we see that with $\PP$ probability at least $1 - \epsilon$,
3106: for any posterior distribution $\rho :
3107: \Omega \rightarrow \C{M}_+^1(\Theta)$,
3108:
3109: \begin{multline*}
3110: - N \log \Bigl\{ \Bigl( 1 -
3111: \lambda \rho \bigl[ R'(\cdot, \T) \bigr] \Bigr) \Bigl(
3112: 1 + \beta \rho \bigl[ R'(\cdot, \T) \Bigr) \Bigr\}
3113: \\* \leq \frac{N}{2} \log \biggl( \frac{1 + \lambda}{1 - \lambda} \biggr)
3114: \rho \bigl[ r'(\cdot, \Btheta) \bigr] -
3115: \frac{N}{2} \log(1 - \lambda^2) \rho \bigl[ m'(\cdot, \Btheta) \bigr]
3116: + \C{K}(\rho,\pi) \\* + \log \biggl\{ \pi \biggl[
3117: \exp \Bigl\{ - \tfrac{N}{2} \log \Bigl( \tfrac{1+\beta}{1-\beta} \Bigr)
3118: \bigl[r'(\cdot, \Btheta\,) \bigr] - \tfrac{N}{2} \log(1 - \beta^2) m'(\cdot, \Btheta\,)
3119: \Bigr\} \biggr] \biggr\} \\*
3120: + \tfrac{N}{2} \log \Bigl[ \tfrac{(1 + \lambda)(1 - \beta)}{(1 - \lambda)(1 + \beta)}
3121: \Bigr] \bigl[ r(\Btheta\,) - r(\T) \bigr]
3122: \\* - \tfrac{N}{2} \log \bigl[ (1 - \lambda^2)(1 - \beta^2) \bigr] m'(\Btheta\,,\T)
3123: - 2 \log(\epsilon).
3124: \end{multline*}
3125:
3126: Let us now define for simplicity the posterior $\nu : \Omega \rightarrow
3127: \C{M}_+^1(\Theta)$ by the identity
3128: $$
3129: \frac{d \nu}{d \pi}(\theta) = \frac{ \exp \Bigl\{
3130: - \frac{N}{2} \log \Bigl( \frac{1+\lambda}{1-\lambda} \Bigr)
3131: r'(\theta,\Btheta) + \frac{N}{2} \log(1 - \lambda^2) m'(\theta, \Btheta)
3132: \Bigr\}}{ \pi
3133: \biggl[ \exp \Bigl\{
3134: - \frac{N}{2} \log \Bigl( \frac{1+\lambda}{1-\lambda} \Bigr)
3135: r'(\cdot,\Btheta) + \frac{N}{2} \log(1 - \lambda^2) m'(\cdot, \Btheta)
3136: \Bigr\}\biggl]}.
3137: $$
3138: Let us also introduce the random bound
3139: \begin{multline*}
3140: B =
3141: \frac{1}{N} \log \biggl\{ \nu \biggl[ \exp \Bigl[ \tfrac{N}{2} \log \Bigl[
3142: \tfrac{(1 + \lambda)(1 - \beta)}{(1 - \lambda) (1 + \beta) } \Bigr]
3143: r'(\cdot, \Btheta) \\ \shoveright{- \tfrac{N}{2} \log \bigl[ (1 - \lambda^2)
3144: (1 - \beta^2) \bigr] m'(\cdot, \Btheta\,) \Bigr] \biggr] \biggr\}\qquad} \\
3145: \shoveleft{\qquad + \sup_{\theta \in \Theta_1}
3146: \frac{1}{2} \log \Big[\tfrac{(1 - \lambda)(1 + \beta)}{(1 + \lambda)(1 - \beta)}
3147: \Bigr]
3148: r'(\theta,\Btheta\,)} \\ - \frac{1}{2} \log\bigl[ (1 - \lambda^2)(1 - \beta^2)\bigr]
3149: m'(\theta,\Btheta\,) - \frac{2}{N} \log(\epsilon).
3150: \end{multline*}
3151: \begin{thm}\mypoint
3152: Using the above notations, for any real constants $0 \leq \beta < \lambda < 1$,
3153: for any prior distribution $\pi \in \C{M}_+^1(\Theta)$,
3154: for any subset $\Theta_1 \subset \Theta$,
3155: with $\PP$ probability at least $1 - \epsilon$,
3156: for any posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,
3157: $$
3158: - \log \Bigl\{ \Bigl( 1 - \lambda \bigl[ \rho (R) - \inf_{\Theta_1} R \bigr]
3159: \Bigr)\Bigl(1 + \beta \bigl[ \rho (R) - \inf_{\Theta_1} R \bigr] \Bigr) \Bigr\}
3160: \leq \frac{\C{K}(\rho, \nu)}{N} + B.
3161: $$
3162: Therefore,
3163: \begin{multline*}
3164: \rho(R) - \inf_{\Theta_1} R \\* \leq \frac{\lambda - \beta}{2 \lambda \beta}
3165: \left( \sqrt{1 + 4 \frac{\lambda \beta}{(\lambda - \beta)^2}
3166: \left[ 1 - \exp \left( - B - \frac{\C{K}(\rho, \nu)}{N} \right) \right]}-1\right)
3167: \\ \leq \frac{1}{\lambda - \beta} \left( B + \frac{\C{K}(\rho,\nu)}{N} \right).
3168: \end{multline*}
3169: \end{thm}
3170: Let us define the posterior $\widehat{\nu}$ by the identity
3171: $$
3172: \frac{d\widehat{\nu}}{d\pi} (\theta) = \frac{\exp
3173: \Bigl[ - \frac{N}{2} \log \left(
3174: \frac{1+\beta}{1-\beta}\right) r'(\theta, \Btheta) - \frac{N}{2}
3175: \log(1 - \beta^2) m'(\theta, \Btheta)\Bigr]}{
3176: \pi \Bigl\{ \exp
3177: \Bigl[ - \frac{N}{2} \log \left(
3178: \frac{1+\beta}{1-\beta}\right) r'(\cdot, \Btheta) - \frac{N}{2}
3179: \log(1 - \beta^2) m'(\cdot, \Btheta)\Bigr]\Bigr\}}.
3180: $$
3181: It is useful to remark that
3182: \begin{multline*}
3183: \frac{1}{N} \log \biggl\{ \nu \biggl[ \exp \Bigl[ \frac{N}{2} \log \Bigl(
3184: \frac{(1 + \lambda)(1 - \beta)}{(1 - \lambda) (1 + \beta) } \Bigr)
3185: r'(\cdot, \Btheta) \\ \shoveright{- \frac{N}{2} \log \bigl[ (1 - \lambda^2)
3186: (1 - \beta^2) \bigr] m'(\cdot, \Btheta) \Bigr] \biggr] \biggr\}\qquad} \\
3187: \\ \shoveleft{\qquad \leq
3188: \widehat{\nu}
3189: \biggl\{ \frac{1}{2}
3190: \log \Bigl( \frac{(1+\lambda)(1-\beta)}{(1 - \lambda)(1+\beta)}\Bigr)
3191: r'( \cdot, \Btheta) }\\ - \frac{1}{2} \log\bigl[ (1 - \lambda^2)(1 - \beta^2) \bigr]
3192: m'(\cdot, \Btheta) \biggr\}.
3193: \end{multline*}
3194: Let us introduce as previously
3195: $
3196: \Bphi(x) = \sup_{\theta \in \Theta} m'(\theta, \Btheta) -
3197: x \, r'(\theta, \Btheta)$, $x \in \RR_+$.
3198: Let us moreover consider $
3199: \Tphi(x) = \sup_{\theta \in \Theta_1} m'(\theta, \Btheta) -
3200: x \, r'(\theta, \Btheta)$, $x \in \RR_+$. These functions can be
3201: used to produce a result which is slightly weaker, but maybe easier
3202: to read and understand. Indeed, comming back a little while,
3203: we see that, for any $x \in \RR_+$, with $\PP$ probability at least $1 - \epsilon$,
3204: for any posterior distribution $\rho$,
3205:
3206: \begin{multline*}
3207: - N \log \Bigl\{\Bigl( 1 - \lambda \rho \bigl[R'(\cdot, \T)\bigr] \Bigr)
3208: \Bigl(1 + \beta \rho \bigl[ R'(\cdot, \T) \bigr] \Bigr) \Bigr\}
3209: \\*\shoveleft{\qquad \leq \frac{N}{2} \log \left[ \frac{(1+\lambda)}{(1-\lambda)(1 - \lambda^2)^x}\right]
3210: \rho \bigl[ r'(\cdot, \Btheta) \bigr] }
3211: \\*\shoveleft{\qquad\qquad - \frac{N}{2} \log\bigl[ (1 - \lambda^2)(1 -
3212: \beta^2) \bigr] \Bphi(x)} + \C{K}(\rho, \pi)
3213: \\*\shoveleft{\qquad\qquad + \log \biggl\{ \pi \biggl[ \exp \Bigl\{
3214: - \tfrac{N}{2} \log \Bigl[ \tfrac{(1+\beta)}{(1-\beta)(1 - \beta^2)^x}\Bigr]
3215: r'(\cdot, \Btheta) \Bigr\} \biggr] \biggr\}
3216: }\\* \shoveleft{\qquad\qquad - \frac{N}{2} \log\bigl[
3217: (1-\lambda^2)(1-\beta^2) \bigr]
3218: \Tphi \left( \frac{ \log \left[ \frac{(1+\lambda)(1-\beta)}{(1-\lambda)(1+\beta)}
3219: \right]}{- \log\left[ (1 - \lambda^2)(1 - \beta^2) \right]} \right)
3220: }\\*\shoveright{- 2 \log(\epsilon)\qquad}
3221: \\ \shoveleft{ \qquad =
3222: \int_{\frac{N}{2} \log \left[ \frac{(1+\beta)}{(1 - \beta)(1 - \beta^2)^x} \right]}^{
3223: \frac{N}{2} \log \left[ \frac{(1+\lambda)}{(1 - \lambda)(1 - \lambda^2)^x} \right]}
3224: \pi_{\exp (- \alpha r)}\bigl[ r'(\cdot, \Btheta)\bigr] d \alpha}
3225: \\* \shoveright{+ \C{K}(\rho, \pi_{\exp \{ - \frac{N}{2} \log [ \frac{(1+\lambda)}{(1-\lambda)
3226: (1-\lambda^2)^x}] r \}}) - 2 \log (\epsilon)\quad}
3227: \\* - \frac{N}{2} \log \bigl[ (1 - \lambda^2)(1 - \beta^2) \bigr]
3228: \left[ \Bphi(x) + \Tphi \left( \frac{\log \left[ \frac{(1+\lambda)(1-\beta)}{(1-\lambda)
3229: (1 + \beta)} \right]}{- \log [ (1 - \lambda^2)(1 - \beta^2) ]} \right) \right].
3230: \end{multline*}
3231: \begin{thm}\mypoint
3232: With the previous notations, for any real constants $0 \leq \beta < \lambda < 1$,
3233: for any positive real constant $x$, for any prior probability distribution
3234: $\pi \in \C{M}_+^1(\Theta)$, for any subset $\Theta_1 \subset \Theta$,
3235: with $\PP$ probability at least $1 - \epsilon$,
3236: for any posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,
3237: putting
3238: \begin{multline*}
3239: B(\rho) =
3240: \frac{1}{N(\lambda - \beta)}
3241: \int_{\frac{N}{2} \log \left[ \frac{(1+\beta)}{(1 - \beta)(1 - \beta^2)^x} \right]}^{
3242: \frac{N}{2} \log \left[ \frac{(1+\lambda)}{(1 - \lambda)(1 - \lambda^2)^x} \right]}
3243: \pi_{\exp (- \alpha r)}\bigl[ r'(\cdot, \Btheta)\bigr] d \alpha
3244: \\ + \frac{\C{K}(\rho, \pi_{\exp \{ - \frac{N}{2} \log [ \frac{(1+\lambda)}{(1-\lambda)
3245: (1-\lambda^2)^x}] r \}}) - 2 \log (\epsilon)}{N(\lambda - \beta)}\\
3246: - \frac{1}{2(\lambda - \beta)} \log \bigl[ (1 - \lambda^2)(1 - \beta^2) \bigr]
3247: \left[ \Bphi(x) + \Tphi \left( \frac{\log \left[ \frac{(1+\lambda)(1-\beta)}{(1-\lambda)
3248: (1 + \beta)} \right]}{- \log [ (1 - \lambda^2)(1 - \beta^2) ]} \right) \right]
3249: \\ \shoveleft{\leq
3250: \frac{1}{N(\lambda - \beta)}
3251: d_e \log \left( \frac{\log \Bigl[ \frac{(1+\lambda)}{(1-\lambda)(1-\lambda^2)^x}\Bigr]}{
3252: \log \Bigl(\frac{(1+\beta)}{(1-\beta)(1-\beta^2)^x}\Bigr)}\right)}
3253: \\ + \frac{\C{K}(\rho, \pi_{\exp \{ - \frac{N}{2} \log [ \frac{(1+\lambda)}{(1-\lambda)
3254: (1-\lambda^2)^x}] r \}}) - 2 \log (\epsilon)}{N(\lambda - \beta)}\\
3255: - \frac{1}{2(\lambda - \beta)} \log \bigl[ (1 - \lambda^2)(1 - \beta^2) \bigr]
3256: \left[ \Bphi(x) + \Tphi \left( \frac{\log \left[ \frac{(1+\lambda)(1-\beta)}{(1-\lambda)
3257: (1 + \beta)} \right]}{- \log [ (1 - \lambda^2)(1 - \beta^2) ]} \right) \right],
3258: \end{multline*}
3259: the following bounds hold true:
3260: \begin{multline*}
3261: \rho(R) - \inf_{\Theta_1} R \\ \leq \frac{\lambda - \beta}{2 \lambda \beta}
3262: \Biggl(
3263: \sqrt{
3264: 1 + \frac{4 \lambda \beta}{(\lambda - \beta)^2}
3265: \Bigl\{ 1 - \exp \bigl[ - (\lambda - \beta) B(\rho)
3266: \bigr] \Bigr\}} - 1 \Biggr) \\ \leq B(\rho).
3267: \end{multline*}
3268: \end{thm}
3269: Let us remark that this alternative way of handling
3270: relative deviation bounds
3271: made it possible to carry on with non linear bounds up to the final result.
3272: (For instance, if $\lambda = 0.5$, $\beta = 0.2$ and $B(\rho) = 0.1$,
3273: the non linear bound gives $\rho(R) - \inf_{\Theta_1} R \leq 0.096$.)
3274:
3275: \subsection{Bounds relative to a Gibbs distribution} The empirical bounds
3276: of the previous section
3277: involve taking suprema in $\theta \in \Theta$, and replacing the
3278: {\em margin function} $\varphi$ by some empirical counter parts
3279: $\Bphi$ or $\Tphi$, which may prove unsafe
3280: when using very complex classification models. Moreover,
3281: they are not easy to analyze
3282: with PAC-Bayesian tools. To remedy these
3283: weaknesses, we are going now to propose
3284: another type of relative bounds. We will first explain how to
3285: compare
3286: the expected error rate $\rho(R)$ of any posterior distribution
3287: $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$
3288: with $\pi_{\exp( - \beta R)}(R)$,
3289: the expected risk of a Gibbs prior distribution.
3290: We will then show how to analyze the behaviour of this
3291: bound. This will provide an
3292: estimator proven to reach adaptively the best possible
3293: asymptotic behaviour of the error rate under Mammen
3294: and Tsybakov margin assumptions and parametric complexity
3295: assumptions.
3296:
3297: Then, we will provide an empirical bound for the Kullback
3298: divergence $\C{K}(\rho, \pi_{\exp( - \beta R)})$
3299: of a posterior distribution with respect to a Gibbs prior,
3300: making use of relative deviation inequalities.
3301:
3302: To tackle the question of model selection,
3303: we will estimate the relative performance
3304: of one posterior distribution with respect to another,
3305: which is useful when the two posteriors are supported by
3306: different models.
3307:
3308: Eventually, we will propose a more integrated approach to model selection,
3309: showing how to build a two step localization strategy, in which
3310: the performance of the posterior distribution to be analyzed is
3311: compared with some {\em two step} Gibbs prior.
3312:
3313: \subsubsection{Comparing a posterior distribution with a Gibbs prior}
3314: \newcommand{\wt}[1]{\widetilde{#1}}
3315: Similarly to Theorem \ref{thm2.2.18} we can prove that for any prior distribution
3316: $\wt{\pi} \in \C{M}_+^1(\Theta)$,
3317: \begin{multline}
3318: \label{eq1.1.15}
3319: \PP \Biggl\{ \wt{\pi} \otimes \wt{\pi} \biggl\{ \exp \biggl[ -
3320: N \log (1 - \lambda R') \\ - \frac{N}{2}\log \left( \frac{1+\lambda}{1-\lambda}
3321: \right) r' + \frac{N}{2} \log \bigl(1 - \lambda^2) m' \biggr] \biggr\}
3322: \Biggr\} \leq 1.
3323: \end{multline}
3324: Replacing $\wt{\pi}$ with $\pi_{\exp( - \beta R)}$ and considering
3325: the posterior distribution $\rho \otimes \pi_{\exp( - \beta R)}$,
3326: provides a starting point in the comparison of
3327: $\rho$ with $\pi_{\exp( - \beta R)}$; we can indeed
3328: state with $\PP$ probability at least $1 - \epsilon$ that
3329: \begin{multline}
3330: \label{eq1.1.17}
3331: - N \log \Bigl\{ 1 - \lambda \Bigl[
3332: \rho(R) - \pi_{\exp( - \beta R)}(R) \Bigr] \Bigr\}
3333: \\ \leq \frac{N}{2} \log \left( \frac{1+\lambda}{1-\lambda}\right)
3334: \bigl[ \rho(r) - \pi_{\exp(- \beta R)}(r) \bigr]
3335: \\ \qquad - \frac{N}{2} \log\bigl(1 - \lambda^2\bigr) \rho \otimes \pi_{\exp( - \beta R)}
3336: (m') \\ + \C{K}\bigl[ \rho, \pi_{\exp(- \beta R)} \bigr] - \log(\epsilon).
3337: \end{multline}
3338: Using the parameter
3339: $\gamma = \frac{N}{2} \log \left( \frac{1+\lambda}{1-\lambda}\right)$,
3340: so that $\lambda = \tanh \left(\frac{\gamma}{N}\right)$ and
3341: $-\frac{N}{2} \log ( 1 - \lambda^2) = N \log \bigl[ \cosh(\frac{\gamma}{N})\bigr]$,
3342: and noticing that
3343: \begin{multline}
3344: \label{eq1.1.16}
3345: \C{K}\bigl[ \rho, \pi_{\exp( - \beta R)}\bigr]
3346: = \beta \bigl[ \rho(R) - \pi_{\exp( - \beta R)}(R) \bigr]
3347: \\ + \C{K}(\rho, \pi) - \C{K}\bigl[\pi_{\exp( - \beta R)}, \pi\bigr],
3348: \end{multline}
3349: makes a step further in the proper handling of the entropy term:
3350: \begin{multline}
3351: \label{eq1.1.20}
3352: - N \log \Bigl\{ 1 - \tanh(\tfrac{\gamma}{N})
3353: \Bigl[ \rho(R) - \pi_{\exp( - \beta R)}(R) \Bigr] \Bigr\}
3354: - \beta \bigl[ \rho(R) - \pi_{\exp( - \beta R)}(R) \bigr]
3355: \\ \leq \gamma \bigl[ \rho(r) - \pi_{\exp( - \beta R)}(r) \bigr]
3356: + N \log \bigl[ \cosh \bigl(\tfrac{\gamma}{N} \bigr)\bigr]
3357: \rho \otimes \pi_{\exp( - \beta R)}(m')
3358: \\ + \C{K}(\rho, \pi) - \C{K}\bigl[ \pi_{\exp( - \beta R)}, \pi \bigr]
3359: - \log(\epsilon).
3360: \end{multline}
3361:
3362: We can then decompose in the right-hand side
3363: $\gamma \bigl[ \rho(r) - \pi_{\exp( - \beta R)}(r) \bigr]$ into
3364: $(\gamma - \lambda) \bigl[ \rho(r) - \pi_{\exp( - \beta R)}(r) \bigr]
3365: + \lambda \bigl[ \rho(r) - \pi_{\exp( - \beta R)}(r) \bigr]$
3366: and use the fact that
3367: \begin{multline*}
3368: \lambda \bigl[ \rho(r) - \pi_{\exp( - \beta R)}(r) \bigr]
3369: + N \log \bigl[ \cosh(\tfrac{\gamma}{N}) \bigr] \rho \otimes
3370: \pi_{\exp( - \beta R)}(m') \\ \shoveright{+ \C{K}(\rho, \pi)
3371: - \C{K}\bigl[ \pi_{\exp( - \beta R)}, \pi \bigr]}
3372: \\ \leq \lambda \rho(r) + \C{K}(\rho, \pi) + \log \Bigl\{
3373: \pi \Bigl[ \exp \bigl\{ - \lambda r + N \log \bigl[ \cosh(\tfrac{\gamma}{N}) \bigr] \rho(m') \bigr\}
3374: \Bigr] \Bigr\} \\
3375: = \C{K}\bigl[ \rho, \pi_{\exp( - \lambda r)}\bigr]
3376: + \log \Bigl\{ \pi_{\exp( - \lambda r)} \Bigl[ \exp \bigl\{ N \log \bigl[
3377: \cosh(\tfrac{\gamma}{N}) \bigr] \rho(m') \bigr\} \Bigr] \Bigr\},
3378: \end{multline*}
3379: to get rid of the appearance of the unobserved Gibbs prior $\pi_{\exp( - \beta R)}$
3380: in most places of the right-hand side of our inequality, leading to
3381: \begin{thm}
3382: \mypoint
3383: \label{thm1.1.41Bis}
3384: For any real constants $\beta$ and $\gamma$,
3385: with $\PP$ probability at least $1 - \epsilon$, for any posterior distribution
3386: $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$, for any real constant $\lambda$,
3387: \begin{multline*}
3388: \bigl[ N \tanh(\tfrac{\gamma}{N}) - \beta \bigr]
3389: \bigl[ \rho(R) - \pi_{\exp( - \beta R)}(R) \bigr] \\
3390: \shoveleft{\qquad \leq - N \log \Bigl\{ 1 - \tanh(\tfrac{\gamma}{N}) \Bigl[ \rho(R)
3391: - \pi_{\exp( - \beta R)}(R) \Bigr] \Bigr\} }
3392: \\ \shoveright{- \beta \bigl[ \rho(R) - \pi_{\exp( - \beta R)}(R) \bigr]}
3393: \\ \shoveleft{\qquad \leq (\gamma - \lambda) \bigl[
3394: \rho(r) - \pi_{\exp( - \beta R)}(r) \bigr]
3395: + \C{K}\bigl[ \rho, \pi_{\exp( - \lambda r)}\bigr]}
3396: \\\shoveright{ + \log \Bigl\{ \pi_{\exp( - \lambda r)} \Bigl[ \exp \bigl\{ N
3397: \log \bigl[ \cosh(\tfrac{\gamma}{N}) \bigr] \rho(m') \bigr\} \Bigr] \Bigr\} -
3398: \log(\epsilon)}
3399: \\ \shoveleft{\qquad = \C{K}\bigl[ \rho, \pi_{\exp (- \gamma r)} \bigr] }
3400: \\ + \log \Bigl\{ \pi_{\exp( - \gamma r)} \Bigl[
3401: \exp \bigl\{ (\gamma - \lambda) r + N \log \bigl[ \cosh(\tfrac{\gamma}{N})
3402: \bigr] \rho(m') \bigr\} \Bigr] \Bigr\} \\
3403: -( \gamma - \lambda) \pi_{\exp( - \beta R)}(r)
3404: - \log(\epsilon).
3405: \end{multline*}
3406: \end{thm}
3407: We would like to have a fully empirical upper bound even in the case when $\lambda
3408: \neq \gamma$. This can be done by using the theorem twice. We will
3409: need a lemma
3410: \begin{lemma}
3411: \label{lemma1.38}
3412: For any probability distribution $\pi \in \C{M}_+^1(\Theta)$,
3413: for any bounded measurable functions $g,h: \Theta \rightarrow \RR$,
3414: $$
3415: \pi_{\exp( -g )}(g) - \pi_{\exp(-h)}(g) \leq
3416: \pi_{\exp(-g)}(h) - \pi_{\exp(-h)}(h).
3417: $$
3418: \end{lemma}
3419: \begin{proof}
3420: Let us notice that
3421: \begin{multline*}
3422: 0 \leq \C{K}(\pi_{\exp( - g)}, \pi_{\exp( - h )})
3423: = \pi_{\exp( - g)}(h)
3424: + \log \bigl\{ \pi \bigl[ \exp ( - h) \bigr] \bigr\} + \C{K}(\pi_{\exp( - g)}, \pi)
3425: \\ = \pi_{\exp( - g)}(h) - \pi_{\exp( - h)}(h) - \C{K}(\pi_{\exp( - h)}, \pi)
3426: + \C{K}(\pi_{\exp( - g)}, \pi)
3427: \\ = \pi_{\exp( - g)}(h) - \pi_{\exp( - h)}(h) - \C{K}(\pi_{\exp( - h)}, \pi)
3428: - \pi_{\exp( - g)}(g) - \log \bigl\{ \pi \bigl[ \exp ( - g) \bigr] \bigr\}.
3429: \end{multline*}
3430: Moreover
3431: $$
3432: - \log \bigl\{ \pi \bigl[ \exp( - g) \bigr] \bigr\} \leq \pi_{\exp( - h)}(g)
3433: + \C{K}(\pi_{\exp( - h)}, \pi),
3434: $$
3435: which achieves the proof.
3436: \end{proof}
3437:
3438: For any positive real constants $\beta$ and $\lambda$,
3439: we can then apply Theorem \ref{thm1.1.41Bis} to $\rho = \pi_{\exp( - \lambda r)}$,
3440: and use the inequality
3441: \begin{equation}
3442: \label{eq1.1.22}
3443: \frac{\lambda}{\beta} \bigl[
3444: \pi_{\exp( - \lambda r)}(r) - \pi_{\exp( - \beta R)}(r) \bigr]
3445: \leq \pi_{\exp( - \lambda r)}(R) -
3446: \pi_{\exp( - \beta R) }(R)
3447: \end{equation}
3448: provided by the previous lemma.
3449: We thus obtain with $\PP$ probability at least $1 - \epsilon$
3450: \begin{multline*}
3451: - N \log \Bigl\{ 1 - \tanh(\tfrac{\gamma}{N}) \tfrac{\lambda}{\beta}
3452: \Bigl[ \pi_{\exp
3453: (- \lambda r)} (r) - \pi_{\exp( - \beta R)}(r) \Bigr] \Bigr\}
3454: \\ \shoveright{- \gamma \bigl[
3455: \pi_{\exp( - \lambda r)}(r) - \pi_{\exp( - \beta R)}(r) \bigr] }
3456: \\ \leq \log \Bigl\{ \pi_{\exp( - \lambda r)} \Bigl[
3457: \exp \bigl\{ N \log \bigl[ \cosh(\tfrac{\gamma}{N}) \bigr] \pi_{\exp( - \lambda r)}
3458: (m') \bigr\} \Bigr] \Bigr\} - \log(\epsilon).
3459: \end{multline*}
3460: Let us
3461: introduce the convex function
3462: $$
3463: F_{\gamma, \alpha}(x) = - N \log \bigl[ 1 - \tanh(\tfrac{\gamma}{N})
3464: x \bigr] - \alpha x \geq \bigl[ N \tanh(\tfrac{\gamma}{N}) - \alpha \bigr] x.
3465: $$
3466: With $\PP$ probability at least $1 - \epsilon$,
3467: \begin{multline*}
3468: - \pi_{\exp( - \beta R)}(r)
3469: \leq \inf_{\lambda \in \RR_+^*} \biggl\{ - \pi_{\exp( - \lambda r)}(r) \\*
3470: + \frac{\beta}{\lambda} F_{\gamma,
3471: \frac{\beta \gamma}{\lambda}}^{-1} \biggl[
3472: \log \Bigl\{ \pi_{\exp(- \lambda r)} \Bigl[ \exp
3473: \bigl\{ N \log \bigl[ \cosh(\tfrac{\gamma}{N}) \bigr]
3474: \pi_{\exp( - \lambda r)}(m') \bigr\} \Bigr] \Bigr\}
3475: \\ - \log(\epsilon) \biggr] \biggr\}.
3476: \end{multline*}
3477: Since Theorem \ref{thm1.1.41Bis} holds uniformly for any posterior distribution
3478: $\rho$, we can apply it again to some arbitrary posterior distribution $\rho$.
3479: We can moreover make the result uniform in $\beta$ and $\gamma$ by considering
3480: some atomic measure $\nu \in \C{M}_+^1(\RR)$ on the real line and using a union bound.
3481: This leads to
3482: \begin{thm}
3483: \mypoint
3484: \label{thm1.1.43}
3485: For any atomic probability distribution on the positive real line
3486: $\nu \in \C{M}_+^1(\RR_+)$,
3487: with $\PP$ probability
3488: at least $1 - \epsilon$, for any posterior distribution $\rho :
3489: \Omega \rightarrow \C{M}_+^1(\Theta)$, for any positive real constants $\beta$
3490: and $\gamma$,
3491: \begin{multline*}
3492: \bigl[ N \tanh(\tfrac{\gamma}{N}) - \beta \bigr] \bigl[ \rho(R) -
3493: \pi_{\exp( - \beta R)}(R) \bigr]
3494: \\* \shoveright{\leq
3495: F_{\gamma, \beta}\bigl[ \rho(R) - \pi_{\exp( - \beta R)}(R) \bigr]
3496: \leq B(\rho, \beta, \gamma), \text{ where}}\\\shoveleft{B(\rho, \beta, \gamma) = \inf_{
3497: \substack{\lambda_1 \in \RR_+, \lambda_1 \leq \gamma\\
3498: \lambda_2 \in \RR, \lambda_2 >
3499: \frac{\beta \gamma}{N} \tanh(\frac{\gamma}{N})^{-1}
3500: }} \Biggr\{
3501: \C{K}\bigl[ \rho, \pi_{\exp( - \lambda_1 r)} \bigr] }
3502: \\\shoveleft{\qquad + (\gamma - \lambda_1) \bigl[ \rho(r)
3503: - \pi_{\exp( - \lambda_2 r)}(r) \bigr]}
3504: \\\shoveleft{\qquad + \log \Bigl\{ \pi_{\exp( - \lambda_1 r)} \Bigl[ \exp \bigl\{
3505: N \log \bigl[ \cosh(\tfrac{\gamma}{N}) \bigr] \rho(m') \bigr\} \Bigr] \Bigr\}
3506: - \log \bigl[ \epsilon \nu(\beta) \nu(\gamma) \bigr]}\\
3507: \shoveleft{\qquad + (\gamma - \lambda_1) \frac{\beta}{\lambda_2}
3508: F_{\gamma, \frac{\beta \gamma}{\lambda_2}}^{-1} \biggl[
3509: \log \Bigl\{ }\\ \pi_{\exp( - \lambda_2 r)} \Bigl[ \exp \bigl\{
3510: N \log \bigl[ \cosh(\tfrac{\gamma}{N}) \bigr] \pi_{\exp( - \lambda_2 r)}(m')
3511: \bigr\} \Bigr] \Bigr\} \\\shoveright{ - \log \bigl[ \epsilon \nu(\beta)
3512: \nu(\gamma)\bigr] \biggr] \Biggr\}}
3513: \\\shoveleft{\leq \inf_{
3514: \substack{\lambda_1 \in \RR_+, \lambda_1 \leq \gamma\\
3515: \lambda_2 \in \RR, \lambda_2 >
3516: \frac{\beta \gamma}{N} \tanh(\frac{\gamma}{N})^{-1}
3517: }} \Biggr\{
3518: \C{K}\bigl[ \rho, \pi_{\exp( - \lambda_1 r)} \bigr]
3519: }\\\shoveleft{\qquad+ (\gamma - \lambda_1) \bigl[
3520: \rho(r) - \pi_{\exp( - \lambda_2 r)}(r) \bigr]}
3521: \\\shoveleft{\qquad+ \log \Bigl\{ \pi_{\exp( - \lambda_1 r)} \Bigl[ \exp \bigl\{
3522: N \log \bigl[ \cosh(\tfrac{\gamma}{N}) \bigr] \rho(m') \bigr\} \Bigr] \Bigr\}}
3523: \\\shoveleft{\qquad + \frac{\beta}{\lambda_2} \frac{(1 - \frac{\lambda_1}{\gamma})}{
3524: \bigl[ \frac{N}{\gamma} \tanh(\frac{\gamma}{N}) - \frac{\beta}{\lambda_2}\bigr]}
3525: \log \Bigl\{ \pi_{\exp( - \lambda_2 r)} \Bigl[ }
3526: \\ \exp \bigl\{
3527: N \log \bigl[ \cosh(\tfrac{\gamma}{N}) \bigr] \pi_{\exp( - \lambda_2 r)}(m')
3528: \bigr\} \Bigr] \Bigr\} \\
3529: - \Bigl\{ 1 + \frac{\beta}{\lambda_2} \tfrac{(1 - \frac{\lambda_1}{\gamma})}{
3530: [ \frac{N}{\gamma} \tanh(\frac{\gamma}{N}) - \frac{\beta}{\lambda_2}]} \Bigr\}
3531: \log \bigl[ \epsilon \nu( \beta) \nu( \gamma) \bigr] \Biggr\},
3532: \end{multline*}
3533: where we have written for short $\nu(\beta)$ and $\nu(\gamma)$ instead
3534: of $\nu(\{\beta\})$ and $\nu(\{\gamma\})$.
3535: \end{thm}
3536: Let us notice that $B(\rho, \beta, \gamma) = + \infty$ when $\nu(\beta) = 0$
3537: or $\nu(\gamma) = 0$, the uniformity in $\beta$ and $\gamma$ of the
3538: theorem therefore necessarily bears on a countable number of values of these parameters.
3539: We can typically choose for $\nu$ distributions such as the one
3540: used in Theorem \ref{thm1.1.11} on page \pageref{thm1.1.11}:
3541: namely we can put for some positive real ratio $\alpha > 1$
3542: $$
3543: \nu(\alpha^k) = \frac{1}{(k+1)(k+2)}, \qquad k \in \NN,
3544: $$
3545: or alternatively, since we are interested in values of the parameters
3546: less than $N$, we can prefer
3547: $$
3548: \nu(\alpha^k) = \frac{\log(\alpha)}{\log(\alpha N)},
3549: \qquad 0 \leq k < \frac{\log(N)}{\log(\alpha)}.
3550: $$
3551: We can also use such a coding distribution on dyadic numbers
3552: as the one defined by equation \eqref{eq1.1.4bis} on page \pageref{eq1.1.4bis}.
3553:
3554: \subsubsection{The effective temperature of a posterior distribution}
3555: Using the parametric approximation $\pi_{\exp( - \alpha r)}(r)
3556: - \inf_{\Theta} r \simeq \frac{d_e}{\alpha}$, we get as an order of magnitude
3557: \begin{multline*}
3558: B(\pi_{\exp( - \lambda_1 r)}, \beta, \gamma) \lesssim
3559: - (\gamma - \lambda_1) d_e \bigl[ \lambda_2^{-1} - \lambda_1^{-1} \bigr]
3560: \\ \shoveleft{\qquad + 2 d_e \log \frac{\lambda_1}{ \lambda_1
3561: - N\log\bigl[ \cosh(\tfrac{\gamma}{N}) \bigr] x}}\\*
3562: \qquad\qquad + 2 \frac{\beta}{\lambda_2} \frac{(1 - \frac{\lambda_1}{\gamma})}{
3563: \bigl[ \frac{N}{\gamma}\tanh(\tfrac{\gamma}{N}) - \frac{\beta}{\lambda_2} \bigr]} d_e \log
3564: \left( \frac{ \lambda_2}{\lambda_2 - N \log \bigl[ \cosh(\tfrac{\gamma}{N}) \bigr] x}
3565: \right) \\*
3566: \qquad\qquad\qquad\qquad + 2 N \log \bigl[ \cosh(\tfrac{\gamma}{N}) \bigr] \biggl[ 1 + \frac{\beta}{\lambda_2}
3567: \frac{(1 - \frac{\lambda_1}{\gamma})}{ \bigl[ \frac{N}{\gamma}
3568: \tanh(\frac{\gamma}{N}) - \frac{\beta}{\lambda_2} \bigr]} \biggr] \Tphi(x)
3569: \\ - \Bigl\{ 1 + \frac{\beta}{\lambda_2}
3570: \frac{(1 - \frac{\lambda_1}{\gamma})}{[\frac{N}{\gamma} \tanh(\tfrac{\gamma}{N})
3571: - \frac{\beta}{\lambda_2}]} \Bigr\} \log\bigl[ \nu(\beta) \nu(\gamma) \epsilon
3572: \bigr].
3573: \end{multline*}
3574: Therefore, if the empirical dimension $d_e$ stays bounded when $N$ increases,
3575: we are going to obtain a negative upper bound for any values of the constants
3576: $\lambda_1 > \lambda_2 > \beta$, as soon as $\gamma$ and $\frac{N}{\gamma}$
3577: are chosen to be large enough.
3578: This ability to obtain negative values for the bound $B(\pi_{\exp( - \lambda_1 r)},
3579: \gamma, \beta)$, and more generally $B(\rho, \gamma, \beta)$, leads the way
3580: to introducing the new concept of the {\em effective temperature} of an estimator.
3581: \begin{dfn}
3582: For any posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$ we define
3583: the {\em effective temperature} $T(\rho) \in
3584: \RR \cup \{ - \infty, + \infty \}$ of $\rho$ by the equation
3585: $$
3586: \rho(R) = \pi_{\exp( - \frac{R}{T(\rho)})}(R).
3587: $$
3588: \end{dfn}
3589: Note that $\beta \mapsto \pi_{\exp( - \beta R)}(R) : \RR \cup \{ - \infty, + \infty \}
3590: \rightarrow (0,1)$ is continuous and strictly decreasing from $\ess \sup_{\pi} R$
3591: to $\ess \inf_{\pi} R$ (as soon as these two bounds do not coincide). This shows
3592: that the effective temperature $T(\rho)$ is a well defined random variable.
3593:
3594: Theorem \ref{thm1.1.43} provides a bound for $T(\rho)$, indeed:
3595: \begin{prop}\mypoint
3596: \label{prop1.1.37}
3597: Let
3598: $$
3599: \w{\beta}(\rho) = \sup \bigl\{ \beta \in \RR; \inf_{\gamma, N \tanh(\frac{\gamma}{N})
3600: > \beta}
3601: B(\rho, \beta, \gamma) \leq 0 \bigr\},
3602: $$
3603: where $B(\rho, \beta, \gamma)$ is as in Theorem \ref{thm1.1.43}.
3604: Then with $\PP$ probability at least $1 - \epsilon$, for any posterior
3605: distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,
3606: $T(\rho) \leq \w{\beta}(\rho)^{-1}$, or equivalently
3607: $\rho(R) \leq \pi_{\exp[ - \w{\beta}(\rho) R]}(R)$.
3608: \end{prop}
3609: This notion of {\em effective temperature} of a (randomized) estimator
3610: $\rho$ is interesting for two reasons:
3611:
3612: $\bullet$ the difference $\rho(R) - \pi_{\exp( - \beta R)}(R)$ can be estimated
3613: with a better accuracy than $\rho(R)$ itself, due to the use of relative deviation
3614: inequalities, leading to convergence rates up to $1/N$ in favourable situations,
3615: even when $\inf_{\Theta} R$ is not close to zero;
3616:
3617: $\bullet$ and of course $\pi_{\exp( - \beta R)}(R)$ is a decreasing function
3618: of $\beta$, thus being able to estimate $\rho(R) - \pi_{\exp( - \beta R)}(R)$
3619: with some given accuracy, means being able to discriminate between values
3620: of $\rho(R)$ with the same accuracy, although doing so through the
3621: parametrization $\beta \mapsto \pi_{\exp( - \beta R)}(R)$, which cannot
3622: be observed nor estimated with the same precision!
3623:
3624: \subsubsection{Analysis of an empirical bound for the effective temperature}
3625: We are now going to launch into a mathematically rigorous analysis of
3626: the bound $B(\pi_{\exp( - \lambda_1 r), \beta, \gamma})$
3627: provided by Theorem \ref{thm1.1.43},
3628: to show that \linebreak $\inf_{\rho \in \C{M}_+^1(\Theta)}
3629: \pi_{\exp[ - \w{\beta}(\rho) R]}(R)$ converges indeed to $\inf_{\Theta} R$
3630: at some unimprovable rates in favourable situations.
3631:
3632: It is more convenient for this purpose to use deviation inequalities involving
3633: $M'$ rather than $m'$. It is straightforward to extend Theorem \ref{thm4.1} on
3634: page \pageref{thm4.1} to
3635: \begin{thm}
3636: \mypoint
3637: For any real constants $\beta$ and $\gamma$, for any prior distribution
3638: $\mu \in \C{M}_+^1(\Theta)$, with $\PP$ probability at least $1 - \eta$,
3639: for any posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,
3640: $$
3641: \gamma \rho \otimes \pi_{\exp( - \beta R)} \bigl[ \Psi_{\frac{\gamma}{N}}(R', M') \bigr]
3642: \leq \gamma \rho \otimes \pi_{\exp( - \beta R)}(r') + \C{K}(\rho, \mu) - \log(\eta).
3643: $$
3644: \end{thm}
3645: In order to transform the left-hand side into a linear expression and
3646: in the same time to localize this theorem, let us choose $\mu$ defined by its density
3647:
3648: \begin{multline*}
3649: \frac{d \mu}{d \pi}(\theta_1)
3650: = C^{-1} \exp \biggl[ - \beta R(\theta_1)
3651: \\* - \gamma \int_{\Theta} \Bigl\{
3652: \Psi_{\frac{\gamma}{N}} \bigl[ R'(\theta_1, \theta_2),
3653: M'(\theta_1, \theta_2) \bigr] \\* - \tfrac{N}{\gamma} \sinh(\tfrac{\gamma}{N})
3654: R'(\theta_1, \theta_2) \Bigr\} \pi_{\exp( - \beta R)}(d \theta_2) \biggr],
3655: \end{multline*}
3656: where $C$ is such that $\mu(\Theta) = 1$.
3657: We get
3658: \begin{multline*}
3659: \C{K}(\rho, \mu) = \beta \rho(R) + \gamma
3660: \rho \otimes \pi_{\exp( - \beta R)} \bigl[
3661: \Psi_{\frac{\gamma}{N}} (R', M') - \tfrac{N}{\gamma} \sinh(\tfrac{\gamma}{N})
3662: R' \bigr] + \C{K}(\rho, \pi) \\
3663: \shoveleft{\qquad + \log \biggl\{ \int_{\Theta} \exp \biggl[ - \beta R(\theta_1)}
3664: \\ - \gamma \int_{\Theta} \Bigl\{
3665: \Psi_{\frac{\gamma}{N}} \bigl[ R'(\theta_1, \theta_2), M'(\theta_1,
3666: \theta_2) \bigr]\\\shoveright{ - \tfrac{N}{\gamma} \sinh(\tfrac{\gamma}{N})
3667: R'(\theta_1, \theta_2) \Bigr\} \pi_{\exp( -
3668: \beta R)}(d \theta_2) \biggr] \pi ( d \theta_1) \biggr\}}
3669: \\\shoveleft{\quad= \beta \bigl[ \rho(R) - \pi_{\exp( - \beta R)}(R) \bigr]}\\
3670: + \gamma \rho \otimes \pi_{\exp ( - \beta R)} \bigl[
3671: \Psi_{\frac{\gamma}{N}}(R', M') - \tfrac{N}{\gamma} \sinh(\tfrac{\gamma}{N})
3672: R' \bigr]
3673: \\\shoveright{+ \C{K}(\rho, \pi) - \C{K}(\pi_{\exp( - \beta R)}, \pi)
3674: \qquad}\\
3675: \shoveleft{\qquad + \log \biggl\{ \int_{\Theta} \exp
3676: \biggl[ - \gamma \int_{\Theta} \Bigl\{ \Psi_{\frac{\gamma}{N}}
3677: \bigl[ R'(\theta_1, \theta_2),M'(\theta_1, \theta_2) \bigr]
3678: }\\ - \tfrac{N}{\gamma} \sinh(\tfrac{\gamma}{N})
3679: R'(\theta_1, \theta_2) \Bigr\} \pi_{\exp( - \beta R)}(d \theta_2)
3680: \biggr] \pi_{\exp( - \beta R)}(d \theta_1) \biggr\}.
3681: \end{multline*}
3682: Thus with $\PP$ probability at least $1 - \eta$,
3683: \begin{multline}
3684: \label{eq1.1.23}
3685: \bigl[ N \sinh(\tfrac{\gamma}{N}) - \beta \bigr]
3686: \bigl[ \rho(R) - \pi_{\exp( - \beta R)}(R) \bigr]
3687: \\\shoveleft{\qquad \leq \gamma \bigl[ \rho(r) - \pi_{\exp ( - \beta R)}(r) \bigr] +
3688: \C{K}(\rho, \pi) - \C{K}(\pi_{\exp( - \beta R)}, \pi) - \log(\eta) +
3689: C(\beta, \gamma)}
3690: \\
3691: \shoveleft{\text{where } C(\beta, \gamma) = \log \biggl\{ \int_{\Theta} \exp
3692: \biggl[ - \gamma \int_{\Theta} \Bigl\{ \Psi_{\frac{\gamma}{N}}
3693: \bigl[ R'(\theta_1, \theta_2),M'(\theta_1, \theta_2) \bigr]
3694: }\\- \tfrac{N}{\gamma} \sinh(\tfrac{\gamma}{N})
3695: R'(\theta_1, \theta_2) \Bigr\} \pi_{\exp( - \beta R)}(d \theta_2)
3696: \biggr] \pi_{\exp( - \beta R)}(d \theta_1) \biggr\}.
3697: \end{multline}
3698: Remarking that
3699: $$
3700: \C{K}\bigl[ \rho, \pi_{\exp( - \beta R)}\bigr]
3701: = \beta \bigl[ \rho(R) - \pi_{\exp( - \beta R)}(R) \bigr]
3702: + \C{K}(\rho, \pi) - \C{K}(\pi_{\exp( - \beta R)}, \pi),
3703: $$
3704: we deduce from the previous inequality
3705: \begin{thm}\mypoint
3706: \label{thm1.1.45}
3707: For any real constants $\beta$ and $\gamma$, with $\PP$ probability
3708: at least $1 - \eta$, for any posterior distribution $\rho : \Omega
3709: \rightarrow \C{M}_+^1(\Theta)$,
3710: \begin{multline*}
3711: N \sinh(\tfrac{\gamma}{N}) \bigl[ \rho(R) - \pi_{\exp( - \beta R)}(R)
3712: \bigr] \leq \gamma \bigl[ \rho(r) - \pi_{\exp( - \beta R)}(r) \bigr]
3713: \\ + \C{K}\bigl[ \rho, \pi_{\exp( - \beta R)}\bigr] - \log(\eta)
3714: + C(\beta, \gamma).
3715: \end{multline*}
3716: \end{thm}
3717: We can also go into a slightly different direction, starting
3718: back again from equation \eqref{eq1.1.23} on page \pageref{eq1.1.23} and
3719: remarking that for any real constant $\lambda$,
3720: \begin{multline*}
3721: \lambda \bigl[ \rho(r) - \pi_{\exp( - \beta R)}(r) \bigr]
3722: + \C{K}(\rho, \pi) - \C{K}(\pi_{\exp(- \beta R)}, \pi)
3723: \\ \leq \lambda \rho(r) + \C{K}(\rho, \pi) + \log \bigl\{
3724: \pi \bigl[ \exp ( - \lambda r) \bigr] \bigr\} =
3725: \C{K}\bigl[ \rho, \pi_{\exp( - \lambda r)} \bigr].
3726: \end{multline*}
3727: This leads to
3728: \begin{thm}\mypoint
3729: For any real constants $\beta$ and $\gamma$, with $\PP$ probability at least $1 - \eta$,
3730: for any real constant $\lambda$,
3731: \begin{multline*}
3732: \bigl[ N \sinh(\tfrac{\gamma}{N}) - \beta \bigr]
3733: \bigl[ \rho(R) - \pi_{\exp( - \beta R)}(R) \bigr]
3734: \\ \leq (\gamma - \lambda)
3735: \bigl[ \rho(r) - \pi_{\exp ( - \beta R)}(r) \bigr] +
3736: \C{K}\bigl[ \rho, \pi_{\exp( - \lambda r)} \bigr] - \log(\eta) + C(\beta, \gamma),
3737: \end{multline*}
3738: where the definition of $C(\beta, \gamma)$ is given by equation \eqref{eq1.1.23}
3739: on page \pageref{eq1.1.23}.
3740: \end{thm}
3741:
3742: We can now use this inequality in the case when $\rho = \pi_{\exp( - \lambda r)}$
3743: and combine it with inequality \eqref{eq1.1.22} on page \pageref{eq1.1.22}
3744: to obtain
3745: \begin{thm}
3746: For any real constants $\beta$ and $\gamma$,
3747: with $\PP$ probability at least $1 - \eta$, for any real constant
3748: $\lambda$,
3749: $$
3750: \bigl[ \tfrac{N \lambda}{\beta} \sinh(\tfrac{\gamma}{N}) - \gamma \bigr]
3751: \bigl[ \pi_{\exp( - \lambda r)}(r) - \pi_{\exp( - \beta R)}(r) \bigr]
3752: \leq C(\beta, \gamma) - \log(\eta).
3753: $$
3754: \end{thm}
3755: We deduce from this theorem
3756: \begin{prop}
3757: For any real positive constants $\beta_1$, $\beta_2$ and
3758: $\gamma$, with $\PP$ probability at least $1 - \eta$, for any real constants
3759: $\lambda_1$ and $\lambda_2$, such that $\lambda_2 < \beta_2 \frac{\gamma}{N}
3760: \sinh(\frac{\gamma}{N})^{-1}$ and $\lambda_1 > \beta_1 \frac{\gamma}{N}
3761: \sinh(\frac{\gamma}{N})^{-1}$,
3762: \begin{multline*}
3763: \pi_{\exp( - \lambda_1 r)}(r) - \pi_{\exp( - \lambda_2 r)}(r)
3764: \leq \pi_{\exp( - \beta_1 R)}(r) - \pi_{\exp( - \beta_2 R)}(r)
3765: \\ + \frac{C(\beta_1, \gamma) + \log( 2 /\eta)}{\frac{N\lambda_1}{\beta_1}
3766: \sinh(\frac{\gamma}{N})- \gamma}
3767: + \frac{C(\beta_2, \gamma) + \log( 2 /\eta)}{\gamma - \frac{N\lambda_2}{\beta_2}
3768: \sinh(\frac{\gamma}{N})}.
3769: \end{multline*}
3770: \end{prop}
3771: Moreover, $\pi_{\exp( - \beta_1 R)}$ and $\pi_{\exp( - \beta_2 R)}$
3772: being prior distributions,
3773: with $\PP$ probability at least $1 - \eta$,
3774: \begin{multline*}
3775: \gamma \bigl[ \pi_{\exp( - \beta_1 R)}(r) - \pi_{\exp( - \beta_2 R)}(r) \bigr]
3776: \\ \leq \gamma \pi_{\exp( - \beta_1 R)} \otimes \pi_{\exp( - \beta_2 R)}
3777: \bigl[ \Psi_{- \frac{\gamma}{N}}(R',M') \bigr] - \log( \eta).
3778: \end{multline*}
3779: Hence
3780: \begin{prop}
3781: For any positive real constants $\beta_1$, $\beta_2$ and $\gamma$,
3782: with $\PP$ probability at least $1 - \eta$,
3783: for any positive real constants $\lambda_1$ and $\lambda_2$
3784: such that $\lambda_2 < \beta_2 \frac{\gamma}{N} \sinh(\tfrac{\gamma}{N})^{-1}$
3785: and $\lambda_1 > \beta_1 \frac{\gamma}{N} \sinh(\frac{\gamma}{N})^{-1}$,
3786: \begin{multline*}
3787: \pi_{\exp ( - \lambda_1 r)}(r) - \pi_{\exp( - \lambda_2 r)}(r)
3788: \\ \leq \pi_{\exp( - \beta_1 R)} \otimes
3789: \pi_{\exp( - \beta_2 R)} \bigl[ \Psi_{- \frac{\gamma}{N}} (R',M')\bigr] \\
3790: + \frac{\log(\frac{3}{\eta})}{\gamma} + \frac{C(\beta_1,\gamma) + \log(\frac{3}{\eta})}{
3791: \frac{N \lambda_1}{\beta_1} \sinh(\frac{\gamma}{N})- \gamma}
3792: + \frac{C(\beta_2, \gamma) + \log (\frac{3}{\eta})}{\gamma -
3793: \frac{N \lambda_2}{\beta_2} \sinh(\frac{\gamma}{N})}.
3794: \end{multline*}
3795: \end{prop}
3796:
3797: In order to achieve the analysis of the bound $B(\pi_{\exp( - \lambda_1 r)}, \beta,
3798: \gamma)$
3799: given by Theorem \ref{thm1.1.43}, there remains now to bound quantities of the
3800: general form
3801: \begin{multline*}
3802: \log \Bigl\{ \pi_{\exp( - \lambda r)} \Bigl[
3803: \exp \bigl\{ N \log \bigl[ \cosh(\tfrac{\gamma}{N}) \bigr] \pi_{\exp(
3804: - \lambda r)}(m') \bigr\} \Bigr] \Bigr\} \\
3805: = \sup_{\rho \in \C{M}_+^1(\Theta)}
3806: N \log \bigl[ \cosh(\tfrac{\gamma}{N}) \bigr] \rho \otimes
3807: \pi_{\exp( - \lambda)}(m') -
3808: \C{K}\bigl[\rho, \pi_{\exp( - \lambda r)}\bigr].
3809: \end{multline*}
3810:
3811: Let us consider the prior distribution $\mu \in \C{M}_+^1(\Theta \times \Theta)$
3812: on couples of parameters defined by its density
3813: $$
3814: \frac{d \mu}{d (\pi \otimes \pi)} (\theta_1, \theta_2)
3815: = C^{-1} \exp \Bigl\{
3816: - \beta R(\theta_1) - \beta R(\theta_2) + \alpha
3817: \Phi_{- \frac{\alpha}{N}} \bigl[ M'(\theta_1, \theta_2) \bigr] \Bigr\},
3818: $$
3819: where the normalizing constant $C$ is such that $\mu( \Theta \times \Theta) = 1$.
3820: Since for fixed values of the parameters $\theta$
3821: and $\theta' \in \Theta$, $m'(\theta, \theta')$, like $r(\theta)$, is a sum
3822: of independent Bernoulli random variables, we can easily
3823: adapt the proof of Theorem \ref{thm2.3} on page \pageref{thm2.3},
3824: to establish that with $\PP$ probability at least $1 - \eta$, for any posterior distribution
3825: $\rho$ and any real constant $\lambda$,
3826: \begin{multline*}
3827: \alpha \rho \otimes \pi_{\exp( - \lambda r)}(m')
3828: \leq \alpha \rho \otimes \pi_{\exp( - \lambda r)} \bigl[ \Phi_{- \frac{\alpha}{N}}(M') \bigr]
3829: \\\shoveright{ + \C{K}(\rho \otimes \pi_{\exp( - \lambda r)}, \mu) -
3830: \log( \eta)} \\
3831: \shoveleft{\qquad = \C{K}\bigl[ \rho, \pi_{\exp( - \beta R)}\bigr] + \C{K}\bigl[
3832: \pi_{\exp( - \lambda r)}, \pi_{\exp( - \beta R)}\bigr] }
3833: \\* + \log \Bigl\{ \pi_{\exp( - \beta R)} \otimes \pi_{\exp( - \beta
3834: R)} \Bigl[ \exp \bigl( \alpha \Phi_{-\frac{\alpha}{N}}\!\circ\!M' \bigr)
3835: \Bigr] \Bigr\} - \log(\eta).
3836: \end{multline*}
3837: Thus for any real constant $\beta$ and any positive real constants
3838: $\alpha$ and $\gamma$,
3839: with $\PP$ probability at least $1 - \eta$, for any real constant
3840: $\lambda$,
3841: \begin{multline}
3842: \label{eq1.1.24}
3843: \log \Bigl\{ \pi_{\exp( - \lambda r)} \Bigl[ \exp
3844: \bigl\{ N \log \bigl[ \cosh(\tfrac{\gamma}{N})\bigr] \pi_{\exp( - \lambda r)}
3845: (m') \bigr\} \Bigr] \Bigr\}
3846: \\ \leq \sup_{\rho \in \C{M}_+^1(\Theta)} \biggl(
3847: \tfrac{N}{\alpha} \log \bigl[ \cosh(\tfrac{\gamma}{N})\bigr]
3848: \Bigl\{ \C{K}\bigl[ \rho, \pi_{\exp( - \beta R)}\bigr]
3849: + \C{K} \bigl[ \pi_{\exp( - \lambda r)}, \pi_{\exp( - \beta R)} \bigr]
3850: \\
3851: + \log \bigl\{ \pi_{\exp( - \beta R)} \otimes \pi_{\exp(- \beta R)}
3852: \bigl[ \exp ( \alpha \Phi_{- \frac{\alpha}{N}}\!\circ\!M') \bigr] \bigr\}
3853: \\ - \log( \eta) \Bigr\} - \C{K}\bigl[ \rho, \pi_{\exp( - \lambda r)}\bigr] \biggr).
3854: \end{multline}
3855:
3856: To conclude, we need some suitable upper bound for the entropy
3857: \linebreak $\C{K}\bigl[ \rho, \pi_{\exp( - \beta R)} \bigr]$. This question can
3858: be handled in the following way:
3859: using Theorem \ref{thm1.1.45} on page \pageref{thm1.1.45},
3860: we see that for any positive real constants $\gamma$ and $\beta$,
3861: with $\PP$ probability at least $1 - \eta$, for any posterior distribution
3862: $\rho$,
3863: \begin{multline*}
3864: \C{K}\bigl[ \rho, \pi_{\exp( - \beta R)} \bigr]
3865: = \beta \bigl[ \rho(R) - \pi_{\exp( - \beta R)}(R) \bigr]
3866: + \C{K}(\rho, \pi) - \C{K}(\pi_{\exp( - \beta R)}, \pi)
3867: \\ \shoveleft{\qquad \leq \frac{\beta}{N \sinh(\frac{\gamma}{N})} \biggl[
3868: \gamma \bigl[ \rho(r) - \pi_{\exp( - \beta R)}(r) \bigr] }
3869: \\ + \C{K}\bigl[ \rho, \pi_{\exp( - \beta R)}\bigr]
3870: - \log(\eta) + C(\beta, \gamma) \biggr]\\\shoveright{+ \C{K}(\rho, \pi)
3871: - \C{K}(\pi_{\exp( - \beta R)}, \pi)\qquad}
3872: \\ \shoveleft{\qquad \leq \C{K} \bigl[ \rho, \pi_{\exp( - \frac{\beta \gamma}{N
3873: \sinh(\frac{\gamma}{N})} r)}
3874: \bigr]} \\ + \frac{\beta}{N \sinh(\frac{\gamma}{N})}
3875: \Bigl\{ \C{K}\bigl[ \rho, \pi_{\exp( - \beta R)}\bigr]
3876: + C(\beta, \gamma) - \log(\eta) \Bigr\}.
3877: \end{multline*}
3878: In other words,
3879: \begin{thm}
3880: \mypoint
3881: For any positive real constants $\beta$ and $\gamma$ such that
3882: $\beta < N \sinh(\tfrac{\gamma}{N})$, with $\PP$ probability at least $1 - \eta$, for any posterior
3883: distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,
3884: $$
3885: \C{K}\bigl[ \rho, \pi_{\exp( - \beta R)} \bigr]
3886: \leq \frac{\ds \C{K} \bigl[ \rho, \pi_{\exp[ - \beta \frac{\gamma}{N}
3887: \sinh(\frac{\gamma}{N})^{-1} r]} \bigr]}{\ds 1 - \frac{\beta}{N \sinh(\frac{\gamma}{N})}}
3888: + \frac{\ds C(\beta, \gamma) - \log(\eta)}{\ds \frac{N \sinh(\frac{\gamma}{N})}{\beta}
3889: - 1}.
3890: $$
3891: \end{thm}
3892:
3893: Choosing in equation \eqref{eq1.1.24} on page \pageref{eq1.1.24}
3894: $\ds \alpha = \frac{N \log \bigl[ \cosh(\frac{\gamma}{N})\bigr]}{1
3895: - \frac{\beta}{N \sinh(\frac{\gamma}{N})}}$ and \linebreak
3896: $\beta = \lambda \frac{N}{\gamma} \sinh(\frac{\gamma}{N})$, so that
3897: $\ds \alpha = \frac{N \log \bigl[ \cosh(\frac{\gamma}{N})\bigr]}{1 - \frac{\lambda}{\gamma}
3898: }$, we obtain with $\PP$
3899: probability at least $1 - \eta$,
3900: \begin{multline*}
3901: \log \Bigl\{ \pi_{\exp( - \lambda r)} \Bigl[
3902: \exp \bigl\{ N \log \bigl[ \cosh(\tfrac{\gamma}{N})\bigr] \pi_{\exp( -
3903: \lambda r)}(m') \bigr\} \Bigr] \Bigr\}
3904: \\ \shoveleft{\qquad \leq \tfrac{2 \lambda}{\gamma} \bigl[
3905: C(\beta, \gamma) + \log( \tfrac{2}{\eta}) \bigr]
3906: } \\ + \Bigl( 1 - \tfrac{\lambda}{\gamma} \Bigr) \biggl[ \log \Bigl\{ \pi_{\exp( - \beta R)} \otimes \pi_{\exp( - \beta R)}
3907: \bigl[ \exp( \alpha \Phi_{-\frac{\alpha}{N}}\!\circ\!M')\bigr] \Bigr\} \\+
3908: \log( \tfrac{2}{\eta}) \biggr].
3909: \end{multline*}
3910: This proves
3911: \begin{prop}
3912: \mypoint
3913: For any positive real constants $\lambda < \gamma$,
3914: with $\PP$ probability at least $1 - \eta$,
3915: \begin{multline*}
3916: \log \Bigl\{ \pi_{\exp( - \lambda r)} \Bigl[
3917: \exp \bigl\{ N \log \bigl[ \cosh(\tfrac{\gamma}{N})\bigr] \pi_{\exp( -
3918: \lambda r)}(m') \bigr\} \Bigr] \Bigr\} \\
3919: \shoveleft{\qquad \leq
3920: \frac{2 \lambda}{\gamma} \bigl[ C( \tfrac{N \lambda}{\gamma} \sinh(
3921: \tfrac{\gamma}{N}), \gamma)
3922: + \log ( \tfrac{2}{\eta}) \bigr]}
3923: \\\shoveleft{\qquad\qquad + \Bigl(1 - \tfrac{\lambda}{\gamma}\Bigr)
3924: \log \biggl\{ \pi_{\exp[ - \frac{N\lambda}{\gamma} \sinh(\frac{\gamma}{N}) R]
3925: }^{\otimes 2}
3926: \biggl[}\\\shoveright{
3927: \exp \biggl( \frac{N \log [ \cosh(\tfrac{\gamma}{N})]}{1 - \frac{\lambda}{\gamma}}
3928: \Phi_{- \frac{\log[\cosh(\frac{\gamma}{N})]}{1 - \frac{\lambda}{\gamma}}}\!\circ\!M'
3929: \biggr)
3930: \biggr] \biggr\}\qquad}\\
3931: + \Bigl( 1 - \tfrac{\lambda}{\gamma} \Bigr) \log( \tfrac{2}{\eta}).
3932: \end{multline*}
3933: \end{prop}
3934:
3935: We are now ready to analyse the bound $B(\pi_{\exp( - \lambda_1 r)}, \beta, \gamma)$ of
3936: Theorem \ref{thm1.1.43} on page \pageref{thm1.1.43}.
3937: \begin{thm}\mypoint
3938: \label{thm1.1.52}
3939: For any positive real constants $\lambda_1$, $\lambda_2$, $\beta_1$,
3940: $\beta_2$, $\beta$ and $\gamma$, such that
3941: \begin{align*}
3942: \lambda_1 & < \gamma,&
3943: \beta_1 & < \tfrac{N \lambda_1}{\gamma} \sinh(\tfrac{\gamma}{N}),\\
3944: \lambda_2 & < \gamma, & \beta_2 & > \tfrac{N \lambda_2}{\gamma} \sinh(\tfrac{\gamma}{N}),\\
3945: & & \beta & < \tfrac{N \lambda_2}{\gamma} \tanh(\tfrac{\gamma}{N}),
3946: \end{align*}
3947: with $\PP$ probability $1 - \eta$, the bound
3948: $B(\pi_{\exp( - \lambda_1 r)}, \beta, \gamma)$
3949: of Theorem \ref{thm1.1.43} on page \pageref{thm1.1.43} satisfies
3950: \begin{multline*}
3951: B(\pi_{\exp( - \lambda_1 r)}, \beta, \gamma) \\ \leq
3952: (\gamma - \lambda_1) \Biggl\{ \pi_{\exp( - \beta_1 R)} \otimes
3953: \pi_{\exp( - \beta_2 R)} \bigl[ \Psi_{- \frac{\gamma}{N}} (R',M') \bigr]
3954: + \frac{\log(\frac{7}{\eta})}{\gamma} \\*
3955: \shoveright{+ \frac{C(\beta_1, \gamma) + \log( \frac{7}{\eta})}{
3956: \frac{N \lambda_1}{\beta_1} \sinh(\frac{\gamma}{N}) - \gamma}
3957: + \frac{C(\beta_2, \gamma)+ \log(\frac{7}{\eta})}{\gamma -
3958: \frac{N\lambda_2}{\beta_2} \sinh( \frac{\gamma}{N})}
3959: \Biggr\}} \\*
3960: \qquad+ \frac{2 \lambda_1}{\gamma}
3961: \Bigl[ C \bigl(\tfrac{N \lambda_1}{\gamma} \sinh(\tfrac{\gamma}{N}), \gamma\bigr)
3962: + \log(\tfrac{7}{\eta}) \Bigr] \\*
3963: \shoveleft{\qquad + \left( 1 - \tfrac{\lambda_1}{\gamma} \right)
3964: \log \biggl\{ \pi_{\exp [ - \frac{N \lambda_1}{\gamma} \sinh(\frac{\gamma}{N})
3965: R]}^{\otimes 2} \biggl[}\\\shoveright{ \exp \biggl( \tfrac{N \log [ \cosh(\frac{\gamma}{N})] }{1
3966: - \frac{\lambda_1}{\gamma}} \Phi_{- \frac{\log[\cosh(\frac{\gamma}{N})]}{1
3967: - \frac{\lambda_1}{\gamma}}}\!\circ\!M'\biggr)\biggr] \biggr\} }
3968: \\* + \Bigl( 1 - \tfrac{\lambda_1}{\gamma} \Bigr)
3969: \log(\tfrac{7}{\eta}) - \log\bigl[ \nu(\{\beta\}) \nu(\{\gamma\})\epsilon
3970: \bigr]\\*
3971: \shoveleft{\qquad+ (\gamma - \lambda_1) \tfrac{\beta}{\lambda_2}
3972: F_{\gamma, \frac{\beta \gamma}{\lambda_2}}^{-1} \Biggl\{
3973: \frac{2 \lambda_2}{\gamma}
3974: \Bigl[ C \bigl( \tfrac{N \lambda_2}{\gamma} \sinh(\tfrac{\gamma}{N}), \gamma \bigr)
3975: + \log \bigl( \tfrac{7}{\eta}\bigr) \Bigr]}\\*
3976: \shoveleft{\qquad \qquad + \Bigl( 1 - \tfrac{\lambda_2}{\gamma}
3977: \Bigr)
3978: \log \biggl\{
3979: \pi_{\exp[ - \frac{N \lambda_2}{\gamma} \sinh(\frac{\gamma}{N})R]}^{\otimes 2}
3980: \biggl[}\\
3981: \exp \biggl( \frac{N\log[\cosh(\frac{\gamma}{N})]}{1 - \frac{\lambda_2}{\gamma}}
3982: \Phi_{- \frac{\log[\cosh(\frac{\gamma}{N})]}{1 - \frac{\lambda_2}{\gamma}}}\!\circ\!M'
3983: \biggr) \biggr] \biggr\} \\* + \Bigl(1 - \tfrac{\lambda_2}{\gamma} \Bigr)
3984: \log\bigl(\tfrac{7}{\eta}\bigr) - \log\bigl[\nu(\{\beta\}) \nu(\{\gamma\})\epsilon\bigr]
3985: \Biggr\},
3986: \end{multline*}
3987: where the function $C(\beta, \gamma)$ is defined by equation \eqref{eq1.1.23}
3988: on page \pageref{eq1.1.23}.
3989: \end{thm}
3990: \subsubsection{Adaptation to parametric and margin assumptions}
3991: To help understand the previous theorem, it may be useful to
3992: give linear upper-bounds to the factors appearing in the
3993: right-hand side of the previous inequality.
3994: Introducing $\T$ such that $R(\T) = \inf_{\Theta} R$
3995: (assuming that such a parameter exists) and remembering that
3996: \begin{align*}
3997: \Psi_{-a}(p,m) & \leq a^{-1} \sinh(a) p + 2 a^{-1} \sinh(\tfrac{a}{2})^2 m, & a \in \RR_+,\\
3998: \Phi_{-a}(p) & \leq a^{-1} \bigl[ \exp(a)-1 \bigr] p, & a \in \RR_+,\\
3999: \Psi_{a}(p,m) & \geq a^{-1} \sinh(a) p - 2a^{-1}\sinh(\tfrac{a}{2})^2 m, & a \in \RR_+,\\
4000: M'(\theta_1, \theta_2) & \leq M'(\theta_1, \T) + M'(\theta_2, \T), & \theta_1, \theta_2
4001: \in \Theta,\\
4002: M'(\theta_1, \T) & \leq x R'(\theta_1, \T) + \varphi(x), & x \in \RR_+, \theta_1 \in
4003: \Theta,
4004: \end{align*}
4005: (the last inequality being rather
4006: a consequence of the definition of $\varphi$ than a property of $M'$),
4007: we easily see that
4008: \begin{multline*}
4009: \pi_{\exp( - \beta_1 R)}\otimes \pi_{\exp( - \beta_2 R)}
4010: \bigl[ \Psi_{- \frac{\gamma}{N}}(R',M') \bigr]
4011: \\\shoveleft{\quad \leq
4012: \tfrac{N}{\gamma} \sinh(\tfrac{\gamma}{N})
4013: \bigl[ \pi_{\exp( - \beta_1 R)}(R) - \pi_{\exp( - \beta_2 R)}(R) \bigr]}
4014: \\\shoveright{+ \tfrac{2N}{\gamma}\sinh(\tfrac{\gamma}{2N})^{2}
4015: \pi_{\exp( - \beta_1 R)} \otimes \pi_{\exp( - \beta_2 R)}
4016: (M')\qquad} \\
4017: \shoveleft{\quad\leq \tfrac{N}{\gamma} \sinh(\tfrac{\gamma}{N}) \bigl[ \pi_{
4018: \exp( - \beta_1 R)}(R) -
4019: \pi_{\exp( - \beta_2 R)}(R) \bigr]} \\
4020: \qquad + \frac{2xN}{\gamma} \sinh(\tfrac{\gamma}{2N})^{2} \Bigl\{
4021: \pi_{\exp( - \beta_1 R)}\bigl[ R'(\cdot, \T) \bigr] +
4022: \pi_{\exp( - \beta_2 R)} \bigl[ R'(\cdot, \T) \bigr] \Bigr\}
4023: \\ + \frac{4N}{\gamma} \sinh(\tfrac{\gamma}{2N})^2 \varphi(x).
4024: \end{multline*}
4025: \begin{multline*}
4026: C(\beta, \gamma) \leq
4027: \log \biggl\{ \pi_{\exp( - \beta R)} \Bigl\{ \exp \Bigl[
4028: 2 N \sinh\bigl(\tfrac{\gamma}{2N}\bigr)^{2} \pi_{\exp( - \beta R)}(M') \Bigr] \Bigr\}
4029: \biggr\} \\\shoveleft{\qquad\qquad\leq
4030: \log \biggl\{ \pi_{\exp( - \beta R)} \Bigl\{ \exp \Bigl[
4031: 2 N \sinh\bigl(\tfrac{\gamma}{2N}\bigr)^{2} M'(\cdot, \T) \Bigr] \Bigr\}
4032: \biggr\}} \\\shoveright{ + 2N\sinh(\tfrac{\gamma}{2N})^{2} \pi_{\exp( - \beta R)}
4033: \bigl[ M'(\cdot, \T)\bigr]}\\
4034: \shoveleft{\qquad\qquad \leq \log \biggl\{ \pi_{\exp( - \beta R)} \Bigl\{ \exp \Bigl[
4035: 2 x N \sinh(\tfrac{\gamma}{2N})^{2} R'( \cdot, \T) \Bigr] \Bigr\} \biggr\}}
4036: \\\shoveright{+ 2 x N \sinh(\tfrac{\gamma}{2N})^{2} \pi_{\exp( - \beta R)}
4037: \bigl[ R'(\cdot, \T) \bigr] + 4 N \sinh(\tfrac{\gamma}{2N})^{2}
4038: \varphi(x)}\\
4039: \shoveleft{\qquad\qquad = \int_{\beta - 2xN\sinh(\frac{\gamma}{2N})^2}^{\beta}
4040: \pi_{\exp( - \alpha R)}\bigl[ R'(\cdot, \T) \bigr] d \alpha}\\
4041: \shoveright{+ 2 x N \sinh(\tfrac{\gamma}{2N})^{2} \pi_{\exp( - \beta R)}
4042: \bigl[ R'(\cdot, \T) \bigr] + 4 N \sinh(\tfrac{\gamma}{2N})^{2}
4043: \varphi(x)}\\
4044: \shoveleft{\qquad \qquad \leq 4xN\sinh(\tfrac{\gamma}{2N})^2 \pi_{\exp[ - (\beta - 2 x N
4045: \sinh(\frac{\gamma}{2N})^2)R]}\bigl[ R'(\cdot, \T) \bigr]
4046: }\\ + 4 N \sinh(\tfrac{\gamma}{2N})^2 \varphi(x).
4047: \end{multline*}
4048:
4049: \begin{multline*}
4050: \log \Bigl\{ \pi_{\exp( - \beta R)}^{\otimes 2} \Bigl[
4051: \exp \Bigl( N \alpha \Phi_{- \alpha} \!\circ\!M' \Bigr) \Bigr] \Bigr\}
4052: \\ \leq 2 \log \Bigl\{ \pi_{\exp( - \beta R)} \Bigl[ \exp \Bigl( N
4053: \bigl[ \exp( \alpha) - 1 \bigr] M'(\cdot, \T) \Bigr) \Bigr] \Bigr\}
4054: \\ \leq 2 x N \bigl[ \exp( \alpha) - 1\bigr]
4055: \pi_{\exp[ - (\beta - x N [\exp(\alpha) - 1]) R]} \bigl[ R'(\cdot, \T) \bigr]
4056: \\* + 2 x N \bigl[ \exp( \alpha) - 1 \bigr] \varphi(x).
4057: \end{multline*}
4058:
4059: Let us push further the investigation under the parametric
4060: assumption that for some positive real constant $d$
4061: \begin{equation}
4062: \label{parametric}
4063: \lim_{\beta \rightarrow + \infty} \beta \pi_{\exp( - \beta R)}\bigl[ R'( \cdot,
4064: \T) \bigr] = d,
4065: \end{equation}
4066: This assumption will for instance hold true
4067: with $d = \frac{n}{2}$ when $R : \Theta \rightarrow (0,1)$
4068: is a smooth function defined on a compact subset $\Theta$ of $\RR^n$ that
4069: reaches its minimum value on a finite number of non degenerate (i.e. with
4070: a positive definite Hessian) interior points of $\Theta$, and $\pi$
4071: is absolutely continuous with respect to the
4072: Lebesgue measure on $\Theta$ and has a smooth density.
4073:
4074: In case of assumption \eqref{parametric}, if we restrict to sufficiently large values of the
4075: constants $\beta$, $\beta_1$, $\beta_2$, $\lambda_1$, $\lambda_2$ and $\gamma$
4076: (the smaller of which being as a rule $\beta$ as we will see), we can
4077: use the fact that for some (small) positive constant $\delta$, and
4078: some (large) positive constant $A$,
4079: \begin{equation}
4080: \label{eq1.1.25}
4081: \frac{d}{\alpha}(1 - \delta) \leq \pi_{\exp(- \alpha R)}\bigl[ R'(\cdot, \T)
4082: \bigr] \leq
4083: \frac{d}{\alpha}(1 + \delta), \qquad \alpha \geq A.
4084: \end{equation}
4085: Under this assumption,
4086: \begin{multline*}
4087: \pi_{\exp( - \beta_1 R)} \otimes \pi_{\exp( - \beta_2 R)}
4088: \bigl[ \Psi_{- \frac{\gamma}{N}}(R', M') \bigr]
4089: \\ \leq \tfrac{N}{\gamma} \sinh(\tfrac{\gamma}{N})
4090: \bigl[ \tfrac{d}{\beta_1}(1 + \delta) - \tfrac{d}{\beta_2}(1 - \delta) \bigr]
4091: \qquad \qquad \\ \shoveright{+ \tfrac{2 x N}{\gamma}
4092: \sinh(\tfrac{\gamma}{2N})^2 (1 + \delta)
4093: \bigl[ \tfrac{d}{\beta_1}
4094: + \tfrac{d}{\beta_2} \bigr] + \tfrac{4N}{\gamma} \sinh(\tfrac{\gamma}{2N})^2
4095: \varphi(x).}
4096: \\
4097: \shoveleft{C(\beta, \gamma) \leq d(1 + \delta) \log \Bigl( \tfrac{\beta}{\beta -
4098: 2xN\sinh(\frac{\gamma}{2N})^2} \Bigr)} \\
4099: \shoveright{+ 2 x N \sinh(\tfrac{\gamma}{2N})^2
4100: \tfrac{(1 + \delta)d}{\beta} + 4N \sinh(\tfrac{\gamma}{2N})^2 \varphi(x).}\\
4101: \shoveleft{\log \Bigl\{ \pi_{\exp( - \beta R)}^{\otimes 2}
4102: \Bigl[ \exp \Bigl( N \alpha \Phi_{- \alpha}\!\circ\!M' \Bigr) \Bigr] \Bigr\}
4103: } \\ \leq 2xN\bigl[ \exp( \alpha) - 1 \bigr] \frac{d(1 + \delta)}{ \beta -
4104: x N [\exp(\alpha) - 1]} + 2 N \bigl[ \exp( \alpha) - 1 \bigr] \varphi(x).
4105: \end{multline*}
4106: Thus with $\PP$ probability at least $1 - \eta$,
4107: \begin{multline*}
4108: B(\pi_{\exp( - \lambda_1 r)}, \beta, \gamma)
4109: \leq - (\gamma - \lambda_1) \tfrac{N}{\gamma}
4110: \sinh(\tfrac{\gamma}{N}) \tfrac{d}{\beta_2}( 1
4111: - \delta)
4112: \\ \shoveleft{+
4113: (\gamma - \lambda_1) \biggl\{
4114: \tfrac{N}{\gamma} \sinh(\tfrac{\gamma}{N}) \tfrac{(1+\delta)d}{\beta_1}
4115: }\\*\shoveright{+ \tfrac{2xN}{\gamma} \sinh(\tfrac{\gamma}{2N})^2(1+\delta) \bigl[ \tfrac{d}{\beta_1}
4116: + \tfrac{d}{\beta_2} \bigr]
4117: + \tfrac{4N}{\gamma} \sinh(\tfrac{\gamma}{2N})^2 \varphi(x)
4118: + \frac{\log(\tfrac{7}{\eta})}{\gamma}}\\
4119: + \frac{4xN\sinh(\tfrac{\gamma}{2N})^2 \tfrac{(1+\delta)d}{\beta_1 -
4120: 2xN\sinh(\frac{\gamma}{2N})^2} + 4 N \sinh(\tfrac{\gamma}{2N})^2 \varphi(x)
4121: + \log(\frac{7}{\eta})}{\frac{N\lambda_1}{\beta_1}\sinh(\frac{\gamma}{N}) -
4122: \gamma}\\
4123: \shoveright{+ \frac{4xN\sinh(\tfrac{\gamma}{2N})^2 \tfrac{(1+\delta)d}{\beta_2 -
4124: 2xN\sinh(\frac{\gamma}{2N})^2} + 4 N \sinh(\tfrac{\gamma}{2N})^2 \varphi(x)
4125: + \log(\frac{7}{\eta})}{\gamma - \frac{N\lambda_2}{\beta_2}\sinh(\frac{\gamma}{N})}
4126: \biggr\}}
4127: \\ \shoveleft{+
4128: \frac{2 \lambda_1}{\gamma}
4129: \biggl\{ 4xN\sinh(\tfrac{\gamma}{2N})^2 \tfrac{(1+\delta)d}{\tfrac{N\lambda_1}{\gamma}
4130: \sinh(\tfrac{\gamma}{N}) -
4131: 2xN\sinh(\frac{\gamma}{2N})^2}}\\
4132: \shoveright{ + 4 N \sinh(\tfrac{\gamma}{2N})^2 \varphi(x)
4133: + \log(\tfrac{7}{\eta}) \biggr\}}\\
4134: \shoveleft{+ \Bigl( 1 - \frac{\lambda_1}{\gamma} \Bigr) \Biggl\{
4135: 2 d(1+\delta) \Biggl( \tfrac{\lambda_1\sinh\bigl(\tfrac{\gamma}{N}\bigr)}{x \gamma
4136: \Bigl[ \exp\Bigl(\frac{\log[\cosh(\frac{\gamma}{N})]}{1-\frac{\lambda_1}{\gamma}}
4137: \Bigr)-1
4138: \Bigr]}-1 \Biggr)^{-1}}\\\shoveright{ + 2N\Bigl[ \exp \Bigl( \tfrac{\log[\cosh(\frac{\gamma}{N})]}{1 -
4139: \frac{\lambda_1}{\gamma}} \Bigr) - 1 \Bigr] \varphi(x)
4140: \Biggr\}}\\
4141: + \Bigl(1 - \tfrac{\lambda_1}{\gamma} \Bigr)
4142: \log(\tfrac{7}{\eta}) - \log\bigl[ \nu(\{\beta\}) \nu(\{\gamma\}) \epsilon\bigr]\\
4143: \shoveleft{+ \frac{1 - \frac{\lambda_1}{\gamma}}{ \frac{N \lambda_2}{\beta \gamma}
4144: \tanh(\frac{\gamma}{N}) - 1} \Biggl\{
4145: \frac{2 \lambda_2}{\gamma}
4146: \biggl\{ 4xN\sinh(\tfrac{\gamma}{2N})^2 \tfrac{(1+\delta)d}{\tfrac{N\lambda_2}{\gamma}
4147: \sinh(\tfrac{\gamma}{N}) -
4148: 2xN\sinh(\frac{\gamma}{2N})^2}}\\
4149: \shoveright{+ 4 N \sinh(\tfrac{\gamma}{2N})^2 \varphi(x)
4150: + \log(\tfrac{7}{\eta}) \biggr\}}\\
4151: \shoveleft{+ \Bigl( 1 - \frac{\lambda_2}{\gamma} \Bigr) \Biggl[
4152: 2 d(1+\delta) \Biggl( \tfrac{\lambda_2\sinh\bigl(\tfrac{\gamma}{N}\bigr)}{x \gamma
4153: \Bigl[ \exp\Bigl(\frac{\log[\cosh(\frac{\gamma}{N})]}{1-\frac{\lambda_2}{\gamma}}
4154: \Bigr)-1
4155: \Bigr]}-1 \Biggr)^{-1}} \\
4156: \shoveright{+ 2N\Bigl[ \exp \Bigl( \tfrac{\log[\cosh(\frac{\gamma}{N})]}{1 -
4157: \frac{\lambda_2}{\gamma}} \Bigr) - 1 \Bigr] \varphi(x)
4158: \Biggr]\qquad\quad}\\
4159: + \Bigl(1 - \tfrac{\lambda_2}{\gamma} \Bigr)
4160: \log(\tfrac{7}{\eta}) - \log\bigl[ \nu(\beta) \nu(\gamma) \epsilon\bigr]
4161: \Biggr\}.
4162: \end{multline*}
4163:
4164: Now let us choose for simplicity
4165: $\beta_2 = 2 \lambda_2 = 4 \beta$, $\beta_1 = \lambda_1 / 2 = \gamma / 4$,
4166: and let us introduce the notations
4167: \begin{align*}
4168: C_1 & = \frac{N}{\gamma}\sinh(\frac{\gamma}{N}),\\
4169: C_2 & = \frac{N}{\gamma} \tanh(\frac{\gamma}{N}),\\
4170: C_3 & = \frac{N^2}{\gamma^2}
4171: \bigl[ \exp( \frac{\gamma^2}{N^2} ) - 1 \bigr]\\
4172: \text{and }\quad
4173: C_4 & = \frac{2 N^2(1 - \frac{2 \beta}{\gamma})}{\gamma^2}
4174: \Bigl[ \exp \Bigl( \frac{\gamma^2}{2 N^2 (1 - \frac{2 \beta}{\gamma})}
4175: \Bigr) - 1 \Bigr],
4176: \end{align*}
4177: to obtain
4178: \begin{multline*}
4179: B(\pi_{\exp( - \lambda_1 r)}, \beta, \gamma) \leq
4180: - \frac{C_1 \gamma}{8 \beta} (1 - \delta)d
4181: \\ + \frac{C_1 \gamma}{2} \biggl\{
4182: \tfrac{4(1+\delta)d}{\gamma} + x \tfrac{\gamma}{2 N}(1+\delta)
4183: \bigl[ \tfrac{4 d}{\gamma} + \tfrac{d}{4\beta} \bigr]
4184: + \tfrac{\gamma}{N} \varphi(x) \biggr\} +
4185: \tfrac{1}{2} \log\bigl(\tfrac{7}{\eta}\bigr)\\*
4186: \qquad + \frac{1}{2C_1-1} \Bigl[(1+\delta) d \Bigl( \tfrac{N}{2xC_1\gamma} -1 \Bigr)^{-1}
4187: + C_1 \frac{\gamma^2}{2N} \varphi(x) + \tfrac{1}{2} \log(\tfrac{7}{\eta}) \Bigr]
4188: \\*\hfill \hfill \hfill + \frac{1}{2 - C_1} \biggl[ 2 (1+\delta)d \Bigl( \tfrac{8 N \beta}{x C_1 \gamma^2}
4189: - 1\Bigr)^{-1} + C_1 \frac{\gamma^2}{N} \varphi(x) + \log(\tfrac{7}{\eta}) \biggr]
4190: \hfill \\*
4191: \shoveright{+ \frac{2 x \gamma (1 + \delta) d}{N - x \gamma} + C_1 \tfrac{\gamma^2}{N} \varphi(x)
4192: + \log( \tfrac{7}{\eta})} \\*
4193: \shoveright{+ d(1+\delta)\frac{x \gamma}{N} \biggl( \frac{C_1}{2
4194: C_3 } - \frac{x \gamma}{N} \biggr)^{-1} + \frac{\gamma^2}{N} C_3
4195: \varphi(x) + \frac{\log(\frac{7}{\eta})}{2} -
4196: \log\bigl[ \nu(\beta) \nu(\gamma) \epsilon\bigr]}\\*
4197: \shoveleft{\qquad + \Bigl( 4 C_2 - 2\Bigr)^{-1}
4198: \Biggl\{ \frac{4 \beta}{\gamma} \biggl\{
4199: x \frac{\gamma^2}{N} C_1 (1 + \delta) d \Bigl(
4200: 2 \beta C_1 - x C_1 \frac{\gamma^2}{2N} \Bigr)^{-1}} \\\shoveright{
4201: + \tfrac{\gamma^2}{N} \varphi(x)
4202: + \log(\tfrac{7}{\eta})\biggr\}\quad }
4203: \\* \shoveleft{\qquad + \Bigl(1 - \frac{2 \beta}{\gamma} \Bigr) \biggl\{
4204: 2 d (1 + \delta) \frac{x \gamma}{N}
4205: \biggl[ \frac{4 \beta C_1}{
4206: \gamma C_4}\biggl(1 - \frac{2 \beta}{\gamma}\biggr) - \frac{x \gamma}{N}
4207: \biggr]^{-1}}\\ \shoveright{
4208: + \frac{\gamma^2}{N(1 - \frac{2 \beta}{\gamma})} C_4 \varphi(x)
4209: \biggr\}\quad }
4210: \\* + \Bigl( 1 - \tfrac{2 \beta}{\gamma} \Bigr) \log(\tfrac{7}{\eta}) - \log
4211: \bigl[ \nu(\beta) \nu(\gamma) \epsilon \bigr]
4212: \Biggr\}.
4213: \end{multline*}
4214: This simplifies to
4215: \begin{multline*}
4216: B( \pi_{\exp( - \lambda_1 r)}, \beta, \gamma) \leq
4217: - \frac{C_1}{8}(1- \delta)d \frac{\gamma}{\beta}
4218: \\ + 2 C_1(1 + \delta) d + \log(\tfrac{7}{\eta})
4219: \biggl[ 2 + \tfrac{3 C_1}{(4C_1-2)(2-C_1)}
4220: + \frac{ 1 + \frac{2 \beta}{\gamma}}{4C_2 - 2}
4221: \biggr] \\ \hfill - \bigl( 1 + \tfrac{1}{4 C_2 - 2} \bigr)
4222: \log\bigl[ \nu(\beta) \nu( \gamma) \epsilon\bigr]\qquad
4223: \\\qquad + \frac{(1 + \delta) d x \gamma}{N} \biggl\{
4224: C_1 + \tfrac{1}{2 C_1 - 1} \Bigl(
4225: \tfrac{1}{2C_1} - \tfrac{\gamma x}{N} \Bigr)^{-1}
4226: \hfill \\\hfill + 2 \Bigl( 1 - \tfrac{\gamma x}{N} \Bigr)^{-1}
4227: + \Bigl( \tfrac{C_1}{2 C_3} -
4228: \tfrac{\gamma x}{N} \Bigr)^{-1} + \tfrac{4C_1\beta}{\gamma(4C_2-2)}
4229: \biggr\}\qquad \\
4230: \qquad + \frac{(1 + \delta) d x \gamma^2}{N \beta} \biggl\{
4231: \tfrac{C_1}{16} + \tfrac{2}{2-C_1} \Bigl( \tfrac{8}{C_1} -
4232: \tfrac{x \gamma^2}{N \beta} \Bigr)^{-1} \hfill \\
4233: \hfill +
4234: \Bigl(1 - \tfrac{2 \beta}{\gamma} \Bigr) \tfrac{1}{2C_2 -1}
4235: \Bigl[ \tfrac{4C_1}{C_4}\Bigl(1 - \tfrac{2 \beta}{\gamma}\Bigr)
4236: - \tfrac{\gamma^2 x}{\beta N} \Bigr]^{-1}
4237: \biggr\} \qquad
4238: \\
4239: + \frac{\gamma^2}{N} \varphi(x) \biggl\{
4240: \tfrac{3 C_1}{2} + \tfrac{C_1}{4C_1 - 2} + \tfrac{C_1}{2 - C_1} + C_3
4241: + \tfrac{4 \beta}{\gamma( 4 C_2 - 2)} + \tfrac{C_4}{4 C_2 - 2}
4242: \biggr\}.
4243: \end{multline*}
4244:
4245: This shows that there exist universal positive real constants $A_1$, $A_2$, $B_1$, $B_2$, $B_3$,
4246: and $B_4$
4247: such that as soon as $\frac{\gamma \max\{x, 1\}}{N} \leq A_1 \frac{\beta}{\gamma}
4248: \leq A_2$,
4249: \begin{multline*}
4250: B( \pi_{\exp( - \lambda_1 r) }, \beta, \gamma) \leq
4251: - B_1 (1 - \delta) d \frac{\gamma}{\beta} + B_2 (1 + \delta) d \\
4252: - B_3 \log\bigl[
4253: \nu(\beta) \nu(\gamma) \epsilon\,\eta\bigr]
4254: + B_4 \frac{\gamma^2}{N} \varphi(x).
4255: \end{multline*}
4256: Thus $\pi_{\exp( - \lambda_1 r)}(R)
4257: \leq \pi_{\exp( - \beta R)}(R) \leq \inf_{\Theta} R + \frac{ (1 + \delta) d}{\beta}$
4258: as soon as moreover
4259: $$
4260: \frac{\beta}{\gamma} \leq \frac{ B_1}{
4261: B_2\frac{(1 + \delta)}{(1 - \delta)} + \frac{B_4 \frac{\gamma^2}{N} \varphi(x)
4262: - B_3 \log[\nu(\beta) \nu(\gamma) \epsilon \eta]}{(1-\delta) d}}.
4263: $$
4264:
4265: Choosing some real ratio $\alpha > 1$,
4266: we can now make the above result uniform for any
4267: \begin{equation}
4268: \label{eq1.1.27}
4269: \beta, \gamma \in
4270: \Lambda_{\alpha} \overset{\text{def}}{=}
4271: \Bigl\{ \alpha^k ; k \in \NN, 0 \leq k < \tfrac{\log(N)}{\log(\alpha)} \Bigr\},
4272: \end{equation}
4273: by substituting $\nu(\beta)$ and $\nu(\gamma)$
4274: with $\frac{\log(\alpha)}{\log(\alpha N)}$ and $- \log(\eta)$ with
4275: $ - \log( \eta) + 2 \log \left[ \frac{\log( \alpha N)}{\log(\alpha)} \right]$.
4276:
4277: Taking moreover for simplicity $\eta = \epsilon$,
4278: let us summarize the type of result we got by
4279: \begin{thm}
4280: \mypoint
4281: \label{thm1.50}
4282: There exist positive real universal constants
4283: $A$, $B_1$, $B_2$, $B_3$ and $B_4$ such that
4284: for any positive real constants $\alpha > 1$, $d$ and $\delta$, for any
4285: prior distribution $\pi \in \C{M}_+^1(\Theta)$,
4286: with
4287: $\PP$ probability at least $1 - \epsilon$,
4288: for any $\beta, \gamma
4289: \in \Lambda_{\alpha}$ (where $\Lambda_{\alpha}$ is defined by equation
4290: \eqref{eq1.1.27} above) such that
4291: $$
4292: \sup_{\beta' \in \RR, \beta' \geq \beta}
4293: \biggl\lvert \frac{\beta'}{d} \bigl[
4294: \pi_{\exp( - \beta' R)}(R) - \inf_{\Theta} R \bigr] - 1 \biggr\rvert
4295: \leq \delta
4296: $$
4297: and such that also for some positive real parameter $x$
4298: $$
4299: \frac{\gamma \max\{x, 1\}}{N} \leq \frac{A \beta}{\gamma} \text{ and }
4300: \frac{\beta}{\gamma} \leq
4301: \frac{B_1}{B_2 \frac{(1 + \delta)}{(1 - \delta)}
4302: + \frac{ B_4 \frac{\gamma^2}{N}\varphi(x) - 2 B_3 \log(\epsilon) + 4
4303: B_3 \log \bigl[ \frac{\log(N)}{\log(\alpha)}\bigr]}{(1 - \delta) d}},
4304: $$
4305: the bound $B(\pi_{\exp( - \frac{\gamma}{2} r)}, \beta, \gamma)$
4306: given by Theorem \ref{thm1.1.43} on page \pageref{thm1.1.43}
4307: in the case where we have chosen $\nu$
4308: to be the uniform probability measure on $\Lambda_{\alpha}$,
4309: satisfies
4310: $B(\pi_{\exp( - \frac{\gamma}{2} r)}, \beta, \gamma)
4311: \leq 0,$ proving that $\w{\beta}(\pi_{\exp( - \frac{\gamma}{2} r)})
4312: \geq \beta$ and therefore that
4313: $$
4314: \pi_{\exp( - \gamma \frac{r}{2} )}(R) \leq \pi_{\exp ( - \beta R)}(R)
4315: \leq \inf_{\Theta} R + \frac{(1 + \delta) d}{\beta}.
4316: $$
4317: \end{thm}
4318: What is important in this result is that we do not only bound
4319: $\pi_{\exp( - \frac{\gamma}{2} r)}(R)$, but also
4320: $B(\pi_{\exp( - \frac{\gamma}{2} r)}, \beta, \gamma)$,
4321: and that we do it uniformly on a grid of values of $\beta$ and
4322: $\gamma$, showing that we can indeed
4323: set the constants $\beta$ and $\gamma$
4324: adaptively using the empirical bound
4325: $B( \pi_{\exp( - \frac{\gamma}{2} r)}, \beta, \gamma)$.
4326:
4327: Let us see what we get under the margin assumption \eqref{eq1.1.17Bis}
4328: (see page \pageref{eq1.1.17Bis}).
4329: When $\kappa = 1$, $\varphi(c^{-1}) \leq 0$, leading to
4330: \begin{cor}\mypoint
4331: Assuming that the margin
4332: assumption \ref{eq1.1.17Bis} (on page \pageref{eq1.1.17Bis}) is
4333: satisfied for $\kappa = 1$, that $R : \Theta \rightarrow (0,1)$
4334: is independent of $N$ (which is the case for instance when
4335: $\PP = P^{\otimes N}$), and is such that
4336: $$
4337: \lim_{\beta' \rightarrow + \infty} \beta'
4338: \bigl[ \pi_{\exp( - \beta'
4339: R)}(R) - \inf_{\Theta} R \bigr] = d,
4340: $$
4341: there are universal positive real constants
4342: $B_5$ and $B_6$
4343: and $N_1 \in \NN$
4344: such that
4345: for any $N \geq N_1$,
4346: with $\PP$ probability at least $1 - \epsilon$
4347: $$
4348: \pi_{\exp( - \widehat{\gamma}\frac{r}{2} )}(R) \leq
4349: \inf_{\Theta} R + \frac{ B_5 d}{c N}
4350: \left[1 + \frac{B_6}{d} \log \biggl( \frac{\log(N)}{
4351: \epsilon } \biggr) \right]^2,
4352: $$
4353: where $\w{\gamma} \in \arg\max_{\gamma \in \Lambda_2} \max \bigl\{ \beta \in \Lambda_2
4354: ; B(\pi_{\exp( - \gamma \frac{r}{2})}, \beta, \gamma) \leq 0 \bigr\}$
4355: (where $\Lambda_2$ is defined by equation \eqref{eq1.1.27} on page \pageref{eq1.1.27}).
4356: \end{cor}
4357: When $\kappa > 1$, $\varphi(x) \leq (1 - \kappa^{-1}) \bigl( \kappa c x \bigr)^{-
4358: \frac{1}{\kappa -1}}$, and we can choose $\gamma$ and $x$ such that
4359: $\frac{\gamma^2}{N} \varphi(x) \simeq d$ to prove
4360: \begin{cor}\mypoint
4361: \label{cor1.52}
4362: Assuming that the margin assumption \eqref{eq1.1.17Bis} is satisfied
4363: for some exponent $\kappa > 1$, that $R : \Theta \rightarrow (0,1)$
4364: is independent of $N$ (which is for instance the case when
4365: $\PP = P^{\otimes N}$), and is such that
4366: $$
4367: \lim_{\beta' \rightarrow + \infty} \beta'
4368: \bigl[ \pi_{\exp ( - \beta' R)}(R) - \inf_{\Theta} R \bigr] = d,
4369: $$
4370: there are universal positive constants
4371: $B_7$ and $B_8$
4372: and $N_1 \in \NN$ such that for any $N \geq N_1$, with $\PP$
4373: probability at least $1 - \epsilon$,
4374: $$
4375: \pi_{\exp( - \widehat{\gamma} \frac{r}{2} )}(R)
4376: \leq \inf_{\Theta} R + B_7
4377: c^{ - \frac{1}{2 \kappa -1}}
4378: \biggl[ 1 + \frac{B_8}{d} \log
4379: \biggl( \frac{\log(N)}{\epsilon} \biggr)
4380: \biggr]^{\frac{2 \kappa}{2 \kappa - 1}} \left(
4381: \frac{d}{N} \right)^{ \frac{\kappa}{2 \kappa - 1}},
4382: $$
4383: where $\widehat{\gamma} \in \arg \max_{\gamma \in \Lambda_2}
4384: \max \bigl\{ \beta \in \Lambda_2; B(\pi_{\exp( - \gamma \frac{r}{2})},
4385: \beta, \gamma) \leq 0 \bigr\}$ ($\Lambda_2$ being defined by equation \eqref{eq1.1.27}
4386: on page \pageref{eq1.1.27}).
4387: \end{cor}
4388: We find the same rate of convergence as in Corollary
4389: \ref{cor1.1.23} on page \pageref{cor1.1.23}, but this
4390: time, we were able to provide an empirical posterior distribution
4391: $\pi_{\exp( - \w{\gamma} \frac{r}{2})}$
4392: which achieves this rate adaptively in all the parameters
4393: (meaning in particular that we do not need to know $d$,
4394: $c$ or $\kappa$). Moreover, as
4395: already mentioned, the power
4396: of $N$ in this rate of convergence is known to be unimprovable
4397: in the worst case (see \cite{Mammen,Tsybakov,Tsybakov2}, and
4398: more specifically in \cite{Audibert2} --- downloadable from
4399: its author's web page,--- Theorem 3.3 on page 132).
4400:
4401: \subsubsection{Estimating the divergence of a posterior
4402: with respect to a Gibbs prior}
4403: Another interesting question is to estimate
4404: $\C{K} \bigl[ \rho, \pi_{\exp ( - \beta R)} \bigr]$
4405: using relative deviation inequalities.
4406: We follow here an idea to be found first
4407: in Audibert \cite[page 93]{Audibert2}.
4408: Indeed, combining equation \eqref{eq1.1.17} with
4409: equation \eqref{eq1.1.16} on page \pageref{eq1.1.16}, we see that
4410: for any positive real parameters $\beta$ and $\lambda$,
4411: with $\PP$ probability at least $1 - \epsilon$, for any
4412: posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,
4413: \begin{multline*}
4414: \C{K}\bigl[\rho, \pi_{\exp( - \beta R)}\bigr]
4415: \leq \frac{\beta}{N \lambda} \biggl\{
4416: \frac{N}{2} \log\left( \frac{1 + \lambda}{1 - \lambda}\right)
4417: \bigl[ \rho(r) - \pi_{\exp(- \beta R)}(r) \bigr]
4418: \\ \hfill - \frac{N}{2} \log(1 - \lambda^2) \rho \otimes \pi_{\exp( - \beta R)}
4419: (m') \qquad \\\hfill + \C{K}\bigl[ \rho, \pi_{\exp( - \beta R)}\bigr]
4420: - \log(\epsilon) \biggr\} + \C{K}(\rho, \pi) - \C{K} \bigl[
4421: \pi_{\exp( - \beta R)}, \pi \bigr]\quad
4422: \\ \leq \C{K} \bigl[ \rho, \pi_{\exp [ - \frac{\beta}{2\lambda}
4423: \log(\frac{1+\lambda}{1-\lambda}) r]} \bigr] + \frac{\beta}{N \lambda} \C{K}\bigl[
4424: \rho, \pi_{\exp( - \beta R)}\bigr] - \frac{\beta}{N \lambda}
4425: \log(\epsilon) \\ +
4426: \log \biggl[ \pi_{\exp [ - \frac{\beta}{2\lambda}\log(\frac{1+\lambda}{1-\lambda})r]}
4427: \Bigl\{ \exp \Bigl[ - \frac{\beta}{2 \lambda} \log(1 - \lambda^2)
4428: \rho(m')\Bigr] \Bigr\} \biggr].
4429: \end{multline*}
4430: Thus, putting $\gamma = \frac{N}{2} \log( \frac{1+\lambda}{1 - \lambda})$,
4431: we obtain
4432: \begin{thm}
4433: \mypoint
4434: \label{thm1.1.37}
4435: For any positive real constants $\beta$ and $\gamma$ such
4436: that $\beta < N \tanh ( \frac{\gamma}{N})$,
4437: with $\PP$ probability at least $1 - \epsilon$, for any
4438: posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,
4439: \begin{multline*}
4440: \C{K}\bigl[ \rho, \pi_{\exp( - \beta R)}\bigr]
4441: \leq \left( 1 - \frac{\beta}{N}\tanh\left(\frac{\gamma}{N}\right)^{-1}\right)^{-1}
4442: \\ \times \Biggl\{ \C{K}\bigl[ \rho, \pi_{\exp [ - \frac{\beta\gamma}{N}
4443: \tanh(\frac{\gamma}{N})^{-1}r]}
4444: \bigr] - \frac{\beta}{N \tanh(\frac{\gamma}{N})} \log(\epsilon)
4445: \\ + \log \Bigl\{ \pi_{\exp[ -
4446: \frac{\beta \gamma}{N} \tanh(\frac{\gamma}{N})^{-1} r]} \Bigl[
4447: \exp \bigl\{ \beta \tanh(\tfrac{\gamma}{N})^{-1} \log[\cosh(\tfrac{\gamma}{N})]
4448: \rho(m') \bigr\} \Bigr] \Bigr\} \Biggr\}.
4449: \end{multline*}
4450: \end{thm}
4451: This theorem provides another way of measuring overfitting,
4452: since it gives an upper bound for $\C{K}\bigl[
4453: \pi_{\exp[ - \frac{\beta \gamma}{N}
4454: \tanh(\frac{\gamma}{N})^{-1} r]}, \pi_{\exp( - \beta R)} \bigr]$.
4455: It may be used in combination with Theorem \ref{thm2.7}
4456: on page \pageref{thm2.7} as an alternative to Theorem
4457: \ref{thm1.1.17} on page \pageref{thm1.1.17}.
4458: It will also be used in the next section.
4459:
4460: An alternative parametrization of the same result providing a simpler
4461: right-hand side is also useful:
4462: \begin{cor}
4463: For any positive real constants $\beta$ and $\gamma$ such that $
4464: \beta < \gamma$, with $\PP$ probability at least $1 - \epsilon$, for any
4465: posterior distribution $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$,
4466: \begin{multline*}
4467: \C{K}\bigl[ \rho, \pi_{\exp[ - N \frac{\beta}{\gamma} \tanh(\frac{\gamma}{N}) R]}
4468: \bigr] \leq \biggl(1 - \frac{\beta}{\gamma} \biggr)^{-1}
4469: \Biggl\{ \C{K}\bigl[ \rho, \pi_{\exp( - \beta r)}\bigr] - \frac{\beta}{\gamma}
4470: \log( \epsilon) \\ +
4471: \log \Bigl\{ \pi_{\exp( - \beta r)} \Bigl[ \exp \bigl\{
4472: N \tfrac{\beta}{\gamma} \log \bigl[ \cosh(\tfrac{\gamma}{N})\bigr] \rho
4473: (m') \bigr\} \Bigr] \Bigr\} \Biggr\}.
4474: \end{multline*}
4475: \end{cor}
4476:
4477: \subsubsection{Comparing two posterior distributions}
4478: Estimating the effective temperature of an estimator provides an efficient
4479: way to tune parameters in a model with a parametric behaviour. On the other
4480: hand, it will not be fitted to choose between different models, especially
4481: in the case when they are nested (because as we already saw in the case
4482: when $\Theta$ is a union of nested models, the prior distribution $\pi_{\exp
4483: ( - \beta R)}$ is not providing an efficient localization of the parameter
4484: in this case, in the sens that $\pi_{\exp( - \beta R)}(R)$
4485: is not going down to $\inf_{\Theta} R$ at the desired rate when
4486: $\beta$ goes to $+ \infty$, requiring to resort to partial localization).
4487:
4488: Once some estimator (in the form of a posterior distribution) has been
4489: chosen in each submodel, these estimators can be compared between themselves
4490: with the help of the relative bounds that we will establish in this section.
4491:
4492: From equation \eqref{eq1.1.15} (slightly modified by replacing $\pi \otimes \pi$
4493: with $\pi^1 \otimes \pi^2$), we obtain easily
4494: \begin{thm}
4495: \mypoint
4496: \label{thm1.1.38}
4497: For any positive real constant $\lambda$,
4498: for any prior distributions $\pi^1, \pi^2 \in \C{M}_+^1(\Theta)$,
4499: with $\PP$ probability at least $1 - \epsilon$,
4500: for any posterior distributions $\rho_1$ and $\rho_2 :
4501: \Omega \rightarrow \C{M}_+^1(\Theta)$,
4502: \begin{multline*}
4503: - N \log \Bigl\{ 1 - \tanh\bigl( \tfrac{\lambda}{N} \bigr)
4504: \Bigl[ \rho_2(R) - \rho_1(R) \Bigr] \Bigr\}
4505: \leq \lambda \bigl[ \rho_2(r) - \rho_1(r) \bigr]
4506: \\ + N \log \bigl[ \cosh \bigl( \tfrac{\lambda}{N} \bigr) \bigr]
4507: \rho_1 \otimes \rho_2 (m') \\ + \C{K}\bigl( \rho_1, \pi^1 \bigr)
4508: + \C{K}\bigl( \rho_2, \pi^2\bigr) - \log(\epsilon).
4509: \end{multline*}
4510: \end{thm}
4511:
4512: There enters into the game the entropy bound
4513: of the previous section, providing a localized version of Theorem \ref{thm1.1.38}.
4514: We will use the notation
4515: $$
4516: \Xi_{a} (q) = \tanh(a)^{-1} \bigl[ 1 -
4517: \exp( - aq) \bigr] \leq \frac{a}{\tanh(a)}q, \qquad a, q \in \RR.
4518: $$
4519: \begin{thm}
4520: \mypoint
4521: \label{thm1.1.39}
4522: For any sequence of prior distributions $(\pi^i)_{i \in \NN } \in
4523: \C{M}_+^1(\Theta)^{\NN}$,
4524: any probability distribution $\mu$ on $\NN$,
4525: any atomic probability distribution $\nu$ on $\RR_+$,
4526: with $\PP$ probability at least $1 - \epsilon$, for any posterior distributions
4527: $\rho_1, \rho_2 : \Omega \rightarrow \C{M}_+^1(\Theta)$,
4528: \begin{multline*}
4529: \hfill \rho_2(R) - \rho_1(R) \leq B(\rho_1, \rho_2), \text{ where} \hfill
4530: \\
4531: \shoveleft{B(\rho_1, \rho_2) = \inf_{\lambda, \beta_1 < \gamma_1, \beta_2 <
4532: \gamma_2 \in \RR_+, i, j \in \NN} \Xi_{\frac{\lambda}{N}} \Biggl\{
4533: \bigl[ \rho_2(r) - \rho_1(r) \bigr]}\\\shoveright{ + \tfrac{N}{\lambda} \log
4534: \bigl[ \cosh(
4535: \tfrac{\lambda}{N}) \bigr] \rho_1 \otimes \rho_2(m')
4536: }\\\shoveleft{ + \frac{1}{\lambda \Bigl(1 - \frac{\beta_1}{\gamma_1}\Bigr)}
4537: \biggl\{ \C{K} \bigl[ \rho_1, \pi^i_{\exp( - \beta_1 r)}\bigr]
4538: }\\ \shoveright{+ \log \Bigl\{ \pi^i_{\exp( - \beta_1 r)} \Bigl[ \exp \bigl\{
4539: \beta_1 \tfrac{N}{\gamma_1}
4540: \log \bigl[ \cosh(\tfrac{\gamma_1}{N})\bigr] \rho_1(m') \bigr\}
4541: \Bigr] \Bigr\} \biggr\} \quad}
4542: \\ \shoveleft{+ \frac{1}{\lambda \Bigl( 1 - \frac{\beta_2}{\gamma_2} \Bigr)} \biggl\{
4543: \C{K} \bigl[ \rho_2, \pi^j_{\exp( - \beta_2 r)}\bigr]
4544: }\\ \shoveright{+ \log \Bigl\{ \pi^j_{\exp( - \beta_2 r)} \Bigl[ \exp \bigl\{ \beta_2
4545: \tfrac{N}{\gamma_2}
4546: \log \bigl[ \cosh(\tfrac{\gamma_2}{N})\bigr] \rho_2(m') \bigr\}
4547: \Bigr] \Bigr\} \biggr\}\quad }
4548: \\ \shoveleft{- \Bigl[ \bigl( \tfrac{\gamma_1}{\beta_1} - 1 \bigr)^{-1}
4549: + \bigl( \tfrac{\gamma_2}{\beta_2} - 1 \bigr)^{-1} + 1 \Bigr]
4550: }\\ \times \frac{
4551: \log\bigl[3^{-1} \nu(\beta_1) \nu(\beta_2) \nu(\gamma_1) \nu(\gamma_2)
4552: \nu(\lambda) \mu(i) \mu(j) \epsilon\bigr]}{\lambda}
4553: \Biggr\}.
4554: \end{multline*}
4555: \end{thm}
4556: The sequence of prior distributions $(\pi^i)_{i \in \NN}$
4557: should be understood
4558: to be typically supported by subsets of $\Theta$ corresponding to
4559: parametric submodels, that is submodels for which it
4560: is reasonable to expect that \\
4561: \mbox{} \hfill $\ds \lim_{\beta \rightarrow
4562: + \infty} \beta \bigl[ \pi^i_{\exp( - \beta R)}(R) -
4563: \ess \inf_{\pi^i} R \bigr]$\hfill\mbox{}\\
4564: exists and is positive and finite.
4565: As there is no reason why the bound $B(\rho_1, \rho_2)$ provided by
4566: the previous theorem should be subadditive (in the sense that
4567: $B(\rho_1, \rho_3) \leq B(\rho_1, \rho_2) + B(\rho_2, \rho_3)$),
4568: it is adequate, at least from a theoretical point of view, to
4569: consider some workable subset $\C{P} \subset \C{M}_+^1(\Theta)$
4570: of posterior distributions (for instance the distributions of
4571: the form $\pi^i_{\exp( - \beta r)}$, $i \in \NN$, $\beta \in \RR_+$,
4572: it is understood that $\C{P}$ is allowed to be a random
4573: subset of $\C{M}_+^1(\Theta)$, as in this suggested example),
4574: and to define the subadditive chained bound
4575: \newcommand{\TB}{\widetilde{B}}
4576: \begin{multline*}
4577: \TB (\rho, \rho') = \inf \Biggl\{
4578: \sum_{k=0}^{n-1} B(\rho_k, \rho_{k+1});\, n \in \NN^*,
4579: (\rho_k)_{k=0}^{n} \in \C{P}^{n+1},\\ \rho_0 = \rho,
4580: \rho_n = \rho' \Biggr\}, \quad \rho, \rho' \in \C{P}.
4581: \end{multline*}
4582: \begin{prop}\mypoint
4583: \label{prop1.1.54}
4584: With $\PP$ probability at least $1 - \epsilon$,
4585: for any posterior distributions $\rho_1, \rho_2
4586: \in \C{P}$,
4587: $
4588: \rho_2(R) - \rho_1(R) \leq \TB(\rho_1, \rho_2).
4589: $
4590: Moreover for any
4591: posterior distribution $\rho_1 \in \C{P}$,
4592: any posterior distribution $\rho_2 \in \C{P}$ such that
4593: $\TB(\rho_1, \rho_2) = \inf_{\rho_3 \in \C{P}} \TB(\rho_1, \rho_3)$
4594: is unimprovable with the help of $\TB$ in $\C{P}$
4595: in the sense that $\inf_{\rho_3 \in \C{P}}
4596: \TB(\rho_2, \rho_3) \geq 0$.
4597: \end{prop}
4598: \begin{proof} The first assertion is a direct consequence of the
4599: previous theorem, therefore only the second assertion requires a proof: for
4600: any $\rho_3 \in \C{P}$, we deduce from
4601: the optimality of $\rho_2$ and the subadditivity of $\TB$ that
4602: $
4603: \TB(\rho_1,\rho_2) \leq \TB(\rho_1, \rho_3) \leq \TB(\rho_1, \rho_2) +
4604: \TB(\rho_2, \rho_3).
4605: $
4606: \end{proof}
4607:
4608: This proposition provides a way to improve a posterior distribution
4609: $\rho_1 \in \C{P}$ by choosing $\rho_2 \in \arg\min_{\rho \in \C{P}}
4610: \TB(\rho_1, \rho)$ whenever $\TB(\rho_1, \rho_2) < 0$.
4611: This improvement process is proved according to Proposition \ref{prop1.1.54}
4612: to be a one step process: the obtained improved posterior $\rho_2$
4613: cannot be improved again using the same technique.
4614:
4615: Let us give some example of possible starting
4616: distribution $\rho_1$ for this improvement scheme: $\rho_1$ may be chosen as
4617: the best posterior Gibbs distribution
4618: according to Proposition \ref{prop1.1.37} on page
4619: \pageref{prop1.1.37}. More precisely, we may build
4620: from the prior distributions $\pi^i$, $i \in \NN$,
4621: a global prior $\pi = \sum_{i \in \NN} \mu(i) \pi^i$.
4622: We can then define the estimator of the inverse effective
4623: temperature as in Proposition \ref{prop1.1.37}
4624: and choose $\rho_1 \in \arg \min_{\rho \in \C{P}} \w{\beta}(\rho)$,
4625: where $\C{P}$ is as suggested above the set of posterior
4626: distributions
4627: $$
4628: \C{P} = \Bigl\{ \pi^i_{\exp( - \beta r)};\, i \in \NN, \beta \in \RR_+ \Bigr\}.
4629: $$
4630: (This starting point $\rho_1$ should already be pretty good,
4631: at least in an asymptotic perspective, the only
4632: gain in the rate of convergence to be expected bearing
4633: on spurious $\log(N)$ factors).
4634:
4635: For more elaborate uses of relative bounds, we refer to
4636: the third section of the second chapter of Audibert \cite{Audibert2}, where an algorithm
4637: is proposed and analyzed, which allows to use relative bounds
4638: between two posterior distributions as a stand alone estimation
4639: tool.
4640:
4641: \subsubsection{Two step localization of relative bounds}
4642:
4643: Let us consider again in this section
4644: the case when we want to choose adaptively between a family
4645: of parametric models. Let us thus assume that the parameter
4646: set is a disjoint union of measurable submodels, so that we can write
4647: $\Theta = \sqcup_{m \in M} \Theta_m$, where $M$ is some measurable
4648: index set. Let us choose some prior probability distribution
4649: on the index set $\mu \in \C{M}_+^1(M)$, and some regular conditional
4650: prior distribution on $(M,\Theta)$, $\pi : M \rightarrow \C{M}_+^1(\Theta)$,
4651: such that $\pi(m, \Theta_m) = 1$, $m \in M$. Let us then study some
4652: arbitrary posterior distributions $\nu : \Omega \rightarrow \C{M}_+^1(M)$
4653: and $\rho : \Omega \times M : \rightarrow \C{M}_+^1(\Theta)$, such
4654: that $\rho(\omega, m, \Theta_m) = 1$, $\omega \in \Omega$, $m \in M$.
4655: We would like to compare $\nu \rho(R)$ with some doubly localized
4656: prior distribution $\mu_{\exp[ - \frac{\beta}{1 + \zeta_2} \pi_{
4657: \exp( - \beta R)}(R)]} \bigl[ \pi_{\exp( - \beta R)} \bigr](R)$
4658: (where $\zeta_2$ is a positive parameter to be set as needed later on).
4659: We will define to ease notations two prior distributions (one
4660: being more precisely a conditional distribution) depending on
4661: the positive real parameters $\beta$ and $\zeta_2$, putting
4662: \begin{equation}
4663: \label{eqprior}
4664: \ov{\pi} = \pi_{\exp( - \beta R)}
4665: \text{ and }\ov{\mu} = \mu_{\exp[ - \frac{\beta}{1 + \zeta_2}
4666: \ov{\pi}(R)]}.
4667: \end{equation}
4668:
4669: Similarly to Theorem \ref{thm2.2.18} on page \pageref{thm2.2.18}
4670: we can write for any positive real constants $\beta$ and $\gamma$
4671: \begin{multline*}
4672: \PP \biggl\{ (\ov{\mu}\,\ov{\pi}) \otimes (\ov{\mu}\,\ov{\pi})
4673: \biggl[ \exp \Bigl[ - N \log \bigl[ 1 - \tanh(\tfrac{\gamma}{N})R' \bigr]
4674: \\ - \gamma r' - N \log \bigl[
4675: \cosh(\tfrac{\gamma}{N})\bigr] m' \Bigr] \biggr] \biggr\}
4676: \leq 1,
4677: \end{multline*}
4678: and deduce, using Lemma \ref{lemma1.3} on page \pageref{lemma1.3}
4679: \begin{multline}
4680: \label{eq1.31}
4681: \PP \biggl\{ \exp \biggl[
4682: \sup_{\nu \in \C{M}_+^1(M)} \sup_{\rho : M \rightarrow \C{M}_+^1(\Theta)}
4683: \Bigl\{ - N
4684: \log \bigl[ 1 - \tanh(\tfrac{\gamma}{N})
4685: (\nu \rho - \ov{\mu}\,\ov{\pi}) (R) \bigr]\\* - \gamma (\nu \rho - \ov{\mu}
4686: \,\ov{\pi})(r)
4687: - N \log \bigl[ \cosh(\tfrac{\gamma}{N}) \bigr] (\nu \rho) \otimes
4688: (\ov{\mu}\,\ov{\pi}) (m') \\* - \C{K}(\nu, \ov{\mu}) - \nu
4689: \bigl[ \C{K}(\rho, \ov{\pi}) \bigr] \Bigr\} \biggr] \biggr\} \leq 1.
4690: \end{multline}
4691: This will be our starting point in comparing
4692: $\nu \rho(R)$ with $\ov{\mu}\,\ov{\pi}(R)$.
4693: However, obtaining an empirical bound will require some supplementary efforts.
4694: For each $m \in M$, we can write
4695: in the same way
4696: $$
4697: \PP \biggl\{ \ov{\pi} \otimes \ov{\pi}
4698: \biggl[ \exp \Bigl[ - N \log \bigl[ 1 - \tanh(\tfrac{\gamma}{N})R' \bigr]
4699: - \gamma r' - N \log \bigl[ \cosh(\tfrac{\gamma}{N})\bigr] m' \Bigr] \biggr] \biggr\}
4700: \leq 1.
4701: $$
4702: Intagrating this inequality with respect to $\ov{\mu}$ and using Fubini's lemma
4703: for positive functions, we get
4704: $$
4705: \PP \biggl\{ \ov{\mu}(\ov{\pi} \otimes \ov{\pi})
4706: \biggl[ \exp \Bigl[ - N \log \bigl[ 1 - \tanh(\tfrac{\gamma}{N})R' \bigr]
4707: - \gamma r' - N \log \bigl[ \cosh(\tfrac{\gamma}{N})\bigr] m' \Bigr] \biggr] \biggr\}
4708: \leq 1.
4709: $$
4710: Let us make clear that $\ov{\mu}(\ov{\pi} \otimes \ov{\pi})$ is a probability
4711: measure on $M \times \Theta \times \Theta$, whereas $(\ov{\mu}\,\ov{\pi})
4712: \otimes (\ov{\mu}\,\ov{\pi})$ considered previously is a probability measure
4713: on \linebreak $(M\times \Theta) \times (M \times \Theta)$.
4714: We get as previously
4715: \begin{multline}
4716: \label{eq1.31bis}
4717: \PP \biggl\{ \exp \biggl[
4718: \sup_{\nu \in \C{M}_+^1(M)}
4719: \sup_{\rho : M \rightarrow \C{M}_+^1(\Theta)} \Bigl\{
4720: - N
4721: \log \bigl[ 1 - \tanh(\tfrac{\gamma}{N})
4722: \nu (\rho - \ov{\pi}) (R) \bigr]
4723: \\ - \gamma \nu (\rho - \ov{\pi})(r) - N \log
4724: \bigl[\cosh(\tfrac{\gamma}{N})\bigr]
4725: \nu ( \rho \otimes \ov{\pi} ) (m') \\ - \C{K}(\nu, \ov{\mu})
4726: - \nu \bigl[ \C{K}(\rho, \ov{\pi}) \bigr]
4727: \Bigr\} \biggr] \biggr\} \leq 1.
4728: \end{multline}
4729: Let us eventually recall that
4730: \begin{align}
4731: \C{K}(\nu, \ov{\mu}) & = \tfrac{\beta}{1 + \zeta_2} (\nu - \ov{\mu})\ov{\pi}(R) + \C{K}(\nu, \mu)
4732: - \C{K}(\ov{\mu}, \mu),\\
4733: \label{eq1.31ter}
4734: \C{K}(\rho, \ov{\pi}) & = \beta (\rho - \ov{\pi})(R) + \C{K}(\rho, \pi)
4735: - \C{K}(\ov{\pi}, \pi).
4736: \end{align}
4737: From equations \eqref{eq1.31}, \eqref{eq1.31bis} and \eqref{eq1.31ter} we deduce
4738: \begin{prop}\mypoint
4739: \label{prop1.58}
4740: For any positive real constants $\beta$, $\gamma$ and $\zeta_2$,
4741: with $\PP$ probability at least $1 - \epsilon$, for any posterior
4742: distribution $\nu : \Omega \rightarrow \C{M}_+^1(M)$ and any conditional posterior
4743: distribution $\rho : \Omega \times M \rightarrow \C{M}_+^1(\Theta)$,
4744: \begin{multline*}
4745: - N \log \bigl[ 1 - \tanh(\tfrac{\gamma}{N})(\nu \rho - \ov{\mu}\,\ov{\pi})(R)
4746: \bigr] - \beta \nu(\rho - \ov{\pi})(R) \\ \leq \gamma (\nu \rho - \ov{\mu}\,\ov{\pi}) (r)
4747: + N \log \bigl[ \cosh(\tfrac{\gamma}{N}) \bigr] (\nu \rho) \otimes
4748: (\ov{\mu}\,\ov{\pi}) (m') \\ + \C{K}(\nu, \ov{\mu}) + \nu \bigl[ \C{K}(\rho, \pi) \bigr]
4749: - \nu \bigl[ \C{K}( \ov{\pi}, \pi) \bigr] + \log \bigl( \tfrac{2}{\epsilon} \bigr).
4750: \end{multline*}
4751: and
4752: \begin{multline*}
4753: - N \log \bigl[ 1 - \tanh(\tfrac{\gamma}{N}) \nu(\rho - \ov{\pi})(R) \bigr]
4754: \\\leq \gamma \nu(\rho - \ov{\pi})(r)
4755: + N \log \bigl[ \cosh(\tfrac{\gamma}{N}) \bigr]
4756: \nu( \rho\otimes \ov{\pi})(m') \\ + \C{K}(\nu, \ov{\mu}) + \nu\bigl[ \C{K}(\rho,
4757: \ov{\pi}) \bigr] +
4758: \log\bigl(\tfrac{2}{\epsilon}\bigr),
4759: \end{multline*}
4760: where the prior distribution $\ov{\mu}\,\ov{\pi}$ is defined by equation
4761: \eqref{eqprior} on page \pageref{eqprior} and depends on $\beta$ and $\zeta_2$.
4762: \end{prop}
4763: Let us put for short
4764: $$
4765: T = \tanh(\tfrac{\gamma}{N}) \text{ and } C = N \log \bigl[ \cosh(\tfrac{\gamma}{N})
4766: \bigr].
4767: $$
4768:
4769: \newcommand{\omu}{\ov{\mu}}
4770: \newcommand{\opi}{\ov{\pi}}
4771: We will use some entropy compensation strategy for which we need a couple
4772: of entropy bounds. Let us assume that $\beta < NT$.
4773: We have according to Proposition \ref{prop1.58},
4774: with $\PP$ probability at least $1 - \epsilon$,
4775: \begin{multline*}
4776: \nu \bigl[ \C{K}(\rho, \opi) \bigr]
4777: = \beta \nu(\rho - \opi)(R) + \nu \bigl[ \C{K}(\rho, \pi) -
4778: \C{K}(\opi, \pi) \bigr] \\\shoveleft{\qquad
4779: \leq \frac{\beta}{NT} \biggl[ \gamma \nu(\rho - \opi) (r)
4780: + C \nu(\rho \otimes \opi)(m')} \\ + \C{K}(\nu, \omu)
4781: + \nu \bigl[ \C{K}( \rho, \opi) \bigr]
4782: + \log( \tfrac{2}{\epsilon} ) \biggr] \\ + \nu \bigl[ \C{K}(\rho, \pi)
4783: - \C{K}(\opi, \pi) \bigr].
4784: \end{multline*}
4785: Similarly
4786: \begin{multline*}
4787: \C{K}(\nu, \omu) = \frac{\beta}{1 + \zeta_2} (\nu - \omu) \opi(R)
4788: + \C{K}(\nu, \mu) - \C{K}(\omu, \mu) \\
4789: \leq \frac{\beta}{(1 + \zeta_2) NT} \biggl[
4790: \gamma (\nu - \omu) \opi(r) + C (\nu \opi) \otimes ( \omu\,\opi) (m')
4791: \\ + \C{K}(\nu, \omu) + \log (\tfrac{2}{\epsilon}) \biggr]
4792: + \C{K}(\nu, \mu) - \C{K}(\omu, \mu).
4793: \end{multline*}
4794: Thus, for any positive real constants $\beta$, $\gamma$ and $\zeta_i$,
4795: $i = 1, \dots, 5$, with $\PP$ probability at least $1 - \epsilon$,
4796: for any posterior distributions $\nu, \nu_3
4797: : \Omega \rightarrow \C{M}_+^1(\Theta)$, any posterior conditional distributions
4798: $\rho, \rho_1, \rho_2, \rho_4, \rho_5
4799: : \Omega \times M \rightarrow \C{M}_+^1(\Theta)$,
4800: \begin{multline*}
4801: - N \log \bigl[ 1 - T (\nu \rho - \omu\,\opi)(R) \bigr]
4802: - \beta \nu (\rho - \opi)(R) \\ \leq
4803: \gamma (\nu \rho - \omu\,\opi)(r) + C (\nu \rho) \otimes (\omu\,\opi)(m')
4804: \\
4805: \hfill + \C{K}(\nu, \omu) + \nu \bigl[ \C{K}(\rho, \pi)
4806: - \C{K}(\opi, \pi) \bigr] + \log(\tfrac{2}{\epsilon}),
4807: \quad\\\quad
4808: \zeta_1 \frac{NT}{\beta} \omu \bigl[ \C{K}(\rho_1, \opi) \bigr]
4809: \leq \zeta_1 \gamma \omu(\rho_1 - \opi)(r) + \zeta_1 C \omu(\rho_1 \otimes \opi)(m')
4810: \hfill \\ \hfill + \zeta_1 \omu \bigl[ \C{K}(\rho_1, \opi) \bigr] +
4811: \zeta_1 \log( \tfrac{2}{\epsilon})
4812: + \zeta_1 \frac{NT}{\beta} \omu \bigl[ \C{K}(\rho_1, \pi)
4813: - \C{K}(\opi, \pi) \bigr],\quad\\\quad
4814: \zeta_2 \frac{NT}{\beta} \nu \bigl[ \C{K}(\rho_2, \opi) \bigr]
4815: \leq \zeta_2 \gamma \nu(\rho_2- \opi)(r) + \zeta_2 C \nu(
4816: \rho_2 \otimes \opi)(m') \hfill \\
4817: + \zeta_2 \C{K}(\nu, \omu) + \zeta_2 \nu \bigl[ \C{K}(\rho_2, \opi) \bigr]
4818: + \zeta_2 \log( \tfrac{2}{\epsilon}) \\ \hfill
4819: + \zeta_2 \frac{NT}{\beta} \nu \bigl[ \C{K}(\rho_2, \pi) - \C{K}(\opi, \pi)
4820: \bigr],\quad\\\quad
4821: \zeta_3 (1 + \zeta_2)\frac{ N T}{\beta} \C{K}(\nu_3, \omu)
4822: \leq \zeta_3 \gamma( \nu_3 - \omu) \opi(r)
4823: \hfill \\ +
4824: \zeta_3 C \bigl[ (\nu_3 \opi) \otimes (\nu_3 \rho_1) + (\nu_3 \rho_1)
4825: \otimes ( \omu \, \opi) \bigr] (m')
4826: + \zeta_3 \C{K}(\nu_3, \omu) + \zeta_3 \log(\tfrac{2}{\epsilon})
4827: \\ \hfill + \zeta_3 (1 + \zeta_2)\frac{NT}{ \beta}
4828: \bigl[ \C{K}(\nu_3, \mu) - \C{K}(\ov{\mu}, \mu) \bigr],\quad\\\quad
4829: \zeta_4 \frac{NT}{\beta} \nu_3 \bigl[ \C{K}(\rho_4, \opi) \bigr]
4830: \leq \zeta_4 \gamma \nu_3(\rho_4 - \opi)(r) \hfill \\
4831: + \zeta_4 C \nu_3(\rho_4 \otimes \opi)
4832: (m') + \zeta_4 \C{K}(\nu_3, \omu) + \zeta_4 \nu_3 \bigl[ \C{K}(\rho_4, \opi) \bigr]
4833: + \zeta_4 \log( \tfrac{2}{\epsilon}) \\
4834: \hfill + \zeta_4 \frac{NT}{\beta} \nu_3 \bigl[ \C{K}(\rho_4,
4835: \pi) - \C{K}( \opi, \pi) \bigr],
4836: \quad\\\quad
4837: \zeta_5 \frac{NT}{\beta} \omu \bigl[ \C{K}(\rho_5, \opi) \bigr]
4838: \leq \zeta_5 \gamma \omu(\rho_5 - \opi)(r) + \zeta_5 C \omu(\rho_5 \otimes \opi)(m')
4839: \hfill \\ \hfill + \zeta_5 \omu \bigl[ \C{K}(\rho_5, \opi) \bigr] +
4840: \zeta_5 \log( \tfrac{2}{\epsilon})
4841: + \zeta_5 \frac{NT}{\beta} \omu \bigl[ \C{K}(\rho_5, \pi)
4842: - \C{K}(\opi, \pi) \bigr].
4843: \end{multline*}
4844: Adding these six inequalities and assuming that $\zeta_4 \leq \zeta_3 \bigl[
4845: ( 1 + \zeta_2) \tfrac{NT}{\beta} - 1 \bigr]$, we find
4846: \begin{multline*}
4847: - N \log \bigl[ 1 - T (\nu \rho - \omu\,\opi)(R) \bigr]
4848: - \beta (\nu \rho - \omu \, \opi)(R) \\\qquad \leq
4849: - N \log \bigl[ 1 - T (\nu \rho - \omu\,\opi)(R) \bigr]
4850: - \beta (\nu \rho - \omu \, \opi)(R)\hfill\\+
4851: \zeta_1 \bigl( \tfrac{NT}{\beta} - 1\bigr)
4852: \omu \bigl[ \C{K}(\rho_1, \opi)\bigr]
4853: + \zeta_2 \bigl( \tfrac{NT}{\beta} - 1 \bigr)
4854: \nu \bigl[ \C{K}(\rho_2, \opi) \bigr] \\ +
4855: \bigl[ \zeta_3(1 + \zeta_2) \tfrac{NT}{\beta} - \zeta_3
4856: - \zeta_4 \bigr] \C{K}(\nu_3, \omu)\\\hfill
4857: + \zeta_4 \bigl( \tfrac{NT}{\beta} - 1 \bigr)
4858: \nu_3 \bigl[ \C{K}(\rho_4, \opi) \bigr] +
4859: \zeta_5 \bigl( \tfrac{NT}{\beta} - 1 \bigr)
4860: \omu \bigl[ \C{K}(\rho_5, \opi) \bigr] \quad\\\qquad
4861: \leq \gamma (\nu \rho - \omu\,\opi)(r)
4862: + \zeta_1 \gamma \omu(\rho_1 - \opi) (r) +
4863: \zeta_2 \gamma \nu(\rho_2 - \opi) (r)
4864: \hfill \\ + \zeta_3 \gamma(\nu_3 - \omu) \opi(r) +
4865: \zeta_4 \gamma \nu_3(\rho_4 - \opi)(r) + \zeta_5 \gamma \omu(\rho_5 - \opi)
4866: (r) \qquad\\ \hfill
4867: + C \bigl[ (\nu \rho) \otimes (\omu\,\opi)+ \zeta_1
4868: \omu(\rho_1 \otimes \opi) + \zeta_2 \nu( \rho_2 \otimes \opi)\qquad\\
4869: \quad + \zeta_3 (\nu_3 \opi) \otimes (\nu_3 \rho_1) +
4870: \zeta_3 (\nu_3 \rho_1) \otimes ( \omu \, \opi)\hfill \\
4871: \hfill + \zeta_4
4872: \nu_3 ( \rho_4 \otimes \opi) + \zeta_5 \omu(\rho_5\otimes \opi) \bigr] (m')\qquad\\
4873: \quad + (1 + \zeta_2) \bigl[\C{K}(\nu, \mu) - \C{K}(\omu, \mu)\bigr]
4874: + \nu \bigl[ \C{K}(\rho, \pi) - \C{K}(\opi, \pi) \bigr]\hfill\\
4875: \hfill + \zeta_1 \tfrac{NT}{\beta} \omu \bigl[ \C{K}(\rho_1, \pi)
4876: - \C{K}(\opi, \pi) \bigr] + \zeta_2 \tfrac{NT}{\beta}
4877: \nu \bigl[ \C{K}(\rho_2, \pi) - \C{K}(\opi, \pi) \bigr] \qquad
4878: \\\quad + \zeta_3 (1 + \zeta_2) \tfrac{NT}{\beta} \bigl[ \C{K}(\nu_3, \mu)
4879: - \C{K}(\omu, \mu) \bigr]
4880: + \zeta_4 \tfrac{NT}{\beta} \nu_3 \bigl[ \C{K}( \rho_4, \pi)
4881: - \C{K}(\opi, \pi) \bigr] \hfill \\
4882: + \zeta_5 \tfrac{NT}{\beta} \omu \bigl[
4883: \C{K}(\rho_5, \pi) - \C{K}(\opi, \pi) \bigr]
4884: + (1 + \zeta_1 + \zeta_2 + \zeta_3 + \zeta_4 + \zeta_5 ) \log( \tfrac{2}{\epsilon}).
4885: \end{multline*}
4886: Let us now apply to $\opi$ (we shall later do the same with $\omu$)
4887: the following inequalities, holding for any random
4888: functions of the sample and the parameters $h : \Omega \times \Theta \rightarrow
4889: \RR$ and $g : \Omega \times \Theta \rightarrow \RR$,
4890: \begin{multline*}
4891: \opi(g-h) - \C{K}(\opi, \pi) \leq
4892: \sup_{\rho : \Omega \times M \rightarrow \C{M}_+^1(\Theta)} \rho( g - h) - \C{K}(\rho, \pi) \\
4893: \shoveleft{\qquad = \log \bigl\{ \pi \bigl[ \exp (g - h) \bigr] \bigr\}} \\
4894: \shoveleft{\qquad \qquad =
4895: \log \bigl\{ \pi \bigl[ \exp ( - h ) \bigr] \bigr\}
4896: + \log \bigl\{ \pi_{\exp( - h)} \bigl[ \exp (g) \bigr] \bigr\}}
4897: \\ = - \pi_{\exp( - h)}(h) - \C{K}(\pi_{\exp( - h)}, \pi)
4898: + \log \bigl\{ \pi_{\exp( - h)} \bigl[ \exp (g) \bigr] \bigr\}.
4899: \end{multline*}
4900: When $h$ and $g$ are observable, and $h$ is not too far from
4901: $\beta r \simeq \beta R$, this gives a way to replace $\opi$ with
4902: some satisfactory empirical approximation.
4903: We will apply this method, choosing $\rho_1$ and $\rho_5$ such that
4904: $\omu\,\opi$ is replaced either with $\omu \rho_1$,
4905: when it comes from the first two inequalities or
4906: with $\omu \rho_5$ otherwise,
4907: choosing $\rho_2$ such that $\nu \opi$ is replaced with $\nu \rho_2$
4908: and $\rho_4$ such that $\nu_3 \opi$ is replaced with $\nu_3 \rho_4$. We will do
4909: so because it leads to a lot of helpful cancellations.
4910: For those to happen, we need to choose $\rho_i = \pi_{\exp( - \lambda_i r)}$,
4911: $i=1,2,4$, where $\lambda_1$, $\lambda_2$ and $\lambda_4$ are such that
4912: \begin{align*}
4913: (1 + \zeta_1) \gamma & = \zeta_1 \tfrac{NT}{\beta} \lambda_1,\\
4914: \zeta_2 \gamma & = \bigl(1 + \zeta_2 \tfrac{NT}{\beta} \bigr) \lambda_2,\\
4915: (\zeta_4 - \zeta_3) \gamma & = \zeta_4 \frac{NT}{\beta} \lambda_4,\\
4916: \zeta_3 \gamma & = \zeta_5 \tfrac{NT}{\beta} \lambda_5,
4917: \end{align*}
4918: and to assume that
4919: $\zeta_4 > \zeta_3$.
4920: We obtain that with $\PP$ probability at least $1 - \epsilon$,
4921: \begin{multline*}
4922: - N \log \bigl[ 1 - T(\mu \rho - \omu\,\opi)(R) \bigr]
4923: - \beta (\nu \rho - \omu\,\opi)(R)\\
4924: \leq \gamma(\nu \rho - \omu\,\rho_1)(r) +
4925: \zeta_3 \gamma(\nu_3 \rho_4 - \omu \rho_5)(r)
4926: \\
4927: + \zeta_1 \tfrac{NT}{\beta} \omu \Biggl\{
4928: \log \Biggl[ \rho_1 \biggl\{ \exp \biggl[ C \tfrac{\beta}{NT \zeta_1}
4929: \bigl[ \nu \rho + \zeta_1 \rho_1 \bigr](m') \biggr]
4930: \biggr\} \Biggr] \Biggr\}\\
4931: + \bigl( 1 + \zeta_2 \tfrac{NT}{\beta}\bigr) \nu \Biggl\{
4932: \log \Biggl\{ \rho_2 \biggl\{ \exp \biggl[ \tfrac{C}{1 + \zeta_2
4933: \frac{NT}{\beta}} \zeta_2 \rho_2 (m') \biggr] \biggr\} \Biggr] \Biggr\}\\
4934: + \zeta_4 \tfrac{NT}{\beta} \nu_3 \Biggl\{ \log \Biggl[
4935: \rho_4 \biggl\{ \exp \biggl[ C \tfrac{\beta}{NT \zeta_4}
4936: \bigl[ \zeta_3 \nu_3 \rho_1 + \zeta_4
4937: \rho_4 \bigr] (m') \biggr] \biggr\} \Biggr] \Biggr\}\\
4938: + \zeta_5 \tfrac{NT}{\beta} \omu \Biggl\{
4939: \log \Biggl[ \rho_5 \biggl\{ \exp \biggl[ C \tfrac{\beta}{NT \zeta_5}
4940: \bigl[ \zeta_3 \nu_3 \rho_1 + \zeta_5 \rho_5 \bigr] (m') \biggr]
4941: \biggr\} \Biggr] \Biggr\}\\
4942: + (1 + \zeta_2) \bigl[ \C{K}(\nu, \mu) - \C{K}(\omu, \mu) \bigr]
4943: + \nu \bigl[ \C{K}(\rho, \pi) - \C{K}(\rho_2, \pi) \bigr]
4944: \\ + \zeta_3(1 + \zeta_2) \tfrac{NT}{\beta} \bigl[
4945: \C{K}(\nu_3, \mu) - \C{K}(\omu, \mu) \bigr] \\
4946: +
4947: \biggl(1 + \sum_{i=1}^5 \zeta_i\biggr) \log \bigl( \tfrac{2}{\epsilon} \bigr).
4948: \end{multline*}
4949: In order to obtain more cancellations while replacing $\omu$ by
4950: some posterior distribution, we will choose the constants such that
4951: $\lambda_5 = \lambda_4$, which can be done by choosing
4952: $$
4953: \zeta_5 = \frac{\zeta_3 \zeta_4}{\zeta_4 - \zeta_3}.
4954: $$
4955: We can now replace $\omu$ with
4956: $\mu_{\exp - \xi_1 \rho_1(r) - \xi_4 \rho_4(r)}$,
4957: where
4958: \begin{align*}
4959: \xi_1 & = \frac{\gamma}{(1 + \zeta_2)\bigl(1 + \tfrac{NT}{\beta} \zeta_3 \bigr)},\\
4960: \xi_4 & = \frac{\gamma\zeta_3}{(1 + \zeta_2)\bigl(1 + \tfrac{NT}{\beta} \zeta_3 \bigr)}.
4961: \end{align*}
4962: Choosing moreover $\nu_3 = \mu_{\exp - \xi_1 \rho_1(r) - \xi_4 \rho_4(r)}$,
4963: to induce some more cancellations,
4964: we get
4965: \begin{thm}\mypoint
4966: \label{thm1.59}
4967: For any positive real constants satisfying the above mentioned constraints,
4968: with $\PP$ probability at least $1 - \epsilon$, for any posterior distribution
4969: $\nu : \Omega \rightarrow \C{M}_+^1(M)$ and any conditional posterior
4970: distribution $\rho : \Omega \times M \rightarrow \C{M}_+^1(\Theta)$,
4971: \begin{multline*}
4972: - N \log \bigl[ 1 - T(\nu \rho - \omu\,\opi)(R) \bigr]
4973: - \beta (\nu \rho - \omu\,\opi)(R) \leq B(\nu, \rho, \beta),\\
4974: \shoveleft{\text{where }
4975: B(\nu, \rho, \beta) \overset{\text{\rm def}}{=} \gamma ( \nu \rho -
4976: \nu_3 \rho_1)(r)} \\*
4977: \shoveleft{\qquad + (1 + \zeta_2) \bigl( 1 + \tfrac{NT}{\beta} \zeta_3 \bigr) }
4978: \\ \times
4979: \log \Biggl\{ \nu_3 \Biggl[ \rho_1 \biggl\{
4980: \exp \biggl[ C \tfrac{\beta}{NT \zeta_1} \bigl[ \nu \rho
4981: + \zeta_1 \rho_1 \bigr] (m') \biggr] \biggr\}^{\frac{\zeta_1 N T}{\beta
4982: (1 + \zeta_2)(1 + \frac{NT}{\beta}\zeta_3)}} \\
4983: \shoveright{\times \rho_4 \biggl\{ \exp \biggl[
4984: C \tfrac{\beta}{NT \zeta_5} \bigl[
4985: \zeta_3 \nu_3 \rho_1 + \zeta_5 \rho_4 \bigr] (m')
4986: \biggr] \biggr\}^{\frac{\zeta_5 N T}{\beta(1 + \zeta_2)(1 + \frac{NT}{\beta}
4987: \zeta_3)}} \Biggr] \Biggr\}}\\
4988: + \bigl( 1 + \zeta_2 \tfrac{NT}{\beta}\bigr) \nu \Biggl\{
4989: \log \Biggl\{ \rho_2 \biggl\{ \exp \biggl[ \tfrac{C}{1 + \zeta_2
4990: \frac{NT}{\beta}} \zeta_2 \rho_2 (m') \biggr] \biggr\} \Biggr] \Biggr\}\\
4991: + \zeta_4 \tfrac{NT}{\beta} \nu_3 \Biggl\{ \log \Biggl[
4992: \rho_4 \biggl\{ \exp \biggl[ C \tfrac{\beta}{NT \zeta_4}
4993: \bigl[ \zeta_3 \nu_3 \rho_1 + \zeta_4
4994: \rho_4 \bigr] (m') \biggr] \biggr\} \Biggr] \Biggr\}\\
4995: \shoveleft{\qquad + (1 + \zeta_2) \bigl[ \C{K}(\nu, \mu) - \C{K}(\nu_3, \mu) \bigr]
4996: } \\ + \nu \bigl[ \C{K}(\rho, \pi) - \C{K}(\rho_2, \pi) \bigr]
4997: + \biggl( 1 + \sum_{i=1}^5 \zeta_i \biggr)
4998: \log \bigl( \tfrac{2}{\epsilon} \bigr).
4999: \end{multline*}
5000: \end{thm}
5001:
5002: This theorem can be used to find the largest value $\w{\beta}(\nu \rho)$ of
5003: $\beta$ such that
5004: $ B( \nu, \rho, \beta) \leq 0$, thus providing an estimator for
5005: $\beta(\nu \rho)$ defined as $\nu \rho(R) = \ov{\mu}_{\beta(\nu \rho)}
5006: \ov{\pi}_{\beta(\nu \rho)}(R)$, where we have mentioned explicitely
5007: the dependence of $\ov{\mu}$ and $\ov{\pi}$ in $\beta$, the constant
5008: $\zeta_2$ staying fixed. The posterior distribution $\nu \rho$ may
5009: then be chosen to maximize $\w{\beta}(\nu \rho)$ within some manageable
5010: subset of posterior distributions $\C{P}$, thus gaining the assurance
5011: that $\nu \rho(R) \leq \ov{\mu}_{\w{\beta}(\nu \rho)}\ov{\pi}_{\w{\beta}(\nu \rho)}
5012: (R)$, with the largest parameter $\w{\beta}(\nu \rho)$ that this
5013: approach can provide. Maximizing $\w{\beta}(\nu \rho)$ is supported by the
5014: fact that $\lim_{\beta \rightarrow + \infty} \ov{\mu}_{\beta}\ov{\pi}_{\beta}(R)
5015: = \ess \inf_{\mu \pi} R$. Anyhow, there is no assurance (to our knowledge) that
5016: $\beta \mapsto \ov{\mu}_{\beta} \ov{\pi}_{\beta}(R)$ will be a decreasing
5017: function of $\beta$ all the way, although this may be expected to be the case
5018: in many practical situations.
5019:
5020: We can make the bound more explicit in several ways. One point
5021: of view is to put forward the optimal values of $\rho$ and $\nu$.
5022: We can thus remark that
5023: \begin{multline*}
5024: \nu \bigl[ \gamma \rho(r) + \C{K}(\rho, \pi) -
5025: \C{K}(\rho_2, \pi) \bigr] + (1 + \zeta_2) \C{K}(\nu, \mu)
5026: \\ =
5027: \nu \biggl[ \C{K}\bigl[ \rho, \pi_{\exp( - \gamma r)} \bigr]
5028: + \lambda_2 \rho_2(r)
5029: + \int_{\lambda^2}^{\gamma}
5030: \pi_{\exp( - \alpha r)}(r) d \alpha \biggr]
5031: + (1 + \zeta_2) \C{K}( \nu, \mu)
5032: \\ = \nu \bigl\{ \C{K}\bigl[ \rho, \pi_{\exp( - \gamma r)} \bigr]
5033: \bigr\} + (1 + \zeta_2)
5034: \C{K}\bigl[ \nu, \mu_{ \exp
5035: \bigl( - \frac{\lambda_2 \rho_2(r)}{1 + \zeta_2}
5036: - \frac{1}{1 + \zeta_2} \int_{\lambda_2}^{\gamma}
5037: \pi_{\exp( - \alpha r)}(r) d \alpha \bigr)} \bigr]
5038: \\ - (1 + \zeta_2) \log \Biggl\{ \mu \Biggl[ \exp \biggl\{
5039: - \frac{\lambda_2}{1 + \zeta_2} \rho_2(r)
5040: - \frac{1}{1 + \zeta_2} \int_{\lambda_2}^{\gamma}
5041: \pi_{\exp( - \alpha r )}(r) d \alpha \biggr\} \Biggr] \Biggr\}.
5042: \end{multline*}
5043: Thus
5044: \begin{multline*}
5045: B(\nu, \rho, \beta) =
5046: (1 + \zeta_2) \Bigl[ \xi_1 \nu_3 \rho_1(r) + \xi_4
5047: \nu_3 \rho_4(r) \\ + \log \bigl\{ \mu \bigl[ \exp
5048: \bigl( - \xi_1 \rho_1(r) - \xi_4 \rho_4(r) \bigr) \bigr] \bigr\}
5049: \Bigr] \\ - (1 + \zeta_2) \log \Biggl\{ \mu \Biggl[ \exp \biggl\{
5050: - \frac{\lambda_2}{1 + \zeta_2} \rho_2(r)
5051: - \frac{1}{1 + \zeta_2} \int_{\lambda_2}^{\gamma}
5052: \pi_{\exp( - \alpha r )}(r) d \alpha \biggr\} \Biggr] \Biggr\} \\ \shoveleft{\quad
5053: - \gamma \nu_3 \rho_1 (r)
5054: + (1 + \zeta_2) \bigl( 1 + \tfrac{NT}{\beta} \zeta_3 \bigr) }
5055: \\ \times
5056: \log \Biggl\{ \nu_3 \Biggl[ \rho_1 \biggl\{
5057: \exp \biggl[ C \tfrac{\beta}{NT \zeta_1} \bigl[ \nu \rho
5058: + \zeta_1 \rho_1 \bigr] (m') \biggr] \biggr\}^{\frac{\zeta_1 N T}{\beta
5059: (1 + \zeta_2)(1 + \frac{NT}{\beta}\zeta_3)}} \\
5060: \shoveright{\times \rho_4 \biggl\{ \exp \biggl[
5061: C \tfrac{\beta}{NT \zeta_5} \bigl[
5062: \zeta_3 \nu_3 \rho_1 + \zeta_5 \rho_4 \bigr] (m')
5063: \biggr] \biggr\}^{\frac{\zeta_5 N T}{\beta(1 + \zeta_2)(1 + \frac{NT}{\beta}
5064: \zeta_3)}} \Biggr] \Biggr\}}\\
5065: + \bigl( 1 + \zeta_2 \tfrac{NT}{\beta}\bigr) \nu \Biggl\{
5066: \log \Biggl\{ \rho_2 \biggl\{ \exp \biggl[ \tfrac{C}{1 + \zeta_2
5067: \frac{NT}{\beta}} \zeta_2 \rho_2 (m') \biggr] \biggr\} \Biggr] \Biggr\}\\
5068: + \zeta_4 \tfrac{NT}{\beta} \nu_3 \Biggl\{ \log \Biggl[
5069: \rho_4 \biggl\{ \exp \biggl[ C \tfrac{\beta}{NT \zeta_4}
5070: \bigl[ \zeta_3 \nu_3 \rho_1 + \zeta_4
5071: \rho_4 \bigr] (m') \biggr] \biggr\} \Biggr] \Biggr\}\\
5072: \shoveleft{\quad + \nu \bigl\{ \C{K}\bigl[ \rho, \pi_{\exp( - \gamma r)} \bigr]
5073: \bigr\}} \\ + (1 + \zeta_2)
5074: \C{K}\bigl[ \nu, \mu_{ \exp
5075: \bigl( - \frac{\lambda_2 \rho_2(r)}{1 + \zeta_2}
5076: - \frac{1}{1 + \zeta_2} \int_{\lambda_2}^{\gamma}
5077: \pi_{\exp( - \alpha r)}(r) d \alpha \bigr)} \bigr]\\
5078: + \biggl(1 + \sum_{i=1}^5 \zeta_i \biggr) \log\bigl(\tfrac{2}{\epsilon}
5079: \bigr).
5080: \end{multline*}
5081: This formula is better understood when thinking about
5082: the following upper bound for the two first lines
5083: in the expression of $B(\nu, \rho, \beta)$ :
5084: \begin{multline*}
5085: (1 + \zeta_2) \Bigl[ \xi_1 \nu_3 \rho_1(r) + \xi_4
5086: \nu_3 \rho_4(r) + \log \bigl\{ \mu \bigl[ \exp
5087: \bigl( - \xi_1 \rho_1(r) - \xi_4 \rho_4(r) \bigr) \bigr] \bigr\}
5088: \Bigr] \\ \shoveleft{\qquad - (1 + \zeta_2) \log \Biggl\{ \mu \Biggl[ \exp \biggl\{
5089: - \frac{\lambda_2}{1 + \zeta_2} \rho_2(r) }
5090: \\ \shoveright{ - \frac{1}{1 + \zeta_2} \int_{\lambda_2}^{\gamma}
5091: \pi_{\exp( - \alpha r )}(r) d \alpha \biggr\} \Biggr] \Biggr\} -
5092: \gamma \nu_3 \rho_1 (r)\qquad}\\
5093: \leq \nu_3 \biggl[ \lambda_2 \rho_2(r) + \int_{\lambda_2}^{\gamma}
5094: \pi_{\exp( - \alpha r)}(r) d \alpha - \gamma \rho_1(r) \biggr].
5095: \end{multline*}
5096: Another approach to understanding Theorem \ref{thm1.59} is
5097: to put forward $\rho_0 = \pi_{\exp(- \lambda_0 r)}$,
5098: for some positive real constant $\lambda_0 < \gamma$,
5099: noticing that
5100: $$
5101: \nu \bigl[ \C{K}(\rho_0, \pi) - \C{K}(\rho_2, \pi) \bigr]
5102: = \lambda_0 \nu (\rho_2 - \rho_0)(r) - \nu \bigl[
5103: \C{K}(\rho_2, \rho_0) \bigr].
5104: $$
5105: Thus
5106: \begin{multline*}
5107: B(\nu, \rho_0, \beta) \leq
5108: \nu_3 \bigl[ (\gamma - \lambda_0) (\rho_0 - \rho_1)(r) + \lambda_0
5109: (\rho_2 - \rho_1)(r) \bigr] \\
5110: \shoveleft{\quad + (1 + \zeta_2) \bigl( 1 + \tfrac{NT}{\beta} \zeta_3 \bigr)
5111: } \\ \times \log \Biggl\{ \nu_3 \Biggl[ \rho_1 \biggl\{
5112: \exp \biggl[ C \tfrac{\beta}{NT \zeta_1} \bigl[ \nu \rho_0
5113: + \zeta_1 \rho_1 \bigr] (m') \biggr] \biggr\}^{\frac{\zeta_1 N T}{\beta
5114: (1 + \zeta_2)(1 + \frac{NT}{\beta}\zeta_3)}} \\
5115: \shoveright{ \times \rho_4 \biggl\{ \exp \biggl[
5116: C \tfrac{\beta}{NT \zeta_5} \bigl[
5117: \zeta_3 \nu_3 \rho_1 + \zeta_5 \rho_4 \bigr] (m')
5118: \biggr] \biggr\}^{\frac{\zeta_5 N T}{\beta(1 + \zeta_2)(1 + \frac{NT}{\beta}
5119: \zeta_3)}} \Biggr] \Biggr\}\quad}\\
5120: + \bigl( 1 + \zeta_2 \tfrac{NT}{\beta}\bigr) \nu \Biggl\{
5121: \log \Biggl\{ \rho_2 \biggl\{ \exp \biggl[ \tfrac{C}{1 + \zeta_2
5122: \frac{NT}{\beta}} \zeta_2 \rho_2 (m') \biggr] \biggr\} \Biggr] \Biggr\}\\
5123: + \zeta_4 \tfrac{NT}{\beta} \nu_3 \Biggl\{ \log \Biggl[
5124: \rho_4 \biggl\{ \exp \biggl[ C \tfrac{\beta}{NT \zeta_4}
5125: \bigl[ \zeta_3 \nu_3 \rho_1 + \zeta_4
5126: \rho_4 \bigr] (m') \biggr] \biggr\} \Biggr] \Biggr\}\\
5127: \shoveleft{\quad + (1 + \zeta_2) \C{K}\Bigl[
5128: \nu, \mu_{\exp \bigl( - \frac{(\gamma - \lambda_0) \rho_0(r) + \lambda_0 \rho_2(r)}{
5129: 1 + \zeta_2} \bigr)} \Bigr] }\\
5130: - \nu \bigl[ \C{K}(\rho_2, \rho_0) \bigr]
5131: + \biggl( 1 + \sum_{i=1}^5 \zeta_i \biggr)
5132: \log \bigl( \tfrac{2}{\epsilon} \bigr).
5133: \end{multline*}
5134:
5135: In the case when we want to select a single model $\wm(\omega)$,
5136: and therefore to set $\nu = \delta_{\wm}$, the previous
5137: inequality engages us to take \\
5138: \mbox{} \hfill $\ds \wm \in \arg \min_{m \in M}
5139: (\gamma - \lambda_0) \rho_0(m, r) + \lambda_0 \rho_2(m, r)$.
5140: \hfill \mbox{}\\
5141: In parametric situations where $\pi_{\exp( - \lambda r)}(r)
5142: \simeq \sr(m) + \frac{d_e(m)}{\lambda}$,
5143: we get\\\mbox{}\hfill
5144: $(\gamma - \lambda_0) \rho_0(m, r) - \lambda_0 \rho_2(m, r)
5145: \simeq \gamma \bigl[ \sr(m) + d_e(m) \bigl( \tfrac{1}{\lambda_0}
5146: + \tfrac{\lambda_0 - \lambda_2}{\gamma \lambda_2} \bigr)\bigr]$,\hfill
5147: \mbox{}\\
5148: resulting in a linear penalization of the empirical dimension of the
5149: models.
5150:
5151: \subsubsection{Analysis of the two step relative bound}
5152: We will not state a formal result, but will neverless give some
5153: hints about how to establish one.
5154: We should start from Theorem \ref{thm4.1}, which gives a deterministic variance
5155: term. From Theorem \ref{thm4.1}, after a
5156: change of prior distribution, we obtain
5157: for any positive constants $\alpha_1$ and $\alpha_2$,
5158: any prior distributions $\wt{\mu}_1$ and $\wt{\mu}_2
5159: \in \C{M}_+^1(M)$,
5160: for any prior conditional distributions $\wt{\pi}_1$
5161: and $\wt{\pi}_2 : M \rightarrow \C{M}_+^1(\Theta)$,
5162: with $\PP$ probability at least $1 - \eta$,
5163: for any posterior distributions $\nu_1 \rho_1$ and
5164: $\nu_2 \rho_2$,
5165: \begin{multline*}
5166: \alpha_1(\nu_1 \rho_1 - \nu_2 \rho_2)(R) \leq
5167: \alpha_2(\nu_1 \rho_1 - \nu_2 \rho_2)(r) \\ +
5168: \C{K}\bigl[ (\nu_1 \rho_1) \otimes (\nu_2 \rho_2),
5169: (\wt{\mu}_1\,\wt{\pi}_1)\otimes(\wt{\mu}_2\,\wt{\pi}_2)
5170: \bigr] \\
5171: + \log \Bigl\{ (\wt{\mu}_1\,\wt{\pi}_1)\otimes (\wt{\mu}_2\,\wt{\pi}_2) \Bigl[
5172: \exp \bigl\{ - \alpha_2 \Psi_{\frac{\alpha_2}{N}}(R',M') + \alpha_1 R' \bigr\}
5173: \Bigr] \Bigr\} - \log(\eta).
5174: \end{multline*}
5175: Applying this to $\alpha_1 = 0$, we get that
5176: \begin{multline*}
5177: (\nu \rho - \nu_3 \rho_1)(r)
5178: \leq \frac{1}{\alpha_2} \biggl[ \C{K}\bigl[
5179: (\nu \rho) \otimes (\nu_3 \rho_1), (\wt{\mu}\,\wt{\pi})\otimes (
5180: \wt{\mu}_3\,\wt{\pi}_1) \bigr]
5181: \\ + \log \Bigl\{ (\wt{\mu}\,\wt{\nu})\otimes(\wt{\mu}_3\,\wt{\pi}_1)
5182: \Bigl[ \exp \bigl\{
5183: \alpha_2 \Psi_{-\frac{\alpha_2}{N}} (R', M') \bigr\} \Bigr] \Bigr\}
5184: - \log(\eta) \biggr].
5185: \end{multline*}
5186: In the same way, to bound quantities of the form
5187: \begin{multline*}
5188: \log \Biggl\{ \nu_3 \Biggl[ \rho_1 \biggl\{
5189: \exp \biggl[ C_1 (\nu \rho + \zeta_1 \rho_1)(m') \biggr] \biggr\}^{p_1}
5190: \\ \times \rho_4 \biggl\{ \exp \biggl[ C_2 \bigl[
5191: \zeta_3 \nu_3 \rho_1 + \zeta_5 \rho_4 \bigr] (m') \biggr]
5192: \biggr\}^{p_2} \Biggr] \Biggr\}
5193: \\ = \sup_{\nu_5} \biggl\{ p_1 \sup_{\rho_5} \Bigl\{
5194: C_1 \bigl[ (\nu \rho) \otimes (\nu_5 \rho_5) + \zeta_1 \nu_5(\rho_1
5195: \otimes \rho_5) \bigr](m') - \C{K}(\rho_5, \rho_1) \Bigr\}
5196: \\\qquad \qquad + p_2 \sup_{\rho_6} \Bigl\{ C_2 \bigl[ \zeta_3
5197: (\nu_3 \rho_1) \otimes (\nu_5 \rho_6) \hfill \\ + \zeta_5 \nu_5(\rho_4
5198: \otimes \rho_6) \bigr] (m') - \C{K}(\rho_6, \rho_4) \Bigr\}
5199: - \C{K}(\nu_5, \nu_3) \biggr\},
5200: \end{multline*}
5201: where $C_1$, $C_2$, $p_1$ and $p_2$ are positive constants,
5202: and similar terms,
5203: we need to use inequalities of the type: for any prior distributions
5204: $\wt{\mu}_i\,\wt{\pi}_i$, $i = 1, 2$, with $\PP$ probability
5205: at least $1 - \eta$, for any posterior distributions
5206: $\nu_i \rho_i$, $i = 1,2$,
5207: \begin{multline*}
5208: \alpha_3 (\nu_1 \rho_1) \otimes (\nu_2 \rho_2)(m')
5209: \leq
5210: \log \Bigl\{ (\wt{\mu}_1\,\wt{\pi}_1) \otimes
5211: (\wt{\mu}_2\,\wt{\pi}_2) \exp \Bigl[ \alpha_3 \Phi_{\frac{- \alpha_3}{N}}
5212: (M') \Bigr] \Bigr\} \\ + \C{K}\bigl[
5213: (\nu_1 \rho_1) \otimes (\nu_2 \rho_2), (\wt{\mu}_1\,\wt{\pi}_1)
5214: \otimes (\wt{\mu}_2\,\wt{\pi}_2) \bigr] - \log(\eta).
5215: \end{multline*}
5216: We need also the variant: with $\PP$ probability at least $1 - \eta$,
5217: for any posterior distribution $\nu_1 : \Omega \rightarrow \C{M}_+^1(M)$
5218: and any conditional posterior distributions $\rho_1, \rho_2 :
5219: \Omega \times M \rightarrow \C{M}_+^1(\Theta)$,
5220: \begin{multline*}
5221: \alpha_3 \nu_1 (\rho_1 \otimes \rho_2)(m')
5222: \leq
5223: \log \Bigl\{ \wt{\mu}_1\bigl(\wt{\pi}_1 \otimes \wt{\pi}_2 \bigr)
5224: \exp \Bigl[ \alpha_3 \Phi_{- \frac{\alpha_3}{N}}(M') \Bigr] \Bigr\}
5225: \\ + \C{K}(\nu_1, \wt{\mu}_1) + \nu_1 \bigl\{
5226: \C{K}\bigl[
5227: \rho_1 \otimes \rho_2, \wt{\pi}_1
5228: \otimes \wt{\pi}_2 \bigr] \bigr\} - \log(\eta).
5229: \end{multline*}
5230: We deduce that
5231: \begin{multline*}
5232: \log \Biggl\{ \nu_3 \Biggl[
5233: \rho_1 \biggl\{ \exp \biggl[
5234: C_1 (\nu \rho + \zeta_1 \rho_1)(m') \biggr]
5235: \biggr\}^{p_1}
5236: \\ \shoveright{ \times \rho_4 \biggl\{ \exp
5237: \biggl[
5238: C_2 \bigl[ \zeta_3 \nu_3 \rho_1 + \zeta_5
5239: \rho_4 \bigr] (m') \biggr] \biggr\}^{p_2} \Biggr] \Biggr\} \quad } \\
5240: \leq \sup_{\nu_5} \Biggl\{ p_1
5241: \sup_{\rho_5} \Biggl[
5242: \frac{C_1}{\alpha_3} \biggl\{ \log \Bigl\{ (\wt{\mu} \, \wt{\pi})
5243: \otimes (\wt{\mu}_5\,\wt{\pi}_5) \exp \Bigl[
5244: \alpha_3 \Phi_{- \frac{\alpha_3}{N}}(M') \Bigr] \Bigr\}
5245: \\ + \C{K}\bigl[ (\nu \rho) \otimes (\nu_5 \rho_5),
5246: (\wt{\mu}\,\wt{\pi} \otimes (\wt{\mu}_5\,\wt{\pi}_5) \bigr]
5247: + \log(\tfrac{2}{\eta}) \\
5248: + \zeta_1 \biggl[
5249: \log \Bigl\{ \wt{\mu}_5 \bigl(
5250: \wt{\pi}_1 \otimes \wt{\pi}_5 \bigr)
5251: \exp \Bigl[ \alpha_3 \Phi_{- \frac{\alpha_3}{N}}
5252: (M') \Bigr] \Bigr\}
5253: \\ + \C{K}(\nu_5, \wt{\mu}_5)
5254: + \nu_5 \bigl\{ \C{K} \bigl[
5255: \rho_1 \otimes \rho_5,
5256: \wt{\pi}_1 \otimes \wt{\pi}_5 \bigr] \bigr\}
5257: + \log \bigl( \tfrac{2}{\eta} \bigr)
5258: \biggr] \biggr\} - \C{K}(\rho_5, \rho_1) \Biggr] \\
5259: + p_2 \sup_{\rho_6} \Biggl[
5260: \frac{C_1}{\alpha_3} \biggl\{ \log \Bigl\{ (\wt{\mu}_3 \, \wt{\pi}_1)
5261: \otimes (\wt{\mu}_5\,\wt{\pi}_6) \exp \Bigl[
5262: \alpha_3 \Phi_{- \frac{\alpha_3}{N}}(M') \Bigr] \Bigr\}
5263: \\ + \C{K}\bigl[ (\nu_3 \rho_1) \otimes (\nu_5 \rho_6),
5264: (\wt{\mu}_3\,\wt{\pi}_1 \otimes (\wt{\mu}_5\,\wt{\pi}_6) \bigr]
5265: + \log(\tfrac{2}{\eta}) \\
5266: + \zeta_1 \biggl[
5267: \log \Bigl\{ \wt{\mu}_5 \bigl(
5268: \wt{\pi}_4 \otimes \wt{\pi}_6 \bigr)
5269: \exp \Bigl[ \alpha_3 \Phi_{- \frac{\alpha_3}{N}}
5270: (M') \Bigr] \Bigr\}
5271: \\ \hfill + \C{K}(\nu_5, \wt{\mu}_5)
5272: + \nu_5 \bigl\{ \C{K} \bigl[
5273: \rho_4 \otimes \rho_6,
5274: \wt{\pi}_4 \otimes \wt{\pi}_6 \bigr] \bigr\}
5275: + \log \bigl( \tfrac{2}{\eta} \bigr)
5276: \biggr] \biggr\}\qquad \\ - \C{K}(\rho_6, \rho_4) \Biggr]
5277: - \C{K}(\nu_5, \nu_3) \Biggr\}.
5278: \end{multline*}
5279:
5280: We are then left with the need to bound entropy terms like
5281: $\C{K}(\nu_3 \rho_1, \wt{\mu}_3\wt{\pi}_1)$, where we have the choice of
5282: $\wt{\mu}_3$ and $\wt{\pi}_1$, to obtain a useful bound.
5283: As could be expected, we decompose it into
5284: $$
5285: \C{K}(\nu_3 \rho_1, \wt{\mu}_3\wt{\pi}_1) =
5286: \C{K}(\nu_3, \wt{\mu}_3) + \nu_3 \bigl[ \C{K}(\rho_1, \wt{\pi}_1) \bigr].
5287: $$
5288: Let us look after the second term first, choosing $\wt{\pi}_1 = \pi_{\exp
5289: ( - \beta_1 R)}$:
5290: \begin{multline*}
5291: \nu_3 \bigl[ \C{K}(\rho_1, \wt{\pi}_1) \bigr]
5292: = \nu_3 \bigl[ \beta_1 (\rho_1 - \wt{\pi}_1)(R) + \C{K}(\rho_1, \pi)
5293: - \C{K}(\wt{\pi}_1, \pi) \bigr]
5294: \\ \leq \frac{\beta_1}{\alpha_1} \biggl[ \alpha_2 \nu_3(\rho_1 - \wt{\pi}_1)(r)
5295: + \C{K}(\nu_3, \wt{\mu}_3) + \nu_3 \bigl[ \C{K}(\rho_1, \wt{\pi}_1) \bigr]
5296: \\+ \log \Bigl\{ \wt{\mu}_3 \bigl( \wt{\pi}_1^{\otimes 2}
5297: \bigr) \Bigl[
5298: \exp \bigl\{ - \alpha_2 \Psi_{\frac{\alpha_2}{N}}
5299: (R', M') + \alpha_1 R' \bigr\} \Bigr] \Bigr\} - \log(\eta) \biggr]
5300: \\ \shoveright{+ \nu_3 \bigl[ \C{K}(\rho_1, \pi) - \C{K}(\wt{\pi}_1, \pi) \bigr]
5301: \qquad}
5302: \\ \quad \leq \frac{\beta_1}{\alpha_1} \biggl[
5303: \C{K}(\nu_3, \wt{\mu}_3) + \nu_3 \bigl[ \C{K}(\rho_1, \wt{\pi}_1) \bigr]
5304: \hfill \\ + \log \Bigl\{
5305: \wt{\mu}_3 \bigl( \wt{\pi}_1^{\otimes 2} \bigr)
5306: \Bigl[ \exp \bigl\{
5307: - \alpha_2 \Psi_{\frac{\alpha_2}{N}}(R', M') + \alpha_1 R' \bigr\}
5308: \Bigr] \Bigr\} - \log(\eta) \biggr]
5309: \\ + \nu_3
5310: \bigl\{ \C{K}\bigl[ \rho_1 , \pi_{\exp ( -
5311: \frac{\beta_1 \alpha_2}{\alpha_1} r)} \bigr] \bigr\}.
5312: \end{multline*}
5313: Thus, when the constraint $\lambda_1 = \frac{\beta_1 \alpha_2}{\alpha_1}$
5314: is satisfied,
5315: \begin{multline*}
5316: \nu_3 \bigl[ \C{K}(\rho_1, \wt{\pi}_1) \bigr]
5317: \leq \Bigl( 1 - \frac{\beta_1}{\alpha_1} \Bigr)^{-1} \frac{\beta_1}{\alpha_1} \biggl[
5318: \C{K}(\nu_3, \wt{\mu}_3) \\ + \log \Bigl\{
5319: \wt{\mu}_3 \bigl(\wt{\pi}_1^{\otimes 2} \bigr)
5320: \Bigl[ \exp \bigl\{ - \alpha_2 \Psi_{\frac{\alpha_2}{N}}(R', M') + \alpha_1
5321: R' \bigr\} \Bigr] \Bigr\}
5322: - \log(\eta) \biggr].
5323: \end{multline*}
5324: We can further specialize the constants, choosing $\alpha_1
5325: = N \sinh(\frac{\alpha_2}{N})$, so that
5326: $$
5327: - \alpha_2 \Psi_{\frac{\alpha_2}{N}}(R', M') + \alpha_1 R'
5328: \leq 2 N \sinh\Bigl(\frac{\alpha_2}{2 N}\Bigr)^2 M'.
5329: $$
5330: We can for instance choose $\alpha_2 = \gamma$, $\alpha_1 = N \sinh(\frac{\gamma}{N})$,
5331: and $\beta_1 = \lambda_1 \frac{N}{\gamma} \sinh(\frac{\gamma}{N})$,
5332: leading to
5333: \begin{prop}\mypoint
5334: With the notations of Theorem \ref{thm1.59}, the constants being
5335: set as explained above, putting $
5336: \wt{\pi}_1 = \pi_{\exp( - \lambda_1 \frac{N}{\gamma}\sinh(\frac{\gamma}{N}) R)}$,
5337: with $\PP$ probability at least $1 - \eta$,
5338: \begin{multline*}
5339: \nu_3 \bigl[ \C{K}(\rho_1, \wt{\pi}_1) \bigr]
5340: \leq \Bigl( 1 - \frac{\lambda_1}{\gamma} \Bigr)^{-1}
5341: \frac{\lambda_1}{\gamma} \biggl[ \C{K}(\nu_3, \wt{\mu}_3)
5342: \\ + \log \Bigl\{
5343: \wt{\mu}_3 \bigl( \wt{\pi}_1^{\otimes 2} \bigr)\Bigl[
5344: \exp \bigl\{ 2 N \sinh(\tfrac{\gamma}{2N})^2 M' \bigr\} \Bigr] \Bigr\}
5345: - \log(\eta) \biggr].
5346: \end{multline*}
5347: More generally
5348: \begin{multline*}
5349: \nu_3 \bigl[ \C{K}(\rho, \wt{\pi}_1) \bigr]
5350: \leq \Bigl( 1 - \frac{\lambda_1}{\gamma} \Bigr)^{-1}
5351: \frac{\lambda_1}{\gamma} \biggl[ \C{K}(\nu_3, \wt{\mu}_3)
5352: \\ + \log \Bigl\{
5353: \wt{\mu}_3 \bigl( \wt{\pi}_1^{\otimes 2} \bigr)\Bigl[
5354: \exp \bigl\{ 2 N \sinh(\tfrac{\gamma}{2N})^2 M' \bigr\}
5355: \Bigr] \Bigr\} - \log(\eta) \biggr]
5356: \\ + \Bigl( 1 - \frac{\lambda_1}{\gamma} \Bigr)^{-1} \nu_3 \bigl[ \C{K}(
5357: \rho, \rho_1) \bigr].
5358: \end{multline*}
5359: \end{prop}
5360: In a similar way, let us choose now $\wt{\mu}_3 = \mu_{\exp[ - \alpha_3 \opi(R)]}$.
5361: We can write
5362: \begin{multline*}
5363: \C{K}(\nu, \wt{\mu}_3) = \alpha_3 (\nu - \wt{\mu}_3)\opi(R)
5364: + \C{K}(\nu, \mu) - \C{K}(\wt{\mu}_3, \mu)
5365: \\ \leq \frac{\alpha_3}{\alpha_1} \biggl[ \alpha_2 (\nu - \wt{\mu}_3)\opi(r)
5366: + \C{K}(\nu, \wt{\mu}_3) \\ + \log \Bigl\{ (\wt{\mu}_3 \opi) \otimes
5367: (\wt{\mu}_3 \opi) \Bigl[ \exp \bigl\{
5368: - \alpha_2 \Psi_{\frac{\alpha_2}{N}}(R',M') + \alpha_1 R' \bigr\} \Bigr] \Bigr\}
5369: - \log(\eta) \biggr] \\
5370: + \C{K}(\nu, \mu) - \C{K}(\wt{\mu}_3, \mu).
5371: \end{multline*}
5372: Let us choose $\alpha_2 = \gamma$, $\alpha_1 = N \sinh(\frac{\gamma}{N})$, and
5373: let us add some other entropy inequalities to get
5374: rid of $\opi$ in a suitable way, the approach of entropy
5375: compensation being quite the same as the one used
5376: to obtain the empirical bound of Theorem \ref{thm1.59}.
5377: This results with $\PP$ probability
5378: at least $1 - \eta$ in
5379: \begin{multline*}
5380: \Bigl( 1 - \frac{\alpha_3}{\alpha_1} \Bigr)
5381: \C{K}(\nu, \wt{\mu}_3) \leq \frac{\alpha_3}{\alpha_1} \biggl[
5382: \gamma (\nu - \wt{\mu}_3)\opi(r)
5383: \\+ \log \Bigl\{ ( \wt{\mu}_3 \opi) \otimes ( \wt{\mu}_3 \opi)
5384: \Bigl[ \exp \bigl\{ - \gamma \Psi_{\frac{\gamma}{N}}(R', M') + \alpha_1 R' \bigr\}
5385: \Bigr] \Bigr\} + \log(\tfrac{2}{\eta}) \biggr]
5386: \\ \hfill + \C{K}(\nu, \mu) - \C{K}(\wt{\mu}_3, \mu),\quad\\\quad
5387: \zeta_6 \Bigl(1 - \frac{\beta}{\alpha_1} \Bigr)
5388: \wt{\mu}_3 \bigl[ \C{K}(\rho_6, \opi) \bigr]
5389: \leq \zeta_6 \frac{\beta}{\alpha_1} \biggl[
5390: \gamma \wt{\mu}_3 (\rho_6 - \opi)(r)\hfill\\
5391: + \log \Bigl\{ \wt{\mu}_3\bigl(\opi^{\otimes 2}\bigr)
5392: \Bigl[ \exp \bigl\{ - \gamma \Psi_{\frac{\gamma}{N}}(R', M')
5393: + \alpha_1 R' \bigr\} \Bigr] \Bigr\} + \log(\tfrac{2}{\eta}) \biggr]
5394: \\ \hfill + \zeta_6 \wt{\mu}_3 \bigl[
5395: \C{K}(\rho_6, \pi) - \C{K}(\opi, \pi) \bigr],\quad\\\quad
5396: \zeta_7 \Bigl(1 - \frac{\beta}{\alpha_1} \Bigr)
5397: \wt{\mu}_3 \bigl[ \C{K}(\rho_7, \opi) \bigr]
5398: \leq \zeta_7 \frac{\beta}{\alpha_1} \biggl[
5399: \gamma \wt{\mu}_3 (\rho_7 - \opi)(r)\hfill \\
5400: + \log \Bigl\{ \wt{\mu}_3\bigl(\opi^{\otimes 2}\bigr)
5401: \Bigl[ \exp \bigl\{ - \gamma \Psi_{\frac{\gamma}{N}}(R', M')
5402: + \alpha_1 R' \bigr\} \Bigr] \Bigr\} + \log(\tfrac{2}{\eta}) \biggr]
5403: \\ \hfill + \zeta_7 \wt{\mu}_3 \bigl[
5404: \C{K}(\rho_7, \pi) - \C{K}(\opi, \pi) \bigr],\quad\\\quad
5405: \zeta_8 \Bigl( 1 - \frac{\beta}{\alpha_1} \Bigr) \nu \bigl[ \C{K}(\rho_8, \opi) \bigr]
5406: \leq \zeta_8 \frac{\beta}{\alpha_1} \biggl[ \gamma \nu ( \rho_8 - \opi) (r)
5407: + \C{K}(\nu, \wt{\mu}_3) \hfill\\ +
5408: \log \Bigl\{ \wt{\mu}_3\bigl(\opi^{\otimes 2}\bigr)
5409: \Bigl[ \exp \bigl\{ - \gamma \Psi_{\frac{\gamma}{N}}(R', M') + \alpha_1 R' \bigr\}
5410: \Bigr] \Bigr\} + \log(\tfrac{2}{\eta}) \biggr]
5411: \\ \hfill + \zeta_8 \nu \bigl[ \C{K}(\rho_8, \pi)
5412: - \C{K}(\opi, \pi) \bigr],\quad\\\quad
5413: \zeta_9 \Bigl( 1 - \frac{\beta}{\alpha_1} \Bigr) \nu \bigl[ \C{K}(\rho_9, \opi) \bigr]
5414: \leq \zeta_9 \frac{\beta}{\alpha_1} \biggl[ \gamma \nu ( \rho_9 - \opi) (r)
5415: + \C{K}(\nu, \wt{\mu}_3) \hfill\\ +
5416: \log \Bigl\{ \wt{\mu}_3\bigl(\opi^{\otimes 2}\bigr)
5417: \Bigl[ \exp \bigl\{ - \gamma \Psi_{\frac{\gamma}{N}}(R', M') + \alpha_1 R' \bigr\}
5418: \Bigr] \Bigr\} + \log(\tfrac{2}{\eta}) \biggr]
5419: \\ \hfill + \zeta_9 \nu \bigl[ \C{K}(\rho_9, \pi)
5420: - \C{K}(\opi, \pi) \bigr],
5421: \end{multline*}
5422: where we have introduced a bunch of constants, assumed to be positive,
5423: that we will more precisely set to
5424: \begin{align*}
5425: x_8 + x_9 & = 1,\\
5426: ( \zeta_6 \beta + x_8 \alpha_3) \frac{\gamma}{\alpha_1} & = \lambda_6,\\
5427: ( \zeta_7 \beta + x_9 \alpha_3) \frac{\gamma}{\alpha_1} & = \lambda_7,\\
5428: ( \zeta_8 \beta - x_8 \alpha_3) \frac{\gamma}{\alpha_1} & = \lambda_8,\\
5429: ( \zeta_9 \beta - x_9 \alpha_3) \frac{\gamma}{\alpha_1} & = \lambda_9.
5430: \end{align*}
5431: We get with $\PP$ probability at least $1 - \eta$,
5432: \begin{multline*}
5433: \Bigl( 1 - \frac{\alpha_3}{\alpha_1} -
5434: (\zeta_8 + \zeta_9) \frac{\beta}{\alpha_1} \Bigr)
5435: \C{K}(\nu, \wt{\mu}_3) \leq
5436: \\ \frac{\alpha_3}{\alpha_1} \biggl[ \gamma \bigl[ \nu (
5437: x_8 \rho_8 + x_9 \rho_9)(r) - \wt{\mu}_3 (x_8 \rho_6 + x_9 \rho_7) (r) \bigr]
5438: \\ + \frac{\alpha_3}{\alpha_1} \log
5439: \Bigl\{ (\wt{\mu}_3 \opi) \otimes (\wt{\mu}_3 \opi)
5440: \Bigl[ \exp \bigl\{ - \gamma \Psi_{\frac{\gamma}{N}}(R', M')
5441: + \alpha_1 R' \bigr\} \Bigr] \Bigr\} \\
5442: + (\zeta_6 + \zeta_7 + \zeta_8 + \zeta_9) \frac{\beta}{\alpha_1}
5443: \log \Bigl\{ \wt{\mu}_3 \bigl(
5444: \opi^{\otimes 2} \bigr)
5445: \Bigl[ \exp \bigl\{ - \gamma
5446: \Psi_{\frac{\gamma}{N}}(R', M') + \alpha_1 R' \bigr\} \Bigr] \Bigr\}\\
5447: + \C{K}(\nu, \mu) - \C{K}(\wt{\mu}_3, \mu)
5448: + \Bigl( \frac{\alpha_3}{\alpha_1} + (\zeta_6 + \zeta_7 + \zeta_8 +
5449: \zeta_9) \frac{\beta}{\alpha_1} \Bigr) \log\bigl( \tfrac{2}{\eta} \bigr).
5450: \end{multline*}
5451: Let us choose the constants so that
5452: $\lambda_1 = \lambda_7 = \lambda_9$, $\lambda_4 = \lambda_6 = \lambda_8$,
5453: $\alpha_3 x_9 \frac{\gamma}{\alpha_1} = \xi_1$ and $ \alpha_3 x_8
5454: \frac{\gamma}{\alpha_1} = \xi_4$.
5455: This is done by setting
5456: \begin{align*}
5457: x_8 & = \frac{\xi_4}{\xi_1 + \xi_4},\\
5458: x_9 & = \frac{\xi_1}{\xi_1 + \xi_4},\\
5459: \alpha_3 & = \tfrac{N}{\gamma} \sinh(\tfrac{\gamma}{N}) ( \xi_1 + \xi_4),\\
5460: \zeta_6 & = \tfrac{N}{\gamma}\sinh(\tfrac{\gamma}{N}) \frac{(\lambda_4 - \xi_4)}{\beta},\\
5461: \zeta_7 & = \tfrac{N}{\gamma}\sinh(\tfrac{\gamma}{N})
5462: \frac{(\lambda_1 - \xi_1)}{\beta},\\
5463: \zeta_8 & = \tfrac{N}{\gamma} \sinh(\tfrac{\gamma}{N}) \frac{(\lambda_4 +
5464: \xi_4)}{\beta},\\
5465: \zeta_9 & = \tfrac{N}{\gamma} \sinh(\tfrac{\gamma}{N}) \frac{(\lambda_1 + \xi_1)}{
5466: \beta}.
5467: \end{align*}
5468: The inequality $\lambda_1 > \xi_1$ is always satisfied. The inequality
5469: $\lambda_4 > \xi_4$ is required for the above choice of constants, and
5470: will be satisfied for a suitable choice of $\zeta_3$ and $\zeta_4$.
5471:
5472: Under these asumptions, we obtain with $\PP$ probability at least $1 - \eta$
5473: \begin{multline*}
5474: \Bigl( 1 - \frac{\alpha_3}{\alpha_1} -
5475: (\zeta_8 + \zeta_9) \frac{\beta}{\alpha_1} \Bigr)
5476: \C{K}(\nu, \wt{\mu}_3) \leq
5477: (\nu - \wt{\mu}_3) (\xi_1 \rho_1 + \xi_4 \rho_4)(r)
5478: \\ + \frac{\alpha_3}{\alpha_1} \log
5479: \Bigl\{ (\wt{\mu}_3 \opi) \otimes (\wt{\mu}_3 \opi)
5480: \Bigl[ \exp \bigl\{ - \gamma \Psi_{\frac{\gamma}{N}}(R', M')
5481: + \alpha_1 R' \bigr\} \Bigr] \Bigr\} \\
5482: + (\zeta_6 + \zeta_7 + \zeta_8 + \zeta_9) \frac{\beta}{\alpha_1}
5483: \log \Bigl\{ \wt{\mu}_3 \bigl(
5484: \opi^{\otimes 2} \bigr)
5485: \Bigl[ \exp \bigl\{ - \gamma
5486: \Psi_{\frac{\gamma}{N}}(R', M') + \alpha_1 R' \bigr\} \Bigr] \Bigr\}\\
5487: + \C{K}(\nu, \mu) - \C{K}(\wt{\mu}_3, \mu)
5488: + \Bigl( \frac{\alpha_3}{\alpha_1} + (\zeta_6 + \zeta_7 + \zeta_8 +
5489: \zeta_9) \frac{\beta}{\alpha_1} \Bigr) \log\bigl( \tfrac{2}{\eta} \bigr).
5490: \end{multline*}
5491: This proves
5492: \begin{prop}
5493: \mypoint
5494: The constants being set as explained above,
5495: with $\PP$ probability at least $1 - \eta$,
5496: for any posterior distribution $\nu : \Omega \rightarrow \C{M}_+^1(M)$,
5497: \begin{multline*}
5498: \C{K}(\nu, \wt{\mu}_3) \leq \Bigl( 1 - \frac{\alpha_3}{\alpha_1} -
5499: (\zeta_8 + \zeta_9) \frac{\beta}{\alpha_1} \Bigr)^{-1}
5500: \biggl[ \C{K}(\nu, \nu_3)
5501: \\ + \frac{\alpha_3}{\alpha_1} \log
5502: \Bigl\{ (\wt{\mu}_3 \opi) \otimes (\wt{\mu}_3 \opi)
5503: \Bigl[ \exp \bigl\{ - \gamma \Psi_{\frac{\gamma}{N}}(R', M')
5504: + \alpha_1 R' \bigr\} \Bigr] \Bigr\} \\
5505: + (\zeta_6 + \zeta_7 + \zeta_8 + \zeta_9) \frac{\beta}{\alpha_1}
5506: \log \Bigl\{ \wt{\mu}_3 \bigl(
5507: \opi^{\otimes 2} \bigr)
5508: \Bigl[ \exp \bigl\{ - \gamma
5509: \Psi_{\frac{\gamma}{N}}(R', M') + \alpha_1 R' \bigr\} \Bigr] \Bigr\}\\
5510: + \Bigl( \frac{\alpha_3}{\alpha_1} + (\zeta_6 + \zeta_7 + \zeta_8 +
5511: \zeta_9) \frac{\beta}{\alpha_1} \Bigr) \log\bigl( \tfrac{2}{\eta} \bigr)\biggr] .
5512: \end{multline*}
5513: \end{prop}
5514: Thus
5515: \begin{multline*}
5516: \C{K}(\nu_3 \rho_1, \wt{\mu}_3\,\wt{\pi}_1) \leq
5517: \frac{1 + \bigl(1 - \frac{\lambda_1}{\gamma}\bigr)^{-1} \frac{\lambda_1}{\gamma}}{
5518: 1 - \frac{\alpha_3}{\alpha_1} - (\zeta_8+\zeta_9)\frac{\beta}{\alpha_1}} \\ \times
5519: \biggl[ \frac{\alpha_3}{\alpha_1} \log \Bigl\{
5520: (\wt{\mu}_3 \ov{\pi} \otimes (\wt{\mu}_3 \ov{\pi}) \Bigl[
5521: \exp \bigl\{ - \gamma \Psi_{\frac{\gamma}{N}}
5522: (R',M') + \alpha_1 R' \bigr\} \Bigr] \Bigr\}
5523: \\ + (\zeta_6 + \zeta_7 + \zeta_8 + \zeta_9) \frac{\beta}{\alpha_1}
5524: \log \Bigl\{ \wt{\mu}_3 \bigl( \ov{\pi}^{\otimes 2} \bigr) \Bigl[
5525: \exp \bigl\{ - \gamma \Psi_{\frac{\gamma}{N}}(R', M') + \alpha_1 R' \bigr\} \Bigr]
5526: \Bigr\} \\
5527: + \Bigl( \frac{\alpha_3}{\alpha_1} + (
5528: \zeta_6 + \zeta_7 + \zeta_8 + \zeta_9) \frac{\beta}{\alpha_1} \Bigr)
5529: \log \bigl( \tfrac{2}{\eta} \bigr) \biggr] \\
5530: + \Bigl( 1 - \frac{\lambda_1}{\gamma} \Bigr)^{-1} \frac{\lambda_1}{\gamma} \biggl[
5531: \log \Bigl\{ \wt{\mu}_3 \bigl( \wt{\pi}_1^{\otimes 2} \bigr)
5532: \Bigl[ \exp \bigl\{ 2 N \sinh\bigl(\tfrac{\gamma}{2N} \bigr)^2
5533: M' \bigr\} \Bigr] \Bigr\} - \log( \tfrac{2}{\eta} ) \biggr].
5534: \end{multline*}
5535: We will not go further, lest it may become tedious, but we hope we have
5536: given sufficient hints to state informally that the bound $B(\nu, \rho, \beta)$
5537: of Theorem \ref{thm1.59} is upper bounded
5538: with $\PP$ probability close to one by a
5539: bound of the same flavour where the empirical quantities $r$ and $m'$
5540: have been replaced with their expectations $R$ and $M'$.
5541:
5542: \section{Transductive PAC-Bayesian learning}
5543:
5544: \subsection{Basic inequalities}
5545: In this section the observed sample $(X_i, Y_i)_{i=1}^N$
5546: will be supplemented with a {\em shadow sample}
5547: $(X_i,Y_i)_{i=N+1}^{(k+1)N}$.
5548: This point of view, called {\em transductive classification},
5549: has been introduced by V. Vapnik. It may be justified in different
5550: ways.
5551:
5552: On the practical side,
5553: one interest of the transductive setting is that it is
5554: often a lot easier to collect examples than it is to label them,
5555: so that it is not unreallistic to assume that we indeed have
5556: two training samples, one labelled and one unlabelled.
5557: It also covers the case when a batch of patterns
5558: is to be classified and we are allowed to observe
5559: the whole batch before issuing the classification.
5560:
5561: On the mathematical side, considering a shadow sample
5562: proves technically fruitfull. Indeed, when introducing
5563: the VC entropy and VC dimension concepts, as well as when
5564: dealing with compression
5565: schemes, albeit the {\em inductive} setting is our
5566: final concern, the transductive setting is a
5567: useful detour.
5568: In this second scenario, intermediate technical results
5569: involving the shadow sample are integrated with respect
5570: to unobserved random variables in a second stage of the proofs.
5571:
5572: Let us describe now the changes to be made to previous
5573: notations to adapt them to the transductive setting.
5574: The distribution $\PP$ will be a probability measure on the
5575: canonical space $\Omega = (\C{X} \times \C{Y})^{(k+1)N}$,
5576: and $(X_i,Y_i)_{i=1}^{(k+1)N}$
5577: will be the canonical process on this space
5578: (that is the coordinate process).
5579: Unless explicitely mentioned, the parameter $k$ indicating the
5580: size of the shadow sample will remain fixed.
5581: Assuming the shadow sample size is a multiple of the
5582: training sample size is convenient without significantly
5583: restricting the generality.
5584: For a while, we will use a weaker assumption than independence,
5585: assuming that $\PP$ is {\em partially exchangeable},
5586: since this is all what we need in the proofs.
5587: \begin{dfn}
5588: \mypoint For $i = 1, \dots, N$,
5589: let $\tau_i : \Omega \rightarrow \Omega$ be defined
5590: for any \linebreak $\omega = (\omega_j)_{j=1}^{(k+1)N} \in \Omega$ by
5591: $$
5592: \begin{cases}
5593: \tau_i(\omega)_{i + jN} = \omega_{i + (j-1)N}, & j=1, \dots, k,\\
5594: \tau_i(\omega)_{i} = \omega_{i+kN}, & \\
5595: \text{and } \tau_i(\omega)_{m + j N} = \omega_{m + j N}, &
5596: m\neq i, m = 1, \dots, N, j=0, \dots k.
5597: \end{cases}
5598: $$
5599: Clearly, if we arrange the $(k+1)N$ samples in a $N \times (k+1)$ array,
5600: $\tau_i$ performs a circular permutation of $k+1$ entries
5601: on the $i$th row, letting the
5602: other rows unchanged.
5603: Moreover, all the circular permutations of the $i$th
5604: row have the form $\tau_i^j$, $j$ ranging from $0$ to $k$.
5605:
5606: The probability distribution $\PP$ is said to be partially exchangeable if
5607: for any $i = 1, \dots, N$, $\PP \circ \tau_i^{-1} = \PP$.
5608:
5609: This means equivalently that for any
5610: bounded measurable function $h : \Omega \rightarrow \RR$, $\PP ( h \circ \tau_i) = \PP (h)$.
5611:
5612: In the same way a function $h$ defined on $\Omega$ will be said to
5613: be partially exchangeable if $h \circ \tau_i = h$ for
5614: any $i=1, \dots, N$.
5615: Accordingly a posterior distribution
5616: $\rho : \Omega \rightarrow \C{M}_+^1(\Theta, \C{T})$ will be said to
5617: be partially exchangeable when $\rho(\omega, A) = \rho \bigl[\tau_i(\omega), A
5618: \bigr]$, for any $\omega \in \Omega$, any $i = 1, \dots, N$
5619: and any $A \in \C{T}$.
5620: \end{dfn}
5621: For any bounded measurable function $h$, let us define
5622: $T_i(h) = \frac{1}{k+1} \sum_{j=0}^k h \circ \tau_i^j$.
5623: Let $T(h) = T_N \circ \dots \circ T_1(h)$.
5624: For any partially exchangeable probability distribution $\PP$, and for
5625: any bounded measurable function $h$, $\PP \bigl[ T(h) \bigr] = \PP(h)$.
5626: Let us put
5627: \renewcommand{\rr}{\overline{r}}
5628: \begin{align*}
5629: \sigma_i(\theta) & = \B{1} \bigl[ f_{\theta}(X_i) \neq Y_i \bigr],
5630: \quad \begin{tabular}[t]{l}indicating the success or failure of $f_{\theta}$\\
5631: to predict $Y_i$ from $X_i$,\end{tabular}\\
5632: r_1(\theta) & = \frac{1}{N} \sum_{i=1}^N \sigma_i(\theta),
5633: \quad \begin{tabular}[t]{l} the empirical error rate of $f_{\theta}$ \\
5634: on the observed sample,\end{tabular}\\
5635: r_2(\theta) & = \frac{1}{kN} \sum_{i=N+1}^{(k+1)N}
5636: \sigma_i(\theta),\quad \text{the error rate of $f_{\theta}$
5637: on the shadow sample,}\\
5638: \rr(\theta) & = \frac{r_1(\theta) + k r_2(\theta)}{k+1}
5639: = \frac{1}{(k+1)N} \sum_{i=1}^{(k+1)N}
5640: \sigma_i(\theta), \quad \begin{tabular}[t]{l}the global error \\
5641: rate of $f_{\theta}$,\end{tabular}\\
5642: R_i(\theta) & = \PP \bigl[ f_{\theta}(X_i) \neq Y_i \bigr],\quad
5643: \begin{tabular}[t]{l}the expected error \\ rate of $f_{\theta}$ on the $i$th
5644: input,\end{tabular}\\
5645: R(\theta) & = \frac{1}{N} \sum_{i=1}^N R_i(\theta) =
5646: \PP \bigl[ r_1(\theta) \bigr] = \PP \bigl[ r_2(\theta) \bigr],
5647: \quad \text{the average expected} \\* \text{error} & \text{ rate of $f_{\theta}$
5648: on all inputs.}
5649: \end{align*}
5650: We will allow for posterior
5651: distributions $\rho : \Omega \rightarrow \C{M}_+^1(\Theta)$
5652: depending on the shadow sample. The most interesting ones will anyhow
5653: be independent of the shadow labels $Y_{N+1}, \dots, Y_{(k+1)N}$.
5654: We will be interested in the conditional expected
5655: error rate of the randomized classification
5656: rule described by $\rho$ on the shadow sample, given the observed
5657: sample, which reads as
5658: $\PP \bigl[ \rho(r_2) \lvert (X_i,Y_i)_{i=1}^N\bigr]$.
5659:
5660: Let us comment on the case when $\PP$ is invariant
5661: by any permutations of the rows, meaning that
5662: \\ \mbox{} \hfill $\PP
5663: \bigl[ h(\omega \circ s) \bigr] = \PP \bigl[ h(\omega) \bigr]$
5664: for all $s \in \mathfrak{S}(\{i+jN ; j=0, \dots, k \})$
5665: \hfill\mbox{}\\ and all $i=1,
5666: \dots, N$ (where $\mathfrak{S}(A)$ is the set of permutations of $A$,
5667: extended to $\{1, \dots, (k+1)N \}$ so as to be the identity outside
5668: of $A$).
5669: In this case, if $\rho$ is invariant by permutations of the rows of
5670: the shadow sample, meaning that $\rho(\omega \circ s) = \rho(\omega)
5671: \in \C{M}_+^1(\Theta)$, $s \in \mathfrak{S}(\{i+jN; j=1, \dots, k \})$,
5672: $i = 1, \dots, N$, then $\PP \bigl[ \rho(r_2) \lvert (X_i,Y_i)_{i=1}^N \bigr] =
5673: \frac{1}{N} \sum_{i=1}^N \PP \bigl[ \rho(\sigma_{i+N})
5674: \lvert (X_i,Y_i)_{i=1}^N \bigr]$, meaning that
5675: the expectation can be taken on a restricted shadow sample
5676: of the same size as the observed sample.
5677: If moreover the rows are equidistributed (meaning that their marginal distributions
5678: are equal), then
5679: \\\mbox{}\hfill $\PP \bigl[ \rho(r_2)
5680: \lvert (X_i,Y_i)_{i=1}^N \bigr] = \PP \bigl[ \rho(\sigma_{N+1})
5681: \lvert (X_i,Y_i)_{i=1}^N \bigr]$. \hfill \mbox{}\\
5682: This means that under these quite commonly fullfilled assumptions,
5683: the expectation can be taken on a single
5684: new object to be classified,
5685: our study thus covers the case when only one of the
5686: patterns from the shadow sample is to be labelled and one is interested
5687: in the expected error rate of this single labelling.
5688: Of course, in the case when
5689: $\PP$ is i.i.d. and $\rho$ depends only on the
5690: training sample $(X_i,Y_i)_{i=1}^N$, we fall back on
5691: the usual criterion of performance
5692: $\PP \bigl[ \rho(r_2) \lvert (Z_i)_{i=1}^N \bigr] = \rho(R)
5693: = \rho(R_1)$.
5694:
5695: Let us recall the notation
5696: $
5697: \Phi_{a}(p) = - a^{-1} \log \bigl\{ 1 - p \bigl[ 1 - \exp( - a) \bigr] \bigr\}.
5698: $
5699:
5700: Using an obvious factorization, and considering for the moment
5701: a fixed value of $\theta$ and any partially exchangeable positive real measurable
5702: function $\lambda : \Omega \rightarrow \RR_+$, we can compute the
5703: $\log$ Laplace transform of $r_1$ under $T$, which acts like a
5704: conditional probability distribution:
5705: \begin{multline*}
5706: \log \Bigl\{ T \bigl[ \exp ( - \lambda r_1 ) \bigr] \Bigr\}
5707: = \sum_{i=1}^N \log \Bigl\{ T_i \bigl[ \exp ( - \tfrac{\lambda}{N} \sigma_i ) \bigr]
5708: \Bigr\} \\
5709: \leq N \log \biggl\{ \frac{1}{N} \sum_{i=1}^N T_i \Bigl[
5710: \exp \bigl( - \tfrac{\lambda}{N} \sigma_i \bigr) \Bigr] \biggr\}
5711: = - \lambda \Phi_{\frac{\lambda}{N}}(\rr).
5712: \end{multline*}
5713: Remarking that $T \Bigl\{ \exp \Bigl[
5714: \lambda \bigl[ \Phi_{\frac{\lambda}{N}}(\rr) - r_1 \bigr] \Bigr] \Bigr\}
5715: = \exp \bigl[ \lambda \Phi_{\frac{\lambda}{N}}(\rr) \bigr] T \bigl[
5716: \exp ( - \lambda r_1) \bigr]$ we obtain
5717: \begin{lemma}
5718: \mypoint For any $\theta \in \Theta$ and any partially
5719: exchangeable positive real
5720: measurable function $\lambda : \Omega \rightarrow \RR_+$,
5721: $$
5722: T \Bigl\{ \exp \Bigl[ \lambda \bigl\{ \Phi_{\frac{\lambda}{N}}
5723: \bigl[ \rr(\theta) \bigr] - r_1(\theta) \bigr\} \Bigr]
5724: \Bigr\} \leq 1.
5725: $$
5726: \end{lemma}
5727: We deduce from this lemma a result analogous to the inductive case:
5728: \begin{thm}
5729: \label{thm1.2}
5730: \mypoint For any partially exchangeable positive real measurable
5731: function $\lambda : \Omega \times \Theta \rightarrow \RR_+$,
5732: for any partially exchangeable posterior distribution
5733: $\pi : \Omega \rightarrow \C{M}_+^1(\Theta)$,
5734: $$
5735: \PP \biggl\{ \exp \biggl[ \sup_{\rho \in \C{M}_+^1(\Theta)}
5736: \rho \Bigl[ \lambda \bigl[ \Phi_{\frac{\lambda}{N}}(\rr) - r_1 \bigr] \Bigr]
5737: - \C{K}(\rho, \pi) \biggr] \biggr\} \leq 1.
5738: $$
5739: \end{thm}
5740: The proof is deduced from the previous lemma, using the
5741: fact that $\pi$ is partially exchangeable :
5742: \begin{multline*}
5743: \PP \biggl\{ \exp \biggl[ \sup_{\rho \in \C{M}_+^1(\Theta)}
5744: \rho \Bigl[ \lambda \bigl[ \Phi_{\frac{\lambda}{N}}(\rr) - r_1 \bigr] \Bigr]
5745: - \C{K}(\rho, \pi) \biggr] \biggr\} \\ =
5746: \PP \biggl\{ \pi \Bigl\{ \exp \Bigl[ \lambda \bigl[ \Phi_{\frac{\lambda}{N}}(\rr) -
5747: r_1 \bigr] \Bigr] \Bigr\} \biggr\} =
5748: \PP \biggl\{ T \pi \Bigl\{ \exp \Bigl[ \lambda \bigl[ \Phi_{\frac{\lambda}{N}}(\rr) -
5749: r_1 \bigr] \Bigr] \Bigr\} \biggr\} \\ =
5750: \PP \biggl\{ \pi \Bigl\{ T \exp \Bigl[ \lambda \bigl[ \Phi_{\frac{\lambda}{N}}(\rr) -
5751: r_1 \bigr] \Bigr] \Bigr\} \biggr\} \leq 1.
5752: \end{multline*}
5753:
5754: Introducing in the same way
5755: \newcommand{\Bm}{\overline{m}}
5756: \begin{align*}
5757: m'(\theta, \theta') & = \frac{1}{N}
5758: \sum_{i=1}^{N} \Bigl\lvert \B{1} \bigl[ f_{\theta}(X_i) \neq Y_i \bigr]
5759: - \B{1}\bigl[ f_{\theta'}(X_i) \neq Y_i \bigr] \Bigr\rvert\\
5760: \text{and } \quad \Bm(\theta, \theta') & = \frac{1}{(k+1)N}
5761: \sum_{i=1}^{(k+1)N} \Bigl\lvert \B{1} \bigl[ f_{\theta}(X_i) \neq Y_i \bigr]
5762: - \B{1}\bigl[ f_{\theta'}(X_i) \neq Y_i \bigr] \Bigr\rvert,
5763: \end{align*}
5764: we could prove along the same line of reasoning
5765: \begin{thm}\mypoint
5766: For any real parameter $\lambda$, any $\T \in \Theta$, any partially exchangeable
5767: posterior distribution $\pi : \Omega \rightarrow \C{M}_+^1(\Theta)$,
5768: \begin{multline*}
5769: \PP \biggl\{ \exp \biggl[ \sup_{\rho \in \C{M}_+^1(\Theta)}
5770: \lambda \Bigl[ \rho \bigl\{
5771: \Psi_{\frac{\lambda}{N}} \bigl[ \rr(\cdot) - \rr(\T), \Bm(\cdot, \T)\bigr]
5772: \bigr\} \\* -
5773: \bigl[ \rho(r_1) - r_1(\T) \bigr] \Bigr] - \C{K}(\rho, \pi) \biggr] \biggr\}
5774: \leq 1.
5775: \end{multline*}
5776: \end{thm}
5777: \begin{thm}\mypoint
5778: For any real constant $\gamma$, for any $\T \in \Theta$,
5779: for any partially exchangeable posterior distribution $\pi : \Omega
5780: \rightarrow \C{M}_+^1(\Theta)$,
5781: \begin{multline*}
5782: \PP \Biggl\{ \exp \Biggl[ \sup_{\rho \in \C{M}_+^1(\Theta)}
5783: \biggl\{ - N \rho \Bigl\{ \log \Bigl[ 1 - \tanh\bigl(\tfrac{\gamma}{N}\bigr) \bigl[ \rr(\cdot) - \rr(\T) \bigr]
5784: \Bigr] \Bigr\} \\
5785: - \gamma
5786: \bigl[\rho(r_1) - r_1(\T) \bigr] -
5787: N \log \bigl[ \cosh \bigl( \tfrac{\gamma}{N} \bigr) \bigr] \rho \bigl[ m'( \cdot, \T) \bigr] -
5788: \C{K}(\rho, \pi) \biggr\} \Biggr] \Biggr\} \leq 1.
5789: \end{multline*}
5790: \end{thm}
5791: This last theorem can be generalized to give
5792: \begin{thm}\mypoint
5793: For any real constant $\gamma$, for any partially
5794: exchangeable posterior distributions $\pi^1, \pi^2: \Omega
5795: \rightarrow \C{M}_+^1(\Theta)$,
5796: \begin{multline*}
5797: \PP \Biggl\{ \exp \Biggl[
5798: \sup_{\rho_1, \rho_2 \in \C{M}_+^1(\Theta)}
5799: \biggl\{
5800: - N \log \Bigl\{ 1 - \tanh\bigl( \tfrac{\gamma}{N} \bigr)
5801: \bigl[ \rho_1(\rr) - \rho_2(\rr) \bigr] \Bigr\} \\
5802: - \gamma \bigl[ \rho_1(r_1) - \rho_2(r_1) \bigr]
5803: - N \log \bigl[ \cosh \bigl( \tfrac{\gamma}{N}
5804: \bigr) \bigr]
5805: \rho_1 \otimes \rho_2 (m') \\ - \C{K}(\rho_1, \pi^1) -
5806: \C{K}(\rho_2, \pi^2) \biggr\} \Biggr] \Biggr\} \leq 1.
5807: \end{multline*}
5808: \end{thm}
5809:
5810: To conclude this section, we see that the basic theorems of transductive PAC-Bayesian
5811: classification have exactly the same form as the basic inequalities of inductive
5812: classification, Theorems \ref{thm2.3}, \ref{thm4.1} and \ref{thm2.2.18}
5813: {\em with $R(\theta)$ replaced with $\rr(\theta)$}, $r(\theta)$ replaced
5814: with $r_1(\theta)$ and $M'(\theta, \T)$
5815: replaced with $\Bm(\theta, \T)$.
5816: \label{page97}
5817:
5818: {\em Thus all the results of the first section remain true under the hypotheses
5819: of transductive classification, with $R(\theta)$ replaced with $\rr(\theta)$,
5820: $r(\theta)$ replaced with $r_1(\theta)$
5821: and $M'(\theta, \T\,)$ replaced with $\Bm(\theta, \T)$.}
5822:
5823: {\em Consequently, in the case when the unlabelled shadow sample is observed,
5824: it is possible
5825: to improve on Vapnik's bounds to be discussed hereafter by using
5826: an explicit partially exchangeable posterior distribution $\pi$ and
5827: resorting to localized or to relative bounds (in the case at least of
5828: unlimited computing resources, which of course may still be unrealistic
5829: in many real world situations, and with the caveat, to be recalled in
5830: the conclusion of this article, that for small sample sizes and comparatively
5831: complex classification models, the improvement may not be so decisive).}
5832:
5833: Let us notice also that the transductive setting when experimentally available,
5834: has the advantage that
5835: \newcommand{\Bd}{\overline{d}}
5836: \begin{multline*}
5837: \Bd(\theta, \theta') = \frac{1}{(k+1)N}
5838: \sum_{i=1}^{(k+1)N} \B{1} \bigl[ f_{\theta'}(X_i) \neq f_{\theta}(X_i) \bigr]
5839: \\ \geq \Bm(\theta, \theta') \geq \rr(\theta) - \rr(\theta'), \qquad
5840: \theta, \theta' \in \Theta,
5841: \end{multline*}
5842: is observable in this context, providing an empirical upper bound for
5843: the difference
5844: $\rr(\wtheta) - \rho(\rr)$ for any non randomized estimator
5845: $\wtheta$ and any posterior distribution $\rho$, namely
5846: $$
5847: \rr(\wtheta) \leq \rho(\rr) + \rho\bigl[\,\Bd( \cdot, \wtheta)\bigr].
5848: $$
5849: Thus in the setting of transductive statistical experiments,
5850: the PAC-Bayesian framework provides fully empirical bounds
5851: for the error rate of non randomized estimators $\wtheta :
5852: \Omega \rightarrow \Theta$, even when using a non atomic
5853: prior $\pi$ (or more generally a non atomic partially exchangeable
5854: posterior distribution $\pi$), when $\Theta$
5855: is not a vector space and $\theta \mapsto R(\theta)$
5856: cannot be proved to be convex on the support of some useful
5857: posterior distribution $\rho$.
5858:
5859: \subsection{Vapnik's bounds for transductive classification}
5860: In this section, we are going to stick to plain unlocalized non relative
5861: bounds. As we have already mentioned, (and as it was put forward
5862: by Vapnik himself in his seminal works), these bounds are not always
5863: superseded by the asymptotically better ones, and deserve all our efforts
5864: since they deal in many situations better with small samples.
5865: \subsubsection{With a shadow sample of arbitrary size}
5866: The great thing with the transductive setting is that we are manipulating
5867: only $r_1$ and $\rr$ which can take but a finite number of values
5868: and therefore are piecewise constant on $\Theta$. To make use of this,
5869: let us consider for any value $\theta \in \Theta$ of the parameter
5870: the subset $\Delta(\theta) \subset \Theta$ of parameters $\theta'$ such
5871: that the classification rule $f_{\theta'}$ answers the same on the
5872: extended sample $(X_i)_{i=1}^{(k+1)N}$ as $f_{\theta}$. Namely, let us put
5873: for any $\theta \in \Theta$
5874: $$
5875: \Delta(\theta) = \bigl\{ \theta' \in \Theta ; f_{\theta'}(X_i) = f_{\theta}(X_i),
5876: i = 1, \dots, (k+1)N \bigr\}.
5877: $$
5878: We see immediately that $\Delta(\theta)$ is an exchangeable parameter subset on
5879: which $r_1$ and $r_2$ (and therefore also $\rr$) take a constant value.
5880: Thus for any $\theta \in \Theta$ we may consider the posterior $\rho_{\theta}$
5881: defined by
5882: $$
5883: \frac{d\rho_{\theta}}{d \pi}(\theta') = \B{1} \bigl[ \theta' \in \Delta(\theta) \bigr]\pi
5884: \bigl[ \Delta(\theta) \bigr]^{-1},
5885: $$
5886: and use the fact that $\rho_{\theta}(r_1) = r_1(\theta)$ and $\rho_{\theta}(\rr) = \rr(\theta)$,
5887: to prove that
5888: \begin{lemma}
5889: \mypoint For any partially exchangeable positive real measurable function
5890: $\lambda : \Omega \times \Theta \rightarrow \RR$ such that
5891: \begin{equation}
5892: \label{eq2.2.1}
5893: \lambda(\omega, \theta') = \lambda(\omega, \theta), \quad \theta \in \Theta, \theta'
5894: \in \Delta(\theta), \omega \in \Omega,
5895: \end{equation}
5896: and any partially exchangeable posterior distribution
5897: $\pi : \Omega \rightarrow \C{M}_+^1(\Theta)$,
5898: with $\PP$ probability at least $1 - \epsilon$, for any $\theta \in \Theta$,
5899: $$
5900: \Phi_{\frac{\lambda}{N}}\bigl[ \rr(\theta) \bigr] + \frac{\log \bigl\{ \epsilon \pi \bigl[
5901: \Delta(\theta) \bigr] \bigr\}}{\lambda(\theta)} \leq r_1(\theta).
5902: $$
5903: \end{lemma}
5904: We can then remark that for any value of $\lambda$ independent of $\omega$,
5905: the left-hand side of the previous inequality is a partially exchangeable function of
5906: $\omega \in \Omega$. Thus this left-hand side is maximized by some
5907: partially exchangeable function $\lambda$, namely $$
5908: \arg\max_{\lambda}
5909: \Phi_{\frac{\lambda}{N}} \bigl[ \rr(\theta) \bigr]
5910: + \frac{\log \bigl\{ \epsilon \pi \bigl[ \Delta(\theta) \bigr] \bigr\}}{\lambda}
5911: $$
5912: is partially exchangeable as depending only on partially exchangeable quantities.
5913: Moreover this choice of $\lambda(\omega, \theta)$ satisfies also condition
5914: \eqref{eq2.2.1}
5915: stated in the previous lemma of being constant on $\Delta(\theta)$,
5916: proving
5917: \begin{lemma}
5918: \mypoint For any partially exchangeable posterior distribution $\pi : \Omega \rightarrow
5919: \C{M}_+^1(\Theta)$, with $\PP$ probability at least $1 - \epsilon$,
5920: for any $\theta \in \Theta$ and any $\lambda \in \RR_+$,
5921: $$
5922: \Phi_{\frac{\lambda}{N}} \bigl[ \rr(\theta) \bigr] + \frac{\log \bigl\{
5923: \epsilon \pi \bigl[ \Delta(\theta) \bigr] \bigr\}}{\lambda} \leq r_1(\theta).
5924: $$
5925: \end{lemma}
5926:
5927: Writing $\rr = \frac{r_1 + k r_2}{k+1}$ and rearranging terms we obtain
5928: \begin{thm}
5929: \label{thm2.1.5}
5930: \mypoint For any partially exchangeable posterior
5931: distribution $\pi : \Omega \rightarrow
5932: \C{M}_+^1(\Theta)$, with $\PP$ probability at least $1 - \epsilon$,
5933: for any $\theta \in \Theta$,
5934: $$
5935: r_2(\theta) \leq \frac{k+1}{k} \inf_{\lambda \in \RR_+}
5936: \frac{\ds 1 - \exp \left( - \frac{\lambda}{N} r_1(\theta) + \frac{ \log \bigl\{
5937: \epsilon \pi \bigl[ \Delta(\theta) \bigr] \bigr\}}{N} \right)}{\ds 1
5938: - \exp \bigl( - \tfrac{\lambda}{N}\bigr)} - \frac{r_1(\theta)}{k}.
5939: $$
5940: \end{thm}
5941:
5942: Let us remind the reader that in the case when we have a set of binary
5943: classification rules $\{ f_{\theta}; \theta \in \Theta \}$ whose
5944: VC dimension is not greater than $h$, then we can choose $\pi$ such
5945: that $\pi \bigl[ \Delta(\theta) \bigr]$ is independent of $\theta$
5946: and not less that
5947: $\ds \left(\frac{h}{e(k+1)N}\right)^h$.
5948:
5949: Another important case when the complexity term $- \log \bigl\{
5950: \pi \bigl[ \Delta(\theta) \bigr] \bigr\}$ can easily be controlled
5951: is the setting of {\em compression schemes},
5952: introduced by Littlestone et Warmuth \cite{Little}.
5953: In this case, we are given for each labelled subsample
5954: $(X_i, Y_i)_{i \in J}$, $J \subset \{1, \dots, N\}$,
5955: an estimator of the parameter
5956: $$
5957: \wtheta\bigl[ (X_i, Y_i)_{i \in J} \bigr]
5958: = \wtheta_J, \quad J \subset \{ 1, \dots, N \}, \lvert J \rvert \leq h,
5959: $$
5960: \label{compression} where
5961: $$
5962: \wtheta : \bigsqcup_{k=1}^N \bigl( \C{X} \times \C{Y} \bigr)^k \rightarrow \Theta
5963: $$
5964: is an exchangeable function providing estimators for
5965: subsamples of arbitrary size.
5966: Let us assume that $\w{\theta}$
5967: is exchangeable, meaning that for any $k = 1, \dots, N$ and
5968: any permutation $\sigma$ of $\{1, \dots, k\}$
5969: $$
5970: \w{\theta} \bigl[ (x_i, y_i)_{i=1}^k \bigr]
5971: = \w{\theta} \bigl[ (x_{\sigma(i)}, y_{\sigma(i)})_{i=1}^k
5972: \bigr], \qquad
5973: (x_i, y_i)_{i=1}^k \in \bigl( \C{X} \times \C{Y} \bigr)^k.
5974: $$
5975: In this situation, we can introduce the exchangeable subset
5976: $$
5977: \Bigl\{ \wtheta_J ; J \subset \{1, \dots, (k+1)N\}, \lvert J
5978: \rvert \leq h \Bigr\} \subset \Theta,
5979: $$
5980: which is seen to contain at most $\ds \sum_{j=0}^h \binom{(k+1)N}{j}
5981: \leq \left( \frac{e(k+1)N}{h} \right)^h$ classification rules
5982: (as will be proved later on in Theorem \ref{th2} on page \pageref{th2}).
5983: Note that we had to extend the range of $J$ to all the subsets
5984: of the extended sample, although we will use for estimation
5985: only those of the training sample, on which the labels
5986: are observed.
5987: Thus in this case also we can find a partially exchangeable posterior
5988: distribution $\pi$ such that $\ds \pi \bigl[ \Delta(\wtheta_J) \bigr]
5989: \geq \left( \frac{h}{e(k+1)N} \right)^h$. We see that the size of
5990: the compression scheme plays the same role in this complexity bound
5991: as the $VC$ dimension for $VC$ classes.
5992:
5993: In these two cases of binary classification with VC dimension
5994: not greater than $h$ and compression schemes depending on a
5995: compression set with at most $h$ points, we get a bound of
5996: \begin{multline*}
5997: r_2(\theta) \leq \frac{k+1}{k} \inf_{\lambda \in \RR_+}
5998: \frac{\ds 1 - \exp \left( - \frac{\lambda}{N} r_1(\theta) - \frac{ h
5999: \log \left( \frac{e(k+1)N}{h} \right) - \log(\epsilon)}{N} \right)}{\ds 1
6000: - \exp \bigl( - \tfrac{\lambda}{N}\bigr)} \\ - \frac{r_1(\theta)}{k}.
6001: \end{multline*}
6002: Let us make some numerical application: when $N = 1000, h = 10, \epsilon = 0.01$,
6003: and $\inf_{\Theta} r_1 = r_1(\w{\theta}) = 0.2$,
6004: we find that $r_2(\w{\theta}) \leq 0.4093$, for $k$ between
6005: $15$ and $17$, and values of $\lambda$ equal respectively to $965$,
6006: $968$ and $971$. For $k=1$, we find only $r_2(\w{\theta}) \leq 0.539$, showing
6007: the interest of allowing $k$ to be larger than $1$.
6008:
6009: \subsubsection{When the shadow sample has the same size as the training sample}
6010: In the case when $k = 1$, we can improve Theorem \ref{thm1.2} by taking advantage
6011: of the fact that $T_i(\sigma_i)$ can take only $3$ values, namely $0$, $0.5$
6012: and $1$. We see thus that $T_i(\sigma_i) - \Phi_{\frac{\lambda}{N}}\bigl[
6013: T_i(\sigma_i) \bigr]$ can take only two values, $0$ and $\frac{1}{2} - \Phi_{\frac{
6014: \lambda}{N}}(\frac{1}{2})$, because $\Phi_{\frac{\lambda}{N}}(0) = 0$ and
6015: $\Phi_{\frac{\lambda}{N}}(1) = 1$. Thus
6016: $$
6017: T_i(\sigma_i) - \Phi_{\frac{\lambda}{N}} \bigl[ T_i(\sigma_i) \bigr]
6018: = \bigl[ 1 - \lvert 1 - 2 T_i(\sigma_i) \rvert \bigr] \bigl[
6019: \tfrac{1}{2} - \Phi_{\frac{\lambda}{N}}(\tfrac{1}{2}) \bigr].
6020: $$
6021: This shows that in the case when $k=1$,
6022: \begin{multline*}
6023: \log \Bigl\{ T \bigl[ \exp ( - \lambda r_1) \bigr] \Bigr\}
6024: = - \lambda \rr
6025: + \frac{\lambda}{N} \sum_{i=1}^N T_i(\sigma_i) - \Phi_{\frac{\lambda}{N}}
6026: \bigl[ T_i(\sigma_i) \bigr]\\
6027: = - \lambda \rr + \frac{\lambda}{N} \sum_{i=1}^N \bigl[ 1 - \lvert 1 - 2 T_i(\sigma_i) \rvert
6028: \bigr] \bigl[ \tfrac{1}{2} - \Phi_{\frac{\lambda}{N}}(\tfrac{1}{2}) \bigr]
6029: \\ \leq - \lambda \rr + \lambda \bigl[ \tfrac{1}{2} - \Phi_{\frac{\lambda}{N}}(\tfrac{1}{2}) \bigr] \bigl[ 1 - \lvert 1 - 2 \rr \rvert \bigr].
6030: \end{multline*}
6031: Noticing that $\frac{1}{2} - \Phi_{\frac{\lambda}{N}}(\frac{1}{2}) =
6032: \frac{N}{\lambda} \log \bigl[ \cosh(\frac{\lambda}{2N}) \bigr]$,
6033: we obtain
6034: \begin{thm}
6035: \mypoint For any partially exchangeable function $\lambda : \Omega \times \Theta
6036: \rightarrow \RR_+$, for any partially exchangeable posterior distribution
6037: $\pi : \Omega \rightarrow \C{M}_+^1(\Theta)$,
6038: \begin{multline*}
6039: \PP \biggl\{ \exp \biggl[
6040: \sup_{\rho \in \C{M}_+^1(\Theta)}
6041: \rho \Bigl[ \lambda ( \rr - r_1) \\ -
6042: N \log \bigl[ \cosh(\tfrac{\lambda}{2N}) \bigr]
6043: \bigl( 1 - \lvert 1 - 2 \rr \rvert \bigr) \Bigr] - \C{K}(\rho, \pi) \biggr]
6044: \biggr\} \leq 1.
6045: \end{multline*}
6046: \end{thm}
6047: As a consequence, reasonning as previously, we deduce
6048: \begin{thm}
6049: \label{thm2.2.5}
6050: \mypoint In the case when $k=1$,
6051: for any partially exchangeable posterior distribution $\pi: \Omega
6052: \rightarrow \C{M}_+^1(\Theta)$, with $\PP$ probability at least
6053: $1 - \epsilon$, for any $\theta \in \Theta$ and any
6054: $\lambda \in \RR_+$,
6055: $$
6056: \rr(\theta) - \tfrac{N}{\lambda} \log \bigl[
6057: \cosh(\tfrac{\lambda}{2N}) \bigr] \bigl( 1 - \lvert 1
6058: - 2 \rr(\theta) \rvert \bigr) + \frac{ \log \bigl\{ \epsilon
6059: \pi\bigl[\Delta(\theta)\bigr] \bigr\}}{\lambda} \leq r_1(\theta);
6060: $$
6061: and consequently for any $\theta \in \Theta$,
6062: $$
6063: r_2(\theta) \leq 2 \inf_{\lambda \in \RR_+} \frac{\ds r_1(\theta) - \frac{\log \bigl\{
6064: \epsilon \pi \bigl[ \Delta(\theta) \bigr] \bigr\}}{\lambda}}{
6065: 1 - \frac{2N}{\lambda} \log \bigl[ \cosh(\frac{\lambda}{2N})
6066: \bigr]} - r_1(\theta).
6067: $$
6068: \end{thm}
6069:
6070: In the case of binary classification using a VC class
6071: of VC dimension not greater than $h$, we can choose $\pi$ such that
6072: $- \log \bigl\{ \pi \bigl[ \Delta(\theta) \bigr] \bigr\}
6073: \leq h \log ( \frac{2eN}{h})$ and obtain the following
6074: numerical illustration of this theorem : for $N = 1000$, $h = 10$,
6075: $\epsilon = 0.01$ and $\inf_{\Theta} r_1 = r_1(\w{\theta}) = 0.2$,
6076: we find an upper bound $r_2(\w{\theta})
6077: \leq 0.5033$, which improves on Theorem \ref{thm2.1.5} but still
6078: is not under the significance level $\frac{1}{2}$ (achieved by
6079: blind random classification). This indicates that considering
6080: shadow samples of arbitrary sizes brings in some noisy situations
6081: a significant improvement on bounds obtained with a shadow sample
6082: of the same size as the training sample.
6083:
6084: \subsubsection{When moreover the distribution of the augmented sample
6085: is exchangeable} In the case when $k=1$ and $\PP$ is exchangeable meaning that for
6086: any bounded measurable function $h : \Omega \rightarrow \RR$
6087: and any permutation $s \in \mathfrak{S} \bigl(
6088: \{1, \dots, 2N \} \bigr)$ $\PP \bigl[ h( \omega \circ s ) \bigr]
6089: = \PP \bigl[ h(\omega) \bigr]$, then we can still improve the bound
6090: as follows. Let
6091: $$
6092: T' (h) = \frac{1}{N!} \sum_{s \in \mathfrak{S}
6093: \bigl( \{ N+1, \dots, 2N \} \bigr)} h(\omega \circ s).
6094: $$
6095: Then we can write
6096: $$
6097: 1 - \lvert 1 - 2 T_i(\sigma_i) \rvert = (\sigma_i - \sigma_{i+N})^2
6098: = \sigma_i + \sigma_{i+N} - 2 \sigma_i \sigma_{i+N}.
6099: $$
6100: Using this identity, we get for any exchangeable function
6101: $\lambda : \Omega \times \Theta \rightarrow \RR_+$,
6102: $$
6103: T \biggl\{ \exp \biggl[ \lambda (\rr - r_1) - \log \bigl[ \cosh(\tfrac{\lambda}{2N}
6104: ) \bigr] \sum_{i=1}^N \bigl( \sigma_i + \sigma_{i+N} - 2 \sigma_i \sigma_{i+N}
6105: \bigr) \biggr] \biggl\} \leq 1.
6106: $$
6107: Let us put
6108: \label{page39}
6109: \begin{align}
6110: \label{eq2.2}
6111: A(\lambda) & = \tfrac{2N}{\lambda} \log \bigl[ \cosh(\tfrac{\lambda}{2N}
6112: ) \bigr],\\
6113: v(\theta) & = \frac{1}{2N} \sum_{i=1}^N (\sigma_i + \sigma_{i+N}
6114: - 2 \sigma_i \sigma_{i+N}).
6115: \end{align}
6116: With these notations
6117: $$
6118: T \Bigl\{ \exp \bigl\{ \lambda \bigl[ \rr - r_1 - A(\lambda) v \bigr] \bigr\}
6119: \Bigr\} \leq 1.
6120: $$
6121: Let notice now that
6122: $$
6123: T'\bigl[ v(\theta) \bigr] = \rr(\theta) - r_1(\theta) r_2(\theta).
6124: $$
6125: Let $\pi : \Omega \rightarrow \C{M}_+^1(\Theta)$ be any given
6126: exchangeable posterior distribution. Using the exchangeability
6127: of $\PP$ and $\pi$ and the exchangeability of the exponential
6128: function, we get
6129: \begin{align*}
6130: \PP & \Bigl\{ \pi \Bigl[ \exp \bigl\{ \lambda \bigl[
6131: \rr - r_1 - A(\rr - r_1 r_2) \bigr] \bigr\} \Bigr] \Bigr\}
6132: = \PP \Bigl\{ \pi \Bigl[ \exp \bigl\{ \lambda \bigl[
6133: \rr - r_1 - AT'(v) \bigr] \bigr\} \Bigr] \Bigr\}
6134: \\ & \leq
6135: \PP \Bigl\{ \pi \Bigl[ T' \exp \bigl\{ \lambda \bigl[
6136: \rr - r_1 - Av \bigr] \bigr\} \Bigr] \Bigr\}
6137: =
6138: \PP \Bigl\{ T' \pi \Bigl[ \exp \bigl\{ \lambda \bigl[
6139: \rr - r_1 - Av \bigr] \bigr\} \Bigr] \Bigr\}
6140: \\ & =
6141: \PP \Bigl\{ \pi \Bigl[ \exp \bigl\{ \lambda \bigl[
6142: \rr - r_1 - Av \bigr] \bigr\} \Bigr] \Bigr\}
6143: =
6144: \PP \Bigl\{ T \pi \Bigl[ \exp \bigl\{ \lambda \bigl[
6145: \rr - r_1 - Av \bigr] \bigr\} \Bigr] \Bigr\}
6146: \\ & =
6147: \PP \Bigl\{ \pi \Bigl[ T \exp \bigl\{ \lambda \bigl[
6148: \rr - r_1 - Av \bigr] \bigr\} \Bigr] \Bigr\}
6149: \leq 1.
6150: \end{align*}
6151: We are thus ready to state
6152: \begin{thm}
6153: \label{thm3.3.8}
6154: \mypoint
6155: In the case when $k = 1$, for any exchangeable probability distribution $\PP$,
6156: for any exchangeable posterior distribution $\pi : \Omega \rightarrow
6157: \C{M}_+^1(\Theta)$, for any exchangeable function
6158: $\lambda : \Omega \times \Theta \rightarrow \RR_+$,
6159: $$
6160: \PP \biggl\{ \exp \biggl[ \sup_{\rho \in \C{M}_+^1(\Theta)}
6161: \rho \Bigl\{ \lambda \bigl[ \rr - r_1 - A(\lambda)(\rr - r_1 r_2)\bigr] \Bigr\}
6162: - \C{K}(\rho, \pi) \biggr] \biggr\} \leq 1,
6163: $$
6164: where $A(\lambda)$ is defined by equation \eqref{eq2.2} above.
6165: \end{thm}
6166: We then deduce as previously
6167: \begin{cor}
6168: \label{thm2.2.6}
6169: \mypoint For any exchangeable posterior distribution $\pi :
6170: \Omega \rightarrow \C{M}_+^1(\Theta)$, for any
6171: exchangeable probability measure $\PP \in \C{M}_+^1(\Omega)$,
6172: for any measurable exchangeable function $\lambda: \Omega \times \Theta
6173: \rightarrow \RR_+$,
6174: with $\PP$ probability at least $1 - \epsilon$, for any $\theta \in \Theta$,
6175: $$
6176: \rr(\theta) \leq r_1(\theta) + A(\lambda) \bigl[ \rr(\theta) - r_1( \theta)
6177: r_2(\theta) \bigr] - \frac{ \log \bigl\{ \epsilon \pi\bigl[
6178: \Delta(\theta) \bigr] \bigr\}}{\lambda},
6179: $$
6180: where $A(\lambda)$ is defined by equation \eqref{eq2.2}
6181: on page \pageref{eq2.2}.
6182: \end{cor}
6183: In order to deduce an empirical bound from this theorem, we have
6184: to make some choice for $\lambda(\omega, \theta)$.
6185: Fortunately, it is easy to show that the bound indeed holds uniformly
6186: in $\lambda$. This is the case because the inequality can
6187: be rewritten as a function of only one non exchangeable quantity,
6188: namely $r_1(\theta)$. Indeed, since
6189: $r_2 = 2 \rr - r_1$, we see that the
6190: inequality can be written as
6191: $$
6192: \rr(\theta) \leq r_1(\theta) + A(\lambda) \bigl[
6193: \rr(\theta) - 2 \rr(\theta) r_1(\theta) + r_1(\theta)^2 \bigr]
6194: - \frac{\log \bigl\{ \epsilon \pi \bigl[ \Delta(\theta)\bigr]}{\lambda}.
6195: $$
6196: It can be solved in $r_1(\theta)$, to get
6197: $$
6198: r_1(\theta) \geq f \Bigl(\lambda, \rr(\theta), -\log \bigl\{ \epsilon
6199: \pi\bigl[ \Delta(\theta) \bigr] \bigr\} \Bigr),
6200: $$
6201: where namely
6202: \begin{multline*}
6203: f(\lambda, \rr, d) = \bigl[2 A(\lambda)\bigr]^{-1}
6204: \biggl\{ 2 \rr A(\lambda) - 1 \\ + \sqrt{\bigl[1 - 2 \rr A(\lambda)\bigr]^2
6205: + 4 A(\lambda) \Bigl\{ \rr\bigl[ 1 - A(\lambda) \bigr] - \tfrac{d}{\lambda}
6206: \Bigr\}} \biggr\}.
6207: \end{multline*}
6208: Thus we can find some exchangeable function $\lambda(\omega, \theta)$,
6209: such that
6210: $$
6211: f\Bigl( \lambda(\omega, \theta), \rr(\theta), -
6212: \log \bigl\{ \epsilon \pi \bigl[ \Delta(\theta) \bigr] \bigr\} \Bigr)
6213: = \sup_{\beta \in \RR_+} f \Bigl( \beta, \rr(\theta), - \log\bigl\{
6214: \epsilon \pi \bigl[ \Delta(\theta) \bigr]\bigr\} \Bigr).
6215: $$
6216: Applying Corollary \ref{thm2.2.6} to that choice of $\lambda$, we
6217: see that
6218: \begin{thm}
6219: \mypoint For any exchangeable probability measure
6220: $\PP \in \C{M}_+^1(\Omega)$, for any exchangeable posterior
6221: probability distribution $\pi : \Omega \rightarrow \C{M}_+^1(\Theta)$,
6222: with $\PP$ probability at least $1 - \epsilon$, for any $\theta \in \Theta$,
6223: for any $\lambda \in \RR_+$,
6224: $$
6225: \rr(\theta) \leq r_1(\theta) + A(\lambda) \bigl[
6226: \rr(\theta) - r_1(\theta) r_2(\theta) \bigr] - \frac{
6227: \log \bigl\{ \epsilon \pi \bigl[ \Delta(\theta) \bigr] \bigr\}}{\lambda},
6228: $$
6229: where $A(\lambda)$ is defined by equation \eqref{eq2.2} on
6230: page \pageref{eq2.2}.
6231: \end{thm}
6232: Solving the previous inequality in $r_2(\theta)$, we get
6233: \begin{cor}
6234: \mypoint Under the same assumptions as in the
6235: previous theorem, with
6236: $\PP$ probability at least $1 - \epsilon$, for any
6237: $\theta \in \Theta$,
6238: $$
6239: r_2(\theta) \leq \inf_{\lambda \in \RR_+}
6240: \frac{\ds r_1(\theta) \Bigl\{ 1 + \tfrac{2N}{\lambda}\log \bigl[
6241: \cosh(\tfrac{\lambda}{2N})\bigr] \Bigr\} - \frac{ 2 \log \bigl\{ \epsilon \pi
6242: \bigl[ \Delta(\theta) \bigr] \bigr\}}{\lambda}}{\ds 1 - \tfrac{2N}{\lambda}
6243: \log \bigl[ \cosh(\tfrac{\lambda}{2N})\bigr] \bigl[
6244: 1 - 2 r_1(\theta) \bigr]}.
6245: $$
6246: \end{cor}
6247: Applying this to our usual numerical example of a binary classification
6248: model with VC dimension not greater than $h = 10$, when $N=1000$, $
6249: \inf_{\Theta} r_1 = r_1(\w{\theta}) = 10$ and
6250: $\epsilon = 0.01$, we obtain that $r_2(\w{\theta}) \leq 0.4450$.
6251:
6252: \subsection{Vapnik's bounds for inductive classification}
6253: \subsubsection{Arbitrary shadow sample size}
6254: \newcommand{\F}[1]{\mathfrak{#1}}
6255: We assume in this section that
6256: $$
6257: \PP = \biggl( \bigotimes_{i=1}^N P_i
6258: \biggr)^{\otimes \, \infty} \in \C{M}_+^1 \Bigl\{ \bigl[
6259: \bigl( \C{X} \times \C{Y} \bigr)^N \bigr]^{\NN} \Bigr\},
6260: $$
6261: where
6262: $P_i \in \C{M}_+^1\bigl( \C{X} \times \C{Y} \bigr)$:
6263: we consider an infinite i.i.d. sequence of independent
6264: {\em not} identically distributed samples of size $N$,
6265: the first one only being observed. The shadow samples will only appear
6266: in the proofs. The aim of this section is to prove better Vapnik's
6267: bounds, generalizing them in the same time to the independent
6268: non i.i.d. setting, which to our knowledge had not been done before.
6269:
6270: Let us introduce the notation $\PP'\bigl[h(\omega) \bigr] =
6271: \PP \bigl[ h(\omega) \,\lvert\, (X_i,Y_i)_{i=1}^N \bigr]$,
6272: where $h$ may be any suitable (e.g. bounded)
6273: random variable, let us also put
6274: $\Omega = \bigl[(\C{X} \times \C{Y})^N \bigr]^{\NN}$.
6275: \begin{dfn}
6276: \mypoint For any subset $A \subset \NN$ of
6277: integers, let $\F{C}(A)$ be the set of circular permutations of the
6278: totally ordered set $A$, extended to a permutation of $\NN$ by
6279: taking it to be the identity on the complement $\NN \setminus A$
6280: of $A$.
6281: We will say that a random function $h : \Omega \rightarrow \RR$ is $k$-partially
6282: exchangeable if
6283: $$
6284: h( \omega \circ s ) = h( \omega ), \quad s \in \F{C}\bigl(
6285: \{i + j N\,;\,j=0, \dots, k \} \bigr), i=1, \dots, N.
6286: $$
6287: In the same way, we will say that a posterior distribution
6288: $\pi : \Omega \rightarrow \C{M}_+^1(\Theta)$ is $k$-partially
6289: exchangeable if
6290: $$
6291: \pi( \omega \circ s ) = \pi ( \omega ) \in \C{M}_+^1(\Theta), \quad s \in \F{C}\bigl(
6292: \{i + j N\,;\,j=0, \dots, k \} \bigr), i=1, \dots, N.
6293: $$
6294: \end{dfn}
6295: Note that $\PP$ itself is $k$-partially exchangeable for any $k$ in the
6296: sense that for any bounded measurable function $h : \Omega \rightarrow \RR$
6297: $$
6298: \PP \bigl[ h( \omega \circ s ) \bigr] = \PP \bigl[ h( \omega ) \bigr] , \quad s \in \F{C}\bigl(
6299: \{i + j N\,;\,j=0, \dots, k \} \bigr), i=1, \dots, N.
6300: $$
6301: Let $\ds
6302: \Delta_k(\theta) = \Bigl\{ \theta' \in \Theta \,;\,
6303: \bigl[ f_{\theta'}(X_i) \bigr]_{i=1}^{(k+1)N} =
6304: \bigl[ f_{\theta}(X_i) \bigr]_{i=1}^{(k+1)N} \Bigr\},$ $\theta \in \Theta,
6305: k \in \NN^*$,
6306: and let also $\ds \rr_k(\theta) = \frac{1}{(k+1)N} \sum_{i=1}^{(k+1) N}
6307: \B{1} \bigl[ f_{\theta}(X_i) \neq Y_i \bigr]$.
6308: Theorem \ref{thm1.2} shows that for any positive real parameter
6309: $\lambda$
6310: and any $k$-partially exchangeable posterior distribution $\pi_k : \Omega
6311: \rightarrow \C{M}_+^1(\Theta)$,
6312: $$
6313: \PP \biggl\{ \exp \biggl[ \sup_{\theta \in \Theta}
6314: \lambda \bigl[ \Phi_{\frac{\lambda}{N}}(\rr_k) - r_1 \bigr]
6315: + \log \bigl\{ \epsilon \pi_k \bigl[ \Delta_k (\theta) \bigr] \bigr\} \biggr] \biggr\}
6316: \leq \epsilon.
6317: $$
6318: Using the general fact that
6319: $$
6320: \PP \bigl[ \exp( h ) \bigr] =
6321: \PP \Bigl\{ \PP' \bigl[ \exp( h) \bigr] \Bigr\} \geq \PP \Bigl\{
6322: \exp \bigl[ \PP' (h) \bigr] \Bigr\},
6323: $$
6324: and the fact that the expectation of a supremum is larger than the
6325: supremum of an expectation, we see that with $\PP$ probability
6326: at most $1 - \epsilon$, for any $\theta \in \Theta$,
6327: $$
6328: \PP'\Bigl\{ \Phi_{\frac{\lambda}{N}} \bigl[ \rr_k(\theta) \bigr]
6329: \Bigr\} \leq r_1(\theta) - \frac{
6330: \PP' \Bigl\{ \log \bigl\{ \epsilon \pi_k \bigl[ \Delta_k(\theta) \bigr] \bigr\}
6331: \Bigr\}}{\lambda}.
6332: $$
6333: Let us put for short
6334: \newcommand{\dd}{\Bar{d}}
6335: \begin{align*}
6336: \dd_k(\theta) & = - \log \bigl\{ \epsilon \pi_k \bigl[ \Delta_k(\theta) \bigr] \bigr\},\\
6337: d'_k(\theta) & = - \PP' \Bigl\{ \log \bigl\{ \epsilon \pi_k \bigl[ \Delta_k(\theta) \bigr] \bigr\}
6338: \Bigr\},\\
6339: d_k(\theta) & = - \PP \Bigl\{ \log \bigl\{ \epsilon \pi_k \bigl[ \Delta_k(\theta) \bigr] \bigr\}
6340: \Bigr\}.
6341: \end{align*}
6342: We can use the convexity of $\Phi_{\frac{\lambda}{N}}$ and the fact
6343: that $\PP'(\rr_k) = \frac{r_1 + k R}{k+1}$, to see that
6344: $$
6345: \PP' \Bigl\{ \Phi_{\frac{\lambda}{N}} \bigl[ \rr_k(\theta) \bigr]
6346: \Bigr\} \geq \Phi_{\frac{\lambda}{N}}
6347: \left[ \frac{r_1(\theta) + k R(\theta)}{k+1} \right].
6348: $$
6349: We have proved
6350: \begin{thm}
6351: \mypoint Using the above hypotheses and notations,
6352: for any sequence
6353: $\pi_k : \Omega \rightarrow \C{M}_+^1(\Theta)$, where $\pi_k$
6354: is a $k$-partially exchangeable posterior distribution,
6355: for any positive real constant $\lambda$, any positive integer $k$,
6356: with $\PP$ probability
6357: at least $1 - \epsilon$, for any $\theta \in \Theta$,
6358: $$
6359: \Phi_{\frac{\lambda}{N}} \left[
6360: \frac{ r_1(\theta) + k R(\theta)}{k+1} \right]
6361: \leq r_1(\theta) + \frac{d'_k(\theta)}{\lambda}.
6362: $$
6363: \end{thm}
6364: We can make
6365: as we did with Theorem \ref{thm2.7} on page \pageref{thm2.7} the
6366: result of this theorem uniform in $\lambda \in \{ \alpha^j\,;\,
6367: j \in \NN^* \}$ and $k \in \NN^*$ (considering
6368: on $k$ the prior $\frac{1}{k(k+1)}$ and on $j$ the prior
6369: $\frac{1}{j(j+1)}$), and obtain
6370:
6371: \begin{thm}
6372: \mypoint For any real parameter
6373: $\alpha > 1$, with $\PP$ probability at least $1 - \epsilon$,
6374: for any $\theta \in \Theta$,
6375: \begin{multline*}
6376: R(\theta) \leq \\* \inf_{k \in \NN^*, j \in \NN^*}
6377: \frac{1 - \exp \biggl\{ - \frac{\alpha^j}{N} r_1(\theta) - \frac{1}{N}
6378: \Bigl\{ d'_k(\theta) + \log \bigl[ k (k+1) j (j+1)\bigr]
6379: \Bigr\} \biggr\}}{\frac{k}{k+1} \left[ 1 -
6380: \exp \left( - \frac{\alpha^j}{N}\right) \right] } \\* - \frac{r_1(\theta)}{k}.
6381: \end{multline*}
6382: \end{thm}
6383: Note that as a special case we can choose $\pi_k$ such that $
6384: \log \bigl\{ \pi_k\bigl[ \Delta_k(\theta) \bigr] \bigr\}$ is independent of
6385: $\theta$ and equal to $\log (\F{N}_k)$, where $\F{N}_k = \bigl\lvert \bigl\{
6386: \bigl[ f_{\theta}(X_i) \bigr]_{i=1}^{(k+1)N} \,;\,
6387: \theta \in \Theta \bigr\} \bigr\rvert$ is the size of the trace of the
6388: classification model on the extended sample
6389: of size $(k+1)N$.
6390: With this choice, we obtain a bound involving a new flavour
6391: of conditional Vapnik's entropy, namely
6392: $$
6393: d'_k(\theta) = \PP \bigl[ \log (\F{N}_k) \,\lvert (Z_i)_{i=1}^N \bigr] - \log(\epsilon).
6394: $$
6395:
6396: In the case of binary classification using a VC class of VC dimension not
6397: greater than $h = 10$, when $N = 1000$, $\inf_{\Theta}
6398: r_1 = r_1(\w{\theta}) = 0.2$ and $\epsilon = 0.01$,
6399: choosing $\alpha = 1.1$, we obtain $R(\w{\theta}) \leq 0.4271$
6400: (for an optimal value of $\lambda = 1071.8$, and an optimal
6401: value of $k = 16$).
6402:
6403: \subsubsection{A better minimization with respect to the exponential parameter}If we are not pleased with the fact of optimizing $\lambda$ on a discrete
6404: subset of the real line, we can use a slightly different approach.
6405: From Theorem \ref{thm1.2}, we see that for any positive integer
6406: $k$, for any $k$-partially exchangeable
6407: positive real measurable function $\lambda : \Omega \times \Theta
6408: \rightarrow \RR_+$ satisfying equation \eqref{eq2.2.1} on
6409: page \pageref{eq2.2.1} (with $\Delta(\theta)$ replaced
6410: with $\Delta_k(\theta)$),
6411: for any $\epsilon \in )0,1)$ and $\eta \in )0,1)$,
6412: $$
6413: \PP \biggl\{ \PP' \biggl[ \exp \Bigl[ \sup_{\theta}
6414: \lambda \bigl[ \Phi_{\frac{\lambda}{N}}(\rr_k) - r_1 \bigr] +
6415: \log \bigl\{ \epsilon \eta \pi_k \bigl[ \Delta_k(\theta) \bigr] \bigr\}
6416: \biggr] \biggr\}
6417: \leq \epsilon \eta,
6418: $$
6419: therefore with $\PP$ probability at least $1 - \epsilon$,
6420: $$
6421: \PP' \biggl\{ \exp \Bigl[ \sup_{\theta}
6422: \lambda \bigl[ \Phi_{\frac{\lambda}{N}}(\rr_k) - r_1 \bigr] +
6423: \log \bigl\{ \epsilon \eta \pi_k \bigl[ \Delta_k(\theta) \bigr] \bigr\}
6424: \Bigr]
6425: \biggr\}
6426: \leq \eta,
6427: $$
6428: and consequently, with $\PP$ probability at least $1 - \epsilon$,
6429: with $\PP'$ probability at least $1 - \eta$, for any $\theta \in \Theta$,
6430: $$
6431: \Phi_{\frac{\lambda}{N}}(\rr_k) +
6432: \frac{\log \bigl\{ \epsilon \eta \pi_{k} \bigl[ \Delta_k(\theta)
6433: \bigr] \bigr\}}{\lambda}
6434: \leq r_1.
6435: $$
6436: Now we are entitled to choose $$
6437: \lambda(\omega, \theta)
6438: \in \arg \max_{\lambda' \in \RR_+} \Phi_{\frac{\lambda'}{N}}(\rr_k)
6439: + \frac{\log \bigl\{ \epsilon \eta \pi_{k} \bigl[ \Delta_k(\theta)
6440: \bigr] \bigr\}}{\lambda'}.
6441: $$
6442: This shows that with $\PP$ probability
6443: at least $1 - \epsilon$, with $\PP'$ probability at least $1 - \eta$,
6444: for any $\theta \in \Theta$,
6445: $$
6446: \sup_{\lambda \in \RR_+} \Phi_{\frac{\lambda}{N}}(\rr_k) -
6447: \frac{\dd_k(\theta) - \log(\eta)}{\lambda}
6448: \leq r_1,
6449: $$
6450: which can also be written
6451: $$
6452: \Phi_{\frac{\lambda}{N}}(\rr_k) - r_1 - \frac{
6453: \dd_k(\theta)}{\lambda} \leq - \frac{\log(\eta)}{\lambda}, \quad \lambda \in \RR_+.
6454: $$
6455: Thus with $\PP$ probability at least $1 - \epsilon$,
6456: for any $\theta \in \Theta$, any $\lambda \in \RR_+$,
6457: $$
6458: \PP'\biggl[ \Phi_{\frac{\lambda}{N}}(\rr_k) - r_1 -
6459: \frac{\dd_k(\theta)}{\lambda} \biggr] \leq - \frac{
6460: \log(\eta)}{\lambda} + \biggl[1 - r_1 + \frac{\log(\eta)}{\lambda}
6461: \biggr] \eta.
6462: $$
6463: On the other hand, $\Phi_{\frac{\lambda}{N}}$ being a convex function,
6464: \begin{align*}
6465: \PP'\biggl[ \Phi_{\frac{\lambda}{N}}(\rr_k) - r_1 -
6466: \frac{\dd_k(\theta)}{\lambda} \biggr]
6467: & \geq \Phi_{\frac{\lambda}{N}}\bigl[ \PP'(\rr_k) \bigr] - r_1
6468: - \frac{d'_k}{\lambda} \\ & = \Phi_{\frac{\lambda}{N}}
6469: \biggl( \frac{kR+r_1}{k+1} \biggr) - r_1 - \frac{d'_k}{\lambda}.
6470: \end{align*}
6471: Thus with $\PP$ probability at least $1 - \epsilon$, for any $\theta \in \Theta$,
6472: $$
6473: \frac{kR+r_1}{k+1} \leq \inf_{\lambda \in \RR_+}
6474: \Phi_{\frac{\lambda}{N}}^{-1} \biggl[ r_1(1 - \eta) + \eta +
6475: \frac{d'_k - \log(\eta) (1 - \eta)}{\lambda} \biggr].
6476: $$
6477: We can generalize this approach by considering a finite decreasing sequence
6478: $\eta_0=1 > \eta_1 > \eta_2 > \dots > \eta_J > \eta_{J+1} = 0$, and
6479: the corresponding sequence of levels
6480: \begin{align*}
6481: L_j & = - \frac{\log(\eta_j)}{\lambda}, 0 \leq j \leq J,\\
6482: L_{J+1} & = 1 - r_1 - \frac{\log(J) - \log(\epsilon)}{\lambda}.
6483: \end{align*}
6484: Taking a union bound in $j$, we see that with $\PP$ probability at least $1 - \epsilon$,
6485: for any $\theta \in \Theta$, for any $\lambda \in \RR_+$,
6486: $$
6487: \PP' \biggl[ \Phi_{\frac{\lambda}{N}}(\rr_k) - r_1
6488: - \frac{\dd_k + \log(J)}{\lambda} \geq L_j \biggr] \leq \eta_j, \quad j=0, \dots, J+1,
6489: $$
6490: and consequently
6491: \begin{align*}
6492: \PP' & \biggl[ \Phi_{\frac{\lambda}{N}}(\rr_k) - r_1
6493: - \frac{\dd_k + \log(J)}{\lambda} \biggr] \\
6494: & \leq \int_{0}^{L_{J+1}}
6495: \PP' \biggl[ \Phi_{\frac{\lambda}{N}}(\rr_k) - r_1
6496: - \frac{\dd_k+ \log(J)}{\lambda} \geq \alpha \biggr] d \alpha
6497: \quad \leq \sum_{j=1}^{J+1} \eta_{j-1}(L_j - L_{j-1})
6498: \\ & = \eta_J \biggl[ 1 - r_1 - \frac{\log(J) -
6499: \log(\epsilon) - \log(\eta_J)}{\lambda}
6500: \biggr] - \frac{\log(\eta_1)}{\lambda} + \sum_{j=1}^{J-1}
6501: \frac{\eta_{j}}{\lambda} \log \biggl(
6502: \frac{\eta_{j}}{\eta_{j+1}}\biggr).
6503: \end{align*}
6504: Let us put
6505: \begin{multline*}
6506: d''_k\bigl[\theta, (\eta_j)_{j=1}^J \bigr]
6507: = d'_k(\theta) +
6508: \log(J) - \log(\eta_1)
6509: \\ + \sum_{j=1}^{J-1}
6510: \eta_j \log \left( \frac{\eta_j}{\eta_{j+1}} \right)
6511: + \log\left(\frac{\epsilon \eta_J}{J} \right) \eta_J.
6512: \end{multline*}
6513:
6514: We have proved that for any decreasing sequence $(\eta_j)_{j=1}^J$,
6515: with $\PP$ probability at least $1 - \epsilon$,
6516: for any $\theta \in \Theta$,
6517: $$
6518: \frac{k R + r_1}{k+1}
6519: \leq \inf_{\lambda \in \RR_+}
6520: \Phi_{\frac{\lambda}{N}}^{-1} \biggl[
6521: r_1(1 - \eta_J) + \eta_J +
6522: \frac{ d''_k \bigl[ \theta, (\eta_j)_{j=1}^J \bigr]}{\lambda} \biggr].
6523: $$
6524:
6525: \begin{rmk}
6526: \mypoint We can for instance choose
6527: $J=2$, $\eta_2 = \frac{1}{10N}$, $\eta_1 =
6528: \frac{1}{\log(10 N)}$,
6529: resulting in
6530: $$
6531: d''_k = d'_k + \log(2) + \log\log(10 N) + 1 -
6532: \frac{\log\log(10N)}{\log(10N)} - \frac{\log \left( \frac{20N}{\epsilon} \right)}{10N}.
6533: $$
6534: In the case when $N = 1000$ and for any $\epsilon \in (0,1)$,
6535: we get $d''_k \leq d'_k + 3.7$, in the case when $N = 10^6$,
6536: we get $d''_k \leq d'_k + 4.4$, and in the case $N = 10^9$,
6537: we get $d''_k \leq d'_k + 4.7$.
6538:
6539: Therefore, for any practical
6540: purpose we could take $d''_k = d'_k + 4.7$ and $\eta_J = \frac{1}{10N}$
6541: in the above inequality.
6542: \end{rmk}
6543:
6544: Taking moreover a weighted union bound in $k$, we get
6545: \begin{thm}
6546: \label{thm2.3.3}
6547: \mypoint For any $\epsilon \in )0,1)$, any sequence
6548: $1 > \eta_1 > \dots > \eta_J > 0$,
6549: any sequence $\pi_k : \Omega \rightarrow \C{M}_+^1(\Theta)$,
6550: where $\pi_k$ is a $k$-partially exchangeable posterior distribution,
6551: with $\PP$ probability at least $1 - \epsilon$, for any $\theta
6552: \in \Theta$,
6553: \begin{multline*}
6554: R(\theta) \leq \inf_{k \in \NN^*} \frac{k+1}{k} \inf_{\lambda \in \RR_+}
6555: \Phi_{\frac{\lambda}{N}}^{-1}
6556: \biggl[ r_1(\theta) + \eta_J \bigl[1 - r_1(\theta) \bigr]
6557: \\ + \frac{d''_k\bigl[\theta, (\eta_j)_{j=1}^J \bigr] + \log\bigl[k(k+1)\bigr]}{\lambda}
6558: \biggr] - \frac{r_1(\theta)}{k}.
6559: \end{multline*}
6560: \end{thm}
6561: \begin{cor}
6562: \label{cor3.3.14}
6563: \mypoint For any $\epsilon \in )0,1)$, for any $N \leq 10^9$, with $\PP$ probability
6564: at least $1 - \epsilon$, for any $\theta \in \Theta$,
6565: \begin{multline*}
6566: R(\theta) \leq
6567: \inf_{k \in \NN^*} \inf_{\lambda \in \RR_+}
6568: \frac{k+1}{k} \bigl[ 1 - \exp( - \tfrac{\lambda}{N}) \bigr]^{-1}
6569: \biggl\{ 1 - \exp \biggl[ - \tfrac{\lambda}{N} \bigl[ r_1(\theta) +
6570: \tfrac{1}{10N} \bigr]
6571: \\ - \frac{ \PP' \bigl[ \log(\F{N}_k)\,\lvert\,(Z_i)_{i=1}^N
6572: \bigr]
6573: - \log(\epsilon) + \log\bigl[k(k+1)\bigr] + 4.7}{N} \biggr]
6574: \biggr\}
6575: - \frac{r_1(\theta)}{k}.
6576: \end{multline*}
6577: \end{cor}
6578:
6579: Let us end this section with a numerical example: in the case of binary classification
6580: with a VC class of dimension not greater than $10$, when $N=1000$,
6581: $\inf_{\Theta} r_1 = r_1(\w{\theta}) = 0.2$
6582: and $\epsilon = 0.01$, we get a bound $R(\w{\theta}) \leq 0.4211$ (for optimal
6583: values of $k = 15$ and of $\lambda = 1010$).
6584:
6585: \subsubsection{Equal shadow and training sample sizes}In the case when $k=1$, we can use Theorem \ref{thm2.2.5}, and replace
6586: $\Phi_{\frac{\lambda}{N}}^{-1}(q)$ with $\bigl\{ 1 - \frac{2N}{\lambda}
6587: \log \bigl[ \cosh(\frac{\lambda}{2N}) \bigr] \bigr\}^{-1}q$,
6588: resulting in
6589: \begin{thm}
6590: \mypoint For any $\epsilon \in )0,1)$, any $N \leq 10^9$, any 1-partially exchangeable
6591: posterior distribution
6592: $\pi_1 : \Omega \rightarrow \C{M}_+^1(\Theta)$,
6593: with $\PP$ probability at least $1 - \epsilon$,
6594: for any $\theta \in \Theta$,
6595: $$
6596: R(\theta) \leq
6597: \inf_{\lambda \in \RR_+} \frac{\ds
6598: \Bigl\{ 1 + \tfrac{2N}{\lambda} \log \bigl[ \cosh(\tfrac{\lambda}{2N}) \bigr] \Bigr\} r_1(\theta)
6599: + \frac{1}{5N} + 2 \frac{d_1'(\theta) + 4.7}{\lambda}}{\ds
6600: 1 - \tfrac{2N}{\lambda} \log \bigl[ \cosh(\tfrac{\lambda}{2N}
6601: ) \bigr]}.
6602: $$
6603: \end{thm}
6604:
6605: \subsubsection{Improvement on the equal sample size bound in the i.i.d.~case}
6606: Eventually, in the case when $\PP$ is i.i.d., meaning that all the
6607: $P_i$ are equal, we can improve the previous bound. For any
6608: partially exchangeable function $\lambda : \Omega \times \Theta
6609: \rightarrow \RR_+$, we saw in the discussion preceding Theorem
6610: \ref{thm3.3.8} on page \pageref{thm3.3.8} that
6611: $$
6612: T \Bigl[ \exp \bigl[ \lambda (\rr_k - r_1) - A(\lambda) v \bigr] \Bigr]
6613: \leq 1,
6614: $$
6615: with the notations introduced therein.
6616: Thus for any partially exchangeable positive real measurable function
6617: $\lambda : \Omega \times \Theta \rightarrow \RR_+$ satisfying equation
6618: \eqref{eq2.2.1} on page \pageref{eq2.2.1}, any 1-partially exchangeable
6619: posterior distribution $\pi_1 : \Omega \rightarrow \C{M}_+^1(\Theta)$,
6620: $$
6621: \PP \Bigl\{ \exp \Bigl[ \sup_{\theta \in \Theta}
6622: \lambda \bigl[ \rr_k(\theta) - r_1(\theta) - A(\lambda)v(\theta) \bigr] + \log \bigl[
6623: \epsilon \pi_1 \bigl[ \Delta(\theta) \bigr] \Bigr] \Bigr\} \leq 1.
6624: $$
6625: Therefore with $\PP$ probability at least $1 - \epsilon$, with $\PP'$
6626: probability $1 - \eta$,
6627: $$
6628: \rr_k(\theta) \leq r_1(\theta) + A(\lambda) v(\theta) + \frac{1}{\lambda} \bigl[
6629: \dd_1(\theta) - \log(\eta) \bigr]
6630: $$
6631:
6632: We can then choose $\ds \lambda(\omega, \theta) \in
6633: \arg\min_{\lambda' \in \RR_+} A(\lambda') v(\theta) + \frac{\dd_1(\theta)
6634: - \log(\eta) \bigr]}{\lambda'}$, which satisfies the required
6635: conditions, to show that with $\PP$ probability at least $1 - \epsilon$,
6636: for any $\theta \in \Theta$, with $\PP'$ probability at least $1 - \eta$,
6637: for any $\lambda \in \RR_+$,
6638: $$
6639: \rr_k(\theta) \leq r_1(\theta) +
6640: A(\lambda)v(\theta) + \frac{\dd_1(\theta) - \log(\eta)}{\lambda}.
6641: $$
6642:
6643: We can then take a union bound on a decreasing sequence of $J$
6644: values $\eta_1 \geq \dots \geq \eta_J$ of $\eta$.
6645: Weakening a little the order of quantifiers,
6646: we then obtain the following statement:
6647: with $\PP$ probability at least $1 - \epsilon$, for any $\theta \in \Theta$,
6648: for any $\lambda \in \RR_+$, for any $j=1, \dots, J$
6649: $$
6650: \PP' \biggl[ \rr_k(\theta) - r_1(\theta) -
6651: A(\lambda) v(\theta) - \frac{\dd_1(\theta) + \log(J)}{\lambda}
6652: \geq - \frac{\log(\eta_j)}{\lambda} \biggr] \leq \eta_j.
6653: $$
6654: Consequently for any $\lambda \in \RR_+$,
6655: \begin{multline*}
6656: \PP' \biggl[ \rr_k(\theta) - r_1(\theta) -
6657: A(\lambda) v(\theta) - \frac{\dd_1(\theta) + \log(J)}{\lambda} \biggr]
6658: \\ \leq - \frac{ \log(\eta_1)}{\lambda} +
6659: \eta_J \biggl[1 - r_1(\theta) - \frac{\log(J) - \log(\epsilon) - \log(\eta_J)}{\lambda}
6660: \biggr]
6661: \\ + \sum_{j=1}^{J-1} \frac{\eta_{j}}{\lambda} \log \left( \frac{\eta_j}{\eta_{j+1}}
6662: \right).
6663: \end{multline*}
6664: Moreover $\PP' \bigl[ v(\theta) \bigr] = \frac{r_1 + R}{2} - r_1 R$,
6665: (this is where we need equidistribution) thus proving that
6666: $$
6667: \frac{R - r_1}{2} \leq
6668: \frac{A(\lambda)}{2} \Bigl[ R+r_1 - 2 r_1 R \Bigr]
6669: + \frac{
6670: d''_1\bigl[\theta, (\eta_j)_{j=1}^J\bigr]
6671: }{\lambda} + \eta_J\bigl[1 - r_1(\theta)\bigr].
6672: $$
6673: Keeping track of quantifiers, we obtain
6674: \begin{thm}
6675: \label{thm2.3.9}
6676: \mypoint For any decreasing sequence $(\eta_j)_{j=1}^J$, any
6677: $\epsilon \in (0,1)$, any 1-partially exchangeable posterior
6678: distribution $\pi : \Omega \rightarrow \C{M}_+^1(\Theta)$,
6679: with $\PP$ probability at least $1 - \epsilon$, for any $\theta \in \Theta$,
6680: \begin{multline*}
6681: R(\theta) \leq \inf_{\lambda \in \RR_+} \\
6682: \frac{\ds \Bigl\{ 1 + \tfrac{2N}{\lambda}\log \bigl[ \cosh(\tfrac{\lambda}{2N})
6683: \bigr] \Bigr\} r_1(\theta) + \frac{2 d''_1\bigl[ \theta, (\eta_j)_{j=1}^J
6684: \bigr] }{\lambda} + 2 \eta_J
6685: \bigl[ 1 - r_1(\theta) \bigr]}{\ds
6686: 1 - \tfrac{2N}{\lambda}\log\bigl[ \cosh(\tfrac{\lambda}{2N})
6687: \bigr] \bigl[ 1 - 2 r_1(\theta) \bigr] }.
6688: \end{multline*}
6689: \end{thm}
6690:
6691: \subsection{Gaussian approximation in Vapnik's bounds}
6692: To obtain formulas which could be easily compared with original Vapnik's bounds,
6693: we may replace $p - \Phi_a(p)$ with a Gaussian upper bound:
6694: \begin{lemma}
6695: \mypoint For any $p \in (0,\frac{1}{2})$, any $a \in \RR_+$,
6696: $$
6697: p - \Phi_a(p) \leq \frac{a}{2} p(1-p).
6698: $$
6699: For any $p \in (\frac{1}{2}, 1)$,
6700: $$
6701: p - \Phi_a(p) \leq \frac{a}{8} .
6702: $$
6703:
6704: \end{lemma}
6705: \begin{proof}
6706: Let us notice that for any $p \in (0,1)$,
6707: \begin{align*}
6708: \frac{\partial}{\partial a} \bigl[ - a \Phi_a(p) \bigr]
6709: & = - \frac{p \exp(-a) }{1 - p + p \exp( - a)},\\
6710: \frac{\partial^2}{\partial^2 a} \bigl[ - a \Phi_a(p) \bigr]
6711: & =
6712: \frac{p \exp(-a) }{1 - p + p \exp( - a)}
6713: \left( 1 - \frac{p \exp( - a)}{1 - p + p\exp( - a)} \right) \\
6714: & \leq
6715: \begin{cases}
6716: p(1-p) & p \in (0, \frac{1}{2}),\\
6717: \frac{1}{4} & p \in (\frac{1}{2}, 1).
6718: \end{cases}
6719: \end{align*}
6720: Thus taking a Taylor expansion of order one with integral remainder :
6721: $$
6722: -a \Phi(a) \leq
6723: \begin{cases}
6724: \begin{aligned}[b]-a p + \int_0^a p (1-p) & (a-b) db \\
6725: & = -a p + \frac{a^2}{2}p(1-p),\end{aligned} & p \in
6726: (0,\frac{1}{2}),\\
6727: \ds -a p + \int_0^a \frac{1}{4}(a -b) db = -a p + \frac{a^2}{8}, & p \in
6728: (\frac{1}{2}, 1).
6729: \end{cases}
6730: $$
6731: This ends the proof of our lemma. \end{proof}
6732: \begin{lemma}
6733: \mypoint Let us consider the bound
6734: $$
6735: B(q,d) = \left(1 + \frac{2 d}{N} \right)^{-1}
6736: \biggl[ q + \frac{d}{N} + \sqrt{ \frac{2 d q(1-q)}{N}
6737: + \frac{d^2}{N^2}} \biggr], \quad q \in \RR_+, d \in \RR_+.
6738: $$
6739: Let us also put
6740: $$
6741: \Bar{B}(q,d) =
6742: \begin{cases}
6743: B(q,d) & B(q,d) \leq \frac{1}{2},\\
6744: q + \sqrt{\frac{d}{2N}} & \text{ otherwise}.
6745: \end{cases}
6746: $$
6747: For any positive real parameters $q$ and $d$
6748: $$
6749: \inf_{\lambda \in \RR_+} \Phi_{\frac{\lambda}{N}}^{-1}
6750: \biggl( q + \frac{d}{\lambda} \biggr) \leq \Bar{B}(q,d).
6751: $$
6752: \end{lemma}
6753: \begin{proof}
6754: Let $\ds p = \inf_{\lambda} \Phi_{\frac{\lambda}{N}}^{-1} \biggl(
6755: q + \frac{d}{\lambda}\,\biggr)$. For any $\lambda \in \RR_+$,
6756: $$
6757: p - \frac{\lambda}{2N} (p \wedge \tfrac{1}{2})\bigl[1 -
6758: (p \wedge \tfrac{1}{2}) \bigr] \leq \Phi_{\frac{\lambda}{N}}(p)
6759: \leq q + \frac{d}{\lambda}.
6760: $$
6761: Thus
6762: \begin{multline*}
6763: p \leq q + \inf_{\lambda \in \RR_+} \frac{\lambda}{2N}
6764: (p \wedge \tfrac{1}{2}) \bigl[ 1 - ( p \wedge \tfrac{1}{2}) \bigr]
6765: + \frac{d}{\lambda} \\ = q + \sqrt{\frac{2 d
6766: (p \wedge \tfrac{1}{2}) \bigl[ 1 - ( p \wedge \tfrac{1}{2}) \bigr]}{N}}
6767: \leq q + \sqrt{\frac{d}{2N}}.
6768: \end{multline*}
6769: Then let us remark that
6770: $\ds
6771: B(q,d) = \sup \left\{ p' \in \RR_+ \,;\, p' \leq q + \sqrt{\frac{2dp'(1-p')}{N}}
6772: \right\}.$
6773: If moreover $\tfrac{1}{2} \geq B(q,d)$, then according
6774: to this remark $\tfrac{1}{2} \geq q + \sqrt{\frac{d}{2N}} \geq p$.
6775: Therefore $p \leq \tfrac{1}{2}$, and consequently $p \leq q + \sqrt{\frac{2dp(1-p)}{N}}$,
6776: implying that $p \leq B(q,d)$.
6777: \end{proof}
6778:
6779: \subsubsection{Arbitrary shadow sample size}
6780: This lemma combined with Corollary \ref{cor3.3.14}
6781: on page \pageref{cor3.3.14} implies
6782: \begin{cor}
6783: \label{cor2.3.7}
6784: \mypoint For any $\epsilon \in )0,1)$, any integer $N \leq 10^9$,
6785: with $\PP$ probability at least $1 - \epsilon$,
6786: for any $\theta \in \Theta$,
6787: $$
6788: R(\theta) \leq \inf_{k \in \NN^*}
6789: \frac{k+1}{k} \Bigl\{
6790: \Bar{B}\Bigl[r_1(\theta) + \frac{1}{10N}, d'_k(\theta) + \log \bigl[
6791: k(k+1)\bigr] + 4.7 \Bigr] \Bigr\} - \frac{r_1(\theta)}{k}.
6792: $$
6793: \end{cor}
6794:
6795: \subsubsection{Equal sample sizes in the i.i.d.~case}
6796: To make a link with Vapnik's result, it is useful to work out
6797: the Gaussian approximation to Theorem \ref{thm2.3.9}
6798: on page \pageref{thm2.3.9}.
6799: Indeed, using the upper bound $A(\lambda) \leq \frac{\lambda}{4N}$,
6800: where $A(\lambda)$ is defined by equation \eqref{eq2.2}
6801: on page \pageref{eq2.2}, we
6802: get with $\PP$ probability at least $1 - \epsilon$
6803: $$
6804: R - r_1 - 2 \eta_J \leq \inf_{\lambda \in \RR_+}
6805: \frac{\lambda}{4N} \bigl[ R + r_1 - 2 r_1 R \bigr]
6806: + \frac{2 d''_1}{\lambda}
6807: = \sqrt{\frac{2 d''_1 (R + r_1 - 2 r_1 R)}{N}},
6808: $$
6809: which can be solved in $R$ to obtain
6810: \begin{cor}
6811: \label{cor2.3.10}
6812: \mypoint With $\PP$ probability at least
6813: $1 - \epsilon$, for any $\theta \in \Theta$,
6814: \begin{multline*}
6815: R(\theta) \leq r_1(\theta) + \frac{d''_1(\theta)}{N}
6816: \bigl[ 1 - 2 r_1(\theta) \bigr]
6817: + 2 \eta_J
6818: \\ + \sqrt{ \frac{4 d''_1(\theta) \bigl[ 1 - r_1(\theta) \bigr] r_1(\theta)}{N}
6819: + \frac{{d''_1}(\theta)^2}{N^2} \bigl[ 1 - 2 r_1(\theta) \bigr]^2
6820: + \frac{4 d''_1(\theta)}{N} \bigl[ 1 - 2 r_1(\theta) \bigr] \eta_J}.
6821: \end{multline*}
6822: \end{cor}
6823: This is to be compared with Vapnik's result, as proved in \cite[page 138]{Vapnik}:
6824: \begin{thm}[Vapnik]
6825: \label{thmVapnik}
6826: \mypoint For any i.i.d. probability distribution $\PP$,
6827: with $\PP$ probability at least $1 - \epsilon$, for any $\theta \in \Theta$,
6828: putting
6829: $$
6830: d_V = \log \bigl[ \PP (\F{N}_1) \bigr] + \log(4/\epsilon),
6831: $$
6832: $$
6833: R(\theta) \leq r_1(\theta) + \frac{2 d_V}{N} +
6834: \sqrt{ \frac{4 d_V r_1(\theta)}{N} + \frac{4 d_V^2}{N^2}}.
6835: $$
6836: \end{thm}
6837: Recalling that we can choose $(\eta_j)_{j=1}^2$ such that
6838: $\eta_J = \frac{1}{10N}$ (which is negligeable by all means) and
6839: such that for any $N \leq
6840: 10^9$,
6841: $$
6842: d''_1( \theta) \leq \PP \bigl[ \log ( \F{N}_1 ) \,\lvert\,
6843: (Z_i)_{i=1}^N\bigr]
6844: - \log(\epsilon) + 4.7,
6845: $$
6846: we see that our complexity term is somehow more satisfactory than Vapnik's,
6847: since it is integrated outside the logarithm, with a little larger additional
6848: constant (remember that $\log(4) \simeq 1.4$, which is better than our $4.7$,
6849: which could presumably be improved by working out a better sequence $\eta_j$,
6850: but not down to $\log(4)$). Our variance term is better, since we get
6851: $r_1(1-r_1)$ as we should, instead of only $r_1$.
6852: We also have $\ds \frac{d''_1}{N}$ instead of
6853: $\ds 2 \frac{d_V}{N}$, comming from the fact that we do not use any symmetrization
6854: trick.
6855:
6856: Let us illustrate these bound on a numerical example (corresponding to
6857: a situation where the sample is noisy or the classification model is
6858: weak). Let us assume that $N = 1000$, $
6859: \inf_{\Theta} r_1 = r_1(\w{\theta}) = 0.2$, that we
6860: are performing binary classification with a model with VC dimension
6861: not greater than $h = 10$, and that we work at level of confidence
6862: $\epsilon = 0.01$. Vapnik's theorem provides an upper bound for
6863: $R(\w{\theta})$ not smaller than
6864: $0.610$, whereas Corollary \ref{cor2.3.10} gives
6865: $R(\w{\theta}) \leq 0.461$ (using the bound $d''_1 \leq d'_1 + 3.7$ when $N = 1000$).
6866: Now if we go for Theorem
6867: \ref{thm2.3.9} and do not make a Gaussian approximation,
6868: we get $R(\w{\theta}) \leq 0.453$. It is interesting to
6869: remark that this bound is achieved for $\lambda = 1195 > N = 1000$.
6870: This explains why the Gaussian approximation in Vapnik's bound
6871: can be improved: for such a large value of $\lambda$, $\lambda r_1(\theta)$
6872: does not behave like a Gaussian random variable.
6873:
6874: Let us remind in conclusion that the best bound is provided by
6875: Theorem \ref{thm2.3.3}, giving $R(\w{\theta}) \leq 0.4211$,
6876: (that is approximately $2/3$
6877: of Vapnik's bound), for optimal values
6878: of $k = 15$, and of $\lambda = 1010$. This bound can be seen to
6879: take advantage of the fact that Bernoulli random variables
6880: are not Gaussian (its Gaussian approximation, Corollary \ref{cor2.3.7},
6881: gives a bound $R(\theta) \simeq 0.4325$, still with an optimal $k = 15$),
6882: and of the fact that the optimal size of
6883: the shadow sample is significantly larger than the size
6884: of the observed sample. Moreover, Theorem \ref{thm2.3.3} does not
6885: assume that the sample is i.i.d., but only that it is
6886: independent, thus generalizing Vapnik's bounds to inhomogeneous
6887: data (this will presumably be the case when data are collected
6888: from different places where the experimental conditions may
6889: not be expected to be the same, although they may reasonably
6890: be assumed to be independent). We would like also to emphasis
6891: that our little numerical example shows that Vapnik's bounds
6892: can be expected to be appropriate when dealing with moderate
6893: sample sizes. More sophisticated bounds obviously have a better
6894: asymptotic behaviour as shown in the first section. Nevertheless
6895: the numerical illustration
6896: of Theorem \ref{thm1.1.17} given on page \pageref{thm1.1.17}
6897: suggests hat
6898: Vapnik's bounds are not doing so bad for small
6899: to medium ratios between the sample size and the dimension of
6900: the classification model (with local bounds, we could only get
6901: down to $0.332$, although using a quite stronger dimension assumption).
6902:
6903: We chose on purpose an example where it is non trivial
6904: to decide whether the chosen classifier does better than the $0.5$
6905: error rate of blind random classification. We think that this
6906: situation of weak learning is of practical interest, since
6907: ``significant'' weak learning rules may afterwards be aggregated
6908: or combined in various ways to achieve better classification rates.
6909:
6910: \section{Support Vector Machines}
6911: \subsection{How to build them}
6912: \subsubsection{The canonical hyperplane}
6913: \label{chapSVM}
6914:
6915: Support
6916: Vector Machines, of widely spread use and renown,
6917: were introduced by V. Vapkik \cite{Vapnik}.
6918: Before introducing them,
6919: we will study as a prerequisite the separation of points by hyperplanes
6920: in a finite dimensional Euclidean space.
6921: Support Vector Machines perform the same kind of linear
6922: separation after
6923: an implicit change of pattern space.
6924: The preceding PAC-Bayesian results provide a
6925: fit framework to analyze their generalization properties.
6926:
6927: We will deal in this section with the classification
6928: of points in $\RR^d$ in two classes.
6929: Let $Z = (x_i, y_i)_{i=1}^N \in \bigl(\RR^d \times \{-1,+1\}
6930: \bigr)^N$ be some set of labelled examples (called
6931: the training set hereafter). Let us split the set of
6932: indices $I = \{1, \dots, N\}$
6933: according to the labels into two subsets
6934: \begin{align*}
6935: I_+ & = \{ i \in I\,: y_i = + 1 \},\\
6936: I_- & = \{ i \in I\,: y_i = - 1 \}.
6937: \end{align*}
6938: Let us then consider the set of admissible separating directions
6939: $$
6940: A_Z = \bigl\{ w \in \RR^d \,: \sup_{b \in \RR} \inf_{i \in I}
6941: ( \langle w, x_i \rangle - b ) y_i \geq 1 \bigr\},
6942: $$
6943: which can also be written as
6944: $$
6945: A_Z = \bigl\{ w \in \RR^d\,:
6946: \max_{i \in I_-} \langle w, x_i
6947: \rangle + 2 \leq \min_{i \in I_+} \langle w, x_i \rangle \bigr\}.
6948: $$
6949: As it is easily seen, the optimal value of $b$ for a fixed value of $w$, in other
6950: words the value of $b$ which maximizes $\inf_{i \in I}
6951: (\langle w, x_i \rangle - b)y_i$, is equal to
6952: $$
6953: b_w = \frac{1}{2} \Bigl[ \max_{i \in I_-} \langle w, x_i \rangle +
6954: \min_{i \in I_+} \langle w, x_i \rangle \Bigr].
6955: $$
6956: \begin{lemma}\mypoint
6957: When $A_Z \neq \varnothing$, $\inf \{ \lVert w \rVert^2 \,: w
6958: \in A_Z \}$ is reached for only one value $w_Z$ of $w$.
6959: \end{lemma}
6960: \begin{proof}
6961: Let $w_0 \in A_Z$. The set $A_Z \cap \{ w \in \RR^d :
6962: \lVert w \rVert \leq \lVert w_0 \rVert \}$ is a compact convex set and $w \mapsto \lVert w \rVert^2$ is strictly
6963: convex and therefore has a unique minimum on this set, which
6964: is also obviously its minimum on $A_Z$.
6965: \end{proof}
6966: \begin{dfn}\mypoint
6967: When $A_Z \neq \varnothing$, the training set $Z$ is said
6968: to be linearly separable. The hyperplane
6969: $$
6970: H = \{ x \in \RR^d \,: \langle w_Z, x \rangle - b_Z = 0 \},
6971: $$
6972: where
6973: \begin{align*}
6974: w_Z & = \arg\min \{ \lVert w \rVert \,: w \in A_Z \},\\
6975: b_Z & = b_{w_Z},
6976: \end{align*}
6977: is called the canonical separating hyperplane of the training set $Z$.
6978: The quantity $\lVert w_Z \rVert^{-1}$ is called the margin of the
6979: canonical hyperplane.
6980: \end{dfn}
6981: Note that as $\min_{i \in I_+} \langle w_Z, x_i \rangle -
6982: \max_{i \in I_-} \langle w_Z, x_i \rangle = 2$, the margin is
6983: also equal to half the distance between the projections
6984: on the direction $w_Z$ of the positive and negative patterns.
6985:
6986: \subsubsection{Computation of the canonical hyperplane}
6987:
6988: Let us consider the convex hulls $X_+$ and $X_-$ of the positive
6989: and negative patterns:
6990: \begin{align*}
6991: \C{X}_+ & = \Bigl\{ \sum_{i \in I_+} \lambda_i x_i\,:\bigl( \lambda_i
6992: \bigr)_{i \in I_+} \in \RR_+^{I_+}, \sum_{i \in I_+} \lambda_i
6993: = 1 \Bigr\},\\
6994: \C{X}_- & = \Bigl\{ \sum_{i \in I_-} \lambda_i x_i\,:\bigl( \lambda_i
6995: \bigr)_{i \in I_-} \in \RR_+^{I_-}, \sum_{i \in I_-} \lambda_i
6996: = 1 \Bigr\}.
6997: \end{align*}
6998: Let us introduce the closed convex set
6999: $$
7000: \C{V} = \C{X}_+ - \C{X}_- = \bigl\{ x_+ - x_-\,: x_+ \in \C{X}_+, x_- \in
7001: \C{X}_- \bigr\}.
7002: $$
7003: As $v \mapsto \lVert v \rVert^{2}$ is strictly convex,
7004: with compact lower level sets, there is a unique
7005: vector $v^*$ such that
7006: $$
7007: \lVert v^* \rVert^2 = \inf_{v \in \C{V}} \bigl\{ \lVert v \rVert^2\,: v \in \C{V} \bigr\}.
7008: $$
7009: \begin{lemma}\mypoint
7010: The set $A_Z$ is non empty (i.e. the training set $Z$
7011: is linearly separable) if and only if $v^* \neq 0$. In this case
7012: $$
7013: w_Z = \frac{2}{\lVert v^* \rVert^{2}} v^*,
7014: $$
7015: and the margin of the canonical hyperplane is equal to $\frac{1}{2}
7016: \lVert v^* \rVert$.
7017: \end{lemma}
7018: \begin{proof}
7019: Let us assume first that $v^* = 0$, or equivalently that
7020: $\C{X}_+ \cap \C{X}_- \neq \varnothing$. As for any vector $w \in \RR^d$,
7021: \begin{align*}
7022: \min_{i \in I_+} \langle w, x_i \rangle & = \min_{x \in \C{X}_+}
7023: \langle w, x \rangle,\\
7024: \max_{i \in I_-} \langle w, x_i \rangle & = \max_{x \in \C{X}_-}
7025: \langle w, x \rangle,
7026: \end{align*}
7027: we see that necessarily $ \min_{i \in I_+}
7028: \langle w, x_i \rangle - \max_{i \in I_-}
7029: \langle w, x_i \rangle \leq 0$, which shows that
7030: $w$ cannot be in $A_Z$ and therefore that $A_Z$
7031: is empty.
7032:
7033: Let us assume now that $v^* \neq 0$, or equivalently that
7034: $\C{X}_+ \cap \C{X}_- = \varnothing$. Let us put
7035: $w^* = \frac{2}{\lVert v^* \rVert^2} v^*$.
7036: Let us remark first that
7037: \begin{align*}
7038: \min_{i \in I_+} \langle w^*, x_i \rangle -
7039: \max_{i \in I_-} \langle w^*, x_i \rangle & =
7040: \inf_{x \in \C{X}_+} \langle w^*, x \rangle -
7041: \sup_{x \in \C{X}_-} \langle w^*, x \rangle
7042: \\ & = \inf_{x_+ \in \C{X}_+, x_- \in \C{X}_-}
7043: \langle w^*, x_+ - x_- \rangle \\ & =
7044: \frac{2}{\lVert v^* \rVert^2}
7045: \inf_{v \in \C{V}} \langle v^*, v \rangle.
7046: \end{align*}
7047: Let us now prove that $\inf_{v \in \C{V}}
7048: \langle v^*, v \rangle = \lVert v^* \rVert^2$.
7049: Some arbitrary $v \in \C{V}$ being fixed,
7050: consider the function $$\beta \mapsto \lVert
7051: \beta v + (1 - \beta) v^* \rVert^2 : [0,1]
7052: \rightarrow \RR.$$ By definition of $v^*$,
7053: it reaches its minimum value for $\beta = 0$,
7054: and therefore has a non negative derivative at
7055: this point. Computing this derivative, we find
7056: that $\langle v - v^*, v^* \rangle \geq 0$,
7057: as claimed. We have proved that
7058: $$
7059: \min_{i \in I_+} \langle w^*, x_i \rangle
7060: - \max_{i \in I_-} \langle w^*, x_i \rangle
7061: = 2,
7062: $$
7063: and therefore that $w^* \in A_Z$. On the other hand,
7064: any $w \in A_Z$ is such that
7065: $$
7066: 2 \leq \min_{i \in I_+} \langle w, x_i \rangle
7067: - \max_{i \in I_-} \langle w, x_i \rangle
7068: = \inf_{v \in \C{V}} \langle w, v \rangle \leq \lVert w \rVert
7069: \inf_{v \in \C{V}} \lVert v \rVert = \lVert w \rVert
7070: \,\lVert v^* \rVert.
7071: $$
7072: This proves that $\lVert w^* \rVert = \inf \bigl\{ \lVert w \rVert\,:
7073: w \in A_Z \bigr\}$, and therefore that $w^* = w_Z$ as claimed.
7074: \end{proof}
7075: One way to compute $w_Z$ would be therefore to compute $v^*$ by minimizing
7076: $$
7077: \bigl\{ \lVert \sum_{i \in I} \lambda_i y_i x_i \rVert^2\,:
7078: (\lambda_i)_{i \in I} \in \RR_+^I, \sum_{i \in I} \lambda_i = 2,
7079: \sum_{i \in I} y_i \lambda_i = 0 \bigr\}.
7080: $$
7081: Although this is a tractable quadratic programming problem, a
7082: direct computation of $w_Z$ through the following proposition
7083: is usually prefered.
7084: \begin{prop}\mypoint
7085: \label{wComp}
7086: The canonical direction $w_Z$ can be expressed as
7087: $$
7088: w_Z = \sum_{i=1}^N \alpha_i^* y_i x_i,
7089: $$
7090: where $(\alpha_i^*)_{i=1}^N$ is obtained by minimizing
7091: $$
7092: \inf \bigl\{ F(\alpha)\,: \alpha \in \C{A} \bigr\},
7093: $$
7094: where
7095: $$
7096: \C{A} = \Bigl\{ (\alpha_i)_{i \in I}
7097: \in \RR_+^{I}, \sum_{i \in I} \alpha_i y_i = 0 \Bigr\},
7098: $$
7099: and
7100: $$
7101: F(\alpha) = \Bigl\lVert \sum_{i \in I} \alpha_i y_i x_i \Bigr\rVert^2
7102: - 2 \sum_{i \in I} \alpha_i.
7103: $$
7104: \end{prop}
7105: \begin{proof}
7106: Let $w(\alpha) = \sum_{i \in I} \alpha_i y_i x_i$ and
7107: let $S(\alpha) = \frac{1}{2} \sum_{i\in I}\alpha_i$.
7108: We can express the function $F(\alpha)$ as
7109: $F(\alpha) = \lVert w(\alpha) \rVert^2 - 4 S(\alpha)$.
7110: Moreover it is important to notice that for any $s \in \RR_+$
7111: $\{ w(\alpha)\,: \alpha \in \C{A}, S(\alpha) = s\} = s \C{V}$.
7112: This shows that for any $s \in \RR_+$, $\inf \{ F(\alpha)
7113: : \alpha \in \C{A}, S(\alpha) = s \}$ is reached and that for any
7114: \linebreak $\alpha_s \in \{ \alpha \in \C{A}\,: S(\alpha) = s \}$ reaching this infimum,
7115: $w(\alpha_s) = s v^*$. As \linebreak $s \mapsto s^2 \lVert v^* \rVert^2 - 4 s :
7116: \RR_+ \rightarrow \RR$ reaches its infimum for only one value
7117: $s^*$ of $s$, namely at $s^* = \frac{2}{\lVert v^* \rVert^2}$,
7118: this shows that $F(\alpha)$ reaches its infimum on $\C{A}$,
7119: and that for any $\alpha^* \in \C{A}$ such that $F(\alpha^*) =
7120: \inf \{ F(\alpha)\,: \alpha \in \C{A} \}$, $w(\alpha^*)
7121: = \frac{2}{\lVert v^* \rVert^2} v^* = w_Z$.
7122: \end{proof}
7123:
7124: \subsubsection{Support vectors}
7125: \begin{dfn}\mypoint
7126: The set of support vectors $\C{S}$ is defined by
7127: $$
7128: \C{S} = \{ x_i \,: \langle w_Z , x_i \rangle - b_Z = y_i \}.
7129: $$
7130: \end{dfn}
7131:
7132: \begin{prop}\mypoint
7133: \label{chap4Prop3.1}
7134: Any $\alpha^*$ minimizing $F(\alpha)$ on $\C{A}$
7135: is such that
7136: $$
7137: \{ x_i\,: \alpha_i^* > 0 \} \subset \C{S}.
7138: $$
7139: This implies that the representation $w_Z = w(\alpha^*)$
7140: involves in general only a limited number of non zero
7141: coefficients and that $w_Z = w_{Z'}$, where $Z' =
7142: \{ (x_i,y_i)\,: x_i \in \C{S} \}$.
7143: \end{prop}
7144: \begin{proof}
7145: Let us consider any given $i \in I_+$ and $j \in I_-$, such that
7146: $\alpha_i^* > 0$ and $\alpha_j^* > 0$ (there exists at least
7147: one such index in each set $I_-$ and $I_+$, since the sum of the
7148: components of $\alpha^*$ on each of these sets are equal and
7149: since $\sum_{k \in I} \alpha^*_k > 0$).
7150: For any $t \in \RR$, consider
7151: $$
7152: \alpha_k(t) = \alpha_k^* + t \B{1}(k \in \{i,j\}), \quad k \in I.
7153: $$
7154: The vector $\alpha(t)$ is in $\C{A}$
7155: for any value of $t$ in some neighborhood of $0$,
7156: therefore $\frac{\partial}{\partial t}_{|t = 0} F\bigl[\alpha(t) \bigr] = 0$.
7157: Computing this derivative, we find that
7158: $$
7159: y_i \langle w(\alpha^*), x_i \rangle +
7160: y_j \langle w(\alpha^*) , x_j \rangle = 2.
7161: $$
7162: As $y_i = - y_j$, this can also be written as
7163: $$
7164: y_i \bigl[ \langle w(\alpha^*), x_i \rangle - b_Z \bigr] +
7165: y_j \bigl[ \langle w(\alpha^*) , x_j \rangle -b_Z \bigr] = 2.
7166: $$
7167: As $w(\alpha^*)\in A_Z$,
7168: $$
7169: y_k \bigl[ \langle w(\alpha^*), x_k \rangle - b_Z \bigr] \geq 1,
7170: \qquad k \in I,
7171: $$
7172: which implies necessarily as claimed that
7173: $$
7174: y_i \bigl[ \langle w(\alpha^*), x_i \rangle - b_Z \bigr]
7175: = y_j \bigl[ \langle w(\alpha^*) , x_j \rangle -b_Z \bigr] = 1.
7176: $$
7177: \end{proof}
7178: \subsubsection{The non separable case}
7179: In the case when the training set $Z = (x_i, y_i)_{i=1}^N$
7180: is not linearly separable, we can define a noisy canonical
7181: hyperplane as follows. We can choose $w \in \RR^d$ and
7182: $b \in \RR$ to minimize
7183: \begin{equation}
7184: C(w,b) =
7185: \sum_{i=1}^N \bigl[ 1 - \bigl( \langle w, x_i \rangle - b \bigr)
7186: y_i \bigr]_+ + \tfrac{1}{2} \lVert w \rVert^2,
7187: \end{equation}
7188: where for any real number $r$, $r_+ = \max \{r, 0\}$ is
7189: the positive part of $r$.
7190: \newcommand{\Bw}{\overline{w}}
7191: \begin{thm}\mypoint
7192: Let us introduce the dual criterion
7193: $$
7194: F(\alpha) = \sum_{i=1}^N \alpha_i - \frac{1}{2}
7195: \biggl\lVert \sum_{i=1}^N y_i \alpha_i x_i \biggr\rVert^2
7196: $$
7197: and the domain
7198: $\ds
7199: \C{A}' = \biggl\{ \alpha \in \RR_+^N : \alpha_i \leq 1, i = 1, \dots, N,
7200: \sum_{i=1}^N y_i \alpha_i = 0 \biggr\}.
7201: $
7202: Let $\alpha^* \in \C{A}'$ be such that $ F(\alpha^*) = \sup_{\alpha \in
7203: \C{A}'} F(\alpha)$.
7204: Let $w^* = \sum_{i=1}^N y_i \alpha^*_i x_i$. There is
7205: a threshold $b^*$ (whose construction will be detailed
7206: in the proof), such that
7207: $$
7208: C(w^*, b^*) = \inf_{w \in \RR^d, b \in \RR}
7209: C(w, b).
7210: $$
7211: \end{thm}
7212: \begin{cor}\mypoint \!\!{\sc(scaled criterion)}
7213: For any positive real parameter $\lambda$
7214: let us consider the criterion
7215: $$
7216: C_{\lambda}(w,b) = \lambda^2
7217: \sum_{i=1}^N \bigl[ 1 - (\langle w, x_i \rangle - b ) y_i
7218: \bigr]_+ + \tfrac{1}{2} \lVert w \rVert^2
7219: $$
7220: and the domain
7221: $\ds
7222: \C{A}'_{\lambda} = \biggl\{
7223: \alpha \in \RR_+^N : \alpha_i \leq \lambda^2, i = 1, \dots, N,
7224: \sum_{i=1}^N y_i \alpha_i = 0 \biggr\}.
7225: $
7226: For any solution $\alpha^*$ of the minimization problem
7227: $ F(\alpha^*) = \sup_{\alpha \in \C{A}'_{\lambda}} F(\alpha)$,
7228: the vector $w^* = \sum_{i=1}^N y_i \alpha^*_i x_i$
7229: is such that
7230: $$
7231: \inf_{b \in \RR} C_{\lambda}(w^*, b)
7232: = \inf_{w \in \RR^d, b \in \RR} C_{\lambda}(w, b).
7233: $$
7234: \end{cor}
7235: Let us remark that in the separable case, the scaled criterion is
7236: minimized by the canonical hyperplane for $\lambda$ large enough.
7237: This extension of the canonical hyperplane computation
7238: in dual space is often called {\em the box constraint},
7239: for obvious reasons.
7240:
7241: \noindent{\sc Proof.}
7242: The corollary is a straightforward consequence of
7243: the scale property $C_{\lambda}(w, b, x) = \lambda^2 C(\lambda^{-1}
7244: w, b, \lambda x)$, where we have made the dependence
7245: of the criterion in $x \in \RR^{d N}$ explicit.
7246: Let us come now to the proof of the theorem.
7247:
7248: The minimization of $C(w, b)$ can be performed in dual
7249: space extending the couple of parameters $(w, b)$
7250: to $\Bw = (w, b, \gamma) \in \RR^d \times \RR \times \RR_+^N$
7251: and introducing the dual multipliers $\alpha \in \RR_+^N$
7252: and the criterion
7253: $$
7254: G( \alpha, \Bw ) =
7255: \sum_{i = 1}^N \gamma_i + \sum_{i=1}^N \alpha_i
7256: \bigl\{ \bigl[ 1 - (\langle w, x_i \rangle - b ) y_i \bigr] - \gamma_i
7257: \bigr\} + \tfrac{1}{2} \lVert w \rVert^2.
7258: $$
7259: We see that
7260: $$
7261: C(w, b) = \inf_{\gamma \in \RR_+^N} \sup_{\alpha \in \RR_+^N}
7262: G\bigl[ \alpha, (w, b, \gamma) \bigr],
7263: $$
7264: and therefore, putting $\ov{\C{W}} = \{ (w, b, \gamma) :
7265: w \in \RR^d, b \in \RR, \gamma \in \RR_+^N \bigr \}$,
7266: we are led to solve the minimization problem
7267: $$
7268: G(\alpha_*, \Bw_*) = \inf_{\Bw \in \ov{\C{W}}} \sup_{\alpha \in \RR_+^N}
7269: G(\alpha, \Bw),
7270: $$
7271: whose solution $\Bw_* = (w_*, b_*, \gamma_*)$ is such that
7272: $C(\Bw_*, b_*) = \inf_{(w, b) \in \RR^{d+1}} C(w, b)$,
7273: according to the preceding identity.
7274: As for any value of $\alpha' \in \RR_+^N$,
7275: $$
7276: \inf_{\Bw \in \ov{\C{W}}} \sup_{\alpha \in \RR_+^N}
7277: G(\alpha, \ov{w}) \geq
7278: \inf_{\Bw \in \ov{\C{W}}} G(\alpha', \ov{w}),
7279: $$
7280: it is immediately seen that
7281: $$
7282: \inf_{\Bw \in \ov{\C{W}}} \sup_{\alpha \in \RR_+^N}
7283: G(\alpha, \ov{w}) \geq
7284: \sup_{\alpha \in \RR_+^N} \inf_{\Bw \in \ov{\C{W}}}
7285: G(\alpha, \ov{w}).
7286: $$
7287: We are going to show that there is no duality gap,
7288: meaning that this inequality is indeed an equality.
7289: More importantly, we will do so by exhibiting
7290: a saddle point, which, solving the dual minimization
7291: problem will also solve the original one.
7292:
7293: Let us first make explicit the solution of the
7294: dual problem (the interest of this dual problem
7295: precisely lies in the fact that it can more easily
7296: be solved explicitly).
7297: Introducing the admissible set of values
7298: of $\alpha$,
7299: $$
7300: \C{A}' = \bigl\{ \alpha \in \RR^N : 0 \leq \alpha_i \leq
7301: 1, i = 1, \dots, N, \sum_{i=1}^N y_i \alpha_i = 0 \bigr\},
7302: $$
7303: it is elementary to check that
7304: $$
7305: \inf_{\Bw \in \ov{\C{W}}} G(\alpha, \Bw) =
7306: \begin{cases}\ds
7307: \inf_{w \in \RR^d} G \bigl[ \alpha, (w,0,0) \bigr],
7308: & \alpha \in \C{A}',\\
7309: - \infty, & \text{otherwise}.
7310: \end{cases}
7311: $$
7312: As
7313: $$
7314: G \bigl[ \alpha, (w, 0, 0) \bigr]
7315: = \tfrac{1}{2} \lVert w \rVert^2 + \sum_{i=1}^N \alpha_i \bigl(
7316: 1 - \langle w, x_i \rangle y_i \bigr),
7317: $$
7318: we see that $\inf_{w \in \RR^d} G\bigl[ \alpha, (w,0,0) \bigr]$
7319: is reached at
7320: $$
7321: w_{\alpha} = \sum_{i=1}^N y_i \alpha_i x_i.
7322: $$
7323: This proves that
7324: \newcommand{\BW}{\ov{\C{W}}}
7325: $$
7326: \inf_{\Bw \in \BW} G(\alpha, \Bw) = F(\alpha).
7327: $$
7328: The continuous map $\alpha \mapsto \inf_{\Bw \in \ov{\C{W}}}
7329: G(\alpha, \Bw)$ reaches a (non necessarily unique) maximum
7330: $\alpha^*$
7331: on the compact convex set $\C{A}'$.
7332: We are now going to exhibit a choice of $\Bw^* \in \BW$
7333: such that $(\alpha^*, \Bw^*)$ is a {\em saddle point}.
7334: This means that we are going to show that
7335: $$
7336: G(\alpha^*, \Bw^*) =
7337: \inf_{\Bw \in \BW} G(\alpha^*, \Bw) =
7338: \sup_{\alpha \in \RR_+^N} G(\alpha, \Bw^*).
7339: $$
7340: It will imply that
7341: $$
7342: \inf_{\Bw \in \BW} \sup_{\alpha \in \RR_+^d} G(\alpha, \Bw)
7343: \leq \sup_{\alpha \in \RR_+^N} G(\alpha, \Bw^*) = G(\alpha^*, \Bw^*)
7344: $$
7345: on the one hand and that
7346: $$
7347: \inf_{\Bw \in \BW} \sup_{\alpha \in \RR_+^d} G(\alpha, \Bw)
7348: \geq \inf_{\Bw \in \BW} G(\alpha^*, \Bw) = G(\alpha^*, \Bw^*)
7349: $$
7350: on the other hand, proving that
7351: $$
7352: G(\alpha^*, \Bw^*) = \inf_{\Bw \in \BW} \sup_{\alpha \in \RR_+^N}
7353: G(\alpha, \Bw)
7354: $$
7355: as required.
7356:
7357: \noindent{\sc Construction of $\Bw^*$.}
7358: \begin{itemize}
7359: \item Let us put $w^* = w_{\alpha^*}$.
7360: \item If there is $j \in \{1, \dots, N \}$
7361: such that $0 < \alpha^*_j < 1$,
7362: let us put
7363: $$
7364: b^* = \langle x_j , w^* \rangle - y_j.
7365: $$
7366: Otherwise, let us put
7367: $$
7368: b^* = \sup \{ \langle x_i , w^* \rangle - 1 : \alpha^*_i > 0 , y_i = + 1,
7369: i = 1, \dots, N\}.
7370: $$
7371: \item Let us then put
7372: $$
7373: \gamma^*_i =
7374: \begin{cases}
7375: 0, & \alpha^*_i < 1,\\
7376: 1 - (\langle w^*, x_i \rangle - b^*)y_i, & \alpha^*_i = 1.
7377: \end{cases}
7378: $$
7379: \end{itemize}
7380: If we can prove that
7381: \begin{equation}
7382: \label{eq3.2}
7383: 1 - (\langle w^*, x_i \rangle - b^*)y_i
7384: \begin{cases}
7385: \leq 0, & \alpha^*_i = 0,\\
7386: = 0, & 0 < \alpha^*_i < 1,\\
7387: \geq 0, & \alpha^*_i = 1,
7388: \end{cases}
7389: \end{equation}
7390: it will show that $\gamma^* \in \RR_+^N$
7391: and therefore that $\Bw^* = (w^*, b^*, \gamma^*) \in \BW$.
7392: It will also show that
7393: $$
7394: G(\alpha, \Bw^*) = \sum_{i=1}^N \gamma^*_i
7395: + \sum_{i, \alpha^*_i = 0} \alpha_i \bigl[ 1 -
7396: (\langle \Bw^*, x_i \rangle - b^*) y_i \bigr]
7397: + \tfrac{1}{2} \lVert \Bw^* \rVert^2,
7398: $$
7399: proving that
7400: $G(\alpha^*, \Bw^*) = \sup_{\alpha \in \RR_+^N} G(\alpha,
7401: \Bw^*)$. As obviously $G (\alpha^*, \Bw^*) = G \bigl[ \alpha^*,
7402: (w^*, 0 , 0) \bigr]$, we already know that
7403: $G(\alpha^*, \Bw^*) = \inf_{\Bw \in \BW} G(\alpha^*, \Bw)$.
7404: This will show that $(\alpha^*, \Bw^*)$ is the saddle
7405: point we were looking for, thus ending the proof of the
7406: theorem.
7407:
7408: \noindent{\sc Proof of equation \eqref{eq3.2}:} Let us deal first with the case when there is $j \in \{1, \dots, N\}$
7409: such that $0 < \alpha_j^* < 1$.
7410:
7411: For any $i \in \{1, \dots, N\}$
7412: such that $0< \alpha^*_i < 1$, there is $\epsilon > 0$ such
7413: that for any $t \in (-\epsilon, \epsilon)$, $\alpha^* + t y_i e_i - t y_j e_j
7414: \in \C{A}'$, where $(e_k)_{k=1}^N$ is the canonical base of $\RR^N$.
7415: Thus $\frac{\partial}{\partial t}_{|t=0} F(\alpha^* + t y_i e_i -
7416: t y_j e_j ) = 0$. Computing this derivative,
7417: we obtain
7418: \begin{align*}
7419: \frac{\partial}{\partial t}_{|t=0}
7420: F(\alpha^* + t y_i e_i - t y_j e_j)
7421: & = y_i - \langle w^*, x_i \rangle + \langle w^*, x_j \rangle - y_j \\
7422: & = y_i \bigl[ 1 - \bigl(\langle w, x_i \rangle - b^* \bigr) y_i \bigr].
7423: \end{align*}
7424: Thus $1 - \bigl(\langle w, x_i \rangle - b^* \bigr) y_i = 0$,
7425: as required. This shows also that the definition of $b^*$ does not
7426: depend on the choice of $j$ such that $0 < \alpha^*_j < 1$.
7427:
7428: For any $i \in \{1, \dots, N\}$ such that $\alpha^*_i = 0$,
7429: there is $\epsilon > 0$ such that for any $t \in (0, \epsilon)$,
7430: $\alpha^* + t e_i - t y_i y_j e_j \in \C{A}'$.
7431: Thus $\frac{\partial}{\partial t}_{|t=0} F(\alpha^* + t e_i
7432: - t y_i y_j e_j) \leq 0$, showing that
7433: $1 - \bigl( \langle w^*, x_i \rangle - b^* \bigr) y_i \leq 0$ as
7434: required.
7435:
7436: For any $i \in \{1, \dots, N\}$ such that $\alpha^*_i
7437: = 1$, there is $\epsilon > 0$ such that $
7438: \alpha^* - t e_i + t y_i y_j e_j \in \C{A}'$.
7439: Thus $\frac{\partial}{\partial t}_{| t = 0} F(
7440: \alpha^* - t e_i + t y_i y_j e_j) \leq 0$, showing
7441: that $1 - \bigl( \langle w^*, x_i \rangle - b^* \bigr) y_i \geq 0$
7442: as required. This ends to prove that $(\alpha^*, \Bw^*)$
7443: is a saddle point in this case.
7444:
7445: Let us deal now with the case where $\alpha^* \in \{0, 1\}^N$.
7446: If we are not in the trivial case where the vector $(y_i)_{i=1}^N$
7447: is constant, the case $\alpha^* = 0$ is ruled out. Indeed,
7448: in this case, considering $\alpha^* + t e_i + t e_j$, where
7449: $y_i y_j = -1$, we would get the contradiction
7450: $2 = \frac{\partial}{\partial t}_{|t=0} F(\alpha^*+te_i+te_j)
7451: \leq 0$.
7452:
7453: Thus there are values of $j$ such that $\alpha^*_j = 1$,
7454: and since $\sum_{i=1}^N \alpha_i y_i = 0$, both classes are
7455: present in the set $\{ j : \alpha^*_j = 1 \}$.
7456:
7457: Now for any $i, j \in \{1, \dots, N\}$ such that
7458: $\alpha^*_i = \alpha^*_j = 1$ and such that $y_i = +1$ and $y_j = -1$,
7459: $ \frac{\partial}{\partial t}_{|t=0} F( \alpha^* - t e_i
7460: - t e_j) = - 2 + \langle w^* , x_i \rangle - \langle
7461: w^*, x_j \rangle \leq 0$.
7462: Thus
7463: $$
7464: \sup \{ \langle w^*, x_i \rangle - 1 : \alpha^*_i = 1, y_i = +1 \}
7465: \leq \inf \{ \langle w^*, x_j \rangle + 1 : \alpha^*_j = 1, y_j = -1 \},
7466: $$
7467: showing that
7468: $$
7469: 1 - \bigl( \langle w^*, x_k \rangle - b^* \bigr) y_k \geq 0, \alpha^*_k = 1.
7470: $$
7471: Eventually, for any $i$ such that $\alpha^*_i = 0$,
7472: for any $j$ such that $\alpha^*_j = 1$ and
7473: $y_j = y_i$
7474: $$
7475: \frac{\partial}{\partial t}_{|t=0}F(\alpha^*
7476: + t e_i - t e_j) = y_i \langle w^*, x_i - x_j \rangle \leq 0,
7477: $$
7478: showing that $1 - \bigl( \langle w^*, x_i \rangle - b^* \bigr) y_i
7479: \leq 0$. This ends to prove that $(\alpha^*, \Bw^*)$ is in all
7480: circumstances a saddle point.
7481:
7482: \subsubsection{Support Vector Machines}
7483: \begin{dfn}\mypoint
7484: The symmetric measurable kernel $K : \C{X} \times \C{X}
7485: \rightarrow \RR$ is said to
7486: be positive (or more precisely positive semi-definite) if
7487: for any $n \in \NN$, any $(x_i)_{i=1}^n \in \C{X}^n$,
7488: $$
7489: \inf_{\alpha \in \RR^n} \sum_{i=1}^n \sum_{j=1}^n \alpha_i K(x_i, x_j)
7490: \alpha_j \geq 0.
7491: $$
7492: \end{dfn}
7493: Let $Z = (x_i,y_i)_{i=1}^N$ be some training set. Let us consider
7494: as previously
7495: $$
7496: \C{A} = \bigl\{ \alpha \in \RR_+^N \,: \sum_{i=1}^N \alpha_i y_i = 0 \bigr\}.
7497: $$
7498: Let
7499: $$
7500: F(\alpha) = \sum_{i=1}^N \sum_{j=1}^N \alpha_i y_iK(x_i,x_j)y_j \alpha_j
7501: - 2 \sum_{i=1}^N \alpha_i.
7502: $$
7503: \begin{dfn}\mypoint
7504: Let $K$ be a positive symmmetric kernel.
7505: The training set $Z$ is said to be $K$-separable
7506: if
7507: $$
7508: \inf \bigl\{ F(\alpha)\,: \alpha \in \C{A} \bigr\} > - \infty.
7509: $$
7510: \end{dfn}
7511: \begin{lemma}\mypoint
7512: When $Z$ is $K$-separable, $\inf\{ F(\alpha)\,: \alpha \in \C{A} \}$ is
7513: reached.
7514: \end{lemma}
7515: \begin{proof}
7516: Consider the training set $Z' = (x_i',y_i)_{i=1}^N$, where
7517: $$
7518: x_i' = \biggl\{ \biggl[ \Bigl\{ K(x_k,x_{\ell})\Bigr\}_{k=1, \ell=1}^{N
7519: \quad N} \biggr]^{1/2}(i,j) \biggr\}_{j=1}^N \in \RR^N.
7520: $$
7521: We see that $F(\alpha) = \left\lVert \sum_{i=1}^N \alpha_i y_i x_i'
7522: \right\rVert^2 - 2 \sum_{i=1}^N \alpha_i$.
7523: We have proved in the previous section that $Z'$ is linearly separable
7524: if and only if $\inf \{ F(\alpha)\,: \alpha \in \C{A} \} > - \infty$,
7525: and that the infimum is reached in this case.
7526: \end{proof}
7527:
7528: \begin{proposition}\mypoint
7529: \label{chap4Prop4.1} Let $K$ be a symmetric positive kernel and let
7530: $Z = (x_i, y_i)_{i=1}^N$ be some $K$-separable training set. Let
7531: $\alpha^* \in \C{A}$ be such that $F(\alpha^*)
7532: = \inf \{ F(\alpha) \,: \alpha \in \C{A} \}$.
7533: Let
7534: \begin{align*}
7535: I_-^* & = \{ i \in \NN\,:1 \leq i \leq N, y_i = -1, \alpha_i^* > 0 \}\\
7536: I_+^* & = \{ i \in \NN\,:1 \leq i \leq N, y_i = +1, \alpha_i^* > 0 \}\\
7537: b^* & = \frac{1}{2} \Bigl\{
7538: \sum_{j=1}^N \alpha_j^* y_j K(x_j,x_{i_-})
7539: + \sum_{j=1}^N \alpha_j^* y_j K(x_j,x_{i_+}) \Bigr\}, \qquad i_- \in
7540: I_-^*, i_+ \in I_+^*,
7541: \end{align*}
7542: where the value of $b^*$ does not depend on the choice of $i_-$ and
7543: $i_+$.
7544: The classification rule $f : \C{X} \rightarrow \C{Y}$
7545: defined by the formula
7546: $$
7547: f(x) = \sign \left( \sum_{i=1}^N \alpha_i^* y_i K(x_i,x) -
7548: b^* \right)
7549: $$
7550: is independent of the choice of $\alpha^*$ and is called
7551: the support vector machine defined by $K$ and $Z$.
7552: The set
7553: $\C{S} = \{ x_j\,: \sum_{i=1}^N \alpha_i^* y_i K(x_i,x_j) - b^* = y_j \}$
7554: is called the set of support vectors. For any choice of $\alpha^*$,
7555: $\{ x_i\,: \alpha_i^* > 0 \} \subset \C{S}$.
7556: \end{proposition}
7557: An important consequence of this proposition is that the support
7558: vector machine defined by $K$ and $Z$ is also the support vector
7559: machine defined by $K$ and $Z' = \{ (x_i, y_i) : \alpha^*_i > 0,
7560: 1 \leq i \leq N \}$, since this restriction of the index set
7561: contains the value $\alpha^*$ where the minimum of $F$ is reached.
7562:
7563: \begin{proof}
7564: The independence from the choice of $\alpha^*$, which is not
7565: necessarily unique, is seen as follows.
7566: Let $(x_i)_{i=1}^N$ and $x \in \C{X}$ be fixed.
7567: Let us put for ease of notations $x_{N+1} = x$.
7568: Let $M$ be the $(N+1) \times (N+1)$ symmetric
7569: semi-definite matrix defined by $M(i,j) = K(x_i,x_j)$,
7570: $i=1,\dots, N+1$, $j=1, \dots, N+1$.
7571: Let us consider the mapping
7572: $\Psi : \{ x_i\,:i=1, \dots, N+1 \} \rightarrow \RR^{N+1}$
7573: defined by
7574: \begin{equation}
7575: \label{PsiDef}
7576: \Psi(x_i) = \bigl[M^{1/2}(i,j)\bigr]_{j=1}^{N+1} \in \RR^{N+1}.
7577: \end{equation}
7578: Let us consider the training set $Z' = \bigl[ \Psi(x_i),y_i \bigr]_{i=1}^N$.
7579: Then $Z'$ is linearly separable,
7580: $$F(\alpha) =
7581: \Bigl\lVert \sum_{i=1}^N \alpha_i y_i \Psi(x_i) \Bigr\rVert^2
7582: - 2 \sum_{i=1}^N \alpha_i,$$
7583: and we have proved that
7584: for any choice of $\alpha^* \in \C{A}$ minimizing $F(\alpha)$,
7585: \linebreak $w_{Z'} = \sum_{i=1}^N \alpha_i^* y_i \Psi(x_i)$.
7586: Thus the support vector machine defined by $K$ and $Z$ can also be expressed by the formula
7587: $$
7588: f(x) = \sign \Bigl[ \langle w_{Z'}, \Psi(x) \rangle - b_{Z'} \bigr]
7589: $$
7590: which does not depend on $\alpha^*$. The definition of $\C{S}$
7591: is such that $\Psi(\C{S})$ is the set of support vectors
7592: defined in the linear case, where its stated property has already been
7593: prooved.
7594: \end{proof}
7595:
7596: We can in the same way use the box constraint and show
7597: that any solution $\alpha^* \in \arg \min
7598: \{ F(\alpha) : \alpha \in \C{A}, \alpha_i \leq \lambda^2,
7599: i = 1, \dots, N \}$ minimizes
7600: \begin{multline}
7601: \label{eq3.4}
7602: \inf_{b \in \RR} \lambda^2 \sum_{i=1}^N \biggl[ 1 -
7603: \biggl( \sum_{j=1}^N y_j \alpha_j K(x_j, x_i) - b
7604: \biggr) y_i \biggr]_+ \\ + \frac{1}{2}
7605: \sum_{i=1}^N \sum_{j=1}^N \alpha_i \alpha_j y_i y_j K(x_i, x_j).
7606: \end{multline}
7607:
7608: \subsubsection{Building kernels}
7609:
7610: The results of this section (except the last one) are drawned from
7611: \cite{Cristianini}. We have no reference for the last
7612: proposition of this section, although we believe it is well known.
7613: We include them for the convenience of the reader.
7614:
7615: \begin{prop}\mypoint
7616: Let $K_1$ and $K_2$ be positive symmetric kernels on $\C{X}$.
7617: Then for any $a \in \RR_+$
7618: \begin{align*}
7619: (a K_1 + K_2)(x,x') & \overset{\text{\rm def}}{=} a K_1(x,x')
7620: + K_2(x,x')\\
7621: \text{ and }(K_1 \cdot K_2)(x,x') &\overset{\text{\rm def}}{=}
7622: K_1(x,x') K_2(x,x')
7623: \end{align*}
7624: are also positive symmetric kernels.
7625: Moreover, for any measurable function \linebreak $g : \C{X} \rightarrow \RR$,
7626: $K_g(x,x') \overset{\text{\rm def}}{=} g(x)g(x')$ is also a positive symmetric kernel.
7627: \end{prop}
7628: \begin{proof}
7629: It is enough to prove the proposition in the case when $\C{X}$ is
7630: finite and kernels are just ordinary symmetric matrices.
7631: Thus we can assume without loss of generality that
7632: $\C{X} = \{ 1, \dots, n\}$. Then for any $\alpha \in \RR^N$,
7633: using usual matrix notations,
7634: \begin{align*}
7635: \langle \alpha , (a K_1 + K_2) \alpha \rangle & =
7636: a \langle \alpha, K_1 \alpha \rangle + \langle \alpha , K_2 \alpha \rangle
7637: \geq 0,\\
7638: \langle \alpha, (K_1 \cdot K_2) \alpha \rangle & =
7639: \sum_{i,j} \alpha_i K_1(i,j) K_2(i,j) \alpha_j\\
7640: & = \sum_{i,j,k} \alpha_i K_1^{1/2}(i,k) K_1^{1/2}(k,j)K_2(i,j) \alpha_j
7641: \\ & = \sum_{k} \underbrace{\sum_{i,j} \bigl[K_1^{1/2}(k,i) \alpha_i \bigr] K_2(i,j)
7642: \bigl[K_1^{1/2}(k,j) \alpha_j \bigr]}_{
7643: \geq 0} \geq 0,\\
7644: \langle \alpha, K_g \alpha \rangle & = \sum_{i,j} \alpha_i g(i) g(j) \alpha_j
7645: = \left( \sum_i \alpha_i g(i) \right)^2 \geq 0.
7646: \end{align*}
7647: \end{proof}
7648:
7649: \begin{prop}\mypoint
7650: Let $K$ be some positive symmetric kernel on $\C{X}$. Let $p : \RR \rightarrow
7651: \RR$ be a polynomial with positive coefficients.
7652: Let $g : \C{X} \rightarrow \RR^d$ be a measurable function.
7653: Then
7654: \begin{align*}
7655: p(K)(x,x') & \overset{\text{def}}{=}
7656: p\bigl[ K(x,x')\bigr], \\
7657: \exp(K)(x,x') & \overset{\text{def}}{=}
7658: \exp \bigl[ K(x,x') \bigr]\\
7659: \text{ and } G_{g}(x,x') & \overset{\text{def}}{=}
7660: \exp \bigl( - \lVert g(x) - g(x') \rVert^2 \bigr)
7661: \end{align*}are all
7662: positive symmetric kernels.
7663: \end{prop}
7664: \begin{proof}
7665: The first assertion is a direct consequence of the previous proposition.
7666: The second one comes from the fact that the exponential function is
7667: the pointwise limit of a sequence of polynomial functions
7668: with positive coefficients.
7669: The third one is seen from the second one and the decomposition
7670: $$
7671: G_{g}(x,x') = \Bigl[ \exp\bigl( - \lVert g(x) \rVert^2 \bigr)
7672: \exp \bigl( - \lVert g(x') \rVert^2 \bigr) \Bigr]
7673: \exp \bigl[ 2 \langle g(x), g(x') \rangle \bigr]
7674: $$
7675: \end{proof}
7676: \begin{prop}\mypoint
7677: With the notations of the previous proposition,
7678: {\em any} training set $Z = (x_i,y_i)_{i=1}^N \in \bigl( \C{X}\times \{-1,+1\}
7679: \bigr)^N$ is $G_g$-separable as soon as $g(x_i)$, $i = 1, \dots, N$ are
7680: distinct points of $\RR^d$.
7681: \end{prop}
7682: \begin{proof}
7683: It is clearly enough to prove the case when $\C{X} = \RR^d$ and
7684: $g$ is the identity.
7685: Let us consider some other generic point $x_{N+1} \in \RR^d$
7686: and define $\Psi$ as in \eqref{PsiDef}.
7687: It is enough to prove that
7688: $\Psi(x_1), \dots, \Psi(x_N)$ are affine independent, since the
7689: simplex, and therefore any affine independent set of points can
7690: be shattered by affine half-spaces. Let us assume that
7691: $(x_1, \dots, x_N)$ are affine dependent, this means that
7692: for some $(\lambda_1, \dots, \lambda_N) \neq 0$ such that
7693: $\sum_{i=1}^N \lambda_i = 0$,
7694: $$
7695: \sum_{i=1}^N \sum_{j=1}^N \lambda_i G(x_i, x_j) \lambda_j = 0.
7696: $$
7697: Thus, $(\lambda_i)_{i=1}^{N+1}$, where we have put $\lambda_{N+1} = 0$
7698: is in the kernel of the symmetric positive semi-definite matrix
7699: $G(x_i,x_j)_{i,j \in \{1, \dots, N+1\}}$. Therefore
7700: $$
7701: \sum_{i=1}^N \lambda_i G(x_i, x_{N+1}) = 0,
7702: $$
7703: for any $x_{N+1} \in \RR^d$. This would mean that
7704: the functions $x \mapsto \exp (- \lVert x - x_i \rVert^2)$ are
7705: linearly dependent, which can be easily proved to be false.
7706: Indeed, let $n \in \RR^d$ be such that $\lVert n \rVert = 1$
7707: and $\langle n, x_i \rangle$, $i = 1, \dots, N$ are distinct
7708: (such a vector exists, because it has to be outside the
7709: union of a finite number of hyperplanes, which is of zero
7710: Lebesgue measure on the sphere). Let us assume for
7711: a while that for some $(\lambda_i)_{i=1}^N \in \RR^N$,
7712: for any $x \in \RR^d$,
7713: $$
7714: \sum_{i=1}^N \lambda_i \exp( - \lVert x - x_i \rVert^2) = 0.
7715: $$
7716: Considering $x = t n$, for $t \in \RR$, we would get
7717: $$
7718: \sum_{i=1}^N \lambda_i \exp( 2 t \langle n, x_i \rangle
7719: - \lVert x_i \rVert^2 ) = 0, \qquad t \in \RR.
7720: $$
7721: Letting $t$ go to infinity, we see that this is only
7722: possible if $\lambda_i = 0$ for all values of $i$.
7723: \end{proof}
7724:
7725: \subsection{Bounds for Support Vector Machines}
7726:
7727: \subsubsection{Compression scheme bounds}
7728:
7729: We can use Support Vector Machines in the framework of compression
7730: schemes and apply Theorem \ref{thm2.3.3} on page \pageref{thm2.3.3}.
7731: More precisely, given some positive symmetric kernel $K$ on $\C{X}$,
7732: we may consider for any training set $Z' = (x_i',y_i')_{i=1}^h$
7733: the classifier $\Hat{f}_{Z'}: \C{X} \rightarrow \C{Y}$ which is
7734: equal to the Support Vector Machine defined by $K$ and $Z'$
7735: whenever $Z'$ is $K$-separable, and which is equal to some
7736: constant classification rule otherwise (we take this convention
7737: to stick to the framework described on page \pageref{compression}, we
7738: will only use $\Hat{f}_{Z'}$ in the $K$-separable case,
7739: so this extension of the definition is just a matter of
7740: presentation). In the application of Theorem \ref{thm2.3.3}
7741: in the case when the observed sample $(X_i,Y_i)_{i=1}^N$ is $K$-separable,
7742: a natural (if not always optimal) choice of $Z'$ is to choose for
7743: $(x_i')$ the set of support vectors defined by $Z = (X_i,Y_i)_{i=1}^N$
7744: and to choose for $(y_i')$ the corresponding values of $Y$.
7745: This is justified by the fact that $\Hat{f}_{Z}=\Hat{f}_{Z'}$,
7746: as shown in Proposition \ref{chap4Prop4.1} (page \pageref{chap4Prop4.1}).
7747: In the case when
7748: $Z$ is not $K$-separable,
7749: we can train a Support Vector Machine with the box constraint,
7750: then remove all the errors to obtain a $K$-separable subsample
7751: $Z' = \{ (X_i, Y_i) : \alpha^*_i < \lambda^2, 1 \leq i \leq N \}$,
7752: (using the same notations as in equation \eqref{eq3.4}
7753: on page \pageref{eq3.4})
7754: and then
7755: consider its support vectors as the compression set.
7756: Still using the notations of page \pageref{eq3.4},
7757: this means we have to compute successively
7758: $\alpha^* \in \arg\min \{ F(\alpha) : \alpha \in \C{A},
7759: \alpha_i \leq \lambda^2 \}$, and $\alpha^{**}
7760: \in \arg \min \{ F(\alpha) : \alpha \in \C{A},
7761: \alpha_i = 0 \text{ when } \alpha^*_i = \lambda^2 \}$,
7762: to keep eventually the compression set indexed by
7763: $J = \{ i : 1 \leq i \leq N, \alpha^{**}_i > 0 \}$,
7764: and the corresponding Support Vector Machine $\w{f}_{J}$.
7765: Different values of $\lambda$ can be used at this
7766: stage, producing different candidate compression
7767: sets : when $\lambda$ increases, the number of
7768: errors should decrease, on the other hand when
7769: $\lambda$ decreases, the margin $\lVert w \rVert^{-1}$
7770: of the separable subset $Z'$
7771: increases, supporting the hope for a smaller set of
7772: support vectors, thus we can use $\lambda$
7773: to monitor the number of errors on the training set
7774: we accept from the compression scheme.
7775: As we can use whatever heuristic we want while
7776: selecting the compression set, we can also try
7777: to threshold in the previous construction $\alpha_i^{**}$
7778: at different levels $\eta \geq 0$, to produce candidate
7779: compression sets
7780: $J_{\eta} = \{ i : 1 \leq i \leq N, \alpha^{**}_i > \eta \}$
7781: of various sizes.
7782:
7783: As the size $\lvert J \rvert$ of the compression
7784: set is random in this construction, we have to
7785: use a version of Theorem \ref{thm2.3.3} (page
7786: \pageref{thm2.3.3}) which handles compression
7787: sets of arbitrary sizes. This is done by choosing
7788: for each $k$ a $k$-partially exchangeable posterior distribution
7789: $\pi_k$ which weights the compression sets of all dimensions.
7790: We immediately see that we can choose $\pi_k$ such that
7791: $- \log \bigl[ \pi_k (\Delta_k(J)) \bigr]
7792: \leq \log \bigl[ \lvert J \rvert (\lvert J \rvert + 1)
7793: \bigr] + \lvert J \rvert \log \Bigl[
7794: \tfrac{(k+1)eN}{\lvert J \rvert} \Bigr]$.
7795:
7796: If we observe the shadow sample patterns, and if computer
7797: resources permit, we can of
7798: course use more elaborate bounds than Theorem \ref{thm2.3.3},
7799: such as the transductive correspondent to Theorem \ref{thm1.24}
7800: (page \pageref{thm1.24}) (where we may consider the submodels
7801: made of all the compression sets of the same size). Theorems
7802: based on relative bounds, such as Theorem \ref{thm1.59} (
7803: page \pageref{thm1.59}) can also be used. Gibbs distributions
7804: can be approximated by Monte Carlo techniques, where
7805: a Markov chain with the proper invariant measure
7806: consists in suitable local perturbations of the
7807: compression set.
7808:
7809: Let us mention also that the use of compression schemes based
7810: on Support Vector Machines
7811: can be tailored to perform some kind of {\em feature aggregation}.
7812: Imagine that the kernel $K$ is defined as the scalar
7813: product in $L_2(\pi)$, where $\pi \in \C{M}_+^1(\Theta)$.
7814: More precisely let us consider for some set of
7815: soft classification rules $\bigl\{ f_{\theta} : \C{X} \rightarrow
7816: \RR\,; \theta \in \Theta \bigr\}$ the kernel
7817: $$
7818: K(x,x') = \int_{\theta \in \Theta} f_{\theta}(x) f_{\theta}(x')
7819: \pi(d \theta).
7820: $$
7821: In this setting, the Support Vector Machine
7822: applied to the training set $Z = (x_i, y_i)_{i=1}^N$
7823: has the form
7824: $$
7825: f_{Z}(x) = \sign \left( \int_{\theta \in \Theta} f_{\theta}(x)
7826: \sum_{i=1}^N y_i \alpha_i
7827: f_{\theta}(x_i) \pi(d \theta) - b \right)
7828: $$
7829: and, may it be too burdening to compute,
7830: we can replace it with some finite approximation
7831: $$
7832: \widetilde{f}_{Z}(x) = \sign \left(
7833: \sum_{k=1}^m f_{\theta_k}(x) w_k - b \right),
7834: $$
7835: where the set $\{\theta_k,\, k=1, \dots, m\}$ and the
7836: weights $\{ w_k,\,k=1, \dots, m\}$ are computed
7837: in some suitable way from $Z' = (x_i, y_i)_{i , \alpha_i > 0}$,
7838: the set of support vectors
7839: of $f_Z$. For instance,
7840: we can draw $\{ \theta_k,\,k=1, \dots, m\}$ at random according to
7841: the probability distribution proportional to
7842: $$
7843: \left\lvert \sum_{i=1}^N y_i \alpha_i f_{\theta}(x_i) \right\rvert
7844: \pi(d \theta),
7845: $$
7846: define the weights $w_k$ by
7847: $$
7848: w_k =
7849: \sign \left( \sum_{i=1}^N y_i \alpha_i f_{\theta_k}(x_i)
7850: \right) \int_{\theta \in \Theta} \left\lvert
7851: \sum_{i = 1}^N y_i \alpha_i f_{\theta}(x_i) \right\rvert \pi(d\theta),
7852: $$
7853: and choose the smallest value of $m$ for which this approximation
7854: still classifies $Z'$ without errors.
7855: Let us remark that we have built
7856: $\widetilde{f}_Z$ in such a way that
7857: $$
7858: \lim_{m \rightarrow + \infty}
7859: \widetilde{f}_Z(x_i) = f_Z(x_i) = y_i, \quad \text{a.s.}
7860: $$ for any support index
7861: $i$ such that $\alpha_i > 0$.
7862:
7863: Alternatively, given $Z'$, we can select a finite set of features
7864: $\Theta' \subset \Theta$ such that $Z'$ is $K_{\Theta'}$ separable,
7865: where
7866: $K_{\Theta'}(x,x') = \sum_{\theta \in \Theta'}
7867: f_{\theta}(x) f_{\theta}(x')$
7868: and consider the Support Vector Machines $f_{Z'}$ built with the
7869: kernel $K_{\Theta'}$. As soon as $\Theta'$ is chosen as a function
7870: of $Z'$ only, Theorem \ref{thm2.3.3} (page \pageref{thm2.3.3}) applies
7871: and provides
7872: some level of confidence for the risk of $f_{Z'}$.
7873:
7874: \subsubsection{The Vapnik Cervonenkis dimension
7875: of a family of subsets}
7876:
7877: Let us consider some set $X$ and some set
7878: $S \subset \{0,1\}^X$ of subsets of $X$.
7879: Let $h(S)$ be the VC dimension of $S$, defined as
7880: $$
7881: h(S) = \max \{ \lvert A \rvert : A \text{ finite and }
7882: A \cap S = \{0,1\}^{A} \},
7883: $$
7884: where by definition $A \cap S = \{ A \cap B : B \in S \}$.
7885: Let us notice that this definition does not depend on
7886: the choice of the reference set $X$. Indeed $X$ can
7887: be chosen to be $\bigcup S$, the union of all the sets in $S$
7888: or any bigger set. Let us notice also that for any set $B$,
7889: $h(B \cap S) \leq h(S)$, the reason being that
7890: $A \cap (B \cap S) = B \cap (A \cap S)$.
7891:
7892: This notion of VC dimension is useful because
7893: it can, as we will see about Support Vector
7894: Machines, be computed in some important special cases.
7895: Let us prove here as an illustration that
7896: $h(S) = d+1$ when $X = \RR^d$
7897: and $S$ is made of all the half spaces :
7898: $$
7899: S = \{ A_{w,b}\,: w \in \RR^d, b \in \RR \},
7900: \text{ where } A_{w,b} = \{ x \in X \,:
7901: \langle w, x \rangle \geq b \}.
7902: $$
7903: \begin{prop}\mypoint
7904: With the previous notations, $h(S) = d+1$.
7905: \end{prop}
7906: \begin{proof}
7907: Let $(e_i)_{i=1}^{d+1}$ be the canonical base of $\RR^{d+1}$,
7908: and let $X$ be the affine subspace it generates, which
7909: can be identified with $\RR^d$. For any $(\epsilon_i)_{i=1}^{d+1}
7910: \in \{-1,+1\}^{d+1}$, let $w = \sum_{i=1}^{d+1} \epsilon_i e_i$
7911: and $b = 0$. The half space $A_{w,b} \cap X$ is such that
7912: $\{e_i\,; i=1, \dots, d+1 \} \cap (A_{w,b} \cap X) = \{ e_i \,;
7913: \epsilon_i = +1 \}$. This proves that $h(S) \geq d + 1$.
7914:
7915: To prove that $h(S) \leq d + 1$, we have to show that
7916: for any set $A \subset \RR^d$
7917: of size $|A| = d+2$, there is $B \subset A$ such
7918: that $B \not\in (A \cap S)$. This will obviously
7919: be the case if the convex hulls of $B$ and $A \setminus
7920: B$ have a non empty intersection : indeed if a hyperplane
7921: separates two sets of points, it also separates
7922: their convex hulls. As $\lvert A \rvert
7923: > d+1$, $A$ is affine dependent : there is
7924: $(\lambda_x)_{x \in A} \in \RR^{d+2} \setminus
7925: \{0\}$ such that
7926: $\sum_{x \in A} \lambda_x x = 0$ and $\sum_{x \in A}
7927: \lambda_x = 0$. The set
7928: $B = \{ x \in A\,: \lambda_x > 0\}$ is non-empty,
7929: as well as its complement $A \setminus B$,
7930: because $\sum_{x \in A} \lambda_x = 0$ and $\lambda \neq
7931: 0$. Moreover $\sum_{x \in B} \lambda_x =
7932: \sum_{x \in A \setminus B} - \lambda_x > 0$.
7933: The relation
7934: $$
7935: \frac{1}{\sum_{x \in B} \lambda_x} \sum_{x \in B}
7936: \lambda_x x = \frac{1}{\sum_{x \in B} \lambda_x}
7937: \sum_{x \in A \setminus B} - \lambda_x x
7938: $$
7939: shows that the convex hulls of $B$ and $A \setminus B$
7940: have a non void intersection.
7941: \end{proof}
7942:
7943: Let us introduce the function of two integers
7944: $$
7945: \Phi_n^h = \sum_{k=0}^h \binom{n}{k}
7946: $$
7947: Let us notice that $\Phi$ can alternatively be defined
7948: by the relations :
7949: $$
7950: \Phi_n^h =
7951: \begin{cases}
7952: 2^n & \text{ when } n \leq h,\\
7953: \Phi_{n-1}^{h-1} + \Phi_{n-1}^h & \text{ when } n > h.
7954: \end{cases}
7955: $$
7956: \begin{thm}\mypoint
7957: \label{th1}
7958: Whenever $\bigcup S$ is finite,
7959: $$
7960: \lvert S \rvert \leq \Phi\left( \left\lvert \bigcup S \right\rvert, h(S)
7961: \right).
7962: $$
7963: \end{thm}
7964: \begin{thm}\mypoint
7965: \label{th2}
7966: For any $h \leq n$,
7967: $$
7968: \Phi_n^h \leq \exp \bigl( n H(\tfrac{h}{n}) \bigr)
7969: \leq \exp \bigl[ h \bigl( \log ( \tfrac{n}{h} ) + 1 \bigr) \bigr],
7970: $$
7971: where $H(p) = - p \log(p) - (1-p)\log(1-p)$ is the Shannon
7972: entropy of the Bernoulli distribution with parameter $p$.
7973: \end{thm}
7974: {\sc Proof of theorem \ref{th1}.}
7975: Let us prove this theorem by induction on $\left\lvert \bigcup
7976: S \right\rvert$. It is easy to check that it holds
7977: true when $\left\lvert \bigcup
7978: S \right\rvert = 1$.
7979: Let $X = \bigcup S$, let
7980: $x \in X$ and $X' = X \setminus \{x\}$. Define ($\bigtriangleup$
7981: denoting the symmetric difference of two sets)
7982: \begin{align*}
7983: S' & = \{ A \in S : A \bigtriangleup \{x\} \in S \},\\
7984: S'' & = \{ A \in S : A \bigtriangleup \{x\} \not\in S \}.
7985: \end{align*}
7986: Clearly, $\sqcup$ denoting the disjoint union,
7987: $S = S' \sqcup S''$ and $S \cap X' = (S' \cap X')
7988: \sqcup (S'' \cap X')$. Moreover $\lvert S' \rvert =
7989: 2 \lvert S' \cap X' \rvert$ and $\lvert S'' \rvert = \lvert
7990: S'' \cap X' \rvert$. Thus $\lvert S \rvert =
7991: \lvert S' \rvert + \lvert S'' \rvert = 2 \lvert S' \cap X' \rvert
7992: + \lvert S'' \rvert = \lvert S \cap X' \rvert + \lvert S' \cap
7993: X' \rvert$. Obviously $h(S \cap X') \leq h(S)$. Moreover
7994: $h(S' \cap X') = h(S') - 1$, because if $A \subset X'$
7995: is shattered by $S'$ (or equivalently by $S' \cap X'$),
7996: then $A \cup \{x\}$ is shattered by $S'$ (we say that $A$
7997: is shattered by $S$ when $S \cap A = \{0,1\}^A$).
7998: Using the induction hypothesis, we then see that
7999: $\lvert S \cap X' \rvert \leq \Phi_{\lvert X' \rvert}^{h(S)}
8000: + \Phi_{\lvert X' \rvert}^{h(S)-1}$. But as $\lvert X' \rvert =
8001: \lvert X \rvert - 1$, the righthand side of this inequality
8002: is equal to $\Phi_{\lvert X \rvert}^{h(S)}$, according to
8003: the recurrence equation satisfyied by $\Phi$.
8004:
8005: {\sc Proof of theorem \ref{th2}:}
8006: This is the well known Chernoff bound for the deviation of sums
8007: of Bernoulli r.v.: let $(\sigma_1, \dots, \sigma_n)$ be i.i.d.
8008: Bernoulli r.v. with parameter $1/2$. Let us notice that
8009: $$
8010: \Phi_n^h = 2^n \PP \left( \sum_{i=1}^n \sigma_i \leq h \right).
8011: $$
8012: For any positive real number $\lambda$ ,
8013: \begin{align*}
8014: \PP( \sum_{i=1}^n \sigma_i \leq h ) & \leq \exp (\lambda h) \EE \left[
8015: \exp \left( - \lambda \sum_{i=1}^n \sigma_i \right) \right] \\ & =
8016: \exp \Bigl\{ \lambda h + n \log \bigl\{
8017: \EE \bigl[ \exp \bigl( - \lambda \sigma_1 \bigr)
8018: \bigr] \bigr\} \Bigr\}.
8019: \end{align*}
8020: Differentiating the right-hand side in $\lambda$ shows that its
8021: minimal value is \linebreak
8022: $\exp \bigl[ - n \C{K}(\tfrac{h}{n},\tfrac{1}{2}) \bigr]$,
8023: where $\C{K}(p,q) = p \log(\tfrac{p}{q}) + (1-p) \log(\tfrac{1-p}{1-q})$
8024: is the Kullback divergence function between two Bernoulli distributions
8025: $B_p$ and $B_q$
8026: of parameters $p$ and $q$. Indeed the optimal value $\lambda^*$ of $\lambda$
8027: is such that $h = n \frac{\EE \bigl[\sigma_1 \exp ( - \lambda^* \sigma_1)
8028: \bigr]}{\EE \bigl[ \exp ( - \lambda^* \sigma_1) \bigr]}
8029: = n B_{h/n}(\sigma_1)$. Therefore (using the fact that two Bernoulli
8030: distributions with the same expectations are equal)
8031: $$
8032: \log \bigl\{ \EE \bigl[ \exp ( - \lambda^* \sigma_1)\bigr] \bigr\}
8033: = - \lambda^* B_{h/n}(\sigma_1) - \C{K}(B_{h/n},B_{1/2}) =
8034: - \lambda^* \tfrac{h}{n} - \C{K}(\tfrac{h}{n},\tfrac{1}{2}).
8035: $$
8036: The announced result then follows from
8037: the identity
8038: \begin{multline*}
8039: H(p) = \log(2) - \C{K}(p,\tfrac{1}{2}) \\= p \log(p^{-1})
8040: + (1- p) \log(1 + \frac{p}{1-p}) \leq p \bigl[ \log(p^{-1})+1\bigr].
8041: \end{multline*}
8042:
8043: \subsubsection{VC dimension of linear rules with margin}
8044: The proof of the following theorem has been suggested to us
8045: by a similar proof presented in \cite{Cristianini}.
8046: \begin{thm}\mypoint
8047: \label{chap5Th1.1}
8048: Consider a family of points $(x_1, \dots, x_n)$ in some Euclidean
8049: vector space $E$ and a family of affine functions
8050: $$
8051: \C{H} = \bigl\{ g_{w,b} : E \rightarrow \RR\,; w \in E, \lVert w \rVert = 1,
8052: b \in \RR \bigr\},
8053: $$
8054: where
8055: $$
8056: g_{w,b}(x) = \langle w, x \rangle - b, \qquad x \in E.
8057: $$
8058:
8059: Assume that there is a set of thresholds $(b_i)_{i=1}^n
8060: \in \RR^n$ such that for any \linebreak $(y_i)_{i=1}^n \in \{-1,+1\}^n$,
8061: there is $g_{w,b} \in \C{H}$ such that
8062: $$
8063: \inf_{i=1}^n \bigl( g_{w,b}(x_i) - b_i \bigr) y_i \geq
8064: \gamma.
8065: $$
8066: Let us also introduce the empirical variance of $(x_i)_{i=1}^n$,
8067: $$
8068: \Var(x_1, \dots, x_n) = \frac{1}{n} \sum_{i=1}^n
8069: \biggl\lVert x_i - \frac{1}{n} \sum_{j=1}^n x_j \biggr\rVert^2.
8070: $$
8071: In this case and with these notations,
8072: \begin{equation}
8073: \label{firstPart}
8074: \frac{\Var(x_1, \dots, x_n)}{\gamma^2} \geq
8075: \begin{cases}
8076: n-1 & \text{ when } n \text{ is even,}\\
8077: (n-1) \frac{n^2 - 1}{n^2} & \text{ when } n \text{ is odd.}
8078: \end{cases}
8079: \end{equation}
8080: Moreover, equality is reached when $\gamma$ is optimal,
8081: $b_i = 0$, $i = 1, \dots, n$
8082: and $(x_1, \dots, x_n)$
8083: is a regular simplex
8084: (i.e. when $2 \gamma$ is the minimum distance
8085: between the convex hulls of any two subsets of $\{x_1, \dots, x_n\}$
8086: and $\lVert x_i - x_j \rVert$ does not depend on $i \neq j$).
8087: \end{thm}
8088: \begin{proof}
8089: Let $(s_i)_{i=1}^n \in \RR^n$ be such that $\sum_{i=1}^n s_i = 0$.
8090: Let $\sigma$ be a uniformly distributed random variable with values
8091: in $\mathfrak{S}_{n}$, the set of permutations of the first $n$
8092: integers $\{1, \dots, n \}$. By assumption, for any value of $\sigma$,
8093: there is an affine function $g_{w,b} \in \C{H}$ such that
8094: $$
8095: \min_{i=1, \dots, n} \bigl[ g_{w,b}(x_i) - b_i \bigr] \bigl[
8096: 2 \B{1}(s_{\sigma(i)} > 0) - 1 \bigr] \geq \gamma.
8097: $$
8098: As a consequence
8099: \begin{align*}
8100: \left\langle \sum_{i=1}^n s_{\sigma(i)} x_i, w \right\rangle
8101: & =
8102: \sum_{i=1}^n s_{\sigma(i)} \bigl( \langle x_i, w \rangle - b - b_i\bigr)
8103: + \sum_{i=1}^n s_{\sigma(i)} b_i\\
8104: & \geq \sum_{i=1}^n
8105: \gamma \lvert s_{\sigma(i)} \rvert + s_{\sigma(i)} b_i.
8106: \end{align*}
8107: Therefore, using the fact that the map $x \mapsto
8108: \Bigl(\max \bigl\{0,x\bigr\}\Bigr)^2$ is convex,
8109: \begin{multline*}
8110: \EE \left(
8111: \biggl\lVert \sum_{i=1}^n s_{\sigma(i)} x_i \biggr\rVert^2 \right)
8112: \geq
8113: \EE \left[ \left( \max \left\{ 0,
8114: \sum_{i=1}^n \gamma \lvert s_{\sigma(i)} \rvert + s_{\sigma(i)} b_i
8115: \right\} \right)^2 \right] \\ \geq
8116: \left(\max \left\{ 0, \sum_{i=1}^n \gamma \EE \bigl(
8117: \lvert s_{\sigma(i)} \rvert \bigr) + \EE \bigl( s_{\sigma(i)} \bigr)
8118: b_i \right\} \right)^2
8119: = \gamma^2 \left( \sum_{i=1}^n \lvert s_i \rvert \right)^2,
8120: \end{multline*}
8121: where $\EE$ is the expectation with respect to the random permutation
8122: $\sigma$.
8123: On the other hand
8124: $$
8125: \EE \left( \biggl\lVert \sum_{i=1}^n s_{\sigma(i)} x_i \biggr\rVert^2 \right)
8126: = \sum_{i=1}^n \EE(s_{\sigma(i)}^2) \lVert x_i \rVert^2 +
8127: \sum_{i\neq j} \EE(s_{\sigma(i)} s_{\sigma(j)}) \langle x_i, x_j \rangle.
8128: $$
8129: Moreover
8130: $$
8131: \EE ( s_{\sigma(i)}^2 ) = \frac{1}{n} \EE \left(
8132: \sum_{i=1}^n s_{\sigma(i)}^2 \right) = \frac{1}{n} \sum_{i=1}^n
8133: s_i^2.
8134: $$
8135: In the same way, for any $i \neq j$,
8136: \begin{align*}
8137: \EE \left( s_{\sigma(i)} s_{\sigma(j)} \right) & =
8138: \frac{1}{n(n-1)} \EE \left( \sum_{i \neq j} s_{\sigma(i)} s_{\sigma(j)}
8139: \right) \\ & = \frac{1}{n(n-1)} \sum_{i\neq j} s_i s_j\\
8140: & = \frac{1}{n(n-1)} \Biggl[
8141: \Biggl( \underbrace{\sum_{i=1}^n s_i}_{=0} \Biggr)^2 - \sum_{i=1}^n s_i^2
8142: \Biggr] \\ & = - \frac{1}{n(n-1)} \sum_{i=1}^n s_i^2.
8143: \end{align*}
8144: Thus
8145: \begin{align*}
8146: \EE \left( \biggl\lVert \sum_{i=1}^n s_{\sigma(i)} x_i \biggr\rVert^2 \right)
8147: & = \left( \sum_{i=1}^n s_i^2 \right) \left[ \frac{1}{n} \sum_{i=1}^n \lVert
8148: x_i \rVert^2 -
8149: \frac{1}{n(n-1)} \sum_{i\neq j} \langle x_i, x_j \rangle \right] \\ & =
8150: \left( \sum_{i=1}^n s_i^2 \right) \Biggl[
8151: \left( \frac{1}{n} + \frac{1}{n(n-1)} \right) \sum_{i=1}^n \lVert x_i \rVert^2
8152: \\ & \qquad - \frac{1}{n(n-1)} \biggl\lVert \sum_{i=1}^n x_i
8153: \biggr\rVert^2 \Biggr] \\ & =
8154: \frac{n}{n-1} \left( \sum_{i=1}^n s_i^2 \right) \Var(x_1, \dots, x_n).
8155: \end{align*}
8156: We have proved that
8157: $$
8158: \frac{\Var(x_1, \dots, x_n)}{\gamma^2} \geq \frac{\ds (n-1) \biggl(
8159: \sum_{i=1}^n \lvert s_i \rvert \biggr)^2}{\ds n \sum_{i=1}^n s_i^2}.
8160: $$
8161: This can be used with $s_i = \B{1}( i \leq \frac{n}{2}) - \B{1}(
8162: i > \frac{n}{2})$ in the case when $n$ is even and
8163: $s_i = \frac{2}{(n-1)} \B{1}( i \leq \frac{n-1}{2} ) -
8164: \frac{2}{n+1} \B{1}(i > \frac{n-1}{2} )$ in the case when
8165: $n$ is odd to establish the first inequality \eqref{firstPart} of the theorem.
8166:
8167: Checking that equality is reached for the simplex is an easy computation
8168: when the simplex $(x_i)_{i=1}^n \in (\RR^n)^n$ is parametrized in such a
8169: way that
8170: $$
8171: x_i(j) = \begin{cases}
8172: 1 & \text{ if } i = j,\\
8173: 0 & \text{ otherwise.}
8174: \end{cases}
8175: $$
8176: Indeed the distance between the convex hulls of any two subsets of
8177: the simplex is the distance between their mean values (i.e. centers of mass).
8178: \end{proof}
8179:
8180: \subsubsection{Application to Support Vector Machines}
8181:
8182: We are going to apply Theorem \ref{chap5Th1.1} (page
8183: \pageref{chap5Th1.1}) to Support Vector
8184: Machines in the transductive case. So let us consider
8185: $(X_i, Y_i)_{i=1}^{(k+1)N}$ distributed according to some partially exchangeable
8186: distribution $\PP$ and assume that $(X_i)_{i=1}^{(k+1)N}$ and
8187: $(Y_i)_{i=1}^N$ are observed. Let us consider some positive
8188: kernel $K$ on $\C{X}$. For any $K$-separable training set of
8189: the form $Z' = (X_i,y_i')_{i=1}^{(k+1)N}$, where $(y_i')_{i=1}^{(k+1)N}
8190: \in \C{Y}^{(k+1)N}$, let $\Hat{f}_{Z'}$ be the Support Vector Machine
8191: defined by $K$ and $Z'$ and let $\gamma(Z')$ be its margin.
8192: Let
8193: \begin{multline*}
8194: R^2 = \max_{i=1, \dots, (k+1)N} K(X_i,X_i) + \frac{1}{(k+1)^2 N^2}
8195: \sum_{j=1}^{(k+1)N} \sum_{k=1}^{(k+1)N} K(X_j,X_k) \\
8196: - \frac{2}{(k+1)N}
8197: \sum_{j=1}^{(k+1)N} K(X_i,X_j).
8198: \end{multline*}
8199: (This is an easily computable upper-bound for the radius
8200: of some ball containing the image of $(X_1, \dots, X_{(k+1)N})$
8201: in feature space.)
8202:
8203: Let us define for any integer $h$ the margins
8204: \begin{equation}
8205: \label{margin}
8206: \gamma_{2h} = (2h - 1)^{-1/2}
8207: \text{ and } \gamma_{2h+1} = \left[ 2h\left(
8208: 1 - \frac{1}{(2h+1)^2}\right) \right]^{-1/2}.
8209: \end{equation}
8210: Let us consider for any $h =1, \dots, N$ the exchangeable model
8211: $$
8212: \C{R}_h = \bigl\{ \Hat{f}_{Z'}\,:Z' = (X_i, y_i')_{i=1}^{(k+1)N}
8213: \text{ is $K$-separable and } \gamma(Z') \geq R \gamma_h \bigr\}.
8214: $$
8215: The family of models $\C{R}_h$, $h=1, \dots, N$ is nested,
8216: and we know from Theorem \ref{chap5Th1.1} (page \pageref{chap5Th1.1}) and
8217: Theorems \ref{th1} (page \pageref{th1}) and
8218: \ref{th2} (page \pageref{th2}) that
8219: $$
8220: \log \bigl( \lvert \C{R}_h \rvert \bigr) \leq h \log
8221: \bigl( \tfrac{(k+1)e N}{h} \bigr).
8222: $$
8223: We can then consider on the large model $\C{R} = \bigsqcup_{h=1}^N
8224: \C{R}_h$ (the disjoint union of the submodels)
8225: an exchangeable prior $\pi$ which is uniform on each $\C{R}_h$
8226: and is such that $\pi(\C{R}_h) \geq \frac{1}{h(h+1)}$.
8227: Applying Theorem \ref{thm2.1.5}
8228: (page \pageref{thm2.1.5})
8229: we get
8230: \begin{proposition}\mypoint
8231: With $\PP$ probability at least $1 - \epsilon$, for any
8232: $h = 1, \dots, N$, any Support Vector Machine $f \in \C{R}_h$,
8233: \begin{multline*}
8234: r_2(f) \leq \\*
8235: \frac{k+1}{k} \inf_{\lambda \in \RR_+}
8236: \frac{1 - \exp \Bigl[ - \frac{\lambda}{N} r_1(f) - \frac{h}{N} \log
8237: \Bigl( \frac{e(k+1)N}{h} \Bigr) - \frac{\log[h(h+1)] -
8238: \log(\epsilon)}{N}
8239: \Bigr]}{
8240: 1 - \exp( - \frac{\lambda}{N})} \\* - \frac{r_1(f)}{k}.
8241: \end{multline*}
8242: \end{proposition}
8243: Searching the whole model $\C{R}_h$ may be unfeasible,
8244: nonetheless any heuristic can be applied to choose $f$. For instance,
8245: a Support Vector Machine $f'$ can be trained from
8246: the training set $(X_i, Y_i)_{i=1}^N$ and then $(y'_i)_{i=1}^{
8247: (k+1)N}$ can be set to $y'_i = \sign(f'(X_i))$, $i = 1,
8248: \dots, (k+1)N$.
8249:
8250: \subsubsection[Inductive margin bounds]{Inductive margin bounds for Support
8251: Vector Machines}
8252:
8253: In order to establish inductive margin bounds, we will
8254: need a different combinatorial lemma. It is due to \cite{Alon}.
8255: We will reproduce their proof with some tiny improvements on
8256: the values of constants.
8257:
8258: Let us consider the finite case when $\C{X} = \{1, \dots, n\}$,
8259: $\C{Y} = \{1, \dots, b\}$ and \linebreak $b \geq 3$ (the question
8260: we will study would be meaningless in the case when $b \leq 2$). Assume as usual that we are
8261: dealing with a prescribed set of classification rules
8262: \linebreak $\C{R} = \bigl\{ f : \C{X} \rightarrow \C{Y} \bigr\}$.
8263: Let us say that a pair $(A,s)$, where $A \subset \C{X}$
8264: is a non empty set of shapes
8265: and $s : A \rightarrow \{2, \dots, b-1\}$ a threshold function,
8266: is {\em shattered}
8267: by the set of functions $F \subset \C{R}$
8268: if for any $(\sigma_x)_{x \in A} \in \{-1,+1\}^{A}$,
8269: there exists some $f \in F$ such that $\min_{x \in A}
8270: \sigma_x \bigl[ f(x) - s(x) \bigr] \geq 1$.
8271:
8272: \begin{dfn}\mypoint
8273: \label{fatDef}
8274: Let the {\em fat shattering
8275: dimension} of $(\C{X},\C{R})$ be the maximal size $\lvert A \rvert$
8276: of the first component of the pairs which are shattered by $\C{R}$.
8277: \end{dfn}
8278:
8279: Let us say that a subset of classification rules $F \subset
8280: \C{Y}^{\C{X}}$ is {\em separated} whenever for any pair
8281: $(f,g) \in F^2$ such that $f\neq g$, $\lVert f - g \rVert_{\infty}
8282: = \max_{x \in \C{X}} \lvert f(x) - g(x) \rvert \geq 2$.
8283: Let $\mathfrak{M}(\C{R})$ be the maximum size $\lvert F \rvert$
8284: of separated subsets $F$ of $\C{R}$. Note that if $F$ is a
8285: separated subset of $\C{R}$ such that $\lvert F \rvert =
8286: \mathfrak{M}(\C{R})$, then it is a $1$-net for the $\C{L}_{\infty}$
8287: distance: for any function $f \in \C{R}$ there exists $g \in F$
8288: such that $\lVert f - g \rVert_{\infty} \leq 1$ (otherwise $f$ could be
8289: added to $F$ to create a larger separated set).
8290:
8291: \begin{lemma}\mypoint
8292: \label{lemma3.1}
8293: With the above notations,
8294: whenever the fat shattering dimension of
8295: $(\C{X}, \C{R})$ is not greater than $h$,
8296: \begin{multline*}
8297: \log \bigl[ \mathfrak{M}(\C{R}) \bigr] < \log \bigl[ (b-1)(b-2) n \bigr]
8298: \Biggl\{\frac{\log \bigl[ \sum_{i=1}^h \binom{n}{i} (b-2)^i \bigr]}{
8299: \log(2)}+1 \Biggr\} + \log(2)
8300: \\ \leq \log \bigl[ (b-1)(b-2) n \bigr]
8301: \Biggl\{ \biggl[ \log \Bigl[ \tfrac{(b-2) n}{h}
8302: \Bigr] + 1 \biggr] \frac{h}{\log(2)} + 1\Biggr\} + \log(2).
8303: \end{multline*}
8304: \end{lemma}
8305: \begin{proof}
8306: For any set of functions $F \subset \C{Y}^{\C{X}}$,
8307: let $t(F)$ be the number of pairs $(A, s)$ shattered by $F$.
8308: Let $t(m,n)$ be the minimum of $t(F)$ over
8309: all {\em separated} sets of functions $F \subset \C{Y}^{\C{X}}$ of size $\lvert
8310: F \rvert = m$ ($n$ is here to recall that the shape space $\C{X}$
8311: is made of $n$ shapes). For any $m$ such that $t(m,n) > \sum_{i=1}^h
8312: \binom{n}{i} (b-2)^i$, it is clear that any separated set of functions
8313: of size $\lvert F \rvert \geq m$ shatters at least one pair
8314: $(A,s)$ such that $\lvert A \rvert > h$. Indeed, $t(m,n)$ is
8315: clearly from its definition a non decreasing function of $m$,
8316: so that $t(\lvert F \rvert, n) > \sum_{i=1}^h \binom{n}{i}
8317: (b-2)^i$.
8318: Moreover there are only $\sum_{i=1}^h \binom{n}{i}(b-2)^i$
8319: pairs $(A,s)$ such that $\lvert A \rvert \leq h$.
8320: As a consequence, whenever the fat shattering dimension
8321: of $(\C{X}, \C{R})$ is not greater than $h$ we have $\mathfrak{M}(\C{R})
8322: < m$.
8323:
8324: It is clear that for any $n \geq 1$, $t(2,n) = 1$.
8325: \begin{lemma}\mypoint
8326: For any $m \geq 1$,
8327: $t\bigl[mn(b-1)(b-2), n \bigr] \geq 2 t\bigl[ m, n-1 \bigr]$,
8328: and therefore $t\bigl[ 2 n(n-1) \dots (n-r+1) (b-1)^r(b-2)^r, n \bigr]
8329: \geq 2^r$.
8330: \end{lemma}
8331: \begin{proof}
8332: Let $F = \{f_1, \dots, f_{mn(b-1)(b-2)}\}$
8333: be some separated set of functions of size
8334: $mn(b-1)(b-2)$. For any pair $(f_{2i-1},f_{2i})$,
8335: $i=1,\dots, mn(b-1)(b-2)/2$, there is $x_i \in \C{X}$
8336: such that $\lvert f_{2i-1}(x_i) - f_{2i}(x_i) \rvert
8337: \geq 2$. Since $\lvert \C{X} \rvert = n$, there is
8338: $x \in \C{X}$ such that $\sum_{i=1}^{mn(b-1)(b-2)/2}
8339: \B{1}(x_i = x) \geq m(b-1)(b-2)/2$. Let $I = \{ i \,:
8340: x_i = x\}$.
8341: Since there are
8342: $(b-1)(b-2)/2$ pairs $(y_1,y_2) \in \C{Y}^2$
8343: such that $1\leq y_1 < y_2 - 1 \leq b -1$, there is some pair
8344: $(y_1,y_2)$, such that $1 \leq y_1 < y_2 \leq b$
8345: and such that $\sum_{i\in I} \B{1}(\{y_1,y_2\} = \{f_{2i-1}(x),
8346: f_{2i}(x)\}) \geq m$.
8347: Let $J = \bigl\{i \in I\,: \{f_{2i-1}(x),f_{2i}(x)\} = \{y_1,y_2\}
8348: \bigr\}$. Let
8349: \begin{align*}
8350: F_1 & =
8351: \{ f_{2i-1} \,:i \in J, f_{2i-1}(x) = y_1\}
8352: \cup
8353: \{ f_{2i} \,:i \in J, f_{2i}(x) = y_1\},\\
8354: F_2 & =
8355: \{ f_{2i-1} \,:i \in J, f_{2i-1}(x) = y_2\}
8356: \cup
8357: \{ f_{2i} \,:i \in J, f_{2i}(x) = y_2\}.
8358: \end{align*}
8359: Obviously $\lvert F_1 \rvert = \lvert F_2 \rvert =
8360: \lvert J \rvert = m$. Moreover the restrictions
8361: of the functions of $F_1$ to $\C{X} \setminus \{x\}$
8362: are separated, and it is the same with $F_2$. Thus
8363: $F_1$ strongly shatters at least $t(m,n-1)$
8364: pairs $(A,s)$ such that $A \subset \C{X} \setminus \{x\}$
8365: and it is the same with $F_2$. Eventually,
8366: if the pair $(A,s)$ where $A \subset \C{X} \setminus \{x\}$
8367: is both shattered by $F_1$ and $F_2$, then
8368: $F_1 \cup F_2$ shatters also $(A \cup \{x\}, s')$
8369: where $s'(x') = s(x')$ for any $x' \in A$ and $s'(x) =
8370: \lfloor \frac{y_1+y_2}{2} \rfloor$. Thus $F_1 \cup F_2$,
8371: and therefore $F$, shatters at least $2t(m,n-1)$
8372: pairs $(A,s)$.
8373: \end{proof}
8374:
8375: Resuming the proof of lemma \ref{lemma3.1}, let us choose
8376: for $r$ the smallest integer such that
8377: $2^r > \sum_{i=1}^h \binom{n}{i} (b-2)^i$, which is no greater than
8378: \\ \mbox{} \hfill $\left\{ \frac{\log \bigl[ \sum_{i=1}^h \binom{n}{i} (b-2)^i \bigr]}{
8379: \log(2)} + 1 \right\}$.
8380: \hfill \mbox{}\\
8381: In the case when $1 \leq n \leq r$,
8382: $$
8383: \log( \mathfrak{M}(\C{R}) ) < {\lvert \C{X} \rvert} \log(\lvert \C{Y} \rvert)
8384: = n \log(b) \leq r \log( b) \leq r \log \bigl[ (b-1)(b-2)n \bigr] + \log(2),
8385: $$
8386: which proves the lemma. In the remaining case $n > r$,
8387: \begin{multline*}
8388: t \bigl[ 2 n^r (b-1)^r (b-2)^r, n \bigr]
8389: \\ \geq t \bigl[ 2n(n-1) \dots (n-r+1)(b-1)^r(b-2)^r, n\bigr]
8390: \\ > \sum_{i=1}^h \binom{n}{i} (b-2)^i.
8391: \end{multline*}
8392: Thus $\lvert \mathfrak{M}(\C{R}) \rvert < 2 \Bigl[(b-2)(b-1)n\Bigr]^r$ as
8393: claimed.
8394: \end{proof}
8395:
8396: In order to apply this combinatorial lemma to Support Vector
8397: Machines, let us consider now the case of separating
8398: hyperplanes in $\RR^d$ (the generalization to Support Vector Machines
8399: being straightforward).
8400: Assume that $\C{X} = \RR^d$ and
8401: $\C{Y}= \{-1,+1\}$.
8402: For any sample $(X)_{i=1}^{(k+1)N}$, let
8403: $$
8404: R(X_1^{(k+1)N}) = \max \{ \lVert X_i \rVert \,: 1 \leq i \leq (k+1)N \}.
8405: $$
8406: Let us consider the set of parameters
8407: $$
8408: \Theta = \bigl\{ (w,b) \in \RR^d \times \RR\,: \lVert w \rVert = 1 \bigr\}.
8409: $$
8410: For any $(w,b) \in \Theta$, let
8411: $g_{w,b}(x) = \langle w, x \rangle - b$.
8412: Let $h$ be some fixed integer and let $\gamma = R(X_1^{(k+1)N})\gamma_h$,
8413: where $\gamma_h$ is defined by equation \eqref{margin} on page \pageref{margin}.
8414:
8415: Let us define $\zeta : \RR \rightarrow \ZZ$ by
8416: $$
8417: \zeta (r) =
8418: \left\{
8419: \begin{aligned}
8420: -5 & & \text{ when }&& & r \leq -4\gamma,\\
8421: -3 & & \text{ when }&& -4 \gamma < & r \leq -2 \gamma,\\
8422: -1 & & \text{ when }&& -2 \gamma < & r \leq 0,\\
8423: +1 & & \text{ when }&& 0 < & r \leq 2 \gamma,\\
8424: +3 & & \text{ when }&& 2 \gamma < & r \leq 4 \gamma,\\
8425: +5 & & \text{ when }&& 4 \gamma < & r.
8426: \end{aligned}\right.
8427: $$
8428: Let $G_{w,b}(x) = \zeta \bigl[ g_{w,b}(x) \bigr]$.
8429: The fat shattering dimension (as defined in \ref{fatDef})
8430: of
8431: $$
8432: \Bigl( X_1^{(k+1)N}, \bigl\{ (G_{w,b}+7)/2 :
8433: (w,b) \in \Theta \bigr\} \Bigr)
8434: $$
8435: is not greater than $h$ (according to Theorem \ref{chap5Th1.1}, page
8436: \pageref{chap5Th1.1}),
8437: therefore there is some set $\C{F}$
8438: of functions from $X_1^{(k+1)N}$ to $\{-5,-3,-1,+1,+3,+5\}$
8439: such that
8440: $$
8441: \log \bigl(\lvert \C{F} \rvert \bigr) \leq
8442: \log\bigl[ 20(k+1) N \bigr] \Biggl\{ \frac{h}{\log(2)}
8443: \biggl[ \log \left( \frac{4(k+1)N}{h} \right) + 1 \biggr]
8444: + 1 \Biggr\} + \log(2).
8445: $$
8446: and
8447: for any $(w,b) \in \Theta$, there is
8448: $f_{w,b} \in \C{F}$ such that $\sup \bigl\{ \lvert f_{w,b}
8449: (X_i) - G_{w,b}(X_i) \rvert\,: i=1, \dots, (k+1)N \bigr\} \leq 2.$
8450: Moreover, the choice of $f_{w,b}$ may be required to depend
8451: on $(X_i)_{i=1}^{(k+1)N}$ in an exchangeable way.
8452: Similarly to Theorem \ref{thm2.1.5} (page \pageref{thm2.1.5}),
8453: it can be proved that for any partially exchangeable probability
8454: distribution $\PP \in \C{M}_+^1 (\Omega)$,
8455: with $\PP$ probability at least $1 - \epsilon$,
8456: for any $f_{w,b} \in \C{F}$,
8457:
8458: \begin{multline*}
8459: \frac{1}{kN} \sum_{i=N+1}^{(k+1)N}
8460: \B{1}\bigl[f_{w,b}(X_i) Y_i \leq 1 \bigr] \\
8461: \begin{aligned} \leq \frac{k+1}{k} & \inf_{\lambda \in \RR_+}
8462: \bigl[ 1 - \exp( - \tfrac{\lambda}{N} ) \bigr]^{-1}
8463: \biggl\{ 1 - \\
8464: & \exp \biggl[ - \frac{\lambda}{N^2}
8465: \sum_{i=1}^N \B{1} \bigl[ f_{w,b}(X_i) Y_i \leq 1 \bigr]
8466: - \frac{\log \bigl( \lvert \C{F} \rvert \bigr) - \log(\epsilon)}{N}
8467: \biggr] \biggr\} \end{aligned}\\- \frac{1}{k N} \sum_{i=1}^{N} \B{1} \bigl[
8468: f_{w,b}(X_i) Y_i \leq 1 \bigr].
8469: \end{multline*}
8470:
8471: Let us remark that
8472: $$
8473: \B{1} \Bigl\{
8474: 2 \B{1} \bigl[g_{w,b}(X_i) \geq 0 \bigr] - 1 \neq Y_i \Bigr\}
8475: = \B{1}\bigl[ G_{w,b}(X_i) Y_i < 0 \bigr] \leq
8476: \B{1} \bigl[ f_{w,b}(X_i) Y_i \leq 1 \bigr]
8477: $$
8478: and
8479: $$
8480: \B{1}\bigl[ f_{w,b}(X_i) Y_i \leq 1 \bigr]
8481: \leq \B{1}\bigl[ G_{w,b}(X_i) Y_i \leq 3 \bigr]
8482: \leq \B{1} \bigl[ g_{w,b}(X_i) Y_i \leq 4 \gamma \bigr].
8483: $$
8484: This proves the following theorem.
8485: \begin{thm}\mypoint
8486: With $\PP$ probability at least
8487: $1 - \epsilon$, for any $(w,b) \in \Theta$,
8488: \begin{multline*}
8489: \frac{1}{kN} \sum_{i=N+1}^{(k+1)N}
8490: \B{1} \Bigl\{ 2 \B{1} \bigl[ g_{w,b}(X_i) \geq 0 \bigr] - 1 \neq Y_i \Bigr\}\\
8491: \begin{aligned} \leq \frac{k+1}{k} & \inf_{\lambda \in \RR_+, h \in \NN^*}
8492: \bigl[ 1 - \exp( - \tfrac{\lambda}{N} ) \bigr]^{-1}
8493: \Biggl\{ 1 - \\
8494: \exp \Biggl[ - & \frac{\lambda}{N^2}
8495: \sum_{i=1}^N \B{1} \bigl[ g_{w,b}(X_i)Y_i \leq 4 R \gamma_h \bigr]
8496: \\ - & \frac{\log
8497: \bigl[ 20 (k+1)N \bigr] \Bigl\{
8498: \tfrac{h}{\log(2)} \log \Bigl( \tfrac{4e (k+1)N}{h} \Bigr)
8499: + 1 \Bigr\} + \log\Bigl[ \tfrac{2h(h+1)}{\epsilon} \Bigr] }{N}
8500: \Biggr] \Biggr\} \end{aligned}\\- \frac{1}{k N} \sum_{i=1}^{N} \B{1}
8501: \bigl[ g_{w,b}(X_i)Y_i \leq 4 R \gamma_h \bigr].
8502: \end{multline*}
8503: \end{thm}
8504: As a consequence,
8505: we obtain with $\PP$ probability at least $1 - \epsilon$,
8506: for any $(w,b) \in \Theta$ such that
8507: $$
8508: \gamma = \min_{i=1, \dots, N} g_{w,b}(X_i)Y_i > 0,
8509: $$
8510: \begin{multline*}
8511: \frac{1}{kN} \sum_{i=N+1}^{(k+1)N}
8512: \B{1} \bigl[ g_{w,b}(X_i) Y_i < 0 \bigr]
8513: \\ \leq \tfrac{k+1}{k} \biggl\{
8514: 1 - \exp \biggl[ - \tfrac{\log\bigl[ 20(k+1)N \bigr] }{N}
8515: \Bigl\{ \tfrac{16 R^2 + 2 \gamma^2}{\log(2) \gamma^2}
8516: \log \Bigl( \tfrac{e (k+1)N \gamma^2}{4R^2} \Bigr) + 1 \Bigr\}
8517: \\ + \frac{1}{N} \log ( \tfrac{\epsilon}{2} ) \biggr] \biggr\}.
8518: \end{multline*}
8519: This inequality compares favourably with similar inequalities
8520: in \cite{Cristianini}, which moreover do not extend to the margin
8521: quantile case as this one.
8522:
8523: Let us also remark that it is easy to circonvent the fact that
8524: $R$ is not observed when the test set
8525: $X_{N+1}^{(k+1)N}$ is not observed.
8526:
8527: Indeed, we can consider the sample obtained by projecting $X_1^{(k+1)N}$
8528: on some ball of fixed radius $R_{\max}$, putting
8529: $$
8530: t_{R_{\max}}(X_i) = \min \left\{ 1, \frac{R_{\max}}{\lVert X_i \rVert} \right\} X_i.
8531: $$
8532: We can further consider an atomic prior distribution $\nu \in \C{M}_+^1(\RR_+)$
8533: bearing on $R_{\max}$, to obtain a uniform result through a union bound.
8534: As a consequence of the previous theorem indeed,
8535: \begin{cor}\mypoint
8536: For any atomic prior $\nu \in \C{M}_+^1(\RR_+)$,
8537: for any partially exchangeable probability measure $\PP \in \C{M}_+^1(\Omega)$,
8538: with $\PP$ probability at least
8539: $1 - \epsilon$, for any $(w,b) \in \Theta$, any $R_{\max} \in \RR_+$,
8540: \begin{multline*}
8541: \frac{1}{kN} \sum_{i=N+1}^{(k+1)N}
8542: \B{1} \Bigl\{ 2 \B{1} \bigl[ g_{w,b} \circ t_{R_{\max}}(X_i)
8543: \geq 0 \bigr] - 1 \neq Y_i \Bigr\}\\*
8544: \begin{aligned} \leq \frac{k+1}{k} & \inf_{\lambda \in \RR_+, h \in \NN^*}
8545: \bigl[ 1 - \exp( - \tfrac{\lambda}{N} ) \bigr]^{-1}
8546: \Biggl\{ 1 - \\
8547: \exp \Biggl[ - & \frac{\lambda}{N^2}
8548: \sum_{i=1}^N \B{1} \bigl[ g_{w,b} \circ t_{R_{\max}}(X_i)Y_i \leq 4 R_{\max}
8549: \gamma_h \bigr]
8550: \\ - & \frac{\log
8551: \bigl[ 20 (k+1)N \bigr] \Bigl\{
8552: \tfrac{h}{\log(2)} \log \Bigl( \tfrac{4e (k+1)N}{h} \Bigr)
8553: + 1 \Bigr\} + \log\Bigl[ \tfrac{2h(h+1)}{\epsilon \nu(R_{\max})} \Bigr] }{N}
8554: \Biggr] \Biggr\} \end{aligned}\\- \frac{1}{k N} \sum_{i=1}^{N} \B{1}
8555: \bigl[ g_{w,b}\circ t_{R_{\max}} (X_i)Y_i \leq 4 R_{\max} \gamma_h \bigr].
8556: \end{multline*}
8557: \end{cor}
8558:
8559: \input{appendix}
8560: