0201:cs0201014/paper.tex

1: \documentclass{article}

2: \usepackage{nips01,times}

3: \usepackage{graphicx}

4: \usepackage{subfigure}

5: \usepackage{psfig}

6:

7: %% \documentstyle[nips01]{article}

8:

9: \title{The Dynamics of AdaBoost Weights \\ Tells You What's Hard to Classify}

10:

11: \author{B. Caprile\\

12: ITC-irst \\

13: I-38050 Povo, Trento\\

14: Italy\\

15: {\it caprile@itc.it} \\

16: \And

17: C. Furlanello \\

18: ITC-irst \\

19: I-38050 Povo, Trento\\

20: Italy\\

21: {\it furlan@itc.it} \\\\

22: \And

23: S. Merler\\

24: ITC-irst \\

25: I-38050 Povo, Trento\\

26: Italy\\

27: {\it merler@itc.it} \\\\

28: }

29:

30:

31:

32: \begin{document}

33: \maketitle

34:

35: \bibliographystyle{plain}

36:

37: \newcommand{\REM}[1]{

38: {\bf #1}

39: }

40:

41: \newcommand{\Ada}{AdaBoost

42: }

43:

44: \newcommand{\Adaa}{{\tt AdaBoost} algorithm

45: }

46:

47:

48: \begin{abstract}

49: The dynamical evolution of weights in the \Ada algorithm contains

50: useful information about the r{\^o}le that the associated data points

51: play in the built of the \Ada model. In particular, the dynamics

52: induces a bipartition of the data set into two (easy/hard)

53: classes. Easy points are ininfluential in the making of the model,

54: while the varying relevance of hard points can be gauged in terms of

55: an entropy value associated to their evolution. Smooth approximations

56: of entropy highlight regions where classification is most

57: uncertain. Promising results are obtained when methods proposed are

58: applied in the Optimal Sampling framework.

59: \end{abstract}

60:

61:

62: \begin{section}{Introduction}

63:

64: In this paper we investigate the boosting weight dynamics induced by

65: classification procedures of the AdaBoost family

66: \cite{FreSch97,SchFreBarLee98}, and show how it can be exploited to

67: for highlighting points and regions of uncertain

68: classification. Friedman et al. \cite{FriHasTib00} proposed to analyze

69: and trim the distribution of weights over a training sample in order

70: to reduce computation without sacrificing accuracy. Here, we focus

71: instead on tracking the dynamics of the boosting weight of individual

72: points. By introducing the notion of entropy of the weight evolution,

73: we can clarify the notions of ``easy'' and the ``hard'' points as the

74: two types of weight dynamics being observed: in particular, in

75: different classification tasks and with different base models it is

76: found that a group of points may be selected which have very low

77: (ideally, zero) entropy of weight evolution: the easy points. In this

78: framework, we can answer questions as: do easy point play any role in

79: building the AdaBoost model? For hard points, can different degrees

80: of ``hardness'' be identified which account for different degrees of

81: classification uncertainty? Do easy/hard points show any preference about

82: where to concentrate? The first two questions are clearly connected to

83: equivalent results in the framework of Support Vector Machines: in a

84: number of experiments, hard points are

85: found indeed mostly nearby the classification boundary.  In the second

86: part of this paper, the smooth approximation (by kernel regression) of

87: the weight entropy at training data is proposed as an indicator

88: function of classification uncertainty, thereby obtaining a region

89: highlighting methodology. As a natural application,

90: a strategy for optimal sampling in classification tasks was implemented:

91: compared with uniform random sampling, the entropy-based strategy is

92: clearly more effective. Moreover, it compares favorably with an

93: alternative margin-based sampling strategy.

94:

95: \end{section}

96:

97: \begin{section}{The Dynamics of Weights}

98: \label{sec:dynamics}

99:

100: In the present section, the dynamics that the \Ada algorithm sets over

101: the weights is singled out for study. In particular, the intuition is

102: substantiated that the evolution of weights yields information about

103: the varying relevance that different data points have in the built of

104: the \Ada model.

105:

106: Let $D \equiv \{{\bf x}_{i}, y_{i}\}_{i=1}^{N}$ be a two-class set of

107: data points, where the ${\bf x}_{i}$s belong to a suitable region,

108: $X$, of some (metric) feature space, and $y_{i}$ takes values in $\{1,

109: -1\}$, for $1 \leq i \leq N$. The \Ada algorithm iteratively builds a

110: class membership estimator over $X$ as a thresholded linear

111: superposition of different realizations, $M_{k}$, of a same base

112: model, $M$. Any model instance, $M_{k}$, resulting from training at

113: step $k$ depends on the values taken at the same step by a set of $N$

114: numbers (in the following, the {\em weights}), ${\bf w} = w_{1}, \dots

115: w_{N}$ -- one for each data point. After training, weights are

116: updated: those associated to points misclassified by the current model

117: instance are increased, while decreased are those for which the

118: associated point is classified correctly. An interesting variant of

119: this basic scheme consists in training the different realizations of

120: the base model, not on the whole data set, but on Bootstrap replicates

121: of it \cite{Qui96}. In this second scheme, samplings are extracted

122: according to the discrete probability distribution defined by the

123: weights associated to data points, normalized to sum one.

124:

125: In Fig. \ref{fig:weights-traces-and-histograms}a the plots are

126: reported of the evolution of the weights associated to 3 data points

127: when the \Ada algorithm is applied to a simple binary classification

128: task on synthetic two-dimensional data (experiment A-{\tt Gaussians}

129: as described in Sec. \ref{subsec:appendix-data-a}). Except for

130: occasional bursts, the weight associated to the first point goes

131: rapidly to zero, while the weights associated to the second and third

132: point keep on going up and down in a seemingly chaotic fashion. Our

133: experience is that these two types of behaviour are not specific of

134: the case under consideration, but can be observed in any \Ada

135: experiment. Moreover, {\em tertium non datur}, i.e., no other

136: qualitative behaviour is observed (as, for example, that some weight

137: tends to a strictly positive value).

138:

139: \begin{subsection}{Easy Vs. Hard Data Points}

140: \label{easy-hard-data-points}

141:

142: \begin{figure*}[ht]

143:   \begin{center}

144:     \leavevmode

145:     \psfig{figure=gaussian-5000-weights-trace-1.epsi,width=0.3\textwidth}

146:     \psfig{figure=gaussian-5000-weights-trace-2.epsi,width=0.3\textwidth}

147:     \psfig{figure=gaussian-5000-weights-trace-3.epsi,width=0.3\textwidth}(a)

148:     \psfig{figure=gaussian-5000-histogram-1.epsi,width=0.3\textwidth}

149:     \psfig{figure=gaussian-5000-histogram-2.epsi,width=0.3\textwidth}

150:     \psfig{figure=gaussian-5000-histogram-3.epsi,width=0.3\textwidth}(b)

151:     \caption{{\em Evolution of weights in the \Ada algorithm. (a)

152:     The evolutions over 5000 steps of the \Ada algorithm are reported

153:     for the weights associated to 3 data points of experiment {\rm

154:     A-{\tt Gaussians}}. From left to right: an ``easy'' data point

155:     (the weight tends to zero), and two ``hard'' data points (the

156:     weight follows a seemingly random pattern). (b) The corresponding

157:     frequency histograms.}}

158:

159:     \label{fig:weights-traces-and-histograms}

160:   \end{center}

161: \end{figure*}

162:

163: The hypothesis therefore emerges that the \Ada algorithm set a

164: partition of data points into two classes: on one side the points

165: whose weight tends rapidly to zero; on the other, the points whose

166: weight show an apparently chaotic behaviour. In fact, the hypothesis is

167: perfectly consistent with the rationale underlying the \Ada algorithm:

168: weights associated to those data points that several model instances

169: classify correctly even when they are {\em not} contained in the

170: training sample follow the first kind of behaviour. In practice

171: independently of which bootstrap sample is extracted, these points are

172: classified correctly, and their weight is consequently decreased and

173: decreased. We call them the ``easy'' points. The second type of

174: behaviour is followed by the points that, when not contained in the

175: training set, happen to be often misclassified. A series of

176: misclassifications makes the weight associated with any such point

177: increase, thereby increasing the probability for the point to be

178: contained in the following bootstrap sample. As the probability

179: increases and the point is finally extracted (and classified

180: correctly), its weight is decreased; this in turn makes the point less

181: likely to be extracted -- and so forth. We call this kind of points

182: ``hard''.

183:

184: In Fig. \ref{fig:weights-traces-and-histograms}b, histograms are

185: reported of the values that the weights associated to the same 3 data

186: points of Fig. \ref{fig:weights-traces-and-histograms}a take over the

187: same 5000 iterations of the \Ada algorithm. As expected, the histogram

188: of (easy) point 1 is very much squeezed towards zero (more than 80\%

189: of weights lies below $10^{-6}$). Histograms of (hard) points 2 and 3

190: exhibit the same Gamma-like shape, but differ remarkably for what

191: concerns average and dispersion. Naturally, the first question is

192: whether any limit exists for these distributions. For each data point,

193: two unbinned cumulative distributions were therefore built by taking

194: the weights generated by the first 3000 steps of the \Ada algorithm,

195: and those generated over the whole 5000 steps. The same-distribution

196: hypothesis was then tested by means of the Kolmogorov-Smirnov (KS)

197: test \cite{PreTeuVetFla92}. Results are reported in

198: Fig. \ref{fig:mean-vs-entropy-ks-test-and-histogram}a, where

199: $p$-values are plotted against the mean value of all 5000 values. It

200: is interesting to notice that for mean values close to 0 (easy points)

201: the same-distribution hypothesis is always rejected, while it is

202: typically not-rejected for higher values (hard points). It seems that

203: easy points may be confidently identified by simply considering the

204: average of their weight distribution. A binary LDA classifier was

205: therefore trained on the data of

206: Fig. \ref{fig:mean-vs-entropy-ks-test-and-histogram}a. By setting a

207: $p$-value threshold equal to 0.05, the resulting {\em precision} (the

208: complement to 1 of the fraction of false negative) was equal to 0.79

209: and {\em recall} (the complement to 1 of the fraction of false

210: positive) was equal to 0.96.

211:

212: \end{subsection}

213:

214: \begin{subsection}{Entropy}

215: \label{subsec:entropy}

216:

217: Can we do any better at separating easy points from hard ones? For

218: hard points, can different degrees of ``hardness'' be identified which

219: account for different degrees of classification uncertainty? What we

220: are going to show is that by associating a notion of {\em entropy} to

221: the evolutions of weights both questions can be answered in the

222: positive. To this end, the interval $[0,1]$ is partitioned into $L$

223: subintervals of length $1/L$, and the entropy value is computed as

224: $\sum_{i=1}^{L} f_{i}~log_{2}~ f_{i}$, where $f_{i}$ is the relative

225: frequency of weight values falling in the $i$-th subinterval ($0~

226: log_{2}~ 0$ is set to $0$). For our cases, $L$ was set to 1000.

227:

228: \begin{figure*}[ht]

229:   \begin{center}

230:     \leavevmode

231: 	\psfig{figure=ks-test-mean.epsi,width=0.29\textwidth}(a)

232: %%    \psfig{figure=figures/gaussian-5000-weights-mean-vs-entropy.epsi,width=0.28\textwidth}(a)

233:     \psfig{figure=ks-test-entropy.epsi,width=0.29\textwidth}(b)

234:     \psfig{figure=entropy-histogram.epsi,width=0.29\textwidth}(c)

235: %%     \caption{{\em Mean Vs. entropy plot for the weights frequency

236: %%         histograms of the 400 data points of experiment {\rm A-{\tt

237: %%             Gaussians}}. Marked data points are those whose evolution

238: %%         and frequency histograms are reported in Fig.

239: %%         \ref{fig:weights-traces-and-histograms}. The vertical line

240: %%         shows the value of the initial weights. (b) $p$-values of the

241: %%         Kolmogorov-Smirnov test are plotted against entropy of

242: %%         frequency histograms. High values of the entropy indicate

243: %%         stability of frequency histograms. (c) Histogram of entropy

244: %%         values for the 400 data points of experiment {\rm A-{\tt

245: %%             Gaussians}}. Low entropy points are clearly separable from

246: %%         the others.}}

247:     \caption{{\em Separating easy form hard points. (a) $p$-values of

248:     the KS test Vs. mean values of frequency histograms. (b)

249:     $p$-values of the KS test Vs. entropy of frequency histograms. As

250:     in (a), the horizontal line marks the threshold value for the LDA

251:     classifier. (c) Histogram of entropy values for the 400 data

252:     points of experiment {\rm A-{\tt Gaussians}}.}}

253:

254:     \label{fig:mean-vs-entropy-ks-test-and-histogram}

255:   \end{center}

256: \end{figure*}

257:

258: Qualitatively, the relationship between entropy and $p$-values of the

259: KS test is similar to the one holding for the mean

260: (Fig. \ref{fig:mean-vs-entropy-ks-test-and-histogram}a-b). Quantitatively,

261: however, a difference is observed, since the LDA classifier trained on

262: these data performs much better in precision and slightly worse in

263: recall (respectively, 0.99 and 0.90, as compared to 0.79 and

264: 0.96). This implies that the class of easy points can be identified

265: with higher confidence by using the entropy in place of the mean value

266: of the distribution. Further support to the hypothesis of a bipartite

267: (easy/hard) nature of data points is gained by observing the frequency

268: histogram of entropies for the 400 points of experiment A-{\tt

269: Gaussians} (Fig. \ref{fig:mean-vs-entropy-ks-test-and-histogram}c),

270: from which two groups of data points emerge as clearly separated. The

271: first is the zero entropy group of easy points, and the second is the

272: group of hard points.

273:

274: Do easy/hard points show any preference about where to concentrate?

275: In Fig. \ref{fig:using-entropy}a hard and easy points are shown as

276: determined for the experiment A-{\tt Sin} (see

277: Sec. \ref{subsec:appendix-data-a} for details). Hard points are mostly

278: found nearby the two-class boundary; yet, their density is much lower

279: along the straight segment of the boundary (where the boundary is

280: smoother), and appear therefore to concentrate where the

281: classification uncertainty is highest. Easy points to the

282: opposite. Considering that easy points stay well clear of the boundary

283: (i.e., hard points typically interpose between them and the boundary),

284: what one may then question is whether they play any r{\^o}le in the

285: built of the \Ada model. The answer is no. In fact, the models built

286: disregarding the easy points are practically the same as the models

287: built on the complete data set. In the experiment of

288: Fig. \ref{fig:using-entropy} only the $0.55\%$ of $10000$ test points

289: were classified differently by the two models, as contrasted to

290: reduction of the training set from $400$ to only $111$ (hard)

291: points.

292:

293: \end{subsection}

294:

295: \begin{subsection}{Smoothing the Entropy}

296: \label{subsec:extending-entropy}

297:

298: In the previous section, the entropy of the weight frequency histogram

299: was introduced as an indicator of the uncertainty of classifying the

300: associated data point as belonging to class $-1$ or $1$. By defining a

301: smooth approximation to the punctual entropy values associated to data

302: points, we now extend the notion of classification uncertainty to the

303: whole domain of our binary classifier. For simplicity sake, kernel

304: regression was employed -- i.e., the entropy values at data points are

305: convolved with a Gaussian kernel of fixed bandwidth \cite{Har90}. In

306: so doing, a scalar entropy function, $H = H({\bf x})$, is defined on

307: $A$. In Fig. \ref{fig:using-entropy}b, the grey levels encode the

308: values of $H$ (increasing from black to white) for the experiment {\rm

309: A-{\tt Sin}}.

310:

311: \begin{figure*}[ht]

312:   \begin{center} \leavevmode

313:     \psfig{figure=sinusoidal-5000-leaving-out-easy-points.epsi,height=0.4\textwidth}(a)

314:     \psfig{figure=sinusoidal-convolution-0.5.epsi,height=0.4\textwidth}(b)

315:     \caption{{\em (a) Easy (white) and hard (black) data points of

316:     experiment A-{\tt Sin} obtained by thresholding the histogram of

317:     entropy. Squares and circlets express the class. (b) Level-plot of

318:     the $H$ function. Grey levels encode $H$ values (see scale on the

319:     right).}}

320: \label{fig:using-entropy}

321: \end{center}

322: \end{figure*}

323:

324: The method appears capable of highlighting regions where

325: classification turns out uncertain -- due to the distribution of data

326: points, the morphology of the class boundary or both. Of course,

327: function $H$ depends on the geometric properties specific of the base

328: model adopted, and its degree of smoothness depends on the size of the

329: convolution kernel. It should be noticed, however, that the

330: bias/variance balance can be controlled by suitably tuning the

331: convolution parameters. Finally, more sophisticated local smoothing

332: techniques may be employed as well (e.g., Radial Basis Functions)

333: which may adapt to directionality, known morphology of the boundary or

334: local density of sample points.

335:

336: \end{subsection}

337:

338: \end{section}

339:

340: \begin{section}{An Application to Optimal Sampling}

341: \label{sec:optimal-design}

342:

343: To illustrate the applicability of notions developed above to

344: practical cases, we refer to the framework of optimal sampling

345: \cite{Fed72}. In general, an optimal sampling problem is one in which

346: a {\em cost} is associated to the acquisition of data points, in such

347: a way that solving the problem consists not only in minimizing the

348: classification (or regression) error but also in keeping the sampling

349: cost as low as possible. A typical setting for this class of problems

350: is the one in which we start from an assigned set of (sparse) data

351: points, and we then incrementally add points to the training set on

352: the basis of certain information extracted from intermediate

353: results.

354:

355: %% Training points may already belong to some pre-assigned,

356: %% unlabelled totality, or may be chosen and labelled at run time.

357:

358: \begin{figure*}[ht]

359:   \begin{center}

360:     \leavevmode

361:     \psfig{figure=sinusoidal-incremental-40-1000-x10-error.epsi,width=0.45\textwidth}(a)

362:     \psfig{figure=spiral-incremental-40-1000-x10-error.epsi,width=0.45\textwidth}(b)

363:     \caption{{\em Misclassification error as a function of the number

364:     of training points for the entropy based scheme is compared to

365:     the uniform random sampling and the margin sampling

366:     strategy. (a) Experiment {\rm B-{\tt Sin}}. (b) Experiment {\rm B-{\tt Spiral}}.}}

367:     \label{fig:optimal-sampling-errors}

368:   \end{center}

369: \end{figure*}

370: \end{section}

371:

372: For the experiments reported below, which are based on the same

373: settings as {\tt Sin} and {\tt Spiral} of

374: Sec. \ref{subsec:appendix-data-a} (see also

375: Sec. \ref{subsec:experiment-b} for details), we started from a small

376: set of sparse two-dimensional binary classification

377: data. High-uncertainty areas are identified by means of the method

378: described in Sec. \ref{subsec:extending-entropy}, and additional

379: training points are chosen in these areas. Assuming a unitary cost for

380: each new point, performance of the procedure is finally evaluated by

381: analyzing the sampling cost against the classification error.

382:

383: In Fig. \ref{fig:optimal-sampling-errors}, two plots are reported of

384: the classification error as function of the number of training

385: points. Comparison is made with a blind (randomly uniform) sampling

386: strategy, and with a specialization of {\em uncertainty sampling

387: strategy} as recently proposed in \cite{LewCat94}. The latter consists

388: in adding training points where the classifier is less certain of

389: class membership. In particular, the classifier was the \Ada model and

390: the uncertainty indicator was the margin of the prediction.

391:

392: Results reported in Fig. \ref{fig:optimal-sampling-errors} show that

393: in both experiments the entropy sampling method holds a definite

394: advantage on the random sampling strategy. In the first experiment, an

395: initial advantage of entropy over the margin based sampling is also

396: observed, but the margin strategy takes over as the number of

397: samplings goes beyond 400. It should be noticed, however, that the

398: margin sampling automatically adapts its spatial scale to the

399: increased density of sampling points, while our entropy method does

400: not (the size of the convolution kernel is fixed). In fact, in the

401: experiment {\rm B-{\tt Spiral}}

402: (Fig. \ref{fig:optimal-sampling-errors}b) where the boundary has a

403: more complex structure, (and the size of convolution kernel smaller),

404: 1000 samplings are not sufficient for the margin based method to

405: exhibit an advantage on the entropy method (but the latter looses the

406: initial advantage exhibited in the first experiment).

407:

408: \begin{section}{Final Comments}

409: \label{sec:conclusions}

410:

411: Within the many possible interpretations of learning by boosting, it

412: is promising to create diagnostic indicator functions alternative to

413: margins \cite{SchFreBarLee98} by tracing the dynamics of boosting

414: weights for individual points. We have used entropy (in the punctual

415: and then smoothed versions) as a descriptor of classification

416: uncertainty, identifying easy and hard points, and designing a

417: specific optimal sampling strategy. The strategy needs to be further

418: automated, e.g. considering adaptive selection of smoothing parameters

419: as a function of spatial variability. A direct numerical relationship

420: with the weights of Support Vector expansions is also clearly needed.

421: On the other hand, it would be also interesting to associate the

422: main types of weight dynamics (or point hardness) to the

423: regularity of the boundary surface and of the noise structure.

424:

425: \end{section}

426:

427: \begin{thebibliography}{1}

428:

429: \bibitem{Fed72}

430: V.~Fedorov.

431: \newblock {\em {Theory of Optimal Experiments}}.

432: \newblock Academic Press, New York, 1972.

433:

434: \bibitem{FreSch97}

435: Y.~Freund and R.~E. Schapire.

436: \newblock {A Decision-theoretic Generalization of Online Learning and an

437:   Application to Boosting}.

438: \newblock {\em Journal of Computer and System Sciences}, 55(1):{119--139},

439:   {August} 1997.

440:

441: \bibitem{FriHasTib00}

442: J.~Friedman, T.~Hastie, and R.~Tibshirani.

443: \newblock Additive logistic regression: a statistical view of boosting.

444: \newblock {\em The Annals of Statistics}, 2000.

445:

446: \bibitem{LewCat94}

447: D.~D. Lewis and J.~Catlett.

448: \newblock {Heterogeneous Uncertainty Sampling for Supervised Learning}.

449: \newblock In Cohen and Hirsh, editors, {\em Eleventh International Conference

450:   on Machine Learning}, pages {148--156}, {San Francisco}, 1994. {Morgan

451:   Kaufmann}.

452:

453: \bibitem{PreTeuVetFla92}

454: W.~H. Press, S.~A. Teukolsky, W.~T. Vetterling, and B.~P. Flannery.

455: \newblock {\em {Numerical Recipes in C -- The Art of Scientific Computing}}.

456: \newblock Cambridge University Press, second edition, 1992.

457:

458: \bibitem{Qui96}

459: J.R. Quinlan.

460: \newblock {Bagging, Boosting, and C4.5}.

461: \newblock In {\em {Thirteenth National Conference on Artificial Intelligence}},

462:   pages {163--175}, {Cambridge}, 1996. AAAI Press/MIT Press.

463:

464: \bibitem{RavInt99}

465: Y.~Raviv and N.~Intrator.

466: \newblock {Variance Reduction via Noise and Bias Constraints.}

467: \newblock In A.J.C. Sharkey, editor, {\em {Combining Artificial Neural Nets:

468:   Ensemble and Modular Multi-Net Systems}}, pages {163--175}, {London}, 1999.

469:   Springer-Verlag.

470:

471: \bibitem{SchFreBarLee98}

472: R.~E. Schapire, Y.~Freund, P.~Bartlett, and W.~S. Lee.

473: \newblock {Boosting the Margin: A New Explanation for the Effectiveness of

474:   Voting Methods}.

475: \newblock {\em The Annals of Statistics}, 26(5):{1651--1686}, 1998.

476:

477: \bibitem{Har90}

478: {W. H\"{a}rdle}.

479: \newblock {\em {Applied Nonparametric Regression}}, volume~{19} of {\em

480:   {Econometric Society Monographs}}.

481: \newblock {Cambridge University Press}, 1990.

482:

483: \end{thebibliography}

484:

485: \appendix

486:

487: \begin{section}{Data}

488: \label{sec:appendix-data}

489:

490: Details are given on the data employed in experiments of

491: Sec. \ref{sec:dynamics} and \ref{sec:optimal-design}. Full details and

492: data are accessible at {\tt http://www.mpa.itc.it/nips-2001/data/}.

493:

494: \begin{subsection}{Experiment A}

495: \label{subsec:appendix-data-a}

496:

497: %% This group of data sets was generated for the analysis of the weights

498: %% dynamics.

499:

500: \begin{description}

501:

502:         \item[{\tt Gaussians}:] 4 sets of points (100 points each) were

503: generated by sampling 4 two-dimensional Gaussian distributions,

504: respectively centered in $(-1.0,0.5)$, $(0.0,-0.5)$, $(0.0,0.5)$ and

505: $(1.0,-0.5)$. Covariance matrices were diagonal for all the 4

506: distributions; variance was constant and equal to 0.4. Points coming

507: from the sampling of the first two Gaussians were labelled with class

508: $-1$; the others with class $1$.

509:

510: %% (see Fig. \ref{fig:experiment-a}a).

511:

512:         \item[{\tt Sin}:] The box in $R^{2}$, $R \equiv

513: [-10,10]\times[-5,5]$, was partitioned into two class regions $R_{1}$

514: (upper) and $R_{-1}$ (lower) by means of the curve, $C$ of parametric

515: equations:

516:

517: $$

518: C \equiv \left\{

519:     \begin{array}{rcl}

520:       x(t) & = & t \\

521:       y(t) & = & 2 sin(3 t) \mbox{ if } -10 \leq t \leq 0 ; 0 \mbox{

522:     if } 0 \leq t \leq 10 .\\

523:     \end{array}

524:   \right.

525: $$

526:

527: \noindent

528: 400 two-dimensional data were generated by randomly sampling region

529: $R$, and labelled with either $-1$ or $1$ according to whether they

530: belonged to $R_{-1}$ or $R_{1}$.

531:

532:         \item[{\tt Spiral}:] As in the previous case, the idea was to

533: have a bipartition of a rectangular subset, $S$, of $R^{2}$ presenting

534: fairly complex boundaries ($S \equiv [-5,5]\times[-5,5]$). Taking

535: inspiration from \cite{RavInt99}, a spiral shaped boundary was

536: defined. 400 two-dimensional data were then generated by randomly

537: sampling region $S$, and were labelled with either $-1$ or $1$

538: according to whether they belonged to one or the other of the two

539: class regions.

540:

541: \end{description}

542:

543: \end{subsection}

544:

545: \begin{subsection}{Experiment B}

546: \label{subsec:experiment-b}

547:

548: This group of data was generated in support to the optimal sampling

549: experiments described in Sec. \ref{sec:optimal-design}. More

550: specifically, two initial data sets, each containing 40 points, were

551: generated for both the {\tt Sin} and {\tt Spiral} settings by

552: employing the same procedures as above. At each round of the optimal

553: sampling procedure, 10 new data points were generated by uniformly

554: sampling a suitable, high entropy subregion of the domain. Data

555: points were then labelled according to their belonging to one or the

556: other of the two class regions.

557:

558: \end{subsection}

559:

560: \end{section}

561:

562: \end{document}

563:

564:

565:

566:

567:

568:

569:

570:

571:

572:

573:

574:

575:

576:

577:

578:

579:

580:

581:

582: