cs0201014/paper.tex
1: \documentclass{article}
2: \usepackage{nips01,times}
3: \usepackage{graphicx}
4: \usepackage{subfigure}
5: \usepackage{psfig}
6: 
7: %% \documentstyle[nips01]{article}
8: 
9: \title{The Dynamics of AdaBoost Weights \\ Tells You What's Hard to Classify}
10: 
11: \author{B. Caprile\\
12: ITC-irst \\
13: I-38050 Povo, Trento\\
14: Italy\\
15: {\it caprile@itc.it} \\
16: \And
17: C. Furlanello \\
18: ITC-irst \\
19: I-38050 Povo, Trento\\
20: Italy\\
21: {\it furlan@itc.it} \\\\
22: \And
23: S. Merler\\
24: ITC-irst \\
25: I-38050 Povo, Trento\\
26: Italy\\
27: {\it merler@itc.it} \\\\
28: }
29: 
30: 
31: 
32: \begin{document}
33: \maketitle
34: 
35: \bibliographystyle{plain}
36: 
37: \newcommand{\REM}[1]{
38: {\bf #1}
39: }
40: 
41: \newcommand{\Ada}{AdaBoost
42: }
43: 
44: \newcommand{\Adaa}{{\tt AdaBoost} algorithm
45: }
46: 
47: 
48: \begin{abstract}
49: The dynamical evolution of weights in the \Ada algorithm contains
50: useful information about the r{\^o}le that the associated data points
51: play in the built of the \Ada model. In particular, the dynamics
52: induces a bipartition of the data set into two (easy/hard)
53: classes. Easy points are ininfluential in the making of the model,
54: while the varying relevance of hard points can be gauged in terms of
55: an entropy value associated to their evolution. Smooth approximations
56: of entropy highlight regions where classification is most
57: uncertain. Promising results are obtained when methods proposed are
58: applied in the Optimal Sampling framework.
59: \end{abstract}
60: 
61: 
62: \begin{section}{Introduction}
63: 
64: In this paper we investigate the boosting weight dynamics induced by
65: classification procedures of the AdaBoost family
66: \cite{FreSch97,SchFreBarLee98}, and show how it can be exploited to
67: for highlighting points and regions of uncertain
68: classification. Friedman et al. \cite{FriHasTib00} proposed to analyze
69: and trim the distribution of weights over a training sample in order
70: to reduce computation without sacrificing accuracy. Here, we focus
71: instead on tracking the dynamics of the boosting weight of individual
72: points. By introducing the notion of entropy of the weight evolution,
73: we can clarify the notions of ``easy'' and the ``hard'' points as the
74: two types of weight dynamics being observed: in particular, in
75: different classification tasks and with different base models it is
76: found that a group of points may be selected which have very low
77: (ideally, zero) entropy of weight evolution: the easy points. In this
78: framework, we can answer questions as: do easy point play any role in
79: building the AdaBoost model? For hard points, can different degrees
80: of ``hardness'' be identified which account for different degrees of
81: classification uncertainty? Do easy/hard points show any preference about
82: where to concentrate? The first two questions are clearly connected to
83: equivalent results in the framework of Support Vector Machines: in a
84: number of experiments, hard points are
85: found indeed mostly nearby the classification boundary.  In the second
86: part of this paper, the smooth approximation (by kernel regression) of
87: the weight entropy at training data is proposed as an indicator
88: function of classification uncertainty, thereby obtaining a region
89: highlighting methodology. As a natural application, 
90: a strategy for optimal sampling in classification tasks was implemented:
91: compared with uniform random sampling, the entropy-based strategy is
92: clearly more effective. Moreover, it compares favorably with an
93: alternative margin-based sampling strategy. 
94: 
95: \end{section}
96: 
97: \begin{section}{The Dynamics of Weights}
98: \label{sec:dynamics}
99: 
100: In the present section, the dynamics that the \Ada algorithm sets over
101: the weights is singled out for study. In particular, the intuition is
102: substantiated that the evolution of weights yields information about
103: the varying relevance that different data points have in the built of
104: the \Ada model. 
105: 
106: Let $D \equiv \{{\bf x}_{i}, y_{i}\}_{i=1}^{N}$ be a two-class set of
107: data points, where the ${\bf x}_{i}$s belong to a suitable region,
108: $X$, of some (metric) feature space, and $y_{i}$ takes values in $\{1,
109: -1\}$, for $1 \leq i \leq N$. The \Ada algorithm iteratively builds a
110: class membership estimator over $X$ as a thresholded linear
111: superposition of different realizations, $M_{k}$, of a same base
112: model, $M$. Any model instance, $M_{k}$, resulting from training at
113: step $k$ depends on the values taken at the same step by a set of $N$
114: numbers (in the following, the {\em weights}), ${\bf w} = w_{1}, \dots
115: w_{N}$ -- one for each data point. After training, weights are
116: updated: those associated to points misclassified by the current model
117: instance are increased, while decreased are those for which the
118: associated point is classified correctly. An interesting variant of
119: this basic scheme consists in training the different realizations of
120: the base model, not on the whole data set, but on Bootstrap replicates
121: of it \cite{Qui96}. In this second scheme, samplings are extracted
122: according to the discrete probability distribution defined by the
123: weights associated to data points, normalized to sum one.
124: 
125: In Fig. \ref{fig:weights-traces-and-histograms}a the plots are
126: reported of the evolution of the weights associated to 3 data points
127: when the \Ada algorithm is applied to a simple binary classification
128: task on synthetic two-dimensional data (experiment A-{\tt Gaussians}
129: as described in Sec. \ref{subsec:appendix-data-a}). Except for
130: occasional bursts, the weight associated to the first point goes
131: rapidly to zero, while the weights associated to the second and third
132: point keep on going up and down in a seemingly chaotic fashion. Our
133: experience is that these two types of behaviour are not specific of
134: the case under consideration, but can be observed in any \Ada
135: experiment. Moreover, {\em tertium non datur}, i.e., no other
136: qualitative behaviour is observed (as, for example, that some weight
137: tends to a strictly positive value).
138: 
139: \begin{subsection}{Easy Vs. Hard Data Points}
140: \label{easy-hard-data-points}	
141: 
142: \begin{figure*}[ht]
143:   \begin{center} 
144:     \leavevmode
145:     \psfig{figure=gaussian-5000-weights-trace-1.epsi,width=0.3\textwidth}
146:     \psfig{figure=gaussian-5000-weights-trace-2.epsi,width=0.3\textwidth}
147:     \psfig{figure=gaussian-5000-weights-trace-3.epsi,width=0.3\textwidth}(a)
148:     \psfig{figure=gaussian-5000-histogram-1.epsi,width=0.3\textwidth}
149:     \psfig{figure=gaussian-5000-histogram-2.epsi,width=0.3\textwidth}
150:     \psfig{figure=gaussian-5000-histogram-3.epsi,width=0.3\textwidth}(b)
151:     \caption{{\em Evolution of weights in the \Ada algorithm. (a)
152:     The evolutions over 5000 steps of the \Ada algorithm are reported
153:     for the weights associated to 3 data points of experiment {\rm
154:     A-{\tt Gaussians}}. From left to right: an ``easy'' data point
155:     (the weight tends to zero), and two ``hard'' data points (the
156:     weight follows a seemingly random pattern). (b) The corresponding
157:     frequency histograms.}}
158:     
159:     \label{fig:weights-traces-and-histograms}
160:   \end{center}
161: \end{figure*}
162: 
163: The hypothesis therefore emerges that the \Ada algorithm set a
164: partition of data points into two classes: on one side the points
165: whose weight tends rapidly to zero; on the other, the points whose
166: weight show an apparently chaotic behaviour. In fact, the hypothesis is
167: perfectly consistent with the rationale underlying the \Ada algorithm:
168: weights associated to those data points that several model instances
169: classify correctly even when they are {\em not} contained in the
170: training sample follow the first kind of behaviour. In practice
171: independently of which bootstrap sample is extracted, these points are
172: classified correctly, and their weight is consequently decreased and
173: decreased. We call them the ``easy'' points. The second type of
174: behaviour is followed by the points that, when not contained in the
175: training set, happen to be often misclassified. A series of
176: misclassifications makes the weight associated with any such point
177: increase, thereby increasing the probability for the point to be
178: contained in the following bootstrap sample. As the probability
179: increases and the point is finally extracted (and classified
180: correctly), its weight is decreased; this in turn makes the point less
181: likely to be extracted -- and so forth. We call this kind of points
182: ``hard''.
183: 
184: In Fig. \ref{fig:weights-traces-and-histograms}b, histograms are
185: reported of the values that the weights associated to the same 3 data
186: points of Fig. \ref{fig:weights-traces-and-histograms}a take over the
187: same 5000 iterations of the \Ada algorithm. As expected, the histogram
188: of (easy) point 1 is very much squeezed towards zero (more than 80\%
189: of weights lies below $10^{-6}$). Histograms of (hard) points 2 and 3
190: exhibit the same Gamma-like shape, but differ remarkably for what
191: concerns average and dispersion. Naturally, the first question is
192: whether any limit exists for these distributions. For each data point,
193: two unbinned cumulative distributions were therefore built by taking
194: the weights generated by the first 3000 steps of the \Ada algorithm,
195: and those generated over the whole 5000 steps. The same-distribution
196: hypothesis was then tested by means of the Kolmogorov-Smirnov (KS)
197: test \cite{PreTeuVetFla92}. Results are reported in
198: Fig. \ref{fig:mean-vs-entropy-ks-test-and-histogram}a, where
199: $p$-values are plotted against the mean value of all 5000 values. It
200: is interesting to notice that for mean values close to 0 (easy points)
201: the same-distribution hypothesis is always rejected, while it is
202: typically not-rejected for higher values (hard points). It seems that
203: easy points may be confidently identified by simply considering the
204: average of their weight distribution. A binary LDA classifier was
205: therefore trained on the data of
206: Fig. \ref{fig:mean-vs-entropy-ks-test-and-histogram}a. By setting a
207: $p$-value threshold equal to 0.05, the resulting {\em precision} (the
208: complement to 1 of the fraction of false negative) was equal to 0.79
209: and {\em recall} (the complement to 1 of the fraction of false
210: positive) was equal to 0.96.
211: 
212: \end{subsection}
213: 
214: \begin{subsection}{Entropy}
215: \label{subsec:entropy}
216: 
217: Can we do any better at separating easy points from hard ones? For
218: hard points, can different degrees of ``hardness'' be identified which
219: account for different degrees of classification uncertainty? What we
220: are going to show is that by associating a notion of {\em entropy} to
221: the evolutions of weights both questions can be answered in the
222: positive. To this end, the interval $[0,1]$ is partitioned into $L$
223: subintervals of length $1/L$, and the entropy value is computed as
224: $\sum_{i=1}^{L} f_{i}~log_{2}~ f_{i}$, where $f_{i}$ is the relative
225: frequency of weight values falling in the $i$-th subinterval ($0~
226: log_{2}~ 0$ is set to $0$). For our cases, $L$ was set to 1000.
227: 
228: \begin{figure*}[ht]
229:   \begin{center}
230:     \leavevmode
231: 	\psfig{figure=ks-test-mean.epsi,width=0.29\textwidth}(a)
232: %%    \psfig{figure=figures/gaussian-5000-weights-mean-vs-entropy.epsi,width=0.28\textwidth}(a)
233:     \psfig{figure=ks-test-entropy.epsi,width=0.29\textwidth}(b)
234:     \psfig{figure=entropy-histogram.epsi,width=0.29\textwidth}(c)
235: %%     \caption{{\em Mean Vs. entropy plot for the weights frequency
236: %%         histograms of the 400 data points of experiment {\rm A-{\tt
237: %%             Gaussians}}. Marked data points are those whose evolution
238: %%         and frequency histograms are reported in Fig.
239: %%         \ref{fig:weights-traces-and-histograms}. The vertical line
240: %%         shows the value of the initial weights. (b) $p$-values of the
241: %%         Kolmogorov-Smirnov test are plotted against entropy of
242: %%         frequency histograms. High values of the entropy indicate
243: %%         stability of frequency histograms. (c) Histogram of entropy
244: %%         values for the 400 data points of experiment {\rm A-{\tt
245: %%             Gaussians}}. Low entropy points are clearly separable from
246: %%         the others.}}
247:     \caption{{\em Separating easy form hard points. (a) $p$-values of
248:     the KS test Vs. mean values of frequency histograms. (b)
249:     $p$-values of the KS test Vs. entropy of frequency histograms. As
250:     in (a), the horizontal line marks the threshold value for the LDA
251:     classifier. (c) Histogram of entropy values for the 400 data
252:     points of experiment {\rm A-{\tt Gaussians}}.}}
253:     
254:     \label{fig:mean-vs-entropy-ks-test-and-histogram}
255:   \end{center}
256: \end{figure*}
257: 
258: Qualitatively, the relationship between entropy and $p$-values of the
259: KS test is similar to the one holding for the mean
260: (Fig. \ref{fig:mean-vs-entropy-ks-test-and-histogram}a-b). Quantitatively,
261: however, a difference is observed, since the LDA classifier trained on
262: these data performs much better in precision and slightly worse in
263: recall (respectively, 0.99 and 0.90, as compared to 0.79 and
264: 0.96). This implies that the class of easy points can be identified
265: with higher confidence by using the entropy in place of the mean value
266: of the distribution. Further support to the hypothesis of a bipartite
267: (easy/hard) nature of data points is gained by observing the frequency
268: histogram of entropies for the 400 points of experiment A-{\tt
269: Gaussians} (Fig. \ref{fig:mean-vs-entropy-ks-test-and-histogram}c),
270: from which two groups of data points emerge as clearly separated. The
271: first is the zero entropy group of easy points, and the second is the
272: group of hard points.
273: 
274: Do easy/hard points show any preference about where to concentrate?
275: In Fig. \ref{fig:using-entropy}a hard and easy points are shown as
276: determined for the experiment A-{\tt Sin} (see
277: Sec. \ref{subsec:appendix-data-a} for details). Hard points are mostly
278: found nearby the two-class boundary; yet, their density is much lower
279: along the straight segment of the boundary (where the boundary is
280: smoother), and appear therefore to concentrate where the
281: classification uncertainty is highest. Easy points to the
282: opposite. Considering that easy points stay well clear of the boundary
283: (i.e., hard points typically interpose between them and the boundary),
284: what one may then question is whether they play any r{\^o}le in the
285: built of the \Ada model. The answer is no. In fact, the models built
286: disregarding the easy points are practically the same as the models
287: built on the complete data set. In the experiment of
288: Fig. \ref{fig:using-entropy} only the $0.55\%$ of $10000$ test points
289: were classified differently by the two models, as contrasted to
290: reduction of the training set from $400$ to only $111$ (hard)
291: points. 
292: 
293: \end{subsection}
294: 
295: \begin{subsection}{Smoothing the Entropy}
296: \label{subsec:extending-entropy}
297: 
298: In the previous section, the entropy of the weight frequency histogram
299: was introduced as an indicator of the uncertainty of classifying the
300: associated data point as belonging to class $-1$ or $1$. By defining a
301: smooth approximation to the punctual entropy values associated to data
302: points, we now extend the notion of classification uncertainty to the
303: whole domain of our binary classifier. For simplicity sake, kernel
304: regression was employed -- i.e., the entropy values at data points are
305: convolved with a Gaussian kernel of fixed bandwidth \cite{Har90}. In
306: so doing, a scalar entropy function, $H = H({\bf x})$, is defined on
307: $A$. In Fig. \ref{fig:using-entropy}b, the grey levels encode the
308: values of $H$ (increasing from black to white) for the experiment {\rm
309: A-{\tt Sin}}.
310: 
311: \begin{figure*}[ht]
312:   \begin{center} \leavevmode
313:     \psfig{figure=sinusoidal-5000-leaving-out-easy-points.epsi,height=0.4\textwidth}(a)
314:     \psfig{figure=sinusoidal-convolution-0.5.epsi,height=0.4\textwidth}(b)
315:     \caption{{\em (a) Easy (white) and hard (black) data points of
316:     experiment A-{\tt Sin} obtained by thresholding the histogram of
317:     entropy. Squares and circlets express the class. (b) Level-plot of
318:     the $H$ function. Grey levels encode $H$ values (see scale on the
319:     right).}}  
320: \label{fig:using-entropy} 
321: \end{center}
322: \end{figure*}
323: 
324: The method appears capable of highlighting regions where
325: classification turns out uncertain -- due to the distribution of data
326: points, the morphology of the class boundary or both. Of course,
327: function $H$ depends on the geometric properties specific of the base
328: model adopted, and its degree of smoothness depends on the size of the
329: convolution kernel. It should be noticed, however, that the
330: bias/variance balance can be controlled by suitably tuning the
331: convolution parameters. Finally, more sophisticated local smoothing
332: techniques may be employed as well (e.g., Radial Basis Functions)
333: which may adapt to directionality, known morphology of the boundary or
334: local density of sample points.
335: 
336: \end{subsection}
337: 
338: \end{section}
339: 
340: \begin{section}{An Application to Optimal Sampling}
341: \label{sec:optimal-design}
342: 
343: To illustrate the applicability of notions developed above to
344: practical cases, we refer to the framework of optimal sampling
345: \cite{Fed72}. In general, an optimal sampling problem is one in which
346: a {\em cost} is associated to the acquisition of data points, in such
347: a way that solving the problem consists not only in minimizing the
348: classification (or regression) error but also in keeping the sampling
349: cost as low as possible. A typical setting for this class of problems
350: is the one in which we start from an assigned set of (sparse) data
351: points, and we then incrementally add points to the training set on
352: the basis of certain information extracted from intermediate
353: results. 
354: 
355: %% Training points may already belong to some pre-assigned,
356: %% unlabelled totality, or may be chosen and labelled at run time.
357: 
358: \begin{figure*}[ht]
359:   \begin{center}
360:     \leavevmode
361:     \psfig{figure=sinusoidal-incremental-40-1000-x10-error.epsi,width=0.45\textwidth}(a)
362:     \psfig{figure=spiral-incremental-40-1000-x10-error.epsi,width=0.45\textwidth}(b)
363:     \caption{{\em Misclassification error as a function of the number
364:     of training points for the entropy based scheme is compared to
365:     the uniform random sampling and the margin sampling
366:     strategy. (a) Experiment {\rm B-{\tt Sin}}. (b) Experiment {\rm B-{\tt Spiral}}.}}
367:     \label{fig:optimal-sampling-errors} 
368:   \end{center} 
369: \end{figure*}
370: \end{section}
371: 
372: For the experiments reported below, which are based on the same
373: settings as {\tt Sin} and {\tt Spiral} of
374: Sec. \ref{subsec:appendix-data-a} (see also
375: Sec. \ref{subsec:experiment-b} for details), we started from a small
376: set of sparse two-dimensional binary classification
377: data. High-uncertainty areas are identified by means of the method
378: described in Sec. \ref{subsec:extending-entropy}, and additional
379: training points are chosen in these areas. Assuming a unitary cost for
380: each new point, performance of the procedure is finally evaluated by
381: analyzing the sampling cost against the classification error.
382: 
383: In Fig. \ref{fig:optimal-sampling-errors}, two plots are reported of
384: the classification error as function of the number of training
385: points. Comparison is made with a blind (randomly uniform) sampling
386: strategy, and with a specialization of {\em uncertainty sampling
387: strategy} as recently proposed in \cite{LewCat94}. The latter consists
388: in adding training points where the classifier is less certain of
389: class membership. In particular, the classifier was the \Ada model and
390: the uncertainty indicator was the margin of the prediction.
391: 
392: Results reported in Fig. \ref{fig:optimal-sampling-errors} show that
393: in both experiments the entropy sampling method holds a definite
394: advantage on the random sampling strategy. In the first experiment, an
395: initial advantage of entropy over the margin based sampling is also
396: observed, but the margin strategy takes over as the number of
397: samplings goes beyond 400. It should be noticed, however, that the
398: margin sampling automatically adapts its spatial scale to the
399: increased density of sampling points, while our entropy method does
400: not (the size of the convolution kernel is fixed). In fact, in the
401: experiment {\rm B-{\tt Spiral}}
402: (Fig. \ref{fig:optimal-sampling-errors}b) where the boundary has a
403: more complex structure, (and the size of convolution kernel smaller),
404: 1000 samplings are not sufficient for the margin based method to
405: exhibit an advantage on the entropy method (but the latter looses the
406: initial advantage exhibited in the first experiment).
407: 
408: \begin{section}{Final Comments}
409: \label{sec:conclusions}
410: 
411: Within the many possible interpretations of learning by boosting, it
412: is promising to create diagnostic indicator functions alternative to
413: margins \cite{SchFreBarLee98} by tracing the dynamics of boosting
414: weights for individual points. We have used entropy (in the punctual
415: and then smoothed versions) as a descriptor of classification
416: uncertainty, identifying easy and hard points, and designing a
417: specific optimal sampling strategy. The strategy needs to be further
418: automated, e.g. considering adaptive selection of smoothing parameters
419: as a function of spatial variability. A direct numerical relationship
420: with the weights of Support Vector expansions is also clearly needed.
421: On the other hand, it would be also interesting to associate the
422: main types of weight dynamics (or point hardness) to the
423: regularity of the boundary surface and of the noise structure.
424: 
425: \end{section}
426: 
427: \begin{thebibliography}{1}
428: 
429: \bibitem{Fed72}
430: V.~Fedorov.
431: \newblock {\em {Theory of Optimal Experiments}}.
432: \newblock Academic Press, New York, 1972.
433: 
434: \bibitem{FreSch97}
435: Y.~Freund and R.~E. Schapire.
436: \newblock {A Decision-theoretic Generalization of Online Learning and an
437:   Application to Boosting}.
438: \newblock {\em Journal of Computer and System Sciences}, 55(1):{119--139},
439:   {August} 1997.
440: 
441: \bibitem{FriHasTib00}
442: J.~Friedman, T.~Hastie, and R.~Tibshirani.
443: \newblock Additive logistic regression: a statistical view of boosting.
444: \newblock {\em The Annals of Statistics}, 2000.
445: 
446: \bibitem{LewCat94}
447: D.~D. Lewis and J.~Catlett.
448: \newblock {Heterogeneous Uncertainty Sampling for Supervised Learning}.
449: \newblock In Cohen and Hirsh, editors, {\em Eleventh International Conference
450:   on Machine Learning}, pages {148--156}, {San Francisco}, 1994. {Morgan
451:   Kaufmann}.
452: 
453: \bibitem{PreTeuVetFla92}
454: W.~H. Press, S.~A. Teukolsky, W.~T. Vetterling, and B.~P. Flannery.
455: \newblock {\em {Numerical Recipes in C -- The Art of Scientific Computing}}.
456: \newblock Cambridge University Press, second edition, 1992.
457: 
458: \bibitem{Qui96}
459: J.R. Quinlan.
460: \newblock {Bagging, Boosting, and C4.5}.
461: \newblock In {\em {Thirteenth National Conference on Artificial Intelligence}},
462:   pages {163--175}, {Cambridge}, 1996. AAAI Press/MIT Press.
463: 
464: \bibitem{RavInt99}
465: Y.~Raviv and N.~Intrator.
466: \newblock {Variance Reduction via Noise and Bias Constraints.}
467: \newblock In A.J.C. Sharkey, editor, {\em {Combining Artificial Neural Nets:
468:   Ensemble and Modular Multi-Net Systems}}, pages {163--175}, {London}, 1999.
469:   Springer-Verlag.
470: 
471: \bibitem{SchFreBarLee98}
472: R.~E. Schapire, Y.~Freund, P.~Bartlett, and W.~S. Lee.
473: \newblock {Boosting the Margin: A New Explanation for the Effectiveness of
474:   Voting Methods}.
475: \newblock {\em The Annals of Statistics}, 26(5):{1651--1686}, 1998.
476: 
477: \bibitem{Har90}
478: {W. H\"{a}rdle}.
479: \newblock {\em {Applied Nonparametric Regression}}, volume~{19} of {\em
480:   {Econometric Society Monographs}}.
481: \newblock {Cambridge University Press}, 1990.
482: 
483: \end{thebibliography}
484: 
485: \appendix
486: 
487: \begin{section}{Data}
488: \label{sec:appendix-data}
489: 
490: Details are given on the data employed in experiments of
491: Sec. \ref{sec:dynamics} and \ref{sec:optimal-design}. Full details and
492: data are accessible at {\tt http://www.mpa.itc.it/nips-2001/data/}.
493: 
494: \begin{subsection}{Experiment A}
495: \label{subsec:appendix-data-a}
496: 
497: %% This group of data sets was generated for the analysis of the weights
498: %% dynamics. 
499: 
500: \begin{description}
501: 
502:         \item[{\tt Gaussians}:] 4 sets of points (100 points each) were
503: generated by sampling 4 two-dimensional Gaussian distributions,
504: respectively centered in $(-1.0,0.5)$, $(0.0,-0.5)$, $(0.0,0.5)$ and
505: $(1.0,-0.5)$. Covariance matrices were diagonal for all the 4
506: distributions; variance was constant and equal to 0.4. Points coming
507: from the sampling of the first two Gaussians were labelled with class
508: $-1$; the others with class $1$.
509: 
510: %% (see Fig. \ref{fig:experiment-a}a).
511: 
512:         \item[{\tt Sin}:] The box in $R^{2}$, $R \equiv
513: [-10,10]\times[-5,5]$, was partitioned into two class regions $R_{1}$
514: (upper) and $R_{-1}$ (lower) by means of the curve, $C$ of parametric
515: equations:
516: 
517: $$ 
518: C \equiv \left\{
519:     \begin{array}{rcl}
520:       x(t) & = & t \\
521:       y(t) & = & 2 sin(3 t) \mbox{ if } -10 \leq t \leq 0 ; 0 \mbox{
522:     if } 0 \leq t \leq 10 .\\
523:     \end{array}
524:   \right. 
525: $$
526: 
527: \noindent
528: 400 two-dimensional data were generated by randomly sampling region
529: $R$, and labelled with either $-1$ or $1$ according to whether they
530: belonged to $R_{-1}$ or $R_{1}$.
531: 
532:         \item[{\tt Spiral}:] As in the previous case, the idea was to
533: have a bipartition of a rectangular subset, $S$, of $R^{2}$ presenting
534: fairly complex boundaries ($S \equiv [-5,5]\times[-5,5]$). Taking
535: inspiration from \cite{RavInt99}, a spiral shaped boundary was
536: defined. 400 two-dimensional data were then generated by randomly
537: sampling region $S$, and were labelled with either $-1$ or $1$
538: according to whether they belonged to one or the other of the two
539: class regions.
540: 
541: \end{description}
542: 
543: \end{subsection}
544: 
545: \begin{subsection}{Experiment B}
546: \label{subsec:experiment-b}
547: 
548: This group of data was generated in support to the optimal sampling
549: experiments described in Sec. \ref{sec:optimal-design}. More
550: specifically, two initial data sets, each containing 40 points, were
551: generated for both the {\tt Sin} and {\tt Spiral} settings by
552: employing the same procedures as above. At each round of the optimal
553: sampling procedure, 10 new data points were generated by uniformly
554: sampling a suitable, high entropy subregion of the domain. Data
555: points were then labelled according to their belonging to one or the
556: other of the two class regions.
557: 
558: \end{subsection}
559: 
560: \end{section}
561: 
562: \end{document}
563: 
564: 
565: 
566: 
567: 
568: 
569: 
570: 
571: 
572: 
573: 
574: 
575: 
576: 
577: 
578: 
579: 
580: 
581: 
582: