cond-mat0610342/semi.tex
1: \documentclass[preprintnumbers,amsmath,amssymb]{revtex4}
2: 
3: \usepackage{epsfig}
4: \usepackage{rotating}
5: \usepackage{graphicx}% Include figure files
6: \usepackage{dcolumn}% Align table columns on decimal point
7: \usepackage{bm}% bold math
8: 
9: %\nofiles
10: 
11: \begin{document}
12: 
13: 
14: \title{
15: Semi-supervised learning by search of optimal target vector}
16: 
17: \author{Leonardo Angelini, Daniele Marinazzo, Mario Pellicoro and Sebastiano Stramaglia}
18: 
19: \affiliation{ TIRES-Center of Innovative Technologies for Signal Detection
20: and Processing, \\ Universit\`a di Bari, Italy \\
21: Dipartimento Interateneo di Fisica, Bari, Italy \\
22: Istituto Nazionale di Fisica Nucleare, Sezione di Bari, Italy
23: }
24: 
25: % The \author macro works with any number of authors. There are two commands
26: % used to separate the names and addresses of multiple authors: \And and \AND.
27: %
28: % Using \And between authors leaves it to \LaTeX{} to determine where to break
29: % the lines. Using \AND forces a linebreak at that point. So, if \LaTeX{}
30: % puts 3 of 4 authors names on the first line, and the last on the second
31: % line, try using \AND instead of \And before the third author name.
32: 
33: \date{\today}
34: 
35: 
36: 
37: \begin{abstract}
38: We introduce a semi-supervised learning estimator which tends to the
39: first kernel principal component as the number of labeled points
40: vanishes. Our approach is based on the notion of optimal target
41: vector, which is defined as follows. Given an input data-set of
42: ${\bf x}$ values, the optimal target vector $\mathbf{y}$ is such
43: that treating it as the target and using kernel ridge regression to
44: model the dependency of $y$ on ${\bf x}$, the training error
45: achieves its minimum value. For an unlabeled data set, the first
46: kernel principal component is the optimal vector. In the case one is
47: given a partially labeled data set, still one may look for the
48: optimal target vector minimizing the training error. We use this new
49: estimator in two directions. As a substitute of kernel principal
50: component analysis, in the case one has some labeled data, to
51: produce dimensionality reduction. Second, to develop a
52: semi-supervised regression and classification algorithm for
53: transductive inference. We show application of the proposed method
54: in both directions.
55: \end{abstract}
56: 
57: \maketitle
58: 
59: \section{Introduction}
60: \label{intro} The problem of effectively combining {\it unlabeled} data with {\it
61: labeled} data, semi-supervised learning, is of central importance in machine learning;
62: see, for example, \cite{zu,zhu,ch} and references therein. Semi-supervised learning
63: methods usually assume that adjacent points and/or points in the same structure (group,
64: cluster) should have similar labels; one may assume that data are situated on a low
65: dimensional manifold which can be approximated by a weighted discrete graph whose
66: vertices are identified with the empirical (labeled and unlabeled) data points. This can
67: be seen as a form of regularization \cite{smo}. A common feature of these methods, see
68: also \cite{arg}, is that, as the number of labeled points vanishes, the solution tends to
69: the constant vector. An interesting survey on semi-supervised learning literature may be
70: found on the web \cite{zz}. Improving regression with unlabeled data is the problem
71: considered in \cite{zhou}, where co-training is achieved using k-NN regressors. A
72: statistical physics approach, based on the Potts model, is described in \cite{getz}. An
73: issue closely related to semi-supervised learning is active-learning: some attempts to
74: combine active learning and semi-supervised learning has been made \cite{zzz}.
75: 
76: 
77: The purpose of this work is to introduce a semi-supervised learning estimator which, as
78: the number of labeled points vanishes, tends to the first kernel principal component
79: \cite{kpca}; when a suitable number of labeled points is available, it may be used for
80: transductive inference \cite{vapnik}. Our approach is based on the following fact. Given
81: an unlabeled data set, its first kernel principal component is such that, treating it as
82: target vector, supervised kernel ridge regression provides the minimum training error.
83: Now, suppose that you are given a partially labeled data set: still one may look for the
84: target vector minimizing the training error. This optimal target vector may be seen as
85: the generalization of the first kernel principal component to the semi-supervised case.
86: 
87: The paper is organized as follows. In the next Section we describe our approach, while in
88: Section 3 the experiments we performed are described. Some conclusions are drawn in
89: Section 4.
90: 
91: \section{Methods} \label{methods}
92: \subsection{Kernel ridge regression}
93: We briefly recall the properties of kernel ridge regression (KRR), while referring the
94: reader to \cite{st} for further technical details. Let us consider a set of $\ell$
95: independent, identically distributed data $S=\{ ({\bf x}_i, y_i) \}_{i=1}^\ell$, where
96: ${\bf x}_i$ is the $n$-dimensional vector of input variables and $y_i$ is the scalar
97: output variable. Data are drawn from an unknown probability distribution; we assume that
98: both ${\bf x}$ and $y$ have been centered, i.e. they have been linearly transformed to
99: have zero mean. The regularized linear predictor is $y={\bf w}\cdot{\bf x}$, where ${\bf
100: w}$ minimizes the following functional:
101: \begin{equation}\label{lagrangian-function-without-b}
102: L(\mathbf{w}) =  \sum_{i=1}^\ell \left ( y_i - {\bf w}\cdot {\bf x}_i \right )^2 +
103: \lambda || {\mathbf w} ||^2.
104: \end{equation}
105: Here $|| \mathbf{w} || = \sqrt{\bf{w}\cdot\bf{w}} $ and $\lambda >0$ is the
106: regularization parameter. For $\lambda  =0$, predictor
107: (\ref{lagrangian-function-without-b}) is invariant when new variables, statistically
108: independent of input and target variables, are added to the set of input variables (IIV
109: property, \cite{as}). One may show that this invariance property holds, for
110: (\ref{lagrangian-function-without-b}), also at finite $\lambda
111: > 0$.
112: 
113: KRR is the {\it kernel} version of the previous predictor. Calling $\bf{y}$ = $(y_1, y_2,
114: ..., y_\ell)^\top$ the vector formed by the $\ell$ values of the output variable and
115:  $K(\cdot,\cdot)$  being a positive definite symmetric function, the predictor has the
116: following form:
117: \begin{equation}\label{notlinear}
118: y = f({\bf x}) = \sum_{i=1}^\ell c_i K({\bf x}_i,{\bf x}),
119: \end{equation}
120: where coefficients $\{c_i\}$ are given by
121: \begin{equation}\label{w2}
122: \bf{c} =  \left (\bf{K} + \lambda \bf{I} \right)^{-1} \bf{y},
123: \end{equation}
124: $\bf{K}$ being the $\ell \times \ell$ matrix with elements $K({\bf x}_i,{\bf x}_j)$.
125: Equation (\ref{notlinear}) may be seen to correspond to a linear predictor in the feature
126: space $ \Phi({\bf x}) = ( \sqrt{\alpha_1}\psi_1 ({\bf x}), \sqrt{\alpha_2}\psi_2 ({\bf
127: x}), ...,\sqrt{\alpha_N} \psi_N ({\bf x}), ... ), $ where $\alpha_i$ and $\psi_i$  are
128: the eigenvalues and eigenfunctions of the integral operator with kernel $K$. One may
129:  show \cite{prep} that, for KRR predictors with nonlinear kernels, the IIV property does not
130: generically hold, even for those kernels, discussed in \cite{as}, for which the property
131: holds at $\lambda=0$. Regularization breaks the IIV invariance in those cases.
132: 
133: Due to (\ref{notlinear}) and (\ref{w2}), the predicted output vector $\bf{\bar{y}}$, in
134: correspondence of the {\it true} target vector $\bf{y}$, is given by
135: $\bf{\bar{y}}$$=\mathbf{G}\bf{y}$, where the symmetric matrix $\mathbf{G}$ is given by
136: \begin{equation}\label{G}
137: \mathbf{G}=\mathbf{K}\left(\mathbf{K}+\lambda \mathbf{I}\right)^{-1}.\end{equation} Note
138: that matrix $\mathbf{G}$ depends only on the distribution of $\{\mathbf{x}\}$ values:
139: $\mathbf{G}$ embodies information about the structures present in $\{\mathbf{x}\}$ data
140: set. Indeed, for $i\ne j$, the matrix element $G_{ij}$ quantifies how much the target
141: value of the $j-th$ point influences the estimate of the target of point $i$. Let us now
142: consider the leave-one-out scheme; let data point $i$ be removed from the data set and
143: the model be trained using the remaining $\ell -1$ points. We denote $\tilde{y}_i$ the
144: target value thus predicted, in correspondence of $\bf{x_i}$. It is well known \cite{st}
145: that the leave-one-out-error $\tilde{y}_i -y_i$ and the training error obtained using the
146: whole data set $\bar{y}_i -y_i$ satisfy:
147: \begin{equation}
148: \label{loo} \tilde{y}_i -y_i={\bar{y}_i -y_i \over 1-G_{ii}}.
149: \end{equation}
150: This formula shows that the closer $G_{ii}$ to one, the farther the leave-one-out
151: predicted value from those obtained using also point $i$ in the training stage. Consider
152: a point $i$ in a dense region of the feature space:  one may expect that removing this
153: point from the data-set would not change much the estimate since it can be well predicted
154: on the basis of values of  neighboring points. Therefore points in low density regions of
155: the feature space are characterized by diagonal values $G_{ii}$ close to one, while
156: $G_{ii}$ is close to zero for points $\mathbf{x_i}$ in dense regions: the diagonal
157: elements of $\mathbf{G}$ thus convey information about the structure of points in the
158: feature space. It is worth stressing that, given a kernel function, the corresponding
159: features $\psi_\gamma ({\bf x})$ are not centered in general. One can show \cite{kpca}
160: that centering the features ($\psi_\gamma \to \psi_\gamma -\langle \psi_\gamma\rangle$,
161: for all $\gamma$) amounts to perform the following transformation on the kernel matrix:
162: $$\mathbf{K }\to \mathbf{\tilde{K}}=\mathbf{K}-\mathbf{I}_\ell \mathbf{K}-\mathbf{K}\mathbf{I}_\ell +\mathbf{I}_\ell \mathbf{K} \mathbf{I}_\ell,$$
163: where $\left( I_\ell\right)_{ij}=1/\ell$, and to work with the centered kernel
164: $\mathbf{\tilde{K}}$. In the following we will  assume that the kernel matrix
165: $\mathbf{K}$ has been centered.
166: \subsection{Optimal target vector}
167: The training error of the KRR model  is proportional to
168: $(\bf{y}-\mathbf{G}\bf{y})^\top(\bf{y}-\mathbf{G}\bf{y})=\bf{y}^\top \mathbf{H}\bf{y},$
169: where $\mathbf{H}= \mathbf{I}-2\mathbf{G}+\mathbf{G}\mathbf{G}$ is a symmetric and
170: positive matrix. In the unsupervised case the data set is made of $\mathbf{x}$ points,
171: $\{ {\bf x}_i\}_{i=1}^\ell$, the target function $\mathbf{y}$ is missing. However we may
172: pose the following  question: what is the vector $\mathbf{y}\in \mathbf{R}^\ell$ such
173: that treating it as the target vector leads to the best fit, i.e. the minimum training
174: error $\bf{y}^\top \mathbf{H}\bf{y}$? We expect that this {\it optimal} target vector
175: would bring information about the structures present in the data. To avoid the trivial
176: solution $\mathbf{y}=\mathbf{0}$, we constrain the target vector to have unit norm,
177: $\mathbf{y}^\top\mathbf{y}=1$; it follows that the optimal vector is the normalized
178: eigenvector of $\mathbf{H}$ with the smallest eigenvalue. On the other hand, matrix
179: $\mathbf{H}$ is a function of matrix $\mathbf{K}$: hence it has the same eigenvectors of
180: $\mathbf{K}$ while the corresponding eigenvalues $\mu_H$ and $\mu_K$ are related by the
181: following monotonically decreasing correspondence:
182: $$\mu_H=\left(1-{\mu_K\over \mu_K+\lambda}\right)^2.$$ Therefore,
183: independently of $\lambda$,  the smallest eigenvalue of $\mathbf{H}$ corresponds to the
184: largest eigenvalue of $\mathbf{K}$, and the optimal vector coincides with the first
185: kernel principal component. To conclude this subsection, we have shown that the method in
186: [10] may be motivated also as the search for the optimal target vector.
187: 
188: The notion of optimal target vector has been introduced in \cite{ang}, where a kernel
189: method for dichotomic clustering has been proposed, consisting in finding the ground
190: state of a class of Ising models.
191: 
192: \subsection{Semi-supervised learning}
193: Now we consider the case that we are given a set $S=\{ {\bf x}_i \}_{i=1}^\ell$ of data
194: points with unknown targets $\{t_i\}_{i=1}^\ell$, and a set $S'=\{ ({\bf x}_j, u_j)
195: \}_{j=\ell+1}^{N}$, where $N=\ell +m$, of input-output data. Without loss of generality
196: we assume that the labeled points belong to two classes, and take $u_j\in
197: \{-1/\sqrt{N},+1/\sqrt{N}\}$ for all $j$'s. The $N$ dimensional full vector of targets
198: $\mathbf{y}$ is obtained appending $\{t\}$ (unknown) and $\{u\}$ (known) values:
199: $$\mathbf{y}=(\mathbf{t}^\top \mathbf{u}^\top)^\top.$$
200: Keeping  the kernel and $\lambda$ fixed, we look for the unit norm target vector
201: $\mathbf{y}$ minimizing the training error $\mathbf{y}^\top \mathbf{H} \mathbf{y}$.  The
202: $N\times N$ matrix $\mathbf{H}$ has the block structure
203: \[ \mathbf{H} = \left( \begin{array}{cc}
204:               \mathbf{H_0} & \mathbf{H_1} \\
205:               \mathbf{H_1^\top}& \mathbf{H_2}
206:     \end{array}\right), \]
207: where $\mathbf{H_0}$ is an $\ell \times \ell$ matrix. Neglecting a constant term, the
208: optimal vector is determined by the vector $\mathbf{t}$ minimizing
209: \begin{equation}
210: \mathcal{E}(\bf{t})=\mathbf{t}^\top \mathbf{H_0} \mathbf{t} +2\mathbf{t}^\top
211: \mathbf{H_1} \mathbf{u} \label{eeee}
212: \end{equation}
213:  under the constraint $|| \mathbf{t} ||^2=1-|| \mathbf{u} ||^2$.
214: The first term of $\mathcal{E}$ favors projections of the $\ell$ points with great
215: variance, whereas the second term measures their consistency with  labeled points. Let us
216: denote $\{\Psi_{\alpha'}\}$ and $\{\mu_{\alpha'}\}$ the eigenvectors and eigenvalues of
217: $\mathbf{H_0}$, sorted into increasing $\mu_{\alpha'}$. We express
218: $\mathbf{t}=\sum_{\alpha' =1}^\ell \xi_{\alpha'} \Psi_{\alpha'}$. The coefficients
219: $\xi_{\alpha'}$ for the minimum are given by
220: $$\xi_{\alpha'} ={f_{\alpha'} \over \mu -\mu_{\alpha'}},$$
221: where $f_{\alpha'}= \Psi_{\alpha'}^\top\mathbf{H_1} \mathbf{u}$, and $\mu$ is a Lagrange
222: multiplier which must to be tuned to satisfy:
223: \begin{equation} \label{csi} g(\mu)= \sum_{\alpha' =1}^\ell
224: \left({f_{\alpha'} \over \mu -\mu_{\alpha'}}\right)^2=1-|| \mathbf{u} ||^2.
225: \end{equation}
226: 
227: Equation (\ref{csi}) has always at least one solution with $\mu < \mu_1$, see figure 1,
228: and usually this is the one minimizing $\mathcal{E}$. However all the solutions of
229: (\ref{csi}) must be compared according to their {\it energies} $\mathcal{E}$; those
230: corresponding to the lowest $\mathcal{E}$, $\bf{y^\star}$, is then  selected. Clearly as
231: $m\to 0$ one recovers the first eigenvector of $\mathbf{H_0}$, i.e. the first kernel
232: principal component: $\bf{y^\star}$ thus constitutes a generalization of the latter to
233: the semi-supervised case. To construct the other generalized kernel principal components,
234: we make the following transformation on matrix $\mathbf{H}$:
235: $$\mathbf{\tilde{H}}=\mathbf{H}-\mathbf{P^\star}\mathbf{H}
236: -\mathbf{H}\mathbf{P^\star}+\mathbf{P^\star}\mathbf{H}\mathbf{P^\star},$$ where
237: $\mathbf{P^\star}=\bf{y^\star}\bf{y^\star}^\top$ is the projector on the linear subspace
238: spanned by $\bf{y^\star}$. The symmetric matrix $\mathbf{\tilde{H}}$ has   the lowest
239: eigenvalue equal to zero and corresponding to eigenvector $\bf{y^\star}$. The system of
240: eigenvectors of $\mathbf{\tilde{H}}$ constitutes a generalization of kernel principal
241: components to the semi-supervised case. \section{Experiments}
242: \subsection{Generalizing kernel principal components}
243: Now we present some simulations of the proposed method, focusing on the dimensionality
244: reduction issue and comparing  with  fully unsupervised kernel principal component
245: analysis. We consider three well known data sets: IRIS (100 points in a four-dimensional
246: space, second and third classes, versicolor and virginica); colon cancer data set of
247: \cite{alon}, consisting in 40 tumor and 22 normal colon tissues samples, each sample
248: being described by the $100$ most discriminant genes; the leukemia data set of
249: \cite{golub}, consisting of samples of tissues of bone marrow samples, $47$ affected by
250: acute myeloid leukemia (AML) and $25$ by acute lymphoblastic leukemia (ALL), each sample
251: being described by the $500$ most discriminant genes. The following question is
252: addressed: is $\bf{y^\star}$ more correlated to the true labels than the fully
253: unsupervised first kernel principal component? Here we restrict our analysis to the
254: linear kernel.
255: 
256: We start with IRIS and proceed as follows. We randomly select $m=4$ points and, treating
257: them as labeled, we find the system of eigenvectors of $\mathbf{\tilde{H}}$. Then  we
258: evaluate the linear correlation $R$ between the eigenvectors and the true labels of the
259: whole data-set. The distributions of $R$ for the four eigenvectors are depicted in figure
260: 2. We observe that in most cases the vector $\bf{y^\star}$ is more correlated with the
261: true classes than the fully unsupervised principal component: the one-dimensional
262: projection of data onto $\bf{y^\star}$ is more informative than the first principal
263: component. However there are situations where use of labeled points leads to poor
264: results; a typical example is depicted in figure 3. In figure 4 a situation is depicted
265: where knowledge of labeled points leads to a relevant improvement.
266: 
267: In general, we denote $f$ the fraction of instances such that $\bf{y^\star}$ is more
268: correlated to the true labels than the first principal component. In figure 5 we depict
269: $f$ as a function of $\bar{m}=m/N$ for the three data sets here considered. At
270: $\bar{m}=0.16$  $f$ is already nearly one. The semi-supervised method here proposed
271: outperforms principal components almost always for large $\bar{m}$.
272: 
273: 
274: \subsection{Transductive inference}
275: In this subsection we demonstrate the effectiveness of the proposed approach for
276: estimating the values of a function at a set of test points, given a set of input-output
277: data points, without estimating (as an intermediate step) the regression function.
278: 
279: The boston data set is a well-known problem where one is required to estimate house
280: prices according to various statistics based on $13$ locational, economic and structural
281: features from data collected by U.S. Census Service in the Boston Massachusetts area. For
282: $\ell =5,10,15,20,25$, we partition the data-set of $N=506$ observations randomly 100
283: times into a training set of $N-\ell$ observations and a testing set of $\ell$
284: observations. We use a Gaussian kernel with $\sigma =1$ and set $\lambda =1$; results are
285: stable against variations of these parameters. In Table 1 we report the mean squared
286: error (MSE) on the test set averaged over the 100 runs, for each value of $\ell$, we
287: obtain using the optimal target vector $\bf{y^\star}$. In Table 1 we also report the MSE
288: obtained using the classical KRR in the two step procedure: (i) estimation of the
289: regression function using the training data-set (ii) calculation of the regression
290: function at points of interest (test data-set). The improvement achieved using the
291: optimal target approach, over classical KRR, is clear.
292: \begin{table}
293: \caption{\label{tab:table1}The mean square error on the Boston data set obtained using
294: the optimal target (OT) approach and the classical kernel ridge regression (KRR) method.
295: The size of the test set is $\ell$. }
296: \begin{ruledtabular}
297: \begin{tabular}{lcr}
298: $\ell$&OT&KRR\\
299: \hline
300:     5&   2.3790   & 3.6312\\
301: 
302:    10&   2.7938  &  4.0111 \\
303: 
304:    15& 2.9460 &   4.1057 \\
305: 
306:    20&   3.1024 &   4.1802 \\
307: 
308:    25&   3.1569 &   4.1653 \\
309: \end{tabular}
310: \end{ruledtabular}
311: \end{table}
312: 
313: We also consider five well known data sets of pattern recognition from UCI database: we
314: evaluate the optimal target vector, points are then attributed to classes according to
315: the sign of $\bf{y^\star}$. We compare with the transductive linear discrimination (TLD)
316: approach developed in \cite{trans}; the performance of a classifier is measured by its
317: average error over 100 partitions of the data-sets into training and testing sets. We use
318: the linear kernel with $\lambda =1$, however the results are stable to variations of
319: $\lambda$. Obviously, our approach and TLD are applied to the same partitions of
320: data-sets, so that the comparison is meaningful. The results are shown in Table 2: our
321: approach outperforms TLD.
322: \begin{table}
323: \caption{\label{tab:table2}The percentage test error of transductive linear
324: discrimination and optimal target approach, on five datasets from UCI database.}
325: \begin{ruledtabular}
326: \begin{tabular}{lcr}
327: &TLD&OT\\
328: \hline
329:     Diabetes&   23.3   & 11.98\\
330: 
331:    Titanic&   22.4  &  6.52 \\
332: 
333:    Breast Cancer& 25.7 &   16.7 \\
334: 
335:    Heart&   15.7 &   3.3 \\
336: 
337:    Thyroid&   4.0 &   4.0 \\
338: \end{tabular}
339: \end{ruledtabular}
340: \end{table}
341: 
342: It is worth stressing that our results are obtained without a fine-tuning of parameters.
343: In particular,note that our definition of optimal target vector fixes the relative
344: importance of the two terms in equation (\ref{eeee}).
345: \section{Conclusions}
346: \label{conc} We have presented a new  approach to semi-supervised learning based on the
347: notion of optimal target vector, the target vector such that KRR provides the minimum
348: training error over all the possible target vectors. The proposed algorithm is
349: characterized by the fact that the first kernel principal component is recovered as the
350: cardinality of labeled points vanishes; hence it may be seen as a semi-supervised
351: generalization of Kernel Principal Components Analysis. The effectiveness of the proposed
352: approach for transductive inference has also been demonstrated.
353: 
354: 
355: 
356: \vskip 0.4 cm\par\noindent{\bf Acknoledgements.} The authors thank Olivier Chapelle for a
357: valuable correspondence on the subject of this paper. Discussions on semi-supervised
358: learning with Eytan Domany and Noam Shental are warmly acknowledged.
359: 
360: \begin{thebibliography}{99}
361: \bibitem{zu} X. Zhu, Z. Ghahramani, J. Lafferty, Semi-supervised learning using Gaussian fields
362: and harmonic functions. Proc. 20-th Int. Conf. Machine Learning 2003.
363: 
364: \bibitem{zhu} D. Zhu, O. Bousquet, T.N. Lal, J. Weston, and B. Scholkopf. Learning with local and
365: global consistency. {\it Advances in Neural Information processing Systems}, 16, S. Thrun
366: et al. (Eds.), MIT press, Cambridge, MA, 2004.
367: 
368: \bibitem{ch} O. Chapelle, A. Zien, Semi-supervised classification by low density separation.
369: International Workshop on Artificial Intelligence and Statistics, AI STATS 2005,
370: Barbados.
371: 
372: \bibitem{smo} A. Smola, R. Kondor, Kernels and regularizations on graphs, COLT/Kernel Workshop
373: 2003.
374: 
375: \bibitem{arg} A. Argyriou, M. Herbster, M. Pontil, Combining Graph Laplacians for Semi-Supervised
376: Learning, {\it Advances in Neural Information processing Systems}, 18, Y. Weiss and B.
377: Sch\"{o}lkopf and J. Platt (Eds.), MIT press, Cambridge, MA, 2006.
378: 
379: \bibitem{zz} Xiaojin Zhu, Semi-supervised Learning Literature Survey. Computer Sciences TR 1530,
380: University of Wisconsin - Madison.
381: 
382: \bibitem{zhou} Z.H. Zhou, M. Li, Semi-supervised regression with co-training. Proceedings
383: International Joint Conference on Artificial Intelligence (IJCAI) 2005.
384: 
385: \bibitem{getz} G. Getz, N. Shental, E. Domany, Semi-supervised learning - a statistical physics
386: approach. Proceedings of the 22nd ICML Workshop on Learning with Partially Classified
387: Training Data. Bonn, Germani 2005.
388: 
389: \bibitem{zzz}X. Zhu, J. Lafferty, Z. Ghaharamani, Combining active learning and semi-supervised
390: learning using Gaussian fields and harmonic functions. ICML 2003 workshop on The
391: continuum from labeled to unlabeled data in Machine Learning and Data mining.
392: 
393: \bibitem{kpca} B. Sch\"{o}lkopf, A. Smola, K.-R. Muller, Nonlinear Component Analysis as a Kernel Eigenvalue Problem,
394: {\it Neural Computation} {\bf 10} 1299
395: (1998).
396: 
397: \bibitem{vapnik} V. Vapnik. Estimation of dependences based on empirical data.
398: Springer-Verlag, New York, 1982.
399: 
400: \bibitem{st} J. Shawe-Taylor and N. Cristianini, {\it Kernel Methods for Pattern Analysis}
401: Cambridge University Press, 2004.
402: 
403: \bibitem{as} N. Ancona and S. Stramaglia, An invariance property of predictors in kernel-induced
404: hypothesis spaces, Neural Comput. 18:749-759, 2006.
405: 
406: \bibitem{prep} N. Ancona and S. Stramaglia, unpublished.
407: 
408: \bibitem{ang}L. Angelini, D. Marinazzo, M. Pellicoro, S. Stramaglia, Kernel method for clustering
409: based on optimal target vector. Physics Letters A {\bf 357} 413 (2006).
410: 
411: \bibitem{alon}U. Alon et al.,Broad Patterns of Gene Expression Revealed by Clustering Analysis of
412: Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays, {\it PNAS} {\bf 96} 6745
413: (1999).
414: 
415: \bibitem{golub}T.R. Golub et al., Molecular Classification of Cancer: Class Discovery and Class
416: Prediction by Gene Expression Monitoring, {\it Science} {\bf 286} 531 (1999).
417: 
418: \bibitem{trans} O. Chapelle and V. Vapnik and J. Weston, Transductive Inference for Estimating Values of Functions,
419: Advances in Neural Information Processing Systems, Vol. 12, 1999.
420: \end{thebibliography}
421: 
422: \begin{figure}[ht!]
423: \begin{center}
424: \epsfig{file=fig1.eps,height=8.5cm}
425: \end{center}
426: \caption{{\small  The solutions of equation (\ref{csi}) are depicted, for a typical
427: instance  of four labeled points in the IRIS data set. The star corresponds to the
428: solution with $\mu < \mu_1$, which has the smallest energy $\mathcal{E}$.\label{fig1}}}
429: \end{figure}
430: \begin{figure}[ht!]
431: \begin{center}
432: \epsfig{file=fig2.eps,height=8.5cm}
433: \end{center}
434: \caption{{\small  Concerning IRIS data set and $m=4$, we depict the distribution (over
435: 10000 random selections of labeled points) of the linear correlation $R$  between
436: eigenvectors of $\mathbf{\tilde{H}}$ and the true labels. From the left to the right and
437: the top to the bottom, we refer to the first, the second, the third and the fourth
438: eigenvector. Grey (black) histogram bars denote values of $R$ lower (greater) than those
439: of the corresponding fully unsupervised principal component. \label{fig2}}}
440: \end{figure}
441: 
442: \begin{figure}[ht!]
443: \begin{center}
444: \epsfig{file=fig3.eps,height=8.5cm}
445: \end{center}
446: \caption{{\small  (Top) The IRIS data set is depicted in the plane of the first two
447: principal components, $\star$ versicolor, $+$ virginica. The linear correlation of the
448: first principal component with the true labels is $R=0.732$. Four selected points are
449: surrounded by a circle. (Bottom) The data set is represented in the plane of the first
450: two eigenvectors of $\mathbf{\tilde{H}}$. The linear correlation between $\bf{y^\star}$
451: and  the true labels is $R=0.615$. (Note that two circles are almost overlapping and thus
452: difficult to distinguish). \label{fig3}}}
453: \end{figure}
454: 
455: \begin{figure}[ht!]
456: \begin{center}
457: \epsfig{file=fig4.eps,height=8.5cm}
458: \end{center}
459: \caption{{\small  (Top) The IRIS data set is depicted in the plane of the first two
460: principal components, $\star$ versicolor, $+$ virginica. Four selected points are
461: surrounded by a circle. (Bottom) The data set is represented in the plane of the first
462: two eigenvectors of $\mathbf{\tilde{H}}$. The linear correlation between $\bf{y^\star}$
463: and  the true labels is, in this case, $R=0.846$.\label{fig4}}}
464: \end{figure}
465: \begin{figure}[ht!]
466: \begin{center}
467: \epsfig{file=fig5.eps,height=8.5cm}
468: \end{center}
469: \caption{{\small The fraction $f$ (see the text) is depicted as a function of $\bar{m}$
470: for three data sets here considered. 10000 random selections of the labeled points are
471: considered for each value of $m$ and for each data-set. \label{fig5}}}
472: \end{figure}
473: \end{document}
474: