0610:cond-mat0610342/semi.tex

1: \documentclass[preprintnumbers,amsmath,amssymb]{revtex4}

2:

3: \usepackage{epsfig}

4: \usepackage{rotating}

5: \usepackage{graphicx}% Include figure files

6: \usepackage{dcolumn}% Align table columns on decimal point

7: \usepackage{bm}% bold math

8:

9: %\nofiles

10:

11: \begin{document}

12:

13:

14: \title{

15: Semi-supervised learning by search of optimal target vector}

16:

17: \author{Leonardo Angelini, Daniele Marinazzo, Mario Pellicoro and Sebastiano Stramaglia}

18:

19: \affiliation{ TIRES-Center of Innovative Technologies for Signal Detection

20: and Processing, \\ Universit\`a di Bari, Italy \\

21: Dipartimento Interateneo di Fisica, Bari, Italy \\

22: Istituto Nazionale di Fisica Nucleare, Sezione di Bari, Italy

23: }

24:

25: % The \author macro works with any number of authors. There are two commands

26: % used to separate the names and addresses of multiple authors: \And and \AND.

27: %

28: % Using \And between authors leaves it to \LaTeX{} to determine where to break

29: % the lines. Using \AND forces a linebreak at that point. So, if \LaTeX{}

30: % puts 3 of 4 authors names on the first line, and the last on the second

31: % line, try using \AND instead of \And before the third author name.

32:

33: \date{\today}

34:

35:

36:

37: \begin{abstract}

38: We introduce a semi-supervised learning estimator which tends to the

39: first kernel principal component as the number of labeled points

40: vanishes. Our approach is based on the notion of optimal target

41: vector, which is defined as follows. Given an input data-set of

42: ${\bf x}$ values, the optimal target vector $\mathbf{y}$ is such

43: that treating it as the target and using kernel ridge regression to

44: model the dependency of $y$ on ${\bf x}$, the training error

45: achieves its minimum value. For an unlabeled data set, the first

46: kernel principal component is the optimal vector. In the case one is

47: given a partially labeled data set, still one may look for the

48: optimal target vector minimizing the training error. We use this new

49: estimator in two directions. As a substitute of kernel principal

50: component analysis, in the case one has some labeled data, to

51: produce dimensionality reduction. Second, to develop a

52: semi-supervised regression and classification algorithm for

53: transductive inference. We show application of the proposed method

54: in both directions.

55: \end{abstract}

56:

57: \maketitle

58:

59: \section{Introduction}

60: \label{intro} The problem of effectively combining {\it unlabeled} data with {\it

61: labeled} data, semi-supervised learning, is of central importance in machine learning;

62: see, for example, \cite{zu,zhu,ch} and references therein. Semi-supervised learning

63: methods usually assume that adjacent points and/or points in the same structure (group,

64: cluster) should have similar labels; one may assume that data are situated on a low

65: dimensional manifold which can be approximated by a weighted discrete graph whose

66: vertices are identified with the empirical (labeled and unlabeled) data points. This can

67: be seen as a form of regularization \cite{smo}. A common feature of these methods, see

68: also \cite{arg}, is that, as the number of labeled points vanishes, the solution tends to

69: the constant vector. An interesting survey on semi-supervised learning literature may be

70: found on the web \cite{zz}. Improving regression with unlabeled data is the problem

71: considered in \cite{zhou}, where co-training is achieved using k-NN regressors. A

72: statistical physics approach, based on the Potts model, is described in \cite{getz}. An

73: issue closely related to semi-supervised learning is active-learning: some attempts to

74: combine active learning and semi-supervised learning has been made \cite{zzz}.

75:

76:

77: The purpose of this work is to introduce a semi-supervised learning estimator which, as

78: the number of labeled points vanishes, tends to the first kernel principal component

79: \cite{kpca}; when a suitable number of labeled points is available, it may be used for

80: transductive inference \cite{vapnik}. Our approach is based on the following fact. Given

81: an unlabeled data set, its first kernel principal component is such that, treating it as

82: target vector, supervised kernel ridge regression provides the minimum training error.

83: Now, suppose that you are given a partially labeled data set: still one may look for the

84: target vector minimizing the training error. This optimal target vector may be seen as

85: the generalization of the first kernel principal component to the semi-supervised case.

86:

87: The paper is organized as follows. In the next Section we describe our approach, while in

88: Section 3 the experiments we performed are described. Some conclusions are drawn in

89: Section 4.

90:

91: \section{Methods} \label{methods}

92: \subsection{Kernel ridge regression}

93: We briefly recall the properties of kernel ridge regression (KRR), while referring the

94: reader to \cite{st} for further technical details. Let us consider a set of $\ell$

95: independent, identically distributed data $S=\{ ({\bf x}_i, y_i) \}_{i=1}^\ell$, where

96: ${\bf x}_i$ is the $n$-dimensional vector of input variables and $y_i$ is the scalar

97: output variable. Data are drawn from an unknown probability distribution; we assume that

98: both ${\bf x}$ and $y$ have been centered, i.e. they have been linearly transformed to

99: have zero mean. The regularized linear predictor is $y={\bf w}\cdot{\bf x}$, where ${\bf

100: w}$ minimizes the following functional:

101: \begin{equation}\label{lagrangian-function-without-b}

102: L(\mathbf{w}) =  \sum_{i=1}^\ell \left ( y_i - {\bf w}\cdot {\bf x}_i \right )^2 +

103: \lambda || {\mathbf w} ||^2.

104: \end{equation}

105: Here $|| \mathbf{w} || = \sqrt{\bf{w}\cdot\bf{w}} $ and $\lambda >0$ is the

106: regularization parameter. For $\lambda  =0$, predictor

107: (\ref{lagrangian-function-without-b}) is invariant when new variables, statistically

108: independent of input and target variables, are added to the set of input variables (IIV

109: property, \cite{as}). One may show that this invariance property holds, for

110: (\ref{lagrangian-function-without-b}), also at finite $\lambda

111: > 0$.

112:

113: KRR is the {\it kernel} version of the previous predictor. Calling $\bf{y}$ = $(y_1, y_2,

114: ..., y_\ell)^\top$ the vector formed by the $\ell$ values of the output variable and

115:  $K(\cdot,\cdot)$  being a positive definite symmetric function, the predictor has the

116: following form:

117: \begin{equation}\label{notlinear}

118: y = f({\bf x}) = \sum_{i=1}^\ell c_i K({\bf x}_i,{\bf x}),

119: \end{equation}

120: where coefficients $\{c_i\}$ are given by

121: \begin{equation}\label{w2}

122: \bf{c} =  \left (\bf{K} + \lambda \bf{I} \right)^{-1} \bf{y},

123: \end{equation}

124: $\bf{K}$ being the $\ell \times \ell$ matrix with elements $K({\bf x}_i,{\bf x}_j)$.

125: Equation (\ref{notlinear}) may be seen to correspond to a linear predictor in the feature

126: space $ \Phi({\bf x}) = ( \sqrt{\alpha_1}\psi_1 ({\bf x}), \sqrt{\alpha_2}\psi_2 ({\bf

127: x}), ...,\sqrt{\alpha_N} \psi_N ({\bf x}), ... ), $ where $\alpha_i$ and $\psi_i$  are

128: the eigenvalues and eigenfunctions of the integral operator with kernel $K$. One may

129:  show \cite{prep} that, for KRR predictors with nonlinear kernels, the IIV property does not

130: generically hold, even for those kernels, discussed in \cite{as}, for which the property

131: holds at $\lambda=0$. Regularization breaks the IIV invariance in those cases.

132:

133: Due to (\ref{notlinear}) and (\ref{w2}), the predicted output vector $\bf{\bar{y}}$, in

134: correspondence of the {\it true} target vector $\bf{y}$, is given by

135: $\bf{\bar{y}}$$=\mathbf{G}\bf{y}$, where the symmetric matrix $\mathbf{G}$ is given by

136: \begin{equation}\label{G}

137: \mathbf{G}=\mathbf{K}\left(\mathbf{K}+\lambda \mathbf{I}\right)^{-1}.\end{equation} Note

138: that matrix $\mathbf{G}$ depends only on the distribution of $\{\mathbf{x}\}$ values:

139: $\mathbf{G}$ embodies information about the structures present in $\{\mathbf{x}\}$ data

140: set. Indeed, for $i\ne j$, the matrix element $G_{ij}$ quantifies how much the target

141: value of the $j-th$ point influences the estimate of the target of point $i$. Let us now

142: consider the leave-one-out scheme; let data point $i$ be removed from the data set and

143: the model be trained using the remaining $\ell -1$ points. We denote $\tilde{y}_i$ the

144: target value thus predicted, in correspondence of $\bf{x_i}$. It is well known \cite{st}

145: that the leave-one-out-error $\tilde{y}_i -y_i$ and the training error obtained using the

146: whole data set $\bar{y}_i -y_i$ satisfy:

147: \begin{equation}

148: \label{loo} \tilde{y}_i -y_i={\bar{y}_i -y_i \over 1-G_{ii}}.

149: \end{equation}

150: This formula shows that the closer $G_{ii}$ to one, the farther the leave-one-out

151: predicted value from those obtained using also point $i$ in the training stage. Consider

152: a point $i$ in a dense region of the feature space:  one may expect that removing this

153: point from the data-set would not change much the estimate since it can be well predicted

154: on the basis of values of  neighboring points. Therefore points in low density regions of

155: the feature space are characterized by diagonal values $G_{ii}$ close to one, while

156: $G_{ii}$ is close to zero for points $\mathbf{x_i}$ in dense regions: the diagonal

157: elements of $\mathbf{G}$ thus convey information about the structure of points in the

158: feature space. It is worth stressing that, given a kernel function, the corresponding

159: features $\psi_\gamma ({\bf x})$ are not centered in general. One can show \cite{kpca}

160: that centering the features ($\psi_\gamma \to \psi_\gamma -\langle \psi_\gamma\rangle$,

161: for all $\gamma$) amounts to perform the following transformation on the kernel matrix:

162: $$\mathbf{K }\to \mathbf{\tilde{K}}=\mathbf{K}-\mathbf{I}_\ell \mathbf{K}-\mathbf{K}\mathbf{I}_\ell +\mathbf{I}_\ell \mathbf{K} \mathbf{I}_\ell,$$

163: where $\left( I_\ell\right)_{ij}=1/\ell$, and to work with the centered kernel

164: $\mathbf{\tilde{K}}$. In the following we will  assume that the kernel matrix

165: $\mathbf{K}$ has been centered.

166: \subsection{Optimal target vector}

167: The training error of the KRR model  is proportional to

168: $(\bf{y}-\mathbf{G}\bf{y})^\top(\bf{y}-\mathbf{G}\bf{y})=\bf{y}^\top \mathbf{H}\bf{y},$

169: where $\mathbf{H}= \mathbf{I}-2\mathbf{G}+\mathbf{G}\mathbf{G}$ is a symmetric and

170: positive matrix. In the unsupervised case the data set is made of $\mathbf{x}$ points,

171: $\{ {\bf x}_i\}_{i=1}^\ell$, the target function $\mathbf{y}$ is missing. However we may

172: pose the following  question: what is the vector $\mathbf{y}\in \mathbf{R}^\ell$ such

173: that treating it as the target vector leads to the best fit, i.e. the minimum training

174: error $\bf{y}^\top \mathbf{H}\bf{y}$? We expect that this {\it optimal} target vector

175: would bring information about the structures present in the data. To avoid the trivial

176: solution $\mathbf{y}=\mathbf{0}$, we constrain the target vector to have unit norm,

177: $\mathbf{y}^\top\mathbf{y}=1$; it follows that the optimal vector is the normalized

178: eigenvector of $\mathbf{H}$ with the smallest eigenvalue. On the other hand, matrix

179: $\mathbf{H}$ is a function of matrix $\mathbf{K}$: hence it has the same eigenvectors of

180: $\mathbf{K}$ while the corresponding eigenvalues $\mu_H$ and $\mu_K$ are related by the

181: following monotonically decreasing correspondence:

182: $$\mu_H=\left(1-{\mu_K\over \mu_K+\lambda}\right)^2.$$ Therefore,

183: independently of $\lambda$,  the smallest eigenvalue of $\mathbf{H}$ corresponds to the

184: largest eigenvalue of $\mathbf{K}$, and the optimal vector coincides with the first

185: kernel principal component. To conclude this subsection, we have shown that the method in

186: [10] may be motivated also as the search for the optimal target vector.

187:

188: The notion of optimal target vector has been introduced in \cite{ang}, where a kernel

189: method for dichotomic clustering has been proposed, consisting in finding the ground

190: state of a class of Ising models.

191:

192: \subsection{Semi-supervised learning}

193: Now we consider the case that we are given a set $S=\{ {\bf x}_i \}_{i=1}^\ell$ of data

194: points with unknown targets $\{t_i\}_{i=1}^\ell$, and a set $S'=\{ ({\bf x}_j, u_j)

195: \}_{j=\ell+1}^{N}$, where $N=\ell +m$, of input-output data. Without loss of generality

196: we assume that the labeled points belong to two classes, and take $u_j\in

197: \{-1/\sqrt{N},+1/\sqrt{N}\}$ for all $j$'s. The $N$ dimensional full vector of targets

198: $\mathbf{y}$ is obtained appending $\{t\}$ (unknown) and $\{u\}$ (known) values:

199: $$\mathbf{y}=(\mathbf{t}^\top \mathbf{u}^\top)^\top.$$

200: Keeping  the kernel and $\lambda$ fixed, we look for the unit norm target vector

201: $\mathbf{y}$ minimizing the training error $\mathbf{y}^\top \mathbf{H} \mathbf{y}$.  The

202: $N\times N$ matrix $\mathbf{H}$ has the block structure

203: \[ \mathbf{H} = \left( \begin{array}{cc}

204:               \mathbf{H_0} & \mathbf{H_1} \\

205:               \mathbf{H_1^\top}& \mathbf{H_2}

206:     \end{array}\right), \]

207: where $\mathbf{H_0}$ is an $\ell \times \ell$ matrix. Neglecting a constant term, the

208: optimal vector is determined by the vector $\mathbf{t}$ minimizing

209: \begin{equation}

210: \mathcal{E}(\bf{t})=\mathbf{t}^\top \mathbf{H_0} \mathbf{t} +2\mathbf{t}^\top

211: \mathbf{H_1} \mathbf{u} \label{eeee}

212: \end{equation}

213:  under the constraint $|| \mathbf{t} ||^2=1-|| \mathbf{u} ||^2$.

214: The first term of $\mathcal{E}$ favors projections of the $\ell$ points with great

215: variance, whereas the second term measures their consistency with  labeled points. Let us

216: denote $\{\Psi_{\alpha'}\}$ and $\{\mu_{\alpha'}\}$ the eigenvectors and eigenvalues of

217: $\mathbf{H_0}$, sorted into increasing $\mu_{\alpha'}$. We express

218: $\mathbf{t}=\sum_{\alpha' =1}^\ell \xi_{\alpha'} \Psi_{\alpha'}$. The coefficients

219: $\xi_{\alpha'}$ for the minimum are given by

220: $$\xi_{\alpha'} ={f_{\alpha'} \over \mu -\mu_{\alpha'}},$$

221: where $f_{\alpha'}= \Psi_{\alpha'}^\top\mathbf{H_1} \mathbf{u}$, and $\mu$ is a Lagrange

222: multiplier which must to be tuned to satisfy:

223: \begin{equation} \label{csi} g(\mu)= \sum_{\alpha' =1}^\ell

224: \left({f_{\alpha'} \over \mu -\mu_{\alpha'}}\right)^2=1-|| \mathbf{u} ||^2.

225: \end{equation}

226:

227: Equation (\ref{csi}) has always at least one solution with $\mu < \mu_1$, see figure 1,

228: and usually this is the one minimizing $\mathcal{E}$. However all the solutions of

229: (\ref{csi}) must be compared according to their {\it energies} $\mathcal{E}$; those

230: corresponding to the lowest $\mathcal{E}$, $\bf{y^\star}$, is then  selected. Clearly as

231: $m\to 0$ one recovers the first eigenvector of $\mathbf{H_0}$, i.e. the first kernel

232: principal component: $\bf{y^\star}$ thus constitutes a generalization of the latter to

233: the semi-supervised case. To construct the other generalized kernel principal components,

234: we make the following transformation on matrix $\mathbf{H}$:

235: $$\mathbf{\tilde{H}}=\mathbf{H}-\mathbf{P^\star}\mathbf{H}

236: -\mathbf{H}\mathbf{P^\star}+\mathbf{P^\star}\mathbf{H}\mathbf{P^\star},$$ where

237: $\mathbf{P^\star}=\bf{y^\star}\bf{y^\star}^\top$ is the projector on the linear subspace

238: spanned by $\bf{y^\star}$. The symmetric matrix $\mathbf{\tilde{H}}$ has   the lowest

239: eigenvalue equal to zero and corresponding to eigenvector $\bf{y^\star}$. The system of

240: eigenvectors of $\mathbf{\tilde{H}}$ constitutes a generalization of kernel principal

241: components to the semi-supervised case. \section{Experiments}

242: \subsection{Generalizing kernel principal components}

243: Now we present some simulations of the proposed method, focusing on the dimensionality

244: reduction issue and comparing  with  fully unsupervised kernel principal component

245: analysis. We consider three well known data sets: IRIS (100 points in a four-dimensional

246: space, second and third classes, versicolor and virginica); colon cancer data set of

247: \cite{alon}, consisting in 40 tumor and 22 normal colon tissues samples, each sample

248: being described by the $100$ most discriminant genes; the leukemia data set of

249: \cite{golub}, consisting of samples of tissues of bone marrow samples, $47$ affected by

250: acute myeloid leukemia (AML) and $25$ by acute lymphoblastic leukemia (ALL), each sample

251: being described by the $500$ most discriminant genes. The following question is

252: addressed: is $\bf{y^\star}$ more correlated to the true labels than the fully

253: unsupervised first kernel principal component? Here we restrict our analysis to the

254: linear kernel.

255:

256: We start with IRIS and proceed as follows. We randomly select $m=4$ points and, treating

257: them as labeled, we find the system of eigenvectors of $\mathbf{\tilde{H}}$. Then  we

258: evaluate the linear correlation $R$ between the eigenvectors and the true labels of the

259: whole data-set. The distributions of $R$ for the four eigenvectors are depicted in figure

260: 2. We observe that in most cases the vector $\bf{y^\star}$ is more correlated with the

261: true classes than the fully unsupervised principal component: the one-dimensional

262: projection of data onto $\bf{y^\star}$ is more informative than the first principal

263: component. However there are situations where use of labeled points leads to poor

264: results; a typical example is depicted in figure 3. In figure 4 a situation is depicted

265: where knowledge of labeled points leads to a relevant improvement.

266:

267: In general, we denote $f$ the fraction of instances such that $\bf{y^\star}$ is more

268: correlated to the true labels than the first principal component. In figure 5 we depict

269: $f$ as a function of $\bar{m}=m/N$ for the three data sets here considered. At

270: $\bar{m}=0.16$  $f$ is already nearly one. The semi-supervised method here proposed

271: outperforms principal components almost always for large $\bar{m}$.

272:

273:

274: \subsection{Transductive inference}

275: In this subsection we demonstrate the effectiveness of the proposed approach for

276: estimating the values of a function at a set of test points, given a set of input-output

277: data points, without estimating (as an intermediate step) the regression function.

278:

279: The boston data set is a well-known problem where one is required to estimate house

280: prices according to various statistics based on $13$ locational, economic and structural

281: features from data collected by U.S. Census Service in the Boston Massachusetts area. For

282: $\ell =5,10,15,20,25$, we partition the data-set of $N=506$ observations randomly 100

283: times into a training set of $N-\ell$ observations and a testing set of $\ell$

284: observations. We use a Gaussian kernel with $\sigma =1$ and set $\lambda =1$; results are

285: stable against variations of these parameters. In Table 1 we report the mean squared

286: error (MSE) on the test set averaged over the 100 runs, for each value of $\ell$, we

287: obtain using the optimal target vector $\bf{y^\star}$. In Table 1 we also report the MSE

288: obtained using the classical KRR in the two step procedure: (i) estimation of the

289: regression function using the training data-set (ii) calculation of the regression

290: function at points of interest (test data-set). The improvement achieved using the

291: optimal target approach, over classical KRR, is clear.

292: \begin{table}

293: \caption{\label{tab:table1}The mean square error on the Boston data set obtained using

294: the optimal target (OT) approach and the classical kernel ridge regression (KRR) method.

295: The size of the test set is $\ell$. }

296: \begin{ruledtabular}

297: \begin{tabular}{lcr}

298: $\ell$&OT&KRR\\

299: \hline

300:     5&   2.3790   & 3.6312\\

301:

302:    10&   2.7938  &  4.0111 \\

303:

304:    15& 2.9460 &   4.1057 \\

305:

306:    20&   3.1024 &   4.1802 \\

307:

308:    25&   3.1569 &   4.1653 \\

309: \end{tabular}

310: \end{ruledtabular}

311: \end{table}

312:

313: We also consider five well known data sets of pattern recognition from UCI database: we

314: evaluate the optimal target vector, points are then attributed to classes according to

315: the sign of $\bf{y^\star}$. We compare with the transductive linear discrimination (TLD)

316: approach developed in \cite{trans}; the performance of a classifier is measured by its

317: average error over 100 partitions of the data-sets into training and testing sets. We use

318: the linear kernel with $\lambda =1$, however the results are stable to variations of

319: $\lambda$. Obviously, our approach and TLD are applied to the same partitions of

320: data-sets, so that the comparison is meaningful. The results are shown in Table 2: our

321: approach outperforms TLD.

322: \begin{table}

323: \caption{\label{tab:table2}The percentage test error of transductive linear

324: discrimination and optimal target approach, on five datasets from UCI database.}

325: \begin{ruledtabular}

326: \begin{tabular}{lcr}

327: &TLD&OT\\

328: \hline

329:     Diabetes&   23.3   & 11.98\\

330:

331:    Titanic&   22.4  &  6.52 \\

332:

333:    Breast Cancer& 25.7 &   16.7 \\

334:

335:    Heart&   15.7 &   3.3 \\

336:

337:    Thyroid&   4.0 &   4.0 \\

338: \end{tabular}

339: \end{ruledtabular}

340: \end{table}

341:

342: It is worth stressing that our results are obtained without a fine-tuning of parameters.

343: In particular,note that our definition of optimal target vector fixes the relative

344: importance of the two terms in equation (\ref{eeee}).

345: \section{Conclusions}

346: \label{conc} We have presented a new  approach to semi-supervised learning based on the

347: notion of optimal target vector, the target vector such that KRR provides the minimum

348: training error over all the possible target vectors. The proposed algorithm is

349: characterized by the fact that the first kernel principal component is recovered as the

350: cardinality of labeled points vanishes; hence it may be seen as a semi-supervised

351: generalization of Kernel Principal Components Analysis. The effectiveness of the proposed

352: approach for transductive inference has also been demonstrated.

353:

354:

355:

356: \vskip 0.4 cm\par\noindent{\bf Acknoledgements.} The authors thank Olivier Chapelle for a

357: valuable correspondence on the subject of this paper. Discussions on semi-supervised

358: learning with Eytan Domany and Noam Shental are warmly acknowledged.

359:

360: \begin{thebibliography}{99}

361: \bibitem{zu} X. Zhu, Z. Ghahramani, J. Lafferty, Semi-supervised learning using Gaussian fields

362: and harmonic functions. Proc. 20-th Int. Conf. Machine Learning 2003.

363:

364: \bibitem{zhu} D. Zhu, O. Bousquet, T.N. Lal, J. Weston, and B. Scholkopf. Learning with local and

365: global consistency. {\it Advances in Neural Information processing Systems}, 16, S. Thrun

366: et al. (Eds.), MIT press, Cambridge, MA, 2004.

367:

368: \bibitem{ch} O. Chapelle, A. Zien, Semi-supervised classification by low density separation.

369: International Workshop on Artificial Intelligence and Statistics, AI STATS 2005,

370: Barbados.

371:

372: \bibitem{smo} A. Smola, R. Kondor, Kernels and regularizations on graphs, COLT/Kernel Workshop

373: 2003.

374:

375: \bibitem{arg} A. Argyriou, M. Herbster, M. Pontil, Combining Graph Laplacians for Semi-Supervised

376: Learning, {\it Advances in Neural Information processing Systems}, 18, Y. Weiss and B.

377: Sch\"{o}lkopf and J. Platt (Eds.), MIT press, Cambridge, MA, 2006.

378:

379: \bibitem{zz} Xiaojin Zhu, Semi-supervised Learning Literature Survey. Computer Sciences TR 1530,

380: University of Wisconsin - Madison.

381:

382: \bibitem{zhou} Z.H. Zhou, M. Li, Semi-supervised regression with co-training. Proceedings

383: International Joint Conference on Artificial Intelligence (IJCAI) 2005.

384:

385: \bibitem{getz} G. Getz, N. Shental, E. Domany, Semi-supervised learning - a statistical physics

386: approach. Proceedings of the 22nd ICML Workshop on Learning with Partially Classified

387: Training Data. Bonn, Germani 2005.

388:

389: \bibitem{zzz}X. Zhu, J. Lafferty, Z. Ghaharamani, Combining active learning and semi-supervised

390: learning using Gaussian fields and harmonic functions. ICML 2003 workshop on The

391: continuum from labeled to unlabeled data in Machine Learning and Data mining.

392:

393: \bibitem{kpca} B. Sch\"{o}lkopf, A. Smola, K.-R. Muller, Nonlinear Component Analysis as a Kernel Eigenvalue Problem,

394: {\it Neural Computation} {\bf 10} 1299

395: (1998).

396:

397: \bibitem{vapnik} V. Vapnik. Estimation of dependences based on empirical data.

398: Springer-Verlag, New York, 1982.

399:

400: \bibitem{st} J. Shawe-Taylor and N. Cristianini, {\it Kernel Methods for Pattern Analysis}

401: Cambridge University Press, 2004.

402:

403: \bibitem{as} N. Ancona and S. Stramaglia, An invariance property of predictors in kernel-induced

404: hypothesis spaces, Neural Comput. 18:749-759, 2006.

405:

406: \bibitem{prep} N. Ancona and S. Stramaglia, unpublished.

407:

408: \bibitem{ang}L. Angelini, D. Marinazzo, M. Pellicoro, S. Stramaglia, Kernel method for clustering

409: based on optimal target vector. Physics Letters A {\bf 357} 413 (2006).

410:

411: \bibitem{alon}U. Alon et al.,Broad Patterns of Gene Expression Revealed by Clustering Analysis of

412: Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays, {\it PNAS} {\bf 96} 6745

413: (1999).

414:

415: \bibitem{golub}T.R. Golub et al., Molecular Classification of Cancer: Class Discovery and Class

416: Prediction by Gene Expression Monitoring, {\it Science} {\bf 286} 531 (1999).

417:

418: \bibitem{trans} O. Chapelle and V. Vapnik and J. Weston, Transductive Inference for Estimating Values of Functions,

419: Advances in Neural Information Processing Systems, Vol. 12, 1999.

420: \end{thebibliography}

421:

422: \begin{figure}[ht!]

423: \begin{center}

424: \epsfig{file=fig1.eps,height=8.5cm}

425: \end{center}

426: \caption{{\small  The solutions of equation (\ref{csi}) are depicted, for a typical

427: instance  of four labeled points in the IRIS data set. The star corresponds to the

428: solution with $\mu < \mu_1$, which has the smallest energy $\mathcal{E}$.\label{fig1}}}

429: \end{figure}

430: \begin{figure}[ht!]

431: \begin{center}

432: \epsfig{file=fig2.eps,height=8.5cm}

433: \end{center}

434: \caption{{\small  Concerning IRIS data set and $m=4$, we depict the distribution (over

435: 10000 random selections of labeled points) of the linear correlation $R$  between

436: eigenvectors of $\mathbf{\tilde{H}}$ and the true labels. From the left to the right and

437: the top to the bottom, we refer to the first, the second, the third and the fourth

438: eigenvector. Grey (black) histogram bars denote values of $R$ lower (greater) than those

439: of the corresponding fully unsupervised principal component. \label{fig2}}}

440: \end{figure}

441:

442: \begin{figure}[ht!]

443: \begin{center}

444: \epsfig{file=fig3.eps,height=8.5cm}

445: \end{center}

446: \caption{{\small  (Top) The IRIS data set is depicted in the plane of the first two

447: principal components, $\star$ versicolor, $+$ virginica. The linear correlation of the

448: first principal component with the true labels is $R=0.732$. Four selected points are

449: surrounded by a circle. (Bottom) The data set is represented in the plane of the first

450: two eigenvectors of $\mathbf{\tilde{H}}$. The linear correlation between $\bf{y^\star}$

451: and  the true labels is $R=0.615$. (Note that two circles are almost overlapping and thus

452: difficult to distinguish). \label{fig3}}}

453: \end{figure}

454:

455: \begin{figure}[ht!]

456: \begin{center}

457: \epsfig{file=fig4.eps,height=8.5cm}

458: \end{center}

459: \caption{{\small  (Top) The IRIS data set is depicted in the plane of the first two

460: principal components, $\star$ versicolor, $+$ virginica. Four selected points are

461: surrounded by a circle. (Bottom) The data set is represented in the plane of the first

462: two eigenvectors of $\mathbf{\tilde{H}}$. The linear correlation between $\bf{y^\star}$

463: and  the true labels is, in this case, $R=0.846$.\label{fig4}}}

464: \end{figure}

465: \begin{figure}[ht!]

466: \begin{center}

467: \epsfig{file=fig5.eps,height=8.5cm}

468: \end{center}

469: \caption{{\small The fraction $f$ (see the text) is depicted as a function of $\bar{m}$

470: for three data sets here considered. 10000 random selections of the labeled points are

471: considered for each value of $m$ and for each data-set. \label{fig5}}}

472: \end{figure}

473: \end{document}

474: