0211:cs0211006/arxiv.tex

1: \documentclass[12pt]{article}

2: \usepackage{graphicx,latexsym}

3: \newcommand{\bi}[1]{\mbox{\boldmath $#1$}}

4:

5: \newtheorem{proposition}{Proposition}

6: \newtheorem{example}{Example}

7:

8: %\newcommand{\mycal}{\mathsf}

9: \newcommand{\mylag}{\alpha}

10: \newcommand{\mycal}{\mathcal}

11: \newcommand{\inner}[2]{#1\cdot#2}

12: \newcommand{\wt}{\omega}

13: \newcommand{\wtsvm}{\wt_{\mathrm{SVM}}}

14: \newcommand{\fsvm}{f_{\mathrm{SVM}}}

15: \newcommand{\myft}[1]{{#1}^*}

16: \newcommand{\myftx}{\myft{\bix}}

17: \newcommand{\svmft}[1]{{#1}^\dag}

18: \newcommand{\mycd}[1]{\hat{#1}}

19: \newcommand{\transpose}{^{\mathsf T}}

20: \newcommand{\zeroth}{^{(0)}}

21: \newcommand{\myraw}{^{\mathrm{raw}}}

22: \newcommand{\kth}{^{(k)}}

23: \newcommand{\kpth}{^{(k+1)}}

24: \newcommand{\knl}{\mathrm{k}}

25: \newcommand{\knlmat}{C}

26: \newcommand{\knlmatx}{D}

27: \newcommand{\knlvec}{\bi{c}}

28: \newcommand{\kx}{\mathbf{k}_x}

29: \newcommand{\kxy}{\mathrm{K}_{xy}}

30: \newcommand{\bia}{\bi{a}}

31: \newcommand{\bib}{\bi{b}}

32: \newcommand{\bid}{\bi{d}}

33: \newcommand{\bix}{\bi{x}}

34: \newcommand{\biy}{\bi{y}}

35: \newcommand{\bipsi}{\bi{\psi}}

36: \newcommand{\bieps}{\bi{\varepsilon}}

37: \newcommand{\mycdd}{\mycd{\bid}}

38: \newcommand{\mycdx}{\mycd{\bix}}

39: \newcommand{\mycdw}{\mycd{\wt}}

40: \newcommand{\mycda}{\mycd{a}}

41: \newcommand{\mycdb}{\mycd{\bib}}

42: \newcommand{\mycdg}{\mycd{g}}

43: \newcommand{\mycdeta}{\mycd{\eta}}

44: \newcommand{\mycdp}{\mycd{p}}

45: \newcommand{\mycdq}{\mycd{\bi{q}}}

46: \newcommand{\mycdr}{\mycd{r}}

47: \newcommand{\mycds}{\mycd{s}}

48: \newcommand{\mycdt}{\mycd{t}}

49: \newcommand{\mycdu}{\mycd{u}}

50: \newcommand{\mynew}{^{\mathrm{new}}}

51: \newcommand{\myold}{^{\mathrm{old}}}

52: \newcommand{\myprev}{^{[l]}}

53: \newcommand{\mynext}{^{[l+1]}}

54: \newcommand{\mycdf}{\mycd{f}}

55: \newcommand{\gnorm}{_{G_i}^2}

56: \newcommand{\sgnorm}{_{G_i}}

57: \newcommand{\inorm}{_{G_i^{-1}}^2}

58: \newcommand{\jnorm}{_{G_j^{-1}}^2}

59: \newcommand{\sinorm}{_{G_i^{-1}}}

60: \hyphenation{di-men-sion-al}

61: \title{Maximing the Margin in the Input Space}

62:

63: \author{

64: Shotaro Akaho \\

65: AIST Neuroscience Research Institute\\

66: 1--1 Central 2, Umezono, Tsukuba 3058568 Japan \\

67: {\texttt{s.akaho@aist.go.jp}}}

68:

69: \begin{document}

70:

71: \maketitle

72:

73: \begin{abstract}

74:  We propose a novel criterion for support vector machine learning:

75:  maximizing the margin in the input space, not in the feature (Hilbert) space.

76:  This criterion is a discriminative version of the principal curve

77:  proposed by Hastie et al.

78:  The criterion is appropriate in particular when the input space is

79:  already a well-designed feature space with rather small dimensionality.

80:  The definition of the margin is generalized

81:  in order to represent prior knowledge.

82:  The derived algorithm consists of two alternating steps to estimate the

83:  dual parameters.

84:  Firstly, the parameters are initialized by the original SVM.

85:  Then one set of parameters is updated by Newton-like procedure, and

86:  the other set is updated by solving a quadratic programming problem.

87:  The algorithm converges in a few steps to a local optimum under mild

88:  conditions and it preserves the sparsity of support vectors.

89:  Although the complexity to calculate temporal variables increases

90:  the complexity to solve the quadratic programming problem for each step

91:  does not change.

92:  It is also shown that the original SVM can be seen as a special case.

93:  We further derive a simplified algorithm which enables us to use

94:  the existing code for the original SVM.

95: \end{abstract}

96:

97: \section{Introduction}

98: The support vector machine (SVM) is known as one of state-of-the-art

99: methods especially for pattern recognition

100: \cite{cortes,mueller,vapnik}.

101: The original SVM maximizes the margin which is

102: defined by the minimum distance between samples

103: and a separating hyperplane in a Hilbert space $\mycal H$.

104: Even when the dimensionality of $\mycal H$ is very large,

105: it has been proved that the original SVM has

106: a bound for a generalization error

107: which is independent of the dimensionality.

108: In practice, however,

109: the original SVM sometimes gives a very small margin in the input

110: space, because the metric of the feature space is usually quite different from

111: that of the input space.

112: Such a situation is undesirable in particular when the input space

113: is already a well-designed feature space by using some prior

114: knowledge\cite{amari,decoste,jaakkola,simard,tsuda}.

115:

116: This paper gives a learning algorithm to maximize the

117: margin in the input space.

118: One difficulty is getting an explicit form of the

119: margin in the input space, because the classification boundary is curved and

120: the vertical projection from a sample point to the boundary is not

121: always unique. We solve this problem by linear approximation

122: techniques.  The derived algorithm basically consists of iterations

123: of the alternating two stages as follows:

124: one is to estimate the projection point and the other is

125: to solve a quadratic programming to find optimal parameter values.

126:

127: Such a dual structure appears in other frameworks, such as

128: EM algorithm and variational Bayes.

129: Much more related work is the principal curve proposed by

130: Hastie et al\cite{hastie}. The principal curve finds a curve in a `center'

131: of the points in the input space.

132:

133: The derived algorithm is not a gradient-descent type but Newton-like;

134: hence we have to investigate its convergence property.

135: It is shown that the derived

136: algorithm does not always converges to the global optimum, but

137: it converges to a local optimum under mild conditions.

138: Some interesting relations to the original SVM are also shown:

139: the original SVM can be seen as a special case of the algorithm;

140: and the number of support vectors does not increase so much from the

141: original SVM.

142: The algorithm is verified through simple simulations.

143:

144: \section{Generalized margin in the input space}

145:

146: We consider a binary classification problem.

147: The purpose of learning is to construct a map from an $m$-dimensional input

148: $\bix\in{\Re}^m$ to a corresponding output $y\in\{\pm1\}$ by using

149: a finite number of samples $(\bix_1,y_1),\ldots,(\bix_n,y_n)$.

150:

151: Let us consider a linear classifier,

152: $y=\mbox{sgn}[f(\bix)]$, where

153: $f(\bix) \equiv \inner{\wt}{\phi(\bix)} + f_0$;

154: $\phi(\bix)$ is a feature of an input $\bix$ in

155: a Hilbert space $\mycal H$,

156: $\wt\in \mycal H$ is a weight parameter

157: and $f_0\in \Re$ is a bias parameter.

158: Those parameters $\wt$ and $f_0$ define a separating hyperplane in the

159: feature space.

160: As a feature function $\phi(\bix)$, we only consider a differentiable

161: nonlinear map.

162:

163: A margin in the input space is defined by the minimum distance from sample

164: points to the classification boundary in the input space.

165: Since the classification boundary forms a complex curved surface,

166: the distance cannot be obtained in an explicit form, and more

167: significantly, a projection from a point to the boundary is not unique.

168:

169: Here, the metric in the input space is not necessary to be Euclidean.

170: Some Riemannian metric $G(\bix)$ may be defined, which

171: enables us to represent many kinds of prior knowledge.

172: For example, the invariance of patterns\cite{mueller,simard} can be implemented

173: in this form.

174: Another example is that

175: Fisher information matrix is a natural metric,

176: when the input space is a parameter space

177: of some probability distribution\cite{amari,jaakkola}.

178: Although the distance is theoretically preferable to be measured by

179: the length of a geodesic in the Riemannian space,

180: it causes computational difficulty.

181: In our formulation, since we only need a distance from a sample point to

182: another point, we use a computationally feasible (nonsymmetric) distance

183: from a sample point $\bix_i$ to another point $\bix$ in the quadratic norm,

184: \[

185: \|\bix-\bix_i\|\gnorm =

186:   (\bix-\bix_i)\transpose G_i(\bix-\bix_i),

187: \]

188: where $G_i\equiv G(\bix_i)$.

189:

190: For simplicity, we mainly consider the hard margin case in which

191: sample points are separable by a hyperplane in the Hilbert space.

192: The soft margin case is discussed in the section \ref{sec:soft}.

193:

194: Let $\myftx_i$ be the closest point on the boundary

195: surface from a sample point $\bix_i$, and

196: $\bid_i \equiv \myftx_i - \bix_i$.

197: Since $\bid_i$ is invariant under a scalar transformation of $(\wt,f_0)$,

198: we can assume all points are separated with satisfying

199: \begin{equation}

200:   \label{eq:constraint}

201:   \|\bid_i\|\gnorm \ge {1/\inner{\wt}{\wt}},\quad i=1,\cdots,n,

202: \end{equation}

203: If we assume at least one of them is an equality,

204: the margin is given by $1/\sqrt{\inner{\wt}{\wt}}$.

205: Then we can find the optimal parameter by minimizing

206: a quadratic objective function $\inner{\wt}{\wt}$

207: with the constraints (\ref{eq:constraint}) and $y_i f(\bix_i) > 0$.

208:

209: In order to solve the optimization problem, we start from a solution

210: of the original SVM and update the solution iteratively.

211: By two kinds of linearization technique and a kernel trick

212: which are described in the next section, we obtain

213: a discriminant function at the $k$-th iteration step in the form of

214: \begin{equation}

215: \label{eq:f}

216:  f(\bi{x})=\sum_{i\in \mathrm{S.V.}} \{a_i\kth \knl(\mycdx_i\kth,\bix) +

217:   \bib_i\kth{}\transpose \kx(\mycdx_i\kth, \bix)\} + f_0\kth,

218: \end{equation}

219: where S.V. is a set of indices of support vectors,

220: $\knl(\bix,\biy)$ is a kernel function and $\kx(\bix,\biy)$ is its

221: derivative defined by $\kx(\bix,\biy)\equiv {\partial

222: \knl(\bix,\biy)/\partial\bix}$.

223: We have two groups of parameters here: One is of $a_i$, $\bib_i$ and $f_0$

224: which are parameters of linear coefficients, and the other is

225: of $\mycdx_i$ which is an estimate of

226: the projection point $\myftx_i$ and forms base functions.

227: $a_i$ and $f_0$ are initialized by the corresponding parameters in the

228: original SVM and the other parameters are initialized by

229: $\bib_i=\mathbf0$, $\mycdx_i=\bix_i$.

230:

231: \section{Iterative QP by linear approximations}

232: In this section, we overview the derivation of update rules of

233: those parameters. The resultant algorithm is summarized in sec.\ref{sec:overall}.

234:

235: \subsection{Linear approximation of the distance to the boundary}

236: \label{sec:d}

237: Suppose an estimated projection point $\mycdx_i$ is given,

238: we can get an approximate distance $\|\bid_i\|\sgnorm$

239: by a linear approximation\cite{akaho}.

240: \hfill Taking the Taylor expansion of \\

241:  $f(\myftx_i)=0$ around $\mycdx_i$

242: up to the first order,

243: we obtain a constraint on $\bid_i$,

244: \[

245:  f(\mycdx_i) +

246:  \nabla f(\mycdx_i)\transpose (\bid_i - \mycdd_i) = 0,

247: \]

248: where $\mycdd_i = \mycdx_i-\bix_i$.

249: Minimizing $\|\bid_i\|\gnorm$ under this constraint,

250: we have

251: \begin{equation}

252: \label{eq:d}

253: \|\bid_i\|\gnorm = {(\inner{\wt}{\{\phi(\mycdx_i) -

254:  \bipsi(\mycdx_i)\transpose\mycdd_i \}}+f_0)^2\over

255: \|\inner{\wt}{\bipsi(\mycdx_i)}\|\inorm},

256: \end{equation}

257: where $\bipsi(\mycdx_i)\equiv

258: \nabla \phi(\mycdx_i)\in {\mycal H}^m$.

259: Note that this approximate value is unique, and it is invariant under a

260: scalar transformation of

261: $(\wt,f_0)$.

262: Moreover, the approximation is strictly correct when $\mycdx_i=\myftx_i$

263: and $\nabla f(\myftx_i)\ne 0$.

264:

265: \subsection{Linearization of the constraint}

266: \label{sec:qp}

267: Using the approximate value of the distance, we have a nonlinear

268: constraint,

269: \begin{equation}

270: \label{eq:NLconst}

271:  y_i\left[\inner{\wt}\{\phi(\mycdx_i) -

272:  \bipsi(\mycdx_i)\transpose\mycdd_i \}+f_0\right]

273:   \ge {\|\inner{\wt}{\bipsi(\mycdx_i)}\|\sinorm\over\sqrt{\inner{\wt}{\wt}}}.

274: \end{equation}

275: Since the constraint is nonlinear for $\wt$, we linearize it around

276: an approximate solution $\wt=\mycdw$ which is the solution at

277: a current step.

278: This linearization not only simplifies the problem, but

279: also enables us to derive a dual problem.

280:

281: Let $g_i(\wt)$ be the right hand side of (\ref{eq:NLconst}),

282: the first order expansion is

283: \[

284:   g_i(\wt) = g_i(\mycdw) +

285:    \inner{\left({\partial g_i(\mycdw)/\partial\wt}\right)}{(\wt-\mycdw)}.

286: \]

287: Now let $\mycdg_i \equiv g_i(\mycdw),

288:  \mycdeta_i \equiv {\partial g_i(\mycdw)/\partial\wt}$,

289: then we have a linear constraint for $\wt$,

290: \begin{equation}

291: \label{eq:constraint3}

292:  \inner{\wt}{[y_i\ \{\phi(\mycdx_i) -

293: \bipsi(\mycdx_i)\transpose\mycdd_i

294:  \}-\mycdeta_i]}\ge \mycdg_i- f_0 y_i,

295: \end{equation}

296: where we used the fact $\inner{\mycdw}{\mycdeta_i}=0$.

297: Suppose $\mycdq_i \equiv \inner{\mycdw}{\bipsi(\mycdx_i)}$ and

298: $\mycdr \equiv \inner{\mycdw}{\mycdw}$,

299: then $\mycdg_i$ and $\mycdeta_i$ are given by

300: \begin{eqnarray}

301: \label{eq:h}

302:  \mycdg_i &=& {1\over \sqrt{\mycdr}}\|\mycdq_i\|\sinorm,\nonumber\\

303:  \mycdeta_i

304:    &=& {1\over \mycdg_i \mycdr} \left\{\mycdq_i\transpose G_i^{-1}

305:     \bipsi(\mycdx_i) -{1\over\mycdr}\|\mycdq_i\|\inorm\mycdw\right\}.

306: \end{eqnarray}

307: By the above linearization, we can derive the dual problem

308: in a similar way to the original SVM,

309: \begin{eqnarray}

310: \lefteqn{W(\bi{\mylag}) = \sum_i \mycdg_i\mylag_i} \nonumber\\

311: && -{1\over2}

312:   \sum_{i,j}\mylag_i\mylag_j [y_i \{\phi(\mycdx_i) -

313: \bipsi(\mycdx_i)\transpose\mycdd_i

314:  \}-\mycdeta_i]\cdot[y_j \{\phi(\mycdx_j) -

315:  \bipsi(\mycdx_j)\transpose \mycdd_j

316:  \}-\mycdeta_j], \nonumber

317: \end{eqnarray}

318: which is maximized under constraints $\mylag_i\ge0$ \\

319: and $\sum_i\mylag_i y_i = 0$.

320: The solution $\wt$ is given by

321: \begin{equation}

322: \label{eq:wt}

323: \wt = \sum_i \mylag_i [y_i \{\phi(\mycdx_i) -

324: \bipsi(\mycdx_i)\transpose\mycdd_i

325:  \}-\mycdeta_i].

326: \end{equation}

327: Here we can see an apparent relation to the original SVM, i.e.,

328: by letting $\mycdx_i=\bix_i$, $\mycdeta_i=0$, and $\mycdg_i=1$,

329: we have the exactly the same optimization problem as the original SVM.

330:

331: \subsection{Kernel trick}

332:

333: In order to avoid the calculation of mapping into high dimensional

334: Hilbert space, SVM applies a kernel trick, by which

335: an inner product is replaced by a symmetric positive definite

336: kernel function (Mercer kernel) that is easy to

337: calculate\cite{ramsey,cortes,mueller,vapnik}.

338: In our formulation,

339: $\inner{\phi(\bix)}{\phi(\biy)}$ is replaced by a Mercer kernel

340: $\knl(\bix,\biy)$.

341: We also have to calculate the inner product

342: related to $\bipsi$ (the derivative of $\phi$).

343: Let us assume that the kernel function $\knl$ is differentiable.

344: Then, $\inner{\bipsi(\bix)}{\phi(\biy)}$

345: is replaced by a vector

346: $\kx(\bix,\biy)\equiv {\partial \knl(\bix,\biy)/\partial\bix}$,

347: and $\inner{\bipsi(\bix)}{\bipsi(\biy)\transpose}$

348: is replaced by a matrix

349: $\kxy(\bix,\biy)

350: \equiv {\partial^2 \knl(\bix,\biy)/\partial\bix\partial\biy\transpose}$.

351:

352: Now we can derive the kernel version of the optimization problem.

353: In (\ref{eq:wt}), $\mycdeta_i\in \mycal H$ has bases related to

354: $\bipsi(\mycdx_i)$ and $\mycdw$,

355: and the solution $\wt$ has bases $\phi(\mycdx_i)$ additionally.

356: Although $\mycdw$ can have any kinds of bases, we restrict it

357: in the following form to avoid increasing number of bases.

358: \[

359:  \mycdw=\sum_i \{\mycda_i \phi(\mycdx_i) +

360:   \mycdb_i\transpose \bipsi(\mycdx_i)\}.

361: \]

362: Then we have

363: $\mycdq_i = \sum_j \{ \mycda_j

364:   \kx(\mycdx_i, \mycdx_j) +

365:   \kxy(\mycdx_i,\mycdx_j)\mycdb_j

366:   \}$.

367: Now let

368: \[

369:  \mycdp_i \equiv \inner{\mycdw}{\phi(\mycdx_i)} =

370:   \sum_j \{\mycda_j\knl(\mycdx_j,\mycdx_i) + \mycdb_j\transpose

371:   \kx(\mycdx_j,\mycdx_i)\},

372: \]

373: then $\mycdr$ is given by

374: $\mycdr = \sum_i (\mycda_i \mycdp_i + \mycdb_i\transpose\mycdq_i)$,

375: and $\mycdg_i$ by (\ref{eq:h}).

376: Further, let us define additional temporal variables

377: that represent several terms in the objective function,

378: \begin{eqnarray*}

379:  \mycds_{ij} &\equiv& \inner{\{\phi(\mycdx_i) -

380: \bipsi(\mycdx_i)\transpose\mycdd_i

381:  \}}{\{\phi(\mycdx_j) -

382:  \bipsi(\mycdx_j)\transpose \mycdd_j

383:  \}} \\

384:  &=& \knl(\mycdx_i,\mycdx_j)+\mycdd_i\transpose

385:   \kxy(\mycdx_i,\mycdx_j)\mycdd_j

386:   -\mycdd_i\transpose\kx(\mycdx_i,\mycdx_j)

387:   -\mycdd_j\transpose\kx(\mycdx_j,\mycdx_i), \\

388: \mycdt_{ij} &\equiv& \inner{\mycdeta_i}

389: {\{\phi(\mycdx_j) - \bipsi(\mycdx_j)\transpose\mycdd_j\}}

390: \\

391: &=&

392: {1\over \mycdg_i \mycdr}\bigg\{\mycdq_i\transpose G_i^{-1}

393:  \left(\kx(\mycdx_i,\mycdx_j) - \kxy(\mycdx_i,\mycdx_j)\mycdd_j

394:   \right)

395:   - {\|\mycdq_i\|\inorm\over \mycdr}(

396:   \mycdp_j - \mycdd_j\transpose\mycdq_j)

397:  \bigg\}, \\

398:  \mycdu_{ij} &=& \inner{\mycdeta_i}{\mycdeta_j}

399:  =

400:  {1\over \mycdg_i \mycdg_j \mycdr^2}(\mycdq_i\transpose G_i^{-1}\kxy(\mycdx_i,\mycdx_j) G_j^{-1}\mycdq_j

401:   -{\|\mycdq_i\|\inorm\|\mycdq_j\|\jnorm\over\mycdr}),

402: \end{eqnarray*}

403: then we have the objective function in a kernel form,

404: \begin{equation}

405: W(\bi{\mylag}) = \sum_i \mycdg_i\mylag_i

406:  -{1\over2}\sum_{i,j}\mylag_i\mylag_j (y_i y_j \mycds_{ij} - y_j \mycdt_{ij}-

407:  y_i \mycdt_{ji}+\mycdu_{ij}),

408: \label{eq:qp}

409: \end{equation}

410: which is maximized under constraints

411: \begin{equation}

412: \label{eq:constrainta}

413:  \mylag_i\ge0, \qquad \sum_i y_i\mylag_i = 0.

414: \end{equation}

415:

416: The new parameters can be determined from (\ref{eq:wt}) by

417: \begin{eqnarray}

418: \label{eq:newab}

419:  a_i\kpth &=& \mylag_i y_i + \beta \mycda_i,\nonumber\\

420:  \bib_i\kpth &=& -\mylag_i\left(y_i\mycdd_i+ {G_i^{-1}\mycdq_i\over

421: 			 \mycdg_i \mycdr}\right) +\beta

422:  \mycdb_i,

423: \end{eqnarray}

424: where

425: $ \beta = \sum_j{\mylag_j\|\mycdq_j\|\inorm/\mycdg_j\mycdr^2}$.

426:

427: As for the bias term $f_0$, since the constraint

428: (\ref{eq:constraint3}) should be satisfied in equality

429: for $J=\{i\mid\mylag_i\ne0\}$ from

430: the Kuhn-Tucker condition, we have for any $i\in J$,

431: \begin{equation}

432: \label{eq:newf}

433:  f_0\kpth = y_i \mycdg_i -\sum_j \mylag_j

434:   (y_j \mycds_{ji} - \mycdt_{ji} - y_i y_j \mycdt_{ij} + y_i \mycdu_{ij})

435: \end{equation}

436:

437: From ($\ref{eq:newab}$), we can estimate the number of support vectors.

438: Let $J_k$ be the indices of nonzero $\mylag_i$'s at the $k$-th step, then

439: the number of support vectors is bounded from upper by

440: $|J_0\cup J_1 \cup \cdots \cup J_k|$. Since $J_k$ does not

441: change much as long as the structure of classification boundary

442: is similar,

443: the number of support vectors is expected to be not so larger than

444: the original SVM.

445:

446: \subsection{Update of the approximate projection of the points}

447: To complete the algorithm, we have to consider the update of the approximate value

448: of the projection point $\mycdx_i$ which is initialized by $\bix_i$, otherwise the convergent solution is not precise

449: what we want.

450: If good approximates $\mycdw$ and $\mycdf_0$ of

451: the solution are given, we can refine $\mycdx_i$

452: iteratively in the same way as in sec. \ref{sec:d}:

453: Suppose $\mycdw=\sum_j \{\mycda_j \phi(\mycdx_j\myold) +

454: \mycdb_j\transpose \bipsi(\mycdx_j\myold)\}$,

455: the projection point $\mycdx_i$ can be estimated by iterating

456: the following steps for $l=0,1,2,3,\cdots$,

457: \begin{equation}

458: \label{eq:upmycdx}

459:  \mycdx_i\mynext

460:    = \bix_i -

461:    {\mycdq_i\myprev\over\|\mycdq_i\myprev\|\inorm}

462:    \left[\mycdp_i\myprev

463:     - (\mycdx_i\myprev{}-\bix_i)\transpose

464:     \mycdq_i\myprev  + \mycdf_0\right]

465: \end{equation}

466: where $\mycdx_i^{[0]}$ is initialized by $\mycdx_i\myold$;

467: $\mycdp_i\myprev$ and $\mycdq_i\myprev$ are defined in a similar way as

468: $\mycdp_i$ and $\mycdq_i$,

469: \begin{eqnarray}

470:  \mycdp_i\myprev &\equiv& \inner{\mycdw}{\phi(\mycdx_i\myprev)} \nonumber\\

471:  &=&

472:   \sum_j \{\mycda_j\knl(\mycdx_j\myold,\mycdx_i\myprev) + \mycdb_j\transpose

473:   \kx(\mycdx_j\myold,\mycdx_i\myprev)\}, \nonumber \\

474:  \mycdq_i\myprev &\equiv&

475:   \inner{\mycdw}{\bipsi(\mycdx_i\myprev)}\nonumber\\

476:  &=&\sum_j \{ \mycda_j

477:   \kx(\mycdx_i\myprev, \mycdx_j\myold) +

478:   \kxy(\mycdx_i\myprev,\mycdx_j\myold)\mycdb_j

479:   \}.\nonumber

480: \end{eqnarray}

481:

482: Note that locally maximum points and saddle

483: points of the distance are also equilibrium states

484: of (\ref{eq:upmycdx}). The following proposition guarantees

485: such a point is not stable.

486: \begin{proposition}

487: A point $\mycdx_i\in {\Re}^m$ is an equilibrium state of the

488:  iteration step (\ref{eq:upmycdx}), when and only when the point

489:  is a critical point of the distance from $\bix_i$ to the

490:  separating boundary, i.e.,

491:  a local minimum, a local maximum or a saddle point.

492:  The equilibrium state is not stable when the point is a

493:  local maximum or a saddle point.

494: \end{proposition}

495: \textit{Proof:}

496: It is straightforward to show that a point is

497: an equillibrium state of the iteration step (\ref{eq:upmycdx}),

498: only when the point is a critical point of the projection point

499: $\|\bid_i\|\gnorm$. Without loss of generality,

500: we can assume the uniform metric case $G_i=I$, because

501: update rule (\ref{eq:upmycdx}) is invariant of a metric transformation.

502: We consider the behavior around a critical point $\myftx_i$.

503: Let $\mycdx_i\myprev=\myftx_i+\bieps$,

504: for a sufficiently small vector $\bieps$.

505: One can show that $\mycdx_i\myprev$ is mapped into the separating

506: hypersurface $f(\bix)=\inner{\mycdw}{\phi(\bix)}+\mycdf_0=0$

507: for a small $\bieps$ after one step iteration.

508: Therefore, we only consider the

509: case $\mycdx_i\myprev$ is on the hypersurface.

510:

511: Since $\myftx_i$ is a critical point

512: of the distance, the tangent vector $\nabla f(\myftx_i)$ is

513: collinear to the distant vector $\bid_i=\myftx_i-\bix_i$, i.e.,

514: for some constant $\lambda$, it holds

515: \begin{equation}

516:  \nabla f(\myftx_i) = \lambda \bid_i.

517: \end{equation}

518: Furthermore, if $\mycdx_i\myprev$ is in a point of $f(\bix)=0$,

519: $\nabla f(\myftx_i)$ is nearly orthogonal to $\bieps$,

520: i.e.,

521: \begin{equation}

522:  \nabla f(\myftx_i)\transpose \bieps \simeq 0.

523: \end{equation}

524: By expanding (\ref{eq:upmycdx}) around $\myftx_i$, we have

525: a new estimation $\mycdx_i\mynext$ by

526: \begin{equation}

527: \label{eq:mycdx}

528:  \mycdx_i\mynext \simeq \myftx_i

529:  + {1\over\lambda}\nabla^2 f(\myftx_i)\bieps

530:   - {\bid_i\transpose\nabla^2 f(\myftx_i)\bieps\over\lambda\|\bid_i\|}\bid_i,

531: \end{equation}

532: where $\nabla^2 f$ is a hessian matrix of $f(\bix)$.

533: Without loss of generality, we can take the coordinate of $\bix$ as

534: follows: the first coordinate is the direction of $\bid_i$, and

535: the second to the $m$-th coordinates are taken orthogonally such that

536: an $(m-1)\times(m-1)$ submatrix of $\nabla^2 f(\myftx_i)$

537: for those coordinates is diagonalized, i.e., $\nabla^2 f(\myftx_i)$

538: is in the form,

539: \begin{equation}

540:  \nabla^2 f(\myftx_i) = \left(

541: \begin{array}{cccc}

542: c_1 & & \bi{b}\transpose & \\

543:  & c_2 & & 0 \\

544: \bi{b} & & \ddots & \\

545:  & 0 & & c_m \\

546: \end{array} \right).

547: \end{equation}

548: Under this coordinate system,

549: since $\varepsilon_1$ is of small order value,

550: the first element calculated from the second and third term in (\ref{eq:mycdx})

551: vanishes and we have

552: \begin{equation}

553: \mycdx_i\mynext - \myftx_i \simeq {1\over\lambda}

554:  (0, c_2 \varepsilon_2,\ldots,c_m\varepsilon_m)\transpose.

555: \end{equation}

556: The iteration step is stable at $\myftx_i$ only when

557: $\|\mycdx_i\mynext-\myftx_i\|\le\|\forall\bieps\|$, i.e.,

558: t$|c_j|< |\lambda|$ for all $j=2,\ldots,m$. \hfill $\Box$

559:

560: The condition for 1-$j$ plane is shown in figure \ref{fig:stability}.

561:

562: \begin{figure}[tbhp]

563:   \begin{center}

564:    \includegraphics[width=.8\textwidth]{stab.eps}

565:     \caption{Stability of projection point update}

566:     \label{fig:stability}

567:   \end{center}

568: \end{figure}

569:

570: When the point is a local maximum or saddle, the hypersurface is in the unstable

571: region. However, even in the case of local minimum, there exist an

572: unstable region, when the hypersurface is stronglly curved.

573: We can avoid the undesired behavior by slowing down.

574: For example, first $c_2,\ldots,c_m$ and $\lambda$ are estimated from

575: $\nabla f$ and $\nabla^2 f$ values at the current estimate,

576: and then if $c_j < |\lambda|$

577: for all $j=2,\ldots,m$, the point is to be local minima, then

578: the movement $\mycdx_i\mynext-\mycdx_i\myprev$

579: to the axes in which $c_j<-|\lambda|$ should be

580: shrinked by multiplying some factor $0 < e_j < |\lambda|/|c_j|$.

581:

582: This computationally intensive

583: treatment would be usually necessary only

584: after the several steps, because it is considered

585: that the unstablity for local minima occurs a small region

586: relatively to the size of $\bid_i$.

587:

588: \subsection{Projection of the hyperplane}

589: \label{sec:proj}

590: The update of $\mycdx_i$ causes another problem:

591: We assumed in section \ref{sec:qp}

592: that $\wt$ and $\mycdw$ have the same bases.

593: However, $\mycdw$ has bases based on the old $\mycdx_i$, while

594: we need the new $\wt$ based on the new $\mycdx_i$.

595: To solve that problem, $\mycdw$ is projected into new bases, i.e.,

596: from the old one

597: $\mycdw\myold=\sum_{i\in \mathrm{S.V.}}\{\mycda\myold_i

598: \phi(\mycdx_i\myold) + \mycdb\myold_i{}\transpose\bipsi(\mycdx_i\myold)

599: \}$

600: to a new one,

601: $\mycdw\mynew=\sum_{i\in \mathrm{S.V.}}\{\mycda\mynew_i

602: \phi(\mycdx_i\mynew) + \mycdb\mynew_i{}\transpose\bipsi(\mycdx_i\mynew)\}$.

603: Although $\mycdw\mynew$ can have more bases other than S.V.,

604: we restrict the bases to support vectors to

605: preserve the sparsity of bases.

606:

607: There are several possibilities of the projection.

608: In this paper, we use the one which minimizes the cost function

609: \begin{equation}

610: \label{eq:E}

611:  {1\over2}\sum_{\bix\in T} \{\inner{\mycdw\mynew}{\phi(\bix)} + \mycdf_0\mynew -

612:   (\inner{\mycdw\myold}{\phi(\bix)} + \mycdf_0\myold)\}^2,

613: \end{equation}

614: where $T$ is a certain set of $\bix$, and we use $T=$ $\{\bix_i$,

615: $\mycdx_i\myold$, $\mycdx_i\mynew$; $i=1,\cdots,n\}$.

616:

617: Minimizing (\ref{eq:E}) leads to a simple least square problem, which can

618: be solved by linear equations.

619: Another possibility of the cost function is

620: $\|\mycdw\mynew-\mycdw\myold\|^2$, which leads to another set of

621: linear equations.

622:

623: \subsection{Overall algorithm and the convergence property}

624: \label{sec:overall}

625:

626: Now let us summarize the algorithm below.

627: \par

628: \bigskip

629: \par

630: \noindent{\textbf{\strut Algorithm 1: Algorithm to maximize the margin

631: in the input space}}

632: \hrule

633: \strut Initialization step:

634:        Let the solution of the original SVM be

635:        $a_i\zeroth$ and $f_0\zeroth$;

636:        let $\bib_i\zeroth=\mathbf0$ and $\mycdx_i\zeroth=\bix_i$.

637: \par\noindent

638: For $k=0,1,2,\ldots$, repeat the following steps until convergence:

639: \begin{enumerate}

640:  \item Update of $\mycdx_i$:

641:        Calculate $\mycdx_i\kpth$ by

642:        applying (\ref{eq:upmycdx}) iteratively to $\mycdx_i\kth$.

643:  \item Projection of hyperplane:

644:        Calculate $\mycda_i$, $\mycdb_i$ and $\mycdf_0$ based on

645:        $\mycdx_i\kpth$ by

646:        a certain projection method from $a_i\kth$, $\bib_i\kth$ and $f_0\kth$

647:        based on $\mycdx_i\kth$ (sec.\ref{sec:proj}).

648:  \item QP step: Solve the QP problem (\ref{eq:qp})

649:        with respect to $\mylag_i$.

650:  \item Parameter update:

651:        Calculate $a_i\kpth$, $\bi{b}_i\kpth$ and $f_0\kpth$ by

652:        (\ref{eq:newab}) and (\ref{eq:newf}).

653: \end{enumerate}

654: The discriminant function at the $k$-th step is given by (\ref{eq:f}).

655: \par\smallskip

656: \hrule

657: \bigskip

658:

659: Although Algorithm 1 does not always converge to the global minimum,

660: we can prove the following proposition concerning about the convergence

661: of the algorithm.

662: \begin{proposition}

663: Equilibrium points of Algorithm 1 are critical points of the margin in

664:  the input space.

665: The algorithm is stable, when the update rule of $\mycdx_i$ (\ref{eq:upmycdx})

666:  is stable for all $i$ (see also Proposition 1).

667: \end{proposition}

668: This proposition can be proved basically by proposition 1 and the fact that

669: the linearization of QP is almost exact by a small

670: perturbation of $\wt$.

671: As in the case of (\ref{eq:upmycdx}), we can modify the algorithm by

672: slowing down in (\ref{eq:d}) and (\ref{eq:upmycdx}) so that

673: the equilibrium state is stable when and only when the margin

674: is locally optimal.

675: However, we don't use it in the simulation because the case

676: that the local minimum is unstable is expected to be rare.

677:

678: Another problem of Algorithm 1 is that each iteration step does not

679: always increase the margin monotonically.

680: Although it is usually faster than gradient type algorithms,

681: the algorithm sometimes does not improve the solution of the original

682: SVM at all.

683: Because the original SVM can be seen as a special case of the algorithm,

684: we can use some annealing technique, for example, updating temporal

685: variables and parameters more gradually from their initial values.

686: However, for simplicity, we use a crude method in the simulation

687: as follows: Repeat several

688: steps of the algorithm (5 steps in the simulation) and then choose

689: the best solution which gives the largest estimated value of the margin.

690:

691: As for the complexity of the algorithm, we need $O(m^2 n^2)$ space

692: and $O(m^3 n^2)$ time complexity to calculate temporal variables

693: if the computation of a kernel function is $O(m)$,

694: while the original SVM requires $O(n^2)$ space and $O(m n^2)$ time.

695: Those calculation can be pararellized easily.

696: This complexity is not so different when $m$ is comparatively small.

697: Once the variables are calculated, the complexity for QP is just the same.

698: Therefore, as far as the calculation for temporal variables

699: is comparative to the QP time,

700: the proposed algorithm is comparative to the original SVM.

701: If the Algorithm 1 is heavy because of the large $m$, we can use

702: a simplified algorithm as shown in the section \ref{sec:simple}.

703:

704: As for the iteration of QP which is carried out usually for a few steps,

705: since a current solution is an estimate of the solution,

706: it may be able to reduce the complexity

707: of the QP at the next iteration step.

708:

709: \section{Simulation results}

710: \label{sec:simulation}

711:

712: In this section, we give a simulation result for

713: artificial data sets in order to verify the proposed algorithm

714:  and to examine the basic performance.

715: 20 training samples and 1000 test samples are randomly drawn from

716: positive and negative distribution, each of which is a

717: Gaussian mixture of 3 components with

718: uniformly distributed centers $[0,1)^2$ and

719: fixed spherical variance $\sigma^2=0.2^2$.

720: The kernel function used here is a spherical Gaussian kernel with

721: $\sigma^2=1^2$.

722: The metric is taken to be Euclidean (i.e., $G_i$ is the unit matrix).

723: Figure \ref{fig:svm} and \ref{fig:alg1}

724: show an example of results by the original SVM

725: (initial condition) and the proposed algorithm (after 5 steps).

726: In this case, the margin value increases from 0.040 to 0.096.

727: Such a simulation is repeated for 100 sets of samples with different random

728: numbers.

729:

730: The estimated margins

731: in the input space for the original and proposed

732: algorithm is shown in figure \ref{fig:margin} (log-log scale).

733: By the crude algorithm described in the

734: previous section, there are 4 cases among 100 runs that cannot improve the

735: margin of the original SVM. The ratios of the margin are distributed

736: from 1.00 (no improvement) to 27.9.

737:

738: The misclassification errors

739: for test samples is shown in figure \ref{fig:error}.

740: The ratios of error distributed between [0.40(best),1.37(worst)].

741:

742: This results indicates that the margin in the input space

743: is efficient to improve the generalization performance in average, but

744: there are cases that cannot reduce the generalization error

745: even when the margin in the input space increases.

746:

747: \begin{figure}[tbhp]

748:  \includegraphics[width=.8\textwidth]{origsvm-r.eps}

749: \caption{Result of the original SVM (margin .040).

750: Circles ($\circ$) and crosses ($\times$) are positive and negative

751: samples. Squares ($\Box$) represent estimates of the projection

752: of the points by applying (\ref{eq:upmycdx}) for 10 steps.}

753: \label{fig:svm}

754: %

755: \end{figure}

756:

757: \begin{figure}[tbhp]

758:  \includegraphics[width=.8\textwidth]{5step-r.eps}

759: \caption{Result of the algorithm 1 (after 5 steps, margin .096)

760:  for the same data set as fig.\ref{fig:svm}}

761: \label{fig:alg1}

762: %

763: \par\bigskip

764: \end{figure}

765:

766: \begin{figure}[tbhp]

767:  \includegraphics[width=.8\textwidth]{mar-r2.eps}

768: \caption{Margin comparison with the original SVM for 100 runs

769:  (log-log scale)}

770: \label{fig:margin}

771: \end{figure}

772:

773: \begin{figure}[tbhp]

774:  \includegraphics[width=.8\textwidth]{err-r2.eps}

775: \caption{Test error comparison with the original SVM for 100 runs}

776: %

777: \label{fig:error}

778: %

779:  \par\bigskip

780: \end{figure}

781:

782: \section{Soft margin}

783: \label{sec:soft}

784:

785: For noisy situation, the hard margin classifier often overfits

786: samples.

787: There are several possibitilities to incorporate the soft margin,

788: here we give a simple one.

789: The soft margin can be derived by introducing slack variables $z_i$

790: into the optimization problem.

791: If we use a soft constraint in the form

792: \begin{equation}

793: \label{eq:constraint5}

794:  \inner{\wt}{[y_i\ \{\phi(\mycdx_i) -

795: \bipsi(\mycdx_i)\transpose\mycdd_i

796:  \}-\mycdeta_i]}\ge \mycdg_i-f_0 y_i - z_i,

797: \end{equation}

798: and adding penalty for the slack variables,

799: \begin{equation}

800:  {1\over2}\inner{\wt}{\wt} + C\sum_i z_i,

801: \end{equation}

802:

803: By this modification, only the constraint (\ref{eq:constrainta}) for

804: $\mylag_i$ is changed to

805: \begin{equation}

806:  0\le\mylag_i\le C, \qquad \sum_i y_i\mylag_i = 0,

807: \end{equation}

808: which is the same constraint as the soft margin of the original SVM.

809: However, the geometrical meaning of (\ref{eq:constraint5}) in the space

810: is not clear. It is a future work to introduce a natural soft constraint

811: in the input space.

812:

813: \section{Simplified algorithm for a high dimensional case}

814:

815: \label{sec:simple}

816:

817: Although Algorithm 1 achieves the precise solution, the computation

818: costs is high for large dimensionality of inputs.

819: In this section, we give a simplified algorithm.

820:

821: If we don't update $\mycdx_i$, the first and the second steps of Algorithm 1

822: is not necessary any more. This simplification makes Algorithm 1

823: a little simpler because all $\mycdd_i$ terms vanish.

824: However, let us consider further simplification.

825:

826: We have shown the relation to the original SVM:

827: the original SVM can be derived $\mycdg_i=1$ and $\mycdeta_i=0$.

828: Since $\mycdeta_i$ causes many temporal variables,

829: we only maintain $\mycdg_i$.

830: Then all the terms related to $\mycdb_i$'s vanish.

831:

832: Consequently,

833: the above simplifications lead to the algorithm much like the original

834: SVM. In fact, the existing code for the original SVM can be used as follows:

835:

836: For each step, first $\mycdg_i$ is calculated,

837: \begin{equation}

838:  \mycdg_i = {\|\sum_j a_i\kx(\bix_i,\bix_j)\|\sinorm\over

839:   \sqrt{\sum_{j,k}a_j\kth a_k\kth \knl(\bix_j,\bix_k)}}.

840: \end{equation}

841: Then, by letting the $(i,j)$ element of kernel matrix be

842: $\knl(\bix_i,\bix_j) / \mycdg_i\mycdg_j$, the original SVM for this

843: kernel matrix gives the solution for each step of the simplified algorithm.

844:

845: \section{Conclusion}

846: We have proposed a new learning algorithm to find a kernel-based

847: classifier that maximizes the margin in the input space.

848: The derived algorithm consists of an alternating optimization between

849: the foot of perpendicular and the linear coefficient parameters.

850: Such a dual structure appears in other frameworks, such as

851: EM algorithm, variational Bayes, and principal curve.

852:

853: There are many issues to be studied about the algorithm, for example,

854: analyzing the generalization performance theoretically and

855: finding an efficient algorithm that reduces the complexity and

856: converges more stably.

857: It is also an interesting issue to extend our framework to other

858: problems than classification, such as regression\cite{akaho,otsu,mueller}.

859:

860: In this paper, we have assumed that the kernel function is given and fixed.

861: Recently, several techniques and criteria to choose a kernel function

862: have been proposed extensively. We expect that

863: those techniques and much other knowledge for the original SVM

864: can be incorporated in our framework.

865: Applying the algorithm to real world data is also important.

866:

867: \begin{thebibliography}{12}

868:  \bibitem{akaho} S. Akaho, Curve fitting that minimizes the mean square of

869: perpendicular distances from sample points, {\it SPIE Vision Geometry

870: 	 II} (also found in {\it Selected SPIE Papers on CD-ROM},

871: 	 8, 1999), 237--244 (1993)

872:

873:  \bibitem{amari}

874:  S. Amari, {\it Differential Geometrical Methods in

875: 	 Statistics}, Springer-Verlag (1984)

876:

877:  \bibitem{cortes}

878:  C. Cortes and V.N. Vapnik, Support vector machines,

879: 	 {\it Machine Learning}, 20, pp. 273--297 (1995)

880:

881:  \bibitem{decoste}

882:  D. DeCoste and B. Sch\"olkopf, Training invariant

883: 	 support vector machines, {\it Machine Learning}, 46(1),

884: 	 pp. 161--190 (2002)

885:

886:  \bibitem{hastie}

887: 	 T. Hastie and W. Stuetzle, Principal curves,

888: 	 {\it Journal of the American Statistical Association}, 84(406),

889: 	 pp. 502--516 (1989)

890:

891:  \bibitem{jaakkola}

892:  T.S. Jaakkola and D. Haussler, Exploiting generative

893: 	 models in discriminative classifiers, {\it NIPS 11},

894: 	 pp. 487--493 (1998)

895:

896:  \bibitem{mueller}

897:  K.R. M\"uller, S. Mika, G. R\"atch, K. Tsuda,

898: 	 B.Sch\"olkopf, An Introduction to Kernel-Based Learning

899: 	 Algorithms, {\it IEEE Trans. on Neural Networks}, 12,

900: 	 pp. 181--201 (2001)

901:

902:  \bibitem{otsu}

903:  N. Otsu, Karhunen-Loeve line fitting and a linearly

904: 	 measure. In {\it IEEE Proc. of ICPR'84}, pp. 486--489 (1984)

905:

906:  \bibitem{ramsey}

907:  J.O. Ramsey, B.W. Silverman, {\it Functional Data Analysis},

908: 	 Springer-Verlag (1997)

909:

910:  \bibitem{simard}

911:   P.Y. Simard, Y.A. Le Cun, J.S. Denker, B. Victorri,

912: 	 Transformation Invariance in Pattern Recognition -- Tangent

913: 	 Distance and Tangent Propagation, in {\it Neural Networks:

914: 	 Tricks of the Trade}, G. Orr and K.-R. M\"uller, eds.,

915: 	 Springer-Verlag, vol.1524, pp.239--274 (1998)

916:

917:  \bibitem{tsuda}

918:   K. Tsuda, M. Kawanabe, G. R\"atsch, S. Sonnenburg, K.R. M\"uller,

919:          A New Discriminative Kernel from Probabilistic Models,

920: 	 {\it NIPS 14} (2001)

921:

922:  \bibitem{vapnik}

923:   V.N. Vapnik, {\it The Nature of Statistical

924: 	 Learning Theory}, Springer-Verlag (1995)

925: \end{thebibliography}

926:

927: \end{document}

928:

929: