cs0211006/arxiv.tex
1: \documentclass[12pt]{article} 
2: \usepackage{graphicx,latexsym}
3: \newcommand{\bi}[1]{\mbox{\boldmath $#1$}}
4: 
5: \newtheorem{proposition}{Proposition}
6: \newtheorem{example}{Example}
7: 
8: %\newcommand{\mycal}{\mathsf}
9: \newcommand{\mylag}{\alpha}
10: \newcommand{\mycal}{\mathcal}
11: \newcommand{\inner}[2]{#1\cdot#2}
12: \newcommand{\wt}{\omega}
13: \newcommand{\wtsvm}{\wt_{\mathrm{SVM}}}
14: \newcommand{\fsvm}{f_{\mathrm{SVM}}}
15: \newcommand{\myft}[1]{{#1}^*}
16: \newcommand{\myftx}{\myft{\bix}}
17: \newcommand{\svmft}[1]{{#1}^\dag}
18: \newcommand{\mycd}[1]{\hat{#1}}
19: \newcommand{\transpose}{^{\mathsf T}}
20: \newcommand{\zeroth}{^{(0)}}
21: \newcommand{\myraw}{^{\mathrm{raw}}}
22: \newcommand{\kth}{^{(k)}}
23: \newcommand{\kpth}{^{(k+1)}}
24: \newcommand{\knl}{\mathrm{k}}
25: \newcommand{\knlmat}{C}
26: \newcommand{\knlmatx}{D}
27: \newcommand{\knlvec}{\bi{c}}
28: \newcommand{\kx}{\mathbf{k}_x}
29: \newcommand{\kxy}{\mathrm{K}_{xy}}
30: \newcommand{\bia}{\bi{a}}
31: \newcommand{\bib}{\bi{b}}
32: \newcommand{\bid}{\bi{d}}
33: \newcommand{\bix}{\bi{x}}
34: \newcommand{\biy}{\bi{y}}
35: \newcommand{\bipsi}{\bi{\psi}}
36: \newcommand{\bieps}{\bi{\varepsilon}}
37: \newcommand{\mycdd}{\mycd{\bid}}
38: \newcommand{\mycdx}{\mycd{\bix}}
39: \newcommand{\mycdw}{\mycd{\wt}}
40: \newcommand{\mycda}{\mycd{a}}
41: \newcommand{\mycdb}{\mycd{\bib}}
42: \newcommand{\mycdg}{\mycd{g}}
43: \newcommand{\mycdeta}{\mycd{\eta}}
44: \newcommand{\mycdp}{\mycd{p}}
45: \newcommand{\mycdq}{\mycd{\bi{q}}}
46: \newcommand{\mycdr}{\mycd{r}}
47: \newcommand{\mycds}{\mycd{s}}
48: \newcommand{\mycdt}{\mycd{t}}
49: \newcommand{\mycdu}{\mycd{u}}
50: \newcommand{\mynew}{^{\mathrm{new}}}
51: \newcommand{\myold}{^{\mathrm{old}}}
52: \newcommand{\myprev}{^{[l]}}
53: \newcommand{\mynext}{^{[l+1]}}
54: \newcommand{\mycdf}{\mycd{f}}
55: \newcommand{\gnorm}{_{G_i}^2}
56: \newcommand{\sgnorm}{_{G_i}}
57: \newcommand{\inorm}{_{G_i^{-1}}^2}
58: \newcommand{\jnorm}{_{G_j^{-1}}^2}
59: \newcommand{\sinorm}{_{G_i^{-1}}}
60: \hyphenation{di-men-sion-al}
61: \title{Maximing the Margin in the Input Space}
62: 
63: \author{
64: Shotaro Akaho \\
65: AIST Neuroscience Research Institute\\
66: 1--1 Central 2, Umezono, Tsukuba 3058568 Japan \\
67: {\texttt{s.akaho@aist.go.jp}}}
68: 
69: \begin{document}
70: 
71: \maketitle
72: 
73: \begin{abstract}
74:  We propose a novel criterion for support vector machine learning:
75:  maximizing the margin in the input space, not in the feature (Hilbert) space. 
76:  This criterion is a discriminative version of the principal curve
77:  proposed by Hastie et al.
78:  The criterion is appropriate in particular when the input space is
79:  already a well-designed feature space with rather small dimensionality.
80:  The definition of the margin is generalized
81:  in order to represent prior knowledge.
82:  The derived algorithm consists of two alternating steps to estimate the
83:  dual parameters.
84:  Firstly, the parameters are initialized by the original SVM.
85:  Then one set of parameters is updated by Newton-like procedure, and
86:  the other set is updated by solving a quadratic programming problem.
87:  The algorithm converges in a few steps to a local optimum under mild
88:  conditions and it preserves the sparsity of support vectors.
89:  Although the complexity to calculate temporal variables increases
90:  the complexity to solve the quadratic programming problem for each step
91:  does not change.
92:  It is also shown that the original SVM can be seen as a special case.
93:  We further derive a simplified algorithm which enables us to use
94:  the existing code for the original SVM.
95: \end{abstract}
96: 
97: \section{Introduction}
98: The support vector machine (SVM) is known as one of state-of-the-art
99: methods especially for pattern recognition
100: \cite{cortes,mueller,vapnik}.
101: The original SVM maximizes the margin which is 
102: defined by the minimum distance between samples 
103: and a separating hyperplane in a Hilbert space $\mycal H$. 
104: Even when the dimensionality of $\mycal H$ is very large,
105: it has been proved that the original SVM has
106: a bound for a generalization error
107: which is independent of the dimensionality.
108: In practice, however, 
109: the original SVM sometimes gives a very small margin in the input
110: space, because the metric of the feature space is usually quite different from
111: that of the input space.
112: Such a situation is undesirable in particular when the input space
113: is already a well-designed feature space by using some prior
114: knowledge\cite{amari,decoste,jaakkola,simard,tsuda}.
115: 
116: This paper gives a learning algorithm to maximize the
117: margin in the input space.
118: One difficulty is getting an explicit form of the
119: margin in the input space, because the classification boundary is curved and
120: the vertical projection from a sample point to the boundary is not
121: always unique. We solve this problem by linear approximation
122: techniques.  The derived algorithm basically consists of iterations
123: of the alternating two stages as follows:
124: one is to estimate the projection point and the other is
125: to solve a quadratic programming to find optimal parameter values.
126: 
127: Such a dual structure appears in other frameworks, such as
128: EM algorithm and variational Bayes.
129: Much more related work is the principal curve proposed by
130: Hastie et al\cite{hastie}. The principal curve finds a curve in a `center'
131: of the points in the input space.
132: 
133: The derived algorithm is not a gradient-descent type but Newton-like;
134: hence we have to investigate its convergence property.
135: It is shown that the derived
136: algorithm does not always converges to the global optimum, but
137: it converges to a local optimum under mild conditions.
138: Some interesting relations to the original SVM are also shown:
139: the original SVM can be seen as a special case of the algorithm;
140: and the number of support vectors does not increase so much from the
141: original SVM.
142: The algorithm is verified through simple simulations.
143: 
144: \section{Generalized margin in the input space}
145: 
146: We consider a binary classification problem.
147: The purpose of learning is to construct a map from an $m$-dimensional input
148: $\bix\in{\Re}^m$ to a corresponding output $y\in\{\pm1\}$ by using
149: a finite number of samples $(\bix_1,y_1),\ldots,(\bix_n,y_n)$.
150: 
151: Let us consider a linear classifier, 
152: $y=\mbox{sgn}[f(\bix)]$, where
153: $f(\bix) \equiv \inner{\wt}{\phi(\bix)} + f_0$; 
154: $\phi(\bix)$ is a feature of an input $\bix$ in 
155: a Hilbert space $\mycal H$,
156: $\wt\in \mycal H$ is a weight parameter
157: and $f_0\in \Re$ is a bias parameter.
158: Those parameters $\wt$ and $f_0$ define a separating hyperplane in the
159: feature space. 
160: As a feature function $\phi(\bix)$, we only consider a differentiable
161: nonlinear map.
162: 
163: A margin in the input space is defined by the minimum distance from sample
164: points to the classification boundary in the input space.
165: Since the classification boundary forms a complex curved surface,
166: the distance cannot be obtained in an explicit form, and more
167: significantly, a projection from a point to the boundary is not unique.
168: 
169: Here, the metric in the input space is not necessary to be Euclidean.
170: Some Riemannian metric $G(\bix)$ may be defined, which
171: enables us to represent many kinds of prior knowledge.
172: For example, the invariance of patterns\cite{mueller,simard} can be implemented
173: in this form.
174: Another example is that 
175: Fisher information matrix is a natural metric,
176: when the input space is a parameter space
177: of some probability distribution\cite{amari,jaakkola}.
178: Although the distance is theoretically preferable to be measured by
179: the length of a geodesic in the Riemannian space,
180: it causes computational difficulty.
181: In our formulation, since we only need a distance from a sample point to
182: another point, we use a computationally feasible (nonsymmetric) distance
183: from a sample point $\bix_i$ to another point $\bix$ in the quadratic norm,
184: \[
185: \|\bix-\bix_i\|\gnorm =
186:   (\bix-\bix_i)\transpose G_i(\bix-\bix_i),
187: \]
188: where $G_i\equiv G(\bix_i)$.
189: 
190: For simplicity, we mainly consider the hard margin case in which
191: sample points are separable by a hyperplane in the Hilbert space.
192: The soft margin case is discussed in the section \ref{sec:soft}.
193: 
194: Let $\myftx_i$ be the closest point on the boundary
195: surface from a sample point $\bix_i$, and
196: $\bid_i \equiv \myftx_i - \bix_i$.
197: Since $\bid_i$ is invariant under a scalar transformation of $(\wt,f_0)$,
198: we can assume all points are separated with satisfying
199: \begin{equation}
200:   \label{eq:constraint}
201:   \|\bid_i\|\gnorm \ge {1/\inner{\wt}{\wt}},\quad i=1,\cdots,n,
202: \end{equation}
203: If we assume at least one of them is an equality,
204: the margin is given by $1/\sqrt{\inner{\wt}{\wt}}$.
205: Then we can find the optimal parameter by minimizing
206: a quadratic objective function $\inner{\wt}{\wt}$
207: with the constraints (\ref{eq:constraint}) and $y_i f(\bix_i) > 0$.
208: 
209: In order to solve the optimization problem, we start from a solution
210: of the original SVM and update the solution iteratively.
211: By two kinds of linearization technique and a kernel trick
212: which are described in the next section, we obtain
213: a discriminant function at the $k$-th iteration step in the form of
214: \begin{equation}
215: \label{eq:f}
216:  f(\bi{x})=\sum_{i\in \mathrm{S.V.}} \{a_i\kth \knl(\mycdx_i\kth,\bix) +
217:   \bib_i\kth{}\transpose \kx(\mycdx_i\kth, \bix)\} + f_0\kth,
218: \end{equation}
219: where S.V. is a set of indices of support vectors,
220: $\knl(\bix,\biy)$ is a kernel function and $\kx(\bix,\biy)$ is its
221: derivative defined by $\kx(\bix,\biy)\equiv {\partial
222: \knl(\bix,\biy)/\partial\bix}$.
223: We have two groups of parameters here: One is of $a_i$, $\bib_i$ and $f_0$
224: which are parameters of linear coefficients, and the other is
225: of $\mycdx_i$ which is an estimate of
226: the projection point $\myftx_i$ and forms base functions.
227: $a_i$ and $f_0$ are initialized by the corresponding parameters in the 
228: original SVM and the other parameters are initialized by
229: $\bib_i=\mathbf0$, $\mycdx_i=\bix_i$.
230: 
231: \section{Iterative QP by linear approximations}
232: In this section, we overview the derivation of update rules of
233: those parameters. The resultant algorithm is summarized in sec.\ref{sec:overall}.
234: 
235: \subsection{Linear approximation of the distance to the boundary}
236: \label{sec:d}
237: Suppose an estimated projection point $\mycdx_i$ is given,
238: we can get an approximate distance $\|\bid_i\|\sgnorm$
239: by a linear approximation\cite{akaho}.
240: \hfill Taking the Taylor expansion of \\
241:  $f(\myftx_i)=0$ around $\mycdx_i$
242: up to the first order,
243: we obtain a constraint on $\bid_i$,
244: \[
245:  f(\mycdx_i) + 
246:  \nabla f(\mycdx_i)\transpose (\bid_i - \mycdd_i) = 0,
247: \]
248: where $\mycdd_i = \mycdx_i-\bix_i$.
249: Minimizing $\|\bid_i\|\gnorm$ under this constraint,
250: we have
251: \begin{equation}
252: \label{eq:d}
253: \|\bid_i\|\gnorm = {(\inner{\wt}{\{\phi(\mycdx_i) -
254:  \bipsi(\mycdx_i)\transpose\mycdd_i \}}+f_0)^2\over 
255: \|\inner{\wt}{\bipsi(\mycdx_i)}\|\inorm},
256: \end{equation}
257: where $\bipsi(\mycdx_i)\equiv
258: \nabla \phi(\mycdx_i)\in {\mycal H}^m$.
259: Note that this approximate value is unique, and it is invariant under a
260: scalar transformation of
261: $(\wt,f_0)$.
262: Moreover, the approximation is strictly correct when $\mycdx_i=\myftx_i$
263: and $\nabla f(\myftx_i)\ne 0$.
264: 
265: \subsection{Linearization of the constraint}
266: \label{sec:qp}
267: Using the approximate value of the distance, we have a nonlinear
268: constraint, 
269: \begin{equation}
270: \label{eq:NLconst}
271:  y_i\left[\inner{\wt}\{\phi(\mycdx_i) -
272:  \bipsi(\mycdx_i)\transpose\mycdd_i \}+f_0\right]
273:   \ge {\|\inner{\wt}{\bipsi(\mycdx_i)}\|\sinorm\over\sqrt{\inner{\wt}{\wt}}}.
274: \end{equation}
275: Since the constraint is nonlinear for $\wt$, we linearize it around
276: an approximate solution $\wt=\mycdw$ which is the solution at
277: a current step.
278: This linearization not only simplifies the problem, but
279: also enables us to derive a dual problem.
280: 
281: Let $g_i(\wt)$ be the right hand side of (\ref{eq:NLconst}),
282: the first order expansion is 
283: \[
284:   g_i(\wt) = g_i(\mycdw) +
285:    \inner{\left({\partial g_i(\mycdw)/\partial\wt}\right)}{(\wt-\mycdw)}.
286: \]
287: Now let $\mycdg_i \equiv g_i(\mycdw),
288:  \mycdeta_i \equiv {\partial g_i(\mycdw)/\partial\wt}$,
289: then we have a linear constraint for $\wt$,
290: \begin{equation}
291: \label{eq:constraint3}
292:  \inner{\wt}{[y_i\ \{\phi(\mycdx_i) -
293: \bipsi(\mycdx_i)\transpose\mycdd_i
294:  \}-\mycdeta_i]}\ge \mycdg_i- f_0 y_i,
295: \end{equation}
296: where we used the fact $\inner{\mycdw}{\mycdeta_i}=0$.
297: Suppose $\mycdq_i \equiv \inner{\mycdw}{\bipsi(\mycdx_i)}$ and
298: $\mycdr \equiv \inner{\mycdw}{\mycdw}$,
299: then $\mycdg_i$ and $\mycdeta_i$ are given by
300: \begin{eqnarray}
301: \label{eq:h}
302:  \mycdg_i &=& {1\over \sqrt{\mycdr}}\|\mycdq_i\|\sinorm,\nonumber\\
303:  \mycdeta_i 
304:    &=& {1\over \mycdg_i \mycdr} \left\{\mycdq_i\transpose G_i^{-1}
305:     \bipsi(\mycdx_i) -{1\over\mycdr}\|\mycdq_i\|\inorm\mycdw\right\}.
306: \end{eqnarray}
307: By the above linearization, we can derive the dual problem
308: in a similar way to the original SVM,
309: \begin{eqnarray}
310: \lefteqn{W(\bi{\mylag}) = \sum_i \mycdg_i\mylag_i} \nonumber\\
311: && -{1\over2}
312:   \sum_{i,j}\mylag_i\mylag_j [y_i \{\phi(\mycdx_i) -
313: \bipsi(\mycdx_i)\transpose\mycdd_i
314:  \}-\mycdeta_i]\cdot[y_j \{\phi(\mycdx_j) -
315:  \bipsi(\mycdx_j)\transpose \mycdd_j
316:  \}-\mycdeta_j], \nonumber
317: \end{eqnarray}
318: which is maximized under constraints $\mylag_i\ge0$ \\
319: and $\sum_i\mylag_i y_i = 0$.
320: The solution $\wt$ is given by 
321: \begin{equation}
322: \label{eq:wt}
323: \wt = \sum_i \mylag_i [y_i \{\phi(\mycdx_i) -
324: \bipsi(\mycdx_i)\transpose\mycdd_i
325:  \}-\mycdeta_i].
326: \end{equation}
327: Here we can see an apparent relation to the original SVM, i.e.,
328: by letting $\mycdx_i=\bix_i$, $\mycdeta_i=0$, and $\mycdg_i=1$,
329: we have the exactly the same optimization problem as the original SVM.
330: 
331: \subsection{Kernel trick}
332: 
333: In order to avoid the calculation of mapping into high dimensional
334: Hilbert space, SVM applies a kernel trick, by which
335: an inner product is replaced by a symmetric positive definite
336: kernel function (Mercer kernel) that is easy to
337: calculate\cite{ramsey,cortes,mueller,vapnik}.
338: In our formulation, 
339: $\inner{\phi(\bix)}{\phi(\biy)}$ is replaced by a Mercer kernel
340: $\knl(\bix,\biy)$.
341: We also have to calculate the inner product
342: related to $\bipsi$ (the derivative of $\phi$).
343: Let us assume that the kernel function $\knl$ is differentiable.
344: Then, $\inner{\bipsi(\bix)}{\phi(\biy)}$
345: is replaced by a vector
346: $\kx(\bix,\biy)\equiv {\partial \knl(\bix,\biy)/\partial\bix}$,
347: and $\inner{\bipsi(\bix)}{\bipsi(\biy)\transpose}$
348: is replaced by a matrix
349: $\kxy(\bix,\biy)
350: \equiv {\partial^2 \knl(\bix,\biy)/\partial\bix\partial\biy\transpose}$.
351: 
352: Now we can derive the kernel version of the optimization problem.
353: In (\ref{eq:wt}), $\mycdeta_i\in \mycal H$ has bases related to
354: $\bipsi(\mycdx_i)$ and $\mycdw$,
355: and the solution $\wt$ has bases $\phi(\mycdx_i)$ additionally.
356: Although $\mycdw$ can have any kinds of bases, we restrict it
357: in the following form to avoid increasing number of bases.
358: \[
359:  \mycdw=\sum_i \{\mycda_i \phi(\mycdx_i) +
360:   \mycdb_i\transpose \bipsi(\mycdx_i)\}.
361: \]
362: Then we have
363: $\mycdq_i = \sum_j \{ \mycda_j
364:   \kx(\mycdx_i, \mycdx_j) +
365:   \kxy(\mycdx_i,\mycdx_j)\mycdb_j
366:   \}$.
367: Now let 
368: \[
369:  \mycdp_i \equiv \inner{\mycdw}{\phi(\mycdx_i)} =
370:   \sum_j \{\mycda_j\knl(\mycdx_j,\mycdx_i) + \mycdb_j\transpose
371:   \kx(\mycdx_j,\mycdx_i)\},
372: \]
373: then $\mycdr$ is given by
374: $\mycdr = \sum_i (\mycda_i \mycdp_i + \mycdb_i\transpose\mycdq_i)$,
375: and $\mycdg_i$ by (\ref{eq:h}).
376: Further, let us define additional temporal variables
377: that represent several terms in the objective function,
378: \begin{eqnarray*}
379:  \mycds_{ij} &\equiv& \inner{\{\phi(\mycdx_i) -
380: \bipsi(\mycdx_i)\transpose\mycdd_i
381:  \}}{\{\phi(\mycdx_j) -
382:  \bipsi(\mycdx_j)\transpose \mycdd_j
383:  \}} \\
384:  &=& \knl(\mycdx_i,\mycdx_j)+\mycdd_i\transpose
385:   \kxy(\mycdx_i,\mycdx_j)\mycdd_j
386:   -\mycdd_i\transpose\kx(\mycdx_i,\mycdx_j)
387:   -\mycdd_j\transpose\kx(\mycdx_j,\mycdx_i), \\
388: \mycdt_{ij} &\equiv& \inner{\mycdeta_i}
389: {\{\phi(\mycdx_j) - \bipsi(\mycdx_j)\transpose\mycdd_j\}}
390: \\
391: &=& 
392: {1\over \mycdg_i \mycdr}\bigg\{\mycdq_i\transpose G_i^{-1}
393:  \left(\kx(\mycdx_i,\mycdx_j) - \kxy(\mycdx_i,\mycdx_j)\mycdd_j
394:   \right) 
395:   - {\|\mycdq_i\|\inorm\over \mycdr}(
396:   \mycdp_j - \mycdd_j\transpose\mycdq_j)
397:  \bigg\}, \\
398:  \mycdu_{ij} &=& \inner{\mycdeta_i}{\mycdeta_j} 
399:  =
400:  {1\over \mycdg_i \mycdg_j \mycdr^2}(\mycdq_i\transpose G_i^{-1}\kxy(\mycdx_i,\mycdx_j) G_j^{-1}\mycdq_j 
401:   -{\|\mycdq_i\|\inorm\|\mycdq_j\|\jnorm\over\mycdr}),
402: \end{eqnarray*}
403: then we have the objective function in a kernel form,
404: \begin{equation}
405: W(\bi{\mylag}) = \sum_i \mycdg_i\mylag_i
406:  -{1\over2}\sum_{i,j}\mylag_i\mylag_j (y_i y_j \mycds_{ij} - y_j \mycdt_{ij}-
407:  y_i \mycdt_{ji}+\mycdu_{ij}),
408: \label{eq:qp}
409: \end{equation}
410: which is maximized under constraints
411: \begin{equation}
412: \label{eq:constrainta}
413:  \mylag_i\ge0, \qquad \sum_i y_i\mylag_i = 0.
414: \end{equation}
415: 
416: The new parameters can be determined from (\ref{eq:wt}) by
417: \begin{eqnarray}
418: \label{eq:newab}
419:  a_i\kpth &=& \mylag_i y_i + \beta \mycda_i,\nonumber\\
420:  \bib_i\kpth &=& -\mylag_i\left(y_i\mycdd_i+ {G_i^{-1}\mycdq_i\over
421: 			 \mycdg_i \mycdr}\right) +\beta
422:  \mycdb_i,
423: \end{eqnarray}
424: where
425: $ \beta = \sum_j{\mylag_j\|\mycdq_j\|\inorm/\mycdg_j\mycdr^2}$.
426: 
427: As for the bias term $f_0$, since the constraint
428: (\ref{eq:constraint3}) should be satisfied in equality
429: for $J=\{i\mid\mylag_i\ne0\}$ from
430: the Kuhn-Tucker condition, we have for any $i\in J$,
431: \begin{equation}
432: \label{eq:newf}
433:  f_0\kpth = y_i \mycdg_i -\sum_j \mylag_j
434:   (y_j \mycds_{ji} - \mycdt_{ji} - y_i y_j \mycdt_{ij} + y_i \mycdu_{ij})
435: \end{equation}
436: 
437: From ($\ref{eq:newab}$), we can estimate the number of support vectors.
438: Let $J_k$ be the indices of nonzero $\mylag_i$'s at the $k$-th step, then
439: the number of support vectors is bounded from upper by
440: $|J_0\cup J_1 \cup \cdots \cup J_k|$. Since $J_k$ does not
441: change much as long as the structure of classification boundary
442: is similar,
443: the number of support vectors is expected to be not so larger than
444: the original SVM.
445: 
446: \subsection{Update of the approximate projection of the points}
447: To complete the algorithm, we have to consider the update of the approximate value
448: of the projection point $\mycdx_i$ which is initialized by $\bix_i$, otherwise the convergent solution is not precise
449: what we want.
450: If good approximates $\mycdw$ and $\mycdf_0$ of
451: the solution are given, we can refine $\mycdx_i$
452: iteratively in the same way as in sec. \ref{sec:d}:
453: Suppose $\mycdw=\sum_j \{\mycda_j \phi(\mycdx_j\myold) +
454: \mycdb_j\transpose \bipsi(\mycdx_j\myold)\}$,
455: the projection point $\mycdx_i$ can be estimated by iterating
456: the following steps for $l=0,1,2,3,\cdots$,
457: \begin{equation}
458: \label{eq:upmycdx}
459:  \mycdx_i\mynext
460:    = \bix_i -
461:    {\mycdq_i\myprev\over\|\mycdq_i\myprev\|\inorm}
462:    \left[\mycdp_i\myprev
463:     - (\mycdx_i\myprev{}-\bix_i)\transpose
464:     \mycdq_i\myprev  + \mycdf_0\right]
465: \end{equation}
466: where $\mycdx_i^{[0]}$ is initialized by $\mycdx_i\myold$;
467: $\mycdp_i\myprev$ and $\mycdq_i\myprev$ are defined in a similar way as
468: $\mycdp_i$ and $\mycdq_i$,
469: \begin{eqnarray}
470:  \mycdp_i\myprev &\equiv& \inner{\mycdw}{\phi(\mycdx_i\myprev)} \nonumber\\
471:  &=&
472:   \sum_j \{\mycda_j\knl(\mycdx_j\myold,\mycdx_i\myprev) + \mycdb_j\transpose
473:   \kx(\mycdx_j\myold,\mycdx_i\myprev)\}, \nonumber \\
474:  \mycdq_i\myprev &\equiv&
475:   \inner{\mycdw}{\bipsi(\mycdx_i\myprev)}\nonumber\\
476:  &=&\sum_j \{ \mycda_j
477:   \kx(\mycdx_i\myprev, \mycdx_j\myold) +
478:   \kxy(\mycdx_i\myprev,\mycdx_j\myold)\mycdb_j
479:   \}.\nonumber
480: \end{eqnarray}
481: 
482: Note that locally maximum points and saddle
483: points of the distance are also equilibrium states
484: of (\ref{eq:upmycdx}). The following proposition guarantees
485: such a point is not stable.
486: \begin{proposition}
487: A point $\mycdx_i\in {\Re}^m$ is an equilibrium state of the
488:  iteration step (\ref{eq:upmycdx}), when and only when the point
489:  is a critical point of the distance from $\bix_i$ to the
490:  separating boundary, i.e.,
491:  a local minimum, a local maximum or a saddle point.
492:  The equilibrium state is not stable when the point is a
493:  local maximum or a saddle point.
494: \end{proposition}
495: \textit{Proof:}
496: It is straightforward to show that a point is
497: an equillibrium state of the iteration step (\ref{eq:upmycdx}),
498: only when the point is a critical point of the projection point
499: $\|\bid_i\|\gnorm$. Without loss of generality,
500: we can assume the uniform metric case $G_i=I$, because
501: update rule (\ref{eq:upmycdx}) is invariant of a metric transformation.
502: We consider the behavior around a critical point $\myftx_i$.
503: Let $\mycdx_i\myprev=\myftx_i+\bieps$,
504: for a sufficiently small vector $\bieps$.
505: One can show that $\mycdx_i\myprev$ is mapped into the separating
506: hypersurface $f(\bix)=\inner{\mycdw}{\phi(\bix)}+\mycdf_0=0$
507: for a small $\bieps$ after one step iteration.
508: Therefore, we only consider the
509: case $\mycdx_i\myprev$ is on the hypersurface.
510: 
511: Since $\myftx_i$ is a critical point
512: of the distance, the tangent vector $\nabla f(\myftx_i)$ is
513: collinear to the distant vector $\bid_i=\myftx_i-\bix_i$, i.e.,
514: for some constant $\lambda$, it holds
515: \begin{equation}
516:  \nabla f(\myftx_i) = \lambda \bid_i.
517: \end{equation}
518: Furthermore, if $\mycdx_i\myprev$ is in a point of $f(\bix)=0$,
519: $\nabla f(\myftx_i)$ is nearly orthogonal to $\bieps$,
520: i.e.,
521: \begin{equation}
522:  \nabla f(\myftx_i)\transpose \bieps \simeq 0.
523: \end{equation}
524: By expanding (\ref{eq:upmycdx}) around $\myftx_i$, we have
525: a new estimation $\mycdx_i\mynext$ by
526: \begin{equation}
527: \label{eq:mycdx}
528:  \mycdx_i\mynext \simeq \myftx_i
529:  + {1\over\lambda}\nabla^2 f(\myftx_i)\bieps
530:   - {\bid_i\transpose\nabla^2 f(\myftx_i)\bieps\over\lambda\|\bid_i\|}\bid_i,
531: \end{equation}
532: where $\nabla^2 f$ is a hessian matrix of $f(\bix)$.
533: Without loss of generality, we can take the coordinate of $\bix$ as
534: follows: the first coordinate is the direction of $\bid_i$, and
535: the second to the $m$-th coordinates are taken orthogonally such that
536: an $(m-1)\times(m-1)$ submatrix of $\nabla^2 f(\myftx_i)$
537: for those coordinates is diagonalized, i.e., $\nabla^2 f(\myftx_i)$
538: is in the form,
539: \begin{equation}
540:  \nabla^2 f(\myftx_i) = \left(
541: \begin{array}{cccc}
542: c_1 & & \bi{b}\transpose & \\
543:  & c_2 & & 0 \\
544: \bi{b} & & \ddots & \\
545:  & 0 & & c_m \\
546: \end{array} \right).
547: \end{equation}
548: Under this coordinate system,
549: since $\varepsilon_1$ is of small order value,
550: the first element calculated from the second and third term in (\ref{eq:mycdx})
551: vanishes and we have
552: \begin{equation}
553: \mycdx_i\mynext - \myftx_i \simeq {1\over\lambda}
554:  (0, c_2 \varepsilon_2,\ldots,c_m\varepsilon_m)\transpose.
555: \end{equation}
556: The iteration step is stable at $\myftx_i$ only when
557: $\|\mycdx_i\mynext-\myftx_i\|\le\|\forall\bieps\|$, i.e.,
558: t$|c_j|< |\lambda|$ for all $j=2,\ldots,m$. \hfill $\Box$
559: 
560: The condition for 1-$j$ plane is shown in figure \ref{fig:stability}.
561: 
562: \begin{figure}[tbhp]
563:   \begin{center}
564:    \includegraphics[width=.8\textwidth]{stab.eps}
565:     \caption{Stability of projection point update}
566:     \label{fig:stability}
567:   \end{center}
568: \end{figure}
569: 
570: When the point is a local maximum or saddle, the hypersurface is in the unstable
571: region. However, even in the case of local minimum, there exist an
572: unstable region, when the hypersurface is stronglly curved.
573: We can avoid the undesired behavior by slowing down.
574: For example, first $c_2,\ldots,c_m$ and $\lambda$ are estimated from
575: $\nabla f$ and $\nabla^2 f$ values at the current estimate,
576: and then if $c_j < |\lambda|$
577: for all $j=2,\ldots,m$, the point is to be local minima, then
578: the movement $\mycdx_i\mynext-\mycdx_i\myprev$
579: to the axes in which $c_j<-|\lambda|$ should be
580: shrinked by multiplying some factor $0 < e_j < |\lambda|/|c_j|$.
581: 
582: This computationally intensive
583: treatment would be usually necessary only
584: after the several steps, because it is considered
585: that the unstablity for local minima occurs a small region
586: relatively to the size of $\bid_i$.
587: 
588: \subsection{Projection of the hyperplane}
589: \label{sec:proj}
590: The update of $\mycdx_i$ causes another problem:
591: We assumed in section \ref{sec:qp}
592: that $\wt$ and $\mycdw$ have the same bases.
593: However, $\mycdw$ has bases based on the old $\mycdx_i$, while
594: we need the new $\wt$ based on the new $\mycdx_i$.
595: To solve that problem, $\mycdw$ is projected into new bases, i.e.,
596: from the old one
597: $\mycdw\myold=\sum_{i\in \mathrm{S.V.}}\{\mycda\myold_i
598: \phi(\mycdx_i\myold) + \mycdb\myold_i{}\transpose\bipsi(\mycdx_i\myold)
599: \}$ 
600: to a new one,
601: $\mycdw\mynew=\sum_{i\in \mathrm{S.V.}}\{\mycda\mynew_i
602: \phi(\mycdx_i\mynew) + \mycdb\mynew_i{}\transpose\bipsi(\mycdx_i\mynew)\}$.
603: Although $\mycdw\mynew$ can have more bases other than S.V.,
604: we restrict the bases to support vectors to
605: preserve the sparsity of bases.
606: 
607: There are several possibilities of the projection.
608: In this paper, we use the one which minimizes the cost function
609: \begin{equation}
610: \label{eq:E}
611:  {1\over2}\sum_{\bix\in T} \{\inner{\mycdw\mynew}{\phi(\bix)} + \mycdf_0\mynew -
612:   (\inner{\mycdw\myold}{\phi(\bix)} + \mycdf_0\myold)\}^2, 
613: \end{equation}
614: where $T$ is a certain set of $\bix$, and we use $T=$ $\{\bix_i$,
615: $\mycdx_i\myold$, $\mycdx_i\mynew$; $i=1,\cdots,n\}$.
616: 
617: Minimizing (\ref{eq:E}) leads to a simple least square problem, which can
618: be solved by linear equations.
619: Another possibility of the cost function is
620: $\|\mycdw\mynew-\mycdw\myold\|^2$, which leads to another set of
621: linear equations.
622: 
623: \subsection{Overall algorithm and the convergence property}
624: \label{sec:overall}
625: 
626: Now let us summarize the algorithm below.
627: \par
628: \bigskip
629: \par
630: \noindent{\textbf{\strut Algorithm 1: Algorithm to maximize the margin
631: in the input space}}
632: \hrule 
633: \strut Initialization step:
634:        Let the solution of the original SVM be
635:        $a_i\zeroth$ and $f_0\zeroth$; 
636:        let $\bib_i\zeroth=\mathbf0$ and $\mycdx_i\zeroth=\bix_i$.
637: \par\noindent
638: For $k=0,1,2,\ldots$, repeat the following steps until convergence:
639: \begin{enumerate}
640:  \item Update of $\mycdx_i$:
641:        Calculate $\mycdx_i\kpth$ by
642:        applying (\ref{eq:upmycdx}) iteratively to $\mycdx_i\kth$.
643:  \item Projection of hyperplane:
644:        Calculate $\mycda_i$, $\mycdb_i$ and $\mycdf_0$ based on
645:        $\mycdx_i\kpth$ by
646:        a certain projection method from $a_i\kth$, $\bib_i\kth$ and $f_0\kth$
647:        based on $\mycdx_i\kth$ (sec.\ref{sec:proj}).
648:  \item QP step: Solve the QP problem (\ref{eq:qp})
649:        with respect to $\mylag_i$.
650:  \item Parameter update:
651:        Calculate $a_i\kpth$, $\bi{b}_i\kpth$ and $f_0\kpth$ by
652:        (\ref{eq:newab}) and (\ref{eq:newf}).
653: \end{enumerate}
654: The discriminant function at the $k$-th step is given by (\ref{eq:f}).
655: \par\smallskip
656: \hrule
657: \bigskip
658: 
659: Although Algorithm 1 does not always converge to the global minimum,
660: we can prove the following proposition concerning about the convergence
661: of the algorithm.
662: \begin{proposition}
663: Equilibrium points of Algorithm 1 are critical points of the margin in
664:  the input space.
665: The algorithm is stable, when the update rule of $\mycdx_i$ (\ref{eq:upmycdx})
666:  is stable for all $i$ (see also Proposition 1).
667: \end{proposition}
668: This proposition can be proved basically by proposition 1 and the fact that
669: the linearization of QP is almost exact by a small
670: perturbation of $\wt$.
671: As in the case of (\ref{eq:upmycdx}), we can modify the algorithm by
672: slowing down in (\ref{eq:d}) and (\ref{eq:upmycdx}) so that
673: the equilibrium state is stable when and only when the margin
674: is locally optimal.
675: However, we don't use it in the simulation because the case
676: that the local minimum is unstable is expected to be rare.
677: 
678: Another problem of Algorithm 1 is that each iteration step does not
679: always increase the margin monotonically.  
680: Although it is usually faster than gradient type algorithms,
681: the algorithm sometimes does not improve the solution of the original
682: SVM at all.
683: Because the original SVM can be seen as a special case of the algorithm,
684: we can use some annealing technique, for example, updating temporal
685: variables and parameters more gradually from their initial values.
686: However, for simplicity, we use a crude method in the simulation
687: as follows: Repeat several 
688: steps of the algorithm (5 steps in the simulation) and then choose
689: the best solution which gives the largest estimated value of the margin.
690: 
691: As for the complexity of the algorithm, we need $O(m^2 n^2)$ space
692: and $O(m^3 n^2)$ time complexity to calculate temporal variables
693: if the computation of a kernel function is $O(m)$,
694: while the original SVM requires $O(n^2)$ space and $O(m n^2)$ time.
695: Those calculation can be pararellized easily.
696: This complexity is not so different when $m$ is comparatively small.
697: Once the variables are calculated, the complexity for QP is just the same.
698: Therefore, as far as the calculation for temporal variables
699: is comparative to the QP time,
700: the proposed algorithm is comparative to the original SVM.
701: If the Algorithm 1 is heavy because of the large $m$, we can use
702: a simplified algorithm as shown in the section \ref{sec:simple}.
703: 
704: As for the iteration of QP which is carried out usually for a few steps,
705: since a current solution is an estimate of the solution,
706: it may be able to reduce the complexity
707: of the QP at the next iteration step.
708: 
709: \section{Simulation results}
710: \label{sec:simulation}
711: 
712: In this section, we give a simulation result for 
713: artificial data sets in order to verify the proposed algorithm
714:  and to examine the basic performance.
715: 20 training samples and 1000 test samples are randomly drawn from
716: positive and negative distribution, each of which is a
717: Gaussian mixture of 3 components with
718: uniformly distributed centers $[0,1)^2$ and
719: fixed spherical variance $\sigma^2=0.2^2$.
720: The kernel function used here is a spherical Gaussian kernel with
721: $\sigma^2=1^2$.
722: The metric is taken to be Euclidean (i.e., $G_i$ is the unit matrix).
723: Figure \ref{fig:svm} and \ref{fig:alg1}
724: show an example of results by the original SVM
725: (initial condition) and the proposed algorithm (after 5 steps).
726: In this case, the margin value increases from 0.040 to 0.096.
727: Such a simulation is repeated for 100 sets of samples with different random
728: numbers.
729: 
730: The estimated margins
731: in the input space for the original and proposed
732: algorithm is shown in figure \ref{fig:margin} (log-log scale).
733: By the crude algorithm described in the
734: previous section, there are 4 cases among 100 runs that cannot improve the
735: margin of the original SVM. The ratios of the margin are distributed
736: from 1.00 (no improvement) to 27.9.
737: 
738: The misclassification errors
739: for test samples is shown in figure \ref{fig:error}.
740: The ratios of error distributed between [0.40(best),1.37(worst)].
741: 
742: This results indicates that the margin in the input space
743: is efficient to improve the generalization performance in average, but
744: there are cases that cannot reduce the generalization error
745: even when the margin in the input space increases.
746: 
747: \begin{figure}[tbhp]
748:  \includegraphics[width=.8\textwidth]{origsvm-r.eps}
749: \caption{Result of the original SVM (margin .040).
750: Circles ($\circ$) and crosses ($\times$) are positive and negative
751: samples. Squares ($\Box$) represent estimates of the projection
752: of the points by applying (\ref{eq:upmycdx}) for 10 steps.}
753: \label{fig:svm}
754: %
755: \end{figure}
756: 
757: \begin{figure}[tbhp]
758:  \includegraphics[width=.8\textwidth]{5step-r.eps}
759: \caption{Result of the algorithm 1 (after 5 steps, margin .096)
760:  for the same data set as fig.\ref{fig:svm}}
761: \label{fig:alg1}
762: %
763: \par\bigskip
764: \end{figure}
765: 
766: \begin{figure}[tbhp]
767:  \includegraphics[width=.8\textwidth]{mar-r2.eps}
768: \caption{Margin comparison with the original SVM for 100 runs
769:  (log-log scale)}
770: \label{fig:margin}
771: \end{figure}
772: 
773: \begin{figure}[tbhp]
774:  \includegraphics[width=.8\textwidth]{err-r2.eps}
775: \caption{Test error comparison with the original SVM for 100 runs}
776: %
777: \label{fig:error}
778: %
779:  \par\bigskip
780: \end{figure}
781: 
782: \section{Soft margin}
783: \label{sec:soft}
784: 
785: For noisy situation, the hard margin classifier often overfits
786: samples. 
787: There are several possibitilities to incorporate the soft margin,
788: here we give a simple one.
789: The soft margin can be derived by introducing slack variables $z_i$
790: into the optimization problem.
791: If we use a soft constraint in the form
792: \begin{equation}
793: \label{eq:constraint5}
794:  \inner{\wt}{[y_i\ \{\phi(\mycdx_i) -
795: \bipsi(\mycdx_i)\transpose\mycdd_i
796:  \}-\mycdeta_i]}\ge \mycdg_i-f_0 y_i - z_i,
797: \end{equation}
798: and adding penalty for the slack variables,
799: \begin{equation}
800:  {1\over2}\inner{\wt}{\wt} + C\sum_i z_i,
801: \end{equation}
802: 
803: By this modification, only the constraint (\ref{eq:constrainta}) for
804: $\mylag_i$ is changed to
805: \begin{equation}
806:  0\le\mylag_i\le C, \qquad \sum_i y_i\mylag_i = 0,
807: \end{equation}
808: which is the same constraint as the soft margin of the original SVM.
809: However, the geometrical meaning of (\ref{eq:constraint5}) in the space
810: is not clear. It is a future work to introduce a natural soft constraint
811: in the input space.
812: 
813: \section{Simplified algorithm for a high dimensional case}
814: 
815: \label{sec:simple}
816: 
817: Although Algorithm 1 achieves the precise solution, the computation
818: costs is high for large dimensionality of inputs.
819: In this section, we give a simplified algorithm.
820: 
821: If we don't update $\mycdx_i$, the first and the second steps of Algorithm 1
822: is not necessary any more. This simplification makes Algorithm 1
823: a little simpler because all $\mycdd_i$ terms vanish.
824: However, let us consider further simplification.
825: 
826: We have shown the relation to the original SVM:
827: the original SVM can be derived $\mycdg_i=1$ and $\mycdeta_i=0$.
828: Since $\mycdeta_i$ causes many temporal variables,
829: we only maintain $\mycdg_i$.
830: Then all the terms related to $\mycdb_i$'s vanish.
831: 
832: Consequently,
833: the above simplifications lead to the algorithm much like the original
834: SVM. In fact, the existing code for the original SVM can be used as follows:
835: 
836: For each step, first $\mycdg_i$ is calculated,
837: \begin{equation}
838:  \mycdg_i = {\|\sum_j a_i\kx(\bix_i,\bix_j)\|\sinorm\over
839:   \sqrt{\sum_{j,k}a_j\kth a_k\kth \knl(\bix_j,\bix_k)}}.
840: \end{equation}
841: Then, by letting the $(i,j)$ element of kernel matrix be
842: $\knl(\bix_i,\bix_j) / \mycdg_i\mycdg_j$, the original SVM for this
843: kernel matrix gives the solution for each step of the simplified algorithm.
844: 
845: \section{Conclusion}
846: We have proposed a new learning algorithm to find a kernel-based
847: classifier that maximizes the margin in the input space.
848: The derived algorithm consists of an alternating optimization between
849: the foot of perpendicular and the linear coefficient parameters.
850: Such a dual structure appears in other frameworks, such as
851: EM algorithm, variational Bayes, and principal curve.
852: 
853: There are many issues to be studied about the algorithm, for example,
854: analyzing the generalization performance theoretically and
855: finding an efficient algorithm that reduces the complexity and
856: converges more stably.
857: It is also an interesting issue to extend our framework to other
858: problems than classification, such as regression\cite{akaho,otsu,mueller}.
859: 
860: In this paper, we have assumed that the kernel function is given and fixed.
861: Recently, several techniques and criteria to choose a kernel function
862: have been proposed extensively. We expect that
863: those techniques and much other knowledge for the original SVM
864: can be incorporated in our framework.
865: Applying the algorithm to real world data is also important.
866: 
867: \begin{thebibliography}{12}
868:  \bibitem{akaho} S. Akaho, Curve fitting that minimizes the mean square of 
869: perpendicular distances from sample points, {\it SPIE Vision Geometry
870: 	 II} (also found in {\it Selected SPIE Papers on CD-ROM}, 
871: 	 8, 1999), 237--244 (1993)
872: 
873:  \bibitem{amari}
874:  S. Amari, {\it Differential Geometrical Methods in
875: 	 Statistics}, Springer-Verlag (1984)
876: 
877:  \bibitem{cortes}
878:  C. Cortes and V.N. Vapnik, Support vector machines,
879: 	 {\it Machine Learning}, 20, pp. 273--297 (1995)
880: 
881:  \bibitem{decoste}
882:  D. DeCoste and B. Sch\"olkopf, Training invariant
883: 	 support vector machines, {\it Machine Learning}, 46(1),
884: 	 pp. 161--190 (2002)
885: 
886:  \bibitem{hastie}
887: 	 T. Hastie and W. Stuetzle, Principal curves,
888: 	 {\it Journal of the American Statistical Association}, 84(406),
889: 	 pp. 502--516 (1989)
890: 
891:  \bibitem{jaakkola}
892:  T.S. Jaakkola and D. Haussler, Exploiting generative
893: 	 models in discriminative classifiers, {\it NIPS 11},
894: 	 pp. 487--493 (1998)
895: 
896:  \bibitem{mueller}
897:  K.R. M\"uller, S. Mika, G. R\"atch, K. Tsuda,
898: 	 B.Sch\"olkopf, An Introduction to Kernel-Based Learning
899: 	 Algorithms, {\it IEEE Trans. on Neural Networks}, 12,
900: 	 pp. 181--201 (2001)
901: 
902:  \bibitem{otsu}
903:  N. Otsu, Karhunen-Loeve line fitting and a linearly
904: 	 measure. In {\it IEEE Proc. of ICPR'84}, pp. 486--489 (1984)
905: 
906:  \bibitem{ramsey}
907:  J.O. Ramsey, B.W. Silverman, {\it Functional Data Analysis},
908: 	 Springer-Verlag (1997)
909: 	 
910:  \bibitem{simard}
911:   P.Y. Simard, Y.A. Le Cun, J.S. Denker, B. Victorri,
912: 	 Transformation Invariance in Pattern Recognition -- Tangent
913: 	 Distance and Tangent Propagation, in {\it Neural Networks:
914: 	 Tricks of the Trade}, G. Orr and K.-R. M\"uller, eds.,
915: 	 Springer-Verlag, vol.1524, pp.239--274 (1998)
916: 
917:  \bibitem{tsuda}
918:   K. Tsuda, M. Kawanabe, G. R\"atsch, S. Sonnenburg, K.R. M\"uller,
919:          A New Discriminative Kernel from Probabilistic Models,
920: 	 {\it NIPS 14} (2001)
921: 
922:  \bibitem{vapnik}
923:   V.N. Vapnik, {\it The Nature of Statistical
924: 	 Learning Theory}, Springer-Verlag (1995)
925: \end{thebibliography}
926: 
927: \end{document}
928: 
929: