1: \documentclass[12pt]{article}
2: \usepackage{graphicx,latexsym}
3: \newcommand{\bi}[1]{\mbox{\boldmath $#1$}}
4:
5: \newtheorem{proposition}{Proposition}
6: \newtheorem{example}{Example}
7:
8: %\newcommand{\mycal}{\mathsf}
9: \newcommand{\mylag}{\alpha}
10: \newcommand{\mycal}{\mathcal}
11: \newcommand{\inner}[2]{#1\cdot#2}
12: \newcommand{\wt}{\omega}
13: \newcommand{\wtsvm}{\wt_{\mathrm{SVM}}}
14: \newcommand{\fsvm}{f_{\mathrm{SVM}}}
15: \newcommand{\myft}[1]{{#1}^*}
16: \newcommand{\myftx}{\myft{\bix}}
17: \newcommand{\svmft}[1]{{#1}^\dag}
18: \newcommand{\mycd}[1]{\hat{#1}}
19: \newcommand{\transpose}{^{\mathsf T}}
20: \newcommand{\zeroth}{^{(0)}}
21: \newcommand{\myraw}{^{\mathrm{raw}}}
22: \newcommand{\kth}{^{(k)}}
23: \newcommand{\kpth}{^{(k+1)}}
24: \newcommand{\knl}{\mathrm{k}}
25: \newcommand{\knlmat}{C}
26: \newcommand{\knlmatx}{D}
27: \newcommand{\knlvec}{\bi{c}}
28: \newcommand{\kx}{\mathbf{k}_x}
29: \newcommand{\kxy}{\mathrm{K}_{xy}}
30: \newcommand{\bia}{\bi{a}}
31: \newcommand{\bib}{\bi{b}}
32: \newcommand{\bid}{\bi{d}}
33: \newcommand{\bix}{\bi{x}}
34: \newcommand{\biy}{\bi{y}}
35: \newcommand{\bipsi}{\bi{\psi}}
36: \newcommand{\bieps}{\bi{\varepsilon}}
37: \newcommand{\mycdd}{\mycd{\bid}}
38: \newcommand{\mycdx}{\mycd{\bix}}
39: \newcommand{\mycdw}{\mycd{\wt}}
40: \newcommand{\mycda}{\mycd{a}}
41: \newcommand{\mycdb}{\mycd{\bib}}
42: \newcommand{\mycdg}{\mycd{g}}
43: \newcommand{\mycdeta}{\mycd{\eta}}
44: \newcommand{\mycdp}{\mycd{p}}
45: \newcommand{\mycdq}{\mycd{\bi{q}}}
46: \newcommand{\mycdr}{\mycd{r}}
47: \newcommand{\mycds}{\mycd{s}}
48: \newcommand{\mycdt}{\mycd{t}}
49: \newcommand{\mycdu}{\mycd{u}}
50: \newcommand{\mynew}{^{\mathrm{new}}}
51: \newcommand{\myold}{^{\mathrm{old}}}
52: \newcommand{\myprev}{^{[l]}}
53: \newcommand{\mynext}{^{[l+1]}}
54: \newcommand{\mycdf}{\mycd{f}}
55: \newcommand{\gnorm}{_{G_i}^2}
56: \newcommand{\sgnorm}{_{G_i}}
57: \newcommand{\inorm}{_{G_i^{-1}}^2}
58: \newcommand{\jnorm}{_{G_j^{-1}}^2}
59: \newcommand{\sinorm}{_{G_i^{-1}}}
60: \hyphenation{di-men-sion-al}
61: \title{Maximing the Margin in the Input Space}
62:
63: \author{
64: Shotaro Akaho \\
65: AIST Neuroscience Research Institute\\
66: 1--1 Central 2, Umezono, Tsukuba 3058568 Japan \\
67: {\texttt{s.akaho@aist.go.jp}}}
68:
69: \begin{document}
70:
71: \maketitle
72:
73: \begin{abstract}
74: We propose a novel criterion for support vector machine learning:
75: maximizing the margin in the input space, not in the feature (Hilbert) space.
76: This criterion is a discriminative version of the principal curve
77: proposed by Hastie et al.
78: The criterion is appropriate in particular when the input space is
79: already a well-designed feature space with rather small dimensionality.
80: The definition of the margin is generalized
81: in order to represent prior knowledge.
82: The derived algorithm consists of two alternating steps to estimate the
83: dual parameters.
84: Firstly, the parameters are initialized by the original SVM.
85: Then one set of parameters is updated by Newton-like procedure, and
86: the other set is updated by solving a quadratic programming problem.
87: The algorithm converges in a few steps to a local optimum under mild
88: conditions and it preserves the sparsity of support vectors.
89: Although the complexity to calculate temporal variables increases
90: the complexity to solve the quadratic programming problem for each step
91: does not change.
92: It is also shown that the original SVM can be seen as a special case.
93: We further derive a simplified algorithm which enables us to use
94: the existing code for the original SVM.
95: \end{abstract}
96:
97: \section{Introduction}
98: The support vector machine (SVM) is known as one of state-of-the-art
99: methods especially for pattern recognition
100: \cite{cortes,mueller,vapnik}.
101: The original SVM maximizes the margin which is
102: defined by the minimum distance between samples
103: and a separating hyperplane in a Hilbert space $\mycal H$.
104: Even when the dimensionality of $\mycal H$ is very large,
105: it has been proved that the original SVM has
106: a bound for a generalization error
107: which is independent of the dimensionality.
108: In practice, however,
109: the original SVM sometimes gives a very small margin in the input
110: space, because the metric of the feature space is usually quite different from
111: that of the input space.
112: Such a situation is undesirable in particular when the input space
113: is already a well-designed feature space by using some prior
114: knowledge\cite{amari,decoste,jaakkola,simard,tsuda}.
115:
116: This paper gives a learning algorithm to maximize the
117: margin in the input space.
118: One difficulty is getting an explicit form of the
119: margin in the input space, because the classification boundary is curved and
120: the vertical projection from a sample point to the boundary is not
121: always unique. We solve this problem by linear approximation
122: techniques. The derived algorithm basically consists of iterations
123: of the alternating two stages as follows:
124: one is to estimate the projection point and the other is
125: to solve a quadratic programming to find optimal parameter values.
126:
127: Such a dual structure appears in other frameworks, such as
128: EM algorithm and variational Bayes.
129: Much more related work is the principal curve proposed by
130: Hastie et al\cite{hastie}. The principal curve finds a curve in a `center'
131: of the points in the input space.
132:
133: The derived algorithm is not a gradient-descent type but Newton-like;
134: hence we have to investigate its convergence property.
135: It is shown that the derived
136: algorithm does not always converges to the global optimum, but
137: it converges to a local optimum under mild conditions.
138: Some interesting relations to the original SVM are also shown:
139: the original SVM can be seen as a special case of the algorithm;
140: and the number of support vectors does not increase so much from the
141: original SVM.
142: The algorithm is verified through simple simulations.
143:
144: \section{Generalized margin in the input space}
145:
146: We consider a binary classification problem.
147: The purpose of learning is to construct a map from an $m$-dimensional input
148: $\bix\in{\Re}^m$ to a corresponding output $y\in\{\pm1\}$ by using
149: a finite number of samples $(\bix_1,y_1),\ldots,(\bix_n,y_n)$.
150:
151: Let us consider a linear classifier,
152: $y=\mbox{sgn}[f(\bix)]$, where
153: $f(\bix) \equiv \inner{\wt}{\phi(\bix)} + f_0$;
154: $\phi(\bix)$ is a feature of an input $\bix$ in
155: a Hilbert space $\mycal H$,
156: $\wt\in \mycal H$ is a weight parameter
157: and $f_0\in \Re$ is a bias parameter.
158: Those parameters $\wt$ and $f_0$ define a separating hyperplane in the
159: feature space.
160: As a feature function $\phi(\bix)$, we only consider a differentiable
161: nonlinear map.
162:
163: A margin in the input space is defined by the minimum distance from sample
164: points to the classification boundary in the input space.
165: Since the classification boundary forms a complex curved surface,
166: the distance cannot be obtained in an explicit form, and more
167: significantly, a projection from a point to the boundary is not unique.
168:
169: Here, the metric in the input space is not necessary to be Euclidean.
170: Some Riemannian metric $G(\bix)$ may be defined, which
171: enables us to represent many kinds of prior knowledge.
172: For example, the invariance of patterns\cite{mueller,simard} can be implemented
173: in this form.
174: Another example is that
175: Fisher information matrix is a natural metric,
176: when the input space is a parameter space
177: of some probability distribution\cite{amari,jaakkola}.
178: Although the distance is theoretically preferable to be measured by
179: the length of a geodesic in the Riemannian space,
180: it causes computational difficulty.
181: In our formulation, since we only need a distance from a sample point to
182: another point, we use a computationally feasible (nonsymmetric) distance
183: from a sample point $\bix_i$ to another point $\bix$ in the quadratic norm,
184: \[
185: \|\bix-\bix_i\|\gnorm =
186: (\bix-\bix_i)\transpose G_i(\bix-\bix_i),
187: \]
188: where $G_i\equiv G(\bix_i)$.
189:
190: For simplicity, we mainly consider the hard margin case in which
191: sample points are separable by a hyperplane in the Hilbert space.
192: The soft margin case is discussed in the section \ref{sec:soft}.
193:
194: Let $\myftx_i$ be the closest point on the boundary
195: surface from a sample point $\bix_i$, and
196: $\bid_i \equiv \myftx_i - \bix_i$.
197: Since $\bid_i$ is invariant under a scalar transformation of $(\wt,f_0)$,
198: we can assume all points are separated with satisfying
199: \begin{equation}
200: \label{eq:constraint}
201: \|\bid_i\|\gnorm \ge {1/\inner{\wt}{\wt}},\quad i=1,\cdots,n,
202: \end{equation}
203: If we assume at least one of them is an equality,
204: the margin is given by $1/\sqrt{\inner{\wt}{\wt}}$.
205: Then we can find the optimal parameter by minimizing
206: a quadratic objective function $\inner{\wt}{\wt}$
207: with the constraints (\ref{eq:constraint}) and $y_i f(\bix_i) > 0$.
208:
209: In order to solve the optimization problem, we start from a solution
210: of the original SVM and update the solution iteratively.
211: By two kinds of linearization technique and a kernel trick
212: which are described in the next section, we obtain
213: a discriminant function at the $k$-th iteration step in the form of
214: \begin{equation}
215: \label{eq:f}
216: f(\bi{x})=\sum_{i\in \mathrm{S.V.}} \{a_i\kth \knl(\mycdx_i\kth,\bix) +
217: \bib_i\kth{}\transpose \kx(\mycdx_i\kth, \bix)\} + f_0\kth,
218: \end{equation}
219: where S.V. is a set of indices of support vectors,
220: $\knl(\bix,\biy)$ is a kernel function and $\kx(\bix,\biy)$ is its
221: derivative defined by $\kx(\bix,\biy)\equiv {\partial
222: \knl(\bix,\biy)/\partial\bix}$.
223: We have two groups of parameters here: One is of $a_i$, $\bib_i$ and $f_0$
224: which are parameters of linear coefficients, and the other is
225: of $\mycdx_i$ which is an estimate of
226: the projection point $\myftx_i$ and forms base functions.
227: $a_i$ and $f_0$ are initialized by the corresponding parameters in the
228: original SVM and the other parameters are initialized by
229: $\bib_i=\mathbf0$, $\mycdx_i=\bix_i$.
230:
231: \section{Iterative QP by linear approximations}
232: In this section, we overview the derivation of update rules of
233: those parameters. The resultant algorithm is summarized in sec.\ref{sec:overall}.
234:
235: \subsection{Linear approximation of the distance to the boundary}
236: \label{sec:d}
237: Suppose an estimated projection point $\mycdx_i$ is given,
238: we can get an approximate distance $\|\bid_i\|\sgnorm$
239: by a linear approximation\cite{akaho}.
240: \hfill Taking the Taylor expansion of \\
241: $f(\myftx_i)=0$ around $\mycdx_i$
242: up to the first order,
243: we obtain a constraint on $\bid_i$,
244: \[
245: f(\mycdx_i) +
246: \nabla f(\mycdx_i)\transpose (\bid_i - \mycdd_i) = 0,
247: \]
248: where $\mycdd_i = \mycdx_i-\bix_i$.
249: Minimizing $\|\bid_i\|\gnorm$ under this constraint,
250: we have
251: \begin{equation}
252: \label{eq:d}
253: \|\bid_i\|\gnorm = {(\inner{\wt}{\{\phi(\mycdx_i) -
254: \bipsi(\mycdx_i)\transpose\mycdd_i \}}+f_0)^2\over
255: \|\inner{\wt}{\bipsi(\mycdx_i)}\|\inorm},
256: \end{equation}
257: where $\bipsi(\mycdx_i)\equiv
258: \nabla \phi(\mycdx_i)\in {\mycal H}^m$.
259: Note that this approximate value is unique, and it is invariant under a
260: scalar transformation of
261: $(\wt,f_0)$.
262: Moreover, the approximation is strictly correct when $\mycdx_i=\myftx_i$
263: and $\nabla f(\myftx_i)\ne 0$.
264:
265: \subsection{Linearization of the constraint}
266: \label{sec:qp}
267: Using the approximate value of the distance, we have a nonlinear
268: constraint,
269: \begin{equation}
270: \label{eq:NLconst}
271: y_i\left[\inner{\wt}\{\phi(\mycdx_i) -
272: \bipsi(\mycdx_i)\transpose\mycdd_i \}+f_0\right]
273: \ge {\|\inner{\wt}{\bipsi(\mycdx_i)}\|\sinorm\over\sqrt{\inner{\wt}{\wt}}}.
274: \end{equation}
275: Since the constraint is nonlinear for $\wt$, we linearize it around
276: an approximate solution $\wt=\mycdw$ which is the solution at
277: a current step.
278: This linearization not only simplifies the problem, but
279: also enables us to derive a dual problem.
280:
281: Let $g_i(\wt)$ be the right hand side of (\ref{eq:NLconst}),
282: the first order expansion is
283: \[
284: g_i(\wt) = g_i(\mycdw) +
285: \inner{\left({\partial g_i(\mycdw)/\partial\wt}\right)}{(\wt-\mycdw)}.
286: \]
287: Now let $\mycdg_i \equiv g_i(\mycdw),
288: \mycdeta_i \equiv {\partial g_i(\mycdw)/\partial\wt}$,
289: then we have a linear constraint for $\wt$,
290: \begin{equation}
291: \label{eq:constraint3}
292: \inner{\wt}{[y_i\ \{\phi(\mycdx_i) -
293: \bipsi(\mycdx_i)\transpose\mycdd_i
294: \}-\mycdeta_i]}\ge \mycdg_i- f_0 y_i,
295: \end{equation}
296: where we used the fact $\inner{\mycdw}{\mycdeta_i}=0$.
297: Suppose $\mycdq_i \equiv \inner{\mycdw}{\bipsi(\mycdx_i)}$ and
298: $\mycdr \equiv \inner{\mycdw}{\mycdw}$,
299: then $\mycdg_i$ and $\mycdeta_i$ are given by
300: \begin{eqnarray}
301: \label{eq:h}
302: \mycdg_i &=& {1\over \sqrt{\mycdr}}\|\mycdq_i\|\sinorm,\nonumber\\
303: \mycdeta_i
304: &=& {1\over \mycdg_i \mycdr} \left\{\mycdq_i\transpose G_i^{-1}
305: \bipsi(\mycdx_i) -{1\over\mycdr}\|\mycdq_i\|\inorm\mycdw\right\}.
306: \end{eqnarray}
307: By the above linearization, we can derive the dual problem
308: in a similar way to the original SVM,
309: \begin{eqnarray}
310: \lefteqn{W(\bi{\mylag}) = \sum_i \mycdg_i\mylag_i} \nonumber\\
311: && -{1\over2}
312: \sum_{i,j}\mylag_i\mylag_j [y_i \{\phi(\mycdx_i) -
313: \bipsi(\mycdx_i)\transpose\mycdd_i
314: \}-\mycdeta_i]\cdot[y_j \{\phi(\mycdx_j) -
315: \bipsi(\mycdx_j)\transpose \mycdd_j
316: \}-\mycdeta_j], \nonumber
317: \end{eqnarray}
318: which is maximized under constraints $\mylag_i\ge0$ \\
319: and $\sum_i\mylag_i y_i = 0$.
320: The solution $\wt$ is given by
321: \begin{equation}
322: \label{eq:wt}
323: \wt = \sum_i \mylag_i [y_i \{\phi(\mycdx_i) -
324: \bipsi(\mycdx_i)\transpose\mycdd_i
325: \}-\mycdeta_i].
326: \end{equation}
327: Here we can see an apparent relation to the original SVM, i.e.,
328: by letting $\mycdx_i=\bix_i$, $\mycdeta_i=0$, and $\mycdg_i=1$,
329: we have the exactly the same optimization problem as the original SVM.
330:
331: \subsection{Kernel trick}
332:
333: In order to avoid the calculation of mapping into high dimensional
334: Hilbert space, SVM applies a kernel trick, by which
335: an inner product is replaced by a symmetric positive definite
336: kernel function (Mercer kernel) that is easy to
337: calculate\cite{ramsey,cortes,mueller,vapnik}.
338: In our formulation,
339: $\inner{\phi(\bix)}{\phi(\biy)}$ is replaced by a Mercer kernel
340: $\knl(\bix,\biy)$.
341: We also have to calculate the inner product
342: related to $\bipsi$ (the derivative of $\phi$).
343: Let us assume that the kernel function $\knl$ is differentiable.
344: Then, $\inner{\bipsi(\bix)}{\phi(\biy)}$
345: is replaced by a vector
346: $\kx(\bix,\biy)\equiv {\partial \knl(\bix,\biy)/\partial\bix}$,
347: and $\inner{\bipsi(\bix)}{\bipsi(\biy)\transpose}$
348: is replaced by a matrix
349: $\kxy(\bix,\biy)
350: \equiv {\partial^2 \knl(\bix,\biy)/\partial\bix\partial\biy\transpose}$.
351:
352: Now we can derive the kernel version of the optimization problem.
353: In (\ref{eq:wt}), $\mycdeta_i\in \mycal H$ has bases related to
354: $\bipsi(\mycdx_i)$ and $\mycdw$,
355: and the solution $\wt$ has bases $\phi(\mycdx_i)$ additionally.
356: Although $\mycdw$ can have any kinds of bases, we restrict it
357: in the following form to avoid increasing number of bases.
358: \[
359: \mycdw=\sum_i \{\mycda_i \phi(\mycdx_i) +
360: \mycdb_i\transpose \bipsi(\mycdx_i)\}.
361: \]
362: Then we have
363: $\mycdq_i = \sum_j \{ \mycda_j
364: \kx(\mycdx_i, \mycdx_j) +
365: \kxy(\mycdx_i,\mycdx_j)\mycdb_j
366: \}$.
367: Now let
368: \[
369: \mycdp_i \equiv \inner{\mycdw}{\phi(\mycdx_i)} =
370: \sum_j \{\mycda_j\knl(\mycdx_j,\mycdx_i) + \mycdb_j\transpose
371: \kx(\mycdx_j,\mycdx_i)\},
372: \]
373: then $\mycdr$ is given by
374: $\mycdr = \sum_i (\mycda_i \mycdp_i + \mycdb_i\transpose\mycdq_i)$,
375: and $\mycdg_i$ by (\ref{eq:h}).
376: Further, let us define additional temporal variables
377: that represent several terms in the objective function,
378: \begin{eqnarray*}
379: \mycds_{ij} &\equiv& \inner{\{\phi(\mycdx_i) -
380: \bipsi(\mycdx_i)\transpose\mycdd_i
381: \}}{\{\phi(\mycdx_j) -
382: \bipsi(\mycdx_j)\transpose \mycdd_j
383: \}} \\
384: &=& \knl(\mycdx_i,\mycdx_j)+\mycdd_i\transpose
385: \kxy(\mycdx_i,\mycdx_j)\mycdd_j
386: -\mycdd_i\transpose\kx(\mycdx_i,\mycdx_j)
387: -\mycdd_j\transpose\kx(\mycdx_j,\mycdx_i), \\
388: \mycdt_{ij} &\equiv& \inner{\mycdeta_i}
389: {\{\phi(\mycdx_j) - \bipsi(\mycdx_j)\transpose\mycdd_j\}}
390: \\
391: &=&
392: {1\over \mycdg_i \mycdr}\bigg\{\mycdq_i\transpose G_i^{-1}
393: \left(\kx(\mycdx_i,\mycdx_j) - \kxy(\mycdx_i,\mycdx_j)\mycdd_j
394: \right)
395: - {\|\mycdq_i\|\inorm\over \mycdr}(
396: \mycdp_j - \mycdd_j\transpose\mycdq_j)
397: \bigg\}, \\
398: \mycdu_{ij} &=& \inner{\mycdeta_i}{\mycdeta_j}
399: =
400: {1\over \mycdg_i \mycdg_j \mycdr^2}(\mycdq_i\transpose G_i^{-1}\kxy(\mycdx_i,\mycdx_j) G_j^{-1}\mycdq_j
401: -{\|\mycdq_i\|\inorm\|\mycdq_j\|\jnorm\over\mycdr}),
402: \end{eqnarray*}
403: then we have the objective function in a kernel form,
404: \begin{equation}
405: W(\bi{\mylag}) = \sum_i \mycdg_i\mylag_i
406: -{1\over2}\sum_{i,j}\mylag_i\mylag_j (y_i y_j \mycds_{ij} - y_j \mycdt_{ij}-
407: y_i \mycdt_{ji}+\mycdu_{ij}),
408: \label{eq:qp}
409: \end{equation}
410: which is maximized under constraints
411: \begin{equation}
412: \label{eq:constrainta}
413: \mylag_i\ge0, \qquad \sum_i y_i\mylag_i = 0.
414: \end{equation}
415:
416: The new parameters can be determined from (\ref{eq:wt}) by
417: \begin{eqnarray}
418: \label{eq:newab}
419: a_i\kpth &=& \mylag_i y_i + \beta \mycda_i,\nonumber\\
420: \bib_i\kpth &=& -\mylag_i\left(y_i\mycdd_i+ {G_i^{-1}\mycdq_i\over
421: \mycdg_i \mycdr}\right) +\beta
422: \mycdb_i,
423: \end{eqnarray}
424: where
425: $ \beta = \sum_j{\mylag_j\|\mycdq_j\|\inorm/\mycdg_j\mycdr^2}$.
426:
427: As for the bias term $f_0$, since the constraint
428: (\ref{eq:constraint3}) should be satisfied in equality
429: for $J=\{i\mid\mylag_i\ne0\}$ from
430: the Kuhn-Tucker condition, we have for any $i\in J$,
431: \begin{equation}
432: \label{eq:newf}
433: f_0\kpth = y_i \mycdg_i -\sum_j \mylag_j
434: (y_j \mycds_{ji} - \mycdt_{ji} - y_i y_j \mycdt_{ij} + y_i \mycdu_{ij})
435: \end{equation}
436:
437: From ($\ref{eq:newab}$), we can estimate the number of support vectors.
438: Let $J_k$ be the indices of nonzero $\mylag_i$'s at the $k$-th step, then
439: the number of support vectors is bounded from upper by
440: $|J_0\cup J_1 \cup \cdots \cup J_k|$. Since $J_k$ does not
441: change much as long as the structure of classification boundary
442: is similar,
443: the number of support vectors is expected to be not so larger than
444: the original SVM.
445:
446: \subsection{Update of the approximate projection of the points}
447: To complete the algorithm, we have to consider the update of the approximate value
448: of the projection point $\mycdx_i$ which is initialized by $\bix_i$, otherwise the convergent solution is not precise
449: what we want.
450: If good approximates $\mycdw$ and $\mycdf_0$ of
451: the solution are given, we can refine $\mycdx_i$
452: iteratively in the same way as in sec. \ref{sec:d}:
453: Suppose $\mycdw=\sum_j \{\mycda_j \phi(\mycdx_j\myold) +
454: \mycdb_j\transpose \bipsi(\mycdx_j\myold)\}$,
455: the projection point $\mycdx_i$ can be estimated by iterating
456: the following steps for $l=0,1,2,3,\cdots$,
457: \begin{equation}
458: \label{eq:upmycdx}
459: \mycdx_i\mynext
460: = \bix_i -
461: {\mycdq_i\myprev\over\|\mycdq_i\myprev\|\inorm}
462: \left[\mycdp_i\myprev
463: - (\mycdx_i\myprev{}-\bix_i)\transpose
464: \mycdq_i\myprev + \mycdf_0\right]
465: \end{equation}
466: where $\mycdx_i^{[0]}$ is initialized by $\mycdx_i\myold$;
467: $\mycdp_i\myprev$ and $\mycdq_i\myprev$ are defined in a similar way as
468: $\mycdp_i$ and $\mycdq_i$,
469: \begin{eqnarray}
470: \mycdp_i\myprev &\equiv& \inner{\mycdw}{\phi(\mycdx_i\myprev)} \nonumber\\
471: &=&
472: \sum_j \{\mycda_j\knl(\mycdx_j\myold,\mycdx_i\myprev) + \mycdb_j\transpose
473: \kx(\mycdx_j\myold,\mycdx_i\myprev)\}, \nonumber \\
474: \mycdq_i\myprev &\equiv&
475: \inner{\mycdw}{\bipsi(\mycdx_i\myprev)}\nonumber\\
476: &=&\sum_j \{ \mycda_j
477: \kx(\mycdx_i\myprev, \mycdx_j\myold) +
478: \kxy(\mycdx_i\myprev,\mycdx_j\myold)\mycdb_j
479: \}.\nonumber
480: \end{eqnarray}
481:
482: Note that locally maximum points and saddle
483: points of the distance are also equilibrium states
484: of (\ref{eq:upmycdx}). The following proposition guarantees
485: such a point is not stable.
486: \begin{proposition}
487: A point $\mycdx_i\in {\Re}^m$ is an equilibrium state of the
488: iteration step (\ref{eq:upmycdx}), when and only when the point
489: is a critical point of the distance from $\bix_i$ to the
490: separating boundary, i.e.,
491: a local minimum, a local maximum or a saddle point.
492: The equilibrium state is not stable when the point is a
493: local maximum or a saddle point.
494: \end{proposition}
495: \textit{Proof:}
496: It is straightforward to show that a point is
497: an equillibrium state of the iteration step (\ref{eq:upmycdx}),
498: only when the point is a critical point of the projection point
499: $\|\bid_i\|\gnorm$. Without loss of generality,
500: we can assume the uniform metric case $G_i=I$, because
501: update rule (\ref{eq:upmycdx}) is invariant of a metric transformation.
502: We consider the behavior around a critical point $\myftx_i$.
503: Let $\mycdx_i\myprev=\myftx_i+\bieps$,
504: for a sufficiently small vector $\bieps$.
505: One can show that $\mycdx_i\myprev$ is mapped into the separating
506: hypersurface $f(\bix)=\inner{\mycdw}{\phi(\bix)}+\mycdf_0=0$
507: for a small $\bieps$ after one step iteration.
508: Therefore, we only consider the
509: case $\mycdx_i\myprev$ is on the hypersurface.
510:
511: Since $\myftx_i$ is a critical point
512: of the distance, the tangent vector $\nabla f(\myftx_i)$ is
513: collinear to the distant vector $\bid_i=\myftx_i-\bix_i$, i.e.,
514: for some constant $\lambda$, it holds
515: \begin{equation}
516: \nabla f(\myftx_i) = \lambda \bid_i.
517: \end{equation}
518: Furthermore, if $\mycdx_i\myprev$ is in a point of $f(\bix)=0$,
519: $\nabla f(\myftx_i)$ is nearly orthogonal to $\bieps$,
520: i.e.,
521: \begin{equation}
522: \nabla f(\myftx_i)\transpose \bieps \simeq 0.
523: \end{equation}
524: By expanding (\ref{eq:upmycdx}) around $\myftx_i$, we have
525: a new estimation $\mycdx_i\mynext$ by
526: \begin{equation}
527: \label{eq:mycdx}
528: \mycdx_i\mynext \simeq \myftx_i
529: + {1\over\lambda}\nabla^2 f(\myftx_i)\bieps
530: - {\bid_i\transpose\nabla^2 f(\myftx_i)\bieps\over\lambda\|\bid_i\|}\bid_i,
531: \end{equation}
532: where $\nabla^2 f$ is a hessian matrix of $f(\bix)$.
533: Without loss of generality, we can take the coordinate of $\bix$ as
534: follows: the first coordinate is the direction of $\bid_i$, and
535: the second to the $m$-th coordinates are taken orthogonally such that
536: an $(m-1)\times(m-1)$ submatrix of $\nabla^2 f(\myftx_i)$
537: for those coordinates is diagonalized, i.e., $\nabla^2 f(\myftx_i)$
538: is in the form,
539: \begin{equation}
540: \nabla^2 f(\myftx_i) = \left(
541: \begin{array}{cccc}
542: c_1 & & \bi{b}\transpose & \\
543: & c_2 & & 0 \\
544: \bi{b} & & \ddots & \\
545: & 0 & & c_m \\
546: \end{array} \right).
547: \end{equation}
548: Under this coordinate system,
549: since $\varepsilon_1$ is of small order value,
550: the first element calculated from the second and third term in (\ref{eq:mycdx})
551: vanishes and we have
552: \begin{equation}
553: \mycdx_i\mynext - \myftx_i \simeq {1\over\lambda}
554: (0, c_2 \varepsilon_2,\ldots,c_m\varepsilon_m)\transpose.
555: \end{equation}
556: The iteration step is stable at $\myftx_i$ only when
557: $\|\mycdx_i\mynext-\myftx_i\|\le\|\forall\bieps\|$, i.e.,
558: t$|c_j|< |\lambda|$ for all $j=2,\ldots,m$. \hfill $\Box$
559:
560: The condition for 1-$j$ plane is shown in figure \ref{fig:stability}.
561:
562: \begin{figure}[tbhp]
563: \begin{center}
564: \includegraphics[width=.8\textwidth]{stab.eps}
565: \caption{Stability of projection point update}
566: \label{fig:stability}
567: \end{center}
568: \end{figure}
569:
570: When the point is a local maximum or saddle, the hypersurface is in the unstable
571: region. However, even in the case of local minimum, there exist an
572: unstable region, when the hypersurface is stronglly curved.
573: We can avoid the undesired behavior by slowing down.
574: For example, first $c_2,\ldots,c_m$ and $\lambda$ are estimated from
575: $\nabla f$ and $\nabla^2 f$ values at the current estimate,
576: and then if $c_j < |\lambda|$
577: for all $j=2,\ldots,m$, the point is to be local minima, then
578: the movement $\mycdx_i\mynext-\mycdx_i\myprev$
579: to the axes in which $c_j<-|\lambda|$ should be
580: shrinked by multiplying some factor $0 < e_j < |\lambda|/|c_j|$.
581:
582: This computationally intensive
583: treatment would be usually necessary only
584: after the several steps, because it is considered
585: that the unstablity for local minima occurs a small region
586: relatively to the size of $\bid_i$.
587:
588: \subsection{Projection of the hyperplane}
589: \label{sec:proj}
590: The update of $\mycdx_i$ causes another problem:
591: We assumed in section \ref{sec:qp}
592: that $\wt$ and $\mycdw$ have the same bases.
593: However, $\mycdw$ has bases based on the old $\mycdx_i$, while
594: we need the new $\wt$ based on the new $\mycdx_i$.
595: To solve that problem, $\mycdw$ is projected into new bases, i.e.,
596: from the old one
597: $\mycdw\myold=\sum_{i\in \mathrm{S.V.}}\{\mycda\myold_i
598: \phi(\mycdx_i\myold) + \mycdb\myold_i{}\transpose\bipsi(\mycdx_i\myold)
599: \}$
600: to a new one,
601: $\mycdw\mynew=\sum_{i\in \mathrm{S.V.}}\{\mycda\mynew_i
602: \phi(\mycdx_i\mynew) + \mycdb\mynew_i{}\transpose\bipsi(\mycdx_i\mynew)\}$.
603: Although $\mycdw\mynew$ can have more bases other than S.V.,
604: we restrict the bases to support vectors to
605: preserve the sparsity of bases.
606:
607: There are several possibilities of the projection.
608: In this paper, we use the one which minimizes the cost function
609: \begin{equation}
610: \label{eq:E}
611: {1\over2}\sum_{\bix\in T} \{\inner{\mycdw\mynew}{\phi(\bix)} + \mycdf_0\mynew -
612: (\inner{\mycdw\myold}{\phi(\bix)} + \mycdf_0\myold)\}^2,
613: \end{equation}
614: where $T$ is a certain set of $\bix$, and we use $T=$ $\{\bix_i$,
615: $\mycdx_i\myold$, $\mycdx_i\mynew$; $i=1,\cdots,n\}$.
616:
617: Minimizing (\ref{eq:E}) leads to a simple least square problem, which can
618: be solved by linear equations.
619: Another possibility of the cost function is
620: $\|\mycdw\mynew-\mycdw\myold\|^2$, which leads to another set of
621: linear equations.
622:
623: \subsection{Overall algorithm and the convergence property}
624: \label{sec:overall}
625:
626: Now let us summarize the algorithm below.
627: \par
628: \bigskip
629: \par
630: \noindent{\textbf{\strut Algorithm 1: Algorithm to maximize the margin
631: in the input space}}
632: \hrule
633: \strut Initialization step:
634: Let the solution of the original SVM be
635: $a_i\zeroth$ and $f_0\zeroth$;
636: let $\bib_i\zeroth=\mathbf0$ and $\mycdx_i\zeroth=\bix_i$.
637: \par\noindent
638: For $k=0,1,2,\ldots$, repeat the following steps until convergence:
639: \begin{enumerate}
640: \item Update of $\mycdx_i$:
641: Calculate $\mycdx_i\kpth$ by
642: applying (\ref{eq:upmycdx}) iteratively to $\mycdx_i\kth$.
643: \item Projection of hyperplane:
644: Calculate $\mycda_i$, $\mycdb_i$ and $\mycdf_0$ based on
645: $\mycdx_i\kpth$ by
646: a certain projection method from $a_i\kth$, $\bib_i\kth$ and $f_0\kth$
647: based on $\mycdx_i\kth$ (sec.\ref{sec:proj}).
648: \item QP step: Solve the QP problem (\ref{eq:qp})
649: with respect to $\mylag_i$.
650: \item Parameter update:
651: Calculate $a_i\kpth$, $\bi{b}_i\kpth$ and $f_0\kpth$ by
652: (\ref{eq:newab}) and (\ref{eq:newf}).
653: \end{enumerate}
654: The discriminant function at the $k$-th step is given by (\ref{eq:f}).
655: \par\smallskip
656: \hrule
657: \bigskip
658:
659: Although Algorithm 1 does not always converge to the global minimum,
660: we can prove the following proposition concerning about the convergence
661: of the algorithm.
662: \begin{proposition}
663: Equilibrium points of Algorithm 1 are critical points of the margin in
664: the input space.
665: The algorithm is stable, when the update rule of $\mycdx_i$ (\ref{eq:upmycdx})
666: is stable for all $i$ (see also Proposition 1).
667: \end{proposition}
668: This proposition can be proved basically by proposition 1 and the fact that
669: the linearization of QP is almost exact by a small
670: perturbation of $\wt$.
671: As in the case of (\ref{eq:upmycdx}), we can modify the algorithm by
672: slowing down in (\ref{eq:d}) and (\ref{eq:upmycdx}) so that
673: the equilibrium state is stable when and only when the margin
674: is locally optimal.
675: However, we don't use it in the simulation because the case
676: that the local minimum is unstable is expected to be rare.
677:
678: Another problem of Algorithm 1 is that each iteration step does not
679: always increase the margin monotonically.
680: Although it is usually faster than gradient type algorithms,
681: the algorithm sometimes does not improve the solution of the original
682: SVM at all.
683: Because the original SVM can be seen as a special case of the algorithm,
684: we can use some annealing technique, for example, updating temporal
685: variables and parameters more gradually from their initial values.
686: However, for simplicity, we use a crude method in the simulation
687: as follows: Repeat several
688: steps of the algorithm (5 steps in the simulation) and then choose
689: the best solution which gives the largest estimated value of the margin.
690:
691: As for the complexity of the algorithm, we need $O(m^2 n^2)$ space
692: and $O(m^3 n^2)$ time complexity to calculate temporal variables
693: if the computation of a kernel function is $O(m)$,
694: while the original SVM requires $O(n^2)$ space and $O(m n^2)$ time.
695: Those calculation can be pararellized easily.
696: This complexity is not so different when $m$ is comparatively small.
697: Once the variables are calculated, the complexity for QP is just the same.
698: Therefore, as far as the calculation for temporal variables
699: is comparative to the QP time,
700: the proposed algorithm is comparative to the original SVM.
701: If the Algorithm 1 is heavy because of the large $m$, we can use
702: a simplified algorithm as shown in the section \ref{sec:simple}.
703:
704: As for the iteration of QP which is carried out usually for a few steps,
705: since a current solution is an estimate of the solution,
706: it may be able to reduce the complexity
707: of the QP at the next iteration step.
708:
709: \section{Simulation results}
710: \label{sec:simulation}
711:
712: In this section, we give a simulation result for
713: artificial data sets in order to verify the proposed algorithm
714: and to examine the basic performance.
715: 20 training samples and 1000 test samples are randomly drawn from
716: positive and negative distribution, each of which is a
717: Gaussian mixture of 3 components with
718: uniformly distributed centers $[0,1)^2$ and
719: fixed spherical variance $\sigma^2=0.2^2$.
720: The kernel function used here is a spherical Gaussian kernel with
721: $\sigma^2=1^2$.
722: The metric is taken to be Euclidean (i.e., $G_i$ is the unit matrix).
723: Figure \ref{fig:svm} and \ref{fig:alg1}
724: show an example of results by the original SVM
725: (initial condition) and the proposed algorithm (after 5 steps).
726: In this case, the margin value increases from 0.040 to 0.096.
727: Such a simulation is repeated for 100 sets of samples with different random
728: numbers.
729:
730: The estimated margins
731: in the input space for the original and proposed
732: algorithm is shown in figure \ref{fig:margin} (log-log scale).
733: By the crude algorithm described in the
734: previous section, there are 4 cases among 100 runs that cannot improve the
735: margin of the original SVM. The ratios of the margin are distributed
736: from 1.00 (no improvement) to 27.9.
737:
738: The misclassification errors
739: for test samples is shown in figure \ref{fig:error}.
740: The ratios of error distributed between [0.40(best),1.37(worst)].
741:
742: This results indicates that the margin in the input space
743: is efficient to improve the generalization performance in average, but
744: there are cases that cannot reduce the generalization error
745: even when the margin in the input space increases.
746:
747: \begin{figure}[tbhp]
748: \includegraphics[width=.8\textwidth]{origsvm-r.eps}
749: \caption{Result of the original SVM (margin .040).
750: Circles ($\circ$) and crosses ($\times$) are positive and negative
751: samples. Squares ($\Box$) represent estimates of the projection
752: of the points by applying (\ref{eq:upmycdx}) for 10 steps.}
753: \label{fig:svm}
754: %
755: \end{figure}
756:
757: \begin{figure}[tbhp]
758: \includegraphics[width=.8\textwidth]{5step-r.eps}
759: \caption{Result of the algorithm 1 (after 5 steps, margin .096)
760: for the same data set as fig.\ref{fig:svm}}
761: \label{fig:alg1}
762: %
763: \par\bigskip
764: \end{figure}
765:
766: \begin{figure}[tbhp]
767: \includegraphics[width=.8\textwidth]{mar-r2.eps}
768: \caption{Margin comparison with the original SVM for 100 runs
769: (log-log scale)}
770: \label{fig:margin}
771: \end{figure}
772:
773: \begin{figure}[tbhp]
774: \includegraphics[width=.8\textwidth]{err-r2.eps}
775: \caption{Test error comparison with the original SVM for 100 runs}
776: %
777: \label{fig:error}
778: %
779: \par\bigskip
780: \end{figure}
781:
782: \section{Soft margin}
783: \label{sec:soft}
784:
785: For noisy situation, the hard margin classifier often overfits
786: samples.
787: There are several possibitilities to incorporate the soft margin,
788: here we give a simple one.
789: The soft margin can be derived by introducing slack variables $z_i$
790: into the optimization problem.
791: If we use a soft constraint in the form
792: \begin{equation}
793: \label{eq:constraint5}
794: \inner{\wt}{[y_i\ \{\phi(\mycdx_i) -
795: \bipsi(\mycdx_i)\transpose\mycdd_i
796: \}-\mycdeta_i]}\ge \mycdg_i-f_0 y_i - z_i,
797: \end{equation}
798: and adding penalty for the slack variables,
799: \begin{equation}
800: {1\over2}\inner{\wt}{\wt} + C\sum_i z_i,
801: \end{equation}
802:
803: By this modification, only the constraint (\ref{eq:constrainta}) for
804: $\mylag_i$ is changed to
805: \begin{equation}
806: 0\le\mylag_i\le C, \qquad \sum_i y_i\mylag_i = 0,
807: \end{equation}
808: which is the same constraint as the soft margin of the original SVM.
809: However, the geometrical meaning of (\ref{eq:constraint5}) in the space
810: is not clear. It is a future work to introduce a natural soft constraint
811: in the input space.
812:
813: \section{Simplified algorithm for a high dimensional case}
814:
815: \label{sec:simple}
816:
817: Although Algorithm 1 achieves the precise solution, the computation
818: costs is high for large dimensionality of inputs.
819: In this section, we give a simplified algorithm.
820:
821: If we don't update $\mycdx_i$, the first and the second steps of Algorithm 1
822: is not necessary any more. This simplification makes Algorithm 1
823: a little simpler because all $\mycdd_i$ terms vanish.
824: However, let us consider further simplification.
825:
826: We have shown the relation to the original SVM:
827: the original SVM can be derived $\mycdg_i=1$ and $\mycdeta_i=0$.
828: Since $\mycdeta_i$ causes many temporal variables,
829: we only maintain $\mycdg_i$.
830: Then all the terms related to $\mycdb_i$'s vanish.
831:
832: Consequently,
833: the above simplifications lead to the algorithm much like the original
834: SVM. In fact, the existing code for the original SVM can be used as follows:
835:
836: For each step, first $\mycdg_i$ is calculated,
837: \begin{equation}
838: \mycdg_i = {\|\sum_j a_i\kx(\bix_i,\bix_j)\|\sinorm\over
839: \sqrt{\sum_{j,k}a_j\kth a_k\kth \knl(\bix_j,\bix_k)}}.
840: \end{equation}
841: Then, by letting the $(i,j)$ element of kernel matrix be
842: $\knl(\bix_i,\bix_j) / \mycdg_i\mycdg_j$, the original SVM for this
843: kernel matrix gives the solution for each step of the simplified algorithm.
844:
845: \section{Conclusion}
846: We have proposed a new learning algorithm to find a kernel-based
847: classifier that maximizes the margin in the input space.
848: The derived algorithm consists of an alternating optimization between
849: the foot of perpendicular and the linear coefficient parameters.
850: Such a dual structure appears in other frameworks, such as
851: EM algorithm, variational Bayes, and principal curve.
852:
853: There are many issues to be studied about the algorithm, for example,
854: analyzing the generalization performance theoretically and
855: finding an efficient algorithm that reduces the complexity and
856: converges more stably.
857: It is also an interesting issue to extend our framework to other
858: problems than classification, such as regression\cite{akaho,otsu,mueller}.
859:
860: In this paper, we have assumed that the kernel function is given and fixed.
861: Recently, several techniques and criteria to choose a kernel function
862: have been proposed extensively. We expect that
863: those techniques and much other knowledge for the original SVM
864: can be incorporated in our framework.
865: Applying the algorithm to real world data is also important.
866:
867: \begin{thebibliography}{12}
868: \bibitem{akaho} S. Akaho, Curve fitting that minimizes the mean square of
869: perpendicular distances from sample points, {\it SPIE Vision Geometry
870: II} (also found in {\it Selected SPIE Papers on CD-ROM},
871: 8, 1999), 237--244 (1993)
872:
873: \bibitem{amari}
874: S. Amari, {\it Differential Geometrical Methods in
875: Statistics}, Springer-Verlag (1984)
876:
877: \bibitem{cortes}
878: C. Cortes and V.N. Vapnik, Support vector machines,
879: {\it Machine Learning}, 20, pp. 273--297 (1995)
880:
881: \bibitem{decoste}
882: D. DeCoste and B. Sch\"olkopf, Training invariant
883: support vector machines, {\it Machine Learning}, 46(1),
884: pp. 161--190 (2002)
885:
886: \bibitem{hastie}
887: T. Hastie and W. Stuetzle, Principal curves,
888: {\it Journal of the American Statistical Association}, 84(406),
889: pp. 502--516 (1989)
890:
891: \bibitem{jaakkola}
892: T.S. Jaakkola and D. Haussler, Exploiting generative
893: models in discriminative classifiers, {\it NIPS 11},
894: pp. 487--493 (1998)
895:
896: \bibitem{mueller}
897: K.R. M\"uller, S. Mika, G. R\"atch, K. Tsuda,
898: B.Sch\"olkopf, An Introduction to Kernel-Based Learning
899: Algorithms, {\it IEEE Trans. on Neural Networks}, 12,
900: pp. 181--201 (2001)
901:
902: \bibitem{otsu}
903: N. Otsu, Karhunen-Loeve line fitting and a linearly
904: measure. In {\it IEEE Proc. of ICPR'84}, pp. 486--489 (1984)
905:
906: \bibitem{ramsey}
907: J.O. Ramsey, B.W. Silverman, {\it Functional Data Analysis},
908: Springer-Verlag (1997)
909:
910: \bibitem{simard}
911: P.Y. Simard, Y.A. Le Cun, J.S. Denker, B. Victorri,
912: Transformation Invariance in Pattern Recognition -- Tangent
913: Distance and Tangent Propagation, in {\it Neural Networks:
914: Tricks of the Trade}, G. Orr and K.-R. M\"uller, eds.,
915: Springer-Verlag, vol.1524, pp.239--274 (1998)
916:
917: \bibitem{tsuda}
918: K. Tsuda, M. Kawanabe, G. R\"atsch, S. Sonnenburg, K.R. M\"uller,
919: A New Discriminative Kernel from Probabilistic Models,
920: {\it NIPS 14} (2001)
921:
922: \bibitem{vapnik}
923: V.N. Vapnik, {\it The Nature of Statistical
924: Learning Theory}, Springer-Verlag (1995)
925: \end{thebibliography}
926:
927: \end{document}
928:
929: