1: \documentclass[a4,12pt]{article}
2: \usepackage{latexsym}
3: \oddsidemargin=0cm
4: \evensidemargin=0cm
5: \textwidth=16cm
6: \paperwidth=21cm
7: \textwidth=18.6cm
8: %\textheight=24.7cm
9: \oddsidemargin=-0.5in
10: \evensidemargin=-0.5in
11: %\topmargin=-0.6in
12: \usepackage{amsmath,amstext,amsfonts}
13: \def\bm#1{\mbox{\boldmath $#1$}}
14: \def\teigi{\stackrel{\rm def}{=}}
15: \def\hatena{\stackrel{\boldmath ?}{=}}
16: %\bibliographystyle{mybstfeb96}
17: %\bibliographystyle{mybst1996}
18: %\bibliographystyle{bstforNEu}
19: %\bibliographystyle{BSTforNEU}
20: %\bibliographystyle{apalike}
21: %\bibliographystyle{apahack}
22: %%
23: \makeatletter
24: \renewcommand{\theequation}{%
25: \thesection.\arabic{equation}}
26: \@addtoreset{equation}{section}
27: \makeatother
28: \tolerance=6000
29:
30:
31:
32:
33: \title{Multiplicative Nonholonomic/Newton -like Algorithm }
34: \author{Toshinao {\sc
35: Akuzawa}\thanks{akuzawa@islab.brain.riken.go.jp}\vspace{0.3cm}\\
36: and\vspace{0.3cm}\\
37: Noboru {\sc Murata}
38: \vspace{0.5cm}\\
39: Brain Science Institute \\
40: {\it RIKEN}\\
41: %%{\small(The Institute of Physical and Chemical Research)}\\
42: {\small 2-1 Hirosawa, Wako-shi, Saitama 351-0198, Japan}}
43: \date{{\it October 19, 1999}}
44:
45: \begin{document}
46: \maketitle
47: \abstract{We construct new algorithms from scratch,
48: which use the fourth order cumulant of stochastic
49: variables for the cost function.
50: The
51: multiplicative updating rule
52: here constructed is natural from the
53: homogeneous nature of the Lie group
54: and has numerous merits for
55: the rigorous treatment of the
56: dynamics.
57: As one consequence, the second order convergence is shown.
58: For the cost function,
59: functions invariant under
60: the componentwise scaling
61: are choosen.
62: By
63: identifying
64: points which can be transformed to each other by the scaling,
65: we assume that the dynamics is in a coset space.
66: In our method, a point can move toward any direction in this coset.
67: Thus,
68: no prewhitening is required.
69: }
70: \section{Introduction}
71: \label{intro}
72: Suppose that $N$-dimensional stochastic
73: variables $\{X_i|1\le i \le N\}$ are observed.
74: The independent component analysis (ICA) pursues a map
75: $X \mapsto Y$, where each component of $Y$ becomes mutually independent.
76: In this letter we restrict ourselves to
77: the linear independent component analysis.
78: There
79: we want to find a linear transformation $C:{\bf X}=(X_1,\cdots,X_N)'\mapsto
80: {\bf Y}=(Y_1,\cdots,Y_N)'=C{\bf X}$ which
81: minimizes some cost function that measures the independence.
82: Hereafter we denote by the upper subscript $\prime$ the transposition and
83: by $\dagger$ the complex conjugate.
84:
85: There can be many candidates for the cost function.
86: For example
87: the Kullback-Leibler information
88: is a good measure for the independence.
89: In this case
90: the problem is translated to
91: the minimization of
92: $ -\sum_{i=1}^N\int dy_i P_i(y_i)\ln P_i(y_i)$, where
93: $P_i$ is the probability density function of the $i$-th component.
94: It is obvious that we must evaluate $P_i$'s to find the optimal
95: solution. A robust estimation
96: of the probability density functions is not an easy task
97: and if it is possible it may be computationally expensive.
98:
99: An alternative idea is to make use of the cumulant of the fourth
100: order, or the kurtosis\cite{hyvarinen1}, which we will adopt in this letter.
101: The fourth order cumulant vanishes for
102: the normal distribution. So, this cost function is robust under
103: the gaussian random noises.
104: We will construct algorithms where a matrix, which specifies the
105: linear transformation, is updated by the left-multiplication of a
106: matrix $D={\rm e}^{\Delta}$.
107: This expression implies that $D$ belongs to
108: $GL(N,{\boldmath R})$ (more accurately,
109: the component of $GL(N,{\boldmath R})$ connected to the unit element),
110: which ensures the
111: conservation of the rank.
112: The specification of $D$ by the coordinate $\Delta$
113: has many advantages
114: since it has a compatibility with the homogeneous nature of the Lie group.
115:
116: There are variations for the form of the cost
117: function. We will show our definitions in the following two sections, which
118: are choosen to possess invariance under componentwise scaling.
119: This invariance is crucial for
120: a rigorous treatment of the convergence properties.
121: Moreover, this invariance allows us to
122: identify
123: points in $GL(N,{\boldmath R})$ which is transformed to each
124: other by the
125: scaling.
126: Then we can legitimately restrict the dynamics to a coset space
127: which is introduced by this identification.
128:
129: Under these settings, we determine $\Delta$ by using the Newton method
130: for the second order
131: expansion of the cost function with respect to $\{\Delta_{ij}\}$. It
132: is assumed
133: that the diagonal elements of $\Delta$ are zeros,
134: which does not impose any restrictions.
135: That is, a point can move toward any direction in this coset by a
136: left-multiplication of ${\rm e}^{\Delta}$.
137: Thus
138: it is not necesarry for our method to prewhiten the data.
139: It is also not required
140: that the
141: optimal solution is the maximum or the minimum of the
142: cost function. Indeed, the sole requirement is that
143: the optimal point is a saddle point of the cost function
144: since our method
145: is in principle the Newton method.
146: These are great advantages of our method.
147:
148: %This property is unique to our method and
149: %that does not causes any serious problem if the starting point is
150: %close enough.
151:
152:
153:
154: Our strategy is as follows.
155: As an initial condition we set $C_0$.
156: For $t>0~(t\in{\bf N}^{+})$,
157: we introduce an $N\times N$ matrix $\Delta_t$ and
158: denote $C_{t}$ as $C_{t}={\rm e}^{\Delta_{t}}C_{t-1}$.
159: Next, we evaluate the cost function at $C_{t}$
160: by using the expansion around $C_{t-1}$
161: with respect to the elements of
162: $\Delta_{t}$ up to the second order.
163: Then $\Delta_t$ is choosen as a saddle point of
164: this second order
165: expansion.
166: We iteratively follow these procedures until we obtain a satisfactory
167: solution.
168:
169:
170: This letter is organized as follows.
171: In Section \ref{kurt1} the main part of our algorithm is constructed,
172: where the cost function is essentially identical to the sum of
173: kurtoses.
174: We adopt the square of the kurtoses for the cost function
175: in Section \ref{kurt2}.
176: Explicit expressions for the optimal
177: $\Delta$ (up to the second order)
178: are obtained both in Sections \ref{kurt1} and \ref{kurt2}.
179: Section \ref{iteration} is a short section where we show how
180: each updating step is combined to obtain the optimal $C$.
181: In Section \ref{secconv} the convergence property of our algorithm is
182: discussed. Section \ref{disc} contains conclusions and discussions.
183: \section{Multiplicative update algorithm}
184: \label{kurt1}
185: \subsection{Expansion of the cost function }
186: Let us start by defining the cost function:
187: \begin{eqnarray}
188: \label{eq:e1}
189: &&f(C,X)=\sum_i f_i(C,X)~,
190: \end{eqnarray}
191: where $f_i$'s are the fourth order moments
192: of components
193: divided by the square of their variances,
194: \begin{eqnarray}
195: \label{eq:e1.1}
196: && f_i(C,X)=\frac{E((CX)_i^4)}{E((CX)_i^2)^2}~.
197: \end{eqnarray}
198: In this letter we denote by $E(A)$ the expectation of
199: $A$.
200: Obviously
201: the cost function $f$ coincides with the sum of kurtoses of all the components
202: up to the constant.
203: We set $D={\rm e}^{\Delta}$ and
204: expand $f(D,Y)$ %(\ref{eq:e1})
205: in terms of the elements of $\Delta$.
206: %and $K={\rm e}^{-\Delta}-1$,
207: For example expansions term by term are evaluated as follows:
208: \begin{eqnarray}
209: \label{eq:e2}
210: E((DY)_i^4)
211: &=&
212: E(Y_i^4)+4\sum_{p}(\Delta_{ip}+(\frac{\Delta^2}{2})_{ip})E(Y_i^3Y_p)
213: +6\sum_{p,q}\Delta_{ip}\Delta_{iq}E(Y_i^2Y_pY_q)+O(\Delta^3)~\nonumber\\
214: %\end{eqnarray}
215: %\begin{eqnarray}
216: % \label{eq:3}
217: E((DY)_i^2)
218: &=&
219: E(Y_i^2)+2\sum_{p}(\Delta_{ip}+(\frac{\Delta^2}{2})_{ip})E(Y_iY_p)
220: +\sum_{p,q}\Delta_{ip}\Delta_{iq}E(Y_pY_q)+O(\Delta^3)~.
221: \end{eqnarray}
222: Hereafter we denote by
223: $O(\Delta^k)$ polynomials of matrix elements of $\Delta$ which
224: does not contain terms with degrees less than $k$.
225: For brevity's sake
226: we introduce the following notations:
227: \begin{eqnarray}
228: \label{eq:e3.1}
229: && \sigma_i^{(k)}=|E(Y_i^k)|^{1/k}~,\\
230: && R^{(k)}_{pi}=\frac{E(Y_i^k Y_p)}{(\sigma^{(2)}_i)^{k+1}}~,\\
231: && U^{(k,i)}_{pq}=\frac{E(Y_i^kY_p Y_q)}{(\sigma^{(2)}_i)^{k+2}}~,
232: \end{eqnarray}
233: and
234: \begin{eqnarray}
235: \label{eq:e3.2}
236: && \kappa_i={(\sigma^{(4)}_i)^4}/{(\sigma^{(2)}_i)^4}~.
237: \end{eqnarray}
238: Using the quantities defined above we can show that the
239: cost function is expanded as
240: \begin{eqnarray}
241: \label{eq:e4}
242: f_i(D,Y)
243: &=&\bigg[
244: \kappa_i+4\big[(\Delta+\frac{\Delta^2}{2})R^{(3)}\big]_{ii}
245: +6\big[
246: \Delta U^{(2,i)}\Delta'
247: \big]_{ii}
248: +O(\Delta^3)
249: \bigg]\nonumber\\
250: &&~~\times
251: \bigg[
252: 1-4\big[(\Delta+\frac{\Delta^2}{2})R^{(1)}\big]_{ii}
253: -2\big[
254: \Delta U^{(0,i)}\Delta'
255: \big]_{ii}
256: +12\big[
257: \Delta R^{(1)}
258: \big]_{ii}^2
259: +O(\Delta^3)
260: \bigg]\nonumber\\
261: &=&\kappa_i - 4\big[(\Delta+\frac{\Delta^2}{2})(\kappa_i
262: R^{(1)}-R^{(3)})\big]_{ii}
263: +2\big[
264: \Delta (3U^{(2,i)}-\kappa_i U^{(0,i)})\Delta'
265: \big]_{ii}\nonumber\\
266: &&~~
267: +12\kappa_i\big[
268: \Delta R^{(1)}
269: \big]_{ii}^2
270: -16\big[
271: \Delta R^{(1)}
272: \big]_{ii}\big[
273: \Delta R^{(3)}
274: \big]_{ii}+O(\Delta^3)~
275: \end{eqnarray}
276: by straightforward calculations.
277: Next, we evaluate partial derivatives of the cost function
278: by the matrix elements of $\Delta$.
279: %We need only terms up to $O(\Delta^2)$.
280: Partially differentiating (\ref{eq:e4}),
281: %It follows that the partial derivative of $f(C,Y)$ becomes
282: we get an expression,
283: \begin{eqnarray}
284: \label{eq:e5}
285: && \frac{\partial f({\rm e}^{\Delta},Y)}{\partial \Delta_{kl}}=
286: -4\big[K-R^{(3)}\big]_{lk}
287: -2\big[(K-R^{(3)})\Delta+\Delta(K-R^{(3)})\big]_{lk}\nonumber\\
288: &&+4\big[
289: (3U^{(2,k)}-\kappa_k U^{(0,k)})\Delta'
290: \big]_{lk}
291: +24K_{lk}\big[\Delta R^{(1)}
292: \big]_{kk}
293: -16R^{(1)}_{lk}\big[\Delta R^{(3)}
294: \big]_{kk}
295: -16 R^{(3)}_{lk}\big[\Delta R^{(1)}
296: \big]_{kk}\nonumber\\
297: &&+O(\Delta^2)~,
298: \end{eqnarray}
299: where $K$ is an $N\times N$ matrix defined by
300: \begin{eqnarray}
301: \label{eq:e5.9}
302: &&K_{pq}=\kappa_q R^{(1)}_{pq}~.
303: \end{eqnarray}
304: We want to decide $\Delta$ for which
305: the partial derivative
306: by $\Delta_{kl}~(k\ne
307: l)$
308: of the cost function
309: vanish on condition that
310: $\Delta_{ii}=0$ for $1\le i \le N$.
311: We neglect $O(\Delta^3)$ terms in the cost function.
312: Thus the right-hand side of (\ref{eq:e5}) is
313: regarded as a polynomial of
314: % the elements of $\Delta$
315: $\{\Delta_{kl}\}$
316: of at most first order and it is always possible
317: in principle to
318: determine $\Delta$ which satifies the above condition.
319: % for which
320: % (\ref{eq:e5}) vanishes.
321: It is, at the same time, not easy to describe the problem
322: in a form which is valid
323: for
324: arbitrary $N$.
325: In the following subsection we will introduce a transparent and unified
326: method for handling the partial derivatives of $f$.
327: %Before this subsection by
328: We leave this subsection by
329: introducing $N\times N$ matrices
330: \begin{eqnarray}
331: \label{eq:e6}
332: && V^{(i)}=3U^{(2,i)}-\kappa_i U^{(0,i)}~
333: \end{eqnarray}
334: and
335: \begin{eqnarray}
336: \label{eq:e6.1}
337: % && Q=R^{(1)}-R^{(3)}~.
338: && Q=K-R^{(3)}~
339: \end{eqnarray}
340: for later convenience.
341: \subsection{Expression by tensor product and determination of $\Delta$}
342: The expression (\ref{eq:e5}) is quite complicated and not
343: convenient for our purpose,
344: `` determine $\Delta$, where
345: all the partial derivatives vanish''.
346: Fortunately by mapping the relations between elements of
347: $N\times N$ matrices to those of $N^2\times
348: N^2$ matrices, we can handle the problem transparently.
349: %, the problem can be rewritten in a general form.
350: Some preparations
351: are needed.
352: First, let us introduce a map $\rm cs$:
353: \begin{eqnarray}
354: \label{eq:a14}
355: {\rm Mat}(N,{\boldmath F}) &\rightarrow& {\boldmath F}^{N^2}\nonumber\\
356: A=\left(
357: \begin{array}{cccc}
358: A_{11}& A_{12}&\cdots &A_{1N}\\
359: A_{21} &\multicolumn{3}{c}{\dotfill}\\
360: \multicolumn{4}{c}{\dotfill}\\
361: A_{N1} &\multicolumn{2}{c}{\dotfill}&A_{NN}
362: \end{array}
363: \right) &\mapsto&
364: {\rm cs}(A)=
365: (A_{11}~ A_{21}~ \cdots~ A_{N1}~ A_{12}~ A_{22}~\cdots~ A_{NN})'~,\nonumber\\
366: \end{eqnarray}
367: where $\boldmath F$ is an unspecified field.
368: We also introduce
369: two useful operators $T$ and $P$.
370: The ``intertwiner'' $T$ is an $N^2\times N^2$ matrix
371: defined by
372: \begin{eqnarray}
373: \label{eq:a15}
374: {\rm cs}(A')=T{\rm cs}(A) ~\mbox{\rm for~} A\in {\rm Mat}(N,{\boldmath F})~.
375: \end{eqnarray}
376: The projection operator $P$,
377: \begin{eqnarray}
378: \label{eq:a18}
379: P&=&{\rm diag}(p_1,\cdots,p_{N^2})~,\nonumber\\
380: &&\left\{
381: \begin{array}{ll}
382: p_k=1 ~~~\mbox{\rm for}~~ k=N(i-1)+i,1\le i\le N~\\
383: p_k=0~~~~ \mbox{\rm otherwise}~,
384: \end{array}
385: \right.
386: \end{eqnarray}
387: is used to extract the ``diagonal''
388: elements of a matrix from its image by $\rm cs$.
389:
390: On this setting we can rewrite (\ref{eq:e5}) as
391: \begin{eqnarray}
392: \label{eq:e7}
393: \frac{\partial f({\rm e}^{\Delta},Y)}{\partial \Delta_{kl}}&=&
394: \bigg[ -4{\rm cs}(Q)
395: -2\big[I_N\otimes Q+T(I_N\otimes Q')T\big]{\rm cs}(\Delta)
396: +4
397: \big\{\bigoplus_{i=1}^N V^{(i)}\big\}
398: {\rm cs}(\Delta')
399: \nonumber\\&&
400: +
401: \bigg\{24(I_N \otimes K)P(I\otimes R^{(1)})'
402: -16 ( I_N \otimes R^{(1)})P(I\otimes R^{(3)})'\nonumber\\
403: &&-16 (I_N\otimes R^{(3)})P(I\otimes R^{(1)})'
404: \bigg\}
405: {\rm cs}(\Delta')
406: \bigg]_{l+N(k-1)}~,
407: \end{eqnarray}
408: where $I_N$ is the $N\times N$ unit matrix and
409: \begin{eqnarray}
410: \label{eq:tiu1}
411: \bigoplus_{i=1}^N V^{(i)}=
412: \left(
413: \begin{array}{lllll}
414: V^{(1)} & 0 & \multicolumn{2}{c}{\cdots\cdots} & 0\\
415: 0& V^{(2)} & 0 & \multicolumn{2}{c}{\cdots\cdots}\\
416: \multicolumn{5}{c}{\dotfill}\\
417: \multicolumn{5}{c}{\dotfill}\\
418: 0& \multicolumn{2}{c}{\cdots\cdots}& V^{(N-1)}& 0 \\
419: 0& 0& \multicolumn{2}{c}{\cdots\cdots}& V^{(N)} \\
420: \end{array}
421: \right)~.
422: \end{eqnarray}
423: %where $E_N$ is an $N\times N$ matrix of ones.
424: We make use of the following fact:\\
425: For $X\in {\rm Mat}(N,{\boldmath F})$
426: \begin{eqnarray}
427: \label{eq:e8f}
428: T(I_N\otimes X)T=X\otimes I_N~.
429: \end{eqnarray}
430: See Appendix \ref{app:prf} for the proof of (\ref{eq:e8f}).
431: Then (\ref{eq:e7}) becomes
432: \begin{eqnarray}
433: \label{eq:e77}
434: && \frac{\partial f({\rm e}^{\Delta},Y)}{\partial \Delta_{kl}}=
435: -4[{\rm cs}(Q)]_{l+N(k-1)}
436: +\big[
437: W
438: {\rm cs}(\Delta)
439: \big]_{l+N(k-1)}~,\nonumber\\
440: \end{eqnarray}
441: where
442: \begin{eqnarray}
443: \label{eq:e8}
444: W&=&
445: -2\big(I_N\otimes Q+Q'\otimes I_N\big)
446: +4
447: \big\{\bigoplus_{i=1}^N V^{(i)}\big\}
448: T
449: +
450: \bigg[24(I_N\otimes K)P(I\otimes R^{(1)})'
451: \nonumber\\&&
452: -16 (I_N \otimes R^{(1)})P(I\otimes R^{(3)})'
453: -16 (I_N \otimes R^{(3)})P(I\otimes R^{(1)})'
454: \bigg]
455: T~.\nonumber\\
456: \end{eqnarray}
457: Now let us determine $\Delta$.
458: Remember that we are going along the spirit of the Newton method.
459: Thus we want to find $\Delta$ which satisfies
460: the conditions
461: \begin{eqnarray}
462: \label{eq:e10}
463: \frac{\partial f({\rm e}^{\Delta},Y)}{\partial
464: \Delta_{kl}}=0+O(\Delta^2)~~
465: \mbox{\rm for } 1\le k,l \le N,~k\ne l
466: \end{eqnarray}
467: and
468: \begin{eqnarray}
469: \label{eq:e11}
470: \Delta_{kk}=0 ~~\mbox{\rm for}~~ 1\le k\le N~.
471: \end{eqnarray}
472: The conditions (\ref{eq:e11}) make the problem rather complicated one.
473: Fortunately,
474: by using $P$
475: we can combine %%%transform
476: the conditions (\ref{eq:e10}) and (\ref{eq:e11}) into
477: a matrix equation :
478: \begin{eqnarray}
479: \label{eq:e19}
480: \Big[(I_{N^2}-P)
481: W(I_{N^2}-P)
482: +P
483: \Big]
484: {\rm cs}(\Delta)-4(I_{N^2}-P){\rm cs}(Q)=0~.
485: \end{eqnarray}
486: Immediately it follows that
487: \begin{eqnarray}
488: \label{eq:e20}
489: {\rm cs}(\Delta)=4
490: \Big[(I_{N^2}-P)
491: W
492: (I_{N^2}-P)
493: +P
494: \Big]^{-1}
495: (I_{N^2}-P){\rm cs}(Q)~.
496: \end{eqnarray}
497: Thus we have obtained $\Delta$ which specify a saddle point of
498: the expansion of
499: $f(C,Y)$ up to the second order.
500: Note that quantities in the right-hand side of (\ref{eq:e20}) are easily estimated
501: ones
502: from the
503: observed data.
504: So, an updating is determined by (\ref{eq:e20}) without any
505: ambiguities.
506:
507: \section{Case $\rm I\!I$: square of kurtosis}
508: %~(kurtosis)${\bm{}^2}$}
509: \label{kurt2}
510: Obviously, points where kurtosis
511: vanishes do not play any special role for
512: the cost function $f$ in Section \ref{kurt1}. The optimal solution, however,
513: contains components with zero kurtoses
514: when the number of the sources is less than that of the observation channels.
515: Thus,
516: in this section we treat with a slightly different
517: % algorithm, where
518: cost function, which is the sum,
519: \begin{eqnarray}
520: \label{eq:se1}
521: &&{\bm f}(C,X)=\sum_i {\bm f}_i(C,X)~,
522: \end{eqnarray}
523: of the square of the kurtoses,
524: \begin{eqnarray}
525: \label{eq:se1.1}
526: && {\bm f}_i(C,X)=\left[\frac{E((CX)_i^4)}{E((CX)_i^2)^2}-3\right]^2~.
527: \end{eqnarray}
528: %Computations needed for evaluating
529: As in the last section, we want to know the saddle point
530: $D={\rm e}^{\Delta}$ of
531: the expansion of ${\bm
532: f_i}(D,Y)$ in
533: terms of $\{\Delta_{ij}\}$ up to the second order.
534: We do not describe details of the calculations in this section,
535: which is
536: carried out %accomplished
537: almost in the same way as in Section \ref{kurt1}.
538: First, the expansion of ${\bm
539: f_i}(D,Y)$ is evaluated as
540: \begin{eqnarray}
541: \label{eq:se4}
542: {\bm f}_i(D,Y)
543: &=&(\kappa_i-3)^2 - 8\big[(\Delta+\frac{\Delta^2}{2})(
544: R^{(1)}\kappa_i-R^{(3)})\big]_{ii}(\kappa_i-3)\nonumber\\
545: &&+4\big[
546: \Delta (3U^{(2,i)}-\kappa_i U^{(0,i)})\Delta'
547: \big]_{ii}(\kappa_i-3)
548: +16\big[
549: \Delta (R^{(1)}\kappa_i-R^{(3)})
550: \big]_{ii}^2
551: \nonumber\\
552: &&
553: +24(\kappa_i-3)\kappa_i\big[
554: \Delta R^{(1)}
555: \big]_{ii}^2
556: -32(\kappa_i-3)\big[
557: \Delta R^{(1)}
558: \big]_{ii}\big[
559: \Delta R^{(3)}
560: \big]_{ii}+O(\Delta^3)~.
561: \end{eqnarray}
562: Next, we introduce $N\times N$ matrices $\bm K$, $\{{\bm
563: V}^{(i)}|1\le i\le N\}$,
564: $\bm S$, and $\bm Q$
565: defined respectively by
566: \begin{eqnarray}
567: \label{eq:se5.9}
568: &&{\bm K}_{pq}= 2R^{(1)}_{pq}(\kappa_q-3)\kappa_q~,
569: \end{eqnarray}
570: \begin{eqnarray}
571: \label{eq:se6}
572: && {\bm V}^{(i)}=2(\kappa_i-3)(3U^{(2,i)}-\kappa_i U^{(0,i)})~,\\
573: \end{eqnarray}
574: \begin{eqnarray}
575: \label{eq:se6.001}
576: {\bm S}={\rm diag}(2(\kappa_i-3))~,
577: \end{eqnarray}
578: and
579: \begin{eqnarray}
580: \label{eq:se6.1}
581: && {\bm Q}_{pq}=2(\kappa_q-3)(R^{(1)}_{pq}\kappa_q-R^{(3)}_{pq})~.
582: \end{eqnarray}
583: We also rewrite $Q$ in (\ref{eq:e6.1}) by $\bm q$ in order to avoid confusions:
584: \begin{eqnarray}
585: \label{eq:se6.2}
586: && {\bm q}_{pq}=(R^{(1)}_{pq}\kappa_q-R^{(3)}_{pq})~.
587: \end{eqnarray}
588: Now
589: we proceed to the expression by using the tensor product.
590: We can show that the gradients of the cost function have the
591: following expression:
592: \begin{eqnarray}
593: \label{eq:se77}
594: && \frac{\partial {\bm f}({\rm e}^{\Delta},Y)}{\partial \Delta_{kl}}=
595: -4[{\rm cs}({\bm Q})]_{l+N(k-1)}
596: +\big[
597: {\bm W}
598: {\rm cs}(\Delta)
599: \big]_{l+N(k-1)}+O(\Delta^2)~,\nonumber\\
600: \end{eqnarray}
601: where
602: \begin{eqnarray}
603: \label{eq:se8}
604: {\bm W}&=&
605: -2\big(I_N\otimes {\bm Q}+{\bm Q'}\otimes I_N\big)
606: +4
607: \big\{\bigoplus_{i=1}^N {\bm V}^{(i)}\big\}
608: T
609: +
610: \bigg[24( I_N\otimes {\bm K})P(I\otimes R^{(1)})'
611: \nonumber\\&&
612: +32( I_N\otimes {\bm q})P(I_N\otimes {\bm q})'
613: -16 ( I_N\otimes R^{(1)}{\bm S})P(I\otimes R^{(3)})'
614: \nonumber\\&&
615: -16 ( I_N\otimes R^{(3)}{\bm S})P(I\otimes R^{(1)})'
616: \bigg]
617: T~.
618: \end{eqnarray}
619: This is a completely analogous expression to (\ref{eq:e77}).
620: Thus the coordinate $\Delta$ of the saddle point of the second order
621: expansion
622: is determined by
623: \begin{eqnarray}
624: \label{eq:se20}
625: {\rm cs}(\Delta)=4
626: \Big[(I_{N^2}-P)
627: {\bm W}
628: (I_{N^2}-P)
629: +P
630: \Big]^{-1}
631: (I_{N^2}-P){\rm cs}({\bm Q})~.
632: \end{eqnarray}
633: %In many cases we obtain almost the same results through the two
634: %cost functions in Section \ref{kurt1} and Section \ref{kurt2}.
635: %algorithms.
636: In many cases obtained through the two cost functions in Section
637: \ref{kurt1} and Section \ref{kurt2} are almost the same results.
638: As implied at the beginning of this section,
639: the main difference between these two lies in the points where the kurtosis of
640: one of the components vanishes.
641: These point indeed constitue saddle points of
642: the cost function
643: $\boldmath f$, while it is impossible to capture them by the
644: algorithm in Section \ref{kurt1}.
645: Thus, we must choose an appropriate method for individual problems
646: having this differnce in mind.
647: %This will be
648: %revisited in Section
649: %{\ref{disc}}.
650:
651:
652: \section{Iteration of updating}
653: \label{iteration}
654: Now we have obtained the updating rules. It is not necessary to tune the
655: learning rate. Apparently, (\ref{eq:e19})
656: and (\ref{eq:se20})
657: look complicated.
658: They are, however, easily implemented by the numerical tools like MatLab.
659: (The source codes will be available from our Web-site. )
660: Starting from $C_0$,
661: $C_i$ for positive $i$ is determined by the left multiplication by
662: ${\rm e}^{\Delta_i}$, where
663: $\Delta$ is determined by setting $Y=C_{i-1}X$,
664: i.e,
665: \begin{eqnarray}
666: \label{eq:b1}
667: C_t={\rm e}^{\Delta_{t}}{\rm e}^{\Delta_{t-1}}{\rm e}^{\Delta_{t-2}}\cdots{\rm e}^{\Delta_{1}}C_0~.
668: \end{eqnarray}
669: If $\Delta$ becomes saficiently small, we can stop the iteration and exit the
670: process.
671:
672: \section{Second order convergence}
673: \label{secconv}
674: First, we will take over the notations in Section \ref{kurt1}.
675: The following discussion is, however, valid for the algorithm in Section
676: \ref{kurt2} if we substitute the quantities $f$, $W$, and so on by
677: their boldface counterparts.
678: Let us start this section by introducing some additional notations.
679: We set
680: \begin{eqnarray}
681: \label{eq:pr1}
682: G\in GL(N,{\boldmath R})
683: \end{eqnarray}
684: and
685: \begin{eqnarray}
686: \label{eq:prd2}
687: K\in GL(1,{\boldmath R})^{\oplus N}~.
688: \end{eqnarray}
689: We also define the coset space $K\backslash G$ by
690: introducing the equivalence relation
691: \begin{eqnarray}
692: \label{eq:pr3}
693: g' g^{-1}\in K
694: \Longleftrightarrow
695: g\sim g'
696: \end{eqnarray}
697: to $G$. That is, $K\backslash G\cong\{Kg|g\in G\}$.
698: Our method is
699: understood as
700: an orthodox adaptation of the Newton method to this
701: coset space $K\backslash G$.
702: Note that
703: the cost function $F(\cdot)\teigi f(\cdot,Y)$ on $G$
704: % defined by (\ref{eq:e1})
705: %and (\ref{eq:e1.1})
706: satisfies the relation
707: \begin{eqnarray}
708: \label{eq:pr4}
709: F(g)=F(Kg)~.
710: \end{eqnarray}
711: So $F$ is naturally considered as a function on $K\backslash G$.
712: That is the reason of our choice for the cost function.
713: Thus, the second-order convergence immediately follows if the
714: the correction to the error with respect to the coordinating
715: resulting from the multiplicative nature is properly evaluated.
716:
717: At time $t$, a point $g$ on $K\backslash G$ is specified by
718: the coordinate $X^{(t)}(g) \in{\frak m}$ such that
719: \begin{eqnarray}
720: \label{eq:prf101}
721: {\rm e}^{X^{(t)}(g)}C_t\sim g~,
722: \end{eqnarray}
723: where $\frak m$ is the set of $N\times N$ matrices whose diagonal
724: elements are zeros.
725: Actually, this statement itself is not a thing of course, for which the proof
726: will be given
727: elsewhere.
728: Define $F_t$, the representation of the cost function at $t$, by
729: \begin{eqnarray}
730: \label{eq:prf102}
731: F_t(X)=F( {\rm e}^{X}C_t)~.
732: \end{eqnarray}
733: Here we introduce an $(N^2-N)\times N^2$ matrix $\tilde P$ by
734: drawing out the $i+N(i-1)$-th raws from the unit $N^2\times N^2$
735: matrix where $i=N,N-1,\cdots, 2,1$.
736: We will denote by $\boldmath H^{(t)}$ the Hessian,
737: \begin{eqnarray}
738: \label{eq:prf102.11}
739: {\boldmath H}^{(t)}_{kl}=\frac{\partial^2 F_t(X)}
740: {\partial ({\tilde P}{\rm cs}(X))_k\partial ({\tilde P}{\rm cs}(X))_l}
741: \end{eqnarray}
742: Note that if we set
743: \begin{eqnarray}
744: \label{eq:prf103}
745: h_t(X)=\left.
746: T\bigg((I_{N^2}-P)
747: W(I_{N^2}-P)
748: +P\bigg)\right|_{C={\rm e}^X C_t}~,
749: \end{eqnarray}
750: the Hessian is written as
751: \begin{eqnarray}
752: \label{eq:prf103.1}
753: {\boldmath H}^{(t)}={\tilde P}h_t{\tilde P}' ~.
754: \end{eqnarray}
755: Suppose that at some neighborhood of the optimal solution $g_*$,
756: ${\boldmath H}^{(t)}(X)$
757: is Lipschitz continuous for some $t$:
758: \begin{eqnarray}
759: \label{eq:prf104}
760: ||{\boldmath H}^{(t)}(X)-{\boldmath H}^{(t)}(X')||\le L ||X-X'||~,
761: \end{eqnarray}
762: where $||A||$ is the norm of a matrix $A$ as the Euclidian space,
763: \begin{eqnarray}
764: \label{eq:norm1}
765: ||A||^2={\rm tr}(AA^{\dagger})~.
766: \end{eqnarray}
767: We set
768: \begin{eqnarray}
769: \label{eq:prf104.001}
770: \beta=||H^{(t)}(X^t(g_*))^{-1} || ~.
771: \end{eqnarray}
772: There exists a positive real number $r$,
773: for which
774: % neighborhood of $g_*$,
775: \begin{eqnarray}
776: \label{eq:prf104.002}
777: ||H^{(t)}(X^t(g))^{-1} || <2\beta~~\mbox{\rm for}~
778: \forall g\in B^{(t)}(g_*,r)\teigi\bigg\{g\bigg|r> ||X^t(g)-X^t(g_*)||~\bigg\}
779: \end{eqnarray}
780: is satisfied.
781: Then
782: it is known that
783: for all $g\in B(g_*,{\rm min}(r,(2\beta L)^{-1}))$,
784: \begin{eqnarray}
785: \label{eq:prf104.003}
786: ||X^t(C_{t+1})-X^t(g_*)||\le \beta L ||X^t(C_{t})-X^t(g_*)||^2
787: \end{eqnarray}
788: and
789: \begin{eqnarray}
790: \label{eq:prf104.004}
791: ||X^t(C_{t+1})-X^t(g_*)||\le \frac{1}{2} ||X^t(C_{t})-X^t(g_*)||
792: \end{eqnarray}
793: are fulfilled. Thus the second order convergence in this norm is shown.
794: Unfortunately, this norm is not invariant and is unnatural.
795: (A natural metric on $K\backslash G$
796: is one which is invariant under the parallel transformation,
797: %where the parallel transformation
798: which is induced by the action
799: of elements in $K\backslash G$
800: from the right-hand side.) But, it suffices in practice.
801:
802:
803: \section{Discussions}
804: \label{disc}
805: \subsection{Nonholonomy?}
806: Our method is related to the nonholonomic method
807: by
808: Amari, Chen, and Chichocki\cite{amari-chen-cichocki1}.
809: In essence our method is a Newton
810: approach to the same problem, the optimization without prewhitening.
811: Let us set
812: \begin{eqnarray}
813: \label{eq:conc11}
814: {\rm e}^{z} = {\rm e}^{x}{\rm e}^{y}
815: \end{eqnarray}
816: for $x,y\in {\frak gl}(N,{\boldmath R})$.
817: Then it is obvious that $z$ does not necessarily belongs to $\frak m$
818: even if $x,y\in {\frak m}$(, that is,
819: $z_{ii}$'s do not always vanish
820: when $x_{ii}=y_{ii}=0$ for $1\le i\le N$).
821: This may be explained by using the concept of nonholonomy.
822: The degree of freedom in each step, however, equals the dimension
823: of the space $K\backslash G$ in our setting. The nonholonomic nature
824: emerges when we go back to $G=GL(N,{\bm R})$ again.
825:
826: There exist several
827: studies\cite{takeuchi1,helgason1,helgason2,helgason3,akuzawa5} which
828: deal with
829: cosets
830: like $K\backslash G$ or the right coset $G/K$
831: when $K$ is a maximal compact subgroup of $G$.
832: Unfortunately, what we are studying is the case where $K$ is not a
833: maximal compact subgroup of $G$.
834: So, for example
835: it is necessary to show
836: whether the coordinate (\ref{eq:prf101}) is justified or not.
837: As mentioned above, further studies including this justification
838: will appear elsewhere.
839:
840: \subsection{Global convergence}
841: % On the other hand,
842: We should carefully treat
843: first few steps since this method also has
844: a somewhat undesirable global convergent property inherent in
845: the Newton method. Fortunately enough,
846: there exist methods which can
847: handle the earlier stage. For example, the nonholonomic gradient
848: method\cite{amari-chen-cichocki1}
849: may be applicable.
850: Another posiibility is to construct a nonholonomic fixed-point
851: algorithm which uses the kernel method.
852: These methods are suitable for capturing the optimal point which
853: contains components with zero kurtoses. There
854: we must, of course, use the method in Section \ref{kurt2}.
855: If it is not necessary to worry about these zero kurtosis components,
856: there is little difference between the two methods described in
857: Section \ref{kurt1} and Section \ref{kurt2}.
858:
859: \subsection{Conclusions}
860: We have constructed a new algorithm for finding a optimal point in a
861: matrix space, where we have introduced a new
862: multiplicative updating method.
863: %does not
864: %
865: %requie
866: %prewhitening.
867: The algorithm is in essence the Newton method on a
868: coset.
869: So it converges quite rapidly and it can capture the saddle point.
870: Since it does not require prewhitening,
871: it is not necessary to worry about the error resulting from the
872: prewhitening.
873: Indeed, it is possible to increase
874: the kurtosis slightly for data preprocessed by
875: the FastICA\cite{fastica1}.
876:
877:
878: \begin{thebibliography}{8}
879:
880: \bibitem[A.Hyv\"arinen,1997]{hyvarinen1}
881: A.Hyv\"arinen (1997).
882: \newblock A Fast Fixed-Point Algorithm for Independent Component Analysis.
883: \newblock {\em Neural Computation\/}, {\em 9\/}, 1483--1492.
884:
885: \bibitem[Amari {\em et~al.\/},1997]{amari-chen-cichocki1}
886: Amari, S., Chen, T.-P., \& Cichocki, A. (1997).
887: \newblock Non-holonomic Constraints in Learning Algorithms for Blind Source
888: Separation.
889: \newblock {\em preprint\/}.
890:
891: \bibitem[Hurri {\em et~al.\/},1998]{fastica1}
892: Hurri, J., G\"avert, H., S\"alel\"a, J., \& Hyv\"arinen, A. (1998).
893: \newblock FastICA package for MATLAB.
894: \newblock http://www.cis.hut.fi/projects/ica/fastica/.
895:
896: \bibitem[M.Takeuchi,1994]{takeuchi1}
897: M.Takeuchi (1994).
898: \newblock {\em Modern Spherical Functions\/}.
899: \newblock Amer. Math. Soc.
900:
901: \bibitem[S.Helgason,1962]{helgason2}
902: S.Helgason (1962).
903: \newblock {\em Differential Geometry and Symmetric Spaces\/}.
904: \newblock Academic Press.
905:
906: \bibitem[S.Helgason,1978]{helgason1}
907: S.Helgason (1978).
908: \newblock {\em Differential Geometry, Lie Groups and Symmetric Spaces\/}.
909: \newblock New York: Academic Press.
910:
911: \bibitem[S.Helgason,1984]{helgason3}
912: S.Helgason (1984).
913: \newblock {\em Groups and Geometric Analysis\/}.
914: \newblock Academic Press.
915:
916: \bibitem[T.Akuzawa \& M.Wadati,1998]{akuzawa5}
917: T.Akuzawa \& M.Wadati (1998).
918: \newblock Diffusions on symmetric spaces of type A${\rm I\!I\!I}$ and random
919: matrix theories for rectangular matrices.
920: \newblock {\em J.Phys.A\/}, {\em 31\/}, 1713--1732.
921:
922: \end{thebibliography}
923:
924:
925: \appendix
926: \section*{appendix}
927: \section{proof of (\ref{eq:e8f})}
928: \label{app:prf}
929: \begin{quote}
930: %{Proof}:
931: For $B\in GL(N,{\boldmath F})$ and $1\le i,j\le N$,
932: \begin{eqnarray}
933: \label{eq:proof1}
934: [ T(X\otimes Y)T {\rm cs}(B)]_{i+N(j-1)}
935: &=&[ (X\otimes Y)T {\rm cs}(B)]_{j+N(i-1)}\nonumber\\
936: &&=X_{ip}Y_{jq}(B')_{qp}
937: =(YB'X')_{ji}~.
938: \end{eqnarray}
939: On the other hand
940: \begin{eqnarray}
941: \label{eq:proof12}
942: [ (Y\otimes X) {\rm cs}(B)]_{i+N(j-1)}
943: &&=Y_{jp}X_{iq}B_{qp}
944: =(YB'X')_{ji}~.
945: \end{eqnarray}
946: This proves the statement since $\rm cs$ is bijective. $\Box$
947: \end{quote}
948:
949:
950:
951:
952:
953:
954:
955:
956: \end{document}
957:
958:
959:
960:
961:
962:
963:
964:
965: