cs0002006/cs0002006
1: \documentclass[a4,12pt]{article}
2: \usepackage{latexsym}
3: \oddsidemargin=0cm
4: \evensidemargin=0cm
5: \textwidth=16cm
6: \paperwidth=21cm
7: \textwidth=18.6cm
8: %\textheight=24.7cm
9: \oddsidemargin=-0.5in
10: \evensidemargin=-0.5in
11: %\topmargin=-0.6in
12: \usepackage{amsmath,amstext,amsfonts}
13: \def\bm#1{\mbox{\boldmath $#1$}}
14: \def\teigi{\stackrel{\rm def}{=}}
15: \def\hatena{\stackrel{\boldmath ?}{=}}
16: %\bibliographystyle{mybstfeb96}
17: %\bibliographystyle{mybst1996}
18: %\bibliographystyle{bstforNEu}
19: %\bibliographystyle{BSTforNEU}
20: %\bibliographystyle{apalike}
21: %\bibliographystyle{apahack}
22: %%
23: \makeatletter
24:   \renewcommand{\theequation}{%
25:      \thesection.\arabic{equation}}
26:   \@addtoreset{equation}{section}
27: \makeatother 
28: \tolerance=6000
29: 
30: 
31: 
32: 
33: \title{Multiplicative Nonholonomic/Newton -like Algorithm  }
34: \author{Toshinao {\sc
35:     Akuzawa}\thanks{akuzawa@islab.brain.riken.go.jp}\vspace{0.3cm}\\
36: and\vspace{0.3cm}\\
37: Noboru {\sc Murata}
38: \vspace{0.5cm}\\
39: Brain Science Institute \\
40: {\it RIKEN}\\
41: %%{\small(The Institute of Physical and Chemical Research)}\\
42: {\small 2-1 Hirosawa, Wako-shi, Saitama 351-0198, Japan}}
43: \date{{\it October 19, 1999}}
44: 
45: \begin{document}
46: \maketitle
47: \abstract{We construct new algorithms  from scratch, 
48: which use the  fourth order cumulant  of stochastic 
49: variables for the cost function.  
50: The 
51: multiplicative updating rule 
52: here constructed  is natural from the
53: homogeneous nature of the Lie group
54: and has numerous merits for 
55: the   rigorous treatment of the 
56: dynamics.  
57: As one consequence, the second order convergence is shown.
58: For the cost function, 
59: functions invariant under 
60: the componentwise scaling 
61: are choosen. 
62: By
63: identifying
64: points  which can be transformed to each other by the scaling, 
65:  we assume that the dynamics is in a coset space.   
66: In our method, a point can move toward  any direction in this coset.
67: Thus, 
68:  no  prewhitening is  required.   
69:  }
70: \section{Introduction}
71: \label{intro}
72: Suppose that $N$-dimensional stochastic
73: variables $\{X_i|1\le i \le N\}$ are observed.
74: The independent component analysis (ICA) pursues a map
75:  $X \mapsto Y$, where each component of $Y$ becomes mutually independent.  
76: In this letter  we restrict ourselves to 
77:  the linear independent component analysis. 
78: There  
79: we want to find a linear transformation $C:{\bf X}=(X_1,\cdots,X_N)'\mapsto
80: {\bf Y}=(Y_1,\cdots,Y_N)'=C{\bf X}$ which 
81:  minimizes some cost function that measures the independence. 
82: Hereafter we  denote by the upper subscript $\prime$ the transposition and
83: by $\dagger$ the complex conjugate. 
84: 
85: There can be  many candidates for the cost function.  
86: For example
87: the Kullback-Leibler information 
88: is a good measure for the independence. 
89: In this case 
90: the problem is translated to 
91:   the minimization of 
92: $ -\sum_{i=1}^N\int dy_i P_i(y_i)\ln  P_i(y_i)$, where
93: $P_i$ is the probability density function of the $i$-th component. 
94: It is obvious that we must evaluate $P_i$'s to find the optimal
95: solution. A robust estimation 
96: of the probability density functions  is not an easy  task 
97: and if it is possible it may be computationally expensive. 
98: 
99: An alternative idea is to make use of  the cumulant of the fourth
100: order, or the kurtosis\cite{hyvarinen1}, which we will adopt in this letter. 
101: The fourth order cumulant vanishes for 
102: the  normal distribution.  So, this cost function is robust under 
103: the gaussian random noises. 
104: We will construct algorithms where a matrix, which specifies the
105: linear transformation, is updated by the left-multiplication of a 
106: matrix $D={\rm e}^{\Delta}$. 
107: This expression implies that $D$ belongs to
108: $GL(N,{\boldmath R})$ (more accurately, 
109: the component of $GL(N,{\boldmath R})$ connected to the unit element), 
110: which ensures the
111: conservation of the rank.
112: The specification of $D$ by the coordinate $\Delta$ 
113: has many advantages 
114: since it has a compatibility with   the homogeneous nature of the Lie group. 
115: 
116: There are variations for the form of the cost
117: function. We will show our definitions in the following two sections, which
118: are choosen to possess  invariance under componentwise scaling. 
119: This invariance is crucial for 
120: a rigorous treatment of the convergence properties.  
121: Moreover, this invariance allows us to 
122: identify
123: points  in $GL(N,{\boldmath R})$ which is transformed to each
124: other by the
125: scaling.
126: Then  we can legitimately restrict   the dynamics to a coset space  
127: which is introduced by this identification.    
128: 
129: Under these settings, we determine $\Delta$ by using the Newton method 
130: for the second order
131: expansion of the cost function with respect to $\{\Delta_{ij}\}$. It
132: is assumed 
133: that the diagonal elements of $\Delta$ are zeros, 
134: which does not impose any restrictions. 
135: That is, a point can move toward  any direction in this coset by a
136: left-multiplication of ${\rm e}^{\Delta}$. 
137: Thus 
138: it is not necesarry for our  method to prewhiten the data. 
139: It is also not required 
140: that the
141: optimal solution is  the maximum or the minimum of the
142: cost function. Indeed,  the sole requirement is that 
143: the optimal point  is a saddle point of the cost function 
144: since our method
145: is in principle the Newton method.  
146: These are great advantages of our method. 
147: 
148: %This property  is unique to our method  and
149: %that does not causes any serious problem if the starting point is
150: %close enough. 
151: 
152: 
153: 
154: Our strategy is as follows.
155: As an initial condition we set $C_0$. 
156: For  $t>0~(t\in{\bf N}^{+})$, 
157: we introduce an  $N\times N$ matrix $\Delta_t$ and 
158: denote $C_{t}$ as  $C_{t}={\rm e}^{\Delta_{t}}C_{t-1}$.
159: Next, we evaluate the cost function at $C_{t}$ 
160: by using the expansion around $C_{t-1}$ 
161: with respect to the elements of
162: $\Delta_{t}$ up to the second order. 
163: Then    $\Delta_t$ is choosen as a saddle point of 
164: this second order
165: expansion.  
166: We iteratively follow these procedures until we obtain a satisfactory
167: solution. 
168: 
169: 
170: This letter is organized as follows.
171: In Section \ref{kurt1} the main part of our algorithm is  constructed, 
172: where the cost function  is essentially identical to the sum of
173: kurtoses. 
174: We adopt the square of the kurtoses for the cost function 
175: in Section \ref{kurt2}.
176: Explicit expressions for the optimal 
177: $\Delta$ (up to the second order)
178: are obtained both in Sections \ref{kurt1} and \ref{kurt2}. 
179:  Section \ref{iteration} is a short section  where  we show how
180: each updating step is combined to obtain the optimal $C$. 
181: In Section \ref{secconv} the convergence property of our algorithm is 
182: discussed. Section \ref{disc} contains conclusions and discussions. 
183: \section{Multiplicative update algorithm}
184: \label{kurt1}
185: \subsection{Expansion of the cost function }
186: Let us start by defining the cost function:
187: \begin{eqnarray}
188:   \label{eq:e1}
189: &&f(C,X)=\sum_i f_i(C,X)~,
190:   \end{eqnarray}
191: where $f_i$'s are the fourth order moments 
192: of components
193: divided by the square of their variances, 
194: \begin{eqnarray}
195:   \label{eq:e1.1}
196: &&  f_i(C,X)=\frac{E((CX)_i^4)}{E((CX)_i^2)^2}~.
197: \end{eqnarray}
198: In this letter we denote by $E(A)$ the expectation  of 
199: $A$. 
200: Obviously 
201: the cost function $f$ coincides with the sum of kurtoses of all the components 
202: up to  the constant. 
203: We set $D={\rm e}^{\Delta}$ and
204:  expand $f(D,Y)$  %(\ref{eq:e1})
205:  in terms of the elements of $\Delta$. 
206: %and $K={\rm e}^{-\Delta}-1$, 
207: For example expansions term  by term are evaluated as follows:
208:  \begin{eqnarray}
209:   \label{eq:e2}
210: E((DY)_i^4)
211: &=&
212: E(Y_i^4)+4\sum_{p}(\Delta_{ip}+(\frac{\Delta^2}{2})_{ip})E(Y_i^3Y_p)
213: +6\sum_{p,q}\Delta_{ip}\Delta_{iq}E(Y_i^2Y_pY_q)+O(\Delta^3)~\nonumber\\
214: %\end{eqnarray}
215: %\begin{eqnarray}
216: %  \label{eq:3}
217: E((DY)_i^2)
218: &=&
219: E(Y_i^2)+2\sum_{p}(\Delta_{ip}+(\frac{\Delta^2}{2})_{ip})E(Y_iY_p)
220: +\sum_{p,q}\Delta_{ip}\Delta_{iq}E(Y_pY_q)+O(\Delta^3)~.
221: \end{eqnarray}
222: Hereafter we denote by 
223:  $O(\Delta^k)$   polynomials of matrix elements of $\Delta$ which 
224: does not contain terms with degrees less than $k$. 
225: For  brevity's sake  
226: we introduce the following notations:
227: \begin{eqnarray}
228:   \label{eq:e3.1}
229: &&  \sigma_i^{(k)}=|E(Y_i^k)|^{1/k}~,\\
230: &&  R^{(k)}_{pi}=\frac{E(Y_i^k Y_p)}{(\sigma^{(2)}_i)^{k+1}}~,\\
231: &&  U^{(k,i)}_{pq}=\frac{E(Y_i^kY_p Y_q)}{(\sigma^{(2)}_i)^{k+2}}~,
232: \end{eqnarray}
233: and
234: \begin{eqnarray}
235:   \label{eq:e3.2}
236:   && \kappa_i={(\sigma^{(4)}_i)^4}/{(\sigma^{(2)}_i)^4}~.
237: \end{eqnarray}
238: Using the quantities defined above we can  show that  the
239: cost function is expanded as 
240: \begin{eqnarray}
241:   \label{eq:e4}
242:  f_i(D,Y)
243: &=&\bigg[
244: \kappa_i+4\big[(\Delta+\frac{\Delta^2}{2})R^{(3)}\big]_{ii}
245: +6\big[
246: \Delta U^{(2,i)}\Delta'
247: \big]_{ii}
248: +O(\Delta^3)
249: \bigg]\nonumber\\
250: &&~~\times
251: \bigg[
252: 1-4\big[(\Delta+\frac{\Delta^2}{2})R^{(1)}\big]_{ii}
253: -2\big[
254: \Delta U^{(0,i)}\Delta'
255: \big]_{ii}
256: +12\big[
257: \Delta R^{(1)}
258: \big]_{ii}^2
259: +O(\Delta^3)
260: \bigg]\nonumber\\
261: &=&\kappa_i - 4\big[(\Delta+\frac{\Delta^2}{2})(\kappa_i
262: R^{(1)}-R^{(3)})\big]_{ii}
263: +2\big[
264: \Delta (3U^{(2,i)}-\kappa_i  U^{(0,i)})\Delta'
265: \big]_{ii}\nonumber\\
266: &&~~
267: +12\kappa_i\big[
268: \Delta R^{(1)}
269: \big]_{ii}^2
270: -16\big[
271: \Delta R^{(1)}
272: \big]_{ii}\big[
273: \Delta R^{(3)}
274: \big]_{ii}+O(\Delta^3)~
275: \end{eqnarray}
276: by  straightforward calculations. 
277: Next, we evaluate  partial derivatives of the cost function 
278: by the matrix elements of $\Delta$. 
279:  %We  need only terms up to $O(\Delta^2)$. 
280:  Partially differentiating  (\ref{eq:e4}), 
281: %It follows that the partial derivative of $f(C,Y)$ becomes
282: we get an expression, 
283: \begin{eqnarray}
284:   \label{eq:e5}
285: &&  \frac{\partial f({\rm e}^{\Delta},Y)}{\partial \Delta_{kl}}=
286: -4\big[K-R^{(3)}\big]_{lk}
287: -2\big[(K-R^{(3)})\Delta+\Delta(K-R^{(3)})\big]_{lk}\nonumber\\
288: &&+4\big[
289:  (3U^{(2,k)}-\kappa_k  U^{(0,k)})\Delta'
290: \big]_{lk}
291: +24K_{lk}\big[\Delta R^{(1)}
292: \big]_{kk}
293: -16R^{(1)}_{lk}\big[\Delta R^{(3)}
294: \big]_{kk}
295: -16 R^{(3)}_{lk}\big[\Delta R^{(1)}
296: \big]_{kk}\nonumber\\
297: &&+O(\Delta^2)~,
298: \end{eqnarray}
299: where $K$ is an $N\times N$ matrix defined by
300: \begin{eqnarray}
301:   \label{eq:e5.9}
302: &&K_{pq}=\kappa_q  R^{(1)}_{pq}~.  
303: \end{eqnarray}
304: We want to decide $\Delta$  for which 
305:  the partial derivative 
306: by  $\Delta_{kl}~(k\ne
307:  l)$
308: of the cost function 
309:  vanish on condition that 
310:  $\Delta_{ii}=0$ for $1\le i \le N$.  
311: We neglect $O(\Delta^3)$ terms in the cost function. 
312: Thus the  right-hand side of (\ref{eq:e5}) is 
313: regarded as a polynomial of
314: % the elements of $\Delta$ 
315:  $\{\Delta_{kl}\}$ 
316: of at most first order and it is  always possible 
317: in principle to
318:  determine $\Delta$ which satifies the above condition. 
319: % for which 
320: % (\ref{eq:e5}) vanishes. 
321: It is, at the same time, not easy  to  describe  the problem 
322: in a  form which is valid 
323: for
324: arbitrary $N$. 
325: In the following subsection we will introduce  a transparent and unified 
326: method for handling the partial derivatives of $f$. 
327: %Before  this subsection by 
328: We leave this subsection by
329: introducing $N\times N$ matrices 
330: \begin{eqnarray}
331:   \label{eq:e6}
332: &&  V^{(i)}=3U^{(2,i)}-\kappa_i  U^{(0,i)}~
333: \end{eqnarray}
334: and
335: \begin{eqnarray}
336:   \label{eq:e6.1}
337: % &&  Q=R^{(1)}-R^{(3)}~.
338:  &&  Q=K-R^{(3)}~
339: \end{eqnarray}
340: for later convenience.  
341: \subsection{Expression by tensor product and determination of $\Delta$}
342: The expression (\ref{eq:e5}) is quite  complicated and   not
343: convenient for our purpose, 
344: `` determine $\Delta$, where
345: all the partial derivatives  vanish''. 
346: Fortunately by  mapping  the relations between elements of 
347: $N\times N$ matrices  to those of   $N^2\times
348: N^2$ matrices, we can handle the problem transparently. 
349: %,  the problem can be rewritten in a general form. 
350: Some preparations
351: are needed.
352: First, let us introduce a map $\rm cs$:
353: \begin{eqnarray}
354:   \label{eq:a14}
355:   {\rm Mat}(N,{\boldmath F}) &\rightarrow& {\boldmath F}^{N^2}\nonumber\\
356: A=\left(
357:   \begin{array}{cccc}
358:  A_{11}& A_{12}&\cdots &A_{1N}\\
359: A_{21} &\multicolumn{3}{c}{\dotfill}\\
360: \multicolumn{4}{c}{\dotfill}\\
361: A_{N1} &\multicolumn{2}{c}{\dotfill}&A_{NN}
362:   \end{array}
363: \right) &\mapsto& 
364: {\rm cs}(A)=
365: (A_{11}~ A_{21}~ \cdots~ A_{N1}~ A_{12}~ A_{22}~\cdots~ A_{NN})'~,\nonumber\\
366: \end{eqnarray}
367: where $\boldmath F$ is an unspecified  field. 
368: We also introduce 
369: two useful operators $T$ and $P$. 
370: The ``intertwiner'' $T$ is  an $N^2\times N^2$ matrix 
371: defined by 
372: \begin{eqnarray}
373:   \label{eq:a15}
374:   {\rm cs}(A')=T{\rm cs}(A) ~\mbox{\rm for~} A\in  {\rm Mat}(N,{\boldmath F})~.
375: \end{eqnarray}
376: The projection  operator $P$,
377: \begin{eqnarray}
378:   \label{eq:a18}
379: P&=&{\rm diag}(p_1,\cdots,p_{N^2})~,\nonumber\\
380: &&\left\{
381: \begin{array}{ll}
382:  p_k=1 ~~~\mbox{\rm for}~~ k=N(i-1)+i,1\le i\le N~\\
383:  p_k=0~~~~ \mbox{\rm otherwise}~,
384: \end{array}
385: \right.
386: \end{eqnarray}
387:  is used to extract the ``diagonal''
388: elements of a matrix from its image by $\rm cs$. 
389: 
390: On this setting we can rewrite (\ref{eq:e5}) as
391: \begin{eqnarray}
392:   \label{eq:e7}
393:   \frac{\partial f({\rm e}^{\Delta},Y)}{\partial \Delta_{kl}}&=&
394: \bigg[ -4{\rm cs}(Q)
395: -2\big[I_N\otimes Q+T(I_N\otimes Q')T\big]{\rm cs}(\Delta)
396: +4
397: \big\{\bigoplus_{i=1}^N V^{(i)}\big\}
398: {\rm cs}(\Delta')
399: \nonumber\\&&
400: +
401: \bigg\{24(I_N \otimes K)P(I\otimes R^{(1)})'
402: -16 ( I_N \otimes R^{(1)})P(I\otimes R^{(3)})'\nonumber\\
403: &&-16 (I_N\otimes R^{(3)})P(I\otimes R^{(1)})'
404: \bigg\}
405: {\rm cs}(\Delta')
406: \bigg]_{l+N(k-1)}~,
407: \end{eqnarray}
408: where $I_N$ is the $N\times N$ unit matrix and
409: \begin{eqnarray}
410:   \label{eq:tiu1}
411: \bigoplus_{i=1}^N V^{(i)}=
412: \left(
413:   \begin{array}{lllll}
414: V^{(1)} & 0 & \multicolumn{2}{c}{\cdots\cdots} & 0\\ 
415: 0& V^{(2)} & 0 & \multicolumn{2}{c}{\cdots\cdots}\\
416:  \multicolumn{5}{c}{\dotfill}\\
417:  \multicolumn{5}{c}{\dotfill}\\
418:  0& \multicolumn{2}{c}{\cdots\cdots}& V^{(N-1)}& 0   \\
419: 0& 0& \multicolumn{2}{c}{\cdots\cdots}& V^{(N)}   \\
420:    \end{array}
421: \right)~.
422: \end{eqnarray}
423: %where $E_N$ is an $N\times N$ matrix of ones. 
424: We make use of the following fact:\\
425: For $X\in {\rm Mat}(N,{\boldmath F})$
426: \begin{eqnarray}
427:   \label{eq:e8f}
428:   T(I_N\otimes X)T=X\otimes I_N~.
429: \end{eqnarray}
430: See  Appendix \ref{app:prf} for the proof of  (\ref{eq:e8f}). 
431: Then (\ref{eq:e7}) becomes 
432: \begin{eqnarray}
433:   \label{eq:e77}
434: &&  \frac{\partial f({\rm e}^{\Delta},Y)}{\partial \Delta_{kl}}=
435:  -4[{\rm cs}(Q)]_{l+N(k-1)}
436: +\big[
437: W
438: {\rm cs}(\Delta)
439: \big]_{l+N(k-1)}~,\nonumber\\
440: \end{eqnarray}
441: where
442: \begin{eqnarray}
443:   \label{eq:e8}
444: W&=&
445: -2\big(I_N\otimes Q+Q'\otimes I_N\big)
446: +4
447: \big\{\bigoplus_{i=1}^N V^{(i)}\big\}
448: T
449: +
450: \bigg[24(I_N\otimes K)P(I\otimes R^{(1)})'
451: \nonumber\\&&
452: -16 (I_N \otimes R^{(1)})P(I\otimes R^{(3)})'
453: -16 (I_N \otimes R^{(3)})P(I\otimes R^{(1)})'
454: \bigg]
455: T~.\nonumber\\
456: \end{eqnarray}
457: Now let us determine $\Delta$. 
458: Remember that we are going along the spirit of the Newton method. 
459: Thus we want to find $\Delta$ which satisfies
460: the  conditions
461: \begin{eqnarray}
462:   \label{eq:e10}
463:     \frac{\partial f({\rm e}^{\Delta},Y)}{\partial
464:     \Delta_{kl}}=0+O(\Delta^2)~~
465: \mbox{\rm for } 1\le k,l \le N,~k\ne l
466: \end{eqnarray}
467: and 
468: \begin{eqnarray}
469:   \label{eq:e11}
470:   \Delta_{kk}=0 ~~\mbox{\rm for}~~ 1\le k\le N~.
471: \end{eqnarray}
472: The conditions (\ref{eq:e11}) make the problem rather complicated one.
473: Fortunately, 
474: by using $P$ 
475: we can combine %%%transform 
476: the conditions  (\ref{eq:e10}) and  (\ref{eq:e11}) into 
477:   a matrix equation :
478: \begin{eqnarray}
479:   \label{eq:e19}
480: \Big[(I_{N^2}-P)
481: W(I_{N^2}-P)
482: +P
483: \Big]
484: {\rm cs}(\Delta)-4(I_{N^2}-P){\rm cs}(Q)=0~.
485: \end{eqnarray}
486: Immediately it follows that  
487: \begin{eqnarray}
488:   \label{eq:e20}
489: {\rm cs}(\Delta)=4
490: \Big[(I_{N^2}-P)
491: W
492: (I_{N^2}-P)
493: +P
494: \Big]^{-1}
495: (I_{N^2}-P){\rm cs}(Q)~.
496: \end{eqnarray}
497: Thus we have obtained $\Delta$ which specify a saddle point of 
498: the  expansion of 
499: $f(C,Y)$ up to the second order.
500: Note that quantities in the right-hand side of (\ref{eq:e20}) are easily estimated
501: ones 
502: from the
503: observed data.
504: So, an updating is determined by (\ref{eq:e20}) without any
505: ambiguities. 
506: 
507: \section{Case $\rm I\!I$:  square of kurtosis}
508: %~(kurtosis)${\bm{}^2}$}
509: \label{kurt2}
510: Obviously, points where kurtosis 
511: vanishes do not play any special role  for
512: the cost function  $f$ in Section \ref{kurt1}. The optimal solution, however, 
513: contains components with zero kurtoses 
514: when the number of the sources is less than that of the observation channels. 
515: Thus, 
516: in this section  we treat with  a slightly different
517: % algorithm, where
518:   cost function, which  is the sum,
519: \begin{eqnarray}
520:   \label{eq:se1}
521: &&{\bm f}(C,X)=\sum_i {\bm f}_i(C,X)~,
522:   \end{eqnarray}
523: of the square of the kurtoses, 
524: \begin{eqnarray}
525:   \label{eq:se1.1}
526: &&  {\bm f}_i(C,X)=\left[\frac{E((CX)_i^4)}{E((CX)_i^2)^2}-3\right]^2~.
527: \end{eqnarray}
528: %Computations needed for evaluating 
529: As in the last section, we want to know the saddle point 
530: $D={\rm  e}^{\Delta}$ of 
531: the expansion of ${\bm
532:   f_i}(D,Y)$ in 
533: terms of $\{\Delta_{ij}\}$ up to the second order.
534: We do not describe details of the calculations in this section, 
535: which is 
536:  carried out %accomplished 
537: almost in the same way as in Section \ref{kurt1}. 
538: First, the expansion of ${\bm
539:   f_i}(D,Y)$ is evaluated as 
540: \begin{eqnarray}
541:   \label{eq:se4}
542:  {\bm f}_i(D,Y)
543: &=&(\kappa_i-3)^2 - 8\big[(\Delta+\frac{\Delta^2}{2})(
544: R^{(1)}\kappa_i-R^{(3)})\big]_{ii}(\kappa_i-3)\nonumber\\
545: &&+4\big[
546: \Delta (3U^{(2,i)}-\kappa_i  U^{(0,i)})\Delta'
547: \big]_{ii}(\kappa_i-3)
548: +16\big[
549: \Delta (R^{(1)}\kappa_i-R^{(3)})
550: \big]_{ii}^2
551: \nonumber\\
552: &&
553: +24(\kappa_i-3)\kappa_i\big[
554: \Delta R^{(1)}
555: \big]_{ii}^2
556: -32(\kappa_i-3)\big[
557: \Delta R^{(1)}
558: \big]_{ii}\big[
559: \Delta R^{(3)}
560: \big]_{ii}+O(\Delta^3)~.
561: \end{eqnarray}
562: Next, we introduce  $N\times N$ matrices $\bm K$, $\{{\bm
563:   V}^{(i)}|1\le i\le N\}$, 
564: $\bm S$, and $\bm Q$ 
565: defined respectively by
566: \begin{eqnarray}
567:   \label{eq:se5.9}
568: &&{\bm K}_{pq}=  2R^{(1)}_{pq}(\kappa_q-3)\kappa_q~,  
569: \end{eqnarray}
570: \begin{eqnarray}
571:   \label{eq:se6}
572: &&  {\bm V}^{(i)}=2(\kappa_i-3)(3U^{(2,i)}-\kappa_i  U^{(0,i)})~,\\
573: \end{eqnarray}
574: \begin{eqnarray}
575:   \label{eq:se6.001}
576:   {\bm S}={\rm diag}(2(\kappa_i-3))~,
577: \end{eqnarray}
578: and
579: \begin{eqnarray}
580:   \label{eq:se6.1}
581:  && {\bm Q}_{pq}=2(\kappa_q-3)(R^{(1)}_{pq}\kappa_q-R^{(3)}_{pq})~.
582: \end{eqnarray}
583: We also rewrite $Q$ in (\ref{eq:e6.1}) by $\bm q$ in order to avoid confusions:
584: \begin{eqnarray}
585:   \label{eq:se6.2}
586:  && {\bm q}_{pq}=(R^{(1)}_{pq}\kappa_q-R^{(3)}_{pq})~.
587: \end{eqnarray}
588: Now 
589: we proceed to the expression by using the tensor product.
590: We can show  that the gradients of the cost function have the
591: following expression: 
592: \begin{eqnarray}
593:   \label{eq:se77}
594: &&  \frac{\partial {\bm f}({\rm e}^{\Delta},Y)}{\partial \Delta_{kl}}=
595:  -4[{\rm cs}({\bm Q})]_{l+N(k-1)}
596: +\big[
597: {\bm W}
598: {\rm cs}(\Delta)
599: \big]_{l+N(k-1)}+O(\Delta^2)~,\nonumber\\
600: \end{eqnarray}
601: where
602: \begin{eqnarray}
603:   \label{eq:se8}
604: {\bm W}&=&
605: -2\big(I_N\otimes {\bm Q}+{\bm Q'}\otimes I_N\big)
606: +4
607: \big\{\bigoplus_{i=1}^N {\bm V}^{(i)}\big\}
608: T
609: +
610: \bigg[24( I_N\otimes {\bm K})P(I\otimes R^{(1)})'
611: \nonumber\\&&
612: +32( I_N\otimes {\bm q})P(I_N\otimes {\bm q})'
613: -16 ( I_N\otimes R^{(1)}{\bm S})P(I\otimes R^{(3)})'
614: \nonumber\\&&
615: -16 ( I_N\otimes R^{(3)}{\bm S})P(I\otimes R^{(1)})'
616: \bigg]
617: T~.
618: \end{eqnarray}
619: This is a  completely analogous expression  to  (\ref{eq:e77}).  
620: Thus  the coordinate $\Delta$ of the saddle point of the second order 
621: expansion  
622: is determined by
623: \begin{eqnarray}
624:   \label{eq:se20}
625: {\rm cs}(\Delta)=4
626: \Big[(I_{N^2}-P)
627: {\bm W}
628: (I_{N^2}-P)
629: +P
630: \Big]^{-1}
631: (I_{N^2}-P){\rm cs}({\bm Q})~.
632: \end{eqnarray}
633: %In many cases we obtain almost the same results through the two
634: %cost functions in Section \ref{kurt1} and Section \ref{kurt2}. 
635: %algorithms. 
636: In many cases obtained through the two cost functions in Section
637: \ref{kurt1} and Section \ref{kurt2}  are almost the same results.  
638: As  implied at the beginning of this section, 
639: the main difference between these two lies in the points where the kurtosis of
640: one of the components vanishes. 
641: These point indeed constitue saddle points of 
642:  the   cost function
643: $\boldmath f$, while  it is impossible to  capture them by the
644: algorithm in Section \ref{kurt1}. 
645: Thus, we must choose an appropriate method for individual problems 
646: having this differnce in mind. 
647: %This  will be
648: %revisited  in Section
649: %{\ref{disc}}. 
650: 
651: 
652: \section{Iteration of updating}
653: \label{iteration}
654: Now we have obtained the updating rules. It is not necessary to tune the
655: learning rate. Apparently, (\ref{eq:e19}) 
656: and (\ref{eq:se20})
657: look complicated. 
658: They are, however, easily implemented by the numerical tools like MatLab. 
659: (The source codes will be available from our Web-site. )
660: Starting from $C_0$, 
661: $C_i$ for positive $i$ is determined by the left multiplication by 
662: ${\rm e}^{\Delta_i}$, where 
663: $\Delta$ is determined by setting $Y=C_{i-1}X$,  
664: i.e,
665: \begin{eqnarray}
666:   \label{eq:b1}
667:   C_t={\rm e}^{\Delta_{t}}{\rm e}^{\Delta_{t-1}}{\rm e}^{\Delta_{t-2}}\cdots{\rm e}^{\Delta_{1}}C_0~.
668: \end{eqnarray}
669: If $\Delta$ becomes saficiently small, we can stop the iteration and exit the 
670: process. 
671: 
672: \section{Second order convergence}
673: \label{secconv}
674: First, we will take over  the notations in Section \ref{kurt1}. 
675: The following discussion  is, however, valid for the algorithm in Section
676: \ref{kurt2} if we  substitute the quantities  $f$,  $W$, and so on by 
677: their boldface counterparts. 
678: Let us  start this section by introducing some additional notations. 
679: We set 
680: \begin{eqnarray}
681:   \label{eq:pr1}
682:   G\in GL(N,{\boldmath R})
683: \end{eqnarray}
684: and 
685: \begin{eqnarray}
686:   \label{eq:prd2}
687:   K\in GL(1,{\boldmath R})^{\oplus N}~.
688: \end{eqnarray}
689: We also define the coset space  $K\backslash G$ by
690: introducing  the equivalence relation 
691: \begin{eqnarray}
692:   \label{eq:pr3}
693: g' g^{-1}\in K
694: \Longleftrightarrow 
695:  g\sim g'
696: \end{eqnarray}
697: to $G$. That is, $K\backslash G\cong\{Kg|g\in G\}$. 
698: Our method is 
699: understood as 
700: an orthodox adaptation of the Newton method to this 
701: coset space $K\backslash G$. 
702: Note that  
703: the cost function $F(\cdot)\teigi f(\cdot,Y)$ on $G$ 
704: % defined by (\ref{eq:e1})
705: %and (\ref{eq:e1.1})
706:  satisfies the relation 
707: \begin{eqnarray}
708:   \label{eq:pr4}
709: F(g)=F(Kg)~.
710: \end{eqnarray}
711: So $F$ is  naturally considered as a function on $K\backslash G$. 
712: That is the reason of our choice for  the cost function. 
713: Thus, the second-order convergence immediately follows if the 
714: the correction to the  error with respect to the  coordinating
715: resulting from the  multiplicative nature is properly evaluated.
716: 
717: At time $t$, a point $g$ on $K\backslash G$ is specified by
718: the coordinate $X^{(t)}(g) \in{\frak m}$ such that 
719: \begin{eqnarray}
720:   \label{eq:prf101}
721:   {\rm e}^{X^{(t)}(g)}C_t\sim g~,
722: \end{eqnarray}
723: where $\frak m$ is the set of $N\times N$ matrices whose diagonal
724: elements are zeros. 
725: Actually, this statement itself  is not a thing of course, for which the proof
726: will be given
727: elsewhere. 
728: Define $F_t$, the representation of the cost function at $t$,   by
729: \begin{eqnarray}
730:   \label{eq:prf102}
731:   F_t(X)=F(  {\rm e}^{X}C_t)~.
732: \end{eqnarray}
733: Here we introduce an $(N^2-N)\times N^2$ matrix $\tilde P$ by
734: drawing out the $i+N(i-1)$-th raws from the unit $N^2\times N^2$
735: matrix where $i=N,N-1,\cdots, 2,1$. 
736: We will denote by $\boldmath H^{(t)}$ the  Hessian, 
737: \begin{eqnarray}
738:   \label{eq:prf102.11}
739:   {\boldmath H}^{(t)}_{kl}=\frac{\partial^2 F_t(X)}
740: {\partial ({\tilde P}{\rm cs}(X))_k\partial ({\tilde P}{\rm cs}(X))_l}
741: \end{eqnarray}
742: Note that if we set
743: \begin{eqnarray}
744:   \label{eq:prf103}
745: h_t(X)=\left.
746: T\bigg((I_{N^2}-P)
747: W(I_{N^2}-P)
748: +P\bigg)\right|_{C={\rm e}^X C_t}~,
749: \end{eqnarray}
750: the Hessian is written as
751: \begin{eqnarray}
752:   \label{eq:prf103.1}
753:   {\boldmath H}^{(t)}={\tilde P}h_t{\tilde P}' ~. 
754: \end{eqnarray}
755: Suppose that at some neighborhood of the optimal solution $g_*$,
756: ${\boldmath H}^{(t)}(X)$ 
757: is Lipschitz continuous for some $t$:
758: \begin{eqnarray}
759:   \label{eq:prf104}
760:   ||{\boldmath H}^{(t)}(X)-{\boldmath H}^{(t)}(X')||\le L ||X-X'||~,
761: \end{eqnarray}
762: where $||A||$ is the norm of a  matrix $A$ as the Euclidian space,
763: \begin{eqnarray}
764:   \label{eq:norm1}
765:   ||A||^2={\rm tr}(AA^{\dagger})~.
766: \end{eqnarray}
767: We set
768: \begin{eqnarray}
769:   \label{eq:prf104.001}
770: \beta=||H^{(t)}(X^t(g_*))^{-1} ||  ~.
771: \end{eqnarray}
772: There exists a positive real number $r$, 
773:  for which 
774: % neighborhood of $g_*$,
775: \begin{eqnarray}
776:   \label{eq:prf104.002}
777:   ||H^{(t)}(X^t(g))^{-1} || <2\beta~~\mbox{\rm for}~
778: \forall g\in B^{(t)}(g_*,r)\teigi\bigg\{g\bigg|r> ||X^t(g)-X^t(g_*)||~\bigg\}
779: \end{eqnarray}
780:  is satisfied. 
781: Then 
782: it is known that 
783: for all $g\in B(g_*,{\rm min}(r,(2\beta L)^{-1}))$, 
784: \begin{eqnarray}
785:   \label{eq:prf104.003}
786:   ||X^t(C_{t+1})-X^t(g_*)||\le  \beta L ||X^t(C_{t})-X^t(g_*)||^2
787: \end{eqnarray}
788: and
789: \begin{eqnarray}
790:   \label{eq:prf104.004}
791:   ||X^t(C_{t+1})-X^t(g_*)||\le \frac{1}{2} ||X^t(C_{t})-X^t(g_*)||
792: \end{eqnarray}
793: are fulfilled. Thus the second order convergence in this norm  is shown.
794: Unfortunately, this norm is not invariant and is  unnatural. 
795: (A natural   metric on $K\backslash G$
796:  is  one which is  invariant  under the parallel transformation,  
797: %where the parallel transformation 
798: which is induced by the action
799:  of  elements in $K\backslash G$
800:  from the right-hand side.) But, it suffices in practice. 
801: 
802: 
803: \section{Discussions}
804: \label{disc}
805: \subsection{Nonholonomy?}
806: Our method is  related to the nonholonomic method 
807: by 
808: Amari, Chen, and Chichocki\cite{amari-chen-cichocki1}. 
809: In essence our method is a Newton
810:  approach to the same problem, the optimization without prewhitening. 
811: Let us   set 
812: \begin{eqnarray}
813:   \label{eq:conc11}
814:   {\rm e}^{z} = {\rm e}^{x}{\rm e}^{y}
815: \end{eqnarray}
816: for $x,y\in {\frak gl}(N,{\boldmath R})$.  
817: Then it is obvious that $z$ does not necessarily belongs to $\frak m$ 
818:  even if $x,y\in {\frak m}$(, that is,
819:  $z_{ii}$'s do not always  vanish
820: when $x_{ii}=y_{ii}=0$ for $1\le i\le N$).  
821: This may be explained by using the concept of nonholonomy. 
822:  The degree of freedom in each step, however, equals the dimension
823: of the space $K\backslash G$ in our setting. The nonholonomic nature 
824: emerges when we go back to $G=GL(N,{\bm R})$ again. 
825: 
826: There exist several
827: studies\cite{takeuchi1,helgason1,helgason2,helgason3,akuzawa5} which
828: deal with
829: cosets
830: like $K\backslash G$  or the right coset $G/K$
831:  when $K$ is a maximal compact subgroup of $G$. 
832: Unfortunately, what we are studying is the case where $K$ is not a
833: maximal compact subgroup of $G$. 
834: So, for example
835: it is necessary to show 
836:  whether the  coordinate (\ref{eq:prf101})  is justified or not. 
837: As mentioned above, further studies including this justification 
838:  will  appear  elsewhere. 
839: 
840: \subsection{Global convergence}
841: % On the other hand, 
842: We should carefully treat  
843:  first few  steps since this method also has 
844: a somewhat undesirable global convergent property  inherent in 
845: the  Newton method. Fortunately enough, 
846: there exist methods which can 
847: handle the earlier stage. For example, the nonholonomic gradient
848:  method\cite{amari-chen-cichocki1}  
849: may be applicable. 
850: Another posiibility is to construct a nonholonomic fixed-point
851:  algorithm which uses the kernel method. 
852: These methods are suitable for  capturing the optimal point which
853:  contains components with zero kurtoses. There
854:  we must, of course,     use the method in Section \ref{kurt2}.  
855: If it is not necessary to worry about these zero kurtosis components, 
856: there is little difference between the two methods described in 
857: Section \ref{kurt1} and Section \ref{kurt2}. 
858: 
859: \subsection{Conclusions}
860: We have constructed a new  algorithm for finding a optimal point in a
861: matrix space, where we have  introduced a new   
862: multiplicative updating method. 
863: %does not 
864: %
865: %requie 
866: %prewhitening. 
867: The algorithm is in essence the Newton method on a
868: coset. 
869: So it converges quite rapidly and it can capture the saddle point. 
870: Since it does not require prewhitening, 
871: it is not necessary to worry about the error resulting from the 
872: prewhitening.  
873: Indeed, it is  possible to increase
874: the kurtosis slightly  for data preprocessed by 
875: the FastICA\cite{fastica1}. 
876: 
877: 
878: \begin{thebibliography}{8}
879: 
880: \bibitem[A.Hyv\"arinen,1997]{hyvarinen1}
881: A.Hyv\"arinen (1997).
882: \newblock A Fast Fixed-Point Algorithm for Independent Component Analysis.
883: \newblock {\em Neural Computation\/}, {\em 9\/}, 1483--1492.
884: 
885: \bibitem[Amari {\em et~al.\/},1997]{amari-chen-cichocki1}
886: Amari, S., Chen, T.-P., \& Cichocki, A. (1997).
887: \newblock Non-holonomic Constraints in Learning Algorithms for Blind Source
888:   Separation.
889: \newblock {\em preprint\/}.
890: 
891: \bibitem[Hurri {\em et~al.\/},1998]{fastica1}
892: Hurri, J., G\"avert, H., S\"alel\"a, J., \& Hyv\"arinen, A. (1998).
893: \newblock FastICA package for MATLAB.
894: \newblock http://www.cis.hut.fi/projects/ica/fastica/.
895: 
896: \bibitem[M.Takeuchi,1994]{takeuchi1}
897: M.Takeuchi (1994).
898: \newblock {\em Modern Spherical Functions\/}.
899: \newblock Amer. Math. Soc.
900: 
901: \bibitem[S.Helgason,1962]{helgason2}
902: S.Helgason (1962).
903: \newblock {\em Differential Geometry and Symmetric Spaces\/}.
904: \newblock Academic Press.
905: 
906: \bibitem[S.Helgason,1978]{helgason1}
907: S.Helgason (1978).
908: \newblock {\em Differential Geometry, Lie Groups and Symmetric Spaces\/}.
909: \newblock New York: Academic Press.
910: 
911: \bibitem[S.Helgason,1984]{helgason3}
912: S.Helgason (1984).
913: \newblock {\em Groups and Geometric Analysis\/}.
914: \newblock Academic Press.
915: 
916: \bibitem[T.Akuzawa \& M.Wadati,1998]{akuzawa5}
917: T.Akuzawa \& M.Wadati (1998).
918: \newblock Diffusions on symmetric spaces of type A${\rm I\!I\!I}$ and random
919:   matrix theories for rectangular matrices.
920: \newblock {\em J.Phys.A\/}, {\em 31\/}, 1713--1732.
921: 
922: \end{thebibliography}
923: 
924: 
925: \appendix
926: \section*{appendix}
927: \section{proof  of (\ref{eq:e8f})}
928: \label{app:prf}
929: \begin{quote}
930: %{Proof}:
931: For $B\in GL(N,{\boldmath F})$ and $1\le i,j\le N$,
932: \begin{eqnarray}
933:   \label{eq:proof1}
934: [  T(X\otimes Y)T {\rm cs}(B)]_{i+N(j-1)}
935: &=&[  (X\otimes Y)T {\rm cs}(B)]_{j+N(i-1)}\nonumber\\
936: &&=X_{ip}Y_{jq}(B')_{qp}
937: =(YB'X')_{ji}~.
938: \end{eqnarray}
939: On the other hand 
940: \begin{eqnarray}
941:   \label{eq:proof12}
942: [  (Y\otimes X) {\rm cs}(B)]_{i+N(j-1)}
943: &&=Y_{jp}X_{iq}B_{qp}
944: =(YB'X')_{ji}~.
945: \end{eqnarray}
946: This proves the statement since $\rm cs$  is bijective. $\Box$
947: \end{quote}
948: 
949: 
950: 
951: 
952: 
953: 
954: 
955: 
956: \end{document}
957: 
958: 
959: 
960: 
961: 
962: 
963: 
964: 
965: