cs0001004/wsXXX.tex
1: \documentclass[a4,12pt]{article}
2: \usepackage{latexsym}
3: %\usepackage[light]{draftcopy}
4: \usepackage{epsfig}
5: %\usepackage{bookman}
6: %\usepackage{graphicx}
7: \oddsidemargin=0cm
8: \evensidemargin=0cm
9: \textwidth=16cm
10: \paperwidth=21cm
11: \textwidth=18.6cm
12: %\textheight=24.7cm
13: \oddsidemargin=-0.5in
14: \evensidemargin=-0.5in
15: %\topmargin=-0.6in
16: \usepackage{amsmath,amstext,amsfonts}
17: \def\bm#1{\mbox{\boldmath $#1$}}
18: \def\teigi{\stackrel{\rm def}{=}}
19: \def\hatena{\stackrel{\boldmath ?}{=}}
20: %%
21: \makeatletter
22:   \renewcommand{\theequation}{%
23:      \thesection.\arabic{equation}}
24:   \@addtoreset{equation}{section}
25: \makeatother 
26: \tolerance=6000
27: 
28: 
29: 
30: 
31: \title{Multiplicative  Algorithm for  Orthgonal Groups \\and 
32: Independent Component Analysis }
33: \author{Toshinao {\sc
34:     Akuzawa}\thanks{akuzawa@brain.riken.go.jp}\vspace{0.3cm}\\
35: \vspace{0.5cm}\\
36: Brain Science Institute \\
37: {\it RIKEN}\\
38: %%{\small(The Institute of Physical and Chemical Research)}\\
39: {\small 2-1 Hirosawa, Wako, Saitama 351-0198, Japan}}
40: \date{{\it \today}}
41: 
42: \begin{document}
43: \maketitle
44: \abstract{
45: The multiplicative Newton-like method developed by the author\:{\it et\:al.}
46: is extended to
47: the situation  where the dynamics  is restricted  to
48:  the orthogonal group. 
49: A general framework is
50: constructed  without specifying the cost function. 
51: Though the restriction to the orthogonal groups   makes the problem 
52:  somewhat complicated, 
53: an explicit expression for the amount of  individual jumps is obtained. 
54: This algorithm is exactly second-order-convergent. 
55: The global instability inherent in the Newton method is remedied by
56:  a Levenberg-Marquardt-type variation.  
57: The method thus constructed  can readily be applied to the independent
58: component analysis. 
59: Its remarkable performance is illustrated by a
60: numerical simulation. 
61: 
62: 
63: % In the case of the independent component analysis  the restriction
64: % corresponds to the prewhitening of the  data. 
65:  }
66: \section{Overview}
67: \label{intro}
68: Many optimization problems take the form, 
69: ``Find an optimal matrix under the constraints (1).. (2).. {\it etc}."
70: Some of these can be considered as optimizations on Lie groups. 
71: For groups, the fundamental manipulation
72: is a multiplication whereas an addition is unnatural. 
73: %(Imagine the compound interest rate on your bank account.)
74: In consideration of this fact, 
75: we have constructed a multiplicative Newton-like algorithm 
76: for maximizing the kurtosis (a good barometer for the independence) in 
77: \cite{akuzawa8}.  There the dynamics takes place on the coset 
78: $GL(1,{\Bbb R})^{N}\backslash GL(N,{\Bbb R})$. 
79: We can apply the techniques
80: developed in \cite{akuzawa8} to many other optimization problems. 
81: The coset structure $GL(1,{\Bbb R})^{N}\backslash GL(N,{\Bbb R})$ is,
82: however,
83: characteristic of the  independent component
84: analysis(ICA). It is understood 
85: by the fact that the independence is nothing to do with the scaling. 
86: The redundancy
87: resulting from the invariance of the model under the componentwise scaling 
88: must be eliminated for a rigorous discussion and this redundancy
89: corresponds 
90: to $GL(1,{\Bbb R})^{N}$. 
91: 
92: Another way to eliminate this redundancy is the
93: prewhitening. 
94: The prewhitening is a linear transformation of the observed data 
95: which  maps
96: the covariance matrix to  the unit matrix. 
97: If we deal with  prewhitened data, we can legitimately narrow
98: the sweeping range  to the orthogonal group. 
99: The aim of this letter is the construction of a multiplicative
100: algorithm
101: for the orthogonal groups. 
102: 
103: 
104: The framework is  as follows. 
105: %Suppose that 
106:  $N$-dimensional prewhitened random variables 
107:  $\{X_i|1\le i \le N\}$ are available
108: and it is anticipated that their origins  are 
109:  some unknown mutually independent components $\{Y_i^*|1\le i \le N\}$.
110: The goal of the ICA is the map
111:  $\{X_i\} \mapsto \{Y_i^*\}$. 
112: We restrict ourselves to 
113:  the linear independent component analysis. 
114: There  
115: we want to find a linear transformation $C^*:{X}=(X_1,\cdots,X_N)'\mapsto
116: { Y^*}=(Y_1^*,\cdots,Y_N^*)'=C^*{ X}$ which 
117:  minimizes some cost function that measures the independence. 
118: Since we are assuming that the data is already prewhitened, the
119: covariance matrix of $X$ is the $N\times N$ unit matrix.
120: If we do not take into account  errors in the prewhitening, 
121: the optimal  point $C^*$ must belong to $O(N)$.  
122: 
123: 
124: Giving up the analytical solution, 
125: we consider a sequence, 
126: \begin{eqnarray}
127:   \label{eq:intro1}
128:   C(0),~ C{(1)},~ C{(2)},~ C{(3)},~\cdots\cdots~, 
129: \end{eqnarray}
130:  which converges to the optimal solution $C^*$.  
131: The sequence  $\{C(t)\}$ 
132: % which specifies the
133: %linear transformation 
134: is generated by the left-multiplication of another sequence of
135: orthogonal  
136: matrices $\{D(t)\}$. 
137: Each $D(t)$  is specified by the coordinate 
138: $\Delta(t)$ which satisfies $D(t)={\rm e}^{\Delta(t)}$. 
139: We assume that $\Delta(t)$ is 
140: an $N\times N$  skew-symmetric 
141: matrix,  
142: which  implies that  $D(t)$ belongs to
143: the identity component of $O(N)$. 
144: In practice the procedure  is as follows.
145: As an initial condition we set $C(0)$. 
146: For  $t>0~(t\in{\Bbb N}^{+})$, 
147: we introduce %an  $N\times N$ matrix 
148: $\Delta(t)$ and 
149: denote $C({t})$ as  $C({t+1})={\rm e}^{\Delta({t})}C(t)$.
150: Under these settings, we determine $\Delta(t)$ by using the Newton method 
151: %for the second order
152: %expansion of the cost function 
153: with respect to 
154: the matrix elements of 
155: $\Delta(t)$. That is,
156:  we evaluate the cost function at $C({t+1})$ 
157: by  expanding it  around $C({t})$ 
158: in terms of  the elements of
159: $\Delta({t})$ up to the second order. 
160:  Then    $\Delta(t)$ is choosen as the (unique) critical point of 
161: this second order
162: expansion.  
163: We iteratively follow these procedures until we obtain a satisfactory
164: solution. 
165: 
166: This letter is organized as follows. 
167: In Section \ref{sec:mult} 
168: we will give a complete description of 
169: a new  multiplicative updating method for the orthogonal groups. 
170: This section  is the main part of this letter. Since our formulation 
171: does not depend on the details of the  cost function
172: the method can be useful for many problems other than the ICA. 
173: The performance of
174: our method including the second-order-convergence is discussed in
175: Section \ref{sec:per1}.   
176: Section \ref{sec:appl} is a survey of possible applications of our
177: method. 
178: The algorithm constructed in Section \ref{sec:mult} 
179: is considered as  a pure-Newton method on the orthogonal groups.
180: To achive  the global convergence, we must modify the method. This is 
181: accomplished  in 
182: Section \ref{sec:practice}. Section 
183: \ref{sec:practice} also includes a numerical examination of 
184: the performance of our
185: method. Section \ref{sec:summ} is a summary. 
186: 
187: \section{Multiplicative updating on $O(N)$}
188: \label{sec:mult}
189: We assume that the  cost function $F$ takes the form, 
190: \begin{eqnarray}
191:   \label{eq:a1}
192:   F(Y)=\sum_{i=1}^NE(f_i(Y_i))~,
193: \end{eqnarray}
194: where each $f_i:{\Bbb R}\rightarrow{\Bbb R}$ is an unspecified function. 
195: Through this letter we denote by $E(\cdot)$ the expectation.  % of $A$. 
196: We will determine 
197: the concrete procedures 
198: %amount of  each step 
199:  after  the Newton manner.  
200: First, we  introduce
201:  maps, 
202:  $R$ and $\{U_{i}(1\le i\le N)\}$'s,  from 
203: $N$-dimensional 
204: dataset to  $N \times N$ matrices
205: by 
206: \begin{eqnarray}
207:   \label{eq:a2}
208:   [R(Y)]_{ki}=E\left(\frac{\partial f_i(Y_i)}{\partial Y_i}Y_k\right)
209: \end{eqnarray}
210: and
211: \begin{eqnarray}
212:   \label{eq:a3}
213: [ U_{i}(Y)]_{kl}=U_{ikl}(Y)= E\left(\frac{\partial^2 f_i(Y_i)}{\partial
214:   Y_i^2}Y_k Y_l\right)~.
215: \end{eqnarray}
216: The goal is  the construction of  a sequence
217: $\{Y(t)\}$  of the estimates of the independent components, which
218: converges to the optimal point $Y^*$.  
219: %We suppose that
220: Within the framework of the linear analysis, we consider that 
221:  this sequence is derived from another sequence 
222:  $\{C(t)\}$ of the linear transformation by the relation
223: $Y(t)=C(t)X$,  
224: where $X$ are the original data. Thus if we  restate the problem,
225:  the task is to 
226: determine 
227: a  sequence  $\{C(t)\}$. 
228: We assume that
229: for each $t\in {\Bbb N}^{+}$
230:  the estimates of the independent components at  time $t$ and 
231: and the estimates 
232: at time $t+1$ are related by 
233: \begin{eqnarray}
234:   \label{eq:a4}
235:   Y{(t+1)}=D{(t)}Y{(t)}~
236: \end{eqnarray}
237: or equivalently
238: \begin{eqnarray}
239:   \label{eq:a4bb}
240:   C{(t+1)}=D{(t)}C{(t)}~,
241: \end{eqnarray}
242: where $D{(t)}$  is  some orthogonal matrix to be fixed. 
243: Our method is characterized by this left-multiplicative updating rule. 
244: As mentioned in the previous section,
245: we  assume that 
246: each $D(t)$   always belongs to the identity component of the 
247: orthogonal group $O(N)$. 
248: This assumption is reasonable, for example, if the  original data $X$
249: are already prewhitened in the case of the ICA. 
250: % we suppose that the original data $X$ are already prewhitened. 
251: %In this case  we can legitimately
252: Anyway, under this restriction 
253:  $D{(t)}$ is specified by an $N\times N$ anti-symmetric matrix $\Delta{(t)}$,
254: which satisfies
255: \begin{eqnarray}
256:   \label{eq:a5}
257:   \exp(\Delta{(t)})=D{(t)}~.
258: \end{eqnarray}
259: For brevity's sake we will omit the argument $(t)$ and denote $Y(t+1)$ by $Z$. 
260: $F(Z)$ is expanded in terms of $\{\Delta_{ij}\}$ as 
261: \begin{eqnarray}
262:   \label{eq:a6}
263:   F(Z)=F(Y)+{\rm tr}(\Delta R(Y))+{\rm
264:   tr}\left(\frac{\Delta^2}{2}R(Y)\right)
265: +\frac{1}{2}\sum_{i,k,l}\Delta_{ik}\Delta_{il}U_{ikl}(Y)
266: +O(\Delta^3)~. 
267: \end{eqnarray}
268: %By partially differentiating (\ref{eq:a6}),  
269: Through the letter
270:  we denote by $O(\Delta^k)$ polynomials of matrix elements of $\Delta$ 
271: which does not contain terms with degrees less than $k$. Do not
272:  confuse this with the symbol for the orthogonal groups such as  $O(N)$. 
273: As in the usual Newton method, 
274: we truncate the expansion (\ref{eq:a6}) at the second order with
275: respect to $\{\Delta_{ij}\}$.
276:  Then $\Delta$ in this step is determined as the coordinate of  the
277: critical point of this truncated expansion. 
278: The partial derivative of (\ref{eq:a6}) is more convenient for the purpose. 
279: It reads
280: \begin{eqnarray}
281:   \label{eq:a7}
282:   \frac{\partial F(Z)}{\partial \Delta_{kl}}
283: =R_{lk}+\frac{1}{2}\left[\Delta R+R\Delta
284: \right]_{lk}+\sum_p \Delta_{kp}U_{klp}+O(\Delta^2)~, 
285: \end{eqnarray}
286: where we have omitted the argument $Y$ for $R$ and $U$. 
287: Now let us  introduce a map $\rm cs$ (the column string) as in the previous
288: article
289: \cite{akuzawa8}:
290: \begin{eqnarray}
291:   \label{eq:a14}
292:   {\rm Mat}(N,{\Bbb F}) &\rightarrow& {\Bbb F}^{N^2}\\
293: A=\left(
294:   \begin{array}{cccc}
295:  A_{11}& A_{12}&\cdots &A_{1N}\\
296: A_{21} &\multicolumn{3}{c}{\dotfill}\\
297: \multicolumn{4}{c}{\dotfill}\\
298: A_{N1} &\multicolumn{2}{c}{\dotfill}&A_{NN}
299:   \end{array}
300: \right) &\mapsto& 
301: {\rm cs}(A)=
302: (A_{11}~ A_{21}~ \cdots~ A_{N1}~~ A_{12}~ A_{22}~\cdots ~A_{NN})'~,\nonumber
303: \end{eqnarray}
304: where  ${\rm Mat}(N,{\Bbb F})$ is  
305:  $N\times N$ matrices on some unspecified field $\Bbb F$.
306: We  denote by the upper subscript $\prime$ the transposition and
307: by $\dagger$ the complex conjugate. 
308: For the orthogonal groups it is rather simple to move to
309: the framework of the column string as compared to the case of
310: $GL(1,{\Bbb R})^N\backslash GL(N,{\Bbb R})$: 
311: By neglecting  $O(\Delta^2)$ terms, 
312: the right-hand-side of (\ref{eq:a7}) is straightforwardly  rewritten as 
313: \begin{eqnarray}
314:   \label{eq:a8}
315: R_{lk}&+&\frac{1}{2}\left[\Delta R+R\Delta
316: \right]_{lk}+\sum_p \Delta_{kp}U_{klp}\nonumber\\
317: &&=\left[{\rm cs}(R)+\frac{1}{2}\left(
318: R'\otimes I_N+I_N\otimes R
319: \right) {\rm cs}(\Delta)+
320: \big(\bigoplus_k U_k\big) T{\rm cs}(\Delta)\right]_{l+(k-1)N}~,
321: \end{eqnarray}
322: where
323:  the symbol ``$\bigoplus$'' stands for the direct sum,
324: \begin{eqnarray}
325:   \label{eq:tiu1}
326: \bigoplus_{k=1}^N U_{k}=
327: \left(
328:   \begin{array}{lllll}
329: U_1 & 0 & \multicolumn{2}{c}{\cdots\cdots} & 0\\ 
330: 0& U_2 & 0 & \multicolumn{2}{c}{\cdots\cdots}\\
331:  \multicolumn{5}{c}{\dotfill}\\
332:  \multicolumn{5}{c}{\dotfill}\\
333:  0& \multicolumn{2}{c}{\cdots\cdots}& U_{N-1}& 0   \\
334: 0& 0& \multicolumn{2}{c}{\cdots\cdots}& U_{N}   \\
335:    \end{array}
336: \right)~,
337: \end{eqnarray}
338:  $T$ is  an $N^2\times N^2$ matrix 
339: defined by 
340: \begin{eqnarray}
341:   \label{eq:a15}
342:   {\rm cs}(A')=T{\rm cs}(A) ~\mbox{\rm for~} A\in  {\rm Mat}(N,{\Bbb F})~,
343: \end{eqnarray}
344: and $I_N$ is the $N\times N$ unit matrix. 
345: We denote  the tensor product by  $\otimes$  as usual. 
346:  The ``transposition'' $T$ is also considered as 
347: an intertwiner between  two equivalent representations: 
348: \begin{eqnarray}
349: %\nonumber  
350: T(A\otimes B)T=B\otimes A~.
351: \end{eqnarray}
352: The orthogonal group $O(N)$ has less degrees of freedom than the
353: general linear group. 
354: The canonical basis of the Lie algebra, ${\frak o}(N)$, of $O(N)$ is
355: $N(N-1)/2$
356: anti-symmetric
357: matrices. We will introduce some operators which enable us to move to
358: the coordinates based on  
359: the canonical basis on ${\frak o}(N)$. 
360: In the first place, we introduce an $N^2\times N^2$ matrix  $H$ by
361: \begin{eqnarray}
362:   \label{eq:a9}
363: H=\sum_{i>j}H^{(i,j)}~,
364: \end{eqnarray}
365: where $H^{(i,j)}$ is a $\pi/4$  rotation between 
366:  the $j+N(i-1)$-th component and the $i+N(j-1)$-th component:
367: \begin{eqnarray}
368:   \label{eq:a10}
369: H^{(i,j)}_{kl}=  \left\{
370:   \begin{array}{ccl}
371: \frac{1}{\sqrt{2}}~~~~&\mbox{\rm for}&k=j+N(i-1),~~l=j+M(i-1)\\
372: -\frac{1}{\sqrt{2}}~~~~&\mbox{\rm for}&k=j+N(i-1),~~l=i+M(j-1)\\
373: \frac{1}{\sqrt{2}}~~~~&\mbox{\rm for}&k=i+N(j-1),~~l=j+M(i-1)\\
374: \frac{1}{\sqrt{2}}~~~~&\mbox{\rm for}&k=i+N(j-1),~~l=i+M(j-1)\\
375: 0~~~~&&\mbox{\rm otherwise. }
376:   \end{array}
377: \right. 
378: \end{eqnarray}
379: The projection  operator $P_D$,
380: \begin{eqnarray}
381:   \label{eq:a18}
382: P_D&=&{\rm diag}(p_1,\cdots,p_{N^2})~,\nonumber\\
383: &&\left\{
384: \begin{array}{ll}
385:  p_k=1 ~~~\mbox{\rm for}~~ k=N(i-1)+i,1\le i\le N~\\
386:  p_k=0~~~~ \mbox{\rm otherwise}~,
387: \end{array}
388: \right.
389: \end{eqnarray}
390:  is used to extract the diagonal
391: elements of a matrix from its image by $\rm cs$. 
392: Then the coordinate transformation 
393: is realized by a multiplication of
394: \begin{eqnarray}
395:   \label{eq:a12.1}
396:   H+P_D~
397: \end{eqnarray}
398: to   column string vectors. 
399: We need to introduce two more 
400: projection operators $P_S$ and $P_A$  defined by
401: \begin{eqnarray}
402:   \label{eq:a11}
403:   P_S&=&{\rm diag}(p_1,p_2,\cdots,p_{N^2})\\
404: P_A&=&{\rm diag}(1-p_1,1-p_2,\cdots,1-p_{N^2})~,%\nonumber
405: \end{eqnarray}
406: where
407: \begin{eqnarray}
408:   \label{eq:a12}
409:   p_k=\left\{
410:     \begin{array}{ccl}
411: 1&\mbox{\rm if}&{}^{\exists}(i,j);~~ j\le i~~ \mbox{\rm and}~~k=i+N(j-1)\\
412: 0&&\mbox{\rm otherwise}. 
413:     \end{array}
414: \right.
415: \end{eqnarray}
416: By the left-action of  $P_S$ and $P_A$ to
417:  column string vectors rotated by $H+P_D$
418: we can extract,  
419: %The projection operators  $P_S$ and $P_A$  are used to extract,
420: respectively, 
421:  the symmetric components
422:  and the anti-symmetric components of the matrices.
423: Then the conditions 
424: for the critical point of the second-order-expansion,
425: which must be  satisfied by $\Delta$,  are 
426: translated into the following two conditions.
427: First, symmetric components of $\Delta$ must vanish.
428:  This condition is expressed as 
429: \begin{eqnarray}
430:   \label{eq:11.91}
431: \left[(H+P_D){\rm cs}(\Delta)\right]_{j+(i-1)N}=0
432: \qquad\mbox{\rm for}\quad i\le j
433: \quad\bigg(\Longleftrightarrow
434: P_S(H+P_D){\rm cs}(\Delta)  =0\bigg)~.
435: \end{eqnarray}
436: Secondly, for the anti-symmetric components 
437: the condition for the critical point is transformed to 
438: \begin{eqnarray}
439:   \label{eq:a11.9}
440:   \left[(H+P_D){\rm cs}(R)+(H+P_D)W
441:  {\rm cs}(\Delta)
442: \right]_{j+(i-1)N}~=0 \qquad\mbox{\rm for}\quad i>j~,
443: \end{eqnarray}
444: where we have  set 
445: \begin{eqnarray}
446:   \label{eq:a14b}
447:   W=\frac{1}{2}\left(
448: R'\otimes I_N+I_N\otimes R
449: \right) +
450: \big(\bigoplus_k U_k\big) T~.
451: \end{eqnarray}
452: %Now 
453: %one can see that 
454: % (\ref{eq:a8}) 
455: The conditions (\ref{eq:11.91}) and (\ref{eq:a11.9}) are 
456: combined into an equation, 
457: \begin{eqnarray}
458:   \label{eq:a13-1}
459: P_A(H+P_D){\rm cs}(R)+
460:  \bigg[P_A (H+P_D) W (H+P_D)' P_A +P_S
461: \bigg](H+P_D) {\rm cs}(\Delta)
462: =0~.
463: \end{eqnarray}
464: Note that 
465: \begin{eqnarray}
466:   \label{eq:a12.2}
467:   P_A(H+P_D)=P_AH~.
468: \end{eqnarray}
469: The optimal $\Delta$ is immediately obtained from (\ref{eq:a13-1}): 
470: \begin{eqnarray}
471:   \label{eq:a13}
472:   {\rm cs}(\Delta)&=&
473: -(H+P_D)'\bigg[P_A (H+P_D) W (H+P_D)' P_A +P_S
474: \bigg]^{-1}P_A(H+P_D){\rm cs}(R)~\nonumber\\
475: &=&
476: -H'\left(P_A H W H' P_A +P_S
477: \right)^{-1}P_AH{\rm cs}(R)~.
478: \end{eqnarray}
479: Thus we have obtained the explicit updating rule. 
480: By iterating the procedure in this section  from a  starting point 
481: sufficiently close
482: to the 
483: optimal one, 
484:  the sequences  $\{C(t)\}$ and $\{Y(t)\}$ converge to 
485:  the optimal solutions. 
486: 
487: \section{Performance (theoretical aspects)}\label{sec:per1}
488: %Our method has  very desirable convergence properties. 
489: The second-order-convergence is one of the main  advantages of this
490: method.
491: Indeed, this algorithm is rigorously  second-order-convergent. The
492: proof   can be  given 
493: almost in the same way as in \cite{akuzawa8}. So we omit the proof in
494: this letter.  
495: 
496: Sometimes we have to
497: deal with  large matrices  to apply  the technique here constructed. 
498: Let us examine the situation. 
499: The $N^2\times N^2$ matrix $P_A HW H' P_A +P_S$ is 
500: a direct sum of an $N(N-1)/2\times N(N-1)/2$ matrix
501: and an  $N(N+1)/2\times N(N+1)/2$ unit matrix. 
502: Within the   $N(N-1)/2\times N(N-1)/2$
503: block 
504: the number of non-zero off-diagonal elements   is 
505: no more than  ${N(N-1)(N-2)}$.  
506: So this is a very sparse matrix when $N$ becomes large. 
507: Of course if $N$ becomes extremely large, our method requires quite large
508: memories. But due to the sparseness, it remains to be  a 
509: practical tool for problems with considerably large $N$. 
510: \begin{figure}[htbp]
511:   \begin{center}
512: \epsfig{file=sparse.eps, scale=0.3}
513: \caption{\small  $N=10$. The black dots denote non-zero elements of $P_A H W H' P_A +P_S$.  }
514:       \end{center}
515: \end{figure}
516: 
517: As  is often the case with the Newton method,  % \cite{akuzawa8}
518: the global convergence is not assured by this algorithm.
519: %So first few steps must be treated separately. 
520: Fortunately it is possible to  cure this fault. 
521: We will show the prescription to the global instability in
522: Section \ref{sec:practice}. 
523: 
524: %it is not assured that  this method 
525: %converges  globally.
526: 
527: 
528: \section{Applications to ICA}\label{sec:appl}
529: So far we have not specified the cost function beyond the assumption
530: that 
531: the cost function is a sum of the form (\ref{eq:a1}). 
532: Many of the cost functions  for the independent component analysis
533:   belong to this class. 
534: \subsection{Kullback-Leibler information}
535: The Kullback-Leibler information,
536:  \begin{eqnarray}
537:   \label{eq:ka9}
538: \int \prod_{i=1}^Ndy_i P(y)\bigg\{\ln  P(y)- \sum_{i=1}^N \ln
539: P_i(y_i)\bigg\}
540: ~,
541: \end{eqnarray}
542: is a good measure for the independence. 
543:  Here $P$ is the joint probability density function of $\{Y_i\}$ and 
544:  $P_i$ is the probability density function of the $i$-th component. 
545: We have already restricted ourselves to the case where the jacobian of
546:  the transformation equals one. Then 
547: the minimization of the Kullback-Leibler information  is equivalent to 
548:   the minimization of 
549:   \begin{eqnarray}
550:     \label{eq:bb1.1}
551:  -\int \prod_i dY_i P(Y)\sum_{i=1}^N\ln  P_i(Y_i)
552: =\sum_{i=1}^N E(-\ln P_i(Y_i)) ~ .     
553:   \end{eqnarray}
554: Thus we can  legitimately  
555: transform 
556: the Kullback-Leibler information  
557: to a cost
558: function of the
559: form  (\ref{eq:a1}), where we
560:   should set $\{f_i\}$'s as 
561: \begin{eqnarray}
562:   \label{eq:bb1}
563:   f_i(\cdot)= -\ln  P_i(\cdot)~.
564: \end{eqnarray}
565: We must evaluate $\{P_i\}$'s,  their derivatives, and so on  to determine
566: the optimal
567: solution. A robust estimation 
568: of these quantities  is possibly  not an easy  task\cite{silverman1,cox1}. 
569: 
570: \subsection{Cumulant of fourth order}\label{subsec:cum}
571: The kurtosis of a random variable $A$ is defined by 
572:   \begin{eqnarray}
573:     {\kappa(A)}
574: =\frac{E(A^4)}{(E(A^2))^2}-3~.
575:   \end{eqnarray}
576: The kurtosis is related to the cumulant of the fourth order, 
577: \begin{eqnarray}
578: %  \nonumber
579: Cum^{(4)}(A)=E(A^4)-3(E(A^2))^2~,
580: \end{eqnarray}
581: by
582:   \begin{eqnarray}%\nonumber
583:     {\kappa(A)}=\frac{    Cum^{(4)}(A)}{(E(A^2))^2}~.
584:   \end{eqnarray}
585: For prewhitened data the kurtosis equals the cumulant of the fourth
586: order. 
587: As is well-known\cite{hyvarinen1,akuzawa8}, 
588: we can grab  independent components in many cases
589: by seeking  the maximum of the absolute values of the kurtoses. Our method
590: is applicable 
591: by setting
592: \begin{eqnarray}
593:   \label{eq:kur1}
594:  f_i=-\kappa^2  
595: \end{eqnarray}
596: for all $i$. 
597: If it is  known a priori that all the sources $\{Y_i^*\}$ have positive
598: kurtoses, we may use the kurtosis itself and  set 
599: \begin{eqnarray}
600:   \label{eq:kur2}
601:  f_i=-\kappa~.  
602: \end{eqnarray}
603: For these cost functions, $R$, $\{U_i\}$, and other
604: quantities needed for determining each step are calculated easily
605: from the observed data.
606: Thus applying our method for this cost function is highly practical and 
607: reasonable choice. 
608: \section{Levenberg-Marquardt-type variation and performance in practice}
609: \label{sec:practice}
610: The pure-Newton updating rule (\ref{eq:a13}) has a 
611: poor global convergence property.
612: This drawback is remedied  by
613:  the Levenberg-Marquardt-type variation\cite{numerical1}. 
614: First, We modify  (\ref{eq:a13}) 
615: as 
616:  \begin{eqnarray}
617:   \label{eq:lev1}
618:   {\rm cs}(\Delta)&=&
619: -H'\left(P_A H W H' P_A +P_S+\lambda I_{N^2}
620: \right)^{-1}P_AH{\rm cs}(R)~.
621: \end{eqnarray}
622: The initial value $\lambda_0$ for $\lambda$ is fixed at some positive value.
623: We also fix a real number  $\alpha(>1)$. 
624: (In the following example we set $\lambda_0=50$ and $\alpha=10$.) 
625: Then the  procedure at time $t$ is as follows:
626: \renewcommand{\labelenumi}{\roman{enumi})}
627: \begin{enumerate}
628: \item 
629: Calculate $\Delta$ by  (\ref{eq:lev1}). 
630: \item 
631: If $F({\rm e}^{\Delta}Y(t))$ is larger than $F(Y(t))$,
632: multiply $\lambda$ 
633: by $\alpha$ and go back to i). 
634: \item 
635: Otherwise, 
636: multiply $\lambda$ by $1/\alpha$ and proceed to the next time step $t+1$. 
637: \end{enumerate}
638: Other parts of the algorithm is completely the same  as in the
639: pure-Newton version in Section \ref{sec:mult}. 
640: 
641: 
642: Let us  examine the real performance of 
643: our method under this setting.
644: For the cost function we choose the kurtosis as in Subsection
645: \ref{subsec:cum}. 
646:  The source signals are three synthesizer-generated 
647: wav files(Fig.\ref{fig:2}). 
648: \begin{figure}[htbp]
649:   \begin{center}
650: %\epsfig{file=sample.eps,width=15cm,height=3cm}    
651: \epsfig{file=sample.eps,scale=0.4}    
652:     \caption{Sample  data generated by a synthesizer (by courtesy of 
653: N.Murata).}
654:     \label{fig:2}
655:   \end{center}
656: \end{figure}
657: Pseudo-observed data  are generated by mixing the source by 
658: a random  matrix,
659:  \begin{eqnarray}
660:    \label{eq:tr1}
661:    A=I_3+S,
662:  \end{eqnarray}
663: where each element of $S$ is distributed uniformly on $(-1/2,1/2)$. 
664:  The residual crosstalk of the  signals 
665: demixed by our method 
666: is 
667: $1.29\%$ on average.  It takes about $122$ seconds (CPU time) for one
668:  hundred iteration of the same problem on
669:  our workstation. 
670: For reference, we have also solved the same demixing problem
671: by the FastICA\cite{fastica1}. 
672: In this case the residual crosstalk 
673: is 
674: $1.36\%$ on average and  it takes about $156$ seconds for 
675: one hundred 
676:  iteration on
677:  the same workstation. 
678: Since the author's  knowledge about the FastICA package is limited, 
679: one should not take this result seriously. 
680: It can, however, be said 
681:  that our method is quite good also in practice. 
682: 
683: \section{Summary}\label{sec:summ}
684: We have constructed a new  algorithm  for  finding a
685: critical  point 
686: of broad classes of cost functions  %defined 
687: on the orthogonal groups. This method is second-order-convergent  
688: since it  is in essence the Newton method.
689: The method here constructed  is an extension (or a restriction) of
690: the multiplicative updating method 
691:  developed in our 
692: previous work\cite{akuzawa8}. The constraint for $\Delta$ from the nature
693: of the orthogonal groups  makes the
694: problem a little complicated. We have, however, obtained a rigorous and 
695: explicit updating rule. 
696: We have also constructed 
697:  a Levenberg-Marquardt-type variation, which is  suitable for
698:  practical purpose.  
699: The global instability inherent in the Newton method is remedied in
700: this version. 
701: %  the Kullback-Leibler information, the kurtosis, {\it etc.},
702: %  suitable for
703: %the
704: %purpose. 
705: Since our discussion does not depend on the
706: detail of the cost function, 
707: this method is applicable to many concrete problems.
708: The relatively  mild assumption (\ref{eq:a1}) on the form of the  cost
709: function, however,  implies that 
710:  our algorithm is especially
711: suitable for 
712:  the ICA. 
713: %we can choose arbitrary functions for
714: %$\{f_i\}$. 
715: % readily our method 
716: %by
717: %prewhitening data. 
718: %The potential of our method  
719: Its practical utility for the ICA
720:  have been  illustrated here  by a numerical simulation.  
721: 
722: 
723: %Let us conclude the a
724: To summarize, 
725: our algorithm  has  numerous theoretical virtues such as 
726: the  rigorous second order convergence, the explicit and strict formulation,
727:  and so on. 
728: %Moreover  
729:  It provides, 
730:  also in practice, 
731:   fast and powerful tools for the
732:  ICA and many other problems. 
733: 
734: 
735: 
736: %Since it does not require prewhitening, 
737: 
738: \section*{Acknowledgments}
739: The author would like to thank Noboru Murata and Shun-ichi Amari for 
740: valuable
741: discussions and comments. 
742: %\bibliography{mybib}
743: \begin{thebibliography}{6}
744: 
745: \bibitem[A.Hyv\"arinen,1997]{hyvarinen1}
746: A.Hyv\"arinen (1997).
747: \newblock A Fast Fixed-Point Algorithm for Independent Component Analysis.
748: \newblock {\em Neural Computation\/}, {\em 9\/}, 1483--1492.
749: 
750: \bibitem[B.W.Sliverman,1986]{silverman1}
751: B.W.Sliverman (1986).
752: \newblock {\em Density Estimation for Statistics and Data Analysis\/}.
753: \newblock London: Chapman \& Hall.
754: 
755: \bibitem[D.Cox,1985]{cox1}
756: D.Cox, D. (1985).
757: \newblock A Penalty Method for Nonparametric Estimation of the Logarithmic
758:   Derivative of a Density Function.
759: \newblock {\em Ann.Inst.Statist.Math.\/}, {\em 37\/}, 271--288.
760: 
761: \bibitem[Hurri {\em et~al.\/},1998]{fastica1}
762: Hurri, J., G\"avert, H., S\"alel\"a, J., \& Hyv\"arinen, A. (1998).
763: \newblock FastICA package for MATLAB.
764: \newblock http://www.cis.hut.fi/projects/ica/fastica/.
765: 
766: \bibitem[T.Akuzawa \& N.Murata,1999]{akuzawa8}
767: T.Akuzawa \& N.Murata (1999).
768: \newblock Multiplicative Nonholonomic/Newton -like Algorithm.
769: \newblock {\em preprint \\(available from
770:   http://www.islab.brain.riken.go.jp/\~{}akuzawa/)\/}.
771: 
772: \bibitem[W.H.Press {\em et~al.\/},1988]{numerical1}
773: W.H.Press, B.P.Flannery, S.A.Teukolsky, \& W.T.Vetterling (1988).
774: \newblock {\em Numerical Recipes in C\/}.
775: \newblock Cambridge: Cambridge U.P.
776: 
777: \end{thebibliography}
778: 
779: \end{document}
780: