0001:cs0001004/wsXXX.tex

1: \documentclass[a4,12pt]{article}

2: \usepackage{latexsym}

3: %\usepackage[light]{draftcopy}

4: \usepackage{epsfig}

5: %\usepackage{bookman}

6: %\usepackage{graphicx}

7: \oddsidemargin=0cm

8: \evensidemargin=0cm

9: \textwidth=16cm

10: \paperwidth=21cm

11: \textwidth=18.6cm

12: %\textheight=24.7cm

13: \oddsidemargin=-0.5in

14: \evensidemargin=-0.5in

15: %\topmargin=-0.6in

16: \usepackage{amsmath,amstext,amsfonts}

17: \def\bm#1{\mbox{\boldmath $#1$}}

18: \def\teigi{\stackrel{\rm def}{=}}

19: \def\hatena{\stackrel{\boldmath ?}{=}}

20: %%

21: \makeatletter

22:   \renewcommand{\theequation}{%

23:      \thesection.\arabic{equation}}

24:   \@addtoreset{equation}{section}

25: \makeatother

26: \tolerance=6000

27:

28:

29:

30:

31: \title{Multiplicative  Algorithm for  Orthgonal Groups \\and

32: Independent Component Analysis }

33: \author{Toshinao {\sc

34:     Akuzawa}\thanks{akuzawa@brain.riken.go.jp}\vspace{0.3cm}\\

35: \vspace{0.5cm}\\

36: Brain Science Institute \\

37: {\it RIKEN}\\

38: %%{\small(The Institute of Physical and Chemical Research)}\\

39: {\small 2-1 Hirosawa, Wako, Saitama 351-0198, Japan}}

40: \date{{\it \today}}

41:

42: \begin{document}

43: \maketitle

44: \abstract{

45: The multiplicative Newton-like method developed by the author\:{\it et\:al.}

46: is extended to

47: the situation  where the dynamics  is restricted  to

48:  the orthogonal group.

49: A general framework is

50: constructed  without specifying the cost function.

51: Though the restriction to the orthogonal groups   makes the problem

52:  somewhat complicated,

53: an explicit expression for the amount of  individual jumps is obtained.

54: This algorithm is exactly second-order-convergent.

55: The global instability inherent in the Newton method is remedied by

56:  a Levenberg-Marquardt-type variation.

57: The method thus constructed  can readily be applied to the independent

58: component analysis.

59: Its remarkable performance is illustrated by a

60: numerical simulation.

61:

62:

63: % In the case of the independent component analysis  the restriction

64: % corresponds to the prewhitening of the  data.

65:  }

66: \section{Overview}

67: \label{intro}

68: Many optimization problems take the form,

69: ``Find an optimal matrix under the constraints (1).. (2).. {\it etc}."

70: Some of these can be considered as optimizations on Lie groups.

71: For groups, the fundamental manipulation

72: is a multiplication whereas an addition is unnatural.

73: %(Imagine the compound interest rate on your bank account.)

74: In consideration of this fact,

75: we have constructed a multiplicative Newton-like algorithm

76: for maximizing the kurtosis (a good barometer for the independence) in

77: \cite{akuzawa8}.  There the dynamics takes place on the coset

78: $GL(1,{\Bbb R})^{N}\backslash GL(N,{\Bbb R})$.

79: We can apply the techniques

80: developed in \cite{akuzawa8} to many other optimization problems.

81: The coset structure $GL(1,{\Bbb R})^{N}\backslash GL(N,{\Bbb R})$ is,

82: however,

83: characteristic of the  independent component

84: analysis(ICA). It is understood

85: by the fact that the independence is nothing to do with the scaling.

86: The redundancy

87: resulting from the invariance of the model under the componentwise scaling

88: must be eliminated for a rigorous discussion and this redundancy

89: corresponds

90: to $GL(1,{\Bbb R})^{N}$.

91:

92: Another way to eliminate this redundancy is the

93: prewhitening.

94: The prewhitening is a linear transformation of the observed data

95: which  maps

96: the covariance matrix to  the unit matrix.

97: If we deal with  prewhitened data, we can legitimately narrow

98: the sweeping range  to the orthogonal group.

99: The aim of this letter is the construction of a multiplicative

100: algorithm

101: for the orthogonal groups.

102:

103:

104: The framework is  as follows.

105: %Suppose that

106:  $N$-dimensional prewhitened random variables

107:  $\{X_i|1\le i \le N\}$ are available

108: and it is anticipated that their origins  are

109:  some unknown mutually independent components $\{Y_i^*|1\le i \le N\}$.

110: The goal of the ICA is the map

111:  $\{X_i\} \mapsto \{Y_i^*\}$.

112: We restrict ourselves to

113:  the linear independent component analysis.

114: There

115: we want to find a linear transformation $C^*:{X}=(X_1,\cdots,X_N)'\mapsto

116: { Y^*}=(Y_1^*,\cdots,Y_N^*)'=C^*{ X}$ which

117:  minimizes some cost function that measures the independence.

118: Since we are assuming that the data is already prewhitened, the

119: covariance matrix of $X$ is the $N\times N$ unit matrix.

120: If we do not take into account  errors in the prewhitening,

121: the optimal  point $C^*$ must belong to $O(N)$.

122:

123:

124: Giving up the analytical solution,

125: we consider a sequence,

126: \begin{eqnarray}

127:   \label{eq:intro1}

128:   C(0),~ C{(1)},~ C{(2)},~ C{(3)},~\cdots\cdots~,

129: \end{eqnarray}

130:  which converges to the optimal solution $C^*$.

131: The sequence  $\{C(t)\}$

132: % which specifies the

133: %linear transformation

134: is generated by the left-multiplication of another sequence of

135: orthogonal

136: matrices $\{D(t)\}$.

137: Each $D(t)$  is specified by the coordinate

138: $\Delta(t)$ which satisfies $D(t)={\rm e}^{\Delta(t)}$.

139: We assume that $\Delta(t)$ is

140: an $N\times N$  skew-symmetric

141: matrix,

142: which  implies that  $D(t)$ belongs to

143: the identity component of $O(N)$.

144: In practice the procedure  is as follows.

145: As an initial condition we set $C(0)$.

146: For  $t>0~(t\in{\Bbb N}^{+})$,

147: we introduce %an  $N\times N$ matrix

148: $\Delta(t)$ and

149: denote $C({t})$ as  $C({t+1})={\rm e}^{\Delta({t})}C(t)$.

150: Under these settings, we determine $\Delta(t)$ by using the Newton method

151: %for the second order

152: %expansion of the cost function

153: with respect to

154: the matrix elements of

155: $\Delta(t)$. That is,

156:  we evaluate the cost function at $C({t+1})$

157: by  expanding it  around $C({t})$

158: in terms of  the elements of

159: $\Delta({t})$ up to the second order.

160:  Then    $\Delta(t)$ is choosen as the (unique) critical point of

161: this second order

162: expansion.

163: We iteratively follow these procedures until we obtain a satisfactory

164: solution.

165:

166: This letter is organized as follows.

167: In Section \ref{sec:mult}

168: we will give a complete description of

169: a new  multiplicative updating method for the orthogonal groups.

170: This section  is the main part of this letter. Since our formulation

171: does not depend on the details of the  cost function

172: the method can be useful for many problems other than the ICA.

173: The performance of

174: our method including the second-order-convergence is discussed in

175: Section \ref{sec:per1}.

176: Section \ref{sec:appl} is a survey of possible applications of our

177: method.

178: The algorithm constructed in Section \ref{sec:mult}

179: is considered as  a pure-Newton method on the orthogonal groups.

180: To achive  the global convergence, we must modify the method. This is

181: accomplished  in

182: Section \ref{sec:practice}. Section

183: \ref{sec:practice} also includes a numerical examination of

184: the performance of our

185: method. Section \ref{sec:summ} is a summary.

186:

187: \section{Multiplicative updating on $O(N)$}

188: \label{sec:mult}

189: We assume that the  cost function $F$ takes the form,

190: \begin{eqnarray}

191:   \label{eq:a1}

192:   F(Y)=\sum_{i=1}^NE(f_i(Y_i))~,

193: \end{eqnarray}

194: where each $f_i:{\Bbb R}\rightarrow{\Bbb R}$ is an unspecified function.

195: Through this letter we denote by $E(\cdot)$ the expectation.  % of $A$.

196: We will determine

197: the concrete procedures

198: %amount of  each step

199:  after  the Newton manner.

200: First, we  introduce

201:  maps,

202:  $R$ and $\{U_{i}(1\le i\le N)\}$'s,  from

203: $N$-dimensional

204: dataset to  $N \times N$ matrices

205: by

206: \begin{eqnarray}

207:   \label{eq:a2}

208:   [R(Y)]_{ki}=E\left(\frac{\partial f_i(Y_i)}{\partial Y_i}Y_k\right)

209: \end{eqnarray}

210: and

211: \begin{eqnarray}

212:   \label{eq:a3}

213: [ U_{i}(Y)]_{kl}=U_{ikl}(Y)= E\left(\frac{\partial^2 f_i(Y_i)}{\partial

214:   Y_i^2}Y_k Y_l\right)~.

215: \end{eqnarray}

216: The goal is  the construction of  a sequence

217: $\{Y(t)\}$  of the estimates of the independent components, which

218: converges to the optimal point $Y^*$.

219: %We suppose that

220: Within the framework of the linear analysis, we consider that

221:  this sequence is derived from another sequence

222:  $\{C(t)\}$ of the linear transformation by the relation

223: $Y(t)=C(t)X$,

224: where $X$ are the original data. Thus if we  restate the problem,

225:  the task is to

226: determine

227: a  sequence  $\{C(t)\}$.

228: We assume that

229: for each $t\in {\Bbb N}^{+}$

230:  the estimates of the independent components at  time $t$ and

231: and the estimates

232: at time $t+1$ are related by

233: \begin{eqnarray}

234:   \label{eq:a4}

235:   Y{(t+1)}=D{(t)}Y{(t)}~

236: \end{eqnarray}

237: or equivalently

238: \begin{eqnarray}

239:   \label{eq:a4bb}

240:   C{(t+1)}=D{(t)}C{(t)}~,

241: \end{eqnarray}

242: where $D{(t)}$  is  some orthogonal matrix to be fixed.

243: Our method is characterized by this left-multiplicative updating rule.

244: As mentioned in the previous section,

245: we  assume that

246: each $D(t)$   always belongs to the identity component of the

247: orthogonal group $O(N)$.

248: This assumption is reasonable, for example, if the  original data $X$

249: are already prewhitened in the case of the ICA.

250: % we suppose that the original data $X$ are already prewhitened.

251: %In this case  we can legitimately

252: Anyway, under this restriction

253:  $D{(t)}$ is specified by an $N\times N$ anti-symmetric matrix $\Delta{(t)}$,

254: which satisfies

255: \begin{eqnarray}

256:   \label{eq:a5}

257:   \exp(\Delta{(t)})=D{(t)}~.

258: \end{eqnarray}

259: For brevity's sake we will omit the argument $(t)$ and denote $Y(t+1)$ by $Z$.

260: $F(Z)$ is expanded in terms of $\{\Delta_{ij}\}$ as

261: \begin{eqnarray}

262:   \label{eq:a6}

263:   F(Z)=F(Y)+{\rm tr}(\Delta R(Y))+{\rm

264:   tr}\left(\frac{\Delta^2}{2}R(Y)\right)

265: +\frac{1}{2}\sum_{i,k,l}\Delta_{ik}\Delta_{il}U_{ikl}(Y)

266: +O(\Delta^3)~.

267: \end{eqnarray}

268: %By partially differentiating (\ref{eq:a6}),

269: Through the letter

270:  we denote by $O(\Delta^k)$ polynomials of matrix elements of $\Delta$

271: which does not contain terms with degrees less than $k$. Do not

272:  confuse this with the symbol for the orthogonal groups such as  $O(N)$.

273: As in the usual Newton method,

274: we truncate the expansion (\ref{eq:a6}) at the second order with

275: respect to $\{\Delta_{ij}\}$.

276:  Then $\Delta$ in this step is determined as the coordinate of  the

277: critical point of this truncated expansion.

278: The partial derivative of (\ref{eq:a6}) is more convenient for the purpose.

279: It reads

280: \begin{eqnarray}

281:   \label{eq:a7}

282:   \frac{\partial F(Z)}{\partial \Delta_{kl}}

283: =R_{lk}+\frac{1}{2}\left[\Delta R+R\Delta

284: \right]_{lk}+\sum_p \Delta_{kp}U_{klp}+O(\Delta^2)~,

285: \end{eqnarray}

286: where we have omitted the argument $Y$ for $R$ and $U$.

287: Now let us  introduce a map $\rm cs$ (the column string) as in the previous

288: article

289: \cite{akuzawa8}:

290: \begin{eqnarray}

291:   \label{eq:a14}

292:   {\rm Mat}(N,{\Bbb F}) &\rightarrow& {\Bbb F}^{N^2}\\

293: A=\left(

294:   \begin{array}{cccc}

295:  A_{11}& A_{12}&\cdots &A_{1N}\\

296: A_{21} &\multicolumn{3}{c}{\dotfill}\\

297: \multicolumn{4}{c}{\dotfill}\\

298: A_{N1} &\multicolumn{2}{c}{\dotfill}&A_{NN}

299:   \end{array}

300: \right) &\mapsto&

301: {\rm cs}(A)=

302: (A_{11}~ A_{21}~ \cdots~ A_{N1}~~ A_{12}~ A_{22}~\cdots ~A_{NN})'~,\nonumber

303: \end{eqnarray}

304: where  ${\rm Mat}(N,{\Bbb F})$ is

305:  $N\times N$ matrices on some unspecified field $\Bbb F$.

306: We  denote by the upper subscript $\prime$ the transposition and

307: by $\dagger$ the complex conjugate.

308: For the orthogonal groups it is rather simple to move to

309: the framework of the column string as compared to the case of

310: $GL(1,{\Bbb R})^N\backslash GL(N,{\Bbb R})$:

311: By neglecting  $O(\Delta^2)$ terms,

312: the right-hand-side of (\ref{eq:a7}) is straightforwardly  rewritten as

313: \begin{eqnarray}

314:   \label{eq:a8}

315: R_{lk}&+&\frac{1}{2}\left[\Delta R+R\Delta

316: \right]_{lk}+\sum_p \Delta_{kp}U_{klp}\nonumber\\

317: &&=\left[{\rm cs}(R)+\frac{1}{2}\left(

318: R'\otimes I_N+I_N\otimes R

319: \right) {\rm cs}(\Delta)+

320: \big(\bigoplus_k U_k\big) T{\rm cs}(\Delta)\right]_{l+(k-1)N}~,

321: \end{eqnarray}

322: where

323:  the symbol ``$\bigoplus$'' stands for the direct sum,

324: \begin{eqnarray}

325:   \label{eq:tiu1}

326: \bigoplus_{k=1}^N U_{k}=

327: \left(

328:   \begin{array}{lllll}

329: U_1 & 0 & \multicolumn{2}{c}{\cdots\cdots} & 0\\

330: 0& U_2 & 0 & \multicolumn{2}{c}{\cdots\cdots}\\

331:  \multicolumn{5}{c}{\dotfill}\\

332:  \multicolumn{5}{c}{\dotfill}\\

333:  0& \multicolumn{2}{c}{\cdots\cdots}& U_{N-1}& 0   \\

334: 0& 0& \multicolumn{2}{c}{\cdots\cdots}& U_{N}   \\

335:    \end{array}

336: \right)~,

337: \end{eqnarray}

338:  $T$ is  an $N^2\times N^2$ matrix

339: defined by

340: \begin{eqnarray}

341:   \label{eq:a15}

342:   {\rm cs}(A')=T{\rm cs}(A) ~\mbox{\rm for~} A\in  {\rm Mat}(N,{\Bbb F})~,

343: \end{eqnarray}

344: and $I_N$ is the $N\times N$ unit matrix.

345: We denote  the tensor product by  $\otimes$  as usual.

346:  The ``transposition'' $T$ is also considered as

347: an intertwiner between  two equivalent representations:

348: \begin{eqnarray}

349: %\nonumber

350: T(A\otimes B)T=B\otimes A~.

351: \end{eqnarray}

352: The orthogonal group $O(N)$ has less degrees of freedom than the

353: general linear group.

354: The canonical basis of the Lie algebra, ${\frak o}(N)$, of $O(N)$ is

355: $N(N-1)/2$

356: anti-symmetric

357: matrices. We will introduce some operators which enable us to move to

358: the coordinates based on

359: the canonical basis on ${\frak o}(N)$.

360: In the first place, we introduce an $N^2\times N^2$ matrix  $H$ by

361: \begin{eqnarray}

362:   \label{eq:a9}

363: H=\sum_{i>j}H^{(i,j)}~,

364: \end{eqnarray}

365: where $H^{(i,j)}$ is a $\pi/4$  rotation between

366:  the $j+N(i-1)$-th component and the $i+N(j-1)$-th component:

367: \begin{eqnarray}

368:   \label{eq:a10}

369: H^{(i,j)}_{kl}=  \left\{

370:   \begin{array}{ccl}

371: \frac{1}{\sqrt{2}}~~~~&\mbox{\rm for}&k=j+N(i-1),~~l=j+M(i-1)\\

372: -\frac{1}{\sqrt{2}}~~~~&\mbox{\rm for}&k=j+N(i-1),~~l=i+M(j-1)\\

373: \frac{1}{\sqrt{2}}~~~~&\mbox{\rm for}&k=i+N(j-1),~~l=j+M(i-1)\\

374: \frac{1}{\sqrt{2}}~~~~&\mbox{\rm for}&k=i+N(j-1),~~l=i+M(j-1)\\

375: 0~~~~&&\mbox{\rm otherwise. }

376:   \end{array}

377: \right.

378: \end{eqnarray}

379: The projection  operator $P_D$,

380: \begin{eqnarray}

381:   \label{eq:a18}

382: P_D&=&{\rm diag}(p_1,\cdots,p_{N^2})~,\nonumber\\

383: &&\left\{

384: \begin{array}{ll}

385:  p_k=1 ~~~\mbox{\rm for}~~ k=N(i-1)+i,1\le i\le N~\\

386:  p_k=0~~~~ \mbox{\rm otherwise}~,

387: \end{array}

388: \right.

389: \end{eqnarray}

390:  is used to extract the diagonal

391: elements of a matrix from its image by $\rm cs$.

392: Then the coordinate transformation

393: is realized by a multiplication of

394: \begin{eqnarray}

395:   \label{eq:a12.1}

396:   H+P_D~

397: \end{eqnarray}

398: to   column string vectors.

399: We need to introduce two more

400: projection operators $P_S$ and $P_A$  defined by

401: \begin{eqnarray}

402:   \label{eq:a11}

403:   P_S&=&{\rm diag}(p_1,p_2,\cdots,p_{N^2})\\

404: P_A&=&{\rm diag}(1-p_1,1-p_2,\cdots,1-p_{N^2})~,%\nonumber

405: \end{eqnarray}

406: where

407: \begin{eqnarray}

408:   \label{eq:a12}

409:   p_k=\left\{

410:     \begin{array}{ccl}

411: 1&\mbox{\rm if}&{}^{\exists}(i,j);~~ j\le i~~ \mbox{\rm and}~~k=i+N(j-1)\\

412: 0&&\mbox{\rm otherwise}.

413:     \end{array}

414: \right.

415: \end{eqnarray}

416: By the left-action of  $P_S$ and $P_A$ to

417:  column string vectors rotated by $H+P_D$

418: we can extract,

419: %The projection operators  $P_S$ and $P_A$  are used to extract,

420: respectively,

421:  the symmetric components

422:  and the anti-symmetric components of the matrices.

423: Then the conditions

424: for the critical point of the second-order-expansion,

425: which must be  satisfied by $\Delta$,  are

426: translated into the following two conditions.

427: First, symmetric components of $\Delta$ must vanish.

428:  This condition is expressed as

429: \begin{eqnarray}

430:   \label{eq:11.91}

431: \left[(H+P_D){\rm cs}(\Delta)\right]_{j+(i-1)N}=0

432: \qquad\mbox{\rm for}\quad i\le j

433: \quad\bigg(\Longleftrightarrow

434: P_S(H+P_D){\rm cs}(\Delta)  =0\bigg)~.

435: \end{eqnarray}

436: Secondly, for the anti-symmetric components

437: the condition for the critical point is transformed to

438: \begin{eqnarray}

439:   \label{eq:a11.9}

440:   \left[(H+P_D){\rm cs}(R)+(H+P_D)W

441:  {\rm cs}(\Delta)

442: \right]_{j+(i-1)N}~=0 \qquad\mbox{\rm for}\quad i>j~,

443: \end{eqnarray}

444: where we have  set

445: \begin{eqnarray}

446:   \label{eq:a14b}

447:   W=\frac{1}{2}\left(

448: R'\otimes I_N+I_N\otimes R

449: \right) +

450: \big(\bigoplus_k U_k\big) T~.

451: \end{eqnarray}

452: %Now

453: %one can see that

454: % (\ref{eq:a8})

455: The conditions (\ref{eq:11.91}) and (\ref{eq:a11.9}) are

456: combined into an equation,

457: \begin{eqnarray}

458:   \label{eq:a13-1}

459: P_A(H+P_D){\rm cs}(R)+

460:  \bigg[P_A (H+P_D) W (H+P_D)' P_A +P_S

461: \bigg](H+P_D) {\rm cs}(\Delta)

462: =0~.

463: \end{eqnarray}

464: Note that

465: \begin{eqnarray}

466:   \label{eq:a12.2}

467:   P_A(H+P_D)=P_AH~.

468: \end{eqnarray}

469: The optimal $\Delta$ is immediately obtained from (\ref{eq:a13-1}):

470: \begin{eqnarray}

471:   \label{eq:a13}

472:   {\rm cs}(\Delta)&=&

473: -(H+P_D)'\bigg[P_A (H+P_D) W (H+P_D)' P_A +P_S

474: \bigg]^{-1}P_A(H+P_D){\rm cs}(R)~\nonumber\\

475: &=&

476: -H'\left(P_A H W H' P_A +P_S

477: \right)^{-1}P_AH{\rm cs}(R)~.

478: \end{eqnarray}

479: Thus we have obtained the explicit updating rule.

480: By iterating the procedure in this section  from a  starting point

481: sufficiently close

482: to the

483: optimal one,

484:  the sequences  $\{C(t)\}$ and $\{Y(t)\}$ converge to

485:  the optimal solutions.

486:

487: \section{Performance (theoretical aspects)}\label{sec:per1}

488: %Our method has  very desirable convergence properties.

489: The second-order-convergence is one of the main  advantages of this

490: method.

491: Indeed, this algorithm is rigorously  second-order-convergent. The

492: proof   can be  given

493: almost in the same way as in \cite{akuzawa8}. So we omit the proof in

494: this letter.

495:

496: Sometimes we have to

497: deal with  large matrices  to apply  the technique here constructed.

498: Let us examine the situation.

499: The $N^2\times N^2$ matrix $P_A HW H' P_A +P_S$ is

500: a direct sum of an $N(N-1)/2\times N(N-1)/2$ matrix

501: and an  $N(N+1)/2\times N(N+1)/2$ unit matrix.

502: Within the   $N(N-1)/2\times N(N-1)/2$

503: block

504: the number of non-zero off-diagonal elements   is

505: no more than  ${N(N-1)(N-2)}$.

506: So this is a very sparse matrix when $N$ becomes large.

507: Of course if $N$ becomes extremely large, our method requires quite large

508: memories. But due to the sparseness, it remains to be  a

509: practical tool for problems with considerably large $N$.

510: \begin{figure}[htbp]

511:   \begin{center}

512: \epsfig{file=sparse.eps, scale=0.3}

513: \caption{\small  $N=10$. The black dots denote non-zero elements of $P_A H W H' P_A +P_S$.  }

514:       \end{center}

515: \end{figure}

516:

517: As  is often the case with the Newton method,  % \cite{akuzawa8}

518: the global convergence is not assured by this algorithm.

519: %So first few steps must be treated separately.

520: Fortunately it is possible to  cure this fault.

521: We will show the prescription to the global instability in

522: Section \ref{sec:practice}.

523:

524: %it is not assured that  this method

525: %converges  globally.

526:

527:

528: \section{Applications to ICA}\label{sec:appl}

529: So far we have not specified the cost function beyond the assumption

530: that

531: the cost function is a sum of the form (\ref{eq:a1}).

532: Many of the cost functions  for the independent component analysis

533:   belong to this class.

534: \subsection{Kullback-Leibler information}

535: The Kullback-Leibler information,

536:  \begin{eqnarray}

537:   \label{eq:ka9}

538: \int \prod_{i=1}^Ndy_i P(y)\bigg\{\ln  P(y)- \sum_{i=1}^N \ln

539: P_i(y_i)\bigg\}

540: ~,

541: \end{eqnarray}

542: is a good measure for the independence.

543:  Here $P$ is the joint probability density function of $\{Y_i\}$ and

544:  $P_i$ is the probability density function of the $i$-th component.

545: We have already restricted ourselves to the case where the jacobian of

546:  the transformation equals one. Then

547: the minimization of the Kullback-Leibler information  is equivalent to

548:   the minimization of

549:   \begin{eqnarray}

550:     \label{eq:bb1.1}

551:  -\int \prod_i dY_i P(Y)\sum_{i=1}^N\ln  P_i(Y_i)

552: =\sum_{i=1}^N E(-\ln P_i(Y_i)) ~ .

553:   \end{eqnarray}

554: Thus we can  legitimately

555: transform

556: the Kullback-Leibler information

557: to a cost

558: function of the

559: form  (\ref{eq:a1}), where we

560:   should set $\{f_i\}$'s as

561: \begin{eqnarray}

562:   \label{eq:bb1}

563:   f_i(\cdot)= -\ln  P_i(\cdot)~.

564: \end{eqnarray}

565: We must evaluate $\{P_i\}$'s,  their derivatives, and so on  to determine

566: the optimal

567: solution. A robust estimation

568: of these quantities  is possibly  not an easy  task\cite{silverman1,cox1}.

569:

570: \subsection{Cumulant of fourth order}\label{subsec:cum}

571: The kurtosis of a random variable $A$ is defined by

572:   \begin{eqnarray}

573:     {\kappa(A)}

574: =\frac{E(A^4)}{(E(A^2))^2}-3~.

575:   \end{eqnarray}

576: The kurtosis is related to the cumulant of the fourth order,

577: \begin{eqnarray}

578: %  \nonumber

579: Cum^{(4)}(A)=E(A^4)-3(E(A^2))^2~,

580: \end{eqnarray}

581: by

582:   \begin{eqnarray}%\nonumber

583:     {\kappa(A)}=\frac{    Cum^{(4)}(A)}{(E(A^2))^2}~.

584:   \end{eqnarray}

585: For prewhitened data the kurtosis equals the cumulant of the fourth

586: order.

587: As is well-known\cite{hyvarinen1,akuzawa8},

588: we can grab  independent components in many cases

589: by seeking  the maximum of the absolute values of the kurtoses. Our method

590: is applicable

591: by setting

592: \begin{eqnarray}

593:   \label{eq:kur1}

594:  f_i=-\kappa^2

595: \end{eqnarray}

596: for all $i$.

597: If it is  known a priori that all the sources $\{Y_i^*\}$ have positive

598: kurtoses, we may use the kurtosis itself and  set

599: \begin{eqnarray}

600:   \label{eq:kur2}

601:  f_i=-\kappa~.

602: \end{eqnarray}

603: For these cost functions, $R$, $\{U_i\}$, and other

604: quantities needed for determining each step are calculated easily

605: from the observed data.

606: Thus applying our method for this cost function is highly practical and

607: reasonable choice.

608: \section{Levenberg-Marquardt-type variation and performance in practice}

609: \label{sec:practice}

610: The pure-Newton updating rule (\ref{eq:a13}) has a

611: poor global convergence property.

612: This drawback is remedied  by

613:  the Levenberg-Marquardt-type variation\cite{numerical1}.

614: First, We modify  (\ref{eq:a13})

615: as

616:  \begin{eqnarray}

617:   \label{eq:lev1}

618:   {\rm cs}(\Delta)&=&

619: -H'\left(P_A H W H' P_A +P_S+\lambda I_{N^2}

620: \right)^{-1}P_AH{\rm cs}(R)~.

621: \end{eqnarray}

622: The initial value $\lambda_0$ for $\lambda$ is fixed at some positive value.

623: We also fix a real number  $\alpha(>1)$.

624: (In the following example we set $\lambda_0=50$ and $\alpha=10$.)

625: Then the  procedure at time $t$ is as follows:

626: \renewcommand{\labelenumi}{\roman{enumi})}

627: \begin{enumerate}

628: \item

629: Calculate $\Delta$ by  (\ref{eq:lev1}).

630: \item

631: If $F({\rm e}^{\Delta}Y(t))$ is larger than $F(Y(t))$,

632: multiply $\lambda$

633: by $\alpha$ and go back to i).

634: \item

635: Otherwise,

636: multiply $\lambda$ by $1/\alpha$ and proceed to the next time step $t+1$.

637: \end{enumerate}

638: Other parts of the algorithm is completely the same  as in the

639: pure-Newton version in Section \ref{sec:mult}.

640:

641:

642: Let us  examine the real performance of

643: our method under this setting.

644: For the cost function we choose the kurtosis as in Subsection

645: \ref{subsec:cum}.

646:  The source signals are three synthesizer-generated

647: wav files(Fig.\ref{fig:2}).

648: \begin{figure}[htbp]

649:   \begin{center}

650: %\epsfig{file=sample.eps,width=15cm,height=3cm}

651: \epsfig{file=sample.eps,scale=0.4}

652:     \caption{Sample  data generated by a synthesizer (by courtesy of

653: N.Murata).}

654:     \label{fig:2}

655:   \end{center}

656: \end{figure}

657: Pseudo-observed data  are generated by mixing the source by

658: a random  matrix,

659:  \begin{eqnarray}

660:    \label{eq:tr1}

661:    A=I_3+S,

662:  \end{eqnarray}

663: where each element of $S$ is distributed uniformly on $(-1/2,1/2)$.

664:  The residual crosstalk of the  signals

665: demixed by our method

666: is

667: $1.29\%$ on average.  It takes about $122$ seconds (CPU time) for one

668:  hundred iteration of the same problem on

669:  our workstation.

670: For reference, we have also solved the same demixing problem

671: by the FastICA\cite{fastica1}.

672: In this case the residual crosstalk

673: is

674: $1.36\%$ on average and  it takes about $156$ seconds for

675: one hundred

676:  iteration on

677:  the same workstation.

678: Since the author's  knowledge about the FastICA package is limited,

679: one should not take this result seriously.

680: It can, however, be said

681:  that our method is quite good also in practice.

682:

683: \section{Summary}\label{sec:summ}

684: We have constructed a new  algorithm  for  finding a

685: critical  point

686: of broad classes of cost functions  %defined

687: on the orthogonal groups. This method is second-order-convergent

688: since it  is in essence the Newton method.

689: The method here constructed  is an extension (or a restriction) of

690: the multiplicative updating method

691:  developed in our

692: previous work\cite{akuzawa8}. The constraint for $\Delta$ from the nature

693: of the orthogonal groups  makes the

694: problem a little complicated. We have, however, obtained a rigorous and

695: explicit updating rule.

696: We have also constructed

697:  a Levenberg-Marquardt-type variation, which is  suitable for

698:  practical purpose.

699: The global instability inherent in the Newton method is remedied in

700: this version.

701: %  the Kullback-Leibler information, the kurtosis, {\it etc.},

702: %  suitable for

703: %the

704: %purpose.

705: Since our discussion does not depend on the

706: detail of the cost function,

707: this method is applicable to many concrete problems.

708: The relatively  mild assumption (\ref{eq:a1}) on the form of the  cost

709: function, however,  implies that

710:  our algorithm is especially

711: suitable for

712:  the ICA.

713: %we can choose arbitrary functions for

714: %$\{f_i\}$.

715: % readily our method

716: %by

717: %prewhitening data.

718: %The potential of our method

719: Its practical utility for the ICA

720:  have been  illustrated here  by a numerical simulation.

721:

722:

723: %Let us conclude the a

724: To summarize,

725: our algorithm  has  numerous theoretical virtues such as

726: the  rigorous second order convergence, the explicit and strict formulation,

727:  and so on.

728: %Moreover

729:  It provides,

730:  also in practice,

731:   fast and powerful tools for the

732:  ICA and many other problems.

733:

734:

735:

736: %Since it does not require prewhitening,

737:

738: \section*{Acknowledgments}

739: The author would like to thank Noboru Murata and Shun-ichi Amari for

740: valuable

741: discussions and comments.

742: %\bibliography{mybib}

743: \begin{thebibliography}{6}

744:

745: \bibitem[A.Hyv\"arinen,1997]{hyvarinen1}

746: A.Hyv\"arinen (1997).

747: \newblock A Fast Fixed-Point Algorithm for Independent Component Analysis.

748: \newblock {\em Neural Computation\/}, {\em 9\/}, 1483--1492.

749:

750: \bibitem[B.W.Sliverman,1986]{silverman1}

751: B.W.Sliverman (1986).

752: \newblock {\em Density Estimation for Statistics and Data Analysis\/}.

753: \newblock London: Chapman \& Hall.

754:

755: \bibitem[D.Cox,1985]{cox1}

756: D.Cox, D. (1985).

757: \newblock A Penalty Method for Nonparametric Estimation of the Logarithmic

758:   Derivative of a Density Function.

759: \newblock {\em Ann.Inst.Statist.Math.\/}, {\em 37\/}, 271--288.

760:

761: \bibitem[Hurri {\em et~al.\/},1998]{fastica1}

762: Hurri, J., G\"avert, H., S\"alel\"a, J., \& Hyv\"arinen, A. (1998).

763: \newblock FastICA package for MATLAB.

764: \newblock http://www.cis.hut.fi/projects/ica/fastica/.

765:

766: \bibitem[T.Akuzawa \& N.Murata,1999]{akuzawa8}

767: T.Akuzawa \& N.Murata (1999).

768: \newblock Multiplicative Nonholonomic/Newton -like Algorithm.

769: \newblock {\em preprint \\(available from

770:   http://www.islab.brain.riken.go.jp/\~{}akuzawa/)\/}.

771:

772: \bibitem[W.H.Press {\em et~al.\/},1988]{numerical1}

773: W.H.Press, B.P.Flannery, S.A.Teukolsky, \& W.T.Vetterling (1988).

774: \newblock {\em Numerical Recipes in C\/}.

775: \newblock Cambridge: Cambridge U.P.

776:

777: \end{thebibliography}

778:

779: \end{document}

780: