0309:cond-mat0309484/ica.tex

1: \documentclass{epl}

2: \usepackage{amssymb,graphicx}

3:

4:

5: \newcommand{\R}{{\mathbb R}}

6: \newcommand{\sign}{ \mbox{\rm sign} }

7: \newcommand{\ext}{ \mbox{\rm extr} }

8: \newcommand{\exta}[1]

9:      {{\renewcommand{\arraystretch}{0.75} \begin{array}[t]{c}

10:        \ext \\ {\scriptstyle #1}

11:      \end{array}}}

12: \newcommand{\maxa}[1]

13:      {{\renewcommand{\arraystretch}{0.75} \begin{array}[t]{c}

14:        \max \\ {\scriptstyle #1}

15:      \end{array}}}

16: \newcommand{\eff}{{ \mbox{\rm e}} }

17: \newcommand{\smp}{ {\mbox{\rm\scriptsize smp}}}

18: \newcommand{\ens}{ {\mbox{\rm\scriptsize ens}}}

19: \newcommand{\di}{\mbox{\rm d}}

20: \newcommand{\half}{{\frac{1}{2}}}

21: \renewcommand{\#}{\displaystyle}

22: \newcommand{\halpha}{\hat{\alpha}}

23: \newcommand{\La}{\left\langle}

24: \newcommand{\Ra}{\right\rangle}

25: \newcommand{\sLa}{\langle}

26: \newcommand{\sRa}{\rangle}

27: \newcommand{\cut}[1]{}

28: \newcommand{\n}{{\scriptscriptstyle \! N}}

29: \newcommand{\hxi}{{\hat\xi}}

30:

31: \newcommand{\ppreprint}{

32: \textheight 1.2\textheight

33:   \textwidth  1.2\textwidth

34:   \oddsidemargin  -0.5cm

35:   \evensidemargin -0.5cm

36:   \topmargin -0.5cm

37:   \baselineskip 1.8\baselineskip}

38:

39: %\ppreprint

40:

41: \title{ Statistical physics of independent component analysis

42:       }

43:

44: \author{R. Urbanczik}

45:   \institute{

46:         Institut f\"ur theoretische Physik -

47:         Universit\"at W\"urzburg,

48:         Am Hubland,

49:         D-97074 W\"urzburg,

50:         Germany       }

51:

52: \pacs{89.75.Fb}{Structures and organization in complex systems}

53: \pacs{84.35.+i}{Neural networks}

54: \pacs{64.60.Cn}

55:      {Order disorder transitions; statistical mechanics of model systems}

56:

57: \begin{document}\maketitle

58:

59:

60: \begin{abstract}

61: Statistical physics is used to investigate independent component

62: analysis with polynomial contrast functions. While the replica method

63: fails,  an adapted cavity approach

64: yields valid results. The learning curves, obtained in a suitable

65: thermodynamic limit, display a

66: first order phase transition from poor to perfect generalization.

67: \end{abstract}

68:

69:

70:

71:

72:

73:

74: \newcommand{\D}{{\mathbb D}}

75:

76: During the last decade, independent component analysis (ICA) has emerged as

77: one of the most powerful unsupervised learning procedure for many

78: signal processing tasks \cite{Hyv01,Cic02}. It assumes that the observed,

79: often high dimensional signal, is a linear mixture of {\em independent}

80: source signals and aims to recover these sources just from

81: observing the mixed up signal. Hence, ICA is sometimes also

82: called blind signal deconvolution. An illustrative scenario is the

83: cocktail party problem where, to understand any single speaker, we first

84: need to identify her voice amidst the jumble of sounds reaching our

85: ears.

86:

87: The basic finding in ICA is that the distribution of the observed

88: signal will be similar to a Gaussian, especially when

89: many independent sources contribute to the linear mixture. The source

90: signals, however, will often be highly structured, and

91: non-Gaussian. ICA thus searches for a linear transformation of the

92: observations which maximizes non-Gaussianity by evaluating a suitable

93: contrast function. To detect this, the

94: contrast function used must compute a higher than quadratic statistics of the

95: transformed data.

96:

97: In a principled way, ICA can be derived by considering the mutual

98: information of the transformed data, which is a natural measure of statistical

99: dependence. To avoid the problem of density estimation, which

100: arises in a direct evaluation of the mutual information, one then uses

101: expansions (Edgeworth, Gram-Charlier) around Gaussianity to

102: approximate the mutual information \cite{Com94,Ama95}.

103: This leads to  contrast

104: functions which are related to the higher order cumulants of

105: the transformed data.

106:

107: This Letter provides a first analysis of ICA for

108: polynomial  contrast functions using the

109: statistical physics of disordered systems.

110: Surprisingly,

111: the replica method, one of the most powerful tools in analyzing

112: quenched disorder, fails since it cannot  control the contributions to

113: the contrast function in the large deviations regime. However, a

114: physically valid analysis is obtained by adapting the cavity

115: method, showing that the scale of the learning curve depends on the

116: degree of  the polynomial. Unusually, for a system with continuous couplings,

117: the curve itself is a step function, jumping from poor to perfect

118: generalization. But a badly generalizing state is always

119: metastable and it is remarkable that we can nevertheless find polynomial time

120: algorithms which generalize well.

121:

122: In formal terms, we assume that the

123: observable signal $\xi$ can be written as $\xi = M\hxi$, where

124: the source $\hxi$ is an $N$-dimensional  random variable with

125: independent components and $M$ is the $N$ by $N$ mixing matrix.

126: Learning is based on a training set $\D$ of $P$ independent

127: observations  $\xi^\mu$

128: of the signal $\xi$, obtained for a fixed, if unknown, mixing matrix $M$.

129:  The deconvolution problem (finding $\hxi$)

130: can be decomposed by first finding just one independent component,

131: subtracting it from the mixture, and reapplying the procedure to the

132: remaining $N-1$ dimensional task. Hence, I shall just deal with

133: finding the first  component $\hxi_1$ and assume that it is non-Gaussian

134: whereas all other components of $\hxi$ are Gaussian.

135:

136: Normally, the first step in ICA is to whiten the data, so that it has

137: zero mean and its covariance matrix is the identity. So, I shall

138: further assume that the source components have zero mean and unit

139: variance and that $M$ is orthogonal, $M^TM = \mathbf 1$. In short, the

140: ICA task now is to find, based on the training set $\D$, a vector $J$

141: such that $J^T\xi = \pm\hxi_1$. For this,

142: one picks a suitable non-quadratic contrast function $g$, computes the

143: empirical contrast

144: \begin{equation}

145: c_{\D}(J) = P^{-1} \sum_{\mu=1}^P g(J^T M\hxi^\mu), \label{contrast}

146: \end{equation}

147: and  chooses $J$ to maximize $c_{\D}(J)$ under the constraint $|J|=1$.

148: To analyze this problem,  one will

149: first  consider the Gibbs weight

150: $\exp(\beta N c_{\D}(J))$ at some finite inverse temperature $\beta$

151: and calculate the typical value of the logarithm of its partition function

152: $Z_\D =  \int {\rm d}J \exp(\beta N c_{\D}(J))$, where the integration

153: is over the uniform density on the unit sphere in $\R^N$. Since, via a

154: gauge, the  partition function is independent of the mixing matrix $M$,

155: we set $M= \mathbf 1$ for the analysis.

156:

157: I shall first consider the replica approach to this calculation and

158: for brevity assume that the contrast function is

159: $g(x) = x^3$. We are then immediately faced with the problem that

160: the moments $\La Z_\D^n \Ra_\D$ do not exist, indeed $Z_\D$ does not

161: even have a mean

162: \footnote{In a sense, this problem already crops up for principal

163: component analysis where $g(x)=x^2$. Then $\La Z_\D^n \Ra_\D$

164: diverges, if $n$ or $\beta$ are large enough. So, using replicas, one

165: is in effect computing a continuation from small $\beta$ and large $n$

166: to large $\beta$ and small $n$.

167: }.

168: A second issue arises since $c_{\D}(J)$ is ${\cal

169: O}(N^{3/2}/P)$ for $J = \xi^\mu/|\xi^\mu|$. So, if we have just

170: $P = \alpha N$ examples, $\ln Z_\D$ is not an extensive quantity for

171: large $N$.

172:

173:

174: \newcommand{\KN}{K_{\!\scriptscriptstyle N}}

175: \newcommand{\LN}{L_{\scriptscriptstyle N}}

176: \newcommand{\gN}{g_{\scriptscriptstyle N}}

177:

178: To address the first problem, we introduce a cutoff $\KN > 0$, replacing

179: $g(x) = x^3$ by $\gN(x) = \max\{x^3,\KN^3\}$ in Eq. (\ref{contrast}).

180: Since we want to

181: ultimately recover the $g(x) = x^3$ case, we assume that $\KN$

182: diverges with increasing $N$.

183: Nevertheless, due to

184: the cutoff, the moments of  $Z_\D$ now exist for any finite $N$.

185: Further, we assume that the training set has $P=\alpha \LN N$ and

186: not just $\alpha N$ patterns. Then, if $\LN$ diverges sufficiently quickly

187: w.r.t. $N$ and $\KN$,  $\ln Z_\D$ will be an extensive quantity.

188: Finally, we should find that for the purpose of calculating  $\ln

189: Z_\D$ for large $N$, choosing $K_N = \sqrt{N}$ is equivalent to not

190: cutting off at all. The reason for this quite simply is that

191: for $N\rightarrow\infty$

192: the fields $J^T \xi^\mu$ are bounded by $\sqrt{N}$ for

193: almost all training sets.

194:

195: In this setting, standard arguments yield the exact finite $N$ result

196: \begin{eqnarray*}

197: \La Z_\D^n \Ra_\D &=&

198: \lambda_{N,n}\!\! \int\!\! {\rm d}R{\rm d}Q

199:   \det(Q\!-\! R R^T)^{\frac{N-n+1}{2}}

200: {\cal G}_{\scriptscriptstyle N} (R,Q)^N \\

201: {\cal G}_{\scriptscriptstyle N}(R,Q) &=&

202: \La \prod_{a=1}^n

203:   \exp\left( \frac{\beta\max\{(R^a \xi_1 + X^a)^3,\KN^3\}}{\alpha L_N}

204:   %\gN(R^a \xi_1 + X^a)

205: \right)

206:   \Ra_{\xi_1,X}^{\alpha \LN}

207: \end{eqnarray*}

208: Here $R$ is an $n$-vector, Q a symmetric $n$ by $n$ matrix with

209: $Q^{aa}=1$, and the domain of integration is such that the matrix

210: $Q - R R^T$ is positive definite.

211: The $X^a$ are zero mean Gaussian with covariances

212: $\La X^a X^b\Ra = Q^{ab} - R^a R^b$, and $\lambda_{N,n}$ is obtained using that

213: the moments equal $1$  for $\beta = 0$.

214: Now, given any sequence of cutoffs

215: $\KN$, we can certainly find $\LN$ so that

216: ${\cal G}_{\scriptscriptstyle N}(R,Q)$ stays

217: finite for large $N$. Then, we should be able to use Laplace's method

218: of the maximum point to find that in the large $N$ limit

219: \begin{equation}

220: \frac{1}{N}\ln\La Z_\D^n \Ra_\D \!=\! \sup_{R,Q}\,

221: \ln {\cal G}_N(R,Q) + \half \ln \det(Q\!-\! R R^T)\,. \label{lapl}

222: \end{equation}

223: But at this point, at the latest, it is clear that something is amiss.

224: The limiting value of the above RHS depends only on the

225: relative scalings of $K_N$ and $L_N$ and not on the relationship of

226: these scalings to the system size $N$.

227: So (\ref{lapl}) implies  that the scale of learning curve can be

228: {\em arbitrarily} stretched by using cutoffs which diverge quickly

229: with $N$. This problem arises regardless of assumptions about replica

230: symmetry.

231:

232: We proceed anyway and, using the replica symmetric

233: parameterization of (\ref{lapl}), find for $N\rightarrow\infty$

234: \begin{eqnarray}

235: \frac{1}{N}\La\ln Z_{\D} \Ra_\D

236: &=&

237: \sup_r \inf_q\,\, G_r(q,R) + G_s(q,r) \nonumber \\

238: G_r(q,R) &=&

239: \alpha L_N \La \!\ln\!\La \exp\left(

240:     \frac{\beta}{\alpha L_N}\gN(r \xi_1 + \sqrt{q-r^2}y_0+\sqrt{1-q}y_1)

241:        \right)  \Ra_{\!\!y_1} \Ra_{\!\!\xi_1,y_0} \nonumber \\

242: G_s(q,r) &=& \half \frac{q-r^2}{1-q} + \half\ln(1-q) \label{rsZ}

243: \end{eqnarray}

244: where

245: $y_o,y_1$ are standard Gaussians, i.e. with zero mean, unit variance.

246: The extremal $r$

247: is just the typical value of the first component of a weight vector

248: picked from the Gibbs density and

249: measures to which extent the structure in the data is recognized.

250: Using (\ref{rsZ}), we relate the scalings of

251: $\KN$ and $\LN$. For $\LN \gg \KN$ the energy term converges to

252: $G_r(q,R) = r^3 \La \xi_1^3 \Ra$. This is the limit of many

253: examples where $r=1$ for all $\alpha$. In contrast, for $\LN \ll \KN$

254: there are too few examples and  $G_r(q,R)$ diverges.

255:

256: So, the scale of the learning curve is given by setting $\LN = \KN$.

257: On this scale,

258: we find that  $G_r(q,R)$ converges to $r^3 \La \xi_1^3 \Ra$ as in the

259: limit of many examples if $q$ exceeds a critical value

260: $q_c(\alpha,\beta)$,  whereas $G_r(q,R)$ diverges for $q

261: <q_c(\alpha,\beta)$. Solving the extremal problem for $q$ by taking the

262: limit $q\rightarrow q_c(\alpha,\beta)$ from above, then taking the

263: $\beta\rightarrow\infty$ limit, we finally find the

264: simple result for the

265: ground state:

266: $

267: c(\alpha)= \sup_r

268: r^3 \La  \xi_1^3 \Ra_{\xi_1}+  (1-r^2)/\alpha. %\label{repfin}

269: $

270: Here $c(\alpha)$ is the typical value of the highest achievable

271: empirical contrast, $\max_{|J|=1} c_\D(J)$. The learning curve for $r$

272: thus obtained, is a step function showing a first order

273: phase transition at $\alpha_c = 1/\La  \xi_1^3 \Ra_{\xi_1}$

274: from no learning ($r=0$) to perfect learning ($r=1$).

275: But the $r=0$ state is metastable for all values $\alpha >

276: \alpha_c$.

277:

278:

279: \begin{figure}

280:    \begin{tabular}{l}

281:         \mbox{\begin{tabular}{l}

282:            \includegraphics[scale=0.8]{cfig1.eps}

283:               \end{tabular}}

284:    \end{tabular}

285:  \caption{

286:  Prediction of $\KN=\sqrt{N}$ replica theory (bold line) compared to

287:  simulation results. The non Gaussian source is

288:   $\hat\xi_1 =(y^2-1)/\sqrt 2$, where $y$ is a standard Gaussian.

289:  The empty symbols show the results for the algorithm finding local

290:  maxima of the empirical contrast. The full symbols, denoting results

291:  for the iterated version of the procedure described in the main text,

292:  show that the agreement with the replica theory improves quickly with

293:  increasing system size $N$ for this algorithm.

294:  The  error bars estimate the standard deviation of the sample to sample

295:  fluctuations.

296: }

297: \end{figure}

298:

299: The replica theory predicts that for any divergent sequence of

300: cutoffs $\KN$, e.g. $\KN = e^N$,  we need $P > \alpha_c \KN N$ examples for

301: good generalization when $N$ is large.

302: While this is ridiculous, I have argued above

303: that choosing $\KN=\sqrt N$ is, for $N\rightarrow\infty$,

304:  equivalent to not cutting off at all. To

305: compare the replica result for this choice of $\KN$ to

306: numerical  simulations, let us consider

307: actually finding a weight vector maximizing $c_\D(J)$.

308: It turns out that a rather simple discrete dynamics can be used since

309: $g(x) = x^3$. Starting with a random

310: vector of unit length $J^0$, at the $k$-th time step we first compute the

311: matrix

312: $A(J^k) = \sum_{\mu=1}^P \xi^\mu ({J^k}^T \xi^\mu ) {\xi^\mu}^T$

313: and then choose $J^{k+1}$ to maximize

314: $|J^T A(J^k) J|$ under the constraint $|J|=1$.

315: So, $J^{k+1}$ is an

316: eigenvector to the eigenvalue of largest magnitude of $

317: A(J^k)$. Standard results on quadratic forms imply that

318: $|{J^{k+1}}^T A(J^k) J^{k+1}| \geq   |{J^{k}}^T A(J^{k-1}) J^{k}|$,

319: and the inequality is strict unless we are at a fixed point.

320: Hence, the iteration converges to a vector $J^\infty$ which is a local

321: maximum or minimum of  $c_\D(J)$. In the latter case, we just flip the

322: sign of $J^\infty$ to obtain a local maximum.

323:

324: Simulation results for the procedure, compared to the $\KN =

325: \sqrt{N}$ replica theory in Fig. 1, show that the performance of

326: the algorithm is rather poor. This is in line with the

327: theoretical findings, since these predict that $r=0$ is

328: metastable, and the algorithm is only finding a local maximum. Figure 1

329: also shows result for an iterated version of the algorithm. There the

330: algorithm is rerun with $m=0.1N$ different random initial conditions,

331: and the weight vector maximizing $c_\D(J)$ among the $m$ outcomes is

332: chosen. These result are in good agreement with the $\KN =

333: \sqrt{N}$ replica theory, indicating that beyond the phase transition the

334: basin of attraction of the global maximum is quite large.

335:

336: Even if the simulations indicate

337: that the replica approach is saved by

338: in the end plugging in the correct scaling of the cutoff $\KN$,

339: the theoretical situation is highly unsatisfactory.

340: I shall next show that a physically

341: reasonable analysis can be provided by adapting the cavity method.

342: This is much simplified if make some major

343: changes to the notation. From now on the non-Gaussian source will be

344: denoted by $\gamma$, whereas all of the $N$  components of $\xi$ are

345: assumed independent standard Gaussian. Our primary goal is to calculate

346: the typical value of $C_r = \max_{|J|=1} C_r(J)$ with

347: \begin{equation}

348: C_r(J) = \frac{1}{P}\sum_{\mu=1}^P g(r \gamma^\mu + \sqrt{1-r^2} J^T

349: \xi^\mu)

350: \label{orig}

351: \end{equation}

352: where $J$ is an N-dimensional vector. So $C_r$ is the maximal value of

353: the empirical contrast achievable on an $r$-shell. For generality, we

354: shall now longer assume that $g(x)$ must be cubic but consider any

355: super-quadratic function which does not diverge too quickly.

356: In particular, for some $k>0$,

357: $

358: \lim_{x\rightarrow\infty}{g(x)}/{x^{2+k}} = \psi

359: $

360: should exist and be positive. Without loss of generality, we may then

361: assume $\psi=1$.

362:

363: We still have $P=\alpha \LN N$

364: examples and consider the random variable $J_\D$ with the Gibbs

365: density

366: \begin{eqnarray}

367: p_\D(J) &=& \frac{1}{Z_\D(\beta)}

368:           \frac{e^{-\half |J|^2}}{(2 \pi)^{\half N}}

369:           \prod_{\mu=1}^P

370:             e^{\frac{\beta}{\LN}

371:                      g(\gamma^\mu,[J]^T\xi^\mu)}

372: \nonumber \\

373: g(\gamma^\mu,[J]^T\xi^\mu) &=&

374: g(r \gamma^\mu + \sqrt{1-r^2} [J]^T\xi^\mu)\,. \label{GD}

375: \end{eqnarray}

376: Here $[J] = J/|J|$ and $Z_\D(\beta)$ is given

377: by the normalization $\int \!{\rm d}J\, p_\D(J) =1$. Note, that we are now

378: using a factorizing Gaussian prior on $J$ and, to compensate for this, the

379: normalized vector $[J]$ is used to calculate the field in (\ref{GD}).

380:

381: A key task in the cavity approach is  obtain the field distribution by

382: calculating  the thermal average

383: $\La \phi(\gamma^\mu,[J_D]^T\xi^\mu) \Ra_{J_\D}$ for  any function

384: $\phi$. One finds

385: \begin{eqnarray}

386: \La\phi(\gamma^\mu,[J_D]^T\xi^\mu) \Ra_{J_\D} &=&

387: \frac{Z_{\D/\mu}(\beta)}{Z_\D(\beta)}

388: \La e^{\frac{\beta}{\LN}

389:                      g(\gamma^\mu,[J_{\D/\mu}]^T\xi^\mu)}

390:     \phi(\gamma^\mu,[J_{\D/\mu}]^T\xi^\mu) \Ra_{J_{\D/\mu}},

391: \label{cav}

392: \end{eqnarray}

393: where $J_{\D/\mu}$ is the random variable with the Gibbs density obtained

394: when pattern $\mu$ is removed from the system, i.e.

395: omitting the $\mu$-th factor

396: of the product in (\ref{GD}) and adjusting the partition function to

397: $Z_{\D/\mu}(\beta)$.

398: The variance of the cavity field

399: $[J_{\D/\mu}]^T\xi^\mu$ is a self averaging quantity and it must then

400: equal $1-q$ for large $N$, where

401: $q = |\La [J_{\D/\mu}] \Ra_{J_{\D/\mu}}|^2$. Normally, one would further argue

402: that  $[J_{\D/\mu}]^T\xi^\mu$ becomes Gaussian in the thermodynamic limit.

403: But if we assume this,

404: the $J_{\D/\mu}$ average in (\ref{cav}) diverges even when

405: $\phi$ is a simple bounded function.

406: This highlights the fact that the cavity field is not Gaussian in the large

407: deviations regime because

408: $[J_{\D/\mu}]^T\xi^\mu$ cannot be larger than $|\xi^\mu|$.

409:

410:

411: Hence, I rephrase the cavity argument as follows: For the purpose of

412: calculating overlaps with a random vector such as $\xi^\mu$,

413: the not normalized $J_{\D/\mu}$ can for large $N$ be treated as a

414: Gaussian (with covariance matrix $(1-q)\mathbf 1$).

415: Then, the fluctuations of the cavity field obtained using

416: the normalized $[J_{\D/\mu}]$,

417: \[

418: P_{N,q}(h) = \La

419: \delta\left(h -

420: \left([J_{\D/\mu}]^T-\La[ J_{\D/\mu}]^T\Ra_{J_{\D/\mu}}\right)\xi^\mu

421: \right) \Ra_{J_{\D/\mu}}

422: \]

423: can be explicitly calculated.

424: This yields the

425: important fact that there are just two relevant scales for the cavity

426: fluctuations.

427: For large $N$,

428: $P_{N,q}(h)$ converges to

429: $e^{-\half h^2/(1-q)}/\sqrt{2 \pi (1-q)}$

430: if  $h \ll \sqrt{N}$,  but in the large deviations regime, for

431: $h = d \sqrt{N}$,

432: \begin{equation}

433: \lim_{N\rightarrow\infty} N^{-1}\ln P_N(d \sqrt{N}) =

434: -\half \frac{ q d^2}{1-q} + \half\ln(1-d^2)

435: \label{ldev}

436: \end{equation}

437: if $0\leq d\leq1$.

438: Now, in terms of the functional

439: \[

440: {\cal L}^{q,\beta}_{y,\gamma}(\phi) =

441: \int_{-\sqrt{N}}^{\sqrt{N}}

442: {\rm d}h\, P_{N,q}(h)\,\phi(\gamma,\sqrt{q}y+h)\,

443: e^{\frac{\beta}{\LN} g(\gamma,\sqrt{q}y+h)}

444: \]

445: the average in Eq. (\ref{cav}) can in the limit of large $N$  be rewritten as

446: $\La\phi(\gamma^\mu,[J_D]^T\xi^\mu) \Ra_{J_\D} =

447: {\cal L}^{q,\beta}_{y^\mu,\gamma^\mu}(\phi)/

448: {\cal L}^{q,\beta}_{y^\mu,\gamma^\mu}(1)$

449: with $y^\mu = q^{-\half}\La[ J_{\D/\mu}]\Ra_{J_{\D/\mu}}^T\xi^\mu$. So  the

450: quenched  averages are

451: \begin{eqnarray}

452: \La \La \phi(\gamma^\mu,[J_D]^T\xi^\mu) \Ra_{J_\D} \Ra_\D

453: &=& \La

454: \frac{{\cal L}^{q,\beta}_{y,\gamma}(\phi)}

455:      {{\cal L}^{q,\beta}_{y,\gamma}(1)} \Ra_{y,\gamma}  \label{qav} \\

456: \La \ln Z_\D(\beta) - \ln Z_{\D/\mu}(\beta) \Ra_\D &=&

457: \La \ln {\cal L}^{q,\beta}_{y,\gamma}(1)  \Ra_{y,\gamma}

458: \label{qav1}

459: \end{eqnarray}

460: where $y$ is standard Gaussian. The last equation is

461: obtained by setting $\phi =1$ in (\ref{cav}).

462:

463: We can now consider whether the large deviations regime contributes to

464: the averages in (\ref{qav}) for a polynomially bounded

465: $\phi$. Using that for large arguments $g(x) \sim x^{2+k}$ and

466: referring to  Eq. (\ref{ldev}), we find that it

467: will contribute if the maximum of

468: \begin{equation}

469: u(d) =

470: \beta d^{k+2}\frac{N^{\half k}}{\LN}

471: - \half \frac{ q d^2}{1-q} +

472: \half\ln(1-d^2)

473: \label{reldev}

474: \end{equation}

475: is positive for large $N$. This won't happen if

476: $\LN \gg  N^{\half k}$ and

477: Eq. (\ref{qav}) then implies that

478: $\La \La \phi(\gamma^\mu,[J_D]^T\xi^\mu) \Ra_{J_\D} \Ra_\D =

479: \La \phi(\gamma,y) \Ra_{y,\gamma}$. The empirical mean equals

480: the expectation value and so the learning curve is  trivial.

481: Henceforth, we focus on the relevant scale, setting

482: $\LN =  N^{\half k}$.

483:

484: Our next task is to calculate the response when a new coupling $J_0$ is

485: added to the system and each pattern $\xi^\mu$ is augmented by

486: a new component $\xi_0^\mu$. We denote the augmented training set by

487: $\hat\D$ and use (\ref{GD}) to define the partition function

488: $Z_{\hat\D}(\beta)$ of the $N+1$ dimensional system.

489: Due  to the $N$-dependence of the Gibbs weight

490: $e^{\frac{\beta}{\LN}g(\gamma^\mu,[J]^T\xi^\mu)}$, it is simplest

491: to assume a slightly different temperature

492: $\hat\beta_N = \beta L_{\scriptscriptstyle N+1}/\LN$

493: in the augmented system. Then,

494: when  considering the ratio $Z_{\hat\D}(\hat\beta_N)/Z_{\D}(\beta)$,

495: the two systems have the same Gibbs weight per pattern.

496: Standard arguments  \cite{Mez89} thus apply and yield that

497: $

498: \sLa \ln Z_{\hat\D}(\hat\beta_N)/Z_\D(\beta) \sRa_{\hat\D}

499: = G_s(q,0)\, \label{entres}

500: $ for large $N$.

501: Here $G_s(q,0)$ is the entropy term of the

502: replica theory (Eq. \ref{rsZ}), but evaluated at $r=0$ because we are

503: calculating the partition function for each $r$-shell individually.

504:

505: Having identified, via $\LN=

506: N^{\half k}$, the scale of the learning curve,

507: $N^{-1}\La \ln Z_\D(\beta) \Ra_D$ will

508: converge to a finite quantity  $z(\alpha,\beta)$ in the thermodynamic limit.

509: We then  have

510: %

511: \newcommand{\pdev}[1]{ \frac{\partial\,\,}{\partial #1} }

512:

513: \begin{eqnarray*}

514:   \sLa \ln Z_{\hat\D}(\hat\beta_N)/Z_\D(\beta) \sRa_{\hat\D} &=&

515:  z(\alpha,\beta) -

516:  \alpha \frac{k+2}{2}\pdev{\alpha}{z(\alpha,\beta)} +

517:  \frac{\beta k}{2} \pdev{\beta}{z(\alpha,\beta)}.

518: \end{eqnarray*}

519: The derivative of $z$ with respect to $\alpha$ is obtained from

520: Eq. (\ref{qav1}), and the thermal derivative is found

521: from  (\ref{qav}) using $\phi =g$.

522:

523: Putting things together, we finally find for large $N$

524: \begin{eqnarray}

525: z(\alpha,\beta) &=&

526: \La \alpha\frac{k+2}{2} N^{\half k}  \ln {\cal L}^{q,\beta}_{y,\gamma}(1)

527: - \frac{\beta k}{2} \frac{{\cal L}^{q,\beta}_{y,\gamma}(g)}

528:                           {{\cal L}^{q,\beta}_{y,\gamma}(1)}

529: \Ra_{y,\gamma}\!\!

530: + G_s(q,0)\,, \label{zfunc}

531: \end{eqnarray}

532: where the value of $q$ still has to be determined.

533:

534: For this, let us reconsider when the large deviations regime

535: contributes to the value of ${\cal L}^{q,\beta}_{y,\gamma}(1)$. Going back

536: to Eq. (\ref{reldev}), with $\LN =  N^{\half k}$,

537: we see that as in the replica theory this is governed by a critical

538: value $q_{\rm c}(\beta)$ of $q$.

539: For $q < q_{\rm c}(\beta)$, $\max_d u(d)$ is positive in the large $N$ limit,

540: so (\ref{zfunc}) diverges.

541: The possible range for $q$ is thus $q_{\rm c}(\beta) \leq q \leq 1$.

542: But,  if we assume $q > q_{\rm c}(\beta)$, the large $N$ limit yields the

543: very simple result

544: $

545: z(\alpha,\beta) =   G_s(q) + \alpha \beta \La g(\gamma,y) \Ra_{\gamma,y}

546: $.

547: Now, on one hand,  the empirical contrast is found by

548: differentiating $z(\alpha,\beta)$ w.r.t to $\beta$. This yields

549: $\La g(\gamma,y) \Ra_{\gamma,y} + \frac{1}{\alpha}G'_s(q)\pdev{\beta}q$.

550: But computing the same quantity using (\ref{qav}) yields

551: $\La g(\gamma,y) \Ra_{\gamma,y}$. So $q$ must stay  constant when $\beta$

552: varies, but this is impossible since $q_{\rm c}(\beta)\rightarrow 1$ for

553: $\beta\rightarrow\infty$.

554:

555: Hence, the only possible value for $q$ is $q_{\rm

556: c}(\beta)$.

557: Evaluating (\ref{zfunc}) by taking the limit $q\rightarrow q_{\rm

558: c}(\beta)$ from above, leads to the same result as in the $\KN =

559: \sqrt{N}$ replica theory. But, of course, this  has the same

560: inconsistencies as found for the $q > q_{\rm c}(\beta)$ assumption.

561: It also makes no physical sense to use (\ref{zfunc})

562: at the point of discontinuity since  the cavity  derivation neglects

563: fluctuations of $q$. Even if these vanish with increasing $N$, at the point

564: of discontinuity, $q=q_{\rm c}(\beta)$, the true result will

565: nevertheless  depend on the unknown fluctuations.

566:

567: But some conclusions can be drawn, knowing that $q$ has the

568: critical value. Let $d_\beta$ be the unique positive value such that

569: $u(d_\beta) =0$ for   $q=q_{\rm c}(\beta)$. Then arguments analogous

570: to the derivation of  (\ref{qav}) show that the probability of the

571: posterior field $[J_\D]\xi^\mu$ exceeding $d\sqrt{N}$ is {\em not}

572: exponentially small if $d$ is lower than $d_\beta$.

573: More precisely, one finds for

574: $N\rightarrow\infty$ and $d < d_\beta$

575: \begin{eqnarray*}

576: \La N^{-1}\ln\sLa \Theta([J_D]^T\xi^\mu - d\sqrt{N}) \sRa_{J_\D} \Ra_\D

577: &=&  \\

578: \La N^{-1}\ln

579: {{\cal L}^{q,\beta}_{y,\gamma}(\Theta(h - d\sqrt{N}))}/

580:      {{\cal L}^{q,\beta}_{y,\gamma}(1)} \Ra_{y,\gamma} &=& 0\,.

581: \end{eqnarray*}

582: Further, $d_\beta$ approaches $1$ with increasing $\beta$. But this is

583: only possible if simply aligning the weight vector with the pattern $\xi^\mu$

584: maximizes the empirical contrast, at least upto sub-extensive corrections. So,

585: in the notation of Eq. \ref{orig}, we have $C_r = C_r([\xi^\mu ])$

586: for large $N$, and thus finally

587: \begin{equation}

588: C_r  =

589: (1-r^2)^{\frac{2+k}{2}}/\alpha + \La g(r \gamma + \sqrt{1-r^2}\,y)

590: \Ra_{\gamma,y}\,.

591: \label{final}

592: \end{equation}

593: Maximizing this in $r$, the same learning curve is obtained for

594: the cubic case, $g(x)=x^3$, as in

595: the  $\KN=\sqrt N$ replica theory

596: %

597: \footnote{

598: For $g(x)=x^4$, the curve depends on whether $\sLa \gamma^4

599: \sRa_\gamma > 3$, since the fourth moment of a standard Gaussian is

600: $3$. If so, the value of $r$ jumps from $0$ to $1$ at

601: $\alpha_c = 1/(\sLa \gamma^4\sRa_\gamma - 3)$. The

602: $\sLa \gamma^4\sRa_\gamma < 3$ case, where one will use  $g(x)=-x^4$,

603: shall be described elsewhere. It

604: is much simpler since the large deviations regime does not contribute.}.

605: %

606: It is important to note that we have in essence just used the standard

607: weak correlation assumptions of the cavity method in deriving (\ref{final}).

608: In view of the good agreement with numerical simulations (Fig. 1),

609: this strongly suggests that the cavity result is indeed exact in the

610: thermodynamic limit.

611:

612: From an analytical point of view, it is intriguing that the present

613: problem reveals a difference in the scope of the replica and the

614: cavity method. The latter can be transparently adapted to take

615: into account that the cavity field is not Gaussian in the large

616: deviations regime. But, commuting the thermal average with the disorder

617: average, at the expense of considering moments, is part and parcel

618: of using replicas. As a consequence, all the relevant fields

619: become truly Gaussian. This points to implicit assumptions in the

620: replica method, which need to be taken care of in any program to put

621: the approach on a solid mathematical footing \cite{Par02}.

622:

623: \acknowledgements

624:

625: It is a pleasure to acknowledge many discussions with Manfred Opper.

626: This work was supported by the Deutsche Forschungsgemeinschaft.

627:

628: \bibliographystyle{unsrt}

629: \bibliography{/home/robert/tex/neural}

630:

631: \end{document}

632:

633:

634:

635:

636:

637:

638:

639:

640:

641:

642: