1: \documentclass{epl}
2: \usepackage{amssymb,graphicx}
3:
4:
5: \newcommand{\R}{{\mathbb R}}
6: \newcommand{\sign}{ \mbox{\rm sign} }
7: \newcommand{\ext}{ \mbox{\rm extr} }
8: \newcommand{\exta}[1]
9: {{\renewcommand{\arraystretch}{0.75} \begin{array}[t]{c}
10: \ext \\ {\scriptstyle #1}
11: \end{array}}}
12: \newcommand{\maxa}[1]
13: {{\renewcommand{\arraystretch}{0.75} \begin{array}[t]{c}
14: \max \\ {\scriptstyle #1}
15: \end{array}}}
16: \newcommand{\eff}{{ \mbox{\rm e}} }
17: \newcommand{\smp}{ {\mbox{\rm\scriptsize smp}}}
18: \newcommand{\ens}{ {\mbox{\rm\scriptsize ens}}}
19: \newcommand{\di}{\mbox{\rm d}}
20: \newcommand{\half}{{\frac{1}{2}}}
21: \renewcommand{\#}{\displaystyle}
22: \newcommand{\halpha}{\hat{\alpha}}
23: \newcommand{\La}{\left\langle}
24: \newcommand{\Ra}{\right\rangle}
25: \newcommand{\sLa}{\langle}
26: \newcommand{\sRa}{\rangle}
27: \newcommand{\cut}[1]{}
28: \newcommand{\n}{{\scriptscriptstyle \! N}}
29: \newcommand{\hxi}{{\hat\xi}}
30:
31: \newcommand{\ppreprint}{
32: \textheight 1.2\textheight
33: \textwidth 1.2\textwidth
34: \oddsidemargin -0.5cm
35: \evensidemargin -0.5cm
36: \topmargin -0.5cm
37: \baselineskip 1.8\baselineskip}
38:
39: %\ppreprint
40:
41: \title{ Statistical physics of independent component analysis
42: }
43:
44: \author{R. Urbanczik}
45: \institute{
46: Institut f\"ur theoretische Physik -
47: Universit\"at W\"urzburg,
48: Am Hubland,
49: D-97074 W\"urzburg,
50: Germany }
51:
52: \pacs{89.75.Fb}{Structures and organization in complex systems}
53: \pacs{84.35.+i}{Neural networks}
54: \pacs{64.60.Cn}
55: {Order disorder transitions; statistical mechanics of model systems}
56:
57: \begin{document}\maketitle
58:
59:
60: \begin{abstract}
61: Statistical physics is used to investigate independent component
62: analysis with polynomial contrast functions. While the replica method
63: fails, an adapted cavity approach
64: yields valid results. The learning curves, obtained in a suitable
65: thermodynamic limit, display a
66: first order phase transition from poor to perfect generalization.
67: \end{abstract}
68:
69:
70:
71:
72:
73:
74: \newcommand{\D}{{\mathbb D}}
75:
76: During the last decade, independent component analysis (ICA) has emerged as
77: one of the most powerful unsupervised learning procedure for many
78: signal processing tasks \cite{Hyv01,Cic02}. It assumes that the observed,
79: often high dimensional signal, is a linear mixture of {\em independent}
80: source signals and aims to recover these sources just from
81: observing the mixed up signal. Hence, ICA is sometimes also
82: called blind signal deconvolution. An illustrative scenario is the
83: cocktail party problem where, to understand any single speaker, we first
84: need to identify her voice amidst the jumble of sounds reaching our
85: ears.
86:
87: The basic finding in ICA is that the distribution of the observed
88: signal will be similar to a Gaussian, especially when
89: many independent sources contribute to the linear mixture. The source
90: signals, however, will often be highly structured, and
91: non-Gaussian. ICA thus searches for a linear transformation of the
92: observations which maximizes non-Gaussianity by evaluating a suitable
93: contrast function. To detect this, the
94: contrast function used must compute a higher than quadratic statistics of the
95: transformed data.
96:
97: In a principled way, ICA can be derived by considering the mutual
98: information of the transformed data, which is a natural measure of statistical
99: dependence. To avoid the problem of density estimation, which
100: arises in a direct evaluation of the mutual information, one then uses
101: expansions (Edgeworth, Gram-Charlier) around Gaussianity to
102: approximate the mutual information \cite{Com94,Ama95}.
103: This leads to contrast
104: functions which are related to the higher order cumulants of
105: the transformed data.
106:
107: This Letter provides a first analysis of ICA for
108: polynomial contrast functions using the
109: statistical physics of disordered systems.
110: Surprisingly,
111: the replica method, one of the most powerful tools in analyzing
112: quenched disorder, fails since it cannot control the contributions to
113: the contrast function in the large deviations regime. However, a
114: physically valid analysis is obtained by adapting the cavity
115: method, showing that the scale of the learning curve depends on the
116: degree of the polynomial. Unusually, for a system with continuous couplings,
117: the curve itself is a step function, jumping from poor to perfect
118: generalization. But a badly generalizing state is always
119: metastable and it is remarkable that we can nevertheless find polynomial time
120: algorithms which generalize well.
121:
122: In formal terms, we assume that the
123: observable signal $\xi$ can be written as $\xi = M\hxi$, where
124: the source $\hxi$ is an $N$-dimensional random variable with
125: independent components and $M$ is the $N$ by $N$ mixing matrix.
126: Learning is based on a training set $\D$ of $P$ independent
127: observations $\xi^\mu$
128: of the signal $\xi$, obtained for a fixed, if unknown, mixing matrix $M$.
129: The deconvolution problem (finding $\hxi$)
130: can be decomposed by first finding just one independent component,
131: subtracting it from the mixture, and reapplying the procedure to the
132: remaining $N-1$ dimensional task. Hence, I shall just deal with
133: finding the first component $\hxi_1$ and assume that it is non-Gaussian
134: whereas all other components of $\hxi$ are Gaussian.
135:
136: Normally, the first step in ICA is to whiten the data, so that it has
137: zero mean and its covariance matrix is the identity. So, I shall
138: further assume that the source components have zero mean and unit
139: variance and that $M$ is orthogonal, $M^TM = \mathbf 1$. In short, the
140: ICA task now is to find, based on the training set $\D$, a vector $J$
141: such that $J^T\xi = \pm\hxi_1$. For this,
142: one picks a suitable non-quadratic contrast function $g$, computes the
143: empirical contrast
144: \begin{equation}
145: c_{\D}(J) = P^{-1} \sum_{\mu=1}^P g(J^T M\hxi^\mu), \label{contrast}
146: \end{equation}
147: and chooses $J$ to maximize $c_{\D}(J)$ under the constraint $|J|=1$.
148: To analyze this problem, one will
149: first consider the Gibbs weight
150: $\exp(\beta N c_{\D}(J))$ at some finite inverse temperature $\beta$
151: and calculate the typical value of the logarithm of its partition function
152: $Z_\D = \int {\rm d}J \exp(\beta N c_{\D}(J))$, where the integration
153: is over the uniform density on the unit sphere in $\R^N$. Since, via a
154: gauge, the partition function is independent of the mixing matrix $M$,
155: we set $M= \mathbf 1$ for the analysis.
156:
157: I shall first consider the replica approach to this calculation and
158: for brevity assume that the contrast function is
159: $g(x) = x^3$. We are then immediately faced with the problem that
160: the moments $\La Z_\D^n \Ra_\D$ do not exist, indeed $Z_\D$ does not
161: even have a mean
162: \footnote{In a sense, this problem already crops up for principal
163: component analysis where $g(x)=x^2$. Then $\La Z_\D^n \Ra_\D$
164: diverges, if $n$ or $\beta$ are large enough. So, using replicas, one
165: is in effect computing a continuation from small $\beta$ and large $n$
166: to large $\beta$ and small $n$.
167: }.
168: A second issue arises since $c_{\D}(J)$ is ${\cal
169: O}(N^{3/2}/P)$ for $J = \xi^\mu/|\xi^\mu|$. So, if we have just
170: $P = \alpha N$ examples, $\ln Z_\D$ is not an extensive quantity for
171: large $N$.
172:
173:
174: \newcommand{\KN}{K_{\!\scriptscriptstyle N}}
175: \newcommand{\LN}{L_{\scriptscriptstyle N}}
176: \newcommand{\gN}{g_{\scriptscriptstyle N}}
177:
178: To address the first problem, we introduce a cutoff $\KN > 0$, replacing
179: $g(x) = x^3$ by $\gN(x) = \max\{x^3,\KN^3\}$ in Eq. (\ref{contrast}).
180: Since we want to
181: ultimately recover the $g(x) = x^3$ case, we assume that $\KN$
182: diverges with increasing $N$.
183: Nevertheless, due to
184: the cutoff, the moments of $Z_\D$ now exist for any finite $N$.
185: Further, we assume that the training set has $P=\alpha \LN N$ and
186: not just $\alpha N$ patterns. Then, if $\LN$ diverges sufficiently quickly
187: w.r.t. $N$ and $\KN$, $\ln Z_\D$ will be an extensive quantity.
188: Finally, we should find that for the purpose of calculating $\ln
189: Z_\D$ for large $N$, choosing $K_N = \sqrt{N}$ is equivalent to not
190: cutting off at all. The reason for this quite simply is that
191: for $N\rightarrow\infty$
192: the fields $J^T \xi^\mu$ are bounded by $\sqrt{N}$ for
193: almost all training sets.
194:
195: In this setting, standard arguments yield the exact finite $N$ result
196: \begin{eqnarray*}
197: \La Z_\D^n \Ra_\D &=&
198: \lambda_{N,n}\!\! \int\!\! {\rm d}R{\rm d}Q
199: \det(Q\!-\! R R^T)^{\frac{N-n+1}{2}}
200: {\cal G}_{\scriptscriptstyle N} (R,Q)^N \\
201: {\cal G}_{\scriptscriptstyle N}(R,Q) &=&
202: \La \prod_{a=1}^n
203: \exp\left( \frac{\beta\max\{(R^a \xi_1 + X^a)^3,\KN^3\}}{\alpha L_N}
204: %\gN(R^a \xi_1 + X^a)
205: \right)
206: \Ra_{\xi_1,X}^{\alpha \LN}
207: \end{eqnarray*}
208: Here $R$ is an $n$-vector, Q a symmetric $n$ by $n$ matrix with
209: $Q^{aa}=1$, and the domain of integration is such that the matrix
210: $Q - R R^T$ is positive definite.
211: The $X^a$ are zero mean Gaussian with covariances
212: $\La X^a X^b\Ra = Q^{ab} - R^a R^b$, and $\lambda_{N,n}$ is obtained using that
213: the moments equal $1$ for $\beta = 0$.
214: Now, given any sequence of cutoffs
215: $\KN$, we can certainly find $\LN$ so that
216: ${\cal G}_{\scriptscriptstyle N}(R,Q)$ stays
217: finite for large $N$. Then, we should be able to use Laplace's method
218: of the maximum point to find that in the large $N$ limit
219: \begin{equation}
220: \frac{1}{N}\ln\La Z_\D^n \Ra_\D \!=\! \sup_{R,Q}\,
221: \ln {\cal G}_N(R,Q) + \half \ln \det(Q\!-\! R R^T)\,. \label{lapl}
222: \end{equation}
223: But at this point, at the latest, it is clear that something is amiss.
224: The limiting value of the above RHS depends only on the
225: relative scalings of $K_N$ and $L_N$ and not on the relationship of
226: these scalings to the system size $N$.
227: So (\ref{lapl}) implies that the scale of learning curve can be
228: {\em arbitrarily} stretched by using cutoffs which diverge quickly
229: with $N$. This problem arises regardless of assumptions about replica
230: symmetry.
231:
232: We proceed anyway and, using the replica symmetric
233: parameterization of (\ref{lapl}), find for $N\rightarrow\infty$
234: \begin{eqnarray}
235: \frac{1}{N}\La\ln Z_{\D} \Ra_\D
236: &=&
237: \sup_r \inf_q\,\, G_r(q,R) + G_s(q,r) \nonumber \\
238: G_r(q,R) &=&
239: \alpha L_N \La \!\ln\!\La \exp\left(
240: \frac{\beta}{\alpha L_N}\gN(r \xi_1 + \sqrt{q-r^2}y_0+\sqrt{1-q}y_1)
241: \right) \Ra_{\!\!y_1} \Ra_{\!\!\xi_1,y_0} \nonumber \\
242: G_s(q,r) &=& \half \frac{q-r^2}{1-q} + \half\ln(1-q) \label{rsZ}
243: \end{eqnarray}
244: where
245: $y_o,y_1$ are standard Gaussians, i.e. with zero mean, unit variance.
246: The extremal $r$
247: is just the typical value of the first component of a weight vector
248: picked from the Gibbs density and
249: measures to which extent the structure in the data is recognized.
250: Using (\ref{rsZ}), we relate the scalings of
251: $\KN$ and $\LN$. For $\LN \gg \KN$ the energy term converges to
252: $G_r(q,R) = r^3 \La \xi_1^3 \Ra$. This is the limit of many
253: examples where $r=1$ for all $\alpha$. In contrast, for $\LN \ll \KN$
254: there are too few examples and $G_r(q,R)$ diverges.
255:
256: So, the scale of the learning curve is given by setting $\LN = \KN$.
257: On this scale,
258: we find that $G_r(q,R)$ converges to $r^3 \La \xi_1^3 \Ra$ as in the
259: limit of many examples if $q$ exceeds a critical value
260: $q_c(\alpha,\beta)$, whereas $G_r(q,R)$ diverges for $q
261: <q_c(\alpha,\beta)$. Solving the extremal problem for $q$ by taking the
262: limit $q\rightarrow q_c(\alpha,\beta)$ from above, then taking the
263: $\beta\rightarrow\infty$ limit, we finally find the
264: simple result for the
265: ground state:
266: $
267: c(\alpha)= \sup_r
268: r^3 \La \xi_1^3 \Ra_{\xi_1}+ (1-r^2)/\alpha. %\label{repfin}
269: $
270: Here $c(\alpha)$ is the typical value of the highest achievable
271: empirical contrast, $\max_{|J|=1} c_\D(J)$. The learning curve for $r$
272: thus obtained, is a step function showing a first order
273: phase transition at $\alpha_c = 1/\La \xi_1^3 \Ra_{\xi_1}$
274: from no learning ($r=0$) to perfect learning ($r=1$).
275: But the $r=0$ state is metastable for all values $\alpha >
276: \alpha_c$.
277:
278:
279: \begin{figure}
280: \begin{tabular}{l}
281: \mbox{\begin{tabular}{l}
282: \includegraphics[scale=0.8]{cfig1.eps}
283: \end{tabular}}
284: \end{tabular}
285: \caption{
286: Prediction of $\KN=\sqrt{N}$ replica theory (bold line) compared to
287: simulation results. The non Gaussian source is
288: $\hat\xi_1 =(y^2-1)/\sqrt 2$, where $y$ is a standard Gaussian.
289: The empty symbols show the results for the algorithm finding local
290: maxima of the empirical contrast. The full symbols, denoting results
291: for the iterated version of the procedure described in the main text,
292: show that the agreement with the replica theory improves quickly with
293: increasing system size $N$ for this algorithm.
294: The error bars estimate the standard deviation of the sample to sample
295: fluctuations.
296: }
297: \end{figure}
298:
299: The replica theory predicts that for any divergent sequence of
300: cutoffs $\KN$, e.g. $\KN = e^N$, we need $P > \alpha_c \KN N$ examples for
301: good generalization when $N$ is large.
302: While this is ridiculous, I have argued above
303: that choosing $\KN=\sqrt N$ is, for $N\rightarrow\infty$,
304: equivalent to not cutting off at all. To
305: compare the replica result for this choice of $\KN$ to
306: numerical simulations, let us consider
307: actually finding a weight vector maximizing $c_\D(J)$.
308: It turns out that a rather simple discrete dynamics can be used since
309: $g(x) = x^3$. Starting with a random
310: vector of unit length $J^0$, at the $k$-th time step we first compute the
311: matrix
312: $A(J^k) = \sum_{\mu=1}^P \xi^\mu ({J^k}^T \xi^\mu ) {\xi^\mu}^T$
313: and then choose $J^{k+1}$ to maximize
314: $|J^T A(J^k) J|$ under the constraint $|J|=1$.
315: So, $J^{k+1}$ is an
316: eigenvector to the eigenvalue of largest magnitude of $
317: A(J^k)$. Standard results on quadratic forms imply that
318: $|{J^{k+1}}^T A(J^k) J^{k+1}| \geq |{J^{k}}^T A(J^{k-1}) J^{k}|$,
319: and the inequality is strict unless we are at a fixed point.
320: Hence, the iteration converges to a vector $J^\infty$ which is a local
321: maximum or minimum of $c_\D(J)$. In the latter case, we just flip the
322: sign of $J^\infty$ to obtain a local maximum.
323:
324: Simulation results for the procedure, compared to the $\KN =
325: \sqrt{N}$ replica theory in Fig. 1, show that the performance of
326: the algorithm is rather poor. This is in line with the
327: theoretical findings, since these predict that $r=0$ is
328: metastable, and the algorithm is only finding a local maximum. Figure 1
329: also shows result for an iterated version of the algorithm. There the
330: algorithm is rerun with $m=0.1N$ different random initial conditions,
331: and the weight vector maximizing $c_\D(J)$ among the $m$ outcomes is
332: chosen. These result are in good agreement with the $\KN =
333: \sqrt{N}$ replica theory, indicating that beyond the phase transition the
334: basin of attraction of the global maximum is quite large.
335:
336: Even if the simulations indicate
337: that the replica approach is saved by
338: in the end plugging in the correct scaling of the cutoff $\KN$,
339: the theoretical situation is highly unsatisfactory.
340: I shall next show that a physically
341: reasonable analysis can be provided by adapting the cavity method.
342: This is much simplified if make some major
343: changes to the notation. From now on the non-Gaussian source will be
344: denoted by $\gamma$, whereas all of the $N$ components of $\xi$ are
345: assumed independent standard Gaussian. Our primary goal is to calculate
346: the typical value of $C_r = \max_{|J|=1} C_r(J)$ with
347: \begin{equation}
348: C_r(J) = \frac{1}{P}\sum_{\mu=1}^P g(r \gamma^\mu + \sqrt{1-r^2} J^T
349: \xi^\mu)
350: \label{orig}
351: \end{equation}
352: where $J$ is an N-dimensional vector. So $C_r$ is the maximal value of
353: the empirical contrast achievable on an $r$-shell. For generality, we
354: shall now longer assume that $g(x)$ must be cubic but consider any
355: super-quadratic function which does not diverge too quickly.
356: In particular, for some $k>0$,
357: $
358: \lim_{x\rightarrow\infty}{g(x)}/{x^{2+k}} = \psi
359: $
360: should exist and be positive. Without loss of generality, we may then
361: assume $\psi=1$.
362:
363: We still have $P=\alpha \LN N$
364: examples and consider the random variable $J_\D$ with the Gibbs
365: density
366: \begin{eqnarray}
367: p_\D(J) &=& \frac{1}{Z_\D(\beta)}
368: \frac{e^{-\half |J|^2}}{(2 \pi)^{\half N}}
369: \prod_{\mu=1}^P
370: e^{\frac{\beta}{\LN}
371: g(\gamma^\mu,[J]^T\xi^\mu)}
372: \nonumber \\
373: g(\gamma^\mu,[J]^T\xi^\mu) &=&
374: g(r \gamma^\mu + \sqrt{1-r^2} [J]^T\xi^\mu)\,. \label{GD}
375: \end{eqnarray}
376: Here $[J] = J/|J|$ and $Z_\D(\beta)$ is given
377: by the normalization $\int \!{\rm d}J\, p_\D(J) =1$. Note, that we are now
378: using a factorizing Gaussian prior on $J$ and, to compensate for this, the
379: normalized vector $[J]$ is used to calculate the field in (\ref{GD}).
380:
381: A key task in the cavity approach is obtain the field distribution by
382: calculating the thermal average
383: $\La \phi(\gamma^\mu,[J_D]^T\xi^\mu) \Ra_{J_\D}$ for any function
384: $\phi$. One finds
385: \begin{eqnarray}
386: \La\phi(\gamma^\mu,[J_D]^T\xi^\mu) \Ra_{J_\D} &=&
387: \frac{Z_{\D/\mu}(\beta)}{Z_\D(\beta)}
388: \La e^{\frac{\beta}{\LN}
389: g(\gamma^\mu,[J_{\D/\mu}]^T\xi^\mu)}
390: \phi(\gamma^\mu,[J_{\D/\mu}]^T\xi^\mu) \Ra_{J_{\D/\mu}},
391: \label{cav}
392: \end{eqnarray}
393: where $J_{\D/\mu}$ is the random variable with the Gibbs density obtained
394: when pattern $\mu$ is removed from the system, i.e.
395: omitting the $\mu$-th factor
396: of the product in (\ref{GD}) and adjusting the partition function to
397: $Z_{\D/\mu}(\beta)$.
398: The variance of the cavity field
399: $[J_{\D/\mu}]^T\xi^\mu$ is a self averaging quantity and it must then
400: equal $1-q$ for large $N$, where
401: $q = |\La [J_{\D/\mu}] \Ra_{J_{\D/\mu}}|^2$. Normally, one would further argue
402: that $[J_{\D/\mu}]^T\xi^\mu$ becomes Gaussian in the thermodynamic limit.
403: But if we assume this,
404: the $J_{\D/\mu}$ average in (\ref{cav}) diverges even when
405: $\phi$ is a simple bounded function.
406: This highlights the fact that the cavity field is not Gaussian in the large
407: deviations regime because
408: $[J_{\D/\mu}]^T\xi^\mu$ cannot be larger than $|\xi^\mu|$.
409:
410:
411: Hence, I rephrase the cavity argument as follows: For the purpose of
412: calculating overlaps with a random vector such as $\xi^\mu$,
413: the not normalized $J_{\D/\mu}$ can for large $N$ be treated as a
414: Gaussian (with covariance matrix $(1-q)\mathbf 1$).
415: Then, the fluctuations of the cavity field obtained using
416: the normalized $[J_{\D/\mu}]$,
417: \[
418: P_{N,q}(h) = \La
419: \delta\left(h -
420: \left([J_{\D/\mu}]^T-\La[ J_{\D/\mu}]^T\Ra_{J_{\D/\mu}}\right)\xi^\mu
421: \right) \Ra_{J_{\D/\mu}}
422: \]
423: can be explicitly calculated.
424: This yields the
425: important fact that there are just two relevant scales for the cavity
426: fluctuations.
427: For large $N$,
428: $P_{N,q}(h)$ converges to
429: $e^{-\half h^2/(1-q)}/\sqrt{2 \pi (1-q)}$
430: if $h \ll \sqrt{N}$, but in the large deviations regime, for
431: $h = d \sqrt{N}$,
432: \begin{equation}
433: \lim_{N\rightarrow\infty} N^{-1}\ln P_N(d \sqrt{N}) =
434: -\half \frac{ q d^2}{1-q} + \half\ln(1-d^2)
435: \label{ldev}
436: \end{equation}
437: if $0\leq d\leq1$.
438: Now, in terms of the functional
439: \[
440: {\cal L}^{q,\beta}_{y,\gamma}(\phi) =
441: \int_{-\sqrt{N}}^{\sqrt{N}}
442: {\rm d}h\, P_{N,q}(h)\,\phi(\gamma,\sqrt{q}y+h)\,
443: e^{\frac{\beta}{\LN} g(\gamma,\sqrt{q}y+h)}
444: \]
445: the average in Eq. (\ref{cav}) can in the limit of large $N$ be rewritten as
446: $\La\phi(\gamma^\mu,[J_D]^T\xi^\mu) \Ra_{J_\D} =
447: {\cal L}^{q,\beta}_{y^\mu,\gamma^\mu}(\phi)/
448: {\cal L}^{q,\beta}_{y^\mu,\gamma^\mu}(1)$
449: with $y^\mu = q^{-\half}\La[ J_{\D/\mu}]\Ra_{J_{\D/\mu}}^T\xi^\mu$. So the
450: quenched averages are
451: \begin{eqnarray}
452: \La \La \phi(\gamma^\mu,[J_D]^T\xi^\mu) \Ra_{J_\D} \Ra_\D
453: &=& \La
454: \frac{{\cal L}^{q,\beta}_{y,\gamma}(\phi)}
455: {{\cal L}^{q,\beta}_{y,\gamma}(1)} \Ra_{y,\gamma} \label{qav} \\
456: \La \ln Z_\D(\beta) - \ln Z_{\D/\mu}(\beta) \Ra_\D &=&
457: \La \ln {\cal L}^{q,\beta}_{y,\gamma}(1) \Ra_{y,\gamma}
458: \label{qav1}
459: \end{eqnarray}
460: where $y$ is standard Gaussian. The last equation is
461: obtained by setting $\phi =1$ in (\ref{cav}).
462:
463: We can now consider whether the large deviations regime contributes to
464: the averages in (\ref{qav}) for a polynomially bounded
465: $\phi$. Using that for large arguments $g(x) \sim x^{2+k}$ and
466: referring to Eq. (\ref{ldev}), we find that it
467: will contribute if the maximum of
468: \begin{equation}
469: u(d) =
470: \beta d^{k+2}\frac{N^{\half k}}{\LN}
471: - \half \frac{ q d^2}{1-q} +
472: \half\ln(1-d^2)
473: \label{reldev}
474: \end{equation}
475: is positive for large $N$. This won't happen if
476: $\LN \gg N^{\half k}$ and
477: Eq. (\ref{qav}) then implies that
478: $\La \La \phi(\gamma^\mu,[J_D]^T\xi^\mu) \Ra_{J_\D} \Ra_\D =
479: \La \phi(\gamma,y) \Ra_{y,\gamma}$. The empirical mean equals
480: the expectation value and so the learning curve is trivial.
481: Henceforth, we focus on the relevant scale, setting
482: $\LN = N^{\half k}$.
483:
484: Our next task is to calculate the response when a new coupling $J_0$ is
485: added to the system and each pattern $\xi^\mu$ is augmented by
486: a new component $\xi_0^\mu$. We denote the augmented training set by
487: $\hat\D$ and use (\ref{GD}) to define the partition function
488: $Z_{\hat\D}(\beta)$ of the $N+1$ dimensional system.
489: Due to the $N$-dependence of the Gibbs weight
490: $e^{\frac{\beta}{\LN}g(\gamma^\mu,[J]^T\xi^\mu)}$, it is simplest
491: to assume a slightly different temperature
492: $\hat\beta_N = \beta L_{\scriptscriptstyle N+1}/\LN$
493: in the augmented system. Then,
494: when considering the ratio $Z_{\hat\D}(\hat\beta_N)/Z_{\D}(\beta)$,
495: the two systems have the same Gibbs weight per pattern.
496: Standard arguments \cite{Mez89} thus apply and yield that
497: $
498: \sLa \ln Z_{\hat\D}(\hat\beta_N)/Z_\D(\beta) \sRa_{\hat\D}
499: = G_s(q,0)\, \label{entres}
500: $ for large $N$.
501: Here $G_s(q,0)$ is the entropy term of the
502: replica theory (Eq. \ref{rsZ}), but evaluated at $r=0$ because we are
503: calculating the partition function for each $r$-shell individually.
504:
505: Having identified, via $\LN=
506: N^{\half k}$, the scale of the learning curve,
507: $N^{-1}\La \ln Z_\D(\beta) \Ra_D$ will
508: converge to a finite quantity $z(\alpha,\beta)$ in the thermodynamic limit.
509: We then have
510: %
511: \newcommand{\pdev}[1]{ \frac{\partial\,\,}{\partial #1} }
512:
513: \begin{eqnarray*}
514: \sLa \ln Z_{\hat\D}(\hat\beta_N)/Z_\D(\beta) \sRa_{\hat\D} &=&
515: z(\alpha,\beta) -
516: \alpha \frac{k+2}{2}\pdev{\alpha}{z(\alpha,\beta)} +
517: \frac{\beta k}{2} \pdev{\beta}{z(\alpha,\beta)}.
518: \end{eqnarray*}
519: The derivative of $z$ with respect to $\alpha$ is obtained from
520: Eq. (\ref{qav1}), and the thermal derivative is found
521: from (\ref{qav}) using $\phi =g$.
522:
523: Putting things together, we finally find for large $N$
524: \begin{eqnarray}
525: z(\alpha,\beta) &=&
526: \La \alpha\frac{k+2}{2} N^{\half k} \ln {\cal L}^{q,\beta}_{y,\gamma}(1)
527: - \frac{\beta k}{2} \frac{{\cal L}^{q,\beta}_{y,\gamma}(g)}
528: {{\cal L}^{q,\beta}_{y,\gamma}(1)}
529: \Ra_{y,\gamma}\!\!
530: + G_s(q,0)\,, \label{zfunc}
531: \end{eqnarray}
532: where the value of $q$ still has to be determined.
533:
534: For this, let us reconsider when the large deviations regime
535: contributes to the value of ${\cal L}^{q,\beta}_{y,\gamma}(1)$. Going back
536: to Eq. (\ref{reldev}), with $\LN = N^{\half k}$,
537: we see that as in the replica theory this is governed by a critical
538: value $q_{\rm c}(\beta)$ of $q$.
539: For $q < q_{\rm c}(\beta)$, $\max_d u(d)$ is positive in the large $N$ limit,
540: so (\ref{zfunc}) diverges.
541: The possible range for $q$ is thus $q_{\rm c}(\beta) \leq q \leq 1$.
542: But, if we assume $q > q_{\rm c}(\beta)$, the large $N$ limit yields the
543: very simple result
544: $
545: z(\alpha,\beta) = G_s(q) + \alpha \beta \La g(\gamma,y) \Ra_{\gamma,y}
546: $.
547: Now, on one hand, the empirical contrast is found by
548: differentiating $z(\alpha,\beta)$ w.r.t to $\beta$. This yields
549: $\La g(\gamma,y) \Ra_{\gamma,y} + \frac{1}{\alpha}G'_s(q)\pdev{\beta}q$.
550: But computing the same quantity using (\ref{qav}) yields
551: $\La g(\gamma,y) \Ra_{\gamma,y}$. So $q$ must stay constant when $\beta$
552: varies, but this is impossible since $q_{\rm c}(\beta)\rightarrow 1$ for
553: $\beta\rightarrow\infty$.
554:
555: Hence, the only possible value for $q$ is $q_{\rm
556: c}(\beta)$.
557: Evaluating (\ref{zfunc}) by taking the limit $q\rightarrow q_{\rm
558: c}(\beta)$ from above, leads to the same result as in the $\KN =
559: \sqrt{N}$ replica theory. But, of course, this has the same
560: inconsistencies as found for the $q > q_{\rm c}(\beta)$ assumption.
561: It also makes no physical sense to use (\ref{zfunc})
562: at the point of discontinuity since the cavity derivation neglects
563: fluctuations of $q$. Even if these vanish with increasing $N$, at the point
564: of discontinuity, $q=q_{\rm c}(\beta)$, the true result will
565: nevertheless depend on the unknown fluctuations.
566:
567: But some conclusions can be drawn, knowing that $q$ has the
568: critical value. Let $d_\beta$ be the unique positive value such that
569: $u(d_\beta) =0$ for $q=q_{\rm c}(\beta)$. Then arguments analogous
570: to the derivation of (\ref{qav}) show that the probability of the
571: posterior field $[J_\D]\xi^\mu$ exceeding $d\sqrt{N}$ is {\em not}
572: exponentially small if $d$ is lower than $d_\beta$.
573: More precisely, one finds for
574: $N\rightarrow\infty$ and $d < d_\beta$
575: \begin{eqnarray*}
576: \La N^{-1}\ln\sLa \Theta([J_D]^T\xi^\mu - d\sqrt{N}) \sRa_{J_\D} \Ra_\D
577: &=& \\
578: \La N^{-1}\ln
579: {{\cal L}^{q,\beta}_{y,\gamma}(\Theta(h - d\sqrt{N}))}/
580: {{\cal L}^{q,\beta}_{y,\gamma}(1)} \Ra_{y,\gamma} &=& 0\,.
581: \end{eqnarray*}
582: Further, $d_\beta$ approaches $1$ with increasing $\beta$. But this is
583: only possible if simply aligning the weight vector with the pattern $\xi^\mu$
584: maximizes the empirical contrast, at least upto sub-extensive corrections. So,
585: in the notation of Eq. \ref{orig}, we have $C_r = C_r([\xi^\mu ])$
586: for large $N$, and thus finally
587: \begin{equation}
588: C_r =
589: (1-r^2)^{\frac{2+k}{2}}/\alpha + \La g(r \gamma + \sqrt{1-r^2}\,y)
590: \Ra_{\gamma,y}\,.
591: \label{final}
592: \end{equation}
593: Maximizing this in $r$, the same learning curve is obtained for
594: the cubic case, $g(x)=x^3$, as in
595: the $\KN=\sqrt N$ replica theory
596: %
597: \footnote{
598: For $g(x)=x^4$, the curve depends on whether $\sLa \gamma^4
599: \sRa_\gamma > 3$, since the fourth moment of a standard Gaussian is
600: $3$. If so, the value of $r$ jumps from $0$ to $1$ at
601: $\alpha_c = 1/(\sLa \gamma^4\sRa_\gamma - 3)$. The
602: $\sLa \gamma^4\sRa_\gamma < 3$ case, where one will use $g(x)=-x^4$,
603: shall be described elsewhere. It
604: is much simpler since the large deviations regime does not contribute.}.
605: %
606: It is important to note that we have in essence just used the standard
607: weak correlation assumptions of the cavity method in deriving (\ref{final}).
608: In view of the good agreement with numerical simulations (Fig. 1),
609: this strongly suggests that the cavity result is indeed exact in the
610: thermodynamic limit.
611:
612: From an analytical point of view, it is intriguing that the present
613: problem reveals a difference in the scope of the replica and the
614: cavity method. The latter can be transparently adapted to take
615: into account that the cavity field is not Gaussian in the large
616: deviations regime. But, commuting the thermal average with the disorder
617: average, at the expense of considering moments, is part and parcel
618: of using replicas. As a consequence, all the relevant fields
619: become truly Gaussian. This points to implicit assumptions in the
620: replica method, which need to be taken care of in any program to put
621: the approach on a solid mathematical footing \cite{Par02}.
622:
623: \acknowledgements
624:
625: It is a pleasure to acknowledge many discussions with Manfred Opper.
626: This work was supported by the Deutsche Forschungsgemeinschaft.
627:
628: \bibliographystyle{unsrt}
629: \bibliography{/home/robert/tex/neural}
630:
631: \end{document}
632:
633:
634:
635:
636:
637:
638:
639:
640:
641:
642: