cond-mat0309484/ica.tex
1: \documentclass{epl}
2: \usepackage{amssymb,graphicx}
3: 
4: 
5: \newcommand{\R}{{\mathbb R}}
6: \newcommand{\sign}{ \mbox{\rm sign} }
7: \newcommand{\ext}{ \mbox{\rm extr} }
8: \newcommand{\exta}[1]
9:      {{\renewcommand{\arraystretch}{0.75} \begin{array}[t]{c} 
10:        \ext \\ {\scriptstyle #1}
11:      \end{array}}}
12: \newcommand{\maxa}[1]
13:      {{\renewcommand{\arraystretch}{0.75} \begin{array}[t]{c} 
14:        \max \\ {\scriptstyle #1}
15:      \end{array}}}
16: \newcommand{\eff}{{ \mbox{\rm e}} }
17: \newcommand{\smp}{ {\mbox{\rm\scriptsize smp}}}
18: \newcommand{\ens}{ {\mbox{\rm\scriptsize ens}}}
19: \newcommand{\di}{\mbox{\rm d}}
20: \newcommand{\half}{{\frac{1}{2}}}
21: \renewcommand{\#}{\displaystyle}
22: \newcommand{\halpha}{\hat{\alpha}}
23: \newcommand{\La}{\left\langle}
24: \newcommand{\Ra}{\right\rangle}
25: \newcommand{\sLa}{\langle}
26: \newcommand{\sRa}{\rangle}
27: \newcommand{\cut}[1]{}
28: \newcommand{\n}{{\scriptscriptstyle \! N}}
29: \newcommand{\hxi}{{\hat\xi}}
30: 
31: \newcommand{\ppreprint}{
32: \textheight 1.2\textheight
33:   \textwidth  1.2\textwidth
34:   \oddsidemargin  -0.5cm
35:   \evensidemargin -0.5cm
36:   \topmargin -0.5cm
37:   \baselineskip 1.8\baselineskip}
38: 
39: %\ppreprint
40: 
41: \title{ Statistical physics of independent component analysis
42:       }
43: 
44: \author{R. Urbanczik} 
45:   \institute{
46:         Institut f\"ur theoretische Physik -
47:         Universit\"at W\"urzburg,
48:         Am Hubland,
49:         D-97074 W\"urzburg,
50:         Germany       }
51: 
52: \pacs{89.75.Fb}{Structures and organization in complex systems}
53: \pacs{84.35.+i}{Neural networks}
54: \pacs{64.60.Cn}
55:      {Order disorder transitions; statistical mechanics of model systems} 
56: 
57: \begin{document}\maketitle
58: 
59: 
60: \begin{abstract}
61: Statistical physics is used to investigate independent component
62: analysis with polynomial contrast functions. While the replica method
63: fails,  an adapted cavity approach
64: yields valid results. The learning curves, obtained in a suitable
65: thermodynamic limit, display a
66: first order phase transition from poor to perfect generalization.
67: \end{abstract}
68: 
69: 
70: 
71: 
72: 
73: 
74: \newcommand{\D}{{\mathbb D}}
75: 
76: During the last decade, independent component analysis (ICA) has emerged as
77: one of the most powerful unsupervised learning procedure for many
78: signal processing tasks \cite{Hyv01,Cic02}. It assumes that the observed, 
79: often high dimensional signal, is a linear mixture of {\em independent}
80: source signals and aims to recover these sources just from
81: observing the mixed up signal. Hence, ICA is sometimes also
82: called blind signal deconvolution. An illustrative scenario is the
83: cocktail party problem where, to understand any single speaker, we first
84: need to identify her voice amidst the jumble of sounds reaching our
85: ears. 
86: 
87: The basic finding in ICA is that the distribution of the observed
88: signal will be similar to a Gaussian, especially when
89: many independent sources contribute to the linear mixture. The source
90: signals, however, will often be highly structured, and
91: non-Gaussian. ICA thus searches for a linear transformation of the 
92: observations which maximizes non-Gaussianity by evaluating a suitable
93: contrast function. To detect this, the 
94: contrast function used must compute a higher than quadratic statistics of the 
95: transformed data.
96: 
97: In a principled way, ICA can be derived by considering the mutual
98: information of the transformed data, which is a natural measure of statistical 
99: dependence. To avoid the problem of density estimation, which 
100: arises in a direct evaluation of the mutual information, one then uses 
101: expansions (Edgeworth, Gram-Charlier) around Gaussianity to
102: approximate the mutual information \cite{Com94,Ama95}. 
103: This leads to  contrast
104: functions which are related to the higher order cumulants of
105: the transformed data.  
106: 
107: This Letter provides a first analysis of ICA for
108: polynomial  contrast functions using the
109: statistical physics of disordered systems.
110: Surprisingly,
111: the replica method, one of the most powerful tools in analyzing
112: quenched disorder, fails since it cannot  control the contributions to
113: the contrast function in the large deviations regime. However, a
114: physically valid analysis is obtained by adapting the cavity
115: method, showing that the scale of the learning curve depends on the
116: degree of  the polynomial. Unusually, for a system with continuous couplings,
117: the curve itself is a step function, jumping from poor to perfect 
118: generalization. But a badly generalizing state is always
119: metastable and it is remarkable that we can nevertheless find polynomial time
120: algorithms which generalize well.
121: 
122: In formal terms, we assume that the
123: observable signal $\xi$ can be written as $\xi = M\hxi$, where 
124: the source $\hxi$ is an $N$-dimensional  random variable with 
125: independent components and $M$ is the $N$ by $N$ mixing matrix.
126: Learning is based on a training set $\D$ of $P$ independent
127: observations  $\xi^\mu$
128: of the signal $\xi$, obtained for a fixed, if unknown, mixing matrix $M$.
129:  The deconvolution problem (finding $\hxi$)
130: can be decomposed by first finding just one independent component,
131: subtracting it from the mixture, and reapplying the procedure to the
132: remaining $N-1$ dimensional task. Hence, I shall just deal with
133: finding the first  component $\hxi_1$ and assume that it is non-Gaussian 
134: whereas all other components of $\hxi$ are Gaussian. 
135: 
136: Normally, the first step in ICA is to whiten the data, so that it has
137: zero mean and its covariance matrix is the identity. So, I shall
138: further assume that the source components have zero mean and unit
139: variance and that $M$ is orthogonal, $M^TM = \mathbf 1$. In short, the
140: ICA task now is to find, based on the training set $\D$, a vector $J$
141: such that $J^T\xi = \pm\hxi_1$. For this,
142: one picks a suitable non-quadratic contrast function $g$, computes the
143: empirical contrast
144: \begin{equation}
145: c_{\D}(J) = P^{-1} \sum_{\mu=1}^P g(J^T M\hxi^\mu), \label{contrast}
146: \end{equation}
147: and  chooses $J$ to maximize $c_{\D}(J)$ under the constraint $|J|=1$.
148: To analyze this problem,  one will
149: first  consider the Gibbs weight 
150: $\exp(\beta N c_{\D}(J))$ at some finite inverse temperature $\beta$
151: and calculate the typical value of the logarithm of its partition function 
152: $Z_\D =  \int {\rm d}J \exp(\beta N c_{\D}(J))$, where the integration
153: is over the uniform density on the unit sphere in $\R^N$. Since, via a
154: gauge, the  partition function is independent of the mixing matrix $M$,
155: we set $M= \mathbf 1$ for the analysis. 
156:  
157: I shall first consider the replica approach to this calculation and
158: for brevity assume that the contrast function is 
159: $g(x) = x^3$. We are then immediately faced with the problem that
160: the moments $\La Z_\D^n \Ra_\D$ do not exist, indeed $Z_\D$ does not
161: even have a mean 
162: \footnote{In a sense, this problem already crops up for principal
163: component analysis where $g(x)=x^2$. Then $\La Z_\D^n \Ra_\D$
164: diverges, if $n$ or $\beta$ are large enough. So, using replicas, one
165: is in effect computing a continuation from small $\beta$ and large $n$
166: to large $\beta$ and small $n$.
167: }.
168: A second issue arises since $c_{\D}(J)$ is ${\cal
169: O}(N^{3/2}/P)$ for $J = \xi^\mu/|\xi^\mu|$. So, if we have just 
170: $P = \alpha N$ examples, $\ln Z_\D$ is not an extensive quantity for
171: large $N$.
172: 
173: 
174: \newcommand{\KN}{K_{\!\scriptscriptstyle N}}
175: \newcommand{\LN}{L_{\scriptscriptstyle N}}
176: \newcommand{\gN}{g_{\scriptscriptstyle N}}
177: 
178: To address the first problem, we introduce a cutoff $\KN > 0$, replacing 
179: $g(x) = x^3$ by $\gN(x) = \max\{x^3,\KN^3\}$ in Eq. (\ref{contrast}). 
180: Since we want to
181: ultimately recover the $g(x) = x^3$ case, we assume that $\KN$
182: diverges with increasing $N$. 
183: Nevertheless, due to
184: the cutoff, the moments of  $Z_\D$ now exist for any finite $N$.
185: Further, we assume that the training set has $P=\alpha \LN N$ and
186: not just $\alpha N$ patterns. Then, if $\LN$ diverges sufficiently quickly
187: w.r.t. $N$ and $\KN$,  $\ln Z_\D$ will be an extensive quantity.
188: Finally, we should find that for the purpose of calculating  $\ln
189: Z_\D$ for large $N$, choosing $K_N = \sqrt{N}$ is equivalent to not
190: cutting off at all. The reason for this quite simply is that 
191: for $N\rightarrow\infty$
192: the fields $J^T \xi^\mu$ are bounded by $\sqrt{N}$ for
193: almost all training sets.  
194: 
195: In this setting, standard arguments yield the exact finite $N$ result 
196: \begin{eqnarray*}
197: \La Z_\D^n \Ra_\D &=& 
198: \lambda_{N,n}\!\! \int\!\! {\rm d}R{\rm d}Q 
199:   \det(Q\!-\! R R^T)^{\frac{N-n+1}{2}}
200: {\cal G}_{\scriptscriptstyle N} (R,Q)^N \\
201: {\cal G}_{\scriptscriptstyle N}(R,Q) &=& 
202: \La \prod_{a=1}^n
203:   \exp\left( \frac{\beta\max\{(R^a \xi_1 + X^a)^3,\KN^3\}}{\alpha L_N} 
204:   %\gN(R^a \xi_1 + X^a)
205: \right)
206:   \Ra_{\xi_1,X}^{\alpha \LN}  
207: \end{eqnarray*}
208: Here $R$ is an $n$-vector, Q a symmetric $n$ by $n$ matrix with 
209: $Q^{aa}=1$, and the domain of integration is such that the matrix 
210: $Q - R R^T$ is positive definite.
211: The $X^a$ are zero mean Gaussian with covariances 
212: $\La X^a X^b\Ra = Q^{ab} - R^a R^b$, and $\lambda_{N,n}$ is obtained using that
213: the moments equal $1$  for $\beta = 0$.
214: Now, given any sequence of cutoffs
215: $\KN$, we can certainly find $\LN$ so that 
216: ${\cal G}_{\scriptscriptstyle N}(R,Q)$ stays
217: finite for large $N$. Then, we should be able to use Laplace's method
218: of the maximum point to find that in the large $N$ limit
219: \begin{equation}
220: \frac{1}{N}\ln\La Z_\D^n \Ra_\D \!=\! \sup_{R,Q}\, 
221: \ln {\cal G}_N(R,Q) + \half \ln \det(Q\!-\! R R^T)\,. \label{lapl}
222: \end{equation}
223: But at this point, at the latest, it is clear that something is amiss.
224: The limiting value of the above RHS depends only on the
225: relative scalings of $K_N$ and $L_N$ and not on the relationship of
226: these scalings to the system size $N$. 
227: So (\ref{lapl}) implies  that the scale of learning curve can be
228: {\em arbitrarily} stretched by using cutoffs which diverge quickly
229: with $N$. This problem arises regardless of assumptions about replica
230: symmetry.
231: 
232: We proceed anyway and, using the replica symmetric
233: parameterization of (\ref{lapl}), find for $N\rightarrow\infty$
234: \begin{eqnarray}
235: \frac{1}{N}\La\ln Z_{\D} \Ra_\D
236: &=& 
237: \sup_r \inf_q\,\, G_r(q,R) + G_s(q,r) \nonumber \\
238: G_r(q,R) &=&
239: \alpha L_N \La \!\ln\!\La \exp\left(
240:     \frac{\beta}{\alpha L_N}\gN(r \xi_1 + \sqrt{q-r^2}y_0+\sqrt{1-q}y_1) 
241:        \right)  \Ra_{\!\!y_1} \Ra_{\!\!\xi_1,y_0} \nonumber \\  
242: G_s(q,r) &=& \half \frac{q-r^2}{1-q} + \half\ln(1-q) \label{rsZ} 
243: \end{eqnarray}
244: where 
245: $y_o,y_1$ are standard Gaussians, i.e. with zero mean, unit variance.
246: The extremal $r$ 
247: is just the typical value of the first component of a weight vector
248: picked from the Gibbs density and
249: measures to which extent the structure in the data is recognized.
250: Using (\ref{rsZ}), we relate the scalings of
251: $\KN$ and $\LN$. For $\LN \gg \KN$ the energy term converges to 
252: $G_r(q,R) = r^3 \La \xi_1^3 \Ra$. This is the limit of many
253: examples where $r=1$ for all $\alpha$. In contrast, for $\LN \ll \KN$
254: there are too few examples and  $G_r(q,R)$ diverges.
255: 
256: So, the scale of the learning curve is given by setting $\LN = \KN$.
257: On this scale, 
258: we find that  $G_r(q,R)$ converges to $r^3 \La \xi_1^3 \Ra$ as in the
259: limit of many examples if $q$ exceeds a critical value
260: $q_c(\alpha,\beta)$,  whereas $G_r(q,R)$ diverges for $q
261: <q_c(\alpha,\beta)$. Solving the extremal problem for $q$ by taking the
262: limit $q\rightarrow q_c(\alpha,\beta)$ from above, then taking the 
263: $\beta\rightarrow\infty$ limit, we finally find the
264: simple result for the
265: ground state:
266: $
267: c(\alpha)= \sup_r
268: r^3 \La  \xi_1^3 \Ra_{\xi_1}+  (1-r^2)/\alpha. %\label{repfin}
269: $
270: Here $c(\alpha)$ is the typical value of the highest achievable
271: empirical contrast, $\max_{|J|=1} c_\D(J)$. The learning curve for $r$
272: thus obtained, is a step function showing a first order
273: phase transition at $\alpha_c = 1/\La  \xi_1^3 \Ra_{\xi_1}$ 
274: from no learning ($r=0$) to perfect learning ($r=1$).
275: But the $r=0$ state is metastable for all values $\alpha >
276: \alpha_c$.
277: 
278: 
279: \begin{figure}
280:    \begin{tabular}{l}
281:         \mbox{\begin{tabular}{l}
282:            \includegraphics[scale=0.8]{cfig1.eps}
283:               \end{tabular}}
284:    \end{tabular}   
285:  \caption{
286:  Prediction of $\KN=\sqrt{N}$ replica theory (bold line) compared to
287:  simulation results. The non Gaussian source is 
288:   $\hat\xi_1 =(y^2-1)/\sqrt 2$, where $y$ is a standard Gaussian.
289:  The empty symbols show the results for the algorithm finding local
290:  maxima of the empirical contrast. The full symbols, denoting results
291:  for the iterated version of the procedure described in the main text,
292:  show that the agreement with the replica theory improves quickly with
293:  increasing system size $N$ for this algorithm.
294:  The  error bars estimate the standard deviation of the sample to sample 
295:  fluctuations.
296: }
297: \end{figure}
298: 
299: The replica theory predicts that for any divergent sequence of
300: cutoffs $\KN$, e.g. $\KN = e^N$,  we need $P > \alpha_c \KN N$ examples for
301: good generalization when $N$ is large. 
302: While this is ridiculous, I have argued above
303: that choosing $\KN=\sqrt N$ is, for $N\rightarrow\infty$, 
304:  equivalent to not cutting off at all. To
305: compare the replica result for this choice of $\KN$ to
306: numerical  simulations, let us consider 
307: actually finding a weight vector maximizing $c_\D(J)$. 
308: It turns out that a rather simple discrete dynamics can be used since
309: $g(x) = x^3$. Starting with a random
310: vector of unit length $J^0$, at the $k$-th time step we first compute the
311: matrix 
312: $A(J^k) = \sum_{\mu=1}^P \xi^\mu ({J^k}^T \xi^\mu ) {\xi^\mu}^T$
313: and then choose $J^{k+1}$ to maximize 
314: $|J^T A(J^k) J|$ under the constraint $|J|=1$. 
315: So, $J^{k+1}$ is an
316: eigenvector to the eigenvalue of largest magnitude of $
317: A(J^k)$. Standard results on quadratic forms imply that 
318: $|{J^{k+1}}^T A(J^k) J^{k+1}| \geq   |{J^{k}}^T A(J^{k-1}) J^{k}|$,
319: and the inequality is strict unless we are at a fixed point. 
320: Hence, the iteration converges to a vector $J^\infty$ which is a local 
321: maximum or minimum of  $c_\D(J)$. In the latter case, we just flip the
322: sign of $J^\infty$ to obtain a local maximum. 
323: 
324: Simulation results for the procedure, compared to the $\KN =
325: \sqrt{N}$ replica theory in Fig. 1, show that the performance of
326: the algorithm is rather poor. This is in line with the
327: theoretical findings, since these predict that $r=0$ is
328: metastable, and the algorithm is only finding a local maximum. Figure 1
329: also shows result for an iterated version of the algorithm. There the
330: algorithm is rerun with $m=0.1N$ different random initial conditions,
331: and the weight vector maximizing $c_\D(J)$ among the $m$ outcomes is
332: chosen. These result are in good agreement with the $\KN =
333: \sqrt{N}$ replica theory, indicating that beyond the phase transition the
334: basin of attraction of the global maximum is quite large. 
335: 
336: Even if the simulations indicate
337: that the replica approach is saved by
338: in the end plugging in the correct scaling of the cutoff $\KN$,
339: the theoretical situation is highly unsatisfactory. 
340: I shall next show that a physically
341: reasonable analysis can be provided by adapting the cavity method.
342: This is much simplified if make some major
343: changes to the notation. From now on the non-Gaussian source will be
344: denoted by $\gamma$, whereas all of the $N$  components of $\xi$ are
345: assumed independent standard Gaussian. Our primary goal is to calculate
346: the typical value of $C_r = \max_{|J|=1} C_r(J)$ with
347: \begin{equation}
348: C_r(J) = \frac{1}{P}\sum_{\mu=1}^P g(r \gamma^\mu + \sqrt{1-r^2} J^T
349: \xi^\mu)
350: \label{orig}
351: \end{equation}
352: where $J$ is an N-dimensional vector. So $C_r$ is the maximal value of
353: the empirical contrast achievable on an $r$-shell. For generality, we
354: shall now longer assume that $g(x)$ must be cubic but consider any
355: super-quadratic function which does not diverge too quickly. 
356: In particular, for some $k>0$, 
357: $
358: \lim_{x\rightarrow\infty}{g(x)}/{x^{2+k}} = \psi 
359: $
360: should exist and be positive. Without loss of generality, we may then
361: assume $\psi=1$.
362: 
363: We still have $P=\alpha \LN N$
364: examples and consider the random variable $J_\D$ with the Gibbs
365: density    
366: \begin{eqnarray}
367: p_\D(J) &=& \frac{1}{Z_\D(\beta)} 
368:           \frac{e^{-\half |J|^2}}{(2 \pi)^{\half N}} 
369:           \prod_{\mu=1}^P 
370:             e^{\frac{\beta}{\LN} 
371:                      g(\gamma^\mu,[J]^T\xi^\mu)}
372: \nonumber \\
373: g(\gamma^\mu,[J]^T\xi^\mu) &=& 
374: g(r \gamma^\mu + \sqrt{1-r^2} [J]^T\xi^\mu)\,. \label{GD} 
375: \end{eqnarray}
376: Here $[J] = J/|J|$ and $Z_\D(\beta)$ is given
377: by the normalization $\int \!{\rm d}J\, p_\D(J) =1$. Note, that we are now
378: using a factorizing Gaussian prior on $J$ and, to compensate for this, the
379: normalized vector $[J]$ is used to calculate the field in (\ref{GD}). 
380: 
381: A key task in the cavity approach is  obtain the field distribution by
382: calculating  the thermal average 
383: $\La \phi(\gamma^\mu,[J_D]^T\xi^\mu) \Ra_{J_\D}$ for  any function
384: $\phi$. One finds 
385: \begin{eqnarray}
386: \La\phi(\gamma^\mu,[J_D]^T\xi^\mu) \Ra_{J_\D} &=& 
387: \frac{Z_{\D/\mu}(\beta)}{Z_\D(\beta)}
388: \La e^{\frac{\beta}{\LN} 
389:                      g(\gamma^\mu,[J_{\D/\mu}]^T\xi^\mu)}
390:     \phi(\gamma^\mu,[J_{\D/\mu}]^T\xi^\mu) \Ra_{J_{\D/\mu}},
391: \label{cav}
392: \end{eqnarray}
393: where $J_{\D/\mu}$ is the random variable with the Gibbs density obtained 
394: when pattern $\mu$ is removed from the system, i.e. 
395: omitting the $\mu$-th factor
396: of the product in (\ref{GD}) and adjusting the partition function to 
397: $Z_{\D/\mu}(\beta)$.
398: The variance of the cavity field 
399: $[J_{\D/\mu}]^T\xi^\mu$ is a self averaging quantity and it must then 
400: equal $1-q$ for large $N$, where 
401: $q = |\La [J_{\D/\mu}] \Ra_{J_{\D/\mu}}|^2$. Normally, one would further argue
402: that  $[J_{\D/\mu}]^T\xi^\mu$ becomes Gaussian in the thermodynamic limit.
403: But if we assume this, 
404: the $J_{\D/\mu}$ average in (\ref{cav}) diverges even when 
405: $\phi$ is a simple bounded function. 
406: This highlights the fact that the cavity field is not Gaussian in the large 
407: deviations regime because
408: $[J_{\D/\mu}]^T\xi^\mu$ cannot be larger than $|\xi^\mu|$. 
409: 
410: 
411: Hence, I rephrase the cavity argument as follows: For the purpose of 
412: calculating overlaps with a random vector such as $\xi^\mu$, 
413: the not normalized $J_{\D/\mu}$ can for large $N$ be treated as a
414: Gaussian (with covariance matrix $(1-q)\mathbf 1$).
415: Then, the fluctuations of the cavity field obtained using
416: the normalized $[J_{\D/\mu}]$,
417: \[
418: P_{N,q}(h) = \La 
419: \delta\left(h - 
420: \left([J_{\D/\mu}]^T-\La[ J_{\D/\mu}]^T\Ra_{J_{\D/\mu}}\right)\xi^\mu
421: \right) \Ra_{J_{\D/\mu}}
422: \]
423: can be explicitly calculated.
424: This yields the 
425: important fact that there are just two relevant scales for the cavity 
426: fluctuations. 
427: For large $N$, 
428: $P_{N,q}(h)$ converges to 
429: $e^{-\half h^2/(1-q)}/\sqrt{2 \pi (1-q)}$ 
430: if  $h \ll \sqrt{N}$,  but in the large deviations regime, for
431: $h = d \sqrt{N}$,
432: \begin{equation}
433: \lim_{N\rightarrow\infty} N^{-1}\ln P_N(d \sqrt{N}) =
434: -\half \frac{ q d^2}{1-q} + \half\ln(1-d^2)
435: \label{ldev}
436: \end{equation}
437: if $0\leq d\leq1$.
438: Now, in terms of the functional
439: \[
440: {\cal L}^{q,\beta}_{y,\gamma}(\phi) = 
441: \int_{-\sqrt{N}}^{\sqrt{N}} 
442: {\rm d}h\, P_{N,q}(h)\,\phi(\gamma,\sqrt{q}y+h)\,
443: e^{\frac{\beta}{\LN} g(\gamma,\sqrt{q}y+h)}
444: \]
445: the average in Eq. (\ref{cav}) can in the limit of large $N$  be rewritten as 
446: $\La\phi(\gamma^\mu,[J_D]^T\xi^\mu) \Ra_{J_\D} =
447: {\cal L}^{q,\beta}_{y^\mu,\gamma^\mu}(\phi)/
448: {\cal L}^{q,\beta}_{y^\mu,\gamma^\mu}(1)$ 
449: with $y^\mu = q^{-\half}\La[ J_{\D/\mu}]\Ra_{J_{\D/\mu}}^T\xi^\mu$. So  the 
450: quenched  averages are
451: \begin{eqnarray}
452: \La \La \phi(\gamma^\mu,[J_D]^T\xi^\mu) \Ra_{J_\D} \Ra_\D
453: &=& \La
454: \frac{{\cal L}^{q,\beta}_{y,\gamma}(\phi)}
455:      {{\cal L}^{q,\beta}_{y,\gamma}(1)} \Ra_{y,\gamma}  \label{qav} \\
456: \La \ln Z_\D(\beta) - \ln Z_{\D/\mu}(\beta) \Ra_\D &=&
457: \La \ln {\cal L}^{q,\beta}_{y,\gamma}(1)  \Ra_{y,\gamma} 
458: \label{qav1}
459: \end{eqnarray}
460: where $y$ is standard Gaussian. The last equation is 
461: obtained by setting $\phi =1$ in (\ref{cav}).
462: 
463: We can now consider whether the large deviations regime contributes to
464: the averages in (\ref{qav}) for a polynomially bounded
465: $\phi$. Using that for large arguments $g(x) \sim x^{2+k}$ and
466: referring to  Eq. (\ref{ldev}), we find that it
467: will contribute if the maximum of 
468: \begin{equation}
469: u(d) = 
470: \beta d^{k+2}\frac{N^{\half k}}{\LN} 
471: - \half \frac{ q d^2}{1-q} + 
472: \half\ln(1-d^2)
473: \label{reldev}
474: \end{equation}
475: is positive for large $N$. This won't happen if
476: $\LN \gg  N^{\half k}$ and 
477: Eq. (\ref{qav}) then implies that
478: $\La \La \phi(\gamma^\mu,[J_D]^T\xi^\mu) \Ra_{J_\D} \Ra_\D =
479: \La \phi(\gamma,y) \Ra_{y,\gamma}$. The empirical mean equals
480: the expectation value and so the learning curve is  trivial. 
481: Henceforth, we focus on the relevant scale, setting 
482: $\LN =  N^{\half k}$. 
483: 
484: Our next task is to calculate the response when a new coupling $J_0$ is 
485: added to the system and each pattern $\xi^\mu$ is augmented by
486: a new component $\xi_0^\mu$. We denote the augmented training set by
487: $\hat\D$ and use (\ref{GD}) to define the partition function  
488: $Z_{\hat\D}(\beta)$ of the $N+1$ dimensional system. 
489: Due  to the $N$-dependence of the Gibbs weight 
490: $e^{\frac{\beta}{\LN}g(\gamma^\mu,[J]^T\xi^\mu)}$, it is simplest
491: to assume a slightly different temperature 
492: $\hat\beta_N = \beta L_{\scriptscriptstyle N+1}/\LN$ 
493: in the augmented system. Then,
494: when  considering the ratio $Z_{\hat\D}(\hat\beta_N)/Z_{\D}(\beta)$,
495: the two systems have the same Gibbs weight per pattern.
496: Standard arguments  \cite{Mez89} thus apply and yield that 
497: $
498: \sLa \ln Z_{\hat\D}(\hat\beta_N)/Z_\D(\beta) \sRa_{\hat\D}
499: = G_s(q,0)\, \label{entres}
500: $ for large $N$.
501: Here $G_s(q,0)$ is the entropy term of the
502: replica theory (Eq. \ref{rsZ}), but evaluated at $r=0$ because we are
503: calculating the partition function for each $r$-shell individually.
504: 
505: Having identified, via $\LN=
506: N^{\half k}$, the scale of the learning curve, 
507: $N^{-1}\La \ln Z_\D(\beta) \Ra_D$ will 
508: converge to a finite quantity  $z(\alpha,\beta)$ in the thermodynamic limit.
509: We then  have
510: %
511: \newcommand{\pdev}[1]{ \frac{\partial\,\,}{\partial #1} } 
512: 
513: \begin{eqnarray*}
514:   \sLa \ln Z_{\hat\D}(\hat\beta_N)/Z_\D(\beta) \sRa_{\hat\D} &=&
515:  z(\alpha,\beta) - 
516:  \alpha \frac{k+2}{2}\pdev{\alpha}{z(\alpha,\beta)} +
517:  \frac{\beta k}{2} \pdev{\beta}{z(\alpha,\beta)}.                   
518: \end{eqnarray*}
519: The derivative of $z$ with respect to $\alpha$ is obtained from
520: Eq. (\ref{qav1}), and the thermal derivative is found
521: from  (\ref{qav}) using $\phi =g$. 
522: 
523: Putting things together, we finally find for large $N$
524: \begin{eqnarray}
525: z(\alpha,\beta) &=&   
526: \La \alpha\frac{k+2}{2} N^{\half k}  \ln {\cal L}^{q,\beta}_{y,\gamma}(1)
527: - \frac{\beta k}{2} \frac{{\cal L}^{q,\beta}_{y,\gamma}(g)}
528:                           {{\cal L}^{q,\beta}_{y,\gamma}(1)}
529: \Ra_{y,\gamma}\!\! 
530: + G_s(q,0)\,, \label{zfunc}
531: \end{eqnarray}
532: where the value of $q$ still has to be determined.
533: 
534: For this, let us reconsider when the large deviations regime
535: contributes to the value of ${\cal L}^{q,\beta}_{y,\gamma}(1)$. Going back
536: to Eq. (\ref{reldev}), with $\LN =  N^{\half k}$, 
537: we see that as in the replica theory this is governed by a critical
538: value $q_{\rm c}(\beta)$ of $q$. 
539: For $q < q_{\rm c}(\beta)$, $\max_d u(d)$ is positive in the large $N$ limit, 
540: so (\ref{zfunc}) diverges.
541: The possible range for $q$ is thus $q_{\rm c}(\beta) \leq q \leq 1$.
542: But,  if we assume $q > q_{\rm c}(\beta)$, the large $N$ limit yields the
543: very simple result
544: $
545: z(\alpha,\beta) =   G_s(q) + \alpha \beta \La g(\gamma,y) \Ra_{\gamma,y}
546: $. 
547: Now, on one hand,  the empirical contrast is found by
548: differentiating $z(\alpha,\beta)$ w.r.t to $\beta$. This yields  
549: $\La g(\gamma,y) \Ra_{\gamma,y} + \frac{1}{\alpha}G'_s(q)\pdev{\beta}q$.
550: But computing the same quantity using (\ref{qav}) yields
551: $\La g(\gamma,y) \Ra_{\gamma,y}$. So $q$ must stay  constant when $\beta$ 
552: varies, but this is impossible since $q_{\rm c}(\beta)\rightarrow 1$ for 
553: $\beta\rightarrow\infty$.
554: 
555: Hence, the only possible value for $q$ is $q_{\rm
556: c}(\beta)$.
557: Evaluating (\ref{zfunc}) by taking the limit $q\rightarrow q_{\rm
558: c}(\beta)$ from above, leads to the same result as in the $\KN =
559: \sqrt{N}$ replica theory. But, of course, this  has the same 
560: inconsistencies as found for the $q > q_{\rm c}(\beta)$ assumption.
561: It also makes no physical sense to use (\ref{zfunc})
562: at the point of discontinuity since  the cavity  derivation neglects 
563: fluctuations of $q$. Even if these vanish with increasing $N$, at the point
564: of discontinuity, $q=q_{\rm c}(\beta)$, the true result will 
565: nevertheless  depend on the unknown fluctuations.
566: 
567: But some conclusions can be drawn, knowing that $q$ has the
568: critical value. Let $d_\beta$ be the unique positive value such that 
569: $u(d_\beta) =0$ for   $q=q_{\rm c}(\beta)$. Then arguments analogous
570: to the derivation of  (\ref{qav}) show that the probability of the
571: posterior field $[J_\D]\xi^\mu$ exceeding $d\sqrt{N}$ is {\em not}
572: exponentially small if $d$ is lower than $d_\beta$. 
573: More precisely, one finds for
574: $N\rightarrow\infty$ and $d < d_\beta$
575: \begin{eqnarray*}
576: \La N^{-1}\ln\sLa \Theta([J_D]^T\xi^\mu - d\sqrt{N}) \sRa_{J_\D} \Ra_\D
577: &=&  \\ 
578: \La N^{-1}\ln 
579: {{\cal L}^{q,\beta}_{y,\gamma}(\Theta(h - d\sqrt{N}))}/
580:      {{\cal L}^{q,\beta}_{y,\gamma}(1)} \Ra_{y,\gamma} &=& 0\,.
581: \end{eqnarray*}
582: Further, $d_\beta$ approaches $1$ with increasing $\beta$. But this is
583: only possible if simply aligning the weight vector with the pattern $\xi^\mu$
584: maximizes the empirical contrast, at least upto sub-extensive corrections. So,
585: in the notation of Eq. \ref{orig}, we have $C_r = C_r([\xi^\mu ])$
586: for large $N$, and thus finally
587: \begin{equation}
588: C_r  =
589: (1-r^2)^{\frac{2+k}{2}}/\alpha + \La g(r \gamma + \sqrt{1-r^2}\,y) 
590: \Ra_{\gamma,y}\,. 
591: \label{final}
592: \end{equation}
593: Maximizing this in $r$, the same learning curve is obtained for
594: the cubic case, $g(x)=x^3$, as in
595: the  $\KN=\sqrt N$ replica theory
596: %
597: \footnote{
598: For $g(x)=x^4$, the curve depends on whether $\sLa \gamma^4
599: \sRa_\gamma > 3$, since the fourth moment of a standard Gaussian is
600: $3$. If so, the value of $r$ jumps from $0$ to $1$ at 
601: $\alpha_c = 1/(\sLa \gamma^4\sRa_\gamma - 3)$. The 
602: $\sLa \gamma^4\sRa_\gamma < 3$ case, where one will use  $g(x)=-x^4$,
603: shall be described elsewhere. It 
604: is much simpler since the large deviations regime does not contribute.}.
605: %
606: It is important to note that we have in essence just used the standard
607: weak correlation assumptions of the cavity method in deriving (\ref{final}).
608: In view of the good agreement with numerical simulations (Fig. 1),
609: this strongly suggests that the cavity result is indeed exact in the 
610: thermodynamic limit.
611: 
612: From an analytical point of view, it is intriguing that the present
613: problem reveals a difference in the scope of the replica and the
614: cavity method. The latter can be transparently adapted to take
615: into account that the cavity field is not Gaussian in the large
616: deviations regime. But, commuting the thermal average with the disorder
617: average, at the expense of considering moments, is part and parcel
618: of using replicas. As a consequence, all the relevant fields
619: become truly Gaussian. This points to implicit assumptions in the
620: replica method, which need to be taken care of in any program to put
621: the approach on a solid mathematical footing \cite{Par02}. 
622: 
623: \acknowledgements  
624:  
625: It is a pleasure to acknowledge many discussions with Manfred Opper.
626: This work was supported by the Deutsche Forschungsgemeinschaft.
627: 
628: \bibliographystyle{unsrt}
629: \bibliography{/home/robert/tex/neural}
630: 
631: \end{document}
632: 
633: 
634: 
635: 
636: 
637: 
638: 
639: 
640: 
641: 
642: