0104:cond-mat0104011/fv.tex

1: \documentstyle[aps,prl,twocolumn,graphicx,amsmath]{revtex}

2:

3: \begin{document}

4: \newcommand{\J}{{\mathbf J}}

5: \newcommand{\du}{{\mathbf w}}

6: \newcommand{\T}{{\mathbf T}}

7: \newcommand{\B}{{\mathbf B}}

8: \newcommand{\s}{{\mathbf S}}

9: \newcommand{\ksi}{{\boldsymbol \xi}^{\mu}}

10: \newcommand{\bsxi}{\boldsymbol \xi}

11: \newcommand{\qh}{\hat{q}}

12: \newcommand{\Qh}{\hat{Q}}

13: \newcommand{\sgn}{\text{sgn}}

14: \newcommand{\eps}{\varepsilon}

15: \newcommand{\al}{\alpha}

16: \newcommand{\lan}{\langle\langle}

17: \newcommand{\ran}{\rangle\rangle}

18: \newcommand{\btau}{{\boldsymbol \tau}}

19:

20: \def\lim{\mathop{\rm lim}}

21: \def\extr{\mathop{\rm extr}}

22: \def\Tr{\mathop{\rm Tr}}

23:

24: \twocolumn[\hsize\textwidth\columnwidth\hsize\csname@twocolumnfalse\endcsname

25: \title

26: {Multilayer neural networks with extensively many hidden units}

27: \author

28: {Michal Rosen--Zvi$^1$, Andreas Engel$^2$, and Ido Kanter$^1$ }

29: \noindent

30: \address{$^1$ Minerva Center and Department of Physics, Bar-Ilan University,

31:    Ramat-Gan, 52900 Israel\\

32:  $^2$ Institut f\"ur Theoretische Physik, Otto-von-Guericke Universit\"at, \\

33:          PSF 4120, 39016 Magdeburg, Germany}

34:

35: \maketitle

36:

37: \begin{abstract}

38: The information processing abilities of a multilayer neural network with a

39: number of hidden units scaling as the input dimension are studied using

40: statistical mechanics methods. The mapping from the input

41: layer to the hidden units is performed by general symmetric Boolean functions

42: whereas the hidden layer is connected to the output by either discrete or

43: continuous couplings. Introducing an overlap in the space of Boolean

44: functions as order parameter the storage capacity if found to scale with the

45: logarithm of the number of implementable Boolean functions. The generalization

46: behaviour is smooth for continuous couplings and shows a discontinuous

47: transition to perfect generalization for discrete ones.

48: \end{abstract}

49: \pacs{} ]

50:

51:

52: %introduction

53:

54: Statistical mechanics investigations of artificial neural networks continue to

55: play a stimulating and integrating role in the scientific dialogue between

56: discipline as diverse as neurophysiology, mathematical statistics, computer

57: science and information theory. In particular the study of feed-forward neural

58: networks pioneered by Gardner \cite{Gardner} has revealed a

59: variety of interesting results on how these system may learn different tasks

60: of information processing from examples (for a review see \cite{EnvB}). Of

61: particular importance in this respect are multilayer networks (MLN) because of

62: their ability to implement any function between input and output \cite{Cyb}

63: which makes them attractive candidates for many practical

64: applications. It is well known that very many hidden

65: units are needed in order to realize this vast computational

66: complexity. However, statistical mechanics studies of MLN have so far been

67: mostly restricted to systems with very few hidden units as compared to the

68: number of inputs \cite{smMLN}. In the present letter we overcome this

69: limitation and study the storage and generalization abilities of a

70: tree MLN in which the size of the hidden layer scales in the same way as the

71: input dimension.

72:

73: % the model

74:

75: We consider a MLN with $N$ binary hidden units $\tau_i=\pm 1, i=1,...,N$

76: feeding a binary output $\sigma=\sgn(\sum_i J_i \tau_i)$ through a coupling

77: vector $\J=J_1,...,J_N$. The hidden units are determined via Boolean functions

78: $\tau_i=B_i(\s_i)$ by disjoint sets of inputs $\s_i=S_{i1},...,S_{iL}$

79: containing $L$ elements each. We are interested in the limit $N\to\infty$ with

80: $L$ remaining constant.

81:

82: In order to keep the connection with neural network architectures we restrict

83: ourselves to {\it symmetric} Boolean functions characterized by

84: $B_i(-\s_i)=-B_i(\s_i)$. There are $2^{2^{L-1}}$

85: such functions with $L$ inputs, with only few of them realizable by a coupling

86: vector $\du_i$ according to $B_i(\s_i)=\sgn(\sum_j w_{ij} S_{ij})$. For $L=3$

87: there are, e.g., 16 symmetric Boolean functions but only 14 of them are

88: linearly separable.

89:

90: % statistical mechanics analysis and general expressions

91:

92: In order to investigate the storage and generalization properties of the

93: network we consider a set of $\al L N$ inputs $\ksi_i,\mu=1,...,\al LN$ the

94: components $\xi_{i1}^{\mu},...,\xi_{iL}^{\mu}$ of which are independent,

95: identically distributed

96: random variables with zero mean and unit variance. We then ask for the ability

97: of the network to map these inputs on outputs $\sigma^{\mu}=1$ for all $\mu$ by

98: adapting the Boolean functions $B_i$ and the couplings $J_i$ appropriately.

99:

100: The central quantity in the statistical mechanics analysis is the

101: {\it quenched entropy}

102: \begin{equation}\label{defs}

103:   s=\lim_{N\to\infty}\frac{1}{N} \lan \int d\mu(\J) \Tr_{\{B_i\}}

104:     \prod_{\mu=1}^{\al L N}\theta(\sum_i J_i B_i(\ksi_i))

105:        \ran_{\{\ksi_i\}}

106: \end{equation}

107: where $d\mu(\J)$ is the proper measure in the space of couplings $\J$, the

108: trace denotes the sum over all Boolean functions, the product

109: is non-zero only if the arguments of all of the $\theta$-functions is positive

110: and the double angle stands for the average over the inputs. The

111: determination of $s$ can be performed using the replica trick and introducing

112: the overlap between two solutions in the combined space of couplings $\J$ and

113: Boolean functions $B_i$ of the form

114: \begin{equation}\label{defq}

115: q^{ab}=\frac{1}{N}\sum_i J_i^a J_i^b \;\lan B_i^a({\bsxi})

116:         B_i^b({\bsxi})\ran_{\bsxi}

117: \end{equation}

118: with the average being now over a single, $L$-component vector

119: $\bsxi$.

120: Exploiting the fact that this average involves a finite number of terms only

121: and assuming replica symmetry $q^{ab}=q$ for $a\neq b$ we can write $s$ using

122: standard techniques \cite{EnvB} in the form

123: \begin{equation}\label{ress}

124:   s=\extr_{q, \qh}\left[ G_C(q, \qh)+G_S(\qh)+\al L G_E(q)\right],

125: \end{equation}

126: with the explicit expressions for the functions $G_C, G_S$ and $G_E$ depending

127: on $L$, the constraints on $\J$ and on whether the storage or the

128: generalization problem is addressed.

129:

130: % the storage problem

131:

132: Let us begin with the storage problem by asking for the storage capacity

133: $\al_c$ defined as the maximal value of $\al$ for which the system can still

134: realize all desired input-output mappings with probability 1. Performing the

135: replica limit with the number of replicas tending to zero characteristic for

136: this problem we find

137: \begin{equation}\label{GEstor}

138:   G_E(q)=\int Dt \ln H(Q\;t)

139: \end{equation}

140: with the abbreviations $Dt=dt\, e^{-t^2/2}/\sqrt{2\pi}$,

141: $H(x)=\int_{x}^{\infty} Dt$, and $Q=\sqrt{q/(1-q)}$. The expressions for $G_C$

142: and $G_S$ depend on the constraints on the coupling vector $\J$.

143:

144: % the Ising case

145:

146: A particular simple case is given by Ising couplings $J_i=\pm 1$. From the

147: symmetry of the Boolean functions considered it is clear that it

148: is sufficient to consider $J_i=1$ for all $i$. Consequently in this case all

149: flexibility of the network rests in the

150: choice of the Boolean functions between input and hidden layer and

151: $q$ is a sole overlap in the space of these Booleans. We find

152: $G_C=\qh(1-q)/2$ where $\qh$ denotes the conjugate order parameter to

153: $q$. Moreover, in the case where all $2^{2^{L-1}}$ symmetric

154: Boolean functions are admissible we use the identity

155: \begin{equation}

156:   \Tr_{\{B_i\}}\exp(\sqrt{\frac{\qh}{2^{L-1}}} \nonumber

157:        \sum_{\mathbf\xi} z_{\mathbf\xi} B_i({\mathbf\xi}))

158:   =\prod_{\mathbf\xi}(2\cosh(\sqrt{\frac{\qh}{2^{L-1}}}\; z_{\mathbf\xi}))

159: \end{equation}

160: with the sums and products over ${\bsxi}$ running  over all $2^{L-1}$

161: configurations of $\bsxi$ with $\xi_1=1$ to find

162: \begin{equation}

163:   G_S=2^{L-1}\int Dz \ln \left[ 2\cosh(\sqrt{\frac{\qh}{2^{L-1}}}\;z)\right].

164: \end{equation}

165: Under the transformations $\qh\mapsto 2^{L-1}\qh$ and $\al\mapsto

166: 2^{L-1}\al/L$ the resulting expression for the entropy maps {\it exactly} on

167: the result for the Ising perceptron corresponding to $L=1$ and we may

168: therefore

169: use the well known results for this case \cite{KrMe}. Accordingly the storage

170: capacity is overestimated by the replica symmetric expression and the correct

171: result

172: \begin{equation}\label{h1}

173: \al_c(L)= \al_c(1)\;2^{L-1}/L \cong 0.83\; 2^{L-1}/L

174: \end{equation}

175: is given by the value of $\al$ at which the entropy $s(\al)$ turns negative.

176: The storage capacity is hence proportional to the {\it logarithm} of the

177: number of implementable Boolean functions. This result is in

178: accordance also with the rigorous upper bound $\al_c\leq2^{L-1}/L$ resulting

179: from the annealed entropy $s^{\text{ann}}=(2^{L-1}-\al L)\ln 2$.

180: As in the case of

181: the Ising perceptron this bound is related to information theory.

182: The full specification of the network with all $J_i=1$ requires

183: $N\,2^{L-1}$ bits of information necessary to pin down the $N$ Boolean

184: functions $B_i$. Therefore the machine cannot store more than $N\,2^{L-1}$

185: bits and $\al_c$ cannot exceed $2^{L-1}/L$.

186:

187: Fig.\ref{Isingcap} compares the analytical result $\al_c(3)\cong 1.11$ for

188: $L=3$ with numerical simulations using exact enumerations. Even for the small

189: sizes accessible to this numerical technique we find a steepening of the

190: transition with increasing $N$ and a crossing point of the curves near to the

191: theoretical prediction.

192:

193: If the trace over the Boolean functions in (\ref{defs}) is restricted to those

194: which can be realized by perceptrons with coupling vectors $\du_i$ the exact

195: mapping on the Ising perceptron no longer holds. Solving the

196: corresponding extremum conditions numerically for $L=3$ we find $\alpha_c

197: \cong 1.06$ for this case. The reduction of $\al_c$ compared to the

198: unrestricted case is roughly as the reduction in the logarithm of the number

199: of admissible Boolean functions $1.06/1.11\cong\ln(14)/\ln(16)$.

200:

201: \begin{figure}[tb]\vspace*{-.75cm}

202: \hspace*{-1cm}\includegraphics[width=8cm,angle=270]{capacity16.ps}

203:    \caption{\label{Isingcap} Fraction $f$ of $3 \al N$ random input-output

204:           mappings implementable by

205:   a MLN with $3N$ inputs and $N$ hidden units as function of $\al$ for $N=3$

206:   (squares) and $N=5$ (circles). The couplings between hidden units and output

207:   are fixed to $J_i=1$ for all $i$ and enumerations are performed over all

208:   combinations of symmetric Boolean functions $B_i$ between input and hidden

209:   layer. For every value of $\al$, 200 realizations of Gaussian inputs

210:   where averaged over. The solid line gives the analytical result describing

211:   the limit $N\to\infty$.}

212: \end{figure}

213:

214: % Finite synaptic depth

215:

216: It is possible to generalize the above analysis to the case of discrete

217: couplings with finite synaptic depth $l$ of the form

218: $J_i=\pm 1/l,\pm 2/l,..., \pm 1$ by building on the analysis of the analogous

219: case for the perceptron \cite{GuSt,Kanter}. In this case the additional order

220: parameter $\bar{q}=\sum_{i}(J_i^a)^2/N$, and its conjugate,

221: $\hat{\bar{q}}$ have to be introduced. For $G_E$ we then again find

222: (\ref{GEstor}) with now $Q=\sqrt{q/(\bar{q}-q)}$. Moreover

223: $G_C=-\hat{\bar{q}}\bar{q} + \qh q/2$ and, if all symmetrical

224: Boolean functions are admissible,

225: \begin{align}

226: G_S&=\int\prod_{\bsxi} Dz_{\bsxi} \nonumber\\

227:    & \ln \Tr_J \exp(-(\frac{\qh}{2}-\hat{\bar{q}})J^2) \nonumber

228:    \prod_{\mathbf\xi} 2\cosh(J\,\sqrt{\frac{\qh}{2^{L-1}}}\; z_{\mathbf\xi}),

229: \end{align}

230: with $\Tr_J$ denoting the trace over the $2l$ possible values of the couplings

231: $J_i$.

232: Using these results we have numerically calculated the storage capacity

233: $\al_c(l)$ for the simplest case $L=3$ as a function of the synaptic depth

234: $l$. The results are shown in fig.\ref{CDigital} together with a fit to the

235: asymptotic behavior. The capacity increases from $\alpha_c\cong 1.11$ of

236: the Ising case, $\l=1$, to roughly $1.7$ for large $l$. It is rather difficult

237: to compare these analytical findings with numerical simulations since the

238: effects of the finite synaptic depth do not show up at the small values of

239: $N$ accessible to exact enumerations \cite{PBGDK}.

240:

241: \begin{figure}[tb]\vspace*{-.75cm}

242: \hspace*{-1cm}\includegraphics[width=8cm,angle=270]{Cont.ps}

243:    \caption{\label{CDigital} Storage capacity of a MLN with $N\to\infty$

244:      hidden units and $3N$ inputs with couplings $J_i$ between hidden layer

245:      and output taking $2l$ discrete values. The inputs are mapped to the

246:      hidden layer by symmetric Boolean functions $B_i$. The solid line is the

247:      fit $\al_c\sim 1.70-0.91/l$

248:      to the asymptotic behavior, the dashed line gives the replica

249:      symmetric result for continuous couplings $J_i$.}

250: \end{figure}

251:

252: % continuous couplings

253:

254: To complete the analysis of the storage properties

255: we analyze the case of continuous couplings $\J$  between hidden

256: and output layer. It is convenient to eliminate the additional order parameter

257: $k$ necessary in this case to enforce the normalization $\J^2=N$ by

258: introducing $\Qh=\qh/(k+\qh)$. Within replica symmetry the quenched entropy

259: $s$ is then again of the form (\ref{ress}) with $G_C=0$, $G_E$ given by

260: (\ref{GEstor}), and the extremum taken now over $Q$ and $\Qh$. Moreover

261: \begin{align}

262:   G_S=&\frac{1}{2}\ln(1-\frac{\Qh}{1+Q^2})\nonumber\\

263:       &+\int\prod_ {\bsxi} Dz_{\bsxi}\ln \Tr_B\;\exp(\frac{\Qh}{2^L}\;

264:        (\sum_{\bsxi} z_{\bsxi} B({\bsxi}))^2).

265: \end{align}

266: The storage capacity $\al_c$ can be obtained from these expressions in the

267: limits $Q\to\infty, \Qh\to\infty$ corresponding to $q\to 1$. This limit

268: indicates that different solutions of the storage problem may at most

269: differ in a non-extensive number of components $J_i$ and Boolean functions

270: $B_i$. We then find $G_E\sim -Q^2/4$ and, if all

271: Boolean functions are admissible, $G_S\sim \Qh(1/2+(2^{L-1}-1)/\pi)$ giving

272: rise to

273: \begin{equation}\label{alspherrs}

274:   \al_c^{RS}=\frac{2+\frac{4}{\pi}(2^{L-1}-1)}{L}.

275: \end{equation}

276: For $L=3$ this yields $\al_c^{RS}=2/3+4/\pi\cong 1.94$. If only linearly

277: separable Boolean functions implementable by coupling vectors $\du_i$ are

278: considered the asymptotic behavior of $G_S$ is more difficult to obtain. For

279: the  case $L=3$ we find $\al_c\cong 1.85$. Again the relative reduction of

280: $\al_c$ when compared to the unrestricted case is roughly given by the ratio

281: of the logarithms of the number of available Boolean functions per hidden

282: unit.

283:

284: It is possible to derive an upper bound for $\al_c$ as has been done for MLN

285: with a finite number of hidden units \cite{MiDu} by using some exact

286: results for the perceptron \cite{Cover}. For $L=3$ we find

287: $\al_c(L=3)\leq 2.394$ and the replica symmetric result is therefore within

288: the bound. For large $L$ the bound is given by

289: $\al_c(L\to\infty)\lesssim 2^{L-1}/L+\ln 2$ and shows the same scaling with

290: $L$ as (\ref{alspherrs}).

291: Nevertheless the replica symmetric result (\ref{alspherrs}) is very likely

292: to overestimate the storage capacity as can be seen from fig.\ref{CDigital} in

293: which the result for $\al_c$ for $L=3$ is included as horizontal dashed line.

294: Unlike the case of the perceptron \cite{GuSt} the values for $\alpha_c$

295: for finite synaptic depth seem not to approach the value for continuous

296: couplings when $l\to\infty$. It would hence be very interesting to

297: investigate the implications of replica symmetry breaking, both on the case

298: of continuous couplings and of couplings with finite synaptic depth

299: \cite{Robert}.

300:

301: % the generalization problem

302:

303: Let us finally elucidate the generalization problem, i.e. the ability of

304: the network to infer a rule from examples. To this end we consider as usual

305: two networks of the same type with the couplings and Boolean function of one

306: of them (the ``teacher'') fixed at random. The other network (the ``student'')

307: receives a set of randomly chosen inputs $\ksi_i,\mu=1,...,\al L N$ together

308: with the corresponding outputs $\sigma^{\mu}_T$ generated by the teacher. The

309: task for the student is to imitate the teacher as well as possible. The

310: success in doing so is quantified by the generalization error $\eps$ defined

311: as the probability that a

312: {\it newly} chosen random input is classified differently by teacher and

313: student.

314:

315: As is well known the statistical mechanics analysis of the generalization

316: problem builds again on the expression (\ref{defs}) for the quenched entropy

317: with the number of replicas now tending to 1 rather than to 0

318: \cite{OpHa,EnvB}. A nice feature of this limit is that replica symmetry is

319: known to be stable. The order parameter $q$ defined in

320: (\ref{defq}) now gives the typical overlap between teacher and student and

321: determines the generalization error $\eps$ in a simple way. In the present

322: situation we have the standard relation $\eps=(\arccos q)/\pi$. Moreover

323: (\ref{GEstor}) is replaced by

324: \begin{equation}\label{GEgen}

325:     G_E(q)=2 \int Dt\; H(Q\;t)\;\ln H(Q\;t).

326: \end{equation}

327: The case using Ising couplings $J_i=\pm 1$ and all symmetric Boolean functions

328: can again be mapped exactly on the Ising perceptron. Correspondingly there is

329: a {\it discontinuous} transition to perfect learning, $\eps=0$ for $\al>\al_d$

330: \cite{Gyo} with $\al_d=1.24\;2^{L-1}/L$. This transition occurs when all

331: Boolean functions of the student ``lock'' onto the corresponding input-hidden

332: mappings of the teacher and is also expected to occur in the case where only

333: a restricted set of Booleans can be implemented.

334:

335: For continuous couplings we find $G_C=Q^2\Qh/((1+Q^2)(1-\Qh))$ and

336: \begin{equation}\label{GSgen}

337:   G_S=\frac{1}{2}\ln(1-\Qh)+\sqrt{1-\Qh}

338:       \int\prod_ {\bsxi} Dz_{\bsxi}\; g(z_{\bsxi})\,\ln g(z_{\bsxi})

339: \end{equation}

340: where

341: \begin{equation}\label{GSh}

342:  g(z_{\bsxi})=\Tr_B \exp(\frac{\Qh}{2^L}(\sum_{\bsxi}\,z_{\bsxi} B(\bsxi))^2).

343: \end{equation}

344: For small $\al$ this gives rise to $\eps\sim 1/2-\al L/(\pi^2 2^{L-1})$ which

345: coincides with the result for the perceptron for L=1 as it should. With

346: increasing $L$ the initial decay of the generalization error becomes slower

347: reflecting the increasing complexity and storage abilities of the network.

348: There is no retarded learning because of the non-zero

349: correlation between the hidden units and the output \cite{retlearn}.

350: For large $\al$ the generalization behaviour is dominated by the fine tuning

351: of the student couplings between hidden layer and output to the respective

352: couplings of the teacher resulting in the ubiquitous power law decay

353: $\eps\sim 0.625/(L\al)$.

354:

355: In conclusion we have quantitatively characterized the storage and

356: generalization abilities of a multilayer neural network with a number of

357: hidden units scaling as the input dimension. If the mapping from the input

358: to the hidden layer is realized by symmetric Boolean functions with $L$

359: inputs the capacity is found to be proportional to the logarithm of the

360: number of these Boolean functions divided by $L$. The more conventional case

361: in which the hidden units are the outputs of perceptrons with couplings

362: $\du_i$ is more difficult to analyze. However, speculating that the above

363: scaling holds true also in this case and observing that the logarithm of

364: the number of Boolean functions which can be implemented by a perceptron with

365: $L$ inputs is $O(L^2)$ we arrive at the interesting result that the number

366: of stored input-output relations {\it per weight} of the network is

367: proportional to $L$. This implies that doubling the number of couplings

368: in the network would increase the storage capacity by a factor of 2 making

369: the proposed architecture superior to MLN with few ($K\ll N$) hidden

370: units in which the storage capacity is known to increase at most

371: logarithmically with the number of weights.

372:

373:

374:

375: \vspace*{.5cm}

376:

377:

378: {\bf Acknowledgment:} We have benefitted from discussions with Wolfgang

379: Kinzel, Robert Urbanczik, Peter Reimann, and Stephan Mertens.

380: We would like to thank the

381: Max-Planck-Institut f\"ur Physik komplexer Systeme in Dresden where this

382: work was finished for hospitality and the GIF for support.

383:

384:

385:

386: \begin{thebibliography}{99}

387: \vspace*{-1.5cm}

388: \bibitem{Gardner} E. Gardner, J. Phys. {\bf A21}, 257 (1988); E. Gardner and

389:   B. Derrida, J. Phys. {\bf A21}, 271 (1988).

390: \bibitem{EnvB} A. Engel and C. Van den Broeck {\it Statistical Mechanics of

391:          Learning} (Cambridge University Press, Cambrigde, 2001).

392: \bibitem{Cyb} G. Cybenko, Math. Contr. Sign. Syst. {\bf 2}, 303 (1989).

393: \bibitem{smMLN} E. Barkai, D. Hansel and I. Kanter, Phys. Rev. Lett. {\bf 18},

394:         2312 (1990); E. Barkai, D. Hansel, and H. Sompolinsky, Phys. Rev.

395:         {\bf A45}, 4146 (1992); A. Engel, H. M. K\"ohler, F. Tschepke,

396:         H. Vollmayr, and A. Zippelius, Phys. Rev. {\bf A45}, 5790 (1992);

397:         H. Schwarze and J. Hertz, Europhys. Lett. {\bf 20}, 375 (1992);

398:         R. Monasson and R. Zecchina, Phys. Rev. Lett.  {\bf 75}, 2432 (1995);

399:         R. Urbanczik, Europhys. Lett. {\bf 35}, 553 (1996); one of the rare

400:         exceptions is \cite{BKH}.

401: \bibitem{BKH} A. Bethge, R. K\"uhn, and H. Horner, J. Phys. {\bf A27}, 1929

402:          (1994).

403: \bibitem{KrMe} W. Krauth and M. Mezard, J. Phys. (Paris) {\bf 50}, 3057, 1989.

404: \bibitem{GuSt} H. Gutfreund and Y. Stein, J. Phys. {\bf A23}, 2613, 1990.

405: \bibitem{Kanter} I. Kanter, Europhys. Lett.  {\bf 17}, 181, 1992.

406: \bibitem{PBGDK} A. Priel, M. Blatt, T. Grossman, E. Domany and I. Kanter

407:  Phys. Rev. {\bf E50}, 577 (1994).

408: \bibitem{MiDu} G. J. Mitchison and R. M. Durbin, Biol. Cybern. {\bf 60}, 345

409:   (1989).

410: \bibitem{Cover} T. M. Cover, IEEE Trans. Electron. Comput. {\bf EC-14}, 326

411:   (1965).

412: \bibitem{Robert} R. Urbanczik, Europhys. Lett. {\bf 26}, 233 (1994)

413: \bibitem{OpHa} M. Opper and D. Haussler, Phys. Rev. Lett. {\bf 66} , 2677

414:         (1991).

415: \bibitem{Gyo} G. Gy\"orgyi, Phys. Rev. {\bf A41}, 7097 (1990).

416: \bibitem{retlearn} M. Biehl and A. Mietzner, J. Phys. {\bf A 27}, 1885 (1994);

417:   B. Schottky, J. Phys. {\bf A28}, 4515 (1995); C. Van den Broeck and

418:   P. Reimann, Phys. Rev. Lett. {\bf 76}, 2188 (1996).

419: \end{thebibliography}

420:

421: \end{document}

422: