1: %------------------------------------------------------------------------------
2: % standard
3:
4: \newcommand{\stdpack}{
5: \usepackage{amssymb}
6: \usepackage{amsmath}
7: \usepackage{eucal}
8: % \usepackage{/home/mt/usr/tex/calrsfs}
9: % \usepackage{/home/mt/usr/tex/calligra}
10: \usepackage[final]{graphicx}
11: \usepackage{psfrag}
12: \usepackage{fancyhdr}
13: \renewcommand{\headrulewidth}{0pt}\lhead{}\cfoot{}\rfoot{\thepage}
14:
15: \newcommand{\draft}{\usepackage[light,first]{draftcopy}\draftcopyName{draft}{350}}
16: \newcommand{\labels}{\usepackage{showlabels}}
17: \newcommand{\maple}{\usepackage{maple2e}}
18: \newcommand{\makeidx}{\usepackage{makeidx}\makeindex}
19: }
20: \newcommand{\std}[1]{
21: \stdpack
22: \usepackage{a4,a4wide}
23: %\renewcommand{\labelenumi}{\textbf{(\roman{enumi})}}
24: %\renewcommand{\labelenumi}{${}^{\bf (\roman{enumi})}$}
25: %\renewcommand{\labelitemi}{\bf $\cdot$}
26: \newcommand{\blockindent}{3ex}
27: \renewcommand{\baselinestretch}{#1}
28: \renewcommand{\arraystretch}{1.2}
29: \hoffset -1cm \addtolength{\textwidth}{2cm}
30: \voffset -0cm \addtolength{\textheight}{0cm}
31:
32: \usepackage{./mt}
33: \columnsep 5ex
34: \parindent 3ex
35: \parskip 1ex
36: \macros
37: \pagestyle{fancy}
38: \bibliographystyle{./chicago}
39: \renewenvironment{abstract}{\paragraph{Abstract}\begin{rblock}\small}{\end{rblock}}
40: }
41:
42: %------------------------------------------------------------------------------
43: % styles
44:
45: \newcommand{\article}[2]{
46: \documentclass[#1pt,twoside,fleqn]{article}\usepackage{chicago}\std{#2} }
47: \newcommand{\nips}{
48: \documentclass{article} \usepackage{nips01e,times} \stdpack\macros }
49: \newcommand{\ijcnn}{
50: \documentclass[10pt,twocolumn]{/home/mt/usr/tex/ijcnn}
51: %\documentclass[10pt,twocolumn]{article}\usepackage{/home/mt/usr/tex/wcci}
52: \stdpack\macros
53: \bibliographystyle{abbrv}
54: }
55: \newcommand{\foga}{
56: \documentclass{article}
57: % \documentclass[10pt,twocolumn]{/home/mt/usr/tex/foga}
58: %\documentclass[10pt,twocolumn]{article}\usepackage{/home/mt/usr/tex/wcci}
59: \stdpack
60: \usepackage{/home/mt/usr/tex/foga}
61: \macros
62: }
63:
64: \newcommand{\book}[2]{
65: \documentclass[#1pt,twoside,fleqn]{book}\usepackage{chicago}\std{#2} }
66:
67: \newcommand{\foils}[1]{
68: \documentclass[12pt,fleqn]{article}
69: \std{#1}
70: %\renewcommand{\arraystretch}{1.5}
71: %\setlength{\voffset}{-3cm}%{-6cm}
72: %\setlength{\hoffset}{-4cm}
73: %\setlength{\textheight}{267mm}%{29cm}
74: %\setlength{\textwidth}{19cm}
75: %\hoffset -1cm \addtolength{\textwidth}{2cm}
76: \voffset -1cm \addtolength{\textheight}{2cm}
77: \renewcommand{\footskip}{2cm}
78: %\parindent 0ex
79: %\parskip 0ex %8ex
80: %\pagestyle{plain}
81: %\macros
82: \begin{document}
83: \large
84: }
85: \newcommand{\landfoils}[1]{
86: \documentclass[fleqn]{article}
87: \stdpack
88: \renewcommand{\baselinestretch}{#1}
89: \renewcommand{\arraystretch}{1}
90: \setlength{\hoffset}{-3.5cm}%{-5cm}
91: \setlength{\voffset}{-3.5cm}
92: \setlength{\textwidth}{27cm}%{29cm}
93: \setlength{\textheight}{19cm}
94: \parindent 0ex
95: \parskip 0ex %8ex
96: \pagestyle{empty}
97: \macros
98: \begin{document}
99: \huge
100: }
101: \newcommand{\landfolien}[1]{
102: \documentclass[fleqn]{article}
103: \usepackage{german}
104: \stdpack
105: \renewcommand{\baselinestretch}{#1}
106: \renewcommand{\arraystretch}{1.5}
107: \setlength{\hoffset}{-5cm}%{-6cm}
108: \setlength{\voffset}{-1.5cm}
109: \setlength{\textwidth}{27cm}%{29cm}
110: \setlength{\textheight}{19cm}
111: \parindent 0ex
112: \parskip 0ex %8ex
113: \pagestyle{plain}
114: \begin{document}
115: \huge
116: }
117:
118:
119: %------------------------------------------------------------------------------
120: % titles
121:
122: \newcommand{\addressCologne}{
123: Institute for Theoretical Physics\\
124: University of Cologne\\
125: 50923 K\"oln---Germany\\
126: {\tt mt@thp.uni-koeln.de}\\
127: {\tt www.thp.uni-koeln.de/\~{}mt/}
128: }
129:
130: \newcommand{\homepage}{{\tt www.neuroinformatik.ruhr-uni-bochum.de/PEOPLE/mt/}}
131: \newcommand{\email}{{\rm mt@neuroinformatik.ruhr-uni-bochum.de}}
132:
133: \newcommand{\address}{\small\it
134: Institut f\"ur Neuroinformatik,
135: Ruhr-Universit\"at Bochum, ND 04,
136: 44780 Bochum---Germany\\
137: \email
138: %\homepage
139: }
140:
141: \newcommand{\mytitle}{
142: \thispagestyle{empty}
143: \rhead{\it Marc Toussaint---\today}
144: \hrule height2pt
145: \begin{list}{}{\leftmargin2ex \rightmargin2ex \topsep2ex }\item[]
146: {\Large\bf \thetitle}
147: \end{list}
148: \begin{list}{}{\leftmargin7ex \rightmargin7ex \topsep0ex }\item[]
149: Marc Toussaint \quad\today
150:
151: \address
152: \end{list}
153: \vspace{2ex}
154: \hrule height1pt
155: \vspace{5ex}
156: }
157:
158: \newcommand{\contents}{{\small \parskip 0ex \tableofcontents \parskip 2ex }}
159:
160:
161: %------------------------------------------------------------------------------
162: % environments / commands
163:
164:
165: \newcommand{\sepline}{
166: \begin{center} \begin{picture}(200,0)
167: \line(1,0){200}
168: \end{picture}\end{center}
169: }
170:
171: \newcommand{\partsection}[1]{
172: \vspace{5ex}
173: \centerline{\sc\LARGE #1}
174: \addtocontents{toc}{\contentsline{section}{{\sc #1}}{}}
175: }
176:
177: \newcommand{\intro}[1]{\textbf{#1}\index{#1}}
178:
179: \newtheorem{definition}{Definition}
180: \newtheorem{statement}{Statement}
181: \newtheorem{theorem}{Theorem}
182: \newtheorem{hypothesis}{Hypothesis}
183: \newenvironment{remark}{\noindent\emph{Remark.}}{}
184: \newenvironment{example}[1][]{\begin{block}[Example {#1}]}{\end{block}~}
185:
186: \newcounter{parac}
187: \newcommand{\para}{\refstepcounter{parac}{\bf [{\roman{parac}}]}~~}
188: \newcommand{\Pref}[1]{[\emph{\ref{#1}}\,]}
189:
190: \newenvironment{block}[1][]{{\noindent\bf #1}
191: \begin{list}{}{\leftmargin\blockindent \topsep-\parskip}
192: \item[]
193: }{
194: \end{list}
195: }
196:
197: \newenvironment{rblock}{
198: \begin{list}{}{\leftmargin\blockindent \rightmargin\blockindent \topsep-\parskip}
199: \item[]
200: }{
201: \end{list}
202: }
203:
204: \newenvironment{keywords}{\paragraph{Keywords}\begin{rblock}\small}{\end{rblock}}
205:
206: \newenvironment{colpage}{
207: \addtolength{\columnwidth}{-3ex}
208: \begin{minipage}{\columnwidth}
209: \vspace{.5ex}
210: }{
211: \vspace{.5ex}
212: \end{minipage}
213: }
214:
215: \newenvironment{enum}{
216: \begin{list}{}{\leftmargin3ex \topsep0ex \itemsep0ex}
217: \item[\labelenumi]
218: }{
219: \end{list}
220: }
221:
222: \newenvironment{cramp}{
223: \begin{quote} \begin{picture}(0,0)
224: \put(-5,0){\line(1,0){20}}
225: \put(-5,0){\line(0,-1){20}}
226: \end{picture}
227: }{
228: \begin{picture}(0,0)
229: \put(-5,5){\line(1,0){20}}
230: \put(-5,5){\line(0,1){20}}
231: \end{picture} \end{quote}
232: }
233:
234:
235: \newcommand{\inputReduce}[1]{
236: {\sc\hspace{\fill} REDUCE file: #1}
237: % \input{#1.tex}
238: % \sepline
239: % \input{../tex/tridefs}
240: % \input{#1.out}
241: % \redefinemath
242: }
243:
244: \newcommand{\inputReduceInput}[1]{
245: {\sc\hspace{\fill} REDUCE input - file: #1}
246: \input{#1.tex}
247: }
248:
249: \newcommand{\inputReduceOutput}[1]{
250: {\sc\hspace{\fill} REDUCE output - file: #1}
251: % \input{../tex/tridefs}
252: % \input{#1.out}
253: % \redefinemath
254: }
255:
256: \newcommand{\todo}[1]{{\bf[#1]}}
257:
258: %------------------------------------------------------------------------------
259: % macros
260:
261: \newcommand{\macros}{
262: \newcommand{\0}{{\hat 0}}
263: \newcommand{\1}{{\hat 1}}
264: \newcommand{\2}{{\hat 2}}
265: \newcommand{\3}{{\hat 3}}
266: \newcommand{\5}{{\hat 5}}
267:
268: \renewcommand{\a}{\alpha}
269: \renewcommand{\b}{\beta}
270: \renewcommand{\c}{\gamma}
271: \renewcommand{\d}{\delta}
272: \newcommand{\D}{\Delta}
273: \newcommand{\e}{\epsilon}
274: \newcommand{\g}{\gamma}
275: \newcommand{\G}{\Gamma}
276: \renewcommand{\l}{\lambda}
277: \renewcommand{\L}{\Lambda}
278: \newcommand{\m}{\mu}
279: \newcommand{\n}{\nu}
280: \newcommand{\N}{\nabla}
281: \renewcommand{\k}{\kappa}
282: %\renewcommand{\o}{\omega}
283: \renewcommand{\O}{\Omega}
284: \newcommand{\p}{\phi}
285: \newcommand{\ph}{\varphi}
286: \renewcommand{\P}{\Phi}
287: \renewcommand{\r}{\varrho}
288: \newcommand{\s}{\sigma}
289: \newcommand{\Si}{\Sigma}
290: \renewcommand{\t}{\theta}
291: \newcommand{\T}{\Theta}
292: \renewcommand{\v}{\vartheta}
293: \newcommand{\X}{\Xi}
294: \newcommand{\Y}{\Upsilon}
295:
296: \renewcommand{\AA}{{\cal A}}
297: \newcommand{\GG}{{\cal G}}
298: \renewcommand{\SS}{{\cal S}}
299: \newcommand{\TT}{{\cal T}}
300: \newcommand{\EE}{{\cal E}}
301: \newcommand{\FF}{{\cal F}}
302: \newcommand{\HH}{{\cal H}}
303: \newcommand{\II}{{\cal I}}
304: \newcommand{\KK}{{\cal K}}
305: \newcommand{\LL}{{\cal L}}
306: \newcommand{\MM}{{\cal M}}
307: \newcommand{\NN}{{\cal N}}
308: \newcommand{\CC}{{\cal C}}
309: \newcommand{\PP}{{\cal P}}
310: \newcommand{\QQ}{{\cal Q}}
311: \newcommand{\RR}{{\cal R}}
312: \newcommand{\UU}{{\cal U}}
313: \newcommand{\YY}{{\cal Y}}
314: \newcommand{\SOSO}{{\cal SO}}
315: \newcommand{\GLGL}{{\cal GL}}
316:
317: \newcommand{\NNN}{{\mathbb{N}}}
318: \newcommand{\ZZZ}{{\mathbb{Z}}}
319: %\newcommand{\RRR}{{\mathrm{I\!R}}}
320: \newcommand{\RRR}{{\mathbb{R}}}
321: \newcommand{\CCC}{{\mathbb{C}}}
322: \newcommand{\one}{{{\bf 1}}}
323:
324: \newcommand{\<}{{\ensuremath\langle}}
325: \renewcommand{\>}{{\ensuremath\rangle}}
326: \newcommand{\Aut}{{\rm Aut}}
327: \newcommand{\cor}{{\rm cor}}
328: \newcommand{\corr}{{\rm corr}}
329: \newcommand{\cov}{{\rm cov}}
330: \newcommand{\sd}{{\rm sd}}
331: \newcommand{\tr}{{\rm tr}}
332: \newcommand{\lag}{\mathcal{L}}
333: \newcommand{\inn}{\rfloor}
334: \newcommand{\lie}{\pounds}
335: \newcommand{\longto}{\longrightarrow}
336: \newcommand{\speer}{\parbox{0.4ex}{\raisebox{0.8ex}{$\nearrow$}}}
337: \renewcommand{\dag}{ {}^\dagger }
338: \newcommand{\h}{{}^\star}
339: \newcommand{\w}{\wedge}
340: \newcommand{\too}{\longrightarrow}
341: \newcommand{\To}{\Rightarrow}
342: \newcommand{\Too}{\;\Longrightarrow\;}
343: \newcommand{\ow}{\stackrel{\circ}\wedge}
344: \newcommand{\feed}{\nonumber \\}
345: \newcommand{\comma}{\; , \quad}
346: \newcommand{\period}{\; . \quad}
347: \newcommand{\del}{\partial}
348: % \newcommand{\quabla}{\Delta}
349: \newcommand{\point}{$\bullet~~$}
350: \newcommand{\doubletilde}{
351: ~ \raisebox{0.3ex}{$\widetilde {}$} \raisebox{0.6ex}{$\widetilde {}$} \!\!
352: }
353: \newcommand{\topcirc}{\parbox{0ex}{~\raisebox{2.5ex}{${}^\circ$}}}
354: \newcommand{\sym}{\topcirc}
355:
356: \newcommand{\half}{\frac{1}{2}}
357: \newcommand{\third}{\frac{1}{3}}
358: \newcommand{\fourth}{\frac{1}{4}}
359:
360: \renewcommand{\_}{\underset}
361: \renewcommand{\^}{\overset}
362:
363: \renewcommand{\small}{\footnotesize}
364: }
365:
366: \newcommand{\argmax}[1]{\text{arg}\underset{#1}\max}
367: \newcommand{\argmin}[1]{\text{arg}\underset{#1}\min}
368: \newcommand{\kld}[2]{D\!\left(\,#1\,|\!|\,#2\,\right)}
369:
370: %\newcommand{\path}{\pathmt}
371: \newcommand{\pathmt}{./}
372: \newcommand{\basepath}{./}
373: \newcommand{\setpath}[1]{\renewcommand{\pathmt}{#1}\renewcommand{\basepath}{#1}}
374: \newcommand{\pathinput}[2]{
375: \renewcommand{\pathmt}{\basepath #1}
376: \input{\pathmt #2} \renewcommand{\pathmt}{\basepath}}
377:
378: \newcommand{\hide}[1]{[\small #1 \normalsize]}
379: \newcommand{\color}[2][1]{}
380: %\newcommand{\url}[1]{{\tt #1}}
381:
382: \article{10}{1}
383:
384: %\draft
385: %\labels
386:
387: %\ijcnn
388: %\renewcommand{\paragraph}[1]{{\bf #1}}
389: %\newcommand{\citeNP}[1]{\cite{#1}}
390: %\newcommand{\citeyear}[1]{\cite{#1}}
391: %\rfoot{}
392:
393:
394:
395: \newcommand{\df}{\d\!f}
396:
397: \title{\Large\textbf{On model selection and the disability of neural networks\\ to decompose tasks}}
398:
399: \author{\normalsize Marc Toussaint\\
400: \sizeix Institut f\"ur Neuroinformatik, Ruhr-Universit\"at Bochum\\
401: \sizeix 44780 Bochum, Germany\\
402: \textit{\sizeix Marc.Toussaint@neuroinformatik.ruhr-uni-bochum.de}}
403:
404: \date{}
405:
406:
407:
408:
409: \begin{document}
410:
411: %\pagestyle{empty}
412: %\maketitle\thispagestyle{fancy}
413:
414: \twocolumn[\mytitle]\thispagestyle{fancy}
415:
416: \rhead{\it Proceedings of the International Joint Conference on Neural
417: Networks (IJCNN 2002)}
418:
419:
420: \begin{abstract}%
421: A neural network with fixed topology can be regarded as a
422: parametrization of functions, which decides on the correlations
423: between functional variations when parameters are adapted. We
424: propose an analysis, based on a differential geometry point of view,
425: that allows to calculate these correlations. In practise, this
426: describes how one response is unlearned while another is trained.
427: Concerning conventional feed-forward neural networks we find that
428: they generically introduce strong correlations, are predisposed to
429: forgetting, and inappropriate for task decomposition. Perspectives
430: to solve these problems are discussed.
431:
432: % Consider the functional responses of a neural network on two
433: % different stimuli. When the weights are varied or adapted, we ask:
434: % Do both of the responses vary? Do they always vary in correlation?
435: % Or are their variations decorrelated, e.g., because they depend on
436: % different weights?
437:
438: % These questions actually address the way of parametrization of the
439: % function space and may thus be embedded in the model selection
440: % problem. We formalize these ideas and propose an analysis of the
441: % parametrization that is based on a differential geometry point of
442: % view and allows to predict the correlations in functional
443: % variations. In practise, this allows a prediction of how one
444: % response is unlearned while the other is trained.
445:
446: % Finally, an important result is that we can generically classify
447: % conventional feed-forward neural networks as introducing much
448: % correlations, being predisposed to forgetting, and inappropriate for
449: % task decomposition. Perspectives to solve these problems are
450: % discussed.
451: \end{abstract}
452:
453:
454:
455:
456: \section{Introduction}\label{intro}
457:
458: Following Kerns et al. \citeyear{kearns:95}, the problem of model
459: selection may be defined as follows: Given a finite set of data
460: points, find a function (or conditional probability distribution, also
461: called hypothesis) such that the expected generalization error is
462: minimized. Typically, the search space $\FF$ (the space of functions
463: or conditional probability distributions) is assumed to be organized
464: as a nested sequence of subspaces $\FF_1 \subseteq .. \subseteq \FF_d
465: \subseteq .. \subseteq \FF$ of increasing complexity. For instance,
466: the index $d$ may denote the number of parameters or the
467: Vapnik-Chervonenkis dimension \cite{vapnik:95}. Finding the function
468: with minimal generalization error then amounts to finding the
469: appropriate sub-search-space before applying ordinary optimization
470: schemes. Many approaches introduce a penalty term related to
471: complexity which has to be minimized together with the training error.
472: Penalty terms are, for example, the number of parameters of the model,
473: the number of \emph{effective} model parameters, the
474: Vapnik-Chervonenkis dimension, or the description length
475: \cite{akaike:74,amari:93,moody:91,rissanen:78,vapnik:95}. An
476: alternative based on geometric arguments is presented by
477: Schuurmans \citeyear{schuurmans:97}.
478:
479: The emphasis of our investigations is different to these classical
480: approaches. The choice of a specific model (e.g., a neural network)
481: to represent a function has \emph{two} implications: it defines the
482: space $\FF_d$ of representable functions, but it also defines a
483: \emph{parametrization} of this space, where parametrization is not
484: meant in the sense of `finding parameters' but in the sense of
485: introducing coordinates on that space, i.e., introducing a mapping
486: $\P:\, \RRR^m \to \FF_d$ from some coordinate space $\RRR^m$ onto the
487: sub-search-space. To omit confusion, we use the term \emph{model
488: class} for the sub-search-space $\FF_d$, and \emph{model
489: parametrization} for the parametrization $\P$ of this
490: sub-search-space. For example, an artificial neural network with $m$
491: free parameters, fixed topology, and fixed activation functions
492: defines a model class (the subspace of functions it can
493: realize---which, if the topology is appropriate, includes an
494: approximation of any function \cite{hornik:89}) but it also defines a
495: model parametrization (the mapping from its parameters to the
496: corresponding function).
497:
498: Our emphasis is on the implications of a specific model
499: parametrization instead of the choice of a certain model class. It is
500: important to have a closer look at this parametrization in order to
501: allow for an analytical description of the adaptation dynamics, rather
502: than just analyzing the complexity of a model class. In particular,
503: the precise relation between variations of parameters and functional
504: variations of the system is of fundamental interest because it
505: decides, e.g., on the way of ``extrapolation'', or on how the system
506: forgets previously learned data. This relation can be derived from the
507: model parametrization and our goal is to extract such features
508: analytically. We focus on forgetting as a specific character of
509: adaptation dynamics and develop an analysis of the model
510: parametrization that allows to approximate the rate of forgetting.
511: This analysis is based on a differential geometry point of view and is
512: related to a large pool of research, including the discussions of
513: \emph{cross-talk} \cite{jacobs:90} and \emph{catastrophic forgetting}
514: \cite{french:99}, the information geometry point of view on parameter
515: adaptation \cite{amari:00}, and perfectly analogous ideas in the
516: context of evolutionary adaptation \cite{toussaint:01}. Section
517: \ref{ana} includes a discussion of these relations.
518:
519: We apply our method of analyzing the model para\-metrization on the
520: class of standard feed-forward neural networks (FFNNs). We find that
521: the variety of FFNNs with arbitrary topology is actually not a great
522: variety with respect to certain characters of the model
523: parametrization. In particular, FFNNs gnerically introduce strong
524: correlations between functional variations and thereby are predisposed
525: to forget previously learned data. Hence, using FFNNs as a function
526: model means a limitation---not with respect to representable functions
527: but with respect to learning characteristics. A simple example
528: compares a standard FFNN with a network that includes competitive
529: interactions. The results validate our analytical predictions and
530: illustrate their implications. We conclude that a generalization of
531: the class of FFNNs is necessary and that the introduction of
532: competitive interactions between neurons is a promising approach to
533: solve these problems.
534:
535: Section \ref{def} will introduce to the formalism our investigations
536: are based on and, in section \ref{ana}, we describe the analysis of
537: the model parametrization. Section \ref{emp} presents the examples
538: and in section 5 we give an outlook concerning the evolutionary
539: perspective on model selection and discuss the relevance of the
540: limitedness of FFNN models. The conclusion follows up.
541:
542:
543:
544:
545: \section{Definitions}\label{def}
546:
547: \subsection{The functional point of view}
548:
549: Let $\FF$ be the search space. Here, $\FF$ shall be the space of all
550: functions mapping from a finite space $X$ to $Y \subseteq \RRR^n$.
551: However, all results can be transferred to the search space of
552: conditional probabilities, as we discuss below.
553:
554: The space of functions $f:\, X \to Y$ can be written as $Y^X$, which
555: is isomorphic to $\RRR^{n\cdot|X|}$. Thus, let a function $f \in
556: Y^X$ be represented by $n\!\cdot\!|X|$ components $f^a \in \RRR$,
557: where the index $a$ refers to a specific point in $X$ \emph{and} a
558: $Y$-dimension. (The components $f^a$ may be regarded as entries of a
559: lookup-table representation of $f$.) On this representation, we
560: describe an online adaptation step as a probabilistic transition to a
561: new function as follows: Assume that adaptation is initiated by the
562: observation of a target value $t^a$ for a functional component $f^a$.
563: A transition occurs as a variation $\df \in \RRR^{n|X|}$ with
564: probability $p(\df\, |\, f^a,t^a)$. The interesting point is that
565: functional components of which no target value has been observed may
566: vary as well. Let $a$ be a random variable and consider the density
567: $p(\df)=p(\df\, |\, f^a,t^a)\, p(a)$. We will refer to the respective
568: covariance between two variation components as the \emph{functional
569: covariance matrix}
570: \begin{align}
571: C^{bc} := \cov_{p(\df)}(\df^b,\df^c) \;.
572: \end{align}
573:
574: This matrix is a first order description of how the adaptation of the
575: observed functional component results in a \emph{coadaptation} of a
576: functional component which has not been observed. For example,
577: assuming a linear dependence between $\df^a$ and $\df^b$, we have
578: $\df^b \stackrel{\cdot}= \<\df^b\> + \frac{C^{ab}}{\s^2}\, \big( \df^a
579: - \<\df^a\>\big)$, where $\s^2$ is the variance of $\df^a$. Whether
580: this coadaptation is desirable or not depends on the problem.
581: Coadaptation is also an explicit description of the ``way of
582: generalization''\footnote{By ``way of generalization'' we do not refer
583: to the generalization error but to the way of extrapolation from
584: observed data to unobserved.}: unobserved functional components
585: (i.e., the functional response on stimuli that have not been observed)
586: are coadapted depending on the adaptation of observed functional
587: components. In general, one would like to choose from a variety of
588: different coadaptation schemes, i.e., one would like to select a model
589: from a variety of models with different kinds of coadaptation. We will
590: find that this refers to the selection of a model parametrization.
591:
592: When the set of functional components can be separated in two disjoint
593: subsets such that $C^{ab}$ vanishes for two components $f^a$ and $f^b$
594: of different subsets, then we speak of \emph{adaptation
595: decomposition}. During online learning, adaptation decomposition
596: means that the development of two such components during successive
597: adaptation is not correlated. In terms of homogeneous Markov
598: processes, successive adaptation is described by the transition
599: probability $p(\df\, |\, f^a,t^a)$ (assuming that the draw of $a$ from
600: $p(a)$ is independent at each time), and adaptation is decomposed if
601: $p(\df^a,\df^b)=p(\df^a)\, p(\df^b)$.
602:
603:
604:
605: \subsection{The parameter point of view}
606:
607: We now address the \emph{modeling} of functions. Let $\P$ be a
608: $m$-dimensional, differentiable parametrization of a subset $\P(W)$
609: of functions:
610: \begin{align}
611: & \P:\, W \to \FF \comma W\subseteq\RRR^m \;,\\
612: & \P(W) := \bigcup_{w \in W} \{\P(w)\} \quad \subseteq \FF \;.
613: \end{align}
614: We call $\P$ the \emph{model parametrization} and $\P(W)$ the
615: \emph{model class}. In terms of differential geometry, $\P$ is the
616: inverse of a coordinate map (or chart, or atlas) for $\P(W)$. Since
617: this map is differentiable, it induces a metric on $\P(W)$ if one on
618: $W$ is given and vice versa. We define the \emph{functional metric}
619: $g^{ab}(w)$ on $\P(W)$ as the lift of the Euclidean metric on $W$,
620: \begin{align}
621: g^{ab}(w) := \sum_i \frac{d \P(w)^a}{d w^i}\, \frac{d \P(w)^b}{d w^i} \;;
622: \label{funcMet}
623: \end{align}
624: and we define the \emph{parameter metric} $g_{ij}(w)$ on $W$ (actually
625: on the \emph{dual} tangent spaces of $W$) as the pull-back of the
626: Euclidean metric on $\P(W)$,
627: \begin{align}
628: g_{ij}(w) := \sum_a \frac{d \P(w)^a}{d w^i}\, \frac{d \P(w)^a}{d w^j} \;.
629: \label{parMet}
630: \end{align}
631: As usual in differential geometry, the metrics depend on the locality
632: given by $w$. These metrics describe the relation between parameter
633: variations and functional variations as we explore in more detail in
634: the next section.
635:
636:
637:
638:
639: \section{Analysis of the model para\-metrization}\label{ana}
640:
641: In the previous section we defined the correlation matrix $C^{ab}$ on
642: the functional level. Now we analyze what the choice of a model
643: parametrization $\P$ implies on this functional level. Given $\P$ and
644: parameters $w$, we write $f^a=\P(w)^a$. Assume that a target $t^a$ was
645: observed and adaptation of the parameters takes place by a gradient
646: descent,
647: \begin{align}
648: \d w^i = 2 \a\; \frac{d f^a}{d w^i}\, (t^a - f^a) \;,
649: \end{align}
650: which corresponds to the gradient of the squared error multiplied by
651: an adaptation rate $\a$. In first order approximation, this induces a
652: functional variation
653: \begin{align}
654: \df^b
655: = 2 \a\; \sum_i \frac{d f^b}{d w^i}\, \d w^i
656: = 2 \a\; g^{ab}\, (t^a - f^a) \;,
657: \label{deltaF}
658: \end{align}
659: using definition (\ref{funcMet}). Thus, the functional metric $g^{ab}$
660: describes the variation of a functional component $f^b$ when $t^a$ is
661: observed. This gives a first order description of coadaptation and of
662: how the model generalizes the experience of a target value $t^a$ in
663: order to adapt also functional components $f^b$. In this approximation
664: the functional covariance reads
665: \begin{align}\label{covari}
666: C^{bc} = 4 \a^2\; \sum_a p(a)\; g^{ba}\, g^{ca}\, (t^a - f^a)^2 -\<\df^a\>\<\df^b\>\;.
667: \end{align}
668: To discuss this expression, let us assume that the second term
669: vanishes, $\<\df^a\>\<\df^b\>=0$. Concerning the first term, the
670: product $g^{ba}\, g^{ca}$ vanishes for all $a$ if and only if the
671: functional metric is a block matrix and $b$ and $c$ refer to different
672: blocks:
673: \begin{align*}
674: g^{ab} = \left(\begin{array}{cc}
675: A \in \RRR^{\m\times\m} & 0\\
676: 0 & B\in \RRR^{\n\times\n}
677: \end{array}\right) \comma b\le\m \comma c>\m \;,
678: %\bigg(\!\begin{array}{c}\text{\input{blocks1.fig}}\end{array}\!\!\!\!\!\bigg)
679: \end{align*}
680: where $A$ and $B$ are arbitrary symmetric matrices and
681: $\m+\n=n\cdot|X|$. Thus, adaptation is decomposed into two subsets of
682: functional components exactly if the functional metric is a block
683: matrix and the functional component subsets correspond to these
684: blocks.\footnote{Note the relation to group theory: A group
685: representation is said to be reducible if all group generators can
686: be represented as a block matrix (such that all of them fit in the
687: same block template). On this basis, physics defines the notion of
688: an elementary particle as corresponding to an irreducible
689: representation, whereas physical systems that correspond to a
690: reducible representation (a block matrix) are considered as
691: \emph{composed} of particles. A system of which the adaptation
692: dynamics (instead of physical interactions) can be decomposed in the
693: sense of a block matrix can analogously be thought of as composed of
694: subsystems.
695:
696: More formally, the observation of a target $t^a$ can be identified
697: with an element of a group that applies on the functional
698: components. Adaptation dynamics is now interpreted as successive
699: application of group elements. The group representation (i.e., the
700: way the group elements apply on the functional components) is
701: determined by the model parametrization. If adaptation is
702: decomposed, this representation is reducible.}
703:
704:
705:
706:
707:
708: \subsection{Reference to related research}
709:
710:
711: \paragraph{Cross-talk.}
712: The inspiring work by Jacobs et al. \citeyear{jacobs:90} discusses the
713: implication of the choice of a multi-expert model on the learning
714: speed and generalization behavior. They formulate the idea of spatial
715: and temporal crosstalk, which denotes the statistical dependence
716: between the states of two different neurons or between the states of a
717: neuron at two different times. In our formalism, this crosstalk is
718: captured by the functional covariance---spatial for two indices $a$
719: and $b$ belonging to the same input $x \in X$, and temporal for two
720: indices of different input. They argue that such a crosstalk may be
721: undesirable and is avoided by explicitly separating neurons in
722: disjoint experts. As we will see below, selecting a multi-expert model
723: is a very intuitive way to explicitly declare an independence of
724: functional components and realize decomposed adaptation. In fact, the
725: separation into experts corresponds to a block matrix type functional
726: metric. (If the gating is also adaptive, the functional metric is
727: actually not a completely clean block matrix.)
728:
729: In the context of artificial neural networks, the term
730: \emph{catastrophic forgetting} has been used to describe negative
731: effects of coadaptation. See \cite{french:99} for a review.
732:
733: \paragraph{Information geometry.}
734: The methods applied in this paper are related to information geometry.
735: Let $Y=S_\n=[0,1]^{2^\n-1}$ be the $2^\n-1$ dimensional manifold of
736: probability distributions over $\{0,1\}^\n$ as defined by Arami
737: \citeyear{amari:00}. Then, the search space $\FF$ of mappings $X \to Y$
738: is the space of all conditional probabilities $p(y|x)$, $x\in X, y\in
739: Y$. Usually, one assumes the Fisher metric on $\FF$, not the
740: Euclidean. Thus, we would have to change the definition (\ref{parMet})
741: of the parameter metric into
742: \begin{align}
743: g_{ij}(w)=E\left[
744: \frac{\del\log p(x,y;w)}{\del w^i}\;
745: \frac{\del\log p(x,y;w)}{\del w^j}\right]\;,
746: \end{align}
747: where $E[.]$ denotes the expectation and $p(x,y;w)=p(y|x;w)\,p(x)$,
748: $p(y|x;w)=\P(w) \in \FF$. Arami \citeyear{amari:98} uses this metric
749: to define the natural gradient descent on the parameter space (which
750: actually is the covariant derivative instead of the contravariant).
751: The use of the natural gradient can also be motivated by a
752: spatio-temporal decorrelation \cite{choi:00}.
753:
754:
755: \paragraph{Evolutionary computation.}
756: It seems that in the field of evolutionary computation the discussion
757: of the covariance structure in the search space is much more
758: elaborated than in the field of neural computation (see
759: \citeNP{toussaint:01}). Roughly speaking, the goal of evolutionary
760: computation is to maximize the probability of good mutations during
761: evolutionary search. Eventually, fitness requires some phenotypic
762: traits to be mutated in correlation. Such correlations (coadaptation)
763: may be modeled explicitly in the search density of evolutionary
764: algorithms \cite{baluja:97,hansen:01,muehlenbein:99,pelikan:99}.
765: Alternatively, they may be induced implicitly by the choice of a good
766: parametrization of phenotypic traits---by a genotype-phenotype
767: mapping, which is in perfect analogy to the model parametrization
768: $\P$. Many research efforts focus on the choice or the understanding
769: of the genotype-phenotype mapping
770: \cite{stephens:99,toussaint:01,wagner:96}. In this view, functional
771: components $f^a$ may be compared to phenotypic traits, whereas
772: parameters relate to the genotype.
773:
774:
775:
776:
777:
778: \section{Example}\label{emp}
779:
780: Our test of the learning behavior is very simple: a regression of only
781: two patterns in $\{0,1\}^3$ has to be learned by mapping the first
782: pattern on $+1$ and the second on $-1$. However, we impose that these
783: patterns have to be learned \emph{online} where they alternate only
784: after they have been exposed for $100$ times in
785: succession.\footnote{This task is not meant as a performance test but
786: as an experimental setup to test our analytical methods. However,
787: similar effects of learning and unlearning occur in online learning
788: when a specific response is unlearned during the course of training
789: other responses for several time steps. In real world simulations it
790: is also plausible that stimuli remain unchanged for many time
791: steps.} We test two systems on this task: a standard feed-forward
792: neural network as described in detail in table \ref{FFN}, and a system
793: that involves a softmax layer as described in table \ref{monet}. The
794: parameters of both systems are initialized randomly by the normal
795: distribution $\NN(0,0.1)$ around zero with standard deviation $0.1$.
796: The two patterns were chosen as $110$ and $010$. Learning is realized
797: by a slow gradient descent with adaptation rate $2\cdot10^{-3}$ and
798: momentum $0.5$. The metric components are calculated from the
799: gradients.
800:
801: \begin{table}[t]
802: \fbox{\begin{colpage} The feed-forward neural network we investigate
803: here is 3-4-1-layered; layers are completely connected; the output
804: neurons are linear, the hidden ones implement the sigmoid
805: $\frac{1}{1+\exp(-10\, x)}$; only the hidden neurons have bias
806: terms.
807: \end{colpage}}
808: \caption{The Standard model}
809: \label{FFN}
810: \end{table}
811:
812: \begin{table}[t]
813: \fbox{\begin{colpage} The softmax model is the same as the standard
814: model with the exception that the four neurons in the hidden layer
815: \emph{compete} for activation: their output activations $y_i$ are
816: given by
817: \begin{align}
818: y_i=\frac{e^{30\, x_i}}{X} \comma
819: &x_i=\sum_{j \in\, \text{input}} w_{ij} y_j + w_i \;,\feed
820: &X=\sum_{i \in\, \text{hidden}} e^{30\, x_i} \;.
821: \label{gating}
822: \end{align}
823: Here, $w_{ij}$ and $w_i$ denote weight and bias parameters. The
824: exponent factor $30$ may be interpreted as rather low temperature,
825: i.e., high competition. The calculation of the gradient is a little
826: more involved than ordinary back-propagation but straightforward and
827: of same computational cost (see \cite{toussaint:02b}).
828: \end{colpage}}
829: \caption{The Softmax model}
830: \label{monet}
831: \end{table}
832:
833: \begin{figure}[t]\center
834: \psfrag{g_00}{\!\small $g^{00}$}
835: \psfrag{g_01}{\!\small $g^{01}$}
836: \psfrag{g_11}{\!\small $g^{11}$}
837: \psfrag{empirical}{\small measured}
838: \psfrag{estimated}{\small calculated}
839: \psfrag{trained}{\small trained}
840: \psfrag{untrained}{\small untrained}
841: %\includegraphics[scale=0.36]{data/forget3c.eps}
842: \includegraphics[width=\columnwidth]{forget3c.eps}
843: \caption{\emph{Test of the standard model.}\newline
844: \small For all four graphs the abscissa denotes the time
845: step.\newline \emph{Top:} The learning curves (errors) with respect
846: to both patterns are displayed. Only one of the patterns is
847: trained---alternating every 100 time steps. The error of the
848: untrained patterns increases.\newline \emph{Second:} The slope
849: (change of error per time step) of the untrained learning curve is
850: displayed. The dotted line refer to the measured slope of the upper
851: curve, the normal line is calculated according to equation
852: (\ref{deltaF}).\newline \emph{Third:} The slope (measured and
853: calculated) of the trained learning curve.\newline \emph{Bottom:}
854: The three components of the functional metric $g^{00}$, $g^{01}$,
855: $g^{11}$ are displayed in logarithmic scale. In particular the
856: cross-component $g^{01}$ is clearly non-vanishing.}
857: \label{curvesStd}
858: \end{figure}
859:
860:
861: \begin{figure}[t]\center
862: \psfrag{g_00}{\!\small $g^{00}$}
863: \psfrag{g_01}{\!\small $g^{01}$}
864: \psfrag{g_11}{\!\small $g^{11}$}
865: \psfrag{empirical}{\small measured}
866: \psfrag{estimated}{\small calculated}
867: \psfrag{trained}{\small trained}
868: \psfrag{untrained}{\small untrained}
869: %\includegraphics[scale=0.36]{data/forget2d.eps}
870: \includegraphics[width=\columnwidth]{forget2d.eps}
871: \caption{\emph{Test of the softmax model.} \newline
872: \small\emph{Top:} The learning curves (errors) with respect to both
873: patterns are displayed. The untrained patterns is scarcely
874: forgotten.\newline \emph{Second:} The slope (measured and
875: calculated) of the untrained learning curve nearly vanishes.\newline
876: \emph{Third:} The slope (measured and calculated) of the trained
877: learning curve.\newline \emph{Bottom:} The three components of the
878: functional metric $g^{00}$, $g^{01}$, $g^{11}$ (in logarithmic
879: scale). The cross-component $g^{01}$ is small, it decreases
880: significantly at time step 200.}
881: \label{curvesSoft}
882: \end{figure}
883:
884:
885: Please see Figures \ref{curvesStd} and \ref{curvesSoft} for the
886: results. For the standard neural model we observe some forgetting of
887: the untrained pattern during the training of the other. For the
888: softmax model, the error of the untrained pattern hardly increases.
889: The rate of forgetting, given by the slope of the error curve, is well
890: described by equation (\ref{deltaF}) and demonstrated by the graphs in
891: the middle. The bottom graphs display the functional metric components
892: and generally exhibit that the cross-component $g^{01}$, which is
893: responsible for coadaptation and forgetting, is quite large for the
894: standard model compared to the softmax model. Further, the softmax
895: model seems to learn the adaptation decomposition, as defined in
896: section \ref{def}, after the 200th time step. All these results reveal
897: that the standard model is not well-suited to solve the simple task
898: given and that the analysis of the model's functional metric provides
899: a formal way of understanding this phenomenon. Remarkably also, the
900: components $g^{00}$ and $g^{11}$ become significantly greater than $1$
901: during the training phase of the respective functional component. By
902: equation (\ref{deltaF}), this means that the ``effective'' adaptation
903: rate is larger than $2\cdot10^{-3}$.
904:
905:
906: One might object that the results given above rely on the random
907: initialization and on the specific task we chose. To analyze both
908: types of models in a more general way we perform another test. We
909: investigate the distribution of the functional metric components when
910: parameters are normally distributed by $\NN(0,0.1)$. Figure
911: \ref{distri} shows the distributions for both models. Clearly, the
912: standard model exhibits a Gauss-like distribution of the
913: cross-component $g^{01}$ with mean around $1.5$; a vanishing
914: cross-component $g^{01}$ is not very likely. On the other hand, the
915: softmax model exhibits two strong peaks at $g^{01}=0$ and $g^{01}=1$,
916: such that the probability for $g^{01}<0.1$ is larger than 10\%. These
917: distributions are generic properties of the two models.
918:
919:
920:
921:
922: \begin{figure}[t]\center
923: \psfrag{g00}{\!\small $g^{00}$}
924: \psfrag{g01}{\!\small $g^{01}$}
925: \psfrag{g11}{\!\small $g^{11}$}
926: \includegraphics[width=\columnwidth]{stat2.eps}
927: \caption{\emph{Distribution of metric components.}\newline
928: The distribution was calculated as a histogram of $1$ million
929: samples by using bins of size $\frac{1}{100}$. The ordinate is
930: scaled in ``percent of samples that fell into the bin''.\newline
931: \emph{Top:} The standard model. The probability of vanishing
932: cross-component $g^{01}$ is vary small.\newline \emph{Bottom:} The
933: softmax model. The inset graph is in logarithmic scale. The
934: probability of vanishing cross-component $g^{01}$ is fairly high.}
935: \label{distri}
936: \end{figure}
937:
938:
939:
940:
941:
942: \section{Toward evolutionary model selection}
943:
944: Finally, the question of how to select an appropriate model has not
945: yet been addressed. As discussed in the introduction, classical
946: approaches to model selection commonly introduce a penalty term in
947: order to reduce the model's complexity. Following this tradition we
948: could introduce a penalty term that reduces forgetting. Consider
949: \begin{align}
950: \sum_{ab} (g^{ab})^2 - \sum_a (g^{aa})^2 \;.
951: \end{align}
952: This is a measure of the cross-components in the functional metric.
953: Unfortunately, we cannot present any experiments with this model
954: selection criterion here. This approach is postponed to future
955: research.
956:
957: The original motivation for this work, though, was not to develop a
958: new model selection criterion as given by the above penalty term.
959: Instead we believe that the evolution of neural networks, as it
960: recently became an elaborated branch of research (see \cite{yao:99}
961: for a review), is actually a promising method of model selection.
962: However, most of these approaches focus on standard neural models,
963: i.e., the evolutionary search space is the space of ordinary
964: feed-forward neural networks (FFNNs) with arbitrary topology. The
965: belief is that the variety of topologies offers a variety of
966: functionally different models. The present paper is a critique of this
967: belief because it supports that the functional metric inherent of
968: FFNNs comprises significantly non-vanishing cross-components. This
969: implies that the variety of FFNNs with arbitrary topology is actually
970: not a great variety with respect to the functional metric. E.g., it
971: hardly includes models with vanishing cross-components and low rate of
972: forgetting. In conclusion, the search space has to be generalized to
973: contain also models with arbitrary functional metric in order to allow
974: for the selection of more optimal models. The presented softmax model
975: involving competitive interactions between neurons is a step in this
976: direction, but much motivation is left for future research toward the
977: generalization of the model search space and evolutionary methods to
978: select good models from this great variety. The model presented in
979: \cite{toussaint:02b} is one approach.
980:
981:
982:
983:
984: \section{Conclusion}
985:
986: We developed a new analytical approach to characterize a function
987: model and describe its learning properties. We focussed on functional
988: correlations in the adaptation process and derived the relation to the
989: functional metric of the model parametrization. The analysis can in
990: principal be applied on any kind of differentiable model (also
991: probabilistic, when formulated in terms of information geometry). Our
992: empirical studies illustrate the approach and demonstrate that
993: conventional neural network models are rather limited with respect to
994: their adaptation behavior: a task separation, i.e., decorrelated
995: adaptation to decorrelated data, is hardly possible. In contrast, a
996: model involving competitive interactions is more predisposed for task
997: decomposition. Thus, as we pointed out in the previous section, the
998: evolutionary approach to model selection should generalize the search
999: space to include not only standard feed-forward neural networks, but
1000: also models with arbitrary functional metrics, e.g., by allowing for
1001: competitive interactions.
1002:
1003:
1004: \subsection*{Acknowledgment}
1005:
1006: The author acknowledges support by the German Research Foundation DFG
1007: under grant \emph{SoleSys} SE 251/41-1.
1008:
1009: \small
1010: \bibliography{/home/mt/bibtex/bibs}
1011: \end{document}
1012: