hep-ph0108207/ANALYSIS/network.tex
1: \subsection{Pattern Recognition with Neural Networks \label{sec:net}}
2: We now turn to the more specialized part of the analysis where we shall seek
3: to extract signals for LLE, LQD, and MSSM SUSY scenarios 
4: separately using three neural networks trained to regocnize the specific
5: event shapes associated with each scenario. The results of the analysis are
6: presented in section \ref{sec:results}. Here, we concentrate on the structure
7: and function of neural networks and their application to the present problem,
8: essentially one of pattern classification. Does this event ``look'' like a
9: Standard Model event, or does it look like a SUSY event?
10: 
11: \subsubsection{What Neural Networks Do} 
12: The first step in any analysis based on
13: cuts is to find as optimal variables as possible 
14: to place cuts on, the second step to find the
15: optimal \emph{placements} of the cuts. What neural networks do is 
16: to \emph{learn} 
17: which variables to use and where to place the cuts, based on a teaching 
18: sample of background and signal events. The simplest type of network consists
19: of a single neuron which computes a linear combination of the input variables
20: in the problem, in our case the discriminating variables just discussed. 
21: It then places a cut on this ``activation level'' and returns
22: 1 if the activation was above the cut value and 0 otherwise. In our case,
23: these two outputs would  correspond to the event having been classified as
24: either a signal or a background event. The learning
25: algorithm then serves to adjust the coefficients in the linear combination
26: and the placement of the cut according to the average error the network makes
27: over the learning sample such that next time it goes over the sample it
28: will classify more events correctly. The way this works is by a procedure
29: called ``gradient descent'' where the network calculates
30: the gradient of the error squared, or some other function that one
31: wishes to minimize, with respect to each network parameter. It then 
32: adjusts each parameter, taking a small step in parameter space 
33: in the minimizing direction each time it has processed an event or, to
34: decrease the effect of insignificant
35: fluctuations, it sums up the required changes over a number of processed
36: events before it applies them. This latter approach smoothes out the
37: otherwise jittering movement of the network across parameter space, often 
38: allowing faster progress towards the minimum. 
39: 
40: For gradient descent to work, note that the neuron cannot be allowed to
41: compute a sharp cut on it its activation level, since the step function 
42: is discontinuous and hence not
43: differentiable. Instead, one uses so-called sigmoidal functions which look
44: like smoothed out versions of the step function. We consider these
45: functions and the gradient descent algorithm in more detail below, yet let us
46: first extend our network beyond just a single neuron. 
47: 
48: In problems where the classification is
49: not quite so easy that it can be performed using a cut on just one linear
50: combination of the inputs, more neurons are needed, each one computing a 
51: sigmoidal function of its 
52: activation level, resulting in an output from each of these ``cut neurons'' 
53: between zero and one. These outputs 
54: then serve as inputs to the output neuron who sums them up in
55: a new linear combination, the output neuron activation. In the present case,
56: this activation is used directly as the network output, alternatively one may
57: let the output neuron compute a function of its input. It remains that the
58: computing power of the network lies in the cut neurons. A function may or may
59: not be handy to apply to the output, but it will not increase the amount of
60: information there is in the output value. It now also becomes apparent why
61: the cut neurons are customarily 
62: referred to as hidden neurons. The world outside the
63: network interacts with it by giving it input on the input neurons and by
64: reading the output from the output neuron. The cut neurons 
65: communicate only with other neurons. Henceforth, we refer to these internal
66: neurons as hidden neurons. 
67: 
68: The function of a neural network is thus nothing but 
69: a number of smoothed 
70: cuts on the same number of linear combinations of the inputs, with the
71: results of the cuts being used as variables in a last linear combination
72: defining the output of the network -- similar to what is being done
73: in an ordinary cut-based analysis. The benefit is that neural networks
74: automatically pick up correlations and anti-correlations between arbitrarily
75: many of the input variables. A hypothetical example of high-dimensional
76: correlations would be if we
77: imagine that many signal events have high jet multiplicities
78: when there is little \ET\ in the event, but that they have very few jets when
79: \ET\ is high. Furthermore, let us suppose that, at high \ET\, a certain
80: fraction of signal
81: events with few jets have high lepton multiplicities, but that at low
82: \ET\ high lepton multiplicities would be a characteristic of background
83: events, unless there was also a high thrust in the event, or failing that at
84: least a high oblateness. These
85: correlations would of course follow from physical arguments related to the
86: processes involved in shaping the hypothetical 
87: background and signal processes in this example, and linear combinations of
88: the variables designed to make use of the correlations could be constructed and
89: optimized manually, yet this would be an extremely 
90: time-consuming task considering the more than 50 different scenarios
91: investigated in this work. Moreover, it is a task which neural networks are
92: ideally suited for by construction.
93: 
94: \subsubsection{Network Layout and Network Learning:}
95: For each event to be processed, 
96: each of the discriminating variables defined above are presented as inputs to
97: the network. Since the network is initizalized with random weights between
98: 0 and 1, it is sensible to scale these inputs to typically 
99: lie in the range $[0,1]$ as well for faster learning. Otherwise, 
100: the input-to-hidden weights (the coefficients in the
101: linear combination mentioned above)
102: have to be corrected, possibly for a long time, until 
103: the right ball-park is found.
104: The input normalizations used here are listed in table \ref{tab:inputnorms}. 
105: \begin{table}[tb]
106: \begin{center}
107: \setlength{\extrarowheight}{0pt}
108: \begin{tabular}{cccccccc}\toprule
109: \boldmath$i$    & \bf1 & \bf2 & \bf3 & \bf4 & \bf5 & \bf6 & \bf7
110: \\
111: \boldmath$\mathrm{In}_{i}$&$\displaystyle\frac{\ETs}{200}$&$\displaystyle\frac{N_{\mathrm{jets}}}{15}$&
112: $\displaystyle\frac{N_{\mu}^{\mathrm{iso}}}{5}$&$\displaystyle\frac{N_e^{\mathrm{iso}}}{5}$& 
113: $\displaystyle\frac{P_{4C}}{500}$&\textup{Thrust}&\textup{Circularity}\\
114: \cmidrule{1-8}
115: \boldmath$i$&\bf8&\bf9&\bf10&\bf11&\bf12&\bf13&\bf14
116: \\\boldmath$\mathrm{In}_i$ & Oblateness
117: &$\displaystyle\frac{p_{T,\mathrm{jet}}^1}{100}$&$\displaystyle\frac{p_{,T\mathrm{jet}}^2}{100}$&$\displaystyle\frac{p_{T,\mathrm{jet}}^3}{100}$&$\displaystyle\frac{p_{T,\mathrm{jet}}^4}{100}$&$\displaystyle\frac{p_{T,\ell}^1}{100}$&$\displaystyle\frac{p_{T,\ell}^2}{100}$\\\bottomrule
118: \end{tabular}
119: \caption[\small Inputs to the neural net]{Inputs to the neural network and
120: their normalizations. In the text, $i$ is used as an index denoting 
121: input neurons and 
122: $\mathrm{In}_i$ the value of the input variable as given in this table. Note
123: that $\mathrm{In}_i$ is 
124: not necessarily identical to the
125: output of the input neuron which we denote by $I_i$. 
126: $P_{4C}$ is defined in section \ref{sec:lspdecsig},
127: $p_{T\mathrm{jet}}^{1-4}$ are the transverse momenta of the four
128: hardest jets, and $p_{T,\ell}^{1-2}$ of the two hardest
129: leptons.\label{tab:inputnorms}} 
130: \end{center}
131: \vspace*{-\tfcapsep}\end{table} 
132: The hidden layer in most applied networks 
133: normally has fewer neurons than the input layer, 
134: representing that some generalization can already be made at this stage: it
135: is not always necessary to form $N$ linear combinations of $N$ variables since
136: some mutual interdependence can usually be eliminated. In
137: the present analysis with 14 inputs, 
138: it is found that a network with 10 hidden neurons
139: performs with negligible loss of discriminating power compared to networks
140: with more hidden neurons. As described above, each hidden neuron computes a
141: sigmoidal of its activation level, the name sigmoidal coming from the tilted
142: $S$ shape of these functions. The
143: particular sigmoidal used in this work is the logistic function (the most
144: commonly used). This function assigns an
145: output value for the $j$'th neuron in the hidden layer of:
146: \begin{equation}\vspace*{2mm}
147: H_j=\frac{1}{1+e^{-\sum_{i=1}^{N_{\mathrm{in}}} 
148: (I_i w_{ij}) - \delta^H_j }}\label{eq:logistic}\vspace*{1mm}
149: \end{equation}
150: where $I_i$ is the output of the $i$'th input neuron, 
151: $w_{ij}$ is the weight of the synapse
152: connecting input $i$ to hidden neuron $j$, $\delta^H_j$ is a bias term for the
153: hidden neuron, and $N_{\mathrm{in}}$ is the number of neurons in the input
154: layer. Henceforth, we follow the convention that subscript $i$ refers to the
155: input layer whereas subscript $j$ refers to the hidden layer. The slope of
156: the sigmoid is sometimes also adjusted by introducing a ``temperature'', $T$:
157: \begin{equation}
158: \vspace*{2mm}H_j=\frac{1}{1+e^{-\left(\sum_{i=1}^{N_{\mathrm{in}}} 
159: (I_i w_{ij}) - \delta^H_j\right)/T_j }}\vspace*{1mm}
160: \end{equation}
161: The effect of this modification is shown in figure \ref{fig:sigmoid}.
162: \begin{figure}[b]
163: \begin{center}
164: \includegraphics*[scale=0.6]{PLOTS/sigmoid.eps}
165: \caption[\small The logistic function]{The logistic (sigmoid) function for
166: $T=2$ (dashed), $T=1$ (solid), and $T=0.5$ (dot-dashed). \label{fig:sigmoid}}
167: \end{center}
168: \end{figure}
169: However, since introducing a temperature different from unity 
170: simply corresponds to rescaling all the weights connecting to $H_j$ and the
171: bias by a common factor $1/T_j$, there is nothing gained by introducing such a
172: parameter. Moreover, the network becomes slower and there is the risk that it
173: begins to oscillate between changing $T_j$ and rescaling the weights in the
174: learning procedure, and so we stick with eq.~(\ref{eq:logistic}). 
175: For the input neurons, only a bias is added to the value of the input variables:
176: \begin{equation}
177: I_i=\mathrm{In}_i - \delta^I_i
178: \end{equation}
179: Taking one more look at figure \ref{fig:sigmoid} one also sees the reality of
180: the earlier made comment that these functions look like smoothed out step
181: functions and so can be regarded as smooth versions of cuts. 
182: The complete network layout looks as depicted in figure
183: \ref{fig:netlayout}. 
184: \begin{figure}[h!]
185: \vspace*{5mm}
186: \begin{fmffile}{neural}
187: \begin{fmfgraph*}(320,150)
188: \fmfset{arrow_len}{2mm}
189: \fmftop{i1,i2,i3,i4,i5,i6,i7}
190: \fmfbottom{o}
191: \fmfforce{0.1w,0.43h}{v1}
192: \fmfforce{0.25w,0.43h}{v2}
193: \fmfforce{0.41w,0.43h}{v3}
194: \fmfforce{0.5w,0.43h}{v4}
195: \fmfforce{0.61w,0.43h}{v5}
196: \fmfforce{0.8w,0.43h}{v6}
197: \fmfforce{0.95w,0.7h}{wij}
198: \fmfforce{0.93w,0.43h}{tm}
199: \fmfforce{0.95w,0.1h}{oj}
200: \fmfforce{0.93w,0h}{tl}
201: \fmfforce{0.95w,1h}{tr}
202: \fmfv{label=$I_1$,d.sh=circ,d.siz=0.08w,d.fill=empty,foreground=(0.7,,0,,0),label.dist=0}{i1}
203: \fmfv{label=$I_2$,d.sh=circ,d.siz=0.08w,d.fill=empty,foreground=(0.8,,0.4,,0),label.dist=0}{i2}
204: \fmfv{label=\large\boldmath$...$}{i3}
205: \fmfv{label=$I_i$,d.sh=circ,d.siz=0.08w,d.fill=empty,foreground=(0.7,,0.7,,0),label.dist=0}{i4}
206: \fmfv{label=\large\boldmath$...$}{i5}
207: \fmfv{label=$I_{N_{\mathrm{in}}}$,d.sh=circ,d.siz=0.08w,d.fill=empty,foreground=(0.5,,0.7,,0),label.dist=0}{i6}
208: \fmfv{label=$H_1$,d.sh=square,d.siz=0.075w,d.fill=empty,foreground=(0,,0.75,,0),label.dist=0}{v1}
209: \fmfv{label=$H_2$,d.sh=square,d.siz=0.075w,d.fill=empty,foreground=(0,,0.7,,0.5),label.dist=0}{v2}
210: \fmfv{label=\large\boldmath$...$}{v3}
211: \fmfv{label=$H_j$,d.sh=square,d.siz=0.075w,label.ang=-30,d.fill=empty,foreground=(0,,0.5,,0.8),label.dist=0}{v4}
212: \fmfv{label=\large\boldmath$...$}{v5}
213: \fmfv{label=$H_{\!N_{\mathrm{hid}}}$,d.sh=square,d.siz=0.075w,d.fill=empty,foreground=(0,,0,,0.9),label.dist=0}{v6}
214: \fmfv{label=$\mathcal{O}$,d.sh=circ,d.siz=0.08w,d.fill=empty,foreground=black,label.dist=0}{o}
215: % I1
216: \fmf{fermion,right=0.3,foreground=(0.7,,0,,0)}{i1,v1}
217: \fmf{fermion,right=0.1,foreground=(0.7,,0,,0)}{i1,v2}
218: \fmf{fermion,foreground=(0.7,,0,,0)}{i1,v4}
219: \fmf{fermion,left=0.08,foreground=(0.7,,0,,0)}{i1,v6}
220: % I2
221: \fmf{fermion,right=0.3,foreground=(0.8,,0.4,,0)}{i2,v1}
222: \fmf{fermion,right=0.3,foreground=(0.8,,0.4,,0)}{i2,v2}
223: \fmf{fermion,left=0.05,foreground=(0.8,,0.4,,0)}{i2,v4}
224: \fmf{fermion,left=0.15,foreground=(0.8,,0.4,,0)}{i2,v6}
225: % Ii
226: \fmf{fermion,right=0.3,foreground=(0.7,,0.7,,0)}{i4,v1}
227: \fmf{fermion,right=0.3,foreground=(0.7,,0.7,,0)}{i4,v2}
228: \fmf{fermion,foreground=(0.7,,0.7,,0)}{i4,v4}
229: \fmf{fermion,left=0.2,foreground=(0.7,,0.7,,0)}{i4,v6}
230: % In
231: \fmf{fermion,right=0.2,foreground=(0.5,,0.7,,0)}{i6,v1}
232: \fmf{fermion,right=0.1,foreground=(0.5,,0.7,,0)}{i6,v2}
233: \fmf{fermion,left=0.05,foreground=(0.5,,0.7,,0)}{i6,v4}
234: \fmf{fermion,left=0.2,foreground=(0.5,,0.7,,0),label=$w_{ij}$,label.side=left}{i6,v6}
235: % H
236: \fmf{fermion,right=0.25,foreground=(0,,0.75,,0)}{v1,o}
237: \fmf{fermion,right=0.1,foreground=(0,,0.7,,0.5)}{v2,o}
238: \fmf{fermion,foreground=(0,,0.5,,0.8)}{v4,o}
239: \fmf{fermion,left=0.15,foreground=(0,,0,,0.9),label=$o_j$,label.side=left}{v6,o}
240: \fmfv{label=$I_i = \mathrm{In}_i - \delta^I_i$}{tr}
241: \fmfv{label=$H_j = \frac{1}{1+\exp(\sum_i I_i w_{ij}-\delta^H_j)}$}{tm}
242: \fmfv{label=$\mathcal{O}=\frac{1}{N_{\mathrm{hid}}}\left(\sum_j H_jo_j - \delta^O\right)$}{tl}
243: \end{fmfgraph*}
244: \end{fmffile}
245: \vspace*{6mm}
246: \caption[\small Layout of the neural networks]{Layout of the neural
247: networks used in the final part of the analysis (before brain-damage -- see
248: below).  
249: \label{fig:netlayout}}
250: \end{figure}
251: \noindent 
252: 
253: \subsubsection{Network Training}
254: The neural networks used in this work all learn by adjusting their
255: synaptic weights and biases 
256: in the direction which minimizes the error squared of the network, the error
257: being defined on an event-by-event basis by:
258: \begin{equation}
259: e = t - \mathcal{O}
260: \end{equation} 
261: where $t$ is the target output (0 for background, 1 for signal), 
262: and $\mathcal{O}$ is the output that was actually obtained for the event.
263: 
264: As the name implies, the \emph{gradient descent} learning 
265: algorithm \cite{mcclelland89}
266: consists of a learning rule defined by taking a step in weight space in the
267: direction of the gradient of the (squared) error:
268: \begin{equation}
269: p \to p + \Delta p = p - \alpha\frac{\partial e^2}{\partial p} = p + 2\alpha e
270: \frac{\partial \mathcal{O}}{\partial p}
271: \end{equation} 
272: where $p$ represents any of the adjustable parameters in the network, 
273: $e$ is the error on the output, and $\alpha$ is a learning
274: rate parameter specifying how large steps the network takes.
275: This rule does not necessarily have to be applied immediately after each
276: event in the learning sample has been processed. 
277: In order to wash out sharp, irregular 
278: changes called for from
279: extreme events in the distributions, possibly pulling in opposite directions,
280: we do better in accumulating the parameter changes over some processing
281: period before invoking them. 
282: After the period, the \emph{accumulated} change to each
283: parameter is invoked, and the accumulated error is reset. 
284: In order to get as little statistical noise in
285: the parameter changes as possible, the period was here set to be the number of
286: events in the learning sample. Denoting event number by $n$,
287: the learning rule above becomes:
288: \begin{equation}
289: p \to p + \Delta p = p + 2\alpha \sum_ne(n)
290: \frac{\partial \mathcal{O}(n)}{\partial p}
291: \end{equation} 
292: One further improvement can be made on the learning rule. Seeing as two
293: successive changes to the parameters of the network 
294: often go in roughly the same direction in parameter
295: space, we add a mixture of the last parameter change to the current change, 
296: something which often increases the learning speed of the network:
297: \begin{equation}
298: p \to p + \Delta p = p + 2\alpha \sum_ne(n)
299: \frac{\partial \mathcal{O}(n)}{\partial p} + \beta \Delta p^{\mathrm{last}}
300: \label{eq:learnrule}
301: \end{equation} 
302: where $\Delta p^{\mathrm{last}}$ is the change which was made after the previous
303: period, and $\beta<1$ specifies the ``inertia'' of the system. 
304: We are now ready to specify the learning rules for each of the parameters of
305: the network in fig.~\ref{fig:netlayout}. 
306: These can be easily derived (using the chain rule). 
307: \begin{equation}
308: \begin{array}{rclcrcl}\displaystyle
309: \frac{\partial \mathcal{O}}{\partial \delta^O} 
310: & = & \displaystyle-1/N_{\mathrm{hid}} & & \displaystyle\frac{\partial \mathcal{O}}{\partial o_j} 
311: & = & \displaystyle\frac{H_j}{N_{\mathrm{hid}}}
312:  \vspace*{3mm}\\ 
313: \displaystyle\frac{\partial \mathcal{O}}{\partial \delta^H_j} & = &
314: \displaystyle o_jH_j(H_j-1) & & \displaystyle\frac{\partial
315:   \mathcal{O}}{\partial w_{ij}} & = &
316: \displaystyle\frac{o_j}{N_{\mathrm{hid}}}  
317: H_j(1-H_j)I_i \vspace*{3mm}\\ \displaystyle
318: \frac{\partial \mathcal{O}}{\partial \delta_i^I} & = &\displaystyle
319: I_i(1-I_i)\sum_j o_jH_j(H_j-1)w_{ij} 
320:  & \hspace*{5mm} & \\
321: \end{array}
322: \end{equation}
323: Replacing these quantities back into eq.~(\ref{eq:learnrule}) directly 
324: yields the required learning rules. 
325: \subsubsection{Optimal Brain Damage}
326: As described above, the first derivates of the squared error with respect to
327: the network parameters are used in training by gradient descent. 
328: It was shown by LeCun, Solla, and Denker 
329: \cite{lecun90} that the \emph{second} derivatives can be used to
330: trim the network by getting rid of the most redundant parameters. This is
331: desireable since
332: redundant parameters are the ones that will eventually cause the network to
333: overfit the sample space. This can have a severe effect on the
334: generalizational ability of the network, i.e.\ its performance on data
335: samples it has not been in contact with during the learning process. 
336: The idea
337: of Optimal Brain Damage
338: is to introduce a measure for how much the squared error will change as a
339: result of deleting each network parameter. The parameters whose deletion will
340: have the least effect can then be discarded if over-fitting is a problem. The
341: measure proposed in \cite{lecun90} is the \emph{saliency}, defined for the
342: network parameter $p$ as:
343: \begin{equation}
344: s_p = \frac{p^2}{2}\frac{\partial^2 e^2}{\partial p^2} 
345: \end{equation}
346: This definition is appropriate when the learning process is near its end and
347: the network is almost in the minimum (otherwise a more general formula would
348: apply. See \cite{lecun90}). The procedure is quite simple. One begins with a
349: network that contains many parameters. One then trains it and deletes the
350: parameters with lowest saliency. 
351: The resulting, ``brain-damaged'', network is then retrained
352: until it converges. This procedure is repeated until a satisfactory 
353: compromise between the mean squared error and the generalizational ability of
354: the network is found. As a side benefit, the finished 
355: network contains fewer parameters and is therefore faster to run. What
356: happens is typically that the smallest coefficients in the linear
357: combinations forming the hidden neuron activations get thrown away while the
358: larger coefficients are kept.
359: The diagonal second
360: derivates of the parameters for the present network are:
361: \begin{eqnarray}
362: \frac{\partial^2 e^2}{\partial (\delta^I_i)^2} & = & \frac{2}{N_{\mathrm{hid}}}
363:   \left[\left(\sum_{j=1}^{N_{\mathrm{hid}}}\!w_{ij}H_jo_j(1\!-\!H_j)\right)^2\!\! - e
364:   \left(\sum_{j=1}^{N_{\mathrm{hid}}}\!w_{ij}^2o_jH_j(1\!-\!H_j(3\!-\!2H_j))\right)\right]\\
365: \frac{\partial^2 e^2}{\partial w_{ij}^2} & = &
366: \frac{2}{N_{\mathrm{hid}}}I_i^2H_jo_j\left(o_jH_j(1-H_j)^2+e(1-H_j(3-2H_j))\right)\\
367: \frac{\partial^2 e^2}{\partial (\delta^H_j)^2} & = & \frac{2}{N_{\mathrm{hid}}}H_jo_j\left(o_jH_j(1-H_j)^2+e(1-H_j(3-2H_j)))\right)\\
368: \frac{\partial^2 e^2}{\partial o_j^2} & = & \frac{2H_j^2}{N_{\mathrm{hid}}}\\
369: \frac{\partial^2 e^2}{\partial (\delta^O)^2} & = & \frac{2}{N_{\mathrm{hid}}}
370: \end{eqnarray}
371: Each of the networks used thus started out with 14 inputs (= 14 biases), 
372: 10 hidden units (= 10 biases), 
373: 140 input-to-hidden synapses, 10 hidden-to-output synapses, and a bias on the
374: output for a total of 175 parameters. Approximately a quarter of 
375: the parameters can
376: be deleted with very little effect on the mean squared error of the networks
377: over the learning samples, but with an increase in convergence rate and
378: processing speed. 
379: An example of a finished, brain-damaged network, the one used for MSSM
380: recognition, is illustrated in figure \ref{fig:braindamage}. 
381: \begin{figure}[t]
382: \begin{center}
383: \input{APPENDICES/BD}
384: \caption[\small Sketch of the MSSM network]{Sketch of the MSSM
385: network after brain damage. 
386: Neurons with biases
387: still on are shown with the symbol $\delta$. The numbering of the inputs 
388: goes from left to right.\label{fig:braindamage}}
389: \end{center}
390: \end{figure}
391: 
392: To improve the convergence rate, two further improvements were made in the
393: design. Firstly, a fixed learning rate is not optimal since there may be
394: regions in ``error space'' which look like endless plateaus to one whose legs
395: are not long enough. The network was therefore equipped with the ability to
396: increase its learning rate if the relative change in the squared 
397: error (averaged over the learning sample) after a learning cycle 
398: is small. A little experimenting showed an error change of less than 
399: 5\% from cycle to cycle to be a good indicator of when a larger learning rate
400: was needed. 
401: Secondly, unlimited growth is not desireable. At the other side of
402: a plateau, a mountaneous region may yet exist, and so a mechanism to decrease
403: the rate is also included. If the error 
404: change is greater than zero, meaning
405: that the error \emph{increases}, the network ``concludes'' that it has taken
406: too big a step, unlearns the direct effects of
407: the last learning step (``direct'' here meaning that the effects of the momentum
408: term are not unlearned) 
409: and goes forward again with a smaller learning rate. After some more
410: experimenting to reach a good balance between increase and decrease, this
411: technique proved highly efficient, typically improving the convergence rate
412: by factors of ten.
413: