0108:hep-ph0108207/ANALYSIS/network.tex

1: \subsection{Pattern Recognition with Neural Networks \label{sec:net}}

2: We now turn to the more specialized part of the analysis where we shall seek

3: to extract signals for LLE, LQD, and MSSM SUSY scenarios

4: separately using three neural networks trained to regocnize the specific

5: event shapes associated with each scenario. The results of the analysis are

6: presented in section \ref{sec:results}. Here, we concentrate on the structure

7: and function of neural networks and their application to the present problem,

8: essentially one of pattern classification. Does this event ``look'' like a

9: Standard Model event, or does it look like a SUSY event?

10:

11: \subsubsection{What Neural Networks Do}

12: The first step in any analysis based on

13: cuts is to find as optimal variables as possible

14: to place cuts on, the second step to find the

15: optimal \emph{placements} of the cuts. What neural networks do is

16: to \emph{learn}

17: which variables to use and where to place the cuts, based on a teaching

18: sample of background and signal events. The simplest type of network consists

19: of a single neuron which computes a linear combination of the input variables

20: in the problem, in our case the discriminating variables just discussed.

21: It then places a cut on this ``activation level'' and returns

22: 1 if the activation was above the cut value and 0 otherwise. In our case,

23: these two outputs would  correspond to the event having been classified as

24: either a signal or a background event. The learning

25: algorithm then serves to adjust the coefficients in the linear combination

26: and the placement of the cut according to the average error the network makes

27: over the learning sample such that next time it goes over the sample it

28: will classify more events correctly. The way this works is by a procedure

29: called ``gradient descent'' where the network calculates

30: the gradient of the error squared, or some other function that one

31: wishes to minimize, with respect to each network parameter. It then

32: adjusts each parameter, taking a small step in parameter space

33: in the minimizing direction each time it has processed an event or, to

34: decrease the effect of insignificant

35: fluctuations, it sums up the required changes over a number of processed

36: events before it applies them. This latter approach smoothes out the

37: otherwise jittering movement of the network across parameter space, often

38: allowing faster progress towards the minimum.

39:

40: For gradient descent to work, note that the neuron cannot be allowed to

41: compute a sharp cut on it its activation level, since the step function

42: is discontinuous and hence not

43: differentiable. Instead, one uses so-called sigmoidal functions which look

44: like smoothed out versions of the step function. We consider these

45: functions and the gradient descent algorithm in more detail below, yet let us

46: first extend our network beyond just a single neuron.

47:

48: In problems where the classification is

49: not quite so easy that it can be performed using a cut on just one linear

50: combination of the inputs, more neurons are needed, each one computing a

51: sigmoidal function of its

52: activation level, resulting in an output from each of these ``cut neurons''

53: between zero and one. These outputs

54: then serve as inputs to the output neuron who sums them up in

55: a new linear combination, the output neuron activation. In the present case,

56: this activation is used directly as the network output, alternatively one may

57: let the output neuron compute a function of its input. It remains that the

58: computing power of the network lies in the cut neurons. A function may or may

59: not be handy to apply to the output, but it will not increase the amount of

60: information there is in the output value. It now also becomes apparent why

61: the cut neurons are customarily

62: referred to as hidden neurons. The world outside the

63: network interacts with it by giving it input on the input neurons and by

64: reading the output from the output neuron. The cut neurons

65: communicate only with other neurons. Henceforth, we refer to these internal

66: neurons as hidden neurons.

67:

68: The function of a neural network is thus nothing but

69: a number of smoothed

70: cuts on the same number of linear combinations of the inputs, with the

71: results of the cuts being used as variables in a last linear combination

72: defining the output of the network -- similar to what is being done

73: in an ordinary cut-based analysis. The benefit is that neural networks

74: automatically pick up correlations and anti-correlations between arbitrarily

75: many of the input variables. A hypothetical example of high-dimensional

76: correlations would be if we

77: imagine that many signal events have high jet multiplicities

78: when there is little \ET\ in the event, but that they have very few jets when

79: \ET\ is high. Furthermore, let us suppose that, at high \ET\, a certain

80: fraction of signal

81: events with few jets have high lepton multiplicities, but that at low

82: \ET\ high lepton multiplicities would be a characteristic of background

83: events, unless there was also a high thrust in the event, or failing that at

84: least a high oblateness. These

85: correlations would of course follow from physical arguments related to the

86: processes involved in shaping the hypothetical

87: background and signal processes in this example, and linear combinations of

88: the variables designed to make use of the correlations could be constructed and

89: optimized manually, yet this would be an extremely

90: time-consuming task considering the more than 50 different scenarios

91: investigated in this work. Moreover, it is a task which neural networks are

92: ideally suited for by construction.

93:

94: \subsubsection{Network Layout and Network Learning:}

95: For each event to be processed,

96: each of the discriminating variables defined above are presented as inputs to

97: the network. Since the network is initizalized with random weights between

98: 0 and 1, it is sensible to scale these inputs to typically

99: lie in the range $[0,1]$ as well for faster learning. Otherwise,

100: the input-to-hidden weights (the coefficients in the

101: linear combination mentioned above)

102: have to be corrected, possibly for a long time, until

103: the right ball-park is found.

104: The input normalizations used here are listed in table \ref{tab:inputnorms}.

105: \begin{table}[tb]

106: \begin{center}

107: \setlength{\extrarowheight}{0pt}

108: \begin{tabular}{cccccccc}\toprule

109: \boldmath$i$    & \bf1 & \bf2 & \bf3 & \bf4 & \bf5 & \bf6 & \bf7

110: \\

111: \boldmath$\mathrm{In}_{i}$&$\displaystyle\frac{\ETs}{200}$&$\displaystyle\frac{N_{\mathrm{jets}}}{15}$&

112: $\displaystyle\frac{N_{\mu}^{\mathrm{iso}}}{5}$&$\displaystyle\frac{N_e^{\mathrm{iso}}}{5}$&

113: $\displaystyle\frac{P_{4C}}{500}$&\textup{Thrust}&\textup{Circularity}\\

114: \cmidrule{1-8}

115: \boldmath$i$&\bf8&\bf9&\bf10&\bf11&\bf12&\bf13&\bf14

116: \\\boldmath$\mathrm{In}_i$ & Oblateness

117: &$\displaystyle\frac{p_{T,\mathrm{jet}}^1}{100}$&$\displaystyle\frac{p_{,T\mathrm{jet}}^2}{100}$&$\displaystyle\frac{p_{T,\mathrm{jet}}^3}{100}$&$\displaystyle\frac{p_{T,\mathrm{jet}}^4}{100}$&$\displaystyle\frac{p_{T,\ell}^1}{100}$&$\displaystyle\frac{p_{T,\ell}^2}{100}$\\\bottomrule

118: \end{tabular}

119: \caption[\small Inputs to the neural net]{Inputs to the neural network and

120: their normalizations. In the text, $i$ is used as an index denoting

121: input neurons and

122: $\mathrm{In}_i$ the value of the input variable as given in this table. Note

123: that $\mathrm{In}_i$ is

124: not necessarily identical to the

125: output of the input neuron which we denote by $I_i$.

126: $P_{4C}$ is defined in section \ref{sec:lspdecsig},

127: $p_{T\mathrm{jet}}^{1-4}$ are the transverse momenta of the four

128: hardest jets, and $p_{T,\ell}^{1-2}$ of the two hardest

129: leptons.\label{tab:inputnorms}}

130: \end{center}

131: \vspace*{-\tfcapsep}\end{table}

132: The hidden layer in most applied networks

133: normally has fewer neurons than the input layer,

134: representing that some generalization can already be made at this stage: it

135: is not always necessary to form $N$ linear combinations of $N$ variables since

136: some mutual interdependence can usually be eliminated. In

137: the present analysis with 14 inputs,

138: it is found that a network with 10 hidden neurons

139: performs with negligible loss of discriminating power compared to networks

140: with more hidden neurons. As described above, each hidden neuron computes a

141: sigmoidal of its activation level, the name sigmoidal coming from the tilted

142: $S$ shape of these functions. The

143: particular sigmoidal used in this work is the logistic function (the most

144: commonly used). This function assigns an

145: output value for the $j$'th neuron in the hidden layer of:

146: \begin{equation}\vspace*{2mm}

147: H_j=\frac{1}{1+e^{-\sum_{i=1}^{N_{\mathrm{in}}}

148: (I_i w_{ij}) - \delta^H_j }}\label{eq:logistic}\vspace*{1mm}

149: \end{equation}

150: where $I_i$ is the output of the $i$'th input neuron,

151: $w_{ij}$ is the weight of the synapse

152: connecting input $i$ to hidden neuron $j$, $\delta^H_j$ is a bias term for the

153: hidden neuron, and $N_{\mathrm{in}}$ is the number of neurons in the input

154: layer. Henceforth, we follow the convention that subscript $i$ refers to the

155: input layer whereas subscript $j$ refers to the hidden layer. The slope of

156: the sigmoid is sometimes also adjusted by introducing a ``temperature'', $T$:

157: \begin{equation}

158: \vspace*{2mm}H_j=\frac{1}{1+e^{-\left(\sum_{i=1}^{N_{\mathrm{in}}}

159: (I_i w_{ij}) - \delta^H_j\right)/T_j }}\vspace*{1mm}

160: \end{equation}

161: The effect of this modification is shown in figure \ref{fig:sigmoid}.

162: \begin{figure}[b]

163: \begin{center}

164: \includegraphics*[scale=0.6]{PLOTS/sigmoid.eps}

165: \caption[\small The logistic function]{The logistic (sigmoid) function for

166: $T=2$ (dashed), $T=1$ (solid), and $T=0.5$ (dot-dashed). \label{fig:sigmoid}}

167: \end{center}

168: \end{figure}

169: However, since introducing a temperature different from unity

170: simply corresponds to rescaling all the weights connecting to $H_j$ and the

171: bias by a common factor $1/T_j$, there is nothing gained by introducing such a

172: parameter. Moreover, the network becomes slower and there is the risk that it

173: begins to oscillate between changing $T_j$ and rescaling the weights in the

174: learning procedure, and so we stick with eq.~(\ref{eq:logistic}).

175: For the input neurons, only a bias is added to the value of the input variables:

176: \begin{equation}

177: I_i=\mathrm{In}_i - \delta^I_i

178: \end{equation}

179: Taking one more look at figure \ref{fig:sigmoid} one also sees the reality of

180: the earlier made comment that these functions look like smoothed out step

181: functions and so can be regarded as smooth versions of cuts.

182: The complete network layout looks as depicted in figure

183: \ref{fig:netlayout}.

184: \begin{figure}[h!]

185: \vspace*{5mm}

186: \begin{fmffile}{neural}

187: \begin{fmfgraph*}(320,150)

188: \fmfset{arrow_len}{2mm}

189: \fmftop{i1,i2,i3,i4,i5,i6,i7}

190: \fmfbottom{o}

191: \fmfforce{0.1w,0.43h}{v1}

192: \fmfforce{0.25w,0.43h}{v2}

193: \fmfforce{0.41w,0.43h}{v3}

194: \fmfforce{0.5w,0.43h}{v4}

195: \fmfforce{0.61w,0.43h}{v5}

196: \fmfforce{0.8w,0.43h}{v6}

197: \fmfforce{0.95w,0.7h}{wij}

198: \fmfforce{0.93w,0.43h}{tm}

199: \fmfforce{0.95w,0.1h}{oj}

200: \fmfforce{0.93w,0h}{tl}

201: \fmfforce{0.95w,1h}{tr}

202: \fmfv{label=$I_1$,d.sh=circ,d.siz=0.08w,d.fill=empty,foreground=(0.7,,0,,0),label.dist=0}{i1}

203: \fmfv{label=$I_2$,d.sh=circ,d.siz=0.08w,d.fill=empty,foreground=(0.8,,0.4,,0),label.dist=0}{i2}

204: \fmfv{label=\large\boldmath$...$}{i3}

205: \fmfv{label=$I_i$,d.sh=circ,d.siz=0.08w,d.fill=empty,foreground=(0.7,,0.7,,0),label.dist=0}{i4}

206: \fmfv{label=\large\boldmath$...$}{i5}

207: \fmfv{label=$I_{N_{\mathrm{in}}}$,d.sh=circ,d.siz=0.08w,d.fill=empty,foreground=(0.5,,0.7,,0),label.dist=0}{i6}

208: \fmfv{label=$H_1$,d.sh=square,d.siz=0.075w,d.fill=empty,foreground=(0,,0.75,,0),label.dist=0}{v1}

209: \fmfv{label=$H_2$,d.sh=square,d.siz=0.075w,d.fill=empty,foreground=(0,,0.7,,0.5),label.dist=0}{v2}

210: \fmfv{label=\large\boldmath$...$}{v3}

211: \fmfv{label=$H_j$,d.sh=square,d.siz=0.075w,label.ang=-30,d.fill=empty,foreground=(0,,0.5,,0.8),label.dist=0}{v4}

212: \fmfv{label=\large\boldmath$...$}{v5}

213: \fmfv{label=$H_{\!N_{\mathrm{hid}}}$,d.sh=square,d.siz=0.075w,d.fill=empty,foreground=(0,,0,,0.9),label.dist=0}{v6}

214: \fmfv{label=$\mathcal{O}$,d.sh=circ,d.siz=0.08w,d.fill=empty,foreground=black,label.dist=0}{o}

215: % I1

216: \fmf{fermion,right=0.3,foreground=(0.7,,0,,0)}{i1,v1}

217: \fmf{fermion,right=0.1,foreground=(0.7,,0,,0)}{i1,v2}

218: \fmf{fermion,foreground=(0.7,,0,,0)}{i1,v4}

219: \fmf{fermion,left=0.08,foreground=(0.7,,0,,0)}{i1,v6}

220: % I2

221: \fmf{fermion,right=0.3,foreground=(0.8,,0.4,,0)}{i2,v1}

222: \fmf{fermion,right=0.3,foreground=(0.8,,0.4,,0)}{i2,v2}

223: \fmf{fermion,left=0.05,foreground=(0.8,,0.4,,0)}{i2,v4}

224: \fmf{fermion,left=0.15,foreground=(0.8,,0.4,,0)}{i2,v6}

225: % Ii

226: \fmf{fermion,right=0.3,foreground=(0.7,,0.7,,0)}{i4,v1}

227: \fmf{fermion,right=0.3,foreground=(0.7,,0.7,,0)}{i4,v2}

228: \fmf{fermion,foreground=(0.7,,0.7,,0)}{i4,v4}

229: \fmf{fermion,left=0.2,foreground=(0.7,,0.7,,0)}{i4,v6}

230: % In

231: \fmf{fermion,right=0.2,foreground=(0.5,,0.7,,0)}{i6,v1}

232: \fmf{fermion,right=0.1,foreground=(0.5,,0.7,,0)}{i6,v2}

233: \fmf{fermion,left=0.05,foreground=(0.5,,0.7,,0)}{i6,v4}

234: \fmf{fermion,left=0.2,foreground=(0.5,,0.7,,0),label=$w_{ij}$,label.side=left}{i6,v6}

235: % H

236: \fmf{fermion,right=0.25,foreground=(0,,0.75,,0)}{v1,o}

237: \fmf{fermion,right=0.1,foreground=(0,,0.7,,0.5)}{v2,o}

238: \fmf{fermion,foreground=(0,,0.5,,0.8)}{v4,o}

239: \fmf{fermion,left=0.15,foreground=(0,,0,,0.9),label=$o_j$,label.side=left}{v6,o}

240: \fmfv{label=$I_i = \mathrm{In}_i - \delta^I_i$}{tr}

241: \fmfv{label=$H_j = \frac{1}{1+\exp(\sum_i I_i w_{ij}-\delta^H_j)}$}{tm}

242: \fmfv{label=$\mathcal{O}=\frac{1}{N_{\mathrm{hid}}}\left(\sum_j H_jo_j - \delta^O\right)$}{tl}

243: \end{fmfgraph*}

244: \end{fmffile}

245: \vspace*{6mm}

246: \caption[\small Layout of the neural networks]{Layout of the neural

247: networks used in the final part of the analysis (before brain-damage -- see

248: below).

249: \label{fig:netlayout}}

250: \end{figure}

251: \noindent

252:

253: \subsubsection{Network Training}

254: The neural networks used in this work all learn by adjusting their

255: synaptic weights and biases

256: in the direction which minimizes the error squared of the network, the error

257: being defined on an event-by-event basis by:

258: \begin{equation}

259: e = t - \mathcal{O}

260: \end{equation}

261: where $t$ is the target output (0 for background, 1 for signal),

262: and $\mathcal{O}$ is the output that was actually obtained for the event.

263:

264: As the name implies, the \emph{gradient descent} learning

265: algorithm \cite{mcclelland89}

266: consists of a learning rule defined by taking a step in weight space in the

267: direction of the gradient of the (squared) error:

268: \begin{equation}

269: p \to p + \Delta p = p - \alpha\frac{\partial e^2}{\partial p} = p + 2\alpha e

270: \frac{\partial \mathcal{O}}{\partial p}

271: \end{equation}

272: where $p$ represents any of the adjustable parameters in the network,

273: $e$ is the error on the output, and $\alpha$ is a learning

274: rate parameter specifying how large steps the network takes.

275: This rule does not necessarily have to be applied immediately after each

276: event in the learning sample has been processed.

277: In order to wash out sharp, irregular

278: changes called for from

279: extreme events in the distributions, possibly pulling in opposite directions,

280: we do better in accumulating the parameter changes over some processing

281: period before invoking them.

282: After the period, the \emph{accumulated} change to each

283: parameter is invoked, and the accumulated error is reset.

284: In order to get as little statistical noise in

285: the parameter changes as possible, the period was here set to be the number of

286: events in the learning sample. Denoting event number by $n$,

287: the learning rule above becomes:

288: \begin{equation}

289: p \to p + \Delta p = p + 2\alpha \sum_ne(n)

290: \frac{\partial \mathcal{O}(n)}{\partial p}

291: \end{equation}

292: One further improvement can be made on the learning rule. Seeing as two

293: successive changes to the parameters of the network

294: often go in roughly the same direction in parameter

295: space, we add a mixture of the last parameter change to the current change,

296: something which often increases the learning speed of the network:

297: \begin{equation}

298: p \to p + \Delta p = p + 2\alpha \sum_ne(n)

299: \frac{\partial \mathcal{O}(n)}{\partial p} + \beta \Delta p^{\mathrm{last}}

300: \label{eq:learnrule}

301: \end{equation}

302: where $\Delta p^{\mathrm{last}}$ is the change which was made after the previous

303: period, and $\beta<1$ specifies the ``inertia'' of the system.

304: We are now ready to specify the learning rules for each of the parameters of

305: the network in fig.~\ref{fig:netlayout}.

306: These can be easily derived (using the chain rule).

307: \begin{equation}

308: \begin{array}{rclcrcl}\displaystyle

309: \frac{\partial \mathcal{O}}{\partial \delta^O}

310: & = & \displaystyle-1/N_{\mathrm{hid}} & & \displaystyle\frac{\partial \mathcal{O}}{\partial o_j}

311: & = & \displaystyle\frac{H_j}{N_{\mathrm{hid}}}

312:  \vspace*{3mm}\\

313: \displaystyle\frac{\partial \mathcal{O}}{\partial \delta^H_j} & = &

314: \displaystyle o_jH_j(H_j-1) & & \displaystyle\frac{\partial

315:   \mathcal{O}}{\partial w_{ij}} & = &

316: \displaystyle\frac{o_j}{N_{\mathrm{hid}}}

317: H_j(1-H_j)I_i \vspace*{3mm}\\ \displaystyle

318: \frac{\partial \mathcal{O}}{\partial \delta_i^I} & = &\displaystyle

319: I_i(1-I_i)\sum_j o_jH_j(H_j-1)w_{ij}

320:  & \hspace*{5mm} & \\

321: \end{array}

322: \end{equation}

323: Replacing these quantities back into eq.~(\ref{eq:learnrule}) directly

324: yields the required learning rules.

325: \subsubsection{Optimal Brain Damage}

326: As described above, the first derivates of the squared error with respect to

327: the network parameters are used in training by gradient descent.

328: It was shown by LeCun, Solla, and Denker

329: \cite{lecun90} that the \emph{second} derivatives can be used to

330: trim the network by getting rid of the most redundant parameters. This is

331: desireable since

332: redundant parameters are the ones that will eventually cause the network to

333: overfit the sample space. This can have a severe effect on the

334: generalizational ability of the network, i.e.\ its performance on data

335: samples it has not been in contact with during the learning process.

336: The idea

337: of Optimal Brain Damage

338: is to introduce a measure for how much the squared error will change as a

339: result of deleting each network parameter. The parameters whose deletion will

340: have the least effect can then be discarded if over-fitting is a problem. The

341: measure proposed in \cite{lecun90} is the \emph{saliency}, defined for the

342: network parameter $p$ as:

343: \begin{equation}

344: s_p = \frac{p^2}{2}\frac{\partial^2 e^2}{\partial p^2}

345: \end{equation}

346: This definition is appropriate when the learning process is near its end and

347: the network is almost in the minimum (otherwise a more general formula would

348: apply. See \cite{lecun90}). The procedure is quite simple. One begins with a

349: network that contains many parameters. One then trains it and deletes the

350: parameters with lowest saliency.

351: The resulting, ``brain-damaged'', network is then retrained

352: until it converges. This procedure is repeated until a satisfactory

353: compromise between the mean squared error and the generalizational ability of

354: the network is found. As a side benefit, the finished

355: network contains fewer parameters and is therefore faster to run. What

356: happens is typically that the smallest coefficients in the linear

357: combinations forming the hidden neuron activations get thrown away while the

358: larger coefficients are kept.

359: The diagonal second

360: derivates of the parameters for the present network are:

361: \begin{eqnarray}

362: \frac{\partial^2 e^2}{\partial (\delta^I_i)^2} & = & \frac{2}{N_{\mathrm{hid}}}

363:   \left[\left(\sum_{j=1}^{N_{\mathrm{hid}}}\!w_{ij}H_jo_j(1\!-\!H_j)\right)^2\!\! - e

364:   \left(\sum_{j=1}^{N_{\mathrm{hid}}}\!w_{ij}^2o_jH_j(1\!-\!H_j(3\!-\!2H_j))\right)\right]\\

365: \frac{\partial^2 e^2}{\partial w_{ij}^2} & = &

366: \frac{2}{N_{\mathrm{hid}}}I_i^2H_jo_j\left(o_jH_j(1-H_j)^2+e(1-H_j(3-2H_j))\right)\\

367: \frac{\partial^2 e^2}{\partial (\delta^H_j)^2} & = & \frac{2}{N_{\mathrm{hid}}}H_jo_j\left(o_jH_j(1-H_j)^2+e(1-H_j(3-2H_j)))\right)\\

368: \frac{\partial^2 e^2}{\partial o_j^2} & = & \frac{2H_j^2}{N_{\mathrm{hid}}}\\

369: \frac{\partial^2 e^2}{\partial (\delta^O)^2} & = & \frac{2}{N_{\mathrm{hid}}}

370: \end{eqnarray}

371: Each of the networks used thus started out with 14 inputs (= 14 biases),

372: 10 hidden units (= 10 biases),

373: 140 input-to-hidden synapses, 10 hidden-to-output synapses, and a bias on the

374: output for a total of 175 parameters. Approximately a quarter of

375: the parameters can

376: be deleted with very little effect on the mean squared error of the networks

377: over the learning samples, but with an increase in convergence rate and

378: processing speed.

379: An example of a finished, brain-damaged network, the one used for MSSM

380: recognition, is illustrated in figure \ref{fig:braindamage}.

381: \begin{figure}[t]

382: \begin{center}

383: \input{APPENDICES/BD}

384: \caption[\small Sketch of the MSSM network]{Sketch of the MSSM

385: network after brain damage.

386: Neurons with biases

387: still on are shown with the symbol $\delta$. The numbering of the inputs

388: goes from left to right.\label{fig:braindamage}}

389: \end{center}

390: \end{figure}

391:

392: To improve the convergence rate, two further improvements were made in the

393: design. Firstly, a fixed learning rate is not optimal since there may be

394: regions in ``error space'' which look like endless plateaus to one whose legs

395: are not long enough. The network was therefore equipped with the ability to

396: increase its learning rate if the relative change in the squared

397: error (averaged over the learning sample) after a learning cycle

398: is small. A little experimenting showed an error change of less than

399: 5\% from cycle to cycle to be a good indicator of when a larger learning rate

400: was needed.

401: Secondly, unlimited growth is not desireable. At the other side of

402: a plateau, a mountaneous region may yet exist, and so a mechanism to decrease

403: the rate is also included. If the error

404: change is greater than zero, meaning

405: that the error \emph{increases}, the network ``concludes'' that it has taken

406: too big a step, unlearns the direct effects of

407: the last learning step (``direct'' here meaning that the effects of the momentum

408: term are not unlearned)

409: and goes forward again with a smaller learning rate. After some more

410: experimenting to reach a good balance between increase and decrease, this

411: technique proved highly efficient, typically improving the convergence rate

412: by factors of ten.

413: