1: \begin{abstract}%
2: %Stochastic gradient-based optimization is a crucial technology in machine learning to optimize neural networks.
3: Stochastic gradient-based optimization is crucial to optimize neural networks.
4: While popular approaches heuristically adapt the step size and direction by rescaling gradients, a more principled approach to improve optimizers requires second-order information.
5: Such methods precondition the gradient using the objective's Hessian.
6: Yet, computing the Hessian is usually expensive and effectively using second-order information in the stochastic gradient setting is non-trivial.
7: We propose using Information-Theoretic Trust Region Optimization (\ittr) for improved updates with uncertain second-order information.
8: By modeling the network parameters as a Gaussian distribution and using a Kullback-Leibler divergence-based trust region, our approach takes bounded steps accounting for the objective's curvature and uncertainty in the parameters.
9: Before each update, it solves the trust region problem for an optimal step size, resulting in a more stable and faster optimization process.
10: We approximate the diagonal elements of the Hessian from stochastic gradients using a simple recursive least squares approach, constructing a model of the expected Hessian over time using only first-order information.
11: We show that \ittr{} combines the fast convergence of adaptive moment-based optimization with the generalization capabilities of SGD.
12: \end{abstract}
13: