0f922e82f2673542.tex
1: \begin{abstract}
2: Although artificial neural networks have shown great promise in applications
3: including computer vision and speech recognition, there remains considerable
4: practical and theoretical difficulty in optimizing their parameters.
5: The seemingly unreasonable success of gradient descent methods in
6: minimizing these non-convex functions remains poorly understood. In
7: this work we offer some theoretical guarantees for networks with piecewise
8: affine activation functions, which have in recent years become the
9: norm. We prove three main results. Firstly, that the network is piecewise
10: convex as a function of the input data. Secondly, that the network,
11: considered as a function of the parameters in a single layer, all
12: others held constant, is again piecewise convex. Finally, that the
13: network as a function of all its parameters is piecewise multi-convex,
14: a generalization of biconvexity. From here we characterize the local
15: minima and stationary points of the training objective, showing that
16: they minimize certain subsets of the parameter space. We then analyze
17: the performance of two optimization algorithms on multi-convex problems:
18: gradient descent, and a method which repeatedly solves a number of
19: convex sub-problems. We prove necessary convergence conditions for
20: the first algorithm and both necessary and sufficient conditions for
21: the second, after introducing regularization to the objective. Finally,
22: we remark on the remaining difficulty of the global optimization problem.
23: Under the squared error objective, we show that by varying the training
24: data, a single rectifier neuron admits local minima arbitrarily far
25: apart, both in objective value and parameter space.
26: \end{abstract}