abstract:0f922e82f2673542.tex

1: \begin{abstract}

2: Although artificial neural networks have shown great promise in applications

3: including computer vision and speech recognition, there remains considerable

4: practical and theoretical difficulty in optimizing their parameters.

5: The seemingly unreasonable success of gradient descent methods in

6: minimizing these non-convex functions remains poorly understood. In

7: this work we offer some theoretical guarantees for networks with piecewise

8: affine activation functions, which have in recent years become the

9: norm. We prove three main results. Firstly, that the network is piecewise

10: convex as a function of the input data. Secondly, that the network,

11: considered as a function of the parameters in a single layer, all

12: others held constant, is again piecewise convex. Finally, that the

13: network as a function of all its parameters is piecewise multi-convex,

14: a generalization of biconvexity. From here we characterize the local

15: minima and stationary points of the training objective, showing that

16: they minimize certain subsets of the parameter space. We then analyze

17: the performance of two optimization algorithms on multi-convex problems:

18: gradient descent, and a method which repeatedly solves a number of

19: convex sub-problems. We prove necessary convergence conditions for

20: the first algorithm and both necessary and sufficient conditions for

21: the second, after introducing regularization to the objective. Finally,

22: we remark on the remaining difficulty of the global optimization problem.

23: Under the squared error objective, we show that by varying the training

24: data, a single rectifier neuron admits local minima arbitrarily far

25: apart, both in objective value and parameter space.

26: \end{abstract}