88db5569fbb76451.tex
1: \begin{abstract}
2:   This paper revisits the special type of a neural network known under
3:   two names. In the statistics and machine learning community it is
4:   known as a multi-class logistic regression neural network. In the
5:   neural network community, it is simply the soft-max layer. The
6:   importance is underscored by its role in deep learning: as the last
7:   layer, whose autput is actually the classification of the input
8:   patterns, such as images. Our exposition focuses on mathematically
9:   rigorous derivation of the key equation expressing the gradient. The
10:   fringe benefit of our approach is a fully vectorized expression,
11:   which is a basis of an efficient implementation. The second result
12:   of this paper is the positivity of the second derivative of the
13:   cross-entropy loss function as function of the weights. This result
14:   proves that optimization methods based on convexity may be used to
15:   train this network. As a corollary, we demonstrate that no
16:   $L^2$-regularizer is needed to guarantee convergence of gradient
17:   descent, provided that a global minimum of the loss function exists.
18:   We also provide an effective bound on the rate of convergence for
19:   two classes.
20: \end{abstract}
21: