5a79db395b0a7ec6.tex
1: \begin{abstract} 
2: We study whether a depth two neural network can learn another 
3: depth two network using gradient descent.
4: % We study the problem of learning a depth two neural network with
5: % another
6: % randomly initialized
7:  % depth two network using gradient descent.
8:   % We study the problem of learning a function using a neural network
9:   % of a certain depth and width assuming it can be represented using
10:   % such a network.  
11: Assuming a linear output node,
12: % the output node of the network
13: % is linear, 
14: we show that
15: % We show that for networks of depth two with certain
16: %   simplifying assumptions 
17: the question of whether gradient descent converges to the 
18: target function is equivalent to the following question in
19: electrodynamics: 
20: Given $k$ fixed protons in $\rea^d,$ and $k$ electrons,
21: % initialized at random positions 
22: % with the electrons moving due to 
23: % under the influence of the 
24: %electrical
25: each moving due to the attractive force from the protons and repulsive
26: force from the remaining electrons,
27: %. The question of convergence, then, is 
28: whether at equilibrium all the electrons will be matched up with
29: %to all the
30: the protons, up to a permutation. 
31: Under the standard electrical
32: force, this follows from the classic Earnshaw's theorem. In our setting,
33: the force  is 
34: % If the force function between a pair of
35: % charges is not given by the standard electrical force of $1/r^2$
36: % (where $r$ is the distance between unit charges), but by another
37: % function that is 
38: determined by the activation function and the
39: input distribution.  
40: Building on this equivalence, we prove the
41: existence of an activation function such that 
42: % the corresponding
43: gradient descent learns
44: % dynamics 
45: % result in learning 
46: at least one of the
47: hidden nodes in the target network. 
48: Iterating, we show that gradient
49: descent can be used to learn the entire network one node at a time.
50: \end{abstract}