abstract:5a79db395b0a7ec6.tex

1: \begin{abstract}

2: We study whether a depth two neural network can learn another

3: depth two network using gradient descent.

4: % We study the problem of learning a depth two neural network with

5: % another

6: % randomly initialized

7:  % depth two network using gradient descent.

8:   % We study the problem of learning a function using a neural network

9:   % of a certain depth and width assuming it can be represented using

10:   % such a network.

11: Assuming a linear output node,

12: % the output node of the network

13: % is linear,

14: we show that

15: % We show that for networks of depth two with certain

16: %   simplifying assumptions

17: the question of whether gradient descent converges to the

18: target function is equivalent to the following question in

19: electrodynamics:

20: Given $k$ fixed protons in $\rea^d,$ and $k$ electrons,

21: % initialized at random positions

22: % with the electrons moving due to

23: % under the influence of the

24: %electrical

25: each moving due to the attractive force from the protons and repulsive

26: force from the remaining electrons,

27: %. The question of convergence, then, is

28: whether at equilibrium all the electrons will be matched up with

29: %to all the

30: the protons, up to a permutation.

31: Under the standard electrical

32: force, this follows from the classic Earnshaw's theorem. In our setting,

33: the force  is

34: % If the force function between a pair of

35: % charges is not given by the standard electrical force of $1/r^2$

36: % (where $r$ is the distance between unit charges), but by another

37: % function that is

38: determined by the activation function and the

39: input distribution.

40: Building on this equivalence, we prove the

41: existence of an activation function such that

42: % the corresponding

43: gradient descent learns

44: % dynamics

45: % result in learning

46: at least one of the

47: hidden nodes in the target network.

48: Iterating, we show that gradient

49: descent can be used to learn the entire network one node at a time.

50: \end{abstract}