abstract:3a38ddc3aac39aa4.tex

1: \begin{abstract}

2: We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime, where the networks' biases are initialized to some constant rather than zero.

3: We prove that under such initialization, the neural network will have sparse activation throughout the entire training process, which enables fast training procedures

4: via some sophisticated computational methods. With such initialization, we show that the neural networks possess a different limiting kernel which we call \textit{bias-generalized NTK}, and we study various properties of the neural networks with this new kernel.

5: We first characterize the gradient descent dynamics.

6: In particular, we show that the network in this case can achieve as fast convergence as the dense network, as opposed to the previous work suggesting that the sparse networks converge slower.

7: In addition, our result improves the previous required width to ensure convergence.

8: Secondly, we study the networks' generalization: we show a width-sparsity dependence, which yields a sparsity-dependent Rademacher complexity and generalization bound.

9: To our knowledge, this is the first sparsity-dependent generalization result via Rademacher complexity.

10: Lastly, we study the smallest eigenvalue of this new kernel.

11: We identify a data-dependent region where we can derive a much sharper lower bound on the NTK's smallest eigenvalue than the worst-case bound previously known. This can lead to improvement in the generalization bound.

12: \end{abstract}

13: