abstract:936226740e162b1b.tex

1: \begin{abstract}

2: For large scale learning problems, it is desirable if we can

3: obtain the optimal model parameters by going through the data in

4: only one pass. \cite{Polyak92} showed that asymptotically the test

5: performance of the simple average of the parameters obtained by

6: stochastic gradient descent (SGD) is as good as that of the

7: parameters which minimize the empirical cost. However, to our

8: knowledge, despite its optimal asymptotic convergence rate,

9: averaged SGD (ASGD) received little attention in recent research

10: on large scale learning. One possible reason is that it may take a

11: prohibitively large number of training samples for ASGD to reach

12: its asymptotic region for most real problems. In this paper, we

13: present a finite sample analysis for the method of

14: \cite{Polyak92}. Our analysis shows that it indeed usually takes a

15: huge number of samples for ASGD to reach its asymptotic region for

16: improperly chosen learning rate. More importantly, based on our

17: analysis, we propose a simple way to properly set learning rate so

18: that it takes a reasonable amount of data for ASGD to reach its

19: asymptotic region. We compare ASGD using our proposed learning

20: rate with other well known algorithms for training large scale

21: linear classifiers. The experiments clearly show the superiority

22: of ASGD.

23: \end{abstract}