1: \begin{abstract}
2: For large scale learning problems, it is desirable if we can
3: obtain the optimal model parameters by going through the data in
4: only one pass. \cite{Polyak92} showed that asymptotically the test
5: performance of the simple average of the parameters obtained by
6: stochastic gradient descent (SGD) is as good as that of the
7: parameters which minimize the empirical cost. However, to our
8: knowledge, despite its optimal asymptotic convergence rate,
9: averaged SGD (ASGD) received little attention in recent research
10: on large scale learning. One possible reason is that it may take a
11: prohibitively large number of training samples for ASGD to reach
12: its asymptotic region for most real problems. In this paper, we
13: present a finite sample analysis for the method of
14: \cite{Polyak92}. Our analysis shows that it indeed usually takes a
15: huge number of samples for ASGD to reach its asymptotic region for
16: improperly chosen learning rate. More importantly, based on our
17: analysis, we propose a simple way to properly set learning rate so
18: that it takes a reasonable amount of data for ASGD to reach its
19: asymptotic region. We compare ASGD using our proposed learning
20: rate with other well known algorithms for training large scale
21: linear classifiers. The experiments clearly show the superiority
22: of ASGD.
23: \end{abstract}