1: \begin{abstract}
2: Training a classifier over a large number of classes, known as 'extreme classification',
3: has become a topic of major interest with applications
4: in technology, science, and e-commerce. Traditional softmax regression induces a gradient cost
5: proportional to the number of classes~$C$, which often is prohibitively expensive.
6: A popular scalable softmax approximation relies on uniform negative
7: sampling, which suffers from slow convergence due a poor signal-to-noise ratio.
8: In this paper, we propose a simple training method for drastically enhancing the gradient signal
9: by drawing negative samples from an adversarial model that mimics the data distribution.
10: Our contributions are three-fold: (i)~an adversarial sampling mechanism that
11: produces negative samples at a cost only logarithmic in~$C$, thus still resulting in cheap gradient updates;
12: (ii)~a mathematical proof that this adversarial sampling minimizes the gradient
13: variance while any bias due to non-uniform sampling can be removed;
14: (iii)~experimental results on large scale data sets that
15: show a reduction of the training time by an order of magnitude relative to several competitive baselines.
16: \end{abstract}
17: