d1b27b6084ebce67.tex
1: \begin{abstract}  
2: Training a classifier over a large number of classes, known as 'extreme classification', 
3: has become a topic of major interest with applications
4: in technology, science, and e-commerce. Traditional softmax regression induces a gradient cost
5: proportional to the number of classes~$C$, which often is prohibitively expensive.
6: A popular scalable softmax approximation relies on uniform negative
7: sampling, which suffers from slow convergence due a poor signal-to-noise ratio. 
8: In this paper, we propose a simple training method for drastically enhancing the gradient signal 
9: by drawing negative samples from an adversarial model that mimics the data distribution.
10: Our contributions are three-fold: (i)~an adversarial sampling mechanism that 
11: produces negative samples at a cost only logarithmic in~$C$, thus still resulting in cheap gradient updates;
12: (ii)~a mathematical proof that this adversarial sampling minimizes the gradient
13: variance while any bias due to non-uniform sampling can be removed; 
14: (iii)~experimental results on large scale data sets that
15: show a reduction of the training time by an order of magnitude relative to several competitive baselines.
16: \end{abstract}
17: