abstract:f4583e7da2589c2e.tex

1: \begin{abstract}

2:     This paper presents a simple unsupervised visual representation learning method with a pretext task of discriminating all images in a dataset using a parametric, instance-level classifier.

3:     The overall framework is a replica of a supervised classification model, where \textit{semantic classes} (e.g., \textit{dog, bird,} and \textit{ship}) are replaced by \textit{instance IDs}.

4:     However, scaling up the classification task from thousands of \textit{semantic labels} to millions of \textit{instance labels} brings specific challenges including 1) the large-scale softmax computation; 2) the slow convergence due to the infrequent visiting of instance samples; and 3) the massive number of negative classes that can be noisy.

5:     This work presents several novel techniques to handle these difficulties.

6:     First, we introduce a hybrid parallel training framework to make large-scale training feasible.

7:     Second, we present a raw-feature initialization mechanism for classification weights, which we assume offers a contrastive prior for instance discrimination and can clearly speed up converge in our experiments.

8:     Finally, we propose to smooth the labels of a few hardest classes to avoid optimizing over very similar negative pairs.

9:     While being conceptually simple, our framework achieves competitive or superior performance compared to state-of-the-art unsupervised approaches, i.e., SimCLR, MoCoV2, and PIC under ImageNet linear evaluation protocol and on several downstream visual tasks, verifying that full instance classification is a strong pretraining technique for many semantic visual tasks.

10: \end{abstract}

11: