3bdc102437a0c3fa.tex
1: \begin{abstract}
2:   Deep Convolutional Neural Networks (DCNNs) commonly use generic
3:   `max-pooling' (MP) layers to extract deformation-invariant features, but we
4:   argue in favor of a more refined treatment. First, we introduce {\em
5:     epitomic convolution} as a building block alternative to the common
6:   convolution-MP cascade of DCNNs; while having identical complexity to MP,
7:   Epitomic Convolution allows for parameter sharing across different filters,
8:   resulting in faster convergence and better generalization. Second, we
9:   introduce a Multiple Instance Learning approach to explicitly accommodate
10:   global translation and scaling when training a DCNN exclusively with class
11:   labels. For this we rely on a {\em `patchwork'} data structure that
12:   efficiently lays out all image scales and positions as candidates to a
13:   DCNN. Factoring global and local deformations allows a DCNN to `focus its
14:   resources' on the treatment of non-rigid deformations and yields a
15:   substantial classification accuracy improvement. Third, further pursuing
16:   this idea, we develop an efficient DCNN sliding window object detector that
17:   employs explicit search over position, scale, and aspect ratio. We
18:   provide competitive image classification and localization results on the
19:   ImageNet dataset and object detection results on the Pascal VOC 2007
20:   benchmark.
21: \end{abstract}
22: