1: \begin{abstract}
2:
3: Active vision is inherently attention-driven: The agent selects views of observation to best
4: approach the vision task while improving its internal representation of the scene being observed.
5: Inspired by the recent success of attention-based models in 2D vision tasks based on single RGB images,
6: we propose to address the multi-view depth-based active object recognition using attention mechanism, through developing
7: an end-to-end recurrent 3D attentional network.
8: The architecture comprises of a recurrent neural network (RNN), storing and updating an internal representation,
9: and two levels of spatial transformer units, guiding two-level attentions.
10: Our model, trained with a 3D shape database, is able to iteratively attend to the best views targeting an object
11: of interest for recognizing it, and focus on the object in each view for removing the background clutter.
12: %Therefore, we method is able achieving two levels of attention.
13: To realize 3D view selection, we derive a 3D spatial transformer network which is differentiable
14: for training with back-propagation, achieving must faster convergence than the reinforcement learning
15: employed by most existing attention-based models.
16: Experiments show that our method outperforms state-of-the-art methods in cluttered scenes.
17: \end{abstract}
18: