abstract:8973d2190829c0f3.tex

1: \begin{abstract}

2:

3: Active vision is inherently attention-driven: The agent selects views of observation to best

4: approach the vision task while improving its internal representation of the scene being observed.

5: Inspired by the recent success of attention-based models in 2D vision tasks based on single RGB images,

6: we propose to address the multi-view depth-based active object recognition using attention mechanism, through developing

7: an end-to-end recurrent 3D attentional network.

8: The architecture comprises of a recurrent neural network (RNN), storing and updating an internal representation,

9: and two levels of spatial transformer units, guiding two-level attentions.

10: Our model, trained with a 3D shape database, is able to iteratively attend to the best views targeting an object

11: of interest for recognizing it, and focus on the object in each view for removing the background clutter.

12: %Therefore, we method is able achieving two levels of attention.

13: To realize 3D view selection, we derive a 3D spatial transformer network which is differentiable

14: for training with back-propagation, achieving must faster convergence than the reinforcement learning

15: employed by most existing attention-based models.

16: Experiments show that our method outperforms state-of-the-art methods in cluttered scenes.

17: \end{abstract}

18: