585301f3a5b469bd.tex
1: \begin{abstract}
2: We introduce a method for following high-level navigation instructions by mapping directly from images, instructions and pose estimates to continuous low-level velocity commands for real-time control.
3: The Grounded Semantic Mapping Network ($\modelname$) is a fully-differentiable neural network architecture that builds an explicit semantic map in the world reference frame by incorporating a pinhole camera projection model within the network. The information stored in the map is learned from experience, while the local-to-world transformation is computed explicitly.
4: We train the model using $\daggerfm$, a modified variant of $\vanilladagger$ that trades tabular convergence guarantees for improved training speed and memory use.
5: We test $\modelname$ in  virtual environments on a realistic quadcopter simulator and show that incorporating an explicit mapping and grounding modules allows $\modelname$ to outperform strong neural baselines and almost reach an expert policy performance. Finally, we analyze the learned map representations and show that using an explicit map leads to an interpretable instruction-following model.
6: \end{abstract}
7: