1: \begin{abstract}
2: Radiance field methods have achieved photorealistic novel view synthesis and geometry reconstruction.
3: But they are mostly applied in per-scene optimization or small-baseline settings.
4: While several recent works investigate feed-forward reconstruction with large baselines by utilizing transformers, they all operate with a standard global attention mechanism and hence ignore the local nature of 3D reconstruction.
5: %
6: We propose a method that unifies local and global reasoning in transformer layers, resulting in improved quality and faster convergence. Our model represents scenes as Gaussian Volumes and combines this with an image encoder and Group Attention Layers for efficient feed-forward reconstruction. Experimental results demonstrate that our model, trained for two days on four GPUs, demonstrates high fidelity in reconstructing 360$^{\circ}$ radiance fields, and robustness to zero-shot and out-of-domain testing.
7:
8: % Our key hypothesis is that unifying local and global reasoning into transformer layers to boost reconstruction quality and speed up the convergence.
9: % We represent scenes as Gaussian Volumes, allowing us to efficiently render high-resolution images and extract meshes.
10: % Combining this scene representation with an image encoder and Group Attention Layers yields an efficient feed-forward reconstruction model.
11: % Our experimental results show that our model, trained for 2 days on 4 GPUs, is already capable of reconstructing both appearance and geometry significantly better than previous work, and that is robust to zero-shot and out-of-domain testing.
12:
13: \keywords{3D Reconstruction \and 3D Transformer \and Radiance Fields}
14: \end{abstract}
15: