1: \begin{abstract}
2:
3: % Motivation
4: The convergence of embodied agents and large language models (LLMs) has brought significant advancements to embodied instruction following.
5: Particularly, the strong reasoning capabilities of LLMs make it possible for robots to perform long-horizon tasks without expensive annotated demonstrations.
6: However, public benchmarks for testing the long-horizon reasoning capabilities of language-conditioned robots in various scenarios are still missing.
7: To fill this gap, this work focuses on the tabletop
8: manipulation task and releases a simulation benchmark,
9: \textit{LoHoRavens}, which covers various long-horizon
10: reasoning aspects spanning color, size, space, arithmetics
11: and reference.
12: Furthermore, there is a key modality bridging problem for
13: long-horizon manipulation tasks with LLMs: how to
14: incorporate the observation feedback during robot execution
15: for the LLM's closed-loop planning, which is however less studied by prior work.
16: We investigate two methods of bridging the modality gap: caption generation and learnable interface for incorporating explicit and implicit observation feedback to the LLM, respectively.
17: These methods serve as the two baselines for our proposed benchmark.
18: Experiments show that both methods struggle to solve some tasks, indicating long-horizon manipulation tasks are still challenging for current popular models.
19: We expect the proposed public benchmark and baselines can help the community develop better models for long-horizon tabletop manipulation tasks.\footnote{The video and code of LoHoRavens are available at~\url{https://cisnlp.github.io/lohoravens-webpage/}.}
20:
21:
22:
23: \end{abstract}
24: