abstract:f86957bf5a79adc2.tex

1: \begin{abstract}

2:

3: % Motivation

4: The convergence of embodied agents and large language models (LLMs) has brought significant advancements to embodied instruction following.

5: Particularly, the strong reasoning capabilities of LLMs make it possible for robots to perform long-horizon tasks without expensive annotated demonstrations.

6: However, public benchmarks for testing the long-horizon reasoning capabilities of language-conditioned robots in various scenarios are still missing.

7: To fill this gap, this work focuses on the tabletop

8: manipulation task and releases a simulation benchmark,

9: \textit{LoHoRavens}, which covers various long-horizon

10: reasoning aspects spanning color, size, space, arithmetics

11: and reference.

12: Furthermore, there is a key modality bridging problem for

13: long-horizon manipulation tasks with LLMs: how to

14: incorporate the observation feedback during robot execution

15: for the LLM's closed-loop planning, which is however less studied by prior work.

16: We investigate two methods of bridging the modality gap: caption generation and learnable interface for incorporating explicit and implicit observation feedback to the LLM, respectively.

17: These methods serve as the two baselines for our proposed benchmark.

18: Experiments show that both methods struggle to solve some tasks, indicating long-horizon manipulation tasks are still challenging for current popular models.

19: We expect the proposed public benchmark and baselines can help the community develop better models for long-horizon tabletop manipulation tasks.\footnote{The video and code of LoHoRavens are available at~\url{https://cisnlp.github.io/lohoravens-webpage/}.}

20:

21:

22:

23: \end{abstract}

24: