Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation

Abstract

LLM-based agents have demonstrated impressive zero-shot performance in vision-language navigation (VLN) task. However, existing LLM-based methods often focus only on solving high-level task planning by selecting nodes in predefined navigation graphs for movements, overlooking low-level control in navigation scenarios. To bridge this gap, we propose AO-Planner, a novel Affordances-Oriented Planner for continuous VLN task. Our AO-Planner integrates various foundation models to achieve affordances-oriented low-level motion planning and high-level decision-making, both performed in a zero-shot setting. Specifically, we employ a Visual Affordances Prompting (VAP) approach, where the visible ground is segmented by SAM to provide navigational affordances, based on which the LLM selects potential candidate waypoints and plans low-level paths towards selected waypoints. We further propose a high-level PathAgent which marks planned paths into the image input and reasons the most probable path by comprehending all environmental information. Finally, we convert the selected path into 3D coordinates using camera intrinsic parameters and depth information, avoiding challenging 3D predictions for LLMs. Experiments on the challenging R2R-CE and RxR-CE datasets show that AO-Planner achieves state-of-the-art zero-shot performance (8.8% improvement on SPL). Our method can also serve as a data annotator to obtain pseudo-labels, distilling its waypoint prediction ability into a learning-based predictor. This new predictor does not require any waypoint data from the simulator and achieves 47% SR competing with supervised methods. We establish an effective connection between LLM and 3D world, presenting novel prospects for employing foundation models in low-level motion control.

Introduction

In discrete VLN, LLMs only need to perform high-level planning by selecting a view as the forward direction (left). For continuous environments, previous agents rely on collecting simulator data to train low-level policies. In this paper, we utilize multimodal foundation models and propose visual affordances prompting to predict low-level candidate waypoints and paths in a zero-shot setting (right).

Method

Our proposed low-level affordances-oriented planning framework with visual affordances prompting. First, we utilize Grounded SAM to segment the visible ground as affordances. We then introduce visual affordances prompting (VAP), where we uniformly scatter points with numeric labels within the affordances. After querying the LLM by combining the visualized new image with task definition, instruction, waypoint definition, and output requirements, we finally obtain potential waypoints and paths in this view.

Our proposed high-level PathAgent. Different from previous zero-shot VLN agents, we utilize visual prompting by marking candidate waypoints and their corresponding paths (i.e., Path 0-5) in all four observation directions. This allows the PathAgent to make action decisions in the proficient RGB space and then map pixel-based paths to 3D coordinates using depth information and camera intrinsic parameters.

Results

BibTeX