VAMOS: A Hierarchical Vision-Language-Action Model for Capability-Modulated and Steerable Navigation

VAMOS is a hierarchical vision-language-action model that decouples semantic planning from embodiment grounding, enabling robust cross-embodiment navigation with natural language steerability.

A single high-level planner can be deployed across physically distinct wheeled and legged robots by using a embodiment specific affordance model to reject physically infeasible plans.

Hierarchical Navigation with Embodiment Grounding

VAMOS system architecture showing hierarchical design with VLM planner and affordance model

A fundamental challenge in robot navigation lies in learning policies that generalize across diverse environments while conforming to the unique physical constraints of specific embodiments. Quadrupeds can walk up stairs, wheeled robots cannot.

VAMOS addresses this challenge through a carefully designed hierarchical architecture that separates concerns:

High-level VLM Planner: Learns from diverse, open-world data to understand semantic navigation
Per-embodiment Affordance Model: Learns robot's physical constraints safely in simulation

Cross-Embodiment Transfer

Boston Dynamics Spot robot navigating stairs while wheeled robot takes ramp

The same high-level planner works across different robot embodiments by simply swapping lightweight, specialized affordance models.

In our experiments, we demonstrate successful navigation on both:

Boston Dynamics Spot (Legged): Can traverse stairs, ramps, and complex terrain
UW Hound Robot (Wheeled): Limited to ramps and flat surfaces

The affordance model automatically selects appropriate paths based on each robot's capabilities, enabling the same planner to achieve high performance across both platforms.

Natural Language Steerability

VAMOS enables intuitive control through natural language commands, allowing users to specify navigation preferences without complex programming.

🔄 Click the trajectory buttons to cycle through different path options

All paths

↻

All paths

↻

All paths

↻

All paths

↻

Experimental Results

Click on an environment to see method performance in that setting

Hallways

Atrium

Lab

Campus

Forest

Down Ramp

Showing results across all environments

Modular Stack

Success Rate 53%

Avg. Interventions 0

Number of Timeouts -

ViPlanner

Success Rate 67%

Avg. Interventions 0

Number of Timeouts -

NoMaD

Success Rate 27%

Avg. Interventions 1.3

Number of Timeouts -

NaViLA

Success Rate 10%

Avg. Interventions 0.7

Number of Timeouts -

VAMOS (Ours)

Success Rate 90%

Avg. Interventions 0.25

Number of Timeouts -

Run Visualizations

By default: VAMOS course speedruns. Switch groups or clips below.

VAMOS — Hallway

Detailed Baseline Comparison

Method: Hierarchical VLA Architecture

VAMOS operationalizes the insight that navigation can be decomposed: high-level heuristics are generalizable across embodiments, while low-level traversability depends on physical capabilities.

High-Level VLM Planner

Built on PaliGemma 2 3B model
Trained on 29.8 hours of diverse navigation data
Predicts 2D paths in pixel space
Enables natural language steerability

Affordance Model

Lightweight MLP trained in simulation
Evaluates path feasibility for specific embodiment
Maps elevation + position + heading → traversability
Enables safe deployment across robot types

Training Data Sources

SCAND Dataset

TartanDrive

CODa

Spot Dataset

Citation

@article{guamancastro2025vamos,
  title={VAMOS: A Hierarchical Vision-Language-Action Model for Capability-Modulated and Steerable Navigation},
  author={Mateo Guaman Castro and Sidharth Rajagopal and Daniel Gorbatov and Matt Schmittle and Rohan Baijal and Octi Zhang and Rosario Scalise and Sidharth Talia and Emma Romig and Celso de Melo and Byron Boots and Abhishek Gupta},
  journal={Under Review},
  year={2025}
}

VAMOS

A Hierarchical Vision-Language-Action Model for
Capability-Modulated and Steerable Navigation

Gradio Demo

Hierarchical Navigation with Embodiment Grounding

Cross-Embodiment Transfer

Natural Language Steerability