VAMOS

A Hierarchical Vision-Language-Action Model for
Capability-Modulated and Steerable Navigation

VAMOS system architecture diagram

VAMOS is a hierarchical vision-language-action model that decouples semantic planning from embodiment grounding, enabling robust cross-embodiment navigation with natural language steerability.

A single high-level planner can be deployed across physically distinct wheeled and legged robots by using a embodiment specific affordance model to reject physically infeasible plans.

Hierarchical Navigation with Embodiment Grounding

A fundamental tension in robot navigation lies in learning policies that generalize across diverse environments while conforming to the unique physical constraints of specific embodiments. Quadrupeds can walk up stairs, wheeled robots cannot.

VAMOS resolves this tension through a carefully designed hierarchical architecture that separates concerns:

  • High-level VLM Planner: Learns from diverse, open-world data to understand semantic navigation
  • Per-embodiment Affordance Model: Learns robot's physical constraints safely in simulation
VAMOS system architecture showing hierarchical design with VLM planner and affordance model

Cross-Embodiment Transfer

Boston Dynamics Spot robot navigating stairs while wheeled robot takes ramp Cross-embodiment navigation comparison

The same high-level planner works across different robot embodiments by simply swapping lightweight, specialized affordance models.

In our experiments, we demonstrate successful navigation on both:

  • Boston Dynamics Spot (Legged): Can traverse stairs, ramps, and complex terrain
  • UW Hound Robot (Wheeled): Limited to ramps and flat surfaces

The affordance model automatically selects appropriate paths based on each robot's capabilities, enabling the same planner to achieve high performance across both platforms.

Natural Language Steerability

VAMOS enables intuitive control through natural language commands, allowing users to specify navigation preferences without complex programming.

🔄 Click the trajectory buttons to cycle through different path options
Robot following 'take the ramp' command

All paths

↻
Robot navigating around tree

All paths

↻
Robot crossing grass area

All paths

↻
Robot navigating around U-pole obstacle

All paths

↻

Experimental Results

Click on an environment to see method performance in that setting

Indoor hallway navigation test

Hallways

Atrium navigation in low light

Atrium

Lab environment with obstacles

Lab

Campus outdoor navigation with stairs

Campus

Forest navigation with vegetation

Forest

Down ramp navigation test

Down Ramp

Showing results across all environments

Modular Stack

Success Rate 53%
Avg. Interventions 0
Number of Timeouts -

ViPlanner

Success Rate 67%
Avg. Interventions 0
Number of Timeouts -

NoMaD

Success Rate 27%
Avg. Interventions 1.3
Number of Timeouts -

NaViLA

Success Rate 10%
Avg. Interventions 0.7
Number of Timeouts -

VAMOS (Ours)

Success Rate 90%
Avg. Interventions 0.25
Number of Timeouts -

Method: Hierarchical VLA Architecture

VAMOS operationalizes the insight that navigation can be decomposed: high-level heuristics are generalizable across embodiments, while low-level traversability depends on physical capabilities.

High-Level VLM Planner

  • Built on PaliGemma 2 3B model
  • Trained on 29.8 hours of diverse navigation data
  • Predicts 2D paths in pixel space
  • Enables natural language steerability

Affordance Model

  • Lightweight MLP trained in simulation
  • Evaluates path feasibility for specific embodiment
  • Maps elevation + position + heading → traversability
  • Enables safe deployment across robot types

Training Data Sources

SCAND dataset visualization

SCAND Dataset

TartanDrive dataset visualization

TartanDrive

CODa dataset visualization

CODa

In-domain Spot dataset

Spot Dataset

Our Team

Mateo Guaman Castro profile

Mateo Guaman Castro*

University of Washington

Sidharth Rajagopal profile

Sidharth Rajagopal

University of Washington

Daniel Gorbatov profile

Daniel Gorbatov

University of Washington

Matt Schmittle profile

Matt Schmittle

University of Washington

Rohan Baijal profile

Rohan Baijal

University of Washington

Octi Zhang profile

Octi Zhang

University of Washington

Rosario Scalise profile

Rosario Scalise

University of Washington

Sidharth Talia profile

Sidharth Talia

University of Washington

Emma Romig profile

Emma Romig

University of Washington

Celso de Melo profile

Celso de Melo

Army Research Laboratory

Byron Boots profile

Byron Boots

University of Washington

Abhishek Gupta profile

Abhishek Gupta

University of Washington

Citation

@article{guamancastro2025vamos,
  title={VAMOS: A Hierarchical Vision-Language-Action Model for Capability-Modulated and Steerable Navigation},
  author={Mateo Guaman Castro and Sidharth Rajagopal and Daniel Gorbatov and Matt Schmittle and Rohan Baijal and Octi Zhang and Rosario Scalise and Sidharth Talia and Emma Romig and Celso de Melo and Byron Boots and Abhishek Gupta},
  journal={Under Review},
  year={2025}
}