Recent vision-language models (VLMs) have demonstrated significant potential in robotic planning. However, they typically function as semantic reasoners, lacking an intrinsic understanding of the specific robot’s physical capabilities. This limitation is particularly critical in interactive navigation, where robots must actively modify cluttered environments to create traversable paths. Existing VLM-based navigators are predominantly confined to passive obstacle avoidance, failing to reason about when and how to interact with objects to clear blocked paths. To bridge this gap, we propose Counterfactual Interactive Navigation with Skill-aware VLM (CoINS), a hierarchical framework that integrates skill-aware reasoning and robust low-level execution. Specifically, we fine-tune a VLM, named InterNav-VLM, which incorporates skill affordance and concrete constraint parameters into the input context and grounds them into a metric-scale environmental representation. By internalizing the logic of counterfactual reasoning through fine-tuning on the proposed InterNav dataset, the model learns to implicitly evaluate the causal effects of object removal on navigation connectivity, thereby determining interaction necessity and target selection. To execute the generated high-level plans, we develop a comprehensive skill library through reinforcement learning, introducing traversability-oriented strategies to manipulate diverse objects for path clearance. A systematic benchmark in Isaac Sim is proposed to evaluate both the reasoning and execution aspects of interactive navigation. Extensive simulations and real-world experiments demonstrate that CoINS outperforms existing baselines, surpassing the success rate by 17% overall and over 80% in complex long-horizon scenarios, while exhibiting robust generalization across diverse object categories and robot embodiments.
We introduce CoINS, a hierarchical interactive navigation framework that integrates skill-aware VLM reasoning with an RL-based skill library for diverse behaviors. Unlike traditional methods that rely on passive avoidance, CoINS enables the robot to actively modify the environment to clear paths in cluttered scenarios.
The CoINS framework. The VLM reasoning module takes the robot's egocentric RGB observations and embodiment constraints as input to produce high-level interaction and navigation decisions, which are then translated by the skill execution module into precise motion controls for diverse interaction primitives.
The video demonstrates the interactive navigation capabilities of CoINS in indoor cluttered scenarios across both simulation and real-world experiments, as well as the ability to interact with diverse objects to reconfigure the environment and create free paths.
CoINS (Counterfactual Interactive Navigation via Skill-Aware VLM) explores how vision-language models can reason about robot skills and counterfactual scenarios to guide interactive navigation. Project details, demos, and paper links will be posted here soon.