ECCV 2026 · Vision-Language-Action Models

Restoring Linguistic Grounding in VLA Models via Train-Free Attention Recalibration

Ninghao Zhang1 · Bin Zhu2 · Shijie Zhou3 · Jingjing Chen3

IGAR exposes and mitigates linguistic blindness: VLA policies can keep executing visually plausible actions even when the language instruction contradicts the scene.

Paper Code coming soon Video

Abstract

When language contradicts the scene, should the robot still act?

We study a reliability failure in modern Vision-Language-Action models: under out-of-distribution contradictory instructions, policies often ignore instruction semantics and execute actions supported by visual priors. We introduce ICBench, a controlled diagnostic benchmark built from LIBERO, and propose Instruction-Guided Attention Recalibration (IGAR), a train-free inference-time intervention that shifts attention back toward instruction tokens without retraining or architecture changes.

01

Linguistic Blindness

Contradictory commands reveal policies that follow visual shortcuts instead of language.

02

ICBench

Controlled instruction contradictions isolate language-action coupling across 30 LIBERO tasks.

03

Train-Free IGAR

Attention recalibration runs during inference and plugs into existing transformer-based VLAs.

04

Real Robot Check

Franka experiments show IGAR prevents manipulation triggered by inconsistent instructions.

Normal instructions succeed while contradictory instructions produce visually plausible but instruction-inconsistent executions.
VLA policies can complete the same visually plausible trajectory even when the modified instruction is logically incompatible with the scene.

Method

Instruction-Guided Attention Recalibration

IGAR detects sink tokens through hidden-state spikes, selects grounding heads with cross-modal imbalance, and redistributes attention toward non-sink instruction tokens. The intervention operates entirely inside the forward pass.

Overview of the IGAR framework with sink token detection, grounding head selection, and attention redistribution.
IGAR is a plug-and-play inference-time module for restoring language influence during action generation.

Video

Supplementary video

The supplementary video summarizes the ICBench setting, simulated rollouts, and real-world behavior under normal and contradictory instructions.

Results

Recalibration restores language-sensitive behavior

Attention visualizations, sensitivity analyses, and real-world evaluations show that IGAR reduces spurious execution under contradictory instructions while preserving normal task behavior.

Attention heatmaps with and without IGAR under normal and contradictory instructions.
IGAR redirects attention toward instruction-relevant objects and spatial regions.
Hyperparameter sensitivity curves for text-sink decay factor, visual-sink bound, and intervened layers.
Hyperparameter sweeps identify stable settings for linguistic grounding.
Real-world Franka robot experiments comparing baseline Pi0 and Pi0 with IGAR.
On a real Franka arm, IGAR turns fake success into deserved failure under contradictions.
Supplementary heatmap comparison across normal and contradictory instruction cases.
Supplementary visualizations show the same grounding trend across more scenes.

Demos

ICBench rollout gallery

Each task compares normal and contradictory instructions, with and without IGAR. Videos are loaded from metadata first so the page remains usable on GitHub Pages.

Citation

BibTeX

@misc{zhang2026restoring,
  title={Restoring Linguistic Grounding in VLA Models via Train-Free Attention Recalibration},
  author={Ninghao Zhang and Bin Zhu and Shijie Zhou and Jingjing Chen},
  year={2026},
  eprint={2603.06001},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  url={https://arxiv.org/abs/2603.06001}
}