Initial Scores: 8, 8, 6, 6
* equal contribution, † correspondence author, ‡ equal supervision
π΄ Be careful: Your 3D-LLM isn't understanding 3D, it might be just guessing without seeing.
We find that plain LLMs with zero 3D input can match full 3D-LLMs on existing benchmarks β exposing a fundamental flaw in how we evaluate 3D spatial reasoning.
Our findings point to several important open directions for the 3D vision-language community.
True 3D comprehension must be situated β grounded in the observer's egocentric perspective, not just a scene-level overview. Future benchmarks should evaluate understanding from specific viewpoints and positions within the scene.
Current model architectures and training pipelines have no explicit design for rotation robustness. Introducing rotation-aware representations, data augmentation, or equivariant architectures is an important step toward genuine rotation-invariant 3D reasoning.
Similar shortcut learning and spatial inconsistency issues exist in 3D grounding and captioning. However, current benchmarks in these areas lack situated (egocentric) descriptions. Building situation-aware evaluation for these tasks is a promising and necessary direction.
Language plays a dual role: it can be a shortcut that inflates performance, but it also provides valuable world priors and common-sense knowledge. How to properly leverage linguistic knowledge while preventing it from bypassing genuine 3D reasoning remains a key open challenge.
@inproceedings{ma2026real3dqa,
title={Do 3D Large Language Models Really Understand 3D Spatial Relationships?},
author={Xianzheng Ma and Tao Sun and Shuai Chen and Yash Bhalgat and Jindong Gu and Angel X Chang and Iro Armeni and Iro Laina and Songyou Peng and Victor Adrian Prisacariu},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=3vlMiJwo8b}
}