Do 3D Large Language Models Really Understand 3D Spatial Relationships?

ICLR 2026

Initial Scores: 8, 8, 6, 6

Xianzheng Ma^1,2*, Tao Sun³*, Shuai Chen^1,2, Yash Bhalgat¹, Jindong Gu⁵ †,
Angel X Chang⁴, Iro Armeni³, Iro Laina¹, Songyou Peng⁵ ‡, Victor Adrian Prisacariu² ‡

¹VGG, University of Oxford ²AVL, University of Oxford ³Stanford University
⁴Simon Fraser University ⁵Google DeepMind

* equal contribution, † correspondence author, ‡ equal supervision

arXiv Demo Code Data

🔴 Be careful: Your 3D-LLM isn't understanding 3D, it might be just guessing without seeing.

We find that plain LLMs with zero 3D input can match full 3D-LLMs on existing benchmarks — exposing a fundamental flaw in how we evaluate 3D spatial reasoning.

Part 1 · The Problem

Are Current 3D-LLM Benchmarks Measuring Real 3D Understanding?

State-of-the-art 3D-LLMs achieve impressive scores on benchmarks like SQA3D, suggesting strong 3D spatial reasoning. But what happens when we train a blind, text-only model — with zero 3D input — on the same data? And is this just a problem with one benchmark?

(A) Various Models on SQA3D, EM [%]

(B) LEO on Various Benchmarks, EM [%]

*LEO is a widely-used generalist 3D-LLM capable of 3D captioning, QA, dialogue, task planning, and navigation.

Comparison between blind text-only models and 3D-LLMs on existing 3D-QA benchmarks, revealing strong linguistic shortcuts.

As shown in (A), the blind model matches or even outperforms full 3D-LLMs on SQA3D. And as (B) reveals, this isn't limited to one benchmark — across multiple 3D-QA benchmarks, the same pattern holds: linguistic shortcuts alone can achieve competitive scores.

This means existing evaluations significantly overestimate genuine 3D reasoning ability, masking fundamental limitations.

Part 2 · Our Solution

What Should a Benchmark for Genuine 3D Understanding Look Like?

We argue that a reliable 3D benchmark must satisfy two criteria:

①

It must bypass language shortcuts — questions should not be answerable through linguistic priors or common-sense knowledge alone. If a blind, text-only model can cheat its way to matching a 3D-LLM's accuracy, those questions fail to test real 3D perception.

②

It must genuinely evaluate 3D spatial understanding — a model that truly comprehends 3D geometry should produce consistently correct answers regardless of the observer's viewpoint, because the underlying scene structure does not change.

Based on these principles, we build Real-3DQA through the pipeline below.

Real-3DQA construction pipeline — Construction pipeline of Real-3DQA: comparing 3D-LLMs with **blind text-only** counterparts and general LLMs to eliminate shortcut-solvable questions.

Viewpoint Rotation: A Consistency Test for 3D Understanding

If a model truly understands 3D spatial relationships, it should answer correctly regardless of the observer's viewpoint — because the scene geometry doesn't change. We test this by rotating the observer's viewpoint (0°, 90°, 180°, 270°) while keeping the scene and question fixed and updating the answer accordingly.

Rotating the observer's viewpoint generates equivalent questions — models with genuine 3D understanding should answer all of them correctly.

Viewpoint Rotation Visualizer

Observe how the same spatial question receives different answers as the viewpoint rotates in place.

scene0278
Direction
scene0690
Counting
scene0389
Distance
scene0025
Existence

🖱️ Click & drag to rotate view · Scroll to zoom in/out

0°

90°

180°

270°

Situation

Question

GT Answer

LEO Prediction

Part 3 · Key Findings

What Did We Discover?

Are 3D-LLMs Still Strong After Debiasing SQA3D to Real-3DQA?

Performance comparison on SQA3D vs. Real-3DQA. The dramatic drop reveals inflated scores on existing benchmarks.

Takeaway: All models suffer a dramatic performance drop after debiasing — up to 92% decrease — revealing that their high scores on SQA3D were largely driven by linguistic shortcuts, not genuine 3D understanding.

Do 3D-LLMs Still Understand After Viewpoint Rotation?

For each question, we rotate the viewpoint by 0°, 90°, 180°, 270° and check how many times the model answers correctly. Higher "4/4 correct" proportions indicate genuine rotation-invariant understanding.

Proportion of questions answered correctly under different numbers of rotated viewpoints, and overall VRS (%).

Takeaway: The proportion of questions answered correctly across all four viewpoints is near zero for every model — meaning no current 3D-LLM can maintain its understanding under simple viewpoint changes. The answer to our title question becomes clear: current 3D-LLMs do not truly understand 3D spatial relationships.

Part 4 · Open Directions

What's Next?

Our findings point to several important open directions for the 3D vision-language community.

📍

Toward Finer-Grained Situated 3D Understanding

True 3D comprehension must be situated — grounded in the observer's egocentric perspective, not just a scene-level overview. Future benchmarks should evaluate understanding from specific viewpoints and positions within the scene.

🔄

Rotation-Invariant Model Design

Current model architectures and training pipelines have no explicit design for rotation robustness. Introducing rotation-aware representations, data augmentation, or equivariant architectures is an important step toward genuine rotation-invariant 3D reasoning.

🔗

Beyond QA: Grounding & Captioning Need Situated Awareness Too

Similar shortcut learning and spatial inconsistency issues exist in 3D grounding and captioning. However, current benchmarks in these areas lack situated (egocentric) descriptions. Building situation-aware evaluation for these tasks is a promising and necessary direction.

⚖️

Rethinking the Role of Language in 3D-LLMs

Language plays a dual role: it can be a shortcut that inflates performance, but it also provides valuable world priors and common-sense knowledge. How to properly leverage linguistic knowledge while preventing it from bypassing genuine 3D reasoning remains a key open challenge.

BibTeX

@inproceedings{ma2026real3dqa,
  title={Do 3D Large Language Models Really Understand 3D Spatial Relationships?},
  author={Xianzheng Ma and Tao Sun and Shuai Chen and Yash Bhalgat and Jindong Gu and Angel X Chang and Iro Armeni and Iro Laina and Songyou Peng and Victor Adrian Prisacariu},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
  url={https://openreview.net/forum?id=3vlMiJwo8b}
}