Let Language Constrain Geometry:
Vision–Language Models as Semantic and Spatial Critics for 3D Generation

1Peking University, 2Xiaohongshu Inc 3MMLab, CUHK 4BAAI, Beijing Corresponding Author

Method


Abstract

Text-to-3D generation has advanced rapidly, yet state-of-theart models, encompassing both optimization-based and feedforward architectures, still face two fundamental limitations. First, they struggle with coarse semantic alignment, often failing to capture fine-grained prompt details. Second, they lack robust 3D spatial understanding, leading to geometric inconsistencies and catastrophic failures in part assembly and spatial relationships. To address these challenges, we propose VLM3D, a general framework that repurposes large vision-language models (VLMs) as powerful, differentiable semantic and spatial critics. Our core contribution is a dualquery critic signal derived from the VLM’s "Yes/No" logodds, which assesses both semantic fidelity and geometric coherence. We demonstrate the generality of this guidance signal across two distinct paradigms: (1) As a reward objective for optimization-based pipelines, VLM3D significantly outperforms existing methods on standard benchmarks. (2) As a test-time guidance module for feed-forward pipelines, it actively steers the iterative sampling process of SOTA native 3D models to correct severe spatial errors. VLM3D establishes a principled and generalizable path to inject the VLM’s rich, language-grounded understanding of both semantics and space into diverse 3D generative pipelines.

Video Results

Sailor Animation

A Navy sailor in a dark uniform—cap, jacket, and trousers—leaning forward at the waist to press his lips against a nurse in a crisp white dress; his left hand cradles the back of her head and his right arm wraps around her waist, while she arches backward in a dramatic dip with one leg lifted and her arms outstretched for balance.

An ancient, weathered statue, now covered in a blanket of moss

An ice cream scoop that serves up scoops of cloud fluff instead of ice cream

Spotted ladybug
crawling on
a green leaf

A bedside table

A bedside table

A carry-on bag

A carry-on bag

An electric grill

An electric grill

Comparison with Baselines on the GPTEval3D Benchmark

VLM3D outperforms baseline methods in geometric consistency, 3D plausibility, texture richness, and text alignment.

Comparison with Baselines based on MVDiffusion and reward model

VLM3D outperforms these reward-involved methods in semantic fidelity while retaining high perceptual quality.

Comparison of VLM3D with feed-forward baselines.

VLM3D outperforms feed-forward baselines in semantic fidelity while retaining high perceptual quality.

Sensitivity Analysis to Text Perturbations

VLM3D accurately changes clothing color (first row), and updates spatial relations (second row),

demonstrating its better semantic understanding than baselines.

Ablation of Geometric Query and Multi-View Input

We assess the impact of (a) removing the explicit geometry-consistency query from the VLM prompt and (b) using a single view instead of multi-view images. Omitting either component degrades 3D quality—leading to Janus-face artifacts, floating parts, and fractured surfaces.

Each row uses a different diffusion backbone: the top employs Stable Diffusion 2-1, while the bottom uses MVDream.

Quantitative Results

Quantitative Results on 110 Prompts from the GPTEval3D Benchmark. We compute all six GPTEval3D metrics—text alignment, 3D plausibility, texture–geometry coherence, geometry details, texture details, and overall score—to comprehensively evaluate 3D generation quality. VLM3D achieves the highest score on every metric, demonstrating its superior performance.

Method Prompts from GPTEval3D
Alignment Plausibility T-G Coherency Geo Details Tex Details Overall
DreamFusion 1000.0 1000.0 1000.0 1000.0 1000.0 1000.0
DreamGaussian 1100.6 953.6 1158.6 1126.2 1130.8 951.4
Fantasia3D 1067.9 891.9 1006.0 1109.3 1027.5 933.5
Instant3D 1200.0 1087.6 1152.7 1152.0 1181.3 1097.8
Latent-NeRF 1222.3 1144.8 1156.7 1180.5 1160.8 1178.7
Magic3D 1152.3 1000.8 1084.4 1178.1 1084.6 961.7
ProlificDreamer 1261.8 1058.7 1152.0 1246.4 1180.6 1012.5
SyncDreamer 1041.2 968.8 1083.1 1064.2 1045.7 963.5
MVDream 1270.5 1147.5 1250.6 1324.9 1255.5 1097.7
DreamReward1 1287.5 1195.0 1254.4 1295.5 1261.6 1193.3
DreamDPO 1298.9 1171.9 1276.4 1373.25 1296.9 1203.1
VLM3D (Ours) 1365.5 1293.7 1365.4 1419.0 1368.7 1268.6

1 Our metrics differ from those reported in the original DreamReward paper because GPT-4V has been deprecated in GPTEval3D, so we instead use GPT-4o-mini.

Quantitative Results for feed-forward pipelines on 24 prompts. VLM3D boosts both semantic and geometric quality.

Method CLIP-D ↓ FID ↓ CLIP-FID ↓ Geo. ↓
CLAY 0.22 310.3 53.11 0.63
Hunyuan3D 0.23 338.6 54.01 0.58
VLM3D (Ours) 0.19 274.9 45.79 0.49

Additional results generated by our optimization-based VLM3D

Additional VLM3D Results

Additional results generated by our optimization-based VLM3D.

Additional results generated by our optimization-based VLM3D

Additional VLM3D Results

Additional results generated by our optimization-based VLM3D.

Additional results generated by our optimization-based VLM3D

Additional VLM3D Results

Additional results generated by our feed-forward-based VLM3D.

Additional results generated by our optimization-based VLM3D

Additional VLM3D Results

Additional results generated by our feed-forward-based VLM3D.

BibTeX

@misc{bai2025letlanguageconstraingeometry,
      title={Let Language Constrain Geometry: Vision-Language Models as Semantic and Spatial Critics for 3D Generation}, 
      author={Weimin Bai and Yubo Li and Weijian Luo and Zeqiang Lai and Yequan Wang and Wenzheng Chen and He Sun},
      year={2025},
      eprint={2511.14271},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.14271}, 
    }
}