Vision–Language Models as Differentiable Semantic and Spatial Rewards for Text-to-3D Generation

¹Peking University, ²Xiaohongshu Inc ^✉Corresponding Author

Abstract

Score Distillation Sampling (SDS) enables high-quality text-to-3D generation by supervising 3D models through the denoising of multi-view 2D renderings, using a pretrained text-to-image diffusion model to align with the input prompt and ensure 3D consistency. However, existing SDS-based methods face two fundamental limitations: (1) their reliance on CLIP-style text encoders leads to coarse semantic alignment and struggles with fine-grained prompts; and (2) 2D diffusion priors lack explicit 3D spatial constraints, resulting in geometric inconsistencies and inaccurate object relationships in multi-object scenes. To address these challenges, we propose VLM3D, a novel text-to-3D generation framework that integrates large vision–language models (VLMs) into the SDS pipeline as differentiable semantic and spatial priors. Unlike standard text-to-image diffusion priors, VLMs leverage rich language-grounded supervision that enables fine-grained prompt alignment. Moreover, their inherent vision language modeling provides strong spatial understanding, which significantly enhances 3D consistency for single-object generation and improves relational reasoning in multi-object scenes. We instantiate VLM3D based on the open-source Qwen2.5-VL model and evaluate it on the GPTeval3D benchmark. Experiments across diverse objects and complex scenes show that VLM3D significantly outperforms prior SDS-based methods in semantic fidelity, geometric coherence, and spatial correctness.

Video Results

Several large, solid, cube-shaped parcels

Orange monarch butterfly

A cat pondering the mysteries

Floating bonsai tree

Multi-layered wedding cake

Carved wooden bear

Sequence of street lamps

Golden retriever plush toy

A torn hat

Quantitative Results

Quantitative Results on 110 Prompts from the GPTEval3D Benchmark. We compute all six GPTEval3D metrics—text alignment, 3D plausibility, texture–geometry coherence, geometry details, texture details, and overall score—to comprehensively evaluate 3D generation quality. VLM3D achieves the highest score on every metric, demonstrating its superior performance.

Method	Prompts from GPTEval3D
Method	Alignment	Plausibility	T-G Coherency	Geo Details	Tex Details	Overall
DreamFusion	1000.0	1000.0	1000.0	1000.0	1000.0	1000.0
DreamGaussian	1100.6	953.6	1158.6	1126.2	1130.8	951.4
Fantasia3D	1067.9	891.9	1006.0	1109.3	1027.5	933.5
Instant3D	1200.0	1087.6	1152.7	1152.0	1181.3	1097.8
Latent-NeRF	1222.3	1144.8	1156.7	1180.5	1160.8	1178.7
Magic3D	1152.3	1000.8	1084.4	1178.1	1084.6	961.7
ProlificDreamer	1261.8	1058.7	1152.0	1246.4	1180.6	1012.5
SyncDreamer	1041.2	968.8	1083.1	1064.2	1045.7	963.5
MVDream	1270.5	1147.5	1250.6	1324.9	1255.5	1097.7
DreamReward¹	1287.5	1195.0	1254.4	1295.5	1261.6	1193.3
DreamDPO	1298.9	1171.9	1276.4	1373.25	1296.9	1203.1
VLM3D (Ours)	1365.5	1293.7	1365.4	1419.0	1368.7	1268.6

¹ Our metrics differ from those reported in the original DreamReward paper because GPT-4V has been deprecated in GPTEval3D, so we instead use GPT-4o-mini.

BibTeX

@article{weimin2025vlm3d, author = {Weimin, Bai and Yubo, Li and Weijian, Luo and Wenzheng, Chen and He, Sun}, title = {Vision-Language Models as Differentiable Semantic and Spatial Rewards for Text-to-3D Generation}, journal = {arXiv preprint arXiv:2509.15772}, year = {2025}, }

Vision–Language Models as Differentiable Semantic and Spatial Rewards for Text-to-3D Generation

Abstract

Method

Video Results

Comparison with Baselines on the GPTEval3D Benchmark

Comparison with Baselines based on MVDiffusion and reward model

Sensitivity Analysis to Text Perturbations

Quantitative Results

BibTeX