Several large, solid, cube-shaped parcels
Orange monarch butterfly
A cat pondering the mysteries
Floating bonsai tree
Multi-layered wedding cake
Carved wooden bear
Sequence of street lamps
Golden retriever plush toy
A torn hat
Score Distillation Sampling (SDS) enables high-quality text-to-3D generation by supervising 3D models through the denoising of multi-view 2D renderings, using a pretrained text-to-image diffusion model to align with the input prompt and ensure 3D consistency. However, existing SDS-based methods face two fundamental limitations: (1) their reliance on CLIP-style text encoders leads to coarse semantic alignment and struggles with fine-grained prompts; and (2) 2D diffusion priors lack explicit 3D spatial constraints, resulting in geometric inconsistencies and inaccurate object relationships in multi-object scenes. To address these challenges, we propose VLM3D, a novel text-to-3D generation framework that integrates large vision–language models (VLMs) into the SDS pipeline as differentiable semantic and spatial priors. Unlike standard text-to-image diffusion priors, VLMs leverage rich language-grounded supervision that enables fine-grained prompt alignment. Moreover, their inherent vision language modeling provides strong spatial understanding, which significantly enhances 3D consistency for single-object generation and improves relational reasoning in multi-object scenes. We instantiate VLM3D based on the open-source Qwen2.5-VL model and evaluate it on the GPTeval3D benchmark. Experiments across diverse objects and complex scenes show that VLM3D significantly outperforms prior SDS-based methods in semantic fidelity, geometric coherence, and spatial correctness.

Several large, solid, cube-shaped parcels
Orange monarch butterfly
A cat pondering the mysteries
Floating bonsai tree
Multi-layered wedding cake
Carved wooden bear
Sequence of street lamps
Golden retriever plush toy
A torn hat
VLM3D outperforms baseline methods in geometric consistency, 3D plausibility, texture richness, and text alignment.
VLM3D outperforms these reward-involved methods in semantic fidelity while retaining high perceptual quality.
VLM3D accurately adds or removes objects (first row), changes clothing color (second row), and updates spatial relations (third row), demonstrating its better semantic understanding than baselines.
Quantitative Results on 110 Prompts from the GPTEval3D Benchmark. We compute all six GPTEval3D metrics—text alignment, 3D plausibility, texture–geometry coherence, geometry details, texture details, and overall score—to comprehensively evaluate 3D generation quality. VLM3D achieves the highest score on every metric, demonstrating its superior performance.
| Method | Prompts from GPTEval3D | |||||
|---|---|---|---|---|---|---|
| Alignment | Plausibility | T-G Coherency | Geo Details | Tex Details | Overall | |
| DreamFusion | 1000.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 |
| DreamGaussian | 1100.6 | 953.6 | 1158.6 | 1126.2 | 1130.8 | 951.4 |
| Fantasia3D | 1067.9 | 891.9 | 1006.0 | 1109.3 | 1027.5 | 933.5 |
| Instant3D | 1200.0 | 1087.6 | 1152.7 | 1152.0 | 1181.3 | 1097.8 |
| Latent-NeRF | 1222.3 | 1144.8 | 1156.7 | 1180.5 | 1160.8 | 1178.7 |
| Magic3D | 1152.3 | 1000.8 | 1084.4 | 1178.1 | 1084.6 | 961.7 |
| ProlificDreamer | 1261.8 | 1058.7 | 1152.0 | 1246.4 | 1180.6 | 1012.5 |
| SyncDreamer | 1041.2 | 968.8 | 1083.1 | 1064.2 | 1045.7 | 963.5 |
| MVDream | 1270.5 | 1147.5 | 1250.6 | 1324.9 | 1255.5 | 1097.7 |
| DreamReward1 | 1287.5 | 1195.0 | 1254.4 | 1295.5 | 1261.6 | 1193.3 |
| DreamDPO | 1298.9 | 1171.9 | 1276.4 | 1373.25 | 1296.9 | 1203.1 |
| VLM3D (Ours) | 1365.5 | 1293.7 | 1365.4 | 1419.0 | 1368.7 | 1268.6 |
1 Our metrics differ from those reported in the original DreamReward paper because GPT-4V has been deprecated in GPTEval3D, so we instead use GPT-4o-mini.
@article{weimin2025vlm3d,
author = {Weimin, Bai and Yubo, Li and Weijian, Luo and Wenzheng, Chen and He, Sun},
title = {Vision-Language Models as Differentiable Semantic and Spatial Rewards for Text-to-3D Generation},
journal = {arXiv preprint arXiv:2509.15772},
year = {2025},
}