Vision–Language Models as Differentiable Semantic and Spatial Rewards for Text-to-3D Generation

1Peking University, 2Xiaohongshu Inc Corresponding Author

Abstract

Score Distillation Sampling (SDS) enables high-quality text-to-3D generation by supervising 3D models through the denoising of multi-view 2D renderings, using a pretrained text-to-image diffusion model to align with the input prompt and ensure 3D consistency. However, existing SDS-based methods face two fundamental limitations: (1) their reliance on CLIP-style text encoders leads to coarse semantic alignment and struggles with fine-grained prompts; and (2) 2D diffusion priors lack explicit 3D spatial constraints, resulting in geometric inconsistencies and inaccurate object relationships in multi-object scenes. To address these challenges, we propose VLM3D, a novel text-to-3D generation framework that integrates large vision–language models (VLMs) into the SDS pipeline as differentiable semantic and spatial priors. Unlike standard text-to-image diffusion priors, VLMs leverage rich language-grounded supervision that enables fine-grained prompt alignment. Moreover, their inherent vision language modeling provides strong spatial understanding, which significantly enhances 3D consistency for single-object generation and improves relational reasoning in multi-object scenes. We instantiate VLM3D based on the open-source Qwen2.5-VL model and evaluate it on the GPTeval3D benchmark. Experiments across diverse objects and complex scenes show that VLM3D significantly outperforms prior SDS-based methods in semantic fidelity, geometric coherence, and spatial correctness.

Method


Video Results

Several large, solid, cube-shaped parcels

Orange monarch butterfly

A cat pondering the mysteries

Floating bonsai tree

Multi-layered wedding cake

Carved wooden bear

Sequence of street lamps

Golden retriever plush toy

A torn hat

Comparison with Baselines on the GPTEval3D Benchmark

VLM3D outperforms baseline methods in geometric consistency, 3D plausibility, texture richness, and text alignment.

Comparison with Baselines based on MVDiffusion and reward model

VLM3D outperforms these reward-involved methods in semantic fidelity while retaining high perceptual quality.

Sensitivity Analysis to Text Perturbations

VLM3D accurately adds or removes objects (first row), changes clothing color (second row), and updates spatial relations (third row), demonstrating its better semantic understanding than baselines.

Quantitative Results

Quantitative Results on 110 Prompts from the GPTEval3D Benchmark. We compute all six GPTEval3D metrics—text alignment, 3D plausibility, texture–geometry coherence, geometry details, texture details, and overall score—to comprehensively evaluate 3D generation quality. VLM3D achieves the highest score on every metric, demonstrating its superior performance.

Method Prompts from GPTEval3D
Alignment Plausibility T-G Coherency Geo Details Tex Details Overall
DreamFusion 1000.0 1000.0 1000.0 1000.0 1000.0 1000.0
DreamGaussian 1100.6 953.6 1158.6 1126.2 1130.8 951.4
Fantasia3D 1067.9 891.9 1006.0 1109.3 1027.5 933.5
Instant3D 1200.0 1087.6 1152.7 1152.0 1181.3 1097.8
Latent-NeRF 1222.3 1144.8 1156.7 1180.5 1160.8 1178.7
Magic3D 1152.3 1000.8 1084.4 1178.1 1084.6 961.7
ProlificDreamer 1261.8 1058.7 1152.0 1246.4 1180.6 1012.5
SyncDreamer 1041.2 968.8 1083.1 1064.2 1045.7 963.5
MVDream 1270.5 1147.5 1250.6 1324.9 1255.5 1097.7
DreamReward1 1287.5 1195.0 1254.4 1295.5 1261.6 1193.3
DreamDPO 1298.9 1171.9 1276.4 1373.25 1296.9 1203.1
VLM3D (Ours) 1365.5 1293.7 1365.4 1419.0 1368.7 1268.6

1 Our metrics differ from those reported in the original DreamReward paper because GPT-4V has been deprecated in GPTEval3D, so we instead use GPT-4o-mini.

BibTeX

@article{weimin2025vlm3d,
  author    = {Weimin, Bai and Yubo, Li and Weijian, Luo and Wenzheng, Chen and He, Sun},
  title     = {Vision-Language Models as Differentiable Semantic and Spatial Rewards for Text-to-3D Generation},
  journal   = {arXiv preprint arXiv:2509.15772},
  year      = {2025},
}