Several large, solid, cube-shaped parcels
Orange monarch butterfly
A cat pondering the mysteries
Floating bonsai tree
Multi-layered wedding cake
Carved wooden bear
Sequence of street lamps
Golden retriever plush toy
A torn hat

Distilling pre-trained 2D diffusion models into 3D assets has driven remarkable advances in text-to-3D synthesis. However, existing methods typically rely on Score Distillation Sampling (SDS) loss, which involves asymmetric KL divergence—a formulation that inherently favors mode-seeking behavior and limits generation diversity. In this paper, we introduce Dive3D, a novel text-to-3D generation framework that replaces KL-based objectives with Score Implicit Matching (SIM) loss, a score-based objective that effectively mitigates mode collapse. Furthermore, Dive3D integrates both diffusion distillation and reward-guided optimization under a unified divergence perspective. Such reformulation, together with SIM loss, yields significantly more diverse 3D outputs while improving text alignment, human preference, and overall visual fidelity. We validate Dive3D across various 2D-to-3D prompts and find that it consistently outperforms prior methods in qualitative assessments, including diversity, photorealism, and aesthetic appeal. We further evaluate its performance on the GPTEval3D benchmark, comparing against nine state-of-the-art baselines. Dive3D also achieves strong results on quantitative metrics, including text–asset alignment, 3D plausibility, text–geometry consistency, texture quality, and geometric detail.
Several large, solid, cube-shaped parcels
Orange monarch butterfly
A cat pondering the mysteries
Floating bonsai tree
Multi-layered wedding cake
Carved wooden bear
Sequence of street lamps
Golden retriever plush toy
A torn hat
Dive3D exhibits higher quality, richer texture details, and superior alignment with human preferences, such as accurate clothing styles, and vivid fur texture.
Dive3D exhibits more detailed and realistic 3D generation, capturing fine-grained structures such as accurate guitar geometry and transparent glass materials.
Score-based divergence vs. KL divergence in 2D space sampling. The proposed score-based divergence significantly enhances the diversity of generated 2D samples, yielding more varied backgrounds and clothing in "game character" generation, as well as a broader range of environments, lighting conditions, and architectural features in "Japanese building" generation.
Quantitative Results on 110 Prompts from the GPTEval3D Benchmark. We compute all six GPTEval3D metrics—text alignment, 3D plausibility, texture–geometry coherence, geometry details, texture details, and overall score—to comprehensively evaluate 3D generation quality. Dive3D achieves the highest score on every metric, demonstrating its superior performance.
| Method | Prompts from GPTEval3D | |||||
|---|---|---|---|---|---|---|
| Alignment | Plausibility | T-G Coherency | Geo Details | Tex Details | Overall | |
| DreamFusion | 1000.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 | 1000.0 |
| DreamGaussian | 1100.6 | 953.6 | 1158.6 | 1126.2 | 1130.8 | 951.4 |
| Fantasia3D | 1067.9 | 891.9 | 1006.0 | 1109.3 | 1027.5 | 933.5 |
| Instant3D | 1200.0 | 1087.6 | 1152.7 | 1152.0 | 1181.3 | 1097.8 |
| Latent-NeRF | 1222.3 | 1144.8 | 1156.7 | 1180.5 | 1160.8 | 1178.7 |
| Magic3D | 1152.3 | 1000.8 | 1084.4 | 1178.1 | 1084.6 | 961.7 |
| ProlificDreamer | 1261.8 | 1058.7 | 1152.0 | 1246.4 | 1180.6 | 1012.5 |
| SyncDreamer | 1041.2 | 968.8 | 1083.1 | 1064.2 | 1045.7 | 963.5 |
| MVDream | 1270.5 | 1147.5 | 1250.6 | 1324.9 | 1255.5 | 1097.7 |
| DreamReward1 | 1287.5 | 1195.0 | 1254.4 | 1295.5 | 1261.6 | 1193.3 |
| DIVE3D (Ours) | 1341.0 | 1249.0 | 1322.6 | 1360.2 | 1329.1 | 1243.3 |
1 Our metrics differ from those reported in the original DreamReward paper because GPT-4V has been deprecated in GPTEval3D, so we instead use GPT-4o-mini.
Quantitative comparison of diversity and quality. Arrows indicate whether higher ($\uparrow$) or lower ($\downarrow$) values are better.
| Method | CLIP-D | LPIPS-D | Chamfer-D$ | FID |
|---|---|---|---|---|
| Prolific Dreamer (KL-based) | 0.3908 | 0.485 | 0.072 | 207.34 |
| Dive3D (Ours, Score-based) | 0.4483 | 0.551 | 0.115 | 168.59 |
We analyze the runtime and memory overhead of Dive3D in Table. The cost depends on the diffusion backbone. When using MVDiffusion, our method is comparable to other reward-guided methods like DreamReward. When using Stable Diffusion, our method is slightly slower than ProlificDreamer (NeRF) and requires more memory due to the reward guidance. In both scenarios, the computational cost is justified by the significant gains in generation quality and diversity. All experiments were run on a single NVIDIA A100 GPU.
| Diffusion Backbone | Method | Generation Time | Peak Memory (GB) |
|---|---|---|---|
| MVDiffusion | MVDream | 0.7 hours | ∼14 |
| DreamReward | 1.0 hours | ∼22 | |
| Dive3D (Ours) | 1.0 hours | ∼22 | |
| Stable Diffusion | ProlificDreamer | 7.9 hours | ∼27 |
| Dive3D (Ours) | 9.0 hours | ∼35 |
@article{bai2025dive3d,
title = {Dive3D: Diverse Distillation-based Text-to-3D Generation via Score Implicit Matching},
author = {Weimin, Bai and Yubo, Li and Wenzheng, Chen and Weijian, Luo and He, Sun},
journal = {Transactions on Machine Learning Research},
issn = {2835-8856},
year = {2025},
month = {12},
url = {https://openreview.net/forum?id=OUYMueHLMf&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DTMLR%2FAuthors%23your-submissions)},
note = {}
}