Text-to-3D generation has advanced rapidly, yet state-of-theart models, encompassing both optimization-based and feedforward architectures, still face two fundamental limitations.
First, they struggle with coarse semantic alignment, often
failing to capture fine-grained prompt details. Second, they
lack robust 3D spatial understanding, leading to geometric
inconsistencies and catastrophic failures in part assembly
and spatial relationships. To address these challenges, we
propose VLM3D, a general framework that repurposes large
vision-language models (VLMs) as powerful, differentiable
semantic and spatial critics. Our core contribution is a dualquery critic signal derived from the VLM’s "Yes/No" logodds, which assesses both semantic fidelity and geometric
coherence. We demonstrate the generality of this guidance
signal across two distinct paradigms: (1) As a reward objective for optimization-based pipelines, VLM3D significantly
outperforms existing methods on standard benchmarks. (2)
As a test-time guidance module for feed-forward pipelines,
it actively steers the iterative sampling process of SOTA native 3D models to correct severe spatial errors. VLM3D
establishes a principled and generalizable path to inject
the VLM’s rich, language-grounded understanding of both
semantics and space into diverse 3D generative pipelines.