DMV3D:Denoising Multi-View Diffusion
using 3D Large Reconstruction Model

Yinghao Xu^1,2, Hao Tan¹, Fujun Luan¹, Sai Bi¹, Peng Wang^1,3, Jiahao Li^1,4, Zifan Shi^1,5,
Kalyan Sunkavalli¹, Gordon Wetzstein², Zexiang Xu*¹, Kai Zhang*¹

¹Adobe Research ²Stanford ³HKU ⁴TTIC ⁵HKUST
* denotes equal advisory

arXiv Cite

A single-stage approach for high-quality text-to-3D generation and single-image reconstruction in 30s .

3D generation and composition: 3D assets are generated by our Text-to-3D or Image-to-3D model.

SAM + DMV3D: We use SAM to segment any objects and reconstruct their 3D shape and appearance with our DMV3D.

Probablistic single-image to 3D with different samples

Abstract

We propose DMV3D, a novel 3D generation approach that uses a transformer-based 3D large reconstruction model to denoise multi-view diffusion. Our reconstruction model incorporates a triplane NeRF representation and can denoise noisy multi-view images via NeRF reconstruction and rendering, achieving single-stage 3D generation in 30s on single A100 GPU. We train DMV3D on large-scale multi-view image datasets of highly diverse objects using only image reconstruction losses, without accessing 3D assets. We demonstrate state-of-the-art results for the single-image reconstruction problem where probabilistic modeling of unseen object parts is required for generating diverse reconstructions with sharp textures. We also show high-quality text-to-3D generation results outperforming previous 3D diffusion models.

Framework. We denoise multiple views (three shown in the figure; four used in experiments) for 3D generation. Our multi-view denoiser is a large transformer model that reconstructs a noise-free triplane NeRF from input noisy images with camera poses (parameterized by Plucker rays). During training, we supervise the triplane NeRF with a rendering loss at input and novel viewpoints. During inference, we render denoised images at input viewpoints and combine them with noise to obtain less noisy input for the next denoising step. Once the multi-view images are fully denoised, our model offers a clean triplane NeRF, enabling 3D generation.

Image-to-3D

Text-to-3D

a bowl of vegetables

a rusty old car

a voxelized dog

a birthday cupcake

a schoolbus

a race car

a donut with pink icing

a spaceship

Text-to-Image by SD-XL, Image-to-3D by DMV3D

Visualization of how x0 evolves during denoising process

Input View Novel View1 Novel View2 Novel View3

BibTeX

@misc{xu2023dmv3d,
      title={DMV3D: Denoising Multi-View Diffusion using 3D Large Reconstruction Model}, 
      author={Yinghao Xu and Hao Tan and Fujun Luan and Sai Bi and Peng Wang and Jiahao Li and Zifan Shi and Kalyan Sunkavalli and Gordon Wetzstein and Zexiang Xu and Kai Zhang},
      year={2023},
      eprint={2311.09217},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

DMV3D:Denoising Multi-View Diffusion using 3D Large Reconstruction Model