DMV3D:Denoising Multi-View Diffusion
using 3D Large Reconstruction Model

1Adobe Research    2Stanford    3HKU    4TTIC    5HKUST   
* denotes equal advisory



A single-stage approach for high-quality text-to-3D generation and single-image reconstruction in 30s .

3D generation and composition: 3D assets are generated by our Text-to-3D or Image-to-3D model.



SAM + DMV3D: We use SAM to segment any objects and reconstruct their 3D shape and appearance with our DMV3D.



Probablistic single-image to 3D with different samples

Abstract

We propose DMV3D, a novel 3D generation approach that uses a transformer-based 3D large reconstruction model to denoise multi-view diffusion. Our reconstruction model incorporates a triplane NeRF representation and can denoise noisy multi-view images via NeRF reconstruction and rendering, achieving single-stage 3D generation in 30s on single A100 GPU. We train DMV3D on large-scale multi-view image datasets of highly diverse objects using only image reconstruction losses, without accessing 3D assets. We demonstrate state-of-the-art results for the single-image reconstruction problem where probabilistic modeling of unseen object parts is required for generating diverse reconstructions with sharp textures. We also show high-quality text-to-3D generation results outperforming previous 3D diffusion models.

dmv3d_pipline_figure

Framework. We denoise multiple views (three shown in the figure; four used in experiments) for 3D generation. Our multi-view denoiser is a large transformer model that reconstructs a noise-free triplane NeRF from input noisy images with camera poses (parameterized by Plucker rays). During training, we supervise the triplane NeRF with a rendering loss at input and novel viewpoints. During inference, we render denoised images at input viewpoints and combine them with noise to obtain less noisy input for the next denoising step. Once the multi-view images are fully denoised, our model offers a clean triplane NeRF, enabling 3D generation.

Image-to-3D


Text-to-3D

a bowl of vegetables
a rusty old car
a voxelized dog
a birthday cupcake
a schoolbus
a race car
a donut with pink icing
a spaceship


Text-to-Image by SD-XL, Image-to-3D by DMV3D



Visualization of how x0 evolves during denoising process

Input View                   Novel View1                   Novel View2                   Novel View3


BibTeX

@misc{xu2023dmv3d,
      title={DMV3D: Denoising Multi-View Diffusion using 3D Large Reconstruction Model}, 
      author={Yinghao Xu and Hao Tan and Fujun Luan and Sai Bi and Peng Wang and Jiahao Li and Zifan Shi and Kalyan Sunkavalli and Gordon Wetzstein and Zexiang Xu and Kai Zhang},
      year={2023},
      eprint={2311.09217},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}