Scene Splatter: Momentum 3D Scene Generation from Single Image with Video Diffusion Model

1Tsinghua University, 2Tecent

Scene Splatter, a momentum-based paradigm for video diffusion to generate generic scenes from single image.

Abstract

In this paper, we propose Scene Splatter, a momentum-based paradigm for video diffusion to generate generic scenes from single image. Existing methods, which employ video generation models to synthesize novel views, suffer from limited video length and scene inconsistency, leading to artifacts and distortions during further reconstruction. To address this issue, we construct noisy samples from original features as momentum to enhance video details and maintain scene consistency. However, for latent features with the perception field that spans both known and unknown regions, such latent-level momentum restricts the generative ability of video diffusion in unknown regions. Therefore, we further introduce the aforementioned consistent video as a pixel-level momentum to a directly generated video without momentum for better recovery of unseen regions. Our cascaded momentum enables video diffusion models to generate both high-fidelity and consistent novel views. We further finetune the global Gaussian representations with enhanced frames and render new frames for momentum update in the next step. In this manner, we can iteratively recover a 3D scene, avoiding the limitation of video length. Extensive experiments demonstrate the generalization capability and superior performance of our method in high-fidelity and consistent scene generation.

Rendering results of a single-view reconstruction scene

Results of our method on more camera trajectories

MY ALT TEXT

Results of our method compared with baselines

MY ALT TEXT

Method Overview

MY ALT TEXT

The pipeline of Scene Splatter. We initialize the Gaussian representations from the input image \(I_{0}\) with a Gaussian Predictor. For each iteration, we first render the video \(\mathcal{I}\) from 3D Gaussians \(\mathcal{G}\). Then, we generate the enhanced video \(\Phi_{\lambda}(\mathcal{I})\) with latent-level momentum and \(\Phi_{0}(\mathcal{I})\) directly from the vanilla diffusion model, where \(\Phi_{\lambda}\) and \(\Phi_{0}\) share the same weights of the denoising network. We further render scale maps as pixel-level momentum coefficient to further enhance the generated frames. We use the final results to supervise the optimization of Gaussian representations. We conduct this process along the camera trajectory to iteratively recover 3D scenes.

BibTeX


@article{Scene Splatter,
        title   = {Scene Splatter: Momentum 3D Scene Generation from Single Image with Video Diffusion Model},
        author  = {Zhang, Shengjun and Li, Jinzhao and Fei, Xin and Liu, Hao and Duan, Yueqi},
        journal = {IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR)},
        year    = {2025},
        }