Videos on this page are subsampled and rescaled to meet the supplementary file size limits. Videos on our further online project page will have higher quality.
TL;DR: A camera-controllable video generation framework that requires no camera annotations, using geometry-aware tokenization to achieve state-of-the-art consistency and extensibility.
Camera-controlled Video Generation
Zoom Out
Zoom Out
Truck Right
Tilt Up
Down-right
Roll Clockwise
Roll Clockwise
Orbit Right
Orbit Right
Orbit Right
CETCam framework synthesizes dynamic, geometry-consistent scenes across diverse domains.
Camera trajectories are described in nature languages for better understanding, while the CETCam model takes raw camera trajectories as input.
Diverse Scenes Results
CETCam can produce higher-quality videos with precise camera control in diverse scenes.
Same Camera, Different Source Images
Same Source Image, Different Cameras
Dolly out and pan left
Dolly in and pan right
Swing
Dolly out and pan left
Orbit Right
Swing
CETCam has natural ability to generate videos of the same scene with different camera trajectories.
Extensibility Results
Comparations between Uni3C and ours
Comparison with the closest concurrent work Uni3C. Top: Uni3C renderings fail to accurately follow the intended camera motion and exhibit geometric distortions and outliers due to inconsistent 3D estimation. Bottom: These inaccuracies in renderings lead to spatial misalignment and visible artifacts in the generated videos of Uni3C, while our generated videos do not exhibit these artifacts.
Method
Overview of the CETCam I2V generation framework. (a) CETCam Tokenizer. Given an in-the-wild training video or a test-time frame input, the frames are processed by VGGT to predict the depth maps. In training, camera poses are also estimated. The predicted depths and camera poses are used for point cloud reprojection to generate renderings of the first frame and the corresponding masks. These renderings, masks, and camera poses are embedded and fused to produce CETCam tokens. (b) Token-Based Controlled Video Generation. We leverage various types of tokens with rich and diverse functions, including CETCam tokens, noisy latents, VCU tokens, and VACE tokens. CETCam tokens are consumed in learnable CETCam context blocks, which are connected to pre-trained Wan DiT blocks using zero-linear and add functions. Other tokens are processed through VACE context blocks and Wan DiT blocks. Finally, the output tokens are decoded by a 3D VAE to generate a video.