CETCam

Camera-Controllable Video Generation
via Consistent and Extensible Tokenization

Anonymous Submission

Videos on this page are subsampled and rescaled to meet the supplementary file size limits.
Videos on our further online project page will have higher quality.

TL;DR: A camera-controllable video generation framework that requires no camera annotations, using geometry-aware tokenization to achieve state-of-the-art consistency and extensibility.

Camera-controlled Video Generation

Zoom Out Zoom Out Truck Right Tilt Up Down-right
Roll Clockwise Roll Clockwise Orbit Right Orbit Right Orbit Right
CETCam framework synthesizes dynamic, geometry-consistent scenes across diverse domains.
Camera trajectories are described in nature languages for better understanding,
while the CETCam model takes raw camera trajectories as input.

Diverse Scenes Results

Right-upward Orbiting
CETCam can produce higher-quality videos with precise camera control in diverse scenes.

Same Camera, Different Source Images

complex_1
Same Camera Trajectory I
complex_2
Same Camera Trajectory I, More Samples
complex_3
Same Camera Trajectory II

Same Source Image, Different Cameras

Dolly out and pan left Dolly in and pan right Swing
Dolly out and pan left Orbit Right Swing
CETCam has natural ability to generate videos of the same scene with different camera trajectories.

Extensibility Results

complex_3
complex_3
complex_3
complex_3 complex_1
From left to right, we show camera trajectory, source image, VACE control input, and the generated video.
CETCam can perform various control tasks while achiving precise camera control.

Comparations between Uni3C and ours

Uni3C Camera Renderings
Our Camera Renderings
Uni3C Generated Videos
Our Generated Videos
Comparison with the closest concurrent work Uni3C.
Top: Uni3C renderings fail to accurately follow the intended camera motion and exhibit geometric distortions and outliers due to inconsistent 3D estimation.
Bottom: These inaccuracies in renderings lead to spatial misalignment and visible artifacts in the generated videos of Uni3C, while our generated videos do not exhibit these artifacts.

Method

complex_1
Overview of the CETCam I2V generation framework. (a) CETCam Tokenizer. Given an in-the-wild training video or a test-time frame input, the frames are processed by VGGT to predict the depth maps. In training, camera poses are also estimated. The predicted depths and camera poses are used for point cloud reprojection to generate renderings of the first frame and the corresponding masks. These renderings, masks, and camera poses are embedded and fused to produce CETCam tokens. (b) Token-Based Controlled Video Generation. We leverage various types of tokens with rich and diverse functions, including CETCam tokens, noisy latents, VCU tokens, and VACE tokens. CETCam tokens are consumed in learnable CETCam context blocks, which are connected to pre-trained Wan DiT blocks using zero-linear and add functions. Other tokens are processed through VACE context blocks and Wan DiT blocks. Finally, the output tokens are decoded by a 3D VAE to generate a video.