CETCam

Videos on this page are subsampled and rescaled to meet the supplementary file size limits.
Videos on our further online project page will have higher quality.

TL;DR: A camera-controllable video generation framework that requires no camera annotations, using geometry-aware tokenization to achieve state-of-the-art consistency and extensibility.

Camera-controlled Video Generation

Zoom Out Zoom Out Truck Right Tilt Up Down-right

Roll Clockwise Roll Clockwise Orbit Right Orbit Right Orbit Right

CETCam framework synthesizes dynamic, geometry-consistent scenes across diverse domains.
Camera trajectories are described in nature languages for better understanding,
while the CETCam model takes raw camera trajectories as input.

Diverse Scenes Results

Right-upward Orbiting
CETCam can produce higher-quality videos with precise camera control in diverse scenes.

Same Camera, Different Source Images

Same Camera Trajectory I

Same Camera Trajectory I, More Samples

Same Camera Trajectory II

Same Source Image, Different Cameras

Dolly out and pan left Dolly in and pan right Swing

Dolly out and pan left Orbit Right Swing

CETCam has natural ability to generate videos of the same scene with different camera trajectories.

Extensibility Results

From left to right, we show camera trajectory, source image, VACE control input, and the generated video.
CETCam can perform various control tasks while achiving precise camera control.

Comparations between Uni3C and ours

Uni3C Camera Renderings

Our Camera Renderings

Uni3C Generated Videos

Our Generated Videos
Comparison with the closest concurrent work Uni3C.
Top: Uni3C renderings fail to accurately follow the intended camera motion and exhibit geometric distortions and outliers due to inconsistent 3D estimation.
Bottom: These inaccuracies in renderings lead to spatial misalignment and visible artifacts in the generated videos of Uni3C, while our generated videos do not exhibit these artifacts.

Method

Overview of the CETCam I2V generation framework. (a) CETCam Tokenizer. Given an in-the-wild training video or a test-time frame input, the frames are processed by VGGT to predict the depth maps. In training, camera poses are also estimated. The predicted depths and camera poses are used for point cloud reprojection to generate renderings of the first frame and the corresponding masks. These renderings, masks, and camera poses are embedded and fused to produce CETCam tokens. (b) Token-Based Controlled Video Generation. We leverage various types of tokens with rich and diverse functions, including CETCam tokens, noisy latents, VCU tokens, and VACE tokens. CETCam tokens are consumed in learnable CETCam context blocks, which are connected to pre-trained Wan DiT blocks using zero-linear and add functions. Other tokens are processed through VACE context blocks and Wan DiT blocks. Finally, the output tokens are decoded by a 3D VAE to generate a video.