Overview of Plan-X. Plan-X comprises two key components: an MLLM for high-level semantic reasoning and planning, and a DiT for high-fidelity video synthesis. The Semantic Planner receives multimodal inputs, including a descriptive text prompt for static scene content (Timg), a motion description (Tmotion), an instructional system prompt (Tsys) specifying the number of target frames T, and optionally the semantically encoded first frame I0. It then autoregressively generates discrete text-aligned semantic tokens that encode spatio-temporal semantic structures in the form of keyframes. Complementary to global text conditioning, we introduce a dedicated semantic guidance branch that instructs a pretrained DiT model to translate these structured semantics augmented with 3D spatio-temporal RoPE into high-fidelity, temporally coherent video realizations.
Comparison with baseline methods
Fixed shot on a tabletop. The left hand reaches in to pick up the pink powder comb from the green book and sets it down on the table, while the right hand moves in and lifts the white iPhone from its spot, placing it to rest on top of the green book.
HunyuanVideo
SkyReels V2 14B
Wan 2.2 5B
Seedance 1.0
Plan-X-Wan
Plan-X-Seedance
The camera captures a fixed overhead view of a cluttered table covered with a yellow and white checkered cloth. A hand reaches into the white container, retrieves the red makeup remover, and places it beside the red lipstick on the table.
HunyuanVideo
SkyReels V2 14B
Wan 2.2 5B
Seedance 1.0
Plan-X-Wan
Plan-X-Seedance
The camera focuses on a hand holding a small pot with two cacti, then slowly shifts focus to the woman holding it. She smiles at the camera while standing in a busy outdoor area with blurred lights and people in the background.
Wan 2.2 5B
Wan + PE
Wan + SFT
Wan + Query
Plan-X-Wan (1.5B M)
Plan-X-Wan w/o 3D RoPE
Plan-X-Wan w/o Text
Plan-X-Wan w/o e2e
Plan-X-Wan
A hand reaches in and picks up a pair of black chopsticks from a white medium seasoning dish. The chopsticks are then placed on the countertop next to a white small seasoning dish.
Wan 2.2 5B
Wan + PE
Wan + SFT
Wan + Query
Plan-X-Wan (1.5B M)
Plan-X-Wan w/o 3D RoPE
Plan-X-Wan w/o Text
Plan-X-Wan w/o e2e
Plan-X-Wan