Overview of Plan-X. Plan-X comprises two key components: an MLLM for high-level semantic reasoning and planning, and a DiT for high-fidelity video synthesis. The Semantic Planner receives multimodal inputs, including a descriptive text prompt for static scene content (Timg), a motion description (Tmotion), an instructional system prompt (Tsys) specifying the number of target frames T, and optionally the semantically encoded first frame I0. It then autoregressively generates discrete text-aligned semantic tokens that encode spatio-temporal semantic structures in the form of keyframes. Complementary to global text conditioning, we introduce a dedicated semantic guidance branch that instructs a pretrained DiT model to translate these structured semantics augmented with 3D spatio-temporal RoPE into high-fidelity, temporally coherent video realizations.
Comparison with baseline methods
The woman reaches for the white mug with her right hand, takes a sip, and places the mug back down; then, her left hand reaches for the chocolate croissant and lifts it up slightly.
HunyuanVideo
SkyReels V2 14B
Wan 2.2 5B
Seedance 1.0
Plan-X-Wan
Plan-X-Seedance
Fixed shot on a tabletop. The left hand reaches in to pick up the pink powder comb from the green book and sets it down on the table, while the right hand moves in and lifts the white iPhone from its spot, placing it to rest on top of the green book.
HunyuanVideo
SkyReels V2 14B
Wan 2.2 5B
Seedance 1.0
Plan-X-Wan
Plan-X-Seedance