Plan-X: Instruct Video Generation via Semantic Planning

The dog lowers its head towards the mushroom, gently grips the mushroom stem with its mouth, carefully pulls the mushroom from the log, then swallows it.

The boy extends his left hand towards the green cube, picks it up from the mat, and carefully places the green cube on top of the red fire truck, where it balances precariously.

The woman picks up the silver watering can, tips the can, pouring water into the geranium pot, then places the watering can back on the table.

Woman slowly raises her head, looks up towards the bronze owl statue, reaches her right hand to the shelf, picks up the owl statue, and places it carefully on the open book.

Method Overview

Overview of Plan-X. Plan-X comprises two key components: an MLLM for high-level semantic reasoning and planning, and a DiT for high-fidelity video synthesis. The Semantic Planner receives multimodal inputs, including a descriptive text prompt for static scene content (T_img), a motion description (T_motion), an instructional system prompt (T_sys) specifying the number of target frames T, and optionally the semantically encoded first frame I₀. It then autoregressively generates discrete text-aligned semantic tokens that encode spatio-temporal semantic structures in the form of keyframes. Complementary to global text conditioning, we introduce a dedicated semantic guidance branch that instructs a pretrained DiT model to translate these structured semantics augmented with 3D spatio-temporal RoPE into high-fidelity, temporally coherent video realizations.

Base DiT vs Plan-X Comparison

Seedance 1.0 (Base)

Plan-X-Seedance (Ours)

Seedance 1.0 (Base)

Plan-X-Seedance (Ours)

Fixed camera. The woman slowly stands up from the ground, using both hands on the floor to push herself up. She instinctively glances around her, then raises her arms and stretches upward, her body leaning slightly backward.

Fixed camera. A hand reaches out to the passport on the table and places it on the record.

The woman's right hand retracts from the purple 'Cosmic Bites' bag. Her hand moves downwards and to the left. She grasps an orange 'Cosmic Bites' bag from the shelf below, holding it out slightly.

The person's right gloved hand picks up the stirring rod, carefully places it vertically inside the beaker, letting it rest against the side, then withdraws the hand.

The man's right hand extends, fingers gently grip the brass stem of the green-shaded desk lamp, and slowly pivots the lamp head downwards, causing the green shade to angle lower towards the book, intensifying the light on the table surface.

The man's hand reaches for the smartphone, lifts it slightly, and his thumb briefly taps the screen, causing it to illuminate and display content.

Wan 2.2 (Base)

Plan-X-Wan (Ours)

Wan 2.2 (Base)

Plan-X-Wan (Ours)

fixed shot, close-up of a purple lotus flower blooming slowly. the petals gradually open outward, revealing the inner structure and stamen. wisps of smoke drift around the flower, enhancing its ethereal appearance.

Fixed shot. A hand reaches to the plate with cucumbers, move and place it next to the transparent plate.

Baseline Comparison

Comparison with baseline methods

The woman reaches for the white mug with her right hand, takes a sip, and places the mug back down; then, her left hand reaches for the chocolate croissant and lifts it up slightly.

HunyuanVideo

SkyReels V2 14B

Wan 2.2 5B

Seedance 1.0

Plan-X-Wan

Plan-X-Seedance

Fixed shot on a tabletop. The left hand reaches in to pick up the pink powder comb from the green book and sets it down on the table, while the right hand moves in and lifts the white iPhone from its spot, placing it to rest on top of the green book.

HunyuanVideo

SkyReels V2 14B

Wan 2.2 5B

Seedance 1.0

Plan-X-Wan

Plan-X-Seedance

Single Image - Multiple Actions

The man's right hand smoothly grasps the white coffee cup, lifts it to his lips for a brief sip, and then gently returns the cup to its saucer on the table.

The man's right hand gently pushes the white coffee cup and its saucer a few inches to the left across the polished wooden tabletop.

camera tilts down, the person's hands move to the base of the vase, lifting it slightly off the wheel, the vase tilts precariously, then settles back down.

camera slowly pans left, the person's head turns to look at a finished pot on a shelf, their hand pauses on the vase, then returns to shaping.

The child kneels slightly, uses both hands to gently grasp the potted plant, and then carefully lifts it from the soil, moving it to the right and placing it onto the grassy path next to the watering can.

The child's right hand lowers the silver shovel, placing it flat on the soil near their boot. The child then bends slightly, reaching to grasp the handle of the green watering can, lifting it off the grass.

Fixed shot. The vivid blue bird flutters its wings, takes off from the bird bath rim, and flies upward and out of frame.

Fixed shot. The vivid blue bird flies from the bird bath to the red apple, landing briefly on its top, causing the apple to tilt and then roll slightly to the side before the bird flies off screen.

Text-To-Video (T2V) Generation

Image description: "Medium close-up shot of a middle-aged white woman standing outdoors in the snow, holding a plate of pancakes with both hands and smiling at the camera. She is wearing a light-colored headscarf with a floral pattern and a dark green coat. Her face has visible wrinkles, deep-set eyes, a wide nose, distinct nasolabial folds, straight teeth, reddish skin, and a short, round chin. She has brown hair and looks towards the lens. The plate in front of her holds many pancakes stacked together. The background is blurred, featuring snow-covered trees and some red berries." Video prompt: "On a snowy day, a woman holds a plate of pancakes. The woman first squints her eyes while looking at the camera and speaking happily, then she laughs loudly while shaking her body up and down, followed by speaking happily again."

Image description: "In a backyard with gentle sunshine, a father wearing a green T-shirt is bending over watering a newly planted small tree with a hose. Standing beside him is a little girl wearing a yellow dress and a denim jacket, quietly watching her father's actions. Behind them are a wooden fence and a gray house wall; the ground is covered with newly turned wet soil, sunlight spills down through tree shadows, and the scene is warm and tranquil." Video prompt: "Fixed shot, the father uses a hose to water the roots of the tree. The little girl smiles and watches, reaching out to touch the tree leaves. Very excited, the little girl then raises both hands and cheers."

Plan-X

Instruct Video Generation via Semantic Planning

Method Overview

Base DiT vs Plan-X Comparison

Baseline Comparison

Single Image - Multiple Actions

Text-To-Video (T2V) Generation

Additional Applications: Semantic Cross Transfer

Additional Applications: Video Continuation