X-UniMotion features an end-to-end training framework that jointly learns an implicit latent representation of full-body human motion and synthesizes lifelike videos using a DiT network. At its core, we employ an image encoder $\mathcal{E}$ to extract a 1D latent motion descriptor $z$ from the driving image $I_D$, capturing full-body articulation. This global motion code is complemented by decoupled local descriptors—$z_{lh}$ and $z_{rh}$ for the left and right hands, and $z_f$ for facial expressions—extracted from local patches corresponding to the hands and face via $\mathcal{E}_h$ and $\mathcal{E}_f$ respectively. To achieve identity-agnostic motion representation, we apply spatial and color augmentations that effectively disentangle identity cues from the motion latents. These motion tokens are retargeted to the reference subject's body structure in $I_S$ via a ViT decoder $\mathcal{D}$, which outputs identity-aligned spatial motion guidance. This guidance is concatenated with noised video latents and, together with the reference image latents, provided as input to the DiT model. The facial motion latent $z_f$ is further injected into the DiT network via cross-attention layers to control expressions. To supervise motion encoding, we apply dual decoding $\mathcal{D}_h$ and $\mathcal{D}_n$ on the intermediate motion features, predicting both joint heatmaps and hand normal maps. During inference, latent motion codes are directly extracted from each frame of the driving video, enabling expressive and photorealistic animations that maintain strong identity resemblance to the reference image.
Results
Results Gallery
We provided more results of X-UniMotion. (Double click the video to zoom in for better quality)
Appearance and Motion Disentanglement (First column is driving video)
X-UniMotion extracts compact, unified, expressive, and depth-aware latent representations for whole-body human motion that is disentangled from identity, capturing complex body and hand articulations, and fine-grained facial expressions.
Diversity
We show that X-UniMotion can generate diverse motions for varity of reference images.
Comparison to State-of-the-Art Methods
Ethics
Our research presents advanced generative AI capabilities for human video synthesis. We firmly oppose the misuse of our technology for generating manipulated content of real individuals. While our model enables the creation and editing of photorealistic digital humans, we strongly condemn any application aimed at spreading misinformation, damaging reputations, or creating deceptive content. We acknowledge the ethical considerations surrounding this technology and are committed to responsible development and deployment that prioritizes transparency and prevents harmful applications. The images and videos used in these demos are from public sources or generated by models, and are solely used to demonstrate the capabilities of this research work. If there are any concerns, please contact us (guoxiansong@bytedance.com) and we will delete it in time.