X-Portrait: Expressive Portrait Animation with Hierarchical Motion Attention

ByteDance Inc.

Please turn on speaker.
    Source portraits: www.pexels.com, www.midjourney.com, www.deviantart.com
    Driving video: www.reddit.com

Please turn on speaker.
    Source portraits: www.pexels.com, www.midjourney.com, www.deviantart.com
    Driving video: www.reddit.com

Please turn on speaker.
    Source portraits: www.pexels.com, www.midjourney.com, www.deviantart.com
    Driving video: www.reddit.com

Please turn on speaker.
    Source portraits: www.pexels.com, www.midjourney.com, www.deviantart.com
    Driving video: www.reddit.com

Please turn on speaker.
    Source portraits: www.pexels.com, www.midjourney.com, www.deviantart.com
    Driving video: www.reddit.com

Please turn on speaker.
    Source portraits: www.pexels.com, www.midjourney.com, www.deviantart.com
    Driving video: www.reddit.com

Please turn on speaker.
    Source portraits: www.pexels.com, www.midjourney.com, www.deviantart.com
    Driving video: www.reddit.com

Interpolate start reference image.

Given a single reference portrait (left column), X-Portrait demonstrates remarkable proficiency in synthesizing compelling and expressive animations (right columns), capturing a wide range of head poses and facial expressions of a driving sequence at both dynamic and nuanced scales. Our method is universally effective across a diverse range of facial portraits and driving motions, ensuring precise retention of identity features and faithful transfer of intricate motion details.

Abstract

We propose X-Portrait, an innovative conditional diffusion model tailored for generating expressive and temporally coherent portrait animation. Specifically, given a single portrait as appearance reference, we aim to animate it with motion derived from a driving video, capturing both highly dynamic and subtle facial expressions along with wide-range head movements. As its core, we leverage the generative prior of a pre-trained diffusion model as the rendering backbone, while achieve fine-grained head pose and expression control with novel controlling signals within the framework of ControlNet. In contrast to conventional coarse explicit controls such as facial landmarks, our motion control module is learned to interpret the dynamics directly from the original driving RGB inputs. The motion accuracy is further enhanced with a patch-based local control module that effectively guides the motion attention to small-scale nuances like eyeball positions. Notably, to mitigate the identity leakage from the driving signals, we train our motion control modules with scale-augmented cross-identity images, ensuring full disentanglement from the appearance reference modules. Experimental results demonstrate the universal effectiveness of X-Portrait across a diverse range of facial portraits and expressive driving sequences, and showcase its proficiency in generating captivating portrait animations with well retained identity characteristics.

Presentation


Single source portrait + Multiple driving motions


Single driving motion + Multiple source portraits

Interpolate start reference image.

Comparison

Interpolate start reference image.