X-Streamer: Unified Human World Modeling with Audiovisual Interaction

You Xie Tianpei Gu Zenan Li Chenxu Zhang Guoxian Song Xiaochen Zhao Chao Liang

Jianwen Jiang Hongyi Xu Linjie Luo

ByteDance Intelligent Creation

TL;DR

X-Streamer is an end-to-end multimodal human world modeling framework for constructing an infinitely streamable digital human from one single portrait, capable of generating intelligent, real-time, multi-turn responses across text, speech, and video. X-Streamer paves the way toward unified world modeling of interactive digital humans.

Infinite Streamable Generation

X-Streamer is capable of infinite interactions across text, speech, and video within a single unified architecture.

Long Conversational Context and Intelligent Interaction

X-Streamer accommodates up to 8K tokens of conversational context, facilitating advanced reasoning and long-term memory throughout multi-turn interactions.

More Examples

X-Streamer generalizes seamlessly to diverse scenarios without the need for re-training.

Visual Perception Extension

Visual perception can be readily integrated into the existing Thinker–Actor architecture.