X-Streamer: Unified Human World Modeling with Audiovisual Interaction

X-Streamer Teaser

TL;DR

X-Streamer is an end-to-end multimodal human world modeling framework for constructing an infinitely streamable digital human from one single portrait, capable of generating intelligent, real-time, multi-turn responses across text, speech, and video. X-Streamer paves the way toward unified world modeling of interactive digital humans.

X-Streamer Architecture

Infinite Streamable Generation

X-Streamer is capable of infinite interactions across text, speech, and video within a single unified architecture.

Long Conversational Context and Intelligent Interaction

X-Streamer accommodates up to 8K tokens of conversational context, facilitating advanced reasoning and long-term memory throughout multi-turn interactions.

More Examples

X-Streamer generalizes seamlessly to diverse scenarios without the need for re-training.

Visual Perception Extension

Visual perception can be readily integrated into the existing Thinker–Actor architecture.