UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking

1Shanda AI Research Tokyo, 2The University of Tokyo, 3Institute of Science Tokyo
*Equal contribution. Corresponding author.

Abstract

Generating lifelike conversational avatars requires modeling not just isolated speakers, but the dynamic, reciprocal interaction of speaking and listening. However, modeling the listener is exceptionally challenging: direct audio-driven training fails, producing stiff, static listening motions. This failure stems from a fundamental imbalance: the speaker's motion is strongly driven by speech audio, while the listener's motion primarily follows an internal motion prior and is only loosely guided by external speech. This challenge has led most methods to focus on speak-only generation. The only prior attempt at joint generation relies on extra speaker's motion to produce the listener. This design is not end-to-end, thereby hindering the real-time applicability. To address this limitation, we present UniLS, the first end-to-end framework for generating unified speak-listen expressions, driven by only dual-track audio. Our method introduces a novel two-stage training paradigm. Stage 1 first learns the internal motion prior by training an audio-free autoregressive generator, capturing the spontaneous dynamics of natural facial motion. Stage 2 then introduces the dual-track audio, fine-tuning the generator to modulate the learned motion prior based on external speech cues. Extensive evaluations show UniLS achieves state-of-the-art speaking accuracy. More importantly, it delivers up to 44.1% improvement in listening metrics, generating significantly more diverse and natural listening expressions. This effectively mitigates the stiffness problem and provides a practical, high-fidelity audio-driven solution for interactive digital humans.

UniLS-Talk Dataset

To enable research on unified speaking and listening avatar generation, we curate and construct the UniLS-Talk Dataset, a large-scale collection of high-quality 3D facial motion data. We apply a carefully designed tracking pipeline to extract per-frame FLAME parameters, including expression coefficients, eye-gaze, jaw pose and head pose annotations. The dataset comprises two complementary parts:

  • Paired conversational data sourced from the Seamless Interaction dataset, providing synchronized dual-speaker videos with natural turn-taking dynamics between speaking and listening.
  • Unpaired multi-scenario data aggregated from CelebV, TalkingHead-1KH, TEDTalk, VFHQ, and other in-the-wild videos, covering diverse facial behaviors across identities and environments (news broadcasts, interviews, casual talking, etc.).
Category Source Hours Audio Motion
Paired Conversational Seamless Interaction Dataset 657.5 h
Unpaired Multi-Scenario Diverse identities and environments from in-the-wild videos 546.5 h
Total 1,204 h

The paired conversational data is split into 622.5 hours for training, 4.8 hours for validation, and 30.2 hours for testing. All data includes FLAME expression parameters, jaw and head pose, and eye-gaze annotations at 25 fps. The dataset is publicly available on HuggingFace.

Demo Gallery

Unified Speaking & Listening

UniLS generates smooth, real-time transitions between speaking and listening for two interlocutors driven solely by dual-track audio. Each video shows a pair of avatars: notice the natural head movements, blinks, and micro-expressions during both speaking and listening phases.

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Coming Soon


Speaking & Listening with GAGAvatar Rendering

UniLS generates unified speaking and listening expressions, rendered with GAGAvatar for realistic avatar visualization.

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Speaking Only

Audio-driven speaking generation across different identities.

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Coming Soon

BibTeX

@inproceedings{chu2025unils,
    title     = {{UniLS}: End-to-End Audio-Driven Avatars for Unified Listening and Speaking},
    author    = {Chu, Xuangeng and Liu, Ruicong and Huang, Yifei and Liu, Yun and Peng, Yichen and Zheng, Bo},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    year      = {2026}
}