ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model

1The University of Tokyo, 2RIKEN AIP

ARTalk generates realistic 3D head motions in ⚡real time⚡ from given audio, including accurate lip sync, natural facial animations, eye blinks, and head poses.

Abstract

Speech-driven 3D facial animation aims to generate realistic lip movements and facial expressions for 3D head models from audio clips. Although existing diffusion-based methods are capable of producing natural motions, their slow generation speed limits their application potential. In this paper, we introduce a novel autoregressive model that achieves real-time generation of highly synchronized lip movements and realistic head poses and eye blinks by learning a mapping from speech to a multi-scale motion codebook. Furthermore, our model can adapt to unseen speaking styles using sample motion sequences, enabling the creation of 3D talking avatars with unique personal styles beyond the identities seen during training. Extensive evaluations and user studies demonstrate that our method outperforms existing approaches in lip synchronization and perceived quality.

Driving Mesh with The icon of ARTalk.

Driving with Text-to-Speech Audio

Driving The icon of GAGAvatar. with The icon of ARTalk.

BibTeX

@misc{
    chu2025artalk,
    title={ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model}, 
    author={Xuangeng Chu and Nabarun Goswami and Ziteng Cui and Hanqin Wang and Tatsuya Harada},
    year={2025},
    eprint={2502.20323},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2502.20323}, 
}