AMUSE: Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion

Emotional Speech-driven 3D Body Animation
via Disentangled Latent Diffusion

CVPR 2024

Kiran Chhatre¹, Radek Daněček², Nikos Athanasiou²,
Giorgio Becherini², Christopher Peters¹, Michael J. Black², Timo Bolkart²

¹KTH Royal Institute of Technology, Sweden, ²Max Planck Institute for Intelligent Systems, Germany

arXiv CVPR Access YouTube GitHub Poster Contact

AMUSE generates realistic emotional 3D body gestures directly from a speech sequence (top). It provides user control over the generated emotion by combining the driving speech with a different emotional audio (bottom).

Abstract

Existing methods for synthesizing 3D human gestures from speech have shown promising results, but they do not explicitly model the impact of emotions on the generated gestures. Instead, these methods directly output animations from speech without control over the expressed emotion. To address this limitation, we present AMUSE, an emotional speech-driven body animation model based on latent diffusion. Our observation is that content (i.e., gestures related to speech rhythm and word utterances), emotion, and personal style are separable. To account for this, AMUSE maps the driving audio to three disentangled latent vectors: one for content, one for emotion, and one for personal style. A latent diffusion model, trained to generate gesture motion sequences, is then conditioned on these latent vectors. Once trained, AMUSE synthesizes 3D human gestures directly from speech with control over the expressed emotions and style by combining the content from the driving speech with the emotion and style of another speech sequence. Randomly sampling the noise of the diffusion model further generates variations of the gesture with the same emotional expressivity. Qualitative, quantitative, and perceptual evaluations demonstrate that AMUSE outputs realistic gesture sequences. Compared to the state of the art, the generated gestures are better synchronized with the speech content and better represent the emotion expressed by the input speech.

Intro video

Method overview

Gesture generation model. We train the motion prior (\(\mathcal{P}_{E}, \mathcal{P}_{D}\)) and the latent denoiser \(\Delta\) jointly, while keeping the audio encoding networks frozen. In the forward pass, we take an input audio \(a^{1:T}\) and pose sequence \(m^{1:T}\). Firstly, we do a forward pass of \(m^{1:T}\) through \(\mathcal{P}_{E}\) and \(\mathcal{P}_{D}\) and compute \(\mathcal{L}_{rec}\), \(\mathcal{L}_{Vrec}\), and \(\mathcal{L}_{KL}\). Then, we apply the diffusion process to a gradient detached \(\textup{sg}\left[{z_m}\right]\) obtaining the noisy \(z_m^{(D)}\), which is then denoised with \(\Delta\) and \(\mathcal{L}_{LD}\) is computed. Finally, we use \(\Delta\) to fully denoise \(z_n\) into gradient-detached \(\textup{sg}\left[{z_{\tilde{m}}}\right]\), further decode \(z_{\tilde{m}^{1:T}}\) using \(\mathcal{P}_{D}\), and compute \(\mathcal{L}_{align}\) and \(\mathcal{L}_{Valign}\).

Speech disentanglement model. An input filterbank is given to the three encoders, producing three disentangled latents, which are decoded into a reconstructed filterbank. We obtain disentangled content, emotion, and style latents from the transformer encoders. (Self) Concatenation of triplet latent vectors is used to decode back into the original filterbank. To enforce content disentanglement, we swap content latent vectors (cross-content) between different-subjects audio pairs with the same utterances. To enforce style and emotion disentanglement, we swap style (cross-style) and emotion (cross-emotion) latent vectors between same-subject audio pairs with the same emotion categorical label. We repeat the procedure for quadruples of audio \( \{ a^{\ast}, a^{\star}, a^{\circ}, a^{\bullet} \} \) input in each forward pass.

More results

Acknowledgments & Disclosure

We thank the authors of the BEAT dataset for providing the raw motion capture data. We thank Alpar Cseke, Taylor McConnell, and Tsvetelina Alexiadis for their help with the design and deployment of the perceptual study. We also thank Benjamin Pellkofer, Joan Piles-Contreras, and Eugen Fritzler for cluster computing and IT support at MPI, and EECS IT support at KTH. We express our gratitude to Peter Kulits for proof-reading and valuable feedback. Finally, we thank Mathis Petrovich, Surabhi Kokane, and Sahba Zojaji for their insightful discussions and advice. This project has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No. 860768 (CLIPE project). Michael Black has received research gift funds from Adobe, Intel, Nvidia, Meta/Facebook, and Amazon. Michael Black has financial interests in Amazon, Datagen Technologies, and Meshcapade GmbH. While Michael Black is a consultant for Meshcapade and Timo Bolkart a full-time employee of Google, their research was performed solely at, and funded solely by, the Max Planck Society.

BibTeX

If you find the Model & Software, BVH2SMPLX conversion tool, and SMPLX Blender addon-based visualization software useful in your research, we kindly ask that you cite our work:

@InProceedings{Chhatre_2024_CVPR,
    author    = {Chhatre, Kiran and Daněček, Radek and Athanasiou, Nikos and Becherini, Giorgio and Peters, Christopher and Black, Michael J. and Bolkart, Timo},
    title     = {{AMUSE}: Emotional Speech-driven {3D} Body Animation via Disentangled Latent Diffusion},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {1942-1953},
    url = {https://amuse.is.tue.mpg.de},
}