HolisticSemGes: Semantic Grounding of Holistic Co-Speech Gesture Generation with Contrastive Flow-Matching

Method Overview

A brief visual summary of the proposed semantic-aware contrastive flow matching framework for holistic gesture generation.

Overview video of HolisticSemGes. For smoother playback, Chrome or Opera is recommended.

Highlights

Key ideas and contributions of HolisticSemGes.

Semantic Grounding Beyond Rhythm

Existing co-speech gesture generators often learn rhythmic beat gestures well, but struggle to produce sparse semantically meaningful motions such as iconic and metaphoric gestures.

Semantics-Aware Composite Module

We introduce SACM, which aligns text, audio, and holistic motion within a shared composite latent space to improve cross-modal consistency and cross-articulator coherence.

Contrastive Flow Matching

We propose a contrastive flow-matching framework that uses mismatched audio-text conditions as negatives, encouraging the learned velocity field to follow semantically correct motion trajectories while diverging from incongruent ones.

Strong Results on Two Benchmarks

Extensive experiments and user studies on BEAT2 and SHOW demonstrate improved motion realism, speech-motion synchronization, diversity, and perceptual quality over recent state-of-the-art baselines.

Our approach provides a unified end-to-end solution for semantically grounded holistic co-speech gesture generation without relying on external semantic retrieval stages.

Abstract

While the field of co-speech gesture generation has seen significant advances, producing holistic, semantically grounded gestures remains a challenge. Existing approaches rely on external semantic retrieval methods, which limit their generalization capability due to dependency on predefined linguistic rules. Flow-matching-based methods produce promising results; however, the network is optimized using only semantically congruent samples without exposure to negative examples, leading to learning rhythmic gestures rather than sparse motion, such as iconic and metaphoric gestures. Furthermore, by modeling body parts in isolation, the majority of methods fail to maintain cross-modal consistency. We introduce a Contrastive Flow Matching-based co-speech gesture generation model that uses mismatched audio-text conditions as negatives, training the velocity field to follow the correct motion trajectory while repelling semantically incongruent trajectories. Our model ensures cross-modal coherence by embedding text, audio, and holistic motion into a composite latent space via cosine and contrastive objectives. Extensive experiments and a user study demonstrate that our proposed approach outperforms state-of-the-art methods on two datasets, BEAT2 and SHOW.

Framework Overview

HolisticSemGes consists of two synergistic modules for semantic-aware holistic gesture generation.

The Semantics-Aware Composite Module (SACM) aligns text, audio, and holistic motion in a shared semantic latent space using sequence-level cosine alignment and CLIP-style contrastive objectives. The Multimodal Conditioning Module then learns a conditional velocity field that transports latent noise toward the target motion manifold, while Contrastive Flow Matching introduces mismatched audio-text conditions as negatives to improve semantic grounding and directional stability during generation.

Main Results

We evaluate HolisticSemGes on BEAT2 and SHOW using Fréchet Gesture Distance (FGD), Beat Consistency (BC), and Diversity. Our method achieves the best overall balance across motion realism, speech-motion synchronization, and gesture variability.

On BEAT2, our method achieves 2.247 FGD, 0.780 BC, and 120 Diversity, outperforming recent baselines including SHOW, EMAGE, RAGGesture, GestureLSM, and SemTalk. On SHOW, our method achieves 18.92 FGD, 0.831 BC, and 112 Diversity, again yielding the strongest overall performance. These results indicate improved realism, temporal coordination, and diversity across datasets and speaker identities.

User Study

We conducted a controlled user study with 30 native English speakers from the UK and US using 35-second clips from the BEAT2 test set across six narrated topics. Participants evaluated randomized videos on naturalness, diversity, and alignment with speech content and timing.

The user study shows that HolisticSemGes significantly outperforms SemTalk and EMAGE in perceptual quality. In particular, our model achieves higher scores in naturalness and alignment with speech content and timing, while also surpassing competing methods in perceived diversity without sacrificing realism.

Demo Videos

Example gestures generated by our model under different narrative settings.

Demo 1: semantically grounded holistic gesture generation.

Demo 2: expressive gesture synthesis aligned with speech semantics.

Demo 3.

Demo 4.