A Holistic Cascade System, Benchmark, and Human Evaluation Protocol
for Expressive Speech-to-Speech Translation

Wen-Chin Huang1†‡ , Benjamin Peloquin2‡, Justine Kao2, Changhan Wang2
Hongyu Gong2, Elizabeth Salesky3†, Yossi Adi2, Ann Lee2, Peng-Jen Chen2

1Nagoya University, 2Meta AI, 3Johns Hopkins University
(† = Work done while interning at Meta AI. and ‡ = Equal contribution.)

We propose a holistic cascade system for expressive S2ST, combining multiple prosody transfer techniques previously considered only in isolation. We curate a benchmark expressivity test set in the TV series domain (Heroes) and explored a second dataset in the audiobook domain (Mined audiobook). Finally, wepresent a human evaluation protocol to assess multiple expressive dimensions across speech pairs. Experimental results indicate that bilingual annotators can assess the quality of expressive preservation in S2ST systems, and the holistic modeling approach outperforms single-aspect systems.

In this page, we demonstrate synthesized examples on both Heroes and Mined audiobook benchmark datasets with different expressive dimensions.

Demo

Results on the mined audiobook benchmark
Synthesize speech-to-text output
Ground Truth Predictions
Source (Spanish) Target (English) Vanilla TTS Holistic Cascade (Global transfer + local transfer)
Input text: próxima a cometer una mala acción contemplando el sueño de un justo which is on the point of committing a bad action contemplating the sleep of a He is about to commit a bad action, contemplating the dream of a just man. He is about to commit a bad action, contemplating the dream of a just man.
Input text: entonces el escribió una carta a su madre and writes a letter to his mother Then he wrote a letter to his mother. Then he wrote a letter to his mother.
Input text: le he destinado un sitio de honor habéis conquistado a mi abuelo I have fixed upon a corner of Honor for that you have conquered my grandfather you suit him I have assigned him a place of honor, you have conquered my grandfather. I have assigned him a place of honor, you have conquered my grandfather.
Results on the Heroes benchmark
Synthesize speech-to-text system output
Predictions
Vanilla TTS Holistic Cascade (Global transfer + local transfer) Ablation (Global transfer only) Ablation (Local transfer only)
Input text (speech-to-text output): It's like a Greek tragedy or something.
Input text (speech-to-text output): Abby Collins, “National Security.”
Input text (speech-to-text output): You weren’t going to find out what the powers were.
Synthesize ground truth text
Predictions
Vanilla TTS Holistic Cascade (Global transfer + local transfer)
Input text (ground truth): It started with their father. Delusions of grandeur, paranoia.
Input text (ground truth): I have been thinking about you and wondering how you've been since...
Input text (ground truth): Only someone with Peter's abilities could get where the virus is stored.
Template based on Textless NLP and HiFi-GAN pages.