Wen-Chin Huang1†‡ , Benjamin Peloquin2‡, Justine Kao2, Changhan Wang2 Hongyu Gong2, Elizabeth Salesky3†, Yossi Adi2, Ann Lee2, Peng-Jen Chen2
1Nagoya University, 2Meta AI, 3Johns Hopkins University († = Work done while interning at Meta AI. and ‡ = Equal contribution.)
We propose a holistic cascade system for expressive S2ST, combining multiple prosody transfer techniques previously considered only in isolation. We curate a benchmark expressivity test set in the TV series domain (Heroes) and explored a second dataset in the audiobook domain (Mined audiobook). Finally, wepresent a human evaluation protocol to assess multiple expressive dimensions across speech pairs. Experimental results indicate that bilingual annotators can assess the quality of expressive preservation in S2ST systems, and the holistic modeling approach outperforms single-aspect systems.
In this page, we demonstrate synthesized examples on both Heroes and Mined audiobook benchmark datasets with different expressive dimensions.
Synthesize speech-to-text output | ||||||
---|---|---|---|---|---|---|
Ground Truth | Predictions | |||||
Source (Spanish) | Target (English) | Vanilla TTS | Holistic Cascade (Global transfer + local transfer) | |||
Input text: | próxima a cometer una mala acción contemplando el sueño de un justo | which is on the point of committing a bad action contemplating the sleep of a | He is about to commit a bad action, contemplating the dream of a just man. | He is about to commit a bad action, contemplating the dream of a just man. | ||
Input text: | entonces el escribió una carta a su madre | and writes a letter to his mother | Then he wrote a letter to his mother. | Then he wrote a letter to his mother. | ||
Input text: | le he destinado un sitio de honor habéis conquistado a mi abuelo | I have fixed upon a corner of Honor for that you have conquered my grandfather you suit him | I have assigned him a place of honor, you have conquered my grandfather. | I have assigned him a place of honor, you have conquered my grandfather. |
Synthesize speech-to-text system output | |||||
---|---|---|---|---|---|
Predictions | |||||
Vanilla TTS | Holistic Cascade (Global transfer + local transfer) | Ablation (Global transfer only) | Ablation (Local transfer only) | ||
Input text (speech-to-text output): It's like a Greek tragedy or something.
|
|||||
Input text (speech-to-text output): Abby Collins, “National Security.”
|
|||||
Input text (speech-to-text output): You weren’t going to find out what the powers were.
|
|||||
Synthesize ground truth text | ||||||
---|---|---|---|---|---|---|
Predictions | ||||||
Vanilla TTS | Holistic Cascade (Global transfer + local transfer) | |||||
Input text (ground truth): It started with their father. Delusions of grandeur, paranoia.
|
||||||
Input text (ground truth): I have been thinking about you and wondering how you've been since...
|
||||||
Input text (ground truth): Only someone with Peter's abilities could get where the virus is stored.
|
||||||