Direct Speech-to-Speech Translation With Discrete Units

Ann Lee, Peng-Jen Chen, Changhan Wang, Jiatao Gu, Sravya Popuri, Xutai Ma,
Adam Polyak, Yossi Adi, Qing He, Yun Tang, Juan Pino, Wei-Ning Hsu

[paper]

We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation. We tackle the problem by first applying a self-supervised discrete speech encoder on the target speech and then training a sequence-to-sequence speech-to-unit translation (S2UT) model to predict the discrete representations of the target speech. When target text transcripts are available, we design a joint speech and text training framework that enables the model to generate dual modality output (speech and text) simultaneously in the same inference pass. Experiments on the Fisher Spanish-English dataset show that the proposed framework yields improvement of 6.7 BLEU compared with a baseline direct S2ST model that predicts spectrogram features. When trained without any text transcripts, our model performance is comparable to models that predict spectrograms and are trained with text supervision, showing the potential of our system for translation between unwritten languages.

Written Language Setup
Compare systems that use both source and target text transcripts during training

We provide ground truth source and target audios with the corresponding reference text, as well as audio samples from three systems and the corresponding ASR text:
(1) S2UT+CTC: the proposed direct speeech-to-unit translation system with joint speech and text training,
(2) Transformer Translatotron: a baseline direct speech-to-spectrogram translation model,
(3) S2T+TTS: a baseline cascaded system with a speech-to-text translation model and a text-to-speech synthesis model.
Both (1) and (2) are trained with source and target text as auxiliary task targets. For (1) and (3), we also provide the systems' text output.

Ground truth Predictions
Source (Spanish) Target (English) S2UT+CTC Transformer Translatotron S2T+TTS
sample 1: S2UT+CTC performs the best
Reference: y, y le voy a preguntar si me toca reemplazarlo. and, and I am going to ask if it will be on me to replace it.
ASR: and i'm going to ask if i have to replace it and alaskan and i have to relac im and i'm going to ask you and i have to replace it
Text output: and i'm going to ask if i have to replace it and i'm going to ask you and i have to replace it
sample 2: ASR errors for S2UT+CTC and Transformer Translatotron
Reference: sin embargo te digo, no sé, de repente aquí. ¿Tu estás en, en Pennsylvania, me dijistes? anyways I tell you, I don't know, all of a sudden here. You were in Pennsylvania you say?
ASR: however i tell you i don't know suddenly here you are in pennsylvania you told me however i'm telling you i don't know somley here in pennsylvania nevertheless i tell you i don't know suddenly here you are in in pennsylvania and you
Text output: however i tell you i don't know suddenly here you are in in pennsylvania you told me nevertheless i tell you i don't know suddenly here you are in in pennsylvania and you
sample 3: S2UT+CTC performs the best
Reference: mucha energía así es lots of energy, indeed
ASR: a lot of energy that's right a lot of energy this is a lot of energy so
Text output: a lot of energy that's right a lot of energy so
sample 4: ASR errors vs. correct text output from S2UT+CTC and S2T+TTS
Reference: Puertoriqueña. ¿Pero creció en Estados Unidos? Puerto Rican. But were you born in the United States?
ASR: port orekin but i grew in the united states porto rekin but he grew up in the united states porto recan but i grew up in the united states
Text output: puerto rican but i grew in the united states puerto rican but i grew up in the united states
sample 5: S2UT+CTC and S2T+TTS perform similarly
Reference: no no hace tanto, hace poco. not it hasn't been that long, it's been a short time
ASR: no not that long ago no no i don't know that much yes no not so long ago
Text output: no not that long ago no not so long ago
sample 6: S2UT+CTC and S2T+TTS perform similarly
Reference: No yo soy casada, yo tengo tres años de casada y la verdad yo sí tengo, un niño, tiene veinte meses No, I'm married, I've been married for three years and i do have a kid, he's 20 months old
ASR: no i'm married i've been married for three years and honestly i do have only i have a child he's twenty months old no i am married i have been married and really wing and the truth is i'd only have one is twenty months old no i'm married i've been married for three years and the truth i do have only a child is twenty months old
Text output: no i'm married i've been married for three years and honestly i do have only i have a child he's twenty months old no i'm married i've been married for three years and the truth i do have only a child is twenty months old
sample 7: wrong translations for name
Reference: Bueno mi nombre es Claudia Ivette ¿Con quién tengo el gusto? Well, my name is Claudia Ivette. With whom do I have the pleasure of speaking?
ASR: well my name is claudia but tracy with whom do i have the pleasure of speaking well my name is claria and beatricia with whom have the pleasure to pleasure well my name is claudia who am i talking to
Text output: well my name is claudia buttraycy with whom do i have the pleasure of speaking well my name is claudia who am i talking to
Unwritten Language Setup
Compare systems that do NOT use any text transcripts during training

We provide ground truth source and target audios with the corresponding reference text, as well as audio samples from two systems and the corresponding ASR text:
(1) S2UT w/ source unit task: the proposed direct speeech-to-unit translation system trained with source discrete units as the auxiliary task target,
(2) S2UT w/o auxiliary task: the proposed direct speeech-to-unit translation system trained without multitask learning.

Ground truth Predictions
Source (Spanish) Target (English) S2UT w/ source unit task S2UT w/o auxiliary task
sample 1: S2UT w/ source unit task performs better
Reference: Hola Fernanda, ¿cómo estás? mi nombre es Claudia. Hi Fernanda, how are you?, my name is Claudia
ASR: hello fernanda how are you my name is claudia hello how are you how are you my name is gloria
sample 2: S2UT w/ source unit task performs better
Reference: No dijo que no para la familia no más, porque ya le van a comprar algo mejor para el niño, y. No, she said that family only, because they will buy something better for the kid, so
ASR: no yes not that for the family because they are going to buy something better for the kid end no yes i don't know for me that they call the contrary for the kids for the kids
sample 3: S2UT w/o auxiliary task failed to translate
Reference: El computador es lo más avanzado de la civilización ahora. The computer is the most advanced of civilization now.
ASR: the computer is the most advanced of the civilization now that they were talking about the religion and all that
sample 4: ASR errors
Reference: ajá, ¿y tu tienes, prácticas alguna religión en particulas? aha, and do you practice any religion in particular?
ASR: wright and you have do you practise some religion in particular right and you have to participate religion
sample 5: S2UT w/ source unit task performs better
Reference: entonces los servicios son más baratos, porque venden a más personas then services are cheaper, because they sell to more people
ASR: and then the services are cheaper because they sell more people and then they say that they say because they give you more person
sample 6: S2UT w/ source unit task performs better
Reference: Ah, yo hallé, tengo una amiga que se llama Norma, yo creo que ella vive en Georgia también. Ah, I, have a friend whose name is Norma, I think she lives in Georgia also.
ASR: oh i have a friend that's called norma i think that she lives in georgia of course i'm from my friends that called me but i think that they also taken to
sample 7: S2UT w/o auxiliary task failed to translate
Reference: Tienen hasta clases que, que los niños pueden tomar o los estudiantes que pueden tomar en la computadora They have classes for kids that can take and solve them on the computer
ASR: they have classes that the kids can take the students that they can take computers you have to be careful that they can call the computers
Written (source) to Unwritten (target) Language Setup
Compare systems that use source text transcripts but NOT target text transcripts during training

We provide ground truth source and target audios with the corresponding reference text, as well as audio samples from three systems and the corresponding ASR text:
(1) S2UT: the proposed direct speeech-to-unit translation system trained with source text as the auxiliary task target,
(2) ASR+T2UT: a cascaded system with a automatic speech recognition model and a text-to-unit translation model.

Ground truth Predictions
Source (Spanish) Target (English) S2UT ASR+T2UT
sample 1: both systems perform well
Reference: ¿Qué, qué estudias tu? What are you studying?
ASR: what do you study what do you study
sample 2: both systems perform well
Reference: ¿Y hace cuánta ya que vive acá? And how long did you live there?
ASR: and how long have you been living here and how long have you been here
sample 3: ASR error for S2UT
Reference: sí. Yo crecí, y me crié en Chile, y cuando terminé la secundaria de High School , vine a estudiar a la universidad, a Yes. I was born and raised in Chile, and when I finished High School, I came to study at the University,
ASR: yes i was raised in shile when i finished the secondary of high school i came to study in the university yes i grew up i was raised in chile when i finished high school i came to study to university ah
sample 4: ASR+T2UT produces natural speech
Reference: bueno, realmente no sé, qué, qué tanto conocimiento tienes tu pero este, por ejemplo, al situación que estamos viviendo en Venezuela es una situación muy especial well, I really don't know, how, how much knowledge you have but ehm, for example, the situation that we are living in Venezuela is a very special situation
ASR: well i really don't know how much knowledge you have but for example the the situation that we are living in venezuela is a very special situation well i really don't know how much knowledge you have but for example sure situation that we are living in venezuela it's a very special situation
sample 5: repeating translation for S2UT
Reference: hoy en día pues comprar un computador por doscientos o trescientos dólares now you can buy a computer for two hundred or three hundred dollars,
ASR: nowadays you can buy a computer for two hundred or three hundred or three hundred dollars today you can buy a computer for two hundred or three hundred dollars
sample 6: S2UT performs better
Reference: sin embargo te digo, no sé, de repente aquí. ¿Tu estás en, en Pennsylvania, me dijistes? anyways I tell you, I don't know, all of a sudden here. You were in Pennsylvania you say?
ASR: however i tell you i don't know suddenly here are you in pensylvania right you saw however i tell you i don't know maybe you are here in pensylvania
sample 7: ASR+T2UT produces natural speech
Reference: no sé qué pasa, pero no tienen o sea, en, en general no tienen esa como esa necesidad de protestar ante ciertas cosas I don't know what happens, but they don't have in general they don't have that necessity to protest under certain things
ASR: i don't know what happens but they don't have i mean in general they don't have a need to protestant certain things i don't know what happens but they don't have i mean in general they don't have that like that need to protest some some things
Template based on Textless NLP and HiFi-GAN pages.