Ann Lee, Peng-Jen Chen, Changhan Wang, Jiatao Gu, Sravya Popuri, Xutai Ma,
Adam Polyak, Yossi Adi, Qing He, Yun Tang, Juan Pino, Wei-Ning Hsu
[paper]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation. We tackle the problem by first applying a self-supervised discrete speech encoder on the target speech and then training a sequence-to-sequence speech-to-unit translation (S2UT) model to predict the discrete representations of the target speech. When target text transcripts are available, we design a joint speech and text training framework that enables the model to generate dual modality output (speech and text) simultaneously in the same inference pass. Experiments on the Fisher Spanish-English dataset show that the proposed framework yields improvement of 6.7 BLEU compared with a baseline direct S2ST model that predicts spectrogram features. When trained without any text transcripts, our model performance is comparable to models that predict spectrograms and are trained with text supervision, showing the potential of our system for translation between unwritten languages.
We provide ground truth source and target audios with the corresponding reference text,
as well as audio samples from three systems and the corresponding ASR text:
(1) S2UT+CTC: the proposed direct speeech-to-unit translation system with joint speech and text training,
(2) Transformer Translatotron: a baseline direct speech-to-spectrogram translation model,
(3) S2T+TTS: a baseline cascaded system with a speech-to-text translation model and a text-to-speech synthesis model.
Both (1) and (2) are trained with source and target text as auxiliary task targets. For (1) and (3), we also provide the systems' text output.
Ground truth | Predictions | ||||
---|---|---|---|---|---|
Source (Spanish) | Target (English) | S2UT+CTC | Transformer Translatotron | S2T+TTS | |
sample 1: S2UT+CTC performs the best | |||||
Reference: | y, y le voy a preguntar si me toca reemplazarlo. | and, and I am going to ask if it will be on me to replace it. | |||
ASR: | and i'm going to ask if i have to replace it | and alaskan and i have to relac im | and i'm going to ask you and i have to replace it | ||
Text output: | and i'm going to ask if i have to replace it | and i'm going to ask you and i have to replace it | |||
sample 2: ASR errors for S2UT+CTC and Transformer Translatotron | |||||
Reference: | sin embargo te digo, no sé, de repente aquí. ¿Tu estás en, en Pennsylvania, me dijistes? | anyways I tell you, I don't know, all of a sudden here. You were in Pennsylvania you say? | |||
ASR: | however i tell you i don't know suddenly here you are in pennsylvania you told me | however i'm telling you i don't know somley here in pennsylvania | nevertheless i tell you i don't know suddenly here you are in in pennsylvania and you | ||
Text output: | however i tell you i don't know suddenly here you are in in pennsylvania you told me | nevertheless i tell you i don't know suddenly here you are in in pennsylvania and you | |||
sample 3: S2UT+CTC performs the best | |||||
Reference: | mucha energía así es | lots of energy, indeed | |||
ASR: | a lot of energy that's right | a lot of energy this is | a lot of energy so | ||
Text output: | a lot of energy that's right | a lot of energy so | |||
sample 4: ASR errors vs. correct text output from S2UT+CTC and S2T+TTS | |||||
Reference: | Puertoriqueña. ¿Pero creció en Estados Unidos? | Puerto Rican. But were you born in the United States? | |||
ASR: | port orekin but i grew in the united states | porto rekin but he grew up in the united states | porto recan but i grew up in the united states | ||
Text output: | puerto rican but i grew in the united states | puerto rican but i grew up in the united states | |||
sample 5: S2UT+CTC and S2T+TTS perform similarly | |||||
Reference: | no no hace tanto, hace poco. | not it hasn't been that long, it's been a short time | |||
ASR: | no not that long ago | no no i don't know that much yes | no not so long ago | ||
Text output: | no not that long ago | no not so long ago | |||
sample 6: S2UT+CTC and S2T+TTS perform similarly | |||||
Reference: | No yo soy casada, yo tengo tres años de casada y la verdad yo sí tengo, un niño, tiene veinte meses | No, I'm married, I've been married for three years and i do have a kid, he's 20 months old | |||
ASR: | no i'm married i've been married for three years and honestly i do have only i have a child he's twenty months old | no i am married i have been married and really wing and the truth is i'd only have one is twenty months old | no i'm married i've been married for three years and the truth i do have only a child is twenty months old | ||
Text output: | no i'm married i've been married for three years and honestly i do have only i have a child he's twenty months old | no i'm married i've been married for three years and the truth i do have only a child is twenty months old | |||
sample 7: wrong translations for name | |||||
Reference: | Bueno mi nombre es Claudia Ivette ¿Con quién tengo el gusto? | Well, my name is Claudia Ivette. With whom do I have the pleasure of speaking? | |||
ASR: | well my name is claudia but tracy with whom do i have the pleasure of speaking | well my name is claria and beatricia with whom have the pleasure to pleasure | well my name is claudia who am i talking to | ||
Text output: | well my name is claudia buttraycy with whom do i have the pleasure of speaking | well my name is claudia who am i talking to |
We provide ground truth source and target audios with the corresponding reference text,
as well as audio samples from two systems and the corresponding ASR text:
(1) S2UT w/ source unit task: the proposed direct speeech-to-unit translation system trained with source discrete units as the auxiliary task target,
(2) S2UT w/o auxiliary task: the proposed direct speeech-to-unit translation system trained without multitask learning.
Ground truth | Predictions | ||||
---|---|---|---|---|---|
Source (Spanish) | Target (English) | S2UT w/ source unit task | S2UT w/o auxiliary task | ||
sample 1: S2UT w/ source unit task performs better | |||||
Reference: | Hola Fernanda, ¿cómo estás? mi nombre es Claudia. | Hi Fernanda, how are you?, my name is Claudia | |||
ASR: | hello fernanda how are you my name is claudia | hello how are you how are you my name is gloria | |||
sample 2: S2UT w/ source unit task performs better | |||||
Reference: | No dijo que no para la familia no más, porque ya le van a comprar algo mejor para el niño, y. | No, she said that family only, because they will buy something better for the kid, so | |||
ASR: | no yes not that for the family because they are going to buy something better for the kid end | no yes i don't know for me that they call the contrary for the kids for the kids | |||
sample 3: S2UT w/o auxiliary task failed to translate | |||||
Reference: | El computador es lo más avanzado de la civilización ahora. | The computer is the most advanced of civilization now. | |||
ASR: | the computer is the most advanced of the civilization now | that they were talking about the religion and all that | |||
sample 4: ASR errors | |||||
Reference: | ajá, ¿y tu tienes, prácticas alguna religión en particulas? | aha, and do you practice any religion in particular? | |||
ASR: | wright and you have do you practise some religion in particular | right and you have to participate religion | |||
sample 5: S2UT w/ source unit task performs better | |||||
Reference: | entonces los servicios son más baratos, porque venden a más personas | then services are cheaper, because they sell to more people | |||
ASR: | and then the services are cheaper because they sell more people | and then they say that they say because they give you more person | |||
sample 6: S2UT w/ source unit task performs better | |||||
Reference: | Ah, yo hallé, tengo una amiga que se llama Norma, yo creo que ella vive en Georgia también. | Ah, I, have a friend whose name is Norma, I think she lives in Georgia also. | |||
ASR: | oh i have a friend that's called norma i think that she lives in georgia | of course i'm from my friends that called me but i think that they also taken to | |||
sample 7: S2UT w/o auxiliary task failed to translate | |||||
Reference: | Tienen hasta clases que, que los niños pueden tomar o los estudiantes que pueden tomar en la computadora | They have classes for kids that can take and solve them on the computer | |||
ASR: | they have classes that the kids can take the students that they can take computers | you have to be careful that they can call the computers |
We provide ground truth source and target audios with the corresponding reference text,
as well as audio samples from three systems and the corresponding ASR text:
(1) S2UT: the proposed direct speeech-to-unit translation system trained with source text as the auxiliary task target,
(2) ASR+T2UT: a cascaded system with a automatic speech recognition model and a text-to-unit translation model.
Ground truth | Predictions | ||||
---|---|---|---|---|---|
Source (Spanish) | Target (English) | S2UT | ASR+T2UT | ||
sample 1: both systems perform well | |||||
Reference: | ¿Qué, qué estudias tu? | What are you studying? | |||
ASR: | what do you study | what do you study | |||
sample 2: both systems perform well | |||||
Reference: | ¿Y hace cuánta ya que vive acá? | And how long did you live there? | |||
ASR: | and how long have you been living here | and how long have you been here | |||
sample 3: ASR error for S2UT | |||||
Reference: | sí. Yo crecí, y me crié en Chile, y cuando terminé la secundaria de High School , vine a estudiar a la universidad, a | Yes. I was born and raised in Chile, and when I finished High School, I came to study at the University, | |||
ASR: | yes i was raised in shile when i finished the secondary of high school i came to study in the university | yes i grew up i was raised in chile when i finished high school i came to study to university ah | |||
sample 4: ASR+T2UT produces natural speech | |||||
Reference: | bueno, realmente no sé, qué, qué tanto conocimiento tienes tu pero este, por ejemplo, al situación que estamos viviendo en Venezuela es una situación muy especial | well, I really don't know, how, how much knowledge you have but ehm, for example, the situation that we are living in Venezuela is a very special situation | |||
ASR: | well i really don't know how much knowledge you have but for example the the situation that we are living in venezuela is a very special situation | well i really don't know how much knowledge you have but for example sure situation that we are living in venezuela it's a very special situation | |||
sample 5: repeating translation for S2UT | |||||
Reference: | hoy en día pues comprar un computador por doscientos o trescientos dólares | now you can buy a computer for two hundred or three hundred dollars, | |||
ASR: | nowadays you can buy a computer for two hundred or three hundred or three hundred dollars | today you can buy a computer for two hundred or three hundred dollars | |||
sample 6: S2UT performs better | |||||
Reference: | sin embargo te digo, no sé, de repente aquí. ¿Tu estás en, en Pennsylvania, me dijistes? | anyways I tell you, I don't know, all of a sudden here. You were in Pennsylvania you say? | |||
ASR: | however i tell you i don't know suddenly here are you in pensylvania right you saw | however i tell you i don't know maybe you are here in pensylvania | |||
sample 7: ASR+T2UT produces natural speech | |||||
Reference: | no sé qué pasa, pero no tienen o sea, en, en general no tienen esa como esa necesidad de protestar ante ciertas cosas | I don't know what happens, but they don't have in general they don't have that necessity to protest under certain things | |||
ASR: | i don't know what happens but they don't have i mean in general they don't have a need to protestant certain things | i don't know what happens but they don't have i mean in general they don't have that like that need to protest some some things |