Direct speech-to-speech translation with discrete units

We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation. We tackle the problem by first applying a self-supervised discrete speech encoder on the target speech and then training a sequence-to-sequence speech-to-unit translation (S2UT) model to predict the discrete representations of the target speech. When target text transcripts are available, we design a joint speech and text training framework that enables the model to generate dual modality output (speech and text) simultaneously in the same inference pass. Experiments on the Fisher Spanish-English dataset show that the proposed framework yields improvement of 6.7 BLEU compared with a baseline direct S2ST model that predicts spectrogram features. When trained without any text transcripts, our model performance is comparable to models that predict spectrograms and are trained with text supervision, showing the potential of our system for translation between unwritten languages.

Written language setup
Unwritten language setup
Written (source) to unwritten (target) language setup

Written Language Setup

Compare systems that use both source and target text transcripts during training

We provide ground truth source and target audios with the corresponding reference text, as well as audio samples from three systems and the corresponding ASR text:
(1) S2UT+CTC: the proposed direct speeech-to-unit translation system with joint speech and text training,
(2) Transformer Translatotron: a baseline direct speech-to-spectrogram translation model,
(3) S2T+TTS: a baseline cascaded system with a speech-to-text translation model and a text-to-speech synthesis model.
Both (1) and (2) are trained with source and target text as auxiliary task targets. For (1) and (3), we also provide the systems' text output.

	Ground truth		Predictions
	Source (Spanish)	Target (English)	S2UT+CTC	Transformer Translatotron	S2T+TTS
sample 1: S2UT+CTC performs the best

Reference:	y, y le voy a preguntar si me toca reemplazarlo.	and, and I am going to ask if it will be on me to replace it.
ASR:			and i'm going to ask if i have to replace it	and alaskan and i have to relac im	and i'm going to ask you and i have to replace it
Text output:			and i'm going to ask if i have to replace it		and i'm going to ask you and i have to replace it
sample 2: ASR errors for S2UT+CTC and Transformer Translatotron

Reference:	sin embargo te digo, no sé, de repente aquí. ¿Tu estás en, en Pennsylvania, me dijistes?	anyways I tell you, I don't know, all of a sudden here. You were in Pennsylvania you say?
ASR:			however i tell you i don't know suddenly here you are in pennsylvania you told me	however i'm telling you i don't know somley here in pennsylvania	nevertheless i tell you i don't know suddenly here you are in in pennsylvania and you
Text output:			however i tell you i don't know suddenly here you are in in pennsylvania you told me		nevertheless i tell you i don't know suddenly here you are in in pennsylvania and you
sample 3: S2UT+CTC performs the best

Reference:	mucha energía así es	lots of energy, indeed
ASR:			a lot of energy that's right	a lot of energy this is	a lot of energy so
Text output:			a lot of energy that's right		a lot of energy so
sample 4: ASR errors vs. correct text output from S2UT+CTC and S2T+TTS

Reference:	Puertoriqueña. ¿Pero creció en Estados Unidos?	Puerto Rican. But were you born in the United States?
ASR:			port orekin but i grew in the united states	porto rekin but he grew up in the united states	porto recan but i grew up in the united states
Text output:			puerto rican but i grew in the united states		puerto rican but i grew up in the united states
sample 5: S2UT+CTC and S2T+TTS perform similarly

Reference:	no no hace tanto, hace poco.	not it hasn't been that long, it's been a short time
ASR:			no not that long ago	no no i don't know that much yes	no not so long ago
Text output:			no not that long ago		no not so long ago
sample 6: S2UT+CTC and S2T+TTS perform similarly

Reference:	No yo soy casada, yo tengo tres años de casada y la verdad yo sí tengo, un niño, tiene veinte meses	No, I'm married, I've been married for three years and i do have a kid, he's 20 months old
ASR:			no i'm married i've been married for three years and honestly i do have only i have a child he's twenty months old	no i am married i have been married and really wing and the truth is i'd only have one is twenty months old	no i'm married i've been married for three years and the truth i do have only a child is twenty months old
Text output:			no i'm married i've been married for three years and honestly i do have only i have a child he's twenty months old		no i'm married i've been married for three years and the truth i do have only a child is twenty months old
sample 7: wrong translations for name

Reference:	Bueno mi nombre es Claudia Ivette ¿Con quién tengo el gusto?	Well, my name is Claudia Ivette. With whom do I have the pleasure of speaking?
ASR:			well my name is claudia but tracy with whom do i have the pleasure of speaking	well my name is claria and beatricia with whom have the pleasure to pleasure	well my name is claudia who am i talking to
Text output:			well my name is claudia buttraycy with whom do i have the pleasure of speaking		well my name is claudia who am i talking to

Unwritten Language Setup

Compare systems that do NOT use any text transcripts during training

We provide ground truth source and target audios with the corresponding reference text, as well as audio samples from two systems and the corresponding ASR text:
(1) S2UT w/ source unit task: the proposed direct speeech-to-unit translation system trained with source discrete units as the auxiliary task target,
(2) S2UT w/o auxiliary task: the proposed direct speeech-to-unit translation system trained without multitask learning.

	Ground truth		Predictions
	Source (Spanish)	Target (English)	S2UT w/ source unit task	S2UT w/o auxiliary task
sample 1: S2UT w/ source unit task performs better

Reference:	Hola Fernanda, ¿cómo estás? mi nombre es Claudia.	Hi Fernanda, how are you?, my name is Claudia
ASR:			hello fernanda how are you my name is claudia	hello how are you how are you my name is gloria
sample 2: S2UT w/ source unit task performs better

Reference:	No dijo que no para la familia no más, porque ya le van a comprar algo mejor para el niño, y.	No, she said that family only, because they will buy something better for the kid, so
ASR:			no yes not that for the family because they are going to buy something better for the kid end	no yes i don't know for me that they call the contrary for the kids for the kids
sample 3: S2UT w/o auxiliary task failed to translate

Reference:	El computador es lo más avanzado de la civilización ahora.	The computer is the most advanced of civilization now.
ASR:			the computer is the most advanced of the civilization now	that they were talking about the religion and all that
sample 4: ASR errors

Reference:	ajá, ¿y tu tienes, prácticas alguna religión en particulas?	aha, and do you practice any religion in particular?
ASR:			wright and you have do you practise some religion in particular	right and you have to participate religion
sample 5: S2UT w/ source unit task performs better

Reference:	entonces los servicios son más baratos, porque venden a más personas	then services are cheaper, because they sell to more people
ASR:			and then the services are cheaper because they sell more people	and then they say that they say because they give you more person
sample 6: S2UT w/ source unit task performs better

Reference:	Ah, yo hallé, tengo una amiga que se llama Norma, yo creo que ella vive en Georgia también.	Ah, I, have a friend whose name is Norma, I think she lives in Georgia also.
ASR:			oh i have a friend that's called norma i think that she lives in georgia	of course i'm from my friends that called me but i think that they also taken to
sample 7: S2UT w/o auxiliary task failed to translate

Reference:	Tienen hasta clases que, que los niños pueden tomar o los estudiantes que pueden tomar en la computadora	They have classes for kids that can take and solve them on the computer
ASR:			they have classes that the kids can take the students that they can take computers	you have to be careful that they can call the computers

Written (source) to Unwritten (target) Language Setup

Compare systems that use source text transcripts but NOT target text transcripts during training

We provide ground truth source and target audios with the corresponding reference text, as well as audio samples from three systems and the corresponding ASR text:
(1) S2UT: the proposed direct speeech-to-unit translation system trained with source text as the auxiliary task target,
(2) ASR+T2UT: a cascaded system with a automatic speech recognition model and a text-to-unit translation model.

	Ground truth		Predictions
	Source (Spanish)	Target (English)	S2UT	ASR+T2UT
sample 1: both systems perform well

Reference:	¿Qué, qué estudias tu?	What are you studying?
ASR:			what do you study	what do you study
sample 2: both systems perform well

Reference:	¿Y hace cuánta ya que vive acá?	And how long did you live there?
ASR:			and how long have you been living here	and how long have you been here
sample 3: ASR error for S2UT

Reference:	sí. Yo crecí, y me crié en Chile, y cuando terminé la secundaria de High School , vine a estudiar a la universidad, a	Yes. I was born and raised in Chile, and when I finished High School, I came to study at the University,
ASR:			yes i was raised in shile when i finished the secondary of high school i came to study in the university	yes i grew up i was raised in chile when i finished high school i came to study to university ah
sample 4: ASR+T2UT produces natural speech

Reference:	bueno, realmente no sé, qué, qué tanto conocimiento tienes tu pero este, por ejemplo, al situación que estamos viviendo en Venezuela es una situación muy especial	well, I really don't know, how, how much knowledge you have but ehm, for example, the situation that we are living in Venezuela is a very special situation
ASR:			well i really don't know how much knowledge you have but for example the the situation that we are living in venezuela is a very special situation	well i really don't know how much knowledge you have but for example sure situation that we are living in venezuela it's a very special situation
sample 5: repeating translation for S2UT

Reference:	hoy en día pues comprar un computador por doscientos o trescientos dólares	now you can buy a computer for two hundred or three hundred dollars,
ASR:			nowadays you can buy a computer for two hundred or three hundred or three hundred dollars	today you can buy a computer for two hundred or three hundred dollars
sample 6: S2UT performs better

Reference:	sin embargo te digo, no sé, de repente aquí. ¿Tu estás en, en Pennsylvania, me dijistes?	anyways I tell you, I don't know, all of a sudden here. You were in Pennsylvania you say?
ASR:			however i tell you i don't know suddenly here are you in pensylvania right you saw	however i tell you i don't know maybe you are here in pensylvania
sample 7: ASR+T2UT produces natural speech

Reference:	no sé qué pasa, pero no tienen o sea, en, en general no tienen esa como esa necesidad de protestar ante ciertas cosas	I don't know what happens, but they don't have in general they don't have that necessity to protest under certain things
ASR:			i don't know what happens but they don't have i mean in general they don't have a need to protestant certain things	i don't know what happens but they don't have i mean in general they don't have that like that need to protest some some things

Template based on Textless NLP and HiFi-GAN pages.