Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation

Sravya Popuri, Peng-Jen Chen, Changhan Wang, Juan Pino, Yossi Adi, Jiatao Gu, Wei-Ning Hsu, Ann Lee
(☆ = Equal contribution and † = Equal supervision)

[paper]

We explore self-supervised pre-training with unlabeled speech data and data augmentation to improve direct speech-to-speech model training. We take advantage of a recently proposed speech-to-unit translation (S2UT) framework that encodes target speech into discrete representations, and study both speech encoder and discrete unit decoder pre-training as well as efficient partial finetuning methods. We conduct experiments under various data setups and show that self-supervised pre-training consistently improves model performance compared with multitask learning and is complementary to data augmentation techniques that apply ASR and MT models to create weakly supervised training data.

Spanish To English
Comparison with Baselines

We provide ground truth source and target audios with the corresponding reference text, as well as audio samples from three systems:
(1) S2UT+LNA-D: the proposed direct speeech-to-unit translation system initialized with wav2vec 2.0 encoder, unit mBART decoder and finetuned using LNA-D strategy
(2) Supervised S2UT: a baseline direct speech-to-unit translation system trained with source and target text as auxiliary task targets.
(3) S2T+TTS: a baseline cascaded system with a speech-to-text translation model initialized with wav2vec 2.0 encoder and a randomly initialized decoder, followed by a text-to-speech synthesis model.
Both (1) and (2) use an open sourced HiFi-GAN vocoder to convert units to waveforms.

Ground truth Predictions
Source (Spanish) Target (English) S2UT+LNA-D Supervised S2UT S2T+TTS
Sample 1: S2UT+LNAD performs best
Reference: autobuses adicionales normalmente proporcionados por go south coast van desde bristol al festival ADDITIONAL BUSES USUALLY PROVIDED BY GO SOUTH COAST GO FROM BRISTOL TO THE FESTIVAL
ASR: ADDITIONAL BUSES USUALLY PROVIDED BY GO SOUTH COAST GO FROM BRISTOL TO THE FESTIVAL ADDITIONAL UP TO BORSES NORMALLY PROVIDED BY COAST SO CAST BANDS OF BRISTOL ALL FESTIVAL ADDITIONAL BUSES USUALLY PROVIDED BY GO SOUTH COAST GO FROM BRUCE TO THE FESTIVAL
Sample 2: S2UT+LNAD performs best
Reference: encontró un país con dos gobiernos en la capital maximiliano era el emperador HE FOUND A COUNTRY WITH TWO GOVERNMENTS IN THE CAPITAL MAXIMILIAN WAS THE EMPEROR
ASR: HE FOUND A COUNTRY WITH TWO GOVERNMENTS IN THE CAPITAL MAXIMILIAN WAS THE EMPEROR HE FOUND A COUNTRY WITH TWO GOVERNMENTS IN THE CAPITAL THE MOST SIMILIAN CAPITAL WAS THE EMPEROR HE FOUND A COUNTRY WITH TWO GOVERNMENTS AND THE CAPITAL MAXIMILIAN WAS AN EMPEROR
Sample 3: S2T+TTS performs best
Reference: otro aspecto más institucional es el equilibrio de fuerzas entre el parlamento y el consejo ANOTHER MORE INSTITUTIONAL ASPECT IS THE BALANCE OF POWER BETWEEN PARLIAMENT AND THE COUNCIL
ASR: ANOTHER MORE INSTITUTIONAL ASPECT IS THE BALANCE OF FORCES BETWEEN PARLIAMENT AND THE COUNCIL ANOTHER MORE INSTITUTIONAL ASPECT IS THE BALANCE OF FORCES BETWEEN PARLIAMENT AND THE COUNCIL ANOTHER MORE INSTITUTIONAL ASPECT IS THE BALANCE OF POWER BETWEEN PARLIAMENT AND THE COUNCIL
Sample 4: All systems make errors
Reference: además su capacidad de regeneración es muy limitada MOREOVER THEIR CAPACITY FOR REGENERATION IS VERY LIMITED
ASR: MOREOVER ITS CAPACITY FOR REGENERATION IS VERY LIMITED IN ADDITION HIS REGENERATION CAPACITY IS VERY LIMITED MOREOVER ITS RECOVERY IS VERY LIMITED
Spanish To English
Different Data Setups

We provide ground truth source and target audios with the corresponding reference text, as well as audio samples from three systems. All the three models are initialized with wav2vec 2.0 encoder, unit mBART decoder and finetuned using LNA-D strategy but use different datasets for finetuning:
(1) S2UT_Base: finetuned on the combination of CoVoST2, Europarl-ST, mTEDx datasets.
(2) S2UT_LR: finetuned on low resource setup with 50hr of data sampled from the the combination of CoVoST2, Europarl-ST, mTEDx datasets
(3) S2UT_Aug: finetuned on the the combination of CoVoST2, Europarl-ST, mTEDx datasets datasets plus the ASR data.
All models use an open sourced HiFi-GAN vocoder to convert units to waveforms.

Ground truth Predictions
Source (Spanish) Target (English) S2UT_Base S2UT_LR S2UT_Aug
Sample 1: All systems do well
Reference: cada uno de ellos es un derecho exclusivo sujeto a ciertas limitaciones y excepciones EACH ONE OF THEM IS AN EXCLUSIVE RIGHT SUBJECT TO CERTAIN LIMITATIONS AND EXCEPTIONS
ASR: EACH OF THEM IS AN EXCLUSIVE RIGHT SUBJECT TO CERTAIN LIMITATIONS AND EXCEPTIONS EACH ONE OF THEM IS AN EXCLUSIVE RIGHT SUBJECT TO CERTAIN LIMITATIONS AND EXCEPTIONS EACH OF THEM IS AN EXCLUSIVE RIGHT SUBJECT TO CERTAIN LIMITATIONS AND EXCEPTIONS
Sample 2: S2UT_LR performs best
Reference: esta experiencia representa un paso trascendental en la historia espacial del país THIS EXPERIENCE REPRESENTS A TRANSCENDENTAL STEP IN THE SPATIAL HISTORY OF THE COUNTRY
ASR: THIS EXPERIENCE REPRESENTS A TRANSCENDENT STEP IN THE SPACE HISTORY OF THE COUNTRY THIS EXPERIENCE REPRESENTS A TRANSCENDENTAL STEP IN THE SPATIAL HISTORY OF THE COUNTRY THIS EXPERIENCE REPRESENTS A MOVEMENT STEP IN THE SPACE HISTORY OF THE COUNTRY
Sample 3: S2UT_Aug performs best
Reference: desde la perspectiva del balance físico químico y biológico está en una posición clave THE PERSPECTIVE OF PHYSICAL CHEMICAL AND BIOLOGICAL BALANCE IT IS IN A KEY POSITION
ASR: FROM A PHYSICAL CHEMICAL AND BIOLOGICAL BALANCE HE IS IN A KEY POSITION FROM A PHYSICAL PERSPECTIVE OF PHYSICAL CHEMICAL AND BIOLOGICAL POSITION FROM THE PERSPECTIVE OF PHYSICAL CHEMICAL AND BIOLOGICAL BALANCE IT IS IN A KEY POSITION
Sample 4: S2UT_Aug performs best
Reference: desde un punto de vista presupuestario no parece adecuada la propuesta de financiación procedente de la comisión de desarrollo ya que este dinero no existe al IN ANY CASE GIVEN THAT THE FINANCING OF THIS NEW COOPERATION INSTRUMENT MUST BE COMPATIBLE WITH THE TWO THOUSAND SEVEN TWENTY THIRTEEN FINANCIAL FRAMEWORK IT IS WORTH
ASR: IN ANY CASE GIVEN THAT THE FUNDING OF THIS NEW CORPORATION INSTRUMENT MUST BE COMPATIBLE WITH THE TWO THOUSAND SEVEN TWENTY THIRTEEN FINANCIAL FRAMEWORK IT IS IMPORTANT IN ANY CASE SINCE THE FINANCING OF THIS NEW INSTRUMENT OF CORPORATION MUST COMPATIBLE WITH THE FINANCIAL FRAMEWORK FOR TWENTY THIRTEEN IN ANY CASE GIVEN THAT THE FINANCING OF THIS NEW CORPORATION INSTRUMENT MUST BE COMPATIBLE WITH THE TWO THOUSAND SEVEN TWENTY THIRTEEN FINANCIAL FRAMEWORK
English to Spanish
Comparison with Baselines

We provide ground truth source and target audios with the corresponding reference text, as well as audio samples from three systems:
(1) S2UT+LNA-D: the proposed direct speeech-to-unit translation system initialized with wav2vec 2.0 encoder, unit mBART decoder and finetuned using LNA-D strategy
(2) Supervised S2UT: a baseline direct speech-to-unit translation system trained with source and target text as auxiliary task targets.
(3) S2T+TTS: a baseline cascaded system with a speech-to-text translation model initialized with wav2vec 2.0 encoder and a randomly initialized decoder, followed by a text-to-speech synthesis model.
Both (1) and (2) use an open sourced HiFi-GAN vocoder to convert units to waveforms.

Ground truth Predictions
Source (English) Target (Spanish) S2UT+LNA-D Supervised S2UT S2T+TTS
Sample 1: S2UT+LNAD performs the best.
Reference: this should also be an important part of our approach to the twenty twelve budget esto también debería ser una parte importante de nuestro enfoque del presupuesto dos mil doce
ASR: esto también debería ser una parte importante de nuestro enfoque al presupuesto dos mil doce también debería ser una parte importante de nuestro enfoque al presupuesto dos mil doce esto también debería ser una parte importante de nuestro enfoque al presupuesto de dos mildos mil dos mil doce
Sample 2: S2UT+LNAD performs the best.
Reference: information encourages citizens interest in public matters and their participation la información fomenta el interés de los ciudadanos por los asuntos públicos y su participación
ASR: la información fomenta el interés de los ciudadanos en asuntos públicos y su participación la información y el interés de los ciudadanos alientan los intereses de las cuestiones públicas y su participación la información alienta el interés de los ciudadanos en asuntos públicos y en su participación
Sample 3: S2UT+LNAD performs the best.
Reference: his family who are my constituents are convinced of his innocence su familia que son mis electores está convencida de su inocencia
ASR: su familia que son mis electores está convencida de su inocencia su familia que son mí circunscripciones están convencidas de estos inocentes su familia que son mis electores están convencidos de su inocencia
Sample 4: All systems make errors.
Reference: of the directive on all taxes including social security contributions the automatic exchange of information and improved cooperation between the member states in matters of taxation de la directiva a todos los impuestos incluidas las contribuciones a la seguridad social el intercambio automático de información y la mejora de la cooperación fiscal entre los estados miembros
ASR: de la directiva a todos los impuestos incluidas las contribuciones a la seguridad social el intercambio automático de información y la mejor cooperación entre los estados miembros en las cuestiones de impuestos de la directiva a todos los impuestos impluyendo las contribuciones de seguridad social el intercambio automático de la información y mejorar la cooperación entre los estados miembros y las cuestiones de impuestos de la directiva para todos los impuestos incluidos las contribuciones de seguridad social el intercambio automático de información y la mejor cooperación entre los estados miembros en la cuestión de la fiscalidad
English To Spanish
Different Data Setups

We provide ground truth source and target audios with the corresponding reference text, as well as audio samples from three systems. All the three models are initialized with wav2vec 2.0 encoder, unit mBART decoder and finetuned using LNA-D strategy but use different datasets for finetuning:
(1) S2UT_Base: finetuned on the combination of Europarl-ST, MuST-C datasets.
(2) S2UT_LR: finetuned on low resource setup with 50hr of data sampled from the combination of Europarl-ST, MuST-C datasets
(3) S2UT_Aug: finetuned on the combination of Europarl-ST, MuST-C datasets plus the ASR data.
All models use an open sourced HiFi-GAN vocoder to convert units to waveforms.

Ground truth Predictions
Source (English) Target (Spanish) S2UT_Base S2UT_LR S2UT_Aug
Sample 1: All systems do well.
Reference: we want to see energy poverty as a part of this debate queremos ver la pobreza energética como parte de este debate
ASR: queremos ver la pobreza energética como parte de este deate queremos ver la pobreza energética como parte de este date queremos ver la pobreza energética como parte de este deate
Sample 2: S2UT_LR has errors but S2UT_Base and S2UT_Aug got it right.
Reference: in my view one of the most important elements is the follow up of legislative initiative requests from parliament en mi opinión uno de los elementos más importantes es el seguimiento de las solicitudes de iniciativa legislativa del parlamento
ASR: n mi opinión uno de los elementos más importantes es el seguimiento de las peticiones de la iniciativa legislativa por parte del pagamento en mi opinión uno de los elementos más importantes es el seguimiento de las emiendas de iniciativas legislativas de ley en mi opinión uno de los elementos más importantes es el seguimiento de las solicitudes de iniciativa legislativa del pagamento
Sample 3: S2UT_Aug performs the best
Reference: we must find an open and constructive procedure on the next financial framework debemos encontrar un procedimiento abierto y constructivo en el próximo marco financiero
ASR: debemos encontrar un procedimiento abierto y constructivo sobre el próximo marco financiero debemos encontrar un procedimiento abierto y constructivo en el sistema financiero financiero financiero financiero debemos encontrar un procedimiento abierto y constructivo en el próximo marco financiero
Sample 4: All systems make errors
Reference: of the directive on all taxes including social security contributions the automatic exchange of information and improved cooperation between the member states in matters of taxation de la directiva a todos los impuestos incluidas las contribuciones a la seguridad social el intercambio automático de información y la mejora de la cooperación fiscal entre los estados miembros
ASR: de la directiva a todos los impuestos incluidas las contribuciones a la seguridad social el intercambio automático de información y la mejor cooperación entre los estados miembros en las cuestiones de impuestos la directiva sobre el impuesto de todos los contribuyentes inpluyendo las contribuciones sociales la introducción automática y mejorada de los estados miembros y mejorar la cooperación entre los estados miembros de la directiva a todos los impuestos incluidas las contribuciones a la seguridad social el intercambio automático de información y la mejor cooperación entre los estados miembros en materia de impuestos
Template based on Textless NLP and HiFi-GAN pages.