Sravya Popuri☆, Peng-Jen Chen☆, Changhan
Wang, Juan Pino, Yossi Adi,
Jiatao Gu, Wei-Ning Hsu†, Ann Lee†
(☆ = Equal contribution and † = Equal supervision)
[paper]
We explore self-supervised pre-training with unlabeled speech data and data augmentation to improve direct speech-to-speech model training. We take advantage of a recently proposed speech-to-unit translation (S2UT) framework that encodes target speech into discrete representations, and study both speech encoder and discrete unit decoder pre-training as well as efficient partial finetuning methods. We conduct experiments under various data setups and show that self-supervised pre-training consistently improves model performance compared with multitask learning and is complementary to data augmentation techniques that apply ASR and MT models to create weakly supervised training data.
We provide ground truth source and target audios with the corresponding reference text,
as well as audio samples from three systems:
(1) S2UT+LNA-D: the proposed direct speeech-to-unit translation
system initialized with wav2vec 2.0 encoder, unit mBART decoder and finetuned using LNA-D strategy
(2) Supervised S2UT: a baseline direct speech-to-unit translation system trained with
source and target text as auxiliary task targets.
(3) S2T+TTS: a baseline cascaded system with a speech-to-text translation model initialized
with wav2vec 2.0 encoder and a randomly initialized decoder, followed by a text-to-speech synthesis model.
Both (1) and (2) use an open sourced HiFi-GAN vocoder to convert units to waveforms.
Ground truth | Predictions | ||||
---|---|---|---|---|---|
Source (Spanish) | Target (English) | S2UT+LNA-D | Supervised S2UT | S2T+TTS | |
Sample 1: S2UT+LNAD performs best | |||||
Reference: | autobuses adicionales normalmente proporcionados por go south coast van desde bristol al festival | ADDITIONAL BUSES USUALLY PROVIDED BY GO SOUTH COAST GO FROM BRISTOL TO THE FESTIVAL | |||
ASR: | ADDITIONAL BUSES USUALLY PROVIDED BY GO SOUTH COAST GO FROM BRISTOL TO THE FESTIVAL | ADDITIONAL UP TO BORSES NORMALLY PROVIDED BY COAST SO CAST BANDS OF BRISTOL ALL FESTIVAL | ADDITIONAL BUSES USUALLY PROVIDED BY GO SOUTH COAST GO FROM BRUCE TO THE FESTIVAL | ||
Sample 2: S2UT+LNAD performs best | |||||
Reference: | encontró un país con dos gobiernos en la capital maximiliano era el emperador | HE FOUND A COUNTRY WITH TWO GOVERNMENTS IN THE CAPITAL MAXIMILIAN WAS THE EMPEROR | |||
ASR: | HE FOUND A COUNTRY WITH TWO GOVERNMENTS IN THE CAPITAL MAXIMILIAN WAS THE EMPEROR | HE FOUND A COUNTRY WITH TWO GOVERNMENTS IN THE CAPITAL THE MOST SIMILIAN CAPITAL WAS THE EMPEROR | HE FOUND A COUNTRY WITH TWO GOVERNMENTS AND THE CAPITAL MAXIMILIAN WAS AN EMPEROR | ||
Sample 3: S2T+TTS performs best | |||||
Reference: | otro aspecto más institucional es el equilibrio de fuerzas entre el parlamento y el consejo | ANOTHER MORE INSTITUTIONAL ASPECT IS THE BALANCE OF POWER BETWEEN PARLIAMENT AND THE COUNCIL | |||
ASR: | ANOTHER MORE INSTITUTIONAL ASPECT IS THE BALANCE OF FORCES BETWEEN PARLIAMENT AND THE COUNCIL | ANOTHER MORE INSTITUTIONAL ASPECT IS THE BALANCE OF FORCES BETWEEN PARLIAMENT AND THE COUNCIL | ANOTHER MORE INSTITUTIONAL ASPECT IS THE BALANCE OF POWER BETWEEN PARLIAMENT AND THE COUNCIL | ||
Sample 4: All systems make errors | |||||
Reference: | además su capacidad de regeneración es muy limitada | MOREOVER THEIR CAPACITY FOR REGENERATION IS VERY LIMITED | |||
ASR: | MOREOVER ITS CAPACITY FOR REGENERATION IS VERY LIMITED | IN ADDITION HIS REGENERATION CAPACITY IS VERY LIMITED | MOREOVER ITS RECOVERY IS VERY LIMITED |
We provide ground truth source and target audios with the corresponding reference text,
as well as audio samples from three systems. All the three models are initialized with wav2vec 2.0 encoder,
unit
mBART decoder and finetuned using LNA-D strategy but use different datasets for finetuning:
(1) S2UT_Base: finetuned on the combination of CoVoST2, Europarl-ST, mTEDx datasets.
(2) S2UT_LR: finetuned on low resource setup with 50hr of data sampled from the the
combination of CoVoST2, Europarl-ST, mTEDx datasets
(3) S2UT_Aug: finetuned on the the combination of CoVoST2, Europarl-ST, mTEDx datasets
datasets plus the ASR data.
All models use an open sourced HiFi-GAN vocoder to convert units to waveforms.
Ground truth | Predictions | ||||
---|---|---|---|---|---|
Source (Spanish) | Target (English) | S2UT_Base | S2UT_LR | S2UT_Aug | |
Sample 1: All systems do well | |||||
Reference: | cada uno de ellos es un derecho exclusivo sujeto a ciertas limitaciones y excepciones | EACH ONE OF THEM IS AN EXCLUSIVE RIGHT SUBJECT TO CERTAIN LIMITATIONS AND EXCEPTIONS | |||
ASR: | EACH OF THEM IS AN EXCLUSIVE RIGHT SUBJECT TO CERTAIN LIMITATIONS AND EXCEPTIONS | EACH ONE OF THEM IS AN EXCLUSIVE RIGHT SUBJECT TO CERTAIN LIMITATIONS AND EXCEPTIONS | EACH OF THEM IS AN EXCLUSIVE RIGHT SUBJECT TO CERTAIN LIMITATIONS AND EXCEPTIONS | ||
Sample 2: S2UT_LR performs best | |||||
Reference: | esta experiencia representa un paso trascendental en la historia espacial del país | THIS EXPERIENCE REPRESENTS A TRANSCENDENTAL STEP IN THE SPATIAL HISTORY OF THE COUNTRY | |||
ASR: | THIS EXPERIENCE REPRESENTS A TRANSCENDENT STEP IN THE SPACE HISTORY OF THE COUNTRY | THIS EXPERIENCE REPRESENTS A TRANSCENDENTAL STEP IN THE SPATIAL HISTORY OF THE COUNTRY | THIS EXPERIENCE REPRESENTS A MOVEMENT STEP IN THE SPACE HISTORY OF THE COUNTRY | ||
Sample 3: S2UT_Aug performs best | |||||
Reference: | desde la perspectiva del balance físico químico y biológico está en una posición clave | THE PERSPECTIVE OF PHYSICAL CHEMICAL AND BIOLOGICAL BALANCE IT IS IN A KEY POSITION | |||
ASR: | FROM A PHYSICAL CHEMICAL AND BIOLOGICAL BALANCE HE IS IN A KEY POSITION | FROM A PHYSICAL PERSPECTIVE OF PHYSICAL CHEMICAL AND BIOLOGICAL POSITION | FROM THE PERSPECTIVE OF PHYSICAL CHEMICAL AND BIOLOGICAL BALANCE IT IS IN A KEY POSITION | ||
Sample 4: S2UT_Aug performs best | |||||
Reference: | desde un punto de vista presupuestario no parece adecuada la propuesta de financiación procedente de la comisión de desarrollo ya que este dinero no existe al | IN ANY CASE GIVEN THAT THE FINANCING OF THIS NEW COOPERATION INSTRUMENT MUST BE COMPATIBLE WITH THE TWO THOUSAND SEVEN TWENTY THIRTEEN FINANCIAL FRAMEWORK IT IS WORTH | |||
ASR: | IN ANY CASE GIVEN THAT THE FUNDING OF THIS NEW CORPORATION INSTRUMENT MUST BE COMPATIBLE WITH THE TWO THOUSAND SEVEN TWENTY THIRTEEN FINANCIAL FRAMEWORK IT IS IMPORTANT | IN ANY CASE SINCE THE FINANCING OF THIS NEW INSTRUMENT OF CORPORATION MUST COMPATIBLE WITH THE FINANCIAL FRAMEWORK FOR TWENTY THIRTEEN | IN ANY CASE GIVEN THAT THE FINANCING OF THIS NEW CORPORATION INSTRUMENT MUST BE COMPATIBLE WITH THE TWO THOUSAND SEVEN TWENTY THIRTEEN FINANCIAL FRAMEWORK |
We provide ground truth source and target audios with the corresponding reference text,
as well as audio samples from three systems:
(1) S2UT+LNA-D: the proposed direct speeech-to-unit translation
system initialized with wav2vec 2.0 encoder, unit mBART decoder and finetuned using LNA-D strategy
(2) Supervised S2UT: a baseline direct speech-to-unit translation system trained with
source and target text as auxiliary task targets.
(3) S2T+TTS: a baseline cascaded system with a speech-to-text translation model initialized
with wav2vec 2.0 encoder and a randomly initialized decoder, followed by a text-to-speech synthesis model.
Both (1) and (2) use an open sourced HiFi-GAN vocoder to convert units to waveforms.
Ground truth | Predictions | ||||
---|---|---|---|---|---|
Source (English) | Target (Spanish) | S2UT+LNA-D | Supervised S2UT | S2T+TTS | |
Sample 1: S2UT+LNAD performs the best. | |||||
Reference: | this should also be an important part of our approach to the twenty twelve budget | esto también debería ser una parte importante de nuestro enfoque del presupuesto dos mil doce | |||
ASR: | esto también debería ser una parte importante de nuestro enfoque al presupuesto dos mil doce | también debería ser una parte importante de nuestro enfoque al presupuesto dos mil doce | esto también debería ser una parte importante de nuestro enfoque al presupuesto de dos mildos mil dos mil doce | ||
Sample 2: S2UT+LNAD performs the best. | |||||
Reference: | information encourages citizens interest in public matters and their participation | la información fomenta el interés de los ciudadanos por los asuntos públicos y su participación | |||
ASR: | la información fomenta el interés de los ciudadanos en asuntos públicos y su participación | la información y el interés de los ciudadanos alientan los intereses de las cuestiones públicas y su participación | la información alienta el interés de los ciudadanos en asuntos públicos y en su participación | ||
Sample 3: S2UT+LNAD performs the best. | |||||
Reference: | his family who are my constituents are convinced of his innocence | su familia que son mis electores está convencida de su inocencia | |||
ASR: | su familia que son mis electores está convencida de su inocencia | su familia que son mí circunscripciones están convencidas de estos inocentes | su familia que son mis electores están convencidos de su inocencia | ||
Sample 4: All systems make errors. | |||||
Reference: | of the directive on all taxes including social security contributions the automatic exchange of information and improved cooperation between the member states in matters of taxation | de la directiva a todos los impuestos incluidas las contribuciones a la seguridad social el intercambio automático de información y la mejora de la cooperación fiscal entre los estados miembros | |||
ASR: | de la directiva a todos los impuestos incluidas las contribuciones a la seguridad social el intercambio automático de información y la mejor cooperación entre los estados miembros en las cuestiones de impuestos | de la directiva a todos los impuestos impluyendo las contribuciones de seguridad social el intercambio automático de la información y mejorar la cooperación entre los estados miembros y las cuestiones de impuestos | de la directiva para todos los impuestos incluidos las contribuciones de seguridad social el intercambio automático de información y la mejor cooperación entre los estados miembros en la cuestión de la fiscalidad |
We provide ground truth source and target audios with the corresponding reference text,
as well as audio samples from three systems. All the three models are initialized with wav2vec 2.0 encoder,
unit
mBART decoder and finetuned using LNA-D strategy but use different datasets for finetuning:
(1) S2UT_Base: finetuned on the combination of Europarl-ST, MuST-C datasets.
(2) S2UT_LR: finetuned on low resource setup with 50hr of data sampled from the combination
of Europarl-ST, MuST-C datasets
(3) S2UT_Aug: finetuned on the combination of Europarl-ST, MuST-C datasets plus the ASR
data.
All models use an open sourced HiFi-GAN vocoder to convert units to waveforms.
Ground truth | Predictions | ||||
---|---|---|---|---|---|
Source (English) | Target (Spanish) | S2UT_Base | S2UT_LR | S2UT_Aug | |
Sample 1: All systems do well. | |||||
Reference: | we want to see energy poverty as a part of this debate | queremos ver la pobreza energética como parte de este debate | |||
ASR: | queremos ver la pobreza energética como parte de este deate | queremos ver la pobreza energética como parte de este date | queremos ver la pobreza energética como parte de este deate | ||
Sample 2: S2UT_LR has errors but S2UT_Base and S2UT_Aug got it right. | |||||
Reference: | in my view one of the most important elements is the follow up of legislative initiative requests from parliament | en mi opinión uno de los elementos más importantes es el seguimiento de las solicitudes de iniciativa legislativa del parlamento | |||
ASR: | n mi opinión uno de los elementos más importantes es el seguimiento de las peticiones de la iniciativa legislativa por parte del pagamento | en mi opinión uno de los elementos más importantes es el seguimiento de las emiendas de iniciativas legislativas de ley | en mi opinión uno de los elementos más importantes es el seguimiento de las solicitudes de iniciativa legislativa del pagamento | ||
Sample 3: S2UT_Aug performs the best | |||||
Reference: | we must find an open and constructive procedure on the next financial framework | debemos encontrar un procedimiento abierto y constructivo en el próximo marco financiero | |||
ASR: | debemos encontrar un procedimiento abierto y constructivo sobre el próximo marco financiero | debemos encontrar un procedimiento abierto y constructivo en el sistema financiero financiero financiero financiero | debemos encontrar un procedimiento abierto y constructivo en el próximo marco financiero | ||
Sample 4: All systems make errors | |||||
Reference: | of the directive on all taxes including social security contributions the automatic exchange of information and improved cooperation between the member states in matters of taxation | de la directiva a todos los impuestos incluidas las contribuciones a la seguridad social el intercambio automático de información y la mejora de la cooperación fiscal entre los estados miembros | |||
ASR: | de la directiva a todos los impuestos incluidas las contribuciones a la seguridad social el intercambio automático de información y la mejor cooperación entre los estados miembros en las cuestiones de impuestos | la directiva sobre el impuesto de todos los contribuyentes inpluyendo las contribuciones sociales la introducción automática y mejorada de los estados miembros y mejorar la cooperación entre los estados miembros | de la directiva a todos los impuestos incluidas las contribuciones a la seguridad social el intercambio automático de información y la mejor cooperación entre los estados miembros en materia de impuestos |