Until now, AI translation has mainly focused on written languages. Yet nearly half of the world’s 7,000+ living languages are primarily oral and do not have a standard or widely used writing system. This makes it impossible to build machine translation tools using standard techniques, which require large amounts of written text in order to train an AI model. To address this challenge, we've built the first AI-powered translation system for a primarily oral language, Hokkien. Hokkien is widely spoken within the Chinese diaspora but lacks a standard written form. Our technology allows Hokkien speakers to hold conversations with English speakers.
The open sourced translation system is part of Meta’s Universal Speech Translator (UST) project, which is developing new AI methods that we hope will eventually allow real-time speech-to-speech translation across all extant languages, even primarily spoken ones. We believe spoken communication can help break down barriers and bring people together wherever they are located — even in the metaverse.
To develop this new speech-only translation system, Meta’s AI researchers had to overcome many challenges from traditional machine translation systems, including data gathering, model design, and evaluation. We have much work ahead to extend UST to more languages. But the ability to speak effortlessly to people in any language is a long-sought dream, and we’re pleased to be one step closer to achieving it. We’re open-sourcing not just our Hokkien translation models but also the evaluation datasets and research papers, so that others can reproduce and build on our work.
Collecting sufficient data was a significant obstacle we faced when setting out to build a Hokkien translation system. Hokkien is what’s known as a low-resource language, which means there isn’t an ample supply of training data readily available for the language, compared with, say, Spanish or English. In addition, there are relatively few human English-to-Hokkien translators, making it difficult to collect and annotate data to train the model.
We leveraged Mandarin as an intermediate language to build pseudolabel as well as human translations, where we first translated English (or Hokkien) speech to Mandarin text, and we then translated to Hokkien (or English) and added it to training data. This method greatly improved the model performance by leveraging data from a similar high-resource language.
Speech mining is another approach to training data generation. With a pretrained speech encoder, we were able to encode Hokkien speech embeddings into the same semantic space as other languages without requiring Hokkien to have a written form. Hokkien speech can be aligned with English speech and texts whose semantic embeddings are similar. We then synthesized English speech from texts, yielding parallel Hokkien and English speech.
Many speech translation systems rely on transcriptions or are speech-to-text systems. However, since primarily oral languages do not have standard written forms, producing transcribed text as the translation output doesn’t work. Thus, we focused on speech-to-speech translation.
We used speech-to-unit translation (S2UT) to translate input speech to a sequence of acoustic units directly in the path previously pioneered by Meta. Then, we generated waveforms from the units. In addition, UnitY was adopted for a two-pass decoding mechanism, where the first-pass decoder generates text in a related language (Mandarin) and the second-pass decoder creates units.
Speech translation systems are usually evaluated using a metric called ASR-BLEU, which involves first transcribing the translated speech into text using automatic speech recognition (ASR), and then computing BLEU scores (a standard machine translation metric) by comparing the transcribed text with a human-translated text. However, one of the challenges of evaluating speech translations for an oral language such as Hokkien is that there is no standard writing system. In order to enable automatic evaluation, we developed a system that transcribes Hokkien speech into a standardized phonetic notation called Tâi-lô. This technique enabled us to compute a BLEU score at the syllable level and easily compare the translation quality of different approaches.
In addition to developing a method for evaluating Hokkien-English speech translations, we also created the first Hokkien-English bidirectional speech-to-speech translation benchmark dataset based on a Hokkien speech corpus called Taiwanese Across Taiwan. This benchmark dataset will be open-sourced to encourage other researchers to work on Hokkien speech translation and together make further progress in the field.
In its current phase, our approach allows someone who speaks Hokkien to converse with someone who speaks English. While the model is still a work in progress and can translate only one full sentence at a time, it’s a step toward a future where simultaneous translation between languages is possible.
The techniques we pioneered with Hokkien can be extended to many other written and unwritten languages. To that end, we are releasing SpeechMatrix, a large corpus of speech-to-speech translations mined with Meta’s innovative data mining technique, called LASER, which will enable researchers to create their own speech-to-speech translation (S2ST) systems and build on our work.
Meta’s recent advances in unsupervised speech recognition (wav2vec-U) and unsupervised machine translation (mBART) will inform future work in translating more spoken languages. Our progress in unsupervised learning demonstrates the feasibility of building high-quality speech-to-speech translation models without any human annotations. The system significantly lowers the requirements for expanding the coverage of low-resource languages, as many of them do not have labeled data at all.
AI research is helping break down language barriers in both the real world and the metaverse, with the goal of encouraging connection and mutual understanding. We’re excited to expand our research and bring this technology to more people in the future.
This work is being undertaken by a multidisciplinary team that includes Al Youngblood, Ana Paula Kirschner Mofarrej, Andy Chung, Angela Fan, Ann Lee, Benjamin Peloquin, Benoît Sagot, Brian Bui, Brian O’Horo, Carleigh Wood, Changhan Wang, Chloe Meyere, Chris Summers, Christopher Johnson, David Wu, Diana Otero, Eric Kaplan, Ethan Ye, Gopika Jhala, Gustavo Gandia Rivera, Hirofumi Inaguma, Holger Schwenk, Hongyu Gong, Ilia Kulikov, Iska Saric, Janice Lam, Jeff Wang, Jingfei Du, Juan Pino, Julia Vargas, Justine Kao, Karla Caraballo-Torres, Kevin Tran, Koklioong Loa, Lachlan Mackenzie, Michael Auli, Michael Friedrichs, Natalie Hereth, Ning Dong, Oliver Libaw, Orialis Valentin, Paden Tomasello, Paul-Ambroise Duquenne, Peng-Jen Chen, Pengwei Li, Robert Lee, Safiyyah Saleem, Sascha Brodsky, Semarley Jarrett, Sravya Popuri, TJ Krusinski, Vedanuj Goswami, Wei-Ning Hsu, Xutai Ma, Yilin Yang, and Yun Tang.