Most translation tools commonly used today are focused on written high-resource languages that have lots of translation data available. However, this approach doesn’t allow for the accurate translation of low-resource languages – that is, with little to no data available – and of oral languages, where we cannot find corpora of written text, for example, due to the lack of a standardized writing system.
While it might not affect us directly, it is still important to focus on such problems, since around 20 percent of the world’s population does not speak languages that are covered by the usual translation models. This implies, for instance, that people from these low-resource communities cannot easily access information online. Take Wikipedia: for languages like Lingala, spoken by over 20 million people in Africa, there are around 4000 articles available, far less than the 2.5 million available in Swedish, a language spoken by around 10 million people. If there were an efficient translation tool for Lingala, this would allow Lingala speakers to read as much information as Swedish speakers. Therefore, building an accurate and inclusive translator is very important.
Already back in February, Meta announced that it wanted to tackle this problem through 2 new projects: the first is NLLB (No Language Left Behind), whose goal is to build an AI translator also for languages that have limited translation data; the second one is UST (Universal Speech Translator), in which they want to translate languages speech-to-speech in real-time.
NLLB (No Language Left Behind)
Let’s focus on the first project. The first big achievement came back in July 2022, when they announced a new model that allowed the translation of 200 languages, many of which were previously low-resource ones.
The main problem they had to face was indeed the unavailability of data. In particular, they used 3 types of data:
- Translation data already in circulation: this also includes biblical translations.
- A human-curated dataset that they created themselves, called NLLB seed.
- Monolingual data used for bitext mining, that is for identifying translation relationships among the words.
To exploit the data in the best way possible they used LASER3, a toolkit that allowed them to identify sentences with a similar representation in different languages and understand how likely it is for them to have the same meaning.
In particular, this tool embeds all of the sentences coming from different languages into a single multilingual representation space. Then, by computing the distance between the text that we want to translate and the actual translation, you can check whether or not it is a plausible translation or not.
Moreover, using this toolkit, it is relatively easy to adapt the result of the training on some languages to some others. This means that you can include languages in the multilingual space even if little data is available.
The other important challenge they had to face was fitting 200 languages into a single model, without overfitting for the high-resource languages. So they tried to ensure that the low-resource languages got a fair capacity allocation despite the much smaller amount of training data.
The results that they were able to achieve were very good and the translations were of high quality. With respect to other translation tools, there has been an average increase in quality of 44 percent across all languages, with an increase of over 70 percent for some African and Indian ones. They also expanded the existing dataset for the evaluation of translation tools by developing FLORES-200, a dataset that includes over 200 languages.
UST (Universal Speech Translator)
About the second project, last week Meta announced a new model that allowed the translation of a primarily oral language, Hokkien, into English. Hokkien is a variety of Chinese that is widely spoken in southeastern China (around 45 million people speak it) and it is one of many oral languages without an official writing system.
To translate Hokkien, typical models cannot be used, as they rely on transcriptions: they first convert speech into text, then they translate the text to the target language and then they convert the result back to speech. However, since Hokkien doesn’t have a standard written form, producing transcripts doesn’t work. Therefore, you need to directly focus on speech-to-speech translation.
As with NLLB, the first problem that they had to face was the lack of data. To solve this, they used Mandarin as an intermediate language. So, they first translated English (or Hokkien) vocal data to Mandarin text, and then they translated it back to Hokkien (or English). By doing so, they exploited the higher presence of resources and translations in Mandarin.
They also used audio mining, that is they managed to embed Hokkien into the same space of the other languages, even without a written form. Therefore, the vocal content in Hokkien could be put in relation to other similar text and vocal contents.
For the translation itself, they used various methods, such as a recently-developed technology called S2UT (speech-to-unit translation), which translated input speech into acoustic sounds. From these, they generated wavelengths corresponding to the translation.
To evaluate the results, they transcribed the audio into a standardized phonetic alphabet called Tâi-lô. Through this method, they were able to assess the quality of the translation. Moreover, they created the first dataset for speech-to-speech translations from Hokkien to English.
For now, this model allows translating only a sentence at a time, but it is still an impressive achievement.
What are the effects of such technologies?
We have to be careful when using translation tools especially if they are not high-quality, since we might produce content that could be harmful. To assess the performance of a model, the most used algorithm is BLEU: it compares the given translation with a set of good-quality reference translations. However, this method is flawed in many ways because:
- A sentence that was translated correctly could still receive a low score depending on the data that we use for the assessment.
- It doesn’t take into account the difference in the gravity of an error. For instance, if you want to translate the word for “cat”, but the model gives you “kitten”, the error is regarded in the same way as if it gave you the word “computer”. So it is not an absolute measure of the correctness of the model.
Considering the possible errors of translations and their consequences is very important when using these tools in platforms such as Facebook and Instagram. There have already been cases where not accurate translation has led to unwanted consequences. For instance, in 2017 a Palestinian man was arrested by Israeli police after Facebook’s translation software mistranslated “good morning” with “hurt them/attack them” in a post he shared. To mitigate this problem, in the NLLB model Meta created a toxicity list for all the 200 languages, that removed unwanted toxic content that appeared during the translation.
Still, the benefits of such technologies are enormous. Better machine translation could mean more people interacting with each other, sharing ideas and knowledge, and this would break language barriers both in the real world and on the Internet. For instance, people who only speak a dialect could access the same amount of information as those who speak English.
These models are already starting to be used. For instance, a partnership with the Wikimedia Foundation has given access to NLLB to the hands of those who help to translate the millions of articles on Wikipedia.