Logo

The Data Daily

The State of Multilingual AI

The State of Multilingual AI

The State of Multilingual AI
cross-lingual
The State of Multilingual AI
This post takes a closer look at the state of multilingual AI. How multilingual are current models in NLP, computer vision, and speech? What are the main recent contributions in this area? What challenges remain and how we can we address them?
Sebastian Ruder
14 Nov 2022
• 36 min read
Models that allow interaction via natural language have become ubiquitious. Research models such as BERT and T5 have become much more accessible while the latest generation of language and multi-modal models are demonstrating increasingly powerful capabilities. At the same time, a wave of NLP startups has started to put this technology to practical use.
While such language technology may be hugely impactful, recent models have mostly focused on English and a handful of other languages with large amounts of resources. Developing models that work for more languages is important in order to offset the existing language divide and to ensure that speakers of non-English languages are not left behind, among many other reasons .
This post takes a closer look at how the AI community is faring in this endeavour. I will be focusing on topics related to natural language processing (NLP) and African languages as these are the domains I am most familiar with. I've tried to cover as many contributions as possible but undoubtedly missed relevant work. Feel free to leave a comment or reach out with a pointer to anything I missed.
This post is partially based on a keynote I gave at the Deep Learning Indaba 2022 . It covers the following high-level topics:
Challenges and Opportunities
Status Quo
There are around 7,000 languages spoken around the world. Around 400 languages have more than 1M speakers and around 1,200 languages have more than 100k [1] . Bender [2] highlighted the need for language independence in 2011. Reviewing papers published at ACL 2008, she found that 63% of all papers focused on English. For a recent study [3] , we similarly reviewed papers from ACL 2021 and found that almost 70% of papers only evaluate on English. 10 years on, little thus seems to have changed.
Many languages in Africa, Asia, and the Americas that are spoken by tens of millions of people have received little research attention [1:1] [4] . Continents such as Africa with around 2,000 languages or individual countries such as Indonesia with around 700 languages are incredibly linguistically diverse and at the same time dramatically underserved by current research and technology.
Beyond individual languages, researchers with affiliations in countries where such languages are spoken are similarly under-represented in both ML and NLP communities. For instance, while we can observe a slight upward trend in the number of authors affiliated with African universities publishing at top machine learning (ML) and NLP venues, this increase pales compared to the thousands of authors from other regions publishing in such venues every year.
Representation of African NLP Researchers in top ML and NLP venues. *: does not consider African authors working abroad. Data is based on: ml_nlp_paper_data by Marek Rei. NLP venues: ACL, CL, COLING, CoNLL, EACL, EMNLP, NAACL, TACL; ML venues: AAAI, ICLR, ICML, NeurIPS.
Current state-of-the-art models in many ML domains are mainly based on two ingredients: 1) large, scalable architectures (often based on the Transformer [5] ) and 2) transfer learning [6] . Given the general nature of these models, they can be applied to various types of data including images [7] , video [8] , and audio [9] . Some of the most successful models in recent NLP are BERT [10] , RoBERTa [11] , BART [12] , T5 [13] , and DeBERTa [14] , which have been trained on billions of tokens of online text using variants of masked language modeling in English. In speech, wav2vec 2.0 [15] has been pre-trained on large amounts of unlabeled speech.
Multilingual models   These models have multilingual analogues—in NLP, models such as mBERT , RemBERT [16] , XLM-RoBERTa [17] , mBART [18] , mT5 [19] , and mDeBERTa [14:1] —that were trained in a similar fashion, predicting randomly masked tokens on data of around 100 languages. Compared to their monolingual counterparts, these multilingual models require a much larger vocabulary to represent tokens in many languages.
A number of factors has been found to be important for learning robust multilingual representations, including shared tokens [20] , subword fertility [21] , and word embedding alignment [22] . In speech, models such as XSLR [23] and UniSpeech [24] are pre-trained on large amounts of unlabeled data in 53 and 60 languages respectively.
The curse of multilinguality   Why do these models only cover up to 100 languages? One reason is the 'curse of multilinguality' [17:1] . Similar to models that are trained on many tasks, the more languages a model is pre-trained on, the less model capacity is available to learn representations for each language. Increasing the size of a model ameliorates this to some extent, enabling the model to dedicate more capacity to each language [25] .
Lack of pre-training data   Another limiting factor is the availability of text data on the web, which is skewed towards languages spoken in Western countries and with a large online footprint. The languages with the most online resources available for pre-training are typically prioritized, leading to an under-representation of languages with few resources due to this top-heavy selection. This is concerning as prior studies have shown that the amount of pre-training data in a language correlates with downstream performance for some tasks [26] [27] [28] . In particular, languages and scripts that were never seen during pre-training often lead to poor performance [29] [30] .
Amount of data in GiB (log-scale) for the 88 languages that appear in both Wikipedia and CommonCrawl ( Conneau et al., 2020 ).
Quality issues in existing multilingual resources   Even for languages where data is available, past work has shown that some commonly used multilingual resources have severe quality issues. Entity names in Wikidata are not in the native script for many under-represented languages while entity spans in WikiAnn [31] , a weakly supervised multilingual named entity recognition dataset based on Wikipedia, are often erroneous [32] .
Similarly, several automatically mined resources and automatically aligned corpora used for machine translation are problematic [33] . For instance, 44/65 audited languages in WikiMatrix [34] and 19/20 audited langages in CCAligned [35] contain less than 50% correct sentences. Overall, however, performance seems to be mostly constrained by the quantity rather than quality of data in under-represented languages [36] .
Multilingual evaluation results   We can get a better picture of the state of the art by looking at the performance of recent models on a representative multilingual benchmark such as XTREME [26:1] —a multilingual counterpart to GLUE [37] and SuperGLUE [38] —which aggregates performance across 9 tasks and 40 languages. Starting with the first multilingual pre-trained models two and a half years ago, performance has improved steadily and is slowly approaching human-level performance on the benchmark.
Performance of models on the XTREME leaderboard on 9 tasks and 40 languages.
However, looking at the average performance on such a benchmark obscures which languages a model was actually evaluated on. Beyond a few datasets with a large language coverage—Universal Dependencies [39] , WikiAnn [31:1] , and Tatoeba [40] —other tasks only cover few languages, and are again skewed towards languages with more resources. Most current benchmarks thus only provide a distorted view of the overall progress towards multilingual AI for the world's languages.
For a more accurate impression, we can look at the normalized state-of-the-art performance on different language technology tasks averaged across the world's languages either based on their speaker population (demographic utility) or equally (linguistic utility) [41] .
Linguistic and demographic global utility metrics for a number of language technology tasks. The red curve corresponds to the sequence where first the language with the largest number of users is set to utility 1, then the second, and so on (Blasi et al., 2022) .
Most NLP tasks fare better when we average based on speaker population. Overall, however, we observe very low linguistic utility numbers, showing an unequal performance distribution across the world's languages. This conclusion, however, may be somewhat overly pessimistic as it only considers languages for which evaluation data is currently available. Cross-lingual performance prediction [42] could be used to estimate performance for a broader set of languages.
Multilingual vs English-centric models   Let us now take a step back and look at recent large language models in NLP in general. We can plot recent models based on the fraction of non-English data they are pre-trained on. Based on this characterization, we can observe two distinct streams of research: 1) Multilingual models that are trained on multingual data in many languages and 2) English-centric models that are trained on mostly English data.
The largest recent models are not becoming significantly more multilingual. Figure adapted from Noah Constant.
The latter form the foundation for the mainstream of NLP research and while these models have been getting larger, they have not been getting much more multilingual. An exception is BLOOM [43] , the largest multilingual open-source model to date. Some of these large models have demonstrated surprising multilingual capabilities. For instance, GPT-3 [44] and PaLM [45] can translate text between languages with large amounts of data. While they have been shown to be able to perform multilingual few-shot learning [46] [47] [48] , models perform best when prompts or input data are translated to English. They also perform poorly when translating between non-English language pairs or into languages with limited data. While PaLM is able to summarize non-English text into English, it struggles when generating text in other languages.
Similarly, recent speech models such as HuBERT [49] and WavLM [50] and recent large vision models that generate text based on an image such as Flamingo [51] or an image based on text such as DALL-E 2 [52] , Imagen [53] , and Parti [54] are English-centric. Exceptions are Whisper [55] and PaLI [56] , which are pre-trained on large amounts of weakly supervised data from the web for ASR and image captioning in 96 and 109 languages respectively. However, overall, for the latest generation of large models, multilinguality remains a side-effect rather than a key design criterion.
User-facing technologies   With regard to user-facing technologies, keyboards and spell checkers such as Gboard support more than 900+ languages but many languages still lack support or speakers may be unaware that a keyboard for their language is available [57] . Other user-facing technologies with broad language coverage are machine translation and automatic speech recognition (ASR). Google Translate and speech-to-text , for instance, support 133 and more than 125 languages respectively as of the publishing of this post.
Recent Progress
Recent progress in this area can be categorized into two categories: 1) new groups, communities, support structures, and initiatives that have enabled broader work; and 2) high-level research contributions such as new datasets and models that allow others to build on them.
Research communities   There are many languages with active existing research communities dedicated to them. These include languages with large speaker populations such as Japanese, Mandarin, Turkish, and Hindi as well as languages with fewer speakers such as Gaelic or Basque [1:2] . There have also been concerted efforts in the past to collect data for specific under-represented languages such as Inuktitut [58] [59] .
In the last few years, various new communities have emerged specializing in under-represented languages or language families. These include groups focusing on linguistic regions such as Masakhane for African languages, AmericasNLP for native American languages, IndoNLP for Indonesian languages, GhanaNLP and HausaNLP , among others. Events such as the Deep Learning Indaba , IndabaX , Khipu , EEML , SEAMLS , and ALPS , among many others and workshops such as AfricaNLP have enabled these communities to come together and complement longer-running events such as the Arabic NLP , ComputEL , and SIGTYP workshops .
The Deep Learning Indaba 2022 in Tunesia.
At the same time, there are communities with broader focus areas such as ML Collective that have contributed to this space. One of the largest community-driven efforts in multilingual AI is BigScience , which has released BLOOM [43:1] . In many cases, projects in these communities have been participatory and highly collaborative [60] [61] , lowering the barrier to doing research and involving members of the community at every stage of the process.
Other communities such as Zindi or Data Science Nigeria have focused on hosting competitions and providing training courses while new programs such as the African Master's in Machine Intelligence seek to educate the next generation of AI researchers.
Initiatives   The Association for Computational Linguistics (ACL) has emphasized the importance of language diversity, with a special theme track at the main ACL 2022 conference on this topic. The ACL has also launched the 60-60 initiative , which aims to make scientific content more accessible by creating a) translations of the entire ACL anthology into 60 languages; b) cross-lingual subtitling and dubbing for all plenary talks in 10 languages; and c) a comprehensive standardized scientific and NLP terminology list in 60 languages. The latter resource and glossaries for African languages could help to facilitate the discussion of language technology in local languages.
Datasets   On the research side, there has been a flurry of new datasets covering a host of applications, from unlabeled speech and text corpora [62] , to language identification [63] , text classification [64] , sentiment analysis [65] , ASR , named entity recognition [61:1] , question answering [66] , and summarization [67] in a range of under-represented languages. New benchmarks seek to assess models on a broad set of tasks in Romanian [68] , Korean [69] , and Turkish [70] , in geographically related languages such as Indonesian [71] [72] or Indian languages [73] , and in different modalities such as speech [74] and image-grounded text [75] . The development of these datasets has been enabled by new funding structures and initiatives such as the Lacuna Fund and FAIR Forward that have incentivized work in this area.
Named entity annotations in African languages in MasakhaNER. PER, LOC, and DATE entities are in purple, orange, and green respectively (Adelani et al., 2021)
Other existing corpora have grown in their language coverage with community involvement: The Common Voice speech corpus [76] now covers 100 languages while the latest release of Universal Dependencies [39:1] includes 130 languages. Given the number of new datasets, there have been efforts to catalogue available datasets in African and Indonesian languages , Arabic , and a diverse set of languages [77] .
Models   New models developed in this area focus specifically on under-represented languages. There are text-based language models that focus on African languages such as AfriBERTa [78] , AfroXLM-R [79] , and KinyaBERT [80] and models for Indonesian languages such as IndoBERT [71:1] and IndoGPT [72:1] . For Indian languages, there are text-based models such as IndicBERT [73:1] and MuRIL [81] and speech models such as CLSRIL [82] and IndicWav2Vec [83] . Many of these approaches train a model on several related languages and are thus able to leverage positive transfer and to be much more efficient than larger multilingual models. See [84] and [85] for recent surveys of recent multilingual models in NLP and speech.
Industry   In industry, startups have been developing new technology to serve local languages such as InstaDeep developing a model for Tunisian Arabic [86] , Nokwary enabling financial inclusion in Ghanaian languages, BarefootLaw employing NLP technology to provide legal help in Uganda, and NeuralSpace building speech and text APIs for a geographically diverse set of languages, among many others.
Similarly, large tech companies have expanded their ASR and machine translation offerings . Both Google [87] and Meta [88] have described efforts on how to scale machine translation technology to the next thousand languages. At the heart of these efforts are a) mining high-quality monolingual data from the web based on improved language identification and filtering; b) training massively multilingual models on monolingual and parallel data; and c) extensive evaluation on newly collected datasets. These components are similarly important for building better ASR systems for under-represented languages [89] .
Challenges and Opportunities
Given this recent progress, what are the remaining challenges and opportunities in this area?
Challenge #1: Limited Data
Arguably the biggest challenge in multilingual research is the limited amount of data available for most of the world's languages. Joshi et al. [90] categorized the languages of the world into six different categories based on the amount of labeled and unlabeled data available in them.
The distribution of resources in the world's languages. Labeled data is based on the LDC Catalog and ELRA Map . Unlabeled data is based on Wikipedia. The size of the gradient circle represents the number of languages in the class. The color spectrum represents the total speaker population size from low to high ( Joshi et al., 2020 ).
88% of the world's languages are in resource group 0 with virtually no text data available to them while 5% of languages are in resource group 1 where there is very limited text data available.
Opportunity #1: Real-world Data
How can we overcome this enormous discrepancy in the resource distribution across the world's languages? The creation of new data, particularly in languages with few annotators, is expensive. For this reason, many existing multilingual datasets such as XNLI [91] , XQuAD [92] , and XCOPA [93] are based on translations of established English datasets.
Such translation-based data, however, are problematic. Translated text in a language can be considered a dialect of that language, known as 'translationese', which differs from natural language [94] . Translation-based test sets may thus over-estimate the performance of models trained on similar data, which have learned to exploit translation artifacts [95] .
Over-representation of Western concepts   Beyond these issues, translating an existing dataset inherits the biases of the original data. In particular, translated data differs from data that is naturally created by speakers of different languages. As existing datasets were mostly created by crowdworkers or researchers based in Western countries, they mostly reflect Western-centric concepts. For example, ImageNet [96] , one of the most influential datasets in ML, is based on English WordNet. As a result, it captures concepts that are overly English-specific and unknown in other cultures [97] . Similarly, Flickr30k [98] contains depictions of concepts that are mainly familiar to people from certain Western regions such as tailgating in the US [99] .
An image in Flickr30k (Young et al., 2014) . Two American annotators but neither Dutch nor German workers identified the Denver Broncos jersey. Three out of five American annotators described the activity in the image as tailgating , a North-American pastime where people gather to enjoy an informal (often barbecue) meal on the parking lot outside a sports stadium (van Miltenburg et al., 2017) .
The commonsense reasoning dataset COPA [100] contains many referents that have no language-specific terms in some languages, e.g., bowling ball, hamburger, and lottery [93:1] . Most questions in current QA datasets ask about US or UK nationals [101] while many other datasets, particularly those based on Wikipedia, contain mainly entities from Europe, the US, and the Middle East [102] .
Practical data   For new datasets, it is thus ever more important to create data that is informed by real-world usage. On the one hand, data should reflect the background of the speakers speaking the language. For example, MaRVL [103] is a multi-modal reasoning dataset that covers concepts representative of different cultures and languages.
A Swahili example in MaRVL depicting the concept leso ("handkerchief"). Caption: Picha moja ina watu kadhaa waliovaa leso na picha nyingine ina leso bila watu. ("One picture contains several people wearing handkerchiefs and another picture has a handkerchief without people."). Label: FALSE (Liu et al., 2021) .
Given the increasing maturity of language technology, it is important to collect data that is relevant for real-world applications and that may have a positive impact on speakers of under-represented languages. Such applications include the development of assistive language technology for humanitarian crises, health, education, legal, and finance. Languages that may benefit from such technology are standardised languages and contact languages, including creoles and regional language varieties [104] .
Creating real-world datasets has the potential to ground research and enables it to have a larger impact. It also reduces the distribution shift between research and practical scenarios and makes it more likely that models developed on academic datasets will be useful in production.
Beyond the creation of the training or evaluation data, the development of a language model requires the involvement of a large number of stakeholders, many of whom are often not explicitly acknowledged. Many of the components in this process are under-performing and often not available in many languages.
The development cycle of a language model. Model creation relies on data created by multiple stakeholders. (Credit: Clara Rivera; adapted from ∀ et al., 2020 ).
This starts at the beginning of data creation where online platforms and keyboards may not support certain languages [57:1] , dictionaries do not cover certain languages and language ID does not perform well in those languages [105] . In many languages, the connections between different stakeholders are also missing and it is difficult to find original content or to identify qualified annotators. The fact that text on the web is difficult to find for some languages does not mean, however, that these languages are resource-poor or that data for these languages does not exist.
Multi-modal data   Many languages around the world are more commonly spoken than written. We can overcome the reliance (and lack of) text data by focusing on information from multi-modal data sources such as radio broadcasts and online videos as well as combining information from multiple modalities. Recent speech-and-text models [106] [107] achieve strong improvements on speech tasks such as ASR, speech translation, and text-to-speech. They still perform more poorly, however, on text-only tasks due to a lack of capacity [108] . There is a lot of potential to leverage multi-modal data as well as to investigate the linguistic characteristics of different languages and their interplay in text and speech [109] .
Multilingual speech-text pre-training in mSLAM. A model is jointly pre-trained on unlabeled and labeled text and speech datasets using a set of different modality-specific losses (Bapna et al., 2022) .
Beyond multi-modal information, data may also be available in formats that are locked to current models such as in handwritten documents and non-digitized books, among others. Technologies such as optical character recognition (OCR) [110] and new datasets such as the Bloom Library [111] will help us make such untapped data sources more accessible. There are also resources that have so far been used relatively little despite their large language coverage such as the Bible, which covers around 1,600 languages [112] and lexicons, which cover around 5,700 languages [113] . Other data sources may be readily available but have so far gone unused or unnoticed. Recent examples of such 'fortuitous data' [114] include HTML and web page structure [115] [116] , among others.
Given the generalization ability of pre-trained language models, benchmarks have been increasingly moving towards evaluation in low-resource settings. When creating new datasets, large test sets with sufficient statistical power [117] should thus be prioritized. In addition, languages for annotation can be prioritized based on the expected gain in utility [41:1] and reduction in inequality [118] .
Finally, there are challenges for responsible AI when collecting data and developing technology for under-represented languages, including data governance, safety, privacy, and participation. Addressing these challenges requires answering questions such as: How are appropriate usage and ownership of the data and technology guaranteed [119] ? Are there methods in place to detect and filter sensitive and biased data and detect bias in models? How is privacy preserved during data collection and usage? How can the data and technology development be made participatory [120] ?
Challenge #2: Limited Compute
Under-represented language applications face constraints that go beyond the lack of data. Mobile data, compute, and other computational resources may often be expensive or unavailable. GPU servers, for instance, are scarce even in top universities in many countries [4:1] while the cost of mobile data is higher in countries where under-represented languages are spoken [121] .
Cost of mobile data by country for the resource groups by Joshi et al. (2020) (Ahia et al., 2021) .
Opportunity #2: Efficiency
In order to make better use of limited compute, we must develop methods that are more efficient. For an overview of efficient Transformer architectures and efficient NLP methods in general refer to [122] and [123] . As pre-trained models are widely available, a promising direction is the adaptation of such models via parameter-efficient methods, which have been shown to be more effective than in-context learning [124] .
A common method are adapters [125] [126] , small bottleneck layers that are inserted between a pre-trained model's weights. These parameter-efficient methods can be used to overcome the curse of multilinguality by enabling the allocation of additional language-specific capacity. They also enable the adaptation of a pre-trained multilingual model to languages that it has not been exposed to during pre-training [127] [128] . As such adapter layers are separate from the remaining parameters of the model, they allow learning modular interactions between tasks and languages [129] .
Language-specific adapter layers learned via masked language modeling (MLM) on data of each language while the remaining parameters of the model are frozen (Pfeiffer et al., 2020) .
Adapters have been shown to improve robustness [130] [131] , lead to increased sample efficiency compared to fine-tuning [132] , and outperform alternative parameter-efficient methods [133] [134] . They allow for extensions such as incorporating hierarchical structure [135] or conditioning via hyper-networks [136] [137] .
Cross-lingual parameter-efficient transfer learning is not restricted to adapters but can take other forms [138] such as sparse sub-networks [139] . Such methods have been applied to a diverse set of applications and domains, from machine translation [140] [141] to ASR [142] and speech translation [143] .
Challenge #3: Language Typology
If we plot the typological features of the world's languages based on the World Atlas of Language Structures (WALS) and project them into two dimensions using PCA, we get a density plot such as the one below. Marking the languages that are present in Universal Dependencies [39:2] , one of the most multilingual resources with red stars, we can observe that the languages for which data is available lie mostly in low-density regions of this plot. The distribution of languages in existing datasets is thus heavily skewed compared to the real-world distribution of languages and languages with available data are unrepresentative of most of the world's languages.
Density of WALS typological features of the world's languages. Red stars are languages in Universal Dependencies (Ponti et al. (2021) .
Under-represented languages have many linguistic features that are not present in Western languages. A common linguistic feature is tone, which is present in around 80% of African languages [109:1] and can be lexical or gramatical. In Yorùbá, lexical tone distinguishes meaning, for instance, in the following words: igbá ("calabash", "basket"), igba ("200"), ìgbà ("time"), ìgbá ("garden egg"), and igbà ("rope"). In Akan, grammatical tone distinguishes habitual and stative verbs such as for Ama dá ha ("Ama sleeps here") and Ama dà ha ("Ama is sleeping here"). Tone is relatively unexplored in speech and NLP applications.
While the typological features of languages around the world are diverse, languages within a region often share linguistic features. For instance, African languages mainly belong to a few major language families.
Map of African language families (Credit: Wikipedia ).
Opportunity #3: Specialization
Rich Sutton highlights a bitter lesson for the field of AI research:
"The great power of general purpose methods [...] that continue to scale with increased computation [...]. The two methods that seem to scale arbitrarily: search and learning."
For most under-represented languages, computation and data, however, are limited. It is thus reasonable to incorporate (some amount of) knowledge into our language models to make them more useful for such languages.
This can take the form of biasing the tokenization process, which often produces poor segmentations for languages with a rich morphology or limited data. We can modify the algorithm to prefer tokens that are shared across many languages [144] , preserve tokens’ morphological structure [145] , or make the tokenization algorithm more robust to deal with erroneous segmentations [146] .
We can also exploit the fact that many under-represented languages belong to groups of similar languages. Models focusing on such groups can thus more easily share information across languages. While recent models focus mainly on related languages [73:2] [81:1] [82:1] , future models may also include language variants and dialects, which can benefit from positive transfer from related languages.
While principled variants of masking such as whole word masking [147] and PMI-masking [148] have been found useful in the past, new pre-training objectives that take linguistic characteristics such as rich morphology or tone into account may lead to more sample-efficient learning. Finally, the architeture of models can be adapted to incorporate information about morphology such as in the KinyaBERT model for Kinyarwanda [80:1] .
The KinyaBERT model for Kinyarwanda. The morphological analyzer produces morphemes for every word. The model uses different embeddings for POS tags, stems, and affixes (Nzeyimana & Rubungo, 2022) .
Conclusion
While there has been a tremendous amount of progress in recent multilingual AI, there is still a lot more to do. Most importantly, we should focus on creating data that reflects the real-world circumstances of language speakers and to develop language technology that serves the needs of speakers around the world. While there is momentum and increasing awareness that such work is important, it takes a village to develop equitable language technology for the world's languages. Masakhane ("let us build together" in isiZulu)!

Images Powered by Shutterstock