Natural language processing (NLP) research predominantly focuses on developing methods that work well for English despite the many positive benefits of working on other languages. These benefits range from an outsized societal impact to modelling a wealth of linguistic features to avoiding overfitting as well as interesting challenges for machine learning (ML).

There are around 7,000 languages spoken around the world. The map above (see the interactive version at Langscape) gives an overview of languages spoken around the world, with each green circle representing a native language. Most of the world's languages are spoken in Asia, Africa, the Pacific region and the Americas.

While we have seen exciting progress across many tasks in natural language processing (see Papers with Code and NLP Progress for an overview) over the last years, most such results have been achieved in English and a small set of other high-resource languages.

In a previous overview of our ACL 2019 tutorial on Unsupervised Cross-lingual Representation Learning, I've defined a resource hierarchy based on the availability of unlabelled data and labelled data online. In a recent ACL 2020 paper, Joshi et al. define a taxonomy similarly based on data availability, which you can see below.

Language resource distribution of Joshi et al. (2020)
Language resource distribution of Joshi et al. (2020). The size and colour of a circle represent the number of languages and speakers respectively in each category. Colours (on the VIBGYOR spectrum; Violet–Indigo–Blue–Green–Yellow–Orange–Red) represent the total speaker population size from low (violet) to high (red).

Languages in categories 5 and 4 that lie at a sweet spot of having both large amounts of labelled and unlabelled data available to them are well-studied in the NLP literature. On the other hand, languages in the other groups have largely been neglected.

In this post, I will argue why you should work on languages other than English. Specifically, I will highlight reasons from a societal, linguistic, machine learning, cultural and normative, and cognitive perspective.

The societal perspective

Technology cannot be accessible if it is only available for English speakers with a standard accent.

What language you speak determines your access to information, education, and even human connections. Even though we think of the Internet as open to everyone, there is a digital language divide between dominant languages (mostly from the Western world) and others. Only a few hundred languages are represented on the web and speakers of minority languages are severely limited in the information available to them.

As many more languages are being written in informal contexts in chat apps and on social media, this divide extends to all levels of technology: At the most basic language technology level, low-resource languages lack keyboard support and spell checking (Soria et al., 2018)—and keyboard support is even rarer for languages without a widespread written tradition (Paterson, 2015). At a higher level, algorithms are biased and discriminate against speakers of non-English languages or simply with different accents.

The latter is a problem because much existing work treats a high-resource language such as English as homogeneous. Our models consequently underperform on the plethora of related linguistic subcommunities, dialects, and accents (Blodgett et al., 2016). In reality, the boundaries between language varieties are much blurrier than we make them out to be and language identification of similar languages and dialects is still a challenging problem (Jauhiainen et al., 2018). For instance, even though Italian is the official language in Italy, there are around 34 regional languages and dialects spoken throughout the country.

A continuing lack of technological inclusion will not only exacerbate the language divide but it may also drive speakers of unsupported languages and dialects to high-resource languages with better technological support, further endangering such language varieties. To ensure that non-English language speakers are not left behind and at the same time to offset the existing imbalance, to lower language and literacy barriers, we need to apply our models to non-English languages.

The linguistic perspective

Even though we claim to be interested in developing general language understanding methods, our methods are generally only applied to a single language, English.

English and the small set of other high-resource languages are in many ways not representative of the world's other languages. Many resource-rich languages belong to the Indo-European language family, are spoken mostly in the Western world, and are morphologically poor, i.e. information is mostly expressed syntactically, e.g. via a fixed word order and using multiple separate words rather than through variation at the word level.

For a more holistic view, we can take a look at the typological features of different languages. The World Atlas of Language Structure catalogues 192 typological features, i.e. structural and semantic properties of a language. For instance, one typological feature describes the typical order of subject, object, and verb in a language. Each feature has 5.93 categories on average. 48% of all feature categories exist only in the low-resource languages of groups 0–2 above and cannot be found in languages of groups 3–5 (Joshi et al., 2020). Ignoring such a large subset of typological features means that our NLP models are potentially missing out on valuable information that can be useful for generalisation.

Working on languages beyond English may also help us gain new knowledge about the relationships between the languages of the world (Artetxe et al., 2020). Conversely, it can help us reveal what linguistic features our models are able to capture. Specifically, you could use your knowledge of a particular language to probe aspects that differ from English such as the use of diacritics, extensive compounding, inflection, derivation, reduplication, agglutination, fusion, etc.

The ML perspective

We encode assumptions into the architectures of our models that are based on the data we intend to apply them. Even though we intend our models to be general, many of their inductive biases are specific to English and languages similar to it.

The lack of any explicitly encoded information in a model does not mean that it is truly language agnostic. A classic example are n-gram language models, which perform significantly worse for languages with elaborate morphology and relatively free word order (Bender, 2011).

Similarly, neural models often overlook the complexities of morphologically rich languages (Tsarfaty et al., 2020): Subword tokenization performs poorly on languages with reduplication (Vania and Lopez, 2017), byte pair encoding does not align well with morphology (Bostrom and Durrett, 2020), and languages with larger vocabularies are more difficult for language models (Mielke et al., 2019). Differences in grammar, word order, and syntax also cause problems for neural models (Ravfogel et al., 2018; Ahmad et al., 2019; Hu et al., 2020). In addition, we generally assume that pre-trained embeddings readily encode all relevant information, which may not be the case for all languages (Tsarfaty et al., 2020).

The above problems pose unique challenges for modelling structure—both on the word and the sentence level—, dealing with sparsity, few-shot learning, encoding relevant information in pre-trained representations, and transferring between related languages, among many other interesting directions. These challenges are not addressed by current methods and thus call for a new set of language-aware approaches.

Recent models have repeatedly matched human-level performance on increasingly difficult benchmarks—that is, in English using labelled datasets with thousands and unlabelled data with millions of examples. In the process, as a community we have overfit to the characteristics and conditions of English-language data. In particular, by focusing on high-resource languages, we have prioritised methods that work well only when large amounts of labelled and unlabelled data are available.

In contrast, most current methods break down when applied to the data-scarce conditions that are common for most of the world's languages. Even recent advances in pre-training language models that dramatically reduce the sample complexity for downstream tasks (Peters et al., 2018; Howard and Ruder, 2018; Devlin et al., 2019; Clark et al., 2020) require massive amounts of clean, unlabelled data, which is not available for most of the world's languages (Artetxe et al., 2020). Doing well with few data is thus an ideal setting to test the limitations of current models—and evaluation on low-resource languages constitutes arguably its most impactful real-world application.

The cultural and normative perspective

The data our models are trained on reveals not only the characteristics of the specific language but also sheds light on cultural norms and common sense knowledge.

However, such common sense knowledge may be different for different cultures. For instance, the notion of 'free' and 'non-free' varies cross-culturally where 'free' goods are ones that anyone can use without seeking permission, such as salt in a restaurant. Taboo topics are also different in different cultures. Furthermore, cultures vary in their assessment of relative power and social distance, among many other things (Thomas, 1983). In addition, many real-world situations such as ones included in the COPA dataset (Roemmele et al., 2011) do not match the direct experience of many and equally do not reflect key situations that are obvious background knowledge for many people in the world (Ponti et al., 2020).

Consequently, an agent that was only exposed to English data originating mainly in the Western world may be able to have a reasonable conversation with speakers from Western countries, but conversing with someone from a different culture may lead to pragmatic failures.

Beyond cultural norms and common sense knowledge, the data we train a model on also reflects the values of the underlying society. As an NLP researcher or practitioner, we have to ask ourselves whether we want our NLP system to exclusively share the values of a specific country or language community.

While this decision might be less important for current systems that mostly deal with simple tasks such as text classification, it will become more important as systems become more intelligent and need to deal with complex decision-making tasks.

The cognitive perspective

Human children can acquire any natural language and their language understanding ability is remarkably consistent across all kinds of languages. In order to achieve human-level language understanding, our models should be able to show the same level of consistency across languages from different language families and typologies.

Our models should ultimately be able to learn abstractions that are not specific to the structure of any language but that can generalise to languages with different properties.

What you can do

Datasets  If you create a new dataset, reserve half of your annotation budget for creating the same size dataset in another language.

Evaluation  If you are interested in a particular task, consider evaluating your model on the same task in a different language. For an overview of some tasks, see NLP Progress or our XTREME benchmark.

Bender Rule  State the language you are working on.

Assumptions  Be explicit about the signals your model uses and the assumptions it makes. Consider which are specific to the language you are studying and which might be more general.

Language diversity  Estimate the language diversity of the sample of languages you are studying (Ponti et al., 2020).

Research  Work on methods that address challenges of low-resource languages. In the next post, I will outline interesting research directions and opportunities in multilingual NLP.

Citation

For attribution in academic contexts, please cite this work as:

@misc{ruder2020,
  author = {Ruder, Sebastian},
  title = {{Why You Should Do NLP Beyond English}},
  year = {2020},
  howpublished = {\url{http://ruder.io/nlp-beyond-english}},
}

Thanks to Aida Nematzadeh, Laura Rimell, and Adhi Kuncoro for valuable feedback on drafts of this post.