<![CDATA[Sebastian Ruder]]>http://ruder.io/http://ruder.io/favicon.pngSebastian Ruderhttp://ruder.io/Ghost 2.16Mon, 04 Mar 2019 20:39:52 GMT60<![CDATA[AAAI 2019 Highlights: Dialogue, reproducibility, and more]]>http://ruder.io/aaai-2019-highlights/5c76ff7c1b9b0d18555b9ebbThu, 07 Feb 2019 17:00:00 GMT

This post discusses highlights of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19).

I attended AAAI 2019 in Honolulu, Hawaii last week. Overall, I was particularly surprised by the interest in natural language processing at the conference. There were 15 sessions on NLP (most standing-room only) with ≈10 papers each (oral and spotlight presentations), so around 150 NLP papers (out of 1,150 accepted papers overall). I also really enjoyed the diversity of invited speakers who discussed topics from AI for social good, to adversarial learning and imperfect-information games (videos of all invited talks are available here). Another cool thing was the Oxford style debate, which required debaters to take controversial positions. This was a nice change of pace from panel discussions, which tend to converge to a uniform opinion.

# Dialogue

In his talk at the Reasoning and Learning for Human-Machine Dialogues workshop, Phil Cohen argued that chatbots are an attempt to avoid solving the hard problems of dialogue. They provide the illusion of having a dialogue but in fact do not have a clue what we are saying or meaning. What we should rather do is recognize intents via semantic parsing. We should then reason about the speech acts, infer a user's plan, and help them to succeed. You can find more information about his views in this position paper.

During the panel discussion, Imed Zitouni highlighted that the limitations of current dialogue models affect user behaviour. 75-80% of the time users only employ 4 skills: "play music", "set a timer", "set a reminder", and "what is the weather". Phil argued that we should not have to learn how to talk, how to make an offer, etc. all over again for each domain. We can often build simple dialogue agents for new domains "overnight".

# Reproducibility

At the Workshop on Reproducible AI, Joel Grus argued that Jupyter notebooks are bad for reproducibility. As an alternative, he recommended to adopt higher-level abstractions and declarative configurations. Another good resource for reproducibility is the ML reproducibility checklist by Joelle Pineau, which provides a list of items for algorithms, theory, and empirical results to enforce reproducibility.

A team from Facebook reported on their experiments reproducing AlphaZero in their ELF framework, training a model using 2,000 GPUs in 2 weeks. Reproducing an on-policy, distributed RL system such as AlphaZero is particularly challenging as it does not have a fixed dataset and optimization is dependent on the distributed environment. Training smaller versions and scaling up is key. For reproducibility, the random seed, the git commit number, and the logs should be stored.

During the panel discussion, Odd Eric Gunderson argued that reproducibility should be defined as the ability of an independent research team to produce the same results using the same AI method based on the documentation by the original authors. Degrees of reproducibility can be measured based on the availability of different types of documentation, such as the method description, data, and code.

Pascal van Hentenryck argued that reproducibility could be made part of the peer review process, such as in the Mathematical Programming Computation journal where each submission requires an executable file (which does not need to be public). He also pointed out that—empirically—papers with supplementary materials are more likely to be accepted.

At the Reasoning and Complex QA Workshop, Ken Forbus discussed an analogical training method for QA that adapts a general-purpose semantic parser to a new domain with few examples. At the end of his talk, Ken argued that the train/test method in ML is holding us back. Our learning systems should use rich relational representations, gather their own data, and evaluate progress.

Ashish Sabharwal discussed the OpenBookQA dataset presented at EMNLP 2018 during his talk. The open book setting is situated between reading comprehension and open-ended QA on the textual QA spectrum (see below).

It is designed to probe a deeper understanding rather than memorization skills and requires applying core principles to new situations. He also argued that while entailment is recognized as a core NLP task with many applications, it is still lacking a convincing application to an end-task. This is mainly due to multi-sentence entailment being a lot harder, as irrelevant sentences often have significant textual overlap.

Furthermore, he discussed the design of leaderboards, which have to make tradeoffs along multiple competing axes with respect to the host, the submitters, and the community. A particular deficit of current leaderboards is that they make it difficult to share and build upon successful techniques. For an extensive discussion of the pros and cons of leaderboards, check out this recent NLP Highlights podcast.

The first part of the final panel discussion focused on important outstanding technical challenges for question answering. Michael Witbrock emphasized techniques to create datasets that cannot easily be exploited by neural networks, such as the adversarial filtering in SWAG. Ken argued that models should come up with answers and explanations rather than performing multiple choice question answering, while Ashish noted that such explanations need to be automatically validated.

Eduard Hovy suggested that one way towards a system that can perform more complex QA could consist of the following steps:

1. Build a symbolic numerical reasoner that leverages relations from an existing KB, such as Geobase, which contains geography facts.
2. Look at the subset of questions in existing natural language datasets, which require reasoning that is possible with the reasoner.
3. Annotate these questions with semantic parses and train a semantic parsing model to convert the questions to logical forms. These can then be provided to the reasoner to produce an answer.
4. Augment the reasoner with another reasoning component and repeat steps 2-3.

The panel members noted that such reasoners exist, but lack a common API.

Finally, here are a few papers on question answering that I enjoyed:

# AI for social good

During his invited talk, Milind Tambe looked back on 10 years of research in AI and multiagent systems for social good (video available here; slides available here). Milind discussed his research on using game theory to optimize security resources such as patrols at airports, air marshal assignments on flights, coast guard patrols, and ranger patrols in African national parks to protect against poachers. Overall, his talk was a striking reminder of the positive effects AI can have if it is employed for social good.

# Debate

The Oxford style debate focused on the proposition “The AI community today should continue to focus mostly on ML methods” (video available here). It pitted Jennifer Neville and Peter Stone on the 'pro' side against Michael Littman and Oren Etzioni on the 'against' side, with Kevin Leyton-Brown as moderator. Overall, the debate was entertaining and engaging to watch.

Here are some representative remarks from each of the debaters that stuck with me:

"The unique strength of the AI community is that we focus on the problems that need to be solved." – Jennifer Neville
"We are in the middle of one of the most amazing paradigm shifts in all of science, certainly computer science." – Oren Etzioni
"If you want to have an impact, don’t follow the bandwagon. Keep alive other areas." – Peter Stone
"Scientists in the natural sciences are actually very excited about ML as much of their research relies on expensive computations, which can be approximated with neural networks." – Michael Littman

There were some important observations and ultimately a general consensus that ML alone is not enough and we need to integrate other methods with ML. Yonatan Belinkov also live tweeted, while I tweeted some remarks that elicited laughs.

During his invited talk (video available here), Ian Goodfellow discussed a multiplicity of areas to which adversarial learning has been applied. Among many advances, Ian mentioned that he was impressed by the performance and flexibility of attention masks for GANs, particularly that they are not restricted to circular masks.

He discussed adversarial examples, which are a consequence of moving away from i.i.d. data: attackers are able to confuse the model by showing unusual data from a different distribution such as graffiti on stop signs. He also argued—contrary to the prevalent opinion—that deep models that are more robust are more interpretable than linear models. The main reason is that the latent space of a linear model is totally unintuitive, while a more robust model is more inspectable (as can be seen below).

Semi-supervised learning with GANs can allow models to be more sample-efficient. What is interesting about such applications is that they focus on the discriminator (which is normally discarded) rather than the generator where the discriminator is extended to classify n+1 classes. Regarding leveraging GANs for NLP, Ian conceded that we currently have not found a good way to deal with the large action space required to generate sentences with RL.

# Imperfect-information games

In his invited talk (video available here), Tuomas Sandholm—whose AI Libratus was the first AI to beat top Heads-Up No-Limit Texas Hold'em professionals in January 2017—discussed new results for solving imperfect-information games. He stressed that only game-theoretically sound techniques yield strategies that are robust against all opponents in imperfect-information games. Other advantages of a game-theoretic approach are a) that even if humans have access to the entire history of plays of the AI, they still can't find holes in its strategy; and b) it requires no data, just the rules of the game.

For solving such games, the quality of the solution depends on the quality of the abstraction. Developing better abstractions is thus important, which also applies to modelling such games. In imperfect-information games, planning is important. In real-time planning, we must consider how the opponent can adapt to changes in the policy. In contrast to perfect-information games, states do not have well-defined values.

# Inductive biases

There were several papers that incorporated different inductive biases into existing models:

# Transfer learning

Papers on transfer learning ranged from multi-task learning and semi-supervised learning to sequential and zero-shot transfer:

# Word embeddings

Naturally there were also a number of papers that provided new methods for learning word embeddings:

# Miscellaneous

Finally, here are some papers that I enjoyed that do not fit into any of the above categories:

Cover image: AAAI-19 Opening Reception

]]>
<![CDATA[The 4 Biggest Open Problems in NLP]]>http://ruder.io/4-biggest-open-problems-in-nlp/5c76ff7c1b9b0d18555b9eb0Tue, 15 Jan 2019 15:28:00 GMT

This post discusses 4 major open problems in NLP based on an expert survey and a panel discussion at the Deep Learning Indaba.

This is the second blog post in a two-part series. The series expands on the Frontiers of Natural Language Processing session organized by Herman Kamper, Stephan Gouws, and me at the Deep Learning Indaba 2018. Slides of the entire session can be found here. The first post discussed major recent advances in NLP focusing on neural network-based methods. This post discusses major open problems in NLP. You can find a recording of the panel discussion this post is based on here.

In the weeks leading up to the Indaba, we asked NLP experts a number of simple but big questions. Based on the responses, we identified the four problems that were mentioned most often:

We discussed these problems during a panel discussion. This article is mostly based on the responses from our experts (which are well worth reading) and thoughts of my fellow panel members Jade Abbott, Stephan Gouws, Omoju Miller, and Bernardt Duvenhage. I will aim to provide context around some of the arguments, for anyone interested in learning more.

# Natural language understanding

I think the biggest open problems are all related to natural language
understanding. [...] we should develop systems that read and
understand text the way a person does
, by forming a representation of
the world of the text, with the agents, objects, settings, and the
relationships, goals, desires, and beliefs of the agents, and everything else
that humans create to understand a piece of text. Until we can do that, all of our progress is in improving our systems’ ability to do pattern matching.
– Kevin Gimpel

Many experts in our survey argued that the problem of natural language understanding (NLU) is central as it is a prerequisite for many tasks such as natural language generation (NLG). The consensus was that none of our current models exhibit 'real' understanding of natural language.

Innate biases vs. learning from scratch   A key question is what biases and structure should we build explicitly into our models to get closer to NLU. Similar ideas were discussed at the Generalization workshop at NAACL 2018, which Ana Marasovic reviewed for The Gradient and I reviewed here. Many responses in our survey mentioned that models should incorporate common sense. In addition, dialogue systems (and chat bots) were mentioned several times.

On the other hand, for reinforcement learning, David Silver argued that you would ultimately want the model to learn everything by itself, including the algorithm, features, and predictions. Many of our experts took the opposite view, arguing that you should actually build in some understanding in your model. What should be learned and what should be hard-wired into the model was also explored in the debate between Yann LeCun and Christopher Manning in February 2018.

Program synthesis   Omoju argued that incorporating understanding is difficult as long as we do not understand the mechanisms that actually underly NLU and how to evaluate them. She argued that we might want to take ideas from program synthesis and automatically learn programs based on high-level specifications instead. Ideas like this are related to neural module networks and neural programmer-interpreters.

She also suggested we should look back to approaches and frameworks that were originally developed in the 80s and 90s, such as FrameNet and merge these with statistical approaches. This should help us infer common sense-properties of objects, such as whether a car is a vehicle, has handles, etc. Inferring such common sense knowledge has also been a focus of recent datasets in NLP.

Embodied learning   Stephan argued that we should use the information in available structured sources and knowledge bases such as Wikidata. He noted that humans learn language through experience and interaction, by being embodied in an environment. One could argue that there exists a single learning algorithm that if used with an agent embedded in a sufficiently rich environment, with an appropriate reward structure, could learn NLU from the ground up. However, the compute for such an environment would be tremendous. For comparison, AlphaGo required a huge infrastructure to solve a well-defined board game. The creation of a general-purpose algorithm that can continue to learn is related to lifelong learning and to general problem solvers.

While many people think that we are headed in the direction of embodied learning, we should thus not underestimate the infrastructure and compute that would be required for a full embodied agent. In light of this, waiting for a full-fledged embodied agent to learn language seems ill-advised. However, we can take steps that will bring us closer to this extreme, such as grounded language learning in simulated environments, incorporating interaction, or leveraging multimodal data.

Emotion   Towards the end of the session, Omoju argued that it will be very difficult to incorporate a human element relating to emotion into embodied agents. Emotion, however, is very relevant to a deeper understanding of language. On the other hand, we might not need agents that actually possess human emotions. Stephan stated that the Turing test, after all, is defined as mimicry and sociopaths—while having no emotions—can fool people into thinking they do. We should thus be able to find solutions that do not need to be embodied and do not have emotions, but understand the emotions of people and help us solve our problems. Indeed, sensor-based emotion recognition systems have continuously improved—and we have also seen improvements in textual emotion detection systems.

Cognitive and neuroscience   An audience member asked how much knowledge of neuroscience and cognitive science are we leveraging and building into our models. Knowledge of neuroscience and cognitive science can be great for inspiration and used as a guideline to shape your thinking. As an example, several models have sought to imitate humans' ability to think fast and slow. AI and neuroscience are complementary in many directions, as Surya Ganguli illustrates in this post.

Omoju recommended to take inspiration from theories of cognitive science, such as the cognitive development theories by Piaget and Vygotsky. She also urged everyone to pursue interdisciplinary work. This sentiment was echoed by other experts. For instance, Felix Hill recommended to go to cognitive science conferences.

# NLP for low-resource scenarios

Dealing with low-data settings (low-resource languages, dialects (including social media text "dialects"), domains, etc.).  This is not a completely "open" problem in that there are already a lot of promising ideas out there; but we still don't have a universal solution to this universal problem.
– Karen Livescu

The second topic we explored was generalisation beyond the training data in low-resource scenarios. Given the setting of the Indaba, a natural focus was low-resource languages. The first question focused on whether it is necessary to develop specialised NLP tools for specific languages, or it is enough to work on general NLP.

Universal language model   Bernardt argued that there are universal commonalities between languages that could be exploited by a universal language model. The challenge then is to obtain enough data and compute to train such a language model. This is closely related to recent efforts to train a cross-lingual Transformer language model and cross-lingual sentence embeddings.

Cross-lingual representations   Stephan remarked that not enough people are working on low-resource languages. There are 1,250-2,100 languages in Africa alone, most of which have received scarce attention from the NLP community. The question of specialized tools also depends on the NLP task that is being tackled. The main issue with current models is sample efficiency. Cross-lingual word embeddings are sample-efficient as they only require word translation pairs or even only monolingual data. They align word embedding spaces sufficiently well to do coarse-grained tasks like topic classification, but don't allow for more fine-grained tasks such as machine translation. Recent efforts nevertheless show that these embeddings form an important building lock for unsupervised machine translation.

More complex models for higher-level tasks such as question answering on the other hand require thousands of training examples for learning. Transferring tasks that require actual natural language understanding from high-resource to low-resource languages is still very challenging. With the development of cross-lingual datasets for such tasks, such as XNLI, the development of strong cross-lingual models for more reasoning tasks should hopefully become easier.

Benefits and impact   Another question enquired—given that there is inherently only small amounts of text available for under-resourced languages—whether the benefits of NLP in such settings will also be limited. Stephan vehemently disagreed, reminding us that as ML and NLP practitioners, we typically tend to view problems in an information theoretic way, e.g. as maximizing the likelihood of our data or improving a benchmark. Taking a step back, the actual reason we work on NLP problems is to build systems that break down barriers. We want to build models that enable people to read news that was not written in their language, ask questions about their health when they don't have access to a doctor, etc.

Given the potential impact, building systems for low-resource languages is in fact one of the most important areas to work on. While one low-resource language may not have a lot of data, there is a long tail of low-resource languages; most people on this planet in fact speak a language that is in the low-resource regime. We thus really need to find a way to get our systems to work in this setting.

Jade opined that it is almost ironic that as a community we have been focusing on languages with a lot of data as these are the languages that are well taught around the world. The languages we should really focus on are the low-resource languages where not much data is available. The great thing about the Indaba is that people are working and making progress on such low-resource languages. Given the scarcity of data, even simple systems such as bag-of-words will have a large real-world impact. Etienne Barnard, one of the audience members, noted that he observed a different effect in real-world speech processing: Users were often more motivated to use a system in English if it works for their dialect compared to using a system in their own language.

Incentives and skills   Another audience member remarked that people are incentivized to work on highly visible benchmarks, such as English-to-German machine translation, but incentives are missing for working on low-resource languages. Stephan suggested that incentives exist in the form of unsolved problems. However, skills are not available in the right demographics to address these problems. What we should focus on is to teach skills like machine translation in order to empower people to solve these problems. Academic progress unfortunately doesn't necessarily relate to low-resource languages. However, if cross-lingual benchmarks become more pervasive, then this should also lead to more progress on low-resource languages.

Data availability   Jade finally argued that a big issue is that there are no datasets available for low-resource languages, such as languages spoken in Africa. If we create datasets and make them easily available, such as hosting them on openAFRICA, that would incentivize people and lower the barrier to entry. It is often sufficient to make available test data in multiple languages, as this will allow us to evaluate cross-lingual models and track progress. Another data source is the South African Centre for Digital Language Resources (SADiLaR), which provides resources for many of the languages spoken in South Africa.

# Reasoning about large or multiple documents

Representing large contexts efficiently. Our current models are mostly based on recurrent neural networks, which cannot represent longer contexts well. [...] The stream of work on graph-inspired RNNs is potentially promising, though has only seen modest improvements and has not been widely adopted due to them being much less straight-forward to train than a vanilla RNN.
– Isabelle Augenstein

Another big open problem is reasoning about large or multiple documents. The recent NarrativeQA dataset is a good example of a benchmark for this setting. Reasoning with large contexts is closely related to NLU and requires scaling up our current systems dramatically, until they can read entire books and movie scripts. A key question here—that we did not have time to discuss during the session—is whether we need better models or just train on more data.

Endeavours such as OpenAI Five show that current models can do a lot if they are scaled up to work with a lot more data and a lot more compute. With sufficient amounts of data, our current models might similarly do better with larger contexts. The problem is that supervision with large documents is scarce and expensive to obtain. Similar to language modelling and skip-thoughts, we could imagine a document-level unsupervised task that requires predicting the next paragraph or chapter of a book or deciding which chapter comes next. However, this objective is likely too sample-inefficient to enable learning of useful representations.

A more useful direction thus seems to be to develop methods that can represent context more effectively and are better able to keep track of relevant information while reading a document. Multi-document summarization and multi-document question answering are steps in this direction. Similarly, we can build on language models with improved memory and lifelong learning capabilities.

# Datasets, problems, and evaluation

Perhaps the biggest problem is to properly define the problems themselves. And by properly defining a problem, I mean building datasets and evaluation procedures that are appropriate to measure our progress towards concrete goals. Things would be easier if we could reduce everything to Kaggle style competitions!
– Mikel Artetxe

We did not have much time to discuss problems with our current benchmarks and evaluation settings but you will find many relevant responses in our survey. The final question asked what the most important NLP problems are that should be tackled for societies in Africa. Jade replied that the most important issue is to solve the low-resource problem. Particularly being able to use translation in education to enable people to access whatever they want to know in their own language is tremendously important.

The session concluded with general advice from our experts on other questions that we had asked them, such as "What, if anything, has led the field in the wrong direction?" and "What advice would you give a postgraduate student in NLP starting their project now?" You can find responses to all questions in the survey.

## Deep Learning Indaba 2019

If you are interested in working on low-resource languages, consider attending the Deep Learning Indaba 2019, which takes place in Nairobi, Kenya from 25-31 August 2019.

Credit: Title image text is from the NarrativeQA dataset. The image is from the slides of the NLP session.

]]>
<![CDATA[10 Exciting Ideas of 2018 in NLP]]>http://ruder.io/10-exciting-ideas-of-2018-in-nlp/5c76ff7c1b9b0d18555b9eb7Wed, 19 Dec 2018 19:28:46 GMT

This post gathers 10 ideas that I found exciting and impactful this year—and that we'll likely see more of in the future.

For each idea, I will highlight 1-2 papers that execute them well. I tried to keep the list succinct, so apologies if I did not cover all relevant work. The list is necessarily subjective and covers ideas mainly related to transfer learning and generalization. Most of these (with some exceptions) are not trends (but I suspect that some might become more 'trendy' in 2019). Finally, I would love to read about your highlights in the comments or see highlights posts about other areas.

## 1) Unsupervised MT

There were two unsupervised MT papers at ICLR 2018. They were surprising in that they worked at all, but results were still low compared to supervised systems. At EMNLP 2018, unsupervised MT hit its stride with two papers from the same two groups that significantly improve upon their previous methods. My highlight:

• Phrase-Based & Neural Unsupervised Machine Translation (EMNLP 2018):  The paper does a nice job in distilling the three key requirements for unsupervised MT: a good initialization, language modelling, and modelling the inverse task (via back-translation). All three are also beneficial in other unsupervised scenarios, as we will see below. Modelling the inverse task enforces cyclical consistency, which has been employed in different approaches—most prominently in CycleGAN. The paper performs extensive experiments and evaluates even on two low-resource language pairs, English-Urdu and English-Romanian. We will hopefully see more work on low-resource languages in the future.

## 2) Pretrained language models

Using pretrained language models is probably the most significant NLP trend this year, so I won't spend much time on it here. There have been a slew of memorable approaches: ELMo, ULMFiT, OpenAI Transformer, and BERT. My highlight:

• Deep contextualized word representations (NAACL-HLT 2018): The paper that introduced ELMo has been much lauded. Besides the impressive empirical results, where it shines is the careful analysis section that teases out the impact of various factors and analyses the information captured in the representations. The word sense disambiguation (WSD) analysis by itself (below on the left) is well executed. Both demonstrate that a LM on its own provides WSD and POS tagging performance close to the state-of-the-art.

## 3) Common sense inference datasets

Incorporating common sense into our models is one of the most important directions moving forward. However, creating good datasets is not easy and even popular ones show large biases. This year, there have been some well-executed datasets that seek to teach models some common sense such as Event2Mind and SWAG, both from the University of Washington. SWAG was solved unexpectedly quickly. My highlight:

• Visual Commonsense Reasoning (arXiv 2018): This is the first visual QA dataset that includes a rationale (an explantation) with each answer. In addition, questions require complex reasoning. The creators go to great lengths to address possible bias by ensuring that every answer's prior probability of being correct is 25% (every answer appears 4 times in the entire dataset, 3 times as an incorrect answer and 1 time as the correct answer); this requires solving a constrained optimization problem using models that compute relevance and similarity. Hopefully preventing possible bias will become a common component when creating datasets. Finally, just look at the gorgeous presentation of the data 👇.

## 4) Meta-learning

Meta-learning has seen much use in few-shot learning, reinforcement learning, and robotics—the most prominent example: model-agnostic meta-learning (MAML)—but successful applications in NLP have been rare. Meta-learning is most useful for problems with a limited number of training examples. My highlight:

• Meta-Learning for Low-Resource Neural Machine Translation (EMNLP 2018): The authors use MAML to learn a good initialization for translation, treating each language pair as a separate meta-task. Adapting to low-resource languages is probably the most useful setting for meta-learning in NLP. In particular, combining multilingual transfer learning (such as multilingual BERT), unsupervised learning, and meta-learning is a promising direction.

## 5) Robust unsupervised methods

This year, we and others have observed that unsupervised cross-lingual word embedding methods break down when languages are dissimilar. This is a common phenomenon in transfer learning where a discrepancy between source and target settings (e.g. domains in domain adaptation, tasks in continual learning and multi-task learning) leads to deterioration or failure of the model. Making models more robust to such changes is thus important. My highlight:

## 6) Understanding representations

There have been a lot of efforts in better understanding representations. In particular, 'diagnostic classifiers' (tasks that aim to measure if learned representations can predict certain attributes) have become quite common. My highlight:

In many settings, we have seen an increasing usage of multi-task learning with carefully chosen auxiliary tasks. For a good auxiliary task, data must be easily accessible. One of the most prominent examples is BERT, which uses next-sentence prediction (that has been used in Skip-thoughts and more recently in Quick-thoughts) to great effect. My highlights:

• Syntactic Scaffolds for Semantic Structures (EMNLP 2018): This paper proposes an auxiliary task that pretrains span representations by predicting for each span the corresponding syntactic constituent type. Despite being conceptually simple, the auxiliary task leads to large improvements on span-level prediction tasks such as semantic role labelling and coreference resolution. This papers shows that specialised representations learned at the level required by the target task (here: spans) are immensely beneficial.
• pair2vec: Compositional Word-Pair Embeddings for Cross-Sentence Inference (arXiv 2018): In a similar vein, this paper pretrains word pair representations by maximizing the pointwise mutual information of pairs of words with their context. This encourages the model to learn more meaningful representations of word pairs than with more general objectives, such as language modelling. The pretrained representations are effective in tasks such as SQuAD and MultiNLI that require cross-sentence inference. We can expect to see more pretraining tasks that capture properties particularly suited to certain downstream tasks and are complementary to more general-purpose tasks like language modelling.

## 8) Combining semi-supervised learning with transfer learning

With the recent advances in transfer learning, we should not forget more explicit ways of using target task-specific data. In fact, pretrained representations are complementary with many forms of semi-supervised learning. We have explored self-labelling approaches, a particular category of semi-supervised learning. My highlight:

• Semi-Supervised Sequence Modeling with Cross-View Training (EMNLP 2018): This paper shows that a conceptually very simple idea, making sure that the predictions on different views of the input agree with the prediction of the main model, can lead to gains on a diverse set of tasks. The idea is similar to word dropout but allows leveraging unlabelled data to make the model more robust. Compared to other self-ensembling models such as mean teacher, it is specifically designed for particular NLP tasks. With much work on implicit semi-supervised learning, we will hopefully see more work that explicitly tries to model the target predictions going forward.

## 9) QA and reasoning with large documents

There have been a lot of developments in question answering (QA), with an array of new QA datasets. Besides conversational QA and performing multi-step reasoning, the most challenging aspect of QA is to synthesize narratives and large bodies of information. My highlight:

• The NarrativeQA Reading Comprehension Challenge (TACL 2018): This paper proposes a challenging new QA dataset based on answering questions about entire movie scripts and books. While this task is still out of reach for current methods, models are provided the option of using a summary (rather than the entire book) as context, of selecting the answer (rather than generate it), and of using the output from an IR model. These variants make the task more feasible and enable models to gradually scale up to the full setting. We need more datasets like this that present ambitious problems, but still manage to make them accessible.

## 10) Inductive bias

Inductive biases such as convolutions in a CNN, regularization, dropout, and other mechanisms are core parts of neural network models that act as a regularizer and make models more sample-efficient. However, coming up with a broadly useful inductive bias and incorporating it into a model is challenging. My highlights:

• Sequence classification with human attention (CoNLL 2018): This paper proposes to use human attention from eye-tracking corpora to regularize attention in RNNs. Given that many current models such as Transformers use attention, finding ways to train it more efficiently is an important direction. It is also great to see another example that human language learning can help improve our computational models.
• Linguistically-Informed Self-Attention for Semantic Role Labeling (EMNLP 2018): This paper has a lot to like: a Transformer trained jointly on both syntactic and semantic tasks; the ability to inject high-quality parses at test time; and out-of-domain evaluation. It also regularizes the Transformer's multi-head attention to be more sensitive to syntax by training one attention head to attend to the syntactic parents of each token. We will likely see more examples of Transformer attention heads used as auxiliary predictors focusing on particular aspects of the input.
]]>
<![CDATA[EMNLP 2018 Highlights: Inductive bias, cross-lingual learning, and more]]>http://ruder.io/emnlp-2018-highlights/5c76ff7c1b9b0d18555b9eb4Tue, 06 Nov 2018 10:12:40 GMT

The post discusses highlights of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018).

This post originally appeared at the AYLIEN blog.

You can find past highlights of conferences here. You can find all 549 accepted papers in the EMNLP proceedings. In this review, I will focus on papers that relate to the following topics:

## Inductive bias

The inductive bias of a machine learning algorithm is the set of assumptions that the model makes in order to generalize to new inputs. For instance, the inductive bias obtained through multi-task learning encourages the model to prefer hypotheses (sets of parameters) that explain more than one task.

• Inductive bias was the main theme during Max Welling's keynote at CoNLL 2018. The two key takeaways from his talk are:
Lesson 1: If there is symmetry in the input space, exploit it.

The most canonical example for exploiting such symmetry are convolutional neural networks, which are translation invariant. Invariance in general means that an object is recognized as an object even if its appearance varies in some way. Group equivariant convolutional networks and Steerable CNNs similarly are rotation invariant (see below). Given the success of CNNs in computer vision, it is a compelling research direction to think of what types of invariance are possible in language and how these can be implemented in neural networks.

Lesson 2: When you know the generative process, you should exploit it.

For many problems the generative process is known but the inverse process of reconstructing the original input is not. Examples of such inverse problems are MRI, image denoising and super-resolution, but also audio-to-speech decoding and machine translation. The Recurrent Inference Machine (RIM) uses an RNN to iteratively generate an incremental update to the input until a sufficiently good estimate of the true signal has been reached, which can be seen for MRI below. This can be seen as similar to producing text via editing, rewriting, and iterative refining.

• A popular way to obtain certain types of invariance in current NLP approaches is via adversarial examples. To this end, Alzantot et al. use a black-box population-based optimization algorithm to generate semantic and syntactic adversarial examples.
• Minervini and Riedel propose to incorporate logic to generate adversarial examples. In particular, they use a combination of combinatorial optimization and a language model to generate examples that maximally violate such logic constraints for natural language inference.
• Another form of inductive bias can be induced via regularization. In particular, Barrett et al. received the special paper award at CoNLL for showing that human attention provides a good inductive bias for attention in neural networks. The human attention is derived from eye-tracking corpora, which—importantly—can be disjoint from the training data.
• For another beneficial inductive bias for attention, in one of the best papers of the conference, Strubell et al. encourage one attention head to attend to the syntactic parents for each token in multi-head attention. They additionally use multi-task learning and allow the injection of a syntactic parse at test time.
• Many NLP tasks such as entailment and semantic similarity compute some sort of alignment between two sequences, but this alignment is either at the word or sentence level. Liu et al. propose to incorporate a structural bias by using structured alignments, which match spans in both sequences to each other.
• Tree-based have been popular in NLP and encode the bias that knowledge of syntax is beneficial. Shi et al. analyze a phenomenon that runs counter to this, which is that trivial trees with no syntactic information often achieve better results than syntactic trees. Their key insight is that in well-performing trees, crucial words are closer to the final representation, which helps in mitigating RNNs' sequential recency bias.
• For aspect-based sentiment analysis, sentence representations are typically computed separately from aspect representations. Huang and Carley propose a nice way to condition the sentence representation on the aspect by using the aspect representation as the parameters of the filters or gates in a CNN. Allowing encoded representations to directly parameterize other parts of a neural network might be useful for other applications, too.

## Cross-lingual learning

There are roughly 6,500 languages spoken around the world. Despite this, the predominant focus of research is on English. This seems to change perceptibly as more papers are investigating cross-lingual settings.

• In her CoNLL keynote, Asifa Majid gives an insightful overview of how culture and language can shape our internal representation of concepts. A common example of this is Scottish having 421 terms for snow. This phenomenon not only applies to our environment, but also to how we talk about ourselves and our bodies.
Languages vary surprisingly in the parts of the body they single out for naming. Variations in part of the lexicon can have knock-on effects for other parts.

If you ask speakers of different languages to color in different body parts in a picture, the body parts that are associated with each term depend on the language. In Dutch, the hand is often considered to be part of the term 'arm', whereas in Japanese, the arm is more clearly delimited. Indonesian, lacking an everyday term that corresponds to 'hand', associates 'arm' with both the hand and the arm as can be seen below.

The representations we obtain from language influence every form of perception. Hans Henning claimed that "olfactory (i.e. related to smell) abstraction is impossible". Most languages lack terms describing specific scents and odours. In contrast, the Jahai, a people of hunter-gatherers in Malaysia, have half a dozen terms for different qualities of smell, which allow them to identify smells much more precisely (Majid et al., 2018).

There was a surprising amount of work on cross-lingual word embeddings at the conference. Taking insights from Asifa's talk, it will be interesting to incorporate insights from psycholinguistics in how we model words across languages and different cultures, as cross-lingual embeddings have mostly focused on word-to-word alignment and so far did not even consider polysemy.

• For cross-lingual word embeddings, Kementchedjhieva et al. show that mapping the languages onto a third, latent space (the mean of the monolingual embedding spaces) rather than directly onto each other, makes it easier to learn an alignment. This approach also naturally enables the integration of supporting languages in low resource scenarios. (Note: I'm a co-author on this paper.)
• With a similar goal in mind, Doval et al. propose to move each word vector towards the mean between its current representation and the representation of the translation in a separate refinement step.
• Similar to using multilingual support, Chen and Cardie propose to jointly learn cross-lingual embeddings between multiple languages by modeling the relations between all language pairs.
• Hartmann et al. analyze an observation of our ACL 2018 paper: Aligning embedding spaces induced with different algorithms does not work. They show, however, that a linear transformation still exists and hypothesize that the optimization problem of learning this transformation might be complicated by the algorithms' different inductive biases.
• Not only word embedding spaces induced by different algorithms, but also word embedding spaces in different languages have different structures, especially for distant languages. Nakashole proposes to learn a transformation that is sensitive to the local neighborhood, which is particularly beneficial for distant languages.
• For the same problem, Hoshen and Wolf propose to first align the second moment of the word distributions and then iteratively refine the alignment.
• Alvarez-Melis and Jaakkola offer a different perspective on word-to-word translation by viewing it as an optimal transport problem. They use the Gromov-Wasserstein distance to measure similarities between pairs of words across languages.
• Xu et al. instead propose to minimize the Sinkhorn distance between the source and target distributions.
• Huang et al. go beyond word alignment with their approach. They introduce multiple cluster-level alignments and additionally enforce the clusters to be consistently distributed across multiple languages.
• In one of the best papers of the conference, Lample et al. proposes an unsupervised phrase-based machine translation model, which works particularly well for low-resource languages. On Urdu-English, it outperforms a supervised phrase-based model trained on 800,000 noisy and out-of-domain parallel sentences.
• Artetxe et al. propose a similar phrase-based approach to unsupervised machine translation.

## Word embeddings

Besides cross-lingual word embeddings, there was naturally also work investigating and improving word embeddings, but this seemed to be a lot less pervasive than in past years.

• Zhuang et al. propose to use second-order co-occurrence relations to train word embeddings via a newly designed metric.
• Zhao et al. propose to learn word embeddings for out-of-vocabulary words by viewing words as bags of character n-grams.
• Bosc and Vincent learn word embeddings by reconstructing dictionary definitions.
• Zhao et al. learn gender-neutral word embeddings rather than removing the bias from trained embeddings. Their approach allocates certain dimensions of the embedding to gender information, while it keeps the remaining dimensions gender-neutral.

## Latent variable models

Latent variable models are slowly emerging as a useful tool to express a structural inductive bias and to model the linguistic structure of words and sentences.

• Kim et al. provided an excellent tutorial of deep latent variable models. The slides can be found here.
• In his talk at the BlackBox NLP workshop, Graham Neubig highlighted latent variables as a way to model the latent linguistic structure of text with neural network. In particular, he discussed multi-space variational encoder-decoders and tree-structured variational auto-encoders, two semi-supervised learning models that leverage latent variables to take advantage of unlabeled data.
• In our paper, we showed how cross-lingual embedding methods can be seen as latent variable models. We can use this insight to derive an EM algorithm and learn a better alignment between words.
• Dou et al. similarly propose a latent variable model based on a variational auto-encoder for unsupervised bilingual lexicon induction.
• In the model by Zhang et al., sentences are viewed as latent variables for summarization. Sentences with activated variables are extracted and directly used to infer gold summaries.
• There were also papers that proposed methods for more general applications. Xu and Durrett propose to use a different distribution in variational auto-encoders that mitigates the common failure mode of a collapsing KL divergence.
• Niculae et al. propose a new approach to build dynamic computation graphs with latent structure through sparsity.

## Language models

Language models are becoming more commonplace in NLP and many papers investigated different architectures and properties of such models.

• In an insightful paper, Peters et al. show that LSTMs, CNNs, and self-attention language models all learn high-quality representations. They additionally show that the representations vary with network depth: morphological information is encoded at the word embedding layer; local syntax is captured at lower layers and longer-range semantics are encoded at the upper layers.
• Tran et al. show that LSTMs generalize hierarchical structure better than self-attention. This hints at possible limitations of the Transformer architecture and suggests that we might need different encoding architectures for different tasks.
• Tang et al. find that the Transformer and CNNs are not better than RNNs at modeling long-distance agreement. However, models relying on self-attention excel at word sense disambiguation.
• Other papers looks at different properties of language models. Amrami and Goldberg show that language models can achieve state-of-the-art for unsupervised word sense induction. Importantly, rather than just providing the left and right context to the word, they find that appending "and" provides more natural and better results. It will be interesting to see what other clever uses we will find for LMs.
• Krishna et al. show that ELMo performs better than logic rules on sentiment analysis tasks. They also demonstrate that language models can implicitly learn logic rules.
• In the best paper at the BlackBoxNLP workshop, Giulianelli et al. use diagnostic classifiers to keep track and improve number agreement in language models.
• In another BlackBoxNLP paper, Wilcox et al. show that RNN language models can represent filler-gap dependencies and learn a particular subset of restrictions known as island constraints.

## Datasets

Many new tasks and datasets were presented at the conference, many of which propose more challenging settings.

• Grounded common sense inference: SWAG contains 113k multiple choice questions about a rich spectrum of grounded situations.
• Coreference resolution: PreCo contains 38k documents and 12.5M words, which are mostly from the vocabulary of English-speaking preschoolers.
• Document grounded dialogue: The dataset by Zhou et al. contains 4112 conversations with an average of 21.43 turns per conversation.
• Automatic story generation from videos: VideoStory contains 20k social media videos amounting to 396 hours of video with 123k sentences, temporally aligned to the video.
• Sequential open-domain question answering: QBLink contains 18k question sequences, with each sequence consisting of three naturally occurring human-authored questions.
• Multimodal reading comprehension: RecipeQA consists of 20k instructional recipes with multiple modalities such as titles, descriptions and aligned set of images and 36k automatically generated question-answer pairs.
• Word similarity: CARD-660 consists of 660 manually selected rare words with manually selected paired words and expert annotations.
• Cloze style question answering: CLOTH consists of 7,131 passages and 99,433 questions used in middle-school and high-school language exams.
• Open book question answering: OpenBookQA consists of 6,000 questions and 1,326 elementary level science facts.
• Semantic parsing and text-to-SQL: Spider contains 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables covering 138 different domains.
• Few-shot relation classification: FewRel consists of 70k sentences on 100 relations derived from Wikipedia.
• Natural language inference: MedNLI consists of 14k sentence pairs in the clinical domain.
• Multilingual natural language inference: XNLI extends the MultiNLI dataset to 15 languages.
• Task-oriented dialogue modeling: MultiWOZ, which won the best resource paper award, is a Wizard-of-Oz style dataset consisting of 10k human-human written conversations spanning over multiple domains and topics.

Papers also continued the trend of ACL 2018 of analyzing the limitations of existing datasets and metrics.

• Text simplification: Sulem et al. show that BLEU is not a good evaluation metric for sentence splitting, the most common operation in text simplification.
• Text-to-SQL: Yavuz et al. show what it takes to achieve 100% accuracy on the WikiSQL benchmark.
• Reading comprehension: Kaushik and Lipton show in the best short paper that models that only rely on the passage or the last sentence for prediction do well on many reading comprehension tasks.

## Miscellaneous

These are papers that provide a refreshing take or tackle an unusual problem, but do not fit any of the above categories.

• Stanovsky and Hopkins propose a novel way to test word representations. Their approach uses Odd-Man-Out puzzles, which consists of 5 (or more) words and require the system to choose the word that does not belong with the others. They show that such a simple setup can reveal various properties of different representations.
• A similarly playful way to test the associative properties of word embeddings is proposed by Shen et al. They use a simplified version of the popular game Codenames. In their setting, a speaker has to select an adjective to refer to two out of three nouns, which then need to be identified by the listener.
• Causal understanding is important for many real-world application, but causal inference has so far not found much adoption in NLP. Wood-Doughty et al. demonstrate how causal analyses can be conducted for text classification and discuss opportunities and challenges and for future work.
• Gender bias and equal opportunity are big issues in STEM. Schluter argues that a glass ceiling exists in NLP, which prevents high achieving women and minorities from obtaining equal access to opportunities. While the field of NLP has been consistently ~33% female, Schluter analyzes 52 years of NLP publication data consisting of 15.6k gender-labeled authors and observes that the growth level of female seniority standing (indicated by last-authorship on a paper) falls significantly below that of the male population, with a gap that is widening.
• Shillingford and Jones tackle both an interesting problem and utilize a refreshing approach. They seek to recover missing characters for long vowels and glottal stops in Old Hawaiian Writing, which are important for reading comprehension and pronunciation. They propose to compose a finite-state transducer—which incorporates domain information—with a neural network. Importantly, their approach only requires modern Hawaiian texts for training.

## Other reviews

You might also find these other reviews of EMNLP 2018 helpful:

]]>
<![CDATA[HackerNoon Interview]]>http://ruder.io/hackernoon-interview/5c76ff7c1b9b0d18555b9eafTue, 02 Oct 2018 10:40:18 GMT

This post is an interview by fast.ai fellow Sanyam Bhutani with me.

This post originally appeared at HackerNoon with a different introduction.

I had the honour to be interviewed by Sanyam Bhutani, a Deep Learning and Computer Vision practitioner and fast.ai fellow who's been doing a series interviewing people that inspire him. To be honest, it feels surreal to be the one being interviewed. I hope my answers may be interesting or useful to some of you.

Sanyam: Hello Sebastian, Thank you for taking the time to do this.

Sebastian: Thanks for having me.

Sanyam: You’re working as a research scientist today at AYLIEN, and you’re a Ph.D. student at Insight Research Centre for Data Analytics. Could you tell the readers about how you got started? What got you interested in NLP and Deep Learning?

Sebastian: I was really into maths and languages when I was in high school and took part in competitions. For my studies, I wanted to combine the logic of maths with the creativity of language somehow but didn’t know if such a field existed. That’s when I came across Computational Linguistics, which seemed to be a perfect fit at the intersection of computer science and linguistics. I then did my Bachelor’s in Computational Linguistics at the University of Heidelberg in Germany, one of my favourite places in Europe. During my Bachelor’s, I got most excited by machine learning, so I tried to get as much as exposure to ML as possible via internships and online courses. I only heard about word2vec as I was finishing my undergrad in 2015; as I learned more about Deep Learning at the start of my Ph.D. later that year, it seemed to be most exciting direction, so I decided to focus on it.

Sebastian: After graduating, I was planning to get some industry experience first by working in a startup. A PhD was always something I had dreamed of, but I hadn’t seriously considered it at that point. When I discussed working with the Dublin-based NLP startup Aylien, they told me about the Employment-based Postgraduate Programme, a PhD programme that is hosted jointly by a university and a company, which seemed like the perfect fit for me. Combining research and industry work can be challenging at times, but has been rewarding for me overall. Most importantly, there should be a fit with the company.

Sanyam: You’ve been working as a researcher for 3 years now. What has been your favorite project during these years?

Sebastian: In terms of learning, delving into a new area where I don’t know much, reading papers, and getting to collaborate with great people. In this vein, my project working on multi-task learning at the University of Copenhagen was a great and very stimulating experience. In terms of impact, being able to work with Jeremy, interacting with the fastai community, and seeing that people find our work on language models useful.

Sanyam: Natural Language Processing has arguably lagged behind Computer Vision. What are your thoughts about the current scenario? Is it a good time to get started as an NLP Practitioner?

Sebastian: I think now is a great time to get started with NLP. Compared to a couple of years ago, we’re at a point of maturity where you’re not limited to just using word embeddings or off-the-shelf models, but you can compose your model from a wide array of components, such as different layers, pretrained representations, auxiliary losses, etc. There also seems to be a growing feeling in the community that many of the canonical problems (POS tagging and dependency parsing on the Penn Treebank, sentiment analysis on movie reviews, etc.) are close to being solved, so we really want to make progress on more challenging problems, such as “real” natural language understanding and creating models that truly generalize. For these problems, I think we can really benefit from people with new perspectives and ideas. In addition, as we can now train models for many useful tasks such as classification or sequence labelling with good accuracy, there are a lot of opportunities for applying and adapting these models to other languages. If you’re a speaker of another language, you can make a big difference by creating datasets others can use for evaluation and training models for that language.

Sanyam: For the readers and the beginners who are interested in working on Natural Language Processing, what would be your best advice?

Sebastian: Find a task you’re interested in for instance by browsing the tasks on NLP-progress. If you’re interested in doing research, try to choose a particular subproblem not everyone is working on. For instance, for sentiment analysis, don’t work on movie reviews but conversations. For summarization, summarize biomedical papers rather than news articles. Read papers related to the task and try to understand what the state-of-the-art does. Prefer tasks that have open-source implementations available that you can run. Once you have a good handle of how something works, for research, reflect if you were surprised by any choices in the paper. Try to understand what kind of errors the model makes and if you can think of any information that could be used to mitigate them. Doing error and ablation analyses or using synthetic tasks that gauge if a model captures a certain kind of information are great ways to do this.

If you have an idea how to make the task more challenging or realistic, try to create a dataset and apply the existing model to that task. Try to recreate the dataset in your language and see if the model performs equally well.

Sanyam: Many job boards (For DL/ML) require the applicants to be post-grads or have research experience. For the readers who want to take up Machine Learning as a Career path, do you feel having research experience is a necessity?

Sebastian: I think research experience can be a good indicator that you’re proficient with certain models and creative, innovative to come up with new solutions. You don’t need to do a Ph.D. or a research fellowship to learn these skills, though. Being proactive, learning about and working on a problem that you’re excited about, trying to improve the model, and writing about your experience is a good way to get started and demonstrate similar skills. In most applied ML settings, you won’t be required to come up with totally new ways to solve a task. Doing ML and data science competitions can thus similarly help you demonstrate that you know how to apply ML models in practice.

Sanyam: Given the explosive growth rates in research, How do you stay up to date with the cutting edge?

Sebastian: I’ve been going through the arXiv daily update, adding relevant papers to my reading list, and reading them in batches. Jeff Dean recently said during a talk at the Deep Learning Indaba that he thinks it’s better to read ten abstracts than one paper in-depth as you can always go back and read one of the papers in-depth. I agree with him. I think you want to read widely about as many ideas as possible, which you can catalogue and use for inspiration later. Having a good paper management system is key. I’ve been using Mendeley. Lately, I’ve been relying more on Arxiv Sanity Preserver to surface relevant papers.

Sanyam: You also maintain a great blog, which I’m a great fan of. Could you share some tips on effectively writing technical articles?

Sebastian: I’ve had the best experience writing a blog when I started out writing it for myself to understand a particular topic better. If you ever find yourself having to put in a lot of work to build intuition or do a lot of research to grasp a subject, consider writing a post about it so you can accelerate everyone else’s learning in the future. In research papers, there’s usually not enough space to properly contextualize a work, highlight motivations, and intuitions, etc. Blog posts are a great way to make technical content more accessible and approachable.

The great thing about a blog is that it doesn’t need to be perfect. You can use it to improve your communication skills as well as obtain feedback on your ideas and things you might have missed. In terms of writing, I think the most important thing I have learned is to be biased towards clarity. Try to be as unambiguous as possible. Remove sentences that don’t add much value. Remove vague adjectives. Write only about what the data shows and if you speculate, clearly say so.
Get feedback on your draft from your friends and colleagues. Don’t try to make something 100% perfect, but get it to a point where you’re happy with it. Feeling anxiety when clicking that ‘Publish’ button is totally normal and doesn’t go away. Publishing something will always be worth it in the long-term.

Sanyam: Do you feel Machine Learning has been overhyped?

Sebastian: No.

Sanyam: Before we conclude, any tips for the beginners who are afraid to get started because of the idea that Deep Learning is an advanced field?

Sebastian: Don’t let anyone tell you that you can’t do this. Do online courses to build your understanding. Once you’re comfortable with the fundamentals, read papers for inspiration when you have time. Choose something you’re excited about, choose a library, and work on it. Don’t think you need massive compute to work on meaningful problems. Particularly in NLP, there are lot of problems with a small number of labelled examples. Write about what you’re doing and learning. Reach out to people with similar interests and focus areas. Engage with the community, e.g. the fastai community is awesome. Get on Twitter. Twitter has a great ML community and you can often get replies from top experts in the field way faster than via email. Find a mentor. If you write to someone for advice, be mindful of their time. Be respectful and try to help others. Be generous with praise and cautious with criticism.

Sanyam: Thank you so much for doing this interview.

The cover image for this post was generated based on the content of the post using wordcouds.

]]>
<![CDATA[A Review of the Neural History of Natural Language Processing]]>http://ruder.io/a-review-of-the-recent-history-of-nlp/5c76ff7c1b9b0d18555b9eaeMon, 01 Oct 2018 12:50:00 GMT

This post discusses major recent advances in NLP focusing on neural network-based methods.

This post originally appeared at the AYLIEN blog.

This is the first blog post in a two-part series. The series expands on the Frontiers of Natural Language Processing session organized by Herman Kamper and me at the Deep Learning Indaba 2018. Slides of the entire session can be found here. This post discusses major recent advances in NLP focusing on neural network-based methods. The second post discusses open problems in NLP. You can find a recording of the talk this post is based on here.

Disclaimer   This post tries to condense ~15 years' worth of work into eight milestones that are the most relevant today and thus omits many relevant and important developments. In particular, it is heavily skewed towards current neural approaches, which may give the false impression that no other methods were influential during this period. More importantly, many of the neural network models presented in this post build on non-neural milestones of the same era. In the final section of this post, we highlight such influential work that laid the foundations for later methods.

# 2001 - Neural language models

Language modelling is the task of predicting the next word in a text given the previous words. It is probably the simplest language processing task with concrete practical applications such as intelligent keyboards, email response suggestion (Kannan et al., 2016), spelling autocorrection, etc. Unsurprisingly, language modelling has a rich history. Classic approaches are based on n-grams and employ smoothing to deal with unseen n-grams (Kneser & Ney, 1995).
The first neural language model, a feed-forward neural network was proposed in 2001 by Bengio et al., shown in Figure 1 below.

This model takes as input vector representations of the $$n$$ previous words, which are looked up in a table $$C$$. Nowadays, such vectors are known as word embeddings. These word embeddings are concatenated and fed into a hidden layer, whose output is then provided to a softmax layer. For more information about the model, have a look at this post.
More recently, feed-forward neural networks have been replaced with recurrent neural networks (RNNs; Mikolov et al., 2010) and long short-term memory networks (LSTMs; Graves, 2013) for language modelling. Many new language models that extend the classic LSTM have been proposed in recent years (have a look at this page for an overview). Despite these developments, the classic LSTM remains a strong baseline (Melis et al., 2018). Even Bengio et al.'s classic feed-forward neural network is in some settings competitive with more sophisticated models as these typically only learn to consider the most recent words (Daniluk et al., 2017). Understanding better what information such language models capture consequently is an active research area (Kuncoro et al., 2018; Blevins et al., 2018).

Language modelling is typically the training ground of choice when applying RNNs and has succeeded at capturing the imagination, with many getting their first exposure via Andrej's blog post. Language modelling is a form of unsupervised learning, which Yann LeCun also calls predictive learning and cites as a prerequisite to acquiring common sense (see here for his Cake slide from NIPS 2016). Probably the most remarkable aspect about language modelling is that despite its simplicity, it is core to many of the later advances discussed in this post:

• Word embeddings: The objective of word2vec is a simplification of language modelling.
• Sequence-to-sequence models: Such models generate an output sequence by predicting one word at a time.
• Pretrained language models: These methods use representations from language models for transfer learning.

This conversely means that many of the most important recent advances in NLP reduce to a form of language modelling. In order to do "real" natural language understanding, just learning from the raw form of text likely will not be enough and we will need new methods and models.

Multi-task learning is a general method for sharing parameters between models that are trained on multiple tasks. In neural networks, this can be done easily by tying the weights of different layers. The idea of multi-task learning was first proposed in 1993 by Rich Caruana and was applied to road-following and pneumonia prediction (Caruana, 1998). Intuitively, multi-task learning encourages the models to learn representations that are useful for many tasks. This is particularly useful for learning general, low-level representations, to focus a model's attention or in settings with limited amounts of training data. For a more comprehensive overview of multi-task learning, have a look at this post.

Multi-task learning was first applied to neural networks for NLP in 2008 by Collobert and Weston. In their model, the look-up tables (or word embedding matrices) are shared between two models trained on different tasks, as depicted in Figure 2 below.

Sharing the word embeddings enables the models to collaborate and share general low-level information in the word embedding matrix, which typically makes up the largest number of parameters in a model. The 2008 paper by Collobert and Weston proved influential beyond its use of multi-task learning. It spearheaded ideas such as pretraining word embeddings and using convolutional neural networks (CNNs) for text that have only been widely adopted in the last years. It won the test-of-time award at ICML 2018 (see the test-of-time award talk contextualizing the paper here).

Multi-task learning is now used across a wide range of NLP tasks and leveraging existing or "artificial" tasks has become a useful tool in the NLP repertoire. For an overview of different auxiliary tasks, have a look at this post. While the sharing of parameters is typically predefined, different sharing patterns can also be learned during the optimization process (Ruder et al., 2017). As models are being increasingly evaluated on multiple tasks to gauge their generalization ability, multi-task learning is gaining in importance and dedicated benchmarks for multi-task learning have been proposed recently (Wang et al., 2018; McCann et al., 2018).

# 2013 - Word embeddings

Sparse vector representations of text, the so-called bag-of-words model have a long history in NLP. Dense vector representations of words or word embeddings have been used as early as 2001 as we have seen above. The main innovation that was proposed in 2013 by Mikolov et al. was to make the training of these word embeddings more efficient by removing the hidden layer and approximating the objective. While these changes were simple in nature, they enabled---together with the efficient word2vec implementation---large-scale training of word embeddings.

Word2vec comes in two flavours that can be seen in Figure 3 below: continuous bag-of-words (CBOW) and skip-gram. They differ in their objective: one predicts the centre word based based on the surrounding words, while the other does the opposite.

While these embeddings are no different conceptually than the ones learned with a feed-forward neural network, training on a very large corpus enables them to approximate certain relations between words such as gender, verb tense, and country-capital relations, which can be seen in Figure 4 below.

These relations and the meaning behind them sparked initial interest in word embeddings and many studies have investigated the origin of these linear relationships (Arora et al., 2016; Mimno & Thompson, 2017; Antoniak & Mimno, 2018; Wendlandt et al., 2018). However, later studies showed that the learned relations are not without bias (Bolukbasi et al., 2016). What cemented word embeddings as a mainstay in current NLP was that using pretrained embeddings as initialization was shown to improve performance across a wide range of downstream tasks.

While the relations word2vec captured had an intuitive and almost magical quality to them, later studies showed that there is nothing inherently special about word2vec: Word embeddings can also be learned via matrix factorization (Pennington et al, 2014; Levy & Goldberg, 2014) and with proper tuning, classic matrix factorization approaches like SVD and LSA achieve similar results (Levy et al., 2015).

Since then, a lot of work has gone into exploring different facets of word embeddings (as indicated by the staggering number of citations of the original paper). Have a look at this post for some trends and future directions. Despite many developments, word2vec is still a popular choice and widely used today. Word2vec's reach has even extended beyond the word level: skip-gram with negative sampling, a convenient objective for learning embeddings based on local context, has been applied to learn representations for sentences (Mikolov & Le, 2014; Kiros et al., 2015)---and even going beyond NLP---to networks (Grover & Leskovec, 2016) and biological sequences (Asgari & Mofrad, 2015), among others.

One particularly exciting direction is to project word embeddings of different languages into the same space to enable (zero-shot) cross-lingual transfer. It is becoming increasingly possible to learn a good projection in a completely unsupervised way (at least for similar languages) (Conneau et al., 2018; Artetxe et al., 2018; Søgaard et al., 2018), which opens applications for low-resource languages and unsupervised machine translation (Lample et al., 2018; Artetxe et al., 2018). Have a look at (Ruder et al., 2018) for an overview.

# 2013 - Neural networks for NLP

2013 and 2014 marked the time when neural network models started to get adopted in NLP. Three main types of neural networks became the most widely used: recurrent neural networks, convolutional neural networks, and recursive neural networks.

Recurrent neural networks   Recurrent neural networks (RNNs) are an obvious choice to deal with the dynamic input sequences ubiquitous in NLP. Vanilla RNNs (Elman, 1990) were quickly replaced with the classic long short-term memory networks (Hochreiter & Schmidhuber, 1997), which proved more resilient to the vanishing and exploding gradient problem. Before 2013, RNNs were still thought to be difficult to train; Ilya Sutskever's PhD thesis was a key example on the way to changing this reputation. A visualization of an LSTM cell can be seen in Figure 5 below. A bidirectional LSTM (Graves et al., 2013) is typically used to deal with both left and right context.

Convolutional neural networks   With convolutional neural networks (CNNs) being widely used in computer vision, they also started to get applied to language (Kalchbrenner et al., 2014; Kim et al., 2014). A convolutional neural network for text only operates in two dimensions, with the filters only needing to be moved along the temporal dimension. Figure 6 below shows a typical CNN as used in NLP.

An advantage of convolutional neural networks is that they are more parallelizable than RNNs, as the state at every timestep only depends on the local context (via the convolution operation) rather than all past states as in the RNN. CNNs can be extended with wider receptive fields using dilated convolutions to capture a wider context (Kalchbrenner et al., 2016). CNNs and LSTMs can also be combined and stacked (Wang et al., 2016) and convolutions can be used to speed up an LSTM (Bradbury et al., 2017).

Recursive neural networks   RNNs and CNNs both treat the language as a sequence. From a linguistic perspective, however, language is inherently hierarchical: Words are composed into higher-order phrases and clauses, which can themselves be recursively combined according to a set of production rules. The linguistically inspired idea of treating sentences as trees rather than as a sequence gives rise to recursive neural networks (Socher et al., 2013), which can be seen in Figure 7 below.

Recursive neural networks build the representation of a sequence from the bottom up in contrast to RNNs who process the sentence left-to-right or right-to-left. At every node of the tree, a new representation is computed by composing the representations of the child nodes. As a tree can also be seen as imposing a different processing order on an RNN, LSTMs have naturally been extended to trees (Tai et al., 2015).

Not only RNNs and LSTMs can be extended to work with hierarchical structures. Word embeddings can be learned based not only on local but on grammatical context (Levy & Goldberg, 2014); language models can generate words based on a syntactic stack (Dyer et al., 2016); and graph-convolutional neural networks can operate over a tree (Bastings et al., 2017).

# 2014 - Sequence-to-sequence models

In 2014, Sutskever et al. proposed sequence-to-sequence learning, a general framework for mapping one sequence to another one using a neural network. In the framework, an encoder neural network processes a sentence symbol by symbol and compresses it into a vector representation; a decoder neural network then predicts the output symbol by symbol based on the encoder state, taking as input at every step the previously predicted symbol as can be seen in Figure 8 below.

Machine translation turned out to be the killer application of this framework. In 2016, Google announced that it was starting to replace its monolithic phrase-based MT models with neural MT models (Wu et al., 2016). According to Jeff Dean, this meant replacing 500,000 lines of phrase-based MT code with a 500-line neural network model.

This framework due to its flexibility is now the go-to framework for natural language generation tasks, with different models taking on the role of the encoder and the decoder. Importantly, the decoder model can not only be conditioned on a sequence, but on arbitrary representations. This enables for instance generating a caption based on an image (Vinyals et al., 2015) (as can be seen in Figure 9 below), text based on a table (Lebret et al., 2016), and a description based on source code changes (Loyola et al., 2017), among many other applications.

Sequence-to-sequence learning can even be applied to structured prediction tasks common in NLP where the output has a particular structure. For simplicity, the output is linearized as can be seen for constituency parsing in Figure 10 below. Neural networks have demonstrated the ability to directly learn to produce such a linearized output given sufficient amount of training data for constituency parsing (Vinyals et al, 2015), and named entity recognition (Gillick et al., 2016), among others.

Encoders for sequences and decoders are typically based on RNNs but other model types can be used. New architectures mainly emerge from work in MT, which acts as a Petri dish for sequence-to-sequence architectures. Recent models are deep LSTMs (Wu et al., 2016), convolutional encoders (Kalchbrenner et al., 2016; Gehring et al., 2017), the Transformer (Vaswani et al., 2017), which will be discussed in the next section, and a combination of an LSTM and a Transformer (Chen et al., 2018).

# 2015 - Attention

Attention (Bahdanau et al., 2015) is one of the core innovations in neural MT (NMT) and the key idea that enabled NMT models to outperform classic phrase-based MT systems. The main bottleneck of sequence-to-sequence learning is that it requires to compress the entire content of the source sequence into a fixed-size vector. Attention alleviates this by allowing the decoder to look back at the source sequence hidden states, which are then provided as a weighted average as additional input to the decoder as can be seen in Figure 11 below.

Different forms of attention are available (Luong et al., 2015). Have a look here for a brief overview. Attention is widely applicable and potentially useful for any task that requires making decisions based on certain parts of the input. It has been applied to consituency parsing (Vinyals et al., 2015), reading comprehension (Hermann et al., 2015), and one-shot learning (Vinyals et al., 2016), among many others. The input does not even need to be a sequence, but can consist of other representations as in the case of image captioning (Xu et al., 2015), which can be seen in Figure 12 below. A useful side-effect of attention is that it provides a rare---if only superficial---glimpse into the inner workings of the model by inspecting which parts of the input are relevant for a particular output based on the attention weights.

Attention is also not restricted to just looking at the input sequence; self-attention can be used to look at the surrounding words in a sentence or document to obtain more contextually sensitive word representations. Multiple layers of self-attention are at the core of the Transformer architecture (Vaswani et al., 2017), the current state-of-the-art model for NMT.

# 2015 - Memory-based networks

Attention can be seen as a form of fuzzy memory where the memory consists of the past hidden states of the model, with the model choosing what to retrieve from memory. For a more detailed overview of attention and its connection to memory, have a look at this post. Many models with a more explicit memory have been proposed. They come in different variants such as Neural Turing Machines (Graves et al., 2014), Memory Networks (Weston et al., 2015) and End-to-end Memory Networks (Sukhbaatar et al., 2015), Dynamic Memory Networks (Kumar et al., 2015), the Neural Differentiable Computer (Graves et al., 2016), and the Recurrent Entity Network (Henaff et al., 2017).

Memory is often accessed based on similarity to the current state similar to attention and can typically be written to and read from. Models differ in how they implement and leverage the memory. For instance, End-to-end Memory Networks process the input multiple times and update the memory to enable multiple steps of inference. Neural Turing Machines also have a location-based addressing, which allows them to learn simple computer programs like sorting. Memory-based models are typically applied to tasks, where retaining information over longer time spans should be useful such as language modelling and reading comprehension. The concept of a memory is very versatile: A knowledge base or table can function as a memory, while a memory can also be populated based on the entire input or particular parts of it.

# 2018 - Pretrained language models

Pretrained word embeddings are context-agnostic and only used to initialize the first layer in our models. In recent months, a range of supervised tasks has been used to pretrain neural networks (Conneau et al., 2017; McCann et al., 2017; Subramanian et al., 2018). In contrast, language models only require unlabelled text; training can thus scale to billions of tokens, new domains, and new languages. Pretrained language models were first proposed in 2015 (Dai & Le, 2015); only recently were they shown to be beneficial across a diverse range of tasks. Language model embeddings can be used as features in a target model (Peters et al., 2018) or a language model can be fine-tuned on target task data (Ramachandran et al., 2017; Howard & Ruder, 2018). Adding language model embeddings gives a large improvement over the state-of-the-art across many different tasks as can be seen in Figure 13 below.

Pretrained language models have been shown enable learning with significantly less data. As language models only require unlabelled data, they are particularly beneficial for low-resource languages where labelled data is scarce. For more information about the potential of pretrained language models, refer to this article.

# Other milestones

Some other developments are less pervasive than the ones mentioned above, but still have wide-ranging impact.

Character-based representations   Using a CNN or an LSTM over characters to obtain a character-based word representation is now fairly common, particularly for morphologically rich languages and tasks where morphological information is important or that have many unknown words. Character-based representations were first used for part-of-speech tagging and language modeling (Ling et al., 2015) and dependency parsing (Ballesteros et al., 2015). They later became a core component of models for sequence labeling (Lample et al., 2016; Plank et al., 2016) and language modeling (Kim et al., 2016). Character-based representations alleviate the need of having to deal with a fixed vocabulary at increased computational cost and enable applications such as fully character-based NMT (Ling et al., 2016; Lee et al., 2017).

Adversarial learning   Adversarial methods have taken the field of ML by storm and have also been used in different forms in NLP. Adversarial examples are becoming increasingly widely used not only as a tool to probe models and understand their failure cases, but also to make them more robust (Jia & Liang, 2017). (Virtual) adversarial training, that is, worst-case perturbations (Miyato et al., 2017; Yasunaga et al., 2018) and domain-adversarial losses (Ganin et al., 2016; Kim et al., 2017) are useful forms of regularization that can equally make models more robust. Generative adversarial networks (GANs) are not yet too effective for natural language generation (Semeniuta et al., 2018), but are useful for instance when matching distributions (Conneau et al., 2018).

Reinforcement learning   Reinforcement learning has been shown to be useful for tasks with a temporal dependency such as selecting data during training (Fang et al., 2017; Wu et al., 2018) and modelling dialogue (Liu et al., 2018). RL is also effective for directly optimizing a non-differentiable end metric such as ROUGE or BLEU instead of optimizing a surrogate loss such as cross-entropy in summarization (Paulus et al, 2018; Celikyilmaz et al., 2018) and machine translation (Ranzato et al., 2016). Similarly, inverse reinforcement learning can be useful in settings where the reward is too complex to be specified such as visual storytelling (Wang et al., 2018).

# Non-neural milestones

In 1998 and over the following years, the FrameNet project was introduced (Baker et al., 1998), which led to the task of semantic role labelling, a form of shallow semantic parsing that is still actively researched today. In the early 2000s, the shared tasks organized together with the Conference on Natural Language Learning (CoNLL) catalyzed research in core NLP tasks such as chunking (Tjong Kim Sang et al., 2000), named entity recognition (Tjong Kim Sang et al., 2003), and dependency parsing (Buchholz et al., 2006), among others. Many of the CoNLL shared task datasets are still the standard for evaluation today.

In 2001, conditional random fields (CRF; Lafferty et al., 2001), one of the most influential classes of sequence labelling methods were introduced, which won the Test-of-time award at ICML 2011. A CRF layer is a core part of current state-of-the-art models for sequence labelling problems with label interdependencies such as named entity recognition (Lample et al., 2016).

In 2002, the bilingual evaluation understudy (BLEU; Papineni et al., 2002) metric was proposed, which enabled MT systems to scale up and is still the standard metric for MT evaluation these days. In the same year, the structured preceptron (Collins, 2002) was introduced, which laid the foundation for work in structured perception. At the same conference, sentiment analysis, one of the most popular and widely studied NLP tasks, was introduced (Pang et al., 2002). All three papers won the Test-of-time award at NAACL 2018. In addition, the linguistic resource PropBank (Kingsbury & Palmer, 2002) was introduced in the same year. PropBank is similar to FrameNet, but focuses on verbs. It is frequently used in semantic role labelling.

2003 saw the introduction of latent dirichlet allocation (LDA; Blei et al., 2003), one of the most widely used techniques in machine learning, which is still the standard way to do topic modelling. In 2004, novel max-margin models were proposed that are better suited for capturing correlations in structured data than SVMs (Taskar et al., 2004a; 2004b).

In 2006, OntoNotes (Hovy et al., 2006), a large multilingual corpus with multiple annotations and high interannotator agreement was introduced. OntoNotes has been used for the training and evaluation of a variety of tasks such as dependency parsing and coreference resolution. Milne and Witten (2008) described in 2008 how Wikipedia can be used to enrich machine learning methods. To this date, Wikipedia is one of the most useful resources for training ML methods, whether for entity linking and disambiguation, language modelling, as a knowledge base, or a variety of other tasks.

In 2009, the idea of distant supervision (Mintz et al., 2009) was proposed. Distant supervision leverages information from heuristics or existing knowledge bases to generate noisy patterns that can be used to automatically extract examples from large corpora. Distant supervision has been used extensively and is a common technique in relation extraction, information extraction, and sentiment analysis, among other tasks.

In 2016, Universal Dependencies v1 (Nivre et al., 2016), a multilingual collection of treebanks, was introduced. The Universal Dependencies project is an open community effort that aims to create consistent dependency-based annotations across many languages. As of January 2019, Universal Dependencies v2 comprises more than 100 treebanks in over 70 languages.

]]>
<![CDATA[ACL 2018 Highlights: Understanding Representations and Evaluation in More Challenging Settings]]>http://ruder.io/acl-2018-highlights/5c76ff7c1b9b0d18555b9eadThu, 26 Jul 2018 11:00:00 GMT

This post discusses highlights of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018).

This post originally appeared at the AYLIEN blog.

I attended the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018) in Melbourne, Australia from July 15-20, 2018 and presented three papers . It is foolhardy to try to condense an entire conference into one topic; however, in retrospect, certain themes appear particularly pronounced. In 2015 and 2016, NLP conferences were dominated by word embeddings and some people were musing that Embedding Methods in Natural Language Processing was a more appropriate name for the Conference on Empirical Methods in Natural Language Processing, one of the top conferences in the field.

According to Chris Manning, 2017 was the year of the BiLSTM with attention. While BiLSTMs optionally with attention are still ubiquitous, the main themes of this conference for me were to gain a better understanding what the representations of such models capture and to expose them to more challenging settings. In my review, I will mainly focus on contributions that touch on these themes but will also discuss other themes that I found of interest.

# Understanding Representations

## Probing models

It was very refreshing to see that rather than introducing ever shinier new models, many papers methodically investigated existing models and what they capture. This was most commonly done by automatically creating a dataset that focuses on one particular aspect of the generalization behaviour and evaluating different trained models on this dataset:

• Conneau et al. for instance evaluate different sentence embedding methods on ten datasets designed to capture certain linguistic features, such as predicting the length of a sentence, recovering word content, sensitivity to bigram shift, etc. They find that different encoder architectures can result in embeddings with different characteristics and that bag-of-embeddings is surprisingly good at capturing sentence-level properties, among other results.
• Zhu et al. evaluate sentence embeddings by observing the change in similarity of generated triplets of sentences that differ in a certain semantic or syntactic aspect. They find---among other things---that SkipThought and InferSent can distinguish negation from synonymy, while InferSent is better at identifying semantic equivalence and dealing with quantifiers.
• Pezzelle et al. focus specifically on quantifiers and test different CNN and LSTM models on their ability to predict quantifiers in single-sentence and multi-sentence contexts. They find that in single-sentence context, models outperform humans, while humans are slightly better in multi-sentence contexts.
• Kuncoro et al. evaluate LSTMs on modeling subject-verb agreement. They find that with enough capacity, LSTMs can model subject-verb agreement, but that more syntax-sensitive models such as recurrent neural network grammars do even better.
• Blevins et al. evaluate models pretrained on different tasks whether they capture a hierarchical notion of syntax. Specifically, they train the models to predict POS tags as well as constituent labels at different depths of a parse tree. They find that all models indeed encode a significant amount of syntax and---in particular---that language models learn some syntax.
• Khandelwal et al. show that LSTM language models use about 200 tokens of context on average. Word order is only relevant within the most recent sentence.
• Another interesting result regarding the generalization ability of language models is due to Lau et al. who find that a language model trained on a sonnet corpus captures meter implicitly at human-level performance.
• Language models, however, also have their limitations. Spithourakis and Riedel observe that language models are bad at modelling numerals and propose several strategies to improve them.
• Liu et al. show that LSTMs trained on natural language data are able to recall tokens from much longer sequence than models trained on non-language data at the Repl4NLP workshop.

In particular, I think better understanding what information LSTMs and language models will become more important, as they seem to be a key driver of progress in NLP going forward, as evidenced by our ACL paper on language model fine-tuning and related approaches.

## Understanding state-of-the-art models

While the above studies try to understand a specific aspect of the generalization ability of a particular model class, several papers focus on better understanding state-of-the-art models for a particular task:

• Glockner et al. focused on the task of natural language inference. They created a dataset with sentences that differ by at most one word from sentences in the training data in order to probe if models can deal with simple lexical inferences. They find that current state-of-the-art models fail to capture many simple inferences.
• Mudrakarta et al. analyse state-of-the-art QA models across different modalities and find that the models often ignore key question terms. They then perturb questions to craft adversarial examples that significantly lower models' accuracy.
I found many of the papers probing different aspects of models stimulating. I hope that the generation of such probing datasets will become a standard tool in the toolkit of every NLP researchers so that we will not only see more of such papers in the future but that such an analysis may also become part of the standard model evaluation, besides error and ablation analyses.

## Analyzing the inductive bias

Another way to gain a better understanding of a model is to analyze its inductive bias. The Workshop on Relevance of Linguistic Structure in Neural Architectures for NLP (RELNLP) sought to explore how useful it is to incorporate linguistic structure into our models. One of the key points of Chris Dyer's talk during the workshop was whether RNNs have a useful inductive bias for NLP. In particular, he argued that there are several pieces of evidence indicating that RNNs prefer sequential recency, namely:

1. Gradients become attenuated across time. LSTMs or GRUs may help with this, but they also forget.
2. People have used training regimes like reversing the input sequence for machine translation.
3. People have used enhancements like attention to have direct connections back in time.
4. For modeling subject-verb agreement, the error rate increases with the number of attractors.

According to Chomsky, sequential recency is not the right bias for learning human language. RNNs thus don't seem to have the right bias for modeling language, which in practice can lead to statistical inefficiency and poor generalization behaviour. Recurrent neural network grammars, a class of models that generates both a tree and a sequence sequentially by compressing a sentence into its constituents, instead have a bias for syntactic (rather than sequential) recency.

However, it can often be hard to identify whether a model has a useful inductive bias. For identifying subject-verb agreement, Chris hypothesizes that LSTM language models learn a non-structural "first noun" heuristic that relies on matching the verb to the first noun in the sentence. In general, perplexity (and other aggregate metrics) are correlated with syntactic/structural competence, but are not particularly sensitive at distinguishing structurally sensitive models from models that use a simpler heuristic.

## Using Deep Learning to understand language

In his talk at the workshop, Mark Johnson opined that while Deep Learning has revolutionized NLP, its primary benefit is economic: complex component pipelines have been replaced with end-to-end models and target accuracy can often be achieved more quickly and cheaply. Deep Learning has not changed our understanding of language. Its main contribution in this regard is to demonstrate that a neural network aka a computational model can perform certain NLP tasks, which shows that these tasks are not indicators of intelligence. While DL methods can pattern match and perform perceptual tasks really well, they struggle with tasks relying on deliberate reflection and conscious thought.

## Incorporating linguistic structure

Jason Eisner questioned in his talk whether linguistic structures and categories actually exist or whether "scientist just like to organize data into piles" given that a linguistics-free approach works surprisingly well for MT. He finds that even "arbitrarily defined" categories such as the difference between the /b/ and /p/ phonemes can become hardened and accrue meaning. However, neural models are pretty good sponges to soak up whatever isn't modeled explicitly.

He outlines four common ways to introduce linguistic information into models: a) via a pipeline-based approach, where linguistic categories are used as features; b) via data augmentation, where the data is augmented with linguistic categories; c) via multi-task learning; d) via structured modeling such as using a transition-based parser, a recurrent neural network grammar, or even classes that depend on each other such as BIO notation.

In her talk at the workshop, Emily Bender questioned the premise of linguistics-free learning altogether: Even if you had a huge corpus in a language that you knew nothing about, without any other priors, e.g. what function words are, you would not be able to learn sentence structure or meaning. She also pointedly called out many ML papers that describe their approach as similar to how babies learn, without citing any actual developmental psychology or language acquisition literature. Babies in fact learn in situated, joint, emotional context, which carries a lot of signal and meaning.

## Understanding the failure modes of LSTMs

Better understanding representations was also a theme at the Representation Learning for NLP workshop. During his talk, Yoav Goldberg detailed some of the efforts of his group to better understand representations of RNNs. In particular, he discussed recent work on extracting a finite state automaton from an RNN in order to better understand what the model has learned. He also reminded the audience that LSTM representations, even though they have been trained on one task, are not task-specific. They are often predictive of unintended aspects such as demographics in the data. Even when a model has been trained using a domain-adversarial loss to produce representations that are invariant of a certain aspect, the representations will be still slightly predictive of said attribute. It can thus be a challenge to completely remove unwanted information from encoded language data and even seemingly perfect LSTM models may have hidden failure modes.

On the topic of failure modes of LSTMs, a statement that also fits well in this theme was uttered by this year's recipient of the ACL lifetime achievement award, Mark Steedman. He asked "LSTMs work in practice, but can they work in theory?".

# Evaluation in more challenging settings

A theme that is closely interlinked with gaining a better understanding of the limitations of state-of-the-art models is to propose ways how they can be improved. In particular, similar to adversarial example paper mentioned above, several papers tried to make models more robust to adversarial examples:

• Cheng et al. propose to make both the encoder and decoder in NMT models more robust against input perturbations.
• Ebrahimi et al. propose white-box adversarial examples to trick a character-level neural classifier by swapping few tokens.
• Ribeiro et al. improve upon the previous method with semantic-preserving perturbations that induce changes in the model's predictions, which they generalize to rules that induce adversaries on many instances.
• Bose et al. incorporate adversarial examples into noise contrastive estimation using an adversarially learned sampler. The sampler finds harder negative examples, which forces the model to learn better representations.

## Learning robust and fair representations

Tim Baldwin discussed different ways to make models more robust to a domain shift during his talk at the RepL4NLP workshop. The slides can be found here. For using a single source domain, he discussed a method to linguistically perturb training instances based on different types of syntactic and semantic noise. In the setting with multiple source domains, he proposed to train an adversarial model on the source domains. Finally, he discussed a method that allows to learn robust and privacy-preserving text representations.

Margaret Mitchell focused on fair and privacy-preserving representations during her talk at the workshop. In particular, she highlighted the difference between a descriptive and a normative view of the world. ML models learn representations that reflect a descriptive view of the data they're trained on. The data represents "the world as people talk about it". Research in fairness conversely seeks to create representations that reflect a normative view of the world, which captures our values and seeks to instill them in the representations.

## Improving evaluation methodology

Besides making models more robust, several papers sought to improve the way we evaluate our models:

• Finegan-Dollak et al. identify limitations and propose improvements to current evaluations of text-to-SQL systems. They show that the current train-test split and practice of anonymization of variables are flawed and release standardized and improved versions of seven datasets to mitigate these.
• Dror et al. focus on a practice that should be commonplace, but is often not done or done poorly: statistical significance testing. In particular, they survey recent empirical papers in ACL and TACL 2017 finding that statistical significance testing is often ignored or misused and propose a simple protocol for statistical significance test selection for NLP tasks.
• Chaganty et al. investigate the bias of automatic metrics such as BLEU and ROUGE and find that even an unbiased estimator only achieves a comparatively low error reduction. This highlights the need to improve both the correlation of automatic metric as well as reduce the variance of human annotation.

## Strong baselines

Another way to improve model evaluation is to compare new models against stronger baselines, in order to make sure that improvements are actually significant. Some papers focused on this line of research:

• Shen et al. systematically compare simple word embedding-based methods with pooling to more complex models such as LSTMs and CNNs. They find that for most datasets, word embedding-based methods exhibit competitive or even superior performance.
• Ethayarajh proposes a strong baseline for sentence embedding models at the RepL4NLP workshop.
• In a similar vein, Ruder and Plank find that classic bootstrapping algorithms such as tri-training make for strong baselines for semi-supervised learning and even outperform recent state-of-the-art methods.
In the above paper, we also emphasize the importance of evaluating in more challenging settings, such as on out-of-distribution data and on different tasks. Our findings would have been different if we had just focused on a single task or only on in-domain data. We need to test our models under such adverse conditions to get a better sense of their robustness and how well they can actually generalize.

## Creating harder datasets

In order to evaluate under such settings, more challenging datasets need to be created. Yejin Choi argued during the RepL4NLP panel discussion (a summary can be found here) that the community pays a lot of attention to easier tasks such as SQuAD or bAbI, which are close to solved. Yoav Goldberg even went so far as to say that "SQuAD is the MNIST of NLP". Instead, we should focus on solving harder tasks and develop more datasets with increasing levels of difficulty. If a dataset is too hard, people don't work on it. In particular, the community should not work on datasets for too long as datasets are getting solved very fast these days; creating novel and more challenging datasets is thus even more important. Two datasets that seek to go beyond SQuAD for reading comprehension were presented at the conference:

• QAngaroo focuses on reading comprehension that requires to gather several pieces of information via multiple steps of inference.

Richard Socher also stressed the importance of training and evaluating a model across multiple tasks during his talk during the Machine Reading for Question Answering workshop. In particular, he argues that NLP requires many types of reasoning, e.g. logical, linguistic, emotional, etc., which cannot all be satisfied by a single task.

## Evaluation on multiple and low-resource languages

Another facet of this is to evaluate our models on multiple languages. Emily Bender surveyed 50 NAACL 2018 papers in her talk mentioned above and found that 42 papers evaluate on an unnamed mystery language (i.e. English). She emphasizes that it is important to name the language you work on as languages have different linguistic structures; not mentioning the language obfuscates this fact.

If our methods are designed to be cross-lingual, then we should additionally evaluate them on the more challenging setting of low-resource languages. For instance, both of the following two papers observe that current methods for unsupervised bilingual dictionary methods fail if the target language is dissimilar to language such as with Estonian or Finnish:

• Søgaard et al. probe the limitations of current methods further and highlight that such methods also fail when embeddings are trained on different domains or using different algorithms. They finally propose a metric to quantify the potential of such methods.
• Artetxe et al. propose a new unsupervised self-training method that employs a better initialization to steer the optimization process and is particularly robust for dissimilar language pairs.

Several other papers also evaluate their approaches on low-resource languages:

• Dror et al. propose to use orthographic features for bilingual lexicon induction. Though these mostly help for related languages, they also evaluate on the dissimilar language pair English-Finnish.
• Ren et al. finally propose to leverage another rich language for translation into a resource-poor language . They find that their model significantly improves the translation quality of rare languages.
• Currey and Heafield propose an unsupervised tree-to-sequence model for NMT by adapting the Gumbel tree-LSTM. Their model proves particularly useful for low-resource languages.

# Progress in NLP

Another theme during the conference for me was that the field is visibly making progress. Marti Hearst, president of the ACL, echoed this sentiment during her presidential address. She used to demonstrate what our models can and can't do using the example of Stanley Kubrick's HAL 9000 (seen below). In recent years, this has become a less useful exercise as our models have learned to perform tasks that seemed previously decades away such as recognizing and producing human speech or lipreading[1]. Naturally, we are still far away from tasks that require deep language understanding and reasoning such as having an argument; nevertheless, this progress is remarkable.

Marti also paraphrased NLP and IR pioneer Karen Spärck Jones saying that research is not going around in circles, but climbing a spiral or---maybe more fittingly---different staircases that are not necessarily linked but go in the same direction. She also expressed a sentiment that seems to resonate with a lot of people: In the 1980s and 90s, with only a few papers to read, it was definitely easier to keep track of the state of the art. To make this easier, I have recently created a document to collect the state of the art across different NLP tasks.

With the community growing, she encouraged people to participate and volunteer and announced an ACL Distinguished Service Award for the most dedicated members. ACL 2018 also saw the launch (after EACL in 1982 and NAACL in 2000) of its third chapter, AACL, the Asia-Pacific Chapter of the Association for Computational Linguistics.

The business meeting during the conference focused on measures to address a particular challenge of the growing community: the escalating number of submissions and the need for more reviewers. We can expect to see new efforts to deal with the large number of submissions at the conferences next year.

# Reinforcement learning

Back in 2016, it seemed as though reinforcement learning (RL) was finding its footing in NLP and being applied to more and more tasks. These days, it seems that the dynamic nature of RL makes it most useful for tasks that intrinsically have some temporal dependency such as selecting data during training[1][1] and modelling dialogue, while supervised learning seems to be better suited for most other tasks. Another important application of RL is to optimize the end metric such as ROUGE or BLEU directly instead of optimizing a surrogate loss such as cross-entropy. Successful applications of this are summarization[1][1] and machine translation[1].

Inverse reinforcement learning can be valuable in settings where the reward is too complex to be specified. A successful application of this is visual storytelling[1]. RL is particularly promising for sequential decision making problems in NLP such as playing text-based games, navigating webpages, and completing tasks. The Deep Reinforcement Learning for NLP tutorial provided a comprehensive overview of the space.

# Tutorials

There were other great tutorials as well. I particularly enjoyed the Variational Inference and Deep Generative Models tutorial. The tutorials on Semantic Parsing and about "100 things you always wanted to know about semantics & pragmatics" also seemed really worthwhile. A complete list of the tutorials can be found here.

Cover image: View from the conference venue.

Thanks to Isabelle Augenstein for some paper suggestions.

1. Fang, M., Li, Y., & Cohn, T. (2017). Learning how to Active Learn: A Deep Reinforcement Learning Approach. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Retrieved from https://arxiv.org/pdf/1708.02383v1.pdf

2. Wu, J., Li, L., & Wang, W. Y. (2018). Reinforced Co-Training. In Proceedings of NAACL-HLT 2018.

3. Paulus, R., Xiong, C., & Socher, R. (2018). A deep reinforced model for abstractive summarization. In Proceedings of ICLR 2018.

4. Celikyilmaz, A., Bosselut, A., He, X., & Choi, Y. (2018). Deep communicating agents for abstractive summarization. In Proceedings of NAACL-HLT 2018.

5. Ranzato, M. A., Chopra, S., Auli, M., & Zaremba, W. (2016). Sequence level training with recurrent neural networks. In Proceedings of ICLR 2016.

6. Wang, X., Chen, W., Wang, Y.-F., & Wang, W. Y. (2018). No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling. In Proceedings of ACL 2018. Retrieved from http://arxiv.org/abs/1804.09160

7. Chung, J. S., & Zisserman, A. (2016, November). Lip reading in the wild. In Asian Conference on Computer Vision (pp. 87-103). Springer, Cham.

8. Glockner, M., Shwartz, V., & Goldberg, Y. (2018). Breaking NLI Systems with Sentences that Require Simple Lexical Inferences. In Proceedings of ACL 2018. Retrieved from http://arxiv.org/abs/1805.02266.

9. Zhu, X., Li, T., & De Melo, G. (2018). Exploring Semantic Properties of Sentence Embeddings. In Proceedings of ACL 2018 (pp. 1–6). Retrieved from http://aclweb.org/anthology/P18-2100

10. Pezzelle, S., Steinert-Threlkeld, S., Bernardi, R., & Szymanik, J. (2018). Some of Them Can be Guessed! Exploring the Effect of Linguistic Context in Predicting Quantifiers. In Proceedings of ACL 2018. Retrieved from http://arxiv.org/abs/1806.00354

11. Conneau, A., Kruszewski, G., Lample, G., Barrault, L., & Baroni, M. (2018). What you can cram into a single vector: Probing sentence embeddings for linguistic properties. In Proceedings of ACL 2018. Retrieved from http://arxiv.org/abs/1805.01070

12. Kuncoro, A., Dyer, C., Hale, J., Yogatama, D., Clark, S., & Blunsom, P. (2018). LSTMs Can Learn Syntax-Sensitive Dependencies Well, But Modeling Structure Makes Them Better. In Proceedings of ACL 2018 (pp. 1–11). Retrieved from http://aclweb.org/anthology/P18-1132

13. Ribeiro, M. T., Singh, S., & Guestrin, C. (2018). Semantically Equivalent Adversarial Rules for Debugging NLP Models. In Proceedings of ACL 2018.

14. Dror, R., Baumer, G., Shlomov, S., & Reichart, R. (2018). The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing. In Proceedings of ACL 2018. Retrieved from https://ie.technion.ac.il/~roiri/papers/ACL-2018-sig-cr.pdf

15. Dror, R., Baumer, G., Shlomov, S., & Reichart, R. (2018). Riley, P., & Gildea, D. (2018). Orthographic Features for Bilingual Lexicon Induction. In Proceedings of ACL 2018.

16. Lau, J. H., Cohn, T., Baldwin, T., Brooke, J., & Hammond, A. (2018). Deep-speare: A Joint Neural Model of Poetic Language, Meter and Rhyme. In Proceedings of ACL 2018. Retrieved from http://arxiv.org/abs/1807.03491

17. Chaganty, A. T., Mussman, S., & Liang, P. (2018). The price of debiasing automatic metrics in natural language evaluation. In Proceedings of ACL 2018. Retrieved from http://arxiv.org/abs/1807.02202

18. Finegan-Dollak, C., Kummerfeld, J. K., Zhang, L., Ramanathan, K., Sadasivam, S., Zhang, R., & Radev, D. (2018). Improving Text-to-SQL Evaluation Methodology. In Proceedings of ACL 2018. Retrieved from http://arxiv.org/abs/1806.09029

19. Shen, D., Wang, G., Wang, W., Min, M. R., Su, Q., Zhang, Y., … Carin, L. (2018). Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms. In Proceedings of ACL 2018.

20. Ren, S., Chen, W., Liu, S., Li, M., Zhou, M., & Ma, S. (2018). Triangular Architecture for Rare Language Translation. In Proceedings of ACL 2018. Retrieved from http://arxiv.org/abs/1805.04813

21. Cheng, Y., Tu, Z., Meng, F., Zhai, J., & Liu, Y. (2018). Towards Robust Neural Machine Translation. In Proceedings of ACL 2018.

22. Mudrakarta, P. K., Taly, A., Brain, G., Sundararajan, M., & Google, K. D. (2018). Did the Model Understand the Question? In Proceedings of ACL 2018. Retrieved from https://arxiv.org/pdf/1805.05492.pdf

23. Spithourakis, G. P., & Riedel, S. (2018). Numeracy for Language Models: Evaluating and Improving their Ability to Predict Numbers. In Proceedings of ACL 2018.

24. Blevins, T., Levy, O., & Zettlemoyer, L. (2018). Deep RNNs Encode Soft Hierarchical Syntax. In Proceedings of ACL 2018. Retrieved from http://arxiv.org/abs/1805.04218

25. Ruder, S., & Plank, B. (2018). Strong Baselines for Neural Semi-supervised Learning under Domain Shift. In Proceedings of ACL 2018.

26. Søgaard, A., Ruder, S., & Vulić, I. (2018). On the Limitations of Unsupervised Bilingual Dictionary Induction. In Proceedings of ACL 2018.

27. Artetxe, M., Labaka, G., & Agirre, E. (2018). A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In Proceedings of ACL 2018.

28. Ebrahimi, J., Rao, A., Lowd, D., & Dou, D. (2018). HotFlip: White-Box Adversarial Examples for Text Classification. In Proceedings of ACL 2018. Retrieved from http://arxiv.org/abs/1712.06751

29. Linzen, T., Dupoux, E., & Goldberg, Y. (2016). Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies. Proceedings of TACL 2016. Retrieved from http://arxiv.org/abs/1611.01368

30. Dyer, C., Kuncoro, A., Ballesteros, M., & Smith, N. A. (2016). Recurrent Neural Network Grammars. In NAACL. Retrieved from http://arxiv.org/abs/1602.07776

31. Currey, A., & Heafield, K. (2018). Unsupervised Source Hierarchies for Low-Resource Neural Machine Translation. In Proceedings of the Workshop on the Relevance of Linguistic Structure in Neural Architectures for NLP (pp. 1–7).

32. Li, Y., Cohn, T., & Baldwin, T. (2017). Robust Training under Linguistic Adversity. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (Vol. 2, pp. 21–27).

33. Li, Y., Baldwin, T., & Cohn, T. (2018). Learning Domain-Robust Text Representations using Adversarial Training. In Proceedings of NAACL-HLT 2018.

34. Li, Y., Baldwin, T., & Cohn, T. (2018). Towards Robust and Privacy-preserving Text Representations. In Proceedings of ACL 2018.

35. Weiss, G., Goldberg, Y., & Yahav, E. (2018). Extracting Automata from Recurrent Neural Networks Using Queries and Counterexamples. In Proceedings of ICML 2018. Retrieved from http://arxiv.org/abs/1711.09576

36. Ethayarajh, K. (2018). Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline. In Proceedings of the 3rd Workshop on Representation Learning for NLP. Retrieved from http://www.aclweb.org/anthology/W18-3012

37. Liu, N. F., Levy, O., Schwartz, R., Tan, C., & Smith, N. A. (2018). LSTMs Exploit Linguistic Attributes of Data. In Proceedings of the 3rd Workshop on Representation Learning for NLP,. Retrieved from http://arxiv.org/abs/1805.11653

38. Howard, J., & Ruder, S. (2018). Universal Language Model Fine-tuning for Text Classification. In Proceedings of ACL 2018. Retrieved from http://arxiv.org/abs/1801.06146

39. Welbl, J., Stenetorp, P., & Riedel, S. (2018). Constructing Datasets for Multi-hop Reading Comprehension Across Documents. In Transactions of the Association for Computational Linguistics. Retrieved from http://arxiv.org/abs/1710.06481

40. Kočiský, T., Schwarz, J., Blunsom, P., Dyer, C., Hermann, K. M., Melis, G., & Grefenstette, E. (2018). The NarrativeQA Reading Comprehension Challenge. Transactions of the Association for Computational Linguistics. Retrieved from http://arxiv.org/abs/1712.07040

41. Bose, A., Ling, H., & Cao, Y. (2018). Adversarial Contrastive Estimation. In Proceedings of ACL 2018.

42. Khandelwal, U., He, H., Qi, P., & Jurafsky, D. (2018). Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context. In Proceedings of ACL 2018.
]]>
<![CDATA[NLP's ImageNet moment has arrived]]>http://ruder.io/nlp-imagenet/5c76ff7c1b9b0d18555b9eacThu, 12 Jul 2018 08:27:00 GMT

This post discusses pretrained language models, one of the most exciting directions in contemporary NLP.

This post originally appeared at TheGradient and was edited by Andrey Kurenkov, Eric Wang, and Aditya Ganesh.

Big changes are underway in the world of Natural Language Processing (NLP). The long reign of word vectors as NLP’s core representation technique has seen an exciting new line of challengers emerge: ELMo[1], ULMFiT[2], and the OpenAI transformer[3]. These works made headlines by demonstrating that pretrained language models can be used to achieve state-of-the-art results on a wide range of NLP tasks. Such methods herald a watershed moment: they may have the same wide-ranging impact on NLP as pretrained ImageNet models had on computer vision.

## From Shallow to Deep Pre-Training

Pretrained word vectors have brought NLP a long way. Proposed in 2013 as an approximation to language modeling, word2vec[4] found adoption through its efficiency and ease of use in a time when hardware was a lot slower and deep learning models were not widely supported. Since then, the standard way of conducting NLP projects has largely remained unchanged: word embeddings pretrained on large amounts of unlabeled data via algorithms such as word2vec and GloVe[5] are used to initialize the first layer of a neural network, the rest of which is then trained on data of a particular task. On most tasks with limited amounts of training data, this led to a boost of two to three percentage points[6]. Though these pretrained word embeddings have been immensely influential, they have a major limitation: they only incorporate previous knowledge in the first layer of the model---the rest of the network still needs to be trained from scratch.

Word2vec and related methods are shallow approaches that trade expressivity for efficiency. Using word embeddings is like initializing a computer vision model with pretrained representations that only encode edges: they will be helpful for many tasks, but they fail to capture higher-level information that might be even more useful. A model initialized with word embeddings needs to learn from scratch not only to disambiguate words, but also to derive meaning from a sequence of words. This is the core aspect of language understanding, and it requires modeling complex language phenomena such as compositionality, polysemy, anaphora, long-term dependencies, agreement, negation, and many more. It should thus come as no surprise that NLP models initialized with these shallow representations still require a huge number of examples to achieve good performance.

At the core of the recent advances of ULMFiT, ELMo, and the OpenAI transformer is one key paradigm shift: going from just initializing the first layer of our models to pretraining the entire model with hierarchical representations. If learning word vectors is like only learning edges, these approaches are like learning the full hierarchy of features, from edges to shapes to high-level semantic concepts.

Interestingly, pretraining entire models to learn both low and high level features has been practiced for years by the computer vision (CV) community. Most often, this is done by learning to classify images on the large ImageNet dataset. ULMFiT, ELMo, and the OpenAI transformer have now brought the NLP community close to having an "ImageNet for language"---that is, a task that enables models to learn higher-level nuances of language, similarly to how ImageNet has enabled training of CV models that learn general-purpose features of images. In the rest of this piece, we’ll unpack just why these approaches seem so promising by extending and building on this analogy to ImageNet.

## ImageNet

ImageNet’s impact on the course of machine learning research can hardly be overstated. The dataset was originally published in 2009 and quickly evolved into the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). In 2012, the deep neural network[7] submitted by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton performed 41% better than the next best competitor, demonstrating that deep learning was a viable strategy for machine learning and arguably triggering the explosion of deep learning in ML research.

The success of ImageNet highlighted that in the era of deep learning, data was at least as important as algorithms. Not only did the ImageNet dataset enable that very important 2012 demonstration of the power of deep learning, but it also allowed a breakthrough of similar importance in transfer learning: researchers soon realized that the weights learned in state of the art models for ImageNet could be used to initialize models for completely other datasets and improve performance significantly. This "fine-tuning" approach allowed achieving good performance with as little as one positive example per category (Donahue et al., 2014)[8].

Pretrained ImageNet models have been used to achieve state-of-the-art results in tasks such as object detection[9], semantic segmentation[10], human pose estimation[11], and video recognition[12]. At the same time, they have enabled the application of CV to domains where the number of training examples is small and annotation is expensive. Transfer learning via pretraining on ImageNet is in fact so effective in CV that not using it is now considered foolhardy (Mahajan et al., 2018)[13].

## What’s in an ImageNet?

In order to determine what an ImageNet for language might look like, we first have to identify what makes ImageNet good for transfer learning. Previous studies[14] have only shed partial light on this question: reducing the number of examples per class or the number of classes only results in a small performance drop, while fine-grained classes and more data are not always better.

Rather than looking at the data directly, it is more prudent to probe what the models trained on the data learn. It is common knowledge that features of deep neural networks trained on ImageNet transition from general to task-specific from the first to the last layer[15]: lower layers learn to model low-level features such as edges, while higher layers model higher-level concepts such as patterns and entire parts or objects[16] as can be seen in the figure below. Importantly, knowledge of edges, structures, and the visual composition of objects is relevant for many CV tasks, which sheds light on why these layers are transferred. A key property of an ImageNet-like dataset is thus to encourage a model to learn features that will likely generalize to new tasks in the problem domain.

Beyond this, it is difficult to make further generalizations about why transfer from ImageNet works quite so well. For instance, another possible advantage of the ImageNet dataset is the quality of the data. ImageNet’s creators went to great lengths to ensure reliable and consistent annotations. However, work in distant supervision serves as a counterpoint, indicating that large amounts of weakly labelled data might often be sufficient. In fact, recently researchers at Facebook showed that they could pre-train a model by predicting hashtags on billions of social media images to state-of-the-art accuracy on ImageNet.

Without any more concrete insights, we are left with two key desiderata:

1. An ImageNet-like dataset should be sufficiently large, i.e. on the order of millions of training examples.

2. It should be representative of the problem space of the discipline.

## An ImageNet for language

In NLP, models are typically a lot shallower than their CV counterparts. Analysis of features has thus mostly focused on the first embedding layer, and little work has investigated the properties of higher layers for transfer learning. Let us consider the datasets that are large enough, fulfilling desideratum #1. Given the current state of NLP, there are several contenders[17].

Natural language inference is the task of identifying the relation (entailment, contradiction, and neutral) that holds between a piece of text and a hypothesis. The most popular dataset for this task, the Stanford Natural Language Inference (SNLI) Corpus[19], contains 570k human-written English sentence pairs. Examples of the dataset can be seen below.

Machine translation, translating text in one language to text in another language, is one of the most studied tasks in NLP, and---over the years---has accumulated vast amounts of training data for popular language pairs, e.g. 40M English-French sentence pairs in WMT 2014. See below for two example translation pairs.

Constituency parsing seeks to extract the syntactic structure of a sentence in the form of a (linearized) constituency parse tree as can be seen below. In the past, millions of weakly labelled parses[20] have been used for training sequence-to-sequence models for this task.

Language modeling (LM) aims to predict the next word given its previous word. Existing benchmark datasets consist of up to 1B words, but as the task is unsupervised, any number of words can be used for training. See below for examples from the popular WikiText-2 dataset consisting of Wikipedia articles.

All of these tasks provide---or would allow the collection of---a sufficient number of examples for training. Indeed, the[21] above[22] tasks[23] (and many others such as sentiment analysis[24], constituency parsing [20:1], skip-thoughts[25], and autoencoding[26]) have been used to pretrain representations in recent months.

While any data contains some bias[27], human annotators may inadvertently introduce additional signals that a model can exploit. Recent[28] studies[29] reveal that state-of-the-art models for tasks such as reading comprehension and natural language inference do not in fact exhibit deep natural language understanding but pick up on such cues to perform superficial pattern matching. For instance, Gururangan et al. (2018)[29:1] show that annotators tend to produce entailment examples simply by removing gender or number information and generate contradictions by introducing negations. A model that simply exploits these cues can correctly classify the hypothesis without looking at the premise in about 67% of the SNLI dataset.

The more difficult question thus is: Which task is most representative of the space of NLP problems? In other words, which task allows us to learn most of the knowledge or relations required for understanding natural language?

## The case for language modelling

In order to predict the most probable next word in a sentence, a model is required not only to be able to express syntax (the grammatical form of the predicted word must match its modifier or verb) but also model semantics. Even more, the most accurate models must incorporate what could be considered world knowledge or common sense. Consider the incomplete sentence "The service was poor, but the food was". In order to predict the succeeding word such as “yummy” or “delicious”, the model must not only memorize what attributes are used to describe food, but also be able to identify that the conjunction “but” introduces a contrast, so that the new attribute has the opposing sentiment of “poor”.

Language modelling, the last approach mentioned, has been shown to capture many facets of language relevant for downstream tasks, such as long-term dependencies[30], hierarchical relations[31], and sentiment[32]. Compared to related unsupervised tasks such as skip-thoughts and autoencoding, language modelling performs better on syntactic tasks even with less training data.

Among the biggest benefits of language modelling is that training data comes for free with any text corpus and that potentially unlimited amounts of training data are available. This is particularly significant, as NLP deals not only with the English language. More than 4,500 languages are spoken around the world by more than 1,000 speakers. Language modeling as a pretraining task opens the door to developing models for previously underserved languages. For very low-resource languages where even unlabeled data is scarce, multilingual language models may be trained on multiple related languages at once, analogous to work on cross-lingual embeddings[33].

So far, our argument for language modeling as a pretraining task has been purely conceptual. Pretraining a language model was first proposed in 2015 [26:1], but it remained unclear whether a single pretrained language model was useful for many tasks. In recent months, we finally obtained overwhelming empirical proof: Embeddings from Language Models (ELMo), Universal Language Model Fine-tuning (ULMFiT), and the OpenAI Transformer have empirically demonstrated how language modeling can be used for pretraining, as shown by the above figure from ULMFiT. All three methods employed pretrained language models to achieve state-of-the-art on a diverse range of tasks in Natural Language Processing, including text classification, question answering, natural language inference, coreference resolution, sequence labeling, and many others.

In many cases such as with ELMo in the figure below, these improvements ranged between 10-20% better than the state-of-the-art on widely studied benchmarks, all with the single core method of leveraging a pretrained language model. ELMo furthermore won the best paper award at NAACL-HLT 2018, one of the top conferences in the field. Finally, these models have been shown to be extremely sample-efficient, achieving good performance with only hundreds of examples and are even able to perform zero-shot learning.

In light of this step change, it is very likely that in a year’s time NLP practitioners will download pretrained language models rather than pretrained word embeddings for use in their own models, similarly to how pre-trained ImageNet models are the starting point for most CV projects nowadays.

However, similar to word2vec, the task of language modeling naturally has its own limitations: It is only a proxy to true language understanding, and a single monolithic model is ill-equipped to capture the required information for certain downstream tasks. For instance, in order to answer questions about or follow the trajectory of characters in a story, a model needs to learn to perform anaphora or coreference resolution. In addition, language models can only capture what they have seen. Certain types of information, such as most common sense knowledge, are difficult to learn from text alone[34] and require incorporating external information.

One outstanding question is how to transfer the information from a pre-trained language model to a downstream task. The two main paradigms for this are whether to use the pre-trained language model as a fixed feature extractor and incorporate its representation as features into a randomly initialized model as used in ELMo, or whether to fine-tune the entire language model as done by ULMFiT. The latter fine-tuning approach is what is typically done in CV where either the top-most or several of the top layers are fine-tuned. While NLP models are typically more shallow and thus require different fine-tuning techniques than their vision counterparts, recent pretrained models are getting deeper. The next months will show the impact of each of the core components of transfer learning for NLP: an expressive language model encoder such as a deep BiLSTM or the Transformer, the amount and nature of the data used for pretraining, and the method used to fine-tune the pretrained model.

## But where’s the theory?

Our analysis thus far has been mostly conceptual and empirical, as it is still poorly understood why models trained on ImageNet---and consequently on language modeling---transfer so well. One way to think about the generalization behaviour of pretrained models more formally is under a model of bias learning (Baxter, 2000)[35]. Assume our problem domain covers all permutations of tasks in a particular discipline, e.g. computer vision, which forms our environment. We are provided with a number of datasets that allow us to induce a family of hypothesis spaces $\mathrm{H} = {\mathcal{H}}$. Our goal in bias learning is to find a bias, i.e. a hypothesis space $\mathcal{H} \in \mathrm{H}$ that maximizes performance on the entire (potentially infinite) environment.

Empirical and theoretical results in multi-task learning (Caruana, 1997; Baxter, 2000)[36][35:1] indicate that a bias that is learned on sufficiently many tasks is likely to generalize to unseen tasks drawn from the same environment. Viewed through the lens of multi-task learning, a model trained on ImageNet learns a large number of binary classification tasks (one for each class). These tasks, all drawn from the space of natural, real-world images, are likely to be representative of many other CV tasks. In the same vein, a language model---by learning a large number of classification tasks, one for each word---induces representations that are likely helpful for many other tasks in the realm of natural language. Still, much more research is necessary to gain a better theoretical understanding why language modeling seems to work so well for transfer learning.

## The ImageNet moment

The time is ripe for practical transfer learning to make inroads into NLP. In light of the impressive empirical results of ELMo, ULMFiT, and OpenAI it only seems to be a question of time until pretrained word embeddings will be dethroned and replaced by pretrained language models in the toolbox of every NLP practitioner. This will likely open many new applications for NLP in settings with limited amounts of labeled data. The king is dead, long live the king!

For further reading, check out the discussion on HackerNews.

Cover image due to Matthew Peters.

1. Peters, Matthew E., et al. "Deep contextualized word representations." Proceedings of NAACL-HLT (2018). ↩︎

2. Howard, Jeremy, and Sebastian Ruder. "Fine-tuned Language Models for Text Classification." Proceedings of ACL (2018). ↩︎

3. Radford, Alec, et al. "Improving Language Understanding by Generative Pre-Training." ↩︎

4. Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013. ↩︎

5. Pennington, Jeffrey, Richard Socher, and Christopher Manning. "Glove: Global vectors for word representation." Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014. ↩︎

6. Kim, Yoon. "Convolutional neural networks for sentence classification." Proceedings of EMNLP (2014). ↩︎

7. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012. ↩︎

8. Donahue, Jeff, et al. "Decaf: A deep convolutional activation feature for generic visual recognition." International conference on machine learning. 2014. ↩︎

9. He, Kaiming, et al. "Mask r-cnn." Computer Vision (ICCV), 2017 IEEE International Conference on. IEEE, 2017. ↩︎

10. Zhao, Hengshuang, et al. "Pyramid scene parsing network." IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 2017. ↩︎

11. Papandreou, George, et al. "Towards accurate multi-person pose estimation in the wild." CVPR. Vol. 3. No. 4. 2017. ↩︎

12. Carreira, Joao, and Andrew Zisserman. "Quo vadis, action recognition? a new model and the kinetics dataset." Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 2017. ↩︎

13. Mahajan, Dhruv, et al. "Exploring the Limits of Weakly Supervised Pretraining." arXiv preprint arXiv:1805.00932 (2018). ↩︎

14. Huh, Minyoung, Pulkit Agrawal, and Alexei A. Efros. "What makes ImageNet good for transfer learning?." arXiv preprint arXiv:1608.08614 (2016). ↩︎

15. Yosinski, Jason, et al. "How transferable are features in deep neural networks?." Advances in neural information processing systems. 2014. ↩︎

16. Olah, Chris, Alexander Mordvintsev, and Ludwig Schubert. "Feature visualization." Distill 2.11 (2017): e7. ↩︎

17. For a comprehensive overview of progress in NLP tasks, you can refer to this GitHub repository. ↩︎

18. Rajpurkar, Pranav, et al. "Squad: 100,000+ questions for machine comprehension of text." Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP, 2016). ↩︎

19. Bowman, Samuel R., et al. "A large annotated corpus for learning natural language inference." Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP, 2015). ↩︎

20. Vinyals, Oriol, et al. "Grammar as a foreign language." Advances in Neural Information Processing Systems. 2015. ↩︎ ↩︎

21. Conneau, Alexis, et al. "Supervised learning of universal sentence representations from natural language inference data." Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP, 2017). ↩︎

22. Subramanian, Sandeep, et al. "Learning general purpose distributed sentence representations via large scale multi-task learning." Proceedings of the International Conference on Learning Representations, ICLR (2018). ↩︎

23. McCann, Bryan, et al. "Learned in translation: Contextualized word vectors." Advances in Neural Information Processing Systems. 2017. ↩︎

24. Felbo, Bjarke, et al. "Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm." Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP, 2017). ↩︎

25. Kiros, Ryan, et al. "Skip-thought vectors." Advances in neural information processing systems. 2015. ↩︎

26. Dai, Andrew M., and Quoc V. Le. "Semi-supervised sequence learning." Advances in neural information processing systems. 2015. ↩︎

27. Bolukbasi, Tolga, et al. "Man is to computer programmer as woman is to homemaker? debiasing word embeddings." Advances in Neural Information Processing Systems. 2016. ↩︎

28. Chen, Danqi, Jason Bolton, and Christopher D. Manning. "A thorough examination of the cnn/daily mail reading comprehension task." Proceedings of the Meeting of the Association for Computational Linguistics (2016). ↩︎

29. Gururangan, Suchin, et al. "Annotation artifacts in natural language inference data." Proceedings of NAACL-HLT (2018). ↩︎ ↩︎

30. Linzen, T., Dupoux, E., & Goldberg, Y. (2016). Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies. Proceedings of TACL 2016. ↩︎

31. Gulordava, K., Bojanowski, P., Grave, E., Linzen, T., & Baroni, M. (2018). Colorless green recurrent networks dream hierarchically. In Proceedings of NAACL-HLT 2018. ↩︎

32. Radford, A., Jozefowicz, R., & Sutskever, I. (2017). Learning to Generate Reviews and Discovering Sentiment. arXiv preprint arXiv:1704.01444. ↩︎

33. Ruder, Sebastian, Ivan Vulic, and Anders Søgaard. "A Survey of Cross-lingual Word Embedding Models." Journal of Artificial Intelligence Research (2018). ↩︎

34. Lucy, Li, and Jon Gauthier. "Are distributional representations ready for the real world? Evaluating word vectors for grounded perceptual meaning." Proceedings of the First Workshop on Language Grounding for Robotics (2017). ↩︎

35. Jonathan Baxter. 2000. A Model of Inductive Bias Learning. Journal of Artificial Intelligence Research 12:149–198. ↩︎ ↩︎

36. Rich Caruana. 1993. Multitask learning: A knowledge-based source of inductive bias. In Proceedings of the Tenth International Conference on Machine Learning. ↩︎

]]>
<![CDATA[Tracking the Progress in Natural Language Processing]]>http://ruder.io/tracking-progress-nlp/5c76ff7c1b9b0d18555b9eaaFri, 22 Jun 2018 17:31:55 GMT

This post introduces a resource to track the progress and state-of-the-art across many tasks in NLP.

Research in Machine Learning and in Natural Language Processing (NLP) is moving so fast these days, it is hard to keep up. This is an issue for people in the field, but it is an even bigger obstacle for people wanting to get into NLP and those seeking to make the leap from tutorials to reproducing papers and conducting their own research. Without expert guidance and prior knowledge, it can be a painstaking process to identify the most common datasets and the current state-of-the-art for your task of interest.

A number of resources exist that could help with this process, but each has deficits: The Association of Computation Linguistics (ACL) has a wiki page tracking the state-of-the-art, but the page is not maintained and contributing is not straightforward. The Electronic Frontier Foundation and the AI Index try to do something similar for all of AI but only cover a few language tasks. The Language Resources and Evaluation (LRE) Map collects language resources presented at LREC and other conferences, but does not allow to break them out by tasks or popularity. Similarly, the International Workshop on Semantic Evaluation (SemEval) hosts a small number of tasks each year, which provide new datasets that typically have not been widely studied before. There are also resources that focus on computer vision and speech recognition as well as this repo, which focuses on all of ML.

As an alternative, I have created a GitHub repository that keeps track of the datasets and the current state-of-the-art for the most common tasks in NLP. The repository is kept as simple as possible to make maintenance and contribution easy. If I missed your favourite task or dataset or your new state-of-the-art result or if I made any error, you can simply submit a pull request.

The aim is to have a comprehensive and up-to-date resource where everyone can see at a glance the state-of-the-art for the tasks they care about. Datasets, which already do a great job at tracking this such as SQuAD or SNLI using a public leaderboard will simply be referenced instead.

My hope is that such a resource will give a broader sense of progress in the field than results in individual papers. It might also make it easier to identify tasks or areas where progress has been lacking. Another benefit is that such a resource may encourage serendipity: chancing upon an interesting new task or method. Finally, a positive by-product of having the state-of-the-art for each task easily accessible may be that it will be harder to justify (accidentally) comparing to weak baselines. For instance, the perplexity of the best baseline on the Penn Treebank varied dramatically across 10 language modeling papers submitted to ICLR 2018 (see below).

Credit for the cover image is due to the Electronic Frontier Foundation.

]]>
<![CDATA[Highlights of NAACL-HLT 2018: Generalization, Test-of-time, and Dialogue Systems]]>http://ruder.io/highlights-naacl-2018/5c76ff7c1b9b0d18555b9eabTue, 12 Jun 2018 13:42:00 GMT

This post discusses highlights of NAACL-HLT 2018.

This post originally appeared at the AYLIEN blog.

I attended NAACL-HLT 2018 in New Orleans last week. I didn’t manage to catch as many talks and posters this time around (there were just too many inspiring people to talk to!), so my highlights and the trends I observed mainly focus on invited talks and workshops.

Specifically, my highlights concentrate on three topics, which were prominent throughout the conference: Generalization, the Test-of-Time awards, and Dialogue Systems. For more information about other topics, you can check out the conference handbook and the proceedings.

First of all, there were four quotes from the conference that particularly resonated with me (some of them are paraphrased):

People worked on MT before the BLEU score. - Kevin Knight

It’s natural to work on tasks where evaluation is easy. Instead, we should encourage more people to tackle hard problems that are not easy to evaluate. These are often the most rewarding to work on.

BLEU is an understudy. It was never meant to replace human judgement and never expected to last this long. - Kishore Papineni, co-creator of BLEU, the most commonly used metric for machine translation.

No approach is perfect. Even the authors of landmark papers were aware their methods had flaws. The best we can do is provide a fair evaluation of our technique.

We decided to sample an equal number of positive and negative reviews---was that a good idea? - Bo Pang, first author of one of the first papers on sentiment analysis (7k+ citations).

In addition to being aware of the flaws of our method, we should be explicit about the assumptions we make so that future work can either build upon them or discard them if they prove unhelpful or turn out to be false.

I pose the following challenge to the community: we should evaluate on out-of-distribution data or on a new task. - Percy Liang.

We never know how well our model truly generalizes if we just test it on data of the same distribution. In order to develop models that can be applied to the real world, we need to evaluate on out-of-distribution data or on a new task. Percy Liang’s quote ties into one of the topics that received increasing attention at the conference: how can we train models that are less brittle and that generalize better?

## Generalization

Over the last years, much of the research within the NLP community focused on improving LSTM-based models on benchmark tasks. At NAACL, it seemed that increasingly people were thinking about how to get models to generalize beyond the conditions during training, reflecting a similar sentiment in the Deep Learning community in general (a paper on generalization won the best paper award at ICLR 2017).

One aspect is generalizing from few examples, which is difficult with the current generation of neural network-based methods. Charles Yang, professor of Linguistics, Computer Science and Psychology at the University of Pennsylvania put this in a cognitive science context.

Machine learning and NLP researchers in the neural network era frequently like to motivate their work by referencing the remarkable generalization ability of young children. One piece of information that often is eluded, however, is that generalization in young children is also not without its errors, because it requires learning a rule and accounting for exceptions. For instance, when learning to count, young children still frequently make mistakes, as they have to balance rule-based generalization (for regular numbers such as sixteen, seventeen, eighteen, etc.) with memorizing exceptions (numbers such as fifteen, twenty, thirty, etc.).

However, once a child can flawlessly count to 72, it can generalize to any new numbers. This magic number, 72, is given by the so-called tolerance principle, which prescribes that in order for a generalization rule to be productive, there can be at most N/ln(N) exceptions, where N is the total number of examples as can be seen in the Figure below. For counting, 72/ln(72) ≈ 17, which is exactly the number of exceptions until 72.

Hitchhiker’s Guide to the Galaxy fans, however, need not be disappointed: in Chinese, the magic number is 42. Chinese only has 11 exceptions. As 42/ln(42) ≈ 11, Chinese children typically only need to learn to count up to 42 in order to generalize, which explains why Chinese children usually learn to count faster.

It is also interesting to note that even though young children can count up to a certain number, they can’t tell you, for example, which number is higher than 24. Only after they’ve learned the rule can they actually apply it productively.

The tolerance principle implies that if the total number of examples covered by the rule is smaller, it is easier to incorporate comparatively more exceptions. While children can productively learn language from few examples, this indicates that for few-shot learning (at least in the cognitive process of language acquisition), big data may actually be harmful. Insights from cognitive science may thus help us in developing models that generalize better.

The choice of the best paper of the conference, Deep contextualized word representations, also demonstrates an increasing interest in generalization. Embeddings from Language Models (ELMo) showed significant improvements over the state-of-the-art on a wide range of tasks as can be seen below. This---together with better ways to fine-tune pre-trained language models---reveals the potential of transfer learning for NLP.

Generalization was also the topic of the Workshop on New Forms of Generalization in Deep Learning and Natural Language Processing, which sought to analyze the failings of brittle models and propose new ways to evaluate and new models that would enable better generalization. Throughout the workshop, the room (seen below) was packed most of the time, which is both a testament to the prestige of the speakers and the community interest in the topic.

In the first talk of the workshop, Yejin Choi argued that natural language understanding (NLU) does not generalize to natural language generation (NLG): while pre-Deep Learning NLG models often started with NLU output, post-DL NLG seems less dependent on NLU. However, current neural NLG heavily depends on language models and neural NLG can be brittle; in many cases, baselines based on templates can actually work better.

Despite advances in NLG, generating a coherent paragraph still does not work well and models end up generating generic, contentless, bland text full of repetitions and contradictions. But if we feed our models natural language input, why do they produce unnatural language output?

Yejin identified two limitations of language models in this context. Language models are passive learners: in the real world, one can’t learn to write just by reading; and similarly, she argued, even RNNs need to “practice” writing. Secondly, language models are surface learners: they need “world” models and must be sensitive to the “latent process” behind language. In reality, people don’t write to maximize the probability of the next token, but rather seek to fulfill certain communicative goals.

To address this, Yejin and collaborators proposed in an upcoming ACL 2018 paper to augment the generator with multiple discriminative models that grade the output along different dimensions inspired by Grice’s maxims.

Yejin also sought to explain why there is a significant performance gaps between different NLU tasks such as machine translation and dialogue. For Type 1 or shallow NLU tasks, there is a strong alignment between input and output and models can often match surface patterns. For Type 2 or deep NLU tasks, the alignment between input and output is weaker; in order to perform well, a model needs to be able to abstract and reason, and requires certain types of knowledge, especially common sense knowledge. In particular, commonsense knowledge has somewhat fallen out of favour; past approaches, which were mostly proposed in the 80s, did not have access to a lot of computing power and were mostly done by non-NLP people. Overall, NLU traditionally focuses on understanding only “natural” language, while NLG also requires understanding machine language, which may be unnatural.

Devi Parikh discussed generalization “opportunities” in visual question answering (VQA) and illustrated successes and failures of VQA models. In particular, VQA models are not very good at generalizing to novel instances; the distance of a test image from the k-nearest neighbours seen during training can predict the success or failure of a model with about 67% accuracy.

Devi also showed that in many cases, VQA models do not even consider the entire question: in 50% of cases, only half the question is read. In particular, certain prefixes demonstrate the power of language priors: if the question begins with “Is there a clock…?”, the answer is “yes” 98% of the time; if the question begins with “Is the man wearing glasses…?”, the answer is “yes” 94% of the time. In order to counteract these biases, Devi and her collaborators introduced a new dataset of complimentary scenes, which are very similar but differ in their answers. They also proposed a new setting for VQA where for every question type, train and test sets have different prior distributions of answers.

The final discussion with senior panelists (seen below) was arguably the highlight of the workshop; a summary can be found here. The main takeaways are that we need to develop models with inductive biases and that we need to do a better job of educating people on how to design experiments and identify biases in datasets.

## Test-of-time awards

Another highlight of the conference was the test-of-time awards session, which highlighted persons and papers that had a huge impact on the field. At the beginning of the session, Aravind Yoshi (see below), NLP pioneer and professor of Computer Science at the University of Pennsylvania, who passed away on December 31, 2017 was honored in touching epitaphs by close friends and people who knew him. The commemoration was a powerful reminder that research is about more than the papers we publish, but about the people we help and the lives we touch.

Afterwards, three papers (all published in 2002) were honored with test-of-time awards. For each paper, one of the original authors presented the paper and reflected on its impact. The first paper presented was BLEU: a Method for Automatic Evaluation of Machine Translation, which introduced the original BLEU metric, now commonplace in machine translation (MT). Kishore Papineni recounted that the name was inspired by Candide, an experimental MT system at IBM in the early 1990s and by IBM’s nickname, Big Blue, as all authors were at IBM at that time.

Before BLEU, machine translation evaluation was cumbersome, expensive, and thought to be as difficult as training an MT model. Despite its huge impact, BLEU’s creators were apprehensive before its initial publication. Once published, BLEU seemed to split the community in two camps: those who loved it, and those who hated it; the authors hadn’t expected such a strong reaction.

BLEU is still criticized today. It was meant as a corpus-level metric; individual sentence errors should be averaged out across the entire corpus. Kishore conceded that in hindsight, they made a few mistakes: they should have included smoothing and statistical significance testing; an initial version was also case insensitive, which caused confusion.

In summary, BLEU has many known limitations and inspired many colorful variants. On the whole, however, it is an understudy (as the acronym BiLingual Evaluation Understudy implies): it was never meant to replace human judgement and---notably---was never expected to last this long.

The second honored paper was Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms by Michael Collins, which introduced the Structured Perceptron, one of the foundational and easiest to understand algorithms for general structured prediction.

Lastly, Bo Pang looked back on her paper Thumbs up? Sentiment Classification using Machine Learning Techniques, which was the first paper of her PhD and one of the first papers on sentiment analysis, now an active research area in the NLP community. Prior to the paper, people had worked on classifying the subjectivity of sentences and the semantic orientation (polarity) of adjectives; sentiment classification was thus a natural progression.

Over the years, the paper has accumulated more than 7,000 citations. One reason why the paper was so impactful was that the authors decided to release a dataset with it. Bo was critical of the sampling choices they made that “messed” with the natural distribution of the data: they capped the number of reviews of prolific authors, which was probably a good idea. However, they sampled an equal number of positive and negative reviews, which set the standard that many approaches later followed and is still the norm for sentiment analysis today. A better idea might have been to stay true to the natural distribution of the data.

I found the test-of-time award session both insightful and humbling: we can derive many insights from traditional approaches and combining traditional with more or recent approaches is often a useful direction; at the same time, even the authors of landmark papers are critical of their own work and aware of its own flaws.

## Dialogue systems

Another pervasive topic at the conference was dialogue systems. On the first day, researchers from PolyAI gave an excellent tutorial on Conversational AI. On the second day, Mari Ostendorf, professor of Electrical Engineering at the University of Washington and faculty advisor to the Sounding Board team, which won the inaugural Alexa Prize competition, shared some of the secrets to their winning socialbot. A socialbot in this context is a bot with which you can have a conversation, in contrast to a personal assistant that is designed to accomplish user-specified goals.

A good socialbot should have the same qualities as someone you enjoy talking to at a cocktail party: it should have something interesting to say and show interest in the conversation partner. To illustrate this, an example conversation with the Sounding Board bot can be seen below.

With regard to saying something interesting, the team found that users react positively to learning something new but negatively to old or unpleasant news; a challenge here is to filter what to present to people. In addition, users lose interest when they receive too much content that they do not care about.

Regarding expressing interest, users appreciate an acknowledgement of their reactions and requests. While some users need encouragement to express opinions, some prompts can be annoying (“The article mentioned Google. Have you heard of Google?”).

They furthermore found that modeling prosody is important. Prosody deals with modeling the patterns of stress and intonation in speech. Instead of sounding monotonous, a bot that incorporates prosody seems more engaging, can better articulate intent, communicate sentiment or sarcasm, express empathy or enthusiasm, or change the topic. In some cases, prosody is also essential for avoiding---often hilarious---misunderstandings: for instance, the Alexa’s default voice pattern for ‘Sounding Board’ sounds like ‘Sounding Bored’.

Using a knowledge graph, the bot can have deeper conversations by staying “sort of on topic without totally staying on topic”. Mari also shared four key lessons that they learned working with 10M conversations:

### Lesson #1: ASR is imperfect

While automatic speech recognition (ASR) has reached lower and lower word error rates in recent years, ASR is far from being solved. In particular, ASR in dialogue agents is tuned for short commands, not conversational speech. Whereas it can accurately identify for instance many obscure music groups, it struggles with more conversational or informal input. In addition, current development platforms provide developers with an impoverished representation of speech that does not contain segmentation or prosody and often misses certain intents and affects. Problems caused by this missing information can be seen below.

In practice, the Sounding Board team found it helpful for the socialbot to behave similarly to attendees at a cocktail party: if the intent of the entire utterance is not completely clear, the bot responds to whatever part it understood, e.g. to a somewhat unintelligible “cause does that you’re gonna state that’s cool” it might respond “I’m happy you like that.” They also found that often, asking the conversation partner to repeat an utterance will not yield a much better result; instead, it is often better just to change the topic.

### Lesson #2: Users vary

The team discovered something what might seem obvious but has wide-ranging consequences in practice: users vary a lot across different dimensions. Users have different interests, different opinions on issues, and a different sense of humor. Interaction styles, similarly, can range from terse to verbose (seen below), from polite to rude. Users can also interact with the bot in pursuit of widely different goals: they may seek information, intend to share opinions, try to get to know the bot, or seek to explore the limitations of the bot. Modeling the user involves both determining what to say as well as listening to what the user says.

### Lesson #3: It’s a wild world

There is a lot of problematic content and many issues exist that need to be navigated. Problematic content can consist of offensive or controversial material or sensitive and depressing topics. In addition, users might act adversarially and deliberately try to get the bot to produce such content. Examples of such behaviour can be seen below. Other users (e.g. such as those suffering from a mental illness) are in turn risky to deal with. Overall, filtering content is a hard problem. As one example, Mari mentioned that early in the project, the bot made the following witty observation / joke: “You know what I realized the other day? Santa Claus is the most elaborate lie ever told”.

### Lesson #4: Shallow conversations

As the goal of the Alexa Prize was to maintain a conversation of 20 minutes, in light of the current limited understanding and generation capabilities, the team focused on a dialog strategy of shallow conversations. Even for shallow conversations, however, switching to related topics can still be fragile due to word sense ambiguities. It is notable that among the top 3 Alexa Prize teams, Deep Learning was only used by one team and only for reranking.

In total, a competition such as the Alexa Prize that brings academia and industry together is useful as it allows researchers to access data from real users at a large scale, which impacts the problems they choose to solve and the resulting solutions. It also teaches students about complete problems and real-world challenges and provides funding support for students. The team found it in particular beneficial someone from industry available to support the partnership, to provide advice on tools, and feedback on progress.

On the other hand, privacy-preserving access to user data, such as prosody info for spoken language and speaker/author demographics for text and speech still needs work. For spoken dialog systems, richer speech interfaces are furthermore necessary. Finally, while competitions are great kickstarters, they nevertheless require a substantial engineering effort.

Finally, in her keynote address on dialogue models, Dilek Hakkani-Tür, Research Scientist at Google Research, argued that over the recent years, chitchat systems and task-oriented dialogue systems have been converging. However, current production systems are essentially a walled garden of domains and only allow directed dialogue and limited personalization. Models have to be learned from developers using a limited set of APIs and tools and are hard to scale. At the same time, conversation is a skill even for humans. It is thus important to learn from real users, not from developers.

In order to learn about users, we can leverage personal knowledge graphs learned from user assertions, such as “show me directions to my daughter’s school”. Semantic recall, i.e. remembering entities from previous user interactions, e.g. “Do you remember the restaurant we ordered Asian food from?” is important. Personalized natural language understanding can also leverage data from user’s device (in a privacy-preserving manner) and employ user modeling for dialogue generation.

For learning from users, actions can be learned from user demonstration and/or explanation or from experience and feedback (mostly using RL for dialogue systems). In both cases, transcription and annotation are bottlenecks. A user can’t be expected to transcribe or annotate data; on the other hand, it is easier to give feedback after the system repeats an utterance.

Generally, task-oriented dialogue can be treated as a game between two parties (see below): the seeker has a goal, which is fixed or flexible, while the provider has access to APIs to perform the task. The dialogue policy of the seeker decides the next seeker action. This is typically determined using “user simulators”, which are often sequence-to-sequence models. Most recently, hierarchical sequence-to-sequence models have been used with a focus on following the user goals and generating diverse responses.

The dialogue policy of the provider similarly determines the next provider action, which is realized either via supervised or reinforcement learning (RL). For RL, reward estimation and policy shaping are important. Recent approaches jointly learn seeker and provider policies. End-to-end dialogue models with deep RL are critical for learning from user feedback, while component-wise training benefits from additional data for each component.

In practice, a combination of supervised and reinforcement learning is best and outperforms both purely supervised learning and supervised learning with policy-only RL as can be seen below.

Overall, the conference was a great opportunity to see fantastic research and meet great people. See you all at ACL 2018!

]]>
<![CDATA[An overview of proxy-label approaches for semi-supervised learning]]>http://ruder.io/semi-supervised/5c76ff7c1b9b0d18555b9ea7Thu, 26 Apr 2018 07:00:00 GMT

This post discusses semi-supervised learning algorithms that learn from proxy labels assigned to unlabelled data.

Note: Parts of this post are based on my ACL 2018 paper Strong Baselines for Neural Semi-supervised Learning under Domain Shift with Barbara Plank.

• Self-training
• Multi-view training
• Co-training
• Democratic Co-learning
• Tri-training
• Tri-training with disagreement
• Asymmetric tri-training
• Self-ensembling
• $$\Pi$$ model
• Temporal Ensembling
• Mean Teacher
• Related methods and areas
• Distillation
• Learning from weak supervision
• Learning with noisy labels
• Data augmentation
• Ensembling a single model

Unsupervised learning constitutes one of the main challenges for current machine learning models and one of the key elements that is missing for general artificial intelligence. While unsupervised learning on its own is still elusive, researchers have a made a lot of progress in combining unsupervised learning with supervised learning. This branch of machine learning research is called semi-supervised learning.

Semi-supervised learning has a long history. For a (slightly outdated) overview, refer to Zhu (2005) [1] and Chapelle et al. (2006) [2]. Particularly recently, semi-supervised learning has seen some success, considerably reducing the error rate on important benchmarks. Semi-supervised learning also makes an appearance in Amazon's annual letter to shareholders where it is credited with reducing the amount of labelled data needed to achieve the same accuracy improvement by $$40\times$$.

In this blog post, I will focus on a particular class of semi-supervised learning algorithms that produce proxy labels on unlabelled data, which are used as targets together with the labelled data. These proxy labels are produced by the model itself or variants of it without any additional supervision; they thus do not reflect the ground truth but might still provide some signal for learning. In a sense, these labels can be considered noisy or weak. I will highlight the connection to learning from noisy labels, weak supervision as well as other related topics in the end of this post.

This class of models is of particular interest in my opinion, as a) deep neural networks have been shown to be good at dealing with noisy labels and b) these models have achieved state-of-the-art in semi-supervised learning for computer vision. Note that many of these ideas are not new and many related methods have been developed in the past. In one half of this post, I will thus cover classic methods and discuss their relevance for current approaches; in the other half, I will discuss techniques that have recently achieved state-of-the-art performance. Some of the following approaches have been referred to as self-teaching or bootstrapping algorithms; I am not aware of a term that captures all of them, so I will simply refer to them as proxy-label methods.

I will divide these methods in three groups, which I will discuss in the following: 1) self-training, which uses a model's own predictions as proxy labels; 2) multi-view learning, which uses the predictions of models trained with different views of the data; and 3) self-ensembling, which ensembles variations of a model's own predictions and uses these as feedback for learning. I will show pseudo-code for the most important algorithms. You can find the LaTeX source here.

There are many interesting and equally important directions for semi-supervised learning that I will not cover in this post, e.g. graph-convolutional neural networks [3].

## Self-training

Self-training (Yarowsky, 1995; McClosky et al., 2006) [4] [5] is one of the earliest and simplest approaches to semi-supervised learning and the most straightforward example of how a model's own predictions can be incorporated into training. As the name implies, self-training leverages a model's own predictions on unlabelled data in order to obtain additional information that can be used during training. Typically the most confident predictions are taken at face value, as detailed next.

Formally, self-training trains a model $$m$$ on a labeled training set $$L$$ and an unlabeled data set $$U$$. At each iteration, the model provides predictions $$m(x)$$ in the form of a probability distribution over the $$C$$ classes for all unlabeled examples $$x$$ in $$U$$. If the probability assigned to the most likely class is higher than a predetermined threshold $$\tau$$, $$x$$ is added to the labeled examples with $$\DeclareMathOperator*{\argmax}{argmax} p(x) = \argmax m(x)$$ as pseudo-label. This process is generally repeated for a fixed number of iterations or until no more predictions on unlabelled examples are confident. This instantiation is the most widely used and shown in Algorithm 1.

Classic self-training has shown mixed success. In parsing it proved successful with small datasets (Reichart, and Rappoport, 2007; Huang and Harper, 2009) [6] [7] or when a generative component is used together with a reranker when more data is available (McClosky et al., 2006; Suzuki and Isozaki , 2008) [8]. Some success was achieved with careful task-specific data selection (Petrov and McDonald, 2012) [9], while others report limited success on a variety of NLP tasks (He and Zhou, 2011; Plank, 2011; Van Asch and Daelemans, 2016; van der Goot et al., 2017) [10] [11] [12] [13].

The main downside of self-training is that the model is unable to correct its own mistakes. If the model's predictions on unlabelled data are confident but wrong, the erroneous data is nevertheless incorporated into training and the model's errors are amplified. This effect is exacerbated if the domain of the unlabelled data is different from that of the labelled data; in this case, the model's confidence will be a poor predictor of its performance.

## Multi-view training

Multi-view training aims to train different models with different views of the data. Ideally, these views complement each other and the models can collaborate in improving each other's performance. These views can differ in different ways such as in the features they use, in the architectures of the models, or in the data on which the models are trained.

Co-training   Co-training (Blum and Mitchell, 1998) [14] is a classic multi-view training method, which makes comparatively strong assumptions. It requires that the data $$L$$ can be represented using two conditionally independent feature sets $$L^1$$ and $$L^2$$ and that each feature set is sufficient to train a good model. After the initial models $$m_1$$ and $$m_2$$ are trained on their respective feature sets, at each iteration, only inputs that are confident (i.e. have a probability higher than a threshold $$\tau$$) according to exactly one of the two models are moved to the training set of the other model. One model thus provides the labels to the inputs on which the other model is uncertain. Co-training can be seen in Algorithm 2.

In the original co-training paper (Blum and Mitchell, 1998), co-training is used to classify web pages using the text on the page as one view and the anchor text of hyperlinks on other pages pointing to the page as the other view. As two conditionally independent views are not always available, Chen et al. (2011) [15] propose pseudo-multiview regularization (Chen et al., 2011) in order to split the features into two mutually exclusive views so that co-training is effective. To this end, pseudo-multiview regularization constrains the models so that at least one of them has a zero weight for each feature. This is similar to the orthogonality constraint recently used in domain adaptation to encourage shared and private spaces (Bousmalis et al., 2016) [16]. A second constraint requires the models to be confident on different subsets of $$U$$. Chen et al. (2011) [17] use pseudo-multiview regularization to adapt co-training to domain adaptation.

Democratic Co-learning Rather than treating different feature sets as views, democratic co-learning (Zhou and Goldman, 2004) [18] employs models with different inductive biases. These can be different network architectures in the case of neural networks or completely different learning algorithms. Democratic co-learning first trains each model separately on the complete labelled data $$L$$. The models then make predictions on the unlabelled data $$U$$. If a majority of models confidently agree on the label of an example, the example is added to the labelled dataset. Confidence is measured in the original formulation by measuring if the sum of the mean confidence intervals $$w$$ of the models, which agreed on the label is larger than the sum of the models that disagreed. This process is repeated until no more examples are added. The final prediction is made with a majority vote weighted with the confidence intervals of the models. The full algorithm can be seen below. $$M$$ is the set of all models that predict the same label $$j$$ for an example $$x$$.

Tri-training   Tri-training (Zhou and Li, 2005) [19] is one of the best known multi-view training methods. It can be seen as an instantiation of democratic co-learning, which leverages the agreement of three independently trained models to reduce the bias of predictions on unlabeled data. The main requirement for tri-training is that the initial models are diverse. This can be achieved using different model architectures as in democratic co-learning. The most common way to obtain diversity for tri-training, however, is to obtain different variations $$S_i$$ of the original training data $$L$$ using bootstrap sampling. The three models $$m_1$$, $$m_2$$, and $$m_3$$ are then trained on these bootstrap samples, as depicted in Algorithm 4. An unlabeled data point is added to the training set of a model $$m_i$$ if the other two models $$m_j$$ and $$m_k$$ agree on its label. Training stops when the classifiers do not change anymore.

Despite having been proposed more than 10 years ago, before the advent of Deep Learning, we found in a recent paper (Ruder and Plank, 2018) [20] that classic tri-training is a strong baseline for neural semi-supervised with and without domain shift for NLP and that it outperforms even recent state-of-the-art methods.

Tri-training with disagreement   Tri-training with disagreement (Søgaard, 2010) [21] is based on the intuition that a model should only be strengthened in its weak points and that the labeled data should not be skewed by easy data points. In order to achieve this, it adds a simple modification to the original algorithm (altering line 8 in Algorithm 2), requiring that for an unlabeled data point on which $$m_j$$ and $$m_k$$ agree, the other model $$m_i$$ disagrees on the prediction. Tri-training with disagreement is more data-efficient than tri-training and has achieved competitive results on part-of-speech tagging (Søgaard, 2010).

Asymmetric tri-training   Asymmetic tri-training (Saito et al., 2017) [22] is a recently proposed extension of tri-training that achieved state-of-the-art results for unsupervised domain adaptation in computer vision. For unsupervised domain adaptation, the test data and unlabeled data are from a different domain than the labelled examples. To adapt tri-training to this shift, asymmetric tri-training learns one of the models only on proxy labels and not on labelled examples (a change to line 10 in Algorithm 4) and uses only this model to classify target domain examples at test time. In addition, all three models share the same feature extractor.

Multi-task tri-training   Tri-training typically relies on training separate models on bootstrap samples of a potentially large amount of training data, which is expensive. Multi-task tri-training (MT-Tri) (Ruder and Plank, 2018) aims to reduce both the time and space complexity of tri-training by leveraging insights from multi-task learning (MTL) (Caruana, 1993) [23] to share knowledge across models and accelerate training. Rather than storing and training each model separately, MT-Tri shares the parameters of the models and trains them jointly using MTL. Note that the model does only pseudo MTL as all three models effectively perform the same task.

The output softmax layers are model-specific and are only updated for the input of the respective model. As the models leverage a joint representation, diversity is even more crucial. We need to ensure that the features used for prediction in the softmax layers of the different models are as diverse as possible, so that the models can still learn from each other's predictions. In contrast, if the parameters in all output softmax layers were the same, the method would degenerate to self-training. Similar to pseudo-view regularization, we thus use an orthogonality constraint (Bousmalis et al., 2016) on two of the three softmax output layers as an additional loss term.

The pseudo-code can be seen below. In contrast to classic tri-training, we can train the multi-task model with its three model-specific outputs jointly and without bootstrap sampling on the labeled source domain data until convergence, as the orthogonality constraint enforces different representations between models $$m_1$$ and $$m_2$$. From this point, we can leverage the pair-wise agreement of two output layers to add pseudo-labeled examples as training data to the third model. We train the third output layer $$m_3$$ only on pseudo-labeled target instances in order to make tri-training more robust to a domain shift. For the final prediction, we use majority voting of all three output layers. For more information about multi-task tri-training, self-training, other tri-training variants, you can refer to our recent ACL 2018 paper.

## Self-ensembling

Self-ensembling methods are very similar to multi-view learning approaches in that they combine different variants of a model. Multi-task tri-training, for instance, can also be seen as a self-ensembling method where different variations of a model are used to create a stronger ensemble prediction. In contrast to multi-view learning, diversity is not a key concern. Self-ensembling approaches mostly use a single model under different configurations in order to make the model's predictions more robust. Most of the following methods are very recent and several have achieved state-of-the-art results in computer vision.

Ladder networks   The $$\Gamma$$ (gamma) version of Ladder Networks (Rasmus et al., 2015) [24] aims to make a model more robust to noise. For each unlabelled example, it uses the model's prediction on the clean example as a proxy label for prediction on a perturbed version of the example. This way, the model learns to develop features that are invariant to noise and predictive of the labels on the labelled training data. Ladder networks have been mostly used in computer vision where many forms of perturbation and data augmentation are available.

Virtual Adversarial Training   If perturbing the original sample is not possible or desired, we can instead perturb the example in feature space. Rather than randomly perturbing it by e.g. adding dropout, we can apply the worst possible perturbation for the model, which transforms the input into an adversarial sample. While adversarial training requires access to the labels to perform these perturbations, virtual adversarial training (Miyato et al., 2017) [25] requires no labels and is thus suitable for semi-supervised learning. Virtual adversarial training effectively seeks to make the model robust to perturbations in directions to which it is most sensitive and has achieved good results on text classification datasets.

$$\Pi$$ model   Rather than treating clean predictions as proxy labels, the $$\Pi$$ (pi) model (Laine and Aila, 2017) [26] ensembles the predictions of the model under two different perturbations of the input data and two different dropout conditions $$z$$ and $$\tilde{z}$$. The full pseudo-code can be seen in Algorithm 6 below. $$g(x)$$ is the stochastic input augmentation function. The first loss term encourages the predictions under the two different noise settings to be consistent, with $$\lambda$$ determining the contribution, while the second loss term is the standard cross-entropy loss $$H$$ with respect to the label $$y$$. In contrast to the models we encountered before, we apply the unsupervised loss component to both unlabelled and labelled examples.

Temporal Ensembling   Instead of ensembling over the same model under different noise configurations, we can ensemble over different models. As training separate models is expensive, we can instead ensemble the predictions of a model at different timesteps. We can save the ensembled proxy labels $$Z$$ as an exponential moving average of the model's past predictions on all examples as depicted below in order to save space. As we initialize the proxy labels as a zero vector, they are biased towards $$0$$. We can correct this bias similar to Adam (Kingma and Ba, 2015) [27] based on the current epoch $$t$$ to obtain bias-corrected target vectors $$\tilde{z}$$. We then update the model similar to the $$\Pi$$ model.

Mean Teacher   Finally, instead of averaging the predictions of our model over training time, we can average the model weights. Mean teacher (Tarvainen and Valpola, 2017) [28] stores an exponential moving average of the model parameters. For every example, this mean teacher model is then used to obtain proxy labels $$\tilde{z}$$. The consistency loss and supervised loss are computed as in temporal ensembling.

Mean teacher has achieved state-of-the-art results for semi-supervised learning for computer vision. For reference, on ImageNet with 10% of the labels, it achieves an error rate of $$9.11$$, compared to an error rate of $$3.79$$ using all labels with the state-of-the-art. For more information about self-ensembling methods, have a look at this intuitive blog post by the Curious AI company. We have run experiments with temporal ensembling for NLP tasks, but did not manage to obtain consistent results. My assumption is that the unsupervised consistency loss is more suitable for continuous inputs. Mean teacher might work better, as averaging weights aka Polyak averaging (Polyak and Juditsky, 1992) [29] is a tried method for accelerating optimization.

Very recently, Oliver et al. (2018) [30] raise some questions regarding the true applicability of these methods: They find that the performance difference to a properly tuned supervised baseline is smaller than typically reported, that transfer learning from a labelled dataset (e.g. ImageNet) outperforms the presented methods, and that performance degrades severely under a domain shift. In order to deal with the latter, algorithms such as asymmetric or multi-task tri-training learn different representations for the target distribution. It remains to be seen if these insights translate to other domains; a combination of transfer learning and semi-supervised adaptation to the target domain seems particularly promising.

## Related methods and areas

Distillation   Proxy-label approaches can be seen as different forms of distillation (Hinton et al., 2015) [31]. Distillation was originally conceived as a method to compress the information of a large model or an ensemble in a smaller model. In the standard setup, a typically large and fully trained teacher model provides proxy targets for a student model, which is generally smaller and faster. Self-learning is akin to distillation without a teacher, where the student is left to learn by themselves and with no-one to correct its mistakes. For multi-view learning, different models work together to teach each other, alternately acting as both teachers and students. Self-ensembling, finally, has one model assuming the dual role of teacher and student: As a teacher, it generates new targets, which are then incorporated by itself as a student for learning.

Learning from weak supervision   Learning from weak supervision, as the name implies, can be seen as a weaker form of supervised learning or alternatively as a stronger form of semi-supervised learning: While supervised learning provides us with labels that we know to be correct and semi-supervised learning only provides us with a small set of labelled examples, weak supervision allows us to obtain labels that we know to be noisy for the unlabelled data as a further signal for learning. Typically, the weak annotator is an unsupervised method that is very different from the model we use for learning the task. For sentiment analysis, this could be a simple lexicon-based method [32]. Many of the presented methods could be extended to the weak supervision setting by incorporating the weak labels as feedback. Self-ensembling methods, for instance, might employ another teacher model that gauges the quality of weakly annotated examples similar to Deghani et al. (2018) [33]. For an overview of weak supervision, have a look at this blog post by Stanford's Hazy Research group.

Learning with noisy labels   Learning with noisy labels is similar to learning from weak supervision. In both cases, labels are available that cannot be completely trusted. For learning with noisy labels, labels are typically assumed to be permuted with a fixed random permutation. While proxy-label approaches supply the noisy labels themselves, when learning with noisy labels, the labels are part of the data. Similar to learning from weak supervision, we can try to model the noise to assess the quality of the labels (Sukhbaatar et al., 2015) [34]. Similar to self-ensembling methods, we can enforce consistency between the model's preditions and the proxy labels (Reed et al., 2015) [35].

Data augmentation   Several self-ensembling methods employ data augmentation to enforce consistency between model predictions under different noise settings. Data augmentation is mostly used in computer vision, but noise in the form of different dropout masks can also be applied to the model parameters as in the $$\Pi$$ model and has also been used in LSTMs (Zolna et al., 2018) [36]. While regularization in the form of dropout, batch normalization, etc. can be used when labels are available in order to make predictions more robust, a consistency loss is required in the case without labels. For supervised learning, adversarial training can be employed to obtain adversarial examples and has been used successfully e.g. for part-of-speech tagging (Yasunaga et al., 2018) [37].

Ensembling a single model   The discussed self-ensembling methods all employ ensemble predictions not just to make predictions more robust, but as feedback to improve the model itself during training in a self-reinforcing loop. In the supervised setting, this feedback might not be necessary; ensembling a single model is still useful, however, to save time compared to training multiple models. Two methods that have been proposed to ensemble a model from a single training run are checkpoint ensembles and snapshot ensembles. Checkpoint ensembles (Sennrich et al., 2016) [38] ensemble the last $$n$$ checkpoints of a single training run and have been used to achieve state-of-the-art in machine translation. Snapshot ensembles (Huang et al., 2017) [39] ensemble models converged to different minima during a training run and have been used to achieve state-of-the-art in object recognition.

## Conclusion

I hope this post was able to give you an insight into a part of the semi-supervised learning landscape that seems to be particularly useful to improve the performance of current models. While learning completely without labelled data is unrealistic at this point, semi-supervised learning enables us to augment our small labelled datasets with large amounts of available unlabelled data. Most of the discussed methods are promising in that they treat the model as a black box and can thus be used with any existing supervised learning model. As always, if you have any questions or noticed any mistakes, feel free to write a comment in the comments section below.

1. Zhu, X. (2005). Semi-Supervised Learning Literature Survey. ↩︎

2. Chapelle, O., Schölkopf, B., & Zien, A. (2006). Semi-Supervised Learning. Interdisciplinary sciences computational life sciences (Vol. 1). http://doi.org/10.1007/s12539-009-0016-2 ↩︎

3. Kipf, T. N., & Welling, M. (2017). Semi-Supervised Classification with Graph Convolutional Networks. Proceedings of ICLR 2017. ↩︎

4. Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd annual meeting on Association for Computational Linguistics (pp. 189-196). Association for Computational Linguistics. ↩︎

5. McClosky, D., Charniak, E., & Johnson, M. (2006). Effective self-training for parsing. Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, 152–159. ↩︎

6. Reichart, R., & Rappoport, A. (2007). Self-training for enhancement and domain adaptation of statistical parsers trained on small datasets. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (pp. 616-623) ↩︎

7. Huang, Z., & Harper, M. (2009). Self-training PCFG grammars with latent annotations across languages. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2 (pp. 832-841). Association for Computational Linguistics. ↩︎

8. Suzuki, J., & Isozaki, H. (2008). Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. Proceedings of ACL-08: HLT, 665-673. ↩︎

9. Petrov, S., & McDonald, R. (2012). Overview of the 2012 shared task on parsing the web. In Notes of the first workshop on syntactic analysis of non-canonical language (sancl) (Vol. 59). ↩︎

10. He, Y., & Zhou, D. (2011). Self-training from labeled features for sentiment analysis. Information Processing & Management, 47(4), 606-616. ↩︎

11. Plank, B. (2011). Domain adaptation for parsing. University Library Groniongen][Host]. ↩︎

12. Van Asch, V., & Daelemans, W. (2016). Predicting the Effectiveness of Self-Training: Application to Sentiment Classification. arXiv preprint arXiv:1601.03288. ↩︎

13. van der Goot, R., Plank, B., & Nissim, M. (2017). To normalize, or not to normalize: The impact of normalization on part-of-speech tagging. arXiv preprint arXiv:1707.05116. ↩︎

14. Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory (pp. 92-100). ACM. ↩︎

15. Chen, M., Weinberger, K. Q., & Chen, Y. (2011). Automatic Feature Decomposition for Single View Co-training. Proceedings of the 28th International Conference on Machine Learning (ICML-11), 953–960. ↩︎

16. Bousmalis, K., Trigeorgis, G., Silberman, N., Krishnan, D., & Erhan, D. (2016). Domain Separation Networks. In Advances in Neural Information Processing Systems. ↩︎

17. Chen, M., Weinberger, K. Q., & Blitzer, J. C. (2011). Co-Training for Domain Adaptation. In Advances in Neural Information Processing Systems. ↩︎

18. Zhou, Y., & Goldman, S. (2004). Democratic Co-Learning. In 16th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2004. ↩︎

19. Zhou, Z.-H., & Li, M. (2005). Tri-Training: Exploiting Unlabled Data Using Three Classifiers. IEEE Trans.Data Eng., 17(11), 1529–1541. http://doi.org/10.1109/TKDE.2005.186 ↩︎

20. Ruder, S., & Plank, B. (2018). Strong Baselines for Neural Semi-supervised Learning under Domain Shift. In Proceedings of ACL 2018. ↩︎

21. Søgaard, A. (2010). Simple semi-supervised training of part-of-speech taggers. Proceedings of the ACL 2010 Conference Short Papers. ↩︎

23. Caruana, R. (1993). Multitask learning: A knowledge-based source of inductive bias. In Proceedings of the Tenth International Conference on Machine Learning. ↩︎

24. Rasmus, A., Valpola, H., Honkala, M., Berglund, M., & Raiko, T. (2015). Semi-Supervised Learning with Ladder Network. arXiv Preprint arXiv:1507.02672. Retrieved from http://arxiv.org/abs/1507.02672 ↩︎

25. Miyato, T., Dai, A. M., & Goodfellow, I. (2017). Adversarial Training Methods for Semi-supervised Text Classification. In Proceedings of ICLR 2017. ↩︎

26. Laine, S., & Aila, T. (2017). Temporal Ensembling for Semi-Supervised Learning. In Proceedings of ICLR 2017. ↩︎

27. Kingma, D. P., & Ba, J. L. (2015). Adam: a Method for Stochastic Optimization. International Conference on Learning Representations. ↩︎

28. Tarvainen, A., & Valpola, H. (2017). Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in Neural Information Processing Systems. Retrieved from http://arxiv.org/abs/1703.01780 ↩︎

29. Polyak, B. T., & Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4), 838-855. ↩︎

30. Oliver, A., Odena, A., Raffel, C., Cubuk, E. D., & Goodfellow, I. J. (2018). Realistic Evaluation of Semi-Supervised Learning Algorithms. arXiv preprint arXiv:1804.09170. ↩︎

31. Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv Preprint arXiv:1503.02531. https://doi.org/10.1063/1.4931082 ↩︎

32. Kiritchenko, S., Zhu, X., & Mohammad, S. M. (2014). Sentiment analysis of short informal texts. Journal of Artificial Intelligence Research, 50, 723-762. ↩︎

33. Dehghani, M., Mehrjou, A., Gouws, S., Kamps, J., & Schölkopf, B. (2018). Fidelity-Weighted Learning. In Proceedings of ICLR 2018. Retrieved from http://arxiv.org/abs/1711.02799 ↩︎

34. Sukhbaatar, S., Bruna, J., Paluri, M., Bourdev, L., & Fergus, R. (2015). Training Convolutional Networks with Noisy Labels. Workshop Track - ICLR 2015. Retrieved from http://arxiv.org/abs/1406.2080 ↩︎

35. Reed, S., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., & Rabinovich, A. (2015). Training Deep Neural Networks on Noisy Labels with Bootstrapping. ICLR 2015 Workshop Track. Retrieved from http://arxiv.org/abs/1412.6596 ↩︎

36. Zolna, K., Arpit, D., Suhubdy, D., & Bengio, Y. (2018). Fraternal Dropout. In Proceedings of ICLR 2018. Retrieved from http://arxiv.org/abs/1711.00066 ↩︎

37. Yasunaga, M., Kasai, J., & Radev, D. (2018). Robust Multilingual Part-of-Speech Tagging via Adversarial Training. In Proceedings of NAACL 2018. Retrieved from http://arxiv.org/abs/1711.04903 ↩︎

38. Sennrich, R., Haddow, B., & Birch, A. (2016). Edinburgh neural machine translation systems for WMT 16. arXiv preprint arXiv:1606.02891. ↩︎

39. Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., & Weinberger, K. Q. (2017). Snapshot Ensembles: Train 1, get M for free. In Proceedings of ICLR 2017. ↩︎

]]>
<![CDATA[Text Classification with TensorFlow Estimators]]>http://ruder.io/text-classification-tensorflow-estimators/5c76ff7c1b9b0d18555b9ea9Mon, 16 Apr 2018 13:24:29 GMT

This post is a tutorial on how to use TensorFlow Estimators for text classification.

Note: This post was written together with the awesome Julian Eisenschlos and was originally published on the TensorFlow blog.

Hello there! Throughout this post we will show you how to classify text using Estimators in TensorFlow. Here’s the outline of what we’ll cover:

• Building baselines using pre-canned estimators.
• Using word embeddings.
• Building custom estimators with convolution and LSTM layers.
• Evaluating and comparing models using TensorBoard.

Welcome to Part 4 of a blog series that introduces TensorFlow Datasets and Estimators. You don’t need to read all of the previous material, but take a look if you want to refresh any of the following concepts. Part 1 focused on pre-made Estimators, Part 2 discussed feature columns, and Part 3 how to create custom Estimators.

Here in Part 4, we will build on top of all the above to tackle a different family of problems in Natural Language Processing (NLP). In particular, this article demonstrates how to solve a text classification task using custom TensorFlow estimators, embeddings, and the tf.layers module. Along the way, we’ll learn about word2vec and transfer learning as a technique to bootstrap model performance when labeled data is a scarce resource.

We will show you relevant code snippets. Here’s the complete Jupyter Notebook that you can run locally or on Google Colaboratory. The plain .py source file is also available here. Note that the code was written to demonstrate how Estimators work functionally and was not optimized for maximum performance.

The dataset we will be using is the IMDB Large Movie Review Dataset, which consists of $25,000$ highly polar movie reviews for training, and $25,000$ for testing. We will use this dataset to train a binary classification model, able to predict whether a review is positive or negative.

For illustration, here’s a piece of a negative review (with $2$ stars) in the dataset:

Now, I LOVE Italian horror films. The cheesier they are, the better. However, this is not cheesy Italian. This is week-old spaghetti sauce with rotting meatballs. It is amateur hour on every level. There is no suspense, no horror, with just a few drops of blood scattered around to remind you that you are in fact watching a horror film.

Keras provides a convenient handler for importing the dataset which is also available as a serialized numpy array .npz file to download here. For text classification, it is standard to limit the size of the vocabulary to prevent the dataset from becoming too sparse and high dimensional, causing potential overfitting. For this reason, each review consists of a series of word indexes that go from $4$ (the most frequent word in the dataset the) to $4999$, which corresponds to orange. Index $1$ represents the beginning of the sentence and the index $2$ is assigned to all unknown (also known as out-of-vocabulary or OOV) tokens. These indexes have been obtained by pre-processing the text data in a pipeline that cleans, normalizes and tokenizes each sentence first and then builds a dictionary indexing each of the tokens by frequency.

After we’ve loaded the data in memory we pad each of the sentences with $0$ so that we have two $25000 \times 200$ arrays for training and testing respectively.

vocab_size = 5000
sentence_size = 200
(x_train_variable, y_train), (x_test_variable, y_test) = imdb.load_data(num_words=vocab_size)
x_train_variable,
maxlen=sentence_size,
value=0)
x_test_variable,
maxlen=sentence_size,
value=0)


### Input Functions

The Estimator framework uses input functions to split the data pipeline from the model itself. Several helper methods are available to create them, whether your data is in a .csv file, or in a pandas.DataFrame, whether it fits in memory or not. In our case, we can use Dataset.from_tensor_slices for both the train and test sets.

x_len_train = np.array([min(len(x), sentence_size) for x in x_train_variable])
x_len_test = np.array([min(len(x), sentence_size) for x in x_test_variable])

def parser(x, length, y):
features = {"x": x, "len": length}
return features, y

def train_input_fn():
dataset = tf.data.Dataset.from_tensor_slices((x_train, x_len_train, y_train))
dataset = dataset.shuffle(buffer_size=len(x_train_variable))
dataset = dataset.batch(100)
dataset = dataset.map(parser)
dataset = dataset.repeat()
iterator = dataset.make_one_shot_iterator()
return iterator.get_next()

def eval_input_fn():
dataset = tf.data.Dataset.from_tensor_slices((x_test, x_len_test, y_test))
dataset = dataset.batch(100)
dataset = dataset.map(parser)
iterator = dataset.make_one_shot_iterator()
return iterator.get_next()


We shuffle the training data and do not predefine the number of epochs we want to train, while we only need one epoch of the test data for evaluation. We also add an additional "len" key that captures the length of the original, unpadded sequence, which we will use later.

### Building a baseline

It’s good practice to start any machine learning project trying basic baselines. The simpler the better as having a simple and robust baseline is key to understanding exactly how much we are gaining in terms of performance by adding extra complexity. It may very well be the case that a simple solution is good enough for our requirements.

With that in mind, let us start by trying out one of the simplest models for text classification. That would be a sparse linear model that gives a weight to each token and adds up all of the results, regardless of the order. As this model does not care about the order of words in a sentence, we normally refer to it as a Bag-of-Words approach. Let’s see how we can implement this model using an Estimator.

We start out by defining the feature column that is used as input to our classifier. As we have seen in Part 2, categorical_column_with_identity is the right choice for this pre-processed text input. If we were feeding raw text tokens other feature_columns could do a lot of the pre-processing for us. We can now use the pre-made LinearClassifier.

column = tf.feature_column.categorical_column_with_identity('x', vocab_size)
classifier = tf.estimator.LinearClassifier(
feature_columns=[column],
model_dir=os.path.join(model_dir, 'bow_sparse'))


Finally, we create a simple function that trains the classifier and additionally creates a precision-recall curve. As we do not aim to maximize performance in this blog post, we only train our models for 25,000 steps.

def train_and_evaluate(classifier):
classifier.train(input_fn=train_input_fn, steps=25000)
eval_results = classifier.evaluate(input_fn=eval_input_fn)
predictions = np.array([p['logistic'][0] for p in classifier.predict(input_fn=eval_input_fn)])
tf.reset_default_graph()
# Add a PR summary in addition to the summaries that the classifier writes
pr = summary_lib.pr_curve('precision_recall', predictions=predictions, labels=y_test.astype(bool), num_thresholds=21)
with tf.Session() as sess:
writer = tf.summary.FileWriter(os.path.join(classifier.model_dir, 'eval'), sess.graph)
writer.close()
train_and_evaluate(classifier)


One of the benefits of choosing a simple model is that it is much more interpretable. The more complex a model, the harder it is to inspect and the more it tends to work like a black box. In this example, we can load the weights from our model’s last checkpoint and take a look at what tokens correspond to the biggest weights in absolute value. The results look like what we would expect.

# Load the tensor with the model weights
weights = classifier.get_variable_value('linear/linear_model/x/weights').flatten()
# Find biggest weights in absolute value
extremes = np.concatenate((sorted_indexes[-8:], sorted_indexes[:8]))
# word_inverted_index is a dictionary that maps from indexes back to tokens
extreme_weights = sorted(
[(weights[i], word_inverted_index[i - index_offset]) for i in extremes])
# Create plot
y_pos = np.arange(len(extreme_weights))
plt.bar(y_pos, [pair[0] for pair in extreme_weights], align='center', alpha=0.5)
plt.xticks(y_pos, [pair[1] for pair in extreme_weights], rotation=45, ha='right')
plt.ylabel('Weight')
plt.title('Most significant tokens')
plt.show()


As we can see, tokens with the most positive weight such as ‘refreshing’ are clearly associated with positive sentiment, while tokens that have a large negative weight unarguably evoke negative emotions. A simple but powerful modification that one can do to improve this model is weighting the tokens by their tf-idf scores.

### Embeddings

The next step of complexity we can add are word embeddings. Embeddings are a dense low-dimensional representation of sparse high-dimensional data. This allows our model to learn a more meaningful representation of each token, rather than just an index. While an individual dimension is not meaningful, the low-dimensional space—when learned from a large enough corpus—has been shown to capture relations such as tense, plural, gender, thematic relatedness, and many more. We can add word embeddings by converting our existing feature column into an embedding_column. The representation seen by the model is the mean of the embeddings for each token (see the combiner argument in the docs). We can plug in the embedded features into a pre-canned DNNClassifier.

A note for the keen observer: an embedding_column is just an efficient way of applying a fully connected layer to the sparse binary feature vector of tokens, which is multiplied by a constant depending of the chosen combiner. A direct consequence of this is that it wouldn’t make sense to use an embedding_column directly in a LinearClassifier because two consecutive linear layers without non-linearities in between add no prediction power to the model, unless of course the embeddings are pre-trained.

embedding_size = 50
word_embedding_column = tf.feature_column.embedding_column(
column, dimension=embedding_size)
classifier = tf.estimator.DNNClassifier(
hidden_units=[100],
feature_columns=[word_embedding_column],
model_dir=os.path.join(model_dir, 'bow_embeddings'))
train_and_evaluate(classifier)


We can use TensorBoard to visualize our $50$ dimensional word vectors projected into $\mathbb{R}^3$ using t-SNE. We expect similar words to be close to each other. This can be a useful way to inspect our model weights and find unexpected behaviours.

### Convolutions

At this point one possible approach would be to go deeper, further adding more fully connected layers and playing around with layer sizes and training functions. However, by doing that we would add extra complexity and ignore important structure in our sentences. Words do not live in a vacuum and meaning is compositional, formed by words and its neighbors.

Convolutions are one way to take advantage of this structure, similar to how we can model salient clusters of pixels for image classification. The intuition is that certain sequences of words, or n-grams, usually have the same meaning regardless of their overall position in the sentence. Introducing a structural prior via the convolution operation allows us to model the interaction between neighboring words and consequently gives us a better way to represent such meaning.

The following image shows how a filter matrix $F \in \mathbb{R}^{d\times m}$ tri-gram window of tokens to build a new feature map. Afterwards a pooling layer is usually applied to combine adjacent results.

Source: Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks by Severyn et al. [2015]

Let us look at the full model architecture. The use of dropout layers is a regularization technique that makes the model less likely to overfit.

### Creating a custom estimator

As seen in previous blog posts, the tf.estimator framework provides a high-level API for training machine learning models, defining train(), evaluate() and predict() operations, handling checkpointing, loading, initializing, serving, building the graph and the session out of the box. There is a small family of pre-made estimators, like the ones we used earlier, but it’s most likely that you will need to build your own.

Writing a custom estimator means writing a model_fn(features, labels, mode, params) that returns an EstimatorSpec. The first step will be mapping the features into our embedding layer:

input_layer = tf.contrib.layers.embed_sequence(
features['x'],
vocab_size,
embedding_size,
initializer=params['embedding_initializer'])


Then we use tf.layers to process each output sequentially.

training = (mode == tf.estimator.ModeKeys.TRAIN)
dropout_emb = tf.layers.dropout(inputs=input_layer,
rate=0.2,
training=training)
conv = tf.layers.conv1d(
inputs=dropout_emb,
filters=32,
kernel_size=3,
activation=tf.nn.relu)
pool = tf.reduce_max(input_tensor=conv, axis=1)
hidden = tf.layers.dense(inputs=pool, units=250, activation=tf.nn.relu)
dropout = tf.layers.dropout(inputs=hidden, rate=0.2, training=training)
logits = tf.layers.dense(inputs=dropout_hidden, units=1)


Finally, we will use a Head to simplify the writing of our last part of the model_fn. The head already knows how to compute predictions, loss, train_op, metrics and export outputs, and can be reused across models. This is also used in the pre-made estimators and provides us with the benefit of a uniform evaluation function across all of our models. We will use binary_classification_head, which is a head for single label binary classification that uses sigmoid_cross_entropy_with_logits as the loss function under the hood.

head = tf.contrib.estimator.binary_classification_head()
def _train_op_fn(loss):
tf.summary.scalar('loss', loss)
return optimizer.minimize(
loss=loss,
global_step=tf.train.get_global_step())

features=features,
labels=labels,
mode=mode,
logits=logits,
train_op_fn=_train_op_fn)


Running this model is just as easy as before:

initializer = tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0))
params = {'embedding_initializer': initializer}
cnn_classifier = tf.estimator.Estimator(model_fn=model_fn,
model_dir=os.path.join(model_dir, 'cnn'),
params=params)
train_and_evaluate(cnn_classifier)


### LSTM Networks

Using the Estimator API and the same model head, we can also create a classifier that uses a Long Short-Term Memory (LSTM) cell instead of convolutions. Recurrent models such as this are some of the most successful building blocks for NLP applications. An LSTM processes the entire document sequentially, recursing over the sequence with its cell while storing the current state of the sequence in its memory.

One of the drawbacks of recurrent models compared to CNNs is that, because of the nature of recursion, models turn out deeper and more complex, which usually produces slower training time and worse convergence. LSTMs (and RNNs in general) can suffer convergence issues like vanishing or exploding gradients, that said, with sufficient tuning they can obtain state-of-the-art results for many problems. As a rule of thumb CNNs are good at feature extraction, while RNNs excel at tasks that depend on the meaning of the whole sentence, like question answering or machine translation.

Each cell processes one token embedding at a time updating its internal state based on a differentiable computation that depends on both the embedding vector $x_t$ and the previous state $h_{t-1}$. In order to get a better understanding of how LSTMs work, you can refer to Chris Olah’s blog post.

Source: Understanding LSTM Networks by Chris Olah

The complete LSTM model can be expressed by the following simple flowchart:

In the beginning of this post, we padded all documents up to $200$ tokens, which is necessary to build a proper tensor. However, when a document contains fewer than $200$ words, we don’t want the LSTM to continue processing padding tokens as it does not add information and degrades performance. For this reason, we additionally want to provide our network with the length of the original sequence before it was padded. Internally, the model then copies the last state through to the sequence’s end. We can do this by using the "len" feature in our input functions. We can now use the same logic as above and simply replace the convolutional, pooling, and flatten layers with our LSTM cell.

lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(100)
_, final_states = tf.nn.dynamic_rnn(
lstm_cell, inputs, sequence_length=features['len'], dtype=tf.float32)
logits = tf.layers.dense(inputs=final_states.h, units=1)


### Pre-trained vectors

Most of the models that we have shown before rely on word embeddings as a first layer. So far, we have initialized this embedding layer randomly. However, much previous work has shown that using embeddings pre-trained on a large unlabeled corpus as initialization is beneficial, particularly when training on only a small number of labeled examples. The most popular pre-trained embedding is word2vec. Leveraging knowledge from unlabeled data via pre-trained embeddings is an instance of transfer learning.

To this end, we will show you how to use them in an Estimator. We will use the pre-trained vectors from another popular model, GloVe.

embeddings = {}
with open('glove.6B.50d.txt', 'r', encoding='utf-8') as f:
for line in f:
values = line.strip().split()
w = values[0]
vectors = np.asarray(values[1:], dtype='float32')
embeddings[w] = vectors


After loading the vectors into memory from a file we store them as a numpy.array using the same indexes as our vocabulary. The created array is of shape (5000, 50). At every row index, it contains the 50-dimensional vector representing the word at the same index in our vocabulary.

embedding_matrix = np.random.uniform(-1, 1, size=(vocab_size, embedding_size))
for w, i in word_index.items():
v = embeddings.get(w)
if v is not None and i < vocab_size:
embedding_matrix[i] = v


Finally, we can use a custom initializer function and pass it in the params object to our cnn_model_fn , without any modifications.

def my_initializer(shape=None, dtype=tf.float32, partition_info=None):
assert dtype is tf.float32
return embedding_matrix
params = {'embedding_initializer': my_initializer}
cnn_pretrained_classifier = tf.estimator.Estimator(
model_fn=cnn_model_fn,
model_dir=os.path.join(model_dir, 'cnn_pretrained'),
params=params)
train_and_evaluate(cnn_pretrained_classifier)


### Running TensorBoard

Now we can launch TensorBoard and see how the different models we’ve trained compare against each other in terms of training time and performance.

In a terminal, we run

> tensorboard --logdir={model_dir}


We can visualize many metrics collected while training and testing, including the loss function values of each model at each training step, and the precision-recall curves. This is of course most useful to select which model works best for our use-case as well as how to choose classification thresholds.

### Getting Predictions

To obtain predictions on new sentences we can use the predict method in the Estimator instances, which will load the latest checkpoint for each model and evaluate on the unseen examples. But before passing the data into the model we have to clean up, tokenize and map each token to the corresponding index as we see below.

def text_to_index(sentence):
# Remove punctuation characters except for the apostrophe
translator = str.maketrans('', '', string.punctuation.replace("'", ''))
tokens = sentence.translate(translator).lower().split()
return np.array([1] + [word_index[t] + index_offset if t in word_index else 2 for t in tokens])

def print_predictions(sentences, classifier):
indexes = [text_to_index(sentence) for sentence in sentences]
maxlen=sentence_size,
value=-1)
length = np.array([min(len(x), sentence_size) for x in indexes])
predict_input_fn = tf.estimator.inputs.numpy_input_fn(x={"x": x, "len": length}, shuffle=False)
predictions = [p['logistic'][0] for p in classifier.predict(input_fn=predict_input_fn)]
print(predictions)


It is worth noting that the checkpoint itself is not sufficient to make predictions; the actual code used to build the estimator is necessary as well in order to map the saved weights to the corresponding tensors. It’s a good practice to associate saved checkpoints with the branch of code with which they were created.

If you are interested in exporting the models to disk in a fully recoverable way, you might want to look into the SavedModel class, which is especially useful for serving your model through an API using TensorFlow Serving.

### Summary

In this blog post, we explored how to use estimators for text classification, in particular for the IMDB Reviews Dataset. We trained and visualized our own embeddings, as well as loaded pre-trained ones. We started from a simple baseline and made our way to convolutional neural networks and LSTMs.

For more details, be sure to check out:

]]>
<![CDATA[Requests for Research]]>http://ruder.io/requests-for-research/5c76ff7c1b9b0d18555b9ea8Sun, 04 Mar 2018 15:00:00 GMT

This post aims to provide inspiration and ideas for research directions to junior researchers and those trying to get into research.

It can be hard to find compelling topics to work on and know what questions are interesting to ask when you are just starting as a researcher in a new field. Machine learning research in particular moves so fast these days that it is difficult to find an opening.

This post aims to provide inspiration and ideas for research directions to junior researchers and those trying to get into research. It gathers a collection of research topics that are interesting to me, with a focus on NLP and transfer learning. As such, they might obviously not be of interest to everyone. If you are interested in Reinforcement Learning, OpenAI provides a selection of interesting RL-focused research topics. In case you'd like to collaborate with others or are interested in a broader range of topics, have a look at the Artificial Intelligence Open Network.

Most of these topics are not thoroughly thought out yet; in many cases, the general description is quite vague and subjective and many directions are possible. In addition, most of these are not low-hanging fruit, so serious effort is necessary to come up with a solution. I am happy to provide feedback with regard to any of these, but will not have time to provide more detailed guidance unless you have a working proof-of-concept. I will update this post periodically with new research directions and advances in already listed ones. Note that this collection does not attempt to review the extensive literature but only aims to give a glimpse of a topic; consequently, the references won't be comprehensive.

I hope that this collection will pique your interest and serve as inspiration for your own research agenda.

## Task-independent data augmentation for NLP

Data augmentation aims to create additional training data by producing variations of existing training examples through transformations, which can mirror those encountered in the real world. In Computer Vision (CV), common augmentation techniques are mirroring, random cropping, shearing, etc. Data augmentation is super useful in CV. For instance, it has been used to great effect in AlexNet (Krizhevsky et al., 2012) [1] to combat overfitting and in most state-of-the-art models since. In addition, data augmentation makes intuitive sense as it makes the training data more diverse and should thus increase a model's generalization ability.

However, in NLP, data augmentation is not widely used. In my mind, this is for two reasons:

1. Data in NLP is discrete. This prevents us from applying simple transformations directly to the input data. Most recently proposed augmentation methods in CV focus on such transformations, e.g. domain randomization (Tobin et al., 2017) [2].
2. Small perturbations may change the meaning. Deleting a negation may change a sentence's sentiment, while modifying a word in a paragraph might inadvertently change the answer to a question about that paragraph. This is not the case in CV where perturbing individual pixels does not change whether an image is a cat or dog and even stark changes such as interpolation of different images can be useful (Zhang et al., 2017) [3].

Existing approaches that I am aware of are either rule-based (Li et al., 2017) [4] or task-specific, e.g. for parsing (Wang and Eisner, 2016) [5] or zero-pronoun resolution (Liu et al., 2017) [6]. Xie et al. (2017) [7] replace words with samples from different distributions for language modelling and Machine Translation. Recent work focuses on creating adversarial examples either by replacing words or characters (Samanta and Mehta, 2017; Ebrahimi et al., 2017) [8] [9], concatenation (Jia and Liang, 2017) [10], or adding adversarial perturbations (Yasunaga et al., 2017) [11]. An adversarial setup is also used by Li et al. (2017) [12] who train a system to produce sequences that are indistinguishable from human-generated dialogue utterances.

Back-translation (Sennrich et al., 2015; Sennrich et al., 2016) [13] [14] is a common data augmentation method in Machine Translation (MT) that allows us to incorporate monolingual training data. For instance, when training a EN$$\rightarrow$$FR system, monolingual French text is translated to English using an FR$$\rightarrow$$EN system; the synthetic parallel data can then be used for training. Back-translation can also be used for paraphrasing (Mallinson et al., 2017) [15]. Paraphrasing has been used for data augmentation for QA (Dong et al., 2017) [16], but I am not aware of its use for other tasks.

Another method that is close to paraphrasing is generating sentences from a continuous space using a variational autoencoder (Bowman et al., 2016; Guu et al., 2017) [17] [18]. If the representations are disentangled as in (Hu et al., 2017) [19], then we are also not too far from style transfer (Shen et al., 2017) [20].

There are a few research directions that would be interesting to pursue:

1. Evaluation study: Evaluate a range of existing data augmentation methods as well as techniques that have not been widely used for augmentation such as paraphrasing and style transfer on a diverse range of tasks including text classification and sequence labelling. Identify what types of data augmentation are robust across task and which are task-specific. This could be packaged as a software library to make future benchmarking easier (think CleverHans for NLP).
2. Data augmentation with style transfer: Investigate if style transfer can be used to modify various attributes of training examples for more robust learning.
3. Learn the augmentation: Similar to Dong et al. (2017) we could learn either to paraphrase or to generate transformations for a particular task.
4. Learn a word embedding space for data augmentation: A typical word embedding space clusters synonyms and antonyms together; using nearest neighbours in this space for replacement is thus infeasible. Inspired by recent work (Mrkšić et al., 2017) [21], we could specialize the word embedding space to make it more suitable for data augmentation.
5. Adversarial data augmentation: Related to recent work in interpretability (Ribeiro et al., 2016) [22], we could change the most salient words in an example, i.e. those that a model depends on for a prediction. This still requires a semantics-preserving replacement method, however.

## Few-shot learning for NLP

Zero-shot, one-shot and few-shot learning are one of the most interesting recent research directions IMO. Following the key insight from Vinyals et al. (2016) [23] that a few-shot learning model should be explicitly trained to perform few-shot learning, we have seen several recent advances (Ravi and Larochelle, 2017; Snell et al., 2017) [24] [25].

Learning from few labeled samples is one of the hardest problems IMO and one of the core capabilities that separates the current generation of ML models from more generally applicable systems. Zero-shot learning has only been investigated in the context of learning word embeddings for unknown words AFAIK. Dataless classification (Song and Roth, 2014; Song et al., 2016) [26] [27] is an interesting related direction that embeds labels and documents in a joint space, but requires interpretable labels with good descriptions.

Potential research directions are the following:

1. Standardized benchmarks: Create standardized benchmarks for few-shot learning for NLP. Vinyals et al. (2016) introduce a one-shot language modelling task for the Penn Treebank. The task, while useful, is dwarfed by the extensive evaluation on CV benchmarks and has not seen much use AFAIK. A few-shot learning benchmark for NLP should contain a large number of classes and provide a standardized split for reproducibility. Good candidate tasks would be topic classification or fine-grained entity recognition.
2. Evaluation study: After creating such a benchmark, the next step would be to evaluate how well existing few-shot learning models from CV perform for NLP.
3. Novel methods for NLP: Given a dataset for benchmarking and an empirical evaluation study, we could then start developing novel methods that can perform few-shot learning for NLP.

## Transfer learning for NLP

Transfer learning has had a large impact on computer vision (CV) and has greatly lowered the entry threshold for people wanting to apply CV algorithms to their own problems. CV practicioners are no longer required to perform extensive feature-engineering for every new task, but can simply fine-tune a model pretrained on a large dataset with a small number of examples.

In NLP, however, we have so far only been pretraining the first layer of our models via pretrained embeddings. Recent approaches (Peters et al., 2017, 2018) [28] [29] add pretrained language model embedddings, but these still require custom architectures for every task. In my opinion, in order to unlock the true potential of transfer learning for NLP, we need to pretrain the entire model and fine-tune it on the target task, akin to fine-tuning ImageNet models. Language modelling, for instance, is a great task for pretraining and could be to NLP what ImageNet classification is to CV (Howard and Ruder, 2018) [30].

Here are some potential research directions in this context:

1. Identify useful pretraining tasks: The choice of the pretraining task is very important as even fine-tuning a model on a related task might only provide limited success (Mou et al., 2016) [31]. Other tasks such as those explored in recent work on learning general-purpose sentence embeddings (Conneau et al., 2017; Subramanian et al., 2018; Nie et al., 2017) [32] [33] [34] might be complementary to language model pretraining or suitable for other target tasks.
2. Fine-tuning of complex architectures: Pretraining is most useful when a model can be applied to many target tasks. However, it is still unclear how to pretrain more complex architectures, such as those used for pairwise classification tasks (Augenstein et al., 2018) or reasoning tasks such as QA or reading comprehension.

Multi-task learning (MTL) has become more commonly used in NLP. See here for a general overview of multi-task learning and here for MTL objectives for NLP. However, there is still much we don't understand about multi-task learning in general.

The main questions regarding MTL give rise to many interesting research directions:

1. Identify effective auxiliary tasks: One of the main questions is which tasks are useful for multi-task learning. Label entropy has been shown to be a predictor of MTL success (Alonso and Plank, 2017) [35], but this does not tell the whole story. In recent work (Augenstein et al., 2018) [36], we have found that auxiliary tasks with more data and more fine-grained labels are more useful. It would be useful if future MTL papers would not only propose a new model or auxiliary task, but also try to understand why a certain auxiliary task might be better than another closely related one.
2. Alternatives to hard parameter sharing: Hard parameter sharing is still the default modus operandi for MTL, but places a strong constraint on the model to compress knowledge pertaining to different tasks with the same parameters, which often makes learning difficult. We need better ways of doing MTL that are easy to use and work reliably across many tasks. Recently proposed methods such as cross-stitch units (Misra et al., 2017; Ruder et al., 2017) [37] [38] and a label embedding layer (Augenstein et al., 2018) are promising steps in this direction.
3. Artificial auxiliary tasks: The best auxiliary tasks are those, which are tailored to the target task and do not require any additional data. I have outlined a list of potential artificial auxiliary tasks here. However, it is not clear which of these work reliably across a number of diverse tasks or what variations or task-specific modifications are useful.

## Cross-lingual learning

Creating models that perform well across languages and that can transfer knowledge from resource-rich to resource-poor languages is one of the most important research directions IMO. There has been much progress in learning cross-lingual representations that project different languages into a shared embedding space. Refer to Ruder et al. (2017) [39] for a survey.

Cross-lingual representations are commonly evaluated either intrinsically on similarity benchmarks or extrinsically on downstream tasks, such as text classification. While recent methods have advanced the state-of-the-art for many of these settings, we do not have a good understanding of the tasks or languages for which these methods fail and how to mitigate these failures in a task-independent manner, e.g. by injecting task-specific constraints (Mrkšić et al., 2017).

Novel architectures that outperform the current state-of-the-art and are tailored to specific tasks are regularly introduced, superseding the previous architecture. I have outlined best practices for different NLP tasks before, but without comparing such architectures on different tasks, it is often hard to gain insights from specialized architectures and tell which components would also be useful in other settings.

A particularly promising recent model is the Transformer (Vaswani et al., 2017) [40]. While the complete model might not be appropriate for every task, components such as multi-head attention or position-based encoding could be building blocks that are generally useful for many NLP tasks.

## Conclusion

I hope you've found this collection of research directions useful. If you have suggestions on how to tackle some of these problems or ideas for related research topics, feel free to comment below.

Cover image is from Tobin et al. (2017).

1. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105). ↩︎

2. Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., & Abbeel, P. (2017). Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. arXiv Preprint arXiv:1703.06907. Retrieved from http://arxiv.org/abs/1703.06907 ↩︎

3. Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2017). mixup: Beyond Empirical Risk Minimization, 1–11. Retrieved from http://arxiv.org/abs/1710.09412 ↩︎

4. Li, Y., Cohn, T., & Baldwin, T. (2017). Robust Training under Linguistic Adversity. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (Vol. 2, pp. 21–27). ↩︎

5. Wang, D., & Eisner, J. (2016). The Galactic Dependencies Treebanks: Getting More Data by Synthesizing New Languages. Tacl, 4, 491–505. Retrieved from https://www.transacl.org/ojs/index.php/tacl/article/viewFile/917/212 https://transacl.org/ojs/index.php/tacl/article/view/917 ↩︎

6. Liu, T., Cui, Y., Yin, Q., Zhang, W., Wang, S., & Hu, G. (2017). Generating and Exploiting Large-scale Pseudo Training Data for Zero Pronoun Resolution. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (pp. 102–111). ↩︎

7. Xie, Z., Wang, S. I., Li, J., Levy, D., Nie, A., Jurafsky, D., & Ng, A. Y. (2017). Data Noising as Smoothing in Neural Network Language Models. In Proceedings of ICLR 2017. ↩︎

8. Samanta, S., & Mehta, S. (2017). Towards Crafting Text Adversarial Samples. arXiv preprint arXiv:1707.02812. ↩︎

9. Ebrahimi, J., Rao, A., Lowd, D., & Dou, D. (2017). HotFlip: White-Box Adversarial Examples for NLP. Retrieved from http://arxiv.org/abs/1712.06751 ↩︎

10. Jia, R., & Liang, P. (2017). Adversarial Examples for Evaluating Reading Comprehension Systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. ↩︎

11. Yasunaga, M., Kasai, J., & Radev, D. (2017). Robust Multilingual Part-of-Speech Tagging via Adversarial Training. In Proceedings of NAACL 2018. Retrieved from http://arxiv.org/abs/1711.04903 ↩︎

12. Li, J., Monroe, W., Shi, T., Ritter, A., & Jurafsky, D. (2017). Adversarial Learning for Neural Dialogue Generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Retrieved from http://arxiv.org/abs/1701.06547 ↩︎

13. Sennrich, R., Haddow, B., & Birch, A. (2015). Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709. ↩︎

14. Sennrich, R., Haddow, B., & Birch, A. (2016). Edinburgh neural machine translation systems for wmt 16. arXiv preprint arXiv:1606.02891. ↩︎

15. Mallinson, J., Sennrich, R., & Lapata, M. (2017). Paraphrasing revisited with neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers (Vol. 1, pp. 881-893). ↩︎

16. Dong, L., Mallinson, J., Reddy, S., & Lapata, M. (2017). Learning to Paraphrase for Question Answering. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. ↩︎

17. Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Jozefowicz, R., & Bengio, S. (2016). Generating Sentences from a Continuous Space. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning (CoNLL). Retrieved from http://arxiv.org/abs/1511.06349 ↩︎

18. Guu, K., Hashimoto, T. B., Oren, Y., & Liang, P. (2017). Generating Sentences by Editing Prototypes. ↩︎

19. Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R., & Xing, E. P. (2017). Toward Controlled Generation of Text. In Proceedings of the 34th International Conference on Machine Learning. ↩︎

20. Shen, T., Lei, T., Barzilay, R., & Jaakkola, T. (2017). Style Transfer from Non-Parallel Text by Cross-Alignment. In Advances in Neural Information Processing Systems. Retrieved from http://arxiv.org/abs/1705.09655 ↩︎

21. Mrkšić, N., Vulić, I., Séaghdha, D. Ó., Leviant, I., Reichart, R., Gašić, M., … Young, S. (2017). Semantic Specialisation of Distributional Word Vector Spaces using Monolingual and Cross-Lingual Constraints. TACL. Retrieved from http://arxiv.org/abs/1706.00374 ↩︎

22. Ribeiro, M. T., Singh, S., & Guestrin, C. (2016, August). Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1135-1144). ACM. ↩︎

23. Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., & Wierstra, D. (2016). Matching Networks for One Shot Learning. NIPS 2016. Retrieved from http://arxiv.org/abs/1606.04080 ↩︎

24. Ravi, S., & Larochelle, H. (2017). Optimization as a Model for Few-Shot Learning. In ICLR 2017. ↩︎

25. Snell, J., Swersky, K., & Zemel, R. S. (2017). Prototypical Networks for Few-shot Learning. In Advances in Neural Information Processing Systems. ↩︎

26. Song, Y., & Roth, D. (2014). On dataless hierarchical text classification. Proceedings of AAAI, 1579–1585. Retrieved from http://cogcomp.cs.illinois.edu/papers/SongSoRo14.pdf ↩︎

27. Song, Y., Upadhyay, S., Peng, H., & Roth, D. (2016). Cross-Lingual Dataless Classification for Many Languages. Ijcai, 2901–2907. ↩︎

28. Peters, M. E., Ammar, W., Bhagavatula, C., & Power, R. (2017). Semi-supervised sequence tagging with bidirectional language models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017). ↩︎

29. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. Proceedings of NAACL. ↩︎

30. Howard, J., & Ruder, S. (2018). Fine-tuned Language Models for Text Classification. arXiv preprint arXiv:1801.06146. ↩︎

31. Mou, L., Meng, Z., Yan, R., Li, G., Xu, Y., Zhang, L., & Jin, Z. (2016). How Transferable are Neural Networks in NLP Applications? Proceedings of 2016 Conference on Empirical Methods in Natural Language Processing. ↩︎

32. Conneau, A., Kiela, D., Schwenk, H., Barrault, L., & Bordes, A. (2017). Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. ↩︎

33. Subramanian, S., Trischler, A., Bengio, Y., & Pal, C. J. (2018). Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning. In Proceedings of ICLR 2018. ↩︎

34. Nie, A., Bennett, E. D., & Goodman, N. D. (2017). DisSent: Sentence Representation Learning from Explicit Discourse Relations. arXiv Preprint arXiv:1710.04334. Retrieved from http://arxiv.org/abs/1710.04334 ↩︎

35. Alonso, H. M., & Plank, B. (2017). When is multitask learning effective? Multitask learning for semantic sequence prediction under varying data conditions. In EACL. Retrieved from http://arxiv.org/abs/1612.02251 ↩︎

36. Augenstein, I., Ruder, S., & Søgaard, A. (2018). Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate Label Spaces. In Proceedings of NAACL 2018. ↩︎

37. Misra, I., Shrivastava, A., Gupta, A., & Hebert, M. (2016). Cross-stitch Networks for Multi-task Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. http://doi.org/10.1109/CVPR.2016.433 ↩︎

38. Ruder, S., Bingel, J., Augenstein, I., & Søgaard, A. (2017). Sluice networks: Learning what to share between loosely related tasks. arXiv preprint arXiv:1705.08142. ↩︎

39. Ruder, S., Vulić, I., & Søgaard, A. (2017). A Survey of Cross-lingual Word Embedding Models. arXiv Preprint arXiv:1706.04902. Retrieved from http://arxiv.org/abs/1706.04902 ↩︎

40. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention Is All You Need. In Advances in Neural Information Processing Systems. ↩︎

]]>
<![CDATA[Optimization for Deep Learning Highlights in 2017]]>http://ruder.io/deep-learning-optimization-2017/5c76ff7c1b9b0d18555b9e9fSun, 03 Dec 2017 15:36:00 GMT

This post discusses the most exciting highlights and most promising directions in optimization for Deep Learning.

Deep Learning ultimately is about finding a minimum that generalizes well -- with bonus points for finding one fast and reliably. Our workhorse, stochastic gradient descent (SGD), is a 60-year old algorithm (Robbins and Monro, 1951) [1], that is as essential to the current generation of Deep Learning algorithms as back-propagation.

Different optimization algorithms have been proposed in recent years, which use different equations to update a model's parameters. Adam (Kingma and Ba, 2015) [2] was introduced in 2015 and is arguably today still the most commonly used one of these algorithms. This indicates that from the Machine Learning practitioner's perspective, best practices for optimization for Deep Learning have largely remained the same.

New ideas, however, have been developed over the course of this year, which may shape the way we will optimize our models in the future. In this blog post, I will touch on the most exciting highlights and most promising directions in optimization for Deep Learning in my opinion. Note that this blog post assumes a familiarity with SGD and with adaptive learning rate methods such as Adam. To get up to speed, refer to this blog post for an overview of existing gradient descent optimization algorithms.

Despite the apparent supremacy of adaptive learning rate methods such as Adam, state-of-the-art results for many tasks in computer vision and NLP such as object recognition (Huang et al., 2017) [3] or machine translation (Wu et al., 2016) [4] have still been achieved by plain old SGD with momentum. Recent theory (Wilson et al., 2017) [5] provides some justification for this, suggesting that adaptive learning rate methods converge to different (and less optimal) minima than SGD with momentum. It is empirically shown that the minima found by adaptive learning rate methods perform generally worse compared to those found by SGD with momentum on object recognition, character-level language modeling, and constituency parsing. This seems counter-intuitive given that Adam comes with nice convergence guarantees and that its adaptive learning rate should give it an edge over the regular SGD. However, Adam and other adaptive learning rate methods are not without their own flaws.

### Decoupling weight decay

One factor that partially accounts for Adam's poor generalization ability compared with SGD with momentum on some datasets is weight decay. Weight decay is most commonly used in image classification problems and decays the weights $$\theta_t$$ after every parameter update by multiplying them by a decay rate $$w_t$$ that is slightly less than $$1$$:

$$\theta_{t+1} = w_t : \theta_t$$

This prevents the weights from growing too large. As such, weight decay can also be understood as an $$\ell_2$$ regularization term that depends on the weight decay rate $$w_t$$ added to the loss:

$$\mathcal{L}_\text{reg} = \dfrac{w_t}{2} |\theta_t |^2_2$$

Weight decay is commonly implemented in many neural network libraries either as the above regularization term or directly to modify the gradient. As the gradient is modified in both the momentum and Adam update equations (via multiplication with other decay terms), weight decay no longer equals $$\ell_2$$ regularization. Loshchilov and Hutter (2017) [6] thus propose to decouple weight decay from the gradient update by adding it after the parameter update as in the original definition.
The SGD with momentum and weight decay (SGDW) update then looks like the following:

\begin{align} \begin{split} v_t &= \gamma v_{t-1} + \eta g_t \\ \theta_{t+1} &= \theta_t - v_t - \eta w_t \theta_t \end{split} \end{align}

where $$\eta$$ is the learning rate and the third term in the second equation is the decoupled weight decay. Similarly, for Adam with weight decay (AdamW) we obtain:

\begin{align} \begin{split} m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \\ v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2\\ \hat{m}_t &= \dfrac{m_t}{1 - \beta^t_1} \\ \hat{v}_t &= \dfrac{v_t}{1 - \beta^t_2} \\ \theta_{t+1} &= \theta_{t} - \dfrac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t - \eta w_t \theta_t \end{split} \end{align}

where $$m_t$$ and $$\hat{m}_t$$ and $$v_t$$ and $$\hat{v}_t$$ are the biased and bias-corrected estimates of the first and second moments respectively and $$\beta_1$$ and $$\beta_2$$ are their decay rates, with the same weight decay term added to it. The authors show that this substantially improves Adam’s generalization performance and allows it to compete with SGD with momentum on image classification datasets.

In addition, it decouples the choice of the learning rate from the choice of the weight decay, which enables better hyperparameter optimization as the hyperparameters no longer depend on each other. It also separates the implementation of the optimizer from the implementation of the weight decay, which contributes to cleaner and more reusable code (see e.g. the fast.ai AdamW/SGDW implementation).

### Fixing the exponential moving average

Several recent papers (Dozat and Manning, 2017; Laine and Aila, 2017) [7],[8] empirically find that a lower $$\beta_2$$ value, which controls the contribution of the exponential moving average of past squared gradients in Adam, e.g. $$0.99$$ or $$0.9$$ vs. the default $$0.999$$ worked better in their respective applications, indicating that there might be an issue with the exponential moving average.

An ICLR 2018 submission formalizes this issue and pinpoints the exponential moving average of past squared gradients as another reason for the poor generalization behaviour of adaptive learning rate methods. Updating the parameters via an exponential moving average of past squared gradients is at the heart of adaptive learning rate methods such as Adadelta, RMSprop, and Adam. The contribution of the exponential average is well-motivated: It should prevent the learning rates to become infinitesimally small as training progresses, the key flaw of the Adagrad algorithm. However, this short-term memory of the gradients becomes an obstacle in other scenarios.

In settings where Adam converges to a suboptimal solution, it has been observed that some minibatches provide large and informative gradients, but as these minibatches only occur rarely, exponential averaging diminishes their influence, which leads to poor convergence. The authors provide an example for a simple convex optimization problem where the same behaviour can be observed for Adam.

To fix this behaviour, the authors propose a new algorithm, AMSGrad that uses the maximum of past squared gradients rather than the exponential average to update the parameters. The full AMSGrad update without bias-corrected estimates can be seen below:

\begin{align} \begin{split} m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \\ v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2\\ \hat{v}_t &= \text{max}(\hat{v}_{t-1}, v_t) \\ \theta_{t+1} &= \theta_{t} - \dfrac{\eta}{\sqrt{\hat{v}_t} + \epsilon} m_t \end{split} \end{align}

The authors observe improved performance compared to Adam on small datasets and on CIFAR-10.

## Tuning the learning rate

In many cases, it is not our models that require improvement and tuning, but our hyperparameters. Recent examples for language modelling demonstrate that tuning LSTM parameters (Melis et al., 2017) [9] and regularization parameters (Merity et al., 2017) [10] can yield state-of-the-art results compared to more complex models.

An important hyperparameter for optimization in Deep Learning is the learning rate $$\eta$$. In fact, SGD has been shown to require a learning rate annealing schedule to converge to a good minimum in the first place. It is often thought that adaptive learning rate methods such as Adam are more robust to different learning rates, as they update the learning rate themselves. Even for these methods, however, there can be a large difference between a good and the optimal learning rate (psst... it's $$3e-4$$).

Zhang et al. (2017) [11] show that SGD with a tuned learning rate annealing schedule and momentum parameter is not only competitive with Adam, but also converges faster. On the other hand, while we might think that the adaptivity of Adam's learning rates might mimic learning rate annealing, an explicit annealing schedule can still be beneficial: If we add SGD-style learning rate annealing to Adam, it converges faster and outperforms SGD on Machine Translation (Denkowski and Neubig, 2017) [12].

In fact, learning rate annealing schedule engineering seems to be the new feature engineering as we can often find highly-tuned learning rate annealing schedules that improve the final convergence behaviour of our model. An interesting example of this is Vaswani et al. (2017) [13]. While it is usual to see a model's hyperparameters being subjected to large-scale hyperparameter optimization, it is interesting to see a learning rate annealing schedule as the focus of the same attention to detail: The authors use Adam with $$\beta_1=0.9$$, a non-default $$\beta_2=0.98$$, $$\epsilon = 10^{-9}$$, and arguably one of the most elaborate annealing schedules for the learning rate $$\eta$$:

$$\eta = d_\text{model}^{-0.5} \cdot \min(step\text{_}num^{-0.5}, step\text{_}num \cdot warmup\text{_}steps^{-1.5})$$

where $$d_\text{model}$$ is the number of parameters of the model and $$warmup\text{_}steps = 4000$$.

Another recent paper by Smith et al. (2017) [14] demonstrates an interesting connection between the learning rate and the batch size, two hyperparameters that are typically thought to be independent of each other: They show that decaying the learning rate is equivalent to increasing the batch size, while the latter allows for increased parallelism. Conversely, we can reduce the number of model updates and thus speed up training by increasing the learning rate and scaling the batch size. This has ramifications for large-scale Deep Learning, which can now repurpose existing training schedules with no hyperparameter tuning.

## Warm restarts

### SGD with restarts

Another effective recent development is SGDR (Loshchilov and Hutter, 2017) [15], an SGD alternative that uses warm restarts instead of learning rate annealing. In each restart, the learning rate is initialized to some value and is scheduled to decrease. Importantly, the restart is warm as the optimization does not start from scratch but from the parameters to which the model converged during the last step. The key factor is that the learning rate is decreased with an aggressive cosine annealing schedule, which rapidly lowers the learning rate and looks like the following:

$$\eta_t = \eta_{min}^i + \dfrac{1}{2}(\eta_{max}^i - \eta_{min}^i)(1 + \text{cos}(\dfrac{T_{cur}}{T_i}\pi))$$

where $$\eta_{min}^i$$ and $$\eta_{max}^i$$ are ranges for the learning rate during the $$i$$-th run, $$T_{cur}$$ indicates how many epochs passed since the last restart, and $$T_i$$ specifies the epoch of the next restart. The warm restart schedules for $$T_i=50$$, $$T_i=100$$, and $$T_i=200$$ compared with regular learning rate annealing are shown in Figure 1.

The high initial learning rate after a restart is used to essentially catapult the parameters out of the minimum to which they previously converged and to a different area of the loss surface. The aggressive annealing then enables the model to rapidly converge to a new and better solution. The authors empirically find that SGD with warm restarts requires 2 to 4 times fewer epochs than learning rate annealing and achieves comparable or better performance.

Learning rate annealing with warm restarts is also known as cyclical learning rates and has been originally proposed by Smith (2017) [16]. Two more articles by students of fast.ai (which has recently started to teach this method) that discuss warm restarts and cyclical learning rates can be found here and here.

### Snapshot ensembles

Snapshot ensembles (Huang et al., 2017) [17] are a clever, recent technique that uses warm restarts to assemble an ensemble essentially for free when training a single model. The method trains a single model until convergence with the cosine annealing schedule that we have seen above. It then saves the model parameters, performs a warm restart, and then repeats these steps $$M$$ times. In the end, all saved model snapshots are ensembled. The common SGD optimization behaviour on an error surface compared to the behaviour of snapshot ensembling can be seen in Figure 2.

The success of ensembling in general relies on the diversity of the individual models in the ensemble. Snapshot ensembling thus relies on the cosine annealing schedule's ability to enable the model to converge to a different local optimum after every restart. The authors demonstrate that this holds in practice, achieving state-of-the-art results on CIFAR-10, CIFAR-100, and SVHN.

Warm restarts did not work originally with Adam due to its dysfunctional weight decay, which we have seen before. After fixing weight decay, Loshchilov and Hutter (2017) similarly extend Adam to work with warm restarts. They set $$\eta_{min}^i=0$$ and $$\eta_{max}^i=1$$, which yields:

$$\eta_t = 0.5 + 0.5 : \text{cos}(\dfrac{T_{cur}}{T_i}\pi))$$

They recommend to start with an initially small $$T_i$$ (between $$1%$$ and $$10%$$ of the total number of epochs) and multiply it by a factor of $$T_{mult}$$ (e.g. $$T_{mult}=2$$) at every restart.

## Learning to optimize

One of the most interesting papers of last year (and reddit's "Best paper name of 2016" winner) was a paper by Andrychowicz et al. (2016) [18] where they train an LSTM optimizer to provide the updates to the main model during training. Unfortunately, learning a separate LSTM optimizer or even using a pre-trained LSTM optimizer for optimization greatly increases the complexity of model training.

Another very influential learning-to-learn paper from this year uses an LSTM to generate model architectures in a domain-specific language (Zoph and Quoc, 2017) [19]. While the search process requires vast amounts of resources, the discovered architectures can be used as-is to replace their existing counterparts. This search process has proved effective and found architectures that achieve state-of-the-art results on language modeling and results competitive with the state-of-the-art on CIFAR-10.

The same search principle can be applied to any other domain where key processes have been previously defined by hand. One such domain are optimization algorithms for Deep Learning. As we have seen before, optimization algorithms are more similar than they seem: All of them use a combination of an exponential moving average of past gradients (as in momentum) and of an exponential moving average of past squared gradients (as in Adadelta, RMSprop, and Adam) (Ruder, 2016) [20].

Bello et al. (2017) [21] define a domain-specific language that consists of primitives useful for optimization such as these exponential moving averages. They then sample an update rule from the space of possible update rules, use this update rule to train a model, and update the RNN controller based on the performance of the trained model on the test set. The full procedure can be seen in Figure 3.

In particular, they discover two update equations, PowerSign and AddSign. The update equation for PowerSign is the following:

$$\theta_{t+1} = \theta_{t} - \alpha^{f(t)* \text{sign}(g_t)*\text{sign}(m_t)}*g_t$$

where $$\alpha$$ is a hyperparameter that is often set to $$e$$ or $$2$$, $$f(t)$$ is either $$1$$ or a decay function that performs linear, cyclical or decay with restarts based on time step $$t$$, and $$m_t$$ is the moving average of past gradients. The common configuration uses $$\alpha=e$$ and no decay. We can observe that the update scales the gradient by $$\alpha^{f(t)}$$ or $$1/\alpha^{f(t)}$$ depending on whether the direction of the gradient and its moving average agree. This indicates that this momentum-like agreement between past gradients and the current one is a key piece of information for optimizing Deep Learning models.

$$\theta_{t+1} = \theta_{t} - \alpha + f(t) * \text{sign}(g_t) * \text{sign}(m_t)) * g_t$$

with $$\alpha$$ often set to $$1$$ or $$2$$. Similar to the above, this time the update scales $$\alpha + f(t)$$ or $$\alpha - f(t)$$ again depending on the agreement of the direction of the gradients. The authors show that PowerSign and AddSign outperform Adam, RMSprop, and SGD with momentum on CIFAR-10 and transfer well to other tasks such as ImageNet classification and machine translation.

## Understanding generalization

Optimization is closely tied to generalization as the minimum to which a model converges defines how well the model generalizes. Advances in optimization are thus closely correlated with theoretical advances in understanding the generalization behaviour of such minima and more generally of gaining a deeper understanding of generalization in Deep Learning.

However, our understanding of the generalization behaviour of deep neural networks is still very shallow. Recent work showed that the number of possible local minima grows exponentially with the number of parameters (Kawaguchi, 2016) [22]. Given the huge number of parameters of current Deep Learning architectures, it still seems almost magical that such models converge to solutions that generalize well, in particular given that they can completely memorize random inputs (Zhang et al., 2017) [23].

Keskar et al. (2017) [24] identify the sharpness of a minimum as a source for poor generalization: In particular, they show that sharp minima found by batch gradient descent have high generalization error. This makes intuitive sense, as we generally would like our functions to be smooth and a sharp minima indicates a high irregularity in the corresponding error surface. However, more recent work suggests that sharpness may not be such a good indicator after all by showing that local minima that generalize well can be made arbitrarily sharp (Dinh et al., 2017) [25]. A Quora answer by Eric Jang also discusses these articles.

An ICLR 2018 submission demonstrates through a series of ablation analyses that a model's reliance on single directions in activation space, i.e. the activation of single units or feature maps is a good predictor of its generalization performance. They show that this holds across models trained on different datasets and for different degrees of label corruption. They find that dropout does not help to resolve this, while batch normalization discourages single direction reliance.

While these findings indicate that there is still much we do not know in terms of Optimization for Deep Learning, it is important to remember that convergence guarantees and a large body of work exists for convex optimization and that existing ideas and insights can also be applied to non-convex optimization to some extent. The large-scale optimization tutorial at NIPS 2016 provides an excellent overview of more theoretical work in this area (see the slides part 1, part 2, and the video).

## Conclusion

I hope that I was able to provide an impression of some of the compelling developments in optimization for Deep Learning over the past year. I've undoubtedly failed to mention many other approaches that are equally important and noteworthy. Please let me know in the comments below what I missed, where I made a mistake or misrepresented a method, or which aspect of optimization for Deep Learning you find particularly exciting or underexplored.

## Hacker News

You can find the discussion of this post on HN here.

1. Robbins, H., & Monro, S. (1951). A stochastic approximation method. The annals of mathematical statistics, 400-407. ↩︎

2. Kingma, D. P., & Ba, J. L. (2015). Adam: a Method for Stochastic Optimization. International Conference on Learning Representations. ↩︎

3. Huang, G., Liu, Z., Weinberger, K. Q., & van der Maaten, L. (2017). Densely Connected Convolutional Networks. In Proceedings of CVPR 2017. ↩︎

4. Wu, Y., Schuster, M., Chen, Z., Le, Q. V, Norouzi, M., Macherey, W., … Dean, J. (2016). Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv Preprint arXiv:1609.08144. ↩︎

5. Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., & Recht, B. (2017). The Marginal Value of Adaptive Gradient Methods in Machine Learning. arXiv Preprint arXiv:1705.08292. Retrieved from http://arxiv.org/abs/1705.08292 ↩︎

6. Loshchilov, I., & Hutter, F. (2017). Fixing Weight Decay Regularization in Adam. arXiv Preprint arXi1711.05101. Retrieved from http://arxiv.org/abs/1711.05101 ↩︎

7. Dozat, T., & Manning, C. D. (2017). Deep Biaffine Attention for Neural Dependency Parsing. In ICLR 2017. Retrieved from http://arxiv.org/abs/1611.01734 ↩︎

8. Laine, S., & Aila, T. (2017). Temporal Ensembling for Semi-Supervised Learning. In Proceedings of ICLR 2017. ↩︎

9. Melis, G., Dyer, C., & Blunsom, P. (2017). On the State of the Art of Evaluation in Neural Language Models. In arXiv preprint arXiv:1707.05589. ↩︎

10. Merity, S., Shirish Keskar, N., & Socher, R. (2017). Regularizing and Optimizing LSTM Language Models. arXiv Preprint arXiv:1708.02182. Retrieved from https://arxiv.org/pdf/1708.02182.pdf ↩︎

11. Zhang, J., Mitliagkas, I., & Ré, C. (2017). YellowFin and the Art of Momentum Tuning. In arXiv preprint arXiv:1706.03471. ↩︎

12. Denkowski, M., & Neubig, G. (2017). Stronger Baselines for Trustable Results in Neural Machine Translation. In Workshop on Neural Machine Translation (WNMT). Retrieved from https://arxiv.org/abs/1706.09733 ↩︎

13. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention Is All You Need. In Advances in Neural Information Processing Systems. ↩︎

14. Smith, S. L., Kindermans, P.-J., & Le, Q. V. (2017). Don’t Decay the Learning Rate, Increase the Batch Size. In arXiv preprint arXiv:1711.00489. Retrieved from http://arxiv.org/abs/1711.00489 ↩︎

15. Loshchilov, I., & Hutter, F. (2017). SGDR: Stochastic Gradient Descent with Warm Restarts. In Proceedings of ICLR 2017. https://doi.org/10.1002/fut ↩︎

16. Smith, Leslie N. "Cyclical learning rates for training neural networks." In Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on, pp. 464-472. IEEE, 2017. ↩︎

17. Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., & Weinberger, K. Q. (2017). Snapshot Ensembles: Train 1, get M for free. In Proceedings of ICLR 2017. ↩︎

18. Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., & de Freitas, N. (2016). Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems. Retrieved from http://arxiv.org/abs/1606.04474 ↩︎

19. Zoph, B., & Le, Q. V. (2017). Neural Architecture Search with Reinforcement Learning. In ICLR 2017. ↩︎

20. Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv Preprint arXiv:1609.04747. ↩︎

21. Bello, I., Zoph, B., Vasudevan, V., & Le, Q. V. (2017). Neural Optimizer Search with Reinforcement Learning. In Proceedings of the 34th International Conference on Machine Learning. ↩︎

22. Kawaguchi, K. (2016). Deep Learning without Poor Local Minima. In Advances in Neural Information Processing Systems 29 (NIPS 2016). Retrieved from http://arxiv.org/abs/1605.07110 ↩︎

23. Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2017). Understanding deep learning requires rethinking generalization. In Proceedings of ICLR 2017. ↩︎

24. Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., & Tang, P. T. P. (2017). On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. In Proceedings of ICLR 2017. Retrieved from http://arxiv.org/abs/1609.04836 ↩︎

25. Dinh, L., Pascanu, R., Bengio, S., & Bengio, Y. (2017). Sharp Minima Can Generalize For Deep Nets. In Proceedings of the 34th International Conference on Machine Learning. ↩︎

]]>
<![CDATA[Word embeddings in 2017: Trends and future directions]]>http://ruder.io/word-embeddings-2017/5c76ff7c1b9b0d18555b9ea3Sat, 21 Oct 2017 09:00:00 GMT

This post discusses the deficiencies of word embeddings and how recent approaches have tried to resolve them.

The word2vec method based on skip-gram with negative sampling (Mikolov et al., 2013) [1] was published in 2013 and had a large impact on the field, mainly through its accompanying software package, which enabled efficient training of dense word representations and a straightforward integration into downstream models. In some respects, we have come far since then: Word embeddings have established themselves as an integral part of Natural Language Processing (NLP) models. In other aspects, we might as well be in 2013 as we have not found ways to pre-train word embeddings that have managed to supersede the original word2vec.

This post will focus on the deficiencies of word embeddings and how recent approaches have tried to resolve them. If not otherwise stated, this post discusses pre-trained word embeddings, i.e. word representations that have been learned on a large corpus using word2vec and its variants. Pre-trained word embeddings are most effective if not millions of training examples are available (and thus transferring knowledge from a large unlabelled corpus is useful), which is true for most tasks in NLP. For an introduction to word embeddings, refer to this blog post.

## Subword-level embeddings

Word embeddings have been augmented with subword-level information for many applications such as named entity recognition (Lample et al., 2016) [2], part-of-speech tagging (Plank et al., 2016) [3], dependency parsing (Ballesteros et al., 2015; Yu & Vu, 2017) [4], [5], and language modelling (Kim et al., 2016) [6]. Most of these models employ a CNN or a BiLSTM that takes as input the characters of a word and outputs a character-based word representation.

For incorporating character information into pre-trained embeddings, however, character n-grams features have been shown to be more powerful than composition functions over individual characters (Wieting et al., 2016; Bojanowski et al., 2017) [7], [8]. Character n-grams -- by far not a novel feature for text categorization (Cavnar et al., 1994) [9] -- are particularly efficient and also form the basis of Facebook's fastText classifier (Joulin et al., 2016) [10]. Embeddings learned using fastText are available in 294 languages.

Subword units based on byte-pair encoding have been found to be particularly useful for machine translation (Sennrich et al., 2016) [11] where they have replaced words as the standard input units. They are also useful for tasks with many unknown words such as entity typing (Heinzerling & Strube, 2017) [12], but have not been shown to be helpful yet for standard NLP tasks, where this is not a major concern. While they can be learned easily, it is difficult to see their advantage over character-based representations for most tasks (Vania & Lopez, 2017) [13].

Another choice for using pre-trained embeddings that integrate character information is to leverage a state-of-the-art language model (Jozefowicz et al., 2016) [14] trained on a large in-domain corpus, e.g. the 1 Billion Word Benchmark (a pre-trained Tensorflow model can be found here). While language modelling has been found to be useful for different tasks as auxiliary objective (Rei, 2017) [15], pre-trained language model embeddings have also been used to augment word embeddings (Peters et al., 2017) [16]. As we start to better understand how to pre-train and initialize our models, pre-trained language model embeddings are poised to become more effective. They might even supersede word2vec as the go-to choice for initializing word embeddings by virtue of having become more expressive and easier to train due to better frameworks and more computational resources over the last years.

## OOV handling

One of the main problems of using pre-trained word embeddings is that they are unable to deal with out-of-vocabulary (OOV) words, i.e. words that have not been seen during training. Typically, such words are set to the UNK token and are assigned the same vector, which is an ineffective choice if the number of OOV words is large. Subword-level embeddings as discussed in the last section are one way to mitigate this issue. Another way, which is effective for reading comprehension (Dhingra et al., 2017) [17] is to assign OOV words their pre-trained word embedding, if one is available.

Recently, different approaches have been proposed for generating embeddings for OOV words on-the-fly. Herbelot and Baroni (2017) [18] initialize the embedding of OOV words as the sum of their context words and then rapidly refine only the OOV embedding with a high learning rate. Their approach is successful for a dataset that explicitly requires to model nonce words, but it is unclear if it can be scaled up to work reliably for more typical NLP tasks. Another interesting approach for generating OOV word embeddings is to train a character-based model to explicitly re-create pre-trained embeddings (Pinter et al., 2017) [19]. This is particularly useful in low-resource scenarios, where a large corpus is inaccessible and only pre-trained embeddings are available.

## Evaluation

Evaluation of pre-trained embeddings has been a contentious issue since their inception as the commonly used evaluation via word similarity or analogy datasets has been shown to only correlate weakly with downstream performance (Tsvetkov et al., 2015) [20]. The RepEval Workshop at ACL 2016 exclusively focused on better ways to evaluate pre-trained embeddings. As it stands, the consensus seems to be that -- while pre-trained embeddings can be evaluated on intrinsic tasks such as word similarity for comparison against previous approaches -- the best way to evaluate them is extrinsic evaluation on downstream tasks.

## Multi-sense embeddings

A commonly cited criticism of word embeddings is that they are unable to capture polysemy. A tutorial at ACL 2016 outlined the work in recent years that focused on learning separate embeddings for multiple senses of a word (Neelakantan et al., 2014; Iacobacci et al., 2015; Pilehvar & Collier, 2016) [21], [22], [23]. However, most existing approaches for learning multi-sense embeddings solely evaluate on word similarity. Pilehvar et al. (2017) [24] are one of the first to show results on topic categorization as a downstream task; while multi-sense embeddings outperform randomly initialized word embeddings in their experiments, they are outperformed by pre-trained word embeddings.

Given the stellar results Neural Machine Translation systems using word embeddings have achieved in recent years (Johnson et al., 2016) [25], it seems that the current generation of models is expressive enough to contextualize and disambiguate words in context without having to rely on a dedicated disambiguation pipeline or multi-sense embeddings. However, we still need better ways to understand whether our models are actually able to sufficiently disambiguate words and how to improve this disambiguation behaviour if necessary.

## Beyond words as points

While we might not need separate embeddings for every sense of each word for good downstream performance, reducing each word to a point in a vector space is unarguably overly simplistic and causes us to miss out on nuances that might be useful for downstream tasks. An interesting direction is thus to employ other representations that are better able to capture these facets. Vilnis & McCallum (2015) [26] propose to model each word as a probability distribution rather than a point vector, which allows us to represent probability mass and uncertainty across certain dimensions. Athiwaratkun & Wilson (2017) [27] extend this approach to a multimodal distribution that allows to deal with polysemy, entailment, uncertainty, and enhances interpretability.

Rather than altering the representation, the embedding space can also be changed to better represent certain features. Nickel and Kiela (2017) [28], for instance, embed words in a hyperbolic space, to learn hierarchical representations. Finding other ways to represent words that incorporate linguistic assumptions or better deal with the characteristics of downstream tasks is a compelling research direction.

## Phrases and multi-word expressions

In addition to not being able to capture multiple senses of words, word embeddings also fail to capture the meanings of phrases and multi-word expressions, which can be a function of the meaning of their constituent words, or have an entirely new meaning. Phrase embeddings have been proposed already in the original word2vec paper (Mikolov et al., 2013) [29] and there has been consistent work on learning better compositional and non-compositional phrase embeddings (Yu & Dredze, 2015; Hashimoto & Tsuruoka, 2016) [30], [31]. However, similar to multi-sense embeddings, explicitly modelling phrases has so far not shown significant improvements on downstream tasks that would justify the additional complexity. Analogously, a better understanding of how phrases are modelled in neural networks would pave the way to methods that augment the capabilities of our models to capture compositionality and non-compositionality of expressions.

## Bias

Bias in our models is becoming a larger issue and we are only starting to understand its implications for training and evaluating our models. Even word embeddings trained on Google News articles exhibit female/male gender stereotypes to a disturbing extent (Bolukbasi et al., 2016) [32]. Understanding what other biases word embeddings capture and finding better ways to remove theses biases will be key to developing fair algorithms for natural language processing.

## Temporal dimension

Words are a mirror of the zeitgeist and their meanings are subject to continuous change; current representations of words might differ substantially from the way these words where used in the past and will be used in the future. An interesting direction is thus to take into account the temporal dimension and the diachronic nature of words. This can allows us to reveal laws of semantic change (Hamilton et al., 2016; Bamler & Mandt, 2017; Dubossarsky et al., 2017) [33], [34], [35], to model temporal word analogy or relatedness (Szymanski, 2017; Rosin et al., 2017) [36], [37], or to capture the dynamics of semantic relations (Kutuzov et al., 2017) [37:1].

## Lack of theoretical understanding

Besides the insight that word2vec with skip-gram negative sampling implicitly factorizes a PMI matrix (Levy & Goldberg, 2014) [38], there has been comparatively little work on gaining a better theoretical understanding of the word embedding space and its properties, e.g. that summation captures analogy relations. Arora et al. (2016) [39] propose a new generative model for word embeddings, which treats corpus generation as a random walk of a discourse vector and establishes some theoretical motivations regarding the analogy behaviour. Gittens et al. (2017) [40] provide a more thorough theoretical justification of additive compositionality and show that skip-gram word vectors are optimal in an information-theoretic sense. Mimno & Thompson (2017) [41] furthermore reveal an interesting relation between word embeddings and the embeddings of context words, i.e. that they are not evenly dispersed across the vector space, but occupy a narrow cone that is diametrically opposite to the context word embeddings. Despite these additional insights, our understanding regarding the location and properties of word embeddings is still lacking and more theoretical work is necessary.

One of the major downsides of using pre-trained embeddings is that the news data used for training them is often very different from the data on which we would like to use them. In most cases, however, we do not have access to millions of unlabelled documents in our target domain that would allow for pre-training good embeddings from scratch. We would thus like to be able to adapt embeddings pre-trained on large news corpora, so that they capture the characteristics of our target domain, but still retain all relevant existing knowledge. Lu & Zheng (2017) [42] proposed a regularized skip-gram model for learning such cross-domain embeddings. In the future, we will need even better ways to adapt pre-trained embeddings to new domains or to incorporate the knowledge from multiple relevant domains.

Rather than adapting to a new domain, we can also use existing knowledge encoded in semantic lexicons to augment pre-trained embeddings with information that is relevant for our task. An effective way to inject such relations into the embedding space is retro-fitting (Faruqui et al., 2015) [43], which has been expanded to other resources such as ConceptNet (Speer et al., 2017) [44] and extended with an intelligent selection of positive and negative examples (Mrkšić et al., 2017) [45]. Injecting additional prior knowledge into word embeddings such as monotonicity (You et al., 2017) [46], word similarity (Niebler et al., 2017) [47], task-related grading or intensity, or logical relations is an important research direction that will allow to make our models more robust.

Word embeddings are useful for a wide variety of applications beyond NLP such as information retrieval, recommendation, and link prediction in knowledge bases, which all have their own task-specific approaches. Wu et al. (2017) [48] propose a general-purpose model that is compatible with many of these applications and can serve as a strong baseline.

## Transfer learning

Rather than adapting word embeddings to any particular task, recent work has sought to create contextualized word vectors by augmenting word embeddings with embeddings based on the hidden states of models pre-trained for certain tasks, such as machine translation (McCann et al., 2017) [49] or language modelling (Peters et al., 2018) [50]. Together with fine-tuning pre-trained models (Howard and Ruder, 2018) [51], this is one of the most promising research directions.

## Embeddings for multiple languages

As NLP models are being increasingly employed and evaluated on multiple languages, creating multilingual word embeddings is becoming a more important issue and has received increased interest over recent years. A promising direction is to develop methods that learn cross-lingual representations with as few parallel data as possible, so that they can be easily applied to learn representations even for low-resource languages. For a recent survey in this area, refer to Ruder et al. (2017) [52].

## Embeddings based on other contexts

Word embeddings are typically learned only based on the window of surrounding context words. Levy & Goldberg (2014) [53] have shown that dependency structures can be used as context to capture more syntactic word relations; Köhn (2015) [54] finds that such dependency-based embeddings perform best for a particular multilingual evaluation method that clusters embeddings along different syntactic features.

Melamud et al. (2016) [55] observe that different context types work well for different downstream tasks and that simple concatenation of word embeddings learned with different context types can yield further performance gains. Given the recent success of incorporating graph structures into neural models for different tasks as -- for instance -- exhibited by graph-convolutional neural networks (Bastings et al., 2017; Marcheggiani & Titov, 2017) [56] [57], we can conjecture that incorporating such structures for learning embeddings for downstream tasks may also be beneficial.

Besides selecting context words differently, additional context may also be used in other ways: Tissier et al. (2017) [58] incorporate co-occurrence information from dictionary definitions into the negative sampling process to move related works closer together and prevent them from being used as negative samples. We can think of topical or relatedness information derived from other contexts such as article headlines or Wikipedia intro paragraphs that could similarly be used to make the representations more applicable to a particular downstream task.

## Conclusion

It is nice to see that as a community we are progressing from applying word embeddings to every possible problem to gaining a more principled, nuanced, and practical understanding of them. This post was meant to highlight some of the current trends and future directions for learning word embeddings that I found most compelling. I've undoubtedly failed to mention many other areas that are equally important and noteworthy. Please let me know in the comments below what I missed, where I made a mistake or misrepresented a method, or just which aspect of word embeddings you find particularly exciting or unexplored.

# Hacker News

Refer to the discussion on Hacker News for some more insights on word embeddings.

## Other blog posts on word embeddings

If you want to learn more about word embeddings, these other blog posts on word embeddings are also available:

## References

Cover image credit: Hamilton et al. (2016)

1. Mikolov, T., Corrado, G., Chen, K., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. Proceedings of the International Conference on Learning Representations (ICLR 2013), 1–12. ↩︎

2. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural Architectures for Named Entity Recognition. In NAACL-HLT 2016. ↩︎

3. Plank, B., Søgaard, A., & Goldberg, Y. (2016). Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. ↩︎

4. Ballesteros, M., Dyer, C., & Smith, N. A. (2015). Improved Transition-Based Parsing by Modeling Characters instead of Words with LSTMs. In Proceedings of EMNLP 2015. https://doi.org/10.18653/v1/D15-1041 ↩︎

5. Yu, X., & Vu, N. T. (2017). Character Composition Model with Convolutional Neural Networks for Dependency Parsing on Morphologically Rich Languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (pp. 672–678). ↩︎

6. Kim, Y., Jernite, Y., Sontag, D., & Rush, A. M. (2016). Character-Aware Neural Language Models. AAAI. Retrieved from http://arxiv.org/abs/1508.06615 ↩︎

7. Wieting, J., Bansal, M., Gimpel, K., & Livescu, K. (2016). Charagram: Embedding Words and Sentences via Character n-grams. Retrieved from http://arxiv.org/abs/1607.02789 ↩︎

8. Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics. Retrieved from http://arxiv.org/abs/1607.04606 ↩︎

9. Cavnar, W. B., Trenkle, J. M., & Mi, A. A. (1994). N-Gram-Based Text Categorization. Ann Arbor MI 48113.2, 161–175. https://doi.org/10.1.1.53.9367 ↩︎

10. Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of Tricks for Efficient Text Classification. arXiv Preprint arXiv:1607.01759. Retrieved from http://arxiv.org/abs/1607.01759 ↩︎

11. Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Retrieved from http://arxiv.org/abs/1508.07909 ↩︎

12. Heinzerling, B., & Strube, M. (2017). BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages. Retrieved from http://arxiv.org/abs/1710.02187 ↩︎

13. Vania, C., & Lopez, A. (2017). From Characters to Words to in Between: Do We Capture Morphology? In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (pp. 2016–2027). ↩︎

14. Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., & Wu, Y. (2016). Exploring the Limits of Language Modeling. arXiv Preprint arXiv:1602.02410. Retrieved from http://arxiv.org/abs/1602.02410 ↩︎

15. Rei, M. (2017). Semi-supervised Multitask Learning for Sequence Labeling. In Proceedings of ACL 2017. ↩︎

16. Peters, M. E., Ammar, W., Bhagavatula, C., & Power, R. (2017). Semi-supervised sequence tagging with bidirectional language models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (pp. 1756–1765). ↩︎

17. Dhingra, B., Liu, H., Salakhutdinov, R., & Cohen, W. W. (2017). A Comparative Study of Word Embeddings for Reading Comprehension. arXiv preprint arXiv:1703.00993. ↩︎

18. Herbelot, A., & Baroni, M. (2017). High-risk learning: acquiring new word vectors from tiny data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. ↩︎

19. Pinter, Y., Guthrie, R., & Eisenstein, J. (2017). Mimicking Word Embeddings using Subword RNNs. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Retrieved from http://arxiv.org/abs/1707.06961 ↩︎

20. Tsvetkov, Y., Faruqui, M., Ling, W., Lample, G., & Dyer, C. (2015). Evaluation of Word Vector Representations by Subspace Alignment. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17-21 September 2015, 2049–2054. ↩︎

21. Neelakantan, A., Shankar, J., Passos, A., & Mccallum, A. (2014). Efficient Non-parametric Estimation of Multiple Embeddings per Word in Vector Space. In Proceedings fo (pp. 1059–1069). ↩︎

22. Iacobacci, I., Pilehvar, M. T., & Navigli, R. (2015). SensEmbed: Learning Sense Embeddings for Word and Relational Similarity. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (pp. 95–105). ↩︎

23. Pilehvar, M. T., & Collier, N. (2016). De-Conflated Semantic Representations. In Proceedings of EMNLP. ↩︎

24. Pilehvar, M. T., Camacho-Collados, J., Navigli, R., & Collier, N. (2017). Towards a Seamless Integration of Word Senses into Downstream NLP Applications. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1857–1869). https://doi.org/10.18653/v1/P17-1170 ↩︎

25. Johnson, M., Schuster, M., Le, Q. V, Krikun, M., Wu, Y., Chen, Z., … Dean, J. (2016). Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. arXiv Preprint arXiv:1611.0455. ↩︎

26. Vilnis, L., & McCallum, A. (2015). Word Representations via Gaussian Embedding. ICLR. Retrieved from http://arxiv.org/abs/1412.6623 ↩︎

27. Athiwaratkun, B., & Wilson, A. G. (2017). Multimodal Word Distributions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017). ↩︎

28. Nickel, M., & Kiela, D. (2017). Poincaré Embeddings for Learning Hierarchical Representations. arXiv Preprint arXiv:1705.08039. Retrieved from http://arxiv.org/abs/1705.08039 ↩︎

29. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. NIPS. ↩︎

30. Yu, M., & Dredze, M. (2015). Learning Composition Models for Phrase Embeddings. Transactions of the ACL, 3, 227–242. ↩︎

31. Hashimoto, K., & Tsuruoka, Y. (2016). Adaptive Joint Learning of Compositional and Non-Compositional Phrase Embeddings. ACL, 205–215. Retrieved from http://arxiv.org/abs/1603.06067 ↩︎

32. Bolukbasi, T., Chang, K.-W., Zou, J., Saligrama, V., & Kalai, A. (2016). Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. In 30th Conference on Neural Information Processing Systems (NIPS 2016). Retrieved from http://arxiv.org/abs/1607.06520 ↩︎

33. Hamilton, W. L., Leskovec, J., & Jurafsky, D. (2016). Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (pp. 1489–1501). ↩︎

34. Bamler, R., & Mandt, S. (2017). Dynamic Word Embeddings via Skip-Gram Filtering. In Proceedings of ICML 2017. Retrieved from http://arxiv.org/abs/1702.08359 ↩︎

35. Dubossarsky, H., Grossman, E., & Weinshall, D. (2017). Outta Control: Laws of Semantic Change and Inherent Biases in Word Representation Models. In Conference on Empirical Methods in Natural Language Processing (pp. 1147–1156). Retrieved from http://aclweb.org/anthology/D17-1119 ↩︎

36. Szymanski, T. (2017). Temporal Word Analogies : Identifying Lexical Replacement with Diachronic Word Embeddings. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (pp. 448–453). ↩︎

37. Rosin, G., Radinsky, K., & Adar, E. (2017). Learning Word Relatedness over Time. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Retrieved from https://arxiv.org/pdf/1707.08081.pdf ↩︎ ↩︎

38. Levy, O., & Goldberg, Y. (2014). Neural Word Embedding as Implicit Matrix Factorization. Advances in Neural Information Processing Systems (NIPS), 2177–2185. Retrieved from http://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization ↩︎

39. Arora, S., Li, Y., Liang, Y., Ma, T., & Risteski, A. (2016). A Latent Variable Model Approach to PMI-based Word Embeddings. TACL, 4, 385–399. Retrieved from https://transacl.org/ojs/index.php/tacl/article/viewFile/742/204 ↩︎

40. Gittens, A., Achlioptas, D., & Mahoney, M. W. (2017). Skip-Gram – Zipf + Uniform = Vector Additivity. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (pp. 69–76). https://doi.org/10.18653/v1/P17-1007 ↩︎

41. Mimno, D., & Thompson, L. (2017). The strange geometry of skip-gram with negative sampling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 2863–2868). ↩︎

42. Lu, W., & Zheng, V. W. (2017). A Simple Regularization-based Algorithm for Learning Cross-Domain Word Embeddings. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 2888–2894). ↩︎

43. Faruqui, M., Dodge, J., Jauhar, S. K., Dyer, C., Hovy, E., & Smith, N. A. (2015). Retrofitting Word Vectors to Semantic Lexicons. In NAACL 2015. ↩︎

44. Speer, R., Chin, J., & Havasi, C. (2017). ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. In AAAI 31 (pp. 4444–4451). Retrieved from http://arxiv.org/abs/1612.03975 ↩︎

45. Mrkšić, N., Vulić, I., Séaghdha, D. Ó., Leviant, I., Reichart, R., Gašić, M., … Young, S. (2017). Semantic Specialisation of Distributional Word Vector Spaces using Monolingual and Cross-Lingual Constraints. TACL. Retrieved from http://arxiv.org/abs/1706.00374 ↩︎

46. You, S., Ding, D., Canini, K., Pfeifer, J., & Gupta, M. (2017). Deep Lattice Networks and Partial Monotonic Functions. In 31st Conference on Neural Information Processing Systems (NIPS 2017). Retrieved from http://arxiv.org/abs/1709.06680 ↩︎

47. Niebler, T., Becker, M., Pölitz, C., & Hotho, A. (2017). Learning Semantic Relatedness From Human Feedback Using Metric Learning. In Proceedings of ISWC 2017. Retrieved from http://arxiv.org/abs/1705.07425 ↩︎

48. Wu, L., Fisch, A., Chopra, S., Adams, K., Bordes, A., & Weston, J. (2017). StarSpace: Embed All The Things! arXiv preprint arXiv:1709.03856. ↩︎

49. Mccann, B., Bradbury, J., Xiong, C., & Socher, R. (2017). Learned in Translation: Contextualized Word Vectors. In Advances in Neural Information Processing Systems. ↩︎

50. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. Proceedings of NAACL-HLT 2018. ↩︎

51. Howard, J., & Ruder, S. (2018). Universal Language Model Fine-tuning for Text Classification. In Proceedings of ACL 2018. Retrieved from http://arxiv.org/abs/1801.06146 ↩︎

52. Ruder, S., Vulić, I., & Søgaard, A. (2017). A Survey of Cross-lingual Word Embedding Models Sebastian. arXiv preprint arXiv:1706.04902. Retrieved from http://arxiv.org/abs/1706.04902 ↩︎

53. Levy, O., & Goldberg, Y. (2014). Dependency-Based Word Embeddings. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Short Papers), 302–308. https://doi.org/10.3115/v1/P14-2050 ↩︎

54. Köhn, A. (2015). What’s in an Embedding? Analyzing Word Embeddings through Multilingual Evaluation. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17-21 September 2015, (2014), 2067–2073. ↩︎

55. Melamud, O., McClosky, D., Patwardhan, S., & Bansal, M. (2016). The Role of Context Types and Dimensionality in Learning Word Embeddings. In Proceedings of NAACL-HLT 2016 (pp. 1030–1040). Retrieved from http://arxiv.org/abs/1601.00893 ↩︎

56. Bastings, J., Titov, I., Aziz, W., Marcheggiani, D., & Sima’an, K. (2017). Graph Convolutional Encoders for Syntax-aware Neural Machine Translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. ↩︎

57. Marcheggiani, D., & Titov, I. (2017). Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. ↩︎

58. Tissier, J., Gravier, C., & Habrard, A. (2017). Dict2Vec : Learning Word Embeddings using Lexical Dictionaries. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Retrieved from http://aclweb.org/anthology/D17-1024 ↩︎

]]>