Table of contents:

This post is a collection of best practices for using neural networks in Natural Language Processing. It will be updated periodically as new insights become available and in order to keep track of our evolving understanding of Deep Learning for NLP.

There has been a running joke in the NLP community that an LSTM with attention will yield state-of-the-art performance on any task. While this has been true over the course of the last two years, the NLP community is slowly moving away from this now standard baseline and towards more interesting models.

However, we as a community do not want to spend the next two years independently (re-)discovering the *next* LSTM with attention. We do not want to reinvent tricks or methods that have already been shown to work. While many existing Deep Learning libraries already encode best practices for working with neural networks in general, such as initialization schemes, many other details, particularly task or domain-specific considerations, are left to the practitioner.

This post is not meant to keep track of the state-of-the-art, but rather to collect best practices that are relevant for a wide range of tasks. In other words, rather than describing one particular architecture, this post aims to collect the features that underly successful architectures. While many of these features will be most useful for pushing the state-of-the-art, I hope that wider knowledge of them will lead to stronger evaluations, more meaningful comparison to baselines, and inspiration by shaping our intuition of what works.

I assume you are familiar with neural networks as applied to NLP (if not, I recommend Yoav Goldberg's excellent primer [^{43}]) and are interested in NLP in general or in a particular task. The main goal of this article is to get you up to speed with the relevant best practices so you can make meaningful contributions as soon as possible.

I will first give an overview of best practices that are relevant for most tasks. I will then outline practices that are relevant for the most common tasks, in particular classification, sequence labelling, natural language generation, and neural machine translation.

**Disclaimer:** Treating something as *best practice* is notoriously difficult: Best according to what? What if there are better alternatives? This post is based on my (necessarily incomplete) understanding and experience. In the following, I will only discuss practices that have been reported to be beneficial independently by *at least* two different groups. I will try to give at least two references for each best practice.

Word embeddings are arguably the most widely known best practice in the recent history of NLP. It is well-known that using pre-trained embeddings helps (Kim, 2014) [^{12}]. The optimal dimensionality of word embeddings is mostly task-dependent: a smaller dimensionality works better for more syntactic tasks such as named entity recognition (Melamud et al., 2016) [^{44}] or part-of-speech (POS) tagging (Plank et al., 2016) [^{32}], while a larger dimensionality is more useful for more semantic tasks such as sentiment analysis (Ruder et al., 2016) [^{45}].

While we will not reach the depths of computer vision for a while, neural networks in NLP have become progressively deeper. State-of-the-art approaches now regularly use deep Bi-LSTMs, typically consisting of 3-4 layers, e.g. for POS tagging (Plank et al., 2016) and semantic role labelling (He et al., 2017) [^{33}]. Models for some tasks can be even deeper, cf. Google's NMT model with 8 encoder and 8 decoder layers (Wu et al., 2016) [^{20}]. In most cases, however, performance improvements of making the model deeper than 2 layers are minimal (Reimers & Gurevych, 2017) [^{46}].

These observations hold for most sequence tagging and structured prediction problems. For classification, deep or very deep models perform well only with character-level input and shallow word-level models are still the state-of-the-art (Zhang et al., 2015; Conneau et al., 2016; Le et al., 2017) [^{28}, ^{29}, ^{30}].

For training deep neural networks, some tricks are essential to avoid the vanishing gradient problem. Different layers and connections have been proposed. Here, we will discuss three: i) highway layers, ii) residual connections, and iii) dense residual connections.

**Highway layers** Highway layers (Srivastava et al., 2015) [^{1}] are inspired by the gates of an LSTM. First let us assume a one-layer MLP, which applies an affine transformation followed by a non-linearity \(g\) to its input \(\mathbf{x}\):

\(\mathbf{h} = g(\mathbf{W}\mathbf{x} + b)\)

A highway layer then computes the following function instead:

\(\mathbf{h} = \mathbf{t} \odot g(\mathbf{W} \mathbf{x} + \mathbf{b}) + (1-\mathbf{t}) \odot \mathbf{x} \)

where \(\odot\) is elementwise multiplication, \(\mathbf{t} = \sigma(\mathbf{W}_T \mathbf{x} + \mathbf{b}_T)\) is called the *transform* gate, and \((1-\mathbf{t})\) is called the *carry* gate. As we can see, highway layers are similar to the gates of an LSTM in that they adaptively *carry* some dimensions of the input directly to the output.

Highway layers have been used pre-dominantly to achieve state-of-the-art results for language modelling (Kim et al., 2016; Jozefowicz et al., 2016; Zilly et al., 2017) [^{2}, ^{3}, ^{4}], but have also been used for other tasks such as speech recognition (Zhang et al., 2016) [^{5}]. Sristava's page contains more information and code regarding highway layers.

**Residual connections** Residual connections (He et al., 2016) [^{6}] have been first proposed for computer vision and were the main factor for winning ImageNet 2016. Residual connections are even more straightforward than highway layers. Let us augment our previous layer output \(h\) with indices \(l\) indicating the current layer. Residual connections then learn the following function:

\(\mathbf{h}^l = g(\mathbf{W}\mathbf{x}^l + \mathbf{b}) + \mathbf{x}^{l-1}\)

which simply adds the input of the previous layer to the current layer via a short-cut connection. This simple modification mitigates the vanishing gradient problem, as the model can default to using the identity function if the layer is not beneficial.

**Dense residual connections** Rather than just adding layers from each layer to the next, dense residual connections (Huang et al., 2017) [^{7}] (best paper award at CVPR 2017) add connections from each layer to all subsequent layers:

\(\mathbf{h}^l = g(\mathbf{W}\mathbf{x}^l + \mathbf{b}) + \sum\limits^l_{j=0} \mathbf{x}^j\)

Dense residual connections have been successfully used in computer vision, but have also been shown to consistently outperform residual connections for neural machine translation (Britz et al., 2017) [^{27}].

While batch normalisation in computer vision has made other regularizers obsolete in most applications, dropout (Srivasta et al., 2014) [^{8}] is still the go-to regularizer for deep neural networks in NLP. A dropout rate of 0.5 has been shown to be effective in most scenarios (Kim, 2014). In recent years, variations of dropout such as adaptive (Ba & Frey, 2013) [^{9}] and evolutional dropout (Li et al., 2016) [^{10}] have been proposed, but none of these have found wide adoption in the community. The main problem hindering dropout in NLP has been that it could not be applied to recurrent connections, as the aggregating dropout masks would effectively zero out embeddings over time.

**Recurrent dropout** Recurrent dropout (Gal & Ghahramani, 2016) [^{11}] addresses this issue by applying the same dropout mask across timesteps at layer \(l\). This avoids amplifying the dropout noise along the sequence and leads to effective regularization for sequence models. Recurrent dropout has been used for instance to achieve state-of-the-art results in semantic role labelling (He et al., 2017) and language modelling (Melis et al., 2017) [^{34}].

If additional data is available, multi-task learning (MTL) can often be used to improve performance on the target task. Have a look this blog post for more information on MTL.

**Auxiliary objectives** We can often find auxiliary objectives that are useful for the task we care about (Ruder, 2017) [^{13}]. While we can already predict surrounding words in order to pre-train word embeddings (Mikolov et al., 2013), we can also use this as an auxiliary objective during training (Rei, 2017) [^{35}]. A similar objective has also been used by (Ramachandran et al., 2016) [^{36}] for sequence-to-sequence models.

**Task-specific layers** While the standard approach to MTL for NLP is hard parameter sharing, it is beneficial to allow the model to learn task-specific layers. This can be done by placing the output layer of one task at a lower level (Søgaard & Goldberg, 2016) [^{47}]. Another way is to induce private and shared subspaces (Liu et al., 2017; Ruder et al., 2017) [^{48}, ^{49}].

Attention is most commonly used in sequence-to-sequence models to attend to encoder states, but can also be used in any sequence model to look back at past states. Using attention, we obtain a context vector \(\mathbf{c}_i\) based on hidden states \(\mathbf{s}_1, \ldots, \mathbf{s}_m\) that can be used together with the current hidden state \(\mathbf{h}_i\) for prediction. The context vector \(\mathbf{c}_i\) at position is calculated as an average of the previous states weighted with the attention scores \(\mathbf{a}_i\):

\(\begin{align}\begin{split} \mathbf{c}_i &= \sum\limits_j a_{ij}\mathbf{s}_j\\ \mathbf{a}_i &= \text{softmax}(f_{att}(\mathbf{h}_i, \mathbf{s}_j)) \end{split}\end{align}\)

The attention function \(f_{att}(\mathbf{h}_i, \mathbf{s}_j)\) calculates an unnormalized alignment score between the current hidden state \(\mathbf{h}_i\) and the previous hidden state \(\mathbf{s}_j\). In the following, we will discuss four attention variants: i) additive attention, ii) multiplicative attention, iii) self-attention, and iv) key-value attention.

**Additive attention** The original attention mechanism (Bahdanau et al., 2015) [^{15}] uses a one-hidden layer feed-forward network to calculate the attention alignment:

\(f_{att}(\mathbf{h}_i, \mathbf{s}_j) = \mathbf{v}_a{}^\top \text{tanh}(\mathbf{W}_a[\mathbf{h}_i; \mathbf{s}_j]) \)

where \(\mathbf{v}_a\) and \(\mathbf{W}_a\) are learned attention parameters and \([\cdot; \cdot]\) represents concatenation. Analogously, we can also use matrices \(\mathbf{W}_1\) and \(\mathbf{W}_2\) to learn separate transformations for \(\mathbf{h}_i\) and \(\mathbf{s}_j\) respectively, which are then summed:

\(f_{att}(\mathbf{h}_i, \mathbf{s}_j) = \mathbf{v}_a{}^\top \text{tanh}(\mathbf{W}_1 \mathbf{h}_i + \mathbf{W}_2 \mathbf{s}_j) \)

**Multiplicative attention** Multiplicative attention (Luong et al., 2015) [^{16}] simplifies the attention operation by calculating the following function:

\(f_{att}(h_i, s_j) = h_i^\top \mathbf{W}_a s_j \)

Additive and multiplicative attention are similar in complexity, although multiplicative attention is faster and more space-efficient in practice as it can be implemented more efficiently using matrix multiplication. Both variants perform similar for small dimensionality \(d_h\) of the decoder states, but additive attention performs better for larger dimensions. One way to mitigate this is to scale \(f_{att}(\mathbf{h}_i, \mathbf{s}_j)\) by \(1 / \sqrt{d_h}\) (Vaswani et al., 2017) [^{17}].

Attention cannot only be used to attend to encoder or previous hidden states, but also to obtain a distribution over other features, such as the word embeddings of a text as used for reading comprehension (Kadlec et al., 2017) [^{37}]. However, attention is not directly applicable to classification tasks that do not require additional information, such as sentiment analysis. In such models, the final hidden state of an LSTM or an aggregation function such as max pooling or averaging is often used to obtain a sentence representation.

**Self-attention** Without any additional information, however, we can still extract relevant aspects from the sentence by allowing it to attend to itself using self-attention (Lin et al., 2017) [^{18}]. Self-attention, also called intra-attention has been used successfully in a variety of tasks including reading comprehension (Cheng et al., 2016) [^{38}], textual entailment (Parikh et al., 2016) [^{39}], and abstractive summarization (Paulus et al., 2017) [^{40}].

We can simplify additive attention to compute the unnormalized alignment score for each hidden state \(\mathbf{h}_i\):

\(f_{att}(\mathbf{h}_i) = \mathbf{v}_a{}^\top \text{tanh}(\mathbf{W}_a \mathbf{h}_i) \)

In matrix form, for hidden states \(\mathbf{H} = \mathbf{h}_1, \ldots, \mathbf{h}_n\) we can calculate the attention vector \(\mathbf{a}\) and the final sentence representation \(\mathbf{c}\) as follows:

\(\begin{align}\begin{split} \mathbf{a} &= \text{softmax}(\mathbf{v}_a \text{tanh}(\mathbf{W}_a \mathbf{H}^\top))\\ \mathbf{c} & = \mathbf{H} \mathbf{a}^\top \end{split}\end{align}\)

Rather than only extracting one vector, we can perform several hops of attention by using a matrix \(\mathbf{V}_a\) instead of \(\mathbf{v}_a\), which allows us to extract an attention matrix \(\mathbf{A}\):

\(\begin{align}\begin{split} \mathbf{A} &= \text{softmax}(\mathbf{V}_a \text{tanh}(\mathbf{W}_a \mathbf{H}^\top))\\ \mathbf{C} & = \mathbf{A} \mathbf{H} \end{split}\end{align}\)

In practice, we enforce the following orthogonality constraint to penalize redundancy and encourage diversity in the attention vectors in the form of the squared Frobenius norm:

\(\Omega = \|(\mathbf{A}\mathbf{A}^\top - \mathbf{I} \|^2_F \)

A similar multi-head attention is also used by Vaswani et al. (2017).

**Key-value attention** Finally, key-value attention (Daniluk et al., 2017) [^{19}] is a recent attention variant that separates form from function by keeping separate vectors for the attention calculation. It has also been found useful for different document modelling tasks (Liu & Lapata, 2017) [^{41}]. Specifically, key-value attention splits each hidden vector \(\mathbf{h}_i\) into a key \(\mathbf{k}_i\) and a value \(\mathbf{v}_i\): \([\mathbf{k}_i; \mathbf{v}_i] = \mathbf{h}_i\). The keys are used for calculating the attention distribution \(\mathbf{a}_i\) using additive attention:

\(\mathbf{a}_i = \text{softmax}(\mathbf{v}_a{}^\top \text{tanh}(\mathbf{W}_1 [\mathbf{k}_{i-L}; \ldots; \mathbf{k}_{i-1}] + (\mathbf{W}_2 \mathbf{k}_i)\mathbf{1}^\top)) \)

where \(L\) is the length of the attention window and \(\mathbf{1}\) is a vector of ones. The values are then used to obtain the context representation \(\mathbf{c}_i\):

\(\mathbf{c}_i = [\mathbf{v}_{i-L}; \ldots; \mathbf{v}_{i-1}] \mathbf{a}^\top\)

The context \(\mathbf{c}_i\) is used together with the current value \(\mathbf{v}_i\) for prediction.

The optimization algorithm and scheme is often one of the parts of the model that is used as-is and treated as a black-box. Sometimes, even slight changes to the algorithm, e.g. reducing the \(\beta_2\) value in Adam (Dozat & Manning, 2017) [^{50}] can make a large difference to the optimization behaviour.

**Optimization algorithm** Adam (Kingma & Ba, 2015) [^{21}] is one of the most popular and widely used optimization algorithms and often the go-to optimizer for NLP researchers. It is often thought that Adam clearly outperforms vanilla stochastic gradient descent (SGD). However, while it converges much faster than SGD, it has been observed that SGD with learning rate annealing slightly outperforms Adam (Wu et al., 2016). Recent work furthermore shows that SGD with properly tuned momentum outperforms Adam (Zhang et al., 2017) [^{42}].

**Optimization scheme** While Adam internally tunes the learning rate for every parameter (Ruder, 2016) [^{22}], we can explicitly use SGD-style annealing with Adam. In particular, we can perform learning rate annealing with restarts: We set a learning rate and train the model until convergence. We then halve the learning rate and restart by loading the previous best model. In Adam's case, this causes the optimizer to forget its per-parameter learning rates and start fresh. Denkowski & Neubig (2017) [^{23}] show that Adam with 2 restarts and learning rate annealing is faster and performs better than SGD with annealing.

Combining multiple models into an ensemble by averaging their predictions is a proven strategy to improve model performance. While predicting with an ensemble is expensive at test time, recent advances in distillation allow us to compress an expensive ensemble into a much smaller model (Hinton et al., 2015; Kuncoro et al., 2016; Kim & Rush, 2016) [^{24}, ^{25}, ^{26}].

Ensembling is an important way to ensure that results are still reliable if the diversity of the evaluated models increases (Denkowski & Neubig, 2017). While ensembling different checkpoints of a model has been shown to be effective (Jean et al., 2015; Sennrich et al., 2016) [^{51}, ^{52}], it comes at the cost of model diversity. Cyclical learning rates can help to mitigate this effect (Huang et al., 2017) [^{53}]. However, if resources are available, we prefer to ensemble multiple independently trained models to maximize model diversity.

Rather than pre-defining or using off-the-shelf hyperparameters, simply tuning the hyperparameters of our model can yield significant improvements over baselines. Recent advances in Bayesian Optimization have made it an ideal tool for the black-box optimization of hyperparameters in neural networks (Snoek et al., 2012) [^{56}] and far more efficient than the widely used grid search. Automatic tuning of hyperparameters of an LSTM has led to state-of-the-art results in language modeling, outperforming models that are far more complex (Melis et al., 2017).

**Learning the initial state** We generally initialize the initial LSTM states with a \(0\) vector. Instead of fixing the initial state, we can learn it like any other parameter, which can improve performance and is also recommended by Hinton. Refer to this blog post for a Tensorflow implementation.

**Tying input and output embeddings** Input and output embeddings account for the largest number of parameters in the LSTM model. If the LSTM predicts words as in language modelling, input and output parameters can be shared (Inan et al., 2016; Press & Wolf, 2017) [^{54}, ^{55}]. This is particularly useful on small datasets that do not allow to learn a large number of parameters.

**Gradient norm clipping** One way to decrease the risk of exploding gradients is to clip their maximum value (Mikolov, 2012) [^{57}]. This, however, does not improve performance consistently (Reimers & Gurevych, 2017). Rather than clipping each gradient independently, clipping the global norm of the gradient (Pascanu et al., 2013) [^{58}] yields more significant improvements (a Tensorflow implementation can be found here).

**Down-projection** To reduce the number of output parameters further, the hidden state of the LSTM can be projected to a smaller size. This is useful particularly for tasks with a large number of outputs, such as language modelling (Melis et al., 2017).

In the following, we will discuss task-specific best practices. Most of these perform best for a particular type of task. Some of them might still be applied to other tasks, but should be validated before. We will discuss the following tasks: classification, sequence labelling, natural language generation (NLG), and -- as a special case of NLG -- neural machine translation.

More so than for sequence tasks, where CNNs have only recently found application due to more efficient convolutional operations, CNNs have been popular for classification tasks in NLP. The following best practices relate to CNNs and capture some of their optimal hyperparameter choices.

**CNN filters** Combining filter sizes near the optimal filter size, e.g. (3,4,5) performs best (Kim, 2014; Kim et al., 2016). The optimal number of feature maps is in the range of 50-600 (Zhang & Wallace, 2015) [^{59}].

**Aggregation function** 1-max-pooling outperforms average-pooling and \(k\)-max pooling (Zhang & Wallace, 2015).

Sequence labelling is ubiquitous in NLP. While many of the existing best practices are with regard to a particular part of the model architecture, the following guidelines discuss choices for the model's output and prediction stage.

**Tagging scheme** For some tasks, which can assign labels to segments of texts, different tagging schemes are possible. These are: *BIO*, which marks the first token in a segment with a *B-* tag, all remaining tokens in the span with an *I-*tag, and tokens outside of segments with an *O-* tag; *IOB*, which is similar to BIO, but only uses *B-* if the previous token is of the same class but not part of the segment; and *IOBES*, which in addition distinguishes between single-token entities (*S-*) and the last token in a segment (*E-*). Using IOBES and BIO yield similar performance (Lample et al., 2017)

**CRF output layer** If there are any dependencies between outputs, such as in named entity recognition the final softmax layer can be replaced with a linear-chain conditional random field (CRF). This has been shown to yield consistent improvements for tasks that require the modelling of constraints (Huang et al., 2015; Max & Hovy, 2016; Lample et al., 2016) [^{60}, ^{61}, ^{62}].

**Constrained decoding** Rather than using a CRF output layer, constrained decoding can be used as an alternative approach to reject erroneous sequences, i.e. such that do not produce valid BIO transitions (He et al., 2017). Constrained decoding has the advantage that arbitrary constraints can be enforced this way, e.g. task-specific or syntactic constraints.

Most of the existing best practices can be applied to natural language generation (NLG). In fact, many of the tips presented so far stem from advances in language modelling, the most prototypical NLP task.

**Modelling coverage** Repetition is a big problem in many NLG tasks as current models do not have a good way of remembering what outputs they already produced. Modelling coverage explicitly in the model is a good way of addressing this issue. A checklist can be used if it is known in advances, which entities should be mentioned in the output, e.g. ingredients in recipes (Kiddon et al., 2016) [^{63}]. If attention is used, we can keep track of a coverage vector \(\mathbf{c}_i\), which is the sum of attention distributions \(\mathbf{a}_t\) over previous time steps (Tu et al., 2016; See et al., 2017) [^{64}, ^{65}]:

\(\mathbf{c}_i = \sum\limits^{i-1}_{t=1} \mathbf{a}_t \)

This vector captures how much attention we have paid to all words in the source. We can now condition additive attention additionally on this coverage vector in order to encourage our model not to attend to the same words repeatedly:

\(f_{att}(\mathbf{h}_i,\mathbf{s}_j,\mathbf{c}_i) = \mathbf{v}_a{}^\top \text{tanh}(\mathbf{W}_1 \mathbf{h}_i + \mathbf{W}_2 \mathbf{s}_j + \mathbf{W}_3 \mathbf{c}_i )\)

In addition, we can add an auxiliary loss that captures the task-specific attention behaviour that we would like to elicit: For NMT, we would like to have a roughly one-to-one alignment; we thus penalize the model if the final coverage vector is more or less than one at every index (Tu et al., 2016). For summarization, we only want to penalize the model if it repeatedly attends to the same location (See et al., 2017).

While neural machine translation (NMT) is an instance of NLG, NMT receives so much attention that many methods have been developed specifically for the task. Similarly, many best practices or hyperparameter choices apply exclusively to it.

**Embedding dimensionality** 2048-dimensional embeddings yield the best performance, but only do so by a small margin. Even 128-dimensional embeddings perform surprisingly well and converge almost twice as quickly (Britz et al., 2017).

**Encoder and decoder depth** The encoder does not need to be deeper than \(2-4\) layers. Deeper models outperform shallower ones, but more than \(4\) layers is not necessary for the decoder (Britz et al., 2017).

**Directionality** Bidirectional encoders outperform unidirectional ones by a small margin.
Sutskever et al., (2014) [^{67}] proposed to reverse the source sequence to reduce the number of long-term dependencies. Reversing the source sequence in unidirectional encoders outperforms its non-reversed counter-part (Britz et al., 2017).

**Beam search strategy** Medium beam sizes around \(10\) with length normalization penalty of \(1.0\) (Wu et al., 2016) yield the best performance (Britz et al., 2017).

**Sub-word translation** Senrich et al. (2016) [^{66}] proposed to split words into sub-words based on a byte-pair encoding (BPE). BPE iteratively merges frequent symbol pairs, which eventually results in frequent character n-grams being merged into a single symbol, thereby effectively eliminating out-of-vocabulary-words. While it was originally meant to handle rare words, a model with sub-word units outperforms full-word systems across the board, with 32,000 being an effective vocabulary size for sub-word units (Denkowski & Neubig, 2017).

I hope this post was helpful in kick-starting your learning of a new NLP task. Even if you have already been familiar with most of these, I hope that you still learnt something new or refreshed your knowledge of useful tips.

I am sure that I have forgotten many best practices that deserve to be on this list. Similarly, there are many tasks such as parsing, information extraction, etc., which I do not know enough about to give recommendations. If you have a best practice that should be on this list, do let me know in the comments below. Please provide at least one reference and your handle for attribution. If this gets very collaborative, I might open a GitHub repository rather than collecting feedback here (I won't be able to accept PRs submitted directly to the generated HTML source of this article).

Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Training Very Deep Networks. In Advances in Neural Information Processing Systems. ↩

Kim, Y., Jernite, Y., Sontag, D., & Rush, A. M. (2016). Character-Aware Neural Language Models. AAAI. Retrieved from http://arxiv.org/abs/1508.06615 ↩

Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., & Wu, Y. (2016). Exploring the Limits of Language Modeling. arXiv Preprint arXiv:1602.02410. Retrieved from http://arxiv.org/abs/1602.02410 ↩

Zilly, J. G., Srivastava, R. K., Koutnik, J., & Schmidhuber, J. (2017). Recurrent Highway Networks. In International Conference on Machine Learning (ICML 2017). ↩

Zhang, Y., Chen, G., Yu, D., Yao, K., Kudanpur, S., & Glass, J. (2016). Highway Long Short-Term Memory RNNS for Distant Speech Recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). ↩

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. In CVPR. ↩

Huang, G., Weinberger, K. Q., & Maaten, L. Van Der. (2016). Densely Connected Convolutional Networks. CVPR 2017. ↩

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 15, 1929–1958. https://doi.org/10.1214/12-AOS1000 ↩

Ba, J., & Frey, B. (2013). Adaptive dropout for training deep neural networks. In Advances in Neural Information Processing Systems. Retrieved from file:///Files/A5/A51D0755-5CEF-4772-942D-C5B8157FBE5E.pdf ↩

Li, Z., Gong, B., & Yang, T. (2016). Improved Dropout for Shallow and Deep Learning. In Advances in Neural Information Processing Systems 29 (NIPS 2016). Retrieved from http://arxiv.org/abs/1602.02220 ↩

Gal, Y., & Ghahramani, Z. (2016). A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. In Advances in Neural Information Processing Systems. Retrieved from http://arxiv.org/abs/1512.05287 ↩

Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1746–1751. Retrieved from http://arxiv.org/abs/1408.5882 ↩

Ruder, S. (2017). An Overview of Multi-Task Learning in Deep Neural Networks. In arXiv preprint arXiv:1706.05098. ↩

Semi-supervised Multitask Learning for Sequence Labeling. In Proceedings of ACL 2017. ↩

Bahdanau, D., Cho, K., & Bengio, Y.. Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015. https://doi.org/10.1146/annurev.neuro.26.041002.131047 ↩

Luong, M.-T., Pham, H., & Manning, C. D. (2015). Effective Approaches to Attention-based Neural Machine Translation. EMNLP 2015. Retrieved from http://arxiv.org/abs/1508.04025 ↩

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention Is All You Need. arXiv Preprint arXiv:1706.03762. ↩

Lin, Z., Feng, M., Santos, C. N. dos, Yu, M., Xiang, B., Zhou, B., & Bengio, Y. (2017). A Structured Self-Attentive Sentence Embedding. In ICLR 2017. ↩

Daniluk, M., Rockt, T., Welbl, J., & Riedel, S. (2017). Frustratingly Short Attention Spans in Neural Language Modeling. In ICLR 2017. ↩

Wu, Y., Schuster, M., Chen, Z., Le, Q. V, Norouzi, M., Macherey, W., … Dean, J. (2016). Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv Preprint arXiv:1609.08144. ↩

Kingma, D. P., & Ba, J. L. (2015). Adam: a Method for Stochastic Optimization. International Conference on Learning Representations. ↩

Ruder, S. (2016). An overview of gradient descent optimization. arXiv Preprint arXiv:1609.04747. ↩

Denkowski, M., & Neubig, G. (2017). Stronger Baselines for Trustable Results in Neural Machine Translation. ↩

Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv Preprint arXiv:1503.02531. https://doi.org/10.1063/1.4931082 ↩

Kuncoro, A., Ballesteros, M., Kong, L., Dyer, C., & Smith, N. A. (2016). Distilling an Ensemble of Greedy Dependency Parsers into One MST Parser. Empirical Methods in Natural Language Processing. ↩

Kim, Y., & Rush, A. M. (2016). Sequence-Level Knowledge Distillation. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP-16). ↩

Britz, D., Goldie, A., Luong, T., & Le, Q. (2017). Massive Exploration of Neural Machine Translation Architectures. In arXiv preprint arXiv:1703.03906. ↩

Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems, 649–657. Retrieved from http://arxiv.org/abs/1509.01626 ↩

Conneau, A., Schwenk, H., Barrault, L., & Lecun, Y. (2016). Very Deep Convolutional Networks for Natural Language Processing. arXiv Preprint arXiv:1606.01781. Retrieved from http://arxiv.org/abs/1606.01781 ↩

Le, H. T., Cerisara, C., & Denis, A. (2017). Do Convolutional Networks need to be Deep for Text Classification ? In arXiv preprint arXiv:1707.04108. ↩

Wu, Y., Schuster, M., Chen, Z., Le, Q. V, Norouzi, M., Macherey, W., … Dean, J. (2016). Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv Preprint arXiv:1609.08144. ↩

Plank, B., Søgaard, A., & Goldberg, Y. (2016). Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. ↩

He, L., Lee, K., Lewis, M., & Zettlemoyer, L. (2017). Deep Semantic Role Labeling: What Works and What’s Next. ACL. ↩

Melis, G., Dyer, C., & Blunsom, P. (2017). On the State of the Art of Evaluation in Neural Language Models. ↩

Rei, M. (2017). Semi-supervised Multitask Learning for Sequence Labeling. In Proceedings of ACL 2017. ↩

Ramachandran, P., Liu, P. J., & Le, Q. V. (2016). Unsupervised Pretrainig for Sequence to Sequence Learning. arXiv Preprint arXiv:1611.02683. ↩

Kadlec, R., Schmid, M., Bajgar, O., & Kleindienst, J. (2016). Text Understanding with the Attention Sum Reader Network. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. ↩

Cheng, J., Dong, L., & Lapata, M. (2016). Long Short-Term Memory-Networks for Machine Reading. arXiv Preprint arXiv:1601.06733. Retrieved from http://arxiv.org/abs/1601.06733 ↩

Parikh, A. P., Täckström, O., Das, D., & Uszkoreit, J. (2016). A Decomposable Attention Model for Natural Language Inference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Retrieved from http://arxiv.org/abs/1606.01933 ↩

Paulus, R., Xiong, C., & Socher, R. (2017). A Deep Reinforced Model for Abstractive Summarization. In arXiv preprint arXiv:1705.04304,. Retrieved from http://arxiv.org/abs/1705.04304 ↩

Liu, Y., & Lapata, M. (2017). Learning Structured Text Representations. In arXiv preprint arXiv:1705.09207. Retrieved from http://arxiv.org/abs/1705.09207 ↩

Zhang, J., Mitliagkas, I., & Ré, C. (2017). YellowFin and the Art of Momentum Tuning. arXiv preprint arXiv:1706.03471. ↩

Goldberg, Y. (2016). A Primer on Neural Network Models for Natural Language Processing. Journal of Artificial Intelligence Research, 57, 345–420. https://doi.org/10.1613/jair.4992 ↩

Melamud, O., McClosky, D., Patwardhan, S., & Bansal, M. (2016). The Role of Context Types and Dimensionality in Learning Word Embeddings. In Proceedings of NAACL-HLT 2016 (pp. 1030–1040). Retrieved from http://arxiv.org/abs/1601.00893 ↩

Ruder, S., Ghaffari, P., & Breslin, J. G. (2016). A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP-16), 999–1005. Retrieved from http://arxiv.org/abs/1609.02745 ↩

Reimers, N., & Gurevych, I. (2017). Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks. In arXiv preprint arXiv:1707.06799: Retrieved from https://arxiv.org/pdf/1707.06799.pdf ↩

Søgaard, A., & Goldberg, Y. (2016). Deep multi-task learning with low level tasks supervised at lower layers. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 231–235. ↩

Liu, P., Qiu, X., & Huang, X. (2017). Adversarial Multi-task Learning for Text Classification. In ACL 2017. Retrieved from http://arxiv.org/abs/1704.05742 ↩

Ruder, S., Bingel, J., Augenstein, I., & Søgaard, A. (2017). Sluice networks: Learning what to share between loosely related tasks. arXiv Preprint arXiv:1705.08142. Retrieved from http://arxiv.org/abs/1705.08142 ↩

Dozat, T., & Manning, C. D. (2017). Deep Biaffine Attention for Neural Dependency Parsing. In ICLR 2017. Retrieved from http://arxiv.org/abs/1611.01734 ↩

Jean, S., Cho, K., Memisevic, R., & Bengio, Y. (2015). On Using Very Large Target Vocabulary for Neural Machine Translation. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 1–10. Retrieved from http://www.aclweb.org/anthology/P15-1001 ↩

Sennrich, R., Haddow, B., & Birch, A. (2016). Edinburgh Neural Machine Translation Systems for WMT 16. In Proceedings of the First Conference on Machine Translation (WMT 2016). Retrieved from http://arxiv.org/abs/1606.02891 ↩

Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., & Weinberger, K. Q. (2017). Snapshot Ensembles: Train 1, get M for free. In ICLR 2017. ↩

Inan, H., Khosravi, K., & Socher, R. (2016). Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling. arXiv Preprint arXiv:1611.01462. ↩

Press, O., & Wolf, L. (2017). Using the Output Embedding to Improve Language Models. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 2, 157--163. ↩

Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical Bayesian Optimization of Machine Learning Algorithms. Neural Information Processing Systems Conference (NIPS 2012). https://doi.org/2012arXiv1206.2944S ↩

Mikolov, T. (2012). Statistical language models based on neural networks (Doctoral dissertation, PhD thesis, Brno University of Technology). ↩

Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. International Conference on Machine Learning, (2), 1310–1318. https://doi.org/10.1109/72.279181 ↩

Zhang, Y., & Wallace, B. (2015). A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification. arXiv Preprint arXiv:1510.03820, (1). Retrieved from http://arxiv.org/abs/1510.03820 ↩

Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF Models for Sequence Tagging. arXiv preprint arXiv:1508.01991. ↩

Ma, X., & Hovy, E. (2016). End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. arXiv Preprint arXiv:1603.01354. ↩

Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural Architectures for Named Entity Recognition. NAACL-HLT 2016. ↩

Kiddon, C., Zettlemoyer, L., & Choi, Y. (2016). Globally Coherent Text Generation with Neural Checklist Models. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP2016), 329–339. ↩

Tu, Z., Lu, Z., Liu, Y., Liu, X., & Li, H. (2016). Modeling Coverage for Neural Machine Translation. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. https://doi.org/10.1145/2856767.2856776 ↩

See, A., Liu, P. J., & Manning, C. D. (2017). Get To The Point: Summarization with Pointer-Generator Networks. In ACL 2017. ↩

Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Retrieved from http://arxiv.org/abs/1508.07909 ↩

Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems, 9. Retrieved from http://arxiv.org/abs/1409.3215%5Cnhttp://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks ↩

Credit for the cover image goes to Bahdanau et al. (2015).

]]>Note: If you are looking for a review paper, this blog post is also available as an article on arXiv.

Table of contents:

- Introduction
- Motivation
- Two MTL methods for Deep Learning
- Why does MTL work?
- MTL in non-neural models
- Recent work on MTL for Deep Learning
- Auxiliary tasks
- Conclusion

In Machine Learning (ML), we typically care about optimizing for a particular metric, whether this is a score on a certain benchmark or a business KPI. In order to do this, we generally train a single model or an ensemble of models to perform our desired task. We then fine-tune and tweak these models until their performance no longer increases. While we can generally achieve acceptable performance this way, by being laser-focused on our single task, we ignore information that might help us do even better on the metric we care about. Specifically, this information comes from the training signals of related tasks. By sharing representations between related tasks, we can enable our model to generalize better on our original task. This approach is called Multi-Task Learning (MTL) and will be the topic of this blog post.

Multi-task learning has been used successfully across all applications of machine learning, from natural language processing [^{1}] and speech recognition [^{2}] to computer vision [^{3}] and drug discovery [^{4}]. MTL comes in many guises: joint learning, learning to learn, and learning with auxiliary tasks are only some names that have been used to refer to it. Generally, as soon as you find yourself optimizing more than one loss function, you are effectively doing multi-task learning (in contrast to single-task learning). In those scenarios, it helps to think about what you are trying to do explicitly in terms of MTL and to draw insights from it.

Even if you're only optimizing one loss as is the typical case, chances are there is an auxiliary task that will help you improve upon your main task. Rich Caruana [^{5}] summarizes the goal of MTL succinctly: "MTL improves generalization by leveraging the domain-specific information contained in the training signals of related tasks".

Over the course of this blog post, I will try to give a general overview of the current state of multi-task learning, in particular when it comes to MTL with deep neural networks. I will first motivate MTL from different perspectives. I will then introduce the two most frequently employed methods for MTL in Deep Learning. Subsequently, I will describe mechanisms that together illustrate why MTL works in practice. Before looking at more advanced neural network-based MTL methods, I will provide some context by discussing the literature in MTL. I will then introduce some more powerful recently proposed methods for MTL in deep neural networks. Finally, I will talk about commonly used types of auxiliary tasks and discuss what makes a good auxiliary task for MTL.

We can motivate multi-task learning in different ways: Biologically, we can see multi-task learning as being inspired by human learning. For learning new tasks, we often apply the knowledge we have acquired by learning related tasks. For instance, a baby first learns to recognize faces and can then apply this knowledge to recognize other objects.

From a pedagogical perspective, we often learn tasks first that provide us with the necessary skills to master more complex techniques. This is true for learning the proper way of falling in martial arts, e.g. Judo as much as learning to program.

Taking an example out of pop culture, we can also consider *The Karate Kid* (1984) (thanks to Margaret Mitchell and Adrian Benton for the inspiration). In the movie, *sensei* Mr Miyagi teaches the karate kid seemingly unrelated tasks such as sanding the floor and waxing a car. In hindsight, these, however, turn out to equip him with invaluable skills that are relevant for learning karate.

Finally, we can motivate multi-task learning from a machine learning point of view: We can view multi-task learning as a form of inductive transfer. Inductive transfer can help improve a model by introducing an inductive bias, which causes a model to prefer some hypotheses over others. For instance, a common form of inductive bias is \(\ell_1\) regularization, which leads to a preference for sparse solutions. In the case of MTL, the inductive bias is provided by the auxiliary tasks, which cause the model to prefer hypotheses that explain more than one task. As we will see shortly, this generally leads to solutions that generalize better.

So far, we have focused on theoretical motivations for MTL. To make the ideas of MTL more concrete, we will now look at the two most commonly used ways to perform multi-task learning in deep neural networks. In the context of Deep Learning, multi-task learning is typically done with either *hard* or *soft parameter sharing* of hidden layers.

Hard parameter sharing is the most commonly used approach to MTL in neural networks and goes back to [^{6}]. It is generally applied by sharing the hidden layers between all tasks, while keeping several task-specific output layers.

Hard parameter sharing greatly reduces the risk of overfitting. In fact, [^{7}] showed that the risk of overfitting the shared parameters is an order N -- where N is the number of tasks -- smaller than overfitting the task-specific parameters, i.e. the output layers. This makes sense intuitively: The more tasks we are learning simultaneously, the more our model has to find a representation that captures all of the tasks and the less is our chance of overfitting on our original task.

In soft parameter sharing on the other hand, each task has its own model with its own parameters. The distance between the parameters of the model is then regularized in order to encourage the parameters to be similar. [^{8}] for instance use the \(\ell_2\) norm for regularization, while [^{9}] use the trace norm.

The constraints used for soft parameter sharing in deep neural networks have been greatly inspired by regularization techniques for MTL that have been developed for other models, which we will soon discuss.

Even though an inductive bias obtained through multi-task learning seems intuitively plausible, in order to understand MTL better, we need to look at the mechanisms that underlie it. Most of these have first been proposed by Caruana (1998). For all examples, we will assume that we have two related tasks \(A\) and \(B\), which rely on a common hidden layer representation \(F\).

MTL effectively increases the sample size that we are using for training our model. As all tasks are at least somewhat noisy, when training a model on some task \(A\), our aim is to learn a good representation for task \(A\) that ideally ignores the data-dependent noise and generalizes well. As different tasks have different noise patterns, a model that learns two tasks simultaneously is able to learn a more general representation. Learning just task \(A\) bears the risk of overfitting to task \(A\), while learning \(A\) and \(B\) jointly enables the model to obtain a better representation \(F\) through averaging the noise patterns.

If a task is very noisy or data is limited and high-dimensional, it can be difficult for a model to differentiate between relevant and irrelevant features. MTL can help the model focus its attention on those features that actually matter as other tasks will provide additional evidence for the relevance or irrelevance of those features.

Some features \(G\) are easy to learn for some task \(B\), while being difficult to learn for another task \(A\). This might either be because \(A\) interacts with the features in a more complex way or because other features are impeding the model's ability to learn \(G\). Through MTL, we can allow the model to *eavesdrop*, i.e. learn \(G\) through task \(B\). The easiest way to do this is through *hints* [^{10}], i.e. directly training the model to predict the most important features.

MTL biases the model to prefer representations that other tasks also prefer. This will also help the model to generalize to new tasks in the future as a hypothesis space that performs well for a sufficiently large number of training tasks will also perform well for learning novel tasks as long as they are from the same environment [^{11}].

Finally, MTL acts as a regularizer by introducing an inductive bias. As such, it reduces the risk of overfitting as well as the Rademacher complexity of the model, i.e. its ability to fit random noise.

In order to better understand MTL in deep neural networks, we will now look to the existing literature on MTL for linear models, kernel methods, and Bayesian algorithms. In particular, we will discuss two main ideas that have been pervasive throughout the history of multi-task learning: enforcing sparsity across tasks through norm regularization; and modelling the relationships between tasks.

Note that many approaches to MTL in the literature deal with a homogenous setting: They assume that all tasks are associated with a single output, e.g. the multi-class MNIST dataset is typically cast as 10 binary classification tasks. More recent approaches deal with a more realistic, heterogeneous setting where each task corresponds to a unique set of outputs.

In order to better connect the following approaches, let us first introduce some notation. We have \(T\) tasks. For each task \(t\), we have a model \(m_t\) with parameters \(a_t\) of dimensionality \(d\). We can write the parameters as a column vector \(a_t = \begin{bmatrix}a_{1, t} \ \ldots \ a_{d, t} \end{bmatrix}^\top \). We now stack these column vectors \(a_1, \ldots, a_T\) column by column to form a matrix \(A \in

\mathbb{R}^{d \times T}\). The \(i\)-th row of \(A\) then contains the parameter \(a_{i, \cdot}\) corresponding to the \(i\)-th feature of the model for every task, while the \(j\)-th column of \(A\) contains the parameters \(a_{\cdot,j}\) corresponding to the \(j\)-th model.

Many existing methods make some sparsity assumption with regard to the parameters of our models. [^{12}] assume that all models share a small set of features. In terms of our task parameter matrix \(A\), this means that all but a few rows are \(0\), which corresponds to only a few features being used across *all* tasks. In order to enforce this, they generalize the \(\ell_1\) norm to the MTL setting. Recall that the \(\ell_1\) norm is a constraint on the sum of the parameters, which forces all but a few parameters to be exactly \(0\). It is also known as lasso (**l**east **a**bsolute **s**hrinkage and **s**election **o**perator).

While in the single-task setting, the \(\ell_1\) norm is computed based on the parameter vector \(a_t\) of the respective task \(t\), for MTL we compute it over our task parameter matrix \(A\). In order to do this, we first compute an \(\ell_q\) norm across each row \(a_i\) containing the parameter corresponding to the \(i\)-th feature across all tasks, which yields a vector \(b = \begin{bmatrix}\|a_1\|_q \ldots \|a_d\|_q \end{bmatrix} \in \mathbb{R}^d\). We then compute the \(\ell_1\) norm of this vector, which forces all but a few entries of \(b\), i.e. rows in \(A\) to be \(0\).

As we can see, depending on what constraint we would like to place on each row, we can use a different \(\ell_q\). In general, we refer to these mixed-norm constraints as \(\ell_1/\ell_q\) norms. They are also known as block-sparse regularization, as they lead to entire rows of \(A\) being set to \(0\). [^{13}] use \(\ell_1/\ell_\infty\) regularization, while Argyriou et al. (2007) use a mixed \(\ell_1/\ell_2\) norm. The latter is also known as group lasso and was first proposed by [^{14}].

Argyriou et al. (2007) also show that the problem of optimizing the non-convex group lasso can be made convex by penalizing the trace norm of \(A\), which forces \(A\) to be low-rank and thereby constrains the column parameter vectors \(a_{\cdot, 1}, \ldots, a_{\cdot, t}\) to live in a low-dimensional subspace. [^{15}] furthermore establish upper bounds for using the group lasso in multi-task learning.

As much as this block-sparse regularization is intuitively plausible, it is very dependent on the extent to which the features are shared across tasks. [^{16}] show that if features do not overlap by much, \(\ell_1/\ell_q\) regularization might actually be worse than element-wise \(\ell_1\) regularization.

For this reason, [^{17}] improve upon block-sparse models by proposing a method that combines block-sparse and element-wise sparse regularization. They decompose the task parameter matrix \(A\) into two matrices \(B\) and \(S\) where \(A = B + S\). \(B\) is then enforced to be block-sparse using \(\ell_1/\ell_\infty\) regularization, while \(S\) is made element-wise sparse using lasso. Recently, [^{18}] propose a distributed version of group-sparse regularization.

While the group-sparsity constraint forces our model to only consider a few features, these features are largely used across all tasks. All of the previous approaches thus assume that the tasks used in multi-task learning are closely related. However, each task might not be closely related to all of the available tasks. In those cases, sharing information with an unrelated task might actually hurt performance, a phenomenon known as negative transfer.

Rather than sparsity, we would thus like to leverage prior knowledge indicating that some tasks are related while others are not. In this scenario, a constraint that enforces a clustering of tasks might be more appropriate. [^{19}] suggest to impose a clustering constraint by penalizing both the norms of our task column vectors \(a_{\cdot, 1}, \ldots, a_{\cdot, t}\) as well as their variance with the following constraint:

\(\Omega = \|\bar{a}\|^2 + \dfrac{\lambda}{T} \sum^T_{t=1} \| a_{\cdot, t} - \bar{a} \|^2 \)

where \(\bar{a} = (\sum^T_{t=1} a_{\cdot, t})/T \) is the mean parameter vector. This penalty enforces a clustering of the task parameter vectors \(a_{\cdot, 1}, \ldots, a_{\cdot, t}\) towards their mean that is controlled by \(\lambda\). They apply this constraint to kernel methods, but it is equally applicable to linear models.

A similar constraint for SVMs was also proposed by [^{20}]. Their constraint is inspired by Bayesian methods and seeks to make all models close to some mean model. In SVMs, the loss thus trades off having a large margin for each SVM with being close to the mean model.

[^{21}] make the assumptions underlying cluster regularization more explicit by formalizing a cluster constraint on \(A\) under the assumption that the number of clusters \(C\) is known in advance. They then decompose the penalty into three separate norms:

A global penalty which measures how large our column parameter vectors are on average: \(\Omega_{mean}(A) = \|\bar{a}\|^2 \).

A measure of between-cluster variance that measures how close to each other the clusters are: \(\Omega_{between}(A) = \sum^C_{c=1} T_c \| \bar{a}_c - \bar{a} \|^2 \) where \(T_c\) is the number of tasks in the \(c\)-th cluster and \(\bar{a}_c\) is the mean vector of the task parameter vectors in the \(c\)-th cluster.

A measure of within-cluster variance that gauges how compact each cluster is: \(\Omega_{within} = \sum^C_{c=1} \sum_{t \in J(c)} \| a_{\cdot, t} - \bar{a}_c \| \) where \(J(c)\) is the set of tasks in the \(c\)-th cluster.

The final constraint then is the weighted sum of the three norms:

\(\Omega(A) = \lambda_1 \Omega_{mean}(A) + \lambda_2 \Omega_{between}(A) + \lambda_3 \Omega_{within}(A)\).

As this constraint assumes clusters are known in advance, they introduce a convex relaxation of the above penalty that allows to learn the clusters at the same time.

In another scenario, tasks might not occur in clusters but have an inherent structure. [^{22}] extend the group lasso to deal with tasks that occur in a tree structure, while [^{23}] apply it to tasks with graph structures.

While the previous approaches to modelling the relationship between tasks employ norm regularization, other approaches do so without regularization: [^{24}] were the first ones who presented a task clustering algorithm using k-nearest neighbour, while [^{25}] learn a common structure from multiple related tasks with an application to semi-supervised learning.

Much other work on learning task relationships for multi-task learning uses Bayesian methods:

[^{26}] propose a Bayesian neural network for multi-task learning by placing a prior on the model parameters to encourage similar parameters across tasks. [^{27}] extend Gaussian processes (GP) to MTL by inferring parameters for a shared covariance matrix. As this is computationally very expensive, they adopt a sparse approximation scheme that greedily selects the most informative examples. [^{28}] also use GP for MTL by assuming that all models are sampled from a common prior.

[^{29}] place a Gaussian as a prior distribution on each task-specific layer. In order to encourage similarity between different tasks, they propose to make the mean task-dependent and introduce a clustering of the tasks using a mixture distribution. Importantly, they require task characteristics that define the clusters and the number of mixtures to be specified in advance.

Building on this, [^{30}] draw the distribution from a Dirichlet process and enable the model to learn the similarity between tasks as well as the number of clusters. They then share the same model among all tasks in the same cluster. [^{31}] propose a hierarchical Bayesian model, which learns a latent task hierarchy, while [^{32}] use a GP-based regularization for MTL and extend a previous GP-based approach to be more computationally feasible in larger settings.

Other approaches focus on the online multi-task learning setting: [^{33}] adapt some existing methods such as the approach by Evgeniou et al. (2005) to the online setting. They also propose a MTL extension of the regularized Perceptron, which encodes task relatedness in a matrix. They use different forms of regularization to bias this task relatedness matrix, e.g. the closeness of the task characteristic vectors or the dimension of the spanned subspace. Importantly, similar to some earlier approaches, they require the task characteristics that make up this matrix to be provided in advance. [^{34}] then extend the previous approach by learning the task relationship matrix.

[^{35}] assume that tasks form disjoint groups and that the tasks within each group lie in a low-dimensional subspace. Within each group, tasks share the same feature representation whose parameters are learned jointly together with the group assignment matrix using an alternating minimization scheme. However, a total disjointness between groups might not be the ideal way, as the tasks might still share some features that are helpful for prediction.

[^{36}] in turn allow two tasks from different groups to overlap by assuming that there exist a small number of latent basis tasks. They then model the parameter vector \(a_t\) of every actual task \(t\) as a linear combination of these: \(a_t = Ls_t\) where \(L\ \in \mathbb{R}^{k \times d}\) is a matrix containing the parameter vectors of \(k\) latent tasks, while \(s_t\ \in \mathbb{R}^k\) is a vector containing the coefficients of the linear combination. In addition, they constrain the linear combination to be sparse in the latent tasks; the overlap in the sparsity patterns between two tasks then controls the amount of sharing between these. Finally, [^{37}] learn a small pool of shared hypotheses and then map each task to a single hypothesis.

While many recent Deep Learning approaches have used multi-task learning -- either explicitly or implicitly -- as part of their model (prominent examples will be featured in the next section), they all employ the two approaches we introduced earlier, hard and soft parameter sharing. In contrast, only a few papers have looked at developing better mechanisms for MTL in deep neural networks.

In MTL for computer vision, approaches often share the convolutional layers, while learning task-specific fully-connected layers. [^{38}] improve upon these models by proposing Deep Relationship Networks. In addition to the structure of shared and task-specific layers, which can be seen in Figure 3, they place matrix priors on the fully connected layers, which allow the model to learn the relationship between tasks, similar to some of the Bayesian models we have looked at before. This approach, however, still relies on a pre-defined structure for sharing, which may be adequate for well-studied computer vision problems, but prove error-prone for novel tasks.

Starting at the other extreme, [^{39}] propose a bottom-up approach that starts with a thin network and dynamically widens it greedily during training using a criterion that promotes grouping of similar tasks. The widening procedure, which dynamically creates branches can be seen in Figure 4. However, the greedy method might not be able to discover a model that is globally optimal, while assigning each branch to exactly one task does not allow the model to learn more complex interactions between tasks.

[^{40}] start out with two separate model architectures just as in soft parameter sharing. They then use what they refer to as cross-stitch units to allow the model to determine in what way the task-specific networks leverage the knowledge of the other task by learning a linear combination of the output of the previous layers. Their architecture can be seen in Figure 5, in which they only place cross-stitch units after pooling and fully-connected layers.

In contrast, in natural language processing (NLP), recent work focused on finding better task hierarchies for multi-task learning: [^{41}] show that low-level tasks, i.e. NLP tasks typically used for preprocessing such as part-of-speech tagging and named entity recognition, should be supervised at lower layers when used as auxiliary task.

Building on this finding, [^{42}] pre-define a hierarchical architecture consisting of several NLP tasks, which can be seen in Figure 6, as a joint model for multi-task learning.

Instead of learning the structure of sharing, [^{43}] take a orthogonal approach by considering the uncertainty of each task. They then adjust each task's relative weight in the cost function by deriving a multi-task loss function based on maximizing the Gaussian likelihood with task-dependant uncertainty. Their architecture for per-pixel depth regression, semantic and instance segmentation can be seen in Figure 7.

More recent work seeks to generalize existing approaches to MTL to Deep Learning: [^{44}] generalize some of the previously discussed matrix factorisation approaches using tensor factorisation to split the model parameters into shared and task-specific parameters for every layer.

Finally, we propose Sluice Networks [^{45}], a model that generalizes Deep Learning-based MTL approaches such as hard parameter sharing and cross-stitch networks, block-sparse regularization approaches, as well as recent NLP approaches that create a task hierarchy. The model, which can be seen in Figure 8, allows to learn what layers and subspaces should be shared, as well as at what layers the network has learned the best representations of the input sequences.

Having surveyed these recent approaches, let us now briefly summarize and draw a conclusion on what to share in our deep MTL models. Most approaches in the history of MTL have focused on the scenario where tasks are drawn from the same distribution (Baxter, 1997). While this scenario is beneficial for sharing, it does not always hold. In order to develop robust models for MTL, we thus have to be able to deal with unrelated or only loosely related tasks.

While early work in MTL for Deep Learning has pre-specified which layers to share for each task pairing, this strategy does not scale and heavily biases MTL architectures. Hard parameter sharing, a technique that was originally proposed by Caruana (1996), is still the norm 20 years later. While useful in many scenarios, hard parameter sharing quickly breaks down if tasks are not closely related or require reasoning on different levels. Recent approaches have thus looked towards *learning* what to share and generally outperform hard parameter sharing. In addition, giving our models the capacity to learn a task hierarchy is helpful, particularly in cases that require different granularities.

As mentioned initially, we are doing MTL as soon as we are optimizing more than one loss function. Rather than constraining our model to compress the knowledge of all tasks into the same parameter space, it is thus helpful to draw on the advances in MTL that we have discussed and enable our model to learn how the tasks should interact with each other.

MTL is a natural fit in situations where we are interested in obtaining predictions for multiple tasks at once. Such scenarios are common for instance in finance or economics forecasting, where we might want to predict the value of many possibly related indicators, or in bioinformatics where we might want to predict symptoms for multiple diseases simultaneously. In scenarios such as drug discovery, where tens or hundreds of active compounds should be predicted, MTL accuracy increases continuously with the number of tasks (Ramsundar et al., 2015).

In most situations, however, we only care about performance on one task. In this section, we will thus look at how we can find a suitable auxiliary task in order to still reap the benefits of multi-task learning.

Using a related task as an auxiliary task for MTL is the classical choice. To get an idea what a related task can be, we will present some prominent examples. Caruana (1998) uses tasks that predict different characteristics of the road as auxiliary tasks for predicting the steering direction in a self-driving car; [^{46}] use head pose estimation and facial attribute inference as auxiliary tasks for facial landmark detection; [^{47}] jointly learn query classification and web search; Girshick (2015) jointly predicts the class and the coordinates of an object in an image; finally, [^{48}] jointly predict the phoneme duration and frequency profile for text-to-speech.

Often, labeled data for a related task is unavailable. In some circumstances, however, we have access to a task that is *opposite* of what we want to achieve. This data can be leveraged using an adversarial loss, which does not seek to minimize but maximize the training error using a gradient reversal layer. This setup has found recent success in domain adaptation [^{49}]. The adversarial task in this case is predicting the domain of the input; by reversing the gradient of the adversarial task, the adversarial task loss is maximized, which is beneficial for the main task as it forces the model to learn representations that cannot distinguish between domains.

As mentioned before, MTL can be used to learn features that might not be easy to learn just using the original task. An effective way to achieve this is to use hints, i.e. predicting the features as an auxiliary task. Recent examples of this strategy in the context of natural language processing are [^{50}] who predict whether an input sentence contains a positive or negative sentiment word as auxiliary tasks for sentiment analysis and [^{51}] who predict whether a name is present in a sentence as auxiliary task for name error detection.

Similarly, the auxiliary task can be used to focus attention on parts of the image that a network might normally ignore. For instance, for learning to steer (Caruana, 1998) a single-task model might typically ignore lane markings as these make up only a small part of the image and are not always present. Predicting lane markings as auxiliary task, however, forces the model to learn to represent them; this knowledge can then also be used for the main task. Analogously, for facial recognition, one might learn to predict the location of facial landmarks as auxiliary tasks, since these are often distinctive.

For many tasks, the training objective is quantized, i.e. while a continuous scale might be more plausible, labels are available as a discrete set. This is the case in many scenarios that require human assessment for data gathering, such as predicting the risk of a disease (e.g. low/medium/high) or sentiment analysis (positive/neutral/negative). Using less quantized auxiliary tasks might help in these cases, as they might be learned more easily due to their objective being smoother.

In some scenarios, it is impractical to use some features as inputs as they are unhelpful for predicting the desired objective. However, they might still be able to guide the learning of the task. In those cases, the features can be used as outputs rather than inputs. [^{52}] present several problems where this is applicable.

In many situations, some features only become available *after* the predictions are supposed to be made. For instance, for self-driving cars, more accurate measurements of obstacles and lane markings can be made once the car is passing them. Caruana (1998) also gives the example of pneumonia prediction, after which the results of additional medical trials will be available. For these examples, the additional data cannot be used as features as it will not be available as input at runtime. However, it can be used as an auxiliary task to impart additional knowledge to the model during training.

The goal of an auxiliary task in MTL is to enable the model to learn representations that are shared or helpful for the main task. All auxiliary tasks discussed so far do this implicitly: They are closely related to the main task, so that learning them likely allows the model to learn beneficial representations. A more explicit modelling is possible, for instance by employing a task that is known to enable a model to learn transferable representations. The language modelling objective as employed by Cheng et al. (2015) and [^{53}] fulfils this role. In a similar vein, an autoencoder objective can also be used as an auxiliary task.

In this section, we have discussed different auxiliary tasks that can be used to leverage MTL even if we only care about one task. We still do not know, though, what auxiliary task will be useful in practice. Finding an auxiliary task is largely based on the assumption that the auxiliary task should be related to the main task in some way and that it should be helpful for predicting the main task.

However, we still do not have a good notion of when two tasks should be considered similar or related. Caruana (1998) defines two tasks to be similar if they use the same features to make a decision. Baxter (2000) argues only theoretically that related tasks share a common optimal hypothesis class, i.e. have the same inductive bias. [^{54}] propose that two tasks are \(\mathcal{F}\)-related if the data for both tasks can be generated from a fixed probability distribution using a set of transformations \(\mathcal{F}\). While this allows to reason over tasks where different sensors collect data for the same classification problem, e.g. object recognition with data from cameras with different angles and lighting conditions, it is not applicable to tasks that do not deal with the same problem. Xue et al. (2007) finally argue that two tasks are similar if their classification boundaries, i.e. parameter vectors are close.

In spite of these early theoretical advances in understanding task relatedness, we have not made much recent progress towards this goal. Task similarity is not binary, but resides on a spectrum. More similar tasks should help more in MTL, while less similar tasks should help less. Allowing our models to learn what to share with each task might allow us to temporarily circumvent the lack of theory and make better use even of only loosely related tasks. However, we also need to develop a more principled notion of task similarity with regard to multi-task learning in order to know which tasks we should prefer.

Recent work [^{55}] have found auxiliary tasks with compact and uniform label distributions to be preferable for sequence tagging problems in NLP, which we have confirmed in experiments (Ruder et al., 2017). In addition, gains have been found to be more likely for main tasks that quickly plateau with non-plateauing auxiliary tasks [^{56}].

These experiments, however, have so far been limited in scope and recent findings only provide the first clues towards a deeper understanding of multi-task learning in neural networks.

In this overview, I have reviewed both the history of literature in multi-task learning as well as more recent work on MTL for Deep Learning. While MTL is being more frequently used, the 20-year old hard parameter sharing paradigm is still pervasive for neural-network based MTL. Recent advances on learning what to share, however, are promising. At the same time, our understanding of tasks -- their similarity, relationship, hierarchy, and benefit for MTL -- is still limited and we need to learn more about them to gain a better understanding of the generalization capabilities of MTL with regard to deep neural networks.

I hope you found this overview helpful. If I made any error, missed a reference, or misrepresented some aspect, or if you would just like to share your thoughts, please leave a comment below.

This blog post is also available as an article on arXiv, in case you want to refer to it later.

In case you found it helpful, consider citing the corresponding arXiv article as:

*Sebastian Ruder (2017). An Overview of Multi-Task Learning in Deep Neural Networks. arXiv preprint arXiv:1706.05098.*

Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing. Proceedings of the 25th International Conference on Machine Learning - ICML ’08, 20(1), 160–167. https://doi.org/10.1145/1390156.1390177 ↩

Deng, L., Hinton, G. E., & Kingsbury, B. (2013). New types of deep neural network learning for speech recognition and related applications: An overview. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 8599–8603. https://doi.org/10.1109/ICASSP.2013.6639344 ↩

Girshick, R. (2015). Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1440–1448). https://doi.org/10.1109/iccv.2015.169 ↩

Ramsundar, B., Kearnes, S., Riley, P., Webster, D., Konerding, D., & Pande, V. (2015). Massively Multitask Networks for Drug Discovery. https://doi.org/https://arxiv.org/abs/1502.02072 ↩

Caruana, R. (1998). Multitask Learning. Autonomous Agents and Multi-Agent Systems, 27(1), 95–133. https://doi.org/10.1016/j.csl.2009.08.003 ↩

Caruana, R. "Multitask learning: A knowledge-based source of inductive bias." Proceedings of the Tenth International Conference on Machine Learning. 1993. ↩

Baxter, J. (1997). A Bayesian/information theoretic model of learning to learn via multiple task sampling. Machine Learning, 28, 7–39. Retrieved from http://link.springer.com/article/10.1023/A:1007327622663 ↩

Duong, L., Cohn, T., Bird, S., & Cook, P. (2015). Low Resource Dependency Parsing: Cross-lingual Parameter Sharing in a Neural Network Parser. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Short Papers), 845–850. ↩

Yang, Y., & Hospedales, T. M. (2017). Trace Norm Regularised Deep Multi-Task Learning. In Workshop track - ICLR 2017. Retrieved from http://arxiv.org/abs/1606.04038 ↩

Abu-Mostafa, Y. S. (1990). Learning from hints in neural networks. Journal of Complexity, 6(2), 192–198. https://doi.org/10.1016/0885-064X(90)90006-Y ↩

Baxter, J. (2000). A Model of Inductive Bias Learning. Journal of Artificial Intelligence Research, 12, 149–198. ↩

Argyriou, A., & Pontil, M. (2007). Multi-Task Feature Learning. In Advances in Neural Information Processing Systems. http://doi.org/10.1007/s10994-007-5040-8 ↩

C.Zhang and J.Huang. Model selection consistency of the lasso selection in high-dimensional linear regression. Annals of Statistics, 36:1567–1594, 2008 ↩

Yuan, Ming, and Yi Lin. "Model selection and estimation in regression with grouped variables." Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68.1 (2006): 49-67. ↩

Lounici, K., Pontil, M., Tsybakov, A. B., & van de Geer, S. (2009). Taking Advantage of Sparsity in Multi-Task Learning. Stat, (1). Retrieved from http://arxiv.org/pdf/0903.1468 ↩

Negahban, S., & Wainwright, M. J. (2008). Joint support recovery under high-dimensional scaling : Benefits and perils of \(\ell_{1,\infty}\)-regularization. Advances in Neural Information Processing Systems, 1161–1168. ↩

Jalali, A., Ravikumar, P., Sanghavi, S., & Ruan, C. (2010). A Dirty Model for Multi-task Learning. Advances in Neural Information Processing Systems. Retrieved from https://papers.nips.cc/paper/4125-a-dirty-model-for-multi-task-learning.pdf ↩

Liu, S., Pan, S. J., & Ho, Q. (2016). Distributed Multi-task Relationship Learning. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS) (pp. 751–760). Retrieved from http://arxiv.org/abs/1612.04022 ↩

Evgeniou, T., Micchelli, C., & Pontil, M. (2005). Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6, 615–637. Retrieved from http://discovery.ucl.ac.uk/13423/ ↩

Evgeniou, T., & Pontil, M. (2004). Regularized multi-task learning. International Conference on Knowledge Discovery and Data Mining, 109. https://doi.org/10.1145/1014052.1014067 ↩

Jacob, L., Vert, J., Bach, F. R., & Vert, J. (2009). Clustered Multi-Task Learning: A Convex Formulation. Advances in Neural Information Processing Systems 21, 745–752. Retrieved from http://eprints.pascal-network.org/archive/00004705/%5Cnhttp://papers.nips.cc/paper/3499-clustered-multi-task-learning-a-convex-formulation.pdf ↩

Kim, S., & Xing, E. P. (2010). Tree-Guided Group Lasso for Multi-Task Regression with Structured Sparsity. 27th International Conference on Machine Learning, 1–14. https://doi.org/10.1214/12-AOAS549 ↩

Chen, X., Kim, S., Lin, Q., Carbonell, J. G., & Xing, E. P. (2010). Graph-Structured Multi-task Regression and an Efficient Optimization Method for General Fused Lasso, 1–21. https://doi.org/10.1146/annurev.arplant.56.032604.144204 ↩

Thrun, S., & O’Sullivan, J. (1996). Discovering Structure in Multiple Learning Tasks: The TC Algorithm. Proceedings of the Thirteenth International Conference on Machine Learning, 28(1), 5–5. Retrieved from http://scholar.google.com/scholar?cluster=956054018507723832&hl=en ↩

Ando, R. K., & Tong, Z. (2005). A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data. Journal of Machine Learning Research, 6, 1817–1853. ↩

Heskes, T. (2000). Empirical Bayes for Learning to Learn. Proceedings of the Seventeenth International Conference on Machine Learning, 367–364. ↩

Lawrence, N. D., & Platt, J. C. (2004). Learning to learn with the informative vector machine. Twenty-First International Conference on Machine Learning - ICML ’04, 65. https://doi.org/10.1145/1015330.1015382 ↩

Yu, K., Tresp, V., & Schwaighofer, A. (2005). Learning Gaussian processes from multiple tasks. Proceedings of the International Conference on Machine Learning (ICML), 22, 1012–1019. https://doi.org/10.1145/1102351.1102479 ↩

Bakker, B., & Heskes, T. (2003). Task Clustering and Gating for Bayesian Multitask Learning. Journal of Machine Learning Research, 1(1), 83–99. https://doi.org/10.1162/153244304322765658 ↩

Xue, Y., Liao, X., Carin, L., & Krishnapuram, B. (2007). Multi-Task Learning for Classification with Dirichlet Process Priors. Journal of Machine Learning Research, 8, 35–63. ↩

Daumé III, H. (2009). Bayesian multitask learning with latent hierarchies, 135–142. Retrieved from http://dl.acm.org.sci-hub.io/citation.cfm?id=1795131 ↩

Zhang, Y., & Yeung, D. (2010). A Convex Formulation for Learning Task Relationships in Multi-Task Learning. Uai, 733–442. ↩

Cavallanti, G., Cesa-Bianchi, N., & Gentile, C. (2010). Linear Algorithms for Online Multitask Classification. Journal of Machine Learning Research, 11, 2901–2934. ↩

Saha, A., Rai, P., Daumé, H., & Venkatasubramanian, S. (2011). Online learning of multiple tasks and their relationships. Journal of Machine Learning Research, 15, 643–651. Retrieved from http://www.scopus.com/inward/record.url?eid=2-s2.0-84862275213&partnerID=tZOtx3y1 ↩

Kang, Z., Grauman, K., & Sha, F. (2011). Learning with whom to share in multi-task feature learning. Proceedings of the 28th International Conference on Machine Learning, (4), 4–5. Retrieved from http://machinelearning.wustl.edu/mlpapers/paper

*files/ICML2011Kang*344.pdf ↩Kumar, A., & Daumé III, H. (2012). Learning Task Grouping and Overlap in Multi-task Learning. Proceedings of the 29th International Conference on Machine Learning, 1383–1390. ↩

Crammer, K., & Mansour, Y. (2012). Learning Multiple Tasks Using Shared Hypotheses. Neural Information Processing Systems (NIPS), 1484–1492 ↩

Long, M., & Wang, J. (2015). Learning Multiple Tasks with Deep Relationship Networks. arXiv Preprint arXiv:1506.02117. Retrieved from http://arxiv.org/abs/1506.02117 ↩

Lu, Y., Kumar, A., Zhai, S., Cheng, Y., Javidi, T., & Feris, R. (2016). Fully-adaptive Feature Sharing in Multi-Task Networks with Applications in Person Attribute Classification. Retrieved from http://arxiv.org/abs/1611.05377 ↩

Misra, I., Shrivastava, A., Gupta, A., & Hebert, M. (2016). Cross-stitch Networks for Multi-task Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. https://doi.org/10.1109/CVPR.2016.433 ↩

Søgaard, A., & Goldberg, Y. (2016). Deep multi-task learning with low level tasks supervised at lower layers. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 231–235. ↩

Hashimoto, K., Xiong, C., Tsuruoka, Y., & Socher, R. (2016). A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks. arXiv Preprint arXiv:1611.01587. Retrieved from http://arxiv.org/abs/1611.01587 ↩

Kendall, A., Gal, Y., & Cipolla, R. (2017). Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. Retrieved from http://arxiv.org/abs/1705.07115 ↩

Yang, Y., & Hospedales, T. (2017). Deep Multi-task Representation Learning: A Tensor Factorisation Approach. In ICLR 2017. https://doi.org/10.1002/joe.20070 ↩

Ruder, S., Bingel, J., Augenstein, I., & Søgaard, A. (2017). Sluice networks: Learning what to share between loosely related tasks. Retrieved from http://arxiv.org/abs/1705.08142 ↩

Zhang, Z., Luo, P., Loy, C. C., & Tang, X. (2014). Facial Landmark Detection by Deep Multi-task Learning. In European Conference on Computer Vision (pp. 94–108). https://doi.org/10.1007/978-3-319-10599-4_7 ↩

Liu, X., Gao, J., He, X., Deng, L., Duh, K., & Wang, Y.-Y. (2015). Representation Learning Using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval. Naacl-2015, 912–921. ↩

Arık, S. Ö., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., … Shoeybi, M. (2017). Deep Voice: Real-time Neural Text-to-Speech. In ICML 2017. ↩

Ganin, Y., & Lempitsky, V. (2015). Unsupervised Domain Adaptation by Backpropagation. In Proceedings of the 32nd International Conference on Machine Learning. (Vol. 37). ↩

Yu, J., & Jiang, J. (2016). Learning Sentence Embeddings with Auxiliary Tasks for Cross-Domain Sentiment Classification. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP2016), 236–246. Retrieved from http://www.aclweb.org/anthology/D/D16/D16-1023.pdf ↩

Cheng, H., Fang, H., & Ostendorf, M. (2015). Open-Domain Name Error Detection using a Multi-Task RNN. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 737–746). ↩

Caruana, R., & Sa, V. R. de. (1997). Promoting poor features to supervisors: Some inputs work better as outputs. Advances in Neural Information Processing Systems 9: Proceedings of The 1996 Conference, 9, 389. Retrieved from http://scholar.google.com/scholar?start=20&q=author:%22Rich+Caruana%22&hl=en#6 ↩

Rei, M. (2017). Semi-supervised Multitask Learning for Sequence Labeling. In ACL 2017. ↩

Ben-David, S., & Schuller, R. (2003). Exploiting task relatedness for multiple task learning. Learning Theory and Kernel Machines, 567–580. https://doi.org/10.1007/978-3-540-45167-9_41 ↩

Alonso, H. M., & Plank, B. (2017). When is multitask learning effective? Multitask learning for semantic sequence prediction under varying data conditions. In EACL. Retrieved from http://arxiv.org/abs/1612.02251 ↩

Bingel, J., & Søgaard, A. (2017). Identifying beneficial task relations for multi-task learning in deep neural networks. In EACL. Retrieved from http://arxiv.org/abs/1702.08303 ↩

Table of contents:

- What is Transfer Learning?
- Why Transfer Learning Now?
- A Definition of Transfer Learning
- Transfer Learning Scenarios
- Applications of Transfer Learning
- Transfer Learning Methods
- Related Research Areas
- Conclusion

In recent years, we have become increasingly good at training deep neural networks to learn a very accurate mapping from inputs to outputs, whether they are images, sentences, label predictions, etc. from large amounts of labeled data.

What our models still frightfully lack is the ability to generalize to conditions that are different from the ones encountered during training. When is this necessary? Every time you apply your model not to a carefully constructed dataset but to the real world. The real world is messy and contains an infinite number of novel scenarios, many of which your model has not encountered during training and for which it is in turn ill-prepared to make predictions. The ability to transfer knowledge to new conditions is generally known as transfer learning and is what we will discuss in the rest of this post.

Over the course of this blog post, I will first contrast transfer learning with machine learning's most pervasive and successful paradigm, supervised learning. I will then outline reasons why transfer learning warrants our attention. Subsequently, I will give a more technical definition and detail different transfer learning scenarios. I will then provide examples of applications of transfer learning before delving into practical methods that can be used to transfer knowledge. Finally, I will give an overview of related directions and provide an outlook into the future.

In the classic supervised learning scenario of machine learning, if we intend to train a model for some task and domain \(A\), we assume that we are provided with labeled data for the same task and domain. We can see this clearly in Figure 1, where the task and domain of the training and test data of our model \(A\) is the same. We will later define in more detail what exactly a task and a domain are). For the moment, let us assume that a task is the objective our model aims to perform, e.g. recognize objects in images, and a domain is where our data is coming from, e.g. images taken in San Francisco coffee shops.

We can now train a model \(A\) on this dataset and expect it to perform well on unseen data of the same task and domain. On another occasion, when given data for some other task or domain \(B\), we require again labeled data of the same task or domain that we can use to train a new model \(B\) so that we can expect it to perform well on this data.

The traditional supervised learning paradigm breaks down when we do not have sufficient labeled data for the task or domain we care about to train a reliable model.

If we want to train a model to detect pedestrians on night-time images, we could apply a model that has been trained on a similar domain, e.g. on day-time images. In practice, however, we often experience a deterioration or collapse in performance as the model has inherited the bias of its training data and does not know how to generalize to the new domain.

If we want to train a model to perform a new task, such as detecting bicyclists, we cannot even reuse an existing model, as the labels between the tasks differ.

Transfer learning allows us to deal with these scenarios by leveraging the already existing labeled data of some related task or domain. We try to store this knowledge gained in solving the source task in the source domain and apply it to our problem of interest as can be seen in Figure 2.

In practice, we seek to transfer as much knowledge as we can from the source setting to our target task or domain. This knowledge can take on various forms depending on the data: it can pertain to how objects are composed to allow us to more easily identify novel objects; it can be with regard to the general words people use to express their opinions, etc.

Andrew Ng, chief scientist at Baidu and professor at Stanford, said during his widely popular NIPS 2016 tutorial that transfer learning will be -- after supervised learning -- the next driver of ML commercial success.

In particular, he sketched out a chart on a whiteboard that I've sought to replicate as faithfully as possible in Figure 4 below (sorry about the unlabelled axes). According to Andrew Ng, transfer learning will become a key driver of Machine Learning success in industry.

It is indisputable that ML use and success in industry has so far been mostly driven by supervised learning. Fuelled by advances in Deep Learning, more capable computing utilities, and large labeled datasets, supervised learning has been largely responsible for the wave of renewed interest in AI, funding rounds and acquisitions, and in particular the applications of machine learning that we have seen in recent years and that have become part of our daily lives. If we disregard naysayers and heralds of another AI winter and instead trust the prescience of Andrew Ng, this success will likely continue.

It is less clear, however, why transfer learning which has been around for decades and is currently little utilized in industry, will see the explosive growth predicted by Ng. Even more so as transfer learning currently receives relatively little visibility compared to other areas of machine learning such as unsupervised learning and reinforcement learning, which have come to enjoy increasing popularity: Unsupervised learning -- the key ingredient on the quest to General AI according to Yann LeCun as can be seen in Figure 5 -- has seen a resurgence of interest, driven in particular by Generative Adversarial Networks. Reinforcement learning, in turn, spear-headed by Google DeepMind has led to advances in game-playing AI exemplified by the success of AlphaGo and has already seen success in the real world, e.g. by reducing Google's data center cooling bill by 40%. Both of these areas, while promising, will likely only have a comparatively small commercial impact in the foreseeable future and mostly remain within the confines of cutting-edge research papers as they still face many challenges.

What makes transfer learning different? In the following, we will look at the factors that -- in our opinion -- motivate Ng's prognosis and outline the reasons why just now is the time to pay attention to transfer learning.

The current use of machine learning in industry is characterised by a dichotomy:

On the one hand, over the course of the last years, we have obtained the ability to train more and more accurate models. We are now at the stage that for many tasks, state-of-the-art models have reached a level where their performance is so good that it is no longer a hindrance for users. How good? The newest residual networks [^{1}] on ImageNet achieve superhuman performance at recognising objects; Google's Smart Reply [^{2}] automatically handles 10% of all mobile responses; speech recognition error has consistently dropped and is more accurate than typing [^{3}]; we can automatically identify skin cancer as well as dermatologists; Google's NMT system [^{4}] is used in production for more than 10 language pairs; Baidu can generate realistic sounding speech in real-time; the list goes on and on. This level of maturity has allowed the large-scale deployment of these models to millions of users and has enabled widespread adoption.

On the other hand, these successful models are immensely data-hungry and rely on huge amounts of labeled data to achieve their performance. For some tasks and domains, this data is available as it has been painstakingly gathered over many years. In a few cases, it is public, e.g. ImageNet [^{5}], but large amounts of labeled data are usually proprietary or expensive to obtain, as in the case of many speech or MT datasets, as they provide an edge over the competition.

At the same time, when applying a machine learning model in the wild, it is faced with a myriad of conditions which the model has never seen before and does not know how to deal with; each client and every user has their own preferences, possesses or generates data that is different than the data used for training; a model is asked to perform many tasks that are related to but not the same as the task it was trained for. In all of these situations, our current state-of-the-art models, despite exhibiting human-level or even super-human performance on the task and domain they were trained on, suffer a significant loss in performance or even break down completely.

Transfer learning can help us deal with these novel scenarios and is necessary for production-scale use of machine learning that goes beyond tasks and domains were labeled data is plentiful. So far, we have applied our models to the tasks and domains that -- while impactful -- are the low-hanging fruits in terms of data availability. To also serve the long tail of the distribution, we must learn to transfer the knowledge we have acquired to new tasks and domains.

To be able to do this, we need to understand the concepts that transfer learning involves. For this reason, we will give a more technical definition in the following section.

For this definition, we will closely follow the excellent survey by Pan and Yang (2010) [^{6}] with binary document classification as a running example.

Transfer learning involves the concepts of a domain and a task. A domain \(\mathcal{D}\) consists of a feature space \(\mathcal{X}\) and a marginal probability distribution \(P(X)\) over the feature space, where \(X = {x_1, \cdots, x_n} \in \mathcal{X}\). For document classification with a bag-of-words representation, \(\mathcal{X}\) is the space of all document representations, \(x_i\) is the binary feature of the \(i\)-th word and \(X\) is a particular document.

Given a domain, \(\mathcal{D} = \{\mathcal{X},P(X)\}\), a task \(\mathcal{T}\) consists of a label space \(\mathcal{Y}\) and a conditional probability distribution \(P(Y|X)\) that is typically learned from the training data consisting of pairs \(x_i \in X\) and \(y_i \in \mathcal{Y}\). In our document classification example, \(\mathcal{Y}\) is the set of all labels, i.e. *True*, *False* and \(y_i\) is either *True* or *False*.

Given a source domain \(\mathcal{D}_S\), a corresponding source task \(\mathcal{T}_S\), as well as a target domain \(\mathcal{D}_T\) and a target task \(\mathcal{T}_T\), the objective of transfer learning now is to enable us to learn the target conditional probability distribution \(P(Y_T|X_T)\) in \(\mathcal{D}_T\) with the information gained from \(\mathcal{D}_S\) and \(\mathcal{T}_S\) where \(\mathcal{D}_S \neq \mathcal{D}_T\) or \(\mathcal{T}_S \neq \mathcal{T}_T\). In most cases, a limited number of labeled target examples, which is exponentially smaller than the number of labeled source examples are assumed to be available.

As both the domain \(\mathcal{D}\) and the task \(\mathcal{T}\) are defined as tuples, these inequalities give rise to four transfer learning scenarios, which we will discus below.

Given source and target domains \(\mathcal{D}_S\) and \(\mathcal{D}_T\) where \(\mathcal{D} = \{\mathcal{X},P(X)\}\) and source and target tasks \(\mathcal{T}_S\) and \(\mathcal{T}_T\) where \(\mathcal{T} = \{\mathcal{Y}, P(Y|X)\}\) source and target conditions can vary in four ways, which we will illustrate in the following again using our document classification example:

- \(\mathcal{X}_S \neq \mathcal{X}_T\). The feature spaces of the source and target domain are different, e.g. the documents are written in two different languages. In the context of natural language processing, this is generally referred to as cross-lingual adaptation.
- \(P(X_S) \neq P(X_T)\). The marginal probability distributions of source and target domain are different, e.g. the documents discuss different topics. This scenario is generally known as domain adaptation.
- \(\mathcal{Y}_S \neq \mathcal{Y}_T\). The label spaces between the two tasks are different, e.g. documents need to be assigned different labels in the target task. In practice, this scenario usually occurs with scenario 4, as it is extremely rare for two different tasks to have different label spaces, but exactly the same conditional probability distributions.
- \(P(Y_S|X_S) \neq P(Y_T|X_T)\). The conditional probability distributions of the source and target tasks are different, e.g. source and target documents are unbalanced with regard to their classes. This scenario is quite common in practice and approaches such as over-sampling, under-sampling, or SMOTE [
^{7}] are widely used.

After we are now aware of the concepts relevant for transfer learning and the scenarios in which it is applied, we will look to different applications of transfer learning that illustrate some of its potential.

One particular application of transfer learning that I'm very excited about and that I assume we'll see more of in the future is learning from simulations. For many machine learning applications that rely on hardware for interaction, gathering data and training a model in the real world is either expensive, time-consuming, or simply too dangerous. It is thus advisable to gather data in some other, less risky way.

Simulation is the preferred tool for this and is used towards enabling many advanced ML systems in the real world. Learning from a simulation and applying the acquired knowledge to the real world is an instance of transfer learning scenario 2, as the feature spaces between source and target domain are the same (both generally rely on pixels), but the marginal probability distributions between simulation and reality are different, i.e. objects in the simulation and the source *look* different, although this difference diminishes as simulations get more realistic. At the same time, the conditional probability distributions between simulation and real wold might be different as the simulation is not able to fully replicate all reactions in the real world, e.g. a physics engine can not completely mimic the complex interactions of real-world objects.

Learning from simulations has the benefit of making data gathering easy as objects can be easily bounded and analyzed, while simultaneously enabling fast training, as learning can be parallelized across multiple instances. Consequently, it is a prerequisite for large-scale machine learning projects that need to interact with the real world, such as self-driving cars (Figure 6). According to Zhaoyin Jia, Google's self-driving car tech lead, "Simulation is essential if you really want to do a self-driving car". Udacity has open-sourced the simulator it uses for teaching its self-driving car engineer nanodegree, which can be seen in Figure 7 and OpenAI's Universe will potentially allows to train a self-driving car using GTA 5 or other video games.

Another area where learning from simulations is key is robotics: Training models on a real robot is too slow and robots are expensive to train. Learning from a simulation and transferring the knowledge to real-world robot alleviates this problem and has recently been garnering additional interest [^{8}]. An example of a data manipulation task in the real world and in a simulation can be seen in Figure 8.

Finally, another direction where simulation will be an integral part is on the path towards general AI. Training an agent to achieve general artificial intelligence directly in the real world is too costly and hinders learning initially through unnecessary complexity. Rather, learning may be more successful if it is based on a simulated environment such as CommAI-env [^{9}] that is visible in Figure 9.

While learning from simulations is a particular instance of domain adaptation, it is worth outlining some other examples of domain adaptation.

Domain adaptation is a common requirement in vision as often the data where labeled information is easily accessible and the data that we actually care about are different, whether this pertains to identifying bikes as in Figure 10 or some other objects in the wild. Even if the training and the the test data *look* the same, the training data may still contain a bias that is imperceptible to humans but which the model will exploit to overfit on the training data [^{10}].

Another common domain adaptation scenario pertains to adapting to different text types: Standard NLP tools such as part-of-speech taggers or parsers are typically trained on news data such as the Wall Street Journal, which has historically been used to evaluate these models. Models trained on news data, however, have difficulty coping with more novel text forms such as social media messages and the challenges they present.

Even within one domain such as product reviews, people employ different words and phrases to express the same opinion. A model trained on one type of review should thus be able to disentangle the general and domain-specific opinion words that people use in order not to be confused by the shift in domain.

Finally, while the above challenges deal with general text or image types, problems are amplified if we look at domains that pertain to individual or groups of users: Consider the case of automatic speech recognition (ASR). Speech is poised to become the next big platform, with 50% of all our searches predicted to be performed by voice by 2020. Most ASR systems are evaluated traditionally on the Switchboard dataset, which comprises 500 speakers. Most people with a standard accent are thus fortunate, while immigrants, people with non-standard accents, people with a speech impediment, or children have trouble being understood. Now more than ever do we need systems that are able to adapt to individual users and minorities to ensure that everyone's voice is heard.

Finally, learning from one language and applying our knowledge to another language is -- in my opinion -- another killer application of transfer learning, which I have written about before here in the context of cross-lingual embedding models. Reliable cross-lingual adaptation methods would allow us to leverage the vast amounts of labeled data we have in English and apply them to any language, particularly underserved and truly low-resource languages. Given the current state-of-the-art, this still seems utopian, but recent advances such as zero-shot translation [^{11}] promise rapid progress in this area.

While we have so far considered particular applications of transfer learning, we will now look at practical methods and directions in the literature that are used to solve some of the presented challenges.

Transfer learning has a long history of research and techniques exist to tackle each of the four transfer learning scenarios described above. The advent of Deep Learning has led to a range of new transfer learning approaches, some of which we will review in the following. For a survey of earlier methods, refer to [^{6}].

In order to motivate the most common way of transfer learning is currently applied, we must understand what accounts for the outstanding success of large convolutional neural networks on ImageNet [^{12}].

While many details of how these models work still remain a mystery, we are by now aware that lower convolutional layers capture low-level image features, e.g. edges (see Figure 14), while higher convolutional layers capture more and more complex details, such as body parts, faces, and other compositional features.

The final fully-connected layers are generally assumed to capture information that is relevant for solving the respective task, e.g. AlexNet's fully-connected layers would indicate which features are relevant to classify an image into one of 1000 object categories.

However, while knowing that a cat has whiskers, paws, fur, etc. is necessary for identifying an animal as a cat (for an example, see Figure 15), it does not help us with identifying new objects or to solve other common vision tasks such as scene recognition, fine grained recognition, attribute detection and image retrieval.

What can help us, however, are representations that capture general information of how an image is composed and what combinations of edges and shapes it contains. This information is contained in one of the final convolutional layers or early fully-connected layers in large convolutional neural networks trained on ImageNet as we have described above.

For a new task, we can thus simply use the off-the-shelf features of a state-of-the-art CNN pre-trained on ImageNet and train a new model on these extracted features. In practice, we either keep the pre-trained parameters fixed or tune them with a small learning rate in order to ensure that we do not unlearn the previously acquired knowledge. This simple approach has been shown to achieve astounding results on an array of vision tasks [^{13}] as well as tasks that rely on visual input such as image captioning. A model trained on ImageNet seems to capture details about the way animals and objects are structured and composed that is generally relevant when dealing with images. As such, the ImageNet task seems to be a good proxy for general computer vision problems, as the same knowledge that is required to excel in it is also relevant for many other tasks.

A similar assumption is used to motivate generative models: When training generative models, we assume that the ability to generate realistic images requires an understanding of the underlying structure of images, which in turn can be applied to many other tasks. This assumption itself relies on the premise that all images lie on a low-dimensional manifold, i.e. that there is some underlying structure to images that can be extracted by a model. Recent advances in generating photorealistic images with Generative Adversarial Networks [^{14}] indicate that such a structure might indeed exist, as evidenced by the model's ability to show realistic transitions between points in the bedroom image space in Figure 16.

Off-the-shelf CNN features have seen unparalleled results in vision, but the question remains if this success can be replicated in other disciplines using other types of data, such as languages. Currently, there are no off-the-shelf features that achieve results for natural language processing that are as astounding as their vision equivalent. Why is that? Do such features exist at all or -- if not -- why is vision more conducive to this form of transfer than language?

The output of lower-level tasks such as part-of-speech tagging or chunking can be likened as off-the-shelf features, but these do not capture more fine-grained rules of language use beyond syntax and are not helpful for all tasks. As we have seen, the existence of generalizable off-the-shelf features seems to be intertwined with the existence of a task that can be seen as a prototype for many tasks in the field. In vision, object recognition occupies such a role. In language, the closest analogue might be language modelling: In order to predict the next word or sentence given a sequence of words, a model needs to possess knowledge of how language is structured, needs to understand what words likely are related to and likely follow each other, needs to model long-term dependencies, etc.

While state-of-the-art language models increasingly approach human levels [^{15}], their features are only of limited use. At the same time, advances in language modelling have led to positive results for other tasks: Pre-training a model with a language model objective improves performance [^{16}]. In addition, word embeddings pre-trained on a large unlabelled corpus with an approximated language modelling objective have become pervasive [^{17}]. While they are not as effective as off-the-shelf features in vision, they still provide sizeable gains [^{18}] and can be seen a simple form of transfer of general domain knowledge derived from a large unlabelled corpus.

While a general proxy task seems currently out of reach in natural language processing, auxiliary tasks can take the form of local proxies. Whether through multi-task objectives [^{19}] or synthetic task objectives [^{20}, ^{21}], they can be used to inject additional relevant knowledge into the model.

Using pre-trained features is currently the most straightforward and most commonly used way to perform transfer learning. However, it is by far not the only one.

Pre-trained features are in practice mostly used for adaptation scenario 3 where we want to adapt to a new task. For the other scenarios, another way to transfer knowledge enabled by Deep Learning is to learn representations that do not change based on our domain. This approach is conceptually very similar to the way we have been thinking about using pre-trained CNN features: Both encode general knowledge about our domain. However, creating representations that do not change based on the domain is a lot less expensive and more feasible for non-vision tasks than generating representations that are useful for all tasks. ImageNet has taken years and thousands of hours to create, while we typically only need unlabelled data of each domain for creating domain-invariant representations. These representations are generally learned using stacked denoising autoencoders and have seen success in natural language processing [^{22}, ^{23}] as well as in vision [^{24}].

In order to improve the transferability of the learned representations from the source to the target domain, we would like the representations between the two domains to be as similar as possible so that the model does not take into account domain-specific characteristics that may hinder transfer but the commonalities between the domains.

Rather than just letting our autoencoder learn some representation, we can thus actively encourage the representations of both domains to be more similar to each other. We can apply this as a pre-processing step directly to the representations of our data [^{25}, ^{26}] and can then use the new representations for training. We can also encourage the representations of the domains in our model to be more similar to each other [^{27}, ^{28}].

Another way to ensure similarity between the representations of both domains that has recently become more popular is to add another objective to an existing model that encourages it to confuse the two domains [^{29}, ^{30}]. This domain confusion loss is a regular classification loss where the model tries to predict the domain of the input example. The difference to a regular loss, however, is that gradients that flow from the loss to the rest of the network are reversed, as can be seen in Figure 17.

Instead of learning to minimize the error of the domain classification loss, the gradient reversal layer causes the model to maximize the error. In practice, this means that the model learns representations that allow it to minimize its original objective, while not allowing it to differentiate between the two domains, which is beneficial for knowledge transfer. While a model trained only with the regular objective is shown in Figure 18 to be clearly able to separate domains based on its learned representation, a model whose objective has been augmented with the domain confusion term is unable to do so.

While this post is about transfer learning, transfer learning is by far not the only area of machine learning that seeks to leverage limited amounts of data, use learned knowledge for new endeavours, and enable models to generalize better to new settings. In the following, we will thus introduce other directions that are related or complementary to the goals of transfer learning.

Transfer learning seeks to leverage unlabelled data in the target task or domain to the most effect. This is also the maxim of semi-supervised learning, which follows the classical machine learning setup but assumes only a limited amount of labeled samples for training. Insofar, semi-supervised domain adaptation is essentially semi-supervised learning under domain shift. Many lessons and insights from semi-supervised learning are thus equally applicable and relevant for transfer learning. Refer to [^{31}] for a great survey on semi-supervised learning.

Another direction that is related to transfer learning and semi-supervised learning is to enable models to work better with limited amounts of data.

This can be done in several ways: One can leverage unsupervised or semi-supervised learning to extract information from unlabelled data thereby reducing the reliance on labelled samples; one can give the model access to other features inherent in the data while reducing its tendency to overfit via regularization; finally, one can leverage data that so far remains neglected or rests in non-obvious places.

Such fortuitous data [^{32}] may be created as a side effect of user-generated content, such as hyperlinks that can be used to improve named entity and part-of-speech taggers; it may come as a by-product of annotation, e.g. annotator disagreement that may improve tagging or parsing; or it may be derived from user behaviour such as eye tracking or keystroke dynamics, which can inform NLP tasks. While such data has only been exploited in limited ways, such examples encourage us to look for data in unexpected places and to investigate new ways of retrieving data.

Related to this is also the direction of making models generalize better. In order to achieve this, we must first better understand the behaviour and intricacies of large neural networks and investigate why and how they generalize. Recent work has taken promising steps towards this end [^{33}], but many questions are still left unanswered.

While improving our models' generalization ability goes a long way, we might generalize well to similar instances but still fail catastrophically on unexpected or atypical inputs. Therefore, a key complementary objective is to make our models more robust. This direction has seen increasing interest recently fuelled by advances in adversarial learning and recent approaches have investigated many ways of how models can be made more robust to worst-case or adversarial examples in different settings [^{34}, ^{35}].

In transfer learning, we mainly care about doing well on our target task or domain. In multi-task learning, in contrast, the objective is to do well on all available tasks. Alternatively, we can also use the knowledge acquired by learning from related tasks to do well on a target. Crucially, in contrast to transfer learning, some labeled data is usually assumed for each task. In addition, models are trained jointly on source and target task data, which is not the case for all transfer learning scenarios. However, even if target data is not available during training, insights about tasks that are beneficial for multi-task learning [^{19}] can still inform transfer learning decisions.

For a more thorough overview of multi-task learning, particularly as applied to deep neural networks, have a look at my other blog post here.

While multi-task learning allows us to retain the knowledge across many tasks without suffering a performance penalty on our source tasks, this is only possible if all tasks are present at training time. For each new task, we would generally need to retrain our model on all tasks again.

In the real world, however, we would like an agent to be able to deal with tasks that gradually become more complex by leveraging its past experience. To this end, we need to enable a model to learn continuously without forgetting. This area of machine learning is known as learning to learn [^{36}], meta-learning, life-long learning, or continuous learning. It has seen some recent developments in the context of RL [^{37}, ^{38}, ^{39}] most notably by Google DeepMind on their quest towards general learning agents and is also being applied to sequence-to-sequence models [^{40}].

Finally, if we take transfer learning to the extreme and aim to learn from only a few, one or even zero instances of a class, we arrive at few-shot, one-shot, and zero-shot learning respectively. Enabling models to perform one-shot and zero-shot learning is admittedly among the hardest problems in machine learning. At the same time, it is something that comes naturally to us humans: Toddlers only need to be told once what a dog is in order to be able to identify any other dog, while adults can understand the essence of an object just by reading about it in context, without ever having encountered it before.

Recent advances in one-shot learning have leveraged the insight that models need to be trained explicitly to perform one-shot learning in order to achieve good performance at test time [^{41}, ^{42}], while the more realistic generalized zero-shot learning setting where training classes are present at test time has garnered attention lately [^{43}].

In summary, there are many exciting research directions that transfer learning offers and -- in particular -- many applications that are in need of models that can transfer knowledge to new tasks and adapt to new domains. I hope that I was able to provide you with an overview of transfer learning in this blog post and was able to pique your interest.

Some of the statements in this blog post are deliberately phrased *slightly* controversial. Let me know your thoughts about any contentious issues and any errors that I undoubtedly made in writing this post in the comments below.

Note: Title image is credited to [^{44}].

Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. (2016). Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. arXiv preprint arXiv:1602.07261. ↩

Kannan, A., Kurach, K., Ravi, S., Kaufmann, T., Tomkins, A., Miklos, B., … Ramavajjala, V. (2016). Smart Reply: Automated Response Suggestion for Email. In KDD 2016. http://doi.org/10.475/123 ↩

Ruan, S., Wobbrock, J. O., Liou, K., Ng, A., & Landay, J. (2016). Speech Is 3x Faster than Typing for English and Mandarin Text Entry on Mobile Devices. arXiv preprint arXiv:1608.07323. ↩

Wu, Y., Schuster, M., Chen, Z., Le, Q. V, Norouzi, M., Macherey, W., … Dean, J. (2016). Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv preprint arXiv:1609.08144. ↩

Deng, J., Dong, W., Socher, R., Li, L., Li, K., & Fei-fei, L. (2009). ImageNet : A Large-Scale Hierarchical Image Database. In IEEE Conference on Computer Vision and Pattern Recognition. ↩

Pan, S. J., & Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345–1359. ↩

Chawla, N. V, Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE : Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321–357. ↩

Rusu, A. A., Vecerik, M., Rothörl, T., Heess, N., Pascanu, R., & Hadsell, R. (2016). Sim-to-Real Robot Learning from Pixels with Progressive Nets. arXiv Preprint arXiv:1610.04286. Retrieved from http://arxiv.org/abs/1610.04286 ↩

Mikolov, T., Joulin, A., & Baroni, M. (2015). A Roadmap towards Machine Intelligence. arXiv Preprint arXiv:1511.08130. Retrieved from http://arxiv.org/abs/1511.08130 ↩

Torralba, A., & Efros, A. A. (2011). Unbiased Look at Dataset Bias. In 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). ↩

Johnson, M., Schuster, M., Le, Q. V, Krikun, M., Wu, Y., Chen, Z., … Dean, J. (2016). Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. arXiv Preprint arXiv:1611.0455. ↩

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Advances In Neural Information Processing Systems, 1–9. ↩

Razavian, A. S., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: An astounding baseline for recognition. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 512–519. ↩

Radford, A., Metz, L., & Chintala, S. (2016). Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. ICLR. Retrieved from http://arxiv.org/abs/1511.06434 ↩

Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., & Wu, Y. (2016). Exploring the Limits of Language Modeling. arXiv Preprint arXiv:1602.02410. Retrieved from http://arxiv.org/abs/1602.02410 ↩

Ramachandran, P., Liu, P. J., & Le, Q. V. (2016). Unsupervised Pretrainig for Sequence to Sequence Learning. arXiv Preprint arXiv:1611.02683. ↩

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. NIPS. ↩

Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1746–1751. Retrieved from http://arxiv.org/abs/1408.5882 ↩

Bingel, J., & Søgaard, A. (2017). Identifying beneficial task relations for multi-task learning in deep neural networks. In EACL. Retrieved from http://arxiv.org/abs/1702.08303 ↩

Plank, B., Søgaard, A., & Goldberg, Y. (2016). Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. ↩

Yu, J., & Jiang, J. (2016). Learning Sentence Embeddings with Auxiliary Tasks for Cross-Domain Sentiment Classification. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP2016), 236–246. Retrieved from http://www.aclweb.org/anthology/D/D16/D16-1023.pdf ↩

Glorot, X., Bordes, A., & Bengio, Y. (2011). Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach. Proceedings of the 28th International Conference on Machine Learning, 513–520. Retrieved from http://www.icml-2011.org/papers/342_icmlpaper.pdf ↩

Chen, M., Xu, Z., Weinberger, K. Q., & Sha, F. (2012). Marginalized Denoising Autoencoders for Domain Adaptation. Proceedings of the 29th International Conference on Machine Learning (ICML-12), 767--774. http://doi.org/10.1007/s11222-007-9033-z ↩

Zhuang, F., Cheng, X., Luo, P., Pan, S. J., & He, Q. (2015). Supervised Representation Learning: Transfer Learning with Deep Autoencoders. IJCAI International Joint Conference on Artificial Intelligence, 4119–4125. ↩

Daumé III, H. (2007). Frustratingly Easy Domain Adaptation. Association for Computational Linguistic (ACL), (June), 256–263. http://doi.org/10.1.1.110.2062 ↩

Sun, B., Feng, J., & Saenko, K. (2016). Return of Frustratingly Easy Domain Adaptation. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16). Retrieved from http://arxiv.org/abs/1511.05547 ↩

Bousmalis, K., Trigeorgis, G., Silberman, N., Krishnan, D., & Erhan, D. (2016). Domain Separation Networks. NIPS. ↩

Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., & Darrell, T. (2014). Deep Domain Confusion: Maximizing for Domain Invariance. CoRR. Retrieved from https://arxiv.org/pdf/1412.3474.pdf ↩

Ganin, Y., & Lempitsky, V. (2015). Unsupervised Domain Adaptation by Backpropagation. In Proceedings of the 32nd International Conference on Machine Learning. (Vol. 37). ↩

Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., … Lempitsky, V. (2016). Domain-Adversarial Training of Neural Networks. Journal of Machine Learning Research, 17, 1–35. http://www.jmlr.org/papers/volume17/15-239/source/15-239.pdf ↩

Zhu, X. (2005). Semi-Supervised Learning Literature Survey. ↩

Plank, B. (2016). What to do about non-standard (or non-canonical) language in NLP. KONVENS 2016. Retrieved from https://arxiv.org/pdf/1608.07836.pdf ↩

Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2017). Understanding deep learning requires rethinking generalization. ICLR 2017. ↩

Kurakin, A., Goodfellow, I., & Bengio, S. (2017). Adversarial examples in the physical world. In ICLR 2017. Retrieved from http://arxiv.org/abs/1607.02533 ↩

Huang, S., Papernot, N., Goodfellow, I., Duan, Y., & Abbeel, P. (2017). Adversarial Attacks on Neural Network Policies. In Workshop Track - ICLR 2017. ↩

Thrun, S., & Pratt, L. (1998). Learning to learn. Springer Science & Business Media. ↩

Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., … Hadsell, R. (2017). Overcoming catastrophic forgetting in neural networks. PNAS. ↩

Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., ... Deepmind, G. (2016). Progressive Neural Networks. arXiv preprint arXiv:1606.04671. ↩

Fernando, C., Banarse, D., Blundell, C., Zwols, Y., Ha, D., Rusu, A. A., ... Wierstra, D. (2017). PathNet: Evolution Channels Gradient Descent in Super Neural Networks. In arXiv preprint arXiv:1701.08734. ↩

Kaiser, Ł., Nachum, O., Roy, A., & Bengio, S. (2017). Learning to Remember Rare Events. In ICLR 2017. ↩

Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., & Wierstra, D. (2016). Matching Networks for One Shot Learning. NIPS 2016. Retrieved from http://arxiv.org/abs/1606.04080 ↩

Ravi, S., & Larochelle, H. (2017). Optimization as a Model for Few-Shot Learning. In ICLR 2017. ↩

Xian, Y., Schiele, B., Akata, Z., Campus, S. I., & Machine, A. (2017). Zero-Shot Learning - The Good, the Bad and the Ugly. In CVPR 2017. ↩

Tzeng, E., Hoffman, J., Saenko, K., & Darrell, T. (2017). Adversarial Discriminative Domain Adaptation. ↩

I attended NIPS 2016 in Barcelona from Monday, December 5 to Saturday, December 10. The full conference program is available here. In the following, I will share some of my highlights.

The Conference on Neural Information Processing Systems (NIPS) is (besides

]]>*This post originally appeared at the AYLIEN blog.*

I attended NIPS 2016 in Barcelona from Monday, December 5 to Saturday, December 10. The full conference program is available here. In the following, I will share some of my highlights.

The Conference on Neural Information Processing Systems (NIPS) is (besides ICML) one of the two top conferences in machine learning. It took place for the first time in 1987 and is held every December, historically in close proximity to a ski resort. This year, in slight juxtaposition, it took place in sunny Barcelona.

Machine Learning seems to become more pervasive every month. However, it is still sometimes hard to keep track of the actual extent of this development. One of the most accurate barometers for this evolution is the growth of NIPS itself. The number of attendees skyrocketed at this year’s conference growing by over 50% year-over-year.

Unsurprisingly, Deep Learning (DL) was by far the most popular research topic, with about every fourth of more than 2,500 submitted papers (and 568 accepted papers) dealing with deep neural networks.

On the other hand, the distribution of research paper topics has quite a long tail and reflects the diversity of topics at the conference that span everything from theory to applications, from robotics to neuroscience, and from healthcare to self-driving cars.

One of the hottest developments within Deep Learning was Generative Adversarial Networks (GANs). The minimax game playing networks have by now won the favor of many luminaries in the field. Yann LeCun hails them as the most exciting development in ML in recent years. The organizers and attendees of NIPS seem to side with him: NIPS featured a tutorial by Ian Goodfellow about his brainchild, which led to a packed main conference hall.

Though a fairly recent development, there are many cool extensions of GANs among the conference papers:

- Reed et al. propose a model that allows you to specify not only what you want to draw (e.g. a bird) but also where to put it in an image.
- Chen et al. disentangle factors of variation in GANs by representing them with latent codes. The resulting models allow you to adjust e.g. the type of a digit, its breadth and width, etc.

In spite of their popularity, we know alarmingly little about what makes GANs so capable of generating realistic-looking images. In addition, making them work in practice is an arduous endeavour and a lot of (undocumented) hacks are necessary to achieve the best performance. Soumith Chintala presents a collection of these hacks in his "How to train your GAN" talk at the Adversarial Training workshop.

Yann LeCun muses in his keynote that the development of GANs parallels the history of neural networks themselves: They were poorly understood and hard to get to work in the beginning and only took off once researchers figured out the right tricks and learned how to make them work. At this point, it seems unlikely that GANs will experience a winter anytime soon; the research community is still at the beginning in learning how to make the best use of them and it will be exciting to see what progress we can make in the coming years.

On the other hand, the success of GANs so far has been limited mostly to Computer Vision due to their difficulty in modelling discrete rather than continuous data. The Adversarial Training workshop showcased some promising work in this direction (see e.g. my colleague John Glover’s paper on modeling documents, this and this paper on generating text, and this paper on adversarial evaluation of dialogue models). We will see if 2017 will be the year in which GANs break through in NLP.

Andrew Ng gave one of the best tutorials of the conference with his take on building AI applications using Deep Learning. Drawing from his experience of managing the 1,300 people AI team at Baidu and hundreds of applied AI projects and equipped solely with two whiteboards, he shared many insights about how to build and deploy AI applications in production.

Besides better hardware, Ng attributes the success of Deep Learning to two factors: In contrast to traditional methods, deep NNs are able to learn more effectively from large amounts of data. Secondly, end-to-end (supervised) Deep Learning allows us to learn to map from inputs directly to outputs.

While this approach to training chatbots or self-driving cars is sufficient to write innovative research papers, Ng emphasized end-to-end DL is often not production-ready: A chatbot that maps from text directly to a response is not able to have a coherent conversation or fulfill a request, while mapping from an image directly to a steering command might have literally fatal side effects if the model has not encountered the corresponding part of the input space before. Rather, for a production model, we still want to have intermediate steps: For a chatbot, we prefer to have an inference engine that generates a response, while in a self-driving car, DL is used to identify obstacles, while the steering is performed by a traditional planning algorithm.

Ng also shared that the most common mistake he sees in project teams is that they track the wrong metrics: In an applied machine learning project, the only relevant metrics are the training error, the development error, and the test error. These metrics alone enable the project team to know what steps to take, as he demonstrated in the diagram below:

A key facilitator of the recent success of ML have been the advances in hardware that allowed faster computation and storage. Given that Moore's Law will reach its limits sooner or later, one might reason that also the rise of ML might plateau. Ng, however, argued that the commitment by leading hardware manufacturers such as NVIDIA and Intel and the ensuing performance improvements to ML hardware would fuel further growth.

Among ML research areas, supervised learning is the undisputed driver of the recent success of ML and will likely continue to drive it for the foreseeable future. In second place, Ng saw neither unsupervised learning nor reinforcement learning, but transfer learning. We at AYLIEN are bullish on transfer learning for NLP and think that it has massive potential.

The conference also featured a symposium dedicated to Recurrent Neural Networks (RNNs). The symposium coincided with the 20 year anniversary of LSTM...

...being rejected from NIPS 1996. The fact that papers that *do not* use LSTMs have been rare in the most recent NLP conferences (see my EMNLP 2016 blog post) is a testament to the perseverance of the authors of the original paper, Sepp Hochreiter and Jürgen Schmidhuber.

At NIPS, we had several papers that sought to improve RNNs in different ways:

- Ba et al. and Neil et al. enable RNNs to handle different time scales using slow weights and a phased variant of the LSTM respectively.
- Fraccaro et al. model uncertainty.

Other improvements apply to Deep Learning in general:

- Salimans and Kingma propose Weight Normalisation to accelerate training that can be applied in two lines of Python code.
- Li et al. propose a multinomial variant of dropout that sets neurons to zero depending on the data distribution.

The Neural Abstract Machines & Program Induction (NAMPI) workshop also featured several speakers talking about RNNs:

- Alex Graves focused on his recent work on Adaptive Computation Time (ACT) for RNNs that allows to decouple the processing time from the sequence length. He showed that a word-level language model with ACT could reach state-of-the-art with fewer computations.
- Edward Grefenstette outlined several limitations and potential future research directions in the context of RNNs in his talk.

While Deep Learning is a fairly recent development, the conference featured also several improvements to algorithms that have been around for decades:

- Ge et al. show in their best paper that the non-convex objective for matrix completion has no spurious local minima, i.e. every local minimum is a global minimum.
- Bachem et al. present a method that guarantees accurate and fast seedings for large-scale k-means++ clustering. The presentation was one of the most polished ones of the conference and the code is open-source and can be installed via pip.
- Ashtiani et al. show that we can make NP-hard k-means clustering problems solvable by allowing the model to pose queries for a few examples to a domain expert.

Reinforcement Learning (RL) was another much-discussed topic at NIPS with an excellent tutorial by Pieter Abbeel and John Schulman. John Schulman also gave some practical advice for getting started with RL.

One of the best papers of the conference introduces Value Iteration Networks, which learn to plan by providing a differentiable approximation to a classic planning algorithm via a CNN. This paper was another cool example of one of the major benefits of deep neural networks: They allow us to learn increasingly complex behaviour as long as we can represent it in a differentiable way. I hope to see more approaches in the future that integrate classic algorithms to enhance the capabilities of a neural network.

During the week of the conference, several research environments for RL were simultaneously released, among them OpenAI’s Universe, Deep Mind Lab, and FAIR’s Torchcraft. These will likely be a key driver in future RL research and should open up new research opportunities.

Another topic that came up in several discussions over the course of the conference was Learning-to-learn or Meta-learning:

- Andrychowicz et al. learn an optimizer in a paper with the ingenious title "Learning to learn by gradient descent by gradient descent".
- Vinyals et al. learn how to one shot-learn in a paper that frames one-shot learning in the sequence-to-sequence framework and has inspired new approaches for one-shot learning.

Most of the existing papers on meta-learning demonstrate that wherever you are doing something that gives you gradients, you can optimize them using another algorithm via gradient descent. Prepare for a surge of “Meta-learning for X” and “(Meta-)+learning” papers in 2017. It’s LSTMs all the way down!

Meta-learning was also one of the key talking points at the RNN symposium. Jürgen Schmidhuber argued that a true meta-learner would be able to learn in the space of all programs and would have the ability to modify itself and elaborated on these ideas at his talk at the NAMPI workshop. Ilya Sutskever remarked that we currently have no good meta-learning models. However, there is hope as the plethora of new research environments should also bring progress in this area.

Learning how to learn also plays a role in the pursuit of the elusive goal of attaining General Artificial Intelligence, which was a topic in several keynotes. Yann LeCun argued that in order to achieve General AI, machines need to learn common sense. While common sense is often vaguely mentioned in research papers, Yann LeCun gave a succinct explanation of what common sense is: "Predicting any part of the past, present or future percepts from whatever information is available." He called this predictive learning, but notes that this is really unsupervised learning.

His talk also marked the appearance of a controversial and often tongue-in-cheek copied image of a cake, which he used to demonstrate that unsupervised learning is the most challenging task where we should concentrate our efforts, while RL is only the cherry on the icing of the cake.

Drew Purves focused on the bilateral relationship between the environment and AI in what was probably the most aesthetically pleasing keynote of the conference (just look at those illustrations!).

He emphasized that while simulations of ecological tasks in naturalistic environments could be an important test bed for General AI, General AI is needed to maintain the biosphere in a state that will allow the continued existence of our civilization.

While it is frequently — and incorrectly — claimed that neural networks work so well because they emulate the brain’s behaviour, Saket Navlakha argued during his keynote that we can still learn a great deal from the engineering principles of the brain. For instance, rather than pre-allocating a large number of neurons, the brain generates 1000s of synapses per minutes until its second year. Afterwards, until adolescence, the number of synapses is pruned and decreases by ~50%.

It will be interesting to see how neuroscience can help us to advance our field further.

In the context of the Machine Intelligence workshop, another environment was introduced in the form of FAIR’s CommAI-env that allows to train agents through interaction with a teacher. During the panel discussion, the ability to learn hierarchical representations and to identify patterns was emphasized. However, although the field is making rapid progress on standard tasks such as object recognition, it is unclear if the focus on such specific tasks brings us indeed closer to General AI.

While NLP is more of a niche topic at NIPS, there were a few papers with improvements relevant to NLP:

- He et al. propose a dual learning framework for MT that has two agents translating in opposite directions teaching each other via reinforcement learning.
- Sokolov et al. explore how to use structured prediction under bandit feedback.
- Huang et al. extend Word Mover’s Distance, an unsupervised document similarity metric to the supervised setting.
- Lee et al. model the helpfulness of reviews by taking into account position and presentation biases.

Finally, a workshop on learning methods for dialogue explored how end-to-end DL, linguistics and ML methods can be used to create dialogue agents.

Jürgen Schmidhuber, the father of the LSTM was not only present on several panels, but did his best to remind everyone that whatever your idea, he had had a similar idea two decades ago and you should better cite him lest he interrupt your tutorial.

NIPS2016 Day 1: Poor @Goodfellow_Ian gets Schmidhuber'ed during educational GAN Tutorial session. pic.twitter.com/IeQzKcJYiv

— hardmaru (@hardmaru) 5. Dezember 2016

Boston Robotics’ Spot proved that — even though everyone is excited by learning and learning-to-learn — traditional planning algorithms are enough to win the admiration of a hall full of learning enthusiasts.

Apple, one of the most secretive companies in the world, has decided to be more open, to publish, and to engage with academia. This can only be good for the community. I'm particularly looking forward to more apple research papers.

Uber announced their acquisition of Cambridge-based AI startup Geometric Intelligence and threw one of the most popular parties of NIPS.

Talking about startups, the "launch" of Rocket AI and their patented Temporally Recurrent Optimal Learning had some people fooled (note the acronyms in the below tweets). Riva-Melissa Tez finally cleared up the confusion.

#rocketai just drove me home. the team is just mind-blowing. so excited about Temporally Recurrent Optimal Learning, the next GAN!

— Soumith Chintala (@soumithchintala) 11. Dezember 2016

#rocketai definitely has the most popular Jacobian-Optimized Kernel Expansion of NIPS 2016

— Ian Goodfellow (@goodfellow_ian) 10. Dezember 2016

These were my impressions from NIPS 2016. I had a blast and hope to be back in 2017!

]]>In past blog posts, we discussed different models, objective functions, and hyperparameter choices that allow us to learn accurate word embeddings. However, these models are generally restricted to capture representations

]]>Note: If you are looking for a survey paper, this blog post is also available as an article on arXiv.

In past blog posts, we discussed different models, objective functions, and hyperparameter choices that allow us to learn accurate word embeddings. However, these models are generally restricted to capture representations of words in the language they were trained on. The availability of resources, training data, and benchmarks in English leads to a disproportionate focus on the English language and a negligence of the plethora of other languages that are spoken around the world.

In our globalised society, where national borders increasingly blur, where the Internet gives everyone equal access to information, it is thus imperative that we do not only seek to eliminate bias pertaining to gender or race inherent in our representations, but also aim to address our bias towards *language*.

To remedy this and level the linguistic playing field, we would like to leverage our existing knowledge in English to equip our models with the capability to process other languages.

Perfect machine translation (MT) would allow this. However, we do not need to actually translate examples, as long as we are able to project examples into a common subspace such as the one in Figure 1.

Ultimately, our goal is to learn a shared embedding space between words in *all* languages. Equipped with such a vector space, we are able to train our models on data in any language. By projecting examples available in one language into this space, our model simultaneously obtains the capability to perform predictions in all other languages (we are glossing over some considerations here; for these, refer to this section). This is the promise of cross-lingual embeddings.

Over the course of this blog post, I will give an overview of models and algorithms that have been used to come closer to this elusive goal of capturing the relations between words in multiple languages in a common embedding space.

Note: While neural MT approaches *implicitly* learn a shared cross-lingual embedding space by optimizing for the MT objective, we will focus on models that *explicitly* learn cross-lingual word representations throughout this blog post. These methods generally do so at a much lower cost than MT and can be considered to be to MT what word embedding models (word2vec, GloVe, etc.) are to language modelling.

In recent years, various models for learning cross-lingual representations have been proposed. In the following, we will order them by the type of approach that they employ.

Note that while the nature of the parallel data used is equally discriminatory and has been shown to account for inter-model performance differences [^{1}], we consider the type of approach more conducive to understanding the assumptions a model makes and -- consequently -- its advantages and deficiencies.

Cross-lingual embedding models generally use four different approaches:

**Monolingual mapping**: These models initially train monolingual word embeddings on large monolingual corpora. They then learn a linear mapping between monolingual representations in different languages to enable them to map unknown words from the source language to the target language.**Pseudo-cross-lingual**: These approaches create a pseudo-cross-lingual corpus by mixing contexts of different languages. They then train an off-the-shelf word embedding model on the created corpus. The intuition is that the cross-lingual contexts allow the learned representations to capture cross-lingual relations.**Cross-lingual training**: These models train their embeddings on a parallel corpus and optimize a cross-lingual constraint between embeddings of different languages that encourages embeddings of similar words to be close to each other in a shared vector space.**Joint optimization**: These approaches train their models on parallel (and optionally monolingual data). They jointly optimise a combination of monolingual and cross-lingual losses.

In terms of parallel data, methods may use different supervision signals that depend on the type of data used. These are, from most to least expensive:

**Word-aligned data**: A parallel corpus with word alignments that is commonly used for machine translation; this is the most expensive type of parallel data to use.**Sentence-aligned data**: A parallel corpus without word alignments. If not otherwise specified, the model uses the Europarl corpus consisting of sentence-aligned text from the proceedings of the European parliament that is generally used for training Statistical Machine Translation models.**Document-aligned data**: A corpus containing documents in different languages. The documents can be topic-aligned (e.g. Wikipedia) or label/class-aligned (e.g. sentiment analysis and multi-class classification datasets).**Lexicon**: A bilingual or cross-lingual dictionary with pairs of translations between words in different languages.**No parallel data**: No parallel data whatsoever. Learning cross-lingual representations from only monolingual resources would enable zero-shot learning across languages.

To make the distinctions clearer, we provide the following table, which serves equally as the table of contents and a springboard to delve deeper into the different cross-lingual models:

After the discussion of cross-lingual embedding models, we will additionally look into how to incorporate visual information into word representations, discuss the challenges that still remain in learning cross-lingual representations, and finally summarize which models perform best and how to evaluate them.

Methods that employ monolingual mapping train monolingual word representations independently on large monolingual corpora. They then seek to learn a transformation matrix that maps representations in one language to the representations of the other language. They usually employ a set of source word-target word pairs that are translations of each other, which are used as anchor words for learning the mapping.

Note that all of the following methods presuppose that monolingual embedding spaces have already been trained. If not stated otherwise, these embedding spaces have been learned using the word2vec variants, skip-gram with negative sampling (SGNS) or continuous bag-of-words (CBOW) on large monolingual corpora.

Mikolov et al. have popularised the notion that vector spaces can encode meaningful relations between words. In addition, they notice that the geometric relations that hold between words are similar across languages [^{2}], e.g. numbers and animals in English show a similar geometric constellation as their Spanish counterparts in Figure 2.

This suggests that it might be possible to transform one language's vector space into the space of another simply by utilising a linear projection with a transformation matrix \(W\).

In order to achieve this, they translate the 5,000 most frequent words from the source language and use these 5,000 translations pairs as bilingual dictionary. They then learn \(W\) using stochastic gradient descent by minimising the distance between the previously learned monolingual representations of the source word \(w_i\) that is transformed using \(W\) and its translation \(z_i\) in the bilingual dictionary:

\(\min\limits_W \sum\limits^n_{i=1} \|Wx_i - z_i\|^2 \).

Faruqui and Dyer [^{3}] propose to use another technique to learn the linear mapping. They use canonical correlation analysis (CCA) to project words from two languages into a shared embedding space. Different to linear projection, CCA learns a transformation matrix for every language, as can be seen in Figure 3, where the transformation matrix \(V\) is used to project word representations from the embedding space \(\Sigma\) to a new space \(\Sigma^\ast\), while \(W\) transforms words from \(\Omega\) to \(\Omega^\ast\). Note that \(\Sigma^\ast\) and \(\Omega^\ast\) can be seen as the same shared embedding space.

Similar to linear projection, CCA also requires a number of translation pairs in \(\Sigma'\) and \(\Omega'\) whose correlation can be maximised. Faruqui and Dyer obtain these pairs by selecting for each source word the target word to which it has been aligned most often in a parallel corpus. Alternatively, they could have also used a bilingual dictionary.

As CCA sorts the correlation vectors in \(V\) and \(W\) in descending order, Faruqui and Dyer perform experiments using only the top \(k\) correlated projection vectors and find that using the \(80\) % projection vectors with the highest correlation generally yields the highest performance.

Interestingly, they find that using multilingual projection helps to separate synonyms and antonyms in the source language, as can be seen in Figure 4, where the unprotected antonyms of "beautiful" are in two clusters in the top, whereas the CCA-projected vectors of the synonyms and antonyms form two distinct clusters in the bottom.

Xing et al. [^{33}] notice inconsistencies in the linear projection method by Mikolov et al. (2013), which they set out to resolve. Recall that Mikolov et al. initially learn monolingual word embeddings. For this, they use the skip-gram objective, which is the following:

\(\dfrac{1}{N} \sum\limits_{i=1}^N \sum\limits_{-C \leq j \leq C, j \neq 0} \text{log} P(w_{i+j} \:|\: w_i) \)

where \(C\) is the context length and \(P(w_{i+j} \:|\: w_i)\) is computed using the softmax:

\(P(w_{i+j} \:|\: w_i) = \dfrac{\text{exp}(c_{w_{i+j}}^T c_{w_i})}{\sum_w \text{exp}(c_w^T c_{w_i})}\).

They then learn a linear transformation between the two monolingual vector spaces with:

\(\text{min} \sum\limits_i \|Wx_i - z_i\|^2 \)

where \(W\) is the projection matrix that should be learned and \(x_i\) and \(z_i\) are word vectors in the source and target language respectively that are similar in meaning.

Xing et al. argue that there is a mismatch between the objective function used to learn word representations (maximum likelihood based on inner product), the distance measure for word vectors (cosine similarity), and the objective function used to learn the linear transformation (mean squared error), which may lead to degradation in performance.

They subsequently propose a method to resolve each of these inconsistencies: In order to fix the mismatch between the inner product similarity measure \(c_w^T c_{w'}\) during training and the cosine similarity measure \(\dfrac{c_w^T c_w'}{\|c_w\| \|c_{w'}\|}\) for testing, the inner product could also be used for testing. Cosine similarity, however, is used conventionally as an evaluation measure in NLP and generally performs better than the inner product. For this reason, they propose to normalise the word vectors to be unit length during training, which makes the inner product the same as cosine similarity and places all word vectors on a hypersphere as a side-effect, as can be seen in Figure 5.

They resolve the inconsistency between the cosine similarity measure now used in training and the mean squared error employed for learning the transformation by replacing the mean squared error with cosine similarity for learning the mapping, which yields:

\(\max\limits_W \sum\limits_i (Wx_i)^T z_i \).

Finally, in order to also normalise the projected vector \(Wx_i\) to be unit length, they constrain \(W\) to be an orthogonal matrix by solving a separate optimisation problem.

Lazaridou et al. [^{28}] identify another issue with the linear transformation objective of Mikolov et al. (2013): They discover that using least-squares as objective for learning a projection matrix leads to *hubness*, i.e. some words tend to appear as nearest neighbours of many other words. To resolve this, they use a margin-based (max-margin) ranking loss (Collobert et al. [^{34}]) to train the model to rank the correct translation vector \(y_i\) of a source word \(x_i\) that is projected to \(\hat{y_i}\) higher than any other target words \(y_j\):

\(\sum\limits^k_{j\neq i} \max \{ 0, \gamma + cos(\hat{y_i}, y_i) - cos(\hat{y_i}, y_j) \} \)

where \(k\) is the number of negative examples and \(\gamma\) is the margin.

They show that selecting max-margin over the least-squares loss consistently improves performance and reduces hubness. In addition, the choice of the negative examples, i.e. the target words compared to which the model should rank the correct translation higher, is important. They hypothesise that an informative negative example is an *intruder* ("truck" in the example), i.e. it is near the current projected vector \(\hat{y_i}\) but far from the actual translation vector \(y_i\) ("cat") as depicted in Figure 6.

These intruders should help the model identify cases where it is failing considerably to approximate the target function and should thus allow it to correct its behaviour. At every step of gradient descent, they compute \(s_j = cos(\hat{y_i}, y_j) - cos(y_i, y_j) \) for all vectors \(y_t\) in the target embedding space with \(j \neq i\) and choose the vector with the largest \(s_j\) as negative example for \(x_i\). Using intruders instead of random negative examples yields a small improvement of 2 percentage points on their comparison task.

Guo et al. [^{4}] propose another projection method that solely relies on word alignments. They count the number of times each word in the source language is aligned with each word in the target language in a parallel corpus and store these counts in an alignment matrix \(\mathcal{A}\).

In order to project a word \(w_i\) from its source representation \(v(w_i^S)\) to its representation in the target embedding space \(v(w_i)^T\) in the target embedding space, they simply take the average of the embeddings of its translations \(v(w_j)^T\) weighted by their alignment probability with the source word:

\(v(w_i)^T = \sum\limits_{i, j \in \mathcal{A}} \dfrac{c_{i, j}}{\sum_j c_{i,j}} \cdot v(w_j)^T\)

where \(c_{i,j}\) is the number of times the \(i^{th}\) source word has been aligned to the \(j^{th}\) target word.

The problem with this method is that it only assigns embeddings for words that are aligned in the reference parallel corpus. Gou et al. thus propagate alignments from in-vocabulary to OOV words by using edit distance as a metric for morphological similarity. They set the projected vector of an OOV source word \(v(w_{OOV}^T)\) as the average of the projected vectors of source words that are similar to it in edit distance:

\(v(w_{OOV}^T) = Avg(v(w_T))\)

where \(C = \{ w \:|\: EditDist(w_{OOV}^T, w) \leq \tau \} \). They set the threshold \(\tau\) empirically to \(1\).

Even though this approach seems simplistic, they actually observe significant improvements over projection via CCA in their experiments.

Ammar et al. [^{5}] extend the bilingual CCA projection method of Faruqui and Dyer (2014) to the multi-lingual setting using the English embedding space as the foundation for their multilingual embedding space.

They learn the two projection matrices for every other language with English. The transformation from each target language space \(\Omega\) to the English embedding space \(\Sigma\) can then be obtained by projecting the vectors in \(\Omega\) into the CCA space \(\Omega^\ast\) using the transformation matrix \(W\) as in Figure 3. As \(\Omega^\ast\) and \(\Sigma^\ast\) lie in the same space, vectors in \(\Sigma^\ast\) can be projected into the English embedding space \(\Sigma\) using the inverse of \(V\).

The previous mapping approaches used a bilingual dictionary as inherent component of their model, but did not pay much attention to the quality of the dictionary entries, using either automatic translations of frequent words or word alignments of all words.

Vulić and Korhonen [^{6}] in turn emphasise the role of the seed lexicon that is used for learning the projection matrix. They propose a hybrid model that initially learns a first shared bilingual embedding space based on an existing cross-lingual embedding model. They then use this initial vector space to obtain translations for a list of frequent source words by projecting them into the space and using the nearest neighbour in the target language as translation. With these translation pairs as seed words, they learn a projection matrix analogously to Mikolov et al. (2013).

In addition, they propose a symmetry constraint, which enforces that words are only included if their projections are neighbours of each other in the first embedding space. Additionally, one can retain pairs whose second nearest neighbours are less similar than the first nearest neighbours up to some threshold.

They run experiments showing that their model with the symmetry constraint outperforms comparison models and that a small threshold of \(0.01\) or \(0.025\) leads to slightly improved performance.

The previous approaches have introduced models that imposed different constraints for mapping monolingual representations of different languages to each other. The relation between these methods and constraints, however, is not clear.

Artetxe et al. [^{32}] thus propose to generalise previous work on learning a linear transformation between monolingual vector spaces: Starting with the basic optimisation objective, they propose several constraints that should intuitively help to improve the quality of the learned cross-lingual representations. Recall that the linear transformation learned by Mikolov et al. (2013) aims to find a parameter matrix \(W\) that satisfies:

\(\DeclareMathOperator*{\argmin}{argmin} \argmin\limits_W \sum\limits_i \|Wx_i - z_i\|^2 \)

where \(x_i\) and \(z_i\) are similar words in the source and target language respectively.

If the performance of the embeddings on a monolingual evaluation task should not be degraded, the dot products need to be preserved after the mapping. This can be guaranteed by requiring \(W\) to be an orthogonal matrix.

Secondly, in order to ensure that all embeddings contribute equally to the objective, embeddings in both languages can be normalised to be unit vectors:

\(\argmin\limits_W \sum\limits_i \| W \dfrac{x_i}{\|x_i\|} - \dfrac{z_i}{\|z_i\|}\|^2 \).

As the norm of an orthogonal matrix is \(1\), if \(W\) is orthogonal, we can add it to the denominator and move \(W\) to the numerator:

\(\argmin\limits_W \sum\limits_i \| \dfrac{Wx_i}{\|Wx_i\|} - \dfrac{z_i}{\|z_i\|}\|^2 \).

Through expansion of the above binomial, we obtain:

\(\argmin\limits_W \sum\limits_i \|\dfrac{Wx_i}{\|Wx_i\|}\|^2 + \|\dfrac{z_i}{\|z_i||}\|^2 - 2 \dfrac{Wx_i}{\|Wx_i\|}^T \dfrac{z_i}{\|z_i\|} \).

As the norm of a unit vector is \(1\) the first two terms reduce to \(1\), which leaves us with the following:

\(\argmin\limits_W \sum\limits_i 2 - 2 \dfrac{Wx_i}{\|Wx_i\|}^T \dfrac{z_i}{\|z_i\|} ) \).

The latter term now is just the cosine similarity of \(Wx_i\) and \(z_i\):

\(\argmin\limits_W \sum\limits_i 2 - 2 \: \text{cos}(Wx_i, z_i) \).

As we are interested in finding parameters \(W\) that minimise our objective, we can remove the constants above:

\(\argmin\limits_W \sum\limits_i - \: \text{cos}(Wx_i, z_i) \).

Minimising the sum of negative cosine similarities is then equal to maximising the sum of cosine similarities, which gives us the following:

\(\DeclareMathOperator*{\argmax}{argmax} \argmax\limits_W \sum\limits_i \text{cos}(Wx_i, z_i) \).

This is equal to the objective by Xing et al. (2015), although they motivated it via an inconsistency of the objectives.

Finally, Artetxe et al. argue that two randomly selected words are generally expected not to be similar. For this reason, the cosine of their embeddings in any dimension -- as well as their cosine similarity -- should be zero. They capture this intuition by performing dimension-wise mean centering with a centering matrix \(C_m\):

\(\argmin\limits_W \sum\limits_i ||C_mWx_i - C_mz_i||^2 \).

This reduces to maximizing the sum of dimension-wise covariance as long as \(W\) is orthogonal similar as above:

\(\argmax\limits_W \sum\limits_i \text{cov}(Wx_i, z_i) \).

Interestingly, the method by Faruqui and Dyer (2014) is similar to this objective, as CCA maximizes the dimension-wise covariance of both projections. This is equivalent to the single projection here, as it is constrained to be orthogonal. The only difference is that, while CCA changes the monolingual embeddings so that different dimensions have the same variance and are uncorrelated -- which might degrade performance -- Artetxe et al. enforce monolingual invariance.

All previous approaches to learning a transformation matrix between monolingual representations in different languages require either a dictionary or word alignments as a source of parallel data.

Barone [^{7}], in contrast, seeks to get closer to the elusive goal of creating cross-lingual representations without parallel data. He proposes to use an adversarial auto-encoder to transform source embeddings into the target embedding space. The auto-encoder is then trained to reconstruct the source embeddings, while the discriminator is trained to differentiate the projected source embeddings from the actual target embeddings as in Figure 7.

While intriguing, learning a transformation between languages without any parallel data at all seems unfeasible at this point. However, future approaches that aim to learn a mapping with fewer and fewer parallel data may bring us closer to this goal.

More generally, however, it remains unclear if a projection can reliably transform the embedding space of one language into the embedding space of another language. Additionally, the reliance on lexicon data or word alignment information is expensive.

The second type of cross-lingual models seeks to construct a pseudo-cross-lingual corpus that captures interactions between the words in different languages. Most approaches aim to identify words that can be translated to each other in monolingual corpora of different languages and replace these with placeholders to ensure that translations of the same word have the same vector representation.

Xiao and Guo [^{8}] propose the first pseudo-cross-lingual method that leverages translation pairs: They first translate all words that appear in the source language corpus into the target language using Wiktionary. As these translation pairs are still very noisy, they filter them by removing polysemous words in the source and target language and translations that do not appear in the target language corpus. From this bilingual dictionary, they now create a joint vocabulary, in which each translation pair has the same vector representation.

For training, they use the margin-based ranking loss of Collobert et al. (2008) to rank correct word windows higher than corrupted ones, where the middle word is replaced by an arbitrary word.

In contrast to the subsequent methods, they do not construct a pseudo-cross-lingual corpus explicitly. Instead, they feed windows of both the source and target corpus into the model during training, thereby essentially interpolating source and target language.

It is thus most likely that, for ease of training, the authors replace translation pairs in source and target corpus with a placeholder to ensure a common vector representation, similar to the procedure of subsequent models.

Gouws and Søgaard [^{9}] in turn explicitly create a pseudo-cross-lingual corpus: They leverage translation pairs of words in the source and in the target language obtained via Google Translate. They concatenate the source and target corpus and replace each word that is part of a translation pair with its translation equivalent with a probability of 50%. They then train CBOW on this corpus.

It is interesting to note that they also experiment with replacing words not based on translation but part-of-speech equivalence, i.e. words with the same part-of-speech in different languages will be replaced with one another. While replacement based on part-of-speech leads to small improvements for cross-lingual part-of-speech tagging, replacement based on translation equivalences yields even better performance for the task.

Duong et al. [^{10}] propose a similar approach to Gouws and Søgaard (2015). They also use CBOW, which predicts the centre word in a window given the surrounding words. Instead of randomly replacing every word in the corpus with its translation during pre-processing, they replace each centre word with a translation on-the-fly during training.

In addition to past approaches, they also seek to handle polysemy explicitly by proposing an EM-inspired method that chooses as replacement the translation \(\bar{w_i}\) whose representation is most similar to the combination of the representations of the source word \(v_{w_i}\) and the context vector \(h_i\):

\(\bar{w_i} = \text{argmax}_{w \: \in \: \text{dict}(w_i)} \: \text{cos}(v_{w_i} + h_i, v_w) \)

where \(\text{dict}(w_i)\) contains the translations of \(w_i\).

They then jointly learn to predict both the words and their appropriate translations. They use PanLex as bilingual dictionary, which covers around 1,300 language with about 12 million expressions. Consequently, translations are high coverage but often noisy.

Ammar et al. (2016) propose another approach that is similar to the previous method by Gouws and Søgaard (2015): They use bilingual dictionaries to find clusters of synonymous words in different languages. They then concatenate the monolingual corpora of different languages and replace tokens in the same cluster with the cluster ID. They then train SGNS on the concatenated corpus.

The previous methods all use a bilingual dictionary or a translation tool as a source of translation pairs that can be used for replacement.

Vulić and Moens [^{11}] present a model that does without translation pairs and learns cross-lingual embeddings only from document-aligned data. In contrast to the previous methods, the authors propose not to merge two monolingual corpora but two aligned documents of different languages into a pseudo-bilingual document.

They concatenate the documents and then shuffle them by randomly permutating the words. The intuition is that as most methods rely on learning word embeddings based on their context, shuffling the documents would lead to bilingual contexts for each word that will enable the creation of a robust embedding space. As shuffling is necessarily random, however, it might lead to sub-optimal configurations.

For this reason, they propose another merging strategy that assumes that the structures of the document are similar: They then alternatingly insert words from each language into the pseudo-bilingual document in the order in which they appear in their monolingual document and based on the mono-lingual documents' length ratio.

While pseudo-cross-lingual approaches are attractive due to their simplicity and ease of implementation, relying on naive replacement and permutation does not allow them to capture more sophisticated facets of cross-lingual relations.

Cross-lingual training approaches focus exclusively on optimising the cross-lingual objective. These approaches typically rely on sentence alignments rather than a bilingual lexicon and require a parallel corpus for training.

The first approach that optimizes only a cross-lingual objective is the bilingual compositional sentence model by Hermann and Blunsom [^{12}]. They train two models to produce sentence representations of aligned sentences in two languages and use the distance between the two sentence representations as objective. They minimise the following loss:

\(E_{dist}(a,b) = \|a_{\text{root}} - b_{\text{root}} \|^2 \)

where \(a_{\text{root}}\) and \(b_{\text{root}}\) are the representations of two aligned sentences from different languages. They compose \(a_{\text{root}}\) and \(b_{\text{root}}\) simply as the sum of the embeddings of the words in the corresponding sentence. The full model is depicted in Figure 8.

They train the model then to output a higher score for correct translations than for randomly sampled incorrect translations using the max-margin hinge loss of Collobert et al. (2008).

Instead of minimising the distance between two sentence representations in different languages, Lauly et al. [^{13}] aim to reconstruct the target sentence from the original source sentence. They start with a monolingual autoencoder that encodes an input sentence as a sum of its word embeddings and tries to reconstruct the original source sentence. For efficient reconstruction, they opt for a tree-based decoder that is similar to a hierarchical softmax. They then augment this autoencoder with a second decoder that reconstructs the aligned target sentence from the representation of the source sentence as in Figure 9.

Encoders and decoders have language-specific parameters. For an aligned sentence pair, they then train the model with four reconstruction losses: for each of the two sentences, they reconstruct from the sentence to itself and to its equivalent in the other language.

While the previous approaches required word alignments as a prerequisite for learning cross-lingual embeddings, Kočiský et al. [^{15}] simultaneously learn word embeddings and alignments. Their model, Distributed Word Alignment, combines a distributed version of FastAlign (Dyer et al. [^{35}]) with a language model. Similar to other bilingual approaches, they use the word in the source language sentence of an aligned sentence pair to predict the word in the target language sentence.

They replace the standard multinomial translation probability of FastAlign with an energy function that tries to bring the representation of a target word \(f\) close to the sum of the context words around the word \(e_i\) in the source sentence:

\(E(f, e_i) = - ( \sum\limits_{s=-k}^k r^T_{e_{i+s}} T_s) r_f - b_r^T r_f - b_f \)

where \(r_{e_{i+s}}\) and \(r_f\) are vector representations for source and target words, \(T_s\) is a projection matrix, and \(b_r\) and \(b_f\) are representation and target biases respectively. For calculating the translation probability \(p(f|e_i)\), we then simply need to apply the softmax to the translation probabilities between the source word and all words in the target language.

In addition, the authors speed up training by using a class factorisation strategy similar to the hierarchical softmax and predict frequency-based class representations instead of word representations. For training, they also use EM but fix the alignment counts learned by FastAlign that was initially trained for 5 epochs during the E-step and optimise the translation probabilities in the M-step only.

Hermann and Blunsom [^{16}] extend their approach (Hermann and Blunsom, 2013) to documents, by applying their composition and objective function recursively to compose sentences into documents. First, sentence representations are computed as before. These sentence representations are then fed into a document-level compositional vector model, which integrates the sentence representations in the same way as can be seen in Figure 10.

The advantage of this method is that weaker supervision in the form of document-level alignment can be used instead of or in conjunction with sentence-level alignment. The authors run experiments both on Europarl as well as on a newly created corpus of multilingual aligned TED talk transcriptions and find that the document signal helps considerably.

In addition, they propose another composition function that -- instead of summing the representations -- applies a non-linearity to bigram pairs:

\(f(x) = \sum\limits_{i=1}^n \text{tanh}(x_{i-1} + x_i)\)

They find that this composition slightly outperforms addition, but underperforms it on smaller training datasets.

Chandar et al. [^{17}] extend the approach by Lauly et al. (2013) in two ways: Instead of using a tree-based decoder for calculating the reconstruction loss, they reconstruct a sparse binary vector of word occurrences as in Figure 11. Due to the high-dimensionality of the binary bag-of-words vector, reconstruction is slower. As they perform training using mini-batch gradient descent, where each mini-batch consists of adjacent sentences, they propose to merge the bags-of-words of the mini-batch into a single bag-of-words and to perform updates based on the merged bag-of-words. They find that this yields good performance and even outperforms the tree-based decoder.

Secondly, they propose to add a term \(cor(a(x), a(y))\) to the objective function that encourages correlation between the representations \(a(x)\) , \(a(y)\) of the source and target language respectively by summing the scalar correlations between all dimensions of the two vectors.

Similar to the previous methods, Pham et al. [^{18}] learn sentence representations as a means for learning cross-lingual word embeddings. They extend paragraph vectors (Mikolov et al. [^{36}]) to the multilingual setting by forcing aligned sentences of different languages to share the same vector representation as in Figure 12 where \(sent\) is the shared sentence representation. The shared sentence representation is concatenated with the sum of the previous \(N\) words in the sentence and the model is trained to predict the next word in the sentence.

The authors use a hierarchical softmax to speed-up training. As the model only learns representations for the sentences it has seen during training, at test time for an unknown sentence, the sentence representation is randomly initialised and the model is trained to predict only the words in the sentence. Only the sentence vector is updated, while the other model parameters are frozen.

Besides word embedding models such as skip-gram, matrix factorisation approaches have historically been used successfully to learn representations of words. One of the most popular methods is LSA, which Gardner et al. [^{19}] extend as translation-invariant LSA to to learn cross-lingual word embeddings. They factorise a multilingual co-occurrence matrix with the restriction that it should be invariant to translation, i.e. it should stay the same if multiplied with the respective word or context dictionary.

All previous approaches to learn cross-lingual representations have been based on some form of language model or matrix factorisation. In contrast, Søgaard et al. [^{20}] propose an approach that does without any of these methods, but instead relies on the structure of the multilingual knowledge base Wikipedia, which they exploit by inverted indexing. Their method is based on the intuition that similar words will be used to describe the same concepts across different languages.

In Wikipedia, articles in multiple languages deal with the same concept. We would typically represent every concept with the terms that are used to describe it across different languages. To learn cross-lingual word representations, we can now simply invert the index and instead represent a word by the Wikipedia concepts it is used to describe. This way, we are directly provided with cross-lingual representations of words without performing any optimisation whatsoever. As a post-processing step, we can perform dimensionality reduction on the produced word representations.

While the previous methods are able to make effective use of parallel sentence and documents to learn cross-lingual word representations, they neglect the monolingual quality of the learned representations. Ultimately, we do not only want to embed languages into a shared embedding space, but also want the monolingual representations do well on the task at hand.

Models that use joint optimisation aim to do exactly this: They not only consider a cross-lingual constraint, but jointly optimize mono-lingual and cross-lingual objectives.

In practice, for two languages \(l_1\) and \(l_2\), these models optimize a monolingual loss \(\mathcal{M}\) for each language and one or multiple terms \(\Omega\) that regularize the transfer from language \(l_1\) to \(l_2\) (and vice versa):

\(\mathcal{M}_{l_1} + \mathcal{M}_{l_2} + \lambda (\Omega_{l_1 \rightarrow l_2} + \Omega_{l_2 \rightarrow l_1}) \)

where \(\lambda\) is an interpolation parameter that adjusts the impact of the cross-lingual regularization.

The first jointly optimised model for learning cross-lingual representations was created by Klementiev et al. [^{23}]. They train a neural language model for each language and jointly optimise the monolingual maximum likelihood objective of each language model with a word-alignment based MT regularization term as the cross-lingual objective. The monolingual objective is thus to maximise the probability of the current word \(w_t\) given its \(n\) surrounding words:

\(\mathcal{M} = \text{log} \: P(w_t^ \: | \: w_{t-n+1:t-1}) \).

This is optimised using the classic language model of Bengio et al. [^{37}]. The cross-lingual regularisation term in turn encourages the representations of words that are often aligned to each other to be similar:

\(\Omega = \dfrac{1}{2} c^T (A \otimes I) c\)

where \(A\) is the matrix capturing alignment scores, \(I\) is the identity matrix, \(\otimes\) is the Kronecker product, and \(c\) is the representation of word \(w_t\).

Zou et al. [^{14}] use a matrix factorisation approach in the spirit of GloVe (Pennington et al. [^{38}]) to learn cross-lingual word representations for English and Chinese. They create two alignment matrices \(A_{en \rightarrow zh}\) and \(A_{zh \rightarrow en}\) using alignment counts automatically learned from the Chinese Gigaword corpus. In \(A_{en \rightarrow zh}\), each element \(a_{ij}\) contains the number of times the \(i\)-th Chinese word was aligned with the \(j\)-th English word, with each row normalised to sum to \(1\).

Intuitively, if a word in the source language is only aligned with one word in the target language, then those words should have the same representation. If the target word is aligned with more than one source word, then its representation should be a combination of the representations of its aligned words. Consequently, the authors represent the embeddings in the target language as the product of the source embeddings \(V_{en}\) and their corresponding alignment counts \(A_{en \rightarrow zh}\). They then minimise the squared difference between these two terms:

\(\Omega_{en\rightarrow zh} = || V_{zh} - A_{en \rightarrow zh} V_{en}||^2 \)

\(\Omega_{zh\rightarrow en} = || V_{en} - A_{zh \rightarrow en} V_{zh}||^2 \)

where \(V_{en}\) and \(V_{zh}\) are the embedding matrices of the English and Chinese word embeddings respectively.

They employ the max-margin hinge loss objective by Collobert et al. (2008) as monolingual objective \(\mathcal{M}\) and train the English and Chinese word embeddings to minimise the corresponding objective above together with a monolingual objective. For instance, for English, the training objective is:

\(\mathcal{M}_{en} + \lambda \Omega_{zh\rightarrow en} \).

It is interesting to observe that the authors learn embeddings using a curriculum, training different frequency bands of the vocabulary at a time. The entire training process takes 19 days.

Luong et al. [^{24}] in turn extend skip-gram to the cross-lingual setting and use the skip-gram objectives as monolingual and cross-lingual objectives. Rather than just predicting the surrounding words in the source language, they use the words in the source language to additionally predict their aligned words in the target language as in Figure 13.

For this, they require word alignment information. They propose two ways to predict aligned words: For their first method, they automatically learn alignment information; if a word is unaligned, the alignments of its neighbours are used for prediction. In their second method, they assume that words in the source and target sentence are monotonically aligned, with each source word at position \(i\) being aligned to the target word at position \(i \cdot T/S\) where \(S\) and \(T\) are the source and target sentence lengths. They find that a simple monotonic alignment is comparable to the unsupervisedly learned alignment in performance.

Gouws et al. [^{25}] propose a Bilingual Bag-of-Words without Word Alignments (BilBOWA) that leverages additional monolingual data. They use the skip-gram objective as a monolingual objective and a novel sampled \(l_2\) loss as cross-lingual regularizer as in Figure 14.

More precisely, instead of relying on expensive word alignments, they simply assume that each word in a source sentence is aligned with *every* word in the target sentence under a uniform alignment model. Thus, instead of minimising the distance between words that were aligned to each other, they minimise the distance between the means of the word representations in the aligned sentences, which is shown in Figure 15, where \(s^e\) and \(s^f\) are the sentences in source and target language respectively.

The cross-lingual objective in the BilBOWA model is thus:

\(\Omega = \|\dfrac{1}{m} \sum\limits_{w_i \in s^{l_1}}^m r_i^{l_1} - \dfrac{1}{n} \sum\limits_{w_j \in s^{l_2}}^n r_j^{l_2}

\|^2 \)

where \(r_i\) and \(r_j\) are the word embeddings of word \(w_i\) and \(w_j\) in each sentence \(s^{l_1}\) and \(s^{l_2}\) of length \(m\) and \(n\) in languages \(l_1\) and \(l_2\) respectively.

Another extension of skip-gram to learning cross-lingual representations is proposed by Coulmance et al. [^{26}]. They also use the regular skip-gram objective as monolingual objective. For the cross-lingual objective, they make a similar assumption as Gouws et al. (2015) by supposing that every word in the source sentence is uniformly aligned to every word in the target sentence.

Under the skip-gram formulation, they treat every word in the target sentence as context of every word in the source sentence and thus train their model to predict all words in the target sentence with the following skip-gram objective:

\(\Omega_{e,f} = \sum\limits_{(s_{l_1}, s_{l_2}) \in C_{l_1, l_2}} \sum\limits_{w_{l_1} \in s_{l_1}} \sum\limits_{c_{l_2} \in s_{l_2}} - \text{log} \: \sigma(w_{l_1}, c_{l_2}) \)

where \(s\) is the sentence in the respective language, \(C\) is the sentence-aligned corpus, \(w\) are word and \(c\) are context representations respectively, and \( - \text{log} \: \sigma(\centerdot)\) is the standard skip-gram loss function.

As the cross-lingual objective is asymmetric, they use one cross-lingual objective for the source-to-target and another one for the target-to-source direction. The complete Trans-gram objective including two monolingual and two cross-lingual skip-gram objectives is displayed in Figure 16.

Shi et al. [^{27}] use a joint matrix factorisation model to learn cross-lingual representations. In contrast to Zou et al. (2013), they also take into account additional monolingual data. Similar to the former, they also use the GloVe objective (Pennington et al., 2014) as monolingual objective:

\(\mathcal{M}_{l_i} = \sum\limits_{j,k} f(X_{jk}^{l_i})(w_j^{l_i} \cdot c_k^{l_i} + b_{w_j}^{l_i} + b_{c_k}^{l_i} + b^{l_i} - M_{jk}^{l_{i}}) \)

where \(w_j^{l_i}\) and \(c_k^{l_i}\) are the embeddings and \(M_{jk}^{l_{i}}\) the PMI value of a word-context pair \((j,k) \) in language \(l_{i}\), while \( b_{w_j}^{l_i}\) and \(b_{c_k}^{l_i}\) and \(b^{l_i}\) are the word-specific and language-specific bias terms respectively.

They then place cross-lingual constraints on the monolingual representations as can be seen in Figure 17. The authors propose two cross-lingual regularisation objectives: The first one is based on calculating cross-lingual co-occurrence counts. These co-occurrences can be calculated without alignment information using a uniform alignment model as in Gouws et al. (2015). Alternatively, co-occurrence counts can also be calculated by leveraging automatically learned word alignments. The co-occurrence counts are then stored in a matrix \(X^{\text{bi}}\) where every entry \(X_{jk}^{\text{bi}}\) contains the number of times the source word \(j\) occurred with the target word \(k\) in an aligned sentence pair in the parallel corpus.

For optimisation, a PMI matrix \(M^{\text{bi}}_{jk}\) can be calculated based on the co-occurrence counts in \(X^{\text{bi}}\). This matrix can again be factorised as in the GloVe objective, where now the context word representation \(c_k^{l_i}\) is replaced with the representation of the word in the target language \(w_k^{l_2}\):

\(\Omega = \sum\limits_{j \in V^{l_1}, k \in V^{l_2}} f(X_{jk}^{l_1})(w_j^{l_1} \cdot w_k^{l_2} + b_{w_j}^{l_1} + b_{w_k}^{l_2} + b^{\text{bi}} - M_{jk}^{\text{bi}}) \).

The second cross-lingual regularisation term they propose leverages the translation probabilities produced by a machine translation system and involves minimising the distances of the representations of related words in the two languages weighted by their similarities:

\(\Omega = \sum\limits_{j \in V^{l_1}, k \in V^{l_2}} sim(j,k) \cdot ||w_j^{l_1} - w_k^{l_2}||^2\)

where \(j\) and \(k\) are words in the source and target language respectively and \(sim(j,k)\) is their translation probability.

Vyas and Carpuat [^{21}] propose another method based on matrix factorisation that -- in contrast to previous approaches -- allows learning sparse cross-lingual representations. They first independently train two monolingual word representations \(X_e\) and \(X_f\) in two different languages using GloVe (Pennington et al., 2014) on two large monolingual corpora.

They then learn monolingual sparse representations from these dense representations by decomposing \(X\) into two matrices \(A\) and \(D\) such that the \(l_2\) reconstruction error is minimised, with an additional constraint on \(A\) for sparsity:

\(\mathcal{M}_{l_i} = \sum\limits_{i=1}^{v_{l_i}} \|A_{l_ii}D_{l_i}^T - X_{l_ii}\| + \lambda_{l_i} \|A_{l_ii}\|_1 \)

where \(v_{l_i}\) is the number of dense word representations in language \(l_i\).

The above equation, however, only creates sparse monolingual embeddings. To learn bilingual embeddings, they add another constraint based on automatically learned word alignment that minimises the \(l_2\) reconstruction error between words that were strongly aligned to each other:

\(\Omega = \sum\limits_{i=1}^{v_{l_1}} \sum\limits_{j=1}^{v_{l_2}} \dfrac{1}{2} \lambda_x S_{ij} \|A_{l_1i} - A_{l_2j}\|_2^2 \)

where \(S\) is the alignment matrix where each entry \(S_{ij}\) contains the alignment score of source word \(X_{l_1i}\) with target word \(X_{l_2j}\).

The complete objective function is thus the following:

\(\mathcal{M}_{l_1} + \mathcal{M}_{l_2} + \Omega\).

Mogadala and Rettinger [^{22}] use an approach similar to Pham et al. (2015), but extend it to also work without parallel data. They use the paragraph vectors objective as monolingual objective \(\mathcal{M}\). They jointly optimise this objective together with a cross-lingual regularization function \(\Omega\) that encourages the representations of words in languages \(l_1\) and \(l_2\) to be close to each other.

Their main innovation is that the cross-lingual regularizer \(\Omega\) is adjusted based on the nature of the training corpus. In addition to regularising the mean of word vectors in a sentence to be close to the mean of word vectors in the aligned sentence similar to Gouws et al. (2015) (the second term in the below equation), they also regularise the paragraph vectors \(SP^{l_1}\) and \(SP^{l_2}\) of aligned sentences in languages \(l_1\) and \(l_2\) to be close to each other. The complete cross-lingual objective then uses elastic net regularization to combine both terms:

\(\Omega = \alpha ||SP^{l_1}_j - SP^{l_2}_j||^2 + (1-\alpha) \dfrac{1}{m} \sum\limits_{w_i \in s_j^{l_1}}^m W_i^{l_1} - \dfrac{1}{n} \sum\limits_{w_k \in s_j^{l_2}}^n W_k^{l_2} \)

where \(W_i^{l_1}\) and \(W_k^{l_2}\) are the word embeddings of word \(w_i\) and \(w_k\) in each sentence \(s_j\) of length \(m\) and \(n\) in languages \(l_1\) and \(l_2\) respectively.

To leverage data that is not sentence-aligned, but where an alignment is still present on the document level, they propose a two-step approach: They use Procrustes analysis, a method for statistical shape analysis, to find for each document in language \(l_1\) the most similar document in language \(l_2\). This is done by first learning monolingual representations of the documents in each language using paragraph vectors on each corpus. Subsequently, Procrustes analysis aims to learn a transformation between the two vector spaces by translating, rotating, and scaling the embeddings in the first space until they most closely align to the document representations in the second space.

In the second step, they then simply use the previously described method to learn cross-lingual word representations from the alignment documents, this time treating the entire documents as paragraphs.

A recent branch of research proposes to incorporate visual information to improve the performance of monolingual [^{29}] or cross-lingual [^{30}] representations. These methods show good performance on comparison tasks. They additionally demonstrate application for zero-shot learning and might thus ultimately be helpful in learning cross-lingual representations without (linguistic) parallel data.

Models for learning cross-linguistic representations share weaknesses with other vector space models of language: While they are very good at modelling the conceptual aspect of meaning evaluated in word similarity tasks, they fail to properly model the functional aspect of meaning, e.g. to distinguish whether one remarks "Give me *a* pencil" or "Give me *that* pencil".

Secondly, due to the reliance on bag-of-words representations, current models for learning cross-lingual word embeddings completely ignore word order. Models that are oblivious to word order, for instance, assign to the following sentence pair (Landauer & Dumais [^{39}]) the exact same representation as they contain the same set of words, even though they are completely different in meaning:

- "That day the office manager, who was drinking, hit the problem sales worker with a bottle, but it was not serious."
- "It was not the sales manager, who hit the bottle that day, but the office worker with a serious drinking problem".

Most approaches for learning cross-lingual representations focus on word representations. These approaches are not able to easily compose word representations to form representations of sentences and documents. Even approaches that learn jointly learn word and sentence representations do so by via simple summation of words in the sentence. In the future, it will be interesting to see if LSTMs or CNNs that can form more composable sentence representations can be applied efficiently to learn cross-lingual representations.

While conflating multiple senses of a word is already problematic for learning mono-lingual word representations, this issue is amplified in a cross-lingual embedding space: Monosemous words in one language might align with polysemous words in another language and thus fail to capture the entirety of the cross-lingual relations. There has already been promising work on learning monolingual multi-sense embeddings. We hypothesize that learning cross-lingual multi-sense embeddings will become increasingly relevant, as it enables us to capture more fine-grained cross-lingual meaning.

The final challenge pertains to the feasibility of the venture of learning cross-lingual embeddings itself: Languages are incredibly complex, human artefacts. Learning a monolingual embedding space is already difficult; sharing such a vector space between two languages and expecting that inter-language and intra-language relations are reliably reflected then seems utopian.

Additionally, some languages show linguistic features, which other languages lack. The ease of constructing a shared embedding space between languages and consequently the success of cross-lingual transfer is intuitively proportional to the similarity of the languages: An embedding space shared between Spanish and Portuguese tends to capture more linguistic nuances of meaning than an embedding space populated with English and Chinese representations. Furthermore, if two languages are too dissimilar, cross-linguistic transfer might not be possible at all -- similar to the negative transfer that occurs in domain adaptation between very dissimilar domains.

Having surveyed models to learn cross-lingual word representations, we would now like to know which is the best method to use for the task we care about. Cross-lingual representation models have been evaluated on a wide range of tasks such as cross-lingual document classification (CLDC), Machine Translation (MT), word similarity, as well as cross-lingual variations of the following tasks: named entity recognition, part-of-speech tagging, super sense tagging, dependency parsing, and dictionary induction.

In the context of the CLDC evaluation setup by Klementiev et al. (2012) \(40\)-dimensional cross-lingual word embeddings are learned to classify documents in one language and evaluated on the documents of another language. As CLDC is among the most widely used, we show below exemplarily the evaluation table of Mogadala and Rettinger (2016) for this task:

Method | en -> de | de -> en | en -> fr | fr -> en | en -> es | es -> en |
---|---|---|---|---|---|---|

Majority class | 46.8 | 46.8 | 22.5 | 25.0 | 15.3 | 22.2 |

MT | 68.1 | 67.4 | 76.3 | 71.1 | 52.0 | 58.4 |

Multi-task language model (Klementiev et al., 2012) | 77.6 | 71.1 | 74.5 | 61.9 | 31.3 | 63.0 |

Bag-of-words autoencoder with correlation (Chandar et al., 2014) | 91.8 | 74.2 | 84.6 | 74.2 | 49.0 | 64.4 |

Bilingual compositional document model (Hermann and Blunsom, 2014) | 86.4 | 74.7 | - | - | - | - |

Distributed word alignment (Kočiský et al., 2014) | 83.1 | 75.4 | - | - | - | - |

Bilingual bag-of-words without word alignments (Gouws et al., 2015) | 86.5 | 75.0 | - | - | - | - |

Bilingual skip-gram (Luong et al., 2015) | 87.6 | 77.8 | - | - | - | - |

Bilingual skip-gram without word alignments (Coulmance et al., 2015) | 87.8 | 78.7 | - | - | - | - |

Bilingual paragraph vectors (without parallel data) (Mogadala and Rettinger, 2016) | 88.1 | 78.9 | 79.2 | 77.8 | 56.9 | 67.6 |

These results, however, should not be considered as representative of the general performance of cross-lingual embedding models as different methods tend to well on different tasks depending on the type of approach and the type of data used.

Upadhyay et al. [^{31}] evaluate cross-lingual embedding models that require different forms of supervision on various tasks. They find that on word similarity datasets, models that require cheaper forms of supervision (sentence-aligned and document-aligned data) are almost as good as models with more expensive supervision in the form of word alignments. For cross-lingual classification and dictionary induction, more informative supervision is better. Finally, for parsing, models with word-level alignment are able to capture syntax more accurately and thus perform better overall.

The findings by Upadhyay et al. are further proof for the intuition that the choice of the data is important. Levy et al. (2016) go even further than this in comparing models for learning cross-lingual word representations to traditional alignment models on dictionary induction and word alignment tasks. They argue that whether or not an algorithm uses a particular feature set is more important than the choice of the algorithm. In their experiments, using sentence ids, i.e. creating a sentence's language-independent representation (for instance with doc2vec) achieves better results than just using the source and target words.

Finally, to facilitate evaluation of cross-lingual word embeddings, Ammar et al. (2016) make a website available where learned representations can be uploaded and automatically evaluated on a wide range of tasks.

Models that allow us to learn cross-lingual representations have already been useful in a variety of tasks such as Machine Translation (decoding and evaluation), automated bilingual dictionary generation, cross-lingual information retrieval, parallel corpus extraction and generation, as well as cross-language plagiarism detection. It will be interesting to see what further progress the future will bring.

**Let me know your thoughts about this post and about any errors you found in the comments below.**

This blog post is also available as an article on arXiv, in case you want to refer to it later.

In case you found it helpful, consider citing the corresponding arXiv article as:

*Sebastian Ruder (2017). A survey of cross-lingual embedding models. arXiv preprint arXiv:1706.04902.*

If you want to learn more about word embeddings, these other blog posts on word embeddings are also available:

- On word embeddings - Part 1
- On word embeddings - Part 2: Approximating the softmax
- On word embeddings - Part 3: The secret ingredients of word2vec

Levy, O., Søgaard, A., & Goldberg, Y. (2016). Reconsidering Cross-lingual Word Embeddings. arXiv Preprint arXiv:1608.05426. Retrieved from http://arxiv.org/abs/1608.05426 ↩

Mikolov, T., Le, Q. V., & Sutskever, I. (2013). Exploiting Similarities among Languages for Machine Translation. Retrieved from http://arxiv.org/abs/1309.4168 ↩

Faruqui, M., & Dyer, C. (2014). Improving Vector Space Word Representations Using Multilingual Correlation. Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, 462 – 471. Retrieved from http://repository.cmu.edu/lti/31 ↩

Guo, J., Che, W., Yarowsky, D., Wang, H., & Liu, T. (2015). Cross-lingual Dependency Parsing Based on Distributed Representations. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 1234–1244. Retrieved from http://www.aclweb.org/anthology/P15-1119 ↩

Ammar, W., Mulcaire, G., Tsvetkov, Y., Lample, G., Dyer, C., & Smith, N. A. (2016). Massively Multilingual Word Embeddings. Retrieved from http://arxiv.org/abs/1602.01925 ↩

Vulic, I., & Korhonen, A. (2016). On the Role of Seed Lexicons in Learning Bilingual Word Embeddings. Proceedings of ACL, 247–257. ↩

Barone, A. V. M. (2016). Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders. Proceedings of the 1st Workshop on Representation Learning for NLP, 121–126. Retrieved from http://arxiv.org/pdf/1608.02996.pdf ↩

Xiao, M., & Guo, Y. (2014). Distributed Word Representation Learning for Cross-Lingual Dependency Parsing. CoNLL. ↩

Gouws, S., & Søgaard, A. (2015). Simple task-specific bilingual word embeddings. NAACL, 1302–1306. ↩

Duong, L., Kanayama, H., Ma, T., Bird, S., & Cohn, T. (2016). Learning Crosslingual Word Embeddings without Bilingual Corpora. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP-16). ↩

Vulic, I., & Moens, M.-F. (2016). Bilingual Distributed Word Representations from Document-Aligned Comparable Data. Journal of Artificial Intelligence Research, 55, 953–994. Retrieved from http://arxiv.org/abs/1509.07308 ↩

Hermann, K. M., & Blunsom, P. (2013). Multilingual Distributed Representations without Word Alignment. arXiv Preprint arXiv:1312.6173. ↩

Lauly, S., Boulanger, A., & Larochelle, H. (2013). Learning Multilingual Word Representations using a Bag-of-Words Autoencoder. NIPS WS on Deep Learning, 1–8. Retrieved from http://arxiv.org/abs/1401.1803 ↩

Zou, W. Y., Socher, R., Cer, D., & Manning, C. D. (2013). Bilingual Word Embeddings for Phrase-Based Machine Translation. EMNLP. ↩

Kočiský, T., Hermann, K. M., & Blunsom, P. (2014). Learning Bilingual Word Representations by Marginalizing Alignments. Retrieved from http://arxiv.org/abs/1405.0947 ↩

Hermann, K. M., & Blunsom, P. (2014). Multilingual Models for Compositional Distributed Semantics. Acl, 58–68. ↩

Chandar, S., Lauly, S., Larochelle, H., Khapra, M. M., Ravindran, B., Raykar, V., & Saha, A. (2014). An Autoencoder Approach to Learning Bilingual Word Representations. Advances in Neural Information Processing Systems. Retrieved from http://arxiv.org/abs/1402.1454 ↩

Pham, H., Luong, M.-T., & Manning, C. D. (2015). Learning Distributed Representations for Multilingual Text Sequences. Workshop on Vector Modeling for NLP, 88–94. ↩

Gardner, M., Huang, K., Paplexakis, E., Fu, X., Talukdar, P., Faloutsos, C., … Sidiropoulos, N. (2015). Translation Invariant Word Embeddings. EMNLP. ↩

Søgaard, A., Agic, Z., Alonso, H. M., Plank, B., Bohnet, B., & Johannsen, A. (2015). Inverted indexing for cross-lingual NLP. The 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference of the Asian Federation of Natural Language Processing (ACL-IJCNLP 2015), 1713–1722. ↩

Vyas, Y., & Carpuat, M. (2016). Sparse Bilingual Word Representations for Cross-lingual Lexical Entailment. NAACL, 1187–1197. ↩

Mogadala, A., & Rettinger, A. (2016). Bilingual Word Embeddings from Parallel and Non-parallel Corpora for Cross-Language Text Classification. NAACL, 692–702. Retrieved from http://www.aifb.kit.edu/images/b/b4/NAACL-HLT-2016-Camera-Ready.pdf ↩

Klementiev, A., Titov, I., & Bhattarai, B. (2012). Inducing Crosslingual Distributed Representations of Words. ↩

Luong, M.-T., Pham, H., & Manning, C. D. (2015). Bilingual Word Representations with Monolingual Quality in Mind. Workshop on Vector Modeling for NLP, 151–159. ↩

Gouws, S., Bengio, Y., & Corrado, G. (2015). BilBOWA: Fast Bilingual Distributed Representations without Word Alignments. Proceedings of The 32nd International Conference on Machine Learning, 748–756. Retrieved from http://jmlr.org/proceedings/papers/v37/gouws15.html ↩

Coulmance, J., Marty, J.-M., Wenzek, G., & Benhalloum, A. (2015). Trans-gram, Fast Cross-lingual Word-embeddings. EMNLP 2015, (September), 1109–1113. ↩

Shi, T., Liu, Z., Liu, Y., & Sun, M. (2015). Learning Cross-lingual Word Embeddings via Matrix Co-factorization. Annual Meeting of the Association for Computational Linguistics, 567–572. ↩

Lazaridou, A., Dinu, G., & Baroni, M. (2015). Hubness and Pollution: Delving into Cross-Space Mapping for Zero-Shot Learning. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, 270–280. ↩

Lazaridou, A., Nghia, T. P., & Baroni, M. (2015). Combining Language and Vision with a Multimodal Skip-gram Model. Proceedings of Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, Denver, Colorado, May 31 – June 5, 2015, 153–163. ↩

Vulić, I., Kiela, D., Clark, S., & Moens, M.-F. (2016). Multi-Modal Representations for Improved Bilingual Lexicon Learning. ACL. ↩

Upadhyay, S., Faruqui, M., Dyer, C., & Roth, D. (2016). Cross-lingual Models of Word Embeddings: An Empirical Comparison. Retrieved from http://arxiv.org/abs/1604.00425 ↩

Artetxe, M., Labaka, G., & Agirre, E. (2016). Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP-16), 2289–2294. ↩

Xing, C., Liu, C., Wang, D., & Lin, Y. (2015). Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation. NAACL-2015, 1005–1010. ↩

Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing. Proceedings of the 25th International Conference on Machine Learning - ICML ’08, 20(1), 160–167. http://doi.org/10.1145/1390156.1390177 ↩

Dyer, C., Victor Ch., & Smith, N. A. (2013). A simple, fast, and effective reparameterization of ibm model 2. Association for Computational Linguistics. ↩

Le, Q. V., & Mikolov, T. (2014). Distributed Representations of Sentences and Documents. International Conference on Machine Learning - ICML 2014, 32, 1188–1196. Retrieved from http://arxiv.org/abs/1405.4053 ↩

Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A Neural Probabilistic Language Model. The Journal of Machine Learning Research, 3, 1137–1155. http://doi.org/10.1162/153244303322533223 ↩

Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 1532–1543. http://doi.org/10.3115/v1/D14-1162 ↩

Landauer, T. K. & Dumais, S. T. (1997). A solution to Plato’s problem: the latent semantic analysis theory of acquisition, induction and representation of knowledge, Psychological Review, 104(2), 211-240. ↩

Cover image courtesy of Zou et al. (2013)

]]>I spent the past week in Austin, Texas at EMNLP 2016, the Conference on Empirical Methods in Natural Language Processing.

There were a lot of papers at the conference (179 long papers, 87 short papers, and 9 TACL papers all in all)

]]>*This post originally appeared on the AYLIEN blog.*

I spent the past week in Austin, Texas at EMNLP 2016, the Conference on Empirical Methods in Natural Language Processing.

There were a lot of papers at the conference (179 long papers, 87 short papers, and 9 TACL papers all in all) -- too many to read each single one. The entire program can be found here. In the following, I will highlight some trends and papers that caught my eye:

**Reinforcement learning**: One thing that stood out was that RL seems to be slowly finding its footing in NLP, with more and more people using it to solve complex problems:

- Hahn and Keller model human reading;
- Miao and Blunsom summarise sentences;
- Li et al. generate dialogues;
- He et al. predict reddit threads;
- Clark and Manning perform coreference resolution;
- Narasimhan et al. query the web for additional evidence to improve information extraction in one of the two best papers.

**Dialogue**: Dialogue was a focus of the conference with all of the three keynote speakers dealing with different aspects of dialogue: Christopher Potts talked about pragmatics and how to reason about the intentions of the conversation partner; Stefanie Tellex concentrated on how to use dialogue for human-robot collaboration; finally, Andreas Stolcke focused on the problem of addressee detection in his talk. Among the papers, a few that dealt with dialogue stood out:

- Andreas and Klein model pragmatics in dialogue with neural speakers and listeners;
- Liu et al. show how
*not*to evaluate your dialogue system; - Ouchi and Tsuboi select addressees and responses in multi-party conversations;
- Wen et al. study diverse architectures for dialogue modelling.

**Sequence-to-sequence**: Seq2seq models were again front and center. It is not common for a method to have its own session two years after its introduction (Sutskever et al., 2014). While in the past years, many papers employed seq2seq e.g. for Neural Machine Translation, some papers this year focused on improving the seq2seq framework:

- Wiseman and Rush extend seq2seq to learn global sequence scores;
- Yu et al. perform online seq2seq learning;
- Kim and Rush extend Knowledge Distillation (Hinton et al., 2015) to seq2seq learning;
- Kikuchi et al. propose a method to control the output length in seq2seq models.

**Semantic parsing**: Since seq2seq's use for dialogue modelling was popularised by Vinyals and Le, it is harder to get it to work with goal-oriented tasks that require an intermediate representation on which to act. Semantic parsing is used to convert a message into a more meaningful representation that can be used by another component of the system. As this technique is useful for sophisticated dialogue systems, it is great to see progress in this area:

- Yaruz et al. infer answer types for semantic parsing;
- Krishnamurthy et al. parse to probabilistic programs;
- Kocisky et al. perform semantic parsing with semi-supervised sequential auto encoders.

**X-to-text (or natural language generation)**: While mapping from text-to-text with the seq2seq paradigm is still prevalent, EMNLP featured some cool papers on natural language generation from other inputs:

- Kiddon et al. map from a recipe name and ingredients to a recipe by ticking of items off a checklist;
- Ghazvininejad et al. generate a poem based on a topic;
- Monroe et al. map from a color to its name;
- Koncel-Kedziorski et al. re-write the theme of an algebra problem (think: boring physics book to Star Wars);
- Lebret et al. generate biographical sentences from Wikipedia fact tables.

**Parsing**: Parsing and syntax are a mainstay of every NLP conference and the community seems to particularly appreciate innovative models that push the state-of-the-art in parsing: The ACL '16 outstanding paper by Andor et al. introduced a globally normalized model for parsing, while the best EMNLP ‘16 paper by Lee et al. combines a global parsing model with a local search over subtrees.

**Word embeddings**: There were still papers on word embeddings, but it felt less overwhelming than at the past EMNLP or ACL, with most methods trying to fix a particular flaw rather than training embeddings for embeddings' sake. Pilevhar and Collier de-conflate senses in word embeddings, while Wieting et al. achieve state-of-the-art results for character-based embeddings.

**Sentiment analysis**: Sentiment analysis has been popular in recent years (as attested by the introductions of many recent papers on sentiment analysis). Sadly, many of the conference papers on sentiment analysis reduce to leveraging the latest deep neural network for the task to beat the previous state-of-the-art without providing additional insights. There are, however, some that break the mold: Teng et al. find an effective way to incorporate sentiment lexicons into a neural network, while Hu et al. incorporate structured knowledge into their sentiment analysis model.

**Deep Learning**: By now, it is clear to everyone: Deep Learning is here to stay. In fact, *deep learning* and *neural networks* claimed the two top spots of keywords that were used to describe the submitted papers. The majority of papers used at least an LSTM; using no neural network seems almost contrarian now and is something that needs to be justified. However, there are still many things that need to be improved -- which leads us to...

**Uphill Battles**: While making incremental progress is important to secure grants and publish papers, we should not lose track of the long-term goals. In this spirit, one of the best workshops that I've attended was the *Uphill Battles in Language Processing* workshop, which featured 12 talks and not one, but *four* all-star panels on text understanding, natural language generation, dialogue and speech, and grounded language. Summaries of the panel discussions should be available soon at the workshop website.

This was my brief review of some of the trends of EMNLP 2016. I hope it was helpful.

Cover image credit: Jackie Cheung

]]>Excuse the rather clickbait-y title. This is a blog post that I meant to write for a while. In this post, I want to highlight the factors, i.e. the secret ingredients that account for the success

]]>Table of Contents:

Excuse the rather clickbait-y title. This is a blog post that I meant to write for a while. In this post, I want to highlight the factors, i.e. the secret ingredients that account for the success of word2vec.

In particular, I want to focus on the connection between word embeddings trained via neural models and those produced by traditional distributional semantics models (DSMs). By showing how these ingredients can be transferred to DSMs, I will demonstrate that distributional methods are in no way inferior to the popular word embedding methods.

Even though this is no new insight, I feel that traditional methods are frequently overshadowed amid the deep learning craze and their relevancy consequently deserves to be mentioned more often.

To this effect, the paper on which this blog post is based is *Improving Distributional Similarity with Lessons Learned from Word Embeddings* [^{1}] by Levy et al. (2015). If you haven't read it, I recommend you to check it out.

Over the course of this blog post, I will first introduce GloVe, a popular word embedding model. I will then highlight the connection between word embedding models and distributional semantic methods. Subsequently, I will introduce the four models that will be used to measure the impact of the different factors. I will then give an overview of all additional factors that play a role in learning word representations, besides the choice of the algorithm. I will finally present the results by Levy et al., their takeaways and recommendations.

In a previous blog post, we have given an overview of popular word embedding models. One model that we have omitted so far is GloVe [^{2}].

Briefly, GloVe seeks to make explicit what SGNS does implicitly: Encoding meaning as vector offsets in an embedding space -- seemingly only a serendipitous by-product of word2vec -- is the specified goal of GloVe.

Specifically, the authors of Glove show that the ratio of the co-occurrence probabilities of two words (rather than their co-occurrence probabilities themselves) is what contains information and aim to encode this information as vector differences.

To achieve this, they propose a weighted least squares objective \(J\) that directly aims to minimise the difference between the dot product of the vectors of two words and the logarithm of their number of co-occurrences:

\(J = \sum\limits_{i, j=1}^V f(X_{ij}) \: (w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \text{log} \: X_{ij})^2 \)

where \(w_i\) and \(b_i\) are the word vector and bias respectively of word \(i\), \(\tilde{w}_j\) and \(b_j\) are the context word vector and bias respectively of word \(j\), \(X_{ij}\) is the number of times word \(i\) occurs in the context of word \(j\), and \(f\) is a weighting function that assigns relatively lower weight to rare and frequent co-occurrences.

As co-occurrence counts can be directly encoded in a word-context co-occurrence matrix, GloVe takes such a matrix rather than the entire corpus as input.

If you want to know more about GloVe, the best reference is likely the paper and the accompanying website. Besides that, you can find some additional intuitions on GloVe and its difference to word2vec by the author of gensim here, in this Quora thread, and in this blog post.

The reason why word embedding models, particularly word2vec and GloVe, became so popular is that they seemed to continuously and significantly outperform DSMs. Many attributed this to the neural architecture of word2vec or the fact that it predicts words, which seemed to have a natural edge over solely relying on co-occurrence counts.

We can view DSMs as *count* models as they "count" co-occurrences among words by operating on co-occurrence matrices. In contrast, neural word embedding models can be seen as *predict* models, as they try to predict surrounding words.

In 2014, Baroni et al. [^{3}] showed that predict models consistently outperform count models in almost all tasks, thus providing a clear verification for the apparent superiority of word embedding models. Is this the end? No.

Already with GloVe we've seen that the differences are not as clear-cut: While GloVe is considered a predict model by Levy et al. (2015), it is clearly factorizing a word-context co-occurrence matrix, which brings it close to traditional methods such as PCA and LSA. Even more, Levy et al. [^{4}] demonstrate that word2vec implicitly factorizes a word-context PMI matrix.

Consequently, while on the surface DSMs and word embedding models use different algorithms to learn word representations -- the first count, the latter predict -- fundamentally, both types of models act on the same underlying statistics of the data, i.e. the co-occurrence counts between words.

Thus, the question that still remains and which we will dedicate the rest of this blog post to answering is the following:

*Why do word embedding models still perform better than DSM with almost the same information?*

Following Levy et al. (2015), we will isolate and identify the factors that account for the success of neural word embedding models and show how these can be transferred to traditional methods by comparing the following four models:

**Positive Pointwise Mutual Information (PPMI)**: PMI is a common measure for the strength of association between two words. It is defined as the log ratio between the joint probability of two words \(w\) and \(c\) and the product of their marginal probabilities: \(PMI(w,c) = \text{log} \: \dfrac{P(w,c)}{P(w)\:P(c)} \). As \( PMI(w,c) = \text{log} \: 0 = - \infty \) for pairs \( (w,c) \) that were never observed, PMI is in practice often replaced with*positive*PMI (PPMI), which replaces negative values with \(0\), yielding \(PPMI(w,c) = \text{max}(PMI(w,c),0)\).**Singular Value Decomposition (SVD):**SVD is one of the most popular methods for dimensionality reduction and found its into NLP originally via latent semantic analysis (LSA). SVD factories the word-context co-occurrence matrix into the product of three matrices \(U \cdot \Sigma \times V^T \) where \(U\) and \(V\) are orthonormal matrices (i.e. square matrices whose rows and columns are orthogonal unit vectors) and \(\Sigma\) is a diagonal matrix of eigenvalues in decreasing order. In practice, SVD is often used to factorize the matrix produced by PPMI. Generally, only the top \(d\) elements of \(\Sigma\) are kept, yielding \(W^{SVD} = U_d \cdot \Sigma_d\) and \(C^{SVD} = V_d\), which are typically used as the word and context representations respectively.**Skip-gram with Negative Sampling (SGNS)**aka word2vec: To learn more about the skip-gram architecture and the negative sampling refer to my previous blog posts here and here respectively.**Global Vectors (GloVe)**as presented in the previous section.

We will look at the following hyper-parameters:

Word2vec introduces three ways of pre-processing a corpus, which can be easily applied to DSMs.

In DSMs traditionally, the context window is unweighted and of a constant size. Both SGNS and GloVe, however, use a scheme that assigns more weight to closer words, as closer words are generally considered to be more important to a word's meaning. Additionally, in SGNS, the window size is not fixed, but the actual window size is dynamic and sampled uniformly between \(1\) and the maximum window size during training.

SGNS dilutes very frequent words by randomly removing words whose frequency \(f\) is higher than some threshold \(t\) with a probability \(p = 1 - \sqrt{\dfrac{t}{f}}\). As this subsampling is done *before* actually creating the windows, the context windows used by SGNS in practice are larger than indicated by the context window size.

In the pre-processing of SGNS, rare words are also deleted *before* creating the context windows, which increases the actual size of the context windows further. Levy et al. (2015) find this not to have a significant performance impact, though.

PMI has been shown to be an effective metric for measuring the association between words. Since Levy and Goldberg (2014) have shown SGNS to implicitly factorize a PMI matrix, two variations stemming from this formulation can be introduced to regular PMI.

In SGNS, the higher the number of negative samples \(k\), the more data is being used and the better should be the estimation of the parameters. \(k\) affects the shift of the PMI matrix that is implicitly factorized by word2vec, i.e. \(k\) k shifts the PMI values by log \(k\).

If we transfer this to regular PMI, we obtain Shifted PPMI (SPPMI): \(SPPMI(w,c) = \text{max}(PMI(w,c) - \text{log} \: k,0)\).

In SGNS, the negative samples are sampled according to a *smoothed* unigram distribution, i.e. an unigram distribution raised to the power of \(\alpha\), which is empirically set to \(\dfrac{3}{4}\). This leads to frequent words being sampled relatively less often than their frequency would indicate.

We can transfer this to PMI by equally raising the frequency of the context words \(f(c)\) to the power of \(\alpha\):

\(PMI(w, c) = \text{log} \dfrac{p(w,c)}{p(w)p_\alpha(c)}\) where \(p_\alpha(c) = \dfrac{f(c)^\alpha}{\sum_c f(c)^\alpha}\) and \(f(x)\) is the frequency of word \(x\).

Similar as in pre-processing, three methods can be used to modify the word vectors produced by an algorithm.

The authors of GloVe propose to add word vectors and context vectors to create the final output vectors, e.g. \(\vec{v}_{\text{cat}} = \vec{w}_{\text{cat}} + \vec{c}_{\text{cat}}\). This adds first-order similarity terms, i.e \(w \cdot v\). However, this method cannot be applied to PMI, as the vectors produced by PMI are sparse.

SVD produces the following matrices: \(W^{SVD} = U_d \cdot \Sigma_d \) and \(C^{SVD} = V_d\). These matrices, however, have different properties: \(C^{SVD}\) is orthonormal, while \(W^{SVD}\) is not.

SGNS, in contrast, is more symmetric. We can thus weight the eigenvalue matrix \(\Sigma_d\) with an additional parameter \(p\), which can be tuned, to yield the following:

\(W^{SVD} = U_d \cdot \Sigma_d^p\).

Finally, we can also normalise all vectors to unit length.

Levy et al. (2015) train all models on a dump of the English wikipedia and evaluate them on the commonly used word similarity and analogy datasets. You can read more about the experimental setup and training details in their paper. We summarise the most important results and takeaways below.

Levy et al. find that SVD -- and not one of the word embedding algorithms -- performs best on similarity tasks, while SGNS performs best on analogy datasets. They furthermore shed light on the importance of hyperparameters compared to other choices:

- Hyperparameters vs. algorithms:

Hyperparameter settings are often more important than algorithm choice.

No single algorithm consistently outperforms the other methods. - Hyperparameters vs. more data:

Training on a larger corpus helps for some tasks.

In 3 out of 6 cases, tuning hyperparameters is more beneficial.

Equipped with these insights, we can now debunk some generally held claims:

- Are embeddings superior to distributional methods?

With the right hyperparameters, no approach has a consistent advantage over another. - Is GloVe superior to SGNS?

SGNS outperforms GloVe on all tasks. - Is CBOW a good word2vec configuration?

CBOW does not outperform SGNS on any task.

Finally -- and one of the things I like most about the paper -- we can give concrete practical recommendations:

**DON'T**use shifted PPMI with SVD.**DON'T**use SVD "correctly", i.e. without eigenvector weighting (performance drops 15 points compared to with eigenvalue weighting with \(p = 0.5\)).**DO**use PPMI and SVD with short contexts (window size of \(2\)).**DO**use many negative samples with SGNS.**DO**always use context distribution smoothing (raise unigram distribution to the power of \(\alpha = 0.75\)) for all methods.**DO**use SGNS as a baseline (robust, fast and cheap to train).**DO**try adding context vectors in SGNS and GloVe.

These results run counter to what is generally assumed, namely that word embeddings are superior to traditional methods and indicate that it generally makes *no difference whatsoever* whether you use word embeddings or distributional methods -- what matters is that you tune your hyperparameters and employ the appropriate pre-processing and post-processing steps.

Recent papers from Jurafsky's group [^{5}, ^{6}] echo these findings and show that SVD -- not SGNS -- is often the preferred choice when you care about accurate word representations.

I hope this blog post was useful in highlighting cool research that sheds light on the link between traditional distributional semantic and in-vogue embedding models. As we've seen, knowledge of distributional semantics allows us to improve upon our current methods and develop entirely new variations of existing ones. For this reason, I hope that the next time you train word embeddings, you will consider adding distributional methods to your toolbox or lean on them for inspiration.

As always, feel free to ask questions and point out the mistakes I made in this blog post in the comments below.

If you want to learn more about word embeddings, these other blog posts on word embeddings are also available:

- On word embeddings - Part 1
- On word embeddings - Part 2: Approximating the softmax
- Unofficial Part 4: A survey of cross-lingual embedding models

Levy, O., Goldberg, Y., & Dagan, I. (2015). Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the Association for Computational Linguistics, 3, 211–225. Retrieved from https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/570 ↩

Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 1532–1543. http://doi.org/10.3115/v1/D14-1162 ↩

Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. ACL, 238–247. http://doi.org/10.3115/v1/P14-1023 ↩

Levy, O., & Goldberg, Y. (2014). Neural Word Embedding as Implicit Matrix Factorization. Advances in Neural Information Processing Systems (NIPS), 2177–2185. Retrieved from http://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization ↩

Hamilton, W. L., Clark, K., Leskovec, J., & Jurafsky, D. (2016). Inducing Domain-Specific Sentiment Lexicons from Unlabeled Corpora. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Retrieved from http://arxiv.org/abs/1606.02820 ↩

Hamilton, W. L., Leskovec, J., & Jurafsky, D. (2016). Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change. arXiv Preprint arXiv:1605.09096. ↩

Cover images are courtesy of Stanford.

]]>Table of contents:

From July 20th to July 28th 2016, I had the opportunity of attending the 6th Lisbon Machine Learning School. The Lisbon Machine Learning School (LxMLS) is an annual event that brings together researchers and graduate students in the fields of NLP and Computational Linguistics, computer scientists with an interest in statistics and ML, and industry practitioners with a desire for a more in-depth understanding.

Participants had a chance to join workshops and labs, where they got hands-on experience with building and exploring state-of-the-art deep learning models, as well as to attend talks and speeches by prominent deep learning and NLP researchers from a variety of academic and industrial organisations. You can find the entire programme here.

In this blog post, I am going to share some of the highlights, key insights and takeaways of the summer school. I am going to skip the lectures of the first and second day as they introduce basic Python, Linear Algebra, and Probability Theory concepts and focus on the later lectures and talks. First, we are going to talk about sequence models. We will then turn to structured prediction, a type of supervised ML common to NLP. We will then summarize the lecture on Syntax and Parsing and finally provide insights with regard to Deep Learning. The accompanying slides can be found here.

**Disclaimer**: This blog post is not meant to give a comprehensive introduction of each of the topics discussed; it should rather give you an overview of the week-long event and provide you with pointers if you want to delve deeper into any of the topics.

**Note**: This blog post also appeared at the AYLIEN blog here.

Noah Smith of the University Washington kicked off the third day of the summer school with a compelling lecture about sequence models. To test your understanding of sequence models, try to answer -- without reading further -- the following question: What is the most basic sequence model depicted in Figure 1?

Correct! It is the bag-of-words (notice which words have "fallen" out of the bag-of-words). The bag-of-words model makes the strongest independence assumption of all sequence models: It supposes that each word is entirely independent of its predecessors. It is obvious why models that rely on this assumption do only a poor job at modelling language: every word naturally depends on the words that have preceded it.

Somewhat more sophisticated models thus relax this naive assumption to reduce the entropy: A 1st Order Markov model makes each word dependent on the word that immediately preceded it. This way, it is already able to capture some context of the context that can help to disambiguate a new word. More generally, \(m^{\text{th}}\) Order Markov Models make each word depend on its previous \(m\) words.

In mathematical terms, in \(m^{\text{th}}\) Order Markov Models, the probability of a text sequence (we assume here that such a sequence is delimited by start and stop symbols) can be calculated using the chain rule as the product of the probabilities of the individual words:

\(p(\text{start}, w_1, w_2, ..., w_n, \text{stop}) = \prod\limits_{i=1}^{n+1} \gamma (w_i \: | \: w_{i-m}, ..., w_{i-1}) \)

where \(\gamma\) is the probability of the current word \(w_i\) given its \(m\) previous words, i.e. the probability to transition from the previous words to the current word.

We can view bag-of-words and \(m^{\text{th}}\) Order Markov Models as occupying the following spectrum:

As we go right in Figure 2, we make weaker independence assumption and in exchange gain richer expressive power, while requiring more parameters -- until we eventually obtain the most expressive -- and most rigid -- model, a history-based model where each word depends on its entire history, i.e. all preceding words.

As a side-note, state-of-the-art sequence modelling models such as recurrent neural networks and LSTMs can be thought of as being located on the right side of this spectrum, as they don't require an explicit specification of context words but are -- theoretically -- able to take into account the entire history.

In many cases, we would not only like to model just the observed sequence of symbols, but take some additional information into account. Hidden Markov Models (HMMs) allow us to associate with each symbol \(w_i\) some missing information, its "state" \(s_i\). The probability of a word sequence in an HMM then not only depends on the transition probability \(\gamma\) but also on the so-called emission probability \(\eta\):

\(p(\text{start}, w_1, w_2, ..., w_n, \text{stop}) = \prod\limits_{i=1}^{n+1} \eta (w_i \: | \: s_i) \: \gamma (s_i \: | \: s_{i-1}) \)

Consequently, the HMM is a joint model over observable symbols and hidden/latent/unknown classes. HMMs have traditionally been used in part-of-speech tagging or named entity recognition where the hidden states are POS and NER tags respectively.

If we want to determine the most probable sequence of hidden states, we face a space of potential sequences that grows exponentially with the sequence length. The classic dynamic algorithm to cope with this problem is the Viterbi algorithm, which is used in HMMs, CRFs, and other sequence models to calculate the most probable sequence of hidden states: It lays out the symbol sequence and all possible states in a grid and proceeds left-to-right to compute the maximum probability to transition in every new state given the previous states. The most probable sequence can then be found by back-tracking as in Figure 3.

A close relative is the forward-backward algorithm, which is used to calculate the probability of a word sequence and the probabilities of each word's states, e.g. for language modelling. Indeed, the only difference between Viterbi and the forward-backward algorithm is that Viterbi takes the maximum of the probabilities of the previous state, while forward-backward takes the sum. In this sense, they correspond to the same abstract algorithm, which is instantiated in two different semirings where a semiring informally is a set of values and some operations that obey certain properties.

Finally, if we want to learn HMMs in an unsupervised way, we use the well-known Expectation Maximisation (EM) algorithm, which consists of two steps: During the E step, we calculate the probability of each possible transition and emission at every position with forward-backward (or Viterbi for "hard" EM); for the M step, we re-estimate the parameters with MLE.

On the evening of the third day, Philipp Koehn, one of the pioneers of MT and inventor of phrase-based machine translation gave a talk on Machine Translation as Sequence Modelling, including a detailed review of different MT and alignment approaches. If you are interested in a comprehensive history of MT that takes you from IBM Model 1 all the way to phrase-based, syntax-based and eventually neural MT, while delving into the details of alignment, translation, and decoding, definitely check out the slides here.

HMMs can model sequences, but as their weights are tied to the generative process, strong independence assumptions need to be made to make their computation tractable. We will now turn to a category of models that are more expressive and can be used to predict more complex structures: Structured prediction -- which was introduced by Xavier Carreras of Xerox Research on the morning of the fourth day -- is used to refer to ML algorithms that don’t just predict scalar or real values, but more complex structures. As complex structures are common in language, so is structured prediction; example tasks of structured prediction in NLP include POS tagging, named entity recognition, machine translation, parsing, and many others.

A successful category of structured prediction models are log-linear models, which are so-called because they model log-probabilities using a linear predictor. Such models try to estimate the parameters \(w\) by calculating the following probability:

\(\text{log} \: \text{Pr}(\mathbf{y} \: | \: \mathbf{x}; \mathbf{w}) = \text{log} \: \dfrac{\text{exp}\{\mathbf{w} \cdot \mathbf{f}(\mathbf{x},\mathbf{y})\}}{Z(\mathbf{x};\mathbf{w})}\)

where \(\mathbf{x} = x_1, x_2, ..., x_n \in \mathcal{X}\) is the sequence of symbols, \(\mathbf{y} = y_1, y_2, ..., y_n \in \mathcal{Y}\) is the corresponding sequence of labels, \(\mathbf{f}(\mathbf{x},\mathbf{y})\) is a feature representation of \(\mathbf{x}\) and \(\mathbf{y}\), and \(Z(\mathbf{x};\mathbf{w}) = \sum\limits_{\mathbf{y}' \in \mathcal{Y}} \text{exp}(\mathbf{w} \cdot \mathbf{f}(\mathbf{x},\mathbf{y}')) \) is also referred to as the partition function.

Two approaches that can be used to estimate the model parameters \(w\) are:

- Maximum Entropy Markov Models (MEMMs), which assume that \(\text{Pr}(\mathbf{y} \: | \: \mathbf{x}; \mathbf{w})\) decomposes, i.e. that we can express it as a product of the individual label probabilities that only depend on the previous label (similar to HMMs).
- Conditional Random Fields (CRFs), which make a weaker assumption by only assuming that \(\mathbf{f}(\mathbf{x},\mathbf{y})\) decomposes.

In MEMMs, we assume -- similarly to Markov Models -- that the label \(y_i\) at does not depend on *all* past labels, but only on the previous label \(y_{i-1}\). In contrast to Markov Models, MEMMs allow us to condition the label \(y_i\) on the entire symbol sequence \(x_{1:n}\). Both assumptions combined lead to the following probability of label \(y_i\) in MEMMs:

\(\text{Pr}(y_i \: | \: x_{1:n}, y_{1:i-1}) = \text{Pr}(y_i \: | \: x_{1:n}, y_{i-1})\)

By this formulation, the objective of MEMMs reduces sequence modelling to multi-class logistic regression.

In CRFs, we factorize on label bigrams. Instead of greedily predicting the most probable label \(y_i\) at every position \(i\), we instead aim to find the sequence of labels with the maximum probability:

\(\underset{y \in \mathcal{Y}}{\text{argmax}} \sum_i \mathbf{w} \cdot \mathbf{f}(\mathbf{x}, i, y_{i-1}, y_i)\)

We then estimate the parameters \(w\) of our model using gradient-based methods where we can use forward-backward to compute the gradient.

By choosing between MEMMs and CRFs, we make the choice between local and global normalisation. While MEMMs aim to predict the most probable label at every position, CRFs aim to find the most probable label sequence. This, however, leads to the so-called "Label Bias Problem" in MEMMs: As MEMMs choose the label with the highest probability, the model is biased towards more frequent labels, often irrespective of the local context.

As MEMMs reduce to multi-class classification, they are cheaper to train. On the other hand, CRFs are more flexible and thus easier to extend to more complex structures.

This distinction between local and global normalisation has been a recurring topic in sequence modelling and a key criterion when choosing an algorithm. For text generation tasks, global normalisation is still too expensive, however. Many state-of-the-art approaches thus employ beam search as a compromise between local and global normalisation. In most sequence modelling tasks, local normalisation is very popular due to its ease of use, but might fall out of favour as more advanced models and implementations for global normalisation become available. To this effect, a recent outstanding paper at ACL (Andor et al., 2016) shows that globally normalised models are strictly more expressive than locally normalised ones.

Another distinction that is worth investigating is the difference between generative and discriminative models: HMMs are generative models, while CRFs are discriminative. HMMs only take into account the previous word as its features are tied to the generative process. In contrast, CRF features are very flexible. They can look at the whole input \(x\) paired with a label bigram \((y_i , y_{i+1})\). In practice, for prediction tasks, such “good” discriminative features can improve accuracy a lot.

Regarding the parameter estimation, the distinction between generative and discriminative becomes apparent: HMMs focus on explaining the data, both \(x\) and \(y\), while CRFs focus on the mapping from \(x\) to \(y\). Which model is more appropriate depends on the task: CRFs are commonly used in tasks such as POS tagging and NER, while HMMs have traditionally lain at the heart of speech recognition.

Andreas Vlachos of the University of Sheffield gave a talk on using imitation learning for structured prediction in NLP, which followed the same distinction between local normalisation (aka incremental modelling), i.e. greedily predicting one label at a time and global normalisation (aka joint modelling), i.e. scoring the complete outputs with a CRF that we discussed above. Andreas talked about how imitation learning can be used to improve incremental modelling as it allows to a) explore the search space, b) to address error-propagation, and c) to train with regard to the task-level loss function.

There are many popular imitation learning algorithms in the literature such as SEARN (Daumé III et al., 2009), Dagger (Ross et al. 2011), or V-DAgger (Vlachos and Clark, 2014). Recently, MIXER (Ranzato et al., 2016) has been proposed to directly optimise metrics for text generation, such as BLEU or ROUGE.

An interesting perspective is that imitation learning can be seen as inverse reinforcement learning: Whereas we want to learn the best policy in reinforcement learning, we know the optimal policy in imitation learning, i.e. the labels in the training data; we then infer the per-action reward function and learn a policy, i.e. a classifier that can generalise to unseen data.

In the evening of the fourth day, we presented -- along with other NLP companies and research labs -- Aylien at the LxMLS Demo Day.

We presented an overview of our research directions at Aylien, as well as a 1D generative adversarial network demo and visualization.

Having looked at generic models that are able to cope with sequences and more complex structures, we now briefly mention some of the techniques that are commonly used to deal with one of language’s unique characteristics: syntax. To this end, Slav Petrov of Google Research gave an in-depth lecture about syntax and parsing on the fifth day of the summer school, which discussed, among others, successful parsers such as the Charniak and the Berkeley parser, context-free grammars and phrase-based parsing, projective and non-projective dependency parsing, as well as more recent transition and graph-based parsers.

To tie this to what we've already discussed, Figure 5 demonstrates how the distinction between generative and discriminative models applies to parsers.

In the evening of the fifth day, André Martins of Unbabel gave talk a on an ACL 2015 paper of his, in which he shows that constituent parsing can be reduced to dependency parsing to get the best of both worlds: the informativeness of constituent parser output and the speed of dependency parsers.

Their approach works for any out-of-the-box dependency parser, is competitive for English and morphologically rich languages, and achieves results above the state of the art for discontinuous parsing (where edges are allowed to intersect).

Finally, the two last days were dedicated to Deep Learning and featured prolific researchers from academia and industry labs as speakers. On the morning of the sixth day, Wang Ling of Google DeepMind gave one of the most gentle, family-friendly intro to Deep Learning I've seen -- titled *Deep Neural Networks Are Our Friends* with a theme inspired by the Muppets.

The evening talk by Oriol Vinyals of Google DeepMind detailed some of his lessons learned when working on sequence-to-sequence models at Google and gave glimpses of interesting future challenges, among them, one-shot learning for NLP (Vinyals et al., 2016) and enabling neural networks to ponder decisions (Graves, 2016).

For the lecture on the last day, Chris Dyer of CMU and Google DeepMind discussed modelling sequential data with recurrent neural networks (RNNs) and shared some insights and intuitions with regard to working with RNNs and LSTMs.

If you've worked with RNNs before, then you're most likely familiar with the exploding/vanishing gradients problem: As the length of the sequence increases, computations of the model get amplified, which leads to either exploding or vanishing gradients and thus renders the model incapable to learn. The intuition why advanced models such as LSTMs and GRUs mitigate this problem is that they use summations instead of multiplications (which lead to exponential growth).

Deep or stacked LSTMs are by now a very common sight in the literature and state-of-the-art for many sequence modelling problems. Still, descriptions of implementations often omit details, which might be perceived as self-evident. This, however, means that it is not always clear how a model looks like exactly or how it differs from similar architectures. The same applies to Deep LSTMs. The most standard convention feeds the input not only to the first but (via skip connections) also to subsequent layers as in Figure 6. Additionally, dropout is generally applied only between layers and not on the recurrent connections as this would drop out more and more value over time.

Generally, depth helps. However, in comparison to other applications such as audio/visual processing, depth plays a less significant role in NLP. Hypotheses for this observation are: a) More transformation is required for speech processing, image recognition etc. than for common text applications; b) Less effort has been made to find good architectures (RNNs are expensive to train; have been widely used for less long); c) Backpropagation through time and depth is hard and we need better optimisers.

Generally, 2-8 layers are standard across text applications. Input skip connections are used often but by no means universally.

Only recently have also very deep architectures been proposed for NLP (Conneau et al., 2016).

Mini-batching is generally necessary to make use of optimised matrix-matrix multiplication. In practice, however, this usually requires bucketing training instances based on similar lengths and padding with \(0\)'s, which can be a nuisance. Because of this, this is -- according to Chris Dyer -- "the era of assembly language programming for neural networks. Make the future an easier place to program!"

Character-based models have gained more popularity recently and for some tasks such as language modelling, using character-based LSTMs blows the results of word-based models out of the water, achieving a significantly lower perplexity with a lot fewer parameters particularly for morphologically rich languages.

Finally, no overview of recurrent neural networks is complete without the mention of attention, one of the most influential, recently proposed notions with regard to LSTMs. Attention is closely related to “pooling” operations in convolutional neural networks (and other architectures) as it also allows to selectively focus on particular elements of the input.

The most popular attention architecture pioneered by Bahdanau et al. (2015) seems to only care about “content” in that it relies on computing the dot product, i.e. the cosine similarity between vectors. It contains no obvious bias in favor of diagonals, short jumps, fertility, or other structures that might guide actual attention from a psycho-linguistic perspective.

Some work has begun to add other “structural” biases (Luong et al., 2015; Cohn et al., 2016), but there are many more opportunities for research.

Attention is similar to alignment, but there are important differences: a) alignment makes stochastic but hard decisions. Even if the alignment probability distribution is “flat”, the model picks one word or phrase at a time; b)

in contrast, attention is “soft” (all words are interpolated based on their attention weights). Finally, there is a big difference between “flat” and “peaked” attention weights.

Antoine Bordes of Facebook AI Research gave the last talk of the summer school, in which he discussed Facebook AI Research's two main research directions:

On the one hand, they are working on (Key-Value) Memory Networks, which can be used to jointly model symbolic and continuous systems. They can be trained end-to-end through back propagation and with SGD and provide great flexibility on how to design memories.

On the other hand, they are working on new tools for developing learning algorithms. They have created several datasets of reasonable sizes, such as bAbI, CBT, and MovieQA that are designed to ease interpretation of a model's capabilities and to foster research.

That was the Lisbon Machine Learning Summer School 2016! I had a blast and hope to be back next year!

- Sequence Models (Noah Smith)
- Machine Translation as Sequence Modelling (Philipp Koehn)
- Learning Structured Predictors (Xavier Carreras)
- Structured Prediction in NLP with Imitation Learning (Andreas Vlachos)
- Syntax and Parsing - I, II (Slav Petrov)
- Turbo Parser Redux: From Dependencies to Constituents (André Martins)
- Deep Neural Networks Are Our Friends (Wang Ling)
- Modeling Sequential Data with Recurrent Networks (Chris Dyer)
- Memory Networks for Language Understanding (Antoine Bordes)

Table of contents:

This is the second post in a series on word embeddings and representation learning. In the previous post, we gave an overview of word embedding models and introduced the classic neural language model by Bengio et al. (2003), the C&W model by Collobert and Weston (2008), and the word2vec model by Mikolov et al. (2013). We observed that mitigating the complexity of computing the final softmax layer has been one of the main challenges in devising better word embedding models, a commonality with machine translation (MT) (Jean et al. [^{10}]) and language modelling (Jozefowicz et al. [^{6}]).

In this post, we will thus focus on giving an overview of various approximations to the softmax layer that have been proposed over the last years, some of which have so far only been employed in the context of language modelling or MT. We will postpone the discussion of additional hyperparameters to the subsequent post.

Let us know partially re-introduce the previous post's notation both for consistency and to facilitate comparison as well as introduce some new notation: We assume a training corpus containing a sequence of \(T\) training words \(w_1, w_2, w_3, \cdots, w_T\) that belong to a vocabulary \(V\) whose size is \(|V|\). Our models generally consider a context \(c\) of \( n \) words. We associate every word with an input embedding \( v_w \) (the eponymous word embedding in the Embedding Layer) with \(d\) dimensions and an output embedding \( v'_w \) (the representation of the word in the weight matrix of the softmax layer). We finally optimize an objective function \(J_\theta\) with regard to our model parameters \(\theta\).

Recall that the softmax calculates the probability of a word \(w\) given its context \(c\) and can be computed using the following equation:

\(p(w \: | \: c) = \dfrac{\text{exp}({h^\top v'_w})}{\sum_{w_i \in V} \text{exp}({h^\top v'_{w_i}})} \)

where \(h\) is the output vector of the penultimate network layer. Note that we use \(c\) for the context as mentioned above and drop the index \(t\) of the target word \(w_t\) for simplicity. Computing the softmax is expensive as the inner product between \(h\) and the output embedding of every word \(w_i\) in the vocabulary \(V\) needs to be computed as part of the sum in the denominator in order to obtain the normalized probability of the target word \(w\) given its context \(c\).

In the following we will discuss different strategies that have been proposed to approximate the softmax. These approaches can be grouped into softmax-based and sampling-based approaches. Softmax-based approaches are methods that keep the softmax layer intact, but modify its architecture to improve its efficiency. Sampling-based approaches on the other hand completely do away with the softmax layer and instead optimise some other loss function that approximates the softmax.

Hierarchical softmax (H-Softmax) is an approximation inspired by binary trees that was proposed by Morin and Bengio [^{3}]. H-Softmax essentially replaces the flat softmax layer with a hierarchical layer that has the words as leaves, as can be seen in Figure 1.

This allows us to decompose calculating the probability of one word into a sequence of probability calculations, which saves us from having to calculate the expensive normalization over all words. Replacing a softmax layer with H-Softmax can yield speedups for word prediction tasks of at least \(50 \times\) and is thus critical for low-latency tasks such as real-time communication in Google's new messenger app Allo.

We can think of the regular softmax as a tree of depth \(1\), with each word in \(V\) as a leaf node. Computing the softmax probability of one word then requires normalizing over the probabilities of all \(|V|\) leaves. If we instead structure the softmax as a binary tree, with the words as leaf nodes, then we only need to follow the path to the leaf node of that word, without having to consider any of the other nodes.

Since a balanced binary tree has a depth of \(\text{log}_2 (\:|V|\:)\), we only need to evaluate at most \(\text{log}_2 (\:|V|\:)\) nodes to obtain the final probability of a word. Note that this probability is already normalized, as the probabilities of all leaves in a binary tree sum to \( 1 \) and thus form a probability distribution. To informally verify this, we can reason that at a tree's root node (Node 0) in Figure 1), the probabilities of branching decisions must sum to \(1\). At each subsequent node, the probability mass is then split among its children, until it eventually ends up at the leaf nodes, i.e. the words. Since no probability is lost along the way and since all words are leaves, the probabilities of all words must necessarily sum to \(1\) and hence the hierarchical softmax defines a normalized probability distribution over all words in \(V\).

To get a bit more concrete, as we go through the tree, we have to be able to calculate the probability of taking the left or right branch at every junction. For this reason, we assign a representation to every node. In contrast to the regular softmax, we thus no longer have output embeddings \(v'_w\) for every word \(w\) -- instead, we have embeddings \(v'_n\) for every node \(n\). As we have \(|V|-1\) nodes and each one possesses a unique representation, the number of parameters of H-Softmax is almost the same as for the regular softmax. We can now calculate the probability of going right (or left) at a given node \(n\) given the context \(c\) the following way:

\( p(\: \text{right}\: | \: n, c) = \sigma (h^\top v'_{n})\).

This is almost the same as the computations in the regular softmax; now instead of computing the dot product between \(h\) and the output word embedding \(v'_{w}\), we compute the dot product between \(h\) and the embedding \(v'_{w}\) of each node in the tree; additionally, instead of computing a probability distribution over the entire vocabulary words, we output just one probability, the probability of going right at node \(n\) in this case, with the sigmoid function. Conversely, the probability of turning left is simply \( 1 - p(\:\text{right}\: | \: n,c)\).

The probability of a word \(w\) given its context \(c\) is then simply the product of the probabilities of taking right and left turns respectively that lead to its leaf node. To illustrate this, given the context "the", "dog", "and", "the", the probability of the word "cat" in Figure 2 can be computed as the product of the probability of turning left at node 1, turning right at node 2, and turning right at node 5. Hugo Lachorelle gives a more detailed account in his excellent lecture video. Rong [^{7}] also does a good job of explaining these concepts and also derives the derivatives of H-Softmax.

Obviously, the structure of the tree is of significance. Intuitively, we should be able to achieve better performance, if we make it easier for the model to learn the binary predictors at every node, e.g. by enabling it to assign similar probabilities to similar paths. Based on this idea, Morin and Bengio use the synsets in WordNet as clusters for the tree. However, they still report inferior performance to the regular softmax. Mnih and Hinton [^{8}] learn the tree structure with a clustering algorithm that recursively partitions the words in two clusters and allows them to achieve the same performance as the regular softmax at a fraction of the computation.

Notably, we are only able to obtain this speed-up during training, when we know the word we want to predict (and consequently its path) in advance. During testing, when we need to find the most likely prediction, we still need to calculate the probability of all words, although narrowing down the choices in advance helps here.

In practice, instead of using "right" and "left" in order to designate nodes, we can index every node with a bit vector that corresponds to the path it takes to reach that node. In Figure 2, if we assume a `0`

bit for turning left and a `1`

bit for turning right, we can thus represent the path to "cat" as `011`

.

Recall that the path length in a balanced binary tree is \(\text{log}_2 |V|\). If we set \(|V| = 10000\), this amounts to an average path length of about \( 13.3 \). Analogously, we can represent every word by the bit vector of its path that is on average \(13.3\) bits long. In information theory, this is referred to as an information content of \(13.3\) bits per word.

Recall that the information content \(I(w)\) of a word \(w\) is the negative logarithm of its probability \(p(w)\):

\(I(w) = -\log_2 p(w) \).

The entropy \( H \) of all words in a corpus is then the expectation of the information content of all words in the vocabulary:

\( H = \sum_{i\in V} p(w_i) \: I(w_i) \).

We can also conceive of the entropy of a data source as the average number of bits needed to encode it. For a fair coin flip, we need \(1\) bit per flip, whereas we need \(0\) bits for a data source that always emits the same symbol. For a balanced binary tree, where we treat every word equally, the word entropy \(H\) equals the information content \(I(w)\) of every word \(w\), as each word has the same probability. The average word entropy \(H\) in a balanced binary tree with \(|V| = 10000\) thus coincides with its average path length:

\( H = - \sum_{i\in V} \dfrac{1}{10000} \log_2 \dfrac{1}{10000} = 13.3\).

We saw before that the structure of the tree is important. Notably, we can leverage the tree structure not only to gain better performance, but also to speed up computation: If we manage to encode more information into the tree, we can get away with taking shorter paths for less informative words. Morin and Bengio point out that leveraging word probabilities should work even better; as some words are more likely to occur than others, they can be encoded using less information. They note that the word entropy of their corpus (with \(|V| = 10,000\)) is about \( 9.16 \).

Thus, by taking into account frequencies, we can reduce the average number of bits per word in the corpus from \( 13.3 \) to \( 9.16 \) in this case, which amounts to a speed-up of 31%. A Huffman tree, which is used by Mikolov et al. [^{1}] for their hierarchical softmax, generates such a coding by assigning fewer bits to more common symbols. For instance, "the", the most common word in the English language, would be assigned the shortest bit code in the tree, the second most frequent word would be assigned the second-shortest bit code, and so on. While we still need the same number of codes to designate all words, when we predict the words in a corpus, short codes appear now a lot more often, and we consequently need fewer bits to represent each word on average.

A coding such as Huffman coding is also known as entropy encoding, as the length of each codeword is approximately proportional to the entropy of each symbol as we have observed. Shannon [^{5}] establishes in his experiments that the lower bound on the information rate in English is between \(0.6\) to \(1.3\) bits per character; given an average word length of \(4.5\), this amounts to \(2.7\) - \(5.85\) bits per word.

To tie this back to language modelling (which we already talked about in the previous post): perplexity, the evaluation measure of language modelling, is \(2^{H}\) where \(H\) is the entropy. A unigram entropy of \( 9.16 \) thus entails a still very high perplexity of \( 2^{9.16} = 572.0\). We can render this value more tangible by observing that a model with a perplexity of \(572\) is as confused by the data as if it had to choose among \(572\) possibilities for each word uniformly and independently.

To put this into context: The state-of-the-art language model by Jozefowicz et al. (2016) achieves a perplexity of \(24.2\) per word on the 1B Word Benchmark. Such a model would thus require an average of around \(4.60\) bits to encode each word, as \( 2^{4.60} = 24.2 \), which is incredibly close to the experimental lower bounds documented by Shannon. If and how we could use such a model to construct a better hierarchical softmax layer is still left to be explored.

Chen et al. [^{9}] introduce a variation on the traditional softmax layer, the Differentiated Softmax (D-Softmax). D-Softmax is based on the intuition that not all words require the same number of parameters: Many occurrences of frequent words allow us to fit many parameters to them, while extremely rare words might only allow to fit a few.

In order to do this, instead of the dense matrix of the regular softmax layer of size \(d \: \times \: |V| \) containing the output word embeddings \( v'_w \in \mathbb{R}^d \), they use a sparse matrix. They then arrange \( v'_w\) in blocks sorted by frequency, with the embeddings in each block being of a certain dimensionality \(d_k \). The number of blocks and their embedding sizes are hyperparameters that can be tuned.

In Figure 3, embeddings in partition \(A\) are of dimensionality \(d_A\) (these are embeddings of frequent words, as they are allocated more parameters), while embeddings in partitions \(B\) and \(C\) have \(d_B\) and \(d_C\) dimensions respectively. Note that all areas not part of any partition, i.e. the non-shaded areas in Figure 1, are set to \(0\).

The output of the previous hidden layer \(h\) is treated as a concatenation of features corresponding to each partition of the dimensionality of that partition, e.g. \(h\) in Figure 3 is made up of partitions of size \(d_A\), \(d_B\), and \(d_B\) respectively. Instead of computing the matrix-vector product between the entire output embedding matrix and \(h\) as in the regular softmax, D-Softmax then computes the product of each partition and its corresponding section in \(h\).

As many words will only require comparatively few parameters, the complexity of computing the softmax is reduced, which speeds up training. In contrast to H-Softmax, this speed-up persists during testing. Chen et al. (2015) observe that D-Softmax is the fastest method when testing, while being one of the most accurate. However, as it assigns fewer parameters to rare words, D-Softmax does a worse job at modelling them.

Another modification to the traditional softmax layer is inspired by recent work by Kim et al. [^{13}] who produce input word embeddings \(v_w\) via a character-level CNN. Jozefowicz et al. (2016) in turn suggest to do the same thing for the output word embeddings \(v'_w\) via a character-level CNN -- and refer to this as CNN-Softmax. Note that if we have a CNN at the input and at the output as in Figure 4, the CNN generating the output word embeddings \(v'_w\) is necessarily different from the CNN generating the input word embeddings \(v_w\), just as the input and output word embedding matrices would be different.

While this still requires computing the regular softmax normalization, this approach drastically reduces the number of parameters of the model: Instead of storing an embedding matrix of \(d \: \times \: |V| \), we now only need to keep track of the parameters of the CNN. During testing, the output word embeddings \(v'_w\) can be pre-computed, so that there is no loss in performance.

However, as characters are represented in a continuous space and as the resulting model tends to learn a smooth function mapping characters to a word embedding, character-based models often find it difficult to differentiate between similarly spelled words with different meanings. To mitigate this, the authors add a correction factor that is learned per word, which significantly reduces the performance gap between regular and CNN-softmax. By adjusting the dimensionality of the correction term, the authors are able to trade-off model size versus performance.

The authors also note that instead of using a CNN-softmax, the output of the previous layer \(h\) can be fed to a character-level LSTM, which predicts the output word one character at a time. Instead of a softmax over words, a softmax outputting a probability distribution over characters would thus be used at every time step. They, however, fail to achieve competitive performance with this layer. Ling et al. [^{14}] use a similar layer for machine translation and achieve competitive results.

While the approaches discussed so far still maintain the overall structure of the softmax, sampling-based approaches on the other hand completely do away with the softmax layer. They do this by approximating the normalization in the denominator of the softmax with some other loss that is cheap to compute. However, sampling-based approaches are only useful at training time -- during inference, the full softmax still needs to be computed to obtain a normalised probability.

In order to gain some intuitions about the softmax denominator's impact on the loss, we will derive the gradient of our loss function \(J_\theta\) w.r.t. the parameters of our model \(\theta\).

During training, we aim to minimize the cross-entropy loss of our model for every word \(w\) in the training set. This is simply the negative logarithm of the output of our softmax. If you are unsure of this connection, have a look at Karpathy's explanation to gain some more intuitions about the connection between softmax and cross-entropy. The loss of our model is then the following:

\(J_\theta = - \: \text{log} \: \dfrac{\text{exp}({h^\top v'_{w}})}{\sum_{w_i \in V} \text{exp}({h^\top v'_{w_i}})} \).

Note that in practice \(J_\theta\) would be the average of all negative log-probabilities over the whole corpus. To facilitate the derivation, we decompose \(J_\theta \) into a sum as \(\text{log} \: \dfrac{x}{y} = \text{log} \: x - \text{log} \: y \):

\(J_\theta = - \: h^\top v'_{w} + \text{log} \sum_{w_i \in V} \text{exp}(h^\top v'_{w_i}) \)

For brevity and to conform with the notation of Bengio and Senécal [^{4}, ^{15}] (note that in the first paper, they compute the gradient of the *positive* logarithm), we replace the dot product \( h^\top v'_{w} \) with \( - \mathcal{E}(w) \). Our loss then looks like the following:

\(J_\theta = \: \mathcal{E}(w) + \text{log} \sum_{w_i \in V} \text{exp}( - \mathcal{E}(w_i)) \)

For back-propagation, we can now compute the gradient \(\nabla \) of \(J_\theta \) w.r.t. our model's parameters \(\theta\):

\(\nabla_\theta J_\theta = \: \nabla_\theta \mathcal{E}(w) + \nabla_\theta \text{log} \sum_{w_i \in V} \text{exp}(- \mathcal{E}(w_i)) \)

As the gradient of \( \text{log} \: x \) is \(\dfrac{1}{x} \), an application of the chain rule yields:

\(\nabla_\theta J_\theta = \: \nabla_\theta \mathcal{E}(w) + \dfrac{1}{\sum_{w_i \in V} \text{exp}(- \mathcal{E}(w_i))} \nabla_\theta \sum_{w_i \in V} \text{exp}(- \mathcal{E}(w_i) \)

We can now move the gradient inside the sum:

\(\nabla_\theta J_\theta = \: \nabla_\theta \mathcal{E}(w) + \dfrac{1}{\sum_{w_i \in V} \text{exp}(- \mathcal{E}(w_i))} \sum_{w_i \in V} \nabla_\theta \: \text{exp}(- \mathcal{E}(w_i)) \)

As the gradient of \(\text{exp}(x)\) is just \(\text{exp}(x)\), another application of the chain rule yields:

\(\nabla_\theta J_\theta = \: \nabla_\theta \mathcal{E}(w) + \dfrac{1}{\sum_{w_i \in V} \text{exp}(- \mathcal{E}(w_i))} \sum_{w_i \in V} \text{exp}(- \mathcal{E}(w_i)) \nabla_\theta (- \mathcal{E}(w_i)) \)

We can rewrite this as:

\(\nabla_\theta J_\theta = \: \nabla_\theta \mathcal{E}(w) + \sum_{w_i \in V} \dfrac{\text{exp}(- \mathcal{E}(w_i))}{\sum_{w_i \in V} \text{exp}(- \mathcal{E}(w_i))} \nabla_\theta (- \mathcal{E}(w_i)) \)

Note that \( \dfrac{\text{exp}(- \mathcal{E}(w_i))}{\sum_{w_i \in V} \text{exp}(- \mathcal{E}(w_i))} \) is just the softmax probability \(P(w_i) \) of \(w_i\) (we omit the dependence on the context \(c\) here for brevity). Replacing it yields:

\(\nabla_\theta J_\theta = \: \nabla_\theta \mathcal{E}(w) + \sum_{w_i \in V} P(w_i) \nabla_\theta (- \mathcal{E}(w_i)) \)

Finally, repositioning the negative coefficient in front of the sum yields:

\(\nabla_\theta J_\theta = \: \nabla_\theta \mathcal{E}(w) - \sum_{w_i \in V} P(w_i) \nabla_\theta \mathcal{E}(w_i) \)

Bengio and Senécal (2003) note that the gradient essentially has two parts: a positive reinforcement for the target word \(w\) (the first term in the above equation) and a negative reinforcement for all other words \(w_i\), which is weighted by their probability (the second term). As we can see, this negative reinforcement is just the expectation \(\mathbb{E}_{w_i \sim P}\) of the gradient of \(\mathcal{E} \) for all words \(w_i\) in \(V\):

\(\sum_{w_i \in V} P(w_i) \nabla_\theta \mathcal{E}(w_i) = \mathbb{E}_{w_i \sim P}[\nabla_\theta \mathcal{E}(w_i)]\).

The crux of most sampling-based approach now is to approximate this negative reinforcement in some way to make it easier to compute, since we don't want to sum over the probabilities for all words in \(V\).

We can approximate the expected value \(\mathbb{E}\) of any probability distribution using the Monte Carlo method, i.e. by taking the mean of random samples of the probability distribution. If we knew the network's distribution, i.e. \(P(w)\), we could thus directly sample \(m\) words \( w_1 , \cdots , w_m \) from it and approximate the above expectation with:

\( \mathbb{E}_{w_i \sim P}[\nabla_\theta \mathcal{E}(w_i)] \approx \dfrac{1}{m} \sum\limits^m_{i=1} \nabla_\theta \mathcal{E}(w_i) \).

However, in order to sample from the probability distribution \( P \), we need to compute \( P \), which is just what we wanted to avoid in the first place. We therefore have find some other distribution \( Q \) (we call this the proposal distribution), from which it is cheap to sample and which can be used as the basis of Monte-Carlo sampling. Preferably, \(Q\) should also be similar to \(P\), since we want our approximated expectation to be as accurate as possible. A straightforward choice in the case of language modelling is to simply use the unigram distribution of the training set for \( Q \).

This is essentially what classical Importance Sampling (IS) does: It uses Monte-Carlo sampling to approximate a target distribution \(P\) via a proposal distribution \(Q\). However, this still requires computing \(P(w)\) for every word \(w\) that is sampled. To avoid this, Bengio and Senécal (2003) use a biased estimator that was first proposed by Liu [^{16}]. This estimator can be used when \( P(w) \) is computed as a product, which is the case here, since every division can be transformed into a multiplication.

Essentially, instead of weighting the gradient \(\nabla_\theta \mathcal{E}(w_i)\) with the expensive to compute probability \(P_{w_i}\), we weight it with a factor that leverages the proposal distribution \(Q\). For biased IS, this factor is \(\dfrac{1}{R}r(w_i)\) where \( r(w) = \dfrac{\text{exp}(- \mathcal{E}(w))}{Q(w)} \) and \( R = \sum^m_{j=1} r(w_j) \).

Note that we use \( r \) and \( R \) instead of \( w\) and \(W\) as in Bengio and Senécal (2003, 2008) to avoid name clashes. As we can see, we still compute the numerator of the softmax, but replace the normalisation in the denominator with the proposal distribution \(Q\). Our biased estimator that approximates the expectation thus looks like the following:

\( \mathbb{E}_{w_i \sim P}[\nabla_\theta \mathcal{E}(w_i)] \approx \dfrac{1}{R} \sum\limits^m_{i=1} r(w_i) \nabla_\theta \: \mathcal{E}(w_i)\)

Note that the fewer samples we use, the worse is our approximation. We additionally need to adjust our sample size during training, as the network's distribution \(P\) might diverge from the unigram distribution \(Q \) during training, which leads to divergence of the model, if the sample size that is used is too small. Consequently, Bengio and Senécal introduce a measure to calculate the effective sample size in order to protect against possible divergence. Finally, the authors report a speed-up factor of \(19 \) over the regular softmax for this method.

Bengio and Senécal (2008) note that for Importance Sampling, substituting more complex distributions, e.g. bigram and trigram distributions, later in training to combat the divergence of the unigram distribution \(Q\) from the model's true distribution \(P\) does not help, as n-gram distributions seem to be quite different from the distribution of trained neural language models. As an alternative, they propose an n-gram distribution that is adapted during training to follow the target distribution \(P\) more closely. To this end, they interpolate a bigram distribution and a unigram distribution according to some mixture function, whose parameters they train with SGD for different frequency bins to minimize the Kullback-Leibler divergence between the distribution \( Q \) and the target distribution \( P\). For experiments, they report a speed-up factor of about \(100\).

Jean et al. (2015) propose to use Adaptive Importance Sampling for machine translation. In order to make the method more suitable for processing on a GPU with limited memory, they limit the number of target words that need to be sampled from. They do this by partitioning the training set and including only a fixed number of sample words in every partition, which form a subset \(V'\) of the vocabulary.

This essentially means that a separate proposal distribution \(Q_i \) can be used for every partition \(i\) of the training set, which assigns equal probability to all words included in the vocabulary subset \(V'_i\) and zero probability to all other words.

Noise Contrastive Estimation (NCE) (Gutmann and Hyvärinen) [^{17}] is proposed by Mnih and Teh [^{18}] as a more stable sampling method than Importance Sampling (IS), as we have seen that IS poses the risk of having the proposal distribution \(Q\) diverge from the distribution \(P\) that should be optimized. In contrast to the former, NCE does not try to estimate the probability of a word directly. Instead, it uses an auxiliary loss that also optimises the goal of maximizing the probability of correct words.

Recall the pairwise-ranking criterion of Collobert and Weston (2008) that ranks positive windows higher than "corrupted" windows, which we discussed in the previous post. NCE does a similar thing: We train a model to differentiate the target word from noise. We can thus reduce the problem of predicting the correct word to a binary classification task, where the model tries to distinguish positive, genuine data from noise samples, as can be seen in Figure 4 below.

For every word \(w_i\) given its context \(c_i \) of \(n\) previous words \(w_{t-1} , \cdots , w_{t-n+1}\) in the training set, we thus generate \(k\) noise samples \(\tilde{w}_{ik}\) from a noise distribution \(Q\). As in IS, we can sample from the unigram distribution of the training set. As we need labels to perform our binary classification task, we designate all correct words \(w_i\) given their context \(c_i\) as true (\(y=1\)) and all noise samples \(\tilde{w}_{ik}\) as false (\(y=0\)).

We can now use logistic regression to minimize the negative log-likelihood, i.e. cross-entropy of our training examples against the noise (conversely, we could also maximize the *positive* log-likelihood as some papers do):

\( J_\theta = - \sum_{w_i \in V} [ \text{log} \: P(y=1\:|\:w_i,c_i) + k \: \mathbb{E}_{\tilde{w}_{ik} \sim Q} [ \:\text{log} \: P(y=0 \:|\:\tilde{w}_{ij},c_i)]] \).

Instead of computing the expectation \(\mathbb{E}_{\tilde{w}_{ik} \sim Q}\) of our noise samples, which would still require summing over all words in \(V\) to predict the normalised probability of a negative label, we can again take the mean with the Monte Carlo approximation:

\( J_\theta = - \sum_{w_i \in V} [ \text{log} \: P(y=1\:|\:w_i,c_i) + k \: \sum_{j=1}^k \dfrac{1}{k} \:\text{log} \: P(y=0 \:|\:\tilde{w}_{ij},c_i)] \),

which reduces to:

\( J_\theta = - \sum_{w_i \in V} [ \text{log} \: P(y=1\:|\:w_i,c_i) + \: \sum_{j=1}^k \:\text{log} \: P(y=0 \:|\:\tilde{w}_{ij},c_i)] \),

By generating \(k\) noise samples for every genuine word \(w_i\) given its context \(c\), we are effectively sampling words from two different distributions: Correct words are sampled from the empirical distribution of the training set \(P_{\text{train}}\) and depend on their context \(c\), whereas noise samples come from the noise distribution \(Q\). We can thus represent the probability of sampling either a positive or a noise sample as a mixture of those two distributions, which are weighted based on the number of samples that come from each:

\(P(y, w \: | \: c) = \dfrac{1}{k+1} P_{\text{train}}(w \: | \: c)+ \dfrac{k}{k+1}Q(w) \).

Given this mixture, we can now calculate the probability that a sample came from the training \(P_{\text{train}}\) distribution as a conditional probability of \(y\) given \(w\) and \(c\):

\( P(y=1\:|\:w,c)= \dfrac{\dfrac{1}{k+1} P_{\text{train}}(w \: | \: c)}{\dfrac{1}{k+1} P_{\text{train}}(w \: | \: c)+ \dfrac{k}{k+1}Q(w)} \),

which can be simplified to:

\( P(y=1\:|\:w,c)= \dfrac{P_{\text{train}}(w \: | \: c)}{P_{\text{train}}(w \: | \: c) + k \: Q(w)} \).

As we don't know \(P_{\text{train}}\) (which is what we would like to calculate), we replace \(P_{\text{train}}\) with the probability of our model \(P\):

\( P(y=1\:|\:w,c)= \dfrac{P(w \: | \: c)}{P(w \: | \: c) + k \: Q(w)} \).

The probability of predicting a noise sample (\(y=0\)) is then simply \(P(y=0\:|\:w,c) = 1 - P(y=1\:|\:w,c)\). Note that computing \( P(w \: | \: c) \), i.e. the probability of a word \(w\) given its context \(c\) is essentially the definition of our softmax:

\(P(w \: | \: c) = \dfrac{\text{exp}({h^\top v'_{w}})}{\sum_{w_i \in V} \text{exp}({h^\top v'_{w_i}})} \).

For notational brevity and unambiguity, let us designate the denominator of the softmax with \(Z(c)\), since the denominator only depends on \(h\), which is generated from \(c\) (assuming a fixed \(V\)). The softmax then looks like this:

\(P(w \: | \: c) = \dfrac{\text{exp}({h^\top v'_{w}})}{Z(c)} \).

Having to compute \(P(w \: | \: c)\) means that -- again -- we need to compute \(Z(c)\), which requires us to sum over the probabilities of all words in \(V\). In the case of NCE, there exists a neat trick to circumvent this issue: We can treat the normalisation denominator \(Z(c)\) as a parameter that the model can learn.

Mnih and Teh (2012) and Vaswani et al. [^{20}] actually keep \(Z(c)\) fixed at \(1\), which they report does not affect the model's performance. This assumption has the nice side-effect of reducing the model's parameters, while ensuring that the model self-normalises by not depending on the explicit normalisation in \(Z(c)\). Indeed, Zoph et al. [^{19}] find that even when learned, \(Z(c)\) is close to \(1\) and has low variance.

If we thus set \(Z(c)\) to \(1\) in the above softmax equation, we are left with the following probability of word \(w\) given a context \(c\):

\(P(w \: | \: c) = \text{exp}({h^\top v'_{w}})\).

We can now insert this term in the above equation to compute \(P(y=1\:|\:w,c)\):

\( P(y=1\:|\:w,c)= \dfrac{\text{exp}({h^\top v'_{w}})}{\text{exp}({h^\top v'_{w}}) + k \: Q(w)} \).

Inserting this term in turn in our logistic regression objective finally yields the full NCE loss:

\( J_\theta = - \sum_{w_i \in V} [ \text{log} \: \dfrac{\text{exp}({h^\top v'_{w_i}})}{\text{exp}({h^\top v'_{w_i}}) + k \: Q(w_i)} + \sum_{j=1}^k \:\text{log} \: (1 - \dfrac{\text{exp}({h^\top v'_{\tilde{w}_{ij}}})}{\text{exp}({h^\top v'_{\tilde{w}_{ij}}}) + k \: Q(\tilde{w}_{ij})})] \).

Note that NCE has nice theoretical guarantees: It can be shown that as we increase the number of noise samples \(k\), the NCE derivative tends towards the gradient of the softmax function. Mnih and Teh (2012) report that \(25\) noise samples are sufficient to match the performance of the regular softmax, with an expected speed-up factor of about \(45\). For more information on NCE, Chris Dyer has published some excellent notes [^{21}].

One caveat of NCE is that as typically different noise samples are sampled for every training word \(w\), the noise samples and their gradients cannot be stored in dense matrices, which reduces the benefit of using NCE with GPUs, as it cannot benefit from fast dense matrix multiplications. Jozefowicz et al. (2016) and Zoph et al. (2016) independently propose to share noise samples across all training words in a mini-batch, so that NCE gradients can be computed with dense matrix operations, which are more efficient on GPUs.

Jozefowicz et al. (2016) show that NCE and IS are not only similar as both are sampling-based approaches, but are strongly connected. While NCE uses a binary classification task, they show that IS can be described similarly using a surrogate loss function: Instead of performing binary classification with a logistic loss function like NCE, IS then optimises a multi-class classification problem with a softmax and cross-entropy loss function. They observe that as IS performs multi-class classification, it may be a better choice for language modelling, as the loss leads to tied updates between the data and noise samples rather than independent updates as with NCE. Indeed, Jozefowicz et al. (2016) use IS for language modelling and obtain state-of-the-art performance (as mentioned above) on the 1B Word benchmark.

Negative Sampling (NEG), the objective that has been popularised by Mikolov et al. (2013), can be seen as an approximation to NCE. As we have mentioned above, NCE can be shown to approximate the loss of the softmax as the number of samples \(k\) increases. NEG simplifies NCE and does away with this guarantee, as the objective of NEG is to learn high-quality word representations rather than achieving low perplexity on a test set, as is the goal in language modelling.

NEG also uses a logistic loss function to minimise the negative log-likelihood of words in the training set. Recall that NCE calculated the probability that a word \(w\) comes from the empirical training distribution \(P_{\text{train}}\) given a context \(c\) as follows:

\( P(y=1\:|\:w,c)= \dfrac{\text{exp}({h^\top v'_{w}})}{\text{exp}({h^\top v'_{w}}) + k \: Q(w)} \).

The key difference to NCE is that NEG only approximates this probability by making it as easy to compute as possible. For this reason, it sets the most expensive term, \(k \: Q(w)\) to \(1\), which leaves us with:

\( P(y=1\:|\:w,c)= \dfrac{\text{exp}({h^\top v'_{w}})}{\text{exp}({h^\top v'_{w}}) + 1} \).

\(k \: Q(w) = 1\) is exactly then true, when \(k=\: |V|\) and \(Q\) is a uniform distribution. In this case, NEG is equivalent to NCE. The reason we set \(k \: Q(w) = 1\) and not to some other constant can be seen by rewriting the equation, as \(P(y=1\:|\:w,c)\) can be transformed into the sigmoid function:

\( P(y=1\:|\:w,c)= \dfrac{1}{1 + \text{exp}({-h^\top v'_{w}})} \).

If we now insert this back into the logistic regression loss from before, we get:

\( J_\theta = - \sum_{w_i \in V} [ \text{log} \: \dfrac{1}{1 + \text{exp}({-h^\top v'_{w_i}})} + \: \sum_{j=1}^k \:\text{log} \: (1 - \dfrac{1}{1 + \text{exp}({-h^\top v'_{\tilde{w}_{ij}}})}] \).

By simplifying slightly, we obtain:

\( J_\theta = - \sum_{w_i \in V} [ \text{log} \: \dfrac{1}{1 + \text{exp}({-h^\top v'_{w_i}})} + \: \sum_{j=1}^k \:\text{log} \: (\dfrac{1}{1 + \text{exp}({h^\top v'_{\tilde{w}_{ij}}})}] \).

Setting \(\sigma(x) = \dfrac{1}{1 + \text{exp}({-x})}\) finally yields the NEG loss:

\( J_\theta = - \sum_{w_i \in V} [ \text{log} \: \sigma(h^\top v'_{w_i}) + \: \sum_{j=1}^k \:\text{log} \: \sigma(-h^\top v'_{\tilde{w}_{ij}})] \).

To conform with the notation of Mikolov et al. (2013), \(h\) must be replaced with \(v_{w_I}\), \(v'_{w_i}\) with \(v'_{w_O}\) and \(v_{\tilde{w}_{ij}}\) with \(v'_{w_i}\). Also, in contrast to Mikolov's NEG objective, we a) optimise the objective over the whole corpus, b) minimise negative log-likelihood instead of maximising positive log-likelihood (as mentioned before), and c) have already replaced the expectation \(\mathbb{E}_{\tilde{w}_{ik} \sim Q}\) with its Monte Carlo approximation. For more insights on the derivation of NEG, have a look at Goldberg and Levy's notes [^{22}].

We have seen that NEG is only equivalent to NCE when \(k=\: |V|\) and \(Q\) is uniform. In all other cases, NEG only approximates NCE, which means that it will not directly optimise the likelihood of correct words, which is key for language modelling. While NEG may thus be useful for learning word embeddings, its lack of asymptotic consistency guarantees makes it inappropriate for language modelling.

Even though the self-normalisation technique proposed by Devlin et al. ^{23} is not a sampling-based approach, it provides further intuitions on self-normalisation of language models, which we briefly touched upon. We previously mentioned in passing that by setting the denominator \(Z(c)\) of the NCE loss to \(1\), the model essentially self-normalises. This is a useful property as it allows us to skip computing the expensive normalisation in \(Z(c)\).

Recall that our loss function \(J_\theta\) minimises the negative log-likelihood of all words \(w_i\) in our training data:

\(J_\theta = - \sum\limits_i [\text{log} \: \dfrac{\text{exp}({h^\top v'_{w_i}})}{Z(c)}] \).

We can decompose the softmax into a sum as we did before:

\(J_\theta \: P(w \: | \: c) = - \sum\limits_i [h^\top v'_{w_i} + \text{log} \: Z(c)] \).

If we are able to constrain our model so that it sets \(Z(c) = 1\) or similarly \(\text{log} \: Z(c) = 0\), then we can avoid computing the normalisation in \(Z(c)\) altogether. Devlin et al. (2014) thus propose to add a squared error penalty term to the loss function that encourages the model to keep \(\text{log} \: Z(c)\) as close as possible to \(0\):

\(J_\theta = - \sum\limits_i [h^\top v'_{w_i} + \text{log} \: Z(c) - \alpha \:(\text{log}(Z(c)) - 0)^2] \),

which can be rewritten as:

\(J_\theta = - \sum\limits_i [h^\top v'_{w_i} + \text{log} \: Z(c) - \alpha \: \text{log}^2 Z(c)] \)

where \(\alpha\) allows us to trade-off between model accuracy and mean self-normalisation. By doing this, we can essentially guarantee that \(Z(c)\) will be as close to \(1\) as we want. At decoding time in their MT system, Devlin et al. (2014) then set the denominator of the softmax to \(1\) and only use the numerator for computing \(P(w \: | \: c)\) together with their penalty term:

\(J_\theta = - \sum\limits_i [h^\top v'_{w_i} - \alpha \: \text{log}^2 Z(c)] \)

They report that self-normalisation achieves a speed-up factor of about \(15\), while only resulting in a small degradation of BLEU scores compared to a regular non-self-normalizing neural language model.

Andreas and Klein [^{11}] suggest that it should even be sufficient to only normalise a fraction of the training examples and still obtain approximate self-normalising behaviour. They thus propose Infrequent Normalisation (IN), which down-samples the penalty term, making this a sampling-based approach.

Let us first decompose the sum of the previous loss \(J_\theta\) into two separate sums:

\(J_\theta = - \sum\limits_i h^\top v'_{w_i} + \alpha \sum\limits_i \text{log}^2 Z(c) \).

We can now down-sample the second term by only computing the normalisation for a subset \(C\) of words \(w_j\) and thus of contexts \(c_j\) (as \(Z(c)\) only depends on the context \(c\)) in the training data:

\(J_\theta = - \sum\limits_i h^\top v'_{w_i} + \dfrac{\alpha}{\gamma} \sum\limits_{c_j \in C} \text{log}^2 Z(c_j) \)

where \(\gamma\) controls the size of the subset \(C\). Andreas and Klein (2015) suggest that IF combines the strengths of NCE and self-normalisation as it does not require computing the normalisation for all training examples (which NCE avoids entirely), but like self-normalisation allows trading-off between the accuracy of the model and how well normalisation is approximated. They observe a speed-up factor of \(10\) when normalising only a tenth of the training set, with no noticeable performance penalty.

So far, we have focused exclusively on approximating or even entirely avoiding the computation of the softmax denominator \(Z(c)\), as it is the most expensive term in the computation. We have thus not paid particular attention to \(h^\top v'_{w}\), i.e. the dot-product between the penultimate layer representation \(h\) and output word embedding \(v'_{w}\). Vijayanarasimhan et al. [^{12}] propose fast locality-sensitive hashing to approximate \(h^\top v'_{w}\). However, while this technique accelerates the model at test time, during training, these speed-ups virtually vanish as embeddings must be re-indexed and the batch size increases.

Having reviewed the most popular softmax-based and sampling-based approaches, we have shown that there are plenty of alternatives to the good ol' softmax and almost all of them promise a significant speed-up and equivalent or at most marginally deteriorated performance. This naturally poses the question which approach is the best for a particular task.

Approach | Speed-up factor |
During training? |
During testing? |
Performance (small vocab) |
Performance (large vocab) |
Proportion of parameters |
---|---|---|---|---|---|---|

Softmax | 1x | - | - | very good | very poor | 100% |

Hierarchical Softmax | 25x (50-100x) | X | - | very poor | very good | 100% |

Differentiated Softmax | 2x | X | X | very good | very good | < 100% |

CNN-Softmax | - | X | - | - | bad - good | 30% |

Importance Sampling | (19x) | X | - | - | - | 100% |

Adaptive Importance Sampling |
(100x) | X | - | - | - | 100% |

Target Sampling | 2x | X | - | good | bad | 100% |

Noise Contrastive Estimation |
8x (45x) | X | - | very bad | very bad | 100% |

Negative Sampling | (50-100x) | X | - | - | - | 100% |

Self-Normalisation | (15x) | X | - | - | - | 100% |

Infrequent Normalisation |
6x (10x) | X | - | very good | good | 100% |

We compare the performance of the approaches we discussed in this post for language modelling in Table 1. Speed-up factors and performance are based on the experiments by Chen et al. (2015), while we show speed-up factors reported by the authors of the original papers in brackets. The third and fourth columns indicate if the speed-up is achieved during training and testing respectively. Note that divergence of speed-up factors might be due to unoptimised implementations or the fact that the original authors might not have had access to GPUs, which benefit the regular softmax more than some of the other approaches. Performance for approaches where no comparison is available should largely be analogous to similar approaches, i.e. Self-Normalisation should achieve similar performance as Infrequent Normalisation and Importance Sampling and Adaptive Importance Sampling should achieve similar performance as Target Sampling. The performance of CNN-Softmax is as reported by Jozefowicz et al. (2016) and ranges from bad to good depending on the size of the correction. Of all approaches, only CNN-Softmax achieves a substantial reduction in parameters as the other approaches still require storing output embeddings. Differentiated Softmax reduces parameters by being able to store a sparse weight matrix.

As it always is, there is no clear winner that beats all other approaches on all datasets or tasks. For language modelling, the regular softmax still achieves very good performance on small vocabulary datasets, such as the Penn Treebank, and even performs well on medium datasets, such as Gigaword, but does very poorly on large vocabulary datasets, e.g. the 1B Word Benchmark. Target Sampling, Hierarchical Softmax, and Infrequent Normalisation in turn do better with large vocabularies.

Differentiated Softmax generally does well for both small and large vocabularies and is the only approach that ensures a speed-up at test time. Interestingly, Hierarchical Softmax (HS) performs very poorly with small vocabularies. However, of all methods, HS is the fastest and processes most training examples in a given time frame. While NCE performs well with large vocabularies, it is generally worse than the other methods. Negative Sampling does not work well for language modelling, but it is generally superior for learning word representations, as attested by word2vec's success. Note that all results should be taken with a grain of salt: Chen et al. (2015) report having difficulties using Noise Contrastive Estimation in practice; Kim et al. (2016) use Hierarchical Softmax to achieve state-of-the-art with a small vocabulary, while Importance Sampling is used by the state-of-the-art language model by Jozefowicz et al. (2016) on a dataset with a large vocabulary.

Finally, if you are looking to actually use the described methods, TensorFlow has implementations for a few sampling-based approaches and also explains the differences between some of them here.

This overview of different methods to approximate the softmax attempted to provide you with intuitions that can not only be applied to improve and speed-up learning word representations, but are also relevant for language modelling and machine translation. As we have seen, most of these approaches are closely related and are driven by one uniting factor: the necessity to approximate the expensive normalisation in the denominator of the softmax. With these approaches in mind, I hope you feel now better equipped to train and understand your models and that you might even feel ready to work on learning better word representations yourself.

As we have seen, learning word representations is a vast field and many factors are relevant for success. In the previous blog post, we looked at the architectures of popular models and in this blog post, we investigated more closely a key component, the softmax layer. In the next one, we will introduce GloVe, a method that relies on matrix factorisation rather than language modelling, and turn our attention to other hyperparameters that are essential for successfully learning word embeddings.

**As always, let me know about any mistakes I made and approaches I missed in the comments below.**

If you found this blog post helpful, please consider citing it as:

*Sebastian Ruder. On word embeddings - Part 2: Approximating the Softmax. http://ruder.io/word-embeddings-softmax, 2016.*

If you want to learn more about word embeddings, these other blog posts on word embeddings are also available:

- On word embeddings - Part 1
- On word embeddings - Part 3: The secret ingredients of word2vec
- Unofficial Part 4: A survey of cross-lingual embedding models

This blog post has been translated into the following languages:

- Chinese

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. NIPS, 1–9. ↩

Mikolov, T., Corrado, G., Chen, K., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. Proceedings of the International Conference on Learning Representations (ICLR 2013), 1–12. ↩

Morin, F., & Bengio, Y. (2005). Hierarchical Probabilistic Neural Network Language Model. Aistats, 5. ↩

Bengio, Y., & Senécal, J.-S. (2003). Quick Training of Probabilistic Neural Nets by Importance Sampling. AISTATS. http://www.iro.umontreal.ca/~lisa/pointeurs/submit_aistats2003.pdf ↩

Shannon, C. E. (1951). Prediction and Entropy of Printed English. Bell System Technical Journal, 30(1), 50–64. http://doi.org/10.1002/j.1538-7305.1951.tb01366.x ↩

Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., & Wu, Y. (2016). Exploring the Limits of Language Modeling. Retrieved from http://arxiv.org/abs/1602.02410 ↩

Rong, X. (2014). word2vec Parameter Learning Explained. arXiv:1411.2738, 1–19. Retrieved from http://arxiv.org/abs/1411.2738 ↩

Mnih, A., & Hinton, G. E. (2008). A Scalable Hierarchical Distributed Language Model. Advances in Neural Information Processing Systems, 1–8. Retrieved from http://papers.nips.cc/paper/3583-a-scalable-hierarchical-distributed-language-model.pdf ↩

Chen, W., Grangier, D., & Auli, M. (2015). Strategies for Training Large Vocabulary Neural Language Models. Retrieved from http://arxiv.org/abs/1512.04906 ↩

Jean, S., Cho, K., Memisevic, R., & Bengio, Y. (2015). On Using Very Large Target Vocabulary for Neural Machine Translation. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 1–10. Retrieved from http://www.aclweb.org/anthology/P15-1001 ↩

Andreas, J., & Klein, D. (2015). When and why are log-linear models self-normalizing? Naacl-2015, 244–249. ↩

Vijayanarasimhan, S., Shlens, J., Monga, R., & Yagnik, J. (2015). Deep Networks With Large Output Spaces. Iclr, 1–9. Retrieved from http://arxiv.org/abs/1412.7479 ↩

Kim, Y., Jernite, Y., Sontag, D., & Rush, A. M. (2016). Character-Aware Neural Language Models. AAAI. Retrieved from http://arxiv.org/abs/1508.06615 ↩

Ling, W., Trancoso, I., Dyer, C., & Black, A. W. (2016). Character-based Neural Machine Translation. ICLR, 1–11. Retrieved from http://arxiv.org/abs/1511.04586 ↩

Bengio, Y., & Senécal, J.-S. (2008). Adaptive importance sampling to accelerate training of a neural probabilistic language model. IEEE Transactions on Neural Networks, 19(4), 713–722. http://doi.org/10.1109/TNN.2007.912312 ↩

Liu, J. S. (2001). Monte Carlo Strategies in Scientific Computing. Springer. http://doi.org/10.1017/CBO9781107415324.004 ↩

Gutmann, M., & Hyvärinen, A. (2010). Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. International Conference on Artificial Intelligence and Statistics, 1–8. Retrieved from http://www.cs.helsinki.fi/u/ahyvarin/papers/Gutmann10AISTATS.pdf ↩

Mnih, A., & Teh, Y. W. (2012). A Fast and Simple Algorithm for Training Neural Probabilistic Language Models. Proceedings of the 29th International Conference on Machine Learning (ICML’12), 1751–1758. ↩

Zoph, B., Vaswani, A., May, J., & Knight, K. (2016). Simple, Fast Noise-Contrastive Estimation for Large RNN Vocabularies. NAACL. ↩

Vaswani, A., Zhao, Y., Fossum, V., & Chiang, D. (2013). Decoding with Large-Scale Neural Language Models Improves Translation. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013), (October), 1387–1392. ↩

Dyer, C. (2014). Notes on Noise Contrastive Estimation and Negative Sampling. Arxiv preprint. Retrieved from http://arxiv.org/abs/1410.8251 ↩

Goldberg, Y., & Levy, O. (2014). word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method. arXiv Preprint arXiv:1402.3722, (2), 1–5. Retrieved from http://arxiv.org/abs/1402.3722 ↩

Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R., & Makhoul, J. (2014). Fast and robust neural network joint models for statistical machine translation. Proc. ACL’2014, 1370–1380. ↩

Gouws, S. (2016). Training neural word embeddings for transfer learning and translation (Doctoral dissertation, Stellenbosch: Stellenbosch University). ↩

Credit for the cover image goes to Stephan Gouws who included the image in his PhD dissertation and in the Tensorflow word2vec tutorial.

]]>Table of contents:

Unsupervisedly learned word embeddings have been exceptionally successful in many NLP tasks and are frequently seen as something akin to a *silver bullet*. In fact, in many NLP architectures, they have almost completely replaced traditional distributional features such as Brown clusters and LSA features.

Proceedings of last year's ACL and EMNLP conferences have been dominated by word embeddings, with some people musing that *Embedding Methods in Natural Language Processing* was a more fitting name for EMNLP. This year's ACL features not one but two workshops on word embeddings.

Semantic relations between word embeddings seem nothing short of magical to the uninitiated and Deep Learning NLP talks frequently prelude with the notorious \(king - man + woman \approx queen \) slide, while a recent article in *Communications of the ACM* hails word embeddings as the primary reason for NLP's breakout.

This post will be the first in a series that aims to give an extensive overview of word embeddings showcasing why this hype may or may not be warranted. In the course of this review, we will try to connect the disperse literature on word embedding models, highlighting many models, applications and interesting features of word embeddings, with a focus on multilingual embedding models and word embedding evaluation tasks in later posts.

This first post lays the foundations by presenting current word embeddings based on language modelling. While many of these models have been discussed at length, we hope that investigating and discussing their merits in the context of past and current research will provide new insights.

A brief note on nomenclature: In the following we will use the currently prevalent term *word embeddings* to refer to dense representations of words in a low-dimensional vector space. Interchangeable terms are *word vectors* and *distributed representations*. We will particularly focus on *neural word embeddings*, i.e. word embeddings learned by a neural network.

Since the 1990s, vector space models have been used in distributional semantics. During this time, many models for estimating continuous representations of words have been developed, including Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). Have a look at this blog post for a more detailed overview of distributional semantics history in the context of word embeddings.

Bengio et al. coin the term word embeddings in 2003 and train them in a neural language model jointly with the model's parameters. First to show the utility of pre-trained word embeddings were arguably Collobert and Weston in 2008. Their landmark paper *A unified architecture for natural language processing* not only establishes word embeddings as a useful tool for downstream tasks, but also introduces a neural network architecture that forms the foundation for many current approaches. However, the eventual popularization of word embeddings can be attributed to Mikolov et al. in 2013 who created word2vec, a toolkit that allows the seamless training and use of pre-trained embeddings. In 2014, Pennington et al. released GloVe, a competitive set of pre-trained word embeddings, signalling that word embeddings had reached the main stream.

Word embeddings are one of the few currently successful applications of unsupervised learning. Their main benefit arguably is that they don't require expensive annotation, but can be derived from large unannotated corpora that are readily available. Pre-trained embeddings can then be used in downstream tasks that use small amounts of labeled data.

Naturally, every feed-forward neural network that takes words from a vocabulary as input and embeds them as vectors into a lower dimensional space, which it then fine-tunes through back-propagation, necessarily yields word embeddings as the weights of the first layer, which is usually referred to as *Embedding Layer*.

The main difference between such a network that produces word embeddings as a by-product and a method such as word2vec whose explicit goal is the generation of word embeddings is its computational complexity. Generating word embeddings with a very deep architecture is simply too computationally expensive for a large vocabulary. This is the main reason why it took until 2013 for word embeddings to explode onto the NLP stage; computational complexity is a key trade-off for word embedding models and will be a recurring theme in our review.

Another difference is the training objective: word2vec and GloVe are geared towards producing word embeddings that encode general semantic relationships, which are beneficial to many downstream tasks; notably, word embeddings trained this way won't be helpful in tasks that do not rely on these kind of relationships. In contrast, regular neural networks typically produce task-specific embeddings that are only of limited use elsewhere. Note that a task that relies on semantically coherent representations such as language modelling will produce similar embeddings to word embedding models, which we will investigate in the next chapter.

As a side-note, word2vec and Glove might be said to be to NLP what VGGNet is to vision, i.e. a common weight initialisation that provides generally helpful features without the need for lengthy training.

To facilitate comparison between models, we assume the following notational standards: We assume a training corpus containing a sequence of \(T\) training words \(w_1, w_2, w_3, \cdots, w_T\) that belong to a vocabulary \(V\) whose size is \(|V|\). Our models generally consider a context of \( n \) words. We associate every word with an input embedding \( v_w \) (the eponymous word embedding in the Embedding Layer) with \(d\) dimensions and an output embedding \( v'_w \) (another word representation whose role will soon become clearer). We finally optimize an objective function \(J_\theta\) with regard to our model parameters \(\theta\) and our model outputs some score \(f_\theta(x)\) for every input \( x \).

Word embedding models are quite closely intertwined with language models. The quality of language models is measured based on their ability to learn a probability distribution over words in \( V \). In fact, many state-of-the-art word embedding models try to predict the next word in a sequence to some extent. Additionally, word embedding models are often evaluated using perplexity, a cross-entropy based measure borrowed from language modelling.

Before we get into the gritty details of word embedding models, let us briefly talk about some language modelling fundamentals.

Language models generally try to compute the probability of a word \(w_t\) given its \(n - 1\) previous words, i.e. \(p(w_t \: | \: w_{t-1} , \cdots w_{t-n+1})\). By applying the chain rule together with the Markov assumption, we can approximate the product of a whole sentence or document by the product of the probabilities of each word given its \(n\) previous words:

\(p(w_1 , \cdots , w_T) = \prod\limits_i p(w_i \: | \: w_{i-1} , \cdots , w_{i-n+1}) \).

In n-gram based language models, we can calculate a word's probability based on the frequencies of its constituent n-grams:

\( p(w_t \: | \: w_{t-1} , \cdots , w_{t-n+1}) = \dfrac{count(w_{t-n+1}, \cdots , w_{t-1},w_t)}{count({w_{t-n+1}, \cdots , w_{t-1}})}\).

Setting \(n = 2\) yields bigram probabilities, while \(n = 5\) together with Kneser-Ney smoothing leads to smoothed 5-gram models that have been found to be a strong baseline for language modelling. For more details, you can refer to these slides from Stanford.

In neural networks, we achieve the same objective using the well-known softmax layer:

\(p(w_t \: | \: w_{t-1} , \cdots , w_{t-n+1}) = \dfrac{\text{exp}({h^\top v'_{w_t}})}{\sum_{w_i \in V} \text{exp}({h^\top v'_{w_i}})} \).

The inner product \( h^\top v'_{w_t} \) computes the (unnormalized) log-probability of word \( w_t \), which we normalize by the sum of the log-probabilities of all words in \( V \). \(h\) is the output vector of the penultimate network layer (the hidden layer in the feed-forward network in Figure 1), while \(v'_w\) is the output embedding of word \(w \), i.e. its representation in the weight matrix of the softmax layer. Note that even though \(v'_w\) represents the word \(w\), it is learned separately from the input word embedding \(v_w\), as the multiplications both vectors are involved in differ (\(v_w\) is multiplied with an index vector, \(v'_w\) with \(h\)).

Note that we need to calculate the probability of every word \( w \) at the output layer of the neural network. To do this efficiently, we perform a matrix multiplication between \(h\) and a weight matrix whose rows consist of \(v'_w\) of all words \(w\) in \(V\). We then feed the resulting vector, which is often referred to as a logit, i.e. the output of a previous layer that is not a probability, with \(d = |V|\) into the softmax, while the softmax layer "squashes" the vector to a probability distribution over the words in \(V\).

Note that the softmax layer (in contrast to the previous n-gram calculations) only implicitly takes into account \(n\) previous words: LSTMs, which are typically used for neural language models, encode these in their state \(h\), while Bengio's neural language model, which we will see in the next chapter, feeds the previous \(n\) words through a feed-forward layer.

Keep this softmax layer in mind, as many of the subsequent word embedding models will use it in some fashion.

Using this softmax layer, the model tries to maximize the probability of predicting the correct word at every timestep \( t \). The whole model thus tries to maximize the averaged log probability of the whole corpus:

\(J_\theta = \frac{1}{T} \text{log} \space p(w_1 , \cdots , w_T)\).

Analogously, through application of the chain rule, it is usually trained to maximize the average of the log probabilities of all words in the corpus given their previous \( n \) words:

\(J_\theta = \frac{1}{T}\sum\limits_{t=1}^T\ \text{log} \space p(w_t \: | \: w_{t-1} , \cdots , w_{t-n+1})\).

To sample words from the language model at test time, we can either greedily choose the word with the highest probability \(p(w_t \: | \: w_{t-1} \cdots w_{t-n+1})\) at every time step \( t \) or use beam search. We can do this for instance to generate arbitrary text sequences as in Karpathy's Char-RNN or as part of a sequence prediction task, where an LSTM is used as the decoder.

The classic neural language model proposed by Bengio et al. [^{1}] in 2003 consists of a one-hidden layer feed-forward neural network that predicts the next word in a sequence as in Figure 2.

Their model maximizes what we've described above as the prototypical neural language model objective (we omit the regularization term for simplicity):

\(J_\theta = \frac{1}{T}\sum\limits_{t=1}^T\ \text{log} \space f(w_t , w_{t-1} , \cdots , w_{t-n+1})\).

\( f(w_t , w_{t-1} , \cdots , w_{t-n+1}) \) is the output of the model, i.e. the probability \( p(w_t \: | \: w_{t-1} , \cdots , w_{t-n+1}) \) as computed by the softmax, where \(n \) is the number of previous words fed into the model.

Bengio et al. are one of the first to introduce what we now refer to as a word embedding, a real-valued word feature vector in \(\mathbb{R}\). Their architecture forms very much the prototype upon which current approaches have gradually improved. The general building blocks of their model, however, are still found in all current neural language and word embedding models. These are:

**Embedding Layer**: a layer that generates word embeddings by multiplying an index vector with a word embedding matrix;**Intermediate Layer(s)**: one or more layers that produce an intermediate representation of the input, e.g. a fully-connected layer that applies a non-linearity to the concatenation of word embeddings of \(n\) previous words;**Softmax Layer**: the final layer that produces a probability distribution over words in \(V\).

Additionally, Bengio et al. identify two issues that lie at the heart of current state-of-the-art-models:

- They remark that
**2.**can be replaced by an LSTM, which is used by state-of-the-art neural language models [^{6},^{7}]. - They identify the final softmax layer (more precisely: the normalization term) as the network's main bottleneck, as the cost of computing the softmax is proportional to the number of words in \(V\), which is typically on the order of hundreds of thousands or millions.

Finding ways to mitigate the computational cost associated with computing the softmax over a large vocabulary [^{9}] is thus one of the key challenges both in neural language models as well as in word embedding models.

After Bengio et al.'s first steps in neural language models, research in word embeddings stagnated as computing power and algorithms did not yet allow the training of a large vocabulary.

Collobert and Weston [^{4}] (thus C&W) showcase in 2008 that word embeddings trained on a sufficiently large dataset carry syntactic and semantic meaning and improve performance on downstream tasks. They elaborate upon this in their 2011 paper [^{8}].

Their solution to avoid computing the expensive softmax is to use a different objective function: Instead of the cross-entropy criterion of Bengio et al., which maximizes the probability of the next word given the previous words, Collobert and Weston train a network to output a higher score \(f_\theta\) for a correct word sequence (a probable word sequence in Bengio's model) than for an incorrect one. For this purpose, they use a pairwise ranking criterion, which looks like this:

\(J_\theta\ = \sum\limits_{x \in X} \sum\limits_{w \in V} \text{max} \lbrace 0, 1 - f_\theta(x) + f_\theta(x^{(w)}) \rbrace \).

They sample correct windows \(x\) containing \(n\) words from the set of all possible windows \(X\) in their corpus. For each window \(x\), they then produce a corrupted, incorrect version \(x^{(w)}\) by replacing \(x\)'s centre word with another word \(w\) from \(V\). Their objective now maximises the distance between the scores output by the model for the correct and the incorrect window with a margin of \(1\). Their model architecture, depicted in Figure 3 without the ranking objective, is analogous to Bengio et al.'s model.

The resulting language model produces embeddings that already possess many of the relations word embeddings have become known for, e.g. countries are clustered close together and syntactically similar words occupy similar locations in the vector space. While their ranking objective eliminates the complexity of the softmax, they keep the intermediate fully-connected hidden layer (**2.**) of Bengio et al. around (the **HardTanh** layer in Figure 3), which constitutes another source of expensive computation. Partially due to this, their full model trains for seven weeks in total with \(|V| = 130000\).

Let us now introduce arguably the most popular word embedding model, the model that launched a thousand word embedding papers: word2vec, the subject of two papers by Mikolov et al. in 2013. As word embeddings are a key building block of deep learning models for NLP, word2vec is often assumed to belong to the same group. Technically however, word2vec is not be considered to be part of deep learning, as its architecture is neither deep nor uses non-linearities (in contrast to Bengio's model and the C&W model).

In their first paper [^{2}], Mikolov et al. propose two architectures for learning word embeddings that are computationally less expensive than previous models. In their second paper [^{3}], they improve upon these models by employing additional strategies to enhance training speed and accuracy.

These architectures offer two main benefits over the C&W model and Bengio's language model:

- They do away with the expensive hidden layer.
- They enable the language model to take additional context into account.

As we will later show, the success of their model is not only due to these changes, but especially due to certain training strategies.

In the following, we will look at both of these architectures:

While a language model is only able to look at the past words for its predictions, as it is evaluated on its ability to predict each next word in the corpus, a model that just aims to generate accurate word embeddings does not suffer from this restriction. Mikolov et al. thus use both the \(n\) words before and after the target word \( w_t \) to predict it as depicted in Figure 4. They call this continuous bag-of-words (CBOW), as it uses continuous representations whose order is of no importance.

The objective function of CBOW in turn is only slightly different than the language model one:

\(J_\theta = \frac{1}{T}\sum\limits_{t=1}^T\ \text{log} \space p(w_t \: | \: w_{t-n} , \cdots , w_{t-1}, w_{t+1}, \cdots , w_{t+n})\).

Instead of feeding \( n \) previous words into the model, the model receives a window of \( n \) words around the target word \( w_t \) at each time step \( t \).

While CBOW can be seen as a precognitive language model, skip-gram turns the language model objective on its head: Instead of using the surrounding words to predict the centre word as with CBOW, skip-gram uses the centre word to predict the surrounding words as can be seen in Figure 5.

The skip-gram objective thus sums the log probabilities of the surrounding \( n \) words to the left and to the right of the target word \( w_t \) to produce the following objective:

\(J_\theta = \frac{1}{T}\sum\limits_{t=1}^T\ \sum\limits_{-n \leq j \leq n , \neq 0} \text{log} \space p(w_{t+j} \: | \: w_t)\).

To gain a better intuition of how the skip-gram model computes \( p(w_{t+j} \: | \: w_t) \), let's recall the definition of our softmax:

\(p(w_t \: | \: w_{t-1} , \cdots , w_{t-n+1}) = \dfrac{\text{exp}({h^\top v'_{w_t}})}{\sum_{w_i \in V} \text{exp}({h^\top v'_{w_i}})} \).

Instead of computing the probability of the target word \( w_t \) given its previous words, we calculate the probability of the surrounding word \( w_{t+j} \) given \( w_t \). We can thus simply replace these variables in the equation:

\(p(w_{t+j} \: | \: w_t ) = \dfrac{\text{exp}({h^\top v'_{w_{t+j}}})}{\sum_{w_i \in V} \text{exp}({h^\top v'_{w_i}})} \).

As the skip-gram architecture does not contain a hidden layer that produces an intermediate state vector \(h\), \(h\) is simply the word embedding \(v_{w_t}\) of the input word \(w_t\). This also makes it clearer why we want to have different representations for input embeddings \(v_w\) and output embeddings \(v'_w\), as we would otherwise multiply the word embedding by itself. Replacing \(h \) with \(v_{w_t}\) yields:

\(p(w_{t+j} \: | \: w_t ) = \dfrac{\text{exp}({v^\top_{w_t} v'_{w_{t+j}}})}{\sum_{w_i \in V} \text{exp}({v^\top_{w_t} v'_{w_i}})} \).

Note that the notation in Mikolov's paper differs slightly from ours, as they denote the centre word with \( w_I \) and the surrounding words with \( w_O \). If we replace \( w_t \) with \( w_I \), \( w_{t+j} \) with \( w_O \), and swap the vectors in the inner product due to its commutativity, we arrive at the softmax notation in their paper:

\(p(w_O|w_I) = \dfrac{\text{exp}(v'^\top_{w_O} v_{w_I})}{\sum^V_{w=1}\text{exp}(v'^\top_{w} v_{w_I})}\).

In the next post, we will discuss different ways to approximate the expensive softmax as well as key training decisions that account for much of skip-gram's success. We will also introduce GloVe [^{5}], a word embedding model based on matrix factorisation and discuss the link between word embeddings and methods from distributional semantics.

Did I miss anything? **Let me know in the comments below.**

- On word embeddings - Part 2: Approximating the softmax
- On word embeddings - Part 3: The secret ingredients of word2vec
- Unofficial Part 4: A survey of cross-lingual embedding models

This blog post has been translated into the following languages:

- Chinese

Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A Neural Probabilistic Language Model. The Journal of Machine Learning Research, 3, 1137–1155. http://doi.org/10.1162/153244303322533223 ↩

Mikolov, T., Corrado, G., Chen, K., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. Proceedings of the International Conference on Learning Representations (ICLR 2013), 1–12. ↩

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. NIPS, 1–9. ↩

Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing. Proceedings of the 25th International Conference on Machine Learning - ICML ’08, 20(1), 160–167. http://doi.org/10.1145/1390156.1390177 ↩

Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 1532–1543. http://doi.org/10.3115/v1/D14-1162 ↩

Kim, Y., Jernite, Y., Sontag, D., & Rush, A. M. (2016). Character-Aware Neural Language Models. AAAI. Retrieved from http://arxiv.org/abs/1508.06615 ↩

Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., & Wu, Y. (2016). Exploring the Limits of Language Modeling. Retrieved from http://arxiv.org/abs/1602.02410 ↩

Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural Language Processing (almost) from Scratch. Journal of Machine Learning Research, 12 (Aug), 2493–2537. Retrieved from http://arxiv.org/abs/1103.0398 ↩

Chen, W., Grangier, D., & Auli, M. (2015). Strategies for Training Large Vocabulary Neural Language Models, 12. Retrieved from http://arxiv.org/abs/1512.04906 ↩

Credit for the post image goes to Christopher Olah.

]]>Note: If you are looking for a review paper, this blog post is also available as an article on arXiv.

Update 15.06.2017: Added derivations of AdaMax and Nadam.

Table of contents:

- Gradient descent variants
- Challenges
- Gradient descent optimization algorithms
- Parallelizing and distributing SGD
- Additional strategies for optimizing SGD
- Conclusion
- References

Gradient descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize neural networks. At the same time, every state-of-the-art Deep Learning library contains implementations of various algorithms to optimize gradient descent (e.g. lasagne's, caffe's, and keras' documentation). These algorithms, however, are often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by.

This blog post aims at providing you with intuitions towards the behaviour of different algorithms for optimizing gradient descent that will help you put them to use. We are first going to look at the different variants of gradient descent. We will then briefly summarize challenges during training. Subsequently, we will introduce the most common optimization algorithms by showing their motivation to resolve these challenges and how this leads to the derivation of their update rules. We will also take a short look at algorithms and architectures to optimize gradient descent in a parallel and distributed setting. Finally, we will consider additional strategies that are helpful for optimizing gradient descent.

Gradient descent is a way to minimize an objective function \(J(\theta)\) parameterized by a model's parameters \(\theta \in \mathbb{R}^d \) by updating the parameters in the opposite direction of the gradient of the objective function \(\nabla_\theta J(\theta)\) w.r.t. to the parameters. The learning rate \(\eta\) determines the size of the steps we take to reach a (local) minimum. In other words, we follow the direction of the slope of the surface created by the objective function downhill until we reach a valley. If you are unfamiliar with gradient descent, you can find a good introduction on optimizing neural networks here.

There are three variants of gradient descent, which differ in how much data we use to compute the gradient of the objective function. Depending on the amount of data, we make a trade-off between the accuracy of the parameter update and the time it takes to perform an update.

Vanilla gradient descent, aka batch gradient descent, computes the gradient of the cost function w.r.t. to the parameters \(\theta\) for the entire training dataset:

\(\theta = \theta - \eta \cdot \nabla_\theta J( \theta)\).

As we need to calculate the gradients for the whole dataset to perform just *one* update, batch gradient descent can be very slow and is intractable for datasets that don't fit in memory. Batch gradient descent also doesn't allow us to update our model *online*, i.e. with new examples on-the-fly.

In code, batch gradient descent looks something like this:

```
for i in range(nb_epochs):
params_grad = evaluate_gradient(loss_function, data, params)
params = params - learning_rate * params_grad
```

For a pre-defined number of epochs, we first compute the gradient vector `params_grad`

of the loss function for the whole dataset w.r.t. our parameter vector `params`

. Note that state-of-the-art deep learning libraries provide automatic differentiation that efficiently computes the gradient w.r.t. some parameters. If you derive the gradients yourself, then gradient checking is a good idea. (See here for some great tips on how to check gradients properly.)

We then update our parameters in the direction of the gradients with the learning rate determining how big of an update we perform. Batch gradient descent is guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces.

Stochastic gradient descent (SGD) in contrast performs a parameter update for *each* training example \(x^{(i)}\) and label \(y^{(i)}\):

\(\theta = \theta - \eta \cdot \nabla_\theta J( \theta; x^{(i)}; y^{(i)})\).

Batch gradient descent performs redundant computations for large datasets, as it recomputes gradients for similar examples before each parameter update. SGD does away with this redundancy by performing one update at a time. It is therefore usually much faster and can also be used to learn online.

SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily as in Image 1.

While batch gradient descent converges to the minimum of the basin the parameters are placed in, SGD's fluctuation, on the one hand, enables it to jump to new and potentially better local minima. On the other hand, this ultimately complicates convergence to the exact minimum, as SGD will keep overshooting. However, it has been shown that when we slowly decrease the learning rate, SGD shows the same convergence behaviour as batch gradient descent, almost certainly converging to a local or the global minimum for non-convex and convex optimization respectively.

Its code fragment simply adds a loop over the training examples and evaluates the gradient w.r.t. each example. Note that we shuffle the training data at every epoch as explained in this section.

```
for i in range(nb_epochs):
np.random.shuffle(data)
for example in data:
params_grad = evaluate_gradient(loss_function, example, params)
params = params - learning_rate * params_grad
```

Mini-batch gradient descent finally takes the best of both worlds and performs an update for every mini-batch of \(n\) training examples:

\(\theta = \theta - \eta \cdot \nabla_\theta J( \theta; x^{(i:i+n)}; y^{(i:i+n)})\).

This way, it *a)* reduces the variance of the parameter updates, which can lead to more stable convergence; and *b)* can make use of highly optimized matrix optimizations common to state-of-the-art deep learning libraries that make computing the gradient w.r.t. a mini-batch very efficient. Common mini-batch sizes range between 50 and 256, but can vary for different applications. Mini-batch gradient descent is typically the algorithm of choice when training a neural network and the term SGD usually is employed also when mini-batches are used. Note: In modifications of SGD in the rest of this post, we leave out the parameters \(x^{(i:i+n)}; y^{(i:i+n)}\) for simplicity.

In code, instead of iterating over examples, we now iterate over mini-batches of size 50:

```
for i in range(nb_epochs):
np.random.shuffle(data)
for batch in get_batches(data, batch_size=50):
params_grad = evaluate_gradient(loss_function, batch, params)
params = params - learning_rate * params_grad
```

Vanilla mini-batch gradient descent, however, does not guarantee good convergence, but offers a few challenges that need to be addressed:

Choosing a proper learning rate can be difficult. A learning rate that is too small leads to painfully slow convergence, while a learning rate that is too large can hinder convergence and cause the loss function to fluctuate around the minimum or even to diverge.

Learning rate schedules [

^{11}] try to adjust the learning rate during training by e.g. annealing, i.e. reducing the learning rate according to a pre-defined schedule or when the change in objective between epochs falls below a threshold. These schedules and thresholds, however, have to be defined in advance and are thus unable to adapt to a dataset's characteristics [^{10}].Additionally, the same learning rate applies to all parameter updates. If our data is sparse and our features have very different frequencies, we might not want to update all of them to the same extent, but perform a larger update for rarely occurring features.

Another key challenge of minimizing highly non-convex error functions common for neural networks is avoiding getting trapped in their numerous suboptimal local minima. Dauphin et al. [

^{19}] argue that the difficulty arises in fact not from local minima but from saddle points, i.e. points where one dimension slopes up and another slopes down. These saddle points are usually surrounded by a plateau of the same error, which makes it notoriously hard for SGD to escape, as the gradient is close to zero in all dimensions.

In the following, we will outline some algorithms that are widely used by the deep learning community to deal with the aforementioned challenges. We will not discuss algorithms that are infeasible to compute in practice for high-dimensional data sets, e.g. second-order methods such as Newton's method.

SGD has trouble navigating ravines, i.e. areas where the surface curves much more steeply in one dimension than in another [^{1}], which are common around local optima. In these scenarios, SGD oscillates across the slopes of the ravine while only making hesitant progress along the bottom towards the local optimum as in Image 2.

Momentum [^{2}] is a method that helps accelerate SGD in the relevant direction and dampens oscillations as can be seen in Image 3. It does this by adding a fraction \(\gamma\) of the update vector of the past time step to the current update vector:

\(
\begin{align}
\begin{split}
v_t &= \gamma v_{t-1} + \eta \nabla_\theta J( \theta) \\

\theta &= \theta - v_t
\end{split}
\end{align}
\)

Note: Some implementations exchange the signs in the equations. The momentum term \(\gamma\) is usually set to 0.9 or a similar value.

Essentially, when using momentum, we push a ball down a hill. The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way (until it reaches its terminal velocity if there is air resistance, i.e. \(\gamma < 1\)). The same thing happens to our parameter updates: The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation.

However, a ball that rolls down a hill, blindly following the slope, is highly unsatisfactory. We'd like to have a smarter ball, a ball that has a notion of where it is going so that it knows to slow down before the hill slopes up again.

Nesterov accelerated gradient (NAG) [^{7}] is a way to give our momentum term this kind of prescience. We know that we will use our momentum term \(\gamma v_{t-1}\) to move the parameters \(\theta\). Computing \( \theta - \gamma v_{t-1} \) thus gives us an approximation of the next position of the parameters (the gradient is missing for the full update), a rough idea where our parameters are going to be. We can now effectively look ahead by calculating the gradient not w.r.t. to our current parameters \(\theta\) but w.r.t. the approximate future position of our parameters:

\(
\begin{align}
\begin{split}
v_t &= \gamma v_{t-1} + \eta \nabla_\theta J( \theta - \gamma v_{t-1} ) \\

\theta &= \theta - v_t
\end{split}
\end{align}
\)

Again, we set the momentum term \(\gamma\) to a value of around 0.9. While Momentum first computes the current gradient (small blue vector in Image 4) and then takes a big jump in the direction of the updated accumulated gradient (big blue vector), NAG first makes a big jump in the direction of the previous accumulated gradient (brown vector), measures the gradient and then makes a correction (red vector), which results in the complete NAG update (green vector). This anticipatory update prevents us from going too fast and results in increased responsiveness, which has significantly increased the performance of RNNs on a number of tasks [^{8}].

Refer to here for another explanation about the intuitions behind NAG, while Ilya Sutskever gives a more detailed overview in his PhD thesis [^{9}].

Now that we are able to adapt our updates to the slope of our error function and speed up SGD in turn, we would also like to adapt our updates to each individual parameter to perform larger or smaller updates depending on their importance.

Adagrad [^{3}] is an algorithm for gradient-based optimization that does just this: It adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. For this reason, it is well-suited for dealing with sparse data. Dean et al. [^{4}] have found that Adagrad greatly improved the robustness of SGD and used it for training large-scale neural nets at Google, which -- among other things -- learned to recognize cats in Youtube videos. Moreover, Pennington et al. [^{5}] used Adagrad to train GloVe word embeddings, as infrequent words require much larger updates than frequent ones.

Previously, we performed an update for all parameters \(\theta\) at once as every parameter \(\theta_i\) used the same learning rate \(\eta\). As Adagrad uses a different learning rate for every parameter \(\theta_i\) at every time step \(t\), we first show Adagrad's per-parameter update, which we then vectorize. For brevity, we set \(g_{t, i}\) to be the gradient of the objective function w.r.t. to the parameter \(\theta_i\) at time step \(t\):

\(g_{t, i} = \nabla_\theta J( \theta_i )\).

The SGD update for every parameter \(\theta_i\) at each time step \(t\) then becomes:

\(\theta_{t+1, i} = \theta_{t, i} - \eta \cdot g_{t, i}\).

In its update rule, Adagrad modifies the general learning rate \(\eta\) at each time step \(t\) for every parameter \(\theta_i\) based on the past gradients that have been computed for \(\theta_i\):

\(\theta_{t+1, i} = \theta_{t, i} - \dfrac{\eta}{\sqrt{G_{t, ii} + \epsilon}} \cdot g_{t, i}\).

\(G_{t} \in \mathbb{R}^{d \times d} \) here is a diagonal matrix where each diagonal element \(i, i\) is the sum of the squares of the gradients w.r.t. \(\theta_i\) up to time step \(t\) [^{25}], while \(\epsilon\) is a smoothing term that avoids division by zero (usually on the order of \(1e-8\)). Interestingly, without the square root operation, the algorithm performs much worse.

As \(G_{t}\) contains the sum of the squares of the past gradients w.r.t. to all parameters \(\theta\) along its diagonal, we can now vectorize our implementation by performing an element-wise matrix-vector multiplication \(\odot\) between \(G_{t}\) and \(g_{t}\):

\(\theta_{t+1} = \theta_{t} - \dfrac{\eta}{\sqrt{G_{t} + \epsilon}} \odot g_{t}\).

One of Adagrad's main benefits is that it eliminates the need to manually tune the learning rate. Most implementations use a default value of 0.01 and leave it at that.

Adagrad's main weakness is its accumulation of the squared gradients in the denominator: Since every added term is positive, the accumulated sum keeps growing during training. This in turn causes the learning rate to shrink and eventually become infinitesimally small, at which point the algorithm is no longer able to acquire additional knowledge. The following algorithms aim to resolve this flaw.

Adadelta [^{6}] is an extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to some fixed size \(w\).

Instead of inefficiently storing \(w\) previous squared gradients, the sum of gradients is recursively defined as a decaying average of all past squared gradients. The running average \(E[g^2]_t\) at time step \(t\) then depends (as a fraction \(\gamma \) similarly to the Momentum term) only on the previous average and the current gradient:

\(E[g^2]_t = \gamma E[g^2]_{t-1} + (1 - \gamma) g^2_t \).

We set \(\gamma\) to a similar value as the momentum term, around 0.9. For clarity, we now rewrite our vanilla SGD update in terms of the parameter update vector \( \Delta \theta_t \):

\( \begin{align} \begin{split} \Delta \theta_t &= - \eta \cdot g_{t, i} \\ \theta_{t+1} &= \theta_t + \Delta \theta_t \end{split} \end{align} \)

The parameter update vector of Adagrad that we derived previously thus takes the form:

\( \Delta \theta_t = - \dfrac{\eta}{\sqrt{G_{t} + \epsilon}} \odot g_{t}\).

We now simply replace the diagonal matrix \(G_{t}\) with the decaying average over past squared gradients \(E[g^2]_t\):

\( \Delta \theta_t = - \dfrac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_{t}\).

As the denominator is just the root mean squared (RMS) error criterion of the gradient, we can replace it with the criterion short-hand:

\( \Delta \theta_t = - \dfrac{\eta}{RMS[g]_{t}} g_t\).

The authors note that the units in this update (as well as in SGD, Momentum, or Adagrad) do not match, i.e. the update should have the same hypothetical units as the parameter. To realize this, they first define another exponentially decaying average, this time not of squared gradients but of squared parameter updates:

\(E[\Delta \theta^2]_t = \gamma E[\Delta \theta^2]_{t-1} + (1 - \gamma) \Delta \theta^2_t \).

The root mean squared error of parameter updates is thus:

\(RMS[\Delta \theta]_{t} = \sqrt{E[\Delta \theta^2]_t + \epsilon} \).

Since \(RMS[\Delta \theta]_{t}\) is unknown, we approximate it with the RMS of parameter updates until the previous time step. Replacing the learning rate \(\eta \) in the previous update rule with \(RMS[\Delta \theta]_{t-1}\) finally yields the Adadelta update rule:

\( \begin{align} \begin{split} \Delta \theta_t &= - \dfrac{RMS[\Delta \theta]_{t-1}}{RMS[g]_{t}} g_{t} \\ \theta_{t+1} &= \theta_t + \Delta \theta_t \end{split} \end{align} \)

With Adadelta, we do not even need to set a default learning rate, as it has been eliminated from the update rule.

RMSprop is an unpublished, adaptive learning rate method proposed by Geoff Hinton in Lecture 6e of his Coursera Class.

RMSprop and Adadelta have both been developed independently around the same time stemming from the need to resolve Adagrad's radically diminishing learning rates. RMSprop in fact is identical to the first update vector of Adadelta that we derived above:

\(
\begin{align}
\begin{split}
E[g^2]_t &= 0.9 E[g^2]_{t-1} + 0.1 g^2_t \\

\theta_{t+1} &= \theta_{t} - \dfrac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_{t}
\end{split}
\end{align}
\)

RMSprop as well divides the learning rate by an exponentially decaying average of squared gradients. Hinton suggests \(\gamma\) to be set to 0.9, while a good default value for the learning rate \(\eta\) is 0.001.

Adaptive Moment Estimation (Adam) [^{15}] is another method that computes adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared gradients \(v_t\) like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients \(m_t\), similar to momentum:

\(
\begin{align}
\begin{split}
m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \\

v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2

\end{split}
\end{align}
\)

\(m_t\) and \(v_t\) are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively, hence the name of the method. As \(m_t\) and \(v_t\) are initialized as vectors of 0's, the authors of Adam observe that they are biased towards zero, especially during the initial time steps, and especially when the decay rates are small (i.e. \(\beta_1\) and \(\beta_2\) are close to 1).

They counteract these biases by computing bias-corrected first and second moment estimates:

\( \begin{align} \begin{split} \hat{m}_t &= \dfrac{m_t}{1 - \beta^t_1} \\ \hat{v}_t &= \dfrac{v_t}{1 - \beta^t_2} \end{split} \end{align} \)

They then use these to update the parameters just as we have seen in Adadelta and RMSprop, which yields the Adam update rule:

\(\theta_{t+1} = \theta_{t} - \dfrac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t\).

The authors propose default values of 0.9 for \(\beta_1\), 0.999 for \(\beta_2\), and \(10^{-8}\) for \(\epsilon\). They show empirically that Adam works well in practice and compares favorably to other adaptive learning-method algorithms.

The \(v_t\) factor in the Adam update rule scales the gradient inversely proportionally to the \(\ell_2\) norm of the past gradients (via the \(v_{t-1}\) term) and current gradient \(|g_t|^2\):

\(v_t = \beta_2 v_{t-1} + (1 - \beta_2) |g_t|^2\)

We can generalize this update to the \(\ell_p\) norm. Note that Kingma and Ba also parameterize \(\beta_2\) as \(\beta^p_2\):

\(v_t = \beta_2^p v_{t-1} + (1 - \beta_2^p) |g_t|^p\)

Norms for large \(p\) values generally become numerically unstable, which is why \(\ell_1\) and \(\ell_2\) norms are most common in practice. However, \(\ell_\infty\) also generally exhibits stable behavior. For this reason, the authors propose AdaMax (Kingma and Ba, 2015) and show that \(v_t\) with \(\ell_\infty\) converges to the following more stable value. To avoid confusion with Adam, we use \(u_t\) to denote the infinity norm-constrained \(v_t\):

\(
\begin{align}
\begin{split}
u_t &= \beta_2^\infty v_{t-1} + (1 - \beta_2^\infty) |g_t|^\infty\\

& = \max(\beta_2 \cdot v_{t-1}, |g_t|)
\end{split}
\end{align}
\)

We can now plug this into the Adam update equation by replacing \(\sqrt{\hat{v}_t} + \epsilon\) with \(u_t\) to obtain the AdaMax update rule:

\(\theta_{t+1} = \theta_{t} - \dfrac{\eta}{u_t} \hat{m}_t\)

Note that as \(u_t\) relies on the \(\max\) operation, it is not as suggestible to bias towards zero as \(m_t\) and \(v_t\) in Adam, which is why we do not need to compute a bias correction for \(u_t\). Good default values are again \(\eta = 0.002\), \(\beta_1 = 0.9\), and \(\beta_2 = 0.999\).

As we have seen before, Adam can be viewed as a combination of RMSprop and momentum: RMSprop contributes the exponentially decaying average of past squared gradients \(v_t\), while momentum accounts for the exponentially decaying average of past gradients \(m_t\). We have also seen that Nesterov accelerated gradient (NAG) is superior to vanilla momentum.

Nadam (Nesterov-accelerated Adaptive Moment Estimation) [^{24}] thus combines Adam and NAG. In order to incorporate NAG into Adam, we need to modify its momentum term \(m_t\).

First, let us recall the momentum update rule using our current notation :

\(
\begin{align}
\begin{split}
g_t &= \nabla_{\theta_t}J(\theta_t)\\

m_t &= \gamma m_{t-1} + \eta g_t\\

\theta_{t+1} &= \theta_t - m_t
\end{split}
\end{align}
\)

where \(J\) is our objective function, \(\gamma\) is the momentum decay term, and \(\eta\) is our step size. Expanding the third equation above yields:

\(\theta_{t+1} = \theta_t - ( \gamma m_{t-1} + \eta g_t)\)

This demonstrates again that momentum involves taking a step in the direction of the previous momentum vector and a step in the direction of the current gradient.

NAG then allows us to perform a more accurate step in the gradient direction by updating the parameters with the momentum step *before* computing the gradient. We thus only need to modify the gradient \(g_t\) to arrive at NAG:

\(
\begin{align}
\begin{split}
g_t &= \nabla_{\theta_t}J(\theta_t - \gamma m_{t-1})\\

m_t &= \gamma m_{t-1} + \eta g_t\\

\theta_{t+1} &= \theta_t - m_t
\end{split}
\end{align}
\)

Dozat proposes to modify NAG the following way: Rather than applying the momentum step twice -- one time for updating the gradient \(g_t\) and a second time for updating the parameters \(\theta_{t+1}\) -- we now apply the look-ahead momentum vector directly to update the current parameters:

\(
\begin{align}
\begin{split}
g_t &= \nabla_{\theta_t}J(\theta_t)\\

m_t &= \gamma m_{t-1} + \eta g_t\\

\theta_{t+1} &= \theta_t - (\gamma m_t + \eta g_t)
\end{split}
\end{align}
\)

Notice that rather than utilizing the previous momentum vector \(m_{t-1}\) as in the equation of the expanded momentum update rule above, we now use the current momentum vector \(m_t\) to look ahead. In order to add Nesterov momentum to Adam, we can thus similarly replace the previous momentum vector with the current momentum vector. First, recall that the Adam update rule is the following (note that we do not need to modify \(\hat{v}_t\)):

\(
\begin{align}
\begin{split}
m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t\\

\hat{m}_t & = \frac{m_t}{1 - \beta^t_1}\\
\theta_{t+1} &= \theta_{t} - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t
\end{split}
\end{align}
\)

Expanding the second equation with the definitions of \(\hat{m}_t\) and \(m_t\) in turn gives us:

\(\theta_{t+1} = \theta_{t} - \dfrac{\eta}{\sqrt{\hat{v}_t} + \epsilon} (\dfrac{\beta_1 m_{t-1}}{1 - \beta^t_1} + \dfrac{(1 - \beta_1) g_t}{1 - \beta^t_1})\)

Note that \(\dfrac{\beta_1 m_{t-1}}{1 - \beta^t_1}\) is just the bias-corrected estimate of the momentum vector of the previous time step. We can thus replace it with \(\hat{m}_{t-1}\):

\(\theta_{t+1} = \theta_{t} - \dfrac{\eta}{\sqrt{\hat{v}_t} + \epsilon} (\beta_1 \hat{m}_{t-1} + \dfrac{(1 - \beta_1) g_t}{1 - \beta^t_1})\)

This equation again looks very similar to our expanded momentum update rule above. We can now add Nesterov momentum just as we did previously by simply replacing this bias-corrected estimate of the momentum vector of the previous time step \(\hat{m}_{t-1}\) with the bias-corrected estimate of the current momentum vector \(\hat{m}_t\), which gives us the Nadam update rule:

\(\theta_{t+1} = \theta_{t} - \dfrac{\eta}{\sqrt{\hat{v}_t} + \epsilon} (\beta_1 \hat{m}_t + \dfrac{(1 - \beta_1) g_t}{1 - \beta^t_1})\)

The following two animations (Image credit: Alec Radford) provide some intuitions towards the optimization behaviour of the presented optimization algorithms. Also have a look here for a description of the same images by Karpathy and another concise overview of the algorithms discussed.

In Image 5, we see their behaviour on the contours of a loss surface (the Beale function) over time. Note that Adagrad, Adadelta, and RMSprop almost immediately head off in the right direction and converge similarly fast, while Momentum and NAG are led off-track, evoking the image of a ball rolling down the hill. NAG, however, is quickly able to correct its course due to its increased responsiveness by looking ahead and heads to the minimum.

Image 6 shows the behaviour of the algorithms at a saddle point, i.e. a point where one dimension has a positive slope, while the other dimension has a negative slope, which pose a difficulty for SGD as we mentioned before. Notice here that SGD, Momentum, and NAG find it difficulty to break symmetry, although the two latter eventually manage to escape the saddle point, while Adagrad, RMSprop, and Adadelta quickly head down the negative slope.

As we can see, the adaptive learning-rate methods, i.e. Adagrad, Adadelta, RMSprop, and Adam are most suitable and provide the best convergence for these scenarios.

So, which optimizer should you now use? If your input data is sparse, then you likely achieve the best results using one of the adaptive learning-rate methods. An additional benefit is that you won't need to tune the learning rate but likely achieve the best results with the default value.

In summary, RMSprop is an extension of Adagrad that deals with its radically diminishing learning rates. It is identical to Adadelta, except that Adadelta uses the RMS of parameter updates in the numinator update rule. Adam, finally, adds bias-correction and momentum to RMSprop. Insofar, RMSprop, Adadelta, and Adam are very similar algorithms that do well in similar circumstances. Kingma et al. [^{15}] show that its bias-correction helps Adam slightly outperform RMSprop towards the end of optimization as gradients become sparser. Insofar, Adam might be the best overall choice.

Interestingly, many recent papers use vanilla SGD without momentum and a simple learning rate annealing schedule. As has been shown, SGD usually achieves to find a minimum, but it might take significantly longer than with some of the optimizers, is much more reliant on a robust initialization and annealing schedule, and may get stuck in saddle points rather than local minima. Consequently, if you care about fast convergence and train a deep or complex neural network, you should choose one of the adaptive learning rate methods.

Given the ubiquity of large-scale data solutions and the availability of low-commodity clusters, distributing SGD to speed it up further is an obvious choice.

SGD by itself is inherently sequential: Step-by-step, we progress further towards the minimum. Running it provides good convergence but can be slow particularly on large datasets. In contrast, running SGD asynchronously is faster, but suboptimal communication between workers can lead to poor convergence. Additionally, we can also parallelize SGD on one machine without the need for a large computing cluster. The following are algorithms and architectures that have been proposed to optimize parallelized and distributed SGD.

Niu et al. [^{23}] introduce an update scheme called Hogwild! that allows performing SGD updates in parallel on CPUs. Processors are allowed to access shared memory without locking the parameters. This only works if the input data is sparse, as each update will only modify a fraction of all parameters. They show that in this case, the update scheme achieves almost an optimal rate of convergence, as it is unlikely that processors will overwrite useful information.

Downpour SGD is an asynchronous variant of SGD that was used by Dean et al. [^{4}] in their DistBelief framework (predecessor to TensorFlow) at Google. It runs multiple replicas of a model in parallel on subsets of the training data. These models send their updates to a parameter server, which is split across many machines. Each machine is responsible for storing and updating a fraction of the model's parameters. However, as replicas don't communicate with each other e.g. by sharing weights or updates, their parameters are continuously at risk of diverging, hindering convergence.

McMahan and Streeter [^{12}] extend AdaGrad to the parallel setting by developing delay-tolerant algorithms that not only adapt to past gradients, but also to the update delays. This has been shown to work well in practice.

TensorFlow [^{13}] is Google's recently open-sourced framework for the implementation and deployment of large-scale machine learning models. It is based on their experience with DistBelief and is already used internally to perform computations on a large range of mobile devices as well as on large-scale distributed systems. For distributed execution, a computation graph is split into a subgraph for every device and communication takes place using Send/Receive node pairs. However, the open source version of TensorFlow currently does not support distributed functionality (see here).
Update 13.04.16: A distributed version of TensorFlow has been released.

Zhang et al. [^{14}] propose Elastic Averaging SGD (EASGD), which links the parameters of the workers of asynchronous SGD with an elastic force, i.e. a center variable stored by the parameter server. This allows the local variables to fluctuate further from the center variable, which in theory allows for more exploration of the parameter space. They show empirically that this increased capacity for exploration leads to improved performance by finding new local optima.

Finally, we introduce additional strategies that can be used alongside any of the previously mentioned algorithms to further improve the performance of SGD. For a great overview of some other common tricks, refer to [^{22}].

Generally, we want to avoid providing the training examples in a meaningful order to our model as this may bias the optimization algorithm. Consequently, it is often a good idea to shuffle the training data after every epoch.

On the other hand, for some cases where we aim to solve progressively harder problems, supplying the training examples in a meaningful order may actually lead to improved performance and better convergence. The method for establishing this meaningful order is called Curriculum Learning [^{16}].

Zaremba and Sutskever [^{17}] were only able to train LSTMs to evaluate simple programs using Curriculum Learning and show that a combined or mixed strategy is better than the naive one, which sorts examples by increasing difficulty.

To facilitate learning, we typically normalize the initial values of our parameters by initializing them with zero mean and unit variance. As training progresses and we update parameters to different extents, we lose this normalization, which slows down training and amplifies changes as the network becomes deeper.

Batch normalization [^{18}] reestablishes these normalizations for every mini-batch and changes are back-propagated through the operation as well. By making normalization part of the model architecture, we are able to use higher learning rates and pay less attention to the initialization parameters. Batch normalization additionally acts as a regularizer, reducing (and sometimes even eliminating) the need for Dropout.

According to Geoff Hinton: "*Early stopping (is) beautiful free lunch*" (NIPS 2015 Tutorial slides, slide 63). You should thus always monitor error on a validation set during training and stop (with some patience) if your validation error does not improve enough.

Neelakantan et al. [^{21}] add noise that follows a Gaussian distribution \(N(0, \sigma^2_t)\) to each gradient update:

\(g_{t, i} = g_{t, i} + N(0, \sigma^2_t)\).

They anneal the variance according to the following schedule:

\( \sigma^2_t = \dfrac{\eta}{(1 + t)^\gamma} \).

They show that adding this noise makes networks more robust to poor initialization and helps training particularly deep and complex networks. They suspect that the added noise gives the model more chances to escape and find new local minima, which are more frequent for deeper models.

In this blog post, we have initially looked at the three variants of gradient descent, among which mini-batch gradient descent is the most popular. We have then investigated algorithms that are most commonly used for optimizing SGD: Momentum, Nesterov accelerated gradient, Adagrad, Adadelta, RMSprop, Adam, as well as different algorithms to optimize asynchronous SGD. Finally, we've considered other strategies to improve SGD such as shuffling and curriculum learning, batch normalization, and early stopping.

I hope that this blog post was able to provide you with some intuitions towards the motivation and the behaviour of the different optimization algorithms. Are there any obvious algorithms to improve SGD that I've missed? What tricks are you using yourself to facilitate training with SGD? **Let me know in the comments below.**

Thanks to Denny Britz and Cesar Salgado for reading drafts of this post and providing suggestions.

This blog post is also available as an article on arXiv, in case you want to refer to it later.

In case you found it helpful, consider citing the corresponding arXiv article as:

*Sebastian Ruder (2016). An overview of gradient descent optimisation algorithms. arXiv preprint arXiv:1609.04747.*

This blog post has been translated into the following languages:

Update 21.06.16: This post was posted to Hacker News. The discussion provides some interesting pointers to related work and other techniques.

Sutton, R. S. (1986). Two problems with backpropagation and other steepest-descent learning procedures for networks. Proc. 8th Annual Conf. Cognitive Science Society. ↩

Qian, N. (1999). On the momentum term in gradient descent learning algorithms. Neural Networks : The Official Journal of the International Neural Network Society, 12(1), 145–151. http://doi.org/10.1016/S0893-6080(98)00116-6 ↩

Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12, 2121–2159. Retrieved from http://jmlr.org/papers/v12/duchi11a.html ↩

Dean, J., Corrado, G. S., Monga, R., Chen, K., Devin, M., Le, Q. V, … Ng, A. Y. (2012). Large Scale Distributed Deep Networks. NIPS 2012: Neural Information Processing Systems, 1–11. http://doi.org/10.1109/ICDAR.2011.95 ↩

Zeiler, M. D. (2012). ADADELTA: An Adaptive Learning Rate Method. Retrieved from http://arxiv.org/abs/1212.5701 ↩

Nesterov, Y. (1983). A method for unconstrained convex minimization problem with the rate of convergence o(1/k2). Doklady ANSSSR (translated as Soviet.Math.Docl.), vol. 269, pp. 543– 547. ↩

Bengio, Y., Boulanger-Lewandowski, N., & Pascanu, R. (2012). Advances in Optimizing Recurrent Networks. Retrieved from http://arxiv.org/abs/1212.0901 ↩

Sutskever, I. (2013). Training Recurrent neural Networks. PhD Thesis. ↩

Darken, C., Chang, J., & Moody, J. (1992). Learning rate schedules for faster stochastic gradient search. Neural Networks for Signal Processing II Proceedings of the 1992 IEEE Workshop, (September), 1–11. http://doi.org/10.1109/NNSP.1992.253713 ↩

H. Robinds and S. Monro, “A stochastic approximation method,” Annals of Mathematical Statistics, vol. 22, pp. 400–407, 1951. ↩

Mcmahan, H. B., & Streeter, M. (2014). Delay-Tolerant Algorithms for Asynchronous Distributed Online Learning. Advances in Neural Information Processing Systems (Proceedings of NIPS), 1–9. Retrieved from http://papers.nips.cc/paper/5242-delay-tolerant-algorithms-for-asynchronous-distributed-online-learning.pdf ↩

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., … Zheng, X. (2015). TensorFlow : Large-Scale Machine Learning on Heterogeneous Distributed Systems. ↩

Zhang, S., Choromanska, A., & LeCun, Y. (2015). Deep learning with Elastic Averaging SGD. Neural Information Processing Systems Conference (NIPS 2015), 1–24. Retrieved from http://arxiv.org/abs/1412.6651 ↩

Kingma, D. P., & Ba, J. L. (2015). Adam: a Method for Stochastic Optimization. International Conference on Learning Representations, 1–13. ↩

Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curriculum learning. Proceedings of the 26th Annual International Conference on Machine Learning, 41–48. http://doi.org/10.1145/1553374.1553380 ↩

Zaremba, W., & Sutskever, I. (2014). Learning to Execute, 1–25. Retrieved from http://arxiv.org/abs/1410.4615 ↩

Ioffe, S., & Szegedy, C. (2015). Batch Normalization : Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv Preprint arXiv:1502.03167v3. ↩

Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., & Bengio, Y. (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. arXiv, 1–14. Retrieved from http://arxiv.org/abs/1406.2572 ↩

Sutskever, I., & Martens, J. (2013). On the importance of initialization and momentum in deep learning. http://doi.org/10.1109/ICASSP.2013.6639346 ↩

Neelakantan, A., Vilnis, L., Le, Q. V., Sutskever, I., Kaiser, L., Kurach, K., & Martens, J. (2015). Adding Gradient Noise Improves Learning for Very Deep Networks, 1–11. Retrieved from http://arxiv.org/abs/1511.06807 ↩

LeCun, Y., Bottou, L., Orr, G. B., & Müller, K. R. (1998). Efficient BackProp. Neural Networks: Tricks of the Trade, 1524, 9–50. http://doi.org/10.1007/3-540-49430-8_2 ↩

Niu, F., Recht, B., Christopher, R., & Wright, S. J. (2011). Hogwild! : A Lock-Free Approach to Parallelizing Stochastic Gradient Descent, 1–22. ↩

Dozat, T. (2016). Incorporating Nesterov Momentum into Adam. ICLR Workshop, (1), 2013–2016. ↩

Duchi et al. [3] give this matrix as an alternative to the

*full*matrix containing the outer products of all previous gradients, as the computation of the matrix square root is infeasible even for a moderate number of parameters \(d\). ↩

Image credit for cover photo: Karpathy's beautiful loss functions tumblr

]]>