meta-learning

Highlights of NIPS 2016: Adversarial learning, Meta-learning, and more

The Conference on Neural Information Processing Systems (NIPS) is one of the top ML conferences. This post discusses highlights of NIPS 2016 including GANs, the nuts and bolts of ML, RNNs, improvements to classic algorithms, RL, Meta-learning, and Yann LeCun's infamous cake.

Sebastian Ruder

Dec 21, 2016 • 12 min read

This post discusses highlights of the 2016 Conference on Neural Information Processing Systems (NIPS 2016).

This post originally appeared at the AYLIEN blog.

I attended NIPS 2016 in Barcelona from Monday, December 5 to Saturday, December 10. The full conference program is available here. In the following, I will share some of my highlights.

NIPS

The Conference on Neural Information Processing Systems (NIPS) is (besides ICML) one of the two top conferences in machine learning. It took place for the first time in 1987 and is held every December, historically in close proximity to a ski resort. This year, in slight juxtaposition, it took place in sunny Barcelona.

Machine Learning seems to become more pervasive every month. However, it is still sometimes hard to keep track of the actual extent of this development. One of the most accurate barometers for this evolution is the growth of NIPS itself. The number of attendees skyrocketed at this year’s conference growing by over 50% year-over-year.

Terry's Law — Figure 1: The growth of the number of attendees at NIPS follows (the newly coined) Terry’s Law (named after Terrence Sejnowski, the president of the NIPS foundation; faster growth than Moore's Law Law)

Unsurprisingly, Deep Learning (DL) was by far the most popular research topic, with about every fourth of more than 2,500 submitted papers (and 568 accepted papers) dealing with deep neural networks.

Distribution of topics across NIPS 2016 submissions — Figure 2: Distribution of topics across all submitted papers (Source: The review process for NIPS 2016)

On the other hand, the distribution of research paper topics has quite a long tail and reflects the diversity of topics at the conference that span everything from theory to applications, from robotics to neuroscience, and from healthcare to self-driving cars.

Generative Adversarial Networks

One of the hottest developments within Deep Learning was Generative Adversarial Networks (GANs). The minimax game playing networks have by now won the favor of many luminaries in the field. Yann LeCun hails them as the most exciting development in ML in recent years. The organizers and attendees of NIPS seem to side with him: NIPS featured a tutorial by Ian Goodfellow about his brainchild, which led to a packed main conference hall.

Full conference hall at GAN tutorial — Figure 3: A full conference hall at the GAN tutorial

Though a fairly recent development, there are many cool extensions of GANs among the conference papers:

Reed et al. propose a model that allows you to specify not only what you want to draw (e.g. a bird) but also where to put it in an image.
Chen et al. disentangle factors of variation in GANs by representing them with latent codes. The resulting models allow you to adjust e.g. the type of a digit, its breadth and width, etc.

In spite of their popularity, we know alarmingly little about what makes GANs so capable of generating realistic-looking images. In addition, making them work in practice is an arduous endeavour and a lot of (undocumented) hacks are necessary to achieve the best performance. Soumith Chintala presents a collection of these hacks in his "How to train your GAN" talk at the Adversarial Training workshop.

Figure 4: How to train your GAN (Source: Soumith Chintala)

Yann LeCun muses in his keynote that the development of GANs parallels the history of neural networks themselves: They were poorly understood and hard to get to work in the beginning and only took off once researchers figured out the right tricks and learned how to make them work. At this point, it seems unlikely that GANs will experience a winter anytime soon; the research community is still at the beginning in learning how to make the best use of them and it will be exciting to see what progress we can make in the coming years.

On the other hand, the success of GANs so far has been limited mostly to Computer Vision due to their difficulty in modelling discrete rather than continuous data. The Adversarial Training workshop showcased some promising work in this direction (see e.g. my colleague John Glover’s paper on modeling documents, this and this paper on generating text, and this paper on adversarial evaluation of dialogue models). We will see if 2017 will be the year in which GANs break through in NLP.

The Nuts and Bolts of Machine Learning

Andrew Ng gave one of the best tutorials of the conference with his take on building AI applications using Deep Learning. Drawing from his experience of managing the 1,300 people AI team at Baidu and hundreds of applied AI projects and equipped solely with two whiteboards, he shared many insights about how to build and deploy AI applications in production.

Besides better hardware, Ng attributes the success of Deep Learning to two factors: In contrast to traditional methods, deep NNs are able to learn more effectively from large amounts of data. Secondly, end-to-end (supervised) Deep Learning allows us to learn to map from inputs directly to outputs.

While this approach to training chatbots or self-driving cars is sufficient to write innovative research papers, Ng emphasized end-to-end DL is often not production-ready: A chatbot that maps from text directly to a response is not able to have a coherent conversation or fulfill a request, while mapping from an image directly to a steering command might have literally fatal side effects if the model has not encountered the corresponding part of the input space before. Rather, for a production model, we still want to have intermediate steps: For a chatbot, we prefer to have an inference engine that generates a response, while in a self-driving car, DL is used to identify obstacles, while the steering is performed by a traditional planning algorithm.

Andrew Ng on end-to-end Deep Learning — Figure 5: Andrew Ng on end-to-end DL (right: end-to-end DL chatbot and chatbot with inference engine; left bottom: end-to-end DL self-driving car and self-driving car with intermediate steps)

Ng also shared that the most common mistake he sees in project teams is that they track the wrong metrics: In an applied machine learning project, the only relevant metrics are the training error, the development error, and the test error. These metrics alone enable the project team to know what steps to take, as he demonstrated in the diagram below:

Andrew Ng's flowchart for applied ML projects — Figure 6: Andrew Ng’s flowchart for applied ML projects

A key facilitator of the recent success of ML have been the advances in hardware that allowed faster computation and storage. Given that Moore's Law will reach its limits sooner or later, one might reason that also the rise of ML might plateau. Ng, however, argued that the commitment by leading hardware manufacturers such as NVIDIA and Intel and the ensuing performance improvements to ML hardware would fuel further growth.

Among ML research areas, supervised learning is the undisputed driver of the recent success of ML and will likely continue to drive it for the foreseeable future. In second place, Ng saw neither unsupervised learning nor reinforcement learning, but transfer learning. We at AYLIEN are bullish on transfer learning for NLP and think that it has massive potential.

Recurrent Neural Networks

The conference also featured a symposium dedicated to Recurrent Neural Networks (RNNs). The symposium coincided with the 20 year anniversary of LSTM...

Jürgen Schmidhuber at the RNN symposium — Figure 7: Jürgen Schmidhuber kicking off the RNN symposium

...being rejected from NIPS 1996. The fact that papers that do not use LSTMs have been rare in the most recent NLP conferences (see my EMNLP 2016 blog post) is a testament to the perseverance of the authors of the original paper, Sepp Hochreiter and Jürgen Schmidhuber.

At NIPS, we had several papers that sought to improve RNNs in different ways:

Ba et al. and Neil et al. enable RNNs to handle different time scales using slow weights and a phased variant of the LSTM respectively.
Fraccaro et al. model uncertainty.

Other improvements apply to Deep Learning in general:

Salimans and Kingma propose Weight Normalisation to accelerate training that can be applied in two lines of Python code.
Li et al. propose a multinomial variant of dropout that sets neurons to zero depending on the data distribution.

The Neural Abstract Machines & Program Induction (NAMPI) workshop also featured several speakers talking about RNNs:

Alex Graves focused on his recent work on Adaptive Computation Time (ACT) for RNNs that allows to decouple the processing time from the sequence length. He showed that a word-level language model with ACT could reach state-of-the-art with fewer computations.
Edward Grefenstette outlined several limitations and potential future research directions in the context of RNNs in his talk.

Improving classic algorithms

While Deep Learning is a fairly recent development, the conference featured also several improvements to algorithms that have been around for decades:

Ge et al. show in their best paper that the non-convex objective for matrix completion has no spurious local minima, i.e. every local minimum is a global minimum.
Bachem et al. present a method that guarantees accurate and fast seedings for large-scale k-means++ clustering. The presentation was one of the most polished ones of the conference and the code is open-source and can be installed via pip.
Ashtiani et al. show that we can make NP-hard k-means clustering problems solvable by allowing the model to pose queries for a few examples to a domain expert.

Reinforcement Learning

Reinforcement Learning (RL) was another much-discussed topic at NIPS with an excellent tutorial by Pieter Abbeel and John Schulman. John Schulman also gave some practical advice for getting started with RL.

One of the best papers of the conference introduces Value Iteration Networks, which learn to plan by providing a differentiable approximation to a classic planning algorithm via a CNN. This paper was another cool example of one of the major benefits of deep neural networks: They allow us to learn increasingly complex behaviour as long as we can represent it in a differentiable way. I hope to see more approaches in the future that integrate classic algorithms to enhance the capabilities of a neural network.

During the week of the conference, several research environments for RL were simultaneously released, among them OpenAI’s Universe, Deep Mind Lab, and FAIR’s Torchcraft. These will likely be a key driver in future RL research and should open up new research opportunities.

Learning-to-learn / Meta-learning

Another topic that came up in several discussions over the course of the conference was Learning-to-learn or Meta-learning:

Andrychowicz et al. learn an optimizer in a paper with the ingenious title "Learning to learn by gradient descent by gradient descent".
Vinyals et al. learn how to one shot-learn in a paper that frames one-shot learning in the sequence-to-sequence framework and has inspired new approaches for one-shot learning.

Most of the existing papers on meta-learning demonstrate that wherever you are doing something that gives you gradients, you can optimize them using another algorithm via gradient descent. Prepare for a surge of “Meta-learning for X” and “(Meta-)+learning” papers in 2017. It’s LSTMs all the way down!

Meta-learning was also one of the key talking points at the RNN symposium. Jürgen Schmidhuber argued that a true meta-learner would be able to learn in the space of all programs and would have the ability to modify itself and elaborated on these ideas at his talk at the NAMPI workshop. Ilya Sutskever remarked that we currently have no good meta-learning models. However, there is hope as the plethora of new research environments should also bring progress in this area.

General Artificial Intelligence

Learning how to learn also plays a role in the pursuit of the elusive goal of attaining General Artificial Intelligence, which was a topic in several keynotes. Yann LeCun argued that in order to achieve General AI, machines need to learn common sense. While common sense is often vaguely mentioned in research papers, Yann LeCun gave a succinct explanation of what common sense is: "Predicting any part of the past, present or future percepts from whatever information is available." He called this predictive learning, but notes that this is really unsupervised learning.

His talk also marked the appearance of a controversial and often tongue-in-cheek copied image of a cake, which he used to demonstrate that unsupervised learning is the most challenging task where we should concentrate our efforts, while RL is only the cherry on the icing of the cake.

Yann LeCun's Cake slide — Figure 8: The Cake slide of Yann LeCun's keynote

Drew Purves focused on the bilateral relationship between the environment and AI in what was probably the most aesthetically pleasing keynote of the conference (just look at those illustrations!).

Drew Purves keynote illustrations — Figure 9: Illustrations by Max Cant of Drew Purves' keynote (Source: Drew Purves)

He emphasized that while simulations of ecological tasks in naturalistic environments could be an important test bed for General AI, General AI is needed to maintain the biosphere in a state that will allow the continued existence of our civilization.

Figure 10: Nature needs AI and AI needs Nature from Drew Purves' keynote

While it is frequently — and incorrectly — claimed that neural networks work so well because they emulate the brain’s behaviour, Saket Navlakha argued during his keynote that we can still learn a great deal from the engineering principles of the brain. For instance, rather than pre-allocating a large number of neurons, the brain generates 1000s of synapses per minutes until its second year. Afterwards, until adolescence, the number of synapses is pruned and decreases by ~50%.

Saket Navlakha keynote — Figure 11: Saket Navlakha’s keynote

It will be interesting to see how neuroscience can help us to advance our field further.

In the context of the Machine Intelligence workshop, another environment was introduced in the form of FAIR’s CommAI-env that allows to train agents through interaction with a teacher. During the panel discussion, the ability to learn hierarchical representations and to identify patterns was emphasized. However, although the field is making rapid progress on standard tasks such as object recognition, it is unclear if the focus on such specific tasks brings us indeed closer to General AI.

Natural Language Processing

While NLP is more of a niche topic at NIPS, there were a few papers with improvements relevant to NLP:

He et al. propose a dual learning framework for MT that has two agents translating in opposite directions teaching each other via reinforcement learning.
Sokolov et al. explore how to use structured prediction under bandit feedback.
Huang et al. extend Word Mover’s Distance, an unsupervised document similarity metric to the supervised setting.
Lee et al. model the helpfulness of reviews by taking into account position and presentation biases.

Finally, a workshop on learning methods for dialogue explored how end-to-end DL, linguistics and ML methods can be used to create dialogue agents.

Miscellaneous

Schmidhuber

Jürgen Schmidhuber, the father of the LSTM was not only present on several panels, but did his best to remind everyone that whatever your idea, he had had a similar idea two decades ago and you should better cite him lest he interrupt your tutorial.

NIPS2016 Day 1: Poor @Goodfellow_Ian gets Schmidhuber'ed during educational GAN Tutorial session. pic.twitter.com/IeQzKcJYiv
— hardmaru (@hardmaru) 5. Dezember 2016

Robotics

Boston Robotics’ Spot proved that — even though everyone is excited by learning and learning-to-learn — traditional planning algorithms are enough to win the admiration of a hall full of learning enthusiasts.

Boston Dynamics Spot — Figure 12: Boston Robotics’ Spot amid a crowd of fascinated onlookers

Apple

Apple, one of the most secretive companies in the world, has decided to be more open, to publish, and to engage with academia. This can only be good for the community. I'm particularly looking forward to more apple research papers.

Uber

Uber announced their acquisition of Cambridge-based AI startup Geometric Intelligence and threw one of the most popular parties of NIPS.

Rocket AI

Talking about startups, the "launch" of Rocket AI and their patented Temporally Recurrent Optimal Learning had some people fooled (note the acronyms in the below tweets). Riva-Melissa Tez finally cleared up the confusion.

#rocketai just drove me home. the team is just mind-blowing. so excited about Temporally Recurrent Optimal Learning, the next GAN!
— Soumith Chintala (@soumithchintala) 11. Dezember 2016

#rocketai definitely has the most popular Jacobian-Optimized Kernel Expansion of NIPS 2016
— Ian Goodfellow (@goodfellow_ian) 10. Dezember 2016

These were my impressions from NIPS 2016. I had a blast and hope to be back in 2017!