Sergey Nikolenko

Blog

NeuroNuggets: ACL in Review II

Neuromation researchers attended ACL 2019, the Annual Meeting of the Association for Computing Linguistics, which is the world’s leading conference in the field of natural language processing. With this second part of our ACL in Review series (see the first part here), I continue the experiment of writing draft notes as the ACL sections progress.

This time, it’s the Monday evening session called “Vision, Robotics, Multimodal, Grounding and Speech”; this means that in this section, we get some nice pictures along with the text. Again, I will provide ACL Anthology links for the papers, and all images in this post are taken from the corresponding papers unless specified otherwise. The paper I want to discuss in detail was not the first in its section, but I still decided to keep the order from the conference to make it as authentic as possible.

On to the papers!

Visually Grounded Neural Syntax Acquisition

How do we understand syntax? When we were children, we had to derive it from data, from the language stream directed at us. But what really helped was that language was often paired with imagery: by hearing sentences like “A cat is sleeping outside”, “A cat is staring at you”, or “There’s a cat playing with the ball” and matching them with what we saw, we could extract the notion of “a cat”.

Haoyue Shi et al. (ACL Anthology) ask how to implement this technique for deep learning models: can we generate a linguistically plausible structure for the text given a set of parallel image-text data (say, the MS COCO dataset)? They use the notion of “concreteness”: concrete spans in the parse tree such as “a cat” are more likely to correspond to objects on a picture. This notion can be captured by a part of the network that estimates concreteness based on the interrelation between captions and images. The entire network learns a joint embedding space that unites images and constituents in the same vector space with a hinge-based triplet loss; abstractness and concreteness are defined in the same embedding space. In general, the structure looks like this:

With this approach, the authors get a model that jointly learns parse trees and visually grounded textual representations. They show state of the art parsing results with much less data than needed for state of the art text-only models.

Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation

This work by Google researchers Vihan Jain et al. (ACL Anthology) deals with a rapidly growing field of vision-and-language navigation (VLN): how can we give instructions to agents in natural language and have the agents plan their actions, navigate and respond to the changes in their visual field? A characteristic example here would be the Room-to-Room (R2R) dataset that contains images of real indoor environments, where the agent is asked to follow instructions such as, e.g., “Make a left down at the narrow hall beside the office and walk straight to the exit door. Go out the door and wait.” On a map this might look something like this:

The authors move from R2R to R4R, where instructions are more detailed, the paths are longer, and the agent is supposed to follow specific navigation instructions rather than just find the shortest path from point to point. For instance, the agent that found the red path in the right part of the image above would be penalized if the actual instruction was to go along the blue path; the agent using the orange path in the image does a better job even though it fails to actually reach the target.

The models, all based on the reinforced cross-modal matching (RCM) model by Wang et al., now also come in two variations with different rewards: goal-oriented agents just want to reach the goal, while fidelity-oriented agents have an objective function that rewards following the reference path. It is no wonder that the latter does a better job with R4R. Generally, the work argues that path fidelity is a better objective if our goal is to better understand natural language instructions — the problem being to understand the instruction in its entirety rather than just extract the endpoint.

Expressing Visual Relationships via Language

And here comes our highlight for the section on visually grounded NLP. In this work, Adobe researchers Hao Tan et al. (ACL Anthology) move from the already classic problem of image captioning, i.e., describing with natural language what an image contains, to image editing, i.e., describing what we want to do with the image. We want a model that can get a request such as “Add a sword and a cloak to the squirrel” and do something like this:

So the first question is how you collect a dataset of this kind. The supervised dataset should consist of triples: the original image, an editing request, and the modified image. First, the authors crawled a collaborative image editing community called Zhopped (note: I will be hugely surprised if there are no Russian speakers behind this website) and Reddit; specifically, there is a reddit called r/PhotoshopRequest where you can ask people to help you with image editing. This yielded the pairs of original and edited images. Although Reddit and Zhopped both contain the original editing requests from users, these are usually very noisy and often conversational, so the authors opted to re-do all the requests manually through crowdsourcing.

This procedure yielded the image editing dataset. The authors also used the Spot-the-Diff dataset from (Jhamtani, Berg-Kirkpatrick, 2018) that focuses on finding changes between two images. The problem is now to generate text from a pair of images, like this:

The third dataset with image-image-text triples is the NLVR2 dataset (Suhr et al., 2018) that emphasizes the relationship between the two images. Given two images and a statement, you are supposed to classify whether the statement is true or false; for the purposes of this paper, the authors simply used the correct statements and converted this into a captioning problem for a pair of images:

Now that we have the data, what about the models? To be clear, let’s concentrate on the task of generating a sentence that describes the relationship between a pair of images. There are four different models used in the paper, with a natural succession between them. Let’s look at the flowchart and then discuss it:

This is quite a lot to parse, but actually this is a careful build-up of well-known ideas in the field. The first model (a) is an adaptation of the encoder-decoder model with attention, very similar to the ones used by Xu et al. and Jhamtani and Berg-Kirkpatrick. It constructs features from input images, concatenates them, and then uses this as context to predict the next word with a recurrent architecture.

The basic model, however, does not even differentiate between the two input images. To fix this, model (b) moves on to multi-head attention, an idea made very popular in NLP by Transformer and its follow-up models. In model (b), attention is applied sequentially, so that when the model is attending to the target image it can have context from the source image already available, and it can immediately know where to look for differences.

Models © and (d) introduce the concept of relational attention. This means that they can compute the relational scores between the source and target images (and vice versa, as you can see, there are two attention modules there). In the static model ©, the scores are then compressed into two feature sequences, possibly losing some information along the way, while the dynamic model (d) does it dynamically during decoding and has access to the full scores.

Naturally, this progression means that quality metrics improve as we move from model (a) to model (d). Here are some sample results from the paper, both positive and negative:

As you can see, sometimes state-of-the-art models are actually pretty good at understanding what is going on with the images, but sometimes they are lost and definitely don’t understand what they’re talking about.

Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video

Traditional video grounding is the problem of localizing a spatial region in certain video frames that corresponds to a specific part of a natural language query (say, find the part of the “spatio-temporal tube”, i.e., the video tensor, that corresponds to a coffee cup). This, however, requires dense fine-grained regional annotations in the video, which are very hard to obtain, so this work by Zhenfang Chen et al. (ACL Anthology) considers weakly supervised video grounding. Moreover, they move to the general problem of localizing a spatio-temporal tube that corresponds to a given sentence as a whole, not to a specific noun. They call this problem weakly-supervised spatio-temporally grounding sentence in video (WSSTG); like this:

To solve the problem, the authors use a pipeline with a standard Faster R-CNN object detector to generate bounding box proposals (Instance Generator below), an “Attentive Interactor” module that unites RNN-produced representations for the text and the proposed regions, and then the whole thing is trained with a ranking loss and a diversity loss with multiple instance learning. Here is the whole pipeline:

The authors also collect and present a new dataset designed for this problem, with videos where some target regions are annotated with sentences. There is nothing too sensational in this approach, but, as often happens with modern deep learning models, it is a big surprise that it actually does work! The resulting model can correctly understand quite complicated queries. Here is a sample comparison from the paper:

The PhotoBook Dataset: Building Common Ground through Visually-Grounded Dialogue

And the last paper of the day was by Janosch Haber et al. (ACL Anthology). This dataset concentrates not on captioning, but on visually-grounded dialogue, i.e. conversations . The authors argue that one important reason why dialogue is hard is because the participants rely on their shared knowledge and shared linguistic experience that they build during the conversation, and current dialogue models still cannot really capture this “common ground”. So their proposal is to create a dataset where the grounding is not only conversational but also visual. This is implemented as a game: two participants see six photos each, and they need to find out in dialogue which of the three highlighted photos they have in common (some of them are the same and some are different). They do it in natural dialogue, and the visual domain is controlled too, so the images are sufficiently similar to require an in-depth description from the participants. Here is an example:

Moreover, the game consists of five rounds (hence the “Page 1 of 5” in top left), and in subsequent rounds some images can reappear again, which motivates the participants to make mutually referring descriptions of images. Therefore, this large-scale crowdsourced data collection allows the authors to not only have a good dataset for training but also make some conclusions on how people talk about images. Especially interesting conclusions deal with how the game changes over the five rounds: as the game progresses, utterances become much shorter, the fraction of nouns and content words increases significantly, but these words also begin to repeat themselves a lot, so there are fewer new nouns introduced in later rounds. This is exactly the “common ground” that is hard to capture for the conversational model.

Then the authors present two baseline models for visual grounding, one that has no dialogue history and one that receives (in processed form) the references from previous rounds of the conversation. Naturally, the latter model is more successful on later stages of the game; in the example below, both models do just fine in the left example but only the history-based model can manage the example on the right (and no wonder!):

But both models are still far from perfect, and, of course, the authors hope that this dataset will serve as a common ground (pardon the pun) for further research in the field.

With this, we finish the “Vision, Robotics, Multimodal, Grounding and Speech” section. We are often bombarded by stories about sensational AI achievements from the media. Usually the journalists are not trying to lie to us, but it’s often hard to say whether they are showing a best-case cherry-picked example or a solution ready to go to production. Thus, it was very enlightening to see what exactly the state of the art really is in these things. For most models that we’ve seen today, my conclusion is: sometimes they work, and you can find really cool examples if you look, but very often they still get lost. On the other hand, a lot of this research sounds very promising for real world applications as well. We should stay tuned to this research, but it’s clear that true deep understanding of images from the real world and a genuine ability to put images into words or vice versa are still quite far away.

Sergey Nikolenko
Chief Research Officer, Neuromation

September 3, 2019
NeuroNuggets: ACL in Review I

Neuromation researchers are currently attending ACL 2019, the Annual Meeting of the Association for Computing Linguistics, which is the world’s leading conference in the field of natural language processing. While we do have a paper here, “Large-Scale Transfer Learning for Natural Language Generation”, in this series we will be concentrating on some of the most interesting works we are seeing from others.

Since I am now attending this multi-day conference in person, I’d like to switch gears in my reviews a little. This time, let’s follow the conference as we experience it. Research conferences are usually organized into tracks: you go to a specific room where you listen to several talks united by some common topic. Thus, I will organize this “ACL in Review” NeuroNugget series into ACL tracks that I personally visited. I typed in the rough notes for this piece as I was sitting in on the talks at ACL and later just edited them lightly for readability. For every track, I will choose one paper to highlight in detail and provide very brief summaries of the rest.

So, without further ado, let’s get started! Following is part 1 of my personal experience of the ACL 2019 Conference.

The first track I visited was called “Machine Learning 1”, which was chaired by Chris Dyer, a researcher at DeepMind who is famous for his work in deep learning for NLP. The track collected papers that have ideas of interest for natural language processing and even beyond to machine learning in general. I will begin with the highlight paper and then proceed to describe the rest.

In line with other top NLP conferences, ACL has the laudable tradition of publishing all accepted papers online in the ACL Anthology. So, I will provide ACL Anthology links for the papers we discuss here, rather than arXiv or some other repository. All pictures are taken from the corresponding papers unless specified otherwise.

Augmenting Neural Networks with First Order Logic

As we know, neural networks are great but still suffer from a number of issues. First of all, they are basically end-to-end non-differentiable black boxes, and although they have a lot of hidden variables, these variables are hard to interpret and match with something that we could add external knowledge to. In this work, Tao Li and Vivek Srikumar (ACL Anthology) present a framework for adding domain knowledge to neural networks.

Consider this reading comprehension question answering example from the paper:

In this example, an attention-based model will try to learn an alignment between words in the question and in the paragraph. If you have a huge dataset and a large model, then, of course, mapping “author” to “writing” can occur naturally. If not, however, it is much easier if the network has some way to, e.g., align words that are labelled as similar in some knowledge base. Such knowledge bases, of course, exist and are readily available for most languages.

This is an example where we have domain knowledge about the problem in the form of rules: if the two words are considered related (a binary predicate holds) then align them. There are a lot of things you can encode with such rules if you allow the rules to be expressed in first-order logic, i.e., in the form of logical formulas that can contain predefined predicates. This way to define the rules is very expressible, easy for experts to state and for non-experts to understand, so it is a very natural idea to try to add them to neural networks.

First idea: let’s have named neurons! That is, suppose that some neurons in the networks are associated with an externally-defined meaning. That is, you have a neuron, say a, and another neuron, say b, and you want the result on the next layer, a neuron c, to be a logical conjunction of these two: c=a&b. Sounds like a very easy function to add to the computational graph… but it’s definitely not a differentiable function! Backpropagation will break on a logical “AND” as much as it would break on, say, the argmax function. So how can we insert the predicates and first-order rules into a neural network, where differentiability is paramount?

We need to soften the logical operations, replacing them with differentiable functions in such a way that the result reflects how much we believe the predicate to be true. Li and Srikumar propose to do this with distance functions: given a neuron y=g(Wx) for some activation function g, inputs x, and weights W, and assuming we want to express some conditional statement Z→Y, where Z is some formula over variables z and Y is the variable associated with y, we define a constrained neural layer as y=g(Wx+⍴d(z)), where d(z) is the distance function corresponding to the statement. The authors introduce such functions based on the Lukasiewicz t-norm; say, (NOT a) corresponds to (1-a) and (a&b) corresponds to max(0,a+b-1). Then, provided that your logical formulas do not introduce cycles in the computational graph, you can map them inside the neural network.

The authors augmented several models in this way: a decomposable attention model (Parikh et al., 2016) for natural language inference, BiDAF (Seo et al., 2016) for machine comprehension, and others. For example, in BiDAF the model contains encoded representations of paragraph and query, attention vectors, and the outputs. The logical rules go like this: if two words in the paragraph and query, p and q, are related in the ConceptNet knowledge base, Related(p, q), then align them in the attention layer. As for natural language inference, here the rule is that if Related(p, h) and the unconstrained network strongly believes that the words should align, then align them.

The authors show that the results improve across all experiments, but especially significant improvements result when you have relatively little data: the authors experimented by taking only 1%, 5%, 10% and so on of the same dataset for training. When using 100% of the dataset, there is even a slight deterioration in the results in the case of natural language inference, which probably means that the rules are a little noisy, and the network is already better off learning the rules from data.

Thus, the conclusion is simple: if you have a huge dataset, just believe the data. If not, you may be better off adding external domain knowledge, and this paper gives you one way to add it to our beloved neural black boxes. I hope to see more of this connection between domain knowledge and neural networks in the future: it has always sounded absurd to me to just throw away all the knowledge we have accumulated (although sometimes, like in chess, this is exactly the way to go… oh well, nobody said it would be easy).

Self-Regulated Interactive Sequence-to-Sequence Learning

This work by Julia Kreutzer and Stefan Riezler (ACL Anthology) presents an interesting way to combine various different supervision types. As you know, machine learning techniques are distinguished by supervision types: supervised, semi-supervised, unsupervised, reinforcement. This work tries to find the right balance between different supervision types and to teach a model to integrate multiple different types of supervision. For example, for sequence-to-sequence learning in machine translation, suppose you have a human “teacher” who can mark and/or correct wrong parts of a translated sentence. The solution is a self-regulation approach based on reinforcement learning: there is a regulator model that chooses the supervision type, which serves as the action for the RL agent. The authors successfully evaluated this approach on the personalization task (online domain adaptation from news to TED talks and from English to German) and on machine translation, where the supervision was simulated with reference translations. This looks a lot like active learning, so they also compared it (favorably) with traditional uncertainty-based active learning techniques.

Neural Sequence-to-Sequence Models from Weak Feedback with Bipolar Ramp Loss

In this work, Laura Jehl et al. (ACL Anthology) consider the often-encountered case when golden ground truth supervision is not available but there is still some weak supervision. For example, in semantic parsing for question answering, it’s much easier to collect question-answer pairs than question-parse pairs. Usually, people solve this problem by using metric-augmented objectives where the external metric can be computed with the “golden” target and can help guide the learning process, but where the metric is removed from the actual structure produced by the model and may be unreliable. The intuition is to use this metric to both reward “good” structures and punish “bad” structures according to this weak feedback.

The authors consider several different objectives: in minimum risk training (MRT), the external metric assigns rewards to model outputs. In the ramp loss objective, the function encourages answers that have high probability and a high feedback score (hope outputs) and discourages answers that have a low feedback score but still a high probability of appearing (fear outputs). In the new objective presented in this work, token-level ramp loss, the ramp loss idea is pushed down to individual tokens, encouraging tokens that occur only in the positive example, discouraging tokens from the negative example, and leaving untouched the tokens that appear in both.

The authors apply these objectives to an encoder-decoder model with attention pretrained with maximum likelihood estimation and report improvements for semantic parsing and weakly supervised machine translation for ramp loss over MRT and for token-level ramp loss over the regular ramp loss.

You Only Need Attention to Traverse Trees

Models based on self-attention have completely transformed the field of natural language processing over the last year. Transformer, BERT, OpenAI GPT and their successors have been instrumental in most of the recent famous NLP advances. In this work (ACL Anthology), Mahtab Ahmed et al. present an extension of the self-attention framework that worked so well for sequences to trees, proposing a Tree Transformer model that can handle phrase-level syntax from constituency trees. Basically, they use the attention module as the composition function in a recursive tree-based structure. I won’t go into much detail but here is their main illustration:

They evaluated their results on the Stanford Sentiment Treebank for sentiment analysis, the Sentences Involving Compositional Knowledge (SICK) dataset for semantic relatedness, and other tasks, getting results close to state of the art tree-structured models and significantly better than the regular Transformer.

The Referential Reader: A Recurrent Entity Network for Anaphora Resolution

In this work from Facebook Research, Fei Liu et al. (ACL Anthology) tackle the hard and interesting problem of anaphora resolution, that is, which concept or previously mentioned word/phrase does a given pronoun refer to? It (please resolve my anaphora here) is an interesting problem, and actually even the very best models are still far from perfect at this task — getting anaphora resolution right in corner cases would require some deep knowledge about the world. The authors present the so-called Referential Reader model: it identifies entity references and stores them in a fixed-length memory, with the update and overwrite operations for the memory available for the model. The memory is controlled with a GRU-based language model, with special saliency weights showing which memory cells are still important and which are safe to overwrite. So, in the example below, we first store the “Ismael” in the first cell and increase its saliency, then overwrite the second cell (with low saliency) with “Captain Ahab” (first “captain”, then updated with “Ahab”), and then resolve the “he” reference by choosing the first memory cell contents:

All of this is controlled with recurrent neural nets (I won’t go into the details of the control mechanism, check the paper for that), and the results significantly improve over current state of the art. Moreover, this approach can benefit from unsupervised pre-training via better language models, and the best results are achieved with state of the art BERT language models.

Adaptive Attention Span in Transformers

And now back to Transformers. The original Transformer is based on a multi-head attention system where each head receives query-key-value triples produced from the input vectors and then combines them into attention values. Sainbayar Sukhbaatar et al. (ACL Anthology) attempt to make self-attention more efficient. The problem they are trying to fix is that networks based on self-attention take into account a very long context (up to 3800 tokens for Transformer-XL), which leads to very high memory requirements for the models since you need to compute self-attention with every context word in every attention head. The authors note that some heads do use the long context (it has been shown that long context in general provides a lot of benefits here) but some do not, using much shorter “attention spans” of only a few tokens. Therefore, the authors introduce adaptive attention spans, adding a special masking function to limit the context length and learning its parameters, thus forcing the model to use shorter spans with special regularization.

The authors compare their models with regular Transformer-based networks and show that they can achieve the same or even slightly better results on held-out test sets with significantly smaller models. Another interesting observation is that in the resulting models, the early layers almost universally use very short attention spans, and only the later layers (say, layers 8–12 in a 12-layer model) can make good use of long contexts. Here are two examples: on the left, we see head B in a regular Transformer caring about much shorter context than head A; on the right, we see the actual attention spans across various layers:

With this, the first track was over; all of these papers were presented as 15–20 min talks, with less than two hours for all of the above, including questions and technical breaks for changing the speakers. So yes, if you try to actively understand everything that’s going on in your session of a major conference, it is a rather taxing exercise — but a rewarding one too! See you next time, when we continue this exercise with other, no less interesting tracks from ACL 2019.

Sergey Nikolenko
Chief Research Officer, Neuromation

July 30, 2019
Deep Learning Topics in Computer Vision at Harbour Space

In May and June, Neuromation Chief Research Officer Sergey Nikolenko and Neuromation Researcher Alex Davydow taught a course at the Harbour Space University in Barcelona. The course was called “Neural Networks and Computer Vision” and was devoted to deep learning topics in computer vision, such as convolutional neural networks, object detection, segmentation and GANs for image generation. You can find the course program at the Harbour Space website and Sergey has made the slides available on his own website.

Harbour Space is a very interesting university in that they teach a serious and quite comprehensive program that includes computer science, data science as well as a number of other disciplines. In order to accomplish all this, they came up with the idea of teaching in three-week units where the students work every day on a single topic for three weeks, basically giving them the equivalent of a semester-long course taught once a week. The upside of this system is that Harbour Space is able to invite professors who wouldn’t be able to come to Barcelona to teach for half a year, and they get top names in every field to teach.

While Sergey had taught at Harbour Space in the past, this is the first time the course was supported by Neuromation, and that the practical work of the course was conducted by students exclusively on the Neuromation Platform.

We were extremely pleased with the quality of the students at Harbour Space and look forward to working with them and seeing great things from them in the future. Thank you Harbour Space!

July 11, 2019
NeuroNuggets: Neural Programming, or ICLR in Review II
In the previous NeuroNugget, I talked about Dmitry Vetrov’s three papers that were accepted at ICLR 2019 as an excuse to discuss Bayesian methods in deep learning. Today, we will look at another topic that featured quite prominently at ICLR: an exciting subfield of deep learning where neural networks are learning to program. ICLR 2019 has accepted one paper on program induction and three on program synthesis, and it turns out that in essence, all of these methods are trained on synthetic datasets, which is one of the main focuses of Neuromation. In fact, one of the ICLR papers in question presents an improved way to generate this data.

So, will software developers be out of a job soon? Let’s find out. We will begin by reviewing the most interesting ideas and works in the field and then proceed to more recent developments.

Image source

Program Induction and Program Synthesis

Artificial intelligence researchers have wanted to teach computers to program for a long time. Back in 1970, two Stanford researchers, Zohar Manna and Richard Waldinger, published a paper called “Towards Automatic Program Synthesis”. They did it, in the fashion of the times, by extracting the program from a formal proof of a theorem derived from the program’s formal specifications. In this way, the theorem was supposed to be proven automatically. Waldinger (in collaboration with Richard Lee) even implemented this approach, writing a program called PROW (for “program writing”) that used a resolution-based theorem prover and could output programs in LISP along with a formal proof of their correctness.

This was only the beginning, and such deductive methods for automated program synthesis became an entire field of study.

But today, we will be talking about a different problem setting sometimes called Programming by Examples (PBE), whereby a program is derived from a set of its sample results and input-output pairs. The deductive approaches pioneered by Manna and Waldinger are alive and well in PBE: for instance, the field was largely initiated by Sumit Gulwani’s paper called “Automating String Processing in Spreadsheets Using Input-Output Examples”. In this paper, Gulwani defined a formal language for string manipulation that is, on the one hand, expressive enough to be useful, and on the other hand simple enough to be amenable to PBE.

The basic idea of teaching a machine learning model to program comes in two forms:
- program induction aims to train an end-to-end model to capture an algorithm;
- program synthesis tries to teach a model the semantics of some (usually pretty restrictive) programming language so that the model is able to generate programs.
Basically, in program induction, the network is the program, while in program synthesis the network writes the program.

Intuitively, program synthesis seems preferable for real use, at least because it is far easier to test. In program induction, you have to keep testing the black box extensively, and even then you can never be sure it’s correct; whereas in program synthesis, you can rely upon a wide variety of program verification methods (that come from mathematical logic), and for the kind of programs we are talking about in modern state of the art, verification is almost always trivial (the problems are seldom more complicated than simple arithmetic). On the other hand, program synthesis is much harder: writing out the “vague understanding” that a black box might have sounds like a very hard problem in itself.

Naturally, both problems require large datasets of programs together with their input-output pairs. At present, no such large datasets exist, although some effort has been made to collect real programs by extracting snippets from programming competitions and the like, see (Zavershynskyi et al., 2018). But, thankfully, generating synthetic programs and running them is relatively easy, arguably even easier than generating synthetic data for computer vision. So, all modern approaches use synthetic data (randomly generated programs) to train their “neural computers”.

Program Induction: from Turing Machines to Passing High School Math

A working mechanical Turing machine made of wood; see this link for a video and more details.

It all started with a paper by Alex Graves et al., with the intriguing title Neural Turing Machines. In case you are new to computer science (welcome!), a Turing machine is the most popular and most widely used elementary formalization of a computer. It’s got an infinite memory tape and a head that can move around the tape and read and write symbols (see more, e.g., on Wikipedia, and don’t miss the visualizations). The programs are transition rules in the set of the machine’s configurations; a transition can be conditional on what symbol is under the tape at any given moment, but that’s it.

Alan Turing found that even this elementary construction is expressive enough to implement basically any possible algorithm (this idea is known as the Church-Turing thesis). By now, Turing machines are the most fundamental construction in computer science: we define an algorithm as a Turing machine, we define complexity classes based on the time Turing machines take to solve problems, and so on.

The idea of Graves et al. was to use neural networks to operate with external memory, which, similarly to Turing machines, is formalized as a “tape” or a set of memory locations. The neural network serves as a controller, telling the basic architecture what to read from memory, what to write there, and what to output to the environment.

Image source

The memory matrix Mt at time t consists of N memory locations, each containing a vector of size M, and the neural controller, based on memory location weighting wt (kind of an attention mechanism), produces two control vectors:
- et, the erase vector, that defines what to delete from memory: M’t(i) := Mt-1(i)(1-wt(i)ei);
- at, the add vector, that defines what to delete from memory: Mt(i) := M’t(i)+wt(i)ai
The weighting wt is produced by a separate part of the controller (the most interesting part of the architecture actually, but let’s not dwell on that now).

For training, the network receives an input “tape” (sequence of symbols) and an output tape and has to be able to produce the output from the input:

The Neural Turing machine structure. Image source.

Graves et al. showed that the NTM architectures, especially with recurrent (LSTM-based) constructions, are able to generalize well, learning the ideas of programs from examples. For instance, an NTM trained to copy a vector from input to output (after a delimiter is seen, so you have to store the sequence in memory before writing it out) was able to generalize from examples of length 20 and then copy sequences of length up to 120. And if you look into what the NTM has learned to do, it indeed looks like a simple algorithm for manipulating arrays: first read the input cell by cell into memory, then read the memory cells one by one, copying them to the output tape. Here is a map of memory use for reading and writing:

Image source

The next development by Graves et al. came in 2016, when they developed NTMs into a differentiable neural computer (DNC), an architecture featured in Nature (Graves et al., 2016). They added temporal links that are able to tell which memory cells have been written to earlier or later and memory usage information that can inform the controller about memory state in a more direct and simple way. The resulting architecture (shown below) is able to learn simple programs even better, although training it is even harder.

Differentiable neural computer. Source

Another successful take on program induction is the construction of Neural GPUs (Kaiser and Sutskever, 2015). They note that neural Turing machines are hard to train due to their inherently sequential nature: NTMs try to mimic Turing machines that are regular sequential computers with very limited computation done on each step. But programs with a lot of steps (i.e., any realistic ones) are very hard to train in this way. Neural models are much more suitable for GPU-like computation, a relatively short sequence of highly parallelized and powerful computations. The architecture is based on applying a sequence of convolutional GRUs, a radically simpler idea that can be compared to regular or stacked RNNs rather than involved architectures such as the neural Turing machine. See below:

Image source

Despite their simplicity, Neural GPUs were able to train surprisingly well, learning to perform both binary addition and even binary multiplication — a very hard problem! Multiplication is a highly nontrivial superlinear algorithm, and previous models had absolutely zero success with it.

There have been more developments in program induction, of course, but let’s skip forward to the ICLR 2019 paper by DeepMind researchers Saxton et al. (2019). In a work called “Analysing Mathematical Reasoning Abilities of Neural Models”, they add another level of indirection to the program induction problem, asking models to answer mathematical questions formulated in natural language. That is, the inputs can look like this:
- Solve 2(x-5)+5=10-x for x
- Factorize 2x²-x-1.
- Calculate -841880142.544 + 411127.
- Three letters picked without replacement from qqqkkklkqkkk. Give prob of sequence qql.
- Let u(n) = -n**3 — n**2. Let e(c) = -2*c**3 + c. Let l(j)= -118*e(j) + 54*u(j). What is the derivative of l(a)?
- Is 106033 prime?
Sounds pretty complicated, right? E.g., primality testing is a really complicated algorithm, and one cannot really expect a recurrent neural network to pick it up from examples. Still, Saxton et al. show that the standard state of the art attention-based architecture, the Transformer (first introduced for machine translation in “Attention is all you need” and widely used all over natural language processing since), performs pretty well, giving plausible answers even for very hard questions!

The most interesting test for the Transformer came in the form of a standard math exam administered to 16-year-old British schoolchildren. Saxton et al. excluded problems with plots or tables and came up with 40 problems that cover a similar range of problems as they had trained on. The trained Transformer-based model got 14/40 questions correct, equivalent to an E grade student! So, while we are not quite there yet, pretty soon some collection of deep learning models may be able to graduate high school, at least in math.

Neural Program Synthesis

Okay, that was program induction, but what if we want models to actually write the program? Usually program synthesis models learn an algorithm from examples in the form of an abstract syntax tree (AST). ASTs are a natural internal representation of algorithms in the form of an execution tree that is produced by programming language parsers. For example, here is the Wikipedia example, an AST for Euclid’s algorithm that finds the greatest common divisor:

Image source

The first attempts to solve the problem with deep learning used improvements of the well-known Flash Fill algorithm (Gulwani, 2011) designed to infer a string transformation algorithm in the form of an AST. Flash Fill was developed by Microsoft researcher Samit Gulwani, and is more popular than you can imagine: it is installed into every copy of Microsoft Excel! Flash Fill is the algorithm that fills in Excel cells based on an inferred pattern (see, e.g., here).

Parisotto et al. (2016) presented what they called a Neuro-Symbolic program synthesis method based on Recursive-Reverse-Recursive Neural Networks (R3NNs). The problem, again, is to produce an AST for string manipulation. Here is an example where we need to transform examples on the left into the program on the right that transforms a full name into last name with the first initial:

Image source

The model gradually constructs an AST consistent with input examples by building partial trees and expanding their leaves one by one, until all leaves represent terminals. To do that, the R3NN does two passes along the tree, first computing internal representations (embeddings) for every node from leaves to root and then computing the probabilities of valid expansions during the backward pass. We won’t go into too much detail, but here is the general idea:

Image source

Balog et al. (2016) present the so-called DeepCoder model. The idea here is that actually, the program synthesis problem can be solved with search-based techniques: one can simply try to enumerate all possible ASTs, discarding those that are inconsistent with available examples. Naturally, even for restricted domain-specific languages this kind of search is very computationally intensive, let alone for general-purpose programming languages, and can be done only for very short programs. But then, it’s not like end-to-end techniques can find long programs either, and there is a place for machine learning in search-based methods: models can learn what kind of nodes (functions) or tree structures are more likely given the available examples and thus guide the search. Specifically, DeepCoder predicts the probability that a given function will appear in the code, given input-output examples:

Image source

Balog et al. report that the resulting guided search-based system speeds up regular search by several orders of magnitude and can even solve some problems from real-life programming competitions (hence DeepCoder, which is reminiscent of TopCoder).

The search-based direction was continued in (Vijayakumar et al., 2018), another publication affiliated with Microsoft Research. There, the authors go back to the string transformation domain and use neural networks to specifically predict how to expand the current node, with an LSTM-based model like this:

Image source

ICLR 2019 has three papers related to neural program synthesis. Chen et al. (2019) from UC Berkeley proposes the idea of execution-guided synthesis: by encoding the current program state, we can execute partial programs and condition the next step of the synthesizer on the intermediate state representation. On the other hand, Si et al. (2019) consider so-called syntax-guided synthesis, where the space of possible programs is constrained by a grammar and a logical formula (specification). Formally speaking, this is a generalization of traditional program synthesis because you can write down the dataset of input-output pairs as a logical formula, but in reality this is a completely different problem. Si et al. use reinforcement learning to train the encoder and the policy network together.

There is also a completely different but quite interesting direction of study where synthesized programs are instructions for a robot, and the synthesis is based on some visual cues and/or instructions given in a natural language. This is beyond the scope of this post, but maybe someday we will come back to this field, as synthetic data plays a huge role there as well.

Nevertheless, it is clear that neural programming is still at its very inception. Existing systems can “write” only very simple programs. In my opinion, a necessary step to improving the current models will be to improve the datasets they train on. So let us conclude with how synthetic data is used for neural programming, and how it can be improved further.

Synthetic Data for Program Induction and Synthesis

Both program induction and program synthesis models usually train on synthetic data. It is readily available and is fairly easy to work with: any model problem currently used for program induction/synthesis has fast “regular” algorithms, so one can simply randomize the input data and produce a never-ending stream of synthetic training examples.

In one of the works we considered above, Saxton et al. (2019) had to produce a dataset of natural language mathematical problems. DeepMind researchers summarize their options as follows:

There are two choices for obtaining mathematical questions: either crowd-sourced, or synthetically generated. While crowd-sourcing has the advantage of introducing linguistic diversity, as well as a diversity of problem types, it is difficult to collect and validate such data at scale. In contrast, procedural generation is sufficient for our purposes in most respects: it (1) easily provides a larger number of training examples, with (2) precise controls over difficulty levels, permitting (3) analysis of performance by question type, and (4) better guarantees on question correctness, with (5) potential for more efficient model training by varying the time spent on each module, and (6) ease of testing generalization (since one can precisely vary different axes of difficulty in different question types).

I couldn’t put it better myself.

As for program synthesis, this is finally the place to review one more paper from ICLR 2019. Previously, synthetic data for neural programming had been generated straightforwardly and more or less at random. But UC Berkeley and Google Brain researchers Shin et al. (2019) show that common synthetic data generation algorithms, including the one from the popular tensor2tensor library, have biases that fail to cover important parts of the program space and thus deteriorate the final result. Shin et al. present a novel methodology for sampling random programs and show significant improvements for two important domains: Calculator (which computes the results of arithmetic expressions) and Karel (which achieves a given objective with a robot moving in a two-dimensional grid with walls and markers).

We are excited to see new papers that study the generation of synthetic data as an interesting problem in its own right. It’s even more exciting to see them accepted to major venues such as ICLR. But there are many, many more interesting ideas in papers accepted to ICLR 2019, so stay tuned for our next installments!

Sergey Nikolenko
Chief Research Officer, Neuromation
June 27, 2019
Neuromation at the LDV Vision Summit

On May 22–23, Neuromation had the honor of taking part in one of the most interesting interdisciplinary gatherings in computer vision and graphics: the LDV Vision Summit, hosted by the LDV Capital foundation. Members of our team in attendance included CTO Artyom Astafurov, CRO Sergey Nikolenko, Head of Product Matthew Moore, and VP Digital Economy Arthur McCallum.

Neuromation’s Chief Research Officer Sergey Nikolenko participated in a panel titled “Synthetic Media Will Disrupt or Empower Media & Technology Giants” about the role of synthetic data in artificial intelligence. The panel discussed key applications of synthetic data and achieved consensus regarding the growing importance of synthetic data in AI as well as the increasing role that synthetic media will play in the future.

One of the key features of the event was that it was truly cross-disciplinary: our team had the pleasure of meeting professionals from AI startups specializing in computer vision and computer graphics, media companies, news outlets, and investors who saw many promising startups at the event. Neuromation is always happy to visit and support important interdisciplinary events. After all, we can only make the future of AI together. Big thanks to the organizers, in particular Evan Nisselson, LDV General Partner, and Abigail Hunter-Syed, LDV VP of Operations; it was their conference more than anyone else’s.

Interestingly, panel discussions at the LDV Vision Summit were accompanied by live illustrations: a hired artist was doodling along as the participants spoke. It looked very cool in the process, and here is what the result looked like for the synthetic data panel.

June 7, 2019
NeuroNuggets: Dmitry Vetrov, or ICLR in Review I
The list of accepted papers for ICLR 2019 (International Conference on Learning Representations) is already available online, and there are a number of very interesting papers there waiting to be reviewed. So, with the next several posts I thought we could dive into the best of them and discuss some related and very relevant areas.

First off, the work of the renowned AI researcher Dmitry Vetrov. Dmitry has always been, so to speak, a “big brother” in AI for me. I remember how we first met: back in 2009, we made at least two laps around the Moscow State University campus (quite a walk!), talking about machine learning, deep learning, and the future of AI. Or, better to say, Dmitry was talking and I was listening. By now, Dmitry Vetrov is a laboratory leader at the Samsung AI Center Moscow, a research professor and laboratory head at the Higher School of Economics in Moscow, founder and head of the Bayesian Methods Research Group, one of the strongest ML research groups in Russia, and generally one of the most famous ML researchers in Russia. He has always advocated bringing Bayesian methods to deep learning, and many of his works are devoted to exactly this.

ICLR 2019 has become a very successful conference for Dmitry: he (naturally, not alone but with the researchers and students of his labs) has co-authored three papers accepted to ICLR! This is an outstanding achievement, so I decided to take the first NeuroNugget of the “ICLR in Review” series to say thank you to Dmitry Vetrov for all the advice, guidance, and, most importantly, simply a good example he has been setting for all of us through the years. Thank you Dima, and let’s review your latest and greatest!

I shall be telling this with a sigh
Somewhere ages and ages hence:
Two roads diverged in a wood, and I —
I took the one less traveled by,
And that has made all the difference.

Robert Frost

Variance Networks: Unexpectedly, Expectation Is Not Required

The paper “Variance Networks: When Expectation Does Not Meet Your Expectations” by Kirill Neklyudov, Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov presents a very surprising construction. So surprising that we need to begin with a little context.

All of our readers probably know that randomization plays a central role in the training of modern neural networks. There is no way (and no need) to try to replace stochastic gradient descent with a regular one when using randomized mini-batches is orders of magnitude more efficient. Random initialization of the network weights has entirely replaced the unsupervised pre-training methods that jumpstarted the deep learning revolution back in 2006–2007, and new improvements in random initialization methods keep popping up (for example, Xavier initialization for symmetric activation functions, He initialization for ReLU, Le-Jaitly-Hinton for recurrent networks of ReLUs, and many more).

What is a bit less commonly known is that not only is the training randomized, but the networks themselves often are too, as in stochastic neural network constructions where the weights of the network are randomized not just during initialization but throughout training and even during inference as well.

Can you name the most common example of such a construction?..

…[I’ll just give you a little bit of space to think]…

…that’s right, dropout!

A dropout layer makes any network into a stochastic one. Each weight with value w is accessed through a probability distribution, albeit a very simple one. With probability p it’s w, and with probability (1-p) it’s 0. Dropout is very common, so now you see that stochastic neural networks are actually all over the place.

Dropout also showcases another important trick done with stochastic neural networks: weight scaling. In order to do inference with (that is, apply) a stochastic neural network, it might look like you need to run it several times and then average the results, getting an estimate for the resulting expectation which is usually intractable in any other way; this is known as test-time averaging.

But running a network 20 times to get a single answer is not too fast either! Weight scaling is the technique of approximating this process by replacing each weight with its expected value. In the case of dropout, this means that instead of running the network many times for every test example, we replace each weight w subject to dropout with its expected value pw.

Weight scaling is not really formally correct, but is used very widely. As Neklyudov et al. emphasize, its “success… implies that a lot of learned information is concentrated in the expected value of the weights”. But the main point of their paper is to introduce the so-called variance layers: stochastic layers where the expected values carry exactly zero information because the expectations are always set to zero, and only variances of the weights are trained!

This counterintuitive construction proves to be an interesting one. It turns out that variance layers can and do train and encode interesting features. Here is an illustration for a toy problem:

Image source

The two pictures above show two different ways that neurons at a variance layer can interact. On the left, the two neurons encode the four classes together, in almost exactly the same way: class 4 is encoded by the lowest variance (the red core at the origin), class 3 has higher variance, and so on. By using two neurons in the same way, the network can have a whole “sample” of the same underlying distribution, which makes for a more reliable estimate of the variance.

On the right, the two neurons decided to learn complementary things: the first neuron (Y-axis) has high variance for classes 1 and 2, and the second neuron (X-axis) has high variance for classes 1 and 3.

Neklyudov et al. show that variance networks can perform well on standard tasks, they prove to be more robust against adversarial attacks and can improve exploration in reinforcement learning. But most importantly, it turns out that expectations of the weights are not as crucial as people thought. This is a fascinating proof of concept. Who knows, maybe this will lead to a completely new genre of deep learning; we’ll have to wait and see.

The Deep Weight Prior

The second paper co-authored by Dmitry Vetrov at ICLR 2019 is “The Deep Weight Prior” by Andrei Atanov, Arsenii Ashukha, Kirill Struminsky, Dmitry Vetrov, and Max Welling, considers another key element of the Bayesian framework: prior distributions.

Let’s begin with the main formula of all machine learning — Bayes’ rule:

In machine learning, θ usually represents the model parameters, and D represents the data. The formula shows how we change our beliefs about the parameters θ after getting experimental results D: we update our prior belief p(θ) by essentially multiplying it by the likelihood p(D|θ). Recalculating p(θ) into p(D|θ) is the essence of Bayesian inference, and the essence of many machine learning problems and techniques.

But if this is the core of all machine learning, why don’t we hear more about prior distributions in deep learning? The simple answer is that nobody knows how to make nontrivial priors for complex neural networks. Naturally, an L2 regularizer can be thought of as a zero-mean Gaussian prior… but what if we are looking for something more meaningful?

Atanov et al. propose a novel and interesting way to tackle this problem. Specifically, they consider the case of convolutional neural networks, where:
- it is very plausible that convolutional layers, especially early ones, learn nearly the same features on all datasets from an entire domain — after all, most modern computer vision networks start out by pretraining on ImageNet even if their goal is not to recognize its classes;
- on the other hand, it makes little sense to assume that the prior distribution can decompose into a product over individual weights since the weight matrix of a convolution should represent a single object for this distribution.
Based on these remarks, Atanov et al. rightly assume that they can learn the prior distribution for a convolutional network on other datasets from the same domain (e.g., on smaller datasets of real photographs or handwritten digits), and that this prior distribution, while it factorizes over the layers and channels, will not factorize over the spatial dimensions of the filters, i.e., it will be a distribution over the weight matrices of convolutional filters.

So, how do you train a distribution like that? That’s where variational autoencoders (VAE) come in. We have already discussed VAEs in a previous post. Essentially, they learn a transformation from input x and some random bits with a fixed distribution to a distribution on the latent embeddings q(z|x) that should approximate p(z|x), by learning to output the parameters of q(z|x) (reparametrization trick), and then this distribution on the latent space z can be used through the decoder to make more samples from p(x):

Image source

So what do Atanov et al. do? They opt to take some (smaller and simpler) datasets and train a number of convolutional nets (“source networks”) to get a dataset of convolutions. Then they train a VAE on this dataset to produce the prior distribution on the convolutional kernels! This VAE is now a model that can sample from the distribution of convolutional kernels and that defines the deep weight prior:

Image source

Then they show that variational inference on the convolutional weights with the deep weight prior works much better than with any other previously known kind of prior. But, again, the most important point here is not a specific experimental result but the idea that it can open up wonderful new possibilities for, among other things, transfer learning and domain adaptation.

Variational Autoencoder with Arbitrary Conditioning

And speaking of variational autoencoders… In the third ICLR paper by Vetrov’s group, “Variational Autoencoder with Arbitrary Conditioning”, Oleg Ivanov, Michael Figurnov, and Dmitry Vetrov introduce a very interesting extension for the VAE framework.

Once you have a generative model (of any kind) and can sample objects from a distribution, the natural next step is to try to construct a conditional version of the same model. This will let you create objects that are subject to some kind of condition; e.g., generate the face of a person with a given age and gender, or with a smile and sunglasses. Traditional conditional VAEs are well-known, of course; to add a label to the VAE construction, you can simply input it to both encoder and decoder, like this:

Image source

Image source

Now the model is free to learn completely different distributions in the latent space for different labels, and this produces all sorts of wonderful effects, has implications for style transfer, and so on.

In this paper, Ivanov et al. take conditioning up to eleven: they consider a situation where any subset of the input might be unknown and subject to generation conditioned on the available part. That is:
- along with the input x, we are given a binary mask b that shows which components (pixels) of x are known (where b_i=0) and which are not (where b_i=1);
- the goal is to construct a model for the conditional distribution p(x_b|x_1-b,b);
- as part of the problem setting, we are also given a prior distribution on the masks p(b) that shows which masks are more likely to appear and which, therefore, the model should concentrate on.
To solve this problem, Ivanov et al. propose a model they call Variational Autoencoder with Arbitrary Conditioning (VAEAC). A full consideration of VAEAC, its training and inference goes, alas, far beyond the scope of a NeuroNugget. But I do want to note the uses of the resulting model. The arbitrary conditioning setting is designed for problems such as:
- missing features imputation, i.e., reconstructing missing features in a dataset; in the wild, many datasets are rather dirty, but we cannot afford to simply discard all the incomplete data;
- image inpainting, i.e., reconstructing a hidden part of an image; this is an important problem for image manipulation, e.g., if you want to delete a random passerby from your photo, you can segment the person and cut them out with a segmentation model, but then you still need to inpaint the background in a natural way in place of the person.
VAEAC solves both of these problems nicely. It especially shines on imputation, but that use-case is not flashy enough, so here are some inpainting results. The rightmost columns in the following images show the ground truth. Top to bottom: MNIST and Omniglot, CelebA, and CelebA with more interesting masks:

Image source

Image source

Image source

Stochastic Weight Averaging: A Simple Trick to Rule Your Networks

But wait, there is more! This is not an ICLR paper yet, but it’s a very recent work from Dmitry’s lab that might prove applicable to nearly everyone in the world of deep learning. In “Averaging Weights Leads to Wider Optima and Better Generalization”, Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson present a simple idea that can be embedded into many networks at virtually no cost… but quite possibly with a hefty profit!

It is common knowledge in machine learning that putting together a number of different models, in a procedure known as ensembling, helps improve performance. However, in deep learning ensembling is hard: training even one model is difficult, and getting a hundred meaningful networks together would be quite a computational feat.

There have been some works on ensembling in deep learning, though. Inspired by the ideas of cyclical learning rates, Huang et al. in “Snapshot Ensembles: Train 1, get M for free” (2017) train a single neural network, but along the optimization path make sure to converge to several local minima and save the weights. Then they average these weights, getting basically an ensemble of multiple local minima from a single training pass. Like this:

Image source

The next step came in an earlier paper from Vetrov’s lab. In “Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs”, Garipov et al. showed that for a wide range of architectures, their local optima are actually connected by rather simple curves along which the loss remains near-constant! Here is an illustration:

Garipov et al. show that if you train three different networks you’ll get a landscape like the one shown on the left, as you would expect. But if you then use their proposed mode connecting procedure, you can find curves of near-constant loss function that connect the dots. On the plots, the X-axis is always the same but Y-axes are different, and the procedure invariably finds a valley that connects the two optima.

Based on this observation, Garipov et al. proposed a new ensembling procedure, Fast Geometric Ensembling (FGE), which basically amounts to averaging the weights along such a mode connecting path.

Still, this was rather computationally intensive: you had to find several local optima, then connect them with a separate computational procedure. And, ultimately, it’s still an ensemble: you need to keep multiple models, run them all at inference time, and then average their answers. Stochastic Weight Averaging (SWA), proposed by Izmailov et al. in the paper in question, does away with most of these extra costs!

This time, the fundamental observation is that stochastic gradient descent with cyclical and constant learning rates actually traverses the regions in the weight space corresponding to high-performing networks, but it does not reach the central points of these regions. That is, three networks averaged out by FGE would most likely look something like this:

Image source

So instead of averaging their predictions, why not average the weights directly, getting to the wonderful wSWA point shown above?! That’s exactly what stochastic weight averaging does:
- start with a (somewhat) pretrained model;
- keep training with a cyclical or constant learning rate schedule;
- on every iteration of training, add the current vector to a separately accumulating running average.
And that’s it! The results are nothing short of amazing: with no additional cost at inference time and very little cost at training time, SWA actually manages to improve virtually all architectures across a wide variety of datasets and settings. Here is a sample of experimental results:

Image source

SWA is a straightforward drop-in replacement for stochastic gradient descent, and, naturally, it has an official github repository released by the authors. Try it out and share your results in the comments!

Conclusion

This has been a rather challenging NeuroNugget, I’m afraid, and we have only skimmed the surface of several latest papers by Dmitry Vetrov and his group.

I have tried to outline the main ideas that each paper brings to deep learning, but the most important thing I want to emphasize is the common thread: each paper we have considered today presents a novel idea that looks plausible, works well (at least on toy examples, sometimes more experiments are needed before it can be used in large-scale networks), and opens up a whole new direction of study, usually somewhere near the field of Bayesian deep learning.

That’s how Dmitry Vetrov has always done research: he has a vision and a knack for finding these “paths less traveled” in the forest of deep learning, a forest that by now might seem rather well-trodden. Still, he did it in 2009, he is doing it in 2019, and I’m sure he will still be finding completely novel ideas and uncovering new directions in 2049. Thank you Dmitry — and good luck!

Sergey Nikolenko
Chief Research Officer, Neuromation
May 29, 2019
NeuroNuggets: Deep Anime
Last time, we had some very serious stuff to discuss. Let’s touch upon a much lighter topic today: anime! It turns out that many architectures we’ve discussed on this very blog, or plan to discuss in more detail in the future, have already been applied to Japanese-style comics and animation.

Let me start by giving a shout out to the owner of this fantastic github repository. It is the most comprehensive resource for all things anime in deep learning. Thanks for putting this together and maintaining it, whoever you are!

We will be mostly talking about generating anime characters, but the last part will be a brief overview of some other anime-related problems.

Do everything by hand, even when using the computer.

Hayao Miyazaki

Drawing Anime Characters with GANs

Guess who drew the characters you saw above? You guessed right, there was no manga artist who thought them up, they were drawn automatically with a generative model.

The paper by Jin et al. (2017) presents an architecture based on generative adversarial models trained to generate anime characters. We have spoken about GANs several times on this blog (see, e.g., here or here), and this sounds like a relatively straightforward application. But attempts at direct applications of basic GAN architectures such as DCGAN for this problem, even a relatively successful attempt called (unsurprisingly) AnimeGAN, produced only low-resolution, blurry and generally unsatisfactory images, e.g.:

Image source: https://github.com/jayleicn/animeGAN

How did Jin et al. bridge the gap between this and what we saw above?

First, let’s talk about the data. This work shows a good example of a general trend: dataset collection and especially labeling increasingly becomes an automated or at least semi-automated process, using models that we believe to work reliably in order to label datasets for more complex models.

To get a big collection of anime character faces, Jin et al. scraped the Getchu website that showcases thousands of Japanese games, including unified presentations of their characters, in good quality and on neutral background:

Image source: https://arxiv.org/pdf/1708.05509.pdf

On these pictures, they ran a face detection model called lbpcascade, specifically trained to do face detection for anime/manga, and then enlarged the resulting bounding box (shown in red above) by 1.5x to add some context (shown in blue above). To add the “semi-” to “semi-automated”, the authors also checked the resulting 42000 images by hand and removed about 4% of false positives They don’t show a comparison but I’m sure this was an important step for data preparation.

But that’s not all. Jin et al. wanted to have conditional generation, where you would be able to get a blonde anime girl with a ponytail or a brown-eyed red-haired one with glasses. To do that, they ran a pretrained model called Illustration2Vec which is designed to predict a large number of predefined tags from an anime/manga image. Here is a sample:

Image source: https://github.com/rezoo/illustration2vec

Jin et al. chose suitable thresholds for the classifiers in Illustration2Vec, but basically they used this pretrained model as is, relying on its accuracy to create the training set for the GAN. This is an interesting illustration to how you can bootstrap training sets from pretrained models: it won’t always work but when it does, it can produce large training sets very efficiently. As a result, they now had a large dataset of images labeled with various tags, with a feature vector associated with every image. Here is a part of this dataset in tSNE visualization of the feature vectors:

Image source: https://arxiv.org/pdf/1708.05509.pdf

The next step would be to choose the GAN architecture. Jin et al. went with DRAGAN (Deep Regret Analytic Generative Adversarial Networks), an additional loss function suggested by Kodali et al. (2017) to alleviate the mode collapse problem. We will not go into further details on DRAGAN here. Suffice it to say that the final architecture is basically a standard GAN with a generator and a discriminator, and with some additional loss functions to account for the DRAGAN gradient penalty and for the correct assignment of class labels to make it conditional. The architectures for both generator and discriminator are based on SRResNet, pretty standard convolutional architectures with residual connections.

So now we have both the data and the architecture. Then we train for a while, and then we generate!

Image source: https://arxiv.org/pdf/1708.05509.pdf

Those were the results of unconditional generation, but we can also set up some attributes as conditions. Below, on the left we have the “aqua hair, long hair, drill hair, open mouth, glasses, aqua eyes” tags and on the right we have “orange hair, ponytail, hat, glasses, red eyes, orange eyes”:

Image source: https://arxiv.org/pdf/1708.05509.pdf

And, even better, you can play with this generative model yourself. Jin et al. made the models available through a public frontend at this website; you can specify certain characteristic features and generate new anime characters automatically. Try it!

Next Step: Full-Body Generation with Pose Conditions

Hamada et al. (2018) take the next step in generating anime characters with GANs: instead of just doing a head shot like Jin et al. above, they generate a full-body image with a predefined pose. They are using the basic idea of progressively growing GANs from Karras et al. (2018), a paper that we have actually already discussed in detail on this very blog:
- begin with training a GAN to generate extremely small images, like 4×4 pixels;
- use the result as a condition to train a GAN that scales it up to 8×8 pixels, a process similar to superresolution;
- use the result as a condition to train a GAN that scales it up to 16×16 pixels…
- …and so on until you get to 1024×1024 or something like that.
The novel idea by Hamada et al. is that you can also use the pose as a condition, first expressing it in the form of a pixel mask and then scaling it down to 4×4 pixels, then 8×8, and so on:

Image source: https://arxiv.org/pdf/1809.01890v1.pdf

Then they created a dataset of full-body high-resolution anime characters based on the Unity 3D models for various poses:

Image source: https://arxiv.org/pdf/1809.01890v1.pdf

And as a result, the progressive structure-conditional GAN is able to generate nice pictures with predefined poses. As usual with GANs you can interpolate between characters while keeping the pose fixed, and you can produce different poses of the same character, which makes this paper a big step towards developing a tool that would actually help artists and animators. Here is a sample output:

Image source: https://arxiv.org/pdf/1809.01890v1.pdf

Even Better: StyleGAN for Anime

Have you seen thispersondoesnotexist.com? It shows fake people generated by the latest and greatest GAN-based architecture for face generation, the StyleGAN, and it’s been all over the Web for a while.

Well, turns out there is an anime equivalent! thiswaifudoesnotexist.net generates random anime characters with the StyleGAN architecture and even adds a randomly generated plot summary! Like this:

Image source: https://www.thiswaifudoesnotexist.net/

Looks even better! But wait, what is this StyleGAN we speak of?

StyleGAN is an architecture by NVIDIA researchers Karras et al. (2018), the same group who had previously made progressively growing GANs. This time, they kept the stack of progressive superresolution but changed the architecture of the basic convolutional model, making the generator similar to style transfer networks. Essentially, instead of simply putting a latent code through a convolutional network, like traditional GANs do, StyleGAN first recovers an intermediate code vector and then uses it several times to inform the synthesis network, with external noise coming in at every level. Here is a picture from the paper, with a traditional generator architecture on the left and StyleGAN on the right:

Image source: https://arxiv.org/pdf/1812.04948

We won’t go into more detail on this here, as the StyleGAN would deserve a dedicated NeuroNugget to explain fully (and maybe it’ll get it). Safe to say that the final result now looks even better. StyleGAN defines a new gold standard for face generation, as shown on thispersondoesnotexist.com and now, as we can see, on thiswaifudoesnotexist.net. As for the text generation part, this is a completely different can of worms, awaiting its own NeuroNuggets, quite possibly in the near future…

Brief Overviews

Let us close with a few more papers that solve interesting anime/manga-related problems.

Style transfer for anime sketches. We’ve spoken of GANs that use ideas similar to style transfer, but what about style transfer itself? Zhang et al. (2017) present a style transfer network based on U-Net and auxiliary classifier GAN (AC-GAN) that can fill in sketches with color schemes derived from separate (and completely different) style images. This solves a very practical problem for anime artists: if you can draw a character in full color once and then just apply the style to sketches, it would be a huge saving of effort. We are not quite there yet, but look at the results; in the three examples below, the sketch shown in the top left is combined with a style image shown in the bottom left to get the final image:

Image source: https://arxiv.org/pdf/1706.03319v2.pdf

Interactive segmentation. Ito et al. (2016) propose an interactive segmentation method intended for manga. An important problem for manga illustrators would be to have automated or semi-automated segmentation tools, so they can cut out individual characters or parts of the scene from existing drawings. That’s exactly what Ito et al. do (without any deep learning, by the way, by improving classical segmentation techniques):

Image source: https://projet.liris.cnrs.fr/imagine/pub/proceedings/ICPR-2016/media/files/0660.pdf

Anime superresolution. We have already mentioned superresolution as a stepping stone in progressively growing GANs, but one can also use it directly to transform small and/or low-res images to high-quality anime. The waifu2x model is a model based on SRCNN (single-image superresolution based on convolutional neural networks) that is a little bit fine-tuned and extensively pre-trained to handle anime. The results are actually pretty impressive — here is how waifu2x works:

Image source: https://github.com/nagadomi/waifu2x

Conclusion

Cartoons in general and anime in particular represent a very nice domain for computer vision:
- images in anime style are much simpler than real-life photos: the edges are pronounced, the contours are mostly perfectly closed, many shapes have a distinct style that makes them easy to recognize, and so on;
- there is a natural demand for the tools to manipulate anime images from anime artists, animators, and enthusiasts;
- as we have seen even in this post, there exist large databases of categorized and tagged images that can be scraped for datasets.
So no wonder people have been able to make a lot of things work well in the domain of anime. Actually, I would expect that anime might become an important frontier for image manipulation models, a sandbox where the models work well and do cool things before they can “graduate” to realistic photographic imagery.

But to do that, the world needs large-scale open datasets and a clear formulation of the main problems in the field. Hopefully, the anime community can help with that, and then I have no doubt researchers all over the world will jump in… not only because it might be easier than real photos, but also simply because it’s so cool. Have fun!

Sergey Nikolenko
Chief Research Officer, Neuromation
April 23, 2019
NeuroNuggets: Logic Comes to Machine Learning

New Year celebrations are just behind us, but things are already happening in 2019. One very exciting development for machine learning researchers all around the world is the new journal from the venerable Nature family: Nature Machine Intelligence. Its volume 1 is dated January 2019, and it’s already out (all papers are in open access, you can read them right there). Machine learning results have already made it into Nature — for example, I’ve always wondered how a paper about a piece of software playing a boardgame is about Nature. But now we have a new top venue specifically devoted to machine learning.

Nature Machine Intelligence begins on a high note. We might go back to its first volume in later installments, but today I want to discuss one especially unexpected and exciting result: a paper by Ben-David et al. called Learnability Can Be Undecidable. It brings into our humble and so far very practical field of science: Godel’s incompleteness theorem.

Independence Results: Things We Cannot Know

They might or they might not. You never can tell with bees.

A.A. Milne

Mathematical logic is a new subject for our series, and indeed it doesn’t appear too often in the context of machine learning. So let’s start with a brief introduction.

Gödel’s incompleteness theorem establishes that in any (sufficiently complex) formal system, there are things we can neither prove nor disprove. A formal system is basically a set of symbols and axioms that define relations between these symbols. For example, you can have two functions, + and *, and constants 0 and 1, with the usual axioms for addition and multiplication that define a field. Then you can have models of this formal system, i.e., interpretations of the symbols such that all axioms hold. As an example, the set of real numbers with standard interpretations of 0,1,+,* is one model of the theory of fields, and the set of rational numbers is another.

The original constructions given by Gödel are relatively involved and not easy to grasp without a logical background. They have been quite beautifully explained for the layperson in Douglas Hofstadter’s famous book Gödel, Escher, Bach, but it does take a few dozen pages, so we won’t go into that here.

How can you prove that a certain statement is unprovable? Sounds like an oxymoron, but the basic idea of many such proofs is straightforward: you construct two models of the formal system such that in one of them the statement is true and in the other it’s not.

For example, consider a very simple formal system with only one function s(x), which we interpret as “taking the next element”, and one constant 0. We can construct formulas (terms, to be precise) like s(0), s(s(0)), s(s(0)) etc. We can think of them as natural numbers: 1:=s(0), 2:=s(1)=s(s(0)), and so on. But do negative numbers also exist? Formally, is there an x such that s(x)=0?

The question makes sense (it’s easy to write as a logical formula: ∃x s(x)=0) but has no answer. First, the set of natural numbers 0,1,2,… is a valid model for this formal system, with the function s defined as s(x)=x+1. And in this model, the answer is no: there is no number preceding zero. But the set of integers …,-2,-1,0,1,2,… is also a valid model, with the same interpretation s(x)=x+1! And now, we clearly have s(-1)=0. This means that the original formal system does not know whether negative numbers exist.

Of course, this was a very, very simple formal system and nobody really expected it to have answers to complicated questions. But the same kind of reasoning can be applied to much more complex systems. For example, the axioms of a field in mathematics do not have an answer to whether irrational numbers exist; e.g., ∃x(x*x=2) is true in the real numbers but false in the rational numbers, and both are fields. Godel’s incompleteness theorem says that we can find such statements for any reasonably powerful formal system, including for example, Zermelo-Fraenkel set theory (ZFC), which is basically what we usually mean by mathematics. Logicians have constructed statements that are independent of ZFC axioms.

One such statement is the famous continuum hypothesis. Modern mathematical logic was in many ways initiated by Georg Cantor, who was the first to try to systematically develop the foundations of mathematics, specifically formal and consistent set theory. Cantor was the first to understand that there are different kinds of infinities: the set of natural numbers is smaller than the set of reals because you cannot enumerate all real numbers. The cardinality (size) of the set of natural numbers, denoted ℵ₀ (“aleph-null”) is the smallest infinite number (smallest infinite cardinal, as they are called in mathematical logic), and the set of reals is said to have the cardinality of continuum, ℵ₁ (“aleph-one”).

There is no doubt that ℵ₁ > ℵ₀, but is there anything in between the natural numbers and the reals? This is known as the continuum hypothesis: it says that ℵ₁ is the smallest infinite cardinal larger than ℵ₀. And it turns out to be independent of ZFC: you can construct a model of mathematics where there is an intermediate cardinality, and you can construct a model where there isn’t. There is really no point to ask which model we live in: it’s unclear if there is anything truly infinite in our world at all.

Undecidability in Machine Learning

Some problems are so complex that you have to be highly intelligent and well informed just to be undecided about them.

Laurence J. Peter

Okay, so what does all of this have to do with machine learning? In our field, we usually talk about finite datasets that define optimization problems for the weights. How can we find obscure statements about the existence of various infinities within our practical and usually well-defined field?

Ben-David et al. speak about the “estimating the maximum” problem (EMX):

Given a family F of subsets of some domain X, find a set F whose measure with respect to an unknown probability distribution P is close to maximal, based on a finite sample generated independently from P.

Sounds complicated, but it’s really just a general formulation of many machine learning problems. Ben-David et al. give the following example: suppose you are placing ads on a website. The domain X is the set of visitors for the website, every ad A has its target audience Fᴬ, and P is the distribution of visitors for the site. Then the problem of finding the best ad to show is exactly the problem of finding a set Fᴬ that has the largest measure with respect to P, i.e., it will most probably resonate with a random visitor.

In fact, EMX is a very general problem, and its relation to machine learning is much deeper than this example shows. You can think of a set F as a function from the domain X to 0 and 1: F(x)=1 if x belongs to F and F(x)=0 if it doesn’t. And the EMX problem is asking to find a function F from a given family that tries to maximize the expectation Eᴾ(F) with respect to the distribution P.

Let us now think of samples from the distribution P as data samples, and treat the functions as classifiers. Now the setting begins to make a lot of sense for machine learning: it means that you can know the labels of all data samples and need to, given a sample of the data, find a classifier from a given family that will have low error with respect to the data distribution. Sounds very much like a standard machine learning problem, right? For more details on this setting, check out an earlier paper by Ben-David (two Ben-Davids, actually).

Ben-David et al. consider a rather simple special case of the EMX problem, where X is the interval [0,1] and the family of subsets are all finite subsets of X, that is, finite collections of real numbers from [0,1]. They prove that the problem of EMX learnability with probability 2/3, that is, given some i.i.d. samples from a distribution P, find a finite subset of [0,1] that has probability at least 2/3, is independent of ZFC! That is, our regular mathematics cannot say whether you can find a good classifier in this setting. They do it by constructing a (rather intricate) reduction of the continuum hypothesis to this case of EMX learnability.

So What’s the Takeaway?

A conclusion is the place where you got tired thinking.

Martin H. Fischer

The results of Ben-David et al. are really beautiful. They connect a lot of dots: unprovability and independence, machine learning, compression schemes (used in the proof), and computational learning theory. One important corollary the paper’s main result is that there can be no general notion of dimension for EMX learnability, like the VC (Vapnik-Chervonenkis) dimension is for PAC learnability. I have no doubt these ideas will blossom into a whole new direction of research.

Still, as it sadly often happens with mathematical logic, this result can leave you a bit underwhelmed. It only makes sense in the context of uncountable sets, which you can hardly find in real life. Ben-David et al. themselves admit in the conclusion that the proof hinges on the fact that EMX asks to find a function over an infinite domain rather than, say, an algorithm, which would be a much simpler object (in theoretical computer science, algorithms are defined as Turing machines, basically finite sets of instructions for a very simple formalized “computer”, and there are only countably many finite sets of instructions while there are, obviously, a continuum of finite subsets of [0,1] and hence functions).

Nevertheless, it is really exciting to see different fields of mathematics connected in such unexpected and beautiful ways. I hope that more results like this will follow, and I hope that in the future, modern mathematics will play a more important role in machine learning than it does now. Thank you for reading!

Sergey Nikolenko
Chief Research Officer, Neuromation

January 22, 2019
Creating Molecules from Scratch II: AAE, VAE, and the Wave Transform
It’s been quite a while, but the time has finally come to return to the story of deep learning for drug discovery, a story we began in April. Back then, I presented to you the first paper that had an official Neuromation affiliation, “3D Molecular Representations Based on the Wave Transform for Convolutional Neural Networks”, published in a top biomedical journal Molecular Pharmaceutics. By now, researchers from Neuromation have published more than ten papers, we have told you about some of them in our Neuromation Research blog posts, and, most importantly, we have already released our next big project in collaboration with Insilico, the MOSES dataset and benchmarking suite. But today, we finally come back to that first paper. Once again, many thanks to the CEO of Insilico Medicine Alex Zhavoronkov and CEO of Insilico Taiwan Artur Kadurin who have been the main researchers on this topic. I am very grateful for the opportunity to work alongside them in this project.

A Quick Recap: GANs for Drug Discovery

For a more detailed introduction to the topic, you are welcome to go back and re-read the first part; but to keep this one self-consistent, let me begin with a quick reminder.

Drug discovery is organized like a highly selective funnel: at the first stage, you have doctors coming up with the properties that a molecule should have to be a good drug (binding with a given protein, dissolving in water and so on) and then with plausible candidates for molecules that might have these properties. Then these lead molecules are sent to the lab, and if they survive pre-clinical studies, they go to the official process of clinical trials and, finally, approval of the FDA or similar bodies in other countries.

Only a tiny part of the lead molecules will ever get FDA approval, and the whole process is extremely expensive (developing a new drug takes about 10 years and costs $2.6 billion on average), so one of the main problems of modern medicine is to try and make the funnel as efficient as possible on every stage. Deep learning for drug discovery aims to improve the very first part, generating lead molecules. We try to develop generative models that will produce plausible candidates with useful properties.

We have already talked about GANs many times. Some of our latest posts have been almost exclusively devoted to GANs (e.g., this CVPR in Review post), so I will not repeat the basic structure. But let me repeat the main figure of the paper titled “The Cornucopia of Meaningful Leads: Applying Deep Adversarial Autoencoders for New Molecule Development in Oncology”, whose lead author Artur Kadurin is the current CEO of Insilico Taiwan, my Ph.D. student, and a co-author of the Deep Learning book we released about a year ago. Here is the architecture:

This is, in essence, a so-called conditional adversarial autoencoder:
- an autoencoder receives as input a SMILES fingerprint (basically a bit string that represents a molecule and makes a lot of chemical and biological sense) and the drug concentration; it learns to produce a latent representation (embedding) on the middle layer and then decode it back to obtain the original fingerprint;
- the condition (GI on the bottom) encodes the properties of the molecule; the conditional autoencoder trains on molecules with known properties, and then can potentially generate molecules with desired combinations of properties by supplying them to the middle layer;
- and, finally, the discriminator (on top) tries to tell apart the distribution of latent representations (embeddings) and some known distribution, e.g., a standard Gaussian; this is the main idea of AAE that is supposed to make an autoencoder into a generative model: if we can make the distribution of embeddings indistinguishable from a known distribution, we can sample from the known distribution and decode these samples to get reasonable objects.
Again, we have been through this in the first part, so I refer there for more details. But today, we go further.

druGAN: AAE or VAE?

Our next paper on generative models for drug discovery had a laconic title of “druGAN: An Advanced Generative Adversarial Autoencoder Model for de Novo Generation of New Molecules with Desired Molecular Properties in Silico”, and it appeared in Molecular Pharmaceutics in 2017. The Cornucopia paper that we reviewed above actually solved a relatively small and limited problem: the conditional AAE was trained on a dataset with only 6252 available compounds profiled on a single cell line (MCF-7). This limited scope, naturally, could not satisfy the ambitious team of Insilico Medicine. And it only considered one type of generative models, GANs… wait, what? There’s more?

Well yes, there is! There exists a wide variety of generative models even if you concentrate only on deep learning, i.e., models that have neural networks somewhere. I recommend the well-known tutorial by Ian Goodfellow: a lot has happened in GANs since that tutorial but the taxonomy of generative models is still very relevant.

One of the main classes of generative models in deep learning today are variational autoencoders (VAE). The idea of VAE is exactly the same as in AAE: we want to make the distribution of latent embeddings z similar to some known distribution (say, a Gaussian) so that we can sample embeddings directly and then decode to get sample objects. But VAE implements this idea in a completely different way.

VAE makes the assumption that the embeddings are indeed normally distributed, z ~ N(μ, Σ), where μ is the mean and Σ is the covariance matrix. The job of the encoder now is to produce the parameters of this normal distribution given an object, that is, the encoder outputs μ(x) and Σ(x) for the input object x; Σ is usually assumed to be diagonal, so it’s basically a vector of dimension 2d, where d is the dimension of z. VAE also adds a standard normal prior distribution on μ(x) and Σ(x). Then VAE samples a vector z from the distribution N(μ(x), Σ(x)), decodes it back to the original space of objects and, as a good autoencoder should, tries to make the reconstruction accurate. Here is how it all comes together in the druGAN paper:

Notice how z is not sampled directly from N(μ(x), Σ(x)) but rather comes from a standard normal distribution which is then linearly transformed by μ(x) and Σ(x). This is known as the reparametrization trick, and it was one of the key ideas that made VAEs possible.

I’m not being entirely honest here: there is some beautiful mathematics behind all this, and it is needed to make this work, but, unfortunately, it goes way outside of the format of a popular article. Still, I recommend explanations such as this one, and maybe one day we will have a detailed NeuroNugget about it.

In the druGAN paper, Kadurin et al. compared this VAE with an AAE-based architecture, an improved modification of the one proposed in the Cornucopia paper. Here is the architecture; comparing it with the picture above, you can see the difference between AAE and VAE:

We trained several versions of both VAE and AAE on a set of MACCS fingerprints produced from the PubChem database of substances that contains more than 72 million different molecules, quite a step up from the six thousand used in Cornucopia. The results were promising: we were able to sample quite varied molecules and also trained a simple linear regression that predicted solubility from the features extracted by the autoencoders. Generally, the best AAE models outperformed the best VAE models, although the latter had some advantages in certain settings.

The most meaningful conclusion, however, was that we always had a tradeoff between two most important metrics: quality of reconstruction (measured by the reconstruction error) and variability of the molecules sampled from the trained model (measured by various diversity metrics). Without the former, you don’t get good molecules; without the latter, you don’t get new molecules. This tradeoff lies in the heart of modern research on generative models, and it is still very hard to keep the results both reasonable and diverse.

Molecules in 3D: the Wave Transform Representation

And with that, we finally come to the paper that made us write these posts: the joint work between Insilico Medicine and Neuromation (okay, mostly Insilico) titled “3D Molecular Representations Based on the Wave Transform for Convolutional Neural Networks”; it also appeared in Molecular Pharmaceutics.

This work had a slightly different emphasis: instead of devising and testing new architectures, we tried to look at the descriptions of molecules that are fed as input to these architectures. One motivation for this was that the entire framework of deep learning for drug discovery that we had seen in both Cornucopia and druGAN presupposes that we will screen predicted fingerprints against a database of existing molecules. Not all fingerprints are viable, so you cannot take an arbitrary MACCS fingerprint and reconstruct a real molecule: you have to screen against actually existing fingerprints and find the best matches among them. If we could use a more informative molecular representation, we might not have to choose the proposed molecules from a known database, leading to the holy grail of drug discovery: de novo generation of molecular structures.

So how can we encode molecular structure? People have tried a lot of things: a comprehensive reference by Todeschini et al. (2009) lists several thousand of molecular descriptors, and this list has grown even further over the last decade. They can be broken down into string encodings, such as MACCS itself, graph encodings that capture the molecular graph (there are some very interesting works on how to make convolutions on graphs, e.g., (Kearns et al., 2016; Liu et al., 2018)), and 3D representations that also capture the bond lengths and mutual orientation of atoms in space.

In molecular biology and chemistry, the 3D structure of a molecule is called a conformation; a given molecule can have many different conformations, and it may turn out that it’s important to choose the right one. For example, is the part of the molecule that is supposed to bind with a protein gets hidden inside the rest of the molecule, and the drug will simply not work. So it sounds like a good idea to feed our models with 3D structures of the molecules in question: after all, it’s basically a picture in 3D, and there are plenty of successful CNNs with 3D input.

But it proves to be not that easy. Let’s look at the main picture from the paper:

Part (a) shows how the original molecule looks in 3D space: it’s a 3D structure composed of different atoms shown in different colors on the picture. How do we represent this structure to feed it to convolutional networks? The most straightforward answer would be to discretize the space into voxels (fun fact: the linear size of a voxel here is 0.5Å; that’s Angstrem, 0.1 nanometers!) and represent each atom as a one-hot representation in the voxel; the result is shown in part (b).

But this representation is far from perfect. First, it’s very sparse: less than 0.1% of the voxels contain atoms. Second, due to this sparsity interactions between atoms are also hard to capture: yes, some atoms are near each other and some are farther away, but there is a lot of empty space around atoms, the data does not have enough redundancy, and CNNs just don’t work too well with this kind of data. Sparse voxels lead to sparse gradients, and the whole thing underfits.

Therefore, it is better to smooth out the 3D representation in some way. Parts © and (d) of the picture above show Gaussian smoothing: we take each atom and “blur” it out with a Gaussian kernel, getting an exponentially decaying “ball” around each atom. The kernel in (d) has a much higher variance than in ©, so the result is more “blurry”. This also introduces the necessary redundancy, and the resulting representation is also more robust to errors:

In the paper, we proposed a different kind of “blurring” based on the wave transform; its kernel is a Gaussian multiplied by a cosine function of the distance to center, so the “ball” still decays exponentially but now spreads out in waves. The result is shown in part (e) above. In the paper, we show that this transform has better theoretical properties, deriving an analytical inverse operation (deconvolution) for the wave transform.

This converts to practical advantages, too. In the paper, we trained a simple autoencoder based on the Xception network, but even with this experiment you can see how the wave transform representation performs better. The picture below shows reconstruction results from the autoencoder at different stages of training:

We can see that the voxel-based representation never allowed to reconstruct anything except carbon (and even that quite poorly), and Gaussian blur added nitrogen; the wave transform, however, has also been able to reconstruct oxygen atoms, and the general structure looks much better as well. Our experiments have also shown that the wave transform representation outperforms others in classification problems, e.g., in reconstructing the bits from MACCS fingerprints.

Conclusion

In this post, we have seen how different generative models compare for generating molecules that might become plausible candidates for new drugs. Insilico Medicine is already testing some of these molecules in the lab. Unfortunately, it’s a lengthy process, and nothing is guaranteed; but I hope we will soon see some of the automatically generated lead molecules confirmed by real experiments, and this may completely change medicine as we know it. Best of luck to our friends and collaborators from Insilico Medicine, and I’m sure we will meet them again in future NeuroNuggets. Stay tuned!

Sergey Nikolenko
Chief Research Officer, Neuromation
December 26, 2018
MOSES: A 40-Week Journey to the Promised Land of Molecular Generation
Our long-term collaboration with Insilico Medicine, a company that focuses on artificial intelligence for drug discovery and longevity research, has borne some very important fruit. We have released a benchmarking platform for generative models, which we named MOSES (MOlecular SEtS). You can find the paper currently released on arXiv, the github repository, and the press release by research partner Insilico. Congratulations to the whole team! Before we dive into a little bit of detail, here is the Neuromation staff together with our collaborators from Insilico at the NIPS (NeurIPS, as they call it now) conference currently held in Montreal:

Neuromation and Insilico Medicine at NIPS 2018. Left to right: Alex Zhavoronkov (Insilico/Buck Institute for Research on Aging), Elena Tutubalina (Neuromation), Daniil Polykovsky, Polina Mamoshina (Insilico), Rauf Kurbanov (Neuromation).

So what is MOSES, why do we care, and what do Neuromation and Insilico have in common here?

MOSES is a benchmarking platform for generative models that aim to generate molecular structures. We covered generative models such as generative adversarial networks (GANs) before (e.g., here or here). In molecular biology and biochemistry, generative models are used to produce candidate compounds that might have desired qualities. We have already published a post about our previous joint project with Insilico which provides more details about such models.

One common thread in generative models is that they are really difficult to evaluate and compare. You have a black box that produces, say, images of human faces. Scratch that, you have twenty black boxes. Which one is best? You could try to get closer to the true answer by asking real people to evaluate the faces, but this surely won’t scale.

So researchers have been developing metrics to compare generative models. I will save a more detailed explanation of the metrics for an in-depth Neuromation Research post which is going to follow soon. For now, let me just say that there are plenty of different metrics. It’s really hard to collect them all from very different implementations, and even harder to claim that your numbers are really comparable with the numbers in other papers. The whole field could sure use some standardization and streamlining.

That’s where MOSES comes in. In this project, we:
- prepared a large dataset of approximately 2 million molecules based on specially designed chemical filters;
- implemented the most popular metrics for the evaluation of generative models;
- most importantly, implemented several state of the art models and provided a large and unified experimental comparison between them.
The MOSES pipeline

Building MOSES was a big project. We have been working on it for the better part of this year; 40 weeks in the title are no exaggeration. In a project like that, you need a deep and well integrated collaboration between chemists, medical researchers, and machine learning gurus. And that is exactly what we had between Neuromation, Insilico Medicine, the Harvard University and the University of Toronto.

The result is a benchmarking dataset, an evaluation pipeline, and a large-scale experimental comparison that provides the stable footing so badly needed for this field. Now, new works can build upon this foundation, compare new models with baselines from our paper, and make direct quantitative comparisons in terms of various evaluation metrics. We hope that researchers all over the world will benefit from our joint effort.

My congratulations to our team and to our dear friends at Insilico Medicine! Lots of contributors, but let me please highlight and thank the following individuals: big thanks to Daniil Polykovsky, Alexander Zhebrak, Vladimir Aladinskiy, Mark Veselov, Artur Kadurin, and Alex Zhavoronkov from Insilico, to Benjamin Sanchez-Lengeling from Harvard, Alan Aspuru-Gusik from the University of Toronto/Vector Institute, and to Neuromation researchers Sergey Golovanov, Oktai Tatanov, Stanislav Belyaev, Rauf Kurbanov, and Aleksey Artamonov. Thanks guys!

Thank you for reading and stay tuned for the next updates from our Neuromation Research blog!

Sergey Nikolenko
Chief Research Officer, Neuromation
December 5, 2018