Sergey Nikolenko

Author: snikolenko

NeuroNuggets: Cut-and-Paste in Deep Learning
…Many people think that authors
just cut and paste from real life into books.
It doesn’t work quite that way.
― Paul Fleischman

As the CVPR in Review posts (there were five: GANs for computer vision, pose estimation and tracking for humans, synthetic data, domain adaptation, and face synthesis) have finally dried up, we again turn to our usual stuff. In the NeuroNugget series, we usually talk about specific ideas in deep learning and try to bring you up to speed on each. We have had some pretty general and all-encompassing posts here, but it is often both fun and instructive to dive deeper into something very specific. So we will devote some NeuroNuggets to reviewing a few recent papers that share a common thread.

And today, this thread is… cut-and-paste! And not the kind we all do from other people’s GitHub repositories. In computer vision, this idea is often directly related to synthetic data, as cutting and pasting sometimes proves to be a fertile middle ground between real data and going fully synthetic. But let’s not get ahead of ourselves…

Naive Cut-and-Paste as Data Augmentation

We have talked in great detail about object detection and segmentation, two of the main problems of computer vision. To solve them, models need training data, the more the merrier. In modern computer vision, training data is always in short supply, so researchers always use various data augmentation techniques to enlarge the dataset.

The point of data augmentation is to introduce various modifications of the original image that do not change the ground truth labels you have or change them in predictable ways. Common augmentation techniques include, for instance, moving and rotating the picture and changing its color histogram in predictable ways:

Image source

Or changing the lighting conditions and image parameters that basically reduce to applying various Instagram filters:

Image source

Notice how in terms of individual pixels, the pictures change completely, but we still have a very predictable and controllable transformation of what the result should be. If you know where the cat was in the original image, you know exactly where it is in the rotated-and-cropped one; and Instagram filters usually don’t change the labels at all.

Data augmentation is essential to reduce overfitting and effectively extend the dataset for free; it is usually silently understood in all modern computer vision applications and implemented in standard deep learning libraries (see, e.g., keras.preprocessing.image).

Cutting and pasting sounds like a wonderful idea in this regard: why not cut out objects from images and paste them onto different backgrounds? The problem, of course, is that it is hard to cut and paste an object in a natural way; we will return to this problem later in this post. However, last year (2017) has seen a few papers that claimed that you don’t really have to be terribly realistic to make the augmentation work.

The easiest and most straightforward approach was taken by Rao and Zhang in their paper “Cut and Paste: Generate Artificial Labels for Object Detection” (appeared on ICVIP 2017). They simply took an object detection dataset (VOC07 and VOC12), cut out objects according to their ground truth labels and pasted them onto images with different backgrounds. Like this:

Source: (Rao, Zhang, 2017)

Then they trained with these images, using cut-and-paste like usual augmentation. Even with this very naive approach, they claimed to noticeably improve the results of standard object detection networks like YOLO and SSD. More importantly, they claimed to reduce common error modes of YOLO and SSD. The picture below shows the results after training on the left; and indeed, wrong labels decrease and bounding boxes significantly improve in many cases:

Source: (Rao, Zhang, 2017)

A similar but slightly less naive approach to cutting and pasting was introduced, also in 2017, by researchers from the Carnegie Mellon University. In “Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection” (ICCV 2017), Dwibedi et al. use the same basic idea but instead of just placing whole bounding boxes they go for segmentation masks. Here is a graphical overview of their approach:

Source: (Dwibedi et al., 2017)

Basically, they take a set of images of the objects they want to recognize, collect a set of background scenes, and then paste objects into the scene. Interestingly, they are recognizing grocery items in indoor environments, just like we did in our first big project on synthetic data.

Dwibedi et al. claim that it is not really important to place objects in realistic ways globally but important to achieve local realism. That is, modern object detectors do not care as much to have a Coke bottle on the counter rather than on the floor; however, it is important to blend the object as realistically as possible into the local background. To this purpose, Dwibedi et al. consider several differ blending algorithms for pasting images:

Source: (Dwibedi et al., 2017)

They then make blending another dimension of data augmentation, another factor of variability in order to make the detector robust against boundary artifacts. Together with other data augmentation techniques, it proves highly effective; “All Blend” in the table below means that all versions of blending for the same image are included in the training set:

Source: (Dwibedi et al., 2017)

This also serves as evidence for the point about the importance of local realism. Here are some sample synthetic images Dwibedi et al. come up with:

Source: (Dwibedi et al., 2017)

As you can see, there is indeed little global realism here: objects are floating in the air with no regard to the underlying scene. However, here is how the accuracy improves when you go from real data to real+synthetic:

Source: (Dwibedi et al., 2017)

Note that all of these improvements have been achieved in a completely automated way. The only thing Dwibedi et al. need to make their synthetic dataset is a set of images for that would be easy to segment (in their case, they have photos of objects on a plain background). Then it is all in the hands of neural networks and algorithms: a convolutional network predicts segmentation masks, an algorithm does augmentation for the objects, and then blending algorithms make local patches more believable, so the entire pipeline is fully automated. Here is a general overview of what algorithms constitute this pipeline:

Source: (Dwibedi et al., 2017)

Smarter Augmentation: Pasting with Regard to Geometry

We have seen that even very naive pasting of objects can help improve object detection by making what is essentially synthetic data. The next step in this direction would be to actually try to make the pasted objects consistent with the geometry and other properties of the scene.

Here we begin with a special case: text localization, i.e., object detection specifically for text appearing on an image. That is, you want to take a picture with some text on it and output bounding boxes for the text instances regardless of their form, font, and color, like this:

Image source

This is a well-known problem that has been studied for decades, but here we won’t go into too many details on how to solve it. The point is, in 2016 (the oldest paper in this post, actually) researchers from the University of Oxford proposed an approach to blending synthetic text into real images in a way coherent with the geometry of the scene. In “Synthetic Data for Text Localisation in Natural Images”, Gupta et al. use a novel modification of a fully convolutional regression network (FCRN) to predict bounding boxes, but the main novelty lies in synthetic data generation.

They first sample text and a background image (scraped from Google Image Search, actually). Then the image goes through several steps:
- first, through a contour detection algorithm called gPb-UCM; proposed in (Arbelaez, Fowlkes, 2011), it does not contain any neural networks and is based on classical computer vision techniques (oriented gradient of histograms, multiscale cue combination, watershed transform etc.), so it is very fast to apply but still produces results that are sufficiently good for this application;
- out of the resulting regions, Gupta et al. choose those that are sufficiently large and have sufficiently uniform textures: they are suitable for text placement;
- to understand how to rotate the text, they estimate a depth map (with a state-of-the-art CNN), fit a planar facet to the region in question (with the RANSAC algorithm), and then add the text, blending it in with Poisson editing.
Here is a graphical overview of these steps, with sample generated images on the bottom:

Source: (Gupta et al., 2016)

As a result, Gupta et al. manage to produce very good text placement that blends in with the background scene; their images are not realistic only in the sense that we might not expect text to appear in these places at all, otherwise they are perfectly fine:

Source: (Gupta et al., 2016)

With this synthetic dataset, Gupta et al. report significantly improved results in text localization.

In “Synthesizing Training Data for Object Detection in Indoor Scenes”, Georgakis et al. from the George Mason University and University of North Carolina at Chapel Hill applied similar ideas to pasting objects into scenes rather than just text. Their emphasis is on blending the objects into scenes in a way consistent with the scene geometry and meaning. To do this, Georgakis et al.:
- use the BigBIRD dataset (Big Berkeley Instance Recognition Dataset) that contains 600 different views for every object in the dataset; this lets the authors blend real images of various objects rather than do the 3D modeling required for a purely synthetic approach;
- use an approach by Taylor & Cowley (2012) to parse the scene, which again uses the above-mentioned RANSAC algorithm (at some point, we really should start a NonNeuroNuggets series to explain some classical computer vision ideas — they are and will remain a very useful tool for a long time) to extract the planar surfaces from the indoor scene: counters, tables, floors and so on;
- combine this extraction of supporting surfaces with a convolutional network by Mousavian et al. (2012) that combines semantic segmentation and depth estimation; semantic segmentation lets the model understand which surfaces are indeed supporting surfaces where objects can be placed;
- then depth estimation and positioning of the extracted facets are combined to understand the proper scale and position of the objects on a given surface.
Here is an illustration of this process, which the authors call selective positioning:

Source: (Georgakis et al., 2017)

Here (a) and (e) show the original scene and its depth map, (b) and © show semantic segmentation results with predictions for counters and tables highlighted on (c), (f) is the result of plane extraction, and (g) are estimated supporting surfaces; they all combine to find regions for object placement shown on (d), and then the object is properly scaled and blended on (h) to obtain the final result (i). Here are some more examples to show that the approach indeed works quite well:

Source: (Georgakis et al., 2017)

Georgakis et al. train and compare Faster R-CNN and SSD with their synthetic dataset. Here is one of the final tables:

Source: (Georgakis et al., 2017)

We won’t go into the full details, but it basically shows that, as always, you can get excellent results on synthetic data by training on synthetic data, which is useless, and you don’t get good results on real data by training purely on this kind of synthetic data. But if you throw together real and synthetic then yes, there is a noticeable improvement compared to using just the real dataset. Since this is still just a form of augmentation and thus is basically free (provided that you have a dataset of different views of your objects), why not?

Cutting and Pasting for Segmentation… with GANs

Finally, the last paper in our review is a quite different animal. In this paper recently released by Google, Remez et al. (2018) are actually solving the instance segmentation problem with cut-and-paste, but they are not trying to prepare a synthetic dataset to train a standard segmentation model. Rather, they are using cut-and-paste as an internal quality metric for segmentations: a good segmentation mask will produce a good image with a pasted object. In the image below, a bad mask (a) leads to an unconvincing image (b), and a good mask (c) produces a much better image (d), although the ground truth (e) is better still:

Source: (Remez et al., 2018)

How does the model decide which images are “convincing”? With an adversarial architecture, of course! In the model pipeline shown below, the generator is actually doing the segmentation, and the discriminator judges how well the pasted image is by trying to distinguish it from real images:

Source: (Remez et al., 2018)

The idea is simple and brilliant: only a very good segmentation mask will result in a convincing fake, hence the generator learns to produce good masks… even without any labeled training data for segmentation! The whole pipeline only requires the bounding boxes for objects to cut out.

But you still have to paste objects intelligently. There are several important features required to make this idea work. Let’s go through them one by one.

1. Where do we paste? One can either paste uniformly at random points of the image or try to take into account the scene geometry and be smart about it, like in the papers above. Here, Remez et al. find that yes, pasting objects in a proper scale and place in the scene does help. And no wonder; in the picture below, first look on the left and see how long it takes you to spot the pasted objects. Then look on the right, where they have been pasted uniformly at random. Where will the discriminator’s job be easier?

Source: (Remez et al., 2018)

2. There are a couple of degenerate corner cases that formally represent a very good solution but are actually useless. For example, the generator could learn to “cut off” all or none of the pixels in the image and thus make the result indistinguishable from real… because it is real! To discourage from choosing all pixels, the discriminator simply receives a larger viewpoint, seeing, so to speak, the bigger picture, so this strategy ceases to work. To discourage from choosing no pixels, the authors introduce an additional classification network that attempts to classify the object of interest and the corresponding loss function. Now, if the object has not been cut, classification will certainly fail, incurring a large penalty.

3. Sometimes, cutting only a part of the segmentation mask still results in a plausible object. This is characteristic for modular structures like buildings; for example, in these satellite images some of the masks are obviously incomplete but the resulting cutouts will serve just fine:

Source: (Remez et al., 2018)

To fix this, the authors set up another adversarial game, now trying to distinguish the background resulting from cutting out the object and the background resulting from the same cut elsewhere in the scene. This is basically yet another term in the loss function; modern GANs often tend to grow pretty complicated loss functions, and maybe someday we will explore them in more details.

The authors compare their resulting strategy with some other pretrained baselines; while they, of course, lose to fully supervised methods (with access to ground truth segmentation masks in the training set), they come out ahead against the baselines. It is actually pretty cool that you can get segmentation masks like this with no effort for segmentation type labeling:

Source: (Remez et al., 2018)

There are failure cases too, of course. Usually they happen when the result is still realistic enough even with the incorrect mask. Here are some characteristic examples:

Source: (Remez et al., 2018)

This work is a very interesting example of a growing trend towards data-independent methods in deep learning. More and more often, researchers find ways around the need to label huge datasets, and deep learning gradually learns to do away with the hardships of data labeling. We are not quite there yet but I hope that someday we will be. Until next time!

Sergey Nikolenko
Chief Research Officer, Neuromation
November 29, 2018
What’s In a Face (CVPR in Review V)
I have said that she had no face; but that meant she had a thousand faces…

― C.S. Lewis, Till We Have Faces

Today we present to you another installment where we dive into the details about a few papers from the CVPR 2018 (Computer Vision and Pattern Recognition) conference. We’ve had four already: about GANs for computer vision, about pose estimation and tracking for humans, about synthetic data, and, finally, about domain adaptation. In particular, in the fourth part we presented three papers on the same topic that had actually numerically comparable results.

Today, we turn to a different problem that also warrants a detailed comparison. We will talk about face generation, that is, about synthesizing a realistic picture of a human face, either from scratch or by changing some features of a real photo. Actually, we already touched upon this problem a while ago, in our first post about GANs. But since then, generative adversarial networks (GANs) have been one of the very hottest topics in machine learning, and it is no wonder that new advances await us today. And again, it is my great pleasure to introduce Anastasia Gaydashenko with whom we have co-authored this text.

GANs for Face Synthesis and the Importance of Loss Functions

We have already spoken many times about how important a model’s architecture and a good dataset are for deep learning. In this post, one recurrent theme will be the meaning and importance of loss functions, that is, the functions that a neural network actually represents. One could argue that the loss function is a part of the architecture, but in practice we usually think about them separately; e.g., the same basic architecture could serve a wide variety of loss functions with only minor changes, and that is something we will see today.

We chose these particular papers because we liked them best, but also because they are all using GANs and are all using them to modify pictures of faces while preserving the person’s identity. This is a well-established application of GANs; classical papers such as ADD used it to predict how a person changes with age or how he or she would look like if they had a different gender. The papers that we consider today bring this line of research one step further, parceling out certain parts of a person’s appearance (e.g., makeup or emotions) in such a way that it can become subject to manipulations.

Thus, in a way all of today’s papers are also solving the same problem and might be comparable with each other. The problem, though, is that the true evaluation of a model’s results basically could be done only by a human: you need to judge how realistic the new picture looks like. And in our case, the specific tasks and datasets are somewhat different too, so we will not have a direct comparison of the results, but instead we will extract and compare new interesting ideas.

On to the papers!

Towards Open-Set Identity Preserving Face Synthesis

The authors of the first paper, a joint work of researchers from the University of Science and Technology of China and Microsoft Research (full pdf), aim to disentangle identity and attributes from a single face image. The idea is to decompose a face’s representation into “identity” and “attributes” in such a way that identity corresponds to the person, and attributes correspond to basically everything that could be modified while still preserving identity. Then, using this extracted identity, we can add attributes extracted from a different face. Like this:

Fascinating, right? Let’s investigate how do they do it. There are quite a few novel interesting tricks in the paper, but the main contribution of this work is a new GAN-based architecture:

Here the network takes as input two pictures: the identity picture and the attributes picture that will serve as the source for everything except the person’s identity: pose, emotion, illumination, and even the background.

The main components of this architecture include:
- identity encoder I that produces a latent representation (embedding) of the identity input xˢ;
- attributes encoder A that does the same for the attributes input xᵃ;
- mixed picture generator G that takes as input both embeddings (concatenated) and produces the picture x’ that is supposed to mix the identity of xˢ and the attributes of xᵃ;
- identity classifier C checks whether the person in the generated picture x’ is indeed the same as in xˢ;
- discriminator D that tries to distinguish real and generated examples to improve generator performance, in the usual GAN fashion.
This is the structure of the model used for training; when all components have been trained, for generation itself it suffices to use only the part inside the dotted line, so the networks C and D are only included in the training phase.

The main problem, of course, is how to disentangle identity from attributes. How can we tell the network what it should take from xˢ and what from xᵃ? The architecture outlined above does not answer this question by itself, the main work here is done by a careful selection of loss functions. There are quite a few of them; let us review them one by one. The NeuroNugget format does not allow for too many formulas, so we will try to capture the meaning of each part of the loss function:
- the most straightforward part is the softmax classification loss Lᵢ that trains identity encoder I to recognize the identity of people shown on the photos; basically, we train I to serve as a person classifier and then use the last layer of this network as features fᵢ(xs);
- the reconstruction loss Lᵣ is more interesting; we would like the result x’ to reconstruct the original image xᵃ anyway but there are two distinct cases here:
- if the person on image xᵃ is the same as on the identity image xs, there is no question what we should do: we should reconstruct xᵃ as exactly as possible;
- and if xᵃ and xˢ show two different people (we know all identities on the supervised training phase), we also want to reconstruct xa but with a lower penalty for “errors” (10 times lower in the authors’ experiments); we don’t actually want to reconstruct xᵃ exactly now but still want x’ to be similar to xᵃ;
- the KL divergence loss Lkl is intended to help the attributes encoder A concentrate on attributes and “lose” the identity as much as possible; it serves as a regularizer to make the attributes vector distribution similar to a predefined prior (standard Gaussian);
- the discriminator loss Lᵈ is standard GAN business: it shows how well D can discriminate between real and fake images; however, there is a twist here as well: instead of just including discriminator loss Lᵈ the network starts by using Lᵍᵈ, a feature matching loss that measures how similar the features extracted by D on some intermediate level from x’ and xa are; this is due to the fact that we cannot expect to fool D right away, the discriminator will always be nearly perfect at the beginning of training, and we have to settle for a weaker loss function first (see the CVAE-GAN paper for more details);
- and, again, the same trick works for the identity classifier C; we use the basic classification loss Lᶜ but also augment it with the distance Lᵍᶜ between feature representations of x’ and xˢ on some intermediate layer of C.
(Disclaimer: I apologize for slightly messing up notation from the pictures but Medium actually does not support sub/superscripts so I had to make do with existing Unicode symbols.)

That was quite a lot to take in, wasn’t it? Well, this is how modern GAN-based architectures usually work: their final loss function is usually a sum of many different terms, each with its own motivation and meaning. But the resulting architecture works out very nicely; we can now train it in several different ways:
- first, networks I and C are doing basically the same thing, identifying people; therefore, they can share both the architecture and the weights (which simplifies training), and we can even use a standard pretrained person identification network as a very good initialization for I and C;
- next, we train the whole thing on a dataset of images of people with known identities; as we have already mentioned, we can pick pairs of xˢ and xᵃ as different images of the same person and have the network try to reconstruct xa exactly, or pick xˢ and xᵃ with different people and train with a lower weight of the reconstruction loss;
- but even that is not all; publicly available labeled datasets of people are not diverse enough to train the whole architecture end-to-end, but, fortunately, it even allows for unsupervised training; if we don’t know the identity we can’t train I and C, so we have to ignore their loss functions, but we can still train the rest! And we have already seen that I and C are the easiest to train, so we can assume they have been trained well enough on the supervised part. Thus, we can simply grab some random faces from the Web and add them to the training set without knowing the identities.
Thanks to the conscious and precise choice of the architecture, loss functions, and the training process, the results are fantastic! Here are two selections from the paper. In the first, we see transformations of faces randomly chosen from the training set with random faces for attributes:

And in the second, the identities never appeared in the training set! These are people completely unknown to the network (“zero-shot identities”, as the paper calls them)… and it still works just fine:

PairedCycleGAN: Asymmetric Style Transfer for Applying and Removing Makeup

This collaboration of researchers from Princeton, Berkeley, and Adobe (full pdf) works in the same vein as the previous paper but tackles a much more precise problem: can we add/modify the makeup on a photograph rather than all attributes at once, while keeping the face as recognizable as possible. A major problem here is, as it often happens in machine learning, with the data: a relatively direct approach would be quite possible if we had a large dataset of aligned photographs of faces with and without makeup… but of course we don’t. So how do we solve this?

The network still gets two images as an input: the source image from which we take the face and the reference image from which we take the makeup style. The model then produces the corresponding output; here are some sample results, and they are very impressive:

This unsupervised learning framework relies on a new model of a cycle-consistent generative adversarial network; it consists of the two asymmetric functions: the forward function encodes example-based style transfer, whereas the backward function removes the style. Here is how it works:

The picture shows two coupled networks designed to implement these functions: one that transfers makeup style (G) and another that can remove makeup (F); the idea is to make the output of their successive application to an input photo match the input.

Let us talk about losses again because they define the approach and capture the main new ideas in this work as well. The only notation we need for that is that X is the “no makeup” domain and Y is the domain of images with makeup. Now:
- the discriminator DY tries to discriminate between real samples from domain Y (with makeup) and generated samples, and the generator G aims to fool it; so here we use an adversarial loss to constrain the results of G to look similar to makeup faces from domain Y;
- the same loss function is used for F for the same reason: to encourage it to generate images indistinguishable from no-makeup faces sampled from domain X;
- but these loss functions are not enough; they would simply let the generator reproduce the same picture as the reference without any constraints imposed by the source; to prevent this, we use the identity loss for the composition of G and F: if we apply makeup to a face x from X and then immediately remove it, we should get back the input image x exactly;
- now we have made the output of G to belong to Y (faces with makeup) and preserve the identity, but we still are not really using the reference makeup style in any way; to transfer the style, we use two different style losses:
- style reconstruction loss Ls says that if we transfer makeup from a face y to a face x with G(x,y), then remove makeup from y with F(y), and then apply the style from G(x,y) back to F(y), we should get y back, i.e., G(F(y), G(x,y)) should be similar to y;
- and then on top of all this, we add another discriminator DS that decides whether a given pair of faces have the same makeup; its style discriminator loss LP is the final element of the objective function.
There is more to the paper than just loss functions. For example, another problem was how to acquire a dataset of photos for the training set. The authors found an interesting solution: use beauty-bloggers from YouTube! They collected a dataset from makeup tutorial videos (verified manually on Amazon Mechanical Turk), thus ensuring that it would contain a large variety of makeup styles in high resolution.

The results are, again, pretty impressive:

The results become especially impressive if you compare them with previous state of the art models for makeup transfer:

We have a feeling that the next Prisma might very well be lurking somewhere nearby…

Facial Expression Recognition by De-expression Residue Learning

With the last paper for today (full pdf), we turn from makeup to a different kind of very specific facial features: emotions. How can we disentangle identity and emotions?

In this work, the proposed architecture contains two learning processes: the first is learning to generate standard neutral faces by conditional GANs (cGAN), and the second is learning from the intermediate layers of the resulting generator. To train the cGANs, we use pairs of face images that show some expression (input), and a neutral face image of the same subject (output):

The cGAN is learned as usual: the generator reconstructs the output based on the input image, and then tuples (input, target, yes) and (input, output, no) are given to the discriminator. The discriminator tries to distinguish generated samples from the ground truth while the generator tries to not only confuse the discriminator but also generate an image as close to the target image as possible (composite loss functions again, but this time relatively simple).

The paper calls this process de-expression (removing expression from a face), and the idea is that during de-expression, information related to the actual emotions is still recorded as an expressive component in the intermediate layers of the generator. Thus, for the second learning process we fix the parameters of the generator, and the outputs of intermediate layers are combined and used as input for deep models that do facial expression classification. The overall architecture looks like this:

After neutral face generation, the expression information can be analyzed by comparing the neutral face and the query expression face at the pixel level or feature level. However, pixel-level difference is unreliable due to the variation between images (i.e., rotation, translation, or lighting). This can cause a large pixel-level difference even without any changes in the expression. The feature-level difference is also unstable, as the expression information may vary according to the identity information. Since the difference between the query image and the neutral image is recorded in the intermediate layers, the authors exploit the expressive component from the intermediate layers directly.

The following figure illustrates some samples of the de-expression residue, which are the expressive components for anger, disgust, fear, happiness, sadness, and surprise respectively; the pictures shows the corresponding histogram for each expressive component. As we can see, both expressive components and corresponding histograms are quite distinguishable:

And here are some sample results on different datasets. In all pictures, the first column is the input image, the third column is the ground-truth neutral face image of the same subject, and the middle is the output of the generative model:

As a result, the authors both get a nice network for de-expression, i.e., removing emotion from a face, and improve state of the art results for emotion recognition by training the emotion classifier on rich features captured by the de-expression network.

Final words

Thank you for reading! With this, we are finally done with CVPR 2018. It is hard to do justice to a conference this large; naturally, there were hundreds of very interesting papers that we have not been able to cover. But still, we hope it has been an interesting and useful selection. We will see you again soon in the next NeuroNugget installments. Good luck!

Sergey Nikolenko
Chief Research Officer, Neuromation

Anastasia Gaydashenko
former Research Intern at Neuromation, currently Machine Learning Intern at Cisco
November 13, 2018
State of the Art in Domain Adaptation (CVPR in Review IV)
We have already had three installments about the CVPR 2018 (Computer Vision and Pattern Recognition) conference: the first part was devoted to GANs for computer vision, the second part dealt with papers about recognizing human beings (pose estimation and tracking), and the third part tackled synthetic data. Today we dive deeper into the details of one field of deep learning that has been on the rise lately: domain adaptation. For this NeuroNugget, I’m happy to present to you my co-author Anastasia Gaydashenko, who has already left Neuromation and went on to join Cisco…but her texts live on, and this is one of them.

What is Domain Adaptation?

There are a couple of specific directions in research that are trending lately (including CVPR 2018), and one of them is domain adaptation. As this field is closely related to synthetic data, it is of great interest for us here at Neuromation, but the topic is also increasingly popular and important in and by itself.

Let’s start at the beginning. We have already discussed the most common tasks that constitute the basis for modern computer vision: image classification, object and pose detection, instance and semantic segmentation, object tracking, and so on. These problems are solved quite successfully due to deep convolutional neural architectures and large amounts of labeled data.

But, as we discussed in the last installment, a big challenge always remains: for supervised learning, you always need to find or create labeled datasets. Almost any paper you read about some fancy state of the art model will mention some problems with the dataset, unless they use one of the few standard “vanilla” datasets that everybody usually compares on. Thus, collecting labeled data has become as important as designing the networks themselves. These datasets should be reliable and diverse enough so researchers would be able to use them to develop and evaluate novel architectures.

We have already talked many times about how manual data collection is both expensive and time-consuming, often exceedingly so. Sometimes it is even flat out impossible to label the data manually (for example, how do you label for depth estimation, the problem of evaluating the distances from points on the image to the camera?). Of course, many standard problems already have large labeled datasets that are freely or easily available. But first, this readily labeled data can (and does) bias research towards the specific field where it is available, and second, your own problem will never be exactly the same, and standard datasets will often simply not fit your demands: they will contain different classes, will be biased in different ways, and so on.

The main problem with using existing datasets, or even synthetic data generators that were not done specifically for your particular problem, is that when the data is generated and already labeled we are still facing the problem of domain transfer: how do we use one kind of data to prepare the networks to cope with different kinds? This problem also looms large for the entire field of synthetic data: however realistic you make your data, it still cannot be completely indistinguishable from real world photographs. The major underlying challenge here is known as domain shift: basically, the distribution of data in the target domain (say, real images) is different than in the source domain (say, synthetic images). Devising models that can cope with this shift is exactly the problem called domain adaptation.

Let us see how people are handling this problem now, considering a few papers from CVPR 2018 in slightly more details than we used to in previous “CVPR in Review” installments.

Unsupervised Domain Adaptation with Similarity Learning

This work by Pedro Pinheiro (see pdf here) comes from ElementAI, a Montreal company co-founded in 2016 by none other than Yoshua Bengio. It deals with an approach to domain adaptation based on adversarial networks, the kind we touched upon a little bit before (see also this post, the second part for which is coming really soon… it is, it is, I promise!).

The simplest adversarial approach to unsupervised domain adaptation is a network that tries to extract features that remain the same across the domains. To achieve this, the network tries to make them indistinguishable for a separate part of the network, a discriminator (“disc” in the figure below). But at the same time, these features should be representative for the source domain so the network will be able to classify objects:

In this way, the network has to extract features that would achieve two objectives at once: (1) be informative enough that the “class” network (usually very simple) can classify, and (2) be independent of the domain so that the “disc” network (usually as complex as the feature extractor itself, or more) cannot really distinguish. Note that we don’t have to have any labels for the target domain, only for the source domain, where it is usually much easier (again, think synthetic data for the source domain).

In Pinheiro’s paper, this approach is improved by replacing the classifier part with a similarity-based one. The discriminative part remains the same, and the classification part now compares the embedding of an image with a set of prototypes; all these representations are learned jointly and in an end-to-end fashion:

Basically, we are asking one network, g, to extract features from a labeled source domain and another network, f, to extract features from an unlabeled target domain, with a similar but different data distribution. The difference is that now f and g are different (we had the same f in the picture above), and the classification is now different: instead of training a classifier, we train the model to discriminate the target prototype from all other prototypes. And to label the image from the target domain, we compare the embedding of an image with embeddings of prototype images from the source domain, assigning the label of its nearest neighbors:

The paper shows that the proposed similarity-based classification approach is more robust to the domain shift between the two datasets.

Image to Image Translation for Domain Adaptation

In this work by Murez et al. (full pdf), coming from UCSD and HRL Laboratories, the main idea is actually rather simple, but the implementation is novel and interesting. The work deals with a more complex task than classification, namely image segmentation (see, e.g., our previous post), which is widely used in autonomous driving, medical imaging, and many other domains. So what is this “image translation” thing they are talking about?

Let us begin with regular translation. Imagine that we have two large text corpora in different languages, say English and French, and we don’t know which phrases correspond to which. They may be even slightly different and may lack the corresponding translations in the other language corpus. Just like the pictures from synthetic and real domains. Now, to get a machine translation model we translate a phrase from English to French and will try to distinguish the embedding of the resulting phrase from embeddings of phrases from the original French corpus. And then the way to check that we haven’t lost much is to try to translate this phrase back to English; now, even if the original corpora were completely unaligned, we know what we’re looking for: the answer is just the original sentence!

Now let us look at the image to image translation which is, actually, pretty similar. Basically, domain adaptation techniques aim to address the domain shift problem by finding a mapping from the source data distribution to the target distribution. Alternatively, both domains X and Y could be mapped into a shared domain Z where the distributions are aligned; this is the approach used in this paper. This embedding must be domain-agnostic (independent of the domain), hence we want to maximize the similarity between the distributions of embedded source and target images.

For example, suppose that X is the domain of driving scenes on a sunny day and Y is the domain of driving scenes on a rainy day. While “sunny” and “rainy” are characteristics of the source and target domains, they are in fact variations that mean next to nothing for the annotation task (e.g., semantic segmentation of the road), and they should not affect the annotations. Treating such characteristics as structured noise, we would like to find a latent space Z that would be invariant to such variations. In other words, domain Z should not contain domain-specific characteristics, that is, be domain-agnostic.

In this case, we also want to restore annotations for an image from the target domain. Therefore, we also need to add a mapping from the shared embedding space to the labels. It may be image-level labels such as classes in a classification problem or pixel-level labels such as semantic segmentation:

Basically, that’s the whole idea! Now, to obtain the annotation for an image from the target domain we just need to get its embedding in the shared space Z and restore its annotation from C. This is the basic idea of the approach, but it can be further improved by the ideas proposed in this paper.

Specifically, there are three main tools needed to achieve successful unsupervised domain adaptation:
- domain-agnostic feature extraction, which means that distributions of features extracted from both domains should be indistinguishable as judged by an adversarial discriminator network,
- domain-specific reconstruction, which means that we should be able to decode embeddings back to the source and target domains, that is, we should be able to learn functions gX and gY like shown here:
- cycle consistency to ensure that the mappings are learned correctly, that is, we should be able to get back where we started in cycles like this:
The whole point of the framework proposed in this work is to ensure these properties with loss functions and adversarial constructions. We will not go into the gritty details of the architectures since they may change for other domains and problems.

But let’s have a look at the results! At the end of the post, we will make a detailed comparison between three papers on domain adaptation, but now let’s just have a look at a single example. The paper used two datasets: a synthetic dataset from Grand Theft Auto 5 and a real-world Cityscapes dataset with pictures of cities. Here are two sample pictures:

And here are the segmentation results for the real-world image (B above):

On this picture, E is the ground truth segmentation, C is the result produced without domain adaptation, simply by training on the synthetic GTA5 dataset, and D is the result with domain adaptation. It does look better, and the numbers (intersection-over-union metric) do bear this out.

Conditional Generative Adversarial Network for Structured Domain Adaptation

This paper by Hong et al. (full pdf) proposes another modification of a standard discriminator-segmentator architecture. From the first look at the architecture, we may not even notice any difference:

But actually this architecture does something very interesting: it integrates a GAN into a fully convolutional network (FCN). We have discussed FCNs in a previous NeuroNugget post; it s the network architecture used for the segmentation problem that returns labels for each pixel in the picture by feeding the features through deconvolution layers.

In this model, a GAN is used to mitigate the gap between source and target domains. For example, the previous paper aligns two domains via an intermediate feature space and thereby implicitly assumes the same decision function for both domains. This approach relaxes this assumption: here we learn the residual between feature maps from both domains because the generator learns to produce features like the ones from a real image in order to fool the discriminator; afterwards,FCN parameters are updated to accommodate the changes GAN has made.

Again, we will show a numerical comparison of the result below but here are some examples from the dataset:

Remarkably, in this work the authors have also provided something very similar to what we are doing in our studies into the efficiency of synthetic data: they have measured the accuracy of the results (again measured with intersection-over-union) depending on the portion of synthetic images in the dataset:

Learning from Synthetic Data: Addressing Domain Shift for Semantic Segmentation

This work by Sankaranarayanan et al. (full pdf) presents another modification of the basic approach based on GANs that brings the embeddings closer in the learned feature space. This time, let us begin with the picture and then explain it:

The base network, whose architecture is similar to a pre-trained model such as VGG-16, is split into two parts: the embedding denoted by F and the pixel-wise classifier denoted by C. The output of C is a map of labels upsampled to the same size as the input of F. The generator network G takes as input the learned embedding and reconstructs the RGB image. The discriminator network D performs two different tasks given an input: it classifies the input as real or fake in a domain-consistent manner and also performs a pixel-wise labeling task similar to the network C (this is applied only to source data since target data does not come with any labels during training).

So the main contribution of this work is a technique that employs generative models to align the source and target distributions in the feature space. For this purpose, the authors first project intermediate feature representations obtained using a CNN to the image space by training a reconstruction part of the network and then impose the domain alignment constraint by forcing the network to learn features such that source features produce target-like images when passed to the reconstruction module and vice versa.

Sounds complicated, doesn’t it? Well, let’s see how all of these methods actually compare.

A Numerical Comparison of the Results

We have chosen these three papers for an in-depth look because their results are actually comparable! All three papers used domain adaptation with GTA5 as the source (synthetic) dataset and Cityscapes as the target dataset, so we can literally just compare the numbers.

The Cityscapes dataset contains 19 classes characteristic for city outdoor scenes such as “road”, “wall”, “person”, “car”, etc. And all three papers actually contain tables with results broken down with respect to the classes.

Murez et al., image-to-image translation:

Hong et al., conditional GAN:

Sankaranarayanan et al., GAN in an FCN:

The mean results are 31.8, 44.5, 37.1 respectively, so it appears that the image-to-image approach is the least successful and Conditional GAN is the winner. For clarity, let us also compare the top-3 most and least distinguishable classes (i.e., with best and worst results) for every approach.

Most distinguishable, in the same order of models:
- road (85.3), car (76.7), veg (72.0)
- road (89.2), veg (77.9), car (77.8)
- road (88.0), car (80.4), veg (78.7)
This is not too interesting, obviously roads and cars are always the best. But with the worst classes the situation is different:
- train (0.3), bike (0.6), rider (3.3)
- train (0.0), fence (10.9), wall (13.5)
- train (0.9), t sign (11.6), pole (16.7)
Again, the “train” class seems to pose some kind of an insurmountable challenge (probably there’re just not so many trains in the training set, pardon the pun), but the others are all different. So let us compare all models based on the “bike”, “rider”, “fence”, “wall”, “t sign”, and “pole” classes. Now their scores will be very distinct:

You can draw different conclusions from these results. But the main result that we personally find truly exciting is that with many different approaches that could be proposed for such a complex task, results in different papers at the same conference (so the authors could not follow one another, these results appeared independently) are perfectly comparable with each other, and researchers do not hesitate to publish these comparable numbers instead of some comfortable self-developed metrics that would prove their unquestionable supremacy. Way to go, modern machine learning!

And finally, let us finish on a lighter note, with one more fun paper about synthetic data.

Free supervision from video games

In this work, Philipp Krähenbühl (full pdf) created a wrapper for the ever popular Microsoft DirectX rendering API and added a specialized code into the game as it is running. This enables the DirectX engine to produce ground truth labels for instance segmentation, semantic labeling, depth estimation, optical flow, intrinsic image decomposition, and instance tracking in real time! Which sounds super cool because now, instead of labeling data manually or creating special purpose synthetic data engines, a researcher can just play video games all day long! All you need to do is find a suitable 3D game:

And with that, we finish the fourth installment on CVPR 2018. Thank you for your attention — and stay tuned!

Sergey Nikolenko
Chief Research Officer, Neuromation

Anastasia Gaydashenko
former Research Intern at Neuromation,
currently Machine Learning Intern at Cisco
October 31, 2018
NeuroNuggets: CVPR 2018 in Review, Part III

The CVPR 2018 (Computer Vision and Pattern Recognition) conference is long over, but we can’t stop reviewing its wonderful papers; today, Part III is upon us! In the first part, we briefly reviewed the most interesting papers on GANs for computer vision from CVPR 2018; in the second part, added a human touch and talked about pose estimation and tracking for humans. Today, we turn to one of the main focal point of our own internal research here at Neuromation: synthetic data. As usual, the papers are in no particular order, and our reviews are very brief, so we definitely recommend to read the papers in full.

Synthetic data: imitate to learn

Synthetic data means data that has been generated artificially, either through 3D modeling and rendering (as usual for computer vision) or by other means, and then used to train machine learning models. Synthetic data is a surprising topic in machine learning, and the most surprising thing is for how long it had been mostly neglected. Some works on synthetic data can be traced to the 2000s, but before 2016 it basically attracted no interest at all. The only field where it had been used was to train self-driving cars, where the need for simulated environments and the impossibility to collect real datasets come together and make it the perfect situation for synthetic datasets.

Now the interest is rapidly growing: we now have the SUNCG dataset of simulated indoor environments, outdoor environments for driving and navigation, the SURREAL dataset of synthetic humans to learn pose estimation and tracking, and even recent works that apply GANs to generate and refine synthetic data (we hope to get back to this and explain how it works later). So let us see what CVPR 2018 authors have to say about synthetic data. Since this is our main focus, we will consider the works on synthetic data in slightly more detail than usual.

Generating Synthetic Data from GANs: Augmentation and Adaptation in Feature Space

R. Volpi et al., Adversarial Feature Augmentation for Unsupervised Domain Adaptation
S. Sankaranarayanan et al., Generate To Adapt: Aligning Domains using Generative Adversarial Networks

There is a very interesting and promising field of using GANs to produce synthetic datasets to train other models. On the surface it makes little sense: if you have enough data to train a GAN, why not just use it to train the model? Or even better, if you have a trained GAN why don’t you just take the discriminator and use it for your problem?

But this idea becomes much more interesting in the domain adaptation setting. Suppose you have a large source dataset and a small target dataset, and you need to use a model trained on the source dataset for the target, which might be completely unlabeled. Here adversarial domain adaptation techniques train two networks, a generator and a discriminator, and use it to ensure that the network cannot distingush between the data distributions in the source and target datasets. This field was started in the ICML 2015 paper by Ganin and Lempitsky, where the discriminator is used to ensure that the features stay domain-invariant:

And here is a schematic depiction of how this idea was slightly generalized in the Adversarial Discriminative Domain Adaptation paper from 2017:

In the CVPR 2018 paper by Volpi et al., researchers from Italy and Stanford made the adversarial training work not on the original images but rather in the feature space itself. The GAN operated on features extracted by a pretrained network, which makes it possible to achieve better domain invariance and ultimately improve the quality of domain adaptation. Here is the overall training procedure as it was adapted by Volpi et al.:

Another approach in the same vein was presented in CVPR 2018 by Sankaranarayanan et al., researchers from the University of Maryland. They use GANs to leverage unsupervised data to bring the source and target distributions closer to each other in the feature space. Basically, the idea is to use the discriminator to control that images generated from an embedding remain realistic images for the source distribution even when the embedding was taken from a sample from the target distribution. Here is how it works, and, again, the authors report improved domain adaptation results:

How Well Should You Label? A Study of Label Quality

A. Zlateski et al., On the Importance of Label Quality for Semantic Segmentation

One of the main selling points of synthetic data has always been the pixel-perfect quality of labeling that you can easily achieve with synthetic data. A synthetic scene always comes with perfect segmentation — but just how important is it? The authors of this work studied how fine (or how coarsely) you have to label your training set to get good segmentation quality from modern convolutional architectures… and, of course, what better tool to perform this study than synthetic scenes.

The authors used their specially developed Auto City dataset:

And in their experiments, the authors showed that the final segmentation quality, unsurprisingly, is indeed strongly correlated with the amount of time spent to produce the labels… but not so much with the quality of each individual label. This suggests that it is better to produce lots of coarse labels (say, with crowdsourcing) than to perform strict quality control for every label.

Soccer on Your Tabletop

K.Rematas et al., Soccer on Your Tabletop

Here at Neuromation, we love soccer (yes, the World Cup in Russia cost us a lot of work hours), and this research is just soooooooo cool. The authors present a system that can take a video stream of a soccer game and transform it… into a moving 3D reconstruction that can be projected onto your tabletop and viewed with an augmented reality device!

The system extracts bounding boxes of the players, analyzes the human figures with pose and depth estimation models and produces a quite accurate 3D scene reconstruction. Note how training a model specifically for the soccer domain really improves the results:

It additionally warms our hearts that they actually trained on synthetic data extracted from FIFA games! And the results are simply very cool all around:

But wait, there is more…

Thank you for your attention! Next time we might take an even more detailed look at some of the CVPR 2018 papers regarding synthetic data and domain adaptation. Until then!

Sergey Nikolenko
Chief Research Officer, Neuromation

Aleksey Artamonov
Senior Researcher, Neuromation

October 15, 2018
NeuroNuggets: CVPR 2018 in Review, Part II

Today, we continue our series on the recent CVPR (Computer Vision and Pattern Recognition) conference, a top conference in the world for computer vision. Neuromation successfully participated in the DeepGlobe workshop there, and now we are taking a look at the papers from the main conference. In the first part of our CVPR review, we briefly reviewed the most interesting papers devoted to generative adversarial networks (GAN) for computer vision. This time, we delve into the works that apply computer vision to us, humans: tracking human bodies and other objects in videos, estimating poses and even full 3D body shapes, and so on. Again, the papers are in no particular order, and our reviews are very brief, so we definitely recommend to read the papers in full.

The human touch: person identification, tracking, and pose estimation

Humans are very good at recognizing and identifying other humans, much more so than at recognizing other objects. In particular, there is a special part of the brain, called fusiform gyrus, which is believed to contain the neurons responsible for face recognition, and those neurons are believed to do their jobs a bit differently from the neurons that recognize other things. This is where those illusions about upside-down faces (the Thatcher effect) come from, and there is even a special cognitive disorder, prosopagnosia, where a person loses the ability to recognize human faces… but still perfectly well recognizes tables, chairs, cats or English letters. It’s not all that well understood, of course, and there are probably no specific “individual face neurons”, but faces are definitely different. And humans in general (their shapes, silhouettes, body parts) also have a very special place in our hearts and brains: “basic shapes” for our brain probably include triangles, circles, rectangles… and human silhouettes.

Recognizing humans is a central problem for humans, and so it has been for computer vision. Back in 2014 (a very long time ago in deep learning), Facebook claimed to reach superhuman performance on face recognition, and regardless of contemporary criticism by now we can basically assume that face recognition is indeed solved very well. However, plenty of tasks still remain; e.g., we have already posted about age and gender estimation and pose estimation for humans. On CVPR 2018, most human-related papers were either about finding poses in 3D or about tracking humans in video streams, and this is exactly what we concentrate on today. For good measure, we also review a couple of papers on object tracking that are not directly related to humans (but where humans are definitely one of the most interesting subjects).

Detect-and-Track: Two-Step Tracking with Pose Estimation

R. Girdhar et al., Detect-and-Track: Efficient Pose Estimation in Videos

We have already written about segmentation with Mask R-CNN, one of the most promising approaches to segmentation that appeared in 2017. Over the last year, several extensions and modifications of the basic Mask R-CNN appeared, and this collaboration between Carnegie Mellon, Facebook, and Dartmouth presents another: the authors propose a 3D Mask R-CNN architecture that uses spatiotemporal convolutions to extract features and recognize poses directly on short clips. Then they proceed to show that a two-step algorithm with 3D Mask R-CNN as the first step (and bipartite matching to link keypoint predictions as the second) beats state of the art methods in pose estimation and human tracking. Here is the 3D Mask R-CNN architecture which is sure to find more applications in the future:

Pose-Sensitive Embeddings for Person Re-Identification

M. Saquib Sarfraz et al., A Pose-Sensitive Embedding for Person Re-Identification with Expanded Cross Neighborhood Re-Ranking

Person re-identification is a challenging problem in computer vision: as examples above show, changes in the camera view and pose can make the two pictures not alike at all (although we humans still immediately identify that this is the same person). This problem is usually solved with retrieval-based methods that derive proximity measures between the query image and stored images from some embedding space. This work by German researchers proposes a novel way to incorporate information about the pose directly into the embedding, improving re-identification results. Here is a brief overview picture, but we suggest to read the paper in full to understand how exactly the pose is added to the embedding:

3D Poses from a Single Image: Constructing a 3D Mesh from 2D Pose and 2D Silhouette

G. Pavlakos et al., Learning to Estimate 3D Human Pose and Shape from a Single Color Image

Pose estimation is a well-understood problem; we have written about it before and already mentioned it in this post. Making a full 3D shape of a human body is quite another matter, however. This work presents a very promising and quite surprising result: they generate the 3D mesh of a human body directly through an end-to-end convolutional architecture that combined pose estimation, segmentation of human silhouettes, and mesh generation (see picture above). The key insight here is based on using SMPL, a statistical body shape model that provides a good prior for the human body shape. As a result, this approach manages to construct a 3D mesh of a human body from a single color image! Here are some illustrative results, including some very challenging cases from the standard UP-3D dataset:

FlowTrack: Looking at Video with Attention for Correlation Tracking

Z. Zhu et al., End-to-end Flow Correlation Tracking with Spatial-temporal Attention

Discriminative correlation filters (DCF) are a state of the art learning technique for object tracking. The idea is to learn a filter — that is, a transformation of an image window, usually simply a convolution — which would correspond to the object you want to track and then apply it to all frames in the video. As it often happens with neural networks, DCFs are far from a new idea, dating back to a seminal 1980 paper, but they had been nearly forgotten until 2010; the MOSSE tracker started a revival, and now DCFs are all the rage. However, classical DCFs do not make use of the actual video stream and process each frame separately. In this work, the Chinese researchers present an architecture that does involve a spatial-temporal attention mechanism able to attend across different time frames; they report much improved results. Here is the general flow of their model:

Back to the Classics: Correlation Tracking

C.Suni et al., Correlation Tracking via Joint Discrimination and Reliability Learning

This paper, just like the previous one, is devoted to tracking objects in videos (a very hot topic right now), and just like the previous one, it uses correlation filters for tracking. But, in stark contrast to the previous one, this paper does not use deep neural networks at all! The basic idea here is to explicitly include in the model reliability information, i.e., add a term to the objective function that models how reliable the learned filter is. The authors report significantly improved tracking and also show learned reliability maps that often look very plausible:

That’s all folks!

Thank you for your attention! Join us next time — there are many more interesting papers from CVPR 2018… and, just as a sneak peek, the ICLR 2019 deadline has passed, its submitted papers are online, and although we won’t know which are accepted for a few more months we are already looking at them!

Sergey Nikolenko
Chief Research Officer, Neuromation

Aleksey Artamonov
Senior Researcher, Neuromation

October 2, 2018
Neuromation Team at the Basel Life

Neuromation’s Chief Research Officer Sergey Nikolenko and Head of New Initiatives Maxim Prasolov recently took part in BASEL LIFE, Europe’s leading life sciences conference in Basel, Switzerland.

Organised by the European Association for Life Sciences and the European Molecular Biology Organisation (EMBO), BASEL LIFE brought together researchers and scientists showcasing the latest advances and research on topics such as: aging, drug discovery, antibiotics research, biotherapeutics, genomics, microfluidics, peptide therapeutics, and artificial intelligence.

In a panel discussion on “AI in healthcare”, Sergey Nikolenko participated alongside Chris Schilling of Juvenescence, Pascal Bouquet of Novartis, Verner De Biasi of GSK, and Neuromation partners Alex Zhavoronkov and Poly Mamoshina of InSilico Medicine.

Neuromation’s Sergey Nikolenko also delivered a presentation titled Deep Learning and Synthetic Data for Healthcare as part of the Innovation Forum on Artificial Intelligence and Blockchain in Healthcare.

In his presentation, Dr. Nikolenko discussed many of the major obstacles to AI adoption in healthcare, including the lack of sufficiently large datasets, lack of labeled training data, the difficulty of explaining results and the risk of systematic bias. He then demonstrated some key advantages of Neuromation’s synthetic data approach in this field. For example, in computer vision applications Neuromation can create 100% accurate labeled data with pixel-perfect labeling, which is very hard or impossible to do by hand. Furthermore, Neuromation’s approach increases the speed of automation by orders of magnitude and is several times cheaper than hand labeling.

As an example of Neuromation’s work in the area of drug discovery, Dr. Nikolenko presented a project in which our researchers are trying to introduce novel augmentation or data transfer techniques using GANs and other generative models. Neuromation is currently collaborating with Insilico Medicine to generate fingerprints of molecules likely to have desired properties using conditional adversarial autoencoders.

Also discussed were papers published in leading scientific journals by Neuromation data scientists on topics such as breast cancer histology image analysis, pediatric bone age assessment, and diagnosis of diabetic retinopathy. Specific mention was also made of Neuromation’s collaboration with EMBL (European Molecular Biology Lab) on processing spatially structured multidimensional data originating from imaging mass-spectrometry in order to study the cell cycle via metabolomics.

While many of the above examples served to highlight the case for synthetic data in healthcare, other problems facing the field were also discussed. One of these was the lack of trust that healthcare data providers have in public clouds given the sensitivity of healthcare data. Nonetheless, we believe these providers still need the vast computational power and ease of use provided by public clouds to train their models; one possible solution would be to bring this to the data providers by developing private cloud solutions.

Another problem is the shortage of AI talent. There are only approximately 22,000 deep learning experts in the world right now, most concentrated in only a few geographies.

These two problems are currently being addressed by the Neuromation Platform, which is now in development. The toolsets provided by the Neuromation Platform will enable a far larger cohort of software developers to undertake meaningful AI research and development, while the platform’s cloud-agnostic distributed compute service for training of AI models will allow for access to public cloud compute power while maintaining data security and independence. Both of these are of crucial importance to the healthcare industry and could contribute meaningfully to the progress of AI development.

Neuromation looks forward to connecting with the many world-class scientists and researchers we met at the BASEL LIFE conference in the future. We would also like to personally thank Alex Zhavoronkov of Insilico Medicine for his efforts in organizing Neuromation’s attendance and his invaluable assistance throughout.

September 18, 2018
NeuroNuggets: CVPR 2018 in Review, Part I

Here at Neuromation, we are always on the lookout for new interesting ideas that could help our research. And what better place to look for them than top conferences! We have already written about our success at the DeepGlobe workshop for the CVPR (Computer Vision and Pattern Recognition) conference. This time we will take a closer look at some of the most interesting papers from CVPR itself. These days, top conferences are very large affairs, so prepare for a multi-part post. The papers are in no particular order and chosen not only for standing out among the crowd but also for relevance to our own studies that we do at Neuromation. This time, Aleksey Artamonov (whom you have met before) prepared the list, and I tried to supply some text around it. In this series, we will be very brief, trying to extract at most one interesting point from each paper, so in this format we cannot really do these works justice and wholeheartedly recommend to read the papers in full.

GANs and Computer Vision

In the first part, we concentrate on generative models, that is, machine learning models that can not only tell cats and dogs apart on a photo but also can produce new images of cats and dogs. For computer vision, the most successful class of generative models are generative adversarial networks (GAN), where a separate discriminator network learns to distinguish between generated objects and real objects, and the generator learns to fool the discriminator. We have already written about GANs several times (e.g., here and here), so let’s jump right into it!

Finding Tiny Faces in the Wild

Y. Bai et al., Finding Tiny Faces in the Wild with Generative Adversarial Network

In this collaboration between Saudi and Chinese researchers, the authors use a GAN to detect and upscale very small faces on photographs of large crowds. Even just detecting small faces is an interesting problem that regular face detectors (that appear, e.g., in our previous post) usually fail to solve. And here the authors propose an end-to-end pipeline to extract faces and then apply a generative model to upscale it up to 4x (a process known as superresolution). Here is the pipeline overview from the paper:

PairedCycleGAN for Makeup

H. Chang et al., PairedCycleGAN: Asymmetric Style Transfer for Applying and Removing Makeup

Conditional GANs are already widely used for image manipulation; we have mentioned superresolution, but GANs also succeed for style transfer. With GANs, one can learn salient features that correspond to specific image elements — and then change them! In this work, researchers from Princeton, Berkeley and Adobe present a framework for makeup modification on photos. One interesting part of this work is that the authors train separate generators for different facial components (eyes, lips, skin) and apply them separately, extracting facial components with a different network:

GANerated Hands

F. Mueller et al., GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB

We have already written about pose estimation in the past. One very important subset of pose estimation, which usually requires separate models, is hand tracking. The sci-fi staple of manipulating computers by waving your hands is yet to be fully realized and still requires specialized hardware such as Kinect. As usual, one of the main problems is data: where can you find real video streams of hands labeled in 3D?.. In this work, the authors present a conditional GAN architecture that is able to convert synthetic 3D models of hands to photorealistic images that are then used to train the hand tracking network. This work is very close to our heart as synthetic data is the main emphasis of our studies at Neuromation, so we will likely consider it in more detail later. Meanwhile, here is the “synthetic-to-real” GAN architecture:

Person Transfer GAN

L. Wei et al., Person Transfer GAN to Bridge Domain Gap for Person Re-Identification

Person re-identification (ReID) is the problem of finding the same person on different photos taken in varying conditions and under varying circumstances. This problem has, naturally, been the subject of many studies, and it is relatively well understood by now, but the domain gap problem still remains: different datasets with images of people still have very different conditions (lighting, background etc.), and networks trained on one dataset lose a lot in the transfer to another dataset (and also to, say, a real world application). The picture above shows what different datasets look like. To solve this problem, this work proposes a GAN architecture able to transfer images from one “dataset style” to another, again using GANs to augment real data with complex transformations. It works like this:

Eye Image Synthesis with Generative Models

K. Wang et al., A Hierarchical Generative Model for Eye Image Synthesis and Eye Gaze Estimation

This work from the Rensselaer Polytechnic Institute attacks a very specific problem: generating images of human eyes. This is important not only to make beautiful eyes in generated images but also, again, to use generated eyes to work backwards and solve the gaze estimation problem: what is a person looking at? This would pave the way to truly sci-fi interfaces… but that’s still in the future, and at present even synthetic eye generation is a very hard problem. The authors present a complex probabilistic model of eye shape synthesis and propose a GAN architecture to generate eyes according to this model — with great success!

Image Inpainting: Fill in the Blanks

J. Yu et al., Generative Image Inpainting with Contextual Attention

This work from Adobe Research and University of Illinois at Urbana-Champaign is devoted to the very challenging problem of filling in the blanks on an image (see examples above). Usually, inpainting requires understanding of the underlying scene: in the top right on the picture above, you have to know what a face looks like and what kind of face is likely given the hair and neck that we see. In this work, the authors propose a GAN-based approach that can explicitly make use of the features from the surrounding image to improve generation. The architecture consists of two parts, first generating a coarse result and then refining it with another network. And the results are, again, very good:

Well, that’s it for today. This is only part one, and we will certainly continue the CVPR 2018 review in our next installments. See you around!

Sergey Nikolenko
Chief Research Officer, Neuromation

Aleksey Artamonov
Senior Researcher, Neuromation

September 11, 2018
Neuromation Research: Pediatric Bone Age Assessment with Convolutional Neural Networks
Over time, the NeuroNuggets and Neuromation Research series will serve to introduce all AI researchers whom we have collected in our wonderful research team. Today, we are presenting our very own Kaggle master, Alexander Rakhlin! Alexander is a deep learning guru specializing in problems related to medical imaging, which usually means segmentation, object detection, and generally speaking convolutional neural networks, although medical images are often in 3D and are not necessarily RGB images, as we have seen when we discussed imaging mass-spectrometry.

You may have already met Alexander Rakhlin here in our research blog: he has authored a recent post with a general survey of AI applications for healthcare. But today we have great news: Alexander’s paper, Pediatric Bone Age Assessment Using Deep Convolutional Neural Networks (a joint work with Vladimir Iglovikov, Alexander Kalinin, and Alexey Shvets), has been accepted for publication at the 4th Workshop on Deep Learning in Medical Image Analysis (DLMIA 2018)! This is already not the first paper on medical imaging under Neuromation banners, and this is a great occasion to dive into some details of this work. Similar to our previous post on medical concept normalization, this will be a serious and rather involved affair, so get some coffee and join us!

You Are as Old as Your Bones: Bone Age Assessment

Skeletal age, or bone age, is basically how old your bones look like. As a child develops, the bones in his/her skeleton grow and mature; this means that by looking at a child’s bones, you can estimate the average age when a child should have this kind of skeleton and hence learn how old the child is. At this point you’re probably wondering whether this will be a post about archaeology: it’s not often that living children can get an X-ray but nobody knows when they were born.

Well, yes and no. If the child is developing normally, bone age should indeed be roughly within 10% of the chronological age. But there can be exceptions. Some exceptions are harmless but still good to know: e.g., your kid’s growth spurt in adolescence is related to bone age. So if it’s a couple of years more than the chronological age the kid will stop growing earlier, and if the bones are a couple of years “younger” than the rest you can expect a delayed growth spurt. Moreover, given the current height and bone age you can predict the final adult height of a child rather accurately, which can also come in handy: if your kid loves basketball you might be interested whether he’ll grow to be a 7-footer.

Other exceptions are more serious: a significant mismatch between bone age and chronological age can signal all kinds of problems, including growth disorders and endocrine problems. A single reading of skeletal age informs the clinician of the relative maturity of a patient at a particular time, and, integrated with other clinical findings, separates the normal from the relatively advanced or retarded. Successive skeletal age readings indicate the direction of the child’s development and/or show his or her progress under treatment. By assessing skeletal age, a pediatrician can diagnose endocrine and metabolic disorders in child development such as bone dysplasia, or growth deficiency related to nutritional, metabolic, or unknown factors that impair epiphyseal or osseous maturation. In this form of growth retardation, skeletal age and height may be delayed to nearly the same degree, but, with treatment, the potential exists for reaching normal adult height.

Due to all of the above, it is very common for pediatricians to order an X-rays of a child’s hand to estimate his/her bone age… so naturally it’s a great problem to try to automate.

Palm Reading: Assessing Bone Age from the Hand and Wrist

Skeletal maturity is mainly assessed by the degree of development and ossification of secondary ossification centers in the epiphysis. For decades, bone maturity has been usually determined by visual evaluation of the skeletal development of the hand and wrist. Here is what a radiologist looks for when she examines an X-ray of your hand:

The two most common techniques for estimating skeletal age today are Greulich and Pyle and Tanner-Whitehouse (TW2). Both methods use radiographs of the left hand and wrist to assess skeletal maturity based on recognizing maturity indicators, i.e., changes in the radiographic appearance of the epiphyses of tubular bones from the earliest stages of ossification until they fuse with the diaphysis, or changes in flat bones until they reach adult shape… don’t worry, we hadn’t heard these words before either. Let’s show them on a picture:

Conventional techniques for assessing skeletal maturity, such as GP or TW2, are tedious, time consuming, to a certain extent subjective, and even senior radiologists don’t always agree on the results. Therefore, it is very tempting to use computer-aided diagnostic systems to improve the accuracy of bone age assessment, increase reproducibility and efficiency of clinicians.

Recently, approaches based on deep learning have demonstrated performance improvements over conventional machine learning methods for many problems in biomedicine. In the domain of medical imaging, convolutional neural networks (CNN) have been successfully used for diabetic retinopathy screening, breast cancer histology image analysis, bone disease prediction, and many other problems; see our previous post for a survey of these and other applications.

So naturally we tried to apply modern deep neural architectures to bone age assessment as well. Below we describe a fully automated deep learning approach to the problem of bone age assessment using the data from the Pediatric Bone Age Challenge organized by the Radiological Society of North America (RSNA). While achieving as high accuracy as possible is a primary goal, our system was also designed to stay robust against insufficient quality and diversity of radiographs produced on different hardware by various medical centers.

Data

The dataset was made available by the Radiological Society of North America (RSNA), who organized the Pediatric Bone Age Challenge 2017. The radiographs have been obtained from Stanford Children’s Hospital and Colorado Children’s Hospital; they have been taken on different hardware at different times and under different conditions. These images had been interpreted by professional pediatric radiologists who documented skeletal age in the radiology report based on a visual comparison to Greulich and Pyle’s Radiographic Atlas of Skeletal Development of the Hand and Wrist. Bone age designations were extracted by the organizing committee automatically from radiology reports and were used as the ground truth for training the model.

Radiographs vary in scale, orientation, exposure, and often feature specific markings. The entire RSNA dataset contained 12,611 training, 1,425 validation, and 200 test images. Since the test dataset is obviously too small, and its labels were unknown at development time, we tested the model on 1000 radiographs from the training set which we withheld from training.

The training data contained 5,778 female and 6,833 male radiographs. The age varied from 1 to 228 months, the subjects were mostly children from 5 to 15 years old:

Preprocessing I: Segmentation and Contrast

One of the key contributions of our work is a rigorous preprocessing pipeline. To prevent the model from learning false associations with image artifacts, we first remove the background by segmenting the hand.

For image segmentation we use the U-Net deep architecture. Since its development in 2015, U-Net has become a staple of segmentation tasks. It consists of a contracting path to capture context and a symmetric expanding path that enables precise localization; since this is not the main topic of this post, we will just show the architecture and refer to the original paper for details:

We also used batch normalization to improve convergence during training. In our algorithms, we use the generalized loss function

where His the standard binary cross entropy loss function

where

true value of the pixel

is the predicted probability for the pixel, and

is a differentiable generalization of the Jaccard index:

We finalize the segmentation step by removing small extraneous connected components and equalizing the contrast. Here is an how our preprocessing pipeline works:

As you can see, the quality and contrast of the radiograph does indeed improve significantly. One could stop here and train a standard convolutional neural network for classification/regression, augmenting the training set with our preprocessing and standard techniques such as scaling and rotations. We gave this approach a try, and the result, although not as accurate as our final model, was quite satisfactory.

However, original GP and TW methods focus on specific hand bones, including phalanges, metacarpal and carpal bones. We decided to try to use this information and train separate models on several specific regions in high resolution to numerically evaluate and compare their performance. To correctly locate these regions, we have to transform all images to the same size and position, i.e., to bring them all to the same coordinate space, a process known as image registration.

Preprocessing II: Image Registration with Key Points

Our plan for image registration is simple: we need to detect the coordinates of several characteristic points in the hand. Then we will be able to compute affine transformation parameters (zoom, rotation, translation, and mirroring) to fit the image into the standard position.

To create a training set for the key points model, we manually labelled 800 radiographs using VGG Image Annotator (VIA). We chose three characteristic points: the tip of the distal phalanx of the third finger, tip of the distal phalanx of the thumb, and center of the capitate. Pixel coordinates of key points serve as training targets for our regression model.

The key points model is, again, implemented as a deep convolutional neural network, inspired by a popular VGG family of models but with regression output. The VGG module consists of two convolutional layers with Exponential Linear Unit activation, batch normalization, and max pooling. Here is the architecture:

The model is trained with Mean Squared Error loss (MSE) and Adam optimizer:

To improve generalization, we applied standard augmentations to the input. including rotation, translation and zoom. The model outputs 6 coordinates, 2 for each of the 3 key points.

Having found the key points, we calculate affine transformations (zoom, rotation, translation) for all radiographs. Our goal is to keep the aspect ratio of an image but fit it into a uniform position such that for every radiograph:
1. the tip of the middle finger is aligned horizontally and positioned approximately 100 pixels below the top edge of the image;
2. the capitate is aligned horizontally and positioned approximately 480 pixels above the bottom edge of the image.
By convention, bone age assessment uses radiographs of the left hand, but sometimes images in the dataset get mirrored. To detect these images and adjust them appropriately, we used the key point for the thumb.

Let’s see a sample of how our image registration model works. As you can see, the hand has been successfully rotated into our preferred standard position:

And here are some more examples of the entire preprocessing pipeline. Results of segmentation, normalization and registration are shown in the fourth row:

Bone age assessment models

Following Gilsanz and Ratib’s Hand Bone Age: a Digital Atlas of Skeletal Maturity, we have selected three specific regions from registered radiographs and trained an individual model for each region:
1. whole hand;
2. carpal bones;
3. metacarpals and proximal phalanges.
Here are the regions and some sample corresponding segments of real radiographs:

Convolutional neural networks are typically used for classification tasks, but bone age assessment is a regression problem by nature: we have to predict age, a continuous variable. Therefore, we wanted to compare two settings of the CNN architecture, regression and classification, so we implemented both. The models share similar parameters and training protocols, and only differ in the two final layers.

Our first model is a custom VGG-style architecture with regression output. The network consists of a stack of six convolutional blocks with 32, 64, 128, 128, 256, 384 filters followed by two fully connected layers of 2048 neurons each and a single output (we will show the picture below). The input size varies depending on the considered region of an image. For better generalization, we apply dropout layers before fully connected layers. We rescale the regression target, i.e., bone age, to the range [−1, 1]. To avoid overfitting, we use train time augmentation with zoom, rotation and shift. The network is trained with the Adam optimizer by minimizing the Mean Absolute Error (MAE):

The second model, for classification, is very similar to the regression one except for the two final layers. One major difference is a distinct class assigned to each bone age. In the dataset, bone age is expressed in months, so we considered all 240 classes, and the penultimate layer becomes a softmax layer with 240 outputs. This layer outputs vector of probabilities, where probability of a class takes a real value in the range [0, 1]. In the final layer, the probabilities vector is multiplied by a vector of distinct bone ages [1, …, 239, 240]. Thereby, the model outputs a single expected value of the bone age. We train this model using the same protocol as the regression model.

Here is the model architecture for classification; the regression model is the same except for the lack of softmax and binning layers:

Results

We evaluated the models on a validation set of 1000 radiographs withheld from training. Following GP and TW methods that account for sex, for each spatial zone we trained gender-specific models separately for females and males, and compared them to a gender-agnostic model trained on the entire population. Here is a summary of our results which we will then discuss:

It turns out that adding gender to the input significantly improves accuracy, by 1.4 months on average. The leftmost column represents the performance of a regression model for both genders. The region of metacarpals and proximal phalanges (region C) has Mean Absolute Error (MAE) 8.42 months, while MAE of the whole hand (region A) is 8.08 months. A linear ensemble of the three zones improves overall accuracy to 7.52 months (bottom row in the table).

Gender-specific regression models (second and third columns) improved MAE to 6.30 months for males and to 6.49 months for females. Note that for the female cohort, region of metacarpals and proximal phalanges © has MAE equal to 6.79 months, even more accurate than the whole hand, which gets a MAE of only 7.12 months!

Gender-specific classification models (fourth and fifth columns) perform slightly better than regression models and demonstrate a MAE of 6.16 and 6.39 months respectively (bottom row)

Finally, in the sixth column we show an ensemble of all gender-specific models (classification and regression). On the validation dataset it achieved state of the art accuracy of 6.10 months, which is a great result both in terms of the bone age assessment challenge and from the point of view of real applications.

Conclusion

Let’s wrap up: in this post, we have shown how to develop an automated bone age assessment system that can assess skeletal maturity with remarkable accuracy, similar to or better than an expert radiologist. We have numerically evaluated different zones of a hand and found that bone age assessment could be done just for metacarpals and proximal phalanges without significant loss of accuracy. To overcome the widely ranging quality and diversity of the radiographs, we introduced rigorous cleaning and standardization procedures that significantly increased robustness and accuracy of the model.

Our model has a great potential for deployment in clinical settings to help clinicians in making bone age assessment decisions accurately and in real time, even in hard-to-reach areas. This would ensure timely diagnosis and treatment of growth disorders in their little patients. And this is, again, just one example of what the Neuromation team is capable of. Join us later for more installments of Neuromation Research!

Alexander Rakhlin
Researcher, Neuromation

Sergey Nikolenko
Chief Research Officer, Neuromation
August 6, 2018
Neuromation team at ICML 2018

Neuromation researchers are attending ICML 2018, one of the two largest and most important conferences in machine learning (the other one is NIPS, and we hope to get there as well). Here is a part of our team together with our long-term friends and collaborators from Insilico Medicine next to their booth:

Left to right: Kyryl Truskovsky (Lead Researcher, Neuromation), Rauf Kurbanov (Lead Researcher, Neuromation), Alexander Aliper (President of EMEA, Insilico Medicine), Alex Zhavoronkov (CEO, Insilico Medicine), Denys Popov (CIO, Neuromation), Ira Opanasiuk (HR Director, Neuromation).

Neuromation and Insilico Medicine are collaborating in the many areas of high-performance computing and deep learning; see, e.g., this previous post on one topic of our collaboration. In the area of blockchain technology, Neuromation has partnered with Longenesis, which is a partnership between Insilico Medicine and the BitFury Group.

Both our teams share passion for using the latest advances in artificial intelligence, high-performance computing and blockchain for healthcare. We are happy to be a part of the vibrant ecosystem of companies in this space which resembles the early days of the Internet. And we are building the Internet of Health.

We are looking forward to further collaboration with Insilico and to many other collaborations that ICML can bring. Deep learning galore!

July 13, 2018
Neuromation Research: Medical Concept Normalization in Social Media Posts

Although computer vision is our main focus, here at Neuromation we are pursuing all aspects of deep learning. Today, it is my great pleasure to introduce Elena Tutubalina, Ph.D., our researcher from Kazan who specializes on natural language processing. She has joined Neuromation part-time to work on a very interesting project related to sentiment analysis and named entity recognition… but this is a story for another day.

Today, together with Elena we are presenting our recent paper, Medical Concept Normalization in Social Media Posts with Recurrent Neural Networks. This paper has been published in a top journal, Journal of Biomedical Informatics; Elena and myself have co-authored it with Zulfat Miftakhutdinov and Valentin Malykh. This is already a second post devoted to Neuromation research papers; the first one was a recent NeuroNugget devoted to our DeepGlobe participation; and many more are, of course, to come.

Presented at a NAACL workshop before the journal version, our paper was. Elena’s photo from the NAACL 2018 social event, we show to you:

The Adverse Effects of Social Networks

Nowadays it is hard to find a person who has no social media account in at least one social network, usually more. And it’s virtually impossible to find a person who has never heard about one. This unprecedented popularity of social networks, and the huge amount of stuff people put on their pages, means that there is an enormous quantity of data available in social networks on almost any topic. This data, of course, is nothing like a top quality research report, but there are tons of opinions of real people on all kinds of subjects, and it would be strange to forgo this wisdom of the crowds.

To explain what exactly we will be looking for, let us take a break from social media (exactly as the doctors order, by the way) and look a little bit back on history. One of the most important topics in human history has always been human health. It was important in ancient Egypt, Greece, or China, in Napoleon’s France or modern Britain. Medicine invariably comes together with civilization, and with medicine come the drugs, starting from a shaman’s herbs and all the way to contemporary medicaments.

Unfortunately, with drugs come side effects. Сocaine, for example, was famously introduced as a cough stopper, and back in the good old days cocaine was given to kids (no kidding) and Coca-Cola stopped using fresh coca leaves with significant amounts of cocaine only by 1903. Modern medications also can have side effects (including sleep eating, gambling urges, or males growing boobs), but these days we at least try to test for side effects and warn about them.

To reveal the side effects, drug companies conduct long and costly clinical trials. It takes literally years for a drug to become accepted as a safe one, and while in principle it’s a good thing to test thoroughly in reality it means that many people die from potentially curable diseases while the drugs are still under testing. But even this often overly lengthy process does not catch all possible side effects, or, as they are usually called in scientific literature, adverse drug reactions (ADR): people are too diverse to make a representative group of all possible patient conditions and drug interactions. And this is where social media can help.

Once the drug is released, and people are actually using it, they (unfortunately) can have side effects, including unpredictable side effects like a weird combination of three different drugs that no one could have tested for. But once it happens, people are likely to rant about it on social media, and we could collect that data and use it. By the way, it would be an oversimplification to think that side effects could only be negative. Somewhat surprisingly, it is not that rare when a drug initially targeted to cure one disease is found to be a cure for some completely unrelated condition; kind of like cocaine proved to be so much more than a cough syrup. So the social media data is actually a treasure trove of information ready to be scrapped.

And this is exactly what our paper is about: looking for adverse drug effects in social media. Let’s dive into the details…

The Data and the Problems

To be more precise, the specific dataset that we have used in the paper comes from Twitter. In natural language processing, it is really common to scrape Twitter since it is open, popular, and the texts are so short that we can assume that each tweet stays on a single topic. All of these characteristics are important, by the way: the problems of handling personal data are by now a subject of real importance, especially in such a delicate sphere as healthcare, and we don’t want to break someone’s privacy.

At this point, it might seem that once we have the data it is a simple matter of keyword search to find the drug names and the corresponding side effects: if the same tweet mentions both “cocaine” and “muscle spasm” it is quite likely that muscle spasms are a side effect of cocaine. Unfortunately, it’s not that simple: we can’t expect a random guy snorting cocaine on Twitter to use formal medical language to describe his or her symptoms. People on Twitter (and more broadly in social media) do not use medical terminology. To be honest, we can consider ourselves lucky if they use the actual name of the drug at all; we all know how tricky these drug names can be.

Thus, in the context of mining social media we have to translate a text written in “social media language” (e.g., “I can’t fall asleep all night” or “head spinning a little”) to “formal medical language” (e.g., “insomnia” and “dizziness” respectively). Sometimes the examples are even less obvious:

And so on, and so forth. You can see how this goes beyond simple matching of natural language expressions and vocabulary elements: string matching approaches cannot link social media language to medical concepts since the words often do not overlap at all. We call the task of mapping everyday language to medical terminology medical concept normalization. If we solve this task, we can bridge the gap between the language of Twitter and medical professionals.

Natural Languages and Recurrent Neural Networks

OK, suppose we do have the data in the form of a nicely scraped and parsed set of tweets. Now what? Now it is most important part: we need to process this data, mining it for something that could sound like an adverse drug effect. So how on Earth can a model guess that “I can’t fall asleep all night” is actually about “insomnia”? There is not a single syllable in common between these two phrases.

The answer, as usual in our series, comes from neural networks. Modern state of the art natural language processing often uses neural networks, to be more precise, a special kind of them called recurrent neural networks (RNNs). An RNN can work with sequence data, keeping some intermediate information inside, in its hidden state, to “remember” previous parts of the sequence. Language is a perfect example of sequential data: it is a string of… well, something; some models work with words, some go down to the level of characters, some combine words into bigrams, but in any case the input is a discrete sequence.

We will not go into the details of recurrent neural networks; maybe in a next post. Let us just show the network architecture that we used in this paper:

In the upper left part of you can see a recurrent neural network. It is receiving as input a sequence of words (previously processed into embeddings, another interesting idea that we will explain some other time). The network receives a word and outputs a vector a, but also at the same time sends some information to its “future self”, to the next timestep. This piece of information is called a hidden state, denoted on the figure as h, and formally it is also simply a vector of numbers. Another interesting part is that the sequence is actually handled in two directions: from start to end and vice versa; such a setup is called a bidirectional RNN.

On the right side of the figure you can see a bubble labeled “Softmax”. This is a standard final layer for classification: it turns a vector of extracted features into probabilities of discrete classes. Basically, every neural network that solves a classification problem has a softmax layer in the end, which means that the entire network serves as a feature extractor, and the features are then fed into a logistic regression. In this case, softmax outputs the probabilities of medical terms from a specific vocabulary.

This is all very standard stuff for modern neural networks. The interesting part of the figure is at the bottom. There, we extract additional semantic similarity features that are fed into the softmax layer separately. These features result from analysing UMLS, the largest medical terminological system that links terms and codes between your doctor, your pharmacy, and your insurance company. This system integrates a wide range of terminology in multiple domains: more than 11 million terms from over 133 English source vocabularies into 3.6 million medical concepts. Besides English, UMLS also contains source vocabularies in 24 other languages.

So do these features help? What do the results look like, anyway? Let’s find out.

Our Results

Here is an example of how our system actually works in practice:

The model takes a post from social media (a tweet, like on the picture, or any other text) as input and maps it to a number of standard medical terms. As you can see, some of the concepts are relatively straightforward (“lousy sleeping” produced “difficulty sleeping”) but some, like “asthenia”, do not share any words with the original.

We evaluated our model with 5-fold cross-validation on a publicly available AskAPatient dataset LINK2. This dataset consists of gold-standard mappings of social media messages and medical concepts from a CSIRO adverse drug event corpus LINK3. Our results are for CADEC dataset, which consists of posts from AskAPatient forum annotated by volunteers. Since the volunteers did not have to have any medical training, and they could be inaccurate in some cases (even after detailed instructions), their answers were proof-read by experts in the field, including a pharmacist. The dataset contains adverse drug reactions (ADRs) for 12 well-known drugs, like Diclofenac.

We’ll let the numbers speak for themselves:

Colored bars always look convincing; but what do they stand for? We compare our system with three standard architectures. The RNN and CNN labels should be familiar to our readers: we have briefly touches upon RNNs in this post and have explained CNNs for quite a few posts in the past (see, e.g., here). We will not go into the details of what exact convolutional architectures we used for comparison, let’s just say that one-dimensional convolutions are also a very common tool in natural language processing, and we used the architectures shown in a 2016 paper on this subject by researchers from Oxford.

DNorm is the previous best result for this task, the so-called state of the art, from the era before the deep learning revolution. This model comes from a 2013 paper by researchers from the National Center for Biotechnology Information, and it illustrates very well just how amazing the deep learning revolution has been. This result is only 5 years old, it required the best tricks in business, and it is already hopelessly outmatched even by relatively straightforward neural network architectures, and further improved in our work: we have an error rate of 14.5% compared to their 26.5%, almost half their error rate!

Let us summarize. Improvements in social media mining provided by deep learning can help push this field (dubbed pharmacovigilance, a buzzword on the rise) from experiments to real life applications. That’s what these numbers are for: you can’t solve a problem like this perfectly without strong AI, but when you have an error rate of 25% it doesn’t work at all, and when you push it down to 15%, then 10%, then 5%… at some point the benefits begin to outweigh the costs. By faster and more accurate analysis of the people’s input on the drugs they use, we hope to eventually help pharmaceutical companies to reduce side effects of the drugs they produce. This is yet another example of how neural networks can be changing our lives to the better, and we are happy to be part of this process.

Elena Tutubalina
Researcher, Neuromation

Sergey Nikolenko
Chief Research Officer, Neuromation

July 11, 2018