Sergey Nikolenko

Blog

Tracking Cows with Mask R-CNN and SORT

Deep learning is hot. There are lots of projects on the cutting edge of deep learning appearing every month, lots of research papers on deep learning coming out every week, and lots of very interesting models for all possible applications being developed and trained every day.

With this neverending stream of constant advances, it is common that when you are just about to start solving some computer vision (or natural language processing, or any other) problem, you naturally begin by googling possible solutions, and there is always a bunch of open repositories with ready-made models that promise to solve all your problems. If you are lucky enough, you will even find some pre-trained weights for these neural network models, and even maybe a handy API for them. Basically, for most problems you can usually find a model, or two, or a dozen, and all of them seem to work fine.

So if that is the case, what exactly are we doing here? Are deep learning experts just running existing ready-made models (when they are not doing state of the art research where these models actually come from)? Why the big bucks? Well, actually applying even ready-to-use products is not always easy. In this post, we will see a specific example of why it is hard, how to detect the problem, and what to do when you noticed it. We have written this post together with our St. Petersburg researcher Anastasia Gaydashenko, whom we have already presented in a previous post on segmentation; she has also prepared the models that we use in this post.

And we will be talking about cows.

Problem description

We begin with the problem. It is quite easy to formulate: we would like to learn to track objects from flying drones. We have already talked about very similar problems: object detection, segmentation, pose estimation, and so on. Tracking is basically object detection but for videos rather than still images. Performance is also very important because you probably want tracking to be done in real time: if you spend more time to process the video than to record it you cut off most possible applications that require raising alarms or round-the-clock tracking. And today, we will consider tracking with a slightly unusual but very interesting example.

We at Neuromation believe that artificial intelligence is the future of agriculture. We have written about it extensively, and we hope to bring the latest advances of computer vision and deep learning in general to agricultural applications. We are already applying computer vision computer vision models to help grow healthy pigs, by tracking them in the Piglet’s Big Brother project. So as the testing grounds for the models, we chose this beautiful video, Herding Cattle with a Drone by Trey Lyda for La Escalera Ranch:

We believe that adding high-quality real-time tracking from drones that herd cows opens up even more opportunities: maybe some cows didn’t pay attention to the drone and were left behind, maybe some of them got sick and can’t move as fast or at all… the first step to fixing these problems would be to detect them. And it appears that there are plenty of already developed solutions for tracking that should work for this problem. Let’s see how they do…

Simple Online and Realtime Tracking

The most popular and one of the simplest algorithms for tracking is SORT (Simple Online and Realtime Tracking). It can track multiple objects in real time but the algorithm merely associates already detected objects across different frames based on the coordinates of detection results, like this:

The idea is to use some off-the-shelf model for object detection (we already did a survey of those here) and then plug the results into the SORT algorithm that matches detected objects across frames.

This approach obviously yields a multi-purpose algorithm: SORT doesn’t need to know which type of object we track. It doesn’t even need to learn anything: to perform the associations SORT uses mathematical heuristics such as maximizing the IOU (intersection-over-union) metrics between bounding boxes in neighboring frames. Each box is labeled with a number (object id), and if there is no relevant box in the next frame, the algorithm assumes that the object has left the frame.

The quality of such an approach naturally depends a lot on the quality of the underlying object detection. The whole point of the original SORT paper was to show that object detection algorithms have advanced so much that you don’t have to do anything too fancy about tracking and can achieve state-of-the-art results with straightforward heuristics. Since then, improvements have appeared, in particular the next generation of the SORT algorithm, Deep SORT (deep learning is really fast: SORT came out in 2016, and Deep SORT already in 2017). It was designed especially to reduce the number of switchings between identities, ensuring that the tracking is more stable.

First results

To use SORT for tracking, we need to plug in some model for the detection step. In our case, it could be any object detection model pretrained to recognize cows. We used this open repository that includes a SORT implementation based on YOLO (actually, YOLOv2) detection model; it also has an implementation of Deep SORT.

Since YOLO is pretrained on the standard COCO dataset that has “cow” as one of its classes, we can simply launch the detection and tracking. The results are quite poor:

Note that we haven’t made any bad decisions along the way. Frankly, we haven’t really made any decisions at all: we are using a standard pretrained implementation of SORT with a standard YOLO model for object detection that usually works quite well. But the video clearly shows that the results are poor because of the first step, detection. In almost all frames the model does not detect any cows, only sometimes finding a couple of them. So we need to go deeper…

You Only Look Once

To understand the issue and decide how to deal with it, let’s take a closer look at the YOLO architecture.

The pipeline itself is pretty straightforward: unlike many popular detection models which perform detection on many region proposals (RoIs, region of interest), YOLO passes the image through the neural network only once (this is where the title comes from: You Only Look Once) and returns bounding boxes and class probabilities for predictions. Like this:

To do that, YOLO breaks up the image into a grid, and for each cell in the grid considers a number of possible bounding boxes; neural networks are used to estimate the confidence that each of those boxes contains an object and find class probabilities for this object:

The network architecture is pretty simple too; it contains 24 convolutional layers followed by two fully connected layers, reminiscent of AlexNet and even earlier convolutional architectures:

Since the original image is divided into cells, detection happens if the center of an object falls into a cell. But since each grid cell only predicts two boxes, the model struggles with small objects that appear in groups, such as a flock of birds… or a herd of cows (or is it a kine? a flink? it’s all pure rubbish, of course). It is even explicitly pointed out by the authors in the section on the limitations of their approach.

Okay, so by now we have tried a straightforward approach that seemed very plausible but utterly failed. Time to pivot.

Pivoting to a different model

As we have seen, even if you can find open repositories that seem tailor-made for your specific problem, the models you find, even if they are perfectly good models in general, may not be the best option for your particular problem.

To get the best performance (or some reasonable performance, at least), you usually have to try several different approaches. So as the next step, we changed the model to Mask R-CNN that we have talked about in detail in one of our previous posts. Due to a totally different approach to detection, it should be able to recognize cows better, and it really did:

The basic network that you can download from the repositories was also trained on the COCO dataset.

But to achieve even better results, we decided to get rid of all extra classes and train the model only on classes responsible for cows and sheep. We left sheep in because, first, we wanted to reproduce the results on sheep as well, and second, they look pretty similar from afar, so a similar but different class could be useful for the detection.

There is a pretty easy way to upload new training data for the model in the Mask R-CNN repository that we used. So we retrained the network to detect only these two classes. After that, all we needed to do was to embed the new detection model into the tracking algorithm. And here we go, the results are now much better:

We can again compare all three detection versions on a sample frame from the video.

YOLO did not detect anything:

Vanilla Mask R-CNN did much better but it’s still not something you can call a good result:

And our version of Mask R-CNN is better yet:

All the code for our experiments can be found here, in the Neuromation fork of the “Tracking with darkflow” repository.

As you can see, even almost without any new code, by fiddling with existing repositories you can often go from a completely unworkable model to a reasonably good one. Still, even in the last picture one can notice a few missing detections that really should be there, and the tracking based on this detector is also far from perfect yet. Our simple illustration ends here, but the real work of artificial intelligence experts only begins: now we have to push the model from “reasonable” to “truly useful”. And that’s a completely different flink of cows altogether…

Sergey Nikolenko
Chief Research Officer, Neuromation

Anastasia Gaydashenko
Junior Researcher, Neuromation

June 29, 2018
DeepGlobe Challenge: Three Papers from Neuromation Accepted!

We have great news: we’ve got not one, not two, but three papers accepted to the DeepGlobe workshop at the top computer vision conference, CVPR 2018! This is a big result for us: it shows that our team is able to compete with the very best and get to the main conferences in our field.

Today, we present one of the solutions that got accepted, and it is my great pleasure to present the driving force behind this solution, one of our deep learning researchers in St. Petersburg, with whom we have co-authored this post. Please meet Sergey Golovanov, an experienced competitive data scientist whose skills have been instrumental for the DeepGlobe challenge:

Sergey Golovanov Researcher, Neuromation

The DeepGlobe Challenge

Satellite imagery is a treasure trove of data that can yield many exciting new applications in the nearest future. Over the last decades, dozens of satellites from various agencies such as NASA, ESA, or DigitalGlobe that sponsored this competition, have collected terabytes upon terabytes of data.

At the same time, satellite imagery has not yet become the target of much research in computer vision and deep learning. There are few large-scale publicly available datasets, and data labeling is always a bottleneck for segmentation tasks. The DeepGlobe Challenge is designed to bridge this gap, bringing high-quality and at the same time labeled satellite imagery to everyone; see this paper by DeepGlobe organizers for a more detailed description of the dataset.

By virtue of data science competitions, organizers try to draw the attention of AI researchers and practitioners to specific problems or whole problem domains and thereby spur the development of new models and algorithms in this field. To attract the deep learning community to analyzing satellite imagery, DeepGlobe presented three tracks with different tasks highly relevant for satellite image processing: road extraction, building detection, and land cover classification. Here are three samples of labeled data for these tasks:

Image source

All of these tasks are formulated as semantic segmentation problems; we have already written about segmentation problems before but will also include a reminder below.

A part of our team from the Neuromation Labs at St. Petersburg, Russia, took part in two of the three tracks: building detection and land cover classification. We took the third place in building detection and fourth and fifth places in land cover classification (see the leaderboard) and got three papers accepted for the workshop! In this post, we will explain in detail the solution that we prepared for the building detection track.

Semantic segmentation

Before delving into the details of our solution, let us first discuss the problem setting itself. Semantic segmentation of an image is a partitioning of the image into separate groups of pixels, areas corresponding to certain objects, and at the same time classifying what is the type of object in every area (see also our previous post on this subject). That is, the problem looks something like this:

Image source

Deep Learning for Segmentation: Basic Architecture

We have spoken about convolutional neural networks and characteristics of convolutions many times in our previous posts (for example, see here, here or here), so we will not discuss them in too many details and will go straight to the architectures.

The most popular and one of the most effective neural network architectures for semantic segmentation is U-Net and its extensions (there are plenty of modifications, which is always a good sign for the basic architecture as well). The architecture itself consists of an encoder and a decoder; as you can see in the figure below, U-Net is very aptly named:

Image source

The encoder creates compressed image representations (feature maps), extracting multiscale features and thereby implicitly taking into account local context information within certain neighborhoods on the image. Real life problems often have not too much labeled data, so usually encoders in U-Net and similar architectures use transfer learning, that is, use a classifier pre-trained on ImageNet without the last layer of classification (i.e., only with convolutional layers) as the encoder. Actually, this is useful even if data is plentiful: this usually increases the rate of convergence of the model and improves segmentation results, even in domains that do not overlap with ImageNet such as segmentation of cell nuclei or, in our case, satellite imagery.

The decoder takes as input the feature maps obtained from the encoder and constructs a segmentation map, gradually increasing the resolution for more accurate localization of object boundaries. The main novel feature of U-Net that gives the architecture its form and name are the skip-connections that let the decoder “peek” into intermediate, higher-resolution representations from the encoder, combining them with the outputs of the corresponding level of the decoder. Otherwise, a decoder in U-Net is usually just a few layers of convolutions and deconvolutions.

Deep Learning for Segmentation: Loss Functions

There is one more interesting remark about segmentation as a machine learning problem. On the surface, it looks like a classification problem: you have to set a class for every pixel in the image. However, if you treat it simply as a classification problem (i.e., start using per-pixel cross-entropy as the objective function to optimize the model) it won’t work too well.

The problem with this approach is that it does not capture the spatial connections between the pixels: classification problems are independent for every pixel, and the objective function has no idea that a single pixel in the middle of the sea cannot really be a lawn even if it turns out to be green. This will lead to small holes appearing on segmentation results and very complicated boundaries between different classes.

To solve this problem, we have to balance the standard cross-entropy with some other loss function which is developed specifically for segmentation. We won’t go into the mathematical details here, but the 2017 approach here has been to add the average DICE loss which allows to optimize the value of IoU (Intersection over Union). However, in order to strike the right balance for clear boundaries and absence of holes, one has to to choose the coefficients between these two losses very carefully.

The 2018 approach, rapidly growing in popularity, is the Lovász-Softmax loss (and a similar Lovász-Hinge loss), which serves as a differentiable surrogate for intersection-over-union. We have also used the Lovász-Softmax loss in another DeepGlobe paper of ours, devoted to the land cover classification challenge, so we will concentrate on the architectures here and perhaps return to discuss the loss functions in a later post. In any case, our experiments on this challenge have also shown that both DICE and Lovász-Softmax losses give tangible increases in segmentation quality.

Deep Learning for Segmentation: Beyond the Basics

There are plenty of architectures based on the principles described above; see, e.g., this repository of links on semantic segmentation. But we would like to pay special attention to two very interesting ideas that have already proven to be very effective in practice: Atrous Spatial Pyramid Pooling with Image Pooling and ResNeXt-FPN.

The basic idea of the first approach is to use several parallel atrous convolutions (dilated convolutions) and image pooling at the end of the encoder, which are eventually combined through a 1×1 convolution. An atrous convolution is a convolution where there is some distance between the elements of the kernel, called rate, like here:

Image source

Image pooling simply means averaging over the entire feature map. This architecture effectively extracts multiscale features and then uses their combination to actually produce the segmentation. In this case, the encoder is made shallow, with a finite size of the feature maps, 8 or 16 times smaller than the input image size:

This leads to feature maps of higher resolution. A good rule of thumb here is to use as simple a decoder as possible since each layer of the decoder is only required to increase the image resolution by a factor of two, and thanks to the U-Net architecture and spatial pyramid pooling it has quite a lot of input information to do it.

ResNeXt-FPN is basically the Feature Pyramid Network model with ResNeXt, a modern architecture commonly used for object detection (for example, in Faster-RCNN), but adapted for segmentation. Again, the architecture consists of an encoder and a decoder. However, now for segmentation we use feature maps from each decoder level, not just from the last layer with highest resolution.

Since these maps have different sizes, they are resized (to match the largest) and then combined together:

Image source

This architecture has long been known to work very well for segmentation, taking first place in the COCO 2017 Stuff Segmentation Task.

DeepGlobe Building Detection Challenge

The task in the Building Detection Challenge is, rather surprisingly, to detect buildings. At this point an astute reader might wonder why we need to solve this problem at all. After all, there are a lot of cartographic services where these buildings are already labeled. In the world where you can go to Google Maps and find your city mapped out to the last detail, how is building detection a challenge?

Well, the astute reader would be wrong. Yes, such services exist, but usually they label buildings manually. First of all, this means that labeling costs a lot of money, and cartographic services run into the main bottleneck of modern AI: they need lots of real data that can be produced only by manual labeling.

Moreover, this is not a one-time cost: you cannot label the map of the world once and forget about it. New buildings are constructed all the time, and old ones are demolished. Satellites keep circling the Earth and producing their images pretty automatically, but who will keep re-labeling them? A high-quality detector would help solve these problems.

But cartography is not the only place where such a detector would be useful. Analysis of urbanization in an area, which relies on the location of the buildings, could be useful for realtor, construction, and insurance companies, and, in fact, ordinary people. Another striking application of building detection, one of the most important in our opinion, is disaster relief: when one needs to find and evaluate destructed buildings as fast as possible to save lives, any kind of automation is priceless.

Let us now go back to the challenge. Satellite images were selected from the SpaceNet dataset provided by DigitalGlobe. Images have 30cm per pixel resolution (which is actually a pretty high resolution when you think about it) and have been gathered by the WorldView-3 satellite. The organizers chose several cities — Las Vegas, Paris, Shanghai, and Khartoum — for the challenge. Apart from color photographs (which means three channels for the standard RGB color model), SpaceNet also contains eight additional spectral channels which we will not go into; suffice it to say that satellite imagery contains much, much more than meets the eye.

How does one evaluate the quality of a detector? For evaluation, the organizers proposed to use the classical F1-score:

Here TP (true positive) is the number of correctly detected polygons of buildings, N is the number of existing real building polygons (unique for every building), and M is the number of buildings in the solution. A building proposal (a polygon produced by the model) is considered to be a true positive if the real building polygon that has the largest IoU (Intersection over Union) with the proposal has IoU greater than 0.5; otherwise the proposal is a false positive.

In our solution for the segmentation of buildings, we used an architecture similar to U-Net. As the encoder, we used the SE-ResNeXt-50 pretrained on ImageNet. We chose it because this classifier is of high enough quality and does not require too much memory, which is important to maintain a large batch size during training.

To the encoder, we added Atrous Spatial Pyramid Pooling with Image Pooling. The decoder also contains four blocks, each of which is a sequence of convolution, deconvolution, and another convolution. Besides, following the U-Net idea we added skip-connections from the encoder at each level of the decoder. The full architecture is shown in the figure below.

With this model, we did our first experiments and looked at the results… only to find that a good architecture is not enough to produce a good result. The main problem was that in many situations, buildings were clustered very close together. This dense placement made it very difficult for the model to distinguish individual buildings: it usually decided to simply lump them all together into one uber-building. And this was very bad for the competition score (and for the model’s results in general) because the score was based on identifying specific buildings rather than classifying pixels.

To fix this problem, we needed to do instance segmentation, i.e., learn to separate instances of the same class in the segmentation map. After trying several ways of separating the buildings, we decided on a simple but quite effective solution: the watershed algorithm. Since the watershed algorithm needs an initial approximation of the instances (in the form of markers), in addition to the binary segmentation mask our neural network also predicted the normalized pixel distance of the building to its boundary (“energy”). The markers were obtained by binarizing this energy with a threshold. In addition, we increased the input size of images by a factor of two, which allowed to construct more precise segmentation.

As we explained above, we used the sum of binary cross-entropy and the Lovász-Hinge loss function as the objective for the binary mask and the mean squared error for energy. The model was trained on 256×256 crops of input RGB images. We used standard augmentation methods: rotations by multiples of 90 degrees, flips, random scaling, changes in brightness and contrast. We sampled images in the batch based on the value of the loss that was observed on them, so that images with larger error would appear in a training batch more often.

Results and conclusion

Et voila! In the end, our solution produced building detection of quite good quality:

These animated GIFs show binary masks of the buildings on the left and predicted energy on the right. Naturally, the current detection, even state of the art models such as this one, is still not perfect and we still have a lot to do, but it appears that this quality is already quite sufficient for industrial use.

Let’s wrap up: we have discussed in detail one of our papers accepted for the DeepGlobe CVPR workshop. There are two more for land cover classification, i.e., for segmenting satellite images into classes like “forest”, “water”, or “rangeland”; maybe we will return to them in a later post. Congratulations to everyone who has worked on these models and papers: Alexander Rakhlin, Alex Davydow, Rauf Kurbanov, and Aleksey Artamonov! We have a great team here at Neuromation Labs, and we are sure there are many more papers, competitions, and industrial solutions to come. We’ll keep you posted.

Sergey Golovanov
Researcher, Neuromation

Sergey Nikolenko
Chief Research Officer, Neuromation

June 6, 2018
NeuroNuggets: An Overview of Deep Learning Frameworks
Today we continue the NeuroNuggets series with a new installment. This is the first time when a post written by one of our deep learning researchers was so long that we had to break it up into two parts. In the first part, we discussed the notion of a computational graph and what functionality should a deep learning framework have; we found out that they are basically automated differentiation libraries and understood the distinction between static and dynamic computational graphs. Today, meet again Oktai Tatanov, our junior researcher in St. Petersburg, who will be presenting a brief survey of different deep learning frameworks, highlighting their differences and explaining our choice:

Comparative popularity

Last time, we finished with this graph published by the famous deep learning researcher Andrej Karpathy; it shows comparative popularity of deep learning frameworks in the academic community (mentions in research papers):

Unique mentions of deep learning frameworks in arXiv papers (full text) over time, based on 43K ML papers over last 6 years. Source

We see that the top 4 general-purpose deep learning frameworks right now are TensorFlow, Caffe, Keras, and PyTorch. Today, we will discuss the similarities and differences between them and help you make the right choice of a framework.

Tensorflow

TensorFlow is probably the most famous deep learning framework; it is being developed and maintained by Google. It is written in C++/Python and provides Python, Java, Go and JavaScript API. TensorFlow uses static computational graphs, although a recently released TensorFlow Fold library has added support for dynamic graphs as well. Also, since version 1.7 TensorFlow took a different step towards dynamic execution and implemented eager execution that can evaluate Python code immediately, without building graphs.

At present, TensorFlow has gathered the largest deep learning community around it, so there are a lot of videos, online courses, tutorials, and so on. It offers support for running models on multiple GPUs and can even split a single computational graph over multiple machines in a computational cluster.

Apart from purely computational features, TensorFlow provides an awesome extension called TensorBoard that can visualize the computational graph, plot quantitative metrics about the execution of model training or inference, and basically provide all sorts of information necessary to debug and fine-tune a deep neural network in an easier way.

Plenty of data scientists consider TensorFlow to be the primary software tool of deep learning, but there are also some problems. Despite the big community, learning is still difficult for beginners, and many experts agree that other mainstream frameworks are faster than TensorFlow.

As an example of implementing а simple neural network, look at the following:
```
import numpy as np
import tensorflow as tf

data_size = 10

input_size = 28 * 28
hidden1_output = 200
output_size = 1

data = tf.placeholder(tf.float32, shape=(data_size, input_size))
target = tf.placeholder(tf.float32, shape=(data_size, output_size))

h1_w1 = tf.Variable(tf.random_uniform((input_size, hidden1_output)))
h2_w1 = tf.Variable(tf.random_uniform((hidden1_output, output_size)))

hidden1_out = tf.maximum(tf.matmul(data, h1_w1), 0)
target_ = tf.matmul(hidden1_out, h2_w1)
loss = tf.losses.mean_squared_error(target_, target)

opt = tf.train.GradientDescentOptimizer(1e-3)
upd = opt.minimize(loss)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    feed = {data: np.random.randn(data_size, input_size), target: np.random.randn(data_size, output_size)}

    for step in range(100):
        loss_val, _ = sess.run([loss, upd], feed_dict=feed)
```
It’s not so elementary for beginners, but it shows the main concepts in TensorFlow, so let us try to focus on the code structure only first. We begin by defining the computational graph: placeholders, variables, operations (maximum, matmul) and the loss function at the end. Then we assign an optimizer that defines what and how we want to optimize. And finally, we train our graph over and over in a special execution environment called a session.

Unfortunately, if you want to improve the network’s architecture with conditionals or loops (this is especially useful, even essential for recurrent neural networks), you cannot simply use python keywords. As you already know, a static graph is constructed and compiled once, so to add nodes to the graph you should use special control flow or higher order operations.

For instance, to add a simple conditional to our previous example, we have to modify the previous code like this:
```
# update for https://gist.github.com/Oktai15/4b6617b916c0fa4feecab35be09c1bd6 
a = tf.constant(10)

data = tf.placeholder(tf.float32, shape=(data_size, input_size))
h1_w1 = tf.placeholder(tf.float32, shape=(input_size, hidden1_output))
h2_w1 = tf.placeholder(tf.float32, shape=(input_size, hidden1_output))
 
def first(): return tf.matmul(data, h1_w1)
def second(): return tf.matmul(data, h2_w1)
 
hidden1_out = tf.cond(tf.greater(a, 0), first, second)
```
Caffe

The Caffe library was originally developed at UC Berkeley; it was written in C++ with a Python interface. An important distinctive feature of Caffe is that one can train and deploy models without writing any code! To define a model, you just edit configuration files or use pre-trained models from the Caffe Model Zoo, where you can find most established state-of-the-art architectures. Then, to train a model you just run a simple script. Easy!

To show how it works (at least approximately), check out the following code:
```
name: "SimpleCaffeNet"
layer {
  name: "data"
  type: "Input"
  top: "data"
  input_param { shape: { dim: 10 dim: 1 dim: 28 dim: 28 } }
}
layer {
  name: "fc1"
  type: "InnerProduct"
  bottom: "data"
  top: "fc1"
  inner_product_param {
    num_output: 784
  }
}
layer {
  name: "relu"
  type: "ReLU"
  bottom: "fc1"
  top: "fc1"
}
layer {
  name: "fc2"
  type: "InnerProduct"
  bottom: "fc1"
  top: "fc2"
  inner_product_param {
    num_output: 10
  }
}
layer {
  name: "prob"
  type: "Softmax"
  bottom: "fc2"
  top: "prob"
}
```
We define the neural network as a set of blocks that correspond to layers. At first, we see a data layer where we specify the input shape, then two fully connected layers with ReLU activations. At the end, we have a softmax layer where we get the probability for every class in the data, e.g., 10 classes for the MNIST dataset of handwritten digits.

In reality, Caffe is rarely used for research but is quite often used in production. However, its popularity is waning because there is a new great alternative, Caffe2 which we will touch upon a little when we talk about PyTorch.

Keras

Keras is a high-level neural network library written in Python by Francois Chollet, currently a member of the Google Brain team. It works as a wrapper over one of the low-level libraries such as TensorFlow, Microsoft Cognitive Toolkit, Theano or MXNet. Actually, for quite some time Keras has been shipped as a part of TensorFlow.

Keras is pretty simple, easy to learn and to use. Thanks to brilliant documentation, its community is big and very active, so beginners in deep learning like it. If you do not plan to do complicated research and develop new extravagant neural networks that Keras might not cover, then we heartily advise to consider Keras as your primary tool.

However, you should understand that Keras is being developed with an eye towards fast prototyping. It is not flexible enough for complicated models, and sometimes error messages are not easy to debug. We implemented on Keras the same neural network which we did on TensorFlow. Look:
```
import numpy as np
import tensorflow as tf

data_size = 10

input_size = 28 * 28
hidden1_output = 200
output_size = 1

data = np.random.randn(data_size, input_size)
target = np.random.randn(data_size, output_size)

model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(hidden1_output, input_shape=(input_size,), activation=tf.nn.relu))
model.add(tf.keras.layers.Dense(output_size))

model.compile(loss=tf.keras.losses.mean_squared_error,
              optimizer=tf.keras.optimizers.SGD(lr=1e-3))

model.fit(data, target, epochs=100, batch_size=data_size)
```
What immediately jumps out in this example is that our code has been reduced a lot! No placeholders, no sessions, we only write concise informative constructions, but, of course, we lose some extensibility due to extra layers of abstraction.

PyTorch

PyTorch was released by Facebook’s artificial-intelligence research group for Python, based on Torch (previous Facebook’s framework for Lua). It is the main representative of dynamic graph.

PyTorch is pythonic and very developer-friendly. The memory usage in PyTorch is extremely efficient for any neural networks. It is also said to be a bit faster than TensorFlow.

It has a responsive forum where you can ask any question and extensive documentation with a lot of official tutorials and examples, however, the community is still quite smaller as opposed to TensorFlow. Sometimes you can’t find implementation of contemporary model on PyTorch, but easy to see two or three on TensorFlow. Anyway, this framework is considered as a best choice to research.

Quite surprisingly, since May of 2018, PyTorch project was merged with Caffe2, successor of Caffe which actively developed by Facebook for production exactly. It means for supporters these frameworks that bottleneck between researchers and developers will be vanished.

Now look at this code below that shows simple way to “touch” PyTorch:
```
import torch
import torch.nn as nn
import torch.nn.functional as fun

data_size = 10

input_size = 28 * 28
hidden1_output = 200
output_size = 1

data = torch.randn(data_size, input_size)
target = torch.randn(data_size, output_size)

model = nn.Sequential(
    nn.Linear(input_size, hidden1_output),
    nn.ReLU(),
    nn.Linear(hidden1_output, output_size)
)

opt = torch.optim.SGD(model.parameters(), lr=1e-3)

for step in range(100):
    target_ = model(data)
    loss = fun.mse_loss(target_, target)
    loss.backward()
    opt.step()
    opt.zero_grad()
```
Here we initialize randomly our trial data and target, then assign model and optimizer. The last block executes training: every time calculates answer from model and change weights with SGD. It looks like Keras: easy read, but we don’t lost ability to write complicated neural networks.

Thanks for dynamic graph, PyTorch are integrated in Python more than TensorFlow. So you can write conditionals and loops like as in ordinary python program.

You can see it when try to realize, for example, simple recurrent block that we represent as hi=hi-1·xi:
```
import torch

h0 = torch.randn(10)
x = torch.randn(5, 10)
h = [h0]

for i in range(5):
    h_i = h[-1] * x[i]
    h.append(h_i)
```
The Neuromation choice

Our Research Lab at St. Petersburg mostly prefers PyTorch. For instance, we have used it for computer vision models that we applied to price tag segmentation. Here is a sample result:

But sometimes, especially in cases when PyTorch does not have a ready solution for something yet, we create our models in TensorFlow. The main idea of Neuromation idea is to train neural networks on synthetic data. We are convinced that a great result on real data can be obtained with transfer learning from perfectly labeled synthetic datasets. Have a look at some of our results for the segmentation of retail items based on synthetic data:

Conclusion

There are several deep learning frameworks, and we could go into a lot more detail about which to prefer. But, of course, frameworks are just tools to help you develop neural networks, and while the differences are important they are, of course, secondary. The primary tool in developing modern machine learning solutions is the neural network in your brain: the more you know, the more you think about machine learning solutions from different angles, the better you get. Knowing several deep learning frameworks can also help broaden your horizons, especially when the top contenders are as different as Theano and PyTorch. So it pays to learn them all even if your primary tool has already been chosen for you (e.g., your team uses a specific library). Good luck with your networks!

Oktai Tatanov
Junior Researcher, Neuromation

Sergey Nikolenko
Chief Research Officer, Neuromation
May 24, 2018
NeuroNuggets: What Do Deep Learning Frameworks Do, Exactly?
Our sixth installment of the NeuroNuggets series is slightly different from previous ones. Today we touch upon an essential and, at the same time, rapidly developing area — deep learning frameworks, software libraries that modern AI researchers and practitioners use to train all these models that we have been discussing in our first five installments. In today’s post, we will discuss what a deep learning framework should be able to do and see the most important algorithm that all of them must implement.

We have quite a few very talented junior researchers in our team. Presenting this post on neural networks’ master algorithm is Oktai Tatanov, our junior researcher in St. Petersburg:

What a Deep Learning Framework Must Do

A good AI model begins with an objective function. We also begin this essay with explaining the main purpose of deep learning frameworks. What does it mean to define a model (say, a large-scale convolutional network like the ones we discussed earlier), and what should a software library actually do to convert this definition into code that trains and/or applies the model?

Actually, every modern deep learning framework should be able to do the following checklist:
- build and operate with large computational graphs;
- perform inference (forward propagation) and automatic differentiation (backpropagation) on computational graphs;
- be able to place the computational graph and perform the above operations on a GPU;
- provide a suite of standard neural network layers and other widely used primitives that the computational graph might consist of.
As you can see, every point is somehow about computational graphs… but what are those? How does it relate to neural networks? Let us explain.

Computational Graphs: What

Artificial neural networks are called neural networks for a reason: they model, in a very abstract and imprecise way, processes that happen in our brains. In particular, neural networks consist of a lot of artificial neurons (perceptrons, units); outputs of some of the neurons serve as inputs for others, and outputs of the last neurons are the outputs of the network as a whole. Mathematically speaking, a neural network is a very large and complicated composition of very simple functions.

Computational graphs reflect the structure of this composition. A computational graph is a directed graph where every node represents a mathematical operation or a variable, and edges connect these operations with their inputs. As usual with graphs, a picture is worth a thousand words — here is a computational graph for the function $F(x, y, z) = (x+y)z$ :

The whole idea of neural networks is based on connectionism: huge compositions of very simple functions can give rise to very complicated behaviour. This has been proven mathematically many times, and modern deep learning techniques show how to actually implement these ideas in practice.

But why are the graphs themselves useful? What problem are we trying to solve with them, and what exactly are deep learning frameworks supposed to do?

Computational Graphs: Why

The main goal of deep learning is to train a neural network in such a way that it best describes the data we have. Most often, this problem is reduced to the problem of minimizing some kind of loss function or maximizing the likelihood or posterior distribution of a model, i.e., we either want to minimize how much our model gets wrong or want to maximize how much it gets right. The frameworks are supposed to help with these optimization problems.

Modern neural networks are very complicated and non-convex, so basically the only optimization method we have for large neural networks is the simplest and most universal optimization approach: gradient descent. In gradient descent, we basically compute the derivatives of the objective function (the gradient is the vector consisting of all partial derivatives) and then go into the direction where the objective function increases or decreases, as needed. Like this:

There are, of course, many interesting improvements and modifications to this simple idea: Nesterov’s momentum, adaptive gradient descent algorithms that change the learning rate separately for every weight… Perhaps one day we will return to this discussion in NeuroNuggets. But how do we compute the gradient if we have the neural network as model? That’s where computational graphs help…

Computational Graphs: How

To compute the gradient, deep learning frameworks use an algorithm called backpropagation (bprop); it basically amounts to using the chain rule sequentially across the computational graph. Let us walk through an application of backpropagation to our previous example. We begin by computing partial derivatives of every node of the graph with respect to each of its inputs; we assume that it is easy to do, and neural networks do indeed consist of simple units for which it is easy. Like in our example:

Now we need to combine these derivatives with the chain rule. In backpropagation, we do it sequentially from the graph’s output, where the objective function is computed. There we always have

$\frac{\partial f}{\partial f} = 1.$

Next, for example, we can get

$\frac{\partial f}{\partial x} = \frac{\partial f}{\partial a}\frac{\partial a}{\partial x}= z\cdot 1 = z,$

since we already know both factors in this formula. Backpropagation means that we go through the graph from right to left, computing partial derivatives of f with respect to every node, including the weights that we are interested in. Here is the final result for our example:

This very simple algorithm allows us to set up algorithms to train any deep neural network. This is exactly what any deep learning framework is supposed to do; they are in reality automatic differentiation libraries more than anything else. The main function of any framework is to compute and take derivatives of huge compositions of functions. Note, by the way, that to compute the function you also need to traverse the computational graph, but this time from left to right, from variables to the outputs; this process is called forward propagation (fprop).

Parallelizing the Computations

Once you have the basic functionality of fprop and bprop in your library, you want to make it as efficient as possible. Efficiency gains mostly come from parallelization: note that operations in one part of the graph are completely independent from what happens in other parts. This means, for instance, that if you have a layer in your neural network, i.e., a set of nodes that do not feed into each other but all receive inputs from previous layers, you can compute them all in parallel during both forward propagation and backpropagation.

This is exactly the insight that to a large extent fueled the deep learning revolution: this kind of parallelization can be done across hundreds or even thousands of computational cores. What kind of hardware has thousands of cores? Why, the GPUs, of course! In 2009–2010, it turned out that regular off-the-shelf GPUs designed for gamers can provide a 10x-50x speedup in training neural networks. This was the final push for many deep learning models and applications into the territory of what is actually computationally feasible. We will stop here for the moment but hopefully will discuss parallelization in deep learning in much greater detail at some future post.

There is one more interesting complication. Deep learning frameworks come with two different forms of computational graphs, static and dynamic. Let us find out what this means.

Static and Dynamic Computational Graphs

The main idea of a static computational graph is to separate the process of building the graph and executing backpropagation and forward propagation (i.e., computing the function). Your graph is immutable, i.e., you can’t add or remove nodes at runtime.

In a dynamic graph, though, you can change the structure of the graph at runtime: you can add or remove nodes, dynamically changing its structure.

Both approaches have their advantages and disadvantages. For static graphs:
- you can build a graph once and reuse it again and again;
- the framework can optimize the graph before it is executed;
- once a computational graph is built, it can be serialized and executed without the code that built the graph.
For dynamic graphs:
- each forward pass basically defines a new graph;
- debugging is easier;
- constructing conditionals and loops is easy, which makes building recurrent neural networks much easier than with static graphs.
We will see live examples of code that makes use of dynamic computational graphs in the next installment, where we will consider several deep learning frameworks in detail. And now let us finish with an overview.

Deep Learning Frameworks: An Overview

On March 10, Andrej Karpathy (Director of AI at Tesla) published a tweet with very interesting statistics about machine learning trends. Here is the graph of unique mentions of deep learning frameworks over the last four years:

Unique mentions of deep learning frameworks in arXiv papers (full text) over time, based on 43K ML papers over last 6 years. Source: https://twitter.com/karpathy/status/972295865187512320

The graph shows that the top 4 general-purpose deep learning frameworks right now are TensorFlow, Caffe, Keras, and PyTorch, while, for example, historically the first widely used framework theano has basically lost traction.

The frameworks have interesting relations between them, and it is worthwhile to consider them all, get a feeling of what the code looks like for each, and discuss their pros and cons. This post, however, is already growing long; we will come back to this discussion in the second part.

Oktai Tatanov
Junior Researcher, Neuromation

Sergey Nikolenko
Chief Research Officer, Neuromation
May 4, 2018
NeuroNuggets: Understanding Human Poses in Real-Time
This week, we continue the NeuroNuggets series with the fifth installment on another important computer vision problem: pose estimation. We have already talked about segmentation; applied to humans, segmentation would mean to draw silhouettes around pictures of people. But what about the skeleton? We need pose estimation, in particular, to understand what a person is doing: running, standing, reading NeuroNuggets?

Today, we present a pose estimation model based on the so-called Part Affinity Fields (PAF), a model from this paper that we have uploaded on the NeuroPlatform as a demo. And presenting this model today is Arseny Poezzhaev, our data scientist and computer vision aficionado who has moved from Kazan to St. Petersburg to join Neuromation! We are excited to see Arseny join and welcome him to the team (actually, he joined from the start, more than a month ago, but the NeuroNuggets duty caught up only now). Welcome:

Introduction

Pose estimation is one of the long-standing problems of computer vision. It has interested researchers over the last several decades because not only is pose estimation an important class of problems itself, but it also serves as a preprocessing step for many even more interesting problems. If we know the pose of a human, we can further train machine learning models to automatically infer relative positions of the limbs and generate a pose model that can be used to perform smart surveillance with abnormal behaviour detection, analyze pathologies in medical practices, control 3D model motion in realistic animations, and a lot more.

Moreover, not only humans can have limbs or a pose! Basically, pose estimation can deal with any composition of rigidly moving parts connected to each other at certain joints, and the problem is to recover a representative layout of body parts from image features. We at Neuromation, for example, have been doing pose estimation for synthetic images of pigs (cf. our Piglet’s Big Brother project):

Traditionally, pose estimation used to be done by retrieving motion patterns from optical markers attached to the limbs. Naturally, pose estimation would work much better if we could afford to put special markers on every human on the picture; alas, our problem is a bit harder. The next point of distinction between different approaches is the hardware one can use: can we use multiple cameras? 3D cameras that estimate depth? infrared? Kinect? is there a video stream available or only still images? Again, each additional source of data can only make the problem easier, but in this post we concentrate on a single standard monocular camera. Basically, we want to be able recognize the poses on any old photo.

Top-Down and Bottom-Up

Pose estimation from a single image is a very under-constrained problem precisely due to the lack of hints from other channels, different viewpoints from multiple cameras, or motion patterns from video. The same pose can produce quite different appearances from different viewpoints and, even worse, human body has many degrees of freedom, which means that the solution space has high dimension (always a bad thing, trust me). Occlusions are another big problem: partially occluded limbs cannot be reliably recognized, and it’s hard to teach a model to realize that a hand is simply nowhere to be seen. Nevertheless, single person pose estimation methods show quite good results nowadays.

When you move from a single person to multiple people, pose estimation becomes even harder: humans occlude and interact with other humans. In this case, it is common practice to use a so-called top-down approach: apply a separately trained human detector (based on object detection techniques such as the ones we discussed before), find each person, and then run pose estimation on every detection. It sounds reasonable but actually the difficulties are almost insurmountable: if the detector fails to detect a person, or if limbs from several people appear in a single bounding box (which is almost guaranteed to happen in case of close interactions or crowded scenes), the whole algorithm will fail. Moreover, the computation time needed for this approach grows linearly with the number of people on the image, and that can be a big problem for real-time analysis of groups of people.

In contrast, bottom-up approaches recognize human poses from pixel-level image evidence directly. They can solve both problems above: when you have information from the entire picture you can distinguish between the people, and you can also decouple the runtime from the number of people on the frame… at least theoretically. However, you are still supposed to be able to analyze a crowded scene with lots of people, assigning body parts to different people, and even this task by itself could be NP-hard in the worst case.

Still, it can work; let us show which pose estimation model we chose for the Neuromation platform.

Part Affinity Fields

In the demo, we use the method based on the “Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields” paper done by researchers from The Robotics Institute at Carnegie Mellon University (Cao et al., 2017). Here is it in live action:

https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FpW6nZXeWlGM%3Ffeature%3Doembed&display_name=YouTube&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DpW6nZXeWlGM&image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FpW6nZXeWlGM%2Fhqdefault.jpg&key=a19fcc184b9711e1b4764040d3dc5c07&type=text%2Fhtml&schema=youtube

It is a bottom-up approach, and it uses the so-called Part Affinity Fields (PAFs) together with estimation of body-part confidence maps. PAFs are the main new idea we introduce today, so let us discuss them in a bit more detail. A PAF is a set of 2D vector fields that encode the location and orientation of the limbs. Vector fields? Sounds mathy… but wait, it’s not that bad.

Suppose you have already detected all body parts (hands, elbows, feet, ankles etc.); how do you now generate poses from them? First, you must find out how to connect two points to form a limb. For each body part, there are several candidates to form a limb: there are multiple people on the image, and there also can be lots of false positives. We need some confidence measure for the association between each body part detection. Cao et al. propose a novel feature representation called Part Affinity Fields that contains information about location as well as orientation across the region of support of the limb.

In essence a PAF is a set of vectors that encodes the direction from one part of the limb to the other; each limb is considered as an affinity field between body parts. Here is a forehand:

Figure 1. Affinity field visualization for right forehand. The color encodes limb’s orientation.

If a point lies on the limb then its value in the PAF is a unit vector pointing from starting joint point to ending joint point of this limb; the value is zero if it is outside the limb. Thus, PAF is a vector field that contains information about one specific limb for all the people on the image, and the entire set of PAFs encodes all the limbs for every person. So how do PAFs help us for pose estimation?

Method overview

First, let us go through the overall pipeline of the algorithm.

Figure 2. Overall pipeline. The method of (Cao et al.) takes an input image (a) and simultaneously infers two maps with body-parts (b) and PAFs predictions ©. Then it parses body part candidates and runs a special bipartite matching algorithm to associate them (d); finally, it assembles the body parts into full body poses (e).

Figure 2 above illustrates all the steps from an input image (Fig. 2a) to anatomical keypoints as an output (Fig. 2e). First, a feedforward neural network predicts a set of body part locations on the image (Fig. 2b) in the form of a confidence map and a set of PAFs that encode the degree of association between these body parts (Fig. 2c). Thus, the algorithm gets all information necessary for further matching of limbs and people (all of this stuff does sound a bit bloodthirsty, doesn’t it?). Next, confidence maps and affinity fields are parsed together (Fig 1d) to output the final positions of limbs for all people on the picture.

All of this sounds very reasonable: we now have a plan. But so far this is only a plan: we don’t know how to do any of these steps above. So now let us consider every step in detail.

Architecture

One of the core ideas of (Cao et al., 2017) is to simultaneously predict detection confidence map and affinity fields. The method uses a special feedforward network as a feature extractor. The network looks like this:

Figure 3. Architecture of the two-branch multistage CNN. Each stage in the top branch (beige) predicts a confidence map S, and each stage in the bottom branch (blue) predicts a PAF L. After every stage, predictions from both branches are concatenated with image features F (which come from a VGG-based architecture) and used as input for the next stage. Each branch performs multiple inferences, one per body part.

As you can see, it is split into two branches: the top branch predicts the detection confidence maps and the bottom branch is for affinity fields. Both branches are organized as an iterative prediction architecture, which refines predictions over successive stages. The improvement of accuracy of predictions is controlled by intermediate supervision at each stage. Here is how it might look on a real image:

Figure 4. Demonstration of real image inference by the two-branched architecture neural network.

Before passing input to this two-branch network the method uses auxiliary CNN (first 10 layers of VGG-19) to extract an input feature map F. This prediction is processed by both branches, and their predictions concatenated with initial F are used as input for the next stage (as features).

This process is repeated on every stage, and you can see the refinement process across stages on Figure 4 above.

From Body Parts to Limbs

Take a look at Figure 5 below, which again illustrates the above-mentioned refinement process:

Figure 5. Confidence maps of right wrist (first row) and PAFs of right forearm (second row) across stages. We can see that despite confusion on the first stage, the method can fix its mistakes on later stages.

At the end of each stage, the corresponding loss function is applied for each branch to guide the network.

Now consider the top branch; each confidence map is a 2D representation of our confidence that each pixel belongs to a particular body part (we remind that “body parts” here are “points” such as wrists and elbows, and, say, forearms are referred to as “limbs” rather than “body parts”). To get body part candidate regions, we aggregate confidence maps for different people. After that, the algorithm performs non-maximum suppression to obtain a discrete set of parts locations:

Figure 6. How algorithm forms limbs from detections.

During inference, algorithm computes line integrals over all the PAFs along the line segments between pairs of detected body-parts. If the candidate limb formed by connection of certain pair of points is aligned with corresponding PAF then it’s considered as a true limb. This is exactly what the bottom branch does.

From Limbs to Full Body Models

We now see how the algorithm can find limbs on the image between two points. But we still cannot estimate poses because we need the full body model! We need to somehow connect all these limbs into people. Formally speaking, the algorithm has found body part candidates and has scored pairs of these parts (integrating over PAFs), but the final goal is to find the optimal assignment for the set of all possible connections.

Formally speaking, this problem can be viewed as a k-partite graph matching problem, where nodes of the graph are body part detections, and edges are all possible connections between them (possible limbs). Here k-partite matching means that the vertices can be partitioned into k groups of nodes with no connections inside each group (i.e., vertices corresponding to the same body part). Edges of the graph are weighted with part affinities. Like this:

Figure 7. Graph matching problem simplification. (a) Original image with part detections, (b) K-partite graph -> © Tree structure implicitly includes human body model -> (d) A set of bipartite graphs

A direct solution of this problem may be computationally infeasible (NP-hard), so (Cao et al., 2017) propose a relaxation where the initial k-partite graph is decomposed into a set of bipartite graphs (Fig. 7d) where the matching task is much easier to solve. The decomposition is based on the problem domain: basically, you know how body parts can connect, and a hip cannot be connected to a foot directly, therefore we can first connect hip to knee and then knee to foot.

That’s all, folks! We have considered all the steps in the algorithm that can retrieve poses from a single raw image. Let us now see how it works on our platform.

Pose Estimation Demo in Action

There are, as always, a few simple steps to run this algorithm on our image of interest:
1. Login at https://mvp.neuromation.io
2. Go to “AI models”
3. Click “Add more” and “Buy on market”:
4. Select and buy the OPENPOSE 2D demo model:

5. Launch it with “New Task” button:

6. Choose the Estimate People On Image Demo:

7. Try the demo! You can upload your own photo for pose estimation. We chose this image from the Mannequin Challenge:

8. And here you go! One can see stick models representing poses of people on the image:

And here is a picture of Neuromation leadership in Singapore:

The results are, again, pretty good:

Sergey Nikolenko
Chief Research Officer, Neuromation

Arseny Poezzhaev
Researcher, Neuromation
April 24, 2018
Creating Molecules from Scratch I: Drug Discovery with Generative Adversarial Networks
We’ve got great news: the very first paper with official Neuromation affiliation has appeared! This work, “3D Molecular Representations Based on the Wave Transform for Convolutional Neural Networks”, has recently appeared in a top biomedical journal, Molecular Pharmaceutics. This paper describes the work done by our friends and partners Insilico Medicine in close collaboration with Neuromation. We are starting to work together with Insilico on this and other exciting projects in the biomedical domain to both significantly accelerate drug discovery and improve the outcomes of clinical trials; by the way, I thank CEO of Insilico Medicine Alex Zhavoronkov and CEO of Insilico Taiwan Artur Kadurin for important additions to this post. Collaborations between top AI companies are becoming more and more common in the healthcare space. But wait — the American Chemical Society’s Molecular Pharmaceutics?! Doesn’t sound like a machine learning journal at all, does it? Read on…

Lead Molecules: Educating the Guess

Getting a new drug to the market is a long and tedious process; it can take many years or even decades. There are all sorts of experiments, clinical studies, and clinical trials that you have to go through. And about 90% of all clinical trials in humans fail even after the molecules have been successfully tested in animals.

But to a first approximation, the process is as follows:
- the doctors study medical literature, in particular associations between drugs, diseases, and proteins published in other papers and clinical studies, and find out what the target for the drug should be, i.e., which protein it should bind with;
- after that, they can formulate what kind of properties they want from the drug: how soluble it should be, which specific structures it should have to bind with this protein, should it treat this or that kind of cancer…
- then they sit down and think about which molecules might have these properties; there is a lot to choose from on this stage: e.g., one standard database lists 72 million molecules, complete with their formulas, some properties and everything; unfortunately, it doesn’t always say whether a given molecule cures cancer, this we have to find out for ourselves;
- then their ideas, called lead molecules, or leads, are actually sent to the lab for experimental validation;
- if the lab says that the substance works, the whole clinical trial procedure can be initiated; it is still very long and tedious, and only a small percentage of drugs actually go all the way through the funnel and reach the market, but at least there is hope.
Image source

So where is the place of AI in this process? Naturally, we can’t hope to replace the lab or, God forbid, clinical trials: we wouldn’t want to sell a drug unless we are certain that it’s safe and confident that it is effective in a large number of patients. This certainty can only come from actual live experiments. In the future it is likely that we will be able to go from in silico (in a computer) to patients immediately with the AI-driven drug discovery pipelines but today we need to do the experiments.

Note, however, the initial stage of identifying the lead molecules. At this stage, we cannot be sure of anything, but live experiments in the lab are still very slow and expensive, so we would like to find lead molecules as accurately as we can. After all, even if the goal is to treat cancer there is no hope to check the entire endless variation of small molecules in the lab (“small” are molecules that can easily get through a cell membrane, which means basically everything smaller than a nucleic acid). 72 million is just the size of a specific database, the total number of small molecules is estimated to be between 10⁶⁰ and 10²⁰⁰, and synthesizing and testing a single new molecule in the lab may cost thousands or tens of thousands of dollars. Obviously, the early guessing stage is really, really important.

By now you can see how it might be beneficial to apply latest AI techniques to drug discovery. We can use machine learning models to try and choose the molecules that are most likely to have desired properties.

But when you have 72 million of something, “choosing” ceases to look like classification and gets more into the “generation” part of the spectrum. We have to basically generate a molecule from scratch, and not just some molecule, but a promising candidate for a drug. With modern generative models, we can stop searching for a needle in a haystack and design perfect needles instead:

How do we generate something from scratch? Deep learning does have a few answers when it comes to generative models; in this case, the answer turned out to be…

Generative Adversarial Networks

We have already briefly talked about generative adversarial networks (GANs) in a previous post, but I’m sure a reminder is in order here. GANs are a class of neural networks that aim to learn to generate objects from a certain class. Previously, GANs had been mostly used to generate images: human faces as in (Karras et al., 2017), photos of birds and flowers as in StackGAN, or, somewhat suprisingly, bedroom interiors, a very popular choice for GAN papers due to a commonly used part of the standard LSUN scene understanding dataset. Generation in GANs is based on a very interesting and rather commonsense idea. They have two parts that are in competition with each other:
- the objective of the generator is to generate new objects that are supposed to pass for “true” data points;
- while the discriminator has to decipher the tricks played by the generator and distinguish between real data points and the ones produced by the generator.
Here is how the general scheme looks:

Image source

In other words, the discriminator learns to spot the generator’s counterfeit images, while the generator learns to fool the discriminator. I refer to, e.g., this post for a simple and fun introduction to GANs.

We at Neuromation are following GAN research with great interest due to many possible exciting applications. For example, conditional GANs have been used for image transformations with the explicit purpose of enhancing images; see, e.g., image de-raining recently implemented with GANs in this work. This ties in perfectly with our own ideas of using synthetic data for computer vision: with a proper conditional GAN for image enhancement, we might be able to improve synthetic (3D-rendered) images and make them more like real photos, especially in small details. In the post I referred to, we saw how NVIDIA researchers introduced a nice way to learn GANs progressively, from small images to large ones.

But wait. All of this so far makes a lot of sense for images. Maybe it also makes sense for some other relatively “continuous” kinds of data. But molecules? The atomic structure is totally not continuous, and GANs are notoriously hard to train for discrete structures. Still, GANs did prove to work for generating molecules as well. Let’s find out how.

Adversarial Autoencoders

Our recent paper on molecular representations is actually a part of a long line of research done by our good friends and partners, Insilico Medicine. It began with Insilico’s paper “The cornucopia of meaningful leads: Applying deep adversarial autoencoders for new molecule development in oncology”, whose lead author Artur Kadurin is a world-class expert on deep learning, one of Insilico Medicine’s Pharma.AI team on deep learning for molecular biology, recently appointed CEO of Insilico Taiwan… and my Ph.D. student.

In this work, Kadurin et al. presented an architecture for generating lead molecules based on a variation of the GAN idea called Adversarial Autoencoders (AAE). In AAE, the idea is to learn to generate objects from their latent representations. Generally speaking, autoencoders are neural architectures that take an object as input… and try to return the same object as output. Doesn’t sound too hard, but the idea is that in the middle of the architecture, the input must go through a middle layer that learns a latent representation, i.e., a set of features that succinctly encode the input in such a way that afterwards subsequent layers can decode the object back:

Image source

Either the middle layer is simply smaller (has lower dimension) than input and output, or the autoencoder uses special regularization techniques, but in any case it’s impossible to simply copy the input through all layers, and the autoencoder has to extract the really important stuff.

So what did Kadurin et al. do? They took a conditional adversarial autoencoder and trained it to generate fingerprints of molecules, using and serving desired properties as conditions. Here is the general model architecture from (Kadurin et al., 2017):

Image source: (Kadurin et al., 2017)

Looks just like the autoencoder above, but with two important additions in the middle:
- on top, there is a discriminator that tries to distinguish the distribution of latent representations from some standard distribution, e.g., a Gaussian; this is the main idea of AAE: if you can make the distribution of latent codes indistinguishable from some standard distribution, it means that you can then sample from this distribution and generate reasonable samples through the decoder;
- on the bottom, there is a condition that in this case encodes desired properties of the molecule; we train on the molecules with known properties, and the problem is then to generate molecules with desired (perhaps even never before seen) combinations of properties.
There is still that nagging question about the representations, though. How do we generate discrete structures like molecules? We will discuss molecular representations in much greater detail in the next post; here let me simply mention that this work used a standard representation of a molecule as a MACCS fingerprint, a set of binary characteristics of the molecule such as “how many oxygens is has” or “does it have a ring of size 4”.

Basically, the problem becomes to “translate” the condition, i.e., desired properties of a molecule, into more “low-level” properties of the molecular structure encoded into their MACCS fingerprints. Then a simple screening of the database can find molecules with the fingerprints most similar to generated ones.

At the time that was the first peer-reviewed paper showing that GANs can generate novel molecules. The submission was made in June 2016 and it was accepted in December 2016. In 2017 the community started to notice:

It turned out that the resulting molecules do look interesting…

Conclusion

This post is getting a bit long; let’s take it one step at a time. We will get to our recent paper in the next installment, and now let us summarize what we’ve seen so far.

Since the deep learning revolution, deep neural networks have been revolutionizing one field after another. In this post, we have seen how modern deep learning techniques are transforming molecular biology and drug discovery. Constructions such as adversarial autoencoders are designed to generate high-quality objects of various nature, and it’s not a huge wonder that such approaches work for molecular biology as well. I have no doubt that in the future, these or similar techniques will bring us closer to truly personalized medicine.

So what next? Insilico has already generated several very promising candidates for really useful drugs. Right now they are undergoing experimental validation in the lab. Who know, perhaps in the next few years we will see new drugs identified by deep learning models. Fingers crossed.

Sergey Nikolenko
Chief Research Officer, Neuromation
April 17, 2018
NeuroNuggets: Style Transfer
In the fourth installment of the NeuroNuggets series, we continue our study of basic problems in computer vision. We remind that in the NeuroNuggets series, we discuss the demos available on the recently released NeuroPlatform, concentrating not so much on the demos themselves but rather on the ideas behind each deep learning model.

In the first installment, we talked about the age and gender detection model, which is basically image classification. In the second, we presented object detection, a more complex computer vision problem where you also have to find where an object is located. In the third, we continued this with segmentation and Mask R-CNN model. Today, we turn to something even more exciting. We will consider a popular and very beautiful application of deep learning: style transfer. We will see how to make a model that can draw your portrait in the style of Monet — or Yves Klein if you’re not careful with training! I am also very happy to present the co-author of this post, one of our lead researchers Kyryl Truskovskyi:

Style Transfer: the Problem

Imagine that you can create a true artwork by yourself, turning your own photo or a familiar landscape into a painting done like Van Gogh or Picasso would do it. Sounds like a pipe dream? With the help of deep neural networks, this dream has now become reality. Neural style transfer has become a trending topic both in academic literature and industrial applications; we all know and use popular mobile apps for style transfer and image enhancement. Starting from 2014, these style transfer apps have become a huge PR point for deep neural networks, and today almost all smartphone users have tried some style transfer app for photos and/or video. By now, all of these methods work in real-time and run on mobile devices, and anyone can create artwork with a simple app, stylizing their own images and videos… but how do these apps work their magic?

From the deep learning perspective, ideas for style transfer stem from attempts to interpret the features that a deep neural network learns and understand how exactly it works. Recall that a convolutional neural network for image processing gradually learns more and more convoluted features (see, e.g., our first post in the NeuroNuggets series), starting from basic local filters and getting all the way to semantic features like “dog” or “house”… or, quite possibly, “Monet style”! The basic idea of style transfer is to try to disentangle these features, pull apart semantic features of “what is on the picture” from “how the picture looks”. Once you do it, you can try to replace the style while leaving the contents in place, and the style can come from a completely different painting.

For the reader interested in a more detailed view, we recommend “Neural Style Transfer: A Review”, a detailed survey of neural style transfer methods. E.g., here is an example from that paper where a photo of the Great Wall of China has been redone in classical Chinese painting style:

And here is an example from DeepArt, one of the first style transfer services:

You can even do it with videos:

But let’s see how it works!

Neural Style Transfer by Optimizing the Image

There are two main approaches to neural style transfer: optimization and feedforward network. Let us begin with optimization.

In this approach, we optimize not the network but the resulting image. This is a very simple but powerful idea: the image itself is also just a matrix (tensor) of numbers, just like the weights of the network. This means that we can take derivatives with respect to these weights, extending backpropagation to them too and optimizing an image for a network rather then the other way around; we have already discussed this idea in “Can a Neural Network Read Your Mind?”.

For style transfer, it works as follows: you have a content image, style image and trained neural network (on ImageNet for example). You create a newly generated image, initializing it completely at random. Then the content image and style image pass through the early and intermediate layers of the network to compute two types of loss functions: style loss and content loss (see below). Next, we optimize their losses by changing the generated image, and after a few iterations we have beautiful stylized images. The structure is a bit intimidating:

But do not worry — let’s talk about the losses. After we pass an image through a network we get a feature map from intermediate layers. This feature map captures the semantic representation of this image. And we definitely want the new image to be similar to the content image, so for content image feature maps C and generated image feature maps P we get the following content loss:

The style loss is slightly different: for the style loss we compute Gram matrices of the intermediate representations of generated images and style image:

The style loss is the Euclidean distance between Gram matrices

By directly minimizing the sum of these losses by gradient descent on our generated image, we make it more and more similar to the style image in terms of style while still keeping the content due to the content loss. We refer to the original paper, “A Neural Algorithm of Artistic Style”, for details. This approach works great, but its main disadvantage is that it takes a lot of computational effort.

Neural Style Transfer Via Image Transformation Networks

The basic idea for the next is to use feed-forward networks for image transformation tasks. Basically, we want to create an image transformation network that would directly create beautiful stylized images, with no complicated “image training” process.

The algorithm is simple. Suppose that, again, you have a content image and a style image. You feed the content image through the image transformation network and get a new generated image. After that, loss networks are used to compute style and content losses, like in the optimization method above, but after that, we optimize not the image itself but the image transformation network. This way, we get a trained image transformation network for style transfer and then can use it for stylizing as only a forward pass without any optimization at all.

Here is how it looks through the entire process with the image transformation network and the loss network:

The original paper on this approach, “Perceptual Losses for Real-Time Style Transfer and Super-Resolution”, led to the first truly popular and the first real-time implementation of style transfer and superresolution algorithms; you can find a sample code for this approach here. The image transformation network works fast, but its main disadvantage is that we need to train a completely separate new network for every style image. This is not a problem for preset instagram-like filters but does not solve the general problem.

To fix this, authors of “A Learned Representation for Artistic Style” introduced an approach called conditional instance normalization. The basic idea is to train one network for several different styles. It turns out that normalization plays a huge role in style networks to model a style, and it is sufficient to specialize scaling and shifting parameters after normalization to each specific style. In simpler words, it is enough just to tune the parameters of a simple transformation after normalization for each style, before image transformation network.

Here is a picture of how it works; for the mathematical details, we refer to the original paper:

You can find an example code for this model here.

In the platform demo, we use the Python framework PyTorch for training the neural networks, Flask + Gunicorne to serve it on our platform, and Docker + Mesos for deploying and scaling.

Try it out on the platform

Now that we know about style transfer, it is even better to see the result for ourselves. We follow the usual steps to get the model working.
1. Login at https://mvp.neuromation.io
2. Go to “AI Models”:
3. Click “Add more” and “Buy on market”:

4. Select and buy the Image Style Transfer demo model:

5. Launch it with the “New Task” button:

6. Try the demo! You can upload your own photo for style transfer. We chose this image:

Neuromation Team in Singapore

7. And here you go!

Stylized and enhanced Neuromation Team in Singapore

And here we are:

Sergey Nikolenko
Chief Research Officer, Neuromation

Kyryl Truskovskyi
Lead Researcher, Neuromation
April 11, 2018
Neuromation Team in Singapore

This week we start our Asian Road show, with the first stop in Singapore! Yesterday, Neuromation team had an AI & Blockchain Meetup at Carlton Hotel.

We have met AI enthusiasts, researchers, developers and innovators, telling them more about what is AI, Machine Learning, Deep Learning, Neural Networks and how is AI being applied to improve our world, our businesses, and our lives.

Our team-members, Dr. Sergey Nikolenko, Mr. Maksym Prasolov, Mr. Evan Katz, Mr. Arthur McCallum and Mr. Daniel Liu told about Neuromation case and the future of AI.

Neuromation Road Show continues, come and meet us at the AI Expo at Tokyo Big Sight, April 4th-6th, at the booth 4–7!

April 3, 2018
NeuroNuggets: Segmentation with Mask R-CNN
In the third post of the NeuroNuggets series, we continue our study of basic problems in computer vision. I remind the reader that in this series we discuss the demos available on the recently released NeuroPlatform, concentrating not so much on the demos themselves but rather on the ideas behind each deep learning model. This series is also a great chance to meet the new Neuromation deep learning team that has started working at our new office in St. Petersburg, Russia.

In the first installment, we talked about the age and gender detection model, which is basically image classification. In the second, we presented object detection, a more complex computer vision problem where you also have to find where an object is located. Today, we continue with segmentation, the most detailed problem of this kind, and consider the latest addition to the family of R-CNN models, Mask R-CNN. I am also very happy to present the co-author of this post, Anastasia Gaydashenko:

Segmentation: Problem Setting

Segmentation is a logical next step of the object detection problem that we talked about in our previous post. It still stems from the same classical computer vision conundrum: even with great feature extraction, simple classification is not enough for computer vision, you also have to understand where to extract these features. Given a photo of your friends, a landscape scene, or basically any other image, can you automatically locate and separate all the objects in the picture? In object detection, we looked for bounding boxes, i.e., rectangles that enclose the objects. But what if we require more detail and label the exact silhouettes of the objects, excluding background? This problem is called segmentation. Generally speaking, we want to go from pictures in the first row to pictures in the second row on this picture:

Formally speaking, we want to label each pixel of an image with a certain class (tree, road, sky, etc) as shown in the image. The first question, of course, is why? What’s wrong with regular object detection? Actually, segmentation is applied widely: in medical imaging, сontent-based image retrieval, and so on. It avoids the big problem of regular object detection: overlapping bounding boxes for different objects. If you see three heavily overlapping bounding boxes, are these three different hypotheses for the same object (in which case you should choose one) or three different objects that happen to occupy the same rectangle (in which case you should keep all three)? Regular object detection models can’t really decide.

And if the shape of the object is far from rectangular segmentation provides much better information (this is very important for medical imaging). For instance, Google used semantic image segmentation to create the Portrait Mode in its signature Pixel 2 phone.

These pictures also illustrate another important point: the different flavours of segmentation. What you see above is called semantic segmentation; it’s the simpler version, when we simply want to classify all pixels to categories such as “person”, “airplane”, or “background”. You can see in the picture above that all people are labeled as “person”, and the silhouettes blend together into a big “cuddle pile”, to borrow the expression used in certain circles.

This leads to another, more detailed type of segmentation: instance segmentation. In this case, we would want to separate the people in the previous photo and label them as “person 1”, “person 2”, etc., as shown below:

Here all people on the photo are marked in different colors, which mean different instances. Note also that they are labeled with probabilities that reflect the model’s confidence in a particular class label; the confidence is very high in this case, but generally speaking it is also a desirable property of any AI model to know when it is not sure.

Convolutions and Convolutional Neural Networks

So how can a computer vision system parse these images in such an accurate and humanlike way? Let’s find out. The first thing we need to study is how convolution works and why do we use it. And yes, we return to CNNs in each and every post — because they are really important and we keep finding new things to say about them. But before we get to the new things, let’s briefly go through the old, concentrating on the convolutions themselves this time.

Initially, the idea of convolution has come from biology, or, more specifically, studies of the visual cortex. David Hubel and Torsten Wiesel studied the lower layers of the visual cortex in cats. Cat lovers, please don’t watch this:

https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FYoo4GWiAx94%3Ffeature%3Doembed&url=http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DYoo4GWiAx94&image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FYoo4GWiAx94%2Fhqdefault.jpg&key=a19fcc184b9711e1b4764040d3dc5c07&type=text%2Fhtml&schema=youtube

When Hubel and Wiesel moved a bright line across a cat’s retina, they noticed an interesting effect: the activity of neurons in the brain changed depending on the orientation of the line, and some of the neurons fired only when the line was moving in a particular direction. In simple terms, that means that different regions of the brain react to different simple shapes. To model this behaviour, researchers had to invent detectors for simple shapes, elementary components of an image. That’s how convolutions appeared.

Formally speaking, convolution is just a scalar product of two matrices taken as vectors: we multiply them componentwise and sum up the results. The first matrix is called the “kernel”; it represents a simple shape, like this one:

The second matrix is just some patch from the picture where we want to find the pattern shown in the kernel. If the convolution result is large, we decide that the target shape is indeed present in this patch. Then we can simply run the convolution through all patches in the image and detect where the pattern occurs.

For example, let us try to find the filter above in a picture of a mouse. Here you can see the part where we would expect the filter to light up:

And if we multiply the corresponding submatrix with the kernel and sum all the values, we indeed get a pretty big number:

On the other hand, a random part of the picture does not produce anything at all, which means that it is totally different from the filter’s pattern:

Filters like this let us detect some simple shapes and patterns. But how do we know which of these forms we want to detect? And how can we recognize a plane or an umbrella from these simple lines and curves?

This is exactly where the training of the neural networks comes in. The shapes defined by kernels (the filters) are actually what the CNN learns from the training set. It is usually simple lines and gradients on the first layers of the neural network, but then, with each layer of the model, these shapes are combined with one another into recognizable kernels; a set of kernels is called a map:

The other operation necessary for CNNs is pooling. It is mostly used to reduce computational costs and suppress noise. The main idea is to cover a matrix with small submatrices and leave only one number in each, thus reducing the dimension; usually, the result is just a maximum or average of the values in the small submatrix. Here is a simple example:

This kind of operations is also sometimes called downsampling as they reduce the amount of information we store.

The R-CNN models and Mask R-CNN

Above, we have seen a simplified explanation of techniques used to create convolutional neural networks. We have not touched upon learning at all, but that part is pretty standard. With these simple tools, we can teach a computer model to recognize all sorts of shapes; moreover, CNNs operate on small patches so they are perfectly parallelizable and well suited for GPUs. We recommend our previous two posts for more details on CNNs, but let us now move on to the more high-level things, that is, segmentation.

In this post, we do not discuss all modern segmentation models (there are quite a few) but go straight to the model used in the segmentation demo on the NeuroPlatform. This model is called Mask R-CNN, and it is based on the general architecture of R-CNN models that we discussed in the previous post about object detection; hence, a brief reminder is again in order.

It all begins with the R-CNN model, where R stands for “region-based”. The pipeline is pretty simple: we take a picture and apply an external algorithm (called selective search) to it, searching for all kind of objects. Selective search is a heuristic method that extracts regions based on connectivity, color gradients, and coherence of pixels. Next, we classify all extracted regions with some neural network:

Due to the high number of proposals, R-CNN worked extremely slow. In Fast R-CNN, the RoI (region of interest) projection layer was added to the neural network: instead of putting each region from proposals through the whole network, Fast R-CNN takes the whole image through the network once, finds neurons corresponding to a particular region in the feature map in the network, and then applies the remaining part of the network to each found set of neurons. Like here:

The next step was to invent the Region Proposal Network that could replace selective search; the Faster R-CNN model is now a complete end-to-end neural network.

You can read about all these steps in more detail in our object detection post. But we’re here to learn how Faster R-CNN can be converted to solve the segmentation problem! Actually, it is extremely simple but nevertheless efficient and functional. The authors just added a parallel branch for predicting an object mask to the original Faster R-CNN model. Here is how the resulting Mask R-CNN model looks like:

The top branch in the picture predicts the class of some region and the bottom branch tries to label each pixel of the region to construct a binary mask (i.e., object vs. background). It only remains to understand where this binary mask comes from.

Fully Convolutional Networks

Let us take a closer look at the segmentation part. It is based on a popular architecture called Fully Convolutional Network (FCN):

The FCN model can be used for both image detection and segmentation. The idea is pretty straightforward, and the network is actually even simpler than usual, but it’s still a deep and interesting idea.

In standard deep CNNs for image classification, the last layer is usually a vector of the same size as the number of classes that shows the “scores” of different classes that could be then normalized to give class probabilities. This is what happens in the “class box” on the picture above.

But that if we stop at some middle layer of the CNN and instead of vectors do some more convolutions? And on the last convolutional layer, we get the same number of features as the number of classes. Then, after proper training, we can get “class scores” in every pixel of the last layer, getting a kind of a “heatmap” for every class! Here is how it works — regular classification on top and a fully convolutional approach on the bottom:

For segmentation via this network, we will use the inverses of convolution and pooling. Meet… deconvolution and unpooling!

In deconvolution, we basically do convolution but the matrix is transposed, and now the output is a window rather than a number. Here are two popular ways to do deconvolution (white squares are zero paddings), animated for your viewing convenience:

To understand unpooling, recall the pooling concept that we discussed above. To do max-pooling, we take the maximum value from some submatrix. Now we want to also remember the coordinates of the cells from which we took it and then use it to “invert” max-pooling. We create the matrix with the same shape as the initial and put maximums to the corresponding cells, reconstructing other cell values with approximations based on known cells. Some information stays lost, of course, but usually upsampling works pretty well:

Through the use of deconvolution and unpooling, we can construct pixel-wise predictions for each class, that is, segmentation masks for the image!

Segmentation demo in action

We have seen how Mask R-CNN does segmentation, but it is even better to see the result for yourself. We follow the usual steps to get the model working.
1. Login at https://mvp.neuromation.io
2. Go to “AI Models”:
3. Click “Add more” and “Buy on market”:

4. Select and buy the Object Segmentation demo model:

5. Launch it with the “New Task” button:

6. Try the demo! You can upload your own photo for segmentation. We chose this image:

7. And here you go! The model shows bounding boxes as well, but now it gives much more than just bounding boxes:

Sergey Nikolenko
Chief Research Officer, Neuromation

Anastasia Gaydashenko
Junior Researcher, Neuromation
March 27, 2018
Neuromation Team at the Future of AI!

Neuromation’s global team gathered in Tel-Aviv at the Future of AI conference to present our concept of Knowledge Mining to the international AI market. Israel has taken a promising position in AI, creating nearly 700 AI jobs, of which only 300 were filled.

This is an example of the growing demand for AI talent and products that Neuromation is filling through its distributed marketplace Platform, and its custom turn-key solutions provided by Neuromation Labs.

Stay with us, more reports from the event, including the recording of the keynote speech by Maxim Prasolov, CEO and Sergey Nikolenko, CRO, are coming soon.

March 20, 2018