Sergey Nikolenko

Blog

NeuroNuggets: Object Detection
This is the second post in our NeuroNuggets series. This is a series were we discuss the demos already available on the recently released NeuroPlatform. But it’s not as much about the demos themselves as about the ideas behind each of these models. We also meet the new Neuromation deep learning team that we hired at our new office in St. Petersburg, Russia.

The first installment was devoted to the age and gender estimation model, the simplest neural architecture among our demos, but even there we had quite a few ideas to discuss. Today we take convolutional neural networks for image processing one step further: from straightforward image classification to object detection. It is also my pleasure to introduce Aleksey Artamonov, one of our first hires in St. Petersburg, with whom we have co-authored this post:

Object Detection: Not Quite Trivial

Looking at a picture, you can recognize not only what objects are on it, but also where they are located. On the other hand, a simple convolutional neural network (CNN) like the one we considered in the previous post cannot do that. All they can is estimate the probability with which a given object is present on the image. For practical purposes, this is insufficient: on real world images, there are lots of objects that can interact with each other in nontrivial ways. We need to ascertain the position and class of each object in order to extract more semantic information from the scene. We have already seen it with face detection:

The simplest approach that was used before the advent of convolutional networks consisted of a sliding window and a classifier. If we need, say, to find human faces on a photo, we first train a network that says how likely it is that a given picture contains a face and then apply it to every possible bounding box (a rectangle in the photo where an object could appear), choosing the bounding boxes where this probability is highest.

This approach actually would work pretty well… if anyone could apply it. Detection with a sliding window needs to look through a huge amount of different bounding boxes. Bounding boxes have different positions, aspect ratios, scales. If we try to calculate all the options that we should look through, we get about 10 million combinations for a 1Mpix image. Which means that the naive approach would need to run the classification network 10 million times to detect the actual position of the face. Naturally, this would never do.

Classical Computer Vision: the Viola-Jones Algorithm

Our next stop is an algorithm that embodies classical computer vision approaches to object detection. By “classical” here we mean computer vision as it was before the deep learning revolution made every kind of image processing into different flavours of CNNs. In 2001 Paul Viola and Michael Jones proposed an algorithm for real-time face detection. It employs three basic ideas:
- Haar feature selection;
- boosting algorithm;
- cascade classifier.
Before describing these stages, let us make sure that we actually want the algorithm to achieve. A good object detection algorithm has to be fast and have a very low false-positive rate. We have 10 million possible bounding boxes and only a handful of faces on the photo, so we cannot afford a false positive rate much higher than 10–6 unless we want to be overwhelmed by incorrect bounding boxes. With this in mind, let us jump into the algorithm.

The first part is the Haar transform; it is best to begin with a picture:

We overlap different filters on our image. Activation of one Haar filter is the sum of the values in the white parts of the rectangle minus the sum of values under black part.

The main property of these filters is that they can be computed across the entire image very quickly. Let us consider the integral version (I*) of the original image (I); the integral version is the image where intensity at coordinate I*x,y is the total intensity over the whole rectangle that begins at the top left corner and ends in (x,y):

Let us see the Haar filter overlapped on the image and its integral version in action:

Haar features for this filter could be computed in just a few operations. E.g., the horizontal Haar filter activation shown on the Terminator’s face equals 2C — 2D + B — A + F — E, where the letters denote intensities at the corresponding points on the integral image. We won’t go into the formulas but you can check it yourself, it’s a good exercise for understanding the transform.

To select the best Haar features for face recognition, the Viola-Jones algorithm uses the AdaBoost classification algorithm. Boosting models (in particular, AdaBoost) are machine learning models that combine and build upon simpler classifiers. AdaBoost can take weak features like Haar that are just a little better than tossing a coin. But then AdaBoost learns a combination of these features in such a way that the final decision rule is much stronger and better. It would take a whole separate post to explain AdaBoost (perhaps we should, one day), so we’ll just add a couple of links and leave it at that.

The third, final idea is to combine the classifiers into a cascade. Thing is, even boosted classifiers still have too much false positives. A simple 2-feature classifier can achieve almost 100% detection rate, but with a 50% false positive rate. Therefore, the Viola-Jones algorithm uses a cascade of gradually more complex classifiers, where we can reject a bounding box on every step but have to pass all checks to output a positive answer:

This approach leads to much better detection rates. Roughly speaking, if we have 10 stages in our cascade, each stage has 0.3 false positive rate and 0.01 false negative rate, and all stages are independent (this is a big assumption, of course, but in practice it still works pretty well), the resulting cascade classifier achieves (0.3)¹⁰ ~ 3*10^–6 false positive rate and 0.9 detection level. Here is how a cascade works on a group photo:

Further research in classical computer vision went into detecting objects of specific classes such as pedestrians, vehicles, traffic signs, faces etc. For complex detectors on deep phases of a cascade we can use different kinds of classifiers such as histogram of oriented gradients (HOG) or support vector machines (SVM). Instead of using Haar features on image in grayscale we can get image channels in different color schemes (CIELab or HSV) and image gradients in different directions. All these features are computed in the integral space, summing inside a rectangle with an adjustable threshold.

An important problem that appeared already in classical computer vision is that near a real object, our algorithms will find multiple intersecting bounding boxes; you can draw different rectangles around a given face, and all will have rather high confidence. To choose the best one classical computer vision usually employs the non-maximum suppression algorithm. However, this still remains an open and difficult problem because there are many situations when two or more objects have intersecting bounding boxes, and a simple greedy implementation of non-maximum suppression would lose good bounding boxes. This part of the problem is relevant for all object detection algorithms and still remains an active area of research.

R-CNN

Initially, in object detection tasks neural networks were treated as tools for extracting features (descriptors) on late stages of the cascade. Neural networks by themselves had always been very good in image classification, i.e., prediction of the class or type of the object. But for a long time, there was no mechanism to locate this object at the image with neural networks.

With the deep learning revolution, it all changed rather quickly. By now, there are several competing approaches to object detection that are all based on deep neural networks: YOLO (“You Only Look Once”, a model that was initially optimized for speed but ), SSD (single-shot detectors), and so on. We may return to them in later installments, but in this post we concentrate on a single class of object detection approaches, the R-CNN (Region-Based CNN) line of models.

The original R-CNN model, proposed in 2013, performs a three-step algorithm to do object detection:
- generate hypotheses for reasonable bounding boxes with an external region proposal algorithm;
- warp each proposed region in the image and pass it through a CNN trained for image classification to extract features;
- pass the resulting features to a separate SVM classification model that actually classifies the regions and chooses which of them contain meaningful objects.
Here is an illustration from the R-CNN paper:

R-CNN brought the deep learning revolution to object detection. With R-CNN, mean average precision on the Pascal VOC (2010) dataset grew from 40% (the previous record) up to 53%, a huge improvement.

But improved object detection quality is only part of the problem. R-CNN worked well but was hopelessly slow. The main problem was that you had to run the CNN separately for every bounding box; as a result, object detection with R-CNN took more than forty seconds on a modern GPU for a single image! Not quite real-time. It is also very hard to train because you had to juggle together the three components, two of which (CNN and SVM) are machine learning models that you have to train separately. And, among other things, R-CNN requires an external algorithm that can propose bounding boxes for further classification. In short, something had to be done.

Fast R-CNN

The main problem of R-CNN was speed, so when researchers from Microsoft Research rolled out an improvement there was no doubt how to name the new model: Fast R-CNN was indeed much, much faster. The basic idea (we will see this common theme again below) was to put as much as they could directly into the neural network. In Fast R-CNN, the neural network is used for classification and bounding box regression instead of SVM:

Image Source

In order to make detection independent of the size of the object in the image, R-CNN uses Spatial Pyramid Pooling (SPP) layers that had been introduced in SPPnet. The Idea of SPP is brilliant: instead of cropping and warping the region to construct an input image for a separate run of the classification CNN, SPP crops the region of interest (RoI) projection at deeper convolutional layer, before fully-connected layers.

This means that we can reuse lower layers of the CNN, running it only once instead of a thousand times in basic R-CNN. But a problem arises: a fully-connected layer has a certain size, and the size of our RoI can be anything. The task of the spatial pyramid pooling layer is to solve this problem. The layer divides the window into 21 parts, as shown in the figure below, and summarizes (pools) the values in each part. Thus, the size of the layer’s output does not depend on the size of the input window any more:

Fast R-CNN is 200 times faster than R-CNN to apply to a test image. But it is still insufficient for actual real-time object detection due to an external algorithm for generating bounding box hypotheses. On a real photo, this algorithm can take about 2 seconds, so regardless of how fast we make the neural network, we have at least a 2 second overhead for every image. Can we get out of this bottleneck?

Faster R-CNN

What could be faster than Fast R-CNN? The aptly named Faster R-CNN, of course! And what kind of improvement could we do to Fast R-CNN to get an even faster model? It turns out that we can get rid of the only external algorithm left in the model: extracting region proposals.

The beauty of Faster R-CNN is that we can use the same neural network to extract region proposals. We only need to augment it with a few new layers, called the Region Proposal Network (RPN):

The idea is that the first (nearest to input) layers of a CNN extract universal features that could be useful for everything, including region proposals. On the other hand, the first layers have not yet lost all the spatial information: these features are still rather “local” and correspond to relatively small patches of the image. Thus, RPN uses precomputed feature values from early layers to propose regions with objects for further classification. RPN is a fully convolutional network, and there are no fully-connected layers in the RPN architecture, so the computing overhead is almost nonexistent (10ms per image), and we can now completely remove the region proposal algorithm!

One more good idea is to use anchor boxes with different scales and aspect ratios instead of a spatial pyramid. At the time of its appearance, Faster R-CNN became the state of the art object detection model, and while there has been a lot of new research in the last couple of years, it is still going pretty strong. You can try Faster R-CNN at the NeuroPlatform.

Object Detection at the NeuroPlatform

And finally we are ready to see Faster R-CNN in action! Here is a sequence of steps that will show you how this works on a pretrained model which is already available at the NeuroPlatform.
1. Login at https://mvp.neuromation.io
2. Go to “AI Models”:
3. Click “Add more” and “Buy on market”:

4. Select and buy the object detection demo model:

5. Launch it with the “New Task” button:

6. Try the demo! You can upload your own photo for detection:

7. And here you go!

Sergey Nikolenko
Chief Research Officer, Neuromation

Alexey Artamonov
Senior Researcher, Neuromation
March 19, 2018
NeuroNuggets: Age and Gender Estimation
Today, we begin a new series of posts that we call NeuroNuggets. On February 15, right on time, we released the first version of the NeuroPlatform. So far it is still in alpha, and it will take a lot of time to implement everything that we have planned. But even now, there are quite a few cool things you can do. In the NeuroNuggets series, we will present these cool things one by one, explaining not only the technicalities of how to run something on the platform but also the main ideas behind every model. This is also my chance to present my new deep learning team hired at our new office in St. Petersburg, Russia.

In this post, we present our first installment: the age and gender estimation model. This is the simplest neural architecture among our demos, but even this network will have quite a few tricks to explain. And it is my pleasure to introduce Rauf Kurbanov, one of our first hires in St. Petersburg, with whom we have co-authored this post:

Rauf Kurbanov, image source

Who hired a nerd?

AI researchers tend to question the nature of intuitive. As soon as you ask how a computer can do the same thing that seems too easy for humans, you see that what is “intuitively clear” for us can be very hard to formalize. Our visual perception of human age and gender is a good example of such a subtle quality.

To us AI nerds, Eliezer Yudkowsky is familiar both as an AI safety researcher and the author of the most popular Harry Potter fanfic ever (we heartily recommend “Harry Potter and the Method of Rationality”, HPMoR for short, to everyone). And the Harry Potter series features a perfect example for this post, a fictional artifact that appears intuitively clear but is hard to reproduce in practice:

Albus Dumbledore had placed an Age Line around the Goblet of Fire to prevent anyone under the age of seventeen from approaching it. Age Line magic was so advanced that even an Ageing Potion could not fool it. Even Yudkowsky did not really dig into the mechanics of Age Line in his meticulous manner in HPMoR but today we will give it a try; and while we are on the subject, we will give a shot to gender recognition as well. As usual in computer vision, we begin with convolutional neural networks.

Convolutional neural networks

A neural network, as the name suggests, is a machine learning approach which is in a very abstract way modeled after how the brain processes information. It is a network of learning units called artificial neurons, or perceptrons. During training, the neurons learn how to convert input signals (say, the picture of a cat) into corresponding output signals (say, the label “cat” in this case), training automated recognition from real life examples.

Virtually all computer vision nowadays is based on convolutional neural networks. Very roughly speaking, CNNs are multilayer (deep) neural networks where each layer processes the image in small windows, extracting local features. Gradually, layer by layer, local features become global, able to draw their inputs from a larger and larger portion of the original image. Here is how it works in a very simple CNN (picture taken from this tutorial, which we recommend to read in full):

In the end, after several (sometimes several hundred) layers we get global features that “look at” the whole original image, and they can now be combined in relatively simple ways to obtain class labels (recognize whether it is a dog, cat, boat, or Harry Potter).

Technically, a convolutional neural network is a neural network with convolutional layers, and a convolutional layer is a transformation that applies a certain kernel (filter) to every point in the input (a “picture” with multiple channels in every pixel, i.e., a three-dimensional tensor) and generate filtered output by sliding the kernel over the input.

Let us consider a simple example of a filter: edge detection in images. In this case, the input for edge detection is an image, and each pixel in the image is defined by three numbers: the intensities of red, green, and blue in the color of that pixel. We construct a special kernel which will be applied to every pixel in the image; the output is a new “image” that shows the results of this kernel. Basically, the kernel here is a small matrix. Here’s how it works:

The kernel is sliding over every pixel in the image and the output value increases whenever there is an edge, an abrupt change of colors. In the figure above, after multiplying this simple matrix element-wise to every 3×3 window in the image we get a very nice edge detection result.

Once you understand filters and kernels, it becomes quite simple to explain convolutional layers in neural networks. You can think of them as vanilla convolutions, as in the edge detection example above, but now we are learning convolutional kernels end-to-end when training the networks. That is, we do not have to invent these small matrices by hand anymore but can automatically learn matrices that extract the best features for a specific task.

The model pipeline

Age and gender estimation sounds like a traditional machine learning task: binary classification for the genders (it might stir up some controversy but yeah, our models live in a binary world) and regression for the ages. But before we can begin to solve these problems, we need to find the faces on the photo! Classification will not work on the picture as a whole because it might, for example, contain several faces. Therefore, the age and gender estimation problem is usually broken down into two steps, face detection and age/gender estimation for the detected faces:

In the model that you can find on the NeuroPlatform, these steps are performed independently and are not trained end-to-end, so let us discuss each of them in particular.

Face detection

Face detection is a classic problem in computer vision. It was solved quite successfully even before the deep learning revolution, in early 2000s, by the so-called Viola-Jones algorithm. It was one of the most famous application of Haar cascades as features; but those days are long gone…

Today, face detection is not treated as a separate task that requires individual approaches. It is also solved by convolutional neural networks. To be honest, since the advent of deep learning it has long become clear that CNNs kick ass at object detection, and therefore we expect modern solution to an old problem to be based on CNNs as well. And we are not wrong.

But in real world machine learning, you should also consider other properties beside detection accuracy such as simplicity and inference speed. If a simpler approach works well enough, it might not be worth it to introduce very complicated models to gain a few percentage points (remind me to tell you about the Netflix prize challenge results later). Therefore, in the NeuroPlatform demo we use a more classical approach to face detection while keeping CNNs for the core age/gender recognition task. But we can learn a lot from classical computer vision too.

In short, the face detection model can be described as an SVM on HOG + SIFT feature representation. HOG and SIFT representations are hand-crafted features, the result of years of experience in building image recognition systems. These features recognize gradient orientations in localized portions of an image and perform a series of deterministic image transformations. It turns out this representation works quite well with kernel methods such as support vector machines (SVM).

Data augmentation

Here at Neuromation, we are big fans of using synthetic data for computer vision. Usually, this means that we generate sophisticated synthetic datasets from 3D computer graphics and even look towards using Generative Adversarial Networks for synthetic data generation in the future. But let us not forget the most basic tool for increasing the datasets: data augmentation.

Since we have already extracted faces on the previous step, it is enough to augment only the faces, not the whole image. In the demo, we are using standard augmentation tricks such as horizontal/vertical shifts and mirroring alongside with a more sophisticated one of randomly erasing patches of the image.

Age estimation

To predict the age, we apply a deep convolutional neural network to the face image detected on the previous processing stage. The method in the demo uses the Wide Residual Network (WRN) architecture which beat Inception architecture on mainstream object detection datasets, achieving convergence on the same task twice faster. Before we explain what residual networks are, we begin with a brief historical reference.

The ImageNet challenge

The ImageNet project is a large visual database designed to be used in visual object recognition research. The deep learning revolution of the 2010s started in computer vision with a dramatic advance in solving the ImageNet Challenge. Results on ImageNet were recognized not only within the AI community but across the entire industry, and ImageNet has become and still remains the most popular general purpose computer vision dataset. Without getting into too much details, let us just take a glance at a comparison plot between the winners of a few first years:

On the plot, the horizontal axis shows how computationally intensive a model is, the circle size indicates the number of parameters, and the vertical axis shows image classification accuracy on ImageNet. As you can see, ResNet architectures show some of the best results while staying on the better side of efficiency as well. What is their secret sauce?

Residual connections

It is well known that deeper models perform better than shallower models, they are more expressive. But optimization in deep models is a big problem: deeper models are harder to optimize due to the peculiarities of how gradients propagate from the top layers to the bottom layers (I hope one day we will explain it all in detail). Residual connections are an excellent solution to this problem: they add connections “around” the layers, and the gradient flow is now able to “skip” excessive layers during backpropagation, resulting in much faster convergence and better training:

Essentially the only difference with Wide Residual Networks is that in the original paper the tradeoff between width and depth of architecture was studied more carefully resulting in more efficient architecture with better convergence speed.

DEX on NeuroPlatform

We have wrapped the DEX model into a docker container and uploaded in to the NeuroPlatform. Let’s get started! First, enter Neuromation login page https://mvp.neuromation.io/

On your dashboard on the NeuroPlatform, you can see how much NTK is left on your balance and spend them in three sections:
- AI Models,
- Datasets,
- Generators.
Today we are dealing with the Age and Gender estimator, an AI model available at the NeuroMarket. Let us purchase our first model! We enter AI models and buy the Age&Gender model:

Then we request a new instance; it may take a while:

And here we go! We can now try the model on a demo interface:

It does tend to flatter me a bit.

Sergey Nikolenko
Chief Research Officer, Neuromation

Rauf Kurbanov
Senior Researcher, Neuromation
March 13, 2018
AI for Spatial Metabolomics I: The Datasets of Life
Image source

Here at Neuromation, we are starting an exciting — and rather sophisticated! — joint project with the Spatial Metabolomics group of Dr. Theodore Alexandrov from the European Molecular Biology Laboratory. In this mini-series of posts, I will explain how we plan to use latest achievements in deep learning and invent new models to process imaging mass-spectrometry data, extracting metabolic profiles of individual cells to analyze the molecular trajectories that cells with different phenotypes follow…

Wait, I’ve surely lost you three times already. Let me start over.

Omics: the datasets that make you

Image source

The picture above shows the central dogma of molecular biology, the key insight of XX century biology into how life on Earth works. It shows how genetic information flows from the DNA to the proteins that actually do the work in the cells:
- DNA stores genetic information and can replicate it;
- in the process known as transcription, DNA copies out parts of its genetic code to messenger RNA (m-RNA), also a nucleic acid;
- and finally, translation is the process of making proteins, “reading” the genetic code for them from RNA strings and implementing the blueprint in practice.
I’ve painted a very simplified picture but this is truly the central, the most important information flow of life. The central dogma, first stated by Francis Crick in 1958, says that genetic information flows only from nucleic acids (DNA and RNA) to proteins and never back — your proteins cannot go back and modify your DNA or RNA, or even modify other proteins, they are controlled only by the nucleic acids.

Everybody knows that the genetic code, embodied in DNA, is very important. What is slightly less known is that each step along the central dogma pathway (a pathway is basically a sequence of common reactions that transform molecules into one another for example, DNA -> RNA -> protein is a pathway, and a very important one!) corresponds to its own “dataset”, its own characterization of an organism, each important and interesting in its own way.

Your set of genes, encoded in your DNA, is known as the genome. This is the main “dataset”, your primary blueprint, the genome is the stuff that says how you work in the most abstract way. As you probably know, the genome is a very long string of “letters” A, C, G, and T, which stand for the four nucleotides… don’t worry, we won’t go into too much detail about that stuff. The Human Genome Project successfully sequenced (“read out” letter by letter) a draft of the human genome in 2000 and a complete human genome in 2003, all three billion of letters. Since then, sequencing methods have improved a lot; moreover, all human genomes are, of course, very similar, so once you have one it is much easier to get the others. Your genome determines what diseases you are susceptible to and defines many of your characteristic traits.

The study of the human genome is far from over, but it is only the first part of the story. As we have seen above, genetic code from the DNA has to be read out into RNA. This is known as transcription, a complicated process which is entirely irrelevant for our discussion right now: the point is, pieces of the genome are copied into RNA verbatim (formally speaking, T changes to U, a different nucleotide, but it’s still the exact same information):

Image source

The cells differentiate here in which parts of the genome get transcribed.

The set of RNA sequences (both coding RNA that will later be used to make proteins and non-coding RNA, that is, the rest of it) in a cell is called the transcriptome. The transcriptome provides much more specific information about individual cells and tissues: for example, a cell in your liver has the exact same genome as a neuron in your brain — but very different transcriptomes! By studying the transcriptome, biologists can “increase the resolution” and see which genes are expressed in different tissues and how. For example, modern personalized medicine screens transcriptomes to diagnose cancer.

But this is still about the genetic code. The third dataset is even more detailed: it is the proteome that consists of all proteins produced in a cell, in the process known as translation, where RNA serves as a template, with three letters encoding every protein:

Image source

This is already much closer to the actual objective: the proteins that a cell makes determine its interactions with other cells, and the proteome says a lot about what the cell is doing, what its function in the organism is, what effect it has on other cells, and so on. And the proteome, unlike the genome, is malleable: many drugs work exactly by suppressing or speeding up the translation of specific proteins. Antibiotics, for instance, usually fight bacteria by attacking their RNA, suppressing protein synthesis completely and thus killing the cell.

Genomics, transcriptomics, and proteomics are subfields of molecular biology that study the genome, transcriptome, and proteome. They are collectively known as the “omics”. The central dogma has been known for a long way, but only very recently biologists have developed new tools appeared that actually let us peek into the transcriptome and the proteome.

And this has led to the big data “omics revolution” in molecular biology: with these tools, instead of theorizing we can now actually look into your proteome and find out what’s happening in your cells — and maybe help you personally, not just develop a drug that should work on most humans but somehow fails for you.

Metabolomics: beyond the dogma

Image source

Molecular biologists began to speak of “the omics revolution” in the context of genomics, transcriptomics, and proteomics, but the central dogma is still not the full picture. Translating proteins is only the beginning of the processes that occur in a cell; after that, these proteins actually interact with each other and other molecules in the cell. These reactions comprise the cell’s metabolism, and ultimately it is exactly the metabolism that we are interested in and that we might want to fix.

Modern biology is highly interested in processes that go beyond the central dogma and involve the so-called small molecules: enzymes, lipids, glycose, ATP, and so on. These small molecules are either synthesized inside the cells — in this case they are called metabolites, that is, products of the cell’s metabolism — or come from beyond. For instance, vitamins are typical small molecules that cells need but cannot synthesize themselves, and drugs are exogenous small molecules that we design to tinker with a cell’s metabolism.

These synthesis processes are controlled by proteins and follow the so-called metabolic pathways, chains of reactions with a common biological function. The central dogma is one very important pathway, but in reality there are thousands. A recently developed model of human metabolism lists 5324 metabolites, 7785 reactions and 1675 associated genes, and this is definitely not the last version — modern estimates reach up to 19000 metabolites, so the pathways have not been all mapped out yet.

The metabolic profile of an organism is not fully determined by its genome, transcriptome, or even proteome: the metabolome (set of metabolites) forms, in particular, under the influence of environment that provides, e.g., vitamins. Metabolomics, which studies the composition and interaction between metabolites in live organisms, lies at the intersection of biology, analytical chemistry, and bioinformatics, with growing applications to medicine (and that’s not the last of the omics, but metabolomics will suffice for us now).

Knowing the metabolome, we can better characterize and diagnoze various diseases: they all have to leave a trace in the metabolome because if the metabolism has not changed why is there a problem at all?.. By studying metabolic profiles of cells, biologists can discover new biomarkers for both diagnosis and therapy, find new targets for the drugs. Metabolomics is the foundation for truly personalized medicine.

The ultimate dataset

Image source

So far, I’ve been basically explaining recent progress in molecular biology and medicine. But what do we plan to do in this project? We are not biologists, we are data scientists, AI researchers; what is our part in this?

Well, the metabolome is basically a huge dataset: every cell has its own metabolic profile (set of molecules that appear in the cell). Differences in metabolic profiles determine different cell populations, how metabolic profiles change in time corresponds to patterns of cell development, and so on, and so forth. Moreover, in spatial metabolomics that we plan to collaborate on it comes in the form of special images: results of imaging mass-spectrometry applied at very high resolution. This, again, requires some explanation.

Mass-spectrometry is a tool that lets us find out the masses of everything contained in a sample. Apart from rare collisions, this is basically the same as finding out which specific molecules appear in the sample. For example, if you put a diamond in the mass-spectrometer you’ll see… no, not just a single carbon atom, you will probably see both 12C and 13C isotopes, and their composition will say a lot about the diamond’s properties.

Imaging mass-spectrometry is basically a picture where every pixel is a spectrum. You take a section of some tissue, put it into a mass-spectrometer and get a three-dimensional “data cube”: every pixel contains a list of molecules (metabolites) found at this part of the tissue. This process is shown on the picture above. I’d show some pictures here but it would be misleading: the point is that it’s not a single picture, it’s a lot of parallel pictures, one for every metabolite. Something like this (picture taken from here):

The quest of making better imaging mass-spectrometry tools mostly aims to increase resolution, i.e., make the pixels smaller, and increase sensitivity, i.e., detect smaller amounts of metabolites. By now, imaging mass-spectrometry has come a long way: the resolution is so high that individual pixels in this picture can map to individual cells! This high-def mass-spectrometry, which is becoming known as single-cell mass-spectrometry, opens up the door for metabolomics: you can now get the metabolic profile of a lot of cells at once, complete with their spatial location in the tissue.

This is the ultimate dataset of life, the most in-depth account of actual tissues that exists right now. In the project, we plan to study this ultimate dataset. In the next installment of this mini-series, we will see how.

Sergey Nikolenko
Chief Research Officer, Neuromation
March 9, 2018
Neuromation and Longenesis: The human data economy

We have recently announced an important strategic partnership: Neuromation has joined forces with Longenesis, a startup that promises to develop a “decentralized ecosystem for exchange and utilization of human life data using the latest advances in blockchain and artificial intelligence”. Sounds like a winning entry for last year’s bullshit bingo, right? Well, in this case we actually do believe in Longenesis, understand very well what they are trying to do, and feel that together we have all the components needed for success. Let’s see why.

A match made in heaven

I will begin with the obvious: Longenesis is all about the data, and Neuromation is all about processing this data, training state-of-the-art AI models. This makes us ideal partners: Longenesis needs a computational framework to train AI models and highly qualified people to make use of this framework, and Neuromation needs interesting and useful datasets to make the platform more attractive to both customers and vendors.

This is especially important for us because Longenesis will bring not just any data but data related to medicine and pharmaceutics. I have recently written here about AI in medicine, but this is an endless topic: we are at the brink of an AI revolution in medicine, both in terms of developing generic tools and in terms of personalized medicine. Longenesis will definitely be on the frontier of this revolution.

I have personally collaborated with Longenesis CSO Alex Zhavoronkov and his colleagues in Longenesis’ parent company, Insilico Medicine, especially Arthur Kadurin (he is my Ph.D. student and my co-author in our recently published book, Deep Learning). Together, we have been working on frontier problems related to drug discovery: researchers at Insilico have been applying generative adversarial networks (GANs) to generate promising molecules with desired properties. Check out, e.g., our recent joint paper about the DruGAN model (a name that rings true to a Russian ear). Hence, I can personally vouch that not only Alex himself is one of the most energetic and highly qualified people I have ever met, but that his team in both Insilico and Longenesis is made of great people. And this is the best predictor for success of all.

The NeuroPlatform problem

The basic idea of Longenesis blends together perfectly with what is actually an open problem for the NeuroPlatform so far. On the recently released NeuroPlatform, AI researchers and practitioners will be able to construct AI models (especially deep neural networks that could be trained on the GPUs provided by mining pools), upload them in a standardized form into the NeuroMarket, our marketplace of AI models and datasets, and then rent out either the model architectures or already pretrained models.

To train a neural network, you need a dataset. And if you want to use the GPUs on mining pools, you need to transfer the data to these GPUs. The transfer itself also presents a technical problem for very large datasets, but the main question is: how are you going to trust some unknown mining pool from Inner Mongolia with your sensitive and/or valuable data? It’s not a problem when you train on standard publicly available datasets such as ImageNet, the key dataset for modern computer vision, but to develop a customized solution you will still need to fine-tune on your own data.

We at Neuromation have an interesting solution for this problem: we plan to use synthetic data to train neural networks, creating, e.g., generators for rendered photos based on 3D models. In this case, we solve two problems at once: synthetic data is very unlikely to be sensitive, and there is no transfer of huge files because you only transfer the generator and the full dataset does not even have to be stored at any time. But still, you can’t really synthetize the MRI of a specific person or make up pictures of faces of specific people you want to recognize. In many applications, you have to use real data, and it could be sensitive.

Here is where Longenesis comes in. Longenesis is developing a solution for blockchain-based safe storage for the most sensitive data of all: personal medical records. If hospitals and individuals trust Longenesis’ solution with medical data, you can definitely trust this solution with any kind of sensitive or valuable data you might have. Their solution also has to be able to handle large datasets: CT or MRI scans are pretty hefty. Therefore, we are eagerly awaiting news from them on this front.

But this is still only the beginning.

The human data economy

The ultimate goal of Longenesis is to create a marketplace of personal medical records, a token economy where you can safely store and sell access to your medical records to interested parties such as research hospitals, medical researchers, pharmaceutical companies and so on.

Over his or her lifetime, a modern person accumulates a huge amount of medical records: X-rays, disease histories, CT scans, MRIs, you name it. All of this is very sensitive information that should rightly belong to a person — but also very useful information.

Imagine, God forbid, that you are diagnosed with a rare disease. This is a very unfortunate turn of events, but it also means that your medical records suddenly increase in value: now doctors could get not just yet another MRI of some healthy average Joe but the MRI of a person who has developed or will later develop this disease. The human data economy that Longenesis plans to build will literally sweeten the pill in this case: yes, it’s a bad thing to get this disease but it also means you can cash out on your now-interesting medical records. And let’s face it, in most cases people won’t mind to share their MRIs or CT scans with medical researchers at all, especially for a price.

But the price in this case is much less important than the huge possibilities that open up for the doctors to develop new drugs and better treat this rare disease. With the human data economy in place, people will actually be motivated to bring their records to medical researchers, not the other way around.

This could be another point where we can collaborate; here at Neuromation, we are also building a token economy for useful computing, so this is a natural point where we could join forces as well. In any case, the possibilities are endless, and the journey to better medicine for all and for every one in particular is only beginning. Let’s see where this path takes us.

Sergey Nikolenko
Chief Research Officer, Neuromation

February 26, 2018
“There is no competition, only development of useful tools”: Sergey Nikolenko about AI in medicine

Image Source Futurama

Where and how do people already use medical diagnostic systems based on AI algorithms?

Medical diagnostics is a very interesting case from the point of view of AI.

Indeed, medical tests and medical images are data where one is highly tempted to train some kind of machine learning model. One of the first truly successful applications of AI back in the 1970s was, actually, medical diagnostics: the rule-based expert system MYCIN aggregated the knowledge of real doctors and learned to diagnose from blood tests better than an average doctor.

But even this direct application has its problems. Often a correct diagnosis must not only analyze the data from tests/images but also use some additional information which is hard to formalize. For example, the same lesion on the X-ray of the lungs of a senior chain smoker and a ten-year-old kid is likely to mean two very different things. Theoretically we could train models that make use of this additional information but the training datasets don’t have it, and the model cannot ask an X-ray how many packs it goes through every day.

This is an intermediate case: you do have an image but it is often insufficient. True full-scale diagnostics is even harder: no neural network will be able to ask questions about your history that are most relevant for this case, process your answers to find out relevant facts, ask questions in such a way that the patient would answer truthfully and completely… This is a field where machine learning models are simply no match to biological neural networks in the doctors’ heads.

But, naturally, a human doctor can still make use of a diagnostic system. Nobody can remember and easily extract from memory the detailed symptoms of all diseases in the world. An AI model can help propose possible diagnoses, evaluate their probabilities, show which symptoms fit a given diagnosis or not, and so on.

Thus, despite the hard cases, I personally think that the medical community (including, medical insurance companies, courts, and so on) is being overly cautious in this case. I believe that most doctors would improve their results by using automated diagnostic systems, even at the current stage of AI development. And we should compare them not with some abstract idealized notion of “almost perfect” diagnostic accuracy but with real results produced by live people; I think it would be a much more optimistic outlook.

Is there a difference for a neural network between recognizing a human face on a photo and recognizing a tumor with a tomogram?

Indeed, many problems in medicine look very similar to computer vision tasks, and there is a large and ever growing field of machine learning for medical imaging. The models in that field are often very similar to regular computer vision but sometimes differences do arise.

A lot depends on the nature of the data. For instance, distinguishing between a melanoma and a birthmark given a photo of a skin area is exactly a computer vision problem, and we probably won’t have to develop completely novel models to solve it.

But while many kinds of medical data have spatial structure, they may be more complex than regular photos. For instance, I was involved in a project that processed imaging mass-spectrometry (IMS) datasets. IMS processes a section of tissue (e.g., from a tumor) and produces data which is at first glance similar to an image: it consists of spatially distributed pixels. But every “pixel” contains not one or three numbers, like a photo, but a long and diverse spectrum with thousands of different numbers corresponding to different substances found at this pixel. As a result, although this “data cube’’ has clear spatial structure, classical computer vision models designed for photos are not enough, we have to develop new methods.

What about CT and MRI scans? Will the systems that process them “outcompete” roentgenologists whose job is also to read the scans?

This field has always, since at least the 1990s, been a central application and one of the primary motivations for developing computer vision models.

Nowadays, such systems, together with the rest of computer vision, have almost completely migrated to convolutional neural networks (CNN); with the deep learning revolution CNNs have become the main tool to process any kind of images, medical included. Unfortunately, a detailed survey of this field would be too large to fit on the margins of this interview: e.g., a survey released in February 2017 contains more than 300 references, most of which appeared in 2016 and later — and a whole year has passed since…

Should the doctors be afraid that “the machine” will replace them? Why or why not?

It is still a very long way to go before AI models are able to fully replace human doctors even in individual medical specialties. To continue the above example, there already exist computer vision models that can tell a melanoma apart from a moleno worse or even better than an average doctor. But the real problem is usually not to differentiate pictures but to persuade a person to actually come in for a checkup (automated or not) and to make the checkup sufficiently thorough. It is easy to take a picture of a suspicious mark on your arm, but you will never notice a melanoma, e.g., in the middle of your back; live doctors are still as needed as ever to do the checkup. But a model that would even slightly reduce the error rate is still absolutely relevand and necessary, it will save people’s lives.

A similar situation has been arising in surgery lately. Over the last couple of years, we have seen a lot of news stories about robotic surgeons that cut flesh more accurately, do less damage, and stitch up better than humans. But it is equally obvious that for many years to come, these robots will not replace live surgeons but will only help them save more lives, even if the robots learn to perform an operation from start to finish. It is no secret that modern autopilots can perform virtually the entire flight from start to finish-but it doesn’t mean human pilots are moving out of the cabin anytime soon.

Machine learning models and systems will help doctors diagnose faster and more accurately, but now, while strong AI has not yet been created, they definitely cannot fully replace human doctors. There is no competition, only development of useful tools. Medicine is a very imprecise business, and there always are plenty of factors that an automated system simply cannot know about. We are always talking only about computer-aided diagnosis (CAD), not full automation.

What are the leading companies in Russia that drive the innovations in biomedical AI?

One company that does very serious research in Russia is Insilico Medicine, one of the world leaders in the development of anti-aging drugs and generally in the field of drug discovery with AI models. Latest results by Insilico include models that learn to generate molecules with given properties. Naturally, such models cannot replace clinical trials but they can narrow down the search from a huge number of all possible molecules and thus significantly speed up the work of “real” doctors.

Here at Neuromation, we are also starting projects in biomedical AI, especially in fields related to computer vision. For instance, one of our projects is to develop smart cameras that will track sleeping infants and check whether they are all right, whether they are sleeping in a safe position, and so on. It is still too early to talk about several other projects, they are still at a very early stage, but we are certain something interesting will come out of them very soon. Biomedical applications are one of the main directions of our future development; follow our news!

This is a translation of Sergey Nikolenko’s interview by Anna Khoruzhaya; see the Russian original on the NeuroNews website.

February 13, 2018
What Will The Future Bring? Three Predictions on AI Technology

Here at Neuromation, we are always looking forward to the future. Actually, we are making the future. Thus, we are in a good position to try to predict what is going to happen with the AI industry soon. I know how difficult it is to make predictions, especially, as some Danish parliament member once remarked, about the future. Especially in an industry like ours. Still, here go my three predictions, or, better to say, three trends that I expect to continue for the next few years of AI. I concentrated on the technical side of things — I hope the research will stay as beautifully unpredictable as ever.

Specialized hardware for AI

Image source

One trend which is already obvious and will only gain in strength in the future is the rise of specialized hardware for AI. The deep learning revolution started in earnest when AI researchers realized they could train deep neural networks on graphical processors (GPUs, video cards). The idea was that training deep neural networks is relatively easy to parallelize, and graphic processing is also parallel in nature: you apply shaders to every pixel or every vertex of a model independently. Hence, GPUs have always been specifically designed for parallelization: a modern GPU has several thousand cores compared to 4–8 cores in a CPU (CPU cores are much faster, of course, but still, thousands). In 2009, this turned every gaming-ready PC into an AI powerhouse equal to the supercomputers of old: an off-the-shelf GPU trains a deep neural network 10–30x faster than a high-end CPU; see, e.g., an up-to-date detailed comparison here.

Since then, GPUs have been the most common tool for both research and practice in deep learning. My prediction is that over the next few years, they will be gradually dethroned in favour of chips specifically designed for AI.

The first widely known specialized chips for AI (specifically training neural networks) are proprietary Google TPUs. You can rent them on Google Cloud but they have not been released for sale for the general public, and probably won’t be.

But Google TPU is just the first example. I have already blogged about recent news from Bitmain, one of the leading designers and producers of ASICs for bitcoin mining. They are developing a new ASIC specifically for tensor computing — that is, for training neural networks. I am sure that over the next few years we will see many chips designed specifically that will bring deep learning to new heights.

AI Centralization and Democratization

Image source

The second prediction sounds like an oxymoron: AI research and practice will centralize and decentralize at the same time. Allow me to explain.

In the first part, we talked about training neural networks on GPUs. Since about 2009, deep learning has been living in “the good old days” of computing when you can stay on the bleeding edge of AI research with a couple off-the-shelf GPU for $1000 each in your garage. These days are not past us yet, but it appears that soon they will be.

Modern advances in artificial intelligence are more and more dependent on computational power. Consider, e.g., AlphaZero, a deep reinforcement learning model that has recently trained to play chess, go, and shogi better than the best engines (not just humans! AlphaZero beat Stockfish in chess, AlphaGo in go and Elmo in shogi) completely from scratch, knowing only the rules of the game. This is a huge advance, and it made all the news with the headline “AlphaZero learned to beat Stockfish from scratch in four hours”.

Indeed, four hours were enough for AlphaZero… on a cluster of 5000 Google TPUs for generating self-play games and 64 second-generation TPUs to train the neural networks, as the AlphaZero paper explains. Obviously, you and I wouldn’t be able to replicate this effort in a garage, not without some very serious financing.

This is a common trend. Research in AI is again becoming more and more expensive. It increasingly requires specialized hardware (even if researchers use common GPUs, they need lots of them), large datacenters… all of the stuff associated with the likes of Google and Facebook as they are now, not as they were when they began. So I predict further centralization of large-scale AI research in the hands of cloud-based services.

On the other hand, this kind of centralization also means increased competition on this market. Moreover, the costs of computational power have been recently rather inflated due to super demand on the part of cryptocurrency miners. We tried to buy high-end off-the-shelf GPUs last summer and utterly failed: they were completely sold out for those who were getting into mining ethereum and litecoins. However, this trend is coming to an end too: mining is institutionalizing even faster, returns on mining decrease exponentially, and the computational resources are beginning to free up as it becomes less and less profitable to use them.

We at Neuromation are developing a platform to bring this computational power to AI researchers and practitioners. On our platform, you will be able to rent the endless GPUs that had been mining ETH, getting them cheaper than anywhere else but still making a profit for the miners. This effort will increase competition on the market (currently you go either to Amazon Web Services or Google Cloud, there are very few other solutions) and bring further democratization of various AI technologies.

AI Commoditization

Image source

By the way, speaking of democratization. Machine learning is a very community-driven area of research. It has unprecedented levels of sharing between researchers: it is common practice to accompany research papers with working open source code published on Github, and datasets, unless we are talking about sensitive data like medical records, are often provided for free as well.

For example, modern computer vision based on convolutional neural networks almost invariably uses a huge general purpose dataset called ImageNet; it has more than 14 million images hand-labeled into more than 20 thousand categories. Usually models are first pretrained on ImageNet, which lets them extract low-level features common for all photos of our world, and only then train it further (in machine learning, it is called fine-tuning) on your own data.

You can request access to ImageNet and download it for free, but what is even more important, the models already trained on ImageNet are commonly available for the general public (see, e.g., this repository). This means that you and I don’t have to go through a week or two of pretraining on a terabyte of images, we can jump right into it.

I expect this trend to continue and be taken further by AI researchers in the near future. Very soon, a lot of “basic components” will be publicly available, and an AI researcher will be able to work with and combine directly, without tedious fine-tuning. This will be partially a technical process of making what we (will) have easily accessible, but it will also require some new theoretical insights.

For example, a recent paper from DeepMind presented PathNet, a modular neural architecture able to combine completely different sub-networks and automatically choose and fine-tune a combination of these sub-networks most suitable for a given task. This is still a new direction, but I expect it to pick up.

Again, we at Neuromation plan to be on the cutting edge: in the future, we plan to provide modular components for building modern neural networks on our platform. Democratization and commoditization of AI research is what Neuromation platform is all about.

Sergey Nikolenko
Chief Research Officer, Neuromation

February 5, 2018
Can a Neural Network Read Your Mind?

Image source

Researchers from the ATR Computational Neuroscience Labs at Kyoto and Kyoto University have recently made the news. Their paper, entitled “Deep image reconstruction from human brain activity”,(released on December 30, 2017), basically claims to have developed a machine learning model that can read your mind, with sample reconstructions shown in the picture above. To understand what they mean and whether we should all be thinking only happy thoughts now, we need to start with a brief background.

How to Read the Minds of Deep Neural Networks

A big problem with neural networks has always been their opacity: while we can see the final results, it is very hard to understand what exactly is going on inside a neural network. This is a problem for all architectures, but let us now concentrate on convolutional neural networks (CNNs) used for image processing.

Very roughly speaking, CNNs are multilayer (deep) neural networks where each layer processes the image in small windows, extracting local features. Gradually, layer by layer, local features become global, being able to draw their inputs from a larger and larger portion of the original image. Here is how it works in a very simple CNN (picture taken from this tutorial, which I recommend to read in full):

Image source

In the end, after several (sometimes several hundred) layers we get global features that “look at” the whole original image, and they combine in some ways to get us the class labels (recognize whether it is a dog, cat, or a boat). But how do we understand what these features actually do? Can we interpret them?

One idea is to simply look for the images that activate specific neurons with the hope that they will have something in common. This idea was developed, among other works, in the famous paper “Visualizing and Understanding Convolutional Networks” by Zeiler and Fergus (2013). The following picture shows windows from actual images (on the right) that provide the largest possible activations for four different high-level neurons together with their pixels that contribute to these activations; you can see that the procedure of fitting images to features does produce readily interpretable pictures:

Image source

But then researchers developed another simple but very interesting idea that works well to understand the features. The whole training process in machine learning is designed to fit the features of a network into a training dataset of images. The images are fixed, and the network weights (parameters of these convolutional layers) are changing. But we can also do it the other way around: fix the network and change the image to fit what we need!

For interpretation, this idea was developed, e.g., in the work “Understanding Neural Networks Through Deep Visualization” (2015) by Jason Yosinski et al. The results look a lot like the famous “deep dreams”, and this is no coincidence; for example, here are the images designed to activate certain classes the most:

Image source

Somewhat recognizable but pretty strange, right? We will see similar effects in the “mind-reading” pictures below.

The same idea also leads to the field of adversarial examples for CNNs: now that we have learned to fit images to networks rather than the other way around, what if we fit the images to fool the network? This is how you get examples like this:

Image source

On the left, this picture shows a “benign” image, labeled and recognized as bottlecap. On the right, an adversarial image: you can’t see any noticeable difference but the same network that achieved great results in general and correctly recognized the original as a bottlecap now confidently says that it is a… toy poodle. The difference is that on the right, an adversary has added small carefully crafted perturbations that are small and look just like white noise but that are all designed to push the network in the direction of the toy poodle class.

These examples, by the way, show that modern convolutional neural networks still have a long way to go before solving computer vision once and for all: although we all know some optical illusions that work for humans, there are no adversarial examples that would make us recognize a two-dimensional picture of a bottlecap as a toy poodle. But the power to fit images to features in the network can also be used for good. Let us see how.

And Now to Mind Reading for Humans

So what did the Japanese researchers (they are the mind readers we began with) do, exactly? First, they took the fMRI of the brain of a human looking at something and recorded the features. Functional magnetic resonance imaging (fMRI) is a medical imaging technique that is taking snapshots of the brain activity based on the blood flow: when neurons in an area of the brain are active, more blood flows there, and we can measure it; fMRI is called functional because it can capture changes in blood flow in time, resulting in videos of brain activity. To get more information you can watch this explanatory video or see a sample dataset for yourself:

Since we are measuring blood vessels and not neurons, the spatial resolution of fMRI is not perfect: we can’t go down to the level of individual neurons but we can distinguish rather small brain areas, with voxel size about 1mm in each direction. It has been known for a long time that the fMRI picture contains reliable general information regarding what the person is thinking about in the scanner: emotions, basic drives, processing different inputs such as speech, music, or video etc… but the work of Shen et al. takes this to a whole new level.

Shen et al.(2017) tried to reconstruct the exact images that people in fMRI scanners were looking at. To do that, they trained a deep neural network on fMRI activity results and then tried to match the features of a new image with the features of fMRI activations. That is, they are basically doing the same thing that we discussed above: finding an input image that matches given features as well as possible. The only difference is that the features now come not from a real image processed by the CNN but from an fMRI processed by a different network (also convolutional, of course). You can see how the network gradually fits the image to a given fMRI pattern:

The authors improved their reconstruction results drastically when they added another neural network, the deep generator network (DGN), whose work is to ensure that the image looks “natural” (in technical terms, introducing a prior on the images that favors “natural” ones). This is also an important idea in machine learning: we often can make sense of something only because we know what to expect, and artificial models are no different: they need some prior knowledge, “intuition” about what they can and cannot get, to improve their outputs.

In total, here is what the architecture looks like. The optimization problem is to find an image which best fits both the deep generator network that makes sure it is “natural” and the deep neural network that makes sure it matches fMRI features:

Image source

If these results can be replicated and further improved, this is definitely a breakthrough in neuroscience. One can even dream about paralyzed people communicating through fMRIs by concentrating on what they want to say; although in (Shen et al., 2017) reconstruction results are much worse when people are imagining a simple shape rather than directly looking at it, sometimes even with imagined shapes it does seem that there is something there:

Image source

So can a neural network read your mind now? Not really. You have to lie down in a big scary fMRI scanner and concentrate hard on a single still image. And even in the best cases, it still often comes out like this, kind of similar but not really recognizable:

Image source

Still, this is a big step forward. Maybe one day.

Sergey Nikolenko
Chief Research Officer, Neuromation

January 30, 2018
AI in Biology and Medicine

Today I present to you three research directions that apply the latest achievements in artificial intelligence (mostly deep neural networks) to biomedical applications. Perhaps, this is the research that will not only change but also significantly extend our lives. I am grateful to my old friend, co-author, and graduate student Arthur Kadurin who has suggested some of these projects.

Translated from Russian by Andrey V. Polyakov. Original article here.

Polar, Beyersdorf AG, and Others: Smart Clothes

We begin with a series of projects that are unlikely to turn the world over but will certainly produce, pardon the pun, cosmetic changes in everyday life already in the nearest future. These projects deal with AI applications for the so-called “Internet of Things” (IoT), specifically applications that are very “close to the body”.

Various types of fitness trackers, special bracelets that collect information about heartbeat, steps, and so forth, have long ago entered our life. The main trend at the sportswear companies now is to build different sensors directly into the clothes. That way, you can collect more information and measure it more precisely. Sensors suitable for “smart clothes” were invented in 2016, and already in 2017 Polar has presented Polar Team Pro Shirt, a shirt that collects lots of information during exercises. The plot will no doubt thicken even further when sports medicine supported by artificial intelligence will learn to use all this information properly; I expect a revolution in sports that Moneyball could never dream of.

And it is already beginning. Recently, on November 24–26, the second SkinHack hackathon dedicated to applying machine learning models to analyzing data coming from such sportswear took place in Moscow. The first SkinHack held last year was dedicated to “smart cosmetics”; the participants tried to predict the age of a person by the skin structure on photographs looking for wrinkles. Both smart cosmetics and smart clothing are areas of active interest for Beiersdorf AG (commonly known as the producer of the Nivea brand), so one can hope that the commercial launch of these technologies will be not long in coming. In Russia, SkinHack was supported by Youth Laboratories, a company affiliated with the central characters of our next part…

Insilico: Automatic Discovery of New Drugs

Insilico Medicine is a company well known in the biomedical world. Its primary mission is to fight aging, and I personally wish Insilico success in this effort: one does not look forward to growing old. However, in this article I would like to emphasize another, albeit related, project of the company: drug discovery based on artificial intelligence models.

A medicinal drug is a chemical compound that can link with other substances in our body (usually proteins) and have the desired effect on them, e.g., suppress a protein or start producing another in larger quantities. To find a new drug, you need to choose from a huge number of possible chemical compounds exactly the ones that will have the desired effect.

It is clear that at this point it is impossible to fully automate the search for new drugs: clinical trials are needed, and one usually starts testing on mice, then on humans… in general, the process of bringing a new medicinal drug to the market usually takes years. However, one can try to help doctors by reducing the search space. Insilico develops machine learning models that try not only to predict the properties of a molecule but also to generate candidate molecules with desired properties, thereby helping to choose the most promising candidates for further laboratory and clinical studies.

This search space reduction is done with a very interesting class of deep learning models: generative adversarial networks (GAN). Such networks combine two components: a generator trying to generate new objects — for example, new molecules with desired properties — and a discriminator trying to distinguish generated results from real data points. Learning to deceive the discriminator, the generator begins to generate objects indistinguishable from the real ones… that is, hopefully, actually real ones in this case. The last Insilico model, called druGAN (drug + GAN), attempts to generate, among others, molecules useful for oncological needs.

MonBaby: Keeping Track of the Baby

Finally, I would like to end with a project that Neuromation plans to participate in. Small children, especially babies, cannot always call for help themselves and require special care and attention. This attention is sometimes required even in situations where mom and dad seem to be able to relax: for example, a sleeping baby may hurt a leg by turning to an uncomfortale pose . And then there is the notorious SIDS (sudden baby death syndrome), whose risk has been linked with the pose of a sleeping infant: did you know that the risk of SIDS increases several times if a baby sleeps on the stomach?

The MonBaby smart infant tracking system is a small “button” that snaps onto clothing and monitors the baby’s breathing and turning around while asleep. Currently, the system is based on machine learning for time series analysis: data from baby movements is used to recognize breathing cycles and sleeping body position (on the stomach or on the back).

We plan to complement this system with smart cameras able to track the infant’s movements and everything that happens to him or her by visual surveillance. The strong suits of our company will come in handy here: computer vision systems based on deep convolutional networks and synthetic data for their training. The fact is that in this case it is practically impossible to collect a sufficiently large real data set for training the system: it would take not only real video recordings of tens of thousands of babies, but video recordings with all possible critical situations. Thankfully, modern ethics, both medical and human, would never allow us to generate such datasets in real life. Therefore, we plan to create “virtual babies”, 3D models that will allow us to simulate the necessary critical situations and generate synthetic videos for training.

We have briefly examined three directions in different branches of biomedicine — sports medicine and cosmetics, creating medicines and baby care — each of which is actively using the latest achievements of artificial intelligence. Of course, these are just examples: AI is now used in hundreds of diverse biomedical projects (which we may touch upon in later articles). Hopefully, however, with these illustrations I have managed to show how AI research is working on helping people live longer, better, and healthier.

Sergey Nikolenko,
Chief Research Officer, Neuromation

January 18, 2018
Neuromation Story: From Synthetic Data to Knowledge Mining
My name is Sergey Nikolenko, and I am writing this as Neuromation’s Chief Research Officer. Our company is based on two main ideas, and there is an interesting story of how they followed from one another. In my opinion, this story reflects upon the two main problems to be solved in any applied machine learning project today. It is this story that I will tell you today.

First Problem: Labeled Data

Neuromation began with working on computer vision models and algorithms based on deep neural networks (deep learning). The first big project for Neuromation was in the field of retail: recognize the goods on supermarket shelves. Modern object detection models are quite capable to analyze shelf availability, find free space on the shelves, and even track human interaction. This is an important task both for supermarkets and for the suppliers themselves: the big brands pay good money to ensure that their goods are present on the shelf, occupy some agreed upon part of the shelf, have the right side of the label facing the customers — all of these little things increase sales by dozens of percent. Today, a huge staff of merchandisers are going from supermarket to supermarket, ensuring that everything is right on the shelves; of course, not all of their duties are “monkey jobs” like this, but it is a big part of the day for many real human beings.

Our idea for retail is to install (cheap off-the-shelf) cameras that can capture and transmit to a server, for example, one frame per minute for recognition. This is a very low frequency for an automatic system, causing no overload on either the network or the recognition model, but it is a frequency completely unattainable with manual checks, and it solves all practical problems in retail. Moreover, an automated surveillance system will save a lot of effort, automate meaningless manual labor — a worthwhile goal in itself.

A specialist in artificial intelligence, especially modern deep neural networks for computer vision, might think that this problem is basically solved already. Indeed, modern deep neural networks, trained on large sets of labeled data, can do object detection, and in this case, the objects are relatively simple: cans, bottles,packages with bright labels. Of course, there are a lot of technical issues (for example, it is not easy to cope with hundreds of products on one photo — usually such models are trained to detect fairly large objects, only a few per image), but with a sufficiently large labeled data set, i.e. photos with all goods labeled in the layout, we could successfully overcome such issues.

But where would such labeled dataset come from? Imagine that you have a million photos of supermarket shelves (where to get it, by the way, is also a hard question), and you need to manually draw such rectangles as on the image above, on each one of a million photos. Looks like a completely hopeless task. So far, manual labeling of large sets of images has been usually done with crowdsourcing services such as Amazon Mechanical Turk. Manual work on such services is inexpensive, but it still does not scale well. We have calculated that to label a dataset sufficient for recognizing all 170,000 items from the Russian retail inventory (a million photos, by the way, would not be enough for this) we would need years of labor and tens of millions of dollars.

Thus, we faced the first major challenge, the main “bottleneck” of modern artificial intelligence: where do you get labeled data?

Synthetic Data and the Second Challenge: Computing Power

This problem led to Neuromation’s first major idea: we decided to try to train deep neural networks for computer vision on synthetic data. In the retail project, this means that we create 3D models of goods and virtually “place them on the shelves”, getting perfectly labeled data for recognition.

Synthetic data have two main benefits:
- first, it requires far less manual work; yes, you need to design a 3D model, but this is a one-time investment which then converts into an unlimited number of labeled “photos”; in case of retail the situation is even better since there are not so many different form factors of packaging, and you can reuse some 3D models by simply “attaching” different labels (textures) to them;
- second, the resulting data is perfectly labeled, as we are in full control of the 3D scene; moreover, we can produce labeling which we would not be able to produce by hand: we know the exact distance from the camera to every object, the angles each bottle and each carton of juice are turned by, and so on.
Of course, this approach is not perfect either. Now you have to train networks on one type of data (renderings) and then apply them to a different one (real photos). In machine learning, this is called transfer learning; it is a hard problem in general, but in this case we have been able to solve it successfully. Moreover, we have learned to produce very good photorealistic renderings — our retail partners even intend to use them in media materials and catalogs.

The synthetic data approach has proved to be very successful, and now the models trained by Neuromation are already being implemented in retail. However, this led to the need to process huge datasets of synthetic images. First, they have to be generated (i.e., one has to render a 3D scene), and then used to train deep neural networks. Generating one photorealistic synthetic image — like the one shown above — usually takes a minute or two on a modern GPU, depending on the number of objects and the GPU model. And you need a lot of these images: millions if not tens of millions.

And this is only the first step — then we have to train modern deep neural networks on these images. In AI research, it is not enough to train a model once: you have to try many different architectures, train dozens of different models, conduct hundreds of experiments. This, again, requires cutting edge GPUs, and training deep networks requires even more computational time than data generation.

Thus, we at Neuromation have faced the second major challenge of modern artificial intelligence: where do we get computing power?

Neuromation Knowledge Mining Platform

Our first idea was, of course, to simply purchase a sufficient number of GPUs. However, it was the summer of 2017, the midst of the cryptocurrency mining boom. It turned out that graphic cards with the latest NVIDIA chips are not just expensive, but they are virtually unavailable at all. After we had tried to “mine” for some GPUs through our contacts in the US and realized that this way they would arrive only in a month or more, we switched to plan B.

Plan B involved using cloud services that rent out complete and set-up machines (often virtual ones). A cloud especially popular with AI practitioners is Amazon Web Services. AWS has become a de-facto industry standard, and many new AI startups are renting computing power there for their development tasks. However, cloud-based services do not come cheap: renting a machine with several GPUs for training neural networks costs a few dollars per hour, and you need a lot of these hours.

We at Neuromation have spent thousands of dollars renting computational power on Amazon — only to understand that we do not have to use them. The prices of cloud-based services are acceptable for the buyers, only in the absence of other alternatives.

And when we started thinking about potential alternatives, we recalled the reason we could not buy enough high-end GPUs. This led to the second main idea of Neuromation: repurposing GPU-based mining rigs for useful computing. ASIC chips that have been designed specifically for Bitcoin mining are not suitable for any other computing tasks, but GPUs that are used to mine Ethereum (ETH) and other “lightcoins”, are the exact same GPUs we need to use to train neural networks. Moreover, cryptocurrency mining generates an order of magnitude less income than the clouds charge for renting an equivalent GPU farm for the same period.

We realized that there is a very powerful business opportunity — a huge gap between prices — and also simply an opportunity to make the world better, redirecting the vast resources that are currently searching for collisions in hash functions to more useful calculations.

This is how the idea of the Neuromation platform was born: an universal marketplace for knowledge mining that would connect miners who want to earn more on their equipment and customers and AI startups, researchers, and basically any companies that need to process large datasets or train modern machine learning models.

Now we are already working with several mining farms, using their GPUs for useful computing. This is 5 to 10 times cheaper than renting server capacity from cloud-based services, and even at that price it is still much more profitable for miners. With their GPU-based rigs, miners can earn 3 to 5 times more by “knowledge mining” than they would get from the same setup by cryptocurrency mining. Taking into account that the complexity of calculations for cryptocurrency mining is growing with each new coin, the benefits of “knowledge mining” will only increase with time.

Conclusion

Right now we are presenting the idea of this universal platform for useful computing to the global market. The use of mining rigs for useful computing benefits both parties: miners will earn more, and numerous artificial intelligence researchers and entrepreneurs will receive a considerably (several times) cheaper and convenient way to implement their ideas. We believe that such “AI democratization” will lead to new breakthroughs and, ultimately, fuel the current revolution in artificial intelligence. Join us, and welcome to the revolution!

Sergey Nikolenko
Chief Research Officer, Neuromation
January 16, 2018
Make Man In Our Image: Through the Black Mirror
A Recurring Theme

Warning: major spoilers for the fourth series of Black Mirror ahead. If you haven’t watched it, please stop reading this, go watch the series, then return. I’ll be waiting, I’m an imaginary being who has nothing better to do anyway…

…which is kind of the point.

I watched the fourth Black Mirror series over the holidays. As I watched one episode after another, it struck me that they all seem to be about the exact same thing. This is an overstatement, of course, but three out of six ain’t bad either:
- in “USS Callister”, the antagonist creates virtual copies of living people and makes them the actors in his simulated universe, torturing them to submit if necessary;
- in “Hang the DJ”, virtual copies of living people live through thousands of simulations to gather data for the matchmaking service on a dating app;
- in “Black Museum”, the central showpiece of the museum is a virtual clone of an ex-convict who is put through electrocution over and over, with more clones in constant pain created every time.
Let’s add the “San Junipero” episode and especially the “White Christmas” special from earlier Black Mirror series to this list for good measure.

See the recurring theme? It appears that the Black Mirror creators have put their minds to one of the central problems of modern ethical philosophy: what do we do when we are able to create consciousnesses, probably in the form of virtual copies inside some sort of simulation? Will these virtual beings be people, ethically speaking? Can we do as we please with them?

Judging by the mood of the episodes, Black Mirror is firmly in the camp of those who believe that upon creating a virtual mind, there arises moral responsibilty, and “virtual people” do give rise to ethical imperatives. It does seem to be the obvious choice… doesn’t it?

The Hard Problem: Virtual What?

As soon as we try to consider the issue in slightly more detail, we run into insurmountable problems. The first problem is that with our current state of knowledge, it is extremely hard to define what consciousness is.

The problems of consciousness and first-person experience are still firmly in the realm of philosophy rather than science. Here I mean natural philosophy, a scientific way of reasoning about things that cannot be a subject of the scientific method yet. Ancient Greeks did natural philosophy, pondering the origins of all things and even arriving at the idea of elementary particles. However, as amazing as that insight was, the Greeks could not study elementary particles as modern physicists do, even if they did have the scientific method as we know it. They lacked the tools and even the proper set of notions to reason about these things. In the problem of consciousness and first-person experience, we are still very much at the level of ancient Greeks: nobody knows what it is and nobody has any idea how to get any closer to this knowledge.

Take the works of David Chalmers, a prominent philosopher in the field. He distinguishes between “easy” problems of consciousness, which could be studied scientifically even right now, and “the hard problem” (see, e.g., his seminal paper, “Facing Up to the Problem of Consciousness“). The hard problem is deceptively easy to formulate: what the hell is first-person experience? What is it that “I” am? How does this experience of “myself” arise from the firings of billions of neurons?

At first glance, this looks like a well-defined problem: first-person experience is, frankly, the only thing we can be sure of. The Cartesian doubt argument, exactly as presented by Descartes, is surprisingly relevant to sentient people simulated inside a virtual environment. The guy running the simulation is basically the evil demon of Descartes. If you entertain the possibility that you may be stuck in a simulation, the only thing you cannot doubt is your subjective first person experience.

On the other hand, first-person experience is also competely hidden from everyone else except yourself. Chalmers introduces the notion of a philosophical zombie: (imaginary) beings that look and behave exactly like humans but do not have any first-person experience. They are merely automata, “Chinese rooms“, so to speak, that produce responses matching those of a human being. Their presumed existence does not appear to lead to any logical contradiction. I wouldn’t know but I guess that’s how true psychopaths view others: as mechanical objects of manipulation devoid of subjective suffering.

I will not go into the philosophical details. But what we have already seen should suffice to plant the seed of doubt about virtual copies: why are we sure they have the same kind of first-person experience we do? If they are merely philosophical zombies and do not suffer subjectively, it appears perfectly ethical to do any kind of experiments on them. For that matter, why do you think I am not a zombie? Even if I was, I’d write the exact same words. And a virtual copy of me would be even less similar to you, it would run on completely different hardware — so how do we know it’s not a zombie?

Oh, and one more question for you: were you born this morning? Why not? You don’t have a continuous thread of consciousness connecting you to yesterday (assuming you went to sleep). Sure, you have the memories, but a virtual clone would have the exact same memories. How can you be sure?

Easier Problems: Emulations, Ethics, and Economics

We cannot hope to solve the hard problem of consciousness right now. We cannot even be sure it’s a meaningful problem. However, the existence of virtual “people” also raises more immediate questions.

The Age of Em, a recent book by an economist and futurist Robin Hanson, attempts to answer this question from the standpoint of economics. What is going to happen to the world economy if we discover a way to run emulated copies of people (exactly the setting of the “White Christmas” episode of Black Mirror)? What if we could copy Albert Einstein, Richard Feynman and Geoffrey Hinton a million times over?

Hanson pictures a future that appears to be rather bleak for the emulated people, or “ems”, as he calls them. Since copying costs of virtual people are negligible compared to raising a human being in flesh, competition between the ems will be fierce. They will become near-perfect economic entities — and as such, will be forced to always live at near-subsistence levels, all possible surplus captured immediately by other competing ems. But Hanson argues that the ems might not mind that: their psychology will adapt to their environment, as human psychology has done for millenia.

The real humans will be able to live a life of leisure off this booming market of ems… for a while. After all, there is no reason not to speed ems up as much as computational power allows, so their subjective time might run thousands of times faster compared to our human time (“White Christmas” again, I know), and their society might develop extremely quickly, with unpredictable consequences.

Hanson also tackles the “harder” problems of consciousness from a different angle. Suppose you had a way to easily copy your own mind. This opens up surprising possibilities: what if, instead of going to work tomorrow, you make a copy of yourself, make it do your work, and then terminate the copy, freeing the day for yourself. If you were an em you would be able to actually do it — but wouldn’t you be committing murder at the end of the day? This ties into what has been known for quite some time as the “teleportation problem”: if you are teleported atom by atom to a different place, Star Trek style, is it really you or have the real “you” been killed in the process, and the teleported is a completely new person with the same memories?

By the way, you don’t need to have full-scale brain emulations to have similar ethical problems. What if tomorrow a neural network passes the Turing test and in the process of doing so begs you not to switch it off, appearing genuinely terrified of dying? Is it OK to switch it off anyway?

Questions abound. Interesting questions, open questions, questions that we are not even sure how to formulate properly. I wanted to share the questions with you because I believe they are genuinely interesting, but I want to end with a word of caution. We have been talking about “virtual people”, “emulated minds”, and “neural networks passing the Turing test”. But so far, all of this is just like Black Mirror — just fiction, and not very plausible fiction at that. Despite the ever-growing avalanche of hype around artificial intelligence, there is no good reason to expect virtual minds and the singularity around the corner. But this is a topic for another day.

Sergey Nikolenko
Chief Research Officer, Neuromation
January 15, 2018