Category: Synthesis AI

The Data Problem II: One-Shot and Zero-Shot Learning
In the previous post, we posed what we consider the main problem of modern machine learning: increasing appetite for data that cannot be realistically satisfied if current trends persist. This means that current trends will not persist — but what is going to replace them? How can we build machine learning systems at ever increasing scale without increasing the need for huge hand-labeled datasets? Today, we consider one possible answer to this question: one-shot and zero-shot learning.

How can Face Recognition Work?

Consider a face recognition system. For simplicity, let’s assume that our system is a face classifier so that we do not need to worry about object detection, bounding box proposals and all that: the face is already pre-cut from a photo, and all we have to do is recognize the person behind that face.

A regular image classifier consists of a feature extraction network followed by a classification model; in deep learning, the latter is usually very simple, and feature extraction is the interesting part. Like this (image source):

So now we train this classifier on our large labeled dataset of faces, and it works great. But wait: where do the labeled samples come from? Sure, we can assume that a large labeled dataset of some human faces is available, but a face recognition system needs to do more: while it’s okay to require a large dataset for pretraining, a live model needs to be able to recognize new faces.

How is this use case supposed to work? Isn’t the whole premise of training machine learning models that you need a lot of labeled data for each class of objects you want to recognize? But if so, how can a face recognition system hope to recognize a new face when it usually has at most a couple of shots for each new person?

One-Shot Learning via Siamese Networks

The answer is that face recognition systems are a bit different from regular image classifiers. Any machine learning system working with unstructured data (such as photos) is basically divided into two parts: feature extraction, the part that converts an image into a (much smaller) set of numbers, often called an embedding or a latent code, and a model that uses extracted features to actually solve the problem. Deep learning is so cool because neural networks can learn to be much better at feature extraction than anything handcrafted that we had been able to come up with. In the AlexNet-based classifier shown above, the hard part was the AlexNet model that extracts features, and the classifier can be any old model, usually simple logistic regression.

To get a one-shot learning system, we need to use the embeddings in a different way. In particular, modern face recognition systems learn face embeddings with a different “head” of the network and different error functions. For example, FaceNet (Schroff et al., 2015) learns with a modification of the Siamese network approach, where the target is not a class label but rather the distances or similarities between face embeddings. The goal is to learn embeddings in such a way that embeddings of the same face will be close together while embeddings of different faces will be clearly separated, far from each other (image source):

The “FaceNet model” in the picture does feature extraction. Then to compute the error function you compute the distance between two embedding vectors. If the two input faces belong to the same person, you want the embedding vectors to be close together, and if they are different people, you want to push the embeddings together. There is still some way to go from this basic idea before we have a usable error function, but this will suffice for a short explanation. Here is how FaceNet works as a result; in the picture below, the numbers between photos of faces correspond to the distances between their embedding vectors (source):

Note that the distances between two photos of the same guy are consistently smaller than distances between photos of different people, even though in terms of pixels, composition, background, lighting, and basically everything else except for human identity the pictures in the columns are much more similar than pictures in the rows.

Now, after you have trained a model with these distance-based metrics in mind, you can use the embeddings to do one-shot learning itself. Assuming that a new face’s embedding will have the same basic properties, we can simply compute the embedding for a new person (with just a single photo as input!) and then do classification by looking for nearest neighbors in the space of embeddings.

Zero-Shot Learning and Generative Models

We have seen a simplified but relatively realistic picture of how one-shot learning systems work. But one can go even further: what if there is no data at all available for a new class? This is known as zero-shot learning.

The problem sounds impossible, and it really is: if all you know are images from “class 1” and “class 2”, and then you are asked to distinguish between “class 3” and “class 4”, no amount of machine learning can help you. But real life does not work this way: we usually have some background knowledge about the new classes even if we don’t have any images. For example, when we are asked to recognize a “Yorkshire terrier”, we know that it’s a kind of dog, and maybe we even have its verbal description, e.g., from Wikipedia. With this information, we can try to learn a joint embedding space for both class names and images, and then use the same nearest neighbors approach but look for the nearest label embedding rather than other images (which we have none for a new class). Here is how Socher et al. (2013) illustrate this approach:

Naturally, this won’t give you the same kind of accuracy as training on a large labeled set of images, but systems like this are increasingly successful.

Many zero-shot learning models use generative models and adversarial architectures. One of my favorite examples is a (more) recent paper by Zhu et al. (2018) that uses a generative adversarial network (GAN) to “hallucinate” images of new classes by their textual descriptions and then extracts features from these hallucinated images:

The interesting part here is that they do not need to solve the extremely complex and fragile problem of generating high-resolution images of birds that only the very best of GANs can do (such as, e.g., BigGAN and its successors). They train a GAN to generate not images but rather feature vectors, those very same joint embeddings, and this is, of course, a much easier problem. Zhu et al. use a standard VGG architecture to extract features from images and train a conditional GAN to generate such feature vectors given a latent vector of textual features extracted from the description:

As a result, you can do extremely cool things like zero-shot retrieval: doing an image search against a database without any images corresponding to your query. Given the name of a bird, the system queries Wikipedia, extracts a textual description, uses the conditional GAN to generate a few vectors of visual features for the corresponding bird, and then looks for its nearest neighbors in the latent space among the images in the database. The results are not perfect but still very impressive:

Note, however, that one- and zero-shot learning still require large labeled datasets. The difference is that we don’t need a lot of images for new classes any more. But the feature extraction network has to be trained on similar labeled classes: a zero-shot approach won’t work if you train it on birds and then try to look for a description of a chair. Until we have an uber-network trained on every kind of images possible (and believe me, we still have quite a way to go before Skynet enslaves the human race), this is still a data-intensive approach, although restrictions on what kind of data to use are relaxed.

Conclusion

In this post, we have seen how one-shot and zero-shot learning can help by drastically reducing the need for labeled datasets in the context of:
- either adding new classes to an existing machine learning model (one-shot learning)
- or transforming data from one domain, where it might be plentiful, to another, e.g., mapping textual descriptions to images (zero-shot learning).
But we have also seen that all of these approaches still need a large dataset to begin with. For exampe, a one-shot learning model for human faces needs to have a large labeled dataset of human faces to work, and only then we can add new people with a single photo. A zero-shot model that can hallucinate pictures of birds based on textual descriptions needs a lot of descriptions paired with images before it is able to do its zero-shot magic.

This is, of course, natural. Moreover, it is quite possible that at some point of AI development we will already have enough labeled data to cover entire fields such as computer vision or sound processing, extending their capabilities with one-shot and zero-shot architectures. But while we are not at that point yet, and there is no saying whether we can reach it in the near future, we need to consider other methods for alleviating the data problem. In the next installment of this series, we will see how unlabeled data can be used in this regard.

Sergey Nikolenko
Head of AI, Synthesis AI
March 30, 2020
The Data Problem Part I: Issues and Solutions
Today, we are kicking off the Synthesis AI blog. In these posts, we will speak mostly about our main focus, synthetic data, that is, artificially created data used to train machine learning models. But before we begin to dive into the details of synthetic data generation and use, I want to start with the problem setting. Why do we need synthetic data? What is the problem and are there other ways to solve it? This is exactly what we will discuss in the first series of posts.

Structure of a Machine Learning Project

Machine learning is hot, and it has been for quite some time. The field is growing exponentially fast, new models and new papers appear every week, if not every day. Since the deep learning revolution, for about a decade deep learning has been far outpacing other fields of computer science and arguably even science in general. Here is a nice illustration from Michael Castelle’s excellent post on deep learning:

It shows how deep learning is taking up more and more of the papers published on arXiv (the most important repository of computer science research); the post was published in 2018, but trust me, the trend continues to this day.

Still, the basic pipeline of using machine learning for a given problem remains mostly the same, as shown in the teaser picture for this post:
- first, you collect raw data related to your specific problem and domain;
- second, the data has to be labeled according to the problem setting;
- third, machine learning models train on the resulting labeled datasets (and validate their performance on subsets of the datasets set aside for testing);
- fourth, after you have the model you need to deploy it for inference in the real world; this part often involves deploying the models in low-resource environments or trying to minimize latency.
The vast majority of the thousands of papers published in machine learning deal with the “Training” phase: how can we change the network architecture to squeeze out better results on standard problems or solve completely new ones? Some deal with the “Deployment” phase, aiming to fit the model and run inference on smaller edge devices or monitor model performance in the wild.

Still, any machine learning practitioner will tell you that it is exactly the “Data” and (for some problems especially) “Annotation” phases that take upwards of 80% of any real data science project where standard open datasets are not enough. Will these 80% turn into 99% and become a real bottleneck? Or have they already done so? Let’s find out.

Are Machine Learning Models Hitting a Wall?

I don’t have to tell you how important data is for machine learning. Deep learning is especially data-hungry: usually you can achieve better results with “classical” machine learning techniques on a small dataset, but as the datasets grow, the flexibility of deep neural networks, with their millions of weights free to choose any kind of features to extract, starts showing. This was a big part of the deep learning revolution: even apart from all the new ideas, as the size of available datasets and the performance of available computational devices grew neural networks began to overcome the competition.

By now, large in-house datasets are a large part of the advantage that industry giants such as Google, Facebook, or Amazon have over smaller companies (image taken from here):

I will not tell you how to collect data: it is very problem-specific and individual for every problem domain. However, data collection is only the first step; after collection, there comes data labeling. This is, again, a big part of what separates industry giants from other companies. More than once, I had big companies tell me that they have plenty of data; and indeed, they had collected terabytes of important information… but the data was not properly curated, and most importantly, it was not labeled, which immediately made it orders of magnitude harder to put to good use.

Throughout this series, I will be mostly taking examples from computer vision. For computer vision problems, the labeling required is often very labor-intensive. Suppose that you want to teach a model to recognize and count retail items on a supermarket shelf, a natural and interesting idea for applying computer vision in retail; here at Synthesis AI, we have had plenty of experience with exactly this kind of projects.

Since each photo contains multiple objects of interest (actually, a lot, often in the hundreds), the basic computer vision problem here is object detection, i.e., drawing bounding boxes around the retail items. To train the model, you need a lot of photos with labeling like this:

If we also needed to do segmentation, i.e., distinguishing the silhouettes of items, we would need a lot of images with even more complex labeling, like this:

Imagine how much work it is to label a photo like this by hand! Naturally, people have developed tools to help partially automate the process. For example, a labeling tool will suggest a segmentation done by some general-purpose model, and you are only supposed to fix its mistakes. But it may still take minutes per photo, and the training set for a standard segmentation or object detection model should have thousands of such photos. This adds up to human-years and hundreds of thousands, if not millions, of dollars spent on labeling only.

There exist large open datasets for many different problems, segmentation and object detection included. But as soon as you need something beyond the classes and conditions that are already well covered in these datasets, you are out of luck; ImageNet does have cows, but not shot from above with a drone.

And even if the dataset appears to be tailor-made for your problem, it can contain dangerous biases. For example, suppose you want to recognize faces, a classic problem with large-scale datasets freely available. But if you want to recognize faces “in the wild”, you need a dataset that covers all sorts of rotations for the faces, while standard datasets mostly consist of frontal pictures. In the picture below, taken from (Zhao et al., 2017), on the left you see the distribution of face rotations in IJB-A (a standard open large-scale face recognition dataset), and the picture on the right shows how it probably should look if you really want your detection to be pose-invariant:

To sum up: current systems are data-intensive, data is expensive, and we are hitting the ceiling of where we can go with already available or easily collectible datasets, especially with complex labeling.

Possible Solutions

So what’s next? How can we solve the data problem? Is machine learning heading towards a brick wall? Hopefully not, but it will definitely take additional efforts. In this series of posts, we will find out what researchers are already doing. Here is a tentative plan for the next installments, laid out according to different approaches to the data problem:
- first, we will discuss models able to train on a very small number of examples or even without any labeled training examples at all (few-shot, one-shot, and even zero-shot learning); usually, few-shot and one-shot models learn a representation of the input in some latent space, while zero-shot models rely on some different information modality, e.g., zero-shot retrieval of images based on textual descriptions (we’ll definitely talk about this one, it’s really cool!);
- next, we will see how unlabeled data can be used to pretrain or help train models for problems that are usually trained with labeled datasets; for example, in a recent work Google Brain researchers Xie et al. (2019) managed to improve ImageNet classification performance (not an easy feat in 2019!) by using as additional information a huge dataset of unlabeled photos; such approaches require a lot of computational resources, “trading” computation for data;
- and finally, we will come to the main topic of this blog: synthetic data; for instance, computer vision problems can benefit from pixel-perfect labeling that comes for free with 3D renderings of synthetic scenes (Nikolenko, 2019); but this leads to an additional problem: domain transfer for models trained on synthetic data into the real domain.
Let’s summarize. We have seen that modern machine learning is demanding larger and larger datasets. Collecting and especially labeling all this data becomes a huge task that threatens future progress in our field. Therefore, researchers are trying to find ways around this problem; next time, we will start looking into these ways!

Sergey Nikolenko
Head of AI, Synthesis AI
March 23, 2020