Sergey Nikolenko

Blog

Driving Model Performance with Synthetic Data I: Augmentations in Computer Vision
Welcome back, everybody! It’s been a while since I finished the last series on object detection with synthetic data (here is the series in case you missed it: part 1, part 2, part 3, part 4, part 5). So it is high time to start a new series. Over the next several posts, we will discuss how synthetic data and similar techniques can drive model performance and improve the results. We will mostly be talking about computer vision tasks. We begin this series with an explanation of data augmentation in computer vision; today we will talk about simple “classical” augmentations, and next time we will turn to some of the more interesting stuff.

(header image source; Photo by Guy Bell/REX (8327276c))

Mandatory Credit: Photo by Guy Bell/REX (8327276c) Project curator Catherine Daunt make final adjustments to the installation of 10 colour screenprints of Marilyn Monroe by Andy Warhol Andy Warhol installation for ‘The American Dream: Pop to the Present’ exhibition, British Museum, London, UK – 10 Feb 2017 They were created 50 years ago, shortly after she committed suicide, and are in the British Museum’s Sainsbury Exhibitions Gallery ahead of the opening on 9th March of the Museum’s spring headline exhibition ‘The American Dream: Pop to the Present’. This major exhibition is a comprehensive survey of printmaking across six decades of turbulent US history with more than 200 works by 70 artists. Sponsored by Morgan Stanley and supported by the Terra Foundation for American Art. The works are on loan from Tate. Mandatory Credit – Copyright 2016 The Andy Warhol Foundation for the Visual Arts, inc. /ARS, New York and DACS, London Copyright Tate, London, 2016

Data Augmentation in Computer Vision: The Beginnings

Let me begin by taking you back to 2012, when the original AlexNet by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton (paper link from NIPS 2012) was taking the world of computer vision by storm. AlexNet was not the first successful deep neural network; in computer vision, that honor probably goes to Dan Ciresan from Jurgen Schmidhuber’s group and their MC-DNN (Ciresan et al., 2012). But it was the network that made the deep learning revolution happen in computer vision: in the famous ILSVRC competition, AlexNet had about 16% top-5 error, compared to about 26% of the second best competitor, and that in a competition usually decided by fractions of a percentage point!

Let’s have a look at the famous figure depicting the AlexNet architecture in the original paper by Krizhevsky et al.; you have probably seen it a thousand times:

I want to note one little thing about it: note that the input image dimensions on this picture are 224×224 pixels, while ImageNet actually consists of 256×256 images. What’s the deal with this?

The deal is that AlexNet, already in 2012, had to augment the input dataset in order to avoid overfitting. Augmentations are transformations that change the input data point (image, in this case) but do not change the label (output) or change it in predictable ways so that one can still train the network on augmented inputs. AlexNet used two kinds of augmentations:
- horizontal reflections (a vertical reflection would often fail to produce a plausible photo) and
- image translations; that’s exactly why they used a smaller input size: the 224×224 image is a random crop from the larger 256×256 image.
With both transformations, we can safely assume that the classification label will not change. Even if we were talking about, say, object detection, it would be trivial to shift, crop, and/or reflect the bounding boxes together with the inputs &mdash that’s exactly what I meant by “changing in predictable ways”. The resulting images are, of course, highly interdependent, but they still cover a wider variety of inputs than just the original dataset, reducing overfitting. In training AlexNet, Krizhevsky et al. estimated that they could produce 2048 different images from a single input training image.

What is interesting here is that although ImageNet is so large (AlexNet trained on a subset with 1.2 million training images labeled with 1000 classes), modern neural networks are even larger (AlexNet has 60 million parameters), and Krizhevsky et al. have the following to say about their augmentations: “Without this scheme, our network suffers from substantial overfitting, which would have forced us to use much smaller networks.”

AlexNet was not even the first to use this idea. The above-mentioned MC-DNN also used similar augmentations even though it was indeed a much smaller network trained to recognize much smaller images (traffic signs). One can also find much earlier applications of similar ideas: for instance, Simard et al. (2003) use distortions to augment the MNIST training set, and I am far from certain that this is the earliest reference.

Simple Augmentations Today

In the previous section, we have seen that as soon as neural networks transformed the field of computer vision, augmentations had to be used to expand the dataset and make the training set cover a wider data distribution. By now, this has become a staple in computer vision: while approaches may differ, it is hard to find a setting where data augmentation would not make sense at all.

To review what kind of augmentations are commonplace in computer vision, I will use the example of the Albumentations library developed by Buslaev et al. (2020); although the paper was only released this year, the library itself had been around for several years and by now has become the industry standard.

The obvious candidates are color transformations. Changing the color saturation or converting to grayscale definitely does not change bounding boxes or segmentation masks:

The next obvious category are simple geometric transformations. Again, there is no question about what to do with segmentation masks when the image is rotated or cropped; you simply repeat the same transformation with the labeling:

There are more interesting transformations, however. Take, for instance, grid distortion: we can slice the image up into patches and apply different distortions to different patches, taking care to preserve the continuity. Again, the labeling simply changes in the same way, and the result looks like this:

The same ideas can apply to other types of labeling. Take keypoints, for instance; they can be treated as a special case of segmentation and also changed together with the input image:

For some problems, it also helps to do transformations that take into account the labeling. In the image below, the main transformation is the so-called mask dropout: remove a part of the labeled objects from the image and from the labeling. But it also incorporates random rotation with resizing, blur, and a little bit of an elastic transform; as a result, it may be hard to even recognize that images on the right actually come from the images on the left:

With such a wide set of augmentations, you can expand a dataset very significantly, covering a much wider variety of data and making the trained model much more robust. Note that it does not really hinder training in any way and does not introduce any complications in the development. With modern tools such as the Albumentations library, data augmentation is simply a matter of chaining together several transformations, and then the library will apply them with randomized parameters to every input image. For example, the images above were generated with the following chain of transformations:

light = A.Compose([ A.RandomSizedCrop((512-100, 512+100), 512, 512), A.ShiftScaleRotate(), A.RGBShift(), A.Blur(), A.GaussNoise(), A.ElasticTransform(), A.MaskDropout((10,15), p=1), A.Cutout(p=1) ],p=1)

Not too hard to program, right?

Conclusion

Today, we have begun a new series of posts. I am starting a little bit further back than usual: in this post we have discussed data augmentations, a classical approach to using labeled datasets in computer vision.

Connecting back to the main topic of this blog, data augmentation is basically the simplest possible synthetic data generation. In augmentations, you start with a real world image dataset and create new images that incorporate knowledge from this dataset but at the same time add some new kind of variety to the inputs. Synthetic data works in much the same way, only the path from real-world information to synthetic training examples is usually much longer and more convoluted. So in a (rather tenuous) way, all modern computer vision models are training on synthetic data.

But this is only the beginning. There are more ways to generate new data from existing training sets that come much closer to synthetic data generation. So close, in fact, that it is hard to draw the boundary between “smart augmentations” and “true” synthetic data. Next time we will look through a few of them and see how smarter augmentations can improve your model performance even further.

Sergey Nikolenko
Head of AI, Synthesis AI
November 11, 2020
AI is helping microbiologists to better understand the ‘cell cycle clock’

An important new technology for medical research has been developed on the Neu.ro Platform in coordination with the European Molecular Biology Laboratory (EMBL).

EMBL led collaboration with scientists around the world to create DeepCycle — an AI-driven system with potential applications in cancer research.

EMBL worked with global AI researchers including Neuromation Chief Research Officer, Sergey Nikolenko, and Senior AI Researcher, Alexander Rakhlin, to develop DeepCycle, an AI-driven technology that models the lifecycle of cells — how they grow and divide. Using approximately 2.6 million microscopy images of canine kidney cells, the novel deep learning model is able to reconstruct complex biological phenomena based solely on visual data. Theodore Alexandrov, Team Leader at EMBL says: “It was my pleasure working with Alexander Rakhlin and Sergey Nikolenko, true deep learning experts who made it work.”

Sergey Nikolenko says: “This has been a large and very interesting project on a state of the art topic in bioinformatics: analysis of the cell cycle based on microscopy images. It has been a multidisciplinary effort, but from the AI/ML side, for the first time ever, we have been able to develop distributed representations of cell images that actually have a closed cell cycle progression in time. These representations can be used to identify the “cell clock”, i.e., current “age” of a cell, which may have important implications across the medical field.”

This method can be applied to other types of cells, potentially benefiting scientists studying the development of cancer cells.

The DeepCycle method was developed using the Neu.ro MLOps Platform. Neu.ro managed the entire ML model lifecycle, including experiment tracking, hyperparameter tuning, remote debugging, distributed training and model deployment and monitoring. With Neu.ro, researchers were able to optimize infrastructure costs, streamline infrastructure management and accelerate the development and deployment of this important new technology.

November 4, 2020
Object Detection with Synthetic Data V: Where Do We Stand Now?
This is the last post in my mini-series on object detection with synthetic data. Over the first four posts, we introduced the problem, discussed some classical synthetic datasets for object detection, talked about some early works that have still relevant conclusions and continued with a case study on retail and food object detection. Today we consider two papers from 2019 that still represent the state of the art in object detection with synthetic data and are often used as generic references to the main tradeoffs inherent in using synthetic data. We will see and discuss those tradeoffs too. Is synthetic data ready for production and how does it compare with real in object detection? Let’s find out. (header image source)

An Annotation Saved is an Annotation Earned

The first paper saved me the trouble of thinking of a pithy title. Aptly named “An Annotation Saved is an Annotation Earned: Using Fully Synthetic Training for Object Instance Detections”, this work by Hinterstoisser et al. comes from the Google Cloud AI team. Similar to our last post, Hinterstoisser et al. consider multiple detection of small common objects, most of which are packs of food items and medicine. Here is a sample of their synthetic objects:

But the interesting thing about this paper is that they claim to achieve excellent results without any real data at all, by training on a purely synthetic dataset. Here are some sample results on a real evaluation data for a Faster R-CNN model with Inception ResNet backbone (this is a bog-standard and very common two-stage object detector) trained on a purely synthetic training set:

Looks great, right? So how did Hinterstoisser et al. achieve such wonderful results?

Their first contribution is an interesting take on domain randomization for background images. I remind that domain randomization is the process of doing “smart augmentations” with synthetic images so that they are as random as possible, in the hopes to cover as much of the data distribution as possible. Generally, the more diverse and cluttered the backgrounds are, the better. So Hinterstoisser et al. try to get the clutter up to eleven by the following procedure:
- take a separate dataset of distractor 3D models that are not the objects we are looking for (in the paper, they had 15 thousand such distractor models);
- render these objects on the background in random poses and with scales roughly corresponding to the scale of the foreground objects (so they are comparable in size) while randomly varying the hues of the background object colors (this is standard domain randomization with distractor objects);
- choose and place new background objects until you have covered every pixel of the background (this is the interesting part);
- then place the foreground objects on top (we’ll discuss it in more detail below).
As a result of this approach, Hinterstoisser et al. don’t have to have any background images or scenes at all: the background is fully composed of distractor objects. And they indeed get pretty cluttered images; here is the pipeline together with a couple of samples:

But that’s only part of it. Another part is how to generate the foreground layer, with objects that you actually want to recognize. Here, the contribution of Hinterstoisser et al. is that instead of placing 3D models in random poses or in poses corresponding to the background surfaces, as researchers had done before, they introduce a deterministic curriculum (schedule) for introducing foreground objects:
- iterate over scales from largest to smallest, so that the network starts off with the easier job of recognizing large objects and then proceeds to learn to find their smaller versions;
- for every scale, iterate over all possible rotations;
- and then for every scale and rotation iterate through all available objects, placing them with possible overlaps and cropping at the boundaries; there is also a separate procedure to allow background distractor objects to partially occlude the foreground.
Here is a sample illustration:

As a result, this purely synthetic approach outperforms a 2000-image real training set. Hinterstoisser et al. even estimate the costs: they report that it had taken them about 200 hours to acquire and label the real training set. This should be compared with… a mere 5 hours needed for 3D scanning of the objects: once you have the pipeline ready, that is all you need to do to add new objects or retrain in a different setting. Here are the main results:

But even more interesting are the ablation studies that the authors provide. They analyze which of their ideas contributed the most to their results. Interestingly (and a bit surprisingly), the largest effect is achieved by their curriculum strategy. Here it is compared to purely random pose sampling for foreground objects:

Another interesting conclusion is that the purely synthetic cluttered background actually performs much better than a seemingly more realistic alternative strategy: take real world background images and augment them with synthetic distractor objects (there is no doubt that distractor objects are useful anyway). Surprisingly, the purely synthetic background composed entirely of objects wins quite convincingly:

With these results, Hinterstoisser et al. have the potential to redefine how we see and use synthetic data for object detection; the conclusions most probably also extend to segmentation and possibly other computer vision problems. In essence, they show that synthetic data can be much better than real for object detection if done right. And by “done right” I mean virtually every single element of their pipeline; here is the ablation study:

There are more plots like this in the paper, but it is time to get to our second course.

How Much Real Data Do We Actually Need?

Guess I got lucky with the titles today. The last paper in our object detection series, “How much real data do we actually need: Analyzing object detection performance using synthetic and real data” by Nowruzi et al., concentrates on a different problem, recognizing objects in urban outdoor environments with an obvious intent towards autonomous driving. However, the conclusions it draws appear to be applicable well beyond this specific case, and this paper has become the go-to source among experts in synthetic data.

The difference of this work from other sources is that instead of investigating different approaches to dataset generation within a single general framework, it considers various existing synthetic and real datasets, puts them in comparable conditions, and draws conclusions regarding how best to use synthetic data for object detection.

Here are the sample pictures from the datasets used in the paper:

Nowruzi et al. consider three real datasets:
- Berkeley Deep Drive (BDD) (Yu et al., 2018), a large-scale real dataset (100K images) with segmentation and object detection labeling (image (a) above);
- Kitti-CityScapes (KC), a combination of visually similar classical urban driving datasets KITTI (Geiger et al., 2012) and CityScapes (Cordts et al., 2016) (b);
- NuScenes (NS) (Caesar et al., 2019), a dataset I mentioned in the first post of the series, with 1000 labeled video scenes, each 20 seconds long (c);
and three synthetic:
- Synscapes (7D) (Wrenninge & Unger, 2018), a synthetic dataset designed to mimic the properties of Cityscapes (d);
- Playing for Benchmark (P4B) (Richter et al., 2017), a synthetic dataset with video sequences obtained from the Grand Theft Auto V game engine (e);
- CARLA (Dosovitskiy et al., 2017), a full-scale driving simulator that can also be used to generate labeled computer vision datasets (f).
To put all datasets on equal footing, the authors use only 15000 images from each (since the smallest dataset has 15K images), resize all images to 640×370 pixels, and remove annotations for objects that become too small under these conditions (less than 4% of the image height). The object detection model is also very standard: it is an SSD detector with MobileNet backbone, probably chosen for computationally efficiency of both training and evaluation. The interesting part, of course, are the results.

First, as you would expect, adding more data helps. Training on smaller portions of each dataset significantly impedes the results, as the plot below shows. Note that Nowruzi et al. use both color and shape of the markers to signify two different dimensions of the parameters, and the axes of the picture are performance indicators (average precision and recall), so top right is the best corner and bottom left is the worst; this will be used throughout all plots below:

The next set of results is about transfer learning: how well can object detection models perform on one dataset when trained on another? Let’s see the plot and then discuss:

There are several effects to be seen here:
- naturally, the best results (top right corner) are had when you train and test on the same dataset; this is true for both synthetic and real datasets, but synthetic data significantly outshines real data in this comparison; this is a general theme throughout all synthetic data in computer vision: results on synthetic datasets are always better, sometimes too much so, signifying overfitting (but hopefully not in this case);
- the worst dataset, pretty much an outlier here, is CARLA: while training and testing on CARLA gives the very best results, any attempt at transfer from CARLA to anything else fails miserably;
- but other than that, synthetic datasets fair pretty well, with transfer results clustering together with transfer from real datasets.
The real datasets are still a little better (see, e.g., how well BDD transfers to everything except NuScenes). But note that Nowruzi et al. have removed one of the main advantages of synthetic data by equalizing the size of real and synthetic datasets, so I would say that synthetic data performs quite well here.

But the real positive results come later. Nowruzi et al. compare two different approaches to using hybrid datasets, where synthetic data is combined with real.

First, synthetic-real data mixing, where a small(er) amount of real data is added to a full-scale synthetic dataset. A sample plot below shows the effect for training on the BDD dataset; the dashed line repeats the plot for training on purely real data that we have already seen above:

You can see that training on synthetic data indeed helps save on annotations substantially: e.g., using only 2.5% of the real BDD dataset and a synthetic P4B dataset yields virtually the same results as using 10% of the real BDD while using 4 times less real data. Naturally, 100% of real data is still better, and probably always will be.

But the really interesting stuff begins with the second approach: fine-tuning on real data. The difference is that now we fully train on a synthetic dataset and then fine-tune on (small portions of) real datasets, so training on synthetic and real data is fully separated. This is actually more convenient in practice: you can have a huge synthetic dataset and train on it once, and then adapt the resulting model to various real conditions by fine-tuning which is computationally much easier. Here are the results in the same setting (subsets of BDD):

The dashed line is exactly the same as above, but note how every other result has improved! And this is not an isolated result; here is a comparison on the NuScenes dataset:

The paper has more plots, but the conclusion is already unmistakable: fine-tuning on real data performs much better than just mixing in data. This is the main result of Nowruzi et al., and in my opinion it also fits well with the previous paper, so let’s finish with a common conclusion.

Conclusion

Today, we have seen two influential recent papers that both try to improve object detection with the help of synthetic data. There is a common theme that I see in both papers: they show how important are the exact curricula for training and the smallest details of how synthetic data is generated and presented to the network. Before reading these papers, I would never guess that simply changing the strategy of how to randomize the poses and scales of synthetic objects can improve the results by 0.2-0.3 in mean average precision (that’s a huge difference!).

All this suggests that there is still much left to learn in the field of synthetic data, even for a relatively straightforward problem such as object detection. Using synthetic data is not quite as simple as throwing as much random stuff at the network as possible. This is a good thing, of course: harder problems with uncertain results also mean greater opportunities for research and for deeper understanding of how neural networks work and how computer vision can be ultimately solved. Here at Synthesis AI, we work to achieve better understanding of synthetic data, not only for object detection but for many other deep learning applications as well. And the results we have discussed today suggest that while synthetic data is already working well for us, there is still a fascinating and fruitful road ahead.

With this, I conclude the mini-series on object detection. After a short break, we will return with something completely different. Stay tuned!

Sergey Nikolenko
Head of AI, Synthesis AI
September 15, 2020
Object Detection with Synthetic Data IV: What’s in the Fridge?

We continue the series on synthetic data for object detection. Last time, we stopped in 2016, with some early works on synthetic data for deep learning that still have implications relevant today. This time, we look at a couple of more recent papers devoted to multiple object detection for food and small vendor items. As we will see today, such objects are a natural application for synthetic data, and we’ll see how this application has evolved in the last few years.

Why the Fridge?

Before I proceed to the papers, let me briefly explain why this specific application—recognizing multiple objects on supermarket shelves or in a fridge—sounds like such a perfect fit for synthetic data. There are several reasons, and each of them is quite general and might apply to your own application as well.

First, the backgrounds and scene compositions are quite standardized (the insides of a fridge, a supermarket shelf) so it shouldn’t take too much effort to simulate them realistically. If you look at the datasets for such applications, you will see that they often get by with really simplistic backgrounds. Here are some samples from the dataset from our first paper today, available from Param Rajpura’s github repository:

A couple of surface textures, maybe a glossy surface for the glass shelves, and off you go. This has changed a lot since 2017, and we’ll talk about it below, but it’s still not as hard as making realistic humans.

Second, while simple, the scenes and backgrounds are definitely not what you see in ImageNet and other standard datasets. You can find a lot of pics of people enjoying outdoor picnics and 120 different breeds of dogs in ImageNet but not so many insides of a refrigerator or supermarket shelves with labeled objects. Thus, we cannot reuse pretrained models that easily.

Third, guess why such scenes are not very popular in standard object detection datasets? Because they are obscenely hard to label by hand! A supermarket shelf can have hundreds of objects that are densely packed, often overlap, and thus would require full minutes of tedious work per image. Here are some sample images from a 2019 paper by Goldman et al. that presents a real dataset of such images called SKU-110K (we won’t consider it in detail because it has nothing to do with synthetic data):

Fourth, aren’t we done now that we have a large-scale real dataset? Not really because new objects arrive very often. A system for a supermarket (or the fridge, it’s the same kind of objects) has to easily support the introduction of new object classes because new products or, even more often, new packaging for old products are introduced continuously. Thousands of new objects appear in a supermarket near you over a year, sometimes hundreds of new objects at once (think Christmas packaging). When you have a real dataset, adding new images takes a lot of work: it is not enough to just have a few photos of the new object, you also need to have it on the shelves, surrounded by old and new objects, in different combinations… this gets really hard really quick. In a synthetic dataset, you just add a new 3D model and then you are free to create any number of scenes in any combinations you like.

Finally, while you need a lot of objects in this application and a lot of 3D models for the synthetic dataset, most objects are relatively easy to model. They are Tetra Pak cartons, standardized bottles, paper boxes… Among the thousands of items in a supermarket, there are relatively few different packages, most of them are standard items with different labels. So once you have a 3D model for, say, a pint bottle, most beers will be covered by swapping a couple of textures, and the bottle itself is far from a hard object to model (compare with, say, a human face or a car).

With all that said, object detection for small retail items does sound like a perfect fit for synthetic data. Let’s find out what people have been doing in this direction.

Multiple Object Detection in Constrained Spaces

Our first paper today, the earliest I could find on deep learning with synthetic data for this application, is “Object Detection Using Deep CNNs Trained on Synthetic Images” by Rajpura et al. (2017). They concentrate on recognizing objects inside a refrigerator, and we have already seen some samples of their synthetic data above. They actually didn’t even bother with 3D modeling and just took standard bottles and packs from the ShapeNet repository that we discussed earlier.

They used Blender (often the tool of choice for synthetic data since it’s quite standard and free to use) to create simple scenes of the inside of a fridge and placed objects with different textures there:

As for their approach to object detection, we are still not quite in state of the art territory so I won’t dwell on it too much. In short, Rajpura et al. used a fully convolutional version of GoogLeNet that generates a coverage map and a separate bbox predictor trained on its results:

What were the results and conclusions? Well, first of all, Rajpura et al. saw significantly improved performance for hybrid datasets. Here is a plot from their paper that shows how 10% of real data and 90% of synthetic far outperformed “pure” datasets:

This result, however, should be taken with a grain of salt because, first, they only had 400 real images (remember how hard it is to label such images manually), and second, the scale of synthetic data was also not so large (3600 synthetic images).

Another interesting conclusion, however, is that adding more synthetic images can actually hurt. Here is a plot that shows how performance begins to decline after 4000 synthetic images:

This is probably due to overfitting to synthetic data, and it remains an important problem even today. If you add a lot of synthetic images, the networks may begin to overfit to peculiarities of specifically synthetic images. More generally, synthetic data is different from real, and hence there is always an inherent domain transfer problem involved when you try to apply networks trained on synthetic data to real test sets (which you always ultimately want to do). This is a huge subject, though, and we will definitely come back to domain adaptation for synthetic-to-real transfer later on this blog. For now, let us press on with the fridges.

Smart Synthetic Data for Smart Vending Machines

Or, actually, vending machines. Let us make a jump to 2019 and consider the work by Wang et al. titled “Synthetic Data Generation and Adaption for Object Detection in Smart Vending Machines”. The premise looks very similar: vending machines have small food items placed there, and the system needs to find out which items are still there judging by a camera located inside the vending machine. Here is the general pipeline as outlined in the paper:

On the surface it’s exactly the same thing as Rajpura et al. in terms of computer vision, but there are several interesting points that highlight how synthetic data had progressed over these two years. Let’s take them in order.

First, data generation. In 2017, researchers took ready-made simple ShapeNet objects. In 2019, 3D shapes of the vending machine objects are being scanned from real objects by high-quality commercial 3D scanners, in this case one from Shining 3D. What’s more, 3D scanners still have a really hard time with specular or transparent materials. For specular materials, Wang et al. use a whole other complex neural architecture (an adversarial one, actually) to transform the specular material into a diffuse one based on multiple RGB images and then restore the material during rendering (they use Unity3D for that). The specular-to-diffuse translation is based on a paper by Wu et al. (2018); here is an illustration of its possible input and output:

As for transparent materials, even in 2019 Wang et al. give up, saying that “although this could be alleviated by introducing some manual works, it is beyond the scope of this paper” and simply avoiding transparent objects in their work. This is, by the way, where Synthesis AI could step up: check out ClearGrasp, a result of our collaboration with Google Robotics.

Second, Wang et al. introduce and apply a separate model for the deformation of resulting object meshes. Cans and packs may warp or bulge in a vending machine, and their synthetic data generation pipeline adds random deformations, complete with a (more or less) realistic energy-based model with rigidity parameters based on a previous work by Wang et al. (2012). The results look quite convincing:

Third, the camera. Due to physical constraints, vending machines use fisheye cameras to be able to cover the entire area where objects are located. Here is the vending machine from Wang et al. and sample images from the cameras on every shelf:

3D rendering engines usually support only the pinhole camera model, so, again, Wang et al. use a separate state of the art camera model by Kannala and Brandt, calibrating it on a real fisheye camera and then introducing some random variation and noise.

Fourth, the synthetic-to-real image transfer, i.e., improving the resulting synthetic images so that they look more realistic. Wang et al. use a variation of style transfer based on CycleGAN. I will not go into the details here because this direction of research definitely deserves a separate blog post (or ten), and we will cover it later. For now, let me just say that it does help in this case; below, original synthetic images are on the left and the results of transfer are on the right:

Fifth, the object detection pipeline. Wang et al. compare several state of the art object detection methods, including PVANET by Kim et al. (2016), SSD by Liu et al. (2016), and YOLOv3 by Redmon and Farnadi (2018). Unlike all the works we have seen above, these are architectures that remain quite relevant up to this day (with some new versions released, as usual), and, again, each of them would warrant a whole separate post, so for now I will just skip to the results.

Interestingly, while the absolute numbers and quality of the results have increased substantially since 2017, the general takeaways remain the same. It still helps to have a hybrid dataset with both real and synthetic data (note also that the dataset is again rather small; this time it’s because the models are good enough to achieve saturation in this constrained setting after this kind of data and more synthetic data probably doesn’t help):

The results on a real test set are also quite convincing. Here are some samples for PVANET:

SSD:

and YOLOv3:

Interestingly, PVANET yields the best results, which is contrary to many other object detection applications (YOLOv3 should be best overall in this comparison):

This leads to our last takeaway point for today: in a specific application, it is best to redo the comparisons at least among the current state of the art architectures. It doesn’t add all that much to the cost of the project: in this case, Wang et al. definitely spent much, much more time preparing and adapting synthetic data than testing two additional architectures. But it can yield somewhat unexpected results (one can explain why PVANET has won in this case, but honestly, this would be a post-hoc explanation, you really just don’t know a priori who’s going to win) and let you choose what’s best for your own project.

Conclusion

Today, we have considered a sample application of synthetic data for object detection: recognizing multiple objects in small constrained spaces such as a refrigerator or a vending machine. We have seen why this is a perfect fit for synthetic data, and have used it as an example to showcase some of the progress that synthetic data has enjoyed over the past couple of years. But that’s not all: in the next post, we will consider some very recent works that study synthetic data for object detection in depth, uncovering the tradeoffs inherent in the process. Until next time!

Sergey Nikolenko
Head of AI, Synthesis AI

September 2, 2020
Object Detection with Synthetic Data III: Choose Your Cues Wisely

Today, I continue the series on synthetic data for object detection. In the first post of the series, we discussed the object detection problem itself and real world datasets for it, and the second was devoted to popular synthetic datasets of common objects. The time has come to put this data in practice: in this and subsequent posts, we will discuss common contemporary object detection architectures and see how adding synthetic data fares for object detection as reported in literature. In each post, I will give a detailed account of one paper that stands out in my opinion and briefly review one or two more. We begin in 2015.

Learning Deep Object Detectors from 3D Models

Here at Synthesis AI, we are making synthetic data for all kinds of models, but we are personally most interested in deep learning. In particular, object detection and segmentation have been overrun by deep neural networks over the last several years, and before people come up with something completely different it’s hard to imagine going back to classical computer vision.

Therefore, my story of synthetic data for object detection could not begin earlier than the first deep learning models for this problem… but it does not begin much later either! Our first paper in this review is by Peng et al., called “Learning Deep Object Detectors from 3D Models”; it came out on ICLR 2015, and the preprint is dated 2014.

So what was the state of the art in object detection back in 2014? The deep learning revolution in computer vision was still in early stages, so in terms of image classification architectures that could serve as backbones for object detection we had AlexNet, VGG, and GoogLeNet (the first in the Inception line). But at the time, there was little talk about “backbones”: the state of the art in object detection, reporting a huge improvement over the ILSVRC2013 detection track winner OverFeat (31.4% mIoU vs. 24.3% for OverFeat) was R-CNN by Girshick et al. (2013).

R-CNN is the most straightforward two-stage object detection architecture you can think of: bounding box proposals are produced by an external algorithm (the staple of the era, selective search by Uijlings et al.), and then each proposal goes through a convolutional network (CNN) for classification, with a separate model confirming whether the proposal actually does contain an object (because algorithms like selective search always produce a lot of false positives before you can be sure the real objects are covered). Like this:

R-CNN was hopelessly slow (it took up to a minute to process a single picture!), but later it was sped up by incorporating all elements of the pipeline (bounding box proposal and evaluation) into the neural architecture. The result, Faster R-CNN, became a staple of two-stage object detection architectures, quite relevant even today.

Back in 2014, researchers were still not sure if synthetic data was helpful. Moreover, the synthetic data they had was far from photorealistic, it was more like the ShapeNet dataset we discussed in a previous post. The work by Peng et al. was in many ways intended to study this very question: can you improve object detection or, say, learn to recognize new categories with synthetic data that looks like this:

Thus, the main question for Peng et al. was to separate different “visual cues”, i.e., different components of an object. Simplistic synthetic data does pretty well in terms of shape, but poorly in terms of texture or realistic varied poses, and the background will have to be inserted separately so it probably won’t match too well. Given this discrepancy in quality, what can we expect from object detection models?

To study this question, Peng et al. propose an object detection pipeline that looks like R-CNN but is actually even simpler than that:

They used AlexNet pretrained on ImageNet as a feature extractor, and trained classifiers on features extracted from region proposals, just like R-CNN. Then they started testing for robustness to various cues, producing different synthetic datasets and testing object detection performance on a real test set after training on these datasets. Here is a sample table of results from their paper.

What do we see in this table? Well, interestingly, the results do not follow the standard intuition that the more details you have, the better the results will be. The simplest synthetic data, the W-UG row with uniform gray objects on white backgrounds, yields very reasonable results and significantly outperforms gray objects on more complex backgrounds.

On the other hand, experiments by Peng et al. show that adding a more varied set of views for a given object always helps, sometimes significantly. In the tables below, adding another view for synthetic shapes leads to improvements in the final quality on real test datasets.

The absolute numbers in these tables did not really represent state of the art in object detection even in 2015, and are definitely not relevant today. But conclusions and comparisons show an important trend that goes through many early results on synthetic data for computer vision: for many models, the details and textures don’t matter as much since the models are looking for shapes and object boundaries. If that is the case for your model, then it is much more important to have a variety of shapes and poses, and textures can be left as an afterthought.

By now, I would probably generalize this lesson: different cues may be of different importance to different models. So unless you are willing to invest some serious resources into making an effort to achieve photorealism across the board, experiment with your model and find out what aspects are really important and worth investing for, and what aspects can be neglected (e.g., in this case you can leave the objects gray and skip the textures). This is what Peng et al. teaches us, and I believe it is as relevant in 2020 as it was in 2015.

First Attempts at Synthetic Videos for Object Detection

This was the detailed part, and for a brief review today let us consider one of the first attempts to use synthetic videos for object detection by Bochinski et al. (2016), in a work called “Training a convolutional neural network for multi-class object detection using solely virtual world data“. This is one of the first attempts I could find at building a complete virtual world with the intent of making synthetic data for computer vision systems, and specifically for object detection.

Bochinski et al. were also among the pioneers in using game engines for synthetic data generation. As the engine, they used Garry’s Mod, a sandbox game on the Source engine designed by Valve for Half Life and Counter Strike. Released in 2004 as a Half Life 2 mod intended to showcase the capabilities of the Source engine, Garry’s Mod remains a popular game even today; I saw it among my Steam recommendations less than a month ago…

Anyway, the point of both Garry’s Mod and its use in Bochinski et al. is that Source has a very capable physics engine, and even more than that, it supports scripting for bots, both human and vehicles. Thus, it is relatively easy to create a simulated world for urban driving applications, complete with humans, cars, and surveillance cameras placed in realistic positions. Bochinski et al. extended the engine to be able to export bounding boxes, segmentation, and other kinds of labeling:

As for the rest, the Source engine allows to vary lighting conditions and, naturally, place cameras at arbitrary positions, e.g., in realistic surveillance camera locations:

For object detection, since Bochinski et al. work with video data, they used a simple classical technique to construct bounding boxes: background subtraction. Basically, this means that they train a Gaussian mixture model to describe the history of every pixel, and if the pixel becomes different enough, it is considered to be part of the foreground (an object) rather than background. CNNs are only used (and trained) to do classification in the resulting bounding boxes. As a result, they achieve pretty good results even on a real test set:

So what’s the takeaway? This paper exemplifies how synthetic data can be helpful even for outdated pipelines: here, the bounding boxes were detected with a classical algorithm, so synthetic data was only used to train the classifier, and it still helped and resulted in a reasonable surveillance application.

Conclusion

Today, we have begun our account of synthetic data used to improve object detection pipelines. In fact, it is not easy to find papers that concentrate on object detection: since with synthetic data you can get any kind of labeling for free, most works skip right to segmentation or even more complex 3D-related problems. We have discussed two relatively early works (from 2015 and 2016) that, I believe, have something to tell us even today.

The main takeaway point is, in my opinion, this: different models may prove to be robust to the (un)realism of different aspects of synthetic data. This means that in practice, when you develop a synthetic dataset for an existing model or class of models for a given problem, it often pays to produce an ablation study and find out where you need to invest the most effort. Next time, we will move on to multiple object detection in constrained spaces — stay tuned!

Sergey Nikolenko
Head of AI, Synthesis AI

August 25, 2020
Object Detection with Synthetic Data II: Common Objects And Their Context

In the last post, we started talking about object detection. We discussed what the problem is, saw the three main general-purpose real-world datasets for object detection, and began talking about synthetic data. Today, we continue the series with a brief overview of the most important synthetic datasets for object detection. Last time, I made an example of an autonomous driving dataset, but this is a topic of its own, and so are, say, synthetic images of people and human faces. Today, we will concentrate on general-purpose and household object datasets.

ShapeNet, PartNet, and YCB: Common Objects in 3D

The notion of synthetic data has been a staple of computer vision for a long time. Earlier on this blog, we talked about synthetic data in the very first computer vision models. But the first synthetic datasets all dealt with low-level computer vision problems such as, e.g., optical flow estimation, which are not our subject today. Large-scale public datasets for high-level computer vision problems with common everyday objects and scenes started to appear only in the mid-2010s.

The first efforts related to recognizing everyday objects such as retail items, food, or furniture, mostly drew upon the same database for 3D models. Developed by Chang et al., ShapeNet indexes more than three million models, with 220000 of them classified into 3135 categories that match WordNet synsets. Apart from class labels, ShapeNet also includes geometric, functional, and physical annotations, including planes of symmetry, part hierarchies, weight and materials, and more. Researchers often used the clean and manually verified ShapeNetCore subset that covers 55 common object categories with about 51000 unique 3D models; see, e.g., a large-scale effort in 3D shape reconstruction by Yi et al. (2017).

To be honest, ShapeNet looks more like 3D modeling from the 1990s than a five-year-old effort. Here are some sample shapes:

But to some extent, this was intentional: works based on ShapeNet tried to prove that you can use even relatively crude models to teach neural networks to recognize objects. Rougher models are also easier to process. Since ShapeNet provides not only RGB images with ground truth bounding boxes and segmentation but also full 3D models, it has been widely used for projects on 3D shape reconstruction and completion and 3D scene understanding; maybe we will come back to these projects in a later post.

This emphasis on 3D shape reconstruction carried over to one of the next iterations of ShapeNet, called PartNet (Mo et al., 2018). The creators of this dataset took ShapeNet models and provided an even more detailed kind of labeling for them. For instance, when you look at an office chair, you not only see a generic “chair” objects but can also distinguish the seat, back, armrests, and many other component parts of the chair. PartNet provides several layers of granularity for the parts of individual objects:

They have done it on scale: according to Mo et al., PartNet contains 573,585 fine-grained annotations of object parts for 26,671 shapes that belong to 24 different object categories. The categories themselves are all common household objects but pretty diverse:

To the best of our knowledge, PartNet remains the best dataset that you can use if you need a well-detailed chair with detailed parts annotation. It has been used in dozens of papers on object understanding, and the only thing that prevents it from having a wider impact on, say, indoor navigation is the relatively small selection of object categories. We hope that further efforts will be made to expand the diversity of object categories in PartNet or similar datasets.

At about the same time, researchers from Yale, Carnegie Mellon, and Berkeley got together to produce another popular dataset of 3D shapes (Calli et al., 2015). True patriots of their respective alma maters, they named it the Yale-CMU-Berkeley (YCB) Object and Model set, and it was oriented towards applications in robotics. YCB collected not only the 3D shapes of objects but also their physical properties: dimension in real millimeters, mass, and frictional properties. The dataset was intended to help robotics researchers establish common benchmarks for object manipulation that could be used in silico, without expensive real experiments. Here is a sample of YCB data that would make Andy Warhol proud:

To sum up, by now we have large-scale datasets of common objects in the form of 3D shapes. These datasets have hundreds of thousand shapes and can produce potentially infinite datasets. This does not, however, quite scale to the entire computer vision problem: even the largest existing datasets have quite restricted sets of object categories. We will see a wider variety in specific applications such as indoor or outdoor navigation.

Now let’s see what we can do with those shapes!

Flying Chairs and Falling Things: The Power of Domain Randomizations

Once you have these basic objects, it’s time to put them into context. If you recall, last time we spoke about real-world object detection datasets, and I said (but not yet explained) that the problem becomes much harder if the same picture contains objects on different scales (small and large in terms of the proportion of the picture) and if the objects are embedded into a rich context (complex background). Naturally, if you have a synthetic chair centered on a white background, like in the images above, you won’t get a hard object detection problem, and a network trained on this kind of dataset won’t get you very far in real object detection.

So what do we do? On the surface, it looks like we might have to bite the bullet and start developing complex backgrounds that capture realistic 3D scenes. People actually do it in, say, creating simulations and datasets for training self-driving cars, and it is an entirely reasonable investment of time and effort.

But in object detection, sometimes even much simpler things can work well. Some of the hardest problems in object detection come from the complex interactions between objects: partial occlusions, different scales caused by different distances to the camera, and so on. So why don’t we use a more or less generic scene and just put the objects there at random, striving to achieve a cluttered and complicated scene but with little regard to physical plausibility?

This plays into the narrative of domain randomization, a general term that means randomizing the parameters of synthetic scenes in order to capture as wide a variety of synthetic data as possible. The idea is that if the network learns to do its job on an extremely wide and varied distribution of data, it will hopefully do the job well on real data as well, even if individual samples of this synthetic data are very far from realistic. Starting from the paper by Tobin et al. (2017), domain randomization has been instrumental in synthetic data research, and we will definitely discuss it in more detail later.

When you put this idea into practice, you get datasets like Flying Chairs and Falling Things. Flying Chairs (Dosovitskiy et al., 2015) and Flying Chairs 3D (Mayer et al., 2015) were more oriented towards low-level problems such as optical flow estimation, so maybe I’ll talk about them in another post when it comes to that. The datasets look like this, by the way, so “flying chairs” is an apt name:

The Falling Things Dataset (FAT), developed by NVIDIA researchers Tremblay et al. (2018), contains about 61500 images of 21 household objects taken from the YCB dataset and placed into virtual environments under a wide variety of lighting conditions, with 3D poses, pixel-perfect segmentation, depth images, and 2D/3D bounding box coordinates for each object. The virtual environments are realistic enough, but the scenes are purely random, with a lot of occlusions and objects just flying in the air in all directions. Here is a sample, complete with segmentation and depth maps:

You can download the dataset by a link posted here; note that it is 42GB in size, although there are only 21 objects considered there. This is a common theme as well: as synthetic datasets grow in scale, it becomes less and less practical to render them in full glory and shoot the pictures back and forth over a network. Procedural generation is increasingly used to avoid this and render images only on a per-need basis.

Synthetic Data For Your Project

By this time, you probably wonder just how much effort has to go into creating a synthetic dataset of your own. If you need a truly large-scale dataset, it may be a lot, and so far there is no way to save on the actual design of 3D models. But, as it always happens in our industry, people are working hard to commoditize the things that all these projects have in common, in this case the randomization of scenes, object placement, lighting, and other parameters, as well as procedural generation of these randomized scenes.

One recent example is NVIDIA’s Dataset Synthesizer (NDDS), a plugin for Unreal Engine 4 that allows computer vision researchers to easily turn 3D models and textures into prepared synthetic datasets. NDDS can produce RGB images, segmentation maps, depth maps and bounding boxes, and if the 3D models contain keypoints for the objects, then these keypoints and object poses can be exported too. What’s even more important, NDDS has automated tools for scene randomization: you can randomize lighting conditions, camera location, poses, textures, and more. Basically, NDDS makes it easy to create your own dataset similar to, say, Falling Things, and the result can look something like this:

NVIDIA researchers are already using NDDS to produce synthetic datasets for computer vision; this is an important area of research for NVIDIA today. For example, SIDOD (Synthetic Image Dataset for 3D Object Pose Recognition with Distractors) by Jalal et al.; the image above is actually taken from their paper. SIDOD is relatively small by today’s standards, only 144K stereo image pairs, but it is one of the first datasets to combine all types of outputs with flying distractors. I will borrow a comparison table from the paper by Jalal et al. where you can see some of the datasets we discussed today:

But even with this said, I still have to emphasize that for many real-life problems, you will definitely need professional help with the preparation of 3D models and construction of 3D scenes for them: even the power of domain randomization and random backgrounds could be much improved if you take pains to create a more proper context.

Conclusion

For the last two blog posts, we have been talking about object detection, but so far it has been purely from the point of view of the data. In the first post, we saw the object detection problem and real-world datasets for it, and today we have discussed some important synthetic datasets for this problem. But we are yet to talk about the actual solutions for object detection: okay, I got the data, but what do I actually do to solve the problem? Next time, I intend to start talking about just that.

Sergey Nikolenko
Head of AI, Synthesis AI

August 13, 2020
Object Detection with Synthetic Data I: Introduction to Object Detection
Today, we begin a new mini-series that marks a slight change in the direction of the series. Previously, we have talked about the history of synthetic data (one, two, three, four) and reviewed a recent paper on synthetic data. This time, we begin a series devoted to a specific machine learning problem that is often supplemented by the use of synthetic data: object detection. In this first post of the series, we will discuss what the problem is and where the data for object detection comes from and how you can get your network to detect bounding boxes like below (image source).

Problem Setting: What Is Object Detection

If you have had any experience at all with computer vision, or have heard one of many introductory talks about the wonders of modern deep learning, you probably know about the image classification problem: how do you tell cats and dogs apart? Like this (image source):

Even though this is just binary classification (a question with yes/no answers), this is already a very complex problem. Real world images “live” in a very high-dimensional space, on the order of millions of features: for example, mathematically a one-megapixel color photo is a vector of more than three million numbers! Therefore, the main focus of image classification lies not in the actual learning of a decision surface (that separates classes) but in feature extraction: how do we project this huge space onto something more manageable where a separating surface can be relatively simple?

This is exactly the reason why deep learning has taken off so well: it does not rely on handcrafted features like SIFT that people had used for computer vision before but rather learns its own features from scratch. The classifiers themselves are still really simple and classical: almost all deep neural networks for classification end with a softmax layer, i.e., basically logistic regression. The trick is how to transform the space of images to a representation where logistic regression is enough, and that’s exactly where the rest of the network comes in. If you look at some earlier works you can find examples where people learn to extract features with deep neural networks and then apply other classifiers, such as SVMs (image source):

But by now this is a rarity: once we have enough data to train state of the art feature extractors, it’s much easier and quite sufficient to have a simple logistic regression at the end. And there are plenty of feature extractors for images that people have developed over the last decade: AlexNet, VGG, Inception, ResNet, DenseNet, EfficientNet…

It would take much more than a blog post to explain them all, but the common thread is that you have a feature extraction backbone followed by a simple classification layer, and you train the whole thing end to end on a large image classification dataset, usually ImageNet, a huge manually labeled and curated dataset with more than 14 million images labeled with nearly 22000 classes that are organized in a semantic hierarchy (image source):

Once you are done, the network has learned to extract informative, useful features for real world photographic images, so even if your classes do not come from ImageNet it’s usually a matter of fine-tuning to adapt to this new information. You still need new data, of course, but usually not on the order of millions of images. Unless, of course, it’s a completely novel domain of images, such as X-rays or microscopy, where ImageNet won’t help as much. But we won’t go there today.

But vision doesn’t quite work that way. When I look around, I don’t just see a single label in my mind. I distinguish a lot of different objects within my field of view: right now I’m seeing a keyboard, my own hands, a monitor, a coffee cup, a web camera and so on, and so forth, all basically at the same time (let’s not split hairs over the saccadic nature of human vision right now: I would be able to distinguish all of these objects from a single still image just as well).

This means that we need to move on from classification, which assigns a single label to the whole image (you can assign several with multilabel classification models, but each of them will still refer to the entire image), to other problems that require more fine-grained analysis of the objects on images. People usually distinguish between several different problems:
- classification, as we discussed above;
- classification + localization, where you still assume there is only one “central” object on the image but you are also supposed to localize the object, that is, draw a bounding box (rectangle) around it;
- object detection, our main topic today, requires to find multiple objects on the same picture, all with their own bounding boxes;
- finally, segmentation is an even more complex problem: you are supposed to find the actual outlines of the objects, i.e., basically classify every pixel on the image into either one of the objects or the background; there are several different flavors to segmentation too (semantic, boundary, and instance segmentation) but that’s a discussion for another day.
As explained with cats and dogs right here (image source):

Mathematically, this means that the output of your network is no longer just a class label. It is now several (how many? that’s a very good question that we’ll have to answer somehow) different class labels, each with an associated rectangle. A rectangle is defined by four numbers (coordinates of two opposing corners, or one corner, width and height), so now each output is mathematically four numbers and a class label. Here is the difference (image source):

From the machine learning perspective, before we even start thinking about how to solve the problem, we need to find the data. The basic ImageNet dataset will not help: it is a classification dataset, so it has labels like “Cat”, but it does not have bounding boxes! Manual labeling is now a much harder problem: instead of just clicking on the correct class label you have to actually provide a bounding box for every object, and there may be many objects on a single photo. Here is a sample annotation for a generic object detection problem (image source):

You can imagine that annotating a single image by hand for object detection is a matter of whole minutes rather than seconds as it was for classification. So where can large datasets like this come from? Let’s find out.

Object Detection Datasets: The Real

Let’s first see what kind of object detection datasets we have with real objects and human annotators. To begin with, let’s quickly go over the most popular datasets, so popular that they are listed on the TensorFlow dataset page and have been used in thousands of projects.

The ImageNet dataset gained popularity as a key part of the ImageNet Large Scale Visual Recognition Challenges (ILSVRC), a series of competitions held from 2010 to 2017. The ILSVRC series saw some of the most interesting advances in convolutional neural networks: AlexNet, VGG, GoogLeNet, ResNet, and other famous architectures all debuted there.

A lesser known fact is that ILSVRC always had an object detection competition as well, and the ILSVRC series actually grew out of a collaborative effort with another famous competition, the PASCAL Visual Object Classes (VOC) Challenge held from 2005 to 2012. These challenges also featured object detection from the very beginning, and this is where the first famous dataset comes from, usually known as the PASCAL VOC dataset. Here are some sample images for the “aeroplanes” and “bicycle” categories (source):

By today’s standards, PASCAL VOC is rather small: 20 classes and only 11530 images with 27450 object annotations, which means that PASCAL VOC has less than 2.5 objects per image. The objects are usually quite large and prominent on the photos, so PASCAL VOC is an “easy” dataset. Still, for a long time it was one of the largest manually annotated object detection datasets and was used by default in hundreds of papers on object detection.

The next step up in both scale and complexity was the Microsoft Common Objects in Context (Microsoft COCO) dataset. By now, it has more than 200K labeled images with 1.5 million object instances, and it provides not only bounding boxes but also (rather crude) outlines for segmentation. Here are a couple of sample images:

As you can see, the objects are now more diverse, and they can have very different sizes. This is actually a big issue for object detection: it’s hard to make a single network detect both large and small objects well, and this is the major reason why MS COCO proved to be a much harder dataset than PASCAL VOC. The dataset is still very relevant, with competitions in object detection, instance segmentation, and other tracks held every year.

The last general-purpose object detection dataset that I want to talk about is by far the largest available: Google’s Open Images Dataset. By now, they are at Open Images V6, and it has about 1.9 million images with 16 million bounding boxes for 600 object classes. This amounts to about 8.4 bounding boxes per image, so the scenes are quite complex, and the number of objects is also more evenly distributed:

Examples look interesting, diverse, and sometimes very complicated:

Actually, Open Images was made possible by advances in object detection itself. As we discussed above, it is extremely time-consuming to draw bounding boxes by hand. Fortunately, at some point existing object detectors became so good that we could delegate the bounding boxes to machine learning models and use humans only to verify the results. That is, you can set the model to a relatively low sensitivity threshold, so that you won’t miss anything important, but the result will probably have a lot of false positives. Then you ask a human annotator to confirm the correct bounding boxes and reject false positives.

As far as I know, this paradigm shift occurred in object detection around 2016, after a paper by Papadopoulos et al. It is much more manageable, and this is how Open Images became possible, but it is still a lot of work for human annotators, so only giants like Google can afford to put out an object detection dataset on this scale.

There are, of course, many more object detection datasets, usually for more specialized applications: these three are the primary datasets that cover general-purpose object detection. But wait, this is a blog about synthetic data, and we haven’t yet said a word about it! Let’s fix that.

Object Detection Datasets: Why Synthetic?

With a dataset like Open Images, the main question becomes: why do we need synthetic data for object detection at all? It looks like Open Images is almost as large as ImageNet, and we haven’t heard much about synthetic data for image classification.

For object detection, the answer lies in the details and specific use cases. Yes, Open Images is large, but it does not cover everything that you may need. A case in point: suppose you are building a computer vision system for a self-driving car. Sure, Open Images has the category “Car”, but you need much, much more details: different types of cars in different traffic situations, streetlights, various types of pedestrians, traffic signs and so on and so forth. If all you needed was an image classification problem, you would create your own dataset for the new classes with a few thousand images per class, label it manually for a few hundred dollars, and fine-tune the network for new classes. In object detection and especially segmentation, it doesn’t work quite as easily.

Consider one of the latest and largest real datasets for autonomous driving: nuScenes by Caesar et al.; the paper, by the way, has been accepted for CVPR 2020. They create a full-scale dataset with 6 cameras, 5 radars, and a lidar, fully annotated with 3D bounding boxes (a new standard as we move towards 3D scene understanding) and human scene descriptions. Here is a sample of the data:

And all this is done in video! So what’s the catch? Well, the nuScenes dataset contains 1000 scenes, each 20 seconds long with keyframes sampled at 2Hz, so about 40000 annotated images in total in groups of 40 that are very similar (come from the same scene). Labeling this kind of data was already a big and expensive undertaking.

Compare this with a synthetic dataset for autonomous driving called ProcSy. It features pixel-perfect segmentation (with synthetic data, there is no difference, you can ask for segmentation as easily as for bounding boxes) and depth maps for urban scenes with traffic constructed with the CityEngine by Esri and then rendered with Unreal Engine 4. It looks something like this (with segmentation, depth, and occlusion maps):

In the paper, Khan et al. concentrate on comparing the performance of different segmentation models under inclement weather conditions and other factors that may complicate the problem. For this purpose, they only needed a small data sample of 11000 frames, and that’s what you can download from the website above (the compressed archives already take up to 30Gb, by the way). They report that this dataset was randomly sampled from 1.35 million available road scenes. But the most important part is that the dataset was generated procedurally, so in fact it is a potentially infinite stream of data where you can vary the maps, types of traffic, weather conditions, and more.

This is the main draw of synthetic data: once you have made a single upfront investment into creating (or, better to say, finding and adapting) 3D models of your objects of interest, you are all set to have as much data as you can handle. And if you make an additional investment, you can even move on to full-scale interactive 3D worlds, but this is, again, a story for another day.

Conclusion

Today, we have discussed the basics of object detection. We have seen what kind of object detection datasets exist and how synthetic data can help with problems where humans have a really hard time labeling millions of images. Note that we haven’t said a single word about how to do object detection: we will come to this in later installments, and in the next one will review several interesting synthetic object detection datasets. Stay tuned!

Sergey Nikolenko
Head of AI, Synthesis AI
August 5, 2020
Synthetic Data Research Review: Context-Agnostic Cut-and-Paste
We have been talking about the history of synthetic data for quite some time, but it’s time to get back to 2020! I’m preparing a new series, but in the meantime, today we discuss a paper called “Learning From Context-Agnostic Synthetic Data” by MIT researchers Charles Jin and Martin Rinard, recently released on arXiv (it’s less than a month old). They present a new way to train on synthetic data based on few-shot learning, claiming to need very few synthetic examples; in essence, their paper extends the cut-n-paste approach to generating synthetic datasets. Let’s find out more and, pardon the pun, give their results some context.

Problem Setting: Domain Shift and Context

On this blog, there is no need to discuss in detail what synthetic data is all about; let me just link to my first post about the data problem in machine learning and to my recent survey of the field. Synthetic data is trying to solve this problem by presenting a potentially endless source of synthetic images after a one-time investment of resources to create the virtual objects/environments.

However, this presents the obvious problem: you need to train on synthetic images but then apply the results on real photographs. This is an instance of the domain shift problem that sometimes appears in other fields as well (for instance, the “food” class looks very different in, say, the U.S. and Kenya). Here is an illustration of the domain shift problem for synthetic data from (Sankaranarayanan et al., 2018):

Domain adaptation is a set of techniques designed to make a model trained on one domain of data, the source domain, work well on a different, target domain. This is a natural fit for synthetic data: in almost all applications, we would like to train the model in the source domain of synthetic data but then apply the results in the target domain of real data. By now, domain adaptation is a large field of machine learning, with many interesting models that either make input images more realistic (this is usually called refinement) or change the training process in such a way that the model does not differentiate between synthetic and real domains.

We will definitely have many more posts that deal with domain adaptation in this blog. Today’s paper, however, is basically a modification of one of the most simple and straightforward approaches to generating synthetic datasets. Let us first discuss this idea in general and then get back to Jin and Rinard.

Synthetic Objects Against Real Backgrounds

Existing synthetic datasets also often exploit this idea of separating the object from the background. Usually it appears in the form of placing synthetic objects on real backgrounds. One usually has a virtually endless source of real backgrounds that are perfect for learning in every way but do not contain the necessary objects, so the idea is to paste synthetic objects in the most realistic way possible.

In some problems, it is relatively straightforward. For example, one dataset mentioned in the paper by Jin and Rinard, SynSign by Moiseev et al. (2013), uses this trick to produce synthetic photographs of traffic signs. They cut out augmented (distorted) synthetic images of traffic signs and put them against real backgrounds:

This is easy enough to do for traffic signs:
- they are man-made objects with very simple textures, so even very simple synthetic images are pretty realistic;
- they do not have to blend into the background because a traffic sign even on a real photo would be usually “hanging in the air” against a background that is some distance behind;
- and most importantly, the resulting dataset’s resolution is quite small so you don’t have to worry about boundary artifacts.
Modern synthetic data research can successfully apply the same trick to much more complex situations. For example, the Augmented Autonomous Driving Simulation (AADS) dataset (Li et al., 2019) helps train self-driving cars by blending synthetic cars against real-world backgrounds:

I doubt you can even differentiate images in the top row from real photos of cars in context, especially given the relatively low resolution of this picture.

AADS has a much more complex pipeline than just “cut-n-paste”, and I hope to talk to you about it in detail in a later post. But the basic point still stands: in many problems, 3D models of the objects of interest are far easier than 3D models of the entire environment, but at the same time there is a source of real-world backgrounds, and you can try to paste virtual objects onto them in a smart way to make realistic synthetic datasets.

Balanced Sampling and Superimposing Objects

Jin and Rinard take this approach to the next level. Basically, their paper still presents the same basic pipeline:
- take the object space O consisting of synthetic objects placed in random poses and subjected to a number of different augmentations;
- take the context space C consisting of background images;
- superimpose objects from O against backgrounds from C at random;
- train a neural network on the resulting composite images.
The devil, however, is in the details; a few tricks take this simple approach to provide some of the very best results available in domain adaptation and few-shot learning.

First, the sampling. One common pitfall of computer vision is that when you have relatively few examples of a class, they cannot come in a wide variety of backgrounds. Hence, in a process akin to overfitting the networks might start learning the characteristic features of the backgrounds rather than the objects in this class.

What is the easiest way out of this? How can we tell the classifier that it’s the object that’s important and not the background? With synthetic images, it’s easy: let’s place several different objects on the same background! Then, since the labels are different, the classifier will be forced to learn that backgrounds are not important and it is the objects that differentiate between classes.

Therefore, Jin and Rinard take care to introduce balanced sampling of objects and backgrounds. The basic procedure samples a random biregular graph so that every object is placed on an equal number of backgrounds and vice versa, every background is used with the same number of objects.

But that’s not all. The second idea used in this paper stems from the obvious fact that the classifier must learn to distinguish between different objects. Therefore, it would be beneficial for training to concentrate on the hard cases where the classifier might confuse two objects.

In the paper, this idea comes in two flavors. First, specifically for images Jin and Rinard suggest to superimpose one object on top of another, so that the previous object provides a maximally confusing context for the next one. A picture would be worth a thousand words here but, alas, the paper does not have any. But their second way to use the same idea is even more interesting.

Robustness Training

To give context to the idea of robustness training, I need to take a step back. You might remember how a few years ago, adversarial examples were all the rage. Remember this famous picture from Goodfellow et al. (2014)?

What this means is that due to the simplified structure of neural networks (they are “too linear”, so to speak), you can find a direction in the image space such that even a small step in this direction (note the 0.007 coefficient in the linear combination) can lead to big changes in classifier predictions. How do we find this direction? Easy: just take a gradient of the loss function with respect to the input (rather than the weights, as in training) and go either where the correct class’ probability is reduced the most or where the probability of the class you want to get is increased the most. This is actually explained below the picture.

Since 2013-2014, when adversarial examples first appeared, this field has come a long way. By now, there are a lot of different sorts of adversarial attacks, including attacks that work in the physical world (Kurakin et al., 2016): you can print out an adversarial image, take a photo, and it is still misclassified! The attacks also cause defenses to appear; naturally, attacks have the first move but usually you can defend against a given attack. A recent survey by Xu et al. (2019) lists about 150 references, and the field is growing.

One of the easiest ways to defend against the attack above is to introduce changes into the training process. These gradients with respect to the input can be computed during training as well, so during training you can have access to adversarial examples. Hence, you can
- either add extra components to the loss function, so that the classifier is penalized if it makes mistakes on the hardest examples in some neighborhood of the original image,
- or simply add adversarial examples to the mix as a possible augmentation (this can be made basically equivalent to changing the loss function if you add them in the same mini-batch).
This is exactly the idea of robustness training that Jin and Rinard suggest for synthetic images. You have a synthetic image that might look a little unrealistic and might not be hard enough to confuse even an imperfect classifier. What do you do? You can try to make it harder for the classifier by turning it into an adversarial example.

With all these ideas combined, Jin and Rinard obtain a relatively simple pipeline that is able to achieve state-of-the-art results by training with only a single synthetic image of each object class. Note that there is no fancy domain adaptation here: all ideas can be thought of as smart augmentations.

Conclusion

Today, after a few posts on the history of synthetic data we are taking a break from long series. Here I have reviewed a recent paper from arXiv and discussed its ideas. This paper touches on many different ideas from different parts of machine learning. As we review additional papers in this series, we hope to highlight interesting research and add some new ideas of our own to the mix.

Next time, we will have some more ideas to discuss. Until then!

Sergey Nikolenko
Head of AI, Synthesis AI
June 29, 2020
Synthetic Data for Early Robots, Part II: MOBOT and the Problems of Simulation
Last time, we talked about robotic simulations in general: what they are and why they are inevitable for robotics based on machine learning. We even touched upon some of the more philosophical implications of simulations in robotics, discussing early concerns on whether simulations are indeed useful or may become a dead end for the field. Today, we will see the next steps of robotic simulations, showing how they progressed after the last post with the example of MOBOT, a project developed in the first half of the 1990s in the University of Kaiserslautern. This is another relatively long read and the last post in the “History of Synthetic Data” series.

The MOBOT Project

Let’s begin with a few words about the project itself. MOBOT (Mobile Robot, and that’s the most straightforward acronym you will see in this post) was a project about a robot navigating indoor environments. In what follows, all figures are taken from the papers about the MOBOT project, so let me just list all the references up front and be done with them: (Buchberger et al., 1994), (Trieb, von Puttkamer, 1994), (Edlinger, von Puttkamer, 1994), (Zimmer, von Puttkamer, 1994), (Jorg et al., 1993), (Hoppen et al., 1990).

Here is what MOBOT-IV looked like:

Note the black boxes that form a 360 degree belt around the robot: these are sonar sensors, and we will come back to them later. The main problem that MOBOT developers were solving was navigation in the environment, that is, constructing the map of the environment and understanding how to go where the robot needed to go. There was this nice hierarchy of abstraction layers that gradually grounded the decisions down to the most minute details:

And there were three different layers of world modeling, too; the MOBOT viewed the world differently depending on the level of abstraction:

But in essence, this came down to the same old problem: figure out the sensor readings and map them to all these nice abstract layers where the robot could run pathfinding algorithms such as the evergreen A*. Apart from the sonars, the robot also had a laser radar, and the overall scheme of the ps-WM (Pilot Specific World Modeling; I told you the acronyms would only get weirder) project looks quite involved:

Note that there are several different kinds of maps that need updating. But as we are mostly interested in how synthetic environments were used in the MOBOT project, let us not dwell on the detail and proceed to the simulation.

The 3d7 Simulation Environment: Perfecting the Imperfections

One of the earliest examples of a full-scale 3D simulation environments for robotics is the 3d7 Simulation Environment (Trieb, Puttkamer, 1994); the obscure name does not refer to a nonexistent seven-sided die but rather represents an acronym for “3D Simulation Environment”. The 3d7 environment was developed for MOBOT-IV, an autonomous mobile robot that was supposed to navigate indoor environments; it had general-purpose ambitions rather than simply being, say, a robot vacuum cleaner, because its scene understanding was inherently three-dimensional, while for many specific tasks a 2D floor map would be just enough.

The overall structure of 3d7 is shown on the figure below:

It is pretty straightforward: the software simulates a 3D environment, robot sensors, and robot locomotion, which lets the developers to model various situations, choose the best algorithms for sensory data processing and action control, and so on, just like we discussed last time.

The main point I wanted to make with this example is this: making realistic simulations is very hard. Usually, when we talk about synthetic data, we are concentrating on computer vision, and we are emphasizing the work it takes to create a realistic 3D environment. It is indeed a lot, but just creating a realistic 3D scene it’s definitely not the end of the story for robotics.

3d7 contained an environment editor that let you place primitive 3D objects such as cubes, spheres, or cylinders, and also more complex objects such as chairs or tables. It produced a scene complete with the labels of semantic objects and geometric primitives that make up these objects, like this:

But then the fun part began. MOBOT-IV contained two optical devices: a laser radar sensor and a brand new addition compared to MOBOT-III, an infrared range scanning device. This means that in order to make a useful simulation, the 3d7 environment had to simulate these two devices.

It turns out that both these simulations represent interesting projects. LARS, the Laser Radar Simulator, was designed to model the real laser radar sensor of MOBOT-III and the new infrared range scanner of MOBOT-IV. It produced something like this:

As for sonar range sensors, the corresponding USS2D simulator (Ultrasonic Sensor Simulation 2D) was even more interesting. It was based on the work (Kuc, Siegel, 1987) that takes about thirty pages of in-depth acoustic modeling. I will not go into the details but trust me, there are a lot of details there. The end result was a set of sonar range readings corresponding to the reflections from nearest walls:

This is a common theme in early research on synthetic data, one we have already seen in the context of ALVINN. While modern synthetic simulation environments strive for realism (we will see examples later in the blog), early simulations did not have to be as realistic as possible but rather had to emulate the imperfections of the sensors available at the time. They could not simply assume that hardware is pretty good, they knew it wasn’t and had to incorporate models of their hardware as well.

As another early example, I can refer to the paper (Raczkowsky, Mittenbuehler, 1989) that discussed camera simulations in robotics. It is mostly devoted to the construction of a 3D scene, and back in 1989, you had to do it all yourself, so the paper covers:
- surfaces, contours, and vertices that define a 3D object;
- optical surface properties including Fresnel reflection, diffuse reflectance, flux density and more;
- light source models complete with radiance, wavelengths and so on;
- and finally the camera model that simulates a lens system and the electronic hardware of the robot’s camera.
In the 1980s, only after working through all that you could produce such marvelous realistic images as this 200×200 synthetic photo of some kind of workpieces:

Fortunately, by now most of this is already being taken care of by modern 3D modeling software or gaming engines! However, camera models are still relevant in modern synthetic data applications. For instance, an important use case for, e.g., a smartphone manufacturer might be to re-train or transfer its computer vision models when the camera changes, and you need a good model for both the old and new cameras in order to capture this change and perform this transition.

But wait, that’s not all! After all of this is done, you only have simulated the sensor readings! To actually test your algorithm you also need to model the actions your robot can take and how the environment will respond to these actions. In case of 3d7, this means a separate locomotion simulation model for robot movement called SKy (Simulation of Kinematics and Dynamics of wheeled mobile robots), which also merited its own paper but which we definitely will not go into. We will probably return to this topic in the context of more modern robotic simulations: this is a field that still needs to be done separately and cannot be lifted from gaming engines.

Learning in MOBOT: Synthetic Data Strikes Again

The MOBOT project did not contain many machine learning models, it was mostly operated by fixed algorithms designed to work with sensor readings as shown above. Even the 3d7 simulation environment was mostly designed to help test various data processing algorithms (world modeling) and control algorithms (e.g., path planning or collision avoidance), a synthetic data application similar to the early computer vision we talked about before.

But at some point, MOBOT designers did try out some machine learning. The work (Zimmer, von Puttkamer, 1994) has the appetizing title Realtime-learning on an Autonomous Mobile Robot with Neural Networks. There are not, however, neural networks that you are probably used to: in fact, Zimmer and von Puttkamer used self-organizing maps (SOM), sometimes called Kohonen maps in honor of their creator (Kohonen, 1982), to cluster sensor readings.

The problem setting is this: as the robot moves around its surroundings, it collects sensor information. The basic problem is to build a topological map of the floor with all the obstacles. To do that, the robot needs to be able to recognize places where it has already been, i.e., to cluster the entire set of sensor readings into “places” that can serve as nodes for the topological representation.

Due to the imprecise nature of robotic movements we cannot rely on the kinematic model of where we tried to go: small errors tend to accumulate. Instead, the authors propose to cluster the sensor readings: if the current vector of readings is similar to what we have already seen before, we are probably in approximately the same place.

And again we see the exact same effect: while Zimmer and von Puttkamer do present experiments with a real robot, most of the experiments for SOM training was done with synthetic data. It was done in a test environment that looks like this:

with a test trajectory of the form

And indeed, when the virtual robot has covered this trajectory, SOMs clustered nicely and allowed to build a graph, a topological representation of the territory:

Conclusion

Today, we have seen the main components of a robotic simulation system; we have discussed the many different aspects that need to be simulated and shown how this all came together in one early robotic project, the MOBOT from the University of Kaiserslautern.

This post concludes the series devoted to the earliest applications of synthetic data. We talked about line drawings and test sets in early computer vision, the first self-driving cars, and spent two posts talking about simulations in robotics. All through the last posts, we have seen a common theme: synthetic data may not be necessary for classical computer vision algorithms. But as soon as any kind of learning appears in computer vision, synthetic data is not far behind. Even in the early days, it was used to test CV algorithms, to train models for something as hard as learning to drive or simply for clustering sensor readings to get a feeling for where you have been.

I hope I have convinced you that synthetic data has gone hand in hand with machine learning for a long time, especially with neural networks. In the next posts, I will jump back to something on the bleeding edge of synthetic data research. Stay tuned!

Sergey Nikolenko
Head of AI, Synthesis AI
June 11, 2020
Synthetic Data for Robots, Part I: Are Simulations Good For Robotics?
In the previous two blog posts, we have discussed the origins and first applications of synthetic data. The first part showed how early computer vision used simple line drawings for scene understanding algorithms and how synthetic datasets were necessary as test sets to compare different computer vision algorithms. In the second part, we saw how self-driving cars were made in the 1980s and how the very first application of machine learning in computer vision for autonomous vehicles, the ALVINN system, was trained on synthetic data. Today, we begin the discussion of early robotics and the corresponding synthetic simulators… but this first part will be a bit more philosophical than usual.

Why Robots Need Simulators

Robotics is not quite as old as artificial intelligence: the challenge of building an actual physical entity that could operate in the real world was too big a hurdle for the first few years of AI. However, robotics was recognized as one of the major problems in AI very early on, and as soon as it became possible people started to build real world robots. Here is one of the earliest attempts at a robot equipped with a vision system, the Stanford Cart built in the 1970s (pictures taken from a later review paper by Hans Moravec):

The Cart had an onboard TV system, and a computer program tried to drive the Cart through obstacle courses based on the images broadcast by this system. Based on several images taken from different camera positions (a kind of “super-stereo” vision), its vision algorithm tried to find interest points (features), detect obstacles and avoid or go around them. It was extremely successful for such an early system, although the performance was less than stellar: the Cart moved in short lurches, about 1 meter, every 10-15 minutes. Still, in these lurches the Cart could successfully avoid real life obstacles.

As we discussed last time, before the 1990s computer vision was very seldom based on learning of any kind: researchers tried to devise algorithms, and data was only needed to test and compare them. This fully goes for robotics: early robots such as the Cart had hardcoded algorithms for vision, pathfinding, and everything else.

However, experiments with the Cart and similar robots taught researchers that it is far too costly and often plain impossible to validate their ideas in the real world. Most researchers decided that they want to first test the algorithms in computer simulations and only then proceed to the real world. There are two main reasons for this:
- it is, of course, far easier, faster, and cheaper to test new algorithms in a simulated world than embed them into real robots and test in reality;
- simulations can abstract away many problems that a real world robot has to face, such as unpredictable sources of noise in sensor readings, imperfections in the hardware, accumulating errors and so on; it is important to be able to distinguish whether your algorithm does not work because it is a bad idea in general or because the sensor noise is too large in this particular case.
Hence, robotics moved to the «simulate first, build second» principle which it abides to this day.

A Sample Early Simulator

Let’s have a look at a sample early simulator for a rover robot that was supposed to map the space around it. Benjamin Kuipers and Yung-Tai Byun developed an approach to robot exploration and mapping based on a semantic hierarchy of spatial representations (Kuipers and Byun, 1988; 1991). This means that their robot is supposed to gradually work its way up from the control level, where it finds distinctive places and paths, through the topological level, where it creates a topological network description of the environment, and finally to the geometric level, where the topology is converted to a geometric map by incorporating local information about the distances and global metric relationships between the places. The exploration paths could look something like this (all pictures taken from the AAAI paper):

The method itself was a seminal work, but it’s not our subject right now and I will not go into any more details about it. But note their approach to implementing and testing the method: Kuipers and Byun programmed (in Common Lisp, by the way) a two-dimensional simulated environment called the NX Robot Simulator. The virtual robot in this environment has access to sixteen sonar-type distance sensors and a compass, and moves by two tractor-type chains. The interesting part of this simulation is that Kuipers and Byun took special care to implement error models for the sonars that actually reflect real life errors.

Here is a sample picture from their simulation; on the left you see a robot shooting sonar rays in 16 directions, and the histogram on the right shows the sensor readings (with vertical lines) and true distances (with X and O markers). Note how the O markers represent a systematic error due to specular reflection, much more serious than the deviations of X markers that comes from normal random error:

They made it into a software product with a GUI interface, which was much harder to do in the 1980s than it is now. Here is a sample screenshot:

The algorithms worked fine in a simulation, and the simulation was so realistic that it actually allowed to transfer the results to the real world. In a later work, Kuipers et al. (1993) report on their experiments with two physical mobile robots, Spot and Rover, that quite successfully implemented their algorithms on two different sensorimotor systems.

Deep Criticisms: Simulations, Embodiment, and Representation

Despite these successes, not everybody believed in computer simulations for robotics. In the same book as (Kuipers et al., 1993), another chapter by Rodney Brooks and Maja Mataric, aptly titled Real Robots, Real Learning Problems, had an entire section devoted to warning researchers in robotics from relying on simulations too much. Brooks and Mataric put it as follows:

Simulations are doomed to succeed. Even despite best intentions there is a temptation to fix problems by tweaking the details of a simulation rather than the control program or the learning algorithm. Another common pitfall is the use of global information that could not possibly be available to a real robot. Further, it is often difficult to separate the issues which are intrinsic to the simulation environment from those which are based in the actual learning problem.

Basically, they warned that computer vision had not been solved yet, and while a simulation might provide the robot with information such as «there is food ahead», in reality such high-level information would never be available. This, of course, remains true to this day, and modern robotic vision systems make use of all modern advances in object detection, segmentation, and other high-level computer vision tasks (where synthetic data also helps a lot, by the way, but this will be the subject of later posts).

All this looks like some very basic points that are undoubtedly true, and they sound more as a part of the problem setting than true criticism. However, Rodney Brooks also presents a much more interesting criticism which is not so much against synthetic data and simulations as against the entire computer vision program for robotics; while this is an aside for this blog series, this is an interesting aside, and I want to elaborate on it.

I will present Brooks’ ideas based on two of his papers, Intelligence Without Representation and Intelligence Without Reason. In the former, Brooks argues that abstract representation of the real world, which was a key feature of contemporary AI solutions, is a dangerous weapon that can lead to self-delusion. He says that real life intelligence has not evolved as a machine for solving well-defined abstract problems such as chess playing or theorem proving: intelligence in animals and humans is inseparable from perception and mobility. This was mostly a criticism of early approaches to AI that indeed concentrated on abstractions such as block worlds or knowledge engineering.

In Intelligence Without Reason, Brooks goes further to argue that abstraction and knowledge are basically unavailable to systems that have to operate in the real world, that is, to robotic systems. For example, he mentions vision algorithms based on line drawings that we discussed a couple of blog posts ago, and admits that although some early successes in line detection had dated back to the 1960s, even in the early 1990s we did not have a reliable way to convert real life images to line drawings. «Try it! You’ll be amazed at how bad it is,» Brooks comments, and this comment is not so far from the truth even today.

Brooks presents four key ideas that he believes to be crucial for AI:
- situatedness, i.e., placing AI agents in the real world «with continuity, surprises, or ongoing history»; Brooks agrees that such a world would be hard to achieve in a simulation and concludes that «the world is its own best model»;
- embodiment, i.e., physical grounding of a robot in the real world; this is important precisely to avoid the self-delusion pitfalls that inevitably abstract simulations may lead to; apart from new problems, embodiment may also present solutions to abstract problems by grounding the reasoning and conclusion in the real world;
- intelligence, which Brooks proposes to model after simpler animals than humans, concentrating at first on perception and mobility and only then moving to abstract problem solving, like we did in the process of evolution;
- emergence, where Brooks makes the distinction between traditional AI systems whose components are functional (e.g., a vision system, a pathfinding system, and so on) and behaviour-based systems where each functional unit is responsible for end-to-end processing needed to form a given behaviour (e.g., obstacle avoidance, gaze control etc.).
As for simulations, Brooks concludes that they are examples of precisely the kind of abstractions that may lead to overly optimistic interpretations of results, and argues for complete integrated intelligent mobile robots.

Interestingly, this resonates with the words of Hans Moravec that he wrote in his 1990 paper about the Stanford Cart robot that I began with:

My conclusion is that solving the day to day problems of developing a mobile organism steers one in the direction of general intelligence, while working on the problems of a fixed entity is more likely to result in very specialized solutions

Brooks put his ideas in practice, leading a long-term effort to create mobile autonomous robots in the MIT AI lab. Here are Allen, Herbert, Tom, and Jerry that were designed to interact with the world rather than plan and carry out plans:

This work soon ran into technological obstacles: the hardware was just not up to the task in the late 1980s. But Brooks’ ideas live on: Intelligence Without Representation has more than 2000 citations and is still being cited in 2020, in fields ranging from robotics to cognitive sciences, nanotechnology, and even law (AI-related legislature is a hot topic, and I may return to it on this blog someday).

Conclusion

So are simulations useful for robotics? Of course, and increasingly so! While I believe that there is a lot of truth to the criticism shown in the previous section, in my opinion in most applications it boils down to the following: when your robot works in a simulation, it does not yet mean that it will work in real life. This is, of course, true.

On the other hand, if your robot does not work even in a simulation, it is definitely too early to start building real systems. Moreover, modern developments in robotics such as the success of reinforcement learning seem to have a strange relationship with Brooks’ ideas. On the one hand, this is definitely a step in the direction of creating end-to-end systems that are behaviour-oriented rather than composed of clear-cut predesigned functional units. On the other hand, in the modern state of reinforcement learning it is entirely hopeless to suggest that systems could be trained in real life: they absolutely need simulations because they require millions of training episodes.

In the next posts, we will consider other early robotic simulations and how their ideas still live on in modern synthetic environments for robotics.

Sergey Nikolenko
Head of AI, Synthesis AI
May 19, 2020