Sergey Nikolenko

Category: Synthesis AI

Synthetic Data Case Studies: It Just Works
In this (very) long post, we present an entire whitepaper on synthetic data, proving that synthetic data works even without complicated domain adaptation techniques in a wide variety of practical applications. We consider three specific problems, all related to human faces, show that synthetic data works for all three, and draw some other interesting and important conclusions.

Introduction

Synthetic data is an invaluable tool for many machine learning problems, especially in computer vision, which we will concentrate on below. In particular, many important computer vision problems, including segmentation, depth estimation, optical flow estimation, facial landmark detection, background matting, and many more, are prohibitively expensive to label manually.

Synthetic data provides a way to have unlimited perfectly labeled data at a fraction of the cost of manually labeled data. With the 3D models of objects and environments in question, you can create an endless stream of data with any kind of labeling under different (randomized) conditions such as composition and placement of objects, background, lighting, camera placement and parameters, and so on. For a more detailed overview of synthetic data, see (Nikolenko, 2019).

Naturally, artificially produced synthetic data cannot be perfectly photorealistic. There always exists a domain gap between real and synthetic datasets, stemming both from this lack of photorealism and also in part from different approaches to labeling: for example, manually produced segmentation masks are generally correct but usually rough and far from pixel-perfect.

Therefore, most works on synthetic data center around the problem of domain adaptation: how can we close this gap? There exist approaches that improve the realism, called synthetic-to-real refinement, and approaches that impose constraints on the models—their feature space, training process, or both—in order to make them operate similarly on both real and synthetic data. This is the main direction of research in synthetic data right now, and much of recent research is devoted to suggesting new approaches to domain adaptation.

However, CGI-based synthetic data becomes better and better with time, and some works also suggest that domain randomization, i.e., simply making the synthetic data distribution sufficiently varied to ensure model robustness, may work out of the box. On the other hand, recent advances in related problems such as style transfer (synthetic-to-real refinement is basically style transfer between the domains of synthetic and real images) suggest that refinement-style domain adaptation may be done with very simple techniques; it might happen that while these techniques are insufficient for photorealistic style transfer for high-resolution photographs they are quite enough to make synthetic data useful for computer vision models.

Still, it turns out that synthetic data can provide significant improvements even without complicated domain adaptation approaches. In this whitepaper, we consider three specific use cases where we have found synthetic data to work well under either very simple or no domain adaptation at all. We are also actively pursuing research on domain adaptation, and ideas coming from modern style transfer approaches may prove to bring new interesting results here as well; but in this document, we concentrate on very straightforward applications of synthetic data and show that synthetic data can just work out of the box. In one of the case studies, we compare two main techniques to using synthetic data in this simple way—training on hybrid datasets and fine-tuning on real data after pretraining on synthetic—also with interesting results.

Here at Synthesis AI, we have developed the Face API for mass generation of high-quality synthetic 3D models of human heads, so all three cases have to do with human faces: face segmentation, background matting for human faces, and facial landmark detection. Note that all three use cases also feature some very complex labeling: while facial landmarks are merely very expensive to label by hand, manually labeled datasets for background matting are virtually impossible to obtain.

Before we proceed to the use cases, let us describe what they all have in common.

Data Generation and Domain Adaptation

In this section, we describe the data generation process and the synthetic-to-real domain adaptation approach that we used throughout all three use cases.

Face API by Synthesis AI

The Face API, developed by Synthesis AI, can generate millions of images comprising unique people, with expressions and accessories, in a wide array of environments, with unique camera settings. Below we show some representative examples of various Face API capabilities.

The Face API has tens of thousands of unique identities that span the genders, age groups, and ethnicity/skin tones, and new identities are added continuously. These

It also allows for modifications to the face, including expressions and emotions, eye gaze, head turn, head & facial hair, and more:

Furthermore, the Face API allows to adorn the subjects with accessories, including clear glasses, sunglasses, hats, other headwear, headphones, and face masks.

Finally, it allows for indoor & outdoor environments with accurate lighting, as well as additional directional/spot lighting to further vary the conditions and emulate reality.

The output includes:
- RGB Images
- Pupil Coordinates
- Facial Landmarks (iBug 68-like)
- Camera Settings
- Eye Gaze
- Segmentation Images & Values
- Depth from Camera
- Surface Normals
- Alpha / Transparency
Full documentation can be found at the Synthesis AI website. For the purposes of this whitepaper, let us just say that Face API is a more than sufficient source of synthetic human faces with any kind of labeling that a computer vision practitioner might desire.

Synthetic-to-Real Refinement by Instance Normalization

Below, we consider three computer vision problems where we have experimented with using synthetic data to train more or less standard deep learning models. Although we could have applied complex domain adaptation techniques, we instead chose to use one simple idea inspired by style transfer models. Here we show some practical and less computationally intensive approaches that can work well too.

Most recent works on style transfer, including MUNIT (Huang et al., 2018), StyleGAN (Karras et al., 2018), StyleGAN2 (Karras et al., 2019), and others, make use of the idea of adaptive instance normalization (AdaIN) proposed by Huang and Belongie (2017).

The basic idea of AdaIN is to substitute the statistics of the style image in place of the batch normalization (BN) parameters for the corresponding BN layers during the processing of the content image:

This is a natural extension of an earlier idea of conditional instance normalization (Dimoulin et al., 2016) where BN parameters were learned separately for each style. Both conditional and adaptive instance normalization can be useful for style transfer, but AdaIN is better suited for common style transfer tasks because it only needs a single style image to compute the statistics and does not require pretraining or, generally speaking, any information regarding future styles in advance.

In style transfer architectures such as MUNIT or StyleGAN, AdaIN layers are used as a key component for a complex involved architecture that usually also employs CycleGAN (Zhu et al., 2017) and/or ProGAN (Karras et al., 2017) ideas. As a result, these architectures are hard to train and, what is even more important, require a lot of computational resources to use. This makes state of the art style transfer architectures unsuitable for synthetic-to-real refinement since we need to apply them to every image in the training set.

However, style transfer results already in the original work on AdaIN (Huang and Belongie, 2017) already look quite good, and it is possible to use AdaIN in a much simpler architecture than state of the art style transfer. Therefore, in our experiments we use a similar approach for synthetic-to-real refinement, replacing BN statistics for synthetic images with statistics extracted from real images.

This approach has been shown to work several times in literature, in several variations (Li et al., 2016; Chang et al., 2019) We follow either Chang et al. (2019), which is a simpler and more direct version, or the approach introduced by Seo et al. (2020), called domain-specific optimized normalization (DSON), where for each domain we maintain batch normalization statistics and mixture weights learned on the corresponding domain:

Thus, we have described our general approach to synthetic-to-real domain adaptation; we used it in some of our experiments but note that in many cases, we did not do any domain adaptation at all (these cases will be made clear below). With that, we are ready to proceed to specific computer problems.

Face Segmentation with Synthetic Data: Syn-to-Real Transfer As Good As Real-to-Real Transfer

Our first use case deals with the segmentation problem. Since we are talking about applications of our Face API, this will be the segmentation of human faces. What’s more, we do not simply cut out the mask of a human face from a photo but want to segment different parts of the face.

We have used two real datasets in this study:
- LaPa (Landmark guided face Parsing dataset), presented by Liu et al. (2020), contains more than 22,000 images, with variations in pose and facial expression and also with some occlusions among the images; it contains facial landmark labels (which we do not use) and faces segmented into 11 classes, as in the following example:
- CelebAMask-HQ, presented by Lee et al. (2019), contains 30,000 high-resolution celebrity face images with 19 segmentation classes, including various hairstyles and a number of accessories such as glasses or hats:
For the purposes of this study, we reduced both datasets to 9 classes (same as LaPa but without the eyebrows). As synthetic data, we used 200K diverse images produced by our Face API; no domain adaptation was applied in this case study (we have tried it and found no improvement or even some small deterioration in performance metrics).

As the basic segmentation model, we have chosen the DeepLabv3+ model (Chen et al., 2018) with the DRN-56 backbone, an encoder-decoder model with spatial pyramid pooling and atrous convolutions. DeepLabv3+ produces good results, often serves as a baseline in works on semantic segmentation, and, importantly, is relatively lightweight and easy to train. In particular, due to this choice all images were resized down to 256p.

The results, summarized in the table below, confirmed our initial hypothesis and even outperformed our expectations in some respects. The table shows the mIoU (mean intersection-over-union) scores on the CelebAMask-HQ and LaPa test sets for DeepLabv3+ trained on real data only from CelebAMask-HQ, mixed with synthetic data in various proportions.

In the table below, we show the results for different proportions of real data (CelebAMask-HQ) in the training set, tested on two different test sets.

First of all, as expected, we found that training on a hybrid dataset with both real and synthetic data is undoubtedly beneficial. When both training and testing on CelebAMask-HQ (first two rows in the table), we obtain noticeable improvements across all proportions of real and synthetic data in the training set. The same holds for the two bottom rows in the table that show the results of DeepLabv3+ trained on CelebAMask-HQ and tested on LaPa.

But the most interesting and, in our opinion, important result is that in this context, domain transfer across two (quite similar) real datasets produces virtually the same results as domain transfer from synthetic to real data: results on LaPa with 100% only real data are almost identical to the results on LaPa with 0% real and only synthetic data. Let us look at the plots below and then discuss what conclusions we can draw:

Most importantly, note that the domain gap on the CelebA test set amounts to a 9.6% performance drop for the Syn-to-CelebA domain shift and 9.2% for the LaPa-to-CelebA domain shift. This very small difference suggests that while domain shift is a problem, it is not a problem specifically for the synthetic domain, which performs basically the same as a different domain of real data. The results on the LaPa test set tell a similar story: 14.4% performance drop for CelebA-to-LaPa and 16.8% for Syn-to-LaPa.

Second, note the “fine-tune” bars that exhibit (quite significant) improvements over other models trained on a different domain. This is another effect we have noted in our experiments: it appears that fine-tuning on real data after pretraining on a synthetic dataset often works better than just training on a mixed hybrid syn-plus-real dataset.

Below, we show a more detailed look into where the errors are:

During cross-domain testing, synthetic data is competitive with real data and even outperforms it on difficult classes such as eyes and lips. Synthetic data seems to perform worse on nose and hair, but that can explained by differences in labelling of these two classes across real and synthetic.

Thus, in this very straightforward use case we have seen that even in a very direct application, with very efficient syn-to-real refinement, synthetic data generated by our Face API works basically at the same level as training on a different real dataset.

This is already very promising, but let us proceed to even more interesting use cases!

Background Matting: Synthetic Data for Very Complex Labeling

The primary use case for background matting, useful to keep in mind throughout this section, is cutting out a person from a “green screen” image/video or, even more interesting, from any background. This is, of course, a key computer vision problem in the current era of online videoconferencing.

Formally, background matting is a task very similar to face/person segmentation, but with two important differences. First, we are looking to predict not only the binary segmentation mask but also the alpha (opacity) value, so the result is a “soft” segmentation mask with values in the [0, 1] range. This is very important to improve blending into new backgrounds.

Second, the specific variation of background matting that we are experimenting with here takes two images as input: a pure background photo and a photo with the object (person). In other words, the matting problem here is to subtract the background from the foreground. Here is a sample image from the demo provided by Lin et al. (2020), the work that we take as the basic model for this study:

The purpose of this work was to speed up high-quality background matting for high-resolution images so that it could work in real time; indeed, Lin et al. have also developed a working Zoom plugin that works well in real videoconferencing.

We will not dwell on the model itself for too long. Basically, Lin et al. propose a pipeline that first produces a coarse output with atrous spatial pyramid pooling similar to DeepLabv3 (Chen et al., 2017) and then recover high-resolution matting details with a refinement network (not to be confused with syn-to-real refinement!). Here is the pipeline as illustrated in the paper:

For our experiments, we have used the MobileNetV2 backbone (Sandler et al., 2018). For training we used virtually all parameters as provided by Lin et al. (2020) except augmentation, which we have made more robust.

One obvious problem with background matting is that it is extremely difficult to obtain real training data. Lin et al. describe the PhotoMatte13K dataset of 13,665 2304×3456 images with manually corrected mattes that they acquired, but release only the test set (85 images). Therefore, for real training we used the AISegment.com Human Matting Dataset (released on Kaggle) for the foreground part, refining its mattes a little with open source matting software (see below in more detail about this). The AISegment.com dataset contains ~30,000 600×800 images—note the huge difference in resolution with PhotoMatte13K.

Note that this dataset does not contain the corresponding background images, so for background images we used our own high-quality HDRI panoramas. In general, our pipeline for producing the real training set was as follows:
- cut out the object from an AISegment.com image according to the ground truth matte;
- take a background image, apply several relighting/distortion augmentations, and paste the object onto the resulting background image.
This is a standard way to obtain training sets for this problem. The currently largest academic dataset for this problem, Deep Image Matting by Xu et al. (2017), uses the same kind of procedure.

For the synthetic part, we used our Face API engine to generate a dataset of ~10K 1024×1024 images, using the same high-quality HDRI panoramas for the background. Naturally, the synthetic dataset has very accurate alpha channels, something that could hardly be achieved in manual labeling of real photographs. In the example below, note how hard it would be to label the hair for matting:

Before we proceed to the results, a couple of words about the quality metrics. We used slightly modified metrics from the original paper, also described in more detailed and motivated by Rhemann et al. (2009):
- mse: mean squared error for the alpha channel and foreground computed along the object boundary;
- mae: mean absolute error for the alpha channel and foreground computed along the object boundary;
- grad: spatial-gradient metric that measures the difference between the gradients of the computed alpha matte and the ground truth computed along the object boundary;
- conn: connectivity metric that measures average degrees of connectivity for individual pixels in the computed alpha matte and the ground truth computed along the object boundary;
- IOU: standard intersection over union metric for the “person” class segmentation obtained from the alpha matte by thresholding.
We have trained four models:
- Real model trained only on the real dataset;
- Synthetic model trained only on the synthetic dataset;
- Mixed model trained on a hybrid syn+real dataset;
- DA model, also trained on the hybrid syn+real dataset but with batchnorm-based domain adaptation as shown above updating batchnorm statistics only on the real training set.
The plots below show the quality metrics on the test set PhotoMatte85 (85 test images) where we used our HDRI panoramas as background images:

And here are the same metrics on the PhotoMatte85 test set with 4K images downloaded from the Web as background images:

It is hard to give specific examples where the difference would be striking, but as you can see, adding even a small high-quality synthetic dataset (our synthetic set was ~3x smaller than the real dataset) brings tangible improvements in the quality. Moreover, for some metrics related to visual quality (conn and grad in particular) the model trained only on synthetic data shows better performance than the model trained on real data. The Mixed and DA models are better yet, and show improvements across all metrics, again demonstrating the power of mixed syn+real datasets.

Above, we have mentioned automatic refinement of the AISegment.com dataset with open-source matting software that we applied before training. To confirm that these refinements indeed make the dataset better, we have compared the performance on refined and original AISegment.com dataset. The results clearly show that our refinement techniques bring important improvements:

Overall, in this case study we have seen how synthetic data helps in cases when real labeled data is very hard to come by. The next study is also related to human faces but switches from variants of segmentation to a slightly different problem.

Facial Landmark Detection: Fine-Tuning Beats Domain Adaptation

For many facial analysis tasks, including face recognition, face frontalization, and face 3D modeling, one of the key steps is facial landmark detection, which aims to locate some predefined keypoints on facial components. In particular, in this case study we used 51 out of 68 IBUG facial landmarks. Note that there are several different standards of facial landmarks, as illustrated below (Sagonas et al., 2016):

While this is a classic computer vision task with a long history, it unfortunately still suffers from many challenges in reality. In particular, many existing approaches struggle to cope with occlusions, extreme poses, difficult lighting conditions, and other problems. The occlusion problem is probably the most important obstacle to locating facial landmarks accurately.

As the basic model for recognizing facial landmarks, we use the stacked hourglass networks introduced by Newell et al. (2016). The architecture consists of multiple hourglass modules, each representing a fully convolutional encoder-decoder architecture with skip connections:

Again, we do not go into full details regarding the architecture and training process because we have not changed the basic model, our emphasis is on its performance across different training sets.

The test set in this study consists of real images and comes from the 300 Faces In-the-Wild (300W) Challenge (Sagonas et al., 2016). It consists of 300 indoor and 300 outdoor images of varying sizes that have ground truth manual labels. Here is a sample:

The real training set is a combination of several real datasets, semi-automatically unified to conform to the IBUG format. In total, we use ~3000 real images of varying sizes in the real training set.

For the synthetic training set, since real train and test sets mostly contain frontal or near-frontal good quality images, we generated a relatively restricted dataset with the Face API, without images in extreme conditions but with some added racial diversity, mild variety in camera angles, and accessories. The main features of our synthetic training set are:
- 10,000 synthetic images with 1024×1024 resolution, with randomized facial attributes and uniformly represented ethnicities;
- 10% of the images contain (clear) glasses;
- 60% of the faces have a random expression (emotion) with intensity [0.7, 1.0], and 20% of the faces have a random expression with intensity [0.1, 0.3];
- the maximum angle of face to camera is 45 degrees; camera position and face angles are selected accordingly.
Here are some sample images from our synthetic training set:

Next we present our key results that were achieved with the synthetic training data in several different attempts at closing the domain gap. We present a comparison between 5 different setups:
- trained on real data only;
- trained on synthetic data only;
- trained on a mixture of synthetic and real datasets;
- trained on a mixture of synthetic and real datasets with domain adaptation based on batchnorm statistics;
- pretrained on the synthetic dataset and fine-tuned on real data.
We measure two standard metrics on the 300W test set: normalized mean error (NME), the normalized average Euclidean distance between true and predicted landmarks (smaller is better), and probability of correct keypoint (PCK), the percentage of detections that fall into a predefined range of normalized deviations (larger is better).

The results clearly show that while it is quite hard to outperform the real-only benchmark (the real training set is large and labeled well, and the models are well-tuned to this kind of data), facial landmark detection can still benefit significantly from a proper introduction of synthetic data.

Even more interestingly, we see that the improvement comes not from simply training on a hybrid dataset but from pretraining on a synthetic dataset and fine-tuning on real data.

To further investigate this effect, we have tested the fine-tuning approach across a variety of real dataset subsets. The plots below show that as the size of the real dataset used for fine-tuning decreases, the results also deteriorate (this is natural and expected):

This fine-tuning approach is a training schedule that we have not often seen in literature, but here it proves to be a crucial component for success. Note that in the previous case study (background matting), we also tested this approach but it did not yield noticeable improvements.

Conclusion

In this whitepaper, we have considered simple ways to bring synthetic data into your computer vision projects. We have conducted three case studies for three different computer vision tasks related to human faces, using the power of Synthesis AI’s Face API to produce perfectly labeled and highly varied synthetic datasets. Let us draw some general conclusions from our results.

First of all, as the title suggests, it just works! In all case studies, we have been able to achieve significant improvements or results on par with real data by using synthetically generated datasets, without complex domain adaptation models. Our results suggest that synthetic data is a simple but very efficient way to improve computer vision models, especially in tasks with complex labeling.

Second, we have seen that synthetic-to-real domain gap can be the same as real-to-real domain gap. This is an interesting result because it suggests that while domain transfer still, obviously, remains a problem, it is not specific to synthetic data, which proves to be on par with real data if you train and test in different conditions. We have supported this with our face segmentation study.

Third, even a small amount of synthetic data can help a lot. This is a somewhat counterintuitive conclusion: traditionally, synthetic datasets have been all about quantity and diversity over quality. However, we have found that in problems where labels are very hard to come by and are often imprecise, such as background matting in one of our case studies, even a relatively small synthetic dataset can go a long way towards getting the labels correct for the model.

Fourth, fine-tuning on real data after pretraining on a synthetic dataset seems to work better than training on a hybrid dataset. We do not claim that this will always be the case, but a common theme in our case studies is that these approaches may indeed yield different results, and it might pay to investigate both (especially since they are very straightforward to implement and compare).

We believe that synthetic data may become one of the main driving forces for computer vision in the near future, as real datasets reach saturation and/or become hopelessly expensive. In this whitepaper, we have seen that it does not have to be hard to incorporate synthetic data into your models. Try it, it might just work!

Sergey Nikolenko
Head of AI, Synthesis AI
June 17, 2021
Top 5 Applications of Synthetic Data
It’s been a while since we last met on this blog. Today, we are having a brief interlude in the long series of posts on how to make machine learning models better with synthetic data (that’s a long and still unfinished series: Part I, Part II, Part III, Part IV, Part V, Part VI). I will give a brief overview of five primary fields where synthetic data can shine. You will see that most of them are related to computer vision, which is natural for synthetic data based on 3D models. Still, it makes sense to clarify where exactly synthetic data is already working well and where we expect synthetic data to shine in the nearest future.

The Plan

In this post, we will review five major fields of synthetic data applications (links go to the corresponding sections of this rather long post):
- human faces, where 3D models of human heads are developed in order to produce datasets for face recognition, verification, and similar applications;
- indoor environments, where 3D models and/or simulated environments of home interiors can be used for navigation in home robotics, for AI assistants, for AR/VR applications and more;
- outdoor environments, where synthetic data producers build entire virtual cities in order to train computer vision systems for self-driving cars, security systems, retail logistics and so on;
- industrial simulated environments, used to train industrial robots for manufacturing, logistics on industrial plants, and other applications related to manufacturing;
- synthetic documents and media that can be used to train text recognition systems, systems for recognizing and modifying multimodal media such as advertising, and the like.
Each deserves a separate post, and some will definitely get it soon, but today we are here for a quick overview.

Human Faces

There exist numerous applications of computer vision related to human subjects and specifically human faces. For example:
- face identification, i.e., automatic recognition of people by their faces; it is important not only for security cameras on top secret objects — the very same technology is working in your smartphone or laptop when it tries to recognize you and unlock the device without a cumbersome password;
- gaze estimation means tracking where exactly you are looking at on the screen; right now gaze estimation is important for focus group studies intended to design better interfaces and more efficient advertising, but I expect that in the near future it will be a key technology for a wide range of user-facing AR/VR applications;
- emotion recognition has obvious applications in finding out user reactions to interactions with automated systems and less obvious but perhaps even more important applications such as recognizing whether a car driver is about to fall asleep behind the wheel;
- all of these problems may include and rely upon facial keypoint recognition, that is, finding specific points of interest on a human face, and so on, and so forth.
Synthetic models and images of people (both faces and full bodies) are an especially interesting subject for synthetic data. On the one hand, while large-scale real datasets of photographs of humans definitely exist they are even harder to collect. First, there are privacy issues involved in the collection of real human faces; I will postpone this discussion until a separate post on synthetic faces, so let me just link to a recent study by Raji and Fried (2021) who point out a lot of problems in this regard within existing large-scale datasets.

Second, the labeling for some basic computer vision problems is especially complex: while pose estimation is doable, facial keypoint detection (a key element for facial recognition and image manipulation for faces) may require to specify several dozen landmarks on a human face, which becomes very hard for human labeling.

Third, even if a large dataset is available, it often contains biases in its composition of genders, races, or other parameters of human subjects, sometimes famously so; again, for now let me just link to a paper by Kortylewski et al. (2017) that we will discuss in detail in a later post.

These are the primary advantages of using synthetic datasets for human faces (beyond the usual reasons such as limitless perfectly labeled data). On the other hand, there are complications as well, chief of them being that synthetic 3D models of people and especially synthetic faces are much harder to create than models of basic objects, especially if sufficient fidelity is required. This creates a tension between the quality of available synthetic faces and improvements in face recognition and other related tasks that they can provide.

Here at Synthesis AI, we have already developed a system for producing large-scale synthetic datasets of human faces, including very detailed labeling of facial landmarks, possible occlusions such as glasses or medical masks, varying lighting conditions, and much more. Again, we will talk about this in much more detail later, so for now let me show a few examples of what our APIs are capable of and go on to the next application.

Indoor Environments

Human faces seldom require simulation environments because most applications, including all listed above, deal with static pictures and will not benefit much from a video. But in our next set of examples, it is often crucial to have an interactive environment. Static images will hardly be sufficient to train, e.g., an autonomous vehicle such as a self-driving car or a drone, or an industrial robot. Learning to control a vehicle or robot often requires reinforcement learning, where an agent has to learn from interacting with the environment, and real world experiments are usually entirely impractical for training. Fortunately, this is another field where synthetic data shines: once one has a fully developed 3D environment that can produce datasets for computer vision or other sensory readings, it is only one more step to begin active interaction or at least movement within this environment.

We begin with indoor navigation, an important field where synthetic datasets are required. The main problems here are, as usual for such environments, SLAM (simultaneous localization and mapping, i.e., understanding where the agent is located inside the environment) and navigation. Potential applications here lie in the field of home robotics, industrial robots, and embodied AI, but for our purposes you can simply think of a robotic vacuum cleaner that has to navigate your house based on sensor readings. There exist large-scale efforts to create real annotated datasets of indoor scenes (Chang et al., 2017; Song et al., 2015; Xia et al., 2018), but synthetic data has always been extremely important.

Historically, the main synthetic dataset for indoor navigation was SUNCG4 presented by Song et al. (2017).

It contained over 45,000 different scenes (floors of private houses) with manually created realistic room layouts, 3D models of the furniture, realistic textures, and so on. All scenes were semantically annotated at the object level, and the dataset provides synthetic depth maps and volumetric ground truth data for the scenes. The original paper presented state of the art results in semantic scene completion, but, naturally, SUNCG has been used for many different tasks related to depth estimation, indoor navigation, SLAM, and others; see Qi et al. (2017), Abbasi et al. (2018), and Chen et al. (2019), to name just a few. It often served as the basis for scene understanding competitions, e.g., the 2019 SUMO workshop at CVPR 2019 on 360° Indoor Scene Understanding and Modeling.

Why the past tense, though? Interestingly, even indoor datasets without any humans in them can be murky in terms of legality. Right now, the SUNCG paper website is up but the dataset website is down and SUNCG itself is unavailable due to a legal controversy over the data: the Planner5D company claims that Facebook used their software and data to produce SUNCG and made it publicly available without consent from Planner5D; see more details here.

In any case, by now we have larger and more detailed synthetic indoor environments. A detailed survey will have to wait for a separate post, but I want to highlight the AI Habitat released by the very same Facebook; see also the paper by Savva, Kadian et al. (2019). It is a simulated environment explicitly intended to trained embodied AI agents, and it includes a high-performance full 3D simulator with high-fidelity images. AI Habitat environments might look something like this:

Simulated environments of this kind can be used to train home robots, security cameras, and AI assistants, develop AR/VR applications, and much more. It is an exciting field where Synthesis AI is also making advances to.

Outdoor Environments

Now we come to one of the most important and historically best developed directions of applications for synthetic data: outdoor simulated environments intended to improve the motion of autonomous robots. Possible applications include SLAM, motion planning, and motion for control for self-driving cars (urban navigation), unmanned aerial vehicles, and much more (Fossen et al., 2017; Milz et al., 2018; Paden et al., 2016); see also general surveys of computer vision for mobile robot navigation (Bonin-Font et al., 2008; Desouza, Kak, 2002) and perception and control for autonomous driving (Pendleton et al., 2017).

There are two main directions here:
- training vision systems for segmentation, SLAM, and other similar problems; in this case, synthetic data can be represented by a dataset of static images;
- fully training autonomous driving agents, usually with reinforcement learning, in a virtual environment; in this case, you have to be able to render the environment in real time (preferably much faster since reinforcement learning usually needs millions of episodes), and the system also has to include a realistic simulation of the controls for the agent.
I cannot hope to give this topic justice in a single section, so there will definitely be separate posts on this, and for now let me just scatter a few examples of modern synthetic datasets for autonomous driving.

First, let me remind you that one of the very first applications of synthetic data to training neural networks, the ALVINN network that we already discussed on this blog, was actually an autonomous driving system. Trained on synthetic 30×32 videos supplemented with 8×32 range finder data, ALVINN was one of the first successful applications of neural networks in autonomous driving.

By now, resolutions have grown. Around 2015-2016, researchers realized that modern interactive 3D projects (that is, games) had progressed so much that one can use their results as high-fidelity synthetic data. Therefore, several important autonomous driving datasets produced at that time used modern 3D engines such as Unreal Engine or even specific game engines such as Grand Theft Auto V to generate their data. Here is a sample from the GTAV dataset by Richter et al. (2016):

And here is a sample from the VEIS dataset by Saleh et al. (2018) who used Unity 3D:

For the purposes of synthetic data, researchers often had to modify CGI and graphics engines to suit machine learning requirements, especially to implement various kinds of labeling. For example, the work based on Grand Theft Auto V was recently continued by Hurl et al. (2019) who developed a precise LIDAR simulator within the GTA V engine and published the PreSIL (Precise Synthetic Image and LIDAR) dataset with over 50000 frames with depth information, point clouds, semantic segmentation, and detailed annotations. Here is a sample of just a few of its modalities:

Another interesting direction was pursued by Li et al. (2019) who developed the Augmented Autonomous Driving Simulation (AADS) environment. This synthetic data generator is able to insert synthetic traffic on real-life RGB images in a realistic way. With this approach, a single real image can be reused many times in different synthetic traffic situations. Here is a sample frame from the AADS introductory video that emphasizes that the cars are synthetic — otherwise you might well miss it!

Fully synthetic datasets are also rapidly approaching photorealism. In particular, the Synscapes dataset by Wrenninge and Unger (2018) is quite easy to take for real at first glance; it simulates motion blur and many other properties of real photographs:

Naturally, such photorealistic datasets cannot be rendered in full real time yet, so they usually represent collections of static images. Let us hope that cryptocurrency mining will leave at least a few modern GPUs to advance this research, and let us proceed to our next item.

Industrial Simulated Environments

Imagine a robotic arm manipulating items on an assembly line. Controlling this arm certainly looks like a machine learning problem… but where will the dataset come from? There is no way to label millions of training data instances, especially once you realize that the function that we are actually learning is mapping activations of the robot’s drives and controls into reactions of the environment.

Consider, for instance, the Dactyl robot that OpenAI researchers recently taught to solve a Rubik’s cube in real life (OpenAI, 2018). It was trained by reinforcement learning, which means that the hand had to make millions, if not billions, of attempts at interacting with the environment. Naturally, this could be made possible only by a synthetic simulation environment, in this case ORRB (OpenAI Remote Rendering Backend) also developed by OpenAI (Chociej et al., 2019). For a robotic hand like Dactyl, ORRB can very efficiently render views of randomized environments that look something like this:

And that’s not all: to be able to provide interactions that can serve as fodder for reinforcement learning, the environment also has to implement a reasonably realistic physics simulator. Robotic simulators are a venerable and well-established field; the two most famous and most popular engines are Gazebo, originally presented by Koenig and Howard (2004) and currently being developed by Open Source Robotics Foundation (OSRF), and MuJoCo (Multi-Joint Dynamics with Contact) developed by Todorov et al. (2012); for example, ORRB mentioned above can interface with MuJoCo to provide an all-around simulation.

This does not mean that new engines cannot arise for robotics. For example, Xie et al. (2019) presented VRGym, a virtual reality testbed for physical and interactive AI agents. Their main difference from previous work is the support of human input via VR hardware integration. The rendering and physics engine are based on Unreal Engine 4, and additional multi-sensor hardware is capable of full body sensing and integration of human subjects to virtual environments. Moreover, a special bridge allows to easily communicate with robotic hardware, providing support for ROS (Robot Operating System), a standard set of libraries for robotic control. Visually it looks something like this:

Here at Synthesis AI, we are already working in this direction. In a recent collaboration with Google Robotics called ClearGrasp (we covered it on the blog a year ago), we developed a large-scale dataset of transparent objects that present special challenges for computer vision: transparent objects are notoriously hard for object detection, depth estimation, and basically any computer vision task you can think of. With the help of this dataset, in the ClearGrasp project we developed machine learning models capable of estimating accurate 3D data of transparent objects from RGB-D images. The dataset looks something like this, and for more details I refer to our earlier post and to Google’s web page on the ClearGrasp project:

There is already no doubt that end-to-end training of industrial robots is virtually impossible without synthetic environments. Still, I believe there is a lot to be done here, both specifically for robotics and generally for the best adaptation and usage of synthetic environments.

Synthetic Documents and Media

Finally, we come to the last item in the post: synthetic documents and media. Let us begin with optical character recognition (OCR), which mostly means reading the text written on a photo, although there are several different tasks related to text recognition: OCR itself, text detection, layout analysis and text line segmentation for document digitization, and others.

The basic idea for synthetic data in OCR is simple: it is very easy to produce synthetic text, so why don’t we superimpose synthetic text on real images, or simply on varied randomized backgrounds (recall our discussion of domain randomization), and train on that? Virtually all modern OCR systems have been trained on data produced by some variation of this idea.

I will highlight the works where text is being superimposed in a “smarter”, more realistic way. For instance, Gupta et al. (2016) in their SynthText in the Wild dataset use depth estimation and segmentation models to find regions (planes) of a natural image suitable for placing synthetic text, and even find the correct rotation of text for a given plane. This process is illustrated in the top row of the sample image, and the bottom row shows sample text inserted onto suitable regions:

Another interesting direction where such systems might go (but have not gone yet) is the generation of synthetic media that includes text but is not limited to it. Currently, advertising and rich media in general are at the frontier of multimodal approaches that blend together computer vision and natural language processing. Think about the rich symbolism that we see in many modern advertisements; sometimes understanding an ad borders on a lateral thinking puzzle. For instance, what does this image advertise?

The answer is explicitly stated in the slogan that I’ve cut off: “Removes fast food stains fast”. You might have got it instantaneously, or you might have had to think for a little bit, but how well do you think automated computer vision models can pick up on this? This is a sample image from the dataset collected by researchers from the University of Pittsburgh. Understanding this kind of symbolism is very hard, and I can’t imagine generating synthetic data for this problem at this point.

But perhaps easier questions would include at least detecting advertisements in our surroundings, detecting logos and brand names in the advertising, reading the slogans, and so on. Here, a synthetic dataset is not hard to imagine; at the very least, we could cut off ads from real photos and paste other ads in their place. But even solving this restricted problem would be a huge step forward for AR systems! If you are an advertiser, imagine that you can insert ads in augmented reality on the fly; and if you are just a regular human being imagine that you can block off real life ads or replace them with pleasing landscape photos as you go down the street. This kind of future might be just around the corner, and it might be helped immensely by synthetic data generation.

Conclusion

In this (rather long) post, I’ve tried to give a brief overview of the main directions where we envision the present and future of synthetic data for machine learning. Here at Synthesis AI, we are working to bring this vision of the future to reality. In the next post of this series, we will go into more detail about one of these use cases and showcase some of our work.

Sergey Nikolenko
Head of AI, Synthesis AI
May 18, 2021
Real-to-Synthetic Data: Driving Model Performance with Synthetic Data VI
Today we continue the series on using synthetic data to improve machine learning models.This is the sixth part of the series (Part I, Part II, Part III, Part IV, Part V). In this (relatively) short interlude I will discuss an interesting variation of GAN-based refinement: making synthetic data from real. Why would we ever want to do that if the final goal is always to make the model work on real data rather than synthetic? In this post, we will see two examples from different domains that show both why and how.

It Turns Your Head: Synthetic Faces from Real Photos

In this post, we discuss several works that generate synthetic data from real data by learning to transform real data with conditional GANs. The first application of this idea is to start from real data and produce other realistic images that have been artificially changed in some respects. This approach could either simply serve as a “smart augmentation” to extend the dataset (recall Part II of this series) or, more interestingly, could “fill in the holes” in the data distribution, obtaining synthetic data for situations that are lacking in the original dataset.

As the first example, let us consider Zhao et al. (2018a, 2018b) who concentrated on applying this idea to face recognition in the wild, with different poses rather than by a frontal image. They continued the work of Tran et al. (2017) (we do not review it here in detail) and Huang et al. (2017), who presented a TP-GAN (two-pathway GAN) architecture for frontal view synthesis: given a picture of a face, generate a frontal view picture.

TP-GAN’s generator G has two pathways: a global network that rotates the entire face and four local patch networks that process local textures around four facial landmarks (eyes, nose, and mouth). Both pathways have encoder-decoder architectures with skip connections for multi-scale feature fusion:

The discriminator D in TP-GAN, naturally, learns to distinguish real frontal face images from synthesized images. The synthesis loss function in TP-GAN is a sum of four loss functions:
- pixel-wise L1-loss between the ground truth frontal image and rotated image;
- symmetry loss intended to preserve the symmetry of human faces;
- adversarial loss that goes through the discriminator, as usual;
- and identity preserving loss based on perceptual losses, a popular idea in GANs designed to preserve high-level features when doing low-level transformations; in this case, it serves to preserve the person’s identity when rotating the face and brings together the features from several layers of a fixed face recognition network applied to the original and rotated images.
I won’t bore you with formulas for TP-GAN, but the results were pretty good. Here they are compared to competitors existing in 2017 (the leftmost column shows the profile face to be rotated, second from left are the results of TP-GAN, the rightmost column are actual frontal images, and the rest are various other approaches):

Zhao et al. (2018a) propose the DA-GAN (Dual-Agent GAN) model that also works with faces but in the opposite scenario: while TP-GAN rotates every face into the frontal view, DA-GAN rotates frontal faces to arbitrary angles. The motivation for this brings us to synthetic data: Zhao et al. noticed that existing datasets of human faces contain mostly frontal images, with a few standard angles also appearing in the distribution but not much in between. This could skew the results of model training and make face recognition models perform worse.

Therefore, DA-GAN aims to create synthetic faces to fill in the “holes” in the real data distribution, rotating real faces so that the distribution of angles becomes more uniform. The idea is to go from the data distribution shown on the left (actual distribution from the IJB-A dataset) to something like shown on the right:

They begin with a 3D morphable model, extracting 68 facial landmarks with the Recurrent Attentive-Refinement (RAR) model (Xiao et al., 2016) and estimating the transformation matrix with 3D-MM (Blanz et al., 2007). However, Zhao et al. report that simulation quality dramatically decreases for large yaw angles, which means that further improvement is needed. This is exactly where the DA-GAN framework comes in.

Again, DA-GAN’s generator maps a synthetized image to a refined one. It is trained on a linear combination of three loss functions:
- adversarial loss that follows the BEGAN architecture (I won’t go into details here and recommend this post, this post, and the original paper for explanations of BEGAN and other standard adversarial loss functions);
- the identity preservation loss similar to TP-GAN above; the idea is, again, to put both real and synthetic images through the same (relatively simple) face recognition network and bring features extracted from both images together;
- pixel-wise loss intended to make sure that the pose (angle of inclination for the head) remains the same after refinement.
Here is the general idea of DA-GAN:

Apart from experiments done by the authors, DA-GAN was verified in a large-scale NIST IJB-A competition where a model based on DA-GAN won the face verification and face identification tracks. This example shows the general premise of using synthetic data for this kind of smart augmentations: augmenting the dataset and balancing out the training data distribution with synthetic images proved highly beneficial in this case.

VR Goggles for Robots: Real-to-Sim Transfer in Robotics

In the previous section, we have used real-to-synthetic transfer basically as a smart augmentation, enriching the real dataset that might be skewed with new synthetic data produced from real data points.

The second example deals with a completely different idea of using the same transfer direction. Why do we need domain adaptation at all? Because we want models that have been trained on synthetic data to transfer to real inputs. The idea is to reverse this logic: let us transfer real data into the synthetic domain, where the model is already working great!

In the context of robotics, this kind of real-to-sim approach was implemented by Zhang et al. (2018) in a very interesting approach called “VR-Goggles for Robots”. It is based on the CycleGAN ideas, a general approach to style transfer (and consequently domain adaptation) that we introduced in Part IV of this series.

More precisely, Zhang et al. use the general framework of CyCADA (Hoffman et al., 2017), a popular domain adaptation model. CyCADA adds semantic consistency, feature-based adversarial, and task losses to the basic CycleGAN. The general idea looks as follows:

Note that CyCADA has two different GAN losses (adversarial losses), pixel-wise and feature-based (semantic), and a task loss that reflect the downstream task that we are solving (semantic segmentation in this case). In the original work, CyCADA was applied to standard, synthetic-to-real domain adaptation, although the model really does not distinguish much between the two opposing transfer directions because it’s a CycleGAN. Here are some CyCADA examples on standard synthetic and real outdoor datasets:

Similar to CyCADA, the VR-Goggles model has two generators, real-to-synthetic and synthetic-to-real, and two discriminators, one for the synthetic domain and one for the real domain. The overall loss function consists of:
- a standard adversarial GAN loss function for each discriminator;
- the semantic loss as introduced in CyCADA; the idea is that if we have ground truth labels for the synthetic domain (in this case, we are doing semantic segmentation), we can train a network on synthetic data and then use it to generate pseudolabels for the real domain where ground truth is not available; the semantic loss now makes sure that the results (segmentation maps) remain the same after image translation:
- the shift loss that makes the image translation result invariant to shifts.
Here is the general picture, which is actually very close to the CyCADA pipeline shown above:

Zhang et al. report significantly improved results in robotic navigation tasks. For instance, in the image below a navigation policy was trained (in a synthetic simulation environment) to move towards chairs:

The “No-Goggles” row shows results in the real world without domain adaptation, where the policy fails miserably. The “CycleGAN” row shows improved results after doing real-to-sim transfer with a basic CycleGAN model. Finally the “VR-Goggles” row shows successful navigation with the proposed real-to-sim transfer model.

Why might this inverse real-to-sim direction be a good fit for robotics specifically, and why haven’t we seen this approach in other domains in the previous posts? The reason is that in the real-to-sim approach, we need to use domain adaptation models during inference, as part of using the trained model. In most regular computer vision applications, this would be a great hindrance. In computer vision, if some kind of preprocessing is only part of the training process it is usually assumed to be free (you can find some pretty complicated examples in Part II of this series), and inference time is very precious.

Robotics is a very different setting: robots and controllers are often trained in simulation environments with reinforcement learning, which implies a lot of computational resources needed for training. The simulation environment needs to be responsive and cheap to support, and if every frame of the training needs to be translated via a GAN-based model it may add up to a huge cost that would make RL-based training infeasible. Adding an extra model during inference, on the other hand, may be okay: yes, we reduce the number of processed frames per second, but if it stays high enough for the robot to react in real time, that’s fine.

Conclusion

In this post, we have discussed the inverse direction of data refinement: instead of making synthetic images more realistic, we have seen approaches that make real images look like synthetic ones. We have seen two situations where this is useful: first, for “extremely smart” augmentations that fill in the holes in the real data distribution, and second, for robotics where training is very computationally intensive, and it may be actually easier to modify the input on inference.

With this, I conclude the part on refinement, i.e., domain adaptation techniques that operate on data and modify the data from one domain to another. Next time, we will begin discussing model-based domain adaptation, that is, approaches that change the model itself and leave the data in place. Stay tuned!

Sergey Nikolenko
Head of AI, Synthesis AI
February 24, 2021
Synthetic-to-Real Refinement: Driving Model Performance with Synthetic Data V
We continue the series on synthetic data as it is used in machine learning today. This is a fifth part of an already pretty long series (part 1, part 2, part 3, part 4), and it’s far from over, but I try to keep each post more or less self-contained. Today, however, we pick up from last time, so if you have not read Part 4 yet I suggest to go through it first. In that post, we discussed synthetic-to-real refinement for gaze estimation, which suddenly taught us a lot about modern GAN-based architectures. But eye gaze still remains a relatively small and not very variable problem, so let’s see how well synthetic data does in other computer vision applications. Again, expect a lot of GANs and at least a few formulas for the loss functions.

PixelDA: An Early Work in Refinement

First of all, I have to constrain this post too. There are whole domains of applications where synthetic data is very often used for computer vision, such as, e.g., outdoor scene segmentation for autonomous driving. But this would require a separate discussion, one that I hope to get to in the future. Today I will show a few examples where refinement techniques work on standard “unconstrained” computer vision problems such as object detection and segmentation for common objects. Although, as we will see, most of these problems in fact turn out to be quite constrained.

We begin with an early work in refinement, parallel to (Srivastava et al., 2017), which was done by Google researchers Bousmalis et al. (2017). They train a GAN-based architecture for pixel-level domain adaptation, which they call PixelDA. In essence, PixelDA is a basic style transfer GAN, i.e., they train the model by alternating optimization steps

$\min_{\theta_G,\theta_T}\;\max_{\phi}\;\lambda_{1}\,\mathcal{L}_{\text{dom}}^{\text{PIX}}(D^{\text{PIX}},G^{\text{PIX}})+\lambda_{2}\,\mathcal{L}_{\text{task}}^{\text{PIX}}(G^{\text{PIX}},T^{\text{PIX}})+\lambda_{3}\,\mathcal{L}_{\text{cont}}^{\text{PIX}}(G^{\text{PIX}}),$

where:
- the first term is the domain loss,
  $\begin{align*}\mathcal{L}_{\text{dom}}^{\text{PIX}}(D^{\text{PIX}},G^{\text{PIX}}) =&\mathbb{E}_{\mathbf{x}_{S}\sim p_{\text{syn}}}\!\bigl[\log\!\bigl(1-D^{\text{PIX}}\!\bigl(G^{\text{PIX}}(\mathbf{x}_{S};\theta_{G});\phi\bigr)\bigr)\bigr]\\ +&\mathbb{E}_{\mathbf{x}_{T}\sim p_{\text{real}}}\!\bigl[\log D^{\text{PIX}}(\mathbf{x}_{T};\phi)\bigr];\end{align*}$
- the second term is the task-specific loss, which in (Bousmalis et al., 2017) was the image classification cross-entropy loss provided by a classifier T that was also trained as part of the model:
  $\begin{align*}\mathcal{L}_{\text{task}}^{\text{PIX}}&(G^{\text{PIX}},T^{\text{PIX}}) = \\ &\mathbb{E}_{\mathbf{x}_{S},\mathbf{y}_{S}\sim p_{\text{syn}}} \!\bigl[-\mathbf{y}_{S}^{\top}\!\log T^{\text{PIX}}\!\bigl(G^{\text{PIX}}(\mathbf{x}_{S};\theta_{G});\theta_{T}\bigr) -\mathbf{y}_{S}^{\top}\!\log T^{\text{PIX}}(\mathbf{x}_{S};\theta_{T})\bigr],\end{align*}$
- and the third term is the content similarity loss, intended to make the generator preserve the foreground objects (that would later need to be classified) with a mean squared error applied to their masks:
  $\begin{align*}\mathcal{L}_{\text{cont}}^{\text{PIX}}(G^{\text{PIX}}) =\mathbb{E}_{\mathbf{x}_{S}\sim p_{\text{syn}}}\!\Bigl[& \frac{1}{k}\,\lVert(\mathbf{x}_{S}-G^{\text{PIX}}(\mathbf{x}_{S};\theta_{G}))\odot m(\mathbf{x})\rVert_{2}^{2}\\ &-\frac{1}{k^{2}}\bigl((\mathbf{x}_{S}-G^{\text{PIX}}(\mathbf{x}_{S};\theta_{G}))^{\top}m(\mathbf{x})\bigr)^{2} \Bigr],\end{align*}$
  where $\mathbf{m}$ is a segmentation mask for the foreground object extracted from the synthetic data renderer; note that this loss does not “insist” on preserving pixel values in the object but rather encourages the model to change object pixels in a consistent way, preserving their pairwise differences.
Bousmalis et al. applied this GAN to the Synthetic Cropped LineMod dataset, a synthetic version of a small object classification dataset, doing both classification and pose estimation for the objects. The images in this dataset are quite cluttered and complex, but small in terms of pixel size:

The generator accepts as input a synthetic image of a 3D model of a corresponding object in a random pose together with the corresponding depth map and tries to output a realistic image in a cluttered environment while leaving the object itself in place. Note that the segmentation mask for the central object is also given by the synthetic model. The discriminator also looks at the depth map when it distinguishes between reals and fakes:

Here are some sample results of PixelDA for the same object classes as above but in different poses and with different depth maps:

Hardly the images that you would sell to a stock photo site, but that’s not the point. The point is to improve the classification and object pose estimation quality after training on refined synthetic images. And indeed, Bousmalis et al. reported improved results in both metrics compared to both training on purely synthetic data (for many tasks, this version fails entirely) and a number of previous approaches to domain adaptation.

But these are still rather small images. Can we make synthetic-to-real refinement work on a larger scale? Let’s find out.

CycleGAN for Synthetic-to-Real Refinement: GeneSIS-RT

In the previous post, we discussed the general CycleGAN idea and structure: if you want to do something like style transfer, but don’t have a paired dataset where the same content is depicted in two different styles, you can close the loop by training two generators at once. This is a very natural setting for synthetic-to-real domain adaptation, so many modern approaches to synthetic data refinement include the ideas of CycleGAN.

Probably the most direct application is the GeneSIS-RT framework by Stein and Roy (2017) that refines synthetic data directly with the CycleGAN trained on unpaired datasets of synthetic and real images. Their basic pipeline, summarized in the picture below, sums up the straightforward approach to synthetic-to-real refinement perfectly:

Their results are pretty typical for the basic CycleGAN: some of the straight lines can become wiggly, and the textures have artifacts, but generally the images definitely become closer to reality:

But, again, picture quality is not the main point here. Stein and Roy show that a training set produced by image-to-image translation learned by CycleGAN improves the results of training machine learning systems for real-world tasks such as obstacle avoidance and semantic segmentation.

Here are some sample segmentation results that compare a DeepLab-v2 segmentation network trained on synthetic data and on the same synthetic data refined by GeneSIS-RT; the improvement is quite clear:

CycleGAN Evolved: T²Net

As an example of a more involved application, let’s consider T²Net by Zheng et al. (2018) who apply synthetic-to-real refinement to improve depth estimation from a single image. By the way, if you google their paper in 2021, as I just did, do not confuse it with T²-Net by Zhang et al. (2020) (yes, a one symbol difference in both first author and model names!), a completely different deep learning model for turbulence forecasting…

T²Net also uses the general ideas of CycleGAN with a translation network (generator) that makes images more realistic. The new idea here is that T²Net asks the synthetic-to-real generator G not only to translate one specific domain (synthetic data) to another (real data) but also to work across a number of different input domains, making the input image “more realistic” in every case. Here is the general architecture:

In essence, this means that G aims to learn the minimal transformation necessary to make an image realistic. In particular, it should not change real images much. In total, T²Net has the following loss function for the generator (hope you are getting used to these):

$\mathcal{L}^{T2}=\mathcal{L}_{\text{GAN}}^{T2}(G_{S}^{T2},D_{T}^{T2})+\lambda_{1}\mathcal{L}_{\text{GAN}_f}^{T2}(f_{\text{task}}^{T2},D_{f}^{T2})+\lambda_{2}\mathcal{L}_{r}^{T2}(G_{S}^{T2})+\lambda_{3}\mathcal{L}_{t}^{T2}(f_{\text{task}}^{T2})+\lambda_{4}\mathcal{L}_{s}^{T2}(f_{\text{task}}^{T2}),$

where
- the first term is the usual GAN loss for synthetic-to-real transfer with a discriminator D_T:
  $\begin{align*}\mathcal{L}_{\text{GAN}}^{T2}(G_{S}^{T2},D_{T}^{T2}) =&\mathbb{E}_{\mathbf{x}_{S}\sim p_{\text{syn}}} \bigl[\log(1-D_{T}^{T2}(G_{S}^{T2}(\mathbf{x}_{S})))\bigr]\\ +&\mathbb{E}_{\mathbf{x}_{T}\sim p_{\text{real}}} \bigl[\log D_{T}^{T2}(\mathbf{x}_{T})\bigr];\end{align*}$
- the second term is the feature-level GAN loss for features extracted from translated and real images, with a different discriminator D_f:
  $\begin{align*}\mathcal{L}_{\text{GAN}_f}^{T2}(f_{\text{task}}^{T2},D_{f}^{T2}) &=\mathbb{E}_{\mathbf{x}_{S}\sim p_{\text{syn}}} \bigl[\log D_{f}^{T2}(f_{\text{task}}^{T2}(G_{S}^{T2}(\mathbf{x}_{S})))\bigr]\\ &+\mathbb{E}_{\mathbf{x}_{T}\sim p_{\text{real}}} \bigl[\log(1-D_{f}^{T2}(f_{\text{task}}^{T2}(\mathbf{x}_{T})))\bigr]; \end{align*}$
- the third term $L_r$ is the reconstruction loss for real images, simply an $L_1$ norm that says that T²Net is not supposed to change real images at all;
- the fourth term $L_t$ is the task loss for depth estimation on synthetic images, namely the $L_1$ -norm of the difference between the predicted depth map for a translated synthetic image and the original ground truth synthetic depth map; this loss ensures that translation does not change the depth map;
- finally, the fifth term $L_S$ is the task loss for depth estimation on real images:
  $\mathcal{L}_{s}^{T2}(f_{\text{task}}^{T2}) = |\partial_{x} f_{\text{task}}^{T2}(\mathbf{x}_{T})|^{-|\partial_{x}\mathbf{x}_{T}|} + |\partial_{y} f_{\text{task}}^{T2}(\mathbf{x}_{T})|^{-|\partial_{y}\mathbf{x}_{T}|},$
  that is, the sum of image gradients; since ground truth depth maps are not available now, this regularizer is a locally smooth loss intended to optimize object boundaries, a common tool in depth estimation models that we won’t go into too much detail about.
Zheng et al. show that T²Net can produce realistic images from synthetic ones, even for quite varied domains such as house interiors from the SUNCG synthetic dataset:

But again, the most important conclusions deal with the depth estimation task. Zheng et al. conclude that end-to-end training of the translation network and depth estimation network is preferable over separated training. They show that T²Net can achieve good results for depth estimation with no access to real paired data, even outperforming some (but not all) supervised approaches.

Synthetic Data Refinement for Vending Machines

This is already getting to be quite a long read, so let me wrap up with just one more example that will bring us to 2019. Wang et al. (2019) consider synthetic data generation and domain adaptation for object detection in smart vending machines. Actually, we have already discussed their problem setting and synthetic data in a previous post, titled “What’s in the Fridge?“. So please see that post for a refresher on their synthetic data generation pipeline, and today we will concentrate specifically on their domain adaptation approach.

Wang et al. refine rendered images with virtual-to-real style transfer done by a CycleGAN-based architecture. The novelty here is that Wang et al. separate foreground and background losses, arguing that style transfer needed for foreground objects is very different from (much stronger than) the style transfer for backgrounds. So their overall architecture is even more involved than in previous examples; here is what it looks like:

The overall generator loss function is also a bit different:

$\begin{multline*}\mathcal{L}^{OD}=\mathcal{L}_{\text{GAN}}^{OD}(G^{OD},D_{T}^{OD},\mathbf{x}_{S},\mathbf{x}_{T})+\mathcal{L}_{\text{GAN}}^{OD}(F^{OD},D_{S}^{OD},\mathbf{x}_{T},\mathbf{x}_{S})+\\ +\lambda_{1}\mathcal{L}_{\text{cyc}}^{OD}(G^{OD},F^{OD})+\lambda_{2}\mathcal{L}_{\text{bg}}^{OD}+\lambda_{3}\mathcal{L}_{\text{fg}}^{OD},\end{multline*}$

where:
- L_GAN(G, D, X, Y) is the standard adversarial loss for generator G mapping from domain X to domain Y and discriminator D distinguishing real images from fake ones in domain Y;
- L_cyc(G, F) is the cycle consistency loss as used in CycleGAN and as we have already discussed several times;
- L_bg is the background loss, which is the cycle consistency loss computed only for the background part of the images as defined by the mask m_bg:
  $\begin{align*}\mathcal{L}_{\text{bg}}^{OD} =&\mathbb{E}_{\mathbf{x}_{T}\sim p_{\text{real}}} \Bigl\| \bigl(G^{OD}\!\bigl(F^{OD}(\mathbf{x}_{T})\bigr)-\mathbf{x}_{T}\bigr) \odot m_{\text{bg}}(\mathbf{x}_{T}) \Bigr\|_{2}\\ +&\mathbb{E}_{\mathbf{x}_{S}\sim p_{\text{syn}}} \Bigl\| \bigl(F^{OD}\!\bigl(G^{OD}(\mathbf{x}_{S})\bigr)-\mathbf{x}_{S}\bigr) \odot m_{\text{bg}}(\mathbf{x}_{S}) \Bigr\|_{2};\end{align*}$
- L_fg is the foreground loss, similar to L_bg but computed only for the hue channel in the HSV color space (the authors argue that color and profile are the most critical for recognition and thus need to be preserved the most):
  $\begin{align*}\mathcal{L}_{\text{fg}}^{OD} =&\mathbb{E}_{\mathbf{x}_{T}\sim p_{\text{real}}} \Bigl\| \bigl(G^{OD}\!\bigl(F^{OD}(\mathbf{x}_{T})\bigr)^{H}-\mathbf{x}_{T}^{H}\bigr) \odot m_{\text{fg}}(\mathbf{x}_{T}) \Bigr\|_{2}\\ +&\mathbb{E}_{\mathbf{x}_{S}\sim p_{\text{syn}}} \Bigl\| \bigl(F^{OD}\!\bigl(G^{OD}(\mathbf{x}_{S})\bigr)^{H}-\mathbf{x}_{S}^{H}\bigr) \odot m_{\text{fg}}(\mathbf{x}_{S}) \Bigr\|_{2}. \end{align*}$
Segmentation into foreground and background is done automatically in synthetic data and is made easy in this case for real data since the camera position is fixed, and the authors can collect a dataset of real background templates from the vending machines they used in the experiments and then simply subtract the backgrounds to get the foreground part.

Here are some sample results of their domain adaptation architecture, with original synthetic images on the left and refined results on the right:

As a result, Wang et al. report significantly improved results when using hybrid datasets of real and synthetic data for all three tested object detection architectures: PVANET, SSD, and YOLOv3. Even more importantly, they report a comparison between basic and refined synthetic data with clear gains achieved by refinement across all architectures.

Conclusion

By this point, you are probably already pretty tired of CycleGAN variations. Naturally, there are plenty more examples of this kind of synthetic-to-real style transfer in literature, I just picked a few to illustrate the general ideas and show how they can be applied to specific use cases.

I hope the last couple of posts have managed to convince you that synthetic-to-real refinement is a valid approach that can improve the performance at the end task even if the actual refined images do not look all that realistic to humans: some of the examples above look pretty bad, but training on them still improves performance for downstream tasks.

Next time, we will discuss an interesting variation of this idea: what if we reverse the process and try to translate real data into synthetic? And why would anyone want to do such a thing if we are always interested in solving downstream tasks on real data rather than synthetic?.. The answers to these questions will have to wait until the next post. See you!

Sergey Nikolenko
Head of AI, Synthesis AI
January 28, 2021
Gaze Estimation and GANs: Driving Model Performance with Synthetic Data IV
With the Christmas and New Year holidays behind us, let’s continue our series on how to improve the performance of machine learning models with synthetic data. Last time, I gave a brief introduction into domain adaptation, distinguishing between its two main variations: refinement, where synthetic images are themselves changed before they are fed into model training, and model-based domain adaptation, where the training process changes to adapt to training on different domains. Today, we begin with refinement for the same special case of eye gaze estimation that kickstarted synthetic data refinement a few years ago and still remains an important success story for this approach, but then continue and extend the story of refinement to other computer vision problems. Today’s post will be more in-depth than before, so buckle up and get ready for some GANs!

SimGAN in Detail

Let us begin with a quick reminder. As we discussed last time, Shrivastava et al. (2017) was one of the first approaches that successfully improved a real life model by refining synthetic images and feeding them to a relatively straightforward deep learning model. They took a large-scale synthetic dataset of human eyes created by Wood et al. (2016) and learned a transformation from synthetic images (“Source domain” in the illustration below) to real data (“Target domain” in the illustration below). To do that, they utilized a relatively straightforward SimGAN architecture; we saw it last time in a high-level description, but today, let’s dive a little deeper into the details.

Let me first draw a slightly more detailed picture for you:

In the figure above, black arrows denote the data flow and green arrows show the gradient flow between SimGAN’s components. SimGAN consists of a generator (refiner) that translates source domain data into “fake” target domain data and a discriminator that tries to distinguish between “fake” and real target domain images.

The figure also introduces some notation: it shows that the overall loss function for the generator (refiner) in SimGAN consists of two components:
- the realism loss (green block on the right), usually called the adversarial loss in GANs, makes the generator fool the discriminator into thinking that the refined image is realistic; in a basic GAN such as SimGAN, this was simply the exact opposite of the binary cross-entropy loss for classification that the discriminator trains on;
- the regularization loss (green block on the bottom) makes the refinement process care for the actual target variable; it is supposed to make refinement preserve the eye gaze by preserving the result of some function ψ.
I will indulge myself with a little bit of formulas to make the above discussion more specific (trust me, you are very lucky that I can’t install a LaTeX plugin on this blog and have to insert formulas as pictures — otherwise this blog would be teeming with them). Here is the resulting loss function for SimGAN’s generator:

$\begin{align*}\mathcal{L}^{\text{REF}}_{G}(\theta) &= \mathbb{E}_{S}\Bigl[ \mathcal{L}^{\text{REF}}_{\text{real}}\!\bigl(\theta;\mathbf{x}_{S}\bigr) + \lambda\, \mathcal{L}^{\text{REF}}_{\text{reg}}\!\bigl(\theta;\mathbf{x}_{S}\bigr) \Bigr], \quad\text{where} \\[6pt]\mathcal{L}^{\text{REF}}_{\text{real}}\!\bigl(\theta;\mathbf{x}_{S}\bigr) &= -\,\log\!\Bigl(1 - D^{\text{REF}}_{\phi}\bigl(G^{\text{REF}}_{\theta}(\mathbf{x}_{S})\bigr)\Bigr), \\[6pt]\mathcal{L}^{\text{REF}}_{\text{reg}}\!\bigl(\theta;\mathbf{x}_{S}\bigr) &= \bigl\lVert \psi\!\bigl(G^{\text{REF}}_{\theta}(\mathbf{x}_{S})\bigr) - \psi(\mathbf{x}_{S}) \bigr\rVert_{1}\;.\end{align*}$

where $\psi$ is a mapping into some kind of feature space. The feature space can contain the image itself, image derivatives, statistics of color channels, or features produced by a fixed extractor such as a pretrained CNN. But in case of SimGAN it was… in most cases, just an identity map. That is, the regularization loss simply told the generator to change as little as possible while still making the image realistic.

SimGAN significantly improved gaze estimation over state of the art. But it was a rather straightforward and simplistic GAN even for 2017. Since their inception, generative adversarial networks have evolved quite a bit, with several interesting ideas defining modern adversarial architectures. Fortunately, not only have they been applied to synthetic-to-real refinement, but we don’t even have to deviate from the gaze estimation example to see quite a few of them!

GazeGAN, Part I: Synthetic-to-Real Transfer with CycleGAN

Meet GazeGAN, an architecture also developed in 2017, a few months later, by Sela et al. (2017). It also does synthetic-to-real refinement for gaze estimation, just like SimGAN. But the architecture, once you lay it out in a figure, looks much more daunting:

Let’s take it one step at a time.

First of all, the overall structure. As you can see, GazeGAN has two different refiners, $F$ and $G$ , and two discriminators, one for the source domain and one for the target domain. What’s going on here?

In this structure, GazeGAN implements the idea of CycleGAN introduced by Zhu et al. (2017). The problem CycleGAN was solving was unpaired style transfer. In general, synthetic-to-real refinement is a special case of style transfer: we need to translate images from one domain to another. In a more general context, similar problems could include artistic style transfer (draw a Monet landscape from a photo), drawing maps from satellite images, coloring an old photo, and many more.

In GAN-based style transfer first introduced by the pix2pix model (Isola et al., 2016), you can have a very straightforward architecture where the generator does the transfer and the discriminator tries to tell apart fake pictures from real pictures. The main problem is how to capture the fact that the translated picture should be similar to the one generator received as input. Formally speaking, it is perfectly legal for the generator to just memorize a few Monet paintings and output them for every input unless we do something about it. SimGAN fixed this via a regularization loss that simply told the generator to “change as little as possible”, but this is not quite what’s needed and doesn’t usually work.

In the pix2pix model, style transfer is done with a conditional GAN. This means that both the generator and discriminator see the input picture from the source domain, and the discriminator checks both realism and the fact that the target domain image matches the source domain one. Here is an illustration (Isola et al., 2016):

This approach actually can be made to work very well; here is a sample result from a later version of this model called pix2pixhd (Wang et al., 2018), where the model is synthesizing a realistic photo from a segmentation map (not the other way around!):

But the pix2pix approach does not always apply. For training, this approach requires a paired dataset for style transfer, where images from the source and target domain match each other. It’s not a problem for segmentation maps, but, e.g., for artistic style transfer it would be impossible: Monet only painted some specific landscapes, and we can’t make a perfectly matching photo today.

Enter CycleGAN, a model that solves this problem with a very interesting idea. We need a paired dataset because we don’t know how to capture the idea that the translated image should inherit content from the original. What should be the loss function that says that this image shows the same landscape as a given Monet painting but in photographic form?..

But imagine that we also have an inverse transformation. Then we would be able to make a photo out of a Monet painting, and then make it back into a Monet — which means that now it has to match the original exactly, and we can use some kind of simple pixel-wise loss to make them match! This is precisely the cycle that CycleGAN refers to. Here is an illustration from (Zhu et al., 2017):

Now the cycle consistency loss that ensures that $G(F(\mathbf{x}))=\mathbf{x}$ can be a simple $L_2$ or $L_1$ pixel-wise loss.

If you have a paired dataset, it would virtually always be better to use an architecture such as pix2pix that makes use of this data, but CycleGAN works quite well for unpaired cases. Here are some examples for the Monet-to-photo direction from the original CycleGAN:

In GazeGAN, the cycle is, as usual, implemented by two generators and two discriminators, one for the source domain and one for the target domain. But the cycle consistency loss consists of two parts:
- first, a simple pixel-wise L₁ loss, just like we discussed above:
  $\begin{align*}\mathcal{L}_{\text{Cyc}}^{Gz}\!\bigl(G^{Gz},F^{Gz}\bigr) =& \mathbb{E}_{\mathbf{x}_{S}\sim p_{\text{syn}}} \!\Bigl[\bigl\|F^{Gz}\!\bigl(G^{Gz}(\mathbf{x}_{S})\bigr)-\mathbf{x}_{S}\bigr\|_{1}\Bigr]\\ +&\mathbb{E}_{\mathbf{x}_{T}\sim p_{\text{real}}} \!\Bigl[\bigl\|G^{Gz}\!\bigl(F^{Gz}(\mathbf{x}_{T})\bigr)-\mathbf{x}_{T}\bigr\|_{1}\Bigr];\end{align*}$
- but second, we also need a special gaze cycle consistency loss to preserve the gaze direction (so that the target variable can be transferred with no change); for this, Sela et al. train a separate gaze estimation network E designed to overfit to the source domain as much as possible and predict the gaze very accurately on synthetic data; the loss makes sure E still works after going through a cycle:
$\mathcal{L}_{\text{GazeCyc}}^{Gz}\bigl(G^{Gz},F^{Gz}\bigr) = \mathbb{E}_{\mathbf{x}_S\sim p_{\text{syn}}}\left[\left\| E^{Gz}\left(F^{Gz}(G^{Gz}(\mathbf{x}_S))\right) - E^{Gz}(\mathbf{x}_S)\right\|_2^2\right].$

GazeGAN, Part II: LSGAN, Label Smoothing, and More

Apart from the overall CycleGAN structure, the GazeGAN model also used quite a few novelties that had been absent in SimGAN but had already become instrumental in GAN-based architectures by the end of 2017. Let’s discuss at least a couple of those.

First, the adversarial loss function. As I mentioned above, SimGAN used the most basic adversarial loss: binary cross-entropy which is the natural loss function for classification. However, this loss function has quite a few undesirable properties that make it hard to train GANs with. Since the original GANs were introduced in 2014, a lot of different adversarial losses have been developed, but the two probably most prominent and most often used are Wasserstein GANs (Arjovsky et al., 2017) and LSGAN (Least Squares GAN; Mao et al., 2016).

I hope I will have a reason to discuss Wasserstein GANs in the future — it’s a very interesting idea that sheds a lot of light on the training of GANs and machine learning in general. But GazeGAN used the LSGAN loss function. As the title suggests, LSGAN uses the least squares loss function instead of binary cross-entropy. In the general case, it looks like

$\begin{align*}\min_{D}\; V_{\text{LSGAN}}(D)&= \tfrac12\,\mathbb{E}_{\mathbf{x}\sim p{\text{data}}}\bigl[(D(\mathbf{x}) - b)^{2}\bigr]+ \tfrac12\,\mathbb{E}_{\mathbf{z}\sim p{z}}\bigl[(D(G(\mathbf{z})) - a)^{2}\bigr], \\\min_{G}\; V_{\text{LSGAN}}(G)&= \tfrac12\,\mathbb{E}_{\mathbf{z}\sim p{z}}\bigl[(D(G(\mathbf{z})) - c)^{2}\bigr],\end{align*}$

which means that the discriminator is trying to output some constant $b$ on real images and some other constant $a$ on fake images, while the generator is trying to convince the discriminator to output c on fake images (it has no control over the real ones). Naturally, usually one takes $a=0$ and $b=c=1$ , although there is an interesting theoretical result about the case when $b-c=1$ and $b-a=2$ , that is, when the generator is trying to make the discriminator maximally unsure about the fake images.

In general, trying to learn a classifier with the least squares loss is about as wrong as you can be in machine learning: this loss function becomes larger as the classifier becomes more sure in the correct answer! But for GANs, the saturation of the logistic sigmoid in binary cross-entropy is a much more serious problem. Even more than that, GazeGAN uses label smoothing on top of the LSGAN loss: while the discriminator aims to output 1 on real examples and 0 on refined synthetic images, the generator smoothes its target to 0.9, getting the loss function

$\mathcal{L}_{\text{LSGAN}}^{Gz}(G, D, S, R) = \mathbb{E}_{\mathbf{x}_S \sim p{_\text{syn}}}\Bigl[\,\bigl(D\bigl(G(\mathbf{x}_S)\bigr) - 0.9\bigr)^{2}\Bigr] + \mathbb{E}_{\mathbf{x}_T \sim p_{\text{real}}}\Bigl[\,D(\mathbf{x}_T)^{2}\Bigr];$

this loss is applied in both CycleGAN directions, synthetic-to-real and real-to-synthetic. Label smoothing helps the generator to avoid overfitting to intermediate versions of the discriminator (recall that GANs are trained by alternating between training the generator and the discriminator, and there is no way to train them separately because they need each other for training data).

And that’s it! With these loss functions, GazeGAN is able to create highly realistic images of eyes from synthetic eyes rendered by a Unity-based 3D modeling engine. Here are some samples by Sela et al. (2017):

Note how this model works not only for the eye itself but also for the surrounding facial features, “filling in” even those parts of the synthetic image that were not there.

Conclusion

Today, we have discussed the gaze estimation problem in detail, taking this opportunity to talk about several important ideas in generative adversarial networks. Synthetic-to-real refinement has proven its worth with this example. But, as I already mentioned in the previous post, gaze estimation is also a relatively easy example: synthetic images of eyes that needed refining for Srivastava et al. were only 30×60 pixels in size!

GazeGAN takes the next step: it operates not on the 30×60 grayscale images but on 128×128 color images, and GazeGAN actually refines not only the eye itself but parts of the image (e.g., nose and hair) that were not part of the 3D model of the eye.

But these are still relatively small images and a relatively simple task, at least one with low variability in the data. Next time, we will see how well synthetic-to-real refinement works for other applications.

Sergey Nikolenko
Head of AI, Synthesis AI
January 13, 2021
Domain Adaptation Overview: Driving Model Performance with Synthetic Data III
Today, I continue the series about different ways of improving model performance with synthetic data. We have already discussed simple augmentations in the first post and “smart” augmentations that make more complex transformations of the input in the second. Today we go on to the next sub-topic: domain adaptation. We will stay with domain adaptation for a while, and in the first post on this topic I would like to present a general overview of the field and introduce the most basic approaches to domain adaptation.

Refinement and Model-Based Domain Adaptation

In previous posts, we have discussed augmentations, transformations that can be used to extend the training set. In the context of synthetic data (we are in the Synthesis AI blog, after all), this means that synthetic data can be used to augment real datasets of insufficient size, and an important part of using synthetic data would be to augment the heck out of it so that the model would generalize as well as possible. This is the idea of domain randomization, a very important part of using synthetic data for machine learning and a part that we will definitely return to in future posts.

But the use of synthetic data can be made much more efficient than just training on it. Domain adaptation is a set of techniques designed to make a model trained on one domain of data, the source domain, work well on a different, target domain. The problem we are trying to solve is called transfer learning, i.e., transferring the knowledge learned on source tasks into an improvement in performance on a different target task, as shown in this illustration from (Pan, Yang, 2009):

This is a natural fit for synthetic data: in almost all applications, we would like to train the model in the source domain of synthetic data but then apply the results in the target domain of real data. Here is an illustration of the three approaches to this kind of transfer in the context of robotics (source):

Here we see the difference:
- system identification simply hopes that simulation is well-calibrated and matches reality sufficiently well,
- domain randomization tries to make the synthetic distribution so wide that a trained model will be robust enough to generalize to real data, and
- domain adaptation makes changes, either in the model or in the datasets.
In this series of posts, we will give a survey of domain adaptation approaches that have been used for such synthetic-to-real adaptation. We broadly divide the methods outlined in this chapter into two groups. Approaches from the first group operate on the data level, which makes it possible to extract synthetic data “refined” in order to work better on real data, while approaches from the second group operate directly on the model, its feature space or training procedure, leaving the data itself unchanged.

Let us now discuss these two options in more detail.

Synthetic Data Refinement

The first group of approaches for synthetic-to-real domain adaptation work with the data itself. In this approach, we try to develop models that can take a synthetic image as input and “refine” it, making it better for subsequent model training. Note that while in most works we discuss here the objective is basically to make synthetic data more realistic (for example, in GAN-based models it is the direct objective: the discriminators should not be able to distinguish refined synthetic data from real samples), this does not necessarily have to be the case. Some early works on synthetic data even concluded that synthetic images may work better if they are less realistic, resulting in better generalization of the models. But generally speaking, realism is the goal.

Today I will make an example of the very first work that kickstarted synthetic-to-real refinement back in 2016, and we will discuss what’s happened since then later, in a separate post. This example, one of the first successful models with straightforward synthetic-to-real refinement, was given by Apple researchers Shrivastava et al. (2017).

The underlying problem here is gaze estimation: recognizing the direction where a human eye is looking. Gaze estimation methods are usually divided into model-based, which model the geometric structure of the eye and adjacent regions, and appearance-based, which use the eye image directly as input; naturally, synthetic data is made and refined for appearance-based gaze estimation.

Before Shrivastava et al., this problem had already been tackled with synthetic data. In particular, Wood et al. (2016) presented a large dataset of realistic renderings of human eyes and showed improvements on real test sets over previous work done with the MPIIgaze dataset of real labeled images. Here is what their synthetic dataset looks like:

The usual increase in scale (synthetic images are almost free after the initial investment is made) is manifested here as an increase in variability: MPIIgaze contains about 214K images, and the synthetic training set was only about 1M images, but all images in MPIIgaze come from the same 15 participants of the experiment, while the UnityEyes system developed by Wood et al. can render every image in a different randomized environment, which makes the model much more robust.

The refinement here is to make these synthetic images even more realistic. Shrivastava et al. present a GAN-based system trained to improve synthesized images of the eyes. They call this idea Simulated+Unsupervised learning:

They learn a transformation implemented with a Refiner network with the SimGAN adversarial architecture:

SimGAN is a relatively straightforward GAN. It consists of a generator (refiner) and a discriminator, as shown above. The discriminator learns to distinguish between real and refined images with a standard binary classification loss function. The generator, in turn, is trained with a combination of the adversarial loss that makes it learn to fool the discriminator and regularization loss that captures the similarity between the refined image and the original one in order to preserve the target variable (gaze direction).

As a result, Shrivastava et al. were able to significantly improve upon previous results. But the gaze estimation problem is in many ways a natural and simple candidate for such an approach. It is especially telling that the images they generate and refine are merely 30×60 pixels in size: even the GANs that existed in 2017 were able to work quite well on this kind of output dimension. In a later post, we will see how image-based refinement works today and in other applications.

Model-Based Domain Adaptation

We have seen models that perform domain adaptation at the data level, i.e., one can extract a part of the model that takes as input a data point from the source domain (a synthetic image) and map it to the target domain (a real image).

However, it is hard to find applications where this is actually necessary. The final goal of AI model development usually is not to get a realistic synthetic image of a human eye; this is just a stepping stone to producing models that work better with the actual task, e.g., gaze estimation in this case.

Therefore, to make better use of synthetic data it makes sense to also consider feature-level or model-level domain adaptation, that is, methods that work in the space of latent features or model weights and never go back to change the actual data.

The simplest approach to such model-based domain adaptation would be to share the weights between networks operating on different domains or learn an explicit mapping between them. Here is an early paper by Glorot, Bordes, and Bengio (2011) that does exactly that, this relatively simple approach remains relevant to this day, and we will probably see more examples of it in later installments.

But for this introductory post I would like to show probably the most important work in this direction, the one that started model-based domain adaptation in earnest. I’m talking, of course, about “Unsupervised Domain Adaptation by Backpropagation” by two Russian researchers, Yaroslav Ganin and Victor Lempitsky. Here is the main illustration from their paper:

This is basically a generic framework for unsupervised domain adaptation that consists of:
- a feature extractor,
- a label predictor that performs the necessary task (e.g., classification) on extracted features, and
- a domain classifier that takes the same features and attempts to classify which domain the original input belonged to.
The idea is to train the label predictor to perform as well as possible and at the same time train the domain classifier to perform as badly as possible. This is achieved with the gradient reversal layer: the gradients are multiplied by a negative constant as they pass from the domain classifier to the feature extractor.

This is basically the idea of GANs but without the tediousness of iterative separate training of a generator and discriminator and with much fewer convergence problems than early GANs suffered from. The original work by Ganin and Lempitsky applied this idea to examples that would be considered toy datasets by today’s standards, but since 2015 this field has also had a lot of interesting discoveries that we will definitely discuss later.

Conclusion

In this post, we have started to discuss domain adaptation, probably the most important topic in machine learning research dealing with synthetic data. Generally speaking, all we do is domain adaptation: we need to use synthetic data for training, but the final goal is always to transfer to the domain of the real world.
Next time, we will discuss GAN-based refinement in more detail. Stay tuned, there will be GANs aplenty, including some very interesting models! Until next time!

Sergey Nikolenko
Head of AI, Synthesis AI
December 15, 2020
Smart Augmentations: Driving Model Performance with Synthetic Data II
Last time, I started a new series of posts, devoted to different ways of improving model performance with synthetic data. In the first post of the series, we discussed probably the simplest and most widely used way to generate synthetic data: geometric and color data augmentation applied to real training data. Today, we take the idea of data augmentation much further. We will discuss several different ways to construct “smart augmentations” that make much more involved transformations of the input but still change the labeling only in predictable ways.

Automating Augmentations: Finding the Best Strategy

Last time, we discussed the various ways in which modern data augmentation libraries such as Albumentations can transform an unsuspecting input image. Let me remind one example from last time:

Here, the resulting image and segmentation mask are the result of the following chain of transformations:
- take a random crop from a predefined range of sizes;
- shift, scale, and rotate the crop to match the original image dimension;
- apply a (randomized) color shift;
- add blur;
- add Gaussian noise;
- add a randomized elastic transformation for the image;
- perform mask dropout, removing a part of the segmentation masks and replacing them with black cutouts on the image.
That’s quite a few operations! But how do we know that this is the best way to approach data augmentation for this particular problem? Can we find the best way to augment, maybe via some automated meta-strategy that would take into account the specific problem setting?

As far as I know, this natural idea first bore fruit in the paper titled “Learning to Compose Domain-Specific Transformations for Data Augmentation” by Stanford researchers Ratner et al. (2017). They viewed the problem as a sequence generation task, training a recurrent generator to produce sequences of transformation functions:

The next step was taken in the work called “AutoAugment: Learning Augmentation Strategies from Data” by Cubuk et al. (2019). This is a work from Google Brain, from the group led by Quoc V. Le that has been working wonders with neural architecture search (NAS), a technique that automatically searches for the best architectures in a class that can be represented by computational graphs. With NAS, this group has already improved over the state of the art in basic convolutional architectures with NASNet and the EfficientNet family, in object detection architectures with the EfficientDet family, and even in such a basic field as activation functions for individual neural units: the Swish activation functions were found with NAS.

So what did they do with augmentation techniques? As usual, they frame this problem as a reinforcement learning task where the agent (controller) has to find a good augmentation strategy based on the rewards obtained by training a child network with this strategy:

The controller is trained by proximal policy optimization, a rather involved reinforcement learning algorithm that I’d rather not get into (Schulman et al., 2017). The point is, they successfully learned augmentation strategies that significantly outperform other, “naive” strategies. They were even able to achieve improvements over state of the art in classical problems on classical datasets:

Here is a sample augmentation policy for ImageNet found by Cubuk et al.:

The natural question is, of course: why do we care? How can it help us when we are not Google Brain and cannot run this pipeline (it does take a lot of computation)? Cubuk et al. note that the resulting augmentation strategies can indeed transfer across a wide variety of datasets and network architectures; on the other hand, this transferability is far from perfect, so I have not seen the results of AutoAugment pop up in other works as often as the authors would probably like.

Still, these works prove the basic point: augmentations can be composed together for better effect. The natural next step would be to have some even “smarter” functions. And that is exactly what we will see next.

Smart Augmentation: Blending Input Images

In the previous section, we discussed how to chain together standard transformations of input images in the best possible ways. But what if we take one step further and allow augmentations to produce more complex combinations of input data points?

In 2017, this idea was put forward in the work titled “Smart Augmentation: Learning an Optimal Data Augmentation Strategy” by Irish researchers Lemley et al. Their basic idea is to have two networks, “Network A” that implements an augmentation strategy and “Network B” that actually trains on the resulting augmented data and solves the end task:

The difference here is that Network A does not simply choose from a predefined set of strategies but operates as a generative network that can, for instance, blend two different training set examples into one in a smart way:

In particular, Lemley et al. tested their approach on datasets of human faces (a topic quite close to our hearts here at Synthesis AI, but I guess I will talk about it in more details later). So their Network A was able to, e.g., compose two different images of the same person into a blended combination (on the left):

Note that this is not simply a blend of two images but a more involved combination that makes good use of facial features. Here is an even better example:

This kind of “smart augmentation” borders on synthetic data generation: the resulting images are nothing like the originals. But before we turn to actual synthetic data (in subsequent posts), there are other interesting ideas one could apply even at the level of augmentation.

Mixup: Blending the Labels

In smart augmentations, the input data is produced as a combination of several images with the same label: two different images of the same person can be “interpolated” in a way that respects the facial features and expands the data distribution.

Mixup, a technique introduced by MIT and FAIR researchers Zhang et al. (2018), looks at the problem from the opposite side: what if we mix the labels together with the training samples? This is implemented in a very straightforward way: for two labeled input data points, Zhang et al. construct a convex combination of both the inputs and the labels:

$\begin{align*}\tilde{x} &= \lambda x_i + (1 - \lambda) x_j,&\text{where } x_i, x_j \text{ are raw input vectors,} \\[4pt]\tilde{y} &= \lambda y_i + (1 - \lambda) y_j,&\text{where } y_i, y_j \text{ are one-hot label encodings.}\end{align*}$

The blended label does not change either the network architecture or the training process: binary cross-entropy trivially generalizes to target discrete distributions instead of target one-hot vectors. To borrow an illustration from Ferenc Huszar’s blog post, here is what mixup does to a single data point, constructing convex combinations with other points in the dataset:

And here is what happens when we label a lot of points uniformly in the data:

As you can see, the resulting labeled data covers a much more robust and continuous distribution, and this helps the generalization power. Zhang et al. report especially significant improvements in training GANs:

By now, the idea of mixup has become an important part of the deep learning toolbox: you can often see it as an augmentation strategy, especially in the training of modern GAN architectures.

Self-Adversarial Training

To get to the last idea of today’s post, I will use YOLOv4, a recently presented object detection architecture (Bochkovskiy et al., 2020). YOLOv4 is a direct successor to the famous YOLO family of object detectors, improving significantly over the previous YOLOv3. We are, by the way, witnessing an interesting controversy in object detection because YOLOv5 followed less than two months later, from a completely different group of researchers, and without a paper to explain the new ideas (but with the code so it is not a question of reproducing the results)…

Very interesting stuff, but discussing it would take us very far from the topic at hand, so let’s get back to YOLOv4. It boasts impressive performance, with the same detection quality as the above-mentioned EfficientDet at half the cost:

In the YOLO family, new releases usually obtain a much better object detection quality by combining a lot of small improvements, bringing together everything that researchers in the field have found to work well since the previous YOLO version. YOLOv4 is no exception, and it outlines several different ways to add new tricks to the pipeline.

What we are interested in now is their “Bag of Freebies”, the set of tricks that do not change the performance of the object detection framework during inference, adding complexity only at the training stage. It is very characteristic that most items in this bag turn out to be various kinds of data augmentation. In particular, Bochkovskiy et al. introduce a new “mosaic” geometric augmentation that works well for object detection:

But the most interesting part comes next. YOLOv4 is trained with self-adversarial training (SAT), an augmentation technique that actually incorporates adversarial examples into the training process. Remember this famous picture?

It turns out that for most existing artificial neural architectures, one can modify input images with small amounts of noise in such a way that the result looks to us humans completely indistinguishable from the originals but the network is very confident that it is something completely different; see, e.g., this OpenAI blog post for more information.

In the simplest case, such adversarial examples are produced by the following procedure:
- you have a network and an input x that you want to make adversarial; suppose you want to turn a panda into a gibbon;
- formally, it means that you want to increase the “gibbon” component of the network’s output vector (at the expense of the “panda” component);
- so you fix the weights of the network and start regular gradient ascent, but with respect to x rather than the weights! This is the key idea for finding adversarial examples; it does not explain why they exist (it’s not an easy question) but if they do, it’s really not so hard to find them.
So how do you turn this idea into an augmentation technique? Given an input instance, you make it into an adversarial example by following this procedure for the current network that you are training. Then you train the network on this example. This may make the network more resistant to adversarial examples, but the important outcome is that it generally makes the network more stable and robust: now we are explicitly asking the network to work robustly in a small neighborhood of every input image. Note that the basic idea can again be described as “make the input data distribution cover more ground”, but by now we have come quite a long way since horizontal reflections and random crops…
Note that unlike basic geometric augmentations, this may turn out to be a quite costly procedure. But the cost is entirely borne during training: yes, you might have to train the final model for two weeks instead of one, but the resulting model will, of course, work with exactly the same performance: the model architecture does not change, only the training process does.

Bochkovskiy et al. report this technique as one of the main new ideas and main sources of improvement in YOLOv4. They also use the other augmentation ideas that we discussed, of course:

YOLOv4 is an important example for us: it represents a significant improvement in the state of the art in object detection in 2020… and much of the improvement comes directly from better and more complex augmentation techniques! This makes us even more optimistic about taking data augmentation further, to the realm of synthetic data.

Conclusion

In the second post in this new series, we have seen how more involved augmentations take the basic idea of covering a wider variety of input data much further than simple geometric or color transformations ever could. With these new techniques, data augmentation almost blends together with synthetic data as we usually understand it (see, e.g., my previous posts on this blog: one, two, three, four, five).

Smart augmentations such as presented in Lemley et al. border on straight up automated synthetic data generation. It is already hard to draw a clear separating line between them and, say, synthetic data generation with GANs as presented by Shrivastava et al. (2017) for gaze estimation. The latter, however, is a classical example of domain adaptation by synthetic data refinement. Next time, we will begin to speak about this model and similar techniques for domain adaptation intended to make synthetic data work even better. Until then!

Sergey Nikolenko
Head of AI, Synthesis AI
December 2, 2020
Driving Model Performance with Synthetic Data I: Augmentations in Computer Vision
Welcome back, everybody! It’s been a while since I finished the last series on object detection with synthetic data (here is the series in case you missed it: part 1, part 2, part 3, part 4, part 5). So it is high time to start a new series. Over the next several posts, we will discuss how synthetic data and similar techniques can drive model performance and improve the results. We will mostly be talking about computer vision tasks. We begin this series with an explanation of data augmentation in computer vision; today we will talk about simple “classical” augmentations, and next time we will turn to some of the more interesting stuff.

(header image source; Photo by Guy Bell/REX (8327276c))

Mandatory Credit: Photo by Guy Bell/REX (8327276c) Project curator Catherine Daunt make final adjustments to the installation of 10 colour screenprints of Marilyn Monroe by Andy Warhol Andy Warhol installation for ‘The American Dream: Pop to the Present’ exhibition, British Museum, London, UK – 10 Feb 2017 They were created 50 years ago, shortly after she committed suicide, and are in the British Museum’s Sainsbury Exhibitions Gallery ahead of the opening on 9th March of the Museum’s spring headline exhibition ‘The American Dream: Pop to the Present’. This major exhibition is a comprehensive survey of printmaking across six decades of turbulent US history with more than 200 works by 70 artists. Sponsored by Morgan Stanley and supported by the Terra Foundation for American Art. The works are on loan from Tate. Mandatory Credit – Copyright 2016 The Andy Warhol Foundation for the Visual Arts, inc. /ARS, New York and DACS, London Copyright Tate, London, 2016

Data Augmentation in Computer Vision: The Beginnings

Let me begin by taking you back to 2012, when the original AlexNet by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton (paper link from NIPS 2012) was taking the world of computer vision by storm. AlexNet was not the first successful deep neural network; in computer vision, that honor probably goes to Dan Ciresan from Jurgen Schmidhuber’s group and their MC-DNN (Ciresan et al., 2012). But it was the network that made the deep learning revolution happen in computer vision: in the famous ILSVRC competition, AlexNet had about 16% top-5 error, compared to about 26% of the second best competitor, and that in a competition usually decided by fractions of a percentage point!

Let’s have a look at the famous figure depicting the AlexNet architecture in the original paper by Krizhevsky et al.; you have probably seen it a thousand times:

I want to note one little thing about it: note that the input image dimensions on this picture are 224×224 pixels, while ImageNet actually consists of 256×256 images. What’s the deal with this?

The deal is that AlexNet, already in 2012, had to augment the input dataset in order to avoid overfitting. Augmentations are transformations that change the input data point (image, in this case) but do not change the label (output) or change it in predictable ways so that one can still train the network on augmented inputs. AlexNet used two kinds of augmentations:
- horizontal reflections (a vertical reflection would often fail to produce a plausible photo) and
- image translations; that’s exactly why they used a smaller input size: the 224×224 image is a random crop from the larger 256×256 image.
With both transformations, we can safely assume that the classification label will not change. Even if we were talking about, say, object detection, it would be trivial to shift, crop, and/or reflect the bounding boxes together with the inputs &mdash that’s exactly what I meant by “changing in predictable ways”. The resulting images are, of course, highly interdependent, but they still cover a wider variety of inputs than just the original dataset, reducing overfitting. In training AlexNet, Krizhevsky et al. estimated that they could produce 2048 different images from a single input training image.

What is interesting here is that although ImageNet is so large (AlexNet trained on a subset with 1.2 million training images labeled with 1000 classes), modern neural networks are even larger (AlexNet has 60 million parameters), and Krizhevsky et al. have the following to say about their augmentations: “Without this scheme, our network suffers from substantial overfitting, which would have forced us to use much smaller networks.”

AlexNet was not even the first to use this idea. The above-mentioned MC-DNN also used similar augmentations even though it was indeed a much smaller network trained to recognize much smaller images (traffic signs). One can also find much earlier applications of similar ideas: for instance, Simard et al. (2003) use distortions to augment the MNIST training set, and I am far from certain that this is the earliest reference.

Simple Augmentations Today

In the previous section, we have seen that as soon as neural networks transformed the field of computer vision, augmentations had to be used to expand the dataset and make the training set cover a wider data distribution. By now, this has become a staple in computer vision: while approaches may differ, it is hard to find a setting where data augmentation would not make sense at all.

To review what kind of augmentations are commonplace in computer vision, I will use the example of the Albumentations library developed by Buslaev et al. (2020); although the paper was only released this year, the library itself had been around for several years and by now has become the industry standard.

The obvious candidates are color transformations. Changing the color saturation or converting to grayscale definitely does not change bounding boxes or segmentation masks:

The next obvious category are simple geometric transformations. Again, there is no question about what to do with segmentation masks when the image is rotated or cropped; you simply repeat the same transformation with the labeling:

There are more interesting transformations, however. Take, for instance, grid distortion: we can slice the image up into patches and apply different distortions to different patches, taking care to preserve the continuity. Again, the labeling simply changes in the same way, and the result looks like this:

The same ideas can apply to other types of labeling. Take keypoints, for instance; they can be treated as a special case of segmentation and also changed together with the input image:

For some problems, it also helps to do transformations that take into account the labeling. In the image below, the main transformation is the so-called mask dropout: remove a part of the labeled objects from the image and from the labeling. But it also incorporates random rotation with resizing, blur, and a little bit of an elastic transform; as a result, it may be hard to even recognize that images on the right actually come from the images on the left:

With such a wide set of augmentations, you can expand a dataset very significantly, covering a much wider variety of data and making the trained model much more robust. Note that it does not really hinder training in any way and does not introduce any complications in the development. With modern tools such as the Albumentations library, data augmentation is simply a matter of chaining together several transformations, and then the library will apply them with randomized parameters to every input image. For example, the images above were generated with the following chain of transformations:

light = A.Compose([ A.RandomSizedCrop((512-100, 512+100), 512, 512), A.ShiftScaleRotate(), A.RGBShift(), A.Blur(), A.GaussNoise(), A.ElasticTransform(), A.MaskDropout((10,15), p=1), A.Cutout(p=1) ],p=1)

Not too hard to program, right?

Conclusion

Today, we have begun a new series of posts. I am starting a little bit further back than usual: in this post we have discussed data augmentations, a classical approach to using labeled datasets in computer vision.

Connecting back to the main topic of this blog, data augmentation is basically the simplest possible synthetic data generation. In augmentations, you start with a real world image dataset and create new images that incorporate knowledge from this dataset but at the same time add some new kind of variety to the inputs. Synthetic data works in much the same way, only the path from real-world information to synthetic training examples is usually much longer and more convoluted. So in a (rather tenuous) way, all modern computer vision models are training on synthetic data.

But this is only the beginning. There are more ways to generate new data from existing training sets that come much closer to synthetic data generation. So close, in fact, that it is hard to draw the boundary between “smart augmentations” and “true” synthetic data. Next time we will look through a few of them and see how smarter augmentations can improve your model performance even further.

Sergey Nikolenko
Head of AI, Synthesis AI
November 11, 2020
Object Detection with Synthetic Data V: Where Do We Stand Now?
This is the last post in my mini-series on object detection with synthetic data. Over the first four posts, we introduced the problem, discussed some classical synthetic datasets for object detection, talked about some early works that have still relevant conclusions and continued with a case study on retail and food object detection. Today we consider two papers from 2019 that still represent the state of the art in object detection with synthetic data and are often used as generic references to the main tradeoffs inherent in using synthetic data. We will see and discuss those tradeoffs too. Is synthetic data ready for production and how does it compare with real in object detection? Let’s find out. (header image source)

An Annotation Saved is an Annotation Earned

The first paper saved me the trouble of thinking of a pithy title. Aptly named “An Annotation Saved is an Annotation Earned: Using Fully Synthetic Training for Object Instance Detections”, this work by Hinterstoisser et al. comes from the Google Cloud AI team. Similar to our last post, Hinterstoisser et al. consider multiple detection of small common objects, most of which are packs of food items and medicine. Here is a sample of their synthetic objects:

But the interesting thing about this paper is that they claim to achieve excellent results without any real data at all, by training on a purely synthetic dataset. Here are some sample results on a real evaluation data for a Faster R-CNN model with Inception ResNet backbone (this is a bog-standard and very common two-stage object detector) trained on a purely synthetic training set:

Looks great, right? So how did Hinterstoisser et al. achieve such wonderful results?

Their first contribution is an interesting take on domain randomization for background images. I remind that domain randomization is the process of doing “smart augmentations” with synthetic images so that they are as random as possible, in the hopes to cover as much of the data distribution as possible. Generally, the more diverse and cluttered the backgrounds are, the better. So Hinterstoisser et al. try to get the clutter up to eleven by the following procedure:
- take a separate dataset of distractor 3D models that are not the objects we are looking for (in the paper, they had 15 thousand such distractor models);
- render these objects on the background in random poses and with scales roughly corresponding to the scale of the foreground objects (so they are comparable in size) while randomly varying the hues of the background object colors (this is standard domain randomization with distractor objects);
- choose and place new background objects until you have covered every pixel of the background (this is the interesting part);
- then place the foreground objects on top (we’ll discuss it in more detail below).
As a result of this approach, Hinterstoisser et al. don’t have to have any background images or scenes at all: the background is fully composed of distractor objects. And they indeed get pretty cluttered images; here is the pipeline together with a couple of samples:

But that’s only part of it. Another part is how to generate the foreground layer, with objects that you actually want to recognize. Here, the contribution of Hinterstoisser et al. is that instead of placing 3D models in random poses or in poses corresponding to the background surfaces, as researchers had done before, they introduce a deterministic curriculum (schedule) for introducing foreground objects:
- iterate over scales from largest to smallest, so that the network starts off with the easier job of recognizing large objects and then proceeds to learn to find their smaller versions;
- for every scale, iterate over all possible rotations;
- and then for every scale and rotation iterate through all available objects, placing them with possible overlaps and cropping at the boundaries; there is also a separate procedure to allow background distractor objects to partially occlude the foreground.
Here is a sample illustration:

As a result, this purely synthetic approach outperforms a 2000-image real training set. Hinterstoisser et al. even estimate the costs: they report that it had taken them about 200 hours to acquire and label the real training set. This should be compared with… a mere 5 hours needed for 3D scanning of the objects: once you have the pipeline ready, that is all you need to do to add new objects or retrain in a different setting. Here are the main results:

But even more interesting are the ablation studies that the authors provide. They analyze which of their ideas contributed the most to their results. Interestingly (and a bit surprisingly), the largest effect is achieved by their curriculum strategy. Here it is compared to purely random pose sampling for foreground objects:

Another interesting conclusion is that the purely synthetic cluttered background actually performs much better than a seemingly more realistic alternative strategy: take real world background images and augment them with synthetic distractor objects (there is no doubt that distractor objects are useful anyway). Surprisingly, the purely synthetic background composed entirely of objects wins quite convincingly:

With these results, Hinterstoisser et al. have the potential to redefine how we see and use synthetic data for object detection; the conclusions most probably also extend to segmentation and possibly other computer vision problems. In essence, they show that synthetic data can be much better than real for object detection if done right. And by “done right” I mean virtually every single element of their pipeline; here is the ablation study:

There are more plots like this in the paper, but it is time to get to our second course.

How Much Real Data Do We Actually Need?

Guess I got lucky with the titles today. The last paper in our object detection series, “How much real data do we actually need: Analyzing object detection performance using synthetic and real data” by Nowruzi et al., concentrates on a different problem, recognizing objects in urban outdoor environments with an obvious intent towards autonomous driving. However, the conclusions it draws appear to be applicable well beyond this specific case, and this paper has become the go-to source among experts in synthetic data.

The difference of this work from other sources is that instead of investigating different approaches to dataset generation within a single general framework, it considers various existing synthetic and real datasets, puts them in comparable conditions, and draws conclusions regarding how best to use synthetic data for object detection.

Here are the sample pictures from the datasets used in the paper:

Nowruzi et al. consider three real datasets:
- Berkeley Deep Drive (BDD) (Yu et al., 2018), a large-scale real dataset (100K images) with segmentation and object detection labeling (image (a) above);
- Kitti-CityScapes (KC), a combination of visually similar classical urban driving datasets KITTI (Geiger et al., 2012) and CityScapes (Cordts et al., 2016) (b);
- NuScenes (NS) (Caesar et al., 2019), a dataset I mentioned in the first post of the series, with 1000 labeled video scenes, each 20 seconds long (c);
and three synthetic:
- Synscapes (7D) (Wrenninge & Unger, 2018), a synthetic dataset designed to mimic the properties of Cityscapes (d);
- Playing for Benchmark (P4B) (Richter et al., 2017), a synthetic dataset with video sequences obtained from the Grand Theft Auto V game engine (e);
- CARLA (Dosovitskiy et al., 2017), a full-scale driving simulator that can also be used to generate labeled computer vision datasets (f).
To put all datasets on equal footing, the authors use only 15000 images from each (since the smallest dataset has 15K images), resize all images to 640×370 pixels, and remove annotations for objects that become too small under these conditions (less than 4% of the image height). The object detection model is also very standard: it is an SSD detector with MobileNet backbone, probably chosen for computationally efficiency of both training and evaluation. The interesting part, of course, are the results.

First, as you would expect, adding more data helps. Training on smaller portions of each dataset significantly impedes the results, as the plot below shows. Note that Nowruzi et al. use both color and shape of the markers to signify two different dimensions of the parameters, and the axes of the picture are performance indicators (average precision and recall), so top right is the best corner and bottom left is the worst; this will be used throughout all plots below:

The next set of results is about transfer learning: how well can object detection models perform on one dataset when trained on another? Let’s see the plot and then discuss:

There are several effects to be seen here:
- naturally, the best results (top right corner) are had when you train and test on the same dataset; this is true for both synthetic and real datasets, but synthetic data significantly outshines real data in this comparison; this is a general theme throughout all synthetic data in computer vision: results on synthetic datasets are always better, sometimes too much so, signifying overfitting (but hopefully not in this case);
- the worst dataset, pretty much an outlier here, is CARLA: while training and testing on CARLA gives the very best results, any attempt at transfer from CARLA to anything else fails miserably;
- but other than that, synthetic datasets fair pretty well, with transfer results clustering together with transfer from real datasets.
The real datasets are still a little better (see, e.g., how well BDD transfers to everything except NuScenes). But note that Nowruzi et al. have removed one of the main advantages of synthetic data by equalizing the size of real and synthetic datasets, so I would say that synthetic data performs quite well here.

But the real positive results come later. Nowruzi et al. compare two different approaches to using hybrid datasets, where synthetic data is combined with real.

First, synthetic-real data mixing, where a small(er) amount of real data is added to a full-scale synthetic dataset. A sample plot below shows the effect for training on the BDD dataset; the dashed line repeats the plot for training on purely real data that we have already seen above:

You can see that training on synthetic data indeed helps save on annotations substantially: e.g., using only 2.5% of the real BDD dataset and a synthetic P4B dataset yields virtually the same results as using 10% of the real BDD while using 4 times less real data. Naturally, 100% of real data is still better, and probably always will be.

But the really interesting stuff begins with the second approach: fine-tuning on real data. The difference is that now we fully train on a synthetic dataset and then fine-tune on (small portions of) real datasets, so training on synthetic and real data is fully separated. This is actually more convenient in practice: you can have a huge synthetic dataset and train on it once, and then adapt the resulting model to various real conditions by fine-tuning which is computationally much easier. Here are the results in the same setting (subsets of BDD):

The dashed line is exactly the same as above, but note how every other result has improved! And this is not an isolated result; here is a comparison on the NuScenes dataset:

The paper has more plots, but the conclusion is already unmistakable: fine-tuning on real data performs much better than just mixing in data. This is the main result of Nowruzi et al., and in my opinion it also fits well with the previous paper, so let’s finish with a common conclusion.

Conclusion

Today, we have seen two influential recent papers that both try to improve object detection with the help of synthetic data. There is a common theme that I see in both papers: they show how important are the exact curricula for training and the smallest details of how synthetic data is generated and presented to the network. Before reading these papers, I would never guess that simply changing the strategy of how to randomize the poses and scales of synthetic objects can improve the results by 0.2-0.3 in mean average precision (that’s a huge difference!).

All this suggests that there is still much left to learn in the field of synthetic data, even for a relatively straightforward problem such as object detection. Using synthetic data is not quite as simple as throwing as much random stuff at the network as possible. This is a good thing, of course: harder problems with uncertain results also mean greater opportunities for research and for deeper understanding of how neural networks work and how computer vision can be ultimately solved. Here at Synthesis AI, we work to achieve better understanding of synthetic data, not only for object detection but for many other deep learning applications as well. And the results we have discussed today suggest that while synthetic data is already working well for us, there is still a fascinating and fruitful road ahead.

With this, I conclude the mini-series on object detection. After a short break, we will return with something completely different. Stay tuned!

Sergey Nikolenko
Head of AI, Synthesis AI
September 15, 2020
Object Detection with Synthetic Data IV: What’s in the Fridge?

We continue the series on synthetic data for object detection. Last time, we stopped in 2016, with some early works on synthetic data for deep learning that still have implications relevant today. This time, we look at a couple of more recent papers devoted to multiple object detection for food and small vendor items. As we will see today, such objects are a natural application for synthetic data, and we’ll see how this application has evolved in the last few years.

Why the Fridge?

Before I proceed to the papers, let me briefly explain why this specific application—recognizing multiple objects on supermarket shelves or in a fridge—sounds like such a perfect fit for synthetic data. There are several reasons, and each of them is quite general and might apply to your own application as well.

First, the backgrounds and scene compositions are quite standardized (the insides of a fridge, a supermarket shelf) so it shouldn’t take too much effort to simulate them realistically. If you look at the datasets for such applications, you will see that they often get by with really simplistic backgrounds. Here are some samples from the dataset from our first paper today, available from Param Rajpura’s github repository:

A couple of surface textures, maybe a glossy surface for the glass shelves, and off you go. This has changed a lot since 2017, and we’ll talk about it below, but it’s still not as hard as making realistic humans.

Second, while simple, the scenes and backgrounds are definitely not what you see in ImageNet and other standard datasets. You can find a lot of pics of people enjoying outdoor picnics and 120 different breeds of dogs in ImageNet but not so many insides of a refrigerator or supermarket shelves with labeled objects. Thus, we cannot reuse pretrained models that easily.

Third, guess why such scenes are not very popular in standard object detection datasets? Because they are obscenely hard to label by hand! A supermarket shelf can have hundreds of objects that are densely packed, often overlap, and thus would require full minutes of tedious work per image. Here are some sample images from a 2019 paper by Goldman et al. that presents a real dataset of such images called SKU-110K (we won’t consider it in detail because it has nothing to do with synthetic data):

Fourth, aren’t we done now that we have a large-scale real dataset? Not really because new objects arrive very often. A system for a supermarket (or the fridge, it’s the same kind of objects) has to easily support the introduction of new object classes because new products or, even more often, new packaging for old products are introduced continuously. Thousands of new objects appear in a supermarket near you over a year, sometimes hundreds of new objects at once (think Christmas packaging). When you have a real dataset, adding new images takes a lot of work: it is not enough to just have a few photos of the new object, you also need to have it on the shelves, surrounded by old and new objects, in different combinations… this gets really hard really quick. In a synthetic dataset, you just add a new 3D model and then you are free to create any number of scenes in any combinations you like.

Finally, while you need a lot of objects in this application and a lot of 3D models for the synthetic dataset, most objects are relatively easy to model. They are Tetra Pak cartons, standardized bottles, paper boxes… Among the thousands of items in a supermarket, there are relatively few different packages, most of them are standard items with different labels. So once you have a 3D model for, say, a pint bottle, most beers will be covered by swapping a couple of textures, and the bottle itself is far from a hard object to model (compare with, say, a human face or a car).

With all that said, object detection for small retail items does sound like a perfect fit for synthetic data. Let’s find out what people have been doing in this direction.

Multiple Object Detection in Constrained Spaces

Our first paper today, the earliest I could find on deep learning with synthetic data for this application, is “Object Detection Using Deep CNNs Trained on Synthetic Images” by Rajpura et al. (2017). They concentrate on recognizing objects inside a refrigerator, and we have already seen some samples of their synthetic data above. They actually didn’t even bother with 3D modeling and just took standard bottles and packs from the ShapeNet repository that we discussed earlier.

They used Blender (often the tool of choice for synthetic data since it’s quite standard and free to use) to create simple scenes of the inside of a fridge and placed objects with different textures there:

As for their approach to object detection, we are still not quite in state of the art territory so I won’t dwell on it too much. In short, Rajpura et al. used a fully convolutional version of GoogLeNet that generates a coverage map and a separate bbox predictor trained on its results:

What were the results and conclusions? Well, first of all, Rajpura et al. saw significantly improved performance for hybrid datasets. Here is a plot from their paper that shows how 10% of real data and 90% of synthetic far outperformed “pure” datasets:

This result, however, should be taken with a grain of salt because, first, they only had 400 real images (remember how hard it is to label such images manually), and second, the scale of synthetic data was also not so large (3600 synthetic images).

Another interesting conclusion, however, is that adding more synthetic images can actually hurt. Here is a plot that shows how performance begins to decline after 4000 synthetic images:

This is probably due to overfitting to synthetic data, and it remains an important problem even today. If you add a lot of synthetic images, the networks may begin to overfit to peculiarities of specifically synthetic images. More generally, synthetic data is different from real, and hence there is always an inherent domain transfer problem involved when you try to apply networks trained on synthetic data to real test sets (which you always ultimately want to do). This is a huge subject, though, and we will definitely come back to domain adaptation for synthetic-to-real transfer later on this blog. For now, let us press on with the fridges.

Smart Synthetic Data for Smart Vending Machines

Or, actually, vending machines. Let us make a jump to 2019 and consider the work by Wang et al. titled “Synthetic Data Generation and Adaption for Object Detection in Smart Vending Machines”. The premise looks very similar: vending machines have small food items placed there, and the system needs to find out which items are still there judging by a camera located inside the vending machine. Here is the general pipeline as outlined in the paper:

On the surface it’s exactly the same thing as Rajpura et al. in terms of computer vision, but there are several interesting points that highlight how synthetic data had progressed over these two years. Let’s take them in order.

First, data generation. In 2017, researchers took ready-made simple ShapeNet objects. In 2019, 3D shapes of the vending machine objects are being scanned from real objects by high-quality commercial 3D scanners, in this case one from Shining 3D. What’s more, 3D scanners still have a really hard time with specular or transparent materials. For specular materials, Wang et al. use a whole other complex neural architecture (an adversarial one, actually) to transform the specular material into a diffuse one based on multiple RGB images and then restore the material during rendering (they use Unity3D for that). The specular-to-diffuse translation is based on a paper by Wu et al. (2018); here is an illustration of its possible input and output:

As for transparent materials, even in 2019 Wang et al. give up, saying that “although this could be alleviated by introducing some manual works, it is beyond the scope of this paper” and simply avoiding transparent objects in their work. This is, by the way, where Synthesis AI could step up: check out ClearGrasp, a result of our collaboration with Google Robotics.

Second, Wang et al. introduce and apply a separate model for the deformation of resulting object meshes. Cans and packs may warp or bulge in a vending machine, and their synthetic data generation pipeline adds random deformations, complete with a (more or less) realistic energy-based model with rigidity parameters based on a previous work by Wang et al. (2012). The results look quite convincing:

Third, the camera. Due to physical constraints, vending machines use fisheye cameras to be able to cover the entire area where objects are located. Here is the vending machine from Wang et al. and sample images from the cameras on every shelf:

3D rendering engines usually support only the pinhole camera model, so, again, Wang et al. use a separate state of the art camera model by Kannala and Brandt, calibrating it on a real fisheye camera and then introducing some random variation and noise.

Fourth, the synthetic-to-real image transfer, i.e., improving the resulting synthetic images so that they look more realistic. Wang et al. use a variation of style transfer based on CycleGAN. I will not go into the details here because this direction of research definitely deserves a separate blog post (or ten), and we will cover it later. For now, let me just say that it does help in this case; below, original synthetic images are on the left and the results of transfer are on the right:

Fifth, the object detection pipeline. Wang et al. compare several state of the art object detection methods, including PVANET by Kim et al. (2016), SSD by Liu et al. (2016), and YOLOv3 by Redmon and Farnadi (2018). Unlike all the works we have seen above, these are architectures that remain quite relevant up to this day (with some new versions released, as usual), and, again, each of them would warrant a whole separate post, so for now I will just skip to the results.

Interestingly, while the absolute numbers and quality of the results have increased substantially since 2017, the general takeaways remain the same. It still helps to have a hybrid dataset with both real and synthetic data (note also that the dataset is again rather small; this time it’s because the models are good enough to achieve saturation in this constrained setting after this kind of data and more synthetic data probably doesn’t help):

The results on a real test set are also quite convincing. Here are some samples for PVANET:

SSD:

and YOLOv3:

Interestingly, PVANET yields the best results, which is contrary to many other object detection applications (YOLOv3 should be best overall in this comparison):

This leads to our last takeaway point for today: in a specific application, it is best to redo the comparisons at least among the current state of the art architectures. It doesn’t add all that much to the cost of the project: in this case, Wang et al. definitely spent much, much more time preparing and adapting synthetic data than testing two additional architectures. But it can yield somewhat unexpected results (one can explain why PVANET has won in this case, but honestly, this would be a post-hoc explanation, you really just don’t know a priori who’s going to win) and let you choose what’s best for your own project.

Conclusion

Today, we have considered a sample application of synthetic data for object detection: recognizing multiple objects in small constrained spaces such as a refrigerator or a vending machine. We have seen why this is a perfect fit for synthetic data, and have used it as an example to showcase some of the progress that synthetic data has enjoyed over the past couple of years. But that’s not all: in the next post, we will consider some very recent works that study synthetic data for object detection in depth, uncovering the tradeoffs inherent in the process. Until next time!

Sergey Nikolenko
Head of AI, Synthesis AI

September 2, 2020