Sergey Nikolenko

Blog

AI Interviews: Serge Belongie

Hi all! Today we begin a new series of posts here in the Synthesis AI blog. We will talk to the best researchers and practitioners in the field of machine learning, discussing different topics but, obviously, trying to circle back to our main focus of synthetic data every once in a while.

Today we have our first guest, Professor Serge Belongie. He is a Professor of Computer Science at the University of Copenhagen (DIKU) and the Director of the Pioneer Centre for Artificial Intelligence. Previously he was the Andrew H. and Ann R. Tisch Professor at Cornell Tech and in the Computer Science Department at Cornell University, and an Associate Dean at Cornell Tech.

Over his distinguished career, Prof. Belongie has been greatly successful in both academia and business. He co-founded several successful startups, including Digital Persona, Inc. that first brought a fingerprint identification device to the mass market and two computer vision startups, Anchovi Labs and Orpix. The MIT Technology Review included him on their list of Innovators under 35 for 2004, and in 2015, he was the recipient of the ICCV Helmholtz Prize. Google Scholar assigns to Prof. Belongie a spectacular h-index of 96, which includes dozens of papers that have become fundamental for computer vision and other fields, with hundreds of citations each. And, to be honest, I got most of this off Prof. Belongie’s Wikipedia page, which means that this is just barely scratching the surface of his achievements.

Q1. Hello Professor, and welcome to our interview! Your list of achievements is so impressive that we definitely cannot do it justice in this format. But let’s try to add at least one little bit to this Wikipedia dump above. What is the one thing, maybe the one new idea that you are most proud of in your career? You know, the idea that makes you feel the warmest and fuzziest once you remember how you had it?

Prof. Belongie: Thank you for inviting me! I’m excited about Synthesis AI’s vision, so I’m happy to help get out the word to the CV/ML community.

This is a timely question, since I recently started a “Throwback Thursday” series on my lab’s Twitter account. Each week over this past summer, my former students and I had a fun time looking back on the journey behind our publications since I became a professor a couple decades ago. The ideas for which I feel most proud rarely have appeared in highly cited papers. One example is the grid based comparisons in our 2015 paper “Cost-Effective HITs for Relative Similarity Comparisons.” As my students from that time will recall, I was captivated by the idea of triplet based comparisons for measuring perceptual similarity (“is a more similar to b than to c?”), but the cubic complexity of such approaches limited their practical adoption. Then it occurred to us that humans have excellent parallel visual processing abilities, which means we could fill a screen with 4×4 or 5×5 grids of images, and through some simple UI trickery, we could harvest large batches of triplet constraints in one shot, using a HIT (human intelligence task) that was both less expensive to run and more entertaining to complete for the participants. While this approach and the related SNaCK approach we published the following year have not gotten much traction in the literature, I’m convinced that this concept will eventually get its day in the sun.

Q2. Now for the obligatory question: what is your view on the importance of synthetic data for modern computer vision? Here at Synthesis AI, we believe that synthetic data can become one of the solutions to the data problem; do you agree? What other solutions do you see and how, in your opinion, does synthetic data fit into the landscape of computer vision of the future?

Prof. Belongie: I am in complete agreement with this view. When pilots learn to fly, they must log thousands of hours of flight time in simulated and real flight environments. That is an industry that, over several decades, has found the right balance of real vs. synthetic for the best instructional outcome. Our field is now confronting an analogous problem, with the key difference that the student is a machine. With that difference in mind, we will again need to find the right balance. As my PhD advisor [Jitendra Malik] used to tell us in the late 90s, nature has a way of detecting a hack, so we must be careful about overstating what’s possible with purely synthetic environments. But when you think about the cartesian product of all the environmental factors that can influence, say, the appearance of city streets in the context of autonomous driving, it seems foolish not to build upon our troves of real data with clever synthesis and augmentation approaches to give our machines a gigantic head start before tackling the real thing.

Q3. Among all your influential papers with hundreds of citations, the one that looks to me most directly relevant to synthetic data is the paper where Xun Huang and yourself introduced adaptive instance normalization (AdaIN), a very simple style transfer approach that still works wonders. We recently talked about AdaIN on this blog, and in our experiments we have never seen a more complex synthetic-to-real refinement pipeline, even based on your own later work, MUNIT, outperform the basic AdaIN. What has worked best for synthetic-to-real style transfer for you? Do you maybe have more style transfer techniques in store for us, to appear in the near future?

Prof. Belongie: Good ol’ AdaIN indeed works surprisingly well in a wide variety of cases. The situation gets more nuanced, however, in fine grained settings such as the iNat challenges or NeWT downstream tasks. In these cases, even well intentioned style transfer methods can trample over the subtle differences that distinguish tightly related species; as the saying goes, “one person’s signal is another person’s noise.” In this context, we’ve been reflecting on the emerging practice of augmentation engineering. Ever since deep learning burst onto the scene around 2011, it hasn’t been socially acceptable to fiddle with feature design manually, but no one complains if you fiddle with augmentation functions. The latter can be thought of as a roundabout way to scratch the same itch. It’s likely that in fine grained domains, e.g., plant pathology, we’ll need to return to the old – and in my opinion, good – practices of working closely with domain experts to cultivate domain-appropriate geometric and photometric transformations.

In terms of what’s coming next in style transfer, I’m excited about our recent work in the optical see-through (OST) augmented reality setting. In conventional style transfer, you have total control over the values of every pixel. In the OST setting, however, you can only add light; you can’t subtract it. So what can be done about this? We tackle this question in our recent Stay Positive work, focusing on the nonnegative image synthesis problem, and leveraging quirks of the human visual system’s processing of brightness and contrast.

Q4. Continuing from the last question, one of the latest papers to come out of your group is titled “Single Image Texture Translation for Data Augmentation”. In it, you propose a new data augmentation technique that translates textures between objects from single images (as a brief reminder for the readers, we have talked about what data augmentation is previously on this blog). The paper also includes a nice graphical overview of modern data augmentation methods that I can’t but quote here:

Looking at this picture makes me excited. What is your opinion on the limits of data augmentation? Combined with neural style transfer and all other techniques shown here, how far do you think this can take us? How do you see these techniques potentially complementing synthetic data approaches (in the sense of making 3D models and rendering images), and are there, in your opinion, unique advantages of synthetic data that augmentation of real data cannot provide?

Prof. Belongie: When it comes to generic, coarse-grained settings, I would say the sky’s the limit in terms of what data augmentation can accomplish. Here I’m referring to supplying modern machine learning pipelines with sufficiently realistic augmentations, such as adding rain to a street or stubble to a face. The bar is, of course, somewhat higher if the goal is to cross the uncanny valley for human observers. And as I hinted earlier, fine grained visual categorization (FGVC) also presents some tough challenges for the data augmentation movement. FGVC problems are characterized by the need for specialized domain knowledge, the kind that is possessed by very few human experts. In that sense, knowing how to tackle the data augmentation problem for FGVC is tantamount to bottling that knowledge in the form of a family of image manipulations. That strikes me as a daunting task.

Q5. A slightly personal question here. Your group at UCSD used to be called SO(3) in honor of the group of three-dimensional rotations, and your group at Cornell now is called SE(3), after the special Euclidean group in three dimensions. This brings back memories of how I used to work in algebra a little bit back when I was an undergrad. I realize the group’s title probably doesn’t mean much but still: do you see a way for modern algebra and/or geometry to influence machine learning? What is your opinion of current efforts in geometric deep learning: would you advise current math undergrads to go there?

Prof. Belongie: Geometric deep learning provides an interesting framework for incorporating prior knowledge into traditional deep learning settings. Personally, I find it exciting because a new generation of students is talking about topics like graph Laplacians again. I don’t know if I’d point industry-focused ML engineers at geometric deep learning, but I do think it’s a rich landscape for research-oriented undergrads to explore, with an inspiring synthesis of old and new ideas.

Q6. And, if you don’t mind, let us finish with another personal question. Turns out SO3 is not just your computer vision research group’s title but also your band name! I learned about it from this profile article about you that lists quite a few cool things you’ve done, including a teaching gig in Brazil “inspired by Richard Feynman”.

So I guess it’s safe to say that Richard Feynman has been one of your heroes. Who else has been an influence? How did you turn to computer science? And are there maybe some other biographies or popular books that you can recommend for our readers who are choosing their path right now?

Prof. Belongie: Ah, I see you’ve done your research! The primary influences in my career have been my undergrad and grad school advisors, Pietro Perona and Jitendra Malik, who are both towering figures in the field. From them I gained a deep appreciation of ideas outside of computer science and engineering, including human vision, experimental psychology, art history, and neuroscience. I find myself quoting, paraphrasing, or channeling them on a regular basis when meeting with my students. In terms of turning to computer science, that was a matter of practicality. I started out in electrical engineering, focusing on digital signal processing, and as my interests coalesced around image recognition, I naturally gravitated to where the action was circa the late 90s, i.e., computer science.

As far as what I’d recommend now, that’s a tough question. My usual diet is based on the firehose of arXiv preprints that match my group’s keywords du jour. But this can be draining and even demoralizing, since you’ll start to feel like it’s all been done. So if you want something to inspire you, read an old paper by Don Geman, like this one about searching for mental pictures. Or better yet, after you’re done with your week’s quota of @ak92501-recommended papers, go for a long drive or walk and listen to a Rick Beato “What Makes this Song Great” playlist. It doesn’t matter if you know music theory, or if some of the genres he covers aren’t your thing. His passion for music – diving into it, explaining it, making the complex simple – is infectious, and he will inspire you to do great things in whatever domain you’ve chosen as your focus.

Dear Professor, thank you very much for your answers! And thank you, the reader, for your attention! Next time, we will return with an interview with another important figure in machine learning. Stay tuned!

Sergey Nikolenko
Head of AI, Synthesis AI

September 13, 2021
Synthetic Data for Safe Driving

The role of synthetic data in developing solutions for autonomous driving is hard to understate. In a recent post, I already touched upon virtual outdoor environments for training autonomous driving agents, and this is a huge topic that we will no doubt return to later. But today, I want to talk about a much more specialized topic in the same field: driver safety monitoring. It turns out that synthetic data can help here as well—and today we will understand how. This is a companion post for our recent press release.

What Is Driver Safety Monitoring and How Manufacturers Are Forced to Care

Car-related accidents remain a major source of fatalities and trauma all around the world. The United States, for instance, has about 35000 motor vehicle fatalities and over 2 million injuries per year, which may pale in comparison to the COVID pandemic or cancer but still sounds like a lot of unnecessary suffering.

In fact, significant progress has been achieved in reducing these deaths and injuries over the last years. Here are the statistics of road traffic fatalities in Germany over the last few years:

And here is the same plot for France (they both stop at 2019 because it would be really unfair to make road traffic comparisons in the times of overwhelming lockdowns):

Obviously, the European Union is doing something right in their regulation of road traffic. A large part of it are new safety measures that are gradually made mandatory in the EU. And the immediate occasion for this post are new regulations regarding driver safety monitoring.

Starting from 2022, it will be mandatory for the European Union car manufacturers to install the following safety features: “warning of driver drowsiness and distraction (e.g. smartphone use while driving), intelligent speed assistance, reversing safety with camera or sensors, […] lane-keeping assistance, advanced emergency braking, and crash-test improved safety belts”. With these regulations, the European Commission plans to “save over 25,000 lives and avoid at least 140,000 serious injuries by 2038”.

On paper, this sounds marvelous: why not have a system that wakes you up if you fall asleep behind the wheel and helps you stay in your lane when you’re distracted. But how can systems like this work? And where’s the place of synthetic data in this? Let’s find out.

Driver Drowsiness Detection with Deep Learning

We cannot cover everything, so let’s dive into details for one specific aspect of safety monitoring: drowsiness detection. This is a key part of both new regulations and actual car accidents: falling asleep at the wheel is very common. You don’t even have to be completely asleep: 5-10 seconds of what is called a microsleep episode will be more than enough for an accident to occur. So how can a smart car notice that you are about to fall asleep and warn you in time?

The gold standard of recognizing brain states such as sleep is, of course, electroencephalography (EEG), that is, measuring the electrical activity of your brain. Recent research has applied deep learning to analyzing EEG data, and it appears that even relatively simple solutions based on convolutional and recurrent networks are enough to recognize sleep and drowsiness with high certainty. For instance, a recent work by Zurich researchers Malafeev et al. (2020) shows excellent results in the detection of microsleep episodes with a simple architecture like this:

But short of requiring all drivers to wear a headpiece with EEG electrodes, this kind of data will not be available in a real car. EEG is commonly used to collect and label real datasets in this field but we need some other signal for actual drowsiness detection.

There are two actual signals that are both important here. First, steering patterns: a simple sensor can track the steering angle and velocity, and then you can develop a system that recognizes troubling patterns in the driver’s steering. For example, if a driver is barely steering at all for some time, and then returns the car on track with a quick jerking motion, that’s probably a sign that the driver is getting sleepy or distracted. Leading manufacturers such as Volvo, Bosch, and others are already presenting solutions based on steering patterns.

Steering patterns, however, are just one possible signal, and a quite indirect one. Moreover, once you have in place another component of the very same EU regulations, automatic lane-keeping assistance, steering becomes largely automated and these patterns stop working. A much more direct idea would be to use computer vision to detect the signs of drowsiness on the driver’s face.

When Volvo introduced their steering-based system in 2007, their representative said: “We often get questions about why we have chosen this concept instead of monitoring the driver’s eyes. The answer is that we don’t think that the technology of monitoring the driver’s eyes is mature enough yet.” By 2021, computer vision has progressed a lot, and recent works on the subject show excellent results.

The most telling sign would be, of course, detecting that the driver’s eyes are closing. There is an entire field of study devoted to detecting closed eyes and blinking (blinks get longer and more often when you’re drowsy). In 2014, Song et al. presented the now-standard Closed Eyes in the Wild (CEW) dataset, modeled after the classical Labeled Faces in the Wild (LFW) dataset but with eyes closed; here is a sample of CEW (top row) and LFW (bottom row):

Since then, eye closedness and blinking detection has steadily improved, usually with various convolutional pipelines, and by now it is definitely ready to become an important component in car safety .

We don’t have to restrict ourselves only to the eyes, of course. The entire facial expression can provide important clues (did you yawn while reading this?). For example, Shen et al. (2020) recently proposed a multi-featured pipeline that has separate convolutional processing streams for the driver’s head, eyes, and mouth:

Another important recent work comes from Affectiva, a company we have recently collaborated with on eye gaze estimation. Joshi et al. (2020) classify drowsiness based on facial expressions as captured in a 10-second video that might have the driver progress between different states of drowsiness. Their pipeline is based on features extracted by their own SDK for recognizing facial expressions:

All of these systems are not perfect, of course, but it is clear by now that computer vision can provide important clues to detect and evaluate the driver’s state and trigger warnings that can help avoid road traffic accidents and ultimately save lives. So where does synthetic data come into this picture?

Synthetic Data for Drowsiness Detection

On this blog, we have discussed many times (e.g., recently and very recently) what are the conditions under which synthetic data especially shines in computer vision. These conditions include situations where existing real datasets may be biased, environmental features that are not covered in real data (different cameras, lighting conditions etc.), and generally situations that call for extensive variability and randomization which is much easier to achieve in synthetic data than in real datasets.

Guess what: driver safety is definitely one of those situations! First, cameras that can be installed in real cars shoot from positions that are far from standard for usual datasets. Here are some frames from a sample video that Joshi et al. processed in the paper we referenced above:

Compare this with, say, standard frontal photographs characteristic for Labeled Faces in the Wild that we also showed above; obviously, there is some domain transfer needed between these two situations, while a synthetic 3D model of a head can be shot from any angle.

Second, where will real data come from? We could collect real datasets and label them semi-automatically with the help of EEG monitoring, but that would be far from perfect for computer vision model training because real drivers will not be wearing an EEG device. Also, real datasets of this kind will inevitably be very small: it is obviously very difficult and expensive to collect even thousands of samples of people falling asleep at the wheel, let alone millions.

Third, you are most likely to fall asleep when you’re driving at night, and night driving means your face is probably illuminated very poorly. You can use NIR (near-infrared) or ToF NIR (time-of-flight near-infrared) cameras to “see in the dark”. But pupils (well, retinas) act differently in the NIR modality, and this effect can be different across different ethnicities. This kind of different camera modalities and challenging lighting is, again, something that is relatively easy to achieve in synthetic datasets but hard to find in real ones. For example, available NIR datasets such as NVGaze or MRL Eye Dataset are done for AR/VR, not from an in-car camera perspective.

That is why here at Synthesis AI we are moving into this (see our recent press release), and we hope to make important contributions that will make road traffic safer for all of us. We are already collaborating with automobile and autonomous vehicle manufacturers and Tier 1 suppliers in this market.

To make this work, we will need to make an additional effort to model car interiors, cameras used by car manufacturers, and other environmental features, but the heart of this project remains in the FaceAPI that we have already developed. This easy-to-use API can produce millions of unique 3D models that have different combinations of identities, clothing, accessories, and, importantly for this project, facial expressions. FaceAPI is already able to produce a wide variety of emotions, including, of course, closed eyes and drowsiness, but we plan to further expand this feature set.

Here is an example of our automatically generated synthetic data from an in-car perspective, complete with depth and normal maps:

Synthetic Data for Driver Attention

But you don’t have to literally fall asleep to cause a traffic accident. Unfortunately, it often suffices to get momentarily distracted, look at your phone, take your hands off the wheel for a second to adjust your coffee cup… all with the same, sometimes tragic, consequences. Thus, another, no less important application of computer vision for driver safety is monitoring driver attention and possible distractions. This becomes all the more important as driverless cars become increasingly common, and autopilots take up more and more of the total time at the wheel: it is much easier to get distracted when you are not actually driving the car.

First, there is the monitoring of large-scale motions such as taking your hands off the wheel. This falls into the classical field of scene understanding (see, e.g., Xiao et al. (2018)): “are the driver’s hands on the wheel” is a typical scene understanding question that goes beyond simple object detection of both hands and the wheel. Answering these questions, however, usually relies upon classical computer vision problems such as instance segmentation.

Second, it is no less important to track such small-scale motions as eye gaze. Eye gaze estimation is an important computer vision problem that has its own applications but is also obviously useful for driver safety. We have already discussed applications of synthetic data to eye gaze estimation on this blog, with a special focus on domain adaptation.

Obviously, all of these problems belong to the field of computer vision, and all standard arguments for the use of synthetic data apply in this case as well. Thus, we expect that synthetic data produced by our engines will be extremely useful for driver attention monitoring.In the next example, also produced by FaceAPI, we can compare a regular RGB image and the corresponding near-infrared image for two drivers who may be distracted. Note that eye gaze is also clearly seen in our synthetic pictures, as well as larger features:

There’s even more that can be varied parametrically. Here are some examples with head turn, yawing, eye closure, and accessories like face masks and glasses.

In total, we strongly believe that high-quality synthetic data for computer vision systems can help advance security systems for car manufacturers and help reduce road traffic accidents not only in the European Union but all over the world. Here at Synthesis AI, we are devoted to removing the obstacles to further advances of machine learning—especially for such a great cause!

Sergey Nikolenko
Head of AI, Synthesis AI

August 5, 2021
Synthetic Data-Centric AI
In a recent series of talks and related articles, one of the most prominent AI researchers Andrew Ng pointed to the elephant in the room of artificial intelligence: the data. It is a common saying in AI that “machine learning is 80% data and 20% models”, but in practice, the vast majority of effort from both researchers and practitioners concentrates on the model part rather than the data part of AI/ML. In this article, we consider this 80/20 split in slightly more detail and discuss one possible way to advance data-centric AI research.

The life cycle of a machine learning project

The basic life cycle of a machine learning project for some supervised learning problems (for instance, image segmentation) looks like this:

First, one has to collect data, then it has to be labeled according to the problem at hand, then a model is trained on the resulting dataset, and finally the best models have to be fitted into edge devices where they will be deployed. In my personal opinion, these four parts are about equally important in most real life projects; but if you look at the research papers from any top AI conference, you will see that most of them are about the “Training” phase, with a little bit of “Deployment” (model distillation and similar techniques that make models fit into restricted hardware) and an even smaller part devoted to the “Data” and “Annotation” parts (mostly about data augmentation).

This is not due to simple narrow-mindedness: everybody understands that data is key for any AI/ML project. But usually the model is the sexy part of research, where new ideas flourish and intermingle, and data is the “necessary but boring” part. Which is a shame because, as Andrew Ng demonstrated in his talks, improvements in the data department often hang much lower than improvements in state of the art AI models.

Data labeling and data cascades: the real elephants in the room

On the other hand, collecting and especially annotating the data is increasingly becoming a problem, if not a hard constraint on AI research and development. The required labeling is often very labor-intensive. Suppose that you want to teach a model to count the cows grazing in a field, a natural and potentially lucrative idea for applying deep learning in agriculture. The basic computer vision problem here is either object detection, i.e., drawing bounding boxes around cows, or instance segmentation, i.e., distinguishing the silhouettes of cows. To train the model, you need a lot of photos with labeling such as this one:

Imagine how much work it would take to label tens of thousands of such photographs! Naturally, in a real project you would use a weaker already existing model and use manual labor only to correct the mistakes, but it still might take thousands of man-hours.

Another important problem is dataset bias. Even in applications where real labeled data abounds, existing datasets often do not cover cases relevant for new applications. Take face recognition, for instance; there exist datasets with millions of labeled faces. But, first, many such datasets have racial and ethnic bias that often plagues major datasets. And second, there are plenty of use cases in slightly modified conditions: for example, a face recognition system might need to recognize users from any angle, but existing datasets are heavily scaled towards frontal and profile photos.

These and other problems have been recently combined under the label of data cascades, as introduced in this Google AI post. Data cascades include dataset bias, real world noise that is absent in clean training sets, model drifts where the targets change over time, and many other problems, up to poor dataset documentation.

There exist several possible solutions to basic data-related problems, all increasingly explored in modern AI:
- few-shot, one-shot, and even zero-shot learning try to reduce data requirements by pretraining models and then fine-tuning them to new problems with very small datasets; this is a great solution when it works, but success stories are still relatively limited;
- semi-supervised and weakly supervised learning make use of unlabeled data that is often plentiful (e.g., it is usually far cheaper to obtain unlabeled images of the objects in question than label them).
But these solutions are far from universal: if existing data (used for pretraining) has no or very few examples of the objects and relations we are looking for, these approaches will not be able to “invent” them. Fortunately, there is another approach that can do just that.

Synthetic data: a possible solution

I am talking about synthetic data: artificially created and labeled data used to train AI models. In computer vision this would mean that dataset developers create a 3D environment with models of the objects that need to be recognized and their surroundings. In a synthetic environment, you know and control the precise position of every object, which gives you pixel-perfect labeling for free. Moreover, you have total control over many knobs and handles that can be adapted to your specific use case:
- environments: backgrounds and locations for the objects;
- lighting parameters: you can set your own light sources;
- camera parameters: camera type (if you need to recognize images from an infrared camera, standard datasets are unlikely to help), placement etc.;
- highly variable objects: with real data, you are limited to what you have, and with synthetic data you can mix and match everything you have created in limitless combinations.
For instance, synthetic human faces can have any facial features, ethnicities, ages, hairstyles, accessories, emotions, and much more. Here are a few examples from an existing synthetic dataset of faces:

Synthetic data presents its own problems, the most important being the domain shift problem that arises because synthetic data is, well, not real. You need to train a model on one domain (synthetic data) and apply it on a different domain (real data), which leads to a whole field of AI called domain adaptation.

In my opinion, the free labeling, high variability, and sheer boundless quantity of synthetic data (as soon as you have the models, you can generate any number of labeling images at the low cost of rendering) far outweigh this drawback. Recent research is already showing that even very straightforward applications of synthetic data can bring significant improvements in real world problems.

Automatic generation and closing the feedback loop

But wait, there is more. The “dataset” we referred to above is more than just a dataset—it is an entire API (FaceAPI, to be precise) that allows a user to set all of these knobs and handles, generating new synthetic data samples at scale and in a fully automated fashion, with parameters defined for API calls.

This opens up new, even more exciting possibilities. When synthetic data generation becomes fully automated, it means that producing synthetic data is now a parametric process, and the values of parameters may influence the final quality of AI models trained on this synthetic data… you see where this is going, right?

Yes, we can treat data generation as part of the entire machine learning pipeline, closing the feedback loop between data generation and testing the final model on real test sets. Naturally, it is hard to expect gradients to flow naturally across the process of rendering 3D scenes (although recent research may suggest otherwise), so learning the synthetic data generation parameters can be done, e.g., with reinforcement learning that has methods specifically designed to work in these conditions. This is an early approach taken by VADRA (Visual Adversarial Domain Randomization and Augmentation):

A similar but different approach would be to design more direct loss functions by either collecting data on the model performance and then learning or finding other objectives. Here, one important example would be the Meta-Sim model that learns to minimize the distribution gap between synthetic 3D scenes and real scenes together with downstream performance by learning the parameters of scene graphs, a natural representation of the 3D scene structure.

These ideas are being increasingly applied in the studies of synthetic data, and I believe that adaptive generation of synthetic data will be increasingly used in the near future and bring synthetic data to a new level of usefulness for AI/ML. I hope that the progress of modern AI will not stop at the current data problem, and I believe that synthetic data, especially automatic generation and closing the feedback loop, is one of the key tools to overcome it.

Sergey Nikolenko
Head of AI, Synthesis AI
July 20, 2021
Synthetic Data Case Studies: It Just Works
In this (very) long post, we present an entire whitepaper on synthetic data, proving that synthetic data works even without complicated domain adaptation techniques in a wide variety of practical applications. We consider three specific problems, all related to human faces, show that synthetic data works for all three, and draw some other interesting and important conclusions.

Introduction

Synthetic data is an invaluable tool for many machine learning problems, especially in computer vision, which we will concentrate on below. In particular, many important computer vision problems, including segmentation, depth estimation, optical flow estimation, facial landmark detection, background matting, and many more, are prohibitively expensive to label manually.

Synthetic data provides a way to have unlimited perfectly labeled data at a fraction of the cost of manually labeled data. With the 3D models of objects and environments in question, you can create an endless stream of data with any kind of labeling under different (randomized) conditions such as composition and placement of objects, background, lighting, camera placement and parameters, and so on. For a more detailed overview of synthetic data, see (Nikolenko, 2019).

Naturally, artificially produced synthetic data cannot be perfectly photorealistic. There always exists a domain gap between real and synthetic datasets, stemming both from this lack of photorealism and also in part from different approaches to labeling: for example, manually produced segmentation masks are generally correct but usually rough and far from pixel-perfect.

Therefore, most works on synthetic data center around the problem of domain adaptation: how can we close this gap? There exist approaches that improve the realism, called synthetic-to-real refinement, and approaches that impose constraints on the models—their feature space, training process, or both—in order to make them operate similarly on both real and synthetic data. This is the main direction of research in synthetic data right now, and much of recent research is devoted to suggesting new approaches to domain adaptation.

However, CGI-based synthetic data becomes better and better with time, and some works also suggest that domain randomization, i.e., simply making the synthetic data distribution sufficiently varied to ensure model robustness, may work out of the box. On the other hand, recent advances in related problems such as style transfer (synthetic-to-real refinement is basically style transfer between the domains of synthetic and real images) suggest that refinement-style domain adaptation may be done with very simple techniques; it might happen that while these techniques are insufficient for photorealistic style transfer for high-resolution photographs they are quite enough to make synthetic data useful for computer vision models.

Still, it turns out that synthetic data can provide significant improvements even without complicated domain adaptation approaches. In this whitepaper, we consider three specific use cases where we have found synthetic data to work well under either very simple or no domain adaptation at all. We are also actively pursuing research on domain adaptation, and ideas coming from modern style transfer approaches may prove to bring new interesting results here as well; but in this document, we concentrate on very straightforward applications of synthetic data and show that synthetic data can just work out of the box. In one of the case studies, we compare two main techniques to using synthetic data in this simple way—training on hybrid datasets and fine-tuning on real data after pretraining on synthetic—also with interesting results.

Here at Synthesis AI, we have developed the Face API for mass generation of high-quality synthetic 3D models of human heads, so all three cases have to do with human faces: face segmentation, background matting for human faces, and facial landmark detection. Note that all three use cases also feature some very complex labeling: while facial landmarks are merely very expensive to label by hand, manually labeled datasets for background matting are virtually impossible to obtain.

Before we proceed to the use cases, let us describe what they all have in common.

Data Generation and Domain Adaptation

In this section, we describe the data generation process and the synthetic-to-real domain adaptation approach that we used throughout all three use cases.

Face API by Synthesis AI

The Face API, developed by Synthesis AI, can generate millions of images comprising unique people, with expressions and accessories, in a wide array of environments, with unique camera settings. Below we show some representative examples of various Face API capabilities.

The Face API has tens of thousands of unique identities that span the genders, age groups, and ethnicity/skin tones, and new identities are added continuously. These

It also allows for modifications to the face, including expressions and emotions, eye gaze, head turn, head & facial hair, and more:

Furthermore, the Face API allows to adorn the subjects with accessories, including clear glasses, sunglasses, hats, other headwear, headphones, and face masks.

Finally, it allows for indoor & outdoor environments with accurate lighting, as well as additional directional/spot lighting to further vary the conditions and emulate reality.

The output includes:
- RGB Images
- Pupil Coordinates
- Facial Landmarks (iBug 68-like)
- Camera Settings
- Eye Gaze
- Segmentation Images & Values
- Depth from Camera
- Surface Normals
- Alpha / Transparency
Full documentation can be found at the Synthesis AI website. For the purposes of this whitepaper, let us just say that Face API is a more than sufficient source of synthetic human faces with any kind of labeling that a computer vision practitioner might desire.

Synthetic-to-Real Refinement by Instance Normalization

Below, we consider three computer vision problems where we have experimented with using synthetic data to train more or less standard deep learning models. Although we could have applied complex domain adaptation techniques, we instead chose to use one simple idea inspired by style transfer models. Here we show some practical and less computationally intensive approaches that can work well too.

Most recent works on style transfer, including MUNIT (Huang et al., 2018), StyleGAN (Karras et al., 2018), StyleGAN2 (Karras et al., 2019), and others, make use of the idea of adaptive instance normalization (AdaIN) proposed by Huang and Belongie (2017).

The basic idea of AdaIN is to substitute the statistics of the style image in place of the batch normalization (BN) parameters for the corresponding BN layers during the processing of the content image:

This is a natural extension of an earlier idea of conditional instance normalization (Dimoulin et al., 2016) where BN parameters were learned separately for each style. Both conditional and adaptive instance normalization can be useful for style transfer, but AdaIN is better suited for common style transfer tasks because it only needs a single style image to compute the statistics and does not require pretraining or, generally speaking, any information regarding future styles in advance.

In style transfer architectures such as MUNIT or StyleGAN, AdaIN layers are used as a key component for a complex involved architecture that usually also employs CycleGAN (Zhu et al., 2017) and/or ProGAN (Karras et al., 2017) ideas. As a result, these architectures are hard to train and, what is even more important, require a lot of computational resources to use. This makes state of the art style transfer architectures unsuitable for synthetic-to-real refinement since we need to apply them to every image in the training set.

However, style transfer results already in the original work on AdaIN (Huang and Belongie, 2017) already look quite good, and it is possible to use AdaIN in a much simpler architecture than state of the art style transfer. Therefore, in our experiments we use a similar approach for synthetic-to-real refinement, replacing BN statistics for synthetic images with statistics extracted from real images.

This approach has been shown to work several times in literature, in several variations (Li et al., 2016; Chang et al., 2019) We follow either Chang et al. (2019), which is a simpler and more direct version, or the approach introduced by Seo et al. (2020), called domain-specific optimized normalization (DSON), where for each domain we maintain batch normalization statistics and mixture weights learned on the corresponding domain:

Thus, we have described our general approach to synthetic-to-real domain adaptation; we used it in some of our experiments but note that in many cases, we did not do any domain adaptation at all (these cases will be made clear below). With that, we are ready to proceed to specific computer problems.

Face Segmentation with Synthetic Data: Syn-to-Real Transfer As Good As Real-to-Real Transfer

Our first use case deals with the segmentation problem. Since we are talking about applications of our Face API, this will be the segmentation of human faces. What’s more, we do not simply cut out the mask of a human face from a photo but want to segment different parts of the face.

We have used two real datasets in this study:
- LaPa (Landmark guided face Parsing dataset), presented by Liu et al. (2020), contains more than 22,000 images, with variations in pose and facial expression and also with some occlusions among the images; it contains facial landmark labels (which we do not use) and faces segmented into 11 classes, as in the following example:
- CelebAMask-HQ, presented by Lee et al. (2019), contains 30,000 high-resolution celebrity face images with 19 segmentation classes, including various hairstyles and a number of accessories such as glasses or hats:
For the purposes of this study, we reduced both datasets to 9 classes (same as LaPa but without the eyebrows). As synthetic data, we used 200K diverse images produced by our Face API; no domain adaptation was applied in this case study (we have tried it and found no improvement or even some small deterioration in performance metrics).

As the basic segmentation model, we have chosen the DeepLabv3+ model (Chen et al., 2018) with the DRN-56 backbone, an encoder-decoder model with spatial pyramid pooling and atrous convolutions. DeepLabv3+ produces good results, often serves as a baseline in works on semantic segmentation, and, importantly, is relatively lightweight and easy to train. In particular, due to this choice all images were resized down to 256p.

The results, summarized in the table below, confirmed our initial hypothesis and even outperformed our expectations in some respects. The table shows the mIoU (mean intersection-over-union) scores on the CelebAMask-HQ and LaPa test sets for DeepLabv3+ trained on real data only from CelebAMask-HQ, mixed with synthetic data in various proportions.

In the table below, we show the results for different proportions of real data (CelebAMask-HQ) in the training set, tested on two different test sets.

First of all, as expected, we found that training on a hybrid dataset with both real and synthetic data is undoubtedly beneficial. When both training and testing on CelebAMask-HQ (first two rows in the table), we obtain noticeable improvements across all proportions of real and synthetic data in the training set. The same holds for the two bottom rows in the table that show the results of DeepLabv3+ trained on CelebAMask-HQ and tested on LaPa.

But the most interesting and, in our opinion, important result is that in this context, domain transfer across two (quite similar) real datasets produces virtually the same results as domain transfer from synthetic to real data: results on LaPa with 100% only real data are almost identical to the results on LaPa with 0% real and only synthetic data. Let us look at the plots below and then discuss what conclusions we can draw:

Most importantly, note that the domain gap on the CelebA test set amounts to a 9.6% performance drop for the Syn-to-CelebA domain shift and 9.2% for the LaPa-to-CelebA domain shift. This very small difference suggests that while domain shift is a problem, it is not a problem specifically for the synthetic domain, which performs basically the same as a different domain of real data. The results on the LaPa test set tell a similar story: 14.4% performance drop for CelebA-to-LaPa and 16.8% for Syn-to-LaPa.

Second, note the “fine-tune” bars that exhibit (quite significant) improvements over other models trained on a different domain. This is another effect we have noted in our experiments: it appears that fine-tuning on real data after pretraining on a synthetic dataset often works better than just training on a mixed hybrid syn-plus-real dataset.

Below, we show a more detailed look into where the errors are:

During cross-domain testing, synthetic data is competitive with real data and even outperforms it on difficult classes such as eyes and lips. Synthetic data seems to perform worse on nose and hair, but that can explained by differences in labelling of these two classes across real and synthetic.

Thus, in this very straightforward use case we have seen that even in a very direct application, with very efficient syn-to-real refinement, synthetic data generated by our Face API works basically at the same level as training on a different real dataset.

This is already very promising, but let us proceed to even more interesting use cases!

Background Matting: Synthetic Data for Very Complex Labeling

The primary use case for background matting, useful to keep in mind throughout this section, is cutting out a person from a “green screen” image/video or, even more interesting, from any background. This is, of course, a key computer vision problem in the current era of online videoconferencing.

Formally, background matting is a task very similar to face/person segmentation, but with two important differences. First, we are looking to predict not only the binary segmentation mask but also the alpha (opacity) value, so the result is a “soft” segmentation mask with values in the [0, 1] range. This is very important to improve blending into new backgrounds.

Second, the specific variation of background matting that we are experimenting with here takes two images as input: a pure background photo and a photo with the object (person). In other words, the matting problem here is to subtract the background from the foreground. Here is a sample image from the demo provided by Lin et al. (2020), the work that we take as the basic model for this study:

The purpose of this work was to speed up high-quality background matting for high-resolution images so that it could work in real time; indeed, Lin et al. have also developed a working Zoom plugin that works well in real videoconferencing.

We will not dwell on the model itself for too long. Basically, Lin et al. propose a pipeline that first produces a coarse output with atrous spatial pyramid pooling similar to DeepLabv3 (Chen et al., 2017) and then recover high-resolution matting details with a refinement network (not to be confused with syn-to-real refinement!). Here is the pipeline as illustrated in the paper:

For our experiments, we have used the MobileNetV2 backbone (Sandler et al., 2018). For training we used virtually all parameters as provided by Lin et al. (2020) except augmentation, which we have made more robust.

One obvious problem with background matting is that it is extremely difficult to obtain real training data. Lin et al. describe the PhotoMatte13K dataset of 13,665 2304×3456 images with manually corrected mattes that they acquired, but release only the test set (85 images). Therefore, for real training we used the AISegment.com Human Matting Dataset (released on Kaggle) for the foreground part, refining its mattes a little with open source matting software (see below in more detail about this). The AISegment.com dataset contains ~30,000 600×800 images—note the huge difference in resolution with PhotoMatte13K.

Note that this dataset does not contain the corresponding background images, so for background images we used our own high-quality HDRI panoramas. In general, our pipeline for producing the real training set was as follows:
- cut out the object from an AISegment.com image according to the ground truth matte;
- take a background image, apply several relighting/distortion augmentations, and paste the object onto the resulting background image.
This is a standard way to obtain training sets for this problem. The currently largest academic dataset for this problem, Deep Image Matting by Xu et al. (2017), uses the same kind of procedure.

For the synthetic part, we used our Face API engine to generate a dataset of ~10K 1024×1024 images, using the same high-quality HDRI panoramas for the background. Naturally, the synthetic dataset has very accurate alpha channels, something that could hardly be achieved in manual labeling of real photographs. In the example below, note how hard it would be to label the hair for matting:

Before we proceed to the results, a couple of words about the quality metrics. We used slightly modified metrics from the original paper, also described in more detailed and motivated by Rhemann et al. (2009):
- mse: mean squared error for the alpha channel and foreground computed along the object boundary;
- mae: mean absolute error for the alpha channel and foreground computed along the object boundary;
- grad: spatial-gradient metric that measures the difference between the gradients of the computed alpha matte and the ground truth computed along the object boundary;
- conn: connectivity metric that measures average degrees of connectivity for individual pixels in the computed alpha matte and the ground truth computed along the object boundary;
- IOU: standard intersection over union metric for the “person” class segmentation obtained from the alpha matte by thresholding.
We have trained four models:
- Real model trained only on the real dataset;
- Synthetic model trained only on the synthetic dataset;
- Mixed model trained on a hybrid syn+real dataset;
- DA model, also trained on the hybrid syn+real dataset but with batchnorm-based domain adaptation as shown above updating batchnorm statistics only on the real training set.
The plots below show the quality metrics on the test set PhotoMatte85 (85 test images) where we used our HDRI panoramas as background images:

And here are the same metrics on the PhotoMatte85 test set with 4K images downloaded from the Web as background images:

It is hard to give specific examples where the difference would be striking, but as you can see, adding even a small high-quality synthetic dataset (our synthetic set was ~3x smaller than the real dataset) brings tangible improvements in the quality. Moreover, for some metrics related to visual quality (conn and grad in particular) the model trained only on synthetic data shows better performance than the model trained on real data. The Mixed and DA models are better yet, and show improvements across all metrics, again demonstrating the power of mixed syn+real datasets.

Above, we have mentioned automatic refinement of the AISegment.com dataset with open-source matting software that we applied before training. To confirm that these refinements indeed make the dataset better, we have compared the performance on refined and original AISegment.com dataset. The results clearly show that our refinement techniques bring important improvements:

Overall, in this case study we have seen how synthetic data helps in cases when real labeled data is very hard to come by. The next study is also related to human faces but switches from variants of segmentation to a slightly different problem.

Facial Landmark Detection: Fine-Tuning Beats Domain Adaptation

For many facial analysis tasks, including face recognition, face frontalization, and face 3D modeling, one of the key steps is facial landmark detection, which aims to locate some predefined keypoints on facial components. In particular, in this case study we used 51 out of 68 IBUG facial landmarks. Note that there are several different standards of facial landmarks, as illustrated below (Sagonas et al., 2016):

While this is a classic computer vision task with a long history, it unfortunately still suffers from many challenges in reality. In particular, many existing approaches struggle to cope with occlusions, extreme poses, difficult lighting conditions, and other problems. The occlusion problem is probably the most important obstacle to locating facial landmarks accurately.

As the basic model for recognizing facial landmarks, we use the stacked hourglass networks introduced by Newell et al. (2016). The architecture consists of multiple hourglass modules, each representing a fully convolutional encoder-decoder architecture with skip connections:

Again, we do not go into full details regarding the architecture and training process because we have not changed the basic model, our emphasis is on its performance across different training sets.

The test set in this study consists of real images and comes from the 300 Faces In-the-Wild (300W) Challenge (Sagonas et al., 2016). It consists of 300 indoor and 300 outdoor images of varying sizes that have ground truth manual labels. Here is a sample:

The real training set is a combination of several real datasets, semi-automatically unified to conform to the IBUG format. In total, we use ~3000 real images of varying sizes in the real training set.

For the synthetic training set, since real train and test sets mostly contain frontal or near-frontal good quality images, we generated a relatively restricted dataset with the Face API, without images in extreme conditions but with some added racial diversity, mild variety in camera angles, and accessories. The main features of our synthetic training set are:
- 10,000 synthetic images with 1024×1024 resolution, with randomized facial attributes and uniformly represented ethnicities;
- 10% of the images contain (clear) glasses;
- 60% of the faces have a random expression (emotion) with intensity [0.7, 1.0], and 20% of the faces have a random expression with intensity [0.1, 0.3];
- the maximum angle of face to camera is 45 degrees; camera position and face angles are selected accordingly.
Here are some sample images from our synthetic training set:

Next we present our key results that were achieved with the synthetic training data in several different attempts at closing the domain gap. We present a comparison between 5 different setups:
- trained on real data only;
- trained on synthetic data only;
- trained on a mixture of synthetic and real datasets;
- trained on a mixture of synthetic and real datasets with domain adaptation based on batchnorm statistics;
- pretrained on the synthetic dataset and fine-tuned on real data.
We measure two standard metrics on the 300W test set: normalized mean error (NME), the normalized average Euclidean distance between true and predicted landmarks (smaller is better), and probability of correct keypoint (PCK), the percentage of detections that fall into a predefined range of normalized deviations (larger is better).

The results clearly show that while it is quite hard to outperform the real-only benchmark (the real training set is large and labeled well, and the models are well-tuned to this kind of data), facial landmark detection can still benefit significantly from a proper introduction of synthetic data.

Even more interestingly, we see that the improvement comes not from simply training on a hybrid dataset but from pretraining on a synthetic dataset and fine-tuning on real data.

To further investigate this effect, we have tested the fine-tuning approach across a variety of real dataset subsets. The plots below show that as the size of the real dataset used for fine-tuning decreases, the results also deteriorate (this is natural and expected):

This fine-tuning approach is a training schedule that we have not often seen in literature, but here it proves to be a crucial component for success. Note that in the previous case study (background matting), we also tested this approach but it did not yield noticeable improvements.

Conclusion

In this whitepaper, we have considered simple ways to bring synthetic data into your computer vision projects. We have conducted three case studies for three different computer vision tasks related to human faces, using the power of Synthesis AI’s Face API to produce perfectly labeled and highly varied synthetic datasets. Let us draw some general conclusions from our results.

First of all, as the title suggests, it just works! In all case studies, we have been able to achieve significant improvements or results on par with real data by using synthetically generated datasets, without complex domain adaptation models. Our results suggest that synthetic data is a simple but very efficient way to improve computer vision models, especially in tasks with complex labeling.

Second, we have seen that synthetic-to-real domain gap can be the same as real-to-real domain gap. This is an interesting result because it suggests that while domain transfer still, obviously, remains a problem, it is not specific to synthetic data, which proves to be on par with real data if you train and test in different conditions. We have supported this with our face segmentation study.

Third, even a small amount of synthetic data can help a lot. This is a somewhat counterintuitive conclusion: traditionally, synthetic datasets have been all about quantity and diversity over quality. However, we have found that in problems where labels are very hard to come by and are often imprecise, such as background matting in one of our case studies, even a relatively small synthetic dataset can go a long way towards getting the labels correct for the model.

Fourth, fine-tuning on real data after pretraining on a synthetic dataset seems to work better than training on a hybrid dataset. We do not claim that this will always be the case, but a common theme in our case studies is that these approaches may indeed yield different results, and it might pay to investigate both (especially since they are very straightforward to implement and compare).

We believe that synthetic data may become one of the main driving forces for computer vision in the near future, as real datasets reach saturation and/or become hopelessly expensive. In this whitepaper, we have seen that it does not have to be hard to incorporate synthetic data into your models. Try it, it might just work!

Sergey Nikolenko
Head of AI, Synthesis AI
June 17, 2021
Top 5 Applications of Synthetic Data
It’s been a while since we last met on this blog. Today, we are having a brief interlude in the long series of posts on how to make machine learning models better with synthetic data (that’s a long and still unfinished series: Part I, Part II, Part III, Part IV, Part V, Part VI). I will give a brief overview of five primary fields where synthetic data can shine. You will see that most of them are related to computer vision, which is natural for synthetic data based on 3D models. Still, it makes sense to clarify where exactly synthetic data is already working well and where we expect synthetic data to shine in the nearest future.

The Plan

In this post, we will review five major fields of synthetic data applications (links go to the corresponding sections of this rather long post):
- human faces, where 3D models of human heads are developed in order to produce datasets for face recognition, verification, and similar applications;
- indoor environments, where 3D models and/or simulated environments of home interiors can be used for navigation in home robotics, for AI assistants, for AR/VR applications and more;
- outdoor environments, where synthetic data producers build entire virtual cities in order to train computer vision systems for self-driving cars, security systems, retail logistics and so on;
- industrial simulated environments, used to train industrial robots for manufacturing, logistics on industrial plants, and other applications related to manufacturing;
- synthetic documents and media that can be used to train text recognition systems, systems for recognizing and modifying multimodal media such as advertising, and the like.
Each deserves a separate post, and some will definitely get it soon, but today we are here for a quick overview.

Human Faces

There exist numerous applications of computer vision related to human subjects and specifically human faces. For example:
- face identification, i.e., automatic recognition of people by their faces; it is important not only for security cameras on top secret objects — the very same technology is working in your smartphone or laptop when it tries to recognize you and unlock the device without a cumbersome password;
- gaze estimation means tracking where exactly you are looking at on the screen; right now gaze estimation is important for focus group studies intended to design better interfaces and more efficient advertising, but I expect that in the near future it will be a key technology for a wide range of user-facing AR/VR applications;
- emotion recognition has obvious applications in finding out user reactions to interactions with automated systems and less obvious but perhaps even more important applications such as recognizing whether a car driver is about to fall asleep behind the wheel;
- all of these problems may include and rely upon facial keypoint recognition, that is, finding specific points of interest on a human face, and so on, and so forth.
Synthetic models and images of people (both faces and full bodies) are an especially interesting subject for synthetic data. On the one hand, while large-scale real datasets of photographs of humans definitely exist they are even harder to collect. First, there are privacy issues involved in the collection of real human faces; I will postpone this discussion until a separate post on synthetic faces, so let me just link to a recent study by Raji and Fried (2021) who point out a lot of problems in this regard within existing large-scale datasets.

Second, the labeling for some basic computer vision problems is especially complex: while pose estimation is doable, facial keypoint detection (a key element for facial recognition and image manipulation for faces) may require to specify several dozen landmarks on a human face, which becomes very hard for human labeling.

Third, even if a large dataset is available, it often contains biases in its composition of genders, races, or other parameters of human subjects, sometimes famously so; again, for now let me just link to a paper by Kortylewski et al. (2017) that we will discuss in detail in a later post.

These are the primary advantages of using synthetic datasets for human faces (beyond the usual reasons such as limitless perfectly labeled data). On the other hand, there are complications as well, chief of them being that synthetic 3D models of people and especially synthetic faces are much harder to create than models of basic objects, especially if sufficient fidelity is required. This creates a tension between the quality of available synthetic faces and improvements in face recognition and other related tasks that they can provide.

Here at Synthesis AI, we have already developed a system for producing large-scale synthetic datasets of human faces, including very detailed labeling of facial landmarks, possible occlusions such as glasses or medical masks, varying lighting conditions, and much more. Again, we will talk about this in much more detail later, so for now let me show a few examples of what our APIs are capable of and go on to the next application.

Indoor Environments

Human faces seldom require simulation environments because most applications, including all listed above, deal with static pictures and will not benefit much from a video. But in our next set of examples, it is often crucial to have an interactive environment. Static images will hardly be sufficient to train, e.g., an autonomous vehicle such as a self-driving car or a drone, or an industrial robot. Learning to control a vehicle or robot often requires reinforcement learning, where an agent has to learn from interacting with the environment, and real world experiments are usually entirely impractical for training. Fortunately, this is another field where synthetic data shines: once one has a fully developed 3D environment that can produce datasets for computer vision or other sensory readings, it is only one more step to begin active interaction or at least movement within this environment.

We begin with indoor navigation, an important field where synthetic datasets are required. The main problems here are, as usual for such environments, SLAM (simultaneous localization and mapping, i.e., understanding where the agent is located inside the environment) and navigation. Potential applications here lie in the field of home robotics, industrial robots, and embodied AI, but for our purposes you can simply think of a robotic vacuum cleaner that has to navigate your house based on sensor readings. There exist large-scale efforts to create real annotated datasets of indoor scenes (Chang et al., 2017; Song et al., 2015; Xia et al., 2018), but synthetic data has always been extremely important.

Historically, the main synthetic dataset for indoor navigation was SUNCG4 presented by Song et al. (2017).

It contained over 45,000 different scenes (floors of private houses) with manually created realistic room layouts, 3D models of the furniture, realistic textures, and so on. All scenes were semantically annotated at the object level, and the dataset provides synthetic depth maps and volumetric ground truth data for the scenes. The original paper presented state of the art results in semantic scene completion, but, naturally, SUNCG has been used for many different tasks related to depth estimation, indoor navigation, SLAM, and others; see Qi et al. (2017), Abbasi et al. (2018), and Chen et al. (2019), to name just a few. It often served as the basis for scene understanding competitions, e.g., the 2019 SUMO workshop at CVPR 2019 on 360° Indoor Scene Understanding and Modeling.

Why the past tense, though? Interestingly, even indoor datasets without any humans in them can be murky in terms of legality. Right now, the SUNCG paper website is up but the dataset website is down and SUNCG itself is unavailable due to a legal controversy over the data: the Planner5D company claims that Facebook used their software and data to produce SUNCG and made it publicly available without consent from Planner5D; see more details here.

In any case, by now we have larger and more detailed synthetic indoor environments. A detailed survey will have to wait for a separate post, but I want to highlight the AI Habitat released by the very same Facebook; see also the paper by Savva, Kadian et al. (2019). It is a simulated environment explicitly intended to trained embodied AI agents, and it includes a high-performance full 3D simulator with high-fidelity images. AI Habitat environments might look something like this:

Simulated environments of this kind can be used to train home robots, security cameras, and AI assistants, develop AR/VR applications, and much more. It is an exciting field where Synthesis AI is also making advances to.

Outdoor Environments

Now we come to one of the most important and historically best developed directions of applications for synthetic data: outdoor simulated environments intended to improve the motion of autonomous robots. Possible applications include SLAM, motion planning, and motion for control for self-driving cars (urban navigation), unmanned aerial vehicles, and much more (Fossen et al., 2017; Milz et al., 2018; Paden et al., 2016); see also general surveys of computer vision for mobile robot navigation (Bonin-Font et al., 2008; Desouza, Kak, 2002) and perception and control for autonomous driving (Pendleton et al., 2017).

There are two main directions here:
- training vision systems for segmentation, SLAM, and other similar problems; in this case, synthetic data can be represented by a dataset of static images;
- fully training autonomous driving agents, usually with reinforcement learning, in a virtual environment; in this case, you have to be able to render the environment in real time (preferably much faster since reinforcement learning usually needs millions of episodes), and the system also has to include a realistic simulation of the controls for the agent.
I cannot hope to give this topic justice in a single section, so there will definitely be separate posts on this, and for now let me just scatter a few examples of modern synthetic datasets for autonomous driving.

First, let me remind you that one of the very first applications of synthetic data to training neural networks, the ALVINN network that we already discussed on this blog, was actually an autonomous driving system. Trained on synthetic 30×32 videos supplemented with 8×32 range finder data, ALVINN was one of the first successful applications of neural networks in autonomous driving.

By now, resolutions have grown. Around 2015-2016, researchers realized that modern interactive 3D projects (that is, games) had progressed so much that one can use their results as high-fidelity synthetic data. Therefore, several important autonomous driving datasets produced at that time used modern 3D engines such as Unreal Engine or even specific game engines such as Grand Theft Auto V to generate their data. Here is a sample from the GTAV dataset by Richter et al. (2016):

And here is a sample from the VEIS dataset by Saleh et al. (2018) who used Unity 3D:

For the purposes of synthetic data, researchers often had to modify CGI and graphics engines to suit machine learning requirements, especially to implement various kinds of labeling. For example, the work based on Grand Theft Auto V was recently continued by Hurl et al. (2019) who developed a precise LIDAR simulator within the GTA V engine and published the PreSIL (Precise Synthetic Image and LIDAR) dataset with over 50000 frames with depth information, point clouds, semantic segmentation, and detailed annotations. Here is a sample of just a few of its modalities:

Another interesting direction was pursued by Li et al. (2019) who developed the Augmented Autonomous Driving Simulation (AADS) environment. This synthetic data generator is able to insert synthetic traffic on real-life RGB images in a realistic way. With this approach, a single real image can be reused many times in different synthetic traffic situations. Here is a sample frame from the AADS introductory video that emphasizes that the cars are synthetic — otherwise you might well miss it!

Fully synthetic datasets are also rapidly approaching photorealism. In particular, the Synscapes dataset by Wrenninge and Unger (2018) is quite easy to take for real at first glance; it simulates motion blur and many other properties of real photographs:

Naturally, such photorealistic datasets cannot be rendered in full real time yet, so they usually represent collections of static images. Let us hope that cryptocurrency mining will leave at least a few modern GPUs to advance this research, and let us proceed to our next item.

Industrial Simulated Environments

Imagine a robotic arm manipulating items on an assembly line. Controlling this arm certainly looks like a machine learning problem… but where will the dataset come from? There is no way to label millions of training data instances, especially once you realize that the function that we are actually learning is mapping activations of the robot’s drives and controls into reactions of the environment.

Consider, for instance, the Dactyl robot that OpenAI researchers recently taught to solve a Rubik’s cube in real life (OpenAI, 2018). It was trained by reinforcement learning, which means that the hand had to make millions, if not billions, of attempts at interacting with the environment. Naturally, this could be made possible only by a synthetic simulation environment, in this case ORRB (OpenAI Remote Rendering Backend) also developed by OpenAI (Chociej et al., 2019). For a robotic hand like Dactyl, ORRB can very efficiently render views of randomized environments that look something like this:

And that’s not all: to be able to provide interactions that can serve as fodder for reinforcement learning, the environment also has to implement a reasonably realistic physics simulator. Robotic simulators are a venerable and well-established field; the two most famous and most popular engines are Gazebo, originally presented by Koenig and Howard (2004) and currently being developed by Open Source Robotics Foundation (OSRF), and MuJoCo (Multi-Joint Dynamics with Contact) developed by Todorov et al. (2012); for example, ORRB mentioned above can interface with MuJoCo to provide an all-around simulation.

This does not mean that new engines cannot arise for robotics. For example, Xie et al. (2019) presented VRGym, a virtual reality testbed for physical and interactive AI agents. Their main difference from previous work is the support of human input via VR hardware integration. The rendering and physics engine are based on Unreal Engine 4, and additional multi-sensor hardware is capable of full body sensing and integration of human subjects to virtual environments. Moreover, a special bridge allows to easily communicate with robotic hardware, providing support for ROS (Robot Operating System), a standard set of libraries for robotic control. Visually it looks something like this:

Here at Synthesis AI, we are already working in this direction. In a recent collaboration with Google Robotics called ClearGrasp (we covered it on the blog a year ago), we developed a large-scale dataset of transparent objects that present special challenges for computer vision: transparent objects are notoriously hard for object detection, depth estimation, and basically any computer vision task you can think of. With the help of this dataset, in the ClearGrasp project we developed machine learning models capable of estimating accurate 3D data of transparent objects from RGB-D images. The dataset looks something like this, and for more details I refer to our earlier post and to Google’s web page on the ClearGrasp project:

There is already no doubt that end-to-end training of industrial robots is virtually impossible without synthetic environments. Still, I believe there is a lot to be done here, both specifically for robotics and generally for the best adaptation and usage of synthetic environments.

Synthetic Documents and Media

Finally, we come to the last item in the post: synthetic documents and media. Let us begin with optical character recognition (OCR), which mostly means reading the text written on a photo, although there are several different tasks related to text recognition: OCR itself, text detection, layout analysis and text line segmentation for document digitization, and others.

The basic idea for synthetic data in OCR is simple: it is very easy to produce synthetic text, so why don’t we superimpose synthetic text on real images, or simply on varied randomized backgrounds (recall our discussion of domain randomization), and train on that? Virtually all modern OCR systems have been trained on data produced by some variation of this idea.

I will highlight the works where text is being superimposed in a “smarter”, more realistic way. For instance, Gupta et al. (2016) in their SynthText in the Wild dataset use depth estimation and segmentation models to find regions (planes) of a natural image suitable for placing synthetic text, and even find the correct rotation of text for a given plane. This process is illustrated in the top row of the sample image, and the bottom row shows sample text inserted onto suitable regions:

Another interesting direction where such systems might go (but have not gone yet) is the generation of synthetic media that includes text but is not limited to it. Currently, advertising and rich media in general are at the frontier of multimodal approaches that blend together computer vision and natural language processing. Think about the rich symbolism that we see in many modern advertisements; sometimes understanding an ad borders on a lateral thinking puzzle. For instance, what does this image advertise?

The answer is explicitly stated in the slogan that I’ve cut off: “Removes fast food stains fast”. You might have got it instantaneously, or you might have had to think for a little bit, but how well do you think automated computer vision models can pick up on this? This is a sample image from the dataset collected by researchers from the University of Pittsburgh. Understanding this kind of symbolism is very hard, and I can’t imagine generating synthetic data for this problem at this point.

But perhaps easier questions would include at least detecting advertisements in our surroundings, detecting logos and brand names in the advertising, reading the slogans, and so on. Here, a synthetic dataset is not hard to imagine; at the very least, we could cut off ads from real photos and paste other ads in their place. But even solving this restricted problem would be a huge step forward for AR systems! If you are an advertiser, imagine that you can insert ads in augmented reality on the fly; and if you are just a regular human being imagine that you can block off real life ads or replace them with pleasing landscape photos as you go down the street. This kind of future might be just around the corner, and it might be helped immensely by synthetic data generation.

Conclusion

In this (rather long) post, I’ve tried to give a brief overview of the main directions where we envision the present and future of synthetic data for machine learning. Here at Synthesis AI, we are working to bring this vision of the future to reality. In the next post of this series, we will go into more detail about one of these use cases and showcase some of our work.

Sergey Nikolenko
Head of AI, Synthesis AI
May 18, 2021
Real-to-Synthetic Data: Driving Model Performance with Synthetic Data VI
Today we continue the series on using synthetic data to improve machine learning models.This is the sixth part of the series (Part I, Part II, Part III, Part IV, Part V). In this (relatively) short interlude I will discuss an interesting variation of GAN-based refinement: making synthetic data from real. Why would we ever want to do that if the final goal is always to make the model work on real data rather than synthetic? In this post, we will see two examples from different domains that show both why and how.

It Turns Your Head: Synthetic Faces from Real Photos

In this post, we discuss several works that generate synthetic data from real data by learning to transform real data with conditional GANs. The first application of this idea is to start from real data and produce other realistic images that have been artificially changed in some respects. This approach could either simply serve as a “smart augmentation” to extend the dataset (recall Part II of this series) or, more interestingly, could “fill in the holes” in the data distribution, obtaining synthetic data for situations that are lacking in the original dataset.

As the first example, let us consider Zhao et al. (2018a, 2018b) who concentrated on applying this idea to face recognition in the wild, with different poses rather than by a frontal image. They continued the work of Tran et al. (2017) (we do not review it here in detail) and Huang et al. (2017), who presented a TP-GAN (two-pathway GAN) architecture for frontal view synthesis: given a picture of a face, generate a frontal view picture.

TP-GAN’s generator G has two pathways: a global network that rotates the entire face and four local patch networks that process local textures around four facial landmarks (eyes, nose, and mouth). Both pathways have encoder-decoder architectures with skip connections for multi-scale feature fusion:

The discriminator D in TP-GAN, naturally, learns to distinguish real frontal face images from synthesized images. The synthesis loss function in TP-GAN is a sum of four loss functions:
- pixel-wise L1-loss between the ground truth frontal image and rotated image;
- symmetry loss intended to preserve the symmetry of human faces;
- adversarial loss that goes through the discriminator, as usual;
- and identity preserving loss based on perceptual losses, a popular idea in GANs designed to preserve high-level features when doing low-level transformations; in this case, it serves to preserve the person’s identity when rotating the face and brings together the features from several layers of a fixed face recognition network applied to the original and rotated images.
I won’t bore you with formulas for TP-GAN, but the results were pretty good. Here they are compared to competitors existing in 2017 (the leftmost column shows the profile face to be rotated, second from left are the results of TP-GAN, the rightmost column are actual frontal images, and the rest are various other approaches):

Zhao et al. (2018a) propose the DA-GAN (Dual-Agent GAN) model that also works with faces but in the opposite scenario: while TP-GAN rotates every face into the frontal view, DA-GAN rotates frontal faces to arbitrary angles. The motivation for this brings us to synthetic data: Zhao et al. noticed that existing datasets of human faces contain mostly frontal images, with a few standard angles also appearing in the distribution but not much in between. This could skew the results of model training and make face recognition models perform worse.

Therefore, DA-GAN aims to create synthetic faces to fill in the “holes” in the real data distribution, rotating real faces so that the distribution of angles becomes more uniform. The idea is to go from the data distribution shown on the left (actual distribution from the IJB-A dataset) to something like shown on the right:

They begin with a 3D morphable model, extracting 68 facial landmarks with the Recurrent Attentive-Refinement (RAR) model (Xiao et al., 2016) and estimating the transformation matrix with 3D-MM (Blanz et al., 2007). However, Zhao et al. report that simulation quality dramatically decreases for large yaw angles, which means that further improvement is needed. This is exactly where the DA-GAN framework comes in.

Again, DA-GAN’s generator maps a synthetized image to a refined one. It is trained on a linear combination of three loss functions:
- adversarial loss that follows the BEGAN architecture (I won’t go into details here and recommend this post, this post, and the original paper for explanations of BEGAN and other standard adversarial loss functions);
- the identity preservation loss similar to TP-GAN above; the idea is, again, to put both real and synthetic images through the same (relatively simple) face recognition network and bring features extracted from both images together;
- pixel-wise loss intended to make sure that the pose (angle of inclination for the head) remains the same after refinement.
Here is the general idea of DA-GAN:

Apart from experiments done by the authors, DA-GAN was verified in a large-scale NIST IJB-A competition where a model based on DA-GAN won the face verification and face identification tracks. This example shows the general premise of using synthetic data for this kind of smart augmentations: augmenting the dataset and balancing out the training data distribution with synthetic images proved highly beneficial in this case.

VR Goggles for Robots: Real-to-Sim Transfer in Robotics

In the previous section, we have used real-to-synthetic transfer basically as a smart augmentation, enriching the real dataset that might be skewed with new synthetic data produced from real data points.

The second example deals with a completely different idea of using the same transfer direction. Why do we need domain adaptation at all? Because we want models that have been trained on synthetic data to transfer to real inputs. The idea is to reverse this logic: let us transfer real data into the synthetic domain, where the model is already working great!

In the context of robotics, this kind of real-to-sim approach was implemented by Zhang et al. (2018) in a very interesting approach called “VR-Goggles for Robots”. It is based on the CycleGAN ideas, a general approach to style transfer (and consequently domain adaptation) that we introduced in Part IV of this series.

More precisely, Zhang et al. use the general framework of CyCADA (Hoffman et al., 2017), a popular domain adaptation model. CyCADA adds semantic consistency, feature-based adversarial, and task losses to the basic CycleGAN. The general idea looks as follows:

Note that CyCADA has two different GAN losses (adversarial losses), pixel-wise and feature-based (semantic), and a task loss that reflect the downstream task that we are solving (semantic segmentation in this case). In the original work, CyCADA was applied to standard, synthetic-to-real domain adaptation, although the model really does not distinguish much between the two opposing transfer directions because it’s a CycleGAN. Here are some CyCADA examples on standard synthetic and real outdoor datasets:

Similar to CyCADA, the VR-Goggles model has two generators, real-to-synthetic and synthetic-to-real, and two discriminators, one for the synthetic domain and one for the real domain. The overall loss function consists of:
- a standard adversarial GAN loss function for each discriminator;
- the semantic loss as introduced in CyCADA; the idea is that if we have ground truth labels for the synthetic domain (in this case, we are doing semantic segmentation), we can train a network on synthetic data and then use it to generate pseudolabels for the real domain where ground truth is not available; the semantic loss now makes sure that the results (segmentation maps) remain the same after image translation:
- the shift loss that makes the image translation result invariant to shifts.
Here is the general picture, which is actually very close to the CyCADA pipeline shown above:

Zhang et al. report significantly improved results in robotic navigation tasks. For instance, in the image below a navigation policy was trained (in a synthetic simulation environment) to move towards chairs:

The “No-Goggles” row shows results in the real world without domain adaptation, where the policy fails miserably. The “CycleGAN” row shows improved results after doing real-to-sim transfer with a basic CycleGAN model. Finally the “VR-Goggles” row shows successful navigation with the proposed real-to-sim transfer model.

Why might this inverse real-to-sim direction be a good fit for robotics specifically, and why haven’t we seen this approach in other domains in the previous posts? The reason is that in the real-to-sim approach, we need to use domain adaptation models during inference, as part of using the trained model. In most regular computer vision applications, this would be a great hindrance. In computer vision, if some kind of preprocessing is only part of the training process it is usually assumed to be free (you can find some pretty complicated examples in Part II of this series), and inference time is very precious.

Robotics is a very different setting: robots and controllers are often trained in simulation environments with reinforcement learning, which implies a lot of computational resources needed for training. The simulation environment needs to be responsive and cheap to support, and if every frame of the training needs to be translated via a GAN-based model it may add up to a huge cost that would make RL-based training infeasible. Adding an extra model during inference, on the other hand, may be okay: yes, we reduce the number of processed frames per second, but if it stays high enough for the robot to react in real time, that’s fine.

Conclusion

In this post, we have discussed the inverse direction of data refinement: instead of making synthetic images more realistic, we have seen approaches that make real images look like synthetic ones. We have seen two situations where this is useful: first, for “extremely smart” augmentations that fill in the holes in the real data distribution, and second, for robotics where training is very computationally intensive, and it may be actually easier to modify the input on inference.

With this, I conclude the part on refinement, i.e., domain adaptation techniques that operate on data and modify the data from one domain to another. Next time, we will begin discussing model-based domain adaptation, that is, approaches that change the model itself and leave the data in place. Stay tuned!

Sergey Nikolenko
Head of AI, Synthesis AI
February 24, 2021
Synthetic-to-Real Refinement: Driving Model Performance with Synthetic Data V
We continue the series on synthetic data as it is used in machine learning today. This is a fifth part of an already pretty long series (part 1, part 2, part 3, part 4), and it’s far from over, but I try to keep each post more or less self-contained. Today, however, we pick up from last time, so if you have not read Part 4 yet I suggest to go through it first. In that post, we discussed synthetic-to-real refinement for gaze estimation, which suddenly taught us a lot about modern GAN-based architectures. But eye gaze still remains a relatively small and not very variable problem, so let’s see how well synthetic data does in other computer vision applications. Again, expect a lot of GANs and at least a few formulas for the loss functions.

PixelDA: An Early Work in Refinement

First of all, I have to constrain this post too. There are whole domains of applications where synthetic data is very often used for computer vision, such as, e.g., outdoor scene segmentation for autonomous driving. But this would require a separate discussion, one that I hope to get to in the future. Today I will show a few examples where refinement techniques work on standard “unconstrained” computer vision problems such as object detection and segmentation for common objects. Although, as we will see, most of these problems in fact turn out to be quite constrained.

We begin with an early work in refinement, parallel to (Srivastava et al., 2017), which was done by Google researchers Bousmalis et al. (2017). They train a GAN-based architecture for pixel-level domain adaptation, which they call PixelDA. In essence, PixelDA is a basic style transfer GAN, i.e., they train the model by alternating optimization steps

$\min_{\theta_G,\theta_T}\;\max_{\phi}\;\lambda_{1}\,\mathcal{L}_{\text{dom}}^{\text{PIX}}(D^{\text{PIX}},G^{\text{PIX}})+\lambda_{2}\,\mathcal{L}_{\text{task}}^{\text{PIX}}(G^{\text{PIX}},T^{\text{PIX}})+\lambda_{3}\,\mathcal{L}_{\text{cont}}^{\text{PIX}}(G^{\text{PIX}}),$

where:
- the first term is the domain loss,
  $\begin{align*}\mathcal{L}_{\text{dom}}^{\text{PIX}}(D^{\text{PIX}},G^{\text{PIX}}) =&\mathbb{E}_{\mathbf{x}_{S}\sim p_{\text{syn}}}\!\bigl[\log\!\bigl(1-D^{\text{PIX}}\!\bigl(G^{\text{PIX}}(\mathbf{x}_{S};\theta_{G});\phi\bigr)\bigr)\bigr]\\ +&\mathbb{E}_{\mathbf{x}_{T}\sim p_{\text{real}}}\!\bigl[\log D^{\text{PIX}}(\mathbf{x}_{T};\phi)\bigr];\end{align*}$
- the second term is the task-specific loss, which in (Bousmalis et al., 2017) was the image classification cross-entropy loss provided by a classifier T that was also trained as part of the model:
  $\begin{align*}\mathcal{L}_{\text{task}}^{\text{PIX}}&(G^{\text{PIX}},T^{\text{PIX}}) = \\ &\mathbb{E}_{\mathbf{x}_{S},\mathbf{y}_{S}\sim p_{\text{syn}}} \!\bigl[-\mathbf{y}_{S}^{\top}\!\log T^{\text{PIX}}\!\bigl(G^{\text{PIX}}(\mathbf{x}_{S};\theta_{G});\theta_{T}\bigr) -\mathbf{y}_{S}^{\top}\!\log T^{\text{PIX}}(\mathbf{x}_{S};\theta_{T})\bigr],\end{align*}$
- and the third term is the content similarity loss, intended to make the generator preserve the foreground objects (that would later need to be classified) with a mean squared error applied to their masks:
  $\begin{align*}\mathcal{L}_{\text{cont}}^{\text{PIX}}(G^{\text{PIX}}) =\mathbb{E}_{\mathbf{x}_{S}\sim p_{\text{syn}}}\!\Bigl[& \frac{1}{k}\,\lVert(\mathbf{x}_{S}-G^{\text{PIX}}(\mathbf{x}_{S};\theta_{G}))\odot m(\mathbf{x})\rVert_{2}^{2}\\ &-\frac{1}{k^{2}}\bigl((\mathbf{x}_{S}-G^{\text{PIX}}(\mathbf{x}_{S};\theta_{G}))^{\top}m(\mathbf{x})\bigr)^{2} \Bigr],\end{align*}$
  where $\mathbf{m}$ is a segmentation mask for the foreground object extracted from the synthetic data renderer; note that this loss does not “insist” on preserving pixel values in the object but rather encourages the model to change object pixels in a consistent way, preserving their pairwise differences.
Bousmalis et al. applied this GAN to the Synthetic Cropped LineMod dataset, a synthetic version of a small object classification dataset, doing both classification and pose estimation for the objects. The images in this dataset are quite cluttered and complex, but small in terms of pixel size:

The generator accepts as input a synthetic image of a 3D model of a corresponding object in a random pose together with the corresponding depth map and tries to output a realistic image in a cluttered environment while leaving the object itself in place. Note that the segmentation mask for the central object is also given by the synthetic model. The discriminator also looks at the depth map when it distinguishes between reals and fakes:

Here are some sample results of PixelDA for the same object classes as above but in different poses and with different depth maps:

Hardly the images that you would sell to a stock photo site, but that’s not the point. The point is to improve the classification and object pose estimation quality after training on refined synthetic images. And indeed, Bousmalis et al. reported improved results in both metrics compared to both training on purely synthetic data (for many tasks, this version fails entirely) and a number of previous approaches to domain adaptation.

But these are still rather small images. Can we make synthetic-to-real refinement work on a larger scale? Let’s find out.

CycleGAN for Synthetic-to-Real Refinement: GeneSIS-RT

In the previous post, we discussed the general CycleGAN idea and structure: if you want to do something like style transfer, but don’t have a paired dataset where the same content is depicted in two different styles, you can close the loop by training two generators at once. This is a very natural setting for synthetic-to-real domain adaptation, so many modern approaches to synthetic data refinement include the ideas of CycleGAN.

Probably the most direct application is the GeneSIS-RT framework by Stein and Roy (2017) that refines synthetic data directly with the CycleGAN trained on unpaired datasets of synthetic and real images. Their basic pipeline, summarized in the picture below, sums up the straightforward approach to synthetic-to-real refinement perfectly:

Their results are pretty typical for the basic CycleGAN: some of the straight lines can become wiggly, and the textures have artifacts, but generally the images definitely become closer to reality:

But, again, picture quality is not the main point here. Stein and Roy show that a training set produced by image-to-image translation learned by CycleGAN improves the results of training machine learning systems for real-world tasks such as obstacle avoidance and semantic segmentation.

Here are some sample segmentation results that compare a DeepLab-v2 segmentation network trained on synthetic data and on the same synthetic data refined by GeneSIS-RT; the improvement is quite clear:

CycleGAN Evolved: T²Net

As an example of a more involved application, let’s consider T²Net by Zheng et al. (2018) who apply synthetic-to-real refinement to improve depth estimation from a single image. By the way, if you google their paper in 2021, as I just did, do not confuse it with T²-Net by Zhang et al. (2020) (yes, a one symbol difference in both first author and model names!), a completely different deep learning model for turbulence forecasting…

T²Net also uses the general ideas of CycleGAN with a translation network (generator) that makes images more realistic. The new idea here is that T²Net asks the synthetic-to-real generator G not only to translate one specific domain (synthetic data) to another (real data) but also to work across a number of different input domains, making the input image “more realistic” in every case. Here is the general architecture:

In essence, this means that G aims to learn the minimal transformation necessary to make an image realistic. In particular, it should not change real images much. In total, T²Net has the following loss function for the generator (hope you are getting used to these):

$\mathcal{L}^{T2}=\mathcal{L}_{\text{GAN}}^{T2}(G_{S}^{T2},D_{T}^{T2})+\lambda_{1}\mathcal{L}_{\text{GAN}_f}^{T2}(f_{\text{task}}^{T2},D_{f}^{T2})+\lambda_{2}\mathcal{L}_{r}^{T2}(G_{S}^{T2})+\lambda_{3}\mathcal{L}_{t}^{T2}(f_{\text{task}}^{T2})+\lambda_{4}\mathcal{L}_{s}^{T2}(f_{\text{task}}^{T2}),$

where
- the first term is the usual GAN loss for synthetic-to-real transfer with a discriminator D_T:
  $\begin{align*}\mathcal{L}_{\text{GAN}}^{T2}(G_{S}^{T2},D_{T}^{T2}) =&\mathbb{E}_{\mathbf{x}_{S}\sim p_{\text{syn}}} \bigl[\log(1-D_{T}^{T2}(G_{S}^{T2}(\mathbf{x}_{S})))\bigr]\\ +&\mathbb{E}_{\mathbf{x}_{T}\sim p_{\text{real}}} \bigl[\log D_{T}^{T2}(\mathbf{x}_{T})\bigr];\end{align*}$
- the second term is the feature-level GAN loss for features extracted from translated and real images, with a different discriminator D_f:
  $\begin{align*}\mathcal{L}_{\text{GAN}_f}^{T2}(f_{\text{task}}^{T2},D_{f}^{T2}) &=\mathbb{E}_{\mathbf{x}_{S}\sim p_{\text{syn}}} \bigl[\log D_{f}^{T2}(f_{\text{task}}^{T2}(G_{S}^{T2}(\mathbf{x}_{S})))\bigr]\\ &+\mathbb{E}_{\mathbf{x}_{T}\sim p_{\text{real}}} \bigl[\log(1-D_{f}^{T2}(f_{\text{task}}^{T2}(\mathbf{x}_{T})))\bigr]; \end{align*}$
- the third term $L_r$ is the reconstruction loss for real images, simply an $L_1$ norm that says that T²Net is not supposed to change real images at all;
- the fourth term $L_t$ is the task loss for depth estimation on synthetic images, namely the $L_1$ -norm of the difference between the predicted depth map for a translated synthetic image and the original ground truth synthetic depth map; this loss ensures that translation does not change the depth map;
- finally, the fifth term $L_S$ is the task loss for depth estimation on real images:
  $\mathcal{L}_{s}^{T2}(f_{\text{task}}^{T2}) = |\partial_{x} f_{\text{task}}^{T2}(\mathbf{x}_{T})|^{-|\partial_{x}\mathbf{x}_{T}|} + |\partial_{y} f_{\text{task}}^{T2}(\mathbf{x}_{T})|^{-|\partial_{y}\mathbf{x}_{T}|},$
  that is, the sum of image gradients; since ground truth depth maps are not available now, this regularizer is a locally smooth loss intended to optimize object boundaries, a common tool in depth estimation models that we won’t go into too much detail about.
Zheng et al. show that T²Net can produce realistic images from synthetic ones, even for quite varied domains such as house interiors from the SUNCG synthetic dataset:

But again, the most important conclusions deal with the depth estimation task. Zheng et al. conclude that end-to-end training of the translation network and depth estimation network is preferable over separated training. They show that T²Net can achieve good results for depth estimation with no access to real paired data, even outperforming some (but not all) supervised approaches.

Synthetic Data Refinement for Vending Machines

This is already getting to be quite a long read, so let me wrap up with just one more example that will bring us to 2019. Wang et al. (2019) consider synthetic data generation and domain adaptation for object detection in smart vending machines. Actually, we have already discussed their problem setting and synthetic data in a previous post, titled “What’s in the Fridge?“. So please see that post for a refresher on their synthetic data generation pipeline, and today we will concentrate specifically on their domain adaptation approach.

Wang et al. refine rendered images with virtual-to-real style transfer done by a CycleGAN-based architecture. The novelty here is that Wang et al. separate foreground and background losses, arguing that style transfer needed for foreground objects is very different from (much stronger than) the style transfer for backgrounds. So their overall architecture is even more involved than in previous examples; here is what it looks like:

The overall generator loss function is also a bit different:

$\begin{multline*}\mathcal{L}^{OD}=\mathcal{L}_{\text{GAN}}^{OD}(G^{OD},D_{T}^{OD},\mathbf{x}_{S},\mathbf{x}_{T})+\mathcal{L}_{\text{GAN}}^{OD}(F^{OD},D_{S}^{OD},\mathbf{x}_{T},\mathbf{x}_{S})+\\ +\lambda_{1}\mathcal{L}_{\text{cyc}}^{OD}(G^{OD},F^{OD})+\lambda_{2}\mathcal{L}_{\text{bg}}^{OD}+\lambda_{3}\mathcal{L}_{\text{fg}}^{OD},\end{multline*}$

where:
- L_GAN(G, D, X, Y) is the standard adversarial loss for generator G mapping from domain X to domain Y and discriminator D distinguishing real images from fake ones in domain Y;
- L_cyc(G, F) is the cycle consistency loss as used in CycleGAN and as we have already discussed several times;
- L_bg is the background loss, which is the cycle consistency loss computed only for the background part of the images as defined by the mask m_bg:
  $\begin{align*}\mathcal{L}_{\text{bg}}^{OD} =&\mathbb{E}_{\mathbf{x}_{T}\sim p_{\text{real}}} \Bigl\| \bigl(G^{OD}\!\bigl(F^{OD}(\mathbf{x}_{T})\bigr)-\mathbf{x}_{T}\bigr) \odot m_{\text{bg}}(\mathbf{x}_{T}) \Bigr\|_{2}\\ +&\mathbb{E}_{\mathbf{x}_{S}\sim p_{\text{syn}}} \Bigl\| \bigl(F^{OD}\!\bigl(G^{OD}(\mathbf{x}_{S})\bigr)-\mathbf{x}_{S}\bigr) \odot m_{\text{bg}}(\mathbf{x}_{S}) \Bigr\|_{2};\end{align*}$
- L_fg is the foreground loss, similar to L_bg but computed only for the hue channel in the HSV color space (the authors argue that color and profile are the most critical for recognition and thus need to be preserved the most):
  $\begin{align*}\mathcal{L}_{\text{fg}}^{OD} =&\mathbb{E}_{\mathbf{x}_{T}\sim p_{\text{real}}} \Bigl\| \bigl(G^{OD}\!\bigl(F^{OD}(\mathbf{x}_{T})\bigr)^{H}-\mathbf{x}_{T}^{H}\bigr) \odot m_{\text{fg}}(\mathbf{x}_{T}) \Bigr\|_{2}\\ +&\mathbb{E}_{\mathbf{x}_{S}\sim p_{\text{syn}}} \Bigl\| \bigl(F^{OD}\!\bigl(G^{OD}(\mathbf{x}_{S})\bigr)^{H}-\mathbf{x}_{S}^{H}\bigr) \odot m_{\text{fg}}(\mathbf{x}_{S}) \Bigr\|_{2}. \end{align*}$
Segmentation into foreground and background is done automatically in synthetic data and is made easy in this case for real data since the camera position is fixed, and the authors can collect a dataset of real background templates from the vending machines they used in the experiments and then simply subtract the backgrounds to get the foreground part.

Here are some sample results of their domain adaptation architecture, with original synthetic images on the left and refined results on the right:

As a result, Wang et al. report significantly improved results when using hybrid datasets of real and synthetic data for all three tested object detection architectures: PVANET, SSD, and YOLOv3. Even more importantly, they report a comparison between basic and refined synthetic data with clear gains achieved by refinement across all architectures.

Conclusion

By this point, you are probably already pretty tired of CycleGAN variations. Naturally, there are plenty more examples of this kind of synthetic-to-real style transfer in literature, I just picked a few to illustrate the general ideas and show how they can be applied to specific use cases.

I hope the last couple of posts have managed to convince you that synthetic-to-real refinement is a valid approach that can improve the performance at the end task even if the actual refined images do not look all that realistic to humans: some of the examples above look pretty bad, but training on them still improves performance for downstream tasks.

Next time, we will discuss an interesting variation of this idea: what if we reverse the process and try to translate real data into synthetic? And why would anyone want to do such a thing if we are always interested in solving downstream tasks on real data rather than synthetic?.. The answers to these questions will have to wait until the next post. See you!

Sergey Nikolenko
Head of AI, Synthesis AI
January 28, 2021
Gaze Estimation and GANs: Driving Model Performance with Synthetic Data IV
With the Christmas and New Year holidays behind us, let’s continue our series on how to improve the performance of machine learning models with synthetic data. Last time, I gave a brief introduction into domain adaptation, distinguishing between its two main variations: refinement, where synthetic images are themselves changed before they are fed into model training, and model-based domain adaptation, where the training process changes to adapt to training on different domains. Today, we begin with refinement for the same special case of eye gaze estimation that kickstarted synthetic data refinement a few years ago and still remains an important success story for this approach, but then continue and extend the story of refinement to other computer vision problems. Today’s post will be more in-depth than before, so buckle up and get ready for some GANs!

SimGAN in Detail

Let us begin with a quick reminder. As we discussed last time, Shrivastava et al. (2017) was one of the first approaches that successfully improved a real life model by refining synthetic images and feeding them to a relatively straightforward deep learning model. They took a large-scale synthetic dataset of human eyes created by Wood et al. (2016) and learned a transformation from synthetic images (“Source domain” in the illustration below) to real data (“Target domain” in the illustration below). To do that, they utilized a relatively straightforward SimGAN architecture; we saw it last time in a high-level description, but today, let’s dive a little deeper into the details.

Let me first draw a slightly more detailed picture for you:

In the figure above, black arrows denote the data flow and green arrows show the gradient flow between SimGAN’s components. SimGAN consists of a generator (refiner) that translates source domain data into “fake” target domain data and a discriminator that tries to distinguish between “fake” and real target domain images.

The figure also introduces some notation: it shows that the overall loss function for the generator (refiner) in SimGAN consists of two components:
- the realism loss (green block on the right), usually called the adversarial loss in GANs, makes the generator fool the discriminator into thinking that the refined image is realistic; in a basic GAN such as SimGAN, this was simply the exact opposite of the binary cross-entropy loss for classification that the discriminator trains on;
- the regularization loss (green block on the bottom) makes the refinement process care for the actual target variable; it is supposed to make refinement preserve the eye gaze by preserving the result of some function ψ.
I will indulge myself with a little bit of formulas to make the above discussion more specific (trust me, you are very lucky that I can’t install a LaTeX plugin on this blog and have to insert formulas as pictures — otherwise this blog would be teeming with them). Here is the resulting loss function for SimGAN’s generator:

$\begin{align*}\mathcal{L}^{\text{REF}}_{G}(\theta) &= \mathbb{E}_{S}\Bigl[ \mathcal{L}^{\text{REF}}_{\text{real}}\!\bigl(\theta;\mathbf{x}_{S}\bigr) + \lambda\, \mathcal{L}^{\text{REF}}_{\text{reg}}\!\bigl(\theta;\mathbf{x}_{S}\bigr) \Bigr], \quad\text{where} \\[6pt]\mathcal{L}^{\text{REF}}_{\text{real}}\!\bigl(\theta;\mathbf{x}_{S}\bigr) &= -\,\log\!\Bigl(1 - D^{\text{REF}}_{\phi}\bigl(G^{\text{REF}}_{\theta}(\mathbf{x}_{S})\bigr)\Bigr), \\[6pt]\mathcal{L}^{\text{REF}}_{\text{reg}}\!\bigl(\theta;\mathbf{x}_{S}\bigr) &= \bigl\lVert \psi\!\bigl(G^{\text{REF}}_{\theta}(\mathbf{x}_{S})\bigr) - \psi(\mathbf{x}_{S}) \bigr\rVert_{1}\;.\end{align*}$

where $\psi$ is a mapping into some kind of feature space. The feature space can contain the image itself, image derivatives, statistics of color channels, or features produced by a fixed extractor such as a pretrained CNN. But in case of SimGAN it was… in most cases, just an identity map. That is, the regularization loss simply told the generator to change as little as possible while still making the image realistic.

SimGAN significantly improved gaze estimation over state of the art. But it was a rather straightforward and simplistic GAN even for 2017. Since their inception, generative adversarial networks have evolved quite a bit, with several interesting ideas defining modern adversarial architectures. Fortunately, not only have they been applied to synthetic-to-real refinement, but we don’t even have to deviate from the gaze estimation example to see quite a few of them!

GazeGAN, Part I: Synthetic-to-Real Transfer with CycleGAN

Meet GazeGAN, an architecture also developed in 2017, a few months later, by Sela et al. (2017). It also does synthetic-to-real refinement for gaze estimation, just like SimGAN. But the architecture, once you lay it out in a figure, looks much more daunting:

Let’s take it one step at a time.

First of all, the overall structure. As you can see, GazeGAN has two different refiners, $F$ and $G$ , and two discriminators, one for the source domain and one for the target domain. What’s going on here?

In this structure, GazeGAN implements the idea of CycleGAN introduced by Zhu et al. (2017). The problem CycleGAN was solving was unpaired style transfer. In general, synthetic-to-real refinement is a special case of style transfer: we need to translate images from one domain to another. In a more general context, similar problems could include artistic style transfer (draw a Monet landscape from a photo), drawing maps from satellite images, coloring an old photo, and many more.

In GAN-based style transfer first introduced by the pix2pix model (Isola et al., 2016), you can have a very straightforward architecture where the generator does the transfer and the discriminator tries to tell apart fake pictures from real pictures. The main problem is how to capture the fact that the translated picture should be similar to the one generator received as input. Formally speaking, it is perfectly legal for the generator to just memorize a few Monet paintings and output them for every input unless we do something about it. SimGAN fixed this via a regularization loss that simply told the generator to “change as little as possible”, but this is not quite what’s needed and doesn’t usually work.

In the pix2pix model, style transfer is done with a conditional GAN. This means that both the generator and discriminator see the input picture from the source domain, and the discriminator checks both realism and the fact that the target domain image matches the source domain one. Here is an illustration (Isola et al., 2016):

This approach actually can be made to work very well; here is a sample result from a later version of this model called pix2pixhd (Wang et al., 2018), where the model is synthesizing a realistic photo from a segmentation map (not the other way around!):

But the pix2pix approach does not always apply. For training, this approach requires a paired dataset for style transfer, where images from the source and target domain match each other. It’s not a problem for segmentation maps, but, e.g., for artistic style transfer it would be impossible: Monet only painted some specific landscapes, and we can’t make a perfectly matching photo today.

Enter CycleGAN, a model that solves this problem with a very interesting idea. We need a paired dataset because we don’t know how to capture the idea that the translated image should inherit content from the original. What should be the loss function that says that this image shows the same landscape as a given Monet painting but in photographic form?..

But imagine that we also have an inverse transformation. Then we would be able to make a photo out of a Monet painting, and then make it back into a Monet — which means that now it has to match the original exactly, and we can use some kind of simple pixel-wise loss to make them match! This is precisely the cycle that CycleGAN refers to. Here is an illustration from (Zhu et al., 2017):

Now the cycle consistency loss that ensures that $G(F(\mathbf{x}))=\mathbf{x}$ can be a simple $L_2$ or $L_1$ pixel-wise loss.

If you have a paired dataset, it would virtually always be better to use an architecture such as pix2pix that makes use of this data, but CycleGAN works quite well for unpaired cases. Here are some examples for the Monet-to-photo direction from the original CycleGAN:

In GazeGAN, the cycle is, as usual, implemented by two generators and two discriminators, one for the source domain and one for the target domain. But the cycle consistency loss consists of two parts:
- first, a simple pixel-wise L₁ loss, just like we discussed above:
  $\begin{align*}\mathcal{L}_{\text{Cyc}}^{Gz}\!\bigl(G^{Gz},F^{Gz}\bigr) =& \mathbb{E}_{\mathbf{x}_{S}\sim p_{\text{syn}}} \!\Bigl[\bigl\|F^{Gz}\!\bigl(G^{Gz}(\mathbf{x}_{S})\bigr)-\mathbf{x}_{S}\bigr\|_{1}\Bigr]\\ +&\mathbb{E}_{\mathbf{x}_{T}\sim p_{\text{real}}} \!\Bigl[\bigl\|G^{Gz}\!\bigl(F^{Gz}(\mathbf{x}_{T})\bigr)-\mathbf{x}_{T}\bigr\|_{1}\Bigr];\end{align*}$
- but second, we also need a special gaze cycle consistency loss to preserve the gaze direction (so that the target variable can be transferred with no change); for this, Sela et al. train a separate gaze estimation network E designed to overfit to the source domain as much as possible and predict the gaze very accurately on synthetic data; the loss makes sure E still works after going through a cycle:
$\mathcal{L}_{\text{GazeCyc}}^{Gz}\bigl(G^{Gz},F^{Gz}\bigr) = \mathbb{E}_{\mathbf{x}_S\sim p_{\text{syn}}}\left[\left\| E^{Gz}\left(F^{Gz}(G^{Gz}(\mathbf{x}_S))\right) - E^{Gz}(\mathbf{x}_S)\right\|_2^2\right].$

GazeGAN, Part II: LSGAN, Label Smoothing, and More

Apart from the overall CycleGAN structure, the GazeGAN model also used quite a few novelties that had been absent in SimGAN but had already become instrumental in GAN-based architectures by the end of 2017. Let’s discuss at least a couple of those.

First, the adversarial loss function. As I mentioned above, SimGAN used the most basic adversarial loss: binary cross-entropy which is the natural loss function for classification. However, this loss function has quite a few undesirable properties that make it hard to train GANs with. Since the original GANs were introduced in 2014, a lot of different adversarial losses have been developed, but the two probably most prominent and most often used are Wasserstein GANs (Arjovsky et al., 2017) and LSGAN (Least Squares GAN; Mao et al., 2016).

I hope I will have a reason to discuss Wasserstein GANs in the future — it’s a very interesting idea that sheds a lot of light on the training of GANs and machine learning in general. But GazeGAN used the LSGAN loss function. As the title suggests, LSGAN uses the least squares loss function instead of binary cross-entropy. In the general case, it looks like

$\begin{align*}\min_{D}\; V_{\text{LSGAN}}(D)&= \tfrac12\,\mathbb{E}_{\mathbf{x}\sim p{\text{data}}}\bigl[(D(\mathbf{x}) - b)^{2}\bigr]+ \tfrac12\,\mathbb{E}_{\mathbf{z}\sim p{z}}\bigl[(D(G(\mathbf{z})) - a)^{2}\bigr], \\\min_{G}\; V_{\text{LSGAN}}(G)&= \tfrac12\,\mathbb{E}_{\mathbf{z}\sim p{z}}\bigl[(D(G(\mathbf{z})) - c)^{2}\bigr],\end{align*}$

which means that the discriminator is trying to output some constant $b$ on real images and some other constant $a$ on fake images, while the generator is trying to convince the discriminator to output c on fake images (it has no control over the real ones). Naturally, usually one takes $a=0$ and $b=c=1$ , although there is an interesting theoretical result about the case when $b-c=1$ and $b-a=2$ , that is, when the generator is trying to make the discriminator maximally unsure about the fake images.

In general, trying to learn a classifier with the least squares loss is about as wrong as you can be in machine learning: this loss function becomes larger as the classifier becomes more sure in the correct answer! But for GANs, the saturation of the logistic sigmoid in binary cross-entropy is a much more serious problem. Even more than that, GazeGAN uses label smoothing on top of the LSGAN loss: while the discriminator aims to output 1 on real examples and 0 on refined synthetic images, the generator smoothes its target to 0.9, getting the loss function

$\mathcal{L}_{\text{LSGAN}}^{Gz}(G, D, S, R) = \mathbb{E}_{\mathbf{x}_S \sim p{_\text{syn}}}\Bigl[\,\bigl(D\bigl(G(\mathbf{x}_S)\bigr) - 0.9\bigr)^{2}\Bigr] + \mathbb{E}_{\mathbf{x}_T \sim p_{\text{real}}}\Bigl[\,D(\mathbf{x}_T)^{2}\Bigr];$

this loss is applied in both CycleGAN directions, synthetic-to-real and real-to-synthetic. Label smoothing helps the generator to avoid overfitting to intermediate versions of the discriminator (recall that GANs are trained by alternating between training the generator and the discriminator, and there is no way to train them separately because they need each other for training data).

And that’s it! With these loss functions, GazeGAN is able to create highly realistic images of eyes from synthetic eyes rendered by a Unity-based 3D modeling engine. Here are some samples by Sela et al. (2017):

Note how this model works not only for the eye itself but also for the surrounding facial features, “filling in” even those parts of the synthetic image that were not there.

Conclusion

Today, we have discussed the gaze estimation problem in detail, taking this opportunity to talk about several important ideas in generative adversarial networks. Synthetic-to-real refinement has proven its worth with this example. But, as I already mentioned in the previous post, gaze estimation is also a relatively easy example: synthetic images of eyes that needed refining for Srivastava et al. were only 30×60 pixels in size!

GazeGAN takes the next step: it operates not on the 30×60 grayscale images but on 128×128 color images, and GazeGAN actually refines not only the eye itself but parts of the image (e.g., nose and hair) that were not part of the 3D model of the eye.

But these are still relatively small images and a relatively simple task, at least one with low variability in the data. Next time, we will see how well synthetic-to-real refinement works for other applications.

Sergey Nikolenko
Head of AI, Synthesis AI
January 13, 2021
Domain Adaptation Overview: Driving Model Performance with Synthetic Data III
Today, I continue the series about different ways of improving model performance with synthetic data. We have already discussed simple augmentations in the first post and “smart” augmentations that make more complex transformations of the input in the second. Today we go on to the next sub-topic: domain adaptation. We will stay with domain adaptation for a while, and in the first post on this topic I would like to present a general overview of the field and introduce the most basic approaches to domain adaptation.

Refinement and Model-Based Domain Adaptation

In previous posts, we have discussed augmentations, transformations that can be used to extend the training set. In the context of synthetic data (we are in the Synthesis AI blog, after all), this means that synthetic data can be used to augment real datasets of insufficient size, and an important part of using synthetic data would be to augment the heck out of it so that the model would generalize as well as possible. This is the idea of domain randomization, a very important part of using synthetic data for machine learning and a part that we will definitely return to in future posts.

But the use of synthetic data can be made much more efficient than just training on it. Domain adaptation is a set of techniques designed to make a model trained on one domain of data, the source domain, work well on a different, target domain. The problem we are trying to solve is called transfer learning, i.e., transferring the knowledge learned on source tasks into an improvement in performance on a different target task, as shown in this illustration from (Pan, Yang, 2009):

This is a natural fit for synthetic data: in almost all applications, we would like to train the model in the source domain of synthetic data but then apply the results in the target domain of real data. Here is an illustration of the three approaches to this kind of transfer in the context of robotics (source):

Here we see the difference:
- system identification simply hopes that simulation is well-calibrated and matches reality sufficiently well,
- domain randomization tries to make the synthetic distribution so wide that a trained model will be robust enough to generalize to real data, and
- domain adaptation makes changes, either in the model or in the datasets.
In this series of posts, we will give a survey of domain adaptation approaches that have been used for such synthetic-to-real adaptation. We broadly divide the methods outlined in this chapter into two groups. Approaches from the first group operate on the data level, which makes it possible to extract synthetic data “refined” in order to work better on real data, while approaches from the second group operate directly on the model, its feature space or training procedure, leaving the data itself unchanged.

Let us now discuss these two options in more detail.

Synthetic Data Refinement

The first group of approaches for synthetic-to-real domain adaptation work with the data itself. In this approach, we try to develop models that can take a synthetic image as input and “refine” it, making it better for subsequent model training. Note that while in most works we discuss here the objective is basically to make synthetic data more realistic (for example, in GAN-based models it is the direct objective: the discriminators should not be able to distinguish refined synthetic data from real samples), this does not necessarily have to be the case. Some early works on synthetic data even concluded that synthetic images may work better if they are less realistic, resulting in better generalization of the models. But generally speaking, realism is the goal.

Today I will make an example of the very first work that kickstarted synthetic-to-real refinement back in 2016, and we will discuss what’s happened since then later, in a separate post. This example, one of the first successful models with straightforward synthetic-to-real refinement, was given by Apple researchers Shrivastava et al. (2017).

The underlying problem here is gaze estimation: recognizing the direction where a human eye is looking. Gaze estimation methods are usually divided into model-based, which model the geometric structure of the eye and adjacent regions, and appearance-based, which use the eye image directly as input; naturally, synthetic data is made and refined for appearance-based gaze estimation.

Before Shrivastava et al., this problem had already been tackled with synthetic data. In particular, Wood et al. (2016) presented a large dataset of realistic renderings of human eyes and showed improvements on real test sets over previous work done with the MPIIgaze dataset of real labeled images. Here is what their synthetic dataset looks like:

The usual increase in scale (synthetic images are almost free after the initial investment is made) is manifested here as an increase in variability: MPIIgaze contains about 214K images, and the synthetic training set was only about 1M images, but all images in MPIIgaze come from the same 15 participants of the experiment, while the UnityEyes system developed by Wood et al. can render every image in a different randomized environment, which makes the model much more robust.

The refinement here is to make these synthetic images even more realistic. Shrivastava et al. present a GAN-based system trained to improve synthesized images of the eyes. They call this idea Simulated+Unsupervised learning:

They learn a transformation implemented with a Refiner network with the SimGAN adversarial architecture:

SimGAN is a relatively straightforward GAN. It consists of a generator (refiner) and a discriminator, as shown above. The discriminator learns to distinguish between real and refined images with a standard binary classification loss function. The generator, in turn, is trained with a combination of the adversarial loss that makes it learn to fool the discriminator and regularization loss that captures the similarity between the refined image and the original one in order to preserve the target variable (gaze direction).

As a result, Shrivastava et al. were able to significantly improve upon previous results. But the gaze estimation problem is in many ways a natural and simple candidate for such an approach. It is especially telling that the images they generate and refine are merely 30×60 pixels in size: even the GANs that existed in 2017 were able to work quite well on this kind of output dimension. In a later post, we will see how image-based refinement works today and in other applications.

Model-Based Domain Adaptation

We have seen models that perform domain adaptation at the data level, i.e., one can extract a part of the model that takes as input a data point from the source domain (a synthetic image) and map it to the target domain (a real image).

However, it is hard to find applications where this is actually necessary. The final goal of AI model development usually is not to get a realistic synthetic image of a human eye; this is just a stepping stone to producing models that work better with the actual task, e.g., gaze estimation in this case.

Therefore, to make better use of synthetic data it makes sense to also consider feature-level or model-level domain adaptation, that is, methods that work in the space of latent features or model weights and never go back to change the actual data.

The simplest approach to such model-based domain adaptation would be to share the weights between networks operating on different domains or learn an explicit mapping between them. Here is an early paper by Glorot, Bordes, and Bengio (2011) that does exactly that, this relatively simple approach remains relevant to this day, and we will probably see more examples of it in later installments.

But for this introductory post I would like to show probably the most important work in this direction, the one that started model-based domain adaptation in earnest. I’m talking, of course, about “Unsupervised Domain Adaptation by Backpropagation” by two Russian researchers, Yaroslav Ganin and Victor Lempitsky. Here is the main illustration from their paper:

This is basically a generic framework for unsupervised domain adaptation that consists of:
- a feature extractor,
- a label predictor that performs the necessary task (e.g., classification) on extracted features, and
- a domain classifier that takes the same features and attempts to classify which domain the original input belonged to.
The idea is to train the label predictor to perform as well as possible and at the same time train the domain classifier to perform as badly as possible. This is achieved with the gradient reversal layer: the gradients are multiplied by a negative constant as they pass from the domain classifier to the feature extractor.

This is basically the idea of GANs but without the tediousness of iterative separate training of a generator and discriminator and with much fewer convergence problems than early GANs suffered from. The original work by Ganin and Lempitsky applied this idea to examples that would be considered toy datasets by today’s standards, but since 2015 this field has also had a lot of interesting discoveries that we will definitely discuss later.

Conclusion

In this post, we have started to discuss domain adaptation, probably the most important topic in machine learning research dealing with synthetic data. Generally speaking, all we do is domain adaptation: we need to use synthetic data for training, but the final goal is always to transfer to the domain of the real world.
Next time, we will discuss GAN-based refinement in more detail. Stay tuned, there will be GANs aplenty, including some very interesting models! Until next time!

Sergey Nikolenko
Head of AI, Synthesis AI
December 15, 2020
Smart Augmentations: Driving Model Performance with Synthetic Data II
Last time, I started a new series of posts, devoted to different ways of improving model performance with synthetic data. In the first post of the series, we discussed probably the simplest and most widely used way to generate synthetic data: geometric and color data augmentation applied to real training data. Today, we take the idea of data augmentation much further. We will discuss several different ways to construct “smart augmentations” that make much more involved transformations of the input but still change the labeling only in predictable ways.

Automating Augmentations: Finding the Best Strategy

Last time, we discussed the various ways in which modern data augmentation libraries such as Albumentations can transform an unsuspecting input image. Let me remind one example from last time:

Here, the resulting image and segmentation mask are the result of the following chain of transformations:
- take a random crop from a predefined range of sizes;
- shift, scale, and rotate the crop to match the original image dimension;
- apply a (randomized) color shift;
- add blur;
- add Gaussian noise;
- add a randomized elastic transformation for the image;
- perform mask dropout, removing a part of the segmentation masks and replacing them with black cutouts on the image.
That’s quite a few operations! But how do we know that this is the best way to approach data augmentation for this particular problem? Can we find the best way to augment, maybe via some automated meta-strategy that would take into account the specific problem setting?

As far as I know, this natural idea first bore fruit in the paper titled “Learning to Compose Domain-Specific Transformations for Data Augmentation” by Stanford researchers Ratner et al. (2017). They viewed the problem as a sequence generation task, training a recurrent generator to produce sequences of transformation functions:

The next step was taken in the work called “AutoAugment: Learning Augmentation Strategies from Data” by Cubuk et al. (2019). This is a work from Google Brain, from the group led by Quoc V. Le that has been working wonders with neural architecture search (NAS), a technique that automatically searches for the best architectures in a class that can be represented by computational graphs. With NAS, this group has already improved over the state of the art in basic convolutional architectures with NASNet and the EfficientNet family, in object detection architectures with the EfficientDet family, and even in such a basic field as activation functions for individual neural units: the Swish activation functions were found with NAS.

So what did they do with augmentation techniques? As usual, they frame this problem as a reinforcement learning task where the agent (controller) has to find a good augmentation strategy based on the rewards obtained by training a child network with this strategy:

The controller is trained by proximal policy optimization, a rather involved reinforcement learning algorithm that I’d rather not get into (Schulman et al., 2017). The point is, they successfully learned augmentation strategies that significantly outperform other, “naive” strategies. They were even able to achieve improvements over state of the art in classical problems on classical datasets:

Here is a sample augmentation policy for ImageNet found by Cubuk et al.:

The natural question is, of course: why do we care? How can it help us when we are not Google Brain and cannot run this pipeline (it does take a lot of computation)? Cubuk et al. note that the resulting augmentation strategies can indeed transfer across a wide variety of datasets and network architectures; on the other hand, this transferability is far from perfect, so I have not seen the results of AutoAugment pop up in other works as often as the authors would probably like.

Still, these works prove the basic point: augmentations can be composed together for better effect. The natural next step would be to have some even “smarter” functions. And that is exactly what we will see next.

Smart Augmentation: Blending Input Images

In the previous section, we discussed how to chain together standard transformations of input images in the best possible ways. But what if we take one step further and allow augmentations to produce more complex combinations of input data points?

In 2017, this idea was put forward in the work titled “Smart Augmentation: Learning an Optimal Data Augmentation Strategy” by Irish researchers Lemley et al. Their basic idea is to have two networks, “Network A” that implements an augmentation strategy and “Network B” that actually trains on the resulting augmented data and solves the end task:

The difference here is that Network A does not simply choose from a predefined set of strategies but operates as a generative network that can, for instance, blend two different training set examples into one in a smart way:

In particular, Lemley et al. tested their approach on datasets of human faces (a topic quite close to our hearts here at Synthesis AI, but I guess I will talk about it in more details later). So their Network A was able to, e.g., compose two different images of the same person into a blended combination (on the left):

Note that this is not simply a blend of two images but a more involved combination that makes good use of facial features. Here is an even better example:

This kind of “smart augmentation” borders on synthetic data generation: the resulting images are nothing like the originals. But before we turn to actual synthetic data (in subsequent posts), there are other interesting ideas one could apply even at the level of augmentation.

Mixup: Blending the Labels

In smart augmentations, the input data is produced as a combination of several images with the same label: two different images of the same person can be “interpolated” in a way that respects the facial features and expands the data distribution.

Mixup, a technique introduced by MIT and FAIR researchers Zhang et al. (2018), looks at the problem from the opposite side: what if we mix the labels together with the training samples? This is implemented in a very straightforward way: for two labeled input data points, Zhang et al. construct a convex combination of both the inputs and the labels:

$\begin{align*}\tilde{x} &= \lambda x_i + (1 - \lambda) x_j,&\text{where } x_i, x_j \text{ are raw input vectors,} \\[4pt]\tilde{y} &= \lambda y_i + (1 - \lambda) y_j,&\text{where } y_i, y_j \text{ are one-hot label encodings.}\end{align*}$

The blended label does not change either the network architecture or the training process: binary cross-entropy trivially generalizes to target discrete distributions instead of target one-hot vectors. To borrow an illustration from Ferenc Huszar’s blog post, here is what mixup does to a single data point, constructing convex combinations with other points in the dataset:

And here is what happens when we label a lot of points uniformly in the data:

As you can see, the resulting labeled data covers a much more robust and continuous distribution, and this helps the generalization power. Zhang et al. report especially significant improvements in training GANs:

By now, the idea of mixup has become an important part of the deep learning toolbox: you can often see it as an augmentation strategy, especially in the training of modern GAN architectures.

Self-Adversarial Training

To get to the last idea of today’s post, I will use YOLOv4, a recently presented object detection architecture (Bochkovskiy et al., 2020). YOLOv4 is a direct successor to the famous YOLO family of object detectors, improving significantly over the previous YOLOv3. We are, by the way, witnessing an interesting controversy in object detection because YOLOv5 followed less than two months later, from a completely different group of researchers, and without a paper to explain the new ideas (but with the code so it is not a question of reproducing the results)…

Very interesting stuff, but discussing it would take us very far from the topic at hand, so let’s get back to YOLOv4. It boasts impressive performance, with the same detection quality as the above-mentioned EfficientDet at half the cost:

In the YOLO family, new releases usually obtain a much better object detection quality by combining a lot of small improvements, bringing together everything that researchers in the field have found to work well since the previous YOLO version. YOLOv4 is no exception, and it outlines several different ways to add new tricks to the pipeline.

What we are interested in now is their “Bag of Freebies”, the set of tricks that do not change the performance of the object detection framework during inference, adding complexity only at the training stage. It is very characteristic that most items in this bag turn out to be various kinds of data augmentation. In particular, Bochkovskiy et al. introduce a new “mosaic” geometric augmentation that works well for object detection:

But the most interesting part comes next. YOLOv4 is trained with self-adversarial training (SAT), an augmentation technique that actually incorporates adversarial examples into the training process. Remember this famous picture?

It turns out that for most existing artificial neural architectures, one can modify input images with small amounts of noise in such a way that the result looks to us humans completely indistinguishable from the originals but the network is very confident that it is something completely different; see, e.g., this OpenAI blog post for more information.

In the simplest case, such adversarial examples are produced by the following procedure:
- you have a network and an input x that you want to make adversarial; suppose you want to turn a panda into a gibbon;
- formally, it means that you want to increase the “gibbon” component of the network’s output vector (at the expense of the “panda” component);
- so you fix the weights of the network and start regular gradient ascent, but with respect to x rather than the weights! This is the key idea for finding adversarial examples; it does not explain why they exist (it’s not an easy question) but if they do, it’s really not so hard to find them.
So how do you turn this idea into an augmentation technique? Given an input instance, you make it into an adversarial example by following this procedure for the current network that you are training. Then you train the network on this example. This may make the network more resistant to adversarial examples, but the important outcome is that it generally makes the network more stable and robust: now we are explicitly asking the network to work robustly in a small neighborhood of every input image. Note that the basic idea can again be described as “make the input data distribution cover more ground”, but by now we have come quite a long way since horizontal reflections and random crops…
Note that unlike basic geometric augmentations, this may turn out to be a quite costly procedure. But the cost is entirely borne during training: yes, you might have to train the final model for two weeks instead of one, but the resulting model will, of course, work with exactly the same performance: the model architecture does not change, only the training process does.

Bochkovskiy et al. report this technique as one of the main new ideas and main sources of improvement in YOLOv4. They also use the other augmentation ideas that we discussed, of course:

YOLOv4 is an important example for us: it represents a significant improvement in the state of the art in object detection in 2020… and much of the improvement comes directly from better and more complex augmentation techniques! This makes us even more optimistic about taking data augmentation further, to the realm of synthetic data.

Conclusion

In the second post in this new series, we have seen how more involved augmentations take the basic idea of covering a wider variety of input data much further than simple geometric or color transformations ever could. With these new techniques, data augmentation almost blends together with synthetic data as we usually understand it (see, e.g., my previous posts on this blog: one, two, three, four, five).

Smart augmentations such as presented in Lemley et al. border on straight up automated synthetic data generation. It is already hard to draw a clear separating line between them and, say, synthetic data generation with GANs as presented by Shrivastava et al. (2017) for gaze estimation. The latter, however, is a classical example of domain adaptation by synthetic data refinement. Next time, we will begin to speak about this model and similar techniques for domain adaptation intended to make synthetic data work even better. Until then!

Sergey Nikolenko
Head of AI, Synthesis AI
December 2, 2020