Blog

  • Generative AI Models in Image Generation: Overview

    Generative AI Models in Image Generation: Overview

    Some of the most widely publicized results in machine learning in recent years have been related to image generation. You’ve heard of DALL-E a year ago, and now you’ve heard of DALL-E 2, Midjourney, and Stable Diffusion, right? With this post, I’m starting a new series where I will explain the inner workings of these models, what their differences are and how they fit into the general scheme of deep generative models. Today, we begin with a general overview.

    Taxonomy and a Brief History of Image Generation

    Generative AI models are a staple of machine learning. One of the first functional machine learning models, the naive Bayes classifier developed in the 1960s, was an early form of generative AI: you can write new text with a trained naive Bayes model, it just won’t make any sense (since naive Bayes makes the bag-of-words assumption, the text will be just random words sampled in a way consistent with the desired topic).

    Generating images, however, is more difficult than generating text. Just like text, an image is a high-dimensional object: a 1Mpix color photo is defined by about 3 million numbers! Unlike text, however, making extremely strong assumptions such as the bag-of-words model doesn’t make a lot of sense. In the world of images, “words” are pixels, and while naive Bayes is a pretty good text classifier, individual pixels are too simple to be useful even for classification, let alone generation.

    The first generative models that worked for images were autoregressive: you generate the next pixel conditioned on the already generated previous pixels. PixelCNN and PixelRNN were state-of-the-art models for their time (2016), and it might be that with modern architectures, such models could produce state-of-the-art results even today. The problem, however, is that you would have to run the model a million times to get an image with a million pixels, and there is no way to parallelize this process because you need to know the value of pixel number k-1 before you can generate pixel number k. This would be way too slow for high-definition images, so we will not return to purely autoregressive models in this survey.

    Next, we need to distinguish between pure image generation and conditional generation: is it enough to just get a “person who does not exist” or do you want to control the scene with some kind of a description? Significant progress in the former problem was done in 2017-2018 by NVIDIA researchers who specialized in generative adversarial networks (GANs); their ProGAN (progressively growing GAN) model was the first to do high-definition generation (drawing human faces with up to 1024×1024 pixels) with few artifacts that had previously plagued generative models. Later, the same team switched to conditional generation and started working on the StyleGAN family of models, where you can mix and match different levels of features from different images, e.g., take coarse features such as the shape of a face from one person and fine features such as skin texture from another.

    However, it would be even more interesting—and more difficult—if you could just write a prompt for the model and immediately get a picture of the result. This requires multimodal modeling: you have to somehow transform both text and images into the same space, or at least learn how to translate them into one another.

    The first model to claim it has achieved this holy grail with sufficiently good quality was DALL-E from OpenAI. It featured a variational autoencoder with a discrete latent space (like a “language” with discrete “words” that the decoder can turn into images) and a Transformer that made it possible to encode text prompts into this latent space. Later, however, new models have been developed that surpassed DALL-E, including DALL-E 2, Midjourney, and Stable Diffusion. In the next sections, we will discuss these ideas in more detail, although I will reserve the technical discussions for later posts.

    Variational Autoencoder + Transformer = DALL-E

    One of the most important ideas in deep learning is the autoencoder, an encoder-decoder architecture that is tasked to reconstruct the original image:

    The idea here is that the latent code is usually much smaller than the input and output: the task is to compress millions of pixels down to several hundred or a couple of thousand numbers in such a way that decompression is possible.

    It is very tempting to transform an autoencoder into a generative model: it looks like we can sample the latent codes and get new “reconstructions” that could look like new images. Unfortunately, that’s not quite as easy as it seems: even in the lower-dimensional space, the latent codes of “real images” still occupy a rather complicated subset (submanifold), and it would be very difficult to sample from it directly.

    There are several different ways to go about this problem and turn an autoencoder into a proper generative model. One approach is the adversarial autoencoder: let’s turn this into a GAN by adding a discriminator that distinguishes between “real” latent codes sampled from some standard distribution and “fake” latent codes generated by the encoder from actual images:

    Another approach is taken by variational autoencoders (VAE): let’s make the encoder generate not a single latent code but a whole distribution of latent codes. That is, the encoder produces parameters of this distribution, then a latent code is sampled from it, and then the decoder has to reconstruct the original from any sampled latent code, not only from the exact point the encoder has produced:

    This is just a basic idea, it needs a lot of mathematical machinery to actually work, and I hope to explain this machinery in one of the upcoming posts. But if we do make it work, it helps create a nice generative model without the hassle of adversarial training. Variational autoencoders are an important class of generative models, and DALL-E uses one of them to generate images. To be more precise, it uses a variation of VAE that has discrete latent codes, but this explanation definitely can wait until next time.

    The next step in getting a text-to-image model is to add text into the mix. This is exactly what Transformers are great at, and what we need is to train one to generate these discrete latent codes. So the original DALL-E worked as a (discrete) variational autoencoder with a Transformer generating codes for it:

    After training, you can use the Transformer to generate new latent codes and get new pictures via the decoder:

    There are a lot more tricks the authors of DALL-E had to invent to train a huge model able to produce 512×512 images from detailed text prompts but this is the basic idea.

    Diffusion-based models: inverted degradation

    Another important idea that modern generative models have learned to use very well comes from diffusion-based models. Diffusion is the process of adding noise to something, for instance to an image. If you start with a crisp image and keep adding simple noise, say Gaussian, after a while you will have nothing like the original, and if you continue the process long enough you will get something that’s basically indistinguishable from random noise:

    The idea of diffusion-based models is to try and invert this process. Adding noise is very easy, and the conditional distributions on every step are simple. Inverting it, i.e., gradual denoising of the image, is a much more difficult task, but it turns out that we can approximate the inverse, learning a conditional denoising distribution that is close to the true one:

    Then we can string together this chain of approximations and, hopefully, get a model that is able to regenerate crisp images from random noise:

    Again, a full description of what is going on here is quite involved, and I hope to get a chance to explain it in more detail later. But at this point, the only thing that remains is to be able to convert text descriptions into these “random noise” vectors.

    This conversion can be done with an encoder-decoder architecture (recall the previous section) that projects both texts and images into the same latent space. One of the best such models, CLIP, was developed by OpenAI in 2021, and was used as the basis for DALL-E 2; I will not go into detail about its internal structure in this post and leave it for later.

    So overall, we have the following structure:

    • a multimodal text-image model, in this case CLIP, produces a joint latent space where it can project both images and text prompts;
    • a diffusion-based decoder can produce nice-looking images from its own latent space;
    • but at this point, the decoder’s latent space is not connected to CLIP’s latent space, so there is a third model (either autoregressive or diffusion-based too) that converts CLIP latents into the decoder’s latents.

    Here is this structure illustrated by DALL-E 2 authors Ramesh et al. (2022):

    Another large-scale diffusion-based model, Stable Diffusion, was developed by Rombach et al. (2022). It is a different variation of the same idea: it first trains an autoencoder to map the pixel space into a latent space where imperceptible details are abstracted away, and the image is compressed down to a smaller vector, and then performs conditional diffusion in this latent space to account for the text prompt and other conditions.

    I will not go into further detail right now, but here is a general illustration of the approach by Rombach et al.; it mostly concentrates on what’s happening in the latent space because the autoencoder is almost standard by now; note that the conditions are accounted for with Transformer-like encoder-decoder attention modules:

    Unlike DALL-E 2 and Midjourney (there is not even a paper written about Midjourney, let alone source code, so I cannot go into detail about how it works), Stable Diffusion comes with a GitHub repository where you can get the code and, most importantly, trained model weights to use. You can set it up on your home desktop PC (you don’t even need a high-end GPU, although you do need a reasonable one). All generated images used in this post have been produced with Stable Diffusion, and I’m very grateful to its authors for making this great tool available to everybody.

    Generative AI and synthetic data

    So where does this leave us? Does it mean that you can now generate synthetic data at will with very little cost by simply writing a good text prompt, so synthetic data as we understand it, produced by rendering 3D scenes, is useless?

    Far from it. Images generated even by state-of-the-art models do not come with perfect labeling, and generative models for 3D objects are still very far from production quality. If anything, synthetic data comes in more demand now because researchers need more and more data to train these large-scale models, and at the same time they are developing new ways to do domain adaptation and make synthetic data increasingly useful for this process.

    However, this does not mean that state-of-the-art generative models cannot play an important role for synthetic data. One problem where we believe more research is needed is texture generation: while we cannot generate high-definition realistic 3D models, we can probably generate 2D textures for them, but this requires a separate model and training set because textures look nothing like photos or renders. Another idea would be to adapt generative models to modify synthetic images, either making them look more realistic (synthetic-to-real refinement) or simply making more involved augmentation-like transformations.

    In any case, we are living in exciting times with regard to generative models in machine learning. We will discuss these ideas in more detail in subsequent posts, and let’s see what else the nearest future will bring!

    Sergey Nikolenko
    Head of AI, Synthesis AI

  • Facial Landmark Detection with Synthetic Data: Case Study

    Facial Landmark Detection with Synthetic Data: Case Study

    Today we have something very special for you: fresh results of our very own machine learning researchers! We discuss a case study that would be impossible without synthetic data: learning to recognize facial landmarks (keypoints on a human face) in unprecedented numbers and with unprecedented accuracy. We will begin by discussing why facial landmarks are important, show why synthetic data is inevitable here, and then proceed to our recent results.

    Why Facial Landmarks?

    Facial landmarks are certain key points on a human face that define the main facial features: nose, eyes, lips, jawline, and so on. Detecting such key points on in-the-wild photographs is a basic computer vision problem that could help considerably for a number of face-related applications. For example:

    • head pose estimation, that is, finding out where a person is looking and where the head is turned right now;
    • gaze estimation, a problem important for mobile and wearable devices that we discussed recently;
    • recognizing emotions (that are reflected in moving landmarks) and other conditions; in particular, systems recognizing driver fatigue often rely on facial landmarks in preprocessing.

    There are several different approaches to how to define facial landmarks; here is an illustration of no less than eight approaches from Sagonas et al. (2016) who introduced yet another standard:

    Their standard became one of the most widely used in industry. Named iBug68, it consists of 68 facial landmarks defined as follows (the left part shows the definitions, and the right part shows the variance of landmark points as captured by human annotators):

    The iBug68 standard was introduced together with the “300 Faces in the Wild” dataset; true to its name, it contains 300 faces with landmarks labeled by agreement of several human annotators. The authors also released a semi-automated annotation tool that was supposed to help researchers label other datasets—and it does quite a good job.

    All this happened back in 2013-2014, and numerous deep learning models have been developed for facial landmarks detection since then. So what’s the problem? Can we assume that facial landmarks are either solved or, at least, are not suffering from the problems that synthetic data would alleviate?

    Synthetic Landmarks: Number does Matter

    Not quite. As it often happens in machine learning, the problem is more quantitative than qualitative: existing datasets of landmarks can be insufficient for certain tasks. 68 landmarks are enough to get good head pose estimation, but definitely not enough to, say, obtain a full 3D reconstruction of a human head and face, a problem that we discussed very recently and deemed very important for 3D avatars, the Metaverse, and other related problems. 

    For such problems, it would be very helpful to move from datasets of several dozen landmarks to datasets of at least several hundred landmarks that would outline the entire face oval and densely cover the most important lines on a human face. Here is a sample face with 68 iBug landmarks on the left and 243 landmarks on the right:

    And we don’t have to stop there, we can move on to 460 points (left) or even 1001 points (right):

    The more the merrier! If we are able to detect hundreds of keypoints on a face it would significantly improve the accuracy of 3D face reconstruction and many other computer vision problems.

    However, by now you probably already realize the main problem of these extended landmark standards: there are no datasets, and there is little hope of ever getting them. It was hard enough to label 68 points by hand, the original dataset had only 300 photos; labeling several hundred points on a scale sufficient to train models to recognize them would be certainly prohibitive.

    This sounds like a case for synthetic data, right? Indeed, the face shown above is not real, it is a synthetic data point produced by our very own Human API. When you have a synthetic 3D face in a 3D scene that you control, it is absolutely no problem to have as many landmarks as you wish. What’s even more important, you can easily play with the positions of these landmarks and choose which set of points gives you better results in downstream tasks—imagine how hard it would be if you had to get updated human annotations every time you changed landmark locations!

    So at this point, we have a source of unlimited synthetic facial landmark datasets. It only remains to find out whether they can indeed help train better models.

    Training on Synthetic Landmarks

    We have several sets of synthetic landmarks, and we want to see how well we are able to predict them. As the backbone for our deep learning model we used HourglassNet, a venerable convolutional architecture that has been used for pose estimation and similar problems since it was introduced by Newell et al. (2016):

    The input here is an image, and the output is a tensor that specifies all landmarks.

    To train on synthetic data, we couple this backbone with the discriminator-free adversarial learning (DALN) approach introduced very recently by Chen et al. (2022); this is actually another paper from CVPR 2022 so you can consider this part a continuation of our CVPR ‘22 series.

    Usually, unsupervised domain adaptation (UDA) works in an adversarial way by

    • either training a discriminator to distinguish between features extracted from source domain inputs (synthetic data) and target domain inputs (real data), training the model to make this discriminator fail,
    • or learning a source domain classifier and a target domain classifier at the same time, training them to perform the same on the source domain and as different as possible on the target domain while training the model to keep the classification results similar on the target domain.

    DALN suggested a third option for adversarial UDA: it trains only one classifier, with no additional discriminators, and reuse the classifier as a discriminator. The resulting loss function is a combination of a regular classification loss on the source domain and a special adversarial loss on the target domain that is minimized by the model and maximized by the classifier’s weights:

    We have found that this approach works very well, but we had an additional complication. Our synthetic datasets have more landmarks than iBug68. This means that we cannot use real data to help train the models in a regular “mix-and-match” fashion, simply adding it in some proportion together with the synthetic data. We could pose our problem as pure unsupervised domain adaptation, but that would mean we were throwing away perfectly good real labelings, which also does not sound like a good idea.  

    To use the available real data, we introduced the idea of label transfer on top of DALN: our model outputs a tensor of landmarks as they appear in synthetic data, and then an additional small network is trained to convert this tensor into iBug68 landmarks. As a result, we get the best of both worlds: most of our training comes from synthetic data, but we can also fine-tune the model with real iBug68 datasets through this additional label transfer network.

    Finally, another question arises: okay, we know how to train on real data via an auxiliary network, but how do we test the model? We don’t have any labeled real data with our newly designed extra dense landmarks, and labeling even a test set by hand is very problematic. There is no perfect answer here but we found two good ones: we test either only on the points that should exactly coincide with iBug landmarks (if they exist) or train a small auxiliary network to predict iBug landmarks, fix it, and test the rest. Both approaches show that the resulting model is able to predict dense landmarks, and both synthetic and real data are useful for the model even though the real part only has iBug landmarks.

    Quantitative results

    At this point, we understand the basic qualitative ideas that are behind our case study. It’s time to show the money, that is, the numbers!

    First of all, we need to set the metric for evaluation. We are comparing sets of points that have a known 1-to-1 correspondence so the most straightforward way would be to calculate the average distance between corresponding points. Since we want to be able to measure quality on real datasets, we need to use iBug landmarks in the metric, not extended synthetic sets of landmarks. And, finally, different images will have faces shown at different scales, so it would be a bad idea to measure the distances directly in pixels or fractions of image size. This brings us to the evaluation metric computed as

    where n is the number of landmarks, yi are the ground truth landmark positions, f(xi) are the landmark positions predicted by the model, and pupil are the positions of the left and right pupil of the face in question (this is the normalization coefficient). The coefficient 100 is introduced simply to make the numbers easier to read.

    With this, we can finally show the evaluation tables! The full results will have to wait for an official research paper, but here are some of the best results we have now.

    Both tables below show evaluations on the iBug dataset. It is split into four parts: the common training set (you only pay attention to this metric to ensure that you avoid overfitting), the common test set (main test benchmark), a specially crafted subset of challenging examples, and a private test set from the associated competition. We will show the synthetic test set, common test set from iBug, and their challenging test set as a proxy for generalizing to a different use case.

    In the table below, we show all four iBug subsets; to get the predictions, in this table we predict all synthetic landmarks (490 keypoints in this case!) and then choose a subset of them that most closely corresponds to iBug landmarks and evaluate on it.

    The table above shows only a select sample of our results, but it already compares quite a few variations of our basic model described above:

    • the model trained on purely synthetic data; as you can see, this variation loses significantly to all other ways to train, so using real data definitely helps;
    • the model trained on a mix of labeled real and synthetic data with the help of label adaptation as we have described above; we have investigated several different variations of label adaptation networks and finally settled on a small U-Net-like architecture;
    • the model trained adversarially in an unsupervised way; “unsupervised” here means that the model never sees any labels on real data, it uses labeled synthetic data and unlabeled real data with an extra discriminator that ensures that the same features are extracted on both domains; again, we have considered several different ways to organize unsupervised domain adaptation and show only the best one here.

    But wait, what’s that bottom line in the table and how come it shows by far the best results? This is the most straightforward approach: train the model on real data from the iBug dataset (common train) and don’t use synthetic data at all. While the model shows some signs of overfitting, it still outperforms every other model very significantly.

    One possible way to sweep this model under the rug would be to say that this model doesn’t count because it is not able to show us any landmarks other than iBug’s, so it can’t provide the 490 or 1001 landmarks that other models do. But still — why does it win so convincingly? How can it be that adding extra (synthetic) data hurts performance in all scenarios and variations?

    The main reason here is that iBug landmarks are not quite the same as the landmarks that we predict, so even the nearest corresponding points introduce some bias that shows in all rows of the table. Therefore, we have also introduced another evaluation setting: let’s predict synthetic landmarks and then use a separate small model (a multilayer perceptron) to convert the predicted landmarks into iBug landmarks, in a procedure very similar to label adaptation that we have used to train the models. We have trained this MLP on the same common train set.

    The table below shows the new results.

    As you can see, this test-time label adaptation has improved the results across the board, and very significantly! However, they still don’t quite match the supervised model, so some further research into better label adaptation is still in order. The relative order of the model has remained more or less the same, and the mixed syn+real model with label adaptation done with a small U-Net-like architecture wins again, quite convincingly although with a smaller margin than before.

    Conclusion

    We have obtained significant improvements in facial landmark detection, but most importantly, we have been able to train models to detect dense collections of hundreds of landmarks that have never been labeled before. And all this has been made possible with synthetic data: manual labeling would never allow us to have a large dataset with so many landmarks. This short post is just a summary: we hope to prepare a full-scale paper about this research soon.

    Kudos to our ML team, especially Alex Davydow and Daniil Gulevskiy, for making this possible! And see you next time!

    Sergey Nikolenko
    Head of AI, Synthesis AI

    P.S. Have you noticed the cover images today? They were produced by the recently released Stable Diffusion model, with prompts related to facial landmarks. Consider it a teaser for a new series of posts to come…

  • CVPR ‘22, Part IV: Synthetic Data Generation

    CVPR ‘22, Part IV: Synthetic Data Generation

    We continue the long series of reviews for CVPR 2022 papers related to synthetic data. We’ve had three installments so far, devoted to new datasets, use cases for synthetic data, and a very special use case: digital humans. Today, we will discuss papers that can help with generating synthetic data, so expect a lot of 3D model reconstruction, new generative models, especially in 3D, and generally a lot of CGI-related goodness (image generated by DALL-E-Mini by craiyon.com with the prompt “robot designer making a 3D mesh”).

    Introduction

    This is the fourth part of our CVPR in Review series (part I, part II, part III). Similar to previous posts, we have added today’s papers to the OpenSynthetics database, a public database of all things related to synthetic data that we have launched recently.

    The bulk of our discussion in this part is devoted to machine learning models that learn 3D objects (meshes, morphable models, surface normals) from photographs. The relation to synthetic data is obvious: one of the main bottlenecks in creating synthetic data is the manual labor that has to go into creating 3D models. After we have a collection of 3D scenes and 3D object models, the rest is more or less automatic, and we can easily produce a wide variety of datasets under highly varying conditions (object placement, lighting, camera position, weather effects, and so on) with perfect labeling, with all the usual benefits of synthetic data. But before we can have all these nice things, we need to somehow get the 3D models; any progress towards constructing them automatically, say from real world photographs, promises significant simplifications and improvements in synthetic data generation pipelines.

    We will also touch upon two different subjects: camera noise modeling and controlled 2D image generation. Let’s start with the last one.

    Modeling Image Composition for Complex Scene Generation

    Before we proceed to 3D meshes and point clouds, let me begin with a paper on 2D generation, but quite accurately controlled 2D generation. The work “Modeling Image Composition for Complex Scene Generation” (OpenSynthetics) by Yang et al. presents an interesting variation on DALL-E-type models: a Transformer with focal attention (TwFA) that can generate realistic images based on layouts, i.e., from object detection labeling. Like this:

    The architecture in this work has many similarities to DALL-E but is actually quite different. First, the basic VQ-VAE structure with a discrete codebook is there, but the codebook tokens do not come from text, they come from ground truth images (during training) and layouts:

    Second, as you can see above there is a Transformer for generating the tokens, but it’s not a text-based Transformer, it’s a variation on the visual Transformer. Its job is to serve as a “token model” to produce a sequence of VQ-VAE tokens in an autoregressive fashion based on the layout. A learned model will run this Transformer to generate tokens, and then the VQ-VAE decoder will produce an image based on the sequence of tokens:

    But the most important novelty, the one that actually lets this model use layouts in a very interesting way, is a new approach to the attention mechanism in the Transformer. In the classical Transformer, self-attention layers have every token attend to every other token; if we generate the sequence autoregressively, this means every previous token. Focal attention proposed in this work uses masks to enforce the tokens to attend only to the patches and objects that actually relate to it in the scene. Without going into too much detail, here is an illustration of what the masks look like in different variations of this idea:

    This is exactly what lets the model generate images that reflect input layouts very well. And it doesn’t even need a huge dataset with object detection labeling for this, the authors used classical COCO-stuff and Visual Genome datasets. A comparison with other models that tried to tackle the same task is telling:

    Naturally, the paper is devoted to generation and does not try to use generated images as synthetic datasets. But I view it as an interesting step in the direction of controlled generation; we’ve seen before that even very artificial-looking images can be helpful for training large computer vision models, so it would be interesting to check if images generated in such a controlled way could be helpful too.

    It would be an especially interesting case of bootstrapping if one could use images generated by a VQ-VAE and a Transformer to improve the pretraining of these same models—I’m not claiming it’s going to help, I haven’t made or seen any experiments, but it’s an enticing thought to check out.

    Photorealistic Facial Expression Transfer for VR

    In the main part today, we will proceed from more specialized models to more general ones. The first two works are devoted to perhaps the most interesting and one of the most complex single objects in 3D modeling: heads and faces. Evolution has made us humans very good at reading facial expressions and recognizing faces; we usually like to look at other people and have had a lot of practice. This is why it’s very easy to get it wrong: “uncanny valley” examples usually feature human heads and faces.

    In “Robust Egocentric Photo-realistic Facial Expression Transfer for Virtual Reality” (OpenSynthetics), Facebook researchers Jourabloo et al. consider the problem of generating virtual 3D avatars that would reflect our facial expressions in VR. This makes perfect sense for Facebook as their Oculus Quest 2 VR headset has become one of the most successful models to date.

    There are many works on capturing and animating avatars, but compared to other settings, a VR headset is different: we need to model the facial expression of a person who is… well, wearing a VR headset! This sounds very challenging but, on the other hand, we have three cameras that show you the two eyes (covered by the headset) and the bottom of the face. Here is what the input looks like in this system, with camera locations shown on the right:

    Jourabloo et al. propose a multi-identity architecture that takes as input three images like the ones above and a neutral 3D mesh of a person and produce the modified mesh and textured to go with it. There are several parts of the architecture, one for shape modification, another for texture generation, and yet another for putting them together and rasterizing:

    By the way, a major factor in improving the results, as it often happens in computer vision, were augmentations: the authors propose to model 3D augmentations (such as slightly moving or rotating the camera) by 3D rotation and translation of the face shape in the training set, that is, changing the premade 3D shape—looks like another win for synthetic data to me!

    The results are quite impressive; here is a comparison with ground truth 3D models:

    And here are sample final results of the whole pipeline, in comparison with a previous work and again with the ground truth on the right:

    VR technology is constantly evolving, and these examples already look perfect. Naturally, the hardest part here is not about getting some excellent cherry-picked results, but about bridging the gap between research and technology: it would be very interesting to see how well models like this one perform in real world settings, e.g., on my own Oculus Quest 2. Still, I believe that this gap is not too wide already, and we will be able to try photorealistic virtual worlds very soon.

    JIFF: Jointly-aligned Implicit Face Function for Single-View Reconstruction

    In comparison with VR avatars, this work, also devoted to reconstructing 3D faces, clearly shows the difference between possible problem settings. In “JIFF: Jointly-aligned Implicit Face Function for High Quality Single View Clothed Human Reconstruction” (OpenSynthetics), Cao et al. set out to reconstruct the 3D mesh from a single photograph of a clothed human. Here are the results of the best previous model (PiFU from ICCV 2019, which we discussed in a previous post) and the proposed approach:

    The improvement is obvious, but the end result is still far from photorealistic. As we discussed last time, the PiFU family of models uses implicit functions, modeling a surface with a parameterized function so that its zero level is that surface. Another important class of approaches includes 3D morphable models (e.g., the original 3DMM or the recently developed DECA) that capture general knowledge about what a human face looks like and represent individualized shapes and textures in some low-dimensional space.

    So a natural idea would be to join the two approaches, using 3DMM as a prior for PiFU-like reconstruction. This is exactly what JIFF does, incorporating 3DMM as a 3D face prior for the implicit function representation:

    You’ve seen the results above, so let’s keep this section short. The conclusion here is that to get good high-resolution 3D models at this point you need some very good inputs. Maybe there already exists some combination of approaches that could take a single photo, learn the shape like this one, and then somehow upscale and improve the textures to more or less photorealistic results, but I’ve yet to see it. And this is a good thing—there is still a lot of room for research and new ideas!

    BANMo: Building Animatable 3D Neural Models from Many Casual Videos

    Human heads are a very important object in synthetic data, but let’s move on to a model, presented in “BANMo: Building Animatable 3D Neural Models from Many Casual Videos” (OpenSynthetics) by Yang et al., that promises to capture a whole object in 3D. And not just capture but to provide an animatable 3D model able to learn possible deformations of the object in motion. Naturally, to get this last part it requires much more than a single image, namely a collection of “casual videos” that contain the object in question. Oh, and I almost forgot the best part: the “object in question” could be a cat!

    So how does it work? We again come back to implicit functions. A 3D point in BANMo ( a Builder of Animatable 3D Neural Models) has three properties: color, density, and a low-dimensional embedding, and all three are modeled implicitly by trainable multilayer perceptrons. This is very similar to neural radiance fields (NeRF), a very hot topic in this year’s CVPR and one that deserves a separate discussion. Deformations are modeled with warping functions that map a canonical location in 3D to the camera space location and back. Pose estimation in BANMo is based on DensePose-CSE, which actually limits the model to humans and quadrupeds (thankfully, cats are covered). And to get from the 3D deformed result to 2D pixels BANMo uses a differentiable rendering framework, which is yet another can of worms that I don’t want to open right now. 

    Overall, it’s a pretty complicated framework with a lot of moving parts:

    But, as it often happens in successful deep learning applications, by carefully selecting the losses the authors are able to optimize all the parts jointly, in an end-to-end fashion.

    Here is a comparison with previous work on similar problems:

    As you can see, the results are not quite perfect but already very good. BANMo requires a lot of input data: the authors speak of thousands of frames in casual videos. However, collecting this kind of data is far easier than trying to get a 3D model via a hardware solution or recording videos in a multi-camera setup. If you have a cat you probably already have enough data for BANMo. The implications for synthetic data are obvious: if you can get a new model complete with movements and deformations automatically from a collection of videos, this may reduce the production cost for 3D models by a lot.

    High-Fidelity Rendering of Dynamic Humans from a Single Camera

    Clothing is hard. When a girl dances in a dress, the cloth undergoes very complicated transformations that are driven by the dance but would be extremely difficult to capture from a monocular video. In fact, clothes are the hardest part of moving from heads and faces (where clothing is usually limited to rigid accessories) to full-body 3D reconstruction of humans.

    In “Learning Motion-Dependent Appearance for High-Fidelity Rendering of Dynamic Humans from a Single Camera” (OpenSynthetics), Adobe researchers Yoon et al. try to tackle the problem of adding clothes to human 3D models in a realistic way that would be consistent with motion. The problem would be to take a 3D body model as input and output a clothed 3D body model and the corresponding rendering:

    The main challenge here is the lack of data: it would be probably possible to learn the dynamics of secondary motion (e.g., clothing) from videos but that would require a very large labeled dataset that covers all possible poses. In realistic scenarios, this dataset is nonexistent: we usually have only a short YouTube or TikTok video of the moving person.

    This means that we need to have strong priors about what’s going on in secondary motion. Yoon et al. propose the equivariance property: let’s assume that per-pixel features learned by the model’s encoder are transformed in the same way as the original body pose is transformed. The encoder produces 3D motion descriptors for every point, and the decoder’s job is to actually model the secondary motion and produce surface normals and appearance of the final result:

    The results are very impressive; in the figure below the rightmost column is the ground truth, the second on the right is the proposed model, and the rest are baselines taken from prior art:

    Moreover, in the experiments the poses and surface normals (inputs to the model) are not assumed to be known but are also captured from the input video with a specially modified 3D tracking pipeline. This allows to use the model at standalone videos and also enables a number of other applications:

    Overall, this year’s CVPR shows a lot of significant improvements in 3D reconstruction from 2D input, be it either a single image, a short clip, or a whole collection of videos. This is yet another excellent example of such a work.

    High-Fidelity Garment mesh Reconstruction from Single Images

    Let’s also briefly mention another work that deals with a very similar problem: reconstructing the 3D mesh of clothes from a single image. The ReEF model (registering explicit to implicit), proposed in “Registering Explicit to Implicit: Towards High-Fidelity Garment mesh Reconstruction from Single Images” (OpenSynthetics) by Zhu et al., tries to extract high-quality meshes of clothing items from a single photograph.

    The main idea is to use an explicitly given 3D mesh of an item of clothing and learn from the image a function of how to deform it to match the appearance on the image. This is achieved by segmenting the input into clothing items and their boundaries (in 3D!) and then fitting a standard (T-pose) 3D mesh to the results:

    I will not go into details here, but the results are quite impressive. The resulting meshes can then be applied to other 3D meshes, re-deforming them to match:

    That is, you can automatically fit a virtual character with your clothes! When this technology is further improved (I don’t think it’s quite there yet), this may be a very important piece of the puzzle for large-scale synthetic data generation. If we can get items of clothing in bulk from stock photos of clothed humans and eliminate the need to model these items by hand, it will significantly increase variation in synthetic data without adding much to the cost.

    PhotoScene: Photorealistic Material and Lighting Transfer for Indoor Scenes

    Our last 2D-to-3D paper today is “PhotoScene: Photorealistic Material and Lighting Transfer for Indoor Scenes” (OpenSynthetics) by UCSD and Adobe researchers Yeh et al. On one hand, it’s the natural culmination of our small-to-large progression: PhotoScene deals with photographs of whole interiors. 

    On the other hand, PhotoScene tackles a problem that can be crucial for synthetic data generation. Most works that deal with 2D-to-3D reconstruction (including the ones above) concentrate on trying to get the 3D model itself (in the form of a mesh, surface normals, or some other representation) exactly right. While this is very important, it’s only one piece of the puzzle. PhotoScene presents a solution for the next step: given a preexisting coarse 3D model of the interior (either automatically generated or produced by hand), it tries to capture the materials, textures, and scene illumination to match the photo.

    An interesting part of the pipeline that we have never discussed before are material priors, that is, low-dimensional parametric representations of various materials that are resolution-independent and can produce textures of numerous variations of materials by varying their parameters.

    The authors use MATch, a recently developed framework that defines 88 differentiable procedural graphs that can capture a wide range of materials with a lot of detail and variation, as textures in unlimited resolution ready for applying to 3D models, relighting, and other transformations:

    Using this work as a basis, PhotoScene learns to align parts of the input photo with elements of the coarse 3D scene, extracts the most suitable procedural MATch graphs, and learns their parameters. As a result, it produces a scene with high-quality materials and refined lighting, ready to render in new views or with new lighting:

    PhotoScene and, more generally speaking, material priors in the form of procedural graphs also represent a very important novelty for synthetic data generation. With models such as this one, we are able to obtain 3D representations of whole scenes (interiors in this case) that can then be used as the basis for synthetic data. It is not hard to throw together a coarse 3D model for a home interior: naturally, there already exist a lot of 3D models for furniture, decor, covers, and other home items, so it’s more a matter of choosing the most suitable ones. The hard part of 3D modeling here is to get the materials right and set up the lighting in a realistic way—exactly what PhotoScene can help with. Moreover, instead of rasterized textures it produces material graphs that can be used in any resolution and under any magnification—another very important advantage that can let us improve the quality and variety of the synthetic output.

    Modeling sRGB Camera Noise with Normalizing Flows

    In conclusion, as usual, let’s go for something completely different. In “Modeling sRGB Camera Noise with Normalizing Flows” (OpenSynthetics), Samsung researchers Kousha et al. present a new model for modeling the noise introduced by real world cameras. 

    This is a very specialized problem, but a very important one for quite a few applications, not just image denoising. In particular, I personally have worked in the past on superresolution, the problem of increasing resolution and generally improving low-quality photographs. Somewhat surprisingly, noise modeling proved to be at the heart of superresolution: modern approaches such as KernelGAN or RealSR introduce a parametric noise model, either to learn it on the given photo and then upscale it while reducing this noise or to inject this noise for degradation during their single-image training. It turned out that the final result very much depends on how well the noise model was learned and how expressive it was.

    For synthetic data, this means that for many problems, to have really useful synthetic data we would need to have a good noise model as well. If superresolution models were trained on clean synthetic images with noise introduced artificially with some kind of a simple model (probably Gaussian noise), they would simply learn this model and produce atrocious results on real life photos whose noise is generated very differently.

    Technically, Kousha et al. use normalizing flows, a very interesting and increasingly important family of generative models that model the density p(x) as a composition of several relatively simple invertible transformations, producing a diagonal Jacobian that ensures either efficient density estimation (masked autoregressive flows, MAF) or efficient sampling (inverse autoregressive flows, IAF). It would take a separate post to fully introduce normalizing flows (I hope I’ll write it someday), so let’s just say that here Kousha et al. introduce a conditional linear flow that is conditioned on the camera and its gain setting and use it in a whole pipeline designed to model several different sources of noise present in real world cameras:

    As a result, they learn a noise distribution that is much closer to the real noise distribution (in terms of KL divergence) than previous efforts. There are no beautiful pictures to show here, it’s all about small variations in noisy patches viewed under high magnification, but trust me, as I’ve tried to explain above, this line of research may prove to be very important for synthetic data generation.

    Conclusion

    Phew, that was a long one! Today we have discussed several papers from CVPR 2022 that can help with synthetic data generation; I have tried to be thorough but definitely do not claim this post to be an exhaustive list. We have run the gamut from low-level computer vision (camera noise) to reconstructing animatable models from video collections. Next time, we will survey what CVPR 2022 has brought to the table of domain adaptation, another hugely important topic in synthetic data. See you then!

    Sergey Nikolenko
    Head of AI, Synthesis AI

  • CVPR ‘22, Part III: Digital Humans

    CVPR ‘22, Part III: Digital Humans

    Last time, we talked about new use cases for synthetic data, from crowd counting to fractal-based synthetic images for pretraining large models. But there is a large set of use cases that we did not talk about, united by their relation to digital humans: human avatars, virtual try-on for clothes, machine learning for improving animations in synthetic humans, and much more. Today, we talk about the human side of CVPR 2022, considering two primary applications: conditional generation for applications such as virtual try-on and learning 3D avatars from 2D images (image generated by DALL-E-Mini by craiyon.com with the prompt “virtual human in the metaverse”).

    Introduction and the Plan

    In the first post of this series, we talked about new synthetic datasets presented at CVPR ‘22. The second post was devoted to various practical use cases where synthetic data has been successfully used. Today, we dive deeper into a single specific field of application related to digital humans, i.e., models that deal with generating either new images of humans or 3D models (virtual avatars) that can be later animated or put into a metaverse for virtual interaction.

    Just like in the previous posts, papers will be accompanied by links to OpenSynthetics, a public database of all things related to synthetic data that we have launched recently. We have two important directions in today’s post: conditional generation with different features (usually for virtual try-on applications) and trying to learn synthetic human avatars from photographs. Let me begin with a paper that, in a way, combines the two.

    BodyGAN

    We begin with “BodyGAN: General-purpose Controllable Neural Human Body Generation” (OpenSynthetics), where Yang et al. continue a long line of work devoted to generating images of humans with GANs. Throughout the history of GAN development, humans always showcased the progress, from the earliest attempts that couldn’t capture human faces at all to the intricate modifications allowed by the StyleGAN family. I’ve been showing this famous picture by Ian Goodfellow in my lectures since 2018:

    And current results by, say, StyleGAN 3 are much more diverse and interesting:

    The classical line of improvements, however, only dealt with the faces, mostly inspired by the CelebA dataset of celebrity photos. Generating full-scale humans with different poses and clothing is a much harder task, especially if you wish to control these parameters separately. There has been previous work, including StyleRig that tried to add 3D rigging control to StyleGAN-generated images and StylePoseGAN that added explicit control over pose, and these works are exactly what BodyGAN promises to improve.

    Let’s briefly go through the main components of BodyGAN:

    It has three main components: 

    • the pose encoding branch that includes three subnetworks for body parts segmentation, 3D surface mapping, and key point estimation;
    • the appearance encoding branch that produces encodings (condition maps) separately for different body parts;
    • and the generator that is supposed to produce realistic images based on these conditions.

    Training utilizes two discriminators, one for the pose branch and another for the appearance branch, and during inference one can substitute new shape, pose, and appearance encodings (e.g., change the skin color) to obtain new realistic images:

    So overall it is a relatively straightforward architecture that hinges on explicit disentanglement between different features, and the network architectures are also quite standard (e.g., discriminators are taken from pix2pixHD). Interestingly, it works better than previous results; here are some characteristic samples for the main application in the paper, virtual try-on (with conditions shown as small images in the corners):

    We will see more results about virtual try-on below, it was a hot topic on CVPR ‘22; however, this work shows that even a relatively straightforward but well-executed take on the problem can produce very good results. Overall, it looks like we are almost there in regard to these kinds of conditional generation and style transfer applications for images; I would expect truly photorealistic results quite soon.

    Dressing in the Wild

    But that conclusion was only about images; producing convincing photorealistic videos is much harder, and we still have some way to go here. The work “Dressing in the Wild by Watching Dance Videos” (OpenSynthetics) by ByteDance researchers (ByteDance is the mother company of TikTok) takes an intermediate ground: they do show some results on videos but primarily use videos to perform better garment transfer on still images with challenging poses.

    First, they present a dataset of 50000 real life single-person dance videos, Dance50k, with a lot of different garments and poses (at the time of writing, Dance50k was not yet fully available but it’s supposed to be released at the project page). The videos do look diverse enough to get a wide variety of different poses in the wild:

    The model itself is interesting and stands out among the usual GAN-based conditional generation. It is called wFlow but the word “flow” is not about flow-based generative models that have become increasingly popular over the last few years. This time, it is about optical flow estimation: the model has a component estimating where each pixel in the source image should go in the target image.

    Let us go through the pipeline. The input includes a source image of a person where the garment comes from and a query pose image where the pose comes from. The wFlow model works as follows:

    • first (this is not shown in the image above), the authors apply OpenPose to estimate the positions of 18 body joints, a pretrained person segmentation model to obtain segmentation maps for source and query images, and a pretrained mesh extraction model to obtain a dense representation of the 3D mesh extracted from the images;
    • the conditional segmentation network (CSN) takes as input a person source segmentation map, its dense pose representation, and body joints, and produces the target segmentation mask and layout of different body parts;
    • the pixel flow network (PFN) takes the same inputs plus the segmentation mask produced by CSN and predicts the pixel flow, i.e., locations at the target frame where the source frame pixels should map to;
    • an entirely novel part of wFlow is the next step, where the predicted 2D pixel flow is improved with dense pose representations from extracted meshes, fusing the 3D vertex flow with 2D pixel flow;
    • then the resulting flow guides three UNet-based generators; two of them are needed to complete the cycles during training, and the third will actually be used on inference for garment transfer.

    There are some more interesting tricks in the paper, but let’s skip those and get to sample results. First, you can see why videos are hard; video results do have some inconsistencies and flicker across frames, and the lighting is hard to get right:

    But as for still images, the model produces excellent results that already look quite sufficient for the virtual try-on application:

    IMavatar: Human Head Avatars from Video

    We now move on from garment transfer to learning digital avatars. The main difference here is that you have to construct a 3D avatar from 2D images, and then perhaps teach that avatar a few tricks. The first paper of this batch, “I M Avatar: Implicit Morphable Head Avatars from Videos” (OpenSynthetics), concentrates on models of human heads.

    In synthetic data and generally computer generated graphics, human heads are often represented with 3D morphable face models (3DMMs); it is a huge field starting from relatively simple parametric models in the late 1990s and continuing these days into much more detailed and nonlinear neural parametric face models. The idea is to model the appearance and facial geometry in a lower-dimensional representation, together with a decoder to produce the actual models (meshes). This idea underlies, in particular, our very own HumanAPI, and here at Synthesis AI we are also investigating new ideas for human head generation based on 3DMMs. This field is also closely related to neural volumetric modeling, e.g., neural radiance fields (NeRF) that are rapidly gaining traction; I hope to devote a later post to the recent developments of NeRFs at CVPR ‘22.

    In this work, Zheng et al. base their approach on the FLAME model that parameterizes shape, pose, and expression components. Basically, the 3DMM here consists of three networks (neural implicit fields): one predicts the occupancy values for each 3D point, another one predicts deformations, i.e., transformations of canonical points (points from the original model) to new locations based on facial expressions, and the third provides textures by mapping each location to an RGB color value.

    The crux of the paper lies in how to train these three networks. The main approach here is known as implicit differentiable rendering (IDR), an idea that certainly deserves a separate post. In essence, the neural rendering model produces RGB values for a given camera position (also learnable) and image pixel, and the whole thing is trained to represent actual pixel values:

    As a result, this network is able to generate (render) new views from previously unseen angles. Zheng et al. adapt this approach to their 3DMM; this requires some new tricks to deal with the iterative nature of finding the correspondences between points (it’s hard to propagate gradients through an iterative process). I will not go into these details, but here is an illustration of the resulting pipeline, where all three networks can be trained jointly in an end-to-end fashion:

    As a result, the model produces an implicit representation of a given human head, which means that you can generate new views, new expressions and other modifications from this model. Here is how it works on synthetic data:

    And here are some real examples:

    Looks pretty good to me!

    FaceVerse: Coarse-to-Fine Human Head Avatars

    In this collaboration (OpenSynthetics) between Tsinhua University and Ant Group (a company affiliated with Alibaba Group), Wang et al. also deal with learning 3D morphable models of human faces. In this case, the emphasis is on the data—not synthetic data, unfortunately for this blog, but on data nevertheless.

    Similar to many other fields, 3D face datasets come in two varieties: either coarse or small. It’s easy to get a rough dataset with ToF cameras built into many modern smartphones, but to get a high-definition 3D scan you need expensive hardware that only exists in special labs. Wang et al. do both, collecting a large coarse dataset (on the left below) and a small high-quality dataset (on the right):

    The FaceVerse model then proceeds in the same coarse-to-fine fashion: first the authors fit a classical PCA-based 3D morphable model on the coarse dataset, and then refine it with a detailed model similar to StyleGAN, using the smaller high-quality dataset to fine-tune the detalization part:

    These steps are then reproduced on inference, providing a model that gradually refines the 3D model of a face to obtain very high quality results at the end:

    Overall, it appears that while there is still some way to go, monocular face reconstruction may soon become basically solved. State of the art models are already doing such an excellent job that give it a few more years, and while the results may still not have movie-ready photorealistic quality, they will be more than enough to cover our needs for realistic avatars in 3D metaspace.

    PHORHUM: Monocular 3D Reconstruction of Clothed Humans

    The previous two papers were all about heads and faces, but what about the rest of us? In “Photorealistic Monocular 3D Reconstruction of Humans Wearing Clothing” (OpenSynthetics), Google researchers Alldieck et al. present a deep learning model that can take a photo and create a full-body 3D model, complete with clothing.

    This is far from a new problem; previous approaches include, e.g., PIFu from USC and Waseda University, Geo-PIFu from UCLA and Adobe, and PIFuHD from Facebook. This line of models produced very good results already, extracting voxel features from a single image and filling in the occluded details. An important drawback of these works, however, was how they worked with color of the surfaces: usually the resulting model had color taken from the photo, with shading effects baked in and hard to disentangle from geometry. This made it difficult to use the resulting model in any way except copy-and-paste, even changing the lighting could produce rather bad results.

    In essence, PHORHUM continues the line of PIFu (pixel-aligned implicit function) models: the idea is to represent a 3D surface as a level set of a function f, e.g., the set of points x such that f(x)=0. In this way, you don’t need to store the actual voxels and are free to parameterize the function f in any way you choose—obviously, these days you would choose to parameterize it as a neural network.

    In PIFu models, the image is encoded with an hourglass network to obtained point-specific features, and then the surface is defined as an multilayer perceptron that takes as input the features of a current point (its projection on the image) and the depth. The original PIFu had two different functions, one to encode the surface itself and another to predict the RGB values at the current point:

    To cope with the problem of colors and shading, PHORHUM tries to explicitly disentangle unshaded colors of every point on the surface and the lighting effects. This means that the function is trained to output not the actual color of a pixel but the albedo color, that is, the base color of the surface, and then PHORHUM has a separate shading network to modify it according to lighting conditions:

    As a result, you obtain the albedo colors of skin and clothing, and then it becomes much easier to automatically animate the resulting models, adapting to new lighting as needed:

    To get this kind of quality, you are supposed to have a good shot of the person but it doesn’t have to be a lab shot with white or green background, anything will do:

    Overall, between the previous papers and this one it seems that we will soon have perfectly acceptable virtual avatars walking around various metaverses. I have my doubts about whether this will usher in a new era of remote workplaces and entirely new forms of entertainment—at the very least, we’d first need something less cumbersome than a VR helmet to navigate these metaverses. But it looks like the computer vision part is almost there already.

    Speech-Driven Tongue Animation

    Finally, let me conclude today’s post with something completely different. Have you ever wondered how animated movies match the characters’ speech with their mouths and tongues? Currently, there are two answers: either poorly or very, very laboriously. In computer games and low-budget animation, character models usually have several motions for different vowels and consonants and try to segue from one to another in a more or less fluid way. In high-budget animation (think Pixar), skilled animators have to painstakingly match the movements of the palate and tongue to speech, a process that is both very difficult and very expensive.

    In “Speech Driven Tongue Animation” (OpenSynthetics), Medina et al. from Carnegie Mellon University and Epic Games present a model for automatically generating tongue movements that match the speech. To get the data, you need to do tongue motion capture—I’d never think it was possible but apparently people have been doing it for medical purposes for a long time:

    After that, you need to have an encoder to convert speech into features and a decoder that will take these features and get you the tongue animation. The authors have tried several different encoders and decoders, choosing the best results among both classical and recently introduced feature extractors:

    Landmark locations are then postprocessed to get the actual animation. This paper won the Best Demo Award at CVPR ‘22, so be sure to check out their website and in particular their video with examples and descriptions.

    The paper is affiliated with Epic Games, so I would expect this feature to make its way into Unreal Engine 6 or something, but this paper got me thinking about another possible application. I am not a native English speaker, and although I usually watch movies in English my 11-year-old daughter, naturally, requests Russian voices when we watch Pixar/Disney movies. The modern dubbing industry is quite advanced and goes to great lengths to make speech in a different language more or less fit the mouths animated for English… sometimes at the cost of meaning. It would be enormously expensive to re-animate movies for different languages by hand, but thanks to advances like this one, maybe one day animated movies will be distributed in several different languages with different lip animations produced automatically. And judging by the other results we have discussed in this post, maybe one day live-action movies will too…

    Conclusion

    In this third post about the results of CVPR ‘22, we have discussed several papers on virtual humans, a topic that has stayed important for CVPR over many years. In particular, we discussed two important use cases: conditional generation, usually in the form of virtual try-on, and production of 3D avatars from 2D images, both for heads/faces and for full-body avatars. Both problems are key areas of application for synthetic data, as we have seen today and as we have been working towards here at Synthesis AI.

    Our next topic will be similar but not directly related to humans anymore: we will discuss generation of synthetic data (or any new photo and video material) based on 3D reconstruction and similar approaches. Stay tuned!

    Sergey Nikolenko
    Head of AI, Synthesis AI

  • AI Interviews: Victor Lempitsky

    AI Interviews: Victor Lempitsky

    Meet our distinguished guest for the third interview: Professor Victor Lempitsky. Prof. Lempitsky is among the best researchers in machine learning, placing especially highly in the field of computer vision (here is his Google Scholar account). Currently Victor is leading the Computer Vision Group at Skoltech (Skolkovo Institute of Science and Technology) and is the VR project leader at Yandex.

    Foreword. Before we begin, I have to say that this interview was composed before February 24, 2022. In fact, it was finalized on February 22, so by now it is almost half a year old. This is the reason why Q6 may look a little strange these days—we were not dancing around the elephant in the room, it simply had not entered yet. By now, Victor has left both positions mentioned in the preamble and is currently working on a new startup in the AR/VR field.

    Q1. Hello Victor, and welcome to our interview! Computer vision is your major focus, so let me start off immediately with the obligatory question for our blog: what is your general view on synthetic data for computer vision? Do you agree that synthetic data, understood as artificially generated labeled data used to train machine learning models, can be a feasible way out of the data problem for computer vision? Or do you place more faith in other possible approaches that we’ve previously discussed on this blog: augmentations, mixup and self-adversarial training, few- and zero-shot learning, adding unlabeled data, and others?

    I do believe in synthetic data, and several recent projects I was involved with have seen clear benefits from using synthetic data. However, most useful synthetic data are modeled from the real world. Such modeling can benefit strongly from unsupervised learning. So, in the end, there is no dichotomy: I believe in the usefulness of synthetic data, which is enriched/created from real unlabeled data. Augmentations, mixups, adversarial training can all be used as the ways to generate useful synthetic data from real data, even though people not always think about augmentations in this way.

    Q2. Much of your most recent work is devoted to image generation. You have created GANs that work without convolutions or self-attention, neural renderers that can dress 3D avatars and generate semi-transparent objects, GANs that generate timelapse videos of landscapes, and much more. In particular, you often work on 3D generation—generating meshes, textures, point clouds—which is the obvious next step after learning to generate flat images. 3D generation is only starting to work well enough for practical applications, but still, the rate of progress in this field is spectacular. I usually show this picture in my lectures on GANs:

    Do you expect 3D generation to undergo similarly explosive growth in the near future? Or are there conceptual difficulties that need to be resolved before we get the virtual reality Metaverse generated on the fly with GANs?

    The picture you show is indeed very telling, and it reflects and conflates several trends: improvements in algorithms, improvements in computational resources, and improvements in datasets. 

    Given how many bright people are now working on 3D data synthesis, I believe that fast progress in algorithms is inevitable. Neural renderers such as PyTorch3D or nvdiffrast are certainly one piece of the puzzle. Computational resources are trickier and a lot of progress will be bottlenecked on them, so I naturally expect that main breakthroughs will come from the “big four” of NVidia, Google/DeepMind, Meta, and Microsoft (all four have brilliant researchers but also huge computational resources). This was to a large degree true even for 2D image generation, and will likely remain even more true for 3D. Note that I am not saying that everybody else should either join those corporations or work on something else. Just like StyleGAN(s) from NVidia created a whole vibrant ecosystem of researchers from different institutes building on top of it, the same will likely happen with 3D.

    The main bottleneck for progress in 3D data synthesis, however, is (and will be) datasets. Here things are very different from 2D. With 2D, once algorithms and resources were ready, finding good enough datasets for learning was relatively easy. Note that here I am talking about 2D static image generation, good datasets of HD videos are much harder to get: say, YouTube is largely not HD quality, and it is quite a challenge to scrap video datasets of objects or people in high resolution from YouTube. Getting good and large 3D datasets is much harder, especially if we are talking about “full 3D” and not just 2.5D (i.e. color + depth) or toyish 3D models. Currently, quite a few researchers are trying to bypass this lack of datasets and to learn 3D synthesis by matching the 2D images. To this end, they insert 2D projections into their generation learning pipelines. This is surely interesting and could be fruitful, but is inevitably much harder. Just imagine someone trying to learn StyleGAN-like image synthesis while only having access to a dataset of 1D projections such as row sums or one-pixel slices.

    To sum up, I think that the rate of progress in 3D data synthesis will be limited and conditioned on the quality of 3D datasets. Hence, it will be a harder and longer story than with 2D (but no less interesting!)

    Q3. Let us continue from the last question, taking generative models yet further into the realm of speculation. I have always viewed image and 3D generation as an inherently finite task. It has not been easy to scale GANs up, but it seems like progress is inevitable. And human eyes have a finite resolution after all (be it 8K, 32K, or 256K), so the models will sooner or later reach this resolution with photorealistic quality, and there will be no point to move any further. 

    Do you agree with this view, and if yes, when do you expect image and 3D scene generation to hit this ceiling and provide a perfectly immersive experience? (Let’s limit this question to vision, I understand that full immersion will require other senses as well.)

    Let me start by noting that the story with 2D image generation is far from over, even if one can generate very realistic human faces. First of all, GANs still have limited diversity and mode coverage (otherwise we will not have dozens of interesting papers on StyleGAN inversion, and very simple approaches would do the job). Diffusion models are better than GANs in covering the whole distribution but are still extremely slow. Furthermore, even though GAN samples for faces are realistic, GAN samples for full body human images or, say, for full body cats are either significantly less realistic or significantly less diverse (or both). Finally, for 2D video synthesis, we as a community are very far from truly realistic results (at least in the unconditional setting).

    Regarding 3D, the situation is even harder for the reasons I discussed in the answer to the previous question, so I do not expect perfect photorealism there for quite a few years.

    Q4. Now let me ask a (slightly more) technical question that I’ve been interested in for a long time. Your two most cited papers according to Google Scholar are “Unsupervised domain adaptation by backpropagation” (joint work with Yaroslav Ganin) and its continuation and extension, “Domain-adversarial training of neural networks” (with a lot of people including, e.g., Hugo Larochelle). They are also, in my opinion, some of the most relevant for synthetic data because they present a simple and ingenious domain adaptation method.

    We have just discussed the basic idea of Ganin and Lempitsky (2015) on this blog, so I’ll be very brief in explaining it. The idea goes as follows: suppose you want to have a model that works for both synthetic and real data (or any two domains, really). You want to train a feature extractor that will extract features independently of the domain, so that, say, a synthetic face will have the same features extracted as its real counterpart, and models trained with these features on synthetic data can be applied to real data. To achieve this, you add a domain classifier that predicts whether it was a synthetic or a real image based on the features extracted. You want that classifier to fail, just like you want the discriminator to fail in GANs. So you train it as another head of your network, but the gradients for the classification error function are reversed, optimizing it in the opposite direction. In the illustration below (taken from your papers), the classifier wants to minimize its loss Ld, but by the time it gets to the feature extractor, the loss is inverted, and the extractor is actually maximizing it.

    My question here is two-fold. First, I explained your idea in terms of synthetic and real images, and the actual papers also present examples of synthetic-to-real transfer, but only for small images. Have there been attempts to apply this to larger-scale domain adaptation, especially synthetic-to-real, and how successful have they been?

    Second, domain-adversarial training sounds like a very general idea that could actually be applicable wider than just domain adaptation. One cannot say this idea is not widely known: both papers have thousands of citations, including foundational works on GANs. But why haven’t GANs switched to gradient reversal instead of alternating training between the generator and discriminator? Are there some hidden problems here that are not evident in the basic idea?

    On your first question, indeed the approach has become popular, and there has been a lot of follow-up work including applications to large images. Just as with small images, the approach there works somewhat but without miracles. I.e., it usually beats the no-adaptation baseline quite confidently, but, of course, does not solve the domain gap problem completely. For the second question, indeed almost all GANs separate the steps for the generator and the discriminator updates and do not reuse the gradient. The main reason, I believe, is that most modern GANs use slightly different functionals as objectives for the generator and the discriminator. In particular, it turns out that to get the best GAN performance, it is useful to have some form of the so-called non-saturating objective for the discriminator, and also to regularize the discriminator quite strongly with a proper regularizer (and details of such regularization matter a lot). So, when your generator and discriminator are trying to optimize slightly different functionals, gradient reuse becomes highly non-trivial and is therefore not used. 

    Just to clarify, for me the difference between gradient reversal and GANs is not a big deal. Actually, we learned about the GAN arxiv report halfway during the project and by that time we have settled on the idea and the language of “gradient reversal”. This is why we explained our approach in a slightly different way in our paper, and perhaps connected it to GANs in a less clear way than we should have done (but back in early 2015 it was way less obvious that GANs would become such a dominating idea). 

    Q5. Another recent work of yours introduces Cloud Transformers, special architectures for processing point clouds that use ideas similar to self-attention blocks, with excellent results in point cloud segmentation, inpainting, and reconstruction tasks.

    Since their inception in 2017, Transformers have taken deep learning by storm. They started by basically replacing all other embeddings in natural language processing and serving as the basis for the very best language models, but now they are all over computer vision as well, ever expanding their reach as your own work suggests. It looks a bit like deep learning gradually taking over every field in the early 2010s.

    Do you have an explanation for this success? I understand how a Transformer works mathematically, but is there any explanation why self-attention proves to be such a good idea in practice?

    Or maybe it’s just an umbrella term for a specific useful trick, and otherwise modern Transformers are very different from each other? In your paper, you keep using words such as “variant” or “reminiscent”, and the architecture indeed doesn’t look much like Vaswani’s original. What is that core idea that makes an architecture a Transformer, and again, why, in your opinion, does it work so well?

    Well, it is hard to argue that transformers are the most exciting and impactful thing that has happened in deep learning in recent years. What is most exciting about transformers is their universality. True, we are still witnessing the competition between vision transformer variants and ConvNet architectures for the title of “the king of ImageNet”. But what is remarkable and makes many people excited is that very similar Transformer architectures can solve very different tasks across very different modalities (images, audio, text, action planning, etc) with near state-of-the-art quality. Certainly, it feels like the right thing, as our brains also have remarkable plasticity and can repurpose different parts between modalities.

    Our cloud transformers paper will obviously be far less impactful compared to the original transformers, but I still like it very much. Our architecture is similar to “classical” transformers in some ways. E.g. it treats individual points as elements within an unordered set, and our key layer uses multiple processing heads. There are also differences (our equivalent of attention is sparse, and we use convolutions). Still, what I liked about our results is that essentially the same architecture is able to solve very different point cloud processing tasks. This is again reminiscent of the general transformer idea. 

    Q6. And finally a (slightly) more personal question. Anyone who knows you personally or at least follows you knows you feel strongly about the ethical use of AI.There is a trend in the computer vision community about ethical usage of CV technologies. For instance, the creator of YOLO object detectors Joseph Redmon quit computer vision in early 2020 and famously explained his decision as follows: “I stopped doing CV research because I saw the impact my work was having. I loved the work but the military applications and privacy concerns eventually became impossible to ignore.”

    What is your view on the ethical concerns that arise in modern computer vision? Are researchers responsible for potentially unethical uses of their results? I suppose there is no way to stop progress, but do you think there may be ways to ensure that progress works for the benefit of humanity and not against it? What would you advise to work on if one wanted to achieve this goal?

    I had a small project on person re-identification (mostly from surveillance cameras) with my PhD student back in 2016, and after one year or so we stopped. I do not think we pushed state-of-the-art in video surveillance that much, and the reviewers for the submissions we made on the subject concurred with that :). It is the only example where, in retrospect, I sleep slightly better because my work did not make an impact. 

    Having said that, some of the good and well-meaning people that I know still work on face recognition and camera-based surveillance, and I do not want to judge them. After all, the camera-based surveillance technology is double-edged. It will most likely benefit strong democratic societies by making life there safer and more convenient, but it will make life in authoritarian and totalitarian societies considerably worse, which we are already starting to witness in Russia and other countries. The same actually goes for AI and automation issues. The net effect will be strongly positive, people will live more meaningful and productive lives with more interesting occupations, but the dystopian scenarios will also materialize in some societies. 

    Like always, stopping the progress is impossible, even if many strong researchers including Joe Redmon quit the area. Progress in AI-based surveillance and automation “simply” calls for better and stronger political institutions. And the faster the progress, the more urgent the call. I know this all sounds like I am trying to push the responsibility from AI researchers to others (civil society and politicians), but I am just being honest and realistic. The best thing that we (researchers) can and must do is to inform the general public about the current state-of-the-art and reasonable projections for the future.

    Victor, thank you very much for your answers! And you, dear reader, stay tuned for our next interviews!

    Sergey Nikolenko
    Head of AI, Synthesis AI

  • CVPR ‘22, Part II: New Use Cases for Synthetic Data

    CVPR ‘22, Part II: New Use Cases for Synthetic Data

    Last time, we started a new series of posts: an overview of papers from CVPR 2022 that are related to synthetic data. This year’s CVPR has over 2000 accepted papers, and many of them touch upon our main topic on this blog. In today’s installment, we look at papers that make use of synthetic data to advance a number of different use cases in computer vision, along with a couple of very interesting and novel ideas that extend the applicability of synthetic data in new directions. We will even see some fractals as synthetic data! (image source)

    Introduction and the Plan

    In the first post of this series, we talked about new synthetic datasets in computer vision. This post is only superficially different from the first one: here we will consider papers that apply synthetic data to various practical use cases, concentrating more on the downstream task than on synthetic data generation. However, the generation part here is also often interesting, and we will definitely discuss it.

    I will also take this opportunity to discuss two very interesting developments related to synthetic data. First, we will see that synthetic images do not have to be realistic at all to be helpful for training even state-of-the-art visual Transformers, and it turns out that this has a lot to do with fractals. In the last part, we will see how synthetic data helps to automatically fill in the gaps and provide missing data for few-shot learning. But before that, we will see several use cases where synthetic data has helped solve practical computer vision problems. Among these use cases, today we do not consider papers that help generate synthetic data and papers that deal with generating or modifying virtual humans—these will be the topics for later posts.

    Just like last time, I remind you that we have launched OpenSynthetics, a new public database of all things related to synthetic data. In this post, I will again give links to the corresponding OpenSynthetics pages.

    Eyeglass removal

    In “Portrait Eyeglasses and Shadow Removal by Leveraging 3D Synthetic Data” (OpenSynthetics), Lyu et al. consider an interesting image manipulation problem: removing glasses from a human face. While solving this problem is desirable for applications such as face verification or emotion recognition, eyeglasses are very tricky objects for computer vision: they are mostly transparent but can cast shadows and introduce other complex effects in the image. The model constructed in this work consists of two stages: a cross-domain segmentation network predicts segmentation masks of the glasses and shadows cast by them (this part is trained adversarially in order to extract indistinguishable features from real and synthetic data), and then “de-shadow” and “de-glass” networks remove both:

    The whole thing is trained on a mixture of synthetic data and the CelebA dataset (real data), and the authors report much improved results for eyeglass removal:

    This system is the main point of the paper, but for me, it was also interesting to read about their synthetic data generation pipeline. Starting from 3D models of eyeglasses and 3D face models, they manually label four nodes where the glasses attach to the face: two fixed nodes on the temples and two floating points on the nose, “floating” meaning that these two points can drift to produce different positions of glasses on the nose. With these four nodes fixed, the system is able to find out the pose for the glasses, combine it with the face, and then the authors proceed to standard rendering in Blender, also generating the masks for glasses and their shadows to train the segmentation model:

    And the results are really impressive. Here are some real examples (perhaps cherry-picked, but who cares?..) from the paper:

    Crowd counting

    The work “Leveraging Self-Supervision for Cross-Domain Crowd Counting” by Liu et al. (OpenSynthetics) deals with a very straightforward application of synthetic data. Crowd counting is a natural use case: it is very hard to label every person on a crowd photo, and using real images raises privacy issues since it is usually impossible to get the consent of everybody in a real-world crowd.

    Indeed, there already exists a large synthetic dataset for crowd counting called GCC (Wang et al., 2019) with over 7.6 million people labeled on over 15K synthetic images. This dataset was produced by the Grand Theft Auto V engine, that is, Rockstar Advanced Game Engine (RAGE), together with the Script Hook V library that allows extracting labeling from RAGE. Here are two sample images from the paper, a real crowd on the left and a synthetic one on the right:

    Liu et al. use GCC for training and supplement it with unlabeled real images to cope with the domain shift, with a couple of new tricks designed to improve crowd density estimation (such as accounting for perspective since the crowd density appears higher on top of an image such as above than on the bottom). They obtain significantly improved results compared to other domain adaptation approaches; here are a couple of samples (the ground truth crowd density map is in the middle, and the estimated density map is on the right, together with the estimated number of people):

    This is an interesting use case for us since it can be read as reaching largely the same conclusions as we did in our recent white paper: if done right, relatively simple combinations of synthetic and real data can work wonders. It is encouraging to see such approaches appear at top venues such as CVPR: I guess synthetic data does just work.

    Formula-driven supervised learning for pretraining visual Transformers

    And now we proceed from state of the art, but still quite straightforward applications to something much stranger and, in my opinion, more interesting. First, a very unusual application of synthetic data that requires a little bit of context. In 2020, Kataoka et al. presented a completely new approach to training convolutional networks called Formula-Driven Supervised Learning (FDSL). They automatically generate image patterns by assigning image classes with analytically defined fractal categories. It raises a separate and quite difficult problem of how to do that, but the important thing is that after this transformation, you get a family of fractals for each image category. Here is an illustration from Kataoka et al.:

    As you can see, synthetic fractal images are far from realistic, but they capture some of the patterns characteristic for a given class and hence can be used to pretrain deep learning models; as usual with synthetic data, one can generate an endless stream of new samples from these fractal families. This pretraining does not make training on real images unnecessary but can improve the final results.

    Well, in 2022 Kataoka et al. made the next step (OpenSynthetics), moving from CNNs to visual Transformers. They developed new techniques for their synthetic generation, including a new dataset of families focused on image contours. It turned out that visual Transformers pay most attention to the contours anyway, so even a textureless image is helpful for pretraining:

    And visual Transformers perform better when they are pretrained on images like this one instead of real photos! For example, the authors report that ViT-Base pre-trained on ImageNet-21k showed 81.8% top-1 accuracy after fine-tuning on ImageNet-1k, while the same model with FDSL shows 82.7% top-1 accuracy when pre-trained under the same conditions.

    In my opinion, this is a very interesting direction of study. Apart from its direct achievements, it also shows that synthetic-to-real domain shift is not necessarily a bad thing, and if the data is generated in the right way, trying to achieve photorealism may not be the right way to go.

    Synthetic Representative Samples for Few-Shot Learning

    This last paper for today is a little bit of a stretch to call synthetic data, but it’s another interesting idea that may have applications for synthetic data generation as well. Last time, we discussed BigDatasetGAN, a generative model able to create images already labeled for semantic segmentation. This may be one of the first steps towards solving the problem of synthetic data: until the works on DatasetGANs, nobody could generate labeled data so nobody could use generative models to directly generate useful synthetic images.

    If we are talking about classification rather than segmentation, it looks much easier to sidestep this issue: ever since BigGAN, generative models could produce realistic-looking images in many different categories. But this raises another question: to train a generative model we need a dataset in this category, so why don’t we just take this dataset to train on instead of generating new samples?

    The work “Generating Representative Samples for Few-Shot Classification” (OpenSynthetics) by Xu and Le, a collaboration between Stony Brook University and Amazon, finds a new use case where this kind of conditional generation can be useful. The basic idea is as follows: in few-shot learning, say for image classification, one usually trains a feature extractor on a dataset with plenty of labeled data (but the wrong classes) and then adapts it to new classes by estimating a prototype sample. Then this sample can be used for classification; here is an illustration for few-show and zero-shot classification via prototypes from a classical paper by Snell et al. that started this field:

    This illustration works in the latent space of features produced by some kind of encoder.

    But this prototype-based idea has a drawback: it is hard to find a representative prototype if all you have are a few samples. Even if you have a perfect encoder that produces smooth and wonderfully separated Gaussians for every class, these Gaussians have a core of central representative samples and also non-representative samples that are further from the center:

    And if we base a classifier on a single prototype that turns out to be non-representative, the results can be far from perfect. Here is an illustration from an ICLR 2021 paper by Yang et al.:

    But how do we achieve this kind of calibration? Xu and Le propose—and this is where the relation to synthetic data comes into play—to generate representative samples from a variational autoencoder. It is common to use conditional VAEs to learn to extract representative features from images, but this time the cVAE is restricted to produce only representative, central examples of a class (feature vectors close to the center of a Gaussian) via sample selection:

    Note the semantic embedding a: this is where the new samples will come from. For a new class, the authors take its semantic embedding, plug it into this VAE’s decoder, and generate representative samples for the new class. Then the resulting generated prototype is either mixed with actual samples (in few-shot classification) or not (in zero-shot classification), with improved results on miniImageNet and tieredImageNet.

    This is definitely a non-representative example of a paper on synthetic data: the “data” is actually in feature space, and the problem is image classification rather than anything with complicated labeling. But this direction, dating back at least to 2018 (Verma et al., CVPR 2018), is an interesting tangent to our space, and just like DatasetGAN, it goes to show a way in which generative models may prove useful for synthetic data generation.

    Conclusion

    In this post, the second in the CVPR ‘22 series, we have discussed several use cases of synthetic data that have been advanced at the conference, starting from straightforward applications such as eyeglass removal and crowd counting and progressing to less obvious ideas of how deep generative models and even regular mathematical models such as fractals can help produce synthetic data useful for machine learning. Next time, we will discuss a more specific use case related to synthetic humans; stay tuned!

    Sergey Nikolenko
    Head of AI, Synthesis AI

  • CVPR ‘22, Part I: New Synthetic Datasets

    CVPR ‘22, Part I: New Synthetic Datasets

    CVPR 2022, the largest and most prestigious conference in computer vision and one of the most important ML venues in general, has just finished in New Orleans. With over 2000 accepted papers, reviewing the contributions of this year’s CVPR appears to be a truly gargantuan task. Over the next series of blog posts, we will attempt to go over the most interesting papers directly related to our main topic: synthetic data. Today, I present the first but definitely not the last installment devoted to papers from CVPR 2022.

    New Synthetic Datasets: Beyond Images

    As always, CVPR is large, and it contains multitudes, but this year one of the main topics is neural radiance fields (NeRF). These models seem to be the new GANs today, or, better to say, new visual Transformers that were in turn the new GANs a couple of years ago. We view image synthesis, especially controlled synthesis with 3D information, as a key idea that can propel synthetic data forward, so I plan to devote several upcoming posts to recent NeRF advancements.

    But in this series, let me begin with more straightforward applications of synthetic data that have found their way into the CVPR program this year. On the list today we have several new synthetic datasets, usually related to specific use cases of synthetic data; many of them touch upon problems that we have already discussed on this blog but some introduce entirely new avenues for research.

    Synthetic data is a well-established field, and this blog has already documented many of its achievements. By now, it is not enough to just generate a new synthetic dataset to get to a top conference like CVPR (to be honest, it was never enough): you need some twist on the tried-and-true formula of “make or obtain 3D CG models, render images, train CV models, profit”. In this section, let us see what new twists CVPR 2022 has brought.

    And one more thing before we begin: we have recently made public a new database that will gradually collect all things related to synthetic data. It is called OpenSynthetics, and it already has quite a lot of content on synthetic datasets, papers, and code repositories related to synthetic data. So in these review posts, I will also give links to the corresponding OpenSynthetics pages.

    BigDatasetGAN: Generating ImageNet1K with Labels

    It had always been common wisdom that GANs, despite their excellent image generation quality and usefulness for synthetic-to-real refinement, cannot really help with data generation from scratch: there was no way to generate labeled data and no easy way to label generated images. Basically, ever since ProGAN and BigGAN (OpenSynthetics; both released in 2018) you could use GANs to generate new realistic images with sufficient quality, but you would still have to label them afterward as if they were just new images. And this has always meant that GANs are useless for synthetic data generation: we have never lacked new images of ImageNet categories, the bottleneck has always been in the labeling.

    Well, it looks like there is a way to generate labeled data now! This research direction, driven by NVIDIA researchers, bore its first fruit last year when Zhang et al. presented DatasetGAN on CVPR 2021. Their pipeline works as follows: use StyleGAN to generate several images (say, cars), hand-annotate a few of them for your task (say, segmentation of various car parts), and train a very small model (style interpreter) to produce similar segmentation masks from StyleGAN features. At the cost of labeling a few images (literally, a few: DatasetGAN required 16 labeled heads or about 1000 polygons), you can use StyleGAN to generate as many labeled images as you wish, with the usual excellent StyleGAN quality:

    On this year’s CVPR, Li et al. continued this line of research and introduced BigDatasetGAN based on BigGAN instead of StyleGAN. The difference is that BigGAN is better suited for generating a wide variety of different image categories, so now you can hand-label 8000 images, 8 for each category, and have a single model able to produce 1000 ImageNet1K categories that come pre-labeled for segmentation:

    The authors report results improved over supervised pretraining for standard segmentation models.

    Does this mean that synthetic data is soon to be absorbed into deep generative models? Time will tell, but I am not sure: generative models are still hard to train, and this approach requires an operational large-scale GAN with the desired categories before we go into labeling. Moreover, DatasetGANs deal only with segmentation so far, and I have my reservations about more complex labeling such as depth. Still, this is an exciting development that shows the power of modern generative models, and its results provide a set of completely new tools for the arsenal of synthetic data generation.

    ABO: Real-World 3D Object Understanding

    ABO stands for Amazon Berkeley Objects (OpenSynthetics), a new indoor environment and object dataset presented in the work by Collins et al., who are, you guessed it, researchers from UC Berkeley and Amazon. ABO answers the same need as the classical but sadly unavailable SunCG dataset, ShapeNet, or Facebook AI Habitat: it provides a large-scale catalogue of 3D models of indoor household objects—chairs, shoes, coat hangers, rugs, tables, and so on—that can be placed in a variety of indoor environments with available renderings.

    Since Amazon is… well, Amazon, ABO is based on product listings: the dataset contains nearly 150K listings of 576 product types with hi-res photos and over 8000 turntable “360° view” images. It also includes nearly 8000 handmade high-quality 3D models of various objects. Moreover, and this is unique to ABO, the objects come with attributes that identify their material, which is useful for physically-based rendering:

    The authors show that training on ABO leads to better results than training on ShapeNet for state-of-the-art 3D reconstruction models. They also introduce a new task that has been enabled by their work, material estimation, and present novel network architectures for this task. In general, this is an impressive effort, and I hope that it will enable many new works in 3D scene understanding, indoor navigation, and related fields:

    ObjectFolder 2.0: A Multisensory Object Dataset

    While ABO provides some information about the material of the object, it is far from exhaustive. Stanford researchers Gao et al. attempt a far more ambitious task in their new ObjectFolder 2.0 dataset (OpenSynthetics): they aim to model complete multisensory profiles of real objects. This means that they aim to capture not only the 3D shape and material of an object (and therefore its texture) but also other sensory modalities including audio (how a cup clinks when you touch it with a spoon) and feeling to the touch. This information can be later used for problems such as contact localization (where exactly have I touched this object?) that are both difficult and important in robotics:

    Since all of these modalities are location-dependent, they cannot all be explicitly stored in the dataset. The authors use implicit neural representations, that is, each object is defined by a few neural networks (multilayer perceptrons) that are trained to convert coordinates into whatever is necessary; VisionNet models the neural scattering function, AudioNet models the location-specific part of the audio response from applying a unit force to this location, while TouchNet predicts the deformation map and tactile image (geometry of the contact surface):

    ObjectFolder 2.0 contains these representations for 1000 household objects such as cups, chairs, pans, vases, and so on.

    Gao et al. test their dataset with three downstream tasks that require multimodal sim2real object transfer: object scale estimation based on vision and audio, contact localization based on audio and tactile response, and shape reconstruction based on visual and tactile data. They report improved performance across all tasks, and this dataset indeed looks like a possible next step for object manipulation in robotics.

    Articulated 3D Hand-Object Pose Estimation

    Pose estimation is a classical computer vision problem; as in all problems related to the understanding of the 3D world from 2D images, synthetic data comes to mind naturally: it is impossible to do exact manual labeling for pose estimation, and even inexact human labeling is very copious. This goes double for more detailed tasks such as hand pose estimation, so it is no wonder that there exist synthetic datasets for this problem; in particular, here at Synthesis AI we have a variety of hand gestures as part of our HumanAPI.

    In “ArtiBoost: Boosting Articulated 3D Hand-Object Pose Estimation via Online Exploration and Synthesis” (OpenSynthetics), Li et al. make the next step: they consider not just hand gestures but hands holding various objects in different positions. The authors consider the “composited hand-object configuration and viewpoint space” (CCV space) where you can vary object types, composite hand-object poses, and camera viewpoints:

    Then they apply a newly developed grasp synthesis method (that I will not go into), obtain renderings of a synthetic hand grasping the object, and use these images for training.

    What is most interesting for me in this work is that it is an example of the “closing the loop” idea that we have been proposing for quite some time ago here at Synthesis AI; in particular, pardon the self-promotion, I discussed it as an important idea for the future of synthetic data in Chapter 12 of my book.

    In this case, Li et al. do not merely sample the CCV space and create a randomly generated dataset of synthetic hands with objects. They assign weights to different objects, poses, and viewpoints, and update these weights with feedback obtained from the trained model, trying to skew the sampling towards hard examples, a technique known in other contexts as “hard negative mining”. It is great to see that “closing the loop” is gaining traction, and I am certain it can help in other problems as well.

    SHIFT: Synthetic Driving via Multi-Task Domain Adaptation

    And now let us, pardon the pun, shift to data about the outdoors. We begin with autonomous driving. The work “SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Domain Adaptation” (OpenSynthetics), coming from ETH Zurich researchers Sun et al., is a pure synthetic dataset presentation for SHIFT, a synthetic driving dataset—but SHIFT is far from a “regular” synthetic dataset with some labeled images! The problem that Sun et al. recognize here is that autonomous driving requires the system to adapt to constantly changing conditions: if you are driving and it starts raining, the view around you changes significantly and maybe quite quickly, and the computer vision system has to keep working fine.

    To help cope with that, SHIFT contains explicit “domain shifts” across several different domains such as weather conditions, time of day, surroundings, and so on:

    So far this is quite standard fare for autonomous driving simulators. What’s more, SHIFT provides continuous shifts across domains whenever possible. You can have day gradually turning into night or rain starting on a sunny day:

    Naturally, each frame is annotated in the usual modalities, with object bounding boxes, segmentation maps, depth maps, optical flow, and LiDAR point clouds.

    Based on SHIFT, the authors investigate how various object detection and segmentation models cope with these domain shifts. They demonstrate that conclusions about robustness to domain shift that can be made on synthetic data also transfer to real datasets. I think that’s an important validation for synthetic data in general: it turns out that synthetic data can help evaluate machine learning models in ways that real data may fail to provide.

    TOPO-DataGen: Multimodal Synthetic Data Generation for Aerial Scenes

    In another classical synthetic data paper, EPFL researchers Yan et al. present TOPO-DataGen (OpenSynthetics), an automated synthetic data generation system that utilizes available geographic data such as LiDAR point clouds, orthophotographs, or digital terrain models to create synthetic scenes of various parts of the world, complete with the usual synthetic modalities such as depth maps, normals, segmentation maps, and so on:

    Generated images look very impressive and highly realistic, which is made slightly easier by the fact that they are aerial images taken from far away. Based on TOPO-DataGen, Yan et al. develop a new CrossLoc model for absolute localization (i.e., estimating the 6D camera pose in space) that works with several input modalities. They also show some impressive demos of trajectory reconstruction from aerial images based on CrossLoc. In general, while synthetic satellite and aerial images have already been generated, I believe this is the first attempt to bring together the different modalities that are actually often available in current practice.

    LiDAR snowfall simulation

    Finally, a very specific but fun use case: simulating snowfall. Autonomous driving should work under all realistic weather conditions, including heavy snow. But snow presents two problems that are especially bad for LiDARs: first, the ground becomes wet, which changes its reflective properties, and second, the particles of snow in the air also interact with the laser beam, leading to absorption and backscattering that attenuate and introduce a lot of noise into the LiDAR signal.

    Hahner et al. present a snowfall simulation system able to augment synthetic LiDAR datasets (in this case, STF by Bijelic et al. that itself introduced a fog simulation system) with special models for wet ground reflection and the influence of scattering particles. As a result, 3D object detection models trained with this augmentation perform much better; in the illustration below, note that the rightmost results contain no spurious objects, and predicted bounding boxes (black) match the ground truth (green) very well:

    Conclusion

    Today, we have begun our long journey through CVPR 2022. We have looked at papers that introduce new synthetic datasets, usually going far beyond simple generation of labeled images and sometimes defining completely new tasks. Next time, we will talk about papers that present specific use cases for synthetic data, that is, validate the use of synthetic data in practical computer vision tasks. Admittedly, it’s a blurry line with this first installment, but this post is getting quite long as it is. Until next time, stay tuned to the Synthesis AI blog, and check out OpenSynthetics!

    Sergey Nikolenko
    Head of AI, Synthesis AI

  • Driving Model Performance with Synthetic Data VII: Model-Based Domain Adaptation

    Driving Model Performance with Synthetic Data VII: Model-Based Domain Adaptation

    After a long hiatus, we return from interviews to long forms, continuing (and hopefully finishing) our series on how synthetic data is used in machine learning and how machine learning models can adapt to using synthetic data. This is our seventh installment in the series (part 1, part 2, part 3, part 4, part 5, part 6), but, as usual, this post is (I hope!) sufficiently self-contained. We will discuss how one can have a model that works well on synthetic data without making it more realistic explicitly but doing the domain adaptation work at the level of features or model itself.

    Intro and weight sharing

    In previous installments, we have considered models that perform refinement, that is, domain adaptation at the data level. This means that somewhere in the model, there is a learned transformation that takes data points from the source domain (in our case, synthetic images) and transforms them to make them more like the target domain (real images). 

    But it sounds like a lot of unnecessary extra work! Our final goal is very rarely to generate more realistic synthetic images. On the contrary, we want to use synthetic images to help train better models; the data itself is not important, it is just a stepping stone to models that work better. So maybe we don’t need to learn transformations on the level of images and can work in the space of features or model weights, never going back to change the actual data?

    One simple and direct approach to doing that would be to share the weights among networks operating on different domains. This way, when you train on both domains, the network has to learn to do well on both with the same weights – exactly what you need for domain adaptation. This was the idea of the earliest approaches to domain adaptation in deep learning, but weight sharing and similar ideas remain relevant to this day. For instance, Rozantsev et al. (2019) do domain adaptation with a two-stream architecture; the weights for processing the two domains are not shared but the architectures are the same, and there are special regularizers on all layers that bring their weights together:

    Another approach to model-level domain adaptation is to mine relatively strong priors from real data that can then inform a model trained on synthetic data, helping fix problematic cases or incongruencies between synthetic and real data. This also brings us to curriculum learning: it is often helpful to start with the easy cases and get a network rolling, and then fine-tune it in harder and harder situations.

    For example, Zhang et al. (2017) present a curriculum learning approach to domain adaptation for semantic segmentation of urban scenes. They train a segmentation network on synthetic data (specifically on the GTA dataset) but with a special component in the loss function related to the general label distribution in real images, intended to bring together the distributions of labels in real and synthetic datasets. The problem here is that this distribution is not available in real data, so this is where curriculum learning comes in: the authors first train a simpler model on synthetic data to estimate the label distribution from image features and then use it to inform the segmentation model:

    But there are much more interesting ideas in model-based domain adaptation than just training the same network on both domains with some regularizers. Let’s get to them!

    Reversing the Gradients

    One of the main directions in model-level domain adaptation was initiated by Ganin and Lempitsky (2015) who presented a generic framework for unsupervised domain adaptation. Their basic approach goes as follows:

    Let’s unpack what we see in this picture:

    • the feature extractor, true to its name, extracts features from input data; this is actually the network that we want to make domain-independent; after extraction, the features go two separate ways;
    • the label predictor actually does what the network is supposed to do, in this case probably classification but it could be segmentation or any other kind of computer vision problem;
    • the domain classifier is the core of this idea; it takes extracted features as input and attempts to classify which domain the original input belonged to.

    The idea is to train the label predictor to perform as well as possible and at the same time make the domain classifier perform as badly as possible. This is actually very similar to GANs (which we have discussed before). The difference, however, is that Ganin and Lempitsky devised an ingenious method for training that doesn’t require solving any minimax problems or iteratively alternating between networks. 

    The method is called gradient reversal: multiplying the gradients by a negative constant as they pass from the domain classifier to the feature extractor. In this way, the domain classifier learns to maximize its error, and the label predictor minimizes it, all at the same time and within the same loss function. Like this:

    In a subsequent work, Ganin et al. (2016) generalized this domain adaptation approach to arbitrary architectures and experimented with domain adaptation in different domains, including image classification, person re-identification, and sentiment analysis. 

    Disentanglement: Domain Separation Networks and beyond

    Domain separation networks by Bousmalis et al. (2016) represent a different take on the same problem. They attempt to solve domain adaptation via disentanglement, a very important notion in deep learning. Disentanglement is the process of separating different features extracted by a machine learning model so that these separate parts would have different recognizable meanings. For example, many style transfer models (we discussed it in Part IV of this series) try to explicitly disentangle style from content, and then swap the style part of the features before decoding back in order to get the same image in a different style.

    In domain adaptation, disentanglement amounts to separating domain-specific features from domain-independent ones, and trying to make sure that the latter will suffice to solve the actual problem. Domain separation networks explicitly separate the shared and private components of both source and target domains, extracting them with a shared encoder and two private encoders, one for the source domain and one for the target domain:

    The overall objective function for a domain separation network consists of four parts (let’s not do the formulas, it is, after all, almost Christmas):

    • supervised task loss in the source domain, e.g., classification loss;
    • reconstruction loss that compares original samples (both real and synthetic) and the results of a shared decoder that tries to reconstruct the images from a combination of shared and private representations;
    • difference loss that encourages the hidden shared representations of instances from the source and target domains to be orthogonal to their corresponding private representations;
    • similarity loss that encourages the hidden shared representations from the source and target domains to be similar to each other; again, “similar” here means that they should be indistinguishable by a domain classifier trained through the gradient reversal layer, as above.

    Bousmalis et al. evaluate their model on several synthetic-to-real scenarios, e.g., on synthetic traffic signs and synthetic objects from the LineMod dataset.

    Domain separation networks became one of the first major examples in domain adaptation with disentanglement, where the hidden representations are domain-invariant and some of the features can be changed to transition from one domain to another. Further developments include:

    • FCNs in the Wild by Hoffman et al., where feature-based DA for semantic segmentation is done with fully convolutional networks (FCN) where ground truth is available for the source domain (synthetic data) but unavailable for the target domain (real data); they also used domain adversarial training;
    • Xu et al. (2019) used adversarial domain adaptation to transfer object detection models—single-shot multi-box detector (SSD) and multi-scale deep CNN (MSCNN)—from synthetic samples to real videos in the smoke detection problem;
    • Chen et al. (2017) construct the Cross City Adaptation model that brings together features from different domains, with semantic segmentation of outdoor scenes in mind; they adapt segmentation across different cities around the globe and show that their joint training approach with domain adaptation improves the results significantly;
    • and many more…

    The last paper I want to highlight here is by Hong et al. (2018) who provide one of the most direct and most promising applications of feature-level synthetic-to-real domain adaptation. In their Structural Adaptation Network, the conditional generator takes as input the features from a low-level layer of the feature extractor (i.e., features with fine-grained details) and random noise and produces transformed feature maps that should be similar to feature maps extracted from real images:

    To achieve this, the conditional generator produces a noise map and then adds it to high-level features. Hong et al. compared the Structural Adaptation Network with other state of the art approaches, including FCNs in the Wild and Cross-City Adaptation, with source domain datasets SYNTHIA and GTA and target domain dataset Cityscapes; they conclude that this adaptation significantly improves the results for semantic segmentation of urban scenes. Here is a sample of their results:

    Conclusion

    Feature-level domain adaptation provides interesting opportunities for synthetic-to-real adaptation. Many of these methods still mostly represent work in progress, but the field is maturing rapidly, and in our experience, feature- and model-level DA is usually a simpler and more robust approach, easier to get to work, so we expect new exciting developments in this direction and recommend to try this family of methods for synthetic-to-real DA (unless actual refined images are required).

    With this, I am concluding this long series on different facets of using synthetic data in machine learning. Most importantly, synthetic data is a source of virtually limitless perfectly labeled data. It has been explored in many problems, but we believe that many more potential use cases still remain. Maybe we will get a chance to explore them together in 2022.

    Sergey Nikolenko
    Head of AI, Synthesis AI

  • Synthetic Data and the Metaverse

    Synthetic Data and the Metaverse

    Today, we are talking about the Metaverse, a bold vision for the next iteration of the Internet consisting of interconnected virtual spaces. The Metaverse is a buzzword that had sounded entirely fantastical for a very long time. But lately, it looks like technology is catching up, and we may live to see the Metaverse in the near future. In this post, we discuss how modern artificial intelligence, especially computer vision, is enabling the Metaverse, and how synthetic data is enabling the relevant parts of computer vision.

    What is the Metaverse

    The Metaverse is far from a new idea. Anyone familiar with the cyberpunk genre will immediately recognize the concept of a virtual reality that characters of William Gibson’s Neuromancer (1984) inhabit. The term itself was coined in Neal Stephenson’s novel Snow Crash (1992), and this virtual reality-based Internet 2.0 has seen many fictionalized adaptations ever since, including The MatrixReady Player One, a recent Amazon series Upload, and many more.

    While the Metaverse has long been the subject of sci-fi, by now many visionaries believe that developments in VR, AR, and related fields may soon enable similar experiences in real life… I mean, in virtual life, but real virtual life… you know what I mean. One of the sources that got me thinking about the Metaverse recently was a long interview with Mark Zuckerberg. He talks about “the successor to the mobile internet… an embodied internet, where instead of just viewing content — you are in it… present with other people as if you were in other places”. It sounds like Facebook believes in the VR and AR technology and sees the clunkiness of current generation devices as the main obstacle: right now hardly anybody would want to do their jobs in a VR helmet. As soon as wearable technology becomes miniature and light enough, the Metaverse will be upon us.

    Mark Zuckerberg motivates this vision, in particular, with mobile workstations: “…you can walk into a Starbucks… and kind of wave your hands and you can have basically as many monitors as you want, all set up, whatever size you want them to be… and you can just bring that with you wherever you want.” Facebook calls this idea the “infinite office.” But in my opinion, it is almost inevitable that entertainment will be the main driving force behind the Metaverse: imagine that you don’t need large screens to have an immersive cinematic experience, imagine your friends on social networks (well, maybe one social network in particular) streaming their experiences through AR glasses, imagine immersive 3D games that enable real human-to-human personal interaction… Well, I’m sure you’ve heard pitches for the VR technology many times, but this time it sounds like it really has a chance of coming through and becoming the next big thing. Others are beginning to build their own vision for the Metaverse including Epic Games, Roblox, Unity, and more. 

    How the Metaverse is enabled by computer vision

    But we need more than just smaller VR helmets and AR glasses to build the Metaverse. This hardware has to be supported by software that makes the transition between the real and virtual worlds seamless—and this would be impossible without state of the art computer vision. Let me make just a few examples.

    First, the obvious: VR helmets and controllers need to be positioned in space very accurately, and this tracking is usually done with visual information from cameras, either installed separately in base stations or embedded into the helmet itself. This is a basic computer vision problem of simultaneous localization and mapping problem (SLAM). VR helmet technology has recently undergone an important shift: earlier models tended to require base stations (“outside-in” tracking), and latest helmets can localize controllers accurately with embedded cameras (“inside-out” tracking) so you don’t need any special setup in the room (image source):

    This is a result of progress in computer vision, the cameras themselves have not improved that much.

    This problem becomes harder if we are talking about augmented reality: AR software also needs to understand its position in the world, but it needs a far more detailed and accurate 3D map of the environment in order to be able to augment it for the user. Check out our latest AI interview with Andrew Rabinovich, who was the Director of Deep Learning at Magic Leap, the startup that tried to do exactly this.

    Second, we have already talked many times about gaze estimation, i.e., finding out where a person is looking by the picture of their face and eyes. This is also a crucial problem for AR and VR. In particular, current VR relies upon foveated rendering, a technique where the image in the center of our field of view is rendered in high resolution and high detail, and it becomes progressively worse on the periphery; for an overview see, e.g., Patney et al. (2016). This is, by the way, exactly how we ourselves see things; we see only a very small portion of the field of view clearly and in full detail, and peripheral vision is increasingly blurry (illustration by Rooney et al., 2017):

    Foveated rendering is important for VR because VR has an order of magnitude larger field of view than flat screens, and requires a high resolution to support the illusion of immersive virtual reality, so rendering it all in this resolution would be far beyond consumer hardware.

    Third, when you enter virtual reality, you need an avatar to represent you; current VR applications usually provide stock avatars or forgo them entirely (many VR games represent the player as just a head and a pair of hands), but an immersive virtual social experience would need photorealistic virtual avatars that represent real people and can capture their poses, . Constructing such an avatar is a very hard computer vision problem, but people are making good progress on it. For instance, a recent work by Victor Lempitsky’s team introduced textured full-body avatars able to capture poses in real time by visual data streaming from several cameras:

    We are still not quite there, especially when it comes to faces and emotions, but we are getting better, and the Metaverse will definitely make use of this technology.

    These are only a few of the computer vision problems that arise along the way to the Metaverse; for a more, pardon the pun, immersive experience just look at the list of talks on the recent IEEE VR Conference, where you will see all of these topics and much more.

    Synthetic data and the Metaverse

    Our long-time readers have no doubt already recognized where this blog post is going. Indeed, as we have discussed many times before (e.g., here or here), modern computer vision is requiring increasingly large datasets, and manual labeling simply stops working at some point. At Synthesis AI, we are proposing a solution to this problem in the form of synthetic data: artificially generated images and/or 3D scenes that can be used to train machine learning models.

    I chose the three examples above because they each illustrate different uses of synthetic data in machine learning. Let us go over them again.

    First, SLAM is an example where synthetic data can be used in a straightforward way: construct a 3D scene and use it to render training set images with pixel-perfect labels of any kind you would like, including segmentation, depth maps, and more. We have talked about simulated environments on this blog before, and SLAM is a practical problem where segmentation and depth estimation arise as important parts. Modern synthetic datasets provide a wide range of cameras and modalities; for example, here is an overview of a recently released dataset intended specifically for SLAM (Wang et al. 2020):

    Second, gaze estimation is an interesting problem where real data may be hard to come by, and synthetic data comes to the rescue. I have already used gaze estimation on this blog as a go-to example for domain adaptation, i.e., the process of modifying the training data and/or machine learning models so that the model can work on data from a different domain. Gaze estimation works with relatively small input images, so this was an early success for GANs for synthetic-to-real refinement, where synthetic images were made more realistic with specially trained generative models. Recent developments include a large real dataset, MagicEyes, that was created specifically for augmented reality applications (Wu et al., 2020); in fact, it was released by Magic Leap, and we discussed it with Andrew last time:

    Third, virtual avatars touch upon synthetic data from the opposite direction: now the question is about using machine learning to generate synthetic data. We talked about capturing the pose and/or emotions from a real human model, but there is actually a rising trend in machine learning models that are able to create realistic avatars from scratch. Instagram is experiencing a new phenomenon: virtual influencers, accounts that have a personality but do not have a human actually realizing this personality. Here is Lil Miquela, one of the most popular virtual influencers:

    From a research perspective,, this requires state of the art generative models that are supplemented with synthetic data in the classical sense: you need to create a highly realistic 3D environment, place a high-quality human model inside, and then use a generative model (usually a style transfer model) to make the resulting image even more realistic. In this direction, it is still a long way to go before we can have fully photorealistic 3D avatars ready for the Metaverse, but the field is developing very rapidly, and this long way may be traversed in much less time than we have ever expected.

    The Metaverse is an ambitious vision straight out of science fiction, but it looks like the Metaverse is becoming increasingly realistic. It is quite possible that you and I will live to see an actual Metaverse, be it a social-centric Facebook 2.0 envisioned by Mark Zuckerberg, massively multiplayer OASIS out of Ready Player One, or, God forbid, the all-encompassing Matrix. But before we get there, there are still many research problems to be solved. Most of them lie in the field of computer vision, and this is exactly where synthetic data is especially effective for machine learning. Join us next time for another installment on synthetic data!

    Sergey Nikolenko
    Head of AI, Synthesis AI

  • AI Interviews: Andrew Rabinovich

    AI Interviews: Andrew Rabinovich

    Today, I am proud to present our guest for the second interview, Dr. Andrew Rabinovich. Currently, Andrew is the CTO and co-founder of Headroom Inc., a startup devoted to producing AI-based solutions for online business meetings (taking notes, detecting and attracting attention, summarization, and so on). Dr. Rabinovich has produced many important advances in the field of computer vision (here is his Google Scholar account), but he is probably best known for his work as the Director of Deep Learning at Magic Leap, an augmented reality startup that raised more than $3B in investments.

    Q1Hello Andrew, and welcome! Let me begin with a general question that I will also expand upon later. You have a lot of experience in academia, with numerous papers published at top conferences and receiving hundreds of citations. At the same time, some of your top accomplishments are related to more “industrial” research work at startups such as Magic Leap.

    What kind of work has been more fulfilling for you? And what, in your view, are the main differences in the process and/or results? On the surface, research work in both industry and academia is supposed to produce novel solutions that work well for the problem at hand; are there important differences here?

    Hello Sergey, I am glad to be here and thank you for the invitation. What you guys, at Synthesis, are doing is extremely important for the computer vision field, and I am grateful that with these efforts the state of the art in Computer Vision, and AI in general, will improve for many years to come. 

    This is a very interesting question that dates back to my undergraduate days when I worked on medical image analysis and was interested in building image cytometers — automated microscopes with machine learning inference skills. While developing the cytometer, it quickly became apparent that the state of the art in computer vision (it was called image processing then) wasn’t quite up to par to solve the practical problems I was facing. This realization made me turn to more theoretical work and focus on developing core vision algorithms. A similar situation happened at Google, where I was really excited to work on algorithms for Google Goggles, the first AR app for Android and iPhone. Then existing, pre-deep learning approaches, weren’t satisfactory to develop product features we were interested in. Again, I turned to more academic research and was very fortunate to work on the development of modern deep networks, including the Inception architecture, which in turn we applied to visual search in Google Photos. You can probably guess where this is going, the same story repeated itself at Magic Leap. I quickly realized that to develop the vision of Mixed Reality, and to close the perceptual gap between real and virtual content, a lot of new fundamental research in computer vision and AI had to be done.

    Overall, academic and applied research aren’t really separable in my mind. Computer vision and machine learning are not fundamental science disciplines, they don’t describe nature. These are engineering challenges that need to be addressed in the context of practical problems. Industrial research provides that context. If the context is chosen correctly, then solutions to specific engineering challenges generalize to other tasks. 

    Q2. Our blog is devoted to synthetic data, so here is the most expected question. During your work in Headroom, Magic Leap, and other startups, have you used synthetic data to solve computer vision problems? In what ways, and how much did it help (if you’re allowed to divulge this kind of information, of course)? Did it help for the augmented reality applications at Magic Leap?

    I have been a proponent of synthetic data since my days at Google, where we heavily relied on data augmentation (synthetic data 0.1) to train deep models. At Magic Leap, we created a whole synthetic data group, with render farms and custom pipelines. At that time, synthetic data companies were quite rare, so we had to do most of it. The benefits of synthetic data ranged from hand and eye-tracking to 3D reconstruction and segmentation. At Headroom, we are collaborating with synthetic data providers across a number of problems. 

    Generally, there are really two fundamental issues with data for learning. First, obtaining data and labeling it can be quite expensive and laborious, whether it involves humans in the loop or not. Many companies today have established an efficient pipeline for ingesting data and providing annotations for it. The second problem, however, is far more critical. Relying on the human ability to annotate certain types of data is misleading. People can only provide relative and qualitative labels, such as drawing bounding boxes around objects or qualifying relative distances. If the task is much more specific, i.e. describe the illumination in the room, or how far away is the person from the car (in centimeters), these questions humans cannot with the required precision, and in the absence of specific sensors, synthetic data is the only path forward.

    By construction, machine-generated data is auto labeled. The main drawback of synthetic data is that it may be sampled from a distribution that doesn’t represent the real world. Fortunately, that gap is quickly closing with realistic synthesis and domain adaptation approaches in AI.

    Q3One of your latest papers, “DELTAS: Depth Estimation by Learning Triangulation and Densification of Sparse Points”, seems to be making a very interesting point beyond its immediate results. It reconstructs 3D meshes of scenes from RGB images with an end-to-end network, never producing an intermediate depth map, like most other methods do:

    This sounds very human-like to me: I can navigate complex 3D environments, and I have a pretty good grasp on relative depth (which object is closer than the other), but I definitely cannot produce an accurate depth map for my room. Moreover, this is in line with the general trend of deep learning that seems to me evident over at least the last decade: we have neural networks increasingly perform end-to-end training and learn to do various tasks directly, without predefined intermediate representations or side results. The tradeoff here is that usually end-to-end training for complex tasks requires far more data than more specialized training when you have, e.g., ground truth labeled depth maps.

    Do you agree that this trend exists and if yes, where do you think it will take us in the near future, especially in the field of computer vision? Are there other important problems that can be overcome with such end-to-end architectures, and do we have enough data to do that? To make the question more open-ended, what other trends in computer vision do you see that you expect to carry over for the next couple of years (I think in deep learning it doesn’t make sense to predict beyond a couple of years anyway)?

    End-to-end learning is a very attractive, almost romantic notion. The formulations are usually very elegant and simple. However, as you correctly point out, it requires a significantly larger amount of training data to account for all variations. That is why most problems aren’t solved end-to-end, as we aim to provide supervision along the way. With regards to 3D reconstruction, intermediate supervision with depth maps is problematic as well. Obtaining a large amount of depth data is not trivial. 

    As for the trends, I am not a big follower of them, as they are mostly set by the availability of datasets or funding. Over the last few years, I have focused on multi-task learning and believe that focus on this area of AI will lead to significant advances due to generalization during training and inductive bias during inference.  

    Looking forward, I believe developing AI approaches one modality at a time, when applied to the multimodal tasks that surround us, artificially complicates the problem. For example, the classical problem of video understanding is typically solved by isolating video from everything else. However, presence of text, available in the movie scripts or live transcription, and audio sources, make the problem much more tractable. Multimodal multitask learning is one of the areas in AI I am most excited about today.

    Q4Interestingly, another recent paper of yours, “MagicEyes: A Large Scale Eye Gaze Estimation Dataset for Mixed Reality”, goes in precisely the opposite direction. It makes the case that for eye gaze estimation, better results can be achieved by thinking about the 3D properties of the eye (position of the cornea center and pupil center in 3D) and including them in a multi-task architecture:

    Eye gaze estimation is one of my favorite examples for synthetic data because it has everything: a “pure synthetic” solution-based (literally!) on nearest neighbors, GANs for synthetic-to-real refinement that improve the results, new synthetic datasets such as NVGaze… For the readers, here is our recent post about gaze estimation. But it looks like I will have to update my usual story: MagicEyes that you presented in this paper is a large-scale dataset with human-labeled real data, and it allows for better results.

    Obviously, collecting this dataset took a lot of money and effort. This leads to two questions. Specifically, do you believe that synthetic data can still help improve eye gaze estimation further? The paper does not show experiments with training EyeNet on mixed real+synthetic datasets: do you think it would be worthwhile to try? And generally, in what other computer vision problems do you expect even larger manually labeled real datasets to appear in the near future, and how do you think it will affect applications of synthetic data in computer vision?

    Eye-tracking is a very interesting example of a computer vision problem. There are decades of research from human vision and neuroscience about the function and anatomy of how we see. MagicEyes datasets aim to collect a variable set of data from a broad population of subjects to capture this natural variability. The learned representations from this data form a foundation of the distribution that we want to learn for a number of different tasks, ranging from blink detection to 3D gaze estimation. If MagicEyes was infinitely large, we’d be done. Labeling this kind of data is possible, even though slow and expensive. By supplementing MagicEyes with synthetic data, we get an opportunity to significantly reduce time and cost, and to increase the training data set size and heterogeneity of seen examples. 

    As for other vision problems, manual datasets for autonomous navigation, satellite imagery, and human interactions are being collected and annotated at scale. Solving these tasks with additional synthetic data will be extremely useful. In fact, we are starting to see synthetic data expertise (specific companies pick and choose their domains of excellence) being compartmentalized to indoor and outdoor environments, and to human vs. man-made objects. 

    Q5And now let me go back to the industry-vs-academia question, from a different point of view. While preparing the previous two questions, I opened your Google Scholar profile and sorted the publications chronologically. Naturally, you never stopped producing top-notch academic output, but it turned out that it’s far easier to look for your recent papers at your DBLP profile because your Google Scholar profile has recently been literally dominated by patent applications. You’ve had dozens of those in the last couple of years!

    Is that just a formal consequence of your work at MagicLeap and other startups or does it reflect a deeper position on how practical your work can soon become? Generally speaking, how ready do you think we (humanity) are for solving the basic high-level computer vision problems: 3D scene understanding, visual navigation in the real world, producing seamless augmented reality, and so on? Are we there yet, and if not quite, how long do you think it will take in each case?

    Writing patents is standard practice in industrial research. I was fortunate enough to complement patent filings with the corresponding peer-reviewed publications. As we discussed earlier, I do believe that academic research in computer vision and machine learning precedes its applications. The current AI spring started in 2012, has opened a number of industrial research avenues that build upon theoretical results and will lead to innovative products for the next decade. 

    With regards to solving complex vision and learning tasks, I think we are still quite a bit away. Machines have become excellent at pattern matching. There are a large number of practical applications that are coming online: from autonomous driving to augmented reality. The limiting factors here are not just the algorithms, however, but rather sensors and data. In augmented reality, for example, the AI components are available, but the computation power, batteries, and displays are not there to deliver a compelling product. 

    Q6. Apart from your research work in academia and industry, you are also helping LDV Capital, one of the top VC funds for AI-related startups, as their Expert in Residence. This may sound like a stock question, but it would be very interesting to hear your personal take on this: how do you evaluate startups that come for your review? What are you looking for the most, and what are the most common mistakes startups make, in your personal experience? Maybe you can share some advice specific for vision-related startups, since it is your personal area of expertise, and LDV Capital seems to have this as an important focus area as well.

    Traditional VC funding happens by following trends. A trend-setting VC firm invests in a particular sector, and the rest of the funds follow. A growing fear of missing out results in large amounts of capital being deployed. Once a new trend emerges, most VC firms happily switch context or diversify. When I look at start-up projects, whether my own or others, I always look for an end goal thesis, and decide if I agree with it. For example, a company X makes LiDAR sensors, LiDARs are a hot topic these days. To me, company X is interesting because I believe that without LiDAR, certain long-term goals aren’t possible to achieve, self-driving being one of them. If company X fits into the global scheme of things, it is meaningful and fundamental to market development, if it is one-offcreate filters for your Instagram account,not so much. 

    Then, there is the team. Regardless of prior focus, having pedigree, whether academic research, product development, or executive management, is a must. It is fairly simple to identify experts from dreamers. 

    Finally, there are many aspiring entrepreneurs who want to start companies for the sake of starting companies or because they have access to interesting technology. In that situation, product definition doesn’t come from a real need to improve an existing approach, but rather from an opportunistic perspective of “let’s invent a solution for a problem that doesn’t exist”. I think this is the curse of most tech startups.

    Thank you very much for your answers, Andrew! We will come back with the next interview soon—stay tuned!

    Sergey Nikolenko
    Head of AI, Synthesis AI