Category: Synthesis AI

  • CVPR ‘22, Part III: Digital Humans

    CVPR ‘22, Part III: Digital Humans

    Last time, we talked about new use cases for synthetic data, from crowd counting to fractal-based synthetic images for pretraining large models. But there is a large set of use cases that we did not talk about, united by their relation to digital humans: human avatars, virtual try-on for clothes, machine learning for improving animations in synthetic humans, and much more. Today, we talk about the human side of CVPR 2022, considering two primary applications: conditional generation for applications such as virtual try-on and learning 3D avatars from 2D images (image generated by DALL-E-Mini by craiyon.com with the prompt “virtual human in the metaverse”).

    Introduction and the Plan

    In the first post of this series, we talked about new synthetic datasets presented at CVPR ‘22. The second post was devoted to various practical use cases where synthetic data has been successfully used. Today, we dive deeper into a single specific field of application related to digital humans, i.e., models that deal with generating either new images of humans or 3D models (virtual avatars) that can be later animated or put into a metaverse for virtual interaction.

    Just like in the previous posts, papers will be accompanied by links to OpenSynthetics, a public database of all things related to synthetic data that we have launched recently. We have two important directions in today’s post: conditional generation with different features (usually for virtual try-on applications) and trying to learn synthetic human avatars from photographs. Let me begin with a paper that, in a way, combines the two.

    BodyGAN

    We begin with “BodyGAN: General-purpose Controllable Neural Human Body Generation” (OpenSynthetics), where Yang et al. continue a long line of work devoted to generating images of humans with GANs. Throughout the history of GAN development, humans always showcased the progress, from the earliest attempts that couldn’t capture human faces at all to the intricate modifications allowed by the StyleGAN family. I’ve been showing this famous picture by Ian Goodfellow in my lectures since 2018:

    And current results by, say, StyleGAN 3 are much more diverse and interesting:

    The classical line of improvements, however, only dealt with the faces, mostly inspired by the CelebA dataset of celebrity photos. Generating full-scale humans with different poses and clothing is a much harder task, especially if you wish to control these parameters separately. There has been previous work, including StyleRig that tried to add 3D rigging control to StyleGAN-generated images and StylePoseGAN that added explicit control over pose, and these works are exactly what BodyGAN promises to improve.

    Let’s briefly go through the main components of BodyGAN:

    It has three main components: 

    • the pose encoding branch that includes three subnetworks for body parts segmentation, 3D surface mapping, and key point estimation;
    • the appearance encoding branch that produces encodings (condition maps) separately for different body parts;
    • and the generator that is supposed to produce realistic images based on these conditions.

    Training utilizes two discriminators, one for the pose branch and another for the appearance branch, and during inference one can substitute new shape, pose, and appearance encodings (e.g., change the skin color) to obtain new realistic images:

    So overall it is a relatively straightforward architecture that hinges on explicit disentanglement between different features, and the network architectures are also quite standard (e.g., discriminators are taken from pix2pixHD). Interestingly, it works better than previous results; here are some characteristic samples for the main application in the paper, virtual try-on (with conditions shown as small images in the corners):

    We will see more results about virtual try-on below, it was a hot topic on CVPR ‘22; however, this work shows that even a relatively straightforward but well-executed take on the problem can produce very good results. Overall, it looks like we are almost there in regard to these kinds of conditional generation and style transfer applications for images; I would expect truly photorealistic results quite soon.

    Dressing in the Wild

    But that conclusion was only about images; producing convincing photorealistic videos is much harder, and we still have some way to go here. The work “Dressing in the Wild by Watching Dance Videos” (OpenSynthetics) by ByteDance researchers (ByteDance is the mother company of TikTok) takes an intermediate ground: they do show some results on videos but primarily use videos to perform better garment transfer on still images with challenging poses.

    First, they present a dataset of 50000 real life single-person dance videos, Dance50k, with a lot of different garments and poses (at the time of writing, Dance50k was not yet fully available but it’s supposed to be released at the project page). The videos do look diverse enough to get a wide variety of different poses in the wild:

    The model itself is interesting and stands out among the usual GAN-based conditional generation. It is called wFlow but the word “flow” is not about flow-based generative models that have become increasingly popular over the last few years. This time, it is about optical flow estimation: the model has a component estimating where each pixel in the source image should go in the target image.

    Let us go through the pipeline. The input includes a source image of a person where the garment comes from and a query pose image where the pose comes from. The wFlow model works as follows:

    • first (this is not shown in the image above), the authors apply OpenPose to estimate the positions of 18 body joints, a pretrained person segmentation model to obtain segmentation maps for source and query images, and a pretrained mesh extraction model to obtain a dense representation of the 3D mesh extracted from the images;
    • the conditional segmentation network (CSN) takes as input a person source segmentation map, its dense pose representation, and body joints, and produces the target segmentation mask and layout of different body parts;
    • the pixel flow network (PFN) takes the same inputs plus the segmentation mask produced by CSN and predicts the pixel flow, i.e., locations at the target frame where the source frame pixels should map to;
    • an entirely novel part of wFlow is the next step, where the predicted 2D pixel flow is improved with dense pose representations from extracted meshes, fusing the 3D vertex flow with 2D pixel flow;
    • then the resulting flow guides three UNet-based generators; two of them are needed to complete the cycles during training, and the third will actually be used on inference for garment transfer.

    There are some more interesting tricks in the paper, but let’s skip those and get to sample results. First, you can see why videos are hard; video results do have some inconsistencies and flicker across frames, and the lighting is hard to get right:

    But as for still images, the model produces excellent results that already look quite sufficient for the virtual try-on application:

    IMavatar: Human Head Avatars from Video

    We now move on from garment transfer to learning digital avatars. The main difference here is that you have to construct a 3D avatar from 2D images, and then perhaps teach that avatar a few tricks. The first paper of this batch, “I M Avatar: Implicit Morphable Head Avatars from Videos” (OpenSynthetics), concentrates on models of human heads.

    In synthetic data and generally computer generated graphics, human heads are often represented with 3D morphable face models (3DMMs); it is a huge field starting from relatively simple parametric models in the late 1990s and continuing these days into much more detailed and nonlinear neural parametric face models. The idea is to model the appearance and facial geometry in a lower-dimensional representation, together with a decoder to produce the actual models (meshes). This idea underlies, in particular, our very own HumanAPI, and here at Synthesis AI we are also investigating new ideas for human head generation based on 3DMMs. This field is also closely related to neural volumetric modeling, e.g., neural radiance fields (NeRF) that are rapidly gaining traction; I hope to devote a later post to the recent developments of NeRFs at CVPR ‘22.

    In this work, Zheng et al. base their approach on the FLAME model that parameterizes shape, pose, and expression components. Basically, the 3DMM here consists of three networks (neural implicit fields): one predicts the occupancy values for each 3D point, another one predicts deformations, i.e., transformations of canonical points (points from the original model) to new locations based on facial expressions, and the third provides textures by mapping each location to an RGB color value.

    The crux of the paper lies in how to train these three networks. The main approach here is known as implicit differentiable rendering (IDR), an idea that certainly deserves a separate post. In essence, the neural rendering model produces RGB values for a given camera position (also learnable) and image pixel, and the whole thing is trained to represent actual pixel values:

    As a result, this network is able to generate (render) new views from previously unseen angles. Zheng et al. adapt this approach to their 3DMM; this requires some new tricks to deal with the iterative nature of finding the correspondences between points (it’s hard to propagate gradients through an iterative process). I will not go into these details, but here is an illustration of the resulting pipeline, where all three networks can be trained jointly in an end-to-end fashion:

    As a result, the model produces an implicit representation of a given human head, which means that you can generate new views, new expressions and other modifications from this model. Here is how it works on synthetic data:

    And here are some real examples:

    Looks pretty good to me!

    FaceVerse: Coarse-to-Fine Human Head Avatars

    In this collaboration (OpenSynthetics) between Tsinhua University and Ant Group (a company affiliated with Alibaba Group), Wang et al. also deal with learning 3D morphable models of human faces. In this case, the emphasis is on the data—not synthetic data, unfortunately for this blog, but on data nevertheless.

    Similar to many other fields, 3D face datasets come in two varieties: either coarse or small. It’s easy to get a rough dataset with ToF cameras built into many modern smartphones, but to get a high-definition 3D scan you need expensive hardware that only exists in special labs. Wang et al. do both, collecting a large coarse dataset (on the left below) and a small high-quality dataset (on the right):

    The FaceVerse model then proceeds in the same coarse-to-fine fashion: first the authors fit a classical PCA-based 3D morphable model on the coarse dataset, and then refine it with a detailed model similar to StyleGAN, using the smaller high-quality dataset to fine-tune the detalization part:

    These steps are then reproduced on inference, providing a model that gradually refines the 3D model of a face to obtain very high quality results at the end:

    Overall, it appears that while there is still some way to go, monocular face reconstruction may soon become basically solved. State of the art models are already doing such an excellent job that give it a few more years, and while the results may still not have movie-ready photorealistic quality, they will be more than enough to cover our needs for realistic avatars in 3D metaspace.

    PHORHUM: Monocular 3D Reconstruction of Clothed Humans

    The previous two papers were all about heads and faces, but what about the rest of us? In “Photorealistic Monocular 3D Reconstruction of Humans Wearing Clothing” (OpenSynthetics), Google researchers Alldieck et al. present a deep learning model that can take a photo and create a full-body 3D model, complete with clothing.

    This is far from a new problem; previous approaches include, e.g., PIFu from USC and Waseda University, Geo-PIFu from UCLA and Adobe, and PIFuHD from Facebook. This line of models produced very good results already, extracting voxel features from a single image and filling in the occluded details. An important drawback of these works, however, was how they worked with color of the surfaces: usually the resulting model had color taken from the photo, with shading effects baked in and hard to disentangle from geometry. This made it difficult to use the resulting model in any way except copy-and-paste, even changing the lighting could produce rather bad results.

    In essence, PHORHUM continues the line of PIFu (pixel-aligned implicit function) models: the idea is to represent a 3D surface as a level set of a function f, e.g., the set of points x such that f(x)=0. In this way, you don’t need to store the actual voxels and are free to parameterize the function f in any way you choose—obviously, these days you would choose to parameterize it as a neural network.

    In PIFu models, the image is encoded with an hourglass network to obtained point-specific features, and then the surface is defined as an multilayer perceptron that takes as input the features of a current point (its projection on the image) and the depth. The original PIFu had two different functions, one to encode the surface itself and another to predict the RGB values at the current point:

    To cope with the problem of colors and shading, PHORHUM tries to explicitly disentangle unshaded colors of every point on the surface and the lighting effects. This means that the function is trained to output not the actual color of a pixel but the albedo color, that is, the base color of the surface, and then PHORHUM has a separate shading network to modify it according to lighting conditions:

    As a result, you obtain the albedo colors of skin and clothing, and then it becomes much easier to automatically animate the resulting models, adapting to new lighting as needed:

    To get this kind of quality, you are supposed to have a good shot of the person but it doesn’t have to be a lab shot with white or green background, anything will do:

    Overall, between the previous papers and this one it seems that we will soon have perfectly acceptable virtual avatars walking around various metaverses. I have my doubts about whether this will usher in a new era of remote workplaces and entirely new forms of entertainment—at the very least, we’d first need something less cumbersome than a VR helmet to navigate these metaverses. But it looks like the computer vision part is almost there already.

    Speech-Driven Tongue Animation

    Finally, let me conclude today’s post with something completely different. Have you ever wondered how animated movies match the characters’ speech with their mouths and tongues? Currently, there are two answers: either poorly or very, very laboriously. In computer games and low-budget animation, character models usually have several motions for different vowels and consonants and try to segue from one to another in a more or less fluid way. In high-budget animation (think Pixar), skilled animators have to painstakingly match the movements of the palate and tongue to speech, a process that is both very difficult and very expensive.

    In “Speech Driven Tongue Animation” (OpenSynthetics), Medina et al. from Carnegie Mellon University and Epic Games present a model for automatically generating tongue movements that match the speech. To get the data, you need to do tongue motion capture—I’d never think it was possible but apparently people have been doing it for medical purposes for a long time:

    After that, you need to have an encoder to convert speech into features and a decoder that will take these features and get you the tongue animation. The authors have tried several different encoders and decoders, choosing the best results among both classical and recently introduced feature extractors:

    Landmark locations are then postprocessed to get the actual animation. This paper won the Best Demo Award at CVPR ‘22, so be sure to check out their website and in particular their video with examples and descriptions.

    The paper is affiliated with Epic Games, so I would expect this feature to make its way into Unreal Engine 6 or something, but this paper got me thinking about another possible application. I am not a native English speaker, and although I usually watch movies in English my 11-year-old daughter, naturally, requests Russian voices when we watch Pixar/Disney movies. The modern dubbing industry is quite advanced and goes to great lengths to make speech in a different language more or less fit the mouths animated for English… sometimes at the cost of meaning. It would be enormously expensive to re-animate movies for different languages by hand, but thanks to advances like this one, maybe one day animated movies will be distributed in several different languages with different lip animations produced automatically. And judging by the other results we have discussed in this post, maybe one day live-action movies will too…

    Conclusion

    In this third post about the results of CVPR ‘22, we have discussed several papers on virtual humans, a topic that has stayed important for CVPR over many years. In particular, we discussed two important use cases: conditional generation, usually in the form of virtual try-on, and production of 3D avatars from 2D images, both for heads/faces and for full-body avatars. Both problems are key areas of application for synthetic data, as we have seen today and as we have been working towards here at Synthesis AI.

    Our next topic will be similar but not directly related to humans anymore: we will discuss generation of synthetic data (or any new photo and video material) based on 3D reconstruction and similar approaches. Stay tuned!

    Sergey Nikolenko
    Head of AI, Synthesis AI

  • AI Interviews: Victor Lempitsky

    AI Interviews: Victor Lempitsky

    Meet our distinguished guest for the third interview: Professor Victor Lempitsky. Prof. Lempitsky is among the best researchers in machine learning, placing especially highly in the field of computer vision (here is his Google Scholar account). Currently Victor is leading the Computer Vision Group at Skoltech (Skolkovo Institute of Science and Technology) and is the VR project leader at Yandex.

    Foreword. Before we begin, I have to say that this interview was composed before February 24, 2022. In fact, it was finalized on February 22, so by now it is almost half a year old. This is the reason why Q6 may look a little strange these days—we were not dancing around the elephant in the room, it simply had not entered yet. By now, Victor has left both positions mentioned in the preamble and is currently working on a new startup in the AR/VR field.

    Q1. Hello Victor, and welcome to our interview! Computer vision is your major focus, so let me start off immediately with the obligatory question for our blog: what is your general view on synthetic data for computer vision? Do you agree that synthetic data, understood as artificially generated labeled data used to train machine learning models, can be a feasible way out of the data problem for computer vision? Or do you place more faith in other possible approaches that we’ve previously discussed on this blog: augmentations, mixup and self-adversarial training, few- and zero-shot learning, adding unlabeled data, and others?

    I do believe in synthetic data, and several recent projects I was involved with have seen clear benefits from using synthetic data. However, most useful synthetic data are modeled from the real world. Such modeling can benefit strongly from unsupervised learning. So, in the end, there is no dichotomy: I believe in the usefulness of synthetic data, which is enriched/created from real unlabeled data. Augmentations, mixups, adversarial training can all be used as the ways to generate useful synthetic data from real data, even though people not always think about augmentations in this way.

    Q2. Much of your most recent work is devoted to image generation. You have created GANs that work without convolutions or self-attention, neural renderers that can dress 3D avatars and generate semi-transparent objects, GANs that generate timelapse videos of landscapes, and much more. In particular, you often work on 3D generation—generating meshes, textures, point clouds—which is the obvious next step after learning to generate flat images. 3D generation is only starting to work well enough for practical applications, but still, the rate of progress in this field is spectacular. I usually show this picture in my lectures on GANs:

    Do you expect 3D generation to undergo similarly explosive growth in the near future? Or are there conceptual difficulties that need to be resolved before we get the virtual reality Metaverse generated on the fly with GANs?

    The picture you show is indeed very telling, and it reflects and conflates several trends: improvements in algorithms, improvements in computational resources, and improvements in datasets. 

    Given how many bright people are now working on 3D data synthesis, I believe that fast progress in algorithms is inevitable. Neural renderers such as PyTorch3D or nvdiffrast are certainly one piece of the puzzle. Computational resources are trickier and a lot of progress will be bottlenecked on them, so I naturally expect that main breakthroughs will come from the “big four” of NVidia, Google/DeepMind, Meta, and Microsoft (all four have brilliant researchers but also huge computational resources). This was to a large degree true even for 2D image generation, and will likely remain even more true for 3D. Note that I am not saying that everybody else should either join those corporations or work on something else. Just like StyleGAN(s) from NVidia created a whole vibrant ecosystem of researchers from different institutes building on top of it, the same will likely happen with 3D.

    The main bottleneck for progress in 3D data synthesis, however, is (and will be) datasets. Here things are very different from 2D. With 2D, once algorithms and resources were ready, finding good enough datasets for learning was relatively easy. Note that here I am talking about 2D static image generation, good datasets of HD videos are much harder to get: say, YouTube is largely not HD quality, and it is quite a challenge to scrap video datasets of objects or people in high resolution from YouTube. Getting good and large 3D datasets is much harder, especially if we are talking about “full 3D” and not just 2.5D (i.e. color + depth) or toyish 3D models. Currently, quite a few researchers are trying to bypass this lack of datasets and to learn 3D synthesis by matching the 2D images. To this end, they insert 2D projections into their generation learning pipelines. This is surely interesting and could be fruitful, but is inevitably much harder. Just imagine someone trying to learn StyleGAN-like image synthesis while only having access to a dataset of 1D projections such as row sums or one-pixel slices.

    To sum up, I think that the rate of progress in 3D data synthesis will be limited and conditioned on the quality of 3D datasets. Hence, it will be a harder and longer story than with 2D (but no less interesting!)

    Q3. Let us continue from the last question, taking generative models yet further into the realm of speculation. I have always viewed image and 3D generation as an inherently finite task. It has not been easy to scale GANs up, but it seems like progress is inevitable. And human eyes have a finite resolution after all (be it 8K, 32K, or 256K), so the models will sooner or later reach this resolution with photorealistic quality, and there will be no point to move any further. 

    Do you agree with this view, and if yes, when do you expect image and 3D scene generation to hit this ceiling and provide a perfectly immersive experience? (Let’s limit this question to vision, I understand that full immersion will require other senses as well.)

    Let me start by noting that the story with 2D image generation is far from over, even if one can generate very realistic human faces. First of all, GANs still have limited diversity and mode coverage (otherwise we will not have dozens of interesting papers on StyleGAN inversion, and very simple approaches would do the job). Diffusion models are better than GANs in covering the whole distribution but are still extremely slow. Furthermore, even though GAN samples for faces are realistic, GAN samples for full body human images or, say, for full body cats are either significantly less realistic or significantly less diverse (or both). Finally, for 2D video synthesis, we as a community are very far from truly realistic results (at least in the unconditional setting).

    Regarding 3D, the situation is even harder for the reasons I discussed in the answer to the previous question, so I do not expect perfect photorealism there for quite a few years.

    Q4. Now let me ask a (slightly more) technical question that I’ve been interested in for a long time. Your two most cited papers according to Google Scholar are “Unsupervised domain adaptation by backpropagation” (joint work with Yaroslav Ganin) and its continuation and extension, “Domain-adversarial training of neural networks” (with a lot of people including, e.g., Hugo Larochelle). They are also, in my opinion, some of the most relevant for synthetic data because they present a simple and ingenious domain adaptation method.

    We have just discussed the basic idea of Ganin and Lempitsky (2015) on this blog, so I’ll be very brief in explaining it. The idea goes as follows: suppose you want to have a model that works for both synthetic and real data (or any two domains, really). You want to train a feature extractor that will extract features independently of the domain, so that, say, a synthetic face will have the same features extracted as its real counterpart, and models trained with these features on synthetic data can be applied to real data. To achieve this, you add a domain classifier that predicts whether it was a synthetic or a real image based on the features extracted. You want that classifier to fail, just like you want the discriminator to fail in GANs. So you train it as another head of your network, but the gradients for the classification error function are reversed, optimizing it in the opposite direction. In the illustration below (taken from your papers), the classifier wants to minimize its loss Ld, but by the time it gets to the feature extractor, the loss is inverted, and the extractor is actually maximizing it.

    My question here is two-fold. First, I explained your idea in terms of synthetic and real images, and the actual papers also present examples of synthetic-to-real transfer, but only for small images. Have there been attempts to apply this to larger-scale domain adaptation, especially synthetic-to-real, and how successful have they been?

    Second, domain-adversarial training sounds like a very general idea that could actually be applicable wider than just domain adaptation. One cannot say this idea is not widely known: both papers have thousands of citations, including foundational works on GANs. But why haven’t GANs switched to gradient reversal instead of alternating training between the generator and discriminator? Are there some hidden problems here that are not evident in the basic idea?

    On your first question, indeed the approach has become popular, and there has been a lot of follow-up work including applications to large images. Just as with small images, the approach there works somewhat but without miracles. I.e., it usually beats the no-adaptation baseline quite confidently, but, of course, does not solve the domain gap problem completely. For the second question, indeed almost all GANs separate the steps for the generator and the discriminator updates and do not reuse the gradient. The main reason, I believe, is that most modern GANs use slightly different functionals as objectives for the generator and the discriminator. In particular, it turns out that to get the best GAN performance, it is useful to have some form of the so-called non-saturating objective for the discriminator, and also to regularize the discriminator quite strongly with a proper regularizer (and details of such regularization matter a lot). So, when your generator and discriminator are trying to optimize slightly different functionals, gradient reuse becomes highly non-trivial and is therefore not used. 

    Just to clarify, for me the difference between gradient reversal and GANs is not a big deal. Actually, we learned about the GAN arxiv report halfway during the project and by that time we have settled on the idea and the language of “gradient reversal”. This is why we explained our approach in a slightly different way in our paper, and perhaps connected it to GANs in a less clear way than we should have done (but back in early 2015 it was way less obvious that GANs would become such a dominating idea). 

    Q5. Another recent work of yours introduces Cloud Transformers, special architectures for processing point clouds that use ideas similar to self-attention blocks, with excellent results in point cloud segmentation, inpainting, and reconstruction tasks.

    Since their inception in 2017, Transformers have taken deep learning by storm. They started by basically replacing all other embeddings in natural language processing and serving as the basis for the very best language models, but now they are all over computer vision as well, ever expanding their reach as your own work suggests. It looks a bit like deep learning gradually taking over every field in the early 2010s.

    Do you have an explanation for this success? I understand how a Transformer works mathematically, but is there any explanation why self-attention proves to be such a good idea in practice?

    Or maybe it’s just an umbrella term for a specific useful trick, and otherwise modern Transformers are very different from each other? In your paper, you keep using words such as “variant” or “reminiscent”, and the architecture indeed doesn’t look much like Vaswani’s original. What is that core idea that makes an architecture a Transformer, and again, why, in your opinion, does it work so well?

    Well, it is hard to argue that transformers are the most exciting and impactful thing that has happened in deep learning in recent years. What is most exciting about transformers is their universality. True, we are still witnessing the competition between vision transformer variants and ConvNet architectures for the title of “the king of ImageNet”. But what is remarkable and makes many people excited is that very similar Transformer architectures can solve very different tasks across very different modalities (images, audio, text, action planning, etc) with near state-of-the-art quality. Certainly, it feels like the right thing, as our brains also have remarkable plasticity and can repurpose different parts between modalities.

    Our cloud transformers paper will obviously be far less impactful compared to the original transformers, but I still like it very much. Our architecture is similar to “classical” transformers in some ways. E.g. it treats individual points as elements within an unordered set, and our key layer uses multiple processing heads. There are also differences (our equivalent of attention is sparse, and we use convolutions). Still, what I liked about our results is that essentially the same architecture is able to solve very different point cloud processing tasks. This is again reminiscent of the general transformer idea. 

    Q6. And finally a (slightly) more personal question. Anyone who knows you personally or at least follows you knows you feel strongly about the ethical use of AI.There is a trend in the computer vision community about ethical usage of CV technologies. For instance, the creator of YOLO object detectors Joseph Redmon quit computer vision in early 2020 and famously explained his decision as follows: “I stopped doing CV research because I saw the impact my work was having. I loved the work but the military applications and privacy concerns eventually became impossible to ignore.”

    What is your view on the ethical concerns that arise in modern computer vision? Are researchers responsible for potentially unethical uses of their results? I suppose there is no way to stop progress, but do you think there may be ways to ensure that progress works for the benefit of humanity and not against it? What would you advise to work on if one wanted to achieve this goal?

    I had a small project on person re-identification (mostly from surveillance cameras) with my PhD student back in 2016, and after one year or so we stopped. I do not think we pushed state-of-the-art in video surveillance that much, and the reviewers for the submissions we made on the subject concurred with that :). It is the only example where, in retrospect, I sleep slightly better because my work did not make an impact. 

    Having said that, some of the good and well-meaning people that I know still work on face recognition and camera-based surveillance, and I do not want to judge them. After all, the camera-based surveillance technology is double-edged. It will most likely benefit strong democratic societies by making life there safer and more convenient, but it will make life in authoritarian and totalitarian societies considerably worse, which we are already starting to witness in Russia and other countries. The same actually goes for AI and automation issues. The net effect will be strongly positive, people will live more meaningful and productive lives with more interesting occupations, but the dystopian scenarios will also materialize in some societies. 

    Like always, stopping the progress is impossible, even if many strong researchers including Joe Redmon quit the area. Progress in AI-based surveillance and automation “simply” calls for better and stronger political institutions. And the faster the progress, the more urgent the call. I know this all sounds like I am trying to push the responsibility from AI researchers to others (civil society and politicians), but I am just being honest and realistic. The best thing that we (researchers) can and must do is to inform the general public about the current state-of-the-art and reasonable projections for the future.

    Victor, thank you very much for your answers! And you, dear reader, stay tuned for our next interviews!

    Sergey Nikolenko
    Head of AI, Synthesis AI

  • CVPR ‘22, Part II: New Use Cases for Synthetic Data

    CVPR ‘22, Part II: New Use Cases for Synthetic Data

    Last time, we started a new series of posts: an overview of papers from CVPR 2022 that are related to synthetic data. This year’s CVPR has over 2000 accepted papers, and many of them touch upon our main topic on this blog. In today’s installment, we look at papers that make use of synthetic data to advance a number of different use cases in computer vision, along with a couple of very interesting and novel ideas that extend the applicability of synthetic data in new directions. We will even see some fractals as synthetic data! (image source)

    Introduction and the Plan

    In the first post of this series, we talked about new synthetic datasets in computer vision. This post is only superficially different from the first one: here we will consider papers that apply synthetic data to various practical use cases, concentrating more on the downstream task than on synthetic data generation. However, the generation part here is also often interesting, and we will definitely discuss it.

    I will also take this opportunity to discuss two very interesting developments related to synthetic data. First, we will see that synthetic images do not have to be realistic at all to be helpful for training even state-of-the-art visual Transformers, and it turns out that this has a lot to do with fractals. In the last part, we will see how synthetic data helps to automatically fill in the gaps and provide missing data for few-shot learning. But before that, we will see several use cases where synthetic data has helped solve practical computer vision problems. Among these use cases, today we do not consider papers that help generate synthetic data and papers that deal with generating or modifying virtual humans—these will be the topics for later posts.

    Just like last time, I remind you that we have launched OpenSynthetics, a new public database of all things related to synthetic data. In this post, I will again give links to the corresponding OpenSynthetics pages.

    Eyeglass removal

    In “Portrait Eyeglasses and Shadow Removal by Leveraging 3D Synthetic Data” (OpenSynthetics), Lyu et al. consider an interesting image manipulation problem: removing glasses from a human face. While solving this problem is desirable for applications such as face verification or emotion recognition, eyeglasses are very tricky objects for computer vision: they are mostly transparent but can cast shadows and introduce other complex effects in the image. The model constructed in this work consists of two stages: a cross-domain segmentation network predicts segmentation masks of the glasses and shadows cast by them (this part is trained adversarially in order to extract indistinguishable features from real and synthetic data), and then “de-shadow” and “de-glass” networks remove both:

    The whole thing is trained on a mixture of synthetic data and the CelebA dataset (real data), and the authors report much improved results for eyeglass removal:

    This system is the main point of the paper, but for me, it was also interesting to read about their synthetic data generation pipeline. Starting from 3D models of eyeglasses and 3D face models, they manually label four nodes where the glasses attach to the face: two fixed nodes on the temples and two floating points on the nose, “floating” meaning that these two points can drift to produce different positions of glasses on the nose. With these four nodes fixed, the system is able to find out the pose for the glasses, combine it with the face, and then the authors proceed to standard rendering in Blender, also generating the masks for glasses and their shadows to train the segmentation model:

    And the results are really impressive. Here are some real examples (perhaps cherry-picked, but who cares?..) from the paper:

    Crowd counting

    The work “Leveraging Self-Supervision for Cross-Domain Crowd Counting” by Liu et al. (OpenSynthetics) deals with a very straightforward application of synthetic data. Crowd counting is a natural use case: it is very hard to label every person on a crowd photo, and using real images raises privacy issues since it is usually impossible to get the consent of everybody in a real-world crowd.

    Indeed, there already exists a large synthetic dataset for crowd counting called GCC (Wang et al., 2019) with over 7.6 million people labeled on over 15K synthetic images. This dataset was produced by the Grand Theft Auto V engine, that is, Rockstar Advanced Game Engine (RAGE), together with the Script Hook V library that allows extracting labeling from RAGE. Here are two sample images from the paper, a real crowd on the left and a synthetic one on the right:

    Liu et al. use GCC for training and supplement it with unlabeled real images to cope with the domain shift, with a couple of new tricks designed to improve crowd density estimation (such as accounting for perspective since the crowd density appears higher on top of an image such as above than on the bottom). They obtain significantly improved results compared to other domain adaptation approaches; here are a couple of samples (the ground truth crowd density map is in the middle, and the estimated density map is on the right, together with the estimated number of people):

    This is an interesting use case for us since it can be read as reaching largely the same conclusions as we did in our recent white paper: if done right, relatively simple combinations of synthetic and real data can work wonders. It is encouraging to see such approaches appear at top venues such as CVPR: I guess synthetic data does just work.

    Formula-driven supervised learning for pretraining visual Transformers

    And now we proceed from state of the art, but still quite straightforward applications to something much stranger and, in my opinion, more interesting. First, a very unusual application of synthetic data that requires a little bit of context. In 2020, Kataoka et al. presented a completely new approach to training convolutional networks called Formula-Driven Supervised Learning (FDSL). They automatically generate image patterns by assigning image classes with analytically defined fractal categories. It raises a separate and quite difficult problem of how to do that, but the important thing is that after this transformation, you get a family of fractals for each image category. Here is an illustration from Kataoka et al.:

    As you can see, synthetic fractal images are far from realistic, but they capture some of the patterns characteristic for a given class and hence can be used to pretrain deep learning models; as usual with synthetic data, one can generate an endless stream of new samples from these fractal families. This pretraining does not make training on real images unnecessary but can improve the final results.

    Well, in 2022 Kataoka et al. made the next step (OpenSynthetics), moving from CNNs to visual Transformers. They developed new techniques for their synthetic generation, including a new dataset of families focused on image contours. It turned out that visual Transformers pay most attention to the contours anyway, so even a textureless image is helpful for pretraining:

    And visual Transformers perform better when they are pretrained on images like this one instead of real photos! For example, the authors report that ViT-Base pre-trained on ImageNet-21k showed 81.8% top-1 accuracy after fine-tuning on ImageNet-1k, while the same model with FDSL shows 82.7% top-1 accuracy when pre-trained under the same conditions.

    In my opinion, this is a very interesting direction of study. Apart from its direct achievements, it also shows that synthetic-to-real domain shift is not necessarily a bad thing, and if the data is generated in the right way, trying to achieve photorealism may not be the right way to go.

    Synthetic Representative Samples for Few-Shot Learning

    This last paper for today is a little bit of a stretch to call synthetic data, but it’s another interesting idea that may have applications for synthetic data generation as well. Last time, we discussed BigDatasetGAN, a generative model able to create images already labeled for semantic segmentation. This may be one of the first steps towards solving the problem of synthetic data: until the works on DatasetGANs, nobody could generate labeled data so nobody could use generative models to directly generate useful synthetic images.

    If we are talking about classification rather than segmentation, it looks much easier to sidestep this issue: ever since BigGAN, generative models could produce realistic-looking images in many different categories. But this raises another question: to train a generative model we need a dataset in this category, so why don’t we just take this dataset to train on instead of generating new samples?

    The work “Generating Representative Samples for Few-Shot Classification” (OpenSynthetics) by Xu and Le, a collaboration between Stony Brook University and Amazon, finds a new use case where this kind of conditional generation can be useful. The basic idea is as follows: in few-shot learning, say for image classification, one usually trains a feature extractor on a dataset with plenty of labeled data (but the wrong classes) and then adapts it to new classes by estimating a prototype sample. Then this sample can be used for classification; here is an illustration for few-show and zero-shot classification via prototypes from a classical paper by Snell et al. that started this field:

    This illustration works in the latent space of features produced by some kind of encoder.

    But this prototype-based idea has a drawback: it is hard to find a representative prototype if all you have are a few samples. Even if you have a perfect encoder that produces smooth and wonderfully separated Gaussians for every class, these Gaussians have a core of central representative samples and also non-representative samples that are further from the center:

    And if we base a classifier on a single prototype that turns out to be non-representative, the results can be far from perfect. Here is an illustration from an ICLR 2021 paper by Yang et al.:

    But how do we achieve this kind of calibration? Xu and Le propose—and this is where the relation to synthetic data comes into play—to generate representative samples from a variational autoencoder. It is common to use conditional VAEs to learn to extract representative features from images, but this time the cVAE is restricted to produce only representative, central examples of a class (feature vectors close to the center of a Gaussian) via sample selection:

    Note the semantic embedding a: this is where the new samples will come from. For a new class, the authors take its semantic embedding, plug it into this VAE’s decoder, and generate representative samples for the new class. Then the resulting generated prototype is either mixed with actual samples (in few-shot classification) or not (in zero-shot classification), with improved results on miniImageNet and tieredImageNet.

    This is definitely a non-representative example of a paper on synthetic data: the “data” is actually in feature space, and the problem is image classification rather than anything with complicated labeling. But this direction, dating back at least to 2018 (Verma et al., CVPR 2018), is an interesting tangent to our space, and just like DatasetGAN, it goes to show a way in which generative models may prove useful for synthetic data generation.

    Conclusion

    In this post, the second in the CVPR ‘22 series, we have discussed several use cases of synthetic data that have been advanced at the conference, starting from straightforward applications such as eyeglass removal and crowd counting and progressing to less obvious ideas of how deep generative models and even regular mathematical models such as fractals can help produce synthetic data useful for machine learning. Next time, we will discuss a more specific use case related to synthetic humans; stay tuned!

    Sergey Nikolenko
    Head of AI, Synthesis AI

  • CVPR ‘22, Part I: New Synthetic Datasets

    CVPR ‘22, Part I: New Synthetic Datasets

    CVPR 2022, the largest and most prestigious conference in computer vision and one of the most important ML venues in general, has just finished in New Orleans. With over 2000 accepted papers, reviewing the contributions of this year’s CVPR appears to be a truly gargantuan task. Over the next series of blog posts, we will attempt to go over the most interesting papers directly related to our main topic: synthetic data. Today, I present the first but definitely not the last installment devoted to papers from CVPR 2022.

    New Synthetic Datasets: Beyond Images

    As always, CVPR is large, and it contains multitudes, but this year one of the main topics is neural radiance fields (NeRF). These models seem to be the new GANs today, or, better to say, new visual Transformers that were in turn the new GANs a couple of years ago. We view image synthesis, especially controlled synthesis with 3D information, as a key idea that can propel synthetic data forward, so I plan to devote several upcoming posts to recent NeRF advancements.

    But in this series, let me begin with more straightforward applications of synthetic data that have found their way into the CVPR program this year. On the list today we have several new synthetic datasets, usually related to specific use cases of synthetic data; many of them touch upon problems that we have already discussed on this blog but some introduce entirely new avenues for research.

    Synthetic data is a well-established field, and this blog has already documented many of its achievements. By now, it is not enough to just generate a new synthetic dataset to get to a top conference like CVPR (to be honest, it was never enough): you need some twist on the tried-and-true formula of “make or obtain 3D CG models, render images, train CV models, profit”. In this section, let us see what new twists CVPR 2022 has brought.

    And one more thing before we begin: we have recently made public a new database that will gradually collect all things related to synthetic data. It is called OpenSynthetics, and it already has quite a lot of content on synthetic datasets, papers, and code repositories related to synthetic data. So in these review posts, I will also give links to the corresponding OpenSynthetics pages.

    BigDatasetGAN: Generating ImageNet1K with Labels

    It had always been common wisdom that GANs, despite their excellent image generation quality and usefulness for synthetic-to-real refinement, cannot really help with data generation from scratch: there was no way to generate labeled data and no easy way to label generated images. Basically, ever since ProGAN and BigGAN (OpenSynthetics; both released in 2018) you could use GANs to generate new realistic images with sufficient quality, but you would still have to label them afterward as if they were just new images. And this has always meant that GANs are useless for synthetic data generation: we have never lacked new images of ImageNet categories, the bottleneck has always been in the labeling.

    Well, it looks like there is a way to generate labeled data now! This research direction, driven by NVIDIA researchers, bore its first fruit last year when Zhang et al. presented DatasetGAN on CVPR 2021. Their pipeline works as follows: use StyleGAN to generate several images (say, cars), hand-annotate a few of them for your task (say, segmentation of various car parts), and train a very small model (style interpreter) to produce similar segmentation masks from StyleGAN features. At the cost of labeling a few images (literally, a few: DatasetGAN required 16 labeled heads or about 1000 polygons), you can use StyleGAN to generate as many labeled images as you wish, with the usual excellent StyleGAN quality:

    On this year’s CVPR, Li et al. continued this line of research and introduced BigDatasetGAN based on BigGAN instead of StyleGAN. The difference is that BigGAN is better suited for generating a wide variety of different image categories, so now you can hand-label 8000 images, 8 for each category, and have a single model able to produce 1000 ImageNet1K categories that come pre-labeled for segmentation:

    The authors report results improved over supervised pretraining for standard segmentation models.

    Does this mean that synthetic data is soon to be absorbed into deep generative models? Time will tell, but I am not sure: generative models are still hard to train, and this approach requires an operational large-scale GAN with the desired categories before we go into labeling. Moreover, DatasetGANs deal only with segmentation so far, and I have my reservations about more complex labeling such as depth. Still, this is an exciting development that shows the power of modern generative models, and its results provide a set of completely new tools for the arsenal of synthetic data generation.

    ABO: Real-World 3D Object Understanding

    ABO stands for Amazon Berkeley Objects (OpenSynthetics), a new indoor environment and object dataset presented in the work by Collins et al., who are, you guessed it, researchers from UC Berkeley and Amazon. ABO answers the same need as the classical but sadly unavailable SunCG dataset, ShapeNet, or Facebook AI Habitat: it provides a large-scale catalogue of 3D models of indoor household objects—chairs, shoes, coat hangers, rugs, tables, and so on—that can be placed in a variety of indoor environments with available renderings.

    Since Amazon is… well, Amazon, ABO is based on product listings: the dataset contains nearly 150K listings of 576 product types with hi-res photos and over 8000 turntable “360° view” images. It also includes nearly 8000 handmade high-quality 3D models of various objects. Moreover, and this is unique to ABO, the objects come with attributes that identify their material, which is useful for physically-based rendering:

    The authors show that training on ABO leads to better results than training on ShapeNet for state-of-the-art 3D reconstruction models. They also introduce a new task that has been enabled by their work, material estimation, and present novel network architectures for this task. In general, this is an impressive effort, and I hope that it will enable many new works in 3D scene understanding, indoor navigation, and related fields:

    ObjectFolder 2.0: A Multisensory Object Dataset

    While ABO provides some information about the material of the object, it is far from exhaustive. Stanford researchers Gao et al. attempt a far more ambitious task in their new ObjectFolder 2.0 dataset (OpenSynthetics): they aim to model complete multisensory profiles of real objects. This means that they aim to capture not only the 3D shape and material of an object (and therefore its texture) but also other sensory modalities including audio (how a cup clinks when you touch it with a spoon) and feeling to the touch. This information can be later used for problems such as contact localization (where exactly have I touched this object?) that are both difficult and important in robotics:

    Since all of these modalities are location-dependent, they cannot all be explicitly stored in the dataset. The authors use implicit neural representations, that is, each object is defined by a few neural networks (multilayer perceptrons) that are trained to convert coordinates into whatever is necessary; VisionNet models the neural scattering function, AudioNet models the location-specific part of the audio response from applying a unit force to this location, while TouchNet predicts the deformation map and tactile image (geometry of the contact surface):

    ObjectFolder 2.0 contains these representations for 1000 household objects such as cups, chairs, pans, vases, and so on.

    Gao et al. test their dataset with three downstream tasks that require multimodal sim2real object transfer: object scale estimation based on vision and audio, contact localization based on audio and tactile response, and shape reconstruction based on visual and tactile data. They report improved performance across all tasks, and this dataset indeed looks like a possible next step for object manipulation in robotics.

    Articulated 3D Hand-Object Pose Estimation

    Pose estimation is a classical computer vision problem; as in all problems related to the understanding of the 3D world from 2D images, synthetic data comes to mind naturally: it is impossible to do exact manual labeling for pose estimation, and even inexact human labeling is very copious. This goes double for more detailed tasks such as hand pose estimation, so it is no wonder that there exist synthetic datasets for this problem; in particular, here at Synthesis AI we have a variety of hand gestures as part of our HumanAPI.

    In “ArtiBoost: Boosting Articulated 3D Hand-Object Pose Estimation via Online Exploration and Synthesis” (OpenSynthetics), Li et al. make the next step: they consider not just hand gestures but hands holding various objects in different positions. The authors consider the “composited hand-object configuration and viewpoint space” (CCV space) where you can vary object types, composite hand-object poses, and camera viewpoints:

    Then they apply a newly developed grasp synthesis method (that I will not go into), obtain renderings of a synthetic hand grasping the object, and use these images for training.

    What is most interesting for me in this work is that it is an example of the “closing the loop” idea that we have been proposing for quite some time ago here at Synthesis AI; in particular, pardon the self-promotion, I discussed it as an important idea for the future of synthetic data in Chapter 12 of my book.

    In this case, Li et al. do not merely sample the CCV space and create a randomly generated dataset of synthetic hands with objects. They assign weights to different objects, poses, and viewpoints, and update these weights with feedback obtained from the trained model, trying to skew the sampling towards hard examples, a technique known in other contexts as “hard negative mining”. It is great to see that “closing the loop” is gaining traction, and I am certain it can help in other problems as well.

    SHIFT: Synthetic Driving via Multi-Task Domain Adaptation

    And now let us, pardon the pun, shift to data about the outdoors. We begin with autonomous driving. The work “SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Domain Adaptation” (OpenSynthetics), coming from ETH Zurich researchers Sun et al., is a pure synthetic dataset presentation for SHIFT, a synthetic driving dataset—but SHIFT is far from a “regular” synthetic dataset with some labeled images! The problem that Sun et al. recognize here is that autonomous driving requires the system to adapt to constantly changing conditions: if you are driving and it starts raining, the view around you changes significantly and maybe quite quickly, and the computer vision system has to keep working fine.

    To help cope with that, SHIFT contains explicit “domain shifts” across several different domains such as weather conditions, time of day, surroundings, and so on:

    So far this is quite standard fare for autonomous driving simulators. What’s more, SHIFT provides continuous shifts across domains whenever possible. You can have day gradually turning into night or rain starting on a sunny day:

    Naturally, each frame is annotated in the usual modalities, with object bounding boxes, segmentation maps, depth maps, optical flow, and LiDAR point clouds.

    Based on SHIFT, the authors investigate how various object detection and segmentation models cope with these domain shifts. They demonstrate that conclusions about robustness to domain shift that can be made on synthetic data also transfer to real datasets. I think that’s an important validation for synthetic data in general: it turns out that synthetic data can help evaluate machine learning models in ways that real data may fail to provide.

    TOPO-DataGen: Multimodal Synthetic Data Generation for Aerial Scenes

    In another classical synthetic data paper, EPFL researchers Yan et al. present TOPO-DataGen (OpenSynthetics), an automated synthetic data generation system that utilizes available geographic data such as LiDAR point clouds, orthophotographs, or digital terrain models to create synthetic scenes of various parts of the world, complete with the usual synthetic modalities such as depth maps, normals, segmentation maps, and so on:

    Generated images look very impressive and highly realistic, which is made slightly easier by the fact that they are aerial images taken from far away. Based on TOPO-DataGen, Yan et al. develop a new CrossLoc model for absolute localization (i.e., estimating the 6D camera pose in space) that works with several input modalities. They also show some impressive demos of trajectory reconstruction from aerial images based on CrossLoc. In general, while synthetic satellite and aerial images have already been generated, I believe this is the first attempt to bring together the different modalities that are actually often available in current practice.

    LiDAR snowfall simulation

    Finally, a very specific but fun use case: simulating snowfall. Autonomous driving should work under all realistic weather conditions, including heavy snow. But snow presents two problems that are especially bad for LiDARs: first, the ground becomes wet, which changes its reflective properties, and second, the particles of snow in the air also interact with the laser beam, leading to absorption and backscattering that attenuate and introduce a lot of noise into the LiDAR signal.

    Hahner et al. present a snowfall simulation system able to augment synthetic LiDAR datasets (in this case, STF by Bijelic et al. that itself introduced a fog simulation system) with special models for wet ground reflection and the influence of scattering particles. As a result, 3D object detection models trained with this augmentation perform much better; in the illustration below, note that the rightmost results contain no spurious objects, and predicted bounding boxes (black) match the ground truth (green) very well:

    Conclusion

    Today, we have begun our long journey through CVPR 2022. We have looked at papers that introduce new synthetic datasets, usually going far beyond simple generation of labeled images and sometimes defining completely new tasks. Next time, we will talk about papers that present specific use cases for synthetic data, that is, validate the use of synthetic data in practical computer vision tasks. Admittedly, it’s a blurry line with this first installment, but this post is getting quite long as it is. Until next time, stay tuned to the Synthesis AI blog, and check out OpenSynthetics!

    Sergey Nikolenko
    Head of AI, Synthesis AI

  • Driving Model Performance with Synthetic Data VII: Model-Based Domain Adaptation

    Driving Model Performance with Synthetic Data VII: Model-Based Domain Adaptation

    After a long hiatus, we return from interviews to long forms, continuing (and hopefully finishing) our series on how synthetic data is used in machine learning and how machine learning models can adapt to using synthetic data. This is our seventh installment in the series (part 1, part 2, part 3, part 4, part 5, part 6), but, as usual, this post is (I hope!) sufficiently self-contained. We will discuss how one can have a model that works well on synthetic data without making it more realistic explicitly but doing the domain adaptation work at the level of features or model itself.

    Intro and weight sharing

    In previous installments, we have considered models that perform refinement, that is, domain adaptation at the data level. This means that somewhere in the model, there is a learned transformation that takes data points from the source domain (in our case, synthetic images) and transforms them to make them more like the target domain (real images). 

    But it sounds like a lot of unnecessary extra work! Our final goal is very rarely to generate more realistic synthetic images. On the contrary, we want to use synthetic images to help train better models; the data itself is not important, it is just a stepping stone to models that work better. So maybe we don’t need to learn transformations on the level of images and can work in the space of features or model weights, never going back to change the actual data?

    One simple and direct approach to doing that would be to share the weights among networks operating on different domains. This way, when you train on both domains, the network has to learn to do well on both with the same weights – exactly what you need for domain adaptation. This was the idea of the earliest approaches to domain adaptation in deep learning, but weight sharing and similar ideas remain relevant to this day. For instance, Rozantsev et al. (2019) do domain adaptation with a two-stream architecture; the weights for processing the two domains are not shared but the architectures are the same, and there are special regularizers on all layers that bring their weights together:

    Another approach to model-level domain adaptation is to mine relatively strong priors from real data that can then inform a model trained on synthetic data, helping fix problematic cases or incongruencies between synthetic and real data. This also brings us to curriculum learning: it is often helpful to start with the easy cases and get a network rolling, and then fine-tune it in harder and harder situations.

    For example, Zhang et al. (2017) present a curriculum learning approach to domain adaptation for semantic segmentation of urban scenes. They train a segmentation network on synthetic data (specifically on the GTA dataset) but with a special component in the loss function related to the general label distribution in real images, intended to bring together the distributions of labels in real and synthetic datasets. The problem here is that this distribution is not available in real data, so this is where curriculum learning comes in: the authors first train a simpler model on synthetic data to estimate the label distribution from image features and then use it to inform the segmentation model:

    But there are much more interesting ideas in model-based domain adaptation than just training the same network on both domains with some regularizers. Let’s get to them!

    Reversing the Gradients

    One of the main directions in model-level domain adaptation was initiated by Ganin and Lempitsky (2015) who presented a generic framework for unsupervised domain adaptation. Their basic approach goes as follows:

    Let’s unpack what we see in this picture:

    • the feature extractor, true to its name, extracts features from input data; this is actually the network that we want to make domain-independent; after extraction, the features go two separate ways;
    • the label predictor actually does what the network is supposed to do, in this case probably classification but it could be segmentation or any other kind of computer vision problem;
    • the domain classifier is the core of this idea; it takes extracted features as input and attempts to classify which domain the original input belonged to.

    The idea is to train the label predictor to perform as well as possible and at the same time make the domain classifier perform as badly as possible. This is actually very similar to GANs (which we have discussed before). The difference, however, is that Ganin and Lempitsky devised an ingenious method for training that doesn’t require solving any minimax problems or iteratively alternating between networks. 

    The method is called gradient reversal: multiplying the gradients by a negative constant as they pass from the domain classifier to the feature extractor. In this way, the domain classifier learns to maximize its error, and the label predictor minimizes it, all at the same time and within the same loss function. Like this:

    In a subsequent work, Ganin et al. (2016) generalized this domain adaptation approach to arbitrary architectures and experimented with domain adaptation in different domains, including image classification, person re-identification, and sentiment analysis. 

    Disentanglement: Domain Separation Networks and beyond

    Domain separation networks by Bousmalis et al. (2016) represent a different take on the same problem. They attempt to solve domain adaptation via disentanglement, a very important notion in deep learning. Disentanglement is the process of separating different features extracted by a machine learning model so that these separate parts would have different recognizable meanings. For example, many style transfer models (we discussed it in Part IV of this series) try to explicitly disentangle style from content, and then swap the style part of the features before decoding back in order to get the same image in a different style.

    In domain adaptation, disentanglement amounts to separating domain-specific features from domain-independent ones, and trying to make sure that the latter will suffice to solve the actual problem. Domain separation networks explicitly separate the shared and private components of both source and target domains, extracting them with a shared encoder and two private encoders, one for the source domain and one for the target domain:

    The overall objective function for a domain separation network consists of four parts (let’s not do the formulas, it is, after all, almost Christmas):

    • supervised task loss in the source domain, e.g., classification loss;
    • reconstruction loss that compares original samples (both real and synthetic) and the results of a shared decoder that tries to reconstruct the images from a combination of shared and private representations;
    • difference loss that encourages the hidden shared representations of instances from the source and target domains to be orthogonal to their corresponding private representations;
    • similarity loss that encourages the hidden shared representations from the source and target domains to be similar to each other; again, “similar” here means that they should be indistinguishable by a domain classifier trained through the gradient reversal layer, as above.

    Bousmalis et al. evaluate their model on several synthetic-to-real scenarios, e.g., on synthetic traffic signs and synthetic objects from the LineMod dataset.

    Domain separation networks became one of the first major examples in domain adaptation with disentanglement, where the hidden representations are domain-invariant and some of the features can be changed to transition from one domain to another. Further developments include:

    • FCNs in the Wild by Hoffman et al., where feature-based DA for semantic segmentation is done with fully convolutional networks (FCN) where ground truth is available for the source domain (synthetic data) but unavailable for the target domain (real data); they also used domain adversarial training;
    • Xu et al. (2019) used adversarial domain adaptation to transfer object detection models—single-shot multi-box detector (SSD) and multi-scale deep CNN (MSCNN)—from synthetic samples to real videos in the smoke detection problem;
    • Chen et al. (2017) construct the Cross City Adaptation model that brings together features from different domains, with semantic segmentation of outdoor scenes in mind; they adapt segmentation across different cities around the globe and show that their joint training approach with domain adaptation improves the results significantly;
    • and many more…

    The last paper I want to highlight here is by Hong et al. (2018) who provide one of the most direct and most promising applications of feature-level synthetic-to-real domain adaptation. In their Structural Adaptation Network, the conditional generator takes as input the features from a low-level layer of the feature extractor (i.e., features with fine-grained details) and random noise and produces transformed feature maps that should be similar to feature maps extracted from real images:

    To achieve this, the conditional generator produces a noise map and then adds it to high-level features. Hong et al. compared the Structural Adaptation Network with other state of the art approaches, including FCNs in the Wild and Cross-City Adaptation, with source domain datasets SYNTHIA and GTA and target domain dataset Cityscapes; they conclude that this adaptation significantly improves the results for semantic segmentation of urban scenes. Here is a sample of their results:

    Conclusion

    Feature-level domain adaptation provides interesting opportunities for synthetic-to-real adaptation. Many of these methods still mostly represent work in progress, but the field is maturing rapidly, and in our experience, feature- and model-level DA is usually a simpler and more robust approach, easier to get to work, so we expect new exciting developments in this direction and recommend to try this family of methods for synthetic-to-real DA (unless actual refined images are required).

    With this, I am concluding this long series on different facets of using synthetic data in machine learning. Most importantly, synthetic data is a source of virtually limitless perfectly labeled data. It has been explored in many problems, but we believe that many more potential use cases still remain. Maybe we will get a chance to explore them together in 2022.

    Sergey Nikolenko
    Head of AI, Synthesis AI

  • Synthetic Data and the Metaverse

    Synthetic Data and the Metaverse

    Today, we are talking about the Metaverse, a bold vision for the next iteration of the Internet consisting of interconnected virtual spaces. The Metaverse is a buzzword that had sounded entirely fantastical for a very long time. But lately, it looks like technology is catching up, and we may live to see the Metaverse in the near future. In this post, we discuss how modern artificial intelligence, especially computer vision, is enabling the Metaverse, and how synthetic data is enabling the relevant parts of computer vision.

    What is the Metaverse

    The Metaverse is far from a new idea. Anyone familiar with the cyberpunk genre will immediately recognize the concept of a virtual reality that characters of William Gibson’s Neuromancer (1984) inhabit. The term itself was coined in Neal Stephenson’s novel Snow Crash (1992), and this virtual reality-based Internet 2.0 has seen many fictionalized adaptations ever since, including The MatrixReady Player One, a recent Amazon series Upload, and many more.

    While the Metaverse has long been the subject of sci-fi, by now many visionaries believe that developments in VR, AR, and related fields may soon enable similar experiences in real life… I mean, in virtual life, but real virtual life… you know what I mean. One of the sources that got me thinking about the Metaverse recently was a long interview with Mark Zuckerberg. He talks about “the successor to the mobile internet… an embodied internet, where instead of just viewing content — you are in it… present with other people as if you were in other places”. It sounds like Facebook believes in the VR and AR technology and sees the clunkiness of current generation devices as the main obstacle: right now hardly anybody would want to do their jobs in a VR helmet. As soon as wearable technology becomes miniature and light enough, the Metaverse will be upon us.

    Mark Zuckerberg motivates this vision, in particular, with mobile workstations: “…you can walk into a Starbucks… and kind of wave your hands and you can have basically as many monitors as you want, all set up, whatever size you want them to be… and you can just bring that with you wherever you want.” Facebook calls this idea the “infinite office.” But in my opinion, it is almost inevitable that entertainment will be the main driving force behind the Metaverse: imagine that you don’t need large screens to have an immersive cinematic experience, imagine your friends on social networks (well, maybe one social network in particular) streaming their experiences through AR glasses, imagine immersive 3D games that enable real human-to-human personal interaction… Well, I’m sure you’ve heard pitches for the VR technology many times, but this time it sounds like it really has a chance of coming through and becoming the next big thing. Others are beginning to build their own vision for the Metaverse including Epic Games, Roblox, Unity, and more. 

    How the Metaverse is enabled by computer vision

    But we need more than just smaller VR helmets and AR glasses to build the Metaverse. This hardware has to be supported by software that makes the transition between the real and virtual worlds seamless—and this would be impossible without state of the art computer vision. Let me make just a few examples.

    First, the obvious: VR helmets and controllers need to be positioned in space very accurately, and this tracking is usually done with visual information from cameras, either installed separately in base stations or embedded into the helmet itself. This is a basic computer vision problem of simultaneous localization and mapping problem (SLAM). VR helmet technology has recently undergone an important shift: earlier models tended to require base stations (“outside-in” tracking), and latest helmets can localize controllers accurately with embedded cameras (“inside-out” tracking) so you don’t need any special setup in the room (image source):

    This is a result of progress in computer vision, the cameras themselves have not improved that much.

    This problem becomes harder if we are talking about augmented reality: AR software also needs to understand its position in the world, but it needs a far more detailed and accurate 3D map of the environment in order to be able to augment it for the user. Check out our latest AI interview with Andrew Rabinovich, who was the Director of Deep Learning at Magic Leap, the startup that tried to do exactly this.

    Second, we have already talked many times about gaze estimation, i.e., finding out where a person is looking by the picture of their face and eyes. This is also a crucial problem for AR and VR. In particular, current VR relies upon foveated rendering, a technique where the image in the center of our field of view is rendered in high resolution and high detail, and it becomes progressively worse on the periphery; for an overview see, e.g., Patney et al. (2016). This is, by the way, exactly how we ourselves see things; we see only a very small portion of the field of view clearly and in full detail, and peripheral vision is increasingly blurry (illustration by Rooney et al., 2017):

    Foveated rendering is important for VR because VR has an order of magnitude larger field of view than flat screens, and requires a high resolution to support the illusion of immersive virtual reality, so rendering it all in this resolution would be far beyond consumer hardware.

    Third, when you enter virtual reality, you need an avatar to represent you; current VR applications usually provide stock avatars or forgo them entirely (many VR games represent the player as just a head and a pair of hands), but an immersive virtual social experience would need photorealistic virtual avatars that represent real people and can capture their poses, . Constructing such an avatar is a very hard computer vision problem, but people are making good progress on it. For instance, a recent work by Victor Lempitsky’s team introduced textured full-body avatars able to capture poses in real time by visual data streaming from several cameras:

    We are still not quite there, especially when it comes to faces and emotions, but we are getting better, and the Metaverse will definitely make use of this technology.

    These are only a few of the computer vision problems that arise along the way to the Metaverse; for a more, pardon the pun, immersive experience just look at the list of talks on the recent IEEE VR Conference, where you will see all of these topics and much more.

    Synthetic data and the Metaverse

    Our long-time readers have no doubt already recognized where this blog post is going. Indeed, as we have discussed many times before (e.g., here or here), modern computer vision is requiring increasingly large datasets, and manual labeling simply stops working at some point. At Synthesis AI, we are proposing a solution to this problem in the form of synthetic data: artificially generated images and/or 3D scenes that can be used to train machine learning models.

    I chose the three examples above because they each illustrate different uses of synthetic data in machine learning. Let us go over them again.

    First, SLAM is an example where synthetic data can be used in a straightforward way: construct a 3D scene and use it to render training set images with pixel-perfect labels of any kind you would like, including segmentation, depth maps, and more. We have talked about simulated environments on this blog before, and SLAM is a practical problem where segmentation and depth estimation arise as important parts. Modern synthetic datasets provide a wide range of cameras and modalities; for example, here is an overview of a recently released dataset intended specifically for SLAM (Wang et al. 2020):

    Second, gaze estimation is an interesting problem where real data may be hard to come by, and synthetic data comes to the rescue. I have already used gaze estimation on this blog as a go-to example for domain adaptation, i.e., the process of modifying the training data and/or machine learning models so that the model can work on data from a different domain. Gaze estimation works with relatively small input images, so this was an early success for GANs for synthetic-to-real refinement, where synthetic images were made more realistic with specially trained generative models. Recent developments include a large real dataset, MagicEyes, that was created specifically for augmented reality applications (Wu et al., 2020); in fact, it was released by Magic Leap, and we discussed it with Andrew last time:

    Third, virtual avatars touch upon synthetic data from the opposite direction: now the question is about using machine learning to generate synthetic data. We talked about capturing the pose and/or emotions from a real human model, but there is actually a rising trend in machine learning models that are able to create realistic avatars from scratch. Instagram is experiencing a new phenomenon: virtual influencers, accounts that have a personality but do not have a human actually realizing this personality. Here is Lil Miquela, one of the most popular virtual influencers:

    From a research perspective,, this requires state of the art generative models that are supplemented with synthetic data in the classical sense: you need to create a highly realistic 3D environment, place a high-quality human model inside, and then use a generative model (usually a style transfer model) to make the resulting image even more realistic. In this direction, it is still a long way to go before we can have fully photorealistic 3D avatars ready for the Metaverse, but the field is developing very rapidly, and this long way may be traversed in much less time than we have ever expected.

    The Metaverse is an ambitious vision straight out of science fiction, but it looks like the Metaverse is becoming increasingly realistic. It is quite possible that you and I will live to see an actual Metaverse, be it a social-centric Facebook 2.0 envisioned by Mark Zuckerberg, massively multiplayer OASIS out of Ready Player One, or, God forbid, the all-encompassing Matrix. But before we get there, there are still many research problems to be solved. Most of them lie in the field of computer vision, and this is exactly where synthetic data is especially effective for machine learning. Join us next time for another installment on synthetic data!

    Sergey Nikolenko
    Head of AI, Synthesis AI

  • AI Interviews: Andrew Rabinovich

    AI Interviews: Andrew Rabinovich

    Today, I am proud to present our guest for the second interview, Dr. Andrew Rabinovich. Currently, Andrew is the CTO and co-founder of Headroom Inc., a startup devoted to producing AI-based solutions for online business meetings (taking notes, detecting and attracting attention, summarization, and so on). Dr. Rabinovich has produced many important advances in the field of computer vision (here is his Google Scholar account), but he is probably best known for his work as the Director of Deep Learning at Magic Leap, an augmented reality startup that raised more than $3B in investments.

    Q1Hello Andrew, and welcome! Let me begin with a general question that I will also expand upon later. You have a lot of experience in academia, with numerous papers published at top conferences and receiving hundreds of citations. At the same time, some of your top accomplishments are related to more “industrial” research work at startups such as Magic Leap.

    What kind of work has been more fulfilling for you? And what, in your view, are the main differences in the process and/or results? On the surface, research work in both industry and academia is supposed to produce novel solutions that work well for the problem at hand; are there important differences here?

    Hello Sergey, I am glad to be here and thank you for the invitation. What you guys, at Synthesis, are doing is extremely important for the computer vision field, and I am grateful that with these efforts the state of the art in Computer Vision, and AI in general, will improve for many years to come. 

    This is a very interesting question that dates back to my undergraduate days when I worked on medical image analysis and was interested in building image cytometers — automated microscopes with machine learning inference skills. While developing the cytometer, it quickly became apparent that the state of the art in computer vision (it was called image processing then) wasn’t quite up to par to solve the practical problems I was facing. This realization made me turn to more theoretical work and focus on developing core vision algorithms. A similar situation happened at Google, where I was really excited to work on algorithms for Google Goggles, the first AR app for Android and iPhone. Then existing, pre-deep learning approaches, weren’t satisfactory to develop product features we were interested in. Again, I turned to more academic research and was very fortunate to work on the development of modern deep networks, including the Inception architecture, which in turn we applied to visual search in Google Photos. You can probably guess where this is going, the same story repeated itself at Magic Leap. I quickly realized that to develop the vision of Mixed Reality, and to close the perceptual gap between real and virtual content, a lot of new fundamental research in computer vision and AI had to be done.

    Overall, academic and applied research aren’t really separable in my mind. Computer vision and machine learning are not fundamental science disciplines, they don’t describe nature. These are engineering challenges that need to be addressed in the context of practical problems. Industrial research provides that context. If the context is chosen correctly, then solutions to specific engineering challenges generalize to other tasks. 

    Q2. Our blog is devoted to synthetic data, so here is the most expected question. During your work in Headroom, Magic Leap, and other startups, have you used synthetic data to solve computer vision problems? In what ways, and how much did it help (if you’re allowed to divulge this kind of information, of course)? Did it help for the augmented reality applications at Magic Leap?

    I have been a proponent of synthetic data since my days at Google, where we heavily relied on data augmentation (synthetic data 0.1) to train deep models. At Magic Leap, we created a whole synthetic data group, with render farms and custom pipelines. At that time, synthetic data companies were quite rare, so we had to do most of it. The benefits of synthetic data ranged from hand and eye-tracking to 3D reconstruction and segmentation. At Headroom, we are collaborating with synthetic data providers across a number of problems. 

    Generally, there are really two fundamental issues with data for learning. First, obtaining data and labeling it can be quite expensive and laborious, whether it involves humans in the loop or not. Many companies today have established an efficient pipeline for ingesting data and providing annotations for it. The second problem, however, is far more critical. Relying on the human ability to annotate certain types of data is misleading. People can only provide relative and qualitative labels, such as drawing bounding boxes around objects or qualifying relative distances. If the task is much more specific, i.e. describe the illumination in the room, or how far away is the person from the car (in centimeters), these questions humans cannot with the required precision, and in the absence of specific sensors, synthetic data is the only path forward.

    By construction, machine-generated data is auto labeled. The main drawback of synthetic data is that it may be sampled from a distribution that doesn’t represent the real world. Fortunately, that gap is quickly closing with realistic synthesis and domain adaptation approaches in AI.

    Q3One of your latest papers, “DELTAS: Depth Estimation by Learning Triangulation and Densification of Sparse Points”, seems to be making a very interesting point beyond its immediate results. It reconstructs 3D meshes of scenes from RGB images with an end-to-end network, never producing an intermediate depth map, like most other methods do:

    This sounds very human-like to me: I can navigate complex 3D environments, and I have a pretty good grasp on relative depth (which object is closer than the other), but I definitely cannot produce an accurate depth map for my room. Moreover, this is in line with the general trend of deep learning that seems to me evident over at least the last decade: we have neural networks increasingly perform end-to-end training and learn to do various tasks directly, without predefined intermediate representations or side results. The tradeoff here is that usually end-to-end training for complex tasks requires far more data than more specialized training when you have, e.g., ground truth labeled depth maps.

    Do you agree that this trend exists and if yes, where do you think it will take us in the near future, especially in the field of computer vision? Are there other important problems that can be overcome with such end-to-end architectures, and do we have enough data to do that? To make the question more open-ended, what other trends in computer vision do you see that you expect to carry over for the next couple of years (I think in deep learning it doesn’t make sense to predict beyond a couple of years anyway)?

    End-to-end learning is a very attractive, almost romantic notion. The formulations are usually very elegant and simple. However, as you correctly point out, it requires a significantly larger amount of training data to account for all variations. That is why most problems aren’t solved end-to-end, as we aim to provide supervision along the way. With regards to 3D reconstruction, intermediate supervision with depth maps is problematic as well. Obtaining a large amount of depth data is not trivial. 

    As for the trends, I am not a big follower of them, as they are mostly set by the availability of datasets or funding. Over the last few years, I have focused on multi-task learning and believe that focus on this area of AI will lead to significant advances due to generalization during training and inductive bias during inference.  

    Looking forward, I believe developing AI approaches one modality at a time, when applied to the multimodal tasks that surround us, artificially complicates the problem. For example, the classical problem of video understanding is typically solved by isolating video from everything else. However, presence of text, available in the movie scripts or live transcription, and audio sources, make the problem much more tractable. Multimodal multitask learning is one of the areas in AI I am most excited about today.

    Q4Interestingly, another recent paper of yours, “MagicEyes: A Large Scale Eye Gaze Estimation Dataset for Mixed Reality”, goes in precisely the opposite direction. It makes the case that for eye gaze estimation, better results can be achieved by thinking about the 3D properties of the eye (position of the cornea center and pupil center in 3D) and including them in a multi-task architecture:

    Eye gaze estimation is one of my favorite examples for synthetic data because it has everything: a “pure synthetic” solution-based (literally!) on nearest neighbors, GANs for synthetic-to-real refinement that improve the results, new synthetic datasets such as NVGaze… For the readers, here is our recent post about gaze estimation. But it looks like I will have to update my usual story: MagicEyes that you presented in this paper is a large-scale dataset with human-labeled real data, and it allows for better results.

    Obviously, collecting this dataset took a lot of money and effort. This leads to two questions. Specifically, do you believe that synthetic data can still help improve eye gaze estimation further? The paper does not show experiments with training EyeNet on mixed real+synthetic datasets: do you think it would be worthwhile to try? And generally, in what other computer vision problems do you expect even larger manually labeled real datasets to appear in the near future, and how do you think it will affect applications of synthetic data in computer vision?

    Eye-tracking is a very interesting example of a computer vision problem. There are decades of research from human vision and neuroscience about the function and anatomy of how we see. MagicEyes datasets aim to collect a variable set of data from a broad population of subjects to capture this natural variability. The learned representations from this data form a foundation of the distribution that we want to learn for a number of different tasks, ranging from blink detection to 3D gaze estimation. If MagicEyes was infinitely large, we’d be done. Labeling this kind of data is possible, even though slow and expensive. By supplementing MagicEyes with synthetic data, we get an opportunity to significantly reduce time and cost, and to increase the training data set size and heterogeneity of seen examples. 

    As for other vision problems, manual datasets for autonomous navigation, satellite imagery, and human interactions are being collected and annotated at scale. Solving these tasks with additional synthetic data will be extremely useful. In fact, we are starting to see synthetic data expertise (specific companies pick and choose their domains of excellence) being compartmentalized to indoor and outdoor environments, and to human vs. man-made objects. 

    Q5And now let me go back to the industry-vs-academia question, from a different point of view. While preparing the previous two questions, I opened your Google Scholar profile and sorted the publications chronologically. Naturally, you never stopped producing top-notch academic output, but it turned out that it’s far easier to look for your recent papers at your DBLP profile because your Google Scholar profile has recently been literally dominated by patent applications. You’ve had dozens of those in the last couple of years!

    Is that just a formal consequence of your work at MagicLeap and other startups or does it reflect a deeper position on how practical your work can soon become? Generally speaking, how ready do you think we (humanity) are for solving the basic high-level computer vision problems: 3D scene understanding, visual navigation in the real world, producing seamless augmented reality, and so on? Are we there yet, and if not quite, how long do you think it will take in each case?

    Writing patents is standard practice in industrial research. I was fortunate enough to complement patent filings with the corresponding peer-reviewed publications. As we discussed earlier, I do believe that academic research in computer vision and machine learning precedes its applications. The current AI spring started in 2012, has opened a number of industrial research avenues that build upon theoretical results and will lead to innovative products for the next decade. 

    With regards to solving complex vision and learning tasks, I think we are still quite a bit away. Machines have become excellent at pattern matching. There are a large number of practical applications that are coming online: from autonomous driving to augmented reality. The limiting factors here are not just the algorithms, however, but rather sensors and data. In augmented reality, for example, the AI components are available, but the computation power, batteries, and displays are not there to deliver a compelling product. 

    Q6. Apart from your research work in academia and industry, you are also helping LDV Capital, one of the top VC funds for AI-related startups, as their Expert in Residence. This may sound like a stock question, but it would be very interesting to hear your personal take on this: how do you evaluate startups that come for your review? What are you looking for the most, and what are the most common mistakes startups make, in your personal experience? Maybe you can share some advice specific for vision-related startups, since it is your personal area of expertise, and LDV Capital seems to have this as an important focus area as well.

    Traditional VC funding happens by following trends. A trend-setting VC firm invests in a particular sector, and the rest of the funds follow. A growing fear of missing out results in large amounts of capital being deployed. Once a new trend emerges, most VC firms happily switch context or diversify. When I look at start-up projects, whether my own or others, I always look for an end goal thesis, and decide if I agree with it. For example, a company X makes LiDAR sensors, LiDARs are a hot topic these days. To me, company X is interesting because I believe that without LiDAR, certain long-term goals aren’t possible to achieve, self-driving being one of them. If company X fits into the global scheme of things, it is meaningful and fundamental to market development, if it is one-offcreate filters for your Instagram account,not so much. 

    Then, there is the team. Regardless of prior focus, having pedigree, whether academic research, product development, or executive management, is a must. It is fairly simple to identify experts from dreamers. 

    Finally, there are many aspiring entrepreneurs who want to start companies for the sake of starting companies or because they have access to interesting technology. In that situation, product definition doesn’t come from a real need to improve an existing approach, but rather from an opportunistic perspective of “let’s invent a solution for a problem that doesn’t exist”. I think this is the curse of most tech startups.

    Thank you very much for your answers, Andrew! We will come back with the next interview soon—stay tuned!

    Sergey Nikolenko
    Head of AI, Synthesis AI

  • AI Interviews: Serge Belongie

    AI Interviews: Serge Belongie

    Hi all! Today we begin a new series of posts here in the Synthesis AI blog. We will talk to the best researchers and practitioners in the field of machine learning, discussing different topics but, obviously, trying to circle back to our main focus of synthetic data every once in a while.

    Today we have our first guest, Professor Serge Belongie. He is a Professor of Computer Science at the University of Copenhagen (DIKU) and the Director of the Pioneer Centre for Artificial Intelligence. Previously he was the Andrew H. and Ann R. Tisch Professor at Cornell Tech and in the Computer Science Department at Cornell University, and an Associate Dean at Cornell Tech.

    Over his distinguished career, Prof. Belongie has been greatly successful in both academia and business. He co-founded several successful startups, including Digital Persona, Inc. that first brought a fingerprint identification device to the mass market and two computer vision startups, Anchovi Labs and Orpix. The MIT Technology Review included him on their list of Innovators under 35 for 2004, and in 2015, he was the recipient of the ICCV Helmholtz Prize. Google Scholar assigns to Prof. Belongie a spectacular h-index of 96, which includes dozens of papers that have become fundamental for computer vision and other fields, with hundreds of citations each. And, to be honest, I got most of this off Prof. Belongie’s Wikipedia page, which means that this is just barely scratching the surface of his achievements.

    Q1. Hello Professor, and welcome to our interview! Your list of achievements is so impressive that we definitely cannot do it justice in this format. But let’s try to add at least one little bit to this Wikipedia dump above. What is the one thing, maybe the one new idea that you are most proud of in your career? You know, the idea that makes you feel the warmest and fuzziest once you remember how you had it?

    Prof. Belongie: Thank you for inviting me! I’m excited about Synthesis AI’s vision, so I’m happy to help get out the word to the CV/ML community. 

    This is a timely question, since I recently started a “Throwback Thursday” series on my lab’s Twitter account. Each week over this past summer, my former students and I had a fun time looking back on the journey behind our publications since I became a professor a couple decades ago. The ideas for which I feel most proud rarely have appeared in highly cited papers. One example is the grid based comparisons in our 2015 paper “Cost-Effective HITs for Relative Similarity Comparisons.” As my students from that time will recall, I was captivated by the idea of triplet based comparisons for measuring perceptual similarity (“is a more similar to b than to c?”), but the cubic complexity of such approaches limited their practical adoption. Then it occurred to us that humans have excellent parallel visual processing abilities, which means we could fill a screen with 4×4 or 5×5 grids of images, and through some simple UI trickery, we could harvest large batches of triplet constraints in one shot, using a HIT (human intelligence task) that was both less expensive to run and more entertaining to complete for the participants. While this approach and the related SNaCK approach we published the following year have not gotten much traction in the literature, I’m convinced that this concept will eventually get its day in the sun.

    Q2. Now for the obligatory question: what is your view on the importance of synthetic data for modern computer vision? Here at Synthesis AI, we believe that synthetic data can become one of the solutions to the data problem; do you agree? What other solutions do you see and how, in your opinion, does synthetic data fit into the landscape of computer vision of the future?

    Prof. Belongie: I am in complete agreement with this view. When pilots learn to fly, they must log thousands of hours of flight time in simulated and real flight environments. That is an industry that, over several decades, has found the right balance of real vs. synthetic for the best instructional outcome. Our field is now confronting an analogous problem, with the key difference that the student is a machine. With that difference in mind, we will again need to find the right balance. As my PhD advisor [Jitendra Malik] used to tell us in the late 90s, nature has a way of detecting a hack, so we must be careful about overstating what’s possible with purely synthetic environments. But when you think about the cartesian product of all the environmental factors that can influence, say, the appearance of city streets in the context of autonomous driving, it seems foolish not to build upon our troves of real data with clever synthesis and augmentation approaches to give our machines a gigantic head start before tackling the real thing.

    Q3. Among all your influential papers with hundreds of citations, the one that looks to me most directly relevant to synthetic data is the paper where Xun Huang and yourself introduced adaptive instance normalization (AdaIN), a very simple style transfer approach that still works wonders. We recently talked about AdaIN on this blog, and in our experiments we have never seen a more complex synthetic-to-real refinement pipeline, even based on your own later work, MUNIT, outperform the basic AdaIN. What has worked best for synthetic-to-real style transfer for you? Do you maybe have more style transfer techniques in store for us, to appear in the near future?

    Prof. Belongie: Good ol’ AdaIN indeed works surprisingly well in a wide variety of cases. The situation gets more nuanced, however, in fine grained settings such as the iNat challenges or NeWT downstream tasks. In these cases, even well intentioned style transfer methods can trample over the subtle differences that distinguish tightly related species; as the saying goes, “one person’s signal is another person’s noise.” In this context, we’ve been reflecting on the emerging practice of augmentation engineering. Ever since deep learning burst onto the scene around 2011, it hasn’t been socially acceptable to fiddle with feature design manually, but no one complains if you fiddle with augmentation functions. The latter can be thought of as a roundabout way to scratch the same itch. It’s likely that in fine grained domains, e.g., plant pathology, we’ll need to return to the old – and in my opinion, good – practices of working closely with domain experts to cultivate domain-appropriate geometric and photometric transformations.

    In terms of what’s coming next in style transfer, I’m excited about our recent work in the optical see-through (OST) augmented reality setting. In conventional style transfer, you have total control over the values of every pixel. In the OST setting, however, you can only add light; you can’t subtract it. So what can be done about this? We tackle this question in our recent Stay Positive work, focusing on the nonnegative image synthesis problem, and leveraging quirks of the human visual system’s processing of brightness and contrast.

    Q4. Continuing from the last question, one of the latest papers to come out of your group is titled “Single Image Texture Translation for Data Augmentation”. In it, you propose a new data augmentation technique that translates textures between objects from single images (as a brief reminder for the readers, we have talked about what data augmentation is previously on this blog). The paper also includes a nice graphical overview of modern data augmentation methods that I can’t but quote here:

    Looking at this picture makes me excited. What is your opinion on the limits of data augmentation? Combined with neural style transfer and all other techniques shown here, how far do you think this can take us? How do you see these techniques potentially complementing synthetic data approaches (in the sense of making 3D models and rendering images), and are there, in your opinion, unique advantages of synthetic data that augmentation of real data cannot provide?

    Prof. Belongie: When it comes to generic, coarse-grained settings, I would say the sky’s the limit in terms of what data augmentation can accomplish. Here I’m referring to supplying modern machine learning pipelines with sufficiently realistic augmentations, such as adding rain to a street or stubble to a face. The bar is, of course, somewhat higher if the goal is to cross the uncanny valley for human observers. And as I hinted earlier, fine grained visual categorization (FGVC) also presents some tough challenges for the data augmentation movement. FGVC problems are characterized by the need for specialized domain knowledge, the kind that is possessed by very few human experts. In that sense, knowing how to tackle the data augmentation problem for FGVC is tantamount to bottling that knowledge in the form of a family of image manipulations. That strikes me as a daunting task.

    Q5. A slightly personal question here. Your group at UCSD used to be called SO(3) in honor of the group of three-dimensional rotations, and your group at Cornell now is called SE(3), after the special Euclidean group in three dimensions. This brings back memories of how I used to work in algebra a little bit back when I was an undergrad. I realize the group’s title probably doesn’t mean much but still: do you see a way for modern algebra and/or geometry to influence machine learning? What is your opinion of current efforts in geometric deep learning: would you advise current math undergrads to go there?

    Prof. Belongie: Geometric deep learning provides an interesting framework for incorporating prior knowledge into traditional deep learning settings. Personally, I find it exciting because a new generation of students is talking about topics like graph Laplacians again. I don’t know if I’d point industry-focused ML engineers at geometric deep learning, but I do think it’s a rich landscape for research-oriented undergrads to explore, with an inspiring synthesis of old and new ideas.

    Q6. And, if you don’t mind, let us finish with another personal question. Turns out SO3 is not just your computer vision research group’s title but also your band name! I learned about it from this profile article about you that lists quite a few cool things you’ve done, including a teaching gig in Brazil “inspired by Richard Feynman”.

    So I guess it’s safe to say that Richard Feynman has been one of your heroes. Who else has been an influence? How did you turn to computer science? And are there maybe some other biographies or popular books that you can recommend for our readers who are choosing their path right now?

    Prof. Belongie: Ah, I see you’ve done your research! The primary influences in my career have been my undergrad and grad school advisors, Pietro Perona and Jitendra Malik, who are both towering figures in the field. From them I gained a deep appreciation of ideas outside of computer science and engineering, including human vision, experimental psychology, art history, and neuroscience. I find myself quoting, paraphrasing, or channeling them on a regular basis when meeting with my students. In terms of turning to computer science, that was a matter of practicality. I started out in electrical engineering, focusing on digital signal processing, and as my interests coalesced around image recognition, I naturally gravitated to where the action was circa the late 90s, i.e., computer science.

    As far as what I’d recommend now, that’s a tough question. My usual diet is based on the firehose of arXiv preprints that match my group’s keywords du jour. But this can be draining and even demoralizing, since you’ll start to feel like it’s all been done. So if you want something to inspire you, read an old paper by Don Geman, like this one about searching for mental pictures. Or better yet, after you’re done with your week’s quota of @ak92501-recommended papers, go for a long drive or walk and listen to a Rick Beato “What Makes this Song Great” playlist. It doesn’t matter if you know music theory, or if some of the genres he covers aren’t your thing. His passion for music – diving into it, explaining it, making the complex simple – is infectious, and he will inspire you to do great things in whatever domain you’ve chosen as your focus. 

    Dear Professor, thank you very much for your answers! And thank you, the reader, for your attention! Next time, we will return with an interview with another important figure in machine learning. Stay tuned!

    Sergey Nikolenko
    Head of AI, Synthesis AI

  • Synthetic Data for Safe Driving

    Synthetic Data for Safe Driving

    The role of synthetic data in developing solutions for autonomous driving is hard to understate. In a recent post, I already touched upon virtual outdoor environments for training autonomous driving agents, and this is a huge topic that we will no doubt return to later. But today, I want to talk about a much more specialized topic in the same field: driver safety monitoring. It turns out that synthetic data can help here as well—and today we will understand how. This is a companion post for our recent press release.

    What Is Driver Safety Monitoring and How Manufacturers Are Forced to Care

    Car-related accidents remain a major source of fatalities and trauma all around the world. The United States, for instance, has about 35000 motor vehicle fatalities and over 2 million injuries per year, which may pale in comparison to the COVID pandemic or cancer but still sounds like a lot of unnecessary suffering.

    In fact, significant progress has been achieved in reducing these deaths and injuries over the last years. Here are the statistics of road traffic fatalities in Germany over the last few years:

    And here is the same plot for France (they both stop at 2019 because it would be really unfair to make road traffic comparisons in the times of overwhelming lockdowns):

    Obviously, the European Union is doing something right in their regulation of road traffic. A large part of it are new safety measures that are gradually made mandatory in the EU. And the immediate occasion for this post are new regulations regarding driver safety monitoring

    Starting from 2022, it will be mandatory for the European Union car manufacturers to install the following safety features: “warning of driver drowsiness and distraction (e.g. smartphone use while driving), intelligent speed assistance, reversing safety with camera or sensors, […] lane-keeping assistance, advanced emergency braking, and crash-test improved safety belts”. With these regulations, the European Commission plans to “save over 25,000 lives and avoid at least 140,000 serious injuries by 2038”.

    On paper, this sounds marvelous: why not have a system that wakes you up if you fall asleep behind the wheel and helps you stay in your lane when you’re distracted. But how can systems like this work? And where’s the place of synthetic data in this? Let’s find out.

    Driver Drowsiness Detection with Deep Learning

    We cannot cover everything, so let’s dive into details for one specific aspect of safety monitoring: drowsiness detection. This is a key part of both new regulations and actual car accidents: falling asleep at the wheel is very common. You don’t even have to be completely asleep: 5-10 seconds of what is called a microsleep episode will be more than enough for an accident to occur. So how can a smart car notice that you are about to fall asleep and warn you in time?

    The gold standard of recognizing brain states such as sleep is, of course, electroencephalography (EEG), that is, measuring the electrical activity of your brain. Recent research has applied deep learning to analyzing EEG data, and it appears that even relatively simple solutions based on convolutional and recurrent networks are enough to recognize sleep and drowsiness with high certainty. For instance, a recent work by Zurich researchers Malafeev et al. (2020) shows excellent results in the detection of microsleep episodes with a simple architecture like this:

    But short of requiring all drivers to wear a headpiece with EEG electrodes, this kind of data will not be available in a real car. EEG is commonly used to collect and label real datasets in this field but we need some other signal for actual drowsiness detection.

    There are two actual signals that are both important here. First, steering patterns: a simple sensor can track the steering angle and velocity, and then you can develop a system that recognizes troubling patterns in the driver’s steering. For example, if a driver is barely steering at all for some time, and then returns the car on track with a quick jerking motion, that’s probably a sign that the driver is getting sleepy or distracted. Leading manufacturers such as Volvo, Bosch, and others are already presenting solutions based on steering patterns.

    Steering patterns, however, are just one possible signal, and a quite indirect one. Moreover, once you have in place another component of the very same EU regulations, automatic lane-keeping assistance, steering becomes largely automated and these patterns stop working. A much more direct idea would be to use computer vision to detect the signs of drowsiness on the driver’s face.

    When Volvo introduced their steering-based system in 2007, their representative said: “We often get questions about why we have chosen this concept instead of monitoring the driver’s eyes. The answer is that we don’t think that the technology of monitoring the driver’s eyes is mature enough yet.” By 2021, computer vision has progressed a lot, and recent works on the subject show excellent results.

    The most telling sign would be, of course, detecting that the driver’s eyes are closing. There is an entire field of study devoted to detecting closed eyes and blinking (blinks get longer and more often when you’re drowsy). In 2014, Song et al. presented the now-standard Closed Eyes in the Wild (CEW) dataset, modeled after the classical Labeled Faces in the Wild (LFW) dataset but with eyes closed; here is a sample of CEW (top row) and LFW (bottom row):

    Since then, eye closedness and blinking detection has steadily improved, usually with various convolutional pipelines, and by now it is definitely ready to become an important component in car safety .

    We don’t have to restrict ourselves only to the eyes, of course. The entire facial expression can provide important clues (did you yawn while reading this?). For example, Shen et al. (2020) recently proposed a multi-featured pipeline that has separate convolutional processing streams for the driver’s head, eyes, and mouth:

    Another important recent work comes from Affectiva, a company we have recently collaborated with on eye gaze estimation. Joshi et al. (2020) classify drowsiness based on facial expressions as captured in a 10-second video that might have the driver progress between different states of drowsiness. Their pipeline is based on features extracted by their own SDK for recognizing facial expressions:

    All of these systems are not perfect, of course, but it is clear by now that computer vision can provide important clues to detect and evaluate the driver’s state and trigger warnings that can help avoid road traffic accidents and ultimately save lives. So where does synthetic data come into this picture?

    Synthetic Data for Drowsiness Detection

    On this blog, we have discussed many times (e.g., recently and very recently) what are the conditions under which synthetic data especially shines in computer vision. These conditions include situations where existing real datasets may be biased, environmental features that are not covered in real data (different cameras, lighting conditions etc.), and generally situations that call for extensive variability and randomization which is much easier to achieve in synthetic data than in real datasets.

    Guess what: driver safety is definitely one of those situations! First, cameras that can be installed in real cars shoot from positions that are far from standard for usual datasets. Here are some frames from a sample video that Joshi et al. processed in the paper we referenced above:

    Compare this with, say, standard frontal photographs characteristic for Labeled Faces in the Wild that we also showed above; obviously, there is some domain transfer needed between these two situations, while a synthetic 3D model of a head can be shot from any angle.

    Second, where will real data come from? We could collect real datasets and label them semi-automatically with the help of EEG monitoring, but that would be far from perfect for computer vision model training because real drivers will not be wearing an EEG device. Also, real datasets of this kind will inevitably be very small: it is obviously very difficult and expensive to collect even thousands of samples of people falling asleep at the wheel, let alone millions.

    Third, you are most likely to fall asleep when you’re driving at night, and night driving means your face is probably illuminated very poorly. You can use NIR (near-infrared) or ToF NIR (time-of-flight near-infrared) cameras to “see in the dark”.  But pupils (well, retinas) act differently in the NIR modality, and this effect can be different across different ethnicities. This kind of different camera modalities and challenging lighting is, again, something that is relatively easy to achieve in synthetic datasets but hard to find in real ones. For example, available NIR datasets such as NVGaze or MRL Eye Dataset are done for AR/VR, not from an in-car camera perspective.

    That is why here at Synthesis AI we are moving into this (see our recent press release), and we hope to make important contributions that will make road traffic safer for all of us. We are already collaborating with automobile and autonomous vehicle manufacturers and Tier 1 suppliers in this market. 

    To make this work, we will need to make an additional effort to model car interiors, cameras used by car manufacturers, and other environmental features, but the heart of this project remains in the FaceAPI that we have already developed. This easy-to-use API can produce millions of unique 3D models that have different combinations of identities, clothing, accessories, and, importantly for this project, facial expressions. FaceAPI is already able to produce a wide variety of emotions, including, of course, closed eyes and drowsiness, but we plan to further expand this feature set.

    Here is an example of our automatically generated synthetic data from an in-car perspective, complete with depth and normal maps:

    Synthetic Data for Driver Attention

    But you don’t have to literally fall asleep to cause a traffic accident. Unfortunately, it often suffices to get momentarily distracted, look at your phone, take your hands off the wheel for a second to adjust your coffee cup… all with the same, sometimes tragic, consequences. Thus, another, no less important application of computer vision for driver safety is monitoring driver attention and possible distractions. This becomes all the more important as driverless cars become increasingly common, and autopilots take up more and more of the total time at the wheel: it is much easier to get distracted when you are not actually driving the car.

    First, there is the monitoring of large-scale motions such as taking your hands off the wheel. This falls into the classical field of scene understanding (see, e.g., Xiao et al. (2018)): “are the driver’s hands on the wheel” is a typical scene understanding question that goes beyond simple object detection of both hands and the wheel. Answering these questions, however, usually relies upon classical computer vision problems such as instance segmentation.

    Second, it is no less important to track such small-scale motions as eye gaze. Eye gaze estimation is an important computer vision problem that has its own applications but is also obviously useful for driver safety. We have already discussed applications of synthetic data to eye gaze estimation on this blog, with a special focus on domain adaptation.

    Obviously, all of these problems belong to the field of computer vision, and all standard arguments for the use of synthetic data apply in this case as well. Thus, we expect that synthetic data produced by our engines will be extremely useful for driver attention monitoring.In the next example, also produced by FaceAPI, we can compare a regular RGB image and the corresponding near-infrared image for two drivers who may be distracted. Note that eye gaze is also clearly seen in our synthetic pictures, as well as larger features:

    There’s even more that can be varied parametrically. Here are some examples with head turn, yawing, eye closure, and accessories like face masks and glasses.

    In total, we strongly believe that high-quality synthetic data for computer vision systems can help advance security systems for car manufacturers and help reduce road traffic accidents not only in the European Union but all over the world. Here at Synthesis AI, we are devoted to removing the obstacles to further advances of machine learning—especially for such a great cause!

    Sergey Nikolenko
    Head of AI, Synthesis AI

  • Synthetic Data-Centric AI

    Synthetic Data-Centric AI

    In a recent series of talks and related articles, one of the most prominent AI researchers Andrew Ng pointed to the elephant in the room of artificial intelligence: the data. It is a common saying in AI that “machine learning is 80% data and 20% models”, but in practice, the vast majority of effort from both researchers and practitioners concentrates on the model part rather than the data part of AI/ML. In this article, we consider this 80/20 split in slightly more detail and discuss one possible way to advance data-centric AI research.

    The life cycle of a machine learning project

    The basic life cycle of a machine learning project for some supervised learning problems (for instance, image segmentation) looks like this:

    First, one has to collect data, then it has to be labeled according to the problem at hand, then a model is trained on the resulting dataset, and finally the best models have to be fitted into edge devices where they will be deployed. In my personal opinion, these four parts are about equally important in most real life projects; but if you look at the research papers from any top AI conference, you will see that most of them are about the “Training” phase, with a little bit of “Deployment” (model distillation and similar techniques that make models fit into restricted hardware) and an even smaller part devoted to the “Data” and “Annotation” parts (mostly about data augmentation).

    This is not due to simple narrow-mindedness: everybody understands that data is key for any AI/ML project. But usually the model is the sexy part of research, where new ideas flourish and intermingle, and data is the “necessary but boring” part. Which is a shame because, as Andrew Ng demonstrated in his talks, improvements in the data department often hang much lower than improvements in state of the art AI models.

    Data labeling and data cascades: the real elephants in the room

    On the other hand, collecting and especially annotating the data is increasingly becoming a problem, if not a hard constraint on AI research and development. The required labeling is often very labor-intensive. Suppose that you want to teach a model to count the cows grazing in a field, a natural and potentially lucrative idea for applying deep learning in agriculture. The basic computer vision problem here is either object detection, i.e., drawing bounding boxes around cows, or instance segmentation, i.e., distinguishing the silhouettes of cows. To train the model, you need a lot of photos with labeling such as this one:

    Imagine how much work it would take to label tens of thousands of such photographs! Naturally, in a real project you would use a weaker already existing model and use manual labor only to correct the mistakes, but it still might take thousands of man-hours.

    Another important problem is dataset bias. Even in applications where real labeled data abounds, existing datasets often do not cover cases relevant for new applications. Take face recognition, for instance; there exist datasets with millions of labeled faces. But, first, many such datasets have racial and ethnic bias that often plagues major datasets. And second, there are plenty of use cases in slightly modified conditions: for example, a face recognition system might need to recognize users from any angle, but existing datasets are heavily scaled towards frontal and profile photos.

    These and other problems have been recently combined under the label of data cascades, as introduced in this Google AI post. Data cascades include dataset bias, real world noise that is absent in clean training sets, model drifts where the targets change over time, and many other problems, up to poor dataset documentation.

    There exist several possible solutions to basic data-related problems, all increasingly explored in modern AI:

    • few-shotone-shot, and even zero-shot learning try to reduce data requirements by pretraining models and then fine-tuning them to new problems with very small datasets; this is a great solution when it works, but success stories are still relatively limited;
    • semi-supervised and weakly supervised learning make use of unlabeled data that is often plentiful (e.g., it is usually far cheaper to obtain unlabeled images of the objects in question than label them).

    But these solutions are far from universal: if existing data (used for pretraining) has no or very few examples of the objects and relations we are looking for, these approaches will not be able to “invent” them. Fortunately, there is another approach that can do just that.

    Synthetic data: a possible solution

    I am talking about synthetic data: artificially created and labeled data used to train AI models. In computer vision this would mean that dataset developers create a 3D environment with models of the objects that need to be recognized and their surroundings. In a synthetic environment, you know and control the precise position of every object, which gives you pixel-perfect labeling for free. Moreover, you have total control over many knobs and handles that can be adapted to your specific use case:

    • environments: backgrounds and locations for the objects;
    • lighting parameters: you can set your own light sources;
    • camera parameters: camera type (if you need to recognize images from an infrared camera, standard datasets are unlikely to help), placement etc.;
    • highly variable objects: with real data, you are limited to what you have, and with synthetic data you can mix and match everything you have created in limitless combinations.

    For instance, synthetic human faces can have any facial features, ethnicities, ages, hairstyles, accessories, emotions, and much more. Here are a few examples from an existing synthetic dataset of faces:

    Synthetic data presents its own problems, the most important being the domain shift problem that arises because synthetic data is, well, not real. You need to train a model on one domain (synthetic data) and apply it on a different domain (real data), which leads to a whole field of AI called domain adaptation.

    In my opinion, the free labeling, high variability, and sheer boundless quantity of synthetic data (as soon as you have the models, you can generate any number of labeling images at the low cost of rendering) far outweigh this drawback. Recent research is already showing that even very straightforward applications of synthetic data can bring significant improvements in real world problems. 

    Automatic generation and closing the feedback loop

    But wait, there is more. The “dataset” we referred to above is more than just a dataset—it is an entire API (FaceAPI, to be precise) that allows a user to set all of these knobs and handles, generating new synthetic data samples at scale and in a fully automated fashion, with parameters defined for API calls.

    This opens up new, even more exciting possibilities. When synthetic data generation becomes fully automated, it means that producing synthetic data is now a parametric process, and the values of parameters may influence the final quality of AI models trained on this synthetic data… you see where this is going, right? 

    Yes, we can treat data generation as part of the entire machine learning pipeline, closing the feedback loop between data generation and testing the final model on real test sets. Naturally, it is hard to expect gradients to flow naturally across the process of rendering 3D scenes (although recent research may suggest otherwise), so learning the synthetic data generation parameters can be done, e.g., with reinforcement learning that has methods specifically designed to work in these conditions. This is an early approach taken by VADRA (Visual Adversarial Domain Randomization and Augmentation):

    A similar but different approach would be to design more direct loss functions by either collecting data on the model performance and then learning or finding other objectives. Here, one important example would be the Meta-Sim model that learns to minimize the distribution gap between synthetic 3D scenes and real scenes together with downstream performance by learning the parameters of scene graphs, a natural representation of the 3D scene structure.

    These ideas are being increasingly applied in the studies of synthetic data, and I believe that adaptive generation of synthetic data will be increasingly used in the near future and bring synthetic data to a new level of usefulness for AI/ML. I hope that the progress of modern AI will not stop at the current data problem, and I believe that synthetic data, especially automatic generation and closing the feedback loop, is one of the key tools to overcome it.

    Sergey Nikolenko
    Head of AI, Synthesis AI