Sergey Nikolenko

Category: Synthesis AI

Generative AI VII: The AI Spring of 2023
Last time, we finished all intended mathematical content, so it is time for us to wrap up the generative AI series. We will do it over two installments. Today, we discuss and summarize the (lots of) news that have been happening in the AI space over the last half a year. They all conveniently fall into the generative AI space, with expanding capabilities leading to both extreme excitement and serious security concerns. So how are current AI models different from older ones and when are we going to actually have AGI? It all started with GPT-3.5…

Large Language Models: I Heard You Like Hype Waves

Artificial intelligence has a history of ups and downs. The initial wave of excitement after the 1956 Dartmouth seminar ended with the “AI winter” that spanned the 1970s and early 1980s. Then people realized that they could train deep neural networks, and hopes were again high, but it again turned out to be a false start mostly due to insufficient computing power (image source):

Finally, in the mid-2000s deep learning started to work in earnest, and we have been living on another hype wave of artificial intelligence ever since. The first transformative real world application was in speech recognition and processing (voice assistants were made possible by early deep neural networks), then AlexNet revolutionized image processing, then deep neural networks came into natural language processing, and so on, and so forth.

But you can have hype waves inside hype waves. And this is exactly what has been happening with large language models over the last year or so, especially last spring. By now, researchers are seriously considering the possibility that we can reach AGI (artificial general intelligence, usually taken to mean human-level or stronger) with our current basic approach, maybe just by scaling it up and thinking of a few more nice tricks for training it.

How did that happen? Let’s first understand what we are talking about.

A language model is a machine learning model that predicts the next token in a sequence of language tokens; it’s easier to think of tokens as words, although in reality models usually break words down into smaller chunks. The machine learning problem here is basically classification: what’s the next token going to be?

By continuously predicting the next token, a language model can write text, and the better the language model, the more coherent the resulting text can be:

Note that that’s the only thing a language model can do: predict the next token, over and over.

Language models appeared a very long time ago. In fact, one of the first practical examples of a Markov chain, given by Andrei Markov himself in 1913, was a simple language model that learned how likely vowels and consonants are to follow each other in the Russian language.

Up until quite recently, language models were Markovian in nature, but the deep learning revolution changed that: recurrent networks were able to hold a consistent latent state and pass it through to the next time slot, which could improve token predictions. But the real game changer came with Transformers, attention-based architectures that started another hype wave on top of deep learning itself.

Just like deep neural networks were used to achieve state of the art results in virtually every field of machine learning in 2005-2020, after 2017-2018 Transformers did the same thing inside neural networks: the Transformer was invented as an encoder-decoder architecture for machine translation but soon branched into language modeling, general text understanding, and later image understanding, speech recognition, and many, many other fields, becoming a ubiquitous tool.

Still, there is another hype wave inside the Transformers that we are interested in today. So now we are talking about a wave on top of a wave on top of a wave… well, this is the best I could do with Stable Diffusion:

This latest craze started when OpenAI updated its GPT-3 model with fine-tuning techniques that used human feedback. Introduced in InstructGPT in the spring of 2022, these techniques allowed to make a pure token prediction machine more useful for human-initiated tasks by fine-tuning it on human assessments. An assessor labels how useful and/or harmless was the model’s reply, and the model learns to be more useful and less harmful (more on that later). In this way, a model can learn, for example, to answer human questions with answers that it “considers” to be correct, rather than just continue the conversation by changing the subject or asking a question itself (which could be a quite plausible hypothesis if we are just predicting the next token).

The fine-tuned models are known as the GPT-3.5 series, and the fine-tuning process itself was finally developed into reinforcement learning from human feedback (RLHF). With RLHF, GPT-3.5 turned into ChatGPT, the model you have definitely heard about. Starting from GPT-3, such models have become collectively known as large language models (LLM), a term basically meaning “large enough to be interesting in practice”. “Large enough” indeed proves to be quite large: GPT-3 (and hence ChatGPT) has about 175 billion trainable parameters.

Still, note that ChatGPT in essence is still a language model, that is, just a machine for predicting the next token in a text trained on enormous datasets that encompass the whole available Internet. Interestingly enough, that proves to be sufficient for many different applications.

The AI Spring of 2023

After ChatGPT was released in November 2022, it became the fastest growing app ever, and the user base grew exponentially. It took a record 5 days to get to 1 million users, and by now ChatGPT has over 100 million users, a number that has probably already more or less saturated but shows few signs of dropping.

We entered 2023 with ChatGPT, but it turned out to be only the beginning. Here is a timeline of just the main developments in this latest spring of AI:

Let’s walk through some of them.

On February 7, Microsoft announced its own answer to ChatGPT, a language model that was supposed to help Bing search. This release proved to be a little premature: the model was quickly jailbroken by the users, revealed its internal name Sydney, and made some interesting comments about it (I did a rather sloppy censoring job below):

In a way, this is the first time it got into the public consciousness that even the current crop of LLMs may be somewhat dangerous. And yes, I’m still not quite used to this:

On February 24, Facebook AI Research (FAIR) released the LLaMA model (Large Language Model Meta AI). It’s questionable that LLaMA by itself is any better than GPT-3.5 but LLaMA is important because it is open source. Anyone can download the pretrained model weights, which opened up large language models for a huge community of enthusiasts: you cannot train a GPT-3 sized model at home but you sure can experiment with it, do prompt engineering, maybe even fine-tune it. LLaMA has already led to many new developments from independent researchers, and the recently released LLaMA 2 (July 2023) is sure to follow suit.

March 14 became the most important single day in this spring of AI. On the same day:
- Google announced that it would integrate large language models into its ecosystem (that is, Google Docs etc.),
- Antropic, a startup branched from OpenAI with a special interest in AI safety, released its first LLM called Claude,
- but OpenAI stole the day from those two by announcing its next level GPT-4 model.
GPT-4 is supposed to be multimodal, that is, it is able to process both text and images at the same time. At the time of writing, its multimodal capabilities are not yet available to the general public, but existing illustrations from the papers are quite convincing. Here is an example from the original work:

But to get a better grasp on GPT-4 capabilities, I really recommend reading the paper called “Sparks of Artificial General Intelligence: Early experiments with GPT-4”, also released in March. Their experimental results are eerily good even if cherry-picked:

Around the same time, OpenAI released a plugin mechanism that allowed people to build upon ChatGPT via prompt engineering, and In April serious projects of this kind started to appear. One of the most interesting such projects was AutoGPT, a plugin that tries (sometimes quite successfully) to make a language model into an agent that acts independently and intelligently to achieve its goals. AutoGPT was advertised as a personal AI assistant, able to achieve the goals set by a user via planning, setting and fulfilling subgoals, analyzing data found on the Web and on the user’s computer. Right now AutoGPT does not look very successful but it is an interesting attempt to make language models agentic (more on that later).

In May, Google released Bard that proved to be much more successful than Sydney, and made the support for LLMs in the Google ecosystem actually starting to happen. Research-wise, late April and May saw several interesting results aimed at extending the context window for LLMs, that is, how many tokens they can take in and process at a time. Here I will highlight the paper “Scaling Transformer to 1M tokens and beyond with RMT” and Antropic expanding Claude’s context window to 100K tokens. This is already hundreds of pages that a language model can process together, summarize, try to derive new insights, and so on.

In the last couple of months, this torrent of new AI capabilities has slowed down somewhat. But what does it actually mean? Are we going to have AGI soon? Will AI take our jobs? What’s the plan? Let’s see where we stand right now and what are the projections.

When AGI?

ChatGPT and GPT-4 can be transformative in their own right, but what about actual strong artificial intelligence (artificial general intelligence, AGI)? When are we going to have actual human-level intellect in AI models?

AI has a history of extra optimistic forecasts. For a long time, the AI optimists have been predicting true AGI being about 30 years from whenever the survey was held. That’s understandable: an AI guru would predict that he or she would live to see the true AGI, but in some distant future, not right now. Still, let’s see what the forecasters say now.

There are approaches to making AGI forecasts by continuing trend lines—naturally, the problem is which trend line to choose. For example, Ajeya Cotra (2020) tried to anchor AGI development in biological analogies. There are several ways to use biology as a measure of how much computation we need to get to human level:
- there are about 10¹⁵ parameters (synapses) in a human brain;
- to learn the weights for these synapses, we make about 10²⁴ computations during our lifetimes;
- but to get to the human brain, evolution required about 10⁵² computations to create our genome (yes, you can have a ballpark estimate even for that).
The first estimate is clearly too low, the last one is clearly too high, so the truth must be somewhere in between… but where exactly, and why are we supposing that the human brain has any relevance at all? We were abstractly motivated by birds in our desire to fly but inventing the airplane had no relation to avian evolutionary development.

For a different example, Davidson (2021) constructed a model that can make predictions on AI development via what they call semi-informative priors. But if you look inside the model, all you see is a Markov chain of events like “we tried to develop AGI and failed, next year we tried twice”…

In my opinion, all we really have are still expert surveys. In August 2022, Grace, Weinstein-Raun, and Stein-Perlman conducted a survey of 738 AI experts (defined as people who authored papers on NeurIPS and ICML). Their median estimate was that we have a 50% chance to develop human-level intelligence in 37 years, by 2059; this is a very close match with the previous survey, conducted in 2016, that placed AGI in 2061.

Still, these are just medians of some very wide distributions. Wynroe et al. (2023) attempted a meta-review of various transformative AI timelines. Here are the cumulative distribution functions they had:

And if you prefer numbers, here is a summary table of various percentiles:

As you can see, experts believe there is a significant chance to achieve AGI by 2050 (more than half) and we are about 90% certain to get there by 2100. Model-based estimates are much more modest but here we average it with evolutionary bio-anchors and whole brain emulations that are hard to believe to be necessary. Still, all of these estimates have extremely wide margins: nobody knows if the path to AGI is already open (and it’s just a matter of scale and lots of compute) or if it requires more conceptual breakthroughs.

Finally, these days there are people who put their money where their mouths are. The Metaculus prediction market has a popular question that reads as follows: “When will the first general AI system be devised, tested, and publicly announced?” At present (Sep 18, 2023), the forecasting community has a median prediction of 2032:

Interestingly, last time I checked (in July) their median was November 2032, so it’s slowly creeping up. However, since Metaculus handles real bets, they have to have specific resolution criteria for general AI. In this case, it’s a:
- two-hour adversarial Turing test,
- general robotic capabilities,
- and human-level or superhuman results on several general-purpose datasets (see the question page for details).
While this is as good a take on an instrumental definition of AGI as any, I can definitely foresee a model that does all that but is not considered “general AI”, just like many previous benchmarks have been overcome in the past.

So in summary, I would say that current forecasts are not that different from earlier AI history: we hope to see AGI during our lifetimes, we are far from sure we will, and it’s still hard to define what it is, even as we may be on the very brink of it.

How AGI? Slow vs. fast takeoff

Another interesting discussion is not about when AGI comes but about how it is going to happen. Back in 1965, Irving J. Good, a British mathematician and Turing’s coworker at Bletchley Park, suggested the idea of an “intelligence explosion”: a machine with superhuman intelligence will be able to design new intelligent machines faster than humans can, those machines will work faster yet, and ultimately the process will converge to physical limits, and progress will be faster than humans can even notice. This point, when progress becomes “infinitely” fast, is known as the technological singularity.

Technological singularity due to AGI sounds plausible to most thinkers, but opinions differ on how we get there. The current debate is between “slow takeoff” and “fast takeoff” models: how fast is AGI going to happen and are we going to get any warning about it?

In the slow takeoff model, AI has a lot of impact on the world, this impact is very noticeable, and, for instance, the world GDP has an order of magnitude growth due to AI before we have the true AGI that could be dangerous. In this model, AI and even AGI falls into the regular technological progress trends and serves as an important but ultimately yet another technological revolution that will allow these trends to continue further. AI can speed up progress further but it’s just human progress continuing along its exponential trend lines.

In the fast takeoff scenario, AI can and will have an effect in line with “regular” technological progress, but that happens right until the singularity, and then it snowballs very quickly, too quickly for humans to do anything about it. The central scenario for fast takeoff goes as follows: after a certain threshold of capabilities, we get an AI that is simultaneously agentic (which in particular means that it wants power—we’ll get to it in the next post) and able to improve itself. After that, we don’t get any further warnings: the AI improves itself very quickly, quietly obtains sufficient resources, and then simply takes over.

There have been interesting debates about this that are worth reading. The main proponent of the fast takeoff scenario is Eliezer Yudkowsky, who has been warning us about potential AGI dangers for over a decade; we will consider his work in much more detail in the next post.

But it’s worth keeping in mind that slow takeoff is “slow” in the sense that we are going to notice. But even the slow takeoff scenario predicts exponential growth! It assumes only a couple of years or maybe even several months between the AI starting to visibly transform society and the arrival of real superhuman AGI. Fast takeoff says it might take seconds… but, to be honest, a year also does not sound like enough time to prepare unless we start now.

All of this means that we better be ready to face AGI in our lifetimes, perhaps unexpectedly and almost certainly with a very short development timescale. Are we?..

Conclusion

This is the last question still left for us in this series: are we ready for AGI?

Next time, we will discuss the dangers that potentially superhuman AI can pose for humanity. This includes the “mundane” dangers such as the loss of jobs due to this next round of the industrial revolution. But it also includes the potential existential risk of having an AGI that’s smarter than us, more powerful than us, but does not share our values and does not care about humans at all. We will see why it is reasonable to be afraid, what the hard problems are, and how we are currently trying to tackle them. In any case, we sure live in some very interesting times—let’s see what the future brings!

Sergey Nikolenko
Head of AI, Synthesis AI
September 20, 2023
Generative AI VI: Stable Diffusion, DALL-E 2, and Midjourney
Congratulations, my friends, we have finally come to the end of the series! Although… well, not quite (see below), but we have definitely reached the end of what I had planned originally. Last time, we discussed diffusion-based models, mentioning, if not fully going through, all their mathematical glory. This time, we are going to put diffusion-based models together with multimodal latent spaces and variational autoencoders with discrete latent codes, getting to Stable Diffusion and DALL-E 2, and then will discuss Midjourney and associated controversies. Not much new math today: we have all the Lego blocks, and it only remains to fit them all together.

Diffusion models + VQ-GAN = Stable Diffusion

We already know how diffusion-based models work: starting from random noise, they gradually refine the image. The state of the art in 2021 in this direction was DDIMs, models that learn to do sampling faster, in larger steps, but have generally the same final quality.

Then Stable Diffusion happened. Developed by LMU Munich researchers Robin Rombach et al., it was released in August 2022 and published in CVPR 2022 (see also arXiv). Their idea was simple:
- diffusion models are very good at generation but relatively slow and hard to scale up to huge dimensions of real images;
- VAEs are very good at compressing images to a latent space (perhaps continuous, perhaps discrete);
- so let’s use a diffusion model to generate the latent code and then decode it with a VAE!
This is basically it: you can train an excellent diffusion model in the low-dimensional latent space, and we have seen that VAE-based models are very good at compressing and decompressing images to/from this latent space. The autoencoder here is the VQ-GAN model that we have discussed earlier.

Another novelty of the Stable Diffusion model was a conditioning mechanism. The authors used a U-Net as the backbone for the diffusion part, but they augmented it with Transformer-like cross-attention to allow for arbitrary conditions to be imposed in the latent space. As a result, the condition is introduced on every layer of the diffusion decoder and on every step of the denoising. Naturally, the main application of this is to use a text prompt encoded by a Transformer as the condition:

Stable Diffusion had been released by LMU Munich researchers but soon found itself as the poster child of the Stability AI startup, which recently led to a conflict of interests. The controversy with Stability AI is currently unfolding, and until it is fully resolved I will refrain from commenting; here is a link but let’s not go there now.

Whatever the history of its creation, Stable Diffusion has become one of the most important models for image generation because it is both good and free to use: it has been released in open source, incorporated into HuggingFace repositories, and several free GUIs have been developed to make it easier to use.

I will not give specific examples of Stable Diffusion outputs because this entire series of posts has been one such example: all cat images I have used to illustrate these posts have been created with Stable Diffusion. In particular, the prompt shown above is entirely real (augmented with a negative prompt, but making prompts for Stable Diffusion is a separate art in itself).

Diffusion models + CLIP = DALL-E 2

Stable Diffusion uses a diffusion model in the latent space of a VQ-GAN image-to-latent autoencoder, with text serving as a condition for the diffusion denoising model. But we already know that there are options for a joint latent space of text and images, such as CLIP (see Part IV of this series). So maybe we can decode latents obtained directly from text?

DALL-E 2, also known as unCLIP, does exactly that (Ramesh et al., 2022). On the surface, it is an even simpler idea than Stable Diffusion: let’s just use the CLIP latent space! But in reality, they still do need a diffusion model inside: it turns out that text and image embeddings are not quite the same (this makes sense even in a multimodal latent space!), and you need a separate generative model to turn a text embedding into possible matching image embeddings.

So the diffusion-based model still operates on the latent codes, but now the text is not a condition, it’s also embedded in the same joint latent space. Otherwise it’s the exact same multimodal CLIP embeddings that we discussed in an earlier post. The generation process now involves a diffusion model, which the authors of DALL-E 2 call a diffusion prior, to convert the text embedding into an image embedding:

(This time, the prompt is a fake, it’s a Stable Diffusion image again.)

DALL-E 2 reports excellent generation results; here are some samples from the paper:

The combination of CLIP embeddings and a diffusion model in the latent space allows DALL-E 2 to do interesting stuff in the latent space. This includes highly semantic interpolations such as this one:

What’s even more interesting, DALL-E 2 can do text-guided image manipulation, changing the image embedding according to the difference between vectors of the original and modified text captions:

DALL-E 2 is not open sourced like Stable Diffusion, and you can only try it via the OpenAI interface. However, at least we have a paper that describes what DALL-E 2 does and how it has been trained (although the paper does appear to gloss over some important details). In the next section, we will not have even that.

Midjourney and controversies over AI-generated art

So what about the elephant in the room? Over the last year, the default models for text-image generation have been neither Stable Diffusion nor DALL-E 2; the lion’s share of the market has been occupied by Midjourney. Unfortunately, there is little I can add to the story above: Midjourney is definitely a diffusion-based model but the team has not published any papers or code, so technical details remain a secret.

The best I could find was this Reddit comment. It claims that the original releases of Midjourney used a diffusion model augmented with progressive distillation (Salimans, Ho, 2022), a process that gradually combines short sampling steps into new, larger sampling steps by learning new samplers:

This approach can significantly speed up the sampling process in diffusion models, which we noted as an important problem in the previous post. However, this is just a Reddit comment, and its author admits that available information only relates to the original beta releases, so by now Midjourney models may be entirely different. Thus, in this section let us review the public milestones of Midjourney and the controversies that keep arising over AI-generated art.

One of Midjourney’s first claims to fame was an image called “Théâtre d’Opéra Spatial” (“Space Opera Theatre”) produced by Jason Allen:

This image won first place in a digital art competition (2022 Colorado State Fair, to be precise). Allen signed this work as “Jason M. Allen via Midjourney” and insisted that he did not break any rules of the competition, but the judges were unaware that the image had been AI-generated, so some controversy still ensued.

Later in 2022, Midjourney was used to illustrate a children’s book called “Alice and Sparkle“, very appropriately devoted to a girl who creates a self-aware artificial intelligence but somehow manages to solve the AI alignment problem so in the book, Alice and Sparkle live happily ever after:

The text of the book was written with heavy help from ChatGPT, and the entire book went from idea to Amazon in 72 hours. It sparked one of the first serious controversies over the legal status of AI-generated art. “Alice and Sparkle” received many 5-star reviews and no fewer 1-star reviews, was temporarily suspended on Amazon (but then returned, here it is), and while there is no legal reason to take down “Alice and Sparkle” right now, the controversy still has not been resolved.

Legal reasons may appear, however. After “Alice and Sparkle”, human artists realized that the models trained on their collective output can seriously put them out of their jobs. They claimed that AI-generated art should be considered derivative, and authors of the art comprising the training set should be compensated. On January 13, 2023, three artists filed a lawsuit against Stability AI, Midjourney, and DeviantArt, claiming that training the models on original work without consent of its authors constitutes copyright infringement. The lawsuit is proceeding as lawsuits generally do, that is, very slowly. In April, Stability AI motioned to dismiss the case since the plaintiffs failed to identify “a single act of direct infringement, let alone any output that is substantially similar to the plaintiffs’ artwork”. On July 23, Judge William Orrick ruled that the plaintiffs did not present sufficient evidence but allowed them to present additional facts to amend their complaint. We will see how the case unfolds, but I have no doubt that this is just the first of many similar cases, and the legal and copyright system will have to adapt to the new reality of generative AI.

In general, over 2023 Midjourney has remained the leader in the AI-generated art space, with several new versions released to wide acclaim. This acclaim, however, has also been controversial: users are often divided over whether new versions of image generation models are actually improvements.

Lately, generated images tend to make the news not as art objects but as fake photographs. AI-generated art has become good enough to pass for real photos, and people have been using it to various effects. In March, Midjourney generated a viral image of Donald Trump being forcefully arrested. On May 22, a Twitter account made to look like a verified Bloomberg feed published a fake image of an explosion near the Pentagon in Washington D.C. The result exceeded expectations: trading bots and/or real traders took the fake news at face value, resulting in a $500B market cap swing:

While this kind of news keeps attracting attention to generative AI, to be honest I do not really see a big new issue behind these “deepfakes.” Realistic fake photos have been possible to produce for decades, with the tools steadily improving even regardless of machine learning progress. A Photoshop expert could probably make the Pentagon explosion “photo” in a couple of hours; I am not even sure that fiddling with the prompts to get an interesting and realistic result takes significantly less time (but yes, it does not require an experienced artist). While generative models can scale this activity up, it is by no means a new problem.

Professional artists, on the other hand, face a genuine challenge. I have been illustrating this series of posts with (a rather old version of) Stable Diffusion. In this case, it would not make sense to hire a professional illustrator to make pictures for this blog anyway, so having access to a generative model has been a strict improvement for this series. As long as you are not too scrupulous about the little details, the cats just draw themselves:

But what if I had to illustrate a whole book? Right now, the choice is between spending money to get better quality human-made illustrations and using generative AI to get (somewhat) worse illustrations for free or for a small fee for a Midjourney subscription. For me (the author), the work involved is virtually the same since I would have to explain what I need to a human illustrator as well, and would probably have to make a few iterations. For the publisher, hiring a human freelancer is a lot of extra work and expense. Even at present, I already see both myself and publishing houses choosing the cheaper and easier option. Guess what happens when this option ceases to be worse in any noticeable way…

Conclusion

With this, we are done with the original plan for the “Generative AI” series. Over these seven posts, we have seen a general overview of modern approaches to image generation, starting from the original construction of variational autoencoders and proceeding all the way to the latest and greatest diffusion-based models.

However, a lot has happened in the generative AI space even as I have been writing this series! In my lectures, I call 2023 “the spring of artificial intelligence”: starting from the growing popularity of ChatGPT and the release of LLaMA that put large language models in the hands of the public, important advances have been made virtually every week. So next time, I will attempt to review what has been happening this year in AI; it will not be technical at all but the developments seem to be too important to miss. See you then!

Sergey Nikolenko
Head of AI, Synthesis AI
August 9, 2023
Generative AI V: Diffusion-based models
By this time, we have discussed nearly all components of modern generative AI: variational autoencoders, discrete latent spaces, how they combine with Transformers in DALL-E, and how to learn a joint latent space for images and text. There is only one component left—diffusion-based models—but it’s a big one! Today, we discuss the main idea of diffusion-based models and go over the basic diffusion models such as DDPM and DDIM. Expect a lot of math, but it will all pay off at the end.

Diffusion-based models

We have already discussed the main idea behind diffusion in machine learning in the very first, introductory post of this series. As a quick reminder, the idea is to train a model to denoise images or other objects so well that in the end, you can give it (what looks like) random noise as input and after several rounds of denoising get a realistic object.

In the space of images, it would look something like this. Suppose you have some kind of a noise in mind, most probably Gaussian. This defines a probability distribution $q(\mathbf{x}_{k+1}| \mathbf{x}_{k})$ , where $\mathbf{x}_{k}$ is the input image and $\mathbf{x}_{k+1}$ is the image with added noise. Applying this distribution repeatedly, we get a Markov chain called forward diffusion that gradually adds noise until the image is completely unrecognizable:

But on every step of this transformation, you add only a little bit of noise, and it is reasonable to expect that a denoising model would learn to almost perfectly get rid of it. If you get such a denoising model, again in the form of a distribution $p_{\boldsymbol{\theta}}(\mathbf{x}_{k}| \mathbf{x}_{k+1})$ with model parameters $\boldsymbol{\theta}$ that should be a good approximation for the inverted $q(\mathbf{x}_{k}| \mathbf{x}_{k+1})$ , you can presumably run it backwards and get the images back from basically random noise. This process is known as reverse diffusion:

However, as Woody Allen put it, “right now it’s only a notion, but I think I can get money to make it into a concept, and later turn it into an idea”. Training a denoising model straightforwardly, by using pairs of images produced by $q(\mathbf{x}_{k+1}| \mathbf{x}_{k})$ as supervision, will not get us too far: the model needs to understand the entire dynamics and make its backwards steps smarter.

Therefore, we use approximate inference to get from $\mathbf{x}_{n}$ to $\mathbf{x}_{0}$ . Since we already know variational autoencoders, I will mention that one good way to think about diffusion models is to treat them as hierarchical VAEs that chain together several feature-extracting encoders, but with additional restrictions on the encoders and decoders.

But this is where it gets mathy. The next section is not for the faint of heart, but I still include it for those of you who really want to understand how this stuff works. I will not refer to the derivation details later, so if the next section is a bit too much, feel free to skip it.

Probabilistic diffusion models: idea and derivation

Probabilistic diffusion models were introduced by Sohl-Dickstein et al. in “Deep Unsupervised Learning using Nonequilibrium Thermodynamics” (2015). As you can see from the title, it was a novel idea that went in an unexplored direction, and it had taken five years since 2015 to make it work reasonably efficiently, and a couple more years to turn it into the latent diffusion type models that we enjoy now.

Still, the basic concept remains the same. The forward diffusion process adds Gaussian noise, and the reverse diffusion model learns to restore the original image. Let’s dive into the details!

First, if we consider the noise to be Gaussian then we can get a result very similar to the reparametrization tricks we have seen earlier for VAE and dVAE: we can “compress” the whole chain into a single Gaussian. Formally, assume that $q(\mathbf{x}_{t}| \mathbf{x}_{t-1})$ is a Gaussian with variance $\beta_t$ and mean that reduces $\mathbf{x}_{t-1}$ by a factor of the square root of $\alpha_t=1-\beta_t$ (this is necessary to make the process variance preserving, so that $\mathbf{x}_{t}$ would not explode or vanish), and the entire process takes T steps:

$q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}\left(\mathbf{x}_t | \sqrt{1-\beta_t}\mathbf{x}_{t-1}, \beta_t\mathbf{I}\right),\qquad q\left(\mathbf{x}_{1:T} | \mathbf{x}_0\right) = \prod_{t=1}^T q\left(\mathbf{x}_t | \mathbf{x}_{t-1}\right).$

Then we can write

$\begin{align*} \mathbf{x}_t &= \sqrt{\alpha_t}\mathbf{x}_{t-1} + \sqrt{1-\alpha_t}\boldsymbol{\epsilon} \\ & = \sqrt{\alpha_t}\left(\sqrt{\alpha_{t-1}}\mathbf{x}_{t-2} + \sqrt{1-\alpha_{t-1}}\boldsymbol{\epsilon}\right) + \sqrt{1-\alpha_t}\boldsymbol{\epsilon} \\ & =\sqrt{\alpha_t\alpha_t-1}\mathbf{x}_{t-2} + \sqrt{1-\alpha_t\alpha_{t-1}}\boldsymbol{\epsilon} = \ldots \\ & = \sqrt{A_t}\mathbf{x}_0 + \sqrt{1 - A_t}\boldsymbol{\epsilon},\quad\text{where}\quad A_t = \alpha_1\alpha_2\ldots\alpha_t. \end{align*}$

This means that the compressed distribution $q\left(\mathbf{x}_{t} | \mathbf{x}_0\right)$ is also a Gaussian, and we know its parameters:

$q\left(\mathbf{x}_{t} | \mathbf{x}_0\right) = \mathcal{N}\left(\mathbf{x}_{t} | \sqrt{A_t}\mathbf{x}_0, \left(1-A_t\right)\mathbf{I}\right).$

This makes the forward diffusion process very efficient: we can sample from $q\left(\mathbf{x}_{T} | \mathbf{x}_0\right)$ directly, in closed form, without having to go through any intermediate steps.

It might seem that inverting Gaussians should be just as easy as stringing them together. And indeed, if our problem was to invert the Gaussian part of the process for a given $\mathbf{x}_0$ , it would be easy! Let’s use the Bayes formula and substitute distributions that we already know:

$\begin{align*}q\left(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0\right) &= \frac{q\left(\mathbf{x}_{t} | \mathbf{x}_{t-1}, \mathbf{x}_0\right)q\left(\mathbf{x}_{t-1} |\mathbf{x}_0\right)}{q\left(\mathbf{x}_t | \mathbf{x}_0\right)} \\&= \frac{\mathcal{N}\left(\mathbf{x}_t|\sqrt{\alpha_t}\mathbf{x}_{t-1}, \left(1-\alpha_t\right)\mathbf{I}\right)\mathcal{N}\left(\mathbf{x}_{t-1}|\sqrt{A_{t-1}}\mathbf{x}_0, \left(1-A_{t-1}\right)\mathbf{I}\right)}{\mathcal{N}\left(\mathbf{x}_{t}|\sqrt{A_{t}}\mathbf{x}_0, \left(1-A_{t}\right)\mathbf{I}\right)} \\ & = \mathrm{Const} \cdot e^{-\frac12\left(\frac{\left(\mathbf{x}_t-\sqrt{\alpha_t}\x_{t-1}\right)^2}{1-\alpha_t} + \frac{\left(\mathbf{x}_{t-1}-\sqrt{A_{t-1}}\mathbf{x}_{0}\right)^2}{1 - A_{t-1}} + \frac{\left(\mathbf{x}_{t}-\sqrt{A_{t}}\mathbf{x}_{0}\right)^2}{1 - A_{t}}\right)}. \end{align*}$

It is already clear that the new distribution is a Gaussian as well, since its density has a quadratic function of $\mathbf{x}_{t-1}$ in the exponent. I will skip the gory details of extracting the square from this exponent, but the result is, again, a nice and clean Gaussian whose parameters we know and that we could easily sample from:

$\begin{align*}q\left(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0\right) &= \mathcal{N}\left(\mathbf{x}_{t-1}| {\tilde{\boldsymbol{\mu}}}\left(\mathbf{x}_t,\mathbf{x}_0\right), {\tilde{\beta}}_t\mathbf{I}\right),\quad\text{where} \quad{\tilde{\beta}}_t = \frac{1 - A{t-1}}{1 - A_t}\cdot\beta_t,\\{\tilde{\boldsymbol{\mu}}}\left(\mathbf{x}_t,\mathbf{x}_0\right) &= \frac{\sqrt{\alpha_t}\left(1 - A_{t-1}\right)}{1 - A_t}\mathbf{x}_t + \frac{\sqrt{A_{t-1}}\beta_t}{1 - A_t}\mathbf{x}_0= \frac{1}{\sqrt{A_t}}\left(\mathbf{x}_t - \frac{1-\alpha_t}{\sqrt{1-A_t}}\boldsymbol{\epsilon}\right).\end{align*}$

Are we done? Of course not, we are just getting started! This simple distribution is conditioned on $\mathbf{x}_0$ … but it is exactly $q(\mathbf{x}_0)$ that represents the impossibly messy distribution of, say, real life images. Ultimately we want our reverse diffusion process to reconstruct $q(\mathbf{x}_0)$ from a standard input at $\mathbf{x}_n$ ; something like this:

The whole problem of training a generative model, as we have discussed many times on this blog, is to find a good representation for $q(\mathbf{x}_0)$ , and our process so far treats it as a known quantity.

What do we do? As usual in Bayesian inference, we approximate. On every step, we want the model to be a good approximation to $q(\mathbf{x}_t|\mathbf{x}_{t-1})$ , with no conditioning on the unknown $\mathbf{x}_0$ :

To get this approximation, we need a variational lower bound pretty similar to the one used in variational autoencoders and DALL-E. We will start with a bound for the global distribution $q(\mathbf{x}_{1:T}|\mathbf{x}_{0})=q(\mathbf{x}_{1},\ldots,\mathbf{x}_{T}|\mathbf{x}_{0})$ :

And then it will turn out that it decomposes into bounds for individual steps of the diffusion process.

Since we’re doing a lot of math here anyway, let us derive the variational lower bound from first principles, just like we did in the post on VAEs. We start from the obvious equality

$\log p_{\boldsymbol{\theta}}(\mathbf{x}_0) = \log p_{\boldsymbol{\theta}}(\mathbf{x}_0,\mathbf{x}_1,\ldots,\mathbf{x}_T) - \log p_{\boldsymbol{\theta}}(\mathbf{x}_1,\ldots,\mathbf{x}_T|\mathbf{x}_0),$

take the expectation with respect to $q(\mathbf{x}_{0:T})=q(\mathbf{x}_{0},\ldots,\mathbf{x}_{T})$ , and then add and subtract $\log q(\mathbf{x}_{1:T}|\mathbf{x}_{0})= \log q(\mathbf{x}_{1},\ldots,\mathbf{x}_{T}|\mathbf{x}_{0})$ on the right-hand side:

$\begin{align*}\mathbb{E}_{q(\mathbf{x}_{0})}\left[\log p_{\boldsymbol{\theta}}(\mathbf{x}_0)\right] &= \mathbb{E}_{q(\mathbf{x}_{0:T})}\left[\log p_{\boldsymbol{\theta}}(\mathbf{x}_{0:T})\right] - \mathbb{E}_{q(\mathbf{x}_{0:T})}\left[\log p_{\boldsymbol{\theta}}(\mathbf{x}_{1:T}|\mathbf{x}_0)\right] \\ & = \mathbb{E}_{q(\mathbf{x}_{0:T})}\left[\log\frac{p_{\boldsymbol{\theta}}(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}|\mathbf{x}_{0})}\right] + \mathbb{E}_{q(\mathbf{x}_{0:T})}\left[\log\frac{q(\mathbf{x}_{1:T}|\mathbf{x}_{0})}{p_{\boldsymbol{\theta}}(\mathbf{x}_{1:T}|\mathbf{x}_{0})}\right].\end{align*}$

At this point, we note that the second term on the right is the KL divergence between $q(\mathbf{x}_{1:T}|\mathbf{x}_{0})$ and $p_{\boldsymbol{\theta}}(\mathbf{x}_{1:T}|\mathbf{x}_{0})$ , so that’s what we want to minimize in the approximation. Since on the left-hand side we have a constant independent of $\mathbf{x}_{1:T}$ minimizing the KL divergence with respect to $q(\mathbf{x}_{1:T}|\mathbf{x}_{0})$ is equivalent to maximizing the first term on the right, which is our bound.

It will be more convenient to think of it as a loss function, so let’s add a minus sign in front, that is, let’s invert the fraction inside the logarithm. Then we can note that the bound decomposes nicely into the sum of individual steps; this is the last long derivation in this post (phew!):

$\begin{align*}\mathcal{L} =& \mathbb{E}_{q}\left[\log\frac{q(\mathbf{x}_{1:T}|\mathbf{x}_{0})}{p_{\boldsymbol{\theta}}(\mathbf{x}_{0:T})}\right] = \mathbb{E}_{q}\left[\log\frac{\prod_{t=1}^Tq(\mathbf{x}_{t}|\mathbf{x}_{t-1})}{p_{\boldsymbol{\theta}}(\mathbf{x}_{T})\prod_{t=1}^Tp_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_{t})}\right] \\=& \mathbb{E}_{q}\left[-\log p_{\boldsymbol{\theta}}(\mathbf{x}_{T}) + \sum_{t=1}^T\log\frac{q(\mathbf{x}_{t}|\mathbf{x}_{t-1})}{p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_{t})}\right] \\=& \mathbb{E}_{q}\left[-\log p_{\boldsymbol{\theta}}(\mathbf{x}_{T}) + \sum_{t=2}^T\log\frac{q(\mathbf{x}_{t}|\mathbf{x}_{t-1})}{p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_{t})} + \log\frac{q(\mathbf{x}_{1}|\mathbf{x}_{0})}{p_{\boldsymbol{\theta}}(\mathbf{x}_{0}|\mathbf{x}_{1})}\right] \\=& \mathbb{E}_{q}\left[-\log p_{\boldsymbol{\theta}}(\mathbf{x}_{T}) + \sum_{t=2}^T\log\left(\frac{q(\mathbf{x}_{t}|\mathbf{x}_{t-1},\mathbf{x}_{0})}{p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_{t})}\frac{q(\mathbf{x}_{t}|\mathbf{x}_{0})}{q(\mathbf{x}_{t-1}|\mathbf{x}_{0})}\right) + \log\frac{q(\mathbf{x}_{1}|\mathbf{x}_{0})}{p_{\boldsymbol{\theta}}(\mathbf{x}_{0}|\mathbf{x}_{1})}\right] \\=& \mathbb{E}_{q}\left[-\log p_{\boldsymbol{\theta}}(\mathbf{x}_{T}) + \sum_{t=2}^T\log\frac{q(\mathbf{x}_{t}|\mathbf{x}_{t-1},\mathbf{x}_{0})}{p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_{t})} + \log\frac{q(\mathbf{x}_{T}|\mathbf{x}_{0})}{q(\mathbf{x}_{1}|\mathbf{x}_{0})} + \log\frac{q(\mathbf{x}_{1}|\mathbf{x}_{0})}{p_{\boldsymbol{\theta}}(\mathbf{x}_{0}|\mathbf{x}_{1})}\right] \\=& \mathbb{E}_{q}\left[\log\frac{q(\mathbf{x}_{T}|\mathbf{x}_{0})}{p_{\boldsymbol{\theta}}(\mathbf{x}_{T})} + \sum_{t=2}^T\log\frac{q(\mathbf{x}_{t}|\mathbf{x}_{t-1},\mathbf{x}_{0})}{p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_{t})} - \log p_{\boldsymbol{\theta}}(\mathbf{x}_{0}|\mathbf{x}_{1})\right].\end{align*}$

Now we see that the loss function decomposes nicely into a sum of $T+1$ components, and almost all of them are actually KL divergences between Gaussians:

$\begin{align*}L =& L_T + L_{T-1} + \ldots + L_0,\qquad\text{where}\\L_T =& \mathrm{KL}\left(q(\mathbf{x}_{T}|\mathbf{x}_{0})\|p_{\boldsymbol{\theta}}(\mathbf{x}_T)\right),\\L_t =& \mathrm{KL}\left(q(\mathbf{x}_{T}|\mathbf{x}_{t+1},\mathbf{x}_{0})\|p_{\boldsymbol{\theta}}(\mathbf{x}_t|\mathbf{x}_{t+1})\right),\quad t=1,\ldots,T-1,\\L_0 =& -\log p_{\boldsymbol{\theta}}(\mathbf{x}_0 | \mathbf{x}_1).\end{align*}$

All of these components are now relatively straightforward to compute; for example, in $L_t$ we are using the Gaussian parametrization

$p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1} | \mathbf{x}_t) = \mathcal{N}\left(\mathbf{x}_{t-1}| \boldsymbol{\mu}_{\boldsymbol{\theta}}\left(\mathbf{x}_t,t\right),\Sigma_{\boldsymbol{\theta}}\left(\mathbf{x}_t, t\right)\right)$

and trying to match its parameters with $q(\mathbf{x}_t|\mathbf{x}_{t-1},\mathbf{x}_{0})$ . For the mean, for instance, we get

$\boldsymbol{\mu}_{\boldsymbol{\theta}}\left(\mathbf{x}_t,t\right) \approx {\tilde{\boldsymbol{\mu}}}_t\left(\mathbf{x}_t,\mathbf{x}_0\right) = \frac{1}{\sqrt{A_t}}\left(\mathbf{x}_t - \frac{1-\alpha_t}{\sqrt{1-A_t}}\boldsymbol{\epsilon}_t\right),$

and since we know $\mathbf{x}_t$ during training, we can actually parametrize the noise directly rather than the mean:

$p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1} | \mathbf{x}_t) = \mathcal{N}\left(\mathbf{x}_{t-1}\middle| \frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t-\frac{1-\alpha_t}{\sqrt{1-A_t}}\boldsymbol{\epsilon}_{\boldsymbol{\theta}}\left(\mathbf{x}_t,t)\right),\Sigma_{\boldsymbol{\theta}}(\mathbf{x}_t, t\right)\right).$

I will stop the calculations here but I hope you are convinced now that this whole reverse diffusion Markov chain comes down to a closed form loss function that you can program in PyTorch and minimize. This was the main idea of the original paper by Sohl-Dickstein et al. (2015). Let us see where it has gone since then.

Denoising diffusion probabilistic models

In 2015, the original diffusion model could only be run on datasets that by now sound more like toy examples. For instance, Sohl-Dickstein et al. give examples of how their generative model fares on CIFAR-10. In the image below, (a) shows some original hold-out images from CIFAR-10, in (b) they are corrupted with Gaussian noice, (c) shows how the diffusion model can denoise the images from (b), using them as starting points for the reverse diffusion chain, and finally (d) shows new samples generated by the diffusion model:

That looked somewhat promising, but perhaps not promising enough to warrant a concerted effort to develop this approach. At the time, people were just getting excited with GANs: the original work by Goodfellow was published in 2014, ProGAN (thispersondoesnotexist) would be released in 2017, and GANs would define state of the art in image generation for the next years, until they arguably ran out of steam somewhere about StyleGAN 3.

Therefore, the next stop on our way happened only five years later, in 2020, in the work “Denoising Diffusion Probabilistic Models” (DDPM) by Ho et al. They used the same basic idea and arrived at the same basic structure of the loss function; I reproduce it here in a general form since I suspect many readers have not followed through all the derivations in the previous section:

There are three different components in this loss function, two of them appearing at the ends of the chain and one that is responsible for every intermediate step. Ho et al. make the following observations and simplifications:
- they assume all forward diffusion variances $\beta_t$ to be constant hyperparameters and do not train them, so there is nothing to train at all in the forward diffusion distributions $q$ ; since $p_{\boldsymbol{\theta}}(\mathbf{x}_{T})$ is a fixed distribution that we want to sample from, this means that L_T is a constant and can be ignored;
- for the intermediate steps, they do not train the variances in $p_{\boldsymbol{\theta}}(\mathbf{x}_{t} | \mathbf{x}_{t+1})$ either, setting them to $\sigma^2\mathbf{I}$ for some constant $\sigma$ ; they also develop the noise reparametrization mentioned above somewhat further, obtaining a simple closed form for $L_t$ ;
- finally and most importantly, they substitute a separate discrete decoder for $L_0$ ; namely, they assume that the data consists of integers from 0 to 255 scaled linearly to [-1, 1], which is a natural representation for images, and model
  $p_{\boldsymbol{\theta}}(\mathbf{x}_{0} | \mathbf{x}_1) = \prod_{i=1}^D\int_{\delta_-\left({x}_0,i\right)}^{\delta_+\left({x}_0,i\right)} \mathcal{N}\left({x}\middle|{\mu}_{\boldsymbol{\theta},i}\left(\mathbf{x}_1\right), \sigma_1^2\right)\mathrm{d} x,$
  
  where $i$ goes over the pixels (components of $\mathbf{x}$ ), ${\mu}_{\boldsymbol{\theta},i}\left(\mathbf{x}_1\right)$ is the independent decoder model, and the integration limits define an interval of length 1/255 on every side of $x_{0,i}$ , which is a standard trick to make everything smooth and continuous.
As a result, you can substitute a different model at the last step and use a noiseless ${\mu}_{\boldsymbol{\theta},i}\left(\mathbf{x}_1\right)$ during test-time sampling, which extends the capabilities of the whole diffusion model significantly.

With these modifications, DDPM was able to achieve state of the art generation, comparable with the best GANs of the time. Here is a sample:

Still, that’s not quite the end of the story even for basic diffusion-based models.

Denoising Diffusion Implicit Models

The next step came very quickly after DDPMs, in the work called “Denoising Diffusion Implicit Models” (DDIM) by Song et al. (2020). They aim at the same model as DDPM, but address an important drawback of all diffusion models we have discussed so far: they are extremely slow. The generation process mirrors every step of the diffusion process, so to generate a new sample you have to go through thousands of steps (literally!), on every step we have to apply a neural network, and the steps are consecutive and cannot be parallelized. This is especially bad in contrast to the usual deep learning paradigm where it might take you a very long time to train a model but applying it is usually pretty fast: Song et al. mention that sampling from a trained GAN is over 1000x faster than sampling from a DDPM trained for the same image size.

How can we speed up this construction, which at first glance looks inherently incremental? Song et al. do it by generalizing diffusion models and DDPMs specifically. They note that the loss function we discussed above does not depend directly on the joint distribution $q\left(\mathbf{x}_{1:T} | \mathbf{x}_0\right)=q\left(\mathbf{x}_{1},\ldots,\mathbf{x}_{T} | \mathbf{x}_0\right)$ but only on the marginal distributions $q\left(\mathbf{x}_{t} | \mathbf{x}_0\right)$ . This means that we can reuse the exact same learning objective for a different joint distribution as long as it has the same marginals.

Song et al. define their diffusion process in terms of its reverse form:

$q_{\sigma}\left(\mathbf{x}_{1:T} | \mathbf{x}_0\right)= q_{\sigma}\left(\mathbf{x}_{T} | \mathbf{x}_0\right)\prod_{t=2}^T q_{\sigma}\left(\mathbf{x}_{t-1} | \mathbf{x}_{t},\mathbf{x}_0\right).$

Now we can express the forward diffusion distributions via the Bayes theorem:

$q_{\sigma}\left(\mathbf{x}_{t} | \mathbf{x}_{t-1},\mathbf{x}_0\right)=\frac{q_{\sigma}\left(\mathbf{x}_{t-1} | \mathbf{x}_{t},\mathbf{x}_0\right)q_{\sigma}\left(\mathbf{x}_{t} | \mathbf{x}_0\right)}{q_{\sigma}\left(\mathbf{x}_{t-1} | \mathbf{x}_0\right)}.$

Song et al. show (I promised to contain the complicated math in the first section, so I’ll skip the derivation here) that the resulting process has the same marginals, and the reverse diffusion can be trained with the same loss function and will represent an actual Markov chain:

So far it does not sound very helpful: we have extended the class of forward diffusion distributions but sampling a new image still requires going through all the reverse diffusion steps. However, the key observation here is that instead of approximating the random noise $\boldsymbol{\epsilon}_t$ that gets us from $\mathbf{x}_{t}$ to $\mathbf{x}_{t+1}$ , we are now approximating the random noise $\boldsymbol{\epsilon}_t$ that is mixed with $\mathbf{x}_{0}$ to obtain $\mathbf{x}_{t+1}$ .

This process, in essence, means that when we are going in the reverse direction, we are approximating the direction not to $\mathbf{x}_{t}$ , but directly to $\mathbf{x}_{0}$ , and make a step in that direction. Here is an illustration for the difference:

A DDPM model is trying to approximate the step from $\mathbf{x}_{t+1}$ to $\mathbf{x}_{t}$ , failing somewhat and getting a worse image. A DDIM model is trying to approximate the direction all the way from $\mathbf{x}_{t+1}$ to $\mathbf{x}_{0}$ ; naturally, it fails a little and if it tried to go all the way to $\mathbf{x}_{0}$ it would miss by a lot so it makes a small step in the approximate direction. It is hard to say which method is doing a better job at the approximation itself, but there is an important benefit to the DDIM scheme in terms of performance.

Since now $\boldsymbol{\epsilon}_{t}$ and the dependence on $\mathbf{x}_{0}$ are disentangled, $\boldsymbol{\epsilon}_{t}$ is just a Gaussian noise variance, and we can jump over several steps in the process, getting from $\mathbf{x}_{t}$ to $\mathbf{x}_{t+k}$ in a single step with correspondingly increased $\boldsymbol{\epsilon}$ ! One can train a model with a large number of steps $T$ but sample only a few of them in the generation part, which speeds things up very significantly. Naturally, the variance will increase, and the approximations will get worse, but with careful tuning this effect can be contained.

Song et al. achieve 10x to 100x speedups compared to DDPM, with insignificant loss in quality:

Moreover, DDIMs also have a generation process that does not need to be stochastic! Song et al. suggest setting the variance hyperparameter in the reverse diffusion chain to zero during generation. This means that a latent code in the space of $\mathbf{x}_T$ corresponds to exactly one image, and now we can expect DDIMs to behave in the same way as other models that train latent representations (compare, e.g., our previous post), including, for instance, interpolations in the latent space:

Note that DDPMs could not do interpolations because a latent code x_T would have a huge amount of noise added to it during the reverse diffusion process; it wasn’t really a “code” for anything, just a starting point for the Markov chain.

Conclusion

Today, we have introduced the basics of diffusion models in machine learning. This field started in 2015, and its basic idea of learning gradual denoising transformations was preserved in later developments: DDPMs made several improvements that allowed to scale diffusion models up, and DDIMs increased the performance of the generation process and made it deterministic, which opened up a number of new possibilities.

There is basically only one step left before we get to the cutting edge models such as Stable Diffusion and DALL-E 2. Next time, we will take this step; stay tuned!

Sergey Nikolenko
Head of AI, Synthesis AI
June 28, 2023
CLIP and multimodal retrieval: Generative AI IV
Last time, we discussed DALL-E, a model that brings together text Transformers and a discrete VAE for images. While DALL-E was a huge step forward and generated a lot of buzz for generative AI back in 2021, modern generative models such as DALL-E 2 consist of different components. One of them is usually a multimodal encoder that maps different modalities (e.g., text and images) into the same latent space. Today, we discuss such encoders and then make an example of a specific practical problem where they have become instrumental over the last couple years: text-video retrieval, that is, searching for video content by text queries.

CLIP: Contrastive Language-Image Pretraining

A model that has proven to be one of the most important for multimodal retrieval is CLIP, introduced by OpenAI in 2021. The basic motivation behind CLIP was to use the data freely available on the Web: text paired with images, i.e., captions of the form like “a black and white cat” or “Pepper the aussie pup” used in OpenAI’s illustrations (see below).

The question, of course, is how to use this huge data. The authors of CLIP reported that their first instinct was to train an image CNN and a text Transformer to predict a caption of an image. Transformers are famously good at generating text, but it turned out that the resulting recognition quality for ImageNet classes was no higher than from a bag-of-words baseline. The reason was that predicting the exact caption is very hard (basically hopeless), and it’s not really what we need in this model—we just need good multimodal embeddings.

Therefore, CLIP switched to the titular idea of contrastive pretraining: we want both text descriptions and the images themselves to map to the same latent space, so let’s use a loss function that brings positive pairs (correct descriptions) closer together and negative pairs (incorrect descriptions) further apart.

In the picture below, I show the “attractive and repulsive forces” (green and red arrows respectively) that should appear between two image-description pairs, with each pair used as negative samples for the other:

CLIP takes this idea and runs with it, constructing a whole matrix of similarities (dot products in the latent space) between the embeddings of N images and N corresponding textual descriptions. As a result, we get an NxN matrix where the diagonal corresponds to positive pairs (so diagonal elements should be made larger) and all other elements correspond to negative pairs (so off-diagonal elements should be made smaller).

Here is the main illustration for this idea from the CLIP paper:

The encoders, of course, are Transformer-based architectures, specifically the Vision Transformer (ViT) that breaks an input image into patches and treats embeddings of patches as tokens for the Transformer architecture. The margins of this blog post are too narrow to explain Vision Transformers; maybe one day we will have a series about Transformers, what with them being the most important architecture of the last years and all. For now, let’s just assume that ViTs are good at converting images into embeddings, and ViT itself has been a key component of many multimodal architectures; see the original paper by Dosovitskiy et al. for details.

The original work shows that CLIP is very capable of zero-shot classification: you can turn a class label into a rudimentary query (e.g., “cat” becomes “a photo of a cat”) and get a reasonable classifier by finding nearest neighbors in the joint latent space (image by OpenAI):

But the main use of CLIP has been for enabling text-image retrieval and generative AI models. Its multimodal latent space proves to be an excellent tool both for finding existing objects and generating new ones (provided you train a decoder for it, of course—the original CLIP has none). In the rest of this post, I will expand on the retrieval part, and we will leave the generative part for next installments. But first, let’s consider an interesting extension of CLIP that, a little unexpectedly, uses our local specialty: synthetic data.

BLIP: Bootstrapping CLIP with Synthetic Data

There has been no lack of models that further extended and improved CLIP, although the basic CLIP itself is still very relevant. As a representative model that takes a few steps forward from CLIP let us consider BLIP (the similar acronym is no accident, of course), developed in 2022 by Li et al.

One of the central novelties in BLIP is… synthetic data. Yes, you heard right, large datasets of photos and their captions that one can download off the Web seem to be not enough, not because they are not large enough (the Web is huge) but rather because they are too noisy. In many cases, even a properly downloaded caption is not informative about the image.

Therefore, BLIP authors used an automatic captioning model to generate synthetic captions. But you don’t want to just throw away all of your human annotations! Moreover, sometimes synthetic data wins clearly but sometimes the human annotation is much more specific; in the illustration below, T_w is the original human annotation and T_s is the synthetic one:

Thus, BLIP trains a separate filtering model to distinguish between good and bad captions. Here is how it might work:

Overall, the filtering process leads to a dataset with several parts, some of them human-annotated and some synthetic, with filtering used to choose the best version in every case:

Apart from the data, BLIP also extends the model itself. CLIP had a text encoder and an image encoder, and used contrastive pretraining only. BLIP expands on it with multitask pretraining via three different encoders:
- a ViT for images, similar to CLIP, with its output used in three different ways;
- a Transformer-based text encoder trained with the same image-text contrastive loss (ITC in the image below) with the ViT image embeddings;
- an image-grounded text encoder, implemented as a Transformer encoder with causal cross-attention that receives image embedding as input; here the objective is again to classify correct vs. incorrect text-image pairs, but as a regular classification rather than a contrastive loss;
- finally, a separate Transformer text decoder is trained to generate text captions for images, with a language modeling loss that teaches it to produce correct captions for images whose embeddings are fed into its cross-attention layers.
Here is an illustration from the BLIP paper:

Retrieval in the Latent Space: Text-Video Retrieval

So how do we use all of these models for retrieval? The basic idea is very simple: once you have good multimodal embeddings, you can map the query to the same space and find nearest neighbors. Something like this:

But this is only the very first inkling of an idea, and it needs a lot of fleshing out to get real. In this post, I cannot hope to review the entire field of multimodal retrieval so to make a relevant and practical example let us work through some of the models for text-video retrieval, i.e., searching for videos by text prompts.

As a first example, now probably only of historical interest, let’s consider the S2VT model, originally developed for video captioning (producing text descriptions for video) but also possible to use for retrieval: this is a common trend for many models that simply map everything into a common latent space. Here is what S2VT looks like:

This is the archetypal “early deep learning” approach, similar to, e.g., “Show and Tell”: you have a recurrent network for text and a convolutional network for video frames, they extract features and map everything into a common latent space.

Another important trend that started quite early is considering hierarchical representations for both text and video. Both modalities are hierarchical in nature: a (detailed) text caption can be broken down into paragraphs, and the latter into sentences, while a video naturally consists of scenes and/or frames, and one can find objects on these frames.

An early example of this approach was shown by Zhang et al. (2018). Their hierarchical sequence embedding (HSE) model includes separate intermediate loss functions that align sentence-level embeddings for text and clip-level embeddings for videos:

But the whole field changed when Transformers were introduced, or, to be more precise, when Transformers were applied to images in Vision Transformers. Let’s see how!

Transformer-Based Text-Video Retrieval

How can Transformers help retrieval? First, there is the direct approach: we have CLIP that maps text and images into the same space; let’s just extend CLIP to videos by representing them as a sequence of frames. This simple idea has been implemented in one of the first but already quite strong modern baselines for video retrieval, the CLIP4Clip model (Luo et al., 2022).

The only question here is how to break a video down into frames. Naturally, we don’t need all frames; in fact, CLIP4Clip and similar models usually sample just a few frames, like 4 or 8, and almost none of them try anything fancy to find representative frames, it’s usually just uniform sampling (in my opinion, this is a natural place for a potential improvement). After sampling, we still have a sequence of frame embeddings (albeit a short one), and one can unite these embeddings in different ways. CLIP4Clip studies several such possibilities:

And with that, we are basically at the state of the art level. It only remains to combine all of the ideas we have already discussed.

LiteVL (Chen et al., 2022) does approximately the same thing but replaces CLIP with BLIP that we also discussed above. The main novel idea here is to use additional temporal attention modules and text-dependent pooling that allow to adapt to video-language tasks starting from a pretrained image-text BLIP. As a result, it has more loss functions:

DRLTVR (Wang et al., 2022), where DRL stands for “disentangled representation learning”, is interesting in its very detailed study of different forms of cross-modal interaction in text-video retrieval. They consider six different ways to combine text and video representations to obtain a relevance score for retrieval and propose two new important ideas. First, a more fine-grained cross-modal interaction mechanism based on (possibly weighted) token-wise interactions, i.e., basically a cross-attention matrix between sentence tokens and video frame tokens:

Second, a channel decorrelation regularization mechanism that minimizes the redundancy in learned representation vectors and thus helps to learn a hierarchical representation:

And finally, everything comes together in the very recently published Tencent Text-Video Retrieval (Jiang et al., 2022). It has a hierarchical representation structure with frame-word, clip-phrase, and video-sentence alignments:

Combined with a few more tricks related to adaptive label denoising and marginal sample enhancement (choosing the hardest text sample for a video), this has allowed Tencent Text-Video Retrieval to produce state of the art results.

I also want to note that this improvement in text-video state of the art is far from squeezing the last 0.1% out of beaten datasets. For example, let us consider the Recall@1 metric on the classical MSRVTT-7K dataset, that is, how often in its test set the model retrieves a correct result at the very top:
- a very simple zero-shot baseline in ActBERT yields a mere 8.6%;
- good classical models such as HTM achieve about 15% Recall@1;
- CLIP4Clip jumps up to over 40%, with its best version reaching 42.1%;
- the best variation of LiteVL achieves 48.9%;
- the best variation of DRLTVR has Recall@1 of 53.3%;
- and finally, Tencent Text-Video Retrieval sits at the top with 62.9%.
Even the most recent improvements are huge, and there is still a lot of room for improvement!

Conclusion

Today, our main intention has been to discuss multimodal encoders such as CLIP and BLIP that map different objects—mostly text and images—into the same latent space. However, after that we have taken a detour into text-video retrieval as a very practical sample task where such models are used almost directly: using CLIP directly with a few reasonable tricks has led to huge improvements.

Next time, we will consider another key component of modern generative AI: diffusion-based models. In the next installment, we will discuss its main ideas and some of the underlying math (but definitely not the whole thing!), and then it will be just one more step to Stable Diffusion and its kin.

Sergey Nikolenko
Head of AI, Synthesis AI
May 16, 2023
How DALL-E Creates AI-Generated Art: Generative AI III
Today, we continue our discussion of generative AI, a direction that keeps transforming many different industries. Last time, we reviewed the difference between continuous and discrete latent spaces, and how the VQ-VAE architecture (based on variational autoencoders that we discussed before) manages to learn a discrete latent space, a codebook in the continuous latent space. Today, we will put this idea into further practice with our first real text-to-image model, OpenAI’s DALL-E.

General Structure of DALL-E

In previous posts, we have discussed the main ideas that, taken together, have led to OpenAI’s DALL-E, the first text-image model that actually impressed everyone not only in the AI community but in the wider world. DALL-E put image generation by text prompts on the map of the world’s media, and I would say that the current hype wave of generative AI models for images started in earnest with DALL-E (although current models are, of course, much better than the original DALL-E). But what is it, exactly, and how does it work?

Let us begin with the general structure of DALL-E. We almost know all of the components from previous posts, so we can start with the big picture:

The main idea is to train a Transformer-based model to generate tokens that comprise the latent code of a discrete VAE such as the one we discussed in the previous post. Discrete latent spaces converge here with the Transformers’ main forte: learning to continue sequences of tokens.

We only need to train a GPT-like model to generate the latent code as a sequence of special tokens that would continue a text description: “Cat playing chess in the British countryside [IMG] #c100 #c089 #c004 …”. Then we can run it with a text query followed by the special token “[IMG]”, and supposedly it will produce a sequence of latent codes for the discrete VAE. Naturally, this will require us to retrain (or fine-tune) a text Transformer on (image, text) pairs encoded in this way.

Formally speaking, DALL-E is a generative model that needs to learn the joint distribution

$p_{\mathbf{\theta},\mathbf{\psi}}(\mathbf{x},\mathbf{y},\mathbf{z})= p_{\mathbf{\theta}}(\mathbf{x} | \mathbf{y},\mathbf{z})p_{\mathbf{\psi}}(\mathbf{y},\mathbf{z}),$

where $\mathbf{x}$ is an image, $\mathbf{y}$ is the corresponding text description, and $\mathbf{z}$ is the image’s latent code. The Transformer learns to generate $\mathbf{z}$ from $\mathbf{y}$ (actually, learns the entire $p(\mathbf{y},\mathbf{z})$ since it inevitably becomes a generative model for text as well), and the result is used by the discrete VAE, so actually we assume that $p_{\mathbf{\theta}}(\mathbf{x} | \mathbf{y},\mathbf{z})=p_{\mathbf{\theta}}(\mathbf{x} | \mathbf{z})$ .

From the mathematical point of view, DALL-E actually optimizes a huge variational lower bound

$\log p_{\mathbf{\theta},\mathbf{\psi}}(\mathbf{x},\mathbf{y}) \ge \mathbb{E}_{\mathbf{z}\sim q_{\mathbf{\phi}}(\mathbf{z}|\mathbf{x})}\left[ \log p_{\mathbf{\theta}}(\mathbf{x}|\mathbf{y},\mathbf{z}) - \beta\mathrm{KL}\left(q_{\mathbf{\phi}}(\mathbf{z}|\mathbf{x})\| p_{\mathbf{\psi}}(\mathbf{y},\mathbf{z})\right)\right],$

where the distributions in this formula correspond to different parts of the model:
- $q_{\mathbf{\phi}}(\mathbf{z}|\mathbf{x})$ is the distribution of latent codes produced by the discrete VAE’s encoder from an image $\mathbf{x}$ ; $\mathbf{\phi}$ here denotes the parameters of the discrete VAE’s encoder;
- $p_{\mathbf{\theta}}(\mathbf{x}|\mathbf{y},\mathbf{z})$ is the distribution of images generated by the discrete VAE’s decoder from a latent code $\mathbf{z}$ and text description $\mathbf{y}$ ; again, here we assume that $p_{\mathbf{\theta}}(\mathbf{x}|\mathbf{y},\mathbf{z})=p_{\mathbf{\theta}}(\mathbf{x}|\mathbf{z})$ ; $\mathbf{\theta}$ stands for the parameters of the discrete VAE’s decoder;
- $p_{\mathbf{\psi}}(\mathbf{y},\mathbf{z})$ is the joint distribution of texts and latent codes modeled by the Transformer; here $\mathbf{\psi}$ denotes the Transformer’s parameters.
We will not go into the details of variational inference and explain the inequality shown above in full; this is a very important topic in machine learning but we do not have the space to do it justice here. After the derivation, though, it all boils down to a very understandable iterative process:
- first, we maximize the bound with respect to $\mathbf{\phi}$ and $\mathbf{\theta}$ , that is, train the discrete VAE with a dataset of images; the texts are not used here, we assume that $p_{\mathbf{\psi}}(\mathbf{y},\mathbf{z})$ is uniform and relax $q_{\mathbf{\phi}}(\mathbf{z}|\mathbf{x})$ via the Gumbel-Softmax trick;
- then we fix $\mathbf{\phi}$ and $\mathbf{\theta}$ (fix the discrete VAE) and learn $\mathbf{\psi}$ , i.e., train the Transformer to jointly model both text (in BPE encoding) and image codes $\mathbf{z}$ .
At this point, we are done with the general structure of DALL-E. But, alas, to get the full picture we need to return to discrete variational autoencoders because DALL-E uses a slightly different breed of those than VQ-VAE and VQ-GAN we previously discussed.

Discrete VAE with the Gumbel-Softmax trick

We have seen two different discrete VAEs in the previous post: VQ-VAE that introduced the idea of discrete latent codes and VQ-GAN that added a patch-based discriminator to further improve things. But both of them had a middle part that feels pretty ad hoc to me, and hopefully to you as well by now: to move gradients through the discrete latent space they had to go around the discrete part with a copy-gradient operation.

Discrete VAE used in DALL-E takes the next step: instead of outputting a latent vector that is then “rounded” to a codebook vector, it outputs a whole probability distribution over the codebook, probabilities for a “die” that then can be rolled to determine the actual vector:

This is exactly parallel to the idea used in VAEs: we output a distribution in the latent space and thus obtain additional regularization and make the resulting model better.

So now instead of the VQ-VAE problem—how do we put gradients through taking nearest neighbors—we have a different problem: how do we put gradients through rolling a die? Fortunately, we already have a hint: we solved the exact same problem for Gaussians with the reparametrization trick in regular VAEs! The trick was to generate a random sample from a standard Gaussian distribution and then apply a deterministic (and differentiable) linear transformation to change it into a sample from the needed Gaussian.

The distribution is different now, but the trick is the same. We need to first sample from a fixed distribution and then apply a transformation to get the die roll with given probabilities. The fixed distribution in question is actually quite interesting: it is the Gumbel distribution whose density and cumulative distribution function are defined as

$p(g_i) = e^{-\left(g_i + e^{-g_i}\right)},\qquad F(g_i) = e^{-e^{-g_i}}.$

In statistics, the Gumbel distribution appears as the distribution of the maximum (or minimum) of several samples, but, to be honest, I have never encountered the Gumbel distribution in any practical context other than this reparametrization trick.

Anyway, the important part is that once you have sampled g_i from the Gumbel distribution defined above, you can get a sample from a discrete distribution with probabilities π_i (the result of a die roll) as

$z = \arg\,\max_i\left(g_i + \log \pi_i\right).$

The proof of this fact, known as the Gumbel-Max trick, is a straightforward but somewhat tedious calculation, so I’ll skip it or, to put it in a slightly more stylish way, leave it as an exercise for the reader.

Once we have the Gumbel-Max trick, though, we are not quite done. We have gone from sampling to argmax, but it’s still not quite what we need. The argmax operation is also not good for passing gradients since it is piecewise constant; in fact, in VQ-VAE we had exactly the same problem, with argmin for nearest neighbors, and had to resort to copying the gradients.

This time, though, we don’t have to. Since the argmax here corresponds to die rolling, it makes perfect sense to relax it to softmax:

$y_i = \mathrm{softmax}\left(\frac{1}{\tau}\left(g_i + \log\pi_i\right)\right) = \frac{e^{\frac{1}{\tau}\left(g_i + \log\pi_i\right)}}{\sum_je^{\frac{1}{\tau}\left(g_j + \log\pi_j\right)}}.$

For $\tau\to 0$ this tends to a discrete distribution with probabilities $\pi_i$ , and during training we can gradually reduce the temperature τ. Note that now the result is not a single codebook vector but a linear combination of codebook vectors with weights $y_i$ .

Overall, we have the following scheme in our discrete VAE:

And with that, we are done with DALL-E! It remains to see how well it works.

DALL-E: Results and Reception

DALL-E debuted at the very beginning of 2021. This was a perfect New Year’s present for all kinds of AI-related media because DALL-E was indeed a big step forward from what researchers had been able to do before. Images from the DALL-E OpenAI post and paper were hyped all across the Web; images like this one:

Or this one:

It already looked like these images could be useful in practice, and discussions about “replacing the illustrators” began. DALL-E was also able to use image prompts (parts of the resulting image that should be preserved) that could define the style and overall feel of the result.

DALL-E seemed to have a rather deep understanding of our reality that it could put into pictures. For example, the next illustration shows several image prompts and a text prompt that asks DALL-E to show how telephones looked at different stages of their development:

Although the quality of the images themselves may be underwhelming for those who have already seen Stable Diffusion and Midjourney, it was really head and shoulders above anything other available solutions could produce, and it was quite a shocking piece of news for many AI researchers, including yours truly.

It was clear that it would be only a matter of time before DALL-E would be scaled up to high-definition images (the original DALL-E produced 256×256 results) and made even more “understanding” of reality with larger Transformer-based text models. That is indeed what happened, and the world we live in today is being increasingly transformed by both large language models and large image generation models.

Still, many new ideas appeared along this way, and we cannot say that DALL-E 2 is just “DALL-E with more layers”. That’s why our series of posts is far from the end, and modern generative AI has a lot more to teach us.

Conclusion

Today, we have discussed DALL-E, a model released in January 2021. A mere two years have passed, but it looks like DALL-E is already hopelessly outdated. New models that visibly advance state of the art for image generation appear every few months, and the rate of this advancement does not seem to stagnate. Don’t worry though, the ideas behind DALL-E are still sound and useful, and this has been my primary ambition in this series: explain the ideas, the how rather than the what.

However, to get to the current state of the art we need more ideas. So next time, we will take a brief detour from generation and talk about models that produce multimodal latent spaces, such as OpenAI’s CLIP (Contrastive Language-Image Pre-Training) and its successors. They are extremely useful for, e.g., multimodal retrieval (searching for images and videos), but they also serve as the basis for further generative models. Until next time!

Sergey Nikolenko
Head of AI, Synthesis AI
April 25, 2023
Discrete Latent Spaces: Generative AI II
Last time, we discussed one of the models that have made modern generative AI possible: variational autoencoders (VAE). We reviewed the structure and basic assumptions of a VAE, and by now we understand how a VAE makes the latent space more regular by using distributions instead of single points. However, the variations of VAE most often used in modern generative models are a little different: they use discrete latent spaces with a fixed vocabulary of vectors. Let’s see what that means and how it can help generation!

Continuous and Discrete Latent Spaces

We have already discussed latent spaces in both the introductory post and the post on variational autoencoders but this time we have a slightly different spin. In general, an autoencoder is compressing the high-dimensional input (say, an image) into a low-dimensional representation, i.e., into a relatively short vector of numbers whose dimension is in the hundreds rather than millions:

If the autoencoder is designed well, this may result in a latent space where certain directions correspond to specific properties of the image. For instance, if we are compressing cat images then one axis may correspond to the cat’s color and another to the overall style of the image:

Naturally, in reality these directions would not necessarily coincide with coordinate axes and may be hard to find. There’s no preference for a regular autoencoder architecture (say, a VAE) to find a latent space with well-defined directions. In fact, it is easy to see that the latent space may undergo rather complicated transformations with no change in the model complexity: there is usually no difference between learning an encoder Enc and decoder Dec and learning an encoder f○Enc and decoder Dec○f⁻¹ for some invertible transformation f.

This is an appealing picture, but it’s not as easy to obtain, and, moreover, it’s not really how we think about styles and picture descriptions. We are verbal creatures, and when I want to get a picture of a black cat I don’t have a real number associated with its “blackness”, I just want a cat with a discrete “black” modifier, just like I might want a “white” modifier. A black and white cat for me is not a number that reflects the percentage of white hair but most probably just a separate “black and white” modifier that turns out to be in a rather complex relationship with the “black” and “white” modifiers.

Can we try to reflect this intuition in an autoencoder latent space? We could imagine a latent space that has a vocabulary of “words” and decodes combinations of these words into images. Something like this:

This looks much more “human-like”, and the last few years of generative AI have indeed proven this approach to be significantly more fruitful. Its best feature is the ability to use autoregressive generative models for discrete latent representations. For example, the famous Transformers, in particular the GPT family, can be applied to produce latent space “words” just as well as they produce real words in their natural language applications, but they would be much harder to adapt to components of continuous latent vectors.

But the discrete latent space approach comes with its own set of problems, both technical and conceptual. In the rest of this post, we will go through two models that successfully solved these problems and thus became foundational for modern generative AI.

VQ-VAE: Vector Quantized VAE

The first model that successfully managed to construct a discrete latent space at a scale sufficient for general-purpose images was Vector Quantized VAE (VQ-VAE) introduced back in 2017 by DeepMind researchers (van den Oord et al., 2017). Its basic idea is exactly as we have discussed: VQ-VAE finds a finite vocabulary (codebook) and encodes images as fixed sets (tensors) of discrete codes:

It turns out that it’s not a good idea to make the encoder do actual classification over the codebook vectors. Therefore, here’s what we want to happen:
- the encoder, as usual, takes an image x as input and produces a set of latent vectors Enc(x); a slight difference with our previous settings is that now the encoder produces a whole set of vectors (usually formalized as a three-dimensional tensor, i.e., a matrix of vectors), but mathematically it’s equivalent to slicing a single output vector;
- for every latent vector, we find the nearest codebook vector in the latent space and replace it with this codebook vector; the resulting code consists only of codebook vectors;
- the decoder receives as input the tensor of vectors, with the same dimensions as the encoder had produced, but actually the latent code is now discrete: while each component of the latent code is still a continuous vector of real numbers, there’s now only a finite number of possibilities for each of the vectors.
Here’s an illustration (I only show how one vector is chosen but the procedure is the same for each of them):

At this point, some readers might be wondering: there’s now only a finite number of latent codes in total! There is no boundless generation, characteristic for natural languages where we can have texts as long as we like. Won’t that severely limit the models? Well, a realistic size of the latent code tensor is something like 32×32 with, say, 8192 codebook vectors (the numbers are taken from the original DALL-E model). There are two ways to look at these numbers. On the one hand, this amounts to 8192^32×32 = 2⁴⁰⁹⁶⁰ possibilities while the number of atoms in the Universe is less than 2³⁰⁰, so it looks like we are covered. On the other hand, this is equivalent to compressing every possible image of size 256×256 (the dimensions of original DALL-E) into 40960 bits, i.e., a bit more than 5 kilobytes of data, which means that we will need quite a compressing tool. Both views are valid: modern autoencoder-based models are indeed very impressive in their ability to compress images into latent representations, but the diversity of their outputs does not bound our imagination too much.

There are two questions remaining. First, how do we train the encoder and decoder networks? It looks like we have the same problem as VAE had: just like VAE had sampling in the middle, VQ-VAE has a piecewise constant operation (taking the nearest neighbor), and the gradients cannot flow back through this operation. And second, how do we learn the codebook vectors?

At this point, I will show a picture from the original VQ-VAE paper; it always comes up in these discussions, and we need its notation to discuss the VQ-VAE objective, so you need to see it too:

This picture mostly illustrates the idea of a discrete latent space with codebook vectors that we have already discussed. But it also shows the solution to the first problem: VQ-VAE simply copies the gradients (red line) from the decoder to the encoder, that is, the gradient of the loss function with respect to the tensor of codebook vectors is assumed to be its gradient with respect to Enc(x). This is an approximation, of course, and one can remove it with a more involved construction (a discrete VAE with the Gumbel-Softmax trick that we will explain in a later post on DALL-E), but for now it will have to do.

As for the second problem, it brings us to the VQ-VAE training objective. Here is the loss function as defined by van den Oort et al. (2017):

$\mathcal{L} = -\log p(\mathbf{x}|\mathbf{z}_q(\mathbf{x}) + \left\|\mathrm{sg}\left[\mathbf{z}_e(\mathbf{x})\right]-\mathbf{e}\right\|_2^2 + \beta\left\|\mathbf{z}_e(\mathbf{x}) - \mathrm{sg}\left[\mathbf{e}\right]\right\|_2^2.$

This formula sure begs for a detailed explanation. Let’s first go through all the notation step by step and then summarize:
- $\mathbf{z}_e(\mathbf{x})$ and $\mathbf{z}_q(\mathbf{x})$ are two latent representations for an image $\mathbf{x}$ produced by the encoder: $\mathbf{z}_e$ is the output of the decoder and $\mathbf{z}_q$ is the codebook representation after replacing each vector with its nearest codebook neighbor (this notation is illustrated in the image above); the first term is responsible for training the decoder network;
- $p(\cdot|\mathbf{z})$ is the distribution of reconstructed images after the decoder given the latent code; we want the reconstruction to be good so we maximize the likelihood of the original image $\mathbf{x}$ given its latent code $\mathbf{z}_q(\mathbf{x})$ that serves as input for the decoder;
- sg[ᐧ] is the stopgradient operator; it is defined as the identity during the forward pass (when we compute the objective function $\mathcal{L}$ ) and zero during the backward pass (when we compute the gradient $\nabla_{\mathbf{w}}\mathcal{L}$ );
- therefore, the second term means that we want to bring each codebook vector $\mathbf{e}$ closer to the latent codes $\mathbf{z}_e(\mathbf{x})$ that choose it as its nearest neighbor; this term is responsible for training the codebook;
- the third term is the opposite: it brings $\mathbf{z}_e(\mathbf{x})$ closer to their corresponding codebook vectors; in effect, the second and third term together do a kind of clustering of latent codes $\mathbf{z}_e(\mathbf{x})$ around their corresponding codebook vectors; the hyperparameter $\beta$ can balance the two terms although the authors say that the results don’t change for $\beta$ at least from 0.1 to 2.0;
- finally, the encoder network is trained with the first and third terms where it occurs in the form of $\mathbf{z}_e(\mathbf{x})$ .
In the illustration below, I show the components of the objective function and what their contributions are in the latent space (on top) and on learning the weights of the encoder and decoder networks (at the bottom):

Interestingly, the original paper has a typo in its main formula repeated in countless blog post explanations: the authors forgot the minus sign in front of the likelihood so in their training objective the first term should be maximized and the other two minimized. Naturally, it’s just a typo, and all working VQ-VAE implementations get it right, but it’s funny how these things can get propagated.

That’s it for VQ-VAE. The original model predated Transformers so it could not use them for latent code generation but they used a different autoregressive model which was state of the art at the time: PixelCNN (van den Oord et al., 2016; Salimans et al., 2017). PixelCNN itself originated as a model for generating pictures, but generating a high-resolution image autoregressively, pixel by pixel, is just way too slow (see also my first post in this series). But it’s just fine for generating a set of 32×32 codebook tokens! The original VQ-VAE, trained on ImageNet with a separate PixelCNN trained to generate latent codes, produced impressive results by 2017 standards:

The next step was VQ-VAE 2 that still used PixelCNN for latent codes but moved to a hierarchical structure, generating a small top-level representation and then a more detailed bottom-level representation conditioned on the top level result:

VQ-VAE 2 produced excellent results. When it came out, in 2019, in the wake of ProGAN (you may have heard of it as “This person does not exist”) everybody was comparing generation abilities on a dataset of high-dimensional celebrity photos, and VQ-VAE 2 did not disappoint:

But we still have some way to go before DALL-E 2 and Stable Diffusion, even in terms of the underlying autoencoders. The next step for VQ-VAE was to turn it into a GAN…

VQ-GAN: Add a Discriminator to the Mix

VQ-VAE and VQ-VAE 2 left us with some very good generation via discrete latent codes but the codes were still produced by a PixelCNN model. Naturally, we’d like to generate these codes with a Transformer-based architecture, at least because it’s much better at handling global dependencies: a Transformer does not even have the notion of a “long” or “short” dependency, it always attends to every previously generated token.

It was only natural that the next step would be to use a Transformer to generate the codes. So in the autoencoder part, we would have something similar to VQ-VAE, and then the Transformer would serve as the autoregressive model to generate the codes:

So in this approach, an image becomes a sequence of codebook vectors, and the Transformer does what it does best: learns to generate sequences.

One of the problems here is that we need to learn a very rich and expressive codebook. So instead of using just a straightforward reconstruction loss, VQ-GAN (Esser et al., 2020) adds a patch-based discriminator that aims to distinguish between (small patches of) real and reconstructed images, and the loss becomes a perceptual loss, i.e., the difference between features extracted by some standard convolutional network (Zhang et al., 2018). This means that the discriminator now takes care of the local structure of the generate image, and the perceptual loss deals with the actual content.

In total, the losses for our autoencoder might look something like this:

And with this, we are ready to see the overview of the whole architecture as it is shown in the original VQ-GAN paper (Esser et al., 2020):

Just like a regular VQ-VAE, an image is represented with a sequence of discrete codebook vectors, but now the reconstruction is ensured by a combination of perceptual and adversarial losses, and the codes are produced by a Transformer.

VQ-GAN could produce better images on the basic ImageNet—here are some first rate goldfinches compared to other approaches:

But a major point about VQ-GAN was that it could scale to far higher resolutions. Here is a sample landscape (originally 1280×832 pixels) generated by the VQ-GAN from a semantic layout, i.e., from a rough segmentation map showing where the sky, land, mountains, and grass should be:

As a result, VQ-GAN, like virtually every method we discuss, defined the state of the art for image generation when it was introduced. We have to stop here for now, but our story is far from over…

Conclusion

In this post, we have discussed the notion of a discrete latent space, where images are compressed to sequences of tokens (“words”) instead of continuous vectors. This makes it far easier to train a good generative model since generating sequences is the bread and butter of many autoregressive models. The original VQ-VAE family used PixelCNN as this intermediate autoregressive model, but as soon as Transformers appeared it became clear that they are a great fit for this task, and VQ-GAN managed to make it work.

At this point, we are ready to put several things together and discuss not just an image generation/reconstruction model but a real text-to-image model, where (spoiler alert) the Transformer will generate a sequence of discrete latent space tokens starting with a natural language text prompt. So next time, get ready for DALL-E!

Sergey Nikolenko
Head of AI, Synthesis AI
March 21, 2023
Variational Autoencoders (VAEs): Generative AI I
It might seem like generative models are going through new phases every couple of years: we heard about Transformers, then flow-based models were all the rage, then diffusion-based models… But in fact, new ideas build on top of older ones. Following our overview post, today we start an in-depth dive into generative AI. We consider the variational autoencoder (VAE), an idea introduced in 2013, if not earlier, but still very relevant and still underlying state of the art generative models such as Stable Diffusion. We will not consider all the gory mathematical details but I hope to explain the necessary intuition.

VAE Intuition: Handwaving in the Latent Space

We already covered this basic idea in the overview post but let me reintroduce the problem and move on to a more detailed discussion of the VAE intuition. We discussed that a basic autoencoder can learn to compress and decompress images pretty well with a simple high-level architecture like this:

However, this is not enough to get you a generative model because the structure of latent codes will still be too complicated. You have a very complex manifold of images in a huuuuuge space with dimensions in the millions, but the latent space probably also has dimension in the hundreds or low thousands, and the latent codes will have a complicated structure in that space:

So if you try to sample latent codes from a simple distribution you will almost certainly fail, that is, your samples will fall outside the manifold of latent codes, and the decoder will fail to produce anything meaningful, let alone beautiful:

A variational autoencoder tries to fix this problem by making each point “wider”. Instead of a single latent vector z, now each input x is mapped to a whole distribution:

The intuition is that by making the decoder to work with z’s sampled from whole distributions, we force it to be robust to small changes in z. Ultimately, we want to cover a whole chunk of the latent space with points that have “reasonable” decodings, so that afterwards we can sample from a simple distribution and still get good results:

This idea, however, meets with two difficulties. First, when we begin to train an autoencoder, it will be beneficial for it to make the intermediate distributions as “small” (with low variance) as possible: if you are always very close to the central point the decoder’s job becomes easier, and reconstructions probably improve. In a similar vein, the distributions may begin to drift off from each other in the latent space, again making the decoder’s job easier as it now has more slack in distinguishing between different inputs. So unless we do something about it, the training process will look something like this, tending to a regular autoencoder that we know to be of little value for us:

To alleviate this problem, we need to impose some kind of a constraint on what’s happening with the intermediate distributions. In machine learning, hard constraints rarely appear, they usually take the form of regularizers, i.e., additions to the loss function that express what we want. In this case, we want to keep the distributions for each input x relatively “large” and we want to keep them together in relatively close proximity, so we probably will kill two birds with one stone if we make the distribution p_x(z | μ_x, σ_x) closer to a standard Gaussian. Our overall loss function now becomes a sum of the reconstruction loss and this regularizer:

Still, the question remains: how do we regularize? We want p_x(z | μ_x, σ_x) to be close to a standard Gaussian distribution, but there are several plausible ways to do that: the Kullback-Leibler divergence can cut both ways, either KL(p||q) or KL(q||p), and then there are combinations like the Jensen-Shannon divergence… What would be the best and conceptually correct way to define L_reg?

The second problem is more technical: the picture above has the latent code z sampled from a distribution p_x(z | μ_x, σ_x). This is fine during inference, when we want to apply the already trained encoder and decoder. But how do we train? Gradients cannot go through a “sampling layer”.

Let us begin with the first problem; solving it will also give us a nice probabilistic interpretation of what’s going on in VAE and explain why it is called a variational autoencoder.

VAE Intuition: Probabilistic Handwaving

Let us consider a different way to look at the same structure that leads to different insights and ultimately will help us understand the mathematical ideas behind variational autoencoders. We will start from scratch: suppose that we want to train an encoder to produce latent codes z from images x and a decoder to go back from z to x.

We begin with a very simple formula; I promised as little math as possible but, to be honest, there will be a little more than that below:

$p(\mathbf{z})p(\mathbf{x}|\mathbf{z})=p(\mathbf{x},\mathbf{z})=p(\mathbf{x})p(\mathbf{z}|\mathbf{x})$

This is basically the Bayes formula in its simplest form: it says that the joint distribution of images and their latent codes can be decomposed in two different ways, starting either with p(x) or with p(z) and multiplying it by the corresponding conditional distribution.

We already understand, at least generally, all parts of this formula: p(x) is the distribution of images, p(z) is the distribution of latent codes, i.e., the simple distribution we want to be able to sample from (most likely a standard Gaussian), and the other two distributions are what we need to find, the encoder distribution p(z|x) and the decoder distribution p(x|z):

If we want to get a generative model, our main goal is to learn both p(x|z) and p(z|x). But here is the thing: in a generative model p(z) is by design simple since we need to be able to sample from it, while p(x) is, in any model, unimaginably complex since this is the distribution of real objects (images). So we cannot have both the encoder distribution p(x|z) and decoder distribution p(z|x) be simple: if, say, they both were Gaussian we’d have a Gaussian on the left but definitely something much more complicated on the right-hand side of the equation.

We need to pick one:
- either assume that p(x|z) is simple and then try to find a complex p(z|x);
- or vice versa, assume that p(z|x) is simple and find a complex p(x|z).
Variational autoencoders take the first option: we will assume that p(x|z) = N(x | f(z), cI) is a Gaussian distribution with mean f(z) = Decoder(z) and covariance matrix cI which is just a constant c along every axis. Thus, on the left we have a simple Gaussian p(z|x) times a simple Gaussian p(z) = N(z | 0, I), that is, another Gaussian.

What do we do on the right-hand side? We need to find a very complex distribution p(z | x). There are several different ways to do that, and variational autoencoders take the road of approximation: the encoder produces a simple distribution p(z | μ_x, σ_x), actually again a Gaussian N(z | μ_x, σ_x), but this time we cannot say that this Gaussian is the real p(z | x), we have to say that it’s an approximation:

The only thing left is how to find such an approximation. This is where the variational part comes in: variational approximations are how probability distributions are usually approximated in machine learning.

Variational approximations and the loss function in VAE

I promised not to have too much math; I lied. But you already have the basic intuition so now you can safely skip to the very end of this section and still understand everything that goes afterwards. With that said, if you are not afraid to get your hands a little dirty let us still go through the inference.

The idea of variational approximations is shown in the sequence of equations below. We start with an obvious identity, take the expectation over q(z) on both parts, and then do some transformations to break down the right-hand part into two terms, while the left-hand side does not depend on z, so the expectation simply disappears:

$\begin{aligned} p(\mathbf{x},\mathbf{z}) &= p(\mathbf{x})p(\mathbf{z}|\mathbf{x}),\\ \log p(\mathbf{x},\mathbf{z}) &= \log p(\mathbf{x}) + \log p(\mathbf{z}|\mathbf{x}),\\ \log p(\mathbf{x}) &= \log p(\mathbf{x},\mathbf{z}) - \log p(\mathbf{z}|\mathbf{x}),\\\log p(\mathbf{x}) &= \mathbb{E}_{q(\mathbf{z})}\left[\log p(\mathbf{x},\mathbf{z}) - {\log q(\mathbf{z})} + {\log q(\mathbf{z})} - {\log \p(\mathbf{z}|\mathbf{x})\right],\\\log p(\mathbf{x}) &= \int {q(\mathbf{z}}}{\log \frac{p(\mathbf{x},\mathbf{z})}{q(\mathbf{z})}}\dd\mathbf{z} + \int {q(\mathbf{z})}{\log \frac{q(\mathbf{z})}{p(\mathbf{z}|\mathbf{x})}\mathrm{d}\mathbf{z}. \end{aligned}$

As a result, we have a constant (that is, something independent of q(z)) on the left and the sum of L(q) and the Kullback-Leibler divergence between q(z) and p(z|x) on the right, that is, a measure of how close these distributions are to each other:

This means that we can approximate p(z|x) with q(z), i.e., minimize the divergence between them, by maximizing the first term L(q). But this first term is probably much easier to handle since it contains the joint distribution p(x, z) and not the conditional distribution p(z|x). In particular, we can now decompose it in the other way:

$\begin{aligned}\mathcal{L}(q) &= \int {q(\mathbf{z})}{\log \frac{p(\mathbf{x},\mathbf{z})}{q(\mathbf{z})}}\mathrm{d}\mathbf{z} = \int {q(\mathbf{z})}{\log \frac{p(\mathfb{z})p(\mathbf{x}|\mathbf{z})}{q(\mathbf{z})}}\mathrm{d}\mathbf{z} \\&= \int {q(\mathbf{z})}{\log {p(\mathbf{x}|\mathbf{z})}}\mathrm{d}\mathbf{z} + \int {q(\mathbf{z})}{\log \frac{p(\mathbf{z})}{q(\mathbf{z})}}\mathrm{d}\mathbf{z} \\&= \int {q(\mathbf{z})}{\log \mathcal{N}(\mathbf{x}| f(\mathbf{z}),c\mathbf{I})}\mathrm{d}\mathbf{z} - \kl{q(\mathbf{z})}{p(\mathbf{z})} \\&= -\frac{1}{2c}\mathbb{E}_{q(\mathbf{z})}\left[ \left|\mathbf{x} - f(\mathbf{z})\right|^2\right] - \mathrm{KL}\left({q(\mathbf{z})}\|{p(\mathbf{z})}\right).\end{aligned}$

And now we have arrived at exactly the two terms that we considered in the first “intuitive” part! We need to maximize L(q), so the first term wants to make f(z) as close as possible to x, and the second term wants to make q(z) as close as possible to p(z), that is, to the standard Gaussian. Overall, we minimize exactly the sum of two terms that we had at the end of the first section:

$\mathcal{L} = \mathcal{L}_{\mathrm{rec}} + \lambda\mathcal{L}_{\mathrm{reg}}.$

Why did we need all that math if we arrived at the exact same conclusion? Mostly because we were not sure what the reconstruction loss and the regularizer should look like. Our intuition told us that we want q(z) to be “large” but how do we formalize it exactly? And which reconstruction loss should we use? Variational approximations answer all these questions in a conceptually sound way. Moreover, they even explain the meaning of the regularization coefficient λ: turns out it’s the (inverse of) the variance for the decoder distribution. Not that it helps that much—we still need to choose c ourselves just like we needed to choose λ—but it’s always better to understand what’s going on.

By now we are almost done. I will skip the exact calculation of the regularization term: it’s tedious but straightforward and does not contain new interesting ideas; basically, you can get a rather simple exact formula in terms of μ_x and σ_x.

The only thing left is to handle the second problem: how do we train a model that has sampling in the middle?

Reparametrization trick and the overall algorithm

By now, we understand the nature of the loss function in the variational autoencoder and can go back to the sampling problem:

Indeed, it is impossible to send the gradients back through the sampling process. Fortunately, we don’t need to.

The reparametrization trick comes to the rescue. The idea of this trick (we will see other versions of it in subsequent posts) is to sample a random number first from some standard distribution and then transform it into the desired distribution. In the case of Gaussians the reparametrization trick is very simple: to get a vector z from N(z | μ_x, σ_x) with a diagonal covariance matrix we can first get u from N(z | 0, I), then multiply it componentwise by σ_x, and then add μ_x to the result. The picture above in this case looks like this:

Now we can sample a mini-batch of vectors u for an input mini-batch of images and use them for training, never needing to run gradients through the sampling process.

And that’s it! Now we have the complete picture of how a variational autoencoder works, what loss function it minimizes and why, and how this loss function is related to the basic intuition of VAEs.

Conclusion: How Is VAE Still Relevant?

In this post, we have discussed the idea and implementation of VAE, a model first introduced in 2013. But these days, you don’t hear much about VAEs in the news. It’s a nice idea but is it still relevant for generative AI today?

As it turns out, VAEs are not only relevant but actually still represent one of the pillars on which the entire modern generative AI stands. Consider, for instance, the basic structure of the Stable Diffusion model (which has produced all cat images in this post):

As you can see, the picture concentrates on the diffusion and denoising parts—as well it should since these are the novelties that differentiate this work from prior art. But note that all these novelties take place in the latent space of some kind of autoencoder for images, with an encoder E and decoder D mapping the codes produced by diffusion-based models into the pixel space. Where do these E and D come from? You guessed it, it’s a variational autoencoder!

But it is not the default vanilla VAE that we have discussed today. These days, it is actually either the quantized version of VAE with a discrete latent space, VQ-VAE, or its further modification with an additional discriminator, VQGAN. We will discuss these models in the next installment; until then!

Sergey Nikolenko
Head of AI, Synthesis AI
February 7, 2023
Generative AI Models in Image Generation: Overview
Some of the most widely publicized results in machine learning in recent years have been related to image generation. You’ve heard of DALL-E a year ago, and now you’ve heard of DALL-E 2, Midjourney, and Stable Diffusion, right? With this post, I’m starting a new series where I will explain the inner workings of these models, what their differences are and how they fit into the general scheme of deep generative models. Today, we begin with a general overview.

Taxonomy and a Brief History of Image Generation

Generative AI models are a staple of machine learning. One of the first functional machine learning models, the naive Bayes classifier developed in the 1960s, was an early form of generative AI: you can write new text with a trained naive Bayes model, it just won’t make any sense (since naive Bayes makes the bag-of-words assumption, the text will be just random words sampled in a way consistent with the desired topic).

Generating images, however, is more difficult than generating text. Just like text, an image is a high-dimensional object: a 1Mpix color photo is defined by about 3 million numbers! Unlike text, however, making extremely strong assumptions such as the bag-of-words model doesn’t make a lot of sense. In the world of images, “words” are pixels, and while naive Bayes is a pretty good text classifier, individual pixels are too simple to be useful even for classification, let alone generation.

The first generative models that worked for images were autoregressive: you generate the next pixel conditioned on the already generated previous pixels. PixelCNN and PixelRNN were state-of-the-art models for their time (2016), and it might be that with modern architectures, such models could produce state-of-the-art results even today. The problem, however, is that you would have to run the model a million times to get an image with a million pixels, and there is no way to parallelize this process because you need to know the value of pixel number k-1 before you can generate pixel number k. This would be way too slow for high-definition images, so we will not return to purely autoregressive models in this survey.

Next, we need to distinguish between pure image generation and conditional generation: is it enough to just get a “person who does not exist” or do you want to control the scene with some kind of a description? Significant progress in the former problem was done in 2017-2018 by NVIDIA researchers who specialized in generative adversarial networks (GANs); their ProGAN (progressively growing GAN) model was the first to do high-definition generation (drawing human faces with up to 1024×1024 pixels) with few artifacts that had previously plagued generative models. Later, the same team switched to conditional generation and started working on the StyleGAN family of models, where you can mix and match different levels of features from different images, e.g., take coarse features such as the shape of a face from one person and fine features such as skin texture from another.

However, it would be even more interesting—and more difficult—if you could just write a prompt for the model and immediately get a picture of the result. This requires multimodal modeling: you have to somehow transform both text and images into the same space, or at least learn how to translate them into one another.

The first model to claim it has achieved this holy grail with sufficiently good quality was DALL-E from OpenAI. It featured a variational autoencoder with a discrete latent space (like a “language” with discrete “words” that the decoder can turn into images) and a Transformer that made it possible to encode text prompts into this latent space. Later, however, new models have been developed that surpassed DALL-E, including DALL-E 2, Midjourney, and Stable Diffusion. In the next sections, we will discuss these ideas in more detail, although I will reserve the technical discussions for later posts.

Variational Autoencoder + Transformer = DALL-E

One of the most important ideas in deep learning is the autoencoder, an encoder-decoder architecture that is tasked to reconstruct the original image:

The idea here is that the latent code is usually much smaller than the input and output: the task is to compress millions of pixels down to several hundred or a couple of thousand numbers in such a way that decompression is possible.

It is very tempting to transform an autoencoder into a generative model: it looks like we can sample the latent codes and get new “reconstructions” that could look like new images. Unfortunately, that’s not quite as easy as it seems: even in the lower-dimensional space, the latent codes of “real images” still occupy a rather complicated subset (submanifold), and it would be very difficult to sample from it directly.

There are several different ways to go about this problem and turn an autoencoder into a proper generative model. One approach is the adversarial autoencoder: let’s turn this into a GAN by adding a discriminator that distinguishes between “real” latent codes sampled from some standard distribution and “fake” latent codes generated by the encoder from actual images:

Another approach is taken by variational autoencoders (VAE): let’s make the encoder generate not a single latent code but a whole distribution of latent codes. That is, the encoder produces parameters of this distribution, then a latent code is sampled from it, and then the decoder has to reconstruct the original from any sampled latent code, not only from the exact point the encoder has produced:

This is just a basic idea, it needs a lot of mathematical machinery to actually work, and I hope to explain this machinery in one of the upcoming posts. But if we do make it work, it helps create a nice generative model without the hassle of adversarial training. Variational autoencoders are an important class of generative models, and DALL-E uses one of them to generate images. To be more precise, it uses a variation of VAE that has discrete latent codes, but this explanation definitely can wait until next time.

The next step in getting a text-to-image model is to add text into the mix. This is exactly what Transformers are great at, and what we need is to train one to generate these discrete latent codes. So the original DALL-E worked as a (discrete) variational autoencoder with a Transformer generating codes for it:

After training, you can use the Transformer to generate new latent codes and get new pictures via the decoder:

There are a lot more tricks the authors of DALL-E had to invent to train a huge model able to produce 512×512 images from detailed text prompts but this is the basic idea.

Diffusion-based models: inverted degradation

Another important idea that modern generative models have learned to use very well comes from diffusion-based models. Diffusion is the process of adding noise to something, for instance to an image. If you start with a crisp image and keep adding simple noise, say Gaussian, after a while you will have nothing like the original, and if you continue the process long enough you will get something that’s basically indistinguishable from random noise:

The idea of diffusion-based models is to try and invert this process. Adding noise is very easy, and the conditional distributions on every step are simple. Inverting it, i.e., gradual denoising of the image, is a much more difficult task, but it turns out that we can approximate the inverse, learning a conditional denoising distribution that is close to the true one:

Then we can string together this chain of approximations and, hopefully, get a model that is able to regenerate crisp images from random noise:

Again, a full description of what is going on here is quite involved, and I hope to get a chance to explain it in more detail later. But at this point, the only thing that remains is to be able to convert text descriptions into these “random noise” vectors.

This conversion can be done with an encoder-decoder architecture (recall the previous section) that projects both texts and images into the same latent space. One of the best such models, CLIP, was developed by OpenAI in 2021, and was used as the basis for DALL-E 2; I will not go into detail about its internal structure in this post and leave it for later.

So overall, we have the following structure:
- a multimodal text-image model, in this case CLIP, produces a joint latent space where it can project both images and text prompts;
- a diffusion-based decoder can produce nice-looking images from its own latent space;
- but at this point, the decoder’s latent space is not connected to CLIP’s latent space, so there is a third model (either autoregressive or diffusion-based too) that converts CLIP latents into the decoder’s latents.
Here is this structure illustrated by DALL-E 2 authors Ramesh et al. (2022):

Another large-scale diffusion-based model, Stable Diffusion, was developed by Rombach et al. (2022). It is a different variation of the same idea: it first trains an autoencoder to map the pixel space into a latent space where imperceptible details are abstracted away, and the image is compressed down to a smaller vector, and then performs conditional diffusion in this latent space to account for the text prompt and other conditions.

I will not go into further detail right now, but here is a general illustration of the approach by Rombach et al.; it mostly concentrates on what’s happening in the latent space because the autoencoder is almost standard by now; note that the conditions are accounted for with Transformer-like encoder-decoder attention modules:

Unlike DALL-E 2 and Midjourney (there is not even a paper written about Midjourney, let alone source code, so I cannot go into detail about how it works), Stable Diffusion comes with a GitHub repository where you can get the code and, most importantly, trained model weights to use. You can set it up on your home desktop PC (you don’t even need a high-end GPU, although you do need a reasonable one). All generated images used in this post have been produced with Stable Diffusion, and I’m very grateful to its authors for making this great tool available to everybody.

Generative AI and synthetic data

So where does this leave us? Does it mean that you can now generate synthetic data at will with very little cost by simply writing a good text prompt, so synthetic data as we understand it, produced by rendering 3D scenes, is useless?

Far from it. Images generated even by state-of-the-art models do not come with perfect labeling, and generative models for 3D objects are still very far from production quality. If anything, synthetic data comes in more demand now because researchers need more and more data to train these large-scale models, and at the same time they are developing new ways to do domain adaptation and make synthetic data increasingly useful for this process.

However, this does not mean that state-of-the-art generative models cannot play an important role for synthetic data. One problem where we believe more research is needed is texture generation: while we cannot generate high-definition realistic 3D models, we can probably generate 2D textures for them, but this requires a separate model and training set because textures look nothing like photos or renders. Another idea would be to adapt generative models to modify synthetic images, either making them look more realistic (synthetic-to-real refinement) or simply making more involved augmentation-like transformations.

In any case, we are living in exciting times with regard to generative models in machine learning. We will discuss these ideas in more detail in subsequent posts, and let’s see what else the nearest future will bring!

Sergey Nikolenko
Head of AI, Synthesis AI
January 5, 2023
Facial Landmark Detection with Synthetic Data: Case Study
Today we have something very special for you: fresh results of our very own machine learning researchers! We discuss a case study that would be impossible without synthetic data: learning to recognize facial landmarks (keypoints on a human face) in unprecedented numbers and with unprecedented accuracy. We will begin by discussing why facial landmarks are important, show why synthetic data is inevitable here, and then proceed to our recent results.

Why Facial Landmarks?

Facial landmarks are certain key points on a human face that define the main facial features: nose, eyes, lips, jawline, and so on. Detecting such key points on in-the-wild photographs is a basic computer vision problem that could help considerably for a number of face-related applications. For example:
- head pose estimation, that is, finding out where a person is looking and where the head is turned right now;
- gaze estimation, a problem important for mobile and wearable devices that we discussed recently;
- recognizing emotions (that are reflected in moving landmarks) and other conditions; in particular, systems recognizing driver fatigue often rely on facial landmarks in preprocessing.
There are several different approaches to how to define facial landmarks; here is an illustration of no less than eight approaches from Sagonas et al. (2016) who introduced yet another standard:

Their standard became one of the most widely used in industry. Named iBug68, it consists of 68 facial landmarks defined as follows (the left part shows the definitions, and the right part shows the variance of landmark points as captured by human annotators):

The iBug68 standard was introduced together with the “300 Faces in the Wild” dataset; true to its name, it contains 300 faces with landmarks labeled by agreement of several human annotators. The authors also released a semi-automated annotation tool that was supposed to help researchers label other datasets—and it does quite a good job.

All this happened back in 2013-2014, and numerous deep learning models have been developed for facial landmarks detection since then. So what’s the problem? Can we assume that facial landmarks are either solved or, at least, are not suffering from the problems that synthetic data would alleviate?

Synthetic Landmarks: Number does Matter

Not quite. As it often happens in machine learning, the problem is more quantitative than qualitative: existing datasets of landmarks can be insufficient for certain tasks. 68 landmarks are enough to get good head pose estimation, but definitely not enough to, say, obtain a full 3D reconstruction of a human head and face, a problem that we discussed very recently and deemed very important for 3D avatars, the Metaverse, and other related problems.

For such problems, it would be very helpful to move from datasets of several dozen landmarks to datasets of at least several hundred landmarks that would outline the entire face oval and densely cover the most important lines on a human face. Here is a sample face with 68 iBug landmarks on the left and 243 landmarks on the right:

And we don’t have to stop there, we can move on to 460 points (left) or even 1001 points (right):

The more the merrier! If we are able to detect hundreds of keypoints on a face it would significantly improve the accuracy of 3D face reconstruction and many other computer vision problems.

However, by now you probably already realize the main problem of these extended landmark standards: there are no datasets, and there is little hope of ever getting them. It was hard enough to label 68 points by hand, the original dataset had only 300 photos; labeling several hundred points on a scale sufficient to train models to recognize them would be certainly prohibitive.

This sounds like a case for synthetic data, right? Indeed, the face shown above is not real, it is a synthetic data point produced by our very own Human API. When you have a synthetic 3D face in a 3D scene that you control, it is absolutely no problem to have as many landmarks as you wish. What’s even more important, you can easily play with the positions of these landmarks and choose which set of points gives you better results in downstream tasks—imagine how hard it would be if you had to get updated human annotations every time you changed landmark locations!

So at this point, we have a source of unlimited synthetic facial landmark datasets. It only remains to find out whether they can indeed help train better models.

Training on Synthetic Landmarks

We have several sets of synthetic landmarks, and we want to see how well we are able to predict them. As the backbone for our deep learning model we used HourglassNet, a venerable convolutional architecture that has been used for pose estimation and similar problems since it was introduced by Newell et al. (2016):

The input here is an image, and the output is a tensor that specifies all landmarks.

To train on synthetic data, we couple this backbone with the discriminator-free adversarial learning (DALN) approach introduced very recently by Chen et al. (2022); this is actually another paper from CVPR 2022 so you can consider this part a continuation of our CVPR ‘22 series.

Usually, unsupervised domain adaptation (UDA) works in an adversarial way by
- either training a discriminator to distinguish between features extracted from source domain inputs (synthetic data) and target domain inputs (real data), training the model to make this discriminator fail,
- or learning a source domain classifier and a target domain classifier at the same time, training them to perform the same on the source domain and as different as possible on the target domain while training the model to keep the classification results similar on the target domain.
DALN suggested a third option for adversarial UDA: it trains only one classifier, with no additional discriminators, and reuse the classifier as a discriminator. The resulting loss function is a combination of a regular classification loss on the source domain and a special adversarial loss on the target domain that is minimized by the model and maximized by the classifier’s weights:

We have found that this approach works very well, but we had an additional complication. Our synthetic datasets have more landmarks than iBug68. This means that we cannot use real data to help train the models in a regular “mix-and-match” fashion, simply adding it in some proportion together with the synthetic data. We could pose our problem as pure unsupervised domain adaptation, but that would mean we were throwing away perfectly good real labelings, which also does not sound like a good idea.

To use the available real data, we introduced the idea of label transfer on top of DALN: our model outputs a tensor of landmarks as they appear in synthetic data, and then an additional small network is trained to convert this tensor into iBug68 landmarks. As a result, we get the best of both worlds: most of our training comes from synthetic data, but we can also fine-tune the model with real iBug68 datasets through this additional label transfer network.

Finally, another question arises: okay, we know how to train on real data via an auxiliary network, but how do we test the model? We don’t have any labeled real data with our newly designed extra dense landmarks, and labeling even a test set by hand is very problematic. There is no perfect answer here but we found two good ones: we test either only on the points that should exactly coincide with iBug landmarks (if they exist) or train a small auxiliary network to predict iBug landmarks, fix it, and test the rest. Both approaches show that the resulting model is able to predict dense landmarks, and both synthetic and real data are useful for the model even though the real part only has iBug landmarks.

Quantitative results

At this point, we understand the basic qualitative ideas that are behind our case study. It’s time to show the money, that is, the numbers!

First of all, we need to set the metric for evaluation. We are comparing sets of points that have a known 1-to-1 correspondence so the most straightforward way would be to calculate the average distance between corresponding points. Since we want to be able to measure quality on real datasets, we need to use iBug landmarks in the metric, not extended synthetic sets of landmarks. And, finally, different images will have faces shown at different scales, so it would be a bad idea to measure the distances directly in pixels or fractions of image size. This brings us to the evaluation metric computed as

where n is the number of landmarks, y_i are the ground truth landmark positions, f(x_i) are the landmark positions predicted by the model, and pupil are the positions of the left and right pupil of the face in question (this is the normalization coefficient). The coefficient 100 is introduced simply to make the numbers easier to read.

With this, we can finally show the evaluation tables! The full results will have to wait for an official research paper, but here are some of the best results we have now.

Both tables below show evaluations on the iBug dataset. It is split into four parts: the common training set (you only pay attention to this metric to ensure that you avoid overfitting), the common test set (main test benchmark), a specially crafted subset of challenging examples, and a private test set from the associated competition. We will show the synthetic test set, common test set from iBug, and their challenging test set as a proxy for generalizing to a different use case.

In the table below, we show all four iBug subsets; to get the predictions, in this table we predict all synthetic landmarks (490 keypoints in this case!) and then choose a subset of them that most closely corresponds to iBug landmarks and evaluate on it.

The table above shows only a select sample of our results, but it already compares quite a few variations of our basic model described above:
- the model trained on purely synthetic data; as you can see, this variation loses significantly to all other ways to train, so using real data definitely helps;
- the model trained on a mix of labeled real and synthetic data with the help of label adaptation as we have described above; we have investigated several different variations of label adaptation networks and finally settled on a small U-Net-like architecture;
- the model trained adversarially in an unsupervised way; “unsupervised” here means that the model never sees any labels on real data, it uses labeled synthetic data and unlabeled real data with an extra discriminator that ensures that the same features are extracted on both domains; again, we have considered several different ways to organize unsupervised domain adaptation and show only the best one here.
But wait, what’s that bottom line in the table and how come it shows by far the best results? This is the most straightforward approach: train the model on real data from the iBug dataset (common train) and don’t use synthetic data at all. While the model shows some signs of overfitting, it still outperforms every other model very significantly.

One possible way to sweep this model under the rug would be to say that this model doesn’t count because it is not able to show us any landmarks other than iBug’s, so it can’t provide the 490 or 1001 landmarks that other models do. But still — why does it win so convincingly? How can it be that adding extra (synthetic) data hurts performance in all scenarios and variations?

The main reason here is that iBug landmarks are not quite the same as the landmarks that we predict, so even the nearest corresponding points introduce some bias that shows in all rows of the table. Therefore, we have also introduced another evaluation setting: let’s predict synthetic landmarks and then use a separate small model (a multilayer perceptron) to convert the predicted landmarks into iBug landmarks, in a procedure very similar to label adaptation that we have used to train the models. We have trained this MLP on the same common train set.

The table below shows the new results.

As you can see, this test-time label adaptation has improved the results across the board, and very significantly! However, they still don’t quite match the supervised model, so some further research into better label adaptation is still in order. The relative order of the model has remained more or less the same, and the mixed syn+real model with label adaptation done with a small U-Net-like architecture wins again, quite convincingly although with a smaller margin than before.

Conclusion

We have obtained significant improvements in facial landmark detection, but most importantly, we have been able to train models to detect dense collections of hundreds of landmarks that have never been labeled before. And all this has been made possible with synthetic data: manual labeling would never allow us to have a large dataset with so many landmarks. This short post is just a summary: we hope to prepare a full-scale paper about this research soon.

Kudos to our ML team, especially Alex Davydow and Daniil Gulevskiy, for making this possible! And see you next time!

Sergey Nikolenko
Head of AI, Synthesis AI

P.S. Have you noticed the cover images today? They were produced by the recently released Stable Diffusion model, with prompts related to facial landmarks. Consider it a teaser for a new series of posts to come…
December 14, 2022
CVPR ‘22, Part IV: Synthetic Data Generation

We continue the long series of reviews for CVPR 2022 papers related to synthetic data. We’ve had three installments so far, devoted to new datasets, use cases for synthetic data, and a very special use case: digital humans. Today, we will discuss papers that can help with generating synthetic data, so expect a lot of 3D model reconstruction, new generative models, especially in 3D, and generally a lot of CGI-related goodness (image generated by DALL-E-Mini by craiyon.com with the prompt “robot designer making a 3D mesh”).

Introduction

This is the fourth part of our CVPR in Review series (part I, part II, part III). Similar to previous posts, we have added today’s papers to the OpenSynthetics database, a public database of all things related to synthetic data that we have launched recently.

The bulk of our discussion in this part is devoted to machine learning models that learn 3D objects (meshes, morphable models, surface normals) from photographs. The relation to synthetic data is obvious: one of the main bottlenecks in creating synthetic data is the manual labor that has to go into creating 3D models. After we have a collection of 3D scenes and 3D object models, the rest is more or less automatic, and we can easily produce a wide variety of datasets under highly varying conditions (object placement, lighting, camera position, weather effects, and so on) with perfect labeling, with all the usual benefits of synthetic data. But before we can have all these nice things, we need to somehow get the 3D models; any progress towards constructing them automatically, say from real world photographs, promises significant simplifications and improvements in synthetic data generation pipelines.

We will also touch upon two different subjects: camera noise modeling and controlled 2D image generation. Let’s start with the last one.

Modeling Image Composition for Complex Scene Generation

Before we proceed to 3D meshes and point clouds, let me begin with a paper on 2D generation, but quite accurately controlled 2D generation. The work “Modeling Image Composition for Complex Scene Generation” (OpenSynthetics) by Yang et al. presents an interesting variation on DALL-E-type models: a Transformer with focal attention (TwFA) that can generate realistic images based on layouts, i.e., from object detection labeling. Like this:

The architecture in this work has many similarities to DALL-E but is actually quite different. First, the basic VQ-VAE structure with a discrete codebook is there, but the codebook tokens do not come from text, they come from ground truth images (during training) and layouts:

Second, as you can see above there is a Transformer for generating the tokens, but it’s not a text-based Transformer, it’s a variation on the visual Transformer. Its job is to serve as a “token model” to produce a sequence of VQ-VAE tokens in an autoregressive fashion based on the layout. A learned model will run this Transformer to generate tokens, and then the VQ-VAE decoder will produce an image based on the sequence of tokens:

But the most important novelty, the one that actually lets this model use layouts in a very interesting way, is a new approach to the attention mechanism in the Transformer. In the classical Transformer, self-attention layers have every token attend to every other token; if we generate the sequence autoregressively, this means every previous token. Focal attention proposed in this work uses masks to enforce the tokens to attend only to the patches and objects that actually relate to it in the scene. Without going into too much detail, here is an illustration of what the masks look like in different variations of this idea:

This is exactly what lets the model generate images that reflect input layouts very well. And it doesn’t even need a huge dataset with object detection labeling for this, the authors used classical COCO-stuff and Visual Genome datasets. A comparison with other models that tried to tackle the same task is telling:

Naturally, the paper is devoted to generation and does not try to use generated images as synthetic datasets. But I view it as an interesting step in the direction of controlled generation; we’ve seen before that even very artificial-looking images can be helpful for training large computer vision models, so it would be interesting to check if images generated in such a controlled way could be helpful too.

It would be an especially interesting case of bootstrapping if one could use images generated by a VQ-VAE and a Transformer to improve the pretraining of these same models—I’m not claiming it’s going to help, I haven’t made or seen any experiments, but it’s an enticing thought to check out.

Photorealistic Facial Expression Transfer for VR

In the main part today, we will proceed from more specialized models to more general ones. The first two works are devoted to perhaps the most interesting and one of the most complex single objects in 3D modeling: heads and faces. Evolution has made us humans very good at reading facial expressions and recognizing faces; we usually like to look at other people and have had a lot of practice. This is why it’s very easy to get it wrong: “uncanny valley” examples usually feature human heads and faces.

In “Robust Egocentric Photo-realistic Facial Expression Transfer for Virtual Reality” (OpenSynthetics), Facebook researchers Jourabloo et al. consider the problem of generating virtual 3D avatars that would reflect our facial expressions in VR. This makes perfect sense for Facebook as their Oculus Quest 2 VR headset has become one of the most successful models to date.

There are many works on capturing and animating avatars, but compared to other settings, a VR headset is different: we need to model the facial expression of a person who is… well, wearing a VR headset! This sounds very challenging but, on the other hand, we have three cameras that show you the two eyes (covered by the headset) and the bottom of the face. Here is what the input looks like in this system, with camera locations shown on the right:

Jourabloo et al. propose a multi-identity architecture that takes as input three images like the ones above and a neutral 3D mesh of a person and produce the modified mesh and textured to go with it. There are several parts of the architecture, one for shape modification, another for texture generation, and yet another for putting them together and rasterizing:

By the way, a major factor in improving the results, as it often happens in computer vision, were augmentations: the authors propose to model 3D augmentations (such as slightly moving or rotating the camera) by 3D rotation and translation of the face shape in the training set, that is, changing the premade 3D shape—looks like another win for synthetic data to me!

The results are quite impressive; here is a comparison with ground truth 3D models:

And here are sample final results of the whole pipeline, in comparison with a previous work and again with the ground truth on the right:

VR technology is constantly evolving, and these examples already look perfect. Naturally, the hardest part here is not about getting some excellent cherry-picked results, but about bridging the gap between research and technology: it would be very interesting to see how well models like this one perform in real world settings, e.g., on my own Oculus Quest 2. Still, I believe that this gap is not too wide already, and we will be able to try photorealistic virtual worlds very soon.

JIFF: Jointly-aligned Implicit Face Function for Single-View Reconstruction

In comparison with VR avatars, this work, also devoted to reconstructing 3D faces, clearly shows the difference between possible problem settings. In “JIFF: Jointly-aligned Implicit Face Function for High Quality Single View Clothed Human Reconstruction” (OpenSynthetics), Cao et al. set out to reconstruct the 3D mesh from a single photograph of a clothed human. Here are the results of the best previous model (PiFU from ICCV 2019, which we discussed in a previous post) and the proposed approach:

The improvement is obvious, but the end result is still far from photorealistic. As we discussed last time, the PiFU family of models uses implicit functions, modeling a surface with a parameterized function so that its zero level is that surface. Another important class of approaches includes 3D morphable models (e.g., the original 3DMM or the recently developed DECA) that capture general knowledge about what a human face looks like and represent individualized shapes and textures in some low-dimensional space.

So a natural idea would be to join the two approaches, using 3DMM as a prior for PiFU-like reconstruction. This is exactly what JIFF does, incorporating 3DMM as a 3D face prior for the implicit function representation:

You’ve seen the results above, so let’s keep this section short. The conclusion here is that to get good high-resolution 3D models at this point you need some very good inputs. Maybe there already exists some combination of approaches that could take a single photo, learn the shape like this one, and then somehow upscale and improve the textures to more or less photorealistic results, but I’ve yet to see it. And this is a good thing—there is still a lot of room for research and new ideas!

BANMo: Building Animatable 3D Neural Models from Many Casual Videos

Human heads are a very important object in synthetic data, but let’s move on to a model, presented in “BANMo: Building Animatable 3D Neural Models from Many Casual Videos” (OpenSynthetics) by Yang et al., that promises to capture a whole object in 3D. And not just capture but to provide an animatable 3D model able to learn possible deformations of the object in motion. Naturally, to get this last part it requires much more than a single image, namely a collection of “casual videos” that contain the object in question. Oh, and I almost forgot the best part: the “object in question” could be a cat!

So how does it work? We again come back to implicit functions. A 3D point in BANMo ( a Builder of Animatable 3D Neural Models) has three properties: color, density, and a low-dimensional embedding, and all three are modeled implicitly by trainable multilayer perceptrons. This is very similar to neural radiance fields (NeRF), a very hot topic in this year’s CVPR and one that deserves a separate discussion. Deformations are modeled with warping functions that map a canonical location in 3D to the camera space location and back. Pose estimation in BANMo is based on DensePose-CSE, which actually limits the model to humans and quadrupeds (thankfully, cats are covered). And to get from the 3D deformed result to 2D pixels BANMo uses a differentiable rendering framework, which is yet another can of worms that I don’t want to open right now.

Overall, it’s a pretty complicated framework with a lot of moving parts:

But, as it often happens in successful deep learning applications, by carefully selecting the losses the authors are able to optimize all the parts jointly, in an end-to-end fashion.

Here is a comparison with previous work on similar problems:

As you can see, the results are not quite perfect but already very good. BANMo requires a lot of input data: the authors speak of thousands of frames in casual videos. However, collecting this kind of data is far easier than trying to get a 3D model via a hardware solution or recording videos in a multi-camera setup. If you have a cat you probably already have enough data for BANMo. The implications for synthetic data are obvious: if you can get a new model complete with movements and deformations automatically from a collection of videos, this may reduce the production cost for 3D models by a lot.

High-Fidelity Rendering of Dynamic Humans from a Single Camera

Clothing is hard. When a girl dances in a dress, the cloth undergoes very complicated transformations that are driven by the dance but would be extremely difficult to capture from a monocular video. In fact, clothes are the hardest part of moving from heads and faces (where clothing is usually limited to rigid accessories) to full-body 3D reconstruction of humans.

In “Learning Motion-Dependent Appearance for High-Fidelity Rendering of Dynamic Humans from a Single Camera” (OpenSynthetics), Adobe researchers Yoon et al. try to tackle the problem of adding clothes to human 3D models in a realistic way that would be consistent with motion. The problem would be to take a 3D body model as input and output a clothed 3D body model and the corresponding rendering:

The main challenge here is the lack of data: it would be probably possible to learn the dynamics of secondary motion (e.g., clothing) from videos but that would require a very large labeled dataset that covers all possible poses. In realistic scenarios, this dataset is nonexistent: we usually have only a short YouTube or TikTok video of the moving person.

This means that we need to have strong priors about what’s going on in secondary motion. Yoon et al. propose the equivariance property: let’s assume that per-pixel features learned by the model’s encoder are transformed in the same way as the original body pose is transformed. The encoder produces 3D motion descriptors for every point, and the decoder’s job is to actually model the secondary motion and produce surface normals and appearance of the final result:

The results are very impressive; in the figure below the rightmost column is the ground truth, the second on the right is the proposed model, and the rest are baselines taken from prior art:

Moreover, in the experiments the poses and surface normals (inputs to the model) are not assumed to be known but are also captured from the input video with a specially modified 3D tracking pipeline. This allows to use the model at standalone videos and also enables a number of other applications:

Overall, this year’s CVPR shows a lot of significant improvements in 3D reconstruction from 2D input, be it either a single image, a short clip, or a whole collection of videos. This is yet another excellent example of such a work.

High-Fidelity Garment mesh Reconstruction from Single Images

Let’s also briefly mention another work that deals with a very similar problem: reconstructing the 3D mesh of clothes from a single image. The ReEF model (registering explicit to implicit), proposed in “Registering Explicit to Implicit: Towards High-Fidelity Garment mesh Reconstruction from Single Images” (OpenSynthetics) by Zhu et al., tries to extract high-quality meshes of clothing items from a single photograph.

The main idea is to use an explicitly given 3D mesh of an item of clothing and learn from the image a function of how to deform it to match the appearance on the image. This is achieved by segmenting the input into clothing items and their boundaries (in 3D!) and then fitting a standard (T-pose) 3D mesh to the results:

I will not go into details here, but the results are quite impressive. The resulting meshes can then be applied to other 3D meshes, re-deforming them to match:

That is, you can automatically fit a virtual character with your clothes! When this technology is further improved (I don’t think it’s quite there yet), this may be a very important piece of the puzzle for large-scale synthetic data generation. If we can get items of clothing in bulk from stock photos of clothed humans and eliminate the need to model these items by hand, it will significantly increase variation in synthetic data without adding much to the cost.

PhotoScene: Photorealistic Material and Lighting Transfer for Indoor Scenes

Our last 2D-to-3D paper today is “PhotoScene: Photorealistic Material and Lighting Transfer for Indoor Scenes” (OpenSynthetics) by UCSD and Adobe researchers Yeh et al. On one hand, it’s the natural culmination of our small-to-large progression: PhotoScene deals with photographs of whole interiors.

On the other hand, PhotoScene tackles a problem that can be crucial for synthetic data generation. Most works that deal with 2D-to-3D reconstruction (including the ones above) concentrate on trying to get the 3D model itself (in the form of a mesh, surface normals, or some other representation) exactly right. While this is very important, it’s only one piece of the puzzle. PhotoScene presents a solution for the next step: given a preexisting coarse 3D model of the interior (either automatically generated or produced by hand), it tries to capture the materials, textures, and scene illumination to match the photo.

An interesting part of the pipeline that we have never discussed before are material priors, that is, low-dimensional parametric representations of various materials that are resolution-independent and can produce textures of numerous variations of materials by varying their parameters.

The authors use MATch, a recently developed framework that defines 88 differentiable procedural graphs that can capture a wide range of materials with a lot of detail and variation, as textures in unlimited resolution ready for applying to 3D models, relighting, and other transformations:

Using this work as a basis, PhotoScene learns to align parts of the input photo with elements of the coarse 3D scene, extracts the most suitable procedural MATch graphs, and learns their parameters. As a result, it produces a scene with high-quality materials and refined lighting, ready to render in new views or with new lighting:

PhotoScene and, more generally speaking, material priors in the form of procedural graphs also represent a very important novelty for synthetic data generation. With models such as this one, we are able to obtain 3D representations of whole scenes (interiors in this case) that can then be used as the basis for synthetic data. It is not hard to throw together a coarse 3D model for a home interior: naturally, there already exist a lot of 3D models for furniture, decor, covers, and other home items, so it’s more a matter of choosing the most suitable ones. The hard part of 3D modeling here is to get the materials right and set up the lighting in a realistic way—exactly what PhotoScene can help with. Moreover, instead of rasterized textures it produces material graphs that can be used in any resolution and under any magnification—another very important advantage that can let us improve the quality and variety of the synthetic output.

Modeling sRGB Camera Noise with Normalizing Flows

In conclusion, as usual, let’s go for something completely different. In “Modeling sRGB Camera Noise with Normalizing Flows” (OpenSynthetics), Samsung researchers Kousha et al. present a new model for modeling the noise introduced by real world cameras.

This is a very specialized problem, but a very important one for quite a few applications, not just image denoising. In particular, I personally have worked in the past on superresolution, the problem of increasing resolution and generally improving low-quality photographs. Somewhat surprisingly, noise modeling proved to be at the heart of superresolution: modern approaches such as KernelGAN or RealSR introduce a parametric noise model, either to learn it on the given photo and then upscale it while reducing this noise or to inject this noise for degradation during their single-image training. It turned out that the final result very much depends on how well the noise model was learned and how expressive it was.

For synthetic data, this means that for many problems, to have really useful synthetic data we would need to have a good noise model as well. If superresolution models were trained on clean synthetic images with noise introduced artificially with some kind of a simple model (probably Gaussian noise), they would simply learn this model and produce atrocious results on real life photos whose noise is generated very differently.

Technically, Kousha et al. use normalizing flows, a very interesting and increasingly important family of generative models that model the density p(x) as a composition of several relatively simple invertible transformations, producing a diagonal Jacobian that ensures either efficient density estimation (masked autoregressive flows, MAF) or efficient sampling (inverse autoregressive flows, IAF). It would take a separate post to fully introduce normalizing flows (I hope I’ll write it someday), so let’s just say that here Kousha et al. introduce a conditional linear flow that is conditioned on the camera and its gain setting and use it in a whole pipeline designed to model several different sources of noise present in real world cameras:

As a result, they learn a noise distribution that is much closer to the real noise distribution (in terms of KL divergence) than previous efforts. There are no beautiful pictures to show here, it’s all about small variations in noisy patches viewed under high magnification, but trust me, as I’ve tried to explain above, this line of research may prove to be very important for synthetic data generation.

Conclusion

Phew, that was a long one! Today we have discussed several papers from CVPR 2022 that can help with synthetic data generation; I have tried to be thorough but definitely do not claim this post to be an exhaustive list. We have run the gamut from low-level computer vision (camera noise) to reconstructing animatable models from video collections. Next time, we will survey what CVPR 2022 has brought to the table of domain adaptation, another hugely important topic in synthetic data. See you then!

Sergey Nikolenko
Head of AI, Synthesis AI

September 7, 2022