DALL-E 2

How it works and how it doesn't.

Posted by Alexander Meinke on September 06, 2022 · 25 mins read

Introduction

In case you haven't heard: we are living in the future. In the world of AI we can currently see this in the form of powerful generative models that can create almost any image from just a short text description of the contents. These generated images can range from photo-realistic to highly stylized and artistic. One of the most famous of these AI models is OpenAI's DALL-E 2 and tools like this are set to revolutionize the way we create content, from graphic design to stock photography. Also, they are just insanely fun to use. Let's dive in and see how they work, and what they can and cannot do.

Under the hood

The basic idea of what DALL-E 2 does is quite straightforward, but the technical details of how OpenAI got it to work so well involve combining several different ingredients so strap in, because this might get a little complicated. If you are just here for pretty pictures, it's also okay to simply skip this section.

CLIP

The task they set out to solve was to find a good mapping from text descriptions of images to the corresponding images (as previously with their DALL-E and GLIDE models). The first step is to get training data consisting of pairs of images and descriptions - in fact, hundreds of millions such text-image pairs. Then, instead of attempting to directly map text inputs into images, DALL-E 2 makes use of a pre-trained text encoder from another of OpenAI's recent models: CLIP. A text encoder takes some text and maps it into a so-called latent space. A latent space (also called an embedding) is a somewhat technical concept but generally you can imagine that each input gets mapped to a point in some abstract high-dimensional space and we usually want semantically similar inputs to end up close to each other in latent space and dissimilar inputs to be far apart. The CLIP model put a really interesting spin on this idea by jointly training both a text encoder and an image encoder on text-image pairs. During training, we tell the two encoders that the text and image within each pair should map to nearby points in latent space. However, this way the models could simply map every single input to the same point, so we need to also tell the models that text and images that do not correspond to one another should map to different locations in latent space. This push-pull approach to training latent spaces is known as contrastive learning and has been very popular in recent years.

how CLIP training works

How CLIP is trained. Text and image embeddings from the same pair attract, others repel. (Image taken from OpenAI's blog post).

unCLIP

Next we could of course try to generate images directly from the CLIP text embeddings. However, we have to keep in mind that despite CLIP's training objective, the text encoder might not map exactly to the same point as the image encoder. More importantly, many images could correspond to the same text description so we should actually have a model that generates possible image embeddings from a given text embedding. The team at OpenAI decided to do exactly that. For slightly mathy reasons, this model is called the "prior". This prior is trained to map CLIP text embeddings into CLIP image embeddings.

Now with the image embeddings in hand, we need to train one more model that takes the embedding and maps it into an image. Since this model does exactly the opposite of the CLIP image encoder, it is called the decoder.

Finally, we can combine all these pieces into the final architecture that OpenAI calls "unCLIP", as in the figure below. The text encoder deterministically generates a text embedding, the prior randomly produces one of the many possible image embeddings that could correspond to this text embedding and then the decoder generates a random image that could correspond to this image embedding. (Of course there are a many more details, for example that the decoder actually generates a low-resolution image that then gets upsampled by yet more AI models, but that's the basic idea).

how CLIP training works

How unCLIP generates images. The prior maps text embeddings into image embeddings, the decoder generates the image. (Image adapted from OpenAI's paper).

But of course, this doesn't answer the question of how exactly the embedding can magically turn into an image. The secret sauce here lies in so-called diffusion models.

Diffusion

After Generative Adversarial Networks (GANs) had been in the spot light several years, diffusion models have recently so drastically improved hat they have now taken the highly contended throne of image generative models. The way these diffusion models work is basically by starting from a noise image and then gradually removing noise step-by-step until an image comes out. But how does it know how to do that? Well, during training we basically just take our training images and add different amounts of noise to it. Then we train the model to always make it look less noisy and thus closer to the original image.

this good boy gets generated from noise

During training of a diffusion model we add noise to this good boy and ask the model to remove it step-by-step.

Then, to actually generate an image from scratch, we start from a random noise point and let the model iteratively remove noise until the final image appears. During this process, the decoder has access to the image embeddings and even the original caption, so that the final image really looks like the desired prompt.

There actually is one more important detail that helps makes diffusion models as good as they are - a step that is known as "classifier-free guidance". Unfortunately, it is also quite hard to explain without getting into the math. However, the idea is very roughly that many generative models have to somewhat accept a trade-off between sample diversity and sample quality. Diffusion models are no different and crafty AI researchers found a way of directly controlling this trade-off in these models. If you want to learn more about how this is done, check out this excellent article.

how DALL-2 thinks it thinks
It is fun to ask DALL-E 2 how it thinks it thinks. I queried the text in the caption of the original paper's figure. ("high-level overview of unCLIP. Above the dotted line, we depict the CLIP training process, through which we learn a joint representation space").

The Good, the Bad and the Ugly

With the technical details out of the way, it's time to look at some pictures. DALL-E 2 is great at a lot of things. For example, landscapes, animals and paintings. It manages to combine concepts in novel ways when prompted to do so, so it's easy to get lost in the fun of generating absolutely absurd images in seconds.

Much more interesting is the question of what DALL-E 2 cannot yet do. A very obvious thing is that DALL-E is basically dyslexic. It can generate very simple words like "Stop" or "Open" that appear often enough in the dataset that they can effectively be viewed as images rather than text. However, with more complex text, one gets the impression that DALL-E "tries" to produce strings of letters that resemble a desired word but doesn't quite make it.

Similarly, the model cannot accurately count. While it can still do "3 apples" or "4 apples" correctly most of the time, it quickly starts to be off in every single case once we ask for higher numbers. Interestingly, I found that it also could not figure out that 2+2 apples should be 4 apples, despite being able to reliably draw 4 apples.

Additionally, DALL-E 2 is not all that great at correctly attributing features to each object. For example, as noted in the original paper, "a red cube on top of a blue cube" rarely puts the objects in the correct geometric relationship to one another.

Where the results occasionally go from bad to outright nightmarish is when generating faces in a scene. While DALL-E 2 is able to generate great faces without a problem if they take up a significant portion of the image, they often stay on the left of the creepy valley when they are not the central element. I suspect that the faces are not actually any worse than any other details within a complex scene, but since humans are so highly attuned to seeing faces, any small discrepancy seems highly significant to us.

Another thing that greatly limits what DALL-E 2 can do is actually not a technical limitation at all, but rather the strict content filters that OpenAI has placed in order to prevent their model from causing harm. There are filters on both the query as well as the final image that ensure that no images come out that can be seen as pornographic, hateful, violent, offensive or otherwise harmful. Additionally, the model does not generate images with faces of real people in it and when using the inpainting or outpainting features, no realistic faces can be present in the image.

Prompt engineering

Of course, as is the case with large language models, people have quickly figured out that one can greatly improve results by knowing how to formulate a prompt. The new arcane art of effectively communicating with AI models is called prompt engineering and in the short few months since DALL-E 2's release, someone has even collected these tricks into a handy reference book.

One of the most popular techniques is to simply add "award-winning" or "8k" to the prompt.

Amazingly, DALL-E 2 can also be tasked with capturing photographs using very specific (though ficticious) settings - like specifying the focal length of the hypothetical camera or even the shutter speed. In the image below, notice how changing the shutter speed from 1/10 second to 1/1000 second removes all motion blur.

And of course, one can simply ask for something to be in a specific style, by a specific artist or even from a movie or TV series.

DALL-E's secret vocabulary

One of the most interesting phenomena that have been discovered since the release of DALL-E 2 is that it has seemingly developed its own hidden vocabulary. Some very striking examples of this were given by researchers at UT Austin. Specifically, they tried the following: they queried a model for generating a scene with subtitles and unsurprisingly, these subtitles were non-sense gibberish. However, when they then used the generated text as a prompt, they would occasionally get images of things the characters in the scene might really talk about. The examples they gave are given in the figure below.

illustration of hidden vocabulary in DALL-E 2

It almost seems as if the farmers are talking about birds interfering with their harvest. (Figure adapted from here).

Notably, "Apoploe vesrreaitais" is not a name of a real bird but is in fact completely made-up. However, both "Apodidae" and "Ploceidae" are the Latin names of bird families, so a researcher at Columbia University hypothesized that these made-up words are simply the combination of different words with similar meaning. He showed that one could use this fact to reliably produce specific visual concepts from non-sensical words by combining different words with similar meaning. This even works when we combine parts of words that come from different languages.

a creepoky person

I agree that this is what a "creepoky person" should look like. (Figure from here).

illustration of hidden vocabulary in DALL-E 2

"avflugzereo" = "avion" (French) + "Flugzeug" (German) + "aereo" (Italian). (Figure from here).

Conclusion

In the past few months many, many competitors to OpenAI's DALL-E 2 have popped up, for example Midjourney and Google's Imagen. Midjourney has already made headlines by generating art that took the first prize at an art contest. To me the most exciting competitor is Stability AI's Stable Diffusion because it is completely open-source. This empowers developers everywhere to create amazing applications that leverage the power of these powerful AI models. Already there is a private beta for a Photoshop plugin that opens up unprecedented ways for human-AI interaction. The world of realistic generative models is here to stay and it will likely be transformative across several industries.

I will leave you with my favorite DALL-E 2 prompt: "dragon and car". As you can see, the dragon always has the same geometric relationship with the car, indicating that the model learned this relationship from the training data. If you don't know why this is funny, I suggest you google for "dragon and car". Maybe not at work though...