In computer vision, image synthesis is one of the most spectacular recent developments, but also among those with the greatest computational demands. Scaling up likelihood-based models is now dominating the high-resolution synthesis of complicated natural settings. The encouraging results of GANs, on the other hand, have been proven to be primarily confined to data with relatively limited variability, as their adversarial learning technique does not easily scale to modelling complicated, multi-modal distributions. Recently, it has been demonstrated that diffusion models, which are created from a hierarchy of denoising autoencoders, generate outstanding results in picture synthesis. This article will help one to understand a diffusion model used for image synthesis. The following topics are to be covered.
Table of contents
- About diffusion models
- The Latent Diffusion
- Generating images from text using Stable Diffusion
About diffusion models
Diffusion Models are generative models, meaning they create data comparable to the data on which they were trained. Diffusion Models function fundamentally by corrupting training data by successively adding Gaussian noise and then learning to retrieve the data by reversing this noising process.
Diffusion Models are probabilistic models that are used to learn a data distribution by gradually denoising a normally distributed variable, which is equivalent to learning the opposite process of a fixed-length Markov Chain. The most effective picture synthesis models rely on a reweighted form of the variational lower limit on the distribution, which is similar to denoising score matching. These models may be thought of as a series of denoising autoencoders that have been trained to predict a denoised variation of their input.
The Latent Diffusion
The latent diffusion models are diffusion models which are trained in a latent space. A latent space is a multidimensional abstract space that stores a meaningful internal representation of externally witnessed events. In the latent space, samples that are comparable in the external world are positioned near each other. Its primary objective is to convert raw data, such as picture pixel values, into an appropriate internal representation or feature vector from which the learning subsystem, often a classifier, may recognise or categorise patterns in the input.
The variational autoencoder in latent diffusion models maximises the ELBO (Evidence Lower Bound). Directly calculating and maximising the likelihood of the latent variable is challenging since it requires either integrating out all latent variables, which is intractable for big models, or access to a ground truth latent encoder. The ELBO, on the other hand, is a lower bound of the evidence. In this situation, the evidence is expressed as the log probability of the observed data. Then, maximising the ELBO becomes a proxy goal for optimising a latent variable model; in the best-case scenario, when the ELBO is powerfully parameterized and properly optimised, it becomes absolutely identical to the evidence.
Because the encoder optimises for the best among a set of probable posterior distributions specified by the parameters, this technique is variational. It’s named an autoencoder because it’s similar to a standard autoencoder model, in which input data is taught to predict itself after going through an intermediary bottlenecking representation phase.
The autoencoder’s mathematical equation computes two values: the first term in the equation estimates the reconstruction likelihood of the decoder from the variational distribution. This guarantees that the learnt distribution models effective latents from which the original data may be recreated. The second part of the equation compares the learnt variational distribution to a piece of prior knowledge about latent variables. By reducing this component, the encoder is encouraged to learn a distribution rather than collapse into a Dirac delta function. Increasing the ELBO is thus similar to increasing its first term while decreasing its second term.
Generating images from text using Stable Diffusion
In this article, we will use the Stable diffusion V1 pertained model to generate some images from the text description of the image. Stable Diffusion is a text-to-image latent diffusion model developed by CompVis, Stability AI, and LAION researchers and engineers. It was trained using 512×512 pictures from the LAION-5B database. To condition, the model on text prompts, this model employs a frozen CLIP ViT-L/14 text encoder. The model is rather lightweight, with an 860M UNet and 123M text encoder, and it works on a GPU with at least 10GB VRAM.
While image-generating models have great capabilities, they can also reinforce or aggravate societal prejudices. Stable Diffusion v1 was trained on subsets of LAION-2B(en), which contains large pictures with English descriptions. Texts and photos from groups and cultures that speak different languages are likely to be underrepresented. This has an impact on the model’s overall output because white and western cultures are frequently chosen as the default. Furthermore, the model’s capacity to produce material with non-English prompts is noticeably lower than with English-language prompts.
Installing the dependencies
!pip install diffusers==0.3.0 !pip install transformers scipy ftfy !pip install ipywidgets==7.7.2
The user also needs to accept the model license before downloading or using the weights. By visiting the model card and reading the license, and ticking the checkbox if one agrees, the user could access the pre-trained model. The user has to be a registered user in Hugging Face Hub and also needs to use an access token for the code to work.
from huggingface_hub import notebook_login notebook_login()
Importing the dependencies
import torch from torch import autocast from diffusers import StableDiffusionPipeline from PIL import Image
Creating the pipeline
Before creating the pipeline, ensure that the GPU is connected to the notebook if using the Colab notebook, use these lines of code.
StableDiffusionPipeline is a complete inference pipeline that can be used to produce pictures from text using only a few lines of code.
experimental_pipe = StableDiffusionPipeline.from_pretrained(“CompVis/stable-diffusion-v1-4″, revision=”fp16”, torch_dtype=torch.float16, use_auth_token=True)
Initially, we load the model’s pre-trained weights for all components. We’re supplying a particular revision, torch_dtype, and use auth token to the from pretrained function in addition to the model id “CompVis/stable-diffusion-v1-4”, “Use_auth_token” is required to confirm that you have accepted the model’s licence. To guarantee that any free Google Colab can execute Stable Diffusion, we load the weights from the half-precision branch “fp16” and inform diffusers to expect weights in float16 precision by passing torch datatype as the float. For faster inferences, move the pipeline to the GPU accelerator.
experimental_pipe = experimental_pipe.to(“cuda”)
Generating an image
description_1 = “a photograph of an horse on moon” with autocast(“cuda”): image_1 = experimental_pipe(description_1).images image_1
As we could observe the model did a pretty good job in generating the image. We have the horse which is on the moon and we could also see the earth from the moon and the details like highlights, blacks, exposure and others are also fine.
Let’s try a complex description with more details about the image.
description_2 = “dog sitting in a field of autumn leaves” with autocast(“cuda”): image_2 = experimental_pipe(description_2).images
Let’s form a grid of 3 columns and 1 row to display more images.
num_images = 3 description = [description_2]*num_images with autocast(“cuda”): experiment_image = experimental_pipe(description).images grid = grids(experiment_image, rows=1, cols=3) grid
Image synthesis is a spectacular part of the computer vision field of artificial intelligence, and with the growth of autoencoders and probabilistic methods, these syntheses are showing amazing results. Stable diffusion is one of the amazing diffusion models which can generate a well-defined image out of a text description. With this article, we have understood the latent diffusion model and its implementations.
- Link to the above code
- Description of the model