High-Resolution Image Synthesis with Latent Diffusion Models
License
Insights
Stability-AI/stablediffusion
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
main
View all tags
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Cancel
Use Git or checkout with SVN using the web URL.
Work fast with our official CLI. Learn more .
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Failed to load latest commit information.
Type
View code
README.md
Stable Diffusion 2.0
This repository contains Stable Diffusion models trained from scratch and will be continuously updated with new checkpoints. The following list provides an overview of all currently available models. More coming soon.
News
November 2022
New stable diffusion model (Stable Diffusion 2.0-v) at 768x768 resolution. Same number of parameters in the U-Net as 1.5, but uses OpenCLIP-ViT/H as the text encoder and is trained from scratch. SD 2.0-v is a so-called v-prediction model.
The above model is finetuned from SD 2.0-base, which was trained as a standard noise-prediction model on 512x512 images and is also made available.
A text-guided inpainting model , finetuned from SD 2.0-base.
We follow the original repository and provide basic inference scripts to sample from the models.
The original Stable Diffusion model was created in a collaboration with CompVis and RunwayML and builds upon the work:
Stable Diffusion is a latent text-to-image diffusion model.
Requirements
You can update an existing latent diffusion environment by running
conda install pytorch==1.12.1 torchvision==0.13.1 -c pytorch pip install transformers==4.19.2 diffusers invisible-watermark pip install -e .
xformers efficient attention
For more efficiency and speed on GPUs, we highly recommended installing the xformers library.
Tested on A100 with CUDA 11.4. Installation needs a somewhat recent version of nvcc and gcc/g++, obtain those, e.g., via
export CUDA_HOME=/usr/local/cuda-11.4 conda install -c nvidia/label/cuda-11.4.0 cuda-nvcc conda install -c conda-forge gcc conda install -c conda-forge gxx_linux-64=9.5.0
Then, run the following (compiling takes up to 30 min).
cd .. git clone https://github.com/facebookresearch/xformers.git cd xformers git submodule update --init --recursive pip install -r requirements.txt pip install -e . cd ../stable-diffusion
Upon successful installation, the code will automatically default to memory efficient attention for the self- and cross-attention layers in the U-Net and autoencoder.
General Disclaimer
Stable Diffusion models are general text-to-image diffusion models and therefore mirror biases and (mis-)conceptions that are present in their training data. Although efforts were made to reduce the inclusion of explicit pornographic material, we do not recommend using the provided weights for services or products without additional safety mechanisms and considerations. The weights are research artifacts and should be treated as such. Details on the training procedure and data, as well as the intended use of the model can be found in the corresponding model card . The weights are available via the StabilityAI organization at Hugging Face under the CreativeML Open RAIL++-M License .
Stable Diffusion v2.0
Stable Diffusion v2.0 refers to a specific configuration of the model architecture that uses a downsampling-factor 8 autoencoder with an 865M UNet and OpenCLIP ViT-H/14 text encoder for the diffusion model. The SD 2.0-v model produces 768x768 px outputs.
Evaluations with different classifier-free guidance scales (1.5, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0) and 50 DDIM sampling steps show the relative improvements of the checkpoints:
Text-to-Image
Stable Diffusion 2.0 is a latent diffusion model conditioned on the penultimate text embeddings of a CLIP ViT-H/14 text encoder. We provide a reference script for sampling .
Reference Sampling Script
This script incorporates an invisible watermarking of the outputs, to help viewers identify the images as machine-generated . We provide the configs for the SD2.0-v (768px) and SD2.0-base (512px) model.
First, download the weights for SD2.0-v and SD2.0-base .
To sample from the SD2.0-v model, run the following:
--config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768 ">
python scripts/txt2img.py --prompt "a professional photograph of an astronaut riding a horse" --ckpt
--config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768
or try out the Web Demo:
.
To sample from the base model, use
--config
python scripts/txt2img.py --prompt "a professional photograph of an astronaut riding a horse" --ckpt
--config
By default, this uses the DDIM sampler , and renders images of size 768x768 (which it was trained on) in 50 steps. Empirically, the v-models can be sampled with higher guidance scales.
Note: The inference config for all model versions is designed to be used with EMA-only checkpoints. For this reason use_ema=False is set in the configuration, otherwise the code will try to switch from non-EMA to EMA weights.
Image Modification with Stable Diffusion
Depth-Conditional Stable Diffusion
To augment the well-established img2img functionality of Stable Diffusion, we provide a shape-preserving stable diffusion model.
Note that the original method for image modification introduces significant semantic changes w.r.t. the initial image. If that is not desired, download our depth-conditional stable diffusion model and the dpt_hybrid MiDaS model weights , place the latter in a folder midas_models and sample via
python scripts/streamlit/depth2img.py streamlit run scripts/demo/depth2img.py configs/stable-diffusion/v2-midas-inference.yaml
This method can be used on the samples of the base model itself. For example, take this sample generated by an anonymous discord user. Using the streamlit script depth2img.py, the MiDaS model first infers a monocular depth estimate given this input, and the diffusion model is then conditioned on the (relative) depth output.
depth2image
This model is particularly useful for a photorealistic style; see the examples . For a maximum strength of 1.0, the model removes all pixel-based information and only relies on the text prompt and the inferred monocular depth estimate.
Classic Img2Img
For running the "classic" img2img, use
--strength 0.8 --ckpt