It is okay to assume everybody has heard about the Stable Diffusion or DALL-E at this point. The huge craze about text-to-image models has taken over the entire AI domain in the last couple of months, and we have seen really cool executions.
Large-scale language-image (LLI) models have shown extremely pleasing performance in image generation and semantic understanding. They are trained on extremely large datasets (that is where the Large-scale comes from, not the model size) and use advanced image generation methods like auto-encoders or diffusion models.
These models can generate impressive-looking images or even videos. All you need to do is to pass the prompt,let’s say, “a squirrel having a coffee with Pikachu”, you want to see to the model and wait for the results. You will get a beautiful image to enjoy.
But let’s say you liked the squirrel and Pikachu in the image but were not happy with the coffee part. You want to change it to, let’s say, a cup of tea. Can LLI models do that for you? Well, yes and no. You can change your prompt and replace the coffee with a cup of tea, which will also change the entire image. So, you cannot actually use the model for editing a part of the image, unfortunately.
There have been some attempts to use these models for image editing before. Some methods require the user to intentionally mask a portion of the picture to be inpainted and then force the modified image to change just in the masked region. This works fine, but the manual masking operation is both cumbersome and time-consuming. Also, masking the picture can remove critical structural information that is overlooked throughout the inpainting process. As a result, some capabilities, such as altering the texture of a given item, are beyond the reach of inpainting.
Well, since we work with text-to-image models, can we utilize it and have a better and easier editing method? This was the question the authors of this paper asked, and they have a nice answer to that.
An intuitive and effective textual editing approach for semantically modifying pictures in pre-trained text-conditioned diffusion models using Prompt-to-Prompt manipulations is proposed in this study. That was the fancy naming.
But how does it work? How can you force a text-to-image model to edit an image by altering with the prompt?
The key to this problem is hidden in the cross-attention layers. They have a hidden gem that can help us solve this editing problem. The internal cross-attention maps, the high-dimensional tensors that bind the tokens extracted from the prompt with the pixels of the output image, are the gems we are looking for. These maps contain rich semantic relations that affect the generated image. Therefore, accessing and altering them is the way to go for image editing.
The essential idea is that the output images can be altered by injecting cross-attention maps throughout the diffusion process, controlling which pixels attend to which text tokens during diffusion. The authors have shown several methods to control cross-attention maps to demonstrate this idea.
First, the cross-attention maps are fixed, and only a single token is changed in the prompt. This is done to preserve the scene composition in the output image. The second method was adding new words to the text prompt while freezing the attention on previous tokens. Doing so enables new attention to flow to the new tokens, enabling global editing or modifying a specific object. Finally, they have modified the weight of a certain word in the generated image. This is used to amplify certain features of the generated image, such as making a teddy bear more fluffy.
The proposed Prompt-to-Prompt method enables intuitive image editing by modifying only the textual prompt. It does not require fine-tuning or optimization, it directly works on an existing model.
This was a brief summary of the Prompt-to-Prompt method. You can find more information at the links below if you are interested in learning more.