SDXL Prompt-to-Prompt

An implementation of Prompt-to-Prompt for the SDXL architecture, adapted from the original repo.

Prompt-to-Prompt is an image editing framework that leverages the self-attention and cross-attention mechanisms of the diffusion process without requiring external tools for edits. It concurrently generates an original image and a modified version based on prompt changes, such as transitioning from “a pink bear” to “a pink dragon”. During diffusion, the technique blends the attentions from “bear” to “dragon”, maintaining the original image’s style while substituting “bear” with “dragon”.

There are 3 types of editing:

Replacement: In this case, the user swaps tokens of the original prompt with others, e.g., the editing the prompt “A painting of a squirrel eating a burger” to “A painting of a squirrel eating a lasagna” or “A painting of a lion eating a burger”.
Refinement: In this case, the user adds new tokens to the prompt, e.g., editing the prompt “A painting of a squirrel eating a burger” to “A watercolor painting of a squirrel eating a burger”.
Re-weight: In this case, the user changes the weight of certain tokens in the prompt, e.g., for the prompt “A photo of a poppy field at night”, strengthen or weaken the extent to which the word “night” affects the resulting image.

See the original paper, project page and repository for more details.

How to use the API

To edit images with SDXL Prompt-to-Prompt, it is required to provide several input parameters that define the editing instructions. Parameters, “original_prompt” and “prompt_edit_type” are required. Unless the “prompt_edit_type” is “Re-weight”, “edited_prompt” parameter is required as well. The API input arguments are as follows: - image: Optional input image. If provided, DDIM inversion is performed to retrieve initial latents for image generation.
- original_prompt: The prompt used to generate an image with SDXL. This is the starting point for any image editing operation.
- prompt_edit_type: Specifies the type of prompt editing to be applied. Options include Replacement, Refinement, or Re-weight. This choice determines how the edited prompt influences the original SDXL output.
- edited_prompt: The prompt used for editing the original SDXL output image. This parameter is relevant for Replacement and Refinement edit types. For Re-weight, this can be left empty.
- guidance_scale: Text guidance scale, use higher values for better alignment with input prompt.
- local_edit: Indicates specific areas to be edited, represented by comma-separated words. If left as None, the entire image is subject to change.
- cross_replace_steps: The number of diffusion steps during which cross attention should be replaced. This is a fractional value between 0 and 1.0.
- self_replace_steps: The number of diffusion steps during which self attention should be replaced. Like cross_replace_steps, this is a fractional value between 0 and 1.0.
- equalizer_words: Words to be re-weighted (enhanced or diminished) during the editing process. Provide these words in a comma-separated list. If using re-weight, it is required. If not using reweight, this should be left empty.
- equalizer_strengths: Strengths associated with the words to be re-weighted. These can be positive (for enhancement) or negative (for diminishment). Values should be provided in a comma-separated list corresponding to the equalizer_words. If using re-weight, it is required. If not using reweight, this should be left empty.
- num_inversion_steps: Number of diffusion denoising steps for inversion.
- num_inference_steps: Number of diffusion denoising steps for image generation. - seed: A random seed for generating the original output. Leaving this blank randomizes the seed.

Citation

@article{hertz2022prompt, title={Prompt-to-prompt image editing with cross attention control}, author={Hertz, Amir and Mokady, Ron and Tenenbaum, Jay and Aberman, Kfir and Pritch, Yael and Cohen-Or, Daniel}, booktitle={arXiv preprint arXiv:2208.01626}, year={2022} }