Step1X-Edit: Advanced Image Editing ✨ (Cog Implementation)

This Replicate model runs Step1X-Edit, an advanced image editing model developed by StepFun AI. It lets you edit images based on a reference image and a text instruction.

Original Project Page: step1x-edit.github.io Technical Report (Arxiv): arxiv.org/abs/2504.17761 Original Model (Hugging Face): stepfun-ai/Step1X-Edit Online Demo: Step1X-Edit Space

About the Step1X-Edit Model

Step1X-Edit is designed for general-purpose image editing based on user instructions. It takes a reference image and a text prompt describing the desired change (e.g., “remove the person”, “change the background to a beach”, “make it look like a watercolor painting”). The model uses a multimodal language model (Qwen-VL) to understand the image and the instruction, and then guides a diffusion process to generate the edited image while preserving relevant parts of the original.

Key Features & Capabilities ✨

Instruction-Based Editing 📝: Modifies images according to natural language prompts.
Reference Image Guided 🖼️: Uses the input image as a base, aiming to maintain consistency where edits aren’t requested.
Versatile Edits: Capable of various edits like object removal/addition, background replacement, style changes, and more.
Variable Resolution 📐: Supports different internal processing resolutions (size_level) to balance speed and detail capture.
Reproducibility 🌱: Allows setting a random seed for consistent results with the same inputs.
Format Control 💾: Lets you choose the output format (webp, jpg, png) and quality for lossy formats.

Replicate Implementation Details ⚙️

This Cog container packages the Step1X-Edit model and its dependencies for use on Replicate.

Core Models: Uses the Step1X-Edit diffusion model (step1x-edit-i1258.safetensors), a VAE (vae.safetensors), and the Qwen 2.5 VL 7B Instruct model (Qwen/Qwen2.5-VL-7B-Instruct) for multimodal conditioning.
Dependencies: Runs on PyTorch (torch, torchvision) and requires libraries like einops, numpy, Pillow, safetensors, tqdm, and Hugging Face transformers. Relies on custom modules (modules/autoencoder.py, modules/conditioner.py, modules/model_edit.py, sampling.py) from the original implementation. Assumes a CUDA-enabled environment.
Weight Handling: The main Step1X-Edit weights (DiT and VAE, packaged in Step1X-Edit.tar) are downloaded efficiently using pget during container setup from a Replicate cache (https://weights.replicate.delivery/default/step1x-edit/model_cache/). The Qwen model weights are downloaded via the Hugging Face Hub library. All weights are stored in the model_cache directory, and relevant environment variables (HF_HOME, TORCH_HOME, etc.) are set to use this cache.
Workflow (predict.py):
1. Setup: Downloads and extracts the main Step1X-Edit weights using pget if not cached. Instantiates the VAE, DiT (Step1X-Edit), and Qwen encoder models, loading weights and moving them to the GPU.
2. Input: Receives the input image (Path), prompt (string), and parameters like size_level, seed, output_format, output_quality.
3. Preprocessing: Loads the input image with Pillow, processes it to the target size_level while maintaining aspect ratio (input_process_image). Encodes the reference image using the VAE (ae.encode).
4. Conditioning: Uses the Qwen encoder (llm_encoder) to process the prompt and reference image, generating text embeddings and masks. Prepares input dictionary for the diffusion model (prepare), combining image latents, reference latents, and conditioning.
5. Denoising: Sets the random seed. Generates initial noise (torch.randn). Runs the denoising loop (denoise) for a fixed 28 steps using a predefined schedule (sampling.get_schedule), applying classifier-free guidance (CFG fixed at 6.0).
6. Decoding: Decodes the final latent representation back into pixel space using the VAE (ae.decode).
7. Postprocessing: Converts the output tensor to a PIL image, resizes it back to the original input dimensions (output_process_image), and saves it to a temporary file in the specified output_format with the chosen output_quality.
8. Output: Returns the Path to the saved edited image file.

Underlying Technologies & Concepts 🔬

Latent Diffusion Models: Operates in the latent space compressed by a VAE, enabling more efficient computation compared to pixel-space diffusion.
Diffusion Transformers (DiT): The core Step1X-Edit model likely uses a transformer architecture within the diffusion process, similar to recent high-performance diffusion models.
Multimodal Large Language Models (MLLMs): Leverages Qwen-VL to understand the relationship between the input image and the text instruction, providing crucial conditioning for the edit.
Classifier-Free Guidance (CFG): Uses CFG during denoising to improve adherence to the prompt (though the negative prompt is fixed as empty and guidance scale is fixed at 6.0 in this implementation).

Use Cases 💡

Removing unwanted objects or people from photos.
Adding new elements into an image based on text descriptions.
Changing the background of a picture.
Altering the style or mood of an image (e.g., “make it look like a sketch”, “make it night time”).
Modifying attributes of objects (e.g., “change the color of the car to red”).
Creative content generation and photorealistic manipulation.

Limitations ⚠️

GPU Requirements: Needs a powerful NVIDIA GPU with significant VRAM (potentially >40GB recommended by original authors for full precision).
Fixed Parameters: This specific implementation uses fixed values for the number of inference steps (28) and CFG scale (6.0), which might not be optimal for all possible edits. The negative prompt is also implicitly empty.
Instruction Following: Complex or ambiguous instructions might not always be interpreted as intended. The quality of the edit depends heavily on the model’s understanding of the prompt relative to the image.
Artifacts: Like many generative models, results might occasionally contain visual artifacts or imperfections.
Resolution vs. Detail: Lower size_level values run faster but might miss fine details or struggle with complex edits compared to higher resolutions.

License & Disclaimer 📜

The original Step1X-Edit model and code are licensed under the Apache License 2.0. See the Hugging Face repository for details. The Cog packaging code in this repository is MIT licensed.

Disclaimer: This model generates images based on user inputs. Users are responsible for the content they generate and must adhere to ethical guidelines and the original model’s license terms. Do not use this model for creating harmful, misleading, or infringing content. The maintainer of this Replicate packaging is not responsible for user-generated outputs.

Citation 📚

If you use Step1X-Edit in your work, please cite their technical report:

@article{liu2025step1x-edit,
      title={Step1X-Edit: A Practical Framework for General Image Editing},
      author={Shiyu Liu and Yucheng Han and Peng Xing and Fukun Yin and Rui Wang and Wei Cheng and Jiaqi Liao and Yingming Wang and Honghao Fu and Chunrui Han and Guopeng Li and Yuang Peng and Quan Sun and Jingwei Wu and Yan Cai and Zheng Ge and Ranchen Ming and Lei Xia and Xianfang Zeng and Yibo Zhu and Binxing Jiao and Xiangyu Zhang and Gang Yu and Daxin Jiang},
      journal={arXiv preprint arXiv:2504.17761},
      year={2025}
}

Cog implementation managed by zsxkib.

Star the Cog repo on GitHub! ⭐

Follow me on Twitter/X