ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models

The base model of the demo is stabilityai/stable-diffusion-xl-base-1.0

Input: “A beautiful girl on a boat”; Resolution: 2048 x 1152.

Input: “Miniature house with plants in the potted area, hyper realism, dramatic ambient lighting, high detail”; Resolution: 4096 x 4096.

Arbitrary higher-resolution generation based on SD 2.1.

🤗 TL; DR

ScaleCrafter is capable of generating images with a resolution of 4096 x 4096 and videos with a resolution of 2048 x 1152 based on pre-trained diffusion models on a lower resolution. Notably, our approach needs no extra training/optimization.

🔆 Abstract

In this work, we investigate the capability of generating images from pre-trained diffusion models at much higher resolutions than the training image sizes. In addition, the generated images should have arbitrary image aspect ratios. When generating images directly at a higher resolution, 1024 x 1024, with the pre-trained Stable Diffusion using training images of resolution 512 x 512, we observe persistent problems of object repetition and unreasonable object structures. Existing works for higher-resolution generation, such as attention-based and joint-diffusion approaches, cannot well address these issues. As a new perspective, we examine the structural components of the U-Net in diffusion models and identify the crucial cause as the limited perception field of convolutional kernels. Based on this key observation, we propose a simple yet effective re-dilation that can dynamically adjust the convolutional perception field during inference. We further propose the dispersed convolution and noise-damped classifier-free guidance, which can enable ultra-high-resolution image generation (e.g., 4096 x 4096). Notably, our approach does not require any training or optimization. Extensive experiments demonstrate that our approach can address the repetition issue well and achieve state-of-the-art performance on higher-resolution image synthesis, especially in texture details. Our work also suggests that a pre-trained diffusion model trained on low-resolution images can be directly used for high-resolution visual generation without further tuning, which may provide insights for future research on ultra-high-resolution image and video synthesis.

😉 Citation

@article{he2023scalecrafter,
      title={ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models}, 
      author={Yingqing He and Shaoshu Yang and Haoxin Chen and Xiaodong Cun and Menghan Xia and Yong Zhang and Xintao Wang and Ran He and Qifeng Chen and Ying Shan},
      year={2023},
      eprint={2310.07702},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

📭 Contact

If your have any comments or questions, feel free to contact Yingqing He or Shaoshu Yang.

Model created over 1 year ago