stability-ai
/
sdxl
A text-to-image generative AI model that creates beautiful images
Train stability-ai/sdxl
You can train SDXL on a particular object or style, and create a new model that generates images of those objects or styles. Training only requires a few images, and takes about 10-15 minutes. You can also download your fine-tuned LoRA weights to use elsewhere.
Trainings for this model run on Nvidia L40S GPU hardware, which costs $0.000975 per second.
Before fine-tuning starts, the input images are preprocessed using SwinIR for upscaling, BLIP for captioning, and CLIPSeg for removing regions of the images that are not interesting or helpful for training.
Below is a list of all fine-tuning parameters.
Training inputs
input_images
(required): A .zip or .tar file containing the image files that will be used for fine-tuning.seed
: Random seed integer for reproducible training. Leave empty to use a random seed.resolution
: Square pixel resolution which your images will be resized to for training. Defaults to512
.train_batch_size
: Batch size (per device) for training. Defaults to4
.num_train_epochs
: Number of epochs to loop through your training dataset. Defaults to4000
.max_train_steps
: Number of individual training steps. Takes precedence over num_train_epochs. Defaults to1000
.is_lora
: Boolean indicating whether to use LoRA training. If set to False, will use Full fine tuning. Defaults toTrue
.unet_learning_rate
: Learning rate for the U-Net as a float. We recommend this value to be somewhere between1e-6
: to1e-5
. Defaults to1e-6
.ti_lr
: Scaling of learning rate for training textual inversion embeddings. Don’t alter unless you know what you’re doing. Defaults to3e-4
.lora_lr
: Scaling of learning rate for training LoRA embeddings. Don’t alter unless you know what you’re doing. Defaults to1e-4
.lr_scheduler
: Learning rate scheduler to use for training. Allowable values areconstant
orlinear
. Defaults toconstant
.lr_warmup_steps
: Number of warmup steps for lr schedulers with warmups. Defaults to100
.token_string
: A unique string that will be trained to refer to the concept in the input images. Can be anything, but TOK works well. Defaults toTOK
.caption_prefix
: Text which will be used as prefix during automatic captioning. Must contain thetoken_string
. For example, if caption text is ‘a photo of TOK’, automatic captioning will expand to ‘a photo of TOK under a bridge’, ‘a photo of TOK holding a cup’, etc.”, Defaults toa photo of TOK
.mask_target_prompts
: Prompt that describes part of the image that you will find important. For example, if you are fine-tuning your pet,photo of a dog
will be a good prompt. Prompt-based masking is used to focus the fine-tuning process on the important/salient parts of the image. Defaults to None.crop_based_on_salience
: If you want to crop the image totarget_size
: based on the important parts of the image, set this to True. If you want to crop the image based on face detection, set this to False. Defaults toTrue
.use_face_detection_instead
: If you want to use face detection instead of CLIPSeg for masking. For face applications, we recommend using this option. Defaults toFalse
.clipseg_temperature
: How blurry you want the CLIPSeg mask to be. We recommend this value be something between0.5
: to1.0
. If you want to have more sharp mask (but thus more errorful), you can decrease this value. Defaults to1.0
.verbose
: Verbose output. Defaults toTrue
.checkpointing_steps
: Number of steps between saving checkpoints. Set to very very high number to disable checkpointing, because you don’t need one. Defaults to200
.