adirik
/
styletts2
Generates speech from text
If you haven’t yet trained a model on Replicate, we recommend you read one of the following guides.
Pricing
Trainings for this model run on 8x Nvidia A40 (Large) GPU hardware, which costs $0.0058 per second.
Create a training
Note that before you can create a training, you’ll need to create a model and use its name as the value for the destination field.
Fine Tuning with Your Own Data
You can use the train endpoint to fine tune the model on new speakers and perform inference with the fine tuned model by providing the url to the weights.
Input parameters are as follows:
- dataset: Url to .zip file containing the dataset. It must contain a wavs
folder containing wav files with 24kHz sample rate, a train_data.txt
file containing training data and a validation_data.txt
file containing validation data. If SLM adversarial training is desired, it must also contain a OOD_data.txt
file containing out-of-distribution texts for SLM adversarial training.
The dataset must be a zip file whose structure is as follows:
├── wavs
│ ├── 1.wav
│ ├── 2.wav
│ ├── 3.wav
├── train_data.txt
├── validation_data.txt
├── OOD_data.txt
train_data.txt and “validation_data.txt” should have wav file name|transcription|speaker id
in each line. A sample train_data.txt file would be as follows:
1.wav|ðɪs ɪz ðə fɜːst ˈsɑːmpᵊl.|0
2.wav|ðɪs ɪz ðə ˈsɛkənd ˈsɑːmpᵊl.|0
3.wav|ðɪs ɪz ðə θɜːd ˈsɑːmpᵊl.|1
OOD_data.txt should have transcription|speaker id
or wav file name|transcription|speaker id
in each line. A sample OOD_data.txt file would be as follows:
fɜːst ˈsɑːmpᵊl.|0
ˈsɛkənd ˈsɑːmpᵊl.|0
θɜːd ˈsɑːmpᵊl.|1
- num_train_epochs: Number of epochs to train.
- style_diff_starting_epoch: Epoch to start style diffusion.
- joint_training_starting_epoch: Epoch to start SLM advesariral training. If set to a value larger than num_train_epochs, SLM adversarial training will not be performed.
- batch_size: Batch size.
- min_length_ood: Minimum length of OOD texts for training. This is used to facilitate that the synthesized speech has a minimum length.
- max_len_audio: Maximum audio length during training (in frames). Given that 300 is the default hop size, one frame is roughly 300 / 24000 (0.0125) second. If an out-of-memory error happens, try with lower value.