You're looking at a specific version of this model. Jump to the model overview.

cjwbw /voicecraft:1d0faa22

Input schema

The fields you can use to run this model with an API. If you don’t give a value for a field its default value will be used.

Field Type Default value Description
task
string (enum)
zero-shot text-to-speech

Options:

speech_editing-substitution, speech_editing-insertion, speech_editing-deletion, zero-shot text-to-speech

Choose a task
voicecraft_model
string (enum)
giga330M_TTSEnhanced.pth

Options:

giga830M.pth, giga330M.pth, giga330M_TTSEnhanced.pth

Choose a model
orig_audio
string
Original audio file
orig_transcript
string
Optionally provide the transcript of the input audio. Leave it blank to use the WhisperX model below to generate the transcript. Inaccurate transcription may lead to error TTS or speech editing
whisper_model
string (enum)
whisperx-base.en

Options:

whisperx-base.en, whisperx-small.en, whisperx-medium.en

If orig_transcript is not provided above, choose WhisperX model. Inaccurate transcription may lead to error TTS or speech editing. You can modify the generated transcript and provide it directly to
target_transcript
string
Transcript of the target audio file
cut_off_sec
number
3.01
Only used for for zero-shot text-to-speech task. The first seconds of the original audio that are used for zero-shot text-to-speech. 3 sec of reference is generally enough for high quality voice cloning, but longer is generally better, try e.g. 3~6 sec
kvcache
integer (enum)
1

Options:

0, 1

Set to 0 to use less VRAM, but with slower inference
left_margin
number
0.08
Margin to the left of the editing segment
right_margin
number
0.08
Margin to the right of the editing segment
temperature
number
1
Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic. Do not recommend to change
top_p
number
0.9
Default value for TTS is 0.9, and 0.8 for speech editing
stop_repetition
integer
3
Default value for TTS is 3, and -1 for speech editing. -1 means do not adjust prob of silence tokens. if there are long silence or unnaturally stretched words, increase sample_batch_size to 2, 3 or even 4
sample_batch_size
integer
4
Default value for TTS is 4, and 1 for speech editing. The higher the number, the faster the output will be. Under the hood, the model will generate this many samples and choose the shortest one
seed
integer
Random seed. Leave blank to randomize the seed

Output schema

The shape of the response you’ll get when you run this model with an API.

Schema
{'properties': {'generated_audio': {'format': 'uri',
                                    'title': 'Generated Audio',
                                    'type': 'string'},
                'whisper_transcript_orig_audio': {'title': 'Whisper Transcript '
                                                           'Orig Audio',
                                                  'type': 'string'}},
 'required': ['whisper_transcript_orig_audio', 'generated_audio'],
 'title': 'ModelOutput',
 'type': 'object'}