cjwbw/voicecraft:1d0faa22 – Run with an API on Replicate

You're looking at a specific version of this model. Jump to the model overview.

cjwbw /voicecraft:1d0faa22

Input schema

The fields you can use to run this model with an API. If you don’t give a value for a field its default value will be used.

Field	Type	Default value	Description
task	string (enum)	zero-shot text-to-speech Options: speech_editing-substitution, speech_editing-insertion, speech_editing-deletion, zero-shot text-to-speech	Choose a task
voicecraft_model	string (enum)	giga330M_TTSEnhanced.pth Options: giga830M.pth, giga330M.pth, giga330M_TTSEnhanced.pth	Choose a model
orig_audio	string		Original audio file
orig_transcript	string		Optionally provide the transcript of the input audio. Leave it blank to use the WhisperX model below to generate the transcript. Inaccurate transcription may lead to error TTS or speech editing
whisper_model	string (enum)	whisperx-base.en Options: whisperx-base.en, whisperx-small.en, whisperx-medium.en	If orig_transcript is not provided above, choose WhisperX model. Inaccurate transcription may lead to error TTS or speech editing. You can modify the generated transcript and provide it directly to
target_transcript	string		Transcript of the target audio file
cut_off_sec	number	3.01	Only used for for zero-shot text-to-speech task. The first seconds of the original audio that are used for zero-shot text-to-speech. 3 sec of reference is generally enough for high quality voice cloning, but longer is generally better, try e.g. 3~6 sec
kvcache	integer (enum)	1 Options: 0, 1	Set to 0 to use less VRAM, but with slower inference
left_margin	number	0.08	Margin to the left of the editing segment
right_margin	number	0.08	Margin to the right of the editing segment
temperature	number	1	Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic. Do not recommend to change
top_p	number	0.9	Default value for TTS is 0.9, and 0.8 for speech editing
stop_repetition	integer	3	Default value for TTS is 3, and -1 for speech editing. -1 means do not adjust prob of silence tokens. if there are long silence or unnaturally stretched words, increase sample_batch_size to 2, 3 or even 4
sample_batch_size	integer	4	Default value for TTS is 4, and 1 for speech editing. The higher the number, the faster the output will be. Under the hood, the model will generate this many samples and choose the shortest one
seed	integer		Random seed. Leave blank to randomize the seed

Output schema

The shape of the response you’ll get when you run this model with an API.

Schema

{'properties': {'generated_audio': {'format': 'uri',
                                    'title': 'Generated Audio',
                                    'type': 'string'},
                'whisper_transcript_orig_audio': {'title': 'Whisper Transcript '
                                                           'Orig Audio',
                                                  'type': 'string'}},
 'required': ['whisper_transcript_orig_audio', 'generated_audio'],
 'title': 'ModelOutput',
 'type': 'object'}