You're looking at a specific version of this model. Jump to the model overview.

zsxkib /dia:91a8c206

Input schema

The fields you can use to run this model with an API. If you don’t give a value for a field its default value will be used.

Field Type Default value Description
text
string
Input text for dialogue generation. Use [S1], [S2] to indicate different speakers and (description) in parentheses for non-verbal cues e.g., (laughs), (whispers).
audio_prompt
string
Optional audio file (.wav/.mp3/.flac) for voice cloning. The model will attempt to mimic this voice style.
max_new_tokens
integer
3072

Min: 500

Max: 4096

Controls the length of generated audio. Higher values create longer audio. (86 tokens ≈ 1 second of audio).
cfg_scale
number
3

Min: 1

Max: 5

Controls how closely the audio follows your text. Higher values (3-5) follow text more strictly; lower values may sound more natural but deviate more.
temperature
number
1.3

Min: 0.1

Max: 2

Controls randomness in generation. Higher values (1.3-2.0) increase variety; lower values (0.1-0.9) make output more consistent and predictable.
top_p
number
0.95

Min: 0.1

Max: 1

Controls diversity of word choice. Higher values include more unusual options. Most users shouldn't need to adjust this parameter.
cfg_filter_top_k
integer
35

Min: 10

Max: 100

Technical parameter for filtering audio generation tokens. Higher values allow more diverse sounds; lower values create more consistent audio.
speed_factor
number
0.94

Min: 0.5

Max: 1.5

Adjusts playback speed of the generated audio. Values below 1.0 slow down the audio; 1.0 is original speed.
seed
integer
Random seed for reproducible results. Use the same seed value to get the same output for identical inputs. Leave blank for random results each time.

Output schema

The shape of the response you’ll get when you run this model with an API.

Schema
{'format': 'uri', 'title': 'Output', 'type': 'string'}