You're looking at a specific version of this model. Jump to the model overview.

zsxkib /voxtral:f5d491cb

Input schema

The fields you can use to run this model with an API. If you don’t give a value for a field its default value will be used.

Field Type Default value Description
audio
string
Audio file to process.
mode
string (enum)
transcription

Options:

transcription, understanding

Choose processing mode: 'transcription' converts speech to text, 'understanding' analyzes audio content using prompts.
prompt
string
What can you tell me about this audio?
Question or instruction for understanding mode (e.g., 'What is the speaker discussing?', 'Summarize this audio'). Ignored in transcription mode.
language
string (enum)
Auto-detect

Options:

Auto-detect, English, French, German, Spanish, Italian, Portuguese, Dutch, Russian, Chinese, Japanese, Arabic

Audio language. 'Auto-detect' works for most content, or choose a specific language for better accuracy.
model_size
string (enum)
mini

Options:

mini, small

Model selection: 'mini' (3B) is faster and uses less GPU memory, 'small' (24B) provides higher accuracy for complex audio.
max_tokens
integer
500

Min: 50

Max: 1000

Maximum response length. Higher values allow longer outputs but increase processing time.

Output schema

The shape of the response you’ll get when you run this model with an API.

Schema
{'title': 'Output', 'type': 'string'}