zsxkib / voxtral

Voxtral Mini (3B) + Small (24B)🎙️ Speech transcription and audio understanding in 8 languages🧠

  • Public
  • 10 runs
  • GitHub
  • Weights
  • Paper
  • License
Iterate in playground

Run zsxkib/voxtral with an API

Use one of our client libraries to get started quickly. Clicking on a library will take you to the Playground tab where you can tweak different inputs, see the results, and copy the corresponding code to use in your own project.

Input schema

The fields you can use to run this model with an API. If you don't give a value for a field its default value will be used.

Field Type Default value Description
audio
string
Audio file to process.
mode
string (enum)
transcription

Options:

transcription, understanding

Choose processing mode: 'transcription' converts speech to text, 'understanding' analyzes audio content using prompts.
prompt
string
What can you tell me about this audio?
Question or instruction for understanding mode (e.g., 'What is the speaker discussing?', 'Summarize this audio'). Ignored in transcription mode.
language
string (enum)
Auto-detect

Options:

Auto-detect, English, French, German, Spanish, Italian, Portuguese, Dutch, Russian, Chinese, Japanese, Arabic

Audio language. 'Auto-detect' works for most content, or choose a specific language for better accuracy.
model_size
string (enum)
mini

Options:

mini, small

Model selection: 'mini' (3B) is faster and uses less GPU memory, 'small' (24B) provides higher accuracy for complex audio.
max_tokens
integer
500

Min: 50

Max: 1000

Maximum response length. Higher values allow longer outputs but increase processing time.

Output schema

The shape of the response you’ll get when you run this model with an API.

Schema
{
  "type": "string",
  "title": "Output"
}