Voxtral Mini 1.0 (3B) - 2507

Voxtral Mini is an enhancement of Ministral 3B, incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription, translation and audio understanding.

Learn more about Voxtral in our blog post here.

Key Features

Voxtral builds upon Ministral-3B with powerful audio understanding capabilities. - Dedicated transcription mode: Voxtral can operate in a pure speech transcription mode to maximize performance. By default, Voxtral automatically predicts the source audio language and transcribes the text accordingly - Long-form context: With a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding - Built-in Q&A and summarization: Supports asking questions directly through audio. Analyze audio and generate structured summaries without the need for separate ASR and language models - Natively multilingual: Automatic language detection and state-of-the-art performance in the world’s most widely used languages (English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian) - Function-calling straight from voice: Enables direct triggering of backend functions, workflows, or API calls based on spoken user intents - Highly capable at text: Retains the text understanding capabilities of its language model backbone, Ministral-3B

Benchmark Results

Audio

Average word error rate (WER) over the FLEURS, Mozilla Common Voice and Multilingual LibriSpeech benchmarks:

image/png

Text

image/png

Usage

The model can be used with the following frameworks; - vllm (recommended): See here - Transformers 🤗: See here

Transcription

Voxtral-Mini-3B-2507 has powerful transcription capabilities!

Make sure that your client has mistral-common with audio installed:

Transformers 🤗

Voxtral is supported in Transformers natively!