Readme
Voxtral: Speech Understanding That Actually Gets It ๐๏ธ๐ง
Overview ๐
mistralai/Voxtral-Mini-3B-2507 + mistralai/Voxtral-Small-24B-2507
Voxtral is Mistral AI’s language model that learned how to hear. Unlike basic speech-to-text tools that just write down words, Voxtral actually understands what’s happening in audio. You can ask it questions, get summaries, or just get perfect transcriptions in 8 languages.
This tool is built upon the amazing work of Mistral AI and their Voxtral research. We’ve wrapped their model to work on Replicate, making world-class audio understanding accessible to everyone through a simple web interface!
Support Mistral AI and learn more about their work: - Voxtral Research Blog - Small Hugging Face Model - Mini Hugging Face Model
What Voxtral does โจ
Most speech models either transcribe OR understand. Voxtral does both really well.
Transcription mode gives you perfect speech-to-text that automatically figures out what language people are speaking. It works with English, Spanish, French, Portuguese, Hindi, German, Dutch, and Italian. You can throw up to 30 minutes of audio at it and it’ll write down what everyone said.
Understanding mode is where things get interesting. Ask it questions about a podcast and it can tell you what the host was discussing. Play it a meeting recording and it can summarize the key points or tell you who was speaking. This mode can handle up to 40 minutes of audio and understands context across the entire conversation.
You can also use it to trigger functions in your code just by talking to it. No need to build separate speech recognition systems.
There are two sizes: the mini version (3 billion parameters) runs faster and uses less memory, while the small version (24 billion parameters) gives you better accuracy for challenging audio with background noise or multiple speakers.
How to get the best results ๐
Basic approach: - Upload any audio file - meetings, podcasts, interviews, anything - Choose “transcription” for speech-to-text or “understanding” for analysis - For understanding mode, ask clear questions about what you want to know
Advanced control: - Use the mini model for faster processing of clear audio - Use the small model for challenging audio with background noise or multiple speakers - Set language manually if auto-detection isn’t working well - Adjust max_tokens for longer or shorter responses
Example questions for understanding mode: For business meetings: - “What were the main decisions made in this meeting?” - “Who spoke the most and what were their key points?” - “What action items were discussed?”
For podcasts/interviews: - “What is the main topic being discussed?” - “What are the host’s key arguments?” - “Summarize the guest’s background and expertise”
For content analysis: - “What’s the emotional tone of this conversation?” - “Are there any controversial points raised?” - “What questions do the speakers ask each other?”
Perfect for ๐ฏ
Content creators who need to transcribe and analyze podcasts, interviews, and videos. Businesses processing meeting recordings and customer service calls. Researchers working with multilingual audio datasets. Developers building voice-controlled applications that need to understand intent, not just transcribe words.
Good when you need both accurate transcription AND intelligent audio understanding in the same tool.
What makes Voxtral special ๐
Traditional speech models are limited to basic transcription. Voxtral changes this by:
- True understanding: Goes beyond writing down words to understand meaning and context
- Multilingual mastery: Automatically detects and processes 8 languages without switching models
- Long-form context: Handles up to 40 minutes of audio while maintaining context throughout
- Dual capability: Perfect transcription AND intelligent analysis in one model
- Function calling: Can trigger code directly from voice commands
Parameter controls ๐๏ธ
mode (transcription/understanding): Transcription converts speech to text. Understanding analyzes audio content and answers questions.
language (Auto-detect or specific): Auto-detect works for most content, or choose a specific language for better accuracy.
model_size (mini/small): Mini (3 billion parameters) for speed, Small (24 billion parameters) for accuracy with challenging audio.
max_tokens (50-1000): Controls response length. Shorter for quick answers, longer for detailed analysis.
prompt (understanding mode only): Your question or instruction about the audio content.
Research background ๐
Voxtral represents a significant advance in audio understanding, built on Mistral AI’s research in multimodal language models. It combines speech recognition with deep language understanding in a single model.
The model builds on the Mistral Small series, introducing:
- Native audio input processing that handles multiple languages
- Multilingual speech understanding with automatic language detection
- Long-form audio context handling (32,000 tokens worth of audio content)
- Function calling capabilities triggered directly from voice commands
Original research: Voxtral: Advanced Speech Understanding
Important licensing note ๐
This model follows Mistral’s Apache 2.0 license for both commercial and non-commercial use.
Built on Mistral AI’s Voxtral technology - all credit goes to their amazing research team.
The model weights are distributed under Mistral’s licensing terms, which allow commercial use.
Disclaimer โผ๏ธ
I am not liable for any direct, indirect, consequential, incidental, or special damages arising out of or in any way connected with the use/misuse or inability to use this software.
โญ Star the repo on GitHub!
๐ฆ Follow @zsakib_ on X
๐ป Check out more projects @zsxkib on GitHub