nvidia / canary-qwen-2.5b

🎤The best open-source speech-to-text model as of Jul 2025, transcribing audio with record 5.63% WER and enabling AI tasks like summarization directly from speech✨

  • Public
  • 41 runs
  • GitHub
  • Weights
  • Paper
  • License
Iterate in playground

Canary-Qwen-2.5B: Speech Recognition + AI Analysis 🎙️🧠

You know how most speech AI either just writes down what you said or completely misses the context? Canary-Qwen-2.5B actually gets both. Upload a 2-hour meeting recording and get perfect transcription with timestamps. Then ask it “What were the main decisions made?” and it analyzes the content like a human would—understanding context, extracting insights, and reasoning through complex discussions step by step.

Note:
The model weights are for commercial and non-commercial use under NVIDIA’s CC-BY-4.0 license.
Attribution is required when using the model.

What Canary-Qwen-2.5B does ✨

Canary-Qwen-2.5B transforms how you analyze audio by: - Perfect transcription: Goes beyond basic speech-to-text to understand punctuation, capitalization, and conversation flow - Smart reasoning: Shows its thinking process when analyzing complex discussions or presentations
- Dual capability: Handles pure transcription and intelligent analysis equally well - Long-form processing: Analyzes up to 2 hours of audio content in context - Question answering: Answer any question about your audio content - Professional quality: Provides detailed analysis suitable for business and research

Audio analysis capabilities 🎵

Canary-Qwen-2.5B uses advanced reasoning to understand not just what words are spoken, but their relationships, context, and meaning. It can analyze everything from casual conversations to formal presentations.

Key features:

🧠 Contextual understanding for detailed conversation analysis
🎙️ Speaker awareness with context and discussion flow recognition
🎼 Content structure including topics, decisions, and action items
🔊 Audio quality handling for real-world recordings
📝 Flexible questioning - ask it anything about your audio
⏱️ Extended processing up to 2 hours of continuous audio
🎯 Timestamp precision for focusing on specific segments

How to get the best results 🌟

Basic approach: - Canary-Qwen works with MP3, WAV, M4A, FLAC, OGG, AAC, anything - Ask clear questions about what you want to know
- Get detailed insights and analysis

Advanced control: - Use specific prompts to customize the analysis format - Enable reasoning mode for step-by-step thinking - Include timestamps for detailed segment analysis - Use targeted questions for focused insights

Example questions:

For business meetings: - “What were the main action items and who is responsible for each?” - “Summarize the key decisions made and any concerns raised” - “Who were the speakers and what were their main points?”

For interviews and podcasts: - “What are the most interesting insights shared by the guest?” - “Create bullet points of the main topics discussed”
- “What questions did the host ask and how did the guest respond?”

For lectures and presentations: - “Break down the main concepts explained in this lecture” - “What examples were used to illustrate key points?” - “Create a structured summary with headings and key takeaways”

Parameter controls 🎛️

include_timestamps (true/false): Adds time markers to transcription. When enabled, shows exactly when each segment was spoken.

show_confidence (true/false): Reveals the model’s reasoning process. Shows how it arrived at its analysis and conclusions.

llm_prompt: Your question or instruction for analysis. Be specific about what insights you want.

What makes Canary-Qwen-2.5B special 🚀

Traditional speech models are limited to basic transcription or simple commands. Canary-Qwen-2.5B changes this by:

  • True dual capability: Combines NVIDIA’s state-of-the-art speech recognition with Qwen’s language understanding in one seamless model
  • Advanced reasoning: Doesn’t just transcribe—analyzes content, understands context, and answers complex questions
  • Unified approach: No need to use separate tools for transcription and analysis
  • Professional interface: Works seamlessly through Replicate’s web interface
  • Enterprise quality: 5.63% word error rate puts it among the best speech recognition systems available

Best use cases 🎯

Canary-Qwen-2.5B excels at: - Meeting analysis: Extract action items, decisions, and key discussions from recordings - Content creation: Transcribe interviews, podcasts, and videos for editing and repurposing - Research processing: Analyze recorded interviews, focus groups, and qualitative data
- Educational support: Convert lectures and presentations into structured notes and summaries - Business intelligence: Process customer calls and feedback for insights and trends - Accessibility services: Create accurate transcripts with intelligent summaries for hearing-impaired users

Limitations to consider ⚠️

  • Works best with clear English speech
  • Background noise may affect transcription accuracy
  • Complex audio scenarios benefit from specific questioning
  • Processing time increases with longer audio files
  • Analysis quality depends on audio clarity and content complexity

Research background 📚

Canary-Qwen-2.5B represents a significant advance in speech-language models, built on NVIDIA’s research in speech recognition and Alibaba’s work in language understanding. It combines FastConformer speech encoding with Transformer-based language modeling.

The model builds on the Canary series, introducing: - Hybrid speech-language architecture - 2.5 billion parameter optimization - Multi-modal reasoning capabilities
- Extended context processing

Original research: Less is More: Accurate Speech Recognition & Translation without Web-Scale Data

Important licensing note 📝

This model is for commercial and non-commercial use under NVIDIA’s CC-BY-4.0 license.
Attribution is required when using the model.

The model is built on NVIDIA Canary-1B and Qwen3-1.7B which have their own open-source license requirements.

Built by NVIDIA. Packaged for Replicate by @zsxkib.

Disclaimer ‼️

I am not liable for any direct, indirect, consequential, incidental, or special damages arising out of or in any way connected with the use/misuse or inability to use this software.


⭐ Star the repo on GitHub!
🐦 Follow @zsxkib on X
💻 Check out more projects @zsxkib on GitHub