You're looking at a specific version of this model. Jump to the model overview.
            
              
                
              
            
            Input schema
          
        The fields you can use to run this model with an API. If you don’t give a value for a field its default value will be used.
| Field | Type | Default value | Description | 
|---|---|---|---|
| audio | 
           
            string
            
           
         | 
        
           
            Audio file to process.
           
         | 
      |
| mode | 
           
            None
            
           
         | 
        
          
             
              transcription
             
          
          
          
         | 
        
           
            Choose processing mode: 'transcription' converts speech to text, 'understanding' analyzes audio content using prompts.
           
         | 
      
| prompt | 
           
            string
            
           
         | 
        
          
             
              What can you tell me about this audio?
             
          
          
          
         | 
        
           
            Question or instruction for understanding mode (e.g., 'What is the speaker discussing?', 'Summarize this audio'). Ignored in transcription mode.
           
         | 
      
| language | 
           
            None
            
           
         | 
        
          
             
              Auto-detect
             
          
          
          
         | 
        
           
            Audio language. 'Auto-detect' works for most content, or choose a specific language for better accuracy.
           
         | 
      
| model_size | 
           
            None
            
           
         | 
        
          
             
              mini
             
          
          
          
         | 
        
           
            Model selection: 'mini' (3B) is faster and uses less GPU memory, 'small' (24B) provides higher accuracy for complex audio.
           
         | 
      
| max_tokens | 
           
            integer
            
           
         | 
        
          
             
              500
             
          
          
          
            Min: 50 Max: 1000  | 
        
           
            Maximum response length. Higher values allow longer outputs but increase processing time.
           
         | 
      
            
              
                
              
            
            Output schema
          
        The shape of the response you’ll get when you run this model with an API.
              Schema
            
            {'title': 'Output', 'type': 'string'}