You're looking at a specific version of this model. Jump to the model overview.
            
              
                
              
            
            Input schema
          
        The fields you can use to run this model with an API. If you don’t give a value for a field its default value will be used.
| Field | Type | Default value | Description | 
|---|---|---|---|
| reference_image | 
           
            string
            
           
         | 
        
           
            Path to the reference image that will be used as the base for the generated video.
           
         | 
      |
| driving_audio | 
           
            string
            
           
         | 
        
           
            Path to the audio file that will be used to drive the motion in the generated video.
           
         | 
      |
| driving_video | 
           
            string
            
           
         | 
        
           
            Path to the video file that will be used to extract the head motion. If not provided, the generated video will use the motion based on the selected motion_mode.
           
         | 
      |
| motion_mode | 
           
            None
            
           
         | 
        
          
             
              fast
             
          
          
          
         | 
        
           
            Mode for generating the head motion in the output video.
           
         | 
      
| reference_attention_weight | 
           
            number
            
           
         | 
        
          
             
              0.95
             
          
          
          
            Max: 1  | 
        
           
            Amount of attention to pay to the reference image vs. the driving motion. Higher values will make the generated video adhere more closely to the reference image. Range: 0.0 to 1.0
           
         | 
      
| audio_attention_weight | 
           
            number
            
           
         | 
        
          
             
              3
             
          
          
          
            Max: 10  | 
        
           
            Amount of attention to pay to the driving audio vs. the reference image. Higher values will make the generated video's motion more closely match the driving audio. Range: 0.0 to 10.0
           
         | 
      
| num_inference_steps | 
           
            integer
            
           
         | 
        
          
             
              25
             
          
          
          
            Min: 1 Max: 100  | 
        
           
            Number of diffusion steps to perform during generation. More steps will generally produce better quality results but will take longer to run. Range: 1 to 100
           
         | 
      
| image_width | 
           
            integer
            
           
         | 
        
          
             
              512
             
          
          
          
            Min: 64 Max: 2048  | 
        
           
            Width of the generated video frames.
           
         | 
      
| image_height | 
           
            integer
            
           
         | 
        
          
             
              512
             
          
          
          
            Min: 64 Max: 2048  | 
        
           
            Height of the generated video frames.
           
         | 
      
| frames_per_second | 
           
            number
            
           
         | 
        
          
             
              30
             
          
          
          
            Min: 1 Max: 60  | 
        
           
            Frame rate of the generated video.
           
         | 
      
| guidance_scale | 
           
            number
            
           
         | 
        
          
             
              3.5
             
          
          
          
            Min: 1 Max: 20  | 
        
           
            Guidance scale for the diffusion model. Higher values will result in the generated video following the driving motion and audio more closely.
           
         | 
      
| num_context_frames | 
           
            integer
            
           
         | 
        
          
             
              12
             
          
          
          
            Min: 1 Max: 24  | 
        
           
            Number of context frames to use for motion estimation.
           
         | 
      
| context_stride | 
           
            integer
            
           
         | 
        
          
             
              1
             
          
          
          
            Min: 1 Max: 10  | 
        
           
            Stride of the context frames.
           
         | 
      
| context_overlap | 
           
            integer
            
           
         | 
        
          
             
              4
             
          
          
          
            Max: 24  | 
        
           
            Number of overlapping frames between context windows.
           
         | 
      
| num_audio_padding_frames | 
           
            integer
            
           
         | 
        
          
             
              2
             
          
          
          
            Max: 10  | 
        
           
            Number of audio frames to pad on each side of the driving audio.
           
         | 
      
| seed | 
           
            integer
            
           
         | 
        
           
            Random seed. Leave blank to randomize the seed
           
         | 
      
            
              
                
              
            
            Output schema
          
        The shape of the response you’ll get when you run this model with an API.
{'format': 'uri', 'title': 'Output', 'type': 'string'}