You're looking at a specific version of this model. Jump to the model overview.
Input schema
The fields you can use to run this model with an API. If you don’t give a value for a field its default value will be used.
Field | Type | Default value | Description |
---|---|---|---|
reference_image |
string
|
Path to the reference image that will be used as the base for the generated video.
|
|
driving_audio |
string
|
Path to the audio file that will be used to drive the motion in the generated video.
|
|
driving_video |
string
|
Path to the video file that will be used to extract the head motion. If not provided, the generated video will use the motion based on the selected motion_mode.
|
|
motion_mode |
string
(enum)
|
fast
Options: standard, gentle, normal, fast |
Mode for generating the head motion in the output video.
|
reference_attention_weight |
number
|
0.95
Max: 1 |
Amount of attention to pay to the reference image vs. the driving motion. Higher values will make the generated video adhere more closely to the reference image. Range: 0.0 to 1.0
|
audio_attention_weight |
number
|
3
Max: 10 |
Amount of attention to pay to the driving audio vs. the reference image. Higher values will make the generated video's motion more closely match the driving audio. Range: 0.0 to 10.0
|
num_inference_steps |
integer
|
25
Min: 1 Max: 100 |
Number of diffusion steps to perform during generation. More steps will generally produce better quality results but will take longer to run. Range: 1 to 100
|
image_width |
integer
|
512
Min: 64 Max: 2048 |
Width of the generated video frames.
|
image_height |
integer
|
512
Min: 64 Max: 2048 |
Height of the generated video frames.
|
frames_per_second |
number
|
30
Min: 1 Max: 60 |
Frame rate of the generated video.
|
guidance_scale |
number
|
3.5
Min: 1 Max: 20 |
Guidance scale for the diffusion model. Higher values will result in the generated video following the driving motion and audio more closely.
|
num_context_frames |
integer
|
12
Min: 1 Max: 24 |
Number of context frames to use for motion estimation.
|
context_stride |
integer
|
1
Min: 1 Max: 10 |
Stride of the context frames.
|
context_overlap |
integer
|
4
Max: 24 |
Number of overlapping frames between context windows.
|
num_audio_padding_frames |
integer
|
2
Max: 10 |
Number of audio frames to pad on each side of the driving audio.
|
seed |
integer
|
Random seed. Leave blank to randomize the seed
|
Output schema
The shape of the response you’ll get when you run this model with an API.
{'format': 'uri', 'title': 'Output', 'type': 'string'}