zsxkib / multitalk

Audio-driven multi-person conversational video generation - Upload audio files and a reference image to create realistic conversations between multiple people

  • Public
  • 52 runs
  • GitHub
  • Weights
  • Paper
  • License
Iterate in playground

🎭 MeiGen’s MultiTalk: Let Them Talk 🎭

Replicate

This is MeiGen’s MultiTalk, an audio-driven conversational video generation system that does something pretty remarkable: it can make people in images have realistic conversations with each other. Upload an image with one or two people, provide audio files, and watch as they come to life with perfectly synchronized lip movements and natural interactions.

Original Project: MeiGen-AI/MultiTalk
Research Paper: Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation
Model Weights: MeiGen-AI/MeiGen-MultiTalk
Project Website: meigen-ai.github.io/multi-talk

About This Model

Most talking head generators can only animate single speakers with basic lip movements. MultiTalk breaks this pattern by creating realistic multi-person conversations where people actually interact with each other naturally.

The key innovation is that MultiTalk understands conversational dynamics—it doesn’t just make mouths move, it generates natural interactions between people that follow the flow and emotional content of the conversation.

Key capabilities: - Multi-person conversations: Generate realistic conversations between multiple people, not just single talking heads - Perfect lip sync: Audio-driven generation with accurate lip synchronization - Interactive control: Direct virtual humans through natural language prompts - Versatile characters: Works with real people, cartoon characters, and even singing performances - High quality output: 480p and 720p generation at arbitrary aspect ratios - Long-form content: Generate videos up to 15 seconds with consistent quality

How It Works

MultiTalk uses a 14 billion parameter diffusion transformer combined with specialized audio processing:

  1. Audio processing: Extracts features from your audio files and handles both single and multi-person scenarios
  2. Feature extraction: Converts speech into data that captures timing and emotional content
  3. Multi-person coordination: For conversations, combines multiple audio streams while keeping them aligned
  4. Video generation: Generates frames based on both the reference image and audio data
  5. Post-processing: Combines generated video with original audio for perfect synchronization

Using the Model

Single person talking: - Provide a reference image and one audio file - The person in the image will speak with synchronized lip movements

Multi-person conversation: - Provide a reference image with two people and two audio files - Each person will take turns speaking according to their respective audio

The model automatically handles: - Audio extraction from video files - Loudness normalization - Frame count optimization (automatically adjusts to valid values) - GPU memory management for optimal performance

Performance Optimizations

This implementation includes several optimizations: - Automatic memory detection: Optimizes settings based on your GPU’s memory capacity - Turbo mode: Faster generation with optimized sampling parameters
- Acceleration: Speeds up inference by 2-3x with minimal quality loss - Smart frame adjustment: Automatically corrects frame counts to valid values

Technical Details

  • Architecture: 14 billion parameter diffusion transformer with specialized audio conditioning
  • Audio Processing: Wav2Vec2-based feature extraction with loudness normalization
  • Video Generation: Frame-by-frame synthesis conditioned on reference image and audio embeddings
  • Hardware Requirements: NVIDIA GPU with 24GB+ memory recommended (A100, H100, or RTX 4090+)

Limitations

Like all AI models, MultiTalk isn’t perfect. Complex multi-person interactions might not always be interpreted as intended, and the model works best with clear audio and well-lit reference images. The 15-second limit means longer conversations need to be processed in segments.

Citation

@article{kong2025let,
  title={Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation},
  author={Kong, Zhe and Gao, Feng and Zhang, Yong and Kang, Zhuoliang and Wei, Xiaoming and Cai, Xunliang and Chen, Guanying and Luo, Wenhan},
  journal={arXiv preprint arXiv:2505.22647},
  year={2025}
}

License

MultiTalk is licensed under Apache 2.0. The original research and model are from MeiGen-AI.


Cog implementation by zsxkib • Follow on Twitter/X