Dia: Text-to-Dialogue Model 🗣️ (Cog Implementation)
This Replicate model runs Dia, a state-of-the-art 1.6B parameter model for generating realistic dialogue audio from text, developed by Nari Labs.
Original Project: nari-labs/Dia-1.6B on Hugging Face Original GitHub: github.com/nari-labs/dia
About the Dia Model
Dia is a 1.6B parameter text-to-speech model that directly generates highly realistic dialogue from a transcript. Unlike traditional TTS focusing on single utterances, Dia excels at capturing the flow of conversation between multiple speakers. It can also produce non-verbal communications like laughter, coughing, and other sounds, making the output more natural and engaging. You can condition the output on an audio prompt, enabling control over voice style, emotion, and tone.
Key Features & Capabilities ✨
- Realistic Dialogue Generation 💬: Creates natural-sounding conversations using
[S1]
and[S2]
speaker tags. - Non-Verbal Sounds 😄: Generates sounds like
(laughs)
,(coughs)
,(whispers)
when specified in the input text using parentheses. - Voice Cloning 🎤: Mimics the voice style from an optional input audio prompt (
.wav
,.mp3
,.flac
). - Fine-grained Control 🎛️: Offers parameters to adjust audio length, speed, randomness (temperature), and adherence to the text (CFG scale).
- Reproducibility 🌱: Supports setting a random seed for consistent outputs across identical inputs.
Replicate Implementation Details ⚙️
This Cog container packages the Dia model and its dependencies for easy use on Replicate.
- Core Model: Utilizes the pre-trained
nari-labs/Dia-1.6B
weights from Hugging Face. - Dependencies: Runs on PyTorch and leverages libraries like
soundfile
,numpy
, and thedescript-audio-codec
. Installs thedia
library directly from its GitHub repository. Requireslibsndfile1
andffmpeg
system packages. - Weight Handling: Model weights (packaged as a
.tar
archive containing the Hugging Face directory structure) are efficiently downloaded usingpget
during container setup from a Replicate cache (https://weights.replicate.delivery/default/dia/model_cache/
) and extracted into the localmodel_cache
directory. Environment variables (HF_HOME
,TORCH_HOME
, etc.) are set to ensure Hugging Face libraries use this cache. - Workflow (
predict.py
):- Sets up environment variables and creates the
model_cache
directory during initialization (setup
method). - Downloads and extracts model weights using
pget
if they are not already present in the cache. - Loads the Dia model (
Dia.from_pretrained
) into GPU memory. - Receives
text
input and optionalaudio_prompt
(Path), along with generation parameters. - Validates text input.
- Handles the
audio_prompt
by copying it to a temporary file if provided. - Sets the random
seed
for reproducibility. - Calls the main
self.model.generate()
function with text, temporary prompt path (if used), and user-configured parameters (max_new_tokens
,cfg_scale
,temperature
,top_p
,cfg_filter_top_k
). - Performs speed adjustment on the generated audio numpy array based on
speed_factor
using numpy interpolation. - Saves the final audio array to a temporary
.wav
file usingsoundfile
. - Returns the
Path
to the resulting.wav
file.
- Sets up environment variables and creates the
Underlying Technologies & Concepts 🔬
Dia builds upon several key concepts in audio generation:
* Efficient Audio Codecs: Utilizes technologies like the Descript Audio Codec
for high-quality audio representation and synthesis.
* Transformer Architectures: Employs transformer models, common in modern sequence modeling, for processing text input and generating audio tokens.
* Parallel Decoding: Likely inspired by models like SoundStorm, potentially enabling faster audio generation compared to purely autoregressive methods.
* Conditional Generation: Uses text input and optional audio prompts to guide the audio synthesis process, allowing control over content and voice style.
Use Cases 💡
- Generating dialogue for audiobooks, podcasts, animations, or video games.
- Creating voiceovers for presentations, e-learning materials, or marketing videos.
- Prototyping character voices and conversational interactions.
- Developing accessibility tools (e.g., reading out online conversations or articles).
- Generating synthetic dialogue data for training other speech or NLP models.
Limitations ⚠️
- English Only: The current model primarily supports English language generation.
- Hardware Requirements: Requires a GPU with sufficient VRAM (approx. 10GB recommended for the 1.6B model). CPU performance is not optimized.
- Voice Consistency: Without providing an
audio_prompt
or setting a fixedseed
, the speaker voices may vary between runs, as the base model wasn’t fine-tuned on specific speaker identities. - Voice Cloning Quality: The effectiveness of voice cloning depends significantly on the quality, clarity, and length of the provided
audio_prompt
. - Nuance Capture: While capable of generating expressive speech and non-verbal sounds, it may not capture every subtle emotional nuance intended purely from text without careful prompting or fine-tuning.
License & Disclaimer 📜
The original Dia model and code are licensed under the Apache License 2.0. See the LICENSE file in the original repository.
Disclaimer (from Nari Labs): This project offers a high-fidelity speech generation model intended for research and educational use. The following uses are strictly forbidden: * Identity Misuse: Do not produce audio resembling real individuals without permission. * Deceptive Content: Do not use this model to generate misleading content (e.g. fake news) * Illegal or Malicious Use: Do not use this model for activities that are illegal or intended to cause harm.
By using this model, you agree to uphold relevant legal standards and ethical responsibilities. Nari Labs is not responsible for any misuse and firmly opposes any unethical usage of this technology.
This Replicate endpoint is provided for experimentation based on the original work. Users must adhere to the original license and disclaimer.
Citation 📚
If you use this model or implementation in your work, please cite the original Nari Labs Dia repository:
https://github.com/nari-labs/dia
Cog implementation managed by zsxkib.
Star the Cog repo on GitHub! ⭐
Follow me on Twitter/X