kjjk10 / llasa-3b-long

SoTA Zero Shot Voice Cloning and TTS model

  • Public
  • 886 runs
  • GitHub
  • Weights
  • License

Run time and cost

This model costs approximately $0.0046 to run on Replicate, or 217 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 5 seconds.

Readme

Troubleshooting

The checkpoints support English and Chinese.

If you’re having issues, try converting your reference audio to WAV or MP3 and clipping it to 15s.

Credits

Used code from here for batching: - https://github.com/nivibilla/local-llasa-tts Model card: - https://huggingface.co/HKUSTAudio/Llasa-3B

Model Information

Our model, Llasa, is a text-to-speech (TTS) system that extends the text-based LLaMA (1B,3B, and 8B) language model by incorporating speech tokens from the XCodec2 codebook, which contains 65,536 tokens. We trained Llasa on a dataset comprising 250,000 hours of Chinese-English speech data. The model is capable of generating speech either solely from input text or by utilizing a given speech prompt.

Disclaimer

This model is licensed under the CC BY-NC-ND 4.0 License, which prohibits commercial use; detected valiations will result in legal consequences.

This codebase is strictly prohibited from being used for any illegal purposes in any country or region. Please refer to your local laws about DMCA and other related laws.