kjjk10 / llasa-3b-long

SoTA Zero Shot Voice Cloning and TTS model

  • Public
  • 168 runs
  • GitHub
  • Weights
  • License

Troubleshooting

The checkpoints support English and Chinese.

If you’re having issues, try converting your reference audio to WAV or MP3 and clipping it to 15s.

Credits

Used code from here for batching: - https://github.com/nivibilla/local-llasa-tts Model card: - https://huggingface.co/HKUSTAudio/Llasa-3B

Model Information

Our model, Llasa, is a text-to-speech (TTS) system that extends the text-based LLaMA (1B,3B, and 8B) language model by incorporating speech tokens from the XCodec2 codebook, which contains 65,536 tokens. We trained Llasa on a dataset comprising 250,000 hours of Chinese-English speech data. The model is capable of generating speech either solely from input text or by utilizing a given speech prompt.

Disclaimer

This model is licensed under the CC BY-NC-ND 4.0 License, which prohibits commercial use; detected valiations will result in legal consequences.

This codebase is strictly prohibited from being used for any illegal purposes in any country or region. Please refer to your local laws about DMCA and other related laws.