fofr / yue

Generate music with YuE-s1-7B (English, chain of thought model)

  • Public
  • 383 runs
  • GitHub
  • Weights
  • Paper
  • License

Run time and cost

This model costs approximately $0.88 to run on Replicate, or 1 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 16 minutes. The predict time for this model varies significantly based on the inputs.

Readme

Prompt Engineering Guide

The prompt consists of three parts: genre tags, lyrics, and ref audio.

Genre Tagging Prompt

  1. A stable tagging prompt usually consists of five components: genre, instrument, mood, gender, and timbre. All five should be included if possible, separated by space (space delimiter).

  2. Although our tags have an open vocabulary, we have provided the top 200 most commonly used tags. It is recommended to select tags from this list for more stable results.

  3. The order of the tags is flexible. For example, a stable genre tagging prompt might look like: “inspiring female uplifting pop airy vocal electronic bright vocal vocal.”

  4. Additionally, we have introduced the “Mandarin” and “Cantonese” tags to distinguish between Mandarin and Cantonese, as their lyrics often share similarities.

Lyrics Prompt

  1. We support multiple languages, including but not limited to English, Mandarin Chinese, Cantonese, Japanese, and Korean. The default top language distribution during the annealing phase is revealed in issue 12. A language ID on a specific annealing checkpoint indicates that we have adjusted the mixing ratio to enhance support for that language.

  2. The lyrics prompt should be divided into sessions, with structure labels (e.g., [verse], [chorus], [bridge], [outro]) prepended. Each session should be separated by 2 newline character “\n\n”.

  3. DO NOT put too many words in a single segment, since each session is around 30s (--max_new_tokens 3000 by default).

  4. We find that [intro] label is less stable, so we recommend starting with [verse] or [chorus].

  5. For generating music with no vocal, see issue 18.