MusicGen with Fine-tuner

MusicGen is a simple and controllable model for music generation. With this fine-tuner implemented repository, users can fine-tune MusicGen with their own datasets. - AudioCraft 1.2.0 implemented! (Stereo models added.)

MusicGen fine-tuning instruction blog post

Fine-tune MusicGen to generate music in any style

Model Architecture and Development

MusicGen is single stage auto-regressive Transformer model trained over a 32kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz. Unlike existing methods like MusicLM, MusicGen doesn’t require a self-supervised semantic representation, and it generates all 4 codebooks in one pass. By introducing a small delay between the codebooks, the authors show they can predict them in parallel, thus having only 50 auto-regressive steps per second of audio. They used 20K hours of licensed music to train MusicGen. Specifically, they relied on an internal dataset of 10K high-quality music tracks, and on the ShutterStock and Pond5 music data.

Prediction

Default Model

The default prediction model is configured as the melody model.
After completing the fine-tuning process from this repository, the trained model weights will be loaded into your own model repository.

Infinite Generation

You can set duration longer than 30 seconds.
Due to MusicGen’s limitation of generating a maximum 30-second audio in one iteration, if the specified duration exceeds 30 seconds, the model will create multiple sequences. It will utilize the latter portion of the output from the previous generation step as the audio prompt (following the same continuation method) for the subsequent generation step.
Infinite generation works with 1) input_audio=None, 2) input_audio with continuation=True, 3) input_audio longer than duration as melody condition audio, which means continuation=False

Fine-tuning MusicGen

For the instruction of MusicGen fine-tuning, please check the blog post : Fine-tune MusicGen to generate music in any style

Dataset

Audio

Compressed files in formats like .zip, .tar, .gz, and .tgz are compatible for dataset uploads.
Single audio files with .mp3, .wav, and .flac formats can also be uploaded.
Audio files within the dataset must exceed 30 seconds in duration.
Audio Chunking : Files surpassing 30 seconds will be divided into multiple 30-second chunks.
Vocal Removal : If drop_vocals is set to True, the vocal tracks in the audio files will be isolated and removed.(Default : drop_vocals = True)
- For datasets containing audio without vocals, setting drop_vocals = False reduces data preprocessing time and maintains audio file quality.

Text Description

If each audio file requires a distinct description, create a .txt file with a single-line description corresponding to each .mp3 or .wav file. (eg. 01_A_Man_Without_Love.mp3 and 01_A_Man_Without_Love.txt)
For a uniform description across all audio files, set the one_same_description argument to your desired description(str). In this case, there’s no need for individual .txt files.
Auto Labeling : When auto_labeling is set to True, labels such as ‘genre’, ‘mood’, ‘theme’, ‘instrumentation’, ‘key’, and ‘bpm’ will be generated and added to each audio file in the dataset(Default : auto_labeling = True)
- Available Tags of Auto-Labeling

Train Parameters

Train Inputs

dataset_path: Path = Input(“Path to dataset directory”,)
one_same_description: str = Input(description=”A description for all of audio data”, default=None)
auto_labeling: bool = Input(description=”Creating label data like genre, mood, theme, instrumentation, key, bpm for each track. Using essentia-tensorflow for music information retrieval.”, default=True)
drop_vocals: bool = Input(description=”Dropping the vocal tracks from the audio files in dataset, by separating sources with Demucs.”, default=True)
model_version: str = Input(description=”Model version to train.”, default=”stereo-melody”, choices=[“melody”, “small”, “medium”, “stereo-melody”, “stereo-small”, “stereo-medium”])
lr: float = Input(description=”Learning rate”, default=1)
epochs: int = Input(description=”Number of epochs to train for”, default=3)
updates_per_epoch: int = Input(description=”Number of iterations for one epoch”, default=100) If None, iterations per epoch will be set according to dataset/batch size. If there’s a value, then the number of iterations per epoch will be set as the value.
batch_size: int = Input(description=”Batch size”, default=16)

Default Parameters

With epochs=3, updates_per_epoch=100 and lr=1, it takes around 15 minutes to fine-tune the model.
For 8 gpu multiprocessing, batch_size must be a multiple of 8. If not, batch_size will be automatically floored to the nearest multiple of 8.
For medium model, maximum batch_size is 8 with 8 x Nvidia A40 machine setting.

Example Code

import replicate

training = replicate.trainings.create(
    version="sakemin/musicgen-fine-tuner:b1ec6490e57013463006e928abc7acd8d623fe3e8321d3092e1231bf006898b1",
  input={
    "dataset_path":"https://your/data/path.zip",
    "one_same_description":"description for your dataset music",
    "epochs":3,
    "updates_per_epoch":100,
    "model_version":"medium",
  },
  destination="my-name/my-model"
)

print(training)

References

Auto-labeling and audio chunking features are based on lyramakesmusic’s Finetune-MusicGen jupyter notebook.
The auto-labeling feature utilizes effnet-discogs from MTG’s essentia.
‘key’ and ‘bpm’ values are obtained using librosa.
Vocal dropping is implemented using Meta’s demucs.

Licenses

All code in this repository is licensed under the Apache License 2.0 license.
The code in the Audiocraft repository is released under the MIT license as found in the LICENSE file.
The weights in the Audiocraft repository are released under the CC-BY-NC 4.0 license as found in the LICENSE_weights file.

Model created over 1 year ago