Training machine learning models for instrumental music generation

The most practical path to building an instrumental music generation system today is fine-tuning Meta’s MusicGen on 500–2,000 curated tracks using AudioCraft’s open-source pipeline — a process costing $100–300 in compute and yielding genre-specific results within days. For those aiming higher, the field has converged on two dominant paradigms: autoregressive transformers operating on neural audio codec tokens (MusicGen, MusicLM) and latent diffusion models (Stable Audio, AudioLDM2), with hybrid architectures combining both emerging as the frontier in 2025–2026. This report provides a complete technical blueprint covering architectures, audio representations, datasets, training pipelines, compute costs, open-source tools, Suno’s likely approach, and actionable recommendations for building your own system.


Two paradigms dominate: autoregressive transformers and latent diffusion

Music generation architectures have consolidated around two core approaches, each with distinct strengths for instrumental music.

Autoregressive transformers on codec tokens

The most successful open-source model for instrumental music, MusicGen (Meta, 2023), exemplifies this approach. It uses a single-stage autoregressive transformer decoder operating on EnCodec tokens — discrete audio representations produced by a neural audio codec. Audio at 32kHz is compressed via Residual Vector Quantization (RVQ) into 4 codebooks at 50Hz, yielding 200 tokens per second. MusicGen’s key innovation is the delay pattern: instead of predicting all 4 codebooks sequentially (which would require 200 autoregressive steps per second), each successive codebook is offset by one timestep, allowing near-parallel prediction in just ~50 steps per second. The model comes in three sizes — 300M, 1.5B, and 3.3B parameters — with text conditioning via a frozen T5 encoder through cross-attention. Critically for instrumental music, MusicGen was trained on ~20,000 hours of licensed music with vocals removed using HT-Demucs source separation.

MusicLM (Google, 2023) takes a more complex hierarchical approach with three cascaded stages: a semantic modeling stage using w2v-BERT tokens (25/sec) for long-term musical structure, a coarse acoustic stage using SoundStream tokens for timbral content, and a fine acoustic stage for maximum fidelity. Text conditioning works through MuLan, a contrastive audio-text embedding model trained on 50M music-text pairs. MusicLM produces higher-quality output than early MusicGen but remains proprietary and requires multiple large pretrained components.

Jukebox (OpenAI, 2020) pioneered the field with a three-level VQ-VAE compressing 44.1kHz audio at 128×/32×/8× ratios, followed by sparse transformer priors at each level. Though it generated impressive multi-minute songs with vocals, it required ~9 hours per minute of audio on V100 GPUs and has been archived.

Newer autoregressive models include YuE (HKUST/M-A-P, 2025), a 7B+1B parameter LLaMA2-based model generating full songs up to 5 minutes with a novel track-decoupled next-token prediction separating vocals and accompaniment. MAGNeT (Meta, 2024) uses non-autoregressive masked generation on EnCodec tokens for faster inference.

Latent diffusion and flow matching

Stable Audio (Stability AI) represents the diffusion paradigm. Version 1.0 used a 907M-parameter 1D U-Net operating on VAE-compressed latents of 44.1kHz stereo audio, with CLAP text conditioning. Version 2.0 replaced the U-Net with a Diffusion Transformer (DiT), enabling generation of coherent 3-minute tracks with intro/development/outro structure. Stable Audio Open (~1.21B total parameters) provides the community version: a 156M VAE autoencoder, 109M T5-base text encoder, and 1,057M DiT, trained exclusively on ~486K Creative Commons recordings (~7,300 hours). It generates up to 47 seconds of stereo at 44.1kHz.

AudioLDM2 (CVSSP/University of Surrey) introduces a unified framework using AudioMAE self-supervised representations as an intermediate “Language of Audio.” A GPT-2 model first generates AudioMAE embedding vectors from text, then a latent diffusion U-Net generates audio conditioned on these embeddings plus Flan-T5 cross-attention. The 16kHz output limits its fidelity for music, but it achieves state-of-the-art on text-to-audio benchmarks.

Flow matching is rapidly emerging as the preferred non-autoregressive approach. Models like TangoFlux, DiffRhythm, and JAM use rectified flow on VAE latent spaces, offering deterministic vector fields, straighter sampling trajectories, and compatibility with one-step distillation for near-instant inference.

How the two paradigms compare

A controlled June 2025 study (arXiv:2506.08570) directly compared autoregressive and flow matching approaches using identical data and ~400M parameter transformers. Autoregressive models produced slightly higher perceptual quality and better temporal control adherence, while diffusion/flow models offered greater editing flexibility (bidirectional attention enables inpainting), higher diversity, and tunable inference speed. Autoregressive training proved more robust across hyperparameter configurations. For instrumental music specifically, autoregressive models’ stronger rhythm alignment and note stability make them preferable, while diffusion models excel at high sample rate output (44.1–48kHz stereo) critical for professional audio quality.

ModelYearTypeParamsSample RateMax DurationOpen Source
MusicGen2023AR Transformer300M–3.3B32kHz~30sYes (CC-BY-NC)
MusicLM2023Hierarchical ARUndisclosed24kHz30s+No
Jukebox2020VQ-VAE + AR1B/5B44.1kHzMinutesYes (archived)
Stable Audio Open2024Latent Diffusion DiT~1.21B44.1kHz stereo47sYes (Community)
AudioLDM22023GPT-2 + Latent Diffusion~1.1B16kHzVariableYes (Research)
ACE-Step 1.52026Diffusion + LM hybridUp to 4B44.1kHz10 minYes (Apache 2.0)
YuE2025AR LLaMA27B+1B44.1kHz5 minYes (Apache 2.0)

Audio representations: from raw waveforms to codec tokens

The choice of audio representation fundamentally determines a model’s quality ceiling, training efficiency, and generation speed.

Neural audio codecs transformed the field

The breakthrough enabling modern music generation was neural audio codecs — models that compress continuous audio into discrete token sequences, making audio generation a language modeling problem. EnCodec (Meta, 2022) uses a convolutional encoder-decoder with Residual Vector Quantization, compressing 32kHz audio into 4 codebooks of 2,048 entries each at 50Hz. The first codebook captures the most perceptually important information; each subsequent codebook encodes the residual error from the previous one. For MusicGen, this means 1 second of audio becomes 4 × 50 = 200 discrete tokens — a dramatic reduction from the 32,000 raw samples that would otherwise be needed.

Descript Audio Codec (DAC) improves on EnCodec with Snake activations (providing periodic inductive bias for audio), projected codebook learning to prevent codebook collapse, and quantizer dropout for variable-bitrate operation. DAC operates at 44.1kHz with 9 codebooks at 86Hz, achieving better objective metrics (PESQ 3.77 vs EnCodec’s 3.12). However, MusicGen’s authors found slightly worse subjective results with DAC, likely because their EnCodec was specifically optimized for music.

SoundStream (Google, 2021) uses up to 12 RVQ codebooks at 50Hz for 24kHz audio, producing 600 acoustic tokens per second. It serves as the audio backbone for MusicLM and AudioLM.

Mel spectrograms and raw waveforms still have roles

Mel spectrograms — time-frequency representations on a perceptual scale — provide ~100× compression over raw audio and align with human perception. AudioLDM operates on mel spectrogram latents, and Riffusion treats spectrograms as images for Stable Diffusion. The key limitation is phase information loss: reconstructing audio requires a vocoder (HiFi-GAN, Griffin-Lim), which introduces artifacts that become a quality bottleneck.

Raw waveform generation (WaveNet, SampleRNN) preserves maximum fidelity but is impractically slow — each of 16,000+ samples per second requires a sequential forward pass. VAE latent representations (used by Stable Audio) offer a continuous alternative to discrete codecs, enabling diffusion models to operate in a heavily compressed space while the VAE decoder handles waveform reconstruction.

MIDI remains relevant for symbolic music generation (MuseNet, Music Transformer) where note-level control and editability matter, but cannot represent timbre, recording characteristics, or non-pitched sounds.

RepresentationQuality CeilingTraining EfficiencyGeneration SpeedBest For
Raw waveformHighest (lossless)Very lowVery slowResearch only
Mel spectrogramHigh (vocoder-limited)ModerateModerateDiffusion models
Codec tokens (EnCodec/DAC)High (bitrate-dependent)HighFastAutoregressive models
VAE latentHigh (decoder-limited)HighFastLatent diffusion
MIDIN/A (requires synth)Very highVery fastSymbolic generation

Datasets: what exists and what you need

The open data landscape

The largest freely available resource is the Free Music Archive (FMA): 106,574 tracks totaling ~8,232 hours under Creative Commons licenses, with genre hierarchy across 161 genres. It skews toward experimental, electronic, and rock. MTG-Jamendo provides 55,525 tracks (~3,777 hours) with 195 tags covering genres, instruments, and moods. Combined with Freesound (used by Stability AI for Stable Audio Open), these form the backbone of any CC-licensed training effort, yielding roughly 12,000+ hours of usable audio.

MusicCaps (Google) contains only 5,521 ten-second clips but includes expert-written captions by professional musicians, making it the standard evaluation benchmark rather than a training dataset. LP-MusicCaps extends this with LLM-generated pseudo-captions: the MSD variant provides 500K clips with 2.2M captions. JamendoMaxCaps (February 2025) offers 362,000+ instrumental tracks from Jamendo with AI-generated captions — a significant new resource. MAESTRO provides ~200 hours of piano with aligned MIDI, and Slakh2100 offers 145 hours of multi-track music synthesized from MIDI using virtual instruments.

Commercial training data and licensing reality

Meta trained MusicGen on 20,000 hours of licensed music from internal collections, Shutterstock, and Pond5. Stability AI trained Stable Audio commercially on 800,000+ files (19,500 hours) from AudioSparx. For the open version, they used only CC-licensed data and ran Audible Magic detection to filter copyrighted content.

The legal landscape is stark. Suno admitted in court filings that its training data includes “essentially all music files of reasonable quality accessible on the open Internet.” This triggered RIAA lawsuits seeking $150,000 per infringed work, resulting in Warner Music settling for a reported $500 million in November 2025, with Suno agreeing to retire all models trained on unlicensed music. The US Copyright Office concluded in May 2025 that fair use likely does not apply to training on expressive works used to generate substitution products. For any commercial application, using only CC-licensed or properly licensed data is the only defensible path.

Preprocessing pipeline

Effective data preparation requires several stages. Loudness normalization to -14 dB LUFS ensures consistent training signal. Sample rate conversion matches the model’s target (32kHz for MusicGen, 44.1kHz for Stable Audio). Vocal removal via Demucs (Meta’s HT-Demucs) is essential for instrumental models — MusicGen’s training used this extensively. Audio is chunked into 30-second segments for transformer training, with random cropping providing augmentation.

Metadata extraction using Essentia (effnet-discogs model) automatically detects genre, mood, instrumentation, key, and BPM. Caption generation follows the LP-MusicCaps approach: extract tags, feed to an LLM, generate natural language descriptions. Example output: “A mellow jazz track with walking bass and brushed drums, 120 BPM, key of Bb major, relaxed mood.” Quality filtering should remove low-bitrate recordings, tracks shorter than 10 seconds, and content that doesn’t match its metadata.


The training pipeline from data to generated music

Tokenization and model input preparation

For autoregressive models like MusicGen, the pipeline begins with pre-trained EnCodec encoding all audio into discrete tokens. Each 30-second training clip becomes a sequence of codebook indices: 4 codebooks × 50 frames/second × 30 seconds = 6,000 tokens arranged in the delay pattern. For diffusion models like Stable Audio, audio passes through a pre-trained VAE encoder producing continuous latent representations at heavily reduced temporal resolution. In both cases, the codec/VAE is pre-trained separately and frozen during the main model’s training.

Text conditioning uses frozen pre-trained encoders: MusicGen uses T5, Stable Audio uses T5-base, AudioLDM2 uses both CLAP and Flan-T5. Text descriptions are encoded into hidden state sequences and injected via cross-attention layers. During training, conditioning is randomly dropped with probability 0.1–0.2 to enable classifier-free guidance (CFG) at inference — the mechanism that dramatically improves prompt adherence by contrasting conditional and unconditional predictions.

Training objectives differ by architecture

Autoregressive models (MusicGen, MusicLM) use standard cross-entropy loss over the vocabulary of each codebook. At each delay pattern step, the model predicts the next token positions across all codebooks simultaneously, with separate output heads per codebook. Loss is computed only on non-masked positions.

Diffusion models (Stable Audio, AudioLDM) add Gaussian noise to latent representations at a random timestep t, then train the model to predict the added noise (or the clean signal, or the velocity). The loss is mean squared error between predicted and actual noise. Stable Audio uses the EDM DPM-Solver multistep scheduler.

Masked token prediction (VampNet, SoundStorm) randomly masks a subset of tokens and trains the model to reconstruct them given unmasked context. SoundStorm uses progressive masking across RVQ levels, requiring only 27 forward passes for 30 seconds versus 18,000 for flat autoregressive decoding.

Inference and evaluation

At inference, MusicGen uses top-k sampling (k=250), temperature=1.0, and CFG coefficient=3.0 by default. The guided prediction follows: ε̃ = (1+w)·ε(z,c) − w·ε(z), where higher guidance strength w improves prompt adherence at the cost of diversity. Stable Audio typically uses guidance scale 7.0 with 8–20 diffusion steps.

Evaluation relies on Fréchet Audio Distance (FAD) — the audio analog of FID, computing distributional distance between generated and reference audio embeddings (lower is better). KL divergence measures label distribution divergence using a PaSST classifier. CLAP score (cosine similarity between CLAP audio and text embeddings) measures prompt adherence. MusicGen reports FAD scores of 4.88–5.48 and CLAP scores of ~0.27–0.28 on MusicCaps. Human evaluation remains essential, typically scoring overall quality (OVL) and text relevance (REL) on 1–100 scales.


Compute requirements at every scale

Fine-tuning is accessible on consumer hardware

MusicGen-small (300M) fits on a single RTX 4090 (24GB) with mixed-precision training. MusicGen-medium (1.5B) is tight but feasible on an RTX 4090 with autocast. The large model (3.3B) requires FSDP across multiple GPUs. A proven community benchmark: fine-tuning MusicGen-stereo-melody-large on ~1,800 samples (~7 hours of audio) using 8× A100 40GB on Lambda Labs took ~6 hours and cost approximately $150. For minimal experiments, Replicate’s hosted fine-tuner processes 9–10 tracks in ~15 minutes on 8× A40 for under $10.

Stable Audio Open fine-tuning works on A6000 GPUs (48GB VRAM), with inference possible on 16GB. LoRA fine-tuning of MusicGen requires only 10–16GB GPU memory and as little as 15 minutes of audio data.

Training from scratch demands serious investment

Training MusicGen-small from scratch requires an estimated 2,000–5,000 A100-hours (~$2,000–10,000). The medium model needs 10,000–20,000 A100-hours ($15,000–40,000), and the large model 30,000–50,000 A100-hours ($50,000–100,000+). These estimates assume ~5,000–20,000 hours of training data.

Current cloud GPU pricing offers increasingly competitive rates: A100 80GB instances run $1.29–2.50/hour (JarvisLabs to Lambda Labs), while H100 SXM runs $1.38–3.29/hour (Thunder Compute to Lambda). An 8× H100 node costs approximately $16–24/hour.

Building a production-quality system approaching Suno’s capabilities requires $500K–$5M+ in compute alone, 20,000–100,000+ hours of licensed music, and a team of 5–15+ ML engineers and audio specialists. This is fundamentally a venture-scale endeavor.


The open-source ecosystem has matured dramatically

AudioCraft remains the most complete framework

Meta’s AudioCraft (github.com/facebookresearch/audiocraft, MIT license for code, CC-BY-NC for weights) provides the most mature end-to-end pipeline. It includes MusicGen, AudioGen, EnCodec, Multi Band Diffusion, MAGNeT, AudioSeal (watermarking), MusicGen Style, and JASCO (chord/melody/drum conditioning). Ten-plus pretrained checkpoints are available on HuggingFace, from facebook/musicgen-small through stereo melody variants. Full training code uses the Dora grid system with Hydra configs, supporting both from-scratch training and fine-tuning via continue_from.

Community fine-tuning tools have proliferated: ylacombe/musicgen-dreamboothing provides LoRA fine-tuning via HuggingFace PEFT with DreamBooth-style trigger words, requiring as little as 27 minutes of audio. chavinlo/musicgen_trainer offers a simpler interface with .wav + .txt pairs. sakemin/cog-musicgen-fine-tuner on Replicate provides turnkey fine-tuning with automatic preprocessing.

Stable Audio Open and the diffusion ecosystem

Stable Audio Open (github.com/Stability-AI/stable-audio-tools) provides full training and inference code with JSON-based configuration. It supports training both the VAE-GAN autoencoder and DiT from scratch, plus fine-tuning pretrained models. A community fork (yukara-ikemiya/friendly-stable-audio-tools) offers improved documentation. The May 2025 “Small” variant (497M params) runs on mobile devices with Int8 quantization.

AudioLDM2 (github.com/haoheliu/AudioLDM2) integrates with HuggingFace Diffusers, offering multiple checkpoints including a music-focused audioldm2-music-665k. Training code remains planned but not yet released; fine-tuning is possible through the Diffusers library.

The 2025–2026 wave of Apache 2.0 models

A transformative shift occurred in 2025–2026 with Chinese labs releasing powerful models under Apache 2.0 (commercially permissive) licenses. ACE-Step 1.5 (ACE Studio + StepFun) uses a hybrid diffusion + language model architecture generating 4 minutes of music in 20 seconds on an A100, running locally with <4GB VRAM. It supports built-in LoRA fine-tuning from just a few songs. The April 2026 XL variant uses a 4B-parameter DiT. DiffRhythm (ASLP Lab/Xiaomi) is the first latent diffusion model for full-length song generation (4m45s in ~10 seconds). InspireMusic (Alibaba) combines an autoregressive transformer with flow-matching super-resolution, generating 5+ minutes at 48kHz stereo with full training code under MIT license. SongGeneration v2 (Tencent, 4B params) uses a hybrid LLM-Diffusion architecture with automated aesthetic evaluation.

These models collectively enable vocal + accompaniment generation, multi-track separation during generation, and long-form musical structure — capabilities that were proprietary-only a year ago.


What Suno actually does under the hood

Confirmed architectural details

Suno CEO Mikey Shulman has directly confirmed several key facts in multiple interviews (January 2025). The system uses autoregressive transformers as its core: “There’s something special about Autoregression… it figures it out bit by bit, which tends to make for more interesting music. The cartoon version is that Autoregression might make really beautiful music that sounds poorly recorded, while Diffusion models make great sounding elevator music that’s a little boring.” The architecture is “surprisingly similar to text LLMs — we also use transformers and our edge is figuring out how to tokenize audio correctly.”

Audio tokenization is Suno’s stated competitive edge. They compress audio from ~50,000 samples/second to approximately 50 tokens/second, creating discrete representations analogous to MP3-like compression but optimized for generation quality. Suno uses multiple models working together — at minimum Bark (vocals/speech) and Chirp (instrumentation) — with likely an LLM component for lyrics. Forensic analysis by authio.io found Suno operates at a native 32kHz sample rate, upsampling to 44.1kHz for output, creating a hard spectral cutoff at 16kHz.

The Bark model reveals the architectural philosophy

Bark, Suno’s open-source model (MIT license, 300M parameters), uses a three-stage GPT-style pipeline: text → semantic tokens, semantic → coarse EnCodec tokens (first 2 codebooks), coarse → fine EnCodec tokens (all 8 codebooks). The production system has evolved enormously beyond Bark but likely retains the same fundamental approach — autoregressive transformer on audio codec tokens — with custom in-house codecs supporting stereo, much larger model sizes (likely billions of parameters), longer context windows, and RLHF-style preference optimization from user feedback.

Suno admitted in August 2024 court filings that its training data includes “essentially all music files of reasonable quality accessible on the open Internet.” This admission, made only under legal pressure, triggered massive legal consequences: the RIAA filed suit seeking $150,000 per infringed work, Warner Music settled for a reported $500 million in November 2025 (requiring Suno to retire all models trained on unlicensed music), while UMG and Sony litigation continues. International suits from GEMA (Germany) and Koda (Denmark) are pending. Suno has raised $375M total (Series C of $250M in November 2025) and reports ~$150M ARR with 100M+ users.


The practical path forward for April 2026

Path 1: Fine-tune MusicGen (days, $100–300)

Start with MusicGen-medium (1.5B) or MusicGen-stereo-melody-large (3.3B). MusicGen remains the strongest option for instrumental generation — it was explicitly trained on vocal-free data with proven quality metrics.

Prepare 500–2,000 high-quality instrumental tracks: collect WAV files, run Demucs for vocal removal, chunk into 30-second segments, auto-label with Essentia (genre, mood, instruments, key, BPM), and generate text captions. Use LoRA fine-tuning (ylacombe/musicgen-dreamboothing) for quick experiments on a single RTX 4090, or full fine-tuning via AudioCraft’s Dora framework on 4–8× A100s for best results. Pre-compute EnCodec tokens to avoid redundant computation each epoch. Expect style-specific generation (lo-fi hip-hop, cinematic orchestral, ambient electronic) within 6–15 hours of training.

Path 2: Train from scratch at research scale (weeks, $5K–50K)

Choose between the autoregressive path (EnCodec tokenizer + transformer decoder via AudioCraft) or the diffusion path (VAE + DiT via stable-audio-tools). Assemble 1,000–5,000 hours from FMA + MTG-Jamendo + Freesound, supplemented with MIDI-rendered synthetic data using high-quality virtual instruments. Generate captions using the LP-MusicCaps approach. Train a 300M–1.5B parameter model on 4–8× A100 80GB for 1–4 weeks. The EnCodec tokenizer can be reused pre-trained; focus compute on the generative model.

A strong emerging alternative is ACE-Step 1.5 (Apache 2.0), which provides full training code, runs on <4GB VRAM for inference, and supports LoRA fine-tuning from just a few songs. For a diffusion-based approach, InspireMusic (MIT license) offers complete training scripts for its autoregressive + flow-matching architecture at 48kHz stereo.

Path 3: Production scale (months, $500K+)

Approaching Suno quality requires 20,000–100,000+ hours of properly licensed music (budget $100K–$1M+ for stock music licensing from AudioSparx, Pond5, or Shutterstock), 64–256+ H100 GPUs running for weeks to months, and a team of 5–15+ engineers. The realistic total investment is $5–20M+ including team, data, compute, and infrastructure. Consider the hybrid LLM + Diffusion architecture emerging in 2025–2026 models (ACE-Step, SongGeneration v2) as the most promising direction for new production systems.

Critical pitfalls to avoid

Training with vocals in data produces garbled outputs — always run Demucs. Small datasets (10–50 tracks) overfit rapidly; community trainers warn they “only work for overfitting.” Poor text descriptions yield weak conditioning; invest in automated labeling with Essentia and LLM-based caption generation. AudioCraft’s nested Hydra/Dora configuration is notoriously complex — start with Replicate or community trainers before diving into the raw framework. For any commercial deployment, use exclusively CC-licensed or properly licensed data; the Suno lawsuits demonstrate severe legal consequences for unauthorized training on copyrighted music.


Conclusion

The music generation field has undergone a remarkable acceleration from 2023 to 2026. The fundamental technical insight — that neural audio codecs convert continuous audio into discrete tokens amenable to language modeling — has made autoregressive transformers the dominant approach for high-quality music generation, while latent diffusion offers advantages in sample rate, editing flexibility, and inference speed tunability.

Three developments define the current moment. First, the Apache 2.0 wave from Chinese labs (ACE-Step, DiffRhythm, InspireMusic, YuE) has democratized full-song generation with commercially permissive licensing, running on consumer hardware. Second, hybrid architectures combining autoregressive planning with diffusion rendering represent the architectural frontier. Third, the legal reckoning over training data — crystallized by Suno’s $500M Warner settlement — has established that licensed data is not optional for commercial systems.

For practitioners starting today, the ecosystem offers a clear gradient from accessible to ambitious: LoRA fine-tuning MusicGen on a single GPU for style specialization, AudioCraft or ACE-Step for research-scale training, and the full open-source stack (codecs, transformers, diffusion, evaluation metrics) for production-grade systems. The gap between open-source and commercial capabilities has narrowed dramatically, and an individual or small team can now build a competent instrumental music generation system — the question is no longer whether the tools exist, but how much quality, control, and legal defensibility you need.