Separate-and-Detect: Unified Drum Transcription and Stem Generation via Latent Diffusion
Wei-Han Hsu, Chih-Cheng Chang, Bo-Yu Chen, Li Su, Yi-Hsuan Yang

Abstract

Traditional automatic drum transcription (ADT) directly performs transcription from full‑mixture music. This challenging task requires models to understand both the presence of drums and distinguish between different drum pieces. This study leverages advances in music source separation to propose a separation‑then‑transcription pipeline: a 5‑stem multitrack drum separator using latent diffusion generates individual drum stems, after which per‑stem onset detection yields class‑wise pianorolls. The latent diffusion separator denoises in the compact VAE latent and renders audio with a vocoder. Additional onset/timbre auxiliaries guide the separator during training to encourage percussive‑aware representations. In MDB and ENST datasets, this pipeline competes with strong U‑Net baseline (LarsNet), showing class‑specific gains while uniquely providing both accurate transcription and editable audio stems. Analysis also reveals that lower separation reconstruction error doesn’t always result in higher transcription accuracy, motivating transcription‑centric objectives in separation models. This work demonstrates latent diffusion‑based separation offers a viable alternative to direct transcription, achieving competitive accuracy while enabling downstream audio editing applications.

Example 1

pianoroll example 1

Example 1 pianoroll visualization.

Target

Mixture

mixture

Kick

stem

Snare

stem

Toms

stem

Hi‑Hats

stem

Cymbals

stem

Prediction — ADTOF

Mixture

mixture

Kick

stem

Snare

stem

Toms

stem

Hi‑Hats

stem

Cymbals

stem

Prediction — LarsNet

Mixture

mixture

Kick

stem

Snare

stem

Toms

stem

Hi‑Hats

stem

Cymbals

stem

Prediction — MSG-LD (+OB & TB)

Mixture

mixture

Kick

stem

Snare

stem

Toms

stem

Hi‑Hats

stem

Cymbals

stem

Example 2

pianoroll example 2

Example 2 pianoroll visualization.

Target

Mixture

mixture

Kick

stem

Snare

stem

Toms

stem

Hi‑Hats

stem

Cymbals

stem

Prediction — ADTOF

Mixture

mixture

Kick

stem

Snare

stem

Toms

stem

Hi‑Hats

stem

Cymbals

stem

Prediction — LarsNet

Mixture

mixture

Kick

stem

Snare

stem

Toms

stem

Hi‑Hats

stem

Cymbals

stem

Prediction — MSG-LD (+OB & TB)

Mixture

mixture

Kick

stem

Snare

stem

Toms

stem

Hi‑Hats

stem

Cymbals

stem