Traditional automatic drum transcription (ADT) directly performs transcription from full‑mixture music. This challenging task requires models to understand both the presence of drums and distinguish between different drum pieces. This study leverages advances in music source separation to propose a separation‑then‑transcription pipeline: a 5‑stem multitrack drum separator using latent diffusion generates individual drum stems, after which per‑stem onset detection yields class‑wise pianorolls. The latent diffusion separator denoises in the compact VAE latent and renders audio with a vocoder. Additional onset/timbre auxiliaries guide the separator during training to encourage percussive‑aware representations. In MDB and ENST datasets, this pipeline competes with strong U‑Net baseline (LarsNet), showing class‑specific gains while uniquely providing both accurate transcription and editable audio stems. Analysis also reveals that lower separation reconstruction error doesn’t always result in higher transcription accuracy, motivating transcription‑centric objectives in separation models. This work demonstrates latent diffusion‑based separation offers a viable alternative to direct transcription, achieving competitive accuracy while enabling downstream audio editing applications.
Example 1 pianoroll visualization.
Example 2 pianoroll visualization.