118d919e56d49537.tex
1: \begin{abstract}
2: 
3: Sound Event Detection (SED) aims to predict the temporal boundaries of all the events of interest and their class labels,
4: given an unconstrained audio sample.
5: %
6: Taking either the split-and-classify (\ie, frame-level) strategy or 
7: the more principled event-level modeling approach, 
8: all existing methods consider the SED problem
9: from the discriminative learning perspective.
10: %
11: In this work, we reformulate the SED problem by taking a generative learning perspective.
12: %
13: Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process, conditioned on a target audio sample.
14: %
15: During training, our model learns to reverse the noising process by converting noisy latent queries to the ground-truth versions
16: in the elegant Transformer decoder framework.
17: %
18: Doing so enables the model generate accurate event boundaries from even noisy queries during inference.
19: %%
20: % {\bf\em \modelname} is a novel approach to the Sound Event Detection (SED) problem, which differs from traditional methods by taking a generative learning perspective instead of a discriminative one. Unlike the latter, which focus on identifying discrete audio events through a single pass of the decoder, \modelname{} generates sound temporal boundaries from noisy proposals in a denoising diffusion process, conditioned on a test audio sample. During training, the model learns to reverse the noising process by converting noisy latent queries to the ground-truth versions. The model generates accurate event boundaries during inference by starting from noisy event queries. 
21: %
22: Extensive experiments on the Urban-SED and EPIC-Sounds datasets demonstrate that our model significantly outperforms existing alternatives,
23: % for both SED and tagging tasks, 
24: with 40+\% faster convergence in training.
25: \end{abstract}
26: