abstract:118d919e56d49537.tex

1: \begin{abstract}

2:

3: Sound Event Detection (SED) aims to predict the temporal boundaries of all the events of interest and their class labels,

4: given an unconstrained audio sample.

5: %

6: Taking either the split-and-classify (\ie, frame-level) strategy or

7: the more principled event-level modeling approach,

8: all existing methods consider the SED problem

9: from the discriminative learning perspective.

10: %

11: In this work, we reformulate the SED problem by taking a generative learning perspective.

12: %

13: Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process, conditioned on a target audio sample.

14: %

15: During training, our model learns to reverse the noising process by converting noisy latent queries to the ground-truth versions

16: in the elegant Transformer decoder framework.

17: %

18: Doing so enables the model generate accurate event boundaries from even noisy queries during inference.

19: %%

20: % {\bf\em \modelname} is a novel approach to the Sound Event Detection (SED) problem, which differs from traditional methods by taking a generative learning perspective instead of a discriminative one. Unlike the latter, which focus on identifying discrete audio events through a single pass of the decoder, \modelname{} generates sound temporal boundaries from noisy proposals in a denoising diffusion process, conditioned on a test audio sample. During training, the model learns to reverse the noising process by converting noisy latent queries to the ground-truth versions. The model generates accurate event boundaries during inference by starting from noisy event queries.

21: %

22: Extensive experiments on the Urban-SED and EPIC-Sounds datasets demonstrate that our model significantly outperforms existing alternatives,

23: % for both SED and tagging tasks,

24: with 40+\% faster convergence in training.

25: \end{abstract}

26: