abstract:4247fad88bcf1646.tex

1: \begin{abstract}

2: Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings.

3: The association of these constituent sound events with their mixture and each other is

4: semantically constrained: the sound scene contains the union of source classes and not all classes naturally co-occur.  With this motivation, this paper explores the use of unsupervised automatic sound separation to decompose unlabeled sound scenes into multiple semantically-linked views for use in self-supervised contrastive learning.  We find that learning to associate input mixtures with their automatically separated outputs yields stronger representations than past approaches that use the mixtures alone.

5: Further, we discover that optimal source separation is not required for successful contrastive learning by demonstrating that a range of separation system convergence states all lead to useful and often complementary example transformations.

6: Our best system incorporates these unsupervised separation models into a single augmentation front-end and jointly optimizes similarity maximization and coincidence prediction objectives across the views.

7: The result is an unsupervised audio representation that rivals state-of-the-art alternatives on the established shallow AudioSet classification benchmark.

8: \end{abstract}

9: