abstract:10998650c84de994.tex

1: \begin{abstract}

2: Sound matching algorithms seek to approximate a target waveform by parametric audio synthesis.

3: Deep neural networks have achieved promising results in matching sustained harmonic tones.

4: However, the task is more challenging when targets are nonstationary and inharmonic, e.g., percussion.

5: We attribute this problem to the inadequacy of loss function.

6: On one hand, mean square error in the parametric domain, known as ``P-loss'', is simple and fast but fails to accommodate the differing perceptual significance of each parameter.

7: On the other hand, mean square error in the spectrotemporal domain, known as ``spectral loss'', is perceptually motivated and serves in differentiable digital signal processing (DDSP).

8: Yet, spectral loss is a poor predictor of pitch intervals and its gradient may be computationally expensive; hence a slow convergence.

9: Against this conundrum, we present Perceptual-Neural-Physical loss (PNP).

10: PNP is the optimal quadratic approximation of spectral loss while being as fast as P-loss during training.

11: We instantiate PNP with physical modeling synthesis as decoder and joint time--frequency scattering transform (JTFS) as spectral representation.

12: We demonstrate its potential on matching synthetic drum sounds in comparison with other loss functions.

13: \end{abstract}

14: