1: \begin{abstract}
2: Sound matching algorithms seek to approximate a target waveform by parametric audio synthesis.
3: Deep neural networks have achieved promising results in matching sustained harmonic tones.
4: However, the task is more challenging when targets are nonstationary and inharmonic, e.g., percussion.
5: We attribute this problem to the inadequacy of loss function.
6: On one hand, mean square error in the parametric domain, known as ``P-loss'', is simple and fast but fails to accommodate the differing perceptual significance of each parameter.
7: On the other hand, mean square error in the spectrotemporal domain, known as ``spectral loss'', is perceptually motivated and serves in differentiable digital signal processing (DDSP).
8: Yet, spectral loss is a poor predictor of pitch intervals and its gradient may be computationally expensive; hence a slow convergence.
9: Against this conundrum, we present Perceptual-Neural-Physical loss (PNP).
10: PNP is the optimal quadratic approximation of spectral loss while being as fast as P-loss during training.
11: We instantiate PNP with physical modeling synthesis as decoder and joint time--frequency scattering transform (JTFS) as spectral representation.
12: We demonstrate its potential on matching synthetic drum sounds in comparison with other loss functions.
13: \end{abstract}
14: