1: \begin{abstract}
2: Generative AI (GenAI) models have recently achieved remarkable empirical performance in various applications, however, their evaluations yet lack uncertainty quantification.
3: In this paper, we propose a method to compare two generative models based on an unbiased estimator of their relative performance gap.
4: Statistically, our estimator achieves parametric convergence rate and asymptotic normality, which enables valid inference.
5: Computationally, our method is efficient and can be accelerated by parallel computing and leveraging pre-storing intermediate results.
6: On simulated datasets with known ground truth, we show our approach effectively controls type I error and achieves power comparable with commonly used metrics.
7: Furthermore, we demonstrate the performance of our method in evaluating diffusion models on real image datasets with statistical confidence.
8: \end{abstract}
9: