abstract:8b827dd4193f88ad.tex

1: \begin{abstract}

2: How can we trust the correctness of a learned model on a particular input of interest? Model accuracy is typically measured \emph{on average} over a distribution of inputs, giving no guarantee for any fixed input.

3: This paper proposes a theoretically-founded solution to this problem: to train \emph{Self-Proving models} that prove the correctness of their output to %

4: a verification algorithm $V$ via an Interactive Proof.

5:

6: Self-Proving models satisfy that, with high probability over a random input, the model generates a correct output \emph{and} successfully proves its correctness to $V\!$. The \emph{soundness} property of $V$ guarantees that, for \emph{every} input, no model can convince $V$ of the correctness of an incorrect output. Thus, a

7: Self-Proving model proves correctness of most of its outputs, while \emph{all} incorrect outputs (of any model) are detected by $V$. We devise a generic method for learning

8: Self-Proving models, and we prove

9: convergence bounds under certain assumptions.

10:

11: The theoretical framework and results are complemented by experiments on an arithmetic capability:

12: computing the greatest common divisor (GCD) of two integers. Our learning method is used to train a Self-Proving transformer that computes the GCD \emph{and} proves the correctness of its answer.

13: \end{abstract}