cs0703026/rnc.tex
1: % Last modified on Wed May 14 14:49:46 GMT     2008 by marc
2: 
3: \documentclass[11pt,twoside]{article}
4: \usepackage{latex8}
5: %\usepackage{times}
6: \usepackage[dvips]{epsfig}
7: \usepackage[dvips]{graphicx}
8: 
9: \usepackage[latin9]{inputenc} \usepackage[T1]{fontenc}
10: \usepackage[english]{babel}
11: 
12: \usepackage{amsmath}
13: \usepackage{amsfonts}
14: \usepackage{amssymb,amsthm}
15: \usepackage{fancybox}
16: \usepackage{url}
17: %\usepackage{styles}
18: \usepackage{graphicx}
19: \usepackage{xypic}
20: \usepackage{multicol}
21: \usepackage{rotating}
22: %\RRIno{\usepackage{cite}}
23: \usepackage{listings}
24: \usepackage{algorithm,color}
25: 
26: \pagestyle{empty}
27: 
28: \newcommand{\TODO}[1]{\textcolor{red}{#1}}
29: \newcommand{\old}[1]{\mathrm{\backslash{}old}\left(#1\right)}
30: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
31: 
32: \begin{document}
33: 
34: \title{\bf Formal proof for delayed finite field arithmetic \\ using
35:   floating point operators\thanks{This work has been partially founded
36:     by PICS 2533 of the CNRS, project EVA-Flo of the ANR, project
37:     CerPAN of the ANR and PPF Suréna.}}
38: 
39: 
40: \author{%
41:   Sylvie Boldo  \\ {\sc inria}  (Saclay - Île-de-France)               \\ sylvie.boldo@inria.fr    \and
42:   Marc Daumas   \\ {\sc eliaus} ({\sc ea 3679 upvd})                   \\ marc.daumas@ens-lyon.org \and
43:   Pascal Giorgi \\ {\sc lirmm}  ({\sc umr} 5506 {\sc cnrs}--{\sc um}2) \\ pascal.giorgi@lirmm.fr
44: }
45: 
46: \maketitle \thispagestyle{empty}
47:   
48: \begin{abstract}
49:   Formal proof checkers such as Coq are capable of validating proofs
50:   of correction of algorithms for finite field arithmetics but they
51:   require extensive training from potential users.  The delayed
52:   solution of a triangular system over a finite field mixes operations
53:   on integers and operations on floating point numbers. We focus in
54:   this report on verifying proof obligations that state that no round
55:   off error occurred on any of the floating point operations. We use a
56:   tool named Gappa that can be learned in a matter of minutes to
57:   generate proofs related to floating point arithmetic and hide
58:   technicalities of formal proof checkers. We found that three
59:   facilities are missing from existing tools.  The first one is the
60:   ability to use in Gappa new lemmas that cannot be easily expressed
61:   as rewriting rules. We coined the second one ``variable
62:   interchange'' as it would be required to validate loop interchanges.
63:   The third facility handles massive loop unrolling and argument
64:   instantiation by generating traces of execution for a large number
65:   of cases. We hope that these facilities may sometime in the future
66:   be integrated into mainstream code validation.
67: \end{abstract}
68: 
69: \newcommand{\R}{\mathbb{R}}
70: \newcommand{\Z}{\mathbb{Z}}
71: \newcommand{\IF}{\mathbb{IF}}
72: \newcommand{\Zp}{{\Z/p\Z}}
73: 
74: \lstset{basicstyle=\tt\small,xleftmargin=0cm,numberstyle=\tiny}
75: 
76: \renewcommand{\thelstlisting}{\arabic{lstlisting}}
77: 
78: \section{Introduction}
79: 
80: Introducing a new algorithm is a difficult task. Authors have to
81: persuade readers that their algorithm is correct and efficient. Such 
82: goals  are usually attained  by providing pen-and-paper proofs of
83: correction more or less interlaced with  the description of the algorithm.
84: Authors may also provide results of tests to guarantee correction
85: and efficiency on random  cases and on known or new hard
86: cases.  Alas, this process is known to fail on mundane  as well as notorious
87: occurrences~\cite{RusHen91}.
88: 
89: Developing a proof of correction in a formal proof checker using
90: higher order logic such as Coq~\cite{BerCas04} would be a nice
91: alternative but such a task usually represents a large amount of work
92: outside the fields of expertise of most authors.
93: 
94: The delayed solver studied here works on a $N \times N$ unitary
95: triangular matrix on $\Z / p \Z$ finite field. The key improvement of
96: this algorithm compared to state of the art lies in the fact that
97: delayed algorithms use floating point units to perform operations with
98: no rounding error and delay computations of remainders as much as
99: possible. Operations on floating point numbers are limited to three
100: functions. The other functions use combinatorial logic.
101: 
102: The first function (\verb+DGEMM_NEG+) performs a naive matrix
103: multiplication and Gappa may soon be able to handle the proof
104: obligation generated by a tool such as the Why
105: platform~\cite{FilMar07} and the corresponding floating-point
106: annotations~\cite{BolFil07}.  The second function (\verb+DTRSM+) is
107: invoked only under the \verb+delay+ predicate.  This is enforced by
108: the condition on the induction of the invoking function
109: (\verb+LZ_TRSM+) for $N$ between 2 and 54 with $p$ prime varying
110: between $2$ and $94,906,266$. Variable interchange in the \verb+delay+
111: predicate allows to limit the proof to the 53 different values of $N$
112: where a naive user would consider the $94,906,265$ different values of
113: $p$.
114: 
115: Proof obligations are usually derived from a static analysis of the
116: source code considered. Our work showed that generating proof
117: obligations from traces of execution after most parameters have been
118: instantiated may also be useful. We have set up a C++ class to provide
119: such proof obligations but we hope that such capability will be
120: provided by Why and similar tools in the future.
121: 
122: We present some background information in the remaining of the
123: introduction. We continue with a new lemma that might be used by Gappa
124: for inductive linear bounds in Section~\ref{sec/had} and with our
125: prototyping variable interchange developments in
126: Sections~\ref{sec/xch}. We conclude this work in
127: Section~\ref{sec/conc}.
128: 
129: \newcommand{\lCeil}{\left\lceil}
130: \newcommand{\rCeil}{\right\rceil}
131: \def\so{{O {\;\!\tilde{}}\,}}
132: \newcommand{\linbox}{{\sc LinBox}}
133: \newcommand{\lsp}{{{\tt LSP}}}
134: \newcommand{\lup}{{{\tt LUP}}}
135: \newcommand{\lud}{{{\tt LUdivine}}}
136: \newcommand{\lqup}{{{\tt LQUP}}}
137: \newcommand{\turbo}{{{\tt TURBO}}}
138: \newcommand{\fgemm}{{{\tt Fgemm}}}
139: \newcommand{\tu}{{{\tt TURBO}}}
140: \newcommand{\trsm}{{{\tt Trsm}}}
141: \newcommand{\ltrsm}{{{\tt ULeft-Trsm}}}
142: \newcommand{\ultrsm}{{{\tt ULeft-Trsm}}}
143: \newcommand{\lltrsm}{{{\tt LLeft-Trsm}}}
144: \newcommand{\urtrsm}{{{\tt URight-Trsm}}}
145: \newcommand{\lrtrsm}{{{\tt LRight-Trsm}}}
146: \newcommand{\dbl}{\texttt{double} }
147: 
148: \newcommand{\di}{\displaystyle}
149: \newcommand{\GF}[1]{\ensuremath{\mathtt {GF}(#1)}}
150: \newcommand{\GO}{{\cal O}}
151: \newcommand{\pF}[1]{\leavevmode
152:         \kern.1em\raise.0ex \hbox{\Z}\kern-.1em /\kern-.15em\lower.3ex
153:          \hbox{#1}\mbox{\Z}}
154: \newcommand{\til}{\lower 2pt\hbox{\small${}^\sim$}}
155: 
156: \newtheorem{thm}{Theorem}[section]
157: \newtheorem{cnj}[thm]{Conjecture}
158: \newtheorem{lem}[thm]{Lemma}
159: \newtheorem*{cor}{Corollary}
160: \newtheorem{prop}[thm]{Proposition}
161: \newtheorem{defi}[thm]{Definition}
162: \newtheorem{rem}[thm]{Remark}
163: \newtheorem{pty}[thm]{Property}
164: 
165: % \input{02-finite-field.tex}
166: 
167: \subsection{Finite field arithmetic and application to linear algebra}
168: 
169: % \label{sec/pg}
170: 
171: Finite field arithmetic plays a crucial role in nowadays applications.
172: One of the most extensively studied application of finite fields is
173: cryptography.  Another key application of finite field arithmetic
174: arises with exact linear algebra computation where modular techniques
175: (e.g. CRT or P-adic lifting) allow some control on expression swell
176: with high performances (see~\cite{EGGSV06-1} and references herein).
177: While cryptographic applications need finite fields of large
178: cardinality for security purpose, most exact linear algebra restrains
179: to machine word size prime field (e.g. 32 or 64 bits) in order to
180: benefit from machine arithmetic units.
181: 
182: A classical way to perform one arithmetic operation in a prime field, here we refer to integers modulo a prime number,
183: is to first perform the operation on integers and second reduce the result to the destination field.
184: Let  $x,y\in \Zp$ and $* \in \{ +, \times \}$. One may compute $z=x*y\in\Zp$ by computing $t=x*y \in \Z$ and a modular reduction $z = t \mod p$.
185: 
186: %With fixed precision arithmetic, this approach may lead to overflows and erroneous results.
187: 
188: % In the rest of this section, we present  basic facts on word size prime field arithmetics and how to efficiently integrate them 
189: % into an exact linear algebra application: solutions of triangular systems.
190: 
191: % \subsection{Word-size prime field arithmetic}
192: 
193: When one deals with fixed precision prime field arithmetic, two majors
194: issues arise:\, performances and cardinality limitation. The latter
195: issue can have a non-negligible impact on the former one. As was just
196: said, the classical way to perform arithmetic operations over a prime
197: field is to perform operations on integers and reduce intermediate
198: results.  Therefore, all integers between $0$ and $(p-1)^2$ must be
199: representable to correctly perform multiplications over $\Zp$. This
200: limitation slightly increase to perform an \texttt{AXPY} operation (a
201: multiplication followed by an addition) with only one reduction step.
202: This implies that all integers between $0$ and $p\times(p-1)$ must be
203: representable.
204: 
205: 
206: % These bounds are reduced  by one bit with signed types such as \texttt{long}.
207: 
208: % Note that even on 32 bit architectures, one can obtain  64 bit integers by using \texttt{long long} data.
209: % Full  use of \texttt{long long} may reduce performances drastically as  the reduction phase involves a 64 bit division. 
210: % Integer multiplications have no impact on performances since most of 32 bit processors
211: % provide a  multiplier $32 \text{bits} \times 32 \text{bits} \rightarrow 64 \text{bits}$
212: % (see the \texttt{imul} instruction on Intel architectures \cite{Int99c}).
213: 
214: Using word-size machine integers and classic arithmetic we obtain the
215: following cardinality limitation: $p<2^{16}$ on 32 bit architectures
216: and $p<2^{32}$ on 64 bit architectures with {\tt unsigned} types.  An
217: alternative to increase cardinality of word-size prime fields is to
218: use floating point numbers.  According to the IEEE~754
219: standard~\cite{Gol91}, mantissas of double precision floating point
220: numbers can store 53 bit integers (including the implicit bit).
221: Therefore, we can perform prime field arithmetic with cardinality up
222: to $2^{26}$ using \texttt{double}. Note that, the reduction is easily
223: obtained by the \texttt{fmod} function available in standard
224: libraries.  This approach is quite interesting in practice since
225: floating point multiplications and divisions may be faster than their
226: integer counterparts.
227: % on most architecture  
228: % (Compaq Alpha, AMD, Pentium IV, Itanium, Sun Solaris, etc.) with the exception of the
229: % Pentium III  \cite[Table 3.1]{Defour:2003:these}.
230: 
231: % One can get even larger prime fields by mixing integers and floating
232: % point numbers.  The idea is then to use an approximation of the
233: % quotient with floating point and Barret's method to compute the
234: % remainder~\cite[chapter 14]{HandbookCrypto} as demonstrated in NTL
235: % library~\cite{Shoup:NTL}.  It leads to prime fields with cardinality
236: % up to $2^{29}$ on 32 bit architectures and up to $2^{52}$ on 64 bit
237: % architectures.
238: 
239: % Prime field implementations above benefit from the arithmetic units of
240: % processors but performances depend on the target architecture.  
241: 
242: On selected classes of algorithms, delayed prime field arithmetic
243: sustains better performances.  The idea is to perform several integer
244: operations before reduction into the field.  It has been very fruitful
245: for exact linear algebra~\cite{jgd:2004:issac}.  Delayed exact linear
246: algebra computations also benefit from optimized numerical BLAS (e.g.
247: ATLAS~\cite{Whaley:2001:AEO}, GOTO~\cite{2002:gotoblas}) libraries for
248: exact computations and they often reach peak FPU throughput for
249: operations over a finite field.
250: 
251: Beside basics linear algebra operations such as matrix-vector products
252: and matrix multiplications, delayed arithmetic over a prime field is
253: valuable when expressions swell largely such as solving systems of
254: linear equations.  This approach works perfectly for unitary
255: triangular system (only ones along the diagonal) despite the
256: exponential growth of the intermediate variables.
257: 
258: % In the following we show how this has been made possible and we provide an automatic proof of the concept
259: %  with a tool named Gappa.
260: 
261: \subsection{Formal proof checking and Gappa}
262: 
263: Gappa~\cite{DauMel04} has been created to generate formal certificates
264: of correction for programs that use floating point
265: arithmetic~\cite{DinLauMel06,MicTisVey06,MelPio07} and is related to
266: other developments~\cite{Har2Ka,DauMelMun05}. It is available from
267: \begin{center}
268:   \url{http://lipforge.ens-lyon.fr/www/gappa/}.
269: \end{center}
270: It will in the future be able to interact seamlessly with
271: Why~\cite{BolFil07}, a tool to certify programs written in a generic
272: language. C and Java can be converted to this language.
273: 
274: Gappa manipulates arithmetic expressions on real and rational numbers
275: and their evaluations on computers. Exact and rounded expressions are
276: bounded using interval arithmetic~\cite{JauKieDidWal01}, forward error
277: analysis and properties of dyadic fractions.  To the authors' best
278: knowledge, Gappa is the first tool that can convert some of the simple
279: tasks performed here into formal proofs validated by an automatic
280: proof checker.  Such goal has previously been quoted as {\em invisible
281:   formal methods}~\cite{TiwShaRus03} in the sense that Gappa delivers
282: formal certificates to users that are not expected to write any piece
283: of proof in any formal proof system.
284: 
285: % \subsection{Insight into the Coq generated script}
286: 
287: % \begin{lstlisting}[float,caption=Reordered excerpts of the 3548 line Coq script
288: %   generated for a small example ($4 \times 4$ matrices on
289: %   $\Z/101\Z$),label=lst/coq]
290: % Require Import Gappa_library.
291: % Section Generated_by_Gappa.
292: % Definition f1 := Float2 (25) (2).
293: % Definition f2 := Float2 (0) (0).
294: % Definition i3 := makepairF f2 f2.
295: % ...
296: % Notation _DG31_exact := ((_DG29 - _DG30)%R).
297: % Notation _DG31 := 
298: %   ((rounding_float roundNE (53) (1074)) (_DG31_exact)).
299: % Notation r70 := ((_DG31 - _DG31_exact)%R).
300: % Notation p134 := (BND r70 i3).
301: %   (* BND(DG31 - DG31_exact, [0, 0]) *)
302: % Notation p135 := (BND r70 i13).
303: %   (* BND(DG31 - DG31_exact, [-7.45058e-09, 7.45058e-09]) *)
304: % ...
305: % Lemma t696 : p135 -> p141 -> p134.
306: %  intros h0 h1.
307: %  apply bnd_of_bnd_fix with (1 := h0) (2 := h1) ; finalize.
308: % Qed.
309: % Lemma l134 : ... -> p92 -> p43 -> p18 -> p3 -> p134. 
310: %   (* BND(DG31 - DG31_exact, [-0, 0]) *)
311: %  intros h0 h1 h2 h3 h4 h5 h6 h7 h8.
312: %  assert (h9 : p135). apply l130. exact h0. exact h1...
313: %  assert (h10 : p141). apply l133.
314: %  apply t696. exact h9. exact h10.
315: % Qed.
316: % End Generated_by_Gappa.
317: % \end{lstlisting}
318: 
319: Gappa produces a Coq file for a given input script.  Users do not need
320: to be able to write the Coq file but they can check the work of Gappa
321: by reading it or parsing it automatically.  It contains {\tt
322:   Variable}s, {\tt Defini\-tion}s, {\tt Notation}s, {\tt Lemma}s and
323: comments are between {\tt (*} and {\tt *)} signs.
324: % The lines between the statement of a lemma and the {\tt Qed.}
325: % \TODO{Sens de la phrase suivante ?}  sign are proof scripts intended
326: % for Coq only.
327: Although enclosure is the only predicate available to users, Gappa
328: internally relies on more predicates to describe properties on
329: expressions.  All the properties of the input script are defined in
330: the Coq file.  Validity of proofs can automatically be checked by Coq.
331: More insights to Gappa are presented in~\cite{DauMel07}.
332: 
333: \section{A new lemma that might be used by Gappa for inductive linear
334:   bounds}
335: \label{sec/had}
336: 
337: \begin{figure}
338: \hrule
339: \smallskip
340: {\bf Input: } $A \in \Zp^{N \times N}$, $B \in \Zp^{N \times K}$.\\
341: {\bf Output: } $X \in \Zp^{N \times K}$ such that $AX=B$.
342: \smallskip
343: \hrule
344: \begin{quote}
345: {\bf if} N=1 {\bf then}\\
346: \hspace*{1cm} $ X:= A_{1,1}^{-1} \times B$.\\
347: {\bf else}
348: {\it (splitting matrices into $\lfloor \frac{N}{2} \rfloor$ and $\lceil \frac{N}{2} \rceil$ blocks) }\\[-.2cm]
349: \[
350: \begin{array}{cccc}
351: A & X & & B \\
352: \overbrace {\left[ \begin{array}{cc} A_1 & A_2 \\ & A_3 \end{array} \right] }&
353: \overbrace{\left[ \begin{array}{ccc} & X_1 & \\ & X_2 & \end{array} \right] }&
354: = &
355: \overbrace{\left[ \begin{array}{ccc} & B_1 & \\ & B_2 & \end{array} \right] }
356: \end{array}
357: \]
358: \hspace*{1cm}$X_2:=${\tt LZ\_TRSM($A_3,B_2$)}. \\
359: \hspace*{1cm}$B_1:= B_1 - A_2X_2$. \\
360: \hspace*{1cm}$X_1:=${\tt LZ\_TRSM($A_1,B_1$)}. \\
361: {\bf return } X.
362: \end{quote}
363: \hrule
364: \caption{First algorithm for {\tt LZ\_TRSM($A,B$)}}
365: \label{trsm}
366: \end{figure}
367: 
368: A key application in exact linear algebra is the resolution of
369: triangular systems over finite fields presented in Figure~\ref{trsm}.
370: A delayed prime field arithmetic version of this algorithm can be
371: constructed by simply doing a delayed matrix multiplication on the
372: operation \texttt{$B_1:= B_1 - A_2X_2$}. Listing~\ref{lst/mm} performs
373: such matrix multiplication \texttt{DGEMM\_NEG} with no reduction. We
374: used naming conventions of BLAS and LAPack for the function and the
375: parameter names. For the sake of simplicity some parameters have be
376: omitted and some function names were slightly modified.
377: 
378: The \texttt{DREMM} function computes the remainder modulo $p$ of all
379: the components of a matrix. We are overspecifying it for the purpose
380: of this presentation as our proof never use the exact value of these
381: remainders but the sole property that they are between $0$ and $p-1$.
382: In the definition of \texttt{is\_exact\_int\_mat}, we use the
383: predicate \texttt{exists}. The property described is that $X[i \times
384: LDX+j]$ is not any float, but it is an integer. To describe this, we
385: require (or prove) that there exists an integer equal to the
386: floating-point value of $X[i \times LDX+j]$. We also use \verb+\old+
387: in the annotations: the value \verb+\old(X)+ represents the value of
388: the variable {\tt X} before the function execution. It allows us to
389: specify the outputs depending on the inputs, even if the pointed
390: values are modified. Some ghost variables are introduced in our
391: example to ease the proofs. A mechanism not presented here prevents
392: such ghost variables to remain in the code once compiled.
393: 
394: \begin{lstlisting}[float,caption=Matrix-matrix multiplication Y <- Y - AX and component-wise remainder,label=lst/mm]
395: #include <math.h>
396: typedef double fp;
397: #define FP_ESPILON DBL_EPSILON
398: 
399: /*@ logic real epsilon() { 2^^(-53) } @*/
400: /*@ logic real max_int() { 2 / epsilon() } @*/
401: 
402: /*@ predicate is_exact_int_mat (fp *X, int LDX, int N, int M) {
403:   @   \valid_range(X,0,LDX*N) && M <= LDX &&
404:   @   \forall int i; \forall int j; 0 <= i < N && 0 <= j < M =>
405:   @      \round_error(X[i*LDX+j])==0 && \exists int v;  X[i*LDX+j]==v
406:   @ } @*/
407: 
408: /*@ predicate is_exact_int_mat_bounded_by
409:   @ (fp *X, int LDX, int N, int M, int min, int max) {
410:   @   is_exact_int_mat(X,LDX,N,M) && \forall int i; \forall int j;
411:   @     0 <= i < N && 0 <= j < M => min <= X[i*LDX+j] <= max
412:   @ } @*/
413: 
414: /*@ requires (p-1)*(p-1)*M <= max_int() && 
415:   @   is_exact_int_mat_bounded_by(Y,LDY,N,K,0,p-1) &&
416:   @   is_exact_int_mat_bounded_by(A,LDA,N,M,0,p-1) &&
417:   @   is_exact_int_mat_bounded_by(X,LDX,M,K,0,p-1)
418:   @ assigns Y[..]
419:   @ ensures 
420:   @   is_exact_int_mat_bounded_by(Y,LDY,N,K,(1-p)*(p-1)*M,p-1) @*/
421: void DGEMM_NEG (int N, int M, int K, int p,
422:                 fp *A, int LDA, fp *X, int LDX, fp *Y, int LDY) {
423:   int i, j, k; fp oYiK;
424:   for (i = 0; i < N; i++)
425:     for (j = 0 ; j < M; j++)
426:       for (k = 0; k < K; k++) {
427:         oYik       = Y[i*LDY+k];
428: 	Y[i*LDY+k] = Y[i*LDY+k] - A[i*LDA+j] * X[j*LDX+k];
429:         /*@ assert -1 <= Y[i * LDY + k] / ((p-1)*(p-1)*(j+1)) &&
430:                          Y[i * LDY + k] <= p-1                @*/
431:       }
432: }
433: 
434: /*@ requires is_exact_int_mat(X,LDX,N,K)
435:   @ assigns X[..]
436:   @ ensures is_exact_int_mat_bounded_by(X,LDX,N,K,0,p-1) &&
437:   @   \forall int i; \forall int j; 0 <= i < N && 0 <= j < K =>
438:   @     \exists int d; X[i*LDX+j] == \old(X[i*LDX+j]) + d*p @*/
439: void DREMM (int N, int K, int p, fp *X, int LDX) {
440:   int i, k;
441:   for (i = 0; i < N; i++) for (k = 0; k < K; k++) {
442:     X[i*LDX+k] = fmod (X[i*LDX+k], p);
443:     if (X[i*LDX+k] < 0) X[i*LDX+k] += p; 
444:   }
445: }
446: \end{lstlisting}
447: 
448: Let $M$ be the width of matrix $A_2$ or the height of vector $X_2$ and
449: {\tt FP\_EPSILON} be the machine $\epsilon$ for {\tt fp} in {\tt
450:   float}, {\tt double} or {\tt long double} and {\tt FP} in {\tt FLT},
451: {\tt DBL} or {\tt LDBL} respectively. The \texttt{DREMM} reduction can
452: be delayed until the end of \texttt{DGEMM\_NEG} provided
453: $$(p-1)^2\times M \le 2/\mathtt{FP\_EPSILON},$$
454: as all the number between 0 and $2/\mathtt{FP\_EPSILON}$ can be
455: represented exactly with type {\tt fp}.
456: 
457: Assertions on the \verb+DGEMM_NEG+ function generate proof obligation
458: $$
459: -1 \le \frac{ Y[i \times LDY + k]                                                 }{(p - 1)^2 \times (j + 1)}
460:      = \frac{oY i              k  - A[i \times LDA + j] \times X[j \times LDX + k]}{(p - 1)^2 \times (j + 1)}
461: $$
462: for all iterations defined by $i$, $j$ and $k$. It can be proved by
463: induction on $j$ with the following lemma that we proved in Coq and
464: similar ones for the other relations.
465: $$
466: \forall ~~ a, b, c, d, e \in \R ~~~ ; ~~~ e \le \frac{a}{b} ~~ \wedge ~~
467:                                           e \le \frac{c}{d} ~~ \wedge ~~
468:                                           0 <   b \times d
469:                 ~~~ \Longrightarrow  ~~~  e \le \frac{a+c}{b+d}
470: $$
471: 
472: Gappa does not handle arrays so we have to rename $A[i \times LDA +
473: j]$ to {\tt Aik}, $X[j \times LDX + k]$ to {\tt Xjk} and $Y[i \times
474: LDY+k]$ to {\tt Yik}.  The following Gappa text should be sufficient
475: to prove the generic case of the induction if our lemma is added to
476: Gappa.
477: 
478: \begin{lstlisting}
479: {
480:   p in [2,1b53]                                ->
481:   j in [1,1b53]                                ->
482:   oYik in [-1b53,1b53]                         ->
483:   Aij/(p-1) in [0,1]                           ->
484:   Xjk/(p-1) in [0,1]                           ->
485:   oYik / ((p-1)*(p-1)*j) >= -1                 ->
486:   ((p-1)*(p-1)*j) * ((p-1)*(p-1)) >= 0         /\
487:   -Aij*Xjk / ((p-1)*(p-1)) >= -1               /\
488:   (oYik - Aij*Xjk) / ((p-1)*(p-1)*(j+1)) >= -1 
489: }
490: -Aij*Xjk / ((p-1)*(p-1)) ->
491:   - (Aij/(p-1)) * (Xjk/(p-1)) { (p-1) <> 0 };
492: (oYik - Aij*Xjk) / ((p-1)*(p-1)*(j+1)) ->
493:   (oYik + (- Aij*Xjk)) / (((p-1)*(p-1)*j) + ((p-1)*(p-1)))
494:   { ((p-1)*(p-1)*(j+1)) <> 0 , ((p-1)*(p-1)*j) * ((p-1)*(p-1)) <> 0 };
495: \end{lstlisting}
496: 
497: A Gappa file usually starts with aliases.  Gappa uses them for its
498: outputs and in the formal proof instead of machine generated names.
499: We do not need any alias here.  Identifiers are assumed to be
500: universally quantified over the set of real numbers the first time
501: Gappa encounters them.
502: 
503: The main assertion is written between brackets~({\tt \{ \}}). The
504: hypotheses end with the last line finished by an implication sign
505: ({\tt ->}).  Each states that a variable or an expression is within an
506: interval or bounded. Note that $p1 \rightarrow p2 \rightarrow p3$ is
507: logically equivalent to $p1 \wedge p2 \rightarrow p3$.  The goal is
508: next.  It is a conjunction (\verb+/\+).  Statements about intermediate
509: variables are given as goals to force Gappa to establish them first.
510: 
511: The statement after the brackets are hints that propose replacements.
512: Gappa replaces the left side of the {\tt ->} sign by the right side as
513: soon as it encounters the former. It also tries to prove that the
514: replacements are valid. They help Gappa identify the proper theorems
515: syntactically. Conditions on the validity of the hints are expressed
516: between brackets.
517: 
518: \section{Variable interchange in a predicate}
519: \label{sec/xch}
520: 
521: \begin{lstlisting}[float,caption=Delayed solution of a unitary triangular system over a finite field,label=lst/lz]
522: /*@ predicate delay(int N, int p) @*/
523: /*@ logic int l_pmax(int n) @*/
524: 
525: // Floating point exact solution to a small unitary triangular system
526: /*@ requires N <= 54 &&  
527:   @   is_exact_int_mat_bounded_by(X,LDX,N,K,0,l_pmax(N)-1) &&
528:   @   is_exact_int_mat_bounded_by(A,LDA,N,N,0,l_pmax(N)-1)
529:   @ assigns X[..]
530:   @ ensures
531:   @   is_exact_int_mat_bounded_by(X,LDX,N,K,-max_int(),max_int()) @*/
532: void DTRSM (int N, int K, fp *A, int LDA, fp *X, int LDX) {
533:   int i, j, k;
534:   for (i = N-2; i >= 0; i--)
535:     for (j = i+1; j < N; j++)
536:       for (k = 0 ; k < K; k++)
537:         X[i*LDX+k] = X[i*LDX+k] - A[i*LDA+j] * X[j*LDX+k];
538: }
539: 
540: /*@ requires (p-1)*(p-1)*N <= max_int() &&
541:   @   is_exact_int_mat(A,LDA,N,N) && is_exact_int_mat(B,LDB,N,K)
542:   @ assigns B[..]
543:   @ ensures is_exact_int_mat(B,LDB,N,K) @*/
544: void LZ_TRSM (int N, int K, int Nmax, int p,
545: 	      fp *A, int LDA, fp *B, int LDB) {
546:   if (N <= Nmax) {
547:     /*@ assert N <= 54 && delay(N,p) @*/
548:     DTRSM (N, K, A, LDA, B, LDB); DREMM (N-1, K, p, B, LDB);
549:   } else {
550:     int P = N/2, G = N - P;
551:     LZ_TRSM (G, K, Nmax, p, A+P*(LDA+1), LDA, B+P*LDB, LDB);
552:     DGEMM_NEG (P, G, K, p, A+P, LDA, B+P*LDB, LDB, B, LDB);
553:     DREMM (P, K, p, B, LDB);
554:     LZ_TRSM (P, K, Nmax, p, A, LDA, B, LDB);
555:   }
556: }
557: 
558: 
559: /*@ ensures \forall int N; N <= \result => delay (N, p) @*/
560: int Nmax (int p) {
561:   fp pp = 1, p2 = 1; int N;
562:   for (N = 0; ((p-1)*(pp+p2))/2 < 2 / 2^53; N++)
563:     {pp *= p; p2 *= p-2;};
564:   return N;
565: }
566: 
567: /*@ ensures \result==l_pmax(N) &&
568:   @   \forall int p; p <= \result => delay (N, p) @*/
569: int pmax (int N) {
570:   int p; for (p = 1; N <= Nmax(p); p++);
571:   return p-1;
572: }
573: \end{lstlisting}
574: 
575: % Most questions raised before that point should not be new to automated
576: % reasoning. 
577: 
578: For the sake of completeness we recall an optimal bound on integer
579: coefficients growth during backward substitution.
580: 
581: \begin{cor}{\cite[corollary 3.3]{jgd:2004:issac}}
582:   Let $A\in\Z^{N \times N}$ be a unit diagonal upper triangular
583:   matrix, and $b\in\Z^N$, with $|A|,|B|\leq p-1$. Then $x\in\Z^N$ the
584:   solution of the system $Ax=B$ is such that
585:   $$ |x| \leq \frac{p-1}{2}\left[ p^{N-1} + (p-2)^{N-1}\right],$$%
586:   and this bound is optimal.
587: \end{cor}
588: 
589: As this formula also bounds all the intermediate values, enough bits
590: are available to combine a few recursion steps of \texttt{LZ\_TRSM}
591: without reduction and guarantee that all numerical results are exact.
592: We present in Listing~\ref{lst/lz} an improved version for the delayed
593: prime field arithmetic by replacing the last levels of the recursion
594: by calls to \texttt{DTRSM} numerical solver according to the above
595: corollary.  Recursion is stopped for the maximal integer $N_{\max}$
596: such that
597: \begin{equation}
598: \frac{p-1}{2}\left[ p^{N_{max}-1} + (p-2)^{N_{max}-1}\right] \le 2/\mathtt{FP\_EPSILON}.
599: \end{equation}
600: 
601: % In the rest of the paper, we will provide a way to automatically and formally proof this equation.
602: % Without loss of genericity, we will then focus on the case where the right hand side of the system is a vector (i.e. $K=1$ in \texttt{LZ\_TRSM}).
603: % As a matter of consistency with numerical BLAS convention, we will then refer to our algorithm with \texttt{LZ\_TRSV}.
604: 
605: Function {\tt Nmax} of Listing~\ref{lst/lz} uses a strict inequality
606: in equation (1) to control the loop because it computes the bound with
607: the same floating point format than the one used by \texttt{LZ\_TRSM}.
608: We focus now on proving that the \texttt{DTRSM} function invoked by
609: \texttt{LZ\_TRSM} never produce any round-off error.
610: 
611: One could port the proof of Corollary 3.3~\cite{jgd:2004:issac} to Coq
612: to finish the proof of correction. We decided to use Gappa and simple
613: techniques that could be made automatic. Syntactically, the
614: \verb+DTRSM+ function is invoked by \verb+LZ_TRSM+ only if the
615: condition $delay(N,p)$ is fulfilled. We may define it by:
616: 
617: $$delay(N,p) = N \le N_{\max}(p).$$
618: 
619: Gappa does not handle loops and branches. We perform a case analysis
620: but $p$ may vary from $2$ to $94,906,266$. On the other hand, {\tt N}
621: varies only between $2$ and $54$. Table~\ref{tab/pmax} presents the
622: value computed by the \verb+pmax+ function.  Variable interchange
623: should allow to prove that \verb+DTRSM+ is invoked only on the
624: condition
625: $$delay(N,p) \Longleftrightarrow p \le p_{\max}(N).$$
626: 
627: 
628: \begin{table}[t]
629:   \caption{Maximum value {\tt pmax} of parameter {\tt p} for each value of parameter {\tt N} allowed}
630:   \label{tab/pmax}
631:   %\vspace{-5mm}
632:   $$
633:   \begin{array}{|c|c|c|c|c|c|c|c|c|c|c|c|} \hline
634:     N         & 2        & 3      & 4    & 5    & 6   & 7   & 8      & 9      & ~ 10 ~ & ~ 11 ~ & ~ 12 ~ \\ \hline
635:     p_{\max}  & 94906266 & 208064 & 9739 & 1553 & 457 & 191 & ~ 97 ~ & ~ 59 ~ & 39     & 29     & 19     \\ \hline
636:   \end{array}
637:   $$
638:   $$
639:   \begin{array}{|c|c|c|c|c|c|c|c|} \hline
640:     N        & ~ 13 ~ & ~ 14 ~ & ~ 15 ~ & ~ 16\cdots19 ~ & ~ 20\cdots23 ~ & ~ 24\cdots34 ~ & ~ 35\cdots54 ~ \\ \hline
641:     p_{\max} & 17     & 13     & 11     & 7              & 5              & 3              & 2              \\ \hline
642:   \end{array}
643:   $$
644: \end{table}
645: 
646: % \section{Massive loop unrolling and argument instantiation}
647: % \label{sec/loop}
648: 
649: % \begin{lstlisting}[float,caption=Solutions to a small unitary triangular system,label=lst/dtrsm]
650: % // Solutions to a small unitary triangular system
651: % /*@ requires N <= 54 && K <= LDX && 
652: %   @  \valid_range(X,0,LDX*N) &&
653: %   @  \valid_range(A,0,LDA*N) &&
654: %   @   is_integer_matrix_bounded_by(X,LDX,N,K,pmax(N)) &&
655: %   @   is_integer_matrix_bounded_by(A,LDA,N,N,pmax(N))
656: %   @ assigns X[..]
657: %   @ ensures
658: %   @   is_integer_matrix(X,LDX,N,K)
659: %   @*/
660: % void DTRSM (int N, int K, fp *A, int LDA, fp *X, int LDX) {
661: %   int i, j, k;
662: %   for (i = N-2; i >= 0; i--)
663: %     for (j = i+1; j < N; j++)
664: %       for (k = 0 ; k < K; k++)
665: %         X[i*LDX+k] = X[i*LDX+k] - A[i*LDA+j] * X[j*LDX+k];
666: % }
667: % \end{lstlisting}
668: 
669: 
670: % \subsection{A C++ trace class to generate input scripts to Gappa}
671: 
672: % \begin{lstlisting}[float,caption={A C++ trace class to instantiate some
673: %   parameters, implement single-assignment behavior and ask Gappa
674: %   to prove that no operation introduced any round-off
675: %   error},label=lst/object]
676: % class trace;
677: % ostream &operator<<(ostream &os, trace const &a);
678: 
679: % class trace {
680: % public :
681: %   static int lastid;
682: %   static ostringstream bout, mout, eout;
683: %   int id;
684: %   trace() {}
685: %   void init(int p) {
686: %     id = lastid++;
687: %     bout << *this 
688: % 	 << " = int<ne>(" << *this << "_dum);" << endl;
689: %     mout << *this
690: % 	 << " in [0, " << p - 1 << "] ->"      << endl;
691: %   }
692: %   trace(char * oper, trace const &a, trace const &b) {
693: %     id = lastid++; 
694: %     bout << *this << "    rnd= " 
695: %          << a     << oper  << b << ";"         << endl;
696: %     bout << *this << "_exact = " 
697: %          << a     << oper  << b << ";"         << endl;
698: %     eout << *this << " - " 
699: % 	 << *this << "_exact in [0, 0] /\\"    << endl;
700: %     eout << *this << " in ? /\\"               << endl;
701: %   }
702: %   void dump() {
703: %     cout << bout.str() << "{"                  << endl 
704: % 	 << mout.str()
705: % 	 << eout.str() 
706: % 	 << *this      << " in ?"              << endl
707: % 	               << "}"                  << endl;
708: %   }
709: %   void bmes (char * mes, int n = 0)
710: %   {bout << mes ; if (n) bout << n; bout << endl;}
711: % };
712: 
713: % ostream &operator<<(ostream &os,trace const &a)
714: %   {return os << "DG" << a.id;}
715: 
716: % trace operator+(trace const &a, trace const &b)
717: %   {return trace(" + ", a, b);}
718: % trace operator-(trace const &a, trace const &b)
719: %   {return trace(" - ", a, b);}
720: % trace operator*(trace const &a, trace const &b)
721: %   {return trace(" * ", a, b);}
722: 
723: % int trace::lastid = 0;
724: % ostringstream trace::bout, trace::mout, trace::eout;
725: % \end{lstlisting}
726: 
727: % \begin{lstlisting}[float,caption={Reformatted Gappa input script for a small example $(N = 4, p = 101)$},label=lst/gappa]
728: % @rnd = float< ieee_64, ne>;
729: % # Matrix
730: % DG0  = int<ne>(DG0_dum);  ... DG15 = int<ne>(DG15_dum);
731: % # Solution
732: % DG16 = int<ne>(DG16_dum); ... DG19 = int<ne>(DG19_dum);
733: % # DTRSM 4
734: % DG20 rnd= DG5 * DG19;     ... DG31 rnd= DG29 - DG30;
735: % DG20_ex = DG5 * DG19;     ... DG31_ex = DG29 - DG30;
736: % # DREMM 3
737: % DG32 = int<ne>(DG32_dum); ... DG34 = int<ne>(DG34_dum);
738: % {
739: % DG0 in [0, 100] ->        ... DG19 in [0, 100] ->
740: % DG32 in [0, 100] ->       ... DG34 in [0, 100] ->
741: % DG20-DG20_ex in [0, 0] /\ ... DG31-DG31_ex in [0, 0] /\
742: % DG0 in ? /\ DG20 in ? /\  ... DG31 in ?
743: % }
744: % \end{lstlisting}
745: 
746: % The first section defines aliases.  Gappa uses these aliases for its
747: % outputs and in the formal proof instead of machine generated names.
748: % It starts with the definitions of {\tt rnd} rounding operator.  It is
749: % a real function yielding rounded values according to the target data
750: % format ({\tt ieee\_64} - double precision in this case) and a
751: % predefined rounding mode ({\tt ne} - nearest with even tie breaking as
752: % specified in IEEE 754 standard in this case).
753: 
754: % Statements \verb+DGxxx_exact = expr;+ build the list of {\bf
755: %   intended} results and statements on \verb+DGxxx+ the list of results {\bf
756: %   actually} computed.  When a rounding operator appears left of the
757: % equal symbol (as used in \verb+DGxxx rnd= expr;+), all the arithmetic
758: % operations on the right side are rounded individually. We introduced
759: % dummy identifiers in aliases \verb+DGxxx = int<ne>(DGxxx_dum);+ to
760: % express that {\tt DGxxx} are integer numbers.
761: 
762: % The second and third sections are written together between
763: % brackets~({\tt \{ \}}) and in this work, it is a large implication
764: % ({\tt ->}) of a large conjunction (\verb+/\+) of interval enclosures
765: % of mathematical expressions.
766: 
767: % The second section contains a set of hypotheses each stating that a
768: % variable or an expression is within an interval. Note that $p1
769: % \rightarrow p2 \rightarrow p3$ is logically equivalent to $p1 \wedge p2
770: % \rightarrow p3$ and this section is built from the lines finished by an
771: % implication sign ({\tt ->}). Identifiers without definition, such as
772: % the {\tt DGxxx\_dum} variables used in the first section are assumed to
773: % be universally quantified over the set of real numbers the first time
774: % Gappa encounters them.
775: 
776: % The third section contains goals. It is a large conjunction
777: % (\verb+/\+). Statements \verb+DGxxx - DGxxx_exact in [0, 0]+ mean that
778: % we ask Gappa to prove that {\tt DGxxx} and {\tt DGxxx\_exact} are
779: % identical. Statements {\tt DGxxx in ?} are inserted for diagnostics as
780: % Gappa proposes enclosing intervals.
781: 
782: %\subsection{Application to delayed finite field arithmetic}
783: 
784: % The \verb+TEST+ function in Listing~\ref{lst/test} presents memory
785: % allocation and parts of the Coq script related to initial aliases 
786: % and some comments.
787: 
788: % \begin{lstlisting}[float,caption=Test,label=lst/test]
789: % void TEST (int n, int nmax, int p, int k) {
790: %   int i;
791: %   trace *A = new trace[n*n], *B = new trace[k*n];
792: %   A->bmes("@rnd = float< ieee_64, ne>;");
793: %   A->bmes("# Matrix");
794: %   for (i = 0; i < n*n; i++) {A[i].init(p);}
795: %   A->bmes("# Solution");
796: %   for (i = 0; i < k*n; i++) {B[i].init(p);}
797: %   LZ_TRSM (n, k, nmax, p, A, k, B, k);
798: %   A->dump();
799: %   delete []A; delete []B;
800: % }
801: % \end{lstlisting}
802: 
803: \begin{table}[t]
804:   \caption{Time to establish that no round-off occurred and generate a Coq proof script}
805:   \label{tab/time}
806:   %\vspace{-5mm}
807:   $$
808:   \begin{array}{|c|c|c|c|c|c|c|c|c|c|c|c|} \hline
809:     \text{Script} & N               & 2     & 3     & 4     & 5     & 6     & 7     & 8     & 9     & 10    & 11    \\ \hline
810:     \text{no}     & \text{time} (s) & 00.02 & 00.06 & 00.10 & 00.19 & 00.29 & 00.43 & 00.57 & 00.74 & 00.93 & 01.15 \\ \cline{1-1}\cline{3-12}
811:     \text{Coq}    &                 & 00.02 & 00.06 & 00.13 & 00.22 & 00.35 & 00.50 & 00.87 & 01.35 & 01.84 & 02.43 \\ \hline
812:   \end{array}
813:   $$
814:   $$
815:   \begin{array}{|c|c|c|c|c|c|c|c|c|c|c|c|} \hline
816:     \text{Script} & N               & 12    & 13    & 14    & 15    & 17    & 19    & 23    & 34     & 54    \\ \hline
817:     \text{no}     & \text{time} (s) & 01.38 & 01.63 & 01.91 & 02.90 & 02.90 & 03.73 & 05.93 & 13.71  & 38.32 \\ \cline{1-1}\cline{3-11}
818:     \text{Coq}    &                 & 03.22 & 03.77 & 05.26 & 07.13 & 12.22 & 19.50 & 38.13 & 421.4  & 5767  \\ \hline
819:   \end{array}
820:   $$
821: \end{table}
822: 
823: A C++ class produces a trace for {\tt N} between 2 and 54 where all
824: branches and control statements have been removed. Each trace contains
825: only floating point operations.  Table~\ref{tab/time} gives the time
826: needed by Gappa on a 1.86~GHz Quad-Core Intel Xeon Processor with
827: 2x4MB L2 cache and 2x1GB of memory, to produce each of the proofs that
828: no round-off error occurred for all the values of $N$ between $2$ and
829: $54$. That ends the proof of correction of the algorithm.
830: 
831: \section{Perspectives and concluding remarks}
832: \label{sec/conc}
833: 
834: This report presents a new use of Gappa based on proof obligations
835: generated from a trace of execution~\cite{MelPio07}.  It would enable
836: us to prove in Coq in the future that expression swell within the
837: studied algorithm of delayed finite field arithmetic does not
838: introduce round off errors \cite{jgd:2004:issac}. This full
839: certification will not be obtained by porting the initial proof to Coq
840: but by a case analysis on the 53 possible values of one argument $N$.
841: Pieces of the formal proof have been generated by Gappa for each
842: individual value of $N$.
843: 
844: % The other questions raised in this report are not related to floating
845: % arithmetic. Answering these questions involve techniques common to
846: % program analysis that are or should be available in tools that perform
847: % automatic validation of programs.
848: 
849: Our approach can be easily reproduced to other exact linear
850: applications over finite fields.  More precisely, the FFLAS-FFPACK
851: project has been successful on using delayed prime field arithmetic
852: for linear algebra applications.
853: \begin{center}
854:   \url{http://ljk.imag.fr/membres/Jean-Guillaume.Dumas/FFLAS/}
855: \end{center}
856: % We may start by the validation of symmetric representations of the
857: % prime field.
858: 
859: % Future work of FFLAS-FFPACK may also consider Bareiss algorithm.
860: 
861: % Gappa is also capable of validating bounds used for Winograd-Strassen
862: % matrix multiplication \cite{dumas-2006} by rewriting expressions. This
863: % feature not demonstrated here is useful in proving that
864: % Winograd-Strassen algorithm is correct.
865: % Gappa transfers the validation of additional formulas to a
866: % strategy of Coq that is able to validate it component by component in this case.
867: 
868: \bibliographystyle{abbrv}
869: \bibliography{daumas_ref,alternate}
870: %\bibliography{alias,perso,groupe,saao,these,livre,arith}
871: 
872: \end{document}
873: