0703:cs0703026/rnc.tex

1: % Last modified on Wed May 14 14:49:46 GMT     2008 by marc

2:

3: \documentclass[11pt,twoside]{article}

4: \usepackage{latex8}

5: %\usepackage{times}

6: \usepackage[dvips]{epsfig}

7: \usepackage[dvips]{graphicx}

8:

9: \usepackage[latin9]{inputenc} \usepackage[T1]{fontenc}

10: \usepackage[english]{babel}

11:

12: \usepackage{amsmath}

13: \usepackage{amsfonts}

14: \usepackage{amssymb,amsthm}

15: \usepackage{fancybox}

16: \usepackage{url}

17: %\usepackage{styles}

18: \usepackage{graphicx}

19: \usepackage{xypic}

20: \usepackage{multicol}

21: \usepackage{rotating}

22: %\RRIno{\usepackage{cite}}

23: \usepackage{listings}

24: \usepackage{algorithm,color}

25:

26: \pagestyle{empty}

27:

28: \newcommand{\TODO}[1]{\textcolor{red}{#1}}

29: \newcommand{\old}[1]{\mathrm{\backslash{}old}\left(#1\right)}

30: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

31:

32: \begin{document}

33:

34: \title{\bf Formal proof for delayed finite field arithmetic \\ using

35:   floating point operators\thanks{This work has been partially founded

36:     by PICS 2533 of the CNRS, project EVA-Flo of the ANR, project

37:     CerPAN of the ANR and PPF Sur�na.}}

38:

39:

40: \author{%

41:   Sylvie Boldo  \\ {\sc inria}  (Saclay - �le-de-France)               \\ sylvie.boldo@inria.fr    \and

42:   Marc Daumas   \\ {\sc eliaus} ({\sc ea 3679 upvd})                   \\ marc.daumas@ens-lyon.org \and

43:   Pascal Giorgi \\ {\sc lirmm}  ({\sc umr} 5506 {\sc cnrs}--{\sc um}2) \\ pascal.giorgi@lirmm.fr

44: }

45:

46: \maketitle \thispagestyle{empty}

47:

48: \begin{abstract}

49:   Formal proof checkers such as Coq are capable of validating proofs

50:   of correction of algorithms for finite field arithmetics but they

51:   require extensive training from potential users.  The delayed

52:   solution of a triangular system over a finite field mixes operations

53:   on integers and operations on floating point numbers. We focus in

54:   this report on verifying proof obligations that state that no round

55:   off error occurred on any of the floating point operations. We use a

56:   tool named Gappa that can be learned in a matter of minutes to

57:   generate proofs related to floating point arithmetic and hide

58:   technicalities of formal proof checkers. We found that three

59:   facilities are missing from existing tools.  The first one is the

60:   ability to use in Gappa new lemmas that cannot be easily expressed

61:   as rewriting rules. We coined the second one ``variable

62:   interchange'' as it would be required to validate loop interchanges.

63:   The third facility handles massive loop unrolling and argument

64:   instantiation by generating traces of execution for a large number

65:   of cases. We hope that these facilities may sometime in the future

66:   be integrated into mainstream code validation.

67: \end{abstract}

68:

69: \newcommand{\R}{\mathbb{R}}

70: \newcommand{\Z}{\mathbb{Z}}

71: \newcommand{\IF}{\mathbb{IF}}

72: \newcommand{\Zp}{{\Z/p\Z}}

73:

74: \lstset{basicstyle=\tt\small,xleftmargin=0cm,numberstyle=\tiny}

75:

76: \renewcommand{\thelstlisting}{\arabic{lstlisting}}

77:

78: \section{Introduction}

79:

80: Introducing a new algorithm is a difficult task. Authors have to

81: persuade readers that their algorithm is correct and efficient. Such

82: goals  are usually attained  by providing pen-and-paper proofs of

83: correction more or less interlaced with  the description of the algorithm.

84: Authors may also provide results of tests to guarantee correction

85: and efficiency on random  cases and on known or new hard

86: cases.  Alas, this process is known to fail on mundane  as well as notorious

87: occurrences~\cite{RusHen91}.

88:

89: Developing a proof of correction in a formal proof checker using

90: higher order logic such as Coq~\cite{BerCas04} would be a nice

91: alternative but such a task usually represents a large amount of work

92: outside the fields of expertise of most authors.

93:

94: The delayed solver studied here works on a $N \times N$ unitary

95: triangular matrix on $\Z / p \Z$ finite field. The key improvement of

96: this algorithm compared to state of the art lies in the fact that

97: delayed algorithms use floating point units to perform operations with

98: no rounding error and delay computations of remainders as much as

99: possible. Operations on floating point numbers are limited to three

100: functions. The other functions use combinatorial logic.

101:

102: The first function (\verb+DGEMM_NEG+) performs a naive matrix

103: multiplication and Gappa may soon be able to handle the proof

104: obligation generated by a tool such as the Why

105: platform~\cite{FilMar07} and the corresponding floating-point

106: annotations~\cite{BolFil07}.  The second function (\verb+DTRSM+) is

107: invoked only under the \verb+delay+ predicate.  This is enforced by

108: the condition on the induction of the invoking function

109: (\verb+LZ_TRSM+) for $N$ between 2 and 54 with $p$ prime varying

110: between $2$ and $94,906,266$. Variable interchange in the \verb+delay+

111: predicate allows to limit the proof to the 53 different values of $N$

112: where a naive user would consider the $94,906,265$ different values of

113: $p$.

114:

115: Proof obligations are usually derived from a static analysis of the

116: source code considered. Our work showed that generating proof

117: obligations from traces of execution after most parameters have been

118: instantiated may also be useful. We have set up a C++ class to provide

119: such proof obligations but we hope that such capability will be

120: provided by Why and similar tools in the future.

121:

122: We present some background information in the remaining of the

123: introduction. We continue with a new lemma that might be used by Gappa

124: for inductive linear bounds in Section~\ref{sec/had} and with our

125: prototyping variable interchange developments in

126: Sections~\ref{sec/xch}. We conclude this work in

127: Section~\ref{sec/conc}.

128:

129: \newcommand{\lCeil}{\left\lceil}

130: \newcommand{\rCeil}{\right\rceil}

131: \def\so{{O {\;\!\tilde{}}\,}}

132: \newcommand{\linbox}{{\sc LinBox}}

133: \newcommand{\lsp}{{{\tt LSP}}}

134: \newcommand{\lup}{{{\tt LUP}}}

135: \newcommand{\lud}{{{\tt LUdivine}}}

136: \newcommand{\lqup}{{{\tt LQUP}}}

137: \newcommand{\turbo}{{{\tt TURBO}}}

138: \newcommand{\fgemm}{{{\tt Fgemm}}}

139: \newcommand{\tu}{{{\tt TURBO}}}

140: \newcommand{\trsm}{{{\tt Trsm}}}

141: \newcommand{\ltrsm}{{{\tt ULeft-Trsm}}}

142: \newcommand{\ultrsm}{{{\tt ULeft-Trsm}}}

143: \newcommand{\lltrsm}{{{\tt LLeft-Trsm}}}

144: \newcommand{\urtrsm}{{{\tt URight-Trsm}}}

145: \newcommand{\lrtrsm}{{{\tt LRight-Trsm}}}

146: \newcommand{\dbl}{\texttt{double} }

147:

148: \newcommand{\di}{\displaystyle}

149: \newcommand{\GF}[1]{\ensuremath{\mathtt {GF}(#1)}}

150: \newcommand{\GO}{{\cal O}}

151: \newcommand{\pF}[1]{\leavevmode

152:         \kern.1em\raise.0ex \hbox{\Z}\kern-.1em /\kern-.15em\lower.3ex

153:          \hbox{#1}\mbox{\Z}}

154: \newcommand{\til}{\lower 2pt\hbox{\small${}^\sim$}}

155:

156: \newtheorem{thm}{Theorem}[section]

157: \newtheorem{cnj}[thm]{Conjecture}

158: \newtheorem{lem}[thm]{Lemma}

159: \newtheorem*{cor}{Corollary}

160: \newtheorem{prop}[thm]{Proposition}

161: \newtheorem{defi}[thm]{Definition}

162: \newtheorem{rem}[thm]{Remark}

163: \newtheorem{pty}[thm]{Property}

164:

165: % \input{02-finite-field.tex}

166:

167: \subsection{Finite field arithmetic and application to linear algebra}

168:

169: % \label{sec/pg}

170:

171: Finite field arithmetic plays a crucial role in nowadays applications.

172: One of the most extensively studied application of finite fields is

173: cryptography.  Another key application of finite field arithmetic

174: arises with exact linear algebra computation where modular techniques

175: (e.g. CRT or P-adic lifting) allow some control on expression swell

176: with high performances (see~\cite{EGGSV06-1} and references herein).

177: While cryptographic applications need finite fields of large

178: cardinality for security purpose, most exact linear algebra restrains

179: to machine word size prime field (e.g. 32 or 64 bits) in order to

180: benefit from machine arithmetic units.

181:

182: A classical way to perform one arithmetic operation in a prime field, here we refer to integers modulo a prime number,

183: is to first perform the operation on integers and second reduce the result to the destination field.

184: Let  $x,y\in \Zp$ and $* \in \{ +, \times \}$. One may compute $z=x*y\in\Zp$ by computing $t=x*y \in \Z$ and a modular reduction $z = t \mod p$.

185:

186: %With fixed precision arithmetic, this approach may lead to overflows and erroneous results.

187:

188: % In the rest of this section, we present  basic facts on word size prime field arithmetics and how to efficiently integrate them

189: % into an exact linear algebra application: solutions of triangular systems.

190:

191: % \subsection{Word-size prime field arithmetic}

192:

193: When one deals with fixed precision prime field arithmetic, two majors

194: issues arise:\, performances and cardinality limitation. The latter

195: issue can have a non-negligible impact on the former one. As was just

196: said, the classical way to perform arithmetic operations over a prime

197: field is to perform operations on integers and reduce intermediate

198: results.  Therefore, all integers between $0$ and $(p-1)^2$ must be

199: representable to correctly perform multiplications over $\Zp$. This

200: limitation slightly increase to perform an \texttt{AXPY} operation (a

201: multiplication followed by an addition) with only one reduction step.

202: This implies that all integers between $0$ and $p\times(p-1)$ must be

203: representable.

204:

205:

206: % These bounds are reduced  by one bit with signed types such as \texttt{long}.

207:

208: % Note that even on 32 bit architectures, one can obtain  64 bit integers by using \texttt{long long} data.

209: % Full  use of \texttt{long long} may reduce performances drastically as  the reduction phase involves a 64 bit division.

210: % Integer multiplications have no impact on performances since most of 32 bit processors

211: % provide a  multiplier $32 \text{bits} \times 32 \text{bits} \rightarrow 64 \text{bits}$

212: % (see the \texttt{imul} instruction on Intel architectures \cite{Int99c}).

213:

214: Using word-size machine integers and classic arithmetic we obtain the

215: following cardinality limitation: $p<2^{16}$ on 32 bit architectures

216: and $p<2^{32}$ on 64 bit architectures with {\tt unsigned} types.  An

217: alternative to increase cardinality of word-size prime fields is to

218: use floating point numbers.  According to the IEEE~754

219: standard~\cite{Gol91}, mantissas of double precision floating point

220: numbers can store 53 bit integers (including the implicit bit).

221: Therefore, we can perform prime field arithmetic with cardinality up

222: to $2^{26}$ using \texttt{double}. Note that, the reduction is easily

223: obtained by the \texttt{fmod} function available in standard

224: libraries.  This approach is quite interesting in practice since

225: floating point multiplications and divisions may be faster than their

226: integer counterparts.

227: % on most architecture

228: % (Compaq Alpha, AMD, Pentium IV, Itanium, Sun Solaris, etc.) with the exception of the

229: % Pentium III  \cite[Table 3.1]{Defour:2003:these}.

230:

231: % One can get even larger prime fields by mixing integers and floating

232: % point numbers.  The idea is then to use an approximation of the

233: % quotient with floating point and Barret's method to compute the

234: % remainder~\cite[chapter 14]{HandbookCrypto} as demonstrated in NTL

235: % library~\cite{Shoup:NTL}.  It leads to prime fields with cardinality

236: % up to $2^{29}$ on 32 bit architectures and up to $2^{52}$ on 64 bit

237: % architectures.

238:

239: % Prime field implementations above benefit from the arithmetic units of

240: % processors but performances depend on the target architecture.

241:

242: On selected classes of algorithms, delayed prime field arithmetic

243: sustains better performances.  The idea is to perform several integer

244: operations before reduction into the field.  It has been very fruitful

245: for exact linear algebra~\cite{jgd:2004:issac}.  Delayed exact linear

246: algebra computations also benefit from optimized numerical BLAS (e.g.

247: ATLAS~\cite{Whaley:2001:AEO}, GOTO~\cite{2002:gotoblas}) libraries for

248: exact computations and they often reach peak FPU throughput for

249: operations over a finite field.

250:

251: Beside basics linear algebra operations such as matrix-vector products

252: and matrix multiplications, delayed arithmetic over a prime field is

253: valuable when expressions swell largely such as solving systems of

254: linear equations.  This approach works perfectly for unitary

255: triangular system (only ones along the diagonal) despite the

256: exponential growth of the intermediate variables.

257:

258: % In the following we show how this has been made possible and we provide an automatic proof of the concept

259: %  with a tool named Gappa.

260:

261: \subsection{Formal proof checking and Gappa}

262:

263: Gappa~\cite{DauMel04} has been created to generate formal certificates

264: of correction for programs that use floating point

265: arithmetic~\cite{DinLauMel06,MicTisVey06,MelPio07} and is related to

266: other developments~\cite{Har2Ka,DauMelMun05}. It is available from

267: \begin{center}

268:   \url{http://lipforge.ens-lyon.fr/www/gappa/}.

269: \end{center}

270: It will in the future be able to interact seamlessly with

271: Why~\cite{BolFil07}, a tool to certify programs written in a generic

272: language. C and Java can be converted to this language.

273:

274: Gappa manipulates arithmetic expressions on real and rational numbers

275: and their evaluations on computers. Exact and rounded expressions are

276: bounded using interval arithmetic~\cite{JauKieDidWal01}, forward error

277: analysis and properties of dyadic fractions.  To the authors' best

278: knowledge, Gappa is the first tool that can convert some of the simple

279: tasks performed here into formal proofs validated by an automatic

280: proof checker.  Such goal has previously been quoted as {\em invisible

281:   formal methods}~\cite{TiwShaRus03} in the sense that Gappa delivers

282: formal certificates to users that are not expected to write any piece

283: of proof in any formal proof system.

284:

285: % \subsection{Insight into the Coq generated script}

286:

287: % \begin{lstlisting}[float,caption=Reordered excerpts of the 3548 line Coq script

288: %   generated for a small example ($4 \times 4$ matrices on

289: %   $\Z/101\Z$),label=lst/coq]

290: % Require Import Gappa_library.

291: % Section Generated_by_Gappa.

292: % Definition f1 := Float2 (25) (2).

293: % Definition f2 := Float2 (0) (0).

294: % Definition i3 := makepairF f2 f2.

295: % ...

296: % Notation _DG31_exact := ((_DG29 - _DG30)%R).

297: % Notation _DG31 :=

298: %   ((rounding_float roundNE (53) (1074)) (_DG31_exact)).

299: % Notation r70 := ((_DG31 - _DG31_exact)%R).

300: % Notation p134 := (BND r70 i3).

301: %   (* BND(DG31 - DG31_exact, [0, 0]) *)

302: % Notation p135 := (BND r70 i13).

303: %   (* BND(DG31 - DG31_exact, [-7.45058e-09, 7.45058e-09]) *)

304: % ...

305: % Lemma t696 : p135 -> p141 -> p134.

306: %  intros h0 h1.

307: %  apply bnd_of_bnd_fix with (1 := h0) (2 := h1) ; finalize.

308: % Qed.

309: % Lemma l134 : ... -> p92 -> p43 -> p18 -> p3 -> p134.

310: %   (* BND(DG31 - DG31_exact, [-0, 0]) *)

311: %  intros h0 h1 h2 h3 h4 h5 h6 h7 h8.

312: %  assert (h9 : p135). apply l130. exact h0. exact h1...

313: %  assert (h10 : p141). apply l133.

314: %  apply t696. exact h9. exact h10.

315: % Qed.

316: % End Generated_by_Gappa.

317: % \end{lstlisting}

318:

319: Gappa produces a Coq file for a given input script.  Users do not need

320: to be able to write the Coq file but they can check the work of Gappa

321: by reading it or parsing it automatically.  It contains {\tt

322:   Variable}s, {\tt Defini\-tion}s, {\tt Notation}s, {\tt Lemma}s and

323: comments are between {\tt (*} and {\tt *)} signs.

324: % The lines between the statement of a lemma and the {\tt Qed.}

325: % \TODO{Sens de la phrase suivante ?}  sign are proof scripts intended

326: % for Coq only.

327: Although enclosure is the only predicate available to users, Gappa

328: internally relies on more predicates to describe properties on

329: expressions.  All the properties of the input script are defined in

330: the Coq file.  Validity of proofs can automatically be checked by Coq.

331: More insights to Gappa are presented in~\cite{DauMel07}.

332:

333: \section{A new lemma that might be used by Gappa for inductive linear

334:   bounds}

335: \label{sec/had}

336:

337: \begin{figure}

338: \hrule

339: \smallskip

340: {\bf Input: } $A \in \Zp^{N \times N}$, $B \in \Zp^{N \times K}$.\\

341: {\bf Output: } $X \in \Zp^{N \times K}$ such that $AX=B$.

342: \smallskip

343: \hrule

344: \begin{quote}

345: {\bf if} N=1 {\bf then}\\

346: \hspace*{1cm} $ X:= A_{1,1}^{-1} \times B$.\\

347: {\bf else}

348: {\it (splitting matrices into $\lfloor \frac{N}{2} \rfloor$ and $\lceil \frac{N}{2} \rceil$ blocks) }\\[-.2cm]

349: \[

350: \begin{array}{cccc}

351: A & X & & B \\

352: \overbrace {\left[ \begin{array}{cc} A_1 & A_2 \\ & A_3 \end{array} \right] }&

353: \overbrace{\left[ \begin{array}{ccc} & X_1 & \\ & X_2 & \end{array} \right] }&

354: = &

355: \overbrace{\left[ \begin{array}{ccc} & B_1 & \\ & B_2 & \end{array} \right] }

356: \end{array}

357: \]

358: \hspace*{1cm}$X_2:=${\tt LZ\_TRSM($A_3,B_2$)}. \\

359: \hspace*{1cm}$B_1:= B_1 - A_2X_2$. \\

360: \hspace*{1cm}$X_1:=${\tt LZ\_TRSM($A_1,B_1$)}. \\

361: {\bf return } X.

362: \end{quote}

363: \hrule

364: \caption{First algorithm for {\tt LZ\_TRSM($A,B$)}}

365: \label{trsm}

366: \end{figure}

367:

368: A key application in exact linear algebra is the resolution of

369: triangular systems over finite fields presented in Figure~\ref{trsm}.

370: A delayed prime field arithmetic version of this algorithm can be

371: constructed by simply doing a delayed matrix multiplication on the

372: operation \texttt{$B_1:= B_1 - A_2X_2$}. Listing~\ref{lst/mm} performs

373: such matrix multiplication \texttt{DGEMM\_NEG} with no reduction. We

374: used naming conventions of BLAS and LAPack for the function and the

375: parameter names. For the sake of simplicity some parameters have be

376: omitted and some function names were slightly modified.

377:

378: The \texttt{DREMM} function computes the remainder modulo $p$ of all

379: the components of a matrix. We are overspecifying it for the purpose

380: of this presentation as our proof never use the exact value of these

381: remainders but the sole property that they are between $0$ and $p-1$.

382: In the definition of \texttt{is\_exact\_int\_mat}, we use the

383: predicate \texttt{exists}. The property described is that $X[i \times

384: LDX+j]$ is not any float, but it is an integer. To describe this, we

385: require (or prove) that there exists an integer equal to the

386: floating-point value of $X[i \times LDX+j]$. We also use \verb+\old+

387: in the annotations: the value \verb+\old(X)+ represents the value of

388: the variable {\tt X} before the function execution. It allows us to

389: specify the outputs depending on the inputs, even if the pointed

390: values are modified. Some ghost variables are introduced in our

391: example to ease the proofs. A mechanism not presented here prevents

392: such ghost variables to remain in the code once compiled.

393:

394: \begin{lstlisting}[float,caption=Matrix-matrix multiplication Y <- Y - AX and component-wise remainder,label=lst/mm]

395: #include <math.h>

396: typedef double fp;

397: #define FP_ESPILON DBL_EPSILON

398:

399: /*@ logic real epsilon() { 2^^(-53) } @*/

400: /*@ logic real max_int() { 2 / epsilon() } @*/

401:

402: /*@ predicate is_exact_int_mat (fp *X, int LDX, int N, int M) {

403:   @   \valid_range(X,0,LDX*N) && M <= LDX &&

404:   @   \forall int i; \forall int j; 0 <= i < N && 0 <= j < M =>

405:   @      \round_error(X[i*LDX+j])==0 && \exists int v;  X[i*LDX+j]==v

406:   @ } @*/

407:

408: /*@ predicate is_exact_int_mat_bounded_by

409:   @ (fp *X, int LDX, int N, int M, int min, int max) {

410:   @   is_exact_int_mat(X,LDX,N,M) && \forall int i; \forall int j;

411:   @     0 <= i < N && 0 <= j < M => min <= X[i*LDX+j] <= max

412:   @ } @*/

413:

414: /*@ requires (p-1)*(p-1)*M <= max_int() &&

415:   @   is_exact_int_mat_bounded_by(Y,LDY,N,K,0,p-1) &&

416:   @   is_exact_int_mat_bounded_by(A,LDA,N,M,0,p-1) &&

417:   @   is_exact_int_mat_bounded_by(X,LDX,M,K,0,p-1)

418:   @ assigns Y[..]

419:   @ ensures

420:   @   is_exact_int_mat_bounded_by(Y,LDY,N,K,(1-p)*(p-1)*M,p-1) @*/

421: void DGEMM_NEG (int N, int M, int K, int p,

422:                 fp *A, int LDA, fp *X, int LDX, fp *Y, int LDY) {

423:   int i, j, k; fp oYiK;

424:   for (i = 0; i < N; i++)

425:     for (j = 0 ; j < M; j++)

426:       for (k = 0; k < K; k++) {

427:         oYik       = Y[i*LDY+k];

428: 	Y[i*LDY+k] = Y[i*LDY+k] - A[i*LDA+j] * X[j*LDX+k];

429:         /*@ assert -1 <= Y[i * LDY + k] / ((p-1)*(p-1)*(j+1)) &&

430:                          Y[i * LDY + k] <= p-1                @*/

431:       }

432: }

433:

434: /*@ requires is_exact_int_mat(X,LDX,N,K)

435:   @ assigns X[..]

436:   @ ensures is_exact_int_mat_bounded_by(X,LDX,N,K,0,p-1) &&

437:   @   \forall int i; \forall int j; 0 <= i < N && 0 <= j < K =>

438:   @     \exists int d; X[i*LDX+j] == \old(X[i*LDX+j]) + d*p @*/

439: void DREMM (int N, int K, int p, fp *X, int LDX) {

440:   int i, k;

441:   for (i = 0; i < N; i++) for (k = 0; k < K; k++) {

442:     X[i*LDX+k] = fmod (X[i*LDX+k], p);

443:     if (X[i*LDX+k] < 0) X[i*LDX+k] += p;

444:   }

445: }

446: \end{lstlisting}

447:

448: Let $M$ be the width of matrix $A_2$ or the height of vector $X_2$ and

449: {\tt FP\_EPSILON} be the machine $\epsilon$ for {\tt fp} in {\tt

450:   float}, {\tt double} or {\tt long double} and {\tt FP} in {\tt FLT},

451: {\tt DBL} or {\tt LDBL} respectively. The \texttt{DREMM} reduction can

452: be delayed until the end of \texttt{DGEMM\_NEG} provided

453: $$(p-1)^2\times M \le 2/\mathtt{FP\_EPSILON},$$

454: as all the number between 0 and $2/\mathtt{FP\_EPSILON}$ can be

455: represented exactly with type {\tt fp}.

456:

457: Assertions on the \verb+DGEMM_NEG+ function generate proof obligation

458: $$

459: -1 \le \frac{ Y[i \times LDY + k]                                                 }{(p - 1)^2 \times (j + 1)}

460:      = \frac{oY i              k  - A[i \times LDA + j] \times X[j \times LDX + k]}{(p - 1)^2 \times (j + 1)}

461: $$

462: for all iterations defined by $i$, $j$ and $k$. It can be proved by

463: induction on $j$ with the following lemma that we proved in Coq and

464: similar ones for the other relations.

465: $$

466: \forall ~~ a, b, c, d, e \in \R ~~~ ; ~~~ e \le \frac{a}{b} ~~ \wedge ~~

467:                                           e \le \frac{c}{d} ~~ \wedge ~~

468:                                           0 <   b \times d

469:                 ~~~ \Longrightarrow  ~~~  e \le \frac{a+c}{b+d}

470: $$

471:

472: Gappa does not handle arrays so we have to rename $A[i \times LDA +

473: j]$ to {\tt Aik}, $X[j \times LDX + k]$ to {\tt Xjk} and $Y[i \times

474: LDY+k]$ to {\tt Yik}.  The following Gappa text should be sufficient

475: to prove the generic case of the induction if our lemma is added to

476: Gappa.

477:

478: \begin{lstlisting}

479: {

480:   p in [2,1b53]                                ->

481:   j in [1,1b53]                                ->

482:   oYik in [-1b53,1b53]                         ->

483:   Aij/(p-1) in [0,1]                           ->

484:   Xjk/(p-1) in [0,1]                           ->

485:   oYik / ((p-1)*(p-1)*j) >= -1                 ->

486:   ((p-1)*(p-1)*j) * ((p-1)*(p-1)) >= 0         /\

487:   -Aij*Xjk / ((p-1)*(p-1)) >= -1               /\

488:   (oYik - Aij*Xjk) / ((p-1)*(p-1)*(j+1)) >= -1

489: }

490: -Aij*Xjk / ((p-1)*(p-1)) ->

491:   - (Aij/(p-1)) * (Xjk/(p-1)) { (p-1) <> 0 };

492: (oYik - Aij*Xjk) / ((p-1)*(p-1)*(j+1)) ->

493:   (oYik + (- Aij*Xjk)) / (((p-1)*(p-1)*j) + ((p-1)*(p-1)))

494:   { ((p-1)*(p-1)*(j+1)) <> 0 , ((p-1)*(p-1)*j) * ((p-1)*(p-1)) <> 0 };

495: \end{lstlisting}

496:

497: A Gappa file usually starts with aliases.  Gappa uses them for its

498: outputs and in the formal proof instead of machine generated names.

499: We do not need any alias here.  Identifiers are assumed to be

500: universally quantified over the set of real numbers the first time

501: Gappa encounters them.

502:

503: The main assertion is written between brackets~({\tt \{ \}}). The

504: hypotheses end with the last line finished by an implication sign

505: ({\tt ->}).  Each states that a variable or an expression is within an

506: interval or bounded. Note that $p1 \rightarrow p2 \rightarrow p3$ is

507: logically equivalent to $p1 \wedge p2 \rightarrow p3$.  The goal is

508: next.  It is a conjunction (\verb+/\+).  Statements about intermediate

509: variables are given as goals to force Gappa to establish them first.

510:

511: The statement after the brackets are hints that propose replacements.

512: Gappa replaces the left side of the {\tt ->} sign by the right side as

513: soon as it encounters the former. It also tries to prove that the

514: replacements are valid. They help Gappa identify the proper theorems

515: syntactically. Conditions on the validity of the hints are expressed

516: between brackets.

517:

518: \section{Variable interchange in a predicate}

519: \label{sec/xch}

520:

521: \begin{lstlisting}[float,caption=Delayed solution of a unitary triangular system over a finite field,label=lst/lz]

522: /*@ predicate delay(int N, int p) @*/

523: /*@ logic int l_pmax(int n) @*/

524:

525: // Floating point exact solution to a small unitary triangular system

526: /*@ requires N <= 54 &&

527:   @   is_exact_int_mat_bounded_by(X,LDX,N,K,0,l_pmax(N)-1) &&

528:   @   is_exact_int_mat_bounded_by(A,LDA,N,N,0,l_pmax(N)-1)

529:   @ assigns X[..]

530:   @ ensures

531:   @   is_exact_int_mat_bounded_by(X,LDX,N,K,-max_int(),max_int()) @*/

532: void DTRSM (int N, int K, fp *A, int LDA, fp *X, int LDX) {

533:   int i, j, k;

534:   for (i = N-2; i >= 0; i--)

535:     for (j = i+1; j < N; j++)

536:       for (k = 0 ; k < K; k++)

537:         X[i*LDX+k] = X[i*LDX+k] - A[i*LDA+j] * X[j*LDX+k];

538: }

539:

540: /*@ requires (p-1)*(p-1)*N <= max_int() &&

541:   @   is_exact_int_mat(A,LDA,N,N) && is_exact_int_mat(B,LDB,N,K)

542:   @ assigns B[..]

543:   @ ensures is_exact_int_mat(B,LDB,N,K) @*/

544: void LZ_TRSM (int N, int K, int Nmax, int p,

545: 	      fp *A, int LDA, fp *B, int LDB) {

546:   if (N <= Nmax) {

547:     /*@ assert N <= 54 && delay(N,p) @*/

548:     DTRSM (N, K, A, LDA, B, LDB); DREMM (N-1, K, p, B, LDB);

549:   } else {

550:     int P = N/2, G = N - P;

551:     LZ_TRSM (G, K, Nmax, p, A+P*(LDA+1), LDA, B+P*LDB, LDB);

552:     DGEMM_NEG (P, G, K, p, A+P, LDA, B+P*LDB, LDB, B, LDB);

553:     DREMM (P, K, p, B, LDB);

554:     LZ_TRSM (P, K, Nmax, p, A, LDA, B, LDB);

555:   }

556: }

557:

558:

559: /*@ ensures \forall int N; N <= \result => delay (N, p) @*/

560: int Nmax (int p) {

561:   fp pp = 1, p2 = 1; int N;

562:   for (N = 0; ((p-1)*(pp+p2))/2 < 2 / 2^53; N++)

563:     {pp *= p; p2 *= p-2;};

564:   return N;

565: }

566:

567: /*@ ensures \result==l_pmax(N) &&

568:   @   \forall int p; p <= \result => delay (N, p) @*/

569: int pmax (int N) {

570:   int p; for (p = 1; N <= Nmax(p); p++);

571:   return p-1;

572: }

573: \end{lstlisting}

574:

575: % Most questions raised before that point should not be new to automated

576: % reasoning.

577:

578: For the sake of completeness we recall an optimal bound on integer

579: coefficients growth during backward substitution.

580:

581: \begin{cor}{\cite[corollary 3.3]{jgd:2004:issac}}

582:   Let $A\in\Z^{N \times N}$ be a unit diagonal upper triangular

583:   matrix, and $b\in\Z^N$, with $|A|,|B|\leq p-1$. Then $x\in\Z^N$ the

584:   solution of the system $Ax=B$ is such that

585:   $$ |x| \leq \frac{p-1}{2}\left[ p^{N-1} + (p-2)^{N-1}\right],$$%

586:   and this bound is optimal.

587: \end{cor}

588:

589: As this formula also bounds all the intermediate values, enough bits

590: are available to combine a few recursion steps of \texttt{LZ\_TRSM}

591: without reduction and guarantee that all numerical results are exact.

592: We present in Listing~\ref{lst/lz} an improved version for the delayed

593: prime field arithmetic by replacing the last levels of the recursion

594: by calls to \texttt{DTRSM} numerical solver according to the above

595: corollary.  Recursion is stopped for the maximal integer $N_{\max}$

596: such that

597: \begin{equation}

598: \frac{p-1}{2}\left[ p^{N_{max}-1} + (p-2)^{N_{max}-1}\right] \le 2/\mathtt{FP\_EPSILON}.

599: \end{equation}

600:

601: % In the rest of the paper, we will provide a way to automatically and formally proof this equation.

602: % Without loss of genericity, we will then focus on the case where the right hand side of the system is a vector (i.e. $K=1$ in \texttt{LZ\_TRSM}).

603: % As a matter of consistency with numerical BLAS convention, we will then refer to our algorithm with \texttt{LZ\_TRSV}.

604:

605: Function {\tt Nmax} of Listing~\ref{lst/lz} uses a strict inequality

606: in equation (1) to control the loop because it computes the bound with

607: the same floating point format than the one used by \texttt{LZ\_TRSM}.

608: We focus now on proving that the \texttt{DTRSM} function invoked by

609: \texttt{LZ\_TRSM} never produce any round-off error.

610:

611: One could port the proof of Corollary 3.3~\cite{jgd:2004:issac} to Coq

612: to finish the proof of correction. We decided to use Gappa and simple

613: techniques that could be made automatic. Syntactically, the

614: \verb+DTRSM+ function is invoked by \verb+LZ_TRSM+ only if the

615: condition $delay(N,p)$ is fulfilled. We may define it by:

616:

617: $$delay(N,p) = N \le N_{\max}(p).$$

618:

619: Gappa does not handle loops and branches. We perform a case analysis

620: but $p$ may vary from $2$ to $94,906,266$. On the other hand, {\tt N}

621: varies only between $2$ and $54$. Table~\ref{tab/pmax} presents the

622: value computed by the \verb+pmax+ function.  Variable interchange

623: should allow to prove that \verb+DTRSM+ is invoked only on the

624: condition

625: $$delay(N,p) \Longleftrightarrow p \le p_{\max}(N).$$

626:

627:

628: \begin{table}[t]

629:   \caption{Maximum value {\tt pmax} of parameter {\tt p} for each value of parameter {\tt N} allowed}

630:   \label{tab/pmax}

631:   %\vspace{-5mm}

632:   $$

633:   \begin{array}{|c|c|c|c|c|c|c|c|c|c|c|c|} \hline

634:     N         & 2        & 3      & 4    & 5    & 6   & 7   & 8      & 9      & ~ 10 ~ & ~ 11 ~ & ~ 12 ~ \\ \hline

635:     p_{\max}  & 94906266 & 208064 & 9739 & 1553 & 457 & 191 & ~ 97 ~ & ~ 59 ~ & 39     & 29     & 19     \\ \hline

636:   \end{array}

637:   $$

638:   $$

639:   \begin{array}{|c|c|c|c|c|c|c|c|} \hline

640:     N        & ~ 13 ~ & ~ 14 ~ & ~ 15 ~ & ~ 16\cdots19 ~ & ~ 20\cdots23 ~ & ~ 24\cdots34 ~ & ~ 35\cdots54 ~ \\ \hline

641:     p_{\max} & 17     & 13     & 11     & 7              & 5              & 3              & 2              \\ \hline

642:   \end{array}

643:   $$

644: \end{table}

645:

646: % \section{Massive loop unrolling and argument instantiation}

647: % \label{sec/loop}

648:

649: % \begin{lstlisting}[float,caption=Solutions to a small unitary triangular system,label=lst/dtrsm]

650: % // Solutions to a small unitary triangular system

651: % /*@ requires N <= 54 && K <= LDX &&

652: %   @  \valid_range(X,0,LDX*N) &&

653: %   @  \valid_range(A,0,LDA*N) &&

654: %   @   is_integer_matrix_bounded_by(X,LDX,N,K,pmax(N)) &&

655: %   @   is_integer_matrix_bounded_by(A,LDA,N,N,pmax(N))

656: %   @ assigns X[..]

657: %   @ ensures

658: %   @   is_integer_matrix(X,LDX,N,K)

659: %   @*/

660: % void DTRSM (int N, int K, fp *A, int LDA, fp *X, int LDX) {

661: %   int i, j, k;

662: %   for (i = N-2; i >= 0; i--)

663: %     for (j = i+1; j < N; j++)

664: %       for (k = 0 ; k < K; k++)

665: %         X[i*LDX+k] = X[i*LDX+k] - A[i*LDA+j] * X[j*LDX+k];

666: % }

667: % \end{lstlisting}

668:

669:

670: % \subsection{A C++ trace class to generate input scripts to Gappa}

671:

672: % \begin{lstlisting}[float,caption={A C++ trace class to instantiate some

673: %   parameters, implement single-assignment behavior and ask Gappa

674: %   to prove that no operation introduced any round-off

675: %   error},label=lst/object]

676: % class trace;

677: % ostream &operator<<(ostream &os, trace const &a);

678:

679: % class trace {

680: % public :

681: %   static int lastid;

682: %   static ostringstream bout, mout, eout;

683: %   int id;

684: %   trace() {}

685: %   void init(int p) {

686: %     id = lastid++;

687: %     bout << *this

688: % 	 << " = int<ne>(" << *this << "_dum);" << endl;

689: %     mout << *this

690: % 	 << " in [0, " << p - 1 << "] ->"      << endl;

691: %   }

692: %   trace(char * oper, trace const &a, trace const &b) {

693: %     id = lastid++;

694: %     bout << *this << "    rnd= "

695: %          << a     << oper  << b << ";"         << endl;

696: %     bout << *this << "_exact = "

697: %          << a     << oper  << b << ";"         << endl;

698: %     eout << *this << " - "

699: % 	 << *this << "_exact in [0, 0] /\\"    << endl;

700: %     eout << *this << " in ? /\\"               << endl;

701: %   }

702: %   void dump() {

703: %     cout << bout.str() << "{"                  << endl

704: % 	 << mout.str()

705: % 	 << eout.str()

706: % 	 << *this      << " in ?"              << endl

707: % 	               << "}"                  << endl;

708: %   }

709: %   void bmes (char * mes, int n = 0)

710: %   {bout << mes ; if (n) bout << n; bout << endl;}

711: % };

712:

713: % ostream &operator<<(ostream &os,trace const &a)

714: %   {return os << "DG" << a.id;}

715:

716: % trace operator+(trace const &a, trace const &b)

717: %   {return trace(" + ", a, b);}

718: % trace operator-(trace const &a, trace const &b)

719: %   {return trace(" - ", a, b);}

720: % trace operator*(trace const &a, trace const &b)

721: %   {return trace(" * ", a, b);}

722:

723: % int trace::lastid = 0;

724: % ostringstream trace::bout, trace::mout, trace::eout;

725: % \end{lstlisting}

726:

727: % \begin{lstlisting}[float,caption={Reformatted Gappa input script for a small example $(N = 4, p = 101)$},label=lst/gappa]

728: % @rnd = float< ieee_64, ne>;

729: % # Matrix

730: % DG0  = int<ne>(DG0_dum);  ... DG15 = int<ne>(DG15_dum);

731: % # Solution

732: % DG16 = int<ne>(DG16_dum); ... DG19 = int<ne>(DG19_dum);

733: % # DTRSM 4

734: % DG20 rnd= DG5 * DG19;     ... DG31 rnd= DG29 - DG30;

735: % DG20_ex = DG5 * DG19;     ... DG31_ex = DG29 - DG30;

736: % # DREMM 3

737: % DG32 = int<ne>(DG32_dum); ... DG34 = int<ne>(DG34_dum);

738: % {

739: % DG0 in [0, 100] ->        ... DG19 in [0, 100] ->

740: % DG32 in [0, 100] ->       ... DG34 in [0, 100] ->

741: % DG20-DG20_ex in [0, 0] /\ ... DG31-DG31_ex in [0, 0] /\

742: % DG0 in ? /\ DG20 in ? /\  ... DG31 in ?

743: % }

744: % \end{lstlisting}

745:

746: % The first section defines aliases.  Gappa uses these aliases for its

747: % outputs and in the formal proof instead of machine generated names.

748: % It starts with the definitions of {\tt rnd} rounding operator.  It is

749: % a real function yielding rounded values according to the target data

750: % format ({\tt ieee\_64} - double precision in this case) and a

751: % predefined rounding mode ({\tt ne} - nearest with even tie breaking as

752: % specified in IEEE 754 standard in this case).

753:

754: % Statements \verb+DGxxx_exact = expr;+ build the list of {\bf

755: %   intended} results and statements on \verb+DGxxx+ the list of results {\bf

756: %   actually} computed.  When a rounding operator appears left of the

757: % equal symbol (as used in \verb+DGxxx rnd= expr;+), all the arithmetic

758: % operations on the right side are rounded individually. We introduced

759: % dummy identifiers in aliases \verb+DGxxx = int<ne>(DGxxx_dum);+ to

760: % express that {\tt DGxxx} are integer numbers.

761:

762: % The second and third sections are written together between

763: % brackets~({\tt \{ \}}) and in this work, it is a large implication

764: % ({\tt ->}) of a large conjunction (\verb+/\+) of interval enclosures

765: % of mathematical expressions.

766:

767: % The second section contains a set of hypotheses each stating that a

768: % variable or an expression is within an interval. Note that $p1

769: % \rightarrow p2 \rightarrow p3$ is logically equivalent to $p1 \wedge p2

770: % \rightarrow p3$ and this section is built from the lines finished by an

771: % implication sign ({\tt ->}). Identifiers without definition, such as

772: % the {\tt DGxxx\_dum} variables used in the first section are assumed to

773: % be universally quantified over the set of real numbers the first time

774: % Gappa encounters them.

775:

776: % The third section contains goals. It is a large conjunction

777: % (\verb+/\+). Statements \verb+DGxxx - DGxxx_exact in [0, 0]+ mean that

778: % we ask Gappa to prove that {\tt DGxxx} and {\tt DGxxx\_exact} are

779: % identical. Statements {\tt DGxxx in ?} are inserted for diagnostics as

780: % Gappa proposes enclosing intervals.

781:

782: %\subsection{Application to delayed finite field arithmetic}

783:

784: % The \verb+TEST+ function in Listing~\ref{lst/test} presents memory

785: % allocation and parts of the Coq script related to initial aliases

786: % and some comments.

787:

788: % \begin{lstlisting}[float,caption=Test,label=lst/test]

789: % void TEST (int n, int nmax, int p, int k) {

790: %   int i;

791: %   trace *A = new trace[n*n], *B = new trace[k*n];

792: %   A->bmes("@rnd = float< ieee_64, ne>;");

793: %   A->bmes("# Matrix");

794: %   for (i = 0; i < n*n; i++) {A[i].init(p);}

795: %   A->bmes("# Solution");

796: %   for (i = 0; i < k*n; i++) {B[i].init(p);}

797: %   LZ_TRSM (n, k, nmax, p, A, k, B, k);

798: %   A->dump();

799: %   delete []A; delete []B;

800: % }

801: % \end{lstlisting}

802:

803: \begin{table}[t]

804:   \caption{Time to establish that no round-off occurred and generate a Coq proof script}

805:   \label{tab/time}

806:   %\vspace{-5mm}

807:   $$

808:   \begin{array}{|c|c|c|c|c|c|c|c|c|c|c|c|} \hline

809:     \text{Script} & N               & 2     & 3     & 4     & 5     & 6     & 7     & 8     & 9     & 10    & 11    \\ \hline

810:     \text{no}     & \text{time} (s) & 00.02 & 00.06 & 00.10 & 00.19 & 00.29 & 00.43 & 00.57 & 00.74 & 00.93 & 01.15 \\ \cline{1-1}\cline{3-12}

811:     \text{Coq}    &                 & 00.02 & 00.06 & 00.13 & 00.22 & 00.35 & 00.50 & 00.87 & 01.35 & 01.84 & 02.43 \\ \hline

812:   \end{array}

813:   $$

814:   $$

815:   \begin{array}{|c|c|c|c|c|c|c|c|c|c|c|c|} \hline

816:     \text{Script} & N               & 12    & 13    & 14    & 15    & 17    & 19    & 23    & 34     & 54    \\ \hline

817:     \text{no}     & \text{time} (s) & 01.38 & 01.63 & 01.91 & 02.90 & 02.90 & 03.73 & 05.93 & 13.71  & 38.32 \\ \cline{1-1}\cline{3-11}

818:     \text{Coq}    &                 & 03.22 & 03.77 & 05.26 & 07.13 & 12.22 & 19.50 & 38.13 & 421.4  & 5767  \\ \hline

819:   \end{array}

820:   $$

821: \end{table}

822:

823: A C++ class produces a trace for {\tt N} between 2 and 54 where all

824: branches and control statements have been removed. Each trace contains

825: only floating point operations.  Table~\ref{tab/time} gives the time

826: needed by Gappa on a 1.86~GHz Quad-Core Intel Xeon Processor with

827: 2x4MB L2 cache and 2x1GB of memory, to produce each of the proofs that

828: no round-off error occurred for all the values of $N$ between $2$ and

829: $54$. That ends the proof of correction of the algorithm.

830:

831: \section{Perspectives and concluding remarks}

832: \label{sec/conc}

833:

834: This report presents a new use of Gappa based on proof obligations

835: generated from a trace of execution~\cite{MelPio07}.  It would enable

836: us to prove in Coq in the future that expression swell within the

837: studied algorithm of delayed finite field arithmetic does not

838: introduce round off errors \cite{jgd:2004:issac}. This full

839: certification will not be obtained by porting the initial proof to Coq

840: but by a case analysis on the 53 possible values of one argument $N$.

841: Pieces of the formal proof have been generated by Gappa for each

842: individual value of $N$.

843:

844: % The other questions raised in this report are not related to floating

845: % arithmetic. Answering these questions involve techniques common to

846: % program analysis that are or should be available in tools that perform

847: % automatic validation of programs.

848:

849: Our approach can be easily reproduced to other exact linear

850: applications over finite fields.  More precisely, the FFLAS-FFPACK

851: project has been successful on using delayed prime field arithmetic

852: for linear algebra applications.

853: \begin{center}

854:   \url{http://ljk.imag.fr/membres/Jean-Guillaume.Dumas/FFLAS/}

855: \end{center}

856: % We may start by the validation of symmetric representations of the

857: % prime field.

858:

859: % Future work of FFLAS-FFPACK may also consider Bareiss algorithm.

860:

861: % Gappa is also capable of validating bounds used for Winograd-Strassen

862: % matrix multiplication \cite{dumas-2006} by rewriting expressions. This

863: % feature not demonstrated here is useful in proving that

864: % Winograd-Strassen algorithm is correct.

865: % Gappa transfers the validation of additional formulas to a

866: % strategy of Coq that is able to validate it component by component in this case.

867:

868: \bibliographystyle{abbrv}

869: \bibliography{daumas_ref,alternate}

870: %\bibliography{alias,perso,groupe,saao,these,livre,arith}

871:

872: \end{document}

873: