0708.3722/hal.tex
1: 
2: %
3: \documentclass[9.5pt,journal,final,finalsubmission,twocolumn]{IEEEtran}
4: 
5: \usepackage{listings}
6: \lstset{basicstyle=\tt\small,xleftmargin=0cm,numberstyle=\tiny}
7: 
8: %
9: \newtheorem{example}{Example}[section]
10: \newtheorem{algorithm}{Algorithm}[section]
11: 
12: \def\mach{\mbox{\scriptsize\sf m}}
13: \def\fma{\mbox{\scriptsize \sf fma}}
14: \def\ulp{\mbox{ulp}}
15: 
16: \def\H{\mbox{\tiny \sf H}}
17: \def\L{\mbox{\tiny \sf L}}
18: \def\roundtoRZ{\mbox{\sf round\_to\_zero}}
19: \def\roundtoRI{\mbox{\sf round\_to\_inf}}
20: \def\roundtoRN{\mbox{\sf round\_to\_nearest}}
21: \def\roundUP{\mbox{\sf round\_up}}
22: \def\roundDn{\mbox{\sf round\_down}}
23: 
24: \def\fractheo{0.44}
25: 
26: \def\qed{\vrule height7pt width5pt depth0pt}
27: 
28: \newtheorem{theorem}{Theorem}%
29: \newtheorem{definition}{Definition}[section]
30: \newtheorem{proposition}{Proposition}[section]
31: \newtheorem{lemma}{Lemma}[section]
32: \newtheorem{remark}{Remark}[section]
33: 
34: \renewcommand{\theequation}{\thesection.\arabic{equation}}
35: %
36: %
37: 
38: \usepackage{color,url,cite,amstext,amssymb,amstext}
39: \DeclareFontFamily{U}{manual}{}
40: \DeclareFontShape{U}{manual}{m}{n}{ <->  manfnt }{}
41: \newcommand{\manfntsymbol}[1]{%
42:     {\fontencoding{U}\fontfamily{manual}\selectfont\symbol{#1}}}
43: \newcommand{\dbend}{\manfntsymbol{127}}%
44: 
45: 
46: %
47: %
48: %
49: %
50: 
51: \def\Rx{\mathop{\Box}\nolimits}
52: 
53: \def\RO{\mathop{\bowtie}\nolimits}
54: \def\RU{\mathop{\triangle}\nolimits}
55: \def\RN{\mathop{\circ}\nolimits}
56: \def\RD{\mathop{\nabla}\nolimits}
57: 
58: \pagestyle{empty}
59: 
60: %
61: 
62: %
63: 
64: \def\sq{\mbox{\scriptsize \sf sq}}
65: 
66: \title{%
67:   Formally Verified Argument Reduction \\
68:   with a Fused-Multiply-Add
69: }%
70: \author{%
71:   Sylvie Boldo, Marc Daumas and Ren-Cang Li%
72: \thanks{%
73:   S.    Boldo ({\tt  sylvie.boldo@inria.fr}) is with the INRIA Futurs.}
74: \thanks{%
75:   M.    Daumas ({\tt marc.daumas@lirmm.fr}) is with the LIRMM, CNRS, UM2 and ELIAUS, UPVD. Supported in part
76:   by the PICS 2533 of the CNRS.}
77: \thanks{%
78:   R.-C. Li ({\tt rcli@uta.edu})  is  with the  Department
79:   of Mathematics, the University of Texas at Arlington,
80:         P.O. Box 19408,
81:         Arlington, TX 76019-0408. Supported in part
82:   by the National Science Foundation under Grant
83:         No. DMS-0510664 and DMS-0702335 and by the Region Languedoc Roussillon of France.}
84: }
85: 
86: \date{%
87:       %
88:  %
89:  %
90:  %
91:  %
92:        February 2007}
93: 
94: \newcommand{\incb}[2]{\usepackage[#2]{#1}}
95: 
96:  \message{Using PS files}
97:  \incb{graphicx}{dvips}
98:  \incb{epsfig}{dvips}
99:  \def\dvipsselector#1#2{#1}
100:  \def\dvipsoption{dvips}
101: 
102: \newenvironment{pfigure}{\begin{figure*}}{\end{figure*}}
103: \def\figdir{.} \def\figepsdir{\figdir}
104: \def\figext{\dvipsselector{pstex_t}{pdftex_t}}
105: \def\epsdir{../EPS}
106: 
107: \newcommand{\figtex}[2]{%
108:   \begin{pfigure}
109:     \begin{center}
110:       \input \figdir/#1.\figext
111:     \end{center}
112:     \caption{#2}
113:     \label{fig/#1}
114:     \label{fig:#1}
115:   \end{pfigure}
116: }
117: 
118: \begin{document}
119: 
120: \maketitle
121: \thispagestyle{empty}
122: 
123: %
124: 
125: 
126: \begin{abstract}
127:   Cody \& Waite argument reduction technique works perfectly for
128:   reasonably large arguments but as the input grows there are no bit
129:   left to approximate the constant with enough accuracy. Under mild
130:   assumptions, we show that the result computed with a
131:   fused-multiply-add provides a fully accurate result for many
132:   possible values of the input with a constant almost accurate to the
133:   full working precision. We also present an algorithm for a fully
134:   accurate second  reduction step to reach double full accuracy (all the significand bits of two numbers are significant) even in the worst cases of argument reduction. Our work recalls the common
135:   algorithms and presents proofs of correctness.  All the proofs are
136:   formally verified using the Coq automatic proof checker.
137: \end{abstract}
138: 
139: \begin{keywords}
140: Argument reduction, {\tt fma}, formal proof, Coq.
141: \end{keywords}
142: 
143: \section{Introduction}\label{sec:intro}
144: 
145: 
146: Methods that compute elementary functions on a large domain rely on
147: efficient argument reduction techniques.  The idea is to reduce an
148: argument $x$ to $u$ that falls into a small interval to allow
149: efficient approximations \cite{Mar2K,BDKMR05,Mul06,li:04}. A commonly used
150: argument reduction
151: technique~\cite{CodWai80,Mar90,StoTan99,Mar2K} begins with
152: one positive FPN (floating point number) $C_1$ to approximate a number $C>0$ (usually
153: irrational but not necessarily). Examples include $C=\pi/2$ or $\pi$
154: or $2\pi$ for trigonometric functions $\sin x$ and $\cos x$, and
155: $C=\ln 2$ for exponential function $e^x$.
156: 
157: Let $x$ be a given argument, a FPN.  The argument reduction starts by
158: extracting $\chi$ as defined by
159: \begin{eqnarray*}
160: x/C_1   & =       & \framebox{\parbox{2cm}{\centerline{$\chi$\vphantom{$k2^{-N}b$}}}}\,
161:                     \framebox{\parbox{1.5cm}{\centerline{$\varsigma$\vphantom{$k2^{-N}b$}}}}\,.
162: %
163: %
164: \end{eqnarray*}
165: Then it computes a reduced argument $x - \chi C_1$. The result is exactly a FPN as it is defined by an IEEE-754
166: standard remainder operation.
167: But division is a costly operation that is
168: avoided as much as possible. Some authors, see for example \cite{StoTan99,Mar2K,Mul06}, and
169: \begin{center}
170:   {\small \url{http://www.intel.com/software/products/opensource/libraries/num.htm}},
171: \end{center}
172: introduce another FPN $R$ that approximates $1/C$ and the argument
173: reduction replaces the division by a multiplication so that
174: \begin{eqnarray}
175: x\cdot\frac 1C   & \approx & xR   \nonumber \\
176:         & =       & z+s \nonumber \\
177:         & =       & \framebox{\parbox{2cm}{\centerline{$k2^{-N}$\vphantom{$k2^{-N}b$}}}}\,
178:                     \framebox{\parbox{1.5cm}{\centerline{$s$\vphantom{$k2^{-N}b$}}}}\,,
179: \label{eq:z-extract}
180: \end{eqnarray}
181: where $k$ is an integer used to reference a table of size $2^N$.
182: %
183: %
184: %
185: %
186: %
187: This replacement is computational efficient  if $u$
188: \begin{equation}\label{eq:u-1}
189: u =  x - z C_1
190: \end{equation}
191: is a FPN \cite{LiBolDau03}.
192: %
193: %
194: %
195: 
196: Sometimes the computed value of $u$ is not sufficiently accurate, for example if $u$ is near
197: a multiple of $C$, the loss of accuracy due to the approximation
198: $C_1 \approx C$ may  prevail. A better approximation to $C$
199: is necessary to obtain  a fully accurate reduced
200: argument. If this is the case we use  $C_2$, another FPN, roughly containing
201: the next many bits in the significand of $C$ so that the unevaluated
202: $C_1 + C_2 \approx C$ much better than $C_1$ alone.  When equation (\ref{eq:u-1}) does not introduce any rounding error,
203: the new reduced argument is not $u$ but $v$ computed by
204: \begin{eqnarray}\label{eq:u-updated}
205: v & \approx & u -z C_2.
206: \end{eqnarray}
207: To increase once again the accuracy, the error of (\ref{eq:u-updated})
208: need to be computed (see Section~\ref{sec:sndstep}), too to obtain
209: $v_1$ and $v_2$ exactly satisfying
210: \begin{eqnarray}\label{eq:u-updated}
211: v_1 + v_2 & = & u -z C_2.
212: \end{eqnarray}
213: The last step creates a combined reduced argument stored in the
214: unevaluated sum $v_1 + w$ with $2p$ significant bits
215: \begin{eqnarray}\label{eq:u-updated}
216: w & \approx & v_2 -z C_3
217: \end{eqnarray}
218: Whether $v_1$ (or $v_1$ and $w$) is accurate enough for computing the
219: elementary function in question is subject to further error analysis
220: on a function-by-function basis \cite{Kah83}. But this is out of the
221: scope of this paper.
222: 
223: The Cody \& Waite technique \cite{CodWai80} is presented in
224: Figures~\ref{fig/cody} and
225: \ref{fig/cody2}, where $\RN(a)$ denotes the FPN obtained from
226: rounding $a$ in the round-to-nearest mode. Those are examples when no {\tt fma} is used. The
227: sizes of the rectangles represent the precision (length of the
228: significand) of each FPN and their positions indicate magnitude, except for $z$ and $C_1$
229: whose respective layouts are only for showing lengths of significands. The light gray
230: represents the cancellations: the zero bits due to the fact that
231: $|x-\circ(z \times C_1)| \ll |x|$. The dark grey represents the
232: round-off error: the bits that may be wrong due to previous
233: rounding(s).
234: 
235: 
236: Figure~\ref{fig/cody} presents the ideal behavior.
237: Figure~\ref{fig/cody2} presents the behavior when the significand of
238: $z$ is longer. Then, fewer bits are available to store the significand
239: of $C_1$ in order for $z C_1$ to be  stored exactly. The consequence is a
240: tremendous loss of precision in the final result: as $C_1$ must be stored in
241: fewer bits, the cancellation in the computation of $x -z C_1$ is
242: smaller and the final result may be inaccurate.
243: 
244: 
245: \begin{figure*}[t]
246: \begin{minipage}{0.48\linewidth}
247: \centerline{\scalebox{.7}{\input{\figdir/cody.pstex_t}}}
248: \caption{Reduction technique works for $z$ sufficiently small.}
249: \label{fig/cody}
250: \end{minipage}
251: \hfill
252: \begin{minipage}{0.48\linewidth}
253: \centerline{\scalebox{.7}{\input{\figdir/cody2.pstex_t}}}
254: \caption{Cody \ Waite technique fails as $z$ grows, i.e. $u$ is not
255:   accurate enough.}
256: \label{fig/cody2}
257: \end{minipage}
258: \end{figure*}
259: 
260: 
261: %
262: %
263: 
264: 
265: \begin{figure*}
266: \centerline{\scalebox{.7}{\input{\figdir/enh.pstex_t}}}
267: \caption{Argument reduction with exact cancellation in a fused-multiply-add.}
268: \label{fig/enh}
269: \end{figure*}
270: 
271: 
272: 
273: We want to take advantage of the fused-multiply-add ({\tt fma})
274: instructions.  Some machines have hardware support for it, such as
275: machines with HP/Intel$^\text{\textregistered}$
276: Itanium$^\text{\textregistered}$ Microprocessors~\cite{Mar2K} and IBM
277: PowerPC Microprocessors, and this instruction will also be added to
278: the revision of the IEEE-754 standard.  The current draft can be found
279: at
280: \begin{center}
281:    {\small \url{http://www.validlab.com/754R/}}.
282: \end{center}
283: It is obvious that some bits of $x$ and $z C_1$ will cancel each other
284: as $z$ is computed such that $x \approx z C_1$, but it is not clear
285: how many of them will and under what condition(s).  Consequently if
286: accuracy calls for $x - z C_1$ to be calculated exactly (or to more
287: than $p$ bits in the significand), how do we get these bits
288: efficiently?  This question is especially critical if the working
289: precision is the highest available on the underlying computing
290: platform.
291: 
292: 
293: In this paper, we will devise easily met conditions so that $x - z C_1$
294: can be represented exactly by a FPN, and thus it can be computed by one
295: instruction of the {\tt fma} type without error. This technique is
296: presented in Figure~\ref{fig/enh}. The understanding is the same as in
297: Figures \ref{fig/cody} and \ref{fig/cody2}. The cancellation is
298: greater as $C_1$ can be more precise.  The idea is that  the rounding
299: in $z C_1$ is avoided thanks to the {\tt fma}: $z C_1$, a
300: $2p$-bit FPN, is virtually computed with full precision and then
301: subtracted from $x$. This subtraction is proved to be exact as $x
302: \approx z C_1$. The fact of $x - z C_1$ being a FPN is used by the
303: library of \cite{StoTan99,Mar2K} with no formal justification
304: until \cite{LiBolDau03}.
305: 
306: 
307: The motivations of this work are similar to those presented in
308: \cite{LiBolDau03} and Section~\ref{sec:exact-sub} recalls briefly some
309: useful prior-art from the authors \cite{DauRidThe01,LiBolDau03}.
310: However, the rest of the paper presents entirely new results.  The
311: theorems and their proofs are different from the ones presented in
312: \cite{LiBolDau03}.  The changes are necessary to facilitate
313: verification with an automatic proof checker. Moreover, the results
314: have been improved, and are simpler to grasp and new results have been
315: added thanks to this simplification and to a better understanding of
316: the FPNs relationships due to the formal proof.
317: %
318: 
319: 
320: 
321: In a floating-point pen-and-paper proof, it is difficult to be
322: absolutely sure that no special case is forgotten, no inequality
323: is erroneous, and no implicit hypothesis is assumed, etc. All the
324: proofs presented in this paper are verified using our specification of
325: generic floating point arithmetic \cite{DauRidThe01} and Coq proof
326: assistant \cite{BC04}. This approach has already been proven successful in
327: hardware or software applications
328: \cite{CarMin95,Rus98,Har97a,Har2K}. The drawback is a long and
329: tiresome argumentation versus the proof checker that will ascertain
330: each step of the demonstration. The corresponding scripts of proofs
331: are available online at
332: \begin{center}
333:    {\small \url{http://www.netlib.org/fp/fp2.tgz}}.
334: \end{center}
335: We indicate for each theorem its Coq name.  The developments presented
336: here are located in the {\tt FArgReduct[2,3,4].v} files.
337: 
338: 
339: The rest of this paper is organized as follows.
340: Section~\ref{sec:exact-sub} recalls theorems on the number of
341: cancelled bits of two close FPNs (extensions of Sterbenz's
342: theorem~\cite{Ste74}). In Section~\ref{sec:alg}, we present the Coq
343: verified theorem about the correctness of the algorithm that produces
344: $z$ in (\ref{eq:z-extract}) and that satisfies the conditions of
345: %
346: %
347: %
348: %
349: %
350: %
351: %
352: %
353: the following theorems. The demonstration of the main result, 
354: {\it i.e.} the correctness of the first reduction step, is then 
355: described in Section~\ref{sec:proof}. In Section~\ref{sec:sndstep}, we give new
356: algorithms and results about a very accurate second step for the
357: argument reduction. Section~\ref{sec:concl} concludes the work of this paper.
358: 
359: 
360: \medskip\noindent
361: {\bf Notation.}  Throughout, $\ominus$ denotes the floating point
362: subtraction.  $\{{\cal X}\}_{\fma}$ denotes the result by an
363: instruction of the fused-multiply-add type, i.e., the exact $\pm a \pm
364: b \times c$ after only one rounding. FPNs use $p$ digits, hidden bit (if any) counted, in the
365: significand or otherwise explicitly stated.  We denote $\RN(a)$ the FPN obtained from
366: rounding $a$ in the round-to-nearest mode with $p$ digits and $\RN_m(a)$ if we round to $m$ digits instead of $p$. 
367: We denote by $\ulp(\cdot)$
368: the unit in the last place of a $p$-digit FPN and $\ulp^{\circ
369:   2}(\cdot) = \ulp(\ulp(\cdot))$.  
370: The smallest (subnormal)
371: positive FPN is denoted by $\lambda$.
372: 
373: %
374: %
375: 
376: 
377: 
378: 
379: \section{Exact Subtraction Theorems}\label{sec:exact-sub}
380: \setcounter{equation}{0}
381: 
382: 
383: 
384: These theorems will be used in Section~\ref{sec:proof} to guarantee
385: that there will be enough cancellation in $x-z  C_1$ so that it can be computed exactly
386: by one {\tt fma} type instruction, or equivalently, to assure
387: $x-zC_1$ fits into one FPN.
388: 
389: A well-known property \cite{Ste74,Gol91} of the floating point
390: subtraction is the following.
391: 
392: \medskip
393: 
394: \fbox{
395: \begin{minipage}{\fractheo\textwidth}
396: \begin{theorem}[{{\tt Sterbenz {\it in} Fprop.v}}]\label{thm:exact-sub-classical}
397:   Let $x$ and $y$ be FPNs. If  $y/2 \le x \le 2\ y,$
398:   then $x-y$ is a FPN.
399:   This is valid with any integer radix $\beta \ge 2$ and any
400:   precision $p \ge 2$.
401: 
402: \end{theorem}
403: \end{minipage}
404: }
405: 
406: \medskip
407: 
408: We extend Sterbenz's theorem to fit the use of a fused-multiply-add
409: that may create a higher precision virtual number whose leading digits are canceled to the
410: working precision as explained in Figure~\ref{fig/Sterbenz}: when $x$
411: and $y$ are sufficiently near one another, cancellation makes the result exactly fit a smaller precision.
412: 
413: 
414: \medskip
415: 
416: \fbox{
417: \begin{minipage}{\fractheo\textwidth}
418: \begin{theorem}[{{\tt SterbenzApprox2}}]\label{thm:exact-sub}
419:   Let $x$ and $y$ be $p_1$-digit FPNs. If
420:   $$
421:   \frac{y}{1+\beta^{p_2-p_1}} \le x \le \left(1+\beta^{p_2-p_1}\right)\ y,
422:   $$
423:   then $x-y$ is a $p_2$-digit FPN.
424:   This is valid with any different significand sizes $p_1, p_2 \ge 2$, and
425:   any integer  radix $\beta \ge 2$.
426: \end{theorem}
427: \end{minipage}
428: }
429: 
430: \medskip
431: 
432: The proofs are omitted as they appeared in other publications
433: \cite{LiBolDau03,BolDau04a}. It is worth mentioning that
434: Theorem~\ref{thm:exact-sub} do not require $p_1 \ge p_2$ or $p_2 \ge
435: p_1$.
436: 
437: 
438: \begin{figure}[hb]
439: \centerline{\scalebox{.7}{\input{\figdir/Sterbenz.pstex_t}}}
440: \caption{Extension of Sterbenz's theorem.}
441: \label{fig/Sterbenz}
442: \end{figure}
443: 
444: 
445: 
446: From now on, all FPNs are binary.
447: The underlying machine hardware
448: conforms to IEEE-754 floating point standards~\cite{Ste.87,CodKar.84}. 
449: This implies that rounding does not introduce a rounding error when the
450: exact result is a FPN. Unless explicitly stated, the default rounding
451: mode is {\em round-to-nearest} with ties broken to the even
452: significand.
453: 
454: \section{About the Algorithm for $z$}\label{sec:alg}
455: 
456: The computation of $z$ can be done efficiently as
457: \begin{eqnarray}
458: z & = & \{xR+\sigma\}_{\fma}-\sigma,
459: \end{eqnarray}
460: where $\sigma$ is a pre-chosen constant. The technique is adapted from
461: \cite[Chap.  10]{Mar2K} who used an idea attributed by the author to
462: C.~Roothaan in his work for HP's vector math library for Itanium.
463: The explanation is in Figure~\ref{fig/kzh}: here we choose $\sigma=3
464: \cdot 2^{p-N-2}$ for a $z$ having its last bit at exponent $-N$.
465: 
466: 
467: In realizing (\ref{eq:z-extract}), the wanted results are that $z 2^N$ is
468: an integer, and that $|xR-z| \le 2^{-N-1}$. We may also need that
469: the precision needed for $z$ is smaller or equal to $p - 2$. Here is
470: the theorem, verified by Coq.
471: 
472: \medskip
473: 
474: \fbox{
475: \begin{minipage}{\fractheo\textwidth}
476: \begin{theorem}[{{\tt arg\_reduct\_exists\_k\_zH}}]\label{theo:zh}
477:   Assume
478:   \begin{itemize}
479:   \item $p > 3$,
480:   \item $x$ is a $p$-bit FPN,
481:   \item $R$ is a positive normal $p$-bit FPN,
482:     \medskip
483:   \item $z=\left\{3 \cdot 2^{p-N-2}+ x R \right\}_{\fma} \ominus 3  \cdot 2^{p-N-2}$,
484:     \medskip
485:   \item $|z| \ge 2^{1-N}$,
486:   \item $|x R| \le 2^{p-N-2}-2^{-N}$,
487:   \item $2^{-N}$ is a FPN.
488:   \end{itemize}
489:   Then there exists an integer $\ell$ satisfying $2 \le \ell \le p -
490:   2$ such that
491:   \begin{itemize}
492:     \item $|z 2^N|$ is an $\ell$-bit integer greater than $2^{\ell - 1}$, and
493:     \item $|xR -z| \le 2^{-N-1}$.
494:   \end{itemize}
495: %
496: %
497: %
498: %
499: \end{theorem}
500: \end{minipage}
501: }
502: 
503: \medskip
504: 
505: In short, if $z$ is computed as explained and $x$ is not too big, then
506: $z$ is  a correct answer, meaning it  fulfills all the requirements
507: that will be needed in Theorem~\ref{theo:correct1} in the next section.
508: 
509: \begin{figure}[hb]
510: \centerline{\scalebox{.7}{\input{\figdir/kzh.pstex_t}}}
511: \caption{Algorithm for computing $z$.}
512: \label{fig/kzh}
513: \end{figure}
514: 
515: 
516: 
517: 
518: For Intel's double extended precision, this technique is perfectly
519: adapted for range reduction with argument between $-2^{63}$ and
520: $2^{63}$ when $R$ is in the order of $O(1)$.
521: %
522: This argument range coincides with what is in Intel's manual \cite{Int96b}  for {\tt FSIN},
523: {\tt FCOS}, {\tt FPTAN} and {\tt FSINCOS}. A quick justification is for
524: $C = 2 \pi$ and modest $N$, say $N = 0$ for an example, $|x R | \lesssim 2^{63} / (2\pi)$
525: gives
526: $|x  R | < 2^{62} - 1.$
527: 
528: For the exponential function, any argument larger than $11356$
529: overflows in the double extended and quad precisions, and $\ell \le p - 2$
530: is easily satisfied.
531: 
532: 
533: 
534: %
535: \section{Main Results}\label{sec:proof}
536: \setcounter{equation}{0}
537: 
538: We now present the conditions under which $x - z C_1$ can be
539: represented exactly by a FPN, and thus it can be computed by $\{x - z
540: C_1\}_{\fma}$ without error. As in Section~\ref{sec:intro}, $R\approx
541: 1/C$ and $C_1 \approx C$. We suppose that $C > 0$ and $C \neq 2^j$ for
542: any $j$.
543: 
544: 
545: The idea is the use of a fused-multiply-add that may create a higher
546: precision virtual number that cancels to the working precision.
547: Figure~\ref{fig/fmac} explains the idea: if $z$ is an $\ell$-bit
548: integer and the significand of $C_1$ uses $p - q$ bits, it takes up to
549: $p - q + \ell$ bits to store the significand of $z C_1$.  And as $z
550: C_1$ and $x$ are near enough, the final result fits into $p$ bits. The
551: notation $m_X$ stands for the significand of $X$ and $e_X$ its exponent.
552: 
553: \begin{figure}[h]
554: \centerline{\scalebox{.7}{\input{\figdir/fmac.pstex_t}}}
555: \caption{Fused-multiply-add used to create and cancel a higher
556:   precision virtual number.}
557: \label{fig/fmac}
558: \end{figure}
559: 
560: %
561: 
562: We want to give enough hypotheses on the inputs to guarantee that $x -
563: z C_1$ will be computed without error.
564: 
565: We define the exponent $e_{R}$ of $R$ as the only integer such that
566: $2^{e_{R}} < R < 2^{e_{R} + 1}$.
567: We want to set the $q$ least significant bits of  $C_1$ to zero.  Since
568: $C_1$ should be  as accurate as   possible, we set $C_1  \approx
569: 1/R$ to the nearest FPN with $p -  q$ significant bits. From this, we
570: deduce that $2^{-e_{R}-1} \le C_1 \le 2^{-e_{R}}$ and that the
571: distance between $C_1$ and $1/R$ is less than half an ulp (in $p-q$
572: precision) therefore
573: $$\left| \frac 1{R} - C_1 \right| \le 2^{- e_{R} - 1 - (p - q)}.$$
574: 
575: 
576: We now define $\delta = R C_1-1$, and we deduce a bound on its magnitude from the previous
577: inequalities
578: \begin{eqnarray*}
579: |\delta|  & \le  & 2^{q - p}.
580: \end{eqnarray*}
581: 
582: Let $z$ be as defined by (\ref{eq:z-extract}) with the conditions on
583: $z$ and $s$ given there.  We assume for the moment that $z \ne 0$.
584: Theorem~\ref{thm:exact-sub} can be used if we bound $x/(z C_1)$ and
585: its reciprocal by $1 + 2^{q - \ell}$. We have the following
586: equalities:
587: \begin{eqnarray*}
588: \frac x{z C_1} & = & \frac {x R}{zR C_1} \\
589:                  & = & \frac {z+s}{zR C_1} \\
590:                  & = & \left(1 + \frac sz\right)\frac
591:                  1{1+\delta}.
592: \end{eqnarray*}
593: We recall that $z=k 2^{-N}$ and that $k$ is an integer using $\ell$
594: bits, and we deduce on the other hand
595: $$2^{-N + \ell - 1} \le |z| < 2^{-N + \ell}$$
596: to bound
597: \begin{eqnarray*}
598: \left| \frac sz \right| \le 2^{- \ell}.
599: \end{eqnarray*}
600: 
601: 
602: Rewriting the condition of Theorem~\ref{thm:exact-sub} and taking
603: advantage of preceding results, we arrive at the point
604: to prove both
605: \begin{eqnarray}
606: \label{eq:AB1}\frac{1 + 2^{-\ell}}{1 +|\delta|} & \le & 1 + 2^{q - \ell}, \\[6pt]
607: \label{eq:AB2}\frac{1 + |\delta|}{1 - 2^{-\ell}} & \le & 1 + 2^{q - \ell}.
608: \end{eqnarray}
609: 
610: Conditions (\ref{eq:AB1}) and (\ref{eq:AB2}) are checked using functional
611: analysis on polynomials and homographic functions for any permitted
612: value of $A = 2^{-\ell}$.  Since $z$ is both a machine number and a
613: non-zero $\ell$-bit FPN ($1 \le \ell \le p$). From
614: Section~\ref{sec:alg}, the algorithm used to produce $z$ implies $\ell
615: \le p - 2$.  We will use a more generic condition:
616: $$2^{1-p} \le A = 2^{-\ell} \le \frac 12.$$ We will now explain what
617: are the successive requirements to guarantee that both (\ref{eq:AB1})
618: and (\ref{eq:AB2}) are fulfilled.
619: 
620: \paragraph{Condition (\ref{eq:AB1})}
621: 
622: We want to guarantee that
623: $\displaystyle \frac{1 + 2^{-\ell}}{1 + 2^{q - \ell}} \le 1 + |\delta|$.
624: The homographic function
625: $$\frac{1 + 2^{-\ell}}{1 + 2^{q - \ell}} = \frac{1 + A}{1 + A 2^q}$$%
626: we want to bound is maximized at $A = 2^{1-p}$ and it is sufficient to
627: check if $(1 + 2^{1-p})/(1 + 2^{1-p} 2^q) \le 1 + |\delta|.$ We use the
628: bound on $|\delta|$ and we introduce $B = 2^q$. We have left to prove
629: that
630: $$(1 + 2^{1-p})/(1 + 2^{1-p} B) \le 1 - B 2^{-p}.$$%
631: This is equivalent to check if the second order polynomial
632: $2^{1-p} B^2 - B +2 \le 0.$
633: The inequality is satisfied for $B$ between the two roots
634: $\displaystyle 2^{p-2} \left(1 \pm \sqrt{1- 2^{4 - p}} \right)$.
635: Thus it is sufficient  to have $B \ge 4$ for all precisions.
636: 
637: 
638: \paragraph{Condition (\ref{eq:AB2})}
639: 
640: We want to guarantee that $1 + |\delta|  \le  (1 + 2^{q - \ell}) (1 -
641: 2^{-\ell})$. We introduce $A$ and $B$ as before, so we have left to prove
642: $$1 + |\delta|  \le  (1 + AB) (1 - A).$$
643: 
644: 
645: We assume   that  $B \ge  4$ from the preceding paragraph. The polynomial
646: $$(1 + AB)(1 - A) = (1 + 2^{q - \ell})(1 - 2^{-\ell})$$ is minimized
647: at $A = 2^{1-p}$ and it is sufficient to check if $(1 + |\delta|) \le
648: (1 + 2^{1-p} B)(1 - 2^{1-p}).$ From the bound on $|\delta|$, we now
649: have to check if
650: $$(1 + B 2^{-p}) \le (1 + 2^{1-p} B)(1 - 2^{1-p})$$
651: which is true for any precision.
652: 
653: This proof is rather long and complex. We therefore verified it in
654: Coq to be sure there is no mistake. It also gives us more precise
655: and sharp hypothesis than if we would do that only by pen-and-paper. All
656: hypotheses have to be clearly written so that the proof can be
657: checked. There is no easy way to say ``we assume there is no
658: Underflow'' or ``that the precision is big enough''. This leads to
659: long theorems (at least longer than what we are used to), but precise
660: and correct ones:
661: 
662: \medskip
663: 
664: \fbox{
665: \begin{minipage}{\fractheo\textwidth}
666: \begin{theorem}[{{\tt Fmac\_arg\_reduct\_correct1}}]\label{theo:correct1}
667:   Assume
668:   \begin{itemize}
669:   \item $p > 3$,
670:   \item $x$ is a $p$-bit FPN,
671:   \item $R$ is a positive normal $p$-bit FPN,
672:   \item $2 \le q < p-1$,
673: %
674: %
675:   \item $C_1$ is the $(p-q)$-bit FPN obtained by rounding $1/R$ to
676:     $p-q$ bits using round-to-nearest mode,
677:   \item $C_1$ is not exactly a power of 2,
678:   \item $C_1 \ge 2^{p-q+\max(1,N-1)} \lambda$,
679:   \item $2 \le \ell \le p-1$,
680:   \item $|z 2^N|$ is an $\ell$-bit integer greater than $2^{\ell - 1}$,
681:   \item $|xR -z| \le 2^{-N-1}$,
682:   \item $q \le \ell$.
683:   \end{itemize}
684:   Then $x - z C_1$ is a $p$-bit FPN.
685: \end{theorem}
686: \end{minipage}
687: }
688: 
689: \medskip
690: 
691: 
692: In short, if $C_1$ is rounded to the nearest from $1/R$ with $p-q$
693: bits and $q \ge 2$ and $z$ is not too small, then the {\tt fma} does not
694: make any round-off error.
695: 
696: %
697: %
698: %
699: %
700: %
701: %
702: 
703: 
704: %
705: %
706: 
707: Automatic proof checking also prompted us that the exact behavior may be
708: difficult to obtain for $z=2^{-N}$ and $x$ close to $2^{- N - 1} R$.
709: This case was excluded in Theorem~\ref{theo:correct1} under the
710: hypothesis that $2 \le \ell$, but it will be included in the next theorem
711: which  focuses on $q = 2$ as this situation leads to $C_1$ as close as
712: possible from $C$ and thus has more practical value.
713: For completeness and theoretical interest a theorem similar to Theorem~\ref{theo:correct1} but valid for all $2 \le 2 \le p - 1$ is presented in the appendix.
714: 
715: Assume $q=2$ in the rest of this section. When
716: $z=2^{-N}$, then $x \le 2 C_1 \times 2^{-N}$ as $xR$ is
717: approximated by $z=2^{-N}$.  We can also deduce that
718: $$
719: \frac{C_1 \times  2^{-N}}{1+2^{2-p}}   \le x.
720: $$
721: When $C_1 \times 2^{-N}/2 \le x$, Sterbenz's theorem
722: (Theorem~\ref{thm:exact-sub-classical}) can be applied and $x-C_1
723: \times 2^{-N}$ is representable.  If not, then
724: $$
725: \frac{C_1 \times 2^{-N}}{1+2^{2-p}} \le x < \frac{C_1 \times 2^{-N}}{2}.
726: $$
727: %
728: %
729: %
730: %
731: %
732: %
733: Since $C_1$ is a $(p-2)$-bit FPN and not exactly a power of 2 as a $p$-bit FPN, then $C_1$ is at least
734: 4 ulps away from a power of 2. This is because as a $p$-bit FPN,
735: $C_1$ is worth $2^e\times 1.{\tt bb\cdots b}00$, where at least one of the ${\tt b}$'s must be 1; therefore the $C_1$
736: that comes closest to a power of $2$ is either  $2^e\times 1.0\cdots 0{\tt 1}00$ or
737: $2^e\times 1.{\tt 11\cdots 1}00$. Both are 4 ulps away from a power of 2.
738: %
739: %
740: %
741: %
742: %
743: %
744: %
745: %
746: %
747: %
748: %
749: %
750: %
751: %
752: %
753: %
754: This distance and the preceding inequality are enough to guarantee
755: that the exponent of $x$ is the exponent of $C_1$ minus $N + 1$.
756: After a few computations, we finish with $x-C_1 \times 2^{-N}$ being a
757: FPN, regardless of $x$.
758: 
759: 
760: A few peculiar cases have been omitted in the sketch of this proof.
761: Automatic proof checking allows us to trustfully guarantee that these
762: cases have been all checked in our publicly available proof scripts.
763: The only surprising condition is presented in this section.  The
764: other cases are easily generalized from Theorems~\ref{theo:zh} and
765: \ref{theo:correct1}.  So just by wrapping these two results together,
766: we can state the following theorem in its full length, verified with
767: Coq.
768: 
769: \medskip
770: 
771: \fbox{
772: \begin{minipage}{\fractheo\textwidth}
773: \begin{theorem}[{{\tt Fmac\_arg\_reduct\_correct3}}]\label{theo:red3}
774:   Assume
775:   \begin{itemize}
776:   \item $p > 3$,
777:   \item $x$ is a $p$-bit FPN,
778:   \item $R$ is a positive normal $p$-bit FPN,
779:   \item $C_1$ is the $(p-2)$-bit FPN obtained by rounding $1/R$ to $p-2$
780:     bits using round-to-nearest mode,
781:   \item $C_1$ is not exactly a power of 2,
782:   \item $C_1 \ge 2^{p+\max(-1,N)} \lambda$,
783:     \medskip
784:   \item $z=\left\{3 \cdot 2^{p-N-2}+ x R \right\}_{\fma} \ominus 3  \cdot 2^{p-N-2}$,
785:     \medskip
786:   \item $|x R| \le 2^{p-N-2}-2^{-N}$,
787:   \item $2^{-N}$ is a FPN.
788:   \end{itemize}
789:   Then $x - z C_1$ is a $p$-bit FPN.
790: \end{theorem}
791: \end{minipage}
792: }
793: 
794: \medskip
795: 
796: In short, if  $C_1$ is rounded  to  the nearest from  $1/R$ with $p-2$
797: bits and $z$  is computed as usual, then the  {\tt fma}  does not make  any
798: round-off error. In  Tables~\ref{table:pi2} and
799: \ref{table:ln2} we present constants $R$ and $C_1$ for $\pi$ and $\ln(2)$.
800: These constants are for the exponential and the fast reduction phase of the trigonometric functions \cite{Kah83,Ng92,Mar2K,Mul06}.
801: 
802: %
803: %
804: %
805: %
806: %
807: 
808: The hypotheses may seem numerous and restrictive but they are not. As
809: $R$ and $C_1$ are pre-computed, the corresponding requirements can be
810: checked beforehand. Moreover, those requirements are weak: for example
811: with $0 <= N <= 10$ in double precision, we need $C_1 \ge 2^{-1011} \approx 4.5
812: 10^{-305}$. There is no known elementary function for which $C_1$
813: ever comes near a power of 2.
814: %
815: %
816: %
817: The only nontrivial requirement left is the bound on $|xR|$.
818: 
819: \begin{table*}
820:   \caption{Example of value for $R = \RN(1/C)$, $C_1$ rounded to $p -
821:     2$ bits, $C_2$ obtained from Algorithm~\ref{algo:C2}, and $C_3$,  for $C = \pi$,
822:     \newline
823:     easily leading to $C = 2\pi$ or $C = \pi/2$}
824: \label{table:pi2}
825: $$
826: \begin{array}{l c c c c}
827: \text{Precision}  & \text{Single}
828:                   & \text{Double}
829:                   & \text{Double extended}
830:                   & \text{Quad} \\[3pt]
831: R                 & {\scriptstyle 10680707 \cdot 2^{-25}}
832:                   & {\scriptstyle 5734161139222659 \cdot 2^{-54}}
833:                   & {\scriptstyle 11743562013128004906 \cdot 2^{-65}}
834:                   & {\scriptstyle 6611037688290699343682997282138730 \cdot 2^{-114}} \\[3pt]
835: C_1               & {\scriptstyle 13176796 \cdot 2^{-22}}
836:                   & {\scriptstyle 7074237752028440 \cdot 2^{-51}}
837:                   & {\scriptstyle 14488038916154245684 \cdot 2^{-62}}
838:                   & {\scriptstyle 8156040833015188200833743081374136 \cdot 2^{-111}} \\[3pt]
839: C_2               & {\scriptstyle -11464520 \cdot 2^{-45}}
840:                   & {\scriptstyle 4967757600021504 \cdot 2^{-105}}
841:                   & {\scriptstyle 14179128828124470480 \cdot 2^{-126}}
842:                   & {\scriptstyle 9351661544631751449372323967920768 \cdot 2^{-226}} \\[3pt]
843: C_3               & {\scriptstyle -15186280 \cdot 2^{-67}}
844:                   & {\scriptstyle 7744522442262976 \cdot 2^{-155}}
845:                   & {\scriptstyle 10700877088903390780 \cdot 2^{-189}}
846:                   & {\scriptstyle -9186378203702558149401308890796140 \cdot 2^{-334}}
847: \end{array}
848: $$
849: \end{table*}
850: 
851: 
852: \begin{table*}
853: \caption{Example of value for $R = \RN(1/C)$, $C_1$ rounded to $p -
854:     2$ bits, $C_2$ obtained from Algorithm~\ref{algo:C2}, and $C_3$, for $C = \ln(2)$}
855: \label{table:ln2}
856: $$
857: \begin{array}{l c c c c}
858: \text{Precision}  & \text{Single}
859:                   & \text{Double}
860:                   & \text{Double extended}
861:                   & \text{Quad} \\[3pt]
862: R                 & {\scriptstyle 12102203 \cdot 2^{-23}}
863:                   & {\scriptstyle 6497320848556798 \cdot 2^{-52}}
864:                   & {\scriptstyle 13306513097844322492 \cdot 2^{-63}}
865:                   & {\scriptstyle 7490900928631539394323262730195514 \cdot 2^{-112}} \\[3pt]
866: C_1               & {\scriptstyle 11629080 \cdot 2^{-24}}
867:                   & {\scriptstyle 6243314768165360 \cdot 2^{-53}}
868:                   & {\scriptstyle 12786308645202655660 \cdot 2^{-64}}
869:                   & {\scriptstyle 7198051856247353947080814903691240 \cdot 2^{-113}} \\[3pt]
870: C_2               & {\scriptstyle -8577792 \cdot 2^{-52}}
871:                   & {\scriptstyle -7125764960002032 \cdot 2^{-106}}
872:                   & {\scriptstyle -15596301547560248640 \cdot 2^{-130}}
873:                   & {\scriptstyle -5381235925004637553074520129202340 \cdot 2^{-224}} \\[3pt]
874: C_3               & {\scriptstyle -8803384 \cdot 2^{-72}}
875:                   & {\scriptstyle -7338834209110452 \cdot 2^{-161}}
876:                   & {\scriptstyle -13766585803531045332 \cdot 2^{-192}}
877:                   & {\scriptstyle -9437982846677142208552339635087788 \cdot 2^{-338}}
878: \end{array}
879: $$
880: \end{table*}
881: 
882: 
883: %
884: \section{Getting More Accurate Reduced Arguments}\label{sec:sndstep}
885: As we pointed out in the introduction in Section~\ref{sec:intro},
886: sometimes the reduced argument $u=x-z C_1$ is not accurate enough due to the limited precision
887: in $C_1$ as an approximation to $C$. When this happen another FPN $C_2$ containing the lower bits of the constant $C$ has to be made available and the new reduced argument is now $x-z C_1-z C_2$.
888: Assume that the conditions of 
889: Theorem~\ref{theo:red3} hold. In particular $C_1$ has $p-2$ bits in its significand.
890: 
891: %
892: %
893: %
894: %
895: %
896: %
897: 
898: The number $x-zC_1 -z C_2$ can be computed exactly \cite{BolMul05} as the sum of two
899: floats. But here because we know certain conditions on $z$, $C_1$, and $C_2$ as FPNs,
900: we can do it faster. Inspired by \cite{BolMul05}, we propose the following
901: Algorithm~\ref{algo:sndstep} to accomplish the task.
902: %
903: %
904: %
905: %
906: %
907: %
908: %
909: %
910: %
911: %
912: %
913: %
914: %
915: %
916: %
917: %
918: It is built upon two known algorithms:
919: \begin{itemize}
920: \item Fast2Mult($x$,$y$) that computes the rounded product of $x$ and
921: $y$ and its error (2 flops) \cite{KarpMarkstein1997}.
922: \item Fast2Sum($x$,$y$) that computes the rounded sum of $x$ and $y$
923: and its error (3 flops), under the assumption that  either $x=0$, or
924: $y=0$, or $|x| \ge |y|$, or  there exist integers $n_x, e_x,
925: n_y, e_y$ such that $x=n_x 2^{e_x}$ and $y=n_y 2^{e_y}$ and $e_x \ge
926: e_y$ \cite{DauRidThe01}.
927: \end{itemize}
928: 
929: 
930: %
931: %
932: %
933: 
934: \medskip
935: 
936: %
937: \begin{algorithm}[Super accurate argument reduction]\label{algo:sndstep}
938:  \begin{tabular}{ll}
939:  \dbend & The correctness of this algorithm is only guaranteed \\
940:  & under the conditions of Theorem~\ref{theo:sndstep}. It does not \\
941:  & work with any $C_1, C_2$!
942:  \end{tabular}
943: 
944: 
945: 
946: \medskip
947: 
948: $$
949: \begin{array}{rcl}
950: u   & = & \circ(x - z C_1),\\
951: v_1 & = & \circ(u - z C_2),\\
952: (p_1,p_2) & = & \mbox{Fast2Mult}(z, C_2),\\
953: (t_1,t_2) & = & \mbox{Fast2Sum}(u, - p_1),\\
954: v_2 & = & \circ(\circ(\circ(t_1-v_1)+t_2)-p_2).
955: \end{array}
956: $$
957: \end{algorithm}
958: 
959: \medskip
960: 
961: \fbox{
962: \begin{minipage}{\fractheo\textwidth}
963: \begin{theorem}[{{\tt FArgReduct4.v file}}]\label{theo:sndstep}
964:   Assume
965:   \begin{itemize}
966:   \item $p > 4$,
967:   \item $x$ is a $p$-bit FPN,
968:   \item $R$ is a positive normal $p$-bit FPN,
969:   \item $C_1$ is the $(p-2)$-bit FPN obtained by rounding $1/R$ to $p-2$
970:     bits using round-to-nearest mode,
971:   \item $C_1$ it is not exactly a power of 2,
972: %
973: %
974:     \medskip
975:   \item $z=\left\{3 \cdot 2^{p-N-2}+ x R \right\}_{\fma} \ominus 3  \cdot 2^{p-N-2}$,
976:     \medskip
977:   \item $|x R| \le 2^{p-N-2}-2^{-N}$,
978:   \item $2^{-N}$ is a normal $p$-bit FPN,
979:   \item $C_1 \ge 2^{p+\max(-1,p+N-2)} \lambda$,
980:   \item $C_2$ is a FPN and an integer multiple of $8\ulp^{\circ 2}
981:     (C_1)$,
982:   \item $|C_2| \le 4 \ulp (C_1)$,
983:   \item $v_1$ and $v_2$ are computed using Algorithm~\ref{algo:sndstep}.
984:   \end{itemize}
985: Then Fast2Sum works correctly and we have the mathematical
986:   equality $v_1+v_2=x - z C_1- z C_2$ (all the computations of the
987:   last line indeed commit no rounding errors).
988: \end{theorem}
989: \end{minipage}
990: }
991: 
992: \medskip
993: 
994: The first requirements are very similar to the previous ones. The ``no
995: underflow'' bound on $C_1$ has been raised, but is still easily
996: achieved by real constants. For a typical $N$ between 0 and 10 used by the existing elementary
997: math libraries in IEEE double precision, it suffices that $C \ge 10^{-288}$.
998: %
999: %
1000: %
1001: %
1002: 
1003: The most important add-ons are the requirements on $C_2$: it must be
1004: much smaller than $C_1$ (it is near the difference between the
1005: constant $C$ and $C_1$). And $C_2$ must not be ``too precise''.
1006: %
1007: In fact, $C_1+C_2$
1008: will have $2p-4$ bits as shown in Figure~\ref{fig/C1C2}. If by
1009: chance, there are a lot of zeroes just after $C_1$, we cannot take
1010: advantage of that to get a more precise $C_2$. This is a real
1011: drawback, but it does not happen very often that many zeroes are just
1012: at the inconvenient place.
1013: 
1014: 
1015: \begin{figure}[h]
1016: \centerline{\scalebox{.7}{\input{\figdir/C1C2.pstex_t}}}
1017: \caption{Respective layouts of our $C_1$ and $C_2$ compared to optimal
1018: values.}
1019: \label{fig/C1C2}
1020: \end{figure}
1021: 
1022: 
1023: This algorithm may seem simple but it is a very powerful tool. It is
1024: {\it exact\/} and it
1025: %
1026: %
1027: %
1028: %
1029: %
1030: %
1031: %
1032: %
1033: is very {\it fast}: the generic algorithm \cite{BolMul05} costs 20 flops while this one costs only
1034: 9 flops! 
1035: More, the result is more usable than expected as it fits in only one
1036: float instead of two in the general case.
1037: 
1038: 
1039: 
1040: %
1041: %
1042: %
1043: 
1044: 
1045: As for the computation of $C_2$, the requirements are rather low:
1046: there are several $C_2$ fulfilling them. It may be useful to choose
1047: one of them in order to have the bigger or the smaller $C_2$
1048: possible. Algorithm~\ref{algo:C2} gives one way to compute a
1049: convenient $C_2$.
1050: 
1051: 
1052: The idea of the proof for Theorem~\ref{theo:sndstep} is a careful
1053: study of the possible exponents for the involved FPNs. We first prove that
1054: $x$ is an integer multiple of  $2^{-N} \ulp(C_1)$. 
1055: %
1056: %
1057: This is done for
1058: whether $z$ is $2^{-N}$ or not to guarantee the correctness of
1059: Fast2Sum.
1060: 
1061: We then prove that $t_1-v_1$ fits in a FPN. This proof is obtained by
1062: noticing that $t_1$ and $v_1$ are integer multiples of $2^{-N-1}
1063: \ulp^{\circ 2} (C_1)$ and that that $|t_1-v_1| < 2^{p-N-1} \ulp^{\circ
1064: 2} (C_1)$.
1065: 
1066: 
1067: The next step is about $t_1-v_1+t_2=u-p_1-v_1$ being a FPN. We do it
1068: similarly as all these quantities are also integer multiples of
1069: $2^{-N-1} \ulp^{\circ 2} (C_1)$ and as we easily have that
1070: $|t_1-v_1+t_2| < 2^{p-N-1} \ulp^{\circ 2} (C_1)$.
1071: 
1072: We finally prove that $t_1-v_1+t_2-p_2=u-z C_2 - v_1$ fits in a FPN.
1073: Its least significant non-zero bit is at most shifted $N$ times down
1074: compared to the least significant non-zero bit of $C_2$. For this
1075: reason, we require that $C_2$ is an integer multiple of $8 \ulp^{\circ
1076:   2} (C_1)$.
1077: 
1078: This proof needs a careful study of the relationships between the
1079: various floats and their exponent values. The formal proof and its
1080: genericity allowed us a better understanding of the respective layouts
1081: of the FPNs, that is the key of the correctness of
1082: Algorithm~\ref{algo:sndstep}.
1083: 
1084: \medskip
1085: 
1086: \begin{algorithm}[Computation of $C_2$]\label{algo:C2}
1087: Let $C$ be the exact constant (for example, $\pi$ or $\ln 2$).
1088: $$
1089: \begin{array}{rcl}
1090: R   & = & \circ_p(1/C),\\
1091: C_1 & = & \circ_{p-2}(1/R), %
1092: %
1093: %
1094: \end{array}
1095: $$
1096: and take $C_2$ to be the first many significand bits of $C-C_1$ so that
1097: its least non-zero bit must be greater than or equal to 
1098: $\log_2(\ulp(C_1))-p+4=\log_2\left(8\ulp^{\circ 2} (C_1)\right)$, e.g.,
1099: $$
1100: C_2 =\displaystyle \left\lceil \frac{(C - C_1)}{8\ulp^{\circ 2} (C_1)} \right\rfloor 8\ulp^{\circ 2} (C_1),
1101: $$
1102: where $\lceil\cdot\rfloor$ is one of the round-to-integer operations.
1103: %
1104: %
1105: %
1106: \end{algorithm}
1107: 
1108: \medskip
1109: 
1110: This $C_2$ has all the expected properties except that we do not know
1111: for sure if $|C_2| \le 4 \ulp (C_1)$. Note that $C_1$ is not gotten
1112: by directly rounding $C$ but rather
1113: $C_1=\circ\left(1/\circ\left(\frac{1}{C}\right)\right)$.
1114: 
1115: \medskip
1116: 
1117: \fbox{
1118: \begin{minipage}{\fractheo\textwidth}
1119: \begin{theorem}[{{\tt gamma2\_le}}]\label{theo:C2}
1120:   Assume
1121:   \begin{itemize}
1122:   \item $p > 3$,
1123:   \item $C$ is a real positive constant,
1124:   \item $R$ is the $p$-bit FPN obtained by rounding $1/C$ to $p$
1125:     bits using round-to-nearest mode,
1126:   \item $R$ is a positive normal $p$-bit FPN,
1127:   \item $C_1$ is the $(p-2)$-bit FPN obtained by rounding $1/R$ to $p-2$
1128:     bits using round-to-nearest mode,
1129: %
1130:   \item $C_1$ is not exactly a power of 2,
1131:   \item $C_1 \ge 2^{p-1} \lambda$.
1132:   \end{itemize}
1133: Then $|C - C_1| \le 4 \ulp (C_1)$.
1134: \end{theorem}
1135: \end{minipage}
1136: }
1137: 
1138: \medskip
1139: 
1140: As $C$ is not too far from $C_1$, we have that $C \le
1141:  2^{p+1} \ulp(C_1) $. We now bound $C-C_1$: \\
1142: $|C - C_1| \le |C- 1/R| + |1/R - C_1| \le
1143: \frac{C}{R} |R-1/C| +|1/R - C_1|$, so 
1144: $|C - C_1| \le
1145: \frac{C}{R} \ulp(R) / 2 + 4 \ulp(C_1) / 2 \le
1146: C 2^{-p-1} + 2 \ulp (C_1)$, hence the result.
1147: 
1148: \medskip
1149: 
1150: This means that the formula for $C_2$ given above yields a FPN
1151: fulfilling the requirements of Theorem~\ref{theo:sndstep}.
1152: 
1153: %
1154: %
1155: %
1156: %
1157: 
1158: 
1159: 
1160: %
1161: %
1162: %
1163: %
1164: %
1165: %
1166: %
1167: %
1168: 
1169: 
1170: \section{Conclusions}\label{sec:concl}
1171: 
1172: We have presented Coq verified theorems that prove the correctness
1173: and effectiveness of a much faster technique based on the commonly
1174: used argument reduction in elementary function computations, on
1175: machines that have hardware support for fused-multiply-add
1176: instructions. The conditions of these theorems are easily met as our
1177: analysis indicates. While we have showed it is not always possible to use
1178: the most accurate parameters under all circumstances, an almost best
1179: possible selection can be used at all times: to zero out the last 2 bits.
1180: 
1181: %
1182: 
1183: %
1184: %
1185: %
1186: %
1187: %
1188: %
1189: %
1190: %
1191: %
1192: %
1193: %
1194: %
1195: %
1196: %
1197: %
1198: 
1199: %
1200: 
1201: %
1202: %
1203: %
1204: %
1205: %
1206: %
1207: %
1208: %
1209: %
1210: %
1211: %
1212: %
1213: %
1214: %
1215: %
1216: %
1217: %
1218: %
1219: %
1220: %
1221: %
1222: %
1223: %
1224: %
1225: %
1226: %
1227: %
1228: %
1229: %
1230: %
1231: %
1232: %
1233: %
1234: %
1235: %
1236: %
1237: %
1238: %
1239: %
1240: %
1241: %
1242: %
1243: 
1244: We have presented also a very accurate second step argument reduction. We
1245: provide a way to compute $C_2$ which is not the most precise possible,
1246: but is usually 2 bits away from it (and can be rounded as needed by the
1247: programmer). The most interesting part is the possibility to compute
1248: with FPNs the exact error of the second step of the argument reduction
1249: and the fact that this error is exactly representable by only one FPN. It makes the third
1250: step unexpectedly easy as we have a mathematical equality between the
1251: computed FPNs and a very good approximation of $x-z C$ (with a known
1252: error).
1253: 
1254: Except for the computation of $C_2$, all the rounding used should be
1255: rounding to nearest, ties to even. But our proofs are generic enough
1256: to show that our results still hold when using rounding to nearest,
1257: where cases of ties can be decided in any coherent way
1258: \cite{ReiKnu75}. This includes rounding to nearest, ties away from
1259: zero that is found in the revision of the IEEE-754 standard.
1260: 
1261: %
1262: %
1263: %
1264: %
1265: %
1266: 
1267: %
1268: %
1269: %
1270: %
1271: %
1272: %
1273: %
1274: %
1275: 
1276: The formal verification forces us to provide many tedious details in
1277: the proofs but gives us a guarantee on our results.  The
1278: proposed theorems are sufficient in the sense that effective
1279: parameters for efficient argument reductions can be obtained without
1280: any difficulty.
1281: 
1282: Our theorems provides us with sufficient conditions for $x-z C_1$ to be a FPN. This means that $x-z C_1$ could be a FPN  even when one or more of the
1283: conditions fails for some specific values of $C$, $C_1$ and $R$ as
1284: published in the past \cite{StoTan99,Mar2K}.  We may work on
1285: this in the future even though there is only a limited space for
1286: improvement as only the last two bits of $C_1$ can be changed to make
1287: the constant more accurate.
1288: 
1289: The algorithms proved can be applied to any
1290: floating-point format (IEEE single, double or extended for
1291: example). Intuitively, the correctness of these algorithms should come
1292: as natural. Nevertheless, rigorous proofs are not trivial due to a few special
1293: cases that could  have been easily dismissed by hand-waving proofs.
1294: 
1295: 
1296: 
1297: 
1298: 
1299: \bibliographystyle{IEEEtran}
1300: \bibliography{daumas_ref,toto}
1301: 
1302: \appendix
1303: 
1304: %
1305: 
1306: Theorem~\ref{theo:correct1} can be used for any value of $2 \le q \le
1307: p-1$. In most case, users are interested for the smallest possible
1308: value of $q$ because that will give a more accurate $C_1$ and consequently a more accurate reduced argument. For this reason, we proved Theorem~\ref{theo:red3} for $q=2$.
1309: The following theorem is under the hypothesis that
1310: $$
1311: R C_1 \le 1,
1312: $$
1313: while $2 \le q \le p-1$ still. This add-on is enough to guarantee cases that are left over by
1314: Theorem~\ref{theo:correct1}.
1315: 
1316: 
1317: \medskip
1318: 
1319: \fbox{
1320: \begin{minipage}{\fractheo\textwidth}
1321: \begin{theorem}[{{\tt Fmac\_arg\_reduct\_correct2}}]
1322:   Assume
1323:   \begin{itemize}
1324:   \item $p > 3$,
1325:   \item $2 \le q < p-1$,
1326:   \item $x$ is $p$-bit FPN,
1327:   \item $R$ is a positive normal $p$-bit FPN,
1328:   \item $C_1$ is  the $(p-q)$-bit  FPN obtained by  rounding
1329:     $1/R$ to $p-q$ bits using round-to-nearest mode,
1330:   \item $C_1$ is not exactly a power of 2,
1331:   \item $C_1 \ge 2^{p-q+\max(1,N-1)} \lambda$,
1332:     \medskip
1333:   \item $z=\left\{3 \cdot 2^{p-N-2}+ x R \right\}_{\fma} \ominus 3  \cdot 2^{p-N-2}$,
1334:     \medskip
1335:   \item $2^{-N}$ is a FPN,
1336:   \item $|x R| \le 2^{p-N-2}-2^{-N}$,
1337:   \item $R C_1 \le 1$.
1338:   \end{itemize}
1339:   Then $x - z C_1$ is a $p$-bit FPN.
1340: \end{theorem}
1341: \end{minipage}
1342: }
1343: 
1344: \medskip
1345: 
1346: We essentially need to consider how to make $R$ and $C_1$ satisfy
1347: this new constraint.  Since there is no strict connection between
1348: $R$, $C_1$ on one hand and $C$ on the other hand, we can either use
1349: $R$ to be the correctly rounded FPN nearest $1/C$ or we may
1350: alternatively add or subtract one or a few ulps so that
1351: the additional inequality is met.
1352: 
1353: %
1354: %
1355: %
1356: %
1357: %
1358: %
1359: %
1360: %
1361: %
1362: %
1363: 
1364: %
1365: %
1366: %
1367: %
1368: %
1369: %
1370: %
1371: %
1372: %
1373: %
1374: %
1375: %
1376: %
1377: %
1378: %
1379: %
1380: %
1381: %
1382: %
1383: %
1384: %
1385: %
1386: %
1387: %
1388: %
1389: %
1390: %
1391: %
1392: 
1393: %
1394: %
1395: %
1396: %
1397: %
1398: %
1399: %
1400: %
1401: %
1402: %
1403: %
1404: %
1405: %
1406: %
1407: %
1408: %
1409: %
1410: %
1411: %
1412: %
1413: %
1414: %
1415: %
1416: %
1417: %
1418: %
1419: %
1420: %
1421: 
1422: \end{document}
1423: