1: \documentclass{rnc6}
2: \usepackage[latin1]{inputenc}
3: \usepackage{aeguill}
4: \usepackage{comment}
5: \usepackage{xspace}
6:
7: \makeatletter
8:
9: %\iftrue % Use RCSID with temporary versions.
10: \iffalse % For the final version.
11:
12: % RCSID
13: \def\RCSID{%
14: \begingroup
15: \catcode`\_=\active
16: \@rcsid
17: }
18: \def\@rcsid $#1${%
19: \let\protect\noexpand
20: \xdef\@oddfoot{%
21: \reset@font
22: \protect\framebox[\textwidth]{#1-- page \protect\thepage}%
23: }%
24: \endgroup
25: \global\let\@evenfoot\@oddfoot
26: }
27: \RCSID $Id: rnc6.tex 4422 2004-08-30 12:13:17Z lefevre $
28:
29: \def\ps@copyright{}
30:
31: \fi
32:
33: \def\url{%
34: \begingroup
35: \catcode`\~=\active
36: \catcode`:=12
37: \def~{\raisebox{-.8ex}{\textasciitilde}}%
38: \@url}
39: \def\@url#1{\leavevmode
40: \hskip 0pt plus 50pt \penalty 1000 \hskip 0pt plus -50pt
41: \mbox#1\endgroup}
42: % Thanks to Donald Arseneau for the linebreak trick.
43: % See thread <20010409124436$5a00@vinc17.org>.
44:
45: \def\footurl{%
46: \begingroup
47: \catcode`\~=\active
48: \catcode`:=12
49: \def~{\raisebox{-.8ex}{\textasciitilde}}%
50: \@footurl}
51: \def\@footurl#1{\footnote{#1}\endgroup}
52:
53: \makeatother
54:
55: \newcommand{\rnd}{\ensuremath{\diamond}\xspace}
56: \newcommand{\err}{\ensuremath{\varepsilon}\xspace}
57:
58: \begin{document}
59: \sloppy
60: \binoppenalty=10000
61: \relpenalty=10000
62: \doublehyphendemerits 40000
63:
64: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
65:
66: \begin{frontmatter}
67:
68: \title{The Generic Multiple-Precision Floating-Point Addition
69: With Exact Rounding (as in the MPFR Library)}
70:
71: \author{Vincent Lefèvre}
72:
73: \ead{Vincent.Lefevre@loria.fr}
74: \ead[url]{http://www.vinc17.org/research/}
75:
76: \address{INRIA Lorraine, 615 rue du Jardin Botanique,
77: 54602 Villers-lès-Nancy Cedex, France}
78:
79: \begin{abstract}
80: We study the multiple-precision addition of two positive floating-point
81: numbers in base $2$, with exact rounding, as specified in the MPFR
82: library, i.e. where each number has its own precision. We show how
83: the best possible complexity (up to a constant factor that depends
84: on the implementation) can be obtain.
85: \end{abstract}
86:
87: \begin{keyword}
88: multiple precision \sep floating point \sep addition \sep exact rounding
89: \end{keyword}
90:
91: \end{frontmatter}
92:
93: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
94:
95: \section{Introduction}
96: \label{intro}
97:
98: In this paper, we consider the multiple-precision floating-point
99: addition with exact rounding, as specified in the MPFR
100: library\footurl{http://www.mpfr.org/}: the inputs are two (binary)
101: floating-point numbers $x$ and $y$ of precision $m \geq 2$ and
102: $n \geq 2$, a target precision $p \geq 2$ and a rounding mode \rnd,
103: and the output is $\rnd(x + y)$, i.e. the exact value $x + y$ rounded
104: to the target precision in the given rounding mode, and a ternary value
105: giving the sign of $\rnd(x + y) - (x + y)$.
106:
107: By ``addition'', we mean here the ``true'' addition, that is $x + y$
108: where $x$ and $y$ have the same sign, and $x - y$ where $x$ and $y$
109: have opposite signs. For the sake of simplicity, we restrict to the
110: addition of positive values for $x$ and $y$ in the following of the
111: paper. In fact, this is how it is implemented in MPFR: indeed, the
112: addition and subtraction functions call an auxiliary function, ignoring
113: the signs of the input numbers (they are regarded as positive).
114:
115: The addition seems to be a very simple function to implement, as
116: being a basic function, easy to understand. This is unfortunately
117: not true, in particular under the MPFR specifications (where the
118: inputs and the output may have different precisions and the output
119: must be the exactly rounded result), because many different cases
120: need to be considered, and it is easy to forget one, both in the
121: implementation of the addition and in the tests. For a long time,
122: the MPFR addition had been buggy (more precisely, some rare special
123: cases were not handled correctly) and inefficient (also in some
124: rare special cases, for which the time complexity was exponential).
125: I completely rewrote the addition in October 2001. The presentation
126: given in this paper is more or less based on the same ideas as the
127: ones used in the new MPFR implementation, with some non-theoretical
128: differences, mainly due to some MPFR internals.
129:
130: First, the floating-point system is introduced in Section~\ref{fp}.
131: The main computation steps (from which the algorithm can be
132: deduced) and the complexity are presented in Section~\ref{comp}.
133: Section~\ref{impl} deals with the MPFR implementation. We finally
134: conclude in Section~\ref{concl}.
135:
136: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
137:
138: \section{The Floating-Point System}
139: \label{fp}
140:
141: \subsection{The Floating-Point Representation}
142: \label{fp-repr}
143:
144: We consider a floating-point system in base $2$. The results presented
145: in this paper can naturally be extended to other fixed even bases, but
146: we choose the base $2$ (this is the base of the floating-point system
147: in MPFR) to make the notations easier to understand.
148:
149: In our system, an object may contain a special value, like NaN (not
150: a number) or an infinity, zero (possibly signed, as in the IEEE-754
151: standard\cite{IEEE754} and in MPFR), or an non-zero real number that
152: can be written:
153: \[s \times 0.b_1 b_2 b_3 \ldots b_p \times 2^e,\]
154: where $s = \pm 1$ is the \emph{sign}, the $b_i$'s are binary digits
155: ($0$ or $1$) forming the \emph{mantissa}, $e$ is the \emph{exponent}
156: (a bounded integer\footnote{In MPFR, the exponent is between $1 - 2^{30}$
157: and $2^{30} - 1$.}), and $p$ is an integer greater or equal to $1$,
158: called the \emph{precision}. In MPFR, the precision $p$ is not fixed;
159: it is attached to each object. Non-zero real values are normalized,
160: i.e.\ $b_1 \ne 0$, that is in base $2$, $b_1 = 1$. The system does not
161: have subnormals (i.e. numbers with $b_1 = 0$) as they are not really
162: useful with a huge exponent range (like in MPFR) and would make the
163: algorithms and the code much more complex.
164:
165: In the following of the paper, we consider only positive input numbers
166: (as said in the introduction), i.e. numbers that can be written:
167: $0.1 b_2 b_3 \ldots b_p \times 2^e$.
168:
169: \subsection{Rounding}
170: \label{fp-round}
171:
172: When adding two positive numbers $x$ and $y$ of respective precisions
173: $m$ and $n$, the result is not necessarily representable in the target
174: precision $p$; it must be rounded, according to one of the rounding
175: modes chosen by the user, similar to the IEEE-754 rounding modes:
176:
177: \begin{itemize}
178:
179: \item rounding to minus infinity (downwards): we return the largest
180: floating-point number in precision $p$ that is less or equal to $x + y$;
181:
182: \item rounding to plus infinity (upwards): we return the smallest
183: floating-point number in precision $p$ that is greater or equal to $x + y$;
184:
185: \item rounding towards zero: we round downwards, since $x + y > 0$;
186:
187: \item rounding to the nearest: we return the floating-point number
188: in precision $p$ that is the closest to $x + y$. Halfway cases are
189: specified by the implementation; if $p \geq 2$ (this is required
190: by MPFR), then we can choose the round-to-even rule, like in the
191: IEEE-754 standard and in MPFR: we return the only number that has
192: an even mantissa, i.e. with $b_p = 0$.
193:
194: \end{itemize}
195:
196: Note that the returned result must be the rounding of the exact result;
197: this requirement is called \emph{correct} or \emph{exact rounding}.
198:
199: In addition to the rounded result, a ternary value is returned, giving
200: the sign of $\rnd(x + y) - (x + y)$, where \rnd denotes the chosen
201: rounding mode: a positive number means that the rounded result is
202: greater or equal to the exact result, a negative number means that
203: the rounded result is less or equal to the exact result, and $0$
204: means that the returned result is the exact result.
205:
206: Moreover, it is possible that the exponent of the rounded result is
207: not in the exponent range, in which case an overflow is generated.
208: This case does lot lead to any practical or theoretical difficulty
209: and is beyond the scope of this paper.
210:
211: How the exact result of a canonical infinite mantissa $0.1 b_2 b_3 \ldots$
212: (where the number of zero bits is infinite) is rounded can be expressed
213: as a function of the bit $r = b_{p+1}$ following the truncated $p$-bit
214: mantissa, called the \emph{rounding bit}, and $s = b_{p+2} \vee b_{p+3}
215: \vee \ldots$, called the \emph{sticky bit}, as summarized in
216: Table~\ref{fig:rs} (we recall that the result is positive).
217:
218: \begin{table}[htbp]
219: \begin{center}
220: \begin{tabular}{|c|c|c|c|}
221: \hline
222: $r$ / $s$ & downwards & upwards & to the nearest \\
223: \hline
224: $0$ / $0$ & exact & exact & exact \\
225: \hline
226: $0$ / $1$ & $-$ & $+$ & $-$ \\
227: \hline
228: $1$ / $0$ & $-$ & $+$ & $-$ / $+$ \\
229: \hline
230: $1$ / $1$ & $-$ & $+$ & $+$ \\
231: \hline
232: \end{tabular}
233: \end{center}
234: \caption{A $-$ in the table means that the mantissa of the exactly
235: rounded result is $0.1 b_2 b_3 \ldots b_p$, i.e. the truncated
236: exact mantissa. A $+$ in the table means that one needs to add
237: $2^{-p}$ to the truncated mantissa (leading to an exponent change
238: if all the $b_i$'s up to $b_p$ are $1$). The $-$~/~$+$ corresponds
239: to the halfway cases, and the round-to-even rule is applied, that
240: is: $-$ if $b_p = 0$, $+$ if $b_p = 1$.}
241: \label{fig:rs}
242: \end{table}
243:
244: Note: We did not mention the ternary value, as it can easily be deduced
245: from Table~\ref{fig:rs} (telling how the mantissa is rounded). Also,
246: like the rounding modes towards $-\infty$ and towards $+\infty$, the
247: returned ternary value needs to be negated if the result is negative
248: (not considered in this paper).
249:
250: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
251:
252: \section{The Main Computation Steps and the Complexity}
253: \label{comp}
254:
255: We still denote the precisions of the input numbers $x$ and $y$ and
256: the result by $m$, $n$ and $p$ respectively.
257:
258: The addition of two positive floating-point numbers $x$ and $y$ of
259: respective exponents $e_x$ and $e_y$ consists in:
260: \begin{enumerate}
261: \item ordering $x$ and $y$ so that $e_x \geq e_y$,
262: \item computing the exponent difference $d = e_x - e_y$,
263: \item shifting the mantissa of $y$ by $d$ positions to the right,
264: \item adding the mantissa of $x$ and the shifted mantissa of $y$
265: and rounding the result (shifting it by $1$ position to the right
266: if there is a carry),
267: \item computing the exponent of the result: $e_x$ or $e_x + 1$ if
268: there is a carry.
269: \end{enumerate}
270:
271: This method is very inefficient if many trailing bits of $x$ or $y$
272: (possibly all the bits of $y$) do not have any influence on the
273: result, for instance:
274: \[0.101010000010010001 + 0.10001 \times 2^{-9}\]
275: rounded to $4$ bits. The exactly rounded result and the ternary value
276: can be deduced from only the first $6$ bits $101010$ of $x$ (and none
277: for $y$), knowing the fact that its first mantissa bit is always $1$.
278:
279: So, we are interested in taking into account as few input bits as
280: possible (the possible \emph{hole} between the least significant bit
281: of $x$ and the most significant bit of $y$ must also be detected). We
282: do not have any particular knowledge about the input numbers $x$ and
283: $y$ (and the result); we assume that the mantissa bits are $0$ and $1$
284: with equal probabilities after some given position and that $x$ and
285: $y$ are independent numbers. Of course, this is not necessarily a good
286: assumption, but this will be discussed when it has an importance.
287:
288: The addition can be written $x + y = t + \err$, where $t$ is the
289: \emph{main term}, computed with the first $p+2$ bits of $x$ and
290: the corresponding $\max(p+2-d,0)$ bits of $y$, and \err is the
291: \emph{error term}, satisfying $0 \leq \err < 2^{e_x - p - 1}$.
292: This can graphically be represented by:
293: \[
294: \begin{array}{r@{}l}
295: \fbox{\begin{minipage}{20mm}\centerline{$t$}\end{minipage}} & \\[2mm]
296: \fbox{\begin{minipage}{20mm}\centerline{$x'$}\end{minipage}} &
297: \fbox{\begin{minipage}{12mm}\centerline{$x''$}\end{minipage}} \\[1mm]
298: \fbox{\begin{minipage}{16mm}\centerline{$y'$}\end{minipage}} &
299: \fbox{\begin{minipage}{24mm}\centerline{$y''$}\end{minipage}} \\
300: \end{array}
301: \]
302: where $x''$ may be empty and either $y'$ or $y''$ may be empty.
303:
304: The \emph{main term} $t$ is computed and written in time $\Theta(p)$;
305: indeed, an $\Omega(p)$ time is necessary to fill the $p+2$ bits, and
306: a linear
307: time is obviously sufficient. There are many ways to deal with all the
308: different cases (the mantissas of $x$ and $y$ may completely overlap,
309: partially overlap in numerous ways, or even not overlap at all, and
310: some parts of the result may need to be filled with zeros); a carry
311: detection can also be performed by looking at the most significant
312: bits of $x$ and $y$ first. More will be said in Section~\ref{impl},
313: about the implementation in MPFR. However this is not an important
314: point here, as long as the complexity is in $\Theta(p)$.
315:
316: The \emph{error term} allows to obtain the truncated mantissa, the
317: rounding bit and the sticky bit (Section~\ref{fp-round}). First, if
318: the computation of the main term has lead to a carry, then $p+3$ bits
319: of the result have really been computed. This case can be regarded
320: as if there were no carry and the first iteration of the processing
321: described below were already performed (then, this is only a matter
322: of implementation). So, for the sake of simplicity, let us consider
323: that $p+2$ bits of the result have been computed, let $u$ denote the
324: \emph{weight}\footnote{This is the corresponding power of $2$; for
325: instance, the weight of the bit $1$ in $0.001$ is $2^{-3}$.} of the
326: bit $p+2$ (so, $0 \leq \err < 2u$), and let $f$ denote
327: its value ($0$ or $1$)\footnote{If a carry was generated, consider
328: only the first $p+2$ bits of the result in $t$, and the bit $p+3$ is
329: taken into account in \err.}, that we call the \emph{following} bit.
330: Table~\ref{fig:fe} gives the rounding bit $r$ and the sticky bit $s$
331: as a function of the following bit $f$ and the error $\err$.
332:
333: \begin{table}[htbp]
334: \begin{center}
335: \begin{tabular}{|c|c||c|c|}
336: \hline
337: $f$ & $\err$ & $r$ & $s$ \\
338: \hline
339: $0$ & $\err = 0$ & $=$ & $0$ \\
340: \hline
341: $0$ & $\err > 0$ & $=$ & $1$ \\
342: \hline
343: $1$ & $\err < u$ & $=$ & $1$ \\
344: \hline
345: $1$ & $\err = u$ & $+$ & $0$ \\
346: \hline
347: $1$ & $\err > u$ & $+$ & $1$ \\
348: \hline
349: \end{tabular}
350: \end{center}
351: \caption{For $r$, an $=$ means that the rounding bit is the bit $p+1$
352: of the temporary result $t$, and a $+$ means that $1$ must be added to
353: the bit $p+1$ of $t$ (and the carry must propagate).}
354: \label{fig:fe}
355: \end{table}
356:
357: Combining Tables \ref{fig:rs} and \ref{fig:fe}, we get Table~\ref{fig:rfe}.
358: Now we may need to determine the sign of $\err - fu$ (depending on the
359: cases given by Table~\ref{fig:rfe}). This is done with an iteration over
360: the remaining bits of $x$ and $y$.
361:
362: \begin{table}[htbp]
363: \begin{center}
364: \begin{tabular}{|c|c|c||c|c||c|c|c|}
365: \hline
366: $r_t$ & $f$ & $\err$ & $r$ & $s$ & downwards & upwards & to the nearest \\
367: \hline
368: $0$ & $0$ & $\err = 0$ & $0$ & $0$ & exact & exact & exact \\
369: \hline
370: $0$ & $0$ & $\err > 0$ & $0$ & $1$ & $-$ & $+$ & $-$ \\
371: \hline
372: $0$ & $1$ & $\err < u$ & $0$ & $1$ & $-$ & $+$ & $-$ \\
373: \hline
374: $0$ & $1$ & $\err = u$ & $1$ & $0$ & $-$ & $+$ & $-$ / $+$ \\
375: \hline
376: $0$ & $1$ & $\err > u$ & $1$ & $1$ & $-$ & $+$ & $+$ \\
377: \hline
378: $1$ & $0$ & $\err = 0$ & $1$ & $0$ & $-$ & $+$ & $-$ / $+$ \\
379: \hline
380: $1$ & $0$ & $\err > 0$ & $1$ & $1$ & $-$ & $+$ & $+$ \\
381: \hline
382: $1$ & $1$ & $\err < u$ & $1$ & $1$ & $-$ & $+$ & $+$ \\
383: \hline
384: $1$ & $1$ & $\err = u$ & $0$ & $0$ & exact & exact & exact \\
385: \hline
386: $1$ & $1$ & $\err > u$ & $0$ & $1$ & $-$ & $+$ & $-$ \\
387: \hline
388: \end{tabular}
389: \end{center}
390: \caption{The first three columns give all the possible cases for the
391: rounding bit of the main term, the following bit $f$ and the error $\err$.
392: The next two columns give the corresponding values of the rounding bit $r$
393: and the sticky bit $s$ (once the error has been taken into account).
394: The last three columns give information for the rounded result and the
395: ternary value; in the last two cases (lines), a carry is added to the
396: mantissa before the rounding (and this may lead to an exponent change,
397: but has no effect on how the rounding is performed --- implementations
398: must just take care that the ulp is different when rounding upwards).}
399: \label{fig:rfe}
400: \end{table}
401:
402: \begin{itemize}
403:
404: \item If $f = 0$, we need to distinguish the cases $\err = 0$ and $\err > 0$.
405: We have: $\err > 0$ if and only if at least a trailing bit (of $x$ or $y$)
406: is $1$. In particular, if $y < u$, then the most significant bit of $y$
407: (always $1$) is a trailing bit (said otherwise, $\err \geq y$); so, in this
408: case, $\err > 0$. Otherwise, one needs to test the trailing bits, the one
409: after the other until a $1$ is found, and in the worst case ($\err = 0$),
410: all the trailing bits need to be tested. As a consequence, the worst-case
411: complexity in the case $f = 0$ is in $\Theta(m+n+p)$.
412:
413: Is there a best order to test the trailing bits? Under the condition
414: that we do not have any particular knowledge on the input numbers,
415: there is no best order\footnote{One can argue that the real numbers
416: naturally are logarithmically distributed, so that the probability
417: to have a $0$ at position $i$ is higher than the one to have a $1$,
418: and the difference decreases as a function of $i$ \cite{FelGoo1976a}.
419: Therefore, in a very theoretical point of view, if the time of each
420: test is seen as a constant, it would be better to start by the least
421: significant bits! Of course, since the probabilities get very close
422: to $1/2$ very quickly, one would not see any difference in practice.}.
423: One can choose one of the following two possibilities, for instance:
424: \begin{itemize}
425: \item Test the trailing bits of $x$, then the trailing bits of $y$ (or
426: the other way round) until a non-zero bit is found.
427: \item Test trailing bits of both $x$ and $y$ at the same time. This may
428: be an interesting choice as some numbers tend to have an exact mantissa
429: with few non-zero digits (like small integers), thus many trailing zeros.
430: Testing trailing bits of $x$ and $y$ concurrently may allow to avoid
431: such difficult cases.
432: \end{itemize}
433:
434: \item If $f = 1$, we need to distinguish the cases $\err < u$, $\err = u$
435: and $\err > u$. Let $d$ denote the exponent difference so that the bits
436: $x_i$ and $y_{i-d}$ have the same weight ($d$ is the shift count to align
437: the mantissas). Let $q$ be the first integer such that the trailing bits
438: $x_q$ and $y_{q-d}$ are equal (when a bit is not represented, it is $0$).
439: If these bits are $0$'s, then $\err < u$. Otherwise (i.e. if these bits
440: are $1$'s), $\err \geq u$. The equality $\err = u$ can be decided as in
441: the case $f = 0$ ($\err > u$ if and only if at least one the untested
442: bits is $1$).
443:
444: The best way is to start with bit $p+3$ and loop over the
445: increasing positions, until $q$ is found (and more if one has to decide
446: if $\err = u$). If the $x$ mantissa or the $y$ mantissa has entirely
447: been read and $q$ has not been found yet, then it is not necessary to
448: go further, as we can deduce that $\err < u$; in other words, there
449: is no possible carry with bits from only one mantissa.
450:
451: Up to the position $q-1$, we have $x_i + y_{i-d} = 1$, as one of the bits
452: is $0$ and the other is $1$. When $x$ and $y$ overlap, it is necessary to
453: test at least one bit (as one of these two bits has an importance in the
454: sign of $\err - u$). Concerning the other bits, the result can be deduced
455: from only one test (like in the case $f = 0$), but an oracle would be
456: needed, and it is not possible to do better without any particular
457: knowledge. Here is an example: let us considered $x = 0.101111100101$
458: ($12$-bit precision), $y = 0.11010 \times 2^{-7}$ ($5$-bit precision),
459: and a $2$-bit target precision. The mantissas are aligned in the following
460: way:
461: \[
462: \begin{array}{cr}
463: & 0.101111100101 \\
464: + & 0.11010 \\
465: \end{array}
466: \]
467: Though on this particular example, testing the last five bits $0$ is
468: sufficient to deduce that the exact result is less than $0.11$, all the
469: bits are tested from the left to the right. Moreover, if either $x$ or $y$
470: (but not both) had more bits, e.g.\ $y = 0.11010111001 \times 2^{-7}$,
471: then testing these bits would not be necessary as they cannot generate
472: a carry to reach $0.11$; however, if $y = 0.110110000 \times 2^{-7}$,
473: testing the following four bits would be necessary to deduce that the
474: result is $0.11$ exactly.
475:
476: \end{itemize}
477:
478: In the case $f = 1$, the loop is performed over the increasing positions
479: from the bit $p+3$, grabbing the bit of $x$ and the bit of $y$ having the
480: same weight. The same loop can be performed in the case $f = 0$, though
481: this is not the only solution as said above. Of course, special cases
482: must be taken into account: the $x$ mantissa does not necessarily overlap
483: with the aligned $y$ mantissa (as the most significant bit of $y$ may
484: come after some trailing bits of $x$, some trailing bits of $x$ may come
485: after the least significant bit of $y$, some trailing bits of $y$ may
486: come after the least significant bit of $x$), and the most significant
487: bit of $y$ may come after the least significant bit of $x$ (\emph{hole}
488: between the $x$ mantissa and the $y$ mantissa), any \emph{missing} bit
489: being regarded as $0$. At each iteration, the mantissa of the temporary
490: result has the form: $0.1 z_2 z_3 \ldots z_p r f f f \ldots f f f$ with
491: an error in the interval $[0,2)$ ulp\footnote{Unit in the Last Place:
492: here, the weight of the last bit $f$ of the temporary result.}. One
493: iterates as long as the bits after the (temporary) rounding bit are
494: identical. This basically corresponds to the Table Maker's Dilemma,
495: that occurs to exactly round any function (see \cite{LefMulTis1998b},
496: for instance).
497:
498: The time complexity is in $\Omega(p)$ and in $O(m+n+p)$. In the worst
499: case, it is in $\Theta(m+n+p)$. In average (if the bits are $0$ and
500: $1$ with equal probabilities and input numbers are independent), the
501: complexity is in $\Theta(p)$, as the probability to need to test $k$
502: trailing bits decreases exponentially (in $2^{-k}$).
503:
504: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
505:
506: \section{The MPFR Implementation}
507: \label{impl}
508:
509: In this section, we present the implementation of the addition in MPFR.
510: To keep the paper from being too technical, we do not give many details
511: (and the reader can read the source code, as MPFR is distributed under
512: the GNU Lesser General Public License). A complete proof would also be
513: very hard to read and check (unless it could be mechanically checked);
514: so, such a proof is not provided.
515:
516: Let us start with the representation of MPFR numbers. Non-special MPFR
517: numbers have a sign (accessed with C macros), a mantissa, an exponent
518: (some C integer) and a precision (also some C integer). The mantissa is
519: represented by an array of \emph{limbs}; a limb is an unsigned integer
520: (having 32 bits or 64 bits, depending on the C implementation), as defined
521: in the GMP library\footurl{http://www.swox.com/gmp/}, on which MPFR is
522: based. All the bits of a limb are used to represent bits of the mantissa
523: in the conventional binary representation\footnote{GMP now allows to use
524: some bits for the carries, called \emph{nail bits}. They are not supported
525: yet in MPFR. One should note that contrary to integer operations, redundancy
526: provided by nail bits would probably not be very interesting here due to
527: the discontinuity of the rounding function.}.
528: The mantissa is normalized, i.e. its most significant bit is always $1$.
529: Since the precision is not necessarily a multiple of the limb size, some
530: bits of the lowest mantissa limb are not significant and are always $0$
531: (except in temporary values).
532:
533: The computation steps presented in Section~\ref{comp} were bit based
534: (as this is more regular and easier to understand for a theoretical
535: analysis). But working on single bits in a software implementation
536: would not be very efficient. Base operations must be performed by
537: blocks; some limbs may still need to be split into two parts as the
538: $y$ mantissa must be aligned with the $x$ mantissa.
539:
540: In addition to the particular cases that arise in the bit-based case,
541: we need to distinguish the case where the exponent difference is a
542: multiple of the limb size and the other case, needing the $y$ mantissa
543: to be shifted (this is usually done on the fly). We also need to take
544: into account all the cases related to the block boundaries (for
545: instance, where the rounding bit lies in a limb).
546:
547: First, the main term is computed, but there are differences with the
548: bit-based version. As the array holding the target mantissa does not
549: necessarily have the room for $p+2$ bits and we want to avoid an
550: inefficient memory allocation for $p+2$ bits and copy, the temporary
551: rounding bit and the following bit are stored in C integer variables
552: \texttt{rb} and \texttt{fb} (determined on the fly, as soon as they
553: are known); in this way, then can also be handled more efficiently.
554: The second difference is all the bits of the target array
555: are used for this computation (in fact, this is more or less necessary,
556: as the low-level GMP functions do not perform any masking).
557:
558: For the main term, we want to add the most significant parts $x'$ and
559: $y'$. If $y$ does not overlap with the main term ($y'$ is empty), we
560: just copy $x'$ to the target array and zero the least significant limbs
561: of the target if the target has a greater precision than $x$. Now, let
562: us assume that $y$ overlaps with the main term. With GMP, we cannot
563: shift and add with a single operation; therefore these operations have
564: to be performed separately. First, with a GMP function, we copy the
565: most significant part of $y$, shifted if need be, to the target array
566: and we zero the limbs of the target that have not been touched: the
567: most and/or least significant limbs, if the exponent and the precision
568: of $y$ are small enough. Then, with another GMP function, we add the
569: most significant part of $x$ to the target. If a carry is generated,
570: we increment the exponent (unless we already had the maximum exponent,
571: in which case we generate an overflow) and shift the result to the right;
572: a bit is
573: lost due to the shift but it is either the rounding bit or a following
574: bit, and if necessary, the following bits are tested and the rounding
575: can be performed if they are not all equal.
576:
577: Then, the non-significant bits of the target are taken into account;
578: this occurs only if:
579: \begin{itemize}
580: \item the rounding bit is still unknown (otherwise these bits have
581: already been taken into account before the shift due to the carry,
582: as said above), and
583: \item there are non-significant bits.
584: \end{itemize}
585:
586: At this time, the rounding bit and the following bit may still be
587: unknown; in this case, they will be determined as soon as possible
588: from the trailing parts $x''$ and $y''$. The loops are performed
589: on the increasing positions (by blocks), as mentioned at the end
590: of Section~\ref{comp}; moreover the cases $f = 0$ and $f = 1$ are
591: considered together (and not separated as in Section~\ref{comp}).
592:
593: The iterations depend on the current status concerning $x$ and $y$.
594: Here are the different cases that may arise during the iterations:
595: \begin{itemize}
596: \item $x''$ has not entirely been read and $y''$ has not been read yet.
597: \item $x''$ and $y''$ overlap.
598: \item $x''$ has not entirely been read and $y''$ has entirely been read.
599: \item $x''$ has entirely been read and $y''$ has not been read yet.
600: \item $x''$ has entirely been read and $y''$ has not entirely been read.
601: \item $x''$ and $y''$ have entirely been read.
602: \end{itemize}
603:
604: In the overlapping case, at each iteration, a limb of $x$ and the
605: corresponding limb of $y$ (built from two different limbs if $y$ must
606: be shifted) are added. The possible carry is taken into account, and
607: the loop ends as soon as the result is $0$ (all its bits are $0$) for
608: $f = 0$ or the maximum limb value \verb|MP_LIMB_T_MAX| (all its bits
609: are $1$) for $f = 1$.
610:
611: We have focused on the differences coming from the computations by
612: blocks. The whole details may be found in the MPFR code.
613:
614: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
615:
616: \section{Conclusion}
617: \label{concl}
618:
619: We have presented the generic multiple-precision floating-point addition
620: with exact rounding, as specified in the MPFR library, first in a rather
621: theoretical point of view, then considering the current implementation
622: in MPFR. The theoretical analysis could give a more regular description
623: of the implementation, by ignoring the fact that bits are grouped
624: into words in a computer memory. It could help to improve the current
625: implementation (the fact that the cases $f = 0$ and $f = 1$ are
626: considered together is probably not a very good idea, though it
627: reduces the risk of forgetting particular cases).
628:
629: The subtraction could be dealt with in a similar manner, in future
630: work. This is a bit more complicated due to a possible cancellation
631: (when subtracting very close numbers).
632:
633: Full mechanically-checked proofs could also be considered, using the
634: theoretical analysis to define the main notions.
635:
636: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
637:
638: \begin{thebibliography}{1}
639:
640: \bibitem{IEEE754}
641: {IEEE} standard for binary floating-point arithmetic, Tech. Rep. {ANSI}-{IEEE}
642: Standard 754-1985, New York, approved March 21, 1985: {IEEE} Standards Board,
643: approved July 26, 1985: American National Standards Institute, 18 pages
644: (1985).
645:
646: \bibitem{FelGoo1976a}
647: A.~Feldstein, R.~Goodman, Convergence estimates for the distribution of
648: trailing digits, Journal of the ACM 23~(2) (1976) 287--297.
649: \url{http://portal.acm.org/citation.cfm?id=321948}
650:
651: \bibitem{LefMulTis1998b}
652: V.~Lefèvre, J.-M. Muller, A.~Tisserand, Towards correctly rounded
653: transcendentals, IEEE Transactions on Computers 47~(11) (1998) 1235--1243.
654: \url{http://perso.ens-lyon.fr/jean-michel.muller/Nov98.pdf}
655:
656: \end{thebibliography}
657:
658: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
659:
660: \end{document}
661: