0505:cs0505027/rnc6.tex

1: \documentclass{rnc6}

2: \usepackage[latin1]{inputenc}

3: \usepackage{aeguill}

4: \usepackage{comment}

5: \usepackage{xspace}

6:

7: \makeatletter

8:

9: %\iftrue  % Use RCSID with temporary versions.

10: \iffalse  % For the final version.

11:

12: % RCSID

13: \def\RCSID{%

14:   \begingroup

15:     \catcode`\_=\active

16:     \@rcsid

17: }

18: \def\@rcsid $#1${%

19:     \let\protect\noexpand

20:     \xdef\@oddfoot{%

21:       \reset@font

22:       \protect\framebox[\textwidth]{#1-- page \protect\thepage}%

23:     }%

24:   \endgroup

25:   \global\let\@evenfoot\@oddfoot

26: }

27: \RCSID $Id: rnc6.tex 4422 2004-08-30 12:13:17Z lefevre $

28:

29: \def\ps@copyright{}

30:

31: \fi

32:

33: \def\url{%

34:   \begingroup

35:   \catcode`\~=\active

36:   \catcode`:=12

37:   \def~{\raisebox{-.8ex}{\textasciitilde}}%

38:   \@url}

39: \def\@url#1{\leavevmode

40:   \hskip 0pt plus 50pt \penalty 1000 \hskip 0pt plus -50pt

41:   \mbox#1\endgroup}

42: % Thanks to Donald Arseneau for the linebreak trick.

43: % See thread <20010409124436$5a00@vinc17.org>.

44:

45: \def\footurl{%

46:   \begingroup

47:   \catcode`\~=\active

48:   \catcode`:=12

49:   \def~{\raisebox{-.8ex}{\textasciitilde}}%

50:   \@footurl}

51: \def\@footurl#1{\footnote{#1}\endgroup}

52:

53: \makeatother

54:

55: \newcommand{\rnd}{\ensuremath{\diamond}\xspace}

56: \newcommand{\err}{\ensuremath{\varepsilon}\xspace}

57:

58: \begin{document}

59: \sloppy

60: \binoppenalty=10000

61: \relpenalty=10000

62: \doublehyphendemerits 40000

63:

64: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

65:

66: \begin{frontmatter}

67:

68: \title{The Generic Multiple-Precision Floating-Point Addition

69: With Exact Rounding (as in the MPFR Library)}

70:

71: \author{Vincent Lef�vre}

72:

73: \ead{Vincent.Lefevre@loria.fr}

74: \ead[url]{http://www.vinc17.org/research/}

75:

76: \address{INRIA Lorraine, 615 rue du Jardin Botanique,

77: 54602 Villers-l�s-Nancy Cedex, France}

78:

79: \begin{abstract}

80: We study the multiple-precision addition of two positive floating-point

81: numbers in base $2$, with exact rounding, as specified in the MPFR

82: library, i.e. where each number has its own precision. We show how

83: the best possible complexity (up to a constant factor that depends

84: on the implementation) can be obtain.

85: \end{abstract}

86:

87: \begin{keyword}

88: multiple precision \sep floating point \sep addition \sep exact rounding

89: \end{keyword}

90:

91: \end{frontmatter}

92:

93: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

94:

95: \section{Introduction}

96: \label{intro}

97:

98: In this paper, we consider the multiple-precision floating-point

99: addition with exact rounding, as specified in the MPFR

100: library\footurl{http://www.mpfr.org/}: the inputs are two (binary)

101: floating-point numbers $x$ and $y$ of precision $m \geq 2$ and

102: $n \geq 2$, a target precision $p \geq 2$ and a rounding mode \rnd,

103: and the output is $\rnd(x + y)$, i.e. the exact value $x + y$ rounded

104: to the target precision in the given rounding mode, and a ternary value

105: giving the sign of $\rnd(x + y) - (x + y)$.

106:

107: By ``addition'', we mean here the ``true'' addition, that is $x + y$

108: where $x$ and $y$ have the same sign, and $x - y$ where $x$ and $y$

109: have opposite signs. For the sake of simplicity, we restrict to the

110: addition of positive values for $x$ and $y$ in the following of the

111: paper. In fact, this is how it is implemented in MPFR: indeed, the

112: addition and subtraction functions call an auxiliary function, ignoring

113: the signs of the input numbers (they are regarded as positive).

114:

115: The addition seems to be a very simple function to implement, as

116: being a basic function, easy to understand. This is unfortunately

117: not true, in particular under the MPFR specifications (where the

118: inputs and the output may have different precisions and the output

119: must be the exactly rounded result), because many different cases

120: need to be considered, and it is easy to forget one, both in the

121: implementation of the addition and in the tests. For a long time,

122: the MPFR addition had been buggy (more precisely, some rare special

123: cases were not handled correctly) and inefficient (also in some

124: rare special cases, for which the time complexity was exponential).

125: I completely rewrote the addition in October 2001. The presentation

126: given in this paper is more or less based on the same ideas as the

127: ones used in the new MPFR implementation, with some non-theoretical

128: differences, mainly due to some MPFR internals.

129:

130: First, the floating-point system is introduced in Section~\ref{fp}.

131: The main computation steps (from which the algorithm can be

132: deduced) and the complexity are presented in Section~\ref{comp}.

133: Section~\ref{impl} deals with the MPFR implementation. We finally

134: conclude in Section~\ref{concl}.

135:

136: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

137:

138: \section{The Floating-Point System}

139: \label{fp}

140:

141: \subsection{The Floating-Point Representation}

142: \label{fp-repr}

143:

144: We consider a floating-point system in base $2$. The results presented

145: in this paper can naturally be extended to other fixed even bases, but

146: we choose the base $2$ (this is the base of the floating-point system

147: in MPFR) to make the notations easier to understand.

148:

149: In our system, an object may contain a special value, like NaN (not

150: a number) or an infinity, zero (possibly signed, as in the IEEE-754

151: standard\cite{IEEE754} and in MPFR), or an non-zero real number that

152: can be written:

153: \[s \times 0.b_1 b_2 b_3 \ldots b_p \times 2^e,\]

154: where $s = \pm 1$ is the \emph{sign}, the $b_i$'s are binary digits

155: ($0$ or $1$) forming the \emph{mantissa}, $e$ is the \emph{exponent}

156: (a bounded integer\footnote{In MPFR, the exponent is between $1 - 2^{30}$

157: and $2^{30} - 1$.}), and $p$ is an integer greater or equal to $1$,

158: called the \emph{precision}. In MPFR, the precision $p$ is not fixed;

159: it is attached to each object. Non-zero real values are normalized,

160: i.e.\ $b_1 \ne 0$, that is in base $2$, $b_1 = 1$. The system does not

161: have subnormals (i.e. numbers with $b_1 = 0$) as they are not really

162: useful with a huge exponent range (like in MPFR) and would make the

163: algorithms and the code much more complex.

164:

165: In the following of the paper, we consider only positive input numbers

166: (as said in the introduction), i.e. numbers that can be written:

167: $0.1 b_2 b_3 \ldots b_p \times 2^e$.

168:

169: \subsection{Rounding}

170: \label{fp-round}

171:

172: When adding two positive numbers $x$ and $y$ of respective precisions

173: $m$ and $n$, the result is not necessarily representable in the target

174: precision $p$; it must be rounded, according to one of the rounding

175: modes chosen by the user, similar to the IEEE-754 rounding modes:

176:

177: \begin{itemize}

178:

179: \item rounding to minus infinity (downwards): we return the largest

180: floating-point number in precision $p$ that is less or equal to $x + y$;

181:

182: \item rounding to plus infinity (upwards): we return the smallest

183: floating-point number in precision $p$ that is greater or equal to $x + y$;

184:

185: \item rounding towards zero: we round downwards, since $x + y > 0$;

186:

187: \item rounding to the nearest: we return the floating-point number

188: in precision $p$ that is the closest to $x + y$. Halfway cases are

189: specified by the implementation; if $p \geq 2$ (this is required

190: by MPFR), then we can choose the round-to-even rule, like in the

191: IEEE-754 standard and in MPFR: we return the only number that has

192: an even mantissa, i.e. with $b_p = 0$.

193:

194: \end{itemize}

195:

196: Note that the returned result must be the rounding of the exact result;

197: this requirement is called \emph{correct} or \emph{exact rounding}.

198:

199: In addition to the rounded result, a ternary value is returned, giving

200: the sign of $\rnd(x + y) - (x + y)$, where \rnd denotes the chosen

201: rounding mode: a positive number means that the rounded result is

202: greater or equal to the exact result, a negative number means that

203: the rounded result is less or equal to the exact result, and $0$

204: means that the returned result is the exact result.

205:

206: Moreover, it is possible that the exponent of the rounded result is

207: not in the exponent range, in which case an overflow is generated.

208: This case does lot lead to any practical or theoretical difficulty

209: and is beyond the scope of this paper.

210:

211: How the exact result of a canonical infinite mantissa $0.1 b_2 b_3 \ldots$

212: (where the number of zero bits is infinite) is rounded can be expressed

213: as a function of the bit $r = b_{p+1}$ following the truncated $p$-bit

214: mantissa, called the \emph{rounding bit}, and $s = b_{p+2} \vee b_{p+3}

215: \vee \ldots$, called the \emph{sticky bit}, as summarized in

216: Table~\ref{fig:rs} (we recall that the result is positive).

217:

218: \begin{table}[htbp]

219: \begin{center}

220: \begin{tabular}{|c|c|c|c|}

221: \hline

222: $r$ / $s$ & downwards & upwards & to the nearest \\

223: \hline

224: $0$ / $0$ &   exact   &  exact  &     exact      \\

225: \hline

226: $0$ / $1$ &    $-$    &   $+$   &      $-$       \\

227: \hline

228: $1$ / $0$ &    $-$    &   $+$   &   $-$ / $+$    \\

229: \hline

230: $1$ / $1$ &    $-$    &   $+$   &      $+$       \\

231: \hline

232: \end{tabular}

233: \end{center}

234: \caption{A $-$ in the table means that the mantissa of the exactly

235: rounded result is $0.1 b_2 b_3 \ldots b_p$, i.e. the truncated

236: exact mantissa. A $+$ in the table means that one needs to add

237: $2^{-p}$ to the truncated mantissa (leading to an exponent change

238: if all the $b_i$'s up to $b_p$ are $1$). The $-$~/~$+$ corresponds

239: to the halfway cases, and the round-to-even rule is applied, that

240: is: $-$ if $b_p = 0$, $+$ if $b_p = 1$.}

241: \label{fig:rs}

242: \end{table}

243:

244: Note: We did not mention the ternary value, as it can easily be deduced

245: from Table~\ref{fig:rs} (telling how the mantissa is rounded). Also,

246: like the rounding modes towards $-\infty$ and towards $+\infty$, the

247: returned ternary value needs to be negated if the result is negative

248: (not considered in this paper).

249:

250: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

251:

252: \section{The Main Computation Steps and the Complexity}

253: \label{comp}

254:

255: We still denote the precisions of the input numbers $x$ and $y$ and

256: the result by $m$, $n$ and $p$ respectively.

257:

258: The addition of two positive floating-point numbers $x$ and $y$ of

259: respective exponents $e_x$ and $e_y$ consists in:

260: \begin{enumerate}

261: \item ordering $x$ and $y$ so that $e_x \geq e_y$,

262: \item computing the exponent difference $d = e_x - e_y$,

263: \item shifting the mantissa of $y$ by $d$ positions to the right,

264: \item adding the mantissa of $x$ and the shifted mantissa of $y$

265: and rounding the result (shifting it by $1$ position to the right

266: if there is a carry),

267: \item computing the exponent of the result: $e_x$ or $e_x + 1$ if

268: there is a carry.

269: \end{enumerate}

270:

271: This method is very inefficient if many trailing bits of $x$ or $y$

272: (possibly all the bits of $y$) do not have any influence on the

273: result, for instance:

274: \[0.101010000010010001 + 0.10001 \times 2^{-9}\]

275: rounded to $4$ bits. The exactly rounded result and the ternary value

276: can be deduced from only the first $6$ bits $101010$ of $x$ (and none

277: for $y$), knowing the fact that its first mantissa bit is always $1$.

278:

279: So, we are interested in taking into account as few input bits as

280: possible (the possible \emph{hole} between the least significant bit

281: of $x$ and the most significant bit of $y$ must also be detected). We

282: do not have any particular knowledge about the input numbers $x$ and

283: $y$ (and the result); we assume that the mantissa bits are $0$ and $1$

284: with equal probabilities after some given position and that $x$ and

285: $y$ are independent numbers. Of course, this is not necessarily a good

286: assumption, but this will be discussed when it has an importance.

287:

288: The addition can be written $x + y = t + \err$, where $t$ is the

289: \emph{main term}, computed with the first $p+2$ bits of $x$ and

290: the corresponding $\max(p+2-d,0)$ bits of $y$, and \err is the

291: \emph{error term}, satisfying $0 \leq \err < 2^{e_x - p - 1}$.

292: This can graphically be represented by:

293: \[

294: \begin{array}{r@{}l}

295: \fbox{\begin{minipage}{20mm}\centerline{$t$}\end{minipage}} & \\[2mm]

296: \fbox{\begin{minipage}{20mm}\centerline{$x'$}\end{minipage}} &

297: \fbox{\begin{minipage}{12mm}\centerline{$x''$}\end{minipage}} \\[1mm]

298: \fbox{\begin{minipage}{16mm}\centerline{$y'$}\end{minipage}} &

299: \fbox{\begin{minipage}{24mm}\centerline{$y''$}\end{minipage}} \\

300: \end{array}

301: \]

302: where $x''$ may be empty and either $y'$ or $y''$ may be empty.

303:

304: The \emph{main term} $t$ is computed and written in time $\Theta(p)$;

305: indeed, an $\Omega(p)$ time is necessary to fill the $p+2$ bits, and

306: a linear

307: time is obviously sufficient. There are many ways to deal with all the

308: different cases (the mantissas of $x$ and $y$ may completely overlap,

309: partially overlap in numerous ways, or even not overlap at all, and

310: some parts of the result may need to be filled with zeros); a carry

311: detection can also be performed by looking at the most significant

312: bits of $x$ and $y$ first. More will be said in Section~\ref{impl},

313: about the implementation in MPFR. However this is not an important

314: point here, as long as the complexity is in $\Theta(p)$.

315:

316: The \emph{error term} allows to obtain the truncated mantissa, the

317: rounding bit and the sticky bit (Section~\ref{fp-round}). First, if

318: the computation of the main term has lead to a carry, then $p+3$ bits

319: of the result have really been computed. This case can be regarded

320: as if there were no carry and the first iteration of the processing

321: described below were already performed (then, this is only a matter

322: of implementation). So, for the sake of simplicity, let us consider

323: that $p+2$ bits of the result have been computed, let $u$ denote the

324: \emph{weight}\footnote{This is the corresponding power of $2$; for

325: instance, the weight of the bit $1$ in $0.001$ is $2^{-3}$.} of the

326: bit $p+2$ (so, $0 \leq \err < 2u$), and let $f$ denote

327: its value ($0$ or $1$)\footnote{If a carry was generated, consider

328: only the first $p+2$ bits of the result in $t$, and the bit $p+3$ is

329: taken into account in \err.}, that we call the \emph{following} bit.

330: Table~\ref{fig:fe} gives the rounding bit $r$ and the sticky bit $s$

331: as a function of the following bit $f$ and the error $\err$.

332:

333: \begin{table}[htbp]

334: \begin{center}

335: \begin{tabular}{|c|c||c|c|}

336: \hline

337: $f$ &   $\err$   & $r$ & $s$ \\

338: \hline

339: $0$ & $\err = 0$ & $=$ & $0$ \\

340: \hline

341: $0$ & $\err > 0$ & $=$ & $1$ \\

342: \hline

343: $1$ & $\err < u$ & $=$ & $1$ \\

344: \hline

345: $1$ & $\err = u$ & $+$ & $0$ \\

346: \hline

347: $1$ & $\err > u$ & $+$ & $1$ \\

348: \hline

349: \end{tabular}

350: \end{center}

351: \caption{For $r$, an $=$ means that the rounding bit is the bit $p+1$

352: of the temporary result $t$, and a $+$ means that $1$ must be added to

353: the bit $p+1$ of $t$ (and the carry must propagate).}

354: \label{fig:fe}

355: \end{table}

356:

357: Combining Tables \ref{fig:rs} and \ref{fig:fe}, we get Table~\ref{fig:rfe}.

358: Now we may need to determine the sign of $\err - fu$ (depending on the

359: cases given by Table~\ref{fig:rfe}). This is done with an iteration over

360: the remaining bits of $x$ and $y$.

361:

362: \begin{table}[htbp]

363: \begin{center}

364: \begin{tabular}{|c|c|c||c|c||c|c|c|}

365: \hline

366: $r_t$ & $f$ &   $\err$   & $r$ & $s$ & downwards & upwards & to the nearest \\

367: \hline

368:  $0$  & $0$ & $\err = 0$ & $0$ & $0$ &   exact   &  exact  &     exact      \\

369: \hline

370:  $0$  & $0$ & $\err > 0$ & $0$ & $1$ &    $-$    &   $+$   &      $-$       \\

371: \hline

372:  $0$  & $1$ & $\err < u$ & $0$ & $1$ &    $-$    &   $+$   &      $-$       \\

373: \hline

374:  $0$  & $1$ & $\err = u$ & $1$ & $0$ &    $-$    &   $+$   &   $-$ / $+$    \\

375: \hline

376:  $0$  & $1$ & $\err > u$ & $1$ & $1$ &    $-$    &   $+$   &      $+$       \\

377: \hline

378:  $1$  & $0$ & $\err = 0$ & $1$ & $0$ &    $-$    &   $+$   &   $-$ / $+$    \\

379: \hline

380:  $1$  & $0$ & $\err > 0$ & $1$ & $1$ &    $-$    &   $+$   &      $+$       \\

381: \hline

382:  $1$  & $1$ & $\err < u$ & $1$ & $1$ &    $-$    &   $+$   &      $+$       \\

383: \hline

384:  $1$  & $1$ & $\err = u$ & $0$ & $0$ &   exact   &  exact  &     exact      \\

385: \hline

386:  $1$  & $1$ & $\err > u$ & $0$ & $1$ &    $-$    &   $+$   &      $-$       \\

387: \hline

388: \end{tabular}

389: \end{center}

390: \caption{The first three columns give all the possible cases for the

391: rounding bit of the main term, the following bit $f$ and the error $\err$.

392: The next two columns give the corresponding values of the rounding bit $r$

393: and the sticky bit $s$ (once the error has been taken into account).

394: The last three columns give information for the rounded result and the

395: ternary value; in the last two cases (lines), a carry is added to the

396: mantissa before the rounding (and this may lead to an exponent change,

397: but has no effect on how the rounding is performed --- implementations

398: must just take care that the ulp is different when rounding upwards).}

399: \label{fig:rfe}

400: \end{table}

401:

402: \begin{itemize}

403:

404: \item If $f = 0$, we need to distinguish the cases $\err = 0$ and $\err > 0$.

405: We have: $\err > 0$ if and only if at least a trailing bit (of $x$ or $y$)

406: is $1$. In particular, if $y < u$, then the most significant bit of $y$

407: (always $1$) is a trailing bit (said otherwise, $\err \geq y$); so, in this

408: case, $\err > 0$. Otherwise, one needs to test the trailing bits, the one

409: after the other until a $1$ is found, and in the worst case ($\err = 0$),

410: all the trailing bits need to be tested. As a consequence, the worst-case

411: complexity in the case $f = 0$ is in $\Theta(m+n+p)$.

412:

413: Is there a best order to test the trailing bits? Under the condition

414: that we do not have any particular knowledge on the input numbers,

415: there is no best order\footnote{One can argue that the real numbers

416: naturally are logarithmically distributed, so that the probability

417: to have a $0$ at position $i$ is higher than the one to have a $1$,

418: and the difference decreases as a function of $i$ \cite{FelGoo1976a}.

419: Therefore, in a very theoretical point of view, if the time of each

420: test is seen as a constant, it would be better to start by the least

421: significant bits! Of course, since the probabilities get very close

422: to $1/2$ very quickly, one would not see any difference in practice.}.

423: One can choose one of the following two possibilities, for instance:

424: \begin{itemize}

425: \item Test the trailing bits of $x$, then the trailing bits of $y$ (or

426: the other way round) until a non-zero bit is found.

427: \item Test trailing bits of both $x$ and $y$ at the same time. This may

428: be an interesting choice as some numbers tend to have an exact mantissa

429: with few non-zero digits (like small integers), thus many trailing zeros.

430: Testing trailing bits of $x$ and $y$ concurrently may allow to avoid

431: such difficult cases.

432: \end{itemize}

433:

434: \item If $f = 1$, we need to distinguish the cases $\err < u$, $\err = u$

435: and $\err > u$. Let $d$ denote the exponent difference so that the bits

436: $x_i$ and $y_{i-d}$ have the same weight ($d$ is the shift count to align

437: the mantissas). Let $q$ be the first integer such that the trailing bits

438: $x_q$ and $y_{q-d}$ are equal (when a bit is not represented, it is $0$).

439: If these bits are $0$'s, then $\err < u$. Otherwise (i.e. if these bits

440: are $1$'s), $\err \geq u$. The equality $\err = u$ can be decided as in

441: the case $f = 0$ ($\err > u$ if and only if at least one the untested

442: bits is $1$).

443:

444: The best way is to start with bit $p+3$ and loop over the

445: increasing positions, until $q$ is found (and more if one has to decide

446: if $\err = u$). If the $x$ mantissa or the $y$ mantissa has entirely

447: been read and $q$ has not been found yet, then it is not necessary to

448: go further, as we can deduce that $\err < u$; in other words, there

449: is no possible carry with bits from only one mantissa.

450:

451: Up to the position $q-1$, we have $x_i + y_{i-d} = 1$, as one of the bits

452: is $0$ and the other is $1$. When $x$ and $y$ overlap, it is necessary to

453: test at least one bit (as one of these two bits has an importance in the

454: sign of $\err - u$). Concerning the other bits, the result can be deduced

455: from only one test (like in the case $f = 0$), but an oracle would be

456: needed, and it is not possible to do better without any particular

457: knowledge. Here is an example: let us considered $x = 0.101111100101$

458: ($12$-bit precision), $y = 0.11010 \times 2^{-7}$ ($5$-bit precision),

459: and a $2$-bit target precision. The mantissas are aligned in the following

460: way:

461: \[

462: \begin{array}{cr}

463:   & 0.101111100101 \\

464: + &        0.11010 \\

465: \end{array}

466: \]

467: Though on this particular example, testing the last five bits $0$ is

468: sufficient to deduce that the exact result is less than $0.11$, all the

469: bits are tested from the left to the right. Moreover, if either $x$ or $y$

470: (but not both) had more bits, e.g.\ $y = 0.11010111001 \times 2^{-7}$,

471: then testing these bits would not be necessary as they cannot generate

472: a carry to reach $0.11$; however, if $y = 0.110110000 \times 2^{-7}$,

473: testing the following four bits would be necessary to deduce that the

474: result is $0.11$ exactly.

475:

476: \end{itemize}

477:

478: In the case $f = 1$, the loop is performed over the increasing positions

479: from the bit $p+3$, grabbing the bit of $x$ and the bit of $y$ having the

480: same weight. The same loop can be performed in the case $f = 0$, though

481: this is not the only solution as said above. Of course, special cases

482: must be taken into account: the $x$ mantissa does not necessarily overlap

483: with the aligned $y$ mantissa (as the most significant bit of $y$ may

484: come after some trailing bits of $x$, some trailing bits of $x$ may come

485: after the least significant bit of $y$, some trailing bits of $y$ may

486: come after the least significant bit of $x$), and the most significant

487: bit of $y$ may come after the least significant bit of $x$ (\emph{hole}

488: between the $x$ mantissa and the $y$ mantissa), any \emph{missing} bit

489: being regarded as $0$. At each iteration, the mantissa of the temporary

490: result has the form: $0.1 z_2 z_3 \ldots z_p r f f f \ldots f f f$ with

491: an error in the interval $[0,2)$ ulp\footnote{Unit in the Last Place:

492: here, the weight of the last bit $f$ of the temporary result.}. One

493: iterates as long as the bits after the (temporary) rounding bit are

494: identical. This basically corresponds to the Table Maker's Dilemma,

495: that occurs to exactly round any function (see \cite{LefMulTis1998b},

496: for instance).

497:

498: The time complexity is in $\Omega(p)$ and in $O(m+n+p)$. In the worst

499: case, it is in $\Theta(m+n+p)$. In average (if the bits are $0$ and

500: $1$ with equal probabilities and input numbers are independent), the

501: complexity is in $\Theta(p)$, as the probability to need to test $k$

502: trailing bits decreases exponentially (in $2^{-k}$).

503:

504: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

505:

506: \section{The MPFR Implementation}

507: \label{impl}

508:

509: In this section, we present the implementation of the addition in MPFR.

510: To keep the paper from being too technical, we do not give many details

511: (and the reader can read the source code, as MPFR is distributed under

512: the GNU Lesser General Public License). A complete proof would also be

513: very hard to read and check (unless it could be mechanically checked);

514: so, such a proof is not provided.

515:

516: Let us start with the representation of MPFR numbers. Non-special MPFR

517: numbers have a sign (accessed with C macros), a mantissa, an exponent

518: (some C integer) and a precision (also some C integer). The mantissa is

519: represented by an array of \emph{limbs}; a limb is an unsigned integer

520: (having 32 bits or 64 bits, depending on the C implementation), as defined

521: in the GMP library\footurl{http://www.swox.com/gmp/}, on which MPFR is

522: based. All the bits of a limb are used to represent bits of the mantissa

523: in the conventional binary representation\footnote{GMP now allows to use

524: some bits for the carries, called \emph{nail bits}. They are not supported

525: yet in MPFR. One should note that contrary to integer operations, redundancy

526: provided by nail bits would probably not be very interesting here due to

527: the discontinuity of the rounding function.}.

528: The mantissa is normalized, i.e. its most significant bit is always $1$.

529: Since the precision is not necessarily a multiple of the limb size, some

530: bits of the lowest mantissa limb are not significant and are always $0$

531: (except in temporary values).

532:

533: The computation steps presented in Section~\ref{comp} were bit based

534: (as this is more regular and easier to understand for a theoretical

535: analysis). But working on single bits in a software implementation

536: would not be very efficient. Base operations must be performed by

537: blocks; some limbs may still need to be split into two parts as the

538: $y$ mantissa must be aligned with the $x$ mantissa.

539:

540: In addition to the particular cases that arise in the bit-based case,

541: we need to distinguish the case where the exponent difference is a

542: multiple of the limb size and the other case, needing the $y$ mantissa

543: to be shifted (this is usually done on the fly). We also need to take

544: into account all the cases related to the block boundaries (for

545: instance, where the rounding bit lies in a limb).

546:

547: First, the main term is computed, but there are differences with the

548: bit-based version. As the array holding the target mantissa does not

549: necessarily have the room for $p+2$ bits and we want to avoid an

550: inefficient memory allocation for $p+2$ bits and copy, the temporary

551: rounding bit and the following bit are stored in C integer variables

552: \texttt{rb} and \texttt{fb} (determined on the fly, as soon as they

553: are known); in this way, then can also be handled more efficiently.

554: The second difference is all the bits of the target array

555: are used for this computation (in fact, this is more or less necessary,

556: as the low-level GMP functions do not perform any masking).

557:

558: For the main term, we want to add the most significant parts $x'$ and

559: $y'$. If $y$ does not overlap with the main term ($y'$ is empty), we

560: just copy $x'$ to the target array and zero the least significant limbs

561: of the target if the target has a greater precision than $x$. Now, let

562: us assume that $y$ overlaps with the main term. With GMP, we cannot

563: shift and add with a single operation; therefore these operations have

564: to be performed separately. First, with a GMP function, we copy the

565: most significant part of $y$, shifted if need be, to the target array

566: and we zero the limbs of the target that have not been touched: the

567: most and/or least significant limbs, if the exponent and the precision

568: of $y$ are small enough. Then, with another GMP function, we add the

569: most significant part of $x$ to the target. If a carry is generated,

570: we increment the exponent (unless we already had the maximum exponent,

571: in which case we generate an overflow) and shift the result to the right;

572: a bit is

573: lost due to the shift but it is either the rounding bit or a following

574: bit, and if necessary, the following bits are tested and the rounding

575: can be performed if they are not all equal.

576:

577: Then, the non-significant bits of the target are taken into account;

578: this occurs only if:

579: \begin{itemize}

580: \item the rounding bit is still unknown (otherwise these bits have

581: already been taken into account before the shift due to the carry,

582: as said above), and

583: \item there are non-significant bits.

584: \end{itemize}

585:

586: At this time, the rounding bit and the following bit may still be

587: unknown; in this case, they will be determined as soon as possible

588: from the trailing parts $x''$ and $y''$. The loops are performed

589: on the increasing positions (by blocks), as mentioned at the end

590: of Section~\ref{comp}; moreover the cases $f = 0$ and $f = 1$ are

591: considered together (and not separated as in Section~\ref{comp}).

592:

593: The iterations depend on the current status concerning $x$ and $y$.

594: Here are the different cases that may arise during the iterations:

595: \begin{itemize}

596: \item $x''$ has not entirely been read and $y''$ has not been read yet.

597: \item $x''$ and $y''$ overlap.

598: \item $x''$ has not entirely been read and $y''$ has entirely been read.

599: \item $x''$ has entirely been read and $y''$ has not been read yet.

600: \item $x''$ has entirely been read and $y''$ has not entirely been read.

601: \item $x''$ and $y''$ have entirely been read.

602: \end{itemize}

603:

604: In the overlapping case, at each iteration, a limb of $x$ and the

605: corresponding limb of $y$ (built from two different limbs if $y$ must

606: be shifted) are added. The possible carry is taken into account, and

607: the loop ends as soon as the result is $0$ (all its bits are $0$) for

608: $f = 0$ or the maximum limb value \verb|MP_LIMB_T_MAX| (all its bits

609: are $1$) for $f = 1$.

610:

611: We have focused on the differences coming from the computations by

612: blocks. The whole details may be found in the MPFR code.

613:

614: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

615:

616: \section{Conclusion}

617: \label{concl}

618:

619: We have presented the generic multiple-precision floating-point addition

620: with exact rounding, as specified in the MPFR library, first in a rather

621: theoretical point of view, then considering the current implementation

622: in MPFR. The theoretical analysis could give a more regular description

623: of the implementation, by ignoring the fact that bits are grouped

624: into words in a computer memory. It could help to improve the current

625: implementation (the fact that the cases $f = 0$ and $f = 1$ are

626: considered together is probably not a very good idea, though it

627: reduces the risk of forgetting particular cases).

628:

629: The subtraction could be dealt with in a similar manner, in future

630: work. This is a bit more complicated due to a possible cancellation

631: (when subtracting very close numbers).

632:

633: Full mechanically-checked proofs could also be considered, using the

634: theoretical analysis to define the main notions.

635:

636: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

637:

638: \begin{thebibliography}{1}

639:

640: \bibitem{IEEE754}

641: {IEEE} standard for binary floating-point arithmetic, Tech. Rep. {ANSI}-{IEEE}

642:   Standard 754-1985, New York, approved March 21, 1985: {IEEE} Standards Board,

643:   approved July 26, 1985: American National Standards Institute, 18 pages

644:   (1985).

645:

646: \bibitem{FelGoo1976a}

647: A.~Feldstein, R.~Goodman, Convergence estimates for the distribution of

648:   trailing digits, Journal of the ACM 23~(2) (1976) 287--297.

649:   \url{http://portal.acm.org/citation.cfm?id=321948}

650:

651: \bibitem{LefMulTis1998b}

652: V.~Lef�vre, J.-M. Muller, A.~Tisserand, Towards correctly rounded

653:   transcendentals, IEEE Transactions on Computers 47~(11) (1998) 1235--1243.

654:   \url{http://perso.ens-lyon.fr/jean-michel.muller/Nov98.pdf}

655:

656: \end{thebibliography}

657:

658: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

659:

660: \end{document}

661: