0609:quant-ph0609001/nn.tex

1: \documentclass{article} % \submitted{10/7/05} \whohasit{}

2: \usepackage{fullpage}

3: \usepackage{latexsym,amsmath,amssymb,color,rotating,xspace,epic,eepic,latexsym}

4: \usepackage{pstricks,pst-coil}

5: \usepackage{graphics,graphicx}

6: \input{qmac.tex}

7: \definecolor{brown}{rgb}{0.6,0.4,0.2}

8: \definecolor{purple}{rgb}{0.8,0.0,1.0}

9: \definecolor{gray}{rgb}{0.5,0.5,0.5}

10:

11: \title{Shor's Algorithm on a Nearest-Neighbor Machine}

12: \author{Samuel A. Kutin\thanks{Center for Communications Research, 805 Bunn Drive,

13: Princeton, NJ 08540. {\tt kutin@idaccr.org}}}

14: \date{}               % Month Year

15:

16: \newcommand{\caps}[1]{{\sc #1}}

17: \def\SAWUNEH{\caps{sawuneh}}

18: \newcommand{\floor}[1]{\left\lfloor #1 \right\rfloor}

19: \newcommand{\ceil}[1]{\left\lceil #1 \right\rceil}

20: \newcommand{\xor}{\mathbin{\oplus}}

21: \newcommand{\xoreq}{\mathbin{\oplus\!=}}

22: \newcommand{\maj}{\mathop{\rm MAJ}\nolimits}

23: \newcommand{\raiseintable}[1]{\raisebox{1.8ex}[0cm][0cm]{#1}}

24: \newcommand{\logup}[1]{\ceil{\log_2 {#1}}}

25: \newcommand{\flog}[1]{\floor{\lg {#1}}}

26: \newcommand{\igate}[2]{{#1} \xoreq {1 \over 2} {#2}}

27: \newcommand{\jgate}[2]{{#1} \xoreq - {1 \over 2} {#2}}

28: \newcommand{\iorjgate}[2]{{#1} \xoreq \pm {1 \over 2} {#2}}

29: \newcommand{\igatealign}[2]{{#1} & \xoreq {1 \over 2} {#2}}

30: \newcommand{\jgatealign}[2]{{#1} & \xoreq - {1 \over 2} {#2}}

31: \newcommand{\qu}[1]{{\left| {#1} \right\rangle}}

32: \newcommand{\phihat}{\smash[t]{\hat{\phi}}}

33:

34: \newcommand{\DKRS}{cla}

35: \newcommand{\CDKM}{ripple}

36:

37: \newcommand{\QFT}{\caps{QFT}\xspace}

38: \begin{document}

39:

40: \maketitle

41: \begin{abstract}

42: We give a new ``nested adds'' circuit for implementing Shor's

43: algorithm in linear width and quadratic depth on a nearest-neighbor

44: machine.  Our circuit combines Draper's transform adder with

45: approximation ideas of Zalka.  The transform adder requires small

46: controlled rotations.  We also give another version, with slightly

47: larger depth, using only reversible classical gates.  We do not know

48: which version will ultimately be cheaper to implement.

49: \end{abstract}

50:

51: \section{Introduction}

52: \label{intro-sec}

53:

54: We describe a new quantum exponentiation circuit that obeys a

55: ``nearest-neighbor'' constraint:  we imagine that qubits are arranged

56: in a line, and we are only allowed to perform interactions between

57: adjacent qubits.  Previous $n$-bit nearest-neighbor exponentiation

58: circuits~\cite{FDH,V}

59: required either depth $O(n^3)$ or superlinear width, but our construction

60: has width $O(n)$ and depth $O(n^2)$.  This new exponentiation circuit,

61: together with a nearest-neighbor quantum Fourier transform (QFT)~\cite{FDH},

62: gives a new circuit

63: for Shor's factorization algorithm~\cite{Shor}.

64:

65: A number of people have constructed exponentiation circuits for general

66: architectures (i.e., without the nearest-neighbor restriction).

67: See, for example,~\cite{VMI,VMIL,V} for recent summaries.

68: Many of the techniques used to reduce circuit depth

69: do not appear to apply to a nearest-neighbor architecture.

70:

71: Beauregard~\cite{Beau} has given a simple exponentiation

72: circuit using Draper's transform adder~\cite{Drap}.  The adder requires

73: two QFTs together with some controlled rotations.  Beauregard's circuit

74: uses only $2n + O(1)$ qubits, but has cubic depth---the dominant cost is

75: $\Theta(n^2)$ applications of the transform adder.

76: Fowler, Devitt, and Hollenberg~\cite{FDH} modify Beauregard's circuit for use on a

77: nearest-neighbor machine, and they show that these modifications do not

78: affect the dominant terms in the expression for size or depth.

79:

80: Our contribution is a new approximate controlled modular multiplier with

81: linear width and linear depth.  We use

82: an idea of Zalka~\cite{Zalka} for building approximate multipliers.

83: While we still multiply by performing $O(n)$ additions, we only

84: perform a constant number of large QFTs for each multiply.

85: When we insert our multiplier into the framework of Fowler et al.,

86: we obtain a nearest-neighbor exponentiation circuit with linear

87: width and quadratic depth.\footnote{Zalka~\cite{Z2} has recently pointed

88: out this same idea of performing mulitple additions framed by a single

89: QFT, but he does not work out any details or discuss the application

90: to nearest-neighbor circuits.}

91:

92: We first set some notation and review prior work in Section~\ref{prelim-sec}.

93: We describe our multiplier and the resulting exponentiator

94: in Section~\ref{main-sec}, and we discuss a version for general

95: architectures in Section~\ref{general-sec}.

96:

97: Following Fowler et al., we assume that any interaction between two

98: adjacent qubits has unit cost.  In practice, some gates may be easier

99: to implement than others.  Our circuit requires small controlled rotations

100: that may prove expensive.  Van Meter~\cite{V} discusses the error correction

101: requirements for various adders and suggests that the transform adder may not

102: be useful in practice.

103: In Section~\ref{classical-sec}

104: we describe a version of the circuit that is essentially classical

105: and that does not require these small rotations.  However, the

106: depth increases to $O(n^2 \log n)$.  This is the same asymptotic

107: depth achieved by Van Meter~\cite{V}, but we require only linear width.

108:

109: \section{Preliminaries}

110: \label{prelim-sec}

111:

112: Our goal is to compute $w = g^e \bmod m$.  Here $g$ and $m$

113: are $n$-bit constants, known to the classical compiler that builds

114: our circuit.  The $2n$-bit exponent $e$ is in quantum

115: memory.\footnote{More generally, $e$ has length $\alpha n$, and the

116: error rate of the algorithm depends on $\alpha$.  For simplicity

117: we take $\alpha = 2$.}  Using

118: a standard trick (see, for example,~\cite{Beau}),

119: we can assume that only one bit of

120: $e$ at a time is stored in our quantum computer.

121:

122: Writing $e = \sum 2^i e_i$, we have

123: $$

124: w = \left(\prod_i (g^{2^i} \bmod m)^{e_i}\right) \bmod m.

125: $$

126: That is, we can decompose our exponentiation into $2n$ controlled

127: multiplications.  In each case we multiply by $1$ if the controlling

128: bit $e_i$ is $0$, and by a constant if $e_i$ is 1.

129:

130: In Section~\ref{prelim-mod-mult-sec}, we describe how we reduce

131: controlled modular multiplication to (roughly) $n$ controlled

132: additions.  In Section~\ref{prelim-transform-adder-sec}, we describe

133: the addition routine we will use.

134:

135: We refer the reader to Fowler et al.~\cite{FDH} for

136: useful building blocks for nearest-neighbor circuits.  We will use

137: their ``mesh'' circuit for interleaving two registers.  We will

138: not use their controlled swap; instead, in Section~\ref{prelim-pseudo-sec}

139: we describe a simpler controlled swap for the case when one register is

140: known to be $0$.

141:

142: \subsection{Approximate Modular Multiplication}

143: \label{prelim-mod-mult-sec}

144:

145: We now present a scheme of Zalka~\cite{Zalka} for performing

146: controlled modular multiplication.  We wish to compute

147: $$

148: r = abc \bmod m,

149: $$

150: where $a$ and $m$ are $n$-bit constants, $b = \sum_i 2^i b_i$ is

151: in $n$ bits of quantum memory, and $c$ is a control bit.

152: We can write

153: $$

154: r \equiv abc \equiv \sum_i 2^i a b_i c \equiv \sum_i (b_i c) \left(2^i a \bmod m\right) \pmod m.

155: $$

156: We can view this as repeated controlled modular addition; the

157: numbers $x_i = 2^i a \bmod m$ are known at compile-time, and

158: we have $n$ control bits $y_i = b_i c$.

159:

160: We define the partial sum

161: $$

162: s = \sum_i y_i x_i = r - qm.

163: $$

164: The sum $s$ is congruent to the answer $r \pmod m$.  Also, since

165: $s < nm$, the quotient $q$ is at most $n$.  In particular, we can

166: write down $q$ using only $\log_2 n$ bits.

167:

168: Zalka's key idea is to approximate the desired answer $r$ in two

169: parallel steps.  First, we compute $s$ by repeated controlled addition into

170: an $n$-bit accumulator.  Second,

171: we approximate $q$:   We choose some $\ell_0 = O(\log n)$, and we

172: compute $\hat{q}$ using only the $\ell_0$ high

173: bits of each $x_i$.  More precisely, let $\hat{x}_i = 2^{n-\ell_0}

174: \floor{x_i/2^{n-\ell_0}}$.  Then $\hat{q} = \floor{(\sum y_i \hat{x}_i) / m}$.

175: We can easily compute $\hat{q}$ in depth $O(\log^2 n)$.  With

176: high probability, $\hat{q} = q$.

177:

178: Once we have $\hat{q} = \sum_i 2^i \hat{q}_i$, subtracting $\hat{q}m$

179: from $s$ can be done with $\log_2 n$ additional controlled adds into

180: our accumulator (we subtract $2^i m$ controlled by $\hat{q}_i$).

181: Next, we must erase $\hat{q}$; again; this takes only $O(\log^2 n)$

182: depth.  So, aside from a lower-order term, the cost of controlled

183: modular multiplication is about $n$ controlled additions,

184: or, equivalently, one controlled integer multiplication.

185:

186: There are other schemes that give modular multiplication circuits

187: at a cost of three times the cost of integer multiplication (see,

188: for example,~\cite{Dhem}).  So it might seem that Zalka's idea would

189: save only a constant factor.  However, Zalka's idea is conceptually

190: simpler; without it, we might not have found the linear-depth

191: multiplier of Section~\ref{main-sec}.

192:

193: \subsection{The Transform Adder}

194: \label{prelim-transform-adder-sec}

195:

196: Most quantum arithmetic circuits are essentially classical in nature.

197: Draper~\cite{Drap} has given an addition circuit that is inherently

198: quantum.  We briefly describe this circuit, and then discuss how to

199: adapt it to the nearest-neighbor setting.

200:

201: Suppose we have an $n$-bit number register containing

202: $u = \sum_{j=0}^{n-1} u_j 2^j$.  Then the {\QFT} maps $\qu{u}$ to

203: $$

204: \qu{\phi(u)} =

205: \frac{1}{2^{n/2}}\sum_{k=0}^{2^n - 1} e^{2 \pi i u k / 2^n} \qu{k}

206: = \bigotimes_{j=0}^{n-1} \qu{\phi_j(u)},

207: $$

208: where

209: $$

210: \phi_j(u) = {1 \over \sqrt{2}} \left(\qu{0} + e^{2 \pi i u / 2^{j+1}}\qu{1}

211: \right).

212: $$

213: Note that $\qu{\phi(u)}$ is an unentangled state.

214:

215: Suppose we want to add $v$ to $u$.  We can

216: replace each bit $\phi_j(u)$ by $\phi_j(u + v)$; this is simply a

217: $Z$-rotation by an angle of $2 \pi v / 2^{j+1}$, so we can rotate each

218: bit independently.  To perform controlled addition, each of these

219: rotations is controlled by a bit $c$.  We can then perform an inverse

220: {\QFT} to change $\qu{\phi(u+v)}$ to $\qu{u+v}$.

221:

222: One way to view the {\QFT} is that we have moved the information about

223: $u$ into the phase of the qubits.  To do a modular reduction and test

224: the high bit of $u$, we first need to perform an inverse {\QFT}.

225: So, for a naively designed modular exponentiation circuit, we perform

226: $\Theta(n^2)$ {\QFT}s and inverse {\QFT}s.

227: Our main result is

228: a circuit design with only $O(n)$ {\QFT}s.

229:

230: \begin{figure}[h]

231: \begin{center}

232: \input qft.tex

233: \end{center}

234: \caption{Quantum Fourier transform of a 4-bit register on a

235: nearest-neighbor machine. \textcircled{\scriptsize$j$}

236: %{\Large$\bigcirc$}\hspace{-11pt}$j$\hspace{5.5pt}

237: denotes a $Z$-rotation by

238: $2 \pi / 2^j$.}

239: \label{qft-fig}

240: \end{figure}

241:

242: Fowler et al.~\cite{FDH} give a nearest-neighbor

243: form of the {\QFT}.  A 4-bit version is depicted

244: in Figure~\ref{qft-fig}.  After each controlled rotation, we swap the

245: two bits involved, so every pair of bits can interact.  (If we leave out

246: the swaps, we obtain the linear-depth {\QFT} of Moore and Nilsson~\cite{MN}.)

247: Note that we

248: assign unit cost to the controlled rotation together with the

249: accompanying swap.

250:

251: The size of this {\QFT} circuit is $n^2/2 + O(n)$.  We may be able to

252: approximate the {\QFT} and skip some of the small rotations.  On a

253: general machine, this reduces the size to $O(n \log n)$, but on a

254: nearest-neighbor machine we still have to perform $n \choose 2$

255: swaps.

256:

257: \subsection{Pseudo-Toffolis and Controlled Swaps}

258: \label{prelim-pseudo-sec}

259:

260: \begin{figure}[h]

261: \begin{center}

262: \input pseudo.tex

263: \end{center}

264: \caption{Pseudo-Toffoli gate $v \xoreq uw$.  We also change

265: the phase when $\qu{uvw} = \qu{011}$.}

266: \label{pseudo-fig}

267: \end{figure}

268:

269: A frequent useful building block for our circuit is a {\em Toffoli\/}

270: gate, or doubly-controlled not: $v \xoreq uw$.  A cascade of

271: Toffoli gates through a $k$-bit register has depth $2k$.  However,

272: if we use the ``pseudo-Toffoli'' gate of Figure~\ref{pseudo-fig},

273: the depth of the cascade can be reduced to $k$.

274: See~\cite{BBCDMSSSW} for an equivalent pseudo-Toffoli gate.

275:

276: The idea of Figure~\ref{pseudo-fig} is that we correctly set $v$ to

277: $v \xor uw$, but we change the phase when $\qu{uvw} = \qu{011}$.

278: Normally this would be an unacceptable side effect, but there are

279: two cases where we are okay:  First, we may plan to undo this

280: computation and fix the phase later.  Second, we may know that the

281: problem input is forbidden for some reason.

282:

283: \begin{figure}

284: \begin{center}

285: \input pseudo-cascade.tex

286: \end{center}

287: \caption{Swap of 4-bit registers $X$ and $Y$ controlled by $c$

288: in depth $10$.  We assume that $Y$ is initialized to $0$.}

289: \label{pseudo-cascade-fig}

290: \end{figure}

291:

292: For example, suppose we want to swap two $n$-bit registers

293: $X$ and $Y$ controlled by a bit $c$.  Suppose further that $Y$ is

294: initialized to $0$.  Then we can build a pseudo-Toffoli cascade

295: as in Figure~\ref{pseudo-cascade-fig}.  Since each Toffoli target is

296: known to be $0$, there will be no phase shift.  The depth is $2n + 2$.

297:

298: \section{Nested Adds}

299: \label{main-sec}

300:

301: We now describe our main result, the ``nested adds'' multiplier.

302: We begin by describing a controlled multiplier with linear width

303: and depth; we then explain how to modify it to be a modular multiplier.

304: We conclude with an exponentiation circuit with linear width and

305: quadratic depth.

306:

307: \subsection{Nested Controlled Addition}

308: \label{main-add-sec}

309:

310: As noted in Section~\ref{prelim-mod-mult-sec}, we can view

311: controlled multiplication as repeated controlled addition.

312: In this section, we build a repeated controlled adder.

313: We have an $n$-bit

314: register $Z$, initialized to some value $z$, and an $n$-bit

315: register $Y$ of control bits $y_i$.  When the circuit concludes,

316: we want $Z$ to contain $$\left(z + \sum_i x_i y_i\right) \bmod 2^n,$$ where

317: the values $x_i$ are $n$-bit constants.  In the next section, we

318: will convert this circuit to a modular multiplier.

319:

320: It is clear that $n$-bit addition controlled by a single bit $y_i$

321: requires linear depth on a nearest-neighbor machine; the control

322: bit can affect all $n$ bits of $Z$, so we need linear time to

323: move (or pseudocopy) it from one end to the other.  One might at

324: first think that performing $n$ controlled additions would require

325: quadratic depth.  However, if we use the transform adder, we can

326: nest the additions.

327:

328: \begin{figure}[h!]

329: \begin{center}

330: \input nested.pst

331: \end{center}

332: \caption{Schematic for the ``nested adds'' repeated controlled adder.}

333: \label{nested-fig}

334: \end{figure}

335:

336: The basic structure of the circuit is depicted in Figure~\ref{nested-fig}.

337: We begin by performing the {\QFT} on $Z$, in depth $2n-3$.

338: Next, we take each bit of $Y$ successively and swap it with each

339: bit of $Z$.  As we swap $Y_i$ with $Z_j$, we also rotate $Z_j$

340: controlled by $Y_i$; the rotation amount depends on $x_i$.  The idea

341: is that we are adding in $x_i$ by rotating each bit of $Z$ by the

342: proper amount; all of these rotations commute, so the order is

343: unimportant.  This portion has depth $2n - 1$; when it concludes,

344: we have effectively swapped the $Z$ and $Y$ registers.

345:

346: Next, we perform the inverse \QFT on $Z$.  This again has depth $2n-3$.

347: Finally, we move $Y$ back to where it started in depth $2n - 1$.

348:

349: As described, the total depth would be $8n - 8$.  However, as shown

350: in Figure~\ref{nested-fig}, the inverse \QFT nests nicely with the

351: swaps with $Y$.  We can start the inverse \QFT at time $3n - 5$, and

352: we can start the final swaps at time $4n-2$.  The total depth is only

353: $6n - 4$.

354:

355: If we can assume $z$ is a constant, then we can replace the initial {\QFT}

356: with a single time-slice of $n$ unitary transformations\footnote{For

357: example, when $z = 0$, we apply a Hadamard to each qubit of $Z$.} on $Z$.

358: The depth is reduced to $4n - 1$.  See Section~\ref{main-error-sec} for

359: the reasons

360: why we might want to allow nonzero $z$.  For the remainder of this paper,

361: we will assume that $z$ is a constant, and that we can skip the initial {\QFT}.

362:

363: \subsection{Nested Controlled Modular Addition}

364: \label{main-mod-mult-sec}

365:

366: To turn the above circuit into a modular multiplier, we follow the

367: procedure described in Section~\ref{prelim-mod-mult-sec}.  We

368: compute the sum $s = \sum_i y_i x_i$ congruent to the desired

369: answer $r$ modulo $m$.  (Since we know our final answer has $n$ bits, we

370: need only compute the low $n$ bits of $s$.)

371: Simultaneously, we compute the approximate

372: quotient $\hat{q}$.  We then subtract $\hat{q}m$ from our main register.

373: Finally, we erase $\hat{q}$.

374:

375: We compute $\hat{q}$ in an $\ell$-bit register $Q$, which we

376: locate between $Y$ and $Z$.  We take $\ell = \ell_0 + \log_2 n$, so

377: we have room to write the $(n + \log_2 n)$-bit sum $\sum_i y_i \hat{x}_i$

378: (which has $0$ in the low-order $n - \ell_0$ bits).

379:

380: We need to initialize the low $\ell_0$ bits of $Q$.  If we have

381: nonconstant data in $Z$, we could pseudocopy

382: $\ell_0$ bits of it to $Q$; this is

383: not expensive, but it might be costly to erase $Z$ when we are done.

384: In our case, we will initialize $Z$ to a constant $z$, and $Q$

385: to the high-order $\ell_0$ bits of $z$.

386:

387: We pass the bits of $Y$ past $Q$ and

388: then $Z$.  We compute the high bits of $z + \sum_i y_i \hat{x}_i$ in $Q$,

389: and we compute $z + \sum y_i x_i \bmod 2^n$ in $Z$.

390:

391: As soon as the last $y_i$ bit has passed through $Q$, we compute $\hat{q}$.

392: For $k = \log_2 n$ down to $1$, we first subtract $2^{k-1} m$ from

393: $Q$ by doing a unary rotation on each bit.  Next, we do an inverse \QFT in

394: depth at most $2\ell-1$;

395: the top bit of $Q$ is now a control bit indicating whether

396: we should have subtracted $2^{k-1} m$ or not.  We label that bit $\hat{q}_k$

397: and think of it as no longer part of $Q$.  We now do a \QFT on the

398: remaining bits of $Q$, and then move $\hat{q}_k$ through $Q$; this adds

399: $2^{k-1}m$ back if necessary, and also positions $\hat{q}_k$ to go through

400: $Q$.

401:

402: At step $k$, we perform an inverse \QFT on $\ell_0 + k$ bits and

403: a \QFT on $\ell_0 + k - 1$ bits, and then we move $\hat{q}_k$ through $Q$.

404: The depth is $4(\ell_0 + k) - 3$.  The total depth, summing from

405: $k = 1$ to $\log_2 n$, is

406: \begin{equation}

407: \label{q-time}

408: 2\ell^2 - 2\ell_0^2 + O(1) = 2 (2\ell - \log_2 n) \log_2 n + O(1).

409: \end{equation}

410:

411: We use the $\hat{q}_k$ bits as control bits, subtracting $2^k m$ as

412: needed from $s$.  When we are done, the answer $r$ is in $Z$.  When we

413: pass the $\hat{q}_k$ bits back up, we again take time given

414: by~\eqref{q-time} to uncompute $\hat{q}$.  (Alternatively, we could

415: move all of $Q$ past $Z$ and then uncompute $\hat{q}$.)

416:

417: We subtract $z$ from $Z$ after computing $r$.  See

418: Section~\ref{main-error-sec} for details.

419:

420: The total circuit depth for repeated controlled addition is

421: $$

422: 4n + 4 (2\ell - \log_2 n) \log_2 n + O(\log n).

423: $$

424: The width is $2n + \ell + O(1)$.

425:

426: \subsection{Controlled Modular Multiplication}

427: \label{main-control-sec}

428:

429: So far, we have assumed that the $n$ control bits are present at the

430: start of the computation.  To complete our modular multiplier, we need

431: to explain how to start from the multiplicand $b$ and

432: the overall control bit $c$ and produce the control bits $y_i = b_i c$.

433: Also, since we want an in-place multiplier, we need to explain how to

434: erase $b$ when we are done (if $c=1$).

435:

436: \begin{sidewaysfigure}

437: \begin{center}

438: \input mult.pst

439: \end{center}

440: \caption{Schematic for the ``nested adds'' controlled in-place

441: modular multiplier.}

442: \label{nested-mod-mult-fig}

443: \end{sidewaysfigure}

444:

445: It is easy to perform the desired steps in linear depth, given

446: the linear-depth out-of-place modular multiplication

447: circuit described above.  The challenging part is to keep the

448: depth as low as possible.  Our solution has depth

449: $$

450: 11n + 6 (2\ell - \log_2 n) \log_2 n + O(\log n),

451: $$

452: width

453: $$

454: 3n + 2\ell + 1,

455: $$

456: and size

457: $$

458: 5n^2 + O(n \log n),

459: $$

460: and is depicted in Figure~\ref{nested-mod-mult-fig}.  We briefly

461: describe the basic features of the circuit.

462:

463: We have three $n$-bit registers (labeled $B$, $Y$, and $Z$),

464: two $\ell$-bit registers (labeled $Q_Y$ and $Q_Z$), and one

465: control bit $c$.  Initially $B$ contains $b$ and the other

466: four registers contain $0$.  When the circuit concludes,

467: $B$ contains $b$ (when $c=0$) or $ab$ (when $c=1$) and the

468: other four registers contain $0$.

469:

470: To start, we have $Q_Y$, then $B$ and $Y$ interleaved (i.e.,

471: we have $B_0$, $Y_0$, $B_1$, $Y_1$, \dots, $B_{n-1}$, $Y_{n-1}$),

472: and then $c$, $Q_Z$, and $Z$.  When the circuit completes,

473: we have $Y$, then $Q_Y$, then $B$ interleaved with $Z$, then

474: $c$, and finally $Q_Z$.  So, except for the location of $c$,

475: the bits have been flipped upside-down.  (See

476: Section~\ref{main-exp-sec} for the reason we end with $c$ in a

477: different place.)

478:

479: We first move $c$ through the interleaved $B$ and $Y$,

480: performing controlled swaps.  If the contents of $B$ and $Y$ were

481: wholly general, this process would have depth $4n$, but because

482: we know $Y$ contains $0$ we can use pseudo-Toffolis (see

483: Section~\ref{prelim-pseudo-sec}), and the depth is only $2n+2$.

484: After the controlled swaps, we unmesh $B$ and $Y$.

485:

486: Next, we multiply $Y$ by $a$ and write the result to $Z$.

487: These gates are depicted in blue in Figure~\ref{nested-mod-mult-fig}.

488: We use $Q_Z$ as a scratch register for computing $\hat{q}$.  We

489: load a constant $z$ into $Z$ (and its high bits into $Q_Z$), then

490: we perform the circuit described in the previous section, and

491: finally we erase $Q_Z$ and unload the constant $z$.  When this

492: portion concludes, if $c = 0$, then $B$ contains $b$ and $Y$ and

493: $Z$ contain $0$.  If $c = 1$, then $B$ contains $0$, $Y$ contains

494: $b$, and $Z$ contains $ab$.

495:

496: We now perform the gates depicted in red in

497: Figure~\ref{nested-mod-mult-fig}.  We undo a multiplication

498: of $Z$ by $a^{-1}$, writing the result into $Y$.  The red circuit

499: is a backwards, upside-down version of the blue circuit.  When

500: we are done, $Y$ contains $0$.  If $c = 0$, then $B$

501: contains $b$ and $Z$ contains $0$; if $c = 1$, then $B$

502: contains $0$ and $Z$ contains $ab$.

503:

504: Finally, we mesh $B$ and $Z$ and perform the controlled swap

505: in reverse.  (Again, we can use pseudo-Toffolis to reduce the

506: depth to $2n+2$.)  We write $b$ or $ab$ to $B$, and we write $0$

507: to $Z$, as desired.

508:

509: Note that part of the red circuit overlaps part of the blue

510: circuit.  In particular, we uncompute the first $\hat{q}$

511: while computing the second.  This is why the second-order

512: term in the depth is $6 (2\ell -\log_2 n)\log_2 n$ rather than

513: $8 (2\ell -\log_2 n)\log_2 n$.

514:

515: We must swap $B$ and $Y$ before we can interleave

516: $B$ and $Z$.  If our bits were arranged in a ring, we could

517: bring $B$ around from the other side; this would reduce

518: the depth by about $n$ and the size by about $n^2$.  One

519: could construct a more symmetric version of

520: Figure~\ref{nested-mod-mult-fig} by moving $B$ down to the bottom

521: between the blue and red portions, but this increases the

522: size by about $n^2$ without changing the depth.

523:

524: \subsection{Exponentiation}

525: \label{main-exp-sec}

526:

527: We recall from Section~\ref{prelim-sec} that our goal is to

528: perform $2n$ controlled in-place modular multiplications.  We

529: will repeatedly apply the circuit of Section~\ref{main-control-sec}.

530: Since that circuit leaves the machine ``upside-down,'' we alternate

531: between applying the circuit right-side-up and upside-down.

532:

533: Let $e_i$ denote the control bit in the $i$th round.  We add one

534: additional bit to the circuit of Section~\ref{main-control-sec}.

535: Just before we start the swap of $B$ and $Z$ controlled by $e_i$, we

536: create our next control bit $e_{i+1}$.  Then, as soon as we have

537: swapped two bits of the interleaved $B$ and $Z$ controlled by

538: $e_i$, we swap them again controlled by $e_{i+1}$ (viewing them

539: as $B$ and $Y$ for the next round).  We can thus overlap these

540: two controlled swaps; we reduce the depth of each round to only

541: $9n + O(\log^2 n)$.

542:

543: There may be a technicality here because of the order in which we

544: perform measurements.  After we are done using $e_i$, we measure

545: it, and we may need to rotate $e_{i+1}$ based on the observed

546: value of $e_i$.  We will assume that this is not a problem in

547: practice.  If necessary, we could generate $\Theta(\sqrt{n})$

548: control bits at a time and use them; we would still have a

549: depth of roughly $9n$ and a width of roughly $3n$.

550:

551: Our circuit has depth

552: \begin{gather*}

553: 18 n^2 + 12 n (2\ell -\log_2 n) \log_2 n + O(n \log n),

554: \intertext{width}

555: 3n + 2\ell + 2,

556: \intertext{and size}

557: 10 n^3 + O(n^2 \log n).

558: \end{gather*}

559: Here $\ell = O(\log n)$ is chosen to control

560: the error rate of our computation of $\hat{q}$.  See the

561: next section for details.

562:

563: \subsection{Error Analysis}

564: \label{main-error-sec}

565:

566: In this section we address two questions.  First, how should

567: we choose $\ell$?  Second, how does filling $Z$

568: with a random value $z$ improve our error analysis?

569:

570: We perform $4n$ modular multiplications.  For each of these, we

571: add $n$ quantities to compute $\hat{q}$.  There are thus $4n^2$

572: additions where we might make a mistake.  Given random addends,

573: the probability of an error propagating across a window of length

574: $\ell_0$ is $2^{-\ell_0}$.  Our probability of making an error

575: is therefore at most

576: $$

577: 4n^2 2^{-\ell_0} = 2^{2 \log_2 n + 2 - \ell_0}.

578: $$

579: To reduce our error probability to a constant, we should take

580: $\ell_0 = 2 \log_2 n + O(1)$, or

581: $$\ell = \ell_0 + \log_2 n = 3 \log_2 n + O(1).$$

582:

583: What does an error rate of $\epsilon$ mean in the quantum setting?

584: Instead of attaining the desired state $\qu{\phi}$, we attain a

585: state $\qu{\phihat} = \alpha \qu{\phi} + \eta \qu{\psi}$, where

586: the error state $\qu{\psi}$ is orthogonal to

587: $\qu{\phi}$ and $|\eta|^2 \le \epsilon$.

588: A standard calculation yields that the distance between the probability

589: distributions on measurements for

590: $\qu{\phi}$ and $\qu{\phihat}$ is at most $\epsilon$.

591: Note that an error may mean that we fail to erase scratch space

592: correctly, invalidating future rounds, but this is irrelevant to

593: the analysis.

594:

595: The assumption above of ``random addends'' may not be reasonable.

596: Zalka~\cite{Zalka} discusses this problem: citing a ``private

597: objection'' by Manny Knill, Zalka writes that ``mathematically (and

598: therefore very cautiously) inclined people have questioned the

599: validity of this assumption.''  Our solution is to fill our

600: register with a random constant $z$.  (We can use the same $z$ each

601: time, or we can choose a different one for each multiplication.)

602: The expected probability of an error in computing $\hat{q}$ over

603: all our choices of $z$ is the desired $\epsilon$.

604:

605: However, the constant $z$ introduces another place where errors can

606: occur.  When we subtract $z$ at the end, we do not perform a modular

607: subtraction.  If we ensure $z < m/2^t$, the probability of an error

608: at some point is $4n 2^{-t}$.  We therefore take $t = \log_2 n +

609: O(1)$.  Note that this increases $\ell_0$ to $3 \log_2 n + O(1)$

610: and $\ell$ to $4 \log_2 n + O(1)$.

611:

612: \section{A Classical Version}

613: \label{classical-sec}

614:

615: The circuit of this paper requires numerous small controlled rotations.

616: We now show that a variant of these ideas gives a reversible classical

617: approximate exponentiation circuit with depth $O(n^2 \log n)$ and

618: size $O(n^3)$.

619:

620: We still organize exponentiation as repeated multiplication and

621: multiplication as repeated addition.  On a general architecture, we

622: can attain depth $O(n^2 \log n)$ using a logarithmic-depth

623: adder~\cite{\DKRS}.  On a nearest-neighbor machine, we cannot

624: perform controlled addition in sublinear depth.  As in our main

625: construction, we nest different controlled additions to obtain an

626: amortized depth of $O(\log n)$ per addition.

627:

628: We return to the setting of Section~\ref{main-add-sec}.  We

629: have an $n$-bit register $Z$ (initialized to some value $z$) and

630: an $n$-bit register $Y$.  We wish to write to $Z$ the quantity

631: $z + \sum_i x_i y_i \bmod 2^n$; here the $y_i$s are bits of $y$ and

632: the $x_i$s are $n$-bit constants.

633:

634: We follow the general structure of Figure~\ref{nested-fig}.  Since

635: we wish to build a classical circuit, we no longer perform any

636: {\QFT}s.  Instead, we choose some $t = O(\log n)$, and we write

637: $k = \ceil{n/t}$.  We divide $Z$ into $k$ blocks of size $t$; each

638: ``wire'' of $Z$ in Figure~\ref{nested-fig} represents a single block

639: $Z^j$.  (Each wire of $Y$ is still a single bit $y_i$.)  We also

640: divide each $x_i$ into blocks $X_i^j$ of length $t$.

641:

642: We divide this portion of the circuit into $n+k-1$ rounds.  In

643: round $r$, $y_{r-j}$ crosses $Z_j$ for all $j$ (as long as $0 \le j < k$

644: and $0 \le r-j < n$).  At this time, we add the number

645: $$

646: A_r = \sum_j y_{r-j} X_{r-j}^j 2^{t(j-1)}

647: $$

648: into $Z$.  Note that

649: $$\sum_{r=0}^{n+k-1} A_r = \sum_{i=0}^{n-1} x_i y_i$$

650: as desired.  Also note that, in round $r$, the control bit

651: $y_{r-j}$ controlling the $j$th block of $A_r$ is next to $Z_j$ in

652: memory.

653:

654: To add $A_r$ into $Z$, we first do $k$ parallel controlled adds, one

655: for each block.  We erase our work, but we write down the high bit

656: $h_j$ for each block.  We hope that we correctly compute each $h_j$;

657: this requires that no carry propagate through an entire block.

658:

659: Next, we again do $k$ parallel controlled adds, but this time, for

660: the $j$th block, we use $h_{j-1}$ as an incoming carry bit.  If

661: the $h_j$ bits are all correct, we correctly add $A_r$ into $Z$.

662:

663: Finally, we erase the $h_j$ bits.  We compare $Z_j$ with

664: $y_{r-j} X_{r-j}^j$ to determine if an overflow occurred; if so,

665: $h_j$ must have been $1$.  We then exchange each $y_{r-j}$ bit with

666: $Z_j$ to move the control bits into position for the next round.

667:

668: Each of these steps can be performed with a ripple-carry

669: adder~\cite{\CDKM}; the depth is $Ct$ for a small constant $C$.  We need $2k$

670: extra bits:\ the high bits $h_j$ and one scratch bit for each

671: ripple.\footnote{We cannot use the ripple-carry adder of Takahashi

672: and Kunihiro~\cite{TK}.  Their adder eliminates the scratch bit,

673: but it does not work on a nearest-neighbor machine.}

674:

675: To do modular multiplication, we use the same scheme as in our

676: main construction: we estimate $\hat{q}$ on the side.  The error

677: analysis is the same.  Note that we also perform $O(n^3)$

678: controlled additions of size $t$;

679: the probability that some $h_j$ bit is wrong at

680: some point is thus $O(n^3 2^{-t})$.  We choose $t = O(\log n)$ to

681: reduce this probability to a small constant.

682:

683: We can use the pseudo-Toffoli

684: gates described in Section~\ref{prelim-pseudo-sec} to reduce the

685: depth.  It is interesting to note that, for the ripple-carry adder,

686: we do not perform exactly the same gates when we undo the computation,

687: but the ``bad'' case for the pseudo-Toffoli happens on the forward

688: ripple if and only if it happens on the reverse ripple, so we fix

689: our phase errors correctly.

690:

691: The circuit depth is $O(n^2 \log n)$.  The exact constant depends

692: on the choice of $\ell$ and $t$ and on precisely how we

693: perform the ripple-carry additions.

694:

695: \section{General Architectures}

696: \label{general-sec}

697:

698: The ``nested adds'' multiplier of Section~\ref{main-sec} can be

699: simplified in several ways if implemented on a machine without

700: a nearest-neighbor restriction:

701: \begin{itemize}

702: \item The controlled swaps at the start and end of the multiplier can

703: be performed in logarithmic depth.  We fan the control bit $c$ out into

704: an empty $n$-bit register, perform $n$ parallel swaps, and fan $c$

705: back in.  Note that we always have an empty $n$-bit register available.

706: \item The mesh and unmesh operations and any register swaps (all in

707: black in Figure~\ref{nested-mod-mult-fig}) are unnecessary.  This

708: reduces the depth by about $n$ and the size by about $2n^2$.

709: \item The {\QFT} and inverse {\QFT} can be approximated.  This does

710: not improve the depth, but the size of each decreases from

711: about $n^2 / 2$ to $O(n \log n)$.

712: \end{itemize}

713:

714: With these changes, the modular multiplier has depth $6n +

715: 6 (2\ell -\log_2 n)\log_2 n

716: + O(\log n)$, width $3n + 2\ell + 1$, and size $2n^2 + O(n \log n)$.

717: Taking $\ell = 3 \log_2 n + O(1)$ as in Section~\ref{main-error-sec},

718: we get an exponentiation circuit with depth

719: \begin{gather*}

720: 12n^2 + 60 n \log_2^2 n + O(n \log n),

721: \intertext{width}

722: 3n + 6 \log_2 n + O(1),

723: \intertext{and size}

724: 4n^3 + O(n^2 \log n).

725: \end{gather*}

726:

727: We could further reduce the depth by using a parallel version of

728: the {\QFT}~\cite{CW}, but each multiply would still have depth at least

729: $5n + O(\log^2 n)$.

730: We could also consolidate the registers $Q_Y$ and $Q_Z$; we would

731: get a slight increase in depth and a slight decrease in width.

732:

733: \section*{Acknowledgements}

734: The author thanks Bob Beals, Tom Draper, and David Moulton for

735: numerous discussions.

736:

737: \bibliography{nn}

738: \bibliographystyle{alpha}

739:

740: \end{document}

741:

742:

743:

744: