0809:0809.2266/ms.tex

1: \documentclass[preprint]{aastex}

2: \shorttitle{Scalable Correlator Architecture}

3: \shortauthors{Parsons et al.}

4:

5: \usepackage{amsmath}

6: \usepackage{graphicx}

7: \usepackage{natbib}

8: \citestyle{aa}

9:

10: \begin{document}

11: \title{A Scalable Correlator Architecture Based on

12:     Modular FPGA Hardware, Reuseable Gateware, and Data Packetization}

13:

14: \author{Aaron Parsons, Donald Backer, and Andrew Siemion}

15: \affil{Astronomy Department,

16:     University of California, Berkeley, CA}

17: \email{aparsons@astron.berkeley.edu}

18: \author{Henry Chen and Dan Werthimer}

19: \affil{Space Science Laboratory,

20:     University of California, Berkeley, CA}

21: \author{Pierre Droz, Terry Filiba, Jason Manley\altaffilmark{1},

22:     Peter McMahon\altaffilmark{1}, and Arash Parsa}

23: \affil{Berkeley Wireless Research Center,

24:     University of California, Berkeley, CA}

25: \author{David MacMahon, Melvyn Wright}

26: \affil{Radio Astronomy Laboratory,

27:     University of California, Berkeley, CA}

28:

29: \altaffiltext{1}{Affiliated with Karoo Array Telescope,

30:     Cape Town, South Africa}

31:

32: \begin{abstract}

33: A new generation of radio telescopes is achieving unprecedented levels of

34: sensitivity and resolution, as well as increased agility and field-of-view, by

35: employing high-performance digital signal processing hardware to phase and

36: correlate large numbers of antennas.  The computational demands of these

37: imaging systems scale in proportion to $BMN^2$, where $B$ is the signal

38: bandwidth, $M$ is the number of independent beams, and $N$ is the number of

39: antennas.  The specifications of many new arrays lead to demands in excess of

40: tens of PetaOps per second.

41:

42: To meet this challenge, we have developed a general purpose correlator

43: architecture using standard 10-Gbit Ethernet switches to pass data

44: between flexible hardware modules containing Field Programmable Gate Array

45: (FPGA) chips.  These chips are programmed using open-source signal processing

46: libraries we have developed to be flexible, scalable, and chip-independent.

47: This work reduces the time and cost of implementing a wide range of signal

48: processing systems, with correlators foremost among them, and facilitates

49: upgrading to new generations of processing technology. We present several

50: correlator deployments, including a 16-antenna, 200-MHz bandwidth, 4-bit, full

51: Stokes parameter application deployed on the Precision Array for Probing the

52: Epoch of Reionization.

53: \end{abstract}

54:

55: \keywords{Astronomical Instrumentation}

56:

57:

58: % --------------------------------------------------------------------------

59: % Section 1

60: % --------------------------------------------------------------------------

61: \section{Introduction}

62: \label{sec:intro}

63:

64: Radio interferometers, which operate by correlating the signals from two or

65: more antennas, have many advantages over traditional single-dish telescopes,

66: including greater scalability, independent control of aperture size and

67: collecting area, and self-calibration.  Since the first digital correlator

68: built by Weinreb \citep{weinreb_1961}, the processing power of

69: these systems has been tracking the Moore's Law growth of digital electronics.

70: The decreasing cost per performance of these systems has influenced the design

71: of many new radio antenna array telescopes.  Some

72: next-generation array telescopes at meter, centimeter and millimeter

73: wavelengths are:

74: the LOw Frequency ARray (LOFAR),

75: the Precision Array for Probing the Epoch of Reionization (PAPER),

76: the Murchison Widefield Array (MWA),

77: the Long Wavelength Array (LWA),

78: the Expanded Very Large Array (EVLA),

79: the Allen Telescope Array (ATA),

80: the Karoo Array Telescope (MeerKAT),

81: the Australian Square Kilometer Array Demonstrator (ASKAP),

82: the Atacama Large Millimeter Array (ALMA).

83: and the Combined Array for Research Millimeter-wave Astronomy (CARMA).

84: This paper presents a novel approach to the intense digital signal

85: processing requirements of these instruments that has many other applications

86: to astronomy signal processing.

87:

88: While each generation of electronics has brought new commodity data processing

89: solutions, the need for high-bandwidth communication between processing nodes

90: has historically lead to specialized system designs.  This communication

91: problem is particularly germane for correlators, where the number of

92: connections between nodes scales with the square of the number of antennas.

93: Solutions to date have typically consisted of specialized processing boards

94: communicating over custom backplanes using non-standard protocols.  However,

95: such solutions have the disadvantage that each new generation of digital

96: electronics requires expensive and time-consuming investments of engineering

97: time to re-solve the same connectivity problem.  Redesign is driven by the same

98: Moore's Law that makes digital interferometry attractive, and is not unique to

99: the interconnect problem; processors such as Application-Specific Integrated

100: Circuits (ASICs) and Field Programmable Gate Arrays (FPGAs) also require

101: redesign, as do the boards bearing them, and the signal processing algorithms

102: targeting their architectures.

103:

104: Our research is aimed at reducing the time and cost of correlator design and

105: implementation.  We do this, firstly, by developing a packetized communication

106: architecture relying on industry-standard Ethernet switches and protocols to

107: avoid redesigning backplanes, connectors, and communication protocols.

108: Secondly, we develop flexible processing modules that allow identical boards to

109: be used for a multitude of different processing tasks.  These boards are

110: applicable to general signal processing problems that go beyond

111: correlators and even radio science to include, e.g., ASIC design and

112: simulation, genomics, and research into parallel processor architectures.

113: General

114: purpose hardware reduces the number of boards that have to be redesigned and

115: tested with each new generation of electronics.  Thirdly, we create

116: parametrized signal processing libraries that can easily be recompiled and

117: scaled for each generation of processor.  This allows signal processing systems

118: to quickly take advantage of the capabilities of new hardware.  Finally, we

119: employ an extension of a Linux kernel to interface between CPUs and FPGAs for

120: the purposes of testing and control, presenting a standard file interface

121: for interacting with FPGA hardware.

122:

123: This paper begins with a presentation of the new correlator

124: design architecture in \S\ref{sec:architecture}. The hardware to

125: implement this architecture follows in \S\ref{sec:hardware}, and

126: the FPGA gateware used in the hardware is summarized in \S\ref{sec:gateware}.

127: Issues concerning system integration are given in \S\ref{sec:integration},

128: and performance characterization of subsystems are given in

129: \S\ref{sec:characterization}. Results from our first deployments of

130: the packetized correlator are displayed in \S\ref{sec:deployments}.

131: Our final section summarizes our progress and points to a number

132: of directions we are pursuing for the next generation of scalable

133: correlators based on modular hardware, reuseable gateware and

134: data packetization. An appendix gives a glossary of technical

135: acronyms since this paper makes heavy use of abbreviated terms.

136:

137: % --------------------------------------------------------------------------

138: % Section 2

139: % --------------------------------------------------------------------------

140: \section{A Scalable, Asynchronous, Packetized FX Correlator Architecture}

141: \label{sec:architecture}

142:

143: Correlators integrate the pairwise correlation between complex voltage samples

144: from polarization channels of array antenna receivers at a set of

145: frequencies.

146: Once instrumental effects have been calibrated and removed, the resultant

147: correlations (called visibilities) represent the self-convolved electric field

148: across an aperture sampled at locations

149: corresponding to separations between antennas.  These visibilities can be

150: used to reconstruct an image of the sky by inverting the interferometric

151: measurement equation:

152: \begin{equation}

153: V_{\nu}(u,v)=\int\!\!\!\!\int{G_{i,\nu}G_{j,\nu}^*I_\nu(\ell,m)}

154: {e^{-2\pi i(u\ell+vm+w(\sqrt{1-\ell^2-m^2}-1))}d\ell dm}

155: \label{eq:vis}

156: \end{equation}

157: $I_\nu$ represents the sky brightness in angular coordinates $(\ell,m)$, and

158: $(u,v,w)$ correspond to the separation in wavelengths of an antenna pair

159: relative to a pointing direction.

160: For antennas with separate polarization feeds, cross-correlation

161: of polarizations yields components of the four Stokes parameters that

162: characterize polarized radiation, here defined in terms of linear

163: polarizations ($\|,\perp$) for all pairs of antennas $A$ and $B$

164: \citep{rybicki_lightman1979}:

165: \begin{equation}

166: \begin{array}{ll}

167: \displaystyle I=A_\| B_\|^*+A_\perp B_\perp^* &\ \ \

168: Q=A_\| B_\|^*-A_\perp B_\perp^* \nonumber \\

169: \displaystyle U=A_\| B_\perp^*+A_\perp B_\|^* &\ \ \

170: V=A_\| B_\perp^*-A_\perp B_\|^*

171: \label{eq:pol}

172: \end{array}

173: \end{equation}

174: I measures total intensity, V measures the degree of circular polarization,

175: and Q and U measure the amplitude and orientation of linear polarization.

176:

177: The problem of computing pairwise correlation as a function of frequency can be

178: decomposed two mathematically equivalent but architecturally distinct ways.

179: The first architecture is known as ``XF'' correlation because it first

180: cross-correlates antennas (the ``X'' operation) using a time-domain ``lag''

181: convolution, and then computes the spectrum (the ``F'' operation) for each resulting

182: baseline using a Discrete Fourier Transform (DFT).  An alternate architecture

183: takes advantage of the fact that convolution is equivalent to multiplication in

184: Fourier domain.  This second architecture, called ``FX'' correlation, first

185: computes the spectrum for each individual antenna (the F operation), and then

186: multiplies pairwise all antennas for each spectral channel (the X operation).

187: An FX correlator has an advantage over XF

188: correlators in that the operation that scales as $O(N^2)$ with the

189: number of antennas, N, is a complex multiplication as opposed to a full

190: convolution in an XF correlator \citep{daddario2001,yen1974}.

191:

192: Though there are mitigating factors (such as bit-growth for representing the

193: higher dynamic range of frequency-domain data) that favor XF correlators for

194: small numbers of antennas \citep{thompson_et_al2001}, FX correlators are more

195: efficient for larger arrays.  Since scalability to large numbers of antennas is

196: one of the primary motivations of our correlator architecture, we have chosen

197: to develop FX architectures exclusively.

198:

199: \subsection{Scalability With Number of Antennas and Bandwidth}

200: \label{sec:scalability}

201:

202: The challenge of creating a scalable FX correlator is in designing a

203: scalable architecture for factoring the total computation into manageable

204: pieces and efficiently bringing together data in each piece for computation.

205: Traditionally, the spectral decomposition (in F engines) has been scaled to

206: arbitrary bandwidths by using analog mixers and filters to divide the operating

207: band of each antenna into the widest subbands that can be processed digitally

208: using existing technology.  Within correlation of a given subband,

209: the complexities of computation and of data distribution both scale

210: linearly with bandwidth and quadratically with the number of antennas. It is

211: imperative that the arrangement of cross-multiplication engines (hereafter

212: referred to as X engines) minimize data replication/retransmission, even as X

213: engines expand to encompass many boards.  Fortunately, each frequency channel

214: of an FX correlator is computationally independent, providing a natural

215: boundary for dividing computation among processing nodes.

216:

217: \placefigure{fig:corr_arch1}

218:

219: Figure \ref{fig:corr_arch1} illustrates a simplistic architecture for an FX

220: correlator that takes advantage of the computational independence of channels

221: to avoid unnecessary data transmission;

222: the total X computation has been factored into X engines that cross-multiply

223: all antenna pairs for a single frequency channel.

224: This architecture is overly

225: simplistic, since an X engine's performance can be equated to an aggregate

226: input bandwidth that it can handle.  For the sake of efficiency, an X engine

227: processor

228: should receive as many channels as it has capacity to process.  In this case,

229: the number of X engines is given by:

230: \begin{equation}

231: \#\ {\rm X\ Engines} = \frac{({\rm Antenna\ Bandwidth})\times

232:   (\#\ {\rm Antennas})}{ {\rm X\  Engine\ Processing\ Bandwidth}}

233: \end{equation}

234: Multiplexing channels into X engines makes cross-multiplication

235: complexity independent of the number of channels.  There are three

236: potential bottlenecks for scaling this architecture: the complexity of

237: interconnecting F engines and X engines, the bandwidth into individual X

238: engines, and the amount of computation in an X engine relative to the size of a

239: processing chip/board/system.  Each of these bottlenecks warrants further

240: discussion.

241:

242: The potential bottleneck of connecting $N$ antenna-based F engines to $M$

243: channel-based X engines is highlighted by the criss-crossed lines in Figure

244: \ref{fig:corr_arch1}.  Historically, this bottleneck has been addressed with

245: custom backplanes and transmission protocols.  However, our group has taken the

246: novel approach of using high-performance, commercially available,

247: 10-Gbit/s Ethernet (10GbE) switches to solve this problem.

248: As will be discussed, these switches currently have the bandwidth and switching

249: capacity to handle large correlators, and represent a negligible fraction of

250: the total cost of correlator hardware.  Furthermore, switching technology is

251: driven by commercial applications and by Moore's Law, making it likely that

252: future switches will continue increasing in number of ports and bandwidth per

253: port.

254:

255: A second potential bottleneck concerns how data rates and

256: numbers of X engines scale with antenna bandwidth.  It is important that

257: we consider various bandwidth cases, owing to the variety of science

258: applications driving large, next-generation systems.  For example, correlators

259: for large arrays of low-bandwidth antennas will need to multiplex data into

260: higher bandwidth processors, while arrays with larger bandwidths will face the

261: opposite problem. In our architecture, we make the reasonable assumption that

262: the number of frequency channels always exceeds the number of antennas.

263: This assumption

264: ensures that the per-port bandwidth into an X engine never exceeds what is

265: transmitted per antenna.  Multiple channels may then be mapped into an X engine

266: up to its computational capacity (allowing efficient resource utilization for

267: low-bandwidth arrays), and additional X engines may be added for high-bandwidth

268: applications.  Antenna bandwidths requiring transmission above 10 Gbits/s can

269: be accommodated by connecting F engines to multiple 10GbE ports.

270: Frequency channels are then assigned to each port, which connect separate

271: switches and sub-networks of X engines.  In this way, bandwidths may be scaled

272: up to the transmission capability of an F processor by increasing the number of

273: subnets, and not switch complexity.

274:

275: The third and final potential bottleneck concerns how the sizes of individual X

276: engines scales with the number of antennas.  Both large and small numbers of

277: antennas pose scaling problems.  The size of an X engine responsible for

278: computing all baseline cross-multiples with a fixed input data rate

279: scales as $O(N)$, while

280: the number of X engines required to accommodate the expanding data bandwidth

281: with increasing numbers of antennas also scales as $O(N)$,

282: accounting for the $O(N^2)$ scaling of computing in a correlator.  For

283: sufficiently large $N$, the size of an X engine can exceed the size of any

284: processing chip or board.  Our solution has been to develop an X engine whose

285: pipelined architecture allows it to be split across multiple processors with

286: simple point-to-point connectivity.  This allows many processors to be chained

287: together from a switch port to meet the computational demands of an X engine.

288: Scaling to small $N$ is equally challenging, because the aggregate correlator

289: bandwidth decreases as $O(N)$, while computational complexity scales down as

290: $O(N^2)$.  As a result, we can find that the number of X engines that

291: fit onto a chip/board exceeds the rate at which data can be received.  The

292: threshold where this problem is encountered can be changed by designing

293: processors with greater connectivity, but once hardware is fixed, there is no

294: other recourse but to accept a certain inefficiency for low numbers of

295: antennas.  While this is a fundamental limitation of our architecture,

296: the cost of small correlators is typically dominated by development

297: (not hardware), so a certain architectural inefficiency can be accommodated for

298: the savings it affords in development time.

299:

300: \subsection{Globally Asynchronous Locally Synchronous Systems}

301: \label{sec:gals}

302:

303: Packetized data transmission simplifies the cross-connect problem inherent to

304: correlators, but this comes at the price of global synchronicity.  Packetized

305: communication is fundamentally asynchronous: data can arrive scrambled,

306: delayed, or not at all.  Locally-synchronous X engine processing must therefore

307: transition from being timing-driven (with throughput tied to an FPGA clock, for

308: example) to being asynchronously data-driven.  Though data buffers and control

309: signals complicate development, Globally Asynchronous Locally Synchronous

310: (GALS) design facilitates system integration and leads to robust design

311: \citep{chapiro1984,luis_et_al2007}.  Processors run at clock rates above the

312: data rate, using local oscillators that can drift with temperature.  By

313: allowing for non-transmission of data, individual components can fail without

314: causing global failure--an important feature for large systems where

315: components may fail regularly during operation.  GALS design also insulates

316: processing architectures from decisions regarding sample rates and antenna

317: bandwidths, allowing for greater operational flexibility.  Finally, individual

318: processing elements may be redesigned and upgraded in a GALS system without

319: affecting the overall architecture, facilitating early adoption of new

320: technology.

321:

322: Data-driven processing on locally synchronous processors like FPGAs requires

323: controlling propagation through the processing pipeline.  However, routing

324: control signals to every multiplier, accumulator, and logic element in a

325: pipeline can lead to excessive routing and gating demands.  To avoid this, we

326: have implemented a window-based processing architecture for algorithms where

327: the results derived from one set of data samples are computationally

328: independent from the next.  In this architecture, processing elements are

329: allowed to run freely at their native rate without being enabled/disabled, but

330: are only provided data when an entire window of data has been buffered.  These

331: windows of data are provided synchronously with the inherent window boundaries

332: of the processing element, and an entire output window is flagged as valid.

333: Internally, a processor processes both valid and invalid data--it is only the

334: external buffering system that keeps track of data validity.  This technique is

335: applicable to many common operators such as cross-multipliers, DFTs, and

336: accumulators.  Finite Impulse Response (FIR) filtering is an

337: operation notable for not being window-based.

338:

339: \subsection{Example Applications}

340: \label{sec:example}

341:

342: \placefigure{fig:ex_app1}

343:

344: Perhaps the best method for demonstrating the flexibility and scalability of

345: our correlator architecture is through example applications.  To illustrate

346: techniques for using hardware and ports efficiently, we will map processing

347: into fictitious hardware that corresponds roughly in capability to the

348: CASPER (Center for Astronomy Signal Processing

349: and Engineering Research)\footnote{http://casper.berkeley.edu}

350: hardware discussed in Section \ref{sec:hardware}.

351:

352: Our first example (Fig. \ref{fig:ex_app1}) illustrates an antenna signal

353: bandwidth sufficiently low so that data from 2 polarization channels of 2

354: antennas can be transmitted over one 10GbE connection.  Assuming that the

355: number of antennas evenly divides the number of frequency channels, and that

356: the processing bandwidth of an X engine matches the data bandwidth of one

357: antenna, there will be the same number of X engines as F engines, and each X

358: engine will receive 1/N$^{\rm th}$ of the total bandwidth, where N is

359: the number of antennas.  F engine

360: transmission and X engine reception are combined on a single port to make use

361: of the bi-directionality of 10GbE.  This optimization halves the size of the

362: switch needed.  Multiple X processors can be chained together from a single

363: 10GbE port using point-to-point connections.  For cases where the number of

364: antennas does not evenly divide the number of frequency channels, one can adjust

365: packet transmission to drop remainder channels so that the band may be equally

366: divided among X engines.

367:

368: \placefigure{fig:ex_app2}

369:

370: A second example (Fig. \ref{fig:ex_app2}) illustrates a case where the

371: bandwidth from a single F engine exceeds the transmission capacity of a 10GbE

372: link.  Here, data can be split by frequency channel across two

373: ports.  Since different channels are never cross-multiplied, each of these

374: links goes to a separate subnet of switched X engines.  Thus,

375: two smaller (and often less expensive per port) switches may be

376: substituted for one large

377: one.  Each X engine still receives the same bandwidth as in the previous

378: example, although this now represents a smaller fraction of the total

379: bandwidth.  Note that the same X processor used in the first example functions

380: here without modification.  Only the number of X engines and the transmission

381: pattern has changed.

382:

383: \placefigure{fig:ex_app3}

384:

385: A final example (Fig. \ref{fig:ex_app3}) explores the case where the capacity

386: of an X processor and a 10GbE link both exceed the data bandwidth.  In this

387: case, multiple F engines can (but do not have to) be chained together to

388: minimize the number of switched ports.  As should be the case, only half as

389: many X engines (as compared to Fig. \ref{fig:ex_app1}) are necessary for a

390: given number of antennas.  X processors operate in the same configuration as

391: before, oblivious to changes in F engines.

392:

393: These examples highlight the flexibility of the hardware and gateware for

394: targeting a number of applications. One shortcoming they also illustrate is

395: how the cabling between components differs for different bandwidths.

396: Therefore the different bandwidth operations are not as easily reconfigured as

397: might be desired for varying science goals on a given telescope. Research is

398: ongoing to improve the rapid reconfigurability that is an essential

399: specification for the most general radio interferometer array applications.

400:

401: % --------------------------------------------------------------------------

402: % Section 3

403: % --------------------------------------------------------------------------

404: \section{Modular, FPGA-based Processing Hardware}

405: \label{sec:hardware}

406:

407: A flexible and scalable correlator architecture is of limited use without

408: equally dynamic processing hardware that can support a variety of

409: configurations.  FPGAs provide a unique combination of flexibility and

410: performance that make them well-suited for moderate-scale signal processing

411: applications such as correlators and spectrometers \citep{parsons_et_al2006}.

412: A primary goal of the CASPER group has been development of

413: multipurpose processing modules that can be of general use to the astronomy

414: signal processing community, and beyond.  We seek to

415: minimize the effort of redesigning and upgrading hardware by modularizing

416: processing hardware, by minimizing the number of different modules

417: in a system, and by employing industry-standard interconnection protocols.

418:

419: Hardware modularity is the idea that boards should have consistent interfaces

420: in order to be connectible with an arbitrary number of heterogeneous components

421: to meet the computing needs of an application (``computing by the yard''), and

422: that upgrading/revising a component does not change the way in which components

423: are combined in the system.

424: Minimization of hardware reproduction costs is often used to motivate the

425: design of specialized hardware for large-scale correlators.  However,

426: the longer development times inherent to such solutions, and

427: the necessity of targeting specific components from the outset,

428: suggest that a modular solution, initiated nearer to the deployment date,

429: will employ newer technology that costs less and uses less

430: power per operation.  The predicted economy of mass-producing

431: specially-designed hardware must be tempered by its expected devaluation

432: by Moore's Law over the course of correlator development.  This devaluation

433: makes the argument that hardware modularity can reduce the overall system

434: cost, even for large-scale systems, by reducing development time.

435:

436: In current correlator systems, we rely on two

437: CASPER FPGA-based processing boards; Internet Break-Out Boards (IBOBs) are

438: generally used for implementing per-antenna F engine processing, and

439: second-generation Berkeley Emulation Engines (BEE2s) implement X engine

440: processing.  Work is progressing on a new board, the Reconfigurable Open

441: Architecture for Computing Hardware (ROACH), that will provide a single-board

442: solution to both F and X processing.

443:

444: \placefigure{fig:ibobadcbee2}

445:

446: IBOBs (Fig. \ref{fig:ibobadcbee2}) can interface to two

447: Analog-to-Digital Converter (ADC) boards, each capable of digitizing two

448: streams at 1 Gsamples/sec or a single stream at 2 Gsamples/sec using an Atmel

449: AT84AD001B dual 8-bit ADC chip.  This data is processed by a Xilinx XC2VP50

450: FPGA containing 232 18$\times$18-bit multipliers, two PowerPC CPU cores, and

451: over 53,000 logic cells.  Two ZBT SRAM chips provide 36 Mbits of extra

452: buffering, and two 10GbE-compatible CX4 connectors provide a standard interface

453: for connecting to other boards, switches, and computers.  A detailed discussion

454: of ADC signal fidelity is presented in Section \ref{sec:characterization}.

455: We are developing a second ADC board that allows four signal sampling at

456: 200 Msample/sec.

457:

458: The BEE2 board \cite{chang_et_al2005} (Fig. \ref{fig:ibobadcbee2}) was

459: originally designed for high-end reconfigurable computing applications such as

460: ASIC design, but has been conscripted for astronomy applications in a

461: collaboration between the BWRC\footnote{Berkeley Wireless Research Center

462: http://bwrc.eecs.berkeley.edu},

463: the UC Berkeley Radio Astronomy Laboratory, and the UC Berkeley SETI group.

464: The 500

465: Gops/sec of computational power in the BEE2

466: is provided by 5 Xilinx XC2VP70 Virtex-II Pro

467: FPGAs, each containing 328 multipliers, two PowerPC CPU cores capable of

468: running Linux, and over 74,000 configurable logic cells.  Each FPGA connects to

469: 4 GB of DDR2-SDRAM, and four 10GbE-compatible CX4 connectors, and all FPGAs

470: share a 100-Mbps Ethernet port.  The size and connectivity of the

471: BEE2 board make it suitable for implementing X engine processing in our

472: correlator architecture.

473:

474: The ROACH board is being developed in collaboration with MeerKAT and

475: NRAO,\footnote{The National Radio Astronomy

476: Observatory (NRAO) is owned and operated by Associated Universities, Inc. with

477: funding from the National Science Foundation}

478: and is scheduled for release in the third quarter of 2008.  It is intended as a

479: replacement for both IBOB and BEE2 boards.  A single Xilinx Virtex-5 XC5VSX95T

480: FPGA containing 94,000 logic cells and 640 multiplier/accumulators provides 400

481: Gops/sec of processing power and is connected to a separate PowerPC 440EPx

482: processor with a 1 GbE network connection.  The board contains 4 GB of DDR2

483: DRAM and two 36Mbit QDR SRAMs, four 10GbE-compatible CX4 connectors, and two

484: interfaces that allow the use of the current ADC boards, or a new 3

485: Gsamples/sec (6 Gsamples/sec dual-board interleaved) ADC.  The scale, economy,

486: and peripheral interfaces of this board will make it appropriate for both F and

487: X engine processing, and will enable a single-board correlator architecture.

488:

489: \placetable{tab:hardware_price}

490:

491: % --------------------------------------------------------------------------

492: % Section 4

493: % --------------------------------------------------------------------------

494: \section{Gateware}

495: \label{sec:gateware}

496:

497: Efficient, customizable signal processing libraries are another important

498: component of a flexible and scalable correlator architecture.  Towards this

499: goal, our group has designed a set of open-source libraries\footnote{Available

500: at http://casper.berkeley.edu} for the Simulink/Xilinx System Generator FPGA

501: programming language.  These libraries abstract chip-specific components to

502: provide high-level interfaces targeting a wide variety of devices.  Signal

503: processing blocks in these libraries are parametrized to scale up and down to

504: arbitrary sizes, and to have selectable bit widths, latencies, and scaling.

505: Though the design principles of parametrization and scalability have added

506: complexity to the initial design of these libraries, it dramatically enhances

507: their applicability and potential for longevity as hardware evolves.  It also

508: decreases testing time by allowing developers to debug scale models of systems

509: that derive from the same parametrization code and are behaviorally similar to

510: larger systems.  In this section, we present several components of our

511: libraries vital to the design of flexible correlators.

512:

513: \subsection{A Digital Down-Converter}

514: \label{sec:downconverter}

515:

516: The rising speed of ADCs has enabled digitization to occur increasingly early

517: in the antenna receiver chain.  We are thus replacing analog electronics

518: commonly known as intermediate frequency processor (gain, band definition)

519: and baseband mixer (conversion to zero frequency and filtering).

520: There are numerous advantages to doing this.

521: Digital mixing allows dynamically selecting an operating frequency within the

522: digitized band while ensuring perfect sine-cosine phasing in the local

523: oscillator (LO) mixing frequency.

524: Digitizing a wider bandwidth than will be ultimately processed makes analog

525: filtering less critical; inexpensive filters with slow roll-offs can be

526: used, and passband rippling can be corrected.  Finally, digital filtering

527: allows flexibility and control in selecting passband shapes and adjusting fine

528: delays.  One can even split out several bands from the same signal.

529: The issue of quantization levels and other digital artifacts needs to be

530: carefully addressed.

531:

532: Our library provides a digital down-conversion core with a runtime-selectable

533: mixing frequency.  Using a discretely sampled sine wave in an addressable

534: lookup table, we can approximate nearly any mixing frequency by rounding a wide

535: accumulation register (incremented every clock) to the nearest address in the

536: lookup table.  Digital sine waves have an accuracy dictated by the number of

537: bits used to represent a value; a lookup table need only have enough samples to

538: achieve comparable accuracy.  The fact that the derivative of $\sin(x)$ reaches

539: a maximum magnitude of 1 allows the sampling interval of a sine wave to be

540: simply equated to the accuracy of a coefficient over that time interval.

541: As a result, a lookup table only need be addressed with the same

542: bit-width as the sample width to implement an arbitrary mixing frequency.

543:

544: \placefigure{fig:ddc_passband}

545:

546: Our library also contains a decimating FIR filter.  Digital filters have

547: advantages over analog filters by being reprogrammable and by providing exact,

548: calculable passbands.  This filter is often used for suppressing harmonics of

549: the mixing frequency and for steepening the rolloff of cheaper analog filters,

550: but it has also been relied upon for implementing IF sub-band selection

551: digitally.  In practice, one must weigh the need for performance and

552: flexibility against the cost of FPGA resources compared to analog filters.  As

553: an example, the response of the FIR filter used in various correlator designs

554: is shown in Figure \ref{fig:ddc_passband}.  Since the exact shape

555: of this filter can be calculated, it is possible to remove passband

556: ripple post-channelization because of the large dynamic range available in

557: output of our FFT core.

558:

559: \subsection{A Polyphase Filter Bank Front-End}

560: \label{sec:pfb}

561:

562: The Polyphase Filter Bank (PFB) \citep{crochiere+rabiner1983, vaidyanathan1990}

563: is an efficient implementation of a bank of evenly spaced, decimating FIR

564: filters.  The PFB algorithm decomposes these filters into a single polyphase

565: convolution followed by a DFT.  Since DFTs have been highly optimized

566: algorithmically, this results in an extremely efficient implementation.

567: Equivalently, the PFB may be regarded as an improvement on the Fast Fourier

568: Transform (FFT) that uses a front-end polyphase FIR filter to improve the

569: frequency response of each spectral channel (Fig. \ref{fig:pfb_bin_resp}).

570: This improvement comes at the cost of buffering an additional window of samples

571: and adding a complex cross-multiplication for each additional tap in the

572: polyphase FIR.  This PFB implementation has seen widespread use in the astronomy

573: community in 21 cm hydrogen surveys \citep{heiles_et_al2004}, pulsar surveys

574: \citep{demorest_et_al2004}, antenna arrays \citep{bradley_et_al2005}, Very Long

575: Baseline Interferometry, and other applications.

576:

577: \placefigure{fig:pfb_bin_resp}

578:

579: Our core is parametrized to use selectable windowing functions, allowing

580: adjustment of the out-of-band rejection and passband ripple/rolloff.  Blackman

581: and Tukey \citep{blackman_tukey1958} provides a summary of the characteristics

582: and trade-offs of various windows.  Each polyphase FIR tap, at the cost of

583: increased buffering and additional multipliers, increases filter steepness by

584: adding samples (in increments of the number of channels) to the time window

585: used in the PFB.  For fixed-point implementations, a practical upper limit to

586: the number of PFB taps is set by the number of bits used to represent filter

587: coefficients; the sinc function's 1/x tapering ceases to be representable when

588: $\pi T > \pi + 2^{B+1}$ where $T$ is the number of taps, and $B$ is the

589: coefficient bit width.  Finally, the width of a PFB channel is tunable by

590: adjusting the period of the sinc function, forcing adjacent bandpass filters to

591: overlap at a point other than the -3 dB point.  Note that this causes

592: power to no longer be conserved in the Fourier transform operation.

593:

594: \subsection{A Bandwidth-Agile Fast Fourier Transform}

595: \label{sec:fft}

596:

597: The computational core of our FFT library is an implementation of a radix-2

598: biplex pipelined FFT \citep{rabiner_gold1975} capable of analyzing two

599: independent, complex data streams using a fraction of the FPGA resources of

600: commercial designs \citep{dick2000}.  This architecture takes advantage of the

601: streaming nature of ADC samples by multiplexing the butterfly computations of

602: each FFT stage into a single physical butterfly core.  When used to analyze two

603: independent streams, every butterfly in this biplex core outputs valid data

604: every clock for 100\% utilization efficiency.

605:

606: The need to analyze bandwidths higher than the native clock rate of an FPGA led

607: us to create a second core that combines multiple biplex cores with additional

608: butterfly cores to create an FFT that is parametrized to handle $2^P$ samples

609: in parallel \citep{parsons2008}.  This FFT architecture uses only 25\% more

610: buffering than the theoretical minimum, and still achieves 100\% butterfly

611: utilization efficiency.  This feat is achieved by decomposing a $2^N$

612: channel FFT into $2^P$ parallel biplex FFTs of length $2^{N-P}$, followed by a

613: $2^P$ channel parallel FFT core using time-multiplexed twiddle-factor

614: coefficients.

615:

616: Finally, we have written modules for performing two real FFTs with each half of

617: a biplex FFT using Hermitian conjugation.  Mirroring and

618: conjugating the output spectra to reconstitute the negative frequencies, this

619: module effects a 4-in-1 real biplex FFT that can then be substituted for the

620: equivalent number of biplex cores in a high-bandwidth FFT.  Thus, our real FFT

621: module has the same bandwidth flexibility as our standard complex FFT.

622:

623: Dynamic range inside fixed-point FFTs requires careful consideration.  Tones

624: are folded into half as many samples through each FFT stage, causing magnitudes

625: to grow by a factor of 2 for narrow-band signals, and $\sqrt{2}$ for random

626: noise.  To

627: avoid overflow and spectrum corruption, our cores contain optional downshifts

628: at each stage.  In an interference-heavy environment, one must balance loss of

629: SNR from downshifting signal levels against loss of integration time due to

630: overflows.  A good practice is to place time-domain input into the

631: most-significant bits of the FFT and downshift as often as possible to

632: avoid overflow and minimize rounding error in each butterfly stage.  However,

633: it is also best to avoid using the top 2 bits on input since the first

634: 2 butterfly

635: stages can be implemented using negation instead of complex multiplies, but the

636: asymmetric range of 2's complement arithmetic can allow this negation to

637: overflow.

638:

639: \subsection{A Cross-Multiplication/Accumulation (X) Engine}

640: \label{sec:x_engine_arch}

641:

642: \placefigure{fig:x_engine_schem}

643:

644: Our FX correlator architecture employs

645: X engines to compute all antenna cross-multiples within a frequency

646: channel, and multiple frequencies are multiplexed into the core as dictated by

647: processor bandwidth; the complex visibility $V_{ij}$ (Eq. \ref{eq:vis})

648: is the average of the product of complex voltage samples from antenna $i$ and

649: antenna $j$ with the convention that the voltage $j>i$ is conjugated prior to

650: forming product.

651: In collaboration with Lynn Urry of UC Berkeley's Radio

652: Astronomy Lab we have implemented a parametrized module (Fig.

653: \ref{fig:x_engine_schem}) for computing and accumulating all visibilities for a

654: specified number of antennas.  An X engine operates by receiving $N_{ant}$ data

655: blocks in series, each containing $T_{acc}$ data samples from one frequency

656: channel of one antenna.  The first samples of all blocks are

657: cross-multiplied, and the $N_{ant}(N_{ant}+1)/2$ results are added to the

658: results from the second samples, and so on, until all $T_{acc}$ samples have

659: been exhausted.  Accumulation prevents the data rate out of a

660: cross-multiplier from exceeding the input data rate.  An X engine is divided

661: into stages, each responsible for pairing two different data blocks

662: together: the zeroth stage pairs adjacent blocks, the first stage pairs blocks

663: separated by one, and so on.  As the final accumulated results become available,

664: they are loaded onto a shift register and output from the X engine.

665:

666: However, as a new window of $N_{ant}\times T_{acc}$ samples arrives, some

667: stages, behaving as described above, would compute invalid results using

668: data from two different windows.  To avoid this, each stage switches between

669: cross-multiplying separations of $S$ to separations of $N_{ant}-S$, which

670: happen to be valid precisely when separations of $S$ would be invalid.  As a

671: result, there need be only $floor({N_{ant}/2}+1)$ stages in an X engine.  Every

672: $T_{acc}$ samples, each stage outputs a valid result, yielding $N_{ant}\times

673: floor({N_{ant}/2}+1)$ total accumulations; for even values of $N_{ant}$,

674: $N_{ant}/2$ of the results from the last stage are redundant.

675: All other multiplier/accumulators are 100\% utilized.  Each stage

676: also computes all polarization cross-multiples (Eq. \ref{eq:pol})

677: using parallel multipliers.

678:

679: When one X engine no longer fits on a single FPGA, it may be divided across

680: chips at any stage boundary at the cost of a moderate amount of bidirectional

681: interconnect.  The output shift register need not be carried between chips;

682: each FPGA can accumulate and store the results computed locally.  In order for

683: the output shift register's $floor({N_{ant}/2}+1)$ stages to clear before the

684: next accumulation is ready, an X engine requires a minimum integration length

685: of: $T_{acc}>floor({N_{ant}/2}+1)$.  In current hardware, a practical upper

686: limit on $T_{acc}$ is set by the 2$\times$4 Mbit of SRAM storage available on

687: the IBOB.  For 2048 channels with 4-bit samples, and double buffering for 2

688: antennas, 2 polarizations, this limit is $T_{acc}\le 128$.  Longer integration

689: requires an accumulator capable of buffering an entire vector of visibility

690: data, and typically occurs in off-chip DRAM.  The maximum theoretical

691: accumulation length in correlator is determined by the fringe rate of sources

692: moving across the sky, and is a function of observing frequency, maximum

693: antenna separation, and (for correlators with internal fringe rotation)

694: field-of-view across the primary beam.

695:

696: Cross-multiplication comes to dominate the total correlator processing budget

697: for large numbers of antennas.  As a result, care must be taken both to reduce

698: the footprint of a complex multiplier/accumulator and to make full and

699: efficient use of the resources on an FPGA processor.  The number of bits used

700: to carry a signal should be minimized while retaining sufficient dynamic range

701: to distinguish signal from noise.  We have chosen to focus on 4-bit multipliers

702: in current applications, and the subjects of dynamic equalization and Van Vleck

703: correction generalized to 4 bits are explored in Section

704: \ref{sec:characterization} for optimizing signal-to-noise ratios (SNR) in our

705: correlators.  To make full use of FPGA resources, we construct

706: 4-bit complex multipliers using distributed logic, dedicated multiplier cores,

707: and look-up tables implemented in Block RAMs.

708:

709: It is possible to perform the bulk of an $N$-bit complex multiply in an $M$-bit

710: multiplier core by sign-extending numbers to $2N$ bits and combining them into

711: two $M$-bit, unsigned numbers.  Multiplying $(a+bi)(c+di)$, these

712: representations are $(2^{M-2N}a_s+b_s)$ and $(2^{M-2N}c_s+d_s)$, where

713: $n_s=2^{2N}+n$.  The bits corresponding to $ac, ad+bc, bd$ may be selected from

714: the product, provided that the

715: sign-extension to $2N$ bits shifts $a+d$ beyond the bits occupied by $ad$.

716: This yields the constraint:

717: \begin{equation} 6N-1 < M \end{equation}

718: The 18-bit multipliers in current Xilinx

719: FPGAs can efficiently perform 3-bit complex

720: multiplies, but fall short of 4 bits.

721:

722: % --------------------------------------------------------------------------

723: % Section 5

724: % --------------------------------------------------------------------------

725: \section{System Integration}

726: \label{sec:integration}

727:

728: \subsection{F Engine Synchronization}

729: \label{sec:F_synch}

730:

731: \placefigure{fig:corr_vs_dly}

732:

733: Though we have touted GALS design principles for X engine processing,

734: digitization and spectral processing within F engines must be synchronized to a

735: time interval much smaller than a spectral window to avoid severe degradation

736: of correlation response (Fig. \ref{fig:corr_vs_dly}).  This attenuation effect,

737: resulting from the changing degree of overlap of correlated signals within a

738: spectral window, can be caused by systematic signal delay between antennas, as

739: well as by source-dependent geometric delay; FX correlators with insufficient

740: channel resolution experience a narrowing of the field of view related to

741: channel bandwidth.  This effect has been well explored for FX correlators

742: employing DFTs (see Chapter 8 of \citet{thompson_et_al2001}), but Polyphase

743: Filter Banks show a different response owing to a weighting function that

744: extends well beyond the number of samples used in a DFT.

745: Given a standard form for PFB sample weighting of

746: ${\rm sinc}\left(\frac{\pi t}{N\tau_s}\right)

747: W\left(\frac{t}{2TN\tau_s}\right)$,

748: where $N$ is the number of output channels,

749: $T$ is the number of PFB taps, $\tau_s$ is the delay between time-domain

750: samples, and $W$ is an arbitrary windowing function that tapers to 0 at

751: $\pm1$, the gain versus delay $G(\tau)$ of a PFB-based FX correlator is

752: given by:

753: \begin{displaymath}

754: G(\tau)=\int_{-\infty}^{\infty}{

755: \left[{\rm sinc}\left(\frac{\pi t}{N\tau_s}\right)

756: W\left(\frac{t}{2TN\tau_s}\right)\right] \times

757: \left[{\rm sinc}\left(\frac{\pi (t-\tau)}{N\tau_s}\right)

758: W\left(\frac{t-\tau}{2TN\tau_s}\right)\right]\ dt

759: }

760: \end{displaymath}

761:

762: For the purpose of F Engine synchronization, we

763: rely on a one-pulse-per-second (1PPS) signal with a fast edge-rate provided

764: synchronously to a bank of F processors running off identical system clocks.

765: This signal is sampled by the system clock on each processor, and provided

766: alongside ADC data.  A slower, asynchronous ``arm'' signal is sent from

767: a central node to each F engine at the half second phase

768: to indicate that the next 1PPS signal should be

769: used to generate the reset event that synchronizes spectral windows and packet

770: counters.  This ensures that samples from different antennas entering X engines

771: together were acquired within one or two system clocks of one another.  The

772: degree of synchronization is determined by the difference in path lengths of

773: 1PPS and the system clock from their generators to each F engine.  This path

774: length can be determined from celestial source observations

775: using self-calibration, and barring temperature

776: effects, will be constant for a correlator configuration following power-up.

777:

778: \subsection{Asynchronous, Packetized ``Corner Turner''}

779: \label{sec:packetization}

780:

781: The choice of the accumulation length $T_{acc}$ in X engines

782: determines the natural size of UDP packets in our

783: packet-switched correlator architecture.  For current CASPER hardware where

784: channel-ordering occurs in IBOB SRAM, $T_{acc}$ is constrained by the available

785: memory to an upper limit of 128 samples for 2048-channel dual-polarization,

786: 4-bit,

787: complex data, yielding a packet payload of 256 bytes.  A header containing

788: 2 bytes of antenna index and 6 bytes of frequency/time index is added to each

789: packet to enable packet unscrambling on the receive side.  The frequency/time

790: index (hereafter referred to as the master counter, or MCNT) is a counter that

791: is incremented every packet transmission.  The lower bits count frequencies

792: within a spectrum, and the rest count time.  Combined with the antenna

793: index, MCNT completely determines the time, frequency, source, and destination

794: of each packet; MCNT maps uniquely to a destination IP address.

795:

796: \placefigure{fig:packet_rx}

797:

798: Packet reception (Fig. \ref{fig:packet_rx}) is complicated by the realities of

799: packet scrambling, loss, and interference.  A circular buffer holding $N_{win}$

800: windows worth of X engine data stores packet data as they arrive.  The lower

801: bits of MCNT act as an address for placing payloads into the the correct

802: window, and the antenna index addresses the position within that window.  When

803: data arrives $N_{win}/2$ windows ahead of a buffered window, that window is

804: flagged for readout, and is processed contiguously on the next window boundary

805: of the free-running X engine.  Using packet arrival to determine when a window

806: is processed allows a data-rate dependent time interval for all packets to

807: arrive, but pushes data through the buffer in the event of packet loss.  On

808: readout, the buffer is zeroed to ensure that packet loss results in loss of

809: signal, rather than the introduction of noise.  F engines can be intentionally

810: disconnected from transmission without compromising the correlation of

811: those remaining.

812:

813: Packet interference occurs when a well-formed packet contains an invalid MCNT

814: as a result of switch latency, unsynchronized F engines, or system

815: misconfiguration.  Such packets must be prevented from entering the receive

816: buffer, since they can lead to data corruption; one would prefer that a

817: misconfigured F engine antenna result in data loss for that antenna, rather

818: than data loss for the entire system.  To ensure this behavior, incoming

819: packets face a sliding filter based on currently active MCNTs.  Packets are

820: only accepted if their MCNT falls within the range of what can currently be

821: held in the circular buffer.  As higher MCNTs are received and accepted, old

822: windows are flagged for read out, freeing up buffer space for still

823: higher MCNTs.  This system forces MCNTs to advance by small increments and

824: prevents the large discontinuities indicative of packet

825: interference.  In the eventuality that a receive buffer accidentally locks onto

826: an invalid MCNT from the outset, a time-out clause causes the currently active

827: MCNT to be abandoned for a new one if no new data is accepted into the receive

828: buffer.

829:

830: A final complication comes when implementing a bidirectional 10GbE transmission

831: architecture such as the one outlined in Figure \ref{fig:ex_app1}.

832: Commercial switches do not support

833: self-addressed packet transmission; they assume that the transmitter

834: (usually a CPU) intercepts these packets and transfers them to the receive

835: buffer.  On FPGAs, this requires an extra buffer for holding ``loopback'', and

836: a multiplexer for inserting these packets into the processing stream.  A simple

837: method for this insertion would be to always insert loopback packets, if

838: available, and otherwise to insert packets from the 10GbE

839: interface.  However, there is a maximum interval over which packets with

840: identical MCNTs can be scrambled before the receive system rejects

841: packets for being outside of its buffer.  This simple method has the

842: undesirable effect of including switch latency in the time interval over which

843: packets are scrambled, causing unnecessary packet loss.  Our solution is to

844: pull loopback packets only after packets with the same MCNT

845: arrive through the switch.

846:

847: \subsection{Monitor, Control, and Data Acquisition}

848: \label{sec:data_aq}

849:

850: The toolflow we have developed for CASPER hardware provides convenient

851: abstractions for interfacing to hardware components such as ADCs, DRAM, and 10

852: GbE transceivers, and allows specified registers and BRAMs to be automatically

853: connected to CPU-accessible buses.  On top of this framework, we run BORPH--an

854: extension of the Linux operating system that provides kernel support for FPGA

855: resources \citep{so_broderson2006,so2007}.  This system allows FPGA

856: configurations to be run in the same fashion as software processes, and creates

857: a virtual file system representing the memories and registers defined on the

858: FPGA.  Every design compiled with this toolflow comes equipped with this

859: real-time interface for low- to moderate-bandwidth data I/O.  By emulating

860: standard file-I/O interfaces, BORPH allows programmers to use standard

861: languages for writing control software.  The majority of the monitor, control,

862: and data acquisition routines in our correlators are written in C

863: and Python.  For 8-16 antenna correlators, the bandwidth through BORPH on a

864: BEE2 board is sufficient to support the output of visibility data with 5-10s

865: integrations.

866:

867: For correlators with more antennas or shorter integration times, the bandwidth

868: of the CPU/FPGA interface is incapable of maintaining the full correlator

869: output.  This limitation is being overcome by transmitting the final correlator

870: output using a small amount of the extra bandwidth on the 10GbE ports already

871: attached to each X engine.  After accumulation in DRAM, correlator output is

872: multiplexed onto the 10GbE interface and transmitted to one or more Data

873: Acquisition (DA) systems attached to the central 10GbE switch.  These systems

874: collect and store the final correlator output.  With a capable DA system, the

875: added bandwidth through this output pathway can be used to attain millisecond

876: integration times, opening up opportunities for exploring transient events and

877: increasing time resolution for removing interference-dominated data.

878:

879: The capabilities of correlators made possible by our research are placing

880: new challenges on DA systems \citep{wright2005}.  There is a severe (factor of

881: 100) mismatch between the data rates in the on-line correlator hardware and

882: those supported by the off-line processing.  Members of our team are currently

883: pursuing research on how this can be resolved both for correlators and for

884: generic signal processing systems using commercially available compute

885: clusters.  For correlators, our group is currently exploring how to implement

886: calibration and imaging in real-time to reduce the burden of

887: expert data reduction on the end user, and to make best use of both telescope

888: and human resources.

889:

890:

891: % --------------------------------------------------------------------------

892: % Section 6

893: % --------------------------------------------------------------------------

894: \section{Characterization}

895: \label{sec:characterization}

896:

897: \subsection{ADC Crosstalk}

898: \label{sec:crosstalk}

899:

900: \placefigure{fig:crosstalk}

901:

902: Crosstalk is an undesirable but prevalent characteristic of analog systems

903: wherein a signal is coupled at a low level into other pathways.  This can pose

904: a major threat to sensitivity in systems that integrate noise-dominated data to

905: reveal low-level correlation.  For CASPER hardware, we have examined crosstalk

906: levels between signal inputs sharing an ADC chip, and between different ADC

907: boards on the same IBOB.  Figure \ref{fig:crosstalk} illustrates a one-hour

908: integration of uncorrelated noise of various bandwidths input to the ``Pocket

909: Correlator'' system (see Section \ref{sec:deployments}).  Between inputs

910: of the same ADC board, a coupling coefficient of $\sim0.0016$ indicates

911: crosstalk at approximately $-28$ dB.  This coupling is a factor of $5$ higher

912: than the $-35$ dB isolation advertised by the Atmel ADC chip, and is most

913: likely the result of board geometry and shared power supplies.  Crosstalk

914: between inputs on different ADCs also peaks at the $-28$ dB level, but shows

915: more frequency-dependent structure.

916:

917: \placefigure{fig:crosstalk_stability}

918:

919: Crosstalk may be characterized and removed, provided that its timescale for

920: variation is much longer than the calibration interval.  Figure

921: \ref{fig:crosstalk_stability} demonstrates that for integration intervals

922: ranging from 7.15 seconds to approximately 1 day (the limit of our testing),

923: crosstalk amplitudes and phases vary around stable values in a

924: lab test that, when

925: subtracted, yield noise that integrates down with time.  Even

926: though crosstalk is encountered at the $-28$ dB level, its stability allows

927: suppression to at least $-62$ dB.  This stability has allowed crosstalk

928: to be removed post-correlation, and we have until recently deferred

929: adding phase switching.  Developments along this line are proceeding by

930: introducing an invertible mixer (controlled via a Walsh counter on an IBOB)

931: early in the analog signal path, and removing this inversion after

932: digitization.  Phase switching must be coupled with data blanking near

933: boundaries when the

934: inversion state is uncertain.  Blanking will be most easily implemented by

935: intentionally dropping packets of data from F engine transmission, and by

936: providing a count of results accumulated in each integration for normalization

937: purposes.

938:

939: \subsection{XAUI Fidelity and Switch Throughput}

940: \label{sec:10gbe_sw}

941:

942: CASPER boards are currently configured to transmit XAUI protocol over CX4 ports

943: as a point-to-point communication protocol and as the physical layer of 10GbE

944: transmission.  Because the Virtex-II FPGAs used in current CASPER hardware do

945: not fully support XAUI transmission standards \cite{xilinx_ug024,xilinx_ds083},

946: current devices can have

947: sub-optimal performance for certain cable lengths.  We expect the new ROACH

948: board, which employs Virtex-5 FPGAs, to have better

949: performance in this regard.  For cable lengths supported in current hardware,

950: we tested XAUI transmission fidelity using matched Linear Feedback Shift

951: Registers (LFSRs) on transmit and receive.  Error detection was verified using

952: programmable bit-flips following transmitting LFSRs.  Over a period of 16

953: hours, 573 Tb of data were transmitted and received on each of 8 XAUI

954: links.  During this time, no errors were detected, resulting in an estimated

955: bit-error rate of $2.2\cdot 10^{-16}$ Hz.  We also tested the capability of two

956: Fujitsu switches (the XG700 and the XG2000) for performing the full

957: cross-connect packet switching required in our FX correlator architecture.  By

958: tuning the sample rate inside F engines of an 8-antenna (4-IBOB) packetized

959: correlator, we controlled the transmission rate per switch port over a range of

960: 5.96 to 8.94 Gb/s.  In 10-minute tests, packet loss was zero for both

961: switches in all but the 8.94 Gb/s case.  Packet loss in this final case was

962: traced to intermittent XAUI failure as a result of imperfect compliance with

963: the XAUI standard, as described previously. Overheating of FPGA chips in the

964: field has also been reported as a source of intermittent operation.

965:

966: \subsection{Equalization and 4-Bit Requantization}

967: \label{sec:equalization}

968:

969: \placefigure{fig:4_bit_quant}

970:

971: Correlator processing resources can be reduced by limiting the bit width of

972: frequency-domain antenna data before cross-multiplication.  However, digital

973: quantization requires careful setting of signal levels for optimum

974: SNR and subsequent calibration to a linear power scale

975: \citep{thompson_et_al2001,jenet_anderson1998}.  Correlators using 4 bits

976: represent

977: an improvement over their 1 and 2 bit predecessors, but there are still

978: quantization issues to consider.  The total power of a 4-bit quantizer has a

979: non-linear response with respect to input level as shown in Figure

980: \ref{fig:4_bit_quant}.  In currently deployed correlators, we perform

981: equalization (per channel scaling) to control the RMS channel values before

982: requantizing from 18 bits to 4 bits.  This operation saturates RFI and flattens

983: the passband to reduce dynamic range and to hold the passband in

984: the linear regime of the 4-bit quantization power curve.  Equalization is

985: implemented as a scalar multiplication on the output of each PFB using 18-bit

986: coefficients from a dynamically updateable memory.  These coefficients allow

987: for automatic gain control to maintain quantization fidelity through changing

988: system temperatures.

989:

990: % --------------------------------------------------------------------------

991: % Section 7

992: % --------------------------------------------------------------------------

993: \section{Deployments and Results}

994: \label{sec:deployments}

995:

996: \subsection{A Pocket Correlator}

997: \label{sec:pocket_corr}

998:

999: \placefigure{fig:f_engine}

1000:

1001: The ``Pocket Correlator'' (Fig. \ref{fig:f_engine}) is a single IBOB system

1002: that includes F and X engines on a single board for correlating and

1003: accumulating 4 input signals.  Each input is sampled at 4 times the FPGA clock

1004: rate (which runs up to 250 MHz), and a down-converter extracts half of the

1005: digitized band.  This subband is decomposed into 2048 channels by an 8-tap PFB,

1006: equalized, and requantized to 4 bits.  With all input signals on one chip, X

1007: processing can be implemented directly as multipliers and vector accumulators,

1008: rather than as X engines.  Limited buffer space on the IBOB permits only 1024

1009: channels (selectable from within the 2048) to be accumulated.  Output occurs

1010: either via serial connection (with a minimum integration time of 5

1011: seconds) or via 100-Mbit UDP transmission (with a minimum integration time in

1012: the millisecond range).  This system can act as a 2-antenna, full Stokes

1013: correlator, or as a 4-antenna single polarization correlator.

1014:

1015: \placefigure{fig:skymap}

1016:

1017: The Pocket Correlator is valuable as a simple, stand-alone instrument, and for

1018: board verification in larger packetized systems.  It is being applied as a

1019: stand-alone instrument in PAPER, the ATA, and the UNC PARI observatory. A

1020: 4-antenna, single polarization deployment of the PAPER experiment in Western

1021: Australia in 2007 used the Pocket Correlator to collect the data used to

1022: produce a 150 MHz all-sky map illustrated in Figure \ref{fig:skymap}.  In

1023: addition to demonstrating the feasibility of post-correlation crosstalk

1024: removal, this map (specifically, the imperfectly removed sidelobes of sources)

1025: illustrates a problem that will require real-time imaging to solve for large

1026: numbers of antennas.

1027:

1028: \subsection{An 8-Antenna, 2-Stokes, Synchronous Correlator}

1029: \label{sec:8_ant_corr}

1030:

1031: This first generation multi-board correlator demonstrated the functionality

1032: of signal processing algorithms and CASPER hardware, but preempted the

1033: current packetized architecture--it operated synchronously.  This version of

1034: the correlator was most heavily limited by X engine resources, all of which

1035: were implemented on a single FPGA to simplify interconnection. The

1036: total number of complex multipliers in the X engines of an $N_{ant}$ antenna

1037: array is: $N_{cmac} = floor({N_{ant}/2}+1)\times N_{ant}\times N_{pol}$; the

1038: limited number of multipliers on a BEE2 FPGA only allowed for supporting half

1039: the polarization cross-multiples.  This system was an

1040: important demonstration of the basic capabilities of our hardware and software,

1041: and provided a starting point for evolving a more sophisticated system.

1042: Deployments of this

1043: system at the NRAO site in Green Bank as part of the PAPER

1044: experiment, and briefly

1045: at the Hat Creek Radio Observatory for the Allen Telescope Array,

1046: are being supersede by the packetized correlator presented in the next

1047: section.

1048:

1049: \subsection{A 16-Antenna, Full-Stokes, Packetized Correlator}

1050: \label{sec:packet_deploy}

1051:

1052: This packetized FX correlator is a realization of the architecture outlined in

1053: Figure \ref{fig:ex_app1}, with F processing for 2 antennas implemented on each

1054: IBOB, and matching X processors implemented on each corner FPGA of two BEE2s.

1055: Each F processor is identical to a Pocket Correlator (Fig. \ref{fig:f_engine}),

1056: but branches data from the equalization module to a matrix transposer in IBOB

1057: SRAM to form frequency-based packets.  Packet data for each antenna are

1058: multiplexed through a point-to-point XAUI connection to a BEE2-based X

1059: processor, and then relayed in 10GbE format to the switch.  The number of

1060: channels in this system is limited to 2048 by memory in IBOB SRAM for

1061: transposing the 128 spectra needed to meet bandwidth restrictions between X

1062: engines and DRAM-based vector accumulators.

1063:

1064: \placefigure{fig:x_processor}

1065:

1066: The X processor in this packetized correlator implements the transmit and

1067: receive architecture illustrated in Figure \ref{fig:x_processor}

1068: for two X engines sharing the same 10GbE link.

1069: Each X engine's data processing rate is

1070: determined by packets arriving in its own receive buffer, and results are

1071: accumulated in separate DRAM DIMMs.  The accumulated output of each X processor

1072: is read out of DRAM at a low bandwidth and transmitted via 10GbE packets to

1073: a CPU-based server where

1074: all visibility data is collected and

1075: written to disk in MIRIAD format

1076: \citep{sault_et_al1995} using interfaces from the Astronomical Interferometry

1077: in PYthon (AIPY) package\footnote{http://pypi.python.org/pypi/aipy}.

1078:

1079: The clocks for the BEE2 FPGAs are asynchronous 200-MHz oscillators, and IBOBs

1080: run synchronously at any rate lower than this.  Packet transmission is

1081: statically addressed so that all each X engine processes every 16th channel.

1082: We use 8 ports of a Fujitsu XG700 switch to route data.  This system is is

1083: scalable to 32 antennas before two X engines no longer fit on a single FPGA.

1084: For larger systems, the number of BEE2s will scale as the square of the number

1085: of antennas, and the number of IBOBs will scale linearly.  A 32-antenna,

1086: 200-MHz correlator on 16 IBOBs and 4 BEE2s is now working in the lab, and a

1087: 16-antenna version using 8 IBOBs and 2 BEE2s has been deployed to the NRAO site

1088: in Green Bank with the PAPER experiment.

1089:

1090: % --------------------------------------------------------------------------

1091: % Section 8

1092: % --------------------------------------------------------------------------

1093: \section{Conclusion}

1094: \label{sec:conclusion}

1095:

1096: By decreasing the time and engineering costs of building and upgrading

1097: correlators, we aim to reduce the total cost of correlators for a wide range of

1098: scales.  Small- and medium-scale correlators with total cost dominated by

1099: development clearly stand to benefit from our research.  It is less clear if

1100: the cost of large-scale correlators can be reduced by the general-purpose

1101: hardware used in our architecture.  Though minimization of replication cost

1102: favors the development of specialized parts, there are two factors

1103: that can make a generic, modular solution cost less.

1104:

1105: The first factor to consider is time to deployment.  Even if the monetary cost

1106: of development is negligible in the budget of a large correlator, the cost of

1107: development time can be significant.  If a custom solution takes several years

1108: to go from design to implementation, the hardware that is deployed will be out

1109: of date.  Moore's Law suggests that when a custom solution taking 3 years to

1110: develop is deployed, there will exist processors 4 times more powerful, or 4

1111: times less expensive for the equivalent system.  The cost of a generic, modular

1112: system has to be tempered by the expected savings of committing to hardware

1113: closer to the ultimate deployment date.

1114:

1115: The second factor is the cost of upgrade.  Many facilities (including the ATA)

1116: are beginning to appreciate the advantages of designing arrays with wider

1117: bandwidths and larger numbers of antennas than can be handled by current

1118: technology.  Correlators may then be implemented inexpensively on scales

1119: suited to current processors, and upgraded as more powerful processors

1120: become available.  Modular solutions facilitate this methodology.

1121:

1122: % --------------------------------------------------------------------------

1123: % --------------------------------------------------------------------------

1124: % --------------------------------------------------------------------------

1125:

1126: \acknowledgments

1127:

1128: This and other CASPER research are supported by the National Science Foundation

1129: Grant No. 0619596 for Low Cost, Rapid Development Instrumentation for Radio

1130: Telescopes.  We would like to acknowledge the students, faculty and sponsors of

1131: the Berkeley Wireless Research Center, and the National Science Foundation

1132: Infrastructure Grant No.  0403427.  Correlator development for the PAPER

1133: project is supported by NSF grant AST-0505354, and for the ATA project by NSF

1134: grant AST-0321309 as well as the Paul G. Allen Foundation.  Chips and software

1135: were generously provided by Xilinx, Inc.  JM and PM gratefully acknowledge

1136: financial support from the MeerKAT project and South Africa's National Research

1137: Foundation.

1138:

1139: \appendix

1140: Glossary of Technical Terms

1141: \begin{itemize}

1142: \item ADC - Analog to Digital Converter

1143: \item ASIC - Application-Specific Integrated Circuit processor

1144: \item BEE2 - Berkeley Emulation Engine, rev. 2

1145: \item BORPH - Berkeley Operating system for Re-Programmable Hardware

1146: \item BRAM - Block RAM: Random Access Memory inside an FPGA

1147: \item CX4 - 10GbE-compatible industry standard connector

1148: \item CPU - Central Processing Unit

1149: \item DDR2 - Double-Data-Rate 2 type of off-FPGA Synchronous DRAM

1150: \item DIMM - Dual Inline Memory Module

1151: \item DFT - Discrete Fourier Transform

1152: \item DRAM - Dynamic Random Access Memory

1153: \item FFT - Fast Fourier Transform algorithm

1154: \item FIR - Finite Impulse Response digital filter

1155: \item FPGA - Field Programmable Gate Array processor

1156: \item FX - Correlator architecture implemented as frequency channelization, then cross-multiplication

1157: \item GALS - Globally Asynchronous, Locally Synchronous system architecture

1158: \item GB - GigaByte

1159: \item IBOB - Internet Break-Out Board

1160: \item LFSR - Linear Feedback Shift Register

1161: \item LO - Local Oscillator

1162: \item MCNT - Master Counter

1163: \item PFB - Polyphase Filter Bank

1164: \item PowerPC - a specific CPU architecture

1165: \item QDR - Quad-Data-Rate type of off-FPGA SRAM

1166: \item ROACH - Reconfigurable, Open Architecture for Computing Hardware

1167: \item SNR - Signal-to-Noise Ratio

1168: \item SRAM - Static Random Access Memory

1169: \item UDP - User Datagram Protocol Ethernet packetization

1170: \item XAUI - X (ten) Attachment Unit Interface point-to-point transmission protocol

1171: \item XF - Correlator architecture implemented as cross-multiplication, then frequency channelization

1172: \item 1PPS - 1 Pulse Per Second clock signal

1173: \item 10GbE - 10 Gigabit per second Ethernet communication standard

1174: \end{itemize}

1175:

1176: \begin{thebibliography}{25}

1177: \expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi

1178:

1179: \bibitem[{xil(2004)}]{xilinx_ug024}

1180:  2004, {RocketIO Tranceiver User Guide (UG024 V2.5)}, Xilinx user guide,

1181:   http://www.xilinx.com

1182:

1183: \bibitem[{xil(2005)}]{xilinx_ds083}

1184:  2005, {Virtex-II Pro and Virtex-II Pro X Platform FPGAs: Functional

1185:   Description (DS083-2 V4.5)}, Xilinx data sheet, http://www.xilinx.com

1186:

1187: \bibitem[{Blackman \& Tukey(1958)}]{blackman_tukey1958}

1188: Blackman, R. \& Tukey, J. 1958, The measurement of power spectra (Dover

1189:   Publications Inc.)

1190:

1191: \bibitem[{{Bradley} {et~al.}(2005){Bradley}, {Backer}, {Parsons}, {Parashare},

1192:   \& {Gugliucci}}]{bradley_et_al2005}

1193: {Bradley}, R., {Backer}, D., {Parsons}, A., {Parashare}, C., \& {Gugliucci},

1194:   N.~E. 2005, in Bulletin of the American Astronomical Society, 1216--+

1195:

1196: \bibitem[{{Chang} {et~al.}(2005){Chang}, {Wawrzynek}, \&

1197:   {Brodersen}}]{chang_et_al2005}

1198: {Chang}, C., {Wawrzynek}, J., \& {Brodersen}, R.~W. 2005, IEEE Design and Test

1199:   of Computers, 22, 114

1200:

1201: \bibitem[{{Chapiro}(1984)}]{chapiro1984}

1202: {Chapiro}, D.~M. 1984, PhD thesis, Stanford Univ., CA.

1203:

1204: \bibitem[{{Crochiere} \& {Rabiner}(1983)}]{crochiere+rabiner1983}

1205: {Crochiere}, R. \& {Rabiner}, L.~R. 1983, {Multirate Digital Signal Processing}

1206:   (Englewood Cliffs, N.J., Prentice-Hall, Inc., 1983.~336 p.)

1207:

1208: \bibitem[{{D'Addario}(2001)}]{daddario2001}

1209: {D'Addario}, L. 2001, ATA Memo

1210:

1211: \bibitem[{{Demorest} {et~al.}(2004){Demorest}, {Ramachandran}, {Backer},

1212:   {Ferdman}, {Stairs}, \& {Nice}}]{demorest_et_al2004}

1213: {Demorest}, P., {Ramachandran}, R., {Backer}, D., {Ferdman}, R., {Stairs}, I.,

1214:   \& {Nice}, D. 2004, in Bulletin of the American Astronomical Society, 1598--+

1215:

1216: \bibitem[{{Dick}(2000)}]{dick2000}

1217: {Dick}, C. 2000, Xilinx Application Note

1218:

1219: \bibitem[{{Heiles} {et~al.}(2004){Heiles}, {Goldston}, {Mock}, {Parsons},

1220:   {Stanimirovic}, \& {Werthimer}}]{heiles_et_al2004}

1221: {Heiles}, C., {Goldston}, J., {Mock}, J., {Parsons}, A., {Stanimirovic}, S., \&

1222:   {Werthimer}, D. 2004, in Bulletin of the American Astronomical Society,

1223:   1476--+

1224:

1225: \bibitem[{{Jenet} \& {Anderson}(1998)}]{jenet_anderson1998}

1226: {Jenet}, F.~A. \& {Anderson}, S.~B. 1998, PASP, 110, 1467

1227:

1228: \bibitem[{{Parsons}(2008)}]{parsons2008}

1229: {Parsons}, A. 2008, IEEE Signal Processing Letters, submitted

1230:

1231: \bibitem[{{Parsons} {et~al.}(2006){Parsons}, {Backer}, {Chang}, {Chapman},

1232:   {Chen}, {Crescini}, {de Jesus}, {Dick}, {Droz}, {MacMahon}, {Meder}, {Mock},

1233:   {Nagpal}, {Nikolic}, {Parsa}, {Richards}, {Siemion}, {Wawrzynek},

1234:   {Werthimer}, \& {Wright}}]{parsons_et_al2006}

1235: {Parsons}, A., {Backer}, D., {Chang}, C., {Chapman}, D., {Chen}, H.,

1236:   {Crescini}, P., {de Jesus}, C., {Dick}, C., {Droz}, P., {MacMahon}, D.,

1237:   {Meder}, K., {Mock}, J., {Nagpal}, V., {Nikolic}, B., {Parsa}, A.,

1238:   {Richards}, B., {Siemion}, A., {Wawrzynek}, J., {Werthimer}, D., \& {Wright},

1239:   M. 2006, in Asilomar Conference on Signals and Systems, Pacific Grove, CA,

1240:   2031--2035

1241:

1242: \bibitem[{{Plana} {et~al.}(2007){Plana}, {Furber}, {Temple}, {Khan}, {Shi},

1243:   {Wu}, \& {Yang}}]{luis_et_al2007}

1244: {Plana}, L.~A., {Furber}, S.~B., {Temple}, S., {Khan}, M., {Shi}, Y., {Wu}, J.,

1245:   \& {Yang}, S. 2007, IEEE Des. Test, 24, 454

1246:

1247: \bibitem[{{Rabiner} \& {Gold}(1975)}]{rabiner_gold1975}

1248: {Rabiner}, L.~R. \& {Gold}, B. 1975, {Theory and application of digital signal

1249:   processing} (Englewood Cliffs, N.J., Prentice-Hall, Inc., 1975.~777 p.)

1250:

1251: \bibitem[{{Rybicki} \& {Lightman}(1979)}]{rybicki_lightman1979}

1252: {Rybicki}, G.~B. \& {Lightman}, A.~P. 1979, {Radiative processes in

1253:   astrophysics} (New York, Wiley-Interscience, 1979.~393 p.)

1254:

1255: \bibitem[{{Sault} {et~al.}(1995){Sault}, {Teuben}, \&

1256:   {Wright}}]{sault_et_al1995}

1257: {Sault}, R.~J., {Teuben}, P.~J., \& {Wright}, M.~C.~H. 1995, in Astronomical

1258:   Society of the Pacific Conference Series, Vol.~77, Astronomical Data Analysis

1259:   Software and Systems IV, ed. R.~A. {Shaw}, H.~E. {Payne}, \& J.~J.~E.

1260:   {Hayes}, 433--+

1261:

1262: \bibitem[{{So}(2007)}]{so2007}

1263: {So}, K.~H. 2007, PhD thesis, Berkeley Wireless Research Center, UC Berkeley,

1264:   CA.

1265:

1266: \bibitem[{{So} \& {Brodersen}(2006)}]{so_broderson2006}

1267: {So}, K.~H. \& {Brodersen}, R.~W. 2006, in 16th International Conference on

1268:   Field Programmable Logic and Applications, 349--354

1269:

1270: \bibitem[{{Thompson} {et~al.}(2001){Thompson}, {Moran}, \&

1271:   {Swenson}}]{thompson_et_al2001}

1272: {Thompson}, A.~R., {Moran}, J.~M., \& {Swenson}, Jr., G.~W. 2001,

1273:   {Interferometry and Synthesis in Radio Astronomy, 2nd Edition} (New York,

1274:   Wiley-Interscience, 2001.~692 p.)

1275:

1276: \bibitem[{{Vaidyanathan}(1990)}]{vaidyanathan1990}

1277: {Vaidyanathan}, P.~P. 1990, in IEEE, Vol.~78, 56--93

1278:

1279: \bibitem[{{Weinreb}(1961)}]{weinreb_1961}

1280: {Weinreb}, S. 1961, Proc. IEEE, 49, 1099

1281:

1282: \bibitem[{{Wright}(2005)}]{wright2005}

1283: {Wright}, M. 2005, SKA Memo

1284:

1285: \bibitem[{{Yen}(1974)}]{yen1974}

1286: {Yen}, J.~L. 1974, A\&AS, 15, 483

1287:

1288: \end{thebibliography}

1289:

1290:

1291: % --------------------------------------------------------------------------

1292: % TABLES

1293: % --------------------------------------------------------------------------

1294: \clearpage

1295:

1296: %\input tab1.tex

1297: \begin{table}[t]

1298: \label{tab:hardware_price}

1299: \begin{center}

1300: \title{Price and Power Consumption of CASPER Hardware}

1301: \begin{tabular}{lrrrrr}

1302: \hline\hline

1303: \vspace{3pt}

1304: Board & Board & Cost with & Gops    & Power \\

1305:       & Cost  & FPGAs     & per Sec & (W)\\

1306: \hline

1307: IBOB& \$400 & \$2700 & 70 & 30 \\

1308: BEE2& \$5000 & \$23500 & 500 & 150 \\

1309: ROACH$^*$& \$1000 & \$3200 & 400 & 50 \\

1310: ADC (1Gs/s$\times2$)& \$200 & \$200 & N/A & 2 \\

1311: ADC (3Gs/s)\tablenotemark{*}& \$1000 & \$1000 & N/A & 5 \\

1312: \hline\hline

1313: \vspace{-5pt}

1314: \end{tabular}

1315: \\

1316: \vspace{-10pt}

1317: \tablenotetext{*}{Estimated from prototype versions.}

1318: \end{center}

1319: \end{table}

1320:

1321:

1322: % --------------------------------------------------------------------------

1323: % FIGURES

1324: % --------------------------------------------------------------------------

1325: %\clearpage

1326:

1327: \begin{figure}

1328: \begin{center}

1329: \includegraphics[scale=.4]{raw_arch.png}

1330: \caption{In a simplistic FX correlator,

1331: the signals from N antennas are first decomposed into M frequency channels

1332: (F operation) and then cross-multiplied (X operation).  Different channels are

1333: never cross-multiplied, making them natural units for X engine processing.

1334: Thus, each X engine handles all baselines for one frequency channel.

1335: \label{fig:corr_arch1}}

1336: \end{center}

1337: \end{figure}

1338:

1339: \begin{figure}

1340: \begin{center}

1341: \includegraphics[scale=.25]{ex_app1.png}

1342: \caption{Data bandwidth per antenna is equal to the processing bandwidth of

1343: an X processor in this example application.  Transmitted data is routed

1344: through an X processor to take advantage of bidirectionality of 10GbE ports,

1345: thereby halving the number of ports on the switch.

1346: \label{fig:ex_app1}}

1347: \end{center}

1348: \end{figure}

1349:

1350: \begin{figure}

1351: \begin{center}

1352: \includegraphics[scale=.25]{ex_app2.png}

1353: \caption{Data bandwidth per antenna can exceed

1354: what can be carried over 10GbE.  Here, the frequency band has been spread

1355: across ports by channel, so that each half of transmission occurs on an

1356: isolated subnet.  This is possible because different channels are never

1357: cross-multiplied in an FX correlator.

1358: \label{fig:ex_app2}}

1359: \end{center}

1360: \end{figure}

1361:

1362: \begin{figure}

1363: \begin{center}

1364: \includegraphics[scale=.25]{ex_app3.png}

1365: \caption{When the processing bandwidth of an X engine exceeds the antenna

1366: bandwidth by at least a factor of 2, half as many X processors are needed for

1367: a given number of antennas.  X processors operate independently of data

1368: bandwidth; the same design handles this and the previous two cases

1369: (Figs. \ref{fig:ex_app1} and \ref{fig:ex_app2}).  Only the number of X

1370: processors and the data transmission pattern have changed.

1371: \label{fig:ex_app3}}

1372: \end{center}

1373: \end{figure}

1374:

1375: \begin{figure}

1376: \begin{center}

1377: \includegraphics[scale=.25]{ibob_bee2.jpg}

1378: \caption{%

1379: Our correlator architecture relies on modular FPGA-based processing hardware

1380: developed by our group to

1381: combine flexibility, upgradeability, and performance.  Illustrated above are:

1382: (top) IBOB and ADC FPGA/digitizer modules

1383: (bottom) The Berkeley Emulation Engine (BEE2) FPGA board

1384: \label{fig:ibobadcbee2}}

1385: \end{center}

1386: \end{figure}

1387:

1388: \begin{figure}

1389: \begin{center}

1390: \includegraphics[scale=.45]{ddc_response_scaled.png}

1391: \caption{%

1392: This example response an the FIR filter in a digital down-converter,

1393: illustrates the 16 tap low-pass design used in the correlator deployments

1394: presented later.

1395: \label{fig:ddc_passband}}

1396: \end{center}

1397: \end{figure}

1398:

1399: \begin{figure}

1400: \begin{center}

1401: \includegraphics[scale=.52]{pfb_vs_fft_bin_resp.png}

1402: \caption{%

1403: The response of a frequency channel in an 8-tap Polyphase Filter Bank (solid)

1404: using a Hamming window is compared to an equivalently sized Discrete Fourier

1405: Transform (dashed).  This particular PFB, implemented for 2048 channels, is

1406: used in the correlator deployments presented in Section \ref{sec:deployments}.

1407: \label{fig:pfb_bin_resp}}

1408: \end{center}

1409: \end{figure}

1410:

1411: \begin{figure}

1412: \begin{center}

1413: \includegraphics[scale=.45]{x_engine.png}

1414: \caption{%

1415: This X engine schematic illustrates the pipelined flow of data

1416: that allows it to be split across multiple FPGAs and boards.

1417: With continuous data input, all multipliers (with the possible exception of

1418: the final stage for even values of $N_{ant}$) are used with 100\% efficiency.

1419: \label{fig:x_engine_schem}}

1420: \end{center}

1421: \end{figure}

1422:

1423: \begin{figure}

1424: \begin{center}

1425: \includegraphics[scale=.45]{corr_vs_dly_128_scaled.png}

1426: \caption{%

1427: Cross-correlation of noise decreases as a function of signal delay between

1428: antenna inputs.  PFBs operate on a wider window of data compared to DFTs, and

1429: use non-flat sample weightings, yielding a

1430: different correlation response versus signal delay compared to the standard

1431: result presented in Thompson et al. (2001) \cite{thompson_et_al2001}.  Graphed

1432: are the responses of PFBs with 8 taps (solid), 4 taps (dashed), 2 taps (dot

1433: dashed), and the response of a DFT (dotted).

1434: \label{fig:corr_vs_dly}}

1435: \end{center}

1436: \end{figure}

1437:

1438: \begin{figure}

1439: \begin{center}

1440: \includegraphics[scale=.5]{packet_rx.png}

1441: \caption{Before transmission, each F engine packet is

1442: tagged with an antenna number and master counter (MCNT) encoding

1443: time and frequency.  Received packets are filtered to

1444: the narrow range of MCNTs, and maximum MCNT slides smoothly up as packets

1445: are received.  A free-running X engine

1446: processes available windows when it is ready.  This architecture

1447: allows data to be processed at a lower data rate than the FPGA clock rate

1448: without requiring every element in the pipeline to have a enable signal.

1449: \label{fig:packet_rx}}

1450: \end{center}

1451: \end{figure}

1452:

1453: \begin{figure}

1454: \begin{center}

1455: \includegraphics[scale=.35]{crosstalk_v2_scaled.png}

1456: \caption{%

1457: Uncorrelated noise sources with similar bandpass shapes were

1458: input to two channels of one ADC board (solid black) and a third noise source

1459: with a narrower passband was input to to a second ADC board

1460: (dashed black) in the ``Pocket Correlator'' system.

1461: Crosstalk levels between signal inputs on the same ADC board (light gray) and

1462: between ADC boards sharing an IBOB (dark gray) peak at $-28$ dB.

1463: \label{fig:crosstalk}}

1464: \end{center}

1465: \end{figure}

1466:

1467: \begin{figure}

1468: \begin{center}

1469: \includegraphics[scale=.5]{crosstalk_stability_scaled.png}

1470: \caption{%

1471: Measurements of the standard deviation versus integration time of the

1472: correlation between independent noise sources into the same ADC board show

1473: that crosstalk exhibits

1474: stability over a period of 1 day for all frequency channels

1475: Although phase switching

1476: may still be desireable, this stability allows

1477: crosstalk to be calibrated and removed after correlation.

1478: \label{fig:crosstalk_stability}}

1479: \end{center}

1480: \end{figure}

1481:

1482: \begin{figure}

1483: \begin{center}

1484: \includegraphics[scale=.52]{4_bit_quant_rev2.png}

1485: \caption{%

1486: Illustrated above is the relative gain through a 4-bit, 15-level quantizer as a

1487: function of input signal level (log base 2).  Plotted are gain curves for

1488: the cross-correlation of two

1489: gaussian noise sources with correlation levels of 100\% (solid),

1490: 80\% (dot-dashed), 40\% (dotted), and 20\% (dashed).

1491: \label{fig:4_bit_quant}}

1492: \end{center}

1493: \end{figure}

1494:

1495: \begin{figure}

1496: \begin{center}

1497: \includegraphics[scale=.5]{f_processor.png}

1498: \caption{%

1499: This IBOB design serves a dual purpose as a stand-alone ``Pocket Correlator''

1500: and an F processor in a 16 antenna packetized correlator deployment.  Note the

1501: parallel output pathways for each function.

1502: \label{fig:f_engine}}

1503: \end{center}

1504: \end{figure}

1505:

1506: \begin{figure}

1507: \begin{center}

1508: \includegraphics[scale=.25]{allsky_moll_trim_bw.png}

1509: \caption{%

1510: This all-sky image, made using a 75-MHz band centered at 150 MHz with the

1511: ``Pocket Correlator'' as part of the PAPER experiment in Western

1512: Australia, achieves an impressive 10,000:1 signal-to-noise ratio using

1513: 1 day of data.

1514: \label{fig:skymap}}

1515: \end{center}

1516: \end{figure}

1517:

1518: \begin{figure}

1519: \begin{center}

1520: \includegraphics[scale=.5]{x_processor.png}

1521: \caption{%

1522: A BEE2-based X processor in a packetized correlator transmits data

1523: from an F engine

1524: over 10GbE and stores self-addressed packets in a ``loopback'' buffer.

1525: These streams are merged on the receive side, and packets are

1526: distributed to two X engines.  Accumulation occurs

1527: in DRAM buffers, and the results are packetized and output

1528: over the same 10GbE link.  A data aquisition system connects to the

1529: same switch as the X engines.

1530: \label{fig:x_processor}}

1531: \end{center}

1532: \end{figure}

1533:

1534: \end{document}

1535:

1536: