0809.2266/ms.tex
1: \documentclass[preprint]{aastex}
2: \shorttitle{Scalable Correlator Architecture}
3: \shortauthors{Parsons et al.}
4: 
5: \usepackage{amsmath}
6: \usepackage{graphicx}
7: \usepackage{natbib}
8: \citestyle{aa}
9: 
10: \begin{document}
11: \title{A Scalable Correlator Architecture Based on 
12:     Modular FPGA Hardware, Reuseable Gateware, and Data Packetization}
13: 
14: \author{Aaron Parsons, Donald Backer, and Andrew Siemion}
15: \affil{Astronomy Department, 
16:     University of California, Berkeley, CA}
17: \email{aparsons@astron.berkeley.edu}
18: \author{Henry Chen and Dan Werthimer}
19: \affil{Space Science Laboratory,
20:     University of California, Berkeley, CA}
21: \author{Pierre Droz, Terry Filiba, Jason Manley\altaffilmark{1}, 
22:     Peter McMahon\altaffilmark{1}, and Arash Parsa}
23: \affil{Berkeley Wireless Research Center,
24:     University of California, Berkeley, CA}
25: \author{David MacMahon, Melvyn Wright}
26: \affil{Radio Astronomy Laboratory,
27:     University of California, Berkeley, CA}
28: 
29: \altaffiltext{1}{Affiliated with Karoo Array Telescope,
30:     Cape Town, South Africa}
31: 
32: \begin{abstract}
33: A new generation of radio telescopes is achieving unprecedented levels of
34: sensitivity and resolution, as well as increased agility and field-of-view, by
35: employing high-performance digital signal processing hardware to phase and
36: correlate large numbers of antennas.  The computational demands of these
37: imaging systems scale in proportion to $BMN^2$, where $B$ is the signal
38: bandwidth, $M$ is the number of independent beams, and $N$ is the number of
39: antennas.  The specifications of many new arrays lead to demands in excess of
40: tens of PetaOps per second.
41: 
42: To meet this challenge, we have developed a general purpose correlator
43: architecture using standard 10-Gbit Ethernet switches to pass data
44: between flexible hardware modules containing Field Programmable Gate Array
45: (FPGA) chips.  These chips are programmed using open-source signal processing
46: libraries we have developed to be flexible, scalable, and chip-independent.
47: This work reduces the time and cost of implementing a wide range of signal
48: processing systems, with correlators foremost among them, and facilitates
49: upgrading to new generations of processing technology. We present several
50: correlator deployments, including a 16-antenna, 200-MHz bandwidth, 4-bit, full
51: Stokes parameter application deployed on the Precision Array for Probing the
52: Epoch of Reionization.
53: \end{abstract}
54: 
55: \keywords{Astronomical Instrumentation}
56: 
57: 
58: % --------------------------------------------------------------------------
59: % Section 1
60: % --------------------------------------------------------------------------
61: \section{Introduction}
62: \label{sec:intro}
63: 
64: Radio interferometers, which operate by correlating the signals from two or
65: more antennas, have many advantages over traditional single-dish telescopes,
66: including greater scalability, independent control of aperture size and
67: collecting area, and self-calibration.  Since the first digital correlator
68: built by Weinreb \citep{weinreb_1961}, the processing power of
69: these systems has been tracking the Moore's Law growth of digital electronics.
70: The decreasing cost per performance of these systems has influenced the design
71: of many new radio antenna array telescopes.  Some
72: next-generation array telescopes at meter, centimeter and millimeter
73: wavelengths are: 
74: the LOw Frequency ARray (LOFAR), 
75: the Precision Array for Probing the Epoch of Reionization (PAPER), 
76: the Murchison Widefield Array (MWA), 
77: the Long Wavelength Array (LWA),
78: the Expanded Very Large Array (EVLA), 
79: the Allen Telescope Array (ATA), 
80: the Karoo Array Telescope (MeerKAT), 
81: the Australian Square Kilometer Array Demonstrator (ASKAP),
82: the Atacama Large Millimeter Array (ALMA).
83: and the Combined Array for Research Millimeter-wave Astronomy (CARMA). 
84: This paper presents a novel approach to the intense digital signal 
85: processing requirements of these instruments that has many other applications
86: to astronomy signal processing.
87: 
88: While each generation of electronics has brought new commodity data processing
89: solutions, the need for high-bandwidth communication between processing nodes
90: has historically lead to specialized system designs.  This communication
91: problem is particularly germane for correlators, where the number of
92: connections between nodes scales with the square of the number of antennas.
93: Solutions to date have typically consisted of specialized processing boards
94: communicating over custom backplanes using non-standard protocols.  However,
95: such solutions have the disadvantage that each new generation of digital
96: electronics requires expensive and time-consuming investments of engineering
97: time to re-solve the same connectivity problem.  Redesign is driven by the same
98: Moore's Law that makes digital interferometry attractive, and is not unique to
99: the interconnect problem; processors such as Application-Specific Integrated
100: Circuits (ASICs) and Field Programmable Gate Arrays (FPGAs) also require
101: redesign, as do the boards bearing them, and the signal processing algorithms
102: targeting their architectures.
103: 
104: Our research is aimed at reducing the time and cost of correlator design and
105: implementation.  We do this, firstly, by developing a packetized communication
106: architecture relying on industry-standard Ethernet switches and protocols to
107: avoid redesigning backplanes, connectors, and communication protocols.
108: Secondly, we develop flexible processing modules that allow identical boards to
109: be used for a multitude of different processing tasks.  These boards are
110: applicable to general signal processing problems that go beyond 
111: correlators and even radio science to include, e.g., ASIC design and
112: simulation, genomics, and research into parallel processor architectures.  
113: General
114: purpose hardware reduces the number of boards that have to be redesigned and
115: tested with each new generation of electronics.  Thirdly, we create
116: parametrized signal processing libraries that can easily be recompiled and
117: scaled for each generation of processor.  This allows signal processing systems
118: to quickly take advantage of the capabilities of new hardware.  Finally, we
119: employ an extension of a Linux kernel to interface between CPUs and FPGAs for
120: the purposes of testing and control, presenting a standard file interface
121: for interacting with FPGA hardware.
122: 
123: This paper begins with a presentation of the new correlator
124: design architecture in \S\ref{sec:architecture}. The hardware to 
125: implement this architecture follows in \S\ref{sec:hardware}, and
126: the FPGA gateware used in the hardware is summarized in \S\ref{sec:gateware}.
127: Issues concerning system integration are given in \S\ref{sec:integration},
128: and performance characterization of subsystems are given in 
129: \S\ref{sec:characterization}. Results from our first deployments of
130: the packetized correlator are displayed in \S\ref{sec:deployments}.
131: Our final section summarizes our progress and points to a number
132: of directions we are pursuing for the next generation of scalable
133: correlators based on modular hardware, reuseable gateware and
134: data packetization. An appendix gives a glossary of technical
135: acronyms since this paper makes heavy use of abbreviated terms.
136: 
137: % --------------------------------------------------------------------------
138: % Section 2
139: % --------------------------------------------------------------------------
140: \section{A Scalable, Asynchronous, Packetized FX Correlator Architecture}
141: \label{sec:architecture}
142: 
143: Correlators integrate the pairwise correlation between complex voltage samples 
144: from polarization channels of array antenna receivers at a set of
145: frequencies.
146: Once instrumental effects have been calibrated and removed, the resultant 
147: correlations (called visibilities) represent the self-convolved electric field
148: across an aperture sampled at locations 
149: corresponding to separations between antennas.  These visibilities can be 
150: used to reconstruct an image of the sky by inverting the interferometric 
151: measurement equation:
152: \begin{equation}
153: V_{\nu}(u,v)=\int\!\!\!\!\int{G_{i,\nu}G_{j,\nu}^*I_\nu(\ell,m)}
154: {e^{-2\pi i(u\ell+vm+w(\sqrt{1-\ell^2-m^2}-1))}d\ell dm}
155: \label{eq:vis}
156: \end{equation}
157: $I_\nu$ represents the sky brightness in angular coordinates $(\ell,m)$, and 
158: $(u,v,w)$ correspond to the separation in wavelengths of an antenna pair 
159: relative to a pointing direction.
160: For antennas with separate polarization feeds, cross-correlation
161: of polarizations yields components of the four Stokes parameters that
162: characterize polarized radiation, here defined in terms of linear
163: polarizations ($\|,\perp$) for all pairs of antennas $A$ and $B$ 
164: \citep{rybicki_lightman1979}:
165: \begin{equation}
166: \begin{array}{ll}
167: \displaystyle I=A_\| B_\|^*+A_\perp B_\perp^* &\ \ \ 
168: Q=A_\| B_\|^*-A_\perp B_\perp^* \nonumber \\
169: \displaystyle U=A_\| B_\perp^*+A_\perp B_\|^* &\ \ \ 
170: V=A_\| B_\perp^*-A_\perp B_\|^* 
171: \label{eq:pol}
172: \end{array}
173: \end{equation}
174: I measures total intensity, V measures the degree of circular polarization,
175: and Q and U measure the amplitude and orientation of linear polarization.
176: 
177: The problem of computing pairwise correlation as a function of frequency can be
178: decomposed two mathematically equivalent but architecturally distinct ways.
179: The first architecture is known as ``XF'' correlation because it first
180: cross-correlates antennas (the ``X'' operation) using a time-domain ``lag''
181: convolution, and then computes the spectrum (the ``F'' operation) for each resulting
182: baseline using a Discrete Fourier Transform (DFT).  An alternate architecture
183: takes advantage of the fact that convolution is equivalent to multiplication in
184: Fourier domain.  This second architecture, called ``FX'' correlation, first
185: computes the spectrum for each individual antenna (the F operation), and then
186: multiplies pairwise all antennas for each spectral channel (the X operation).  
187: An FX correlator has an advantage over XF
188: correlators in that the operation that scales as $O(N^2)$ with the
189: number of antennas, N, is a complex multiplication as opposed to a full 
190: convolution in an XF correlator \citep{daddario2001,yen1974}.
191: 
192: Though there are mitigating factors (such as bit-growth for representing the
193: higher dynamic range of frequency-domain data) that favor XF correlators for
194: small numbers of antennas \citep{thompson_et_al2001}, FX correlators are more
195: efficient for larger arrays.  Since scalability to large numbers of antennas is
196: one of the primary motivations of our correlator architecture, we have chosen
197: to develop FX architectures exclusively.
198: 
199: \subsection{Scalability With Number of Antennas and Bandwidth}
200: \label{sec:scalability}
201: 
202: The challenge of creating a scalable FX correlator is in designing a
203: scalable architecture for factoring the total computation into manageable 
204: pieces and efficiently bringing together data in each piece for computation.
205: Traditionally, the spectral decomposition (in F engines) has been scaled to
206: arbitrary bandwidths by using analog mixers and filters to divide the operating
207: band of each antenna into the widest subbands that can be processed digitally
208: using existing technology.  Within correlation of a given subband,
209: the complexities of computation and of data distribution both scale
210: linearly with bandwidth and quadratically with the number of antennas. It is
211: imperative that the arrangement of cross-multiplication engines (hereafter
212: referred to as X engines) minimize data replication/retransmission, even as X
213: engines expand to encompass many boards.  Fortunately, each frequency channel
214: of an FX correlator is computationally independent, providing a natural
215: boundary for dividing computation among processing nodes.
216: 
217: \placefigure{fig:corr_arch1}
218: 
219: Figure \ref{fig:corr_arch1} illustrates a simplistic architecture for an FX
220: correlator that takes advantage of the computational independence of channels
221: to avoid unnecessary data transmission;
222: the total X computation has been factored into X engines that cross-multiply 
223: all antenna pairs for a single frequency channel.
224: This architecture is overly
225: simplistic, since an X engine's performance can be equated to an aggregate 
226: input bandwidth that it can handle.  For the sake of efficiency, an X engine 
227: processor 
228: should receive as many channels as it has capacity to process.  In this case, 
229: the number of X engines is given by:
230: \begin{equation}
231: \#\ {\rm X\ Engines} = \frac{({\rm Antenna\ Bandwidth})\times 
232:   (\#\ {\rm Antennas})}{ {\rm X\  Engine\ Processing\ Bandwidth}}
233: \end{equation}
234: Multiplexing channels into X engines makes cross-multiplication
235: complexity independent of the number of channels.  There are three
236: potential bottlenecks for scaling this architecture: the complexity of
237: interconnecting F engines and X engines, the bandwidth into individual X
238: engines, and the amount of computation in an X engine relative to the size of a
239: processing chip/board/system.  Each of these bottlenecks warrants further
240: discussion.
241: 
242: The potential bottleneck of connecting $N$ antenna-based F engines to $M$
243: channel-based X engines is highlighted by the criss-crossed lines in Figure
244: \ref{fig:corr_arch1}.  Historically, this bottleneck has been addressed with
245: custom backplanes and transmission protocols.  However, our group has taken the
246: novel approach of using high-performance, commercially available, 
247: 10-Gbit/s Ethernet (10GbE) switches to solve this problem.  
248: As will be discussed, these switches currently have the bandwidth and switching
249: capacity to handle large correlators, and represent a negligible fraction of
250: the total cost of correlator hardware.  Furthermore, switching technology is
251: driven by commercial applications and by Moore's Law, making it likely that
252: future switches will continue increasing in number of ports and bandwidth per
253: port.
254: 
255: A second potential bottleneck concerns how data rates and 
256: numbers of X engines scale with antenna bandwidth.  It is important that
257: we consider various bandwidth cases, owing to the variety of science
258: applications driving large, next-generation systems.  For example, correlators
259: for large arrays of low-bandwidth antennas will need to multiplex data into
260: higher bandwidth processors, while arrays with larger bandwidths will face the
261: opposite problem. In our architecture, we make the reasonable assumption that
262: the number of frequency channels always exceeds the number of antennas.  
263: This assumption
264: ensures that the per-port bandwidth into an X engine never exceeds what is
265: transmitted per antenna.  Multiple channels may then be mapped into an X engine
266: up to its computational capacity (allowing efficient resource utilization for
267: low-bandwidth arrays), and additional X engines may be added for high-bandwidth
268: applications.  Antenna bandwidths requiring transmission above 10 Gbits/s can
269: be accommodated by connecting F engines to multiple 10GbE ports.
270: Frequency channels are then assigned to each port, which connect separate
271: switches and sub-networks of X engines.  In this way, bandwidths may be scaled
272: up to the transmission capability of an F processor by increasing the number of
273: subnets, and not switch complexity.
274: 
275: The third and final potential bottleneck concerns how the sizes of individual X
276: engines scales with the number of antennas.  Both large and small numbers of
277: antennas pose scaling problems.  The size of an X engine responsible for
278: computing all baseline cross-multiples with a fixed input data rate
279: scales as $O(N)$, while
280: the number of X engines required to accommodate the expanding data bandwidth
281: with increasing numbers of antennas also scales as $O(N)$,
282: accounting for the $O(N^2)$ scaling of computing in a correlator.  For
283: sufficiently large $N$, the size of an X engine can exceed the size of any
284: processing chip or board.  Our solution has been to develop an X engine whose
285: pipelined architecture allows it to be split across multiple processors with
286: simple point-to-point connectivity.  This allows many processors to be chained
287: together from a switch port to meet the computational demands of an X engine.
288: Scaling to small $N$ is equally challenging, because the aggregate correlator
289: bandwidth decreases as $O(N)$, while computational complexity scales down as
290: $O(N^2)$.  As a result, we can find that the number of X engines that
291: fit onto a chip/board exceeds the rate at which data can be received.  The
292: threshold where this problem is encountered can be changed by designing
293: processors with greater connectivity, but once hardware is fixed, there is no
294: other recourse but to accept a certain inefficiency for low numbers of
295: antennas.  While this is a fundamental limitation of our architecture, 
296: the cost of small correlators is typically dominated by development
297: (not hardware), so a certain architectural inefficiency can be accommodated for
298: the savings it affords in development time.
299: 
300: \subsection{Globally Asynchronous Locally Synchronous Systems}
301: \label{sec:gals}
302: 
303: Packetized data transmission simplifies the cross-connect problem inherent to
304: correlators, but this comes at the price of global synchronicity.  Packetized
305: communication is fundamentally asynchronous: data can arrive scrambled,
306: delayed, or not at all.  Locally-synchronous X engine processing must therefore
307: transition from being timing-driven (with throughput tied to an FPGA clock, for
308: example) to being asynchronously data-driven.  Though data buffers and control
309: signals complicate development, Globally Asynchronous Locally Synchronous
310: (GALS) design facilitates system integration and leads to robust design
311: \citep{chapiro1984,luis_et_al2007}.  Processors run at clock rates above the
312: data rate, using local oscillators that can drift with temperature.  By
313: allowing for non-transmission of data, individual components can fail without
314: causing global failure--an important feature for large systems where
315: components may fail regularly during operation.  GALS design also insulates
316: processing architectures from decisions regarding sample rates and antenna
317: bandwidths, allowing for greater operational flexibility.  Finally, individual
318: processing elements may be redesigned and upgraded in a GALS system without
319: affecting the overall architecture, facilitating early adoption of new
320: technology.
321: 
322: Data-driven processing on locally synchronous processors like FPGAs requires
323: controlling propagation through the processing pipeline.  However, routing
324: control signals to every multiplier, accumulator, and logic element in a
325: pipeline can lead to excessive routing and gating demands.  To avoid this, we
326: have implemented a window-based processing architecture for algorithms where
327: the results derived from one set of data samples are computationally
328: independent from the next.  In this architecture, processing elements are
329: allowed to run freely at their native rate without being enabled/disabled, but
330: are only provided data when an entire window of data has been buffered.  These
331: windows of data are provided synchronously with the inherent window boundaries
332: of the processing element, and an entire output window is flagged as valid.
333: Internally, a processor processes both valid and invalid data--it is only the
334: external buffering system that keeps track of data validity.  This technique is
335: applicable to many common operators such as cross-multipliers, DFTs, and
336: accumulators.  Finite Impulse Response (FIR) filtering is an
337: operation notable for not being window-based.
338: 
339: \subsection{Example Applications}
340: \label{sec:example}
341: 
342: \placefigure{fig:ex_app1}
343: 
344: Perhaps the best method for demonstrating the flexibility and scalability of
345: our correlator architecture is through example applications.  To illustrate
346: techniques for using hardware and ports efficiently, we will map processing
347: into fictitious hardware that corresponds roughly in capability to the
348: CASPER (Center for Astronomy Signal Processing 
349: and Engineering Research)\footnote{http://casper.berkeley.edu}
350: hardware discussed in Section \ref{sec:hardware}.
351: 
352: Our first example (Fig. \ref{fig:ex_app1}) illustrates an antenna signal
353: bandwidth sufficiently low so that data from 2 polarization channels of 2
354: antennas can be transmitted over one 10GbE connection.  Assuming that the
355: number of antennas evenly divides the number of frequency channels, and that
356: the processing bandwidth of an X engine matches the data bandwidth of one
357: antenna, there will be the same number of X engines as F engines, and each X
358: engine will receive 1/N$^{\rm th}$ of the total bandwidth, where N is
359: the number of antennas.  F engine
360: transmission and X engine reception are combined on a single port to make use
361: of the bi-directionality of 10GbE.  This optimization halves the size of the
362: switch needed.  Multiple X processors can be chained together from a single
363: 10GbE port using point-to-point connections.  For cases where the number of
364: antennas does not evenly divide the number of frequency channels, one can adjust
365: packet transmission to drop remainder channels so that the band may be equally
366: divided among X engines.
367: 
368: \placefigure{fig:ex_app2}
369: 
370: A second example (Fig. \ref{fig:ex_app2}) illustrates a case where the
371: bandwidth from a single F engine exceeds the transmission capacity of a 10GbE
372: link.  Here, data can be split by frequency channel across two
373: ports.  Since different channels are never cross-multiplied, each of these
374: links goes to a separate subnet of switched X engines.  Thus,
375: two smaller (and often less expensive per port) switches may be 
376: substituted for one large
377: one.  Each X engine still receives the same bandwidth as in the previous
378: example, although this now represents a smaller fraction of the total
379: bandwidth.  Note that the same X processor used in the first example functions
380: here without modification.  Only the number of X engines and the transmission
381: pattern has changed.
382: 
383: \placefigure{fig:ex_app3}
384: 
385: A final example (Fig. \ref{fig:ex_app3}) explores the case where the capacity
386: of an X processor and a 10GbE link both exceed the data bandwidth.  In this
387: case, multiple F engines can (but do not have to) be chained together to
388: minimize the number of switched ports.  As should be the case, only half as
389: many X engines (as compared to Fig. \ref{fig:ex_app1}) are necessary for a
390: given number of antennas.  X processors operate in the same configuration as
391: before, oblivious to changes in F engines.
392: 
393: These examples highlight the flexibility of the hardware and gateware for
394: targeting a number of applications. One shortcoming they also illustrate is
395: how the cabling between components differs for different bandwidths.
396: Therefore the different bandwidth operations are not as easily reconfigured as
397: might be desired for varying science goals on a given telescope. Research is
398: ongoing to improve the rapid reconfigurability that is an essential
399: specification for the most general radio interferometer array applications.
400: 
401: % --------------------------------------------------------------------------
402: % Section 3
403: % --------------------------------------------------------------------------
404: \section{Modular, FPGA-based Processing Hardware}
405: \label{sec:hardware}
406: 
407: A flexible and scalable correlator architecture is of limited use without
408: equally dynamic processing hardware that can support a variety of
409: configurations.  FPGAs provide a unique combination of flexibility and
410: performance that make them well-suited for moderate-scale signal processing
411: applications such as correlators and spectrometers \citep{parsons_et_al2006}.
412: A primary goal of the CASPER group has been development of
413: multipurpose processing modules that can be of general use to the astronomy
414: signal processing community, and beyond.  We seek to
415: minimize the effort of redesigning and upgrading hardware by modularizing
416: processing hardware, by minimizing the number of different modules 
417: in a system, and by employing industry-standard interconnection protocols.
418: 
419: Hardware modularity is the idea that boards should have consistent interfaces
420: in order to be connectible with an arbitrary number of heterogeneous components
421: to meet the computing needs of an application (``computing by the yard''), and
422: that upgrading/revising a component does not change the way in which components
423: are combined in the system.
424: Minimization of hardware reproduction costs is often used to motivate the
425: design of specialized hardware for large-scale correlators.  However, 
426: the longer development times inherent to such solutions, and
427: the necessity of targeting specific components from the outset,
428: suggest that a modular solution, initiated nearer to the deployment date,
429: will employ newer technology that costs less and uses less
430: power per operation.  The predicted economy of mass-producing
431: specially-designed hardware must be tempered by its expected devaluation
432: by Moore's Law over the course of correlator development.  This devaluation
433: makes the argument that hardware modularity can reduce the overall system
434: cost, even for large-scale systems, by reducing development time.
435: 
436: In current correlator systems, we rely on two
437: CASPER FPGA-based processing boards; Internet Break-Out Boards (IBOBs) are
438: generally used for implementing per-antenna F engine processing, and
439: second-generation Berkeley Emulation Engines (BEE2s) implement X engine
440: processing.  Work is progressing on a new board, the Reconfigurable Open
441: Architecture for Computing Hardware (ROACH), that will provide a single-board
442: solution to both F and X processing.
443: 
444: \placefigure{fig:ibobadcbee2}
445: 
446: IBOBs (Fig. \ref{fig:ibobadcbee2}) can interface to two
447: Analog-to-Digital Converter (ADC) boards, each capable of digitizing two
448: streams at 1 Gsamples/sec or a single stream at 2 Gsamples/sec using an Atmel
449: AT84AD001B dual 8-bit ADC chip.  This data is processed by a Xilinx XC2VP50
450: FPGA containing 232 18$\times$18-bit multipliers, two PowerPC CPU cores, and
451: over 53,000 logic cells.  Two ZBT SRAM chips provide 36 Mbits of extra
452: buffering, and two 10GbE-compatible CX4 connectors provide a standard interface
453: for connecting to other boards, switches, and computers.  A detailed discussion
454: of ADC signal fidelity is presented in Section \ref{sec:characterization}.
455: We are developing a second ADC board that allows four signal sampling at
456: 200 Msample/sec. 
457: 
458: The BEE2 board \cite{chang_et_al2005} (Fig. \ref{fig:ibobadcbee2}) was
459: originally designed for high-end reconfigurable computing applications such as
460: ASIC design, but has been conscripted for astronomy applications in a
461: collaboration between the BWRC\footnote{Berkeley Wireless Research Center
462: http://bwrc.eecs.berkeley.edu}, 
463: the UC Berkeley Radio Astronomy Laboratory, and the UC Berkeley SETI group. 
464: The 500
465: Gops/sec of computational power in the BEE2
466: is provided by 5 Xilinx XC2VP70 Virtex-II Pro
467: FPGAs, each containing 328 multipliers, two PowerPC CPU cores capable of
468: running Linux, and over 74,000 configurable logic cells.  Each FPGA connects to
469: 4 GB of DDR2-SDRAM, and four 10GbE-compatible CX4 connectors, and all FPGAs
470: share a 100-Mbps Ethernet port.  The size and connectivity of the
471: BEE2 board make it suitable for implementing X engine processing in our
472: correlator architecture.
473: 
474: The ROACH board is being developed in collaboration with MeerKAT and 
475: NRAO,\footnote{The National Radio Astronomy
476: Observatory (NRAO) is owned and operated by Associated Universities, Inc. with
477: funding from the National Science Foundation} 
478: and is scheduled for release in the third quarter of 2008.  It is intended as a
479: replacement for both IBOB and BEE2 boards.  A single Xilinx Virtex-5 XC5VSX95T
480: FPGA containing 94,000 logic cells and 640 multiplier/accumulators provides 400
481: Gops/sec of processing power and is connected to a separate PowerPC 440EPx
482: processor with a 1 GbE network connection.  The board contains 4 GB of DDR2
483: DRAM and two 36Mbit QDR SRAMs, four 10GbE-compatible CX4 connectors, and two
484: interfaces that allow the use of the current ADC boards, or a new 3
485: Gsamples/sec (6 Gsamples/sec dual-board interleaved) ADC.  The scale, economy,
486: and peripheral interfaces of this board will make it appropriate for both F and
487: X engine processing, and will enable a single-board correlator architecture.
488: 
489: \placetable{tab:hardware_price}
490: 
491: % --------------------------------------------------------------------------
492: % Section 4
493: % --------------------------------------------------------------------------
494: \section{Gateware}
495: \label{sec:gateware}
496: 
497: Efficient, customizable signal processing libraries are another important
498: component of a flexible and scalable correlator architecture.  Towards this
499: goal, our group has designed a set of open-source libraries\footnote{Available
500: at http://casper.berkeley.edu} for the Simulink/Xilinx System Generator FPGA
501: programming language.  These libraries abstract chip-specific components to
502: provide high-level interfaces targeting a wide variety of devices.  Signal
503: processing blocks in these libraries are parametrized to scale up and down to
504: arbitrary sizes, and to have selectable bit widths, latencies, and scaling.
505: Though the design principles of parametrization and scalability have added
506: complexity to the initial design of these libraries, it dramatically enhances
507: their applicability and potential for longevity as hardware evolves.  It also
508: decreases testing time by allowing developers to debug scale models of systems
509: that derive from the same parametrization code and are behaviorally similar to
510: larger systems.  In this section, we present several components of our
511: libraries vital to the design of flexible correlators.
512: 
513: \subsection{A Digital Down-Converter}
514: \label{sec:downconverter}
515: 
516: The rising speed of ADCs has enabled digitization to occur increasingly early
517: in the antenna receiver chain.  We are thus replacing analog electronics
518: commonly known as intermediate frequency processor (gain, band definition)
519: and baseband mixer (conversion to zero frequency and filtering).
520: There are numerous advantages to doing this.
521: Digital mixing allows dynamically selecting an operating frequency within the
522: digitized band while ensuring perfect sine-cosine phasing in the local
523: oscillator (LO) mixing frequency. 
524: Digitizing a wider bandwidth than will be ultimately processed makes analog
525: filtering less critical; inexpensive filters with slow roll-offs can be
526: used, and passband rippling can be corrected.  Finally, digital filtering
527: allows flexibility and control in selecting passband shapes and adjusting fine
528: delays.  One can even split out several bands from the same signal.
529: The issue of quantization levels and other digital artifacts needs to be
530: carefully addressed.
531: 
532: Our library provides a digital down-conversion core with a runtime-selectable
533: mixing frequency.  Using a discretely sampled sine wave in an addressable
534: lookup table, we can approximate nearly any mixing frequency by rounding a wide
535: accumulation register (incremented every clock) to the nearest address in the
536: lookup table.  Digital sine waves have an accuracy dictated by the number of
537: bits used to represent a value; a lookup table need only have enough samples to
538: achieve comparable accuracy.  The fact that the derivative of $\sin(x)$ reaches 
539: a maximum magnitude of 1 allows the sampling interval of a sine wave to be
540: simply equated to the accuracy of a coefficient over that time interval.
541: As a result, a lookup table only need be addressed with the same
542: bit-width as the sample width to implement an arbitrary mixing frequency.
543: 
544: \placefigure{fig:ddc_passband}
545: 
546: Our library also contains a decimating FIR filter.  Digital filters have
547: advantages over analog filters by being reprogrammable and by providing exact,
548: calculable passbands.  This filter is often used for suppressing harmonics of
549: the mixing frequency and for steepening the rolloff of cheaper analog filters,
550: but it has also been relied upon for implementing IF sub-band selection
551: digitally.  In practice, one must weigh the need for performance and
552: flexibility against the cost of FPGA resources compared to analog filters.  As
553: an example, the response of the FIR filter used in various correlator designs
554: is shown in Figure \ref{fig:ddc_passband}.  Since the exact shape 
555: of this filter can be calculated, it is possible to remove passband
556: ripple post-channelization because of the large dynamic range available in
557: output of our FFT core.
558: 
559: \subsection{A Polyphase Filter Bank Front-End}
560: \label{sec:pfb}
561: 
562: The Polyphase Filter Bank (PFB) \citep{crochiere+rabiner1983, vaidyanathan1990}
563: is an efficient implementation of a bank of evenly spaced, decimating FIR
564: filters.  The PFB algorithm decomposes these filters into a single polyphase
565: convolution followed by a DFT.  Since DFTs have been highly optimized
566: algorithmically, this results in an extremely efficient implementation.
567: Equivalently, the PFB may be regarded as an improvement on the Fast Fourier
568: Transform (FFT) that uses a front-end polyphase FIR filter to improve the
569: frequency response of each spectral channel (Fig. \ref{fig:pfb_bin_resp}).
570: This improvement comes at the cost of buffering an additional window of samples
571: and adding a complex cross-multiplication for each additional tap in the
572: polyphase FIR.  This PFB implementation has seen widespread use in the astronomy
573: community in 21 cm hydrogen surveys \citep{heiles_et_al2004}, pulsar surveys
574: \citep{demorest_et_al2004}, antenna arrays \citep{bradley_et_al2005}, Very Long
575: Baseline Interferometry, and other applications.
576: 
577: \placefigure{fig:pfb_bin_resp}
578: 
579: Our core is parametrized to use selectable windowing functions, allowing
580: adjustment of the out-of-band rejection and passband ripple/rolloff.  Blackman
581: and Tukey \citep{blackman_tukey1958} provides a summary of the characteristics
582: and trade-offs of various windows.  Each polyphase FIR tap, at the cost of
583: increased buffering and additional multipliers, increases filter steepness by
584: adding samples (in increments of the number of channels) to the time window
585: used in the PFB.  For fixed-point implementations, a practical upper limit to
586: the number of PFB taps is set by the number of bits used to represent filter
587: coefficients; the sinc function's 1/x tapering ceases to be representable when
588: $\pi T > \pi + 2^{B+1}$ where $T$ is the number of taps, and $B$ is the
589: coefficient bit width.  Finally, the width of a PFB channel is tunable by
590: adjusting the period of the sinc function, forcing adjacent bandpass filters to
591: overlap at a point other than the -3 dB point.  Note that this causes
592: power to no longer be conserved in the Fourier transform operation.
593: 
594: \subsection{A Bandwidth-Agile Fast Fourier Transform}
595: \label{sec:fft}
596: 
597: The computational core of our FFT library is an implementation of a radix-2
598: biplex pipelined FFT \citep{rabiner_gold1975} capable of analyzing two
599: independent, complex data streams using a fraction of the FPGA resources of
600: commercial designs \citep{dick2000}.  This architecture takes advantage of the
601: streaming nature of ADC samples by multiplexing the butterfly computations of
602: each FFT stage into a single physical butterfly core.  When used to analyze two
603: independent streams, every butterfly in this biplex core outputs valid data
604: every clock for 100\% utilization efficiency.
605: 
606: The need to analyze bandwidths higher than the native clock rate of an FPGA led
607: us to create a second core that combines multiple biplex cores with additional
608: butterfly cores to create an FFT that is parametrized to handle $2^P$ samples
609: in parallel \citep{parsons2008}.  This FFT architecture uses only 25\% more
610: buffering than the theoretical minimum, and still achieves 100\% butterfly
611: utilization efficiency.  This feat is achieved by decomposing a $2^N$
612: channel FFT into $2^P$ parallel biplex FFTs of length $2^{N-P}$, followed by a
613: $2^P$ channel parallel FFT core using time-multiplexed twiddle-factor
614: coefficients.
615: 
616: Finally, we have written modules for performing two real FFTs with each half of
617: a biplex FFT using Hermitian conjugation.  Mirroring and
618: conjugating the output spectra to reconstitute the negative frequencies, this
619: module effects a 4-in-1 real biplex FFT that can then be substituted for the
620: equivalent number of biplex cores in a high-bandwidth FFT.  Thus, our real FFT
621: module has the same bandwidth flexibility as our standard complex FFT.
622: 
623: Dynamic range inside fixed-point FFTs requires careful consideration.  Tones
624: are folded into half as many samples through each FFT stage, causing magnitudes
625: to grow by a factor of 2 for narrow-band signals, and $\sqrt{2}$ for random 
626: noise.  To
627: avoid overflow and spectrum corruption, our cores contain optional downshifts
628: at each stage.  In an interference-heavy environment, one must balance loss of
629: SNR from downshifting signal levels against loss of integration time due to
630: overflows.  A good practice is to place time-domain input into the
631: most-significant bits of the FFT and downshift as often as possible to
632: avoid overflow and minimize rounding error in each butterfly stage.  However,
633: it is also best to avoid using the top 2 bits on input since the first 
634: 2 butterfly
635: stages can be implemented using negation instead of complex multiplies, but the
636: asymmetric range of 2's complement arithmetic can allow this negation to
637: overflow.
638: 
639: \subsection{A Cross-Multiplication/Accumulation (X) Engine}
640: \label{sec:x_engine_arch}
641: 
642: \placefigure{fig:x_engine_schem}
643: 
644: Our FX correlator architecture employs
645: X engines to compute all antenna cross-multiples within a frequency
646: channel, and multiple frequencies are multiplexed into the core as dictated by
647: processor bandwidth; the complex visibility $V_{ij}$ (Eq. \ref{eq:vis})
648: is the average of the product of complex voltage samples from antenna $i$ and
649: antenna $j$ with the convention that the voltage $j>i$ is conjugated prior to
650: forming product.
651: In collaboration with Lynn Urry of UC Berkeley's Radio
652: Astronomy Lab we have implemented a parametrized module (Fig.
653: \ref{fig:x_engine_schem}) for computing and accumulating all visibilities for a
654: specified number of antennas.  An X engine operates by receiving $N_{ant}$ data
655: blocks in series, each containing $T_{acc}$ data samples from one frequency
656: channel of one antenna.  The first samples of all blocks are
657: cross-multiplied, and the $N_{ant}(N_{ant}+1)/2$ results are added to the
658: results from the second samples, and so on, until all $T_{acc}$ samples have
659: been exhausted.  Accumulation prevents the data rate out of a
660: cross-multiplier from exceeding the input data rate.  An X engine is divided
661: into stages, each responsible for pairing two different data blocks
662: together: the zeroth stage pairs adjacent blocks, the first stage pairs blocks
663: separated by one, and so on.  As the final accumulated results become available,
664: they are loaded onto a shift register and output from the X engine.
665: 
666: However, as a new window of $N_{ant}\times T_{acc}$ samples arrives, some
667: stages, behaving as described above, would compute invalid results using
668: data from two different windows.  To avoid this, each stage switches between
669: cross-multiplying separations of $S$ to separations of $N_{ant}-S$, which
670: happen to be valid precisely when separations of $S$ would be invalid.  As a
671: result, there need be only $floor({N_{ant}/2}+1)$ stages in an X engine.  Every
672: $T_{acc}$ samples, each stage outputs a valid result, yielding $N_{ant}\times
673: floor({N_{ant}/2}+1)$ total accumulations; for even values of $N_{ant}$,
674: $N_{ant}/2$ of the results from the last stage are redundant.
675: All other multiplier/accumulators are 100\% utilized.  Each stage
676: also computes all polarization cross-multiples (Eq. \ref{eq:pol})
677: using parallel multipliers.
678: 
679: When one X engine no longer fits on a single FPGA, it may be divided across
680: chips at any stage boundary at the cost of a moderate amount of bidirectional
681: interconnect.  The output shift register need not be carried between chips;
682: each FPGA can accumulate and store the results computed locally.  In order for
683: the output shift register's $floor({N_{ant}/2}+1)$ stages to clear before the
684: next accumulation is ready, an X engine requires a minimum integration length
685: of: $T_{acc}>floor({N_{ant}/2}+1)$.  In current hardware, a practical upper
686: limit on $T_{acc}$ is set by the 2$\times$4 Mbit of SRAM storage available on
687: the IBOB.  For 2048 channels with 4-bit samples, and double buffering for 2
688: antennas, 2 polarizations, this limit is $T_{acc}\le 128$.  Longer integration
689: requires an accumulator capable of buffering an entire vector of visibility
690: data, and typically occurs in off-chip DRAM.  The maximum theoretical
691: accumulation length in correlator is determined by the fringe rate of sources
692: moving across the sky, and is a function of observing frequency, maximum
693: antenna separation, and (for correlators with internal fringe rotation)
694: field-of-view across the primary beam.
695: 
696: Cross-multiplication comes to dominate the total correlator processing budget
697: for large numbers of antennas.  As a result, care must be taken both to reduce
698: the footprint of a complex multiplier/accumulator and to make full and
699: efficient use of the resources on an FPGA processor.  The number of bits used
700: to carry a signal should be minimized while retaining sufficient dynamic range
701: to distinguish signal from noise.  We have chosen to focus on 4-bit multipliers
702: in current applications, and the subjects of dynamic equalization and Van Vleck
703: correction generalized to 4 bits are explored in Section
704: \ref{sec:characterization} for optimizing signal-to-noise ratios (SNR) in our
705: correlators.  To make full use of FPGA resources, we construct
706: 4-bit complex multipliers using distributed logic, dedicated multiplier cores, 
707: and look-up tables implemented in Block RAMs.  
708: 
709: It is possible to perform the bulk of an $N$-bit complex multiply in an $M$-bit
710: multiplier core by sign-extending numbers to $2N$ bits and combining them into
711: two $M$-bit, unsigned numbers.  Multiplying $(a+bi)(c+di)$, these
712: representations are $(2^{M-2N}a_s+b_s)$ and $(2^{M-2N}c_s+d_s)$, where
713: $n_s=2^{2N}+n$.  The bits corresponding to $ac, ad+bc, bd$ may be selected from
714: the product, provided that the
715: sign-extension to $2N$ bits shifts $a+d$ beyond the bits occupied by $ad$.
716: This yields the constraint: 
717: \begin{equation} 6N-1 < M \end{equation} 
718: The 18-bit multipliers in current Xilinx 
719: FPGAs can efficiently perform 3-bit complex
720: multiplies, but fall short of 4 bits.
721: 
722: % --------------------------------------------------------------------------
723: % Section 5
724: % --------------------------------------------------------------------------
725: \section{System Integration}
726: \label{sec:integration}
727: 
728: \subsection{F Engine Synchronization}
729: \label{sec:F_synch}
730: 
731: \placefigure{fig:corr_vs_dly}
732: 
733: Though we have touted GALS design principles for X engine processing,
734: digitization and spectral processing within F engines must be synchronized to a
735: time interval much smaller than a spectral window to avoid severe degradation
736: of correlation response (Fig. \ref{fig:corr_vs_dly}).  This attenuation effect,
737: resulting from the changing degree of overlap of correlated signals within a
738: spectral window, can be caused by systematic signal delay between antennas, as
739: well as by source-dependent geometric delay; FX correlators with insufficient
740: channel resolution experience a narrowing of the field of view related to
741: channel bandwidth.  This effect has been well explored for FX correlators
742: employing DFTs (see Chapter 8 of \citet{thompson_et_al2001}), but Polyphase
743: Filter Banks show a different response owing to a weighting function that
744: extends well beyond the number of samples used in a DFT. 
745: Given a standard form for PFB sample weighting of
746: ${\rm sinc}\left(\frac{\pi t}{N\tau_s}\right)
747: W\left(\frac{t}{2TN\tau_s}\right)$, 
748: where $N$ is the number of output channels,
749: $T$ is the number of PFB taps, $\tau_s$ is the delay between time-domain
750: samples, and $W$ is an arbitrary windowing function that tapers to 0 at
751: $\pm1$, the gain versus delay $G(\tau)$ of a PFB-based FX correlator is
752: given by:
753: \begin{displaymath}
754: G(\tau)=\int_{-\infty}^{\infty}{
755: \left[{\rm sinc}\left(\frac{\pi t}{N\tau_s}\right)
756: W\left(\frac{t}{2TN\tau_s}\right)\right] \times
757: \left[{\rm sinc}\left(\frac{\pi (t-\tau)}{N\tau_s}\right)
758: W\left(\frac{t-\tau}{2TN\tau_s}\right)\right]\ dt
759: }
760: \end{displaymath}
761: 
762: For the purpose of F Engine synchronization, we
763: rely on a one-pulse-per-second (1PPS) signal with a fast edge-rate provided
764: synchronously to a bank of F processors running off identical system clocks.
765: This signal is sampled by the system clock on each processor, and provided
766: alongside ADC data.  A slower, asynchronous ``arm'' signal is sent from
767: a central node to each F engine at the half second phase 
768: to indicate that the next 1PPS signal should be
769: used to generate the reset event that synchronizes spectral windows and packet
770: counters.  This ensures that samples from different antennas entering X engines
771: together were acquired within one or two system clocks of one another.  The
772: degree of synchronization is determined by the difference in path lengths of
773: 1PPS and the system clock from their generators to each F engine.  This path
774: length can be determined from celestial source observations
775: using self-calibration, and barring temperature
776: effects, will be constant for a correlator configuration following power-up.
777: 
778: \subsection{Asynchronous, Packetized ``Corner Turner''}
779: \label{sec:packetization}
780: 
781: The choice of the accumulation length $T_{acc}$ in X engines 
782: determines the natural size of UDP packets in our
783: packet-switched correlator architecture.  For current CASPER hardware where
784: channel-ordering occurs in IBOB SRAM, $T_{acc}$ is constrained by the available
785: memory to an upper limit of 128 samples for 2048-channel dual-polarization, 
786: 4-bit,
787: complex data, yielding a packet payload of 256 bytes.  A header containing
788: 2 bytes of antenna index and 6 bytes of frequency/time index is added to each
789: packet to enable packet unscrambling on the receive side.  The frequency/time
790: index (hereafter referred to as the master counter, or MCNT) is a counter that
791: is incremented every packet transmission.  The lower bits count frequencies
792: within a spectrum, and the rest count time.  Combined with the antenna
793: index, MCNT completely determines the time, frequency, source, and destination
794: of each packet; MCNT maps uniquely to a destination IP address.
795: 
796: \placefigure{fig:packet_rx}
797: 
798: Packet reception (Fig. \ref{fig:packet_rx}) is complicated by the realities of
799: packet scrambling, loss, and interference.  A circular buffer holding $N_{win}$
800: windows worth of X engine data stores packet data as they arrive.  The lower
801: bits of MCNT act as an address for placing payloads into the the correct
802: window, and the antenna index addresses the position within that window.  When
803: data arrives $N_{win}/2$ windows ahead of a buffered window, that window is
804: flagged for readout, and is processed contiguously on the next window boundary
805: of the free-running X engine.  Using packet arrival to determine when a window
806: is processed allows a data-rate dependent time interval for all packets to
807: arrive, but pushes data through the buffer in the event of packet loss.  On
808: readout, the buffer is zeroed to ensure that packet loss results in loss of
809: signal, rather than the introduction of noise.  F engines can be intentionally
810: disconnected from transmission without compromising the correlation of
811: those remaining.
812: 
813: Packet interference occurs when a well-formed packet contains an invalid MCNT
814: as a result of switch latency, unsynchronized F engines, or system
815: misconfiguration.  Such packets must be prevented from entering the receive
816: buffer, since they can lead to data corruption; one would prefer that a
817: misconfigured F engine antenna result in data loss for that antenna, rather
818: than data loss for the entire system.  To ensure this behavior, incoming
819: packets face a sliding filter based on currently active MCNTs.  Packets are
820: only accepted if their MCNT falls within the range of what can currently be
821: held in the circular buffer.  As higher MCNTs are received and accepted, old
822: windows are flagged for read out, freeing up buffer space for still
823: higher MCNTs.  This system forces MCNTs to advance by small increments and
824: prevents the large discontinuities indicative of packet
825: interference.  In the eventuality that a receive buffer accidentally locks onto
826: an invalid MCNT from the outset, a time-out clause causes the currently active
827: MCNT to be abandoned for a new one if no new data is accepted into the receive
828: buffer.
829: 
830: A final complication comes when implementing a bidirectional 10GbE transmission
831: architecture such as the one outlined in Figure \ref{fig:ex_app1}.
832: Commercial switches do not support
833: self-addressed packet transmission; they assume that the transmitter
834: (usually a CPU) intercepts these packets and transfers them to the receive
835: buffer.  On FPGAs, this requires an extra buffer for holding ``loopback'', and
836: a multiplexer for inserting these packets into the processing stream.  A simple
837: method for this insertion would be to always insert loopback packets, if
838: available, and otherwise to insert packets from the 10GbE
839: interface.  However, there is a maximum interval over which packets with
840: identical MCNTs can be scrambled before the receive system rejects
841: packets for being outside of its buffer.  This simple method has the
842: undesirable effect of including switch latency in the time interval over which
843: packets are scrambled, causing unnecessary packet loss.  Our solution is to
844: pull loopback packets only after packets with the same MCNT 
845: arrive through the switch.
846: 
847: \subsection{Monitor, Control, and Data Acquisition}
848: \label{sec:data_aq}
849: 
850: The toolflow we have developed for CASPER hardware provides convenient
851: abstractions for interfacing to hardware components such as ADCs, DRAM, and 10
852: GbE transceivers, and allows specified registers and BRAMs to be automatically
853: connected to CPU-accessible buses.  On top of this framework, we run BORPH--an
854: extension of the Linux operating system that provides kernel support for FPGA
855: resources \citep{so_broderson2006,so2007}.  This system allows FPGA
856: configurations to be run in the same fashion as software processes, and creates
857: a virtual file system representing the memories and registers defined on the
858: FPGA.  Every design compiled with this toolflow comes equipped with this
859: real-time interface for low- to moderate-bandwidth data I/O.  By emulating
860: standard file-I/O interfaces, BORPH allows programmers to use standard
861: languages for writing control software.  The majority of the monitor, control,
862: and data acquisition routines in our correlators are written in C
863: and Python.  For 8-16 antenna correlators, the bandwidth through BORPH on a
864: BEE2 board is sufficient to support the output of visibility data with 5-10s
865: integrations.
866: 
867: For correlators with more antennas or shorter integration times, the bandwidth
868: of the CPU/FPGA interface is incapable of maintaining the full correlator
869: output.  This limitation is being overcome by transmitting the final correlator
870: output using a small amount of the extra bandwidth on the 10GbE ports already
871: attached to each X engine.  After accumulation in DRAM, correlator output is
872: multiplexed onto the 10GbE interface and transmitted to one or more Data
873: Acquisition (DA) systems attached to the central 10GbE switch.  These systems
874: collect and store the final correlator output.  With a capable DA system, the
875: added bandwidth through this output pathway can be used to attain millisecond
876: integration times, opening up opportunities for exploring transient events and
877: increasing time resolution for removing interference-dominated data. 
878: 
879: The capabilities of correlators made possible by our research are placing
880: new challenges on DA systems \citep{wright2005}.  There is a severe (factor of
881: 100) mismatch between the data rates in the on-line correlator hardware and
882: those supported by the off-line processing.  Members of our team are currently
883: pursuing research on how this can be resolved both for correlators and for
884: generic signal processing systems using commercially available compute
885: clusters.  For correlators, our group is currently exploring how to implement
886: calibration and imaging in real-time to reduce the burden of
887: expert data reduction on the end user, and to make best use of both telescope
888: and human resources.
889: 
890: 
891: % --------------------------------------------------------------------------
892: % Section 6
893: % --------------------------------------------------------------------------
894: \section{Characterization}
895: \label{sec:characterization}
896: 
897: \subsection{ADC Crosstalk}
898: \label{sec:crosstalk}
899: 
900: \placefigure{fig:crosstalk}
901: 
902: Crosstalk is an undesirable but prevalent characteristic of analog systems
903: wherein a signal is coupled at a low level into other pathways.  This can pose
904: a major threat to sensitivity in systems that integrate noise-dominated data to
905: reveal low-level correlation.  For CASPER hardware, we have examined crosstalk
906: levels between signal inputs sharing an ADC chip, and between different ADC
907: boards on the same IBOB.  Figure \ref{fig:crosstalk} illustrates a one-hour
908: integration of uncorrelated noise of various bandwidths input to the ``Pocket
909: Correlator'' system (see Section \ref{sec:deployments}).  Between inputs 
910: of the same ADC board, a coupling coefficient of $\sim0.0016$ indicates
911: crosstalk at approximately $-28$ dB.  This coupling is a factor of $5$ higher
912: than the $-35$ dB isolation advertised by the Atmel ADC chip, and is most
913: likely the result of board geometry and shared power supplies.  Crosstalk
914: between inputs on different ADCs also peaks at the $-28$ dB level, but shows
915: more frequency-dependent structure.
916: 
917: \placefigure{fig:crosstalk_stability}
918: 
919: Crosstalk may be characterized and removed, provided that its timescale for
920: variation is much longer than the calibration interval.  Figure
921: \ref{fig:crosstalk_stability} demonstrates that for integration intervals
922: ranging from 7.15 seconds to approximately 1 day (the limit of our testing),
923: crosstalk amplitudes and phases vary around stable values in a
924: lab test that, when
925: subtracted, yield noise that integrates down with time.  Even
926: though crosstalk is encountered at the $-28$ dB level, its stability allows
927: suppression to at least $-62$ dB.  This stability has allowed crosstalk
928: to be removed post-correlation, and we have until recently deferred
929: adding phase switching.  Developments along this line are proceeding by
930: introducing an invertible mixer (controlled via a Walsh counter on an IBOB)
931: early in the analog signal path, and removing this inversion after
932: digitization.  Phase switching must be coupled with data blanking near 
933: boundaries when the
934: inversion state is uncertain.  Blanking will be most easily implemented by
935: intentionally dropping packets of data from F engine transmission, and by
936: providing a count of results accumulated in each integration for normalization
937: purposes.
938: 
939: \subsection{XAUI Fidelity and Switch Throughput}
940: \label{sec:10gbe_sw}
941: 
942: CASPER boards are currently configured to transmit XAUI protocol over CX4 ports
943: as a point-to-point communication protocol and as the physical layer of 10GbE
944: transmission.  Because the Virtex-II FPGAs used in current CASPER hardware do
945: not fully support XAUI transmission standards \cite{xilinx_ug024,xilinx_ds083}, 
946: current devices can have
947: sub-optimal performance for certain cable lengths.  We expect the new ROACH
948: board, which employs Virtex-5 FPGAs, to have better
949: performance in this regard.  For cable lengths supported in current hardware,
950: we tested XAUI transmission fidelity using matched Linear Feedback Shift
951: Registers (LFSRs) on transmit and receive.  Error detection was verified using
952: programmable bit-flips following transmitting LFSRs.  Over a period of 16
953: hours, 573 Tb of data were transmitted and received on each of 8 XAUI
954: links.  During this time, no errors were detected, resulting in an estimated
955: bit-error rate of $2.2\cdot 10^{-16}$ Hz.  We also tested the capability of two
956: Fujitsu switches (the XG700 and the XG2000) for performing the full
957: cross-connect packet switching required in our FX correlator architecture.  By
958: tuning the sample rate inside F engines of an 8-antenna (4-IBOB) packetized
959: correlator, we controlled the transmission rate per switch port over a range of
960: 5.96 to 8.94 Gb/s.  In 10-minute tests, packet loss was zero for both
961: switches in all but the 8.94 Gb/s case.  Packet loss in this final case was
962: traced to intermittent XAUI failure as a result of imperfect compliance with
963: the XAUI standard, as described previously. Overheating of FPGA chips in the
964: field has also been reported as a source of intermittent operation.
965: 
966: \subsection{Equalization and 4-Bit Requantization}
967: \label{sec:equalization}
968: 
969: \placefigure{fig:4_bit_quant}
970: 
971: Correlator processing resources can be reduced by limiting the bit width of
972: frequency-domain antenna data before cross-multiplication.  However, digital
973: quantization requires careful setting of signal levels for optimum
974: SNR and subsequent calibration to a linear power scale 
975: \citep{thompson_et_al2001,jenet_anderson1998}.  Correlators using 4 bits 
976: represent
977: an improvement over their 1 and 2 bit predecessors, but there are still
978: quantization issues to consider.  The total power of a 4-bit quantizer has a
979: non-linear response with respect to input level as shown in Figure
980: \ref{fig:4_bit_quant}.  In currently deployed correlators, we perform
981: equalization (per channel scaling) to control the RMS channel values before
982: requantizing from 18 bits to 4 bits.  This operation saturates RFI and flattens
983: the passband to reduce dynamic range and to hold the passband in
984: the linear regime of the 4-bit quantization power curve.  Equalization is
985: implemented as a scalar multiplication on the output of each PFB using 18-bit
986: coefficients from a dynamically updateable memory.  These coefficients allow
987: for automatic gain control to maintain quantization fidelity through changing
988: system temperatures.
989: 
990: % --------------------------------------------------------------------------
991: % Section 7
992: % --------------------------------------------------------------------------
993: \section{Deployments and Results}
994: \label{sec:deployments}
995: 
996: \subsection{A Pocket Correlator}
997: \label{sec:pocket_corr}
998: 
999: \placefigure{fig:f_engine}
1000: 
1001: The ``Pocket Correlator'' (Fig. \ref{fig:f_engine}) is a single IBOB system
1002: that includes F and X engines on a single board for correlating and
1003: accumulating 4 input signals.  Each input is sampled at 4 times the FPGA clock
1004: rate (which runs up to 250 MHz), and a down-converter extracts half of the
1005: digitized band.  This subband is decomposed into 2048 channels by an 8-tap PFB,
1006: equalized, and requantized to 4 bits.  With all input signals on one chip, X
1007: processing can be implemented directly as multipliers and vector accumulators,
1008: rather than as X engines.  Limited buffer space on the IBOB permits only 1024
1009: channels (selectable from within the 2048) to be accumulated.  Output occurs
1010: either via serial connection (with a minimum integration time of 5
1011: seconds) or via 100-Mbit UDP transmission (with a minimum integration time in
1012: the millisecond range).  This system can act as a 2-antenna, full Stokes
1013: correlator, or as a 4-antenna single polarization correlator.
1014: 
1015: \placefigure{fig:skymap}
1016: 
1017: The Pocket Correlator is valuable as a simple, stand-alone instrument, and for
1018: board verification in larger packetized systems.  It is being applied as a
1019: stand-alone instrument in PAPER, the ATA, and the UNC PARI observatory. A
1020: 4-antenna, single polarization deployment of the PAPER experiment in Western
1021: Australia in 2007 used the Pocket Correlator to collect the data used to
1022: produce a 150 MHz all-sky map illustrated in Figure \ref{fig:skymap}.  In
1023: addition to demonstrating the feasibility of post-correlation crosstalk
1024: removal, this map (specifically, the imperfectly removed sidelobes of sources)
1025: illustrates a problem that will require real-time imaging to solve for large
1026: numbers of antennas.
1027: 
1028: \subsection{An 8-Antenna, 2-Stokes, Synchronous Correlator}
1029: \label{sec:8_ant_corr}
1030: 
1031: This first generation multi-board correlator demonstrated the functionality
1032: of signal processing algorithms and CASPER hardware, but preempted the
1033: current packetized architecture--it operated synchronously.  This version of
1034: the correlator was most heavily limited by X engine resources, all of which
1035: were implemented on a single FPGA to simplify interconnection. The
1036: total number of complex multipliers in the X engines of an $N_{ant}$ antenna
1037: array is: $N_{cmac} = floor({N_{ant}/2}+1)\times N_{ant}\times N_{pol}$; the
1038: limited number of multipliers on a BEE2 FPGA only allowed for supporting half
1039: the polarization cross-multiples.  This system was an
1040: important demonstration of the basic capabilities of our hardware and software,
1041: and provided a starting point for evolving a more sophisticated system.  
1042: Deployments of this
1043: system at the NRAO site in Green Bank as part of the PAPER
1044: experiment, and briefly
1045: at the Hat Creek Radio Observatory for the Allen Telescope Array,
1046: are being supersede by the packetized correlator presented in the next
1047: section.
1048: 
1049: \subsection{A 16-Antenna, Full-Stokes, Packetized Correlator}
1050: \label{sec:packet_deploy}
1051: 
1052: This packetized FX correlator is a realization of the architecture outlined in
1053: Figure \ref{fig:ex_app1}, with F processing for 2 antennas implemented on each
1054: IBOB, and matching X processors implemented on each corner FPGA of two BEE2s.
1055: Each F processor is identical to a Pocket Correlator (Fig. \ref{fig:f_engine}),
1056: but branches data from the equalization module to a matrix transposer in IBOB
1057: SRAM to form frequency-based packets.  Packet data for each antenna are
1058: multiplexed through a point-to-point XAUI connection to a BEE2-based X
1059: processor, and then relayed in 10GbE format to the switch.  The number of
1060: channels in this system is limited to 2048 by memory in IBOB SRAM for
1061: transposing the 128 spectra needed to meet bandwidth restrictions between X
1062: engines and DRAM-based vector accumulators.
1063: 
1064: \placefigure{fig:x_processor}
1065: 
1066: The X processor in this packetized correlator implements the transmit and
1067: receive architecture illustrated in Figure \ref{fig:x_processor} 
1068: for two X engines sharing the same 10GbE link.
1069: Each X engine's data processing rate is
1070: determined by packets arriving in its own receive buffer, and results are
1071: accumulated in separate DRAM DIMMs.  The accumulated output of each X processor
1072: is read out of DRAM at a low bandwidth and transmitted via 10GbE packets to
1073: a CPU-based server where
1074: all visibility data is collected and
1075: written to disk in MIRIAD format
1076: \citep{sault_et_al1995} using interfaces from the Astronomical Interferometry
1077: in PYthon (AIPY) package\footnote{http://pypi.python.org/pypi/aipy}.
1078: 
1079: The clocks for the BEE2 FPGAs are asynchronous 200-MHz oscillators, and IBOBs
1080: run synchronously at any rate lower than this.  Packet transmission is
1081: statically addressed so that all each X engine processes every 16th channel.
1082: We use 8 ports of a Fujitsu XG700 switch to route data.  This system is is
1083: scalable to 32 antennas before two X engines no longer fit on a single FPGA.
1084: For larger systems, the number of BEE2s will scale as the square of the number
1085: of antennas, and the number of IBOBs will scale linearly.  A 32-antenna,
1086: 200-MHz correlator on 16 IBOBs and 4 BEE2s is now working in the lab, and a
1087: 16-antenna version using 8 IBOBs and 2 BEE2s has been deployed to the NRAO site
1088: in Green Bank with the PAPER experiment.  
1089: 
1090: % --------------------------------------------------------------------------
1091: % Section 8
1092: % --------------------------------------------------------------------------
1093: \section{Conclusion}
1094: \label{sec:conclusion}
1095: 
1096: By decreasing the time and engineering costs of building and upgrading
1097: correlators, we aim to reduce the total cost of correlators for a wide range of
1098: scales.  Small- and medium-scale correlators with total cost dominated by
1099: development clearly stand to benefit from our research.  It is less clear if
1100: the cost of large-scale correlators can be reduced by the general-purpose
1101: hardware used in our architecture.  Though minimization of replication cost
1102: favors the development of specialized parts, there are two factors
1103: that can make a generic, modular solution cost less.
1104: 
1105: The first factor to consider is time to deployment.  Even if the monetary cost
1106: of development is negligible in the budget of a large correlator, the cost of
1107: development time can be significant.  If a custom solution takes several years
1108: to go from design to implementation, the hardware that is deployed will be out
1109: of date.  Moore's Law suggests that when a custom solution taking 3 years to
1110: develop is deployed, there will exist processors 4 times more powerful, or 4
1111: times less expensive for the equivalent system.  The cost of a generic, modular
1112: system has to be tempered by the expected savings of committing to hardware
1113: closer to the ultimate deployment date.
1114: 
1115: The second factor is the cost of upgrade.  Many facilities (including the ATA)
1116: are beginning to appreciate the advantages of designing arrays with wider
1117: bandwidths and larger numbers of antennas than can be handled by current
1118: technology.  Correlators may then be implemented inexpensively on scales
1119: suited to current processors, and upgraded as more powerful processors
1120: become available.  Modular solutions facilitate this methodology.
1121: 
1122: % --------------------------------------------------------------------------
1123: % --------------------------------------------------------------------------
1124: % --------------------------------------------------------------------------
1125: 
1126: \acknowledgments
1127: 
1128: This and other CASPER research are supported by the National Science Foundation
1129: Grant No. 0619596 for Low Cost, Rapid Development Instrumentation for Radio
1130: Telescopes.  We would like to acknowledge the students, faculty and sponsors of
1131: the Berkeley Wireless Research Center, and the National Science Foundation
1132: Infrastructure Grant No.  0403427.  Correlator development for the PAPER
1133: project is supported by NSF grant AST-0505354, and for the ATA project by NSF
1134: grant AST-0321309 as well as the Paul G. Allen Foundation.  Chips and software
1135: were generously provided by Xilinx, Inc.  JM and PM gratefully acknowledge
1136: financial support from the MeerKAT project and South Africa's National Research
1137: Foundation.
1138: 
1139: \appendix
1140: Glossary of Technical Terms
1141: \begin{itemize}
1142: \item ADC - Analog to Digital Converter
1143: \item ASIC - Application-Specific Integrated Circuit processor
1144: \item BEE2 - Berkeley Emulation Engine, rev. 2
1145: \item BORPH - Berkeley Operating system for Re-Programmable Hardware
1146: \item BRAM - Block RAM: Random Access Memory inside an FPGA
1147: \item CX4 - 10GbE-compatible industry standard connector
1148: \item CPU - Central Processing Unit
1149: \item DDR2 - Double-Data-Rate 2 type of off-FPGA Synchronous DRAM 
1150: \item DIMM - Dual Inline Memory Module
1151: \item DFT - Discrete Fourier Transform
1152: \item DRAM - Dynamic Random Access Memory
1153: \item FFT - Fast Fourier Transform algorithm
1154: \item FIR - Finite Impulse Response digital filter
1155: \item FPGA - Field Programmable Gate Array processor
1156: \item FX - Correlator architecture implemented as frequency channelization, then cross-multiplication
1157: \item GALS - Globally Asynchronous, Locally Synchronous system architecture
1158: \item GB - GigaByte
1159: \item IBOB - Internet Break-Out Board
1160: \item LFSR - Linear Feedback Shift Register
1161: \item LO - Local Oscillator
1162: \item MCNT - Master Counter
1163: \item PFB - Polyphase Filter Bank
1164: \item PowerPC - a specific CPU architecture
1165: \item QDR - Quad-Data-Rate type of off-FPGA SRAM
1166: \item ROACH - Reconfigurable, Open Architecture for Computing Hardware
1167: \item SNR - Signal-to-Noise Ratio
1168: \item SRAM - Static Random Access Memory
1169: \item UDP - User Datagram Protocol Ethernet packetization
1170: \item XAUI - X (ten) Attachment Unit Interface point-to-point transmission protocol
1171: \item XF - Correlator architecture implemented as cross-multiplication, then frequency channelization
1172: \item 1PPS - 1 Pulse Per Second clock signal
1173: \item 10GbE - 10 Gigabit per second Ethernet communication standard
1174: \end{itemize}
1175: 
1176: \begin{thebibliography}{25}
1177: \expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi
1178: 
1179: \bibitem[{xil(2004)}]{xilinx_ug024}
1180:  2004, {RocketIO Tranceiver User Guide (UG024 V2.5)}, Xilinx user guide,
1181:   http://www.xilinx.com
1182: 
1183: \bibitem[{xil(2005)}]{xilinx_ds083}
1184:  2005, {Virtex-II Pro and Virtex-II Pro X Platform FPGAs: Functional
1185:   Description (DS083-2 V4.5)}, Xilinx data sheet, http://www.xilinx.com
1186: 
1187: \bibitem[{Blackman \& Tukey(1958)}]{blackman_tukey1958}
1188: Blackman, R. \& Tukey, J. 1958, The measurement of power spectra (Dover
1189:   Publications Inc.)
1190: 
1191: \bibitem[{{Bradley} {et~al.}(2005){Bradley}, {Backer}, {Parsons}, {Parashare},
1192:   \& {Gugliucci}}]{bradley_et_al2005}
1193: {Bradley}, R., {Backer}, D., {Parsons}, A., {Parashare}, C., \& {Gugliucci},
1194:   N.~E. 2005, in Bulletin of the American Astronomical Society, 1216--+
1195: 
1196: \bibitem[{{Chang} {et~al.}(2005){Chang}, {Wawrzynek}, \&
1197:   {Brodersen}}]{chang_et_al2005}
1198: {Chang}, C., {Wawrzynek}, J., \& {Brodersen}, R.~W. 2005, IEEE Design and Test
1199:   of Computers, 22, 114
1200: 
1201: \bibitem[{{Chapiro}(1984)}]{chapiro1984}
1202: {Chapiro}, D.~M. 1984, PhD thesis, Stanford Univ., CA.
1203: 
1204: \bibitem[{{Crochiere} \& {Rabiner}(1983)}]{crochiere+rabiner1983}
1205: {Crochiere}, R. \& {Rabiner}, L.~R. 1983, {Multirate Digital Signal Processing}
1206:   (Englewood Cliffs, N.J., Prentice-Hall, Inc., 1983.~336 p.)
1207: 
1208: \bibitem[{{D'Addario}(2001)}]{daddario2001}
1209: {D'Addario}, L. 2001, ATA Memo
1210: 
1211: \bibitem[{{Demorest} {et~al.}(2004){Demorest}, {Ramachandran}, {Backer},
1212:   {Ferdman}, {Stairs}, \& {Nice}}]{demorest_et_al2004}
1213: {Demorest}, P., {Ramachandran}, R., {Backer}, D., {Ferdman}, R., {Stairs}, I.,
1214:   \& {Nice}, D. 2004, in Bulletin of the American Astronomical Society, 1598--+
1215: 
1216: \bibitem[{{Dick}(2000)}]{dick2000}
1217: {Dick}, C. 2000, Xilinx Application Note
1218: 
1219: \bibitem[{{Heiles} {et~al.}(2004){Heiles}, {Goldston}, {Mock}, {Parsons},
1220:   {Stanimirovic}, \& {Werthimer}}]{heiles_et_al2004}
1221: {Heiles}, C., {Goldston}, J., {Mock}, J., {Parsons}, A., {Stanimirovic}, S., \&
1222:   {Werthimer}, D. 2004, in Bulletin of the American Astronomical Society,
1223:   1476--+
1224: 
1225: \bibitem[{{Jenet} \& {Anderson}(1998)}]{jenet_anderson1998}
1226: {Jenet}, F.~A. \& {Anderson}, S.~B. 1998, PASP, 110, 1467
1227: 
1228: \bibitem[{{Parsons}(2008)}]{parsons2008}
1229: {Parsons}, A. 2008, IEEE Signal Processing Letters, submitted
1230: 
1231: \bibitem[{{Parsons} {et~al.}(2006){Parsons}, {Backer}, {Chang}, {Chapman},
1232:   {Chen}, {Crescini}, {de Jesus}, {Dick}, {Droz}, {MacMahon}, {Meder}, {Mock},
1233:   {Nagpal}, {Nikolic}, {Parsa}, {Richards}, {Siemion}, {Wawrzynek},
1234:   {Werthimer}, \& {Wright}}]{parsons_et_al2006}
1235: {Parsons}, A., {Backer}, D., {Chang}, C., {Chapman}, D., {Chen}, H.,
1236:   {Crescini}, P., {de Jesus}, C., {Dick}, C., {Droz}, P., {MacMahon}, D.,
1237:   {Meder}, K., {Mock}, J., {Nagpal}, V., {Nikolic}, B., {Parsa}, A.,
1238:   {Richards}, B., {Siemion}, A., {Wawrzynek}, J., {Werthimer}, D., \& {Wright},
1239:   M. 2006, in Asilomar Conference on Signals and Systems, Pacific Grove, CA,
1240:   2031--2035
1241: 
1242: \bibitem[{{Plana} {et~al.}(2007){Plana}, {Furber}, {Temple}, {Khan}, {Shi},
1243:   {Wu}, \& {Yang}}]{luis_et_al2007}
1244: {Plana}, L.~A., {Furber}, S.~B., {Temple}, S., {Khan}, M., {Shi}, Y., {Wu}, J.,
1245:   \& {Yang}, S. 2007, IEEE Des. Test, 24, 454
1246: 
1247: \bibitem[{{Rabiner} \& {Gold}(1975)}]{rabiner_gold1975}
1248: {Rabiner}, L.~R. \& {Gold}, B. 1975, {Theory and application of digital signal
1249:   processing} (Englewood Cliffs, N.J., Prentice-Hall, Inc., 1975.~777 p.)
1250: 
1251: \bibitem[{{Rybicki} \& {Lightman}(1979)}]{rybicki_lightman1979}
1252: {Rybicki}, G.~B. \& {Lightman}, A.~P. 1979, {Radiative processes in
1253:   astrophysics} (New York, Wiley-Interscience, 1979.~393 p.)
1254: 
1255: \bibitem[{{Sault} {et~al.}(1995){Sault}, {Teuben}, \&
1256:   {Wright}}]{sault_et_al1995}
1257: {Sault}, R.~J., {Teuben}, P.~J., \& {Wright}, M.~C.~H. 1995, in Astronomical
1258:   Society of the Pacific Conference Series, Vol.~77, Astronomical Data Analysis
1259:   Software and Systems IV, ed. R.~A. {Shaw}, H.~E. {Payne}, \& J.~J.~E.
1260:   {Hayes}, 433--+
1261: 
1262: \bibitem[{{So}(2007)}]{so2007}
1263: {So}, K.~H. 2007, PhD thesis, Berkeley Wireless Research Center, UC Berkeley,
1264:   CA.
1265: 
1266: \bibitem[{{So} \& {Brodersen}(2006)}]{so_broderson2006}
1267: {So}, K.~H. \& {Brodersen}, R.~W. 2006, in 16th International Conference on
1268:   Field Programmable Logic and Applications, 349--354
1269: 
1270: \bibitem[{{Thompson} {et~al.}(2001){Thompson}, {Moran}, \&
1271:   {Swenson}}]{thompson_et_al2001}
1272: {Thompson}, A.~R., {Moran}, J.~M., \& {Swenson}, Jr., G.~W. 2001,
1273:   {Interferometry and Synthesis in Radio Astronomy, 2nd Edition} (New York,
1274:   Wiley-Interscience, 2001.~692 p.)
1275: 
1276: \bibitem[{{Vaidyanathan}(1990)}]{vaidyanathan1990}
1277: {Vaidyanathan}, P.~P. 1990, in IEEE, Vol.~78, 56--93
1278: 
1279: \bibitem[{{Weinreb}(1961)}]{weinreb_1961}
1280: {Weinreb}, S. 1961, Proc. IEEE, 49, 1099
1281: 
1282: \bibitem[{{Wright}(2005)}]{wright2005}
1283: {Wright}, M. 2005, SKA Memo
1284: 
1285: \bibitem[{{Yen}(1974)}]{yen1974}
1286: {Yen}, J.~L. 1974, A\&AS, 15, 483
1287: 
1288: \end{thebibliography}
1289: 
1290: 
1291: % --------------------------------------------------------------------------
1292: % TABLES
1293: % --------------------------------------------------------------------------
1294: \clearpage
1295: 
1296: %\input tab1.tex
1297: \begin{table}[t]
1298: \label{tab:hardware_price}
1299: \begin{center}
1300: \title{Price and Power Consumption of CASPER Hardware}
1301: \begin{tabular}{lrrrrr}
1302: \hline\hline
1303: \vspace{3pt}
1304: Board & Board & Cost with & Gops    & Power \\
1305:       & Cost  & FPGAs     & per Sec & (W)\\
1306: \hline
1307: IBOB& \$400 & \$2700 & 70 & 30 \\
1308: BEE2& \$5000 & \$23500 & 500 & 150 \\
1309: ROACH$^*$& \$1000 & \$3200 & 400 & 50 \\
1310: ADC (1Gs/s$\times2$)& \$200 & \$200 & N/A & 2 \\
1311: ADC (3Gs/s)\tablenotemark{*}& \$1000 & \$1000 & N/A & 5 \\
1312: \hline\hline
1313: \vspace{-5pt}
1314: \end{tabular}
1315: \\
1316: \vspace{-10pt}
1317: \tablenotetext{*}{Estimated from prototype versions.}
1318: \end{center}
1319: \end{table}
1320: 
1321: 
1322: % --------------------------------------------------------------------------
1323: % FIGURES
1324: % --------------------------------------------------------------------------
1325: %\clearpage
1326: 
1327: \begin{figure}
1328: \begin{center}
1329: \includegraphics[scale=.4]{raw_arch.png}
1330: \caption{In a simplistic FX correlator,
1331: the signals from N antennas are first decomposed into M frequency channels 
1332: (F operation) and then cross-multiplied (X operation).  Different channels are
1333: never cross-multiplied, making them natural units for X engine processing.
1334: Thus, each X engine handles all baselines for one frequency channel.
1335: \label{fig:corr_arch1}}
1336: \end{center}
1337: \end{figure}
1338: 
1339: \begin{figure}
1340: \begin{center}
1341: \includegraphics[scale=.25]{ex_app1.png}
1342: \caption{Data bandwidth per antenna is equal to the processing bandwidth of 
1343: an X processor in this example application.  Transmitted data is routed 
1344: through an X processor to take advantage of bidirectionality of 10GbE ports, 
1345: thereby halving the number of ports on the switch.
1346: \label{fig:ex_app1}}
1347: \end{center}
1348: \end{figure}
1349: 
1350: \begin{figure}
1351: \begin{center}
1352: \includegraphics[scale=.25]{ex_app2.png}
1353: \caption{Data bandwidth per antenna can exceed
1354: what can be carried over 10GbE.  Here, the frequency band has been spread 
1355: across ports by channel, so that each half of transmission occurs on an 
1356: isolated subnet.  This is possible because different channels are never 
1357: cross-multiplied in an FX correlator.
1358: \label{fig:ex_app2}}
1359: \end{center}
1360: \end{figure}
1361: 
1362: \begin{figure}
1363: \begin{center}
1364: \includegraphics[scale=.25]{ex_app3.png}
1365: \caption{When the processing bandwidth of an X engine exceeds the antenna 
1366: bandwidth by at least a factor of 2, half as many X processors are needed for 
1367: a given number of antennas.  X processors operate independently of data
1368: bandwidth; the same design handles this and the previous two cases 
1369: (Figs. \ref{fig:ex_app1} and \ref{fig:ex_app2}).  Only the number of X 
1370: processors and the data transmission pattern have changed.
1371: \label{fig:ex_app3}}
1372: \end{center}
1373: \end{figure}
1374: 
1375: \begin{figure}
1376: \begin{center}
1377: \includegraphics[scale=.25]{ibob_bee2.jpg}
1378: \caption{%
1379: Our correlator architecture relies on modular FPGA-based processing hardware 
1380: developed by our group to
1381: combine flexibility, upgradeability, and performance.  Illustrated above are:
1382: (top) IBOB and ADC FPGA/digitizer modules
1383: (bottom) The Berkeley Emulation Engine (BEE2) FPGA board
1384: \label{fig:ibobadcbee2}}
1385: \end{center}
1386: \end{figure}
1387: 
1388: \begin{figure}
1389: \begin{center}
1390: \includegraphics[scale=.45]{ddc_response_scaled.png}
1391: \caption{%
1392: This example response an the FIR filter in a digital down-converter, 
1393: illustrates the 16 tap low-pass design used in the correlator deployments
1394: presented later.
1395: \label{fig:ddc_passband}}
1396: \end{center}
1397: \end{figure} 
1398: 
1399: \begin{figure}
1400: \begin{center}
1401: \includegraphics[scale=.52]{pfb_vs_fft_bin_resp.png}
1402: \caption{%
1403: The response of a frequency channel in an 8-tap Polyphase Filter Bank (solid)
1404: using a Hamming window is compared to an equivalently sized Discrete Fourier 
1405: Transform (dashed).  This particular PFB, implemented for 2048 channels, is 
1406: used in the correlator deployments presented in Section \ref{sec:deployments}.
1407: \label{fig:pfb_bin_resp}}
1408: \end{center}
1409: \end{figure}
1410: 
1411: \begin{figure}
1412: \begin{center}
1413: \includegraphics[scale=.45]{x_engine.png}
1414: \caption{%
1415: This X engine schematic illustrates the pipelined flow of data
1416: that allows it to be split across multiple FPGAs and boards.
1417: With continuous data input, all multipliers (with the possible exception of
1418: the final stage for even values of $N_{ant}$) are used with 100\% efficiency.
1419: \label{fig:x_engine_schem}}
1420: \end{center}
1421: \end{figure}
1422: 
1423: \begin{figure}
1424: \begin{center}
1425: \includegraphics[scale=.45]{corr_vs_dly_128_scaled.png}
1426: \caption{%
1427: Cross-correlation of noise decreases as a function of signal delay between 
1428: antenna inputs.  PFBs operate on a wider window of data compared to DFTs, and 
1429: use non-flat sample weightings, yielding a
1430: different correlation response versus signal delay compared to the standard
1431: result presented in Thompson et al. (2001) \cite{thompson_et_al2001}.  Graphed
1432: are the responses of PFBs with 8 taps (solid), 4 taps (dashed), 2 taps (dot
1433: dashed), and the response of a DFT (dotted).
1434: \label{fig:corr_vs_dly}}
1435: \end{center}
1436: \end{figure}
1437: 
1438: \begin{figure}
1439: \begin{center}
1440: \includegraphics[scale=.5]{packet_rx.png}
1441: \caption{Before transmission, each F engine packet is 
1442: tagged with an antenna number and master counter (MCNT) encoding
1443: time and frequency.  Received packets are filtered to 
1444: the narrow range of MCNTs, and maximum MCNT slides smoothly up as packets 
1445: are received.  A free-running X engine 
1446: processes available windows when it is ready.  This architecture
1447: allows data to be processed at a lower data rate than the FPGA clock rate 
1448: without requiring every element in the pipeline to have a enable signal.
1449: \label{fig:packet_rx}}
1450: \end{center}
1451: \end{figure}
1452: 
1453: \begin{figure}
1454: \begin{center}
1455: \includegraphics[scale=.35]{crosstalk_v2_scaled.png}
1456: \caption{%
1457: Uncorrelated noise sources with similar bandpass shapes were
1458: input to two channels of one ADC board (solid black) and a third noise source 
1459: with a narrower passband was input to to a second ADC board 
1460: (dashed black) in the ``Pocket Correlator'' system.
1461: Crosstalk levels between signal inputs on the same ADC board (light gray) and 
1462: between ADC boards sharing an IBOB (dark gray) peak at $-28$ dB.
1463: \label{fig:crosstalk}}
1464: \end{center}
1465: \end{figure}
1466: 
1467: \begin{figure}
1468: \begin{center}
1469: \includegraphics[scale=.5]{crosstalk_stability_scaled.png}
1470: \caption{%
1471: Measurements of the standard deviation versus integration time of the 
1472: correlation between independent noise sources into the same ADC board show 
1473: that crosstalk exhibits 
1474: stability over a period of 1 day for all frequency channels
1475: Although phase switching
1476: may still be desireable, this stability allows
1477: crosstalk to be calibrated and removed after correlation.
1478: \label{fig:crosstalk_stability}}
1479: \end{center}
1480: \end{figure}
1481: 
1482: \begin{figure}
1483: \begin{center}
1484: \includegraphics[scale=.52]{4_bit_quant_rev2.png}
1485: \caption{%
1486: Illustrated above is the relative gain through a 4-bit, 15-level quantizer as a 
1487: function of input signal level (log base 2).  Plotted are gain curves for 
1488: the cross-correlation of two
1489: gaussian noise sources with correlation levels of 100\% (solid),
1490: 80\% (dot-dashed), 40\% (dotted), and 20\% (dashed).  
1491: \label{fig:4_bit_quant}}
1492: \end{center}
1493: \end{figure}
1494: 
1495: \begin{figure}
1496: \begin{center}
1497: \includegraphics[scale=.5]{f_processor.png}
1498: \caption{%
1499: This IBOB design serves a dual purpose as a stand-alone ``Pocket Correlator''
1500: and an F processor in a 16 antenna packetized correlator deployment.  Note the
1501: parallel output pathways for each function.
1502: \label{fig:f_engine}}
1503: \end{center}
1504: \end{figure}
1505: 
1506: \begin{figure}
1507: \begin{center}
1508: \includegraphics[scale=.25]{allsky_moll_trim_bw.png}
1509: \caption{%
1510: This all-sky image, made using a 75-MHz band centered at 150 MHz with the
1511: ``Pocket Correlator'' as part of the PAPER experiment in Western 
1512: Australia, achieves an impressive 10,000:1 signal-to-noise ratio using 
1513: 1 day of data. 
1514: \label{fig:skymap}}
1515: \end{center}
1516: \end{figure}
1517: 
1518: \begin{figure}
1519: \begin{center}
1520: \includegraphics[scale=.5]{x_processor.png}
1521: \caption{%
1522: A BEE2-based X processor in a packetized correlator transmits data 
1523: from an F engine
1524: over 10GbE and stores self-addressed packets in a ``loopback'' buffer.
1525: These streams are merged on the receive side, and packets are
1526: distributed to two X engines.  Accumulation occurs
1527: in DRAM buffers, and the results are packetized and output
1528: over the same 10GbE link.  A data aquisition system connects to the
1529: same switch as the X engines.
1530: \label{fig:x_processor}}
1531: \end{center}
1532: \end{figure}
1533: 
1534: \end{document}
1535: 
1536: