1: \documentclass[preprint]{aastex}
2: \shorttitle{Scalable Correlator Architecture}
3: \shortauthors{Parsons et al.}
4:
5: \usepackage{amsmath}
6: \usepackage{graphicx}
7: \usepackage{natbib}
8: \citestyle{aa}
9:
10: \begin{document}
11: \title{A Scalable Correlator Architecture Based on
12: Modular FPGA Hardware, Reuseable Gateware, and Data Packetization}
13:
14: \author{Aaron Parsons, Donald Backer, and Andrew Siemion}
15: \affil{Astronomy Department,
16: University of California, Berkeley, CA}
17: \email{aparsons@astron.berkeley.edu}
18: \author{Henry Chen and Dan Werthimer}
19: \affil{Space Science Laboratory,
20: University of California, Berkeley, CA}
21: \author{Pierre Droz, Terry Filiba, Jason Manley\altaffilmark{1},
22: Peter McMahon\altaffilmark{1}, and Arash Parsa}
23: \affil{Berkeley Wireless Research Center,
24: University of California, Berkeley, CA}
25: \author{David MacMahon, Melvyn Wright}
26: \affil{Radio Astronomy Laboratory,
27: University of California, Berkeley, CA}
28:
29: \altaffiltext{1}{Affiliated with Karoo Array Telescope,
30: Cape Town, South Africa}
31:
32: \begin{abstract}
33: A new generation of radio telescopes is achieving unprecedented levels of
34: sensitivity and resolution, as well as increased agility and field-of-view, by
35: employing high-performance digital signal processing hardware to phase and
36: correlate large numbers of antennas. The computational demands of these
37: imaging systems scale in proportion to $BMN^2$, where $B$ is the signal
38: bandwidth, $M$ is the number of independent beams, and $N$ is the number of
39: antennas. The specifications of many new arrays lead to demands in excess of
40: tens of PetaOps per second.
41:
42: To meet this challenge, we have developed a general purpose correlator
43: architecture using standard 10-Gbit Ethernet switches to pass data
44: between flexible hardware modules containing Field Programmable Gate Array
45: (FPGA) chips. These chips are programmed using open-source signal processing
46: libraries we have developed to be flexible, scalable, and chip-independent.
47: This work reduces the time and cost of implementing a wide range of signal
48: processing systems, with correlators foremost among them, and facilitates
49: upgrading to new generations of processing technology. We present several
50: correlator deployments, including a 16-antenna, 200-MHz bandwidth, 4-bit, full
51: Stokes parameter application deployed on the Precision Array for Probing the
52: Epoch of Reionization.
53: \end{abstract}
54:
55: \keywords{Astronomical Instrumentation}
56:
57:
58: % --------------------------------------------------------------------------
59: % Section 1
60: % --------------------------------------------------------------------------
61: \section{Introduction}
62: \label{sec:intro}
63:
64: Radio interferometers, which operate by correlating the signals from two or
65: more antennas, have many advantages over traditional single-dish telescopes,
66: including greater scalability, independent control of aperture size and
67: collecting area, and self-calibration. Since the first digital correlator
68: built by Weinreb \citep{weinreb_1961}, the processing power of
69: these systems has been tracking the Moore's Law growth of digital electronics.
70: The decreasing cost per performance of these systems has influenced the design
71: of many new radio antenna array telescopes. Some
72: next-generation array telescopes at meter, centimeter and millimeter
73: wavelengths are:
74: the LOw Frequency ARray (LOFAR),
75: the Precision Array for Probing the Epoch of Reionization (PAPER),
76: the Murchison Widefield Array (MWA),
77: the Long Wavelength Array (LWA),
78: the Expanded Very Large Array (EVLA),
79: the Allen Telescope Array (ATA),
80: the Karoo Array Telescope (MeerKAT),
81: the Australian Square Kilometer Array Demonstrator (ASKAP),
82: the Atacama Large Millimeter Array (ALMA).
83: and the Combined Array for Research Millimeter-wave Astronomy (CARMA).
84: This paper presents a novel approach to the intense digital signal
85: processing requirements of these instruments that has many other applications
86: to astronomy signal processing.
87:
88: While each generation of electronics has brought new commodity data processing
89: solutions, the need for high-bandwidth communication between processing nodes
90: has historically lead to specialized system designs. This communication
91: problem is particularly germane for correlators, where the number of
92: connections between nodes scales with the square of the number of antennas.
93: Solutions to date have typically consisted of specialized processing boards
94: communicating over custom backplanes using non-standard protocols. However,
95: such solutions have the disadvantage that each new generation of digital
96: electronics requires expensive and time-consuming investments of engineering
97: time to re-solve the same connectivity problem. Redesign is driven by the same
98: Moore's Law that makes digital interferometry attractive, and is not unique to
99: the interconnect problem; processors such as Application-Specific Integrated
100: Circuits (ASICs) and Field Programmable Gate Arrays (FPGAs) also require
101: redesign, as do the boards bearing them, and the signal processing algorithms
102: targeting their architectures.
103:
104: Our research is aimed at reducing the time and cost of correlator design and
105: implementation. We do this, firstly, by developing a packetized communication
106: architecture relying on industry-standard Ethernet switches and protocols to
107: avoid redesigning backplanes, connectors, and communication protocols.
108: Secondly, we develop flexible processing modules that allow identical boards to
109: be used for a multitude of different processing tasks. These boards are
110: applicable to general signal processing problems that go beyond
111: correlators and even radio science to include, e.g., ASIC design and
112: simulation, genomics, and research into parallel processor architectures.
113: General
114: purpose hardware reduces the number of boards that have to be redesigned and
115: tested with each new generation of electronics. Thirdly, we create
116: parametrized signal processing libraries that can easily be recompiled and
117: scaled for each generation of processor. This allows signal processing systems
118: to quickly take advantage of the capabilities of new hardware. Finally, we
119: employ an extension of a Linux kernel to interface between CPUs and FPGAs for
120: the purposes of testing and control, presenting a standard file interface
121: for interacting with FPGA hardware.
122:
123: This paper begins with a presentation of the new correlator
124: design architecture in \S\ref{sec:architecture}. The hardware to
125: implement this architecture follows in \S\ref{sec:hardware}, and
126: the FPGA gateware used in the hardware is summarized in \S\ref{sec:gateware}.
127: Issues concerning system integration are given in \S\ref{sec:integration},
128: and performance characterization of subsystems are given in
129: \S\ref{sec:characterization}. Results from our first deployments of
130: the packetized correlator are displayed in \S\ref{sec:deployments}.
131: Our final section summarizes our progress and points to a number
132: of directions we are pursuing for the next generation of scalable
133: correlators based on modular hardware, reuseable gateware and
134: data packetization. An appendix gives a glossary of technical
135: acronyms since this paper makes heavy use of abbreviated terms.
136:
137: % --------------------------------------------------------------------------
138: % Section 2
139: % --------------------------------------------------------------------------
140: \section{A Scalable, Asynchronous, Packetized FX Correlator Architecture}
141: \label{sec:architecture}
142:
143: Correlators integrate the pairwise correlation between complex voltage samples
144: from polarization channels of array antenna receivers at a set of
145: frequencies.
146: Once instrumental effects have been calibrated and removed, the resultant
147: correlations (called visibilities) represent the self-convolved electric field
148: across an aperture sampled at locations
149: corresponding to separations between antennas. These visibilities can be
150: used to reconstruct an image of the sky by inverting the interferometric
151: measurement equation:
152: \begin{equation}
153: V_{\nu}(u,v)=\int\!\!\!\!\int{G_{i,\nu}G_{j,\nu}^*I_\nu(\ell,m)}
154: {e^{-2\pi i(u\ell+vm+w(\sqrt{1-\ell^2-m^2}-1))}d\ell dm}
155: \label{eq:vis}
156: \end{equation}
157: $I_\nu$ represents the sky brightness in angular coordinates $(\ell,m)$, and
158: $(u,v,w)$ correspond to the separation in wavelengths of an antenna pair
159: relative to a pointing direction.
160: For antennas with separate polarization feeds, cross-correlation
161: of polarizations yields components of the four Stokes parameters that
162: characterize polarized radiation, here defined in terms of linear
163: polarizations ($\|,\perp$) for all pairs of antennas $A$ and $B$
164: \citep{rybicki_lightman1979}:
165: \begin{equation}
166: \begin{array}{ll}
167: \displaystyle I=A_\| B_\|^*+A_\perp B_\perp^* &\ \ \
168: Q=A_\| B_\|^*-A_\perp B_\perp^* \nonumber \\
169: \displaystyle U=A_\| B_\perp^*+A_\perp B_\|^* &\ \ \
170: V=A_\| B_\perp^*-A_\perp B_\|^*
171: \label{eq:pol}
172: \end{array}
173: \end{equation}
174: I measures total intensity, V measures the degree of circular polarization,
175: and Q and U measure the amplitude and orientation of linear polarization.
176:
177: The problem of computing pairwise correlation as a function of frequency can be
178: decomposed two mathematically equivalent but architecturally distinct ways.
179: The first architecture is known as ``XF'' correlation because it first
180: cross-correlates antennas (the ``X'' operation) using a time-domain ``lag''
181: convolution, and then computes the spectrum (the ``F'' operation) for each resulting
182: baseline using a Discrete Fourier Transform (DFT). An alternate architecture
183: takes advantage of the fact that convolution is equivalent to multiplication in
184: Fourier domain. This second architecture, called ``FX'' correlation, first
185: computes the spectrum for each individual antenna (the F operation), and then
186: multiplies pairwise all antennas for each spectral channel (the X operation).
187: An FX correlator has an advantage over XF
188: correlators in that the operation that scales as $O(N^2)$ with the
189: number of antennas, N, is a complex multiplication as opposed to a full
190: convolution in an XF correlator \citep{daddario2001,yen1974}.
191:
192: Though there are mitigating factors (such as bit-growth for representing the
193: higher dynamic range of frequency-domain data) that favor XF correlators for
194: small numbers of antennas \citep{thompson_et_al2001}, FX correlators are more
195: efficient for larger arrays. Since scalability to large numbers of antennas is
196: one of the primary motivations of our correlator architecture, we have chosen
197: to develop FX architectures exclusively.
198:
199: \subsection{Scalability With Number of Antennas and Bandwidth}
200: \label{sec:scalability}
201:
202: The challenge of creating a scalable FX correlator is in designing a
203: scalable architecture for factoring the total computation into manageable
204: pieces and efficiently bringing together data in each piece for computation.
205: Traditionally, the spectral decomposition (in F engines) has been scaled to
206: arbitrary bandwidths by using analog mixers and filters to divide the operating
207: band of each antenna into the widest subbands that can be processed digitally
208: using existing technology. Within correlation of a given subband,
209: the complexities of computation and of data distribution both scale
210: linearly with bandwidth and quadratically with the number of antennas. It is
211: imperative that the arrangement of cross-multiplication engines (hereafter
212: referred to as X engines) minimize data replication/retransmission, even as X
213: engines expand to encompass many boards. Fortunately, each frequency channel
214: of an FX correlator is computationally independent, providing a natural
215: boundary for dividing computation among processing nodes.
216:
217: \placefigure{fig:corr_arch1}
218:
219: Figure \ref{fig:corr_arch1} illustrates a simplistic architecture for an FX
220: correlator that takes advantage of the computational independence of channels
221: to avoid unnecessary data transmission;
222: the total X computation has been factored into X engines that cross-multiply
223: all antenna pairs for a single frequency channel.
224: This architecture is overly
225: simplistic, since an X engine's performance can be equated to an aggregate
226: input bandwidth that it can handle. For the sake of efficiency, an X engine
227: processor
228: should receive as many channels as it has capacity to process. In this case,
229: the number of X engines is given by:
230: \begin{equation}
231: \#\ {\rm X\ Engines} = \frac{({\rm Antenna\ Bandwidth})\times
232: (\#\ {\rm Antennas})}{ {\rm X\ Engine\ Processing\ Bandwidth}}
233: \end{equation}
234: Multiplexing channels into X engines makes cross-multiplication
235: complexity independent of the number of channels. There are three
236: potential bottlenecks for scaling this architecture: the complexity of
237: interconnecting F engines and X engines, the bandwidth into individual X
238: engines, and the amount of computation in an X engine relative to the size of a
239: processing chip/board/system. Each of these bottlenecks warrants further
240: discussion.
241:
242: The potential bottleneck of connecting $N$ antenna-based F engines to $M$
243: channel-based X engines is highlighted by the criss-crossed lines in Figure
244: \ref{fig:corr_arch1}. Historically, this bottleneck has been addressed with
245: custom backplanes and transmission protocols. However, our group has taken the
246: novel approach of using high-performance, commercially available,
247: 10-Gbit/s Ethernet (10GbE) switches to solve this problem.
248: As will be discussed, these switches currently have the bandwidth and switching
249: capacity to handle large correlators, and represent a negligible fraction of
250: the total cost of correlator hardware. Furthermore, switching technology is
251: driven by commercial applications and by Moore's Law, making it likely that
252: future switches will continue increasing in number of ports and bandwidth per
253: port.
254:
255: A second potential bottleneck concerns how data rates and
256: numbers of X engines scale with antenna bandwidth. It is important that
257: we consider various bandwidth cases, owing to the variety of science
258: applications driving large, next-generation systems. For example, correlators
259: for large arrays of low-bandwidth antennas will need to multiplex data into
260: higher bandwidth processors, while arrays with larger bandwidths will face the
261: opposite problem. In our architecture, we make the reasonable assumption that
262: the number of frequency channels always exceeds the number of antennas.
263: This assumption
264: ensures that the per-port bandwidth into an X engine never exceeds what is
265: transmitted per antenna. Multiple channels may then be mapped into an X engine
266: up to its computational capacity (allowing efficient resource utilization for
267: low-bandwidth arrays), and additional X engines may be added for high-bandwidth
268: applications. Antenna bandwidths requiring transmission above 10 Gbits/s can
269: be accommodated by connecting F engines to multiple 10GbE ports.
270: Frequency channels are then assigned to each port, which connect separate
271: switches and sub-networks of X engines. In this way, bandwidths may be scaled
272: up to the transmission capability of an F processor by increasing the number of
273: subnets, and not switch complexity.
274:
275: The third and final potential bottleneck concerns how the sizes of individual X
276: engines scales with the number of antennas. Both large and small numbers of
277: antennas pose scaling problems. The size of an X engine responsible for
278: computing all baseline cross-multiples with a fixed input data rate
279: scales as $O(N)$, while
280: the number of X engines required to accommodate the expanding data bandwidth
281: with increasing numbers of antennas also scales as $O(N)$,
282: accounting for the $O(N^2)$ scaling of computing in a correlator. For
283: sufficiently large $N$, the size of an X engine can exceed the size of any
284: processing chip or board. Our solution has been to develop an X engine whose
285: pipelined architecture allows it to be split across multiple processors with
286: simple point-to-point connectivity. This allows many processors to be chained
287: together from a switch port to meet the computational demands of an X engine.
288: Scaling to small $N$ is equally challenging, because the aggregate correlator
289: bandwidth decreases as $O(N)$, while computational complexity scales down as
290: $O(N^2)$. As a result, we can find that the number of X engines that
291: fit onto a chip/board exceeds the rate at which data can be received. The
292: threshold where this problem is encountered can be changed by designing
293: processors with greater connectivity, but once hardware is fixed, there is no
294: other recourse but to accept a certain inefficiency for low numbers of
295: antennas. While this is a fundamental limitation of our architecture,
296: the cost of small correlators is typically dominated by development
297: (not hardware), so a certain architectural inefficiency can be accommodated for
298: the savings it affords in development time.
299:
300: \subsection{Globally Asynchronous Locally Synchronous Systems}
301: \label{sec:gals}
302:
303: Packetized data transmission simplifies the cross-connect problem inherent to
304: correlators, but this comes at the price of global synchronicity. Packetized
305: communication is fundamentally asynchronous: data can arrive scrambled,
306: delayed, or not at all. Locally-synchronous X engine processing must therefore
307: transition from being timing-driven (with throughput tied to an FPGA clock, for
308: example) to being asynchronously data-driven. Though data buffers and control
309: signals complicate development, Globally Asynchronous Locally Synchronous
310: (GALS) design facilitates system integration and leads to robust design
311: \citep{chapiro1984,luis_et_al2007}. Processors run at clock rates above the
312: data rate, using local oscillators that can drift with temperature. By
313: allowing for non-transmission of data, individual components can fail without
314: causing global failure--an important feature for large systems where
315: components may fail regularly during operation. GALS design also insulates
316: processing architectures from decisions regarding sample rates and antenna
317: bandwidths, allowing for greater operational flexibility. Finally, individual
318: processing elements may be redesigned and upgraded in a GALS system without
319: affecting the overall architecture, facilitating early adoption of new
320: technology.
321:
322: Data-driven processing on locally synchronous processors like FPGAs requires
323: controlling propagation through the processing pipeline. However, routing
324: control signals to every multiplier, accumulator, and logic element in a
325: pipeline can lead to excessive routing and gating demands. To avoid this, we
326: have implemented a window-based processing architecture for algorithms where
327: the results derived from one set of data samples are computationally
328: independent from the next. In this architecture, processing elements are
329: allowed to run freely at their native rate without being enabled/disabled, but
330: are only provided data when an entire window of data has been buffered. These
331: windows of data are provided synchronously with the inherent window boundaries
332: of the processing element, and an entire output window is flagged as valid.
333: Internally, a processor processes both valid and invalid data--it is only the
334: external buffering system that keeps track of data validity. This technique is
335: applicable to many common operators such as cross-multipliers, DFTs, and
336: accumulators. Finite Impulse Response (FIR) filtering is an
337: operation notable for not being window-based.
338:
339: \subsection{Example Applications}
340: \label{sec:example}
341:
342: \placefigure{fig:ex_app1}
343:
344: Perhaps the best method for demonstrating the flexibility and scalability of
345: our correlator architecture is through example applications. To illustrate
346: techniques for using hardware and ports efficiently, we will map processing
347: into fictitious hardware that corresponds roughly in capability to the
348: CASPER (Center for Astronomy Signal Processing
349: and Engineering Research)\footnote{http://casper.berkeley.edu}
350: hardware discussed in Section \ref{sec:hardware}.
351:
352: Our first example (Fig. \ref{fig:ex_app1}) illustrates an antenna signal
353: bandwidth sufficiently low so that data from 2 polarization channels of 2
354: antennas can be transmitted over one 10GbE connection. Assuming that the
355: number of antennas evenly divides the number of frequency channels, and that
356: the processing bandwidth of an X engine matches the data bandwidth of one
357: antenna, there will be the same number of X engines as F engines, and each X
358: engine will receive 1/N$^{\rm th}$ of the total bandwidth, where N is
359: the number of antennas. F engine
360: transmission and X engine reception are combined on a single port to make use
361: of the bi-directionality of 10GbE. This optimization halves the size of the
362: switch needed. Multiple X processors can be chained together from a single
363: 10GbE port using point-to-point connections. For cases where the number of
364: antennas does not evenly divide the number of frequency channels, one can adjust
365: packet transmission to drop remainder channels so that the band may be equally
366: divided among X engines.
367:
368: \placefigure{fig:ex_app2}
369:
370: A second example (Fig. \ref{fig:ex_app2}) illustrates a case where the
371: bandwidth from a single F engine exceeds the transmission capacity of a 10GbE
372: link. Here, data can be split by frequency channel across two
373: ports. Since different channels are never cross-multiplied, each of these
374: links goes to a separate subnet of switched X engines. Thus,
375: two smaller (and often less expensive per port) switches may be
376: substituted for one large
377: one. Each X engine still receives the same bandwidth as in the previous
378: example, although this now represents a smaller fraction of the total
379: bandwidth. Note that the same X processor used in the first example functions
380: here without modification. Only the number of X engines and the transmission
381: pattern has changed.
382:
383: \placefigure{fig:ex_app3}
384:
385: A final example (Fig. \ref{fig:ex_app3}) explores the case where the capacity
386: of an X processor and a 10GbE link both exceed the data bandwidth. In this
387: case, multiple F engines can (but do not have to) be chained together to
388: minimize the number of switched ports. As should be the case, only half as
389: many X engines (as compared to Fig. \ref{fig:ex_app1}) are necessary for a
390: given number of antennas. X processors operate in the same configuration as
391: before, oblivious to changes in F engines.
392:
393: These examples highlight the flexibility of the hardware and gateware for
394: targeting a number of applications. One shortcoming they also illustrate is
395: how the cabling between components differs for different bandwidths.
396: Therefore the different bandwidth operations are not as easily reconfigured as
397: might be desired for varying science goals on a given telescope. Research is
398: ongoing to improve the rapid reconfigurability that is an essential
399: specification for the most general radio interferometer array applications.
400:
401: % --------------------------------------------------------------------------
402: % Section 3
403: % --------------------------------------------------------------------------
404: \section{Modular, FPGA-based Processing Hardware}
405: \label{sec:hardware}
406:
407: A flexible and scalable correlator architecture is of limited use without
408: equally dynamic processing hardware that can support a variety of
409: configurations. FPGAs provide a unique combination of flexibility and
410: performance that make them well-suited for moderate-scale signal processing
411: applications such as correlators and spectrometers \citep{parsons_et_al2006}.
412: A primary goal of the CASPER group has been development of
413: multipurpose processing modules that can be of general use to the astronomy
414: signal processing community, and beyond. We seek to
415: minimize the effort of redesigning and upgrading hardware by modularizing
416: processing hardware, by minimizing the number of different modules
417: in a system, and by employing industry-standard interconnection protocols.
418:
419: Hardware modularity is the idea that boards should have consistent interfaces
420: in order to be connectible with an arbitrary number of heterogeneous components
421: to meet the computing needs of an application (``computing by the yard''), and
422: that upgrading/revising a component does not change the way in which components
423: are combined in the system.
424: Minimization of hardware reproduction costs is often used to motivate the
425: design of specialized hardware for large-scale correlators. However,
426: the longer development times inherent to such solutions, and
427: the necessity of targeting specific components from the outset,
428: suggest that a modular solution, initiated nearer to the deployment date,
429: will employ newer technology that costs less and uses less
430: power per operation. The predicted economy of mass-producing
431: specially-designed hardware must be tempered by its expected devaluation
432: by Moore's Law over the course of correlator development. This devaluation
433: makes the argument that hardware modularity can reduce the overall system
434: cost, even for large-scale systems, by reducing development time.
435:
436: In current correlator systems, we rely on two
437: CASPER FPGA-based processing boards; Internet Break-Out Boards (IBOBs) are
438: generally used for implementing per-antenna F engine processing, and
439: second-generation Berkeley Emulation Engines (BEE2s) implement X engine
440: processing. Work is progressing on a new board, the Reconfigurable Open
441: Architecture for Computing Hardware (ROACH), that will provide a single-board
442: solution to both F and X processing.
443:
444: \placefigure{fig:ibobadcbee2}
445:
446: IBOBs (Fig. \ref{fig:ibobadcbee2}) can interface to two
447: Analog-to-Digital Converter (ADC) boards, each capable of digitizing two
448: streams at 1 Gsamples/sec or a single stream at 2 Gsamples/sec using an Atmel
449: AT84AD001B dual 8-bit ADC chip. This data is processed by a Xilinx XC2VP50
450: FPGA containing 232 18$\times$18-bit multipliers, two PowerPC CPU cores, and
451: over 53,000 logic cells. Two ZBT SRAM chips provide 36 Mbits of extra
452: buffering, and two 10GbE-compatible CX4 connectors provide a standard interface
453: for connecting to other boards, switches, and computers. A detailed discussion
454: of ADC signal fidelity is presented in Section \ref{sec:characterization}.
455: We are developing a second ADC board that allows four signal sampling at
456: 200 Msample/sec.
457:
458: The BEE2 board \cite{chang_et_al2005} (Fig. \ref{fig:ibobadcbee2}) was
459: originally designed for high-end reconfigurable computing applications such as
460: ASIC design, but has been conscripted for astronomy applications in a
461: collaboration between the BWRC\footnote{Berkeley Wireless Research Center
462: http://bwrc.eecs.berkeley.edu},
463: the UC Berkeley Radio Astronomy Laboratory, and the UC Berkeley SETI group.
464: The 500
465: Gops/sec of computational power in the BEE2
466: is provided by 5 Xilinx XC2VP70 Virtex-II Pro
467: FPGAs, each containing 328 multipliers, two PowerPC CPU cores capable of
468: running Linux, and over 74,000 configurable logic cells. Each FPGA connects to
469: 4 GB of DDR2-SDRAM, and four 10GbE-compatible CX4 connectors, and all FPGAs
470: share a 100-Mbps Ethernet port. The size and connectivity of the
471: BEE2 board make it suitable for implementing X engine processing in our
472: correlator architecture.
473:
474: The ROACH board is being developed in collaboration with MeerKAT and
475: NRAO,\footnote{The National Radio Astronomy
476: Observatory (NRAO) is owned and operated by Associated Universities, Inc. with
477: funding from the National Science Foundation}
478: and is scheduled for release in the third quarter of 2008. It is intended as a
479: replacement for both IBOB and BEE2 boards. A single Xilinx Virtex-5 XC5VSX95T
480: FPGA containing 94,000 logic cells and 640 multiplier/accumulators provides 400
481: Gops/sec of processing power and is connected to a separate PowerPC 440EPx
482: processor with a 1 GbE network connection. The board contains 4 GB of DDR2
483: DRAM and two 36Mbit QDR SRAMs, four 10GbE-compatible CX4 connectors, and two
484: interfaces that allow the use of the current ADC boards, or a new 3
485: Gsamples/sec (6 Gsamples/sec dual-board interleaved) ADC. The scale, economy,
486: and peripheral interfaces of this board will make it appropriate for both F and
487: X engine processing, and will enable a single-board correlator architecture.
488:
489: \placetable{tab:hardware_price}
490:
491: % --------------------------------------------------------------------------
492: % Section 4
493: % --------------------------------------------------------------------------
494: \section{Gateware}
495: \label{sec:gateware}
496:
497: Efficient, customizable signal processing libraries are another important
498: component of a flexible and scalable correlator architecture. Towards this
499: goal, our group has designed a set of open-source libraries\footnote{Available
500: at http://casper.berkeley.edu} for the Simulink/Xilinx System Generator FPGA
501: programming language. These libraries abstract chip-specific components to
502: provide high-level interfaces targeting a wide variety of devices. Signal
503: processing blocks in these libraries are parametrized to scale up and down to
504: arbitrary sizes, and to have selectable bit widths, latencies, and scaling.
505: Though the design principles of parametrization and scalability have added
506: complexity to the initial design of these libraries, it dramatically enhances
507: their applicability and potential for longevity as hardware evolves. It also
508: decreases testing time by allowing developers to debug scale models of systems
509: that derive from the same parametrization code and are behaviorally similar to
510: larger systems. In this section, we present several components of our
511: libraries vital to the design of flexible correlators.
512:
513: \subsection{A Digital Down-Converter}
514: \label{sec:downconverter}
515:
516: The rising speed of ADCs has enabled digitization to occur increasingly early
517: in the antenna receiver chain. We are thus replacing analog electronics
518: commonly known as intermediate frequency processor (gain, band definition)
519: and baseband mixer (conversion to zero frequency and filtering).
520: There are numerous advantages to doing this.
521: Digital mixing allows dynamically selecting an operating frequency within the
522: digitized band while ensuring perfect sine-cosine phasing in the local
523: oscillator (LO) mixing frequency.
524: Digitizing a wider bandwidth than will be ultimately processed makes analog
525: filtering less critical; inexpensive filters with slow roll-offs can be
526: used, and passband rippling can be corrected. Finally, digital filtering
527: allows flexibility and control in selecting passband shapes and adjusting fine
528: delays. One can even split out several bands from the same signal.
529: The issue of quantization levels and other digital artifacts needs to be
530: carefully addressed.
531:
532: Our library provides a digital down-conversion core with a runtime-selectable
533: mixing frequency. Using a discretely sampled sine wave in an addressable
534: lookup table, we can approximate nearly any mixing frequency by rounding a wide
535: accumulation register (incremented every clock) to the nearest address in the
536: lookup table. Digital sine waves have an accuracy dictated by the number of
537: bits used to represent a value; a lookup table need only have enough samples to
538: achieve comparable accuracy. The fact that the derivative of $\sin(x)$ reaches
539: a maximum magnitude of 1 allows the sampling interval of a sine wave to be
540: simply equated to the accuracy of a coefficient over that time interval.
541: As a result, a lookup table only need be addressed with the same
542: bit-width as the sample width to implement an arbitrary mixing frequency.
543:
544: \placefigure{fig:ddc_passband}
545:
546: Our library also contains a decimating FIR filter. Digital filters have
547: advantages over analog filters by being reprogrammable and by providing exact,
548: calculable passbands. This filter is often used for suppressing harmonics of
549: the mixing frequency and for steepening the rolloff of cheaper analog filters,
550: but it has also been relied upon for implementing IF sub-band selection
551: digitally. In practice, one must weigh the need for performance and
552: flexibility against the cost of FPGA resources compared to analog filters. As
553: an example, the response of the FIR filter used in various correlator designs
554: is shown in Figure \ref{fig:ddc_passband}. Since the exact shape
555: of this filter can be calculated, it is possible to remove passband
556: ripple post-channelization because of the large dynamic range available in
557: output of our FFT core.
558:
559: \subsection{A Polyphase Filter Bank Front-End}
560: \label{sec:pfb}
561:
562: The Polyphase Filter Bank (PFB) \citep{crochiere+rabiner1983, vaidyanathan1990}
563: is an efficient implementation of a bank of evenly spaced, decimating FIR
564: filters. The PFB algorithm decomposes these filters into a single polyphase
565: convolution followed by a DFT. Since DFTs have been highly optimized
566: algorithmically, this results in an extremely efficient implementation.
567: Equivalently, the PFB may be regarded as an improvement on the Fast Fourier
568: Transform (FFT) that uses a front-end polyphase FIR filter to improve the
569: frequency response of each spectral channel (Fig. \ref{fig:pfb_bin_resp}).
570: This improvement comes at the cost of buffering an additional window of samples
571: and adding a complex cross-multiplication for each additional tap in the
572: polyphase FIR. This PFB implementation has seen widespread use in the astronomy
573: community in 21 cm hydrogen surveys \citep{heiles_et_al2004}, pulsar surveys
574: \citep{demorest_et_al2004}, antenna arrays \citep{bradley_et_al2005}, Very Long
575: Baseline Interferometry, and other applications.
576:
577: \placefigure{fig:pfb_bin_resp}
578:
579: Our core is parametrized to use selectable windowing functions, allowing
580: adjustment of the out-of-band rejection and passband ripple/rolloff. Blackman
581: and Tukey \citep{blackman_tukey1958} provides a summary of the characteristics
582: and trade-offs of various windows. Each polyphase FIR tap, at the cost of
583: increased buffering and additional multipliers, increases filter steepness by
584: adding samples (in increments of the number of channels) to the time window
585: used in the PFB. For fixed-point implementations, a practical upper limit to
586: the number of PFB taps is set by the number of bits used to represent filter
587: coefficients; the sinc function's 1/x tapering ceases to be representable when
588: $\pi T > \pi + 2^{B+1}$ where $T$ is the number of taps, and $B$ is the
589: coefficient bit width. Finally, the width of a PFB channel is tunable by
590: adjusting the period of the sinc function, forcing adjacent bandpass filters to
591: overlap at a point other than the -3 dB point. Note that this causes
592: power to no longer be conserved in the Fourier transform operation.
593:
594: \subsection{A Bandwidth-Agile Fast Fourier Transform}
595: \label{sec:fft}
596:
597: The computational core of our FFT library is an implementation of a radix-2
598: biplex pipelined FFT \citep{rabiner_gold1975} capable of analyzing two
599: independent, complex data streams using a fraction of the FPGA resources of
600: commercial designs \citep{dick2000}. This architecture takes advantage of the
601: streaming nature of ADC samples by multiplexing the butterfly computations of
602: each FFT stage into a single physical butterfly core. When used to analyze two
603: independent streams, every butterfly in this biplex core outputs valid data
604: every clock for 100\% utilization efficiency.
605:
606: The need to analyze bandwidths higher than the native clock rate of an FPGA led
607: us to create a second core that combines multiple biplex cores with additional
608: butterfly cores to create an FFT that is parametrized to handle $2^P$ samples
609: in parallel \citep{parsons2008}. This FFT architecture uses only 25\% more
610: buffering than the theoretical minimum, and still achieves 100\% butterfly
611: utilization efficiency. This feat is achieved by decomposing a $2^N$
612: channel FFT into $2^P$ parallel biplex FFTs of length $2^{N-P}$, followed by a
613: $2^P$ channel parallel FFT core using time-multiplexed twiddle-factor
614: coefficients.
615:
616: Finally, we have written modules for performing two real FFTs with each half of
617: a biplex FFT using Hermitian conjugation. Mirroring and
618: conjugating the output spectra to reconstitute the negative frequencies, this
619: module effects a 4-in-1 real biplex FFT that can then be substituted for the
620: equivalent number of biplex cores in a high-bandwidth FFT. Thus, our real FFT
621: module has the same bandwidth flexibility as our standard complex FFT.
622:
623: Dynamic range inside fixed-point FFTs requires careful consideration. Tones
624: are folded into half as many samples through each FFT stage, causing magnitudes
625: to grow by a factor of 2 for narrow-band signals, and $\sqrt{2}$ for random
626: noise. To
627: avoid overflow and spectrum corruption, our cores contain optional downshifts
628: at each stage. In an interference-heavy environment, one must balance loss of
629: SNR from downshifting signal levels against loss of integration time due to
630: overflows. A good practice is to place time-domain input into the
631: most-significant bits of the FFT and downshift as often as possible to
632: avoid overflow and minimize rounding error in each butterfly stage. However,
633: it is also best to avoid using the top 2 bits on input since the first
634: 2 butterfly
635: stages can be implemented using negation instead of complex multiplies, but the
636: asymmetric range of 2's complement arithmetic can allow this negation to
637: overflow.
638:
639: \subsection{A Cross-Multiplication/Accumulation (X) Engine}
640: \label{sec:x_engine_arch}
641:
642: \placefigure{fig:x_engine_schem}
643:
644: Our FX correlator architecture employs
645: X engines to compute all antenna cross-multiples within a frequency
646: channel, and multiple frequencies are multiplexed into the core as dictated by
647: processor bandwidth; the complex visibility $V_{ij}$ (Eq. \ref{eq:vis})
648: is the average of the product of complex voltage samples from antenna $i$ and
649: antenna $j$ with the convention that the voltage $j>i$ is conjugated prior to
650: forming product.
651: In collaboration with Lynn Urry of UC Berkeley's Radio
652: Astronomy Lab we have implemented a parametrized module (Fig.
653: \ref{fig:x_engine_schem}) for computing and accumulating all visibilities for a
654: specified number of antennas. An X engine operates by receiving $N_{ant}$ data
655: blocks in series, each containing $T_{acc}$ data samples from one frequency
656: channel of one antenna. The first samples of all blocks are
657: cross-multiplied, and the $N_{ant}(N_{ant}+1)/2$ results are added to the
658: results from the second samples, and so on, until all $T_{acc}$ samples have
659: been exhausted. Accumulation prevents the data rate out of a
660: cross-multiplier from exceeding the input data rate. An X engine is divided
661: into stages, each responsible for pairing two different data blocks
662: together: the zeroth stage pairs adjacent blocks, the first stage pairs blocks
663: separated by one, and so on. As the final accumulated results become available,
664: they are loaded onto a shift register and output from the X engine.
665:
666: However, as a new window of $N_{ant}\times T_{acc}$ samples arrives, some
667: stages, behaving as described above, would compute invalid results using
668: data from two different windows. To avoid this, each stage switches between
669: cross-multiplying separations of $S$ to separations of $N_{ant}-S$, which
670: happen to be valid precisely when separations of $S$ would be invalid. As a
671: result, there need be only $floor({N_{ant}/2}+1)$ stages in an X engine. Every
672: $T_{acc}$ samples, each stage outputs a valid result, yielding $N_{ant}\times
673: floor({N_{ant}/2}+1)$ total accumulations; for even values of $N_{ant}$,
674: $N_{ant}/2$ of the results from the last stage are redundant.
675: All other multiplier/accumulators are 100\% utilized. Each stage
676: also computes all polarization cross-multiples (Eq. \ref{eq:pol})
677: using parallel multipliers.
678:
679: When one X engine no longer fits on a single FPGA, it may be divided across
680: chips at any stage boundary at the cost of a moderate amount of bidirectional
681: interconnect. The output shift register need not be carried between chips;
682: each FPGA can accumulate and store the results computed locally. In order for
683: the output shift register's $floor({N_{ant}/2}+1)$ stages to clear before the
684: next accumulation is ready, an X engine requires a minimum integration length
685: of: $T_{acc}>floor({N_{ant}/2}+1)$. In current hardware, a practical upper
686: limit on $T_{acc}$ is set by the 2$\times$4 Mbit of SRAM storage available on
687: the IBOB. For 2048 channels with 4-bit samples, and double buffering for 2
688: antennas, 2 polarizations, this limit is $T_{acc}\le 128$. Longer integration
689: requires an accumulator capable of buffering an entire vector of visibility
690: data, and typically occurs in off-chip DRAM. The maximum theoretical
691: accumulation length in correlator is determined by the fringe rate of sources
692: moving across the sky, and is a function of observing frequency, maximum
693: antenna separation, and (for correlators with internal fringe rotation)
694: field-of-view across the primary beam.
695:
696: Cross-multiplication comes to dominate the total correlator processing budget
697: for large numbers of antennas. As a result, care must be taken both to reduce
698: the footprint of a complex multiplier/accumulator and to make full and
699: efficient use of the resources on an FPGA processor. The number of bits used
700: to carry a signal should be minimized while retaining sufficient dynamic range
701: to distinguish signal from noise. We have chosen to focus on 4-bit multipliers
702: in current applications, and the subjects of dynamic equalization and Van Vleck
703: correction generalized to 4 bits are explored in Section
704: \ref{sec:characterization} for optimizing signal-to-noise ratios (SNR) in our
705: correlators. To make full use of FPGA resources, we construct
706: 4-bit complex multipliers using distributed logic, dedicated multiplier cores,
707: and look-up tables implemented in Block RAMs.
708:
709: It is possible to perform the bulk of an $N$-bit complex multiply in an $M$-bit
710: multiplier core by sign-extending numbers to $2N$ bits and combining them into
711: two $M$-bit, unsigned numbers. Multiplying $(a+bi)(c+di)$, these
712: representations are $(2^{M-2N}a_s+b_s)$ and $(2^{M-2N}c_s+d_s)$, where
713: $n_s=2^{2N}+n$. The bits corresponding to $ac, ad+bc, bd$ may be selected from
714: the product, provided that the
715: sign-extension to $2N$ bits shifts $a+d$ beyond the bits occupied by $ad$.
716: This yields the constraint:
717: \begin{equation} 6N-1 < M \end{equation}
718: The 18-bit multipliers in current Xilinx
719: FPGAs can efficiently perform 3-bit complex
720: multiplies, but fall short of 4 bits.
721:
722: % --------------------------------------------------------------------------
723: % Section 5
724: % --------------------------------------------------------------------------
725: \section{System Integration}
726: \label{sec:integration}
727:
728: \subsection{F Engine Synchronization}
729: \label{sec:F_synch}
730:
731: \placefigure{fig:corr_vs_dly}
732:
733: Though we have touted GALS design principles for X engine processing,
734: digitization and spectral processing within F engines must be synchronized to a
735: time interval much smaller than a spectral window to avoid severe degradation
736: of correlation response (Fig. \ref{fig:corr_vs_dly}). This attenuation effect,
737: resulting from the changing degree of overlap of correlated signals within a
738: spectral window, can be caused by systematic signal delay between antennas, as
739: well as by source-dependent geometric delay; FX correlators with insufficient
740: channel resolution experience a narrowing of the field of view related to
741: channel bandwidth. This effect has been well explored for FX correlators
742: employing DFTs (see Chapter 8 of \citet{thompson_et_al2001}), but Polyphase
743: Filter Banks show a different response owing to a weighting function that
744: extends well beyond the number of samples used in a DFT.
745: Given a standard form for PFB sample weighting of
746: ${\rm sinc}\left(\frac{\pi t}{N\tau_s}\right)
747: W\left(\frac{t}{2TN\tau_s}\right)$,
748: where $N$ is the number of output channels,
749: $T$ is the number of PFB taps, $\tau_s$ is the delay between time-domain
750: samples, and $W$ is an arbitrary windowing function that tapers to 0 at
751: $\pm1$, the gain versus delay $G(\tau)$ of a PFB-based FX correlator is
752: given by:
753: \begin{displaymath}
754: G(\tau)=\int_{-\infty}^{\infty}{
755: \left[{\rm sinc}\left(\frac{\pi t}{N\tau_s}\right)
756: W\left(\frac{t}{2TN\tau_s}\right)\right] \times
757: \left[{\rm sinc}\left(\frac{\pi (t-\tau)}{N\tau_s}\right)
758: W\left(\frac{t-\tau}{2TN\tau_s}\right)\right]\ dt
759: }
760: \end{displaymath}
761:
762: For the purpose of F Engine synchronization, we
763: rely on a one-pulse-per-second (1PPS) signal with a fast edge-rate provided
764: synchronously to a bank of F processors running off identical system clocks.
765: This signal is sampled by the system clock on each processor, and provided
766: alongside ADC data. A slower, asynchronous ``arm'' signal is sent from
767: a central node to each F engine at the half second phase
768: to indicate that the next 1PPS signal should be
769: used to generate the reset event that synchronizes spectral windows and packet
770: counters. This ensures that samples from different antennas entering X engines
771: together were acquired within one or two system clocks of one another. The
772: degree of synchronization is determined by the difference in path lengths of
773: 1PPS and the system clock from their generators to each F engine. This path
774: length can be determined from celestial source observations
775: using self-calibration, and barring temperature
776: effects, will be constant for a correlator configuration following power-up.
777:
778: \subsection{Asynchronous, Packetized ``Corner Turner''}
779: \label{sec:packetization}
780:
781: The choice of the accumulation length $T_{acc}$ in X engines
782: determines the natural size of UDP packets in our
783: packet-switched correlator architecture. For current CASPER hardware where
784: channel-ordering occurs in IBOB SRAM, $T_{acc}$ is constrained by the available
785: memory to an upper limit of 128 samples for 2048-channel dual-polarization,
786: 4-bit,
787: complex data, yielding a packet payload of 256 bytes. A header containing
788: 2 bytes of antenna index and 6 bytes of frequency/time index is added to each
789: packet to enable packet unscrambling on the receive side. The frequency/time
790: index (hereafter referred to as the master counter, or MCNT) is a counter that
791: is incremented every packet transmission. The lower bits count frequencies
792: within a spectrum, and the rest count time. Combined with the antenna
793: index, MCNT completely determines the time, frequency, source, and destination
794: of each packet; MCNT maps uniquely to a destination IP address.
795:
796: \placefigure{fig:packet_rx}
797:
798: Packet reception (Fig. \ref{fig:packet_rx}) is complicated by the realities of
799: packet scrambling, loss, and interference. A circular buffer holding $N_{win}$
800: windows worth of X engine data stores packet data as they arrive. The lower
801: bits of MCNT act as an address for placing payloads into the the correct
802: window, and the antenna index addresses the position within that window. When
803: data arrives $N_{win}/2$ windows ahead of a buffered window, that window is
804: flagged for readout, and is processed contiguously on the next window boundary
805: of the free-running X engine. Using packet arrival to determine when a window
806: is processed allows a data-rate dependent time interval for all packets to
807: arrive, but pushes data through the buffer in the event of packet loss. On
808: readout, the buffer is zeroed to ensure that packet loss results in loss of
809: signal, rather than the introduction of noise. F engines can be intentionally
810: disconnected from transmission without compromising the correlation of
811: those remaining.
812:
813: Packet interference occurs when a well-formed packet contains an invalid MCNT
814: as a result of switch latency, unsynchronized F engines, or system
815: misconfiguration. Such packets must be prevented from entering the receive
816: buffer, since they can lead to data corruption; one would prefer that a
817: misconfigured F engine antenna result in data loss for that antenna, rather
818: than data loss for the entire system. To ensure this behavior, incoming
819: packets face a sliding filter based on currently active MCNTs. Packets are
820: only accepted if their MCNT falls within the range of what can currently be
821: held in the circular buffer. As higher MCNTs are received and accepted, old
822: windows are flagged for read out, freeing up buffer space for still
823: higher MCNTs. This system forces MCNTs to advance by small increments and
824: prevents the large discontinuities indicative of packet
825: interference. In the eventuality that a receive buffer accidentally locks onto
826: an invalid MCNT from the outset, a time-out clause causes the currently active
827: MCNT to be abandoned for a new one if no new data is accepted into the receive
828: buffer.
829:
830: A final complication comes when implementing a bidirectional 10GbE transmission
831: architecture such as the one outlined in Figure \ref{fig:ex_app1}.
832: Commercial switches do not support
833: self-addressed packet transmission; they assume that the transmitter
834: (usually a CPU) intercepts these packets and transfers them to the receive
835: buffer. On FPGAs, this requires an extra buffer for holding ``loopback'', and
836: a multiplexer for inserting these packets into the processing stream. A simple
837: method for this insertion would be to always insert loopback packets, if
838: available, and otherwise to insert packets from the 10GbE
839: interface. However, there is a maximum interval over which packets with
840: identical MCNTs can be scrambled before the receive system rejects
841: packets for being outside of its buffer. This simple method has the
842: undesirable effect of including switch latency in the time interval over which
843: packets are scrambled, causing unnecessary packet loss. Our solution is to
844: pull loopback packets only after packets with the same MCNT
845: arrive through the switch.
846:
847: \subsection{Monitor, Control, and Data Acquisition}
848: \label{sec:data_aq}
849:
850: The toolflow we have developed for CASPER hardware provides convenient
851: abstractions for interfacing to hardware components such as ADCs, DRAM, and 10
852: GbE transceivers, and allows specified registers and BRAMs to be automatically
853: connected to CPU-accessible buses. On top of this framework, we run BORPH--an
854: extension of the Linux operating system that provides kernel support for FPGA
855: resources \citep{so_broderson2006,so2007}. This system allows FPGA
856: configurations to be run in the same fashion as software processes, and creates
857: a virtual file system representing the memories and registers defined on the
858: FPGA. Every design compiled with this toolflow comes equipped with this
859: real-time interface for low- to moderate-bandwidth data I/O. By emulating
860: standard file-I/O interfaces, BORPH allows programmers to use standard
861: languages for writing control software. The majority of the monitor, control,
862: and data acquisition routines in our correlators are written in C
863: and Python. For 8-16 antenna correlators, the bandwidth through BORPH on a
864: BEE2 board is sufficient to support the output of visibility data with 5-10s
865: integrations.
866:
867: For correlators with more antennas or shorter integration times, the bandwidth
868: of the CPU/FPGA interface is incapable of maintaining the full correlator
869: output. This limitation is being overcome by transmitting the final correlator
870: output using a small amount of the extra bandwidth on the 10GbE ports already
871: attached to each X engine. After accumulation in DRAM, correlator output is
872: multiplexed onto the 10GbE interface and transmitted to one or more Data
873: Acquisition (DA) systems attached to the central 10GbE switch. These systems
874: collect and store the final correlator output. With a capable DA system, the
875: added bandwidth through this output pathway can be used to attain millisecond
876: integration times, opening up opportunities for exploring transient events and
877: increasing time resolution for removing interference-dominated data.
878:
879: The capabilities of correlators made possible by our research are placing
880: new challenges on DA systems \citep{wright2005}. There is a severe (factor of
881: 100) mismatch between the data rates in the on-line correlator hardware and
882: those supported by the off-line processing. Members of our team are currently
883: pursuing research on how this can be resolved both for correlators and for
884: generic signal processing systems using commercially available compute
885: clusters. For correlators, our group is currently exploring how to implement
886: calibration and imaging in real-time to reduce the burden of
887: expert data reduction on the end user, and to make best use of both telescope
888: and human resources.
889:
890:
891: % --------------------------------------------------------------------------
892: % Section 6
893: % --------------------------------------------------------------------------
894: \section{Characterization}
895: \label{sec:characterization}
896:
897: \subsection{ADC Crosstalk}
898: \label{sec:crosstalk}
899:
900: \placefigure{fig:crosstalk}
901:
902: Crosstalk is an undesirable but prevalent characteristic of analog systems
903: wherein a signal is coupled at a low level into other pathways. This can pose
904: a major threat to sensitivity in systems that integrate noise-dominated data to
905: reveal low-level correlation. For CASPER hardware, we have examined crosstalk
906: levels between signal inputs sharing an ADC chip, and between different ADC
907: boards on the same IBOB. Figure \ref{fig:crosstalk} illustrates a one-hour
908: integration of uncorrelated noise of various bandwidths input to the ``Pocket
909: Correlator'' system (see Section \ref{sec:deployments}). Between inputs
910: of the same ADC board, a coupling coefficient of $\sim0.0016$ indicates
911: crosstalk at approximately $-28$ dB. This coupling is a factor of $5$ higher
912: than the $-35$ dB isolation advertised by the Atmel ADC chip, and is most
913: likely the result of board geometry and shared power supplies. Crosstalk
914: between inputs on different ADCs also peaks at the $-28$ dB level, but shows
915: more frequency-dependent structure.
916:
917: \placefigure{fig:crosstalk_stability}
918:
919: Crosstalk may be characterized and removed, provided that its timescale for
920: variation is much longer than the calibration interval. Figure
921: \ref{fig:crosstalk_stability} demonstrates that for integration intervals
922: ranging from 7.15 seconds to approximately 1 day (the limit of our testing),
923: crosstalk amplitudes and phases vary around stable values in a
924: lab test that, when
925: subtracted, yield noise that integrates down with time. Even
926: though crosstalk is encountered at the $-28$ dB level, its stability allows
927: suppression to at least $-62$ dB. This stability has allowed crosstalk
928: to be removed post-correlation, and we have until recently deferred
929: adding phase switching. Developments along this line are proceeding by
930: introducing an invertible mixer (controlled via a Walsh counter on an IBOB)
931: early in the analog signal path, and removing this inversion after
932: digitization. Phase switching must be coupled with data blanking near
933: boundaries when the
934: inversion state is uncertain. Blanking will be most easily implemented by
935: intentionally dropping packets of data from F engine transmission, and by
936: providing a count of results accumulated in each integration for normalization
937: purposes.
938:
939: \subsection{XAUI Fidelity and Switch Throughput}
940: \label{sec:10gbe_sw}
941:
942: CASPER boards are currently configured to transmit XAUI protocol over CX4 ports
943: as a point-to-point communication protocol and as the physical layer of 10GbE
944: transmission. Because the Virtex-II FPGAs used in current CASPER hardware do
945: not fully support XAUI transmission standards \cite{xilinx_ug024,xilinx_ds083},
946: current devices can have
947: sub-optimal performance for certain cable lengths. We expect the new ROACH
948: board, which employs Virtex-5 FPGAs, to have better
949: performance in this regard. For cable lengths supported in current hardware,
950: we tested XAUI transmission fidelity using matched Linear Feedback Shift
951: Registers (LFSRs) on transmit and receive. Error detection was verified using
952: programmable bit-flips following transmitting LFSRs. Over a period of 16
953: hours, 573 Tb of data were transmitted and received on each of 8 XAUI
954: links. During this time, no errors were detected, resulting in an estimated
955: bit-error rate of $2.2\cdot 10^{-16}$ Hz. We also tested the capability of two
956: Fujitsu switches (the XG700 and the XG2000) for performing the full
957: cross-connect packet switching required in our FX correlator architecture. By
958: tuning the sample rate inside F engines of an 8-antenna (4-IBOB) packetized
959: correlator, we controlled the transmission rate per switch port over a range of
960: 5.96 to 8.94 Gb/s. In 10-minute tests, packet loss was zero for both
961: switches in all but the 8.94 Gb/s case. Packet loss in this final case was
962: traced to intermittent XAUI failure as a result of imperfect compliance with
963: the XAUI standard, as described previously. Overheating of FPGA chips in the
964: field has also been reported as a source of intermittent operation.
965:
966: \subsection{Equalization and 4-Bit Requantization}
967: \label{sec:equalization}
968:
969: \placefigure{fig:4_bit_quant}
970:
971: Correlator processing resources can be reduced by limiting the bit width of
972: frequency-domain antenna data before cross-multiplication. However, digital
973: quantization requires careful setting of signal levels for optimum
974: SNR and subsequent calibration to a linear power scale
975: \citep{thompson_et_al2001,jenet_anderson1998}. Correlators using 4 bits
976: represent
977: an improvement over their 1 and 2 bit predecessors, but there are still
978: quantization issues to consider. The total power of a 4-bit quantizer has a
979: non-linear response with respect to input level as shown in Figure
980: \ref{fig:4_bit_quant}. In currently deployed correlators, we perform
981: equalization (per channel scaling) to control the RMS channel values before
982: requantizing from 18 bits to 4 bits. This operation saturates RFI and flattens
983: the passband to reduce dynamic range and to hold the passband in
984: the linear regime of the 4-bit quantization power curve. Equalization is
985: implemented as a scalar multiplication on the output of each PFB using 18-bit
986: coefficients from a dynamically updateable memory. These coefficients allow
987: for automatic gain control to maintain quantization fidelity through changing
988: system temperatures.
989:
990: % --------------------------------------------------------------------------
991: % Section 7
992: % --------------------------------------------------------------------------
993: \section{Deployments and Results}
994: \label{sec:deployments}
995:
996: \subsection{A Pocket Correlator}
997: \label{sec:pocket_corr}
998:
999: \placefigure{fig:f_engine}
1000:
1001: The ``Pocket Correlator'' (Fig. \ref{fig:f_engine}) is a single IBOB system
1002: that includes F and X engines on a single board for correlating and
1003: accumulating 4 input signals. Each input is sampled at 4 times the FPGA clock
1004: rate (which runs up to 250 MHz), and a down-converter extracts half of the
1005: digitized band. This subband is decomposed into 2048 channels by an 8-tap PFB,
1006: equalized, and requantized to 4 bits. With all input signals on one chip, X
1007: processing can be implemented directly as multipliers and vector accumulators,
1008: rather than as X engines. Limited buffer space on the IBOB permits only 1024
1009: channels (selectable from within the 2048) to be accumulated. Output occurs
1010: either via serial connection (with a minimum integration time of 5
1011: seconds) or via 100-Mbit UDP transmission (with a minimum integration time in
1012: the millisecond range). This system can act as a 2-antenna, full Stokes
1013: correlator, or as a 4-antenna single polarization correlator.
1014:
1015: \placefigure{fig:skymap}
1016:
1017: The Pocket Correlator is valuable as a simple, stand-alone instrument, and for
1018: board verification in larger packetized systems. It is being applied as a
1019: stand-alone instrument in PAPER, the ATA, and the UNC PARI observatory. A
1020: 4-antenna, single polarization deployment of the PAPER experiment in Western
1021: Australia in 2007 used the Pocket Correlator to collect the data used to
1022: produce a 150 MHz all-sky map illustrated in Figure \ref{fig:skymap}. In
1023: addition to demonstrating the feasibility of post-correlation crosstalk
1024: removal, this map (specifically, the imperfectly removed sidelobes of sources)
1025: illustrates a problem that will require real-time imaging to solve for large
1026: numbers of antennas.
1027:
1028: \subsection{An 8-Antenna, 2-Stokes, Synchronous Correlator}
1029: \label{sec:8_ant_corr}
1030:
1031: This first generation multi-board correlator demonstrated the functionality
1032: of signal processing algorithms and CASPER hardware, but preempted the
1033: current packetized architecture--it operated synchronously. This version of
1034: the correlator was most heavily limited by X engine resources, all of which
1035: were implemented on a single FPGA to simplify interconnection. The
1036: total number of complex multipliers in the X engines of an $N_{ant}$ antenna
1037: array is: $N_{cmac} = floor({N_{ant}/2}+1)\times N_{ant}\times N_{pol}$; the
1038: limited number of multipliers on a BEE2 FPGA only allowed for supporting half
1039: the polarization cross-multiples. This system was an
1040: important demonstration of the basic capabilities of our hardware and software,
1041: and provided a starting point for evolving a more sophisticated system.
1042: Deployments of this
1043: system at the NRAO site in Green Bank as part of the PAPER
1044: experiment, and briefly
1045: at the Hat Creek Radio Observatory for the Allen Telescope Array,
1046: are being supersede by the packetized correlator presented in the next
1047: section.
1048:
1049: \subsection{A 16-Antenna, Full-Stokes, Packetized Correlator}
1050: \label{sec:packet_deploy}
1051:
1052: This packetized FX correlator is a realization of the architecture outlined in
1053: Figure \ref{fig:ex_app1}, with F processing for 2 antennas implemented on each
1054: IBOB, and matching X processors implemented on each corner FPGA of two BEE2s.
1055: Each F processor is identical to a Pocket Correlator (Fig. \ref{fig:f_engine}),
1056: but branches data from the equalization module to a matrix transposer in IBOB
1057: SRAM to form frequency-based packets. Packet data for each antenna are
1058: multiplexed through a point-to-point XAUI connection to a BEE2-based X
1059: processor, and then relayed in 10GbE format to the switch. The number of
1060: channels in this system is limited to 2048 by memory in IBOB SRAM for
1061: transposing the 128 spectra needed to meet bandwidth restrictions between X
1062: engines and DRAM-based vector accumulators.
1063:
1064: \placefigure{fig:x_processor}
1065:
1066: The X processor in this packetized correlator implements the transmit and
1067: receive architecture illustrated in Figure \ref{fig:x_processor}
1068: for two X engines sharing the same 10GbE link.
1069: Each X engine's data processing rate is
1070: determined by packets arriving in its own receive buffer, and results are
1071: accumulated in separate DRAM DIMMs. The accumulated output of each X processor
1072: is read out of DRAM at a low bandwidth and transmitted via 10GbE packets to
1073: a CPU-based server where
1074: all visibility data is collected and
1075: written to disk in MIRIAD format
1076: \citep{sault_et_al1995} using interfaces from the Astronomical Interferometry
1077: in PYthon (AIPY) package\footnote{http://pypi.python.org/pypi/aipy}.
1078:
1079: The clocks for the BEE2 FPGAs are asynchronous 200-MHz oscillators, and IBOBs
1080: run synchronously at any rate lower than this. Packet transmission is
1081: statically addressed so that all each X engine processes every 16th channel.
1082: We use 8 ports of a Fujitsu XG700 switch to route data. This system is is
1083: scalable to 32 antennas before two X engines no longer fit on a single FPGA.
1084: For larger systems, the number of BEE2s will scale as the square of the number
1085: of antennas, and the number of IBOBs will scale linearly. A 32-antenna,
1086: 200-MHz correlator on 16 IBOBs and 4 BEE2s is now working in the lab, and a
1087: 16-antenna version using 8 IBOBs and 2 BEE2s has been deployed to the NRAO site
1088: in Green Bank with the PAPER experiment.
1089:
1090: % --------------------------------------------------------------------------
1091: % Section 8
1092: % --------------------------------------------------------------------------
1093: \section{Conclusion}
1094: \label{sec:conclusion}
1095:
1096: By decreasing the time and engineering costs of building and upgrading
1097: correlators, we aim to reduce the total cost of correlators for a wide range of
1098: scales. Small- and medium-scale correlators with total cost dominated by
1099: development clearly stand to benefit from our research. It is less clear if
1100: the cost of large-scale correlators can be reduced by the general-purpose
1101: hardware used in our architecture. Though minimization of replication cost
1102: favors the development of specialized parts, there are two factors
1103: that can make a generic, modular solution cost less.
1104:
1105: The first factor to consider is time to deployment. Even if the monetary cost
1106: of development is negligible in the budget of a large correlator, the cost of
1107: development time can be significant. If a custom solution takes several years
1108: to go from design to implementation, the hardware that is deployed will be out
1109: of date. Moore's Law suggests that when a custom solution taking 3 years to
1110: develop is deployed, there will exist processors 4 times more powerful, or 4
1111: times less expensive for the equivalent system. The cost of a generic, modular
1112: system has to be tempered by the expected savings of committing to hardware
1113: closer to the ultimate deployment date.
1114:
1115: The second factor is the cost of upgrade. Many facilities (including the ATA)
1116: are beginning to appreciate the advantages of designing arrays with wider
1117: bandwidths and larger numbers of antennas than can be handled by current
1118: technology. Correlators may then be implemented inexpensively on scales
1119: suited to current processors, and upgraded as more powerful processors
1120: become available. Modular solutions facilitate this methodology.
1121:
1122: % --------------------------------------------------------------------------
1123: % --------------------------------------------------------------------------
1124: % --------------------------------------------------------------------------
1125:
1126: \acknowledgments
1127:
1128: This and other CASPER research are supported by the National Science Foundation
1129: Grant No. 0619596 for Low Cost, Rapid Development Instrumentation for Radio
1130: Telescopes. We would like to acknowledge the students, faculty and sponsors of
1131: the Berkeley Wireless Research Center, and the National Science Foundation
1132: Infrastructure Grant No. 0403427. Correlator development for the PAPER
1133: project is supported by NSF grant AST-0505354, and for the ATA project by NSF
1134: grant AST-0321309 as well as the Paul G. Allen Foundation. Chips and software
1135: were generously provided by Xilinx, Inc. JM and PM gratefully acknowledge
1136: financial support from the MeerKAT project and South Africa's National Research
1137: Foundation.
1138:
1139: \appendix
1140: Glossary of Technical Terms
1141: \begin{itemize}
1142: \item ADC - Analog to Digital Converter
1143: \item ASIC - Application-Specific Integrated Circuit processor
1144: \item BEE2 - Berkeley Emulation Engine, rev. 2
1145: \item BORPH - Berkeley Operating system for Re-Programmable Hardware
1146: \item BRAM - Block RAM: Random Access Memory inside an FPGA
1147: \item CX4 - 10GbE-compatible industry standard connector
1148: \item CPU - Central Processing Unit
1149: \item DDR2 - Double-Data-Rate 2 type of off-FPGA Synchronous DRAM
1150: \item DIMM - Dual Inline Memory Module
1151: \item DFT - Discrete Fourier Transform
1152: \item DRAM - Dynamic Random Access Memory
1153: \item FFT - Fast Fourier Transform algorithm
1154: \item FIR - Finite Impulse Response digital filter
1155: \item FPGA - Field Programmable Gate Array processor
1156: \item FX - Correlator architecture implemented as frequency channelization, then cross-multiplication
1157: \item GALS - Globally Asynchronous, Locally Synchronous system architecture
1158: \item GB - GigaByte
1159: \item IBOB - Internet Break-Out Board
1160: \item LFSR - Linear Feedback Shift Register
1161: \item LO - Local Oscillator
1162: \item MCNT - Master Counter
1163: \item PFB - Polyphase Filter Bank
1164: \item PowerPC - a specific CPU architecture
1165: \item QDR - Quad-Data-Rate type of off-FPGA SRAM
1166: \item ROACH - Reconfigurable, Open Architecture for Computing Hardware
1167: \item SNR - Signal-to-Noise Ratio
1168: \item SRAM - Static Random Access Memory
1169: \item UDP - User Datagram Protocol Ethernet packetization
1170: \item XAUI - X (ten) Attachment Unit Interface point-to-point transmission protocol
1171: \item XF - Correlator architecture implemented as cross-multiplication, then frequency channelization
1172: \item 1PPS - 1 Pulse Per Second clock signal
1173: \item 10GbE - 10 Gigabit per second Ethernet communication standard
1174: \end{itemize}
1175:
1176: \begin{thebibliography}{25}
1177: \expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi
1178:
1179: \bibitem[{xil(2004)}]{xilinx_ug024}
1180: 2004, {RocketIO Tranceiver User Guide (UG024 V2.5)}, Xilinx user guide,
1181: http://www.xilinx.com
1182:
1183: \bibitem[{xil(2005)}]{xilinx_ds083}
1184: 2005, {Virtex-II Pro and Virtex-II Pro X Platform FPGAs: Functional
1185: Description (DS083-2 V4.5)}, Xilinx data sheet, http://www.xilinx.com
1186:
1187: \bibitem[{Blackman \& Tukey(1958)}]{blackman_tukey1958}
1188: Blackman, R. \& Tukey, J. 1958, The measurement of power spectra (Dover
1189: Publications Inc.)
1190:
1191: \bibitem[{{Bradley} {et~al.}(2005){Bradley}, {Backer}, {Parsons}, {Parashare},
1192: \& {Gugliucci}}]{bradley_et_al2005}
1193: {Bradley}, R., {Backer}, D., {Parsons}, A., {Parashare}, C., \& {Gugliucci},
1194: N.~E. 2005, in Bulletin of the American Astronomical Society, 1216--+
1195:
1196: \bibitem[{{Chang} {et~al.}(2005){Chang}, {Wawrzynek}, \&
1197: {Brodersen}}]{chang_et_al2005}
1198: {Chang}, C., {Wawrzynek}, J., \& {Brodersen}, R.~W. 2005, IEEE Design and Test
1199: of Computers, 22, 114
1200:
1201: \bibitem[{{Chapiro}(1984)}]{chapiro1984}
1202: {Chapiro}, D.~M. 1984, PhD thesis, Stanford Univ., CA.
1203:
1204: \bibitem[{{Crochiere} \& {Rabiner}(1983)}]{crochiere+rabiner1983}
1205: {Crochiere}, R. \& {Rabiner}, L.~R. 1983, {Multirate Digital Signal Processing}
1206: (Englewood Cliffs, N.J., Prentice-Hall, Inc., 1983.~336 p.)
1207:
1208: \bibitem[{{D'Addario}(2001)}]{daddario2001}
1209: {D'Addario}, L. 2001, ATA Memo
1210:
1211: \bibitem[{{Demorest} {et~al.}(2004){Demorest}, {Ramachandran}, {Backer},
1212: {Ferdman}, {Stairs}, \& {Nice}}]{demorest_et_al2004}
1213: {Demorest}, P., {Ramachandran}, R., {Backer}, D., {Ferdman}, R., {Stairs}, I.,
1214: \& {Nice}, D. 2004, in Bulletin of the American Astronomical Society, 1598--+
1215:
1216: \bibitem[{{Dick}(2000)}]{dick2000}
1217: {Dick}, C. 2000, Xilinx Application Note
1218:
1219: \bibitem[{{Heiles} {et~al.}(2004){Heiles}, {Goldston}, {Mock}, {Parsons},
1220: {Stanimirovic}, \& {Werthimer}}]{heiles_et_al2004}
1221: {Heiles}, C., {Goldston}, J., {Mock}, J., {Parsons}, A., {Stanimirovic}, S., \&
1222: {Werthimer}, D. 2004, in Bulletin of the American Astronomical Society,
1223: 1476--+
1224:
1225: \bibitem[{{Jenet} \& {Anderson}(1998)}]{jenet_anderson1998}
1226: {Jenet}, F.~A. \& {Anderson}, S.~B. 1998, PASP, 110, 1467
1227:
1228: \bibitem[{{Parsons}(2008)}]{parsons2008}
1229: {Parsons}, A. 2008, IEEE Signal Processing Letters, submitted
1230:
1231: \bibitem[{{Parsons} {et~al.}(2006){Parsons}, {Backer}, {Chang}, {Chapman},
1232: {Chen}, {Crescini}, {de Jesus}, {Dick}, {Droz}, {MacMahon}, {Meder}, {Mock},
1233: {Nagpal}, {Nikolic}, {Parsa}, {Richards}, {Siemion}, {Wawrzynek},
1234: {Werthimer}, \& {Wright}}]{parsons_et_al2006}
1235: {Parsons}, A., {Backer}, D., {Chang}, C., {Chapman}, D., {Chen}, H.,
1236: {Crescini}, P., {de Jesus}, C., {Dick}, C., {Droz}, P., {MacMahon}, D.,
1237: {Meder}, K., {Mock}, J., {Nagpal}, V., {Nikolic}, B., {Parsa}, A.,
1238: {Richards}, B., {Siemion}, A., {Wawrzynek}, J., {Werthimer}, D., \& {Wright},
1239: M. 2006, in Asilomar Conference on Signals and Systems, Pacific Grove, CA,
1240: 2031--2035
1241:
1242: \bibitem[{{Plana} {et~al.}(2007){Plana}, {Furber}, {Temple}, {Khan}, {Shi},
1243: {Wu}, \& {Yang}}]{luis_et_al2007}
1244: {Plana}, L.~A., {Furber}, S.~B., {Temple}, S., {Khan}, M., {Shi}, Y., {Wu}, J.,
1245: \& {Yang}, S. 2007, IEEE Des. Test, 24, 454
1246:
1247: \bibitem[{{Rabiner} \& {Gold}(1975)}]{rabiner_gold1975}
1248: {Rabiner}, L.~R. \& {Gold}, B. 1975, {Theory and application of digital signal
1249: processing} (Englewood Cliffs, N.J., Prentice-Hall, Inc., 1975.~777 p.)
1250:
1251: \bibitem[{{Rybicki} \& {Lightman}(1979)}]{rybicki_lightman1979}
1252: {Rybicki}, G.~B. \& {Lightman}, A.~P. 1979, {Radiative processes in
1253: astrophysics} (New York, Wiley-Interscience, 1979.~393 p.)
1254:
1255: \bibitem[{{Sault} {et~al.}(1995){Sault}, {Teuben}, \&
1256: {Wright}}]{sault_et_al1995}
1257: {Sault}, R.~J., {Teuben}, P.~J., \& {Wright}, M.~C.~H. 1995, in Astronomical
1258: Society of the Pacific Conference Series, Vol.~77, Astronomical Data Analysis
1259: Software and Systems IV, ed. R.~A. {Shaw}, H.~E. {Payne}, \& J.~J.~E.
1260: {Hayes}, 433--+
1261:
1262: \bibitem[{{So}(2007)}]{so2007}
1263: {So}, K.~H. 2007, PhD thesis, Berkeley Wireless Research Center, UC Berkeley,
1264: CA.
1265:
1266: \bibitem[{{So} \& {Brodersen}(2006)}]{so_broderson2006}
1267: {So}, K.~H. \& {Brodersen}, R.~W. 2006, in 16th International Conference on
1268: Field Programmable Logic and Applications, 349--354
1269:
1270: \bibitem[{{Thompson} {et~al.}(2001){Thompson}, {Moran}, \&
1271: {Swenson}}]{thompson_et_al2001}
1272: {Thompson}, A.~R., {Moran}, J.~M., \& {Swenson}, Jr., G.~W. 2001,
1273: {Interferometry and Synthesis in Radio Astronomy, 2nd Edition} (New York,
1274: Wiley-Interscience, 2001.~692 p.)
1275:
1276: \bibitem[{{Vaidyanathan}(1990)}]{vaidyanathan1990}
1277: {Vaidyanathan}, P.~P. 1990, in IEEE, Vol.~78, 56--93
1278:
1279: \bibitem[{{Weinreb}(1961)}]{weinreb_1961}
1280: {Weinreb}, S. 1961, Proc. IEEE, 49, 1099
1281:
1282: \bibitem[{{Wright}(2005)}]{wright2005}
1283: {Wright}, M. 2005, SKA Memo
1284:
1285: \bibitem[{{Yen}(1974)}]{yen1974}
1286: {Yen}, J.~L. 1974, A\&AS, 15, 483
1287:
1288: \end{thebibliography}
1289:
1290:
1291: % --------------------------------------------------------------------------
1292: % TABLES
1293: % --------------------------------------------------------------------------
1294: \clearpage
1295:
1296: %\input tab1.tex
1297: \begin{table}[t]
1298: \label{tab:hardware_price}
1299: \begin{center}
1300: \title{Price and Power Consumption of CASPER Hardware}
1301: \begin{tabular}{lrrrrr}
1302: \hline\hline
1303: \vspace{3pt}
1304: Board & Board & Cost with & Gops & Power \\
1305: & Cost & FPGAs & per Sec & (W)\\
1306: \hline
1307: IBOB& \$400 & \$2700 & 70 & 30 \\
1308: BEE2& \$5000 & \$23500 & 500 & 150 \\
1309: ROACH$^*$& \$1000 & \$3200 & 400 & 50 \\
1310: ADC (1Gs/s$\times2$)& \$200 & \$200 & N/A & 2 \\
1311: ADC (3Gs/s)\tablenotemark{*}& \$1000 & \$1000 & N/A & 5 \\
1312: \hline\hline
1313: \vspace{-5pt}
1314: \end{tabular}
1315: \\
1316: \vspace{-10pt}
1317: \tablenotetext{*}{Estimated from prototype versions.}
1318: \end{center}
1319: \end{table}
1320:
1321:
1322: % --------------------------------------------------------------------------
1323: % FIGURES
1324: % --------------------------------------------------------------------------
1325: %\clearpage
1326:
1327: \begin{figure}
1328: \begin{center}
1329: \includegraphics[scale=.4]{raw_arch.png}
1330: \caption{In a simplistic FX correlator,
1331: the signals from N antennas are first decomposed into M frequency channels
1332: (F operation) and then cross-multiplied (X operation). Different channels are
1333: never cross-multiplied, making them natural units for X engine processing.
1334: Thus, each X engine handles all baselines for one frequency channel.
1335: \label{fig:corr_arch1}}
1336: \end{center}
1337: \end{figure}
1338:
1339: \begin{figure}
1340: \begin{center}
1341: \includegraphics[scale=.25]{ex_app1.png}
1342: \caption{Data bandwidth per antenna is equal to the processing bandwidth of
1343: an X processor in this example application. Transmitted data is routed
1344: through an X processor to take advantage of bidirectionality of 10GbE ports,
1345: thereby halving the number of ports on the switch.
1346: \label{fig:ex_app1}}
1347: \end{center}
1348: \end{figure}
1349:
1350: \begin{figure}
1351: \begin{center}
1352: \includegraphics[scale=.25]{ex_app2.png}
1353: \caption{Data bandwidth per antenna can exceed
1354: what can be carried over 10GbE. Here, the frequency band has been spread
1355: across ports by channel, so that each half of transmission occurs on an
1356: isolated subnet. This is possible because different channels are never
1357: cross-multiplied in an FX correlator.
1358: \label{fig:ex_app2}}
1359: \end{center}
1360: \end{figure}
1361:
1362: \begin{figure}
1363: \begin{center}
1364: \includegraphics[scale=.25]{ex_app3.png}
1365: \caption{When the processing bandwidth of an X engine exceeds the antenna
1366: bandwidth by at least a factor of 2, half as many X processors are needed for
1367: a given number of antennas. X processors operate independently of data
1368: bandwidth; the same design handles this and the previous two cases
1369: (Figs. \ref{fig:ex_app1} and \ref{fig:ex_app2}). Only the number of X
1370: processors and the data transmission pattern have changed.
1371: \label{fig:ex_app3}}
1372: \end{center}
1373: \end{figure}
1374:
1375: \begin{figure}
1376: \begin{center}
1377: \includegraphics[scale=.25]{ibob_bee2.jpg}
1378: \caption{%
1379: Our correlator architecture relies on modular FPGA-based processing hardware
1380: developed by our group to
1381: combine flexibility, upgradeability, and performance. Illustrated above are:
1382: (top) IBOB and ADC FPGA/digitizer modules
1383: (bottom) The Berkeley Emulation Engine (BEE2) FPGA board
1384: \label{fig:ibobadcbee2}}
1385: \end{center}
1386: \end{figure}
1387:
1388: \begin{figure}
1389: \begin{center}
1390: \includegraphics[scale=.45]{ddc_response_scaled.png}
1391: \caption{%
1392: This example response an the FIR filter in a digital down-converter,
1393: illustrates the 16 tap low-pass design used in the correlator deployments
1394: presented later.
1395: \label{fig:ddc_passband}}
1396: \end{center}
1397: \end{figure}
1398:
1399: \begin{figure}
1400: \begin{center}
1401: \includegraphics[scale=.52]{pfb_vs_fft_bin_resp.png}
1402: \caption{%
1403: The response of a frequency channel in an 8-tap Polyphase Filter Bank (solid)
1404: using a Hamming window is compared to an equivalently sized Discrete Fourier
1405: Transform (dashed). This particular PFB, implemented for 2048 channels, is
1406: used in the correlator deployments presented in Section \ref{sec:deployments}.
1407: \label{fig:pfb_bin_resp}}
1408: \end{center}
1409: \end{figure}
1410:
1411: \begin{figure}
1412: \begin{center}
1413: \includegraphics[scale=.45]{x_engine.png}
1414: \caption{%
1415: This X engine schematic illustrates the pipelined flow of data
1416: that allows it to be split across multiple FPGAs and boards.
1417: With continuous data input, all multipliers (with the possible exception of
1418: the final stage for even values of $N_{ant}$) are used with 100\% efficiency.
1419: \label{fig:x_engine_schem}}
1420: \end{center}
1421: \end{figure}
1422:
1423: \begin{figure}
1424: \begin{center}
1425: \includegraphics[scale=.45]{corr_vs_dly_128_scaled.png}
1426: \caption{%
1427: Cross-correlation of noise decreases as a function of signal delay between
1428: antenna inputs. PFBs operate on a wider window of data compared to DFTs, and
1429: use non-flat sample weightings, yielding a
1430: different correlation response versus signal delay compared to the standard
1431: result presented in Thompson et al. (2001) \cite{thompson_et_al2001}. Graphed
1432: are the responses of PFBs with 8 taps (solid), 4 taps (dashed), 2 taps (dot
1433: dashed), and the response of a DFT (dotted).
1434: \label{fig:corr_vs_dly}}
1435: \end{center}
1436: \end{figure}
1437:
1438: \begin{figure}
1439: \begin{center}
1440: \includegraphics[scale=.5]{packet_rx.png}
1441: \caption{Before transmission, each F engine packet is
1442: tagged with an antenna number and master counter (MCNT) encoding
1443: time and frequency. Received packets are filtered to
1444: the narrow range of MCNTs, and maximum MCNT slides smoothly up as packets
1445: are received. A free-running X engine
1446: processes available windows when it is ready. This architecture
1447: allows data to be processed at a lower data rate than the FPGA clock rate
1448: without requiring every element in the pipeline to have a enable signal.
1449: \label{fig:packet_rx}}
1450: \end{center}
1451: \end{figure}
1452:
1453: \begin{figure}
1454: \begin{center}
1455: \includegraphics[scale=.35]{crosstalk_v2_scaled.png}
1456: \caption{%
1457: Uncorrelated noise sources with similar bandpass shapes were
1458: input to two channels of one ADC board (solid black) and a third noise source
1459: with a narrower passband was input to to a second ADC board
1460: (dashed black) in the ``Pocket Correlator'' system.
1461: Crosstalk levels between signal inputs on the same ADC board (light gray) and
1462: between ADC boards sharing an IBOB (dark gray) peak at $-28$ dB.
1463: \label{fig:crosstalk}}
1464: \end{center}
1465: \end{figure}
1466:
1467: \begin{figure}
1468: \begin{center}
1469: \includegraphics[scale=.5]{crosstalk_stability_scaled.png}
1470: \caption{%
1471: Measurements of the standard deviation versus integration time of the
1472: correlation between independent noise sources into the same ADC board show
1473: that crosstalk exhibits
1474: stability over a period of 1 day for all frequency channels
1475: Although phase switching
1476: may still be desireable, this stability allows
1477: crosstalk to be calibrated and removed after correlation.
1478: \label{fig:crosstalk_stability}}
1479: \end{center}
1480: \end{figure}
1481:
1482: \begin{figure}
1483: \begin{center}
1484: \includegraphics[scale=.52]{4_bit_quant_rev2.png}
1485: \caption{%
1486: Illustrated above is the relative gain through a 4-bit, 15-level quantizer as a
1487: function of input signal level (log base 2). Plotted are gain curves for
1488: the cross-correlation of two
1489: gaussian noise sources with correlation levels of 100\% (solid),
1490: 80\% (dot-dashed), 40\% (dotted), and 20\% (dashed).
1491: \label{fig:4_bit_quant}}
1492: \end{center}
1493: \end{figure}
1494:
1495: \begin{figure}
1496: \begin{center}
1497: \includegraphics[scale=.5]{f_processor.png}
1498: \caption{%
1499: This IBOB design serves a dual purpose as a stand-alone ``Pocket Correlator''
1500: and an F processor in a 16 antenna packetized correlator deployment. Note the
1501: parallel output pathways for each function.
1502: \label{fig:f_engine}}
1503: \end{center}
1504: \end{figure}
1505:
1506: \begin{figure}
1507: \begin{center}
1508: \includegraphics[scale=.25]{allsky_moll_trim_bw.png}
1509: \caption{%
1510: This all-sky image, made using a 75-MHz band centered at 150 MHz with the
1511: ``Pocket Correlator'' as part of the PAPER experiment in Western
1512: Australia, achieves an impressive 10,000:1 signal-to-noise ratio using
1513: 1 day of data.
1514: \label{fig:skymap}}
1515: \end{center}
1516: \end{figure}
1517:
1518: \begin{figure}
1519: \begin{center}
1520: \includegraphics[scale=.5]{x_processor.png}
1521: \caption{%
1522: A BEE2-based X processor in a packetized correlator transmits data
1523: from an F engine
1524: over 10GbE and stores self-addressed packets in a ``loopback'' buffer.
1525: These streams are merged on the receive side, and packets are
1526: distributed to two X engines. Accumulation occurs
1527: in DRAM buffers, and the results are packetized and output
1528: over the same 10GbE link. A data aquisition system connects to the
1529: same switch as the X engines.
1530: \label{fig:x_processor}}
1531: \end{center}
1532: \end{figure}
1533:
1534: \end{document}
1535:
1536: