0409:hep-ex0409042/cna.tex

1: %\documentclass[reviewcopy]{elsart}

2: \documentclass{elsart}

3: \usepackage{graphics}

4: \usepackage{graphicx}

5: \usepackage{epsfig}

6: \usepackage{amssymb}

7:

8: \begin{document}

9:

10: \begin{frontmatter}

11:

12: \title{HERA-B Framework for Online Calibration and Alignment}

13:

14: \author[desy-HH,desy-ifh]{J.M. Hern\'andez\thanksref{ciemat},}

15: \author[desy-HH]{D. Ressing,}

16: \author[desy-HH]{V. Rybnikov,}

17: \author[desy-HH]{F. S\'anchez\thanksref{ifae},}

18: \author[lip]{A. Amorim,}

19: \author[desy-HH]{M. Medinnis,}

20: \author[desy-HH]{P. Kreuzer\thanksref{athens},}

21: \author[desy-ifh]{U. Schwanke\thanksref{hu}}

22: \address[desy-HH]{DESY, D-22603 Hamburg, Germany}

23: \address[desy-ifh]{DESY, D-15738 Zeuthen, Germany}

24: %\address[MPI]{Max-Planck-Institut f\"ur Kernphysik, D-69117 Heidelberg, Germany}

25: \address[lip]{FCUL and LIP, P-1749-016 Lisboa, Portugal}

26: \thanks[ciemat]{Now at CIEMAT, E-28040 Madrid, Spain}

27: \thanks[ifae]{Now at Universitat Aut\`onoma Barcelona/IFAE, E-08193 Bellaterra, Spain}

28: \thanks[athens]{Now at Athens University, Athens, Greece}

29: \thanks[hu]{Now at Humboldt University, Berlin, Germany}

30:

31: \begin{abstract}

32:   This paper describes the architecture and implementation of the

33:   HERA-B framework for online calibration and alignment.  At HERA-B the

34:   performance of all trigger levels, including the online

35:   reconstruction, strongly depends on using the appropriate

36:   calibration and alignment constants, which might change during data

37:   taking. A system to monitor, recompute and distribute those

38:   constants to online processes has been integrated in the data

39:   acquisition and trigger systems.

40:

41: %An online system has been implemented in the HERA-B experiment to monitor, recompute and distribute

42: %on the fly updates of the calibration, alignment and channel status constants without causing significant

43: %deadtime during the data acquisition. This system is necessary

44: %to keep the trigger performance and the online reconstruction stable under variations of the detector conditions.

45: \end{abstract}

46:

47: \begin{keyword}

48: Conditions database \sep calibration \sep alignment \sep online reconstruction \sep PC farms

49: \PACS \\

50: 07.05.-t  Computers in experimental physics \\

51: 07.05.Hd  Data acquisition: hardware and software\\

52: 07.05.Bx  Computer systems: hardware, operating systems, computer languages and utilities \\

53:

54:

55: \end{keyword}

56: \end{frontmatter}

57:

58: \section{Introduction}

59: \label{introduction}

60:

61: It is essential in High Energy Physics experiments, that accurate and

62: consistent detector parameter sets are used at all trigger levels and

63: also in the event reconstruction.  Similarly, simulation programs must

64: use detector parameters which are consistent with those used in the

65: trigger and reconstruction to properly simulate the detector and

66: trigger conditions. The detector parameters (such as calibrations,

67: alignments, detector channel maps, resolutions, etc), globally known

68: as detector conditions\footnote{The detector conditions will be

69:   hereinafter also referred as CnA ({\underline C}alibration

70:   a{\underline n}d {\underline A}lignment) constants}, are normally

71: calculated offline in a sporadic manner from monitoring information

72: and event data, and updated in the trigger and reconstruction codes.

73: The bookkeeping of the detector conditions becomes then an important

74: issue.

75:

76: The HERA-B experiment \cite{herab} was designed for the measurement of

77: CP violation in the neutral B-meson system. The data acquisition (DAQ)

78: and trigger systems were designed to cope with more than half a

79: million detector channels, a 40 MHz interaction rate and an extremely

80: low signal to background ratio of $10^{-10}$. A networked

81: high-bandwidth data acquisition system \cite{daq} and a highly

82: selective multi-level trigger \cite{trigger,hlt}, with a suppression

83: factor of $10^{6}$, were built.  Unlike most HEP experiments, HERA-B

84: performs full event reconstruction online.

85:

86: % redundant...

87: %Stable performance demands that all trigger levels

88: %as well as the online reconstruction use up-to-date values for the

89: %parameters which describe detector conditions.  When detector

90: %conditions change, for example due to temperature effects, new sets of

91: %constants must be computed and distributed.

92:

93:

94: A novel approach for handling the detector conditions has been

95: followed at HERA-B where a system to monitor, recompute and distribute

96: CnA constants to online clients is integrated into the DAQ and trigger

97: systems. Online updates of the CnA constants help to stabilize trigger

98: performance and online reconstruction as detector conditions vary during

99: data taking.

100: %redundant

101: %This system monitors and recomputes the CnA constants

102: %from detector monitoring information and event data, and is capable of distributing

103: %updated constants online to the trigger and reconstruction processors.

104: The CnA system is also employed offline during event data reprocessing

105: and Monte Carlo reconstruction. It allows to incorporate offline

106: updates of the CnA constants during the data reprocessing and also

107: ensures that the reconstruction of Monte Carlo simulated events is

108: performed using the same CnA constants employed in the reconstruction

109: of the real data being simulated.  This approach is of potential

110: interest to future HEP experiments who are planning sophisticated

111: trigger systems and online event reconstruction.

112:

113: The architecture, implementation and performance of the CnA system are

114: described in this paper.  The motivation for the CnA system and its

115: requirements are summarized in the next section. The system

116: architecture is described in section~\ref{s:design} and its

117: implementation and performance is discussed in section~\ref{s:desc}.

118: Section~\ref{s:mc} describes the offline usage of the CnA system for

119: data reprocessing and Monte Carlo reconstruction.

120:

121:

122: \section{Motivation and requirements}

123:

124: The design and requirements of the online CnA system are driven by the

125: design of the HERA-B detector and the architecture of the DAQ and

126: trigger systems.  We therefore begin this section with a description

127: of the relevant aspects of the HERA-B detector, DAQ and trigger

128: systems.

129:

130: %Particularly important are

131: %the facts that event reconstruction is run online and that the performance of all %trigger levels strongly depend on

132: %detector conditions.

133: % makes necessary an online system for monitoring, computation and distribution of

134: %CnA constants.

135:

136: The HERA-B detector is depicted in figure~\ref{fig:detector}.  The

137: target wires and the silicon vertex detector (SVD) stations are

138: movable. The target positions change to

139: stabilize the average interaction rate.

140: % by scraping the lateral tails of the HERA proton beam.

141: The SVD stations are also moved towards the beam at the beginning of

142: every fill and retracted at the end to avoid damage during injection.

143: These two subsystems are particularly subject to alignment changes.

144: Although the stepping motors provide sufficient precision (of order

145: 1~micron), alignment corrections are needed relatively often for

146: optimal performance due to thermal and other effects. Similarly, the

147: realignment of the tracking chambers is needed whenever they are moved

148: away from the beam line during accesses to the detector for repairs.

149: Such accesses occur typically at intervals of order one week.  The

150: calibration of all detector subsystems is often updated as well.

151: Examples are pedestal following and energy calibration of the

152: electromagnetic calorimeter, the time calibration of the TDC boards of

153: the tracker system, the drift velocity calibration of the tracker

154: system and the maximal value of the Cherenkov angle in the RICH

155: detector which varies with atmospheric pressure, temperature and gas

156: composition.  Moreover, the channel status (hot/dead/noisy) maps for

157: all subsystems must be monitored and updated periodically.

158:

159: \begin{figure}[htp]

160: \centering

161: \includegraphics[width=\textwidth]{detector_schematic_yz.eps}

162: \caption{Side view of the HERA-B detector}

163: \label{fig:detector}

164: \end{figure}

165:

166: The HERA-B DAQ system and its relationship with the trigger levels is

167: sketched in figure~\ref{fig:daq}. The trigger rates, latencies and

168: data volumes of each stage are also shown.

169:

170: \begin{figure}[htp]

171: \centering

172: \includegraphics[width=0.8\textwidth]{daq-trigger_scheme.eps}

173: \caption{Scheme of the data acquisition and trigger systems. The data throughput,

174:   trigger rates and latencies for each of the DAQ and trigger stages

175:   are also shown.}

176: \label{fig:daq}

177: \end{figure}

178:

179: The detector data are read out at the HERA bunch-crossing rate of

180: about 10 MHz and stored in a 128-deep front-end pipelines during the

181: First Level Trigger (FLT) processing. The large input event rate

182: forces the FLT to be entirely built from specialized hardware. The FLT

183: performs hardware tracking to select events with $J/\Psi$ particles

184: decaying into two leptons. The FLT tracking is initiated by lepton

185: candidates in the electromagnetic calorimeter and muon systems. The

186: accepted events are pushed into a distributed system of buffers

187: (Second Level Buffers, -SLB-).  The events reside in the SLBs while

188: the Second Level Trigger (SLT) step is being run.  The SLT is

189: implemented as a software trigger running in a PC farm of 240 nodes

190: \cite{dam}.

191:

192: A switching network provides full connectivity between the SLBs and

193: the SLT nodes.  This high bandwidth and low latency switch is built by

194: interconnecting several hundreds of Digital Signal Processors (DSP)

195: between the SLB system and the SLT nodes.  The total bandwidth of the

196: DSP switch is above 1 GB/sec. The switch message passing software

197: ensures zero packet loss and, in addition, possesses multicasting

198: capabilities which are used for distributing data sets to all SLT

199: nodes in parallel.

200:

201: The SLT operates on regions of the detector defined either by FLT

202: track candidates or pretrigger information (Region of Interest -RoI-).

203: The SLT refines the FLT tracks and extrapolates them through the

204: spectrometer magnet, tracks them through the SVD and optionally

205: performs a vertex cut. Tracker and SVD data needed by the SLT are

206: fetched from the SLBs via the DSP switch. Events accepted by the SLT

207: are assembled and, optionally, further processed by the Third Level

208: Trigger (TLT) in the same trigger node. Events passing the TLT are

209: sent via a switched Ethernet network to a second 200-processor PC farm

210: to be fully reconstructed online \cite{gellrich}.  Event

211: classification by physics category is performed after the event

212: reconstruction and an additional Fourth Level Trigger (4LT) step can

213: be run at this stage if further reduction of the event rate is

214: required.

215:

216: The HERA-B trigger system relies on track-finding to an unusual

217: degree.  In turn, accurate track-finding and event selection at all

218: trigger levels requires relatively precise knowledge of detector

219: calibration, alignment and detailed detector channel status

220: information all of which can influence both trigger efficiency and

221: trigger rate (and therefore system deadtime).

222: %A proper alignment of the target, SVD and the tracker system is essential for

223: %the extrapolation and reconstruction of the tracks. Channel masks (i.e. maps of dead, %and noisy detector channels) significantly

224: %influence both trigger efficiency and trigger rate, and therefore the system deadtime. %Hot channels particularly affect the performance

225: %of the FLT.

226: %A good calibration of the tracker drift velocity is necessary for reaching the required

227: %suppression at the SLT by using an improved spatial resolution in the tracker.

228: The highly distributed DAQ and trigger systems require a dedicated

229: online system for monitoring and distributing updates of the CnA

230: constants into the trigger processors without incurring significant

231: deadtime.

232:

233: Running full reconstruction online imposes constraints on the quality

234: of the CnA constants but also allows immediate data analysis and

235: therefore detailed information for data quality monitoring. Given the

236: large data volume collected by the experiment and the large event

237: reconstruction time, offline data reprocessing should be minimized.

238: % redundant

239: %On the other hand, the online reconstruction opens the possibility

240: %to monitor and update online the calibration and alignment conditions

241: %using high level information from the fully reconstructed events.

242:

243: \section{Architecture}

244: \label{s:design}

245:

246:

247: The CnA system provides the online infrastructure for collecting data

248: suitable to align and calibrate the detector, for computing CnA

249: constants and for delivering updated constants to all processes

250: involved in the trigger and the online reconstruction, see

251: figure~\ref{fig:cna1}.  The CnA system takes care of tagging the set

252: of CnA constants being used at any moment by the DAQ providing an

253: exact history of the set of constants used.

254:

255: \begin{figure}[htp]

256: \centering

257: \includegraphics[width=0.9\textwidth]{CnA-gatering+distribution.eps}

258: \caption{Gathering of CnA data and distribution of updated CnA constants.}

259: \label{fig:cna1}

260: \end{figure}

261:

262: \subsection{Data gathering}

263:

264: During the online reconstruction procedure, data for monitoring and

265: calculating detector conditions are collected from the reconstruction

266: processes.  In addition, subsystem specific monitors continuously

267: check the raw data and derive channel status maps.  To make use of the

268: large number of trigger and reconstruction nodes providing such data

269: in parallel, the CnA architecture relies on a gathering system to

270: collect data in a central place. Gathered data can then be used

271: centrally to compute updated CnA constants which are subsequently

272: stored in the online database system \cite{amorim}.

273:

274: \subsection{CnA distribution}

275:

276: The CnA distribution system delivers updated CnA constants to the

277: trigger and reconstruction processes.  This involves distributing

278: large objects to a large number of clients as quickly as possible to

279: minimize deadtime.  Two different approaches were followed according

280: to the different latency of the trigger levels and the bandwidth of

281: the DAQ at those stages (see figure~\ref{fig:cna1}): a push

282: architecture is best suited for distributing the CnA constants to the

283: FLT and SLT/TLT processes while a pull architecture was chosen for the

284: reconstruction farm.

285:

286: For the SLT/TLT, trigger latencies are of the order of milliseconds and

287: a fast distribution is required in order not to cause deadtime. Taking

288: advantage of the high speed, reliability and multicasting capabilities

289: of the DSP switch, the CnA data can be synchronously pushed to the

290: trigger processes.

291:

292: For the online reconstruction farm, several factors favor a pull

293: architecture. The Ethernet switching network of the reconstruction

294: farm has substantially less bandwidth than the DSP switch and would be

295: rapidly saturated if operated under a synchronous push protocol. This

296: would lead to frequent data retransmissions and consequently, to

297: degraded performance. In addition, the Ethernet switches have no

298: support for multicasting so that the same data sets would have to be

299: sequentially pushed into all nodes.  Furthermore, the online

300: reconstruction latency is much larger than that of the SLT/TLT and

301: pausing data taking to wait until all events are fully consumed would

302: cause deadtime on the order of seconds. On the other hand, the

303: reconstruction nodes process events asynchronously and independently

304: from each other, and therefore an asynchronous pull protocol would

305: distribute the requests for data over the average reconstruction time

306: and thus make more efficient use of the available bandwidth.  Finally,

307: with a pull architecture, a distributed system of fast memory database

308: caches can be implemented to replicate the CnA data and allow for

309: faster uploading and reduction of overall bandwidth requirements.

310:

311: Since uploading into the reconstruction nodes is asynchronous, the

312: reconstruction processes need to be individually notified when new

313: constants become available.  The notification is done through the

314: event data.  A data base table (the "key table") contains identifiers

315: to all sets of CnA tables which are used by all online processes.  The

316: identifier of the current key table (the CnA key) is stamped into

317: events by the SLT process at event assembly time. This index allows an

318: event to be associated with all the calibration and alignment data

319: used in its triggering process and online reconstruction. Whenever

320: updated CnA constants have been distributed to the SLT/TLT nodes or

321: become available for the online reconstruction, a new identifier is

322: stamped in the event data. The reconstruction processes check the CnA

323: identifier and request updated CnA constants when the identifier

324: changes.

325:

326: For the synchronous distribution of the CnA constants to the FLT and

327: SLT/TLT, a manager process is needed for synchronization during the

328: distribution and also as an intermediary between the CnA producers

329: (processes producing online updated constants) and the consumers

330: (trigger and reconstruction processes).  The manager is notified

331: when updated constants are available for distribution.

332: On notification, the manager requests that the FLT/SLT/TLT be paused,

333: supervises the distribution, and requests resumption data taking

334: when the distribution is completed.

335:

336: The system is quite flexible in that it allows distribution of any kind

337: of information to the trigger processors. This includes FLT and SLT

338: trigger settings as well as geometry and detector calibration data

339: sets.  The same distribution protocols used for the on the fly

340: distributions of CnA constants are employed for the initial loading of

341: the constants at DAQ booting time.

342:

343: \subsection{CnA offline usage}

344:

345: The trigger and online reconstruction farms together with the online

346: booting, control, monitoring and online data transmission protocols

347: and processes are used offline for performing data reprocessing and

348: Monte Carlo production during DAQ idle time \cite{jhnim}.  The CnA

349: system was also designed for offline use. During data reprocessing,

350: the CnA system allows any online changes of the CnA conditions to be

351: accurately reproduced. Moreover, it provides for use of recalculated

352: sets of CnA constants in place of the online tables, when appropriate.

353: For Monte Carlo reconstruction, the geometry, calibrations and channel

354: maps of the run period being simulated are identified and loaded as

355: well as additional data sets containing detector resolution and

356: efficiency data.

357:

358: \section{Implementation and performance}

359: \label{s:desc}

360:

361:

362: We describe in this section the implementation of the online

363: calibration and alignment system following the requirements and

364: architecture discussed in the previous sections.  Key elements of the

365: system are the data collection and monitor processes (gatherers), the

366: distributed system of database servers and proxies for storage and

367: replication of the constants, and the procedure for distribution to

368: the trigger and reconstruction processes. The CnA framework also

369: includes the software modules for the trigger and reconstruction codes

370: needed to upload the CnA constants.

371:

372:

373: \subsection{Data gathering and computation}

374:

375: Data needed to monitor detector conditions are collected, by

376: gatherers, from the reconstruction processes and from dedicated nodes

377: in the SLT farm which run subsystem-specific monitors. The dedicated

378: SLT nodes receive unbiased events at rates up to several Hz and

379: continuously check the raw event data, updating channel status maps as

380: needed.

381:

382:

383: %Unlike the event data transmission protocols which must be fully reliable (lossless), %the gathering protocols can tolerate some data loss

384: % is tolerable since it only results in a decrease of the statistics

385: %available for the monitoring and computation of updated CnA constants. This fact makes %easier the implementation

386: %of the data gathering protocols.

387: As sketched in figure~\ref{fig:cna1}, gatherers collect summary data

388: via Ethernet in parallel from all farm nodes. Gatherer processes can

389: work in two distinct modes, either requesting data from the providers

390: or subscribing for the data in the provider nodes which then

391: periodically publish the data to the subscribers.  Gatherers can also

392: serve as data providers to other gatherers which subscribe to the

393: provider gatherer for needed data and use the data to update CnA

394: constants.  In order to limit the amount of CnA data kept locally in

395: the provider nodes, the CnA data are stored either as histograms or as

396: ring buffers with arbitrary format.  A remote histogramming package

397: (RHP\cite{schwanke}) was developed for the data definition and data

398: collection.  RHP implements part of the functionality of the CERN

399: HBOOK package \cite{hbook}.

400:

401: By calling subdetector specific functions, CnA gatherers compute new

402: CnA constants and monitor their evolution over time.  If significant

403: changes are produced, the new constants are stored in the distributed

404: database system described in section~\ref{s:db}, triggering the online

405: on-the-fly distribution as described in section~\ref{s:dist}.

406:

407:

408: \subsection{CnA distributed database system}

409: \label{s:db}

410:

411: The storage of updated CnA constants into active database servers

412: triggers online distribution. Upon an update, the CnA database servers

413: propagate update messages notifying the update to the CnA distribution

414: system. Indexed objects (CnA keytables), whose identifiers (CnA key)

415: are stored in the event data, are created automatically by a dedicated

416: CnA database server.  The CnA keytable contains CnA metadata, namely

417: the indices of the sets of CnA constants used online during a

418: particular period. The CnA key allows to associate every event with

419: all the CnA constants used in its triggering process and online

420: reconstruction. This bookkeeping is of crucial importance for

421: identifying the correct sets of CnA constants in the event simulation

422: and in the offline event data reprocessing as explained in

423: section~\ref{s:mc}.

424:

425: The CnA distributed system of databases consists of active subdetector

426: CnA database servers, a dedicated CnA keytable database server and a

427: distributed system of fast memory database cache servers.  The

428: subdetector CnA database servers store the subdetector specific CnA

429: constants and notify the CnA keytable server of any update. The CnA

430: keytable server holds the keytable CnA metadata.  After an update

431: message, it generates a new keytable incrementally, i.e., it copies

432: the last keytable and updates the indices of the updated sets of CnA

433: constants. It then publishes to the CnA distribution system the CnA

434: key of the new CnA keytable.  The distributed system of memory

435: database caches is used to replicate the CnA constants in order to

436: speed up their distribution to the online reconstruction farm as

437: explained in the next section.

438:

439:

440: \subsection{CnA distribution}

441: \label{s:dist}

442:

443: As stated before, the distribution procedure of CnA constants is

444: initiated by the storage of new constants into active database servers

445: which then propagate the updates to the CnA keytable database server.

446: This server in turn notifies the CnA distribution system of the

447: existence of updated CnA constants.  The distribution procedure is

448: sketched in figure~\ref{fig:cna2}.

449:

450: \begin{figure}[htp]

451: \centering

452: \includegraphics[width=\textwidth]{cna_distribution.eps}

453: \caption{Online distribution of updated CnA constants.}

454: \label{fig:cna2}

455: \end{figure}

456:

457: The CnA manager is the process receiving the notifications from the

458: CnA keytable database server. This process is in charge of the control

459: and synchronization of the distribution of the CnA constants. At DAQ

460: booting time, the CnA consumers (trigger and reconstruction processes)

461: subscribe in the CnA manager for updates of particular sets of CnA

462: constants.

463:

464: The format of the CnA constants as stored in the databases might be

465: different to the format required by the CnA consumers.  To centralize

466: the formatting of the constants, to save CPU processing time, and to

467: simplify code in the CnA consumers, CnA formatters were introduced.

468: The formatters are CnA processes that call subdetector specific

469: functions which use one or more raw data base tables to produce the

470: tables which are actually distributed to the consumers.  The

471: subscription messages sent by the CnA consumers to the CnA manager

472: identify the associated CnA formatter of the desired set of constants.

473: The CnA formatters fetch any needed CnA tables from the database, make

474: the required formatting and send the formatted constants to the CnA

475: consumers.

476:

477: The full sequence of the distribution of the CnA constants to the

478: trigger processors is the following: a CnA producer (CnA gatherer)

479: stores updated constants into an active database.  The database server

480: informs the CnA keytable server of the update which creates a new

481: keytable with a new CnA key index that is sent to the CnA manager. The

482: manager informs appropriate formatters of the update which then fetch

483: the updated CnA constants from the appropriate database and produce

484: format tables. When the formatting is finished the CnA manager

485: requests the DAQ event controller (EVC) to pause the data taking.  The

486: EVC waits until all events in the second level buffers are processed

487: by the SLT nodes before handshaking Cna manager's request. This way

488: the events already processed by the FLT are processed by the SLT and

489: are stamped with the correct CnA key. In addition, consuming all

490: events in the buffers guarantees that the DSP switch bandwidth will be

491: fully available for the transmission of the CnA constants.  The CnA

492: manager then requests the CnA formatter to push the constants into the

493: trigger nodes. The new CnA key is also distributed to the nodes and

494: will be written into the event data of the new events. When the

495: distribution is complete, the CnA manager requests the event

496: controller to resume the data taking.

497:

498: For distribution to the FLT processors, the FLT CnA formatter sends

499: the constants through Ethernet to a master process running in each of

500: the FLT trigger crates which in turn distributes them in parallel to

501: the trigger boards in the crate. For the distribution into the SLT/TLT

502: farm, the SLT/TLT formatter sends the constants via Ethernet to a

503: dedicated process running in one of the SLT/TLT nodes, the SLT

504: distributor. This process in turn uses the multicasting capabilities

505: of the DSP switch to transmit the constants in parallel to all the 240

506: trigger nodes. The throughput of the multicasting via the DSP switch

507: is about 1 GB/sec.  The constants are synchronously pushed into the

508: trigger nodes under the coordination of the CnA manager.  Thanks to

509: the large bandwidth of the DSP switch, the distribution introduces no

510: significant deadtime.

511:

512: Unlike the distribution to the trigger nodes, where a synchronous

513: distribution is done using a push architecture, the distribution of

514: updated CnA constants to the reconstruction nodes is done

515: asynchronously in each node and using a pull architecture. After any

516: update of the CnA constants, the new CnA key index will end up in the

517: event data. The reconstruction processes upon a change of this index

518: will fetch from the CnA keytable database the new keytable to

519: determine which updated set of constants should be loaded. In order to

520: speed up the loading of the new constants, the reconstruction nodes

521: fetch them via a distributed system of memory database caches.  Given

522: the smaller Ethernet bandwidth compared to the DSP switch bandwidth,

523: an asynchronous retrieval of the constants is more efficient.

524:

525: Exactly the same distribution procedure is applied at DAQ booting time

526: to upload the CnA constants into the trigger and reconstruction

527: processors. At booting time, all the sets of constants appear to be

528: updated and all of them are distributed. Not only detector conditions

529: constants are uploaded following the procedure described above, but

530: also the trigger settings are distributed into the trigger nodes in

531: the same way.

532:

533: \subsection{Performance}

534:

535: Over 60 sets of CnA constants, amounting a total volume of 6.5 MB, are

536: used. At DAQ booting time, they are pushed into the SLT/TLT trigger

537: nodes in 1.5 secs, at a rate of 1 GB/sec (6.5 MB x 240 nodes / 1.5

538: sec). Upon receiving their first events after run startup,

539: reconstruction nodes fetch the constants at an effective rate of 50

540: MB/sec (6.5 MB x 200 processes / 25 sec).  Note that in this case, 25

541: seconds is the total time required to completely upload the constants

542: in all the reconstruction processes, but not the deadtime caused since

543: the retrieval is asynchronous and independent in every node so that

544: each node starts processing events as soon as all the constants have

545: been read.

546:

547: The deadtime caused in the data taking by the distribution of CnA

548: constants to the SLT/TLT nodes is dominated by the time needed for

549: data transmission and multicasting in the DSP switch. The distribution

550: messages containing individual sets of CnA constants are multicasted

551: within the DSP switch. The multicast is based on message copy. The

552: copy of the messages in the first block of the switch dominates the

553: multicast latency.  The contribution from the CnA control and

554: distribution protocol is small, of the order of 100 msecs.

555:

556:

557: The RICH detector calibration and channel status map were updated

558: online at intervals of the order of minutes. ECAL pedestals and

559: tracker channel maps online updates occured at a intervals of the

560: order of hours.  Although the online CnA system was fully functional,

561: not all subdector groups developed online monitors or updated CnA

562: constants online due largely to lack of manpower.  Calibration and

563: alignment constants for the vertex detector, tracker and muon systems

564: were updated offline by manually updating the index to the relevent

565: tables in the online keytable.  The change in the online keytable

566: caused the new constants to be automatically loaded at the next DAQ

567: startup or triggered their distribution if data taking was in

568: progress.

569:

570: The FLT lookup tables are distributed to the FLT trigger boards at DAQ

571: booting time using the CnA distribution procedure. The large number of

572: these tables and the slow input link to the boards prevents online

573: distributions of updated lookup tables except as part of the startup

574: procedure. Steering parameters for the FLT processes are also

575: distributed via the CnA system.

576:

577:

578: \section{CnA system for offline reprocessing and Monte Carlo reconstruction}

579: \label{s:mc}

580:

581: As the knowledge of the detectors improves, the reconstruction

582: packages are further developed and improved calibration and alignment

583: constants are made available, the offline reprocessing of the event

584: data becomes necessary.  At HERA-B the trigger and online

585: reconstruction farms together with the online booting, control,

586: monitoring and online data transmission protocols and processes are

587: used offline for performing data reprocessing and Monte Carlo

588: reconstruction as described in \cite{jhnim}.

589: Exactly the same CnA distributed database system and CnA data

590: uploading mechanism used for online reconstruction are also

591: employed offline for mass data processing. The only difference

592: between reprocessing and online reconstruction is that the source of

593: the data is not the detector but the recorded raw events archived on

594: tape. This system makes an extremely efficient use of the online

595: computing resources during idle time and shutdown periods of the

596: detector.

597:

598: As mentioned earlier, the event record includes a tag which

599: links the event with the CnA keytable in the database containing

600: the indices of all the sets of CnA constants used in the online

601: reconstruction of that event.

602: The automatic bookkeeping of the keytables in the database

603: during data taking allows to reproduce the detector calibration and

604: alignment conditions for offline data reprocessing. Sets of constants

605: improved offline are incorporated in the reprocessing by producing

606: a revision of the online keytables. The online keytables are first

607: duplicated in the database with a new revision number and then the

608: keytables corresponding to data taking periods for which updated

609: constants are available are modified with the indices of the updated constants.

610: The offline reconstruction

611: processes make use of a given revision number of the keytables

612: when reprocessing the data.

613:

614: Monte Carlo event reconstruction should be performed using the

615: reconstruction conditions of the real data one

616: intends to simulate. This is simply achieved by using the keytable (CnA

617: key and CnA revision number) employed in the reconstruction of the real

618: data. For extended data taking periods with several associated keytables,

619: Monte Carlo samples can be reconstructed using those keytables separately, and the

620: events are reweighted according to the relative luminosities of the periods

621: of validity of the different keytables.

622:

623:

624: \section{Summary}

625:

626: In the HERA-B experiment, all trigger levels as well as the online

627: reconstruction critically depend on calibration and alignment

628: constants. In order to keep the trigger performance and the online

629: reconstruction stable under variations of the detector conditions, an

630: online calibration and alignment system was implemented and used. This

631: system monitors the status of the calibration and alignment constants,

632: recomputes them upon significant changes in the calibration or

633: alignment conditions in the detector and if necessary distributes them

634: on the fly to the trigger and reconstruction processors without

635: causing significant deadtime in the data acquisition.  The

636: distribution system exploits the high bandwidth and multicasting

637: capabilities of the DSP switch to synchronously push the constants to

638: the SLT/TLT trigger processes with a throughput of 1 GB/s. On the

639: other hand, given the smaller effective network bandwidth of the

640: reconstruction farm and the higher event processing time, the

641: reconstruction processes asynchronously fetch the updated constants

642: from a distributed and replicated database system.

643:

644: A tag in the event record associates every event with the detector

645: conditions used in the trigger and online reconstruction. This

646: mechanism provides the bookkeeping necessary for offline data

647: reprocessing and Monte Carlo reconstruction. The online CnA

648: distribution system is also used offline for mass data processing.

649:

650: The integration of the CnA system took place during the HERA-B

651: commissioning runs in 2000/2001. The system was fully operational and

652: routinely working during the 2002-2003 data taking period and is still

653: in use for data reprocessing and Monte Carlo reconstruction.

654:

655: The upcoming LHC experiments will incorporate PC farms into their DAQ

656: and trigger systems and might find the HERA-B experience concerning

657: the online calibration and alignment system of interest.

658:

659: \section{Acknowledgments}

660: We are grateful to Andreas Gellrich for fruitful discussions. We thank

661: the DAQ subdetector and trigger experts for their work in the

662: integration of the online subsystems into the DAQ CnA framework.

663:

664: \bibliographystyle{elsart-num}

665: \bibliography{cna}

666:

667:

668: \end{document}

669:

670: