0306:cs0306007/wp1.tex

1: %% ****** Start of file slactemplate.tex ****** %

2: %%

3: %%

4: %%   This file is part of the APS files in the REVTeX 4 distribution.

5: %%   Version 4.0 of REVTeX, August 2001

6: %%

7: %%

8: %%   Copyright (c) 2001 The American Physical Society.

9: %%

10: %%   See the REVTeX 4 README file for restrictions and more information.

11: %%

12: %

13: % This is a template for producing manuscripts for use with REVTEX 4.0

14: % Copy this file to another name and then work on that file.

15: % That way, you always have this original template file to use.

16: %

17: \documentclass[twocolumn,twoside,slac]{revtex4}

18: \usepackage{graphicx}

19: \usepackage{fancyhdr}

20: \usepackage{pifont}

21: \pagestyle{fancy}

22: \fancyhead{} % clear all fields

23: \fancyhead[C]{\it {CHEP03 San Diego, March 24-28, 2003}} \fancyhead[RO,LE]{\thepage}

24: \fancyfoot{} % clear all fields

25: \fancyfoot[LE,LO]{\bf }

26: \renewcommand{\headrulewidth}{0pt}

27: \renewcommand{\footrulewidth}{0pt}

28: \renewcommand{\sfdefault}{phv}

29:

30: \setlength{\textheight}{235mm}

31: \setlength{\textwidth}{170mm}

32: \setlength{\topmargin}{-20mm}

33:

34:

35: % You should use BibTeX and apsrev.bst for references

36:

37: \bibliographystyle{apsrev}

38:

39: \begin{document}

40:

41: %Title of paper

42: \title{The first deployment of workload management services on the EU

43: DataGrid Testbed: feedback on design and implementation.}

44:

45:

46: % Repeat the \author .. \affiliation  etc. as needed

47: %

48: % \affiliation command applies to all authors since the last

49: % \affiliation command. The \affiliation command should follow the

50: % other information

51:

52: \author{G. Avellino, S. Beco, B. Cantalupo, F. Pacini, A. Terracina, A. Maraschini}

53: \affiliation{DATAMAT S.p.A.}

54:

55: \author{D. Colling}

56: \affiliation{Imperial College London}

57:

58: \author{S. Monforte, M. Pappalardo}

59: \affiliation{INFN, Sezione di Catania}

60:

61: \author{L. Salconi}

62: \affiliation{INFN, Sezione di Pisa}

63:

64: \author{F. Giacomini, E. Ronchieri}

65: \affiliation{INFN, CNAF}

66:

67: \author{D. Kouril, A. Krenek, L. Matyska, M. Mulac, J. Pospisil, M. Ruda, Z. Salvet, J. Sitera, M. Vocu}

68: \affiliation{CESNET}

69:

70: \author{M. Mezzadri, F. Prelz}

71: \affiliation{INFN, Sezione di Milano}

72:

73: \author{A. Gianelle, R. Peluso, M. Sgaravatto}

74: \affiliation{INFN, Sezione di Padova}

75:

76: \author{S. Barale, A. Guarise, A. Werbrouck}

77: \affiliation{INFN, Sezione di Torino}

78:

79: \begin{abstract}

80: Application users have now been experiencing for about a year with

81: the standardized resource brokering services provided by the

82: 'workload management' package of the EU DataGrid project (WP1).

83: Understanding, shaping and pushing the limits of the system has provided

84: valuable feedback on both its design and implementation.

85: A digest of the lessons, and ``better practices", that were learned, and

86: that were applied towards the second major release of the software, is given.

87: \end{abstract}

88:

89: %\maketitle must follow title, authors, abstract

90: \maketitle

91:

92: \thispagestyle{fancy}

93:

94: % body of paper here - Use proper section commands

95: % References should be done using the \cite, \ref, and \label commands

96: % Put \label in argument of \section for cross-referencing

97: %\section{\label{}}

98:

99: \section{Introduction}

100: \begin{figure*}[t]

101: \centering

102: \includegraphics[width=135mm]{queue_network.eps}

103: \caption{The various steps to process a job request can be modeled

104: as passing the request through a network of queues.} \label{fig-netqueue}

105: \end{figure*}

106: The workload management task (Work Package 1, or WP1) \cite{wp1} of

107: the EU DataGrid project \cite{DataGrid} (also known, and referred to

108: in the following text, as EDG) is mandated to define and implement

109: a suitable architecture for distributed scheduling and

110: resource management in the Grid environment.

111: During the first year and a half of the project (2001-2002), and following

112: a technology evaluation process, EDG WP1 defined, implemented and deployed a

113: set of services that integrate existing components, mostly from the

114: Condor \cite{Condor} and Globus \cite{Globus} projects. This was described in

115: more detail at CHEP 2001 \cite{WP1Chep01}.

116: In a nutshell, the core job submission component of CondorG (\cite{CondorG}),

117: talking to computing resources

118: (known in DataGrid as Computing Elements, or CEs) via the Globus GRAM

119: protocol, is fundamentally complemented by:

120: \begin {itemize}

121: \item A job requirement matchmaking engine (called the {\it Resource Broker},

122: or RB), matching job requests to computing resource status coming from the

123: Information System and resolving data requirements against the

124: replicated file management services provided by EDG WP2.

125: \item A job Logging and Book-keeping service (LB), where a job state machine

126: is kept current based on events generated during the job lifetime, and the job

127: status is made available to the submitting user. The LB events

128: are generated with some

129: redundancy to cover various cases of loss.

130: \item A stable user API (command line, C++ and JAVA) for access to the system.

131: \end{itemize}

132: Job descriptions are expressed throughout the system

133: using the Condor Classified Ad language,

134: where appropriate conventions were established to express requirement and

135: ranking conditions on Computing and

136: Storage Element info, and to express data requirements.

137: More details on the structure and evolution of these services and the

138: necessary integration scaffolding can be found in various EDG

139: public deliverable documents.

140: \par

141: This paper focuses on how the experience of

142: the first year of operation of the WP1 services on the EDG testbed was

143: interpreted, digested, and how a few design {\it principles} were learned

144: (possibly the hard way) from the design and implementation shortcomings of

145: the first release of WP1 software.

146: \par

147: These principles were applied to design and implement the second major

148: release of WP1 software, that is described in another CHEP 2003 paper

149: (\cite{WP1Chep03Sga}).

150: \par

151: To illustrate the logical path that leads to at least some of these

152: principles, we start by exploring the available techniques to model the

153: behaviour and throughput of the integrated workload management system,

154: and identify two factors that significantly complicate the system analysis.

155:

156: \section{The Workload Management System as a network of queues}

157:

158: The workload management system provided by EDG-WP1 is designed to rely as

159: much as possible on existing technology. While this has the obvious

160: advantages of limiting effort duplication and facilitating the compatibility

161: among different projects, it also significantly complicates troubleshooting

162: across the various layers of software supplied by different providers, and

163: in general the understanding of the integrated system. Also, where negotiations

164: with external software providers couldn't reach an agreement within the

165: EDG deadlines, some of the interfaces and communication paths in the system

166: had to be adapted to fit the existing external software incarnations.

167: \par

168: To get a useful high-level picture of the integrated Workload Management system,

169: beyond all these practical constraints,

170: we can model it as a queuing system, where job requests traverse

171: a network of queues, and the ``service stations" connected to each queue

172: represent one of the various processing steps in the job life-cycle.

173: A few of these steps are exemplified in Figure \ref{fig-netqueue}.

174: \par

175: Establishing the scale factors for each service in the WP1 system

176: (e.g.: how many users can a single matchmaking/job submission station serve,

177: how many requests per unit time can a top-level access point to the

178: information system

179: serve, what is the sustained job throughput that can be achieved

180: through the workload management chain, etc.) is one of the fundamental

181: premises for the correct design of the system. One could

182: expect to obtain this knowledge

183: either by applying queuing theory to this network model (this requires

184: obtaining a formal representation of all the components, their

185: service time profiles and their interconnections) or by measuring the

186: service times and by identifying where long queues are likely to build up

187: when a ``realistic" request load is injected in the system. This information

188: could in principle also be used to identify the areas of the system where

189: improvement is needed (sometimes collectively called {\it bottlenecks}).

190: \par

191: Experience with the WP1 software integration showed that both of these

192: approaches are impractical for either dimensioning the system or (possibly

193: even more important) for identifying the trouble areas that affect the

194: system throughput. We identified two non-linear factors that definitely

195: work against the predictive power of queuing theory in this case, and require

196: extra care even to apply straightforward reasoning when bottlenecks

197: are to be identified to improve system throughput.

198: These are the consequence

199: of common programming practice (and are therefore easy to be found

200: in the software components that we build or are integrating) and

201: are described in the following Section.

202:

203: \section{Troubleshooting the WMS}

204: \label{sec-fmode}

205: \begin{figure}

206: \includegraphics[width=80mm]{failure_mode.eps}

207: \caption{A possible way to make the system throughput worse

208: by applying the genuine intent to make it better. The names

209: of the various steps are just an example and don't refer to

210: any real experience or software component.} \label{fig-fmode}

211: \end{figure}

212: One of the most common (and most frustrating, both to developers

213: and to end users) experiences in troubleshooting the WP1

214: Workload Management system on the EDG testbed has been the fact that

215: often, perceived {\it improvements} to the system (sometimes even

216: simple bug fixes) result in a {\it decrease} in the system stability,

217: or reliability (fraction of requests that complete successfully).

218: The cause is often closely related to the known fact that removing

219: a bottleneck, in any flow system, can cause an overflow downstream,

220: possibly close to the {\it next} bottleneck. The complicating factor

221: is that there are at least two characteristics that could (and possibly

222: still can) be found in many elements of our integrated workload management

223: queuing network, that can cause problems to appear even very far from the area

224: of the network where an {\it improvement} is being attempted:

225: \begin{itemize}

226: \item {\it Queues of job requested can form where they can impact on the

227:       system load.}\\

228:       Different techniques can be chosen or needed to pass

229:       job requests around. Sometimes a socket connection is needed, sometimes

230:       sequential request processing (one request at a time in the system)

231:       is required for some reason, and multiple processes/threads may be used to

232:       handle individual requests. Having a number of

233:       tasks (processes/threads) wait for a socket queue or a sequential

234:       processing slot is one way to ``queue" requests that definitely

235:       generates much extra work for the process scheduler, and can cause

236:       any other process served by the same scheduler to be allocated

237:       less and less time. Queues that are unnecessarily

238:       scanned while waiting for some other condition to allow the processing

239:       of their element can also impact on the system load, especially

240:       if the queue elements are associated to significant amounts of

241:       allocated dynamic memory.

242: \item {\it Some system components can enforce hard timeouts and cause

243:       anomalies in the job flow.}\\

244:       When handling the access (typically via socket connections)

245:       to various distributed services, provisions typically need to

246:       be made to handle

247:       all possible failure modes. ``Reasonably" long

248:       timeouts are sometimes chosen to handle failures that are perceived to be

249:       very unlikely by developers (failure to establish communication

250:       to a local service, for instance). This kind of failures, however,

251:       can easily materialise when the system resources are exhausted under

252:       a stress test or load peak.

253: \end{itemize}

254: Figure \ref{fig-fmode} illustrates how these two effects can conspire to

255: frustrate a genuine effort to remove what seems the limiting bottleneck

256: in the system (the example in the Figure does nor refer to any real case

257: or component): removing the bottleneck (1) causes a request queue to build

258: up at the next station (2), and this interferes via the system load to

259: cause hard timeouts and job failures elsewhere (3). This example

260: is used to rationalise some of the unexpected reactions

261: that, in many cases, were found while working on the WP1 integrated system.

262: The experience on practical troubleshooting cases similar to this one,

263: while bringing an understanding of the difficulties inherent in building

264: distributed systems,

265: also drove us to formulate some of the principles that are presented

266: in the next section.

267:

268: \section{Principles that were learnt (and applied to improve the design)}

269: The attempts at getting a deeper understanding of the EDG-WP1 Workload

270: Management System and their failures led us to formulate a few design

271: principles and to apply them to the second major software release.

272: Here are the principles that descend from the paradigm example described in

273: Section \ref{sec-fmode}:

274: \begin{enumerate}

275: \renewcommand{\labelenumi}{\ding{228} \theenumi. }

276: \item {\bf Queues of various kinds of requests for processing should be

277: allowed to form where they have a minimal and understood impact on

278: system resources.}\\

279: Queues that get `filled' in the form of multiple threads or processes,

280: or that allocate significant amounts of system memory should be avoided, as they

281: not only adversely impact system performance, but also generate

282: inter-dependencies and complicate troubleshooting.

283: \item {\bf Limits should always be placed on dynamically allocated objects,

284: threads and/or subprocesses.}\\

285: This is a consequence of the previous point: every dynamic resource that

286: gets allocated should have a tunable system-wide limit that gets enforced.

287: \item {\bf Special care needs to be taken around the pipeline areas where

288: serial handling of requests is needed.}\\

289: The impact of any contention for system resources becomes more evident near

290: areas of the queuing system that require the acquisition of system-wide locks.

291: \end{enumerate}

292: \par

293: So far we concentrated on a specific attempt at modeling and understanding

294: the workload management system that led to an increased attention to the

295: usage of shared resources. There were other specific practical issues that

296: emerged during the deployment and troubleshooting of the system

297: and that led to the awareness of some fundamental design or

298: implementation mistake that was made. Here is a short list, where the

299: fundamental principle that should correct the fundamental mistake that was made

300: is listed:

301: \begin{enumerate}

302: \setcounter{enumi}{3}

303: \renewcommand{\labelenumi}{\ding{228} \theenumi. }

304: \item {\bf Communication among services should always be reliable:}

305: \begin{itemize}

306:   \item[-] Always applying double-commit and rollback for network communications.

307:   \item[-] Going through the filesystem for local communications.

308: \end{itemize}

309: In general, forms of communication that don't allow for data or messages to

310: be lost in a broken pipe lead to easier recovery from system or process

311: crashes. Where network communication is necessary, database-like techniques

312: have to be used.

313: \item {\bf Every process, object or entity related to the job lifecycle should

314: have another process, object or entity in charge of its well-being.}\\

315: Automatic fault recovery can only happen if every entity is held accountable

316: and accounted for.

317: \item {\bf Information repositories should be minimized (with a clear

318: identification of authoritative information).}\\

319: Many of the software components that were integrated in the EDG-WP1 solution

320: are stateful and include local repositories for request information, in

321: the form of local queues, state files, database back-ends. Only one

322: site with authoritative information about requests has to be identified

323: and kept.

324: \item {\bf Monolithic, long-lived processes should be avoided.}\\

325: Dynamic memory programming, using languages and techniques that

326: require explicit release of dynamically allocated objects, can lead to leaks

327: of memory, descriptors and other resources. Experimental, R\&D code

328: can take time to leak-proof, so it should possibly not be

329: linked to system components that are long-lived, as it can accelerate

330: system resource starvation. Short-lived, easy-to-recover components are a clean

331: and very practical workaround in this case.

332: \item {\bf More thought should be devoted to efficiently and correctly

333: recovering a service rather than to starting and running it.}\\

334: This is again a consequence of the previous point: the capability

335: to quickly recover from

336: failures or interruption helps in assuring that system components `can' be

337: short-lived, either by design or by accident.

338: \end{enumerate}

339:

340: \section{Conclusions}

341: EDG-WP1 has been distributing jobs over the EDG testbed in a continuous

342: fashion for one and a half years now, with a software solution where

343: existing grid technology was integrated wherever possible.

344: \par

345: The experience of understanding the direct and indirect interplay of the service

346: components could not be reduced to a simple {\it scalability} evaluation.

347: This because understanding and removing {\it bottlenecks} is significantly

348: complicated by non-linear and non-continuous effects in the system.

349: In this process, few principles that apply to the very complex practice

350: of distributed systems operations were learned the hard way (i.e. not by

351: just reading some good book on the subject).

352: EDG-WP1 tried to incorporate these principles

353: in its second major software release that

354: will shortly face deployment in the EDG testbed.

355:

356: % If you have acknowledgments, this puts in the proper section head.

357: \begin{acknowledgments}

358: DataGrid is a project funded by the European Commission under contract

359: IST-2000-25182.

360: \par

361: We also acknowledge the national funding agencies participating to

362: DataGrid for their support of this work.

363: \par

364: We wish to thank the Condor development team for many valuable discussions.

365: \end{acknowledgments}

366:

367: % Create the reference section using BibTeX:

368: %\bibliography{basename of .bib file}

369: \begin{thebibliography}{9}   % Use for  1-9  references

370: %\begin{thebibliography}{99} % Use for 10-99 references

371:

372: \bibitem{wp1}

373: Home page for the Grid Workload Management workpackage of the DataGrid project\\

374: \url{http://www.infn.it/workload-grid}.

375:

376: \bibitem{DataGrid}

377: Home page for the Datagrid project\\

378: \url{http://www.eu-datagrid.org}.

379:

380: \bibitem{Condor}

381: Home page for the Condor project\\

382: \url{http://www.cs.wisc.edu/condor/}

383:

384: \bibitem{Globus}

385: Home page for the Globus project\\

386: \url{http://www.globus.org}

387:

388: \bibitem{CondorG}

389: J. Frey, T. Tannenbaum, I. Foster, M. Livny, S. Tuecke,

390: ``Condor-G: A Computation Management Agent for Multi-Institutional Grids'',

391: {\it Proceedings of the Tenth IEEE Symposium on High Performance Distributed

392: Computing (HPDC10)}, 2001

393:

394: \bibitem{WP1Chep01}

395: DataGrid WP1 members (C. Anglano {\it et al.}), ``{\bf Integrating GRID

396: tools to build a Computing Resource Broker: Activities of DataGrid WP1}"

397: Presented at the CHEP 2001 Conference, Beijing (p. 708 in the proceedings)

398:

399: \bibitem{WP1Chep03Sga}

400: DataGrid WP1 members (G. Avellino {\it et al.}), ``{\bf The EU DataGrid

401: Workload Management System: towards the second major release}"

402: Also presented at the CHEP 2003 Conference, San Diego

403:

404: \end{thebibliography}

405:

406: \end{document}

407: %

408: % ****** End of file template.aps ******

409: