1: %% ****** Start of file slactemplate.tex ****** %
2: %%
3: %%
4: %% This file is part of the APS files in the REVTeX 4 distribution.
5: %% Version 4.0 of REVTeX, August 2001
6: %%
7: %%
8: %% Copyright (c) 2001 The American Physical Society.
9: %%
10: %% See the REVTeX 4 README file for restrictions and more information.
11: %%
12: %
13: % This is a template for producing manuscripts for use with REVTEX 4.0
14: % Copy this file to another name and then work on that file.
15: % That way, you always have this original template file to use.
16: %
17: \documentclass[twocolumn,twoside,slac]{revtex4}
18: \usepackage{graphicx}
19: \usepackage{fancyhdr}
20: \usepackage{pifont}
21: \pagestyle{fancy}
22: \fancyhead{} % clear all fields
23: \fancyhead[C]{\it {CHEP03 San Diego, March 24-28, 2003}} \fancyhead[RO,LE]{\thepage}
24: \fancyfoot{} % clear all fields
25: \fancyfoot[LE,LO]{\bf }
26: \renewcommand{\headrulewidth}{0pt}
27: \renewcommand{\footrulewidth}{0pt}
28: \renewcommand{\sfdefault}{phv}
29:
30: \setlength{\textheight}{235mm}
31: \setlength{\textwidth}{170mm}
32: \setlength{\topmargin}{-20mm}
33:
34:
35: % You should use BibTeX and apsrev.bst for references
36:
37: \bibliographystyle{apsrev}
38:
39: \begin{document}
40:
41: %Title of paper
42: \title{The first deployment of workload management services on the EU
43: DataGrid Testbed: feedback on design and implementation.}
44:
45:
46: % Repeat the \author .. \affiliation etc. as needed
47: %
48: % \affiliation command applies to all authors since the last
49: % \affiliation command. The \affiliation command should follow the
50: % other information
51:
52: \author{G. Avellino, S. Beco, B. Cantalupo, F. Pacini, A. Terracina, A. Maraschini}
53: \affiliation{DATAMAT S.p.A.}
54:
55: \author{D. Colling}
56: \affiliation{Imperial College London}
57:
58: \author{S. Monforte, M. Pappalardo}
59: \affiliation{INFN, Sezione di Catania}
60:
61: \author{L. Salconi}
62: \affiliation{INFN, Sezione di Pisa}
63:
64: \author{F. Giacomini, E. Ronchieri}
65: \affiliation{INFN, CNAF}
66:
67: \author{D. Kouril, A. Krenek, L. Matyska, M. Mulac, J. Pospisil, M. Ruda, Z. Salvet, J. Sitera, M. Vocu}
68: \affiliation{CESNET}
69:
70: \author{M. Mezzadri, F. Prelz}
71: \affiliation{INFN, Sezione di Milano}
72:
73: \author{A. Gianelle, R. Peluso, M. Sgaravatto}
74: \affiliation{INFN, Sezione di Padova}
75:
76: \author{S. Barale, A. Guarise, A. Werbrouck}
77: \affiliation{INFN, Sezione di Torino}
78:
79: \begin{abstract}
80: Application users have now been experiencing for about a year with
81: the standardized resource brokering services provided by the
82: 'workload management' package of the EU DataGrid project (WP1).
83: Understanding, shaping and pushing the limits of the system has provided
84: valuable feedback on both its design and implementation.
85: A digest of the lessons, and ``better practices", that were learned, and
86: that were applied towards the second major release of the software, is given.
87: \end{abstract}
88:
89: %\maketitle must follow title, authors, abstract
90: \maketitle
91:
92: \thispagestyle{fancy}
93:
94: % body of paper here - Use proper section commands
95: % References should be done using the \cite, \ref, and \label commands
96: % Put \label in argument of \section for cross-referencing
97: %\section{\label{}}
98:
99: \section{Introduction}
100: \begin{figure*}[t]
101: \centering
102: \includegraphics[width=135mm]{queue_network.eps}
103: \caption{The various steps to process a job request can be modeled
104: as passing the request through a network of queues.} \label{fig-netqueue}
105: \end{figure*}
106: The workload management task (Work Package 1, or WP1) \cite{wp1} of
107: the EU DataGrid project \cite{DataGrid} (also known, and referred to
108: in the following text, as EDG) is mandated to define and implement
109: a suitable architecture for distributed scheduling and
110: resource management in the Grid environment.
111: During the first year and a half of the project (2001-2002), and following
112: a technology evaluation process, EDG WP1 defined, implemented and deployed a
113: set of services that integrate existing components, mostly from the
114: Condor \cite{Condor} and Globus \cite{Globus} projects. This was described in
115: more detail at CHEP 2001 \cite{WP1Chep01}.
116: In a nutshell, the core job submission component of CondorG (\cite{CondorG}),
117: talking to computing resources
118: (known in DataGrid as Computing Elements, or CEs) via the Globus GRAM
119: protocol, is fundamentally complemented by:
120: \begin {itemize}
121: \item A job requirement matchmaking engine (called the {\it Resource Broker},
122: or RB), matching job requests to computing resource status coming from the
123: Information System and resolving data requirements against the
124: replicated file management services provided by EDG WP2.
125: \item A job Logging and Book-keeping service (LB), where a job state machine
126: is kept current based on events generated during the job lifetime, and the job
127: status is made available to the submitting user. The LB events
128: are generated with some
129: redundancy to cover various cases of loss.
130: \item A stable user API (command line, C++ and JAVA) for access to the system.
131: \end{itemize}
132: Job descriptions are expressed throughout the system
133: using the Condor Classified Ad language,
134: where appropriate conventions were established to express requirement and
135: ranking conditions on Computing and
136: Storage Element info, and to express data requirements.
137: More details on the structure and evolution of these services and the
138: necessary integration scaffolding can be found in various EDG
139: public deliverable documents.
140: \par
141: This paper focuses on how the experience of
142: the first year of operation of the WP1 services on the EDG testbed was
143: interpreted, digested, and how a few design {\it principles} were learned
144: (possibly the hard way) from the design and implementation shortcomings of
145: the first release of WP1 software.
146: \par
147: These principles were applied to design and implement the second major
148: release of WP1 software, that is described in another CHEP 2003 paper
149: (\cite{WP1Chep03Sga}).
150: \par
151: To illustrate the logical path that leads to at least some of these
152: principles, we start by exploring the available techniques to model the
153: behaviour and throughput of the integrated workload management system,
154: and identify two factors that significantly complicate the system analysis.
155:
156: \section{The Workload Management System as a network of queues}
157:
158: The workload management system provided by EDG-WP1 is designed to rely as
159: much as possible on existing technology. While this has the obvious
160: advantages of limiting effort duplication and facilitating the compatibility
161: among different projects, it also significantly complicates troubleshooting
162: across the various layers of software supplied by different providers, and
163: in general the understanding of the integrated system. Also, where negotiations
164: with external software providers couldn't reach an agreement within the
165: EDG deadlines, some of the interfaces and communication paths in the system
166: had to be adapted to fit the existing external software incarnations.
167: \par
168: To get a useful high-level picture of the integrated Workload Management system,
169: beyond all these practical constraints,
170: we can model it as a queuing system, where job requests traverse
171: a network of queues, and the ``service stations" connected to each queue
172: represent one of the various processing steps in the job life-cycle.
173: A few of these steps are exemplified in Figure \ref{fig-netqueue}.
174: \par
175: Establishing the scale factors for each service in the WP1 system
176: (e.g.: how many users can a single matchmaking/job submission station serve,
177: how many requests per unit time can a top-level access point to the
178: information system
179: serve, what is the sustained job throughput that can be achieved
180: through the workload management chain, etc.) is one of the fundamental
181: premises for the correct design of the system. One could
182: expect to obtain this knowledge
183: either by applying queuing theory to this network model (this requires
184: obtaining a formal representation of all the components, their
185: service time profiles and their interconnections) or by measuring the
186: service times and by identifying where long queues are likely to build up
187: when a ``realistic" request load is injected in the system. This information
188: could in principle also be used to identify the areas of the system where
189: improvement is needed (sometimes collectively called {\it bottlenecks}).
190: \par
191: Experience with the WP1 software integration showed that both of these
192: approaches are impractical for either dimensioning the system or (possibly
193: even more important) for identifying the trouble areas that affect the
194: system throughput. We identified two non-linear factors that definitely
195: work against the predictive power of queuing theory in this case, and require
196: extra care even to apply straightforward reasoning when bottlenecks
197: are to be identified to improve system throughput.
198: These are the consequence
199: of common programming practice (and are therefore easy to be found
200: in the software components that we build or are integrating) and
201: are described in the following Section.
202:
203: \section{Troubleshooting the WMS}
204: \label{sec-fmode}
205: \begin{figure}
206: \includegraphics[width=80mm]{failure_mode.eps}
207: \caption{A possible way to make the system throughput worse
208: by applying the genuine intent to make it better. The names
209: of the various steps are just an example and don't refer to
210: any real experience or software component.} \label{fig-fmode}
211: \end{figure}
212: One of the most common (and most frustrating, both to developers
213: and to end users) experiences in troubleshooting the WP1
214: Workload Management system on the EDG testbed has been the fact that
215: often, perceived {\it improvements} to the system (sometimes even
216: simple bug fixes) result in a {\it decrease} in the system stability,
217: or reliability (fraction of requests that complete successfully).
218: The cause is often closely related to the known fact that removing
219: a bottleneck, in any flow system, can cause an overflow downstream,
220: possibly close to the {\it next} bottleneck. The complicating factor
221: is that there are at least two characteristics that could (and possibly
222: still can) be found in many elements of our integrated workload management
223: queuing network, that can cause problems to appear even very far from the area
224: of the network where an {\it improvement} is being attempted:
225: \begin{itemize}
226: \item {\it Queues of job requested can form where they can impact on the
227: system load.}\\
228: Different techniques can be chosen or needed to pass
229: job requests around. Sometimes a socket connection is needed, sometimes
230: sequential request processing (one request at a time in the system)
231: is required for some reason, and multiple processes/threads may be used to
232: handle individual requests. Having a number of
233: tasks (processes/threads) wait for a socket queue or a sequential
234: processing slot is one way to ``queue" requests that definitely
235: generates much extra work for the process scheduler, and can cause
236: any other process served by the same scheduler to be allocated
237: less and less time. Queues that are unnecessarily
238: scanned while waiting for some other condition to allow the processing
239: of their element can also impact on the system load, especially
240: if the queue elements are associated to significant amounts of
241: allocated dynamic memory.
242: \item {\it Some system components can enforce hard timeouts and cause
243: anomalies in the job flow.}\\
244: When handling the access (typically via socket connections)
245: to various distributed services, provisions typically need to
246: be made to handle
247: all possible failure modes. ``Reasonably" long
248: timeouts are sometimes chosen to handle failures that are perceived to be
249: very unlikely by developers (failure to establish communication
250: to a local service, for instance). This kind of failures, however,
251: can easily materialise when the system resources are exhausted under
252: a stress test or load peak.
253: \end{itemize}
254: Figure \ref{fig-fmode} illustrates how these two effects can conspire to
255: frustrate a genuine effort to remove what seems the limiting bottleneck
256: in the system (the example in the Figure does nor refer to any real case
257: or component): removing the bottleneck (1) causes a request queue to build
258: up at the next station (2), and this interferes via the system load to
259: cause hard timeouts and job failures elsewhere (3). This example
260: is used to rationalise some of the unexpected reactions
261: that, in many cases, were found while working on the WP1 integrated system.
262: The experience on practical troubleshooting cases similar to this one,
263: while bringing an understanding of the difficulties inherent in building
264: distributed systems,
265: also drove us to formulate some of the principles that are presented
266: in the next section.
267:
268: \section{Principles that were learnt (and applied to improve the design)}
269: The attempts at getting a deeper understanding of the EDG-WP1 Workload
270: Management System and their failures led us to formulate a few design
271: principles and to apply them to the second major software release.
272: Here are the principles that descend from the paradigm example described in
273: Section \ref{sec-fmode}:
274: \begin{enumerate}
275: \renewcommand{\labelenumi}{\ding{228} \theenumi. }
276: \item {\bf Queues of various kinds of requests for processing should be
277: allowed to form where they have a minimal and understood impact on
278: system resources.}\\
279: Queues that get `filled' in the form of multiple threads or processes,
280: or that allocate significant amounts of system memory should be avoided, as they
281: not only adversely impact system performance, but also generate
282: inter-dependencies and complicate troubleshooting.
283: \item {\bf Limits should always be placed on dynamically allocated objects,
284: threads and/or subprocesses.}\\
285: This is a consequence of the previous point: every dynamic resource that
286: gets allocated should have a tunable system-wide limit that gets enforced.
287: \item {\bf Special care needs to be taken around the pipeline areas where
288: serial handling of requests is needed.}\\
289: The impact of any contention for system resources becomes more evident near
290: areas of the queuing system that require the acquisition of system-wide locks.
291: \end{enumerate}
292: \par
293: So far we concentrated on a specific attempt at modeling and understanding
294: the workload management system that led to an increased attention to the
295: usage of shared resources. There were other specific practical issues that
296: emerged during the deployment and troubleshooting of the system
297: and that led to the awareness of some fundamental design or
298: implementation mistake that was made. Here is a short list, where the
299: fundamental principle that should correct the fundamental mistake that was made
300: is listed:
301: \begin{enumerate}
302: \setcounter{enumi}{3}
303: \renewcommand{\labelenumi}{\ding{228} \theenumi. }
304: \item {\bf Communication among services should always be reliable:}
305: \begin{itemize}
306: \item[-] Always applying double-commit and rollback for network communications.
307: \item[-] Going through the filesystem for local communications.
308: \end{itemize}
309: In general, forms of communication that don't allow for data or messages to
310: be lost in a broken pipe lead to easier recovery from system or process
311: crashes. Where network communication is necessary, database-like techniques
312: have to be used.
313: \item {\bf Every process, object or entity related to the job lifecycle should
314: have another process, object or entity in charge of its well-being.}\\
315: Automatic fault recovery can only happen if every entity is held accountable
316: and accounted for.
317: \item {\bf Information repositories should be minimized (with a clear
318: identification of authoritative information).}\\
319: Many of the software components that were integrated in the EDG-WP1 solution
320: are stateful and include local repositories for request information, in
321: the form of local queues, state files, database back-ends. Only one
322: site with authoritative information about requests has to be identified
323: and kept.
324: \item {\bf Monolithic, long-lived processes should be avoided.}\\
325: Dynamic memory programming, using languages and techniques that
326: require explicit release of dynamically allocated objects, can lead to leaks
327: of memory, descriptors and other resources. Experimental, R\&D code
328: can take time to leak-proof, so it should possibly not be
329: linked to system components that are long-lived, as it can accelerate
330: system resource starvation. Short-lived, easy-to-recover components are a clean
331: and very practical workaround in this case.
332: \item {\bf More thought should be devoted to efficiently and correctly
333: recovering a service rather than to starting and running it.}\\
334: This is again a consequence of the previous point: the capability
335: to quickly recover from
336: failures or interruption helps in assuring that system components `can' be
337: short-lived, either by design or by accident.
338: \end{enumerate}
339:
340: \section{Conclusions}
341: EDG-WP1 has been distributing jobs over the EDG testbed in a continuous
342: fashion for one and a half years now, with a software solution where
343: existing grid technology was integrated wherever possible.
344: \par
345: The experience of understanding the direct and indirect interplay of the service
346: components could not be reduced to a simple {\it scalability} evaluation.
347: This because understanding and removing {\it bottlenecks} is significantly
348: complicated by non-linear and non-continuous effects in the system.
349: In this process, few principles that apply to the very complex practice
350: of distributed systems operations were learned the hard way (i.e. not by
351: just reading some good book on the subject).
352: EDG-WP1 tried to incorporate these principles
353: in its second major software release that
354: will shortly face deployment in the EDG testbed.
355:
356: % If you have acknowledgments, this puts in the proper section head.
357: \begin{acknowledgments}
358: DataGrid is a project funded by the European Commission under contract
359: IST-2000-25182.
360: \par
361: We also acknowledge the national funding agencies participating to
362: DataGrid for their support of this work.
363: \par
364: We wish to thank the Condor development team for many valuable discussions.
365: \end{acknowledgments}
366:
367: % Create the reference section using BibTeX:
368: %\bibliography{basename of .bib file}
369: \begin{thebibliography}{9} % Use for 1-9 references
370: %\begin{thebibliography}{99} % Use for 10-99 references
371:
372: \bibitem{wp1}
373: Home page for the Grid Workload Management workpackage of the DataGrid project\\
374: \url{http://www.infn.it/workload-grid}.
375:
376: \bibitem{DataGrid}
377: Home page for the Datagrid project\\
378: \url{http://www.eu-datagrid.org}.
379:
380: \bibitem{Condor}
381: Home page for the Condor project\\
382: \url{http://www.cs.wisc.edu/condor/}
383:
384: \bibitem{Globus}
385: Home page for the Globus project\\
386: \url{http://www.globus.org}
387:
388: \bibitem{CondorG}
389: J. Frey, T. Tannenbaum, I. Foster, M. Livny, S. Tuecke,
390: ``Condor-G: A Computation Management Agent for Multi-Institutional Grids'',
391: {\it Proceedings of the Tenth IEEE Symposium on High Performance Distributed
392: Computing (HPDC10)}, 2001
393:
394: \bibitem{WP1Chep01}
395: DataGrid WP1 members (C. Anglano {\it et al.}), ``{\bf Integrating GRID
396: tools to build a Computing Resource Broker: Activities of DataGrid WP1}"
397: Presented at the CHEP 2001 Conference, Beijing (p. 708 in the proceedings)
398:
399: \bibitem{WP1Chep03Sga}
400: DataGrid WP1 members (G. Avellino {\it et al.}), ``{\bf The EU DataGrid
401: Workload Management System: towards the second major release}"
402: Also presented at the CHEP 2003 Conference, San Diego
403:
404: \end{thebibliography}
405:
406: \end{document}
407: %
408: % ****** End of file template.aps ******
409: