1: %% bigloo scm/r0.o scm/stat.o scm/parameters.o scm/error.o -o bin/r0.bin
2:
3: \documentclass[a4paper,10pt,twoside,fleqn,twocolumn]{article}
4: \usepackage{epsfig}
5: \usepackage{amssymb, amsmath}
6: \usepackage{graphicx}
7: \usepackage{subfigure}
8: \usepackage{url}
9: \usepackage[english]{babel}
10:
11: %% \mathindent 0.2cm
12: \pagestyle{headings}
13: \pagenumbering{arabic}
14: \numberwithin{equation}{section}
15:
16: %% New commands to align figures/tables
17: \makeatletter
18: \newcommand\figcaption{\def\@captype{figure}\caption}
19: \newcommand\tabcaption{\def\@captype{table}\caption}
20: \makeatother
21:
22: %% Un abstract plus beau
23: \renewenvironment{abstract}{\begin{quote}{\noindent\bf Abstract.}}{\end{quote}}
24:
25: %% Draft version
26: %% \newcommand{\reviewtimetoday}[2]{\special{!userdict begin
27: %% /bop-hook{gsave 20 710 translate 45 rotate 0.8 setgray
28: %% /Times-Roman findfont 16 scalefont setfont 0 0 moveto (#1) show
29: %% 0 -12 moveto (#2) show grestore}def end}}
30: %% % You can turn on or off this option.
31: %% \reviewtimetoday{\today}{Draft Version}
32: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
33: \renewcommand{\thefootnote}{\fnsymbol{footnote}}
34:
35:
36: %% Pour avoir plus de marges en haut et en bas
37: %% \setlength{\oddsidemargin}{0in}
38: %% \setlength{\evensidemargin}{0in}
39: %% \setlength{\textwidth}{6.5in}
40: %% \setlength{\textheight}{9.0in}
41: %% \setlength{\topmargin}{0in}
42:
43: %% \selectlanguage{english}
44: \hyphenation{Ra-dio-the-ra-py}
45: \hyphenation{appli-ca-tions}
46: \hyphenation{bio-me-di-cal}
47: \hyphenation{INSTRUIRE}
48: \hyphenation{Healthgrid}
49: \hyphenation{EGEE}
50: \hyphenation{Emmanuel}
51: \hyphenation{Medernach}
52: \hyphenation{Vincent}
53: \hyphenation{Breton}
54: \hyphenation{Yannick}
55: \hyphenation{Pierre-Louis}
56: \hyphenation{Reichstadt}
57: \hyphenation{la-bo-ra-to-ry}
58: \hyphenation{ne-ce-ssa-ry}
59: \hyphenation{Si-mi-lar-ly}
60: \hyphenation{heavily}
61: \hyphenation{parameters}
62: \hyphenation{probability}
63: \hyphenation{classi-cal}
64:
65: \begin{document}
66:
67: \title{
68: \vspace*{-2cm}
69: {\normalsize Workload characterization and modelling
70: \hfill \today \\[1mm]}
71: \LARGE\sc\hrule height0.2pt \vskip 2mm
72: Workload analysis of a cluster \\
73: in a Grid environment
74: \vskip 2mm \hrule height0.2pt}
75:
76: \author{ Emmanuel Medernach\\[1ex]
77: \small Laboratoire de Physique Corpusculaire, CNRS-IN2P3 \\
78: \small Campus des C\'ezeaux,
79: \small 63177 Aubi\`ere Cedex, France \\
80: \thanks{This work was supported by EGEE.}
81: \normalsize \em e-mail: \tt medernac@clermont.in2p3.fr }
82: \date{}
83: \maketitle
84:
85: \begin{abstract}
86: \em With Grids, we are able to share computing resources and to
87: provide for scientific communities a global transparent access to
88: local facilities. In such an environment the problems of fair
89: resource sharing and best usage arise. In this paper, the analysis
90: of the LPC cluster usage (Clermont-Ferrand, France) in the EGEE Grid
91: environment is done, and from the results a model for job arrival is
92: proposed.
93: \end{abstract}
94:
95:
96: \section{Introduction}
97:
98: Analysis of a cluster workload is essential to understand better user behavior
99: and how resources are used~\cite{feitelson02workload}. We are interested to
100: model and simulate the usage of a Grid cluster node in order to compare
101: different scheduling policies and to find the best suited one for our needs.
102:
103: The Grid gives new ways to share resources between sites, both as computing and
104: storage resources. Grid defines a global architecture for distributed
105: scheduling and resource management~\cite{darincosts} that enable resources
106: scaling. We would like to understand better such a system so that a model can
107: be defined. With such a model, simulation may be done and a quality of service
108: and fairness could then be proposed to the different users and groups.
109:
110: Briefly, we have some groups of users that each submit jobs to a
111: group of clusters. These jobs are placed inside a waiting queue on
112: some clusters before being scheduled and then processed. Each group
113: of users have their own need and their own strategy to job
114: submittal. We wish:
115: \begin{enumerate}
116: \item to have good metrics that describes the group and user usage of the
117: site.
118: \item to model the global behavior (average job waiting time, average waiting
119: queue length, system utilization, etc.) in order to know what is the
120: influence of each parameter and to avoid site saturation.
121: \item to simulate jobs arrivals and characteristics to test and compare
122: different scheduling strategies. The goal is to maximize system
123: utilization and to provide fairness between site users to avoid job
124: starvation.
125: \end{enumerate}
126:
127: As parallel scheduling for $p$ machines is a hard
128: problem~\cite{Garey79:intractability, mertens04}, heuristics are
129: used~\cite{school-parallel, feitelson95d}. Moreover we have no exact value
130: about the duration of jobs, making the problem difficult. We need a good model
131: to be able to compare different scheduling strategies. We believe that being
132: able to characterize users and groups behavior we could better design
133: scheduling strategies that promote fairness and maintain a good throughput.
134: From this paper some metrics are revealed, from the job submittal protocol a
135: detailed arrival model for single user and group is proposed and scheduling
136: problems are discussed. We then suggest a new design based on our observation
137: and show relationship between fairness issue and system utilization as a flow
138: problem.
139:
140: Our cluster usage in the EGEE (Enabling Grids for E-science in
141: Europe) Grid is presented in section~\ref{Environment}, the Grid
142: middleware used is described. Corresponding scheduling scheme is
143: shown in section~\ref{Scheduling}. Then the workload of the LPC
144: (Laboratoire de Physique Corpusculaire) computing resource, is
145: presented (section~\ref{Workload analysis and modelling}) and the logs
146: are analyzed statistically. A model is then proposed in
147: section~\ref{Model} that describes the job arrival rate to this
148: cluster. Simulation and validation are done in
149: section~\ref{Simulation} with comparison with related works in
150: section~\ref{related}. Results are discussed in
151: section~\ref{Discussion}. Section~\ref{Conclusion} concludes this
152: paper.
153:
154: \section{Environment}
155: \label{Environment}
156:
157: %% studied
158:
159: \subsection{Local situation}
160: \label{Local situation}
161:
162: The EGEE node at LPC Clermont-Ferrand is a Linux cluster made of 140 dual 3.0
163: GHz CPUs with 1 GB of RAM and managed by 2 servers with the LCG (LHC Computing
164: Grid Project) middleware. We are currently using MAUI as our cluster
165: scheduler~\cite{jackson01, linux-usenix}. It is shared with the regional Grid
166: INSTRUIRE (\url{http://www.instruire.org}). Our LPC Cluster role in EGEE is to
167: be used mostly by Biomedical users\footnote{Our cluster represented 75\% of all
168: the Biomed Virtual Organization (VO) jobs in 2004.} located in Europe and by
169: High Energy Physics Communities. Biomedical research is one core application
170: of the EGEE project. The approach is to apply the computing methods and tools
171: developed in high energy physics for biomedical applications. Our team has
172: been involved in international research group focused on deploying biomedical
173: applications in a Grid environment.
174:
175: %% Biomedical applications currently used in the Biomed Virtual Organization (VO) are
176: %% listed on table~\ref{table:applications}.
177:
178: One pilot application is GATE which is based on the Monte Carlo
179: GEANT4~\cite{geant4} toolkit developed by the high energy physics community.
180: Radiotherapy and brachytherapy use ionizing radiations to treat cancer. Before
181: each treatment, physicians and physicists plan the treatment using analytical
182: treatment planning systems and medical images data of the tumor. By using the
183: Grid environment provided by the EGEE project, we will be able to reduce the
184: computing time of Monte Carlo simulations in order to provide a reasonable time
185: consuming tool for specific cancer treatment requiring Monte-Carlo accuracy.
186:
187: Another group is Dteam, this group is partly responsible of sending
188: tests and monitoring jobs to our site. Total CPU time used by this
189: group is small relatively to the other one, but the jobs sent are
190: important for the site monitoring. There are also groups using the
191: cluster from the LHC experiments at CERN (\url{http://www.cern.ch}).
192: There are different kind of jobs for a given group. For example,
193: Data Analysis requires a lot of I/O whereas Monte-Carlo Simulation
194: needs few I/O.
195:
196: \subsection{EGEE Grid technology}
197: \label{Grid technology}
198:
199: In Grid world, resources are controlled by their owners. For instance
200: different kind of scheduling policies could be used for each site. A
201: Grid resource center provides to the Grid computing and/or storage
202: resources and also services that allow jobs to be submitted by guests
203: users, security services, monitoring tools, storage facility and
204: software management. The main issue of submitting a job to a remote
205: site is to provide some warranty of security and correct
206: execution. In fact the middleware automatically resubmits job when
207: there is a problem with one site. Security and authentication are
208: also provided as Grid services.
209:
210: The Grid principle is to allow user a worldwide transparent access to
211: computing and storage resources. In the case of EGEE, this access is
212: aimed to be transparent by using LCG middleware built on top of the
213: Globus Toolkit~\cite{foster97globus}. Middleware acts as a layer of
214: software that provides homogeneous access to different Grid resource
215: centers.
216:
217: \subsection{LCG Middleware}
218: \label{LCG Middleware}
219:
220: LCG is organized into Virtual Organizations (VOs): dynamic
221: collections of individuals and institutions sharing resources in a
222: flexible, secure and coordinated manner. Resource sharing is
223: facilitated and controlled by a set of services that allow resources
224: to be discovered, accessed, allocated, monitored and accounted for,
225: regardless of their physical location. Since these services provide a
226: layer between physical resources and applications, they are often
227: referred to as Grid Middleware~\cite{glite-arch}.
228:
229: Bag of task applications are parallel applications composed of
230: independent jobs. No communications are required between running
231: jobs. Since jobs from a same task may execute on different sites
232: communications between jobs are avoided. In this context, users
233: submit their jobs to the Grid one by one through the middleware. Our
234: cluster receives jobs only from the Grid. This means that each job
235: requests for one and only one processor. Users could directly
236: specify the execution site or let a Grid service choose the best
237: destination for them. Users give only a rough estimation of the
238: maximum job running time. In general this estimated time is
239: overestimated and very imprecise~\cite{zotkin99joblength}. Instead
240: of speaking about an estimated time, it could be better to speak
241: about an upper bound for job duration, so this value provided by
242: users is more a precision value. The bigger the value is the more
243: imprecise the value of the actual runtime could be.
244:
245: Figure~\ref{fig:jobflow} shows the scenario of a job submittal. In
246: this figure rounded boxes are grid services and ellipses are the
247: different jobs states. As there is no communications between jobs,
248: jobs could run independently on multiple clusters. Instead of
249: communicating between job execution, jobs write output files to
250: Storage Elements (SE) of the Grid. Small output files could also be
251: sent to the UI. Replica Location Service (RLS) is a Grid service
252: that allow location of replicated data. Other jobs may read and work
253: on the data generated, forming ``pipelines'' of jobs.
254:
255: The users Grid entry point is called an User Interface (UI). This is the
256: gateway to Grid services. From this machine, users are given the capability to
257: submit jobs to a Computing Element and to follow their jobs
258: status~\cite{lcguserguide}. A Computing Element (CE) is composed of Grid batch
259: queues. A Computing Element is built on a homogeneous farm of computing nodes
260: called Worker Nodes (WN) and on a node called a GateKeeper acting as a security
261: front-end to the rest of the Grid.
262:
263: %% Ok
264: \begin{figure*}[ht]
265: \centering
266: \includegraphics[height=12.3cm]{wms1.eps}
267: \caption{Job submittal scenario}
268: \label{fig:jobflow}
269: \end{figure*}
270:
271:
272: Users can query the Information System in order to know both the state of different grid nodes and
273: where their jobs are able to run depending on job requirements. This match-making process has been
274: packaged as a Grid service known as the Resource Broker (RB). Users could either submit their jobs
275: directly to different sites or to a central Resource Broker which then dispatches their jobs to
276: matching sites.
277:
278: The services of the Workload Management System (WMS) are responsible for the
279: acceptance of job submits and the dispatching of these jobs to the appropriate
280: CEs, depending on job requirements and on available resources. The Resource
281: Broker is the machine where the WMS services run, there is at least one RB for
282: each VO. The duty of the RB is to find the best resource matching the
283: requirements of a job (match-making process). (For more details
284: see~\cite{dataGrid-wms})
285:
286:
287: Users are then mapped to a local account on the chosen executing CE.
288: When a CE receives a job, it enqueues it inside an appropriate batch
289: queue, chosen depending on the job requirements, for instance
290: depending on the maximum running time. A scheduler then proceeds all
291: these queues to decide the execution of jobs. Users could question
292: about status of their jobs during all the job lifetime.
293:
294: \section{Scheduling scheme}
295: \label{Scheduling}
296:
297: The goal of the scheduler is first to enable execution of jobs, to
298: maximize job throughput and to maintain a good equilibrium between
299: users in their usage of the cluster~\cite{feitelson96c}. At the same
300: time scheduler has to avoid starvation, that is jobs, users or groups
301: that access scarcely to available cluster resources compared to
302: others.
303:
304: Scheduling is done on-line, i.e the scheduler has no knowledge about
305: all the job input requests but jobs are submitted to the cluster at
306: arbitrary time. No preemption is done, the cluster uses a
307: space-sharing mode for jobs. In a Grid environment long-time running
308: jobs are common. The worst case is when the cluster is full of jobs
309: running for days and at the same time receiving jobs blocked in the
310: waiting queue.
311:
312: Short jobs like monitoring jobs barely delay too much longer jobs.
313: For example, a 1 day job could wait 15 minutes before starting, but
314: it is unwise if a 5 minutes job has to wait the same 15 minutes.
315: This results in production of algorithms classes that encourage the
316: start of short jobs over longer jobs. (Short jobs have higher
317: priority~\cite{chiang02}) Some other solution proposed is to split
318: the cluster in static sub-clusters but this is not compatible with a
319: sharing vision like Grids. Ideal on-line scheduler will maximize
320: cluster usage and fairness between groups and users. Of course a
321: good trade-off has to be found between the two.
322:
323: \subsection{Local situation}
324:
325: We are using two servers to manage our 140 CPUs, on each machine
326: there are 5 queues where each group could send their jobs to. Each
327: queue has its own limit in maximum CPU Time. A job in a given queue
328: is killed if it exceeds its queue time limit. There are in fact two
329: limits, one is the maximum CPU time, the other one is the maximum
330: total time (or Wall time) a job could use. For each queue there is
331: also a limit in the number of jobs than can run at a given time.
332: This is done in order to avoid that the cluster is full with long
333: running jobs and short jobs cannot run before days. Likely there is
334: the same limit in number of running jobs for a given group.
335:
336: \begin{table}[h]
337: \begin{center}
338: \begin{tabular}{|l|c|c|c|}
339: \hline
340: Queue & Max CPU & Max Wall & Max Jobs \\
341: & (H:M) & (H:M) & \\
342: \hline
343: Test & 00:05 & 00:15 & 130 \\
344: Short & 00:20 & 01:30 & 130 \\
345: Long & 08:00 & 24:00 & 130 \\
346: Day & 24:00 & 36:00 & 130 \\
347: Infinite & 48:00 & 72:00 & 130 \\
348: \hline
349: \end{tabular}
350: \end{center}
351: \caption{Queue configuration (maximum CPU time, Wall time and
352: running jobs)}
353: \end{table}
354:
355: Maui Scheduler and the Portable Batch System (PBS) run on multiple
356: hardware and operating systems. MAUI is a scheduling policy engine
357: that is used together with the PBS batch system. PBS manages the job
358: reception in queues and execution on cluster nodes. MAUI is a
359: First-Come-First-Served backfill scheduler with priorities. This
360: means that is checks periodically the running queues, execution of
361: lower priority jobs is allowed if it is determined that their running
362: will not delay jobs higher in the queue~\cite{linux-usenix}. Maui is
363: unfortunately not event driven, it polls regularly the PBS queues to
364: decide which jobs to run. MAUI allows to add a priority property for
365: each queue. Our site configuration is that the shorter the queue
366: allows jobs to run, the more priority is given to that job. Jobs are
367: then selected to run depending on a priority based on the job
368: attributes such as owner, group, queue, waiting time, etc.
369:
370: %% If a job violates a site policy it is placed temporary in a blocked
371: %% state and not considered for scheduling.
372:
373: \section{Workload data analysis}
374: \label{Workload analysis and modelling}
375:
376: %% - Ce qu'on observe -> modele qui approche le comportement
377:
378: Workload analysis allows to obtain a model of the user
379: behavior~\cite{calzarossa93workload}. Such a model is essential for
380: understanding how the different parameters change the resource center usage.
381: Meta-computing workload~\cite{chapin99b} like Grid environments is composed of
382: different site workloads. We are interested in modelling workload of our site
383: which is part of the EGEE computational Grid. Our site receives only jobs
384: coming from the EGEE Grid and each requests for only one CPU.
385:
386: Traces of users activities are obtained from accountings on the server logs. Logs contain
387: information about users, resources used, jobs arrival time and jobs completion time. It
388: is possible to use directly these traces to obtain a static simulation or to use a dynamic
389: model instead. Workload models are more flexible than logs, because they allow to
390: generate traces with different parameters and better understand workload
391: properties~\cite{feitelson02workload}. Workload analysis allows to obtain a model of
392: users activity. Such a model is essential for understanding how the different parameters
393: change the resource center usage. Our workload data has been converted to the Standard
394: Workload Format (\url{http://www.cs.huji.ac.il/labs/parallel/workload/}) and made publicly
395: available for further researches.
396:
397: \begin{figure*}[hp]
398: \begin{center}
399: \subfigure[Number of Biomed jobs received per weeks (from August 2004 to May 2005)]{\label{fig:stats:jobWeekBiomed}\includegraphics[totalheight=5.2cm,clip]{NumberJobsPerWeekBiomed.eps}}
400: \subfigure[Number of Dteam jobs received per weeks (from August 2004 to May 2005)]{\label{fig:stats:jobWeekDteam}\includegraphics[totalheight=5.2cm,clip]{NumberJobsPerWeekDteam.eps}}
401: \caption{Number of jobs received per VO and per week from August
402: 2004 to May 2005}
403: \label{fig:stats}
404: \end{center}
405: \end{figure*}
406: \begin{figure*}[hp]
407: \begin{center}
408: \subfigure[System utilization per weeks (from August 2004 to May 2005)]{\label{fig:stats:utilization}\includegraphics[totalheight=5.2cm,clip]{NumberAllCPUPerWeek.eps}}
409: \subfigure[CPU consumed by Biomed and Dteam jobs per weeks (from August 2004 to May 2005)]{\label{fig:stats:cpuBiomed}\includegraphics[totalheight=5.2cm,clip]{CPU.eps}} \\
410: %% \subfigure[CPU consumed by Dteam jobs per weeks (from August 2004 to May 2005)]{\label{fig:stats:cpuDteam}\includegraphics[totalheight=4.6cm,clip]{NumberDteamCPUPerWeek.eps}}
411: %% \subfigure[CPU consumed by Atlas jobs per weeks (from August 2004 to May 2005)]{\label{fig:stats:cpuAtlas}\includegraphics[totalheight=4.6cm,clip]{NumberAtlasCPUPerWeek.eps}} \\
412: %% \subfigure[CPU consumed by LHCb jobs per weeks (from August 2004 to May 2005)]{\label{fig:stats:cpuLHCb}\includegraphics[totalheight=4.6cm,clip]{NumberLHCbCPUPerWeek.eps}} \\
413: \caption{Cluster utilization as CPU consumed per VO and per week from August 2004 to May 2005}
414: \label{fig:statscpu}
415: \end{center}
416: \end{figure*}
417:
418: Workload is from August 1st 2004 to May 15th 2005. We have a cluster
419: containing 140 CPUs since September 15th. This can be visible in the
420: figure~\ref{fig:stats}, \ref{fig:stats:utilization} and
421: \ref{fig:stats:cpuBiomed}, where we notice that the number of jobs
422: sent increases. Statistics are obtained from the PBS log files. PBS
423: log files are well structured for data analysis. An AWK script is
424: used to extract information from PBS log files. AWK acts on lines
425: matched by regular expressions. We do not have information about
426: users \emph{Login} time because users send jobs to our cluster from an
427: User Interface (UI) of the EGEE Grid and not directly.
428:
429: \subsection{Running time}
430:
431: \begin{figure*}[ht]
432: \begin{center}
433: \subfigure[Dteam job runtime]{\label{figure:dteam:jobduration}\includegraphics[totalheight=5.2cm,clip]{dteam.duration.eps}}
434: \subfigure[Biomed job runtime]{\label{figure:biomed:jobduration}\includegraphics[totalheight=5.2cm,clip]{biomed.duration.eps}}
435: \caption{Dteam and Biomed job runtime distributions (logscale on time axis)}
436: \label{fig:group:runtime}
437: \end{center}
438: \end{figure*}
439:
440: \begin{table}[h]
441: \begin{center}
442: \begin{tabular}{|l|c|c|c|}
443: \hline
444: Group & Mean & Standard & Number \\
445: & & Deviation & of jobs \\
446: \hline
447: Biomed & 5417 & 22942.2 & 108197 \\
448: Dteam & 222 & 3673.6 & 94474 \\
449: LHCb & 2072 & 7783.4 & 9709 \\
450: Atlas & 13071 & 28788.8 & 7979 \\
451: Dzero & 213 & 393.9 & 1332 \\
452: \hline
453: \end{tabular}
454: \end{center}
455: \caption{Group running time in seconds and total number of jobs
456: submitted}
457: \label{table:group:runtime}
458: \end{table}
459:
460: During 280 days, our site received 230474 jobs from which 94474 Dteam jobs and 108197 Biomed jobs
461: (table~\ref{table:group:runtime}). For all these jobs there are 23208 jobs that failed and were
462: dequeued. It appears that jobs are submitted irregularly and by bursts, that is lot of jobs
463: submitted in a short period of time followed by a period of relative inactivity. From the logs,
464: there are not much differences between CPU time and total time, so it means that jobs sent to our
465: cluster are really CPU intensive jobs and not I/O intensive. Dteam jobs are mainly short monitoring
466: jobs but all Dteam jobs are not regularly sent jobs. We have 6784.6 days CPU time consumed by
467: Biomed for 108197 jobs (Mean of one hour and half per jobs, table~\ref{table:group:runtime}).
468: Repartition of cumulative job duration distributions for Biomed VO is shown on
469: figure~\ref{fig:group:runtime}. The duration of about 70\% of Biomed jobs are less than 15 minutes
470: and 50\% under 10 seconds, there are a dominant number of small running jobs but the distribution is
471: very wide as shown by the high standard deviation compared to the mean in
472: table~\ref{table:group:runtime}.
473:
474: \begin{table}[h]
475: \begin{center}
476: \begin{tabular}{|l|c|c|c|}
477: \hline
478: Queue & Mean & Standard & CV \\
479: & & Deviation & \\
480: \hline
481: Test & 31.0 & 373.6 & 12.0 \\
482: Short & 149.5 & 1230.5 & 8.2 \\
483: Long & 2943.2 & 11881.2 & 4.0 \\
484: Day & 6634.8 & 25489.2 & 3.8 \\
485: Infinite & 10062.2 & 30824.5 & 3.0 \\
486: \hline
487: \end{tabular}
488: \end{center}
489: \caption{Queue mean running time in seconds, corresponding
490: Standard Deviation and Coefficient of Variation}
491: \label{table:queue:runtime}
492: \end{table}
493:
494: Users submit their jobs with an estimated run length. For
495: relationships between execution time and requested job duration and
496: its accuracy see~\cite{cirnecompr-model}. To sum up estimated jobs
497: duration are essentially inaccurate. It is in fact an upper bound for
498: job duration which could in reality take any value below it.
499: Table~\ref{table:queue:runtime} shows for each queue the mean running
500: time, its standard deviation and coefficient of variation (CV) which
501: is the ratio between standard deviation and the mean. CV decreases as
502: the queue maximum runtime increase. This means that jobs in shorter
503: queues vary a lot in their duration compared to longer jobs and we can
504: expect that more the upper bound given is high the more confidence in
505: using the queue mean runtime as a an estimation we could have.
506:
507: A commonly used method for modelling duration distribution is to use
508: log-uniform distribution. Figures~\ref{figure:dteam:jobduration}
509: and~\ref{figure:biomed:jobduration} show the fraction of Dteam and
510: Biomed jobs with duration less than some value. Job duration has been
511: modelled with a multi-stage log-uniform model
512: in~\cite{downey99elusive} which is piecewise linear in log space. In
513: this case Dteam and Biomed job duration could be approximated
514: respectively with a 3 and a 6 stages log-uniform distribution.
515:
516: \subsection{Waiting time}
517:
518: \begin{table}[h]
519: \begin{center}
520: \begin{tabular}{|l|c|c|c|c|}
521: \hline
522: Group & Mean & Stretch & Standard & CV \\
523: & & & Deviation & \\
524: \hline
525: Biomed & 781.5 & 0.874 & 16398.8 & 20.9 \\
526: Dteam & 1424.1 & 0.135 & 26104.5 & 18.3 \\
527: LHCb & 217.7 & 0.905 & 2000.7 & 9.1 \\
528: Atlas & 2332.8 & 0.848 & 13052.1 & 5.5 \\
529: Dzero & 90.7 & 0.701 & 546.3 & 6.0 \\
530: \hline
531: \end{tabular}
532: \end{center}
533: \caption{Group mean waiting time in seconds, corresponding Standard
534: Deviation and Coefficient of Variation}
535: \label{table:waiting}
536: \end{table}
537:
538: \begin{table}[h]
539: \begin{center}
540: \begin{tabular}{|l|c|c|c|c|}
541: \hline
542: Queue & Mean & Standard & CV & Number \\
543: & & Deviation & & of jobs \\
544: \hline
545: Test & 33335.9 & 148326.4 & 4.4 & 45760 \\
546: Short & 1249.7 & 27621.8 & 22.1 & 81963 \\
547: Long & 535.1 & 5338.8 & 9.9 & 32879 \\
548: Day & 466.8 & 8170.7 & 17.5 & 19275 \\
549: Infinite & 1753.9 & 24439.8 & 13.9 & 49060 \\
550: \hline
551: \end{tabular}
552: \end{center}
553: \caption{Queue mean waiting time in seconds, corresponding Standard
554: Deviation, Coefficient of Variation and number of jobs}
555: \label{fig:queuemean}
556: \end{table}
557:
558: Table~\ref{table:waiting} shows that jobs coming from the Dteam group
559: are the more unfairly treaten. Dteam group sends short jobs very
560: often, Dteam jobs are then all placed in queue waiting that long jobs
561: from other groups finished. Dzero group sends short jobs more rarely
562: and is also less penalized than Dteam because there are less Dzero
563: jobs that are waiting together in queue before being treated. The
564: best treated group is LHCb with not very long running jobs (average of
565: about 34 minutes) and one job about every 41 minutes. The best
566: behavior to reduce waiting time per jobs seems to send jobs that are
567: not too short compared to the waiting factor, and send not too very
568: often in order to avoid that they all wait together inside a queue.
569: Very long jobs is not a good behavior too as the scheduler delay them
570: to run shorter one if possible.
571:
572: Table~\ref{fig:queuemean} shows the mean waiting time per jobs on a
573: given queue. There is a problem with such a metric, for example:
574: Consider one job arriving on a cluster with only one free CPU, it will
575: run on it during a time $T$ with no waiting time. Consider now that
576: this job is splitted in $N$ shorter jobs (numbered $0 \ldots N-1$)
577: with equal total duration $T$. Then the waiting time for the job
578: number $i$ will be $iT/N$, and the total waiting time $(N-1)T/2$. So
579: the more a job is splitted the more it will wait in total. Another
580: metric that does not depend on the number of jobs is the total waiting
581: time divided by the number of jobs and by the total job duration. Let
582: note $\widehat{WT}$ this normalized waiting time, We obtain:
583: \begin{align}
584: \widehat{WT} & = \frac{TotalWaitingTime}{NJobs * TotalDuration} \nonumber\\
585: \widehat{WT} & = \frac{MeanWaitingTime}{NJobs * MeanDuration}
586: \end{align}
587:
588: \begin{table}[h]
589: \begin{center}
590: \begin{tabular}{|l|c||l|c|}
591: \hline
592: Queue & $\widehat{WT}$ & Group & $\widehat{WT}$ \\
593: \hline
594: Test & 2.35e-2 & Biomed & 1.58e-5 \\
595: Short & 1.02e-4 & Dteam & 6.79e-5 \\
596: Long & 5.53e-6 & LHCb & 1.08e-5 \\
597: Day & 3.65e-6 & Atlas & 2.23e-5 \\
598: Infinite & 3.55e-6 & Dzero & 31.9e-5 \\
599: \hline
600: \end{tabular}
601: \end{center}
602: \caption{Queue and Group normalized waiting time}
603: \end{table}
604: With this metric, the Test queue is still the most unfairly treated and the Infinite
605: queue has the more benefits compared to the other queues. Dteam group is again bad
606: treated because their jobs are mainly sent to the Test queue. The more unfairly treated
607: group is Dzero.
608:
609: %% Todo
610: %% In fact, sending short jobs rarely is the worst behavior for general
611: %% waiting time as it is unlikely that two jobs will wait together in a
612: %% queue.
613:
614: \subsection{Arrival time}
615:
616: \begin{figure}[h]
617: \centering \includegraphics[totalheight=5.2cm]{arrivalperday.eps}
618: \caption{Job arrival daily cycle}
619: \label{fig:arrivalhours}
620: \end{figure}
621:
622: \begin{table}[h]
623: \begin{center}
624: \begin{tabular}{|l|c|c|c|}
625: \hline
626: Group & Mean & Standard & CV\\
627: & (seconds) & Deviation & \\
628: \hline
629: Biomed & 223.6 & 5194.5 & 23.22 \\
630: Dteam & 256.2 & 2385.4 & 9.31 \\
631: LHCb & 2474.6 & 39460.5 & 15.94 \\
632: Atlas & 2824.1 & 60789.4 & 21.52 \\
633: Dzero & 5018.7 & 50996.6 & 10.16 \\
634: \hline
635: \end{tabular}
636: \end{center}
637: \caption{Group interarrival time in seconds, corresponding
638: Standard Deviation and Coefficient of Variation}
639: \label{table:interarrival}
640: \end{table}
641:
642: Job arrival daily cycle is presented in figure~\ref{fig:arrivalhours}.
643: This figure shows the number of arrival depending on job arrival
644: hours, with a 10 minutes sampling. Clearly users prefer to send their
645: jobs at o'clock. In fact we receive regular monitoring jobs from the
646: VO Dteam. The monitoring jobs are submitted every hour from
647: \url{goc.grid-support.ac.uk}. Users are located in all Europe, so the
648: effect of sending at working hours is summed over all users timezones.
649: However the shape is similar compared to other daily cycle, during
650: night (before 8am) less jobs are submitted and there is an activity
651: peak around midday, 2pm and 4pm.
652:
653: Table~\ref{table:interarrival} shows the moments of interarrival time
654: for each group. CV is much higher than $1$, this means that arrivals
655: are not Poisson processes and are very irregularly distributed. For
656: instance we could receive $10$ jobs in $10$ minutes followed by
657: nothing during the $50$ next minutes. In this case we have a mean
658: interarrival time of $6$ minutes but in fact when jobs arrived they
659: arrived every minutes.
660:
661: Figure~\ref{fig:stats:utilization} shows the system utilization of our
662: cluster during each week. There are a maximum of 980 CPU days
663: consumed each week for 140 CPUs. We have a highly varying cluster
664: activity.
665:
666: \subsection{Frequency analysis}
667:
668: \begin{figure*}[ht]
669: \begin{center}
670: \subfigure[Biomed job arrival rate each 5 minutes]{\label{table:stats:rateBiomed}\includegraphics[totalheight=5.2cm,clip]{freqBiomedlog.eps}}
671: \subfigure[Dteam job arrival rate each 5 minutes]{\label{table:stats:rateDteam}\includegraphics[totalheight=5.2cm,clip]{freqDteamlog.eps}}
672: \caption{Arrival frequencies for Biomed and Dteam VOs (Proportion
673: of occurrences of $n$ jobs received during an interval of 5
674: minutes)}
675: \label{fig:stats:more}
676: \end{center}
677: \end{figure*}
678:
679: Job arrival rate is a common measurement for a site usage in queuing theory.
680: Figures~\ref{table:stats:rateBiomed} and~\ref{table:stats:rateDteam} present the job
681: arrival rate distribution. It is the number of time $n$ jobs are submitted during
682: interval length of 5 minutes. They show that most of the time the cluster does not
683: receive jobs but jobs arrived grouped. Users actually submit groups of jobs and not
684: stand-alone jobs. It explains the shape of the arrival rate: it fastly decreases but too
685: slowly compared to a Poisson distribution. Poisson distribution is usually used for
686: modelling the arrival process but evidences are against that fact~\cite{feitelson95e}.
687:
688: Dteam monitoring jobs are short and regular jobs, there is no need of a special
689: arrival model for such jobs. What we observe for other kind of jobs is that
690: the job arrival law is not a Poisson Law (see table~\ref{table:interarrival}
691: where $CV \gg 1$) as for instance a web site traffic~\cite{paxson95wide}. What
692: really happens is that users come using the cluster from an User Interface
693: during some time interval. During this time they send jobs to the cluster.
694: Users log to an User Interface machine in order to send their jobs to a RB that
695: dispatch them to some CEs. Note that one can send jobs to our cluster only
696: from an User Interface, it means for instance that jobs running on a cluster
697: cannot send secondary jobs. On a computing site we do not have this user login
698: information, but only job arrival.
699:
700: First we look at modelling user arrival and submission behavior.
701: Secondly we show that the model proposed shows good results for a
702: group behavior.
703:
704: \section{Model}
705: \label{Model}
706:
707: \subsection{Login model}
708: \label{Login model}
709:
710: In this section we begin to model user \emph{Login}/\emph{Logout} behavior from
711: the Grid job flow (figure~\ref{fig:jobflow}). We neglect the case where an
712: user has multiple login on different UI at the same time. We mean that a user
713: is in the state \emph{Login} if he is in the state of sending jobs from an UI
714: to our cluster, else he is in the state \emph{Logout}.
715:
716: \begin{figure}[h]
717: \centering
718: \includegraphics[height=5.2cm]{modele3.eps}
719: \caption{\emph{Login/Logout} cycle}
720: \label{fig:login}
721: \end{figure}
722:
723: Markov chains are like automatons with for each state a probability of
724: transition. One property of Markov chains is that future states depend only on
725: the current state and not on the past history. So a Markov state must contains
726: all the information needed for future states. We decided to model the
727: \emph{Login/Logout} behavior as a continuous Markov chain. During each $dt$, a
728: \emph{Logout} user has a probability during $dt$ of $\lambda dt$ to login and a
729: \emph{Login} user has a probability during $dt$ of $\delta dt$ to logout (see
730: figure~\ref{fig:login}). $\lambda$ is called the \emph{Login} rate and $\delta$
731: is called the \emph{Logout} rate.
732:
733: All these parameters could vary over time as we see with the variation
734: of the week job arrival (figure~\ref{fig:stats:utilization}) or during
735: day time (figure~\ref{fig:arrivalhours}) The model proposed could be
736: used more accurately with non-constant parameters at the expense of
737: more calculation and more difficult fitting. For example, one could
738: numerically use Fourier series for the \emph{Login} rate or for the
739: submittal rate to model this daily cycle. We use now constant
740: parameters for calculation, looking for general properties.
741:
742: We would like to have the probabilities during time that the user is
743: logged or not logged. Let $\mathcal{P}_{Login}(t)$ and
744: $\mathcal{P}_{Logout}(t)$ be respectively probability that the user is
745: logged or not logged at time t. We have from the modelling:
746: \begin{eqnarray}
747: \lefteqn{\mathcal{P}_{Logout}(t+dt) = (1 - \lambda dt) \mathcal{P}_{Logout}(t)}\nonumber \\
748: & & \mbox{} + \delta dt \mathcal{P}_{Login}(t) \\
749: \lefteqn{\mathcal{P}_{Login}(t+dt) = (1 - \delta dt) \mathcal{P}_{Login}(t)}\nonumber \\
750: & & \mbox{} + \lambda dt \mathcal{P}_{Logout}(t)
751: \end{eqnarray}
752: At equilibrium we have no variation so
753: \begin{eqnarray}
754: \mathcal{P}_{Logout}(t+dt) = \mathcal{P}_{Logout}(t) = \mathcal{P}_{Logout} \\
755: \mathcal{P}_{Login}(t+dt) = \mathcal{P}_{Login}(t) = \mathcal{P}_{Login}
756: \end{eqnarray}
757: We obtain:
758: \begin{align}
759: \mathcal{P}_{Logout} &= \frac{\delta}{\lambda + \delta} \\
760: \mathcal{P}_{Login} &= \frac{\lambda}{\lambda + \delta}
761: \end{align}
762:
763: \subsection{Job submittal model}
764: \label{Job submittal model}
765:
766: During period when users are logged they could submit jobs. We model the job
767: submittal rate for one user as follows: During $dt$ when the user is logged he
768: has a probability of $\mu dt$ to submit a job. With $\delta=0$ we have a
769: delayed Poisson process, with $\mu=0$ no jobs are submitted. The full model is
770: shown at figure~\ref{fig:markov}, it shows all the possible outcomes with
771: corresponding probabilities from one of the possible state to the next after a
772: small period $dt$. Numbers inside circles are the number of jobs submitted from
773: the start. \emph{Login} states are below and \emph{Logout} states are at the
774: top. We have:
775:
776:
777: \begin{figure*}[ht]
778: \centering
779: \includegraphics[height=8.6cm]{modele5.eps}
780: \caption{Markov modelling of jobs submittal}
781: \label{fig:markov}
782: \end{figure*}
783:
784: \begin{itemize}
785: \item $\mathcal{P}_n(t)$ is the probability to be in the state
786: ``\emph{User is not logged at time t and n jobs have been submitted
787: between time 0 and t}.''
788: \item $\mathcal{Q}_n(t)$ is the probability to be in the state
789: ``\emph{User is logged at time t and n jobs have been submitted
790: between time 0 and t}.''
791: \item $\mathcal{R}_n(t)$ is the probability to be in the state
792: ``\emph{n jobs have been submitted between time 0 and t}.'' We have
793: $\mathcal{R}_n = \mathcal{P}_n + \mathcal{Q}_n$.
794: \end{itemize}
795:
796: From the model, we obtain with the same method as before this
797: recursive differential equation:
798: \begin{eqnarray}
799: \mathcal{M} & = &
800: \begin{pmatrix}
801: - \lambda & \delta \\
802: \lambda & -(\mu + \delta)
803: \end{pmatrix} \\
804: \begin{pmatrix}
805: \mathcal{P}_0 \\
806: \mathcal{Q}_0
807: \end{pmatrix} ' & = &
808: \mathcal{M}
809: \begin{pmatrix}
810: \mathcal{P}_0 \\
811: \mathcal{Q}_0
812: \end{pmatrix} \\
813: \begin{pmatrix}
814: \mathcal{P}_n \\
815: \mathcal{Q}_n
816: \end{pmatrix} ' & = &
817: \mathcal{M}
818: \begin{pmatrix}
819: \mathcal{P}_n \\
820: \mathcal{Q}_n
821: \end{pmatrix} +
822: \begin{pmatrix}
823: 0 \\
824: \mu \mathcal{Q}_{n-1}
825: \end{pmatrix}
826: \label{recursive}
827: \end{eqnarray}
828:
829: This results to the following recursive equation (in case the
830: parameters are constants, $\mathcal{M}$ is a constant)
831: \begin{eqnarray}
832: \begin{pmatrix}
833: \mathcal{P}_n \\
834: \mathcal{Q}_n
835: \end{pmatrix} & = &
836: e^{\mathcal{M}t}
837: \int{
838: e^{-\mathcal{M}x}
839: \begin{pmatrix}
840: 0 \\
841: \mu \mathcal{Q}_{n-1}
842: \end{pmatrix}
843: dx}
844: \end{eqnarray}
845:
846: We take a look at the probability of having no job arrival during an interval of
847: time $t$ which is $\mathcal{P}_0$ and $\mathcal{Q}_0$. $\mathcal{R}_0$ is the
848: the probability that no jobs have been submitted between arbitrary time 0 and t.
849: So from the above model, we have:
850: \begin{equation}\label{eqmatrix2}
851: \begin{pmatrix}
852: \mathcal{P}_0 \\
853: \mathcal{Q}_0
854: \end{pmatrix} ' =
855: \begin{pmatrix}
856: - \lambda & \delta \\
857: \lambda & -(\mu + \delta)
858: \end{pmatrix}
859: \begin{pmatrix}
860: \mathcal{P}_0 \\
861: \mathcal{Q}_0
862: \end{pmatrix}
863: \end{equation}
864:
865: At arbitrary time we could be in the state \emph{Login} with
866: probability $\lambda / (\lambda + \delta)$ and in the state
867: \emph{Logout} with the probability $\delta / (\lambda + \delta)$. We
868: have from the results above:
869: \begin{align}
870: \begin{pmatrix}
871: \mathcal{P}_0(0) \\
872: \mathcal{Q}_0(0)
873: \end{pmatrix} &=
874: \begin{pmatrix}
875: \mathcal{P}_{Logout} \\
876: \mathcal{P}_{Login}
877: \end{pmatrix} =
878: \frac{1}{\lambda + \delta}
879: \begin{pmatrix}
880: \delta \\
881: \lambda
882: \end{pmatrix}
883: \\
884: \mathcal{R}_0 &= \mathcal{P}_0 + \mathcal{Q}_0
885: \end{align}
886:
887: Finally we obtain the result.
888: \begin{equation}
889: \begin{split}
890: \mathcal{R}_0(t) &= m_0 \frac{e^{- m_1t} - e^{- m_2t}}{m_1
891: - m_2} + \frac{m_1e^{- m_2t} - m_2e^{- m_1t}}{m_1 - m_2}
892: \end{split}
893: \end{equation}
894:
895: Where \begin{align}
896: m_0 &= \frac{\lambda \mu}{\lambda + \delta} = \mu \mathcal{P}_{Login} \\
897: m_1 + m_2 &= \lambda + \delta + \mu \\
898: m_1 m_2 &= \lambda \mu \\
899: \end{align}
900:
901: With $\lambda=0$ or $\mu=0$, we obtain that no jobs are submitted $(\mathcal{R}_0(t)=1)$.
902: With $\delta=0$, this is a Poisson process and $\mathcal{R}_0(t)=e^{-\mu t}$. Note that
903: during a period of $t$ there are in average $\mu \mathcal{P}_{Login} t$ jobs submitted, we
904: have also for small period $t$,
905: \begin{equation}
906: \mathcal{R}_0(t) \approx 1 - \mu\mathcal{P}_{Login}t
907: \end{equation}
908: We have also
909: \begin{align}
910: \mathcal{R}'_0(0) & = - \mu\mathcal{P}_{Login} \\
911: \mathcal{R}'_0(0) & = - \frac{\emph{Number of jobs submitted}}{\emph{Total duration}}
912: \end{align}
913:
914: $\mathcal{R}_0(t)$ could be estimated by splitting the arrival processes in
915: intervals of duration $t$ and estimating the ratio of intervals with no
916: arrival. The error of this estimation is linear with $t$. Another issue is that
917: the logs precision is not below one second.
918:
919: \subsection{Model characteristics}
920:
921: We have also these interesting properties:
922: \begin{eqnarray}
923: \mathcal{R}'_0(0) & = & - \mu \mathcal{P}_{Login} \\
924: \frac{\mathcal{R}''_0(0)}{\mathcal{R}'_0(0)} & = & - \mu \\
925: \frac{\mathcal{R}'_0(0)^{2}}{\mathcal{R}''_0(0)} & = & \mathcal{P}_{Login} \\
926: \frac{\mathcal{R}'''_0(0)}{\mathcal{R}''_0(0)} & = & - (\mu + \delta)
927: \end{eqnarray}
928:
929: Probability distribution of the duration between two jobs arrival is called an
930: interarrival process. Interarrival process is a common metric in queuing theory. We have
931: $\mathcal{A}(t)=\mathcal{P}_0(t)+\mathcal{Q}_0(t)$ with the initial condition that user
932: just submits a job. This implies that user is logged.
933: \begin{equation}
934: \mathcal{P}_0(0) = 0.0, \quad \mathcal{Q}_0(0) = 1.0 \nonumber
935: \end{equation}
936: \begin{equation}
937: \begin{split}
938: \mathcal{A}(t) &= \mu \frac{e^{- m_1t} - e^{- m_2t}}{m_1 -
939: m_2} + \frac{m_1e^{- m_2t} - m_2e^{- m_1t}}{m_1 - m_2}
940: \end{split}
941: \end{equation}
942: \begin{equation}
943: p = \frac{\mu - m_2}{m_1 - m_2}
944: \end{equation}
945: \begin{equation}
946: \mathcal{A}(t) = p e^{- m_1t} + (1 - p) e^{- m_2t}
947: \end{equation}
948: We have $\mu \in[m1; m2]$ because
949: \begin{align}
950: (\mu - m_1)(\mu - m_2) & = \mu^{2} - (\lambda + \delta + \mu)\mu + \delta \mu \nonumber\\
951: (\mu - m_1)(\mu - m_2) & = -\delta \mu < 0
952: \end{align}
953: So $p \in [0; 1]$, and we have an hyper-exponential interarrival law
954: of order 2 with parameters $p=(\mu - m_2)/(m_1 - m_2), m_1, m_2$.
955: This result is coherent with other experimental fitting
956: results~\cite{david-workload} Moreover any hyper-exponential law of
957: order 2 may be modelled with the Markov chain described in
958: figure~\ref{fig:markov} with parameters $\mu = p m_1 + (1-p) m_2,
959: \lambda = m_1 m_2 / \mu, \delta = m_1 + m_2 - \mu - \lambda$
960:
961: Let calculate the mean interarrival time. Probability to have an interarrival
962: time between $\theta$ and $\theta+d\theta$ is $\mathcal{A}(\theta) -
963: \mathcal{A}(\theta+d\theta) = -\mathcal{A}'(\theta)d\theta$. The mean is
964: \begin{eqnarray}
965: \tilde{\mathcal{A}} &=& \int_{0}^{\infty}-\theta \mathcal{A}'(\theta)d\theta = \int_{0}^{\infty}\mathcal{A}(\theta)d\theta \\
966: \tilde{\mathcal{A}} &=& \frac{1}{\mu \mathcal{P}_{Login}} = \frac{\lambda+\delta}{\lambda\mu}
967: \end{eqnarray}
968:
969: Let compute the variance of interarrival distribution.
970: \begin{eqnarray}
971: \mathit{var} &=& \int_{0}^{\infty}-(\theta - \tilde{\mathcal{A}})^{2}\mathcal{A}'(\theta)d\theta \\
972: \mathit{var} &=& 2 \int_{0}^{\infty}\theta\mathcal{A}(\theta)d\theta - \tilde{\mathcal{A}}^{2} \\
973: \frac{\mathit{var}}{\tilde{\mathcal{A}}^{2}} &=& {CV}^{2} = 1 + 2\frac{\delta\mu}{(\lambda+\delta)^{2}} \\
974: {CV}^{2} &=& 1 + 2~\mathcal{P}_{Logout}^{2} \frac{\mu}{\delta}
975: \end{eqnarray}
976:
977: Another interesting property is the number of jobs submitted by this model during a
978: \emph{Login} period. Let $P_n$ be the probability to receive $n$ jobs during a
979: \emph{Login} period. We have:
980: \begin{eqnarray}
981: P_n & = & \int_{0}^{\infty} \frac{(\mu t)^{n}}{n!}e^{-\mu t} \delta e^{-\delta t} dt \\
982: P_n & = & \frac{\delta}{\mu + \delta}(\frac{\mu}{\mu + \delta})^{n}
983: \end{eqnarray}
984: This is a geometric law. The mean number of jobs submitted by
985: \emph{Login} period is $\mu/\delta$.
986:
987: \subsection{Group model}
988:
989: Groups are composed of users, either regular users sending jobs at regular time
990: or users with a \emph{Login/Logout} like behavior. Metrics defined below as the
991: mean number of jobs sent by \emph{Login} state, the mean submittal rate and
992: probability of \emph{Login} could represent an user behavior.
993:
994: \begin{figure}[h]
995: \centering
996: \includegraphics[height=5.2cm]{job-rate.eps}
997: \caption{Users job submittal rates during their period of activity}
998: \label{fig:jobrate}
999: \end{figure}
1000:
1001: Figure~\ref{fig:jobrate} shows the sorted distribution of users submittal rate
1002: ($\mu \mathcal{P}_{Login}$). Except for the highest values it is quite a
1003: straight line in logspace. This observation could be included in a group model.
1004:
1005: \section{Simulation and validation}
1006: \label{Simulation}
1007:
1008: %% - Evaluation des parametres: r0(0) connu avec precision = µPlogin = nbjobs/tempstotal
1009: %% - µPlogin classé par utilisateur, variance: conjecture d'une loi de groupe
1010: %% - Graphe des R0 de plusieurs utilisateurs
1011: %% - Norme utilisée, Fitting des fréquences
1012:
1013: \begin{figure*}[hp]
1014: \begin{center}
1015: \subfigure[Biomed user 1]{\label{figure:simu:biomed001.clrce02}\includegraphics[totalheight=5.4cm,clip]{biomed001.clrce02.eps}}
1016: \subfigure[Biomed user 2]{\label{figure:simu:biomed001.clrce01}\includegraphics[totalheight=5.4cm,clip]{biomed001.clrce01.eps}} \\
1017: \subfigure[Biomed user 3]{\label{figure:simu:biomed002.clrce01}\includegraphics[totalheight=5.4cm,clip]{biomed002.clrce01.eps}}
1018: \subfigure[Biomed user 4]{\label{figure:simu:biomed005.clrce01}\includegraphics[totalheight=5.4cm,clip]{biomed005.2.eps}} \\
1019: \begin{tabular}{|l | c | c | c | c |}
1020: \hline
1021: Name & $\mu$ & $\delta$ & $\lambda$ & Error \\
1022: \hline
1023: Biomed user 1 & 0.0837 & 0.02079 & 2.1e-4 & 4.929e-3 \\
1024: Biomed user 2 & 0.0620 & 0.01188 & 1.2e-4 & 3.534e-3 \\
1025: Biomed user 3 & 0.0832 & 0.02475 & 2.5e-4 & 1.1078e-2 \\
1026: Biomed user 4 & 0.0365 & 1.4285e-3 & 1.075e-4 & 8.78e-2\\
1027: \hline
1028: \end{tabular}
1029: \end{center}
1030: \caption{Biomed simulation results}
1031: \label{table:simu}
1032: \end{figure*}
1033:
1034: \begin{figure*}[hp]
1035: \begin{center}
1036: \subfigure[LPC cluster Biomed user]{\label{fig:r0Biomed}\includegraphics[totalheight=5.4cm,clip]{AngelR0.eps}}
1037: \subfigure[NASA Ames most active user]{\label{fig:nasa:nasa}\includegraphics[totalheight=5.4cm,clip]{Nasa-3.eps}} \\
1038: \subfigure[DAS2 fs0 cluster most active user]{\label{fig:nasa:das2}\includegraphics[totalheight=5.4cm,clip]{DAS2-75.eps}}
1039: \subfigure[SDSC Blue Horizon most active user]{\label{fig:nasa:sdsc}\includegraphics[totalheight=5.4cm,clip]{SDSC-35.eps}}
1040: \caption{Hyper-Exponential fitting of $\mathcal{R}_0$ for a Biomed
1041: LPC user and for the most active users at NASA Ames, DAS2 and
1042: SDSC Blue Horizon clusters.}
1043: \label{fig:nasa}
1044: \end{center}
1045: \end{figure*}
1046:
1047: We have done a simulation in Scheme~\cite{kelsey98revised} directly using the
1048: Markov model. We began by fitting users behavior from the logs with our model.
1049: Like the frequency obtained from the logs, the model shows a majority of
1050: intervals with no job arrival, possibly followed by a relatively flat area and a
1051: fast decreases. Some fitting results are presented in
1052: figures~\ref{table:simu}. Norm used to fit real data is the maximum difference
1053: between the two cumulative distributions. We fitted the frequency data for each
1054: user.
1055:
1056: During a period of $t$ there are in average $\mu \mathcal{P}_{Login} t$ jobs
1057: submitted. We evaluate the value of $\mu \mathcal{P}_{Login}$ which is the
1058: average number of jobs submitted by seconds. We use that value when doing a set
1059: of simulation in order to fit a known real user probability distribution. We
1060: have two free parameters, so we vary $\mathcal{P}_{Login}$ between 0.0 and 1.0
1061: and $lambda$ which the inverse is the average time an user is
1062: \emph{Logout}. Some results obtained are shown in figures~\ref{table:simu}.
1063:
1064: $\mu$ parameter decides of the frequency length of the curve. Without
1065: the \emph{Login} behavior we would have obtained a classic Poisson
1066: curve of $\mu$ parameter. $1/\mu$ is the mean interarrival time during
1067: \emph{Login} period. An idea to evaluate $\mu$ would be to evaluate
1068: the job arrival rate during \emph{Login} periods, but we lack that
1069: \emph{Login} information.
1070:
1071: $\delta$ and $\lambda$ are the \emph{Logout} and \emph{Login} parameters. What is really
1072: important is the ratio $\lambda/(\lambda + \delta)$ which is $\mathcal{P}_{Login}$. This
1073: is the ratio between time user is active on the cluster and total time. $\delta$ and
1074: $\mathcal{P}_{Login}$ are measures of the deviation from a classic Poisson law. For
1075: instance, the mean number of jobs submitted by \emph{Login} period is $\mu/\delta$ and the
1076: mean job submittal rate is $\mu\mathcal{P}_{Login}$. For a same $\mathcal{P}_{Login}$ we
1077: could have very different scenarios. A user could be active for long time but rarely
1078: logged and another user could be active for short period with frequent login. $1/\delta$
1079: is the mean \emph{Login} time, $1/\lambda$ is the mean \emph{Logout} time.
1080:
1081: The $\mathcal{R}_0$ probability is essential for studying job arrival time. $1 -
1082: \mathcal{R}_0(t)$ is the probability that between time $0$ and $t$ we have received at
1083: least one job. It is easier to fit the $\mathcal{R}_0$ distribution for an user than the
1084: interarrival distribution because we have more points. Figure~\ref{fig:r0Biomed} shows a
1085: typical graph of $\mathcal{R}_0$ for a Biomed user. It shows for instance that for
1086: intervals of 10000 seconds, this Biomed user has a probability of about 0.2 to submits one
1087: or more jobs. We have fitted this probability with hyper-exponential curve, that is a
1088: summation of exponential curves. There was too much noises for high interval time to fit
1089: that curve. In fact errors on $\mathcal{R}_0$ are linear with $t$. So we have smoothed
1090: the curve before fitting by averaging near points. $\mathcal{R}_0$ for this user was
1091: fitted with a sum of two exponentials.
1092:
1093: %%% \begin{eqnarray}
1094: %%% \lefteqn{e^{-0.197-8.12e-06t} + e^{-2.04-4.002e-04t}} \nonumber \\
1095: %%% & & \mbox{} + e^{-3.234-4.308e-03t}
1096: %%% \end{eqnarray}.
1097:
1098: It seems that more than the \emph{Login}/\emph{Logout} behavior there is also a
1099: notion of user activity. For example during the preparation of jobs or analysis
1100: phase of the results an user does not use the Grid and consequently the cluster
1101: at all. More than the \emph{Login} and \emph{Logout} state an \emph{Inactive}
1102: state could be added to the model if needed.
1103:
1104: \subsection{Other workloads comparison}
1105:
1106: %% Particularité des jobs ne se trouve pas dans les autres modeles.
1107:
1108: User number 3 is the most active user from the NASA Ames iPSC/860 workload~\footnote{The
1109: workload log from the NASA Ames iPSC/860 was graciously provided by Bill Nitzberg. The
1110: workload logs from DAS2 were graciously provided by Hui Li, David Groep and Lex
1111: Wolters. The workload log from the SDSC Blue Horizon was graciously provided by Travis
1112: Earheart and Nancy Wilkins-Diehr. All are available at the Parallel Workload
1113: Archives~\url{http://www.cs.huji.ac.il/labs/parallel/workload/}\label{foot:pwa}}.
1114: Figure~\ref{fig:nasa:nasa} shows its $\mathcal{R}_0(t)$ probability, it is clearly
1115: hyper-exponential of order 2, as other users like number 22 and 23. Other users like
1116: number 12 and 15 are more classical Poissonian users.
1117:
1118: DAS2 Clusters (see note \ref{foot:pwa}) used also PBS and MAUI as their batch
1119: system and scheduler. The main difference we have with them is that they use
1120: Globus to co-allocate nodes on different clusters. We only have bag of tasks
1121: applications which interacts together in a pipeline way by files stored on SEs.
1122: Their fs0 144 CPUs cluster is quite similar with
1123: ours. Figure~\ref{fig:nasa:das2} shows the $\mathcal{R}_0(t)$ probability for
1124: their most active user and corresponding hyper-exponential fitting or order 2.
1125:
1126: SDSC Blue Horizon cluster (see note \ref{foot:pwa}) have a total of 144 nodes.
1127: The $\mathcal{R}_0(t)$ distribution probability of their most active user was
1128: fitted with a hyper-exponential of order 2 in figure~\ref{fig:nasa:sdsc}.
1129:
1130: %% test.12.r0 ~ exp = Poisson process
1131: %% test.14.r0 ~ droite => envoi regulier
1132: %% test.24.r0 ~ hyper-exp
1133:
1134: \section{Related works}
1135: \label{related}
1136:
1137: Our Grid environment is very particular and different from common cluster
1138: environment as parallelism involved requires no interaction between processes
1139: and degree of parallelism is one for all jobs.
1140:
1141: To be able to completely simulate the node usage we need not only the jobs
1142: submittal process but also the job duration process. Our runtime model is
1143: similar with the Downey model~\cite{downey99elusive} for runtime which is
1144: composed of linear pieces in logspace. There is a strong correlation between
1145: successive jobs running time but it seems unlikely that a general model for
1146: duration may be made because it depends highly on algorithms and data used by
1147: users.
1148:
1149: Most other models use Poisson distribution for interarrival distribution. But
1150: evidences, like CV be much higher than one, demonstrate that exponential
1151: distributions does not fit well the real data~\cite{feitelson01, jann97}. The
1152: need of a detailed model was expressed in~\cite{talby99b}. With constant
1153: parameters our model exhibits a hyper-exponential distribution for interarrival
1154: rate and justify such a distribution choice. One strong benefit of our model
1155: is that it is general and could be used numerically with non-constant
1156: parameters at the expense of difficult fitting.
1157:
1158: \section{Discussion}
1159: \label{Discussion}
1160:
1161: What could be stated is that job maximum run times provided by users are
1162: essentially inaccurate, some authors are even not using this information for
1163: scheduling~\cite{darincosts}. Maybe a better concept is the relative urgency
1164: of a job. For example on a grid software managers are people responsible for
1165: installing software on cluster nodes by sending installation jobs. Software
1166: manager jobs may be regarded as more urgent than other jobs type. So sending
1167: jobs with an estimated runtime could be replaced by sending jobs with an
1168: urgency parameter. That urgency could be established in part as a site policy.
1169: Each site administrator could define some users classes for different kind of
1170: jobs and software used with different jobs priorities. For instance a site
1171: hosted in some laboratory might wish to promote its scientific domain more than
1172: other domain, or some specific applications might need quality of services like
1173: real time interaction.
1174:
1175: Another idea for scheduling is to have some sort of risk assessment measured during the
1176: scheduler decision. This risk assessment may be based on blocking probabilities obtained
1177: either from the logs or from some user behavior models. For example, it could be wise to
1178: forbid that a group or an user takes all the cluster at a given time but instead to let
1179: some few percents of it open for short jobs or low CPU consuming jobs like monitoring.
1180:
1181: Information System shows for a site the number of job currently running and
1182: waiting. But it is not really the relevant metric in an on-line environment. A
1183: better metric for a cluster is the computing flow rate input and the computing
1184: flow rate capacity. A cluster is able to treat some amount of computation per
1185: unit of time. So a cluster is contributing to the Grid with some computation
1186: flow rate (in GigaFLOPS or TeraFLOPS). As with classical queuing theory if the
1187: input rate is higher than the capacity, the site is overloaded and the global
1188: performances are low due to jobs waiting to be processed. What happens is that
1189: the site receive more jobs that is is able to treat in a given time. So the
1190: queues begin to grow and jobs have to wait more and more before being started,
1191: resulting in performance decay. Similarly when the computation submitted rate
1192: is lower than the site capacity the site is under-used. Job submittal have
1193: also to be fairly distributed according to the site capacity. For example, a
1194: site that is twice bigger than another site have to receive twice more
1195: computing request than the other site. But there is a problem to globally
1196: enforce this submittal scheme on all the Grid. This is why a local site
1197: migration policy may be better than a central migration policy done with the
1198: RB.
1199:
1200: To be more precise there are two different kinds of cluster flow rate metrics,
1201: one is the local flow rate and the other one is the global flow rate. The
1202: local computing flow rate is the flow rate that one job sees when reaching the
1203: site. The global flow rate is the computing flow rate a group of jobs see when
1204: reaching the site. That global flow rate is also the main measurement for
1205: meta-scheduling between sites. These two metrics are different, for instance
1206: we could have a site with a lot of slow machines (low local flow rate and big
1207: global flow rate) and another site with only few supercomputers (big local flow
1208: rate and low global flow rate). But the most interesting metric for one job is
1209: the local flow rate. This means that if each job wants individually to be
1210: processed at the best local flow rate site, this site will saturate and be
1211: globally slow.
1212:
1213: As far as all users and groups computation total flow rate is less than the site global
1214: flow rate or site capacity, there is no real fairness issue because there is no strife to
1215: access the site resources, there is enough for all. The problem comes when the sum of
1216: all computation flow rate is greater than the site capacity, firstly this globally
1217: reduces the site performance, secondly the scheduler must take decision to share fairly
1218: these resources. The Grid is an ideal tool that would allow to balance the load between
1219: sites by migrating jobs~\cite{darincosts}. A site that share their resources and is not
1220: saturated could discharge another heavily loaded site. Some kind of local site flow
1221: control could maintain a bounded input rate even with fluctuating jobs submittal. For
1222: instance fairness between groups and users could be maintained by decreasing the most
1223: demanding input rate and distributing it to other less saturated sites.
1224:
1225: Another benefits is that applications computing flow rates may be partly
1226: expressed by users in their job requirements. Computing flow rate takes into
1227: account both the jobs sizes and their time limits. Fairness between users is
1228: then ensured if whatever may be flow values asked by each user, part granted to
1229: each penalizes no other one. Computing flow rate granted by a site to an
1230: application may depend on the applications degree of parallelism, that is for
1231: the moment the number of jobs. For instance it may be more difficult to serve
1232: an application composed of only one job asking for a lot of computing flow rate
1233: than to serve an application asking the same computing flow rate but composed
1234: of many jobs. Urgency is not totally measured by a computing flow rate. For
1235: example a critical medical application which is a matter of life or death
1236: arriving on a full site has to be treated in priority. Allocating flow rates
1237: between users and groups has to be right and to take under account priority or
1238: urgency issues.
1239:
1240: To use a site wisely users have to bound their computational flow rate and to
1241: negotiate it with site managers. A computing model has to be defined and
1242: published. These remarks are important in the case of on-line computing like
1243: Grids where meta-scheduling strategy have to take a lot of parameters into
1244: account. General on-line load balancing and scheduling
1245: algorithms~\cite{azar97line, azar94line, bar-noy00new, lam02line} may be
1246: applied. The problem of finding the best suited scheduling policy is still an
1247: open problem. A better understanding of job running time is necessary to have
1248: a full model.
1249:
1250: The LCG middleware allows users to send their jobs to different nodes. This is done by
1251: the way of a central element called a Resource Broker, that collects user's requests and
1252: distributes them to computing sites. The main purpose is to match the available
1253: resources and balance the load of job submittal requests. Jobs are better localized near
1254: the data they need to use.
1255:
1256: We would like to advise instead a peer to peer~\cite{andrade-ourgrid} view of the Grid
1257: over a centralized one. In this view computing sites themselves work together with other
1258: computing sites to balance the average workload. Not relying on dependent services
1259: greatly improves the reliability and adaptability of the whole systems. That kind of
1260: meta-scheduling have to be globally distributed as stated by Dmitry Zotkin and Peter J.
1261: Keleher~\cite{zotkin99joblength}:
1262:
1263: %% \begin{quote}
1264: \textit{In a distributed system like Grid, the use of a central Grid
1265: scheduler}\footnote{like the Resource Broker used in LCG middleware}
1266: \textit{may result in a performance bottleneck and lead to a failure
1267: of the whole system. It is therefore appropriate to use a
1268: decentralized scheduler architecture and distributed algorithm.}
1269: %% \end{quote} %% %% \hfill{\small (Dmitry Zotkin and Peter J. Keleher )}
1270:
1271: gLite~\cite{glite-design} is the next generation middleware for Grid computing.
1272: gLite will provide lightweight middleware for Grid computing. The gLite Grid
1273: services follow a Service Oriented Architecture which will facilitate
1274: interoperability among Grid services. Architecture details of gLite could be
1275: viewed in~\cite{glite-arch}. The architecture constituted by this set of
1276: services is not bound to specific implementations of the services and although
1277: the services are expected to work together in a concerted way in order to
1278: achieve the goals of the end-user they can be deployed and used independently,
1279: allowing their exploitation in different contexts. The gLite service
1280: decomposition has been largely influenced by the work performed in the LCG
1281: project. Service implementations need to be inter-operable in such a way that
1282: a client may talk to different independent implementations of the same service.
1283: This can be achieved in developing lightweight services that only require
1284: minimal support from their deployment environment and defining standard and
1285: extensible communication protocols between Grid services.
1286:
1287: \hfill
1288:
1289:
1290: \section{Conclusion}
1291: \label{Conclusion}
1292:
1293: So far we have analyzed the workload of a Grid enabled cluster and
1294: proposed an infinite Markov-based model that describes the process of
1295: jobs arrival. Then a numerical fitting has been done between the
1296: logs and the model. We find a very similar behavior compared to the
1297: logs, even bursts were observed during the simulation.
1298:
1299: \section*{Acknowledgments\markboth{Acknowledgments}{Acknowledgments}}
1300: The cluster at LPC Clermont-Ferrand was funded by Conseil R\'egional
1301: d'Auvergne within the framework of the INSTRUIRE project
1302: (\url{http://www.instruire.org})
1303:
1304: %%%% \appendix
1305: %%%%
1306: %%%% \begin{table*}[hb]
1307: %%%% \begin{center}
1308: %%%% \begin{tabular}{l l}
1309: %%%% \textsc{Pilot applications} & \\
1310: %%%% \hline \\
1311: %%%% \textbf{CDSS} & Clinical decision support system \\
1312: %%%% & \url{http://egee-na4.ct.infn.it/biomed/CDSS.html} \\
1313: %%%% \textbf{GATE} & Application for tomographic emission \\
1314: %%%% & \url{http://egee-na4.ct.infn.it/biomed/gate.html} \\
1315: %%%% \textbf{GPS@} & Grid genomic web portal \\
1316: %%%% & \url{http://egee-na4.ct.infn.it/biomed/GPSA.html}\\
1317: %%%% \\
1318: %%%% \textsc{External applications} & \\
1319: %%%% \hline \\
1320: %%%% \textbf{MammoGrid} & European-wide database of mammograms \\
1321: %%%% & \url{http://mammoGrid.vitamib.com/} \\
1322: %%%% \\
1323: %%%% \textsc{Internal applications} & \\
1324: %%%% \hline \\
1325: %%%% \textbf{Docking platform} & Grid-enabled docking platform for in
1326: %%%% sillico drug discovery for tropical diseases \\
1327: %%%% & \url{http://egee-na4.ct.infn.it/biomed/tropicaldisease.html} \\
1328: %%%% \textbf{gPTM 3D} & Interactive radiological image visualization and processing tool \\
1329: %%%% & \url{http://egee-na4.ct.infn.it/biomed/gPTM3D.html} \\
1330: %%%% \textbf{GridGRAMM} & Molecular docking web \\
1331: %%%% & \url{http://egee-na4.ct.infn.it/biomed/GridGRAMM.html} \\
1332: %%%% \textbf{GROCK} & Mass screenings of molecular interactions web \\
1333: %%%% & \url{http://egee-na4.ct.infn.it/biomed/GROCK.html} \\
1334: %%%% \textbf{SiMRI 3D} & Magnetic resonance image simulator \\
1335: %%%% & \url{http://egee-na4.ct.infn.it/biomed/SiMRI3D.html} \\
1336: %%%% \textbf{Xmipp MLrefine} & Macromolecular 3D structure analysis \\
1337: %%%% & \url{http://egee-na4.ct.infn.it/biomed/xmipp_MLrefine.html} \\
1338: %%%% \textbf{Xmipp multiple CTFs} & Micrographia CTF calculation \\
1339: %%%% & \url{http://egee-na4.ct.infn.it/biomed/Xmipp_assign_multiple_CTFs.html} \\
1340: %%%% \end{tabular}
1341: %%%% \end{center}
1342: %%%% \caption{Biomedical application list used by the Biomed VO}
1343: %%%% \label{table:applications}
1344: %%%% \end{table*}
1345:
1346: %%\clearpage
1347: \begin{small}
1348:
1349: \bibliographystyle{unsrt}
1350: \bibliography{workloadAnalysis.bib}
1351:
1352: \end{small}
1353:
1354: \end{document}
1355:
1356:
1357:
1358: %% R0
1359: %% Ok:
1360: %% atlas001 @ clrce01/2 (~ regulier ?)
1361: %% atlas002 @ clrce01
1362: %% atlas003 @ clrce01 (tres regulier ?)
1363: %% atlas004 @ clrce01 (!!)
1364: %% atlassgm @ clrce01 (!!)
1365: %% atlassgm @ clrce02
1366: %% biomed001 @ clrce01 (!!)
1367: %% biomed001 @ clrglop195
1368: %% biomed002 @ clrce01 (!!)
1369: %% biomed002 @ clrce02 (!!)
1370: %% biomed003 @ clrce01 !
1371: %% biomed003 @ clrce02 !
1372: %% biomed004 @ clrce01 !
1373: %% biomed005 @ clrce01 (!!)
1374: %% biomed005 @ clrglop195 (!!)
1375: %% biomed007 @ clrce01 (!!) (tres regulier ?)
1376: %% biomgrid @ clrce01/2/95 (!!)
1377: %% cms002 @ clrce01 (!!) tres regulier ?
1378: %% dteam001 @ clrce01/95 (!!) regulier ?
1379: %% dteam042 @ clrce02/95 (!!) tres regulier ?
1380: %% dteam002 @ clrce01/2/95 (!!) tres regulier ?
1381: %% dteam003 @ clrce01/2/95 regulier ?
1382: %% dteam004 @ clrce01/2/95 (!!) tres regulier ?
1383: %% dteam005 @ clrce01/2/95 regulier ?
1384: %% dteam006/7 ce02/95 tres regulier ?
1385: %% dteam011/3 (!!) tres regulier
1386: %% lhcb001/2 @ clrce01/2/95 (!!) hyperexp..
1387: %% lhcb003 ce01 (--) hyperexp..
1388: %% lhcb003 @ 95 (!!) tres regulier ou hyperexp ?.
1389: %% lhcbsgm @ 95 (!!) regulier !
1390:
1391:
1392:
1393:
1394:
1395:
1396:
1397:
1398:
1399: %% LEs RB sont utilisés comme moyen de pression.