1: \documentclass[10pt,twocolumn]{article}
2: % Must be 10 point, twocolumn (ICDE)
3:
4: \usepackage{latex8} % ICDE
5: \usepackage{times}
6: \usepackage{url}
7: \usepackage{comment}
8: \usepackage{epsfig}
9: \usepackage[compact]{titlesec}
10:
11: % \usepackage{fullpage}
12: % \usepackage{geometry}
13: % \geometry{lmargin=2cm,rmargin=2cm}
14: \usepackage{fancyvrb}
15: % \usepackage[small,compact]{titlesec}
16: % \usepackage[compact]{titlesec}
17: % \usepackage[light,first,bottomafter,outline]{draftcopy}
18:
19: \DefineShortVerb{\#}
20:
21: \includecomment{comment}
22: \specialcomment{commentbf} {\begingroup\bf} {\endgroup}
23: \specialcomment{commentit} {\begingroup\it\footnotesize} {\endgroup}
24: \specialcomment{commentold} {\begingroup\it\footnotesize{\bf OLD:}} {\endgroup}
25: \excludecomment{comment}
26: \excludecomment{commentbf}
27: \excludecomment{commentit}
28: \excludecomment{commentold}
29:
30: \title{
31: DPG: A Cache-Efficient Accelerator for Sorting and for Join Operators
32: }
33: \author{
34: Gene Cooperman\thanks{~$\,$This work was partially supported by
35: the National Science Foundation under Grant CCR-0204113, and
36: by the
37: Institute for Complex Scientific Software
38: (ICSS, http://www.icss.neu.edu/).},
39: Xiaoqin Ma$^*$ and Viet Ha Nguyen$^*$
40: \\
41: Northeastern University\\
42: Boston, MA 02115, USA\\
43: \{gene,xqma,vietha\}@ccs.neu.edu
44: }
45:
46: \begin{document}
47:
48:
49:
50: \maketitle
51:
52: \begin{abstract}
53: Retrieval of records on disk is well-known to be at the heart of many
54: database problems. We show that the corresponding movement of records in main
55: memory has now become a severe bottleneck for many database
56: operations. This is due to the stagnating latency of main memory,
57: even while CPU speed, main memory bandwidth, and disk speed all
58: continue to improve. As a result, record movement has become the
59: dominant cost in main memory sorting.
60:
61: We present a new algorithm for fast record retrieval, {\em
62: distribute-probe-gather}, or DPG. DPG has important applications
63: both in sorting and in joins. Current main memory sorting algorithms
64: split their work into three phases: extraction of key-pointer pairs;
65: sorting of the key-pointer pairs; and copying of the original records
66: into the destination array according the sorted key-pointer pairs.
67: The copying in the last phase dominates today's sorting time. Hence,
68: the use of DPG in the third phase provides an accelerator for existing
69: sorting algorithms.
70:
71: DPG also provides two new join methods for foreign key joins: DPG-move
72: join and DPG-sort join. The resulting join methods with DPG are
73: faster because DPG join is cache-efficient and at the same time DPG
74: join avoids the need for sorting or for hashing. The ideas
75: presented for foreign key join can also be extended to faster record
76: pair retrieval for spatial and temporal databases.
77:
78:
79: \end{abstract}
80:
81: \section{Introduction}
82: \label{sec:intro}
83:
84: Two important database operations are sorting and joins. These
85: operations have three primary hardware-related costs: disk access, CPU
86: operation and main memory access. The growth of main memory in
87: current computers implies that main memory databases become more
88: popular. For main memory databases, the bottleneck moves from disk to
89: main memory and the cost for disk access is not an issue anymore.
90:
91: Further, the growing CPU-memory gap implies that CPU costs represent
92: an increasingly small portion of the total time. This has been borne
93: out by several studies of
94: DBMs~\cite{Ailamaki99,BGB98,Patterson98,TLZT97}. The impact of the
95: CPU-memory gap was popularized by the paper of Wulf and
96: McKee~\cite{WulfMcKee95} on the {\em memory wall}.
97:
98:
99: The diminishing role of the CPU in the total running time is partially
100: accounted for by increasing CPU speeds and greater on-chip functional
101: parallelism. In part, it is also accounted for because most CPUs
102: today implement non-blocking caches and hardware prefetch in order to
103: overlap CPU execution with memory access~\cite{processor}. Hence,
104: memory access becomes the bottleneck. Therefore, we follow the
105: example of previous
106: researchers~\cite{Ailamaki99,Boncz99,Manegold99,Manegold00,Manegold02,ShatdalEtAl94}
107: in concentrating on main memory as the bottleneck.
108:
109: At the heart of this memory bottleneck lies the {\em record retrieval
110: problem}: the problem of copying records from a source array into a
111: destination array according to a new ordering. In a typical
112: application, one will be given a source data file, a sequence of
113: {\em record ids} (rids) for that data file, and a destination file.
114: The task is to copy the source records into the destination file in
115: the order specified by the sequence of rids. Fast record retrieval
116: is the key to faster sorting and faster joins.
117:
118: A standard approach for data retrieval accesses the records of the
119: source data file directly according to the sequence of the rids. This
120: implies random access to the main memory. This, for example, is what
121: was done in AlphaSort~\cite{AlphaSort} and
122: SuperScalarSort~\cite{SuperScalarSort}, the current record holders for
123: the Datamation sorting challenge~\cite{Datamation}. However, the cost
124: of such random access has now become the dominant cost in main memory
125: sorting, since the CPU-memory gap has widened still further since the
126: original work on sorting. (In fact, AlphaSort and SuperScalarSort
127: sort data on disk, but the size of the data file, 100~MB, is small
128: enough that the disk access consists solely of reading the source data
129: file from disk, and writing to a destination data file.)
130:
131: Random access is harmful not just in disk resident databases, but also in main
132: memory resident databases. Current DRAM technologies, such as DDR RAM and
133: Rambus RAM (RDRAM), extract a large latency penalty for any
134: non-sequential access to RAM. This is because the memory chips are
135: divided into memory pages of several kilobytes, and there is a latency
136: penalty for switching to a new memory page~\cite{LostCircuits,DDR}.
137:
138: Random access to RAM also harms performance in a second manner.
139: Random access incurs a heavy penalty when large cache blocks are used.
140: On a cache miss, the entire cache block is loaded into memory. If the record size
141: is small compared to the cache block size, then there is a large
142: overhead to load the entire cache block. For example, for a cache
143: miss on a Pentium~4 with DDR-266 RAM, approximately 60~ns are spent
144: loading the cache block, and approximately 60~ns are spent waiting on
145: the latency of DDR RAM. The trend is toward larger cache blocks. The
146: 128~byte L2 cache blocks of the Pentium~4 are four times larger than
147: those of the Pentium~III. The IBM Power4 processor goes still further
148: using 512~byte L3 cache blocks.
149:
150: The solution to avoid these latency penalties in main memory is to
151: access main memory sequentially. This is similar in spirit to the
152: way in which traditional databases strongly prefer to access disk
153: sequentially. In analogy with operation on disk, two-pass algorithms
154: are a key for faster main memory performance.
155:
156: The DPG-based sorting algorithms immediately yield faster sorting
157: algorithms. Both AlphaSort and SuperScalarSort sort their data
158: essentially in three phases: extraction of key-pointer pairs;
159: sorting of the key-pointer pairs; and copying of the original records
160: into the destination array according to the sorted key-pointer
161: pairs. The last phase is essentially record retrieval.
162:
163: A re-implementation of AlphaSort and SuperScalarSort on a IBM p690
164: Turbo shows that the record retrieval phase now dominates the running
165: time. With the DPG record retrieval algorithm replacing the standard
166: record retrieval algorithm, we immediately produce a faster sorting
167: algorithm. In the re-implementation of SuperScalarSort, the versions
168: with DPG as an accelerator are 27\% faster.
169:
170:
171: % {\bf The DPG algorithm is XXX\% faster than random access to main
172: % memory ond a YYY.}
173:
174: A direct consequence of faster record retrieval is faster main memory
175: sort-merge joins. For example, in implementing sort-merge join using
176: SuperScalarSort, we find that the use of DPG sort instead of
177: SuperScalarSort results in a 27\% faster join algorithm.. The DPG
178: record retrieval phase in isolation is 48\% faster than traditional
179: record retrieval algorithm.
180:
181:
182:
183: Finally, we apply DPG algorithm in the context of {\em foreign key
184: joins}. In a foreign key join the join key is the same as the foreign
185: key. we assume that one relation has an secondary index on the join
186: key. Foreign key joins have the advantage that one need only
187: rearrange the records of one of the files. This is in distinction to
188: sort-merge join and hash join, which both require to rearrange each of
189: the two files into a new file with join key values in sorted order.
190:
191: Foreign key joins require less record retrieval because it is possible
192: to first extract {\em join triples}, ($k$, rid$_R$, rid$_F$), where
193: $R$ and $F$ are the two relations and $k$ is the join key value.
194: Assume that $R$ references a foreign key of~$F$. To construct a join
195: triple, do file scan of~$R$, for each record of~$R$, its key~$k$ is
196: extracted. The secondary B-tree index of the join key on~$F$ is then
197: used to derive the corresponding record id, rid$_F$, with the key
198: value~$k$. The standard index lookup is very expensive, we propose a
199: cache efficient B-tree batch lookup to generate the join triples.
200:
201: Join triples reduce foreign key join to record pair retrieval. The
202: join triples specify the record pairs, (rid$_R$, rid$_F$), to be
203: retrieved. One can re-order the records of~$F$ to match the ordering
204: of rid$_R$ in the sequence of record pairs. Recall that, we do file
205: scan for~$R$, so rid$_R$s in the join triples are in sorted order.
206: Alternatively, one can sort the join triples according to rid$_F$, and
207: re-order the records of~$R$ to match the ordering of rid$_F$ in the
208: sorted rid pairs. In either situation, there are fewer record
209: retrievals, and so foreign key join is faster than a general join.
210:
211:
212: The ideas of faster record retrieval can also be applied to the
213: general case of record pair retrieval. Many algorithms for spatial
214: join~\cite{ArgeEtAl98,Patel96} and temporal join~\cite{Zhang02}
215: produce record pairs. Unlike equijoin, there is no single search key,
216: and so record retrieval is more difficult. The ideas of this paper
217: are described in terms of foreign key join. However, it is even
218: simpler to translate the ideas into record pair retrieval, since an
219: initial join triple extraction is not required.
220:
221:
222:
223:
224:
225: \section{Distribute-Probe-Gather}
226: \label{sec:DPG}
227:
228: The {\em distribute-probe-gather algorithm} (DPG) is a record retrieval
229: algorithm. Given a data file of records
230: and a sequence of record ids (rids) for the file, the goal is to copy
231: the records into a destination file with the property that
232: the ordering of records in the destination file corresponds to
233: the ordering of the rids in the given sequence.
234:
235: For example, let the source file, #R#, be the array of records with a
236: secondary \rm{B+} tree index on the attribute #A#. We want to
237: place the records of #R# in sorted order according to the values of~#A#. The
238: sequence of record ids, #S_rid#, at leaf nodes of the \rm{B+} tree is a list of record
239: ids in sorted order according to the value of A. Then, the destination file, #D#,
240: below, will contain the corresponding records in sorted order according to the
241: value of~#A#.
242:
243: {\tt
244: \begin{quote}
245: for each i,
246: D[i] = R[S\_rid[i]]
247: \end{quote}
248: }
249:
250: In the case that the sequence of rids is a permutation of the rids for
251: the data file, the sequence of rids acts as a permutation vector. The destination
252: file is then a permutation of the records in the input file.
253:
254: Note, however, that the DPG algorithm is not limited to permutations
255: of data. In the case of join, one record from one relation may match more than one
256: records from another relation. In such cases portion of the original records must
257: be duplicated. The DPG algorithm also works for this case.
258:
259:
260: \subsection{\bf Algorithm}
261:
262: The input for the algorithm is: a list of rids, #RID_LIST#, and a
263: data file of records, #INPUT#. The DPG algorithm partitions the rids,
264: #RID_LIST#, into separate runs. It also partitions the data records,
265: #INPUT#, into separate runs. The algorithm makes two passes over the
266: record ids, #RID_LIST#, and two passes over the records of the data
267: file, #INPUT#.
268:
269: The ideas are presented in the context of main memory databases. We
270: assume that neither the sequence of rids, #RID_LIST#, nor the data
271: file, #INPUT#, fit in cache. The DPG algorithm applies equally well
272: as an external data retrieval algorithm between disk and main memory.
273:
274: The spirit of the DPG algorithm is: try to transform arbitrary memory
275: access patterns into sequential memory access patterns; where
276: arbitrary memory access patterns are unavoidable, we try to divide the
277: data into small partitions that fit into the cache.
278:
279: For a sequential access pattern, it is easy to maintain a buffer in
280: cache. In the DPG algorithm we need to read many streams
281: simultaneously, and maintain a buffer for each stream. Hence, we want
282: to keep the buffer as small as possible while maintaining reasonable
283: efficiency. We define a buffer in cache to consist of two cache
284: blocks. When a cache block is full, it is written back to main memory
285: and a new cache block is loaded from main memory into cache. On
286: Pentium~4, this is done automatically by the hardware prefetch
287: function unit if the access to main memory is sequential. On other
288: architectures without the hardware prefetch function unit, the
289: software instructions, $cflush$ and $prefetchnta$, are needed to
290: maintain the buffers.
291:
292:
293: \noindent
294: \paragraph{\bf Constraints:} The data records and rids are split into
295: runs of length~$L$. The run length~$L$ is chosen based on two
296: constraints. First, the cache must be able to simultaneously hold
297: both one run of data records of length~$L$ and one single buffer for the
298: corresponding run of rids. (This constraint applies in the second
299: phase of DPG.) Second, the cache must simultaneously be able to hold
300: a buffer for each run. (This constraint applies in the first phase and
301: in the third phase of DPG.) These constraints are typical of the
302: constraints for two-pass algorithms, such as external sorting.
303:
304:
305:
306: If the data file has $N$~records, then the data file is partitioned
307: into $N/L$ sets of consecutive records. Assuming an rid consists of a
308: page id and offset on that page, the high order bits of the page
309: number can be used to efficiently identify the particular partition to
310: which the rid belongs. This assumes that the number of pages in a
311: partition is a power of two, which can be satisfied by appropriate
312: choice of~$L$. For the sake of clarity, we assume the rids
313: values are in the rage of $0$ and $N$.
314:
315: \paragraph{\bf Three Phases:} There are three phases in the DPG
316: algorithm. The three phases are also illustrated by pseudo-code in
317: Figure ~\ref{fig:DPGalgo} and by the diagram of Figure~\ref{fig:DPG}.
318:
319: \begin{enumerate}
320: \item{\bf Phase I.} The first phase is the {\em Distribute}
321: phase. One {\em distributes} the rids of #RID_LIST# into appropriate
322: RID runs according to the values of the rids. The first RID run
323: contains the rid values in the range from $0$ to $L$, the second RID
324: run contains the rid values in the range from $L$ to $2L$, and so on.
325: Both the access to every RID run and the access to #RID_LIST# are
326: sequential. Therefore, one only needs to maintains a buffer in
327: cache for each RID run and a single buffer in cache for #RID_LIST#, At
328: the end, we form $N/L$ RID runs and each RID run is a permutation
329: vector. For example, the $i$-th RID run is a permutation vector in the
330: range from $(i-1)*L$ and $i*L$.
331:
332:
333: \noindent
334: \item{\bf Phase II.} The second phase is the {\em Probe} phase. In
335: this phase, we allocate a second, temporary data file, #INTERNAL#, in
336: main memory. The temporary data file has the same size as the
337: original data file, #INPUT#, and is organized into the same number of
338: runs as the original data file. One then proceeds through each of the
339: runs of rids and each of the runs of #INPUT#. The rids from the
340: $i$-th RID run are used to probe the $i$-th #INPUT# run and the
341: corresponding records are copied into the $i$-th #INTERNAL# run.
342:
343: At the end, the $i$-th #INTERNAL# run contains the same records as the
344: $i$-th #INPUT# run, but the order of records in the #INTERNAL# run is
345: organized according to the $i$-th RID run. Both the $i$-th RID run and the $i$-th
346: #INTERNAL# run are accessed sequentially. The $i$th #INPUT# run is
347: accessed randomly, but it can fit in cache. Hence, every time, one loads the
348: $i$-th #INPUT# run entirely into cache and maintains two buffers in cache: one
349: for the $i$th RID run and the other for the $i$-th #INTERNAL# run.
350:
351:
352: \noindent
353: \item{\bf Phase III.} The third phase is the {\em Gather} phase.
354: This is an inverse of the {\em Distribute} phase and is similar to the
355: merge phase of external sorting. In the {\em Distribute} phase, the
356: rids are distributed into runs. As a result of the {\em Probe} phase,
357: the records in a given #INTERNAL# run are now in the same order as the
358: rids in the corresponding RID run created in Phase I. Hence, it
359: suffices to gather (merge) the records from the #INTERNAL# runs in
360: exactly the same order as the order of the rids in #RID_LIST#. More
361: precisely, if the $i$-th rid was distributed to the $j$-th RID run
362: during the {\em Distribute} phase, then at the $i$-th step of the {\em
363: Gather} phase, the next record from the $j$-th #INTERNAL# run is
364: copied to the destination array. One maintains several buffers in
365: cache: one single buffer for #RID_LIST#, one single buffer for the
366: destination array, and a buffer for each #INTERNAL# run.
367: \end{enumerate}
368:
369: \subsection{\bf Example}
370: The DPG algorithm is presented more formally in pseudo-code in
371: Figure~\ref{fig:DPGalgo}. The input of the algorithm is an array,
372: #RID#, of {\em rids} (record ids) and a data file, #INPUT_REC#, of
373: records. It is desired to retrieve the records from #INPUT_REC# in
374: the order corresponding to #RID#. The retrieved records are then
375: written to #OUTPUT_REC#.
376:
377: The three phases of the DPG algorithm are illustrated by an example in
378: Figure~\ref{fig:DPG}. The input in this example is the leaf nodes of
379: a secondary B+-tree index. The input contains a sequence of key-rid
380: pairs sorted according to the key values. The output is a sequence of
381: records sorted according to the key values of the secondary index. The
382: output will either be stored again on disk or else pipelined to the
383: next stage. The ability of the DPG algorithm to take advantage of
384: pipelining is an important feature for sorting and joins.
385:
386:
387: The letters #a#, #b#, #c#, ... are used to indicate the sorted keys on
388: the leaf nodes of the index. So (#a#, 5) indicates that the record with the rid
389: value of~5 has a key value of #a#. The first two rows are for
390: Phase~I, the next three rows for Phase~II, and the following
391: three rows for Phase~III. The horizontal rectangle of the first
392: row represents a sequence of key-rid pairs sorted according to
393: the key values. The second row represents the runs of rids into which
394: the first row is partitioned. Similarly, the third row again
395: represents the runs of rids, but now as part of Phase~II. The
396: fourth row is the partitioned runs of input records, and so
397: on. There are~12 elements and~3 runs in the example and each run contains
398: ~4 elements.
399:
400: During {\bf Phase I}, one has a sequence of key-rid pairs sorted
401: according to the key values and will distribute them into appropriate
402: RID runs. The first RID run will contain rid values from~0 to~3, the
403: second RID run will contain rid values from~4 to~7, and the third RID
404: run will contain rid values from~8 to~11. Upon reading the sequence of
405: pairs from the first row, one places the rids in their proper runs. For
406: example, the rid~5 goes to the second RID run, the rid~7 goes to
407: the second RID run, the rid~3 goes to the first RID run, the rid~8
408: goes to the third RID run, and so on. The third row in the figure
409: presents the end of Phase I. In this
410: phase, we do sequential read on a single stream and sequential writes on multiple streams.
411:
412: During {\bf Phase II}, one copies the original data file, #INPUT#, to
413: a temporary data file, #INTERNAL#. The input to this phase is the
414: third row. In the third row, each RID run has rid values that
415: correspond to one contiguous range of the #INPUT# file, an #INPUT#
416: run. For example, for the first RID
417: run, we will load the first #INPUT# run into the cache. The first
418: #INPUT# run consists of the first 4 contiguous records. The first rid
419: in the first RID run is~3 and $INPUT[3]$ is in the cache, so one finds
420: this record, #INPUT[3]#, in cache and copies it to the buffer
421: maintained in cache for the first #INTERNAL# run, and so on. At the
422: end of Phase~II, the records in the first #INPUT# run with key values in the sequence of i, l, f, c are reordered as
423: records in the first #INTERNAL# run with key values in the sequence of c, f, i, l. The reordering is done according to a permutation specified by the first RID run, 3, 2, 0, 1.
424:
425:
426: During {\bf Phase III}, we will use the original rid sequence in the list of
427: key-rid pairs to {\em gather} (merge) records from all #INTERNAL# run. For
428: example, upon reading~5, we go to the second #INTERNAL# run to {\em gather}
429: the record; upon reading~7, we go to the second #INTERNAL# run to {\em gather}
430: the record; upon reading~3, we go to the first #INTERNAL# run to {\em gather}
431: the record; upon reading~8 we go to the third #INTERNAL# run to {\em gather}
432: the record, and so on. For this phase, we maintain a buffer for the key-rid list, a buffer for #OUTPUT# in cache and
433: a buffer for each #INTERNAL# run. All buffers are in cache.
434:
435:
436:
437:
438: \begin{figure}[htb]
439: \setbox0\vbox
440: % families tt, courier, helvetica available built-in
441: {\small\begin{Verbatim}[commandchars=\\\{\},
442: fontfamily=tt,
443: commentchar=\%,codes={\catcode`$=3\catcode`^=7}]
444: % $ (for sake of emacs)
445: Let $L\gets$ (CACHE_SIZE/2)
446: Let $N\gets$ NUM_RECORDS
447: Let NUM_RUNS $\gets N/L$
448:
449: \rm INPUT: integer \tt RID[NUM_RIDS],
450: \rm record \tt INPUT_REC[NUM_RECORDS];
451: \rm OUTPUT: record \tt OUTPUT_REC[NUM_RIDS];
452: \rm PARAMETERS: \rm integer \tt RID_RUN[NUM_RUNS][]\tt,
453: \rm record \tt RECORD_RUN[NUM_RUNS][];
454:
455: //Phase I: {\it Distribute} RID array into runs
456: // with each run of length $L$
457: % For $i$ = 1, ..., NUM_RUNS do
458: % Initialize write buffer of RID_RUN[$i$]
459: For each rid, $r$, in RID do \{
460: Set {\it run\_num} = $\lceil{}r/L\rceil$;
461: Append $r$ to RID_RUN[{\it run\_num}]
462: \}
463:
464: //Phase II: {\it Probe} partitions of INPUT_REC
465: % (Conceptually partition INPUT_REC into partitions of length $L$ and
466: % for each partition, $i$, of INPUT_REC, probe for each rid in RID_RUN[$i$])
467: For $i$ = 1, ..., NUM_RUNS do \{
468: % Initialize read buffer for RID_RUN[$i$]
469: Read into memory all records from
470: INPUT_REC[$(i-1)\!*\!\!L\! +\! 1$] to INPUT_REC[$i\!*\!\!L$]
471: Allocate memory for $L$ records
472: to be stored in RECORD_RUN[i]
473: For each rid, $r$, in RID_RUN[$i$]
474: Append INPUT_REC[$r$] to RECORD_RUN[$i$]
475: Write out RECORD_RUN[$i$] to disk
476: \}
477:
478: //Phase III: {\it Gather} records from RECORD_RUN[]
479: // into OUTPUT_REC in same order as RID[]
480: % For $i$ = 1, ..., NUM_RUNS do
481: % Initialize read buffer of RECORD_RUN[$i$]
482: For each rid, $r$, in RID do \{
483: Set {\it run\_num} = $\lceil{}r/L\rceil$;
484: Read next record from RECORD_RUN[{\it run\_num}]
485: Append record to OUTPUT_REC
486: \}
487: \end{Verbatim}
488: }
489: \centerline{\fbox{\box0}}
490: \caption{Distribute-Probe-Gather (DPG) Algorithm}\label{fig:DPGalgo}
491: \end{figure}
492:
493:
494: % or , which the thicker horizontal rectangles represent the records of
495: % a data file.
496:
497: \begin{figure*}[!hbt]
498: \begin{center}
499: \epsfig{file=figure1.eps,width=7.0in,clip=}
500: \caption{Distribute-Probe-Gather (DPG)
501: (also, see pseudo-code in Figure~\ref{fig:DPGalgo})}
502: \label{fig:DPG}
503: \end{center}
504: \end{figure*}
505:
506:
507:
508:
509: \subsection{\bf Data Skew}
510: Implicit in the description of the DPG algorithm is that the input
511: sequence of rids is distributed uniformly among the set of all rids of
512: the input data file. This is always the case when DPG is applied to retrieve records
513: after sorting key-pointer pairs. In that situation, the
514: key-pointer pairs act as a permutation vector to permute the
515: records in the input data file.
516:
517: If the input sequence of rids is not uniformly distributed, then some
518: RID runs will be larger than other runs. As a consequence, in
519: Phase~II, when the partition of the temporary data file (the partition
520: of #RECORD_RUN# in Figure~\ref{fig:DPGalgo}) may be larger than the size
521: of the cache. If only a few of the partitions of the temporary data
522: file are larger than cache then the overall running time is not
523: greatly affected.
524:
525: If there is a great deal of data skew and many of the temporary
526: partitions #RECORD_RUN# are larger than cache, then the $N/L$
527: partitions of the input sequence of rids must be chosen on some other
528: basis than the high order bits of the page number. In such cases, one
529: can invoke the data skew handling techniques of DeWitt
530: et~al.~\cite{DeWitt92}. Their solution, reformulated in our context,
531: is to sample the rids from the rid sequence. The sampled set of rids
532: is then sorted, and partitions of the rids are chosen so as to evenly
533: partition the sampled set.
534:
535:
536: \section{Sorting}
537: \label{sec:sorting}
538:
539: As discussed in the introduction, DPG acts as an accelerator for many
540: main memory sorting algorithms. Recall that main memory sorts
541: typically proceed in three phases:
542: \begin{enumerate}
543: \item extraction of key-pointer;
544: \item sorting of the key-pointer pairs; and
545: \item copying of the original records
546: into the destination array according to the sorted key-pointer
547: pairs.
548: \end{enumerate}
549:
550: AlphaSort~\cite{AlphaSort} and SuperScalarSort~\cite{SuperScalarSort}
551: are examples of this three-phase sorting paradigm. Both sorting
552: algorithms can be considered as main memory sorting algorithms.
553:
554: In principle, the sorting algorithms are single-pass disk-based
555: sorting algorithms. Both sorting algorithms were introduced as an
556: answer to the Datamation Sorting Challenge~\cite{Datamation}. The
557: Datamation challenge dictates that one is given one million records of
558: 100~bytes. Each record has a 10~byte key. The keys are uniformly
559: distributed. At the time of the Datamation Challenge, external
560: sorting algorithms were required. On today's computers, the data file
561: of 100~MB easily fits in main memory.
562:
563: Hence, the only disk-related portion of the Datamation Challenge is to
564: overlap disk~I/O with CPU operation. Disk striping has the potential
565: to provide very fast disk~I/O. This occurs because the disks are
566: accessed in parallel. In this situation, main memory data retrieval
567: becomes the bottleneck.
568:
569: The DPG algorithm pushes back this main memory bottleneck. By using
570: DPG for data movement, the largest cost of main memory sorting is
571: reduced. We have reimplemented the main memory portion of
572: SuperScalarSort, both with and without DPG.
573:
574:
575: \section {Join Methods with DPG}
576: \label{sec:joinAlgoDPG}
577:
578: We use the ideas of DPG to present three new join algorithms: 1.c, 2 and~3.
579: Algorithms 1.a, 1.b and~4, will be included in the experimental
580: section~\ref{sec:joinExperiments} for completeness.
581: \begin{enumerate}
582: \item Sort-Merge Join
583: \begin{enumerate}
584: \item {Sort-Merge Join with AlphaSort (sort based on~\cite{AlphaSort})}
585: \item {Sort-Merge Join
586: with SuperScalarSort (sort based on~\cite{SuperScalarSort})}
587: \item {Sort-Merge Join with DPG Sort}
588: (sort based on DPG; see~\ref{sec:sortMergeJoin})
589: \end{enumerate}
590: \item {DPG-Sort Join} (see Section~\ref{sec:foreignKeyJoin})
591: \item {DPG-Move Join} (see Section~\ref{sec:foreignKeyJoin})
592: \item {Radix Join} (from~\cite{Manegold02})
593: \end{enumerate}
594:
595:
596: \subsection{Sort-Merge Join with DPG Sort}
597: \label{sec:sortMergeJoin}
598:
599: The well-known Sort-Merge join was introduced by Blasgen and
600: Eswaran~\cite{Blasgen77}. There are two steps in Sort-Merge joins:
601: sort two relations on the join key and scan the sorted relations to do
602: a merge on the join key.
603:
604: Applying DPG sort at the first step provides a new faster sort-merge
605: join method, Sort-Merge join with DPG Sort. In
606: Sections~\ref{sec:sortMergeJoinExperiments}
607: and~\ref{sec:joinExperiments}, we experimentally compare different
608: versions of sort-merge join, according to the sorting methods used.
609: Specifically, we consider using DPG sort, AlphaSort and
610: SuperScalarSort for the sorting step.
611:
612:
613: \subsection{Foreign Key Join with DPG}
614:
615: We next consider joins in which the join key is a foreign key, and it
616: has an index.
617: We denote by~$R$ a non-indexed relation. We denote
618: by~$F$ an indexed relation. The notation is motivated by the example
619: of a foreign key join. In a foreign key join, the join key is the
620: same as the foreign key. So, the join key is a set of attributes in
621: the relation~$R$ that refers to a foreign key from relation~$F$.
622:
623: A {\em join triple} is a triple ($k$, rid$_R$, rid$_F$), such that $k$
624: is a key value, rid$_R$ is the rid of a record from~$R$ with key~$k$,
625: and rid$_F$ the rid of a record from~$F$ with key~$k$.
626:
627: There are three steps in a foreign key join algorithm with DPG. The
628: first step is to construct join triples. The second step is to use
629: the join triples to copy one of the two relations into a temporary
630: file according to an order derived from the join triples. The third
631: step is to join the temporary file with the remaining relation.
632:
633: We describe the second and third steps initially in
634: Section~\ref{sec:foreignKeyJoin}. We then return to the more
635: technical problem of efficiently constructing join triples in
636: Sections~\ref{sec:joinTriples} and~\ref{sec:indexLookup}.
637:
638: \subsubsection{Two Foreign Key Join Methods with DPG}
639: \label{sec:foreignKeyJoin}
640:
641: This section describes two DPG join algorithms. It assumes that one
642: has already constructed the join triples. Some algorithms for
643: constructing the join triples are described later in
644: Sections~\ref{sec:joinTriples} and~\ref{sec:indexLookup}.
645:
646: Assume that one has generated the join triples ($k$, rid$_R$,
647: rid$_F$). It is now possible to ignore the key~$k$, and deal directly
648: with the rid pairs (rid$_R$, rid$_F$). We wish to satisfy one of two
649: goals:
650: \begin{enumerate}
651: \item\label{item:moveF} {\bf DPG-Move join:} Move the records of~$F$ into
652: a temporary file according to the ordering of the records of~$R$.
653: \item\label{item:moveR} {\bf DPG-Sort join:} Move the records of~$R$ into
654: a temporary file according to the ordering of the records of~$F$.
655: So, the rid pairs (rid$_R$, rid$_F$) will be sorted according to
656: rid$_F$.
657: \end{enumerate}
658:
659: First consider DPG-Move join. We will see how to generate the join
660: triples in the order of rid$_R$. This is done by scanning the records
661: of the relation~$R$ in file order (in order of increasing rid$_R$).
662: Therefore the rid pairs can be used directly as part of a DPG
663: algorithm. The second component of the pair, rid$_F$, is the sequence
664: of rids according to which we want to move the records of~$F$. This
665: algorithm does not require any sorting or hashing. Hence, we call
666: it DPG-Move join.
667:
668: Next consider DPG-Sort join. In this version, we first sort
669: the rid pairs (rid$_R$, rid$_F$) according to the order of rid$_F$.
670: This is done, for example, with SuperScalarSort and DPG. The
671: algorithm then reduces to record movement in which we wish to move the
672: records of~$R$ according to the ordering of~rid$_F$ in the sorted
673: sequence (rid$_R$, rid$_F$).
674:
675: Note that DPG-Move join is preferred when the relation~$F$ is smaller.
676: DPG-Sort join is preferred when the relation~$R$ is smaller.
677:
678:
679: \subsubsection{Construction of Join Triples}
680: \label{sec:joinTriples}
681:
682: The simplest solution for a foreign key join is to
683: do a file scan of~$R$, and for each record of~$R$ to extract the join
684: key and do an index lookup in the index of~$F$. One can then join the
685: record of~$R$ with the corresponding record of~$F$. This involves
686: random access, and is economical only if the join produces
687: very few records. In particular, this will be the case only if the
688: number of records of~$R$ is small.
689:
690: A better solution is to do a file scan of~$R$, and to use the index
691: on~$F$ to create join triples. To construct the join triples, one scans the
692: relation~$R$. For each record of~$R$, one extracts the corresponding
693: rid and associated join key~$k$. One then looks up the key~$k$ in the
694: index of~$F$. The index lookup yields the final element of the
695: triple, rid$_F$.
696:
697: \subsubsection{Batch Lookup in Indexes}
698: \label{sec:indexLookup}
699:
700: Note that the join triple is constructed at the cost of a file scan
701: of~$R$ and an index lookup in the index of~$F$ for each record of~$R$.
702: Usually an index on a data file is much smaller than the full data
703: file. If the index fits entirely in cache, then the index lookup will
704: be significantly cheaper than the file scan.
705:
706: Unfortunately, the indexes in many main memory databases generally do
707: not fit in cache. In such cases, the index lookup in the index of~$F$
708: will dominate the costs. For example, in a B+~tree indexing
709: $N$~records, if an internal node has $m$~children, then $\log_m N$
710: nodes of the B+~tree must be accessed. Each such access will be a
711: random access in main memory. Most of the random accesses imply a
712: cache miss. The cost of so many random accesses makes a naive index
713: lookup uneconomical. Even if the index is a hash index, at least one
714: random access in memory will be required.
715:
716: Luckily, it is possible to execute the index lookup faster than the
717: above analysis would indicate. This is because the construction of
718: the join triples requires many index lookups, with no intervening
719: record accesses. For purposes of join triple construction, {\em batch
720: lookup} of keys in an index suffices. By batch lookup, we assume that
721: an array of join keys is first extracted by scanning the data file
722: of~$R$. The batch lookup then produces an array of rids for~$F$
723: through the use of the index on~$F$. We show that the index lookups
724: can be reorganized into a two-pass algorithm. Two such two-pass
725: algorithms are demonstrated: one for B+ tree indexes, and one for hash
726: indexes.
727:
728: Note that batch lookup of rids in an index can be substantially faster
729: than individual lookup. Rao and Ross had previously discussed cache
730: conscious indexes for main memory~\cite{Rao99}. There they present
731: CSS-trees, which have better cache behavior than either B+~trees or
732: hash indexes. However, they only consider individual lookup of keys
733: in an index, one at a time. In their scenario, a second key is not
734: looked up until the index lookup of the first key has been resolved.
735:
736: \paragraph{\bf Batch Lookup in B+ Tree Indexes}
737:
738: For simplicity, we describe the two-pass lookup for an enhanced
739: B+~tree. The notion of {\em enhanced B+~tree} was introduced by Rao
740: and Ross~\cite{Rao99}. The idea is that all slots of a B+~tree node
741: are used. This is similar to compact B-Trees~\cite{CS83} or to the
742: ISAM method introduced by IBM~\cite{GR93}. In the context of a
743: general B+~tree this can be accomplished by maintaining updates to the
744: B+~tree in a separate index, and then doing a batch update by
745: reorganizing the B+~tree. For brevity, we will sometimes refer to
746: B+~tree, although in all cases, an enhanced B+~tree is intended.
747:
748:
749: Assume that we are executing a file scan of~$R$ with the purpose of
750: constructing the join triples. We collect a sequence of keys from the
751: relation~$R$, and we wish to carry out a batch lookup of rids in the
752: enhanced B+-tree. The rids will be used to retrieve records from the
753: foreign relation~$F$. We assume that the B+~tree does not fit in
754: cache. Our goal is a two-pass algorithm which will efficiently
755: accomplish batch lookup.
756:
757: Assume that the B+~tree has $m$~entries per node. Assume that the
758: B+~tree indexes $N$~records. Then there are $\log_m N$ levels. Assume
759: further that the first $(\log_m N)/2$ levels (the top half of the
760: B+~tree) fit in cache. For $m\gg 2$, the top half of the tree has
761: approximately $m^{(\log_m N)/2}=\sqrt{N}$ slots. For example, if
762: there are $N=10^9$ records, then there are 32,000 slots, which clearly
763: fit inside cache.
764:
765: The strategy is a two-pass strategy. In the first pass, one performs
766: a lookup of each key, but only using the top half of the B+~tree.
767: As shown in Figure~\ref{fig:Btree}, each leaf of the
768: subtree comprising this upper half can be considered as the root of a
769: second subtree comprising nodes in the lower half of the B+~tree.
770: Thus, at the end of this first pass, a key can be associated with a
771: leaf of the upper subtree, which is also a root of one of the lower subtrees.
772:
773: \begin{figure}[!hbt]
774: \begin{center}
775: \epsfig{file=figure2.eps,width=3.0in}
776: \caption{B-Tree (Each smaller triangle represents a subtree that
777: fits in cache.)}
778: \label{fig:Btree}
779: \end{center}
780: \end{figure}
781:
782: After the first pass, one can associate each key with a subtree within
783: the lower half of the B+~tree. So, in the second pass, one loads a
784: subtree from the lower half. One then continues the lookup for all
785: keys associated with the root of the subtree in the lower half.
786: At the end, one then has a sequence of keys and rids.
787:
788: If one wishes to have the keys in the same order as the order of the
789: original rids, then this can also be arranged. In this case, one
790: extends the previous scenario to use the DPG algorithm.
791:
792: The first phase of the DPG algorithm is to {\em distribute} the keys into
793: runs. In the initial
794: lookup of a key in the first half, it was associated with a root of a
795: subtree in the first half. The particular root of a subtree
796: identifies the run into which the key is copied.
797:
798: The second phase of the DPG algorithm is to use the key to {\em probe} the
799: index, in order to find the rid. One completes the lookup of all keys
800: in a run associated with a particular subtree of the B+~tree, before
801: proceeding to the next subtree in the lower half. The resulting rids
802: are stored in a temporary partition or run.
803: This is exactly what was described earlier.
804:
805: Finally, the third phase of the DPG algorithm {\em gathers} the rids
806: into a destination array in the same order as that of the original
807: keys. Hence, we have completed a batch lookup of the keys, and
808: returned an array of rids in the same order as that of the original
809: keys.
810:
811: \paragraph{\bf Batch Lookup in Hash Indexes}
812:
813: Batch lookup of hash indexes also proceeds in two passes. We assume
814: that the hash array of the hash index stores at each hash entry one
815: key-rid pair. A second hash array associated with the index stores
816: pointers to overflow key-rid pairs that would have collided with an
817: occupied slot in the first hash array.
818:
819: The two-pass lookup for hash indexes proceeds in a very simple manner.
820: We extract the sequence of keys from~$R$. As the keys are extracted,
821: the hash values are computed. Those key-hash value pairs are saved in
822: an array in an order corresponding to the rid order of~$R$.
823:
824: It now suffices to apply the DPG algorithm. The hash array is
825: partitioned into sets of $L$~hash slots. In the distribute phase, the
826: hash value acts as an index into the hash array. Hence, this becomes
827: the permutation vector of the DPG algorithm. The key and hash value
828: are then written into separate runs according to the partition of the
829: hash array. There is one run of key-hash values for each partition of
830: the hash array.
831:
832: In the probe phase, a run of hash values is loaded into cache along
833: with the corresponding partition of the hash array. As part of the
834: probe phase, the key and hash values from the run are used to look up
835: the the corresponding rid in the partition of the hash array. The
836: rids are then saved in a temporary partition, in the same order as the
837: order in which the key-hash values of the original partition are stored.
838:
839: Finally, in the gather phase, the ordering of the key-hash
840: value pairs are used to gather the rids from the temporary partition.
841: As in Section~\ref{sec:DPG}, the rids are gathered into a destination
842: array in an order corresponding the ordering of the original key-hash
843: value pairs.
844:
845:
846:
847:
848: \section{Experimental Evaluation}
849:
850: \subsection{Sorting Comparisons}
851:
852: Figure~\ref{fig:sorting} demonstrates the acceleration achieved by
853: SuperScalarSort when DPG is used. The results are demonstrated on the
854: IBM pSeries 690 Turbo. The IBM~p690 has an L3~cache of size 128~MB.
855: Hence, in order to realistically demonstrate DPG, we were forced to
856: increase the size of the database. We chose to implement
857: SuperScalarSort for a data file of size 512~MB. The record size was
858: treated as a variable, to illustrate the influence of record size. As
859: in the original Datamation Challenge, the key is 10~bytes.
860:
861: \begin{figure}[!hbt]
862: \begin{center}
863: \epsfig{file=figure3.eps,width=3.0in}
864: \caption{Sorting Comparison}
865: \label{fig:sorting}
866: \end{center}
867: \end{figure}
868:
869: Note that the additional speed of SuperScalarSort with DPG is most
870: pronounced for smaller record sizes. For record sizes of 256~bytes
871: and higher, DPG provides only a small advantage. This is because the
872: IBM p690 has an L3~cache block of size 512~bytes. So, for record
873: sizes below 256~bytes, a cache miss incurs significant overhead
874: in loading a 512~byte cache block.
875:
876: There was some variability in the results because the data was taken
877: on a time-shared, shared memory machine. On the IBM~p690, the
878: L2~cache is shared among two CPUs, and the L3~cache is shared among
879: eight CPUs. Our experiments were run on a single CPU. Hence another
880: process on a neighboring CPU could consume some of our cache, thereby
881: affecting the timings. The reported experimental results are the
882: averages of three runs each.
883:
884: \subsection{Comparion of Sort-Merge Join using Different Sorting Algorithms}
885: \label{sec:sortMergeJoinExperiments}
886:
887: \begin{figure}[!hbt]
888: \begin{center}
889: \epsfig{file=figure4.eps,width=3.0in}
890: \caption{Sort-Merge Join (IBM Power4 p690 Series)}
891: \label{fig:sortMerge}
892: \end{center}
893: \end{figure}
894:
895: We implement two sort-merge join methods: sort-merge join with DPG sort
896: and sort-merge join with SuperScalarSort. As defined in section~\ref{sec:sortMergeJoin},
897: sort-merge join with DPG sort applies the DPG sort for the sorting
898: phase and sort-merge join with SuperScalarSort applies SuperScalarSort
899: for the sorting phase.
900:
901: As expected, the acceleration in the speed of sort-merge join in
902: Figure~\ref{fig:sortMerge} follows the same pattern as that of
903: Figure~\ref{fig:sorting} for sorting. In this case, we illustrate our
904: results on a database of size 128~MB on the IBM p690. The IBM p690
905: has an L2~cache of size 1.4~MB. So, in this case, the L3~cache is acting
906: as the main memory, while L2~cache is acting as the ``cache''.
907:
908:
909:
910: \subsection{Comparion of Six Different Join Methods}
911: \label{sec:joinExperiments}
912:
913: In Figures~\ref{fig:join_nonuniform_1} through~\ref{fig:join_dup_3},
914: we now experimentally compare the six join methods originally
915: presented at the begining of Section~\ref{sec:joinAlgoDPG}. The three
916: join algorithms labelled Sort-Merge Join with DPG Sort, DPG-Sort Join,
917: and DPG-Move Join are all new. The remaining algorithms, Sort-Merge
918: Join with AlphaSort, Sort-Merge Join with SuperScalarSort, and Radix
919: Join, are all based on sorting or join algorithms from the literature.
920:
921: As explained earlier, we denote by
922: ~$F$ an indexed relation and ~$R$ a relation that has foreign key on
923: the indexed attribute of ~$F$.
924: We consider both uniform and non-uniform distribution of foreign key values.
925:
926: \paragraph{\bf The non-uniform distribution of join key values.} Sort-merge join with
927: AlphaSort and DPG-Sort are the only two of the previously discussed join methods that operate
928: correctly for a non-uniform distribution of join key values.
929:
930: In the following we use count bucket sort for bucket sort. The count bucket sort is as
931: follows:
932: \begin{enumerate}
933: \item count the number of elements destined for each bucket.
934: \item set bucket boundaries according to the statics computed and
935: distribute elements to buckets.
936: \end{enumerate}
937:
938: AlphaSort works for non-uniformly distributed data, because it uses
939: quicksort to sort each run and uses replacement-selection to merge
940: the runs. DPG-Sort join works for non-uniformly distributed data too,
941: because we use count bucket sort to sort RIDs.
942:
943: DPG-Sort join sorts the RIDs according to the lowest~ $\log{N}$ bits
944: is enough, $N$ is the number of records. In the simulation, we assign
945: the RID values in the range $[1, ... , n]$, the lowest $\log{N}$ bits is
946: sufficient for sorting. For example, for $N=2^{20}$ sorting RIDs
947: according to the lowest 20 bits is enough. This could be done in two
948: steps with count bucket sort: first do count bucket sort according to
949: the lower 10 bits, then do count bucket sort according to the higher
950: 10 bits.
951:
952: Using the UNIX $random()$ and $exp()$ functions we generate an
953: exponential distribution of the data as $exp(c*(random()>>10))$ ($c$
954: is a constant and in our experiments we assign c with $-0.0000001$).
955: A comparison of Sort-Merge join with AlphaSort and DPG-Sort join on
956: three different computer architectures is provided in the
957: figures~\ref{fig:join_nonuniform_1}, ~\ref{fig:join_nonuniform_2}, and
958: ~\ref{fig:join_nonuniform_3}. The three different architectures are
959: the IBM Power4 Pseries 690 Turbo, the 3.06 GHz Pentium 4 with Rambus
960: PC-1200 RAM, and the 2.6 GHz Pentium 4 with DDR-266 RAM, From the
961: comparison, we can see that DPG-sort join is much faster than
962: Sort-Merge join with AlphaSort.
963:
964: \begin{figure}[!hbt]
965: \begin{center}
966: \epsfig{file=figure5.eps,width=3.0in}
967: \caption{Comparison of Joins (IBM Power4 pSeries 690 Turbo,
968: exponential distribution of keys)}
969: \label{fig:join_nonuniform_1}
970: \end{center}
971: \end{figure}
972:
973: \begin{figure}[!hbt]
974: \begin{center}
975: \epsfig{file=figure6.eps,width=3.0in}
976: \caption{Comparison of Joins (2.6~GHz Pentium 4 / DDR-266 RAM,
977: exponential distribution of keys)}
978: \label{fig:join_nonuniform_2}
979: \end{center}
980: \end{figure}
981:
982:
983: \begin{figure}[!hbt]
984: \begin{center}
985: \epsfig{file=figure7.eps,width=3.0in}
986: \caption{Comparison of Joins (3.06~GHz Pentium 4 / Rambus PC-1200 RAM,
987: exponential distribution of keys)}
988: \label{fig:join_nonuniform_3}
989: \end{center}
990: \end{figure}
991:
992: \paragraph{\bf The uniform distribution of join key values.} First we will
993: show how all algorithms work for uniformly distributed join key values.
994: SuperScalarSort is a key-prefix-sort explained further in
995: section~\ref{sec:intro}. It assumes that data is uniformly
996: distributed according to the highest ~7 bits of the key. This kind of
997: distribution can be applied to all DPG algorithms, because a unform
998: distribution of the join key values is the only constraint of
999: Sort-Merge join with DPG and DPG-move join.
1000:
1001: We also implement the radix join of Manegold et al.~\cite{Manegold02}.
1002: They describe a variation of hash join for main memory. Radix join
1003: also requires a uniform distribution of the join key values.
1004: Otherwise, some of their partitions will be too large to fit into the
1005: L2 cache and how to set the boundaries is unknown.
1006:
1007: We produce uniformly distributed foreign key values using the UNIX
1008: $random()$ function. A comparison of the different join methods on
1009: three different computer architectures as before is provided in the
1010: figures. Figure~\ref{fig:join_no_dup} reflects data with no duplicate
1011: join key values in the relation~$R$. Figures~\ref{fig:join_dup_1},
1012: ~\ref{fig:join_dup_2}, and ~\ref{fig:join_dup_3}, show the same
1013: information in which duplicate join key values are allowed. From the
1014: comparison, we can see that DPG-move join and radix join are the
1015: fastest. DPG-move is better for large records and radix join is better
1016: for small records.
1017:
1018: \begin{figure}[!hbt]
1019: \begin{center}
1020: \epsfig{file=figure8.eps,width=3.0in}
1021: \caption{Comparison of Joins (IBM Power4 pSeries 690 Turbo, no
1022: duplicate keys)}
1023: \label{fig:join_no_dup}
1024: \end{center}
1025: \end{figure}
1026:
1027: \begin{figure}[!hbt]
1028: \begin{center}
1029: \epsfig{file=figure9.eps,width=3.0in}
1030: \caption{Comparison of Joins (IBM Power4 pSeries 690 Turbo,
1031: duplicate keys)}
1032: \label{fig:join_dup_1}
1033: \end{center}
1034: \end{figure}
1035:
1036: \begin{figure}[!hbt]
1037: \begin{center}
1038: \epsfig{file=figure10.eps,width=3.0in}
1039: \caption{Comparison of Joins (2.6~GHz Pentium 4 / DDR-266 RAM,
1040: duplicate keys)}
1041: \label{fig:join_dup_2}
1042: \end{center}
1043: \end{figure}
1044:
1045:
1046: \begin{figure}[!hbt]
1047: \begin{center}
1048: \epsfig{file=figure11.eps,width=3.0in}
1049: \caption{Comparison of Joins (3.06~GHz Pentium 4 / Rambus PC-1200 RAM,
1050: duplicate keys)}
1051: \label{fig:join_dup_3}
1052: \end{center}
1053: \end{figure}
1054:
1055: \section{Conclusions}
1056: The use of DPG in the sorting provides an accelerator for existing
1057: sorting algorithms. Especially for the smaller record sizes, such as
1058: 32~bytes and 64~bytes, the performance improvements are really
1059: impressive.
1060:
1061: For the more common case of non-uniform distribution of join key values,
1062: DPG-Sort join works better than sort-merge join with AlphaSort across
1063: all the tested platforms. For smaller records sizes, such as, 32, 64,
1064: DPG-Sort join is much better than sort-merge join with AlphaSort. More
1065: impressively, on the newer platform, Rambus PC-1200 RAM, DPG-Sort works
1066: better than on PC with DDR-266 RAM and even works better for larger
1067: record sizes, for example 512~bytes.
1068:
1069: For special case of uniform distribution of join key values, DPG-move join
1070: and radix join are the best. The remaining DPG algorithms are also
1071: competitive with older algorithms although with a smaller improvement.
1072:
1073: \section{Future Work}
1074: The DPG algorithm can be easily generalized to multiple passes. This
1075: can be useful when there is a very small cache in relation to the size
1076: of main memory. However, we do not encounter this scenario in our
1077: current experiments.
1078:
1079: \section{Acknowledgements}
1080: We would like to thank Betty Salzberg and Donghui Zhang for extensive
1081: conversations and insights into additional situations where
1082: distribute-probe-gather can be beneficial. We also gratefully
1083: acknowledge the use of the support for the computations on the IBM
1084: p690 by the Scientific Computing and Visualization (SCV) group at
1085: Boston University.
1086:
1087:
1088: \bibliographystyle{latex}
1089: \begin{thebibliography}{99}
1090:
1091: \bibitem{SuperScalarSort}
1092: Agarwal, R.C. ``A Super Scalar Sort Algorithm for RISC Processors'',
1093: {\sl ACM SIGMOD~'96}, pp.~240--246, June 1996.
1094:
1095:
1096: % \bibitem{BlockOrientedProcessing}
1097: % Sriram Padmanabhan, Timothy Malkemus, Ramesh C. Agarwal, Anant
1098: % Jhingran: Block Oriented Processing of Relational Database Operations
1099: % in Modern Computer Architectures. ICDE 2001: 567-574.
1100: % Abstract: Several recent papers have pointed out that database
1101: % systems are not well tuned to take advantage of modern superscalar
1102: % processor architectures. In particular, the Clocks-Per-Instruction
1103: % (CPI) for rather simple database queries are quite poor compared to
1104: % scientific kernels or SPEC benchmarks. The lack of performance of
1105: % database systems has been attributed to poor utilization of caches
1106: % and processor function units as well as higher branching
1107: % penalties. In this paper, we argue that a block oriented processing
1108: % strategy for database operations can lead to better utilization of
1109: % the processors and caches generating significantly higher
1110: % performance. We have implemented the block oriented processing
1111: % technique for aggregation, expression evaluation, and sorting
1112: % operations as a feature in the DB2 Universal Database system. We
1113: % present results from representative queries on a 30 GB TPC-H database
1114: % to show the value of this technique.
1115:
1116:
1117: % On Pentium 4 w/ NT 4.0
1118: \bibitem{Ailamaki99}
1119: A. Ailamaki, D.J.~DeWitt, M.D.~Hill, D.A.~Wood,
1120: ``DBMSs on a Modern Processor: Where Does Time Go?'',
1121: {\sl VLDB'99, Proceedings of 25th International Conference on Very
1122: Large Data Bases}, 1999, pp.~266-277.
1123:
1124: \bibitem{Datamation}
1125: Anon., Et-Al. (1985). ``A Measure of Transaction Processing Power'',
1126: {\sl Datamation}~{\bf 31}(7): pp.~112--118. Also in {\sl Readings
1127: in Database Systems}, M.J.~Stonebraker, ed., Morgan Kaufmann, San
1128: Mateo, 1989.
1129:
1130: \bibitem{ArgeEtAl98}
1131: L.~Arge, O.~Procopiuc, S.~Ramaswamy, T.~Suel and
1132: J.S.~Vitter, ``Scalable Sweeping-Based Spatial Join'', {\sl
1133: VLDB 1998, Proceedings of 24th International Conference on Very Large
1134: Databases}, 1998, pp.~570--581.
1135:
1136: \bibitem{BGB98}
1137: L.A. Barroso, K.~Gharachorloo and E.D.~Bugnion ``Memory System
1138: Characterization of Commercial Workloads'', {\sl Proc. of the
1139: Int. Symp. on Computer Architecture}, Barcelona, Spain, 1998.
1140:
1141: % sort-merge join
1142: \bibitem{Blasgen77}
1143: M.W.~Blasgen and K.P.~Eswaran,
1144: ``Storage and access in relational databases'', {\sl IBM Systems
1145: Journal}~{\bf 16}(4), 1977.
1146:
1147: \bibitem{Boncz99}
1148: P.~Boncz, S.~Manegold and M.L.~Kersten,
1149: ``Database Architecture Optimized for the New Bottleneck: Memory Access'',
1150: {\sl The {VLDB} Journal}, 1999, pp.~54--65.
1151:
1152: \bibitem{CS83}
1153: F.~Cesarini and G.~Soda, ``An Algorithm to Construct a Compact B-tree in
1154: Case of Ordered Keys'', {\sl Information Processing Letters}~{\bf
1155: 17}(1), pp.~1612--1630, 1983.
1156:
1157: \bibitem{DeWitt84}
1158: D. DeWitt, R. Katz, F.~Olken, L.~Shapiro, M.~Stonebraker and D.~Wood,
1159: ``Implementation Techniques for Main Memory Databases'', {\sl
1160: Proc. ACM SIGMOD Conf. on the Management of Data}, 1984.
1161:
1162: \bibitem{DeWitt92}
1163: D. DeWitt, J.~Naughton, D.~Schneider and S.~Seshadri,
1164: ``Practical Skew Handling in Parallel Joins'',
1165: {\sl VLDB 1992, Proc. Intl. Conf. on Very Large Databases}, 1992.
1166:
1167: \bibitem{Dittrich02}
1168: J.-P. Dittrich, B.~Seeger, D.S.~Taylor and P.~Widmayer,
1169: ``Progressive Merge Join: A Generic and Non-blocking Sort-based Join
1170: Algorithm'', {\sl VLDB 2002, Proceedings of 28th International
1171: Conference on Very}, 2002.
1172:
1173: \bibitem{GR93}
1174: J. Gray and A.~Reuter, {\em Transaction Processing: Concepts and
1175: Techniques}, Morgan Kaufmann, San Francisco, 1993.
1176:
1177: \bibitem{processor}{\sl Intel Pentium 4 and Intel Xeon Processor
1178: Optimization Reference
1179: Manual}. \url{http://www.intel.com/design/Pentium4/manuals/index.htm}
1180: \bibitem{RAM}{\sl Intel PC SDRAM Specification, Revision 1.7, November
1181: 1999}.
1182:
1183: \bibitem{LostCircuits}
1184: {\sl Lost Circuits} web site:
1185: ``Latency vs. Bandwidth, a performance analysis'',
1186: \url{http://www.lostcircuits.com/memory/latency/2.shtml};
1187: ``Inside the EDDR Chip:
1188: Combining DRAM storage and SRAM speed'',
1189: \url{http://www.lostcircuits.com/memory/eddr/}, Nov.~27, 2000;
1190: and ``High Performance DDR DIMMs'',
1191: \url{http://www.lostcircuits.com/memory/ddr2/}, July~17, 2001.
1192:
1193: \bibitem{Patterson98} K.~Keeton, D.A.~Patterson, Y.Q.~He R.C.~Raphael
1194: and W.E.~Baker ``Performance Characterization of a Quad {P}entium
1195: Pro SMP using OLTP Workloads'' {\sl Porc. of the Int. Syp. on
1196: Computer Architecture}, p0p.~15--26, Barcelona, Spain, 1998.
1197:
1198: \bibitem{Knuth98}
1199: D.E. Knuth, {\sl Sorting and Searching, The Art of Computer
1200: Programming}, vol.~3, second edition,
1201: Addison Wesley, Reading, MA, 1998
1202:
1203: \bibitem{Manegold99}
1204: S.~Manegold, P.A.~Boncz and M.L.~Kersten,
1205: ``Database architecture
1206: optimized for the new bottleneck: Memory access'', {\sl VLDB '99,
1207: Proc. of 25th Intl. Conf.
1208: Very Large Databases}, 1999.
1209:
1210: % \bibitem{ManegoldCWI99}
1211: % S. Manegold, P.A.~Boncz and M.L.~Kersten,
1212: % ``Optimizing main-memory join on modern hardware'',
1213: % Technical report INS-R9912, Centruum voor Wiskunde Informatica (CWI),
1214: % Netherlands, 1999.
1215:
1216: \bibitem{Manegold00}
1217: S.~Manegold, P.A.~Boncz and
1218: M.L.~Kersten,
1219: ``What Happens During a Join? Dissecting CPU and Memory Optimization
1220: Effects'',
1221: {\sl VLDB 2000, Proceedings of 26th International Conference on Very
1222: Large Data Bases, September 10-14, 2000, Cairo, Egypt},
1223: {2000}, pp.~339--350.
1224:
1225: \bibitem{Manegold02}
1226: S. Manegold, P.A.~Boncz and M.L.~Kersten,
1227: ``Optimizing main-memory join on modern hardware'',
1228: {\sl IEEE Transactions on Knowledge and Data Engineering}~{\bf 14}(4),
1229: 2002, pp.~709--730.
1230:
1231: % Abstract
1232: % In the past decade, the exponential growth in commodity CPUs speed has
1233: % far outpaced advances in memory latency. A second trend is that CPU
1234: % performance advances are not only brought by increased clock rate, but
1235: % also by increasing parallelism inside the CPU. Current database
1236: % systems have not yet adapted to these trends, and show poor
1237: % utilization of both CPU and memory resources on current hardware. In
1238: % this article, we show how these resources can be optimized for large
1239: % joins and translate these insights into guidelines for future database
1240: % architectures, encompassing data structures, algorithms, cost
1241: % modeling, and implementation. In particular, we discuss how vertically
1242: % fragmented data structures optimize cache performance on sequential
1243: % data access. On the algorithmic side, we refine the partitioned
1244: % hash-join with a new partitioning algorithm called radix-cluster,
1245: % which is specifically designed to optimize memory access. The
1246: % performance of this algorithm is quantified using a detailed
1247: % analytical model that incorporates memory access costs in terms of a
1248: % limited number of parameters, such as cache sizes and miss penalties.
1249: % We also present a calibration tool that extracts such parameters
1250: % automatically from any computer hardware. The accuracy of our models
1251: % is proven by exhaustive experiments conducted with the Monet database
1252: % system on three different hardware platforms. Finally, we investigate
1253: % the effect of implementation techniques that optimize CPU resource
1254: % usage. Our experiments show that large joins can be accelerated almost
1255: % an order of magnitude on modern RISC hardware when both memory and CPU
1256: % resources are optimized.
1257:
1258: \bibitem{DDR}Micron Technology, Inc.,
1259: {\sl 256Mb: x4, x8, x16~DDR~SDRAM}, Revision~C, April, 2001,
1260: \url{http://download.micron.com/pdf/datasheets/dram/256Mx4x8x16DDR_D.pdf}
1261:
1262: \bibitem{Negri85}
1263: M.~Negri and G.~Pallagatti, ``Join During Merge: An Improved Sort
1264: Based Algorithm'', {\sl Information Processing Letters}~{\bf 21}(1),
1265: 1985, pp.~11--16.
1266:
1267: \bibitem{AlphaSort}
1268: Nyberg, C., Barclay, T., Cvetanovic, Z., Gray, J., and Lomet, D.
1269: ``AlphaSort: A Cache-Sensitive Parallel External Sort'', {\sl VLDB
1270: Journal}~{\bf 4}(4), pp.~603--627, 1995.
1271:
1272: \bibitem{Patel96}
1273: J. M. Patel and D. J. DeWitt, ``Partition Based SpatialMerge Join'',
1274: {\sl Proc. of the ACM SIGMOD Conference on Management of Data}, 1996.
1275:
1276: \bibitem{Zhang02}
1277: D. Zhang and V.J. Tsotras and B. Seeger,
1278: ``Efficient Temporal Join Processing using Indices'',
1279: {\sl Proc. of 18th International Conference on Data Engineering (ICDE)},
1280: 2002.
1281:
1282: % Available at
1283: % \url{http://research.microsoft.com/barc/SortBenchmark/AlphaSort.rtf}.
1284:
1285: \bibitem{Rao99}
1286: J.~Rao and K.A.~Ross,
1287: ``Cache Conscious Indexing for Decision-Support in Main Memory'',
1288: {\sl VLDB 1999, 25th Intl. Conf.
1289: Very Large Databases}, 1999, pp.~78--89.
1290:
1291: % hash-join algorithm (variation of GRACE)
1292: \bibitem{Shapiro86}
1293: L.D. Shapiro,
1294: ``Join Processing in Database Systems with Large Main Memories'',
1295: {\sl ACM Transactions on Database Systems}~{\bf 11}(3), 1986,
1296: pp.~239--264.
1297:
1298: \bibitem{ShatdalEtAl94}
1299: A.~Shatdal, C.~Kant and J.F.~Naughton,
1300: ``Cache Conscious Algorithms for Relational Query Processing'',
1301: {\sl VLDB 1994,Proceedings of the 20th VLDB Conference},
1302: 1994, pp.~510--521.
1303:
1304: \bibitem{TLZT97}
1305: P. Trancoso J.L.~Larriba-Pey, Z.~Zhang and J.~Torellas, ``The Memory
1306: Performance of DSS Commercial Workloads in Shared-Memory
1307: Multiprocessors'', {\sl Int. Symp. on High Performance Computer
1308: Architecture}, San Antonio, TX, USA 1997.
1309:
1310: \bibitem{WulfMcKee95}
1311: W.A. Wulf and S.A. McKee.
1312: ``Hitting the memory wall: Implications of the obvious'',
1313: {\sl ACM Computer Architecture News}, 23(1):20--24, 1995.
1314:
1315: % Interesting paper for cache conscious search trees:
1316: % Jun Rao, Kenneth A. Ross: Making B^+-Trees Cache Conscious in
1317: % Main Memory. SIGMOD Conference 2000: 475-486
1318:
1319: \end{thebibliography}
1320:
1321: \end{document}
1322: