cs0308004/dpg.tex
1: \documentclass[10pt,twocolumn]{article}
2: % Must be 10 point, twocolumn (ICDE)
3: 
4: \usepackage{latex8}  % ICDE
5: \usepackage{times}
6: \usepackage{url}
7: \usepackage{comment}
8: \usepackage{epsfig}
9: \usepackage[compact]{titlesec}
10: 
11: % \usepackage{fullpage}
12: % \usepackage{geometry}
13: % \geometry{lmargin=2cm,rmargin=2cm}
14: \usepackage{fancyvrb}
15: % \usepackage[small,compact]{titlesec}
16: % \usepackage[compact]{titlesec}
17: % \usepackage[light,first,bottomafter,outline]{draftcopy}
18: 
19: \DefineShortVerb{\#}
20: 
21: \includecomment{comment}
22: \specialcomment{commentbf} {\begingroup\bf} {\endgroup}
23: \specialcomment{commentit} {\begingroup\it\footnotesize} {\endgroup}
24: \specialcomment{commentold} {\begingroup\it\footnotesize{\bf OLD:}} {\endgroup}
25: \excludecomment{comment}
26: \excludecomment{commentbf}
27: \excludecomment{commentit}
28: \excludecomment{commentold}
29: 
30: \title{
31: DPG:  A Cache-Efficient Accelerator for Sorting and for Join Operators
32: }
33: \author{
34:    Gene Cooperman\thanks{~$\,$This work was partially supported by
35:                 the National Science Foundation under Grant CCR-0204113, and
36:                 by the
37:                 Institute for Complex Scientific Software
38:                 (ICSS, http://www.icss.neu.edu/).},
39:  Xiaoqin Ma$^*$ and Viet Ha Nguyen$^*$
40: \\
41: Northeastern University\\
42: Boston, MA 02115, USA\\
43: \{gene,xqma,vietha\}@ccs.neu.edu
44: }
45: 
46: \begin{document}
47: 
48: 
49: 
50: \maketitle
51: 
52: \begin{abstract}
53: Retrieval of records on disk is well-known to be at the heart of many
54: database problems.  We show that the corresponding movement of records in main
55: memory has now become a severe bottleneck for many database
56: operations.  This is due to the stagnating latency of main memory,
57: even while CPU speed, main memory bandwidth, and disk speed all
58: continue to improve.  As a result, record movement has become the
59: dominant cost in main memory sorting.
60: 
61: We present a new algorithm for fast record retrieval, {\em
62:   distribute-probe-gather}, or DPG.  DPG has important applications
63: both in sorting and in joins.  Current main memory sorting algorithms
64: split their work into three phases: extraction of key-pointer pairs;
65: sorting of the key-pointer pairs; and copying of the original records
66: into the destination array according the sorted key-pointer pairs.
67: The copying in the last phase dominates today's sorting time.  Hence,
68: the use of DPG in the third phase provides an accelerator for existing
69: sorting algorithms.
70: 
71: DPG also provides two new join methods for foreign key joins: DPG-move
72: join and DPG-sort join.  The resulting join methods with DPG are
73: faster because DPG join is cache-efficient and at the same time DPG
74: join avoids the need for sorting or for hashing. The ideas
75: presented for foreign key join can also be extended to faster record
76: pair retrieval for spatial and temporal databases.
77: 
78: 
79: \end{abstract}
80: 
81: \section{Introduction}
82: \label{sec:intro}
83: 
84: Two important database operations are sorting and joins.  These
85: operations have three primary hardware-related costs: disk access, CPU
86: operation and main memory access.  The growth of main memory in
87: current computers implies that main memory databases become more
88: popular. For main memory databases, the bottleneck moves from disk to
89: main memory and the cost for disk access is not an issue anymore.
90: 
91: Further, the growing CPU-memory gap implies that CPU costs represent
92: an increasingly small portion of the total time.  This has been borne
93: out by several studies of
94: DBMs~\cite{Ailamaki99,BGB98,Patterson98,TLZT97}.  The impact of the
95: CPU-memory gap was popularized by the paper of Wulf and
96: McKee~\cite{WulfMcKee95} on the {\em memory wall}.
97: 
98: 
99: The diminishing role of the CPU in the total running time is partially
100: accounted for by increasing CPU speeds and greater on-chip functional
101: parallelism.  In part, it is also accounted for because most CPUs
102: today implement non-blocking caches and hardware prefetch in order to
103: overlap CPU execution with memory access~\cite{processor}.  Hence,
104: memory access becomes the bottleneck.  Therefore, we follow the
105: example of previous
106: researchers~\cite{Ailamaki99,Boncz99,Manegold99,Manegold00,Manegold02,ShatdalEtAl94}
107: in concentrating on main memory as the bottleneck.
108: 
109: At the heart of this memory bottleneck lies the {\em record retrieval
110:   problem}: the problem of copying records from a source array into a
111:   destination array according to a new ordering.  In a typical
112:   application, one will be given a source data file, a sequence of
113:   {\em record ids} (rids) for that data file, and a destination file.
114:   The task is to copy the source records into the destination file in
115:   the order specified by the sequence of rids.  Fast record retrieval
116:   is the key to faster sorting and faster joins.
117: 
118: A standard approach for data retrieval  accesses the records of the
119: source data file directly according to the sequence of the rids.  This
120: implies random access to the main memory.  This, for example, is what
121: was done in AlphaSort~\cite{AlphaSort} and
122: SuperScalarSort~\cite{SuperScalarSort}, the current record holders for
123: the Datamation sorting challenge~\cite{Datamation}.  However, the cost
124: of such random access has now become the dominant cost in main memory
125: sorting, since the CPU-memory gap has widened still further since the
126: original work on sorting.  (In fact, AlphaSort and SuperScalarSort
127: sort data on disk, but the size of the data file, 100~MB, is small
128: enough that the disk access consists solely of reading the source data
129: file from disk, and writing to a destination data file.)
130: 
131: Random access is harmful not just in disk resident databases, but also in main
132: memory resident databases.  Current DRAM technologies, such as DDR RAM and
133: Rambus RAM (RDRAM), extract a large latency penalty for any
134: non-sequential access to RAM.  This is because the memory chips are
135: divided into memory pages of several kilobytes, and there is a latency
136: penalty for switching to a new memory page~\cite{LostCircuits,DDR}.
137: 
138: Random access to RAM also harms performance in a second manner.
139: Random access incurs a heavy penalty when large cache blocks are used.
140: On a cache miss, the entire cache block is loaded into memory.  If the record size
141: is small compared to the cache block size, then there is a large
142: overhead  to load the entire cache block.  For example, for a cache
143: miss on a Pentium~4 with DDR-266 RAM, approximately 60~ns are spent
144: loading the cache block, and approximately 60~ns are spent waiting on
145: the latency of DDR RAM.  The trend is toward larger cache blocks.  The
146: 128~byte L2 cache blocks of the Pentium~4 are four times larger than
147: those of the Pentium~III.  The IBM Power4 processor goes still further 
148: using 512~byte L3 cache blocks.
149: 
150: The solution to avoid these latency penalties in main memory is to
151: access main memory sequentially.  This is similar in spirit to the
152: way in which traditional databases strongly prefer to access disk
153: sequentially.  In analogy with operation on disk, two-pass algorithms
154: are a key for faster main memory performance.
155: 
156: The DPG-based sorting algorithms immediately yield faster sorting
157: algorithms.  Both AlphaSort and SuperScalarSort sort their data
158: essentially in three phases:  extraction of key-pointer pairs;
159: sorting of the key-pointer pairs; and copying of the original records
160: into the destination array according to the sorted key-pointer
161: pairs.  The last phase is essentially record retrieval.
162: 
163: A re-implementation of AlphaSort and SuperScalarSort on a IBM p690
164: Turbo shows that the record retrieval phase now dominates the running
165: time.  With the DPG record retrieval algorithm replacing the standard
166: record retrieval algorithm, we immediately produce a faster sorting
167: algorithm.  In the re-implementation of SuperScalarSort, the versions
168: with DPG as an accelerator are 27\% faster.
169: 
170: 
171: % {\bf The DPG algorithm is XXX\% faster than random access to main
172: %  memory ond a YYY.}
173: 
174: A direct consequence of faster record retrieval is faster main memory
175: sort-merge joins.  For example, in implementing sort-merge join using
176: SuperScalarSort, we find that the use of DPG sort instead of
177: SuperScalarSort results in a 27\% faster join algorithm..  The DPG
178: record retrieval phase in isolation is 48\% faster than traditional
179: record retrieval algorithm.
180: 
181: 
182: 
183: Finally, we apply DPG algorithm in the context of {\em foreign key
184: joins}.  In a foreign key join the join key is the same as the foreign
185: key.  we assume that one relation has an secondary index on the join
186: key.  Foreign key joins have the advantage that one need only
187: rearrange the records of one of the files.  This is in distinction to
188: sort-merge join and hash join, which both require to rearrange each of
189: the two files into a new file with join key values in sorted order.
190: 
191: Foreign key joins require less record retrieval because it is possible
192: to first extract {\em join triples}, ($k$, rid$_R$, rid$_F$), where
193: $R$ and $F$ are the two relations and $k$ is the join key value.
194: Assume that $R$ references a foreign key of~$F$.  To construct a join
195: triple, do file scan of~$R$, for each record of~$R$, its key~$k$ is
196: extracted.  The secondary B-tree index of the join key on~$F$ is then
197: used to derive the corresponding record id, rid$_F$, with the key
198: value~$k$. The standard index lookup is very expensive, we propose a
199: cache efficient B-tree batch lookup to generate the join triples.
200: 
201: Join triples reduce foreign key join to record pair retrieval.  The
202: join triples specify the record pairs, (rid$_R$, rid$_F$), to be
203: retrieved.  One can re-order the records of~$F$ to match the ordering
204: of rid$_R$ in the sequence of record pairs. Recall that, we do file
205: scan for~$R$, so rid$_R$s in the join triples are in sorted order.
206: Alternatively, one can sort the join triples according to rid$_F$, and
207: re-order the records of~$R$ to match the ordering of rid$_F$ in the
208: sorted rid pairs.  In either situation, there are fewer record
209: retrievals, and so foreign key join is faster than a general join.
210: 
211: 
212: The ideas of faster record retrieval can also be applied to the
213: general case of record pair retrieval.  Many algorithms for spatial
214: join~\cite{ArgeEtAl98,Patel96} and temporal join~\cite{Zhang02}
215: produce record pairs.  Unlike equijoin, there is no single search key,
216: and so record retrieval is more difficult.  The ideas of this paper
217: are described in terms of foreign key join.  However, it is even
218: simpler to translate the ideas into record pair retrieval, since an
219: initial join triple extraction is not required.
220: 
221: 
222: 
223: 
224: 
225: \section{Distribute-Probe-Gather}
226: \label{sec:DPG}
227: 
228: The {\em distribute-probe-gather algorithm} (DPG) is a record retrieval
229: algorithm.  Given a data file of records
230: and a sequence of record ids (rids) for the file, the goal is to copy
231: the records into a destination file with the property that
232: the ordering of records in the destination file corresponds to
233: the ordering of the rids in the given sequence.
234: 
235: For example, let the  source file, #R#, be the array of records with a
236: secondary \rm{B+} tree index on the attribute #A#. We want to
237: place the records of #R# in sorted order according to the values of~#A#. The
238: sequence of record ids, #S_rid#, at leaf nodes of the \rm{B+} tree is a list of record
239: ids in sorted order according to the value of A. Then, the destination file, #D#,
240: below, will contain the corresponding records in sorted order according to the
241: value of~#A#.
242:   
243: {\tt
244: \begin{quote}
245: for each i,
246:   D[i] = R[S\_rid[i]]
247: \end{quote}
248: }
249: 
250: In the case that the sequence of rids is a permutation of the rids for
251: the data file, the sequence of rids acts as a permutation vector.  The destination
252: file is then a permutation of the records in the input file.
253: 
254: Note, however, that the DPG algorithm is not limited to permutations
255: of data. In the case of join, one record from one relation may match more than one
256: records from another relation. In such cases portion of the original records must
257: be duplicated. The DPG algorithm also works for this case.
258: 
259: 
260: \subsection{\bf Algorithm}
261: 
262: The input for the algorithm is: a list of rids, #RID_LIST#, and a
263: data file of records, #INPUT#.  The DPG algorithm partitions the rids,
264: #RID_LIST#, into separate runs.  It also partitions the data records,
265: #INPUT#, into separate runs.  The algorithm makes two passes over the
266: record ids, #RID_LIST#, and two passes over the records of the data
267: file, #INPUT#.
268: 
269: The ideas are presented in the context of main memory databases.  We
270: assume that neither the sequence of rids, #RID_LIST#, nor the data
271: file, #INPUT#, fit in cache.  The DPG algorithm applies equally well
272: as an external data retrieval algorithm between disk and main memory.
273: 
274: The spirit of the DPG algorithm is: try to transform arbitrary memory
275: access patterns into sequential memory access patterns; where
276: arbitrary memory access patterns are unavoidable, we try to divide the
277: data into small partitions that fit into the cache.
278: 
279: For a sequential access pattern, it is easy to maintain a buffer in
280: cache.  In the DPG algorithm we need to read many streams
281: simultaneously, and maintain a buffer for each stream. Hence, we want
282: to keep the buffer as small as possible while maintaining reasonable
283: efficiency.  We define a buffer in cache to consist of two cache
284: blocks.  When a cache block is full, it is written back to main memory
285: and a new cache block is loaded from main memory into cache. On
286: Pentium~4, this is done automatically by the hardware prefetch
287: function unit if the access to main memory is sequential. On other
288: architectures without the hardware prefetch function unit, the
289: software instructions, $cflush$ and $prefetchnta$, are needed to
290: maintain the buffers.
291: 
292: 
293: \noindent
294: \paragraph{\bf Constraints:} The data records and rids are split into
295: runs of length~$L$.  The run length~$L$ is chosen based on two
296: constraints.  First, the cache must be able to simultaneously hold
297: both one run of data records of length~$L$ and one single buffer for the
298: corresponding run of rids.  (This constraint applies in the second
299: phase of DPG.)  Second, the cache must simultaneously be able to hold
300: a buffer for each run. (This constraint applies in the first phase and
301: in the third phase of DPG.)  These constraints are typical of the
302: constraints for two-pass algorithms, such as external sorting.
303: 
304: 
305: 
306: If the data file has $N$~records, then the data file is partitioned
307: into $N/L$ sets of consecutive records.  Assuming an rid consists of a
308: page id and offset on that page, the high order bits of the page
309: number can be used to efficiently identify the particular partition to
310: which the rid belongs.  This assumes that the number of pages in a
311: partition is a power of two, which can be satisfied by appropriate
312: choice of~$L$. For the sake of clarity, we assume the rids
313: values are in the rage of $0$ and $N$. 
314: 
315: \paragraph{\bf Three Phases:} There are three phases in the DPG
316:  algorithm. The three phases are also illustrated by pseudo-code in
317:  Figure ~\ref{fig:DPGalgo} and by the diagram of Figure~\ref{fig:DPG}.
318: 
319: \begin{enumerate}
320: \item{\bf Phase I.}  The first phase is the {\em Distribute}
321: phase. One {\em distributes} the rids of #RID_LIST# into appropriate
322: RID runs according to the values of the rids. The first RID run
323: contains the rid values in the range from $0$ to $L$, the second RID
324: run contains the rid values in the range from $L$ to $2L$, and so on.
325: Both the access to every RID run and the access to #RID_LIST# are
326: sequential. Therefore, one only needs to maintains a buffer in
327: cache for each RID run and a single buffer in cache for #RID_LIST#, At
328: the end, we form $N/L$ RID runs and each RID run is a permutation
329: vector. For example, the $i$-th RID run is a permutation vector in the
330: range from $(i-1)*L$ and $i*L$.
331: 
332:  
333: \noindent
334: \item{\bf Phase II.}  The second phase is the {\em Probe} phase.  In
335: this phase, we allocate a second, temporary data file, #INTERNAL#, in
336: main memory.  The temporary data file has the same size as the
337: original data file, #INPUT#, and is organized into the same number of
338: runs as the original data file. One then proceeds through each of the
339: runs of rids and each of the runs of #INPUT#.  The rids from the
340: $i$-th RID run are used to probe the $i$-th #INPUT# run and the
341: corresponding records are copied into the $i$-th #INTERNAL# run.  
342: 
343: At the end, the $i$-th #INTERNAL# run contains the same records as the
344: $i$-th #INPUT# run, but the order of records in the #INTERNAL# run is
345: organized according to the $i$-th RID run. Both the $i$-th RID run and the $i$-th
346: #INTERNAL# run are accessed sequentially. The $i$th #INPUT# run is
347: accessed randomly, but it can fit in cache. Hence, every time, one loads the
348: $i$-th #INPUT# run entirely into cache and maintains two buffers in cache: one
349: for the $i$th RID run and the other for the $i$-th #INTERNAL# run. 
350: 
351: 
352: \noindent
353: \item{\bf Phase III.}  The third phase is the {\em Gather} phase.
354: This is an inverse of the {\em Distribute} phase and is similar to the
355: merge phase of external sorting.  In the {\em Distribute} phase, the
356: rids are distributed into runs.  As a result of the {\em Probe} phase,
357: the records in a given #INTERNAL# run are now in the same order as the
358: rids in the corresponding RID run created in Phase I.  Hence, it
359: suffices to gather (merge) the records from the #INTERNAL# runs in
360: exactly the same order as the order of the rids in #RID_LIST#.  More
361: precisely, if the $i$-th rid was distributed to the $j$-th RID run
362: during the {\em Distribute} phase, then at the $i$-th step of the {\em
363: Gather} phase, the next record from the $j$-th #INTERNAL# run is
364: copied to the destination array. One maintains several buffers in
365: cache: one single buffer for #RID_LIST#, one single buffer for the
366: destination array, and a buffer for each #INTERNAL# run.
367: \end{enumerate}
368: 
369: \subsection{\bf Example}
370: The DPG algorithm is presented more formally in pseudo-code in
371: Figure~\ref{fig:DPGalgo}.  The input of the algorithm is an array,
372: #RID#, of {\em rids} (record ids) and a data file, #INPUT_REC#, of
373: records.  It is desired to retrieve the records from #INPUT_REC# in
374: the order corresponding to #RID#.  The retrieved records are then
375: written to #OUTPUT_REC#.
376: 
377: The three phases of the DPG algorithm are illustrated by an example in
378: Figure~\ref{fig:DPG}.  The input in this example is the leaf nodes of
379: a secondary B+-tree index. The input contains a sequence of key-rid
380: pairs sorted according to the key values.  The output is a sequence of
381: records sorted according to the key values of the secondary index. The
382: output will either be stored again on disk or else pipelined to the
383: next stage. The ability of the DPG algorithm to take advantage of
384: pipelining is an important feature for sorting and joins.
385: 
386: 
387: The letters #a#, #b#, #c#, ... are used to indicate the sorted keys on
388: the leaf nodes of the index. So (#a#, 5) indicates that the record with the rid
389: value of~5 has a key value of #a#.  The first two rows are for
390: Phase~I, the next three rows for Phase~II, and the following
391: three rows for Phase~III.  The horizontal rectangle of the first
392: row represents a sequence of key-rid pairs sorted according to
393: the key values.  The second row represents the runs of rids into which
394: the first row is partitioned.  Similarly, the third row again
395: represents the runs of rids, but now as part of Phase~II.  The
396: fourth row is the partitioned runs of input records, and so
397: on. There are~12 elements and~3 runs in the example and each run contains
398: ~4 elements. 
399: 
400: During {\bf Phase I}, one has a sequence of key-rid pairs sorted
401: according to the key values and will distribute them into appropriate
402: RID runs.  The first RID run will contain rid values from~0 to~3, the
403: second RID run will contain rid values from~4 to~7, and the third RID
404: run will contain rid values from~8 to~11. Upon reading the sequence of
405: pairs from the first row, one places the rids in their proper runs. For
406: example, the rid~5 goes to the second RID run, the rid~7 goes to
407: the second RID run, the rid~3 goes to the first RID run, the rid~8 
408: goes to the third RID run, and so on. The third row in the figure
409: presents the end of Phase I. In this
410: phase, we do sequential read on a single stream and sequential writes on multiple streams.
411: 
412: During {\bf Phase II}, one copies the original data file, #INPUT#, to
413: a temporary data file, #INTERNAL#.  The input to this phase is the
414: third row.  In the third row, each RID run has rid values that
415: correspond to one contiguous range of the #INPUT# file, an #INPUT#
416: run.  For example, for the first RID
417: run, we will load the first #INPUT# run into the cache.  The first
418: #INPUT# run consists of the first 4 contiguous records. The first rid
419: in the first RID run is~3 and $INPUT[3]$ is in the cache, so one finds
420: this record, #INPUT[3]#, in cache and copies it to the buffer
421: maintained in cache for the first #INTERNAL# run, and so on. At the
422: end of Phase~II, the records in the first #INPUT# run with key values in the sequence of i, l, f, c  are reordered as
423: records in the first #INTERNAL# run with key values in the sequence of c, f, i, l. The reordering is done according to a permutation specified by the first RID run, 3, 2, 0, 1. 
424: 
425: 
426: During {\bf Phase III}, we will use the original rid sequence in the list of
427: key-rid pairs to {\em gather} (merge) records from all #INTERNAL# run. For
428: example, upon reading~5, we go to the second #INTERNAL# run to {\em gather}
429: the record; upon reading~7, we go to the second #INTERNAL# run to {\em gather}
430: the record; upon reading~3, we go to the first #INTERNAL# run to {\em gather}
431: the record; upon reading~8 we go to the third #INTERNAL# run to {\em gather}
432: the record, and so on. For this phase, we maintain a buffer for the key-rid list, a buffer for #OUTPUT# in cache and
433: a buffer for each #INTERNAL# run. All buffers are in cache.
434: 
435: 
436: 
437: 
438: \begin{figure}[htb]
439: \setbox0\vbox
440: % families tt, courier, helvetica available built-in
441: {\small\begin{Verbatim}[commandchars=\\\{\},
442:   fontfamily=tt,
443:   commentchar=\%,codes={\catcode`$=3\catcode`^=7}]
444: % $ (for sake of emacs)
445: Let $L\gets$ (CACHE_SIZE/2)
446: Let $N\gets$ NUM_RECORDS
447: Let NUM_RUNS $\gets N/L$
448: 
449: \rm INPUT:  integer \tt RID[NUM_RIDS],
450:     \rm record \tt INPUT_REC[NUM_RECORDS];
451: \rm OUTPUT: record \tt OUTPUT_REC[NUM_RIDS];
452: \rm PARAMETERS: \rm integer \tt RID_RUN[NUM_RUNS][]\tt,
453:         \rm record \tt RECORD_RUN[NUM_RUNS][];
454: 
455: //Phase I:  {\it Distribute} RID array into runs
456: //          with each run of length $L$
457: % For $i$ = 1, ..., NUM_RUNS do
458: %   Initialize write buffer of RID_RUN[$i$]
459: For each rid, $r$, in RID do \{
460:  Set {\it run\_num} = $\lceil{}r/L\rceil$;
461:  Append $r$ to RID_RUN[{\it run\_num}]
462: \}
463: 
464: //Phase II:  {\it Probe} partitions of INPUT_REC
465: % (Conceptually partition INPUT_REC into partitions of length $L$ and
466: %  for each partition, $i$, of INPUT_REC, probe for each rid in RID_RUN[$i$])
467: For $i$ = 1, ..., NUM_RUNS do \{
468: %   Initialize read buffer for RID_RUN[$i$]
469:   Read into memory all records from
470:    INPUT_REC[$(i-1)\!*\!\!L\! +\! 1$] to INPUT_REC[$i\!*\!\!L$]
471:   Allocate memory for $L$ records
472:     to be stored in RECORD_RUN[i]
473:   For each rid, $r$, in RID_RUN[$i$]
474:     Append INPUT_REC[$r$] to RECORD_RUN[$i$]
475:   Write out RECORD_RUN[$i$] to disk
476: \}
477: 
478: //Phase III: {\it Gather} records from RECORD_RUN[]
479: //  into OUTPUT_REC in same order as RID[]
480: % For $i$ = 1, ..., NUM_RUNS do
481: %   Initialize read buffer of RECORD_RUN[$i$]
482: For each rid, $r$, in RID do \{
483:   Set {\it run\_num} = $\lceil{}r/L\rceil$;
484:   Read next record from RECORD_RUN[{\it run\_num}]
485:   Append record to OUTPUT_REC
486: \}
487: \end{Verbatim}
488: }
489: \centerline{\fbox{\box0}}
490: \caption{Distribute-Probe-Gather (DPG) Algorithm}\label{fig:DPGalgo}
491: \end{figure}
492: 
493: 
494: % or , which the thicker horizontal rectangles represent the records of
495: % a data file.
496: 
497: \begin{figure*}[!hbt]
498: \begin{center}
499:   \epsfig{file=figure1.eps,width=7.0in,clip=}
500:   \caption{Distribute-Probe-Gather (DPG)
501:     (also, see pseudo-code in Figure~\ref{fig:DPGalgo})}
502:   \label{fig:DPG}
503: \end{center}
504: \end{figure*}
505: 
506: 
507: 
508: 
509: \subsection{\bf Data Skew}
510: Implicit in the description of the DPG algorithm is that the input
511: sequence of rids is distributed uniformly among the set of all rids of
512: the input data file.  This is always the case when DPG is applied to retrieve records
513: after sorting key-pointer pairs.  In that situation, the
514: key-pointer pairs act as a permutation vector to permute the 
515: records in the input data file.
516: 
517: If the input sequence of rids is not uniformly distributed, then some
518: RID runs will be larger than other runs.  As a consequence, in
519: Phase~II, when the partition of the temporary data file (the partition
520: of #RECORD_RUN# in Figure~\ref{fig:DPGalgo}) may be larger than the size
521: of the cache.  If only a few of the partitions of the temporary data
522: file are larger than cache then the overall running time is not
523: greatly affected.
524: 
525: If there is a great deal of data skew and many of the temporary
526: partitions #RECORD_RUN# are larger than cache, then the $N/L$
527: partitions of the input sequence of rids must be chosen on some other
528: basis than the high order bits of the page number.  In such cases, one
529: can invoke the data skew handling techniques of DeWitt
530: et~al.~\cite{DeWitt92}.  Their solution, reformulated in our context,
531: is to sample the rids from the rid sequence.  The sampled set of rids
532: is then sorted, and partitions of the rids are chosen so as to evenly
533: partition the sampled set.
534: 
535: 
536: \section{Sorting}
537: \label{sec:sorting}
538: 
539: As discussed in the introduction, DPG acts as an accelerator for many
540: main memory sorting algorithms.  Recall that main memory sorts
541: typically proceed in three phases:
542: \begin{enumerate}
543: \item extraction of key-pointer;
544: \item sorting of the key-pointer pairs; and
545: \item copying of the original records
546:   into the destination array according to the sorted key-pointer
547:   pairs.
548: \end{enumerate}
549: 
550: AlphaSort~\cite{AlphaSort} and SuperScalarSort~\cite{SuperScalarSort}
551: are examples of this three-phase sorting paradigm.  Both sorting
552: algorithms can be considered as main memory sorting algorithms.
553: 
554: In principle, the sorting algorithms are single-pass disk-based
555: sorting algorithms.  Both sorting algorithms were introduced as an
556: answer to the Datamation Sorting Challenge~\cite{Datamation}.  The
557: Datamation challenge dictates that one is given one million records of
558: 100~bytes.  Each record has a 10~byte key.  The keys are uniformly
559: distributed.  At the time of the Datamation Challenge, external
560: sorting algorithms were required.  On today's computers, the data file
561: of 100~MB easily fits in main memory.
562: 
563: Hence, the only disk-related portion of the Datamation Challenge is to
564: overlap disk~I/O with CPU operation.  Disk striping has the potential
565: to provide very fast disk~I/O.  This occurs because the disks are
566: accessed in parallel.  In this situation, main memory data retrieval
567: becomes the bottleneck.
568: 
569: The DPG algorithm pushes back this main memory bottleneck.  By using
570: DPG for data movement, the largest cost of main memory sorting is
571: reduced.  We have reimplemented the main memory portion of
572: SuperScalarSort, both with and without DPG.
573: 
574: 
575: \section {Join Methods with DPG}
576: \label{sec:joinAlgoDPG}
577: 
578: We use the ideas of DPG to present three new join algorithms: 1.c, 2 and~3.
579: Algorithms 1.a, 1.b and~4, will be included in the experimental
580: section~\ref{sec:joinExperiments} for completeness.
581: \begin{enumerate}
582: \item Sort-Merge Join
583: \begin{enumerate}
584: \item {Sort-Merge Join with AlphaSort (sort based on~\cite{AlphaSort})}
585: \item {Sort-Merge Join
586:     with SuperScalarSort (sort based on~\cite{SuperScalarSort})}
587: \item {Sort-Merge Join with DPG Sort}
588:   (sort based on DPG; see~\ref{sec:sortMergeJoin})
589: \end{enumerate}
590: \item {DPG-Sort Join} (see Section~\ref{sec:foreignKeyJoin})
591: \item {DPG-Move Join} (see Section~\ref{sec:foreignKeyJoin})
592: \item {Radix Join} (from~\cite{Manegold02})
593: \end{enumerate}
594: 
595: 
596: \subsection{Sort-Merge Join with DPG Sort}
597: \label{sec:sortMergeJoin}
598: 
599: The well-known Sort-Merge join was introduced by Blasgen and
600: Eswaran~\cite{Blasgen77}.  There are two steps in Sort-Merge joins:
601: sort two relations on the join key and scan the sorted relations to do
602: a merge on the join key.
603: 
604: Applying DPG sort at the first step provides a new faster sort-merge
605: join method, Sort-Merge join with DPG Sort. In
606: Sections~\ref{sec:sortMergeJoinExperiments}
607: and~\ref{sec:joinExperiments}, we experimentally compare different
608: versions of sort-merge join, according to the sorting methods used.
609: Specifically, we consider using DPG sort, AlphaSort and
610: SuperScalarSort for the sorting step.
611:  
612: 
613: \subsection{Foreign Key Join with DPG}
614: 
615: We next consider joins in which the join key is a foreign key, and it
616: has an index.
617: We denote by~$R$ a non-indexed relation.  We denote
618: by~$F$ an indexed relation.  The notation is motivated by the example
619: of a foreign key join.  In a foreign key join, the join key is the
620: same as the foreign key.  So, the join key is a set of attributes in
621: the relation~$R$ that refers to a foreign key from relation~$F$.
622: 
623: A {\em join triple} is a triple ($k$, rid$_R$, rid$_F$), such that $k$
624: is a key value, rid$_R$ is the rid of a record from~$R$ with key~$k$,
625: and rid$_F$ the rid of a record from~$F$ with key~$k$.
626: 
627: There are three steps in a foreign key join algorithm with DPG.  The
628: first step is to construct join triples.  The second step is to use
629: the join triples to copy one of the two relations into a temporary
630: file according to an order derived from the join triples.  The third
631: step is to join the temporary file with the remaining relation.
632: 
633: We describe the second and third steps initially in
634: Section~\ref{sec:foreignKeyJoin}.  We then return to the more
635: technical problem of efficiently constructing join triples in
636: Sections~\ref{sec:joinTriples} and~\ref{sec:indexLookup}.
637: 
638: \subsubsection{Two Foreign Key Join Methods with DPG}
639: \label{sec:foreignKeyJoin}
640: 
641: This section describes two DPG join algorithms.  It assumes that one
642: has already constructed the join triples.  Some algorithms for
643: constructing the join triples are described later in
644: Sections~\ref{sec:joinTriples} and~\ref{sec:indexLookup}.
645: 
646: Assume that one has generated the join triples ($k$, rid$_R$,
647: rid$_F$).  It is now possible to ignore the key~$k$, and deal directly
648: with the rid pairs (rid$_R$, rid$_F$).  We wish to satisfy one of two
649: goals:
650: \begin{enumerate}
651: \item\label{item:moveF} {\bf DPG-Move join:}  Move the records of~$F$ into
652:   a temporary file according to the ordering of the records of~$R$.
653: \item\label{item:moveR} {\bf DPG-Sort join:}  Move the records of~$R$ into
654:   a temporary file according to the ordering of the records of~$F$.
655:   So, the rid pairs (rid$_R$, rid$_F$) will be sorted according to
656:   rid$_F$.
657: \end{enumerate}
658: 
659: First consider DPG-Move join.  We will see how to generate the join
660: triples in the order of rid$_R$.  This is done by scanning the records
661: of the relation~$R$ in file order (in order of increasing rid$_R$).
662: Therefore the rid pairs can be used directly as part of a DPG
663: algorithm.  The second component of the pair, rid$_F$, is the sequence
664: of rids according to which we want to move the records of~$F$.  This
665: algorithm does not require any sorting or hashing.  Hence, we call
666: it DPG-Move join.
667: 
668: Next consider DPG-Sort join.  In this version, we first sort
669: the rid pairs (rid$_R$, rid$_F$) according to the order of rid$_F$.
670: This is done, for example, with SuperScalarSort and DPG.  The
671: algorithm then reduces to record movement in which we wish to move the
672: records of~$R$ according to the ordering of~rid$_F$ in the sorted
673: sequence (rid$_R$, rid$_F$).
674: 
675: Note that DPG-Move join is preferred when the relation~$F$ is smaller.
676: DPG-Sort join is preferred when the relation~$R$ is smaller.
677: 
678: 
679: \subsubsection{Construction of Join Triples}
680: \label{sec:joinTriples}
681: 
682: The simplest solution for a foreign key join is to
683: do a file scan of~$R$, and for each record of~$R$ to extract the join
684: key and do an index lookup in the index of~$F$.  One can then join the
685: record of~$R$ with the corresponding record of~$F$.  This involves
686: random access, and is economical only if the join produces
687: very few records.  In particular, this will be the case only if the
688: number of records of~$R$ is small.
689: 
690: A better solution is to do a file scan of~$R$, and to use the index
691: on~$F$ to create join triples.  To construct the join triples, one scans the
692: relation~$R$.  For each record of~$R$, one extracts the corresponding
693: rid and associated join key~$k$.  One then looks up the key~$k$ in the
694: index of~$F$.  The index lookup yields the final element of the
695: triple, rid$_F$.
696: 
697: \subsubsection{Batch Lookup in Indexes}
698: \label{sec:indexLookup}
699: 
700: Note that the join triple is constructed at the cost of a file scan
701: of~$R$ and an index lookup in the index of~$F$ for each record of~$R$.
702: Usually an index on a data file is much smaller than the full data
703: file.  If the index fits entirely in cache, then the index lookup will
704: be significantly cheaper than the file scan.
705: 
706: Unfortunately, the indexes in many main memory databases generally do
707: not fit in cache.  In such cases, the index lookup in the index of~$F$
708: will dominate the costs.  For example, in a B+~tree indexing
709: $N$~records, if an internal node has $m$~children, then $\log_m N$
710: nodes of the B+~tree must be accessed.  Each such access will be a
711: random access in main memory.  Most of the random accesses imply a
712: cache miss.  The cost of so many random accesses makes a naive index
713: lookup uneconomical.  Even if the index is a hash index, at least one
714: random access in memory will be required.
715: 
716: Luckily, it is possible to execute the index lookup faster than the
717: above analysis would indicate.  This is because the construction of
718: the join triples requires many index lookups, with no intervening
719: record accesses.  For purposes of join triple construction, {\em batch
720: lookup} of keys in an index suffices.  By batch lookup, we assume that
721: an array of join keys is first extracted by scanning the data file
722: of~$R$.  The batch lookup then produces an array of rids for~$F$
723: through the use of the index on~$F$.  We show that the index lookups
724: can be reorganized into a two-pass algorithm.  Two such two-pass
725: algorithms are demonstrated: one for B+ tree indexes, and one for hash
726: indexes.
727: 
728: Note that batch lookup of rids in an index can be substantially faster
729: than individual lookup.  Rao and Ross had previously discussed cache
730: conscious indexes for main memory~\cite{Rao99}.  There they present
731: CSS-trees, which have better cache behavior than either B+~trees or
732: hash indexes.  However, they only consider individual lookup of keys
733: in an index, one at a time.  In their scenario, a second key is not
734: looked up until the index lookup of the first key has been resolved.
735: 
736: \paragraph{\bf Batch Lookup in B+ Tree Indexes}
737: 
738: For simplicity, we describe the two-pass lookup for an enhanced
739: B+~tree.  The notion of {\em enhanced B+~tree} was introduced by Rao
740: and Ross~\cite{Rao99}.  The idea is that all slots of a B+~tree node
741: are used.  This is similar to compact B-Trees~\cite{CS83} or to the
742: ISAM method introduced by IBM~\cite{GR93}.  In the context of a
743: general B+~tree this can be accomplished by maintaining updates to the
744: B+~tree in a separate index, and then doing a batch update by
745: reorganizing the B+~tree.  For brevity, we will sometimes refer to
746: B+~tree, although in all cases, an enhanced B+~tree is intended.
747: 
748: 
749: Assume that we are executing a file scan of~$R$ with the purpose of
750: constructing the join triples.  We collect a sequence of keys from the
751: relation~$R$, and we wish to carry out a batch lookup of rids in the
752: enhanced B+-tree.  The rids will be used to retrieve records from the
753: foreign relation~$F$.  We assume that the B+~tree does not fit in
754: cache.  Our goal is a two-pass algorithm which will efficiently
755: accomplish batch lookup.
756: 
757: Assume that the B+~tree has $m$~entries per node.  Assume that the
758: B+~tree indexes $N$~records.  Then there are $\log_m N$ levels.  Assume
759: further that the first $(\log_m N)/2$ levels (the top half of the
760: B+~tree) fit in cache.  For $m\gg 2$, the top half of the tree has
761: approximately $m^{(\log_m N)/2}=\sqrt{N}$ slots.  For example, if
762: there are $N=10^9$ records, then there are 32,000 slots, which clearly
763: fit inside cache.
764: 
765: The strategy is a two-pass strategy.  In the first pass, one performs
766: a lookup of each key, but only using the top half of the B+~tree.
767: As shown in Figure~\ref{fig:Btree}, each leaf of the
768: subtree comprising this upper half can be considered as the root of a
769: second subtree comprising nodes in the lower half of the B+~tree.
770: Thus, at the end of this first pass, a key can be associated with a
771: leaf of the upper subtree, which is also a root of one of the lower subtrees.
772: 
773: \begin{figure}[!hbt]
774: \begin{center}
775:   \epsfig{file=figure2.eps,width=3.0in}
776:   \caption{B-Tree (Each smaller triangle represents a subtree that
777:   fits in cache.)}
778:   \label{fig:Btree}
779: \end{center}
780: \end{figure}
781: 
782: After the first pass, one can associate each key with a subtree within
783: the lower half of the B+~tree.  So, in the second pass, one loads a
784: subtree from the lower half.  One then continues the lookup for all
785: keys associated with the root of the subtree in the lower half.
786: At the end, one then has a sequence of keys and rids.
787: 
788: If one wishes to have the keys in the same order as the order of the
789: original rids, then this can also be arranged.  In this case, one
790: extends the previous scenario to use the DPG algorithm.
791: 
792: The first phase of the DPG algorithm is to {\em distribute} the keys into
793: runs.  In the initial
794: lookup of a key in the first half, it was associated with a root of a
795: subtree in the first half.  The particular root of a subtree
796: identifies the run into which the key is copied.
797: 
798: The second phase of the DPG algorithm is to use the key to {\em probe} the
799: index, in order to find the rid.  One completes the lookup of all keys
800: in a run associated with a particular subtree of the B+~tree, before
801: proceeding to the next subtree in the lower half.  The resulting rids
802: are stored in a temporary partition or run.
803: This is exactly what was described earlier.
804: 
805: Finally, the third phase of the DPG algorithm {\em gathers} the rids
806: into a destination array in the same order as that of the original
807: keys.  Hence, we have completed a batch lookup of the keys, and
808: returned an array of rids in the same order as that of the original
809: keys.
810: 
811: \paragraph{\bf Batch Lookup in Hash Indexes}
812: 
813: Batch lookup of hash indexes also proceeds in two passes.  We assume
814: that the hash array of the hash index stores at each hash entry one
815: key-rid pair.  A second hash array associated with the index stores
816: pointers to overflow key-rid pairs that would have collided with an
817: occupied slot in the first hash array.
818: 
819: The two-pass lookup for hash indexes proceeds in a very simple manner.
820: We extract the sequence of keys from~$R$.  As the keys are extracted,
821: the hash values are computed.  Those key-hash value pairs are saved in
822: an array in an order corresponding to the rid order of~$R$.
823: 
824: It now suffices to apply the DPG algorithm.  The hash array is
825: partitioned into sets of $L$~hash slots.  In the distribute phase, the
826: hash value acts as an index into the hash array.  Hence, this becomes
827: the permutation vector of the DPG algorithm.  The key and hash value
828: are then written into separate runs according to the partition of the
829: hash array.  There is one run of key-hash values for each partition of
830: the hash array.
831: 
832: In the probe phase, a run of hash values is loaded into cache along
833: with the corresponding partition of the hash array.  As part of the
834: probe phase, the key and hash values from the run are used to look up
835: the the corresponding rid in the partition of the hash array.  The
836: rids are then saved in a temporary partition, in the same order as the
837: order in which the key-hash values of the original partition are stored.
838: 
839: Finally, in the gather phase, the ordering of the key-hash
840: value pairs are used to gather the rids from the temporary partition.
841: As in Section~\ref{sec:DPG}, the rids are gathered into a destination
842: array in an order corresponding the ordering of the original key-hash
843: value pairs.
844: 
845: 
846: 
847: 
848: \section{Experimental Evaluation}
849: 
850: \subsection{Sorting Comparisons}
851: 
852: Figure~\ref{fig:sorting} demonstrates the acceleration achieved by
853: SuperScalarSort when DPG is used.  The results are demonstrated on the
854: IBM pSeries 690 Turbo.  The IBM~p690 has an L3~cache of size 128~MB.
855: Hence, in order to realistically demonstrate DPG, we were forced to
856: increase the size of the database.  We chose to implement
857: SuperScalarSort for a data file of size 512~MB.  The record size was
858: treated as a variable, to illustrate the influence of record size.  As
859: in the original Datamation Challenge, the key is 10~bytes.
860: 
861: \begin{figure}[!hbt]
862: \begin{center}
863:   \epsfig{file=figure3.eps,width=3.0in}
864:   \caption{Sorting Comparison}
865:   \label{fig:sorting}
866: \end{center}
867: \end{figure}
868: 
869: Note that the additional speed of SuperScalarSort with DPG is most
870: pronounced for smaller record sizes.  For record sizes of 256~bytes
871: and higher, DPG provides only a small advantage.  This is because the
872: IBM p690 has an L3~cache block of size 512~bytes.  So, for record
873: sizes below 256~bytes, a cache miss incurs significant overhead
874: in loading a 512~byte cache block.
875: 
876: There was some variability in the results because the data was taken
877: on a time-shared, shared memory machine.  On the IBM~p690, the
878: L2~cache is shared among two CPUs, and the L3~cache is shared among
879: eight CPUs.  Our experiments were run on a single CPU.  Hence another
880: process on a neighboring CPU could consume some of our cache, thereby
881: affecting the timings.  The reported experimental results are the
882: averages of three runs each.
883: 
884: \subsection{Comparion of Sort-Merge Join using Different Sorting Algorithms}
885: \label{sec:sortMergeJoinExperiments}
886: 
887: \begin{figure}[!hbt]
888: \begin{center}
889:   \epsfig{file=figure4.eps,width=3.0in}
890:   \caption{Sort-Merge Join (IBM Power4 p690 Series)}
891:   \label{fig:sortMerge}
892: \end{center}
893: \end{figure}
894: 
895: We implement two sort-merge join methods: sort-merge join with DPG sort
896: and sort-merge join with SuperScalarSort.  As defined in section~\ref{sec:sortMergeJoin},
897: sort-merge join with DPG sort applies the DPG sort for the sorting
898: phase and sort-merge join with SuperScalarSort applies SuperScalarSort
899: for the sorting phase.
900: 
901: As expected, the acceleration in the speed of sort-merge join in
902: Figure~\ref{fig:sortMerge} follows the same pattern as that of
903: Figure~\ref{fig:sorting} for sorting. In this case, we illustrate our
904: results on a database of size 128~MB on the IBM p690.  The IBM p690
905: has an L2~cache of size 1.4~MB.  So, in this case, the L3~cache is acting
906: as the main memory, while L2~cache is acting as the ``cache''.
907: 
908: 
909: 
910: \subsection{Comparion of Six Different Join Methods}
911: \label{sec:joinExperiments}
912: 
913: In Figures~\ref{fig:join_nonuniform_1} through~\ref{fig:join_dup_3},
914: we now experimentally compare the six join methods originally
915: presented at the begining of Section~\ref{sec:joinAlgoDPG}.  The three
916: join algorithms labelled Sort-Merge Join with DPG Sort, DPG-Sort Join,
917: and DPG-Move Join are all new.  The remaining algorithms, Sort-Merge
918: Join with AlphaSort, Sort-Merge Join with SuperScalarSort, and Radix
919: Join, are all based on sorting or join algorithms from the literature.
920: 
921: As explained earlier, we denote by
922: ~$F$ an indexed relation and ~$R$ a relation that has foreign key on
923: the indexed attribute of ~$F$.
924: We consider both uniform and non-uniform distribution of foreign key values.
925: 
926: \paragraph{\bf The non-uniform distribution of join key values.}  Sort-merge join with
927: AlphaSort and DPG-Sort are the only two of the previously discussed join methods that operate
928: correctly for a non-uniform distribution of join key values.
929: 
930: In the following we use count bucket sort for bucket sort. The count bucket sort is as
931: follows:
932: \begin{enumerate}
933: \item count the number of elements destined for each bucket.
934: \item set bucket boundaries according to the statics computed and
935: distribute elements to buckets.
936: \end{enumerate}
937: 
938: AlphaSort works for non-uniformly distributed data, because it uses
939: quicksort to sort each run and uses replacement-selection to merge
940: the runs. DPG-Sort join works for non-uniformly distributed data too,
941: because we use count bucket sort to sort RIDs.  
942: 
943: DPG-Sort join sorts the RIDs according to the lowest~ $\log{N}$ bits
944: is enough, $N$ is the number of records. In the simulation, we assign
945: the RID values in the range $[1, ... , n]$, the lowest $\log{N}$ bits is
946: sufficient for sorting. For example, for $N=2^{20}$ sorting RIDs
947: according to the lowest 20 bits is enough. This could be done in two
948: steps with count bucket sort: first do count bucket sort according to
949: the lower 10 bits, then do count bucket sort according to the higher
950: 10 bits.
951: 
952: Using the UNIX $random()$ and $exp()$ functions we generate an
953: exponential distribution of the data as $exp(c*(random()>>10))$ ($c$
954: is a constant and in our experiments we assign c with $-0.0000001$).  
955: A comparison of Sort-Merge join with AlphaSort and DPG-Sort join on
956: three different computer architectures is provided in the
957: figures~\ref{fig:join_nonuniform_1}, ~\ref{fig:join_nonuniform_2}, and
958: ~\ref{fig:join_nonuniform_3}. The three different architectures are
959: the IBM Power4 Pseries 690 Turbo, the 3.06 GHz Pentium 4 with Rambus
960: PC-1200 RAM, and the 2.6 GHz Pentium 4 with DDR-266 RAM, From the
961: comparison, we can see that DPG-sort join is much faster than
962: Sort-Merge join with AlphaSort.
963: 
964: \begin{figure}[!hbt]
965: \begin{center}
966:   \epsfig{file=figure5.eps,width=3.0in}
967:   \caption{Comparison of Joins (IBM Power4 pSeries 690 Turbo,
968:   exponential distribution of keys)}
969:   \label{fig:join_nonuniform_1}
970: \end{center}
971: \end{figure}
972: 
973: \begin{figure}[!hbt]
974: \begin{center}
975:   \epsfig{file=figure6.eps,width=3.0in}
976:   \caption{Comparison of Joins (2.6~GHz Pentium 4 / DDR-266 RAM,
977:   exponential distribution of keys)}
978:   \label{fig:join_nonuniform_2}
979: \end{center}
980: \end{figure}
981: 
982: 
983: \begin{figure}[!hbt]
984: \begin{center}
985:   \epsfig{file=figure7.eps,width=3.0in}
986:   \caption{Comparison of Joins (3.06~GHz Pentium 4 / Rambus PC-1200 RAM,
987:   exponential distribution of keys)}
988:   \label{fig:join_nonuniform_3}
989: \end{center}
990: \end{figure}
991: 
992: \paragraph{\bf The uniform distribution of join key values.}  First we will
993: show how all algorithms work for uniformly distributed join key values.
994: SuperScalarSort is a key-prefix-sort explained further in
995: section~\ref{sec:intro}.  It assumes that data is uniformly
996: distributed according to the highest ~7 bits of the key. This kind of
997: distribution can be applied to all DPG algorithms, because a unform
998: distribution of the join key values is the only constraint of
999: Sort-Merge join with DPG and DPG-move join.
1000: 
1001: We also implement the radix join of Manegold et al.~\cite{Manegold02}.
1002: They describe a variation of hash join for main memory.  Radix join
1003: also requires a uniform distribution of the join key values.
1004: Otherwise, some of their partitions will be too large to fit into the
1005: L2 cache and how to set the boundaries is unknown.
1006: 
1007: We produce uniformly distributed foreign key values using the UNIX
1008: $random()$ function.  A comparison of the different join methods on
1009: three different computer architectures as before is provided in the
1010: figures. Figure~\ref{fig:join_no_dup} reflects data with no duplicate
1011: join key values in the relation~$R$.  Figures~\ref{fig:join_dup_1},
1012: ~\ref{fig:join_dup_2}, and ~\ref{fig:join_dup_3}, show the same
1013: information in which duplicate join key values are allowed.  From the
1014: comparison, we can see that DPG-move join and radix join are the
1015: fastest. DPG-move is better for large records and radix join is better
1016: for small records.
1017: 
1018: \begin{figure}[!hbt]
1019: \begin{center}
1020:   \epsfig{file=figure8.eps,width=3.0in}
1021:   \caption{Comparison of Joins (IBM Power4 pSeries 690 Turbo, no
1022:   duplicate keys)}
1023:   \label{fig:join_no_dup}
1024: \end{center}
1025: \end{figure}
1026: 
1027: \begin{figure}[!hbt]
1028: \begin{center}
1029:   \epsfig{file=figure9.eps,width=3.0in}
1030:   \caption{Comparison of Joins (IBM Power4 pSeries 690 Turbo,
1031:   duplicate keys)}
1032:   \label{fig:join_dup_1}
1033: \end{center}
1034: \end{figure}
1035: 
1036: \begin{figure}[!hbt]
1037: \begin{center}
1038:   \epsfig{file=figure10.eps,width=3.0in}
1039:   \caption{Comparison of Joins (2.6~GHz Pentium 4 / DDR-266 RAM,
1040:   duplicate keys)}
1041:   \label{fig:join_dup_2}
1042: \end{center}
1043: \end{figure}
1044: 
1045: 
1046: \begin{figure}[!hbt]
1047: \begin{center}
1048:   \epsfig{file=figure11.eps,width=3.0in}
1049:   \caption{Comparison of Joins (3.06~GHz Pentium 4 / Rambus PC-1200 RAM,
1050:   duplicate keys)}
1051:   \label{fig:join_dup_3}
1052: \end{center}
1053: \end{figure}
1054: 
1055: \section{Conclusions}
1056: The use of DPG in the sorting provides an accelerator for existing
1057: sorting algorithms. Especially for the smaller record sizes, such as
1058: 32~bytes and 64~bytes, the performance improvements are really
1059: impressive.
1060: 
1061: For the more common case of non-uniform distribution of join key values,
1062: DPG-Sort join works better than sort-merge join with AlphaSort across
1063: all the tested platforms. For smaller records sizes, such as, 32, 64,
1064: DPG-Sort join is much better than sort-merge join with AlphaSort. More
1065: impressively, on the newer platform, Rambus PC-1200 RAM, DPG-Sort works
1066: better than on PC with DDR-266 RAM and even works better for larger
1067: record sizes, for example 512~bytes.
1068: 
1069: For special case of uniform distribution of join key values, DPG-move join
1070: and radix join are the best. The remaining DPG algorithms are also
1071: competitive with older algorithms although with a smaller improvement.
1072: 
1073: \section{Future Work}
1074: The DPG algorithm can be easily generalized to multiple passes.  This
1075: can be useful when there is a very small cache in relation to the size
1076: of main memory.  However, we do not encounter this scenario in our
1077: current experiments.
1078: 
1079: \section{Acknowledgements}
1080: We would like to thank Betty Salzberg and Donghui Zhang for extensive
1081: conversations and insights into additional situations where
1082: distribute-probe-gather can be beneficial.  We also gratefully
1083: acknowledge the use of the support for the computations on the IBM
1084: p690 by the Scientific Computing and Visualization (SCV) group at
1085: Boston University.
1086: 
1087: 
1088: \bibliographystyle{latex}
1089: \begin{thebibliography}{99}
1090: 
1091: \bibitem{SuperScalarSort}
1092:   Agarwal, R.C. ``A Super Scalar Sort Algorithm for RISC Processors'',
1093:   {\sl ACM SIGMOD~'96}, pp.~240--246, June 1996.
1094: 
1095: 
1096: % \bibitem{BlockOrientedProcessing}
1097: % Sriram Padmanabhan, Timothy Malkemus, Ramesh C. Agarwal, Anant
1098: % Jhingran: Block Oriented Processing of Relational Database Operations
1099: % in Modern Computer Architectures. ICDE 2001: 567-574.
1100: %  Abstract: Several recent papers have pointed out that database
1101: %  systems are not well tuned to take advantage of modern superscalar
1102: %  processor architectures. In particular, the Clocks-Per-Instruction
1103: %  (CPI) for rather simple database queries are quite poor compared to
1104: %  scientific kernels or SPEC benchmarks. The lack of performance of
1105: %  database systems has been attributed to poor utilization of caches
1106: %  and processor function units as well as higher branching
1107: %  penalties. In this paper, we argue that a block oriented processing
1108: %  strategy for database operations can lead to better utilization of
1109: %  the processors and caches generating significantly higher
1110: %  performance. We have implemented the block oriented processing
1111: %  technique for aggregation, expression evaluation, and sorting
1112: %  operations as a feature in the DB2 Universal Database system. We
1113: %  present results from representative queries on a 30 GB TPC-H database
1114: %  to show the value of this technique.
1115: 
1116: 
1117: % On Pentium 4 w/ NT 4.0
1118: \bibitem{Ailamaki99}
1119: A. Ailamaki, D.J.~DeWitt, M.D.~Hill, D.A.~Wood,
1120: ``DBMSs on a Modern Processor: Where Does Time Go?'',
1121: {\sl VLDB'99, Proceedings of 25th International Conference on Very
1122:                Large Data Bases}, 1999, pp.~266-277.
1123: 
1124: \bibitem{Datamation}
1125:   Anon., Et-Al. (1985). ``A Measure of Transaction Processing Power'',
1126:   {\sl Datamation}~{\bf 31}(7): pp.~112--118.  Also in {\sl Readings
1127:     in Database Systems}, M.J.~Stonebraker, ed., Morgan Kaufmann, San
1128:   Mateo, 1989.
1129: 
1130: \bibitem{ArgeEtAl98}
1131:   L.~Arge, O.~Procopiuc, S.~Ramaswamy, T.~Suel and
1132:      J.S.~Vitter, ``Scalable Sweeping-Based Spatial Join'', {\sl
1133:   VLDB 1998, Proceedings of 24th International Conference on Very Large
1134:    Databases}, 1998, pp.~570--581.
1135: 
1136: \bibitem{BGB98}
1137: L.A. Barroso, K.~Gharachorloo and E.D.~Bugnion ``Memory System
1138: Characterization of Commercial Workloads'', {\sl Proc. of the
1139:   Int. Symp. on Computer Architecture}, Barcelona, Spain, 1998.
1140: 
1141: % sort-merge join
1142: \bibitem{Blasgen77}
1143:  M.W.~Blasgen and K.P.~Eswaran,
1144:  ``Storage and access in relational databases'', {\sl IBM Systems
1145:  Journal}~{\bf 16}(4), 1977.
1146: 
1147: \bibitem{Boncz99}
1148: P.~Boncz, S.~Manegold and M.L.~Kersten,
1149:     ``Database Architecture Optimized for the New Bottleneck: Memory Access'',
1150:     {\sl The {VLDB} Journal}, 1999, pp.~54--65.
1151: 
1152: \bibitem{CS83}
1153: F.~Cesarini and G.~Soda, ``An Algorithm to Construct a Compact B-tree in
1154: Case of Ordered Keys'', {\sl Information Processing Letters}~{\bf
1155: 17}(1), pp.~1612--1630, 1983.
1156: 
1157: \bibitem{DeWitt84}
1158: D. DeWitt, R. Katz, F.~Olken, L.~Shapiro, M.~Stonebraker and D.~Wood,
1159: ``Implementation Techniques for Main Memory Databases'', {\sl
1160:   Proc. ACM SIGMOD Conf. on the Management of Data}, 1984.
1161: 
1162: \bibitem{DeWitt92}
1163: D. DeWitt, J.~Naughton, D.~Schneider and S.~Seshadri,
1164: ``Practical Skew Handling in Parallel Joins'',
1165: {\sl VLDB 1992, Proc. Intl. Conf. on Very Large Databases}, 1992.
1166: 
1167: \bibitem{Dittrich02}
1168: J.-P. Dittrich, B.~Seeger, D.S.~Taylor and P.~Widmayer,
1169: ``Progressive Merge Join: A Generic and Non-blocking Sort-based Join
1170: Algorithm'', {\sl VLDB 2002, Proceedings of 28th International
1171:   Conference on Very}, 2002.
1172: 
1173: \bibitem{GR93}
1174: J. Gray and A.~Reuter, {\em Transaction Processing:  Concepts and
1175: Techniques}, Morgan Kaufmann, San Francisco, 1993.
1176: 
1177: \bibitem{processor}{\sl Intel Pentium 4 and Intel Xeon Processor
1178: Optimization Reference
1179: Manual}. \url{http://www.intel.com/design/Pentium4/manuals/index.htm}
1180: \bibitem{RAM}{\sl Intel PC SDRAM Specification, Revision 1.7, November
1181: 1999}.
1182: 
1183: \bibitem{LostCircuits}
1184: {\sl Lost Circuits} web site:
1185: ``Latency vs. Bandwidth, a performance analysis'',
1186: \url{http://www.lostcircuits.com/memory/latency/2.shtml};
1187: ``Inside the EDDR Chip:
1188: Combining DRAM storage and SRAM speed'',
1189: \url{http://www.lostcircuits.com/memory/eddr/}, Nov.~27, 2000;
1190: and ``High Performance DDR DIMMs'',
1191: \url{http://www.lostcircuits.com/memory/ddr2/}, July~17, 2001.
1192: 
1193: \bibitem{Patterson98} K.~Keeton, D.A.~Patterson, Y.Q.~He R.C.~Raphael
1194:   and W.E.~Baker ``Performance Characterization of a Quad {P}entium
1195:   Pro SMP using OLTP Workloads'' {\sl Porc. of the Int. Syp. on
1196:     Computer Architecture}, p0p.~15--26, Barcelona, Spain, 1998.
1197: 
1198: \bibitem{Knuth98}
1199: D.E. Knuth, {\sl Sorting and Searching, The Art of Computer
1200:   Programming}, vol.~3, second edition,
1201: Addison Wesley, Reading, MA, 1998
1202: 
1203: \bibitem{Manegold99}
1204: S.~Manegold, P.A.~Boncz and M.L.~Kersten,
1205: ``Database architecture   
1206:    optimized for the new bottleneck: Memory access'', {\sl VLDB '99,
1207:   Proc. of 25th Intl. Conf. 
1208:    Very Large Databases}, 1999.
1209: 
1210: % \bibitem{ManegoldCWI99}
1211: % S. Manegold, P.A.~Boncz and M.L.~Kersten,
1212: % ``Optimizing main-memory join on modern hardware'',
1213: % Technical report INS-R9912, Centruum voor Wiskunde Informatica (CWI),
1214: % Netherlands, 1999.
1215: 
1216: \bibitem{Manegold00}
1217: S.~Manegold, P.A.~Boncz and
1218:                M.L.~Kersten,
1219: ``What Happens During a Join? Dissecting CPU and Memory Optimization
1220:                Effects'',
1221: {\sl VLDB 2000, Proceedings of 26th International Conference on Very
1222:                Large Data Bases, September 10-14, 2000, Cairo, Egypt},
1223: {2000}, pp.~339--350.
1224: 
1225: \bibitem{Manegold02}
1226: S. Manegold, P.A.~Boncz and M.L.~Kersten,
1227: ``Optimizing main-memory join on modern hardware'',
1228: {\sl IEEE Transactions on Knowledge and Data Engineering}~{\bf 14}(4),
1229: 2002, pp.~709--730.
1230: 
1231: %  Abstract
1232: %  In the past decade, the exponential growth in commodity CPUs speed has
1233: %  far outpaced advances in memory latency. A second trend is that CPU
1234: %  performance advances are not only brought by increased clock rate, but
1235: %  also by increasing parallelism inside the CPU. Current database
1236: %  systems have not yet adapted to these trends, and show poor
1237: %  utilization of both CPU and memory resources on current hardware. In
1238: %  this article, we show how these resources can be optimized for large
1239: %  joins and translate these insights into guidelines for future database
1240: %  architectures, encompassing data structures, algorithms, cost
1241: %  modeling, and implementation. In particular, we discuss how vertically
1242: %  fragmented data structures optimize cache performance on sequential
1243: %  data access. On the algorithmic side, we refine the partitioned
1244: %  hash-join with a new partitioning algorithm called radix-cluster,
1245: %  which is specifically designed to optimize memory access. The
1246: %  performance of this algorithm is quantified using a detailed
1247: %  analytical model that incorporates memory access costs in terms of a
1248: %  limited number of parameters, such as cache sizes and miss penalties.
1249: %  We also present a calibration tool that extracts such parameters
1250: %  automatically from any computer hardware. The accuracy of our models
1251: %  is proven by exhaustive experiments conducted with the Monet database
1252: %  system on three different hardware platforms. Finally, we investigate
1253: %  the effect of implementation techniques that optimize CPU resource
1254: %  usage. Our experiments show that large joins can be accelerated almost
1255: %  an order of magnitude on modern RISC hardware when both memory and CPU
1256: %  resources are optimized.
1257: 
1258: \bibitem{DDR}Micron Technology, Inc.,
1259: {\sl 256Mb: x4, x8, x16~DDR~SDRAM}, Revision~C, April, 2001,
1260: \url{http://download.micron.com/pdf/datasheets/dram/256Mx4x8x16DDR_D.pdf}
1261: 
1262: \bibitem{Negri85}
1263:   M.~Negri and G.~Pallagatti, ``Join During Merge:  An Improved Sort
1264:   Based Algorithm'', {\sl Information Processing Letters}~{\bf 21}(1),
1265:   1985, pp.~11--16.
1266: 
1267: \bibitem{AlphaSort}
1268:   Nyberg, C., Barclay, T., Cvetanovic, Z., Gray, J., and Lomet, D.
1269:   ``AlphaSort: A Cache-Sensitive Parallel External Sort'', {\sl VLDB
1270:     Journal}~{\bf 4}(4), pp.~603--627, 1995.
1271: 
1272: \bibitem{Patel96}
1273:   J. M. Patel and D. J. DeWitt, ``Partition Based SpatialMerge Join'',
1274:   {\sl Proc. of the ACM SIGMOD Conference on Management of Data}, 1996.
1275: 
1276: \bibitem{Zhang02}
1277:   D. Zhang and V.J. Tsotras and B. Seeger,
1278:   ``Efficient Temporal Join Processing using Indices'',
1279:   {\sl Proc. of 18th International Conference on Data Engineering (ICDE)},
1280:   2002.
1281: 
1282: % Available at
1283: % \url{http://research.microsoft.com/barc/SortBenchmark/AlphaSort.rtf}.
1284: 
1285: \bibitem{Rao99}
1286: J.~Rao and K.A.~Ross,
1287: ``Cache Conscious Indexing for Decision-Support in Main Memory'',
1288: {\sl VLDB 1999, 25th Intl. Conf. 
1289:    Very Large Databases}, 1999, pp.~78--89.
1290: 
1291: % hash-join algorithm (variation of GRACE)
1292: \bibitem{Shapiro86}
1293: L.D. Shapiro,
1294: ``Join Processing in Database Systems with Large Main Memories'',
1295: {\sl ACM Transactions on Database Systems}~{\bf 11}(3), 1986,
1296: pp.~239--264.
1297: 
1298: \bibitem{ShatdalEtAl94}
1299: A.~Shatdal, C.~Kant and J.F.~Naughton,
1300: ``Cache Conscious Algorithms for Relational Query Processing'',
1301: {\sl VLDB 1994,Proceedings of the 20th VLDB Conference},
1302: 1994, pp.~510--521.
1303: 
1304: \bibitem{TLZT97}
1305: P. Trancoso J.L.~Larriba-Pey, Z.~Zhang and J.~Torellas, ``The Memory
1306: Performance of DSS Commercial Workloads in Shared-Memory
1307: Multiprocessors'', {\sl Int. Symp. on High Performance Computer
1308:   Architecture}, San Antonio, TX, USA 1997.
1309: 
1310: \bibitem{WulfMcKee95}
1311: W.A. Wulf and S.A. McKee.
1312:  ``Hitting the memory wall: Implications of the obvious'',
1313:  {\sl ACM Computer Architecture News}, 23(1):20--24, 1995.
1314: 
1315: % Interesting paper for cache conscious search trees:
1316: % Jun Rao, Kenneth A. Ross: Making B^+-Trees Cache Conscious in  
1317: %    Main Memory. SIGMOD Conference 2000: 475-486
1318: 
1319: \end{thebibliography}
1320: 
1321: \end{document}
1322: