0308:cs0308004/dpg.tex

1: \documentclass[10pt,twocolumn]{article}

2: % Must be 10 point, twocolumn (ICDE)

3:

4: \usepackage{latex8}  % ICDE

5: \usepackage{times}

6: \usepackage{url}

7: \usepackage{comment}

8: \usepackage{epsfig}

9: \usepackage[compact]{titlesec}

10:

11: % \usepackage{fullpage}

12: % \usepackage{geometry}

13: % \geometry{lmargin=2cm,rmargin=2cm}

14: \usepackage{fancyvrb}

15: % \usepackage[small,compact]{titlesec}

16: % \usepackage[compact]{titlesec}

17: % \usepackage[light,first,bottomafter,outline]{draftcopy}

18:

19: \DefineShortVerb{\#}

20:

21: \includecomment{comment}

22: \specialcomment{commentbf} {\begingroup\bf} {\endgroup}

23: \specialcomment{commentit} {\begingroup\it\footnotesize} {\endgroup}

24: \specialcomment{commentold} {\begingroup\it\footnotesize{\bf OLD:}} {\endgroup}

25: \excludecomment{comment}

26: \excludecomment{commentbf}

27: \excludecomment{commentit}

28: \excludecomment{commentold}

29:

30: \title{

31: DPG:  A Cache-Efficient Accelerator for Sorting and for Join Operators

32: }

33: \author{

34:    Gene Cooperman\thanks{~$\,$This work was partially supported by

35:                 the National Science Foundation under Grant CCR-0204113, and

36:                 by the

37:                 Institute for Complex Scientific Software

38:                 (ICSS, http://www.icss.neu.edu/).},

39:  Xiaoqin Ma$^*$ and Viet Ha Nguyen$^*$

40: \\

41: Northeastern University\\

42: Boston, MA 02115, USA\\

43: \{gene,xqma,vietha\}@ccs.neu.edu

44: }

45:

46: \begin{document}

47:

48:

49:

50: \maketitle

51:

52: \begin{abstract}

53: Retrieval of records on disk is well-known to be at the heart of many

54: database problems.  We show that the corresponding movement of records in main

55: memory has now become a severe bottleneck for many database

56: operations.  This is due to the stagnating latency of main memory,

57: even while CPU speed, main memory bandwidth, and disk speed all

58: continue to improve.  As a result, record movement has become the

59: dominant cost in main memory sorting.

60:

61: We present a new algorithm for fast record retrieval, {\em

62:   distribute-probe-gather}, or DPG.  DPG has important applications

63: both in sorting and in joins.  Current main memory sorting algorithms

64: split their work into three phases: extraction of key-pointer pairs;

65: sorting of the key-pointer pairs; and copying of the original records

66: into the destination array according the sorted key-pointer pairs.

67: The copying in the last phase dominates today's sorting time.  Hence,

68: the use of DPG in the third phase provides an accelerator for existing

69: sorting algorithms.

70:

71: DPG also provides two new join methods for foreign key joins: DPG-move

72: join and DPG-sort join.  The resulting join methods with DPG are

73: faster because DPG join is cache-efficient and at the same time DPG

74: join avoids the need for sorting or for hashing. The ideas

75: presented for foreign key join can also be extended to faster record

76: pair retrieval for spatial and temporal databases.

77:

78:

79: \end{abstract}

80:

81: \section{Introduction}

82: \label{sec:intro}

83:

84: Two important database operations are sorting and joins.  These

85: operations have three primary hardware-related costs: disk access, CPU

86: operation and main memory access.  The growth of main memory in

87: current computers implies that main memory databases become more

88: popular. For main memory databases, the bottleneck moves from disk to

89: main memory and the cost for disk access is not an issue anymore.

90:

91: Further, the growing CPU-memory gap implies that CPU costs represent

92: an increasingly small portion of the total time.  This has been borne

93: out by several studies of

94: DBMs~\cite{Ailamaki99,BGB98,Patterson98,TLZT97}.  The impact of the

95: CPU-memory gap was popularized by the paper of Wulf and

96: McKee~\cite{WulfMcKee95} on the {\em memory wall}.

97:

98:

99: The diminishing role of the CPU in the total running time is partially

100: accounted for by increasing CPU speeds and greater on-chip functional

101: parallelism.  In part, it is also accounted for because most CPUs

102: today implement non-blocking caches and hardware prefetch in order to

103: overlap CPU execution with memory access~\cite{processor}.  Hence,

104: memory access becomes the bottleneck.  Therefore, we follow the

105: example of previous

106: researchers~\cite{Ailamaki99,Boncz99,Manegold99,Manegold00,Manegold02,ShatdalEtAl94}

107: in concentrating on main memory as the bottleneck.

108:

109: At the heart of this memory bottleneck lies the {\em record retrieval

110:   problem}: the problem of copying records from a source array into a

111:   destination array according to a new ordering.  In a typical

112:   application, one will be given a source data file, a sequence of

113:   {\em record ids} (rids) for that data file, and a destination file.

114:   The task is to copy the source records into the destination file in

115:   the order specified by the sequence of rids.  Fast record retrieval

116:   is the key to faster sorting and faster joins.

117:

118: A standard approach for data retrieval  accesses the records of the

119: source data file directly according to the sequence of the rids.  This

120: implies random access to the main memory.  This, for example, is what

121: was done in AlphaSort~\cite{AlphaSort} and

122: SuperScalarSort~\cite{SuperScalarSort}, the current record holders for

123: the Datamation sorting challenge~\cite{Datamation}.  However, the cost

124: of such random access has now become the dominant cost in main memory

125: sorting, since the CPU-memory gap has widened still further since the

126: original work on sorting.  (In fact, AlphaSort and SuperScalarSort

127: sort data on disk, but the size of the data file, 100~MB, is small

128: enough that the disk access consists solely of reading the source data

129: file from disk, and writing to a destination data file.)

130:

131: Random access is harmful not just in disk resident databases, but also in main

132: memory resident databases.  Current DRAM technologies, such as DDR RAM and

133: Rambus RAM (RDRAM), extract a large latency penalty for any

134: non-sequential access to RAM.  This is because the memory chips are

135: divided into memory pages of several kilobytes, and there is a latency

136: penalty for switching to a new memory page~\cite{LostCircuits,DDR}.

137:

138: Random access to RAM also harms performance in a second manner.

139: Random access incurs a heavy penalty when large cache blocks are used.

140: On a cache miss, the entire cache block is loaded into memory.  If the record size

141: is small compared to the cache block size, then there is a large

142: overhead  to load the entire cache block.  For example, for a cache

143: miss on a Pentium~4 with DDR-266 RAM, approximately 60~ns are spent

144: loading the cache block, and approximately 60~ns are spent waiting on

145: the latency of DDR RAM.  The trend is toward larger cache blocks.  The

146: 128~byte L2 cache blocks of the Pentium~4 are four times larger than

147: those of the Pentium~III.  The IBM Power4 processor goes still further

148: using 512~byte L3 cache blocks.

149:

150: The solution to avoid these latency penalties in main memory is to

151: access main memory sequentially.  This is similar in spirit to the

152: way in which traditional databases strongly prefer to access disk

153: sequentially.  In analogy with operation on disk, two-pass algorithms

154: are a key for faster main memory performance.

155:

156: The DPG-based sorting algorithms immediately yield faster sorting

157: algorithms.  Both AlphaSort and SuperScalarSort sort their data

158: essentially in three phases:  extraction of key-pointer pairs;

159: sorting of the key-pointer pairs; and copying of the original records

160: into the destination array according to the sorted key-pointer

161: pairs.  The last phase is essentially record retrieval.

162:

163: A re-implementation of AlphaSort and SuperScalarSort on a IBM p690

164: Turbo shows that the record retrieval phase now dominates the running

165: time.  With the DPG record retrieval algorithm replacing the standard

166: record retrieval algorithm, we immediately produce a faster sorting

167: algorithm.  In the re-implementation of SuperScalarSort, the versions

168: with DPG as an accelerator are 27\% faster.

169:

170:

171: % {\bf The DPG algorithm is XXX\% faster than random access to main

172: %  memory ond a YYY.}

173:

174: A direct consequence of faster record retrieval is faster main memory

175: sort-merge joins.  For example, in implementing sort-merge join using

176: SuperScalarSort, we find that the use of DPG sort instead of

177: SuperScalarSort results in a 27\% faster join algorithm..  The DPG

178: record retrieval phase in isolation is 48\% faster than traditional

179: record retrieval algorithm.

180:

181:

182:

183: Finally, we apply DPG algorithm in the context of {\em foreign key

184: joins}.  In a foreign key join the join key is the same as the foreign

185: key.  we assume that one relation has an secondary index on the join

186: key.  Foreign key joins have the advantage that one need only

187: rearrange the records of one of the files.  This is in distinction to

188: sort-merge join and hash join, which both require to rearrange each of

189: the two files into a new file with join key values in sorted order.

190:

191: Foreign key joins require less record retrieval because it is possible

192: to first extract {\em join triples}, ($k$, rid$_R$, rid$_F$), where

193: $R$ and $F$ are the two relations and $k$ is the join key value.

194: Assume that $R$ references a foreign key of~$F$.  To construct a join

195: triple, do file scan of~$R$, for each record of~$R$, its key~$k$ is

196: extracted.  The secondary B-tree index of the join key on~$F$ is then

197: used to derive the corresponding record id, rid$_F$, with the key

198: value~$k$. The standard index lookup is very expensive, we propose a

199: cache efficient B-tree batch lookup to generate the join triples.

200:

201: Join triples reduce foreign key join to record pair retrieval.  The

202: join triples specify the record pairs, (rid$_R$, rid$_F$), to be

203: retrieved.  One can re-order the records of~$F$ to match the ordering

204: of rid$_R$ in the sequence of record pairs. Recall that, we do file

205: scan for~$R$, so rid$_R$s in the join triples are in sorted order.

206: Alternatively, one can sort the join triples according to rid$_F$, and

207: re-order the records of~$R$ to match the ordering of rid$_F$ in the

208: sorted rid pairs.  In either situation, there are fewer record

209: retrievals, and so foreign key join is faster than a general join.

210:

211:

212: The ideas of faster record retrieval can also be applied to the

213: general case of record pair retrieval.  Many algorithms for spatial

214: join~\cite{ArgeEtAl98,Patel96} and temporal join~\cite{Zhang02}

215: produce record pairs.  Unlike equijoin, there is no single search key,

216: and so record retrieval is more difficult.  The ideas of this paper

217: are described in terms of foreign key join.  However, it is even

218: simpler to translate the ideas into record pair retrieval, since an

219: initial join triple extraction is not required.

220:

221:

222:

223:

224:

225: \section{Distribute-Probe-Gather}

226: \label{sec:DPG}

227:

228: The {\em distribute-probe-gather algorithm} (DPG) is a record retrieval

229: algorithm.  Given a data file of records

230: and a sequence of record ids (rids) for the file, the goal is to copy

231: the records into a destination file with the property that

232: the ordering of records in the destination file corresponds to

233: the ordering of the rids in the given sequence.

234:

235: For example, let the  source file, #R#, be the array of records with a

236: secondary \rm{B+} tree index on the attribute #A#. We want to

237: place the records of #R# in sorted order according to the values of~#A#. The

238: sequence of record ids, #S_rid#, at leaf nodes of the \rm{B+} tree is a list of record

239: ids in sorted order according to the value of A. Then, the destination file, #D#,

240: below, will contain the corresponding records in sorted order according to the

241: value of~#A#.

242:

243: {\tt

244: \begin{quote}

245: for each i,

246:   D[i] = R[S\_rid[i]]

247: \end{quote}

248: }

249:

250: In the case that the sequence of rids is a permutation of the rids for

251: the data file, the sequence of rids acts as a permutation vector.  The destination

252: file is then a permutation of the records in the input file.

253:

254: Note, however, that the DPG algorithm is not limited to permutations

255: of data. In the case of join, one record from one relation may match more than one

256: records from another relation. In such cases portion of the original records must

257: be duplicated. The DPG algorithm also works for this case.

258:

259:

260: \subsection{\bf Algorithm}

261:

262: The input for the algorithm is: a list of rids, #RID_LIST#, and a

263: data file of records, #INPUT#.  The DPG algorithm partitions the rids,

264: #RID_LIST#, into separate runs.  It also partitions the data records,

265: #INPUT#, into separate runs.  The algorithm makes two passes over the

266: record ids, #RID_LIST#, and two passes over the records of the data

267: file, #INPUT#.

268:

269: The ideas are presented in the context of main memory databases.  We

270: assume that neither the sequence of rids, #RID_LIST#, nor the data

271: file, #INPUT#, fit in cache.  The DPG algorithm applies equally well

272: as an external data retrieval algorithm between disk and main memory.

273:

274: The spirit of the DPG algorithm is: try to transform arbitrary memory

275: access patterns into sequential memory access patterns; where

276: arbitrary memory access patterns are unavoidable, we try to divide the

277: data into small partitions that fit into the cache.

278:

279: For a sequential access pattern, it is easy to maintain a buffer in

280: cache.  In the DPG algorithm we need to read many streams

281: simultaneously, and maintain a buffer for each stream. Hence, we want

282: to keep the buffer as small as possible while maintaining reasonable

283: efficiency.  We define a buffer in cache to consist of two cache

284: blocks.  When a cache block is full, it is written back to main memory

285: and a new cache block is loaded from main memory into cache. On

286: Pentium~4, this is done automatically by the hardware prefetch

287: function unit if the access to main memory is sequential. On other

288: architectures without the hardware prefetch function unit, the

289: software instructions, $cflush$ and $prefetchnta$, are needed to

290: maintain the buffers.

291:

292:

293: \noindent

294: \paragraph{\bf Constraints:} The data records and rids are split into

295: runs of length~$L$.  The run length~$L$ is chosen based on two

296: constraints.  First, the cache must be able to simultaneously hold

297: both one run of data records of length~$L$ and one single buffer for the

298: corresponding run of rids.  (This constraint applies in the second

299: phase of DPG.)  Second, the cache must simultaneously be able to hold

300: a buffer for each run. (This constraint applies in the first phase and

301: in the third phase of DPG.)  These constraints are typical of the

302: constraints for two-pass algorithms, such as external sorting.

303:

304:

305:

306: If the data file has $N$~records, then the data file is partitioned

307: into $N/L$ sets of consecutive records.  Assuming an rid consists of a

308: page id and offset on that page, the high order bits of the page

309: number can be used to efficiently identify the particular partition to

310: which the rid belongs.  This assumes that the number of pages in a

311: partition is a power of two, which can be satisfied by appropriate

312: choice of~$L$. For the sake of clarity, we assume the rids

313: values are in the rage of $0$ and $N$.

314:

315: \paragraph{\bf Three Phases:} There are three phases in the DPG

316:  algorithm. The three phases are also illustrated by pseudo-code in

317:  Figure ~\ref{fig:DPGalgo} and by the diagram of Figure~\ref{fig:DPG}.

318:

319: \begin{enumerate}

320: \item{\bf Phase I.}  The first phase is the {\em Distribute}

321: phase. One {\em distributes} the rids of #RID_LIST# into appropriate

322: RID runs according to the values of the rids. The first RID run

323: contains the rid values in the range from $0$ to $L$, the second RID

324: run contains the rid values in the range from $L$ to $2L$, and so on.

325: Both the access to every RID run and the access to #RID_LIST# are

326: sequential. Therefore, one only needs to maintains a buffer in

327: cache for each RID run and a single buffer in cache for #RID_LIST#, At

328: the end, we form $N/L$ RID runs and each RID run is a permutation

329: vector. For example, the $i$-th RID run is a permutation vector in the

330: range from $(i-1)*L$ and $i*L$.

331:

332:

333: \noindent

334: \item{\bf Phase II.}  The second phase is the {\em Probe} phase.  In

335: this phase, we allocate a second, temporary data file, #INTERNAL#, in

336: main memory.  The temporary data file has the same size as the

337: original data file, #INPUT#, and is organized into the same number of

338: runs as the original data file. One then proceeds through each of the

339: runs of rids and each of the runs of #INPUT#.  The rids from the

340: $i$-th RID run are used to probe the $i$-th #INPUT# run and the

341: corresponding records are copied into the $i$-th #INTERNAL# run.

342:

343: At the end, the $i$-th #INTERNAL# run contains the same records as the

344: $i$-th #INPUT# run, but the order of records in the #INTERNAL# run is

345: organized according to the $i$-th RID run. Both the $i$-th RID run and the $i$-th

346: #INTERNAL# run are accessed sequentially. The $i$th #INPUT# run is

347: accessed randomly, but it can fit in cache. Hence, every time, one loads the

348: $i$-th #INPUT# run entirely into cache and maintains two buffers in cache: one

349: for the $i$th RID run and the other for the $i$-th #INTERNAL# run.

350:

351:

352: \noindent

353: \item{\bf Phase III.}  The third phase is the {\em Gather} phase.

354: This is an inverse of the {\em Distribute} phase and is similar to the

355: merge phase of external sorting.  In the {\em Distribute} phase, the

356: rids are distributed into runs.  As a result of the {\em Probe} phase,

357: the records in a given #INTERNAL# run are now in the same order as the

358: rids in the corresponding RID run created in Phase I.  Hence, it

359: suffices to gather (merge) the records from the #INTERNAL# runs in

360: exactly the same order as the order of the rids in #RID_LIST#.  More

361: precisely, if the $i$-th rid was distributed to the $j$-th RID run

362: during the {\em Distribute} phase, then at the $i$-th step of the {\em

363: Gather} phase, the next record from the $j$-th #INTERNAL# run is

364: copied to the destination array. One maintains several buffers in

365: cache: one single buffer for #RID_LIST#, one single buffer for the

366: destination array, and a buffer for each #INTERNAL# run.

367: \end{enumerate}

368:

369: \subsection{\bf Example}

370: The DPG algorithm is presented more formally in pseudo-code in

371: Figure~\ref{fig:DPGalgo}.  The input of the algorithm is an array,

372: #RID#, of {\em rids} (record ids) and a data file, #INPUT_REC#, of

373: records.  It is desired to retrieve the records from #INPUT_REC# in

374: the order corresponding to #RID#.  The retrieved records are then

375: written to #OUTPUT_REC#.

376:

377: The three phases of the DPG algorithm are illustrated by an example in

378: Figure~\ref{fig:DPG}.  The input in this example is the leaf nodes of

379: a secondary B+-tree index. The input contains a sequence of key-rid

380: pairs sorted according to the key values.  The output is a sequence of

381: records sorted according to the key values of the secondary index. The

382: output will either be stored again on disk or else pipelined to the

383: next stage. The ability of the DPG algorithm to take advantage of

384: pipelining is an important feature for sorting and joins.

385:

386:

387: The letters #a#, #b#, #c#, ... are used to indicate the sorted keys on

388: the leaf nodes of the index. So (#a#, 5) indicates that the record with the rid

389: value of~5 has a key value of #a#.  The first two rows are for

390: Phase~I, the next three rows for Phase~II, and the following

391: three rows for Phase~III.  The horizontal rectangle of the first

392: row represents a sequence of key-rid pairs sorted according to

393: the key values.  The second row represents the runs of rids into which

394: the first row is partitioned.  Similarly, the third row again

395: represents the runs of rids, but now as part of Phase~II.  The

396: fourth row is the partitioned runs of input records, and so

397: on. There are~12 elements and~3 runs in the example and each run contains

398: ~4 elements.

399:

400: During {\bf Phase I}, one has a sequence of key-rid pairs sorted

401: according to the key values and will distribute them into appropriate

402: RID runs.  The first RID run will contain rid values from~0 to~3, the

403: second RID run will contain rid values from~4 to~7, and the third RID

404: run will contain rid values from~8 to~11. Upon reading the sequence of

405: pairs from the first row, one places the rids in their proper runs. For

406: example, the rid~5 goes to the second RID run, the rid~7 goes to

407: the second RID run, the rid~3 goes to the first RID run, the rid~8

408: goes to the third RID run, and so on. The third row in the figure

409: presents the end of Phase I. In this

410: phase, we do sequential read on a single stream and sequential writes on multiple streams.

411:

412: During {\bf Phase II}, one copies the original data file, #INPUT#, to

413: a temporary data file, #INTERNAL#.  The input to this phase is the

414: third row.  In the third row, each RID run has rid values that

415: correspond to one contiguous range of the #INPUT# file, an #INPUT#

416: run.  For example, for the first RID

417: run, we will load the first #INPUT# run into the cache.  The first

418: #INPUT# run consists of the first 4 contiguous records. The first rid

419: in the first RID run is~3 and $INPUT[3]$ is in the cache, so one finds

420: this record, #INPUT[3]#, in cache and copies it to the buffer

421: maintained in cache for the first #INTERNAL# run, and so on. At the

422: end of Phase~II, the records in the first #INPUT# run with key values in the sequence of i, l, f, c  are reordered as

423: records in the first #INTERNAL# run with key values in the sequence of c, f, i, l. The reordering is done according to a permutation specified by the first RID run, 3, 2, 0, 1.

424:

425:

426: During {\bf Phase III}, we will use the original rid sequence in the list of

427: key-rid pairs to {\em gather} (merge) records from all #INTERNAL# run. For

428: example, upon reading~5, we go to the second #INTERNAL# run to {\em gather}

429: the record; upon reading~7, we go to the second #INTERNAL# run to {\em gather}

430: the record; upon reading~3, we go to the first #INTERNAL# run to {\em gather}

431: the record; upon reading~8 we go to the third #INTERNAL# run to {\em gather}

432: the record, and so on. For this phase, we maintain a buffer for the key-rid list, a buffer for #OUTPUT# in cache and

433: a buffer for each #INTERNAL# run. All buffers are in cache.

434:

435:

436:

437:

438: \begin{figure}[htb]

439: \setbox0\vbox

440: % families tt, courier, helvetica available built-in

441: {\small\begin{Verbatim}[commandchars=\\\{\},

442:   fontfamily=tt,

443:   commentchar=\%,codes={\catcode`$=3\catcode`^=7}]

444: % $ (for sake of emacs)

445: Let $L\gets$ (CACHE_SIZE/2)

446: Let $N\gets$ NUM_RECORDS

447: Let NUM_RUNS $\gets N/L$

448:

449: \rm INPUT:  integer \tt RID[NUM_RIDS],

450:     \rm record \tt INPUT_REC[NUM_RECORDS];

451: \rm OUTPUT: record \tt OUTPUT_REC[NUM_RIDS];

452: \rm PARAMETERS: \rm integer \tt RID_RUN[NUM_RUNS][]\tt,

453:         \rm record \tt RECORD_RUN[NUM_RUNS][];

454:

455: //Phase I:  {\it Distribute} RID array into runs

456: //          with each run of length $L$

457: % For $i$ = 1, ..., NUM_RUNS do

458: %   Initialize write buffer of RID_RUN[$i$]

459: For each rid, $r$, in RID do \{

460:  Set {\it run\_num} = $\lceil{}r/L\rceil$;

461:  Append $r$ to RID_RUN[{\it run\_num}]

462: \}

463:

464: //Phase II:  {\it Probe} partitions of INPUT_REC

465: % (Conceptually partition INPUT_REC into partitions of length $L$ and

466: %  for each partition, $i$, of INPUT_REC, probe for each rid in RID_RUN[$i$])

467: For $i$ = 1, ..., NUM_RUNS do \{

468: %   Initialize read buffer for RID_RUN[$i$]

469:   Read into memory all records from

470:    INPUT_REC[$(i-1)\!*\!\!L\! +\! 1$] to INPUT_REC[$i\!*\!\!L$]

471:   Allocate memory for $L$ records

472:     to be stored in RECORD_RUN[i]

473:   For each rid, $r$, in RID_RUN[$i$]

474:     Append INPUT_REC[$r$] to RECORD_RUN[$i$]

475:   Write out RECORD_RUN[$i$] to disk

476: \}

477:

478: //Phase III: {\it Gather} records from RECORD_RUN[]

479: //  into OUTPUT_REC in same order as RID[]

480: % For $i$ = 1, ..., NUM_RUNS do

481: %   Initialize read buffer of RECORD_RUN[$i$]

482: For each rid, $r$, in RID do \{

483:   Set {\it run\_num} = $\lceil{}r/L\rceil$;

484:   Read next record from RECORD_RUN[{\it run\_num}]

485:   Append record to OUTPUT_REC

486: \}

487: \end{Verbatim}

488: }

489: \centerline{\fbox{\box0}}

490: \caption{Distribute-Probe-Gather (DPG) Algorithm}\label{fig:DPGalgo}

491: \end{figure}

492:

493:

494: % or , which the thicker horizontal rectangles represent the records of

495: % a data file.

496:

497: \begin{figure*}[!hbt]

498: \begin{center}

499:   \epsfig{file=figure1.eps,width=7.0in,clip=}

500:   \caption{Distribute-Probe-Gather (DPG)

501:     (also, see pseudo-code in Figure~\ref{fig:DPGalgo})}

502:   \label{fig:DPG}

503: \end{center}

504: \end{figure*}

505:

506:

507:

508:

509: \subsection{\bf Data Skew}

510: Implicit in the description of the DPG algorithm is that the input

511: sequence of rids is distributed uniformly among the set of all rids of

512: the input data file.  This is always the case when DPG is applied to retrieve records

513: after sorting key-pointer pairs.  In that situation, the

514: key-pointer pairs act as a permutation vector to permute the

515: records in the input data file.

516:

517: If the input sequence of rids is not uniformly distributed, then some

518: RID runs will be larger than other runs.  As a consequence, in

519: Phase~II, when the partition of the temporary data file (the partition

520: of #RECORD_RUN# in Figure~\ref{fig:DPGalgo}) may be larger than the size

521: of the cache.  If only a few of the partitions of the temporary data

522: file are larger than cache then the overall running time is not

523: greatly affected.

524:

525: If there is a great deal of data skew and many of the temporary

526: partitions #RECORD_RUN# are larger than cache, then the $N/L$

527: partitions of the input sequence of rids must be chosen on some other

528: basis than the high order bits of the page number.  In such cases, one

529: can invoke the data skew handling techniques of DeWitt

530: et~al.~\cite{DeWitt92}.  Their solution, reformulated in our context,

531: is to sample the rids from the rid sequence.  The sampled set of rids

532: is then sorted, and partitions of the rids are chosen so as to evenly

533: partition the sampled set.

534:

535:

536: \section{Sorting}

537: \label{sec:sorting}

538:

539: As discussed in the introduction, DPG acts as an accelerator for many

540: main memory sorting algorithms.  Recall that main memory sorts

541: typically proceed in three phases:

542: \begin{enumerate}

543: \item extraction of key-pointer;

544: \item sorting of the key-pointer pairs; and

545: \item copying of the original records

546:   into the destination array according to the sorted key-pointer

547:   pairs.

548: \end{enumerate}

549:

550: AlphaSort~\cite{AlphaSort} and SuperScalarSort~\cite{SuperScalarSort}

551: are examples of this three-phase sorting paradigm.  Both sorting

552: algorithms can be considered as main memory sorting algorithms.

553:

554: In principle, the sorting algorithms are single-pass disk-based

555: sorting algorithms.  Both sorting algorithms were introduced as an

556: answer to the Datamation Sorting Challenge~\cite{Datamation}.  The

557: Datamation challenge dictates that one is given one million records of

558: 100~bytes.  Each record has a 10~byte key.  The keys are uniformly

559: distributed.  At the time of the Datamation Challenge, external

560: sorting algorithms were required.  On today's computers, the data file

561: of 100~MB easily fits in main memory.

562:

563: Hence, the only disk-related portion of the Datamation Challenge is to

564: overlap disk~I/O with CPU operation.  Disk striping has the potential

565: to provide very fast disk~I/O.  This occurs because the disks are

566: accessed in parallel.  In this situation, main memory data retrieval

567: becomes the bottleneck.

568:

569: The DPG algorithm pushes back this main memory bottleneck.  By using

570: DPG for data movement, the largest cost of main memory sorting is

571: reduced.  We have reimplemented the main memory portion of

572: SuperScalarSort, both with and without DPG.

573:

574:

575: \section {Join Methods with DPG}

576: \label{sec:joinAlgoDPG}

577:

578: We use the ideas of DPG to present three new join algorithms: 1.c, 2 and~3.

579: Algorithms 1.a, 1.b and~4, will be included in the experimental

580: section~\ref{sec:joinExperiments} for completeness.

581: \begin{enumerate}

582: \item Sort-Merge Join

583: \begin{enumerate}

584: \item {Sort-Merge Join with AlphaSort (sort based on~\cite{AlphaSort})}

585: \item {Sort-Merge Join

586:     with SuperScalarSort (sort based on~\cite{SuperScalarSort})}

587: \item {Sort-Merge Join with DPG Sort}

588:   (sort based on DPG; see~\ref{sec:sortMergeJoin})

589: \end{enumerate}

590: \item {DPG-Sort Join} (see Section~\ref{sec:foreignKeyJoin})

591: \item {DPG-Move Join} (see Section~\ref{sec:foreignKeyJoin})

592: \item {Radix Join} (from~\cite{Manegold02})

593: \end{enumerate}

594:

595:

596: \subsection{Sort-Merge Join with DPG Sort}

597: \label{sec:sortMergeJoin}

598:

599: The well-known Sort-Merge join was introduced by Blasgen and

600: Eswaran~\cite{Blasgen77}.  There are two steps in Sort-Merge joins:

601: sort two relations on the join key and scan the sorted relations to do

602: a merge on the join key.

603:

604: Applying DPG sort at the first step provides a new faster sort-merge

605: join method, Sort-Merge join with DPG Sort. In

606: Sections~\ref{sec:sortMergeJoinExperiments}

607: and~\ref{sec:joinExperiments}, we experimentally compare different

608: versions of sort-merge join, according to the sorting methods used.

609: Specifically, we consider using DPG sort, AlphaSort and

610: SuperScalarSort for the sorting step.

611:

612:

613: \subsection{Foreign Key Join with DPG}

614:

615: We next consider joins in which the join key is a foreign key, and it

616: has an index.

617: We denote by~$R$ a non-indexed relation.  We denote

618: by~$F$ an indexed relation.  The notation is motivated by the example

619: of a foreign key join.  In a foreign key join, the join key is the

620: same as the foreign key.  So, the join key is a set of attributes in

621: the relation~$R$ that refers to a foreign key from relation~$F$.

622:

623: A {\em join triple} is a triple ($k$, rid$_R$, rid$_F$), such that $k$

624: is a key value, rid$_R$ is the rid of a record from~$R$ with key~$k$,

625: and rid$_F$ the rid of a record from~$F$ with key~$k$.

626:

627: There are three steps in a foreign key join algorithm with DPG.  The

628: first step is to construct join triples.  The second step is to use

629: the join triples to copy one of the two relations into a temporary

630: file according to an order derived from the join triples.  The third

631: step is to join the temporary file with the remaining relation.

632:

633: We describe the second and third steps initially in

634: Section~\ref{sec:foreignKeyJoin}.  We then return to the more

635: technical problem of efficiently constructing join triples in

636: Sections~\ref{sec:joinTriples} and~\ref{sec:indexLookup}.

637:

638: \subsubsection{Two Foreign Key Join Methods with DPG}

639: \label{sec:foreignKeyJoin}

640:

641: This section describes two DPG join algorithms.  It assumes that one

642: has already constructed the join triples.  Some algorithms for

643: constructing the join triples are described later in

644: Sections~\ref{sec:joinTriples} and~\ref{sec:indexLookup}.

645:

646: Assume that one has generated the join triples ($k$, rid$_R$,

647: rid$_F$).  It is now possible to ignore the key~$k$, and deal directly

648: with the rid pairs (rid$_R$, rid$_F$).  We wish to satisfy one of two

649: goals:

650: \begin{enumerate}

651: \item\label{item:moveF} {\bf DPG-Move join:}  Move the records of~$F$ into

652:   a temporary file according to the ordering of the records of~$R$.

653: \item\label{item:moveR} {\bf DPG-Sort join:}  Move the records of~$R$ into

654:   a temporary file according to the ordering of the records of~$F$.

655:   So, the rid pairs (rid$_R$, rid$_F$) will be sorted according to

656:   rid$_F$.

657: \end{enumerate}

658:

659: First consider DPG-Move join.  We will see how to generate the join

660: triples in the order of rid$_R$.  This is done by scanning the records

661: of the relation~$R$ in file order (in order of increasing rid$_R$).

662: Therefore the rid pairs can be used directly as part of a DPG

663: algorithm.  The second component of the pair, rid$_F$, is the sequence

664: of rids according to which we want to move the records of~$F$.  This

665: algorithm does not require any sorting or hashing.  Hence, we call

666: it DPG-Move join.

667:

668: Next consider DPG-Sort join.  In this version, we first sort

669: the rid pairs (rid$_R$, rid$_F$) according to the order of rid$_F$.

670: This is done, for example, with SuperScalarSort and DPG.  The

671: algorithm then reduces to record movement in which we wish to move the

672: records of~$R$ according to the ordering of~rid$_F$ in the sorted

673: sequence (rid$_R$, rid$_F$).

674:

675: Note that DPG-Move join is preferred when the relation~$F$ is smaller.

676: DPG-Sort join is preferred when the relation~$R$ is smaller.

677:

678:

679: \subsubsection{Construction of Join Triples}

680: \label{sec:joinTriples}

681:

682: The simplest solution for a foreign key join is to

683: do a file scan of~$R$, and for each record of~$R$ to extract the join

684: key and do an index lookup in the index of~$F$.  One can then join the

685: record of~$R$ with the corresponding record of~$F$.  This involves

686: random access, and is economical only if the join produces

687: very few records.  In particular, this will be the case only if the

688: number of records of~$R$ is small.

689:

690: A better solution is to do a file scan of~$R$, and to use the index

691: on~$F$ to create join triples.  To construct the join triples, one scans the

692: relation~$R$.  For each record of~$R$, one extracts the corresponding

693: rid and associated join key~$k$.  One then looks up the key~$k$ in the

694: index of~$F$.  The index lookup yields the final element of the

695: triple, rid$_F$.

696:

697: \subsubsection{Batch Lookup in Indexes}

698: \label{sec:indexLookup}

699:

700: Note that the join triple is constructed at the cost of a file scan

701: of~$R$ and an index lookup in the index of~$F$ for each record of~$R$.

702: Usually an index on a data file is much smaller than the full data

703: file.  If the index fits entirely in cache, then the index lookup will

704: be significantly cheaper than the file scan.

705:

706: Unfortunately, the indexes in many main memory databases generally do

707: not fit in cache.  In such cases, the index lookup in the index of~$F$

708: will dominate the costs.  For example, in a B+~tree indexing

709: $N$~records, if an internal node has $m$~children, then $\log_m N$

710: nodes of the B+~tree must be accessed.  Each such access will be a

711: random access in main memory.  Most of the random accesses imply a

712: cache miss.  The cost of so many random accesses makes a naive index

713: lookup uneconomical.  Even if the index is a hash index, at least one

714: random access in memory will be required.

715:

716: Luckily, it is possible to execute the index lookup faster than the

717: above analysis would indicate.  This is because the construction of

718: the join triples requires many index lookups, with no intervening

719: record accesses.  For purposes of join triple construction, {\em batch

720: lookup} of keys in an index suffices.  By batch lookup, we assume that

721: an array of join keys is first extracted by scanning the data file

722: of~$R$.  The batch lookup then produces an array of rids for~$F$

723: through the use of the index on~$F$.  We show that the index lookups

724: can be reorganized into a two-pass algorithm.  Two such two-pass

725: algorithms are demonstrated: one for B+ tree indexes, and one for hash

726: indexes.

727:

728: Note that batch lookup of rids in an index can be substantially faster

729: than individual lookup.  Rao and Ross had previously discussed cache

730: conscious indexes for main memory~\cite{Rao99}.  There they present

731: CSS-trees, which have better cache behavior than either B+~trees or

732: hash indexes.  However, they only consider individual lookup of keys

733: in an index, one at a time.  In their scenario, a second key is not

734: looked up until the index lookup of the first key has been resolved.

735:

736: \paragraph{\bf Batch Lookup in B+ Tree Indexes}

737:

738: For simplicity, we describe the two-pass lookup for an enhanced

739: B+~tree.  The notion of {\em enhanced B+~tree} was introduced by Rao

740: and Ross~\cite{Rao99}.  The idea is that all slots of a B+~tree node

741: are used.  This is similar to compact B-Trees~\cite{CS83} or to the

742: ISAM method introduced by IBM~\cite{GR93}.  In the context of a

743: general B+~tree this can be accomplished by maintaining updates to the

744: B+~tree in a separate index, and then doing a batch update by

745: reorganizing the B+~tree.  For brevity, we will sometimes refer to

746: B+~tree, although in all cases, an enhanced B+~tree is intended.

747:

748:

749: Assume that we are executing a file scan of~$R$ with the purpose of

750: constructing the join triples.  We collect a sequence of keys from the

751: relation~$R$, and we wish to carry out a batch lookup of rids in the

752: enhanced B+-tree.  The rids will be used to retrieve records from the

753: foreign relation~$F$.  We assume that the B+~tree does not fit in

754: cache.  Our goal is a two-pass algorithm which will efficiently

755: accomplish batch lookup.

756:

757: Assume that the B+~tree has $m$~entries per node.  Assume that the

758: B+~tree indexes $N$~records.  Then there are $\log_m N$ levels.  Assume

759: further that the first $(\log_m N)/2$ levels (the top half of the

760: B+~tree) fit in cache.  For $m\gg 2$, the top half of the tree has

761: approximately $m^{(\log_m N)/2}=\sqrt{N}$ slots.  For example, if

762: there are $N=10^9$ records, then there are 32,000 slots, which clearly

763: fit inside cache.

764:

765: The strategy is a two-pass strategy.  In the first pass, one performs

766: a lookup of each key, but only using the top half of the B+~tree.

767: As shown in Figure~\ref{fig:Btree}, each leaf of the

768: subtree comprising this upper half can be considered as the root of a

769: second subtree comprising nodes in the lower half of the B+~tree.

770: Thus, at the end of this first pass, a key can be associated with a

771: leaf of the upper subtree, which is also a root of one of the lower subtrees.

772:

773: \begin{figure}[!hbt]

774: \begin{center}

775:   \epsfig{file=figure2.eps,width=3.0in}

776:   \caption{B-Tree (Each smaller triangle represents a subtree that

777:   fits in cache.)}

778:   \label{fig:Btree}

779: \end{center}

780: \end{figure}

781:

782: After the first pass, one can associate each key with a subtree within

783: the lower half of the B+~tree.  So, in the second pass, one loads a

784: subtree from the lower half.  One then continues the lookup for all

785: keys associated with the root of the subtree in the lower half.

786: At the end, one then has a sequence of keys and rids.

787:

788: If one wishes to have the keys in the same order as the order of the

789: original rids, then this can also be arranged.  In this case, one

790: extends the previous scenario to use the DPG algorithm.

791:

792: The first phase of the DPG algorithm is to {\em distribute} the keys into

793: runs.  In the initial

794: lookup of a key in the first half, it was associated with a root of a

795: subtree in the first half.  The particular root of a subtree

796: identifies the run into which the key is copied.

797:

798: The second phase of the DPG algorithm is to use the key to {\em probe} the

799: index, in order to find the rid.  One completes the lookup of all keys

800: in a run associated with a particular subtree of the B+~tree, before

801: proceeding to the next subtree in the lower half.  The resulting rids

802: are stored in a temporary partition or run.

803: This is exactly what was described earlier.

804:

805: Finally, the third phase of the DPG algorithm {\em gathers} the rids

806: into a destination array in the same order as that of the original

807: keys.  Hence, we have completed a batch lookup of the keys, and

808: returned an array of rids in the same order as that of the original

809: keys.

810:

811: \paragraph{\bf Batch Lookup in Hash Indexes}

812:

813: Batch lookup of hash indexes also proceeds in two passes.  We assume

814: that the hash array of the hash index stores at each hash entry one

815: key-rid pair.  A second hash array associated with the index stores

816: pointers to overflow key-rid pairs that would have collided with an

817: occupied slot in the first hash array.

818:

819: The two-pass lookup for hash indexes proceeds in a very simple manner.

820: We extract the sequence of keys from~$R$.  As the keys are extracted,

821: the hash values are computed.  Those key-hash value pairs are saved in

822: an array in an order corresponding to the rid order of~$R$.

823:

824: It now suffices to apply the DPG algorithm.  The hash array is

825: partitioned into sets of $L$~hash slots.  In the distribute phase, the

826: hash value acts as an index into the hash array.  Hence, this becomes

827: the permutation vector of the DPG algorithm.  The key and hash value

828: are then written into separate runs according to the partition of the

829: hash array.  There is one run of key-hash values for each partition of

830: the hash array.

831:

832: In the probe phase, a run of hash values is loaded into cache along

833: with the corresponding partition of the hash array.  As part of the

834: probe phase, the key and hash values from the run are used to look up

835: the the corresponding rid in the partition of the hash array.  The

836: rids are then saved in a temporary partition, in the same order as the

837: order in which the key-hash values of the original partition are stored.

838:

839: Finally, in the gather phase, the ordering of the key-hash

840: value pairs are used to gather the rids from the temporary partition.

841: As in Section~\ref{sec:DPG}, the rids are gathered into a destination

842: array in an order corresponding the ordering of the original key-hash

843: value pairs.

844:

845:

846:

847:

848: \section{Experimental Evaluation}

849:

850: \subsection{Sorting Comparisons}

851:

852: Figure~\ref{fig:sorting} demonstrates the acceleration achieved by

853: SuperScalarSort when DPG is used.  The results are demonstrated on the

854: IBM pSeries 690 Turbo.  The IBM~p690 has an L3~cache of size 128~MB.

855: Hence, in order to realistically demonstrate DPG, we were forced to

856: increase the size of the database.  We chose to implement

857: SuperScalarSort for a data file of size 512~MB.  The record size was

858: treated as a variable, to illustrate the influence of record size.  As

859: in the original Datamation Challenge, the key is 10~bytes.

860:

861: \begin{figure}[!hbt]

862: \begin{center}

863:   \epsfig{file=figure3.eps,width=3.0in}

864:   \caption{Sorting Comparison}

865:   \label{fig:sorting}

866: \end{center}

867: \end{figure}

868:

869: Note that the additional speed of SuperScalarSort with DPG is most

870: pronounced for smaller record sizes.  For record sizes of 256~bytes

871: and higher, DPG provides only a small advantage.  This is because the

872: IBM p690 has an L3~cache block of size 512~bytes.  So, for record

873: sizes below 256~bytes, a cache miss incurs significant overhead

874: in loading a 512~byte cache block.

875:

876: There was some variability in the results because the data was taken

877: on a time-shared, shared memory machine.  On the IBM~p690, the

878: L2~cache is shared among two CPUs, and the L3~cache is shared among

879: eight CPUs.  Our experiments were run on a single CPU.  Hence another

880: process on a neighboring CPU could consume some of our cache, thereby

881: affecting the timings.  The reported experimental results are the

882: averages of three runs each.

883:

884: \subsection{Comparion of Sort-Merge Join using Different Sorting Algorithms}

885: \label{sec:sortMergeJoinExperiments}

886:

887: \begin{figure}[!hbt]

888: \begin{center}

889:   \epsfig{file=figure4.eps,width=3.0in}

890:   \caption{Sort-Merge Join (IBM Power4 p690 Series)}

891:   \label{fig:sortMerge}

892: \end{center}

893: \end{figure}

894:

895: We implement two sort-merge join methods: sort-merge join with DPG sort

896: and sort-merge join with SuperScalarSort.  As defined in section~\ref{sec:sortMergeJoin},

897: sort-merge join with DPG sort applies the DPG sort for the sorting

898: phase and sort-merge join with SuperScalarSort applies SuperScalarSort

899: for the sorting phase.

900:

901: As expected, the acceleration in the speed of sort-merge join in

902: Figure~\ref{fig:sortMerge} follows the same pattern as that of

903: Figure~\ref{fig:sorting} for sorting. In this case, we illustrate our

904: results on a database of size 128~MB on the IBM p690.  The IBM p690

905: has an L2~cache of size 1.4~MB.  So, in this case, the L3~cache is acting

906: as the main memory, while L2~cache is acting as the ``cache''.

907:

908:

909:

910: \subsection{Comparion of Six Different Join Methods}

911: \label{sec:joinExperiments}

912:

913: In Figures~\ref{fig:join_nonuniform_1} through~\ref{fig:join_dup_3},

914: we now experimentally compare the six join methods originally

915: presented at the begining of Section~\ref{sec:joinAlgoDPG}.  The three

916: join algorithms labelled Sort-Merge Join with DPG Sort, DPG-Sort Join,

917: and DPG-Move Join are all new.  The remaining algorithms, Sort-Merge

918: Join with AlphaSort, Sort-Merge Join with SuperScalarSort, and Radix

919: Join, are all based on sorting or join algorithms from the literature.

920:

921: As explained earlier, we denote by

922: ~$F$ an indexed relation and ~$R$ a relation that has foreign key on

923: the indexed attribute of ~$F$.

924: We consider both uniform and non-uniform distribution of foreign key values.

925:

926: \paragraph{\bf The non-uniform distribution of join key values.}  Sort-merge join with

927: AlphaSort and DPG-Sort are the only two of the previously discussed join methods that operate

928: correctly for a non-uniform distribution of join key values.

929:

930: In the following we use count bucket sort for bucket sort. The count bucket sort is as

931: follows:

932: \begin{enumerate}

933: \item count the number of elements destined for each bucket.

934: \item set bucket boundaries according to the statics computed and

935: distribute elements to buckets.

936: \end{enumerate}

937:

938: AlphaSort works for non-uniformly distributed data, because it uses

939: quicksort to sort each run and uses replacement-selection to merge

940: the runs. DPG-Sort join works for non-uniformly distributed data too,

941: because we use count bucket sort to sort RIDs.

942:

943: DPG-Sort join sorts the RIDs according to the lowest~ $\log{N}$ bits

944: is enough, $N$ is the number of records. In the simulation, we assign

945: the RID values in the range $[1, ... , n]$, the lowest $\log{N}$ bits is

946: sufficient for sorting. For example, for $N=2^{20}$ sorting RIDs

947: according to the lowest 20 bits is enough. This could be done in two

948: steps with count bucket sort: first do count bucket sort according to

949: the lower 10 bits, then do count bucket sort according to the higher

950: 10 bits.

951:

952: Using the UNIX $random()$ and $exp()$ functions we generate an

953: exponential distribution of the data as $exp(c*(random()>>10))$ ($c$

954: is a constant and in our experiments we assign c with $-0.0000001$).

955: A comparison of Sort-Merge join with AlphaSort and DPG-Sort join on

956: three different computer architectures is provided in the

957: figures~\ref{fig:join_nonuniform_1}, ~\ref{fig:join_nonuniform_2}, and

958: ~\ref{fig:join_nonuniform_3}. The three different architectures are

959: the IBM Power4 Pseries 690 Turbo, the 3.06 GHz Pentium 4 with Rambus

960: PC-1200 RAM, and the 2.6 GHz Pentium 4 with DDR-266 RAM, From the

961: comparison, we can see that DPG-sort join is much faster than

962: Sort-Merge join with AlphaSort.

963:

964: \begin{figure}[!hbt]

965: \begin{center}

966:   \epsfig{file=figure5.eps,width=3.0in}

967:   \caption{Comparison of Joins (IBM Power4 pSeries 690 Turbo,

968:   exponential distribution of keys)}

969:   \label{fig:join_nonuniform_1}

970: \end{center}

971: \end{figure}

972:

973: \begin{figure}[!hbt]

974: \begin{center}

975:   \epsfig{file=figure6.eps,width=3.0in}

976:   \caption{Comparison of Joins (2.6~GHz Pentium 4 / DDR-266 RAM,

977:   exponential distribution of keys)}

978:   \label{fig:join_nonuniform_2}

979: \end{center}

980: \end{figure}

981:

982:

983: \begin{figure}[!hbt]

984: \begin{center}

985:   \epsfig{file=figure7.eps,width=3.0in}

986:   \caption{Comparison of Joins (3.06~GHz Pentium 4 / Rambus PC-1200 RAM,

987:   exponential distribution of keys)}

988:   \label{fig:join_nonuniform_3}

989: \end{center}

990: \end{figure}

991:

992: \paragraph{\bf The uniform distribution of join key values.}  First we will

993: show how all algorithms work for uniformly distributed join key values.

994: SuperScalarSort is a key-prefix-sort explained further in

995: section~\ref{sec:intro}.  It assumes that data is uniformly

996: distributed according to the highest ~7 bits of the key. This kind of

997: distribution can be applied to all DPG algorithms, because a unform

998: distribution of the join key values is the only constraint of

999: Sort-Merge join with DPG and DPG-move join.

1000:

1001: We also implement the radix join of Manegold et al.~\cite{Manegold02}.

1002: They describe a variation of hash join for main memory.  Radix join

1003: also requires a uniform distribution of the join key values.

1004: Otherwise, some of their partitions will be too large to fit into the

1005: L2 cache and how to set the boundaries is unknown.

1006:

1007: We produce uniformly distributed foreign key values using the UNIX

1008: $random()$ function.  A comparison of the different join methods on

1009: three different computer architectures as before is provided in the

1010: figures. Figure~\ref{fig:join_no_dup} reflects data with no duplicate

1011: join key values in the relation~$R$.  Figures~\ref{fig:join_dup_1},

1012: ~\ref{fig:join_dup_2}, and ~\ref{fig:join_dup_3}, show the same

1013: information in which duplicate join key values are allowed.  From the

1014: comparison, we can see that DPG-move join and radix join are the

1015: fastest. DPG-move is better for large records and radix join is better

1016: for small records.

1017:

1018: \begin{figure}[!hbt]

1019: \begin{center}

1020:   \epsfig{file=figure8.eps,width=3.0in}

1021:   \caption{Comparison of Joins (IBM Power4 pSeries 690 Turbo, no

1022:   duplicate keys)}

1023:   \label{fig:join_no_dup}

1024: \end{center}

1025: \end{figure}

1026:

1027: \begin{figure}[!hbt]

1028: \begin{center}

1029:   \epsfig{file=figure9.eps,width=3.0in}

1030:   \caption{Comparison of Joins (IBM Power4 pSeries 690 Turbo,

1031:   duplicate keys)}

1032:   \label{fig:join_dup_1}

1033: \end{center}

1034: \end{figure}

1035:

1036: \begin{figure}[!hbt]

1037: \begin{center}

1038:   \epsfig{file=figure10.eps,width=3.0in}

1039:   \caption{Comparison of Joins (2.6~GHz Pentium 4 / DDR-266 RAM,

1040:   duplicate keys)}

1041:   \label{fig:join_dup_2}

1042: \end{center}

1043: \end{figure}

1044:

1045:

1046: \begin{figure}[!hbt]

1047: \begin{center}

1048:   \epsfig{file=figure11.eps,width=3.0in}

1049:   \caption{Comparison of Joins (3.06~GHz Pentium 4 / Rambus PC-1200 RAM,

1050:   duplicate keys)}

1051:   \label{fig:join_dup_3}

1052: \end{center}

1053: \end{figure}

1054:

1055: \section{Conclusions}

1056: The use of DPG in the sorting provides an accelerator for existing

1057: sorting algorithms. Especially for the smaller record sizes, such as

1058: 32~bytes and 64~bytes, the performance improvements are really

1059: impressive.

1060:

1061: For the more common case of non-uniform distribution of join key values,

1062: DPG-Sort join works better than sort-merge join with AlphaSort across

1063: all the tested platforms. For smaller records sizes, such as, 32, 64,

1064: DPG-Sort join is much better than sort-merge join with AlphaSort. More

1065: impressively, on the newer platform, Rambus PC-1200 RAM, DPG-Sort works

1066: better than on PC with DDR-266 RAM and even works better for larger

1067: record sizes, for example 512~bytes.

1068:

1069: For special case of uniform distribution of join key values, DPG-move join

1070: and radix join are the best. The remaining DPG algorithms are also

1071: competitive with older algorithms although with a smaller improvement.

1072:

1073: \section{Future Work}

1074: The DPG algorithm can be easily generalized to multiple passes.  This

1075: can be useful when there is a very small cache in relation to the size

1076: of main memory.  However, we do not encounter this scenario in our

1077: current experiments.

1078:

1079: \section{Acknowledgements}

1080: We would like to thank Betty Salzberg and Donghui Zhang for extensive

1081: conversations and insights into additional situations where

1082: distribute-probe-gather can be beneficial.  We also gratefully

1083: acknowledge the use of the support for the computations on the IBM

1084: p690 by the Scientific Computing and Visualization (SCV) group at

1085: Boston University.

1086:

1087:

1088: \bibliographystyle{latex}

1089: \begin{thebibliography}{99}

1090:

1091: \bibitem{SuperScalarSort}

1092:   Agarwal, R.C. ``A Super Scalar Sort Algorithm for RISC Processors'',

1093:   {\sl ACM SIGMOD~'96}, pp.~240--246, June 1996.

1094:

1095:

1096: % \bibitem{BlockOrientedProcessing}

1097: % Sriram Padmanabhan, Timothy Malkemus, Ramesh C. Agarwal, Anant

1098: % Jhingran: Block Oriented Processing of Relational Database Operations

1099: % in Modern Computer Architectures. ICDE 2001: 567-574.

1100: %  Abstract: Several recent papers have pointed out that database

1101: %  systems are not well tuned to take advantage of modern superscalar

1102: %  processor architectures. In particular, the Clocks-Per-Instruction

1103: %  (CPI) for rather simple database queries are quite poor compared to

1104: %  scientific kernels or SPEC benchmarks. The lack of performance of

1105: %  database systems has been attributed to poor utilization of caches

1106: %  and processor function units as well as higher branching

1107: %  penalties. In this paper, we argue that a block oriented processing

1108: %  strategy for database operations can lead to better utilization of

1109: %  the processors and caches generating significantly higher

1110: %  performance. We have implemented the block oriented processing

1111: %  technique for aggregation, expression evaluation, and sorting

1112: %  operations as a feature in the DB2 Universal Database system. We

1113: %  present results from representative queries on a 30 GB TPC-H database

1114: %  to show the value of this technique.

1115:

1116:

1117: % On Pentium 4 w/ NT 4.0

1118: \bibitem{Ailamaki99}

1119: A. Ailamaki, D.J.~DeWitt, M.D.~Hill, D.A.~Wood,

1120: ``DBMSs on a Modern Processor: Where Does Time Go?'',

1121: {\sl VLDB'99, Proceedings of 25th International Conference on Very

1122:                Large Data Bases}, 1999, pp.~266-277.

1123:

1124: \bibitem{Datamation}

1125:   Anon., Et-Al. (1985). ``A Measure of Transaction Processing Power'',

1126:   {\sl Datamation}~{\bf 31}(7): pp.~112--118.  Also in {\sl Readings

1127:     in Database Systems}, M.J.~Stonebraker, ed., Morgan Kaufmann, San

1128:   Mateo, 1989.

1129:

1130: \bibitem{ArgeEtAl98}

1131:   L.~Arge, O.~Procopiuc, S.~Ramaswamy, T.~Suel and

1132:      J.S.~Vitter, ``Scalable Sweeping-Based Spatial Join'', {\sl

1133:   VLDB 1998, Proceedings of 24th International Conference on Very Large

1134:    Databases}, 1998, pp.~570--581.

1135:

1136: \bibitem{BGB98}

1137: L.A. Barroso, K.~Gharachorloo and E.D.~Bugnion ``Memory System

1138: Characterization of Commercial Workloads'', {\sl Proc. of the

1139:   Int. Symp. on Computer Architecture}, Barcelona, Spain, 1998.

1140:

1141: % sort-merge join

1142: \bibitem{Blasgen77}

1143:  M.W.~Blasgen and K.P.~Eswaran,

1144:  ``Storage and access in relational databases'', {\sl IBM Systems

1145:  Journal}~{\bf 16}(4), 1977.

1146:

1147: \bibitem{Boncz99}

1148: P.~Boncz, S.~Manegold and M.L.~Kersten,

1149:     ``Database Architecture Optimized for the New Bottleneck: Memory Access'',

1150:     {\sl The {VLDB} Journal}, 1999, pp.~54--65.

1151:

1152: \bibitem{CS83}

1153: F.~Cesarini and G.~Soda, ``An Algorithm to Construct a Compact B-tree in

1154: Case of Ordered Keys'', {\sl Information Processing Letters}~{\bf

1155: 17}(1), pp.~1612--1630, 1983.

1156:

1157: \bibitem{DeWitt84}

1158: D. DeWitt, R. Katz, F.~Olken, L.~Shapiro, M.~Stonebraker and D.~Wood,

1159: ``Implementation Techniques for Main Memory Databases'', {\sl

1160:   Proc. ACM SIGMOD Conf. on the Management of Data}, 1984.

1161:

1162: \bibitem{DeWitt92}

1163: D. DeWitt, J.~Naughton, D.~Schneider and S.~Seshadri,

1164: ``Practical Skew Handling in Parallel Joins'',

1165: {\sl VLDB 1992, Proc. Intl. Conf. on Very Large Databases}, 1992.

1166:

1167: \bibitem{Dittrich02}

1168: J.-P. Dittrich, B.~Seeger, D.S.~Taylor and P.~Widmayer,

1169: ``Progressive Merge Join: A Generic and Non-blocking Sort-based Join

1170: Algorithm'', {\sl VLDB 2002, Proceedings of 28th International

1171:   Conference on Very}, 2002.

1172:

1173: \bibitem{GR93}

1174: J. Gray and A.~Reuter, {\em Transaction Processing:  Concepts and

1175: Techniques}, Morgan Kaufmann, San Francisco, 1993.

1176:

1177: \bibitem{processor}{\sl Intel Pentium 4 and Intel Xeon Processor

1178: Optimization Reference

1179: Manual}. \url{http://www.intel.com/design/Pentium4/manuals/index.htm}

1180: \bibitem{RAM}{\sl Intel PC SDRAM Specification, Revision 1.7, November

1181: 1999}.

1182:

1183: \bibitem{LostCircuits}

1184: {\sl Lost Circuits} web site:

1185: ``Latency vs. Bandwidth, a performance analysis'',

1186: \url{http://www.lostcircuits.com/memory/latency/2.shtml};

1187: ``Inside the EDDR Chip:

1188: Combining DRAM storage and SRAM speed'',

1189: \url{http://www.lostcircuits.com/memory/eddr/}, Nov.~27, 2000;

1190: and ``High Performance DDR DIMMs'',

1191: \url{http://www.lostcircuits.com/memory/ddr2/}, July~17, 2001.

1192:

1193: \bibitem{Patterson98} K.~Keeton, D.A.~Patterson, Y.Q.~He R.C.~Raphael

1194:   and W.E.~Baker ``Performance Characterization of a Quad {P}entium

1195:   Pro SMP using OLTP Workloads'' {\sl Porc. of the Int. Syp. on

1196:     Computer Architecture}, p0p.~15--26, Barcelona, Spain, 1998.

1197:

1198: \bibitem{Knuth98}

1199: D.E. Knuth, {\sl Sorting and Searching, The Art of Computer

1200:   Programming}, vol.~3, second edition,

1201: Addison Wesley, Reading, MA, 1998

1202:

1203: \bibitem{Manegold99}

1204: S.~Manegold, P.A.~Boncz and M.L.~Kersten,

1205: ``Database architecture

1206:    optimized for the new bottleneck: Memory access'', {\sl VLDB '99,

1207:   Proc. of 25th Intl. Conf.

1208:    Very Large Databases}, 1999.

1209:

1210: % \bibitem{ManegoldCWI99}

1211: % S. Manegold, P.A.~Boncz and M.L.~Kersten,

1212: % ``Optimizing main-memory join on modern hardware'',

1213: % Technical report INS-R9912, Centruum voor Wiskunde Informatica (CWI),

1214: % Netherlands, 1999.

1215:

1216: \bibitem{Manegold00}

1217: S.~Manegold, P.A.~Boncz and

1218:                M.L.~Kersten,

1219: ``What Happens During a Join? Dissecting CPU and Memory Optimization

1220:                Effects'',

1221: {\sl VLDB 2000, Proceedings of 26th International Conference on Very

1222:                Large Data Bases, September 10-14, 2000, Cairo, Egypt},

1223: {2000}, pp.~339--350.

1224:

1225: \bibitem{Manegold02}

1226: S. Manegold, P.A.~Boncz and M.L.~Kersten,

1227: ``Optimizing main-memory join on modern hardware'',

1228: {\sl IEEE Transactions on Knowledge and Data Engineering}~{\bf 14}(4),

1229: 2002, pp.~709--730.

1230:

1231: %  Abstract

1232: %  In the past decade, the exponential growth in commodity CPUs speed has

1233: %  far outpaced advances in memory latency. A second trend is that CPU

1234: %  performance advances are not only brought by increased clock rate, but

1235: %  also by increasing parallelism inside the CPU. Current database

1236: %  systems have not yet adapted to these trends, and show poor

1237: %  utilization of both CPU and memory resources on current hardware. In

1238: %  this article, we show how these resources can be optimized for large

1239: %  joins and translate these insights into guidelines for future database

1240: %  architectures, encompassing data structures, algorithms, cost

1241: %  modeling, and implementation. In particular, we discuss how vertically

1242: %  fragmented data structures optimize cache performance on sequential

1243: %  data access. On the algorithmic side, we refine the partitioned

1244: %  hash-join with a new partitioning algorithm called radix-cluster,

1245: %  which is specifically designed to optimize memory access. The

1246: %  performance of this algorithm is quantified using a detailed

1247: %  analytical model that incorporates memory access costs in terms of a

1248: %  limited number of parameters, such as cache sizes and miss penalties.

1249: %  We also present a calibration tool that extracts such parameters

1250: %  automatically from any computer hardware. The accuracy of our models

1251: %  is proven by exhaustive experiments conducted with the Monet database

1252: %  system on three different hardware platforms. Finally, we investigate

1253: %  the effect of implementation techniques that optimize CPU resource

1254: %  usage. Our experiments show that large joins can be accelerated almost

1255: %  an order of magnitude on modern RISC hardware when both memory and CPU

1256: %  resources are optimized.

1257:

1258: \bibitem{DDR}Micron Technology, Inc.,

1259: {\sl 256Mb: x4, x8, x16~DDR~SDRAM}, Revision~C, April, 2001,

1260: \url{http://download.micron.com/pdf/datasheets/dram/256Mx4x8x16DDR_D.pdf}

1261:

1262: \bibitem{Negri85}

1263:   M.~Negri and G.~Pallagatti, ``Join During Merge:  An Improved Sort

1264:   Based Algorithm'', {\sl Information Processing Letters}~{\bf 21}(1),

1265:   1985, pp.~11--16.

1266:

1267: \bibitem{AlphaSort}

1268:   Nyberg, C., Barclay, T., Cvetanovic, Z., Gray, J., and Lomet, D.

1269:   ``AlphaSort: A Cache-Sensitive Parallel External Sort'', {\sl VLDB

1270:     Journal}~{\bf 4}(4), pp.~603--627, 1995.

1271:

1272: \bibitem{Patel96}

1273:   J. M. Patel and D. J. DeWitt, ``Partition Based SpatialMerge Join'',

1274:   {\sl Proc. of the ACM SIGMOD Conference on Management of Data}, 1996.

1275:

1276: \bibitem{Zhang02}

1277:   D. Zhang and V.J. Tsotras and B. Seeger,

1278:   ``Efficient Temporal Join Processing using Indices'',

1279:   {\sl Proc. of 18th International Conference on Data Engineering (ICDE)},

1280:   2002.

1281:

1282: % Available at

1283: % \url{http://research.microsoft.com/barc/SortBenchmark/AlphaSort.rtf}.

1284:

1285: \bibitem{Rao99}

1286: J.~Rao and K.A.~Ross,

1287: ``Cache Conscious Indexing for Decision-Support in Main Memory'',

1288: {\sl VLDB 1999, 25th Intl. Conf.

1289:    Very Large Databases}, 1999, pp.~78--89.

1290:

1291: % hash-join algorithm (variation of GRACE)

1292: \bibitem{Shapiro86}

1293: L.D. Shapiro,

1294: ``Join Processing in Database Systems with Large Main Memories'',

1295: {\sl ACM Transactions on Database Systems}~{\bf 11}(3), 1986,

1296: pp.~239--264.

1297:

1298: \bibitem{ShatdalEtAl94}

1299: A.~Shatdal, C.~Kant and J.F.~Naughton,

1300: ``Cache Conscious Algorithms for Relational Query Processing'',

1301: {\sl VLDB 1994,Proceedings of the 20th VLDB Conference},

1302: 1994, pp.~510--521.

1303:

1304: \bibitem{TLZT97}

1305: P. Trancoso J.L.~Larriba-Pey, Z.~Zhang and J.~Torellas, ``The Memory

1306: Performance of DSS Commercial Workloads in Shared-Memory

1307: Multiprocessors'', {\sl Int. Symp. on High Performance Computer

1308:   Architecture}, San Antonio, TX, USA 1997.

1309:

1310: \bibitem{WulfMcKee95}

1311: W.A. Wulf and S.A. McKee.

1312:  ``Hitting the memory wall: Implications of the obvious'',

1313:  {\sl ACM Computer Architecture News}, 23(1):20--24, 1995.

1314:

1315: % Interesting paper for cache conscious search trees:

1316: % Jun Rao, Kenneth A. Ross: Making B^+-Trees Cache Conscious in

1317: %    Main Memory. SIGMOD Conference 2000: 475-486

1318:

1319: \end{thebibliography}

1320:

1321: \end{document}

1322: