0405:cs0405069/acm.tex

1: \documentclass[11pt]{article}

2: \usepackage{epsfig}

3: \usepackage{amsmath}

4: \usepackage{amssymb}

5: \usepackage{geometry}

6: \usepackage{url}

7:

8: \newcommand{\xf}[1]{Figure~\ref{#1}}

9: \newcommand{\xt}[1]{Table~\ref{#1}}

10: \newcommand{\xp}[1]{page~\pageref{#1}}

11: \newcommand{\xs}[1]{Section~\ref{#1}}

12: \newcommand{\xa}[1]{Appendix~\ref{#1}}

13: \newtheorem{theorem}{Theorem}

14: \newtheorem{lemma}{Lemma}

15: \newtheorem{prop}{Proposition}

16: \newtheorem{defn}{Definition}

17:

18:

19: \def\db{\mbox {$\cal D$}}

20: \def\mm{\mbox {$\cal M$}}

21: \def\nn{\mbox {$t$}}

22: \def\tdb{\mbox{$t(\db_{\alpha})$}}

23: \def\tT{\mbox{$t(T_{\alpha})$}}

24: \def\tT{\mbox{$t(T_{\alpha})$}}

25: \def\nT{\mbox{$\nu(T_{\alpha})$}}

26: \def\njT{\mbox{$\nu[j](T_{\alpha})$}}

27: \def\mjT{\mbox{$\mu[j](T_{\alpha})$}}

28:

29: \newcommand{\vs}{\vspace{1ex}}		    % Small vertical space

30: \newcommand{\negvs}{\vspace*{-1ex}}		% Small negative vertical space

31:

32:

33:

34:

35: \newlength{\qedlengte}

36: \settowidth{\qedlengte}{$\Box$}

37: \addtolength{\qedlengte}{-0.25\qedlengte}

38: \newcommand{\qedbox}{\rule{\qedlengte}{\qedlengte}}

39: \newcommand{\qed}{\hspace*{1em}\hfill\qedbox}

40:

41:

42:

43: \newenvironment{prog}{\def\-{\hskip 1em}\penalty-1000\vskip\parskip

44: \parskip0pt\leftskip2em\obeylines\tt}{\par}

45:

46: \title{Mining Frequent Itemsets from Secondary Memory}

47:

48: \author{G\"{o}sta Grahne and Jianfei Zhu\\

49: Concordia University\\

50: Montreal, Canada\\

51: \{grahne, j\_zhu\}@cs.concordia.ca\\

52: }

53: \date{March 6, 2004}

54: \begin{document}

55: \maketitle

56:

57: \begin{abstract}

58: Mining frequent itemsets is at the core

59: of mining association rules, and is by now quite well

60: understood algorithmically.

61: However, most algorithms for mining frequent

62: itemsets assume that the main memory is large enough

63: for the data structures used in the mining,

64: and very few efficient algorithms deal with the case when

65: the database is {\em very} large or the minimum support is very low.

66: Mining frequent itemsets from a very large database

67: poses new challenges,

68: as astronomical amounts of raw data

69: is ubiquitously being recorded in commerce, science and government.

70:

71: In this paper, we discuss approaches to mining frequent itemsets when

72: data structures are too large to fit in main memory.

73: Several

74: divide-and-conquer

75: algorithms

76: are given for mining from disks.

77: Many novel techniques are introduced.

78: Experimental results show that

79: the techniques reduce

80: the required disk accesses by orders of magnitude,

81: and enable truly scalable data mining.

82: \end{abstract}

83:

84: \section{Introduction}

85:

86:

87:

88:

89: Mining frequent itemsets is a fundamental problem

90: for mining association rules \cite{AIS93, AS94,MTV94,PBT99, PHM00,WHP03,Zaki02, ZB03}.

91: It also plays an important role in many other data mining tasks

92: such as sequential patterns, episodes, multi-dimensional patterns and so on

93: \cite{AS95, MTV97, KHC97}.

94: In addition, frequent itemsets are one of the key abstractions

95: in data mining.

96:

97:

98: The description of the problem is as follows.

99: Let $I = \{i_1,i_2,\ldots,i_j,\ldots i_n\}$,

100: be a set of {\em items}.

101: Items will sometimes also be denoted

102: $a,b,c,\ldots$.

103: An $I$-{\em transaction} $\tau$ is a subset of $I$.

104: An $I$-transactional {\em database} $\db$ is a finite bag

105: of $I$-transactions.

106: The {\em support} of an itemset $S\subseteq I$

107: is the proportion of transactions in \db~ that contain $S$.

108: The task of mining frequent itemsets is to find

109: all $S$ such that the support of $S$ is greater than some

110: given {\em minimum support} $\xi$,

111: where $\xi$ either is a fraction in $[0,1]$,

112: or an absolute count.

113:

114: Most of the algorithms, such as

115: Apriori \cite{AS94},

116: DepthProject \cite{AAP00},

117: and dEclat \cite{Zaki03}

118: work well when the main memory is big enough to

119: fit the whole database or/and the data structures

120: (candidate sets, FP-trees, etc).

121: When a database is very large or when the minimum support is very low,

122: either the data structures used by the algorithms may not be accommodated in

123: main memory,

124: or the algorithms spend too much time on

125: multiple passes over the database.

126: In the

127: {\em First IEEE ICDM Workshop on Frequent Itemset

128: Mining Implementations, FIMI~'03} \cite{ZB03},

129: many well known algorithms were implemented

130: and independently tested.

131: The results show that ``{\em none} of the algorithms is able to gracefully

132: scale-up to very large datasets,

133: with millions of transactions''

134: \cite{ZaBa03}.

135:

136:

137:

138: At the same time

139: very large databases do exist in real life.

140: In a medium sized business or in a company big as Walmart,

141: it's very easy to collect a few gigabytes of data.

142: Terabytes of raw data

143: is ubiquitously being recorded in commerce, science and government.

144: The question of how to handle these databases is still one of the most

145: difficult problems in data mining.

146:

147:

148: A few researchers have

149: tried to mine frequent itemsets from very large databases.

150: One approach is by {\em sampling}.

151: For instance, \cite{Toiv96}

152: picks a random sample of the database,

153: finds all frequent itemsets from the sample, and then verifies

154: the results with the rest of the database.

155: This approach needs only one pass of the database.

156: However, the results are probabilistic,

157: meaning that

158: some critical frequent itemsets could be missing.

159:

160:

161:

162: {\em Partitioning} \cite{SON95}

163: is another approach for mining very large databases.

164: This approach first partitions the database

165: into many small databases,

166: and mines candidate frequent itemsets from each small database.

167: One more pass

168: over the original database

169: is then done to verify the candidate frequent itemsets.

170: The approach thus needs only two database scans.

171: However, when the data structures used for storing

172: candidate frequent itemsets

173: are too big to fit in main memory,

174: a significant amount of  disk I/O's is needed

175: for the disk resident data structures.

176:

177: In \cite{HPY00, HPYM04}, Han {\em et.\ al.} introduce the  {\em FP-growth}

178: method, which

179: uses two database scans for constructing an FP-tree

180: from the database,

181: and then mines all frequent itemsets from the FP-tree.

182: Two approaches are suggested for the case that

183: the FP-tree is too large to fit into main memory.

184:

185: The first approach writes the FP-tree to disk,

186: then mines all frequent sets by reading the

187: frequency information from the FP-tree.

188: However, the size of the FP-tree could be same as the

189: size of the database, and for each item in the FP-tree,

190: we need at least one FP-tree traversal.

191: Thus the I/O's for writing and reading the

192: disk-resident FP-tree could be

193: prohibitive.

194:

195: The second approach

196: {\em projects} the original database

197: on each frequent item, then mines frequent itemsets from

198: the small projected databases.

199: One advantage of this approach is that any frequent itemset

200: mined from a projected database is a frequent itemset in the original database.

201: To get {\em all} frequent itemsets,

202: we only need to

203: take the union of the frequent itemsets from the small projected databases.

204: This is in contrast to the

205: partitioning approach,

206: where all candidate frequent itemsets have to be stored and later verified

207: by another pass of database.

208: The biggest problem of the projection approach is that

209: the total size of the projected databases could be too large,

210: and there will be too many disk I/O's for the

211: projected databases.

212:

213: \subsubsection*{Contributions}

214: In this paper we consider the problem of  mining frequent itemsets

215: from {\em very} large databases.

216: We adopt a

217: divide-and-conquer approach.

218: First we give  three algorithms,

219: the general divide-and-conquer algorithm,

220: then an algorithm using

221: naive projection, and an algorithm using

222: aggressive projection.

223: We also analyze the

224: number of steps and disk I/O's required by these algorithms.

225:

226: In a detailed divide-and-conquer algorithm,

227: called {\em Diskmine},

228: we use the highly efficient

229: {\em FP-growth*} method \cite{fimi03} to

230: mine frequent itemsets from an FP-tree in main memory.

231: We describe several novel techniques

232: useful in mining frequent itemsets from disks,

233: such as the array technique,

234: the item-grouping technique,

235: and memory management techniques.

236:

237: Finally, we present experimental results that

238: demonstrate the fact that our {\em Diskmine}-algorithm

239: outperforms previous algorithms

240: by orders of magnitude,

241: and scales up to terabytes of data.

242:

243:

244: \subsubsection*{Overview}

245: The remainder of this paper is organized as follows.

246: In Section 2

247: we introduce approaches for mining frequent itemsets from disks.

248: Three algorithms are introduced and analyzed.

249: Section 3 gives a detailed divide-and-conquer

250: algorithm {\em Diskmine},

251: in which many novel optimization techniques are used.

252: These techniques are also described in Section 3.

253: Experimental results are given in Section 4.

254: Section 5 concludes,

255: and outlines directions for future research.

256:

257:

258: \section{Mining from disk} \label{diskmine}

259:

260: How should one go about when mining

261: frequent itemsets from very large databases

262: residing in a secondary memory storage,

263: such as disks?

264: Here ``very large'' means that

265: the data structures constructed from the database

266: for mining frequent itemsets

267: can not fit in the available main memory.

268:

269:

270: Basically, there are two strategies

271: for mining frequent itemsets,

272: the datastructures approach,

273: and the

274: divide-and-conquer approach.

275:

276: The {\em datastructures} approach consists of

277: reading

278: the database buffer by buffer,

279: and generate

280: datastructures (i.e.\ candidate sets or FP-trees).

281: Since the datastructure don't fit into main memory,

282: additional disk I/O's are required.

283: The number of passes and disk I/O's required

284: by the approach

285: depends on the algorithm and its datastructures.

286: For example,

287: if the algorithm is Apriori \cite{AS94}

288: using a hash-tree

289: for candidate itemsets

290: \cite{SON95},

291: disk based hash-trees have to be used.

292: Then the number of passes for the algorithm

293: is same as the length of the longest

294: frequent itemset,

295: and the number of disk I/O's for the hash-trees

296: depend on the size of the hash-trees

297: on disk.

298:

299: The basic strategy for the

300: {\em divide-and-conquer} approach

301: is shown in \xf{bdaqalgo}.

302: In the approach,

303: $|\db|$ denotes

304: the size of the data structures used

305: by the mining algorithm, and

306: $M$ is the size of available main memory.

307: Function {\em mainmine}

308: is called if

309: candidate frequent itemsets (not necessary all)

310: can be mined without

311: writing the data structures used by

312: a mining algorithm  to disks.

313: In \xf{bdaqalgo},

314: a very large database is decomposed into a number

315: of smaller databases.

316: If a ``small'' database is still too large,

317: i.e, the data structures are still too big to fit in main memory,

318: the decomposition is recursively continued

319: until

320: the data structures fit in main memory.

321: After all small databases are processed,

322: all candidate frequent itemsets are combined in some way

323: (obviously depending on the way the decomposition was done)

324: to get all frequent itemsets for the original database.

325:

326:

327: \begin{figure}[h]

328: {\bf Procedure} {\em diskmine}($\db,M$)

329:

330: \smallskip

331:

332: {\bf if} $|\db|\leq M$ {\bf then} {\bf return} {\em mainmine($\,\db$)}

333:

334: {\bf else} decompose $\db$ into $\db_1,\ldots \db_k$.

335:

336: {\hskip 18pt}

337: {\bf return}  {\em combine} {\em diskmine($\,\db_1,M$)},

338:

339: {\hskip 154pt}                           ....              ,

340:

341: {\hskip 90pt}       {\em diskmine($\,\db_k,M$)}.

342:

343: \caption{{\small

344: General divide-and-conquer algorithm for

345: mining frequent itemsets from disk.

346: }}

347: \label{bdaqalgo}

348: \end{figure}

349:

350:

351: The efficiency of {\em diskmine}

352: depends on

353: the method used for mining frequent itemsets

354: in main memory and on the number of

355: disk I/O's needed in the decomposition and

356: combination phases.

357: Sometimes the disk I/O is the main factor.

358: Since the decomposition step involves I/O,

359: ideally the number of recursive calls should be

360: kept small. The faster we can obtain small decomposed

361: databases, the fewer recursive call we will need.

362: On the other hand, if a decomposition cuts

363: down the size of the projected databases drastically,

364: the trade-off might be that the combination

365: step becomes more complicated and might involve heavy

366: disk I/O.

367:

368:

369: In the following we discuss two decomposition

370: strategies, namely

371: decomposition by partition, and

372: decomposition by projection.

373:

374: {\em Partitioning}

375: is an approach in which a large database is decomposed into

376: cells of small non-overlapping databases.

377: The cell-size is chosen so that

378: all frequent itemsets in a cell can be mined without

379: having to store any data structures in  secondary memory.

380: However, since a cell only contains partial frequency

381: information of the original database,

382: all frequent itemsets from the cell are local

383: to that cell of the partition,

384: and could only be {\em candidate} frequent itemsets

385: for the whole database.

386: Thus the candidate frequent itemsets mined from

387: a cell

388: have to be verified

389: later to filter out false hits.

390: Consequently,

391: those candidate sets have to  be written to disk

392: in order to leave

393: space for processing the next cell of the partition.

394: After generating candidate frequent itemsets from

395: all cells,

396: another database scan is needed to

397: filter out all infrequent itemsets.

398: The partition approach therefore needs only two passes

399: over the database,

400: but writing and reading candidate frequent itemsets

401: will involve a significant number of

402: disk I/O's,

403: depending on the size of the set of candidate frequent itemsets.

404:

405: We can conclude that the partition approach

406: to decomposition keeps the recursive levels

407: down to one, but the penalty is that the

408: combination phase becomes expensive.

409:

410:

411: To get an easier combination phase,

412: we adopt another decomposition strategy, which we call

413: {\em projection}.

414: Suppose for simplicity that there are four

415: items, $a,b,c,$ and $d$, and let $\db$ be a

416: database of transactions containing some

417: or all of these items.

418: We could then decompose

419: $\db$ into for instance

420: $\db_{ab}$ and

421: $\db_{cd}$.

422: Typically, we would do this when the descending order

423: of frequency of the items is $a, b, c, d$.

424: In $\db_{cd}$ we put all transactions

425: containing at $c$ or $d$ (or both).

426: In $\db_{ab}$ we put transactions containing

427: $a$ or $b$ (or both), and for each transaction we store

428: only the $a,b$-part. Thus we will have shorter

429: transactions in $\db_{ab}$, and both

430: $\db_{ab}$ and

431: $\db_{cd}$ contain fewer transactions than $\db$.

432: We can then recursively mine all frequent itemsets

433: from $\db_{ab}$, and $\db_{cd}$.

434: Since this decomposition is not a partition,

435: the projected databases

436: might not be that much smaller that the

437: original database. The upside is though that

438: the set of all frequent itemsets in

439: $\db$ now simply is the union of the frequent

440: itemsets in $\db_{ab}$ and $\db_{cd}$.

441: This means that the combination phase

442: in diskmining is a simple union.

443:

444: To illustrate this decomposition,

445: let $\db$ contain the transactions

446: $\{a, b, d\}, \{b, c, d\}, \{a, c\}$ and $\{a, b\}$.

447: Suppose the minimum support is 50\%,

448: then $\db_{cd}=\{\{a, b, d\}, \{b, c, d\}, \{a, c\}\}$,

449: $\db_{ab} =\{ \{a, b\}, \{b\}, \{a\}, \{a, b\}\}$.

450: From $\db_{cd}$, we get all frequent itemsets

451: $\{d\}, \{b,d\}$, and $\{c\}$.

452: Note though $\{a\}$ and $\{b\}$ are also frequent in $\db_{cd}$,

453: they're not listed since they contain neither $c$ nor $d$.

454: They will be listed in the frequent itemsets of $\db_{ab}$,

455: which are $\{a\}, \{b\}$, and $\{a,b\}$.

456:

457: To analyze the recurrence and required disk I/O's of the general

458: divide-and-conquer algorithm

459: when the decomposition strategy is projection,

460: let us suppose that:

461:

462:

463: \begin{small}

464: \begin{list}{-}{}

465:

466: \item

467: The original database size is $D$ bytes.

468:

469: \item

470: The data structure is an FP-tree.

471:

472: \item

473: The FP-tree constructed from original database \db~is $T$,

474: and its size is $|T|$ bytes.

475:

476: \item

477: If a conditional FP-tree $T'$ is constructed from

478: an FP-tree $T$, then $|T'|\leq c\cdot |T|$,

479: for some constant $c<1$.

480:

481: \item

482: The main memory mining method is the {\em FP-growth}

483: method \cite{HPY00, HPYM04}.

484: Two database scans are needed for constructing an FP-tree

485: from a database.

486:

487: \item

488: The block size is $B$ bytes.

489:

490: \item

491: The main memory available for the FP-tree is $M$ bytes

492:

493: \end{list}

494: \end{small}

495:

496:

497: In the first line of the algorithm in \xf{bdaqalgo},

498: if $T$ can not fit in memory,

499: then projected databases will be generated.

500: We assumed that

501: the size of the FP-tree for a projected database

502: is  $c\cdot|T|$.

503: If $c\cdot |T| \leq M$, function

504: {\em mainmine} can be called for the projected database,

505: otherwise, the decomposition goes on.

506: At pass $m$, the size of the FP-tree constructed from

507: a projected database is $c^m\cdot |T|$.

508: Thus, the number of passes needed by the

509: divide-and-conquer projection algorithm is

510: $1+\lceil\log_cM/T\rceil$.

511: Based on our experience and the analysis in \cite{HPY00, HPYM04},

512: we can say that for all practical purposes

513: the number of passes will be at most two.

514: For example, Let $D = 100$ Giga and $T = 10$ Giga,

515: $M = 1$ Giga, $c = 10\%$.

516: Then the number of passes is

517: $1+\lceil\log_{0.1}2^{30}/(10\times 2^{30})\rceil$ = 2.

518: In five passes we can handle databases up to 100 Terabytes.

519: Namely, we get

520: $1+\lceil\log_{0.1}2^{30}/(10\times 2^{40})\rceil$ = 5.

521:

522:

523:

524: Assume that there are two passes,

525: and that the sum of the sizes of all projected

526: databases is $D'$.

527: There are two database scans for \db,

528: one for finding all frequent single items,

529: one for decomposition.

530: Two scans need $2\times D/B$ disk I/O's.

531: The projected databases have to be written to the disks first,

532: then later each scanned twice for building the FP-tree.

533: This step needs  $3\times D'/B$ disk I/O's.

534: Thus, the total disk number of

535: disk I/O's for the general divide-and-conquer

536: projection algorithm

537: is at least

538: \negvs

539: \begin{eqnarray}

540: 2\cdot D/B + 3\cdot D'/B.

541: \end{eqnarray}

542: Obviously,

543: the smaller $D'$, the better the performance.

544:

545:

546: One of the simplest projection strategies

547: is to project the database on each frequent item,

548: which we call

549: {\em naive projection}.

550: First we need some formal definitions.

551:

552: \begin{defn}

553: {\rm

554: Let $I$ be a set of items.

555: By $I^*$ we will denote {\em strings} over $I$,

556: such that each symbol occurs at most once in the string.

557: If $\alpha$, $\beta$ are strings, and $i_j$ an item,

558: then

559: $\alpha.\beta$ denotes the concatenation of the

560: string $\alpha$ with the string $\beta$.

561:

562: For a string $\alpha$, we shall denote

563: by $\{\alpha\}$, the {\em set} of items occurring in it.

564:

565: Let $\db$ be an $I$-database.

566: Then ${\mit freqstring}(\db)$

567: is the string over

568: $I$, such that each frequent item in $\db$ occurs

569: in it exactly once, and the items are in decreasing

570: order of frequency in $\db$.

571: \hspace*{\fill}${\qed}$

572: }

573: \end{defn}

574:

575:

576:

577:

578: As an example, consider the  $\{a,b,c,d\}$-database

579: $\db = \{\{a,b,c\}, \{a,b,c,d\}, \{a,c\}\}$.

580: If the minimum support is 60\%, then

581: ${\mit freqstring}(\db) = acb$.

582: Note that $\{acb\} = \{a,c,b\}$.

583:

584:

585:

586: \begin{defn}

587: {\rm

588: Let $\db$

589: be an $I$-database, and let

590: ${\mit freqstring}(\db)

591: = i_1i_2\cdots i_k$.

592: For $j\in\{1,\ldots,k\}$ we define

593: $\db_{i_j} =

594: \{\tau\cap\{i_1,\ldots,i_j\} : i_j\in\tau,\tau\in\db\}.$

595:

596: Let $\alpha\in I^*$.

597: We define $\db_{\alpha}$ inductively:

598: $\db_{\epsilon} = \db$, and

599: let ${\mit freqstring}(\db_{\alpha})

600: = i_1i_2\cdots i_k$. Then,

601: for $j\in\{1,\ldots,k\}$,

602: $\db_{\alpha.i_j} =

603: \{\tau\cap\{i_1,\ldots,i_j\} : i_j\in\tau,\tau\in\db_{\alpha}\}.$

604: \hspace*{\fill}${\qed}$

605: }

606: \end{defn}

607:

608:

609: Obviously,

610: $\db_{\alpha.i_j}$ is an $\{i_1,\ldots,i_j\}$-database.

611: The decomposition of $\db_{\alpha}$ into

612: $\db_{\alpha.i_1}$, \ldots, $\db_{\alpha.i_k}$

613: is called the {\em naive projection}.

614:

615:

616: \begin{defn}

617: {\rm

618: Let $\alpha\in I^*$, $i_j\in I$, and let

619: $\db_{\alpha.i_j}$ be an $I$-database.

620: Then ${\mit freqsets}(\xi,\db_{\alpha.i_j})$ denotes the subsets

621: of $I$

622: that contain $i_j$ and are frequent in $\db_{\alpha.i_j}$

623: when the  minimum support is $\xi$.

624: Usually, we shall abstract $\xi$ away, and write

625: just  ${\mit freqsets}(\db_{\alpha.i_j})$

626: \hspace*{\fill}${\qed}$

627: }

628: \end{defn}

629:

630:

631: \begin{lemma}

632:

633: Let $\db_{\alpha}$ be an $I$-database, and

634: ${\mit freqstring}(\db_{\alpha}) = i_1i_2\cdots i_k$.

635: Then

636: $${\mit freqsets}(\db_{\alpha}) =

637: \bigcup_{j\in\{1,\ldots,k\}}{\mit freqsets}(\db_{\alpha.i_j})$$

638:

639: \end{lemma}

640:

641: \noindent

642: {\bf Proof}.

643: ($\subseteq$-{\em direction}).

644: Let $S\in {\mit freqsets}(\db_{\alpha})$,

645: and suppose $i_n$ is the item in $S$ that is least frequent in

646: $\db_{\alpha}$.

647: Since $\db_{\alpha.i_n}$ is an $\{i_1,\ldots,i_n\}$-database,

648: and transactions in $\db_{\alpha}$ that contain item $i_j$

649: are all in $\db_{\alpha.i_j}$,

650: if $S$ is frequent in $\db_{\alpha}$,

651: then $S$ must be frequent in $\db_{\alpha.i_j}$.

652:

653: \noindent

654: ($\supseteq$-{\em direction}).

655: For any frequent itemset

656: $S \in freqsets(\db_{\alpha.i_j})$,

657: according to the definition,

658: the

659: support of any itemset in $\db_{\alpha.i_j}$ is not greater than

660: the support of it in $\db_{\alpha}$.

661: Therefore, $S$ must be frequent in $\db_{\alpha}$.

662: \hspace*{\fill}${\qed}$

663:

664: \medskip

665:

666:

667:

668: \xf{hansalgo} gives a divide-and-conquer algorithm

669: that uses naive projection.

670: A transaction $\tau$ in $\db_{\alpha}$ will be partly inserted into

671: $\db_{\alpha.i_j}$ if and only if $\tau$ contains $i_j$.

672: The parallel projection algorithm introduced in

673: \cite{HPYM04}

674: is an algorithm of this kind.

675:

676:

677: \begin{figure}[h]

678: {\bf Procedure} {\em naivediskmine}($\db_{\alpha},M$)

679:

680: \smallskip

681:

682: {\bf if} $|\db_{\alpha}|\leq M$ {\bf then}

683: {\bf return} {\em mainmine($\;\db_{\alpha}$)}

684:

685: {\bf else} let ${\mit freqstring}(\db_{\alpha}) = i_1i_2\cdots i_n$

686:

687: {\hskip 18pt} {\bf return}  {\em naivediskmine}$(\db_{\alpha.i_1},M)\;\cup$

688:

689: {\hskip 146pt}  $\ldots\;\cup$

690:

691: {\hskip 56pt}{\em naivediskmine}$(\db_{\alpha.i_n},M)$.

692:

693: \caption{{\small

694: A simple divide-and-conquer algorithm for

695: mining frequent itemsets from disk

696: }}

697: \label{hansalgo}

698: \end{figure}

699:

700:

701:

702: Let's analyze the disk I/O's of the algorithm

703: in \xf{hansalgo}.

704: As before, we assume that there are two passes,

705: that the data structure is an FP-tree,

706: and that the main memory mining method is

707: {\em FP-growth}.

708: If in $\db_{\epsilon}$, each transaction contains on the average $n$

709: frequent items,

710: each transaction will be written to $n$ projected databases.

711: Thus the total length of the associated transactions in

712: the projected databases is

713: $n+(n-1)+\cdots+1 = n(n+1)/2$,

714: the total size of all projected databases is

715: $(n+1)/2\cdot D\approx n/2\cdot D$.

716:

717: There are two database scans for $\db_{\epsilon}$,

718: one for finding all frequent single items,

719: and one for decomposition.

720: Two scans need $2\cdot D/B$ disk I/O's.

721: The projected databases have to be written to the disks first,

722: then later scanned twice each for building an FP-tree.

723: This step needs at least $3\cdot n/2\times D/B$.

724: Thus, the total disk I/O's for the divide-and-conquer

725: algorithm with naive projection

726: is

727: \negvs

728: \begin{eqnarray}

729: 2 \cdot D/B

730: +

731: n \cdot 3/2 \cdot D/B

732: \end{eqnarray}

733:

734: The recurrence structure of algorithm

735: {\em naivediskmine}

736: is shown in \xf{naivetree}.

737: The reader should ignore

738: nodes in

739: the shaded area

740: at this point, they

741: represent processing

742: in main memory.

743:

744: \begin{figure}[h]

745: \centerline{\psfig{figure=figures/append1,height=1.5in}}

746: \caption{\small Recurrence structure of Naive Projection}

747: \label{naivetree}

748: \end{figure}

749:

750:

751:

752:

753: In a typical application $n$, the average number

754: of frequent items could be hundreds, or thousands.

755: It therefore makes sense to devise a smarter

756: projection strategy.

757: Before we go further, we introduce

758: some definitions and a lemma.

759:

760:

761: \begin{defn}\label{four}

762: {\rm

763: Let $\db_{\alpha}$ be an $I$-database, and let

764: ${\mit freqstring}(\db_{\alpha})

765: = \beta_1.\beta_2. \cdots .\beta_k$,

766: where each $\beta_j$ is a string in $I^*$.

767: We call $\beta_1.\beta_2. \cdots .\beta_k$

768: a {\em grouping} of

769: ${\mit freqstring}(\db_{\alpha})$.

770: For

771: $j\in\{1,\ldots,n\}$,

772: we now define

773: $\db_{\alpha.\beta_j} =

774: \{\tau\cap\{\beta_1,\ldots,\beta_j\} : \tau\in\db_{\alpha},

775: \tau\cap\beta_j\neq\emptyset

776: \}.$

777:

778: In $\db_{\alpha.\beta_j}$,

779: items in $\{\beta_j\}$ are called {\em master items},

780: items in $\{\beta_1,\ldots,\beta_{j-1}\}$ are called {\em slave items}.

781: \hspace*{\fill}${\qed}$

782: }

783: \end{defn}

784:

785:

786: For example,

787: if ${\mit freqstring}(\db_{\alpha}) = abcde$,

788: $\beta_1 = abc$, $\beta_2 = de$ gives

789: the grouping $abc.de$ of $abcde$.

790:

791:

792:

793:

794:

795:

796:

797:

798:

799: \begin{defn}

800: {\rm

801: Let $\{\alpha,\beta\}\subset I^*$, and let

802: $\db_{\alpha.\beta}$ be an $I$-database.

803: Then $freqsets(\db_{\alpha.\beta})$ denotes the subsets

804: of $I$

805: that contain at least one item in $\{\beta\}$

806: and are frequent in $\db_{\alpha.\beta}$.

807: \hspace*{\fill}${\qed}$

808: }

809: \end{defn}

810:

811: \begin{lemma}\label{goodway}

812: Let $\alpha\in I^*$,

813: $\db_{\alpha}$ be an $I$-database, and

814: ${\mit freqstring}(\db_{\alpha}) = \beta_1\beta_2\cdots \beta_k$.

815: Then

816: $$freqsets(\db_{\alpha}) =

817: \bigcup_{j\in\{1,\ldots,k\}}freqsets(\db_{\alpha.\beta_j})$$

818:

819: \end{lemma}

820:

821: \noindent

822: {\bf Proof.}

823: Straightforward from Lemma 1 and the definition

824: of $\db_{\alpha.\beta}$.

825: \hspace*{\fill}${\qed}$

826:

827: \medskip

828:

829: Based on Lemma \ref{goodway},

830: we can obtain a more aggressive divide-and-conquer algorithm for

831: mining from disks.

832: \xf{ouralgo} shows the algorithm {\em aggressivediskmine}.

833: Here,

834: ${\mit freqstring}(\db_{\alpha})$

835: is decomposed into several substrings $\beta_j$,

836: each of which could have more than one item.

837: Each substring corresponds to a projected database.

838: A~transaction $\tau$ in $\db_{\alpha}$ will be partly inserted into

839: $\db_{\alpha.\beta_j}$ if and only if

840: $\tau$ contains at least one item $a$

841: such that $a\in\{\beta_j\}$.

842: Since there will be fewer projected databases,

843: there will be less disk I/O's.

844: Compared with the algorithm in \xf{hansalgo},

845: we can expect that

846: a large amount of disk I/O will be saved by the algorithm

847: in \xf{ouralgo}.

848:

849: \begin{figure}[h]

850: {\bf Procedure} {\em aggressivediskmine}($\db_{\alpha},M$)

851:

852: \smallskip

853:

854: {\bf if} $|\db_{\alpha}|\leq M$ {\bf then}

855:    {\bf return} {\em mainmine($\;\db_{\alpha}$)}

856:

857: {\bf else} let ${\mit freqstring}(\db_{\alpha}) =

858: \beta_1\beta_2\cdots \beta_k$

859:

860: {\hskip 18pt}

861: {\bf return}  {\em aggressivediskmine}$(\db_{\alpha.\beta_1},M)\;\cup$

862:

863: {\hskip 165pt} $\;\ldots\;\cup$

864:

865: {\hskip 57pt}{\em aggressivediskmine}$(\db_{\alpha.\beta_k},M)$.

866:

867:

868: \caption{{\small

869: A more aggressive divide-and-conquer algorithm for

870: mining frequent itemsets from disk

871: }}

872: \label{ouralgo}

873: \end{figure}

874:

875:

876:

877: Let's analyze the recurrence and disk I/O's of the aggressive

878: divide-and-conquer algorithm.

879: The number of

880: passes needed by the algorithm is still

881: \mbox{$1+\lceil\log_cM/T\rceil \approx 2$},

882: since grouping items doesn't change the size of an FP-tree for

883: a projected database.

884: However, for disk I/O,

885: suppose in $\db_{\epsilon}$,

886: each transaction contains on average $n$

887: frequent items,

888: and that we can group them into $k$

889: groups of equal size.

890: Then the $n$ items will be written to the projected databases

891: with total length $n/k+2\cdot n/k+ \ldots +k\cdot n/k = (k+1)/2\cdot n$.

892: Total size of all projected databases is

893: $(k+1)/2\cdot D \approx k/2\cdot D$.

894: The total disk I/O's for the aggressive divide-and-conquer

895: algorithm

896: is then

897: \negvs

898: \begin{eqnarray}\label {formula}

899: 2\cdot D/B

900: +

901: k \cdot 3/2 \cdot D/B

902: \end{eqnarray}

903:

904: The recurrence structure of algorithm

905: {\em aggressivediskmine} is shown

906: in \xf{recagg}. Compared to \xf{naivetree},

907: we can see that the part of the tree

908: that corresponds to decomposition

909: (the nonshaded part) is much smaller

910: in \xf{recagg}. Although the example is

911: very small, it exhibits the general structure

912: of the two trees.

913:

914: \begin{figure}[h]

915: \centerline{\psfig{figure=figures/append2,height=1.5in}}

916: \caption{\small Recurrence structure of Aggressive Projection}

917: \label{recagg}

918: \end{figure}

919:

920:

921:

922:

923: If $k\ll n$,

924: we can expect that the aggressive

925: divide and conquer algorithm will

926: significantly outperform the naive one.

927:

928: \section {Algorithm Diskmine}

929: In this section

930: we give

931: the details of

932: our divide-and-conquer algorithm for mining frequent itemsets

933: from secondary memory.

934: We call the algorithm {\em Diskmine}.

935: In the algorithm,

936: the FP-tree is used as data structure and

937: the extension of {\em FP-growth} method,

938: {\em FP-growth*} \cite{fimi03},

939: as method for mining frequent itemsets from an FP-tree.

940: Before introducing the algorithm,

941: let's first recall the FP-tree and the {\em FP-growth* } method.

942:

943: \subsection{The FP-tree and {\em FP-growth*} method}

944:

945: The {\em FP-tree (Frequent Pattern tree)}

946: is a data structure used in

947: the {\em FP-growth} method by Han {\em et al.\ } \cite{HPY00}.

948: It is a compact representation

949: of all relevant

950: frequency information

951: in a database.

952: The nodes of the FP-tree stores an item name, item count,

953: and a link.

954: Every branch of the FP-tree represents a frequent itemset,

955: and the nodes along the branches are

956: stored in decreasing order of the frequency

957: of the corresponding items, with leaves representing

958: the least frequent items.

959: Compression is achieved by

960: building the tree in such a way that

961: overlapping itemsets

962: share prefixes of the

963: corresponding branches.

964:

965:

966: The FP-tree has

967: a {\em header table} associated with it.

968: Single items and their counts are stored in

969: the header table in

970: decreasing order of their frequency.

971: The entry for an item also contains the head

972: of a list that links all the

973: nodes of the item

974: in the FP-tree.

975:

976:

977: The FP-growth method needs two database scans

978: when mining all frequent itemsets.

979: The first scan counts the number of occurrences

980: of each item.

981: The second scan constructs the initial FP-tree,

982: which contains all frequency information of the original dataset.

983: Mining the database then becomes mining the FP-tree.

984:

985:

986: The {\em FP-growth} method relies on the following

987: principle: if $X$ and $Y$ are two itemsets,

988: the count of itemset $X\cup Y$ in the database

989: is exactly that of $Y$ in the restriction of the database to

990: those transactions containing $X$.

991: This restriction of the database is

992: called

993: the {\em conditional pattern base} of $X$,

994: and the FP-tree constructed from the conditional pattern base

995: is called $X$'s {\em conditional FP-tree},

996: which we denote by $T_X$.

997: We can view the FP-tree constructed from the initial database

998: as $T_{\emptyset}$,

999: the conditional FP-tree for $\emptyset$.

1000: Note that for

1001: any itemset $Y$ that is frequent

1002: in the conditional pattern base of $X$,

1003: the set

1004: $X\cup Y$ is a frequent itemset for the original database.\footnote{In

1005: keeping with the notation introduced so far, we shall

1006: in the sequel write $T_{\alpha}$ when we mean the

1007: FP-tree $T_{\{\alpha\}}$. Similarly we shall write

1008: $T_{\alpha.i}$ instead of $T_{\{\alpha\}\cup\{i\}}$.}

1009:

1010: The recursive structure of FPgrowth can be seen from

1011: the shaded area in \xf{naivetree}.

1012: In the figure, we will enter the main memory phase

1013: for instance for the conditional database $\db_a$.

1014: Then FP-growth first constructs the

1015: FP-tree $T_a$ from $\db_a$.

1016: The tree rooted at $T_a$

1017: shows the recursive structure of FP-growth,

1018: assuming for simplicity that the

1019: relative frequency remains the same in

1020: all conditional pattern bases.

1021:

1022:

1023:

1024:

1025:

1026: In \cite {fimi03}, we extend the FP-growth method into the

1027: {\em FP-growth*} method by using an {\em array technique}

1028: and other optimizations.

1029: The experimental results in the paper

1030: and those done by the FIMI-organizers show

1031: that the FP-growth* method outperforms the {\em FP-growth} method

1032: especially when the database is big or sparse

1033: \cite{fimi03,ZB03}.

1034:

1035:

1036: \subsubsection* {The array technique} \label{arraytech}

1037:

1038: In the original FP-growth method \cite{HPY00},

1039: to construct an FP-tree from a database $\db$,

1040: two database scan are required.

1041: The first scan gets all frequent items,

1042: the second constructs the FP-tree.

1043: And later,

1044: for each item $a$ in

1045: the header of a conditional FP-tree $T_{\alpha}$,

1046: two traversals of $T_{\alpha}$ are needed for constructing

1047: the new conditional FP-tree $T_{\alpha.i}$.

1048: The first traversal finds all frequent items in the

1049: conditional pattern base of $\alpha.i$,

1050: and initializes the FP-tree  $T_{\alpha.i}$

1051: by constructing its header table.

1052: The second traversal constructs the new tree

1053: $T_{\alpha.i}$.

1054:

1055:

1056: In the boosted {\em FP-growth*} method \cite{fimi03},

1057: a simple data structure, an array,

1058: is introduced to omit the first scan of $T_{\alpha}$.

1059: This is achieved

1060: by constructing an array

1061: $A_{\alpha}$ while building $T_{\alpha}$.

1062: More precisely,

1063: in the second scan of the original database we

1064: construct $T_{\epsilon}$, and an array $A_{\epsilon}$.

1065: The array will store the counts of

1066: all 2-itemsets, each cell $[j,k]$

1067: in the array is a counter of the 2-itemset $\{i_j,i_k\}$.

1068: All cells in the array are initialized to 0.

1069: When an itemset is inserted into $T_{\epsilon}$,

1070: the associated cells in $A_{\epsilon}$ are updated.

1071: After the second scan,

1072: the array $A_{\epsilon}$ contains the counts of

1073: all pairs of items frequent in $\db_{\epsilon}$.

1074:

1075:

1076:

1077: Next, the {\em FP-growth*} method is recursively called

1078: to mine frequent itemsets for each item in header table

1079: of $T_{\epsilon}$.

1080: However, now for each item $i$,

1081: instead of traversing $T_{\epsilon}$ along

1082: the linked list starting at $i$ to get

1083: all frequent items in $i$'s conditional pattern base,

1084: $A_{\epsilon}[i,*]$ gives all frequent items for $i$.

1085: Therefore, for each item $i$ in $T_{\epsilon}$

1086: the array $A_{\epsilon}$ makes

1087: the first traversal of $T_{\epsilon}$ unnecessary,

1088: and $T_{\epsilon.i}$ can be

1089: initialized directly from $A_{\epsilon}$.

1090:

1091:

1092: For the same reason, from a conditional FP-tree $T_{\alpha}$,

1093: when we construct a new conditional

1094: FP-tree for $\alpha.i$, for an item $i$,

1095: a new array $A_{\alpha.i}$ is calculated.

1096: During the construction of the

1097: new FP-tree $T_{\alpha.i}$,

1098: the array $A_{\alpha.i}$

1099: is filled.

1100: The construction of arrays and FP-trees continues

1101: until the {\em FP-growth} method terminates.

1102:

1103: Note that if for a database,

1104: if we have the array that stores the count of all pairs of

1105: frequent items,

1106: then only one database scan is needed

1107: to construct an FP-tree from the database.

1108:

1109: \subsection{Divide-and-conquer by aggressive projection}

1110:

1111:

1112:

1113: The algorithm {\em Diskmine} is shown in \xf{appa}. In the algorithm,

1114: $\db_{\alpha}$ is the original database or a projected database,

1115: and $M$ is the maximal size of main memory that can be used by {\em Diskmine}.

1116:

1117: \begin{figure}[h]

1118: {\bf Procedure} {\em Diskmine}$(\db_{\alpha}, M)$

1119:

1120: \smallskip

1121:

1122: scan $\db_{\alpha}$ and compute {\it freqstring}$(\db_{\alpha})$

1123:

1124: call ${\mit trialmainmine(\db_{\alpha}, M)}$

1125:

1126: {\bf if} ${\mit trialmainmine(\db_{\alpha}, M)}$ aborted {\bf then}

1127:

1128: {\hskip 12pt}compute a grouping $\beta_1\beta_2\cdots \beta_k$

1129:    of ${\mit freqstring}(\db_{\alpha})$.

1130:

1131: {\hskip 12pt}Decompose $\db_{\alpha}$ into

1132: $\db_{\alpha.\beta_1},\ldots, \db_{\alpha.\beta_k}$

1133:

1134: {\hskip 12pt}{\bf for} j = 1 {\bf to} k {\bf do begin}

1135:

1136: {\hskip 24pt}{\bf if} $\{\beta_j\}$ is a singleton {\bf then}

1137:

1138: {\hskip 36pt}${\mit Diskmine}(\db_{\alpha.\beta_j},M)$

1139:

1140: {\hskip 24pt}{\bf else}

1141:

1142: {\hskip 36pt}${\mit mainmine}(\db_{\alpha.\beta_j})$

1143:

1144: {\hskip 12pt}{\bf end}

1145:

1146: {\bf else return} {\em freqsets}$(\db_{\alpha})$

1147: \caption{{\small Algorithm Diskmine}}

1148: \label{appa}

1149: \end{figure}

1150:

1151:

1152: {\em Diskmine} uses the FP-tree as

1153: data structure and {\em FP-growth*} \cite{fimi03}

1154: as main memory

1155: mining

1156: algorithm.

1157: Since the FP-tree encodes all frequency information

1158: of the database,

1159: we can shift into main memory mining

1160: as soon as the FP-tree fits

1161: into main memory.

1162:

1163: Since an FP-tree usually is a significant

1164: compression of the database, our {\em Diskmine}

1165: algorithm begins optimistically, by calling {\em trialmainmine},

1166: which starts scanning the database and constructing the FP-tree.

1167: If the tree can be successfully completed and stored in main memory,

1168: we have reached the bottom level of the recursion,

1169: and can obtain

1170: the frequent itemsets of the database

1171: by running

1172: {\em FP-growth*} on the FP-tree in main memory.

1173:

1174: \begin{figure}[h]

1175: {\bf Procedure} {\em trialmainmine}$(\db_{\alpha}, M)$

1176:

1177: start scanning $\db_{\alpha}$ and building the FP-tree

1178:

1179:    {\hskip 12 pt}$T_{\alpha}$ in main memory.

1180:

1181: {\bf if} $|T_{\alpha}|$  exceeds  $M$ {\bf then}

1182:

1183: {\hskip 12pt}{\bf return} the incomplete $T_{\alpha}$

1184:

1185: {\bf else}

1186:

1187: {\hskip 12pt}call {\em FP-growth*}$\,(T_{\alpha})$ and {\bf return}

1188:     {\em freqsets}$(\db_{\alpha})$.

1189:

1190: \caption{{\small Trial main memory mining algorithm}}

1191: \label{trial}

1192: \end{figure}

1193:

1194:

1195:

1196: If, at any time during {\em trialmainmine}

1197: we run out of main memory, we abort and

1198: return the partially constructed FP-tree,

1199: and a pointer to where we stopped scanning the database.

1200: We then resume processing {\em Diskmine}$(\db_{\alpha},M)$

1201: by computing a grouping

1202: $\beta_1,\ldots, \beta_k$ of

1203: {\em freqstring}$(\db_{\alpha})$,

1204: and then decomposing

1205: $\db_{\alpha}$ into

1206: $\db_{\alpha.\beta_1},\ldots,\db_{\alpha.\beta_k}$.

1207: We recursively process

1208: each decomposed database

1209: $\db_{\alpha.\beta_j}$.

1210: During the first level of the recursion,

1211: some groups $\beta_j$ will consist of a single

1212: item only.

1213: If $\{\beta_j\}$ is a singleton,

1214: we call {\em Diskmine}, otherwise

1215: we call {\em mainmine} directly,

1216: since we put several items in a group

1217: only when we estimate that the corresponding

1218: FP-tree will fit into main memory.

1219:

1220: In computing the grouping

1221: $\beta_1,\ldots, \beta_k$

1222: we assume that transactions in a very large database

1223: are evenly distributed, i.e.,

1224: if an FP-tree is constructed from part of a database,

1225: then this FP-tree represents the whole FP-tree for the whole database.

1226: In other words,

1227: if the size of the FP-tree is $n$ for $p\%$ of the database,

1228: then the size of the FP-tree for whole database is $n/p \cdot 100$.

1229: Most of the time, this gives an overestimation,

1230: since an FP-tree increases fast only at the beginning stage,

1231: when items are encountered for the first time and inserted

1232: into the tree. In the later stages, the changes to the FP-tree

1233: will be mostly counter updates.

1234:

1235:

1236: \begin{figure}[h]

1237: {\bf Procedure} {\em mainmine}$(\db_{\alpha.\beta})$

1238:

1239: build a modified FP-tree $T_{\alpha.\beta}$ for $\db_{\alpha.\beta}$

1240:

1241: {\bf for each} $i\in\{\beta\}$ {\bf do begin}

1242:

1243: {\hskip 12pt} construct the FP-tree $T_{\alpha.i}$

1244:                 for $\db_{\alpha.i}$ from $T_{\alpha.\beta}$

1245:

1246: {\hskip 12pt} call {\em FP-growth*}$\,(T_{\alpha.i})$

1247:     and {\bf return}

1248:     {\em freqsets}$(\db_{\alpha.i})$.

1249:

1250: {\bf end}

1251:

1252: \caption{{\small Main memory mining algorithm}}

1253: \label{mainmine}

1254: \end{figure}

1255:

1256:

1257:

1258: Since we know that there is only one master item in the database

1259: (for $\db_\epsilon$, no master item at all),

1260: an FP-tree is constructed without the master item.

1261: In \xf{mainmine},

1262: since $\db_{\alpha.\beta}$ is for multiple master items,

1263: the

1264: FP-tree constructed from $\db_{\alpha.\beta}$ has to contain

1265: those master items.

1266: However, the item order is a problem for the FP-tree,

1267: because we only want to mine all frequent itemsets

1268: that contain master items.

1269: To solve this problem,

1270: we simply use the item order in the partial FP-tree

1271: returned by the aborted

1272: {\em trialmainmine}$(\db_{\alpha})$.

1273: This is what we mean by a ``modified FP-tree''

1274: on the first line in the algorithm in \xf{mainmine}.

1275:

1276: The entire recurrence structure of

1277: {\em Diskmine} can be seen in \xf{recagg}.

1278: Compared to the naive projection in \xf{naivetree}

1279: we see that since the aggressive projection

1280: uses main memory more effective,

1281: the decomposition phase is shorter,

1282: resulting in less I/O.

1283:

1284:

1285:

1286: \begin{theorem}

1287: Diskmine$(\db)$  returns freqsets$(\db)$.

1288: \end{theorem}

1289:

1290: \noindent

1291: {\bf Proof}.

1292: The correctness of {\em Diskmine}

1293: can be derived from the correctness of the

1294: {\em FP-growth*} method in \cite{fimi03}

1295: and Lemma \ref{goodway} in \xs{diskmine}.

1296: In {\em Diskmine},

1297: each item acts as master item in exactly one projected database.

1298: If a projected database is only for one master item $i_j$,

1299: the result of  {\em FP-growth*} method or a recursive call of {\em Diskmine}

1300: will be $freqsets(\db_{i_j})$.

1301: If a projected database is for a set $\{\beta\}$ of master items,

1302: it contains all frequency information associated with the master items.

1303: Since in the {\em FP-growth*} method,

1304: the order of the items in an FP-tree doesn't influence

1305: the correctness of the  {\em FP-growth*} method,

1306: {\em mainmine} indeed returns only frequent itemsets that

1307: contain master item(s),

1308: i.e.\ {\em mainmine} gives the

1309: exact value of $freqsets(\db_{\alpha.\beta})$.

1310: According to Lemma \ref{goodway},

1311: algorithm {\em Diskmine} then

1312: correctly outputs all

1313: itemsets in frequent the original database.

1314: \hspace*{\fill}${\qed}$

1315:

1316:

1317:

1318: \subsection {Memory Management}\label{memory}

1319:

1320: Given a database $\db_{\alpha}$,

1321: to successfully apply the {\em FP-growth*} method,

1322: the basic main memory requirement is that the size of the FP-tree

1323: $T_{\alpha}$

1324: constructed from $\db_{\alpha}$,

1325: is less than the available amount $M$ of main memory.

1326: In addition, we need space

1327: for the  descendant conditional

1328: FP-trees that will be constructed during the recursive calls

1329: of {\em FP-growth*}.

1330:

1331: Suppose the main memory requirement

1332: for $T_{\alpha}$ plus its descendant FP-trees is $m$.

1333: If $M < m$, but the difference $m-M$ is not very big,

1334: the {\em FP-growth*} method

1335: could still be run because the operating

1336: system uses virtual memory.

1337: However, there could be too many page swappings

1338: which takes too much time and makes {\em FP-growth*} very slow.

1339: Therefore, given $M$, for a very large database $\db_{\alpha}$,

1340: we have to stop the construction of the FP-tree $T_{\alpha}$

1341: and the execution of {\em FP-growth*} method before

1342: all physical main memory is used up.

1343:

1344:

1345: Another problem is that we will

1346: construct a large number  of FP-trees.

1347: Since there can be

1348: millions of nodes in those FP-trees,

1349: inserting and deleting nodes is time consuming.

1350:

1351: In the implementation of the algorithm,

1352: we use our own main memory management for

1353: allocating and deallocating nodes,

1354: and calculating the main memory we have already used.

1355: We assume that the main memory needed by an FP-tree is

1356: proportional to the number of nodes in the FP-trees.

1357: We also assume that the workspace needed for calling

1358: {\em FP-growth*(T)} method on an FP-tree is roughly 10\%

1359: of the size of the FP-tree $T$.

1360: Here, 10\% is a liberal assumption according to the

1361: experimental result in \cite{HPY00}.

1362: Later in this section, a more accurate value will be given.

1363: If the size of FP-tree is more than $0.9\cdot M$,

1364: we conclude that $M$ is not big enough to store whole

1365: FP-tree $T_{\alpha}$.

1366:

1367:

1368: Since all memory for nodes in an FP-tree is deallocated after a call

1369: of {\em FP-growth*} ends,

1370: a chunk of memory is allocated for each FP-tree when we create the tree,

1371: and the chunk size is changeable.

1372: After generating all frequent

1373: itemsets from the FP-tree, the chunk is discarded,

1374: and all nodes in the tree are deleted.

1375: Thus we successfully avoid freeing nodes in FP-trees one by one,

1376: which would take too much time.

1377:

1378:

1379: \subsection{Applying the Array Technique}\label{array}

1380:

1381:

1382: In {\em Diskmine},

1383: the array technique is also be applied to save FP-tree traversals.

1384: Furthermore, when projected databases are generated,

1385: the array technique can save a great number of disk~I/O's.

1386:

1387: Recall that in {\em trialmainmine},

1388: if an FP-tree can not be accommodated in main memory,

1389: the construction stops.

1390: Suppose now we decided to stop

1391: scanning the database.

1392: Then later, after generating all projected databases,

1393: for a projected database with only one master item,

1394: two database scans are required to construct an FP-tree for the master item.

1395: The first scan gets all frequent items for the master item,

1396: the second scan constructs the FP-tree.

1397: For a projected database with several master items,

1398: though the FP-tree constructed from the database

1399: uses the modified item order

1400: (the order from the header of the FP-tree  in

1401: the previous level of the recursion),

1402: to construct new FP-trees for the master items,

1403: two FP-tree traversals are needed.

1404: To avoid the extra scan,

1405: in {\em Diskmine} we calculate an array for each FP-tree.

1406: When constructing the FP-tree from $\db_{\alpha}$,

1407: if it is found that the tree can not fit in main memory,

1408: the construction of the FP-tree $T_{\alpha}$ stops,

1409: but the scan of the database $\db_{\alpha}$

1410: continues so that we finish filling the cells of

1411: the array $A_{\alpha}$.

1412: Here, some extra disk I/O's are spent,

1413: but the payback will be that we

1414: save one database scan for each

1415: projected database.

1416: Furthermore, finishing the scanning

1417: of $\db_{\alpha}$

1418: doesn't require any more main memory,

1419: since the array $A_{\alpha}$

1420: is already there.

1421:

1422: From the array, for each projected database,

1423: the count of each pair of master items and

1424: the count of each pair of master item and slave item

1425: can be known.

1426: As an example,

1427: suppose a projected databases is only for one

1428: master item $i_j$

1429: and slave items $i_1, \ldots, i_{j-1}$.

1430: To mine all frequent itemsets,

1431: from the line for $i_j$ in the array,

1432: accurate counts for

1433: $[i_j, i_{j-1}],

1434: [i_j, i_{j-2}],

1435: \ldots,

1436: [i_j, i_1]$

1437: can be easily found.

1438: If there were no array

1439: we would need an extra database scan.

1440:

1441:

1442: With the array, we can also make a projected database

1443: drastically smaller.

1444: In the definition of $\db_{\alpha.\beta_j}$,

1445: we see that

1446: $\db_{\alpha.\beta_j}$ is an $\{\beta_1,\ldots,\beta_j\}$-database.

1447: Actually, by checking the array $A_{\alpha}$,

1448: if a slave item is found not frequently co-occurring

1449: with any master item in $\beta_j$,

1450: it's useless to include the slave item in $\db_{\alpha.\beta_j}$,

1451: because no frequent itemsets mined from $\db_{\alpha.\beta_j}$

1452: will contain that slave item.

1453: For same reason,

1454: if we also find that a master item $a$ is not frequent with any

1455: other master item or slave item,

1456: it will be not written to $\db_{\alpha.\beta_j}$,

1457: either.

1458: However, the frequent itemset $\alpha.a$ is outputted.

1459: Furthermore,

1460: if from the array, we see that a  master item $a$ is

1461: only frequent with one item (master or slave) $b$,

1462: frequent itemsets $\alpha.a$ and $\alpha.a.b$

1463: are outputted directly,

1464: and item $a$ will not appear in $\db_{\alpha.\beta_j}$.

1465: Therefore, by looking through the array,

1466: we find all slave items,

1467: such that they are not frequent with any master item in $\beta_j$,

1468: and all master items, such that their number of frequent items in

1469: $\{\beta_1,\ldots,\beta_j\}$ is 0 or 1.

1470: When generating $\db_{\alpha.\beta_j}$,

1471: all those items are removed from the

1472: transactions we put in $\db_{\alpha.\beta_j}$.

1473:

1474:

1475: \subsection{Statistics}

1476:

1477: \begin{table*}[ht!]

1478: \centering

1479: \begin{tabular}%{0.75\textwidth}

1480: {|r|l|} \hline

1481: $\tdb$&Number of transactions in $\db_{\alpha}$\\

1482: \hline

1483: $A_{\alpha}[j,k]$&Count of frequent item pair $\{i_j, i_k\}$

1484: in $\db_{\alpha}$\\

1485: \hline

1486: $\tT$&Number of transactions used for constructing  $T_{\alpha}$\\

1487: \hline

1488: $\nT$&Number of nodes in $T_{\alpha}$\\

1489: \hline

1490: $\njT$&Number of nodes in  $T_{\alpha}$ if we retain

1491: only nodes for items $i_1, \ldots, i_j$\\

1492: \hline

1493: $\mjT$&Number of nodes in

1494: $T$,

1495: where a  node $P$ for item $i_k$ is counted if\\

1496: &it satisfies the following conditions: 1) $P$ is in a branch that contains $i_j$\\

1497: &2) $i_k \in \{i_1, \ldots, i_j\}$ 3) $A_{\alpha}[j,k] > \xi$\\

1498: \hline

1499:

1500:

1501: \end{tabular}

1502: \caption{Statistics Information}

1503: \label{stat}

1504: \end{table*}

1505:

1506: Algorithm

1507: {\em Diskmine} collects some statistics on the

1508: partial FP-tree $T_{\alpha}$

1509: and the rest of database $\db_{\alpha}$,

1510: for the purpose of

1511: grouping items together.

1512: \xt{stat} shows the statistics information.

1513: In the table,

1514: $\db_{\alpha}$ is the original database or the current projected database,

1515: and {\em freqstring}($\db_{\alpha}$)=

1516: $i_1\ldots i_j\ldots i_k \ldots i_n$.

1517: The partial FP-tree is $T_\alpha$

1518: and $\xi$ is the

1519: absolute value of the minimum support.

1520:

1521: In the table,

1522: the array discussed in \xs{array}

1523: is also listed as statistics.

1524: Values for the cells of

1525: the array are accumulated during the construction of

1526: the partial $T_{\alpha}$.

1527: If {\em trialmainmine} is aborted, the rest

1528: of the statistics

1529: is collected by scanning the

1530: remaining part of $\db_{\alpha}$.

1531: Values in

1532: $\njT$

1533: can also be obtained

1534: during the construction of $T_{\alpha}$.

1535: Here

1536: $\njT$

1537: records the size of the FP-tree after

1538: $T_{\alpha}$ is trimmed and only contains items $i_1, \ldots, i_j$.

1539: Notice that

1540: $\nT$

1541: is equal to

1542: $\nu[n](T_{\alpha})$.

1543: This is  also the size of a tree that can fit in main memory.

1544: The value for

1545: $\mjT$

1546: can be obtained

1547: by traversing $T_{\alpha}$ once,

1548: it gives the size of the FP-tree $T_{\alpha.i_j}$.

1549:

1550: It might seem that

1551: collecting all this statistics

1552: is a large overhead,

1553: however,

1554: since all work is done in main memory,

1555: it doesn't take much time.

1556: And the time saved for disk I/O's

1557: is far more than the time spent on gathering statistics.

1558:

1559:

1560: \subsection{Grouping items}

1561:

1562: In \xf{appa},

1563: the fourth line computes a grouping $\beta_1\beta_2\cdots \beta_k$

1564: of ${\mit freqstring}(\db_{\alpha})$.

1565: Each string $\beta$

1566: corresponds to a group and each $\beta$ consists of at least one item.

1567: For each $\beta$,

1568: a new projected database $\db_{\alpha.\beta}$

1569: will be computed from $\db_{\alpha}$,

1570: then written to disk and read from disk later.

1571: Therefore,

1572: the more groups,

1573: the more disk I/O's.

1574: In other words,

1575: there should be as many items in each

1576: $\beta$ as possible.

1577: To group items,

1578: two questions have to be answered.

1579: \begin{enumerate}

1580: \item If $\beta$ currently only has one item $i_j$,

1581: after projection, is the main memory big enough for

1582: accommodating $T_{\alpha.i_j}$ constructed from

1583: $\db_{\alpha.i_j}$

1584: and running the {\em FP-growth*} method on $T_{\alpha.i_j}$?

1585: \item If more items are put in $\beta$,

1586: after projection, is the main memory big enough for

1587: accommodating $T_{\alpha.\beta}$ constructed from $\db_{\alpha.\beta}$

1588: and running {\em FP-growth*} on $T_{\alpha.\beta}$ only

1589: for items in $\beta$?

1590: \end{enumerate}

1591:

1592: Answering the first question is pretty easy,

1593: since for each item $i_j$,

1594: the number

1595: $\mjT$

1596: gives the size of an FP-tree if the tree

1597: is constructed from the partial FP-tree $T_{\alpha}$.

1598: Therefore

1599: $\mjT$

1600: can be used to estimate the

1601: size of FP-tree $T_{\alpha.i_j}$.

1602: By the assumption that

1603: the transactions in $\db_{\alpha}$ are evenly

1604: distributed and that

1605: the partial $T_{\alpha}$

1606: represents

1607: the whole FP-tree for $\db_{\alpha}$,

1608: the estimated size of FP-tree $T_{\alpha.i_j}$

1609: is

1610: $\mjT\cdot \tdb/\tT$.

1611:

1612:

1613: Before answering the second question,

1614: we introduce the {\em cut point}

1615: from which the first group can be easily found.

1616:

1617: \medskip

1618:

1619: \noindent

1620: {\bf Finding the cut point.}

1621: Recall the order that {\em FP-growth*} uses in mining frequent itemsets.

1622: Starting from the least frequent item $i_n$,

1623: all frequent itemsets that contains $i_n$ are mined first.

1624: Then the process is repeated for

1625: $i_{n-1}$, and so on.

1626: Notice that when mining frequent itemsets for $i_k$,

1627: all frequency information about $i_{k+1},\ldots,i_n$ is useless.

1628: Thus, though a complete FP-tree $T_\alpha$ constructed from $\db_\alpha$

1629: could not fit in main memory,

1630: we can find many $k$'s such that the

1631: trimmed FP-tree containing only

1632: nodes for items $i_k, \ldots, i_1$

1633: will fit into main memory.

1634: All frequent itemsets for  $i_k, \ldots, i_1$

1635: can be then mined from one trimmed tree.

1636: We call the biggest of such $k$'s the {\em cut point}.

1637: At this point, main memory is big enough

1638: for storing the FP-tree

1639: containing only $i_k, \ldots, i_1$,

1640: and there is also enough main memory for running

1641: {\em FP-growth*} on the tree.

1642: Obviously, if the cut point $k$ can be found,

1643: items  $i_k, \ldots, i_1$ can be grouped together.

1644: Only one projected database is needed for $i_k, \ldots, i_1$.

1645:

1646: There are two ways to estimate the cut point.

1647: One way is to get cut point from the value of

1648: $\tdb$

1649: and

1650: $\tT$

1651: in \xt{stat}.

1652: \xf{divi} illustrates the intuition behind the cut point.

1653: In the figure,

1654: since the partial FP-tree for

1655: $\tT$

1656: of

1657: $\tdb$

1658: transactions can be

1659: accommodate in main memory,

1660: we can expect that the FP-tree containing  $i_k, \ldots, i_1$,

1661: where

1662: $k=\lfloor n \cdot \tT /\tdb \rfloor$,

1663: also will fit in main memory.

1664:

1665: \begin{figure}[h]

1666: \centerline{\psfig{figure=figures/division,height=1.25in}}

1667: \caption{Cut Point. Here

1668: $l=\tT$, and $m=\tdb$}

1669: \label{divi}

1670: \end{figure}

1671:

1672: The above method

1673: works well

1674: for many databases,

1675: especially for those databases whose corresponding

1676: FP-trees have plenty of sharing of prefixes for items

1677: from $i_1$ to the cut point.

1678: However,

1679: if the FP-tree constructed from a database

1680: doesn't share prefixes that much,

1681: the estimation could fail,

1682: since now the FP-tree

1683: for items from $i_1$ to the cut point

1684: could be too big.

1685: Thus,

1686: we have to consider another method.

1687: In \xt{stat},

1688: $\njT$

1689: records the size of the FP-tree after

1690: the partial FP-tree $T_\alpha$ is trimmed and only

1691: contains items $i_1, \ldots, i_j$.

1692: Based on

1693: $\njT$

1694: the number of nodes

1695: in the complete FP-tree

1696: for item $i_j$

1697: can be estimated as

1698: $\njT \cdot \tdb/\tT$.

1699: Now, finding the cut point becomes finding the biggest $k$ such that

1700: $\nu[k](T_{\alpha}) \cdot \tdb/\tT \leq \nT$,

1701: and

1702: $\nu[k+1](T_{\alpha}) \cdot

1703: \tdb/\tT > \nT$.

1704:

1705:

1706: Sometimes the above estimation only guarantees

1707: that the main memory is big enough for

1708: the FP-tree which contains all items between $i_1$ and the cut point,

1709: while it doesn't guarantee

1710: that the descendant trees from that FP-tree can fit in main memory.

1711: This is because the estimation doesn't consider the

1712: size of descendant trees correctly

1713: (in \xs{memory}, we assumed that the size of a conditional tree is 10\%

1714: of its nearest ancestor tree).

1715: Actually, from

1716: $\mjT$

1717: we can get a more accurate estimation of the size of the

1718: biggest descendant tree.

1719: To find the cut point,

1720: we need to find the biggest $k$,

1721: such that

1722: $(\nu[k](T_{\alpha}) +

1723: \mjT)\cdot

1724: \tdb/\tT \leq \nT$, and

1725: $(\nu[k+1](T_{\alpha}) +

1726: \mu[m](T_{\alpha}))

1727: > \nT$,

1728: where

1729: $j\leq k$,

1730: $\mjT = {\mit max}_{j\in\{1,\ldots,k\}}\mjT$,

1731: and

1732: $m\leq k+1$,

1733: $\mu[m](T_{\alpha}) = {\mit max}_{m\in\{1,\ldots,k+1\}}\mu[m](T_{\alpha})$.

1734:

1735: \medskip

1736:

1737: \noindent

1738: {\bf Grouping the rest of the items.}

1739: Now we answer the second question, how to put more items into a group?

1740: Here we still need

1741: $\mjT$.

1742: Starting with

1743: \mbox{$\mu[{\mit cutpoint}+1](T_{\alpha})$},

1744: we test if

1745: $\mu[{\mit cutpoint}+1](T_{\alpha})\cdot

1746: \tdb/\tT > \nT$.

1747: If not, we put next item {\mit cutpoint}+2

1748: into the group,

1749: and test if

1750: \mbox{$(\mu[{\mit cutpoint}+1](T_{\alpha}) +

1751: \mu[{\mit cutpoint}+2](T_{\alpha})

1752: )$}

1753: $\cdot \tdb/\tT > \nT$.

1754: We repeatedly put next item in

1755: ${\mit freqstring}(\db)$ into the group

1756: until we reach an item $i_j$,

1757: such that

1758: $$\displaystyle\sum_{m={\mit cutpoint}+1}^{j}

1759: \mu[m](T_{\alpha})\cdot

1760: \tdb/\tT > \nT.$$

1761: Then starting from $i_j$, we put items into next group,

1762: until all items find its group.

1763:

1764: Why can we group items together?

1765: This is because

1766: even if we construct

1767: $T_{\alpha.i_j}, \ldots, T_{\alpha.i_k}$

1768: from the projected databases

1769: $\db_{\alpha.\beta_{i_j}}, \ldots, \db_{\alpha.\beta_{i_k}}$

1770: and put all of them into main memory,

1771: the main memory is big enough according to the grouping condition.

1772: At this stage, $T_{\alpha.i_j}, \ldots, T_{\alpha.i_k}$

1773: all can be constructed by scanning $\db_\alpha$ once.

1774: Then we mine frequent itemsets from the FP-trees.

1775: However, we can do better.

1776: Obviously $T_{\alpha.i_j}, \ldots, T_{\alpha.i_k}$ overlap a lot,

1777: and the total size of the trees is

1778: definitely greater than the size of $T_{\alpha.\beta}$.

1779: It also means that we can put more items into

1780: each $\beta$,

1781: only if the size of $T_{\alpha.\beta}$

1782: is estimated to fit in main memory.

1783: To estimate the size of $T_{\alpha.\beta}$, part of

1784: $T_{\alpha}$

1785: has to be traversed by following the links for

1786: the master items in $T_{\alpha}$.

1787:

1788:

1789:

1790: \subsection {Database projection}

1791: After all items have found their groups,

1792: the original database will be projected to small databases according to

1793: Definition \ref{four}.

1794: To save disk I/O's, three techniques can be used:

1795: \begin {enumerate}

1796: \item

1797: In a group $\beta$, if the number of master items is greater than

1798: half of the number of frequent items

1799: (this often happens in the group that contains cut point),

1800: then $\db_{\alpha.\beta}$ is not necessary

1801: computed.

1802: To mine all frequent itemsets,

1803: $T_{\alpha.\beta}$ can be directly constructed from $\db_{\alpha}$

1804: by reading it once.

1805: This is because  $\db_{\alpha.\beta}$

1806: is not much smaller than $\db_{\alpha}$,

1807: while the disk I/O's for reading from $\db_{\alpha}$ once

1808: is less than the disk I/O's for writing and reading

1809: $\db_{\alpha.\beta}$ once.

1810:

1811: \item

1812: Since

1813: the partial tree $T_{\alpha}$

1814: now in main memory,

1815: records all frequency

1816: information of those transactions that have

1817: been read so far,

1818: when computing projected databases,

1819: the frequency information of those transactions

1820: can be gotten from $T_{\alpha}$.

1821: Thus

1822: disk I/O's are only spent on reading from those transactions

1823: that did not contribute to $T_{\alpha}$.

1824:

1825: \item

1826: As discussed in \xs{array},

1827: by using the array technique,

1828: in group $\beta_j$, we find all slave items,

1829: such that they are not frequent with any master item in $\beta_j$,

1830: and all master items, such that their number of frequent

1831: items in $\{\beta_1,\ldots,\beta_j\}$ is 0 or 1.

1832: When computing $\db_{\alpha.\beta_j}$,

1833: all those items are removed from new transactions in $\db_{\alpha.\beta_j}$.

1834: \end{enumerate}

1835:

1836:

1837:

1838: \subsection {The disk I/O's}

1839: Let's re-count the disk I/O's used in {\em Diskmine}.

1840: From the first scan we get all frequent items in $\db_{\epsilon}$,

1841: which needs $D/B$ disk I/O's.

1842: In the second scan we construct a partial FP-tree $T_{\epsilon}$,

1843: then continue scanning the rest database for statistics,

1844: which needs another $D/B$ disk I/O's.

1845: Suppose then that $k$ projected databases have to be computed.

1846: According to \xs{diskmine},

1847: the total size of the projected databases is

1848: approximately $k/2 \cdot D$.

1849: For computing the projected databases,

1850: the frequency information in $T_{\epsilon}$ is reused,

1851: so only part of $\db_{\epsilon}$ is read.

1852: We assume on average half of $\db_{\epsilon}$ is read at this stage,

1853: which means $1/2\cdot D/B$ disk I/O's.

1854: Writing and later reading $k$ projected databases

1855: will take $2\cdot k/2\cdot D/B = k\cdot D/B$ disk I/O's.

1856: Suppose all frequent itemsets can be mined from the projected databases

1857: without going to the third level.

1858: Then the total disk I/O's is

1859: \negvs

1860: \begin{eqnarray}

1861: 3/2 \cdot D/B

1862: +

1863: k\cdot D/B

1864: \end{eqnarray}

1865:

1866: Compared with formula \ref{formula},

1867: {\em Diskmine} saves at least

1868: $k/2 \cdot D/B$

1869: disk I/O's,

1870: thanks to the various techniques used in the algorithm.

1871:

1872: \section {Experimental Evaluation and Performance Study}

1873:

1874: In this section, we present the results from

1875: a performance comparison of

1876: {\em Diskmine} with the {\em Parallel Projection Algorithm} in

1877: \cite{HPYM04} and the {\em Partitioning Algorithm} introduced

1878: in \cite{SON95}.

1879: The scalability of {\em Diskmine} is also analyzed,

1880: and the accurateness of our memory size

1881: estimations are validated.

1882:

1883:

1884: As mentioned in \xs{diskmine},

1885: the Parallel Projection Algorithm is a naive divide-and-conquer

1886: algorithm,

1887: since for each item a projected database is created.

1888: For performance comparison,

1889: we implemented Parallel Projection Algorithm,

1890: by using {\em FP-growth} as main memory method,

1891: as introduced in \cite{HPYM04}.

1892: The

1893: Partitioning Algorithm is also a divide-and-conquer algorithm.

1894: We implemented

1895: the partitioning algorithm by using the Apriori implementation

1896: \cite{gap}.

1897: We chose this implementation, since

1898: it was well written and easy to adapt

1899: for our purposes.

1900:

1901:

1902: We ran the three algorithms on

1903: both synthetic datasets and real datasets.

1904: Some synthetic datasets have millions of transactions,

1905: and the size of the datasets ranges from several megabytes to

1906: several  hundreds gigabytes.

1907: Without loss of generality,

1908: only the results for some synthetic datasets and a real dataset

1909: are shown here.

1910:

1911:

1912:

1913: All experiments were performed on a 2.0Ghz Pentium 4 with

1914: 256 MB of memory under Windows XP.

1915: For {\em Diskmine} and the Parallel Projection Algorithm,

1916: the size of the main memory is given as an input.

1917: For the Partitioning Algorithm,

1918: since it only has two database scans and each main-memory-sized partition

1919: and all data structures for Apriori

1920: are stored into main memory,

1921: the size of main memory is not controlled,

1922: and only the running time is recorded.

1923:

1924:

1925: We first compared the performance of three algorithms on synthetic dataset.

1926: Dataset {\em T100I20D100K} was generated from the

1927: application of \cite{syns}.

1928: The dataset has 100,000 transactions and 1000 items,

1929: and occupies about 40 megabytes of memory.

1930: The average transaction length is 100,

1931: and the average pattern length is 20.

1932: The dataset is very sparse and FP-tree constructed from the dataset

1933: is bushy.

1934: For Apriori,

1935: a large number of candidate frequent itemsets

1936: will be generated from the dataset.

1937: When running the algorithms, the main memory size

1938: was given as 128 megabytes.

1939: \xf{SynReal}(a) shows the experimental result.

1940: In the figure, ``Naive Algorithm''

1941: represents the Parallel Projection Algorithm,

1942: and

1943: ``Aggressive Algorithm'' represents the {\em Diskmine} algorithm.

1944:

1945:

1946: \begin{figure}[h]

1947:     \begin{minipage}[t]{2in}

1948:        \centerline{\psfig{figure=figures/synthetic,height=1.7in}}

1949:        \center{\small (a)}

1950:     \end{minipage}

1951:     \hfill

1952:     \begin{minipage}[t]{2in}

1953:        \centerline{\psfig{figure=figures/total,height=1.7in}}

1954:        \center{\small (b)}

1955:     \end{minipage}

1956:     \hfill

1957:     \begin{minipage}[t]{2in}

1958:        \centerline{\psfig{figure=figures/realdata,height=1.7in}}

1959:        \center{\small (c)}

1960:     \end{minipage}

1961:   \caption{{\small Experiments on Synthetic Data and Real Data}}

1962:   \label{SynReal}

1963: \end{figure}

1964:

1965: From \xf{SynReal} (a),

1966: we can see that the Partitioning Algorithm is the slowest is the group.

1967: The Naive Algorithm, however, is not slower than the Aggressive Algorithm

1968: if we only compare their CPU time.

1969: In \cite{fimi03},

1970: where we concerned about main memory mining,

1971: we found  that if a dataset is sparse the

1972: boosted {\em FPgrowth*} method has a much better performance than

1973: the original {\em FProwth}.

1974: The reason here the CPU time of the Aggressive Algorithm is not always

1975: less than that of Naive Algorithm is

1976: that the Aggressive Algorithm

1977: has to spend CPU time on calculating statistics.

1978: On the other hand, as expected,

1979: we can see in the figure that

1980: the disk I/O time of the Aggressive Algorithm is

1981: orders of magnitude smaller than that of the Naive Algorithm.

1982: In \xf{SynReal} (b) we compare the total runnng times.

1983: We can see that the CPU overhead used by the Aggressive

1984: Algorithm now become insignificant compared to

1985: the savings in disk I/O.

1986:

1987:

1988:

1989:

1990: We then ran the algorithms on a real dataset {\em Kosarak},

1991: which is used as a test dataset in \cite{ZB03}.

1992: The dataset is about 40 megabytes.

1993: Since it is a dense dataset and its FP-tree is pretty small,

1994: we set the main memory size as 16 megabytes for the experiments.

1995: Results are shown in \xf{SynReal} (c).

1996:

1997: In \xf{SynReal} (b),

1998: the Partitioning Algorithm is still the slowest.

1999: This is  because it generates too many candidate frequent itemsets.

2000: Together with the data structures,

2001: these candidate sets use up main memory and

2002: virtual memory was used.

2003: We can also again notice that the CPU time of the Naive Algorithm

2004: is less than that of the Aggressive Algorithm.

2005: This is because {\em Kosarak} is a dense dataset so

2006: the array technique doesn't help a lot.

2007: In addition, calculating the

2008: statistics takes much time.

2009: The disk I/O's for the Aggressive Algorithm are still

2010: remarkably fewer than the disk I/O's for the Naive Algorithm.

2011:

2012:

2013: To test the effectiveness of the techniques for grouping items,

2014: we run {\em Diskmine} on

2015: {\em T100I20D100K} and see how

2016: close

2017: the estimation of the FP-tree size for each group is to its real size.

2018: We still set the main memory size as 128 megabytes,

2019: the minimum support is 2\%.

2020: When generating the projected databases,

2021: items were grouped into 7 groups

2022: (the total number of frequent items

2023: is 826).

2024: As we can see from \xf{Effect} (a),

2025: in all groups,

2026: the estimated size is always slightly

2027: than the real size.

2028: Compared with the Naive Algorithm,

2029: which constructs an FP-tree for each item from its projected database,

2030: the Aggressive Algorithm almost fully

2031: uses the main memory for each group to

2032: construct an FP-tree.

2033:

2034: \begin{figure}[ht!]

2035:     \begin{minipage}[t]{1.5in}

2036:        \centerline{\psfig{figure=figures/versus,height=1.25in}}

2037:        \center{\small (a)}

2038:     \end{minipage}

2039:     \hfill

2040:     \begin{minipage}[t]{1.5in}

2041:        \centerline{\psfig{figure=figures/scalability,height=1.25in}}

2042:        \center{\small (b)}

2043:     \end{minipage}

2044:   \caption{{\small Estimation Effect and Scalability of {\em Diskmine}}}

2045:   \label{Effect}

2046: \end{figure}

2047: As a divide-and-conquer algorithm,

2048: one of the most important

2049: properties of {\em Diskmine} is its good scalability.

2050: We ran {\em Diskmine} on a set of synthetic datasets.

2051: In all datasets,

2052: the item number was set as 10000 items,

2053: the average transaction length as 100,

2054: and the average pattern length as 20.

2055: The number of the transactions in the datasets

2056: varied from 200,000 to 2,000,000.

2057: Datasets size ranges from 100 megabytes to 1 gigabyte.

2058: Minimum support was set as 1.5\%,

2059: and the available main memory was 128 megabytes.

2060: \xf{Effect} (b) shows the results.

2061: In the figure, the CPU and the disk I/O time is

2062: always kept in a small range of acceptable values.

2063: Even for the datasets with 2 million transactions,

2064: the total running time is less than 1000 seconds.

2065: Extrapolating from these figures using formula (4),

2066: we can conclude that a dataset the size of the

2067: Library of Congress collection (25 Terabytes)

2068: could be mined in around 18 hours with current technology.

2069:

2070:

2071: \section{Conclusions}

2072:

2073: We have introduced several divide-and-conquer algorithms

2074: for mining frequent itemset from secondary memory.

2075: We have analyzed the

2076: recurrences and disk I/O's of all algorithms.

2077:

2078: We then gave a detailed divide-and-conquer

2079: algorithm

2080: which almost fully uses the limited main memory

2081: and saves an numerous number of disk I/O's.

2082: We introduced many novel techniques

2083: used in our algorithm.

2084:

2085: Our

2086: experimental results show

2087: that our algorithm

2088: successfully reduces the number of disk access,

2089: sometimes by orders of magnitude,

2090: and that our algorithm scales up to

2091: terabytes of data.

2092: The experiments also validates that

2093: the estimation techniques used in

2094: our algorithm are accurate.

2095:

2096:

2097:

2098: For future work,

2099: we notice that

2100: there are very few efficient algorithm

2101: for mining

2102: {\em maximal} frequent itemsets and {\em closed}

2103: frequent itemsets \cite{PBT99, PHM00,WHP03,Zaki02}

2104: from very large databases.

2105: Unlike in {\em Diskmine},

2106: where the frequent itemsets mined from all projected databases

2107: are globally frequent,

2108: a maximal frequent itemset or a

2109: closed frequent itemset mined from a projected database

2110: is only locally maximal or closed.

2111: As a challenge,

2112: a data structure, whose size may be also very big,

2113: must be set for recording all already discovered

2114: maximal or closed frequent itemsets.

2115: We also notice that

2116: our implementation of the partitioning algorithm is

2117: based on an existing Apriori implementation,

2118: which is not necessary highly optimized.

2119: As we know,

2120: there are situations

2121: when there are not

2122: too many candidate itemsets in a database,

2123: but the FP-tree constructed from the database is pretty big.

2124: In this situation the

2125: Partitioning Algorithm only needs two database scans

2126: and all frequent items can be nicely mined in main memory,

2127: or with very little I/O for keeping the

2128: candidate sets in virtual memory.

2129: In this situation

2130: {\em Diskmine} also needs two database scans,

2131: and it additionally

2132: needs to

2133: decompose the database.

2134: Therefore, exploring whether some clever disk-based datastructure

2135: would make the partition approach scale,

2136: is another interesting direction for further research.

2137:

2138:

2139: \begin{thebibliography}{icdm}

2140:

2141:

2142:

2143: \bibitem{syns}

2144: \newblock {\tt www.almaden.ibm.com/software/quest}

2145:

2146:

2147: %\bibitem{ZB03a}

2148: %\newblock {\tt fimi.cs.helsinki.fi}

2149:

2150: \bibitem{gap}

2151: \newblock{\tt www.cs.helsinki.fi/u/goethals/software}

2152:

2153:

2154: \bibitem{AAP00}

2155: R.\ C.\ Agarwal, C.\ C.\ Aggarwal and V. V. V. Prasad,

2156: \newblock Depth first generation of long patterns,

2157: \newblock In {\em KDDM '00}, pp.\ 108-118

2158:

2159: \bibitem{AIS93}

2160: R.~Agrawal, T.~Imielinski, and A.~Swami.

2161: \newblock Mining association rules between sets of items in large databases.

2162: \newblock In {\em SIGMOD '93},

2163: pp.\ 207--216, 1993.

2164:

2165: \bibitem{AS94}

2166: R.~Agrawal and R.~Srikant.

2167: \newblock Fast algorithms for mining association rules.

2168: \newblock In  {\em VLDB '94}, pp.\   487--499

2169:

2170: \bibitem{AS95}

2171: R.~Agrawal and R.~Srikant.

2172: \newblock Mining sequential patterns.

2173: \newblock In {\em ICDE '95}, pp.\ 3--14

2174:

2175: %\bibitem{BMS97}

2176: %S.~Brin, R.~Motwani, and C.~Silverstein.

2177: %\newblock Beyond market basket: Generalizing association rules to correlations.

2178: %\newblock In {\em Proceeding of Special Interest Group on Management of

2179: %Data}, pages 265--276, Tucson, Arizona, May 1997.

2180:

2181: \bibitem{fimi03}

2182: G.~Grahne, J.~Zhu.

2183: \newblock Efficiently Using Prefix-trees in Mining Frequent Itemsets.

2184: \newblock In

2185: \cite{ZaBa03}

2186: % {\em 1st Workshop on Frequent Itemset Mining Implementations (FIMI'03)}

2187: %\newblock Melbourne, FL, Nov. 2003.

2188:

2189: \bibitem{HPY00}

2190: J.~Han, J.~Pei, and Y.~Yin.

2191: \newblock Mining frequent patterns without candidate generation.

2192: \newblock In {\em SIGMOD '00}, pp.\ 1--12

2193:

2194: \bibitem{HPYM04}

2195: J.~Han, J.~Pei, Y.~Yin and R.~Mao.

2196: \newblock Mining frequent patterns without candidate generation: A Frequent-Pattern Tree Approach.

2197: \newblock In {\em Data Mining and Knowledge Discovery}, Vol. 8, pages 53-87, 2004.

2198:

2199: \bibitem{KHC97}

2200: M.~Kamber, J.~Han and J.~Chiang.

2201: \newblock Metarule-Guided Mining of Multi-Dimensional Association Rules Using Data Cubes.

2202: \newblock In {\em KDDM '97}, pp.\ 207--210

2203:

2204:

2205: %\bibitem{LLN00}

2206: %V.S. Lakshmanan, C. Leung, and R. Ng.

2207: %\newblock The Segment Support Map: Scalable Mining of Frequent Itemsets.

2208: %\newblock In {\em SIGKDD Explorations Special Issue on Scalable Data Mining},

2209: %\newblock Volume 2, Issue 2, pages 21-27. December 2000.

2210:

2211: %\bibitem{MT97}

2212: %H. Mannila and H. Toivonen.

2213: %\newblock Levelwise search and borders of theories in knowledge discovery.

2214: %\newblock In {\em Data Mining and Knowledge Discovery},

2215: %\newblock Vol. 1, 3(1997), pages 241-258.

2216:

2217: \bibitem{MTV94}

2218: H. Mannila, H. Toivonen, and I. Verkamo.

2219: \newblock Efficient algorithms for discovering association rules.

2220: \newblock In {\em KDDM '94},

2221: pp.\ 181--192.

2222:

2223: \bibitem{MTV97}

2224: H. Mannila, H. Toivonen, and I. Verkamo.

2225: \newblock Discovery of Frequent Episodes in Event Sequences.

2226: \newblock In {\em Data Mining and Knowledge Discovery}.

2227: \newblock Volume 1, 3(1997), pages 259--289.

2228:

2229: \bibitem{PBT99}

2230: N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal.

2231: \newblock Discovering frequent closed itemsets for association rules.

2232: \newblock In {\em ICDT'99}, Jan. 1999.

2233:

2234: \bibitem{PHM00}

2235: J. Pei, J. Han and R. Mao,

2236: \newblock CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets.

2237: \newblock In {\em {ACM} {SIGMOD} Workshop on Research Issues in Data Mining and Knowledge Discovery}, pages 21-30, 2000.

2238:

2239:

2240: %\bibitem{STA98}

2241: %S.~Sarawagi, S.~Thomas, and R.~Agrawal.

2242: %\newblock Integrating association rule mining with relational database systems:

2243: %  Alternatives and implications.

2244: %\newblock In {\em Proceeding of Special Interest Group on Management of Data}, pages 343--354, 1998.

2245:

2246: \bibitem{SON95}

2247: A.~Savasere, E.~Omiecinski, and S.~Navathe.

2248: \newblock An efficient algorithm for mining association rules in large

2249:   databases.

2250: \newblock In {\em VLDB '95}, pp. 432--443

2251:

2252: \bibitem{Toiv96}

2253: H.~Toivonen.

2254: \newblock Sampling large databases for association rules.

2255: \newblock In {\em VLDB '96}, pp.\ 134--145

2256:

2257: \bibitem{WHP03}

2258: J. Wang, J. Han, and J. Pei.

2259: \newblock CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets.

2260: \newblock In {\em Proc. 2003 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'03)}, Washington, D.C., Aug. 2003.

2261:

2262:

2263: \bibitem{ZB03}

2264: B.~Goethals and M.~J.~Zaki (Eds.)

2265: {\em Proceedings

2266: of the First IEEE IDCM Workshop on Frequent Itemset Mining Implementations

2267: (FIMI '03)}.

2268: CEUR Workshop Proceedings, Vol 80

2269: \verb+ http://CEUR-WS.org/Vol-90+

2270:

2271:

2272:

2273: \bibitem{ZaBa03}

2274: Bart Goethals and Mohammed J. Zaki.

2275: \newblock Advances in Frequent Itemset Mining Implementations: Introduction to FIMI03.

2276: \newblock In {\em 1st Workshop on Frequent Itemset Mining Implementations (FIMI'03)}

2277: \newblock Melbourne, FL, Nov. 2003.

2278:

2279:

2280: %\bibitem{Zaki00}

2281: %M. J. Zaki.

2282: %\newblock Scalable algorithms for association mining.

2283: %\newblock In {\em IEEE Transactions on Knowledge and Data Mining},

2284: %\newblock 12(3):372-390, May-June 2000.

2285:

2286: \bibitem{Zaki02}

2287: M. J.~Zaki and C.~Hsiao.

2288: \newblock CHARM: An Efficient Algorithm for Closed Itemset Mining.

2289: \newblock In {\em Proceeding of The 2nd SIAM International Conference on Data Mining},

2290: \newblock Arlington, April 2002.

2291:

2292: \bibitem{Zaki03}

2293: M. J. Zaki and Karam Gouda.

2294: \newblock Fast Vertical Mining Using Diffsets.

2295: \newblock In {\em KDDM '03},

2296: pp.\ 326--335

2297:

2298:

2299:

2300:

2301:

2302: \end{thebibliography}

2303:

2304: \end{document}

2305:

2306:

2307: \newpage

2308:

2309:

2310: \begin{centering}

2311:

2312: \begin{figure}

2313: \begin{minipage}[t]{6.5in}

2314: \centerline{\psfig{figure=figures/append1,height=3.25in}}

2315: \caption{Recurrence structure of Naive Projection Algorithm}

2316: \end{minipage}

2317: \end{figure}

2318:

2319:

2320: \begin{figure}

2321: \begin{minipage}[t]{6.5in}

2322: \centerline{\psfig{figure=figures/append2,height=3.25in}}

2323: \caption{Recurrence structure of Aggressive Projection Algorithm}

2324: \end{minipage}

2325: \end{figure}

2326:

2327: \end{centering}

2328:

2329:

2330:

2331:

2332:

2333:

2334: