1: \documentclass[11pt]{article}
2: \usepackage{epsfig}
3: \usepackage{amsmath}
4: \usepackage{amssymb}
5: \usepackage{geometry}
6: \usepackage{url}
7:
8: \newcommand{\xf}[1]{Figure~\ref{#1}}
9: \newcommand{\xt}[1]{Table~\ref{#1}}
10: \newcommand{\xp}[1]{page~\pageref{#1}}
11: \newcommand{\xs}[1]{Section~\ref{#1}}
12: \newcommand{\xa}[1]{Appendix~\ref{#1}}
13: \newtheorem{theorem}{Theorem}
14: \newtheorem{lemma}{Lemma}
15: \newtheorem{prop}{Proposition}
16: \newtheorem{defn}{Definition}
17:
18:
19: \def\db{\mbox {$\cal D$}}
20: \def\mm{\mbox {$\cal M$}}
21: \def\nn{\mbox {$t$}}
22: \def\tdb{\mbox{$t(\db_{\alpha})$}}
23: \def\tT{\mbox{$t(T_{\alpha})$}}
24: \def\tT{\mbox{$t(T_{\alpha})$}}
25: \def\nT{\mbox{$\nu(T_{\alpha})$}}
26: \def\njT{\mbox{$\nu[j](T_{\alpha})$}}
27: \def\mjT{\mbox{$\mu[j](T_{\alpha})$}}
28:
29: \newcommand{\vs}{\vspace{1ex}} % Small vertical space
30: \newcommand{\negvs}{\vspace*{-1ex}} % Small negative vertical space
31:
32:
33:
34:
35: \newlength{\qedlengte}
36: \settowidth{\qedlengte}{$\Box$}
37: \addtolength{\qedlengte}{-0.25\qedlengte}
38: \newcommand{\qedbox}{\rule{\qedlengte}{\qedlengte}}
39: \newcommand{\qed}{\hspace*{1em}\hfill\qedbox}
40:
41:
42:
43: \newenvironment{prog}{\def\-{\hskip 1em}\penalty-1000\vskip\parskip
44: \parskip0pt\leftskip2em\obeylines\tt}{\par}
45:
46: \title{Mining Frequent Itemsets from Secondary Memory}
47:
48: \author{G\"{o}sta Grahne and Jianfei Zhu\\
49: Concordia University\\
50: Montreal, Canada\\
51: \{grahne, j\_zhu\}@cs.concordia.ca\\
52: }
53: \date{March 6, 2004}
54: \begin{document}
55: \maketitle
56:
57: \begin{abstract}
58: Mining frequent itemsets is at the core
59: of mining association rules, and is by now quite well
60: understood algorithmically.
61: However, most algorithms for mining frequent
62: itemsets assume that the main memory is large enough
63: for the data structures used in the mining,
64: and very few efficient algorithms deal with the case when
65: the database is {\em very} large or the minimum support is very low.
66: Mining frequent itemsets from a very large database
67: poses new challenges,
68: as astronomical amounts of raw data
69: is ubiquitously being recorded in commerce, science and government.
70:
71: In this paper, we discuss approaches to mining frequent itemsets when
72: data structures are too large to fit in main memory.
73: Several
74: divide-and-conquer
75: algorithms
76: are given for mining from disks.
77: Many novel techniques are introduced.
78: Experimental results show that
79: the techniques reduce
80: the required disk accesses by orders of magnitude,
81: and enable truly scalable data mining.
82: \end{abstract}
83:
84: \section{Introduction}
85:
86:
87:
88:
89: Mining frequent itemsets is a fundamental problem
90: for mining association rules \cite{AIS93, AS94,MTV94,PBT99, PHM00,WHP03,Zaki02, ZB03}.
91: It also plays an important role in many other data mining tasks
92: such as sequential patterns, episodes, multi-dimensional patterns and so on
93: \cite{AS95, MTV97, KHC97}.
94: In addition, frequent itemsets are one of the key abstractions
95: in data mining.
96:
97:
98: The description of the problem is as follows.
99: Let $I = \{i_1,i_2,\ldots,i_j,\ldots i_n\}$,
100: be a set of {\em items}.
101: Items will sometimes also be denoted
102: $a,b,c,\ldots$.
103: An $I$-{\em transaction} $\tau$ is a subset of $I$.
104: An $I$-transactional {\em database} $\db$ is a finite bag
105: of $I$-transactions.
106: The {\em support} of an itemset $S\subseteq I$
107: is the proportion of transactions in \db~ that contain $S$.
108: The task of mining frequent itemsets is to find
109: all $S$ such that the support of $S$ is greater than some
110: given {\em minimum support} $\xi$,
111: where $\xi$ either is a fraction in $[0,1]$,
112: or an absolute count.
113:
114: Most of the algorithms, such as
115: Apriori \cite{AS94},
116: DepthProject \cite{AAP00},
117: and dEclat \cite{Zaki03}
118: work well when the main memory is big enough to
119: fit the whole database or/and the data structures
120: (candidate sets, FP-trees, etc).
121: When a database is very large or when the minimum support is very low,
122: either the data structures used by the algorithms may not be accommodated in
123: main memory,
124: or the algorithms spend too much time on
125: multiple passes over the database.
126: In the
127: {\em First IEEE ICDM Workshop on Frequent Itemset
128: Mining Implementations, FIMI~'03} \cite{ZB03},
129: many well known algorithms were implemented
130: and independently tested.
131: The results show that ``{\em none} of the algorithms is able to gracefully
132: scale-up to very large datasets,
133: with millions of transactions''
134: \cite{ZaBa03}.
135:
136:
137:
138: At the same time
139: very large databases do exist in real life.
140: In a medium sized business or in a company big as Walmart,
141: it's very easy to collect a few gigabytes of data.
142: Terabytes of raw data
143: is ubiquitously being recorded in commerce, science and government.
144: The question of how to handle these databases is still one of the most
145: difficult problems in data mining.
146:
147:
148: A few researchers have
149: tried to mine frequent itemsets from very large databases.
150: One approach is by {\em sampling}.
151: For instance, \cite{Toiv96}
152: picks a random sample of the database,
153: finds all frequent itemsets from the sample, and then verifies
154: the results with the rest of the database.
155: This approach needs only one pass of the database.
156: However, the results are probabilistic,
157: meaning that
158: some critical frequent itemsets could be missing.
159:
160:
161:
162: {\em Partitioning} \cite{SON95}
163: is another approach for mining very large databases.
164: This approach first partitions the database
165: into many small databases,
166: and mines candidate frequent itemsets from each small database.
167: One more pass
168: over the original database
169: is then done to verify the candidate frequent itemsets.
170: The approach thus needs only two database scans.
171: However, when the data structures used for storing
172: candidate frequent itemsets
173: are too big to fit in main memory,
174: a significant amount of disk I/O's is needed
175: for the disk resident data structures.
176:
177: In \cite{HPY00, HPYM04}, Han {\em et.\ al.} introduce the {\em FP-growth}
178: method, which
179: uses two database scans for constructing an FP-tree
180: from the database,
181: and then mines all frequent itemsets from the FP-tree.
182: Two approaches are suggested for the case that
183: the FP-tree is too large to fit into main memory.
184:
185: The first approach writes the FP-tree to disk,
186: then mines all frequent sets by reading the
187: frequency information from the FP-tree.
188: However, the size of the FP-tree could be same as the
189: size of the database, and for each item in the FP-tree,
190: we need at least one FP-tree traversal.
191: Thus the I/O's for writing and reading the
192: disk-resident FP-tree could be
193: prohibitive.
194:
195: The second approach
196: {\em projects} the original database
197: on each frequent item, then mines frequent itemsets from
198: the small projected databases.
199: One advantage of this approach is that any frequent itemset
200: mined from a projected database is a frequent itemset in the original database.
201: To get {\em all} frequent itemsets,
202: we only need to
203: take the union of the frequent itemsets from the small projected databases.
204: This is in contrast to the
205: partitioning approach,
206: where all candidate frequent itemsets have to be stored and later verified
207: by another pass of database.
208: The biggest problem of the projection approach is that
209: the total size of the projected databases could be too large,
210: and there will be too many disk I/O's for the
211: projected databases.
212:
213: \subsubsection*{Contributions}
214: In this paper we consider the problem of mining frequent itemsets
215: from {\em very} large databases.
216: We adopt a
217: divide-and-conquer approach.
218: First we give three algorithms,
219: the general divide-and-conquer algorithm,
220: then an algorithm using
221: naive projection, and an algorithm using
222: aggressive projection.
223: We also analyze the
224: number of steps and disk I/O's required by these algorithms.
225:
226: In a detailed divide-and-conquer algorithm,
227: called {\em Diskmine},
228: we use the highly efficient
229: {\em FP-growth*} method \cite{fimi03} to
230: mine frequent itemsets from an FP-tree in main memory.
231: We describe several novel techniques
232: useful in mining frequent itemsets from disks,
233: such as the array technique,
234: the item-grouping technique,
235: and memory management techniques.
236:
237: Finally, we present experimental results that
238: demonstrate the fact that our {\em Diskmine}-algorithm
239: outperforms previous algorithms
240: by orders of magnitude,
241: and scales up to terabytes of data.
242:
243:
244: \subsubsection*{Overview}
245: The remainder of this paper is organized as follows.
246: In Section 2
247: we introduce approaches for mining frequent itemsets from disks.
248: Three algorithms are introduced and analyzed.
249: Section 3 gives a detailed divide-and-conquer
250: algorithm {\em Diskmine},
251: in which many novel optimization techniques are used.
252: These techniques are also described in Section 3.
253: Experimental results are given in Section 4.
254: Section 5 concludes,
255: and outlines directions for future research.
256:
257:
258: \section{Mining from disk} \label{diskmine}
259:
260: How should one go about when mining
261: frequent itemsets from very large databases
262: residing in a secondary memory storage,
263: such as disks?
264: Here ``very large'' means that
265: the data structures constructed from the database
266: for mining frequent itemsets
267: can not fit in the available main memory.
268:
269:
270: Basically, there are two strategies
271: for mining frequent itemsets,
272: the datastructures approach,
273: and the
274: divide-and-conquer approach.
275:
276: The {\em datastructures} approach consists of
277: reading
278: the database buffer by buffer,
279: and generate
280: datastructures (i.e.\ candidate sets or FP-trees).
281: Since the datastructure don't fit into main memory,
282: additional disk I/O's are required.
283: The number of passes and disk I/O's required
284: by the approach
285: depends on the algorithm and its datastructures.
286: For example,
287: if the algorithm is Apriori \cite{AS94}
288: using a hash-tree
289: for candidate itemsets
290: \cite{SON95},
291: disk based hash-trees have to be used.
292: Then the number of passes for the algorithm
293: is same as the length of the longest
294: frequent itemset,
295: and the number of disk I/O's for the hash-trees
296: depend on the size of the hash-trees
297: on disk.
298:
299: The basic strategy for the
300: {\em divide-and-conquer} approach
301: is shown in \xf{bdaqalgo}.
302: In the approach,
303: $|\db|$ denotes
304: the size of the data structures used
305: by the mining algorithm, and
306: $M$ is the size of available main memory.
307: Function {\em mainmine}
308: is called if
309: candidate frequent itemsets (not necessary all)
310: can be mined without
311: writing the data structures used by
312: a mining algorithm to disks.
313: In \xf{bdaqalgo},
314: a very large database is decomposed into a number
315: of smaller databases.
316: If a ``small'' database is still too large,
317: i.e, the data structures are still too big to fit in main memory,
318: the decomposition is recursively continued
319: until
320: the data structures fit in main memory.
321: After all small databases are processed,
322: all candidate frequent itemsets are combined in some way
323: (obviously depending on the way the decomposition was done)
324: to get all frequent itemsets for the original database.
325:
326:
327: \begin{figure}[h]
328: {\bf Procedure} {\em diskmine}($\db,M$)
329:
330: \smallskip
331:
332: {\bf if} $|\db|\leq M$ {\bf then} {\bf return} {\em mainmine($\,\db$)}
333:
334: {\bf else} decompose $\db$ into $\db_1,\ldots \db_k$.
335:
336: {\hskip 18pt}
337: {\bf return} {\em combine} {\em diskmine($\,\db_1,M$)},
338:
339: {\hskip 154pt} .... ,
340:
341: {\hskip 90pt} {\em diskmine($\,\db_k,M$)}.
342:
343: \caption{{\small
344: General divide-and-conquer algorithm for
345: mining frequent itemsets from disk.
346: }}
347: \label{bdaqalgo}
348: \end{figure}
349:
350:
351: The efficiency of {\em diskmine}
352: depends on
353: the method used for mining frequent itemsets
354: in main memory and on the number of
355: disk I/O's needed in the decomposition and
356: combination phases.
357: Sometimes the disk I/O is the main factor.
358: Since the decomposition step involves I/O,
359: ideally the number of recursive calls should be
360: kept small. The faster we can obtain small decomposed
361: databases, the fewer recursive call we will need.
362: On the other hand, if a decomposition cuts
363: down the size of the projected databases drastically,
364: the trade-off might be that the combination
365: step becomes more complicated and might involve heavy
366: disk I/O.
367:
368:
369: In the following we discuss two decomposition
370: strategies, namely
371: decomposition by partition, and
372: decomposition by projection.
373:
374: {\em Partitioning}
375: is an approach in which a large database is decomposed into
376: cells of small non-overlapping databases.
377: The cell-size is chosen so that
378: all frequent itemsets in a cell can be mined without
379: having to store any data structures in secondary memory.
380: However, since a cell only contains partial frequency
381: information of the original database,
382: all frequent itemsets from the cell are local
383: to that cell of the partition,
384: and could only be {\em candidate} frequent itemsets
385: for the whole database.
386: Thus the candidate frequent itemsets mined from
387: a cell
388: have to be verified
389: later to filter out false hits.
390: Consequently,
391: those candidate sets have to be written to disk
392: in order to leave
393: space for processing the next cell of the partition.
394: After generating candidate frequent itemsets from
395: all cells,
396: another database scan is needed to
397: filter out all infrequent itemsets.
398: The partition approach therefore needs only two passes
399: over the database,
400: but writing and reading candidate frequent itemsets
401: will involve a significant number of
402: disk I/O's,
403: depending on the size of the set of candidate frequent itemsets.
404:
405: We can conclude that the partition approach
406: to decomposition keeps the recursive levels
407: down to one, but the penalty is that the
408: combination phase becomes expensive.
409:
410:
411: To get an easier combination phase,
412: we adopt another decomposition strategy, which we call
413: {\em projection}.
414: Suppose for simplicity that there are four
415: items, $a,b,c,$ and $d$, and let $\db$ be a
416: database of transactions containing some
417: or all of these items.
418: We could then decompose
419: $\db$ into for instance
420: $\db_{ab}$ and
421: $\db_{cd}$.
422: Typically, we would do this when the descending order
423: of frequency of the items is $a, b, c, d$.
424: In $\db_{cd}$ we put all transactions
425: containing at $c$ or $d$ (or both).
426: In $\db_{ab}$ we put transactions containing
427: $a$ or $b$ (or both), and for each transaction we store
428: only the $a,b$-part. Thus we will have shorter
429: transactions in $\db_{ab}$, and both
430: $\db_{ab}$ and
431: $\db_{cd}$ contain fewer transactions than $\db$.
432: We can then recursively mine all frequent itemsets
433: from $\db_{ab}$, and $\db_{cd}$.
434: Since this decomposition is not a partition,
435: the projected databases
436: might not be that much smaller that the
437: original database. The upside is though that
438: the set of all frequent itemsets in
439: $\db$ now simply is the union of the frequent
440: itemsets in $\db_{ab}$ and $\db_{cd}$.
441: This means that the combination phase
442: in diskmining is a simple union.
443:
444: To illustrate this decomposition,
445: let $\db$ contain the transactions
446: $\{a, b, d\}, \{b, c, d\}, \{a, c\}$ and $\{a, b\}$.
447: Suppose the minimum support is 50\%,
448: then $\db_{cd}=\{\{a, b, d\}, \{b, c, d\}, \{a, c\}\}$,
449: $\db_{ab} =\{ \{a, b\}, \{b\}, \{a\}, \{a, b\}\}$.
450: From $\db_{cd}$, we get all frequent itemsets
451: $\{d\}, \{b,d\}$, and $\{c\}$.
452: Note though $\{a\}$ and $\{b\}$ are also frequent in $\db_{cd}$,
453: they're not listed since they contain neither $c$ nor $d$.
454: They will be listed in the frequent itemsets of $\db_{ab}$,
455: which are $\{a\}, \{b\}$, and $\{a,b\}$.
456:
457: To analyze the recurrence and required disk I/O's of the general
458: divide-and-conquer algorithm
459: when the decomposition strategy is projection,
460: let us suppose that:
461:
462:
463: \begin{small}
464: \begin{list}{-}{}
465:
466: \item
467: The original database size is $D$ bytes.
468:
469: \item
470: The data structure is an FP-tree.
471:
472: \item
473: The FP-tree constructed from original database \db~is $T$,
474: and its size is $|T|$ bytes.
475:
476: \item
477: If a conditional FP-tree $T'$ is constructed from
478: an FP-tree $T$, then $|T'|\leq c\cdot |T|$,
479: for some constant $c<1$.
480:
481: \item
482: The main memory mining method is the {\em FP-growth}
483: method \cite{HPY00, HPYM04}.
484: Two database scans are needed for constructing an FP-tree
485: from a database.
486:
487: \item
488: The block size is $B$ bytes.
489:
490: \item
491: The main memory available for the FP-tree is $M$ bytes
492:
493: \end{list}
494: \end{small}
495:
496:
497: In the first line of the algorithm in \xf{bdaqalgo},
498: if $T$ can not fit in memory,
499: then projected databases will be generated.
500: We assumed that
501: the size of the FP-tree for a projected database
502: is $c\cdot|T|$.
503: If $c\cdot |T| \leq M$, function
504: {\em mainmine} can be called for the projected database,
505: otherwise, the decomposition goes on.
506: At pass $m$, the size of the FP-tree constructed from
507: a projected database is $c^m\cdot |T|$.
508: Thus, the number of passes needed by the
509: divide-and-conquer projection algorithm is
510: $1+\lceil\log_cM/T\rceil$.
511: Based on our experience and the analysis in \cite{HPY00, HPYM04},
512: we can say that for all practical purposes
513: the number of passes will be at most two.
514: For example, Let $D = 100$ Giga and $T = 10$ Giga,
515: $M = 1$ Giga, $c = 10\%$.
516: Then the number of passes is
517: $1+\lceil\log_{0.1}2^{30}/(10\times 2^{30})\rceil$ = 2.
518: In five passes we can handle databases up to 100 Terabytes.
519: Namely, we get
520: $1+\lceil\log_{0.1}2^{30}/(10\times 2^{40})\rceil$ = 5.
521:
522:
523:
524: Assume that there are two passes,
525: and that the sum of the sizes of all projected
526: databases is $D'$.
527: There are two database scans for \db,
528: one for finding all frequent single items,
529: one for decomposition.
530: Two scans need $2\times D/B$ disk I/O's.
531: The projected databases have to be written to the disks first,
532: then later each scanned twice for building the FP-tree.
533: This step needs $3\times D'/B$ disk I/O's.
534: Thus, the total disk number of
535: disk I/O's for the general divide-and-conquer
536: projection algorithm
537: is at least
538: \negvs
539: \begin{eqnarray}
540: 2\cdot D/B + 3\cdot D'/B.
541: \end{eqnarray}
542: Obviously,
543: the smaller $D'$, the better the performance.
544:
545:
546: One of the simplest projection strategies
547: is to project the database on each frequent item,
548: which we call
549: {\em naive projection}.
550: First we need some formal definitions.
551:
552: \begin{defn}
553: {\rm
554: Let $I$ be a set of items.
555: By $I^*$ we will denote {\em strings} over $I$,
556: such that each symbol occurs at most once in the string.
557: If $\alpha$, $\beta$ are strings, and $i_j$ an item,
558: then
559: $\alpha.\beta$ denotes the concatenation of the
560: string $\alpha$ with the string $\beta$.
561:
562: For a string $\alpha$, we shall denote
563: by $\{\alpha\}$, the {\em set} of items occurring in it.
564:
565: Let $\db$ be an $I$-database.
566: Then ${\mit freqstring}(\db)$
567: is the string over
568: $I$, such that each frequent item in $\db$ occurs
569: in it exactly once, and the items are in decreasing
570: order of frequency in $\db$.
571: \hspace*{\fill}${\qed}$
572: }
573: \end{defn}
574:
575:
576:
577:
578: As an example, consider the $\{a,b,c,d\}$-database
579: $\db = \{\{a,b,c\}, \{a,b,c,d\}, \{a,c\}\}$.
580: If the minimum support is 60\%, then
581: ${\mit freqstring}(\db) = acb$.
582: Note that $\{acb\} = \{a,c,b\}$.
583:
584:
585:
586: \begin{defn}
587: {\rm
588: Let $\db$
589: be an $I$-database, and let
590: ${\mit freqstring}(\db)
591: = i_1i_2\cdots i_k$.
592: For $j\in\{1,\ldots,k\}$ we define
593: $\db_{i_j} =
594: \{\tau\cap\{i_1,\ldots,i_j\} : i_j\in\tau,\tau\in\db\}.$
595:
596: Let $\alpha\in I^*$.
597: We define $\db_{\alpha}$ inductively:
598: $\db_{\epsilon} = \db$, and
599: let ${\mit freqstring}(\db_{\alpha})
600: = i_1i_2\cdots i_k$. Then,
601: for $j\in\{1,\ldots,k\}$,
602: $\db_{\alpha.i_j} =
603: \{\tau\cap\{i_1,\ldots,i_j\} : i_j\in\tau,\tau\in\db_{\alpha}\}.$
604: \hspace*{\fill}${\qed}$
605: }
606: \end{defn}
607:
608:
609: Obviously,
610: $\db_{\alpha.i_j}$ is an $\{i_1,\ldots,i_j\}$-database.
611: The decomposition of $\db_{\alpha}$ into
612: $\db_{\alpha.i_1}$, \ldots, $\db_{\alpha.i_k}$
613: is called the {\em naive projection}.
614:
615:
616: \begin{defn}
617: {\rm
618: Let $\alpha\in I^*$, $i_j\in I$, and let
619: $\db_{\alpha.i_j}$ be an $I$-database.
620: Then ${\mit freqsets}(\xi,\db_{\alpha.i_j})$ denotes the subsets
621: of $I$
622: that contain $i_j$ and are frequent in $\db_{\alpha.i_j}$
623: when the minimum support is $\xi$.
624: Usually, we shall abstract $\xi$ away, and write
625: just ${\mit freqsets}(\db_{\alpha.i_j})$
626: \hspace*{\fill}${\qed}$
627: }
628: \end{defn}
629:
630:
631: \begin{lemma}
632:
633: Let $\db_{\alpha}$ be an $I$-database, and
634: ${\mit freqstring}(\db_{\alpha}) = i_1i_2\cdots i_k$.
635: Then
636: $${\mit freqsets}(\db_{\alpha}) =
637: \bigcup_{j\in\{1,\ldots,k\}}{\mit freqsets}(\db_{\alpha.i_j})$$
638:
639: \end{lemma}
640:
641: \noindent
642: {\bf Proof}.
643: ($\subseteq$-{\em direction}).
644: Let $S\in {\mit freqsets}(\db_{\alpha})$,
645: and suppose $i_n$ is the item in $S$ that is least frequent in
646: $\db_{\alpha}$.
647: Since $\db_{\alpha.i_n}$ is an $\{i_1,\ldots,i_n\}$-database,
648: and transactions in $\db_{\alpha}$ that contain item $i_j$
649: are all in $\db_{\alpha.i_j}$,
650: if $S$ is frequent in $\db_{\alpha}$,
651: then $S$ must be frequent in $\db_{\alpha.i_j}$.
652:
653: \noindent
654: ($\supseteq$-{\em direction}).
655: For any frequent itemset
656: $S \in freqsets(\db_{\alpha.i_j})$,
657: according to the definition,
658: the
659: support of any itemset in $\db_{\alpha.i_j}$ is not greater than
660: the support of it in $\db_{\alpha}$.
661: Therefore, $S$ must be frequent in $\db_{\alpha}$.
662: \hspace*{\fill}${\qed}$
663:
664: \medskip
665:
666:
667:
668: \xf{hansalgo} gives a divide-and-conquer algorithm
669: that uses naive projection.
670: A transaction $\tau$ in $\db_{\alpha}$ will be partly inserted into
671: $\db_{\alpha.i_j}$ if and only if $\tau$ contains $i_j$.
672: The parallel projection algorithm introduced in
673: \cite{HPYM04}
674: is an algorithm of this kind.
675:
676:
677: \begin{figure}[h]
678: {\bf Procedure} {\em naivediskmine}($\db_{\alpha},M$)
679:
680: \smallskip
681:
682: {\bf if} $|\db_{\alpha}|\leq M$ {\bf then}
683: {\bf return} {\em mainmine($\;\db_{\alpha}$)}
684:
685: {\bf else} let ${\mit freqstring}(\db_{\alpha}) = i_1i_2\cdots i_n$
686:
687: {\hskip 18pt} {\bf return} {\em naivediskmine}$(\db_{\alpha.i_1},M)\;\cup$
688:
689: {\hskip 146pt} $\ldots\;\cup$
690:
691: {\hskip 56pt}{\em naivediskmine}$(\db_{\alpha.i_n},M)$.
692:
693: \caption{{\small
694: A simple divide-and-conquer algorithm for
695: mining frequent itemsets from disk
696: }}
697: \label{hansalgo}
698: \end{figure}
699:
700:
701:
702: Let's analyze the disk I/O's of the algorithm
703: in \xf{hansalgo}.
704: As before, we assume that there are two passes,
705: that the data structure is an FP-tree,
706: and that the main memory mining method is
707: {\em FP-growth}.
708: If in $\db_{\epsilon}$, each transaction contains on the average $n$
709: frequent items,
710: each transaction will be written to $n$ projected databases.
711: Thus the total length of the associated transactions in
712: the projected databases is
713: $n+(n-1)+\cdots+1 = n(n+1)/2$,
714: the total size of all projected databases is
715: $(n+1)/2\cdot D\approx n/2\cdot D$.
716:
717: There are two database scans for $\db_{\epsilon}$,
718: one for finding all frequent single items,
719: and one for decomposition.
720: Two scans need $2\cdot D/B$ disk I/O's.
721: The projected databases have to be written to the disks first,
722: then later scanned twice each for building an FP-tree.
723: This step needs at least $3\cdot n/2\times D/B$.
724: Thus, the total disk I/O's for the divide-and-conquer
725: algorithm with naive projection
726: is
727: \negvs
728: \begin{eqnarray}
729: 2 \cdot D/B
730: +
731: n \cdot 3/2 \cdot D/B
732: \end{eqnarray}
733:
734: The recurrence structure of algorithm
735: {\em naivediskmine}
736: is shown in \xf{naivetree}.
737: The reader should ignore
738: nodes in
739: the shaded area
740: at this point, they
741: represent processing
742: in main memory.
743:
744: \begin{figure}[h]
745: \centerline{\psfig{figure=figures/append1,height=1.5in}}
746: \caption{\small Recurrence structure of Naive Projection}
747: \label{naivetree}
748: \end{figure}
749:
750:
751:
752:
753: In a typical application $n$, the average number
754: of frequent items could be hundreds, or thousands.
755: It therefore makes sense to devise a smarter
756: projection strategy.
757: Before we go further, we introduce
758: some definitions and a lemma.
759:
760:
761: \begin{defn}\label{four}
762: {\rm
763: Let $\db_{\alpha}$ be an $I$-database, and let
764: ${\mit freqstring}(\db_{\alpha})
765: = \beta_1.\beta_2. \cdots .\beta_k$,
766: where each $\beta_j$ is a string in $I^*$.
767: We call $\beta_1.\beta_2. \cdots .\beta_k$
768: a {\em grouping} of
769: ${\mit freqstring}(\db_{\alpha})$.
770: For
771: $j\in\{1,\ldots,n\}$,
772: we now define
773: $\db_{\alpha.\beta_j} =
774: \{\tau\cap\{\beta_1,\ldots,\beta_j\} : \tau\in\db_{\alpha},
775: \tau\cap\beta_j\neq\emptyset
776: \}.$
777:
778: In $\db_{\alpha.\beta_j}$,
779: items in $\{\beta_j\}$ are called {\em master items},
780: items in $\{\beta_1,\ldots,\beta_{j-1}\}$ are called {\em slave items}.
781: \hspace*{\fill}${\qed}$
782: }
783: \end{defn}
784:
785:
786: For example,
787: if ${\mit freqstring}(\db_{\alpha}) = abcde$,
788: $\beta_1 = abc$, $\beta_2 = de$ gives
789: the grouping $abc.de$ of $abcde$.
790:
791:
792:
793:
794:
795:
796:
797:
798:
799: \begin{defn}
800: {\rm
801: Let $\{\alpha,\beta\}\subset I^*$, and let
802: $\db_{\alpha.\beta}$ be an $I$-database.
803: Then $freqsets(\db_{\alpha.\beta})$ denotes the subsets
804: of $I$
805: that contain at least one item in $\{\beta\}$
806: and are frequent in $\db_{\alpha.\beta}$.
807: \hspace*{\fill}${\qed}$
808: }
809: \end{defn}
810:
811: \begin{lemma}\label{goodway}
812: Let $\alpha\in I^*$,
813: $\db_{\alpha}$ be an $I$-database, and
814: ${\mit freqstring}(\db_{\alpha}) = \beta_1\beta_2\cdots \beta_k$.
815: Then
816: $$freqsets(\db_{\alpha}) =
817: \bigcup_{j\in\{1,\ldots,k\}}freqsets(\db_{\alpha.\beta_j})$$
818:
819: \end{lemma}
820:
821: \noindent
822: {\bf Proof.}
823: Straightforward from Lemma 1 and the definition
824: of $\db_{\alpha.\beta}$.
825: \hspace*{\fill}${\qed}$
826:
827: \medskip
828:
829: Based on Lemma \ref{goodway},
830: we can obtain a more aggressive divide-and-conquer algorithm for
831: mining from disks.
832: \xf{ouralgo} shows the algorithm {\em aggressivediskmine}.
833: Here,
834: ${\mit freqstring}(\db_{\alpha})$
835: is decomposed into several substrings $\beta_j$,
836: each of which could have more than one item.
837: Each substring corresponds to a projected database.
838: A~transaction $\tau$ in $\db_{\alpha}$ will be partly inserted into
839: $\db_{\alpha.\beta_j}$ if and only if
840: $\tau$ contains at least one item $a$
841: such that $a\in\{\beta_j\}$.
842: Since there will be fewer projected databases,
843: there will be less disk I/O's.
844: Compared with the algorithm in \xf{hansalgo},
845: we can expect that
846: a large amount of disk I/O will be saved by the algorithm
847: in \xf{ouralgo}.
848:
849: \begin{figure}[h]
850: {\bf Procedure} {\em aggressivediskmine}($\db_{\alpha},M$)
851:
852: \smallskip
853:
854: {\bf if} $|\db_{\alpha}|\leq M$ {\bf then}
855: {\bf return} {\em mainmine($\;\db_{\alpha}$)}
856:
857: {\bf else} let ${\mit freqstring}(\db_{\alpha}) =
858: \beta_1\beta_2\cdots \beta_k$
859:
860: {\hskip 18pt}
861: {\bf return} {\em aggressivediskmine}$(\db_{\alpha.\beta_1},M)\;\cup$
862:
863: {\hskip 165pt} $\;\ldots\;\cup$
864:
865: {\hskip 57pt}{\em aggressivediskmine}$(\db_{\alpha.\beta_k},M)$.
866:
867:
868: \caption{{\small
869: A more aggressive divide-and-conquer algorithm for
870: mining frequent itemsets from disk
871: }}
872: \label{ouralgo}
873: \end{figure}
874:
875:
876:
877: Let's analyze the recurrence and disk I/O's of the aggressive
878: divide-and-conquer algorithm.
879: The number of
880: passes needed by the algorithm is still
881: \mbox{$1+\lceil\log_cM/T\rceil \approx 2$},
882: since grouping items doesn't change the size of an FP-tree for
883: a projected database.
884: However, for disk I/O,
885: suppose in $\db_{\epsilon}$,
886: each transaction contains on average $n$
887: frequent items,
888: and that we can group them into $k$
889: groups of equal size.
890: Then the $n$ items will be written to the projected databases
891: with total length $n/k+2\cdot n/k+ \ldots +k\cdot n/k = (k+1)/2\cdot n$.
892: Total size of all projected databases is
893: $(k+1)/2\cdot D \approx k/2\cdot D$.
894: The total disk I/O's for the aggressive divide-and-conquer
895: algorithm
896: is then
897: \negvs
898: \begin{eqnarray}\label {formula}
899: 2\cdot D/B
900: +
901: k \cdot 3/2 \cdot D/B
902: \end{eqnarray}
903:
904: The recurrence structure of algorithm
905: {\em aggressivediskmine} is shown
906: in \xf{recagg}. Compared to \xf{naivetree},
907: we can see that the part of the tree
908: that corresponds to decomposition
909: (the nonshaded part) is much smaller
910: in \xf{recagg}. Although the example is
911: very small, it exhibits the general structure
912: of the two trees.
913:
914: \begin{figure}[h]
915: \centerline{\psfig{figure=figures/append2,height=1.5in}}
916: \caption{\small Recurrence structure of Aggressive Projection}
917: \label{recagg}
918: \end{figure}
919:
920:
921:
922:
923: If $k\ll n$,
924: we can expect that the aggressive
925: divide and conquer algorithm will
926: significantly outperform the naive one.
927:
928: \section {Algorithm Diskmine}
929: In this section
930: we give
931: the details of
932: our divide-and-conquer algorithm for mining frequent itemsets
933: from secondary memory.
934: We call the algorithm {\em Diskmine}.
935: In the algorithm,
936: the FP-tree is used as data structure and
937: the extension of {\em FP-growth} method,
938: {\em FP-growth*} \cite{fimi03},
939: as method for mining frequent itemsets from an FP-tree.
940: Before introducing the algorithm,
941: let's first recall the FP-tree and the {\em FP-growth* } method.
942:
943: \subsection{The FP-tree and {\em FP-growth*} method}
944:
945: The {\em FP-tree (Frequent Pattern tree)}
946: is a data structure used in
947: the {\em FP-growth} method by Han {\em et al.\ } \cite{HPY00}.
948: It is a compact representation
949: of all relevant
950: frequency information
951: in a database.
952: The nodes of the FP-tree stores an item name, item count,
953: and a link.
954: Every branch of the FP-tree represents a frequent itemset,
955: and the nodes along the branches are
956: stored in decreasing order of the frequency
957: of the corresponding items, with leaves representing
958: the least frequent items.
959: Compression is achieved by
960: building the tree in such a way that
961: overlapping itemsets
962: share prefixes of the
963: corresponding branches.
964:
965:
966: The FP-tree has
967: a {\em header table} associated with it.
968: Single items and their counts are stored in
969: the header table in
970: decreasing order of their frequency.
971: The entry for an item also contains the head
972: of a list that links all the
973: nodes of the item
974: in the FP-tree.
975:
976:
977: The FP-growth method needs two database scans
978: when mining all frequent itemsets.
979: The first scan counts the number of occurrences
980: of each item.
981: The second scan constructs the initial FP-tree,
982: which contains all frequency information of the original dataset.
983: Mining the database then becomes mining the FP-tree.
984:
985:
986: The {\em FP-growth} method relies on the following
987: principle: if $X$ and $Y$ are two itemsets,
988: the count of itemset $X\cup Y$ in the database
989: is exactly that of $Y$ in the restriction of the database to
990: those transactions containing $X$.
991: This restriction of the database is
992: called
993: the {\em conditional pattern base} of $X$,
994: and the FP-tree constructed from the conditional pattern base
995: is called $X$'s {\em conditional FP-tree},
996: which we denote by $T_X$.
997: We can view the FP-tree constructed from the initial database
998: as $T_{\emptyset}$,
999: the conditional FP-tree for $\emptyset$.
1000: Note that for
1001: any itemset $Y$ that is frequent
1002: in the conditional pattern base of $X$,
1003: the set
1004: $X\cup Y$ is a frequent itemset for the original database.\footnote{In
1005: keeping with the notation introduced so far, we shall
1006: in the sequel write $T_{\alpha}$ when we mean the
1007: FP-tree $T_{\{\alpha\}}$. Similarly we shall write
1008: $T_{\alpha.i}$ instead of $T_{\{\alpha\}\cup\{i\}}$.}
1009:
1010: The recursive structure of FPgrowth can be seen from
1011: the shaded area in \xf{naivetree}.
1012: In the figure, we will enter the main memory phase
1013: for instance for the conditional database $\db_a$.
1014: Then FP-growth first constructs the
1015: FP-tree $T_a$ from $\db_a$.
1016: The tree rooted at $T_a$
1017: shows the recursive structure of FP-growth,
1018: assuming for simplicity that the
1019: relative frequency remains the same in
1020: all conditional pattern bases.
1021:
1022:
1023:
1024:
1025:
1026: In \cite {fimi03}, we extend the FP-growth method into the
1027: {\em FP-growth*} method by using an {\em array technique}
1028: and other optimizations.
1029: The experimental results in the paper
1030: and those done by the FIMI-organizers show
1031: that the FP-growth* method outperforms the {\em FP-growth} method
1032: especially when the database is big or sparse
1033: \cite{fimi03,ZB03}.
1034:
1035:
1036: \subsubsection* {The array technique} \label{arraytech}
1037:
1038: In the original FP-growth method \cite{HPY00},
1039: to construct an FP-tree from a database $\db$,
1040: two database scan are required.
1041: The first scan gets all frequent items,
1042: the second constructs the FP-tree.
1043: And later,
1044: for each item $a$ in
1045: the header of a conditional FP-tree $T_{\alpha}$,
1046: two traversals of $T_{\alpha}$ are needed for constructing
1047: the new conditional FP-tree $T_{\alpha.i}$.
1048: The first traversal finds all frequent items in the
1049: conditional pattern base of $\alpha.i$,
1050: and initializes the FP-tree $T_{\alpha.i}$
1051: by constructing its header table.
1052: The second traversal constructs the new tree
1053: $T_{\alpha.i}$.
1054:
1055:
1056: In the boosted {\em FP-growth*} method \cite{fimi03},
1057: a simple data structure, an array,
1058: is introduced to omit the first scan of $T_{\alpha}$.
1059: This is achieved
1060: by constructing an array
1061: $A_{\alpha}$ while building $T_{\alpha}$.
1062: More precisely,
1063: in the second scan of the original database we
1064: construct $T_{\epsilon}$, and an array $A_{\epsilon}$.
1065: The array will store the counts of
1066: all 2-itemsets, each cell $[j,k]$
1067: in the array is a counter of the 2-itemset $\{i_j,i_k\}$.
1068: All cells in the array are initialized to 0.
1069: When an itemset is inserted into $T_{\epsilon}$,
1070: the associated cells in $A_{\epsilon}$ are updated.
1071: After the second scan,
1072: the array $A_{\epsilon}$ contains the counts of
1073: all pairs of items frequent in $\db_{\epsilon}$.
1074:
1075:
1076:
1077: Next, the {\em FP-growth*} method is recursively called
1078: to mine frequent itemsets for each item in header table
1079: of $T_{\epsilon}$.
1080: However, now for each item $i$,
1081: instead of traversing $T_{\epsilon}$ along
1082: the linked list starting at $i$ to get
1083: all frequent items in $i$'s conditional pattern base,
1084: $A_{\epsilon}[i,*]$ gives all frequent items for $i$.
1085: Therefore, for each item $i$ in $T_{\epsilon}$
1086: the array $A_{\epsilon}$ makes
1087: the first traversal of $T_{\epsilon}$ unnecessary,
1088: and $T_{\epsilon.i}$ can be
1089: initialized directly from $A_{\epsilon}$.
1090:
1091:
1092: For the same reason, from a conditional FP-tree $T_{\alpha}$,
1093: when we construct a new conditional
1094: FP-tree for $\alpha.i$, for an item $i$,
1095: a new array $A_{\alpha.i}$ is calculated.
1096: During the construction of the
1097: new FP-tree $T_{\alpha.i}$,
1098: the array $A_{\alpha.i}$
1099: is filled.
1100: The construction of arrays and FP-trees continues
1101: until the {\em FP-growth} method terminates.
1102:
1103: Note that if for a database,
1104: if we have the array that stores the count of all pairs of
1105: frequent items,
1106: then only one database scan is needed
1107: to construct an FP-tree from the database.
1108:
1109: \subsection{Divide-and-conquer by aggressive projection}
1110:
1111:
1112:
1113: The algorithm {\em Diskmine} is shown in \xf{appa}. In the algorithm,
1114: $\db_{\alpha}$ is the original database or a projected database,
1115: and $M$ is the maximal size of main memory that can be used by {\em Diskmine}.
1116:
1117: \begin{figure}[h]
1118: {\bf Procedure} {\em Diskmine}$(\db_{\alpha}, M)$
1119:
1120: \smallskip
1121:
1122: scan $\db_{\alpha}$ and compute {\it freqstring}$(\db_{\alpha})$
1123:
1124: call ${\mit trialmainmine(\db_{\alpha}, M)}$
1125:
1126: {\bf if} ${\mit trialmainmine(\db_{\alpha}, M)}$ aborted {\bf then}
1127:
1128: {\hskip 12pt}compute a grouping $\beta_1\beta_2\cdots \beta_k$
1129: of ${\mit freqstring}(\db_{\alpha})$.
1130:
1131: {\hskip 12pt}Decompose $\db_{\alpha}$ into
1132: $\db_{\alpha.\beta_1},\ldots, \db_{\alpha.\beta_k}$
1133:
1134: {\hskip 12pt}{\bf for} j = 1 {\bf to} k {\bf do begin}
1135:
1136: {\hskip 24pt}{\bf if} $\{\beta_j\}$ is a singleton {\bf then}
1137:
1138: {\hskip 36pt}${\mit Diskmine}(\db_{\alpha.\beta_j},M)$
1139:
1140: {\hskip 24pt}{\bf else}
1141:
1142: {\hskip 36pt}${\mit mainmine}(\db_{\alpha.\beta_j})$
1143:
1144: {\hskip 12pt}{\bf end}
1145:
1146: {\bf else return} {\em freqsets}$(\db_{\alpha})$
1147: \caption{{\small Algorithm Diskmine}}
1148: \label{appa}
1149: \end{figure}
1150:
1151:
1152: {\em Diskmine} uses the FP-tree as
1153: data structure and {\em FP-growth*} \cite{fimi03}
1154: as main memory
1155: mining
1156: algorithm.
1157: Since the FP-tree encodes all frequency information
1158: of the database,
1159: we can shift into main memory mining
1160: as soon as the FP-tree fits
1161: into main memory.
1162:
1163: Since an FP-tree usually is a significant
1164: compression of the database, our {\em Diskmine}
1165: algorithm begins optimistically, by calling {\em trialmainmine},
1166: which starts scanning the database and constructing the FP-tree.
1167: If the tree can be successfully completed and stored in main memory,
1168: we have reached the bottom level of the recursion,
1169: and can obtain
1170: the frequent itemsets of the database
1171: by running
1172: {\em FP-growth*} on the FP-tree in main memory.
1173:
1174: \begin{figure}[h]
1175: {\bf Procedure} {\em trialmainmine}$(\db_{\alpha}, M)$
1176:
1177: start scanning $\db_{\alpha}$ and building the FP-tree
1178:
1179: {\hskip 12 pt}$T_{\alpha}$ in main memory.
1180:
1181: {\bf if} $|T_{\alpha}|$ exceeds $M$ {\bf then}
1182:
1183: {\hskip 12pt}{\bf return} the incomplete $T_{\alpha}$
1184:
1185: {\bf else}
1186:
1187: {\hskip 12pt}call {\em FP-growth*}$\,(T_{\alpha})$ and {\bf return}
1188: {\em freqsets}$(\db_{\alpha})$.
1189:
1190: \caption{{\small Trial main memory mining algorithm}}
1191: \label{trial}
1192: \end{figure}
1193:
1194:
1195:
1196: If, at any time during {\em trialmainmine}
1197: we run out of main memory, we abort and
1198: return the partially constructed FP-tree,
1199: and a pointer to where we stopped scanning the database.
1200: We then resume processing {\em Diskmine}$(\db_{\alpha},M)$
1201: by computing a grouping
1202: $\beta_1,\ldots, \beta_k$ of
1203: {\em freqstring}$(\db_{\alpha})$,
1204: and then decomposing
1205: $\db_{\alpha}$ into
1206: $\db_{\alpha.\beta_1},\ldots,\db_{\alpha.\beta_k}$.
1207: We recursively process
1208: each decomposed database
1209: $\db_{\alpha.\beta_j}$.
1210: During the first level of the recursion,
1211: some groups $\beta_j$ will consist of a single
1212: item only.
1213: If $\{\beta_j\}$ is a singleton,
1214: we call {\em Diskmine}, otherwise
1215: we call {\em mainmine} directly,
1216: since we put several items in a group
1217: only when we estimate that the corresponding
1218: FP-tree will fit into main memory.
1219:
1220: In computing the grouping
1221: $\beta_1,\ldots, \beta_k$
1222: we assume that transactions in a very large database
1223: are evenly distributed, i.e.,
1224: if an FP-tree is constructed from part of a database,
1225: then this FP-tree represents the whole FP-tree for the whole database.
1226: In other words,
1227: if the size of the FP-tree is $n$ for $p\%$ of the database,
1228: then the size of the FP-tree for whole database is $n/p \cdot 100$.
1229: Most of the time, this gives an overestimation,
1230: since an FP-tree increases fast only at the beginning stage,
1231: when items are encountered for the first time and inserted
1232: into the tree. In the later stages, the changes to the FP-tree
1233: will be mostly counter updates.
1234:
1235:
1236: \begin{figure}[h]
1237: {\bf Procedure} {\em mainmine}$(\db_{\alpha.\beta})$
1238:
1239: build a modified FP-tree $T_{\alpha.\beta}$ for $\db_{\alpha.\beta}$
1240:
1241: {\bf for each} $i\in\{\beta\}$ {\bf do begin}
1242:
1243: {\hskip 12pt} construct the FP-tree $T_{\alpha.i}$
1244: for $\db_{\alpha.i}$ from $T_{\alpha.\beta}$
1245:
1246: {\hskip 12pt} call {\em FP-growth*}$\,(T_{\alpha.i})$
1247: and {\bf return}
1248: {\em freqsets}$(\db_{\alpha.i})$.
1249:
1250: {\bf end}
1251:
1252: \caption{{\small Main memory mining algorithm}}
1253: \label{mainmine}
1254: \end{figure}
1255:
1256:
1257:
1258: Since we know that there is only one master item in the database
1259: (for $\db_\epsilon$, no master item at all),
1260: an FP-tree is constructed without the master item.
1261: In \xf{mainmine},
1262: since $\db_{\alpha.\beta}$ is for multiple master items,
1263: the
1264: FP-tree constructed from $\db_{\alpha.\beta}$ has to contain
1265: those master items.
1266: However, the item order is a problem for the FP-tree,
1267: because we only want to mine all frequent itemsets
1268: that contain master items.
1269: To solve this problem,
1270: we simply use the item order in the partial FP-tree
1271: returned by the aborted
1272: {\em trialmainmine}$(\db_{\alpha})$.
1273: This is what we mean by a ``modified FP-tree''
1274: on the first line in the algorithm in \xf{mainmine}.
1275:
1276: The entire recurrence structure of
1277: {\em Diskmine} can be seen in \xf{recagg}.
1278: Compared to the naive projection in \xf{naivetree}
1279: we see that since the aggressive projection
1280: uses main memory more effective,
1281: the decomposition phase is shorter,
1282: resulting in less I/O.
1283:
1284:
1285:
1286: \begin{theorem}
1287: Diskmine$(\db)$ returns freqsets$(\db)$.
1288: \end{theorem}
1289:
1290: \noindent
1291: {\bf Proof}.
1292: The correctness of {\em Diskmine}
1293: can be derived from the correctness of the
1294: {\em FP-growth*} method in \cite{fimi03}
1295: and Lemma \ref{goodway} in \xs{diskmine}.
1296: In {\em Diskmine},
1297: each item acts as master item in exactly one projected database.
1298: If a projected database is only for one master item $i_j$,
1299: the result of {\em FP-growth*} method or a recursive call of {\em Diskmine}
1300: will be $freqsets(\db_{i_j})$.
1301: If a projected database is for a set $\{\beta\}$ of master items,
1302: it contains all frequency information associated with the master items.
1303: Since in the {\em FP-growth*} method,
1304: the order of the items in an FP-tree doesn't influence
1305: the correctness of the {\em FP-growth*} method,
1306: {\em mainmine} indeed returns only frequent itemsets that
1307: contain master item(s),
1308: i.e.\ {\em mainmine} gives the
1309: exact value of $freqsets(\db_{\alpha.\beta})$.
1310: According to Lemma \ref{goodway},
1311: algorithm {\em Diskmine} then
1312: correctly outputs all
1313: itemsets in frequent the original database.
1314: \hspace*{\fill}${\qed}$
1315:
1316:
1317:
1318: \subsection {Memory Management}\label{memory}
1319:
1320: Given a database $\db_{\alpha}$,
1321: to successfully apply the {\em FP-growth*} method,
1322: the basic main memory requirement is that the size of the FP-tree
1323: $T_{\alpha}$
1324: constructed from $\db_{\alpha}$,
1325: is less than the available amount $M$ of main memory.
1326: In addition, we need space
1327: for the descendant conditional
1328: FP-trees that will be constructed during the recursive calls
1329: of {\em FP-growth*}.
1330:
1331: Suppose the main memory requirement
1332: for $T_{\alpha}$ plus its descendant FP-trees is $m$.
1333: If $M < m$, but the difference $m-M$ is not very big,
1334: the {\em FP-growth*} method
1335: could still be run because the operating
1336: system uses virtual memory.
1337: However, there could be too many page swappings
1338: which takes too much time and makes {\em FP-growth*} very slow.
1339: Therefore, given $M$, for a very large database $\db_{\alpha}$,
1340: we have to stop the construction of the FP-tree $T_{\alpha}$
1341: and the execution of {\em FP-growth*} method before
1342: all physical main memory is used up.
1343:
1344:
1345: Another problem is that we will
1346: construct a large number of FP-trees.
1347: Since there can be
1348: millions of nodes in those FP-trees,
1349: inserting and deleting nodes is time consuming.
1350:
1351: In the implementation of the algorithm,
1352: we use our own main memory management for
1353: allocating and deallocating nodes,
1354: and calculating the main memory we have already used.
1355: We assume that the main memory needed by an FP-tree is
1356: proportional to the number of nodes in the FP-trees.
1357: We also assume that the workspace needed for calling
1358: {\em FP-growth*(T)} method on an FP-tree is roughly 10\%
1359: of the size of the FP-tree $T$.
1360: Here, 10\% is a liberal assumption according to the
1361: experimental result in \cite{HPY00}.
1362: Later in this section, a more accurate value will be given.
1363: If the size of FP-tree is more than $0.9\cdot M$,
1364: we conclude that $M$ is not big enough to store whole
1365: FP-tree $T_{\alpha}$.
1366:
1367:
1368: Since all memory for nodes in an FP-tree is deallocated after a call
1369: of {\em FP-growth*} ends,
1370: a chunk of memory is allocated for each FP-tree when we create the tree,
1371: and the chunk size is changeable.
1372: After generating all frequent
1373: itemsets from the FP-tree, the chunk is discarded,
1374: and all nodes in the tree are deleted.
1375: Thus we successfully avoid freeing nodes in FP-trees one by one,
1376: which would take too much time.
1377:
1378:
1379: \subsection{Applying the Array Technique}\label{array}
1380:
1381:
1382: In {\em Diskmine},
1383: the array technique is also be applied to save FP-tree traversals.
1384: Furthermore, when projected databases are generated,
1385: the array technique can save a great number of disk~I/O's.
1386:
1387: Recall that in {\em trialmainmine},
1388: if an FP-tree can not be accommodated in main memory,
1389: the construction stops.
1390: Suppose now we decided to stop
1391: scanning the database.
1392: Then later, after generating all projected databases,
1393: for a projected database with only one master item,
1394: two database scans are required to construct an FP-tree for the master item.
1395: The first scan gets all frequent items for the master item,
1396: the second scan constructs the FP-tree.
1397: For a projected database with several master items,
1398: though the FP-tree constructed from the database
1399: uses the modified item order
1400: (the order from the header of the FP-tree in
1401: the previous level of the recursion),
1402: to construct new FP-trees for the master items,
1403: two FP-tree traversals are needed.
1404: To avoid the extra scan,
1405: in {\em Diskmine} we calculate an array for each FP-tree.
1406: When constructing the FP-tree from $\db_{\alpha}$,
1407: if it is found that the tree can not fit in main memory,
1408: the construction of the FP-tree $T_{\alpha}$ stops,
1409: but the scan of the database $\db_{\alpha}$
1410: continues so that we finish filling the cells of
1411: the array $A_{\alpha}$.
1412: Here, some extra disk I/O's are spent,
1413: but the payback will be that we
1414: save one database scan for each
1415: projected database.
1416: Furthermore, finishing the scanning
1417: of $\db_{\alpha}$
1418: doesn't require any more main memory,
1419: since the array $A_{\alpha}$
1420: is already there.
1421:
1422: From the array, for each projected database,
1423: the count of each pair of master items and
1424: the count of each pair of master item and slave item
1425: can be known.
1426: As an example,
1427: suppose a projected databases is only for one
1428: master item $i_j$
1429: and slave items $i_1, \ldots, i_{j-1}$.
1430: To mine all frequent itemsets,
1431: from the line for $i_j$ in the array,
1432: accurate counts for
1433: $[i_j, i_{j-1}],
1434: [i_j, i_{j-2}],
1435: \ldots,
1436: [i_j, i_1]$
1437: can be easily found.
1438: If there were no array
1439: we would need an extra database scan.
1440:
1441:
1442: With the array, we can also make a projected database
1443: drastically smaller.
1444: In the definition of $\db_{\alpha.\beta_j}$,
1445: we see that
1446: $\db_{\alpha.\beta_j}$ is an $\{\beta_1,\ldots,\beta_j\}$-database.
1447: Actually, by checking the array $A_{\alpha}$,
1448: if a slave item is found not frequently co-occurring
1449: with any master item in $\beta_j$,
1450: it's useless to include the slave item in $\db_{\alpha.\beta_j}$,
1451: because no frequent itemsets mined from $\db_{\alpha.\beta_j}$
1452: will contain that slave item.
1453: For same reason,
1454: if we also find that a master item $a$ is not frequent with any
1455: other master item or slave item,
1456: it will be not written to $\db_{\alpha.\beta_j}$,
1457: either.
1458: However, the frequent itemset $\alpha.a$ is outputted.
1459: Furthermore,
1460: if from the array, we see that a master item $a$ is
1461: only frequent with one item (master or slave) $b$,
1462: frequent itemsets $\alpha.a$ and $\alpha.a.b$
1463: are outputted directly,
1464: and item $a$ will not appear in $\db_{\alpha.\beta_j}$.
1465: Therefore, by looking through the array,
1466: we find all slave items,
1467: such that they are not frequent with any master item in $\beta_j$,
1468: and all master items, such that their number of frequent items in
1469: $\{\beta_1,\ldots,\beta_j\}$ is 0 or 1.
1470: When generating $\db_{\alpha.\beta_j}$,
1471: all those items are removed from the
1472: transactions we put in $\db_{\alpha.\beta_j}$.
1473:
1474:
1475: \subsection{Statistics}
1476:
1477: \begin{table*}[ht!]
1478: \centering
1479: \begin{tabular}%{0.75\textwidth}
1480: {|r|l|} \hline
1481: $\tdb$&Number of transactions in $\db_{\alpha}$\\
1482: \hline
1483: $A_{\alpha}[j,k]$&Count of frequent item pair $\{i_j, i_k\}$
1484: in $\db_{\alpha}$\\
1485: \hline
1486: $\tT$&Number of transactions used for constructing $T_{\alpha}$\\
1487: \hline
1488: $\nT$&Number of nodes in $T_{\alpha}$\\
1489: \hline
1490: $\njT$&Number of nodes in $T_{\alpha}$ if we retain
1491: only nodes for items $i_1, \ldots, i_j$\\
1492: \hline
1493: $\mjT$&Number of nodes in
1494: $T$,
1495: where a node $P$ for item $i_k$ is counted if\\
1496: &it satisfies the following conditions: 1) $P$ is in a branch that contains $i_j$\\
1497: &2) $i_k \in \{i_1, \ldots, i_j\}$ 3) $A_{\alpha}[j,k] > \xi$\\
1498: \hline
1499:
1500:
1501: \end{tabular}
1502: \caption{Statistics Information}
1503: \label{stat}
1504: \end{table*}
1505:
1506: Algorithm
1507: {\em Diskmine} collects some statistics on the
1508: partial FP-tree $T_{\alpha}$
1509: and the rest of database $\db_{\alpha}$,
1510: for the purpose of
1511: grouping items together.
1512: \xt{stat} shows the statistics information.
1513: In the table,
1514: $\db_{\alpha}$ is the original database or the current projected database,
1515: and {\em freqstring}($\db_{\alpha}$)=
1516: $i_1\ldots i_j\ldots i_k \ldots i_n$.
1517: The partial FP-tree is $T_\alpha$
1518: and $\xi$ is the
1519: absolute value of the minimum support.
1520:
1521: In the table,
1522: the array discussed in \xs{array}
1523: is also listed as statistics.
1524: Values for the cells of
1525: the array are accumulated during the construction of
1526: the partial $T_{\alpha}$.
1527: If {\em trialmainmine} is aborted, the rest
1528: of the statistics
1529: is collected by scanning the
1530: remaining part of $\db_{\alpha}$.
1531: Values in
1532: $\njT$
1533: can also be obtained
1534: during the construction of $T_{\alpha}$.
1535: Here
1536: $\njT$
1537: records the size of the FP-tree after
1538: $T_{\alpha}$ is trimmed and only contains items $i_1, \ldots, i_j$.
1539: Notice that
1540: $\nT$
1541: is equal to
1542: $\nu[n](T_{\alpha})$.
1543: This is also the size of a tree that can fit in main memory.
1544: The value for
1545: $\mjT$
1546: can be obtained
1547: by traversing $T_{\alpha}$ once,
1548: it gives the size of the FP-tree $T_{\alpha.i_j}$.
1549:
1550: It might seem that
1551: collecting all this statistics
1552: is a large overhead,
1553: however,
1554: since all work is done in main memory,
1555: it doesn't take much time.
1556: And the time saved for disk I/O's
1557: is far more than the time spent on gathering statistics.
1558:
1559:
1560: \subsection{Grouping items}
1561:
1562: In \xf{appa},
1563: the fourth line computes a grouping $\beta_1\beta_2\cdots \beta_k$
1564: of ${\mit freqstring}(\db_{\alpha})$.
1565: Each string $\beta$
1566: corresponds to a group and each $\beta$ consists of at least one item.
1567: For each $\beta$,
1568: a new projected database $\db_{\alpha.\beta}$
1569: will be computed from $\db_{\alpha}$,
1570: then written to disk and read from disk later.
1571: Therefore,
1572: the more groups,
1573: the more disk I/O's.
1574: In other words,
1575: there should be as many items in each
1576: $\beta$ as possible.
1577: To group items,
1578: two questions have to be answered.
1579: \begin{enumerate}
1580: \item If $\beta$ currently only has one item $i_j$,
1581: after projection, is the main memory big enough for
1582: accommodating $T_{\alpha.i_j}$ constructed from
1583: $\db_{\alpha.i_j}$
1584: and running the {\em FP-growth*} method on $T_{\alpha.i_j}$?
1585: \item If more items are put in $\beta$,
1586: after projection, is the main memory big enough for
1587: accommodating $T_{\alpha.\beta}$ constructed from $\db_{\alpha.\beta}$
1588: and running {\em FP-growth*} on $T_{\alpha.\beta}$ only
1589: for items in $\beta$?
1590: \end{enumerate}
1591:
1592: Answering the first question is pretty easy,
1593: since for each item $i_j$,
1594: the number
1595: $\mjT$
1596: gives the size of an FP-tree if the tree
1597: is constructed from the partial FP-tree $T_{\alpha}$.
1598: Therefore
1599: $\mjT$
1600: can be used to estimate the
1601: size of FP-tree $T_{\alpha.i_j}$.
1602: By the assumption that
1603: the transactions in $\db_{\alpha}$ are evenly
1604: distributed and that
1605: the partial $T_{\alpha}$
1606: represents
1607: the whole FP-tree for $\db_{\alpha}$,
1608: the estimated size of FP-tree $T_{\alpha.i_j}$
1609: is
1610: $\mjT\cdot \tdb/\tT$.
1611:
1612:
1613: Before answering the second question,
1614: we introduce the {\em cut point}
1615: from which the first group can be easily found.
1616:
1617: \medskip
1618:
1619: \noindent
1620: {\bf Finding the cut point.}
1621: Recall the order that {\em FP-growth*} uses in mining frequent itemsets.
1622: Starting from the least frequent item $i_n$,
1623: all frequent itemsets that contains $i_n$ are mined first.
1624: Then the process is repeated for
1625: $i_{n-1}$, and so on.
1626: Notice that when mining frequent itemsets for $i_k$,
1627: all frequency information about $i_{k+1},\ldots,i_n$ is useless.
1628: Thus, though a complete FP-tree $T_\alpha$ constructed from $\db_\alpha$
1629: could not fit in main memory,
1630: we can find many $k$'s such that the
1631: trimmed FP-tree containing only
1632: nodes for items $i_k, \ldots, i_1$
1633: will fit into main memory.
1634: All frequent itemsets for $i_k, \ldots, i_1$
1635: can be then mined from one trimmed tree.
1636: We call the biggest of such $k$'s the {\em cut point}.
1637: At this point, main memory is big enough
1638: for storing the FP-tree
1639: containing only $i_k, \ldots, i_1$,
1640: and there is also enough main memory for running
1641: {\em FP-growth*} on the tree.
1642: Obviously, if the cut point $k$ can be found,
1643: items $i_k, \ldots, i_1$ can be grouped together.
1644: Only one projected database is needed for $i_k, \ldots, i_1$.
1645:
1646: There are two ways to estimate the cut point.
1647: One way is to get cut point from the value of
1648: $\tdb$
1649: and
1650: $\tT$
1651: in \xt{stat}.
1652: \xf{divi} illustrates the intuition behind the cut point.
1653: In the figure,
1654: since the partial FP-tree for
1655: $\tT$
1656: of
1657: $\tdb$
1658: transactions can be
1659: accommodate in main memory,
1660: we can expect that the FP-tree containing $i_k, \ldots, i_1$,
1661: where
1662: $k=\lfloor n \cdot \tT /\tdb \rfloor$,
1663: also will fit in main memory.
1664:
1665: \begin{figure}[h]
1666: \centerline{\psfig{figure=figures/division,height=1.25in}}
1667: \caption{Cut Point. Here
1668: $l=\tT$, and $m=\tdb$}
1669: \label{divi}
1670: \end{figure}
1671:
1672: The above method
1673: works well
1674: for many databases,
1675: especially for those databases whose corresponding
1676: FP-trees have plenty of sharing of prefixes for items
1677: from $i_1$ to the cut point.
1678: However,
1679: if the FP-tree constructed from a database
1680: doesn't share prefixes that much,
1681: the estimation could fail,
1682: since now the FP-tree
1683: for items from $i_1$ to the cut point
1684: could be too big.
1685: Thus,
1686: we have to consider another method.
1687: In \xt{stat},
1688: $\njT$
1689: records the size of the FP-tree after
1690: the partial FP-tree $T_\alpha$ is trimmed and only
1691: contains items $i_1, \ldots, i_j$.
1692: Based on
1693: $\njT$
1694: the number of nodes
1695: in the complete FP-tree
1696: for item $i_j$
1697: can be estimated as
1698: $\njT \cdot \tdb/\tT$.
1699: Now, finding the cut point becomes finding the biggest $k$ such that
1700: $\nu[k](T_{\alpha}) \cdot \tdb/\tT \leq \nT$,
1701: and
1702: $\nu[k+1](T_{\alpha}) \cdot
1703: \tdb/\tT > \nT$.
1704:
1705:
1706: Sometimes the above estimation only guarantees
1707: that the main memory is big enough for
1708: the FP-tree which contains all items between $i_1$ and the cut point,
1709: while it doesn't guarantee
1710: that the descendant trees from that FP-tree can fit in main memory.
1711: This is because the estimation doesn't consider the
1712: size of descendant trees correctly
1713: (in \xs{memory}, we assumed that the size of a conditional tree is 10\%
1714: of its nearest ancestor tree).
1715: Actually, from
1716: $\mjT$
1717: we can get a more accurate estimation of the size of the
1718: biggest descendant tree.
1719: To find the cut point,
1720: we need to find the biggest $k$,
1721: such that
1722: $(\nu[k](T_{\alpha}) +
1723: \mjT)\cdot
1724: \tdb/\tT \leq \nT$, and
1725: $(\nu[k+1](T_{\alpha}) +
1726: \mu[m](T_{\alpha}))
1727: > \nT$,
1728: where
1729: $j\leq k$,
1730: $\mjT = {\mit max}_{j\in\{1,\ldots,k\}}\mjT$,
1731: and
1732: $m\leq k+1$,
1733: $\mu[m](T_{\alpha}) = {\mit max}_{m\in\{1,\ldots,k+1\}}\mu[m](T_{\alpha})$.
1734:
1735: \medskip
1736:
1737: \noindent
1738: {\bf Grouping the rest of the items.}
1739: Now we answer the second question, how to put more items into a group?
1740: Here we still need
1741: $\mjT$.
1742: Starting with
1743: \mbox{$\mu[{\mit cutpoint}+1](T_{\alpha})$},
1744: we test if
1745: $\mu[{\mit cutpoint}+1](T_{\alpha})\cdot
1746: \tdb/\tT > \nT$.
1747: If not, we put next item {\mit cutpoint}+2
1748: into the group,
1749: and test if
1750: \mbox{$(\mu[{\mit cutpoint}+1](T_{\alpha}) +
1751: \mu[{\mit cutpoint}+2](T_{\alpha})
1752: )$}
1753: $\cdot \tdb/\tT > \nT$.
1754: We repeatedly put next item in
1755: ${\mit freqstring}(\db)$ into the group
1756: until we reach an item $i_j$,
1757: such that
1758: $$\displaystyle\sum_{m={\mit cutpoint}+1}^{j}
1759: \mu[m](T_{\alpha})\cdot
1760: \tdb/\tT > \nT.$$
1761: Then starting from $i_j$, we put items into next group,
1762: until all items find its group.
1763:
1764: Why can we group items together?
1765: This is because
1766: even if we construct
1767: $T_{\alpha.i_j}, \ldots, T_{\alpha.i_k}$
1768: from the projected databases
1769: $\db_{\alpha.\beta_{i_j}}, \ldots, \db_{\alpha.\beta_{i_k}}$
1770: and put all of them into main memory,
1771: the main memory is big enough according to the grouping condition.
1772: At this stage, $T_{\alpha.i_j}, \ldots, T_{\alpha.i_k}$
1773: all can be constructed by scanning $\db_\alpha$ once.
1774: Then we mine frequent itemsets from the FP-trees.
1775: However, we can do better.
1776: Obviously $T_{\alpha.i_j}, \ldots, T_{\alpha.i_k}$ overlap a lot,
1777: and the total size of the trees is
1778: definitely greater than the size of $T_{\alpha.\beta}$.
1779: It also means that we can put more items into
1780: each $\beta$,
1781: only if the size of $T_{\alpha.\beta}$
1782: is estimated to fit in main memory.
1783: To estimate the size of $T_{\alpha.\beta}$, part of
1784: $T_{\alpha}$
1785: has to be traversed by following the links for
1786: the master items in $T_{\alpha}$.
1787:
1788:
1789:
1790: \subsection {Database projection}
1791: After all items have found their groups,
1792: the original database will be projected to small databases according to
1793: Definition \ref{four}.
1794: To save disk I/O's, three techniques can be used:
1795: \begin {enumerate}
1796: \item
1797: In a group $\beta$, if the number of master items is greater than
1798: half of the number of frequent items
1799: (this often happens in the group that contains cut point),
1800: then $\db_{\alpha.\beta}$ is not necessary
1801: computed.
1802: To mine all frequent itemsets,
1803: $T_{\alpha.\beta}$ can be directly constructed from $\db_{\alpha}$
1804: by reading it once.
1805: This is because $\db_{\alpha.\beta}$
1806: is not much smaller than $\db_{\alpha}$,
1807: while the disk I/O's for reading from $\db_{\alpha}$ once
1808: is less than the disk I/O's for writing and reading
1809: $\db_{\alpha.\beta}$ once.
1810:
1811: \item
1812: Since
1813: the partial tree $T_{\alpha}$
1814: now in main memory,
1815: records all frequency
1816: information of those transactions that have
1817: been read so far,
1818: when computing projected databases,
1819: the frequency information of those transactions
1820: can be gotten from $T_{\alpha}$.
1821: Thus
1822: disk I/O's are only spent on reading from those transactions
1823: that did not contribute to $T_{\alpha}$.
1824:
1825: \item
1826: As discussed in \xs{array},
1827: by using the array technique,
1828: in group $\beta_j$, we find all slave items,
1829: such that they are not frequent with any master item in $\beta_j$,
1830: and all master items, such that their number of frequent
1831: items in $\{\beta_1,\ldots,\beta_j\}$ is 0 or 1.
1832: When computing $\db_{\alpha.\beta_j}$,
1833: all those items are removed from new transactions in $\db_{\alpha.\beta_j}$.
1834: \end{enumerate}
1835:
1836:
1837:
1838: \subsection {The disk I/O's}
1839: Let's re-count the disk I/O's used in {\em Diskmine}.
1840: From the first scan we get all frequent items in $\db_{\epsilon}$,
1841: which needs $D/B$ disk I/O's.
1842: In the second scan we construct a partial FP-tree $T_{\epsilon}$,
1843: then continue scanning the rest database for statistics,
1844: which needs another $D/B$ disk I/O's.
1845: Suppose then that $k$ projected databases have to be computed.
1846: According to \xs{diskmine},
1847: the total size of the projected databases is
1848: approximately $k/2 \cdot D$.
1849: For computing the projected databases,
1850: the frequency information in $T_{\epsilon}$ is reused,
1851: so only part of $\db_{\epsilon}$ is read.
1852: We assume on average half of $\db_{\epsilon}$ is read at this stage,
1853: which means $1/2\cdot D/B$ disk I/O's.
1854: Writing and later reading $k$ projected databases
1855: will take $2\cdot k/2\cdot D/B = k\cdot D/B$ disk I/O's.
1856: Suppose all frequent itemsets can be mined from the projected databases
1857: without going to the third level.
1858: Then the total disk I/O's is
1859: \negvs
1860: \begin{eqnarray}
1861: 3/2 \cdot D/B
1862: +
1863: k\cdot D/B
1864: \end{eqnarray}
1865:
1866: Compared with formula \ref{formula},
1867: {\em Diskmine} saves at least
1868: $k/2 \cdot D/B$
1869: disk I/O's,
1870: thanks to the various techniques used in the algorithm.
1871:
1872: \section {Experimental Evaluation and Performance Study}
1873:
1874: In this section, we present the results from
1875: a performance comparison of
1876: {\em Diskmine} with the {\em Parallel Projection Algorithm} in
1877: \cite{HPYM04} and the {\em Partitioning Algorithm} introduced
1878: in \cite{SON95}.
1879: The scalability of {\em Diskmine} is also analyzed,
1880: and the accurateness of our memory size
1881: estimations are validated.
1882:
1883:
1884: As mentioned in \xs{diskmine},
1885: the Parallel Projection Algorithm is a naive divide-and-conquer
1886: algorithm,
1887: since for each item a projected database is created.
1888: For performance comparison,
1889: we implemented Parallel Projection Algorithm,
1890: by using {\em FP-growth} as main memory method,
1891: as introduced in \cite{HPYM04}.
1892: The
1893: Partitioning Algorithm is also a divide-and-conquer algorithm.
1894: We implemented
1895: the partitioning algorithm by using the Apriori implementation
1896: \cite{gap}.
1897: We chose this implementation, since
1898: it was well written and easy to adapt
1899: for our purposes.
1900:
1901:
1902: We ran the three algorithms on
1903: both synthetic datasets and real datasets.
1904: Some synthetic datasets have millions of transactions,
1905: and the size of the datasets ranges from several megabytes to
1906: several hundreds gigabytes.
1907: Without loss of generality,
1908: only the results for some synthetic datasets and a real dataset
1909: are shown here.
1910:
1911:
1912:
1913: All experiments were performed on a 2.0Ghz Pentium 4 with
1914: 256 MB of memory under Windows XP.
1915: For {\em Diskmine} and the Parallel Projection Algorithm,
1916: the size of the main memory is given as an input.
1917: For the Partitioning Algorithm,
1918: since it only has two database scans and each main-memory-sized partition
1919: and all data structures for Apriori
1920: are stored into main memory,
1921: the size of main memory is not controlled,
1922: and only the running time is recorded.
1923:
1924:
1925: We first compared the performance of three algorithms on synthetic dataset.
1926: Dataset {\em T100I20D100K} was generated from the
1927: application of \cite{syns}.
1928: The dataset has 100,000 transactions and 1000 items,
1929: and occupies about 40 megabytes of memory.
1930: The average transaction length is 100,
1931: and the average pattern length is 20.
1932: The dataset is very sparse and FP-tree constructed from the dataset
1933: is bushy.
1934: For Apriori,
1935: a large number of candidate frequent itemsets
1936: will be generated from the dataset.
1937: When running the algorithms, the main memory size
1938: was given as 128 megabytes.
1939: \xf{SynReal}(a) shows the experimental result.
1940: In the figure, ``Naive Algorithm''
1941: represents the Parallel Projection Algorithm,
1942: and
1943: ``Aggressive Algorithm'' represents the {\em Diskmine} algorithm.
1944:
1945:
1946: \begin{figure}[h]
1947: \begin{minipage}[t]{2in}
1948: \centerline{\psfig{figure=figures/synthetic,height=1.7in}}
1949: \center{\small (a)}
1950: \end{minipage}
1951: \hfill
1952: \begin{minipage}[t]{2in}
1953: \centerline{\psfig{figure=figures/total,height=1.7in}}
1954: \center{\small (b)}
1955: \end{minipage}
1956: \hfill
1957: \begin{minipage}[t]{2in}
1958: \centerline{\psfig{figure=figures/realdata,height=1.7in}}
1959: \center{\small (c)}
1960: \end{minipage}
1961: \caption{{\small Experiments on Synthetic Data and Real Data}}
1962: \label{SynReal}
1963: \end{figure}
1964:
1965: From \xf{SynReal} (a),
1966: we can see that the Partitioning Algorithm is the slowest is the group.
1967: The Naive Algorithm, however, is not slower than the Aggressive Algorithm
1968: if we only compare their CPU time.
1969: In \cite{fimi03},
1970: where we concerned about main memory mining,
1971: we found that if a dataset is sparse the
1972: boosted {\em FPgrowth*} method has a much better performance than
1973: the original {\em FProwth}.
1974: The reason here the CPU time of the Aggressive Algorithm is not always
1975: less than that of Naive Algorithm is
1976: that the Aggressive Algorithm
1977: has to spend CPU time on calculating statistics.
1978: On the other hand, as expected,
1979: we can see in the figure that
1980: the disk I/O time of the Aggressive Algorithm is
1981: orders of magnitude smaller than that of the Naive Algorithm.
1982: In \xf{SynReal} (b) we compare the total runnng times.
1983: We can see that the CPU overhead used by the Aggressive
1984: Algorithm now become insignificant compared to
1985: the savings in disk I/O.
1986:
1987:
1988:
1989:
1990: We then ran the algorithms on a real dataset {\em Kosarak},
1991: which is used as a test dataset in \cite{ZB03}.
1992: The dataset is about 40 megabytes.
1993: Since it is a dense dataset and its FP-tree is pretty small,
1994: we set the main memory size as 16 megabytes for the experiments.
1995: Results are shown in \xf{SynReal} (c).
1996:
1997: In \xf{SynReal} (b),
1998: the Partitioning Algorithm is still the slowest.
1999: This is because it generates too many candidate frequent itemsets.
2000: Together with the data structures,
2001: these candidate sets use up main memory and
2002: virtual memory was used.
2003: We can also again notice that the CPU time of the Naive Algorithm
2004: is less than that of the Aggressive Algorithm.
2005: This is because {\em Kosarak} is a dense dataset so
2006: the array technique doesn't help a lot.
2007: In addition, calculating the
2008: statistics takes much time.
2009: The disk I/O's for the Aggressive Algorithm are still
2010: remarkably fewer than the disk I/O's for the Naive Algorithm.
2011:
2012:
2013: To test the effectiveness of the techniques for grouping items,
2014: we run {\em Diskmine} on
2015: {\em T100I20D100K} and see how
2016: close
2017: the estimation of the FP-tree size for each group is to its real size.
2018: We still set the main memory size as 128 megabytes,
2019: the minimum support is 2\%.
2020: When generating the projected databases,
2021: items were grouped into 7 groups
2022: (the total number of frequent items
2023: is 826).
2024: As we can see from \xf{Effect} (a),
2025: in all groups,
2026: the estimated size is always slightly
2027: than the real size.
2028: Compared with the Naive Algorithm,
2029: which constructs an FP-tree for each item from its projected database,
2030: the Aggressive Algorithm almost fully
2031: uses the main memory for each group to
2032: construct an FP-tree.
2033:
2034: \begin{figure}[ht!]
2035: \begin{minipage}[t]{1.5in}
2036: \centerline{\psfig{figure=figures/versus,height=1.25in}}
2037: \center{\small (a)}
2038: \end{minipage}
2039: \hfill
2040: \begin{minipage}[t]{1.5in}
2041: \centerline{\psfig{figure=figures/scalability,height=1.25in}}
2042: \center{\small (b)}
2043: \end{minipage}
2044: \caption{{\small Estimation Effect and Scalability of {\em Diskmine}}}
2045: \label{Effect}
2046: \end{figure}
2047: As a divide-and-conquer algorithm,
2048: one of the most important
2049: properties of {\em Diskmine} is its good scalability.
2050: We ran {\em Diskmine} on a set of synthetic datasets.
2051: In all datasets,
2052: the item number was set as 10000 items,
2053: the average transaction length as 100,
2054: and the average pattern length as 20.
2055: The number of the transactions in the datasets
2056: varied from 200,000 to 2,000,000.
2057: Datasets size ranges from 100 megabytes to 1 gigabyte.
2058: Minimum support was set as 1.5\%,
2059: and the available main memory was 128 megabytes.
2060: \xf{Effect} (b) shows the results.
2061: In the figure, the CPU and the disk I/O time is
2062: always kept in a small range of acceptable values.
2063: Even for the datasets with 2 million transactions,
2064: the total running time is less than 1000 seconds.
2065: Extrapolating from these figures using formula (4),
2066: we can conclude that a dataset the size of the
2067: Library of Congress collection (25 Terabytes)
2068: could be mined in around 18 hours with current technology.
2069:
2070:
2071: \section{Conclusions}
2072:
2073: We have introduced several divide-and-conquer algorithms
2074: for mining frequent itemset from secondary memory.
2075: We have analyzed the
2076: recurrences and disk I/O's of all algorithms.
2077:
2078: We then gave a detailed divide-and-conquer
2079: algorithm
2080: which almost fully uses the limited main memory
2081: and saves an numerous number of disk I/O's.
2082: We introduced many novel techniques
2083: used in our algorithm.
2084:
2085: Our
2086: experimental results show
2087: that our algorithm
2088: successfully reduces the number of disk access,
2089: sometimes by orders of magnitude,
2090: and that our algorithm scales up to
2091: terabytes of data.
2092: The experiments also validates that
2093: the estimation techniques used in
2094: our algorithm are accurate.
2095:
2096:
2097:
2098: For future work,
2099: we notice that
2100: there are very few efficient algorithm
2101: for mining
2102: {\em maximal} frequent itemsets and {\em closed}
2103: frequent itemsets \cite{PBT99, PHM00,WHP03,Zaki02}
2104: from very large databases.
2105: Unlike in {\em Diskmine},
2106: where the frequent itemsets mined from all projected databases
2107: are globally frequent,
2108: a maximal frequent itemset or a
2109: closed frequent itemset mined from a projected database
2110: is only locally maximal or closed.
2111: As a challenge,
2112: a data structure, whose size may be also very big,
2113: must be set for recording all already discovered
2114: maximal or closed frequent itemsets.
2115: We also notice that
2116: our implementation of the partitioning algorithm is
2117: based on an existing Apriori implementation,
2118: which is not necessary highly optimized.
2119: As we know,
2120: there are situations
2121: when there are not
2122: too many candidate itemsets in a database,
2123: but the FP-tree constructed from the database is pretty big.
2124: In this situation the
2125: Partitioning Algorithm only needs two database scans
2126: and all frequent items can be nicely mined in main memory,
2127: or with very little I/O for keeping the
2128: candidate sets in virtual memory.
2129: In this situation
2130: {\em Diskmine} also needs two database scans,
2131: and it additionally
2132: needs to
2133: decompose the database.
2134: Therefore, exploring whether some clever disk-based datastructure
2135: would make the partition approach scale,
2136: is another interesting direction for further research.
2137:
2138:
2139: \begin{thebibliography}{icdm}
2140:
2141:
2142:
2143: \bibitem{syns}
2144: \newblock {\tt www.almaden.ibm.com/software/quest}
2145:
2146:
2147: %\bibitem{ZB03a}
2148: %\newblock {\tt fimi.cs.helsinki.fi}
2149:
2150: \bibitem{gap}
2151: \newblock{\tt www.cs.helsinki.fi/u/goethals/software}
2152:
2153:
2154: \bibitem{AAP00}
2155: R.\ C.\ Agarwal, C.\ C.\ Aggarwal and V. V. V. Prasad,
2156: \newblock Depth first generation of long patterns,
2157: \newblock In {\em KDDM '00}, pp.\ 108-118
2158:
2159: \bibitem{AIS93}
2160: R.~Agrawal, T.~Imielinski, and A.~Swami.
2161: \newblock Mining association rules between sets of items in large databases.
2162: \newblock In {\em SIGMOD '93},
2163: pp.\ 207--216, 1993.
2164:
2165: \bibitem{AS94}
2166: R.~Agrawal and R.~Srikant.
2167: \newblock Fast algorithms for mining association rules.
2168: \newblock In {\em VLDB '94}, pp.\ 487--499
2169:
2170: \bibitem{AS95}
2171: R.~Agrawal and R.~Srikant.
2172: \newblock Mining sequential patterns.
2173: \newblock In {\em ICDE '95}, pp.\ 3--14
2174:
2175: %\bibitem{BMS97}
2176: %S.~Brin, R.~Motwani, and C.~Silverstein.
2177: %\newblock Beyond market basket: Generalizing association rules to correlations.
2178: %\newblock In {\em Proceeding of Special Interest Group on Management of
2179: %Data}, pages 265--276, Tucson, Arizona, May 1997.
2180:
2181: \bibitem{fimi03}
2182: G.~Grahne, J.~Zhu.
2183: \newblock Efficiently Using Prefix-trees in Mining Frequent Itemsets.
2184: \newblock In
2185: \cite{ZaBa03}
2186: % {\em 1st Workshop on Frequent Itemset Mining Implementations (FIMI'03)}
2187: %\newblock Melbourne, FL, Nov. 2003.
2188:
2189: \bibitem{HPY00}
2190: J.~Han, J.~Pei, and Y.~Yin.
2191: \newblock Mining frequent patterns without candidate generation.
2192: \newblock In {\em SIGMOD '00}, pp.\ 1--12
2193:
2194: \bibitem{HPYM04}
2195: J.~Han, J.~Pei, Y.~Yin and R.~Mao.
2196: \newblock Mining frequent patterns without candidate generation: A Frequent-Pattern Tree Approach.
2197: \newblock In {\em Data Mining and Knowledge Discovery}, Vol. 8, pages 53-87, 2004.
2198:
2199: \bibitem{KHC97}
2200: M.~Kamber, J.~Han and J.~Chiang.
2201: \newblock Metarule-Guided Mining of Multi-Dimensional Association Rules Using Data Cubes.
2202: \newblock In {\em KDDM '97}, pp.\ 207--210
2203:
2204:
2205: %\bibitem{LLN00}
2206: %V.S. Lakshmanan, C. Leung, and R. Ng.
2207: %\newblock The Segment Support Map: Scalable Mining of Frequent Itemsets.
2208: %\newblock In {\em SIGKDD Explorations Special Issue on Scalable Data Mining},
2209: %\newblock Volume 2, Issue 2, pages 21-27. December 2000.
2210:
2211: %\bibitem{MT97}
2212: %H. Mannila and H. Toivonen.
2213: %\newblock Levelwise search and borders of theories in knowledge discovery.
2214: %\newblock In {\em Data Mining and Knowledge Discovery},
2215: %\newblock Vol. 1, 3(1997), pages 241-258.
2216:
2217: \bibitem{MTV94}
2218: H. Mannila, H. Toivonen, and I. Verkamo.
2219: \newblock Efficient algorithms for discovering association rules.
2220: \newblock In {\em KDDM '94},
2221: pp.\ 181--192.
2222:
2223: \bibitem{MTV97}
2224: H. Mannila, H. Toivonen, and I. Verkamo.
2225: \newblock Discovery of Frequent Episodes in Event Sequences.
2226: \newblock In {\em Data Mining and Knowledge Discovery}.
2227: \newblock Volume 1, 3(1997), pages 259--289.
2228:
2229: \bibitem{PBT99}
2230: N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal.
2231: \newblock Discovering frequent closed itemsets for association rules.
2232: \newblock In {\em ICDT'99}, Jan. 1999.
2233:
2234: \bibitem{PHM00}
2235: J. Pei, J. Han and R. Mao,
2236: \newblock CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets.
2237: \newblock In {\em {ACM} {SIGMOD} Workshop on Research Issues in Data Mining and Knowledge Discovery}, pages 21-30, 2000.
2238:
2239:
2240: %\bibitem{STA98}
2241: %S.~Sarawagi, S.~Thomas, and R.~Agrawal.
2242: %\newblock Integrating association rule mining with relational database systems:
2243: % Alternatives and implications.
2244: %\newblock In {\em Proceeding of Special Interest Group on Management of Data}, pages 343--354, 1998.
2245:
2246: \bibitem{SON95}
2247: A.~Savasere, E.~Omiecinski, and S.~Navathe.
2248: \newblock An efficient algorithm for mining association rules in large
2249: databases.
2250: \newblock In {\em VLDB '95}, pp. 432--443
2251:
2252: \bibitem{Toiv96}
2253: H.~Toivonen.
2254: \newblock Sampling large databases for association rules.
2255: \newblock In {\em VLDB '96}, pp.\ 134--145
2256:
2257: \bibitem{WHP03}
2258: J. Wang, J. Han, and J. Pei.
2259: \newblock CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets.
2260: \newblock In {\em Proc. 2003 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'03)}, Washington, D.C., Aug. 2003.
2261:
2262:
2263: \bibitem{ZB03}
2264: B.~Goethals and M.~J.~Zaki (Eds.)
2265: {\em Proceedings
2266: of the First IEEE IDCM Workshop on Frequent Itemset Mining Implementations
2267: (FIMI '03)}.
2268: CEUR Workshop Proceedings, Vol 80
2269: \verb+ http://CEUR-WS.org/Vol-90+
2270:
2271:
2272:
2273: \bibitem{ZaBa03}
2274: Bart Goethals and Mohammed J. Zaki.
2275: \newblock Advances in Frequent Itemset Mining Implementations: Introduction to FIMI03.
2276: \newblock In {\em 1st Workshop on Frequent Itemset Mining Implementations (FIMI'03)}
2277: \newblock Melbourne, FL, Nov. 2003.
2278:
2279:
2280: %\bibitem{Zaki00}
2281: %M. J. Zaki.
2282: %\newblock Scalable algorithms for association mining.
2283: %\newblock In {\em IEEE Transactions on Knowledge and Data Mining},
2284: %\newblock 12(3):372-390, May-June 2000.
2285:
2286: \bibitem{Zaki02}
2287: M. J.~Zaki and C.~Hsiao.
2288: \newblock CHARM: An Efficient Algorithm for Closed Itemset Mining.
2289: \newblock In {\em Proceeding of The 2nd SIAM International Conference on Data Mining},
2290: \newblock Arlington, April 2002.
2291:
2292: \bibitem{Zaki03}
2293: M. J. Zaki and Karam Gouda.
2294: \newblock Fast Vertical Mining Using Diffsets.
2295: \newblock In {\em KDDM '03},
2296: pp.\ 326--335
2297:
2298:
2299:
2300:
2301:
2302: \end{thebibliography}
2303:
2304: \end{document}
2305:
2306:
2307: \newpage
2308:
2309:
2310: \begin{centering}
2311:
2312: \begin{figure}
2313: \begin{minipage}[t]{6.5in}
2314: \centerline{\psfig{figure=figures/append1,height=3.25in}}
2315: \caption{Recurrence structure of Naive Projection Algorithm}
2316: \end{minipage}
2317: \end{figure}
2318:
2319:
2320: \begin{figure}
2321: \begin{minipage}[t]{6.5in}
2322: \centerline{\psfig{figure=figures/append2,height=3.25in}}
2323: \caption{Recurrence structure of Aggressive Projection Algorithm}
2324: \end{minipage}
2325: \end{figure}
2326:
2327: \end{centering}
2328:
2329:
2330:
2331:
2332:
2333:
2334: