cs0405069/acm.tex
1: \documentclass[11pt]{article}
2: \usepackage{epsfig}
3: \usepackage{amsmath}
4: \usepackage{amssymb}
5: \usepackage{geometry}
6: \usepackage{url}
7: 
8: \newcommand{\xf}[1]{Figure~\ref{#1}}
9: \newcommand{\xt}[1]{Table~\ref{#1}}
10: \newcommand{\xp}[1]{page~\pageref{#1}}
11: \newcommand{\xs}[1]{Section~\ref{#1}}
12: \newcommand{\xa}[1]{Appendix~\ref{#1}}
13: \newtheorem{theorem}{Theorem}
14: \newtheorem{lemma}{Lemma}
15: \newtheorem{prop}{Proposition}
16: \newtheorem{defn}{Definition}
17: 
18: 
19: \def\db{\mbox {$\cal D$}}
20: \def\mm{\mbox {$\cal M$}}
21: \def\nn{\mbox {$t$}}
22: \def\tdb{\mbox{$t(\db_{\alpha})$}}
23: \def\tT{\mbox{$t(T_{\alpha})$}}
24: \def\tT{\mbox{$t(T_{\alpha})$}}
25: \def\nT{\mbox{$\nu(T_{\alpha})$}}
26: \def\njT{\mbox{$\nu[j](T_{\alpha})$}}
27: \def\mjT{\mbox{$\mu[j](T_{\alpha})$}}
28: 
29: \newcommand{\vs}{\vspace{1ex}}		    % Small vertical space
30: \newcommand{\negvs}{\vspace*{-1ex}}		% Small negative vertical space
31: 
32: 
33: 
34: 
35: \newlength{\qedlengte}
36: \settowidth{\qedlengte}{$\Box$}
37: \addtolength{\qedlengte}{-0.25\qedlengte}
38: \newcommand{\qedbox}{\rule{\qedlengte}{\qedlengte}}
39: \newcommand{\qed}{\hspace*{1em}\hfill\qedbox}
40: 
41: 
42: 
43: \newenvironment{prog}{\def\-{\hskip 1em}\penalty-1000\vskip\parskip
44: \parskip0pt\leftskip2em\obeylines\tt}{\par}
45: 
46: \title{Mining Frequent Itemsets from Secondary Memory}
47: 
48: \author{G\"{o}sta Grahne and Jianfei Zhu\\
49: Concordia University\\
50: Montreal, Canada\\
51: \{grahne, j\_zhu\}@cs.concordia.ca\\
52: }
53: \date{March 6, 2004}
54: \begin{document}
55: \maketitle
56: 
57: \begin{abstract}
58: Mining frequent itemsets is at the core
59: of mining association rules, and is by now quite well
60: understood algorithmically.
61: However, most algorithms for mining frequent
62: itemsets assume that the main memory is large enough
63: for the data structures used in the mining,
64: and very few efficient algorithms deal with the case when
65: the database is {\em very} large or the minimum support is very low.
66: Mining frequent itemsets from a very large database
67: poses new challenges,
68: as astronomical amounts of raw data
69: is ubiquitously being recorded in commerce, science and government.
70: 
71: In this paper, we discuss approaches to mining frequent itemsets when
72: data structures are too large to fit in main memory.
73: Several
74: divide-and-conquer
75: algorithms
76: are given for mining from disks.
77: Many novel techniques are introduced.
78: Experimental results show that
79: the techniques reduce
80: the required disk accesses by orders of magnitude,
81: and enable truly scalable data mining.
82: \end{abstract}
83: 
84: \section{Introduction}
85: 
86: 
87: 
88: 
89: Mining frequent itemsets is a fundamental problem 
90: for mining association rules \cite{AIS93, AS94,MTV94,PBT99, PHM00,WHP03,Zaki02, ZB03}.
91: It also plays an important role in many other data mining tasks
92: such as sequential patterns, episodes, multi-dimensional patterns and so on
93: \cite{AS95, MTV97, KHC97}.
94: In addition, frequent itemsets are one of the key abstractions
95: in data mining.
96: 
97: 
98: The description of the problem is as follows. 
99: Let $I = \{i_1,i_2,\ldots,i_j,\ldots i_n\}$,
100: be a set of {\em items}.
101: Items will sometimes also be denoted 
102: $a,b,c,\ldots$.
103: An $I$-{\em transaction} $\tau$ is a subset of $I$.
104: An $I$-transactional {\em database} $\db$ is a finite bag
105: of $I$-transactions.
106: The {\em support} of an itemset $S\subseteq I$
107: is the proportion of transactions in \db~ that contain $S$.
108: The task of mining frequent itemsets is to find 
109: all $S$ such that the support of $S$ is greater than some 
110: given {\em minimum support} $\xi$,
111: where $\xi$ either is a fraction in $[0,1]$,
112: or an absolute count.
113: 
114: Most of the algorithms, such as 
115: Apriori \cite{AS94},
116: DepthProject \cite{AAP00},
117: and dEclat \cite{Zaki03}
118: work well when the main memory is big enough to
119: fit the whole database or/and the data structures
120: (candidate sets, FP-trees, etc).
121: When a database is very large or when the minimum support is very low,
122: either the data structures used by the algorithms may not be accommodated in 
123: main memory, 
124: or the algorithms spend too much time on 
125: multiple passes over the database.
126: In the 
127: {\em First IEEE ICDM Workshop on Frequent Itemset 
128: Mining Implementations, FIMI~'03} \cite{ZB03},
129: many well known algorithms were implemented
130: and independently tested.
131: The results show that ``{\em none} of the algorithms is able to gracefully
132: scale-up to very large datasets, 
133: with millions of transactions''
134: \cite{ZaBa03}.
135: 
136: 
137: 
138: At the same time
139: very large databases do exist in real life. 
140: In a medium sized business or in a company big as Walmart, 
141: it's very easy to collect a few gigabytes of data. 
142: Terabytes of raw data
143: is ubiquitously being recorded in commerce, science and government.
144: The question of how to handle these databases is still one of the most 
145: difficult problems in data mining.
146: 
147: 
148: A few researchers have
149: tried to mine frequent itemsets from very large databases.
150: One approach is by {\em sampling}.
151: For instance, \cite{Toiv96}
152: picks a random sample of the database,
153: finds all frequent itemsets from the sample, and then verifies 
154: the results with the rest of the database. 
155: This approach needs only one pass of the database.
156: However, the results are probabilistic, 
157: meaning that
158: some critical frequent itemsets could be missing.
159: 
160: 
161: 
162: {\em Partitioning} \cite{SON95} 
163: is another approach for mining very large databases.
164: This approach first partitions the database 
165: into many small databases,
166: and mines candidate frequent itemsets from each small database.
167: One more pass 
168: over the original database 
169: is then done to verify the candidate frequent itemsets.
170: The approach thus needs only two database scans. 
171: However, when the data structures used for storing 
172: candidate frequent itemsets
173: are too big to fit in main memory, 
174: a significant amount of  disk I/O's is needed 
175: for the disk resident data structures. 
176: 
177: In \cite{HPY00, HPYM04}, Han {\em et.\ al.} introduce the  {\em FP-growth} 
178: method, which 
179: uses two database scans for constructing an FP-tree
180: from the database, 
181: and then mines all frequent itemsets from the FP-tree.
182: Two approaches are suggested for the case that 
183: the FP-tree is too large to fit into main memory.
184: 
185: The first approach writes the FP-tree to disk,
186: then mines all frequent sets by reading the
187: frequency information from the FP-tree.
188: However, the size of the FP-tree could be same as the 
189: size of the database, and for each item in the FP-tree,
190: we need at least one FP-tree traversal. 
191: Thus the I/O's for writing and reading the 
192: disk-resident FP-tree could be
193: prohibitive.
194: 
195: The second approach
196: {\em projects} the original database
197: on each frequent item, then mines frequent itemsets from 
198: the small projected databases. 
199: One advantage of this approach is that any frequent itemset 
200: mined from a projected database is a frequent itemset in the original database.
201: To get {\em all} frequent itemsets, 
202: we only need to 
203: take the union of the frequent itemsets from the small projected databases.
204: This is in contrast to the
205: partitioning approach, 
206: where all candidate frequent itemsets have to be stored and later verified
207: by another pass of database. 
208: The biggest problem of the projection approach is that
209: the total size of the projected databases could be too large, 
210: and there will be too many disk I/O's for the 
211: projected databases.
212: 
213: \subsubsection*{Contributions} 
214: In this paper we consider the problem of  mining frequent itemsets 
215: from {\em very} large databases. 
216: We adopt a 
217: divide-and-conquer approach.
218: First we give  three algorithms, 
219: the general divide-and-conquer algorithm, 
220: then an algorithm using
221: naive projection, and an algorithm using
222: aggressive projection.
223: We also analyze the 
224: number of steps and disk I/O's required by these algorithms. 
225: 
226: In a detailed divide-and-conquer algorithm,
227: called {\em Diskmine}, 
228: we use the highly efficient 
229: {\em FP-growth*} method \cite{fimi03} to
230: mine frequent itemsets from an FP-tree in main memory.
231: We describe several novel techniques
232: useful in mining frequent itemsets from disks, 
233: such as the array technique,
234: the item-grouping technique, 
235: and memory management techniques.
236: 
237: Finally, we present experimental results that 
238: demonstrate the fact that our {\em Diskmine}-algorithm
239: outperforms previous algorithms
240: by orders of magnitude,
241: and scales up to terabytes of data. 
242: 
243: 
244: \subsubsection*{Overview} 
245: The remainder of this paper is organized as follows. 
246: In Section 2
247: we introduce approaches for mining frequent itemsets from disks.
248: Three algorithms are introduced and analyzed. 
249: Section 3 gives a detailed divide-and-conquer 
250: algorithm {\em Diskmine}, 
251: in which many novel optimization techniques are used. 
252: These techniques are also described in Section 3.
253: Experimental results are given in Section 4.
254: Section 5 concludes, 
255: and outlines directions for future research.
256: 
257: 
258: \section{Mining from disk} \label{diskmine}
259: 
260: How should one go about when mining
261: frequent itemsets from very large databases
262: residing in a secondary memory storage,
263: such as disks?
264: Here ``very large'' means that 
265: the data structures constructed from the database 
266: for mining frequent itemsets
267: can not fit in the available main memory.
268: 
269: 
270: Basically, there are two strategies  
271: for mining frequent itemsets,
272: the datastructures approach,
273: and the 
274: divide-and-conquer approach.
275: 
276: The {\em datastructures} approach consists of
277: reading 
278: the database buffer by buffer, 
279: and generate
280: datastructures (i.e.\ candidate sets or FP-trees).
281: Since the datastructure don't fit into main memory, 
282: additional disk I/O's are required.
283: The number of passes and disk I/O's required
284: by the approach
285: depends on the algorithm and its datastructures.
286: For example,
287: if the algorithm is Apriori \cite{AS94}
288: using a hash-tree 
289: for candidate itemsets
290: \cite{SON95},
291: disk based hash-trees have to be used.
292: Then the number of passes for the algorithm
293: is same as the length of the longest 
294: frequent itemset,
295: and the number of disk I/O's for the hash-trees
296: depend on the size of the hash-trees 
297: on disk.
298: 
299: The basic strategy for the  
300: {\em divide-and-conquer} approach
301: is shown in \xf{bdaqalgo}.
302: In the approach,
303: $|\db|$ denotes   
304: the size of the data structures used 
305: by the mining algorithm, and
306: $M$ is the size of available main memory.
307: Function {\em mainmine}
308: is called if 
309: candidate frequent itemsets (not necessary all)
310: can be mined without 
311: writing the data structures used by 
312: a mining algorithm  to disks.
313: In \xf{bdaqalgo},
314: a very large database is decomposed into a number 
315: of smaller databases.
316: If a ``small'' database is still too large,
317: i.e, the data structures are still too big to fit in main memory,
318: the decomposition is recursively continued
319: until
320: the data structures fit in main memory. 
321: After all small databases are processed, 
322: all candidate frequent itemsets are combined in some way
323: (obviously depending on the way the decomposition was done)
324: to get all frequent itemsets for the original database.
325: 
326: 
327: \begin{figure}[h]
328: {\bf Procedure} {\em diskmine}($\db,M$)
329: 
330: \smallskip
331: 
332: {\bf if} $|\db|\leq M$ {\bf then} {\bf return} {\em mainmine($\,\db$)} 
333: 
334: {\bf else} decompose $\db$ into $\db_1,\ldots \db_k$.
335: 
336: {\hskip 18pt}
337: {\bf return}  {\em combine} {\em diskmine($\,\db_1,M$)},
338: 
339: {\hskip 154pt}                           ....              ,
340: 
341: {\hskip 90pt}       {\em diskmine($\,\db_k,M$)}.
342: 
343: \caption{{\small 
344: General divide-and-conquer algorithm for 
345: mining frequent itemsets from disk.
346: }}
347: \label{bdaqalgo}
348: \end{figure}
349: 
350: 
351: The efficiency of {\em diskmine} 
352: depends on 
353: the method used for mining frequent itemsets
354: in main memory and on the number of 
355: disk I/O's needed in the decomposition and
356: combination phases. 
357: Sometimes the disk I/O is the main factor.
358: Since the decomposition step involves I/O,
359: ideally the number of recursive calls should be
360: kept small. The faster we can obtain small decomposed
361: databases, the fewer recursive call we will need.
362: On the other hand, if a decomposition cuts
363: down the size of the projected databases drastically, 
364: the trade-off might be that the combination
365: step becomes more complicated and might involve heavy 
366: disk I/O.
367: 
368: 
369: In the following we discuss two decomposition
370: strategies, namely
371: decomposition by partition, and
372: decomposition by projection.
373: 
374: {\em Partitioning} 
375: is an approach in which a large database is decomposed into 
376: cells of small non-overlapping databases. 
377: The cell-size is chosen so that
378: all frequent itemsets in a cell can be mined without 
379: having to store any data structures in  secondary memory.
380: However, since a cell only contains partial frequency 
381: information of the original database,
382: all frequent itemsets from the cell are local
383: to that cell of the partition,
384: and could only be {\em candidate} frequent itemsets
385: for the whole database.
386: Thus the candidate frequent itemsets mined from
387: a cell
388: have to be verified
389: later to filter out false hits.
390: Consequently, 
391: those candidate sets have to  be written to disk
392: in order to leave
393: space for processing the next cell of the partition.
394: After generating candidate frequent itemsets from 
395: all cells,
396: another database scan is needed to 
397: filter out all infrequent itemsets.
398: The partition approach therefore needs only two passes
399: over the database,
400: but writing and reading candidate frequent itemsets
401: will involve a significant number of
402: disk I/O's,
403: depending on the size of the set of candidate frequent itemsets.
404: 
405: We can conclude that the partition approach 
406: to decomposition keeps the recursive levels
407: down to one, but the penalty is that the 
408: combination phase becomes expensive.
409: 
410: 
411: To get an easier combination phase,
412: we adopt another decomposition strategy, which we call
413: {\em projection}.
414: Suppose for simplicity that there are four
415: items, $a,b,c,$ and $d$, and let $\db$ be a
416: database of transactions containing some 
417: or all of these items.
418: We could then decompose
419: $\db$ into for instance
420: $\db_{ab}$ and
421: $\db_{cd}$.
422: Typically, we would do this when the descending order
423: of frequency of the items is $a, b, c, d$.
424: In $\db_{cd}$ we put all transactions
425: containing at $c$ or $d$ (or both).
426: In $\db_{ab}$ we put transactions containing
427: $a$ or $b$ (or both), and for each transaction we store
428: only the $a,b$-part. Thus we will have shorter
429: transactions in $\db_{ab}$, and both
430: $\db_{ab}$ and
431: $\db_{cd}$ contain fewer transactions than $\db$.
432: We can then recursively mine all frequent itemsets
433: from $\db_{ab}$, and $\db_{cd}$.
434: Since this decomposition is not a partition,
435: the projected databases
436: might not be that much smaller that the
437: original database. The upside is though that
438: the set of all frequent itemsets in
439: $\db$ now simply is the union of the frequent
440: itemsets in $\db_{ab}$ and $\db_{cd}$.
441: This means that the combination phase
442: in diskmining is a simple union.
443: 
444: To illustrate this decomposition,
445: let $\db$ contain the transactions
446: $\{a, b, d\}, \{b, c, d\}, \{a, c\}$ and $\{a, b\}$.
447: Suppose the minimum support is 50\%, 
448: then $\db_{cd}=\{\{a, b, d\}, \{b, c, d\}, \{a, c\}\}$,
449: $\db_{ab} =\{ \{a, b\}, \{b\}, \{a\}, \{a, b\}\}$.
450: From $\db_{cd}$, we get all frequent itemsets 
451: $\{d\}, \{b,d\}$, and $\{c\}$.
452: Note though $\{a\}$ and $\{b\}$ are also frequent in $\db_{cd}$,
453: they're not listed since they contain neither $c$ nor $d$.
454: They will be listed in the frequent itemsets of $\db_{ab}$,
455: which are $\{a\}, \{b\}$, and $\{a,b\}$. 
456: 
457: To analyze the recurrence and required disk I/O's of the general 
458: divide-and-conquer algorithm
459: when the decomposition strategy is projection,
460: let us suppose that:
461: 
462: 
463: \begin{small}
464: \begin{list}{-}{}
465: 
466: \item
467: The original database size is $D$ bytes.
468: 
469: \item
470: The data structure is an FP-tree.
471: 
472: \item
473: The FP-tree constructed from original database \db~is $T$, 
474: and its size is $|T|$ bytes.
475: 
476: \item
477: If a conditional FP-tree $T'$ is constructed from 
478: an FP-tree $T$, then $|T'|\leq c\cdot |T|$,
479: for some constant $c<1$.
480: 
481: \item
482: The main memory mining method is the {\em FP-growth} 
483: method \cite{HPY00, HPYM04}.
484: Two database scans are needed for constructing an FP-tree
485: from a database.
486: 
487: \item
488: The block size is $B$ bytes.
489: 
490: \item 
491: The main memory available for the FP-tree is $M$ bytes
492: 
493: \end{list}
494: \end{small}
495: 
496: 
497: In the first line of the algorithm in \xf{bdaqalgo},
498: if $T$ can not fit in memory,
499: then projected databases will be generated.
500: We assumed that
501: the size of the FP-tree for a projected database
502: is  $c\cdot|T|$. 
503: If $c\cdot |T| \leq M$, function 
504: {\em mainmine} can be called for the projected database,
505: otherwise, the decomposition goes on.
506: At pass $m$, the size of the FP-tree constructed from 
507: a projected database is $c^m\cdot |T|$.
508: Thus, the number of passes needed by the 
509: divide-and-conquer projection algorithm is   
510: $1+\lceil\log_cM/T\rceil$.
511: Based on our experience and the analysis in \cite{HPY00, HPYM04},
512: we can say that for all practical purposes
513: the number of passes will be at most two.
514: For example, Let $D = 100$ Giga and $T = 10$ Giga, 
515: $M = 1$ Giga, $c = 10\%$. 
516: Then the number of passes is
517: $1+\lceil\log_{0.1}2^{30}/(10\times 2^{30})\rceil$ = 2. 
518: In five passes we can handle databases up to 100 Terabytes.
519: Namely, we get
520: $1+\lceil\log_{0.1}2^{30}/(10\times 2^{40})\rceil$ = 5.
521: 
522: 
523: 
524: Assume that there are two passes,
525: and that the sum of the sizes of all projected
526: databases is $D'$.
527: There are two database scans for \db, 
528: one for finding all frequent single items,
529: one for decomposition.
530: Two scans need $2\times D/B$ disk I/O's. 
531: The projected databases have to be written to the disks first,
532: then later each scanned twice for building the FP-tree.
533: This step needs  $3\times D'/B$ disk I/O's.
534: Thus, the total disk number of 
535: disk I/O's for the general divide-and-conquer 
536: projection algorithm
537: is at least 
538: \negvs
539: \begin{eqnarray}
540: 2\cdot D/B + 3\cdot D'/B.
541: \end{eqnarray}
542: Obviously, 
543: the smaller $D'$, the better the performance. 
544: 
545: 
546: One of the simplest projection strategies
547: is to project the database on each frequent item,
548: which we call
549: {\em naive projection}.
550: First we need some formal definitions.
551: 
552: \begin{defn}
553: {\rm
554: Let $I$ be a set of items.
555: By $I^*$ we will denote {\em strings} over $I$,
556: such that each symbol occurs at most once in the string.
557: If $\alpha$, $\beta$ are strings, and $i_j$ an item,
558: then 
559: $\alpha.\beta$ denotes the concatenation of the
560: string $\alpha$ with the string $\beta$. 
561: 
562: For a string $\alpha$, we shall denote
563: by $\{\alpha\}$, the {\em set} of items occurring in it.
564: 
565: Let $\db$ be an $I$-database.
566: Then ${\mit freqstring}(\db)$
567: is the string over
568: $I$, such that each frequent item in $\db$ occurs
569: in it exactly once, and the items are in decreasing
570: order of frequency in $\db$.
571: \hspace*{\fill}${\qed}$
572: }
573: \end{defn}
574: 
575: 
576: 
577: 
578: As an example, consider the  $\{a,b,c,d\}$-database
579: $\db = \{\{a,b,c\}, \{a,b,c,d\}, \{a,c\}\}$.
580: If the minimum support is 60\%, then
581: ${\mit freqstring}(\db) = acb$.
582: Note that $\{acb\} = \{a,c,b\}$. 
583: 
584: 
585: 
586: \begin{defn}
587: {\rm 
588: Let $\db$ 
589: be an $I$-database, and let 
590: ${\mit freqstring}(\db) 
591: = i_1i_2\cdots i_k$.
592: For $j\in\{1,\ldots,k\}$ we define
593: $\db_{i_j} =
594: \{\tau\cap\{i_1,\ldots,i_j\} : i_j\in\tau,\tau\in\db\}.$
595: 
596: Let $\alpha\in I^*$.
597: We define $\db_{\alpha}$ inductively:
598: $\db_{\epsilon} = \db$, and
599: let ${\mit freqstring}(\db_{\alpha}) 
600: = i_1i_2\cdots i_k$. Then,
601: for $j\in\{1,\ldots,k\}$, 
602: $\db_{\alpha.i_j} =
603: \{\tau\cap\{i_1,\ldots,i_j\} : i_j\in\tau,\tau\in\db_{\alpha}\}.$
604: \hspace*{\fill}${\qed}$
605: }
606: \end{defn}
607: 
608: 
609: Obviously,  
610: $\db_{\alpha.i_j}$ is an $\{i_1,\ldots,i_j\}$-database.
611: The decomposition of $\db_{\alpha}$ into
612: $\db_{\alpha.i_1}$, \ldots, $\db_{\alpha.i_k}$
613: is called the {\em naive projection}.
614: 
615: 
616: \begin{defn}
617: {\rm 
618: Let $\alpha\in I^*$, $i_j\in I$, and let
619: $\db_{\alpha.i_j}$ be an $I$-database.
620: Then ${\mit freqsets}(\xi,\db_{\alpha.i_j})$ denotes the subsets
621: of $I$ 
622: that contain $i_j$ and are frequent in $\db_{\alpha.i_j}$ 
623: when the  minimum support is $\xi$.
624: Usually, we shall abstract $\xi$ away, and write
625: just  ${\mit freqsets}(\db_{\alpha.i_j})$
626: \hspace*{\fill}${\qed}$
627: }
628: \end{defn}
629: 
630: 
631: \begin{lemma}
632: 
633: Let $\db_{\alpha}$ be an $I$-database, and
634: ${\mit freqstring}(\db_{\alpha}) = i_1i_2\cdots i_k$.
635: Then
636: $${\mit freqsets}(\db_{\alpha}) =
637: \bigcup_{j\in\{1,\ldots,k\}}{\mit freqsets}(\db_{\alpha.i_j})$$
638: 
639: \end{lemma}
640: 
641: \noindent
642: {\bf Proof}. 
643: ($\subseteq$-{\em direction}). 
644: Let $S\in {\mit freqsets}(\db_{\alpha})$,
645: and suppose $i_n$ is the item in $S$ that is least frequent in
646: $\db_{\alpha}$.
647: Since $\db_{\alpha.i_n}$ is an $\{i_1,\ldots,i_n\}$-database,
648: and transactions in $\db_{\alpha}$ that contain item $i_j$ 
649: are all in $\db_{\alpha.i_j}$,
650: if $S$ is frequent in $\db_{\alpha}$, 
651: then $S$ must be frequent in $\db_{\alpha.i_j}$.
652: 
653: \noindent
654: ($\supseteq$-{\em direction}). 
655: For any frequent itemset 
656: $S \in freqsets(\db_{\alpha.i_j})$,
657: according to the definition,
658: the 
659: support of any itemset in $\db_{\alpha.i_j}$ is not greater than
660: the support of it in $\db_{\alpha}$.
661: Therefore, $S$ must be frequent in $\db_{\alpha}$.
662: \hspace*{\fill}${\qed}$
663: 
664: \medskip
665: 
666: 
667: 
668: \xf{hansalgo} gives a divide-and-conquer algorithm
669: that uses naive projection.
670: A transaction $\tau$ in $\db_{\alpha}$ will be partly inserted into 
671: $\db_{\alpha.i_j}$ if and only if $\tau$ contains $i_j$.
672: The parallel projection algorithm introduced in
673: \cite{HPYM04} 
674: is an algorithm of this kind.
675: 
676: 
677: \begin{figure}[h]
678: {\bf Procedure} {\em naivediskmine}($\db_{\alpha},M$)
679: 
680: \smallskip
681: 
682: {\bf if} $|\db_{\alpha}|\leq M$ {\bf then} 
683: {\bf return} {\em mainmine($\;\db_{\alpha}$)} 
684: 
685: {\bf else} let ${\mit freqstring}(\db_{\alpha}) = i_1i_2\cdots i_n$
686: 
687: {\hskip 18pt} {\bf return}  {\em naivediskmine}$(\db_{\alpha.i_1},M)\;\cup$ 
688: 
689: {\hskip 146pt}  $\ldots\;\cup$ 
690: 
691: {\hskip 56pt}{\em naivediskmine}$(\db_{\alpha.i_n},M)$.
692: 
693: \caption{{\small 
694: A simple divide-and-conquer algorithm for 
695: mining frequent itemsets from disk 
696: }}
697: \label{hansalgo}
698: \end{figure}
699: 
700: 
701: 
702: Let's analyze the disk I/O's of the algorithm
703: in \xf{hansalgo}.
704: As before, we assume that there are two passes,
705: that the data structure is an FP-tree,
706: and that the main memory mining method is
707: {\em FP-growth}. 
708: If in $\db_{\epsilon}$, each transaction contains on the average $n$
709: frequent items, 
710: each transaction will be written to $n$ projected databases.
711: Thus the total length of the associated transactions in
712: the projected databases is
713: $n+(n-1)+\cdots+1 = n(n+1)/2$,
714: the total size of all projected databases is
715: $(n+1)/2\cdot D\approx n/2\cdot D$.
716: 
717: There are two database scans for $\db_{\epsilon}$,
718: one for finding all frequent single items,
719: and one for decomposition.
720: Two scans need $2\cdot D/B$ disk I/O's.
721: The projected databases have to be written to the disks first,
722: then later scanned twice each for building an FP-tree.
723: This step needs at least $3\cdot n/2\times D/B$.
724: Thus, the total disk I/O's for the divide-and-conquer
725: algorithm with naive projection 
726: is 
727: \negvs
728: \begin{eqnarray}
729: 2 \cdot D/B 
730: + 
731: n \cdot 3/2 \cdot D/B
732: \end{eqnarray}
733: 
734: The recurrence structure of algorithm
735: {\em naivediskmine} 
736: is shown in \xf{naivetree}.
737: The reader should ignore
738: nodes in 
739: the shaded area
740: at this point, they
741: represent processing
742: in main memory.
743: 
744: \begin{figure}[h]
745: \centerline{\psfig{figure=figures/append1,height=1.5in}}
746: \caption{\small Recurrence structure of Naive Projection}
747: \label{naivetree}
748: \end{figure}
749: 
750: 
751: 
752: 
753: In a typical application $n$, the average number
754: of frequent items could be hundreds, or thousands.
755: It therefore makes sense to devise a smarter
756: projection strategy.
757: Before we go further, we introduce
758: some definitions and a lemma.
759: 
760: 
761: \begin{defn}\label{four}
762: {\rm
763: Let $\db_{\alpha}$ be an $I$-database, and let
764: ${\mit freqstring}(\db_{\alpha}) 
765: = \beta_1.\beta_2. \cdots .\beta_k$,
766: where each $\beta_j$ is a string in $I^*$.
767: We call $\beta_1.\beta_2. \cdots .\beta_k$
768: a {\em grouping} of 
769: ${\mit freqstring}(\db_{\alpha})$. 
770: For 
771: $j\in\{1,\ldots,n\}$,
772: we now define
773: $\db_{\alpha.\beta_j} =
774: \{\tau\cap\{\beta_1,\ldots,\beta_j\} : \tau\in\db_{\alpha},
775: \tau\cap\beta_j\neq\emptyset
776: \}.$
777: 
778: In $\db_{\alpha.\beta_j}$,
779: items in $\{\beta_j\}$ are called {\em master items},
780: items in $\{\beta_1,\ldots,\beta_{j-1}\}$ are called {\em slave items}.
781: \hspace*{\fill}${\qed}$
782: }
783: \end{defn}
784: 
785: 
786: For example,
787: if ${\mit freqstring}(\db_{\alpha}) = abcde$,
788: $\beta_1 = abc$, $\beta_2 = de$ gives
789: the grouping $abc.de$ of $abcde$.
790: 
791: 
792: 
793: 
794: 
795: 
796: 
797: 
798: 
799: \begin{defn}
800: {\rm
801: Let $\{\alpha,\beta\}\subset I^*$, and let
802: $\db_{\alpha.\beta}$ be an $I$-database.
803: Then $freqsets(\db_{\alpha.\beta})$ denotes the subsets
804: of $I$ 
805: that contain at least one item in $\{\beta\}$
806: and are frequent in $\db_{\alpha.\beta}$.
807: \hspace*{\fill}${\qed}$
808: }
809: \end{defn}
810: 
811: \begin{lemma}\label{goodway}
812: Let $\alpha\in I^*$, 
813: $\db_{\alpha}$ be an $I$-database, and
814: ${\mit freqstring}(\db_{\alpha}) = \beta_1\beta_2\cdots \beta_k$.
815: Then
816: $$freqsets(\db_{\alpha}) =
817: \bigcup_{j\in\{1,\ldots,k\}}freqsets(\db_{\alpha.\beta_j})$$
818: 
819: \end{lemma}
820: 
821: \noindent
822: {\bf Proof.}
823: Straightforward from Lemma 1 and the definition 
824: of $\db_{\alpha.\beta}$.
825: \hspace*{\fill}${\qed}$
826: 
827: \medskip
828: 
829: Based on Lemma \ref{goodway},
830: we can obtain a more aggressive divide-and-conquer algorithm for 
831: mining from disks.
832: \xf{ouralgo} shows the algorithm {\em aggressivediskmine}.
833: Here,
834: ${\mit freqstring}(\db_{\alpha})$
835: is decomposed into several substrings $\beta_j$,
836: each of which could have more than one item.
837: Each substring corresponds to a projected database.
838: A~transaction $\tau$ in $\db_{\alpha}$ will be partly inserted into
839: $\db_{\alpha.\beta_j}$ if and only if 
840: $\tau$ contains at least one item $a$
841: such that $a\in\{\beta_j\}$.
842: Since there will be fewer projected databases,
843: there will be less disk I/O's.
844: Compared with the algorithm in \xf{hansalgo},
845: we can expect that
846: a large amount of disk I/O will be saved by the algorithm
847: in \xf{ouralgo}.
848: 
849: \begin{figure}[h]
850: {\bf Procedure} {\em aggressivediskmine}($\db_{\alpha},M$)
851: 
852: \smallskip
853: 
854: {\bf if} $|\db_{\alpha}|\leq M$ {\bf then} 
855:    {\bf return} {\em mainmine($\;\db_{\alpha}$)} 
856: 
857: {\bf else} let ${\mit freqstring}(\db_{\alpha}) = 
858: \beta_1\beta_2\cdots \beta_k$
859: 
860: {\hskip 18pt}
861: {\bf return}  {\em aggressivediskmine}$(\db_{\alpha.\beta_1},M)\;\cup$ 
862: 
863: {\hskip 165pt} $\;\ldots\;\cup$ 
864: 
865: {\hskip 57pt}{\em aggressivediskmine}$(\db_{\alpha.\beta_k},M)$.
866: 
867: 
868: \caption{{\small 
869: A more aggressive divide-and-conquer algorithm for 
870: mining frequent itemsets from disk 
871: }}
872: \label{ouralgo}
873: \end{figure}
874: 
875: 
876: 
877: Let's analyze the recurrence and disk I/O's of the aggressive 
878: divide-and-conquer algorithm.
879: The number of
880: passes needed by the algorithm is still 
881: \mbox{$1+\lceil\log_cM/T\rceil \approx 2$},
882: since grouping items doesn't change the size of an FP-tree for
883: a projected database.
884: However, for disk I/O,
885: suppose in $\db_{\epsilon}$, 
886: each transaction contains on average $n$
887: frequent items,
888: and that we can group them into $k$
889: groups of equal size.
890: Then the $n$ items will be written to the projected databases
891: with total length $n/k+2\cdot n/k+ \ldots +k\cdot n/k = (k+1)/2\cdot n$.
892: Total size of all projected databases is
893: $(k+1)/2\cdot D \approx k/2\cdot D$.
894: The total disk I/O's for the aggressive divide-and-conquer
895: algorithm
896: is then 
897: \negvs
898: \begin{eqnarray}\label {formula}
899: 2\cdot D/B 
900: +
901: k \cdot 3/2 \cdot D/B
902: \end{eqnarray}
903: 
904: The recurrence structure of algorithm
905: {\em aggressivediskmine} is shown
906: in \xf{recagg}. Compared to \xf{naivetree},
907: we can see that the part of the tree
908: that corresponds to decomposition
909: (the nonshaded part) is much smaller
910: in \xf{recagg}. Although the example is
911: very small, it exhibits the general structure
912: of the two trees.
913: 
914: \begin{figure}[h]
915: \centerline{\psfig{figure=figures/append2,height=1.5in}}
916: \caption{\small Recurrence structure of Aggressive Projection}
917: \label{recagg}
918: \end{figure}
919: 
920: 
921: 
922: 
923: If $k\ll n$,
924: we can expect that the aggressive
925: divide and conquer algorithm will
926: significantly outperform the naive one.
927: 
928: \section {Algorithm Diskmine}
929: In this section
930: we give
931: the details of
932: our divide-and-conquer algorithm for mining frequent itemsets
933: from secondary memory.
934: We call the algorithm {\em Diskmine}.
935: In the algorithm, 
936: the FP-tree is used as data structure and 
937: the extension of {\em FP-growth} method, 
938: {\em FP-growth*} \cite{fimi03},
939: as method for mining frequent itemsets from an FP-tree.
940: Before introducing the algorithm, 
941: let's first recall the FP-tree and the {\em FP-growth* } method.
942: 
943: \subsection{The FP-tree and {\em FP-growth*} method}
944: 
945: The {\em FP-tree (Frequent Pattern tree)}
946: is a data structure used in 
947: the {\em FP-growth} method by Han {\em et al.\ } \cite{HPY00}.
948: It is a compact representation
949: of all relevant
950: frequency information
951: in a database.
952: The nodes of the FP-tree stores an item name, item count,
953: and a link.
954: Every branch of the FP-tree represents a frequent itemset,
955: and the nodes along the branches are
956: stored in decreasing order of the frequency
957: of the corresponding items, with leaves representing
958: the least frequent items.
959: Compression is achieved by
960: building the tree in such a way that
961: overlapping itemsets
962: share prefixes of the
963: corresponding branches.
964: 
965: 
966: The FP-tree has
967: a {\em header table} associated with it.
968: Single items and their counts are stored in
969: the header table in
970: decreasing order of their frequency.
971: The entry for an item also contains the head
972: of a list that links all the
973: nodes of the item
974: in the FP-tree.
975: 
976: 
977: The FP-growth method needs two database scans
978: when mining all frequent itemsets.
979: The first scan counts the number of occurrences
980: of each item.
981: The second scan constructs the initial FP-tree,
982: which contains all frequency information of the original dataset.
983: Mining the database then becomes mining the FP-tree.
984: 
985: 
986: The {\em FP-growth} method relies on the following
987: principle: if $X$ and $Y$ are two itemsets,
988: the count of itemset $X\cup Y$ in the database
989: is exactly that of $Y$ in the restriction of the database to
990: those transactions containing $X$.
991: This restriction of the database is
992: called
993: the {\em conditional pattern base} of $X$,
994: and the FP-tree constructed from the conditional pattern base
995: is called $X$'s {\em conditional FP-tree},
996: which we denote by $T_X$.
997: We can view the FP-tree constructed from the initial database
998: as $T_{\emptyset}$,
999: the conditional FP-tree for $\emptyset$.
1000: Note that for
1001: any itemset $Y$ that is frequent
1002: in the conditional pattern base of $X$,
1003: the set
1004: $X\cup Y$ is a frequent itemset for the original database.\footnote{In
1005: keeping with the notation introduced so far, we shall
1006: in the sequel write $T_{\alpha}$ when we mean the 
1007: FP-tree $T_{\{\alpha\}}$. Similarly we shall write
1008: $T_{\alpha.i}$ instead of $T_{\{\alpha\}\cup\{i\}}$.}
1009: 
1010: The recursive structure of FPgrowth can be seen from
1011: the shaded area in \xf{naivetree}.
1012: In the figure, we will enter the main memory phase
1013: for instance for the conditional database $\db_a$.
1014: Then FP-growth first constructs the
1015: FP-tree $T_a$ from $\db_a$.
1016: The tree rooted at $T_a$
1017: shows the recursive structure of FP-growth,
1018: assuming for simplicity that the
1019: relative frequency remains the same in
1020: all conditional pattern bases.
1021:  
1022: 
1023: 
1024: 
1025: 
1026: In \cite {fimi03}, we extend the FP-growth method into the
1027: {\em FP-growth*} method by using an {\em array technique} 
1028: and other optimizations. 
1029: The experimental results in the paper
1030: and those done by the FIMI-organizers show
1031: that the FP-growth* method outperforms the {\em FP-growth} method
1032: especially when the database is big or sparse
1033: \cite{fimi03,ZB03}.
1034: 
1035: 
1036: \subsubsection* {The array technique} \label{arraytech}
1037: 
1038: In the original FP-growth method \cite{HPY00}, 
1039: to construct an FP-tree from a database $\db$, 
1040: two database scan are required.
1041: The first scan gets all frequent items,
1042: the second constructs the FP-tree.
1043: And later,
1044: for each item $a$ in
1045: the header of a conditional FP-tree $T_{\alpha}$,
1046: two traversals of $T_{\alpha}$ are needed for constructing
1047: the new conditional FP-tree $T_{\alpha.i}$.
1048: The first traversal finds all frequent items in the
1049: conditional pattern base of $\alpha.i$,
1050: and initializes the FP-tree  $T_{\alpha.i}$
1051: by constructing its header table.
1052: The second traversal constructs the new tree
1053: $T_{\alpha.i}$.
1054: 
1055: 
1056: In the boosted {\em FP-growth*} method \cite{fimi03},
1057: a simple data structure, an array,
1058: is introduced to omit the first scan of $T_{\alpha}$.
1059: This is achieved 
1060: by constructing an array
1061: $A_{\alpha}$ while building $T_{\alpha}$.
1062: More precisely,
1063: in the second scan of the original database we 
1064: construct $T_{\epsilon}$, and an array $A_{\epsilon}$.
1065: The array will store the counts of
1066: all 2-itemsets, each cell $[j,k]$
1067: in the array is a counter of the 2-itemset $\{i_j,i_k\}$.
1068: All cells in the array are initialized to 0.
1069: When an itemset is inserted into $T_{\epsilon}$,
1070: the associated cells in $A_{\epsilon}$ are updated.
1071: After the second scan,
1072: the array $A_{\epsilon}$ contains the counts of
1073: all pairs of items frequent in $\db_{\epsilon}$.
1074: 
1075: 
1076: 
1077: Next, the {\em FP-growth*} method is recursively called
1078: to mine frequent itemsets for each item in header table
1079: of $T_{\epsilon}$.
1080: However, now for each item $i$,
1081: instead of traversing $T_{\epsilon}$ along
1082: the linked list starting at $i$ to get
1083: all frequent items in $i$'s conditional pattern base,
1084: $A_{\epsilon}[i,*]$ gives all frequent items for $i$.
1085: Therefore, for each item $i$ in $T_{\epsilon}$
1086: the array $A_{\epsilon}$ makes
1087: the first traversal of $T_{\epsilon}$ unnecessary,
1088: and $T_{\epsilon.i}$ can be
1089: initialized directly from $A_{\epsilon}$.
1090: 
1091: 
1092: For the same reason, from a conditional FP-tree $T_{\alpha}$,
1093: when we construct a new conditional
1094: FP-tree for $\alpha.i$, for an item $i$,
1095: a new array $A_{\alpha.i}$ is calculated.
1096: During the construction of the
1097: new FP-tree $T_{\alpha.i}$,
1098: the array $A_{\alpha.i}$
1099: is filled.
1100: The construction of arrays and FP-trees continues
1101: until the {\em FP-growth} method terminates.
1102: 
1103: Note that if for a database, 
1104: if we have the array that stores the count of all pairs of 
1105: frequent items,
1106: then only one database scan is needed 
1107: to construct an FP-tree from the database.
1108: 
1109: \subsection{Divide-and-conquer by aggressive projection}
1110: 
1111: 
1112: 
1113: The algorithm {\em Diskmine} is shown in \xf{appa}. In the algorithm,
1114: $\db_{\alpha}$ is the original database or a projected database,
1115: and $M$ is the maximal size of main memory that can be used by {\em Diskmine}. 
1116: 
1117: \begin{figure}[h]
1118: {\bf Procedure} {\em Diskmine}$(\db_{\alpha}, M)$
1119: 
1120: \smallskip
1121: 
1122: scan $\db_{\alpha}$ and compute {\it freqstring}$(\db_{\alpha})$
1123: 
1124: call ${\mit trialmainmine(\db_{\alpha}, M)}$ 
1125: 
1126: {\bf if} ${\mit trialmainmine(\db_{\alpha}, M)}$ aborted {\bf then}
1127: 
1128: {\hskip 12pt}compute a grouping $\beta_1\beta_2\cdots \beta_k$
1129:    of ${\mit freqstring}(\db_{\alpha})$.
1130:  
1131: {\hskip 12pt}Decompose $\db_{\alpha}$ into
1132: $\db_{\alpha.\beta_1},\ldots, \db_{\alpha.\beta_k}$
1133: 
1134: {\hskip 12pt}{\bf for} j = 1 {\bf to} k {\bf do begin} 
1135: 
1136: {\hskip 24pt}{\bf if} $\{\beta_j\}$ is a singleton {\bf then}  
1137: 
1138: {\hskip 36pt}${\mit Diskmine}(\db_{\alpha.\beta_j},M)$
1139: 
1140: {\hskip 24pt}{\bf else}  
1141: 
1142: {\hskip 36pt}${\mit mainmine}(\db_{\alpha.\beta_j})$
1143: 
1144: {\hskip 12pt}{\bf end}
1145: 
1146: {\bf else return} {\em freqsets}$(\db_{\alpha})$
1147: \caption{{\small Algorithm Diskmine}}
1148: \label{appa}
1149: \end{figure}
1150: 
1151: 
1152: {\em Diskmine} uses the FP-tree as 
1153: data structure and {\em FP-growth*} \cite{fimi03}
1154: as main memory 
1155: mining
1156: algorithm. 
1157: Since the FP-tree encodes all frequency information
1158: of the database, 
1159: we can shift into main memory mining
1160: as soon as the FP-tree fits
1161: into main memory.
1162: 
1163: Since an FP-tree usually is a significant
1164: compression of the database, our {\em Diskmine}
1165: algorithm begins optimistically, by calling {\em trialmainmine},
1166: which starts scanning the database and constructing the FP-tree.
1167: If the tree can be successfully completed and stored in main memory,
1168: we have reached the bottom level of the recursion,
1169: and can obtain 
1170: the frequent itemsets of the database
1171: by running
1172: {\em FP-growth*} on the FP-tree in main memory.
1173: 
1174: \begin{figure}[h]
1175: {\bf Procedure} {\em trialmainmine}$(\db_{\alpha}, M)$
1176: 
1177: start scanning $\db_{\alpha}$ and building the FP-tree 
1178:    
1179:    {\hskip 12 pt}$T_{\alpha}$ in main memory.
1180: 
1181: {\bf if} $|T_{\alpha}|$  exceeds  $M$ {\bf then}
1182: 
1183: {\hskip 12pt}{\bf return} the incomplete $T_{\alpha}$ 
1184: 
1185: {\bf else} 
1186: 
1187: {\hskip 12pt}call {\em FP-growth*}$\,(T_{\alpha})$ and {\bf return}
1188:     {\em freqsets}$(\db_{\alpha})$.
1189: 
1190: \caption{{\small Trial main memory mining algorithm}}
1191: \label{trial}
1192: \end{figure}
1193: 
1194: 
1195: 
1196: If, at any time during {\em trialmainmine}
1197: we run out of main memory, we abort and
1198: return the partially constructed FP-tree,
1199: and a pointer to where we stopped scanning the database.
1200: We then resume processing {\em Diskmine}$(\db_{\alpha},M)$
1201: by computing a grouping 
1202: $\beta_1,\ldots, \beta_k$ of 
1203: {\em freqstring}$(\db_{\alpha})$, 
1204: and then decomposing
1205: $\db_{\alpha}$ into
1206: $\db_{\alpha.\beta_1},\ldots,\db_{\alpha.\beta_k}$.
1207: We recursively process 
1208: each decomposed database
1209: $\db_{\alpha.\beta_j}$.
1210: During the first level of the recursion,
1211: some groups $\beta_j$ will consist of a single
1212: item only. 
1213: If $\{\beta_j\}$ is a singleton,
1214: we call {\em Diskmine}, otherwise
1215: we call {\em mainmine} directly,
1216: since we put several items in a group
1217: only when we estimate that the corresponding
1218: FP-tree will fit into main memory.
1219: 
1220: In computing the grouping
1221: $\beta_1,\ldots, \beta_k$ 
1222: we assume that transactions in a very large database
1223: are evenly distributed, i.e., 
1224: if an FP-tree is constructed from part of a database,
1225: then this FP-tree represents the whole FP-tree for the whole database.
1226: In other words,
1227: if the size of the FP-tree is $n$ for $p\%$ of the database,
1228: then the size of the FP-tree for whole database is $n/p \cdot 100$.
1229: Most of the time, this gives an overestimation,
1230: since an FP-tree increases fast only at the beginning stage,
1231: when items are encountered for the first time and inserted
1232: into the tree. In the later stages, the changes to the FP-tree
1233: will be mostly counter updates.
1234: 
1235: 
1236: \begin{figure}[h]
1237: {\bf Procedure} {\em mainmine}$(\db_{\alpha.\beta})$
1238: 
1239: build a modified FP-tree $T_{\alpha.\beta}$ for $\db_{\alpha.\beta}$
1240: 
1241: {\bf for each} $i\in\{\beta\}$ {\bf do begin}
1242: 
1243: {\hskip 12pt} construct the FP-tree $T_{\alpha.i}$
1244:                 for $\db_{\alpha.i}$ from $T_{\alpha.\beta}$
1245:    
1246: {\hskip 12pt} call {\em FP-growth*}$\,(T_{\alpha.i})$ 
1247:     and {\bf return}
1248:     {\em freqsets}$(\db_{\alpha.i})$.
1249: 
1250: {\bf end}
1251: 
1252: \caption{{\small Main memory mining algorithm}}
1253: \label{mainmine}
1254: \end{figure}
1255: 
1256: 
1257: 
1258: Since we know that there is only one master item in the database 
1259: (for $\db_\epsilon$, no master item at all),
1260: an FP-tree is constructed without the master item.
1261: In \xf{mainmine},
1262: since $\db_{\alpha.\beta}$ is for multiple master items,
1263: the 
1264: FP-tree constructed from $\db_{\alpha.\beta}$ has to contain
1265: those master items.
1266: However, the item order is a problem for the FP-tree,
1267: because we only want to mine all frequent itemsets
1268: that contain master items.
1269: To solve this problem,
1270: we simply use the item order in the partial FP-tree
1271: returned by the aborted 
1272: {\em trialmainmine}$(\db_{\alpha})$.
1273: This is what we mean by a ``modified FP-tree''
1274: on the first line in the algorithm in \xf{mainmine}.
1275: 
1276: The entire recurrence structure of
1277: {\em Diskmine} can be seen in \xf{recagg}.
1278: Compared to the naive projection in \xf{naivetree}
1279: we see that since the aggressive projection
1280: uses main memory more effective,
1281: the decomposition phase is shorter,
1282: resulting in less I/O.
1283: 
1284: 
1285: 
1286: \begin{theorem}
1287: Diskmine$(\db)$  returns freqsets$(\db)$.
1288: \end{theorem}
1289: 
1290: \noindent
1291: {\bf Proof}. 
1292: The correctness of {\em Diskmine}
1293: can be derived from the correctness of the 
1294: {\em FP-growth*} method in \cite{fimi03}
1295: and Lemma \ref{goodway} in \xs{diskmine}.
1296: In {\em Diskmine},
1297: each item acts as master item in exactly one projected database.
1298: If a projected database is only for one master item $i_j$,
1299: the result of  {\em FP-growth*} method or a recursive call of {\em Diskmine} 
1300: will be $freqsets(\db_{i_j})$. 
1301: If a projected database is for a set $\{\beta\}$ of master items,
1302: it contains all frequency information associated with the master items.
1303: Since in the {\em FP-growth*} method,
1304: the order of the items in an FP-tree doesn't influence 
1305: the correctness of the  {\em FP-growth*} method,
1306: {\em mainmine} indeed returns only frequent itemsets that 
1307: contain master item(s),
1308: i.e.\ {\em mainmine} gives the 
1309: exact value of $freqsets(\db_{\alpha.\beta})$.
1310: According to Lemma \ref{goodway},
1311: algorithm {\em Diskmine} then
1312: correctly outputs all 
1313: itemsets in frequent the original database.
1314: \hspace*{\fill}${\qed}$
1315: 
1316: 
1317: 
1318: \subsection {Memory Management}\label{memory}
1319: 
1320: Given a database $\db_{\alpha}$, 
1321: to successfully apply the {\em FP-growth*} method, 
1322: the basic main memory requirement is that the size of the FP-tree
1323: $T_{\alpha}$
1324: constructed from $\db_{\alpha}$,
1325: is less than the available amount $M$ of main memory.
1326: In addition, we need space
1327: for the  descendant conditional 
1328: FP-trees that will be constructed during the recursive calls
1329: of {\em FP-growth*}.
1330: 
1331: Suppose the main memory requirement 
1332: for $T_{\alpha}$ plus its descendant FP-trees is $m$.
1333: If $M < m$, but the difference $m-M$ is not very big, 
1334: the {\em FP-growth*} method
1335: could still be run because the operating 
1336: system uses virtual memory.
1337: However, there could be too many page swappings
1338: which takes too much time and makes {\em FP-growth*} very slow.
1339: Therefore, given $M$, for a very large database $\db_{\alpha}$,
1340: we have to stop the construction of the FP-tree $T_{\alpha}$
1341: and the execution of {\em FP-growth*} method before
1342: all physical main memory is used up.
1343: 
1344: 
1345: Another problem is that we will 
1346: construct a large number  of FP-trees.
1347: Since there can be 
1348: millions of nodes in those FP-trees,
1349: inserting and deleting nodes is time consuming.
1350:  
1351: In the implementation of the algorithm,
1352: we use our own main memory management for 
1353: allocating and deallocating nodes,
1354: and calculating the main memory we have already used.
1355: We assume that the main memory needed by an FP-tree is
1356: proportional to the number of nodes in the FP-trees.
1357: We also assume that the workspace needed for calling  
1358: {\em FP-growth*(T)} method on an FP-tree is roughly 10\%
1359: of the size of the FP-tree $T$.
1360: Here, 10\% is a liberal assumption according to the
1361: experimental result in \cite{HPY00}. 
1362: Later in this section, a more accurate value will be given.
1363: If the size of FP-tree is more than $0.9\cdot M$,
1364: we conclude that $M$ is not big enough to store whole 
1365: FP-tree $T_{\alpha}$.
1366: 
1367: 
1368: Since all memory for nodes in an FP-tree is deallocated after a call 
1369: of {\em FP-growth*} ends,
1370: a chunk of memory is allocated for each FP-tree when we create the tree,
1371: and the chunk size is changeable. 
1372: After generating all frequent 
1373: itemsets from the FP-tree, the chunk is discarded, 
1374: and all nodes in the tree are deleted.
1375: Thus we successfully avoid freeing nodes in FP-trees one by one,
1376: which would take too much time.
1377: 
1378: 
1379: \subsection{Applying the Array Technique}\label{array}
1380: 
1381: 
1382: In {\em Diskmine}, 
1383: the array technique is also be applied to save FP-tree traversals.
1384: Furthermore, when projected databases are generated,
1385: the array technique can save a great number of disk~I/O's.
1386: 
1387: Recall that in {\em trialmainmine},
1388: if an FP-tree can not be accommodated in main memory,
1389: the construction stops. 
1390: Suppose now we decided to stop
1391: scanning the database.
1392: Then later, after generating all projected databases,
1393: for a projected database with only one master item,
1394: two database scans are required to construct an FP-tree for the master item.
1395: The first scan gets all frequent items for the master item,
1396: the second scan constructs the FP-tree.
1397: For a projected database with several master items,
1398: though the FP-tree constructed from the database
1399: uses the modified item order
1400: (the order from the header of the FP-tree  in
1401: the previous level of the recursion),
1402: to construct new FP-trees for the master items,
1403: two FP-tree traversals are needed.
1404: To avoid the extra scan,
1405: in {\em Diskmine} we calculate an array for each FP-tree.
1406: When constructing the FP-tree from $\db_{\alpha}$, 
1407: if it is found that the tree can not fit in main memory,
1408: the construction of the FP-tree $T_{\alpha}$ stops,
1409: but the scan of the database $\db_{\alpha}$
1410: continues so that we finish filling the cells of 
1411: the array $A_{\alpha}$.
1412: Here, some extra disk I/O's are spent,
1413: but the payback will be that we 
1414: save one database scan for each
1415: projected database.
1416: Furthermore, finishing the scanning
1417: of $\db_{\alpha}$ 
1418: doesn't require any more main memory, 
1419: since the array $A_{\alpha}$
1420: is already there.
1421: 
1422: From the array, for each projected database,
1423: the count of each pair of master items and 
1424: the count of each pair of master item and slave item
1425: can be known.
1426: As an example,
1427: suppose a projected databases is only for one
1428: master item $i_j$
1429: and slave items $i_1, \ldots, i_{j-1}$.
1430: To mine all frequent itemsets,
1431: from the line for $i_j$ in the array,
1432: accurate counts for 
1433: $[i_j, i_{j-1}],
1434: [i_j, i_{j-2}],
1435: \ldots,
1436: [i_j, i_1]$
1437: can be easily found.
1438: If there were no array
1439: we would need an extra database scan.
1440: 
1441: 
1442: With the array, we can also make a projected database 
1443: drastically smaller.
1444: In the definition of $\db_{\alpha.\beta_j}$,
1445: we see that 
1446: $\db_{\alpha.\beta_j}$ is an $\{\beta_1,\ldots,\beta_j\}$-database.
1447: Actually, by checking the array $A_{\alpha}$,
1448: if a slave item is found not frequently co-occurring 
1449: with any master item in $\beta_j$,
1450: it's useless to include the slave item in $\db_{\alpha.\beta_j}$,
1451: because no frequent itemsets mined from $\db_{\alpha.\beta_j}$
1452: will contain that slave item.
1453: For same reason, 
1454: if we also find that a master item $a$ is not frequent with any 
1455: other master item or slave item, 
1456: it will be not written to $\db_{\alpha.\beta_j}$, 
1457: either. 
1458: However, the frequent itemset $\alpha.a$ is outputted. 
1459: Furthermore,
1460: if from the array, we see that a  master item $a$ is
1461: only frequent with one item (master or slave) $b$,
1462: frequent itemsets $\alpha.a$ and $\alpha.a.b$
1463: are outputted directly, 
1464: and item $a$ will not appear in $\db_{\alpha.\beta_j}$.  
1465: Therefore, by looking through the array, 
1466: we find all slave items, 
1467: such that they are not frequent with any master item in $\beta_j$,
1468: and all master items, such that their number of frequent items in 
1469: $\{\beta_1,\ldots,\beta_j\}$ is 0 or 1.
1470: When generating $\db_{\alpha.\beta_j}$,
1471: all those items are removed from the 
1472: transactions we put in $\db_{\alpha.\beta_j}$.
1473: 
1474: 
1475: \subsection{Statistics}
1476: 
1477: \begin{table*}[ht!]
1478: \centering
1479: \begin{tabular}%{0.75\textwidth}
1480: {|r|l|} \hline
1481: $\tdb$&Number of transactions in $\db_{\alpha}$\\
1482: \hline
1483: $A_{\alpha}[j,k]$&Count of frequent item pair $\{i_j, i_k\}$ 
1484: in $\db_{\alpha}$\\
1485: \hline
1486: $\tT$&Number of transactions used for constructing  $T_{\alpha}$\\
1487: \hline
1488: $\nT$&Number of nodes in $T_{\alpha}$\\
1489: \hline
1490: $\njT$&Number of nodes in  $T_{\alpha}$ if we retain 
1491: only nodes for items $i_1, \ldots, i_j$\\
1492: \hline
1493: $\mjT$&Number of nodes in  
1494: $T$, 
1495: where a  node $P$ for item $i_k$ is counted if\\
1496: &it satisfies the following conditions: 1) $P$ is in a branch that contains $i_j$\\
1497: &2) $i_k \in \{i_1, \ldots, i_j\}$ 3) $A_{\alpha}[j,k] > \xi$\\
1498: \hline
1499: 
1500: 
1501: \end{tabular}
1502: \caption{Statistics Information}
1503: \label{stat}
1504: \end{table*}
1505: 
1506: Algorithm
1507: {\em Diskmine} collects some statistics on the
1508: partial FP-tree $T_{\alpha}$ 
1509: and the rest of database $\db_{\alpha}$,
1510: for the purpose of
1511: grouping items together.
1512: \xt{stat} shows the statistics information.
1513: In the table, 
1514: $\db_{\alpha}$ is the original database or the current projected database,
1515: and {\em freqstring}($\db_{\alpha}$)=
1516: $i_1\ldots i_j\ldots i_k \ldots i_n$.
1517: The partial FP-tree is $T_\alpha$ 
1518: and $\xi$ is the 
1519: absolute value of the minimum support.
1520: 
1521: In the table, 
1522: the array discussed in \xs{array}
1523: is also listed as statistics.
1524: Values for the cells of 
1525: the array are accumulated during the construction of
1526: the partial $T_{\alpha}$.
1527: If {\em trialmainmine} is aborted, the rest
1528: of the statistics 
1529: is collected by scanning the
1530: remaining part of $\db_{\alpha}$.
1531: Values in  
1532: $\njT$ 
1533: can also be obtained
1534: during the construction of $T_{\alpha}$.
1535: Here
1536: $\njT$ 
1537: records the size of the FP-tree after
1538: $T_{\alpha}$ is trimmed and only contains items $i_1, \ldots, i_j$.
1539: Notice that
1540: $\nT$ 
1541: is equal to 
1542: $\nu[n](T_{\alpha})$.
1543: This is  also the size of a tree that can fit in main memory.
1544: The value for  
1545: $\mjT$
1546: can be obtained
1547: by traversing $T_{\alpha}$ once,
1548: it gives the size of the FP-tree $T_{\alpha.i_j}$.
1549: 
1550: It might seem that 
1551: collecting all this statistics
1552: is a large overhead,
1553: however, 
1554: since all work is done in main memory,
1555: it doesn't take much time.
1556: And the time saved for disk I/O's 
1557: is far more than the time spent on gathering statistics.
1558: 
1559: 
1560: \subsection{Grouping items}
1561: 
1562: In \xf{appa},
1563: the fourth line computes a grouping $\beta_1\beta_2\cdots \beta_k$
1564: of ${\mit freqstring}(\db_{\alpha})$.
1565: Each string $\beta$ 
1566: corresponds to a group and each $\beta$ consists of at least one item.
1567: For each $\beta$, 
1568: a new projected database $\db_{\alpha.\beta}$
1569: will be computed from $\db_{\alpha}$, 
1570: then written to disk and read from disk later.
1571: Therefore,
1572: the more groups, 
1573: the more disk I/O's. 
1574: In other words,
1575: there should be as many items in each
1576: $\beta$ as possible. 
1577: To group items,
1578: two questions have to be answered.
1579: \begin{enumerate}
1580: \item If $\beta$ currently only has one item $i_j$, 
1581: after projection, is the main memory big enough for
1582: accommodating $T_{\alpha.i_j}$ constructed from 
1583: $\db_{\alpha.i_j}$
1584: and running the {\em FP-growth*} method on $T_{\alpha.i_j}$?
1585: \item If more items are put in $\beta$,
1586: after projection, is the main memory big enough for
1587: accommodating $T_{\alpha.\beta}$ constructed from $\db_{\alpha.\beta}$
1588: and running {\em FP-growth*} on $T_{\alpha.\beta}$ only
1589: for items in $\beta$?
1590: \end{enumerate}
1591: 
1592: Answering the first question is pretty easy,
1593: since for each item $i_j$, 
1594: the number
1595: $\mjT$
1596: gives the size of an FP-tree if the tree
1597: is constructed from the partial FP-tree $T_{\alpha}$.
1598: Therefore 
1599: $\mjT$
1600: can be used to estimate the 
1601: size of FP-tree $T_{\alpha.i_j}$.
1602: By the assumption that
1603: the transactions in $\db_{\alpha}$ are evenly
1604: distributed and that
1605: the partial $T_{\alpha}$ 
1606: represents
1607: the whole FP-tree for $\db_{\alpha}$,
1608: the estimated size of FP-tree $T_{\alpha.i_j}$
1609: is 
1610: $\mjT\cdot \tdb/\tT$.
1611: 
1612: 
1613: Before answering the second question,
1614: we introduce the {\em cut point}
1615: from which the first group can be easily found.
1616: 
1617: \medskip
1618: 
1619: \noindent
1620: {\bf Finding the cut point.} 
1621: Recall the order that {\em FP-growth*} uses in mining frequent itemsets.
1622: Starting from the least frequent item $i_n$,
1623: all frequent itemsets that contains $i_n$ are mined first.
1624: Then the process is repeated for
1625: $i_{n-1}$, and so on.
1626: Notice that when mining frequent itemsets for $i_k$,
1627: all frequency information about $i_{k+1},\ldots,i_n$ is useless.
1628: Thus, though a complete FP-tree $T_\alpha$ constructed from $\db_\alpha$
1629: could not fit in main memory,
1630: we can find many $k$'s such that the 
1631: trimmed FP-tree containing only 
1632: nodes for items $i_k, \ldots, i_1$
1633: will fit into main memory.
1634: All frequent itemsets for  $i_k, \ldots, i_1$
1635: can be then mined from one trimmed tree.
1636: We call the biggest of such $k$'s the {\em cut point}.
1637: At this point, main memory is big enough 
1638: for storing the FP-tree
1639: containing only $i_k, \ldots, i_1$, 
1640: and there is also enough main memory for running
1641: {\em FP-growth*} on the tree.
1642: Obviously, if the cut point $k$ can be found, 
1643: items  $i_k, \ldots, i_1$ can be grouped together. 
1644: Only one projected database is needed for $i_k, \ldots, i_1$.
1645: 
1646: There are two ways to estimate the cut point.
1647: One way is to get cut point from the value of 
1648: $\tdb$
1649: and 
1650: $\tT$
1651: in \xt{stat}.
1652: \xf{divi} illustrates the intuition behind the cut point.
1653: In the figure,
1654: since the partial FP-tree for 
1655: $\tT$
1656: of 
1657: $\tdb$
1658: transactions can be 
1659: accommodate in main memory,
1660: we can expect that the FP-tree containing  $i_k, \ldots, i_1$,
1661: where 
1662: $k=\lfloor n \cdot \tT /\tdb \rfloor$,
1663: also will fit in main memory.
1664: 
1665: \begin{figure}[h]
1666: \centerline{\psfig{figure=figures/division,height=1.25in}}
1667: \caption{Cut Point. Here
1668: $l=\tT$, and $m=\tdb$}
1669: \label{divi}
1670: \end{figure}
1671: 
1672: The above method 
1673: works well
1674: for many databases, 
1675: especially for those databases whose corresponding 
1676: FP-trees have plenty of sharing of prefixes for items
1677: from $i_1$ to the cut point.
1678: However, 
1679: if the FP-tree constructed from a database
1680: doesn't share prefixes that much,
1681: the estimation could fail, 
1682: since now the FP-tree 
1683: for items from $i_1$ to the cut point
1684: could be too big.
1685: Thus, 
1686: we have to consider another method.
1687: In \xt{stat},
1688: $\njT$
1689: records the size of the FP-tree after
1690: the partial FP-tree $T_\alpha$ is trimmed and only 
1691: contains items $i_1, \ldots, i_j$.
1692: Based on 
1693: $\njT$
1694: the number of nodes 
1695: in the complete FP-tree
1696: for item $i_j$ 
1697: can be estimated as  
1698: $\njT \cdot \tdb/\tT$.
1699: Now, finding the cut point becomes finding the biggest $k$ such that
1700: $\nu[k](T_{\alpha}) \cdot \tdb/\tT \leq \nT$, 
1701: and
1702: $\nu[k+1](T_{\alpha}) \cdot 
1703: \tdb/\tT > \nT$.
1704: 
1705: 
1706: Sometimes the above estimation only guarantees 
1707: that the main memory is big enough for 
1708: the FP-tree which contains all items between $i_1$ and the cut point, 
1709: while it doesn't guarantee 
1710: that the descendant trees from that FP-tree can fit in main memory. 
1711: This is because the estimation doesn't consider the 
1712: size of descendant trees correctly 
1713: (in \xs{memory}, we assumed that the size of a conditional tree is 10\%
1714: of its nearest ancestor tree).
1715: Actually, from 
1716: $\mjT$
1717: we can get a more accurate estimation of the size of the 
1718: biggest descendant tree.
1719: To find the cut point,
1720: we need to find the biggest $k$,
1721: such that 
1722: $(\nu[k](T_{\alpha}) + 
1723: \mjT)\cdot 
1724: \tdb/\tT \leq \nT$, and
1725: $(\nu[k+1](T_{\alpha}) + 
1726: \mu[m](T_{\alpha})) 
1727: > \nT$,
1728: where 
1729: $j\leq k$, 
1730: $\mjT = {\mit max}_{j\in\{1,\ldots,k\}}\mjT$,
1731: and 
1732: $m\leq k+1$, 
1733: $\mu[m](T_{\alpha}) = {\mit max}_{m\in\{1,\ldots,k+1\}}\mu[m](T_{\alpha})$.
1734: 
1735: \medskip
1736: 
1737: \noindent
1738: {\bf Grouping the rest of the items.}
1739: Now we answer the second question, how to put more items into a group?
1740: Here we still need 
1741: $\mjT$.
1742: Starting with  
1743: \mbox{$\mu[{\mit cutpoint}+1](T_{\alpha})$},
1744: we test if 
1745: $\mu[{\mit cutpoint}+1](T_{\alpha})\cdot
1746: \tdb/\tT > \nT$.
1747: If not, we put next item {\mit cutpoint}+2 
1748: into the group,
1749: and test if 
1750: \mbox{$(\mu[{\mit cutpoint}+1](T_{\alpha}) +
1751: \mu[{\mit cutpoint}+2](T_{\alpha}) 
1752: )$}
1753: $\cdot \tdb/\tT > \nT$.
1754: We repeatedly put next item in 
1755: ${\mit freqstring}(\db)$ into the group
1756: until we reach an item $i_j$,
1757: such that
1758: $$\displaystyle\sum_{m={\mit cutpoint}+1}^{j}
1759: \mu[m](T_{\alpha})\cdot 
1760: \tdb/\tT > \nT.$$
1761: Then starting from $i_j$, we put items into next group,
1762: until all items find its group.
1763: 
1764: Why can we group items together?
1765: This is because
1766: even if we construct 
1767: $T_{\alpha.i_j}, \ldots, T_{\alpha.i_k}$
1768: from the projected databases 
1769: $\db_{\alpha.\beta_{i_j}}, \ldots, \db_{\alpha.\beta_{i_k}}$
1770: and put all of them into main memory, 
1771: the main memory is big enough according to the grouping condition.
1772: At this stage, $T_{\alpha.i_j}, \ldots, T_{\alpha.i_k}$
1773: all can be constructed by scanning $\db_\alpha$ once.
1774: Then we mine frequent itemsets from the FP-trees.
1775: However, we can do better.
1776: Obviously $T_{\alpha.i_j}, \ldots, T_{\alpha.i_k}$ overlap a lot,
1777: and the total size of the trees is
1778: definitely greater than the size of $T_{\alpha.\beta}$.
1779: It also means that we can put more items into 
1780: each $\beta$,
1781: only if the size of $T_{\alpha.\beta}$ 
1782: is estimated to fit in main memory.
1783: To estimate the size of $T_{\alpha.\beta}$, part of 
1784: $T_{\alpha}$ 
1785: has to be traversed by following the links for 
1786: the master items in $T_{\alpha}$.
1787: 
1788: 
1789: 
1790: \subsection {Database projection}
1791: After all items have found their groups, 
1792: the original database will be projected to small databases according to  
1793: Definition \ref{four}.
1794: To save disk I/O's, three techniques can be used:
1795: \begin {enumerate}
1796: \item
1797: In a group $\beta$, if the number of master items is greater than 
1798: half of the number of frequent items 
1799: (this often happens in the group that contains cut point),
1800: then $\db_{\alpha.\beta}$ is not necessary 
1801: computed. 
1802: To mine all frequent itemsets,
1803: $T_{\alpha.\beta}$ can be directly constructed from $\db_{\alpha}$ 
1804: by reading it once. 
1805: This is because  $\db_{\alpha.\beta}$ 
1806: is not much smaller than $\db_{\alpha}$,
1807: while the disk I/O's for reading from $\db_{\alpha}$ once 
1808: is less than the disk I/O's for writing and reading 
1809: $\db_{\alpha.\beta}$ once.  
1810: 
1811: \item
1812: Since 
1813: the partial tree $T_{\alpha}$ 
1814: now in main memory, 
1815: records all frequency 
1816: information of those transactions that have
1817: been read so far, 
1818: when computing projected databases,
1819: the frequency information of those transactions 
1820: can be gotten from $T_{\alpha}$.
1821: Thus
1822: disk I/O's are only spent on reading from those transactions
1823: that did not contribute to $T_{\alpha}$.
1824: 
1825: \item
1826: As discussed in \xs{array},
1827: by using the array technique,
1828: in group $\beta_j$, we find all slave items,
1829: such that they are not frequent with any master item in $\beta_j$,
1830: and all master items, such that their number of frequent 
1831: items in $\{\beta_1,\ldots,\beta_j\}$ is 0 or 1.
1832: When computing $\db_{\alpha.\beta_j}$,
1833: all those items are removed from new transactions in $\db_{\alpha.\beta_j}$.
1834: \end{enumerate}
1835: 
1836: 
1837: 
1838: \subsection {The disk I/O's}
1839: Let's re-count the disk I/O's used in {\em Diskmine}.
1840: From the first scan we get all frequent items in $\db_{\epsilon}$,
1841: which needs $D/B$ disk I/O's.
1842: In the second scan we construct a partial FP-tree $T_{\epsilon}$, 
1843: then continue scanning the rest database for statistics,
1844: which needs another $D/B$ disk I/O's.
1845: Suppose then that $k$ projected databases have to be computed.
1846: According to \xs{diskmine},
1847: the total size of the projected databases is 
1848: approximately $k/2 \cdot D$.
1849: For computing the projected databases,
1850: the frequency information in $T_{\epsilon}$ is reused,
1851: so only part of $\db_{\epsilon}$ is read. 
1852: We assume on average half of $\db_{\epsilon}$ is read at this stage, 
1853: which means $1/2\cdot D/B$ disk I/O's.
1854: Writing and later reading $k$ projected databases
1855: will take $2\cdot k/2\cdot D/B = k\cdot D/B$ disk I/O's.
1856: Suppose all frequent itemsets can be mined from the projected databases
1857: without going to the third level.
1858: Then the total disk I/O's is
1859: \negvs
1860: \begin{eqnarray}
1861: 3/2 \cdot D/B 
1862: +
1863: k\cdot D/B
1864: \end{eqnarray}
1865: 
1866: Compared with formula \ref{formula}, 
1867: {\em Diskmine} saves at least 
1868: $k/2 \cdot D/B$
1869: disk I/O's,
1870: thanks to the various techniques used in the algorithm.
1871: 
1872: \section {Experimental Evaluation and Performance Study}
1873: 
1874: In this section, we present the results from
1875: a performance comparison of 
1876: {\em Diskmine} with the {\em Parallel Projection Algorithm} in 
1877: \cite{HPYM04} and the {\em Partitioning Algorithm} introduced 
1878: in \cite{SON95}. 
1879: The scalability of {\em Diskmine} is also analyzed,
1880: and the accurateness of our memory size
1881: estimations are validated.
1882: 
1883:  
1884: As mentioned in \xs{diskmine},
1885: the Parallel Projection Algorithm is a naive divide-and-conquer
1886: algorithm, 
1887: since for each item a projected database is created.
1888: For performance comparison, 
1889: we implemented Parallel Projection Algorithm, 
1890: by using {\em FP-growth} as main memory method,
1891: as introduced in \cite{HPYM04}.
1892: The
1893: Partitioning Algorithm is also a divide-and-conquer algorithm.
1894: We implemented 
1895: the partitioning algorithm by using the Apriori implementation
1896: \cite{gap}.
1897: We chose this implementation, since
1898: it was well written and easy to adapt
1899: for our purposes.
1900: 
1901: 
1902: We ran the three algorithms on 
1903: both synthetic datasets and real datasets.
1904: Some synthetic datasets have millions of transactions,
1905: and the size of the datasets ranges from several megabytes to 
1906: several  hundreds gigabytes. 
1907: Without loss of generality,
1908: only the results for some synthetic datasets and a real dataset
1909: are shown here.
1910: 
1911: 
1912: 
1913: All experiments were performed on a 2.0Ghz Pentium 4 with
1914: 256 MB of memory under Windows XP.
1915: For {\em Diskmine} and the Parallel Projection Algorithm,
1916: the size of the main memory is given as an input.
1917: For the Partitioning Algorithm, 
1918: since it only has two database scans and each main-memory-sized partition
1919: and all data structures for Apriori 
1920: are stored into main memory, 
1921: the size of main memory is not controlled,
1922: and only the running time is recorded. 
1923: 
1924: 
1925: We first compared the performance of three algorithms on synthetic dataset.
1926: Dataset {\em T100I20D100K} was generated from the
1927: application of \cite{syns}.
1928: The dataset has 100,000 transactions and 1000 items,
1929: and occupies about 40 megabytes of memory.
1930: The average transaction length is 100,
1931: and the average pattern length is 20.
1932: The dataset is very sparse and FP-tree constructed from the dataset
1933: is bushy. 
1934: For Apriori, 
1935: a large number of candidate frequent itemsets 
1936: will be generated from the dataset. 
1937: When running the algorithms, the main memory size
1938: was given as 128 megabytes.
1939: \xf{SynReal}(a) shows the experimental result.
1940: In the figure, ``Naive Algorithm'' 
1941: represents the Parallel Projection Algorithm,
1942: and 
1943: ``Aggressive Algorithm'' represents the {\em Diskmine} algorithm.
1944: 
1945: 
1946: \begin{figure}[h]
1947:     \begin{minipage}[t]{2in}
1948:        \centerline{\psfig{figure=figures/synthetic,height=1.7in}}
1949:        \center{\small (a)}
1950:     \end{minipage}
1951:     \hfill
1952:     \begin{minipage}[t]{2in}
1953:        \centerline{\psfig{figure=figures/total,height=1.7in}}
1954:        \center{\small (b)}
1955:     \end{minipage}
1956:     \hfill
1957:     \begin{minipage}[t]{2in}
1958:        \centerline{\psfig{figure=figures/realdata,height=1.7in}}
1959:        \center{\small (c)}
1960:     \end{minipage}
1961:   \caption{{\small Experiments on Synthetic Data and Real Data}}
1962:   \label{SynReal}
1963: \end{figure}
1964: 
1965: From \xf{SynReal} (a), 
1966: we can see that the Partitioning Algorithm is the slowest is the group. 
1967: The Naive Algorithm, however, is not slower than the Aggressive Algorithm 
1968: if we only compare their CPU time. 
1969: In \cite{fimi03}, 
1970: where we concerned about main memory mining,
1971: we found  that if a dataset is sparse the
1972: boosted {\em FPgrowth*} method has a much better performance than
1973: the original {\em FProwth}. 
1974: The reason here the CPU time of the Aggressive Algorithm is not always
1975: less than that of Naive Algorithm is
1976: that the Aggressive Algorithm
1977: has to spend CPU time on calculating statistics.
1978: On the other hand, as expected,
1979: we can see in the figure that
1980: the disk I/O time of the Aggressive Algorithm is 
1981: orders of magnitude smaller than that of the Naive Algorithm. 
1982: In \xf{SynReal} (b) we compare the total runnng times.
1983: We can see that the CPU overhead used by the Aggressive
1984: Algorithm now become insignificant compared to
1985: the savings in disk I/O. 
1986: 
1987: 
1988: 
1989: 
1990: We then ran the algorithms on a real dataset {\em Kosarak},
1991: which is used as a test dataset in \cite{ZB03}.
1992: The dataset is about 40 megabytes. 
1993: Since it is a dense dataset and its FP-tree is pretty small,
1994: we set the main memory size as 16 megabytes for the experiments.
1995: Results are shown in \xf{SynReal} (c).
1996: 
1997: In \xf{SynReal} (b),
1998: the Partitioning Algorithm is still the slowest.
1999: This is  because it generates too many candidate frequent itemsets.
2000: Together with the data structures, 
2001: these candidate sets use up main memory and
2002: virtual memory was used.
2003: We can also again notice that the CPU time of the Naive Algorithm
2004: is less than that of the Aggressive Algorithm.
2005: This is because {\em Kosarak} is a dense dataset so
2006: the array technique doesn't help a lot.
2007: In addition, calculating the 
2008: statistics takes much time.
2009: The disk I/O's for the Aggressive Algorithm are still 
2010: remarkably fewer than the disk I/O's for the Naive Algorithm.
2011: 
2012: 
2013: To test the effectiveness of the techniques for grouping items,
2014: we run {\em Diskmine} on 
2015: {\em T100I20D100K} and see how 
2016: close
2017: the estimation of the FP-tree size for each group is to its real size.
2018: We still set the main memory size as 128 megabytes, 
2019: the minimum support is 2\%. 
2020: When generating the projected databases, 
2021: items were grouped into 7 groups 
2022: (the total number of frequent items
2023: is 826).
2024: As we can see from \xf{Effect} (a),
2025: in all groups, 
2026: the estimated size is always slightly
2027: than the real size. 
2028: Compared with the Naive Algorithm, 
2029: which constructs an FP-tree for each item from its projected database,
2030: the Aggressive Algorithm almost fully 
2031: uses the main memory for each group to
2032: construct an FP-tree.   
2033: 
2034: \begin{figure}[ht!]
2035:     \begin{minipage}[t]{1.5in}
2036:        \centerline{\psfig{figure=figures/versus,height=1.25in}}
2037:        \center{\small (a)}
2038:     \end{minipage}
2039:     \hfill
2040:     \begin{minipage}[t]{1.5in}
2041:        \centerline{\psfig{figure=figures/scalability,height=1.25in}}
2042:        \center{\small (b)}
2043:     \end{minipage}
2044:   \caption{{\small Estimation Effect and Scalability of {\em Diskmine}}}
2045:   \label{Effect}
2046: \end{figure}
2047: As a divide-and-conquer algorithm, 
2048: one of the most important 
2049: properties of {\em Diskmine} is its good scalability.
2050: We ran {\em Diskmine} on a set of synthetic datasets.
2051: In all datasets, 
2052: the item number was set as 10000 items,
2053: the average transaction length as 100,
2054: and the average pattern length as 20.
2055: The number of the transactions in the datasets 
2056: varied from 200,000 to 2,000,000.
2057: Datasets size ranges from 100 megabytes to 1 gigabyte.
2058: Minimum support was set as 1.5\%, 
2059: and the available main memory was 128 megabytes.
2060: \xf{Effect} (b) shows the results.
2061: In the figure, the CPU and the disk I/O time is 
2062: always kept in a small range of acceptable values.
2063: Even for the datasets with 2 million transactions,
2064: the total running time is less than 1000 seconds.
2065: Extrapolating from these figures using formula (4),
2066: we can conclude that a dataset the size of the 
2067: Library of Congress collection (25 Terabytes)
2068: could be mined in around 18 hours with current technology. 
2069: 
2070: 
2071: \section{Conclusions}
2072: 
2073: We have introduced several divide-and-conquer algorithms 
2074: for mining frequent itemset from secondary memory.
2075: We have analyzed the 
2076: recurrences and disk I/O's of all algorithms.
2077: 
2078: We then gave a detailed divide-and-conquer
2079: algorithm 
2080: which almost fully uses the limited main memory
2081: and saves an numerous number of disk I/O's.
2082: We introduced many novel techniques 
2083: used in our algorithm.
2084: 
2085: Our
2086: experimental results show
2087: that our algorithm
2088: successfully reduces the number of disk access,
2089: sometimes by orders of magnitude,
2090: and that our algorithm scales up to
2091: terabytes of data.
2092: The experiments also validates that
2093: the estimation techniques used in 
2094: our algorithm are accurate.
2095: 
2096: 
2097: 
2098: For future work,
2099: we notice that 
2100: there are very few efficient algorithm
2101: for mining 
2102: {\em maximal} frequent itemsets and {\em closed} 
2103: frequent itemsets \cite{PBT99, PHM00,WHP03,Zaki02}
2104: from very large databases.
2105: Unlike in {\em Diskmine},
2106: where the frequent itemsets mined from all projected databases
2107: are globally frequent,
2108: a maximal frequent itemset or a
2109: closed frequent itemset mined from a projected database
2110: is only locally maximal or closed.
2111: As a challenge,
2112: a data structure, whose size may be also very big,
2113: must be set for recording all already discovered
2114: maximal or closed frequent itemsets.
2115: We also notice that 
2116: our implementation of the partitioning algorithm is
2117: based on an existing Apriori implementation,
2118: which is not necessary highly optimized.
2119: As we know,
2120: there are situations
2121: when there are not
2122: too many candidate itemsets in a database,
2123: but the FP-tree constructed from the database is pretty big.
2124: In this situation the
2125: Partitioning Algorithm only needs two database scans 
2126: and all frequent items can be nicely mined in main memory,
2127: or with very little I/O for keeping the
2128: candidate sets in virtual memory.
2129: In this situation
2130: {\em Diskmine} also needs two database scans,
2131: and it additionally
2132: needs to 
2133: decompose the database. 
2134: Therefore, exploring whether some clever disk-based datastructure
2135: would make the partition approach scale,
2136: is another interesting direction for further research.
2137: 
2138: 
2139: \begin{thebibliography}{icdm}
2140: 
2141: 
2142: 
2143: \bibitem{syns}
2144: \newblock {\tt www.almaden.ibm.com/software/quest}
2145: 
2146: 
2147: %\bibitem{ZB03a}
2148: %\newblock {\tt fimi.cs.helsinki.fi}
2149: 
2150: \bibitem{gap}
2151: \newblock{\tt www.cs.helsinki.fi/u/goethals/software} 
2152: 
2153: 
2154: \bibitem{AAP00}
2155: R.\ C.\ Agarwal, C.\ C.\ Aggarwal and V. V. V. Prasad,
2156: \newblock Depth first generation of long patterns,
2157: \newblock In {\em KDDM '00}, pp.\ 108-118
2158: 
2159: \bibitem{AIS93}
2160: R.~Agrawal, T.~Imielinski, and A.~Swami.
2161: \newblock Mining association rules between sets of items in large databases. 
2162: \newblock In {\em SIGMOD '93},
2163: pp.\ 207--216, 1993.
2164: 
2165: \bibitem{AS94}
2166: R.~Agrawal and R.~Srikant.
2167: \newblock Fast algorithms for mining association rules.
2168: \newblock In  {\em VLDB '94}, pp.\   487--499
2169: 
2170: \bibitem{AS95}
2171: R.~Agrawal and R.~Srikant.
2172: \newblock Mining sequential patterns.
2173: \newblock In {\em ICDE '95}, pp.\ 3--14
2174: 
2175: %\bibitem{BMS97}
2176: %S.~Brin, R.~Motwani, and C.~Silverstein.
2177: %\newblock Beyond market basket: Generalizing association rules to correlations.
2178: %\newblock In {\em Proceeding of Special Interest Group on Management of
2179: %Data}, pages 265--276, Tucson, Arizona, May 1997.
2180: 
2181: \bibitem{fimi03}
2182: G.~Grahne, J.~Zhu.
2183: \newblock Efficiently Using Prefix-trees in Mining Frequent Itemsets.
2184: \newblock In
2185: \cite{ZaBa03}
2186: % {\em 1st Workshop on Frequent Itemset Mining Implementations (FIMI'03)}
2187: %\newblock Melbourne, FL, Nov. 2003. 
2188: 
2189: \bibitem{HPY00}
2190: J.~Han, J.~Pei, and Y.~Yin.
2191: \newblock Mining frequent patterns without candidate generation.
2192: \newblock In {\em SIGMOD '00}, pp.\ 1--12
2193: 
2194: \bibitem{HPYM04}
2195: J.~Han, J.~Pei, Y.~Yin and R.~Mao.
2196: \newblock Mining frequent patterns without candidate generation: A Frequent-Pattern Tree Approach.
2197: \newblock In {\em Data Mining and Knowledge Discovery}, Vol. 8, pages 53-87, 2004.
2198: 
2199: \bibitem{KHC97} 
2200: M.~Kamber, J.~Han and J.~Chiang.
2201: \newblock Metarule-Guided Mining of Multi-Dimensional Association Rules Using Data Cubes.
2202: \newblock In {\em KDDM '97}, pp.\ 207--210
2203:   
2204: 
2205: %\bibitem{LLN00}
2206: %V.S. Lakshmanan, C. Leung, and R. Ng. 
2207: %\newblock The Segment Support Map: Scalable Mining of Frequent Itemsets. 
2208: %\newblock In {\em SIGKDD Explorations Special Issue on Scalable Data Mining}, 
2209: %\newblock Volume 2, Issue 2, pages 21-27. December 2000.
2210: 
2211: %\bibitem{MT97}
2212: %H. Mannila and H. Toivonen. 
2213: %\newblock Levelwise search and borders of theories in knowledge discovery. 
2214: %\newblock In {\em Data Mining and Knowledge Discovery}, 
2215: %\newblock Vol. 1, 3(1997), pages 241-258.
2216: 
2217: \bibitem{MTV94}
2218: H. Mannila, H. Toivonen, and I. Verkamo.
2219: \newblock Efficient algorithms for discovering association rules. 
2220: \newblock In {\em KDDM '94},
2221: pp.\ 181--192. 
2222: 
2223: \bibitem{MTV97}
2224: H. Mannila, H. Toivonen, and I. Verkamo.
2225: \newblock Discovery of Frequent Episodes in Event Sequences.
2226: \newblock In {\em Data Mining and Knowledge Discovery}.
2227: \newblock Volume 1, 3(1997), pages 259--289.
2228: 
2229: \bibitem{PBT99}
2230: N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal.
2231: \newblock Discovering frequent closed itemsets for association rules.
2232: \newblock In {\em ICDT'99}, Jan. 1999.
2233: 
2234: \bibitem{PHM00}
2235: J. Pei, J. Han and R. Mao,
2236: \newblock CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets.
2237: \newblock In {\em {ACM} {SIGMOD} Workshop on Research Issues in Data Mining and Knowledge Discovery}, pages 21-30, 2000.
2238: 
2239: 
2240: %\bibitem{STA98}
2241: %S.~Sarawagi, S.~Thomas, and R.~Agrawal.
2242: %\newblock Integrating association rule mining with relational database systems:
2243: %  Alternatives and implications. 
2244: %\newblock In {\em Proceeding of Special Interest Group on Management of Data}, pages 343--354, 1998.
2245: 
2246: \bibitem{SON95}
2247: A.~Savasere, E.~Omiecinski, and S.~Navathe.
2248: \newblock An efficient algorithm for mining association rules in large
2249:   databases. 
2250: \newblock In {\em VLDB '95}, pp. 432--443
2251: 
2252: \bibitem{Toiv96}
2253: H.~Toivonen.
2254: \newblock Sampling large databases for association rules. 
2255: \newblock In {\em VLDB '96}, pp.\ 134--145
2256: 
2257: \bibitem{WHP03}
2258: J. Wang, J. Han, and J. Pei.
2259: \newblock CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets.
2260: \newblock In {\em Proc. 2003 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'03)}, Washington, D.C., Aug. 2003.
2261: 
2262: 
2263: \bibitem{ZB03}
2264: B.~Goethals and M.~J.~Zaki (Eds.)
2265: {\em Proceedings 
2266: of the First IEEE IDCM Workshop on Frequent Itemset Mining Implementations
2267: (FIMI '03)}.
2268: CEUR Workshop Proceedings, Vol 80
2269: \verb+ http://CEUR-WS.org/Vol-90+
2270: 
2271: 
2272: 
2273: \bibitem{ZaBa03}
2274: Bart Goethals and Mohammed J. Zaki.
2275: \newblock Advances in Frequent Itemset Mining Implementations: Introduction to FIMI03.
2276: \newblock In {\em 1st Workshop on Frequent Itemset Mining Implementations (FIMI'03)}
2277: \newblock Melbourne, FL, Nov. 2003.
2278: 
2279: 
2280: %\bibitem{Zaki00}
2281: %M. J. Zaki.
2282: %\newblock Scalable algorithms for association mining. 
2283: %\newblock In {\em IEEE Transactions on Knowledge and Data Mining},
2284: %\newblock 12(3):372-390, May-June 2000.
2285: 
2286: \bibitem{Zaki02}
2287: M. J.~Zaki and C.~Hsiao.
2288: \newblock CHARM: An Efficient Algorithm for Closed Itemset Mining.
2289: \newblock In {\em Proceeding of The 2nd SIAM International Conference on Data Mining},
2290: \newblock Arlington, April 2002.
2291: 
2292: \bibitem{Zaki03}
2293: M. J. Zaki and Karam Gouda.
2294: \newblock Fast Vertical Mining Using Diffsets.
2295: \newblock In {\em KDDM '03},
2296: pp.\ 326--335 
2297: 
2298: 
2299: 
2300: 
2301: 
2302: \end{thebibliography} 
2303: 
2304: \end{document}
2305: 
2306: 
2307: \newpage
2308: 
2309: 
2310: \begin{centering}
2311: 
2312: \begin{figure}
2313: \begin{minipage}[t]{6.5in}
2314: \centerline{\psfig{figure=figures/append1,height=3.25in}}
2315: \caption{Recurrence structure of Naive Projection Algorithm}
2316: \end{minipage}
2317: \end{figure}
2318: 
2319: 
2320: \begin{figure}
2321: \begin{minipage}[t]{6.5in}
2322: \centerline{\psfig{figure=figures/append2,height=3.25in}}
2323: \caption{Recurrence structure of Aggressive Projection Algorithm}
2324: \end{minipage}
2325: \end{figure}
2326: 
2327: \end{centering}
2328: 
2329: 
2330: 
2331: 
2332: 
2333: 
2334: