cs0111009/intro.tex
1: \section{Introduction}
2: 
3: The enormous growth of information available in database systems
4: has pushed a significant development of techniques for knowledge
5: discovery in databases. At the heart of the knowledge discovery
6: process there is the application of data mining algorithms that
7: are in charge of extracting hidden relationships holding among
8: pieces of information stored in a given database \cite{KDDBook}.
9: Most used data mining algorithms include classification
10: techniques, clustering analysis and association rule induction
11: \cite{agraw93}. In this paper, we focus on this latter data mining
12: technique. Informally speaking, an association rule tells that a
13: conjunction of conditions implies a consequence. For instance, the
14: rule {\em hamburger, fries} $\Rightarrow$ {\em soft$-$drink}
15: induced from a purchase database, tells that a customer purchasing
16: hamburgers and fries also purchases a soft-drink.
17: 
18: An association rule induced from a database is interesting if it describes a
19: relationship that is, in a sense, \quo{valid} as far as the information stored
20: in the database is concerned. To state such validity, {\em indices} are used,
21: that are functions with values usually in $[0,1]$, that tell to what extent an
22: extracted association rule describe knowledge valid in the database at hand.
23: For instance a {\em confidence} value of 0.7 associated to the rule above tells
24: that 70 percent of purchases including hamburgers and fries also include a
25: soft-drink. In the literature, several index definitions have been provided
26: (see e.g. \cite{BayAgr99}, where many interestingness criteria are proposed).
27: Clear enough, information patterns expressed in the form of association rules
28: and associated indices indeed denote knowledge that can be useful in several
29: application contexts, e.g., market basket analysis.
30: 
31: In some application contexts, however, {\em Boolean} association
32: rules, like the one above are not expressive enough for the
33: purposes of the given knowledge discovery task.
34: %, since, in such rules, conditions are simply attributes that can evaluate
35: %either true or false.
36: %only true.
37: In order to obtain more expressive association rules, one can
38: allow more general forms of conditions to occur therein. {\em
39: Quantitative association rules} \cite{DBLP:conf/sigmod/SrikantA96}
40: are ones where both the premise and the consequent use conditions
41: of one of the following forms: $(i)$ $A = u$; $(ii)$ $A \neq u$;
42: $(iii)$ $A' \in [l',u']$; $(iv)$ $A' \notin [l',u']$, where $A$ is
43: a {\em categorical} attribute, i.e., an attribute that has
44: associated a discrete, unordered domain and $u$ is a value in this
45: domain, and $A'$ is a {\em numeric} attribute, that is, one
46: associated with an ordered domain of numbers, and $l'$ and $u'$
47: ($l' \le u'$) are two, not necessarily distinct, values. For
48: instance, the quantitative rule
49: \\
50: $~~~~(hamburger \in [2,4]), (ice$-$cream$-$taste = chocolate)
51: \Rightarrow (soft$-$drink \in [1,3])$
52: \\
53: \noindent induced from a purchase database, tells that a customer
54: purchasing from 2 to 4 hamburgers and a chocolate ice-cream also
55: purchases from 1 to 3 soft-drinks.
56: 
57: In either of their forms, inducing association rules is a quite widely used
58: data mining technique, several systems have been developed based on them
59: \cite{DBLP:conf/vldb/AgrawalS94,DBLP:conf/icde/BayardoAG99}, and several
60: successful applications in various contexts have been described
61: \cite{DBLP:conf/kdd/BrijsSVW99}. Despite the wide-spread utilization of
62: association rule induction in practical applications, a thorough analysis of
63: the complexity of the associated computational tasks have not been developed.
64: However, such an analysis appears to be important since, as in other contexts,
65: an appropriate understanding of the computational characteristics of the
66: problem at hand makes it possible to single out tractable cases of generally
67: untractable problems, isolate hard complexity sources and, overall, to devise
68: more effective approaches to algorithm development.
69: 
70: As far as we know, some computational complexity analysis pertaining
71: association rules are performed in
72: \cite{DBLP:conf/icdt/GunopoulosMS97,DBLP:conf/dis/Morishita98,DBLP:conf/pods/MorishitaS00,WijMee98,
73: zaki98theoretical}. In \cite{DBLP:conf/dis/Morishita98} and
74: \cite{DBLP:conf/pods/MorishitaS00}, a NP-hardness result is stated regarding
75: the induction of association rules (or, in general, of {\em conditions}) having
76: an optimal {\em entropy} (resp. {\em chi-square}); in \cite{WijMee98}, under
77: some restrictive assumptions, the NP-completeness of inducing quantitative
78: association rules with a {\em confidence} and a {\em support}\footnote{Entropy,
79: confidence and support are indices (see below).} greater than two given
80: thresholds is proved along with a result stating a polynomial bound on the
81: complexity of mining quantitative rules over databases where the number of
82: possible items is constant. In \cite{DBLP:conf/icdt/GunopoulosMS97}, it is
83: stated the $\#P$-hardness of counting the number of mined association rules
84: (under support measure), and moreover, a specialization of the result stated in
85: Theorem \ref{th:large} below regarding boolean association rules. Furthermore,
86: \cite{zaki98theoretical} gives some results about the computational complexity
87: of mining frequent itemsets under combined constraints on the number of items
88: and on the frequency threshold.
89: 
90: In this paper we define a generalized form of association rules embracing both
91: the quantitative and the categorical and the boolean types, in which null
92: values (in the following indicated by $\NULL$) denoting the absence of
93: information, are used.
94: 
95: Nulls are often useful in practice. As an example, consider a
96: market database in which attributes correspond to available
97: products and values represent quantities sold.
98: %stored on three tables \verb+transactions+, \verb+products+ and \verb+products_sold+,
99: %where each row of \verb+products_sold+, is a tuple in which
100: %attributes correspond to the transactions code, the product code,
101: %and the sold quantity, respectively.
102: Null values can be used to denote the absence of a product in a particular
103: transaction (this is quite different than specifying the value $0$ instead). As
104: a further example, consider unavailable values in medical records representing
105: clinical cases in analysis of patient data. We call a database allowing null
106: values, a {\em database with nulls}.
107: 
108: When we induce association rules from databases with nulls, we require that
109: conditions on attributes assuming the null value are always unsatisfied, i.e.
110: that it is not possible to specify conditions on null values. A boolean
111: association rule can be thus regarded as a special case of quantitative or
112: categorical association rule mined on a database with nulls.
113: 
114: In this paper, we analyze the computational complexity implied by
115: inducing
116: %quantitative
117: association rules using four of the mostly used rule quality indices, namely,
118: confidence, support, $\theta$-gain and $h$-laplace \cite{BayAgr99,agraw93}. In
119: particular, we shall show that, in the standard case, and depending on the
120: chosen index of reference, the complexity of the problem is either P or
121: NP-complete. When databases with nulls are considered, independently of the
122: reference index, the rule induction task is NP-complete.
123: 
124: Despite these negative results, there are many cases where the problem turns
125: out to be very easy to compute: whenever the instance database is sparse (i.e.
126: each transaction/tuple is very small with respect to the set of possible
127: attributes), or when the attribute set at hand has constant size, for any
128: index, we are able to show that the computational complexity of the rule
129: induction problem is L; furthermore introducing some constraint on the input
130: instance leads to problems with very low complexity such as $T\!C^0$ or
131: $A\!C^0_2$. Problems with this kind of complexity are very efficiently
132: parallelizable (recall that $A\!C^0_2 \subseteq T\!C^0 \subseteq N\!C^1$,
133: whereas $L \subseteq N\!C^2$).
134: \\
135: The plan of the paper is as follows. In the following section we give
136: preliminary definitions. In Section \ref{complexity} we state general
137: complexity results about inducing association rules. Sparse databases and
138: Fixed-schema complexity of rule induction are dealt with in Section
139: \ref{sparse} and \ref{fixed-schema} respectively. Finally, Section
140: \ref{further} collects an interesting set of special tractable cases.
141: