0111:cs0111009/intro.tex

1: \section{Introduction}

2:

3: The enormous growth of information available in database systems

4: has pushed a significant development of techniques for knowledge

5: discovery in databases. At the heart of the knowledge discovery

6: process there is the application of data mining algorithms that

7: are in charge of extracting hidden relationships holding among

8: pieces of information stored in a given database \cite{KDDBook}.

9: Most used data mining algorithms include classification

10: techniques, clustering analysis and association rule induction

11: \cite{agraw93}. In this paper, we focus on this latter data mining

12: technique. Informally speaking, an association rule tells that a

13: conjunction of conditions implies a consequence. For instance, the

14: rule {\em hamburger, fries} $\Rightarrow$ {\em soft$-$drink}

15: induced from a purchase database, tells that a customer purchasing

16: hamburgers and fries also purchases a soft-drink.

17:

18: An association rule induced from a database is interesting if it describes a

19: relationship that is, in a sense, \quo{valid} as far as the information stored

20: in the database is concerned. To state such validity, {\em indices} are used,

21: that are functions with values usually in $[0,1]$, that tell to what extent an

22: extracted association rule describe knowledge valid in the database at hand.

23: For instance a {\em confidence} value of 0.7 associated to the rule above tells

24: that 70 percent of purchases including hamburgers and fries also include a

25: soft-drink. In the literature, several index definitions have been provided

26: (see e.g. \cite{BayAgr99}, where many interestingness criteria are proposed).

27: Clear enough, information patterns expressed in the form of association rules

28: and associated indices indeed denote knowledge that can be useful in several

29: application contexts, e.g., market basket analysis.

30:

31: In some application contexts, however, {\em Boolean} association

32: rules, like the one above are not expressive enough for the

33: purposes of the given knowledge discovery task.

34: %, since, in such rules, conditions are simply attributes that can evaluate

35: %either true or false.

36: %only true.

37: In order to obtain more expressive association rules, one can

38: allow more general forms of conditions to occur therein. {\em

39: Quantitative association rules} \cite{DBLP:conf/sigmod/SrikantA96}

40: are ones where both the premise and the consequent use conditions

41: of one of the following forms: $(i)$ $A = u$; $(ii)$ $A \neq u$;

42: $(iii)$ $A' \in [l',u']$; $(iv)$ $A' \notin [l',u']$, where $A$ is

43: a {\em categorical} attribute, i.e., an attribute that has

44: associated a discrete, unordered domain and $u$ is a value in this

45: domain, and $A'$ is a {\em numeric} attribute, that is, one

46: associated with an ordered domain of numbers, and $l'$ and $u'$

47: ($l' \le u'$) are two, not necessarily distinct, values. For

48: instance, the quantitative rule

49: \\

50: $~~~~(hamburger \in [2,4]), (ice$-$cream$-$taste = chocolate)

51: \Rightarrow (soft$-$drink \in [1,3])$

52: \\

53: \noindent induced from a purchase database, tells that a customer

54: purchasing from 2 to 4 hamburgers and a chocolate ice-cream also

55: purchases from 1 to 3 soft-drinks.

56:

57: In either of their forms, inducing association rules is a quite widely used

58: data mining technique, several systems have been developed based on them

59: \cite{DBLP:conf/vldb/AgrawalS94,DBLP:conf/icde/BayardoAG99}, and several

60: successful applications in various contexts have been described

61: \cite{DBLP:conf/kdd/BrijsSVW99}. Despite the wide-spread utilization of

62: association rule induction in practical applications, a thorough analysis of

63: the complexity of the associated computational tasks have not been developed.

64: However, such an analysis appears to be important since, as in other contexts,

65: an appropriate understanding of the computational characteristics of the

66: problem at hand makes it possible to single out tractable cases of generally

67: untractable problems, isolate hard complexity sources and, overall, to devise

68: more effective approaches to algorithm development.

69:

70: As far as we know, some computational complexity analysis pertaining

71: association rules are performed in

72: \cite{DBLP:conf/icdt/GunopoulosMS97,DBLP:conf/dis/Morishita98,DBLP:conf/pods/MorishitaS00,WijMee98,

73: zaki98theoretical}. In \cite{DBLP:conf/dis/Morishita98} and

74: \cite{DBLP:conf/pods/MorishitaS00}, a NP-hardness result is stated regarding

75: the induction of association rules (or, in general, of {\em conditions}) having

76: an optimal {\em entropy} (resp. {\em chi-square}); in \cite{WijMee98}, under

77: some restrictive assumptions, the NP-completeness of inducing quantitative

78: association rules with a {\em confidence} and a {\em support}\footnote{Entropy,

79: confidence and support are indices (see below).} greater than two given

80: thresholds is proved along with a result stating a polynomial bound on the

81: complexity of mining quantitative rules over databases where the number of

82: possible items is constant. In \cite{DBLP:conf/icdt/GunopoulosMS97}, it is

83: stated the $\#P$-hardness of counting the number of mined association rules

84: (under support measure), and moreover, a specialization of the result stated in

85: Theorem \ref{th:large} below regarding boolean association rules. Furthermore,

86: \cite{zaki98theoretical} gives some results about the computational complexity

87: of mining frequent itemsets under combined constraints on the number of items

88: and on the frequency threshold.

89:

90: In this paper we define a generalized form of association rules embracing both

91: the quantitative and the categorical and the boolean types, in which null

92: values (in the following indicated by $\NULL$) denoting the absence of

93: information, are used.

94:

95: Nulls are often useful in practice. As an example, consider a

96: market database in which attributes correspond to available

97: products and values represent quantities sold.

98: %stored on three tables \verb+transactions+, \verb+products+ and \verb+products_sold+,

99: %where each row of \verb+products_sold+, is a tuple in which

100: %attributes correspond to the transactions code, the product code,

101: %and the sold quantity, respectively.

102: Null values can be used to denote the absence of a product in a particular

103: transaction (this is quite different than specifying the value $0$ instead). As

104: a further example, consider unavailable values in medical records representing

105: clinical cases in analysis of patient data. We call a database allowing null

106: values, a {\em database with nulls}.

107:

108: When we induce association rules from databases with nulls, we require that

109: conditions on attributes assuming the null value are always unsatisfied, i.e.

110: that it is not possible to specify conditions on null values. A boolean

111: association rule can be thus regarded as a special case of quantitative or

112: categorical association rule mined on a database with nulls.

113:

114: In this paper, we analyze the computational complexity implied by

115: inducing

116: %quantitative

117: association rules using four of the mostly used rule quality indices, namely,

118: confidence, support, $\theta$-gain and $h$-laplace \cite{BayAgr99,agraw93}. In

119: particular, we shall show that, in the standard case, and depending on the

120: chosen index of reference, the complexity of the problem is either P or

121: NP-complete. When databases with nulls are considered, independently of the

122: reference index, the rule induction task is NP-complete.

123:

124: Despite these negative results, there are many cases where the problem turns

125: out to be very easy to compute: whenever the instance database is sparse (i.e.

126: each transaction/tuple is very small with respect to the set of possible

127: attributes), or when the attribute set at hand has constant size, for any

128: index, we are able to show that the computational complexity of the rule

129: induction problem is L; furthermore introducing some constraint on the input

130: instance leads to problems with very low complexity such as $T\!C^0$ or

131: $A\!C^0_2$. Problems with this kind of complexity are very efficiently

132: parallelizable (recall that $A\!C^0_2 \subseteq T\!C^0 \subseteq N\!C^1$,

133: whereas $L \subseteq N\!C^2$).

134: \\

135: The plan of the paper is as follows. In the following section we give

136: preliminary definitions. In Section \ref{complexity} we state general

137: complexity results about inducing association rules. Sparse databases and

138: Fixed-schema complexity of rule induction are dealt with in Section

139: \ref{sparse} and \ref{fixed-schema} respectively. Finally, Section

140: \ref{further} collects an interesting set of special tractable cases.

141: