1 Introduction
There are numerous applications of data mining which fit into this framework. The canonical example from which the
problem gets its name is a supermarket. The items are products and the baskets are customer purchases at the checkout.
Determining what products customers are likely to buy together can be very useful for planning and marketing. However,
there are many other applications which have varied data characteristics. For example, student enrollment in classes, word
occurrence in text documents, users' visits of web pages, and many more. We applied market-basket analysis to census data
(see section 5 ).
In this paper, we address both performance and functionality issues of market-basket analysis. We improve performance
over past methods by introducing a new algorithm for finding large itemsets (an important subproblem). We enhance
functionality by introducing implication rules as an alternative to association rules (see below).
One very common formalization of this problem is finding association rules which are based on support and confidence.
The support of an itemset (a set of items),
The existing methods for deriving the rules consist of two steps:
1. Find the large itemsets for a given
In this paper we address both of these tasks, step 1 from a performance perspective by devising a new algorithm, and step 2 from a semantic perspective by developing conviction, an alternative to confidence.
1.1 Algorithms for Finding Large Itemsets
Much research has focussed on deriving efficient algorithms for finding large itemsets (step 1). The most well-known algorithm is Apriori [AIS93b, AS94] which, as all algorithms for finding large itemsets, relies on the property that an itemset can only be large if and only if all of its subsets are large. It proceeds level-wise. First it counts all the 1-itemsets2 and finds counts which exceed the threshold - the large 1-itemsets. Then it combines those to form candidate (potentially large) 2-itemsets, counts them and determines which are the large 2-itemsets. It continues by combining the large 2-itemsets to form candidate 3-itemsets, counting them and determining which are the large 3-itemsets and so forth.
Let
Lk be the set of large k-itemsets. For example, L3 might contain {{A, B, C}, {A, B, D}, {A, D, F}, . . .}. Let Ck be the set of candidate k-itemsets; this is always a superset of Lk. Here is the algorithm:Result := ;; k := 1; C1 = set of all 1-itemsets; while Ck 6= ; do create a counter for each itemset in Ck; forall transactions in database do Increment the counters of itemsets in Ck which occur in the transaction; Lk := All candidates in Ck which exceed the support threshold; Result := Result [ Lk; Ck+1 := all k + 1-itemsets which have all of their k-item subsets in Lk. k := k + 1; end
Thus, the algorithm performs as many passes over the data as the maximum number of elements in a candidate itemset, checking at pass
k the support for each of the candidates in Ck. The two important factors which govern performance are the number of passes made over all the data and the efficiency of those passes.To address both of those issues we introduce Dynamic Itemset Counting (DIC), an algorithm which reduces the number of passes made over the data while keeping the number of itemsets which are counted in any pass relatively low as compared to methods based on sampling [Toi96]. The intuition behind DIC is that it works like a train running over the data with stops at intervals
M transactions apart. (M is a parameter; in our experiments we tried values ranging from 100 to 10,000.) When the train reaches the end of the transaction file, it has made one pass over the data and it starts over at the beginning for the next pass. The ``passengers'' on the train are itemsets. When an itemset is on the train, we count its occurrence in the transactions that are read.
If we consider Apriori in this metaphor, all itemsets must get on at the start of a pass and get off at the end. The 1-itemsets take the first pass, the 2-itemsets take the second pass, and so on (see Figure 1 ). In DIC, we have the added flexibility of allowing itemsets to get on at any stop as long as they get off at the same stop the next time the train goes around. Therefore, the itemset has ``seen'' all the transactions in the file. This means that we can start counting an itemset as soon as we suspect it may be necessary to count it instead of waiting until the end of the previous pass.
For example, if we are mining 40,000 transactions and
M = 10, 000, we will count all the 1-itemsets in the first 40,000 transactions we will read. However, we will begin counting 2-itemsets after the first 10,000 transactions have been read. We will begin counting 3-itemsets after 20,000 transactions. For now, we assume there are no 4-itemsets we need to count. Once we get to the end of the file, we will stop counting the 1-itemsets and go back to the start of the file to count the 2 and 3-itemsets. After the first 10,000 transactions, we will finish counting the 2-itemsets and after 20,000 transactions, we will finish counting the 3-itemsets. In total, we have made 1.5 passes over the data instead of the 3 passes a level-wise algorithm would make.3DIC addresses the high-level issues of when to count which itemsets and is a substantial speedup over Apriori, particularly when Apriori requires many passes. We deal with the low-level issue of how to increment the appropriate counters for each transaction in Section 3 by considering the sort order of items in our data structure.
1.2 Implication Rules
Our contribution to functionality in market basket analysis is implication rules based on conviction, which we believe is a more useful and intuitive measure than confidence and interest (see discussion in Section 4 ). Unlike confidence, conviction is 2A k-itemset is an itemset with k items. 3Of course in this example we assumed best of all possible circumstances where we estimated correctly exactly which 2 and 3-itemsets we would have to count. More realistically, some itemsets would be added a little later. Nonetheless there would still be considerable savings.
Figure 1: Apriori and DIC
normalized based on both the antecedent and the consequent of the rule like the statistical notion of correlation. Furthermore,
unlike interest, it is directional and measures actual implication as opposed to co-occurrence. Because of these two features,
implication rules can produce useful and intuitive results on a wide variety of data. For example, the rule past active duty in
military
In Section 5 , we present the results of generating implication rules for U.S. census data from the 1990 census. Census
data is considerably more difficult to mine than supermarket data and the performance advantages of DIC for finding large
itemsets are particularly useful.
To show that the itemsets are large we can count them. In fact, we must, since we generally want to know the counts.
However, it is infeasible to count all of the small itemsets. Fortunately, it is sufficient to count just the minimal ones (the
itemsets that do not include any other small itemsets) since if an itemset is small, all of its supersets are small too. The
minimal small itemsets are denoted by circles; in our example
An algorithm which counts all the large itemsets must find and count all of the large itemsets and the minimal small
itemsets (that is, all of the boxes and circles). The DIC algorithm, described here, marks itemsets in four different possible
ways:
Dashed circle - suspected small itemset - an itemset we are still counting that is below the support threshold.
The DIC algorithm works as follows:
1. The empty itemset is marked with a solid box. All the 1-itemsets are marked with dashed circles. All other itemsets
are unmarked. (See Figure 3 .)
3. If a dashed circle has a count that exceeds the support threshold, turn it into a dashed square. If any immediate superset
of it has all of its subsets as solid or dashed squares, add a new counter for it and make it a dashed circle. (See Figures
4 and 5 .)
4. If a dashed itemset has been counted through all the transactions, make it solid and stop counting it.
5. If we are at the end of the transaction file, rewind to the beginning. (See Figure 6 .)
6. If any dashed itemsets remain, go to step 2.
This way DIC starts counting just the 1-itemsets and then quickly adds counters 2,3,4,...,k-itemsets. After just a few
passes over the data (usually less than two for small values of
2. Maintain a counter for every itemset. When transactions are read, increment the counters of those active itemsets which
occur in the transaction. This must be very fast as it is the bottleneck of the whole process. We attempt to optimize
this operation in Section 3 .
3. Maintain itemset states by managing transitions from active to counted (dashed to solid) and from small to large (circle
to square). Detect when these transitions should occur.
4. When itemsets do become large, determine what new itemsets should be added as dashed circles since they could now
potentially be large.
Figure 3: Start of DIC algorithm.
Figure 4: After M transactions.
Figure 5: After 2
Figure 6: After one pass.
The data structure used for this is exactly like the hash tree used in Apriori with a little extra information stored at each
node. It is a trie with the following properties. Each itemset is sorted by its items (the sort order is discussed in Section 3
). Every itemset we are counting or have counted has a node associated with it, as do all of its prefixes. The empty itemset
is the root node. All the 1-itemsets are attached to the root node, and their branches are labeled by the item they represent.
All other itemsets are attached to their prefix containing all but their last item. They are labeled by that last item. Figure
7 shows a sample trie of this form. The dotted path represents the traversal which is made through the trie when the
transaction ABC is encountered so A, AB, ABC, AC, B, BC, and C must be incremented and they are, in that order. The
exact algorithm for this is described in Section 3 .
Every node stores the last item in the itemset it represents, a counter, a marker as to where in the file we started counting
it, its state, and its branches if it is an interior node.
Some important relevant work was done by Toivonen using sampling [Toi96]. His technique was to sample the data using
a reduced threshold for safety, and then count the necessary itemsets over the whole data in just one pass. However, this
pays the added penalty of having to count more itemsets due to the reduced threshold. This can be quite costly, particularly
for datasets like the census data (see section 5 ). Instead of being conservative, our algorithm bravely marches on, on the
assumption that it will later come back to anything missed with little penalty.4
However, randomization may be impractical. For example, it may be expensive, the data may be stored on tape, or there
might be insufficient space to store the randomized version. We considered several ways of addressing this problem:
aABC, AC, AD, BCD, BD, CD, D, E, F, and G
cost 0 since they are leaves.
Slacken the support threshold. First, start with a support threshold considerably lower than the given one. Then,
gradually increase the threshold to the desired level. This way, the algorithm begins fairly conservative and then
becomes more confident as more data is collected. We experimented with this technique somewhat but with little
success. However, perhaps more careful control of the slack or a different dataset would make this a useful technique.
The DIC algorithm addresses the high-level strategy of what itemsets to count when. There are also lower level performance
issues as to how to increment the appropriate counters for a particular transaction. We address these in section 3 .
To determine how to optimize the order of the items, it is important to understand how the counter incrementing process
works. We are given a transaction,
Increment(T,S)
For example, consider the structure in figure 7 . Suppose there are also items E,F, and G, and we add their respective
1-itemsets to the data structure. There will now be three singletons hanging off the tree. If we insert ABCDEFG, the cost of
the insert is 31 (see Table 3 ). However, if we change the order of the items to EFGABCD (note the tree structure remains
the same since A,B,C,D did not change order), the cost becomes 16 (see Table 3 ).
This is considerably cheaper. Therefore what we want is to order the items by the inverse of their popularity in the counted
non-leaf itemsets. A reasonable approximation for this inverse is the inverse of their popularity in the first
Some traditional measures of ``interestingness'' have been support combined with either confidence or interest. Consider these measures from a probabilistic model.
Let
{A, B} be an itemset. Then the support is P({A, B}) which we write P(A, B). This is used to make sure that the items this rule applies to actually occur frequently enough for someone to care. It also makes the task computationally feasible by limiting the size of the result set, and is usually used in conjunction with other measures. The confidence of A ) B is P(B | A), the conditional probability of B given A, which is equal to P(A, B)=P(A). It has the flaw that it ignores P(B). For example, P(A, B)=P(A) could equal P(B) (i.e. the occurrence of B is unrelated to A) and could still be high enough to make the rule hold. For example, if people buy milk 80% of the time in a supermarket and the purchase of milk is completely unrelated to the purchase of smoked salmon, then the confidence of salmon ) milk is still 80%. This confidence is quite high, and therefore would generate a rule. This is a key weakness of confidence, and is particularly evident in census data, where many items are very likely to occur with or without other items.The interest of
A, B is defined as P(A, B)=P(A)P(B) and factors in both P(A) and P(B); essentially it is a measure of departure from independence. However, it only measures co-occurrence not implication, in that it is completely symmetric.To fill the gap, we define conviction as
P(A)P(:B)=P(A, :B). The intuition as to why this is useful is: logically, A -> B can be rewritten as :(A ^ :B) so we see how far A ^ :B deviates from independence, and invert the ratio to take care of the outside negation 5. We believe this concept is useful for a number of reasons:Unlike confidence, conviction factors in both P(A) and P(B) and always has a value of 1 when the relevant items are completely unrelated like the salmon and milk example above.
Unlike interest, rules which hold 100% of the time, like Vietnam veteran ) more than five years old have the highest possible conviction value of infinity. Confidence also has this property in that these rules have a confidence of 1. However, interest does not have this useful property. For example, if 5% of people are Vietnam veterans and 90% are more than five years old, we get interest = 0.05=(0.05) * 0.09 = 1.11 which is only slightly above 1 (the interest for completely independent items).
In short, conviction is truly a measure of implication because it is directional, it is maximal for perfect implications, and it properly takes into account both
P(A) and P(B).5 Results
We tested the DIC algorithm along with reordering on two different types of data: synthetic data generated by the IBM test data generator 6 and U.S. census data. We also tested implication rules on both sets of data but the results of these tests are only interesting on the census data. This is because the synthetic data is designed to test association rules based on support and confidence. Also, the rules generated are only interesting and can be evaluated for utility if the items involved have some actual meaning. Overall, the results of tests on both sets of data justified DIC, and tests on the census data justified implication rules.
5.1 Test Data
The synthetic test data generator is well documented in [AS94] and it was very convenient to use. We used 100,000 transactions, with an average size of 20 items chosen from 1000 items, and average large itemsets were of size 4.
The census data was a bit more challenging. We chose to look at PUMS files which are Public Use Microdata Samples. They contain actual census entries which constitute a five percent sample of the state the file represents. In our tests we used the PUMS file for Washington D.C. which consists of roughly 30,000 entries. Each entry has 127 attributes, each of which is represented as a decimal number. For example, the SEX field uses one digit - 0 for male and 1 for female. The INCOME field uses six digits - the actual dollar value of that person's income for the year. We selected 73 of these attributes to study and took all possible (field,value) pairs. For the numerical attributes, like INCOME, we took the logarithm of the value and rounded it to the nearest integer to reduce the number of possible answers
7. In total this yielded 2166 different items.This data has several important differences from the synthetic data. First, it is considerably wider - 73 versus 20 items per transaction. Second, many of the items are extremely popular, such as worked in 1989. Third, the entire itemset structure is far more complex than that of synthetic data because correlations can be directional (for example given birth to 2 children
) Female but the reverse is not true), many rules are true 100% of the time (same example), and almost all attributes are correlated to some degree. These factors make mining census data considerably more difficult than supermarket style data.5.2 Test Implementations
We implemented DIC and Apriori in C++ on several different Unix platforms. The implementations were mostly the same since Apriori can be thought of as a special case of DIC - the case where the interval size, was the size of the datafile, not M. Note that much work has been done on Apriori in recent years to optimize it and it was impossible for us to perform all of those optimizations. However, almost all of these optimizations are equally applicable to DIC.
5In practice we do not invert the ratio; instead we search for low values of the uninverted ratio. This way we do not have to deal with infinities. 6http://www.almaden.ibm.com/cs/quest/syndata.html 7There has been much work done on bucketizing numerical parameters. However this is not the focus of our research so we took a simple approach.
DIC Apriori Support Threshhold 0.02 0.018 0.016 0.014 0.012 0.01 0.008 0.006 300 250 200 150 100 50 0
Figure 8: Performance of Apriori and DIC on Synthetic Data
Both DIC and Apriori were run on the synthetic data and census data. Running both algorithms on the synthetic data was fairly straightforward. We tried a range of support values and produced large itemsets relatively easily (figure 8 ). Apriori beat out DIC by about 30% on the high support end of the graph but DIC outperformed Apriori in most tests, running 30% faster at a support threshold of 0.005. However, running both DIC and Apriori on census data was tricky. This is because a number of items in the census data appeared over 95% of the time and therefore there were a huge number of large itemsets. To address this problem, items with over 80% support were dropped. There were still a fair number of large itemsets but manageable at very high support thresholds. The reader will notice that the tests were run on support levels between 36% and 50% which are more than an order of magnitude higher than support levels used for supermarket analysis. Otherwise, far too many large itemsets would be generated. Even at the 36% support threshold, mining proved time consuming, taking nearly half an hour on just 30,000 records.
5.3.1 Performance on the Census Data
The performance graphs show three curves (figure 9 ). One for Apriori, one for DIC, and one for DIC when we shuffled
the order of the transactions beforehand (this has no effect on Apriori). For both tests with DIC,
5.4 Varying the Interval Size
One experiment we tried was to find the optimal value of M, the interval size (Figure 10 ). We tried values of 100,
300, 1000, and 10000 for M. The values in the middle of the range, 300 and 1000, worked the best, coming in second and
first respectively. An interval size of 100 proved the worst choice due to too much overhead incurred. A value of 10,000 was
somewhat slow because it took more passes over the data than for lower interval values.
We also tried varying
Item reordering was not nearly as successful as we had hoped. It made a small difference in some tests but overall played a negligible role in performance. In tests on census data it made less than 10% difference, sometimes in the wrong direction (Figure 11 ). This was something of a disappointment, but perhaps a better analysis of what the optimal order is and on-the-fly modification will yield better results.
50 people who are not in the military and are not looking for work
and had work this year (1990, the year of the census) currently
have civilian employment
10 people who are not in the military and who worked last week
are not limited in their work by a disability
2.94 heads of household do not have personal care limitations
1.5 people not in school and without personal care limitations have
worked this year
1.4 African-American women are not in the military
1.28 African-Americans reside in the same state they were born
1.28 unmarried people have moved in the past five years
5.6 Tests of Implication Rules
By comparison, tests with confidence produced some misleading results. For example, the confidence of women who do
not state whether they are looking for a job do not have personal care limitations was 73% which is at the high end of the
scale. However, it turned out that this was simply because 76% of all respondents do not have personal care limitations.
Interest also produced less useful results. For example, the interest of male and never given birth is 1.83 which is
considerably lower than very many itemsets which we would consider less related and appears 40% of the way down in
the list of rules with interest greater than 1.25.
There are a number of possible extensions to DIC. Because of its dynamic nature, it is very flexible and can be adapted
to parallel and incremental mining.
6.1.3 Census Data
The most interesting rules were found in the middle of the range, not extremely high as to be obvious (anything over 5)
but not so low as to be insignificant (around 1.01). We believe that this is generally true of any data mining application. The
extremely correlated effects are generally well known and obvious. The truly interesting ones are far less correlated.
A big problem was the number of rules that were generated - over 20,000. It is both impossible and unnecessary to deal
with so many rules. We have considered several techniques for pruning them.
Overall, conviction has proven to be a useful new measure having the benefits of being
From our experiments we have learned that not all data which fits into the market-basket framework behaves nearly as
well as supermarket data. We do not believe that this is because it is a wrong choice of framework. It is simply because doing
this kind of analysis on data like census data is difficult. There are very many correlations and redundancies in census data.
If we are aware beforehand of all of its idiosyncrasies, we can probably simplify the problem considerably (for example, by
collapsing all the redundant attributes) and find a specialized solution for it. However, we want to build a general system,
capable of detecting and reporting the interesting aspects of any data we may throw at it. Toward this goal, we developed
DIC to make the task less painful to those of us who are impatient, and we developed conviction so that the results of mining
market-basket data are more usable.
[ALSS95] R. Agrawal, K. Lin, S. Sawhney, and K. Shim. Fast similarity search in the presence of noise, scaling and translation in time-series databases. In Proc. of the Int'l Conf. on Very Large Data Bases (VLDB), 1995.
[AS94] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proceedings of the 20th VLDB Conference, Santiago, Chile, 1994.
[AS95] R. Agrawal and R. Srikant. Mining sequential patterns. In Proceedings of the 11th International Conmference on Data Engineering, Taipei, Taiwan, 1995.
[MAR96] M. Mehta, R. Agrawal, and J. Rissanen. Sliq: A fast scalable classifier for data mining. March 1996.
[SA95] R. Srikant and R. Agrawal. Mining generalized association rules. 1995.
[Toi96] H. Toivonen. Sampling large databases for association rules. Proc. of the Int'l Conf. on Very Large Data Bases (VLDB), 1996.