.8 Exercises 473
7.8 Exercises 1. Consider the traffic accident data set shown in Table 7.10.
Table 7,10. Traffic accident data set. Weat
Condition her Driver’s
Condition Tlaffic
Violation Seat belt Urash
Severity Good Bad
Good Good Bad
Good Bad
Good Good Bad
Good Bad
Alcohol-impaired Sober Sober Sober Sober
Alcohol-impaired Alcohol-impaired
Sober Alcohol-impaired
Sober Alcohol-impaired
Sober
Exceed speed limit None
Disobey stop sign Exceed speed limit
Disobey traffic signal Disobey stop sign
None Disobey trafrc signal
None Disobey traffic signal Exceed speed limit Disobey stop sign
No Yes Yes Yes No Yes Yes Yes No No Yes Yes
Major Minor Minor Major Major Minor Major Major Major Major Major Minor
(a) Show a binarized version of the data set. (b) What is the maximum width of each transaction in the binarized data? (c) Assuming that support threshold is 30%, how many candidate and fre-
quent itemsets will be generated? (d) Create a data set that contains only the following asymmetric binary
attributes: (LJeather : Bad, Driver’s condition : Alcohol-impaired, Traf f ic v io lat ion: Yes, Seat Bel t : No, Crash Sever i ty : t ‘ ta jor) . For Traffic violation, only None has a value of 0. The rest of the attribute values are assigned to 1. Assuming that support threshold is 30%, how many candidate and frequent itemsets will be generated?
(e) Compare the number of candidate and frequent itemsets generated in parts (c) and (d).
2. (a) Consider the data set shown in Table 7.11. Suppose we apply the following discretization strategies to the continuous attributes of the data set.
Dl: Partition the range of each continuous attribute into 3 equal-sized bins.
D2: Partition the range of each continuous attribute into 3 bins; where each bin contains an eoual number of transactions
474 Chapter 7 Association Analysis: Advanced Concepts
Table 7.11, Data set for Exercise 2. TID Temperature Pressure Alarm 1 Alarm 2 Alarm 3
I 2 3 4 o r) 7
8 o
9l) 6D 103 97 80 100 83 86 101
1 105 1040 1090 1084 1038 1080 1025 1030 1 100
0 I I 1 0 1 1 1 1
0 1 I
1 0 1 1 0 0 1
1 0 1 0 1 0 1 0 I
For each strategy, answer the following questions:
i. Construct a binarized version of the data set. ii. Derive all the frequent itemsets having support > 30%.
(b) The continuous attribute can also be discretized using a clustering ap- proach.
i. PIot a graph of temperature versus pressure for the data points shown in Table 7.11.
ii. How many natural clusters do you observe from the graph? Assign a label (Cr, Cr, etc.) to each cluster in the graph.
iii. What type of clustering algorithm do you think can be used to iden- tify the clusters? State your reasons clearly.
iv. Replace the temperature and pressure attributes in Table 7.11 with asymmetric binary attributes C1, C2, etc. Construct a transac- tion matrix using the new attributes (along with attributes Alarml, Alarm2, and Alarm3).
v. Derive all the frequent itemsets having support > 30% from the bi- narized data.
Consider the data set shown in Table 7.I2. The first attribute is continuous, while the remaining two attributes are asymmetric binary. A rule is considered to be strong if its support exceeds 15% and its confidence exceeds 60%. The data given in Table 7.12 supports the following two strong rules:
( i ) { (1 < A < 2) ,8 : 1} – – -+ {C : 1} ( i i ) { ( 5 < A < 8 ) ,8 : 1 } – -+ {C : 1 }
(a) Compute the support and confidence for both rules. (b) To find the rules using the traditional Apriori algorithm, we need to
discretize the continuous attribute A. Suppose we apply the equal width
,).