6 1999 two crows corporation link analysis link

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: fication is a way to segment data by assigning it to groups that are already defined. 6 © 1999 Two Crows Corporation Link analysis Link analysis is a descriptive approach to exploring data that can help identify relationships among values in a database. The two most common approaches to link analysis are association discovery and sequence discovery. Association discovery finds rules about items that appear together in an event such as a purchase transaction. Market-basket analysis is a well-known example of association discovery. Sequence discovery is very similar, in that a sequence is an association related over time. Þ Associations are written as A B, where A is called the antecedent or left-hand side (LHS), and B is called the consequent or right-hand side (RHS). For example, in the association rule “If people buy a hammer then they buy nails,” the antecedent is “buy a hammer” and the consequent is “buy nails.” It’s easy to determine the proportion of transactions that contain a particular item or item set: simply count them. The frequency with which a particular association (e.g., the item set “hammers and nails”) appears in the database is called its support or prevalence. If, say, 15 transactions out of 1,000 consist of “hammer and nails,” the support for this association would be 1.5%. A low level of support (say, one transaction out of a million) may indicate that the particular association isn’t very important — or it may indicated the presence of bad data (e.g., “male and pregnant”). To discover meaningful rules, however, we must also look at the relative frequency of occurrence of the items and their combinations. Given the occurrence of item A (the antecedent), how often does item B (the consequent) occur? That is, what is the conditional predictability of B, given A? Using the above example, this would mean asking “When people buy a hammer, how often do they also buy nails?” Another term for this conditional predictability is confidence. Confidence is calculated as a ratio: (frequency of A and B)/(frequency of A). Let’s specify our hypothetical database in more detail to illustrate these concepts: Total hardware-store transactions: 1,000 Number which include “hammer”: 50 Number which include “nails”: 80 Number which include “lumber”: 20 Number which include “hammer” and “nails”: 15 Number which include “nails” and “lumber”: 10 Number which include “hammer” and “lumber”: 10 Number which include “hammer,” “nails” and “lumber”: 5 We can now calculate: Support for “hammer and nails” = 1.5% (15/1,000) Support for “hammer, nails and lumber” = 0.5% (5/1,000) Confidence of “hammer nails” = 30% (15/50) Confidence of “nails hammer” = 19% (15/80) Confidence of “hammer and nails lumber” = 33% (5/15) Confidence of “lumber hammer and nails” = 25% (5/20) Þ Þ Þ Þ Thus we can see that the likelihood that a hammer buyer will also purchase nails (30%) is greater than the likelihood that someone buying nails will also purchase a hammer (19%). The prevalence of this hammer-and-nails association (i.e., the support is 1.5%) is high enough to suggest a meaningful rule. © 1999 Two Crows Corporation 7 Lift is another measure of the power of an association. The greater the lift, the greater the influence that the occurrence of A has on the likelihood that B will occur. Lift is calculated as the ratio (confidence of A B)/ (frequency of B). From our example: Þ Þ Þ Lift of “hammer nails”: 3.75 (30%/8%) Lift of “hammer and nails lumber”: 16.5 (33%/2%) Association algorithms find these rules by doing the equivalent of sorting the data while counting occurrences so that they can calculate confidence and support. The efficiency with which they can do this is one of the differentiators among algorithms. This is especially important because of the combinatorial explosion that results in enormous numbers of rules, even for market baskets in the express lane. Some algorithms will create a database of rules, confidence factors, and support that can be queried (for example, “Show me all associations in which ice cream is the consequent, that have a confidence factor of over 80% and a support of 2% or more”). Another common attribute of association rule generators is the ability to specify an item hierarchy. In our example we have looked at all nails and hammers, not individual types. It is important to choose a proper level of aggregation or you’ll be unlikely to find associations of interest. An item hierarchy allows you to control the level of aggregation and experiment with different levels. Remember that association or sequence rules are not really rules, but rather descriptions of relationships in a particular database. There is no formal testing of models on other data to increase the predictive power of these rules. Rather there is an implicit assumption that the past behavior will continue in the future. It is often difficult to decide what to do with association rules you’ve discov...
View Full Document

Ask a homework question - tutors are online