Association rule analysis beyond transaction data

Note : I originally prepared this post as an article for Predictive Analytics Times, geared towards a broad audience with a business/practitioner leaning. I ended up writing this article on knitr instead. In a subsequent post, I will discuss the shiny app I created to interactively explore and visualize association rules.

Most analysts think of association rules when dealing with transactional or market basket data. However, association analysis can also provide value with traditional tabular data, especially when little is known about the structure of the data and its domain.

What is association analysis?

Association analysis is an unsupervised learning data mining technique, which at heart is very simple and transparent. The goal is to discover previously unknown patterns (easy) that are also interesting (more challenging) from data. Results are easily interpreted by non-technical audiences and can be expressed in plain English. For example, “When X occurs, Y also occurs 85% of the time.” Many data mining techniques require upfront work in the form of parameter specification and statistical assumption satisfaction before yielding meaningful results. Association analysis is by contrast rather “hands off” upfront. Generating rules is a straightforward procedure that requires little to no knowledge of the underlying data. However, post-processing large rule sets often requires a “hands-on” approach to identify the interesting and valid rules.

Popularized in the early 1990s with this paper from Agrawal et al, association rule analysis has since declined in popularity. I’m still trying to figure out how correlated this decline is with it’s relative usefulness. My conclusion so far is that in most cases it isn’t strong enough to provide much value alone, but can provide value for knowledge discovery in highly exploratory settings. Particularly knowledge discovery of data quirks and structure; knowledge discovery of novel patterns takes more thought.

How it works

The inputs are transactions. In the canonical use-case, these transactions are market basket transactions from a grocery store. For example, 3 transactions could be: {Bread, Milk}, {Bread, Diapers, Beer, Eggs}, {Milk, Diapers, Beer, Cola}. The outputs are association rules that explain the relationship (co-occurrence) between the items. A classic rule example is {Beer} ==> {Diapers} which states that transactions including {Beer} tend to include {Diapers}. A thorough technical description following this example can be found in Kumar’s Introduction to Data Mining textbook.

How to measure rule strength

Association analysis often produces many association rules, more than can be inspected by humans one-by-one. Support and confidence are the most common measures used to initially restrict the number of association rules to just the highest quality. Support measures the relative frequency of the items in a transaction relative to all other transactions. Confidence measures the relative strength of the rule: how often the right hand side (RHS) item occurs in transactions containing the left-hand-side (LHS) item set.

How to use association analysis with tabular data

These same principles can be applied to traditional tabular data (spreadsheets or tables from a relational database). Each row is treated as a transaction. All transactions have the same number of items (one per column). Items are represented by the variable name and the value of that variable for the given row. For example, one row from a five column dataset of students could be represented as: {finalMathGrade=B, eyeColor=brown, state=Minnesota, playSports=YES, inBand=NO}. Traditional association rules run on categorical data. If a table contains continuous data of interest, it can be discretized into bins for analysis.

When to use association rules with tabular data?

I find association analysis to be most handy as a first-step approach when presented with a new dataset. One can think of lists of association rules as lists of facts (with regards to past training data). In market basket analysis problems, these facts can be useful and drive business decisions by themselves. With tabular data, I find association rules more useful as a means to develop a deeper understanding of the data and inform more sophisticated analyses.

1. Learn data structure

This is especially important if you don’t know how your data was generated. Many datasets are products of highly human processes that can appear unintuitive to an unfamiliar analyst. Rules with a perfect confidence of 1 will explain ground truths in your data that are usually indicative of business rules, which would be difficult to uncover without documentation or a domain expert. For example when one column V1 equals ‘A’ or ‘B’, column V5 always equals ‘C’. These rules of confidence 1 often include items where variables are equal to “missing.” Since association rules handles only categorical variables, imputation or exclusion of missing values to accommodate numerical computation is not necessary. Missing values are commonly coded as just another possible category rather than discarded. Rules that include “missing” can always be pruned ex post if they are uninteresting.

2. Learn domain

Association rules are often (and rightly) criticized for generating gobs of trivial and redundant rules. However, rules that are trivial to a seasoned domain expert might provide new and useful information to a data scientist jumping into a new domain (where I often find myself). While often not the final answer, association rules can be used to start the conversation and knowledge sharing process between analysts, domain experts and stakeholders. I often find myself in chicken-egg scenarios where domain experts ask us (the consulting data scientists), the data scientists, what we need to know. I often know so little about the domain and data context, I don’t even know how to start asking intelligent questions. Association rules can be that starting point.

3. Hypothesis generation

This is a trickier endeavor. Discovering obvious patterns using association analysis as a newcomer to a dataset is useful, but relatively easy. Discovering the hidden insights that more directly lead to business decisions is difficult. On a recent project I was asked to discover “unknown unknowns” with little domain context and no specific problem to solve. We employed association rules with heavy amounts of rule post-processing using both human and statistical measures (discussed below) to home in on interesting rules which generated hypotheses to investigate further. Although most of the rules exposed database quality issues rather than “big picture” issues, it was a productive first step.

What makes a rule interesting?

Rules can be objectively interesting (to a computer) and/or subjectively interesting (to a human). Objectively interesting rules are quantified statistically from data by interest measures such as Lift (Interest), Conviction and Leverage. Subjectively interesting rules provide insight that is unexpected and/or actionable to a person. The end goal is obviously to develop rules that are interesting (subjectively) to the user or analyst. Human (especially expert human) time is more expensive than computer time, so objective interest measures are used to first sort and prune massive rule sets. The process of identifying interesting rules is usually iterative, requiring input from domain experts and post-processing adjustments from analysts using quantitative interest measures at each iteration. Visualization helps analysts understand the structure of association rules when there are too many to inspect manually. Michael Hahsler introduces a handy toolkit for visualization in the R arulesViz package, which builds on the arules package for mining rules.

Validating rules

It’s tempting to over-interpret association rules which present seemingly unbiased patterns very matter-of-factly. I experienced this on a recent project. We were searching for previously unknown relationships between software assets by monitoring time intervals where the same software assets consistently experienced issues in production. Initial feedback was stronger than expected, so we increased the weight of association rule relationships in our network model. However, once we began validating our results with software SMEs more thoroughly, we realized many of our rules were spurious. Association rules suffer from the Vast Search Effect: the tendency to discover spurious patterns when considering many possibilities. Researchers have proposed techniques to mitigate the issue, such as Swap Randomization and post-processing pruning. However, adoption in popular statistical software has been slow to catch on. For this reason, association analysis can be a dangerous approach to produce “final answers.” Instead, consider association analysis as a tool for exploratory data analysis to get familiar with a dataset’s structure, its domain and generate interesting questions to pursue further with more sophisticated analysis.