Market Basket Analysis

Published

June 17, 2023

Frequent Pattern Mining: Market Basket Analysis

This project aims to discover links between items frequently purchased together through analysing transaction data.

Link to reference article

1: Data Preparation

1.1: R Packages

pacman::p_load(readxl, tidyverse, plyr, dplyr,
               ggplot2, lubridate,
               RcolorBrewer,
               arules, arulesViz)

1.2: Data Loading and Transformation

This Project uses Online Retail dataset from UCI Machine Learning Repository

The dataset contains the following attributes:

  • InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellation.
  • StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
  • Description: Product (item) name.
  • Quantity: The quantities of each product (item) per transaction.
  • InvoiceDate: Invoice Date and time. Numeric, the day and time when each transaction was generated.
  • UnitPrice: Unit price. Numeric, Product price per unit in sterling.
  • CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
  • Country: Country name. Nominal, the name of the country where each customer resides.
order_data <- read_excel("data/Online_Retail.xlsx")

complete.cases(data) returns a logical vector indicating which rows have no missing values. data[,] filters out the rows with missing values:

order_data <- order_data[complete.cases(order_data), ]
summary(order_data)
  InvoiceNo          StockCode         Description           Quantity        
 Length:406829      Length:406829      Length:406829      Min.   :-80995.00  
 Class :character   Class :character   Class :character   1st Qu.:     2.00  
 Mode  :character   Mode  :character   Mode  :character   Median :     5.00  
                                                          Mean   :    12.06  
                                                          3rd Qu.:    12.00  
                                                          Max.   : 80995.00  
  InvoiceDate                       UnitPrice          CustomerID   
 Min.   :2010-12-01 08:26:00.00   Min.   :    0.00   Min.   :12346  
 1st Qu.:2011-04-06 15:02:00.00   1st Qu.:    1.25   1st Qu.:13953  
 Median :2011-07-31 11:48:00.00   Median :    1.95   Median :15152  
 Mean   :2011-07-10 16:30:57.88   Mean   :    3.46   Mean   :15288  
 3rd Qu.:2011-10-20 13:06:00.00   3rd Qu.:    3.75   3rd Qu.:16791  
 Max.   :2011-12-09 12:50:00.00   Max.   :38970.00   Max.   :18287  
   Country         
 Length:406829     
 Class :character  
 Mode  :character  
                   
                   
                   
  • mutate() character data types to factor using as.factor()
  • Convert InvoiceNo to Numeric
  • Store InvoiceDate as date in new variable, ‘TransDate’
  • Extract time from InvoiceDate and store in another variable, ‘TransTime’
order_data %>%
  mutate(across(where(is.character), as.factor()))
# A tibble: 406,829 × 8
   InvoiceNo StockCode Description        Quantity InvoiceDate         UnitPrice
   <chr>     <chr>     <chr>                 <dbl> <dttm>                  <dbl>
 1 536365    85123A    WHITE HANGING HEA…        6 2010-12-01 08:26:00      2.55
 2 536365    71053     WHITE METAL LANTE…        6 2010-12-01 08:26:00      3.39
 3 536365    84406B    CREAM CUPID HEART…        8 2010-12-01 08:26:00      2.75
 4 536365    84029G    KNITTED UNION FLA…        6 2010-12-01 08:26:00      3.39
 5 536365    84029E    RED WOOLLY HOTTIE…        6 2010-12-01 08:26:00      3.39
 6 536365    22752     SET 7 BABUSHKA NE…        2 2010-12-01 08:26:00      7.65
 7 536365    21730     GLASS STAR FROSTE…        6 2010-12-01 08:26:00      4.25
 8 536366    22633     HAND WARMER UNION…        6 2010-12-01 08:28:00      1.85
 9 536366    22632     HAND WARMER RED P…        6 2010-12-01 08:28:00      1.85
10 536367    84879     ASSORTED COLOUR B…       32 2010-12-01 08:34:00      1.69
# ℹ 406,819 more rows
# ℹ 2 more variables: CustomerID <dbl>, Country <chr>
order_data$InvoiceNo <- as.numeric(order_data$InvoiceNo)
Warning: NAs introduced by coercion
order_data$TransDate <- as.Date(order_data$InvoiceDate)

order_data$TransTime <- format(order_data$InvoiceDate, format = "%H:%M:%S")

glimpse(order_data)
Rows: 406,829
Columns: 10
$ InvoiceNo   <dbl> 536365, 536365, 536365, 536365, 536365, 536365, 536365, 53…
$ StockCode   <chr> "85123A", "71053", "84406B", "84029G", "84029E", "22752", …
$ Description <chr> "WHITE HANGING HEART T-LIGHT HOLDER", "WHITE METAL LANTERN…
$ Quantity    <dbl> 6, 6, 8, 6, 6, 2, 6, 6, 6, 32, 6, 6, 8, 6, 6, 3, 2, 3, 3, …
$ InvoiceDate <dttm> 2010-12-01 08:26:00, 2010-12-01 08:26:00, 2010-12-01 08:2…
$ UnitPrice   <dbl> 2.55, 3.39, 2.75, 3.39, 3.39, 7.65, 4.25, 1.85, 1.85, 1.69…
$ CustomerID  <dbl> 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17…
$ Country     <chr> "United Kingdom", "United Kingdom", "United Kingdom", "Uni…
$ TransDate   <date> 2010-12-01, 2010-12-01, 2010-12-01, 2010-12-01, 2010-12-0…
$ TransTime   <chr> "08:26:00", "08:26:00", "08:26:00", "08:26:00", "08:26:00"…

1.3: Creating a basket and storing it as transactionData

transactionData <- ddply(order_data, c("InvoiceNo","TransDate"),
                         function(df1)paste(df1$Description,
                                            collapse = ","))
transactionData$InvoiceNo <- NULL
transactionData$Date <- NULL

colnames(transactionData) <- c("items")

write.csv(transactionData,"data/market_basket_transactions.csv", quote = FALSE, row.names = FALSE)

Read transaction data:

trans <- read.transactions('data/market_basket_transactions.csv', 
                           format = 'basket', 
                           quote = "",
                           sep=',')
Warning in asMethod(object): removing duplicated items in transactions
dim(trans)
[1] 18839  4231

dim(data) shows that there are 18,839 transactions (rows) and 4231 items (columns).

summary(trans)
transactions as itemMatrix in sparse format with
 18839 rows (elements/itemsets/transactions) and
 4231 columns (items) and a density of 0.005236714 

most frequent items:
WHITE HANGING HEART T-LIGHT HOLDER           REGENCY CAKESTAND 3 TIER 
                              2011                               1830 
           JUMBO BAG RED RETROSPOT                      PARTY BUNTING 
                              1635                               1397 
     ASSORTED COLOUR BIRD ORNAMENT                            (Other) 
                              1385                             409149 

element (itemset/transaction) length distribution:
sizes
   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
   1 1401  734  644  635  667  611  576  589  621  524  551  502  476  521  548 
  17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
 547  469  440  481  424  414  349  337  312  245  267  251  221  265  232  204 
  33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48 
 173  170  173  149  125  139  110  129  123  115  110   98   92   87   90   82 
  49   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64 
  79   79   80   68   59   65   68   63   49   55   49   47   37   45   35   33 
  65   66   67   68   69   70   71   72   73   74   75   76   77   78   79   80 
  35   32   40   39   38   23   29   32   14   27   32   25   21   19   17   10 
  81   82   83   84   85   86   87   88   89   90   91   92   93   94   95   96 
  17   18   19   16   20   16   11   15   13    9    8   11   15   12   10    5 
  97   98   99  100  101  102  103  104  105  106  107  108  109  110  111  112 
   8   10   11    3    7   13    4    9    7    2    3    3    6    4    3    3 
 113  114  115  116  117  118  119  120  121  122  123  124  125  126  127  128 
   2    5    4    4    7    5    5    5    5    4    8    5    1    4    5    3 
 129  131  133  134  135  136  137  138  139  140  141  142  143  144  145  147 
   4    2    1    4    1    1    3    3    1    1    2    2    2    2    1    5 
 149  150  151  152  153  156  158  159  160  165  166  168  171  172  178  179 
   1    1    1    2    1    2    1    1    1    1    1    1    1    2    3    2 
 181  182  186  188  194  195  197  205  206  209  212  221  231  251  260  264 
   1    2    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
 274  284  340  352  357  367  380  388  423  441  443  530  534  548 
   1    1    1    1    1    1    1    1    1    1    1    1    1    1 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    8.00   16.00   22.16   28.00  548.00 

includes extended item information - examples:
                      labels
1                   1 HANGER
2     10 COLOUR SPACEBOY PEN
3 12 COLOURED PARTY BALLOONS

summary() data reveals that the 2-itemset size is 1401, the largest amongst all k-itemsets.

2: Generating Item Frequency Plot

topN = 20 parameter specifies top 20 items by frequency

itemFrequencyPlot(trans, 
                  topN = 20,
                  type = "absolute",
                  col = "#3A3B60",
                  main = "Item Frequency Plot",
                  xlab = "Absolute Frequency")

3: Generating Association Rules using apriori()

Each basket can be represented as a vector containing Boolean values pertaining to whether an item is in the basket (1) or not (0). Analysing the baskets to find patterns in the most common items purchased together to form association rules can be useful.

association_rules <- apriori(trans, parameter = list(supp=0.001,
                                                     conf=0.8,
                                                     maxlen=10))
Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.8    0.1    1 none FALSE            TRUE       5   0.001      1
 maxlen target  ext
     10  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 18 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[4231 item(s), 18839 transaction(s)] done [0.14s].
sorting and recoding items ... [2899 item(s)] done [0.01s].
creating transaction tree ... done [0.01s].
checking subsets of size 1 2 3 4 5 6 7 8 9 10
Warning in apriori(trans, parameter = list(supp = 0.001, conf = 0.8, maxlen =
10)): Mining stopped (maxlen reached). Only patterns up to a length of 10
returned!
 done [0.96s].
writing ... [274262 rule(s)] done [0.10s].
creating S4 object  ... done [0.14s].

Display First 20 association rules by confidence:

association_rules <- sort(association_rules,
                          decreasing = TRUE,
                          by = "confidence")

inspect(association_rules[1:20])
     lhs                          rhs                       support    
[1]  {PINK  SPOTS}             => {SWISS ROLL TOWEL}        0.001008546
[2]  {WOBBLY CHICKEN}          => {METAL}                   0.001486278
[3]  {WOBBLY CHICKEN}          => {DECORATION}              0.001486278
[4]  {DECOUPAGE}               => {GREETING CARD}           0.001220872
[5]  {FLOWER FAIRY}            => {5 SUMMER B'DRAW LINERS}  0.001327034
[6]  {5 SUMMER B'DRAW LINERS}  => {FLOWER FAIRY}            0.001327034
[7]  {TREES}                   => {CHRISTMAS GARLAND STARS} 0.001114709
[8]  {CHRISTMAS GARLAND STARS} => {TREES}                   0.001114709
[9]  {BILLBOARD FONTS DESIGN}  => {WRAP}                    0.001804767
[10] {WOBBLY RABBIT}           => {METAL}                   0.001804767
[11] {WOBBLY RABBIT}           => {DECORATION}              0.001804767
[12] {METAL}                   => {DECORATION}              0.002654069
[13] {DECORATION}              => {METAL}                   0.002654069
[14] {BLACK TEA}               => {SUGAR JARS}              0.002707150
[15] {BLACK TEA}               => {COFFEE}                  0.002707150
[16] {CHOCOLATE  SPOTS}        => {SWISS ROLL TOWEL}        0.002654069
[17] {1 HANGER}                => {HOOK}                    0.003078720
[18] {HOOK}                    => {1 HANGER}                0.003078720
[19] {1 HANGER}                => {MAGIC GARDEN}            0.003078720
[20] {MAGIC GARDEN}            => {1 HANGER}                0.003078720
     confidence coverage    lift      count
[1]  1          0.001008546 330.50877 19   
[2]  1          0.001486278 376.78000 28   
[3]  1          0.001486278 376.78000 28   
[4]  1          0.001220872 313.98333 23   
[5]  1          0.001327034 753.56000 25   
[6]  1          0.001327034 753.56000 25   
[7]  1          0.001114709 897.09524 21   
[8]  1          0.001114709 897.09524 21   
[9]  1          0.001804767 509.16216 34   
[10] 1          0.001804767 376.78000 34   
[11] 1          0.001804767 376.78000 34   
[12] 1          0.002654069 376.78000 50   
[13] 1          0.002654069 376.78000 50   
[14] 1          0.002707150 188.39000 51   
[15] 1          0.002707150  54.76453 51   
[16] 1          0.002654069 330.50877 50   
[17] 1          0.003078720 324.81034 58   
[18] 1          0.003078720 324.81034 58   
[19] 1          0.003078720 324.81034 58   
[20] 1          0.003078720 324.81034 58   

Display association rules by lift value:

Lift

  • Lift value is the measure of importance of an association rule
  • It is the ratio of the confidence of the rule / expected confidence of the rule
  • Lift value < 1: negative dependence or substitution effect
  • Lift value > 1: positive dependence or complementary effect
  • Lift value = 1: implies almost no association between items

\[Lift = \frac{Confidence}{Expected\ Confidence} = \frac{P\left(A \cap B\right)}{P\left(A\right).P\left(B\right)}\]
\[Lift({X} -> {Y}) = \frac{(Transactions\ containing\ both\ X\ and\ Y)/(Transactions\ containing\ X)}{(Transactions\ containing\ Y)}\]

inspect(head(sort(association_rules, by = "lift")))
    lhs                          rhs                       support    
[1] {TREES}                   => {CHRISTMAS GARLAND STARS} 0.001114709
[2] {CHRISTMAS GARLAND STARS} => {TREES}                   0.001114709
[3] {FLOWER FAIRY}            => {5 SUMMER B'DRAW LINERS}  0.001327034
[4] {5 SUMMER B'DRAW LINERS}  => {FLOWER FAIRY}            0.001327034
[5] {PINK KNITTED EGG COSY}   => {BLUE KNITTED EGG COSY}   0.001167790
[6] {BLUE KNITTED EGG COSY}   => {PINK KNITTED EGG COSY}   0.001167790
    confidence coverage    lift     count
[1] 1.00       0.001114709 897.0952 21   
[2] 1.00       0.001114709 897.0952 21   
[3] 1.00       0.001327034 753.5600 25   
[4] 1.00       0.001327034 753.5600 25   
[5] 0.88       0.001327034 663.1328 22   
[6] 0.88       0.001327034 663.1328 22   

4: Filter rules

4.1: Select only association rules with confidence greater than 0.6 or 60%

subRules<-association_rules[quality(association_rules)$confidence > 0.6]

summary(subRules)
set of 274262 rules

rule length distribution (lhs + rhs):sizes
    2     3     4     5     6     7     8     9    10 
  128  8208 24106 57299 85063 66943 27329  4826   360 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.000   5.000   6.000   6.025   7.000  10.000 

summary of quality measures:
    support           confidence        coverage             lift        
 Min.   :0.001009   Min.   :0.8000   Min.   :0.001009   Min.   :  7.494  
 1st Qu.:0.001062   1st Qu.:0.8333   1st Qu.:0.001168   1st Qu.: 16.174  
 Median :0.001115   Median :0.8750   Median :0.001327   Median : 19.893  
 Mean   :0.001288   Mean   :0.8856   Mean   :0.001461   Mean   : 32.820  
 3rd Qu.:0.001327   3rd Qu.:0.9231   3rd Qu.:0.001539   3rd Qu.: 29.643  
 Max.   :0.025054   Max.   :1.0000   Max.   :0.030893   Max.   :897.095  
     count       
 Min.   : 19.00  
 1st Qu.: 20.00  
 Median : 21.00  
 Mean   : 24.26  
 3rd Qu.: 25.00  
 Max.   :472.00  

mining info:
  data ntransactions support confidence
 trans         18839   0.001        0.8
                                                                           call
 apriori(data = trans, parameter = list(supp = 0.001, conf = 0.8, maxlen = 10))

Interactive Scatterplot:

plot(subRules, 
     method = "scatterplot", 
     measure = c("support", "confidence"),
     shading = "lift",
     max = 100,
     jitter = 0,
     engine = "plotly")
Warning: Too many rules supplied. Only plotting the best 100 using 'lift'
(change control parameter max if needed).

4.2 Generating rules for specific items

Suppose we would like to find out more about rules regarding most frequent item, “WHITE HANGING HEART T-LIGHT HOLDER”:

rules_frequent_item <- apriori (data=trans, 
                                parameter=list (supp=0.001,conf = 0.6), 
                                appearance = list (default="lhs",rhs="WHITE HANGING HEART T-LIGHT HOLDER"), 
                                control = list (verbose=F))

rules_frequent_item <-sort(rules_frequent_item,
                               decreasing = TRUE,
                               by = "confidence")

# remove redundant rules that are subsets of larger rules
subsetRules <- which(colSums(is.subset(rules_frequent_item, rules_frequent_item)) > 1) 
#length(subsetRules) #> 584
rules_frequent_item <- rules_frequent_item[-subsetRules]

inspect(sort(rules_frequent_item[1:20], by = "lift"))
     lhs                                       rhs                                      support confidence    coverage     lift count
[1]  {KNITTED UNION FLAG HOT WATER BOTTLE,                                                                                           
      SET 7 BABUSHKA NESTING BOXES}         => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001220872  1.0000000 0.001220872 9.367976    23
[2]  {GLASS STAR FROSTED T-LIGHT HOLDER,                                                                                             
      SET 7 BABUSHKA NESTING BOXES}         => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001220872  0.9583333 0.001273953 8.977644    23
[3]  {GLASS STAR FROSTED T-LIGHT HOLDER,                                                                                             
      WHITE METAL LANTERN}                  => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001114709  0.9545455 0.001167790 8.942159    21
[4]  {GLASS STAR FROSTED T-LIGHT HOLDER,                                                                                             
      KNITTED UNION FLAG HOT WATER BOTTLE}  => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001114709  0.9545455 0.001167790 8.942159    21
[5]  {GLASS STAR FROSTED T-LIGHT HOLDER,                                                                                             
      RED WOOLLY HOTTIE WHITE HEART.}       => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001061627  0.9090909 0.001167790 8.516342    20
[6]  {SET 7 BABUSHKA NESTING BOXES,                                                                                                  
      WHITE METAL LANTERN}                  => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001061627  0.9090909 0.001167790 8.516342    20
[7]  {KNITTED UNION FLAG HOT WATER BOTTLE,                                                                                           
      WHITE METAL LANTERN}                  => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001008546  0.9047619 0.001114709 8.475788    19
[8]  {GLASS STAR FROSTED T-LIGHT HOLDER,                                                                                             
      WOOD 2 DRAWER CABINET WHITE FINISH}   => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001061627  0.8333333 0.001273953 7.806647    20
[9]  {CREAM CUPID HEARTS COAT HANGER,                                                                                                
      KNITTED UNION FLAG HOT WATER BOTTLE}  => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001008546  0.8260870 0.001220872 7.738763    19
[10] {PACK OF 60 DINOSAUR CAKE CASES,                                                                                                
      POPCORN HOLDER,                                                                                                                
      SET OF 4 PANTRY JELLY MOULDS}         => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001008546  0.8260870 0.001220872 7.738763    19
[11] {CREAM CUPID HEARTS COAT HANGER,                                                                                                
      SET 7 BABUSHKA NESTING BOXES}         => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001061627  0.8000000 0.001327034 7.494381    20
[12] {SET 7 BABUSHKA NESTING BOXES,                                                                                                  
      WOOD 2 DRAWER CABINET WHITE FINISH}   => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001061627  0.8000000 0.001327034 7.494381    20
[13] {HEART OF WICKER LARGE,                                                                                                         
      HOME BUILDING BLOCK WORD,                                                                                                      
      NATURAL SLATE HEART CHALKBOARD}       => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001061627  0.8000000 0.001327034 7.494381    20
[14] {VINTAGE BILLBOARD LOVE/HATE MUG,                                                                                               
      WOOD 2 DRAWER CABINET WHITE FINISH}   => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001008546  0.7916667 0.001273953 7.416314    19
[15] {RED WOOLLY HOTTIE WHITE HEART.,                                                                                                
      WHITE METAL LANTERN}                  => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001008546  0.7916667 0.001273953 7.416314    19
[16] {SAVE THE PLANET MUG,                                                                                                           
      SET 7 BABUSHKA NESTING BOXES}         => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001008546  0.7916667 0.001273953 7.416314    19
[17] {CREAM CUPID HEARTS COAT HANGER,                                                                                                
      WHITE METAL LANTERN}                  => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001167790  0.7857143 0.001486278 7.360553    22
[18] {SET 7 BABUSHKA NESTING BOXES,                                                                                                  
      WOODEN PICTURE FRAME WHITE FINISH}    => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001167790  0.7857143 0.001486278 7.360553    22
[19] {RED WOOLLY HOTTIE WHITE HEART.,                                                                                                
      SET 7 BABUSHKA NESTING BOXES}         => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001327034  0.7812500 0.001698604 7.318731    25
[20] {GLASS STAR FROSTED T-LIGHT HOLDER,                                                                                             
      WOODEN FRAME ANTIQUE WHITE}           => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001061627  0.7692308 0.001380116 7.206135    20

Plot top 10 rules for item by confidence:

rules_frequent_item <-sort(rules_frequent_item,
                          decreasing = TRUE,
                          by = "confidence")

top10_rules <- rules_frequent_item[1:10]

plot(top10_rules,
     method = "graph",
     engine = "htmlwidget")