pacman::p_load(readxl, tidyverse, plyr, dplyr,
ggplot2, lubridate,
RcolorBrewer,
arules, arulesViz)Market Basket Analysis
Frequent Pattern Mining: Market Basket Analysis
This project aims to discover links between items frequently purchased together through analysing transaction data.
1: Data Preparation
1.1: R Packages
1.2: Data Loading and Transformation
This Project uses Online Retail dataset from UCI Machine Learning Repository
The dataset contains the following attributes:
- InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellation.
- StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
- Description: Product (item) name.
- Quantity: The quantities of each product (item) per transaction.
- InvoiceDate: Invoice Date and time. Numeric, the day and time when each transaction was generated.
- UnitPrice: Unit price. Numeric, Product price per unit in sterling.
- CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
- Country: Country name. Nominal, the name of the country where each customer resides.
order_data <- read_excel("data/Online_Retail.xlsx")complete.cases(data) returns a logical vector indicating which rows have no missing values. data[,] filters out the rows with missing values:
order_data <- order_data[complete.cases(order_data), ]
summary(order_data) InvoiceNo StockCode Description Quantity
Length:406829 Length:406829 Length:406829 Min. :-80995.00
Class :character Class :character Class :character 1st Qu.: 2.00
Mode :character Mode :character Mode :character Median : 5.00
Mean : 12.06
3rd Qu.: 12.00
Max. : 80995.00
InvoiceDate UnitPrice CustomerID
Min. :2010-12-01 08:26:00.00 Min. : 0.00 Min. :12346
1st Qu.:2011-04-06 15:02:00.00 1st Qu.: 1.25 1st Qu.:13953
Median :2011-07-31 11:48:00.00 Median : 1.95 Median :15152
Mean :2011-07-10 16:30:57.88 Mean : 3.46 Mean :15288
3rd Qu.:2011-10-20 13:06:00.00 3rd Qu.: 3.75 3rd Qu.:16791
Max. :2011-12-09 12:50:00.00 Max. :38970.00 Max. :18287
Country
Length:406829
Class :character
Mode :character
mutate()character data types to factor usingas.factor()- Convert InvoiceNo to Numeric
- Store InvoiceDate as date in new variable, ‘TransDate’
- Extract time from InvoiceDate and store in another variable, ‘TransTime’
order_data %>%
mutate(across(where(is.character), as.factor()))# A tibble: 406,829 × 8
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice
<chr> <chr> <chr> <dbl> <dttm> <dbl>
1 536365 85123A WHITE HANGING HEA… 6 2010-12-01 08:26:00 2.55
2 536365 71053 WHITE METAL LANTE… 6 2010-12-01 08:26:00 3.39
3 536365 84406B CREAM CUPID HEART… 8 2010-12-01 08:26:00 2.75
4 536365 84029G KNITTED UNION FLA… 6 2010-12-01 08:26:00 3.39
5 536365 84029E RED WOOLLY HOTTIE… 6 2010-12-01 08:26:00 3.39
6 536365 22752 SET 7 BABUSHKA NE… 2 2010-12-01 08:26:00 7.65
7 536365 21730 GLASS STAR FROSTE… 6 2010-12-01 08:26:00 4.25
8 536366 22633 HAND WARMER UNION… 6 2010-12-01 08:28:00 1.85
9 536366 22632 HAND WARMER RED P… 6 2010-12-01 08:28:00 1.85
10 536367 84879 ASSORTED COLOUR B… 32 2010-12-01 08:34:00 1.69
# ℹ 406,819 more rows
# ℹ 2 more variables: CustomerID <dbl>, Country <chr>
order_data$InvoiceNo <- as.numeric(order_data$InvoiceNo)Warning: NAs introduced by coercion
order_data$TransDate <- as.Date(order_data$InvoiceDate)
order_data$TransTime <- format(order_data$InvoiceDate, format = "%H:%M:%S")
glimpse(order_data)Rows: 406,829
Columns: 10
$ InvoiceNo <dbl> 536365, 536365, 536365, 536365, 536365, 536365, 536365, 53…
$ StockCode <chr> "85123A", "71053", "84406B", "84029G", "84029E", "22752", …
$ Description <chr> "WHITE HANGING HEART T-LIGHT HOLDER", "WHITE METAL LANTERN…
$ Quantity <dbl> 6, 6, 8, 6, 6, 2, 6, 6, 6, 32, 6, 6, 8, 6, 6, 3, 2, 3, 3, …
$ InvoiceDate <dttm> 2010-12-01 08:26:00, 2010-12-01 08:26:00, 2010-12-01 08:2…
$ UnitPrice <dbl> 2.55, 3.39, 2.75, 3.39, 3.39, 7.65, 4.25, 1.85, 1.85, 1.69…
$ CustomerID <dbl> 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17…
$ Country <chr> "United Kingdom", "United Kingdom", "United Kingdom", "Uni…
$ TransDate <date> 2010-12-01, 2010-12-01, 2010-12-01, 2010-12-01, 2010-12-0…
$ TransTime <chr> "08:26:00", "08:26:00", "08:26:00", "08:26:00", "08:26:00"…
1.3: Creating a basket and storing it as transactionData
transactionData <- ddply(order_data, c("InvoiceNo","TransDate"),
function(df1)paste(df1$Description,
collapse = ","))
transactionData$InvoiceNo <- NULL
transactionData$Date <- NULL
colnames(transactionData) <- c("items")
write.csv(transactionData,"data/market_basket_transactions.csv", quote = FALSE, row.names = FALSE)Read transaction data:
trans <- read.transactions('data/market_basket_transactions.csv',
format = 'basket',
quote = "",
sep=',')Warning in asMethod(object): removing duplicated items in transactions
dim(trans)[1] 18839 4231
dim(data) shows that there are 18,839 transactions (rows) and 4231 items (columns).
summary(trans)transactions as itemMatrix in sparse format with
18839 rows (elements/itemsets/transactions) and
4231 columns (items) and a density of 0.005236714
most frequent items:
WHITE HANGING HEART T-LIGHT HOLDER REGENCY CAKESTAND 3 TIER
2011 1830
JUMBO BAG RED RETROSPOT PARTY BUNTING
1635 1397
ASSORTED COLOUR BIRD ORNAMENT (Other)
1385 409149
element (itemset/transaction) length distribution:
sizes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 1401 734 644 635 667 611 576 589 621 524 551 502 476 521 548
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
547 469 440 481 424 414 349 337 312 245 267 251 221 265 232 204
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
173 170 173 149 125 139 110 129 123 115 110 98 92 87 90 82
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
79 79 80 68 59 65 68 63 49 55 49 47 37 45 35 33
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
35 32 40 39 38 23 29 32 14 27 32 25 21 19 17 10
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
17 18 19 16 20 16 11 15 13 9 8 11 15 12 10 5
97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
8 10 11 3 7 13 4 9 7 2 3 3 6 4 3 3
113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128
2 5 4 4 7 5 5 5 5 4 8 5 1 4 5 3
129 131 133 134 135 136 137 138 139 140 141 142 143 144 145 147
4 2 1 4 1 1 3 3 1 1 2 2 2 2 1 5
149 150 151 152 153 156 158 159 160 165 166 168 171 172 178 179
1 1 1 2 1 2 1 1 1 1 1 1 1 2 3 2
181 182 186 188 194 195 197 205 206 209 212 221 231 251 260 264
1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
274 284 340 352 357 367 380 388 423 441 443 530 534 548
1 1 1 1 1 1 1 1 1 1 1 1 1 1
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 8.00 16.00 22.16 28.00 548.00
includes extended item information - examples:
labels
1 1 HANGER
2 10 COLOUR SPACEBOY PEN
3 12 COLOURED PARTY BALLOONS
summary() data reveals that the 2-itemset size is 1401, the largest amongst all k-itemsets.
2: Generating Item Frequency Plot
topN = 20 parameter specifies top 20 items by frequency
itemFrequencyPlot(trans,
topN = 20,
type = "absolute",
col = "#3A3B60",
main = "Item Frequency Plot",
xlab = "Absolute Frequency")
3: Generating Association Rules using apriori()
Each basket can be represented as a vector containing Boolean values pertaining to whether an item is in the basket (1) or not (0). Analysing the baskets to find patterns in the most common items purchased together to form association rules can be useful.
association_rules <- apriori(trans, parameter = list(supp=0.001,
conf=0.8,
maxlen=10))Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen
0.8 0.1 1 none FALSE TRUE 5 0.001 1
maxlen target ext
10 rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 18
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[4231 item(s), 18839 transaction(s)] done [0.14s].
sorting and recoding items ... [2899 item(s)] done [0.01s].
creating transaction tree ... done [0.01s].
checking subsets of size 1 2 3 4 5 6 7 8 9 10
Warning in apriori(trans, parameter = list(supp = 0.001, conf = 0.8, maxlen =
10)): Mining stopped (maxlen reached). Only patterns up to a length of 10
returned!
done [0.96s].
writing ... [274262 rule(s)] done [0.10s].
creating S4 object ... done [0.14s].
Display First 20 association rules by confidence:
association_rules <- sort(association_rules,
decreasing = TRUE,
by = "confidence")
inspect(association_rules[1:20]) lhs rhs support
[1] {PINK SPOTS} => {SWISS ROLL TOWEL} 0.001008546
[2] {WOBBLY CHICKEN} => {METAL} 0.001486278
[3] {WOBBLY CHICKEN} => {DECORATION} 0.001486278
[4] {DECOUPAGE} => {GREETING CARD} 0.001220872
[5] {FLOWER FAIRY} => {5 SUMMER B'DRAW LINERS} 0.001327034
[6] {5 SUMMER B'DRAW LINERS} => {FLOWER FAIRY} 0.001327034
[7] {TREES} => {CHRISTMAS GARLAND STARS} 0.001114709
[8] {CHRISTMAS GARLAND STARS} => {TREES} 0.001114709
[9] {BILLBOARD FONTS DESIGN} => {WRAP} 0.001804767
[10] {WOBBLY RABBIT} => {METAL} 0.001804767
[11] {WOBBLY RABBIT} => {DECORATION} 0.001804767
[12] {METAL} => {DECORATION} 0.002654069
[13] {DECORATION} => {METAL} 0.002654069
[14] {BLACK TEA} => {SUGAR JARS} 0.002707150
[15] {BLACK TEA} => {COFFEE} 0.002707150
[16] {CHOCOLATE SPOTS} => {SWISS ROLL TOWEL} 0.002654069
[17] {1 HANGER} => {HOOK} 0.003078720
[18] {HOOK} => {1 HANGER} 0.003078720
[19] {1 HANGER} => {MAGIC GARDEN} 0.003078720
[20] {MAGIC GARDEN} => {1 HANGER} 0.003078720
confidence coverage lift count
[1] 1 0.001008546 330.50877 19
[2] 1 0.001486278 376.78000 28
[3] 1 0.001486278 376.78000 28
[4] 1 0.001220872 313.98333 23
[5] 1 0.001327034 753.56000 25
[6] 1 0.001327034 753.56000 25
[7] 1 0.001114709 897.09524 21
[8] 1 0.001114709 897.09524 21
[9] 1 0.001804767 509.16216 34
[10] 1 0.001804767 376.78000 34
[11] 1 0.001804767 376.78000 34
[12] 1 0.002654069 376.78000 50
[13] 1 0.002654069 376.78000 50
[14] 1 0.002707150 188.39000 51
[15] 1 0.002707150 54.76453 51
[16] 1 0.002654069 330.50877 50
[17] 1 0.003078720 324.81034 58
[18] 1 0.003078720 324.81034 58
[19] 1 0.003078720 324.81034 58
[20] 1 0.003078720 324.81034 58
Display association rules by lift value:
Lift
- Lift value is the measure of importance of an association rule
- It is the ratio of the confidence of the rule / expected confidence of the rule
- Lift value < 1: negative dependence or substitution effect
- Lift value > 1: positive dependence or complementary effect
- Lift value = 1: implies almost no association between items
\[Lift = \frac{Confidence}{Expected\ Confidence} = \frac{P\left(A \cap B\right)}{P\left(A\right).P\left(B\right)}\]
\[Lift({X} -> {Y}) = \frac{(Transactions\ containing\ both\ X\ and\ Y)/(Transactions\ containing\ X)}{(Transactions\ containing\ Y)}\]
inspect(head(sort(association_rules, by = "lift"))) lhs rhs support
[1] {TREES} => {CHRISTMAS GARLAND STARS} 0.001114709
[2] {CHRISTMAS GARLAND STARS} => {TREES} 0.001114709
[3] {FLOWER FAIRY} => {5 SUMMER B'DRAW LINERS} 0.001327034
[4] {5 SUMMER B'DRAW LINERS} => {FLOWER FAIRY} 0.001327034
[5] {PINK KNITTED EGG COSY} => {BLUE KNITTED EGG COSY} 0.001167790
[6] {BLUE KNITTED EGG COSY} => {PINK KNITTED EGG COSY} 0.001167790
confidence coverage lift count
[1] 1.00 0.001114709 897.0952 21
[2] 1.00 0.001114709 897.0952 21
[3] 1.00 0.001327034 753.5600 25
[4] 1.00 0.001327034 753.5600 25
[5] 0.88 0.001327034 663.1328 22
[6] 0.88 0.001327034 663.1328 22
4: Filter rules
4.1: Select only association rules with confidence greater than 0.6 or 60%
subRules<-association_rules[quality(association_rules)$confidence > 0.6]
summary(subRules)set of 274262 rules
rule length distribution (lhs + rhs):sizes
2 3 4 5 6 7 8 9 10
128 8208 24106 57299 85063 66943 27329 4826 360
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 5.000 6.000 6.025 7.000 10.000
summary of quality measures:
support confidence coverage lift
Min. :0.001009 Min. :0.8000 Min. :0.001009 Min. : 7.494
1st Qu.:0.001062 1st Qu.:0.8333 1st Qu.:0.001168 1st Qu.: 16.174
Median :0.001115 Median :0.8750 Median :0.001327 Median : 19.893
Mean :0.001288 Mean :0.8856 Mean :0.001461 Mean : 32.820
3rd Qu.:0.001327 3rd Qu.:0.9231 3rd Qu.:0.001539 3rd Qu.: 29.643
Max. :0.025054 Max. :1.0000 Max. :0.030893 Max. :897.095
count
Min. : 19.00
1st Qu.: 20.00
Median : 21.00
Mean : 24.26
3rd Qu.: 25.00
Max. :472.00
mining info:
data ntransactions support confidence
trans 18839 0.001 0.8
call
apriori(data = trans, parameter = list(supp = 0.001, conf = 0.8, maxlen = 10))
Interactive Scatterplot:
plot(subRules,
method = "scatterplot",
measure = c("support", "confidence"),
shading = "lift",
max = 100,
jitter = 0,
engine = "plotly")Warning: Too many rules supplied. Only plotting the best 100 using 'lift'
(change control parameter max if needed).
4.2 Generating rules for specific items
Suppose we would like to find out more about rules regarding most frequent item, “WHITE HANGING HEART T-LIGHT HOLDER”:
rules_frequent_item <- apriori (data=trans,
parameter=list (supp=0.001,conf = 0.6),
appearance = list (default="lhs",rhs="WHITE HANGING HEART T-LIGHT HOLDER"),
control = list (verbose=F))
rules_frequent_item <-sort(rules_frequent_item,
decreasing = TRUE,
by = "confidence")
# remove redundant rules that are subsets of larger rules
subsetRules <- which(colSums(is.subset(rules_frequent_item, rules_frequent_item)) > 1)
#length(subsetRules) #> 584
rules_frequent_item <- rules_frequent_item[-subsetRules]
inspect(sort(rules_frequent_item[1:20], by = "lift")) lhs rhs support confidence coverage lift count
[1] {KNITTED UNION FLAG HOT WATER BOTTLE,
SET 7 BABUSHKA NESTING BOXES} => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001220872 1.0000000 0.001220872 9.367976 23
[2] {GLASS STAR FROSTED T-LIGHT HOLDER,
SET 7 BABUSHKA NESTING BOXES} => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001220872 0.9583333 0.001273953 8.977644 23
[3] {GLASS STAR FROSTED T-LIGHT HOLDER,
WHITE METAL LANTERN} => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001114709 0.9545455 0.001167790 8.942159 21
[4] {GLASS STAR FROSTED T-LIGHT HOLDER,
KNITTED UNION FLAG HOT WATER BOTTLE} => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001114709 0.9545455 0.001167790 8.942159 21
[5] {GLASS STAR FROSTED T-LIGHT HOLDER,
RED WOOLLY HOTTIE WHITE HEART.} => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001061627 0.9090909 0.001167790 8.516342 20
[6] {SET 7 BABUSHKA NESTING BOXES,
WHITE METAL LANTERN} => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001061627 0.9090909 0.001167790 8.516342 20
[7] {KNITTED UNION FLAG HOT WATER BOTTLE,
WHITE METAL LANTERN} => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001008546 0.9047619 0.001114709 8.475788 19
[8] {GLASS STAR FROSTED T-LIGHT HOLDER,
WOOD 2 DRAWER CABINET WHITE FINISH} => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001061627 0.8333333 0.001273953 7.806647 20
[9] {CREAM CUPID HEARTS COAT HANGER,
KNITTED UNION FLAG HOT WATER BOTTLE} => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001008546 0.8260870 0.001220872 7.738763 19
[10] {PACK OF 60 DINOSAUR CAKE CASES,
POPCORN HOLDER,
SET OF 4 PANTRY JELLY MOULDS} => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001008546 0.8260870 0.001220872 7.738763 19
[11] {CREAM CUPID HEARTS COAT HANGER,
SET 7 BABUSHKA NESTING BOXES} => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001061627 0.8000000 0.001327034 7.494381 20
[12] {SET 7 BABUSHKA NESTING BOXES,
WOOD 2 DRAWER CABINET WHITE FINISH} => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001061627 0.8000000 0.001327034 7.494381 20
[13] {HEART OF WICKER LARGE,
HOME BUILDING BLOCK WORD,
NATURAL SLATE HEART CHALKBOARD} => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001061627 0.8000000 0.001327034 7.494381 20
[14] {VINTAGE BILLBOARD LOVE/HATE MUG,
WOOD 2 DRAWER CABINET WHITE FINISH} => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001008546 0.7916667 0.001273953 7.416314 19
[15] {RED WOOLLY HOTTIE WHITE HEART.,
WHITE METAL LANTERN} => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001008546 0.7916667 0.001273953 7.416314 19
[16] {SAVE THE PLANET MUG,
SET 7 BABUSHKA NESTING BOXES} => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001008546 0.7916667 0.001273953 7.416314 19
[17] {CREAM CUPID HEARTS COAT HANGER,
WHITE METAL LANTERN} => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001167790 0.7857143 0.001486278 7.360553 22
[18] {SET 7 BABUSHKA NESTING BOXES,
WOODEN PICTURE FRAME WHITE FINISH} => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001167790 0.7857143 0.001486278 7.360553 22
[19] {RED WOOLLY HOTTIE WHITE HEART.,
SET 7 BABUSHKA NESTING BOXES} => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001327034 0.7812500 0.001698604 7.318731 25
[20] {GLASS STAR FROSTED T-LIGHT HOLDER,
WOODEN FRAME ANTIQUE WHITE} => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001061627 0.7692308 0.001380116 7.206135 20
Plot top 10 rules for item by confidence:
rules_frequent_item <-sort(rules_frequent_item,
decreasing = TRUE,
by = "confidence")
top10_rules <- rules_frequent_item[1:10]
plot(top10_rules,
method = "graph",
engine = "htmlwidget")