Also called affinity analysis or market basket analysis.
Aimed at discovering which groups of products tend to be purchased together.
Example 1: Purchases of Phone Plates
Transaction V1 V2 V3 V4
1 1 red white green <NA>
2 2 white orange <NA> <NA>
3 3 white blue <NA> <NA>
4 4 red white orange <NA>
5 5 red blue <NA> <NA>
6 6 white blue <NA> <NA>
7 7 red blue <NA> <NA>
8 8 red white blue green
9 9 red white blue <NA>
10 10 yellow <NA> <NA> <NA>
Rule 1: IF {red} THEN {white}
antecedent/ IF
eg: {red}
consequent/ THEN
eg: {white}
Rule 2: IF {red, green} THEN {white}
antecedent/ IF
eg: {red, green}
consequent/ THEN
eg: {white}
Main idea: Generate candidate rules
Step 1: Generate all the rules that would be candidates for indicating associations between items.
Selecting strong rules
From many possibilities, the goal is to find the only the rules that indicate a strong dependence between the antecedent and consequent itemsets.
We use 3 measures to measure the strength of association implied by a rule
Confidence
Lift ratio
Support indicates the populatiry of the item set X.
\[Support(X) = \frac{\text{Number of transactions in which X appears}}{\text{Total number of transactions}}\] Example:
\[Support({red, white}) = \frac{4}{10} \times 100 \% = 40\%\]
\[Support(X) = \hat{P}(\text{antecedent AND consequent})\]
Exercise: Compute support for all item sets.
Expresses the degree of uncertainty about the IF-THEN rule.
\[Confidence (X \rightarrow Y) = \frac{\text{Number of transactions with both antecedent and consequent itemsets}}{\text{Number of transactions with antecedent itemset}}\]
\[Confidence = \hat{P}(consequent|antecedent)\]
Example:
\[{red, white} \rightarrow {green}\]
\[\frac{\text{support of {red, white, green}}}{\text{support of {red, white}}} = \frac{2}{4} \times 100 \% = 50\% \]
Compute confidence for the following rules
\[\{green\} \rightarrow \{red\}\]
\[\{white, green\} \rightarrow \{red\}\]
\[\text{Benchmark confidence} = \frac{\text{No. transactions with consequent itemset}}{\text{No. of transactions in database}}\]
\[\text{Lift ratio} = \frac{\text{confidence}}{\text{benchmark confidence}}\]
Image credit: https://wiki.smu.edu.sg/1718t3isss608/Group17_Proposal
## Transaction V1 V2 V3 V4
## 1 1 red white green <NA>
## 2 2 white orange <NA> <NA>
## 3 3 white blue <NA> <NA>
## 4 4 red white orange <NA>
## 5 5 red blue <NA> <NA>
## 6 6 white blue <NA> <NA>
## 7 7 red blue <NA> <NA>
## 8 8 red white blue green
## 9 9 red white blue <NA>
## 10 10 yellow <NA> <NA> <NA>
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
Red <- c(1, 0, 0, 1, 1, 0, 1, 1, 1, 0)
White <- c(1, 1, 1, 1, 0, 1, 0, 1, 1, 0)
Blue <- c(0, 0, 1, 0, 1, 1, 1, 1, 1, 0)
Orange <- c(0, 1, 0, 1, 0, 0, 0, 0, 0, 0)
Green <- c(1, 0, 0, 0, 0, 0, 0, 1, 0, 0)
Yellow <- c(rep(0, 9), 1)
mat <- matrix(c(Red, White, Blue,
Orange, Green, Yellow), nrow=10)
mat
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 1 0 0 1 0
## [2,] 0 1 0 1 0 0
## [3,] 0 1 1 0 0 0
## [4,] 1 1 0 1 0 0
## [5,] 1 0 1 0 0 0
## [6,] 0 1 1 0 0 0
## [7,] 1 0 1 0 0 0
## [8,] 1 1 1 0 1 0
## [9,] 1 1 1 0 0 0
## [10,] 0 0 0 0 0 1
colnames(mat) = c("Red", "White", "Blue", "Orange", "Green", "Yellow")
fp.trans <- as(mat, "transactions")
inspect(fp.trans)
## items
## [1] {Red, White, Green}
## [2] {White, Orange}
## [3] {White, Blue}
## [4] {Red, White, Orange}
## [5] {Red, Blue}
## [6] {White, Blue}
## [7] {Red, Blue}
## [8] {Red, White, Blue, Green}
## [9] {Red, White, Blue}
## [10] {Yellow}
rules <- apriori(fp.trans, parameter = list(supp=0.2, conf=0.5, target = "rules"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.2 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 2
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[6 item(s), 10 transaction(s)] done [0.00s].
## sorting and recoding items ... [5 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [18 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# inspect the first six rules
inspect(head(sort(rules, by="lift"), n=6))
## lhs rhs support confidence coverage lift count
## [1] {Red, White} => {Green} 0.2 0.5 0.4 2.500000 2
## [2] {Green} => {Red} 0.2 1.0 0.2 1.666667 2
## [3] {White, Green} => {Red} 0.2 1.0 0.2 1.666667 2
## [4] {Orange} => {White} 0.2 1.0 0.2 1.428571 2
## [5] {Green} => {White} 0.2 1.0 0.2 1.428571 2
## [6] {Red, Green} => {White} 0.2 1.0 0.2 1.428571 2
Interpretation:
Your turn:
Dataset
library(arules)
data(Groceries)
class(Groceries)
## [1] "transactions"
## attr(,"package")
## [1] "arules"
head(Groceries)
## transactions in sparse format with
## 6 transactions (rows) and
## 169 items (columns)
dim(Groceries)
## [1] 9835 169
To view the first five transactions
inspect(head(Groceries))
## items
## [1] {citrus fruit,
## semi-finished bread,
## margarine,
## ready soups}
## [2] {tropical fruit,
## yogurt,
## coffee}
## [3] {whole milk}
## [4] {pip fruit,
## yogurt,
## cream cheese ,
## meat spreads}
## [5] {other vegetables,
## whole milk,
## condensed milk,
## long life bakery product}
## [6] {whole milk,
## butter,
## yogurt,
## rice,
## abrasive cleaner}
str(Groceries)
## Formal class 'transactions' [package "arules"] with 3 slots
## ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
## .. .. ..@ i : int [1:43367] 13 60 69 78 14 29 98 24 15 29 ...
## .. .. ..@ p : int [1:9836] 0 4 7 8 12 16 21 22 27 28 ...
## .. .. ..@ Dim : int [1:2] 169 9835
## .. .. ..@ Dimnames:List of 2
## .. .. .. ..$ : NULL
## .. .. .. ..$ : NULL
## .. .. ..@ factors : list()
## ..@ itemInfo :'data.frame': 169 obs. of 3 variables:
## .. ..$ labels: chr [1:169] "frankfurter" "sausage" "liver loaf" "ham" ...
## .. ..$ level2: Factor w/ 55 levels "baby food","bags",..: 44 44 44 44 44 44 44 42 42 41 ...
## .. ..$ level1: Factor w/ 10 levels "canned food",..: 6 6 6 6 6 6 6 6 6 6 ...
## ..@ itemsetInfo:'data.frame': 0 obs. of 0 variables
Item components
Groceries@itemInfo$labels[1:20]
## [1] "frankfurter" "sausage" "liver loaf"
## [4] "ham" "meat" "finished products"
## [7] "organic sausage" "chicken" "turkey"
## [10] "pork" "beef" "hamburger meat"
## [13] "fish" "citrus fruit" "tropical fruit"
## [16] "pip fruit" "grapes" "berries"
## [19] "nuts/prunes" "root vegetables"
Identifying relevant items for a specific user from a very large collection of items (“filtering”) by considering preferences of many users (“collaboration”).
User-based collaborative filtering: “People like you”
Item-based collaborative filtering: Find similar items rather than similar users (what items from all the options are more similar to what the person has already purchased)
Image credit and further reading: https://medium.com/@cfpinela/recommender-systems-user-based-and-item-based-collaborative-filtering-5d5f375a127f
ID <- 1:10
M1 <- c(4, 4, 5, 3, 4, 3, 3, 4, 4, 3)
M5 <- c(1, NA, NA, NA, 5, NA, NA, NA, NA, NA)
M8 <- c(rep(NA, 3), 1, rep(NA, 6))
M17 <- c(rep(NA, 3), 4, rep(NA, 6))
M18 <- c(3, rep(NA, 9))
M28 <- c(3, NA, NA, 4, NA, NA, 2, NA, NA, NA)
M30 <- c(4, NA, NA, 5, NA, 4, 4, NA, 3, NA)
M44 <- c(5, rep(NA, 4), 4, rep(NA, 4))
M48 <- c(rep(NA, 6), 3, rep(NA, 3))
df <- data.frame(ID, M1, M5, M8, M17, M18, M28, M30, M44, M48)
df
## ID M1 M5 M8 M17 M18 M28 M30 M44 M48
## 1 1 4 1 NA NA 3 3 4 5 NA
## 2 2 4 NA NA NA NA NA NA NA NA
## 3 3 5 NA NA NA NA NA NA NA NA
## 4 4 3 NA 1 4 NA 4 5 NA NA
## 5 5 4 5 NA NA NA NA NA NA NA
## 6 6 3 NA NA NA NA NA 4 4 NA
## 7 7 3 NA NA NA NA 2 4 NA 3
## 8 8 4 NA NA NA NA NA NA NA NA
## 9 9 4 NA NA NA NA NA 3 NA NA
## 10 10 3 NA NA NA NA NA NA NA NA
summary(df, na.rm=TRUE)
## ID M1 M5 M8 M17 M18
## Min. : 1.00 Min. :3.0 Min. :1 Min. :1 Min. :4 Min. :3
## 1st Qu.: 3.25 1st Qu.:3.0 1st Qu.:2 1st Qu.:1 1st Qu.:4 1st Qu.:3
## Median : 5.50 Median :4.0 Median :3 Median :1 Median :4 Median :3
## Mean : 5.50 Mean :3.7 Mean :3 Mean :1 Mean :4 Mean :3
## 3rd Qu.: 7.75 3rd Qu.:4.0 3rd Qu.:4 3rd Qu.:1 3rd Qu.:4 3rd Qu.:3
## Max. :10.00 Max. :5.0 Max. :5 Max. :1 Max. :4 Max. :3
## NA's :8 NA's :9 NA's :9 NA's :9
## M28 M30 M44 M48
## Min. :2.0 Min. :3 Min. :4.00 Min. :3
## 1st Qu.:2.5 1st Qu.:4 1st Qu.:4.25 1st Qu.:3
## Median :3.0 Median :4 Median :4.50 Median :3
## Mean :3.0 Mean :4 Mean :4.50 Mean :3
## 3rd Qu.:3.5 3rd Qu.:4 3rd Qu.:4.75 3rd Qu.:3
## Max. :4.0 Max. :5 Max. :5.00 Max. :3
## NA's :7 NA's :5 NA's :8 NA's :9
The content is directly based on
Data Mining for Business Analytics: Concepts, Techniques, and Applications in R Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, Kenneth C. Lichtendahl Jr.