1 Application: In-class demo

https://www.amazon.com/

2 Association rule

2.1 Data

Example 1: Purchases of Phone Plates

   Transaction     V1     V2     V3    V4
1            1    red  white  green  <NA>
2            2  white orange   <NA>  <NA>
3            3  white   blue   <NA>  <NA>
4            4    red  white orange  <NA>
5            5    red   blue   <NA>  <NA>
6            6  white   blue   <NA>  <NA>
7            7    red   blue   <NA>  <NA>
8            8    red  white   blue green
9            9    red  white   blue  <NA>
10          10 yellow   <NA>   <NA>  <NA>

Rule 1: IF {red} THEN {white}

  • antecedent/ IF

    eg: {red}

  • consequent/ THEN

    eg: {white}

Rule 2: IF {red, green} THEN {white}

  • antecedent/ IF

    eg: {red, green}

  • consequent/ THEN

    eg: {white}

2.2 Apriori algorithm

Main idea: Generate candidate rules

Step 1: Generate all the rules that would be candidates for indicating associations between items.

Selecting strong rules

From many possibilities, the goal is to find the only the rules that indicate a strong dependence between the antecedent and consequent itemsets.

We use 3 measures to measure the strength of association implied by a rule

  1. Confidence

  2. Lift ratio

2.3 Support of itemset X

Support indicates the populatiry of the item set X.

\[Support(X) = \frac{\text{Number of transactions in which X appears}}{\text{Total number of transactions}}\] Example:

\[Support({red, white}) = \frac{4}{10} \times 100 \% = 40\%\]

\[Support(X) = \hat{P}(\text{antecedent AND consequent})\]

Exercise: Compute support for all item sets.

2.4 Confidence

Expresses the degree of uncertainty about the IF-THEN rule.

\[Confidence (X \rightarrow Y) = \frac{\text{Number of transactions with both antecedent and consequent itemsets}}{\text{Number of transactions with antecedent itemset}}\]

\[Confidence = \hat{P}(consequent|antecedent)\]

Example:

\[{red, white} \rightarrow {green}\]

\[\frac{\text{support of {red, white, green}}}{\text{support of {red, white}}} = \frac{2}{4} \times 100 \% = 50\% \]

Compute confidence for the following rules

  1. \[\{green\} \rightarrow \{red\}\]

  2. \[\{white, green\} \rightarrow \{red\}\]

2.5 Lift ratio

\[\text{Benchmark confidence} = \frac{\text{No. transactions with consequent itemset}}{\text{No. of transactions in database}}\]

\[\text{Lift ratio} = \frac{\text{confidence}}{\text{benchmark confidence}}\]

3 Summary

Image credit: https://wiki.smu.edu.sg/1718t3isss608/Group17_Proposal

4 R code

  1. Dataset and package
##    Transaction     V1     V2     V3    V4
## 1            1    red  white  green  <NA>
## 2            2  white orange   <NA>  <NA>
## 3            3  white   blue   <NA>  <NA>
## 4            4    red  white orange  <NA>
## 5            5    red   blue   <NA>  <NA>
## 6            6  white   blue   <NA>  <NA>
## 7            7    red   blue   <NA>  <NA>
## 8            8    red  white   blue green
## 9            9    red  white   blue  <NA>
## 10          10 yellow   <NA>   <NA>  <NA>
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
  1. Binary matrix
Red <- c(1, 0, 0, 1, 1, 0, 1, 1, 1, 0)
White <- c(1, 1, 1, 1, 0, 1, 0, 1, 1, 0)
Blue <- c(0, 0, 1, 0, 1, 1, 1, 1, 1, 0)
Orange <- c(0, 1, 0, 1, 0, 0, 0, 0, 0, 0)
Green <- c(1, 0, 0, 0, 0, 0, 0, 1, 0, 0)
Yellow <- c(rep(0, 9), 1)
mat <- matrix(c(Red, White, Blue,
                Orange, Green, Yellow), nrow=10)
mat
##       [,1] [,2] [,3] [,4] [,5] [,6]
##  [1,]    1    1    0    0    1    0
##  [2,]    0    1    0    1    0    0
##  [3,]    0    1    1    0    0    0
##  [4,]    1    1    0    1    0    0
##  [5,]    1    0    1    0    0    0
##  [6,]    0    1    1    0    0    0
##  [7,]    1    0    1    0    0    0
##  [8,]    1    1    1    0    1    0
##  [9,]    1    1    1    0    0    0
## [10,]    0    0    0    0    0    1
colnames(mat) = c("Red", "White", "Blue", "Orange", "Green", "Yellow")
  1. Convert binary index matrix into a transaction database
fp.trans <- as(mat, "transactions")
inspect(fp.trans)
##      items                    
## [1]  {Red, White, Green}      
## [2]  {White, Orange}          
## [3]  {White, Blue}            
## [4]  {Red, White, Orange}     
## [5]  {Red, Blue}              
## [6]  {White, Blue}            
## [7]  {Red, Blue}              
## [8]  {Red, White, Blue, Green}
## [9]  {Red, White, Blue}       
## [10] {Yellow}
  1. Get rules
rules <- apriori(fp.trans, parameter = list(supp=0.2, conf=0.5, target = "rules"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5     0.2      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 2 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[6 item(s), 10 transaction(s)] done [0.00s].
## sorting and recoding items ... [5 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [18 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
# inspect the first six rules
inspect(head(sort(rules, by="lift"), n=6))
##     lhs               rhs     support confidence coverage lift     count
## [1] {Red, White}   => {Green} 0.2     0.5        0.4      2.500000 2    
## [2] {Green}        => {Red}   0.2     1.0        0.2      1.666667 2    
## [3] {White, Green} => {Red}   0.2     1.0        0.2      1.666667 2    
## [4] {Orange}       => {White} 0.2     1.0        0.2      1.428571 2    
## [5] {Green}        => {White} 0.2     1.0        0.2      1.428571 2    
## [6] {Red, Green}   => {White} 0.2     1.0        0.2      1.428571 2

Interpretation:

Your turn:

Dataset

library(arules)
data(Groceries)
class(Groceries)
## [1] "transactions"
## attr(,"package")
## [1] "arules"
head(Groceries)
## transactions in sparse format with
##  6 transactions (rows) and
##  169 items (columns)
dim(Groceries)
## [1] 9835  169

To view the first five transactions

inspect(head(Groceries))
##     items                     
## [1] {citrus fruit,            
##      semi-finished bread,     
##      margarine,               
##      ready soups}             
## [2] {tropical fruit,          
##      yogurt,                  
##      coffee}                  
## [3] {whole milk}              
## [4] {pip fruit,               
##      yogurt,                  
##      cream cheese ,           
##      meat spreads}            
## [5] {other vegetables,        
##      whole milk,              
##      condensed milk,          
##      long life bakery product}
## [6] {whole milk,              
##      butter,                  
##      yogurt,                  
##      rice,                    
##      abrasive cleaner}
str(Groceries)
## Formal class 'transactions' [package "arules"] with 3 slots
##   ..@ data       :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
##   .. .. ..@ i       : int [1:43367] 13 60 69 78 14 29 98 24 15 29 ...
##   .. .. ..@ p       : int [1:9836] 0 4 7 8 12 16 21 22 27 28 ...
##   .. .. ..@ Dim     : int [1:2] 169 9835
##   .. .. ..@ Dimnames:List of 2
##   .. .. .. ..$ : NULL
##   .. .. .. ..$ : NULL
##   .. .. ..@ factors : list()
##   ..@ itemInfo   :'data.frame':  169 obs. of  3 variables:
##   .. ..$ labels: chr [1:169] "frankfurter" "sausage" "liver loaf" "ham" ...
##   .. ..$ level2: Factor w/ 55 levels "baby food","bags",..: 44 44 44 44 44 44 44 42 42 41 ...
##   .. ..$ level1: Factor w/ 10 levels "canned food",..: 6 6 6 6 6 6 6 6 6 6 ...
##   ..@ itemsetInfo:'data.frame':  0 obs. of  0 variables

Item components

Groceries@itemInfo$labels[1:20]
##  [1] "frankfurter"       "sausage"           "liver loaf"       
##  [4] "ham"               "meat"              "finished products"
##  [7] "organic sausage"   "chicken"           "turkey"           
## [10] "pork"              "beef"              "hamburger meat"   
## [13] "fish"              "citrus fruit"      "tropical fruit"   
## [16] "pip fruit"         "grapes"            "berries"          
## [19] "nuts/prunes"       "root vegetables"

5 Collaborative filtering

Image credit and further reading: https://medium.com/@cfpinela/recommender-systems-user-based-and-item-based-collaborative-filtering-5d5f375a127f

5.1 Dataset: Movie ratings

ID <- 1:10
M1 <- c(4, 4, 5, 3, 4, 3, 3, 4, 4, 3)
M5 <- c(1, NA, NA, NA, 5, NA, NA, NA, NA, NA)
M8 <- c(rep(NA, 3), 1, rep(NA, 6))
M17 <- c(rep(NA, 3), 4, rep(NA, 6))
M18 <- c(3, rep(NA, 9))
M28 <- c(3, NA, NA, 4, NA, NA, 2, NA, NA, NA)
M30 <- c(4, NA, NA, 5, NA, 4, 4, NA, 3, NA)
M44 <- c(5, rep(NA, 4), 4, rep(NA, 4))
M48 <- c(rep(NA, 6), 3, rep(NA, 3))
df <- data.frame(ID, M1, M5, M8, M17, M18, M28, M30, M44, M48)
df
##    ID M1 M5 M8 M17 M18 M28 M30 M44 M48
## 1   1  4  1 NA  NA   3   3   4   5  NA
## 2   2  4 NA NA  NA  NA  NA  NA  NA  NA
## 3   3  5 NA NA  NA  NA  NA  NA  NA  NA
## 4   4  3 NA  1   4  NA   4   5  NA  NA
## 5   5  4  5 NA  NA  NA  NA  NA  NA  NA
## 6   6  3 NA NA  NA  NA  NA   4   4  NA
## 7   7  3 NA NA  NA  NA   2   4  NA   3
## 8   8  4 NA NA  NA  NA  NA  NA  NA  NA
## 9   9  4 NA NA  NA  NA  NA   3  NA  NA
## 10 10  3 NA NA  NA  NA  NA  NA  NA  NA
summary(df, na.rm=TRUE)
##        ID              M1            M5          M8         M17         M18   
##  Min.   : 1.00   Min.   :3.0   Min.   :1   Min.   :1   Min.   :4   Min.   :3  
##  1st Qu.: 3.25   1st Qu.:3.0   1st Qu.:2   1st Qu.:1   1st Qu.:4   1st Qu.:3  
##  Median : 5.50   Median :4.0   Median :3   Median :1   Median :4   Median :3  
##  Mean   : 5.50   Mean   :3.7   Mean   :3   Mean   :1   Mean   :4   Mean   :3  
##  3rd Qu.: 7.75   3rd Qu.:4.0   3rd Qu.:4   3rd Qu.:1   3rd Qu.:4   3rd Qu.:3  
##  Max.   :10.00   Max.   :5.0   Max.   :5   Max.   :1   Max.   :4   Max.   :3  
##                                NA's   :8   NA's   :9   NA's   :9   NA's   :9  
##       M28           M30         M44            M48   
##  Min.   :2.0   Min.   :3   Min.   :4.00   Min.   :3  
##  1st Qu.:2.5   1st Qu.:4   1st Qu.:4.25   1st Qu.:3  
##  Median :3.0   Median :4   Median :4.50   Median :3  
##  Mean   :3.0   Mean   :4   Mean   :4.50   Mean   :3  
##  3rd Qu.:3.5   3rd Qu.:4   3rd Qu.:4.75   3rd Qu.:3  
##  Max.   :4.0   Max.   :5   Max.   :5.00   Max.   :3  
##  NA's   :7     NA's   :5   NA's   :8      NA's   :9

6 Acknowledgement

The content is directly based on

Data Mining for Business Analytics: Concepts, Techniques, and Applications in R Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, Kenneth C. Lichtendahl Jr.