Cluster Analysis

Measuring Distance Between Two Records

Properties of distance measures

  1. Non-negative: \(d_{ij} \geq 0\)

  2. Self-proximity: \(d_{ii} = 0\) (distance to itself is zero)

  3. Symmetry: \(d_{ij} = d_{ji}\)

  4. Triangle inequality: \(d_{ij} \leq d_{ik} + d_{kj}\)

Notations

\(d_{ij}\) - distance metric/ dissimilarity measure between \(i\) and \(j\) records

For \(i\) record we have \(p\) measurements \((x_{i1}, x_{i2}, ..., x_{ip})\) and for \(j\) record we have \(p\) measurements \((x_{j1}, x_{j2}, ..., x_{jp})\)

Measures for numerical data

  1. Euclidean Distance

    • Scale dependent

    • Changing the units of one variable have a huge influence on the results

\[d_{ij} = \sqrt{ (x_{i1} - x_{j1})^2 + (x_{i2} - x_{j2})^2 + ...+ (x_{ip} - x_{jp})^2}\]

  1. Correlation-based similarity

\[r_{ij} = \frac{\sum_{m=1}^p (x_{im} - \bar{x}_i)(x_{jm} - \bar{x}_j)}{\sqrt{\sum_{m=1}^p (x_{im} - \bar{x}_i)^2 \sum_{m=1}^p (x_{jm} - \bar{x}_j)^2}}\]

\[d_{ij} = 1-r^2_{ij}\]

  1. Manhattan distance (city block)

\[d_{ij} = |x_{i1}-x_{j1}| + |x_{i2}-x_{j2}|+...+|x_{ip}-x_{jp}|\]

  1. Maximum coordinate distance: measurement on which \(i\) and \(j\) deviate most.

\[d_{ij} = max_{(m=1, 2, ..p)}|x_{im} - x_{jm}|\]

  1. Mahalanobis distance

\[d_{ij} = \sqrt{(\mathbf{x_i}-\mathbf{x_j})'S^{-1}(\mathbf{x_i}-\mathbf{x_j})}\]

\(\mathbf{x_i}\) and \(\mathbf{x_j}\) are \(p\)-dimensional vectors of the mesurements values for records \(i\) and \(j\), respectively; \(S\) is the covariance matrix for these vectors.

R- Codes

Example data

#define four vectors
a <- c(12, 14, 4, 6)
b <- c(5, 4, 6, 3)
c <- c(9, 6, 9, 7)
d <- c(10, 12, 3, 13)
mat <- rbind(a, b, c, d)
mat
##   [,1] [,2] [,3] [,4]
## a   12   14    4    6
## b    5    4    6    3
## c    9    6    9    7
## d   10   12    3   13
dist(mat, method="euclidean")
##           a         b         c
## b 12.727922                    
## c  9.949874  6.708204          
## d  7.615773 14.071247 10.440307
dist(mat, method="manhattan")
##    a  b  c
## b 22      
## c 17 13   
## d 12 26 19
dist(mat, method="maximum")
##    a  b  c
## b 10      
## c  8  4   
## d  7 10  6

Distance measures for categorical data

0 1
0 a b a+b
1 c d c+d
a+c b+d n
  1. Matching coefficient

\[\frac{a+d}{n}\]

  1. Jaquard’s coefficient

\[\frac{d}{(b+c+d)}\]

Distance measures for mixed data

  • Gower’s similarity measure

Measuring Distance Between Two Clusters

Consider cluster \(A\), which includes the \(m\) records \(A_1, A_2,...A_m\) and Cluster B, which includes \(n\) records \(B_1, B_2, ...B_n\).

  1. Minimum distance

\[min(distance(A_i, B_j)), \text{ } i= 1, 2, ...m; \text{ } j=1, 2, ...n\]

  1. Maximum distance

\[max(distance(A_i, B_j)), \text{ } i= 1, 2, ...m; \text{ } j=1, 2, ...n\]

  1. Average distance

\[Average(distance(A_i, B_j)), \text{ } i= 1, 2, ...m; \text{ } j=1, 2, ...n\]

  1. Centroid distance

\[distance(\bar{x}_A, \bar{x}_B)\]

Hierarchical (Agglomerative) Clustering

  1. Start with each cluster comprising exactly one record (number of observations = number of clusters)

  2. At every step, the two clusters with the smallest distance are merged.

Repeatedly links pairs of clusters until every data object is included in the hierarchy ( until there is only one cluster left at the end or a specified termination condition is satisfied).

Linkage criterion to merge data point

  1. Single linkage: minimum distance

  2. Complete linkage: maximum distance

  3. Average linkage: average distance (between all pairs of records)

  4. Centroid linkage: centroid distance (clusters are represented by their mean values for each variable, which forms a vector of means)

  5. Ward’s method: Consider “loss of information” that occurs when records are clustered together.

Further reading and source: https://livebook.manning.com/book/machine-learning-for-mortals-mere-and-otherwise/chapter-17/v-7/65

Source: https://stats.stackexchange.com/questions/426760/should-we-most-of-the-time-use-wards-method-for-hierarchical-clustering

Dendrogram

Tree-like diagram that summarizes the process of clustering.

Example

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6.9000     ✔ purrr   0.3.4     
## ✔ tibble  3.1.7          ✔ dplyr   1.0.9     
## ✔ tidyr   1.2.0          ✔ stringr 1.4.0     
## ✔ readr   2.1.2          ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(cluster)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
data('USArrests') 
USArrests
##                Murder Assault UrbanPop Rape
## Alabama          13.2     236       58 21.2
## Alaska           10.0     263       48 44.5
## Arizona           8.1     294       80 31.0
## Arkansas          8.8     190       50 19.5
## California        9.0     276       91 40.6
## Colorado          7.9     204       78 38.7
## Connecticut       3.3     110       77 11.1
## Delaware          5.9     238       72 15.8
## Florida          15.4     335       80 31.9
## Georgia          17.4     211       60 25.8
## Hawaii            5.3      46       83 20.2
## Idaho             2.6     120       54 14.2
## Illinois         10.4     249       83 24.0
## Indiana           7.2     113       65 21.0
## Iowa              2.2      56       57 11.3
## Kansas            6.0     115       66 18.0
## Kentucky          9.7     109       52 16.3
## Louisiana        15.4     249       66 22.2
## Maine             2.1      83       51  7.8
## Maryland         11.3     300       67 27.8
## Massachusetts     4.4     149       85 16.3
## Michigan         12.1     255       74 35.1
## Minnesota         2.7      72       66 14.9
## Mississippi      16.1     259       44 17.1
## Missouri          9.0     178       70 28.2
## Montana           6.0     109       53 16.4
## Nebraska          4.3     102       62 16.5
## Nevada           12.2     252       81 46.0
## New Hampshire     2.1      57       56  9.5
## New Jersey        7.4     159       89 18.8
## New Mexico       11.4     285       70 32.1
## New York         11.1     254       86 26.1
## North Carolina   13.0     337       45 16.1
## North Dakota      0.8      45       44  7.3
## Ohio              7.3     120       75 21.4
## Oklahoma          6.6     151       68 20.0
## Oregon            4.9     159       67 29.3
## Pennsylvania      6.3     106       72 14.9
## Rhode Island      3.4     174       87  8.3
## South Carolina   14.4     279       48 22.5
## South Dakota      3.8      86       45 12.8
## Tennessee        13.2     188       59 26.9
## Texas            12.7     201       80 25.5
## Utah              3.2     120       80 22.9
## Vermont           2.2      48       32 11.2
## Virginia          8.5     156       63 20.7
## Washington        4.0     145       73 26.2
## West Virginia     5.7      81       39  9.3
## Wisconsin         2.6      53       66 10.8
## Wyoming           6.8     161       60 15.6
head(USArrests)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7
summary(USArrests)
##      Murder          Assault         UrbanPop          Rape      
##  Min.   : 0.800   Min.   : 45.0   Min.   :32.00   Min.   : 7.30  
##  1st Qu.: 4.075   1st Qu.:109.0   1st Qu.:54.50   1st Qu.:15.07  
##  Median : 7.250   Median :159.0   Median :66.00   Median :20.10  
##  Mean   : 7.788   Mean   :170.8   Mean   :65.54   Mean   :21.23  
##  3rd Qu.:11.250   3rd Qu.:249.0   3rd Qu.:77.75   3rd Qu.:26.18  
##  Max.   :17.400   Max.   :337.0   Max.   :91.00   Max.   :46.00
# Normalize 0-1 datasets:
df <- USArrests %>% mutate_all(function(x) {(x - min(x)) / (max(x) - min(x))})

# Set rowname: 
row.names(df) <- row.names(USArrests)

# Compute distances: 
dd <- dist(df, method = "euclidean")

# Visualize the dissimilarity: 
fviz_dist(dd, lab_size = 7)

# Perform hierarchical clustering: 
hc1 <- hclust(dd, method = "single")
hc1
## 
## Call:
## hclust(d = dd, method = "single")
## 
## Cluster method   : single 
## Distance         : euclidean 
## Number of objects: 50
plot(hc1, hang=-1, ann=FALSE)

# Create a function of dendrogram:

dend_func <- (function(x) {fviz_dend(x, 
          k = 4,   
          cex = 0.5, 
          rect = TRUE, 
          rect_fill = TRUE, 
          horiz = FALSE, 
          palette = "jco", 
          rect_border = "jco", 
          color_labels_by_k = TRUE) })
dend_func(hc1) -> basic_plot
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
basic_plot + theme_gray() + 
         theme(plot.margin = unit(rep(0.7, 4), "cm")) +  
  labs(title = "Hierarchical Clustering with Single Linkage Method")

clust_func<-(function(x){fviz_cluster(list(data = df, cluster = paste0("Group", x)), 
                         alpha = 1, 
                         colors = x, 
                         labelsize = 9, 
                         ellipse.type = "norm")})
sgroup<- cutree(hc1, k = 4)
USArrests$sgroup <- factor(sgroup)
head(USArrests)
##            Murder Assault UrbanPop Rape sgroup
## Alabama      13.2     236       58 21.2      1
## Alaska       10.0     263       48 44.5      2
## Arizona       8.1     294       80 31.0      1
## Arkansas      8.8     190       50 19.5      1
## California    9.0     276       91 40.6      1
## Colorado      7.9     204       78 38.7      1
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggplot(USArrests, aes(x=Murder, y=Assault, color = sgroup)) + geom_point()

ggplot(USArrests, aes(x=Murder, y=UrbanPop, color = sgroup)) + geom_point()

hc2 <- hclust(dd, method = "average")
plot(hc2, hang=-1, ann=FALSE)

dend_func(hc2) -> basic_plot2
basic_plot2 + theme_gray() + 
         theme(plot.margin = unit(rep(0.7, 4), "cm")) +  
  labs(title = "Hierarchical Clustering with Average Linkage Method")

hc3 <- hclust(dd, method = "ward.D2")
plot(hc3, hang=-1, ann=FALSE)

dend_func(hc3) -> basic_plot3
basic_plot3 + theme_gray() + 
         theme(plot.margin = unit(rep(0.7, 4), "cm")) +  
  labs(title = "Hierarchical Clustering with Ward Linkage Method")

Validation of clusters

  1. Cluster interpretability
sgroup3<- cutree(hc3, k = 4)
USArrests$sgroup3 <- factor(sgroup3)
head(USArrests)
##            Murder Assault UrbanPop Rape sgroup sgroup3
## Alabama      13.2     236       58 21.2      1       1
## Alaska       10.0     263       48 44.5      2       2
## Arizona       8.1     294       80 31.0      1       2
## Arkansas      8.8     190       50 19.5      1       3
## California    9.0     276       91 40.6      1       2
## Colorado      7.9     204       78 38.7      1       2
library(GGally)
ggpairs(USArrests, aes(col=sgroup3))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  1. Cluster stability

  2. Cluster separation

  3. Number of clusters

k-means Algorithm:

Select k clusters arbitrarily.

  1. Initialize cluster centers with those k clusters.

  2. Do loop

  1. Partition by assigning or reassigning all data objects to their closest cluster center.

  2. Compute new cluster centers as mean value of the objects in each cluster.

Until no change in cluster center calculation

Example

data(USArrests)
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor(USArrests), method = "number",
         type = "lower")

USArrests <- scale(USArrests)
dim(USArrests)
## [1] 50  4
head(USArrests)
##                Murder   Assault   UrbanPop         Rape
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144  1.7589234  2.067820292
## Colorado   0.02571456 0.3988593  0.8608085  1.864967207
dist.eucl <- dist(USArrests, method = "euclidean")
head(dist.eucl)
## [1] 2.703754 2.293520 1.289810 3.263110 2.651067 3.215297
fviz_dist(dist.eucl)

km.res <- kmeans(USArrests, 4, nstart = 20)
km.res
## K-means clustering with 4 clusters of sizes 16, 13, 13, 8
## 
## Cluster means:
##       Murder    Assault   UrbanPop        Rape
## 1 -0.4894375 -0.3826001  0.5758298 -0.26165379
## 2  0.6950701  1.0394414  0.7226370  1.27693964
## 3 -0.9615407 -1.1066010 -0.9301069 -0.96676331
## 4  1.4118898  0.8743346 -0.8145211  0.01927104
## 
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              4              2              2              4              2 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              1              1              2              4 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              1              3              2              1              3 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              1              3              4              3              2 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              1              2              3              4              2 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              3              3              2              3              1 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              2              2              4              3              1 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              1              1              1              1              4 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              3              4              2              1              3 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              1              1              3              3              1 
## 
## Within cluster sum of squares by cluster:
## [1] 16.212213 19.922437 11.952463  8.316061
##  (between_SS / total_SS =  71.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
df_member <- cbind(USArrests, cluster = km.res$cluster)
head(df_member)
##                Murder   Assault   UrbanPop         Rape cluster
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473       4
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941       2
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388       2
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602       4
## California 0.27826823 1.2628144  1.7589234  2.067820292       2
## Colorado   0.02571456 0.3988593  0.8608085  1.864967207       2
fviz_cluster(km.res, data = USArrests,
             palette=c("red", "blue", "black", "darkgreen"),
             ellipse.type = "euclid",
             star.plot = T,
             repel = T,
             ggtheme = theme())

dist(km.res$centers)
##          1        2        3
## 2 2.411241                  
## 3 1.874055 3.887882         
## 4 2.684572 2.117941 3.246984

Demo: https://www.youtube.com/watch?v=5I3Ei69I40s

Methods to determine optimum number of clusters

  1. Elbow method

  2. Silhouette method

  3. Gap statistic

Exercise:

Cluster districts according to dengue cases

`r library(mozzie)