Posted on Categories data science, Practical Data Science, Statistics, TutorialsTags

Bootstrap Evaluation of Clusters


NewImage
Illustration from Project Gutenberg

The goal of cluster analysis is to group the observations in the data into clusters such that every datum in a cluster is more similar to other datums in the same cluster than it is to datums in other clusters. This is an analysis method of choice when annotated training data is not readily available. In this article, based on chapter 8 of Practical Data Science with R, the authors discuss one approach to evaluating the clusters that are discovered by a chosen clustering method.

An important question when evaluating clusters is whether a given cluster is “real”-does the cluster represent actual structure in the data, or is it an artifact of the clustering algorithm? This is especially important with clustering algorithms like k-means, where the user has to specify the number of clusters a priori. It’s been our experience that clustering algorithms will often produce several clusters that represent actual structure or relationships in the data, and then one or two clusters that are buckets that represent “other” or “miscellaneous.” Clusters of “other” tend to be made up of data points that have no real relationship to each other; they just don’t fit anywhere else.

One way to assess whether a cluster represents true structure is to see if the cluster holds up under plausible variations in the dataset. The fpc package has a function called clusterboot()that uses bootstrap resampling to evaluate how stable a given cluster is. (For a full description of the algorithm, see Christian Henning, “Cluster-wise assessment of cluster stability,” Research Report 271, Dept. of Statistical Science, University College London, December 2006). clusterboot() is an integrated function that both performs the clustering and evaluates the final produced clusters. It has interfaces to a number of R clustering algorithms, including both hclust and kmeans.

clusterboot‘s algorithm uses the Jaccard coefficient, a similarity measure between sets. The Jaccard similarity between two sets A and B is the ratio of the number of elements in the intersection of A and B over the number of elements in the union of A and B. The basic general strategy is as follows:

  1. Cluster the data as usual.
  2. Draw a new dataset (of the same size as the original) by resampling the original dataset with replacement (meaning that some of the data points may show up more than once, and others not at all). Cluster the new dataset.
  3. For every cluster in the original clustering, find the most similar cluster in the new clustering (the one that gives the maximum Jaccard coefficient) and record that value. If this maximum Jaccard coefficient is less than 0.5, the original cluster is considered to be dissolved-it didn’t show up in the new clustering. A cluster that’s dissolved too often is probably not a “real” cluster.
  4. Repeat steps 2-3 several times.

The cluster stability of each cluster in the original clustering is the mean value of its Jaccard coefficient over all the bootstrap iterations. As a rule of thumb, clusters with a stability value less than 0.6 should be considered unstable. Values between 0.6 and 0.75 indicate that the cluster is measuring a pattern in the data, but there isn’t high certainty about which points should be clustered together. Clusters with stability values above about 0.85 can be considered highly stable (they’re likely to be real clusters).

Different clustering algorithms can give different stability values, even when the algorithms produce highly similar clusterings, so clusterboot() is also measuring how stable the clustering algorithm is.

The protein dataset

To demonstrate clusterboot(), we’ll use a small dataset from 1973 on protein consumption from nine different food groups in 25 countries in Europe. The original dataset can be found here. A tab-separated text file with the data can be found in this directory. The data file is called protein.txt; additional information can be found in the file protein_README.txt.

The goal is to group the countries based on patterns in their protein consumption. The dataset is loaded into R as a data frame called protein, as shown in the next listing.

protein <- read.table("protein.txt", sep="\t", header=TRUE)
summary(protein)
#           Country      RedMeat         WhiteMeat           Eggs
# Albania       : 1   Min.   : 4.400   Min.   : 1.400   Min.   :0.500
# Austria       : 1   1st Qu.: 7.800   1st Qu.: 4.900   1st Qu.:2.700
# Belgium       : 1   Median : 9.500   Median : 7.800   Median :2.900
# Bulgaria      : 1   Mean   : 9.828   Mean   : 7.896   Mean   :2.936
# Czechoslovakia: 1   3rd Qu.:10.600   3rd Qu.:10.800   3rd Qu.:3.700
# Denmark       : 1   Max.   :18.000   Max.   :14.000   Max.   :4.700
# (Other)       :19
#      Milk            Fish           Cereals          Starch
# Min.   : 4.90   Min.   : 0.200   Min.   :18.60   Min.   :0.600
# 1st Qu.:11.10   1st Qu.: 2.100   1st Qu.:24.30   1st Qu.:3.100
# Median :17.60   Median : 3.400   Median :28.00   Median :4.700
# Mean   :17.11   Mean   : 4.284   Mean   :32.25   Mean   :4.276
# 3rd Qu.:23.30   3rd Qu.: 5.800   3rd Qu.:40.10   3rd Qu.:5.700
# Max.   :33.70   Max.   :14.200   Max.   :56.70   Max.   :6.500
#
#      Nuts           Fr.Veg
# Min.   :0.700   Min.   :1.400
# 1st Qu.:1.500   1st Qu.:2.900
# Median :2.400   Median :3.800
# Mean   :3.072   Mean   :4.136
# 3rd Qu.:4.700   3rd Qu.:4.900
# Max.   :7.800   Max.   :7.900

#   Use all the columns except the first 
#   (Country). 
vars.to.use <- colnames(protein)[-1]         

# Scale the data columns to be zero mean 
# and unit variance.
# The output of scale() is a matricx.
pmatrix <- scale(protein[,vars.to.use])      

# optionally, store the centers and 
# standard deviations of the original data,
# so you can "unscale" it later.
pcenter <- attr(pmatrix, "scaled:center")  
pscale <- attr(pmatrix, "scaled:scale")

Before running clusterboot() we’ll cluster the data using a hierarchical clustering algorithm (Ward’s method):

#   Create the distance matrix.
d <- dist(pmatrix, method="euclidean") 
    
#   Do the clustering. 
pfit <- hclust(d, method="ward")   

#   Plot the dendrogram.
plot(pfit, labels=protein$Country)     

The dendrogram suggests five clusters, as shown in the figure below. You can draw the rectangles on the dendrogram using the command rect.hclust(pfit, k=5).

Dendrogram annotated

Let’s extract and print the clusters:

#   A convenience function for printing out the 
#   countries in each cluster, along with the values 
#   for red meat, fish, and fruit/vegetable 
#   consumption. 
print_clusters <- function(labels, k) {             
  for(i in 1:k) {
    print(paste("cluster", i))
    print(protein[labels==i,c("Country","RedMeat","Fish","Fr.Veg")])
  }
}

# get the cluster labels
groups <- cutree(pfit, k=5)

# --- results -- 

> print_clusters(groups, 5)
[1] "cluster 1"
      Country RedMeat Fish Fr.Veg
1     Albania    10.1  0.2    1.7
4    Bulgaria     7.8  1.2    4.2
18    Romania     6.2  1.0    2.8
25 Yugoslavia     4.4  0.6    3.2
[1] "cluster 2"
       Country RedMeat Fish Fr.Veg
2      Austria     8.9  2.1    4.3
3      Belgium    13.5  4.5    4.0
9       France    18.0  5.7    6.5
12     Ireland    13.9  2.2    2.9
14 Netherlands     9.5  2.5    3.7
21 Switzerland    13.1  2.3    4.9
22          UK    17.4  4.3    3.3
24   W Germany    11.4  3.4    3.8
[1] "cluster 3"
          Country RedMeat Fish Fr.Veg
5  Czechoslovakia     9.7  2.0    4.0
7       E Germany     8.4  5.4    3.6
11        Hungary     5.3  0.3    4.2
16         Poland     6.9  3.0    6.6
23           USSR     9.3  3.0    2.9
[1] "cluster 4"
   Country RedMeat Fish Fr.Veg
6  Denmark    10.6  9.9    2.4
8  Finland     9.5  5.8    1.4
15  Norway     9.4  9.7    2.7
20  Sweden     9.9  7.5    2.0
[1] "cluster 5"
    Country RedMeat Fish Fr.Veg
10   Greece    10.2  5.9    6.5
13    Italy     9.0  3.4    6.7
17 Portugal     6.2 14.2    7.9
19    Spain     7.1  7.0    7.2

There’s a certain logic to these clusters: the countries in each cluster tend to be in the same geographical region. It makes sense that countries in the same region would have similar dietary habits. You can also see that

  • Cluster 2 is made of countries with higher-than-average red meat consumption.
  • Cluster 4 contains countries with higher-than-average fish consumption but low produce consumption.
  • Cluster 5 contains countries with high fish and produce consumption.

Let’s run clusterboot() on the protein data, using hierarchical clustering with five clusters.

# load the fpc package
library(fpc)   

# set the desired number of clusters                               
kbest.p<-5       
                                                
#   Run clusterboot() with hclust 
#   ('clustermethod=hclustCBI') using Ward's method 
#   ('method="ward"') and kbest.p clusters 
#   ('k=kbest.p'). Return the results in an object 
#   called cboot.hclust.
cboot.hclust <- clusterboot(pmatrix,clustermethod=hclustCBI,
                           method="ward", k=kbest.p)

#   The results of the clustering are in 
#   cboot.hclust$result. The output of the hclust() 
#   function is in cboot.hclust$result$result. 
#
#   cboot.hclust$result$partition returns a 
#   vector of clusterlabels. 
groups<-cboot.hclust$result$partition  

# -- results --

> print_clusters(groups, kbest.p)                           
[1] "cluster 1"
      Country RedMeat Fish Fr.Veg
1     Albania    10.1  0.2    1.7
4    Bulgaria     7.8  1.2    4.2
18    Romania     6.2  1.0    2.8
25 Yugoslavia     4.4  0.6    3.2
[1] "cluster 2"
       Country RedMeat Fish Fr.Veg
2      Austria     8.9  2.1    4.3
3      Belgium    13.5  4.5    4.0
9       France    18.0  5.7    6.5
12     Ireland    13.9  2.2    2.9
14 Netherlands     9.5  2.5    3.7
21 Switzerland    13.1  2.3    4.9
22          UK    17.4  4.3    3.3
24   W Germany    11.4  3.4    3.8
[1] "cluster 3"
          Country RedMeat Fish Fr.Veg
5  Czechoslovakia     9.7  2.0    4.0
7       E Germany     8.4  5.4    3.6
11        Hungary     5.3  0.3    4.2
16         Poland     6.9  3.0    6.6
23           USSR     9.3  3.0    2.9
[1] "cluster 4"
   Country RedMeat Fish Fr.Veg
6  Denmark    10.6  9.9    2.4
8  Finland     9.5  5.8    1.4
15  Norway     9.4  9.7    2.7
20  Sweden     9.9  7.5    2.0
[1] "cluster 5"
    Country RedMeat Fish Fr.Veg
10   Greece    10.2  5.9    6.5
13    Italy     9.0  3.4    6.7
17 Portugal     6.2 14.2    7.9
19    Spain     7.1  7.0    7.2

# The vector of cluster stabilities. 
# Values close to 1 indicate stable clusters
> cboot.hclust$bootmean                                   
[1] 0.7905000 0.7990913 0.6173056 0.9312857 0.7560000

# The count of how many times each cluster was 
# dissolved. By default clusterboot() runs 100 
# bootstrap iterations. 
# Clusters that are dissolved often are unstable. 
> cboot.hclust$bootbrd                                    
[1] 25 11 47  8 35

The above results show that the cluster of countries with high fish consumption (cluster 4) is highly stable (cluster stability = 0.93). Clusters 1 and 2 are also quite stable; cluster 5 (cluster stability 0.76) less so. Cluster 3 (cluster stability 0.62) has the characteristics of what we’ve been calling the “other” cluster. Note that clusterboot() has a random component, so the exact stability values and number of times each cluster is dissolved will vary from run to run.

Based on these results, we can say that the countries in cluster 4 have highly similar eating habits, distinct from those of the other countries (high fish and red meat consumption, with a relatively low amount of fruits and vegetables); we can also say that the countries in clusters 1 and 2 represent distinct eating patterns as well. The countries in cluster 3, on the other hand, show eating patterns that are different from those of the countries in other clusters, but aren’t as strongly similar to each other.

The clusterboot() algorithm assumes that you have the number of clusters, k. Obviously, determining k will be harder for datasets that are larger than our protein example. There are ways of estimating k, but they are beyond the scope of this article. Once you have an idea of the number of clusters, however, clusterboot() is a useful tool for evaluating the strength of the patterns that you have discovered.

For more about clustering, please refer to our free sample chapter 8 of Practical Data Science with R.

7 thoughts on “Bootstrap Evaluation of Clusters”

  1. A very important topic that you touched! Would it make sense to not only bootstrap the samples within one cluster, but to resample by allocating samples into other clusters? This way, would the Jaccard index indicate the stability of the whole tree and not only clusters?

    Cheers,
    Andrej

    1. Thanks for your comment, Andrej. The entire data set is bootstrap resampled, and that resample is reclustered, with the same number of clusters k as the original clustering. Then each of the original clusters is matched to the “most similar” cluster in the new clustering. So while the Jaccard index only measures the stability of a single cluster (from the original clustering), the set of Jaccard indices tells us about the stability of the entire clustering procedure on that data set.

  2. Hi Nina,
    I think you would be interested having a look at the dendextend R package:
    https://cran.r-project.org/web/packages/dendextend/vignettes/introduction.html

    Specifically look at the visualization for pvclust:
    https://cran.r-project.org/web/packages/dendextend/vignettes/introduction.html#pvclust
    Which might be used also for what you have been doing.

    The package also has a short 2 page (open access) paper in bioinformatics:
    http://bioinformatics.oxfordjournals.org/content/early/2015/08/18/bioinformatics.btv428

    With regards,
    Tal

    1. Thanks, Tal! That looks like a really useful package (Looks like it wasn’t around when we wrote Practical Data Science with R). We will definitely check it out.

  3. Very clear presentation, thanks. May I ask is there a way to find which member of a group is the most stable? I mean a member that can represent the whole group? Or ranking of members of a group ?

    1. Thanks for your question!

      For a given cluster O in the original clustering, the Jaccard coefficients tell you for each re-clustering which new cluster is most similar to O. Right now, clusterboot only uses the value of that highest Jaccard coefficient to make the match (call the best new cluster N). But you could also look at how many times each member of O appears in N. This would give you a ranking of the “representativeness” of each member of O — if a certain member of O always appears in N, then you could consider it representative of O.

Comments are closed.