Clustering Method using K-Means, Hierarchical and DBSCAN (using Python)

Nuzulul Khairu Nissa
15 min readDec 9, 2020

--

Identification of customers based on their choices is an important strategy in any organization. This identification may help in approaching customers with specific offers. An organization with a large number of customers may experience difficulty in identifying and placing into a record each customer. A huge amount of data processing and automated techniques are involved in extracting insights from the large information collected on customers. Clustering method can help to identifying the customers based on their key characteristics.

Clustering is a set of techniques used to partition data into groups, or clusters. Clusters are loosely defined as groups of data objects that are more similar to other objects in their cluster than to data objects in other clusters. Clustering refers to a very broad set of techniques for finding subgroups or clusters in a data set. This is unsupervised problem because we are trying to discover structure, in this case, distinct clusters on the basis of a dataset.

There are three basic types of clustering algorithms : partitional, hierarchical and density based algorithms.

  • Partitional Clustering: divides data objects into nonoverlapping groups. In other words, no object can be a member of more than one cluster, and every cluster must have at least one object. Example : K-Means and K-Medoids.
  • Hierarchical Clustering : determines cluster assignments by building a hierarchy. This is implemented by either a bottom-up or a top-down approach. These methods produce a tree-based hierarchy of points called a dendrogram. Example : Agglomerative clustering and divisive clustering.
  • Density Based Clustering : determines cluster assignments based on the density of data points in a region. This approach doesn’t require the user to specify the number of clusters. Instead, there is a distance-based parameter that acts as a tunable threshold. Example : DBSCAN (Density-Based Spatial Clustering of Applications with Noise) and OPTICS (Ordering Points To Identify the Clustering Structure).

Since clustering is popular in many fields, there exist a great number of clustering methods. In this section we focus on perhaps the three clustering approaches : K-means clustering, Hierarchical clustering and DBSCAN.

K-Means Clustering

The K-Means Clustering takes the input of dataset D and parameter k, and then divides a dataset D of n objects into k groups. This partition depends upon the similarity measure so that the resulting intra cluster similarity is high but the inter cluster similarity is low. Cluster similarity is measured regarding the mean value of the objects in a cluster, which can be showed as the cluster’s mean.

The quality of the cluster assignments is determined by computing the sum of the squared error (SSE) after the centroids converge, or match the previous iteration’s assignment. The SSE is defined as the sum of the squared Euclidean distances of each point to its closest centroid. Since this is a measure of error, the objective of k-means is to try to minimize this value.

Implementing K-Means Clustering using Python

Let’s code!

The first step is importing the required libraries.

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.preprocessing import StandardScaler
from yellowbrick.cluster import KElbowVisualizer
from matplotlib import pyplot as plt
import matplotlib.cm as cm
import seaborn as sns
%matplotlib inline
sns.set_style('whitegrid')
plt.style.use('fivethirtyeight')

The next step is loading the dataset. The dataset consist of customer data and it can be downloaded from here. The data shows the annual income and the spending score of the customer based on genre and age.

dataset = pd.read_csv("Data......csv", sep=",")
dataset.head()
dataset.info()

And then we visualizing the dataset using this code,

plt.figure(figsize=(10, 9))
sns.scatterplot('Annual Income (k$)', 'Spending Score (1-100)', data=dataset, alpha=0.9)
data_x = dataset.iloc[:, 3:5]
data_x.head()
x_array = np.array(data_x)
print(x_array)

The process of transforming numerical features to use the same scale is known as feature scaling. In this case, we’ll use the StandardScaler class. This class implements a type of feature scaling called standardization. Standardization scales, or shifts, the values for each numerical feature in your dataset so that the features have a mean of 0 and standard deviation of 1.

scaler = MinMaxScaler()
x_scaled = scaler.fit_transform(x_array)
x_scaled

After that, we’re going to choose the appropriate number of clusters using these methods :

Elbow Method : There’s a sweet spot where the SSE curve starts to bend known as the elbow point. The x-value of this point is thought to be a reasonable trade-off between error and number of clusters. The elbowpoint is the point where the rate of decrease of mean distance i.e. SSE will not change significantly with increase in number of clusters.

Sum_of_squared_distances =[]
K = range(1,15)
for k in K:
km =KMeans(n_clusters =k)
km =km.fit(x_scaled)
Sum_of_squared_distances.append(km.inertia_)

plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('SSE')
plt.title('Elbow Method For Optimal k')
plt.show()

In this example, the elbow is located at x=5.

Silhouette Coefficient : is a measure of cluster cohesion and separation. It quantifies how well a data point fits into its assigned cluster based on two factors: How close the data point is to other points in the cluster and how far away the data point is from points in other clusters. Silhouette coefficient values range between -1 and 1. Larger numbers indicate that samples are closer to their clusters than they are to other clusters.

model = KMeans(random_state=123) 
visualizer = KElbowVisualizer(model, k=(2,8), metric='silhouette', timings=False)

visualizer.fit(x_scaled)
visualizer.poof()

From the result above, we got the sihouette score is 5.

After finding the optimal number of clusters, fit the K-Means clustering model to the dataset and then predict clusters for each of the data elements.

numerics = dataset[['Annual Income (k$)','Spending Score (1-100)']]
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
for i in numerics:
scaler.fit(dataset[[i]])
dataset[i] = scaler.transform(dataset[[i]])
km = KMeans(n_clusters=5)
y_predicted = km.fit_predict(dataset[['Annual Income (k$)', 'Spending Score (1-100)']])
y_predicted

After we get the cluster of each data, then we add the new column, by naming it as “Cluster”.

dataset["Cluster"] = y_predicted
dataset.head(10)
km.cluster_centers_

And the last step is visualizing the result of the clustering. For better representation, we need to give each of the clusters a unique colour and name.

plt.figure(figsize=(12,8))
df1 = dataset[dataset.Cluster==0]
df2 = dataset[dataset.Cluster==1]
df3 = dataset[dataset.Cluster==2]
df4 = dataset[dataset.Cluster==3]
df5 = dataset[dataset.Cluster==4]
plt.scatter(df1['Annual Income (k$)'],df1['Spending Score (1-100)'],color='green', label='Target Group')
plt.scatter(df2['Annual Income (k$)'],df2['Spending Score (1-100)'],color='magenta', label='Sensible')
plt.scatter(df3['Annual Income (k$)'],df3['Spending Score (1-100)'],color='orange', label='Careless')
plt.scatter(df4['Annual Income (k$)'],df4['Spending Score (1-100)'],color='red', label='Careful')
plt.scatter(df5['Annual Income (k$)'],df5['Spending Score (1-100)'],color='blue', label='Standard')
plt.title('Clustering Result', fontweight='bold',fontsize=20)
plt.xlabel('Annual Income (k$)',fontsize=15)
plt.ylabel('Spending Score (1-100)',fontsize=15)
plt.legend(fontsize=15)
plt.grid(True)
plt.show()

From the figure above, it can be seen that 5 clusters of customers have been formed. The clusters are:

  • Target Group
  • Sensible
  • Careless
  • Careful
  • Standard

Hierarchical Clustering

Hierarchical clustering involves creating clusters that have a predetermined ordering from top to bottom. There are two types of hierarchical clustering, Agglomerative and Divisive.

  • Divisive Clustering : the type of hierarchical clustering that uses a top-down approach to make clusters. It uses an approach of the partitioning of 2 least similiar clusters and repeats this step until there is only one cluster. Divisive clustering is not commonly used in real life.
  • Agglomerative Clustering : the type of hierarchical clustering which uses a bottom-up approach to make clusters. It uses an approach of the partitioning 2 most similiar clusters and repeats this step until there is only one cluster. These steps are how the agglomerative hierarchical clustering works:

For a set of N observations to be clustered:

  1. Start assigning each observation as a single point cluster, so that if we have N observations, we have N clusters, each containing just one observation.
  2. Find the closest (most similar) pair of clusters and make them into one cluster, we now have N-1 clusters.
  3. Find the two closest clusters and make them to one cluster. We now have N-2 clusters. This can be done using agglomerative clustering linkage techniques.
  4. Repeat steps 2 and 3 until all observations are clustered into one single cluster of size N.

Before performing the clustering, it is required to determine the proximity matrix containing the distance between each point using a distance function. The following methods differ in how the distance between each cluster is measured:

Hierarchical clustering starts by treating each observation as a separate cluster. Then, it repeatedly executes the following steps:

  • Identify the two clusters that are closest together
  • Merge the two most similiar clusters

This continues until all the clusters are merged together. How do we decide the number of clusters? To get the number of clusters for hierarchical clustering, we make use a Dendrogram.

Dendrograms

Dendrograms are tree diagrams frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering. A subset of similiar data is created in a tree-like structure in which the root node corresponds to entire data and branches are created from the root node to form several clusters. The optimal number of clusters is equal to the number of vertical lines going through the horizontal line.

Let’s Code!

The first step is importing the required libraries.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

And then importing the dataset:

df = pd.read_csv("Data......csv", sep=",")
df.head()

After that, checking the distribution of Annual Income, using this code:

plt.figure(figsize=(8,5))
plt.title("Annual income distribution",fontsize=15)
plt.xlabel ("Annual income (k$)",fontsize=13)
plt.grid(True)
plt.hist(df['Annual Income (k$)'],color='blue',edgecolor='k')
plt.show()

And next, checking the distribution of Spending Score, using this code:

plt.figure(figsize=(8,5))
plt.title("Spending Score distribution",fontsize=15)
plt.xlabel ("Spending Score (1-100)",fontsize=14)
plt.grid(True)
plt.hist(df['Spending Score (1-100)'],color='brown',edgecolor='k')
plt.show()

Visualizing Annual Income and Spending Score using Scatter Plot:

plt.figure(figsize=(11,8))
plt.title("Annual Income and Spending Score Correlation",fontsize=18)
plt.xlabel ("Annual Income (k$)",fontsize=14)
plt.ylabel ("Spending Score (1-100)",fontsize=14)
plt.grid(True)
plt.scatter(df['Annual Income (k$)'],df['Spending Score (1-100)'],color='green',edgecolor='k',alpha=0.6, s=100)
plt.show()

We use a ‘ward’ linkage criterion

X = df.iloc[:,[3,4]].valuesimport scipy.cluster.hierarchy as sch
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster import hierarchy
plt.figure(figsize=(17,10))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean distances')
#plt.grid(True)
dendrogram = sch.dendrogram(sch.linkage(X, method = 'ward'))
plt.show()

We can clearly visualize the steps of hierarchical clustering. More the distance of the vertical lines in the dendrogram, more the distance between those clusters.

And then, we can set a threshold distance and draw a horizontal line. We try to set the threshold in such a way that it cuts the tallest vertical line. From that dendrogram above we can set the threshold as = 200 and we can see from the output below : the horizontal line crossing 5 vertical lines.

plt.figure(figsize=(15,6))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean distances')
plt.hlines(y=190,xmin=0,xmax=2000,lw=3,linestyles='--')
plt.text(x=900,y=220,s='Horizontal line crossing 5 vertical lines',fontsize=20)
#plt.grid(True)
dendrogram = sch.dendrogram(sch.linkage(X, method = 'ward'))
plt.show()
from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 5, affinity = 'euclidean', linkage = 'ward')
y_hc = hc.fit_predict(X)
plt.figure(figsize=(15,10))
plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 1], s = 100, c = 'red', label = 'Careful')
plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s = 100, c = 'blue', label = 'Standard')
plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 1], s = 100, c = 'green', label = 'Target group')
plt.scatter(X[y_hc == 3, 0], X[y_hc == 3, 1], s = 100, c = 'orange', label = 'Careless')
plt.scatter(X[y_hc == 4, 0], X[y_hc == 4, 1], s = 100, c = 'magenta', label = 'Sensible')
plt.title('Clustering Result', fontweight='bold',fontsize=30)
plt.xlabel('Annual Income (k$)',fontsize=20)
plt.ylabel('Spending Score (1-100)',fontsize=20)
plt.legend(fontsize=16)
plt.grid(True)
plt.axhspan(ymin=60,ymax=100,xmin=0.4,xmax=0.96,alpha=0.3,color='yellow')
plt.show()

Finally, we can clearly visualize the 5 clusters. This is how we can implement the hierarchical clustering.

DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density based clustering algorithm. The main concept of DBSCAN algorithm is to locate regions of high density that are separated from one another by regions of low density.

DBSCAN can identify clusters in a large spatial dataset by looking at the local density of corresponding elements. The advantage of the DBSCAN algorithm over the K-Means algorithm, is that the DBSCAN can determine which data points are noise or outliers. DBSCAN can identify points that are not part of any cluster (very useful as outliers detector). But it slower than agglomerative clustering and k-means, but still scales to relatively large datasets.

There are two parameters in DBSCAN: minPoints and eps :

  1. eps: specifies how close points should be to each other to be considered a part of a cluster. It means that if the distance between two points is lower or equal to this value (eps), these points are considered to be neighbors.
  2. minPoints: the minimum number of data points to form a dense region/ cluster. For example, if we set the minPoints parameter as 5, then we need at least 5 points to form a dense region.

Based on the two parameters, the points are classified as Core point, Border point and Noise point.

Core Point

  • A point is a core point if it has more than a specified number of minPoints within eps radius around it or |N (p)|≥ minPoints .
  • Core Point always belongs in a dense region.
  • For example, let’s consider ‘p’ is set to be a core point if ‘p’ has ≥ minPoints in an eps radius around it.

Border Point

  • A point is a border point if it has fewer than minPoints within eps, but is in the neighborhood of a core point.
  • For example, p is set to be a border point if ‘p’ is not a core point. i.e ‘p’ has < minPoints in eps radius. But ‘p’ should belong to the neighborhood ‘q’. Where ‘q’ is a core point.
  • p ∈ neighborhood of q and distance(p,q) ≤ eps .

Noise Point

  • A noise point is any point that is not a core point or a border point.

Before we go through the DBScan algorithm, we should understand about the Density Edge and the Density Connected Points:

Density Edge:
If p and q both are core points and distance between (p,q) ≤ eps then we can connect p, q vertex in a graph and call it Density Edge.

Density Connected Points:
Two points p and q are said to be density connected points if both p and q are core points and they exist a path formed by density edges connecting point (p) to point(q).

After knowing the DBSCAN terms and parameters, next we are going to know how the DBSCAN algorithm works:

Steps of DBSCAN Algorithm:

  1. The algorithm starts with an arbitrary point which has not been visited and its neighborhood information is retrieved from the epsparameter.
  2. If this point contains minPoints within epsneighborhood, cluster formation starts. Otherwise the point is labeled as noise. This point can be later found within the epsneighborhood of a different point and, thus can be made a part of the cluster.
  3. If a point is found to be a core point then the points within the eps neighborhood is also part of the cluster. So all the points found within eps neighborhood are added, along with their own eps neighborhood, if they are also core points.
  4. The above process continues until the density-connected cluster is completely found.
  5. The process restarts with a new point which can be a part of a new cluster or labeled as noise.

And the next question is:

How to choose the Min Points?

  • Remove the outlier
  • Take the min Points to be greater or equal to the dimensionality (d) of our dataset
  • Take the min Points twice of the dimensionality of data
  • If the dataset is noisier, choose a larger value of min Points
  • Choosing the min Points process is creally depends a lot on domain knowledge

How to determine eps?

  • I.e. choose a min Points = 4, for each point p, we’ll compute “d”, where d = distance from p to the 4th nearest neighbor of p. If d is high, then the chance of p is noisy is also high.
  • For each point of the dataset, we’ll have our d’s. Then, we’ll sort all d’s in increasing order. Now, we’ll plot the sorted point index and d’s with elbow or Knee method.

Let’s Code!

Let’s first import the libraries:

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from sklearn.cluster import DBSCAN
import sklearn.utils
from sklearn.preprocessing import StandardScaler
%matplotlib inline

Next step is importing the dataset:

df = pd.read_csv('Data_Customer_Mall.csv', sep=',')
df.head()

We can now create a DBSCAN object and fit the data. We defining the eps = 0.4 and min_samples = 5

Clus_dataSet = df[['Annual_Income','Spending_Score']]
Clus_dataSet = np.nan_to_num(Clus_dataSet)
Clus_dataSet = np.array(Clus_dataSet, dtype=np.float64)
Clus_dataSet = StandardScaler().fit_transform(Clus_dataSet)
# Compute DBSCAN
db = DBSCAN(eps=0.4, min_samples=5).fit(Clus_dataSet)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
df['Clus_Db']=labels
realClusterNum=len(set(labels)) - (1 if -1 in labels else 0)
clusterNum = len(set(labels))
# A sample of clusters
print(df[['Annual_Income','Spending_Score']].head())
# Number of Labels
print("number of labels: ", set(labels))

Let’s visualize the clusters determined by DBSCAN, using this code:

# Black removed and is used for noise instead
plt.figure(figsize=(15,10))
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:

# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k)xy = Clus_dataSet[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = Clus_dataSet[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Clustering of Customers, Estimated Number of Clusters: %d' % realClusterNum, fontweight='bold',fontsize=20)
plt.xlabel('Annual Income',fontsize=20)
plt.ylabel('Spending Score',fontsize=20)
plt.legend(fontsize=20)
plt.show()
n_noise_ = list(labels).count(-1)
print('number of noise(s): ', n_noise_)
for clust_number in set(labels):
clust_set = df[df.Clus_Db == clust_number]
if clust_number != -1:
print ("Cluster "+str(clust_number)+', Avg Annual Income: '+ str(round(np.mean(clust_set.Annual_Income)))+\
', Avg Spending Score: '+ str(round(np.mean(clust_set['Spending_Score'])))+\
', Count: '+ str(np.count_nonzero(clust_set.index)))

From this part of DBSCAN implementation, we can conclude if DBSCAN performed really good at detecting outliers which would not be easy with K-Means or Hierarchical clustering techniques. If we also apply the DBSCAN to a dataset with arbitrary shaped clusters, we’ll see the success of DBSCAN as well.

References:

--

--