Clustering in Machine Learning: Methods, Distance Metrics, K-Means, Hierarchical & Density-Based Techniques

Infographic titled 'Clustering in Machine Learning' showing illustrations of K-Means, Hierarchical, Density-Based clustering, and Distance Metrics with grouped data points, user icons, dendrogram, and scatter plots.


Clustering in Machine Learning: Techniques, Distance Measures, and Quality Evaluation

Clustering is one of the core tasks in unsupervised machine learning. It involves grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups. This powerful technique is widely used in market segmentation, anomaly detection, image compression, social network analysis, and more.


🔍 What is Clustering?

Clustering is a data analysis task that discovers groupings within data without using labeled responses. The goal is to divide the dataset into subsets (clusters), where each data point belongs to only one group. Ideally, data points within the same cluster have high similarity, and data points from different clusters have high dissimilarity.

✅ Applications of Clustering:

  • Customer Segmentation
  • Market Basket Analysis
  • Document or Text Clustering
  • Medical Image Segmentation
  • Recommender Systems

🤍 Distance Measures in Clustering

Clustering largely depends on how we measure similarity or distance between data points. The most common distance metrics are:

  • Euclidean Distance: Most commonly used for continuous features.
    d(p, q) = sqrt(\u03a3 (pi - qi)2)
  • Manhattan Distance: Best when dealing with high-dimensional data.
    d(p, q) = \u03a3 |pi - qi|
  • Cosine Similarity: Used when angle is more important than magnitude.
    cos(\u03b8) = (A · B) / (||A|| ||B||)
  • Hamming Distance: Used for categorical attributes.

📊 Different Clustering Methods

1. Distance-Based Clustering

These methods use distance metrics to define cluster membership.

  • K-Means: Partitions data into k clusters by minimizing intra-cluster variance.
  • K-Medoids: Similar to K-Means but uses actual data points (medoids) instead of centroids.

2. Density-Based Clustering

Clusters are formed based on areas of high density separated by areas of low density.

  • DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
  • OPTICS (Ordering Points To Identify Clustering Structure)

3. Hierarchical Clustering

Builds a hierarchy of clusters either agglomeratively (bottom-up) or divisively (top-down).

  • Can be visualized using a dendrogram.
  • Does not require pre-specifying the number of clusters.

🪣 Iterative Distance-Based Clustering

This includes algorithms that iteratively update centroids/medoids and reassign points:

  • K-Means: Begins with k initial centroids, assigns each point to the closest centroid, recalculates centroids, and repeats until convergence.
  • K-Medoids (PAM Algorithm): More robust to noise and outliers compared to K-Means.

Python Snippet for K-Means:

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60)
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='red')
plt.show()

🔢 Handling Continuous and Categorical Data in K-Means

K-Means is traditionally suited for numerical (continuous) data. To handle mixed data types:

  • Standardization: Normalize continuous values.
  • Encoding: Use One-Hot or Label Encoding for categorical values.
  • Use K-Prototypes: Extension of K-Means to handle mixed data types.

🧰 Constructing a Hierarchical Cluster

Hierarchical clustering constructs nested clusters. The common method used is Agglomerative Clustering.

✅ Steps:

  • Start with each data point as a separate cluster.
  • At each step, merge the closest pair of clusters.
  • Repeat until a single cluster remains.

Python Snippet:

from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt

linked = linkage(X, 'ward')
dendrogram(linked)
plt.show()

🛋️ K-Medoids, K-Mode and Density-Based Clustering

K-Medoids:

  • Chooses actual data points as centers (medoids).
  • Minimizes sum of dissimilarities instead of squared distances.

K-Mode:

  • Extension of K-Means for categorical data.
  • Uses modes instead of means for centroids.

Density-Based Clustering:

  • Detects arbitrary shaped clusters and outliers.
  • DBSCAN: Requires two parameters: epsilon and minimum samples.
  • Can handle noise and outliers effectively.

📊 Measuring Clustering Quality

To evaluate clustering results, several internal and external metrics are used:

✅ Internal Evaluation:

  • Silhouette Coefficient: Measures how similar an object is to its own cluster compared to other clusters.
  • Dunn Index: Ratio of minimal inter-cluster distance to maximal intra-cluster distance.
  • Davies-Bouldin Index: Lower values indicate better clustering.

✅ External Evaluation:

  • Rand Index: Measures similarity between clustering and ground truth.
  • Adjusted Mutual Information: Used when ground truth is available.

Python Snippet for Silhouette Score:

from sklearn.metrics import silhouette_score
score = silhouette_score(X, kmeans.labels_)
print("Silhouette Score:", score)

📖 Related Topic: Naive Bayes Classifier

If you're also interested in probabilistic classification and feature selection, check out our detailed post on the Naive Bayes Classifier.


🔹 Conclusion

Clustering is a fundamental technique in machine learning for discovering hidden patterns in unlabeled data. With a wide variety of algorithms like K-Means, K-Medoids, DBSCAN, and hierarchical clustering, practitioners can choose the most suitable method based on their data characteristics and business objectives.

From measuring similarity using distance metrics to evaluating clustering quality using silhouette scores and other metrics, each step in the clustering pipeline is crucial to achieving insightful and actionable outcomes.

Understanding these methods equips data scientists, analysts, and developers with essential tools to perform powerful data-driven segmentation and discovery.

Comments