Clustering in Machine Learning: Methods, Distance Metrics, K-Means, Hierarchical & Density-Based Techniques
Clustering in Machine Learning: Techniques, Distance Measures, and Quality Evaluation
Clustering is one of the core tasks in unsupervised machine learning. It involves grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups. This powerful technique is widely used in market segmentation, anomaly detection, image compression, social network analysis, and more.
🔍 What is Clustering?
Clustering is a data analysis task that discovers groupings within data without using labeled responses. The goal is to divide the dataset into subsets (clusters), where each data point belongs to only one group. Ideally, data points within the same cluster have high similarity, and data points from different clusters have high dissimilarity.
✅ Applications of Clustering:
- Customer Segmentation
- Market Basket Analysis
- Document or Text Clustering
- Medical Image Segmentation
- Recommender Systems
🤍 Distance Measures in Clustering
Clustering largely depends on how we measure similarity or distance between data points. The most common distance metrics are:
- Euclidean Distance: Most commonly used for continuous features.
d(p, q) = sqrt(\u03a3 (pi - qi)2) - Manhattan Distance: Best when dealing with high-dimensional data.
d(p, q) = \u03a3 |pi - qi| - Cosine Similarity: Used when angle is more important than magnitude.
cos(\u03b8) = (A · B) / (||A|| ||B||) - Hamming Distance: Used for categorical attributes.
📊 Different Clustering Methods
1. Distance-Based Clustering
These methods use distance metrics to define cluster membership.
- K-Means: Partitions data into k clusters by minimizing intra-cluster variance.
- K-Medoids: Similar to K-Means but uses actual data points (medoids) instead of centroids.
2. Density-Based Clustering
Clusters are formed based on areas of high density separated by areas of low density.
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
- OPTICS (Ordering Points To Identify Clustering Structure)
3. Hierarchical Clustering
Builds a hierarchy of clusters either agglomeratively (bottom-up) or divisively (top-down).
- Can be visualized using a dendrogram.
- Does not require pre-specifying the number of clusters.
🪣 Iterative Distance-Based Clustering
This includes algorithms that iteratively update centroids/medoids and reassign points:
- K-Means: Begins with k initial centroids, assigns each point to the closest centroid, recalculates centroids, and repeats until convergence.
- K-Medoids (PAM Algorithm): More robust to noise and outliers compared to K-Means.
Python Snippet for K-Means:
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60)
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='red')
plt.show()
🔢 Handling Continuous and Categorical Data in K-Means
K-Means is traditionally suited for numerical (continuous) data. To handle mixed data types:
- Standardization: Normalize continuous values.
- Encoding: Use One-Hot or Label Encoding for categorical values.
- Use K-Prototypes: Extension of K-Means to handle mixed data types.
🧰 Constructing a Hierarchical Cluster
Hierarchical clustering constructs nested clusters. The common method used is Agglomerative Clustering.
✅ Steps:
- Start with each data point as a separate cluster.
- At each step, merge the closest pair of clusters.
- Repeat until a single cluster remains.
Python Snippet:
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
linked = linkage(X, 'ward')
dendrogram(linked)
plt.show()
🛋️ K-Medoids, K-Mode and Density-Based Clustering
✅ K-Medoids:
- Chooses actual data points as centers (medoids).
- Minimizes sum of dissimilarities instead of squared distances.
✅ K-Mode:
- Extension of K-Means for categorical data.
- Uses modes instead of means for centroids.
✅ Density-Based Clustering:
- Detects arbitrary shaped clusters and outliers.
- DBSCAN: Requires two parameters: epsilon and minimum samples.
- Can handle noise and outliers effectively.
📊 Measuring Clustering Quality
To evaluate clustering results, several internal and external metrics are used:
✅ Internal Evaluation:
- Silhouette Coefficient: Measures how similar an object is to its own cluster compared to other clusters.
- Dunn Index: Ratio of minimal inter-cluster distance to maximal intra-cluster distance.
- Davies-Bouldin Index: Lower values indicate better clustering.
✅ External Evaluation:
- Rand Index: Measures similarity between clustering and ground truth.
- Adjusted Mutual Information: Used when ground truth is available.
Python Snippet for Silhouette Score:
from sklearn.metrics import silhouette_score
score = silhouette_score(X, kmeans.labels_)
print("Silhouette Score:", score)
📖 Related Topic: Naive Bayes Classifier
If you're also interested in probabilistic classification and feature selection, check out our detailed post on the Naive Bayes Classifier.
🔹 Conclusion
Clustering is a fundamental technique in machine learning for discovering hidden patterns in unlabeled data. With a wide variety of algorithms like K-Means, K-Medoids, DBSCAN, and hierarchical clustering, practitioners can choose the most suitable method based on their data characteristics and business objectives.
From measuring similarity using distance metrics to evaluating clustering quality using silhouette scores and other metrics, each step in the clustering pipeline is crucial to achieving insightful and actionable outcomes.
Understanding these methods equips data scientists, analysts, and developers with essential tools to perform powerful data-driven segmentation and discovery.

Comments
Post a Comment