Unsupervised Learning K-Means Clustering
Although most of the applications of Machine Learning today are based on supervised learning (and as a result, this is where most of the investments go to), the vast majority of the available data is unlabeled: we have the input features X, but we do not have the labels y.
Say you want to create a system that will take a few pictures of each item on a manufacturing production line and detect which items are defective. You can fairly easily create a system that will take pictures automatically, and this might give you thousands of pictures every day. You can then build a reasonably large dataset in just a few weeks. But wait, there are no labels! If you want to train a regular binary classifier that will predict whether an item is defective or not, you will need to label every single picture as “defective” or “normal.” This will generally require human experts to sit down and manually go through all the pictures. This is a long, costly, and tedious task, so it will usually only be done on a small subset of the available pictures. As a result, the labeled dataset will be quite small, and the classifier’s performance will be disappointing. Moreover, every time the company makes any change to its products, the whole process will need to be started over from scratch. Wouldn’t it be great if the algorithm could just exploit the unlabeled data without needing humans to label every picture? Enter unsupervised learning.
The most common unsupervised learning tasks in machine learning are
Dimensionality reduction
It is the process of reducing the number of input variables or features in a dataset while retaining as much relevant information as possible. It helps improve model performance by eliminating noise, reducing overfitting, and speeding up computation. Techniques like Principal Component Analysis (PCA) and t-SNE are commonly used for this purpose.
Clustering
The goal is to group similar instances together into clusters. Clustering is a great tool for data analysis, customer segmentation, recommender systems, search engines, image segmentation, semi-supervised learning, dimensionality reduction,and more.Examples are K-Means, DBSCAN etc
Anomaly detection
The objective is to learn what “normal” data looks like, and then use that to detect abnormal instances, such as defective items on a production line or a new trend in a time series.
Density estimation
This is the task of estimating the probability density function (PDF) of the random process that generated the dataset. Density estimation is commonly used for anomaly detection: instances located in very low-density regions are likely to be anomalies. It is also useful for data analysis and visualization
Clustering
As you enjoy a hike in the mountains, you stumble upon a plant you have never seen before. You look around and you notice a few more. They are not identical, yet they are sufficiently similar for you to know that they most likely belong to the same species (or at least the same genus). You may need a botanist to tell you what species that is, but you certainly don’t need an expert to identify groups of similar-looking objects.
This is called clustering: it is the task of identifying similar instances and assigning them to clusters, or groups of similar instances.
Clustering is used in a wide variety of applications, including these:
For customer segmentation
You can cluster your customers based on their purchases and their activity on your website. This is useful to understand who your customers are and what they need, so you can adapt your products and marketing campaigns to each segment.
For example, customer segmentation can be useful in recommender systems to suggest content that other users in the same cluster enjoyed.
For data analysis
When you analyze a new dataset, it can be helpful to run a clustering algorithm, and then analyze each cluster separately.
As a dimensionality reduction technique
Once a dataset has been clustered, it is usually possible to measure each instance’s affinity with each cluster (affinity is any measure of how well an instance fits into a cluster). Each instance’s feature vector x can then be replaced with the vector of its cluster affinities. If there are k clusters, then this vector is k-dimensional. This vector is typically much lower-dimensional than the original feature vector, but itcan preserve enough information for further processing.
For anomaly detection (also called outlier detection)
Any instance that has a low affinity to all the clusters is likely to be an anomaly.For example, if you have clustered the users of your website based on their behavior, you can detect users with unusual behavior, such as an unusual number of requests per second. Anomaly detection is particularly useful in detecting defects in manufacturing, or for fraud detection.
For semi-supervised learning
If you only have a few labels, you could perform clustering and propagate the labels to all the instances in the same cluster. This technique can greatly increase the number of labels available for a subsequent supervised learning algorithm, and thus improve its performance.
For search engines
Some search engines let you search for images that are similar to a reference image. To build such a system, you would first apply a clustering algorithm to all the images in your database; similar images would end up in the same cluster.
Then when a user provides a reference image, all you need to do is use the trained clustering model to find this image’s cluster, and you can then simply return all the images from this cluster.
To segment an image
By clustering pixels according to their color, then replacing each pixel’s color with the mean color of its cluster, it is possible to considerably reduce the number of different colors in the image. Image segmentation is used in many object detection and tracking systems, as it makes it easier to detect the contour of each object.
There is no universal definition of what a cluster is: it really depends on the context, and different algorithms will capture different kinds of clusters. Some algorithms look for instances centered around a particular point, called a centroid. Others look for continuous regions of densely packed instances: these clusters can take on any shape. Some algorithms are hierarchical, looking for clusters of clusters. And the list goes on.
K-Means Clustering
✅ Advantages of K-Means Clustering
-
Simple and Easy to Implement
-
The algorithm is intuitive and easy to code.
-
-
Efficient and Fast
-
Especially with small to medium-sized datasets; scales well with large data if optimized (e.g., using k-means++).
-
-
Works Well with Distinct Clusters
-
Performs effectively when clusters are spherical and well-separated.
-
-
Unsupervised Learning
-
Does not require labeled data, making it useful for exploratory data analysis.
-
-
Interpretable Results
-
Cluster centroids are easy to understand and explain.
-
❌ Disadvantages of K-Means Clustering
-
Requires Predefined Number of Clusters (K)
-
You must specify the number of clusters in advance, which can be difficult without domain knowledge.
-
-
Sensitive to Initial Centroids
-
Different initializations can lead to different results (though k-means++ helps reduce this issue).
-
-
Not Suitable for Non-Spherical Clusters
-
Struggles with clusters of different sizes, densities, or non-globular shapes.
-
-
Affected by Outliers
-
Outliers can significantly distort the cluster centroids.
-
-
Assumes Equal Importance of Features
-
Requires proper feature scaling; otherwise, features with larger ranges dominate the distance calculation.
-
Comments
Post a Comment