Unsupervised Learning K-Means Clustering

Although most of the applications of Machine Learning today are based on supervised learning (and as a result, this is where most of the investments go to), the vast majority of the available data is unlabeled: we have the input features X, but we do not have the labels y.

Say you want to create a system that will take a few pictures of each item on a manufacturing production line and detect which items are defective. You can fairly easily create a system that will take pictures automatically, and this might give you thousands of pictures every day. You can then build a reasonably large dataset in just a few weeks. But wait, there are no labels! If you want to train a regular binary classifier that will predict whether an item is defective or not, you will need to label every single picture as “defective” or “normal.” This will generally require human experts to sit down and manually go through all the pictures. This is a long, costly, and tedious task, so it will usually only be done on a small subset of the available pictures. As a result, the labeled dataset will be quite small, and the classifier’s performance will be disappointing. Moreover, every time the company makes any change to its products, the whole process will need to be started over from scratch. Wouldn’t it be great if the algorithm could just exploit the unlabeled data without needing humans to label every picture? Enter unsupervised learning.

The most common unsupervised learning tasks in machine learning  are

Dimensionality reduction 

It is the process of reducing the number of input variables or features in a dataset while retaining as much relevant information as possible. It helps improve model performance by eliminating noise, reducing overfitting, and speeding up computation. Techniques like Principal Component Analysis (PCA) and t-SNE are commonly used for this purpose.

Clustering

The goal is to group similar instances together into clusters. Clustering is a great tool for data analysis, customer segmentation, recommender systems, search engines, image segmentation, semi-supervised learning, dimensionality reduction,and more.Examples are K-Means, DBSCAN etc

Anomaly detection

The objective is to learn what “normal” data looks like, and then use that to detect abnormal instances, such as defective items on a production line or a new trend in a time series.

Density estimation

This is the task of estimating the probability density function (PDF) of the random process that generated the dataset. Density estimation is commonly used for anomaly detection: instances located in very low-density regions are likely to be anomalies. It is also useful for data analysis and visualization

Clustering

As you enjoy a hike in the mountains, you stumble upon a plant you have never seen before. You look around and you notice a few more. They are not identical, yet they are sufficiently similar for you to know that they most likely belong to the same species (or at least the same genus). You may need a botanist to tell you what species that is, but you certainly don’t need an expert to identify groups of similar-looking objects.

This is called clustering: it is the task of identifying similar instances and assigning them to clusters, or groups of similar instances.

Clustering is used in a wide variety of applications, including these:

For customer segmentation

You can cluster your customers based on their purchases and their activity on your website. This is useful to understand who your customers are and what they need, so you can adapt your products and marketing campaigns to each segment.

For example, customer segmentation can be useful in recommender systems to suggest content that other users in the same cluster enjoyed.

For data analysis

When you analyze a new dataset, it can be helpful to run a clustering algorithm, and then analyze each cluster separately.

As a dimensionality reduction technique

Once a dataset has been clustered, it is usually possible to measure each instance’s affinity with each cluster (affinity is any measure of how well an instance fits into a cluster). Each instance’s feature vector x can then be replaced with the vector of its cluster affinities. If there are k clusters, then this vector is k-dimensional. This vector is typically much lower-dimensional than the original feature vector, but itcan preserve enough information for further processing.

For anomaly detection (also called outlier detection)

Any instance that has a low affinity to all the clusters is likely to be an anomaly.For example, if you have clustered the users of your website based on their behavior, you can detect users with unusual behavior, such as an unusual number of requests per second. Anomaly detection is particularly useful in detecting defects in manufacturing, or for fraud detection.

For semi-supervised learning

If you only have a few labels, you could perform clustering and propagate the labels to all the instances in the same cluster. This technique can greatly increase  the number of labels available for a subsequent supervised learning algorithm, and thus improve its performance.

For search engines

Some search engines let you search for images that are similar to a reference image. To build such a system, you would first apply a clustering algorithm to all the images in your database; similar images would end up in the same cluster.

Then when a user provides a reference image, all you need to do is use the trained clustering model to find this image’s cluster, and you can then simply return all the images from this cluster.

To segment an image

By clustering pixels according to their color, then replacing each pixel’s color with the mean color of its cluster, it is possible to considerably reduce the number of different colors in the image. Image segmentation is used in many object detection and tracking systems, as it makes it easier to detect the contour of each object.

There is no universal definition of what a cluster is: it really depends on the context, and different algorithms will capture different kinds of clusters. Some algorithms look for instances centered around a particular point, called a centroid. Others look for continuous regions of densely packed instances: these clusters can take on any shape. Some algorithms are hierarchical, looking for clusters of clusters. And the list goes on.

K-Means Clustering

Consider the unlabeled dataset represented in Figure 9-2: you can clearly see five blobs of instances. The K-Means algorithm is a simple algorithm capable of clustering this kind of dataset very quickly and efficiently, often in just a few iterations. It was proposed by Stuart Lloyd at Bell Labs in 1957 as a technique for pulse-code modulation, but it was only published outside of the company in 1982.In 1965, Edward W. Forgy had published virtually the same algorithm, so K-Means is sometimes referred to as Lloyd–Forgy.

Let’s train a K-Means clusterer on this dataset.We will  try to find each blob’s center and assign each instance to the closest blob:Note that you have to specify the number of clusters k that the algorithm must find. In this example, it is pretty obvious from looking at the data that k should be set to 5, but in general it is not that easy. 

Each instance was assigned to one of the five clusters. In the context of clustering, an instance’s label is the index of the cluster that this instance gets assigned to by the algorithm: this is not to be confused with the class labels in classification (remember that clustering is an unsupervised learning task).

If you plot the cluster’s decision boundaries, you get a Voronoi tessellation (see Figure 9-3, where each centroid is represented with an X).


The vast majority of the instances were clearly assigned to the appropriate cluster, but a few instances were probably mislabeled (especially near the boundary between the top-left cluster and the central cluster). Indeed, the K-Means algorithm does not behave very well when the blobs have very different diameters because all it cares about when assigning an instance to a cluster is the distance to the centroid. 
Instead of assigning each instance to a single cluster, which is called hard clustering, it can be useful to give each instance a score per cluster, which is called soft clustering. The score can be the distance between the instance and the centroid; conversely, it can be a similarity score (or affinity), such as the Gaussian Radial Basis Function

K-Means Algorithm
The way kmeans algorithm works is as follows:
1.Specify number of clusters K.
2.Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the centroids without replacement.
3.Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters isn’t changing.
    Compute the sum of the squared distance between data points and all centroids.
    Assign each data point to the closest cluster (centroid).
    Compute the centroids for the clusters by taking the average of the all data points that belong to each     cluster.

Example:Consider the one-dimensional data shown in Figure (a) below . Assume that we want to cluster the data into $k = 2$ groups. Let the initial centroids be $\mu_1 = 2$ and $\mu_2 = 4$. In the first iteration, we first compute the clusters, assigning each point to the closest mean, to obtain
$C_1 = \{2,3\}$ $C_2 = \{4,10,11,12,20,25,30\}$

We next update the means as follows:
$\mu_1 =\frac{2+3}{2}=2.5$
$\mu_2 =\frac{4+10+11+12+20+25+30}{7}=\frac{112}{7}=16$

The new centroids and clusters after the first iteration are shown in Figure(b).For the second step, we repeat the cluster assignment and centroid update steps, as shown in Figure(c), to obtain the new clusters:
$C_1 = \{2,3,4\}$ $C_2 = \{10,11,12,20,25,30\}$ and the new means:

$\mu_1 =\frac{2+3+4}{4}=\frac{9}{3}=3$
$\mu_2 =\frac{10+11+12+20+25+30}{6}=\frac{108}{6}=18$
The complete process until convergence is illustrated in Figure . The final clusters are given as
$C_1 = \{2,3,4,10,11,12\}$ $C_2 = \{20,25,30\}$
with representatives $\mu_1 = 7$ and $\mu_2 = 25$.



Example:k=3
A1,B1,C1 are the initial centeroids

New centroids (2,10), (6,6),(1.5,3.5)

Elbow Method for optimal value of K in K-Means

A fundamental step for any unsupervised algorithm is to determine the optimal number of clusters into which the data may be clustered. The Elbow Method is one of the most popular methods to determine this optimal value of K.


We now define the following:-

Inertia: It is the sum of squared distances of samples to their closest cluster center.

We iterate the values of K from 1 to 9 and calculate the values of distortions for each value of  K and calculate the inertia for each value of K in the given range.

To determine the optimal number of clusters, we have to select the value of K at the “elbow” ie the point after which the distortion/inertia start decreasing in a linear fashion.



Advantages of K-Means Clustering

  1. Simple and Easy to Implement

    • The algorithm is intuitive and easy to code.

  2. Efficient and Fast

    • Especially with small to medium-sized datasets; scales well with large data if optimized (e.g., using k-means++).

  3. Works Well with Distinct Clusters

    • Performs effectively when clusters are spherical and well-separated.

  4. Unsupervised Learning

    • Does not require labeled data, making it useful for exploratory data analysis.

  5. Interpretable Results

    • Cluster centroids are easy to understand and explain.


Disadvantages of K-Means Clustering

  1. Requires Predefined Number of Clusters (K)

    • You must specify the number of clusters in advance, which can be difficult without domain knowledge.

  2. Sensitive to Initial Centroids

    • Different initializations can lead to different results (though k-means++ helps reduce this issue).

  3. Not Suitable for Non-Spherical Clusters

    • Struggles with clusters of different sizes, densities, or non-globular shapes.

  4. Affected by Outliers

    • Outliers can significantly distort the cluster centroids.

  5. Assumes Equal Importance of Features

    • Requires proper feature scaling; otherwise, features with larger ranges dominate the distance calculation.



Python code with sample data


import matplotlib.pyplot as plt

x = [4, 5, 10, 4, 3, 11, 14 , 6, 10, 12]
y = [21, 19, 24, 17, 16, 25, 24, 22, 21, 21]
#display the scatter plot of sample data
plt.scatter(x, y)
plt.show()

 # finding the best value for k

from sklearn.cluster import KMeans

data = list(zip(x, y))
inertias = []

for i in range(1,11):
    kmeans = KMeans(n_clusters=i)
    kmeans.fit(data)
    inertias.append(kmeans.inertia_)

plt.plot(range(1,11), inertias, marker='o')
plt.title('Elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()

#creating clusters and plotting

kmeans = KMeans(n_clusters=2)
kmeans.fit(data)

plt.scatter(x, y, c=kmeans.labels_)
plt.show()





Comments

Popular posts from this blog

GNEST305 Introduction to Artificial Intelligence and Data Science KTU BTech S3 2024 Scheme - Dr Binu V P

Basics of Machine Learning

Types of Machine Learning Systems