Principal Component Analysis PCA

✅ What is PCA?

Principal Component Analysis (PCA) is a dimensionality reduction technique used to:

Reduce the number of variables in a dataset,
Retain the most important patterns,
Remove redundancy (correlation),
Improve efficiency in modeling or visualization.

It transforms the original variables into a new set of uncorrelated variables called principal components, ordered by the amount of variance they explain in the data.

✅ Why Use PCA?

PCA is used when:

Your dataset has many correlated features.
You want to visualize high-dimensional data in 2D or 3D.
You want to speed up machine learning models by reducing the number of input features.
You want to denoise or compress data.

✅ Key Concepts

🔹 1. Variance

Measures the spread or information in the data.
PCA tries to retain directions with maximum variance.

🔹 2. Principal Components

New axes/directions formed by linear combinations of original variables.
First principal component (PC1) captures the most variance, followed by PC2, etc.

🔹 3. Orthogonality

All principal components are uncorrelated (orthogonal).

✅ How PCA Works – Step-by-Step

Let’s break PCA into intuitive steps.

🔸 Step 1: Standardize the Data

Center the data (subtract the mean).
Scale the data (if required).

X_{\text{standardized}} = \frac{X - \mu}{\sigma}

🔸 Step 2: Compute the Covariance Matrix

\text{Cov}(X) = \frac{1}{n-1} X^T X

This captures relationships (correlations) between variables.

🔸 Step 3: Compute Eigenvalues and Eigenvectors

Use Eigen Decomposition of the covariance matrix.
Eigenvectors = directions (principal components)
Eigenvalues = amount of variance in each direction

🔸 Step 4: Select Top $k$ Components

Choose the first $k$ components that capture most of the variance (e.g., 95%).

\text{Explained Variance Ratio} = \frac{\lambda_k}{\sum_{i=1}^d \lambda_i}

🔸 Step 5: Project the Data

Transform the data into the new space:

X_{\text{reduced}} = X \cdot W_k

where $W_k$ is a matrix of the top $k$ eigenvectors.

✅ Applications of PCA

Application Area	Purpose
Face Recognition	Reduce pixel data to key features
Genomics	Reduce thousands of gene expressions
Finance	Reduce correlated financial indicators
Text Mining	Dimensionality reduction after TF-IDF
Preprocessing	Before clustering/classification

✅ Limitations of PCA

Assumes linear relationships.
Sensitive to scaling (standardization is essential).
Not ideal when data is not centered or when interpretability is key.
Principal components are linear combinations — may lose physical meaning.

✅ Summary

Feature	Description
Goal	Reduce dimensions while preserving variance
Method	Orthogonal transformation to new feature space
Based On	Eigen decomposition or SVD
Output	New uncorrelated features (principal components)
Use Case	Preprocessing, noise reduction, visualization

Toy PCA Example with Dummy Data (2D to 1D Reduction)

🎯 Goal:

Reduce 2D data to 1D using PCA and understand how it works manually.

📌 Step 1: Create Dummy Data

We’ll start with a small dataset of 5 points in 2D:

X = \begin{bmatrix} 2.5 & 2.4 \\ 0.5 & 0.7 \\ 2.2 & 2.9 \\ 1.9 & 2.2 \\ 3.1 & 3.0 \end{bmatrix}

Each row represents a sample with two features (like Height and Weight).

📌 Step 2: Standardize the Data

Subtract the mean of each column:

$\text{Mean} = \left[\mu_1, \mu_2\right] = \left[{2.04}, 2.24\right]$

X_{\text{centered}} = \begin{bmatrix} 0.46 & 0.16 \\ -1.54 & -1.54 \\ 0.16 & 0.66 \\ -0.14 & -0.04 \\ 1.06 & 0.76 \end{bmatrix}

📌 Step 3: Compute the Covariance Matrix

\text{Cov}(X) = \frac{1}{n-1} X^T X = \begin{bmatrix} 0.6166 & 0.6154 \\ 0.6154 & 0.7166 \end{bmatrix}

📌 Step 4: Compute Eigenvalues and Eigenvectors

The eigenvalues of the covariance matrix are:

$λ_{1} = 1.284$
$λ_{2} = 0.049$

The corresponding eigenvectors (principal components):

v_{1} = [\begin{array}{c} 0.6779 \\ 0.7352 \end{array}], v_{2} = [\begin{array}{c} - 0.7352 \\ 0.6779 \end{array}]

📌 Step 5: Select Top Component(s)

Since

λ1≫λ2, we retain only PC1.

📌 Step 6: Project Data onto PC1

X_{reduced} = X_{centered} \cdot v_{1}

Let’s compute the projection of the first sample:

[0.46, 0.16] \cdot \begin{bmatrix} 0.6779 \\ 0.7352 \end{bmatrix} = 0.46 \times 0.6779 + 0.16 \times 0.7352 = 0.3128 + 0.1176 = \mathbf{0.4304}

Repeat this for each row to get the 1D representation.

📉 Summary of Output

Sample	Original (2D)	Projected (1D)
1	[2.5, 2.4]	0.4304
2	[0.5, 0.7]	-2.0635
3	[2.2, 2.9]	0.6636
4	[1.9, 2.2]	-0.0808
5	[3.1, 3.0]	1.9494

✅ Interpretation

The data originally lived in 2D space.
PCA finds the best line that captures the spread of the data.
We project the data onto this line → get 1D compressed version.
Most of the variation is retained (since λ₁ ≫ λ₂).

📉 Summary of Output

Sample	Original (2D)	Projected (1D)
1	[2.5, 2.4]	0.4304
2	[0.5, 0.7]	-2.0635
3	[2.2, 2.9]	0.6636
4	[1.9, 2.2]	-0.0808
5	[3.1, 3.0]	1.9494

✅ Interpretation

The data originally lived in 2D space.
PCA finds the best line that captures the spread of the data.
We project the data onto this line → get 1D compressed version.
Most of the variation is retained (since λ₁ ≫ λ₂).

📌 Python Code

import numpy as np

from sklearn.decomposition import PCA

import matplotlib.pyplot as plt

# Step 1: Dummy data

X = np.array([

[2.5, 2.4],

[0.5, 0.7],

[2.2, 2.9],

[1.9, 2.2],

[3.1, 3.0]

])

# Step 2: Standardize (mean center)

X_centered = X - np.mean(X, axis=0)

# Step 3-5: PCA

pca = PCA(n_components=1)

X_pca = pca.fit_transform(X_centered)

# Step 6: Show original and projected

print("Original Data:\n", X)

print("Projected 1D Data:\n", X_pca)

# Optional: Plot

plt.figure(figsize=(6, 4))

plt.scatter(X[:, 0], X[:, 1], color='blue', label='Original Data')

for i in range(len(X)):

plt.plot([X[i,0], pca.inverse_transform(X_pca)[i,0]],

[X[i,1], pca.inverse_transform(X_pca)[i,1]], 'r--')

plt.title("PCA Projection to 1D")

plt.xlabel("Feature 1")

plt.ylabel("Feature 2")

plt.legend()

plt.axis('equal')

plt.grid(True)

plt.show()

Output

Original Data:
 [[2.5 2.4]
 [0.5 0.7]
 [2.2 2.9]
 [1.9 2.2]
 [3.1 3. ]]
Projected 1D Data:
 [[ 0.44362444]
 [-2.17719404]
 [ 0.57071239]
 [-0.12902465]
 [ 1.29188186]]

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Step 1: Define dummy 2D data
X = np.array([
    [2.5, 2.4],
    [0.5, 0.7],
    [2.2, 2.9],
    [1.9, 2.2],
    [3.1, 3.0]
])

# Step 2: Mean center the data
X_meaned = X - np.mean(X, axis=0)

# Step 3: Apply PCA to reduce to 1D
pca = PCA(n_components=1)
X_1D = pca.fit_transform(X_meaned)
X_projected = pca.inverse_transform(X_1D)

# Step 4: Plotting

plt.figure(figsize=(10, 6))

# Plot original data
plt.scatter(X_meaned[:, 0], X_meaned[:, 1], color='blue', label='Original Data')

# Plot projected data (back in 2D space)
plt.scatter(X_projected[:, 0], X_projected[:, 1], color='red',

 label='Projected (1D -> 2D)', marker='x')

# Draw lines connecting original and projected points
for i in range(len(X)):
    plt.plot([X_meaned[i, 0], X_projected[i, 0]],
             [X_meaned[i, 1], X_projected[i, 1]],
             'gray', linestyle='--', linewidth=1)

# Plot first principal component as arrow
pc1 = pca.components_[0]
origin = np.zeros(2)
plt.quiver(*origin, *pc1, scale=3, color='green', label='Principal Component 1',

width=0.01)

plt.title("PCA: Original Data and 1D Projection in 2D Space")
plt.xlabel("Feature 1 (centered)")
plt.ylabel("Feature 2 (centered)")
plt.axis('equal')
plt.grid(True)
plt.legend()
plt.show()

✅ Python Code Example with Iris Dataset

from sklearn.datasets import load_iris

from sklearn.preprocessing import StandardScaler

from sklearn.decomposition import PCA

import matplotlib.pyplot as plt

# Load dataset

data = load_iris()

X = data.data

y = data.target

labels = data.target_names

# Step 1: Standardize

X_std = StandardScaler().fit_transform(X)

# Step 2-4: Apply PCA

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X_std)

# Step 5: Plot

plt.figure(figsize=(8, 6))

for i, label in enumerate(labels):

plt.scatter(X_pca[y == i, 0], X_pca[y == i, 1], label=label)

plt.xlabel('PC1')

plt.ylabel('PC2')

plt.title('PCA of Iris Dataset')

plt.legend()

plt.grid(True)

plt.show()

Search This Blog

GNEST305 Introduction to Artificial Intelligence and Data Science KTU BTech S3 2024 Scheme