PCA & Dimensionality Reduction

Eigendecomposition sebagai Foundation PCA

ml-math
intermediate
Derivasi matematis PCA dari variance maximization ke eigenvalue problem, SVD perspective, proportion of variance, reconstruction, kernel PCA, dan sketsa t-SNE/UMAP.

1 Kenapa Ini Penting?

NoteWhy This Matters for Your Work

PCA adalah eigendecomposition dari covariance matrix. Tanpa linear algebra, PCA adalah black box. Dengan linear algebra:

  • PCA menjadi geometrically intuitive — kita mencari directions of maximum variance
  • Kamu mengerti mengapa scree plot berguna (eigenvalue = variance explained)
  • Kamu tahu kenapa loadings dan scores punya interpretasi yang berbeda
  • Kamu bisa membedakan PCA dari factor analysis (sesuatu yang sering dicampuradukkan)

Untuk data analyst: PCA sangat penting untuk exploratory analysis, denoising, visualization, dan preprocessing sebelum modeling.


2 Motivasi: Curse of Dimensionality

Ketika dimensi \(p\) besar: 1. Sparse data: volume hypercube tumbuh exponentially → semua data points “far apart” 2. Computational cost: banyak algoritma scale dengan \(p^2\) atau \(p^3\) 3. Visualization: manusia hanya bisa visualize 2D atau 3D 4. Overfitting: high-dim data mudah overfit (terlalu banyak parameters)

Solusi: reduksi ke low-dimensional representation yang preserve informasi terpenting.


3 PCA — Derivasi dari Variance Maximization

3.1 Setup

Data matrix: \(X \in \mathbb{R}^{n \times p}\), diasumsikan sudah di-center (\(\bar{X} = 0\)).

Sample covariance matrix: \(\hat{\Sigma} = \frac{1}{n-1}X^TX \in \mathbb{R}^{p \times p}\)

3.2 Optimization Problem

Cari unit vector \(\mathbf{w}_1 \in \mathbb{R}^p\) yang memaksimalkan variance dari projection:

\[\text{Var}(\mathbf{w}_1^TX) = \mathbf{w}_1^T \hat{\Sigma} \mathbf{w}_1\]

subject to \(\|\mathbf{w}_1\|^2 = \mathbf{w}_1^T\mathbf{w}_1 = 1\).

Lagrangian: \[\mathcal{L}(\mathbf{w}, \lambda) = \mathbf{w}^T\hat{\Sigma}\mathbf{w} - \lambda(\mathbf{w}^T\mathbf{w} - 1)\]

First-order conditions: \[\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = 2\hat{\Sigma}\mathbf{w} - 2\lambda\mathbf{w} = 0\]

\[\hat{\Sigma}\mathbf{w} = \lambda\mathbf{w}\]

Ini adalah eigenvalue problem! \(\mathbf{w}_1\) adalah eigenvector dari \(\hat{\Sigma}\).

3.3 Variance Maksimum = Eigenvalue Terbesar

Substitusikan \(\hat{\Sigma}\mathbf{w}_1 = \lambda_1\mathbf{w}_1\) ke dalam objective:

\[\mathbf{w}_1^T\hat{\Sigma}\mathbf{w}_1 = \mathbf{w}_1^T(\lambda_1\mathbf{w}_1) = \lambda_1\|\mathbf{w}_1\|^2 = \lambda_1\]

Variance yang dimaksimalkan = \(\lambda_1\) (eigenvalue terbesar).

Jadi: \(\mathbf{w}_1\) = eigenvector corresponding to largest eigenvalue \(\lambda_1\).


4 Principal Components Berikutnya

4.1 PC Kedua

Cari \(\mathbf{w}_2\) yang memaksimalkan variance subject to orthogonality dengan \(\mathbf{w}_1\):

\[\max \mathbf{w}_2^T\hat{\Sigma}\mathbf{w}_2 \quad \text{s.t.} \quad \|\mathbf{w}_2\| = 1, \quad \mathbf{w}_1^T\mathbf{w}_2 = 0\]

Solusi: eigenvector corresponding to second largest eigenvalue \(\lambda_2\).

4.2 Eigendecomposition Lengkap

\(\hat{\Sigma}\) adalah real symmetric positive semidefinite → diagonalizable dengan orthonormal eigenvectors:

\[\hat{\Sigma} = V\Lambda V^T\]

di mana: - \(V = [\mathbf{v}_1, \ldots, \mathbf{v}_p]\): matrix orthonormal eigenvectors (loadings) - \(\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_p)\) dengan \(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_p \geq 0\)

Principal components (scores): \(Z = XV\) — projections onto eigenvector directions.


5 PCA via SVD

ImportantDefinisi: PCA via SVD (Preferred Approach)

Center data: \(\tilde{X} = X - \bar{X}\) (subtract column means).

SVD: \(\tilde{X} = U\Sigma V^T\) di mana: - \(U \in \mathbb{R}^{n \times p}\): left singular vectors (scores, normalized) - \(\Sigma = \text{diag}(\sigma_1, \ldots, \sigma_p)\): singular values (\(\sigma_1 \geq \sigma_2 \geq \cdots \geq 0\)) - \(V \in \mathbb{R}^{p \times p}\): right singular vectors (loadings)

Hubungan dengan eigendecomposition: \[\hat{\Sigma} = \frac{1}{n-1}\tilde{X}^T\tilde{X} = \frac{1}{n-1}V\Sigma^2 V^T\]

Jadi eigenvectors dari \(\hat{\Sigma}\) = \(V\) (right singular vectors dari \(\tilde{X}\)).

Eigenvalues: \(\lambda_k = \sigma_k^2/(n-1)\)

PC scores: \(Z = \tilde{X}V = U\Sigma\) (matrix of projections)

Column \(k\) dari \(Z\): scores untuk \(k\)-th PC.

SVD lebih numerically stable dari langsung eigendecompose \(\hat{\Sigma}\), terutama saat \(n \gg p\).


6 Proportion of Variance Explained

6.1 Formula

Total variance = \(\text{tr}(\hat{\Sigma}) = \sum_k \lambda_k = \sum_k \sigma_k^2/(n-1)\)

Proportion of Variance Explained (PVE) oleh PC \(k\):

\[\text{PVE}_k = \frac{\lambda_k}{\sum_{j=1}^p \lambda_j} = \frac{\sigma_k^2}{\sum_{j=1}^p \sigma_j^2}\]

Cumulative PVE: \[\text{CPVE}_m = \sum_{k=1}^m \text{PVE}_k\]

6.2 Scree Plot

Plot \(\lambda_k\) (atau PVE) vs \(k\). Cari “elbow” — titik di mana eigenvalues mulai flatten out.

Aturan praktis: - Elbow rule: pilih setelah elbow - Threshold: pilih \(m\) sehingga CPVE \(\geq\) 80% (atau 90%) - Kaiser rule: pilih PCs dengan \(\lambda_k > 1\) (lebih dari average variance, untuk standardized data)


7 Reconstruction dan Compression

7.1 Rank-\(k\) Approximation

Truncated SVD: \(\hat{X}_k = U_k\Sigma_k V_k^T\)

di mana \(U_k \in \mathbb{R}^{n \times k}\), \(\Sigma_k \in \mathbb{R}^{k \times k}\), \(V_k \in \mathbb{R}^{p \times k}\).

Equivalently: \(\hat{\tilde{X}}_k = \tilde{X}V_kV_k^T\) (project and back-project).

Eckart-Young-Mirsky Theorem: rank-\(k\) SVD adalah best rank-\(k\) approximation dalam Frobenius norm:

\[\hat{\tilde{X}}_k = \arg\min_{\text{rank}(B) \leq k} \|\tilde{X} - B\|_F^2\]

Reconstruction error:

\[\|\tilde{X} - \hat{\tilde{X}}_k\|_F^2 = \sum_{j=k+1}^p \sigma_j^2\]

Error = sum of discarded singular values squared.

7.2 Compression Ratio

Original data: \(np\) numbers. Compressed (rank-\(k\)): \(nk + kp + k = k(n+p+1)\) numbers (scores + loadings + singular values).

Compression ratio berguna jika \(k \ll \min(n, p)\).


8 Kernel PCA

8.1 Motivasi

PCA menemukan linear subspaces. Untuk nonlinear structure, gunakan kernel trick.

Kernel PCA: PCA di feature space \(\phi(x)\), tapi menggunakan kernel trick sehingga tidak perlu explicit \(\phi\).

8.2 Derivasi

Standard PCA pada mapped data \(\Phi = [\phi(x_1), \ldots, \phi(x_n)]^T\):

Covariance: \(\hat{\Sigma}_\phi = \frac{1}{n}\Phi^T\Phi\)

Eigenvalue problem: \(\hat{\Sigma}_\phi v = \lambda v\)

Karena dari Representer Theorem, eigenvectors bisa ditulis sebagai \(v = \frac{1}{\sqrt{n\lambda}}\Phi^T\alpha\), kita punya:

\[K\alpha = n\lambda\alpha\]

di mana \(K_{ij} = k(x_i, x_j)\) adalah Gram matrix.

Kernel PCA scores untuk data point \(x\): \[z_k(x) = \sum_{i=1}^n \hat{\alpha}_{ik} k(x_i, x)\]


9 t-SNE dan UMAP (Brief Sketch)

9.1 t-SNE (t-distributed Stochastic Neighbor Embedding)

Motivasi: PCA linear — tidak cocok untuk manifold structure seperti “Swiss roll”.

Idea: preserve local neighborhood structure.

  1. Compute affinities di high-dim: \(p_{ij} \propto \exp(-\|x_i - x_j\|^2/(2\sigma_i^2))\) (Gaussian)

  2. Compute affinities di low-dim: \(q_{ij} \propto (1 + \|y_i-y_j\|^2)^{-1}\) (t-distribution dengan 1 df)

  3. Minimize KL divergence: \(\min \sum_{ij} p_{ij}\log\frac{p_{ij}}{q_{ij}}\)

t-distribution di low-dim: heavy tails → moderately distant points mendapat “more space” → avoids crowding problem.

Limitation: t-SNE tidak preserve global structure, non-deterministic, slow (\(O(n^2)\)).

9.2 UMAP (Uniform Manifold Approximation and Projection)

UMAP memiliki mathematical foundation lebih kuat dari t-SNE (Riemannian geometry + topological data analysis).

Key idea: construct fuzzy topological representation of data, kemudian optimize low-dim embedding untuk memiliki “similar” fuzzy topology.

UMAP lebih cepat (\(O(n^{1.14})\)), preserve global structure lebih baik, dan punya hyperparameter yang lebih interpretable.


10 Worked Example: PCA Step-by-Step

library(tidyverse)

# ==========================================
# 1. PCA dari Scratch (Manual)
# ==========================================

# Generate 4D data with known structure
set.seed(42)
n <- 100
# Two latent factors
F1 <- rnorm(n)
F2 <- rnorm(n)
X <- cbind(
  2*F1 + 0.3*rnorm(n),      # highly influenced by F1
  1.5*F1 + 0.3*rnorm(n),    # highly influenced by F1
  2*F2 + 0.3*rnorm(n),      # highly influenced by F2
  1.5*F2 + 0.3*rnorm(n)     # highly influenced by F2
)
colnames(X) <- c("X1", "X2", "X3", "X4")

# Step 1: Center data
X_centered <- scale(X, center = TRUE, scale = FALSE)

# Step 2: Compute covariance matrix
S <- cov(X_centered)
cat("Sample covariance matrix:\n")
print(round(S, 3))

# Step 3: Eigendecomposition
eigen_result <- eigen(S)
cat("\nEigenvalues:\n", round(eigen_result$values, 4), "\n")
cat("Proportion of variance:\n",
    round(eigen_result$values / sum(eigen_result$values), 4), "\n")
cat("Cumulative PVE:\n",
    round(cumsum(eigen_result$values / sum(eigen_result$values)), 4), "\n")

# Step 4: Loadings (eigenvectors)
cat("\nLoadings (eigenvectors):\n")
print(round(eigen_result$vectors, 4))

# Step 5: Scores (projections)
scores <- X_centered %*% eigen_result$vectors
cat("\nFirst few PC scores:\n")
print(head(scores, 3))

# ==========================================
# 2. Using prcomp()
# ==========================================
pca_result <- prcomp(X, center = TRUE, scale. = FALSE)

cat("\n--- prcomp() results ---\n")
print(summary(pca_result))

# Loadings
cat("\nRotation (Loadings):\n")
print(round(pca_result$rotation, 4))

# Scree plot
plot(pca_result, type = "l", main = "Scree Plot")

# Biplot
biplot(pca_result, main = "PCA Biplot")

# ==========================================
# 3. SVD Relationship
# ==========================================
X_c <- scale(X, center = TRUE, scale = FALSE)
svd_result <- svd(X_c)

# Verify: eigenvalues of S = sigma^2 / (n-1)
cat("\nEigenvalues from cov matrix:", round(eigen_result$values, 4), "\n")
cat("Singular values squared / (n-1):",
    round(svd_result$d^2 / (nrow(X) - 1), 4), "\n")

# ==========================================
# 4. Reconstruction Error
# ==========================================
X_recon_2 <- tcrossprod(pca_result$x[, 1:2], pca_result$rotation[, 1:2])
recon_error_2 <- sum((X_c - X_recon_2)^2)
total_var <- sum(X_c^2)

cat(sprintf("\nReconstruction error (2 PCs): %.2f%%\n",
            100 * recon_error_2 / total_var))
cat(sprintf("Variance NOT explained by 2 PCs: %.2f%%\n",
            100 * (1 - sum(pca_result$sdev[1:2]^2) / sum(pca_result$sdev^2))))

# ==========================================
# 5. Applied: iris dataset
# ==========================================
data(iris)
pca_iris <- prcomp(iris[, 1:4], center = TRUE, scale. = TRUE)
cat("\n--- Iris PCA ---\n")
print(summary(pca_iris))

# Visualize in 2D
df_pca <- data.frame(pca_iris$x[,1:2], species = iris$Species)
ggplot(df_pca, aes(PC1, PC2, color = species)) +
  geom_point(alpha = 0.7) +
  labs(title = "Iris: First 2 Principal Components") +
  theme_minimal()

Key observations: 1. Eigenvalues = variance along each PC direction 2. PC1 dan PC2 masing-masing berhubungan dengan F1 dan F2 (struktur latent yang kita masukkan) 3. Dengan 2 PCs sudah bisa explain ~95%+ variance 4. Loadings yang tinggi menunjukkan variable mana yang dominan di setiap PC


CautionConnection: Factor Analysis, ICA, dan Autoencoders

PCA punya banyak kerabat:

  • Factor Analysis: berbeda dari PCA — FA mengasumsikan model latent \(X = \Lambda F + \varepsilon\) di mana \(\varepsilon\) adalah unique factors (dengan variance berbeda). FA lebih “generative” dan interpretable, tapi asumsi lebih kuat.

  • ICA (Independent Component Analysis): cari komponen yang independent (bukan hanya uncorrelated). Berguna untuk blind source separation (cocktail party problem). FastICA algorithm.

  • NMF (Non-negative Matrix Factorization): \(X \approx WH\) dengan \(W, H \geq 0\). Berguna untuk data yang non-negative (counts, frequencies).

  • Autoencoders: neural network yang belajar low-dim representation dengan minimize reconstruction loss. Nonlinear PCA. Variational Autoencoder (VAE) = probabilistic version.

  • Truncated SVD vs PCA: untuk text analysis (LSA/LSI), SVD diaplikasikan ke TF-IDF matrix, bukan covariance matrix — conceptually berbeda tapi mathematically related.


11 Practice Problems

Problem 1: Tunjukkan bahwa PC scores orthogonal dan punya variance = eigenvalue.

\(\text{Cov}(Z) = V^T\hat{\Sigma}V = V^T(V\Lambda V^T)V = \Lambda\) (diagonal)

Jadi \(\text{Cov}(Z_j, Z_k) = 0\) untuk \(j \neq k\) (orthogonal), dan \(\text{Var}(Z_k) = \lambda_k\).

Problem 2: Buktikan bahwa total variance tidak berubah setelah PCA transformation.

\(\text{tr}(\text{Cov}(Z)) = \text{tr}(\Lambda) = \sum_k \lambda_k = \text{tr}(\hat{\Sigma})\)

PCA hanya merotasi — tidak mengubah total variance.

Problem 3: Kenapa standardize sebelum PCA?

Jika variabel dalam units berbeda (misalnya: income dalam jutaan, age dalam tahun), PCA akan dominated oleh variabel dengan variance terbesar (income). Standardize (divide by std) memastikan semua variabel berkontribusi equally.

Untuk menentukan apakah standardize: pertanyaannya adalah apakah units berbeda meaningful. Jika semua dalam units yang sama (pixels, gene expressions), tidak perlu standardize.

Problem 4: Apa perbedaan loadings dan scores?

Loadings (\(V\)): koefisien yang mendefinisikan setiap PC sebagai kombinasi linear dari original variables. Interpretasi: kontribusi setiap variable ke PC.

Scores (\(Z = XV\)): koordinat setiap observasi dalam PC space. Interpretasi: position of each data point in reduced space.

Problem 5: Derive reconstruction error formula.

\(\|\tilde{X} - \hat{\tilde{X}}_k\|_F^2 = \|\tilde{X}(I-V_kV_k^T)\|_F^2 = \text{tr}[(I-V_kV_k^T)\tilde{X}^T\tilde{X}(I-V_kV_k^T)]\)

Gunakan cyclicity of trace dan \(V_kV_k^T\) is projection: \(= \text{tr}[\tilde{X}^T\tilde{X}] - \text{tr}[V_k^T\tilde{X}^T\tilde{X}V_k] = \sum_j \sigma_j^2 - \sum_{j=1}^k \sigma_j^2 = \sum_{j=k+1}^p \sigma_j^2\)


Navigasi: ← Information Theory | Neural Network Math →