PCA & Dimensionality Reduction
Eigendecomposition sebagai Foundation PCA
1 Kenapa Ini Penting?
PCA adalah eigendecomposition dari covariance matrix. Tanpa linear algebra, PCA adalah black box. Dengan linear algebra:
- PCA menjadi geometrically intuitive — kita mencari directions of maximum variance
- Kamu mengerti mengapa scree plot berguna (eigenvalue = variance explained)
- Kamu tahu kenapa loadings dan scores punya interpretasi yang berbeda
- Kamu bisa membedakan PCA dari factor analysis (sesuatu yang sering dicampuradukkan)
Untuk data analyst: PCA sangat penting untuk exploratory analysis, denoising, visualization, dan preprocessing sebelum modeling.
2 Motivasi: Curse of Dimensionality
Ketika dimensi \(p\) besar: 1. Sparse data: volume hypercube tumbuh exponentially → semua data points “far apart” 2. Computational cost: banyak algoritma scale dengan \(p^2\) atau \(p^3\) 3. Visualization: manusia hanya bisa visualize 2D atau 3D 4. Overfitting: high-dim data mudah overfit (terlalu banyak parameters)
Solusi: reduksi ke low-dimensional representation yang preserve informasi terpenting.
3 PCA — Derivasi dari Variance Maximization
3.1 Setup
Data matrix: \(X \in \mathbb{R}^{n \times p}\), diasumsikan sudah di-center (\(\bar{X} = 0\)).
Sample covariance matrix: \(\hat{\Sigma} = \frac{1}{n-1}X^TX \in \mathbb{R}^{p \times p}\)
3.2 Optimization Problem
Cari unit vector \(\mathbf{w}_1 \in \mathbb{R}^p\) yang memaksimalkan variance dari projection:
\[\text{Var}(\mathbf{w}_1^TX) = \mathbf{w}_1^T \hat{\Sigma} \mathbf{w}_1\]
subject to \(\|\mathbf{w}_1\|^2 = \mathbf{w}_1^T\mathbf{w}_1 = 1\).
Lagrangian: \[\mathcal{L}(\mathbf{w}, \lambda) = \mathbf{w}^T\hat{\Sigma}\mathbf{w} - \lambda(\mathbf{w}^T\mathbf{w} - 1)\]
First-order conditions: \[\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = 2\hat{\Sigma}\mathbf{w} - 2\lambda\mathbf{w} = 0\]
\[\hat{\Sigma}\mathbf{w} = \lambda\mathbf{w}\]
Ini adalah eigenvalue problem! \(\mathbf{w}_1\) adalah eigenvector dari \(\hat{\Sigma}\).
3.3 Variance Maksimum = Eigenvalue Terbesar
Substitusikan \(\hat{\Sigma}\mathbf{w}_1 = \lambda_1\mathbf{w}_1\) ke dalam objective:
\[\mathbf{w}_1^T\hat{\Sigma}\mathbf{w}_1 = \mathbf{w}_1^T(\lambda_1\mathbf{w}_1) = \lambda_1\|\mathbf{w}_1\|^2 = \lambda_1\]
Variance yang dimaksimalkan = \(\lambda_1\) (eigenvalue terbesar).
Jadi: \(\mathbf{w}_1\) = eigenvector corresponding to largest eigenvalue \(\lambda_1\).
4 Principal Components Berikutnya
4.1 PC Kedua
Cari \(\mathbf{w}_2\) yang memaksimalkan variance subject to orthogonality dengan \(\mathbf{w}_1\):
\[\max \mathbf{w}_2^T\hat{\Sigma}\mathbf{w}_2 \quad \text{s.t.} \quad \|\mathbf{w}_2\| = 1, \quad \mathbf{w}_1^T\mathbf{w}_2 = 0\]
Solusi: eigenvector corresponding to second largest eigenvalue \(\lambda_2\).
4.2 Eigendecomposition Lengkap
\(\hat{\Sigma}\) adalah real symmetric positive semidefinite → diagonalizable dengan orthonormal eigenvectors:
\[\hat{\Sigma} = V\Lambda V^T\]
di mana: - \(V = [\mathbf{v}_1, \ldots, \mathbf{v}_p]\): matrix orthonormal eigenvectors (loadings) - \(\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_p)\) dengan \(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_p \geq 0\)
Principal components (scores): \(Z = XV\) — projections onto eigenvector directions.
5 PCA via SVD
Center data: \(\tilde{X} = X - \bar{X}\) (subtract column means).
SVD: \(\tilde{X} = U\Sigma V^T\) di mana: - \(U \in \mathbb{R}^{n \times p}\): left singular vectors (scores, normalized) - \(\Sigma = \text{diag}(\sigma_1, \ldots, \sigma_p)\): singular values (\(\sigma_1 \geq \sigma_2 \geq \cdots \geq 0\)) - \(V \in \mathbb{R}^{p \times p}\): right singular vectors (loadings)
Hubungan dengan eigendecomposition: \[\hat{\Sigma} = \frac{1}{n-1}\tilde{X}^T\tilde{X} = \frac{1}{n-1}V\Sigma^2 V^T\]
Jadi eigenvectors dari \(\hat{\Sigma}\) = \(V\) (right singular vectors dari \(\tilde{X}\)).
Eigenvalues: \(\lambda_k = \sigma_k^2/(n-1)\)
PC scores: \(Z = \tilde{X}V = U\Sigma\) (matrix of projections)
Column \(k\) dari \(Z\): scores untuk \(k\)-th PC.
SVD lebih numerically stable dari langsung eigendecompose \(\hat{\Sigma}\), terutama saat \(n \gg p\).
6 Proportion of Variance Explained
6.1 Formula
Total variance = \(\text{tr}(\hat{\Sigma}) = \sum_k \lambda_k = \sum_k \sigma_k^2/(n-1)\)
Proportion of Variance Explained (PVE) oleh PC \(k\):
\[\text{PVE}_k = \frac{\lambda_k}{\sum_{j=1}^p \lambda_j} = \frac{\sigma_k^2}{\sum_{j=1}^p \sigma_j^2}\]
Cumulative PVE: \[\text{CPVE}_m = \sum_{k=1}^m \text{PVE}_k\]
6.2 Scree Plot
Plot \(\lambda_k\) (atau PVE) vs \(k\). Cari “elbow” — titik di mana eigenvalues mulai flatten out.
Aturan praktis: - Elbow rule: pilih setelah elbow - Threshold: pilih \(m\) sehingga CPVE \(\geq\) 80% (atau 90%) - Kaiser rule: pilih PCs dengan \(\lambda_k > 1\) (lebih dari average variance, untuk standardized data)
7 Reconstruction dan Compression
7.1 Rank-\(k\) Approximation
Truncated SVD: \(\hat{X}_k = U_k\Sigma_k V_k^T\)
di mana \(U_k \in \mathbb{R}^{n \times k}\), \(\Sigma_k \in \mathbb{R}^{k \times k}\), \(V_k \in \mathbb{R}^{p \times k}\).
Equivalently: \(\hat{\tilde{X}}_k = \tilde{X}V_kV_k^T\) (project and back-project).
Eckart-Young-Mirsky Theorem: rank-\(k\) SVD adalah best rank-\(k\) approximation dalam Frobenius norm:
\[\hat{\tilde{X}}_k = \arg\min_{\text{rank}(B) \leq k} \|\tilde{X} - B\|_F^2\]
Reconstruction error:
\[\|\tilde{X} - \hat{\tilde{X}}_k\|_F^2 = \sum_{j=k+1}^p \sigma_j^2\]
Error = sum of discarded singular values squared.
7.2 Compression Ratio
Original data: \(np\) numbers. Compressed (rank-\(k\)): \(nk + kp + k = k(n+p+1)\) numbers (scores + loadings + singular values).
Compression ratio berguna jika \(k \ll \min(n, p)\).
8 Kernel PCA
8.1 Motivasi
PCA menemukan linear subspaces. Untuk nonlinear structure, gunakan kernel trick.
Kernel PCA: PCA di feature space \(\phi(x)\), tapi menggunakan kernel trick sehingga tidak perlu explicit \(\phi\).
8.2 Derivasi
Standard PCA pada mapped data \(\Phi = [\phi(x_1), \ldots, \phi(x_n)]^T\):
Covariance: \(\hat{\Sigma}_\phi = \frac{1}{n}\Phi^T\Phi\)
Eigenvalue problem: \(\hat{\Sigma}_\phi v = \lambda v\)
Karena dari Representer Theorem, eigenvectors bisa ditulis sebagai \(v = \frac{1}{\sqrt{n\lambda}}\Phi^T\alpha\), kita punya:
\[K\alpha = n\lambda\alpha\]
di mana \(K_{ij} = k(x_i, x_j)\) adalah Gram matrix.
Kernel PCA scores untuk data point \(x\): \[z_k(x) = \sum_{i=1}^n \hat{\alpha}_{ik} k(x_i, x)\]
9 t-SNE dan UMAP (Brief Sketch)
9.1 t-SNE (t-distributed Stochastic Neighbor Embedding)
Motivasi: PCA linear — tidak cocok untuk manifold structure seperti “Swiss roll”.
Idea: preserve local neighborhood structure.
Compute affinities di high-dim: \(p_{ij} \propto \exp(-\|x_i - x_j\|^2/(2\sigma_i^2))\) (Gaussian)
Compute affinities di low-dim: \(q_{ij} \propto (1 + \|y_i-y_j\|^2)^{-1}\) (t-distribution dengan 1 df)
Minimize KL divergence: \(\min \sum_{ij} p_{ij}\log\frac{p_{ij}}{q_{ij}}\)
t-distribution di low-dim: heavy tails → moderately distant points mendapat “more space” → avoids crowding problem.
Limitation: t-SNE tidak preserve global structure, non-deterministic, slow (\(O(n^2)\)).
9.2 UMAP (Uniform Manifold Approximation and Projection)
UMAP memiliki mathematical foundation lebih kuat dari t-SNE (Riemannian geometry + topological data analysis).
Key idea: construct fuzzy topological representation of data, kemudian optimize low-dim embedding untuk memiliki “similar” fuzzy topology.
UMAP lebih cepat (\(O(n^{1.14})\)), preserve global structure lebih baik, dan punya hyperparameter yang lebih interpretable.
10 Worked Example: PCA Step-by-Step
library(tidyverse)
# ==========================================
# 1. PCA dari Scratch (Manual)
# ==========================================
# Generate 4D data with known structure
set.seed(42)
n <- 100
# Two latent factors
F1 <- rnorm(n)
F2 <- rnorm(n)
X <- cbind(
2*F1 + 0.3*rnorm(n), # highly influenced by F1
1.5*F1 + 0.3*rnorm(n), # highly influenced by F1
2*F2 + 0.3*rnorm(n), # highly influenced by F2
1.5*F2 + 0.3*rnorm(n) # highly influenced by F2
)
colnames(X) <- c("X1", "X2", "X3", "X4")
# Step 1: Center data
X_centered <- scale(X, center = TRUE, scale = FALSE)
# Step 2: Compute covariance matrix
S <- cov(X_centered)
cat("Sample covariance matrix:\n")
print(round(S, 3))
# Step 3: Eigendecomposition
eigen_result <- eigen(S)
cat("\nEigenvalues:\n", round(eigen_result$values, 4), "\n")
cat("Proportion of variance:\n",
round(eigen_result$values / sum(eigen_result$values), 4), "\n")
cat("Cumulative PVE:\n",
round(cumsum(eigen_result$values / sum(eigen_result$values)), 4), "\n")
# Step 4: Loadings (eigenvectors)
cat("\nLoadings (eigenvectors):\n")
print(round(eigen_result$vectors, 4))
# Step 5: Scores (projections)
scores <- X_centered %*% eigen_result$vectors
cat("\nFirst few PC scores:\n")
print(head(scores, 3))
# ==========================================
# 2. Using prcomp()
# ==========================================
pca_result <- prcomp(X, center = TRUE, scale. = FALSE)
cat("\n--- prcomp() results ---\n")
print(summary(pca_result))
# Loadings
cat("\nRotation (Loadings):\n")
print(round(pca_result$rotation, 4))
# Scree plot
plot(pca_result, type = "l", main = "Scree Plot")
# Biplot
biplot(pca_result, main = "PCA Biplot")
# ==========================================
# 3. SVD Relationship
# ==========================================
X_c <- scale(X, center = TRUE, scale = FALSE)
svd_result <- svd(X_c)
# Verify: eigenvalues of S = sigma^2 / (n-1)
cat("\nEigenvalues from cov matrix:", round(eigen_result$values, 4), "\n")
cat("Singular values squared / (n-1):",
round(svd_result$d^2 / (nrow(X) - 1), 4), "\n")
# ==========================================
# 4. Reconstruction Error
# ==========================================
X_recon_2 <- tcrossprod(pca_result$x[, 1:2], pca_result$rotation[, 1:2])
recon_error_2 <- sum((X_c - X_recon_2)^2)
total_var <- sum(X_c^2)
cat(sprintf("\nReconstruction error (2 PCs): %.2f%%\n",
100 * recon_error_2 / total_var))
cat(sprintf("Variance NOT explained by 2 PCs: %.2f%%\n",
100 * (1 - sum(pca_result$sdev[1:2]^2) / sum(pca_result$sdev^2))))
# ==========================================
# 5. Applied: iris dataset
# ==========================================
data(iris)
pca_iris <- prcomp(iris[, 1:4], center = TRUE, scale. = TRUE)
cat("\n--- Iris PCA ---\n")
print(summary(pca_iris))
# Visualize in 2D
df_pca <- data.frame(pca_iris$x[,1:2], species = iris$Species)
ggplot(df_pca, aes(PC1, PC2, color = species)) +
geom_point(alpha = 0.7) +
labs(title = "Iris: First 2 Principal Components") +
theme_minimal()Key observations: 1. Eigenvalues = variance along each PC direction 2. PC1 dan PC2 masing-masing berhubungan dengan F1 dan F2 (struktur latent yang kita masukkan) 3. Dengan 2 PCs sudah bisa explain ~95%+ variance 4. Loadings yang tinggi menunjukkan variable mana yang dominan di setiap PC
PCA punya banyak kerabat:
Factor Analysis: berbeda dari PCA — FA mengasumsikan model latent \(X = \Lambda F + \varepsilon\) di mana \(\varepsilon\) adalah unique factors (dengan variance berbeda). FA lebih “generative” dan interpretable, tapi asumsi lebih kuat.
ICA (Independent Component Analysis): cari komponen yang independent (bukan hanya uncorrelated). Berguna untuk blind source separation (cocktail party problem). FastICA algorithm.
NMF (Non-negative Matrix Factorization): \(X \approx WH\) dengan \(W, H \geq 0\). Berguna untuk data yang non-negative (counts, frequencies).
Autoencoders: neural network yang belajar low-dim representation dengan minimize reconstruction loss. Nonlinear PCA. Variational Autoencoder (VAE) = probabilistic version.
Truncated SVD vs PCA: untuk text analysis (LSA/LSI), SVD diaplikasikan ke TF-IDF matrix, bukan covariance matrix — conceptually berbeda tapi mathematically related.
11 Practice Problems
Problem 1: Tunjukkan bahwa PC scores orthogonal dan punya variance = eigenvalue.
\(\text{Cov}(Z) = V^T\hat{\Sigma}V = V^T(V\Lambda V^T)V = \Lambda\) (diagonal)
Jadi \(\text{Cov}(Z_j, Z_k) = 0\) untuk \(j \neq k\) (orthogonal), dan \(\text{Var}(Z_k) = \lambda_k\).
Problem 2: Buktikan bahwa total variance tidak berubah setelah PCA transformation.
\(\text{tr}(\text{Cov}(Z)) = \text{tr}(\Lambda) = \sum_k \lambda_k = \text{tr}(\hat{\Sigma})\)
PCA hanya merotasi — tidak mengubah total variance.
Problem 3: Kenapa standardize sebelum PCA?
Jika variabel dalam units berbeda (misalnya: income dalam jutaan, age dalam tahun), PCA akan dominated oleh variabel dengan variance terbesar (income). Standardize (divide by std) memastikan semua variabel berkontribusi equally.
Untuk menentukan apakah standardize: pertanyaannya adalah apakah units berbeda meaningful. Jika semua dalam units yang sama (pixels, gene expressions), tidak perlu standardize.
Problem 4: Apa perbedaan loadings dan scores?
Loadings (\(V\)): koefisien yang mendefinisikan setiap PC sebagai kombinasi linear dari original variables. Interpretasi: kontribusi setiap variable ke PC.
Scores (\(Z = XV\)): koordinat setiap observasi dalam PC space. Interpretasi: position of each data point in reduced space.
Problem 5: Derive reconstruction error formula.
\(\|\tilde{X} - \hat{\tilde{X}}_k\|_F^2 = \|\tilde{X}(I-V_kV_k^T)\|_F^2 = \text{tr}[(I-V_kV_k^T)\tilde{X}^T\tilde{X}(I-V_kV_k^T)]\)
Gunakan cyclicity of trace dan \(V_kV_k^T\) is projection: \(= \text{tr}[\tilde{X}^T\tilde{X}] - \text{tr}[V_k^T\tilde{X}^T\tilde{X}V_k] = \sum_j \sigma_j^2 - \sum_{j=1}^k \sigma_j^2 = \sum_{j=k+1}^p \sigma_j^2\)
Navigasi: ← Information Theory | Neural Network Math →