PCA & Dimensionality Reduction

Eigendecomposition sebagai Foundation PCA

ml-math

intermediate

Derivasi matematis PCA dari variance maximization ke eigenvalue problem, SVD perspective, proportion of variance, reconstruction, kernel PCA, dan sketsa t-SNE/UMAP.

1 Kenapa Ini Penting?

Why This Matters for Your Work

PCA adalah eigendecomposition dari covariance matrix. Tanpa linear algebra, PCA adalah black box. Dengan linear algebra:

PCA menjadi geometrically intuitive — kita mencari directions of maximum variance
Kamu mengerti mengapa scree plot berguna (eigenvalue = variance explained)
Kamu tahu kenapa loadings dan scores punya interpretasi yang berbeda
Kamu bisa membedakan PCA dari factor analysis (sesuatu yang sering dicampuradukkan)

Untuk data analyst: PCA sangat penting untuk exploratory analysis, denoising, visualization, dan preprocessing sebelum modeling.

2 Motivasi: Curse of Dimensionality

Ketika dimensi $p$ besar: 1. Sparse data: volume hypercube tumbuh exponentially → semua data points “far apart” 2. Computational cost: banyak algoritma scale dengan $p^2$ atau $p^3$ 3. Visualization: manusia hanya bisa visualize 2D atau 3D 4. Overfitting: high-dim data mudah overfit (terlalu banyak parameters)

Solusi: reduksi ke low-dimensional representation yang preserve informasi terpenting.

3 PCA — Derivasi dari Variance Maximization

3.1 Setup

Data matrix: $X \in \mathbb{R}^{n \times p}$, diasumsikan sudah di-center ($\bar{X} = 0$).

Sample covariance matrix: $\hat{\Sigma} = \frac{1}{n-1}X^TX \in \mathbb{R}^{p \times p}$

3.2 Optimization Problem

Cari unit vector $\mathbf{w}_1 \in \mathbb{R}^p$ yang memaksimalkan variance dari projection:

\[\text{Var}(\mathbf{w}_1^TX) = \mathbf{w}_1^T \hat{\Sigma} \mathbf{w}_1\]

subject to $\|\mathbf{w}_1\|^2 = \mathbf{w}_1^T\mathbf{w}_1 = 1$.

Lagrangian: \[\mathcal{L}(\mathbf{w}, \lambda) = \mathbf{w}^T\hat{\Sigma}\mathbf{w} - \lambda(\mathbf{w}^T\mathbf{w} - 1)\]

First-order conditions: \[\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = 2\hat{\Sigma}\mathbf{w} - 2\lambda\mathbf{w} = 0\]

\[\hat{\Sigma}\mathbf{w} = \lambda\mathbf{w}\]

Ini adalah eigenvalue problem! $\mathbf{w}_1$ adalah eigenvector dari $\hat{\Sigma}$.

3.3 Variance Maksimum = Eigenvalue Terbesar

Substitusikan $\hat{\Sigma}\mathbf{w}_1 = \lambda_1\mathbf{w}_1$ ke dalam objective:

\[\mathbf{w}_1^T\hat{\Sigma}\mathbf{w}_1 = \mathbf{w}_1^T(\lambda_1\mathbf{w}_1) = \lambda_1\|\mathbf{w}_1\|^2 = \lambda_1\]

Variance yang dimaksimalkan = $\lambda_1$ (eigenvalue terbesar).

Jadi: $\mathbf{w}_1$ = eigenvector corresponding to largest eigenvalue $\lambda_1$.

4 Principal Components Berikutnya

4.1 PC Kedua

Cari $\mathbf{w}_2$ yang memaksimalkan variance subject to orthogonality dengan $\mathbf{w}_1$:

\[\max \mathbf{w}_2^T\hat{\Sigma}\mathbf{w}_2 \quad \text{s.t.} \quad \|\mathbf{w}_2\| = 1, \quad \mathbf{w}_1^T\mathbf{w}_2 = 0\]

Solusi: eigenvector corresponding to second largest eigenvalue $\lambda_2$.

4.2 Eigendecomposition Lengkap

$\hat{\Sigma}$ adalah real symmetric positive semidefinite → diagonalizable dengan orthonormal eigenvectors:

\[\hat{\Sigma} = V\Lambda V^T\]

di mana: - $V = [\mathbf{v}_1, \ldots, \mathbf{v}_p]$: matrix orthonormal eigenvectors (loadings) - $\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_p)$ dengan $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_p \geq 0$

Principal components (scores): $Z = XV$ — projections onto eigenvector directions.

5 PCA via SVD

Definisi: PCA via SVD (Preferred Approach)

Center data: $\tilde{X} = X - \bar{X}$ (subtract column means).

SVD: $\tilde{X} = U\Sigma V^T$ di mana: - $U \in \mathbb{R}^{n \times p}$: left singular vectors (scores, normalized) - $\Sigma = \text{diag}(\sigma_1, \ldots, \sigma_p)$: singular values ($\sigma_1 \geq \sigma_2 \geq \cdots \geq 0$) - $V \in \mathbb{R}^{p \times p}$: right singular vectors (loadings)

Hubungan dengan eigendecomposition: \[\hat{\Sigma} = \frac{1}{n-1}\tilde{X}^T\tilde{X} = \frac{1}{n-1}V\Sigma^2 V^T\]

Jadi eigenvectors dari $\hat{\Sigma}$ = $V$ (right singular vectors dari $\tilde{X}$).

Eigenvalues: $\lambda_k = \sigma_k^2/(n-1)$

PC scores: $Z = \tilde{X}V = U\Sigma$ (matrix of projections)

Column $k$ dari $Z$: scores untuk $k$-th PC.

SVD lebih numerically stable dari langsung eigendecompose $\hat{\Sigma}$, terutama saat $n \gg p$.

6 Proportion of Variance Explained

6.1 Formula

Total variance = $\text{tr}(\hat{\Sigma}) = \sum_k \lambda_k = \sum_k \sigma_k^2/(n-1)$

Proportion of Variance Explained (PVE) oleh PC $k$:

\[\text{PVE}_k = \frac{\lambda_k}{\sum_{j=1}^p \lambda_j} = \frac{\sigma_k^2}{\sum_{j=1}^p \sigma_j^2}\]

Cumulative PVE: \[\text{CPVE}_m = \sum_{k=1}^m \text{PVE}_k\]

6.2 Scree Plot

Plot $\lambda_k$ (atau PVE) vs $k$. Cari “elbow” — titik di mana eigenvalues mulai flatten out.

Aturan praktis: - Elbow rule: pilih setelah elbow - Threshold: pilih $m$ sehingga CPVE $\geq$ 80% (atau 90%) - Kaiser rule: pilih PCs dengan $\lambda_k > 1$ (lebih dari average variance, untuk standardized data)

7 Reconstruction dan Compression

7.1 Rank-$k$ Approximation

Truncated SVD: $\hat{X}_k = U_k\Sigma_k V_k^T$

di mana $U_k \in \mathbb{R}^{n \times k}$, $\Sigma_k \in \mathbb{R}^{k \times k}$, $V_k \in \mathbb{R}^{p \times k}$.

Equivalently: $\hat{\tilde{X}}_k = \tilde{X}V_kV_k^T$ (project and back-project).

Eckart-Young-Mirsky Theorem: rank-$k$ SVD adalah best rank-$k$ approximation dalam Frobenius norm:

\[\hat{\tilde{X}}_k = \arg\min_{\text{rank}(B) \leq k} \|\tilde{X} - B\|_F^2\]

Reconstruction error:

\[\|\tilde{X} - \hat{\tilde{X}}_k\|_F^2 = \sum_{j=k+1}^p \sigma_j^2\]

Error = sum of discarded singular values squared.

7.2 Compression Ratio

Original data: $np$ numbers. Compressed (rank-$k$): $nk + kp + k = k(n+p+1)$ numbers (scores + loadings + singular values).

Compression ratio berguna jika $k \ll \min(n, p)$.

8 Kernel PCA

8.1 Motivasi

PCA menemukan linear subspaces. Untuk nonlinear structure, gunakan kernel trick.

Kernel PCA: PCA di feature space $\phi(x)$, tapi menggunakan kernel trick sehingga tidak perlu explicit $\phi$.

8.2 Derivasi

Standard PCA pada mapped data $\Phi = [\phi(x_1), \ldots, \phi(x_n)]^T$:

Covariance: $\hat{\Sigma}_\phi = \frac{1}{n}\Phi^T\Phi$

Eigenvalue problem: $\hat{\Sigma}_\phi v = \lambda v$

Karena dari Representer Theorem, eigenvectors bisa ditulis sebagai $v = \frac{1}{\sqrt{n\lambda}}\Phi^T\alpha$, kita punya:

\[K\alpha = n\lambda\alpha\]

di mana $K_{ij} = k(x_i, x_j)$ adalah Gram matrix.

Kernel PCA scores untuk data point $x$: \[z_k(x) = \sum_{i=1}^n \hat{\alpha}_{ik} k(x_i, x)\]

9 t-SNE dan UMAP (Brief Sketch)

9.1 t-SNE (t-distributed Stochastic Neighbor Embedding)

Motivasi: PCA linear — tidak cocok untuk manifold structure seperti “Swiss roll”.

Idea: preserve local neighborhood structure.

Compute affinities di high-dim: $p_{ij} \propto \exp(-\|x_i - x_j\|^2/(2\sigma_i^2))$ (Gaussian)
Compute affinities di low-dim: $q_{ij} \propto (1 + \|y_i-y_j\|^2)^{-1}$ (t-distribution dengan 1 df)
Minimize KL divergence: $\min \sum_{ij} p_{ij}\log\frac{p_{ij}}{q_{ij}}$

t-distribution di low-dim: heavy tails → moderately distant points mendapat “more space” → avoids crowding problem.

Limitation: t-SNE tidak preserve global structure, non-deterministic, slow ($O(n^2)$).

9.2 UMAP (Uniform Manifold Approximation and Projection)

UMAP memiliki mathematical foundation lebih kuat dari t-SNE (Riemannian geometry + topological data analysis).

Key idea: construct fuzzy topological representation of data, kemudian optimize low-dim embedding untuk memiliki “similar” fuzzy topology.

UMAP lebih cepat ($O(n^{1.14})$), preserve global structure lebih baik, dan punya hyperparameter yang lebih interpretable.

10 Worked Example: PCA Step-by-Step

Worked Example: PCA pada 4D Dataset di R

library(tidyverse)

# ==========================================
# 1. PCA dari Scratch (Manual)
# ==========================================

# Generate 4D data with known structure
set.seed(42)
n <- 100
# Two latent factors
F1 <- rnorm(n)
F2 <- rnorm(n)
X <- cbind(
  2*F1 + 0.3*rnorm(n),      # highly influenced by F1
  1.5*F1 + 0.3*rnorm(n),    # highly influenced by F1
  2*F2 + 0.3*rnorm(n),      # highly influenced by F2
  1.5*F2 + 0.3*rnorm(n)     # highly influenced by F2
)
colnames(X) <- c("X1", "X2", "X3", "X4")

# Step 1: Center data
X_centered <- scale(X, center = TRUE, scale = FALSE)

# Step 2: Compute covariance matrix
S <- cov(X_centered)
cat("Sample covariance matrix:\n")
print(round(S, 3))

# Step 3: Eigendecomposition
eigen_result <- eigen(S)
cat("\nEigenvalues:\n", round(eigen_result$values, 4), "\n")
cat("Proportion of variance:\n",
    round(eigen_result$values / sum(eigen_result$values), 4), "\n")
cat("Cumulative PVE:\n",
    round(cumsum(eigen_result$values / sum(eigen_result$values)), 4), "\n")

# Step 4: Loadings (eigenvectors)
cat("\nLoadings (eigenvectors):\n")
print(round(eigen_result$vectors, 4))

# Step 5: Scores (projections)
scores <- X_centered %*% eigen_result$vectors
cat("\nFirst few PC scores:\n")
print(head(scores, 3))

# ==========================================
# 2. Using prcomp()
# ==========================================
pca_result <- prcomp(X, center = TRUE, scale. = FALSE)

cat("\n--- prcomp() results ---\n")
print(summary(pca_result))

# Loadings
cat("\nRotation (Loadings):\n")
print(round(pca_result$rotation, 4))

# Scree plot
plot(pca_result, type = "l", main = "Scree Plot")

# Biplot
biplot(pca_result, main = "PCA Biplot")

# ==========================================
# 3. SVD Relationship
# ==========================================
X_c <- scale(X, center = TRUE, scale = FALSE)
svd_result <- svd(X_c)

# Verify: eigenvalues of S = sigma^2 / (n-1)
cat("\nEigenvalues from cov matrix:", round(eigen_result$values, 4), "\n")
cat("Singular values squared / (n-1):",
    round(svd_result$d^2 / (nrow(X) - 1), 4), "\n")

# ==========================================
# 4. Reconstruction Error
# ==========================================
X_recon_2 <- tcrossprod(pca_result$x[, 1:2], pca_result$rotation[, 1:2])
recon_error_2 <- sum((X_c - X_recon_2)^2)
total_var <- sum(X_c^2)

cat(sprintf("\nReconstruction error (2 PCs): %.2f%%\n",
            100 * recon_error_2 / total_var))
cat(sprintf("Variance NOT explained by 2 PCs: %.2f%%\n",
            100 * (1 - sum(pca_result$sdev[1:2]^2) / sum(pca_result$sdev^2))))

# ==========================================
# 5. Applied: iris dataset
# ==========================================
data(iris)
pca_iris <- prcomp(iris[, 1:4], center = TRUE, scale. = TRUE)
cat("\n--- Iris PCA ---\n")
print(summary(pca_iris))

# Visualize in 2D
df_pca <- data.frame(pca_iris$x[,1:2], species = iris$Species)
ggplot(df_pca, aes(PC1, PC2, color = species)) +
  geom_point(alpha = 0.7) +
  labs(title = "Iris: First 2 Principal Components") +
  theme_minimal()

Key observations: 1. Eigenvalues = variance along each PC direction 2. PC1 dan PC2 masing-masing berhubungan dengan F1 dan F2 (struktur latent yang kita masukkan) 3. Dengan 2 PCs sudah bisa explain ~95%+ variance 4. Loadings yang tinggi menunjukkan variable mana yang dominan di setiap PC

Connection: Factor Analysis, ICA, dan Autoencoders

PCA punya banyak kerabat:

Factor Analysis: berbeda dari PCA — FA mengasumsikan model latent $X = \Lambda F + \varepsilon$ di mana $\varepsilon$ adalah unique factors (dengan variance berbeda). FA lebih “generative” dan interpretable, tapi asumsi lebih kuat.
ICA (Independent Component Analysis): cari komponen yang independent (bukan hanya uncorrelated). Berguna untuk blind source separation (cocktail party problem). FastICA algorithm.
NMF (Non-negative Matrix Factorization): $X \approx WH$ dengan $W, H \geq 0$. Berguna untuk data yang non-negative (counts, frequencies).
Autoencoders: neural network yang belajar low-dim representation dengan minimize reconstruction loss. Nonlinear PCA. Variational Autoencoder (VAE) = probabilistic version.
Truncated SVD vs PCA: untuk text analysis (LSA/LSI), SVD diaplikasikan ke TF-IDF matrix, bukan covariance matrix — conceptually berbeda tapi mathematically related.

11 Practice Problems

Practice Problems

Problem 1: Tunjukkan bahwa PC scores orthogonal dan punya variance = eigenvalue.

$\text{Cov}(Z) = V^T\hat{\Sigma}V = V^T(V\Lambda V^T)V = \Lambda$ (diagonal)

Jadi $\text{Cov}(Z_j, Z_k) = 0$ untuk $j \neq k$ (orthogonal), dan $\text{Var}(Z_k) = \lambda_k$.

Problem 2: Buktikan bahwa total variance tidak berubah setelah PCA transformation.

$\text{tr}(\text{Cov}(Z)) = \text{tr}(\Lambda) = \sum_k \lambda_k = \text{tr}(\hat{\Sigma})$

PCA hanya merotasi — tidak mengubah total variance.

Problem 3: Kenapa standardize sebelum PCA?

Jika variabel dalam units berbeda (misalnya: income dalam jutaan, age dalam tahun), PCA akan dominated oleh variabel dengan variance terbesar (income). Standardize (divide by std) memastikan semua variabel berkontribusi equally.

Untuk menentukan apakah standardize: pertanyaannya adalah apakah units berbeda meaningful. Jika semua dalam units yang sama (pixels, gene expressions), tidak perlu standardize.

Problem 4: Apa perbedaan loadings dan scores?

Loadings ($V$): koefisien yang mendefinisikan setiap PC sebagai kombinasi linear dari original variables. Interpretasi: kontribusi setiap variable ke PC.

Scores ($Z = XV$): koordinat setiap observasi dalam PC space. Interpretasi: position of each data point in reduced space.

Problem 5: Derive reconstruction error formula.

$\|\tilde{X} - \hat{\tilde{X}}_k\|_F^2 = \|\tilde{X}(I-V_kV_k^T)\|_F^2 = \text{tr}[(I-V_kV_k^T)\tilde{X}^T\tilde{X}(I-V_kV_k^T)]$

Gunakan cyclicity of trace dan $V_kV_k^T$ is projection: $= \text{tr}[\tilde{X}^T\tilde{X}] - \text{tr}[V_k^T\tilde{X}^T\tilde{X}V_k] = \sum_j \sigma_j^2 - \sum_{j=1}^k \sigma_j^2 = \sum_{j=k+1}^p \sigma_j^2$

Navigasi: ← Information Theory | Neural Network Math →

--- title: "PCA & Dimensionality Reduction" subtitle: "Eigendecomposition sebagai Foundation PCA" description: "Derivasi matematis PCA dari variance maximization ke eigenvalue problem, SVD perspective, proportion of variance, reconstruction, kernel PCA, dan sketsa t-SNE/UMAP." categories: [ml-math, intermediate] --- ## Kenapa Ini Penting? ::: {.callout-note title="Why This Matters for Your Work"} PCA adalah eigendecomposition dari covariance matrix. Tanpa linear algebra, PCA adalah black box. Dengan linear algebra: - PCA menjadi geometrically intuitive — kita mencari **directions of maximum variance** - Kamu mengerti mengapa scree plot berguna (eigenvalue = variance explained) - Kamu tahu kenapa loadings dan scores punya interpretasi yang berbeda - Kamu bisa membedakan PCA dari factor analysis (sesuatu yang sering dicampuradukkan) Untuk data analyst: PCA sangat penting untuk exploratory analysis, denoising, visualization, dan preprocessing sebelum modeling. ::: --- ## Motivasi: Curse of Dimensionality Ketika dimensi $p$ besar: 1. **Sparse data**: volume hypercube tumbuh exponentially → semua data points "far apart" 2. **Computational cost**: banyak algoritma scale dengan $p^2$ atau $p^3$ 3. **Visualization**: manusia hanya bisa visualize 2D atau 3D 4. **Overfitting**: high-dim data mudah overfit (terlalu banyak parameters) **Solusi**: reduksi ke low-dimensional representation yang preserve **informasi terpenting**. --- ## PCA — Derivasi dari Variance Maximization ### Setup Data matrix: $X \in \mathbb{R}^{n \times p}$, diasumsikan sudah di-center ($\bar{X} = 0$). Sample covariance matrix: $\hat{\Sigma} = \frac{1}{n-1}X^TX \in \mathbb{R}^{p \times p}$ ### Optimization Problem Cari unit vector $\mathbf{w}_1 \in \mathbb{R}^p$ yang **memaksimalkan variance** dari projection: $$\text{Var}(\mathbf{w}_1^TX) = \mathbf{w}_1^T \hat{\Sigma} \mathbf{w}_1$$ subject to $\|\mathbf{w}_1\|^2 = \mathbf{w}_1^T\mathbf{w}_1 = 1$. **Lagrangian**: $$\mathcal{L}(\mathbf{w}, \lambda) = \mathbf{w}^T\hat{\Sigma}\mathbf{w} - \lambda(\mathbf{w}^T\mathbf{w} - 1)$$ **First-order conditions**: $$\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = 2\hat{\Sigma}\mathbf{w} - 2\lambda\mathbf{w} = 0$$ $$\hat{\Sigma}\mathbf{w} = \lambda\mathbf{w}$$ Ini adalah **eigenvalue problem**! $\mathbf{w}_1$ adalah eigenvector dari $\hat{\Sigma}$. ### Variance Maksimum = Eigenvalue Terbesar Substitusikan $\hat{\Sigma}\mathbf{w}_1 = \lambda_1\mathbf{w}_1$ ke dalam objective: $$\mathbf{w}_1^T\hat{\Sigma}\mathbf{w}_1 = \mathbf{w}_1^T(\lambda_1\mathbf{w}_1) = \lambda_1\|\mathbf{w}_1\|^2 = \lambda_1$$ Variance yang dimaksimalkan = $\lambda_1$ (eigenvalue terbesar). Jadi: $\mathbf{w}_1$ = **eigenvector corresponding to largest eigenvalue** $\lambda_1$. --- ## Principal Components Berikutnya ### PC Kedua Cari $\mathbf{w}_2$ yang memaksimalkan variance subject to **orthogonality** dengan $\mathbf{w}_1$: $$\max \mathbf{w}_2^T\hat{\Sigma}\mathbf{w}_2 \quad \text{s.t.} \quad \|\mathbf{w}_2\| = 1, \quad \mathbf{w}_1^T\mathbf{w}_2 = 0$$ Solusi: eigenvector corresponding to second largest eigenvalue $\lambda_2$. ### Eigendecomposition Lengkap $\hat{\Sigma}$ adalah real symmetric positive semidefinite → diagonalizable dengan orthonormal eigenvectors: $$\hat{\Sigma} = V\Lambda V^T$$ di mana: - $V = [\mathbf{v}_1, \ldots, \mathbf{v}_p]$: matrix orthonormal eigenvectors (loadings) - $\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_p)$ dengan $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_p \geq 0$ **Principal components** (scores): $Z = XV$ — projections onto eigenvector directions. --- ## PCA via SVD ::: {.callout-important title="Definisi: PCA via SVD (Preferred Approach)"} Center data: $\tilde{X} = X - \bar{X}$ (subtract column means). SVD: $\tilde{X} = U\Sigma V^T$ di mana: - $U \in \mathbb{R}^{n \times p}$: left singular vectors (scores, normalized) - $\Sigma = \text{diag}(\sigma_1, \ldots, \sigma_p)$: singular values ($\sigma_1 \geq \sigma_2 \geq \cdots \geq 0$) - $V \in \mathbb{R}^{p \times p}$: right singular vectors (**loadings**) Hubungan dengan eigendecomposition: $$\hat{\Sigma} = \frac{1}{n-1}\tilde{X}^T\tilde{X} = \frac{1}{n-1}V\Sigma^2 V^T$$ Jadi eigenvectors dari $\hat{\Sigma}$ = $V$ (right singular vectors dari $\tilde{X}$). Eigenvalues: $\lambda_k = \sigma_k^2/(n-1)$ **PC scores**: $Z = \tilde{X}V = U\Sigma$ (matrix of projections) Column $k$ dari $Z$: scores untuk $k$-th PC. ::: SVD lebih numerically stable dari langsung eigendecompose $\hat{\Sigma}$, terutama saat $n \gg p$. --- ## Proportion of Variance Explained ### Formula Total variance = $\text{tr}(\hat{\Sigma}) = \sum_k \lambda_k = \sum_k \sigma_k^2/(n-1)$ **Proportion of Variance Explained (PVE)** oleh PC $k$: $$\text{PVE}_k = \frac{\lambda_k}{\sum_{j=1}^p \lambda_j} = \frac{\sigma_k^2}{\sum_{j=1}^p \sigma_j^2}$$ **Cumulative PVE**: $$\text{CPVE}_m = \sum_{k=1}^m \text{PVE}_k$$ ### Scree Plot Plot $\lambda_k$ (atau PVE) vs $k$. Cari "elbow" — titik di mana eigenvalues mulai flatten out. Aturan praktis: - **Elbow rule**: pilih setelah elbow - **Threshold**: pilih $m$ sehingga CPVE $\geq$ 80% (atau 90%) - **Kaiser rule**: pilih PCs dengan $\lambda_k > 1$ (lebih dari average variance, untuk standardized data) --- ## Reconstruction dan Compression ### Rank-$k$ Approximation Truncated SVD: $\hat{X}_k = U_k\Sigma_k V_k^T$ di mana $U_k \in \mathbb{R}^{n \times k}$, $\Sigma_k \in \mathbb{R}^{k \times k}$, $V_k \in \mathbb{R}^{p \times k}$. Equivalently: $\hat{\tilde{X}}_k = \tilde{X}V_kV_k^T$ (project and back-project). **Eckart-Young-Mirsky Theorem**: rank-$k$ SVD adalah **best rank-$k$ approximation** dalam Frobenius norm: $$\hat{\tilde{X}}_k = \arg\min_{\text{rank}(B) \leq k} \|\tilde{X} - B\|_F^2$$ **Reconstruction error**: $$\|\tilde{X} - \hat{\tilde{X}}_k\|_F^2 = \sum_{j=k+1}^p \sigma_j^2$$ Error = sum of discarded singular values squared. ### Compression Ratio Original data: $np$ numbers. Compressed (rank-$k$): $nk + kp + k = k(n+p+1)$ numbers (scores + loadings + singular values). Compression ratio berguna jika $k \ll \min(n, p)$. --- ## Kernel PCA ### Motivasi PCA menemukan linear subspaces. Untuk nonlinear structure, gunakan kernel trick. **Kernel PCA**: PCA di feature space $\phi(x)$, tapi menggunakan kernel trick sehingga tidak perlu explicit $\phi$. ### Derivasi Standard PCA pada mapped data $\Phi = [\phi(x_1), \ldots, \phi(x_n)]^T$: Covariance: $\hat{\Sigma}_\phi = \frac{1}{n}\Phi^T\Phi$ Eigenvalue problem: $\hat{\Sigma}_\phi v = \lambda v$ Karena dari Representer Theorem, eigenvectors bisa ditulis sebagai $v = \frac{1}{\sqrt{n\lambda}}\Phi^T\alpha$, kita punya: $$K\alpha = n\lambda\alpha$$ di mana $K_{ij} = k(x_i, x_j)$ adalah Gram matrix. **Kernel PCA scores** untuk data point $x$: $$z_k(x) = \sum_{i=1}^n \hat{\alpha}_{ik} k(x_i, x)$$ --- ## t-SNE dan UMAP (Brief Sketch) ### t-SNE (t-distributed Stochastic Neighbor Embedding) **Motivasi**: PCA linear — tidak cocok untuk manifold structure seperti "Swiss roll". **Idea**: preserve **local neighborhood structure**. 1. Compute affinities di high-dim: $p_{ij} \propto \exp(-\|x_i - x_j\|^2/(2\sigma_i^2))$ (Gaussian) 2. Compute affinities di low-dim: $q_{ij} \propto (1 + \|y_i-y_j\|^2)^{-1}$ (t-distribution dengan 1 df) 3. Minimize KL divergence: $\min \sum_{ij} p_{ij}\log\frac{p_{ij}}{q_{ij}}$ t-distribution di low-dim: heavy tails → moderately distant points mendapat "more space" → avoids crowding problem. **Limitation**: t-SNE tidak preserve global structure, non-deterministic, slow ($O(n^2)$). ### UMAP (Uniform Manifold Approximation and Projection) UMAP memiliki **mathematical foundation** lebih kuat dari t-SNE (Riemannian geometry + topological data analysis). Key idea: construct fuzzy topological representation of data, kemudian optimize low-dim embedding untuk memiliki "similar" fuzzy topology. UMAP lebih cepat ($O(n^{1.14})$), preserve global structure lebih baik, dan punya hyperparameter yang lebih interpretable. --- ## Worked Example: PCA Step-by-Step ::: {.callout-tip title="Worked Example: PCA pada 4D Dataset di R" collapse="true"} ```r library(tidyverse) # ========================================== # 1. PCA dari Scratch (Manual) # ========================================== # Generate 4D data with known structure set.seed(42) n <- 100 # Two latent factors F1 <- rnorm(n) F2 <- rnorm(n) X <- cbind( 2*F1 + 0.3*rnorm(n), # highly influenced by F1 1.5*F1 + 0.3*rnorm(n), # highly influenced by F1 2*F2 + 0.3*rnorm(n), # highly influenced by F2 1.5*F2 + 0.3*rnorm(n) # highly influenced by F2 ) colnames(X) <- c("X1", "X2", "X3", "X4") # Step 1: Center data X_centered <- scale(X, center = TRUE, scale = FALSE) # Step 2: Compute covariance matrix S <- cov(X_centered) cat("Sample covariance matrix:\n") print(round(S, 3)) # Step 3: Eigendecomposition eigen_result <- eigen(S) cat("\nEigenvalues:\n", round(eigen_result$values, 4), "\n") cat("Proportion of variance:\n", round(eigen_result$values / sum(eigen_result$values), 4), "\n") cat("Cumulative PVE:\n", round(cumsum(eigen_result$values / sum(eigen_result$values)), 4), "\n") # Step 4: Loadings (eigenvectors) cat("\nLoadings (eigenvectors):\n") print(round(eigen_result$vectors, 4)) # Step 5: Scores (projections) scores <- X_centered %*% eigen_result$vectors cat("\nFirst few PC scores:\n") print(head(scores, 3)) # ========================================== # 2. Using prcomp() # ========================================== pca_result <- prcomp(X, center = TRUE, scale. = FALSE) cat("\n--- prcomp() results ---\n") print(summary(pca_result)) # Loadings cat("\nRotation (Loadings):\n") print(round(pca_result$rotation, 4)) # Scree plot plot(pca_result, type = "l", main = "Scree Plot") # Biplot biplot(pca_result, main = "PCA Biplot") # ========================================== # 3. SVD Relationship # ========================================== X_c <- scale(X, center = TRUE, scale = FALSE) svd_result <- svd(X_c) # Verify: eigenvalues of S = sigma^2 / (n-1) cat("\nEigenvalues from cov matrix:", round(eigen_result$values, 4), "\n") cat("Singular values squared / (n-1):", round(svd_result$d^2 / (nrow(X) - 1), 4), "\n") # ========================================== # 4. Reconstruction Error # ========================================== X_recon_2 <- tcrossprod(pca_result$x[, 1:2], pca_result$rotation[, 1:2]) recon_error_2 <- sum((X_c - X_recon_2)^2) total_var <- sum(X_c^2) cat(sprintf("\nReconstruction error (2 PCs): %.2f%%\n", 100 * recon_error_2 / total_var)) cat(sprintf("Variance NOT explained by 2 PCs: %.2f%%\n", 100 * (1 - sum(pca_result$sdev[1:2]^2) / sum(pca_result$sdev^2)))) # ========================================== # 5. Applied: iris dataset # ========================================== data(iris) pca_iris <- prcomp(iris[, 1:4], center = TRUE, scale. = TRUE) cat("\n--- Iris PCA ---\n") print(summary(pca_iris)) # Visualize in 2D df_pca <- data.frame(pca_iris$x[,1:2], species = iris$Species) ggplot(df_pca, aes(PC1, PC2, color = species)) + geom_point(alpha = 0.7) + labs(title = "Iris: First 2 Principal Components") + theme_minimal() ``` **Key observations**: 1. Eigenvalues = variance along each PC direction 2. PC1 dan PC2 masing-masing berhubungan dengan F1 dan F2 (struktur latent yang kita masukkan) 3. Dengan 2 PCs sudah bisa explain ~95%+ variance 4. Loadings yang tinggi menunjukkan variable mana yang dominan di setiap PC ::: --- ::: {.callout-caution title="Connection: Factor Analysis, ICA, dan Autoencoders"} PCA punya banyak kerabat: - **Factor Analysis**: berbeda dari PCA — FA mengasumsikan model latent $X = \Lambda F + \varepsilon$ di mana $\varepsilon$ adalah unique factors (dengan variance berbeda). FA lebih "generative" dan interpretable, tapi asumsi lebih kuat. - **ICA (Independent Component Analysis)**: cari komponen yang **independent** (bukan hanya uncorrelated). Berguna untuk blind source separation (cocktail party problem). FastICA algorithm. - **NMF (Non-negative Matrix Factorization)**: $X \approx WH$ dengan $W, H \geq 0$. Berguna untuk data yang non-negative (counts, frequencies). - **Autoencoders**: neural network yang belajar low-dim representation dengan minimize reconstruction loss. Nonlinear PCA. Variational Autoencoder (VAE) = probabilistic version. - **Truncated SVD vs PCA**: untuk text analysis (LSA/LSI), SVD diaplikasikan ke TF-IDF matrix, bukan covariance matrix — conceptually berbeda tapi mathematically related. ::: --- ## Practice Problems ::: {.callout-warning title="Practice Problems" collapse="true"} **Problem 1**: Tunjukkan bahwa PC scores orthogonal dan punya variance = eigenvalue. $\text{Cov}(Z) = V^T\hat{\Sigma}V = V^T(V\Lambda V^T)V = \Lambda$ (diagonal) Jadi $\text{Cov}(Z_j, Z_k) = 0$ untuk $j \neq k$ (orthogonal), dan $\text{Var}(Z_k) = \lambda_k$. **Problem 2**: Buktikan bahwa total variance tidak berubah setelah PCA transformation. $\text{tr}(\text{Cov}(Z)) = \text{tr}(\Lambda) = \sum_k \lambda_k = \text{tr}(\hat{\Sigma})$ PCA hanya merotasi — tidak mengubah total variance. **Problem 3**: Kenapa standardize sebelum PCA? Jika variabel dalam units berbeda (misalnya: income dalam jutaan, age dalam tahun), PCA akan dominated oleh variabel dengan variance terbesar (income). Standardize (divide by std) memastikan semua variabel berkontribusi equally. Untuk menentukan apakah standardize: pertanyaannya adalah apakah units berbeda meaningful. Jika semua dalam units yang sama (pixels, gene expressions), tidak perlu standardize. **Problem 4**: Apa perbedaan loadings dan scores? **Loadings** ($V$): koefisien yang mendefinisikan setiap PC sebagai kombinasi linear dari original variables. Interpretasi: kontribusi setiap variable ke PC. **Scores** ($Z = XV$): koordinat setiap observasi dalam PC space. Interpretasi: position of each data point in reduced space. **Problem 5**: Derive reconstruction error formula. $\|\tilde{X} - \hat{\tilde{X}}_k\|_F^2 = \|\tilde{X}(I-V_kV_k^T)\|_F^2 = \text{tr}[(I-V_kV_k^T)\tilde{X}^T\tilde{X}(I-V_kV_k^T)]$ Gunakan cyclicity of trace dan $V_kV_k^T$ is projection: $= \text{tr}[\tilde{X}^T\tilde{X}] - \text{tr}[V_k^T\tilde{X}^T\tilde{X}V_k] = \sum_j \sigma_j^2 - \sum_{j=1}^k \sigma_j^2 = \sum_{j=k+1}^p \sigma_j^2$ ::: --- **Navigasi**: [← Information Theory](05-information-theory.qmd) | [Neural Network Math →](07-neural-net-math.qmd)