Regularization — Ridge, LASSO, Elastic Net
Taming Overfitting dengan Matematika
1 Kenapa Ini Penting?
Regularization bukan hanya teknik implementasi — it’s a mathematical trade-off between bias and variance dengan beberapa lapisan interpretasi:
- Optimization: menambahkan penalty pada objective function
- Bayesian: MAP estimation dengan prior distribution tertentu
- SVD: shrinkage dalam directions principal components
Ridge dan LASSO punya perbedaan fundamental dalam sifat solusinya — Ridge selalu menyusutkan tapi tidak pernah nol, LASSO bisa meng-eliminate variable sepenuhnya. Knowing why helps you choose the right one.
Untuk data analis: regularization langsung relevan ketika \(p\) besar (banyak features), data observasional, atau ada multicollinearity.
2 The Bias-Variance Trade-off
Untuk estimator \(\hat{\beta}\):
\[\text{MSE}(\hat{\beta}) = E\|\hat{\beta} - \beta\|^2 = \underbrace{\|E[\hat{\beta}] - \beta\|^2}_{\text{Bias}^2} + \underbrace{\text{tr}(\text{Var}(\hat{\beta}))}_{\text{Variance}}\]
OLS: unbiased (\(E[\hat{\beta}_{OLS}] = \beta\)) dan minimum variance among unbiased estimators (Gauss-Markov). Tapi ketika \(p\) besar atau predictors berkorelasi, variance OLS bisa sangat besar.
Key insight: dengan sedikit mengorbankan bias, kita bisa drastically reduce variance → net MSE turun.
\[\text{Bias}^2(\hat{\beta}_{ridge}) > 0 \quad \text{tapi} \quad \text{Variance}(\hat{\beta}_{ridge}) \ll \text{Variance}(\hat{\beta}_{OLS})\]
3 Ridge Regression (L2 Regularization)
3.1 Objective Function
\[\hat{\beta}_{ridge} = \arg\min_\beta \|y - X\beta\|^2 + \lambda\|\beta\|^2\]
Ini setara dengan constrained optimization: \[\min_\beta \|y - X\beta\|^2 \quad \text{s.t.} \quad \|\beta\|^2 \leq t(\lambda)\]
3.2 Closed-Form Solution
Ambil gradient dan set ke nol:
\[\frac{\partial}{\partial\beta}\left[\|y-X\beta\|^2 + \lambda\|\beta\|^2\right] = -2X^T(y-X\beta) + 2\lambda\beta = 0\]
\[X^TX\hat{\beta} + \lambda\hat{\beta} = X^Ty\]
\[(X^TX + \lambda I)\hat{\beta}_{ridge} = X^Ty\]
\[\boxed{\hat{\beta}_{ridge} = (X^TX + \lambda I)^{-1}X^Ty}\]
Key property: \((X^TX + \lambda I)\) selalu invertible untuk \(\lambda > 0\) — bahkan jika \(X\) tidak full rank! Ini solves multicollinearity.
3.3 Bias Ridge
\[E[\hat{\beta}_{ridge}] = (X^TX + \lambda I)^{-1}X^TX\beta = (X^TX + \lambda I)^{-1}(X^TX + \lambda I - \lambda I)\beta\] \[= \beta - \lambda(X^TX + \lambda I)^{-1}\beta\]
Bias \(= -\lambda(X^TX + \lambda I)^{-1}\beta \neq 0\) — Ridge biased.
3.4 Bayesian Interpretation
Ridge regression = MAP estimation dengan normal prior:
\[\beta_j \overset{iid}{\sim} N\left(0, \frac{\sigma^2}{\lambda}\right)\]
Posterior log: \[\log P(\beta | y, X) \propto -\frac{1}{2\sigma^2}\|y - X\beta\|^2 - \frac{\lambda}{2\sigma^2}\|\beta\|^2\]
Maximizing = minimizing \(\|y-X\beta\|^2 + \lambda\|\beta\|^2\) → Ridge objective.
3.5 SVD Perspective
Gunakan SVD: \(X = U\Sigma V^T\) di mana \(U \in \mathbb{R}^{n \times p}\), \(\Sigma = \text{diag}(\sigma_1, \ldots, \sigma_p)\), \(V \in \mathbb{R}^{p \times p}\).
\[\hat{\beta}_{ridge} = (V\Sigma^2 V^T + \lambda VV^T)^{-1}V\Sigma U^Ty = V(\Sigma^2 + \lambda I)^{-1}\Sigma U^Ty\]
\[\hat{\beta}_{ridge} = \sum_{j=1}^p \frac{\sigma_j}{\sigma_j^2 + \lambda} (\mathbf{u}_j^T y) \mathbf{v}_j\]
Bandingkan dengan OLS: \[\hat{\beta}_{OLS} = \sum_{j=1}^p \frac{1}{\sigma_j}(\mathbf{u}_j^T y)\mathbf{v}_j\]
Ridge shrinks setiap komponen OLS dengan faktor \(\sigma_j^2/(\sigma_j^2 + \lambda)\): - Komponen dengan \(\sigma_j\) besar (well-determined directions): \(\sigma_j^2/(\sigma_j^2+\lambda) \approx 1\) → sedikit shrinkage - Komponen dengan \(\sigma_j\) kecil (ill-determined, multicollinearity): \(\sigma_j^2/(\sigma_j^2+\lambda) \approx 0\) → shrinkage besar
3.6 Effective Degrees of Freedom
\[\text{df}(\lambda) = \text{tr}\left(X(X^TX+\lambda I)^{-1}X^T\right) = \sum_{j=1}^p \frac{\sigma_j^2}{\sigma_j^2 + \lambda}\]
- \(\lambda = 0\): \(\text{df} = p\) (OLS, semua degrees of freedom)
- \(\lambda \to \infty\): \(\text{df} \to 0\) (intercept only)
4 LASSO (L1 Regularization)
4.1 Objective Function
\[\hat{\beta}_{LASSO} = \arg\min_\beta \|y - X\beta\|^2 + \lambda\|\beta\|_1\]
di mana \(\|\beta\|_1 = \sum_{j=1}^p |\beta_j|\).
Constrained form: \(\min \|y-X\beta\|^2\) s.t. \(\|\beta\|_1 \leq t\).
Tidak ada closed-form solution karena \(\|\beta\|_1\) tidak differentiable di \(\beta_j = 0\).
4.2 Sparsity: Kenapa LASSO Bisa Set Koefisien ke Nol
Perhatikan constraint set: - Ridge: \(\|\beta\|^2 \leq t\) → sphere (smooth, no corners) di 2D - LASSO: \(\|\beta\|_1 \leq t\) → diamond/rhombus di 2D (has corners at axes!)
Loss function \(\|y-X\beta\|^2\) adalah ellipse dalam \(\beta\)-space. Solusi adalah titik di mana ellipse pertama kali menyentuh constraint set.
Untuk L2 ball (sphere): kontak biasanya di tepi yang smooth → kedua \(\beta_j \neq 0\).
Untuk L1 ball (diamond): kontak sering di corner \((\beta_j, 0)\) atau \((0, \beta_k)\) → salah satu koefisien = 0!
Untuk \(p\) besar: L1 ball punya \(2p\) corners di sepanjang axes — solusi cenderung menyentuh corner → sparse solution.
4.3 Bayesian Interpretation
LASSO = MAP dengan Laplace prior:
\[\beta_j \overset{iid}{\sim} \text{Laplace}\left(0, \frac{1}{\lambda}\right) \propto \exp(-\lambda|\beta_j|)\]
Laplace prior memiliki “spike” di nol dan “heavy tails” — mendorong sparsity.
4.4 Soft-Thresholding Operator
Untuk orthonormal \(X\) (i.e., \(X^TX = I\)), solution LASSO closed-form:
\[\hat{\beta}_j^{LASSO} = S(\hat{\beta}_j^{OLS}, \lambda/2) = \text{sign}(\hat{\beta}_j^{OLS}) \cdot \max(|\hat{\beta}_j^{OLS}| - \lambda/2, 0)\]
- Jika \(|\hat{\beta}_j^{OLS}| > \lambda/2\): shrink magnitude sebesar \(\lambda/2\)
- Jika \(|\hat{\beta}_j^{OLS}| \leq \lambda/2\): set exactly to zero
Ini menjelaskan sparsity secara analitik.
4.5 Coordinate Descent Algorithm
Untuk general \(X\), optimasi dengan coordinate descent: update satu koefisien pada satu waktu.
Iterasi untuk \(\beta_j\) saat koefisien lain fixed:
\[r_j = y - X_{-j}\beta_{-j} \quad \text{(partial residual)}\]
\[\hat{\beta}_j \leftarrow S\left(\frac{x_j^T r_j}{x_j^Tx_j}, \frac{\lambda}{2x_j^Tx_j}\right)\]
Convergence ke global minimum karena objective LASSO adalah convex.
5 Elastic Net
5.1 Objective
\[\hat{\beta}_{EN} = \arg\min_\beta \|y-X\beta\|^2 + \lambda_1\|\beta\|_1 + \lambda_2\|\beta\|^2\]
Biasanya diparam ulang: \(\lambda = \lambda_1 + \lambda_2\) dan \(\alpha = \lambda_1/(\lambda_1+\lambda_2)\):
\[\min_\beta \|y-X\beta\|^2 + \lambda[\alpha\|\beta\|_1 + (1-\alpha)\|\beta\|^2]\]
- \(\alpha = 1\): LASSO
- \(\alpha = 0\): Ridge
- \(\alpha \in (0,1)\): Elastic Net
5.2 Kapan Elastic Net Lebih Baik?
LASSO dengan correlated features: LASSO cenderung pilih satu variable dari sekelompok correlated variables secara arbiter, dan drop yang lain. Untuk \(p > n\), LASSO pilih paling banyak \(n\) variables.
Elastic Net: L2 term memastikan “grouping effect” — correlated variables cenderung masuk/keluar bersama-sama, bukan satu dipilih arbiter.
6 Cross-Validation untuk Pemilihan \(\lambda\)
6.1 K-Fold Cross-Validation
- Split data menjadi \(K\) folds
- Untuk setiap \(\lambda\) dan setiap fold \(k\):
- Fit model pada \(K-1\) folds
- Evaluate error pada fold \(k\)
- CV error: \(\text{CV}(\lambda) = \frac{1}{K}\sum_{k=1}^K \text{Error}_k(\lambda)\)
- Pilih \(\lambda\) yang minimizes \(\text{CV}(\lambda)\)
6.2 \(\lambda_{min}\) vs \(\lambda_{1se}\)
- \(\lambda_{min}\): minimizes CV error — paling predictive
- \(\lambda_{1se}\): one-standard-error rule — pilih \(\lambda\) terbesar (paling regularized) yang CV error-nya dalam satu SE dari minimum
glmnet default: \(\lambda_{1se}\) — lebih parsimonious, lebih interpretable.
7 Worked Example: Ridge vs LASSO
library(glmnet)
library(tidyverse)
set.seed(42)
n <- 100; p <- 10
# Create correlated features: X1 dan X2 sangat berkorelasi
Z <- rnorm(n)
X <- matrix(rnorm(n*p), n, p)
X[,1] <- Z + 0.1 * rnorm(n) # X1 sangat berkorelasi dengan Z
X[,2] <- Z + 0.1 * rnorm(n) # X2 sangat berkorelasi dengan Z
# True model: hanya X1, X3, X7 penting
beta_true <- c(2, 0, 1, 0, 0, 0, -1.5, 0, 0, 0)
y <- X %*% beta_true + rnorm(n, sd = 1)
# Ridge
ridge_cv <- cv.glmnet(X, y, alpha = 0) # alpha=0 untuk Ridge
plot(ridge_cv, main = "Ridge CV")
ridge_best <- glmnet(X, y, alpha = 0, lambda = ridge_cv$lambda.min)
cat("Ridge coefficients:\n")
print(coef(ridge_best))
# LASSO
lasso_cv <- cv.glmnet(X, y, alpha = 1) # alpha=1 untuk LASSO
plot(lasso_cv, main = "LASSO CV")
lasso_best <- glmnet(X, y, alpha = 1, lambda = lasso_cv$lambda.min)
cat("\nLASSO coefficients:\n")
print(coef(lasso_best))
# Elastic Net
en_cv <- cv.glmnet(X, y, alpha = 0.5)
en_best <- glmnet(X, y, alpha = 0.5, lambda = en_cv$lambda.min)
cat("\nElastic Net coefficients:\n")
print(coef(en_best))
# Key observation:
# Ridge: BOTH X1 and X2 have non-zero coefficients (~equal due to correlation)
# LASSO: picks ONE of X1, X2 (arbitrarily) and zeros the other
# EN: keeps both but smaller
# Regularization Path
par(mfrow = c(1,2))
plot(glmnet(X, y, alpha=0), xvar="lambda", main="Ridge Path")
plot(glmnet(X, y, alpha=1), xvar="lambda", main="LASSO Path")
# LASSO path: coefficients set to exactly zero at some lambda
# Manual SVD demonstration for Ridge
svd_X <- svd(X)
lambda_val <- ridge_cv$lambda.min
shrinkage_factors <- svd_X$d^2 / (svd_X$d^2 + lambda_val)
cat("\nShrinkage factors per principal direction:\n")
cat(round(shrinkage_factors, 3), "\n")
# Small singular values -> lots of shrinkage
# Effective df
eff_df <- sum(svd_X$d^2 / (svd_X$d^2 + lambda_val))
cat(sprintf("\nEffective degrees of freedom (Ridge, lambda=%.2f): %.2f\n",
lambda_val, eff_df))Key Insight: Ridge menyimpan X1 dan X2 keduanya dengan koefisien hampir sama (karena korelasinya tinggi). LASSO memilih salah satu secara acak dan membuang yang lain.
Regularization terhubung ke banyak area:
Tikhonov regularization dalam numerics: sama persis dengan Ridge — menyelesaikan ill-conditioned linear systems dengan menambahkan \(\lambda I\) ke \(X^TX\)
Post-LASSO (double selection): untuk causal inference, jalankan LASSO untuk select variables, lalu OLS pada selected variables + variables of interest. Standard errors valid (Belloni, Chernozhukov, Hansen)
Adaptive LASSO: weight \(\lambda\) berbeda untuk setiap \(\beta_j\) berdasarkan initial estimate. Oracle property: consistent variable selection
Neural networks — weight decay: L2 regularization pada weights $= $ Ridge. Batch normalization dan dropout adalah regularization alternatif
Group LASSO: regularize by groups of features — seluruh grup masuk/keluar bersama
Elastic Net dalam glmnet: menggunakan warm-starts dan active set methods — computationally efficient untuk large \(p\)
8 Practice Problems
Problem 1: Tunjukkan Ridge estimator adalah biased.
\(E[\hat{\beta}_{ridge}] = (X^TX+\lambda I)^{-1}X^TX\beta\)
Ini \(\neq \beta\) kecuali \(\lambda = 0\) atau \(\beta = 0\). Bias \(= -\lambda(X^TX+\lambda I)^{-1}\beta\).
Problem 2: Derive variance Ridge estimator.
\(\text{Var}(\hat{\beta}_{ridge}) = (X^TX+\lambda I)^{-1}X^TX\sigma^2 [(X^TX+\lambda I)^{-1}]^T\)
Untuk \(X^TX = V\Lambda V^T\) (eigendecomposition), variance lebih kecil dari OLS di setiap direction.
Problem 3: Gunakan SVD untuk tunjukkan Ridge mengatasi multicollinearity.
Jika \(X\) punya kolom yang hampir identik, ada singular value kecil \(\sigma_j \approx 0\). OLS: \(\hat{\beta}_j^{OLS}\) proportional to \(1/\sigma_j \to \infty\).
Ridge: faktor \(\sigma_j/(\sigma_j^2+\lambda) \leq 1/(2\sqrt{\lambda})\) — bounded untuk semua \(\sigma_j \geq 0\).
Problem 4: Derive soft-thresholding sebagai solusi LASSO untuk orthonormal \(X\).
Untuk \(X^TX = I\): \(\|y - X\beta\|^2 = \|y\|^2 - 2\beta^TX^Ty + \|\beta\|^2 = \text{const} + \sum_j (\beta_j - \hat{\beta}_j^{OLS})^2\)
Separable! Optimasi per-\(j\): \(\min_{\beta_j} (\beta_j - c_j)^2 + \lambda|\beta_j|\)
KKT conditions: \(2(\hat{\beta}_j - c_j) + \lambda \cdot \text{sign}(\hat{\beta}_j) = 0\) if \(\hat{\beta}_j \neq 0\)
Solution: \(\hat{\beta}_j = \text{sign}(c_j)\max(|c_j| - \lambda/2, 0)\) — soft thresholding.
Problem 5: Kapan Elastic Net lebih baik dari LASSO? Berikan contoh konkret.
Genetic data: banyak SNPs dalam satu gene highly correlated. LASSO memilih satu SNP per gene secara arbiter. Elastic Net memilih semua correlated SNPs dalam gene secara bersamaan — lebih interpretable dan reproducible.
Navigasi: ← Gradient Descent | Kernel Methods & SVM →