Regularization — Ridge, LASSO, Elastic Net

Taming Overfitting dengan Matematika

ml-math
intermediate
Derivasi matematis Ridge (L2), LASSO (L1), dan Elastic Net: closed-form solution, SVD perspective, sparsity, Bayesian MAP interpretation, soft-thresholding, dan cross-validation.

1 Kenapa Ini Penting?

NoteWhy This Matters for Your Work

Regularization bukan hanya teknik implementasi — it’s a mathematical trade-off between bias and variance dengan beberapa lapisan interpretasi:

  1. Optimization: menambahkan penalty pada objective function
  2. Bayesian: MAP estimation dengan prior distribution tertentu
  3. SVD: shrinkage dalam directions principal components

Ridge dan LASSO punya perbedaan fundamental dalam sifat solusinya — Ridge selalu menyusutkan tapi tidak pernah nol, LASSO bisa meng-eliminate variable sepenuhnya. Knowing why helps you choose the right one.

Untuk data analis: regularization langsung relevan ketika \(p\) besar (banyak features), data observasional, atau ada multicollinearity.


2 The Bias-Variance Trade-off

Untuk estimator \(\hat{\beta}\):

\[\text{MSE}(\hat{\beta}) = E\|\hat{\beta} - \beta\|^2 = \underbrace{\|E[\hat{\beta}] - \beta\|^2}_{\text{Bias}^2} + \underbrace{\text{tr}(\text{Var}(\hat{\beta}))}_{\text{Variance}}\]

OLS: unbiased (\(E[\hat{\beta}_{OLS}] = \beta\)) dan minimum variance among unbiased estimators (Gauss-Markov). Tapi ketika \(p\) besar atau predictors berkorelasi, variance OLS bisa sangat besar.

Key insight: dengan sedikit mengorbankan bias, kita bisa drastically reduce variance → net MSE turun.

\[\text{Bias}^2(\hat{\beta}_{ridge}) > 0 \quad \text{tapi} \quad \text{Variance}(\hat{\beta}_{ridge}) \ll \text{Variance}(\hat{\beta}_{OLS})\]


3 Ridge Regression (L2 Regularization)

3.1 Objective Function

\[\hat{\beta}_{ridge} = \arg\min_\beta \|y - X\beta\|^2 + \lambda\|\beta\|^2\]

Ini setara dengan constrained optimization: \[\min_\beta \|y - X\beta\|^2 \quad \text{s.t.} \quad \|\beta\|^2 \leq t(\lambda)\]

3.2 Closed-Form Solution

Ambil gradient dan set ke nol:

\[\frac{\partial}{\partial\beta}\left[\|y-X\beta\|^2 + \lambda\|\beta\|^2\right] = -2X^T(y-X\beta) + 2\lambda\beta = 0\]

\[X^TX\hat{\beta} + \lambda\hat{\beta} = X^Ty\]

\[(X^TX + \lambda I)\hat{\beta}_{ridge} = X^Ty\]

\[\boxed{\hat{\beta}_{ridge} = (X^TX + \lambda I)^{-1}X^Ty}\]

Key property: \((X^TX + \lambda I)\) selalu invertible untuk \(\lambda > 0\) — bahkan jika \(X\) tidak full rank! Ini solves multicollinearity.

3.3 Bias Ridge

\[E[\hat{\beta}_{ridge}] = (X^TX + \lambda I)^{-1}X^TX\beta = (X^TX + \lambda I)^{-1}(X^TX + \lambda I - \lambda I)\beta\] \[= \beta - \lambda(X^TX + \lambda I)^{-1}\beta\]

Bias \(= -\lambda(X^TX + \lambda I)^{-1}\beta \neq 0\) — Ridge biased.

3.4 Bayesian Interpretation

Ridge regression = MAP estimation dengan normal prior:

\[\beta_j \overset{iid}{\sim} N\left(0, \frac{\sigma^2}{\lambda}\right)\]

Posterior log: \[\log P(\beta | y, X) \propto -\frac{1}{2\sigma^2}\|y - X\beta\|^2 - \frac{\lambda}{2\sigma^2}\|\beta\|^2\]

Maximizing = minimizing \(\|y-X\beta\|^2 + \lambda\|\beta\|^2\) → Ridge objective.

3.5 SVD Perspective

ImportantDefinisi: Ridge via SVD

Gunakan SVD: \(X = U\Sigma V^T\) di mana \(U \in \mathbb{R}^{n \times p}\), \(\Sigma = \text{diag}(\sigma_1, \ldots, \sigma_p)\), \(V \in \mathbb{R}^{p \times p}\).

\[\hat{\beta}_{ridge} = (V\Sigma^2 V^T + \lambda VV^T)^{-1}V\Sigma U^Ty = V(\Sigma^2 + \lambda I)^{-1}\Sigma U^Ty\]

\[\hat{\beta}_{ridge} = \sum_{j=1}^p \frac{\sigma_j}{\sigma_j^2 + \lambda} (\mathbf{u}_j^T y) \mathbf{v}_j\]

Bandingkan dengan OLS: \[\hat{\beta}_{OLS} = \sum_{j=1}^p \frac{1}{\sigma_j}(\mathbf{u}_j^T y)\mathbf{v}_j\]

Ridge shrinks setiap komponen OLS dengan faktor \(\sigma_j^2/(\sigma_j^2 + \lambda)\): - Komponen dengan \(\sigma_j\) besar (well-determined directions): \(\sigma_j^2/(\sigma_j^2+\lambda) \approx 1\) → sedikit shrinkage - Komponen dengan \(\sigma_j\) kecil (ill-determined, multicollinearity): \(\sigma_j^2/(\sigma_j^2+\lambda) \approx 0\) → shrinkage besar

3.6 Effective Degrees of Freedom

\[\text{df}(\lambda) = \text{tr}\left(X(X^TX+\lambda I)^{-1}X^T\right) = \sum_{j=1}^p \frac{\sigma_j^2}{\sigma_j^2 + \lambda}\]

  • \(\lambda = 0\): \(\text{df} = p\) (OLS, semua degrees of freedom)
  • \(\lambda \to \infty\): \(\text{df} \to 0\) (intercept only)

4 LASSO (L1 Regularization)

4.1 Objective Function

\[\hat{\beta}_{LASSO} = \arg\min_\beta \|y - X\beta\|^2 + \lambda\|\beta\|_1\]

di mana \(\|\beta\|_1 = \sum_{j=1}^p |\beta_j|\).

Constrained form: \(\min \|y-X\beta\|^2\) s.t. \(\|\beta\|_1 \leq t\).

Tidak ada closed-form solution karena \(\|\beta\|_1\) tidak differentiable di \(\beta_j = 0\).

4.2 Sparsity: Kenapa LASSO Bisa Set Koefisien ke Nol

ImportantDefinisi: Geometric Intuition LASSO vs Ridge

Perhatikan constraint set: - Ridge: \(\|\beta\|^2 \leq t\)sphere (smooth, no corners) di 2D - LASSO: \(\|\beta\|_1 \leq t\)diamond/rhombus di 2D (has corners at axes!)

Loss function \(\|y-X\beta\|^2\) adalah ellipse dalam \(\beta\)-space. Solusi adalah titik di mana ellipse pertama kali menyentuh constraint set.

Untuk L2 ball (sphere): kontak biasanya di tepi yang smooth → kedua \(\beta_j \neq 0\).

Untuk L1 ball (diamond): kontak sering di corner \((\beta_j, 0)\) atau \((0, \beta_k)\) → salah satu koefisien = 0!

Untuk \(p\) besar: L1 ball punya \(2p\) corners di sepanjang axes — solusi cenderung menyentuh corner → sparse solution.

4.3 Bayesian Interpretation

LASSO = MAP dengan Laplace prior:

\[\beta_j \overset{iid}{\sim} \text{Laplace}\left(0, \frac{1}{\lambda}\right) \propto \exp(-\lambda|\beta_j|)\]

Laplace prior memiliki “spike” di nol dan “heavy tails” — mendorong sparsity.

4.4 Soft-Thresholding Operator

Untuk orthonormal \(X\) (i.e., \(X^TX = I\)), solution LASSO closed-form:

\[\hat{\beta}_j^{LASSO} = S(\hat{\beta}_j^{OLS}, \lambda/2) = \text{sign}(\hat{\beta}_j^{OLS}) \cdot \max(|\hat{\beta}_j^{OLS}| - \lambda/2, 0)\]

  • Jika \(|\hat{\beta}_j^{OLS}| > \lambda/2\): shrink magnitude sebesar \(\lambda/2\)
  • Jika \(|\hat{\beta}_j^{OLS}| \leq \lambda/2\): set exactly to zero

Ini menjelaskan sparsity secara analitik.

4.5 Coordinate Descent Algorithm

Untuk general \(X\), optimasi dengan coordinate descent: update satu koefisien pada satu waktu.

Iterasi untuk \(\beta_j\) saat koefisien lain fixed:

\[r_j = y - X_{-j}\beta_{-j} \quad \text{(partial residual)}\]

\[\hat{\beta}_j \leftarrow S\left(\frac{x_j^T r_j}{x_j^Tx_j}, \frac{\lambda}{2x_j^Tx_j}\right)\]

Convergence ke global minimum karena objective LASSO adalah convex.


5 Elastic Net

5.1 Objective

\[\hat{\beta}_{EN} = \arg\min_\beta \|y-X\beta\|^2 + \lambda_1\|\beta\|_1 + \lambda_2\|\beta\|^2\]

Biasanya diparam ulang: \(\lambda = \lambda_1 + \lambda_2\) dan \(\alpha = \lambda_1/(\lambda_1+\lambda_2)\):

\[\min_\beta \|y-X\beta\|^2 + \lambda[\alpha\|\beta\|_1 + (1-\alpha)\|\beta\|^2]\]

  • \(\alpha = 1\): LASSO
  • \(\alpha = 0\): Ridge
  • \(\alpha \in (0,1)\): Elastic Net

5.2 Kapan Elastic Net Lebih Baik?

LASSO dengan correlated features: LASSO cenderung pilih satu variable dari sekelompok correlated variables secara arbiter, dan drop yang lain. Untuk \(p > n\), LASSO pilih paling banyak \(n\) variables.

Elastic Net: L2 term memastikan “grouping effect” — correlated variables cenderung masuk/keluar bersama-sama, bukan satu dipilih arbiter.


6 Cross-Validation untuk Pemilihan \(\lambda\)

6.1 K-Fold Cross-Validation

  1. Split data menjadi \(K\) folds
  2. Untuk setiap \(\lambda\) dan setiap fold \(k\):
    • Fit model pada \(K-1\) folds
    • Evaluate error pada fold \(k\)
  3. CV error: \(\text{CV}(\lambda) = \frac{1}{K}\sum_{k=1}^K \text{Error}_k(\lambda)\)
  4. Pilih \(\lambda\) yang minimizes \(\text{CV}(\lambda)\)

6.2 \(\lambda_{min}\) vs \(\lambda_{1se}\)

  • \(\lambda_{min}\): minimizes CV error — paling predictive
  • \(\lambda_{1se}\): one-standard-error rule — pilih \(\lambda\) terbesar (paling regularized) yang CV error-nya dalam satu SE dari minimum

glmnet default: \(\lambda_{1se}\) — lebih parsimonious, lebih interpretable.


7 Worked Example: Ridge vs LASSO

library(glmnet)
library(tidyverse)

set.seed(42)
n <- 100; p <- 10

# Create correlated features: X1 dan X2 sangat berkorelasi
Z <- rnorm(n)
X <- matrix(rnorm(n*p), n, p)
X[,1] <- Z + 0.1 * rnorm(n)  # X1 sangat berkorelasi dengan Z
X[,2] <- Z + 0.1 * rnorm(n)  # X2 sangat berkorelasi dengan Z

# True model: hanya X1, X3, X7 penting
beta_true <- c(2, 0, 1, 0, 0, 0, -1.5, 0, 0, 0)
y <- X %*% beta_true + rnorm(n, sd = 1)

# Ridge
ridge_cv <- cv.glmnet(X, y, alpha = 0)  # alpha=0 untuk Ridge
plot(ridge_cv, main = "Ridge CV")
ridge_best <- glmnet(X, y, alpha = 0, lambda = ridge_cv$lambda.min)
cat("Ridge coefficients:\n")
print(coef(ridge_best))

# LASSO
lasso_cv <- cv.glmnet(X, y, alpha = 1)  # alpha=1 untuk LASSO
plot(lasso_cv, main = "LASSO CV")
lasso_best <- glmnet(X, y, alpha = 1, lambda = lasso_cv$lambda.min)
cat("\nLASSO coefficients:\n")
print(coef(lasso_best))

# Elastic Net
en_cv <- cv.glmnet(X, y, alpha = 0.5)
en_best <- glmnet(X, y, alpha = 0.5, lambda = en_cv$lambda.min)
cat("\nElastic Net coefficients:\n")
print(coef(en_best))

# Key observation:
# Ridge: BOTH X1 and X2 have non-zero coefficients (~equal due to correlation)
# LASSO: picks ONE of X1, X2 (arbitrarily) and zeros the other
# EN: keeps both but smaller

# Regularization Path
par(mfrow = c(1,2))
plot(glmnet(X, y, alpha=0), xvar="lambda", main="Ridge Path")
plot(glmnet(X, y, alpha=1), xvar="lambda", main="LASSO Path")
# LASSO path: coefficients set to exactly zero at some lambda

# Manual SVD demonstration for Ridge
svd_X <- svd(X)
lambda_val <- ridge_cv$lambda.min
shrinkage_factors <- svd_X$d^2 / (svd_X$d^2 + lambda_val)
cat("\nShrinkage factors per principal direction:\n")
cat(round(shrinkage_factors, 3), "\n")
# Small singular values -> lots of shrinkage

# Effective df
eff_df <- sum(svd_X$d^2 / (svd_X$d^2 + lambda_val))
cat(sprintf("\nEffective degrees of freedom (Ridge, lambda=%.2f): %.2f\n",
            lambda_val, eff_df))

Key Insight: Ridge menyimpan X1 dan X2 keduanya dengan koefisien hampir sama (karena korelasinya tinggi). LASSO memilih salah satu secara acak dan membuang yang lain.


CautionConnection: Tikhonov, Post-LASSO, dan Weight Decay

Regularization terhubung ke banyak area:

  • Tikhonov regularization dalam numerics: sama persis dengan Ridge — menyelesaikan ill-conditioned linear systems dengan menambahkan \(\lambda I\) ke \(X^TX\)

  • Post-LASSO (double selection): untuk causal inference, jalankan LASSO untuk select variables, lalu OLS pada selected variables + variables of interest. Standard errors valid (Belloni, Chernozhukov, Hansen)

  • Adaptive LASSO: weight \(\lambda\) berbeda untuk setiap \(\beta_j\) berdasarkan initial estimate. Oracle property: consistent variable selection

  • Neural networks — weight decay: L2 regularization pada weights $= $ Ridge. Batch normalization dan dropout adalah regularization alternatif

  • Group LASSO: regularize by groups of features — seluruh grup masuk/keluar bersama

  • Elastic Net dalam glmnet: menggunakan warm-starts dan active set methods — computationally efficient untuk large \(p\)


8 Practice Problems

Problem 1: Tunjukkan Ridge estimator adalah biased.

\(E[\hat{\beta}_{ridge}] = (X^TX+\lambda I)^{-1}X^TX\beta\)

Ini \(\neq \beta\) kecuali \(\lambda = 0\) atau \(\beta = 0\). Bias \(= -\lambda(X^TX+\lambda I)^{-1}\beta\).

Problem 2: Derive variance Ridge estimator.

\(\text{Var}(\hat{\beta}_{ridge}) = (X^TX+\lambda I)^{-1}X^TX\sigma^2 [(X^TX+\lambda I)^{-1}]^T\)

Untuk \(X^TX = V\Lambda V^T\) (eigendecomposition), variance lebih kecil dari OLS di setiap direction.

Problem 3: Gunakan SVD untuk tunjukkan Ridge mengatasi multicollinearity.

Jika \(X\) punya kolom yang hampir identik, ada singular value kecil \(\sigma_j \approx 0\). OLS: \(\hat{\beta}_j^{OLS}\) proportional to \(1/\sigma_j \to \infty\).

Ridge: faktor \(\sigma_j/(\sigma_j^2+\lambda) \leq 1/(2\sqrt{\lambda})\) — bounded untuk semua \(\sigma_j \geq 0\).

Problem 4: Derive soft-thresholding sebagai solusi LASSO untuk orthonormal \(X\).

Untuk \(X^TX = I\): \(\|y - X\beta\|^2 = \|y\|^2 - 2\beta^TX^Ty + \|\beta\|^2 = \text{const} + \sum_j (\beta_j - \hat{\beta}_j^{OLS})^2\)

Separable! Optimasi per-\(j\): \(\min_{\beta_j} (\beta_j - c_j)^2 + \lambda|\beta_j|\)

KKT conditions: \(2(\hat{\beta}_j - c_j) + \lambda \cdot \text{sign}(\hat{\beta}_j) = 0\) if \(\hat{\beta}_j \neq 0\)

Solution: \(\hat{\beta}_j = \text{sign}(c_j)\max(|c_j| - \lambda/2, 0)\) — soft thresholding.

Problem 5: Kapan Elastic Net lebih baik dari LASSO? Berikan contoh konkret.

Genetic data: banyak SNPs dalam satu gene highly correlated. LASSO memilih satu SNP per gene secara arbiter. Elastic Net memilih semua correlated SNPs dalam gene secara bersamaan — lebih interpretable dan reproducible.


Navigasi: ← Gradient Descent | Kernel Methods & SVM →