Kernel Methods & SVM
The Kernel Trick — Infinity-dimensional Feature Space in Finite Computation
1 Kenapa Ini Penting?
Kernel methods adalah salah satu insight matematika paling elegan dalam ML: computing inner products in infinite-dimensional spaces cheaply — hanya dengan evaluasi fungsi kernel di input space.
Kernel methods underlie: - SVMs: state-of-the-art pre-deep-learning classifier - Gaussian Processes: Bayesian nonparametric regression - Kernel PCA: nonlinear dimensionality reduction - Kernel Ridge Regression: the dual view of regularization
Memahami math-nya memberi kamu fondasi untuk memahami mengapa attention dalam Transformers (dot-product attention) adalah bentuk kernel computation.
2 Feature Maps dan Nonlinear Modeling
2.1 Motivasi
Linear model: \(f(x) = w^Tx + b\) — terbatas, hanya bisa linear decision boundary.
Ide sederhana: peta ke feature space berdimensi tinggi dulu, lalu fit linear model di sana.
\[\phi: \mathcal{X} \to \mathcal{F}, \quad f(x) = w^T\phi(x) + b\]
Linear di \(\mathcal{F}\) = nonlinear di \(\mathcal{X}\).
2.2 Masalah: Dimensi Bisa Meledak
Contoh: \(x \in \mathbb{R}^2\), polynomial degree-2: \[\phi(x) = (1, x_1, x_2, x_1^2, x_1 x_2, x_2^2) \in \mathbb{R}^6\]
Degree-\(d\) di \(\mathbb{R}^p\): \(\binom{p+d}{d}\) features — exponential dalam \(d\).
Degree-\(\infty\) (RBF kernel): infinite-dimensional feature space!
3 The Kernel Trick
Sebuah fungsi \(k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) disebut kernel function jika ada feature map \(\phi\) sehingga:
\[k(x, z) = \langle \phi(x), \phi(z) \rangle_\mathcal{F}\]
Kernel trick: jika algorithm hanya membutuhkan inner products \(\langle\phi(x_i), \phi(x_j)\rangle\), kita bisa mengganti dengan \(k(x_i, x_j)\) — komputasi langsung di input space tanpa menghitung \(\phi\) eksplisit!
Kompleksitas: \(O(n^2)\) kernel evaluations vs \(O(n \cdot |\mathcal{F}|)\) explicit feature computation (di mana \(|\mathcal{F}|\) bisa infinite).
4 Common Kernels
4.1 Linear Kernel
\[k(x, z) = x^Tz\]
Feature map: \(\phi(x) = x\). Tidak ada nonlinearity — sama dengan standard linear model.
4.2 Polynomial Kernel
\[k(x, z) = (x^Tz + c)^d\]
Contoh: \(d = 2\), \(c = 0\), \(x \in \mathbb{R}^2\):
\(k(x,z) = (x_1z_1 + x_2z_2)^2 = x_1^2z_1^2 + 2x_1x_2z_1z_2 + x_2^2z_2^2\)
Feature map: \(\phi(x) = (x_1^2, \sqrt{2}x_1x_2, x_2^2)\)
\(\langle\phi(x), \phi(z)\rangle = x_1^2z_1^2 + 2x_1x_2z_1z_2 + x_2^2z_2^2\) ✓
4.3 RBF (Radial Basis Function / Gaussian) Kernel
\[k(x, z) = \exp\left(-\frac{\|x-z\|^2}{2\sigma^2}\right)\]
Ini adalah infinite-dimensional feature map! Dengan Taylor expansion:
\[e^{-\|x\|^2/(2\sigma^2)} e^{x^Tz/\sigma^2} = e^{-\|x\|^2/(2\sigma^2)} \sum_{k=0}^\infty \frac{(x^Tz/\sigma^2)^k}{k!}\]
Setiap term \((x^Tz)^k\) adalah polynomial kernel degree \(k\) → infinite feature space.
Parameter \(\sigma\) (bandwidth): kontrol “locality” — kecil \(\sigma\) = sangat local, besar \(\sigma\) = lebih global.
4.4 Mercer’s Theorem
\(k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) adalah valid kernel (ada feature map sehingga \(k(x,z) = \langle\phi(x),\phi(z)\rangle\)) jika dan hanya jika Gram matrix \(K\) adalah positive semi-definite untuk setiap dataset:
\[K_{ij} = k(x_i, x_j), \quad \forall \{x_1, \ldots, x_n\}: K \succeq 0\]
Implikasi praktis: untuk verify \(k\) adalah valid kernel, cukup cek PSD property — tidak perlu construct \(\phi\) secara eksplisit.
Sum dan product of valid kernels adalah valid kernel: - \(k_1 + k_2\) valid - \(k_1 \cdot k_2\) valid (polynomial kernel = product of linear kernels) - \(c \cdot k\) valid untuk \(c > 0\)
5 Reproducing Kernel Hilbert Space (RKHS)
5.1 Definisi
Setiap valid kernel \(k\) mendefinisikan sebuah RKHS \(\mathcal{H}_k\): function space di mana inner product antara fungsi \(f, g\) didefinisikan.
Reproducing property: untuk \(k_x = k(\cdot, x) \in \mathcal{H}_k\):
\[f(x) = \langle f, k(\cdot, x) \rangle_{\mathcal{H}_k}\]
Fungsi \(k_x\) adalah “representer” dari evaluation functional di \(x\).
5.2 Representer Theorem
Untuk regularized problem: \[\min_{f \in \mathcal{H}_k} \frac{1}{n}\sum_{i=1}^n \ell(y_i, f(x_i)) + \lambda \|f\|_{\mathcal{H}_k}^2\]
solusi optimal selalu berada dalam: \[f^*(x) = \sum_{i=1}^n \alpha_i k(x_i, x)\]
Meskipun kita optimasi atas infinite-dimensional function space \(\mathcal{H}_k\), solusinya adalah finite linear combination dari kernel functions evaluated at training points!
6 SVM (Support Vector Machine)
6.1 Primal Problem (Hard Margin)
Untuk linearly separable data, cari hyperplane \(\mathbf{w}^Tx + b = 0\) dengan maximum margin.
Margin = \(2/\|\mathbf{w}\|\). Maximizing margin = minimizing \(\|\mathbf{w}\|\).
Primal (Hard Margin SVM): \[\min_{\mathbf{w},b} \frac{1}{2}\|\mathbf{w}\|^2 \quad \text{s.t.} \quad y_i(\mathbf{w}^Tx_i + b) \geq 1 \quad \forall i\]
6.2 Soft Margin SVM
Untuk data yang tidak linearly separable, allow misclassification dengan slack variables \(\xi_i \geq 0\):
\[\min_{\mathbf{w},b,\xi} \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_i \xi_i \quad \text{s.t.} \quad y_i(\mathbf{w}^Tx_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0\]
\(C\): regularization parameter (tradeoff margin vs misclassification).
6.3 Dual Problem
Dari Lagrangian primal:
\[L(\mathbf{w}, b, \alpha) = \frac{1}{2}\|\mathbf{w}\|^2 - \sum_i \alpha_i[y_i(\mathbf{w}^Tx_i+b) - 1]\]
KKT conditions (stationarity): \[\mathbf{w} = \sum_i \alpha_i y_i x_i, \quad \sum_i \alpha_i y_i = 0\]
Substitusikan ke Lagrangian → Dual problem:
\[\max_\alpha \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i\alpha_j y_iy_j x_i^Tx_j\]
\[\text{s.t.} \quad 0 \leq \alpha_i \leq C, \quad \sum_i \alpha_i y_i = 0\]
6.4 Support Vectors dan KKT
KKT complementary slackness: \[\alpha_i[y_i(\mathbf{w}^Tx_i+b) - 1 + \xi_i] = 0, \quad \mu_i \xi_i = 0\]
- \(\alpha_i = 0\): point di luar atau tepat di margin boundary → tidak mempengaruhi \(\mathbf{w}\)
- \(0 < \alpha_i < C\): point tepat di margin (\(\xi_i = 0\)) — ini support vectors
- \(\alpha_i = C\): point di dalam margin atau misclassified (\(\xi_i > 0\))
Decision function: \(f(x) = \sum_i \alpha_i y_i x_i^Tx + b\) — hanya bergantung pada support vectors!
6.5 Kernelized SVM
Ganti \(x_i^Tx_j\) dengan \(k(x_i, x_j)\):
\[\max_\alpha \sum_i \alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i\alpha_j y_iy_j k(x_i, x_j)\]
Decision function: \(f(x) = \sum_i \alpha_i y_i k(x_i, x) + b\)
Dengan RBF kernel: nonlinear decision boundary dalam input space, tapi linear di infinite-dimensional feature space.
7 Kernel Ridge Regression
7.1 Primal Formulation
\[\min_\beta \|y - X\beta\|^2 + \lambda\|\beta\|^2\]
Solusi: \(\hat{\beta} = (X^TX + \lambda I)^{-1}X^Ty\)
Prediksi: \(\hat{y} = X\hat{\beta} = X(X^TX+\lambda I)^{-1}X^Ty\)
7.2 Dual Formulation
Gunakan identity (Woodbury / push-through):
\[X(X^TX+\lambda I)^{-1} = (XX^T + \lambda I)^{-1}X\]
Definisikan \(K = XX^T\) (Gram matrix, \(K_{ij} = x_i^Tx_j\)):
\[\hat{y} = K(K + \lambda I)^{-1}y\]
Dual coefficients: \(\hat{\alpha} = (K + \lambda I)^{-1}y\)
Prediksi untuk \(x_{new}\): \(\hat{f}(x_{new}) = k_{new}^T\hat{\alpha}\) di mana \(k_{new,i} = k(x_i, x_{new})\)
Sekarang, ganti \(K_{ij} = x_i^Tx_j\) dengan kernel \(k(x_i, x_j)\) → Kernel Ridge Regression!
Primal dengan explicit \(\phi\): \(\min_w \|y - \Phi w\|^2 + \lambda\|w\|^2\)
Dual: \(\hat{\alpha} = (K + \lambda I)^{-1}y\) di mana \(K_{ij} = k(x_i, x_j)\)
8 Worked Example: RBF Kernel Computation
library(kernlab)
library(tidyverse)
# ==========================================
# 1. Manual RBF Kernel Computation
# ==========================================
rbf_kernel <- function(x, z, sigma = 1) {
exp(-sum((x - z)^2) / (2 * sigma^2))
}
# Small dataset: 3 points in R^2
X_data <- matrix(c(1, 1,
2, 1,
1, 2), nrow = 3, byrow = TRUE)
# Compute Gram matrix
n <- nrow(X_data)
sigma <- 1
K <- matrix(0, n, n)
for (i in 1:n) {
for (j in 1:n) {
K[i, j] <- rbf_kernel(X_data[i,], X_data[j,], sigma)
}
}
cat("Gram matrix K (RBF, sigma=1):\n")
print(round(K, 4))
# Verify PSD: eigenvalues should be >= 0
cat("Eigenvalues:", eigen(K)$values, "\n") # Should all be >= 0
# ==========================================
# 2. SVM Classification with kernlab
# ==========================================
# Generate non-linearly separable data (two spirals)
set.seed(123)
n_per_class <- 100
theta <- seq(0, 2*pi, length.out = n_per_class)
x1 <- c(theta * cos(theta), (theta + pi) * cos(theta + pi))
x2 <- c(theta * sin(theta), (theta + pi) * sin(theta + pi))
y_label <- factor(rep(c(1, -1), each = n_per_class))
df_spiral <- data.frame(x1 = x1 + rnorm(2*n_per_class, sd = 0.3),
x2 = x2 + rnorm(2*n_per_class, sd = 0.3),
y = y_label)
# Linear SVM - will struggle with spirals
svm_linear <- ksvm(y ~ x1 + x2, data = df_spiral,
kernel = "vanilladot", C = 1)
# RBF SVM
svm_rbf <- ksvm(y ~ x1 + x2, data = df_spiral,
kernel = "rbfdot", kpar = list(sigma = 0.5), C = 10)
cat(sprintf("Linear SVM training error: %.3f\n",
sum(predict(svm_linear, df_spiral) != df_spiral$y) / nrow(df_spiral)))
cat(sprintf("RBF SVM training error: %.3f\n",
sum(predict(svm_rbf, df_spiral) != df_spiral$y) / nrow(df_spiral)))
cat(sprintf("\nNumber of support vectors (RBF): %d out of %d\n",
nSV(svm_rbf), nrow(df_spiral)))
# ==========================================
# 3. Kernel Ridge Regression
# ==========================================
set.seed(42)
n <- 50
x_train <- seq(0, 2*pi, length.out = n)
y_train <- sin(x_train) + rnorm(n, sd = 0.2)
# Kernel Ridge Regression using kernlab
krr_model <- gausspr(x_train, y_train,
kernel = "rbfdot",
kpar = list(sigma = 1),
var = 0.1) # var ~ lambda
x_test <- seq(0, 2*pi, length.out = 200)
y_pred <- predict(krr_model, x_test)
plot(x_train, y_train, pch = 16, col = "gray",
main = "Kernel Ridge Regression (RBF kernel)",
xlab = "x", ylab = "y")
lines(x_test, y_pred, col = "blue", lwd = 2)
lines(x_test, sin(x_test), col = "red", lty = 2)
legend("topright", c("Predictions", "True function"),
col = c("blue", "red"), lty = 1:2)Observasi penting: RBF SVM berhasil memisahkan spiral, linear SVM tidak bisa. Kernel Trick membuat linear hyperplane di feature space = nonlinear boundary di input space.
Kernel methods terhubung ke berbagai frontier:
Gaussian Processes (GP): Kernel Ridge Regression = posterior mean GP dengan kernel \(k\). GP adalah Bayesian nonparametric method — prior atas functions directly defined by kernel.
Attention mechanism dalam Transformers: \(\text{Attention}(Q,K,V) = \text{softmax}(QK^T/\sqrt{d})V\) — ini adalah kernel regression dengan softmax kernel \(k(q,k) = e^{q^Tk/\sqrt{d}}\).
Graph Neural Networks: aggregation step adalah kernel-like computation atas graph neighbors.
Neural Tangent Kernel (NTK): infinite-width neural networks converge to Gaussian processes dengan specific kernel (NTK). Menjelaskan mengapa very wide networks can be analyzed with kernel theory.
MMD (Maximum Mean Discrepancy): two-sample test menggunakan kernel embedding of distributions — banyak digunakan dalam domain adaptation dan generative models.
9 Practice Problems
Problem 1: Verifikasi bahwa polynomial kernel \(k(x,z) = (x^Tz)^2\) adalah valid kernel untuk \(x,z \in \mathbb{R}^2\).
Konstruksi feature map eksplisit: \(\phi(x) = (x_1^2, \sqrt{2}x_1x_2, x_2^2)\)
\(\langle\phi(x), \phi(z)\rangle = x_1^2z_1^2 + 2x_1x_2z_1z_2 + x_2^2z_2^2 = (x_1z_1 + x_2z_2)^2 = (x^Tz)^2\) ✓
Problem 2: Tunjukkan bahwa RBF kernel adalah valid kernel (hint: gunakan Taylor expansion).
\(k(x,z) = e^{-\|x-z\|^2/(2\sigma^2)} = e^{-(\|x\|^2+\|z\|^2)/(2\sigma^2)} \cdot e^{x^Tz/\sigma^2}\)
\(= e^{-\|x\|^2/(2\sigma^2)} e^{-\|z\|^2/(2\sigma^2)} \sum_{k=0}^\infty \frac{(x^Tz)^k}{\sigma^{2k} k!}\)
Setiap term \(c_k \cdot (x^Tz)^k\) adalah kernel (polynomial dengan nonneg coefficient), dan countable sum of kernels with nonneg coefficients adalah kernel. QED.
Problem 3: Derive dual problem SVM dari Lagrangian.
Lagrangian: \(L = \frac{1}{2}\|\mathbf{w}\|^2 - \sum_i \alpha_i[y_i(\mathbf{w}^Tx_i+b)-1]\)
\(\partial L/\partial \mathbf{w} = \mathbf{w} - \sum_i \alpha_i y_i x_i = 0 \Rightarrow \mathbf{w} = \sum_i \alpha_i y_i x_i\)
\(\partial L/\partial b = -\sum_i \alpha_i y_i = 0\)
Substitusi \(\mathbf{w}\): \(L = -\frac{1}{2}\sum_{ij}\alpha_i\alpha_j y_iy_j x_i^Tx_j + \sum_i \alpha_i\)
Maximize atas \(\alpha \geq 0\) dengan constraint \(\sum_i \alpha_i y_i = 0\). QED.
Problem 4: Tunjukkan Kernel Ridge Regression = Gaussian Process posterior mean.
Prior: \(f \sim GP(0, k)\), likelihood: \(y_i | f(x_i) \sim N(f(x_i), \sigma^2)\)
Posterior mean: \(E[f(x_{new})|y] = k_{new}^T(K + \sigma^2 I)^{-1}y\) — identik dengan KRR prediction dengan \(\lambda = \sigma^2\).
Navigasi: ← Regularization | Information Theory →