Kernel Methods & SVM

The Kernel Trick — Infinity-dimensional Feature Space in Finite Computation

ml-math
advanced
Kernel trick, Mercer’s theorem, RKHS, SVM primal dan dual problem, KKT conditions, support vectors, kernel ridge regression, dan RBF kernel. Dasar matematika untuk SVMs dan Gaussian processes.

1 Kenapa Ini Penting?

NoteWhy This Matters for Your Work

Kernel methods adalah salah satu insight matematika paling elegan dalam ML: computing inner products in infinite-dimensional spaces cheaply — hanya dengan evaluasi fungsi kernel di input space.

Kernel methods underlie: - SVMs: state-of-the-art pre-deep-learning classifier - Gaussian Processes: Bayesian nonparametric regression - Kernel PCA: nonlinear dimensionality reduction - Kernel Ridge Regression: the dual view of regularization

Memahami math-nya memberi kamu fondasi untuk memahami mengapa attention dalam Transformers (dot-product attention) adalah bentuk kernel computation.


2 Feature Maps dan Nonlinear Modeling

2.1 Motivasi

Linear model: \(f(x) = w^Tx + b\) — terbatas, hanya bisa linear decision boundary.

Ide sederhana: peta ke feature space berdimensi tinggi dulu, lalu fit linear model di sana.

\[\phi: \mathcal{X} \to \mathcal{F}, \quad f(x) = w^T\phi(x) + b\]

Linear di \(\mathcal{F}\) = nonlinear di \(\mathcal{X}\).

2.2 Masalah: Dimensi Bisa Meledak

Contoh: \(x \in \mathbb{R}^2\), polynomial degree-2: \[\phi(x) = (1, x_1, x_2, x_1^2, x_1 x_2, x_2^2) \in \mathbb{R}^6\]

Degree-\(d\) di \(\mathbb{R}^p\): \(\binom{p+d}{d}\) features — exponential dalam \(d\).

Degree-\(\infty\) (RBF kernel): infinite-dimensional feature space!


3 The Kernel Trick

ImportantDefinisi: Kernel Function

Sebuah fungsi \(k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) disebut kernel function jika ada feature map \(\phi\) sehingga:

\[k(x, z) = \langle \phi(x), \phi(z) \rangle_\mathcal{F}\]

Kernel trick: jika algorithm hanya membutuhkan inner products \(\langle\phi(x_i), \phi(x_j)\rangle\), kita bisa mengganti dengan \(k(x_i, x_j)\) — komputasi langsung di input space tanpa menghitung \(\phi\) eksplisit!

Kompleksitas: \(O(n^2)\) kernel evaluations vs \(O(n \cdot |\mathcal{F}|)\) explicit feature computation (di mana \(|\mathcal{F}|\) bisa infinite).


4 Common Kernels

4.1 Linear Kernel

\[k(x, z) = x^Tz\]

Feature map: \(\phi(x) = x\). Tidak ada nonlinearity — sama dengan standard linear model.

4.2 Polynomial Kernel

\[k(x, z) = (x^Tz + c)^d\]

Contoh: \(d = 2\), \(c = 0\), \(x \in \mathbb{R}^2\):

\(k(x,z) = (x_1z_1 + x_2z_2)^2 = x_1^2z_1^2 + 2x_1x_2z_1z_2 + x_2^2z_2^2\)

Feature map: \(\phi(x) = (x_1^2, \sqrt{2}x_1x_2, x_2^2)\)

\(\langle\phi(x), \phi(z)\rangle = x_1^2z_1^2 + 2x_1x_2z_1z_2 + x_2^2z_2^2\)

4.3 RBF (Radial Basis Function / Gaussian) Kernel

\[k(x, z) = \exp\left(-\frac{\|x-z\|^2}{2\sigma^2}\right)\]

Ini adalah infinite-dimensional feature map! Dengan Taylor expansion:

\[e^{-\|x\|^2/(2\sigma^2)} e^{x^Tz/\sigma^2} = e^{-\|x\|^2/(2\sigma^2)} \sum_{k=0}^\infty \frac{(x^Tz/\sigma^2)^k}{k!}\]

Setiap term \((x^Tz)^k\) adalah polynomial kernel degree \(k\) → infinite feature space.

Parameter \(\sigma\) (bandwidth): kontrol “locality” — kecil \(\sigma\) = sangat local, besar \(\sigma\) = lebih global.

4.4 Mercer’s Theorem

ImportantDefinisi: Mercer’s Theorem

\(k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) adalah valid kernel (ada feature map sehingga \(k(x,z) = \langle\phi(x),\phi(z)\rangle\)) jika dan hanya jika Gram matrix \(K\) adalah positive semi-definite untuk setiap dataset:

\[K_{ij} = k(x_i, x_j), \quad \forall \{x_1, \ldots, x_n\}: K \succeq 0\]

Implikasi praktis: untuk verify \(k\) adalah valid kernel, cukup cek PSD property — tidak perlu construct \(\phi\) secara eksplisit.

Sum dan product of valid kernels adalah valid kernel: - \(k_1 + k_2\) valid - \(k_1 \cdot k_2\) valid (polynomial kernel = product of linear kernels) - \(c \cdot k\) valid untuk \(c > 0\)


5 Reproducing Kernel Hilbert Space (RKHS)

5.1 Definisi

Setiap valid kernel \(k\) mendefinisikan sebuah RKHS \(\mathcal{H}_k\): function space di mana inner product antara fungsi \(f, g\) didefinisikan.

Reproducing property: untuk \(k_x = k(\cdot, x) \in \mathcal{H}_k\):

\[f(x) = \langle f, k(\cdot, x) \rangle_{\mathcal{H}_k}\]

Fungsi \(k_x\) adalah “representer” dari evaluation functional di \(x\).

5.2 Representer Theorem

ImportantDefinisi: Representer Theorem

Untuk regularized problem: \[\min_{f \in \mathcal{H}_k} \frac{1}{n}\sum_{i=1}^n \ell(y_i, f(x_i)) + \lambda \|f\|_{\mathcal{H}_k}^2\]

solusi optimal selalu berada dalam: \[f^*(x) = \sum_{i=1}^n \alpha_i k(x_i, x)\]

Meskipun kita optimasi atas infinite-dimensional function space \(\mathcal{H}_k\), solusinya adalah finite linear combination dari kernel functions evaluated at training points!


6 SVM (Support Vector Machine)

6.1 Primal Problem (Hard Margin)

Untuk linearly separable data, cari hyperplane \(\mathbf{w}^Tx + b = 0\) dengan maximum margin.

Margin = \(2/\|\mathbf{w}\|\). Maximizing margin = minimizing \(\|\mathbf{w}\|\).

Primal (Hard Margin SVM): \[\min_{\mathbf{w},b} \frac{1}{2}\|\mathbf{w}\|^2 \quad \text{s.t.} \quad y_i(\mathbf{w}^Tx_i + b) \geq 1 \quad \forall i\]

6.2 Soft Margin SVM

Untuk data yang tidak linearly separable, allow misclassification dengan slack variables \(\xi_i \geq 0\):

\[\min_{\mathbf{w},b,\xi} \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_i \xi_i \quad \text{s.t.} \quad y_i(\mathbf{w}^Tx_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0\]

\(C\): regularization parameter (tradeoff margin vs misclassification).

6.3 Dual Problem

Dari Lagrangian primal:

\[L(\mathbf{w}, b, \alpha) = \frac{1}{2}\|\mathbf{w}\|^2 - \sum_i \alpha_i[y_i(\mathbf{w}^Tx_i+b) - 1]\]

KKT conditions (stationarity): \[\mathbf{w} = \sum_i \alpha_i y_i x_i, \quad \sum_i \alpha_i y_i = 0\]

Substitusikan ke Lagrangian → Dual problem:

\[\max_\alpha \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i\alpha_j y_iy_j x_i^Tx_j\]

\[\text{s.t.} \quad 0 \leq \alpha_i \leq C, \quad \sum_i \alpha_i y_i = 0\]

6.4 Support Vectors dan KKT

KKT complementary slackness: \[\alpha_i[y_i(\mathbf{w}^Tx_i+b) - 1 + \xi_i] = 0, \quad \mu_i \xi_i = 0\]

  • \(\alpha_i = 0\): point di luar atau tepat di margin boundary → tidak mempengaruhi \(\mathbf{w}\)
  • \(0 < \alpha_i < C\): point tepat di margin (\(\xi_i = 0\)) — ini support vectors
  • \(\alpha_i = C\): point di dalam margin atau misclassified (\(\xi_i > 0\))

Decision function: \(f(x) = \sum_i \alpha_i y_i x_i^Tx + b\) — hanya bergantung pada support vectors!

6.5 Kernelized SVM

Ganti \(x_i^Tx_j\) dengan \(k(x_i, x_j)\):

\[\max_\alpha \sum_i \alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i\alpha_j y_iy_j k(x_i, x_j)\]

Decision function: \(f(x) = \sum_i \alpha_i y_i k(x_i, x) + b\)

Dengan RBF kernel: nonlinear decision boundary dalam input space, tapi linear di infinite-dimensional feature space.


7 Kernel Ridge Regression

7.1 Primal Formulation

\[\min_\beta \|y - X\beta\|^2 + \lambda\|\beta\|^2\]

Solusi: \(\hat{\beta} = (X^TX + \lambda I)^{-1}X^Ty\)

Prediksi: \(\hat{y} = X\hat{\beta} = X(X^TX+\lambda I)^{-1}X^Ty\)

7.2 Dual Formulation

Gunakan identity (Woodbury / push-through):

\[X(X^TX+\lambda I)^{-1} = (XX^T + \lambda I)^{-1}X\]

Definisikan \(K = XX^T\) (Gram matrix, \(K_{ij} = x_i^Tx_j\)):

\[\hat{y} = K(K + \lambda I)^{-1}y\]

Dual coefficients: \(\hat{\alpha} = (K + \lambda I)^{-1}y\)

Prediksi untuk \(x_{new}\): \(\hat{f}(x_{new}) = k_{new}^T\hat{\alpha}\) di mana \(k_{new,i} = k(x_i, x_{new})\)

Sekarang, ganti \(K_{ij} = x_i^Tx_j\) dengan kernel \(k(x_i, x_j)\)Kernel Ridge Regression!

Primal dengan explicit \(\phi\): \(\min_w \|y - \Phi w\|^2 + \lambda\|w\|^2\)

Dual: \(\hat{\alpha} = (K + \lambda I)^{-1}y\) di mana \(K_{ij} = k(x_i, x_j)\)


8 Worked Example: RBF Kernel Computation

library(kernlab)
library(tidyverse)

# ==========================================
# 1. Manual RBF Kernel Computation
# ==========================================
rbf_kernel <- function(x, z, sigma = 1) {
  exp(-sum((x - z)^2) / (2 * sigma^2))
}

# Small dataset: 3 points in R^2
X_data <- matrix(c(1, 1,
                   2, 1,
                   1, 2), nrow = 3, byrow = TRUE)

# Compute Gram matrix
n <- nrow(X_data)
sigma <- 1
K <- matrix(0, n, n)
for (i in 1:n) {
  for (j in 1:n) {
    K[i, j] <- rbf_kernel(X_data[i,], X_data[j,], sigma)
  }
}

cat("Gram matrix K (RBF, sigma=1):\n")
print(round(K, 4))

# Verify PSD: eigenvalues should be >= 0
cat("Eigenvalues:", eigen(K)$values, "\n")  # Should all be >= 0

# ==========================================
# 2. SVM Classification with kernlab
# ==========================================
# Generate non-linearly separable data (two spirals)
set.seed(123)
n_per_class <- 100
theta <- seq(0, 2*pi, length.out = n_per_class)

x1 <- c(theta * cos(theta), (theta + pi) * cos(theta + pi))
x2 <- c(theta * sin(theta), (theta + pi) * sin(theta + pi))
y_label <- factor(rep(c(1, -1), each = n_per_class))

df_spiral <- data.frame(x1 = x1 + rnorm(2*n_per_class, sd = 0.3),
                        x2 = x2 + rnorm(2*n_per_class, sd = 0.3),
                        y = y_label)

# Linear SVM - will struggle with spirals
svm_linear <- ksvm(y ~ x1 + x2, data = df_spiral,
                   kernel = "vanilladot", C = 1)

# RBF SVM
svm_rbf <- ksvm(y ~ x1 + x2, data = df_spiral,
                kernel = "rbfdot", kpar = list(sigma = 0.5), C = 10)

cat(sprintf("Linear SVM training error: %.3f\n",
            sum(predict(svm_linear, df_spiral) != df_spiral$y) / nrow(df_spiral)))
cat(sprintf("RBF SVM training error: %.3f\n",
            sum(predict(svm_rbf, df_spiral) != df_spiral$y) / nrow(df_spiral)))

cat(sprintf("\nNumber of support vectors (RBF): %d out of %d\n",
            nSV(svm_rbf), nrow(df_spiral)))

# ==========================================
# 3. Kernel Ridge Regression
# ==========================================
set.seed(42)
n <- 50
x_train <- seq(0, 2*pi, length.out = n)
y_train <- sin(x_train) + rnorm(n, sd = 0.2)

# Kernel Ridge Regression using kernlab
krr_model <- gausspr(x_train, y_train,
                     kernel = "rbfdot",
                     kpar = list(sigma = 1),
                     var = 0.1)  # var ~ lambda

x_test <- seq(0, 2*pi, length.out = 200)
y_pred <- predict(krr_model, x_test)

plot(x_train, y_train, pch = 16, col = "gray",
     main = "Kernel Ridge Regression (RBF kernel)",
     xlab = "x", ylab = "y")
lines(x_test, y_pred, col = "blue", lwd = 2)
lines(x_test, sin(x_test), col = "red", lty = 2)
legend("topright", c("Predictions", "True function"),
       col = c("blue", "red"), lty = 1:2)

Observasi penting: RBF SVM berhasil memisahkan spiral, linear SVM tidak bisa. Kernel Trick membuat linear hyperplane di feature space = nonlinear boundary di input space.


CautionConnection: Gaussian Processes dan Graph Neural Networks

Kernel methods terhubung ke berbagai frontier:

  • Gaussian Processes (GP): Kernel Ridge Regression = posterior mean GP dengan kernel \(k\). GP adalah Bayesian nonparametric method — prior atas functions directly defined by kernel.

  • Attention mechanism dalam Transformers: \(\text{Attention}(Q,K,V) = \text{softmax}(QK^T/\sqrt{d})V\) — ini adalah kernel regression dengan softmax kernel \(k(q,k) = e^{q^Tk/\sqrt{d}}\).

  • Graph Neural Networks: aggregation step adalah kernel-like computation atas graph neighbors.

  • Neural Tangent Kernel (NTK): infinite-width neural networks converge to Gaussian processes dengan specific kernel (NTK). Menjelaskan mengapa very wide networks can be analyzed with kernel theory.

  • MMD (Maximum Mean Discrepancy): two-sample test menggunakan kernel embedding of distributions — banyak digunakan dalam domain adaptation dan generative models.


9 Practice Problems

Problem 1: Verifikasi bahwa polynomial kernel \(k(x,z) = (x^Tz)^2\) adalah valid kernel untuk \(x,z \in \mathbb{R}^2\).

Konstruksi feature map eksplisit: \(\phi(x) = (x_1^2, \sqrt{2}x_1x_2, x_2^2)\)

\(\langle\phi(x), \phi(z)\rangle = x_1^2z_1^2 + 2x_1x_2z_1z_2 + x_2^2z_2^2 = (x_1z_1 + x_2z_2)^2 = (x^Tz)^2\)

Problem 2: Tunjukkan bahwa RBF kernel adalah valid kernel (hint: gunakan Taylor expansion).

\(k(x,z) = e^{-\|x-z\|^2/(2\sigma^2)} = e^{-(\|x\|^2+\|z\|^2)/(2\sigma^2)} \cdot e^{x^Tz/\sigma^2}\)

\(= e^{-\|x\|^2/(2\sigma^2)} e^{-\|z\|^2/(2\sigma^2)} \sum_{k=0}^\infty \frac{(x^Tz)^k}{\sigma^{2k} k!}\)

Setiap term \(c_k \cdot (x^Tz)^k\) adalah kernel (polynomial dengan nonneg coefficient), dan countable sum of kernels with nonneg coefficients adalah kernel. QED.

Problem 3: Derive dual problem SVM dari Lagrangian.

Lagrangian: \(L = \frac{1}{2}\|\mathbf{w}\|^2 - \sum_i \alpha_i[y_i(\mathbf{w}^Tx_i+b)-1]\)

\(\partial L/\partial \mathbf{w} = \mathbf{w} - \sum_i \alpha_i y_i x_i = 0 \Rightarrow \mathbf{w} = \sum_i \alpha_i y_i x_i\)

\(\partial L/\partial b = -\sum_i \alpha_i y_i = 0\)

Substitusi \(\mathbf{w}\): \(L = -\frac{1}{2}\sum_{ij}\alpha_i\alpha_j y_iy_j x_i^Tx_j + \sum_i \alpha_i\)

Maximize atas \(\alpha \geq 0\) dengan constraint \(\sum_i \alpha_i y_i = 0\). QED.

Problem 4: Tunjukkan Kernel Ridge Regression = Gaussian Process posterior mean.

Prior: \(f \sim GP(0, k)\), likelihood: \(y_i | f(x_i) \sim N(f(x_i), \sigma^2)\)

Posterior mean: \(E[f(x_{new})|y] = k_{new}^T(K + \sigma^2 I)^{-1}y\) — identik dengan KRR prediction dengan \(\lambda = \sigma^2\).


Navigasi: ← Regularization | Information Theory →