Loss Functions & Risk Minimization

Apa yang Kita Minimasi — dan Mengapa

ml-math

optimization

statistics

Statistical decision theory, common loss functions, empirical risk minimization, bias-variance tradeoff, dan koneksi ke econometrics.

Why This Matters for Your Work

Understanding loss functions determines everything: what estimator you get, what predictions it makes, what outlier sensitivity it has.

Ini bukan detail teknis — ini adalah pilihan desain yang fundamental. Ketika kamu memilih loss function, kamu memilih:

Apa yang dioptimalkan: mean? median? probability?
Sensitivitas terhadap outlier: squared loss sangat sensitif, absolute loss lebih robust
Properti statistik estimator yang dihasilkan: consistency, efficiency, robustness

Koneksi ke econometrics sangat langsung: OLS = ERM dengan squared loss. Quantile regression = ERM dengan asymmetric absolute loss (check function). Logistic regression = ERM dengan log loss.

1 Statistical Decision Theory Framework

1.1 Setup Formal

Kita punya: - Input space $\mathcal{X}$ (misalnya $\mathbb{R}^p$) - Output space $\mathcal{Y}$ (misalnya $\mathbb{R}$ untuk regresi, $\{0,1\}$ untuk klasifikasi) - Action space $\mathcal{A}$ — set dari prediksi yang bisa kita buat - Loss function $L: \mathcal{Y} \times \mathcal{A} \to \mathbb{R}_{\geq 0}$ - Data generating process $(X, Y) \sim P_{XY}$

Definisi: Risk

Risk (expected loss) dari predictor $\hat{f}: \mathcal{X} \to \mathcal{A}$ adalah:

\[R(\hat{f}) = \mathbb{E}_{(X,Y) \sim P_{XY}}[L(Y, \hat{f}(X))]\]

Bayes risk adalah infimum yang bisa dicapai:

\[R^* = \inf_{f} R(f)\]

dan Bayes optimal predictor $f^*$ mencapai $R^*$.

Tujuan kita: cari $\hat{f}$ yang meminimasi risk $R(\hat{f})$.

Masalah: $P_{XY}$ tidak diketahui! Kita hanya punya data $\{(x_i, y_i)\}_{i=1}^n$.

1.2 Empirical Risk Minimization (ERM)

Karena $P_{XY}$ tidak diketahui, kita ganti dengan distribusi empiris:

\[R_n(\hat{f}) = \frac{1}{n} \sum_{i=1}^n L(y_i, \hat{f}(x_i))\]

Definisi: Empirical Risk Minimizer

ERM mencari predictor yang meminimasi empirical risk:

\[\hat{f}_{ERM} = \arg\min_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^n L(y_i, f(x_i))\]

di mana $\mathcal{F}$ adalah function class — set dari candidate predictors.

Pilihan $\mathcal{F}$ menentukan kompleksitas model: - $\mathcal{F}$ = linear functions → linear regression - $\mathcal{F}$ = decision trees → tree models - $\mathcal{F}$ = neural networks → deep learning

Tradeoff fundamental: $\mathcal{F}$ terlalu kecil → underfitting (high bias). $\mathcal{F}$ terlalu besar → overfitting (high variance). Ini bukan intuisi — ini ada mathnya (bias-variance decomposition, lihat bagian bawah).

2 Common Loss Functions

2.1 Squared Loss (L2 Loss)

\[L(y, \hat{y}) = (y - \hat{y})^2\]

Optimal predictor: conditional mean $f^*(x) = \mathbb{E}[Y | X = x]$
Kenapa mean? Derivasinya ada di bagian berikutnya
Sensitivitas outlier: tinggi — error dikuadratkan, jadi outlier dengan error besar sangat dihukum
Smooth di mana-mana: mudah dioptimasi dengan gradient descent

Connection: OLS = ERM with Squared Loss

Ordinary Least Squares adalah ERM dengan squared loss dan function class linear:

\[\hat{\beta}_{OLS} = \arg\min_\beta \frac{1}{n}\sum_{i=1}^n (y_i - x_i^T\beta)^2\]

Ini persis ERM dengan $L(y, \hat{y}) = (y-\hat{y})^2$ dan $\mathcal{F} = \{x \mapsto x^T\beta : \beta \in \mathbb{R}^p\}$.

2.2 Absolute Loss (L1 Loss)

\[L(y, \hat{y}) = |y - \hat{y}|\]

Optimal predictor: conditional median $f^*(x) = \text{Median}(Y | X = x)$
Lebih robust terhadap outlier dibanding squared loss
Tidak smooth di $y = \hat{y}$: perlu subgradient untuk optimasi

Connection: Quantile Regression

Quantile regression adalah ERM dengan asymmetric absolute loss (check function):

\[L_\tau(y, \hat{y}) = (y - \hat{y})(\tau - \mathbf{1}[y < \hat{y}])\]

Untuk $\tau = 0.5$, ini persis absolute loss (meminimasi ke median). Untuk $\tau = 0.9$, ini meminimasi ke 90th percentile.

2.3 Huber Loss

Hybrid antara squared dan absolute:

\[L_\delta(y, \hat{y}) = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{jika } |y - \hat{y}| \leq \delta \\ \delta|y - \hat{y}| - \frac{1}{2}\delta^2 & \text{jika } |y - \hat{y}| > \delta \end{cases}\]

Kuadratik untuk error kecil (smooth, mudah dioptimasi)
Linear untuk error besar (robust terhadap outlier)
$\delta$ adalah hyperparameter yang mengontrol threshold
Digunakan di Huber regression dan beberapa loss functions di gradient boosting

2.4 Log Loss (Binary Cross-Entropy)

Untuk klasifikasi biner $y \in \{0, 1\}$, dengan predicted probability $\hat{p} \in (0,1)$:

\[L(y, \hat{p}) = -[y \log \hat{p} + (1-y) \log(1-\hat{p})]\]

Optimal predictor: $f^*(x) = P(Y=1|X=x)$ — conditional probability
Penalti sangat berat untuk confident-but-wrong predictions (ketika $\hat{p} \approx 1$ tapi $y = 0$, loss mendekati $\infty$)
Koneksi ke MLE: meminimasi log loss = memaksimasi log-likelihood Bernoulli

Connection: Logistic Regression = ERM with Log Loss

Logistic regression meminimasi:

\[\frac{1}{n}\sum_{i=1}^n -[y_i \log \hat{p}_i + (1-y_i)\log(1-\hat{p}_i)]\]

dengan $\hat{p}_i = \sigma(x_i^T\beta) = \frac{1}{1+e^{-x_i^T\beta}}$.

Ini persis ERM dengan log loss dan logistic function class.

2.5 Hinge Loss (SVM)

Untuk klasifikasi biner $y \in \{-1, +1\}$:

\[L(y, \hat{f}) = \max(0, 1 - y\hat{f}(x))\]

Zero loss ketika prediksi benar dengan margin $\geq 1$
Linear penalty ketika margin < 1
Membentuk “margin” dalam Support Vector Machines
Tidak smooth di $y\hat{f} = 1$ — perlu subgradient atau quadratic programming

2.6 KL Divergence sebagai Loss

KL divergence dari distribusi $Q$ ke $P$:

\[D_{KL}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}\]

Dalam konteks loss: jika $P$ adalah distribusi true labels dan $Q$ adalah model distribution, meminimasi $D_{KL}$ setara dengan meminimasi cross-entropy (karena $H(P)$ konstan terhadap $Q$).

Dibahas lebih detail di Information Theory.

Worked Example: Comparing Loss Functions Visually

library(ggplot2)
library(tidyr)

# Grid of residuals
residuals <- seq(-3, 3, length.out = 300)

# Compute loss functions
losses <- data.frame(
  residual = residuals,
  squared  = residuals^2,
  absolute = abs(residuals),
  huber    = ifelse(abs(residuals) <= 1,
                    0.5 * residuals^2,
                    abs(residuals) - 0.5),
  log_loss_y1 = -log(plogis(residuals)),  # approximation
  hinge    = pmax(0, 1 - residuals)
)

# Pivot to long format
losses_long <- pivot_longer(losses, -residual,
                             names_to = "loss_fn",
                             values_to = "loss")

ggplot(losses_long, aes(x = residual, y = loss, color = loss_fn)) +
  geom_line(size = 1.1) +
  ylim(0, 4) +
  labs(
    title = "Perbandingan Loss Functions",
    x = "Residual (y - ŷ)", y = "Loss",
    color = "Loss Function"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Coba jalankan ini dan perhatikan: - Squared loss tumbuh quadratik — outlier sangat dihukum - Absolute loss tumbuh linear — lebih robust - Huber menggabungkan keduanya - Hinge loss punya “flat zone” di sebelah kanan (sudah benar, tidak dihukum)

3 Mengapa Squared Loss → Mean?

Ini derivasi penting yang harus kamu bisa lakukan sendiri.

Pertanyaan: Jika kita mau memprediksi $Y$ dengan konstanta $c$ (tidak ada fitur $X$), nilai $c$ berapa yang meminimasi expected squared loss?

\[\min_c \mathbb{E}[(Y - c)^2]\]

Derivasi:

\[\mathbb{E}[(Y - c)^2] = \mathbb{E}[Y^2 - 2Yc + c^2]\]

\[= \mathbb{E}[Y^2] - 2c\mathbb{E}[Y] + c^2\]

Ambil derivative terhadap $c$ dan set ke nol:

\[\frac{d}{dc}\mathbb{E}[(Y-c)^2] = -2\mathbb{E}[Y] + 2c = 0\]

\[\Rightarrow c^* = \mathbb{E}[Y]\]

Cek second derivative: $\frac{d^2}{dc^2} = 2 > 0$ → ini memang minimum.

Interpretasi: Squared loss “mendefinisikan” mean sebagai optimal point prediction. Kalau mau prediksi sesuatu yang beda dari mean, pakai loss yang berbeda.

Secara lebih umum: optimal predictor di bawah squared loss adalah $f^*(x) = \mathbb{E}[Y|X=x]$.

Bukti untuk conditional case:

\[\mathbb{E}[(Y - f(X))^2] = \mathbb{E}[\mathbb{E}[(Y - f(X))^2 | X]]\]

Untuk setiap nilai $x$ fixed, $\mathbb{E}[(Y - f(x))^2|X=x]$ diminimasi oleh $f(x) = \mathbb{E}[Y|X=x]$. Karena minimasi pointwise berlaku untuk setiap $x$, maka $f^*(x) = \mathbb{E}[Y|X=x]$ meminimasi keseluruhan.

4 Mengapa Absolute Loss → Median?

Pertanyaan: Nilai $c$ berapa yang meminimasi $\mathbb{E}[|Y - c|]$?

Ini sedikit lebih tricky karena $|y - c|$ tidak differentiable di $y = c$.

Cara 1: Subgradient

\[\frac{\partial}{\partial c}\mathbb{E}[|Y - c|] = \mathbb{E}[\text{sign}(c - Y)]\]

Di minimum: $\mathbb{E}[\text{sign}(c - Y)] = 0$

Artinya: $P(Y < c) = P(Y > c)$, yang berarti $c$ adalah median $Y$.

Cara 2: Geometric argument

\[\mathbb{E}[|Y - c|] = \int_{-\infty}^c (c - y) f(y) dy + \int_c^\infty (y - c) f(y) dy\]

Derivative terhadap $c$:

\[\frac{d}{dc}\mathbb{E}[|Y-c|] = \int_{-\infty}^c f(y)dy - \int_c^\infty f(y)dy = F(c) - [1 - F(c)] = 2F(c) - 1\]

Set ke nol: $F(c) = 1/2 \Rightarrow c = F^{-1}(1/2) = \text{Median}(Y)$.

Takeaway: Loss function menentukan “apa arti dari optimal prediction”. Mean → squared loss. Median → absolute loss. Quantile $\tau$ → check function.

5 Bias-Variance Tradeoff

Ini salah satu hasil terpenting dalam statistical learning theory.

5.1 Decomposition untuk Squared Loss

Fix titik $x$. Misalkan $\hat{f}$ adalah estimator yang dilatih dari data, dan target adalah $Y = f^*(x) + \varepsilon$ dengan $\mathbb{E}[\varepsilon] = 0$, $\text{Var}(\varepsilon) = \sigma^2$.

Definisi: Bias-Variance Decomposition

\[\mathbb{E}[(Y - \hat{f}(x))^2] = \underbrace{[\mathbb{E}[\hat{f}(x)] - f^*(x)]^2}_{\text{Bias}^2} + \underbrace{\text{Var}(\hat{f}(x))}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible Error}}\]

Derivasi:

Misalkan $\bar{f}(x) = \mathbb{E}[\hat{f}(x)]$ (expected prediction).

\[\mathbb{E}[(Y - \hat{f}(x))^2]\] \[= \mathbb{E}[(Y - f^*(x) + f^*(x) - \bar{f}(x) + \bar{f}(x) - \hat{f}(x))^2]\]

Pisahkan tiga suku:

\[= \mathbb{E}[(Y - f^*(x))^2] + (\bar{f}(x) - f^*(x))^2 + \mathbb{E}[(\bar{f}(x) - \hat{f}(x))^2]\]

(cross terms hilang karena $\varepsilon$ independent dari $\hat{f}$ dan $\mathbb{E}[\bar{f}(x) - \hat{f}(x)] = 0$)

Ini menjadi:

\[= \sigma^2 + \text{Bias}^2(\hat{f}(x)) + \text{Var}(\hat{f}(x))\]

5.2 Interpretasi

Bias mengukur seberapa “salah rata-rata” kita — error sistematik
Variance mengukur seberapa besar $\hat{f}$ berfluktuasi antar dataset yang berbeda
$\sigma^2$ adalah error yang tidak bisa dihilangkan (noise dari data generating process)

Tradeoff: - Model sederhana (linear): Bias tinggi, Variance rendah - Model kompleks (deep tree, NN): Bias rendah, Variance tinggi - Regularization: increase bias, decrease variance

graph LR
    A[Model Complexity] -->|increases| B[Variance ↑]
    A -->|decreases| C[Bias ↓]
    B --> D[Total Error = Bias² + Var + σ²]
    C --> D

Dalam practice: pilih kompleksitas yang meminimasi total error via cross-validation.

Worked Example: Bias-Variance dalam Polynomial Regression

library(ggplot2)
set.seed(42)

# True function: f*(x) = sin(2*pi*x)
f_true <- function(x) sin(2 * pi * x)

# Generate many datasets and fit polynomials of different degrees
n_datasets <- 100
n_obs <- 30
x_test <- seq(0, 1, length.out = 100)

# Simulate for degree d
simulate_bv <- function(degree, n_datasets = 100) {
  preds <- matrix(NA, nrow = length(x_test), ncol = n_datasets)

  for (i in seq_len(n_datasets)) {
    x_train <- runif(n_obs)
    y_train <- f_true(x_train) + rnorm(n_obs, sd = 0.3)

    fit <- lm(y_train ~ poly(x_train, degree, raw = TRUE))

    # Predict at x_test (extrapolation might be unstable)
    x_df <- data.frame(x_train = x_test)
    preds[, i] <- predict(fit, newdata = x_df)
  }

  bias2 <- (rowMeans(preds) - f_true(x_test))^2
  variance <- apply(preds, 1, var)

  list(
    mean_bias2 = mean(bias2),
    mean_var = mean(variance),
    mean_total = mean(bias2 + variance)
  )
}

degrees <- 1:10
results <- lapply(degrees, simulate_bv)

bv_df <- data.frame(
  degree   = degrees,
  bias2    = sapply(results, `[[`, "mean_bias2"),
  variance = sapply(results, `[[`, "mean_var"),
  total    = sapply(results, `[[`, "mean_total")
)

# Plot
bv_long <- tidyr::pivot_longer(bv_df, -degree,
                                names_to = "component",
                                values_to = "value")

ggplot(bv_long, aes(x = degree, y = value, color = component)) +
  geom_line(size = 1.1) +
  geom_point() +
  labs(title = "Bias-Variance Tradeoff: Polynomial Regression",
       x = "Polynomial Degree", y = "Mean Squared Error",
       color = "Component") +
  theme_minimal()

Perhatikan U-shape pada kurva “total” — minimum total error ada di degree tertentu, bukan degree terkecil atau terbesar.

6 Regularized ERM

Regularization = ERM + penalty:

\[\hat{f}_{reg} = \arg\min_{f \in \mathcal{F}} \left[\frac{1}{n}\sum_{i=1}^n L(y_i, f(x_i)) + \lambda \Omega(f)\right]\]

di mana $\Omega(f)$ adalah complexity penalty dan $\lambda \geq 0$ adalah regularization strength.

$\lambda = 0$: pure ERM (bisa overfit)
$\lambda \to \infty$: sangat diregularisasi (bisa underfit)

Contoh: - Ridge: $\Omega(\beta) = \|\beta\|^2_2$ - LASSO: $\Omega(\beta) = \|\beta\|_1$ - L2 weight decay di neural networks

Interpretasi bias-variance: regularization increase bias, decrease variance.

Connection: Bayesian MAP = Regularized ERM

Regularized ERM punya interpretasi Bayesian yang elegan:

\[\hat{\beta}_{MAP} = \arg\max_\beta [P(\beta | data)] = \arg\max_\beta [\log P(data|\beta) + \log P(\beta)]\]

\[= \arg\min_\beta [-\log L(\beta) + \lambda\Omega(\beta)]\]

Ridge ($\|\beta\|^2$ penalty) = MAP dengan prior $\beta \sim \mathcal{N}(0, \sigma^2/\lambda \cdot I)$
LASSO ($\|\beta\|_1$ penalty) = MAP dengan prior $\beta_j \sim \text{Laplace}(0, 1/\lambda)$

Regularization strength $\lambda$ = “seberapa kuat prior beliefs kamu”.

Practice Problems

Problem 1: Derivasi Optimal Predictor

Tunjukkan bahwa optimal predictor di bawah squared loss adalah $f^*(x) = \mathbb{E}[Y|X=x]$.
Tunjukkan bahwa optimal constant predictor di bawah loss $L(y, c) = (y-c)^4$ adalah $f^* = \mathbb{E}[Y]$ juga. (Petunjuk: bedakan dan set ke nol, gunakan towering expectation.)
Untuk loss $L(y, \hat{y}) = e^{|y - \hat{y}|} - 1$, apakah optimal predictor masih mean atau median? Jelaskan intuitif.

Problem 2: Log Loss Properties

Buktikan bahwa $L(y, \hat{p}) = -[y\log\hat{p} + (1-y)\log(1-\hat{p})] \geq 0$ untuk semua $\hat{p} \in (0,1)$ dan $y \in \{0,1\}$.
Tunjukkan bahwa expected log loss $\mathbb{E}[L(Y, \hat{p}(X))]$ di-minimize oleh $\hat{p}(x) = P(Y=1|X=x)$.
Apa yang terjadi ke log loss ketika $\hat{p} \to 0$ tapi $y = 1$? Interpretasikan.

Problem 3: Bias-Variance

Kita punya $Y = \mu + \varepsilon$ dengan $\mathbb{E}[\varepsilon] = 0$, $\text{Var}(\varepsilon) = \sigma^2$.

Estimator: $\hat{\mu}_1 = \bar{Y}$ (sample mean). Hitung Bias, Variance, dan MSE.
Estimator yang dishrinkage: $\hat{\mu}_2 = c\bar{Y}$ untuk $c \in (0,1)$. Hitung Bias, Variance, dan MSE. Untuk nilai $c$ berapa MSE($\hat{\mu}_2$) < MSE($\hat{\mu}_1$)?
Implikasi: apakah estimator yang biased bisa lebih baik (lower MSE) dari yang unbiased?

Jawaban Problem 3(b):

$\text{Bias}(\hat{\mu}_2) = c\mu - \mu = (c-1)\mu$, $\text{Bias}^2 = (c-1)^2\mu^2$

$\text{Var}(\hat{\mu}_2) = c^2\text{Var}(\bar{Y}) = c^2\sigma^2/n$

$\text{MSE}(\hat{\mu}_2) = (c-1)^2\mu^2 + c^2\sigma^2/n$

$\text{MSE}(\hat{\mu}_2) < \text{MSE}(\hat{\mu}_1) = \sigma^2/n$ ketika:

$(c-1)^2\mu^2 + c^2\sigma^2/n < \sigma^2/n$

$(c-1)^2\mu^2 < (1-c^2)\sigma^2/n = (1-c)(1+c)\sigma^2/n$

$(1-c)\mu^2 < (1+c)\sigma^2/n$ (bagi kedua sisi dengan $(1-c) > 0$ untuk $c < 1$)

Ada range nilai $c$ yang memenuhi ini selama $\mu^2$ tidak terlalu besar relatif terhadap $\sigma^2/n$.

Jawaban: Ya, estimator biased bisa lebih baik dalam artian MSE. Inilah alasan regularization (yang introduce bias) bisa meningkatkan prediction performance.

7 Ringkasan

Loss Function	Optimal Predictor	Gunakan Ketika
Squared $(y-\hat{y})^2$	Conditional mean $\mathbb{E}[Y\\|X=x]$	Default regresi, distribusi Gaussian noise
Absolute $\\|y-\hat{y}\\|$	Conditional median	Ada outlier, mau robust
Huber	Antara mean & median	Outlier ringan, mau smooth
Log loss	Conditional probability $P(Y=1\\|X=x)$	Klasifikasi biner
Hinge $\max(0, 1-y\hat{f})$	Margin classifier	SVM, max-margin

Selanjutnya: Gradient Descent & Optimization →

--- title: "Loss Functions & Risk Minimization" subtitle: "Apa yang Kita Minimasi — dan Mengapa" description: "Statistical decision theory, common loss functions, empirical risk minimization, bias-variance tradeoff, dan koneksi ke econometrics." categories: [ml-math, optimization, statistics] --- ::: {.callout-note title="Why This Matters for Your Work"} Understanding loss functions determines everything: what estimator you get, what predictions it makes, what outlier sensitivity it has. Ini bukan detail teknis — ini adalah *pilihan desain* yang fundamental. Ketika kamu memilih loss function, kamu memilih: - **Apa yang dioptimalkan**: mean? median? probability? - **Sensitivitas terhadap outlier**: squared loss sangat sensitif, absolute loss lebih robust - **Properti statistik estimator yang dihasilkan**: consistency, efficiency, robustness Koneksi ke econometrics sangat langsung: OLS = ERM dengan squared loss. Quantile regression = ERM dengan asymmetric absolute loss (check function). Logistic regression = ERM dengan log loss. ::: ## Statistical Decision Theory Framework ### Setup Formal Kita punya: - **Input space** $\mathcal{X}$ (misalnya $\mathbb{R}^p$) - **Output space** $\mathcal{Y}$ (misalnya $\mathbb{R}$ untuk regresi, $\{0,1\}$ untuk klasifikasi) - **Action space** $\mathcal{A}$ — set dari prediksi yang bisa kita buat - **Loss function** $L: \mathcal{Y} \times \mathcal{A} \to \mathbb{R}_{\geq 0}$ - **Data generating process** $(X, Y) \sim P_{XY}$ ::: {.callout-important title="Definisi: Risk"} **Risk** (expected loss) dari predictor $\hat{f}: \mathcal{X} \to \mathcal{A}$ adalah: $$R(\hat{f}) = \mathbb{E}_{(X,Y) \sim P_{XY}}[L(Y, \hat{f}(X))]$$ **Bayes risk** adalah infimum yang bisa dicapai: $$R^* = \inf_{f} R(f)$$ dan **Bayes optimal predictor** $f^*$ mencapai $R^*$. ::: Tujuan kita: cari $\hat{f}$ yang meminimasi risk $R(\hat{f})$. **Masalah**: $P_{XY}$ tidak diketahui! Kita hanya punya data $\{(x_i, y_i)\}_{i=1}^n$. ### Empirical Risk Minimization (ERM) Karena $P_{XY}$ tidak diketahui, kita ganti dengan distribusi empiris: $$R_n(\hat{f}) = \frac{1}{n} \sum_{i=1}^n L(y_i, \hat{f}(x_i))$$ ::: {.callout-important title="Definisi: Empirical Risk Minimizer"} **ERM** mencari predictor yang meminimasi empirical risk: $$\hat{f}_{ERM} = \arg\min_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^n L(y_i, f(x_i))$$ di mana $\mathcal{F}$ adalah **function class** — set dari candidate predictors. ::: Pilihan $\mathcal{F}$ menentukan kompleksitas model: - $\mathcal{F}$ = linear functions → linear regression - $\mathcal{F}$ = decision trees → tree models - $\mathcal{F}$ = neural networks → deep learning **Tradeoff fundamental**: $\mathcal{F}$ terlalu kecil → underfitting (high bias). $\mathcal{F}$ terlalu besar → overfitting (high variance). Ini bukan intuisi — ini ada mathnya (bias-variance decomposition, lihat bagian bawah). ## Common Loss Functions ### Squared Loss (L2 Loss) $$L(y, \hat{y}) = (y - \hat{y})^2$$ - **Optimal predictor**: conditional mean $f^*(x) = \mathbb{E}[Y | X = x]$ - **Kenapa mean?** Derivasinya ada di bagian berikutnya - **Sensitivitas outlier**: tinggi — error dikuadratkan, jadi outlier dengan error besar sangat dihukum - **Smooth di mana-mana**: mudah dioptimasi dengan gradient descent ::: {.callout-caution title="Connection: OLS = ERM with Squared Loss"} Ordinary Least Squares adalah ERM dengan squared loss dan function class linear: $$\hat{\beta}_{OLS} = \arg\min_\beta \frac{1}{n}\sum_{i=1}^n (y_i - x_i^T\beta)^2$$ Ini persis ERM dengan $L(y, \hat{y}) = (y-\hat{y})^2$ dan $\mathcal{F} = \{x \mapsto x^T\beta : \beta \in \mathbb{R}^p\}$. ::: ### Absolute Loss (L1 Loss) $$L(y, \hat{y}) = |y - \hat{y}|$$ - **Optimal predictor**: conditional median $f^*(x) = \text{Median}(Y | X = x)$ - **Lebih robust** terhadap outlier dibanding squared loss - **Tidak smooth di $y = \hat{y}$**: perlu subgradient untuk optimasi ::: {.callout-caution title="Connection: Quantile Regression"} Quantile regression adalah ERM dengan **asymmetric absolute loss** (check function): $$L_\tau(y, \hat{y}) = (y - \hat{y})(\tau - \mathbf{1}[y < \hat{y}])$$ Untuk $\tau = 0.5$, ini persis absolute loss (meminimasi ke median). Untuk $\tau = 0.9$, ini meminimasi ke 90th percentile. ::: ### Huber Loss Hybrid antara squared dan absolute: $$L_\delta(y, \hat{y}) = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{jika } |y - \hat{y}| \leq \delta \\ \delta|y - \hat{y}| - \frac{1}{2}\delta^2 & \text{jika } |y - \hat{y}| > \delta \end{cases}$$ - **Kuadratik untuk error kecil** (smooth, mudah dioptimasi) - **Linear untuk error besar** (robust terhadap outlier) - **$\delta$** adalah hyperparameter yang mengontrol threshold - Digunakan di Huber regression dan beberapa loss functions di gradient boosting ### Log Loss (Binary Cross-Entropy) Untuk klasifikasi biner $y \in \{0, 1\}$, dengan predicted probability $\hat{p} \in (0,1)$: $$L(y, \hat{p}) = -[y \log \hat{p} + (1-y) \log(1-\hat{p})]$$ - **Optimal predictor**: $f^*(x) = P(Y=1|X=x)$ — conditional probability - **Penalti sangat berat** untuk confident-but-wrong predictions (ketika $\hat{p} \approx 1$ tapi $y = 0$, loss mendekati $\infty$) - **Koneksi ke MLE**: meminimasi log loss = memaksimasi log-likelihood Bernoulli ::: {.callout-caution title="Connection: Logistic Regression = ERM with Log Loss"} Logistic regression meminimasi: $$\frac{1}{n}\sum_{i=1}^n -[y_i \log \hat{p}_i + (1-y_i)\log(1-\hat{p}_i)]$$ dengan $\hat{p}_i = \sigma(x_i^T\beta) = \frac{1}{1+e^{-x_i^T\beta}}$. Ini persis ERM dengan log loss dan logistic function class. ::: ### Hinge Loss (SVM) Untuk klasifikasi biner $y \in \{-1, +1\}$: $$L(y, \hat{f}) = \max(0, 1 - y\hat{f}(x))$$ - **Zero loss** ketika prediksi benar dengan margin $\geq 1$ - **Linear penalty** ketika margin < 1 - Membentuk "margin" dalam Support Vector Machines - Tidak smooth di $y\hat{f} = 1$ — perlu subgradient atau quadratic programming ### KL Divergence sebagai Loss KL divergence dari distribusi $Q$ ke $P$: $$D_{KL}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}$$ Dalam konteks loss: jika $P$ adalah distribusi true labels dan $Q$ adalah model distribution, meminimasi $D_{KL}$ setara dengan meminimasi cross-entropy (karena $H(P)$ konstan terhadap $Q$). Dibahas lebih detail di [Information Theory](05-information-theory.qmd). ::: {.callout-tip title="Worked Example: Comparing Loss Functions Visually" collapse="true"} ```r library(ggplot2) library(tidyr) # Grid of residuals residuals <- seq(-3, 3, length.out = 300) # Compute loss functions losses <- data.frame( residual = residuals, squared = residuals^2, absolute = abs(residuals), huber = ifelse(abs(residuals) <= 1, 0.5 * residuals^2, abs(residuals) - 0.5), log_loss_y1 = -log(plogis(residuals)), # approximation hinge = pmax(0, 1 - residuals) ) # Pivot to long format losses_long <- pivot_longer(losses, -residual, names_to = "loss_fn", values_to = "loss") ggplot(losses_long, aes(x = residual, y = loss, color = loss_fn)) + geom_line(size = 1.1) + ylim(0, 4) + labs( title = "Perbandingan Loss Functions", x = "Residual (y - ŷ)", y = "Loss", color = "Loss Function" ) + theme_minimal() + theme(legend.position = "bottom") ``` Coba jalankan ini dan perhatikan: - Squared loss tumbuh quadratik — outlier sangat dihukum - Absolute loss tumbuh linear — lebih robust - Huber menggabungkan keduanya - Hinge loss punya "flat zone" di sebelah kanan (sudah benar, tidak dihukum) ::: ## Mengapa Squared Loss → Mean? Ini derivasi penting yang harus kamu bisa lakukan sendiri. **Pertanyaan**: Jika kita mau memprediksi $Y$ dengan konstanta $c$ (tidak ada fitur $X$), nilai $c$ berapa yang meminimasi expected squared loss? $$\min_c \mathbb{E}[(Y - c)^2]$$ **Derivasi**: $$\mathbb{E}[(Y - c)^2] = \mathbb{E}[Y^2 - 2Yc + c^2]$$ $$= \mathbb{E}[Y^2] - 2c\mathbb{E}[Y] + c^2$$ Ambil derivative terhadap $c$ dan set ke nol: $$\frac{d}{dc}\mathbb{E}[(Y-c)^2] = -2\mathbb{E}[Y] + 2c = 0$$ $$\Rightarrow c^* = \mathbb{E}[Y]$$ Cek second derivative: $\frac{d^2}{dc^2} = 2 > 0$ → ini memang minimum. **Interpretasi**: Squared loss "mendefinisikan" mean sebagai optimal point prediction. Kalau mau prediksi sesuatu yang beda dari mean, pakai loss yang berbeda. Secara lebih umum: optimal predictor di bawah squared loss adalah $f^*(x) = \mathbb{E}[Y|X=x]$. **Bukti untuk conditional case:** $$\mathbb{E}[(Y - f(X))^2] = \mathbb{E}[\mathbb{E}[(Y - f(X))^2 | X]]$$ Untuk setiap nilai $x$ fixed, $\mathbb{E}[(Y - f(x))^2|X=x]$ diminimasi oleh $f(x) = \mathbb{E}[Y|X=x]$. Karena minimasi pointwise berlaku untuk setiap $x$, maka $f^*(x) = \mathbb{E}[Y|X=x]$ meminimasi keseluruhan. ## Mengapa Absolute Loss → Median? **Pertanyaan**: Nilai $c$ berapa yang meminimasi $\mathbb{E}[|Y - c|]$? Ini sedikit lebih tricky karena $|y - c|$ tidak differentiable di $y = c$. **Cara 1: Subgradient** $$\frac{\partial}{\partial c}\mathbb{E}[|Y - c|] = \mathbb{E}[\text{sign}(c - Y)]$$ Di minimum: $\mathbb{E}[\text{sign}(c - Y)] = 0$ Artinya: $P(Y < c) = P(Y > c)$, yang berarti $c$ adalah **median** $Y$. **Cara 2: Geometric argument** $$\mathbb{E}[|Y - c|] = \int_{-\infty}^c (c - y) f(y) dy + \int_c^\infty (y - c) f(y) dy$$ Derivative terhadap $c$: $$\frac{d}{dc}\mathbb{E}[|Y-c|] = \int_{-\infty}^c f(y)dy - \int_c^\infty f(y)dy = F(c) - [1 - F(c)] = 2F(c) - 1$$ Set ke nol: $F(c) = 1/2 \Rightarrow c = F^{-1}(1/2) = \text{Median}(Y)$. **Takeaway**: Loss function menentukan "apa arti dari optimal prediction". Mean → squared loss. Median → absolute loss. Quantile $\tau$ → check function. ## Bias-Variance Tradeoff Ini salah satu hasil terpenting dalam statistical learning theory. ### Decomposition untuk Squared Loss Fix titik $x$. Misalkan $\hat{f}$ adalah estimator yang dilatih dari data, dan target adalah $Y = f^*(x) + \varepsilon$ dengan $\mathbb{E}[\varepsilon] = 0$, $\text{Var}(\varepsilon) = \sigma^2$. ::: {.callout-important title="Definisi: Bias-Variance Decomposition"} $$\mathbb{E}[(Y - \hat{f}(x))^2] = \underbrace{[\mathbb{E}[\hat{f}(x)] - f^*(x)]^2}_{\text{Bias}^2} + \underbrace{\text{Var}(\hat{f}(x))}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible Error}}$$ ::: **Derivasi:** Misalkan $\bar{f}(x) = \mathbb{E}[\hat{f}(x)]$ (expected prediction). $$\mathbb{E}[(Y - \hat{f}(x))^2]$$ $$= \mathbb{E}[(Y - f^*(x) + f^*(x) - \bar{f}(x) + \bar{f}(x) - \hat{f}(x))^2]$$ Pisahkan tiga suku: $$= \mathbb{E}[(Y - f^*(x))^2] + (\bar{f}(x) - f^*(x))^2 + \mathbb{E}[(\bar{f}(x) - \hat{f}(x))^2]$$ (cross terms hilang karena $\varepsilon$ independent dari $\hat{f}$ dan $\mathbb{E}[\bar{f}(x) - \hat{f}(x)] = 0$) Ini menjadi: $$= \sigma^2 + \text{Bias}^2(\hat{f}(x)) + \text{Var}(\hat{f}(x))$$ ### Interpretasi - **Bias** mengukur seberapa "salah rata-rata" kita — error sistematik - **Variance** mengukur seberapa besar $\hat{f}$ berfluktuasi antar dataset yang berbeda - **$\sigma^2$** adalah error yang tidak bisa dihilangkan (noise dari data generating process) **Tradeoff:** - Model sederhana (linear): Bias tinggi, Variance rendah - Model kompleks (deep tree, NN): Bias rendah, Variance tinggi - Regularization: increase bias, decrease variance ```{mermaid} graph LR A[Model Complexity] -->|increases| B[Variance ↑] A -->|decreases| C[Bias ↓] B --> D[Total Error = Bias² + Var + σ²] C --> D ``` Dalam practice: pilih kompleksitas yang meminimasi total error via cross-validation. ::: {.callout-tip title="Worked Example: Bias-Variance dalam Polynomial Regression" collapse="true"} ```r library(ggplot2) set.seed(42) # True function: f*(x) = sin(2*pi*x) f_true <- function(x) sin(2 * pi * x) # Generate many datasets and fit polynomials of different degrees n_datasets <- 100 n_obs <- 30 x_test <- seq(0, 1, length.out = 100) # Simulate for degree d simulate_bv <- function(degree, n_datasets = 100) { preds <- matrix(NA, nrow = length(x_test), ncol = n_datasets) for (i in seq_len(n_datasets)) { x_train <- runif(n_obs) y_train <- f_true(x_train) + rnorm(n_obs, sd = 0.3) fit <- lm(y_train ~ poly(x_train, degree, raw = TRUE)) # Predict at x_test (extrapolation might be unstable) x_df <- data.frame(x_train = x_test) preds[, i] <- predict(fit, newdata = x_df) } bias2 <- (rowMeans(preds) - f_true(x_test))^2 variance <- apply(preds, 1, var) list( mean_bias2 = mean(bias2), mean_var = mean(variance), mean_total = mean(bias2 + variance) ) } degrees <- 1:10 results <- lapply(degrees, simulate_bv) bv_df <- data.frame( degree = degrees, bias2 = sapply(results, `[[`, "mean_bias2"), variance = sapply(results, `[[`, "mean_var"), total = sapply(results, `[[`, "mean_total") ) # Plot bv_long <- tidyr::pivot_longer(bv_df, -degree, names_to = "component", values_to = "value") ggplot(bv_long, aes(x = degree, y = value, color = component)) + geom_line(size = 1.1) + geom_point() + labs(title = "Bias-Variance Tradeoff: Polynomial Regression", x = "Polynomial Degree", y = "Mean Squared Error", color = "Component") + theme_minimal() ``` Perhatikan U-shape pada kurva "total" — minimum total error ada di degree tertentu, bukan degree terkecil atau terbesar. ::: ## Regularized ERM Regularization = ERM + penalty: $$\hat{f}_{reg} = \arg\min_{f \in \mathcal{F}} \left[\frac{1}{n}\sum_{i=1}^n L(y_i, f(x_i)) + \lambda \Omega(f)\right]$$ di mana $\Omega(f)$ adalah complexity penalty dan $\lambda \geq 0$ adalah regularization strength. - $\lambda = 0$: pure ERM (bisa overfit) - $\lambda \to \infty$: sangat diregularisasi (bisa underfit) **Contoh:** - Ridge: $\Omega(\beta) = \|\beta\|^2_2$ - LASSO: $\Omega(\beta) = \|\beta\|_1$ - L2 weight decay di neural networks Interpretasi bias-variance: regularization increase bias, decrease variance. ::: {.callout-caution title="Connection: Bayesian MAP = Regularized ERM"} Regularized ERM punya interpretasi Bayesian yang elegan: $$\hat{\beta}_{MAP} = \arg\max_\beta [P(\beta | data)] = \arg\max_\beta [\log P(data|\beta) + \log P(\beta)]$$ $$= \arg\min_\beta [-\log L(\beta) + \lambda\Omega(\beta)]$$ - Ridge ($\|\beta\|^2$ penalty) = MAP dengan prior $\beta \sim \mathcal{N}(0, \sigma^2/\lambda \cdot I)$ - LASSO ($\|\beta\|_1$ penalty) = MAP dengan prior $\beta_j \sim \text{Laplace}(0, 1/\lambda)$ Regularization strength $\lambda$ = "seberapa kuat prior beliefs kamu". ::: ::: {.callout-warning title="Practice Problems" collapse="true"} **Problem 1: Derivasi Optimal Predictor** (a) Tunjukkan bahwa optimal predictor di bawah squared loss adalah $f^*(x) = \mathbb{E}[Y|X=x]$. (b) Tunjukkan bahwa optimal constant predictor di bawah loss $L(y, c) = (y-c)^4$ adalah $f^* = \mathbb{E}[Y]$ juga. (Petunjuk: bedakan dan set ke nol, gunakan towering expectation.) (c) Untuk loss $L(y, \hat{y}) = e^{|y - \hat{y}|} - 1$, apakah optimal predictor masih mean atau median? Jelaskan intuitif. --- **Problem 2: Log Loss Properties** (a) Buktikan bahwa $L(y, \hat{p}) = -[y\log\hat{p} + (1-y)\log(1-\hat{p})] \geq 0$ untuk semua $\hat{p} \in (0,1)$ dan $y \in \{0,1\}$. (b) Tunjukkan bahwa expected log loss $\mathbb{E}[L(Y, \hat{p}(X))]$ di-minimize oleh $\hat{p}(x) = P(Y=1|X=x)$. (c) Apa yang terjadi ke log loss ketika $\hat{p} \to 0$ tapi $y = 1$? Interpretasikan. --- **Problem 3: Bias-Variance** Kita punya $Y = \mu + \varepsilon$ dengan $\mathbb{E}[\varepsilon] = 0$, $\text{Var}(\varepsilon) = \sigma^2$. (a) Estimator: $\hat{\mu}_1 = \bar{Y}$ (sample mean). Hitung Bias, Variance, dan MSE. (b) Estimator yang dishrinkage: $\hat{\mu}_2 = c\bar{Y}$ untuk $c \in (0,1)$. Hitung Bias, Variance, dan MSE. Untuk nilai $c$ berapa MSE($\hat{\mu}_2$) < MSE($\hat{\mu}_1$)? (c) Implikasi: apakah estimator yang biased bisa lebih baik (lower MSE) dari yang unbiased? --- **Jawaban Problem 3(b):** $\text{Bias}(\hat{\mu}_2) = c\mu - \mu = (c-1)\mu$, $\text{Bias}^2 = (c-1)^2\mu^2$ $\text{Var}(\hat{\mu}_2) = c^2\text{Var}(\bar{Y}) = c^2\sigma^2/n$ $\text{MSE}(\hat{\mu}_2) = (c-1)^2\mu^2 + c^2\sigma^2/n$ $\text{MSE}(\hat{\mu}_2) < \text{MSE}(\hat{\mu}_1) = \sigma^2/n$ ketika: $(c-1)^2\mu^2 + c^2\sigma^2/n < \sigma^2/n$ $(c-1)^2\mu^2 < (1-c^2)\sigma^2/n = (1-c)(1+c)\sigma^2/n$ $(1-c)\mu^2 < (1+c)\sigma^2/n$ (bagi kedua sisi dengan $(1-c) > 0$ untuk $c < 1$) Ada range nilai $c$ yang memenuhi ini selama $\mu^2$ tidak terlalu besar relatif terhadap $\sigma^2/n$. **Jawaban**: Ya, estimator biased bisa lebih baik dalam artian MSE. Inilah alasan regularization (yang introduce bias) bisa meningkatkan prediction performance. ::: ## Ringkasan | Loss Function | Optimal Predictor | Gunakan Ketika | |---------------|------------------|----------------| | Squared $(y-\hat{y})^2$ | Conditional mean $\mathbb{E}[Y\|X=x]$ | Default regresi, distribusi Gaussian noise | | Absolute $\|y-\hat{y}\|$ | Conditional median | Ada outlier, mau robust | | Huber | Antara mean & median | Outlier ringan, mau smooth | | Log loss | Conditional probability $P(Y=1\|X=x)$ | Klasifikasi biner | | Hinge $\max(0, 1-y\hat{f})$ | Margin classifier | SVM, max-margin | --- **Selanjutnya:** [Gradient Descent & Optimization →](02-gradient-descent.qmd)