Information Theory

Entropy, KL Divergence, dan Cross-Entropy Loss

ml-math

intermediate

Fondasi information theory untuk ML dan statistik: Shannon entropy, KL divergence, cross-entropy loss sebagai KL minimization, mutual information, dan koneksi ke MLE dan model selection.

1 Kenapa Ini Penting?

Why This Matters for Your Work

Information theory bukan hanya teori abstrak — ini adalah bahasa yang tepat untuk berbicara tentang model training dan evaluation:

Cross-entropy loss dalam neural networks IS KL divergence dari data distribution ke model distribution
MLE adalah minimizing KL divergence dari empirical distribution ke model
AIC/BIC punya interpretasi information-theoretic yang clear
Mutual information mengukur dependence nonlinear — lebih umum dari korelasi

Jika kamu pernah heran mengapa kita minimize cross-entropy (bukan MSE) untuk classification, atau mengapa entropy memaksimalkan diri pada uniform distribution — jawabannya ada di sini.

2 Entropy

2.1 Shannon Entropy (Discrete)

Definisi: Shannon Entropy

Untuk random variable diskrit $X$ dengan PMF $P(X=x)$:

\[H(X) = -\sum_{x \in \mathcal{X}} P(X=x) \log P(X=x) = -E[\log P(X)]\]

Conventions: $0 \log 0 = 0$ (limit), $\log$ bisa base 2 (bits) atau natural (nats).

Interpretasi: - Entropy mengukur rata-rata informasi (surprise) dari satu draw dari $X$ - Lebih besar entropy = lebih banyak uncertainty/unpredictability - Minimum entropy = 0 (deterministic variable) - Maximum entropy = $\log|\mathcal{X}|$ (uniform distribution)

Bukti maximum entropy:

Untuk $\mathcal{X} = \{1, \ldots, n\}$ dan constraint $\sum_x P(x) = 1$:

Lagrangian: $\mathcal{L} = -\sum_x P(x)\log P(x) - \lambda(\sum_x P(x) - 1)$

$\partial\mathcal{L}/\partial P(x) = -\log P(x) - 1 - \lambda = 0 \Rightarrow P(x) = e^{-1-\lambda} = 1/n$ (uniform)

2.2 Sifat Entropy

Non-negative: $H(X) \geq 0$, equality jika $X$ deterministic
Concavity: $H(\sum_i \lambda_i P_i) \geq \sum_i \lambda_i H(P_i)$ — mixture lebih uncertain dari components
Chain rule: $H(X, Y) = H(X) + H(Y|X)$
Subadditivity: $H(X, Y) \leq H(X) + H(Y)$, equality iff $X \perp Y$

2.3 Conditional Entropy

\[H(Y|X) = -\sum_{x,y} P(x,y)\log P(y|x) = E_{P(X)}[H(Y|X=x)]\]

Intuitively: remaining uncertainty of $Y$ given knowledge of $X$.

2.4 Differential Entropy (Continuous)

\[h(X) = -\int f(x)\log f(x)\, dx\]

Perhatian: differential entropy tidak invariant terhadap linear transformation!

Contoh penting: Gaussian maximizes differential entropy untuk fixed variance:

\[h(N(\mu, \sigma^2)) = \frac{1}{2}\log(2\pi e \sigma^2)\]

Untuk fixed $\sigma^2$, tidak ada distribusi lain yang memiliki differential entropy lebih tinggi. Ini adalah maximum entropy principle untuk continuous distributions.

3 KL Divergence (Relative Entropy)

Definisi: Kullback-Leibler Divergence

Untuk dua distribusi $P$ dan $Q$ di atas space yang sama:

\[D_{KL}(P \| Q) = \sum_x P(x)\log\frac{P(x)}{Q(x)} = E_P\left[\log\frac{P(X)}{Q(X)}\right]\]

Untuk distribusi kontinu: \[D_{KL}(P \| Q) = \int p(x)\log\frac{p(x)}{q(x)}\, dx\]

Convention: $0 \log(0/0) = 0$, tapi $p(x) > 0$ dan $q(x) = 0$ menghasilkan $+\infty$.

3.1 Properties KL Divergence

Non-negativity (Gibbs’ Inequality): \[D_{KL}(P \| Q) \geq 0, \quad \text{equality iff } P = Q \text{ a.e.}\]

Bukti menggunakan Jensen’s inequality: $\log$ is concave, jadi \[-D_{KL}(P\|Q) = E_P[\log(Q/P)] \leq \log E_P[Q/P] = \log 1 = 0\]

Asymmetry: $D_{KL}(P\|Q) \neq D_{KL}(Q\|P)$ umumnya.

Ini bukan distance metric! Tapi KL divergence adalah “oriented” measure of how $P$ differs from $Q$.

Interpretasi coding theory: - $H(P)$: minimum average code length untuk encode draw dari $P$ (Shannon’s source coding theorem) - $H(P,Q) = H(P) + D_{KL}(P\|Q)$: average code length jika kita pakai code optimal untuk $Q$ tapi data berasal dari $P$ - $D_{KL}(P\|Q)$: extra bits yang terbuang karena menggunakan wrong distribution $Q$

3.2 Forward vs Reverse KL

Forward KL $D_{KL}(P\|Q)$: “mean-seeking” — $Q$ harus cover semua mass $P$ (zero-avoiding)
Reverse KL $D_{KL}(Q\|P)$: “mode-seeking” — $Q$ bisa fokus di mode $P$ dan ignore tails (zero-forcing)

Ini penting dalam variational inference: ELBO mengoptimalkan reverse KL $D_{KL}(Q\|P)$.

4 Cross-Entropy

Definisi: Cross-Entropy

\[H(P, Q) = -\sum_x P(x)\log Q(x) = -E_P[\log Q(X)]\]

Hubungan dengan KL: \[H(P, Q) = H(P) + D_{KL}(P\|Q)\]

Jika $P$ adalah data distribution (fixed), maka: \[\arg\min_Q H(P, Q) = \arg\min_Q D_{KL}(P\|Q)\]

Minimizing cross-entropy = minimizing KL divergence!

4.1 Cross-Entropy Loss dalam Classification

Untuk classification dengan $K$ classes:

Multi-class cross-entropy: \[L = -\sum_{k=1}^K y_k \log \hat{p}_k\]

di mana $y_k \in \{0,1\}$ (one-hot true label) dan $\hat{p}_k = P(Y=k|x)$ (model prediction).

Binary cross-entropy: \[L = -[y \log \hat{p} + (1-y)\log(1-\hat{p})]\]

Ini identik dengan negative log-likelihood Bernoulli distribution: $\ell(\hat{p}; y) = y\log\hat{p} + (1-y)\log(1-\hat{p})$.

Minimizing cross-entropy loss = maximizing likelihood = minimizing KL dari empirical distribution ke model.

4.2 Mengapa Cross-Entropy, Bukan MSE, untuk Classification?

Natural loss: classification = estimating probability, cross-entropy langsung related ke log-likelihood
Gradient behavior: cross-entropy dengan sigmoid/softmax memberikan gradient $\hat{y} - y$ (elegant!). MSE dengan sigmoid memberikan $(\hat{y}-y)\hat{y}(1-\hat{y})$ — saturates near 0/1
Information-theoretic: measures divergence between distributions — semantically appropriate

5 Mutual Information

Definisi: Mutual Information

\[I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)\]

Equivalently: \[I(X; Y) = D_{KL}(P_{XY} \| P_X P_Y) = \sum_{x,y} P(x,y)\log\frac{P(x,y)}{P(x)P(y)}\]

Properti: - $I(X;Y) \geq 0$, equality iff $X \perp Y$ - $I(X;Y) = I(Y;X)$ (symmetric!) - $I(X;Y) \leq \min(H(X), H(Y))$

5.1 Kenapa MI Lebih Baik dari Korelasi?

Korelasi Pearson: mengukur linear dependence saja. $\rho(X,Y) = 0$ tidak berarti independence.
Mutual information: mengukur any dependence — linear maupun nonlinear. $I(X;Y) = 0 \iff X \perp Y$.

Contoh: $Y = X^2$ dengan $X$ symmetric. $\text{Cor}(X,Y) = 0$ tapi $I(X;Y) = H(Y) > 0$.

5.2 Data Processing Inequality (DPI)

Untuk Markov chain $X \to Y \to Z$: \[I(X; Z) \leq I(X; Y)\]

Processing tidak bisa increase information. Fundamental untuk understanding representation learning.

5.3 Normalized Mutual Information (NMI)

\[\text{NMI}(X,Y) = \frac{I(X;Y)}{\sqrt{H(X)H(Y)}} \in [0, 1]\]

Berguna sebagai similarity measure — digunakan dalam clustering evaluation.

6 Log-Likelihood dan Information Theory

6.1 MLE sebagai KL Minimization

Empirical distribution: $\hat{P}_n(x) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}[x_i = x]$

MLE untuk parametric model $\{P_\theta\}$:

\[\hat{\theta}_{MLE} = \arg\max_\theta \frac{1}{n}\sum_{i=1}^n \log p_\theta(x_i) = \arg\max_\theta E_{\hat{P}_n}[\log p_\theta(X)]\]

\[= \arg\min_\theta -E_{\hat{P}_n}[\log p_\theta(X)] = \arg\min_\theta H(\hat{P}_n, P_\theta)\]

\[= \arg\min_\theta D_{KL}(\hat{P}_n \| P_\theta) + H(\hat{P}_n)\]

\[= \arg\min_\theta D_{KL}(\hat{P}_n \| P_\theta)\]

(karena $H(\hat{P}_n)$ tidak bergantung pada $\theta$)

MLE = minimizing KL from empirical distribution to model distribution.

6.2 AIC dan Information Criteria

Akaike Information Criterion (AIC):

\[\text{AIC} = -2\ell(\hat{\theta}) + 2k\]

di mana $\ell(\hat{\theta})$ = maximized log-likelihood, $k$ = jumlah parameters.

Derivasi: AIC adalah estimator dari expected KL divergence dari fitted model ke true distribution (under new data). Akaike (1974) showed:

\[E[D_{KL}(P_{true}\|P_{\hat{\theta}})] \approx -\ell(\hat{\theta}) + k\]

Bayesian Information Criterion (BIC):

\[\text{BIC} = -2\ell(\hat{\theta}) + k\log n\]

BIC penalty lebih kuat untuk large $n$ → lebih parsimonious.

BIC derivasi: approximation to log marginal likelihood (Bayesian model comparison).

7 Information Bottleneck

Definisi: Information Bottleneck

Tujuan: temukan representation $Z$ dari $X$ yang kompres $X$ sebanyak mungkin sambil preserve informasi tentang $Y$.

Objective (tradeoff): \[\min_{P(Z|X)} I(X; Z) - \beta I(Z; Y)\]

$I(X;Z)$: compression (ingin kecil)
$I(Z;Y)$: prediction quality (ingin besar)
$\beta$: tradeoff parameter

Koneksi ke deep learning: setiap layer neural network adalah transformasi dari input. Information Bottleneck framework suggests layers should retain $Y$-relevant info dan compress $Y$-irrelevant info.

7.1 ELBO dan VAE

Variational Autoencoder (VAE) mengoptimalkan Evidence Lower BOund (ELBO):

\[\text{ELBO} = E_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) \| p(z))\]

Pertama: reconstruction quality (negative cross-entropy)
Kedua: regularization via KL dari posterior ke prior

Maksimizing ELBO = minimizing variational gap (approximation error) dari posterior.

8 Worked Example: Cross-Entropy untuk 3-Class Classification

Worked Example: Cross-Entropy Loss di Berbagai Kondisi

library(tidyverse)

# ==========================================
# 1. Entropy Calculations
# ==========================================

# Entropy function
entropy <- function(p) {
  p <- p[p > 0]  # Remove zeros
  -sum(p * log(p))  # nats
}

# Examples
cat("Entropy (uniform 4 classes):", entropy(rep(0.25, 4)), "\n")
cat("Entropy (certain - one class):", entropy(c(1, 0, 0, 0)), "\n")
cat("Entropy (coin flip):", entropy(c(0.5, 0.5)), "\n")
cat("Max entropy (log(4)):", log(4), "\n")

# ==========================================
# 2. KL Divergence
# ==========================================

kl_divergence <- function(p, q) {
  idx <- p > 0
  sum(p[idx] * log(p[idx] / q[idx]))
}

P <- c(0.4, 0.3, 0.3)
Q1 <- c(1/3, 1/3, 1/3)  # uniform
Q2 <- c(0.5, 0.4, 0.1)  # "wrong" model

cat("\nKL(P||Q_uniform):", kl_divergence(P, Q1), "\n")
cat("KL(P||Q_wrong):", kl_divergence(P, Q2), "\n")
cat("KL(Q_wrong||P):", kl_divergence(Q2, P), "\n")  # Asymmetry!
cat("Note: KL is NOT symmetric!\n")

# ==========================================
# 3. Cross-Entropy Loss
# ==========================================

cross_entropy <- function(y_true, y_pred) {
  -sum(y_true * log(y_pred + 1e-15))  # epsilon for numerical stability
}

# True label: class 2 (one-hot)
y_true <- c(0, 1, 0)

# Different predictions
pred_good <- c(0.1, 0.8, 0.1)   # confident, correct
pred_ok   <- c(0.2, 0.5, 0.3)   # uncertain
pred_bad  <- c(0.7, 0.1, 0.2)   # confident, wrong

cat(sprintf("\nCE Loss (good prediction): %.4f\n", cross_entropy(y_true, pred_good)))
cat(sprintf("CE Loss (uncertain):        %.4f\n", cross_entropy(y_true, pred_ok)))
cat(sprintf("CE Loss (wrong prediction): %.4f\n", cross_entropy(y_true, pred_bad)))
# Bad prediction has very high loss!

# ==========================================
# 4. Mutual Information (Discrete)
# ==========================================

# Joint distribution P(X,Y)
P_xy <- matrix(c(0.3, 0.1,
                 0.1, 0.5), nrow = 2, byrow = TRUE)

P_x <- rowSums(P_xy)
P_y <- colSums(P_xy)

# Compute MI
mi <- 0
for (i in 1:2) for (j in 1:2) {
  if (P_xy[i,j] > 0)
    mi <- mi + P_xy[i,j] * log(P_xy[i,j] / (P_x[i] * P_y[j]))
}

H_X <- entropy(P_x)
H_Y <- entropy(P_y)
H_XY <- entropy(as.vector(P_xy))
H_YX <- H_XY - H_X  # H(Y|X)

cat(sprintf("\nH(X) = %.4f\n", H_X))
cat(sprintf("H(Y) = %.4f\n", H_Y))
cat(sprintf("H(X,Y) = %.4f\n", H_XY))
cat(sprintf("I(X;Y) via KL = %.4f\n", mi))
cat(sprintf("I(X;Y) = H(X) - H(X|Y) = %.4f\n", H_X - (H_XY - H_Y)))

# ==========================================
# 5. AIC vs BIC for Model Selection
# ==========================================

# Simulate data from AR(2) process
set.seed(123)
n <- 200
y <- arima.sim(model = list(ar = c(0.6, 0.2)), n = n)

# Fit different AR models
models_aic <- sapply(1:5, function(p) AIC(arima(y, order = c(p, 0, 0))))
models_bic <- sapply(1:5, function(p) BIC(arima(y, order = c(p, 0, 0))))

cat("\n--- Model Selection ---\n")
cat("AIC for AR(1) to AR(5):", round(models_aic, 2), "\n")
cat("BIC for AR(1) to AR(5):", round(models_bic, 2), "\n")
cat("Best by AIC: AR(", which.min(models_aic), ")\n")
cat("Best by BIC: AR(", which.min(models_bic), ")\n")
# True model is AR(2) - do they recover it?

Takeaway: Dengan distribusi yang salah, cross-entropy loss sangat besar. Dengan prediksi confident tapi salah, kita mendapat “very wrong” — ini adalah brier score property yang bagus.

Connection: ELBO, MINE, dan Feature Selection

Information theory punya koneksi luas:

ELBO dalam VAEs: $\log p(x) \geq E_{q(z|x)}[\log p(x|z)] - D_{KL}(q(z|x)\|p(z))$ — ELBO adalah lower bound pada log marginal likelihood. Training VAE = maximizing ELBO.
MINE (Mutual Information Neural Estimation): neural network untuk estimate MI dari sampel, menggunakan Donsker-Varadhan representation: $I(X;Y) = \sup_T E_{P_{XY}}[T] - \log E_{P_XP_Y}[e^T]$.
Feature selection: filter methods menggunakan MI antara setiap feature dan target variable untuk ranking features.
Contrastive learning: InfoNCE loss (NoiSy Contrastive Estimation) adalah lower bound pada mutual information — digunakan dalam SimCLR, CLIP.
Bits-back coding: menggunakan latent variable models untuk data compression — connects VAEs dengan actual compression.

9 Practice Problems

Practice Problems

Problem 1: Tunjukkan bahwa $D_{KL}(P\|Q) \geq 0$ menggunakan Jensen’s inequality.

$-D_{KL}(P\|Q) = E_P\left[\log\frac{Q(X)}{P(X)}\right] \leq \log E_P\left[\frac{Q(X)}{P(X)}\right] = \log \sum_x P(x)\frac{Q(x)}{P(x)} = \log 1 = 0$

Inequality karena $\log$ is concave + Jensen’s inequality. QED.

Problem 2: Buktikan $H(X,Y) = H(X) + H(Y|X)$.

$H(X,Y) = -\sum_{x,y} P(x,y)\log P(x,y)$

$= -\sum_{x,y} P(x,y)[\log P(x) + \log P(y|x)]$

$= -\sum_x P(x)\log P(x) - \sum_{x,y} P(x,y)\log P(y|x)$

$= H(X) + H(Y|X)$ QED

Problem 3: Tunjukkan bahwa MLE = minimizing KL divergence dari empirical distribution.

Lihat derivasi di atas. Key: $E_{\hat{P}_n}[\log p_\theta(X)] = \frac{1}{n}\sum_i \log p_\theta(x_i)$ yang adalah sample average of log-likelihood. Maximizing ini = minimizing $-E_{\hat{P}_n}[\log p_\theta] = H(\hat{P}_n, P_\theta) - H(\hat{P}_n)$.

Karena $H(\hat{P}_n)$ konstan, equivalent to minimizing cross-entropy $H(\hat{P}_n, P_\theta) = H(\hat{P}_n) + D_{KL}(\hat{P}_n \| P_\theta)$, yang equivalent to minimizing $D_{KL}(\hat{P}_n \| P_\theta)$.

Problem 4: Hitung mutual information antara $X \sim \text{Bernoulli}(0.5)$ dan $Y = X$ (perfectly correlated).

$P(X=0,Y=0) = 0.5$, $P(X=1,Y=1) = 0.5$, others = 0.

$H(X) = H(Y) = \log 2 = 1$ bit.

$H(Y|X) = 0$ (Y perfectly determined by X).

$I(X;Y) = H(Y) - H(Y|X) = 1 - 0 = 1$ bit. MI = Entropy of X — maximum possible!

Problem 5: Mengapa differential entropy bisa negatif?

$h(U[0,a]) = \log a$ — negatif jika $a < 1$. Differential entropy bukan jumlah informasi dalam bits (berbeda dari discrete entropy) — ini adalah relative quantity. Bisa negatif. Hanya differences dalam differential entropy punya absolute meaning.

Navigasi: ← Kernel Methods & SVM | PCA & Dimensionality Reduction →

--- title: "Information Theory" subtitle: "Entropy, KL Divergence, dan Cross-Entropy Loss" description: "Fondasi information theory untuk ML dan statistik: Shannon entropy, KL divergence, cross-entropy loss sebagai KL minimization, mutual information, dan koneksi ke MLE dan model selection." categories: [ml-math, intermediate] --- ## Kenapa Ini Penting? ::: {.callout-note title="Why This Matters for Your Work"} Information theory bukan hanya teori abstrak — ini adalah bahasa yang tepat untuk berbicara tentang model training dan evaluation: - **Cross-entropy loss** dalam neural networks IS KL divergence dari data distribution ke model distribution - **MLE** adalah minimizing KL divergence dari empirical distribution ke model - **AIC/BIC** punya interpretasi information-theoretic yang clear - **Mutual information** mengukur dependence nonlinear — lebih umum dari korelasi Jika kamu pernah heran mengapa kita minimize cross-entropy (bukan MSE) untuk classification, atau mengapa entropy memaksimalkan diri pada uniform distribution — jawabannya ada di sini. ::: --- ## Entropy ### Shannon Entropy (Discrete) ::: {.callout-important title="Definisi: Shannon Entropy"} Untuk random variable diskrit $X$ dengan PMF $P(X=x)$: $$H(X) = -\sum_{x \in \mathcal{X}} P(X=x) \log P(X=x) = -E[\log P(X)]$$ Conventions: $0 \log 0 = 0$ (limit), $\log$ bisa base 2 (bits) atau natural (nats). Interpretasi: - Entropy mengukur **rata-rata informasi** (surprise) dari satu draw dari $X$ - Lebih besar entropy = lebih banyak uncertainty/unpredictability - Minimum entropy = 0 (deterministic variable) - Maximum entropy = $\log|\mathcal{X}|$ (uniform distribution) ::: **Bukti maximum entropy**: Untuk $\mathcal{X} = \{1, \ldots, n\}$ dan constraint $\sum_x P(x) = 1$: Lagrangian: $\mathcal{L} = -\sum_x P(x)\log P(x) - \lambda(\sum_x P(x) - 1)$ $\partial\mathcal{L}/\partial P(x) = -\log P(x) - 1 - \lambda = 0 \Rightarrow P(x) = e^{-1-\lambda} = 1/n$ (uniform) ### Sifat Entropy 1. **Non-negative**: $H(X) \geq 0$, equality jika $X$ deterministic 2. **Concavity**: $H(\sum_i \lambda_i P_i) \geq \sum_i \lambda_i H(P_i)$ — mixture lebih uncertain dari components 3. **Chain rule**: $H(X, Y) = H(X) + H(Y|X)$ 4. **Subadditivity**: $H(X, Y) \leq H(X) + H(Y)$, equality iff $X \perp Y$ ### Conditional Entropy $$H(Y|X) = -\sum_{x,y} P(x,y)\log P(y|x) = E_{P(X)}[H(Y|X=x)]$$ Intuitively: remaining uncertainty of $Y$ given knowledge of $X$. ### Differential Entropy (Continuous) $$h(X) = -\int f(x)\log f(x)\, dx$$ Perhatian: differential entropy **tidak invariant** terhadap linear transformation! Contoh penting: **Gaussian maximizes differential entropy** untuk fixed variance: $$h(N(\mu, \sigma^2)) = \frac{1}{2}\log(2\pi e \sigma^2)$$ Untuk fixed $\sigma^2$, tidak ada distribusi lain yang memiliki differential entropy lebih tinggi. Ini adalah **maximum entropy principle** untuk continuous distributions. --- ## KL Divergence (Relative Entropy) ::: {.callout-important title="Definisi: Kullback-Leibler Divergence"} Untuk dua distribusi $P$ dan $Q$ di atas space yang sama: $$D_{KL}(P \| Q) = \sum_x P(x)\log\frac{P(x)}{Q(x)} = E_P\left[\log\frac{P(X)}{Q(X)}\right]$$ Untuk distribusi kontinu: $$D_{KL}(P \| Q) = \int p(x)\log\frac{p(x)}{q(x)}\, dx$$ Convention: $0 \log(0/0) = 0$, tapi $p(x) > 0$ dan $q(x) = 0$ menghasilkan $+\infty$. ::: ### Properties KL Divergence **Non-negativity (Gibbs' Inequality)**: $$D_{KL}(P \| Q) \geq 0, \quad \text{equality iff } P = Q \text{ a.e.}$$ Bukti menggunakan Jensen's inequality: $\log$ is concave, jadi $$-D_{KL}(P\|Q) = E_P[\log(Q/P)] \leq \log E_P[Q/P] = \log 1 = 0$$ **Asymmetry**: $D_{KL}(P\|Q) \neq D_{KL}(Q\|P)$ umumnya. Ini bukan distance metric! Tapi KL divergence adalah "oriented" measure of how $P$ differs from $Q$. **Interpretasi coding theory**: - $H(P)$: minimum average code length untuk encode draw dari $P$ (Shannon's source coding theorem) - $H(P,Q) = H(P) + D_{KL}(P\|Q)$: average code length jika kita pakai code optimal untuk $Q$ tapi data berasal dari $P$ - $D_{KL}(P\|Q)$: **extra bits** yang terbuang karena menggunakan wrong distribution $Q$ ### Forward vs Reverse KL - **Forward KL** $D_{KL}(P\|Q)$: "mean-seeking" — $Q$ harus cover semua mass $P$ (zero-avoiding) - **Reverse KL** $D_{KL}(Q\|P)$: "mode-seeking" — $Q$ bisa fokus di mode $P$ dan ignore tails (zero-forcing) Ini penting dalam variational inference: ELBO mengoptimalkan reverse KL $D_{KL}(Q\|P)$. --- ## Cross-Entropy ::: {.callout-important title="Definisi: Cross-Entropy"} $$H(P, Q) = -\sum_x P(x)\log Q(x) = -E_P[\log Q(X)]$$ Hubungan dengan KL: $$H(P, Q) = H(P) + D_{KL}(P\|Q)$$ Jika $P$ adalah data distribution (fixed), maka: $$\arg\min_Q H(P, Q) = \arg\min_Q D_{KL}(P\|Q)$$ **Minimizing cross-entropy = minimizing KL divergence!** ::: ### Cross-Entropy Loss dalam Classification Untuk classification dengan $K$ classes: **Multi-class cross-entropy**: $$L = -\sum_{k=1}^K y_k \log \hat{p}_k$$ di mana $y_k \in \{0,1\}$ (one-hot true label) dan $\hat{p}_k = P(Y=k|x)$ (model prediction). **Binary cross-entropy**: $$L = -[y \log \hat{p} + (1-y)\log(1-\hat{p})]$$ Ini **identik** dengan negative log-likelihood Bernoulli distribution: $\ell(\hat{p}; y) = y\log\hat{p} + (1-y)\log(1-\hat{p})$. Minimizing cross-entropy loss = maximizing likelihood = minimizing KL dari empirical distribution ke model. ### Mengapa Cross-Entropy, Bukan MSE, untuk Classification? 1. **Natural loss**: classification = estimating probability, cross-entropy langsung related ke log-likelihood 2. **Gradient behavior**: cross-entropy dengan sigmoid/softmax memberikan gradient $\hat{y} - y$ (elegant!). MSE dengan sigmoid memberikan $(\hat{y}-y)\hat{y}(1-\hat{y})$ — saturates near 0/1 3. **Information-theoretic**: measures divergence between distributions — semantically appropriate --- ## Mutual Information ::: {.callout-important title="Definisi: Mutual Information"} $$I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)$$ Equivalently: $$I(X; Y) = D_{KL}(P_{XY} \| P_X P_Y) = \sum_{x,y} P(x,y)\log\frac{P(x,y)}{P(x)P(y)}$$ Properti: - $I(X;Y) \geq 0$, equality iff $X \perp Y$ - $I(X;Y) = I(Y;X)$ (symmetric!) - $I(X;Y) \leq \min(H(X), H(Y))$ ::: ### Kenapa MI Lebih Baik dari Korelasi? - **Korelasi Pearson**: mengukur linear dependence saja. $\rho(X,Y) = 0$ tidak berarti independence. - **Mutual information**: mengukur **any** dependence — linear maupun nonlinear. $I(X;Y) = 0 \iff X \perp Y$. Contoh: $Y = X^2$ dengan $X$ symmetric. $\text{Cor}(X,Y) = 0$ tapi $I(X;Y) = H(Y) > 0$. ### Data Processing Inequality (DPI) Untuk Markov chain $X \to Y \to Z$: $$I(X; Z) \leq I(X; Y)$$ Processing tidak bisa increase information. Fundamental untuk understanding representation learning. ### Normalized Mutual Information (NMI) $$\text{NMI}(X,Y) = \frac{I(X;Y)}{\sqrt{H(X)H(Y)}} \in [0, 1]$$ Berguna sebagai similarity measure — digunakan dalam clustering evaluation. --- ## Log-Likelihood dan Information Theory ### MLE sebagai KL Minimization Empirical distribution: $\hat{P}_n(x) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}[x_i = x]$ MLE untuk parametric model $\{P_\theta\}$: $$\hat{\theta}_{MLE} = \arg\max_\theta \frac{1}{n}\sum_{i=1}^n \log p_\theta(x_i) = \arg\max_\theta E_{\hat{P}_n}[\log p_\theta(X)]$$ $$= \arg\min_\theta -E_{\hat{P}_n}[\log p_\theta(X)] = \arg\min_\theta H(\hat{P}_n, P_\theta)$$ $$= \arg\min_\theta D_{KL}(\hat{P}_n \| P_\theta) + H(\hat{P}_n)$$ $$= \arg\min_\theta D_{KL}(\hat{P}_n \| P_\theta)$$ (karena $H(\hat{P}_n)$ tidak bergantung pada $\theta$) **MLE = minimizing KL from empirical distribution to model distribution.** ### AIC dan Information Criteria **Akaike Information Criterion (AIC)**: $$\text{AIC} = -2\ell(\hat{\theta}) + 2k$$ di mana $\ell(\hat{\theta})$ = maximized log-likelihood, $k$ = jumlah parameters. Derivasi: AIC adalah estimator dari **expected KL divergence** dari fitted model ke true distribution (under new data). Akaike (1974) showed: $$E[D_{KL}(P_{true}\|P_{\hat{\theta}})] \approx -\ell(\hat{\theta}) + k$$ **Bayesian Information Criterion (BIC)**: $$\text{BIC} = -2\ell(\hat{\theta}) + k\log n$$ BIC penalty lebih kuat untuk large $n$ → lebih parsimonious. BIC derivasi: approximation to log marginal likelihood (Bayesian model comparison). --- ## Information Bottleneck ::: {.callout-important title="Definisi: Information Bottleneck"} Tujuan: temukan representation $Z$ dari $X$ yang **kompres** $X$ sebanyak mungkin sambil **preserve** informasi tentang $Y$. Objective (tradeoff): $$\min_{P(Z|X)} I(X; Z) - \beta I(Z; Y)$$ - $I(X;Z)$: compression (ingin kecil) - $I(Z;Y)$: prediction quality (ingin besar) - $\beta$: tradeoff parameter Koneksi ke deep learning: setiap layer neural network adalah transformasi dari input. Information Bottleneck framework suggests layers should retain $Y$-relevant info dan compress $Y$-irrelevant info. ::: ### ELBO dan VAE Variational Autoencoder (VAE) mengoptimalkan **Evidence Lower BOund (ELBO)**: $$\text{ELBO} = E_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) \| p(z))$$ - Pertama: reconstruction quality (negative cross-entropy) - Kedua: regularization via KL dari posterior ke prior Maksimizing ELBO = minimizing variational gap (approximation error) dari posterior. --- ## Worked Example: Cross-Entropy untuk 3-Class Classification ::: {.callout-tip title="Worked Example: Cross-Entropy Loss di Berbagai Kondisi" collapse="true"} ```r library(tidyverse) # ========================================== # 1. Entropy Calculations # ========================================== # Entropy function entropy <- function(p) { p <- p[p > 0] # Remove zeros -sum(p * log(p)) # nats } # Examples cat("Entropy (uniform 4 classes):", entropy(rep(0.25, 4)), "\n") cat("Entropy (certain - one class):", entropy(c(1, 0, 0, 0)), "\n") cat("Entropy (coin flip):", entropy(c(0.5, 0.5)), "\n") cat("Max entropy (log(4)):", log(4), "\n") # ========================================== # 2. KL Divergence # ========================================== kl_divergence <- function(p, q) { idx <- p > 0 sum(p[idx] * log(p[idx] / q[idx])) } P <- c(0.4, 0.3, 0.3) Q1 <- c(1/3, 1/3, 1/3) # uniform Q2 <- c(0.5, 0.4, 0.1) # "wrong" model cat("\nKL(P||Q_uniform):", kl_divergence(P, Q1), "\n") cat("KL(P||Q_wrong):", kl_divergence(P, Q2), "\n") cat("KL(Q_wrong||P):", kl_divergence(Q2, P), "\n") # Asymmetry! cat("Note: KL is NOT symmetric!\n") # ========================================== # 3. Cross-Entropy Loss # ========================================== cross_entropy <- function(y_true, y_pred) { -sum(y_true * log(y_pred + 1e-15)) # epsilon for numerical stability } # True label: class 2 (one-hot) y_true <- c(0, 1, 0) # Different predictions pred_good <- c(0.1, 0.8, 0.1) # confident, correct pred_ok <- c(0.2, 0.5, 0.3) # uncertain pred_bad <- c(0.7, 0.1, 0.2) # confident, wrong cat(sprintf("\nCE Loss (good prediction): %.4f\n", cross_entropy(y_true, pred_good))) cat(sprintf("CE Loss (uncertain): %.4f\n", cross_entropy(y_true, pred_ok))) cat(sprintf("CE Loss (wrong prediction): %.4f\n", cross_entropy(y_true, pred_bad))) # Bad prediction has very high loss! # ========================================== # 4. Mutual Information (Discrete) # ========================================== # Joint distribution P(X,Y) P_xy <- matrix(c(0.3, 0.1, 0.1, 0.5), nrow = 2, byrow = TRUE) P_x <- rowSums(P_xy) P_y <- colSums(P_xy) # Compute MI mi <- 0 for (i in 1:2) for (j in 1:2) { if (P_xy[i,j] > 0) mi <- mi + P_xy[i,j] * log(P_xy[i,j] / (P_x[i] * P_y[j])) } H_X <- entropy(P_x) H_Y <- entropy(P_y) H_XY <- entropy(as.vector(P_xy)) H_YX <- H_XY - H_X # H(Y|X) cat(sprintf("\nH(X) = %.4f\n", H_X)) cat(sprintf("H(Y) = %.4f\n", H_Y)) cat(sprintf("H(X,Y) = %.4f\n", H_XY)) cat(sprintf("I(X;Y) via KL = %.4f\n", mi)) cat(sprintf("I(X;Y) = H(X) - H(X|Y) = %.4f\n", H_X - (H_XY - H_Y))) # ========================================== # 5. AIC vs BIC for Model Selection # ========================================== # Simulate data from AR(2) process set.seed(123) n <- 200 y <- arima.sim(model = list(ar = c(0.6, 0.2)), n = n) # Fit different AR models models_aic <- sapply(1:5, function(p) AIC(arima(y, order = c(p, 0, 0)))) models_bic <- sapply(1:5, function(p) BIC(arima(y, order = c(p, 0, 0)))) cat("\n--- Model Selection ---\n") cat("AIC for AR(1) to AR(5):", round(models_aic, 2), "\n") cat("BIC for AR(1) to AR(5):", round(models_bic, 2), "\n") cat("Best by AIC: AR(", which.min(models_aic), ")\n") cat("Best by BIC: AR(", which.min(models_bic), ")\n") # True model is AR(2) - do they recover it? ``` **Takeaway**: Dengan distribusi yang salah, cross-entropy loss sangat besar. Dengan prediksi confident tapi salah, kita mendapat "very wrong" — ini adalah **brier score** property yang bagus. ::: --- ::: {.callout-caution title="Connection: ELBO, MINE, dan Feature Selection"} Information theory punya koneksi luas: - **ELBO dalam VAEs**: $\log p(x) \geq E_{q(z|x)}[\log p(x|z)] - D_{KL}(q(z|x)\|p(z))$ — ELBO adalah lower bound pada log marginal likelihood. Training VAE = maximizing ELBO. - **MINE (Mutual Information Neural Estimation)**: neural network untuk estimate MI dari sampel, menggunakan Donsker-Varadhan representation: $I(X;Y) = \sup_T E_{P_{XY}}[T] - \log E_{P_XP_Y}[e^T]$. - **Feature selection**: filter methods menggunakan MI antara setiap feature dan target variable untuk ranking features. - **Contrastive learning**: InfoNCE loss (NoiSy Contrastive Estimation) adalah lower bound pada mutual information — digunakan dalam SimCLR, CLIP. - **Bits-back coding**: menggunakan latent variable models untuk data compression — connects VAEs dengan actual compression. ::: --- ## Practice Problems ::: {.callout-warning title="Practice Problems" collapse="true"} **Problem 1**: Tunjukkan bahwa $D_{KL}(P\|Q) \geq 0$ menggunakan Jensen's inequality. $-D_{KL}(P\|Q) = E_P\left[\log\frac{Q(X)}{P(X)}\right] \leq \log E_P\left[\frac{Q(X)}{P(X)}\right] = \log \sum_x P(x)\frac{Q(x)}{P(x)} = \log 1 = 0$ Inequality karena $\log$ is concave + Jensen's inequality. QED. **Problem 2**: Buktikan $H(X,Y) = H(X) + H(Y|X)$. $H(X,Y) = -\sum_{x,y} P(x,y)\log P(x,y)$ $= -\sum_{x,y} P(x,y)[\log P(x) + \log P(y|x)]$ $= -\sum_x P(x)\log P(x) - \sum_{x,y} P(x,y)\log P(y|x)$ $= H(X) + H(Y|X)$ QED **Problem 3**: Tunjukkan bahwa MLE = minimizing KL divergence dari empirical distribution. Lihat derivasi di atas. Key: $E_{\hat{P}_n}[\log p_\theta(X)] = \frac{1}{n}\sum_i \log p_\theta(x_i)$ yang adalah sample average of log-likelihood. Maximizing ini = minimizing $-E_{\hat{P}_n}[\log p_\theta] = H(\hat{P}_n, P_\theta) - H(\hat{P}_n)$. Karena $H(\hat{P}_n)$ konstan, equivalent to minimizing cross-entropy $H(\hat{P}_n, P_\theta) = H(\hat{P}_n) + D_{KL}(\hat{P}_n \| P_\theta)$, yang equivalent to minimizing $D_{KL}(\hat{P}_n \| P_\theta)$. **Problem 4**: Hitung mutual information antara $X \sim \text{Bernoulli}(0.5)$ dan $Y = X$ (perfectly correlated). $P(X=0,Y=0) = 0.5$, $P(X=1,Y=1) = 0.5$, others = 0. $H(X) = H(Y) = \log 2 = 1$ bit. $H(Y|X) = 0$ (Y perfectly determined by X). $I(X;Y) = H(Y) - H(Y|X) = 1 - 0 = 1$ bit. MI = Entropy of X — maximum possible! **Problem 5**: Mengapa differential entropy bisa negatif? $h(U[0,a]) = \log a$ — negatif jika $a < 1$. Differential entropy bukan jumlah informasi dalam bits (berbeda dari discrete entropy) — ini adalah relative quantity. Bisa negatif. Hanya differences dalam differential entropy punya absolute meaning. ::: --- **Navigasi**: [← Kernel Methods & SVM](04-kernel-methods.qmd) | [PCA & Dimensionality Reduction →](06-pca-dimensionality.qmd)