Information Theory
Entropy, KL Divergence, dan Cross-Entropy Loss
1 Kenapa Ini Penting?
Information theory bukan hanya teori abstrak — ini adalah bahasa yang tepat untuk berbicara tentang model training dan evaluation:
- Cross-entropy loss dalam neural networks IS KL divergence dari data distribution ke model distribution
- MLE adalah minimizing KL divergence dari empirical distribution ke model
- AIC/BIC punya interpretasi information-theoretic yang clear
- Mutual information mengukur dependence nonlinear — lebih umum dari korelasi
Jika kamu pernah heran mengapa kita minimize cross-entropy (bukan MSE) untuk classification, atau mengapa entropy memaksimalkan diri pada uniform distribution — jawabannya ada di sini.
2 Entropy
2.1 Shannon Entropy (Discrete)
Untuk random variable diskrit \(X\) dengan PMF \(P(X=x)\):
\[H(X) = -\sum_{x \in \mathcal{X}} P(X=x) \log P(X=x) = -E[\log P(X)]\]
Conventions: \(0 \log 0 = 0\) (limit), \(\log\) bisa base 2 (bits) atau natural (nats).
Interpretasi: - Entropy mengukur rata-rata informasi (surprise) dari satu draw dari \(X\) - Lebih besar entropy = lebih banyak uncertainty/unpredictability - Minimum entropy = 0 (deterministic variable) - Maximum entropy = \(\log|\mathcal{X}|\) (uniform distribution)
Bukti maximum entropy:
Untuk \(\mathcal{X} = \{1, \ldots, n\}\) dan constraint \(\sum_x P(x) = 1\):
Lagrangian: \(\mathcal{L} = -\sum_x P(x)\log P(x) - \lambda(\sum_x P(x) - 1)\)
\(\partial\mathcal{L}/\partial P(x) = -\log P(x) - 1 - \lambda = 0 \Rightarrow P(x) = e^{-1-\lambda} = 1/n\) (uniform)
2.2 Sifat Entropy
Non-negative: \(H(X) \geq 0\), equality jika \(X\) deterministic
Concavity: \(H(\sum_i \lambda_i P_i) \geq \sum_i \lambda_i H(P_i)\) — mixture lebih uncertain dari components
Chain rule: \(H(X, Y) = H(X) + H(Y|X)\)
Subadditivity: \(H(X, Y) \leq H(X) + H(Y)\), equality iff \(X \perp Y\)
2.3 Conditional Entropy
\[H(Y|X) = -\sum_{x,y} P(x,y)\log P(y|x) = E_{P(X)}[H(Y|X=x)]\]
Intuitively: remaining uncertainty of \(Y\) given knowledge of \(X\).
2.4 Differential Entropy (Continuous)
\[h(X) = -\int f(x)\log f(x)\, dx\]
Perhatian: differential entropy tidak invariant terhadap linear transformation!
Contoh penting: Gaussian maximizes differential entropy untuk fixed variance:
\[h(N(\mu, \sigma^2)) = \frac{1}{2}\log(2\pi e \sigma^2)\]
Untuk fixed \(\sigma^2\), tidak ada distribusi lain yang memiliki differential entropy lebih tinggi. Ini adalah maximum entropy principle untuk continuous distributions.
3 KL Divergence (Relative Entropy)
Untuk dua distribusi \(P\) dan \(Q\) di atas space yang sama:
\[D_{KL}(P \| Q) = \sum_x P(x)\log\frac{P(x)}{Q(x)} = E_P\left[\log\frac{P(X)}{Q(X)}\right]\]
Untuk distribusi kontinu: \[D_{KL}(P \| Q) = \int p(x)\log\frac{p(x)}{q(x)}\, dx\]
Convention: \(0 \log(0/0) = 0\), tapi \(p(x) > 0\) dan \(q(x) = 0\) menghasilkan \(+\infty\).
3.1 Properties KL Divergence
Non-negativity (Gibbs’ Inequality): \[D_{KL}(P \| Q) \geq 0, \quad \text{equality iff } P = Q \text{ a.e.}\]
Bukti menggunakan Jensen’s inequality: \(\log\) is concave, jadi \[-D_{KL}(P\|Q) = E_P[\log(Q/P)] \leq \log E_P[Q/P] = \log 1 = 0\]
Asymmetry: \(D_{KL}(P\|Q) \neq D_{KL}(Q\|P)\) umumnya.
Ini bukan distance metric! Tapi KL divergence adalah “oriented” measure of how \(P\) differs from \(Q\).
Interpretasi coding theory: - \(H(P)\): minimum average code length untuk encode draw dari \(P\) (Shannon’s source coding theorem) - \(H(P,Q) = H(P) + D_{KL}(P\|Q)\): average code length jika kita pakai code optimal untuk \(Q\) tapi data berasal dari \(P\) - \(D_{KL}(P\|Q)\): extra bits yang terbuang karena menggunakan wrong distribution \(Q\)
3.2 Forward vs Reverse KL
- Forward KL \(D_{KL}(P\|Q)\): “mean-seeking” — \(Q\) harus cover semua mass \(P\) (zero-avoiding)
- Reverse KL \(D_{KL}(Q\|P)\): “mode-seeking” — \(Q\) bisa fokus di mode \(P\) dan ignore tails (zero-forcing)
Ini penting dalam variational inference: ELBO mengoptimalkan reverse KL \(D_{KL}(Q\|P)\).
4 Cross-Entropy
\[H(P, Q) = -\sum_x P(x)\log Q(x) = -E_P[\log Q(X)]\]
Hubungan dengan KL: \[H(P, Q) = H(P) + D_{KL}(P\|Q)\]
Jika \(P\) adalah data distribution (fixed), maka: \[\arg\min_Q H(P, Q) = \arg\min_Q D_{KL}(P\|Q)\]
Minimizing cross-entropy = minimizing KL divergence!
4.1 Cross-Entropy Loss dalam Classification
Untuk classification dengan \(K\) classes:
Multi-class cross-entropy: \[L = -\sum_{k=1}^K y_k \log \hat{p}_k\]
di mana \(y_k \in \{0,1\}\) (one-hot true label) dan \(\hat{p}_k = P(Y=k|x)\) (model prediction).
Binary cross-entropy: \[L = -[y \log \hat{p} + (1-y)\log(1-\hat{p})]\]
Ini identik dengan negative log-likelihood Bernoulli distribution: \(\ell(\hat{p}; y) = y\log\hat{p} + (1-y)\log(1-\hat{p})\).
Minimizing cross-entropy loss = maximizing likelihood = minimizing KL dari empirical distribution ke model.
4.2 Mengapa Cross-Entropy, Bukan MSE, untuk Classification?
Natural loss: classification = estimating probability, cross-entropy langsung related ke log-likelihood
Gradient behavior: cross-entropy dengan sigmoid/softmax memberikan gradient \(\hat{y} - y\) (elegant!). MSE dengan sigmoid memberikan \((\hat{y}-y)\hat{y}(1-\hat{y})\) — saturates near 0/1
Information-theoretic: measures divergence between distributions — semantically appropriate
5 Mutual Information
\[I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)\]
Equivalently: \[I(X; Y) = D_{KL}(P_{XY} \| P_X P_Y) = \sum_{x,y} P(x,y)\log\frac{P(x,y)}{P(x)P(y)}\]
Properti: - \(I(X;Y) \geq 0\), equality iff \(X \perp Y\) - \(I(X;Y) = I(Y;X)\) (symmetric!) - \(I(X;Y) \leq \min(H(X), H(Y))\)
5.1 Kenapa MI Lebih Baik dari Korelasi?
Korelasi Pearson: mengukur linear dependence saja. \(\rho(X,Y) = 0\) tidak berarti independence.
Mutual information: mengukur any dependence — linear maupun nonlinear. \(I(X;Y) = 0 \iff X \perp Y\).
Contoh: \(Y = X^2\) dengan \(X\) symmetric. \(\text{Cor}(X,Y) = 0\) tapi \(I(X;Y) = H(Y) > 0\).
5.2 Data Processing Inequality (DPI)
Untuk Markov chain \(X \to Y \to Z\): \[I(X; Z) \leq I(X; Y)\]
Processing tidak bisa increase information. Fundamental untuk understanding representation learning.
5.3 Normalized Mutual Information (NMI)
\[\text{NMI}(X,Y) = \frac{I(X;Y)}{\sqrt{H(X)H(Y)}} \in [0, 1]\]
Berguna sebagai similarity measure — digunakan dalam clustering evaluation.
6 Log-Likelihood dan Information Theory
6.1 MLE sebagai KL Minimization
Empirical distribution: \(\hat{P}_n(x) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}[x_i = x]\)
MLE untuk parametric model \(\{P_\theta\}\):
\[\hat{\theta}_{MLE} = \arg\max_\theta \frac{1}{n}\sum_{i=1}^n \log p_\theta(x_i) = \arg\max_\theta E_{\hat{P}_n}[\log p_\theta(X)]\]
\[= \arg\min_\theta -E_{\hat{P}_n}[\log p_\theta(X)] = \arg\min_\theta H(\hat{P}_n, P_\theta)\]
\[= \arg\min_\theta D_{KL}(\hat{P}_n \| P_\theta) + H(\hat{P}_n)\]
\[= \arg\min_\theta D_{KL}(\hat{P}_n \| P_\theta)\]
(karena \(H(\hat{P}_n)\) tidak bergantung pada \(\theta\))
MLE = minimizing KL from empirical distribution to model distribution.
6.2 AIC dan Information Criteria
Akaike Information Criterion (AIC):
\[\text{AIC} = -2\ell(\hat{\theta}) + 2k\]
di mana \(\ell(\hat{\theta})\) = maximized log-likelihood, \(k\) = jumlah parameters.
Derivasi: AIC adalah estimator dari expected KL divergence dari fitted model ke true distribution (under new data). Akaike (1974) showed:
\[E[D_{KL}(P_{true}\|P_{\hat{\theta}})] \approx -\ell(\hat{\theta}) + k\]
Bayesian Information Criterion (BIC):
\[\text{BIC} = -2\ell(\hat{\theta}) + k\log n\]
BIC penalty lebih kuat untuk large \(n\) → lebih parsimonious.
BIC derivasi: approximation to log marginal likelihood (Bayesian model comparison).
7 Information Bottleneck
Tujuan: temukan representation \(Z\) dari \(X\) yang kompres \(X\) sebanyak mungkin sambil preserve informasi tentang \(Y\).
Objective (tradeoff): \[\min_{P(Z|X)} I(X; Z) - \beta I(Z; Y)\]
- \(I(X;Z)\): compression (ingin kecil)
- \(I(Z;Y)\): prediction quality (ingin besar)
- \(\beta\): tradeoff parameter
Koneksi ke deep learning: setiap layer neural network adalah transformasi dari input. Information Bottleneck framework suggests layers should retain \(Y\)-relevant info dan compress \(Y\)-irrelevant info.
7.1 ELBO dan VAE
Variational Autoencoder (VAE) mengoptimalkan Evidence Lower BOund (ELBO):
\[\text{ELBO} = E_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) \| p(z))\]
- Pertama: reconstruction quality (negative cross-entropy)
- Kedua: regularization via KL dari posterior ke prior
Maksimizing ELBO = minimizing variational gap (approximation error) dari posterior.
8 Worked Example: Cross-Entropy untuk 3-Class Classification
library(tidyverse)
# ==========================================
# 1. Entropy Calculations
# ==========================================
# Entropy function
entropy <- function(p) {
p <- p[p > 0] # Remove zeros
-sum(p * log(p)) # nats
}
# Examples
cat("Entropy (uniform 4 classes):", entropy(rep(0.25, 4)), "\n")
cat("Entropy (certain - one class):", entropy(c(1, 0, 0, 0)), "\n")
cat("Entropy (coin flip):", entropy(c(0.5, 0.5)), "\n")
cat("Max entropy (log(4)):", log(4), "\n")
# ==========================================
# 2. KL Divergence
# ==========================================
kl_divergence <- function(p, q) {
idx <- p > 0
sum(p[idx] * log(p[idx] / q[idx]))
}
P <- c(0.4, 0.3, 0.3)
Q1 <- c(1/3, 1/3, 1/3) # uniform
Q2 <- c(0.5, 0.4, 0.1) # "wrong" model
cat("\nKL(P||Q_uniform):", kl_divergence(P, Q1), "\n")
cat("KL(P||Q_wrong):", kl_divergence(P, Q2), "\n")
cat("KL(Q_wrong||P):", kl_divergence(Q2, P), "\n") # Asymmetry!
cat("Note: KL is NOT symmetric!\n")
# ==========================================
# 3. Cross-Entropy Loss
# ==========================================
cross_entropy <- function(y_true, y_pred) {
-sum(y_true * log(y_pred + 1e-15)) # epsilon for numerical stability
}
# True label: class 2 (one-hot)
y_true <- c(0, 1, 0)
# Different predictions
pred_good <- c(0.1, 0.8, 0.1) # confident, correct
pred_ok <- c(0.2, 0.5, 0.3) # uncertain
pred_bad <- c(0.7, 0.1, 0.2) # confident, wrong
cat(sprintf("\nCE Loss (good prediction): %.4f\n", cross_entropy(y_true, pred_good)))
cat(sprintf("CE Loss (uncertain): %.4f\n", cross_entropy(y_true, pred_ok)))
cat(sprintf("CE Loss (wrong prediction): %.4f\n", cross_entropy(y_true, pred_bad)))
# Bad prediction has very high loss!
# ==========================================
# 4. Mutual Information (Discrete)
# ==========================================
# Joint distribution P(X,Y)
P_xy <- matrix(c(0.3, 0.1,
0.1, 0.5), nrow = 2, byrow = TRUE)
P_x <- rowSums(P_xy)
P_y <- colSums(P_xy)
# Compute MI
mi <- 0
for (i in 1:2) for (j in 1:2) {
if (P_xy[i,j] > 0)
mi <- mi + P_xy[i,j] * log(P_xy[i,j] / (P_x[i] * P_y[j]))
}
H_X <- entropy(P_x)
H_Y <- entropy(P_y)
H_XY <- entropy(as.vector(P_xy))
H_YX <- H_XY - H_X # H(Y|X)
cat(sprintf("\nH(X) = %.4f\n", H_X))
cat(sprintf("H(Y) = %.4f\n", H_Y))
cat(sprintf("H(X,Y) = %.4f\n", H_XY))
cat(sprintf("I(X;Y) via KL = %.4f\n", mi))
cat(sprintf("I(X;Y) = H(X) - H(X|Y) = %.4f\n", H_X - (H_XY - H_Y)))
# ==========================================
# 5. AIC vs BIC for Model Selection
# ==========================================
# Simulate data from AR(2) process
set.seed(123)
n <- 200
y <- arima.sim(model = list(ar = c(0.6, 0.2)), n = n)
# Fit different AR models
models_aic <- sapply(1:5, function(p) AIC(arima(y, order = c(p, 0, 0))))
models_bic <- sapply(1:5, function(p) BIC(arima(y, order = c(p, 0, 0))))
cat("\n--- Model Selection ---\n")
cat("AIC for AR(1) to AR(5):", round(models_aic, 2), "\n")
cat("BIC for AR(1) to AR(5):", round(models_bic, 2), "\n")
cat("Best by AIC: AR(", which.min(models_aic), ")\n")
cat("Best by BIC: AR(", which.min(models_bic), ")\n")
# True model is AR(2) - do they recover it?Takeaway: Dengan distribusi yang salah, cross-entropy loss sangat besar. Dengan prediksi confident tapi salah, kita mendapat “very wrong” — ini adalah brier score property yang bagus.
Information theory punya koneksi luas:
ELBO dalam VAEs: \(\log p(x) \geq E_{q(z|x)}[\log p(x|z)] - D_{KL}(q(z|x)\|p(z))\) — ELBO adalah lower bound pada log marginal likelihood. Training VAE = maximizing ELBO.
MINE (Mutual Information Neural Estimation): neural network untuk estimate MI dari sampel, menggunakan Donsker-Varadhan representation: \(I(X;Y) = \sup_T E_{P_{XY}}[T] - \log E_{P_XP_Y}[e^T]\).
Feature selection: filter methods menggunakan MI antara setiap feature dan target variable untuk ranking features.
Contrastive learning: InfoNCE loss (NoiSy Contrastive Estimation) adalah lower bound pada mutual information — digunakan dalam SimCLR, CLIP.
Bits-back coding: menggunakan latent variable models untuk data compression — connects VAEs dengan actual compression.
9 Practice Problems
Problem 1: Tunjukkan bahwa \(D_{KL}(P\|Q) \geq 0\) menggunakan Jensen’s inequality.
\(-D_{KL}(P\|Q) = E_P\left[\log\frac{Q(X)}{P(X)}\right] \leq \log E_P\left[\frac{Q(X)}{P(X)}\right] = \log \sum_x P(x)\frac{Q(x)}{P(x)} = \log 1 = 0\)
Inequality karena \(\log\) is concave + Jensen’s inequality. QED.
Problem 2: Buktikan \(H(X,Y) = H(X) + H(Y|X)\).
\(H(X,Y) = -\sum_{x,y} P(x,y)\log P(x,y)\)
\(= -\sum_{x,y} P(x,y)[\log P(x) + \log P(y|x)]\)
\(= -\sum_x P(x)\log P(x) - \sum_{x,y} P(x,y)\log P(y|x)\)
\(= H(X) + H(Y|X)\) QED
Problem 3: Tunjukkan bahwa MLE = minimizing KL divergence dari empirical distribution.
Lihat derivasi di atas. Key: \(E_{\hat{P}_n}[\log p_\theta(X)] = \frac{1}{n}\sum_i \log p_\theta(x_i)\) yang adalah sample average of log-likelihood. Maximizing ini = minimizing \(-E_{\hat{P}_n}[\log p_\theta] = H(\hat{P}_n, P_\theta) - H(\hat{P}_n)\).
Karena \(H(\hat{P}_n)\) konstan, equivalent to minimizing cross-entropy \(H(\hat{P}_n, P_\theta) = H(\hat{P}_n) + D_{KL}(\hat{P}_n \| P_\theta)\), yang equivalent to minimizing \(D_{KL}(\hat{P}_n \| P_\theta)\).
Problem 4: Hitung mutual information antara \(X \sim \text{Bernoulli}(0.5)\) dan \(Y = X\) (perfectly correlated).
\(P(X=0,Y=0) = 0.5\), \(P(X=1,Y=1) = 0.5\), others = 0.
\(H(X) = H(Y) = \log 2 = 1\) bit.
\(H(Y|X) = 0\) (Y perfectly determined by X).
\(I(X;Y) = H(Y) - H(Y|X) = 1 - 0 = 1\) bit. MI = Entropy of X — maximum possible!
Problem 5: Mengapa differential entropy bisa negatif?
\(h(U[0,a]) = \log a\) — negatif jika \(a < 1\). Differential entropy bukan jumlah informasi dalam bits (berbeda dari discrete entropy) — ini adalah relative quantity. Bisa negatif. Hanya differences dalam differential entropy punya absolute meaning.
Navigasi: ← Kernel Methods & SVM | PCA & Dimensionality Reduction →