Information Theory

Entropy, KL Divergence, dan Cross-Entropy Loss

ml-math
intermediate
Fondasi information theory untuk ML dan statistik: Shannon entropy, KL divergence, cross-entropy loss sebagai KL minimization, mutual information, dan koneksi ke MLE dan model selection.

1 Kenapa Ini Penting?

NoteWhy This Matters for Your Work

Information theory bukan hanya teori abstrak — ini adalah bahasa yang tepat untuk berbicara tentang model training dan evaluation:

  • Cross-entropy loss dalam neural networks IS KL divergence dari data distribution ke model distribution
  • MLE adalah minimizing KL divergence dari empirical distribution ke model
  • AIC/BIC punya interpretasi information-theoretic yang clear
  • Mutual information mengukur dependence nonlinear — lebih umum dari korelasi

Jika kamu pernah heran mengapa kita minimize cross-entropy (bukan MSE) untuk classification, atau mengapa entropy memaksimalkan diri pada uniform distribution — jawabannya ada di sini.


2 Entropy

2.1 Shannon Entropy (Discrete)

ImportantDefinisi: Shannon Entropy

Untuk random variable diskrit \(X\) dengan PMF \(P(X=x)\):

\[H(X) = -\sum_{x \in \mathcal{X}} P(X=x) \log P(X=x) = -E[\log P(X)]\]

Conventions: \(0 \log 0 = 0\) (limit), \(\log\) bisa base 2 (bits) atau natural (nats).

Interpretasi: - Entropy mengukur rata-rata informasi (surprise) dari satu draw dari \(X\) - Lebih besar entropy = lebih banyak uncertainty/unpredictability - Minimum entropy = 0 (deterministic variable) - Maximum entropy = \(\log|\mathcal{X}|\) (uniform distribution)

Bukti maximum entropy:

Untuk \(\mathcal{X} = \{1, \ldots, n\}\) dan constraint \(\sum_x P(x) = 1\):

Lagrangian: \(\mathcal{L} = -\sum_x P(x)\log P(x) - \lambda(\sum_x P(x) - 1)\)

\(\partial\mathcal{L}/\partial P(x) = -\log P(x) - 1 - \lambda = 0 \Rightarrow P(x) = e^{-1-\lambda} = 1/n\) (uniform)

2.2 Sifat Entropy

  1. Non-negative: \(H(X) \geq 0\), equality jika \(X\) deterministic

  2. Concavity: \(H(\sum_i \lambda_i P_i) \geq \sum_i \lambda_i H(P_i)\) — mixture lebih uncertain dari components

  3. Chain rule: \(H(X, Y) = H(X) + H(Y|X)\)

  4. Subadditivity: \(H(X, Y) \leq H(X) + H(Y)\), equality iff \(X \perp Y\)

2.3 Conditional Entropy

\[H(Y|X) = -\sum_{x,y} P(x,y)\log P(y|x) = E_{P(X)}[H(Y|X=x)]\]

Intuitively: remaining uncertainty of \(Y\) given knowledge of \(X\).

2.4 Differential Entropy (Continuous)

\[h(X) = -\int f(x)\log f(x)\, dx\]

Perhatian: differential entropy tidak invariant terhadap linear transformation!

Contoh penting: Gaussian maximizes differential entropy untuk fixed variance:

\[h(N(\mu, \sigma^2)) = \frac{1}{2}\log(2\pi e \sigma^2)\]

Untuk fixed \(\sigma^2\), tidak ada distribusi lain yang memiliki differential entropy lebih tinggi. Ini adalah maximum entropy principle untuk continuous distributions.


3 KL Divergence (Relative Entropy)

ImportantDefinisi: Kullback-Leibler Divergence

Untuk dua distribusi \(P\) dan \(Q\) di atas space yang sama:

\[D_{KL}(P \| Q) = \sum_x P(x)\log\frac{P(x)}{Q(x)} = E_P\left[\log\frac{P(X)}{Q(X)}\right]\]

Untuk distribusi kontinu: \[D_{KL}(P \| Q) = \int p(x)\log\frac{p(x)}{q(x)}\, dx\]

Convention: \(0 \log(0/0) = 0\), tapi \(p(x) > 0\) dan \(q(x) = 0\) menghasilkan \(+\infty\).

3.1 Properties KL Divergence

Non-negativity (Gibbs’ Inequality): \[D_{KL}(P \| Q) \geq 0, \quad \text{equality iff } P = Q \text{ a.e.}\]

Bukti menggunakan Jensen’s inequality: \(\log\) is concave, jadi \[-D_{KL}(P\|Q) = E_P[\log(Q/P)] \leq \log E_P[Q/P] = \log 1 = 0\]

Asymmetry: \(D_{KL}(P\|Q) \neq D_{KL}(Q\|P)\) umumnya.

Ini bukan distance metric! Tapi KL divergence adalah “oriented” measure of how \(P\) differs from \(Q\).

Interpretasi coding theory: - \(H(P)\): minimum average code length untuk encode draw dari \(P\) (Shannon’s source coding theorem) - \(H(P,Q) = H(P) + D_{KL}(P\|Q)\): average code length jika kita pakai code optimal untuk \(Q\) tapi data berasal dari \(P\) - \(D_{KL}(P\|Q)\): extra bits yang terbuang karena menggunakan wrong distribution \(Q\)

3.2 Forward vs Reverse KL

  • Forward KL \(D_{KL}(P\|Q)\): “mean-seeking” — \(Q\) harus cover semua mass \(P\) (zero-avoiding)
  • Reverse KL \(D_{KL}(Q\|P)\): “mode-seeking” — \(Q\) bisa fokus di mode \(P\) dan ignore tails (zero-forcing)

Ini penting dalam variational inference: ELBO mengoptimalkan reverse KL \(D_{KL}(Q\|P)\).


4 Cross-Entropy

ImportantDefinisi: Cross-Entropy

\[H(P, Q) = -\sum_x P(x)\log Q(x) = -E_P[\log Q(X)]\]

Hubungan dengan KL: \[H(P, Q) = H(P) + D_{KL}(P\|Q)\]

Jika \(P\) adalah data distribution (fixed), maka: \[\arg\min_Q H(P, Q) = \arg\min_Q D_{KL}(P\|Q)\]

Minimizing cross-entropy = minimizing KL divergence!

4.1 Cross-Entropy Loss dalam Classification

Untuk classification dengan \(K\) classes:

Multi-class cross-entropy: \[L = -\sum_{k=1}^K y_k \log \hat{p}_k\]

di mana \(y_k \in \{0,1\}\) (one-hot true label) dan \(\hat{p}_k = P(Y=k|x)\) (model prediction).

Binary cross-entropy: \[L = -[y \log \hat{p} + (1-y)\log(1-\hat{p})]\]

Ini identik dengan negative log-likelihood Bernoulli distribution: \(\ell(\hat{p}; y) = y\log\hat{p} + (1-y)\log(1-\hat{p})\).

Minimizing cross-entropy loss = maximizing likelihood = minimizing KL dari empirical distribution ke model.

4.2 Mengapa Cross-Entropy, Bukan MSE, untuk Classification?

  1. Natural loss: classification = estimating probability, cross-entropy langsung related ke log-likelihood

  2. Gradient behavior: cross-entropy dengan sigmoid/softmax memberikan gradient \(\hat{y} - y\) (elegant!). MSE dengan sigmoid memberikan \((\hat{y}-y)\hat{y}(1-\hat{y})\) — saturates near 0/1

  3. Information-theoretic: measures divergence between distributions — semantically appropriate


5 Mutual Information

ImportantDefinisi: Mutual Information

\[I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)\]

Equivalently: \[I(X; Y) = D_{KL}(P_{XY} \| P_X P_Y) = \sum_{x,y} P(x,y)\log\frac{P(x,y)}{P(x)P(y)}\]

Properti: - \(I(X;Y) \geq 0\), equality iff \(X \perp Y\) - \(I(X;Y) = I(Y;X)\) (symmetric!) - \(I(X;Y) \leq \min(H(X), H(Y))\)

5.1 Kenapa MI Lebih Baik dari Korelasi?

  • Korelasi Pearson: mengukur linear dependence saja. \(\rho(X,Y) = 0\) tidak berarti independence.

  • Mutual information: mengukur any dependence — linear maupun nonlinear. \(I(X;Y) = 0 \iff X \perp Y\).

Contoh: \(Y = X^2\) dengan \(X\) symmetric. \(\text{Cor}(X,Y) = 0\) tapi \(I(X;Y) = H(Y) > 0\).

5.2 Data Processing Inequality (DPI)

Untuk Markov chain \(X \to Y \to Z\): \[I(X; Z) \leq I(X; Y)\]

Processing tidak bisa increase information. Fundamental untuk understanding representation learning.

5.3 Normalized Mutual Information (NMI)

\[\text{NMI}(X,Y) = \frac{I(X;Y)}{\sqrt{H(X)H(Y)}} \in [0, 1]\]

Berguna sebagai similarity measure — digunakan dalam clustering evaluation.


6 Log-Likelihood dan Information Theory

6.1 MLE sebagai KL Minimization

Empirical distribution: \(\hat{P}_n(x) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}[x_i = x]\)

MLE untuk parametric model \(\{P_\theta\}\):

\[\hat{\theta}_{MLE} = \arg\max_\theta \frac{1}{n}\sum_{i=1}^n \log p_\theta(x_i) = \arg\max_\theta E_{\hat{P}_n}[\log p_\theta(X)]\]

\[= \arg\min_\theta -E_{\hat{P}_n}[\log p_\theta(X)] = \arg\min_\theta H(\hat{P}_n, P_\theta)\]

\[= \arg\min_\theta D_{KL}(\hat{P}_n \| P_\theta) + H(\hat{P}_n)\]

\[= \arg\min_\theta D_{KL}(\hat{P}_n \| P_\theta)\]

(karena \(H(\hat{P}_n)\) tidak bergantung pada \(\theta\))

MLE = minimizing KL from empirical distribution to model distribution.

6.2 AIC dan Information Criteria

Akaike Information Criterion (AIC):

\[\text{AIC} = -2\ell(\hat{\theta}) + 2k\]

di mana \(\ell(\hat{\theta})\) = maximized log-likelihood, \(k\) = jumlah parameters.

Derivasi: AIC adalah estimator dari expected KL divergence dari fitted model ke true distribution (under new data). Akaike (1974) showed:

\[E[D_{KL}(P_{true}\|P_{\hat{\theta}})] \approx -\ell(\hat{\theta}) + k\]

Bayesian Information Criterion (BIC):

\[\text{BIC} = -2\ell(\hat{\theta}) + k\log n\]

BIC penalty lebih kuat untuk large \(n\) → lebih parsimonious.

BIC derivasi: approximation to log marginal likelihood (Bayesian model comparison).


7 Information Bottleneck

ImportantDefinisi: Information Bottleneck

Tujuan: temukan representation \(Z\) dari \(X\) yang kompres \(X\) sebanyak mungkin sambil preserve informasi tentang \(Y\).

Objective (tradeoff): \[\min_{P(Z|X)} I(X; Z) - \beta I(Z; Y)\]

  • \(I(X;Z)\): compression (ingin kecil)
  • \(I(Z;Y)\): prediction quality (ingin besar)
  • \(\beta\): tradeoff parameter

Koneksi ke deep learning: setiap layer neural network adalah transformasi dari input. Information Bottleneck framework suggests layers should retain \(Y\)-relevant info dan compress \(Y\)-irrelevant info.

7.1 ELBO dan VAE

Variational Autoencoder (VAE) mengoptimalkan Evidence Lower BOund (ELBO):

\[\text{ELBO} = E_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) \| p(z))\]

  • Pertama: reconstruction quality (negative cross-entropy)
  • Kedua: regularization via KL dari posterior ke prior

Maksimizing ELBO = minimizing variational gap (approximation error) dari posterior.


8 Worked Example: Cross-Entropy untuk 3-Class Classification

library(tidyverse)

# ==========================================
# 1. Entropy Calculations
# ==========================================

# Entropy function
entropy <- function(p) {
  p <- p[p > 0]  # Remove zeros
  -sum(p * log(p))  # nats
}

# Examples
cat("Entropy (uniform 4 classes):", entropy(rep(0.25, 4)), "\n")
cat("Entropy (certain - one class):", entropy(c(1, 0, 0, 0)), "\n")
cat("Entropy (coin flip):", entropy(c(0.5, 0.5)), "\n")
cat("Max entropy (log(4)):", log(4), "\n")

# ==========================================
# 2. KL Divergence
# ==========================================

kl_divergence <- function(p, q) {
  idx <- p > 0
  sum(p[idx] * log(p[idx] / q[idx]))
}

P <- c(0.4, 0.3, 0.3)
Q1 <- c(1/3, 1/3, 1/3)  # uniform
Q2 <- c(0.5, 0.4, 0.1)  # "wrong" model

cat("\nKL(P||Q_uniform):", kl_divergence(P, Q1), "\n")
cat("KL(P||Q_wrong):", kl_divergence(P, Q2), "\n")
cat("KL(Q_wrong||P):", kl_divergence(Q2, P), "\n")  # Asymmetry!
cat("Note: KL is NOT symmetric!\n")

# ==========================================
# 3. Cross-Entropy Loss
# ==========================================

cross_entropy <- function(y_true, y_pred) {
  -sum(y_true * log(y_pred + 1e-15))  # epsilon for numerical stability
}

# True label: class 2 (one-hot)
y_true <- c(0, 1, 0)

# Different predictions
pred_good <- c(0.1, 0.8, 0.1)   # confident, correct
pred_ok   <- c(0.2, 0.5, 0.3)   # uncertain
pred_bad  <- c(0.7, 0.1, 0.2)   # confident, wrong

cat(sprintf("\nCE Loss (good prediction): %.4f\n", cross_entropy(y_true, pred_good)))
cat(sprintf("CE Loss (uncertain):        %.4f\n", cross_entropy(y_true, pred_ok)))
cat(sprintf("CE Loss (wrong prediction): %.4f\n", cross_entropy(y_true, pred_bad)))
# Bad prediction has very high loss!

# ==========================================
# 4. Mutual Information (Discrete)
# ==========================================

# Joint distribution P(X,Y)
P_xy <- matrix(c(0.3, 0.1,
                 0.1, 0.5), nrow = 2, byrow = TRUE)

P_x <- rowSums(P_xy)
P_y <- colSums(P_xy)

# Compute MI
mi <- 0
for (i in 1:2) for (j in 1:2) {
  if (P_xy[i,j] > 0)
    mi <- mi + P_xy[i,j] * log(P_xy[i,j] / (P_x[i] * P_y[j]))
}

H_X <- entropy(P_x)
H_Y <- entropy(P_y)
H_XY <- entropy(as.vector(P_xy))
H_YX <- H_XY - H_X  # H(Y|X)

cat(sprintf("\nH(X) = %.4f\n", H_X))
cat(sprintf("H(Y) = %.4f\n", H_Y))
cat(sprintf("H(X,Y) = %.4f\n", H_XY))
cat(sprintf("I(X;Y) via KL = %.4f\n", mi))
cat(sprintf("I(X;Y) = H(X) - H(X|Y) = %.4f\n", H_X - (H_XY - H_Y)))

# ==========================================
# 5. AIC vs BIC for Model Selection
# ==========================================

# Simulate data from AR(2) process
set.seed(123)
n <- 200
y <- arima.sim(model = list(ar = c(0.6, 0.2)), n = n)

# Fit different AR models
models_aic <- sapply(1:5, function(p) AIC(arima(y, order = c(p, 0, 0))))
models_bic <- sapply(1:5, function(p) BIC(arima(y, order = c(p, 0, 0))))

cat("\n--- Model Selection ---\n")
cat("AIC for AR(1) to AR(5):", round(models_aic, 2), "\n")
cat("BIC for AR(1) to AR(5):", round(models_bic, 2), "\n")
cat("Best by AIC: AR(", which.min(models_aic), ")\n")
cat("Best by BIC: AR(", which.min(models_bic), ")\n")
# True model is AR(2) - do they recover it?

Takeaway: Dengan distribusi yang salah, cross-entropy loss sangat besar. Dengan prediksi confident tapi salah, kita mendapat “very wrong” — ini adalah brier score property yang bagus.


CautionConnection: ELBO, MINE, dan Feature Selection

Information theory punya koneksi luas:

  • ELBO dalam VAEs: \(\log p(x) \geq E_{q(z|x)}[\log p(x|z)] - D_{KL}(q(z|x)\|p(z))\) — ELBO adalah lower bound pada log marginal likelihood. Training VAE = maximizing ELBO.

  • MINE (Mutual Information Neural Estimation): neural network untuk estimate MI dari sampel, menggunakan Donsker-Varadhan representation: \(I(X;Y) = \sup_T E_{P_{XY}}[T] - \log E_{P_XP_Y}[e^T]\).

  • Feature selection: filter methods menggunakan MI antara setiap feature dan target variable untuk ranking features.

  • Contrastive learning: InfoNCE loss (NoiSy Contrastive Estimation) adalah lower bound pada mutual information — digunakan dalam SimCLR, CLIP.

  • Bits-back coding: menggunakan latent variable models untuk data compression — connects VAEs dengan actual compression.


9 Practice Problems

Problem 1: Tunjukkan bahwa \(D_{KL}(P\|Q) \geq 0\) menggunakan Jensen’s inequality.

\(-D_{KL}(P\|Q) = E_P\left[\log\frac{Q(X)}{P(X)}\right] \leq \log E_P\left[\frac{Q(X)}{P(X)}\right] = \log \sum_x P(x)\frac{Q(x)}{P(x)} = \log 1 = 0\)

Inequality karena \(\log\) is concave + Jensen’s inequality. QED.

Problem 2: Buktikan \(H(X,Y) = H(X) + H(Y|X)\).

\(H(X,Y) = -\sum_{x,y} P(x,y)\log P(x,y)\)

\(= -\sum_{x,y} P(x,y)[\log P(x) + \log P(y|x)]\)

\(= -\sum_x P(x)\log P(x) - \sum_{x,y} P(x,y)\log P(y|x)\)

\(= H(X) + H(Y|X)\) QED

Problem 3: Tunjukkan bahwa MLE = minimizing KL divergence dari empirical distribution.

Lihat derivasi di atas. Key: \(E_{\hat{P}_n}[\log p_\theta(X)] = \frac{1}{n}\sum_i \log p_\theta(x_i)\) yang adalah sample average of log-likelihood. Maximizing ini = minimizing \(-E_{\hat{P}_n}[\log p_\theta] = H(\hat{P}_n, P_\theta) - H(\hat{P}_n)\).

Karena \(H(\hat{P}_n)\) konstan, equivalent to minimizing cross-entropy \(H(\hat{P}_n, P_\theta) = H(\hat{P}_n) + D_{KL}(\hat{P}_n \| P_\theta)\), yang equivalent to minimizing \(D_{KL}(\hat{P}_n \| P_\theta)\).

Problem 4: Hitung mutual information antara \(X \sim \text{Bernoulli}(0.5)\) dan \(Y = X\) (perfectly correlated).

\(P(X=0,Y=0) = 0.5\), \(P(X=1,Y=1) = 0.5\), others = 0.

\(H(X) = H(Y) = \log 2 = 1\) bit.

\(H(Y|X) = 0\) (Y perfectly determined by X).

\(I(X;Y) = H(Y) - H(Y|X) = 1 - 0 = 1\) bit. MI = Entropy of X — maximum possible!

Problem 5: Mengapa differential entropy bisa negatif?

\(h(U[0,a]) = \log a\) — negatif jika \(a < 1\). Differential entropy bukan jumlah informasi dalam bits (berbeda dari discrete entropy) — ini adalah relative quantity. Bisa negatif. Hanya differences dalam differential entropy punya absolute meaning.


Navigasi: ← Kernel Methods & SVM | PCA & Dimensionality Reduction →