Bayesian Inference
Prior + Data = Posterior
1 Kenapa Ini Penting?
Bayesian approach menyediakan coherent framework untuk learning dari data. Berbeda dengan frequentist yang bertanya “kalau \(H_0\) benar, seberapa ekstrem data kita?”, Bayesian bertanya “given data, apa yang kita percaya tentang parameter?”
Kenapa researcher harus tahu Bayesian stats di 2025?
MCMC methods are standard di advanced econometrics (Bayesian VAR), epidemiology, dan ML. Stan, JAGS, PyMC — semuanya implement Bayesian computation.
Regularization = Bayesian: Ridge regression adalah Bayesian linear regression dengan normal prior. LASSO adalah Bayesian dengan Laplace prior. Mengerti Bayesian = mengerti regularization.
Small samples: Frequentist asymptotic theory butuh large \(n\). Bayesian dengan informative priors bisa give sensible answers with small \(n\).
Uncertainty quantification: Posterior credible intervals have intuitive interpretation: “given this data, there is 95% probability that \(\theta\) lies in [a, b]” — yang sebenarnya diinginkan orang ketika mereka salah menginterpretasikan frequentist CI!
Decision theory: Bayesian framework naturally connects to decision theory and optimal decisions under uncertainty.
2 1. Bayes’ Theorem untuk Inference
Misalkan \(y = (y_1,\ldots,y_n)\) adalah data dan \(\theta\) adalah parameter (bisa skalar atau vector). Posterior distribution adalah:
\[\underbrace{\pi(\theta|y)}_{\text{posterior}} = \frac{\underbrace{f(y|\theta)}_{\text{likelihood}} \cdot \underbrace{\pi(\theta)}_{\text{prior}}}{\underbrace{f(y)}_{\text{marginal likelihood}}}\]
Atau karena \(f(y)\) adalah konstanta (tidak bergantung pada \(\theta\)):
\[\pi(\theta|y) \propto f(y|\theta) \cdot \pi(\theta)\]
“Posterior proportional to likelihood times prior”
Tiga komponen: - Prior \(\pi(\theta)\): beliefs tentang \(\theta\) sebelum melihat data - Likelihood \(f(y|\theta)\): probability of data given \(\theta\) (sama dengan di MLE!) - Posterior \(\pi(\theta|y)\): updated beliefs setelah melihat data
2.1 Learning Process
Bayesian inference adalah learning process: prior + data → posterior. Posterior bisa menjadi prior untuk pengamatan berikutnya (sequential updating).
Contoh intuitif: Toss koin, ingin estimate \(p = P(\text{head})\). - Prior: \(p \sim \text{Beta}(1,1) = \text{Uniform}(0,1)\) — “saya tidak tahu apa-apa” - Likelihood: \(f(k \text{ heads} | n \text{ tosses}, p) = \binom{n}{k}p^k(1-p)^{n-k}\) - Posterior: \(p | \text{data} \sim \text{Beta}(1+k, 1+n-k)\) — “setelah melihat \(k\) dari \(n\) heads”
3 2. Conjugate Priors
Prior \(\pi(\theta)\) adalah conjugate untuk likelihood \(f(y|\theta)\) jika posterior \(\pi(\theta|y)\) berada dalam family distribusi yang sama dengan prior.
Keunggulan: Posterior dalam closed form — tidak butuh numerical integration!
| Likelihood | Conjugate Prior | Posterior |
|---|---|---|
| Binomial(\(n\), \(p\)) | Beta(\(\alpha\), \(\beta\)) | Beta(\(\alpha+k\), \(\beta+n-k\)) |
| Normal(\(\mu\), \(\sigma^2\) known) | Normal(\(\mu_0\), \(\tau^2\)) | Normal(\(\mu_n\), \(\tau_n^2\)) |
| Poisson(\(\lambda\)) | Gamma(\(\alpha\), \(\beta\)) | Gamma(\(\alpha+\sum y_i\), \(\beta+n\)) |
| Exponential(\(\lambda\)) | Gamma(\(\alpha\), \(\beta\)) | Gamma(\(\alpha+n\), \(\beta+\sum y_i\)) |
3.1 Beta-Binomial Conjugacy (Derivasi)
Prior: \(p \sim \text{Beta}(\alpha, \beta)\), \(\pi(p) \propto p^{\alpha-1}(1-p)^{\beta-1}\)
Likelihood: \(f(k|p) \propto p^k(1-p)^{n-k}\)
Posterior: \[\pi(p|k) \propto p^k(1-p)^{n-k} \cdot p^{\alpha-1}(1-p)^{\beta-1} = p^{(\alpha+k)-1}(1-p)^{(\beta+n-k)-1}\]
Ini adalah Beta(\(\alpha+k\), \(\beta+n-k\)).
Interpretasi: Prior parameters \(\alpha\) dan \(\beta\) berperan seperti “pseudo-counts” — \(\alpha\) pseudo-successes dan \(\beta\) pseudo-failures sebelum melihat data.
3.2 Normal-Normal Conjugacy
Prior: \(\mu \sim N(\mu_0, \tau^2)\) — prior mean dan precision \(1/\tau^2\)
Likelihood: \(y_i \sim N(\mu, \sigma^2)\) with \(\sigma^2\) known
Posterior mean: \[\mu_n = \frac{\tau^{-2}\mu_0 + n\sigma^{-2}\bar{y}}{\tau^{-2} + n\sigma^{-2}} = \frac{\text{prior precision} \times \text{prior mean} + \text{data precision} \times \text{data mean}}{\text{total precision}}\]
Posterior variance: \[\tau_n^2 = \frac{1}{\tau^{-2} + n\sigma^{-2}}\]
Interpretasi: Posterior mean adalah weighted average dari prior mean dan data mean, dengan weights proportional to precisions.
4 3. Bayesian Point Estimates dan Credible Intervals
Dari posterior \(\pi(\theta|y)\):
- Posterior mean: \(\hat{\theta}_{PM} = E[\theta|y] = \int \theta \pi(\theta|y)\,d\theta\) — optimal under squared loss
- MAP (Maximum A Posteriori): \(\hat{\theta}_{MAP} = \arg\max_\theta \pi(\theta|y)\) — optimal under 0-1 loss
- Posterior median: optimal under absolute loss
Posterior (credible) interval: \([\theta_L, \theta_U]\) such that \(P(\theta_L \leq \theta \leq \theta_U | y) = 1-\alpha\)
Credible interval vs Confidence interval:
| Confidence Interval | Credible Interval | |
|---|---|---|
| Framework | Frequentist | Bayesian |
| Randomness | Interval is random, \(\theta\) fixed | \(\theta\) is random, data fixed |
| Interpretation | 95% of such intervals contain \(\theta\) | Given data, 95% prob \(\theta\) in interval |
| Dependence on prior | No | Yes |
| What people often want | Usually want Bayesian interpretation | Direct probability statement |
5 4. Bayesian Linear Regression
Model: \(y = X\beta + \varepsilon\), \(\varepsilon \sim N(0, \sigma^2 I)\)
Prior on \(\beta\): \(\beta \sim N(\beta_0, \Sigma_0)\)
Posterior (with \(\sigma^2\) known): \[\beta | y \sim N(\beta_n, \Sigma_n)\]
dimana: \[\Sigma_n = (\Sigma_0^{-1} + \sigma^{-2}X^TX)^{-1}\] \[\beta_n = \Sigma_n(\Sigma_0^{-1}\beta_0 + \sigma^{-2}X^Ty)\]
Posterior mean adalah weighted combination of prior mean and OLS estimate.
5.1 Koneksi ke Ridge Regression
Special case: Prior \(\beta \sim N(0, \frac{\sigma^2}{\lambda}I)\) (independent, zero-mean normal prior):
Posterior mode (MAP): \[\hat{\beta}_{MAP} = \arg\max \left[-\frac{1}{2\sigma^2}||y-X\beta||^2 - \frac{\lambda}{2\sigma^2}||\beta||^2\right]\] \[= \arg\min ||y-X\beta||^2 + \lambda||\beta||^2\] \[= (X^TX + \lambda I)^{-1}X^Ty = \hat{\beta}_{Ridge}\]
Ridge regression IS MAP estimation under normal prior! Regularization parameter \(\lambda\) controls strength of prior — larger \(\lambda\) = stronger prior toward zero = more shrinkage.
5.2 LASSO as Bayesian
Similarly, LASSO corresponds to Laplace (double-exponential) prior on \(\beta\): \[\pi(\beta) \propto \exp(-\lambda|\beta|)\]
The MAP estimate: \[\hat{\beta}_{LASSO} = \arg\min ||y-X\beta||^2 + \lambda||\beta||_1\]
6 5. MCMC: Sampling from Posteriors
Untuk complex models tanpa conjugate priors, kita butuh numerical methods untuk sample dari posterior.
Goal: Sample dari posterior \(\pi(\theta|y)\) yang intractable.
Metropolis-Hastings Algorithm: 1. Initialize \(\theta^{(0)}\) 2. For \(t = 1, \ldots, T\): a. Propose \(\theta^* \sim q(\theta|\theta^{(t-1)})\) (proposal distribution) b. Compute acceptance ratio: \(r = \frac{\pi(\theta^*|y)/q(\theta^*|\theta^{(t-1)})}{\pi(\theta^{(t-1)}|y)/q(\theta^{(t-1)}|\theta^*)}\) c. Accept: set \(\theta^{(t)} = \theta^*\) with probability \(\min(1, r)\); otherwise \(\theta^{(t)} = \theta^{(t-1)}\) 3. After burn-in, \(\{\theta^{(t)}\}\) are (dependent) samples from \(\pi(\theta|y)\)
Gibbs Sampling (special case): Sample each parameter from its full conditional distribution given all others. Requires knowing \(\pi(\theta_j | \theta_{-j}, y)\) for each \(j\).
Modern MCMC: Hamiltonian Monte Carlo (HMC), as used in Stan, exploits gradient information for much more efficient sampling in high dimensions.
7 6. Worked Example: Beta-Binomial Bayesian Inference
Problem: Dalam 30 trials, 12 successes. Estimate \(p\) dengan berbagai priors.
Case 1: Uninformative prior \(p \sim \text{Beta}(1,1) = \text{Uniform}(0,1)\)
Posterior: \(p | \text{data} \sim \text{Beta}(1+12, 1+18) = \text{Beta}(13, 19)\)
Case 2: Informative prior \(p \sim \text{Beta}(5, 10)\) (prior belief: \(p \approx 0.33\))
Posterior: \(p | \text{data} \sim \text{Beta}(5+12, 10+18) = \text{Beta}(17, 28)\)
Case 3: Strong informative prior \(p \sim \text{Beta}(50, 100)\) (very confident: \(p \approx 0.33\))
Posterior: \(p | \text{data} \sim \text{Beta}(50+12, 100+18) = \text{Beta}(62, 118)\)
# Data
n_trials <- 30; k_success <- 12
p_obs <- k_success/n_trials # MLE = 0.4
# Three priors
priors <- list(
uninformative = c(alpha=1, beta=1),
weakly_inform = c(alpha=5, beta=10),
strong_inform = c(alpha=50, beta=100)
)
# Compute posteriors
p_seq <- seq(0, 1, length.out=1000)
par(mfrow=c(1,3))
for(pr_name in names(priors)) {
a_prior <- priors[[pr_name]]["alpha"]
b_prior <- priors[[pr_name]]["beta"]
# Posterior parameters
a_post <- a_prior + k_success
b_post <- b_prior + (n_trials - k_success)
# Point estimates
post_mean <- a_post / (a_post + b_post)
post_mode <- (a_post-1) / (a_post+b_post-2) # MAP
post_median <- qbeta(0.5, a_post, b_post)
# 95% Credible Interval
ci_95 <- qbeta(c(0.025, 0.975), a_post, b_post)
# Plot prior and posterior
plot(p_seq, dbeta(p_seq, a_prior, b_prior), type="l",
col="blue", lwd=2, lty=2,
main=pr_name, xlab="p", ylab="Density",
ylim=c(0, max(dbeta(p_seq, a_post, b_post))*1.1))
lines(p_seq, dbeta(p_seq, a_post, b_post),
col="red", lwd=2)
abline(v=p_obs, col="green", lwd=2, lty=3) # MLE
abline(v=post_mean, col="red", lwd=1, lty=4) # Posterior mean
legend("topright", c("Prior", "Posterior", "MLE", "Post. mean"),
col=c("blue","red","green","red"),
lwd=c(2,2,2,1), lty=c(2,1,3,4), cex=0.7)
cat(sprintf("\n=== %s ===\n", pr_name))
cat(sprintf("Prior: Beta(%.0f, %.0f), E[p] = %.3f\n",
a_prior, b_prior, a_prior/(a_prior+b_prior)))
cat(sprintf("Posterior: Beta(%.0f, %.0f)\n", a_post, b_post))
cat(sprintf("Post mean: %.4f, MAP: %.4f, MLE: %.4f\n",
post_mean, post_mode, p_obs))
cat(sprintf("95%% CI: [%.4f, %.4f]\n", ci_95[1], ci_95[2]))
}Insight: Dengan uninformative prior, posterior mean ≈ MLE. Dengan strong informative prior, posterior mean adalah pulled toward prior. Effect of prior diminishes as \(n\) increases — “data overwhelms prior” asymptotically.
8 7. Bayesian vs Frequentist: Comparison
| Aspek | Frequentist | Bayesian |
|---|---|---|
| \(\theta\) | Fixed unknown constant | Random variable with distribution |
| Inference | Long-run frequency properties | Posterior probability statements |
| Prior | None | Required (explicit assumptions) |
| CI/CrI | 95% of intervals contain \(\theta\) | P(θ in interval | data) = 95% |
| p-value | P(data | \(H_0\)) | Not used; compute P(\(H_0\) | data) |
| Sample size | Need large \(n\) for asymptotics | Works for any \(n\) with good prior |
| Computation | Often analytical | Often requires MCMC |
| Strength | Well-understood, widely used | Intuitive, handles complex models |
Practical recommendation: Belajar kedua paradigma. Di most applied econometrics: - Default: frequentist (OLS, MLE, standard hypothesis tests) - Complex models, small samples, or when priors matter: consider Bayesian - ML regularization: Bayesian perspective illuminates what regularization is doing
9 8. R Code: Bayesian Analysis
# ============================================================
# MANUAL BAYESIAN: Normal-Normal Conjugate
# ============================================================
# Prior: mu ~ N(mu0, tau0^2)
# Likelihood: yi ~ N(mu, sigma^2) with sigma^2 known
# Posterior: mu | y ~ N(mu_n, tau_n^2)
bayes_normal <- function(data, mu0, tau0, sigma) {
n <- length(data); y_bar <- mean(data)
sigma2 <- sigma^2; tau02 <- tau0^2
# Posterior parameters
tau_n2 <- 1 / (1/tau02 + n/sigma2)
mu_n <- tau_n2 * (mu0/tau02 + n*y_bar/sigma2)
return(list(mu_n=mu_n, tau_n=sqrt(tau_n2)))
}
set.seed(2024)
true_mu <- 5; sigma <- 2; n <- 30
y <- rnorm(n, true_mu, sigma)
# Different priors
scenarios <- list(
uninformative = c(mu0=0, tau0=100),
weakly_inform = c(mu0=4, tau0=2),
strong_inform = c(mu0=4, tau0=0.5)
)
cat("=== NORMAL-NORMAL BAYESIAN ANALYSIS ===\n")
cat(sprintf("True mu = %g, n = %d, y_bar = %.4f\n\n",
true_mu, n, mean(y)))
for(sc in names(scenarios)) {
post <- bayes_normal(y, scenarios[[sc]]["mu0"],
scenarios[[sc]]["tau0"], sigma)
ci_95 <- post$mu_n + c(-1,1) * 1.96 * post$tau_n
cat(sprintf("--- %s ---\n", sc))
cat(sprintf("Prior: N(%.1f, %.1f^2)\n",
scenarios[[sc]]["mu0"], scenarios[[sc]]["tau0"]))
cat(sprintf("Post mean: %.4f, Post SD: %.4f\n", post$mu_n, post$tau_n))
cat(sprintf("95%% CrI: [%.4f, %.4f]\n\n", ci_95[1], ci_95[2]))
}
# ============================================================
# RIDGE AS BAYESIAN: Compare with regular OLS
# ============================================================
set.seed(42)
n <- 50; p <- 10 # Many predictors relative to n
X <- matrix(rnorm(n*p), n, p)
beta_true <- c(1, -0.5, 2, 0, 0, 0, 0, 0, 0, 0) # Sparse
y <- X %*% beta_true + rnorm(n, sd=2)
# OLS (might overfit with n=50, p=10)
ols <- lm(y ~ X - 1) # No intercept for simplicity
# Ridge regression (via glmnet)
if(!require(glmnet)) install.packages("glmnet")
library(glmnet)
ridge <- glmnet(X, y, alpha=0, lambda=0.5) # alpha=0 for Ridge
cat("=== OLS vs Ridge (Bayesian MAP with Normal Prior) ===\n")
cat(sprintf("%-8s %-12s %-12s %-12s\n", "Param", "True", "OLS", "Ridge"))
for(j in 1:p) {
cat(sprintf("%-8d %-12.4f %-12.4f %-12.4f\n",
j, beta_true[j], coef(ols)[j],
as.numeric(coef(ridge)[-1])[j]))
}
# Prediction comparison
y_pred_ols <- predict(ols, newx=X)
y_pred_ridge <- X %*% as.numeric(coef(ridge)[-1])
cat(sprintf("\nOLS MSE (in-sample): %.4f\n", mean((y-y_pred_ols)^2)))
cat(sprintf("Ridge MSE (in-sample): %.4f\n", mean((y-y_pred_ridge)^2)))
# ============================================================
# SIMPLE MCMC: Metropolis-Hastings for Beta distribution
# ============================================================
metropolis_hastings <- function(n_iter, start, log_posterior, sigma_prop=0.1) {
samples <- numeric(n_iter)
samples[1] <- start
accept_count <- 0
for(i in 2:n_iter) {
# Proposal
prop <- rnorm(1, samples[i-1], sigma_prop)
if(prop <= 0 || prop >= 1) {
samples[i] <- samples[i-1]
next
}
# Log acceptance ratio
log_r <- log_posterior(prop) - log_posterior(samples[i-1])
# Accept/reject
if(log(runif(1)) < log_r) {
samples[i] <- prop
accept_count <- accept_count + 1
} else {
samples[i] <- samples[i-1]
}
}
cat(sprintf("Acceptance rate: %.2f%%\n", 100*accept_count/n_iter))
return(samples)
}
# Target: Beta(13, 19) posterior from Beta-Binomial example
log_post <- function(p) dbeta(p, 13, 19, log=TRUE)
set.seed(42)
n_iter <- 10000; burn_in <- 1000
mcmc_samples <- metropolis_hastings(n_iter, start=0.4, log_posterior=log_post)
mcmc_clean <- mcmc_samples[(burn_in+1):n_iter]
# Compare MCMC samples to exact distribution
hist(mcmc_clean, probability=TRUE, breaks=50,
main="MCMC vs Exact Beta(13,19) Posterior",
xlab="p")
curve(dbeta(x, 13, 19), add=TRUE, col="red", lwd=2)
cat(sprintf("\nMCMC mean: %.4f (exact: %.4f)\n",
mean(mcmc_clean), 13/(13+19)))
cat(sprintf("MCMC 95%% CrI: [%.4f, %.4f]\n",
quantile(mcmc_clean, 0.025),
quantile(mcmc_clean, 0.975)))
cat(sprintf("Exact 95%% CrI: [%.4f, %.4f]\n",
qbeta(0.025, 13, 19), qbeta(0.975, 13, 19)))10 9. Koneksi ke ML dan Econometrics
Ridge = Bayesian + Normal prior: \(\hat{\beta}_{Ridge} = \arg\min ||y-X\beta||^2 + \lambda||\beta||^2\). This is MAP estimation under \(\beta \sim N(0, \sigma^2/\lambda \cdot I)\).
LASSO = Bayesian + Laplace prior: Prior \(\pi(\beta_j) \propto e^{-\lambda|\beta_j|/\sigma}\) gives LASSO. Laplace prior encourages sparsity because it has heavier tails than normal at 0 and heavier tails overall.
Gaussian processes: Nonparametric Bayesian regression. Prior over functions. Used in kriging (spatial interpolation) and machine learning.
Bayesian VAR: Standard tool di macroeconometrics (Minnesota prior). Useful untuk forecasting dengan many variables, shrinks coefficients toward random walk.
Stan ecosystem: rstanarm gives Bayesian versions of standard regression models with same syntax as lm()/glm(). brms extends this to complex mixed models.
MCMC diagnostics: \(\hat{R}\) (R-hat) statistic for chain convergence, effective sample size, trace plots — all standard practice when reporting Bayesian results.
11 Practice Problems
Problem 1: Beta-Binomial sequential updating.
Start dengan prior \(\text{Beta}(2, 2)\). Observasi datang satu per satu: H, T, H, H, H, T, H (7 tosses, 5 heads).
- Update posterior setelah setiap observasi
- Plot prior dan semua intermediate posteriors dan final posterior
- Apa posterior mean setelah semua 7 tosses?
Jawaban: Final posterior = Beta(2+5, 2+2) = Beta(7, 4). Mean = 7/11 ≈ 0.636.
Problem 2: Ridge as Bayesian MAP.
Tunjukkan secara matematis bahwa Ridge regression estimate: \[\hat{\beta}_{Ridge} = (X^TX + \lambda I)^{-1}X^Ty\]
adalah MAP estimate jika \(\beta \sim N(0, \sigma^2/\lambda \cdot I_p)\) dan \(y|X,\beta \sim N(X\beta, \sigma^2 I_n)\).
Hint: Maximize \(\log\pi(\beta|y) = \log f(y|\beta) + \log\pi(\beta)\) dengan respect to \(\beta\).
Problem 3: Posterior predictive distribution.
Di Normal-Normal model: \(\mu | y \sim N(\mu_n, \tau_n^2)\) dan \(y_{new}|\mu \sim N(\mu, \sigma^2)\).
Derive posterior predictive distribution \(p(y_{new}|y)\) (marginalizing over \(\mu\)).
Hint: \(y_{new}|y = y_{new}|\mu + \mu|y\). Sum of normals → normal.
Jawaban: \(y_{new}|y \sim N(\mu_n, \sigma^2 + \tau_n^2)\). Note: lebih variable dari likelihood karena tambah uncertainty dari posterior of \(\mu\).
Problem 4: Bayesian hypothesis testing.
Contrast Bayesian approach dengan frequentist untuk testing \(H_0: p = 0.5\) vs \(H_1: p \neq 0.5\) untuk coin flip data (20 heads out of 30).
- Frequentist: compute p-value
- Bayesian: with \(\text{Beta}(1,1)\) prior, compute \(P(p < 0.5 | \text{data})\)
- Compute Bayes Factor (ratio of marginal likelihoods under \(H_0\) vs \(H_1\))
# Frequentist
binom.test(20, 30, p=0.5)$p.value # Two-sided p-value
# Bayesian
# Posterior: Beta(1+20, 1+10) = Beta(21, 11)
# P(p < 0.5 | data) = CDF at 0.5
pbeta(0.5, 21, 11) # About 0.95 -- strong evidence p > 0.5
1 - pbeta(0.5, 21, 11) # P(p > 0.5 | data)