Bayesian Inference

Prior + Data = Posterior

statistics

bayesian

MCMC

inference

Fondasi Bayesian inference: Bayes’ theorem untuk parameter estimation, conjugate priors, posterior distributions, MCMC, dan koneksi ke regularized regression.

1 Kenapa Ini Penting?

Why This Matters for Your Work

Bayesian approach menyediakan coherent framework untuk learning dari data. Berbeda dengan frequentist yang bertanya “kalau $H_0$ benar, seberapa ekstrem data kita?”, Bayesian bertanya “given data, apa yang kita percaya tentang parameter?”

Kenapa researcher harus tahu Bayesian stats di 2025?

MCMC methods are standard di advanced econometrics (Bayesian VAR), epidemiology, dan ML. Stan, JAGS, PyMC — semuanya implement Bayesian computation.
Regularization = Bayesian: Ridge regression adalah Bayesian linear regression dengan normal prior. LASSO adalah Bayesian dengan Laplace prior. Mengerti Bayesian = mengerti regularization.
Small samples: Frequentist asymptotic theory butuh large $n$. Bayesian dengan informative priors bisa give sensible answers with small $n$.
Uncertainty quantification: Posterior credible intervals have intuitive interpretation: “given this data, there is 95% probability that $\theta$ lies in [a, b]” — yang sebenarnya diinginkan orang ketika mereka salah menginterpretasikan frequentist CI!
Decision theory: Bayesian framework naturally connects to decision theory and optimal decisions under uncertainty.

2 1. Bayes’ Theorem untuk Inference

Definisi: Bayes’ Theorem untuk Parameter Estimation

Misalkan $y = (y_1,\ldots,y_n)$ adalah data dan $\theta$ adalah parameter (bisa skalar atau vector). Posterior distribution adalah:

\[\underbrace{\pi(\theta|y)}_{\text{posterior}} = \frac{\underbrace{f(y|\theta)}_{\text{likelihood}} \cdot \underbrace{\pi(\theta)}_{\text{prior}}}{\underbrace{f(y)}_{\text{marginal likelihood}}}\]

Atau karena $f(y)$ adalah konstanta (tidak bergantung pada $\theta$):

\[\pi(\theta|y) \propto f(y|\theta) \cdot \pi(\theta)\]

“Posterior proportional to likelihood times prior”

Tiga komponen: - Prior $\pi(\theta)$: beliefs tentang $\theta$ sebelum melihat data - Likelihood $f(y|\theta)$: probability of data given $\theta$ (sama dengan di MLE!) - Posterior $\pi(\theta|y)$: updated beliefs setelah melihat data

2.1 Learning Process

Bayesian inference adalah learning process: prior + data → posterior. Posterior bisa menjadi prior untuk pengamatan berikutnya (sequential updating).

Contoh intuitif: Toss koin, ingin estimate $p = P(\text{head})$. - Prior: $p \sim \text{Beta}(1,1) = \text{Uniform}(0,1)$ — “saya tidak tahu apa-apa” - Likelihood: $f(k \text{ heads} | n \text{ tosses}, p) = \binom{n}{k}p^k(1-p)^{n-k}$ - Posterior: $p | \text{data} \sim \text{Beta}(1+k, 1+n-k)$ — “setelah melihat $k$ dari $n$ heads”

3 2. Conjugate Priors

Definisi: Conjugate Prior

Prior $\pi(\theta)$ adalah conjugate untuk likelihood $f(y|\theta)$ jika posterior $\pi(\theta|y)$ berada dalam family distribusi yang sama dengan prior.

Keunggulan: Posterior dalam closed form — tidak butuh numerical integration!

Likelihood	Conjugate Prior	Posterior
Binomial($n$, $p$)	Beta($\alpha$, $\beta$)	Beta($\alpha+k$, $\beta+n-k$)
Normal($\mu$, $\sigma^2$ known)	Normal($\mu_0$, $\tau^2$)	Normal($\mu_n$, $\tau_n^2$)
Poisson($\lambda$)	Gamma($\alpha$, $\beta$)	Gamma($\alpha+\sum y_i$, $\beta+n$)
Exponential($\lambda$)	Gamma($\alpha$, $\beta$)	Gamma($\alpha+n$, $\beta+\sum y_i$)

3.1 Beta-Binomial Conjugacy (Derivasi)

Prior: $p \sim \text{Beta}(\alpha, \beta)$, $\pi(p) \propto p^{\alpha-1}(1-p)^{\beta-1}$

Likelihood: $f(k|p) \propto p^k(1-p)^{n-k}$

Posterior: \[\pi(p|k) \propto p^k(1-p)^{n-k} \cdot p^{\alpha-1}(1-p)^{\beta-1} = p^{(\alpha+k)-1}(1-p)^{(\beta+n-k)-1}\]

Ini adalah Beta($\alpha+k$, $\beta+n-k$).

Interpretasi: Prior parameters $\alpha$ dan $\beta$ berperan seperti “pseudo-counts” — $\alpha$ pseudo-successes dan $\beta$ pseudo-failures sebelum melihat data.

3.2 Normal-Normal Conjugacy

Prior: $\mu \sim N(\mu_0, \tau^2)$ — prior mean dan precision $1/\tau^2$

Likelihood: $y_i \sim N(\mu, \sigma^2)$ with $\sigma^2$ known

Posterior mean: \[\mu_n = \frac{\tau^{-2}\mu_0 + n\sigma^{-2}\bar{y}}{\tau^{-2} + n\sigma^{-2}} = \frac{\text{prior precision} \times \text{prior mean} + \text{data precision} \times \text{data mean}}{\text{total precision}}\]

Posterior variance: \[\tau_n^2 = \frac{1}{\tau^{-2} + n\sigma^{-2}}\]

Interpretasi: Posterior mean adalah weighted average dari prior mean dan data mean, dengan weights proportional to precisions.

4 3. Bayesian Point Estimates dan Credible Intervals

Definisi: Bayesian Point Estimates

Dari posterior $\pi(\theta|y)$:

Posterior mean: $\hat{\theta}_{PM} = E[\theta|y] = \int \theta \pi(\theta|y)\,d\theta$ — optimal under squared loss
MAP (Maximum A Posteriori): $\hat{\theta}_{MAP} = \arg\max_\theta \pi(\theta|y)$ — optimal under 0-1 loss
Posterior median: optimal under absolute loss

Posterior (credible) interval: $[\theta_L, \theta_U]$ such that $P(\theta_L \leq \theta \leq \theta_U | y) = 1-\alpha$

Credible interval vs Confidence interval:

	Confidence Interval	Credible Interval
Framework	Frequentist	Bayesian
Randomness	Interval is random, $\theta$ fixed	$\theta$ is random, data fixed
Interpretation	95% of such intervals contain $\theta$	Given data, 95% prob $\theta$ in interval
Dependence on prior	No	Yes
What people often want	Usually want Bayesian interpretation	Direct probability statement

5 4. Bayesian Linear Regression

Definisi: Bayesian Linear Regression

Model: $y = X\beta + \varepsilon$, $\varepsilon \sim N(0, \sigma^2 I)$

Prior on $\beta$: $\beta \sim N(\beta_0, \Sigma_0)$

Posterior (with $\sigma^2$ known): \[\beta | y \sim N(\beta_n, \Sigma_n)\]

dimana: \[\Sigma_n = (\Sigma_0^{-1} + \sigma^{-2}X^TX)^{-1}\] \[\beta_n = \Sigma_n(\Sigma_0^{-1}\beta_0 + \sigma^{-2}X^Ty)\]

Posterior mean adalah weighted combination of prior mean and OLS estimate.

5.1 Koneksi ke Ridge Regression

Special case: Prior $\beta \sim N(0, \frac{\sigma^2}{\lambda}I)$ (independent, zero-mean normal prior):

Posterior mode (MAP): \[\hat{\beta}_{MAP} = \arg\max \left[-\frac{1}{2\sigma^2}||y-X\beta||^2 - \frac{\lambda}{2\sigma^2}||\beta||^2\right]\] \[= \arg\min ||y-X\beta||^2 + \lambda||\beta||^2\] \[= (X^TX + \lambda I)^{-1}X^Ty = \hat{\beta}_{Ridge}\]

Ridge regression IS MAP estimation under normal prior! Regularization parameter $\lambda$ controls strength of prior — larger $\lambda$ = stronger prior toward zero = more shrinkage.

5.2 LASSO as Bayesian

Similarly, LASSO corresponds to Laplace (double-exponential) prior on $\beta$: \[\pi(\beta) \propto \exp(-\lambda|\beta|)\]

The MAP estimate: \[\hat{\beta}_{LASSO} = \arg\min ||y-X\beta||^2 + \lambda||\beta||_1\]

6 5. MCMC: Sampling from Posteriors

Untuk complex models tanpa conjugate priors, kita butuh numerical methods untuk sample dari posterior.

Definisi: Markov Chain Monte Carlo (MCMC)

Goal: Sample dari posterior $\pi(\theta|y)$ yang intractable.

Metropolis-Hastings Algorithm: 1. Initialize $\theta^{(0)}$ 2. For $t = 1, \ldots, T$: a. Propose $\theta^* \sim q(\theta|\theta^{(t-1)})$ (proposal distribution) b. Compute acceptance ratio: $r = \frac{\pi(\theta^*|y)/q(\theta^*|\theta^{(t-1)})}{\pi(\theta^{(t-1)}|y)/q(\theta^{(t-1)}|\theta^*)}$ c. Accept: set $\theta^{(t)} = \theta^*$ with probability $\min(1, r)$; otherwise $\theta^{(t)} = \theta^{(t-1)}$ 3. After burn-in, $\{\theta^{(t)}\}$ are (dependent) samples from $\pi(\theta|y)$

Gibbs Sampling (special case): Sample each parameter from its full conditional distribution given all others. Requires knowing $\pi(\theta_j | \theta_{-j}, y)$ for each $j$.

Modern MCMC: Hamiltonian Monte Carlo (HMC), as used in Stan, exploits gradient information for much more efficient sampling in high dimensions.

7 6. Worked Example: Beta-Binomial Bayesian Inference

Worked Example: Estimasi Probability of Success

Problem: Dalam 30 trials, 12 successes. Estimate $p$ dengan berbagai priors.

Case 1: Uninformative prior $p \sim \text{Beta}(1,1) = \text{Uniform}(0,1)$

Posterior: $p | \text{data} \sim \text{Beta}(1+12, 1+18) = \text{Beta}(13, 19)$

Case 2: Informative prior $p \sim \text{Beta}(5, 10)$ (prior belief: $p \approx 0.33$)

Posterior: $p | \text{data} \sim \text{Beta}(5+12, 10+18) = \text{Beta}(17, 28)$

Case 3: Strong informative prior $p \sim \text{Beta}(50, 100)$ (very confident: $p \approx 0.33$)

Posterior: $p | \text{data} \sim \text{Beta}(50+12, 100+18) = \text{Beta}(62, 118)$

# Data
n_trials <- 30; k_success <- 12
p_obs <- k_success/n_trials  # MLE = 0.4

# Three priors
priors <- list(
  uninformative = c(alpha=1, beta=1),
  weakly_inform = c(alpha=5, beta=10),
  strong_inform  = c(alpha=50, beta=100)
)

# Compute posteriors
p_seq <- seq(0, 1, length.out=1000)
par(mfrow=c(1,3))

for(pr_name in names(priors)) {
  a_prior <- priors[[pr_name]]["alpha"]
  b_prior <- priors[[pr_name]]["beta"]

  # Posterior parameters
  a_post <- a_prior + k_success
  b_post <- b_prior + (n_trials - k_success)

  # Point estimates
  post_mean <- a_post / (a_post + b_post)
  post_mode <- (a_post-1) / (a_post+b_post-2)  # MAP
  post_median <- qbeta(0.5, a_post, b_post)

  # 95% Credible Interval
  ci_95 <- qbeta(c(0.025, 0.975), a_post, b_post)

  # Plot prior and posterior
  plot(p_seq, dbeta(p_seq, a_prior, b_prior), type="l",
       col="blue", lwd=2, lty=2,
       main=pr_name, xlab="p", ylab="Density",
       ylim=c(0, max(dbeta(p_seq, a_post, b_post))*1.1))
  lines(p_seq, dbeta(p_seq, a_post, b_post),
        col="red", lwd=2)
  abline(v=p_obs, col="green", lwd=2, lty=3)  # MLE
  abline(v=post_mean, col="red", lwd=1, lty=4)  # Posterior mean

  legend("topright", c("Prior", "Posterior", "MLE", "Post. mean"),
         col=c("blue","red","green","red"),
         lwd=c(2,2,2,1), lty=c(2,1,3,4), cex=0.7)

  cat(sprintf("\n=== %s ===\n", pr_name))
  cat(sprintf("Prior: Beta(%.0f, %.0f), E[p] = %.3f\n",
              a_prior, b_prior, a_prior/(a_prior+b_prior)))
  cat(sprintf("Posterior: Beta(%.0f, %.0f)\n", a_post, b_post))
  cat(sprintf("Post mean: %.4f, MAP: %.4f, MLE: %.4f\n",
              post_mean, post_mode, p_obs))
  cat(sprintf("95%% CI: [%.4f, %.4f]\n", ci_95[1], ci_95[2]))
}

Insight: Dengan uninformative prior, posterior mean ≈ MLE. Dengan strong informative prior, posterior mean adalah pulled toward prior. Effect of prior diminishes as $n$ increases — “data overwhelms prior” asymptotically.

8 7. Bayesian vs Frequentist: Comparison

Aspek	Frequentist	Bayesian
$\theta$	Fixed unknown constant	Random variable with distribution
Inference	Long-run frequency properties	Posterior probability statements
Prior	None	Required (explicit assumptions)
CI/CrI	95% of intervals contain $\theta$	P(θ in interval \| data) = 95%
p-value	P(data \| $H_0$)	Not used; compute P($H_0$ \| data)
Sample size	Need large $n$ for asymptotics	Works for any $n$ with good prior
Computation	Often analytical	Often requires MCMC
Strength	Well-understood, widely used	Intuitive, handles complex models

Practical recommendation: Belajar kedua paradigma. Di most applied econometrics: - Default: frequentist (OLS, MLE, standard hypothesis tests) - Complex models, small samples, or when priors matter: consider Bayesian - ML regularization: Bayesian perspective illuminates what regularization is doing

9 8. R Code: Bayesian Analysis

# ============================================================
# MANUAL BAYESIAN: Normal-Normal Conjugate
# ============================================================
# Prior: mu ~ N(mu0, tau0^2)
# Likelihood: yi ~ N(mu, sigma^2) with sigma^2 known
# Posterior: mu | y ~ N(mu_n, tau_n^2)

bayes_normal <- function(data, mu0, tau0, sigma) {
  n <- length(data); y_bar <- mean(data)
  sigma2 <- sigma^2; tau02 <- tau0^2

  # Posterior parameters
  tau_n2 <- 1 / (1/tau02 + n/sigma2)
  mu_n   <- tau_n2 * (mu0/tau02 + n*y_bar/sigma2)

  return(list(mu_n=mu_n, tau_n=sqrt(tau_n2)))
}

set.seed(2024)
true_mu <- 5; sigma <- 2; n <- 30
y <- rnorm(n, true_mu, sigma)

# Different priors
scenarios <- list(
  uninformative = c(mu0=0, tau0=100),
  weakly_inform = c(mu0=4, tau0=2),
  strong_inform  = c(mu0=4, tau0=0.5)
)

cat("=== NORMAL-NORMAL BAYESIAN ANALYSIS ===\n")
cat(sprintf("True mu = %g, n = %d, y_bar = %.4f\n\n",
            true_mu, n, mean(y)))

for(sc in names(scenarios)) {
  post <- bayes_normal(y, scenarios[[sc]]["mu0"],
                        scenarios[[sc]]["tau0"], sigma)
  ci_95 <- post$mu_n + c(-1,1) * 1.96 * post$tau_n

  cat(sprintf("--- %s ---\n", sc))
  cat(sprintf("Prior: N(%.1f, %.1f^2)\n",
              scenarios[[sc]]["mu0"], scenarios[[sc]]["tau0"]))
  cat(sprintf("Post mean: %.4f, Post SD: %.4f\n", post$mu_n, post$tau_n))
  cat(sprintf("95%% CrI: [%.4f, %.4f]\n\n", ci_95[1], ci_95[2]))
}

# ============================================================
# RIDGE AS BAYESIAN: Compare with regular OLS
# ============================================================
set.seed(42)
n <- 50; p <- 10  # Many predictors relative to n

X <- matrix(rnorm(n*p), n, p)
beta_true <- c(1, -0.5, 2, 0, 0, 0, 0, 0, 0, 0)  # Sparse
y <- X %*% beta_true + rnorm(n, sd=2)

# OLS (might overfit with n=50, p=10)
ols <- lm(y ~ X - 1)  # No intercept for simplicity

# Ridge regression (via glmnet)
if(!require(glmnet)) install.packages("glmnet")
library(glmnet)
ridge <- glmnet(X, y, alpha=0, lambda=0.5)  # alpha=0 for Ridge

cat("=== OLS vs Ridge (Bayesian MAP with Normal Prior) ===\n")
cat(sprintf("%-8s %-12s %-12s %-12s\n", "Param", "True", "OLS", "Ridge"))
for(j in 1:p) {
  cat(sprintf("%-8d %-12.4f %-12.4f %-12.4f\n",
              j, beta_true[j], coef(ols)[j],
              as.numeric(coef(ridge)[-1])[j]))
}

# Prediction comparison
y_pred_ols   <- predict(ols, newx=X)
y_pred_ridge <- X %*% as.numeric(coef(ridge)[-1])

cat(sprintf("\nOLS   MSE (in-sample):   %.4f\n", mean((y-y_pred_ols)^2)))
cat(sprintf("Ridge MSE (in-sample):   %.4f\n", mean((y-y_pred_ridge)^2)))

# ============================================================
# SIMPLE MCMC: Metropolis-Hastings for Beta distribution
# ============================================================
metropolis_hastings <- function(n_iter, start, log_posterior, sigma_prop=0.1) {
  samples <- numeric(n_iter)
  samples[1] <- start
  accept_count <- 0

  for(i in 2:n_iter) {
    # Proposal
    prop <- rnorm(1, samples[i-1], sigma_prop)
    if(prop <= 0 || prop >= 1) {
      samples[i] <- samples[i-1]
      next
    }

    # Log acceptance ratio
    log_r <- log_posterior(prop) - log_posterior(samples[i-1])

    # Accept/reject
    if(log(runif(1)) < log_r) {
      samples[i] <- prop
      accept_count <- accept_count + 1
    } else {
      samples[i] <- samples[i-1]
    }
  }
  cat(sprintf("Acceptance rate: %.2f%%\n", 100*accept_count/n_iter))
  return(samples)
}

# Target: Beta(13, 19) posterior from Beta-Binomial example
log_post <- function(p) dbeta(p, 13, 19, log=TRUE)

set.seed(42)
n_iter <- 10000; burn_in <- 1000
mcmc_samples <- metropolis_hastings(n_iter, start=0.4, log_posterior=log_post)
mcmc_clean <- mcmc_samples[(burn_in+1):n_iter]

# Compare MCMC samples to exact distribution
hist(mcmc_clean, probability=TRUE, breaks=50,
     main="MCMC vs Exact Beta(13,19) Posterior",
     xlab="p")
curve(dbeta(x, 13, 19), add=TRUE, col="red", lwd=2)

cat(sprintf("\nMCMC mean: %.4f (exact: %.4f)\n",
            mean(mcmc_clean), 13/(13+19)))
cat(sprintf("MCMC 95%% CrI: [%.4f, %.4f]\n",
            quantile(mcmc_clean, 0.025),
            quantile(mcmc_clean, 0.975)))
cat(sprintf("Exact 95%% CrI: [%.4f, %.4f]\n",
            qbeta(0.025, 13, 19), qbeta(0.975, 13, 19)))

10 9. Koneksi ke ML dan Econometrics

Connection: Bayesian dalam Modern ML dan Econometrics

Ridge = Bayesian + Normal prior: $\hat{\beta}_{Ridge} = \arg\min ||y-X\beta||^2 + \lambda||\beta||^2$. This is MAP estimation under $\beta \sim N(0, \sigma^2/\lambda \cdot I)$.

LASSO = Bayesian + Laplace prior: Prior $\pi(\beta_j) \propto e^{-\lambda|\beta_j|/\sigma}$ gives LASSO. Laplace prior encourages sparsity because it has heavier tails than normal at 0 and heavier tails overall.

Gaussian processes: Nonparametric Bayesian regression. Prior over functions. Used in kriging (spatial interpolation) and machine learning.

Bayesian VAR: Standard tool di macroeconometrics (Minnesota prior). Useful untuk forecasting dengan many variables, shrinks coefficients toward random walk.

Stan ecosystem: rstanarm gives Bayesian versions of standard regression models with same syntax as lm()/glm(). brms extends this to complex mixed models.

MCMC diagnostics: $\hat{R}$ (R-hat) statistic for chain convergence, effective sample size, trace plots — all standard practice when reporting Bayesian results.

11 Practice Problems

Practice Problems

Problem 1: Beta-Binomial sequential updating.

Start dengan prior $\text{Beta}(2, 2)$. Observasi datang satu per satu: H, T, H, H, H, T, H (7 tosses, 5 heads).

Update posterior setelah setiap observasi
Plot prior dan semua intermediate posteriors dan final posterior
Apa posterior mean setelah semua 7 tosses?

Jawaban: Final posterior = Beta(2+5, 2+2) = Beta(7, 4). Mean = 7/11 ≈ 0.636.

Problem 2: Ridge as Bayesian MAP.

Tunjukkan secara matematis bahwa Ridge regression estimate: \[\hat{\beta}_{Ridge} = (X^TX + \lambda I)^{-1}X^Ty\]

adalah MAP estimate jika $\beta \sim N(0, \sigma^2/\lambda \cdot I_p)$ dan $y|X,\beta \sim N(X\beta, \sigma^2 I_n)$.

Hint: Maximize $\log\pi(\beta|y) = \log f(y|\beta) + \log\pi(\beta)$ dengan respect to $\beta$.

Problem 3: Posterior predictive distribution.

Di Normal-Normal model: $\mu | y \sim N(\mu_n, \tau_n^2)$ dan $y_{new}|\mu \sim N(\mu, \sigma^2)$.

Derive posterior predictive distribution $p(y_{new}|y)$ (marginalizing over $\mu$).

Hint: $y_{new}|y = y_{new}|\mu + \mu|y$. Sum of normals → normal.

Jawaban: $y_{new}|y \sim N(\mu_n, \sigma^2 + \tau_n^2)$. Note: lebih variable dari likelihood karena tambah uncertainty dari posterior of $\mu$.

Problem 4: Bayesian hypothesis testing.

Contrast Bayesian approach dengan frequentist untuk testing $H_0: p = 0.5$ vs $H_1: p \neq 0.5$ untuk coin flip data (20 heads out of 30).

Frequentist: compute p-value
Bayesian: with $\text{Beta}(1,1)$ prior, compute $P(p < 0.5 | \text{data})$
Compute Bayes Factor (ratio of marginal likelihoods under $H_0$ vs $H_1$)

# Frequentist
binom.test(20, 30, p=0.5)$p.value  # Two-sided p-value

# Bayesian
# Posterior: Beta(1+20, 1+10) = Beta(21, 11)
# P(p < 0.5 | data) = CDF at 0.5
pbeta(0.5, 21, 11)  # About 0.95 -- strong evidence p > 0.5
1 - pbeta(0.5, 21, 11)  # P(p > 0.5 | data)

--- title: "Bayesian Inference" subtitle: "Prior + Data = Posterior" description: "Fondasi Bayesian inference: Bayes' theorem untuk parameter estimation, conjugate priors, posterior distributions, MCMC, dan koneksi ke regularized regression." categories: [statistics, bayesian, MCMC, inference] --- ## Kenapa Ini Penting? ::: {.callout-note title="Why This Matters for Your Work"} Bayesian approach menyediakan **coherent framework untuk learning dari data**. Berbeda dengan frequentist yang bertanya "kalau $H_0$ benar, seberapa ekstrem data kita?", Bayesian bertanya "given data, apa yang kita percaya tentang parameter?" **Kenapa researcher harus tahu Bayesian stats di 2025?** 1. **MCMC methods are standard** di advanced econometrics (Bayesian VAR), epidemiology, dan ML. Stan, JAGS, PyMC — semuanya implement Bayesian computation. 2. **Regularization = Bayesian**: Ridge regression adalah Bayesian linear regression dengan normal prior. LASSO adalah Bayesian dengan Laplace prior. Mengerti Bayesian = mengerti regularization. 3. **Small samples**: Frequentist asymptotic theory butuh large $n$. Bayesian dengan informative priors bisa give sensible answers with small $n$. 4. **Uncertainty quantification**: Posterior credible intervals have intuitive interpretation: "given this data, there is 95% probability that $\theta$ lies in [a, b]" — yang sebenarnya diinginkan orang ketika mereka salah menginterpretasikan frequentist CI! 5. **Decision theory**: Bayesian framework naturally connects to decision theory and optimal decisions under uncertainty. ::: --- ## 1. Bayes' Theorem untuk Inference ::: {.callout-important title="Definisi: Bayes' Theorem untuk Parameter Estimation"} Misalkan $y = (y_1,\ldots,y_n)$ adalah data dan $\theta$ adalah parameter (bisa skalar atau vector). Posterior distribution adalah: $$\underbrace{\pi(\theta|y)}_{\text{posterior}} = \frac{\underbrace{f(y|\theta)}_{\text{likelihood}} \cdot \underbrace{\pi(\theta)}_{\text{prior}}}{\underbrace{f(y)}_{\text{marginal likelihood}}}$$ Atau karena $f(y)$ adalah konstanta (tidak bergantung pada $\theta$): $$\pi(\theta|y) \propto f(y|\theta) \cdot \pi(\theta)$$ **"Posterior proportional to likelihood times prior"** **Tiga komponen**: - **Prior** $\pi(\theta)$: beliefs tentang $\theta$ sebelum melihat data - **Likelihood** $f(y|\theta)$: probability of data given $\theta$ (sama dengan di MLE!) - **Posterior** $\pi(\theta|y)$: updated beliefs setelah melihat data ::: ### Learning Process Bayesian inference adalah learning process: prior + data → posterior. Posterior bisa menjadi prior untuk pengamatan berikutnya (sequential updating). **Contoh intuitif**: Toss koin, ingin estimate $p = P(\text{head})$. - **Prior**: $p \sim \text{Beta}(1,1) = \text{Uniform}(0,1)$ — "saya tidak tahu apa-apa" - **Likelihood**: $f(k \text{ heads} | n \text{ tosses}, p) = \binom{n}{k}p^k(1-p)^{n-k}$ - **Posterior**: $p | \text{data} \sim \text{Beta}(1+k, 1+n-k)$ — "setelah melihat $k$ dari $n$ heads" --- ## 2. Conjugate Priors ::: {.callout-important title="Definisi: Conjugate Prior"} Prior $\pi(\theta)$ adalah **conjugate** untuk likelihood $f(y|\theta)$ jika posterior $\pi(\theta|y)$ berada dalam **family distribusi yang sama** dengan prior. **Keunggulan**: Posterior dalam closed form — tidak butuh numerical integration! | Likelihood | Conjugate Prior | Posterior | |---|---|---| | Binomial($n$, $p$) | Beta($\alpha$, $\beta$) | Beta($\alpha+k$, $\beta+n-k$) | | Normal($\mu$, $\sigma^2$ known) | Normal($\mu_0$, $\tau^2$) | Normal($\mu_n$, $\tau_n^2$) | | Poisson($\lambda$) | Gamma($\alpha$, $\beta$) | Gamma($\alpha+\sum y_i$, $\beta+n$) | | Exponential($\lambda$) | Gamma($\alpha$, $\beta$) | Gamma($\alpha+n$, $\beta+\sum y_i$) | ::: ### Beta-Binomial Conjugacy (Derivasi) Prior: $p \sim \text{Beta}(\alpha, \beta)$, $\pi(p) \propto p^{\alpha-1}(1-p)^{\beta-1}$ Likelihood: $f(k|p) \propto p^k(1-p)^{n-k}$ Posterior: $$\pi(p|k) \propto p^k(1-p)^{n-k} \cdot p^{\alpha-1}(1-p)^{\beta-1} = p^{(\alpha+k)-1}(1-p)^{(\beta+n-k)-1}$$ Ini adalah Beta($\alpha+k$, $\beta+n-k$). **Interpretasi**: Prior parameters $\alpha$ dan $\beta$ berperan seperti "pseudo-counts" — $\alpha$ pseudo-successes dan $\beta$ pseudo-failures sebelum melihat data. ### Normal-Normal Conjugacy Prior: $\mu \sim N(\mu_0, \tau^2)$ — prior mean dan precision $1/\tau^2$ Likelihood: $y_i \sim N(\mu, \sigma^2)$ with $\sigma^2$ known Posterior mean: $$\mu_n = \frac{\tau^{-2}\mu_0 + n\sigma^{-2}\bar{y}}{\tau^{-2} + n\sigma^{-2}} = \frac{\text{prior precision} \times \text{prior mean} + \text{data precision} \times \text{data mean}}{\text{total precision}}$$ Posterior variance: $$\tau_n^2 = \frac{1}{\tau^{-2} + n\sigma^{-2}}$$ **Interpretasi**: Posterior mean adalah **weighted average** dari prior mean dan data mean, dengan weights proportional to precisions. --- ## 3. Bayesian Point Estimates dan Credible Intervals ::: {.callout-important title="Definisi: Bayesian Point Estimates"} Dari posterior $\pi(\theta|y)$: - **Posterior mean**: $\hat{\theta}_{PM} = E[\theta|y] = \int \theta \pi(\theta|y)\,d\theta$ — optimal under squared loss - **MAP (Maximum A Posteriori)**: $\hat{\theta}_{MAP} = \arg\max_\theta \pi(\theta|y)$ — optimal under 0-1 loss - **Posterior median**: optimal under absolute loss **Posterior (credible) interval**: $[\theta_L, \theta_U]$ such that $P(\theta_L \leq \theta \leq \theta_U | y) = 1-\alpha$ ::: **Credible interval vs Confidence interval**: | | Confidence Interval | Credible Interval | |---|---|---| | **Framework** | Frequentist | Bayesian | | **Randomness** | Interval is random, $\theta$ fixed | $\theta$ is random, data fixed | | **Interpretation** | 95% of such intervals contain $\theta$ | Given data, 95% prob $\theta$ in interval | | **Dependence on prior** | No | Yes | | **What people often want** | Usually want Bayesian interpretation | Direct probability statement | --- ## 4. Bayesian Linear Regression ::: {.callout-important title="Definisi: Bayesian Linear Regression"} Model: $y = X\beta + \varepsilon$, $\varepsilon \sim N(0, \sigma^2 I)$ Prior on $\beta$: $\beta \sim N(\beta_0, \Sigma_0)$ Posterior (with $\sigma^2$ known): $$\beta | y \sim N(\beta_n, \Sigma_n)$$ dimana: $$\Sigma_n = (\Sigma_0^{-1} + \sigma^{-2}X^TX)^{-1}$$ $$\beta_n = \Sigma_n(\Sigma_0^{-1}\beta_0 + \sigma^{-2}X^Ty)$$ **Posterior mean** adalah weighted combination of prior mean and OLS estimate. ::: ### Koneksi ke Ridge Regression **Special case**: Prior $\beta \sim N(0, \frac{\sigma^2}{\lambda}I)$ (independent, zero-mean normal prior): Posterior mode (MAP): $$\hat{\beta}_{MAP} = \arg\max \left[-\frac{1}{2\sigma^2}||y-X\beta||^2 - \frac{\lambda}{2\sigma^2}||\beta||^2\right]$$ $$= \arg\min ||y-X\beta||^2 + \lambda||\beta||^2$$ $$= (X^TX + \lambda I)^{-1}X^Ty = \hat{\beta}_{Ridge}$$ **Ridge regression IS MAP estimation under normal prior!** Regularization parameter $\lambda$ controls strength of prior — larger $\lambda$ = stronger prior toward zero = more shrinkage. ### LASSO as Bayesian Similarly, LASSO corresponds to **Laplace (double-exponential) prior** on $\beta$: $$\pi(\beta) \propto \exp(-\lambda|\beta|)$$ The MAP estimate: $$\hat{\beta}_{LASSO} = \arg\min ||y-X\beta||^2 + \lambda||\beta||_1$$ --- ## 5. MCMC: Sampling from Posteriors Untuk complex models tanpa conjugate priors, kita butuh numerical methods untuk sample dari posterior. ::: {.callout-important title="Definisi: Markov Chain Monte Carlo (MCMC)"} **Goal**: Sample dari posterior $\pi(\theta|y)$ yang intractable. **Metropolis-Hastings Algorithm**: 1. Initialize $\theta^{(0)}$ 2. For $t = 1, \ldots, T$: a. Propose $\theta^* \sim q(\theta|\theta^{(t-1)})$ (proposal distribution) b. Compute acceptance ratio: $r = \frac{\pi(\theta^*|y)/q(\theta^*|\theta^{(t-1)})}{\pi(\theta^{(t-1)}|y)/q(\theta^{(t-1)}|\theta^*)}$ c. Accept: set $\theta^{(t)} = \theta^*$ with probability $\min(1, r)$; otherwise $\theta^{(t)} = \theta^{(t-1)}$ 3. After burn-in, $\{\theta^{(t)}\}$ are (dependent) samples from $\pi(\theta|y)$ **Gibbs Sampling** (special case): Sample each parameter from its **full conditional** distribution given all others. Requires knowing $\pi(\theta_j | \theta_{-j}, y)$ for each $j$. ::: **Modern MCMC**: Hamiltonian Monte Carlo (HMC), as used in Stan, exploits gradient information for much more efficient sampling in high dimensions. --- ## 6. Worked Example: Beta-Binomial Bayesian Inference ::: {.callout-tip title="Worked Example: Estimasi Probability of Success" collapse="true"} **Problem**: Dalam 30 trials, 12 successes. Estimate $p$ dengan berbagai priors. **Case 1: Uninformative prior** $p \sim \text{Beta}(1,1) = \text{Uniform}(0,1)$ Posterior: $p | \text{data} \sim \text{Beta}(1+12, 1+18) = \text{Beta}(13, 19)$ **Case 2: Informative prior** $p \sim \text{Beta}(5, 10)$ (prior belief: $p \approx 0.33$) Posterior: $p | \text{data} \sim \text{Beta}(5+12, 10+18) = \text{Beta}(17, 28)$ **Case 3: Strong informative prior** $p \sim \text{Beta}(50, 100)$ (very confident: $p \approx 0.33$) Posterior: $p | \text{data} \sim \text{Beta}(50+12, 100+18) = \text{Beta}(62, 118)$ ```r # Data n_trials <- 30; k_success <- 12 p_obs <- k_success/n_trials # MLE = 0.4 # Three priors priors <- list( uninformative = c(alpha=1, beta=1), weakly_inform = c(alpha=5, beta=10), strong_inform = c(alpha=50, beta=100) ) # Compute posteriors p_seq <- seq(0, 1, length.out=1000) par(mfrow=c(1,3)) for(pr_name in names(priors)) { a_prior <- priors[[pr_name]]["alpha"] b_prior <- priors[[pr_name]]["beta"] # Posterior parameters a_post <- a_prior + k_success b_post <- b_prior + (n_trials - k_success) # Point estimates post_mean <- a_post / (a_post + b_post) post_mode <- (a_post-1) / (a_post+b_post-2) # MAP post_median <- qbeta(0.5, a_post, b_post) # 95% Credible Interval ci_95 <- qbeta(c(0.025, 0.975), a_post, b_post) # Plot prior and posterior plot(p_seq, dbeta(p_seq, a_prior, b_prior), type="l", col="blue", lwd=2, lty=2, main=pr_name, xlab="p", ylab="Density", ylim=c(0, max(dbeta(p_seq, a_post, b_post))*1.1)) lines(p_seq, dbeta(p_seq, a_post, b_post), col="red", lwd=2) abline(v=p_obs, col="green", lwd=2, lty=3) # MLE abline(v=post_mean, col="red", lwd=1, lty=4) # Posterior mean legend("topright", c("Prior", "Posterior", "MLE", "Post. mean"), col=c("blue","red","green","red"), lwd=c(2,2,2,1), lty=c(2,1,3,4), cex=0.7) cat(sprintf("\n=== %s ===\n", pr_name)) cat(sprintf("Prior: Beta(%.0f, %.0f), E[p] = %.3f\n", a_prior, b_prior, a_prior/(a_prior+b_prior))) cat(sprintf("Posterior: Beta(%.0f, %.0f)\n", a_post, b_post)) cat(sprintf("Post mean: %.4f, MAP: %.4f, MLE: %.4f\n", post_mean, post_mode, p_obs)) cat(sprintf("95%% CI: [%.4f, %.4f]\n", ci_95[1], ci_95[2])) } ``` **Insight**: Dengan uninformative prior, posterior mean ≈ MLE. Dengan strong informative prior, posterior mean adalah pulled toward prior. Effect of prior diminishes as $n$ increases — **"data overwhelms prior" asymptotically**. ::: --- ## 7. Bayesian vs Frequentist: Comparison | Aspek | Frequentist | Bayesian | |---|---|---| | **$\theta$** | Fixed unknown constant | Random variable with distribution | | **Inference** | Long-run frequency properties | Posterior probability statements | | **Prior** | None | Required (explicit assumptions) | | **CI/CrI** | 95% of intervals contain $\theta$ | P(θ in interval \| data) = 95% | | **p-value** | P(data \| $H_0$) | Not used; compute P($H_0$ \| data) | | **Sample size** | Need large $n$ for asymptotics | Works for any $n$ with good prior | | **Computation** | Often analytical | Often requires MCMC | | **Strength** | Well-understood, widely used | Intuitive, handles complex models | **Practical recommendation**: Belajar kedua paradigma. Di most applied econometrics: - Default: frequentist (OLS, MLE, standard hypothesis tests) - Complex models, small samples, or when priors matter: consider Bayesian - ML regularization: Bayesian perspective illuminates what regularization is doing --- ## 8. R Code: Bayesian Analysis ```r # ============================================================ # MANUAL BAYESIAN: Normal-Normal Conjugate # ============================================================ # Prior: mu ~ N(mu0, tau0^2) # Likelihood: yi ~ N(mu, sigma^2) with sigma^2 known # Posterior: mu | y ~ N(mu_n, tau_n^2) bayes_normal <- function(data, mu0, tau0, sigma) { n <- length(data); y_bar <- mean(data) sigma2 <- sigma^2; tau02 <- tau0^2 # Posterior parameters tau_n2 <- 1 / (1/tau02 + n/sigma2) mu_n <- tau_n2 * (mu0/tau02 + n*y_bar/sigma2) return(list(mu_n=mu_n, tau_n=sqrt(tau_n2))) } set.seed(2024) true_mu <- 5; sigma <- 2; n <- 30 y <- rnorm(n, true_mu, sigma) # Different priors scenarios <- list( uninformative = c(mu0=0, tau0=100), weakly_inform = c(mu0=4, tau0=2), strong_inform = c(mu0=4, tau0=0.5) ) cat("=== NORMAL-NORMAL BAYESIAN ANALYSIS ===\n") cat(sprintf("True mu = %g, n = %d, y_bar = %.4f\n\n", true_mu, n, mean(y))) for(sc in names(scenarios)) { post <- bayes_normal(y, scenarios[[sc]]["mu0"], scenarios[[sc]]["tau0"], sigma) ci_95 <- post$mu_n + c(-1,1) * 1.96 * post$tau_n cat(sprintf("--- %s ---\n", sc)) cat(sprintf("Prior: N(%.1f, %.1f^2)\n", scenarios[[sc]]["mu0"], scenarios[[sc]]["tau0"])) cat(sprintf("Post mean: %.4f, Post SD: %.4f\n", post$mu_n, post$tau_n)) cat(sprintf("95%% CrI: [%.4f, %.4f]\n\n", ci_95[1], ci_95[2])) } # ============================================================ # RIDGE AS BAYESIAN: Compare with regular OLS # ============================================================ set.seed(42) n <- 50; p <- 10 # Many predictors relative to n X <- matrix(rnorm(n*p), n, p) beta_true <- c(1, -0.5, 2, 0, 0, 0, 0, 0, 0, 0) # Sparse y <- X %*% beta_true + rnorm(n, sd=2) # OLS (might overfit with n=50, p=10) ols <- lm(y ~ X - 1) # No intercept for simplicity # Ridge regression (via glmnet) if(!require(glmnet)) install.packages("glmnet") library(glmnet) ridge <- glmnet(X, y, alpha=0, lambda=0.5) # alpha=0 for Ridge cat("=== OLS vs Ridge (Bayesian MAP with Normal Prior) ===\n") cat(sprintf("%-8s %-12s %-12s %-12s\n", "Param", "True", "OLS", "Ridge")) for(j in 1:p) { cat(sprintf("%-8d %-12.4f %-12.4f %-12.4f\n", j, beta_true[j], coef(ols)[j], as.numeric(coef(ridge)[-1])[j])) } # Prediction comparison y_pred_ols <- predict(ols, newx=X) y_pred_ridge <- X %*% as.numeric(coef(ridge)[-1]) cat(sprintf("\nOLS MSE (in-sample): %.4f\n", mean((y-y_pred_ols)^2))) cat(sprintf("Ridge MSE (in-sample): %.4f\n", mean((y-y_pred_ridge)^2))) # ============================================================ # SIMPLE MCMC: Metropolis-Hastings for Beta distribution # ============================================================ metropolis_hastings <- function(n_iter, start, log_posterior, sigma_prop=0.1) { samples <- numeric(n_iter) samples[1] <- start accept_count <- 0 for(i in 2:n_iter) { # Proposal prop <- rnorm(1, samples[i-1], sigma_prop) if(prop <= 0 || prop >= 1) { samples[i] <- samples[i-1] next } # Log acceptance ratio log_r <- log_posterior(prop) - log_posterior(samples[i-1]) # Accept/reject if(log(runif(1)) < log_r) { samples[i] <- prop accept_count <- accept_count + 1 } else { samples[i] <- samples[i-1] } } cat(sprintf("Acceptance rate: %.2f%%\n", 100*accept_count/n_iter)) return(samples) } # Target: Beta(13, 19) posterior from Beta-Binomial example log_post <- function(p) dbeta(p, 13, 19, log=TRUE) set.seed(42) n_iter <- 10000; burn_in <- 1000 mcmc_samples <- metropolis_hastings(n_iter, start=0.4, log_posterior=log_post) mcmc_clean <- mcmc_samples[(burn_in+1):n_iter] # Compare MCMC samples to exact distribution hist(mcmc_clean, probability=TRUE, breaks=50, main="MCMC vs Exact Beta(13,19) Posterior", xlab="p") curve(dbeta(x, 13, 19), add=TRUE, col="red", lwd=2) cat(sprintf("\nMCMC mean: %.4f (exact: %.4f)\n", mean(mcmc_clean), 13/(13+19))) cat(sprintf("MCMC 95%% CrI: [%.4f, %.4f]\n", quantile(mcmc_clean, 0.025), quantile(mcmc_clean, 0.975))) cat(sprintf("Exact 95%% CrI: [%.4f, %.4f]\n", qbeta(0.025, 13, 19), qbeta(0.975, 13, 19))) ``` --- ## 9. Koneksi ke ML dan Econometrics ::: {.callout-caution title="Connection: Bayesian dalam Modern ML dan Econometrics"} **Ridge = Bayesian + Normal prior**: $\hat{\beta}_{Ridge} = \arg\min ||y-X\beta||^2 + \lambda||\beta||^2$. This is MAP estimation under $\beta \sim N(0, \sigma^2/\lambda \cdot I)$. **LASSO = Bayesian + Laplace prior**: Prior $\pi(\beta_j) \propto e^{-\lambda|\beta_j|/\sigma}$ gives LASSO. Laplace prior encourages sparsity because it has heavier tails than normal at 0 and heavier tails overall. **Gaussian processes**: Nonparametric Bayesian regression. Prior over functions. Used in kriging (spatial interpolation) and machine learning. **Bayesian VAR**: Standard tool di macroeconometrics (Minnesota prior). Useful untuk forecasting dengan many variables, shrinks coefficients toward random walk. **Stan ecosystem**: `rstanarm` gives Bayesian versions of standard regression models with same syntax as `lm()`/`glm()`. `brms` extends this to complex mixed models. **MCMC diagnostics**: $\hat{R}$ (R-hat) statistic for chain convergence, effective sample size, trace plots — all standard practice when reporting Bayesian results. ::: --- ## Practice Problems ::: {.callout-warning title="Practice Problems" collapse="true"} **Problem 1**: Beta-Binomial sequential updating. Start dengan prior $\text{Beta}(2, 2)$. Observasi datang satu per satu: H, T, H, H, H, T, H (7 tosses, 5 heads). - Update posterior setelah setiap observasi - Plot prior dan semua intermediate posteriors dan final posterior - Apa posterior mean setelah semua 7 tosses? *Jawaban*: Final posterior = Beta(2+5, 2+2) = Beta(7, 4). Mean = 7/11 ≈ 0.636. **Problem 2**: Ridge as Bayesian MAP. Tunjukkan secara matematis bahwa Ridge regression estimate: $$\hat{\beta}_{Ridge} = (X^TX + \lambda I)^{-1}X^Ty$$ adalah MAP estimate jika $\beta \sim N(0, \sigma^2/\lambda \cdot I_p)$ dan $y|X,\beta \sim N(X\beta, \sigma^2 I_n)$. *Hint*: Maximize $\log\pi(\beta|y) = \log f(y|\beta) + \log\pi(\beta)$ dengan respect to $\beta$. **Problem 3**: Posterior predictive distribution. Di Normal-Normal model: $\mu | y \sim N(\mu_n, \tau_n^2)$ dan $y_{new}|\mu \sim N(\mu, \sigma^2)$. Derive **posterior predictive distribution** $p(y_{new}|y)$ (marginalizing over $\mu$). *Hint*: $y_{new}|y = y_{new}|\mu + \mu|y$. Sum of normals → normal. *Jawaban*: $y_{new}|y \sim N(\mu_n, \sigma^2 + \tau_n^2)$. Note: lebih variable dari likelihood karena tambah uncertainty dari posterior of $\mu$. **Problem 4**: Bayesian hypothesis testing. Contrast Bayesian approach dengan frequentist untuk testing $H_0: p = 0.5$ vs $H_1: p \neq 0.5$ untuk coin flip data (20 heads out of 30). - Frequentist: compute p-value - Bayesian: with $\text{Beta}(1,1)$ prior, compute $P(p < 0.5 | \text{data})$ - Compute **Bayes Factor** (ratio of marginal likelihoods under $H_0$ vs $H_1$) ```r # Frequentist binom.test(20, 30, p=0.5)$p.value # Two-sided p-value # Bayesian # Posterior: Beta(1+20, 1+10) = Beta(21, 11) # P(p < 0.5 | data) = CDF at 0.5 pbeta(0.5, 21, 11) # About 0.95 -- strong evidence p > 0.5 1 - pbeta(0.5, 21, 11) # P(p > 0.5 | data) ``` ::: --- ::: {.page-navigation} [← Asymptotic Theory](04-asymptotic-theory.qmd) | [↑ Module Overview](index.qmd) | [Next: Econometrics Math →](../06-econometrics-math/index.qmd) :::

Likelihood	Conjugate Prior	Posterior
Binomial(\(n\), \(p\))	Beta(\(\alpha\), \(\beta\))	Beta(\(\alpha+k\), \(\beta+n-k\))
Normal(\(\mu\), \(\sigma^2\) known)	Normal(\(\mu_0\), \(\tau^2\))	Normal(\(\mu_n\), \(\tau_n^2\))
Poisson(\(\lambda\))	Gamma(\(\alpha\), \(\beta\))	Gamma(\(\alpha+\sum y_i\), \(\beta+n\))
Exponential(\(\lambda\))	Gamma(\(\alpha\), \(\beta\))	Gamma(\(\alpha+n\), \(\beta+\sum y_i\))