MLE for Discrete Choice Models

Probit, Logit, Tobit — Math Selengkapnya

econometrics-math
intermediate
Derivasi matematis lengkap MLE untuk probit, logit, dan tobit: latent variable framework, log-likelihood, score, Hessian, marginal effects, dan Newton-Raphson.

1 Kenapa Ini Penting?

NoteWhy This Matters for Your Work

Probit, logit, tobit — semua model ini belajar dari data dengan MLE, bukan OLS. Knowing the math helps you:

  • Interpret marginal effects properly (bukan langsung baca koefisien)
  • Debug convergence issues saat optimization gagal
  • Memahami apa arti “standard error” di model nonlinear ini
  • Tahu kapan model identification bisa bermasalah

Kalau kamu pernah heran kenapa koefisien logit dan probit kelihatan beda tapi hasilnya mirip, atau kenapa margins harus dihitung secara terpisah — jawabannya ada di sini.


2 Masalah dengan OLS untuk Outcome Biner

Bayangkan kamu ingin memodelkan \(y_i \in \{0, 1\}\) (misalnya: melamar kerja atau tidak, default kredit atau tidak).

Linear Probability Model (LPM): \(P(y_i = 1 | x_i) = x_i^T\beta\)

Terlihat simpel, tapi ada dua masalah serius:

  1. Prediksi di luar \([0, 1]\): Tidak ada yang menjamin \(x_i^T\hat{\beta} \in [0,1]\). Probabilitas negatif tidak masuk akal.

  2. Heteroskedastisitas struktural: Karena \(y_i\) adalah Bernoulli, \[\text{Var}(y_i | x_i) = P(y_i=1|x_i)(1 - P(y_i=1|x_i)) = x_i^T\beta(1-x_i^T\beta)\] Variance bergantung pada \(x_i\) — bukan konstan. OLS masih konsisten tapi tidak efisien, dan standard errors naif salah.

Solusi: gunakan model nonlinear dengan fungsi link yang membatasi output ke \([0,1]\).


3 Latent Variable Framework

ImportantDefinisi: Latent Variable Model

Model latent variable mengasumsikan ada variabel tersembunyi \(y_i^*\) yang menentukan outcome yang kita observasi.

\[y_i^* = x_i^T\beta + \varepsilon_i \quad \text{(unobserved, continuous)}\]

\[y_i = \mathbf{1}[y_i^* > 0] = \begin{cases} 1 & \text{jika } y_i^* > 0 \\ 0 & \text{jika } y_i^* \leq 0 \end{cases}\]

  • \(x_i^T\beta\): index function (bagian deterministik)
  • \(\varepsilon_i\): error term dengan distribusi tertentu
  • Distribusi \(\varepsilon_i\) menentukan apakah kita pakai probit atau logit

Probabilitas observasi: \[P(y_i = 1 | x_i) = P(y_i^* > 0 | x_i) = P(\varepsilon_i > -x_i^T\beta) = 1 - F_\varepsilon(-x_i^T\beta)\]

Jika distribusi \(\varepsilon_i\) simetris (\(F_\varepsilon(-z) = 1 - F_\varepsilon(z)\)): \[P(y_i = 1 | x_i) = F_\varepsilon(x_i^T\beta)\]


4 Probit Model

4.1 Asumsi Distribusi

\[\varepsilon_i \sim N(0, 1) \quad \Rightarrow \quad F_\varepsilon = \Phi \text{ (CDF normal standar)}\]

Maka: \[P(y_i = 1 | x_i) = \Phi(x_i^T\beta)\]

Catatan: variance \(\varepsilon_i\) dinormalisasi ke 1 karena hanya \(x_i^T\beta/\sigma\) yang identifiable.

4.2 Log-Likelihood Probit

Untuk observasi \(i\): \[\ell_i(\beta) = y_i \log \Phi(x_i^T\beta) + (1 - y_i) \log(1 - \Phi(x_i^T\beta))\]

Total log-likelihood: \[\ell(\beta) = \sum_{i=1}^n \left[ y_i \log \Phi(x_i^T\beta) + (1 - y_i) \log \Phi(-x_i^T\beta) \right]\]

(menggunakan \(1 - \Phi(z) = \Phi(-z)\) untuk simplifikasi)

4.3 Score Function Probit

Turunan pertama (score): \[\frac{\partial \ell}{\partial \beta} = \sum_{i=1}^n \left[ \frac{y_i \phi(x_i^T\beta)}{\Phi(x_i^T\beta)} - \frac{(1-y_i) \phi(x_i^T\beta)}{1 - \Phi(x_i^T\beta)} \right] x_i\]

di mana \(\phi = \Phi'\) adalah PDF normal standar.

Definisikan Mills ratio: \(\lambda(t) = \phi(t)/\Phi(t)\) dan \(\lambda(-t) = \phi(t)/(1-\Phi(t))\).

Score menjadi: \(\frac{\partial \ell}{\partial \beta} = \sum_i \left[ y_i \lambda(x_i^T\beta) - (1-y_i)\lambda(-x_i^T\beta) \right] x_i\)

Tidak ada closed-form solution → perlu numerical optimization (Newton-Raphson atau BFGS).

4.4 Marginal Effects Probit

Koefisien \(\beta\) bukan marginal effect langsung! Yang kita inginkan:

\[\frac{\partial P(y=1|x)}{\partial x_j} = \phi(x^T\beta) \cdot \beta_j\]

Ini bergantung pada \(x\) — berbeda untuk setiap individu.

Average Partial Effect (APE): \[\text{APE}_j = \frac{1}{n}\sum_{i=1}^n \phi(x_i^T\hat{\beta}) \cdot \hat{\beta}_j\]

Marginal Effect at the Mean (MEM): \[\text{MEM}_j = \phi(\bar{x}^T\hat{\beta}) \cdot \hat{\beta}_j\]

APE lebih disukai dalam praktik.


5 Logit Model

5.1 Asumsi Distribusi

\[\varepsilon_i \sim \text{Logistic}(0, 1) \quad \Rightarrow \quad F_\varepsilon = \Lambda(z) = \frac{e^z}{1 + e^z}\]

\[P(y_i = 1 | x_i) = \Lambda(x_i^T\beta) = \frac{\exp(x_i^T\beta)}{1 + \exp(x_i^T\beta)}\]

5.2 Log-Likelihood Logit

\[\ell(\beta) = \sum_{i=1}^n \left[ y_i x_i^T\beta - \log(1 + e^{x_i^T\beta}) \right]\]

Ini lebih numerically stable karena tidak ada ratio \(\Lambda/(\Lambda)\) yang bisa mendekati 0/0.

5.3 Score dan Hessian Logit

Score (elegant!): \[\frac{\partial \ell}{\partial \beta} = \sum_{i=1}^n \left[ y_i - \Lambda(x_i^T\beta) \right] x_i = \sum_{i=1}^n (y_i - \hat{p}_i) x_i\]

Perhatikan: score = sum of residuals times \(x_i\). Intuisinya sangat jelas.

Hessian: \[\frac{\partial^2 \ell}{\partial \beta \partial \beta^T} = -\sum_{i=1}^n \Lambda(x_i^T\beta)(1 - \Lambda(x_i^T\beta)) x_i x_i^T = -X^T W X\]

di mana \(W = \text{diag}(\hat{p}_i(1-\hat{p}_i))\).

Hessian adalah negative definite (karena \(\hat{p}_i(1-\hat{p}_i) > 0\)) → log-likelihood globally concave → unique global maximum.

5.4 Odds Ratio Interpretation

Logit model punya interpretasi yang sangat natural:

\[\log\frac{P(y=1)}{P(y=0)} = \log\frac{\Lambda(x^T\beta)}{1-\Lambda(x^T\beta)} = x^T\beta\]

Jadi \(\exp(\beta_j)\) adalah odds ratio — multiplicative effect pada odds ketika \(x_j\) naik 1 unit.


6 Newton-Raphson dan IRLS

ImportantDefinisi: Newton-Raphson Algorithm

Iterasi untuk mencari akar dari score equation:

\[\beta^{(t+1)} = \beta^{(t)} - \left[H(\beta^{(t)})\right]^{-1} s(\beta^{(t)})\]

di mana \(s\) = score dan \(H\) = Hessian.

Untuk logit, ini setara dengan Iteratively Reweighted Least Squares (IRLS):

\[\beta^{(t+1)} = (X^T W^{(t)} X)^{-1} X^T W^{(t)} z^{(t)}\]

di mana \(z^{(t)} = X\beta^{(t)} + (W^{(t)})^{-1}(y - \hat{p}^{(t)})\) adalah “working response” dan \(W^{(t)}\) adalah diagonal weight matrix.


7 Standard Errors MLE

Karena MLE konsisten dan asymptotically normal:

\[\sqrt{n}(\hat{\beta}_{MLE} - \beta_0) \xrightarrow{d} N(0, I(\beta_0)^{-1})\]

Observed Fisher Information: \[\hat{I}(\hat{\beta}) = -\frac{\partial^2 \ell}{\partial \beta \partial \beta^T}\bigg|_{\hat{\beta}} = X^T \hat{W} X\]

Variance estimator: \[\widehat{\text{Var}}(\hat{\beta}_{MLE}) = \left[-\frac{\partial^2 \ell}{\partial \beta \partial \beta^T}\bigg|_{\hat{\beta}}\right]^{-1} = (X^T \hat{W} X)^{-1}\]

Standard errors = akar diagonal dari matrix ini.


8 Tobit Model

8.1 Setup: Censored Dependent Variable

Tobit model digunakan ketika \(y_i\) censored dari bawah (biasanya di 0):

\[y_i^* = x_i^T\beta + \varepsilon_i, \quad \varepsilon_i \sim N(0, \sigma^2)\] \[y_i = \max(0, y_i^*) = \begin{cases} y_i^* & \text{jika } y_i^* > 0 \\ 0 & \text{jika } y_i^* \leq 0 \end{cases}\]

Contoh: pengeluaran obat-obatan (banyak yang nol), jam kerja, investasi.

8.2 Log-Likelihood Tobit

Likelihood adalah mixture: bagian diskrit untuk \(y_i = 0\) dan bagian kontinu untuk \(y_i > 0\).

\[\ell(\beta, \sigma) = \sum_{i: y_i = 0} \log\left[1 - \Phi\left(\frac{x_i^T\beta}{\sigma}\right)\right] + \sum_{i: y_i > 0} \log\left[\frac{1}{\sigma}\phi\left(\frac{y_i - x_i^T\beta}{\sigma}\right)\right]\]

  • Bagian pertama: probabilitas \(y_i^* \leq 0\)
  • Bagian kedua: PDF normal untuk observasi yang tidak tercensored

8.3 Marginal Effects Tobit

Efek pada \(E[y_i|x_i]\) (unconditional mean of observed \(y\)):

\[\frac{\partial E[y_i|x_i]}{\partial x_j} = \Phi\left(\frac{x_i^T\beta}{\sigma}\right) \beta_j\]

Efek pada \(E[y_i^*|x_i]\) (latent variable): \[\frac{\partial E[y_i^*|x_i]}{\partial x_j} = \beta_j\]

Jadi koefisien langsung mengukur efek pada latent variable, bukan pada observed outcome.


9 Worked Example: Probit vs Logit

library(tidyverse)
library(margins)

# Generate synthetic data: binary outcome (loan default)
set.seed(42)
n <- 500
x1 <- rnorm(n, mean = 50, sd = 10)   # income
x2 <- rnorm(n, mean = 3, sd = 1)      # debt ratio

# True latent index
latent <- -2 + 0.03 * x1 - 0.5 * x2
prob_true <- pnorm(latent)
y <- rbinom(n, 1, prob_true)

df <- data.frame(y = y, income = x1, debt = x2)

# Fit Probit and Logit
probit_model <- glm(y ~ income + debt, data = df,
                    family = binomial(link = "probit"))
logit_model  <- glm(y ~ income + debt, data = df,
                    family = binomial(link = "logit"))

# Compare coefficients
# Note: logit coefficients ≈ probit coefficients * 1.6 (pi/sqrt(3))
cat("Probit coefficients:\n")
print(coef(probit_model))

cat("\nLogit coefficients:\n")
print(coef(logit_model))

cat("\nRatio logit/probit (should be ~1.6-1.8):\n")
print(coef(logit_model) / coef(probit_model))

# Average Partial Effects (APE) via margins package
ape_probit <- margins(probit_model)
ape_logit  <- margins(logit_model)

cat("\nAPE Probit:\n")
summary(ape_probit)

cat("\nAPE Logit:\n")
summary(ape_logit)

# Manual APE computation for probit
beta_hat <- coef(probit_model)
X <- model.matrix(probit_model)
phi_vals <- dnorm(X %*% beta_hat)  # phi evaluated at fitted index

# APE for income
ape_income_manual <- mean(phi_vals * beta_hat["income"])
cat(sprintf("\nManual APE for income (probit): %.4f\n", ape_income_manual))

# Log-likelihood comparison
cat("\nLog-likelihood Probit:", logLik(probit_model))
cat("\nLog-likelihood Logit: ", logLik(logit_model))

# Predicted probabilities comparison
df$pred_probit <- predict(probit_model, type = "response")
df$pred_logit  <- predict(logit_model, type = "response")

cor(df$pred_probit, df$pred_logit)  # Should be very high (~0.99+)

Key Takeaway: Probit dan logit menghasilkan fitted probabilities yang hampir identik. Perbedaan koefisien sekitar faktor \(\pi/\sqrt{3} \approx 1.814\) karena perbedaan variance distribusi (normal standar variance = 1, logistic variance = \(\pi^2/3\)).


10 Perbandingan Probit vs Logit

Aspek Probit Logit
Error distribution \(N(0,1)\) Logistic\((0,1)\)
Link function \(\Phi^{-1}\) \(\log[p/(1-p)]\)
Interpretasi Latent normal Log-odds
Numerik Sedikit lebih lambat Lebih cepat
Hessian Tidak globally concave Globally concave
Marginal effect \(\phi(x^T\beta)\beta_j\) \(\Lambda(1-\Lambda)\beta_j\)
Koefisien relatif Baseline ~1.6-1.8× lebih besar

CautionConnection: Extensions dan Generalisasi

Model ini punya banyak generalisasi penting:

  • Ordered Probit/Logit: untuk outcome ordinal (\(y \in \{1, 2, 3, 4, 5\}\)) — multiple threshold parameters
  • Multinomial Logit: untuk multiple categories — ratio of any two odds tidak bergantung pada alternatif lain (IIA assumption)
  • Nested Logit: relaxes IIA
  • Heckman Selection Model: extends Tobit idea — dua persamaan terpisah untuk selection dan outcome
  • Bivariate Probit: dua binary outcomes dengan correlated errors

Semuanya berbagi struktur yang sama: latent variable + MLE.


11 Practice Problems

Problem 1: Derive marginal effect formula for probit.

Derivasi: \(\frac{\partial P(y=1|x)}{\partial x_j} = \frac{\partial \Phi(x^T\beta)}{\partial x_j} = \phi(x^T\beta) \cdot \beta_j\)

Ini menggunakan chain rule: \(\frac{d}{dz}\Phi(z) = \phi(z)\) dan \(\frac{\partial (x^T\beta)}{\partial x_j} = \beta_j\).

Problem 2: Show logit log-likelihood is globally concave.

Hessian: \(H = -\sum_i \Lambda_i(1-\Lambda_i) x_ix_i^T\)

Untuk vektor \(v\) sembarang: \[v^T H v = -\sum_i \underbrace{\Lambda_i(1-\Lambda_i)}_{>0} \underbrace{(x_i^Tv)^2}_{\geq 0} \leq 0\]

Jadi \(H\) adalah negative semi-definite. Negative definite jika \(X\) full column rank.

Problem 3: Interpret log-odds ratio.

Model: \(\log\frac{p}{1-p} = \beta_0 + \beta_1 x_1\)

Jika \(x_1\) naik 1 unit: \(\log\text{odds}\) naik \(\beta_1\), artinya odds berubah dengan faktor \(e^{\beta_1}\).

Contoh: \(\beta_1 = 0.5\), maka \(e^{0.5} \approx 1.65\) — odds meningkat 65%.

Problem 4: Untuk Tobit model, tunjukkan bahwa OLS (hanya pada \(y_i > 0\)) memberikan estimasi yang biased.

Hint: \(E[y_i | y_i > 0, x_i] = x_i^T\beta + \sigma \frac{\phi(x_i^T\beta/\sigma)}{\Phi(x_i^T\beta/\sigma)}\) — ada selection bias term (Mills ratio).


Navigasi: ← IV dan GMM | Panel Data Math →