MLE for Discrete Choice Models
Probit, Logit, Tobit — Math Selengkapnya
1 Kenapa Ini Penting?
Probit, logit, tobit — semua model ini belajar dari data dengan MLE, bukan OLS. Knowing the math helps you:
- Interpret marginal effects properly (bukan langsung baca koefisien)
- Debug convergence issues saat optimization gagal
- Memahami apa arti “standard error” di model nonlinear ini
- Tahu kapan model identification bisa bermasalah
Kalau kamu pernah heran kenapa koefisien logit dan probit kelihatan beda tapi hasilnya mirip, atau kenapa margins harus dihitung secara terpisah — jawabannya ada di sini.
2 Masalah dengan OLS untuk Outcome Biner
Bayangkan kamu ingin memodelkan \(y_i \in \{0, 1\}\) (misalnya: melamar kerja atau tidak, default kredit atau tidak).
Linear Probability Model (LPM): \(P(y_i = 1 | x_i) = x_i^T\beta\)
Terlihat simpel, tapi ada dua masalah serius:
Prediksi di luar \([0, 1]\): Tidak ada yang menjamin \(x_i^T\hat{\beta} \in [0,1]\). Probabilitas negatif tidak masuk akal.
Heteroskedastisitas struktural: Karena \(y_i\) adalah Bernoulli, \[\text{Var}(y_i | x_i) = P(y_i=1|x_i)(1 - P(y_i=1|x_i)) = x_i^T\beta(1-x_i^T\beta)\] Variance bergantung pada \(x_i\) — bukan konstan. OLS masih konsisten tapi tidak efisien, dan standard errors naif salah.
Solusi: gunakan model nonlinear dengan fungsi link yang membatasi output ke \([0,1]\).
3 Latent Variable Framework
Model latent variable mengasumsikan ada variabel tersembunyi \(y_i^*\) yang menentukan outcome yang kita observasi.
\[y_i^* = x_i^T\beta + \varepsilon_i \quad \text{(unobserved, continuous)}\]
\[y_i = \mathbf{1}[y_i^* > 0] = \begin{cases} 1 & \text{jika } y_i^* > 0 \\ 0 & \text{jika } y_i^* \leq 0 \end{cases}\]
- \(x_i^T\beta\): index function (bagian deterministik)
- \(\varepsilon_i\): error term dengan distribusi tertentu
- Distribusi \(\varepsilon_i\) menentukan apakah kita pakai probit atau logit
Probabilitas observasi: \[P(y_i = 1 | x_i) = P(y_i^* > 0 | x_i) = P(\varepsilon_i > -x_i^T\beta) = 1 - F_\varepsilon(-x_i^T\beta)\]
Jika distribusi \(\varepsilon_i\) simetris (\(F_\varepsilon(-z) = 1 - F_\varepsilon(z)\)): \[P(y_i = 1 | x_i) = F_\varepsilon(x_i^T\beta)\]
4 Probit Model
4.1 Asumsi Distribusi
\[\varepsilon_i \sim N(0, 1) \quad \Rightarrow \quad F_\varepsilon = \Phi \text{ (CDF normal standar)}\]
Maka: \[P(y_i = 1 | x_i) = \Phi(x_i^T\beta)\]
Catatan: variance \(\varepsilon_i\) dinormalisasi ke 1 karena hanya \(x_i^T\beta/\sigma\) yang identifiable.
4.2 Log-Likelihood Probit
Untuk observasi \(i\): \[\ell_i(\beta) = y_i \log \Phi(x_i^T\beta) + (1 - y_i) \log(1 - \Phi(x_i^T\beta))\]
Total log-likelihood: \[\ell(\beta) = \sum_{i=1}^n \left[ y_i \log \Phi(x_i^T\beta) + (1 - y_i) \log \Phi(-x_i^T\beta) \right]\]
(menggunakan \(1 - \Phi(z) = \Phi(-z)\) untuk simplifikasi)
4.3 Score Function Probit
Turunan pertama (score): \[\frac{\partial \ell}{\partial \beta} = \sum_{i=1}^n \left[ \frac{y_i \phi(x_i^T\beta)}{\Phi(x_i^T\beta)} - \frac{(1-y_i) \phi(x_i^T\beta)}{1 - \Phi(x_i^T\beta)} \right] x_i\]
di mana \(\phi = \Phi'\) adalah PDF normal standar.
Definisikan Mills ratio: \(\lambda(t) = \phi(t)/\Phi(t)\) dan \(\lambda(-t) = \phi(t)/(1-\Phi(t))\).
Score menjadi: \(\frac{\partial \ell}{\partial \beta} = \sum_i \left[ y_i \lambda(x_i^T\beta) - (1-y_i)\lambda(-x_i^T\beta) \right] x_i\)
Tidak ada closed-form solution → perlu numerical optimization (Newton-Raphson atau BFGS).
4.4 Marginal Effects Probit
Koefisien \(\beta\) bukan marginal effect langsung! Yang kita inginkan:
\[\frac{\partial P(y=1|x)}{\partial x_j} = \phi(x^T\beta) \cdot \beta_j\]
Ini bergantung pada \(x\) — berbeda untuk setiap individu.
Average Partial Effect (APE): \[\text{APE}_j = \frac{1}{n}\sum_{i=1}^n \phi(x_i^T\hat{\beta}) \cdot \hat{\beta}_j\]
Marginal Effect at the Mean (MEM): \[\text{MEM}_j = \phi(\bar{x}^T\hat{\beta}) \cdot \hat{\beta}_j\]
APE lebih disukai dalam praktik.
5 Logit Model
5.1 Asumsi Distribusi
\[\varepsilon_i \sim \text{Logistic}(0, 1) \quad \Rightarrow \quad F_\varepsilon = \Lambda(z) = \frac{e^z}{1 + e^z}\]
\[P(y_i = 1 | x_i) = \Lambda(x_i^T\beta) = \frac{\exp(x_i^T\beta)}{1 + \exp(x_i^T\beta)}\]
5.2 Log-Likelihood Logit
\[\ell(\beta) = \sum_{i=1}^n \left[ y_i x_i^T\beta - \log(1 + e^{x_i^T\beta}) \right]\]
Ini lebih numerically stable karena tidak ada ratio \(\Lambda/(\Lambda)\) yang bisa mendekati 0/0.
5.3 Score dan Hessian Logit
Score (elegant!): \[\frac{\partial \ell}{\partial \beta} = \sum_{i=1}^n \left[ y_i - \Lambda(x_i^T\beta) \right] x_i = \sum_{i=1}^n (y_i - \hat{p}_i) x_i\]
Perhatikan: score = sum of residuals times \(x_i\). Intuisinya sangat jelas.
Hessian: \[\frac{\partial^2 \ell}{\partial \beta \partial \beta^T} = -\sum_{i=1}^n \Lambda(x_i^T\beta)(1 - \Lambda(x_i^T\beta)) x_i x_i^T = -X^T W X\]
di mana \(W = \text{diag}(\hat{p}_i(1-\hat{p}_i))\).
Hessian adalah negative definite (karena \(\hat{p}_i(1-\hat{p}_i) > 0\)) → log-likelihood globally concave → unique global maximum.
5.4 Odds Ratio Interpretation
Logit model punya interpretasi yang sangat natural:
\[\log\frac{P(y=1)}{P(y=0)} = \log\frac{\Lambda(x^T\beta)}{1-\Lambda(x^T\beta)} = x^T\beta\]
Jadi \(\exp(\beta_j)\) adalah odds ratio — multiplicative effect pada odds ketika \(x_j\) naik 1 unit.
6 Newton-Raphson dan IRLS
Iterasi untuk mencari akar dari score equation:
\[\beta^{(t+1)} = \beta^{(t)} - \left[H(\beta^{(t)})\right]^{-1} s(\beta^{(t)})\]
di mana \(s\) = score dan \(H\) = Hessian.
Untuk logit, ini setara dengan Iteratively Reweighted Least Squares (IRLS):
\[\beta^{(t+1)} = (X^T W^{(t)} X)^{-1} X^T W^{(t)} z^{(t)}\]
di mana \(z^{(t)} = X\beta^{(t)} + (W^{(t)})^{-1}(y - \hat{p}^{(t)})\) adalah “working response” dan \(W^{(t)}\) adalah diagonal weight matrix.
7 Standard Errors MLE
Karena MLE konsisten dan asymptotically normal:
\[\sqrt{n}(\hat{\beta}_{MLE} - \beta_0) \xrightarrow{d} N(0, I(\beta_0)^{-1})\]
Observed Fisher Information: \[\hat{I}(\hat{\beta}) = -\frac{\partial^2 \ell}{\partial \beta \partial \beta^T}\bigg|_{\hat{\beta}} = X^T \hat{W} X\]
Variance estimator: \[\widehat{\text{Var}}(\hat{\beta}_{MLE}) = \left[-\frac{\partial^2 \ell}{\partial \beta \partial \beta^T}\bigg|_{\hat{\beta}}\right]^{-1} = (X^T \hat{W} X)^{-1}\]
Standard errors = akar diagonal dari matrix ini.
8 Tobit Model
8.1 Setup: Censored Dependent Variable
Tobit model digunakan ketika \(y_i\) censored dari bawah (biasanya di 0):
\[y_i^* = x_i^T\beta + \varepsilon_i, \quad \varepsilon_i \sim N(0, \sigma^2)\] \[y_i = \max(0, y_i^*) = \begin{cases} y_i^* & \text{jika } y_i^* > 0 \\ 0 & \text{jika } y_i^* \leq 0 \end{cases}\]
Contoh: pengeluaran obat-obatan (banyak yang nol), jam kerja, investasi.
8.2 Log-Likelihood Tobit
Likelihood adalah mixture: bagian diskrit untuk \(y_i = 0\) dan bagian kontinu untuk \(y_i > 0\).
\[\ell(\beta, \sigma) = \sum_{i: y_i = 0} \log\left[1 - \Phi\left(\frac{x_i^T\beta}{\sigma}\right)\right] + \sum_{i: y_i > 0} \log\left[\frac{1}{\sigma}\phi\left(\frac{y_i - x_i^T\beta}{\sigma}\right)\right]\]
- Bagian pertama: probabilitas \(y_i^* \leq 0\)
- Bagian kedua: PDF normal untuk observasi yang tidak tercensored
8.3 Marginal Effects Tobit
Efek pada \(E[y_i|x_i]\) (unconditional mean of observed \(y\)):
\[\frac{\partial E[y_i|x_i]}{\partial x_j} = \Phi\left(\frac{x_i^T\beta}{\sigma}\right) \beta_j\]
Efek pada \(E[y_i^*|x_i]\) (latent variable): \[\frac{\partial E[y_i^*|x_i]}{\partial x_j} = \beta_j\]
Jadi koefisien langsung mengukur efek pada latent variable, bukan pada observed outcome.
9 Worked Example: Probit vs Logit
library(tidyverse)
library(margins)
# Generate synthetic data: binary outcome (loan default)
set.seed(42)
n <- 500
x1 <- rnorm(n, mean = 50, sd = 10) # income
x2 <- rnorm(n, mean = 3, sd = 1) # debt ratio
# True latent index
latent <- -2 + 0.03 * x1 - 0.5 * x2
prob_true <- pnorm(latent)
y <- rbinom(n, 1, prob_true)
df <- data.frame(y = y, income = x1, debt = x2)
# Fit Probit and Logit
probit_model <- glm(y ~ income + debt, data = df,
family = binomial(link = "probit"))
logit_model <- glm(y ~ income + debt, data = df,
family = binomial(link = "logit"))
# Compare coefficients
# Note: logit coefficients ≈ probit coefficients * 1.6 (pi/sqrt(3))
cat("Probit coefficients:\n")
print(coef(probit_model))
cat("\nLogit coefficients:\n")
print(coef(logit_model))
cat("\nRatio logit/probit (should be ~1.6-1.8):\n")
print(coef(logit_model) / coef(probit_model))
# Average Partial Effects (APE) via margins package
ape_probit <- margins(probit_model)
ape_logit <- margins(logit_model)
cat("\nAPE Probit:\n")
summary(ape_probit)
cat("\nAPE Logit:\n")
summary(ape_logit)
# Manual APE computation for probit
beta_hat <- coef(probit_model)
X <- model.matrix(probit_model)
phi_vals <- dnorm(X %*% beta_hat) # phi evaluated at fitted index
# APE for income
ape_income_manual <- mean(phi_vals * beta_hat["income"])
cat(sprintf("\nManual APE for income (probit): %.4f\n", ape_income_manual))
# Log-likelihood comparison
cat("\nLog-likelihood Probit:", logLik(probit_model))
cat("\nLog-likelihood Logit: ", logLik(logit_model))
# Predicted probabilities comparison
df$pred_probit <- predict(probit_model, type = "response")
df$pred_logit <- predict(logit_model, type = "response")
cor(df$pred_probit, df$pred_logit) # Should be very high (~0.99+)Key Takeaway: Probit dan logit menghasilkan fitted probabilities yang hampir identik. Perbedaan koefisien sekitar faktor \(\pi/\sqrt{3} \approx 1.814\) karena perbedaan variance distribusi (normal standar variance = 1, logistic variance = \(\pi^2/3\)).
10 Perbandingan Probit vs Logit
| Aspek | Probit | Logit |
|---|---|---|
| Error distribution | \(N(0,1)\) | Logistic\((0,1)\) |
| Link function | \(\Phi^{-1}\) | \(\log[p/(1-p)]\) |
| Interpretasi | Latent normal | Log-odds |
| Numerik | Sedikit lebih lambat | Lebih cepat |
| Hessian | Tidak globally concave | Globally concave |
| Marginal effect | \(\phi(x^T\beta)\beta_j\) | \(\Lambda(1-\Lambda)\beta_j\) |
| Koefisien relatif | Baseline | ~1.6-1.8× lebih besar |
Model ini punya banyak generalisasi penting:
- Ordered Probit/Logit: untuk outcome ordinal (\(y \in \{1, 2, 3, 4, 5\}\)) — multiple threshold parameters
- Multinomial Logit: untuk multiple categories — ratio of any two odds tidak bergantung pada alternatif lain (IIA assumption)
- Nested Logit: relaxes IIA
- Heckman Selection Model: extends Tobit idea — dua persamaan terpisah untuk selection dan outcome
- Bivariate Probit: dua binary outcomes dengan correlated errors
Semuanya berbagi struktur yang sama: latent variable + MLE.
11 Practice Problems
Problem 1: Derive marginal effect formula for probit.
Derivasi: \(\frac{\partial P(y=1|x)}{\partial x_j} = \frac{\partial \Phi(x^T\beta)}{\partial x_j} = \phi(x^T\beta) \cdot \beta_j\)
Ini menggunakan chain rule: \(\frac{d}{dz}\Phi(z) = \phi(z)\) dan \(\frac{\partial (x^T\beta)}{\partial x_j} = \beta_j\).
Problem 2: Show logit log-likelihood is globally concave.
Hessian: \(H = -\sum_i \Lambda_i(1-\Lambda_i) x_ix_i^T\)
Untuk vektor \(v\) sembarang: \[v^T H v = -\sum_i \underbrace{\Lambda_i(1-\Lambda_i)}_{>0} \underbrace{(x_i^Tv)^2}_{\geq 0} \leq 0\]
Jadi \(H\) adalah negative semi-definite. Negative definite jika \(X\) full column rank.
Problem 3: Interpret log-odds ratio.
Model: \(\log\frac{p}{1-p} = \beta_0 + \beta_1 x_1\)
Jika \(x_1\) naik 1 unit: \(\log\text{odds}\) naik \(\beta_1\), artinya odds berubah dengan faktor \(e^{\beta_1}\).
Contoh: \(\beta_1 = 0.5\), maka \(e^{0.5} \approx 1.65\) — odds meningkat 65%.
Problem 4: Untuk Tobit model, tunjukkan bahwa OLS (hanya pada \(y_i > 0\)) memberikan estimasi yang biased.
Hint: \(E[y_i | y_i > 0, x_i] = x_i^T\beta + \sigma \frac{\phi(x_i^T\beta/\sigma)}{\Phi(x_i^T\beta/\sigma)}\) — ada selection bias term (Mills ratio).
Navigasi: ← IV dan GMM | Panel Data Math →