Hypothesis Testing
p-value, Power, dan Seni Tidak Salah
1 Kenapa Ini Penting?
Hypothesis testing adalah everywhere dalam empirical work. Setiap kali kamu melihat p-value di output regresi, kamu sedang melakukan hypothesis test. Setiap kali kamu melihat bintang (\(*\), \(**\), \(***\)) di tabel regresi, itu adalah hasil hypothesis test.
Tapi sebagian besar orang menggunakannya tanpa memahami: - Apa sebenarnya yang diukur oleh p-value? (Dan apa yang TIDAK diukurnya) - Kenapa peneliti dengan sample lebih besar hampir selalu menemukan “significant” results? - Apa bedanya “statistically significant” dengan “practically important”? - Kenapa power analysis penting sebelum mengumpulkan data?
Hypothesis testing yang dipahami dengan baik adalah senjata analitik yang powerful. Yang disalahpahami adalah sumber dari banyak “research findings” yang tidak bisa direplikasi.
2 1. Framework Dasar
Null hypothesis \(H_0\): pernyataan yang di-test, biasanya “tidak ada efek” atau “parameter = nilai tertentu”.
Alternative hypothesis \(H_1\) (atau \(H_a\)): kontradiksi dari \(H_0\).
Test statistic \(T = T(X_1,\ldots,X_n)\): fungsi dari data yang mengukur evidence against \(H_0\).
Rejection region \(\mathcal{R}\): nilai-nilai \(T\) yang membuat kita reject \(H_0\).
Decision rule: Reject \(H_0\) jika \(T \in \mathcal{R}\).
Jenis-jenis test: - One-sided: \(H_1: \theta > \theta_0\) (upper-tail) atau \(\theta < \theta_0\) (lower-tail) - Two-sided: \(H_1: \theta \neq \theta_0\)
3 2. Dua Jenis Error
| \(H_0\) Benar | \(H_0\) Salah | |
|---|---|---|
| Reject \(H_0\) | Type I Error (False Positive) | Correct (True Positive) |
| Fail to Reject \(H_0\) | Correct (True Negative) | Type II Error (False Negative) |
Type I error rate (significance level): \[\alpha = P(\text{reject } H_0 | H_0 \text{ benar})\]
Type II error rate: \[\beta = P(\text{fail to reject } H_0 | H_0 \text{ salah})\]
Power: \[\text{Power} = 1 - \beta = P(\text{reject } H_0 | H_0 \text{ salah})\]
Tradeoff: Menurunkan \(\alpha\) (lebih konservatif) akan meningkatkan \(\beta\) (menurunkan power), dan sebaliknya, untuk fixed \(n\).
Satu-satunya cara meningkatkan power tanpa meningkatkan \(\alpha\) adalah meningkatkan \(n\) atau menggunakan more powerful test.
4 3. p-value — Definisi Tepat
p-value adalah probabilitas, di bawah \(H_0\), mendapatkan test statistic setidaknya sekstrem nilai yang diobservasi:
\[p = P(|T| \geq |t_{obs}| \mid H_0) \quad \text{(two-sided)}\] \[p = P(T \geq t_{obs} \mid H_0) \quad \text{(one-sided upper)}\]
Decision rule: Reject \(H_0\) jika \(p < \alpha\).
4.1 Apa p-value BUKAN:
- Bukan probabilitas bahwa \(H_0\) benar: \(p \neq P(H_0 \text{ benar} | \text{data})\)
- Bukan probability of Type I error (itu \(\alpha\), bukan \(p\))
- Bukan ukuran practical importance atau effect size
- Bukan probability of replication: \(p < 0.05\) tidak berarti 95% chance replikasi akan significant
Interpretasi yang benar: “Jika \(H_0\) benar, probability mendapatkan data seperti yang kita observasi (atau lebih ekstrem) adalah \(p\).”
Masalah dengan p-values: - Tergantung \(n\): Dengan \(n\) sangat besar, even trivially small effects akan significant - Tidak ukur effect size: \(p=0.001\) bisa berarti efek kecil tapi precisely estimated - Multiple testing: Jika test 20 hipotesis dengan \(\alpha=0.05\), expected 1 false positive by chance
5 4. Neyman-Pearson Lemma
Untuk testing simple vs simple (\(H_0: \theta = \theta_0\) vs \(H_1: \theta = \theta_1\)), most powerful test dengan size \(\alpha\) adalah likelihood ratio test dengan rejection region:
\[\mathcal{R} = \left\{x : \frac{L(\theta_1; x)}{L(\theta_0; x)} > k\right\}\]
dimana \(k\) dipilih sehingga \(P(\mathcal{R}|H_0) = \alpha\).
Artinya: tidak ada test lain dengan same size \(\alpha\) yang punya power lebih tinggi untuk simple vs simple testing.
Implikasi: NP lemma justifies likelihood ratio as the fundamental approach. Untuk composite hypotheses, kita extend ke generalized LR test.
6 5. t-test
6.1 One-Sample t-test
Test \(H_0: \mu = \mu_0\) dengan \(\sigma^2\) unknown.
Test statistic: \[T = \frac{\bar{X} - \mu_0}{S/\sqrt{n}} \sim t(n-1) \text{ under } H_0 \text{ (if normal)}\]
Why t, not z?: Karena \(S\) (sample SD) adalah random, bukan known \(\sigma\). Dividing by \(S/\sqrt{n}\) instead of \(\sigma/\sqrt{n}\) introduces extra uncertainty, captured by heavier tails of t-distribution.
6.2 Two-Sample t-test
Test \(H_0: \mu_1 = \mu_2\) (equality of means dari dua populasi):
Equal variances (pooled): \[T = \frac{\bar{X}_1 - \bar{X}_2}{S_p\sqrt{1/n_1 + 1/n_2}} \sim t(n_1+n_2-2)\]
dimana \(S_p^2 = \frac{(n_1-1)S_1^2 + (n_2-1)S_2^2}{n_1+n_2-2}\) (pooled variance).
Welch’s t-test (unequal variances, preferred in practice): \[T = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{S_1^2/n_1 + S_2^2/n_2}}\]
dengan degrees of freedom oleh Welch-Satterthwaite approximation.
7 6. F-test untuk Joint Hypotheses
Untuk testing multiple restrictions simultaneously:
Model: \(y = X\beta + \varepsilon\), \(n\) obs, \(k\) parameters.
Test \(H_0: R\beta = r\) dengan \(m\) linear restrictions.
F-statistic: \[F = \frac{(R\hat{\beta}-r)^T\left[R(X^TX)^{-1}R^T\right]^{-1}(R\hat{\beta}-r)/m}{\hat{\sigma}^2} \sim F(m, n-k) \text{ under } H_0\]
Alternative formula (Restricted vs Unrestricted): \[F = \frac{(SSE_R - SSE_U)/m}{SSE_U/(n-k)}\]
Contoh: - Test \(\beta_2 = \beta_3 = 0\): \(R = \begin{pmatrix}0&1&0\\0&0&1\end{pmatrix}\), \(r = \begin{pmatrix}0\\0\end{pmatrix}\), \(m=2\) - Test overall significance: \(H_0: \beta_2 = \cdots = \beta_k = 0\) (all slopes = 0) - Test \(\beta_2 = \beta_3\): \(R = (0, 1, -1)\), \(r = 0\), \(m=1\)
8 7. LR, Wald, dan Score Tests — The Trinity
Untuk testing \(H_0: \theta = \theta_0\) (atau more generally, \(m\) restrictions), ada tiga equivalent asymptotic tests:
Wald Test: \[W = (R\hat{\theta} - r)^T \left[R\hat{V}R^T\right]^{-1}(R\hat{\theta} - r) \xrightarrow{d} \chi^2(m)\]
dimana \(\hat{V}\) adalah estimated asymptotic variance of \(\hat{\theta}\).
Likelihood Ratio (LR) Test: \[LR = -2[\ell(\hat{\theta}_0) - \ell(\hat{\theta})] \xrightarrow{d} \chi^2(m)\]
dimana \(\hat{\theta}_0\) adalah MLE under \(H_0\) (restricted) dan \(\hat{\theta}\) adalah unrestricted MLE.
Score (Lagrange Multiplier, LM) Test: \[LM = s(\hat{\theta}_0)^T I(\hat{\theta}_0)^{-1} s(\hat{\theta}_0) \xrightarrow{d} \chi^2(m)\]
dimana \(s(\hat{\theta}_0)\) adalah score function evaluated at restricted estimator.
Ketiganya asymptotically equivalent under \(H_0\) dan local alternatives. Tapi:
| Test | Butuh | Keunggulan |
|---|---|---|
| Wald | Unrestricted estimation only | Paling simple |
| LR | Restricted + Unrestricted estimation | Paling common, model comparison |
| Score/LM | Restricted estimation only | Efficient ketika restricted model mudah |
Implikasi praktis: - Wald: t-test, F-test dalam OLS adalah Wald tests - LR: AIC = \(-2\ell + 2k\) based on LR; common untuk model comparison - LM: Breusch-Pagan test untuk heteroskedasticity, LM test untuk autocorrelation — ini semua LM (Score) tests
9 8. Power Analysis dan Sample Size
Power function: \(\pi(\theta) = P(\text{reject } H_0 | \theta)\)
Untuk fixed \(\alpha\): - \(\pi(\theta_0) = \alpha\) (size of test, by definition) - \(\pi(\theta)\) untuk \(\theta \neq \theta_0\) adalah power — idealnya mendekati 1
Factors affecting power: 1. Effect size (jarak antara \(\theta_0\) dan true \(\theta\)) 2. Sample size \(n\) (power meningkat dengan \(n\)) 3. Variance \(\sigma^2\) (power turun dengan variance lebih tinggi) 4. Significance level \(\alpha\) (power naik dengan \(\alpha\) lebih besar)
Formula power untuk one-sample z-test (\(H_0: \mu = \mu_0\) vs \(H_1: \mu = \mu_1 > \mu_0\)): \[\text{Power} = P\left(Z > z_\alpha - \frac{(\mu_1-\mu_0)\sqrt{n}}{\sigma}\right)\]
Sample size formula untuk desired power \(1-\beta\), two-sided test: \[n \geq \left(\frac{(z_{\alpha/2} + z_\beta)\sigma}{\delta}\right)^2\]
dimana \(\delta = |\mu_1 - \mu_0|\) adalah minimum detectable effect.
10 9. Multiple Testing
Jika kamu melakukan \(m\) independent tests, masing-masing dengan level \(\alpha\):
Family-wise error rate (FWER) = probability of at least one false positive: \[P(\text{at least one false positive}) = 1 - (1-\alpha)^m\]
Untuk \(m=20\), \(\alpha=0.05\): FWER = \(1-(0.95)^{20} \approx 0.64\)!
Bonferroni correction: Test each hypothesis at level \(\alpha/m\). Controls FWER at \(\alpha\).
Benjamini-Hochberg (BH): Controls False Discovery Rate (FDR) — expected fraction of false positives among rejections. Less conservative than Bonferroni.
Praktis: Ketika kamu test banyak coefficients dalam regresi, atau ketika kamu “search” untuk significant variables, multiple testing adalah concern serius.
11 10. Worked Example: Full t-test dan F-test
Scenario: Dataset tentang pengaruh education (\(edu\)) dan experience (\(exp\)) terhadap log-wage. \(n = 526\).
Two-sample t-test: Apakah rata-rata wage perempuan berbeda dari laki-laki?
# Data setup (using wage1 from wooldridge package)
if(!require(wooldridge)) install.packages("wooldridge")
library(wooldridge)
data("wage1")
# Two-sample t-test: wage by gender
t.test(wage ~ female, data=wage1, alternative="two.sided")Output:
t = 9.24, df = 517.95, p-value < 2.2e-16
95% CI: [1.69, 2.59]
mean in group 0 (male): 7.10, mean in group 1 (female): 4.95
Males earn significantly more (\(t=9.24\), \(p<0.001\)). Wage gap ≈ $2.15/hour.
Multiple regression + F-test:
# Full regression
model <- lm(lwage ~ educ + exper + expersq + female + married,
data=wage1)
summary(model)
# F-test: joint significance of education AND experience
library(car)
linearHypothesis(model, c("educ = 0", "exper = 0", "expersq = 0"))
# Manual F-test: restricted vs unrestricted
model_r <- lm(lwage ~ female + married, data=wage1) # Restricted
model_u <- model # Unrestricted
anova(model_r, model_u)
# F(3, 520) = ?, p = ?
# Power analysis: how many obs needed to detect educ effect?
library(pwr)
# Assume standardized effect size delta/sigma for education
# Using observed: beta_edu ≈ 0.09, sigma ≈ 0.5 (estimate)
delta <- 0.09 # effect size per year of education
sigma <- 0.45 # residual SD estimate
n_needed <- pwr.t.test(
d = delta/sigma, # standardized effect
sig.level = 0.05,
power = 0.80,
alternative = "two.sided"
)$n
cat("Sample needed for 80% power:", ceiling(n_needed), "\n")LR Test for nested models:
# Compare logistic regression models
data("mroz", package="wooldridge")
# Model 1: baseline
logit_r <- glm(inlf ~ age + kidslt6, data=mroz, family=binomial)
# Model 2: with education
logit_u <- glm(inlf ~ age + kidslt6 + educ + exper,
data=mroz, family=binomial)
# LR test
lr_stat <- -2 * (logLik(logit_r) - logLik(logit_u))
df_diff <- length(coef(logit_u)) - length(coef(logit_r))
p_lr <- pchisq(lr_stat, df=df_diff, lower.tail=FALSE)
cat(sprintf("LR stat = %.2f, df = %d, p = %.4f\n", lr_stat, df_diff, p_lr))
# Or use anova()
anova(logit_r, logit_u, test="LRT")12 11. Koneksi ke Econometrics
Hausman test (specification test): Test endogeneity / model misspecification. Compares two estimators under \(H_0\) (OLS consistent) vs \(H_1\) (IV consistent). Hausman statistic \(= (\hat{\beta}_{IV} - \hat{\beta}_{OLS})^T[\text{Var}(\hat{\beta}_{IV}) - \text{Var}(\hat{\beta}_{OLS})]^{-1}(\hat{\beta}_{IV} - \hat{\beta}_{OLS}) \sim \chi^2(k)\) — ini adalah Wald test.
Heteroskedasticity tests (Breusch-Pagan, White test): LM (score) tests. Run auxiliary regression of squared residuals pada original regressors, test \(nR^2 \sim \chi^2(k)\).
Autocorrelation tests (Breusch-Godfrey): Juga LM test. Test serial correlation dalam residuals.
Chow test untuk structural break: F-test membandingkan restricted model (common coefficients) vs unrestricted (separate coefficients per period).
Wald test dalam IV/GMM: Sargan-Hansen J-test untuk overidentifying restrictions in IV/GMM adalah Wald-type test.
13 12. R Code: Power dan Multiple Testing
library(pwr)
# ============================================================
# POWER ANALYSIS
# ============================================================
# One-sample t-test
# H0: mu = 0 vs H1: mu != 0
# Effect size d = mu/sigma (Cohen's d)
# Power as function of n (for d=0.5, alpha=0.05)
n_range <- seq(10, 200, by=5)
power_vals <- sapply(n_range, function(n) {
pwr.t.test(n=n, d=0.5, sig.level=0.05, alternative="two.sided")$power
})
plot(n_range, power_vals, type="l", lwd=2, col="steelblue",
xlab="Sample Size n", ylab="Power",
main="Power vs Sample Size (d=0.5, alpha=0.05)")
abline(h=0.8, col="red", lty=2) # 80% power convention
abline(h=0.9, col="orange", lty=2) # 90% power
legend("bottomright",
c("Power curve", "80% power", "90% power"),
col=c("steelblue", "red", "orange"),
lwd=c(2,1,1), lty=c(1,2,2))
# Sample size for 80% power
n_for_80 <- ceiling(pwr.t.test(d=0.5, sig.level=0.05,
power=0.80, alternative="two.sided")$n)
cat("n needed for 80% power:", n_for_80, "\n")
# ============================================================
# MULTIPLE TESTING
# ============================================================
set.seed(2024)
# Simulate 100 tests where H0 is TRUE for all
n_tests <- 100
p_values_null <- runif(n_tests) # Under H0, p-values are Uniform(0,1)
# Add 10 true alternatives
n_true <- 10
p_values_alt <- c(p_values_null[1:(n_tests-n_true)],
rbeta(n_true, 0.5, 10)) # Small p-values for true effects
alpha <- 0.05
# Unadjusted
rejected_unadj <- p_values_alt < alpha
# Bonferroni
rejected_bonf <- p_values_alt < alpha/n_tests
# BH procedure
p_sorted <- sort(p_values_alt)
bh_threshold <- (1:n_tests)/n_tests * alpha
bh_cutoff <- max(p_sorted[p_sorted <= bh_threshold], na.rm=TRUE)
rejected_bh <- p_values_alt <= bh_cutoff
cat("\n=== MULTIPLE TESTING COMPARISON ===\n")
cat(sprintf("Unadjusted: %d rejections\n", sum(rejected_unadj)))
cat(sprintf("Bonferroni: %d rejections\n", sum(rejected_bonf)))
cat(sprintf("BH (FDR): %d rejections\n", sum(rejected_bh)))
# Using p.adjust
p_bonf <- p.adjust(p_values_alt, method="bonferroni")
p_bh <- p.adjust(p_values_alt, method="BH")
cat(sprintf("\nBonferroni (p.adjust): %d rejections\n", sum(p_bonf < alpha)))
cat(sprintf("BH (p.adjust): %d rejections\n", sum(p_bh < alpha)))14 Practice Problems
Problem 1: p-value interpretation.
Sebuah study mengukur efek treatment pada outcome. \(p = 0.03\). Mana yang benar?
- Ada 3% probability bahwa \(H_0\) benar
- Ada 3% probability mendapat data seextrem ini jika \(H_0\) benar
- Treatment punya 97% chance efektif
- Jika study diulang, 97% akan significant
Jawaban: Hanya (b) yang benar. (a), (c), (d) semuanya salah interpretasi p-value.
Problem 2: Type I vs Type II.
Medical test untuk penyakit langka (\(P(\text{disease}) = 0.001\)). Test sensitivity = 99% (\(1-\beta = 0.99\)), specificity = 95% (\(1-\alpha = 0.95\), so \(\alpha = 0.05\)).
- Hitung \(P(\text{disease} | \text{positive test})\) menggunakan Bayes’ theorem
- Interpretasikan result
Jawaban: \(P(+ | D) = 0.99\), \(P(+ | \bar{D}) = 0.05\). \(P(D|+) = \frac{0.99 \times 0.001}{0.99 \times 0.001 + 0.05 \times 0.999} = \frac{0.00099}{0.00099+0.04995} = 0.0194\)
Hanya ~2% dari positive tests adalah true positives! Ini karena prevalence sangat rendah. High false positive rate dominates.
Problem 3: Wald, LR, Score equivalence.
Untuk large \(n\), ketiga tests seharusnya memberikan hasil yang sangat mirip. Verifikasi numerically:
set.seed(42)
n <- 1000
x <- rnorm(n)
y <- 1 + 0.3*x + rnorm(n) # True effect = 0.3
# Test H0: beta_x = 0
model <- lm(y ~ x)
b <- coef(model)[2]
se <- sqrt(vcov(model)[2,2])
n_params <- 2
# Wald test
wald <- (b/se)^2 # F-statistic = t^2
# LR test
model_r <- lm(y ~ 1)
lr <- -2*(logLik(model_r) - logLik(model))
# Score test (numerical)
# Not easily computed manually for linear regression
cat("Wald:", wald, "\n")
cat("LR:", lr, "\n")Problem 4: Power calculation.
Kamu ingin detect efek \(\delta = 0.2\) unit pada outcome dengan \(\sigma = 1.5\). Dengan \(\alpha = 0.05\) (two-sided), berapa \(n\) yang dibutuhkan untuk: (a) 50% power? (b) 80% power? (c) 95% power?
Interpretasikan hasilnya.
Jawaban: \(d = 0.2/1.5 = 0.133\). (a) \(n \approx 175\), (b) \(n \approx 442\), (c) \(n \approx 736\)
Untuk deteksi efek kecil, butuh sample yang sangat besar!