Hypothesis Testing

p-value, Power, dan Seni Tidak Salah

statistics
hypothesis-testing
inference
Framework hypothesis testing yang rigorous: Type I/II errors, power, p-value, Neyman-Pearson, t-test, F-test, dan the LR/Wald/Score trinity.

1 Kenapa Ini Penting?

NoteWhy This Matters for Your Work

Hypothesis testing adalah everywhere dalam empirical work. Setiap kali kamu melihat p-value di output regresi, kamu sedang melakukan hypothesis test. Setiap kali kamu melihat bintang (\(*\), \(**\), \(***\)) di tabel regresi, itu adalah hasil hypothesis test.

Tapi sebagian besar orang menggunakannya tanpa memahami: - Apa sebenarnya yang diukur oleh p-value? (Dan apa yang TIDAK diukurnya) - Kenapa peneliti dengan sample lebih besar hampir selalu menemukan “significant” results? - Apa bedanya “statistically significant” dengan “practically important”? - Kenapa power analysis penting sebelum mengumpulkan data?

Hypothesis testing yang dipahami dengan baik adalah senjata analitik yang powerful. Yang disalahpahami adalah sumber dari banyak “research findings” yang tidak bisa direplikasi.


2 1. Framework Dasar

ImportantDefinisi: Hypothesis Testing Framework

Null hypothesis \(H_0\): pernyataan yang di-test, biasanya “tidak ada efek” atau “parameter = nilai tertentu”.

Alternative hypothesis \(H_1\) (atau \(H_a\)): kontradiksi dari \(H_0\).

Test statistic \(T = T(X_1,\ldots,X_n)\): fungsi dari data yang mengukur evidence against \(H_0\).

Rejection region \(\mathcal{R}\): nilai-nilai \(T\) yang membuat kita reject \(H_0\).

Decision rule: Reject \(H_0\) jika \(T \in \mathcal{R}\).

Jenis-jenis test: - One-sided: \(H_1: \theta > \theta_0\) (upper-tail) atau \(\theta < \theta_0\) (lower-tail) - Two-sided: \(H_1: \theta \neq \theta_0\)


3 2. Dua Jenis Error

ImportantDefinisi: Type I dan Type II Errors
\(H_0\) Benar \(H_0\) Salah
Reject \(H_0\) Type I Error (False Positive) Correct (True Positive)
Fail to Reject \(H_0\) Correct (True Negative) Type II Error (False Negative)

Type I error rate (significance level): \[\alpha = P(\text{reject } H_0 | H_0 \text{ benar})\]

Type II error rate: \[\beta = P(\text{fail to reject } H_0 | H_0 \text{ salah})\]

Power: \[\text{Power} = 1 - \beta = P(\text{reject } H_0 | H_0 \text{ salah})\]

Tradeoff: Menurunkan \(\alpha\) (lebih konservatif) akan meningkatkan \(\beta\) (menurunkan power), dan sebaliknya, untuk fixed \(n\).

Satu-satunya cara meningkatkan power tanpa meningkatkan \(\alpha\) adalah meningkatkan \(n\) atau menggunakan more powerful test.


4 3. p-value — Definisi Tepat

ImportantDefinisi: p-value

p-value adalah probabilitas, di bawah \(H_0\), mendapatkan test statistic setidaknya sekstrem nilai yang diobservasi:

\[p = P(|T| \geq |t_{obs}| \mid H_0) \quad \text{(two-sided)}\] \[p = P(T \geq t_{obs} \mid H_0) \quad \text{(one-sided upper)}\]

Decision rule: Reject \(H_0\) jika \(p < \alpha\).

4.1 Apa p-value BUKAN:

  1. Bukan probabilitas bahwa \(H_0\) benar: \(p \neq P(H_0 \text{ benar} | \text{data})\)
  2. Bukan probability of Type I error (itu \(\alpha\), bukan \(p\))
  3. Bukan ukuran practical importance atau effect size
  4. Bukan probability of replication: \(p < 0.05\) tidak berarti 95% chance replikasi akan significant

Interpretasi yang benar: “Jika \(H_0\) benar, probability mendapatkan data seperti yang kita observasi (atau lebih ekstrem) adalah \(p\).”

Masalah dengan p-values: - Tergantung \(n\): Dengan \(n\) sangat besar, even trivially small effects akan significant - Tidak ukur effect size: \(p=0.001\) bisa berarti efek kecil tapi precisely estimated - Multiple testing: Jika test 20 hipotesis dengan \(\alpha=0.05\), expected 1 false positive by chance


5 4. Neyman-Pearson Lemma

ImportantDefinisi: Neyman-Pearson Lemma

Untuk testing simple vs simple (\(H_0: \theta = \theta_0\) vs \(H_1: \theta = \theta_1\)), most powerful test dengan size \(\alpha\) adalah likelihood ratio test dengan rejection region:

\[\mathcal{R} = \left\{x : \frac{L(\theta_1; x)}{L(\theta_0; x)} > k\right\}\]

dimana \(k\) dipilih sehingga \(P(\mathcal{R}|H_0) = \alpha\).

Artinya: tidak ada test lain dengan same size \(\alpha\) yang punya power lebih tinggi untuk simple vs simple testing.

Implikasi: NP lemma justifies likelihood ratio as the fundamental approach. Untuk composite hypotheses, kita extend ke generalized LR test.


6 5. t-test

6.1 One-Sample t-test

Test \(H_0: \mu = \mu_0\) dengan \(\sigma^2\) unknown.

Test statistic: \[T = \frac{\bar{X} - \mu_0}{S/\sqrt{n}} \sim t(n-1) \text{ under } H_0 \text{ (if normal)}\]

Why t, not z?: Karena \(S\) (sample SD) adalah random, bukan known \(\sigma\). Dividing by \(S/\sqrt{n}\) instead of \(\sigma/\sqrt{n}\) introduces extra uncertainty, captured by heavier tails of t-distribution.

6.2 Two-Sample t-test

Test \(H_0: \mu_1 = \mu_2\) (equality of means dari dua populasi):

Equal variances (pooled): \[T = \frac{\bar{X}_1 - \bar{X}_2}{S_p\sqrt{1/n_1 + 1/n_2}} \sim t(n_1+n_2-2)\]

dimana \(S_p^2 = \frac{(n_1-1)S_1^2 + (n_2-1)S_2^2}{n_1+n_2-2}\) (pooled variance).

Welch’s t-test (unequal variances, preferred in practice): \[T = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{S_1^2/n_1 + S_2^2/n_2}}\]

dengan degrees of freedom oleh Welch-Satterthwaite approximation.


7 6. F-test untuk Joint Hypotheses

Untuk testing multiple restrictions simultaneously:

ImportantDefinisi: F-test untuk Joint Hypothesis

Model: \(y = X\beta + \varepsilon\), \(n\) obs, \(k\) parameters.

Test \(H_0: R\beta = r\) dengan \(m\) linear restrictions.

F-statistic: \[F = \frac{(R\hat{\beta}-r)^T\left[R(X^TX)^{-1}R^T\right]^{-1}(R\hat{\beta}-r)/m}{\hat{\sigma}^2} \sim F(m, n-k) \text{ under } H_0\]

Alternative formula (Restricted vs Unrestricted): \[F = \frac{(SSE_R - SSE_U)/m}{SSE_U/(n-k)}\]

Contoh: - Test \(\beta_2 = \beta_3 = 0\): \(R = \begin{pmatrix}0&1&0\\0&0&1\end{pmatrix}\), \(r = \begin{pmatrix}0\\0\end{pmatrix}\), \(m=2\) - Test overall significance: \(H_0: \beta_2 = \cdots = \beta_k = 0\) (all slopes = 0) - Test \(\beta_2 = \beta_3\): \(R = (0, 1, -1)\), \(r = 0\), \(m=1\)


8 7. LR, Wald, dan Score Tests — The Trinity

ImportantDefinisi: Asymptotic Test Trinity

Untuk testing \(H_0: \theta = \theta_0\) (atau more generally, \(m\) restrictions), ada tiga equivalent asymptotic tests:

Wald Test: \[W = (R\hat{\theta} - r)^T \left[R\hat{V}R^T\right]^{-1}(R\hat{\theta} - r) \xrightarrow{d} \chi^2(m)\]

dimana \(\hat{V}\) adalah estimated asymptotic variance of \(\hat{\theta}\).


Likelihood Ratio (LR) Test: \[LR = -2[\ell(\hat{\theta}_0) - \ell(\hat{\theta})] \xrightarrow{d} \chi^2(m)\]

dimana \(\hat{\theta}_0\) adalah MLE under \(H_0\) (restricted) dan \(\hat{\theta}\) adalah unrestricted MLE.


Score (Lagrange Multiplier, LM) Test: \[LM = s(\hat{\theta}_0)^T I(\hat{\theta}_0)^{-1} s(\hat{\theta}_0) \xrightarrow{d} \chi^2(m)\]

dimana \(s(\hat{\theta}_0)\) adalah score function evaluated at restricted estimator.

Ketiganya asymptotically equivalent under \(H_0\) dan local alternatives. Tapi:

Test Butuh Keunggulan
Wald Unrestricted estimation only Paling simple
LR Restricted + Unrestricted estimation Paling common, model comparison
Score/LM Restricted estimation only Efficient ketika restricted model mudah

Implikasi praktis: - Wald: t-test, F-test dalam OLS adalah Wald tests - LR: AIC = \(-2\ell + 2k\) based on LR; common untuk model comparison - LM: Breusch-Pagan test untuk heteroskedasticity, LM test untuk autocorrelation — ini semua LM (Score) tests


9 8. Power Analysis dan Sample Size

ImportantDefinisi: Power Function

Power function: \(\pi(\theta) = P(\text{reject } H_0 | \theta)\)

Untuk fixed \(\alpha\): - \(\pi(\theta_0) = \alpha\) (size of test, by definition) - \(\pi(\theta)\) untuk \(\theta \neq \theta_0\) adalah power — idealnya mendekati 1

Factors affecting power: 1. Effect size (jarak antara \(\theta_0\) dan true \(\theta\)) 2. Sample size \(n\) (power meningkat dengan \(n\)) 3. Variance \(\sigma^2\) (power turun dengan variance lebih tinggi) 4. Significance level \(\alpha\) (power naik dengan \(\alpha\) lebih besar)

Formula power untuk one-sample z-test (\(H_0: \mu = \mu_0\) vs \(H_1: \mu = \mu_1 > \mu_0\)): \[\text{Power} = P\left(Z > z_\alpha - \frac{(\mu_1-\mu_0)\sqrt{n}}{\sigma}\right)\]

Sample size formula untuk desired power \(1-\beta\), two-sided test: \[n \geq \left(\frac{(z_{\alpha/2} + z_\beta)\sigma}{\delta}\right)^2\]

dimana \(\delta = |\mu_1 - \mu_0|\) adalah minimum detectable effect.


10 9. Multiple Testing

ImportantDefinisi: Multiple Testing Problem

Jika kamu melakukan \(m\) independent tests, masing-masing dengan level \(\alpha\):

Family-wise error rate (FWER) = probability of at least one false positive: \[P(\text{at least one false positive}) = 1 - (1-\alpha)^m\]

Untuk \(m=20\), \(\alpha=0.05\): FWER = \(1-(0.95)^{20} \approx 0.64\)!

Bonferroni correction: Test each hypothesis at level \(\alpha/m\). Controls FWER at \(\alpha\).

Benjamini-Hochberg (BH): Controls False Discovery Rate (FDR) — expected fraction of false positives among rejections. Less conservative than Bonferroni.

Praktis: Ketika kamu test banyak coefficients dalam regresi, atau ketika kamu “search” untuk significant variables, multiple testing adalah concern serius.


11 10. Worked Example: Full t-test dan F-test

Scenario: Dataset tentang pengaruh education (\(edu\)) dan experience (\(exp\)) terhadap log-wage. \(n = 526\).

Two-sample t-test: Apakah rata-rata wage perempuan berbeda dari laki-laki?

# Data setup (using wage1 from wooldridge package)
if(!require(wooldridge)) install.packages("wooldridge")
library(wooldridge)
data("wage1")

# Two-sample t-test: wage by gender
t.test(wage ~ female, data=wage1, alternative="two.sided")

Output:

t = 9.24, df = 517.95, p-value < 2.2e-16
95% CI: [1.69, 2.59]
mean in group 0 (male): 7.10, mean in group 1 (female): 4.95

Males earn significantly more (\(t=9.24\), \(p<0.001\)). Wage gap ≈ $2.15/hour.

Multiple regression + F-test:

# Full regression
model <- lm(lwage ~ educ + exper + expersq + female + married,
            data=wage1)
summary(model)

# F-test: joint significance of education AND experience
library(car)
linearHypothesis(model, c("educ = 0", "exper = 0", "expersq = 0"))

# Manual F-test: restricted vs unrestricted
model_r <- lm(lwage ~ female + married, data=wage1)  # Restricted
model_u <- model  # Unrestricted

anova(model_r, model_u)
# F(3, 520) = ?, p = ?

# Power analysis: how many obs needed to detect educ effect?
library(pwr)
# Assume standardized effect size delta/sigma for education
# Using observed: beta_edu ≈ 0.09, sigma ≈ 0.5 (estimate)
delta <- 0.09   # effect size per year of education
sigma <- 0.45   # residual SD estimate
n_needed <- pwr.t.test(
  d = delta/sigma,  # standardized effect
  sig.level = 0.05,
  power = 0.80,
  alternative = "two.sided"
)$n
cat("Sample needed for 80% power:", ceiling(n_needed), "\n")

LR Test for nested models:

# Compare logistic regression models
data("mroz", package="wooldridge")

# Model 1: baseline
logit_r <- glm(inlf ~ age + kidslt6, data=mroz, family=binomial)

# Model 2: with education
logit_u <- glm(inlf ~ age + kidslt6 + educ + exper,
               data=mroz, family=binomial)

# LR test
lr_stat <- -2 * (logLik(logit_r) - logLik(logit_u))
df_diff <- length(coef(logit_u)) - length(coef(logit_r))
p_lr <- pchisq(lr_stat, df=df_diff, lower.tail=FALSE)
cat(sprintf("LR stat = %.2f, df = %d, p = %.4f\n", lr_stat, df_diff, p_lr))

# Or use anova()
anova(logit_r, logit_u, test="LRT")

12 11. Koneksi ke Econometrics

CautionConnection: Hypothesis Testing dalam Econometrics

Hausman test (specification test): Test endogeneity / model misspecification. Compares two estimators under \(H_0\) (OLS consistent) vs \(H_1\) (IV consistent). Hausman statistic \(= (\hat{\beta}_{IV} - \hat{\beta}_{OLS})^T[\text{Var}(\hat{\beta}_{IV}) - \text{Var}(\hat{\beta}_{OLS})]^{-1}(\hat{\beta}_{IV} - \hat{\beta}_{OLS}) \sim \chi^2(k)\) — ini adalah Wald test.

Heteroskedasticity tests (Breusch-Pagan, White test): LM (score) tests. Run auxiliary regression of squared residuals pada original regressors, test \(nR^2 \sim \chi^2(k)\).

Autocorrelation tests (Breusch-Godfrey): Juga LM test. Test serial correlation dalam residuals.

Chow test untuk structural break: F-test membandingkan restricted model (common coefficients) vs unrestricted (separate coefficients per period).

Wald test dalam IV/GMM: Sargan-Hansen J-test untuk overidentifying restrictions in IV/GMM adalah Wald-type test.


13 12. R Code: Power dan Multiple Testing

library(pwr)

# ============================================================
# POWER ANALYSIS
# ============================================================
# One-sample t-test
# H0: mu = 0 vs H1: mu != 0
# Effect size d = mu/sigma (Cohen's d)

# Power as function of n (for d=0.5, alpha=0.05)
n_range <- seq(10, 200, by=5)
power_vals <- sapply(n_range, function(n) {
  pwr.t.test(n=n, d=0.5, sig.level=0.05, alternative="two.sided")$power
})

plot(n_range, power_vals, type="l", lwd=2, col="steelblue",
     xlab="Sample Size n", ylab="Power",
     main="Power vs Sample Size (d=0.5, alpha=0.05)")
abline(h=0.8, col="red", lty=2)  # 80% power convention
abline(h=0.9, col="orange", lty=2)  # 90% power
legend("bottomright",
       c("Power curve", "80% power", "90% power"),
       col=c("steelblue", "red", "orange"),
       lwd=c(2,1,1), lty=c(1,2,2))

# Sample size for 80% power
n_for_80 <- ceiling(pwr.t.test(d=0.5, sig.level=0.05,
                                 power=0.80, alternative="two.sided")$n)
cat("n needed for 80% power:", n_for_80, "\n")

# ============================================================
# MULTIPLE TESTING
# ============================================================
set.seed(2024)

# Simulate 100 tests where H0 is TRUE for all
n_tests <- 100
p_values_null <- runif(n_tests)  # Under H0, p-values are Uniform(0,1)

# Add 10 true alternatives
n_true <- 10
p_values_alt <- c(p_values_null[1:(n_tests-n_true)],
                   rbeta(n_true, 0.5, 10))  # Small p-values for true effects

alpha <- 0.05

# Unadjusted
rejected_unadj <- p_values_alt < alpha

# Bonferroni
rejected_bonf <- p_values_alt < alpha/n_tests

# BH procedure
p_sorted <- sort(p_values_alt)
bh_threshold <- (1:n_tests)/n_tests * alpha
bh_cutoff <- max(p_sorted[p_sorted <= bh_threshold], na.rm=TRUE)
rejected_bh <- p_values_alt <= bh_cutoff

cat("\n=== MULTIPLE TESTING COMPARISON ===\n")
cat(sprintf("Unadjusted: %d rejections\n", sum(rejected_unadj)))
cat(sprintf("Bonferroni: %d rejections\n", sum(rejected_bonf)))
cat(sprintf("BH (FDR):   %d rejections\n", sum(rejected_bh)))

# Using p.adjust
p_bonf <- p.adjust(p_values_alt, method="bonferroni")
p_bh   <- p.adjust(p_values_alt, method="BH")
cat(sprintf("\nBonferroni (p.adjust): %d rejections\n", sum(p_bonf < alpha)))
cat(sprintf("BH (p.adjust):         %d rejections\n", sum(p_bh < alpha)))

14 Practice Problems

Problem 1: p-value interpretation.

Sebuah study mengukur efek treatment pada outcome. \(p = 0.03\). Mana yang benar?

  1. Ada 3% probability bahwa \(H_0\) benar
  2. Ada 3% probability mendapat data seextrem ini jika \(H_0\) benar
  3. Treatment punya 97% chance efektif
  4. Jika study diulang, 97% akan significant

Jawaban: Hanya (b) yang benar. (a), (c), (d) semuanya salah interpretasi p-value.

Problem 2: Type I vs Type II.

Medical test untuk penyakit langka (\(P(\text{disease}) = 0.001\)). Test sensitivity = 99% (\(1-\beta = 0.99\)), specificity = 95% (\(1-\alpha = 0.95\), so \(\alpha = 0.05\)).

  • Hitung \(P(\text{disease} | \text{positive test})\) menggunakan Bayes’ theorem
  • Interpretasikan result

Jawaban: \(P(+ | D) = 0.99\), \(P(+ | \bar{D}) = 0.05\). \(P(D|+) = \frac{0.99 \times 0.001}{0.99 \times 0.001 + 0.05 \times 0.999} = \frac{0.00099}{0.00099+0.04995} = 0.0194\)

Hanya ~2% dari positive tests adalah true positives! Ini karena prevalence sangat rendah. High false positive rate dominates.

Problem 3: Wald, LR, Score equivalence.

Untuk large \(n\), ketiga tests seharusnya memberikan hasil yang sangat mirip. Verifikasi numerically:

set.seed(42)
n <- 1000
x <- rnorm(n)
y <- 1 + 0.3*x + rnorm(n)  # True effect = 0.3

# Test H0: beta_x = 0
model <- lm(y ~ x)
b <- coef(model)[2]
se <- sqrt(vcov(model)[2,2])
n_params <- 2

# Wald test
wald <- (b/se)^2  # F-statistic = t^2

# LR test
model_r <- lm(y ~ 1)
lr <- -2*(logLik(model_r) - logLik(model))

# Score test (numerical)
# Not easily computed manually for linear regression

cat("Wald:", wald, "\n")
cat("LR:", lr, "\n")

Problem 4: Power calculation.

Kamu ingin detect efek \(\delta = 0.2\) unit pada outcome dengan \(\sigma = 1.5\). Dengan \(\alpha = 0.05\) (two-sided), berapa \(n\) yang dibutuhkan untuk: (a) 50% power? (b) 80% power? (c) 95% power?

Interpretasikan hasilnya.

Jawaban: \(d = 0.2/1.5 = 0.133\). (a) \(n \approx 175\), (b) \(n \approx 442\), (c) \(n \approx 736\)

Untuk deteksi efek kecil, butuh sample yang sangat besar!