Hypothesis Testing

p-value, Power, dan Seni Tidak Salah

statistics

hypothesis-testing

inference

Framework hypothesis testing yang rigorous: Type I/II errors, power, p-value, Neyman-Pearson, t-test, F-test, dan the LR/Wald/Score trinity.

1 Kenapa Ini Penting?

Why This Matters for Your Work

Hypothesis testing adalah everywhere dalam empirical work. Setiap kali kamu melihat p-value di output regresi, kamu sedang melakukan hypothesis test. Setiap kali kamu melihat bintang ($*$, $**$, $***$) di tabel regresi, itu adalah hasil hypothesis test.

Tapi sebagian besar orang menggunakannya tanpa memahami: - Apa sebenarnya yang diukur oleh p-value? (Dan apa yang TIDAK diukurnya) - Kenapa peneliti dengan sample lebih besar hampir selalu menemukan “significant” results? - Apa bedanya “statistically significant” dengan “practically important”? - Kenapa power analysis penting sebelum mengumpulkan data?

Hypothesis testing yang dipahami dengan baik adalah senjata analitik yang powerful. Yang disalahpahami adalah sumber dari banyak “research findings” yang tidak bisa direplikasi.

2 1. Framework Dasar

Definisi: Hypothesis Testing Framework

Null hypothesis $H_0$: pernyataan yang di-test, biasanya “tidak ada efek” atau “parameter = nilai tertentu”.

Alternative hypothesis $H_1$ (atau $H_a$): kontradiksi dari $H_0$.

Test statistic $T = T(X_1,\ldots,X_n)$: fungsi dari data yang mengukur evidence against $H_0$.

Rejection region $\mathcal{R}$: nilai-nilai $T$ yang membuat kita reject $H_0$.

Decision rule: Reject $H_0$ jika $T \in \mathcal{R}$.

Jenis-jenis test: - One-sided: $H_1: \theta > \theta_0$ (upper-tail) atau $\theta < \theta_0$ (lower-tail) - Two-sided: $H_1: \theta \neq \theta_0$

3 2. Dua Jenis Error

Definisi: Type I dan Type II Errors

	$H_0$ Benar	$H_0$ Salah
Reject $H_0$	Type I Error (False Positive)	Correct (True Positive)
Fail to Reject $H_0$	Correct (True Negative)	Type II Error (False Negative)

Type I error rate (significance level): \[\alpha = P(\text{reject } H_0 | H_0 \text{ benar})\]

Type II error rate: \[\beta = P(\text{fail to reject } H_0 | H_0 \text{ salah})\]

Power: \[\text{Power} = 1 - \beta = P(\text{reject } H_0 | H_0 \text{ salah})\]

Tradeoff: Menurunkan $\alpha$ (lebih konservatif) akan meningkatkan $\beta$ (menurunkan power), dan sebaliknya, untuk fixed $n$.

Satu-satunya cara meningkatkan power tanpa meningkatkan $\alpha$ adalah meningkatkan $n$ atau menggunakan more powerful test.

4 3. p-value — Definisi Tepat

Definisi: p-value

p-value adalah probabilitas, di bawah $H_0$, mendapatkan test statistic setidaknya sekstrem nilai yang diobservasi:

\[p = P(|T| \geq |t_{obs}| \mid H_0) \quad \text{(two-sided)}\] \[p = P(T \geq t_{obs} \mid H_0) \quad \text{(one-sided upper)}\]

Decision rule: Reject $H_0$ jika $p < \alpha$.

4.1 Apa p-value BUKAN:

Bukan probabilitas bahwa $H_0$ benar: $p \neq P(H_0 \text{ benar} | \text{data})$
Bukan probability of Type I error (itu $\alpha$, bukan $p$)
Bukan ukuran practical importance atau effect size
Bukan probability of replication: $p < 0.05$ tidak berarti 95% chance replikasi akan significant

Interpretasi yang benar: “Jika $H_0$ benar, probability mendapatkan data seperti yang kita observasi (atau lebih ekstrem) adalah $p$.”

Masalah dengan p-values: - Tergantung $n$: Dengan $n$ sangat besar, even trivially small effects akan significant - Tidak ukur effect size: $p=0.001$ bisa berarti efek kecil tapi precisely estimated - Multiple testing: Jika test 20 hipotesis dengan $\alpha=0.05$, expected 1 false positive by chance

5 4. Neyman-Pearson Lemma

Definisi: Neyman-Pearson Lemma

Untuk testing simple vs simple ($H_0: \theta = \theta_0$ vs $H_1: \theta = \theta_1$), most powerful test dengan size $\alpha$ adalah likelihood ratio test dengan rejection region:

\[\mathcal{R} = \left\{x : \frac{L(\theta_1; x)}{L(\theta_0; x)} > k\right\}\]

dimana $k$ dipilih sehingga $P(\mathcal{R}|H_0) = \alpha$.

Artinya: tidak ada test lain dengan same size $\alpha$ yang punya power lebih tinggi untuk simple vs simple testing.

Implikasi: NP lemma justifies likelihood ratio as the fundamental approach. Untuk composite hypotheses, kita extend ke generalized LR test.

6 5. t-test

6.1 One-Sample t-test

Test $H_0: \mu = \mu_0$ dengan $\sigma^2$ unknown.

Test statistic: \[T = \frac{\bar{X} - \mu_0}{S/\sqrt{n}} \sim t(n-1) \text{ under } H_0 \text{ (if normal)}\]

Why t, not z?: Karena $S$ (sample SD) adalah random, bukan known $\sigma$. Dividing by $S/\sqrt{n}$ instead of $\sigma/\sqrt{n}$ introduces extra uncertainty, captured by heavier tails of t-distribution.

6.2 Two-Sample t-test

Test $H_0: \mu_1 = \mu_2$ (equality of means dari dua populasi):

Equal variances (pooled): \[T = \frac{\bar{X}_1 - \bar{X}_2}{S_p\sqrt{1/n_1 + 1/n_2}} \sim t(n_1+n_2-2)\]

dimana $S_p^2 = \frac{(n_1-1)S_1^2 + (n_2-1)S_2^2}{n_1+n_2-2}$ (pooled variance).

Welch’s t-test (unequal variances, preferred in practice): \[T = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{S_1^2/n_1 + S_2^2/n_2}}\]

dengan degrees of freedom oleh Welch-Satterthwaite approximation.

7 6. F-test untuk Joint Hypotheses

Untuk testing multiple restrictions simultaneously:

Definisi: F-test untuk Joint Hypothesis

Model: $y = X\beta + \varepsilon$, $n$ obs, $k$ parameters.

Test $H_0: R\beta = r$ dengan $m$ linear restrictions.

F-statistic: \[F = \frac{(R\hat{\beta}-r)^T\left[R(X^TX)^{-1}R^T\right]^{-1}(R\hat{\beta}-r)/m}{\hat{\sigma}^2} \sim F(m, n-k) \text{ under } H_0\]

Alternative formula (Restricted vs Unrestricted): \[F = \frac{(SSE_R - SSE_U)/m}{SSE_U/(n-k)}\]

Contoh: - Test $\beta_2 = \beta_3 = 0$: $R = \begin{pmatrix}0&1&0\\0&0&1\end{pmatrix}$, $r = \begin{pmatrix}0\\0\end{pmatrix}$, $m=2$ - Test overall significance: $H_0: \beta_2 = \cdots = \beta_k = 0$ (all slopes = 0) - Test $\beta_2 = \beta_3$: $R = (0, 1, -1)$, $r = 0$, $m=1$

8 7. LR, Wald, dan Score Tests — The Trinity

Definisi: Asymptotic Test Trinity

Untuk testing $H_0: \theta = \theta_0$ (atau more generally, $m$ restrictions), ada tiga equivalent asymptotic tests:

Wald Test: \[W = (R\hat{\theta} - r)^T \left[R\hat{V}R^T\right]^{-1}(R\hat{\theta} - r) \xrightarrow{d} \chi^2(m)\]

dimana $\hat{V}$ adalah estimated asymptotic variance of $\hat{\theta}$.

Likelihood Ratio (LR) Test: \[LR = -2[\ell(\hat{\theta}_0) - \ell(\hat{\theta})] \xrightarrow{d} \chi^2(m)\]

dimana $\hat{\theta}_0$ adalah MLE under $H_0$ (restricted) dan $\hat{\theta}$ adalah unrestricted MLE.

Score (Lagrange Multiplier, LM) Test: \[LM = s(\hat{\theta}_0)^T I(\hat{\theta}_0)^{-1} s(\hat{\theta}_0) \xrightarrow{d} \chi^2(m)\]

dimana $s(\hat{\theta}_0)$ adalah score function evaluated at restricted estimator.

Ketiganya asymptotically equivalent under $H_0$ dan local alternatives. Tapi:

Test	Butuh	Keunggulan
Wald	Unrestricted estimation only	Paling simple
LR	Restricted + Unrestricted estimation	Paling common, model comparison
Score/LM	Restricted estimation only	Efficient ketika restricted model mudah

Implikasi praktis: - Wald: t-test, F-test dalam OLS adalah Wald tests - LR: AIC = $-2\ell + 2k$ based on LR; common untuk model comparison - LM: Breusch-Pagan test untuk heteroskedasticity, LM test untuk autocorrelation — ini semua LM (Score) tests

9 8. Power Analysis dan Sample Size

Definisi: Power Function

Power function: $\pi(\theta) = P(\text{reject } H_0 | \theta)$

Untuk fixed $\alpha$: - $\pi(\theta_0) = \alpha$ (size of test, by definition) - $\pi(\theta)$ untuk $\theta \neq \theta_0$ adalah power — idealnya mendekati 1

Factors affecting power: 1. Effect size (jarak antara $\theta_0$ dan true $\theta$) 2. Sample size $n$ (power meningkat dengan $n$) 3. Variance $\sigma^2$ (power turun dengan variance lebih tinggi) 4. Significance level $\alpha$ (power naik dengan $\alpha$ lebih besar)

Formula power untuk one-sample z-test ($H_0: \mu = \mu_0$ vs $H_1: \mu = \mu_1 > \mu_0$): \[\text{Power} = P\left(Z > z_\alpha - \frac{(\mu_1-\mu_0)\sqrt{n}}{\sigma}\right)\]

Sample size formula untuk desired power $1-\beta$, two-sided test: \[n \geq \left(\frac{(z_{\alpha/2} + z_\beta)\sigma}{\delta}\right)^2\]

dimana $\delta = |\mu_1 - \mu_0|$ adalah minimum detectable effect.

10 9. Multiple Testing

Definisi: Multiple Testing Problem

Jika kamu melakukan $m$ independent tests, masing-masing dengan level $\alpha$:

Family-wise error rate (FWER) = probability of at least one false positive: \[P(\text{at least one false positive}) = 1 - (1-\alpha)^m\]

Untuk $m=20$, $\alpha=0.05$: FWER = $1-(0.95)^{20} \approx 0.64$!

Bonferroni correction: Test each hypothesis at level $\alpha/m$. Controls FWER at $\alpha$.

Benjamini-Hochberg (BH): Controls False Discovery Rate (FDR) — expected fraction of false positives among rejections. Less conservative than Bonferroni.

Praktis: Ketika kamu test banyak coefficients dalam regresi, atau ketika kamu “search” untuk significant variables, multiple testing adalah concern serius.

11 10. Worked Example: Full t-test dan F-test

Worked Example: t-test dan F-test di R

Scenario: Dataset tentang pengaruh education ($edu$) dan experience ($exp$) terhadap log-wage. $n = 526$.

Two-sample t-test: Apakah rata-rata wage perempuan berbeda dari laki-laki?

# Data setup (using wage1 from wooldridge package)
if(!require(wooldridge)) install.packages("wooldridge")
library(wooldridge)
data("wage1")

# Two-sample t-test: wage by gender
t.test(wage ~ female, data=wage1, alternative="two.sided")

Output:

t = 9.24, df = 517.95, p-value < 2.2e-16
95% CI: [1.69, 2.59]
mean in group 0 (male): 7.10, mean in group 1 (female): 4.95

Males earn significantly more ($t=9.24$, $p<0.001$). Wage gap ≈ $2.15/hour.

Multiple regression + F-test:

# Full regression
model <- lm(lwage ~ educ + exper + expersq + female + married,
            data=wage1)
summary(model)

# F-test: joint significance of education AND experience
library(car)
linearHypothesis(model, c("educ = 0", "exper = 0", "expersq = 0"))

# Manual F-test: restricted vs unrestricted
model_r <- lm(lwage ~ female + married, data=wage1)  # Restricted
model_u <- model  # Unrestricted

anova(model_r, model_u)
# F(3, 520) = ?, p = ?

# Power analysis: how many obs needed to detect educ effect?
library(pwr)
# Assume standardized effect size delta/sigma for education
# Using observed: beta_edu ≈ 0.09, sigma ≈ 0.5 (estimate)
delta <- 0.09   # effect size per year of education
sigma <- 0.45   # residual SD estimate
n_needed <- pwr.t.test(
  d = delta/sigma,  # standardized effect
  sig.level = 0.05,
  power = 0.80,
  alternative = "two.sided"
)$n
cat("Sample needed for 80% power:", ceiling(n_needed), "\n")

LR Test for nested models:

# Compare logistic regression models
data("mroz", package="wooldridge")

# Model 1: baseline
logit_r <- glm(inlf ~ age + kidslt6, data=mroz, family=binomial)

# Model 2: with education
logit_u <- glm(inlf ~ age + kidslt6 + educ + exper,
               data=mroz, family=binomial)

# LR test
lr_stat <- -2 * (logLik(logit_r) - logLik(logit_u))
df_diff <- length(coef(logit_u)) - length(coef(logit_r))
p_lr <- pchisq(lr_stat, df=df_diff, lower.tail=FALSE)
cat(sprintf("LR stat = %.2f, df = %d, p = %.4f\n", lr_stat, df_diff, p_lr))

# Or use anova()
anova(logit_r, logit_u, test="LRT")

12 11. Koneksi ke Econometrics

Connection: Hypothesis Testing dalam Econometrics

Hausman test (specification test): Test endogeneity / model misspecification. Compares two estimators under $H_0$ (OLS consistent) vs $H_1$ (IV consistent). Hausman statistic $= (\hat{\beta}_{IV} - \hat{\beta}_{OLS})^T[\text{Var}(\hat{\beta}_{IV}) - \text{Var}(\hat{\beta}_{OLS})]^{-1}(\hat{\beta}_{IV} - \hat{\beta}_{OLS}) \sim \chi^2(k)$ — ini adalah Wald test.

Heteroskedasticity tests (Breusch-Pagan, White test): LM (score) tests. Run auxiliary regression of squared residuals pada original regressors, test $nR^2 \sim \chi^2(k)$.

Autocorrelation tests (Breusch-Godfrey): Juga LM test. Test serial correlation dalam residuals.

Chow test untuk structural break: F-test membandingkan restricted model (common coefficients) vs unrestricted (separate coefficients per period).

Wald test dalam IV/GMM: Sargan-Hansen J-test untuk overidentifying restrictions in IV/GMM adalah Wald-type test.

13 12. R Code: Power dan Multiple Testing

library(pwr)

# ============================================================
# POWER ANALYSIS
# ============================================================
# One-sample t-test
# H0: mu = 0 vs H1: mu != 0
# Effect size d = mu/sigma (Cohen's d)

# Power as function of n (for d=0.5, alpha=0.05)
n_range <- seq(10, 200, by=5)
power_vals <- sapply(n_range, function(n) {
  pwr.t.test(n=n, d=0.5, sig.level=0.05, alternative="two.sided")$power
})

plot(n_range, power_vals, type="l", lwd=2, col="steelblue",
     xlab="Sample Size n", ylab="Power",
     main="Power vs Sample Size (d=0.5, alpha=0.05)")
abline(h=0.8, col="red", lty=2)  # 80% power convention
abline(h=0.9, col="orange", lty=2)  # 90% power
legend("bottomright",
       c("Power curve", "80% power", "90% power"),
       col=c("steelblue", "red", "orange"),
       lwd=c(2,1,1), lty=c(1,2,2))

# Sample size for 80% power
n_for_80 <- ceiling(pwr.t.test(d=0.5, sig.level=0.05,
                                 power=0.80, alternative="two.sided")$n)
cat("n needed for 80% power:", n_for_80, "\n")

# ============================================================
# MULTIPLE TESTING
# ============================================================
set.seed(2024)

# Simulate 100 tests where H0 is TRUE for all
n_tests <- 100
p_values_null <- runif(n_tests)  # Under H0, p-values are Uniform(0,1)

# Add 10 true alternatives
n_true <- 10
p_values_alt <- c(p_values_null[1:(n_tests-n_true)],
                   rbeta(n_true, 0.5, 10))  # Small p-values for true effects

alpha <- 0.05

# Unadjusted
rejected_unadj <- p_values_alt < alpha

# Bonferroni
rejected_bonf <- p_values_alt < alpha/n_tests

# BH procedure
p_sorted <- sort(p_values_alt)
bh_threshold <- (1:n_tests)/n_tests * alpha
bh_cutoff <- max(p_sorted[p_sorted <= bh_threshold], na.rm=TRUE)
rejected_bh <- p_values_alt <= bh_cutoff

cat("\n=== MULTIPLE TESTING COMPARISON ===\n")
cat(sprintf("Unadjusted: %d rejections\n", sum(rejected_unadj)))
cat(sprintf("Bonferroni: %d rejections\n", sum(rejected_bonf)))
cat(sprintf("BH (FDR):   %d rejections\n", sum(rejected_bh)))

# Using p.adjust
p_bonf <- p.adjust(p_values_alt, method="bonferroni")
p_bh   <- p.adjust(p_values_alt, method="BH")
cat(sprintf("\nBonferroni (p.adjust): %d rejections\n", sum(p_bonf < alpha)))
cat(sprintf("BH (p.adjust):         %d rejections\n", sum(p_bh < alpha)))

14 Practice Problems

Practice Problems

Problem 1: p-value interpretation.

Sebuah study mengukur efek treatment pada outcome. $p = 0.03$. Mana yang benar?

Ada 3% probability bahwa $H_0$ benar
Ada 3% probability mendapat data seextrem ini jika $H_0$ benar
Treatment punya 97% chance efektif
Jika study diulang, 97% akan significant

Jawaban: Hanya (b) yang benar. (a), (c), (d) semuanya salah interpretasi p-value.

Problem 2: Type I vs Type II.

Medical test untuk penyakit langka ($P(\text{disease}) = 0.001$). Test sensitivity = 99% ($1-\beta = 0.99$), specificity = 95% ($1-\alpha = 0.95$, so $\alpha = 0.05$).

Hitung $P(\text{disease} | \text{positive test})$ menggunakan Bayes’ theorem
Interpretasikan result

Jawaban: $P(+ | D) = 0.99$, $P(+ | \bar{D}) = 0.05$. $P(D|+) = \frac{0.99 \times 0.001}{0.99 \times 0.001 + 0.05 \times 0.999} = \frac{0.00099}{0.00099+0.04995} = 0.0194$

Hanya ~2% dari positive tests adalah true positives! Ini karena prevalence sangat rendah. High false positive rate dominates.

Problem 3: Wald, LR, Score equivalence.

Untuk large $n$, ketiga tests seharusnya memberikan hasil yang sangat mirip. Verifikasi numerically:

set.seed(42)
n <- 1000
x <- rnorm(n)
y <- 1 + 0.3*x + rnorm(n)  # True effect = 0.3

# Test H0: beta_x = 0
model <- lm(y ~ x)
b <- coef(model)[2]
se <- sqrt(vcov(model)[2,2])
n_params <- 2

# Wald test
wald <- (b/se)^2  # F-statistic = t^2

# LR test
model_r <- lm(y ~ 1)
lr <- -2*(logLik(model_r) - logLik(model))

# Score test (numerical)
# Not easily computed manually for linear regression

cat("Wald:", wald, "\n")
cat("LR:", lr, "\n")

Problem 4: Power calculation.

Kamu ingin detect efek $\delta = 0.2$ unit pada outcome dengan $\sigma = 1.5$. Dengan $\alpha = 0.05$ (two-sided), berapa $n$ yang dibutuhkan untuk: (a) 50% power? (b) 80% power? (c) 95% power?

Interpretasikan hasilnya.

Jawaban: $d = 0.2/1.5 = 0.133$. (a) $n \approx 175$, (b) $n \approx 442$, (c) $n \approx 736$

Untuk deteksi efek kecil, butuh sample yang sangat besar!

--- title: "Hypothesis Testing" subtitle: "p-value, Power, dan Seni Tidak Salah" description: "Framework hypothesis testing yang rigorous: Type I/II errors, power, p-value, Neyman-Pearson, t-test, F-test, dan the LR/Wald/Score trinity." categories: [statistics, hypothesis-testing, inference] --- ## Kenapa Ini Penting? ::: {.callout-note title="Why This Matters for Your Work"} Hypothesis testing adalah everywhere dalam empirical work. Setiap kali kamu melihat p-value di output regresi, kamu sedang melakukan hypothesis test. Setiap kali kamu melihat bintang ($*$, $**$, $***$) di tabel regresi, itu adalah hasil hypothesis test. Tapi **sebagian besar orang menggunakannya tanpa memahami**: - Apa sebenarnya yang diukur oleh p-value? (Dan apa yang TIDAK diukurnya) - Kenapa peneliti dengan sample lebih besar hampir selalu menemukan "significant" results? - Apa bedanya "statistically significant" dengan "practically important"? - Kenapa power analysis penting sebelum mengumpulkan data? **Hypothesis testing yang dipahami dengan baik adalah senjata analitik yang powerful. Yang disalahpahami adalah sumber dari banyak "research findings" yang tidak bisa direplikasi.** ::: --- ## 1. Framework Dasar ::: {.callout-important title="Definisi: Hypothesis Testing Framework"} **Null hypothesis** $H_0$: pernyataan yang di-test, biasanya "tidak ada efek" atau "parameter = nilai tertentu". **Alternative hypothesis** $H_1$ (atau $H_a$): kontradiksi dari $H_0$. **Test statistic** $T = T(X_1,\ldots,X_n)$: fungsi dari data yang mengukur evidence against $H_0$. **Rejection region** $\mathcal{R}$: nilai-nilai $T$ yang membuat kita reject $H_0$. **Decision rule**: Reject $H_0$ jika $T \in \mathcal{R}$. ::: **Jenis-jenis test**: - **One-sided**: $H_1: \theta > \theta_0$ (upper-tail) atau $\theta < \theta_0$ (lower-tail) - **Two-sided**: $H_1: \theta \neq \theta_0$ --- ## 2. Dua Jenis Error ::: {.callout-important title="Definisi: Type I dan Type II Errors"} | | $H_0$ Benar | $H_0$ Salah | |---|---|---| | **Reject $H_0$** | **Type I Error** (False Positive) | Correct (True Positive) | | **Fail to Reject $H_0$** | Correct (True Negative) | **Type II Error** (False Negative) | **Type I error rate** (significance level): $$\alpha = P(\text{reject } H_0 | H_0 \text{ benar})$$ **Type II error rate**: $$\beta = P(\text{fail to reject } H_0 | H_0 \text{ salah})$$ **Power**: $$\text{Power} = 1 - \beta = P(\text{reject } H_0 | H_0 \text{ salah})$$ ::: **Tradeoff**: Menurunkan $\alpha$ (lebih konservatif) akan meningkatkan $\beta$ (menurunkan power), dan sebaliknya, untuk fixed $n$. Satu-satunya cara meningkatkan power tanpa meningkatkan $\alpha$ adalah **meningkatkan $n$** atau **menggunakan more powerful test**. --- ## 3. p-value — Definisi Tepat ::: {.callout-important title="Definisi: p-value"} **p-value** adalah probabilitas, di bawah $H_0$, mendapatkan test statistic **setidaknya sekstrem** nilai yang diobservasi: $$p = P(|T| \geq |t_{obs}| \mid H_0) \quad \text{(two-sided)}$$ $$p = P(T \geq t_{obs} \mid H_0) \quad \text{(one-sided upper)}$$ **Decision rule**: Reject $H_0$ jika $p < \alpha$. ::: ### Apa p-value BUKAN: 1. **Bukan** probabilitas bahwa $H_0$ benar: $p \neq P(H_0 \text{ benar} | \text{data})$ 2. **Bukan** probability of Type I error (itu $\alpha$, bukan $p$) 3. **Bukan** ukuran practical importance atau effect size 4. **Bukan** probability of replication: $p < 0.05$ tidak berarti 95% chance replikasi akan significant **Interpretasi yang benar**: "Jika $H_0$ benar, probability mendapatkan data seperti yang kita observasi (atau lebih ekstrem) adalah $p$." **Masalah dengan p-values**: - **Tergantung $n$**: Dengan $n$ sangat besar, even trivially small effects akan significant - **Tidak ukur effect size**: $p=0.001$ bisa berarti efek kecil tapi precisely estimated - **Multiple testing**: Jika test 20 hipotesis dengan $\alpha=0.05$, expected 1 false positive by chance --- ## 4. Neyman-Pearson Lemma ::: {.callout-important title="Definisi: Neyman-Pearson Lemma"} Untuk testing **simple vs simple** ($H_0: \theta = \theta_0$ vs $H_1: \theta = \theta_1$), **most powerful test** dengan size $\alpha$ adalah **likelihood ratio test** dengan rejection region: $$\mathcal{R} = \left\{x : \frac{L(\theta_1; x)}{L(\theta_0; x)} > k\right\}$$ dimana $k$ dipilih sehingga $P(\mathcal{R}|H_0) = \alpha$. Artinya: **tidak ada test lain dengan same size $\alpha$ yang punya power lebih tinggi** untuk simple vs simple testing. ::: **Implikasi**: NP lemma justifies likelihood ratio as the fundamental approach. Untuk composite hypotheses, kita extend ke generalized LR test. --- ## 5. t-test ### One-Sample t-test Test $H_0: \mu = \mu_0$ dengan $\sigma^2$ unknown. **Test statistic**: $$T = \frac{\bar{X} - \mu_0}{S/\sqrt{n}} \sim t(n-1) \text{ under } H_0 \text{ (if normal)}$$ **Why t, not z?**: Karena $S$ (sample SD) adalah random, bukan known $\sigma$. Dividing by $S/\sqrt{n}$ instead of $\sigma/\sqrt{n}$ introduces extra uncertainty, captured by heavier tails of t-distribution. ### Two-Sample t-test Test $H_0: \mu_1 = \mu_2$ (equality of means dari dua populasi): **Equal variances (pooled)**: $$T = \frac{\bar{X}_1 - \bar{X}_2}{S_p\sqrt{1/n_1 + 1/n_2}} \sim t(n_1+n_2-2)$$ dimana $S_p^2 = \frac{(n_1-1)S_1^2 + (n_2-1)S_2^2}{n_1+n_2-2}$ (pooled variance). **Welch's t-test** (unequal variances, preferred in practice): $$T = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{S_1^2/n_1 + S_2^2/n_2}}$$ dengan degrees of freedom oleh Welch-Satterthwaite approximation. --- ## 6. F-test untuk Joint Hypotheses Untuk testing multiple restrictions simultaneously: ::: {.callout-important title="Definisi: F-test untuk Joint Hypothesis"} Model: $y = X\beta + \varepsilon$, $n$ obs, $k$ parameters. Test $H_0: R\beta = r$ dengan $m$ linear restrictions. **F-statistic**: $$F = \frac{(R\hat{\beta}-r)^T\left[R(X^TX)^{-1}R^T\right]^{-1}(R\hat{\beta}-r)/m}{\hat{\sigma}^2} \sim F(m, n-k) \text{ under } H_0$$ **Alternative formula** (Restricted vs Unrestricted): $$F = \frac{(SSE_R - SSE_U)/m}{SSE_U/(n-k)}$$ ::: **Contoh**: - Test $\beta_2 = \beta_3 = 0$: $R = \begin{pmatrix}0&1&0\\0&0&1\end{pmatrix}$, $r = \begin{pmatrix}0\\0\end{pmatrix}$, $m=2$ - Test overall significance: $H_0: \beta_2 = \cdots = \beta_k = 0$ (all slopes = 0) - Test $\beta_2 = \beta_3$: $R = (0, 1, -1)$, $r = 0$, $m=1$ --- ## 7. LR, Wald, dan Score Tests — The Trinity ::: {.callout-important title="Definisi: Asymptotic Test Trinity"} Untuk testing $H_0: \theta = \theta_0$ (atau more generally, $m$ restrictions), ada tiga equivalent asymptotic tests: **Wald Test**: $$W = (R\hat{\theta} - r)^T \left[R\hat{V}R^T\right]^{-1}(R\hat{\theta} - r) \xrightarrow{d} \chi^2(m)$$ dimana $\hat{V}$ adalah estimated asymptotic variance of $\hat{\theta}$. --- **Likelihood Ratio (LR) Test**: $$LR = -2[\ell(\hat{\theta}_0) - \ell(\hat{\theta})] \xrightarrow{d} \chi^2(m)$$ dimana $\hat{\theta}_0$ adalah MLE under $H_0$ (restricted) dan $\hat{\theta}$ adalah unrestricted MLE. --- **Score (Lagrange Multiplier, LM) Test**: $$LM = s(\hat{\theta}_0)^T I(\hat{\theta}_0)^{-1} s(\hat{\theta}_0) \xrightarrow{d} \chi^2(m)$$ dimana $s(\hat{\theta}_0)$ adalah score function evaluated at **restricted** estimator. ::: **Ketiganya asymptotically equivalent** under $H_0$ dan local alternatives. Tapi: | Test | Butuh | Keunggulan | |---|---|---| | **Wald** | Unrestricted estimation only | Paling simple | | **LR** | Restricted + Unrestricted estimation | Paling common, model comparison | | **Score/LM** | Restricted estimation only | Efficient ketika restricted model mudah | **Implikasi praktis**: - **Wald**: t-test, F-test dalam OLS adalah Wald tests - **LR**: AIC = $-2\ell + 2k$ based on LR; common untuk model comparison - **LM**: Breusch-Pagan test untuk heteroskedasticity, LM test untuk autocorrelation — ini semua LM (Score) tests --- ## 8. Power Analysis dan Sample Size ::: {.callout-important title="Definisi: Power Function"} Power function: $\pi(\theta) = P(\text{reject } H_0 | \theta)$ Untuk fixed $\alpha$: - $\pi(\theta_0) = \alpha$ (size of test, by definition) - $\pi(\theta)$ untuk $\theta \neq \theta_0$ adalah power — idealnya mendekati 1 **Factors affecting power**: 1. Effect size (jarak antara $\theta_0$ dan true $\theta$) 2. Sample size $n$ (power meningkat dengan $n$) 3. Variance $\sigma^2$ (power turun dengan variance lebih tinggi) 4. Significance level $\alpha$ (power naik dengan $\alpha$ lebih besar) ::: **Formula power untuk one-sample z-test** ($H_0: \mu = \mu_0$ vs $H_1: \mu = \mu_1 > \mu_0$): $$\text{Power} = P\left(Z > z_\alpha - \frac{(\mu_1-\mu_0)\sqrt{n}}{\sigma}\right)$$ **Sample size formula** untuk desired power $1-\beta$, two-sided test: $$n \geq \left(\frac{(z_{\alpha/2} + z_\beta)\sigma}{\delta}\right)^2$$ dimana $\delta = |\mu_1 - \mu_0|$ adalah minimum detectable effect. --- ## 9. Multiple Testing ::: {.callout-important title="Definisi: Multiple Testing Problem"} Jika kamu melakukan $m$ independent tests, masing-masing dengan level $\alpha$: **Family-wise error rate (FWER)** = probability of at least one false positive: $$P(\text{at least one false positive}) = 1 - (1-\alpha)^m$$ Untuk $m=20$, $\alpha=0.05$: FWER = $1-(0.95)^{20} \approx 0.64$! **Bonferroni correction**: Test each hypothesis at level $\alpha/m$. Controls FWER at $\alpha$. **Benjamini-Hochberg (BH)**: Controls False Discovery Rate (FDR) — expected fraction of false positives among rejections. Less conservative than Bonferroni. ::: **Praktis**: Ketika kamu test banyak coefficients dalam regresi, atau ketika kamu "search" untuk significant variables, multiple testing adalah concern serius. --- ## 10. Worked Example: Full t-test dan F-test ::: {.callout-tip title="Worked Example: t-test dan F-test di R" collapse="true"} **Scenario**: Dataset tentang pengaruh education ($edu$) dan experience ($exp$) terhadap log-wage. $n = 526$. **Two-sample t-test**: Apakah rata-rata wage perempuan berbeda dari laki-laki? ```r # Data setup (using wage1 from wooldridge package) if(!require(wooldridge)) install.packages("wooldridge") library(wooldridge) data("wage1") # Two-sample t-test: wage by gender t.test(wage ~ female, data=wage1, alternative="two.sided") ``` Output: ``` t = 9.24, df = 517.95, p-value < 2.2e-16 95% CI: [1.69, 2.59] mean in group 0 (male): 7.10, mean in group 1 (female): 4.95 ``` Males earn significantly more ($t=9.24$, $p<0.001$). Wage gap ≈ $2.15/hour. **Multiple regression + F-test**: ```r # Full regression model <- lm(lwage ~ educ + exper + expersq + female + married, data=wage1) summary(model) # F-test: joint significance of education AND experience library(car) linearHypothesis(model, c("educ = 0", "exper = 0", "expersq = 0")) # Manual F-test: restricted vs unrestricted model_r <- lm(lwage ~ female + married, data=wage1) # Restricted model_u <- model # Unrestricted anova(model_r, model_u) # F(3, 520) = ?, p = ? # Power analysis: how many obs needed to detect educ effect? library(pwr) # Assume standardized effect size delta/sigma for education # Using observed: beta_edu ≈ 0.09, sigma ≈ 0.5 (estimate) delta <- 0.09 # effect size per year of education sigma <- 0.45 # residual SD estimate n_needed <- pwr.t.test( d = delta/sigma, # standardized effect sig.level = 0.05, power = 0.80, alternative = "two.sided" )$n cat("Sample needed for 80% power:", ceiling(n_needed), "\n") ``` **LR Test for nested models**: ```r # Compare logistic regression models data("mroz", package="wooldridge") # Model 1: baseline logit_r <- glm(inlf ~ age + kidslt6, data=mroz, family=binomial) # Model 2: with education logit_u <- glm(inlf ~ age + kidslt6 + educ + exper, data=mroz, family=binomial) # LR test lr_stat <- -2 * (logLik(logit_r) - logLik(logit_u)) df_diff <- length(coef(logit_u)) - length(coef(logit_r)) p_lr <- pchisq(lr_stat, df=df_diff, lower.tail=FALSE) cat(sprintf("LR stat = %.2f, df = %d, p = %.4f\n", lr_stat, df_diff, p_lr)) # Or use anova() anova(logit_r, logit_u, test="LRT") ``` ::: --- ## 11. Koneksi ke Econometrics ::: {.callout-caution title="Connection: Hypothesis Testing dalam Econometrics"} **Hausman test** (specification test): Test endogeneity / model misspecification. Compares two estimators under $H_0$ (OLS consistent) vs $H_1$ (IV consistent). Hausman statistic $= (\hat{\beta}_{IV} - \hat{\beta}_{OLS})^T[\text{Var}(\hat{\beta}_{IV}) - \text{Var}(\hat{\beta}_{OLS})]^{-1}(\hat{\beta}_{IV} - \hat{\beta}_{OLS}) \sim \chi^2(k)$ — ini adalah Wald test. **Heteroskedasticity tests** (Breusch-Pagan, White test): LM (score) tests. Run auxiliary regression of squared residuals pada original regressors, test $nR^2 \sim \chi^2(k)$. **Autocorrelation tests** (Breusch-Godfrey): Juga LM test. Test serial correlation dalam residuals. **Chow test** untuk structural break: F-test membandingkan restricted model (common coefficients) vs unrestricted (separate coefficients per period). **Wald test dalam IV/GMM**: Sargan-Hansen J-test untuk overidentifying restrictions in IV/GMM adalah Wald-type test. ::: --- ## 12. R Code: Power dan Multiple Testing ```r library(pwr) # ============================================================ # POWER ANALYSIS # ============================================================ # One-sample t-test # H0: mu = 0 vs H1: mu != 0 # Effect size d = mu/sigma (Cohen's d) # Power as function of n (for d=0.5, alpha=0.05) n_range <- seq(10, 200, by=5) power_vals <- sapply(n_range, function(n) { pwr.t.test(n=n, d=0.5, sig.level=0.05, alternative="two.sided")$power }) plot(n_range, power_vals, type="l", lwd=2, col="steelblue", xlab="Sample Size n", ylab="Power", main="Power vs Sample Size (d=0.5, alpha=0.05)") abline(h=0.8, col="red", lty=2) # 80% power convention abline(h=0.9, col="orange", lty=2) # 90% power legend("bottomright", c("Power curve", "80% power", "90% power"), col=c("steelblue", "red", "orange"), lwd=c(2,1,1), lty=c(1,2,2)) # Sample size for 80% power n_for_80 <- ceiling(pwr.t.test(d=0.5, sig.level=0.05, power=0.80, alternative="two.sided")$n) cat("n needed for 80% power:", n_for_80, "\n") # ============================================================ # MULTIPLE TESTING # ============================================================ set.seed(2024) # Simulate 100 tests where H0 is TRUE for all n_tests <- 100 p_values_null <- runif(n_tests) # Under H0, p-values are Uniform(0,1) # Add 10 true alternatives n_true <- 10 p_values_alt <- c(p_values_null[1:(n_tests-n_true)], rbeta(n_true, 0.5, 10)) # Small p-values for true effects alpha <- 0.05 # Unadjusted rejected_unadj <- p_values_alt < alpha # Bonferroni rejected_bonf <- p_values_alt < alpha/n_tests # BH procedure p_sorted <- sort(p_values_alt) bh_threshold <- (1:n_tests)/n_tests * alpha bh_cutoff <- max(p_sorted[p_sorted <= bh_threshold], na.rm=TRUE) rejected_bh <- p_values_alt <= bh_cutoff cat("\n=== MULTIPLE TESTING COMPARISON ===\n") cat(sprintf("Unadjusted: %d rejections\n", sum(rejected_unadj))) cat(sprintf("Bonferroni: %d rejections\n", sum(rejected_bonf))) cat(sprintf("BH (FDR): %d rejections\n", sum(rejected_bh))) # Using p.adjust p_bonf <- p.adjust(p_values_alt, method="bonferroni") p_bh <- p.adjust(p_values_alt, method="BH") cat(sprintf("\nBonferroni (p.adjust): %d rejections\n", sum(p_bonf < alpha))) cat(sprintf("BH (p.adjust): %d rejections\n", sum(p_bh < alpha))) ``` --- ## Practice Problems ::: {.callout-warning title="Practice Problems" collapse="true"} **Problem 1**: p-value interpretation. Sebuah study mengukur efek treatment pada outcome. $p = 0.03$. Mana yang benar? (a) Ada 3% probability bahwa $H_0$ benar (b) Ada 3% probability mendapat data seextrem ini jika $H_0$ benar (c) Treatment punya 97% chance efektif (d) Jika study diulang, 97% akan significant *Jawaban*: Hanya **(b)** yang benar. (a), (c), (d) semuanya salah interpretasi p-value. **Problem 2**: Type I vs Type II. Medical test untuk penyakit langka ($P(\text{disease}) = 0.001$). Test sensitivity = 99% ($1-\beta = 0.99$), specificity = 95% ($1-\alpha = 0.95$, so $\alpha = 0.05$). - Hitung $P(\text{disease} | \text{positive test})$ menggunakan Bayes' theorem - Interpretasikan result *Jawaban*: $P(+ | D) = 0.99$, $P(+ | \bar{D}) = 0.05$. $P(D|+) = \frac{0.99 \times 0.001}{0.99 \times 0.001 + 0.05 \times 0.999} = \frac{0.00099}{0.00099+0.04995} = 0.0194$ Hanya ~2% dari positive tests adalah true positives! Ini karena prevalence sangat rendah. High false positive rate dominates. **Problem 3**: Wald, LR, Score equivalence. Untuk large $n$, ketiga tests seharusnya memberikan hasil yang sangat mirip. Verifikasi numerically: ```r set.seed(42) n <- 1000 x <- rnorm(n) y <- 1 + 0.3*x + rnorm(n) # True effect = 0.3 # Test H0: beta_x = 0 model <- lm(y ~ x) b <- coef(model)[2] se <- sqrt(vcov(model)[2,2]) n_params <- 2 # Wald test wald <- (b/se)^2 # F-statistic = t^2 # LR test model_r <- lm(y ~ 1) lr <- -2*(logLik(model_r) - logLik(model)) # Score test (numerical) # Not easily computed manually for linear regression cat("Wald:", wald, "\n") cat("LR:", lr, "\n") ``` **Problem 4**: Power calculation. Kamu ingin detect efek $\delta = 0.2$ unit pada outcome dengan $\sigma = 1.5$. Dengan $\alpha = 0.05$ (two-sided), berapa $n$ yang dibutuhkan untuk: (a) 50% power? (b) 80% power? (c) 95% power? Interpretasikan hasilnya. *Jawaban*: $d = 0.2/1.5 = 0.133$. (a) $n \approx 175$, (b) $n \approx 442$, (c) $n \approx 736$ Untuk deteksi efek kecil, butuh sample yang sangat besar! ::: --- ::: {.page-navigation} [← Theory of Estimation](01-estimation.qmd) | [↑ Module Overview](index.qmd) | [Next: Confidence Intervals →](03-confidence-intervals.qmd) :::