Important Probability Distributions

Normal, t, Chi-squared, F, dan Teman-temannya

probability

distributions

statistics

Panduan komprehensif distribusi probabilitas penting untuk econometrics dan ML: Normal, Chi-squared, t, F, dan distribusi lainnya.

1 Kenapa Ini Penting?

Why This Matters for Your Work

Semua hypothesis testing bergantung pada knowing the distribution of your test statistic. t-test assumes t-distribution. Chi-square test assumes chi-square distribution. F-test assumes F-distribution. Tanpa ini, p-value hanyalah angka tanpa makna — kamu tidak bisa bilang apakah 0.03 itu “kecil” tanpa tahu distribusi referensinya.

Setiap kali kamu menulis summary(lm(...)) di R dan melihat p-value, di balik layar R sedang menghitung tail probability dari t-distribution. Setiap kali kamu melakukan F-test untuk joint significance, R menggunakan F-distribution. Memahami distribusi ini bukan sekedar teori — ini adalah dasar dari setiap angka di tabel regresi kamu.

2 1. Normal Distribution

Normal distribution adalah distribusi paling fundamental dalam statistik. Ini bukan karena alam selalu normal, tapi karena Central Limit Theorem menjamin bahwa sample means akan approximately normal untuk large samples.

Definisi: Normal Distribution

Variabel random $X$ dikatakan berdistribusi normal dengan mean $\mu$ dan variance $\sigma^2$, ditulis $X \sim N(\mu, \sigma^2)$, jika memiliki PDF:

\[f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right), \quad x \in \mathbb{R}\]

Standard Normal: $Z \sim N(0,1)$ adalah special case dengan $\mu=0$, $\sigma^2=1$.

CDF dari standard normal dinotasikan $\Phi(z) = P(Z \leq z) = \int_{-\infty}^z \frac{1}{\sqrt{2\pi}}e^{-t^2/2}dt$.

Standardisasi: jika $X \sim N(\mu, \sigma^2)$, maka $Z = \frac{X-\mu}{\sigma} \sim N(0,1)$.

2.1 Properties Normal Distribution

1. Simetri: $f(x) = f(2\mu - x)$, distribusi simetris di sekitar mean $\mu$.

2. 68-95-99.7 Rule: \[P(\mu - \sigma \leq X \leq \mu + \sigma) \approx 0.6827\] \[P(\mu - 2\sigma \leq X \leq \mu + 2\sigma) \approx 0.9545\] \[P(\mu - 3\sigma \leq X \leq \mu + 3\sigma) \approx 0.9973\]

3. Linear Combinations: Jika $X_i \sim N(\mu_i, \sigma_i^2)$ independent, maka: \[\sum_{i=1}^n a_i X_i \sim N\left(\sum_{i=1}^n a_i\mu_i,\ \sum_{i=1}^n a_i^2\sigma_i^2\right)\]

Ini sangat penting untuk OLS: $\hat{\beta} = (X^TX)^{-1}X^Ty$ adalah linear combination dari $y$, jadi jika $\varepsilon \sim N(0, \sigma^2 I)$, maka $\hat{\beta} \sim N(\beta, \sigma^2(X^TX)^{-1})$.

4. MGF: $M_X(t) = \exp\left(\mu t + \frac{\sigma^2 t^2}{2}\right)$

2.2 Multivariate Normal

Definisi: Multivariate Normal Distribution

Vector $\mathbf{X} = (X_1, \ldots, X_k)^T$ berdistribusi multivariate normal, ditulis $\mathbf{X} \sim N(\boldsymbol{\mu}, \Sigma)$, jika memiliki PDF:

\[f(\mathbf{x}) = \frac{1}{(2\pi)^{k/2}|\Sigma|^{1/2}}\exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T\Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)\]

dimana $\boldsymbol{\mu} \in \mathbb{R}^k$ adalah mean vector dan $\Sigma$ adalah $k \times k$ positive definite covariance matrix.

Key Properties of MVN:

Marginals normal: $X_i \sim N(\mu_i, \Sigma_{ii})$
Conditionals normal: $X_1 | X_2 = x_2 \sim N(\mu_{1|2}, \Sigma_{1|2})$, dimana: \[\mu_{1|2} = \mu_1 + \Sigma_{12}\Sigma_{22}^{-1}(x_2 - \mu_2)\] \[\Sigma_{1|2} = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\]
Linear combinations normal: $A\mathbf{X} \sim N(A\boldsymbol{\mu}, A\Sigma A^T)$
Uncorrelated $\Rightarrow$ Independent (hanya untuk MVN, tidak berlaku umum!)

3 2. Chi-Squared Distribution

Chi-squared distribution muncul secara alami dari kuadrat standard normals.

Definisi: Chi-Squared Distribution

Jika $Z_1, Z_2, \ldots, Z_k \sim N(0,1)$ iid, maka: \[V = \sum_{i=1}^k Z_i^2 \sim \chi^2(k)\]

dimana $k$ adalah degrees of freedom.

Moments: - $E[V] = k$ - $\text{Var}(V) = 2k$

PDF: $f(v) = \frac{v^{k/2-1}e^{-v/2}}{2^{k/2}\Gamma(k/2)}$ untuk $v > 0$.

3.1 Koneksi ke Sample Variance

Ini adalah salah satu result paling penting dalam statistik inferensial:

\[\frac{(n-1)S^2}{\sigma^2} \sim \chi^2(n-1) \quad \text{jika } X_i \sim N(\mu, \sigma^2) \text{ iid}\]

dimana $S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2$.

Kenapa degrees of freedom $n-1$? Karena kita kehilangan 1 df ketika mengestimasi $\mu$ dengan $\bar{X}$.

3.2 Additive Property

Jika $V_1 \sim \chi^2(m)$ dan $V_2 \sim \chi^2(n)$ independent, maka: \[V_1 + V_2 \sim \chi^2(m+n)\]

Aplikasi: Di regresi, SST = SSR + SSE, dan ini bisa dipecah menjadi komponen chi-squared.

Digunakan dalam: - Goodness-of-fit tests (Pearson chi-squared) - Test of independence (contingency tables) - Lagrange Multiplier (LM) tests - Likelihood Ratio (LR) tests - Testing restrictions in econometric models

4 3. Student’s t-Distribution

t-distribution muncul ketika kita tidak tahu $\sigma^2$ dan harus mengestimasinya.

Definisi: Student’s t-Distribution

Jika $Z \sim N(0,1)$ dan $V \sim \chi^2(k)$ independent, maka: \[T = \frac{Z}{\sqrt{V/k}} \sim t(k)\]

dimana $k$ adalah degrees of freedom.

Moments (untuk $k > 2$): - $E[T] = 0$ - $\text{Var}(T) = \frac{k}{k-2}$

Tails lebih berat dibanding normal. Semakin kecil $k$, semakin heavy tail.

4.1 Konvergensi ke Normal

\[t(k) \xrightarrow{d} N(0,1) \text{ sebagai } k \to \infty\]

Praktisnya: untuk $k > 30$, t-distribution sudah sangat dekat dengan normal. Untuk $k > 120$, hampir identik.

4.2 Koneksi ke OLS

Ini adalah koneksi kritis dengan regresi:

\[\frac{\hat{\beta}_j - \beta_j}{\hat{s.e.}(\hat{\beta}_j)} \sim t(n-k)\]

di bawah asumsi Gauss-Markov ditambah normalitas errors ($\varepsilon \sim N(0,\sigma^2 I)$), dimana: - $n$ = jumlah observasi - $k$ = jumlah parameter (termasuk intercept) - $n-k$ = residual degrees of freedom

Digunakan untuk: - Individual coefficient significance tests - Confidence intervals untuk means dengan unknown variance - Comparing two group means (Welch’s t-test)

5 4. F-Distribution

F-distribution adalah ratio dari dua chi-squared distributions. Ini adalah distribusi untuk F-statistics dalam regression.

Definisi: F-Distribution

Jika $U \sim \chi^2(m)$ dan $V \sim \chi^2(n)$ independent, maka: \[F = \frac{U/m}{V/n} \sim F(m, n)\]

dimana $m$ = numerator df dan $n$ = denominator df.

Mean: $E[F] = \frac{n}{n-2}$ untuk $n > 2$ (sedikit di atas 1)

PDF: Right-skewed, defined for $F > 0$.

5.1 F-test untuk Joint Hypotheses

Untuk testing $H_0: R\beta = r$ (linear restrictions) di regression $y = X\beta + \varepsilon$:

\[F = \frac{(R\hat{\beta}-r)^T\left[R(X^TX)^{-1}R^T\right]^{-1}(R\hat{\beta}-r)/m}{\hat{\sigma}^2} \sim F(m, n-k)\]

dimana: - $m$ = number of restrictions - $n-k$ = residual degrees of freedom - $\hat{\sigma}^2 = SSE/(n-k)$ = estimated error variance

5.2 Relasi F dan t

Connection: F dan t Adalah Satu Keluarga

Jika $T \sim t(n)$, maka $T^2 \sim F(1, n)$.

Ini berarti: uji dua sisi dengan t-test ekuivalen dengan F-test dengan 1 restriction.

Saat kamu test $H_0: \beta_j = 0$ menggunakan t-statistic, F-statistic untuk restriction yang sama persis adalah kuadrat dari t-statistic. Dan p-value-nya identik!

Ini juga menjelaskan kenapa F-test “overall significance” di summary regresi selalu tests semua slope coefficients sekaligus.

Digunakan dalam: - Joint hypothesis tests ($\beta_2 = \beta_3 = 0$, dll.) - ANOVA (comparing means across groups) - Overall significance of regression - Chow test (structural break) - Testing equality of variances

6 5. Distribusi Lain yang Penting

6.1 Bernoulli dan Binomial

Bernoulli: $X \in \{0, 1\}$, $P(X=1) = p$. Foundation dari logistic regression. \[E[X] = p, \quad \text{Var}(X) = p(1-p)\]

Binomial: $X = \sum_{i=1}^n X_i$ dengan $X_i \sim \text{Bernoulli}(p)$ iid. \[P(X=k) = \binom{n}{k}p^k(1-p)^{n-k}\] \[E[X] = np, \quad \text{Var}(X) = np(1-p)\]

6.2 Poisson Distribution

Untuk count data, waiting times, number of events in interval.

\[P(X=k) = \frac{e^{-\lambda}\lambda^k}{k!}, \quad k = 0, 1, 2, \ldots\]

Key property: $E[X] = \text{Var}(X) = \lambda$ (mean equals variance — useful for testing overdispersion!).

Digunakan dalam: count regression models (Poisson regression), queueing theory, spatial point processes.

6.3 Exponential Distribution

Waiting time between Poisson events.

\[f(x) = \lambda e^{-\lambda x}, \quad x > 0\] \[E[X] = 1/\lambda, \quad \text{Var}(X) = 1/\lambda^2\]

Memoryless property: $P(X > s+t | X > s) = P(X > t)$ — distribusi satu-satunya yang memiliki sifat ini di domain kontinu.

6.4 Beta Distribution

Untuk modeling probabilities dan proportions (nilai antara 0 dan 1).

\[f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}, \quad x \in (0,1)\]

Digunakan sebagai conjugate prior untuk binomial probability $p$ dalam Bayesian statistics. - $\alpha = \beta = 1$: Uniform prior - Large $\alpha, \beta$: Strong prior belief near $\alpha/(\alpha+\beta)$

6.5 Gamma Distribution

Generalisasi exponential; sum of $k$ exponential variables.

\[f(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}, \quad x > 0\] \[E[X] = \alpha/\beta, \quad \text{Var}(X) = \alpha/\beta^2\]

Catatan: $\chi^2(k) = \text{Gamma}(k/2, 1/2)$. Chi-squared adalah special case dari Gamma.

6.6 Dirichlet Distribution

Generalisasi Beta untuk probability vectors $(\pi_1, \ldots, \pi_K)$ dengan $\sum \pi_k = 1$.

Digunakan sebagai conjugate prior untuk multinomial probabilities dalam Bayesian stats; foundation dari Latent Dirichlet Allocation (LDA) dalam topic modeling.

7 6. Worked Example 1: Full t-Test

Worked Example: One-Sample t-Test

Problem: Data gaji (juta rupiah) untuk 15 karyawan baru:

28.5, 31.2, 25.8, 33.4, 29.1, 27.6, 30.8, 32.1, 26.4, 28.9, 31.5, 27.3, 29.8, 30.2, 28.7

Test $H_0: \mu = 28$ vs $H_1: \mu \neq 28$ pada $\alpha = 0.05$.

Step 1: Hitung sample statistics. \[\bar{x} = \frac{1}{15}\sum x_i = \frac{441.3}{15} = 29.42\] \[s^2 = \frac{1}{14}\sum(x_i - \bar{x})^2 = 4.21, \quad s = 2.052\] \[SE = \frac{s}{\sqrt{n}} = \frac{2.052}{\sqrt{15}} = 0.530\]

Step 2: Hitung t-statistic. \[t = \frac{\bar{x} - \mu_0}{SE} = \frac{29.42 - 28}{0.530} = \frac{1.42}{0.530} = 2.679\]

Step 3: Bandingkan dengan critical value.

Dengan $df = n-1 = 14$ dan $\alpha/2 = 0.025$: $t_{14, 0.025} = 2.145$.

$|t| = 2.679 > 2.145$, jadi reject $H_0$.

Step 4: Hitung p-value. \[p = 2 \times P(T_{14} > 2.679) = 2 \times 0.0085 = 0.017\]

Conclusion: Cukup bukti untuk menolak $H_0$. Rata-rata gaji secara signifikan berbeda dari 28 juta ($t(14) = 2.68$, $p = 0.017$).

# Data
gaji <- c(28.5, 31.2, 25.8, 33.4, 29.1, 27.6, 30.8, 32.1,
          26.4, 28.9, 31.5, 27.3, 29.8, 30.2, 28.7)

# Manual calculation
n <- length(gaji)
x_bar <- mean(gaji)
s <- sd(gaji)
SE <- s / sqrt(n)
t_stat <- (x_bar - 28) / SE
p_value <- 2 * pt(-abs(t_stat), df = n - 1)

cat(sprintf("n=%d, mean=%.3f, s=%.3f, SE=%.3f\n", n, x_bar, s, SE))
cat(sprintf("t-stat=%.3f, p-value=%.4f\n", t_stat, p_value))

# Or just use t.test()
t.test(gaji, mu = 28, alternative = "two.sided")

8 7. Worked Example 2: F-Test for Joint Hypotheses

Worked Example: F-Test Joint Hypothesis

Problem: Regresi $y = \beta_1 + \beta_2 x_2 + \beta_3 x_3 + \varepsilon$, $n = 50$.

Test $H_0: \beta_2 = \beta_3 = 0$ (kedua variabel tidak signifikan secara bersama-sama).

Method 1: Restricted vs Unrestricted RSS

Unrestricted model ($y = \beta_1 + \beta_2 x_2 + \beta_3 x_3$): $SSE_U = 245.3$, $df = 47$
Restricted model ($y = \beta_1$): $SSE_R = 312.8$, $df = 49$

\[F = \frac{(SSE_R - SSE_U)/m}{SSE_U/(n-k)} = \frac{(312.8 - 245.3)/2}{245.3/47} = \frac{67.5/2}{5.219} = \frac{33.75}{5.219} = 6.47\]

Critical value: $F_{0.05}(2, 47) \approx 3.20$.

$F = 6.47 > 3.20$, reject $H_0$. Setidaknya satu dari $\beta_2, \beta_3$ tidak nol.

Method 2: Using R

set.seed(42)
n <- 50
x2 <- rnorm(n)
x3 <- rnorm(n)
y <- 2 + 0.5*x2 + 0.8*x3 + rnorm(n)

# Unrestricted model
model_u <- lm(y ~ x2 + x3)

# Restricted model
model_r <- lm(y ~ 1)

# F-test via anova()
anova(model_r, model_u)

# Or via linearHypothesis in car package
library(car)
linearHypothesis(model_u, c("x2 = 0", "x3 = 0"))

# Manual F-statistic
SSE_U <- sum(resid(model_u)^2)
SSE_R <- sum(resid(model_r)^2)
m <- 2   # restrictions
k <- 3   # parameters in unrestricted (including intercept)
F_stat <- ((SSE_R - SSE_U)/m) / (SSE_U/(n-k))
p_val <- pf(F_stat, df1=m, df2=n-k, lower.tail=FALSE)
cat(sprintf("F = %.3f, p = %.4f\n", F_stat, p_val))

9 8. R Code: Working with Distributions

# ============================================================
# NORMAL DISTRIBUTION
# ============================================================
pnorm(1.96)              # P(Z < 1.96) ≈ 0.975
pnorm(1.96) - pnorm(-1.96)  # P(-1.96 < Z < 1.96) ≈ 0.95
qnorm(0.975)             # 97.5th percentile ≈ 1.96
dnorm(0, mean=0, sd=1)   # PDF at 0 = 1/sqrt(2*pi) ≈ 0.399

# Standardization
x <- 115; mu <- 100; sigma <- 15
z <- (x - mu) / sigma    # z = 1.0
pnorm(z)                  # P(X < 115) = P(Z < 1)

# ============================================================
# t-DISTRIBUTION
# ============================================================
pt(-2, df=10)            # P(T < -2) with 10 df
qt(0.975, df=10)         # Critical value ≈ 2.228
pt(-2, df=10) * 2        # Two-sided p-value

# Comparing t vs normal critical values
cat("df=5:", qt(0.975, df=5), "\n")    # 2.571
cat("df=30:", qt(0.975, df=30), "\n")  # 2.042
cat("df=120:", qt(0.975, df=120), "\n")# 1.980
cat("Normal:", qnorm(0.975), "\n")     # 1.960

# ============================================================
# CHI-SQUARED DISTRIBUTION
# ============================================================
pchisq(9.49, df=4)       # P(X < 9.49) with 4 df ≈ 0.95
qchisq(0.95, df=4)       # 95th percentile ≈ 9.49
dchisq(4, df=4)          # PDF at x=4

# Simulate chi-squared from normals
n_sim <- 10000
set.seed(2024)
z1 <- rnorm(n_sim); z2 <- rnorm(n_sim); z3 <- rnorm(n_sim)
chi_sq_3 <- z1^2 + z2^2 + z3^2
hist(chi_sq_3, probability=TRUE, breaks=50,
     main="Simulated Chi-sq(3)", xlab="x")
curve(dchisq(x, df=3), add=TRUE, col="red", lwd=2)
legend("topright", "Theoretical", col="red", lwd=2)

# ============================================================
# F-DISTRIBUTION
# ============================================================
pf(3.89, df1=2, df2=20)  # P(F < 3.89) ≈ 0.938
qf(0.95, df1=2, df2=20)  # 95th percentile ≈ 3.49
pf(3.89, df1=2, df2=20, lower.tail=FALSE)  # Upper tail

# Verify: t^2 ~ F(1, n)
t_val <- 2.5
n_df <- 20
t_sq <- t_val^2
# P(T > 2.5 in two-sided test) should equal P(F > 6.25)
p_t_twosided <- 2 * pt(-abs(t_val), df=n_df)
p_f <- pf(t_sq, df1=1, df2=n_df, lower.tail=FALSE)
cat("p-value from t:", p_t_twosided, "\n")  # Same!
cat("p-value from F:", p_f, "\n")            # Same!

# ============================================================
# PLOTTING: Visualize how t approaches Normal
# ============================================================
curve(dnorm(x), from=-4, to=4, lwd=2, col="black",
      main="t-distribution vs Normal", ylab="Density")
curve(dt(x, df=1), add=TRUE, col="blue", lwd=2, lty=2)
curve(dt(x, df=5), add=TRUE, col="green", lwd=2, lty=3)
curve(dt(x, df=30), add=TRUE, col="orange", lwd=2, lty=4)
legend("topright",
       legend=c("N(0,1)", "t(1)", "t(5)", "t(30)"),
       col=c("black","blue","green","orange"),
       lwd=2, lty=1:4)

10 9. Koneksi ke Aplikasi Riil

Connection: Distribusi dalam Econometrics dan ML

Dalam spatial econometrics: Moran’s I (test for spatial autocorrelation) under the null hypothesis is asymptotically $N(0,1)$. Kamu menggunakannya seperti z-test biasa.

Dalam time series: Ljung-Box test untuk autocorrelation in residuals uses $\chi^2(m)$ di mana $m$ adalah number of lags tested. $Q = n(n+2)\sum_{k=1}^m \frac{\hat{\rho}_k^2}{n-k} \xrightarrow{d} \chi^2(m)$.

Dalam ML — hypothesis testing for model comparison: Likelihood Ratio test $-2\log\Lambda \xrightarrow{d} \chi^2(m)$ digunakan untuk membandingkan nested models (misalnya logistic regression dengan dan tanpa feature).

Bottom line: Knowing the null distribution is knowing your test. Sebelum interpret p-value apapun, tanyakan: “distribution apa yang digunakan untuk menghasilkan p-value ini?”

11 10. Practice Problems

Practice Problems

Problem 1: Standard normal probabilities. - Hitung $P(Z > 1.96)$ - Hitung $P(-2 < Z < 2)$ - Hitung $P(Z < -1.645)$

Jawaban: - $P(Z > 1.96) = 1 - \Phi(1.96) = 1 - 0.975 = 0.025$ - $P(-2 < Z < 2) = \Phi(2) - \Phi(-2) = 0.9772 - 0.0228 = 0.9545$ - $P(Z < -1.645) = 1 - \Phi(1.645) = 0.05$

Problem 2: Chi-squared moments.

Jika $X \sim \chi^2(5)$: - Hitung $E[X]$ dan $\text{Var}(X)$ - Hitung $P(X > 11.07)$ (use R atau table) - Jika $Y \sim \chi^2(3)$ independent dari $X$, berapa distribusi $X + Y$?

Jawaban: - $E[X] = k = 5$; $\text{Var}(X) = 2k = 10$ - $P(X > 11.07) = 1 - \text{pchisq}(11.07, df=5) \approx 0.05$ - $X + Y \sim \chi^2(5+3) = \chi^2(8)$ (additive property)

Problem 3: Show $T^2 \sim F(1,n)$.

Hint: $T = Z/\sqrt{V/n}$ dimana $Z \sim N(0,1)$, $V \sim \chi^2(n)$ independent. Maka: \[T^2 = \frac{Z^2}{V/n} = \frac{Z^2/1}{V/n}\]

$Z^2 \sim \chi^2(1)$ (karena kuadrat dari standard normal). Jadi $T^2 = \frac{\chi^2(1)/1}{\chi^2(n)/n} \sim F(1,n)$. QED.

Problem 4: Suppose a regression has $n=100$ observations and $k=5$ parameters. You want to test $H_0: \beta_3 = \beta_4 = \beta_5 = 0$. The unrestricted model has $SSE_U = 180$ and the restricted model has $SSE_R = 210$.

Compute the F-statistic
What is the distribution under $H_0$?
Is the result significant at $\alpha = 0.05$? ($F_{0.05}(3, 95) \approx 2.70$)

Jawaban: \[F = \frac{(210-180)/3}{180/95} = \frac{10}{1.895} = 5.28 > 2.70 \Rightarrow \text{Reject } H_0\]

# Verify
pf(5.28, df1=3, df2=95, lower.tail=FALSE)  # p ≈ 0.002

--- title: "Important Probability Distributions" subtitle: "Normal, t, Chi-squared, F, dan Teman-temannya" description: "Panduan komprehensif distribusi probabilitas penting untuk econometrics dan ML: Normal, Chi-squared, t, F, dan distribusi lainnya." categories: [probability, distributions, statistics] --- ## Kenapa Ini Penting? ::: {.callout-note title="Why This Matters for Your Work"} Semua hypothesis testing bergantung pada knowing the distribution of your test statistic. t-test assumes t-distribution. Chi-square test assumes chi-square distribution. F-test assumes F-distribution. Tanpa ini, **p-value hanyalah angka tanpa makna** — kamu tidak bisa bilang apakah 0.03 itu "kecil" tanpa tahu distribusi referensinya. Setiap kali kamu menulis `summary(lm(...))` di R dan melihat p-value, di balik layar R sedang menghitung tail probability dari t-distribution. Setiap kali kamu melakukan F-test untuk joint significance, R menggunakan F-distribution. Memahami distribusi ini bukan sekedar teori — ini adalah dasar dari setiap angka di tabel regresi kamu. ::: --- ## 1. Normal Distribution Normal distribution adalah distribusi paling fundamental dalam statistik. Ini bukan karena alam selalu normal, tapi karena **Central Limit Theorem** menjamin bahwa sample means akan approximately normal untuk large samples. ::: {.callout-important title="Definisi: Normal Distribution"} Variabel random $X$ dikatakan berdistribusi normal dengan mean $\mu$ dan variance $\sigma^2$, ditulis $X \sim N(\mu, \sigma^2)$, jika memiliki PDF: $$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right), \quad x \in \mathbb{R}$$ **Standard Normal**: $Z \sim N(0,1)$ adalah special case dengan $\mu=0$, $\sigma^2=1$. CDF dari standard normal dinotasikan $\Phi(z) = P(Z \leq z) = \int_{-\infty}^z \frac{1}{\sqrt{2\pi}}e^{-t^2/2}dt$. Standardisasi: jika $X \sim N(\mu, \sigma^2)$, maka $Z = \frac{X-\mu}{\sigma} \sim N(0,1)$. ::: ### Properties Normal Distribution **1. Simetri**: $f(x) = f(2\mu - x)$, distribusi simetris di sekitar mean $\mu$. **2. 68-95-99.7 Rule**: $$P(\mu - \sigma \leq X \leq \mu + \sigma) \approx 0.6827$$ $$P(\mu - 2\sigma \leq X \leq \mu + 2\sigma) \approx 0.9545$$ $$P(\mu - 3\sigma \leq X \leq \mu + 3\sigma) \approx 0.9973$$ **3. Linear Combinations**: Jika $X_i \sim N(\mu_i, \sigma_i^2)$ **independent**, maka: $$\sum_{i=1}^n a_i X_i \sim N\left(\sum_{i=1}^n a_i\mu_i,\ \sum_{i=1}^n a_i^2\sigma_i^2\right)$$ Ini **sangat penting** untuk OLS: $\hat{\beta} = (X^TX)^{-1}X^Ty$ adalah linear combination dari $y$, jadi jika $\varepsilon \sim N(0, \sigma^2 I)$, maka $\hat{\beta} \sim N(\beta, \sigma^2(X^TX)^{-1})$. **4. MGF**: $M_X(t) = \exp\left(\mu t + \frac{\sigma^2 t^2}{2}\right)$ ### Multivariate Normal ::: {.callout-important title="Definisi: Multivariate Normal Distribution"} Vector $\mathbf{X} = (X_1, \ldots, X_k)^T$ berdistribusi **multivariate normal**, ditulis $\mathbf{X} \sim N(\boldsymbol{\mu}, \Sigma)$, jika memiliki PDF: $$f(\mathbf{x}) = \frac{1}{(2\pi)^{k/2}|\Sigma|^{1/2}}\exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T\Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)$$ dimana $\boldsymbol{\mu} \in \mathbb{R}^k$ adalah mean vector dan $\Sigma$ adalah $k \times k$ positive definite covariance matrix. ::: **Key Properties of MVN**: - **Marginals normal**: $X_i \sim N(\mu_i, \Sigma_{ii})$ - **Conditionals normal**: $X_1 | X_2 = x_2 \sim N(\mu_{1|2}, \Sigma_{1|2})$, dimana: $$\mu_{1|2} = \mu_1 + \Sigma_{12}\Sigma_{22}^{-1}(x_2 - \mu_2)$$ $$\Sigma_{1|2} = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}$$ - **Linear combinations normal**: $A\mathbf{X} \sim N(A\boldsymbol{\mu}, A\Sigma A^T)$ - **Uncorrelated $\Rightarrow$ Independent** (hanya untuk MVN, tidak berlaku umum!) --- ## 2. Chi-Squared Distribution Chi-squared distribution muncul secara alami dari kuadrat standard normals. ::: {.callout-important title="Definisi: Chi-Squared Distribution"} Jika $Z_1, Z_2, \ldots, Z_k \sim N(0,1)$ **iid**, maka: $$V = \sum_{i=1}^k Z_i^2 \sim \chi^2(k)$$ dimana $k$ adalah **degrees of freedom**. **Moments**: - $E[V] = k$ - $\text{Var}(V) = 2k$ **PDF**: $f(v) = \frac{v^{k/2-1}e^{-v/2}}{2^{k/2}\Gamma(k/2)}$ untuk $v > 0$. ::: ### Koneksi ke Sample Variance Ini adalah salah satu result paling penting dalam statistik inferensial: $$\frac{(n-1)S^2}{\sigma^2} \sim \chi^2(n-1) \quad \text{jika } X_i \sim N(\mu, \sigma^2) \text{ iid}$$ dimana $S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2$. Kenapa degrees of freedom $n-1$? Karena kita kehilangan 1 df ketika mengestimasi $\mu$ dengan $\bar{X}$. ### Additive Property Jika $V_1 \sim \chi^2(m)$ dan $V_2 \sim \chi^2(n)$ **independent**, maka: $$V_1 + V_2 \sim \chi^2(m+n)$$ **Aplikasi**: Di regresi, SST = SSR + SSE, dan ini bisa dipecah menjadi komponen chi-squared. **Digunakan dalam**: - Goodness-of-fit tests (Pearson chi-squared) - Test of independence (contingency tables) - Lagrange Multiplier (LM) tests - Likelihood Ratio (LR) tests - Testing restrictions in econometric models --- ## 3. Student's t-Distribution t-distribution muncul ketika kita tidak tahu $\sigma^2$ dan harus mengestimasinya. ::: {.callout-important title="Definisi: Student's t-Distribution"} Jika $Z \sim N(0,1)$ dan $V \sim \chi^2(k)$ **independent**, maka: $$T = \frac{Z}{\sqrt{V/k}} \sim t(k)$$ dimana $k$ adalah degrees of freedom. **Moments** (untuk $k > 2$): - $E[T] = 0$ - $\text{Var}(T) = \frac{k}{k-2}$ **Tails lebih berat** dibanding normal. Semakin kecil $k$, semakin heavy tail. ::: ### Konvergensi ke Normal $$t(k) \xrightarrow{d} N(0,1) \text{ sebagai } k \to \infty$$ Praktisnya: untuk $k > 30$, t-distribution sudah sangat dekat dengan normal. Untuk $k > 120$, hampir identik. ### Koneksi ke OLS Ini adalah koneksi kritis dengan regresi: $$\frac{\hat{\beta}_j - \beta_j}{\hat{s.e.}(\hat{\beta}_j)} \sim t(n-k)$$ di bawah asumsi Gauss-Markov **ditambah** normalitas errors ($\varepsilon \sim N(0,\sigma^2 I)$), dimana: - $n$ = jumlah observasi - $k$ = jumlah parameter (termasuk intercept) - $n-k$ = residual degrees of freedom **Digunakan untuk**: - Individual coefficient significance tests - Confidence intervals untuk means dengan unknown variance - Comparing two group means (Welch's t-test) --- ## 4. F-Distribution F-distribution adalah ratio dari dua chi-squared distributions. Ini adalah distribusi untuk F-statistics dalam regression. ::: {.callout-important title="Definisi: F-Distribution"} Jika $U \sim \chi^2(m)$ dan $V \sim \chi^2(n)$ **independent**, maka: $$F = \frac{U/m}{V/n} \sim F(m, n)$$ dimana $m$ = numerator df dan $n$ = denominator df. **Mean**: $E[F] = \frac{n}{n-2}$ untuk $n > 2$ (sedikit di atas 1) **PDF**: Right-skewed, defined for $F > 0$. ::: ### F-test untuk Joint Hypotheses Untuk testing $H_0: R\beta = r$ (linear restrictions) di regression $y = X\beta + \varepsilon$: $$F = \frac{(R\hat{\beta}-r)^T\left[R(X^TX)^{-1}R^T\right]^{-1}(R\hat{\beta}-r)/m}{\hat{\sigma}^2} \sim F(m, n-k)$$ dimana: - $m$ = number of restrictions - $n-k$ = residual degrees of freedom - $\hat{\sigma}^2 = SSE/(n-k)$ = estimated error variance ### Relasi F dan t ::: {.callout-caution title="Connection: F dan t Adalah Satu Keluarga"} Jika $T \sim t(n)$, maka $T^2 \sim F(1, n)$. Ini berarti: **uji dua sisi dengan t-test ekuivalen dengan F-test dengan 1 restriction.** Saat kamu test $H_0: \beta_j = 0$ menggunakan t-statistic, F-statistic untuk restriction yang sama persis adalah kuadrat dari t-statistic. Dan p-value-nya identik! Ini juga menjelaskan kenapa F-test "overall significance" di summary regresi selalu tests semua slope coefficients sekaligus. ::: **Digunakan dalam**: - Joint hypothesis tests ($\beta_2 = \beta_3 = 0$, dll.) - ANOVA (comparing means across groups) - Overall significance of regression - Chow test (structural break) - Testing equality of variances --- ## 5. Distribusi Lain yang Penting ### Bernoulli dan Binomial **Bernoulli**: $X \in \{0, 1\}$, $P(X=1) = p$. Foundation dari logistic regression. $$E[X] = p, \quad \text{Var}(X) = p(1-p)$$ **Binomial**: $X = \sum_{i=1}^n X_i$ dengan $X_i \sim \text{Bernoulli}(p)$ iid. $$P(X=k) = \binom{n}{k}p^k(1-p)^{n-k}$$ $$E[X] = np, \quad \text{Var}(X) = np(1-p)$$ ### Poisson Distribution Untuk count data, waiting times, number of events in interval. $$P(X=k) = \frac{e^{-\lambda}\lambda^k}{k!}, \quad k = 0, 1, 2, \ldots$$ Key property: $E[X] = \text{Var}(X) = \lambda$ (mean equals variance — useful for testing overdispersion!). **Digunakan dalam**: count regression models (Poisson regression), queueing theory, spatial point processes. ### Exponential Distribution Waiting time between Poisson events. $$f(x) = \lambda e^{-\lambda x}, \quad x > 0$$ $$E[X] = 1/\lambda, \quad \text{Var}(X) = 1/\lambda^2$$ **Memoryless property**: $P(X > s+t | X > s) = P(X > t)$ — distribusi satu-satunya yang memiliki sifat ini di domain kontinu. ### Beta Distribution Untuk modeling probabilities dan proportions (nilai antara 0 dan 1). $$f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}, \quad x \in (0,1)$$ **Digunakan sebagai conjugate prior** untuk binomial probability $p$ dalam Bayesian statistics. - $\alpha = \beta = 1$: Uniform prior - Large $\alpha, \beta$: Strong prior belief near $\alpha/(\alpha+\beta)$ ### Gamma Distribution Generalisasi exponential; sum of $k$ exponential variables. $$f(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}, \quad x > 0$$ $$E[X] = \alpha/\beta, \quad \text{Var}(X) = \alpha/\beta^2$$ Catatan: $\chi^2(k) = \text{Gamma}(k/2, 1/2)$. Chi-squared adalah special case dari Gamma. ### Dirichlet Distribution Generalisasi Beta untuk probability vectors $(\pi_1, \ldots, \pi_K)$ dengan $\sum \pi_k = 1$. **Digunakan sebagai conjugate prior** untuk multinomial probabilities dalam Bayesian stats; foundation dari Latent Dirichlet Allocation (LDA) dalam topic modeling. --- ## 6. Worked Example 1: Full t-Test ::: {.callout-tip title="Worked Example: One-Sample t-Test" collapse="true"} **Problem**: Data gaji (juta rupiah) untuk 15 karyawan baru: ``` 28.5, 31.2, 25.8, 33.4, 29.1, 27.6, 30.8, 32.1, 26.4, 28.9, 31.5, 27.3, 29.8, 30.2, 28.7 ``` Test $H_0: \mu = 28$ vs $H_1: \mu \neq 28$ pada $\alpha = 0.05$. **Step 1**: Hitung sample statistics. $$\bar{x} = \frac{1}{15}\sum x_i = \frac{441.3}{15} = 29.42$$ $$s^2 = \frac{1}{14}\sum(x_i - \bar{x})^2 = 4.21, \quad s = 2.052$$ $$SE = \frac{s}{\sqrt{n}} = \frac{2.052}{\sqrt{15}} = 0.530$$ **Step 2**: Hitung t-statistic. $$t = \frac{\bar{x} - \mu_0}{SE} = \frac{29.42 - 28}{0.530} = \frac{1.42}{0.530} = 2.679$$ **Step 3**: Bandingkan dengan critical value. Dengan $df = n-1 = 14$ dan $\alpha/2 = 0.025$: $t_{14, 0.025} = 2.145$. $|t| = 2.679 > 2.145$, jadi **reject $H_0$**. **Step 4**: Hitung p-value. $$p = 2 \times P(T_{14} > 2.679) = 2 \times 0.0085 = 0.017$$ **Conclusion**: Cukup bukti untuk menolak $H_0$. Rata-rata gaji secara signifikan berbeda dari 28 juta ($t(14) = 2.68$, $p = 0.017$). ```r # Data gaji <- c(28.5, 31.2, 25.8, 33.4, 29.1, 27.6, 30.8, 32.1, 26.4, 28.9, 31.5, 27.3, 29.8, 30.2, 28.7) # Manual calculation n <- length(gaji) x_bar <- mean(gaji) s <- sd(gaji) SE <- s / sqrt(n) t_stat <- (x_bar - 28) / SE p_value <- 2 * pt(-abs(t_stat), df = n - 1) cat(sprintf("n=%d, mean=%.3f, s=%.3f, SE=%.3f\n", n, x_bar, s, SE)) cat(sprintf("t-stat=%.3f, p-value=%.4f\n", t_stat, p_value)) # Or just use t.test() t.test(gaji, mu = 28, alternative = "two.sided") ``` ::: --- ## 7. Worked Example 2: F-Test for Joint Hypotheses ::: {.callout-tip title="Worked Example: F-Test Joint Hypothesis" collapse="true"} **Problem**: Regresi $y = \beta_1 + \beta_2 x_2 + \beta_3 x_3 + \varepsilon$, $n = 50$. Test $H_0: \beta_2 = \beta_3 = 0$ (kedua variabel tidak signifikan secara bersama-sama). **Method 1**: Restricted vs Unrestricted RSS - Unrestricted model ($y = \beta_1 + \beta_2 x_2 + \beta_3 x_3$): $SSE_U = 245.3$, $df = 47$ - Restricted model ($y = \beta_1$): $SSE_R = 312.8$, $df = 49$ $$F = \frac{(SSE_R - SSE_U)/m}{SSE_U/(n-k)} = \frac{(312.8 - 245.3)/2}{245.3/47} = \frac{67.5/2}{5.219} = \frac{33.75}{5.219} = 6.47$$ Critical value: $F_{0.05}(2, 47) \approx 3.20$. $F = 6.47 > 3.20$, **reject $H_0$**. Setidaknya satu dari $\beta_2, \beta_3$ tidak nol. **Method 2**: Using R ```r set.seed(42) n <- 50 x2 <- rnorm(n) x3 <- rnorm(n) y <- 2 + 0.5*x2 + 0.8*x3 + rnorm(n) # Unrestricted model model_u <- lm(y ~ x2 + x3) # Restricted model model_r <- lm(y ~ 1) # F-test via anova() anova(model_r, model_u) # Or via linearHypothesis in car package library(car) linearHypothesis(model_u, c("x2 = 0", "x3 = 0")) # Manual F-statistic SSE_U <- sum(resid(model_u)^2) SSE_R <- sum(resid(model_r)^2) m <- 2 # restrictions k <- 3 # parameters in unrestricted (including intercept) F_stat <- ((SSE_R - SSE_U)/m) / (SSE_U/(n-k)) p_val <- pf(F_stat, df1=m, df2=n-k, lower.tail=FALSE) cat(sprintf("F = %.3f, p = %.4f\n", F_stat, p_val)) ``` ::: --- ## 8. R Code: Working with Distributions ```r # ============================================================ # NORMAL DISTRIBUTION # ============================================================ pnorm(1.96) # P(Z < 1.96) ≈ 0.975 pnorm(1.96) - pnorm(-1.96) # P(-1.96 < Z < 1.96) ≈ 0.95 qnorm(0.975) # 97.5th percentile ≈ 1.96 dnorm(0, mean=0, sd=1) # PDF at 0 = 1/sqrt(2*pi) ≈ 0.399 # Standardization x <- 115; mu <- 100; sigma <- 15 z <- (x - mu) / sigma # z = 1.0 pnorm(z) # P(X < 115) = P(Z < 1) # ============================================================ # t-DISTRIBUTION # ============================================================ pt(-2, df=10) # P(T < -2) with 10 df qt(0.975, df=10) # Critical value ≈ 2.228 pt(-2, df=10) * 2 # Two-sided p-value # Comparing t vs normal critical values cat("df=5:", qt(0.975, df=5), "\n") # 2.571 cat("df=30:", qt(0.975, df=30), "\n") # 2.042 cat("df=120:", qt(0.975, df=120), "\n")# 1.980 cat("Normal:", qnorm(0.975), "\n") # 1.960 # ============================================================ # CHI-SQUARED DISTRIBUTION # ============================================================ pchisq(9.49, df=4) # P(X < 9.49) with 4 df ≈ 0.95 qchisq(0.95, df=4) # 95th percentile ≈ 9.49 dchisq(4, df=4) # PDF at x=4 # Simulate chi-squared from normals n_sim <- 10000 set.seed(2024) z1 <- rnorm(n_sim); z2 <- rnorm(n_sim); z3 <- rnorm(n_sim) chi_sq_3 <- z1^2 + z2^2 + z3^2 hist(chi_sq_3, probability=TRUE, breaks=50, main="Simulated Chi-sq(3)", xlab="x") curve(dchisq(x, df=3), add=TRUE, col="red", lwd=2) legend("topright", "Theoretical", col="red", lwd=2) # ============================================================ # F-DISTRIBUTION # ============================================================ pf(3.89, df1=2, df2=20) # P(F < 3.89) ≈ 0.938 qf(0.95, df1=2, df2=20) # 95th percentile ≈ 3.49 pf(3.89, df1=2, df2=20, lower.tail=FALSE) # Upper tail # Verify: t^2 ~ F(1, n) t_val <- 2.5 n_df <- 20 t_sq <- t_val^2 # P(T > 2.5 in two-sided test) should equal P(F > 6.25) p_t_twosided <- 2 * pt(-abs(t_val), df=n_df) p_f <- pf(t_sq, df1=1, df2=n_df, lower.tail=FALSE) cat("p-value from t:", p_t_twosided, "\n") # Same! cat("p-value from F:", p_f, "\n") # Same! # ============================================================ # PLOTTING: Visualize how t approaches Normal # ============================================================ curve(dnorm(x), from=-4, to=4, lwd=2, col="black", main="t-distribution vs Normal", ylab="Density") curve(dt(x, df=1), add=TRUE, col="blue", lwd=2, lty=2) curve(dt(x, df=5), add=TRUE, col="green", lwd=2, lty=3) curve(dt(x, df=30), add=TRUE, col="orange", lwd=2, lty=4) legend("topright", legend=c("N(0,1)", "t(1)", "t(5)", "t(30)"), col=c("black","blue","green","orange"), lwd=2, lty=1:4) ``` --- ## 9. Koneksi ke Aplikasi Riil ::: {.callout-caution title="Connection: Distribusi dalam Econometrics dan ML"} **Dalam spatial econometrics**: Moran's I (test for spatial autocorrelation) under the null hypothesis is asymptotically $N(0,1)$. Kamu menggunakannya seperti z-test biasa. **Dalam time series**: Ljung-Box test untuk autocorrelation in residuals uses $\chi^2(m)$ di mana $m$ adalah number of lags tested. $Q = n(n+2)\sum_{k=1}^m \frac{\hat{\rho}_k^2}{n-k} \xrightarrow{d} \chi^2(m)$. **Dalam ML — hypothesis testing for model comparison**: Likelihood Ratio test $-2\log\Lambda \xrightarrow{d} \chi^2(m)$ digunakan untuk membandingkan nested models (misalnya logistic regression dengan dan tanpa feature). **Bottom line**: Knowing the null distribution is knowing your test. Sebelum interpret p-value apapun, tanyakan: "distribution apa yang digunakan untuk menghasilkan p-value ini?" ::: --- ## 10. Practice Problems ::: {.callout-warning title="Practice Problems" collapse="true"} **Problem 1**: Standard normal probabilities. - Hitung $P(Z > 1.96)$ - Hitung $P(-2 < Z < 2)$ - Hitung $P(Z < -1.645)$ *Jawaban*: - $P(Z > 1.96) = 1 - \Phi(1.96) = 1 - 0.975 = 0.025$ - $P(-2 < Z < 2) = \Phi(2) - \Phi(-2) = 0.9772 - 0.0228 = 0.9545$ - $P(Z < -1.645) = 1 - \Phi(1.645) = 0.05$ **Problem 2**: Chi-squared moments. Jika $X \sim \chi^2(5)$: - Hitung $E[X]$ dan $\text{Var}(X)$ - Hitung $P(X > 11.07)$ (use R atau table) - Jika $Y \sim \chi^2(3)$ independent dari $X$, berapa distribusi $X + Y$? *Jawaban*: - $E[X] = k = 5$; $\text{Var}(X) = 2k = 10$ - $P(X > 11.07) = 1 - \text{pchisq}(11.07, df=5) \approx 0.05$ - $X + Y \sim \chi^2(5+3) = \chi^2(8)$ (additive property) **Problem 3**: Show $T^2 \sim F(1,n)$. *Hint*: $T = Z/\sqrt{V/n}$ dimana $Z \sim N(0,1)$, $V \sim \chi^2(n)$ independent. Maka: $$T^2 = \frac{Z^2}{V/n} = \frac{Z^2/1}{V/n}$$ $Z^2 \sim \chi^2(1)$ (karena kuadrat dari standard normal). Jadi $T^2 = \frac{\chi^2(1)/1}{\chi^2(n)/n} \sim F(1,n)$. QED. **Problem 4**: Suppose a regression has $n=100$ observations and $k=5$ parameters. You want to test $H_0: \beta_3 = \beta_4 = \beta_5 = 0$. The unrestricted model has $SSE_U = 180$ and the restricted model has $SSE_R = 210$. - Compute the F-statistic - What is the distribution under $H_0$? - Is the result significant at $\alpha = 0.05$? ($F_{0.05}(3, 95) \approx 2.70$) *Jawaban*: $$F = \frac{(210-180)/3}{180/95} = \frac{10}{1.895} = 5.28 > 2.70 \Rightarrow \text{Reject } H_0$$ ```r # Verify pf(5.28, df1=3, df2=95, lower.tail=FALSE) # p ≈ 0.002 ``` ::: --- ::: {.page-navigation} [← Previous](03-random-variables.qmd) | [↑ Module Overview](index.qmd) | [Next →](05-joint-distributions.qmd) :::