Important Probability Distributions

Normal, t, Chi-squared, F, dan Teman-temannya

probability
distributions
statistics
Panduan komprehensif distribusi probabilitas penting untuk econometrics dan ML: Normal, Chi-squared, t, F, dan distribusi lainnya.

1 Kenapa Ini Penting?

NoteWhy This Matters for Your Work

Semua hypothesis testing bergantung pada knowing the distribution of your test statistic. t-test assumes t-distribution. Chi-square test assumes chi-square distribution. F-test assumes F-distribution. Tanpa ini, p-value hanyalah angka tanpa makna — kamu tidak bisa bilang apakah 0.03 itu “kecil” tanpa tahu distribusi referensinya.

Setiap kali kamu menulis summary(lm(...)) di R dan melihat p-value, di balik layar R sedang menghitung tail probability dari t-distribution. Setiap kali kamu melakukan F-test untuk joint significance, R menggunakan F-distribution. Memahami distribusi ini bukan sekedar teori — ini adalah dasar dari setiap angka di tabel regresi kamu.


2 1. Normal Distribution

Normal distribution adalah distribusi paling fundamental dalam statistik. Ini bukan karena alam selalu normal, tapi karena Central Limit Theorem menjamin bahwa sample means akan approximately normal untuk large samples.

ImportantDefinisi: Normal Distribution

Variabel random \(X\) dikatakan berdistribusi normal dengan mean \(\mu\) dan variance \(\sigma^2\), ditulis \(X \sim N(\mu, \sigma^2)\), jika memiliki PDF:

\[f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right), \quad x \in \mathbb{R}\]

Standard Normal: \(Z \sim N(0,1)\) adalah special case dengan \(\mu=0\), \(\sigma^2=1\).

CDF dari standard normal dinotasikan \(\Phi(z) = P(Z \leq z) = \int_{-\infty}^z \frac{1}{\sqrt{2\pi}}e^{-t^2/2}dt\).

Standardisasi: jika \(X \sim N(\mu, \sigma^2)\), maka \(Z = \frac{X-\mu}{\sigma} \sim N(0,1)\).

2.1 Properties Normal Distribution

1. Simetri: \(f(x) = f(2\mu - x)\), distribusi simetris di sekitar mean \(\mu\).

2. 68-95-99.7 Rule: \[P(\mu - \sigma \leq X \leq \mu + \sigma) \approx 0.6827\] \[P(\mu - 2\sigma \leq X \leq \mu + 2\sigma) \approx 0.9545\] \[P(\mu - 3\sigma \leq X \leq \mu + 3\sigma) \approx 0.9973\]

3. Linear Combinations: Jika \(X_i \sim N(\mu_i, \sigma_i^2)\) independent, maka: \[\sum_{i=1}^n a_i X_i \sim N\left(\sum_{i=1}^n a_i\mu_i,\ \sum_{i=1}^n a_i^2\sigma_i^2\right)\]

Ini sangat penting untuk OLS: \(\hat{\beta} = (X^TX)^{-1}X^Ty\) adalah linear combination dari \(y\), jadi jika \(\varepsilon \sim N(0, \sigma^2 I)\), maka \(\hat{\beta} \sim N(\beta, \sigma^2(X^TX)^{-1})\).

4. MGF: \(M_X(t) = \exp\left(\mu t + \frac{\sigma^2 t^2}{2}\right)\)

2.2 Multivariate Normal

ImportantDefinisi: Multivariate Normal Distribution

Vector \(\mathbf{X} = (X_1, \ldots, X_k)^T\) berdistribusi multivariate normal, ditulis \(\mathbf{X} \sim N(\boldsymbol{\mu}, \Sigma)\), jika memiliki PDF:

\[f(\mathbf{x}) = \frac{1}{(2\pi)^{k/2}|\Sigma|^{1/2}}\exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T\Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)\]

dimana \(\boldsymbol{\mu} \in \mathbb{R}^k\) adalah mean vector dan \(\Sigma\) adalah \(k \times k\) positive definite covariance matrix.

Key Properties of MVN:

  • Marginals normal: \(X_i \sim N(\mu_i, \Sigma_{ii})\)
  • Conditionals normal: \(X_1 | X_2 = x_2 \sim N(\mu_{1|2}, \Sigma_{1|2})\), dimana: \[\mu_{1|2} = \mu_1 + \Sigma_{12}\Sigma_{22}^{-1}(x_2 - \mu_2)\] \[\Sigma_{1|2} = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\]
  • Linear combinations normal: \(A\mathbf{X} \sim N(A\boldsymbol{\mu}, A\Sigma A^T)\)
  • Uncorrelated \(\Rightarrow\) Independent (hanya untuk MVN, tidak berlaku umum!)

3 2. Chi-Squared Distribution

Chi-squared distribution muncul secara alami dari kuadrat standard normals.

ImportantDefinisi: Chi-Squared Distribution

Jika \(Z_1, Z_2, \ldots, Z_k \sim N(0,1)\) iid, maka: \[V = \sum_{i=1}^k Z_i^2 \sim \chi^2(k)\]

dimana \(k\) adalah degrees of freedom.

Moments: - \(E[V] = k\) - \(\text{Var}(V) = 2k\)

PDF: \(f(v) = \frac{v^{k/2-1}e^{-v/2}}{2^{k/2}\Gamma(k/2)}\) untuk \(v > 0\).

3.1 Koneksi ke Sample Variance

Ini adalah salah satu result paling penting dalam statistik inferensial:

\[\frac{(n-1)S^2}{\sigma^2} \sim \chi^2(n-1) \quad \text{jika } X_i \sim N(\mu, \sigma^2) \text{ iid}\]

dimana \(S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2\).

Kenapa degrees of freedom \(n-1\)? Karena kita kehilangan 1 df ketika mengestimasi \(\mu\) dengan \(\bar{X}\).

3.2 Additive Property

Jika \(V_1 \sim \chi^2(m)\) dan \(V_2 \sim \chi^2(n)\) independent, maka: \[V_1 + V_2 \sim \chi^2(m+n)\]

Aplikasi: Di regresi, SST = SSR + SSE, dan ini bisa dipecah menjadi komponen chi-squared.

Digunakan dalam: - Goodness-of-fit tests (Pearson chi-squared) - Test of independence (contingency tables) - Lagrange Multiplier (LM) tests - Likelihood Ratio (LR) tests - Testing restrictions in econometric models


4 3. Student’s t-Distribution

t-distribution muncul ketika kita tidak tahu \(\sigma^2\) dan harus mengestimasinya.

ImportantDefinisi: Student’s t-Distribution

Jika \(Z \sim N(0,1)\) dan \(V \sim \chi^2(k)\) independent, maka: \[T = \frac{Z}{\sqrt{V/k}} \sim t(k)\]

dimana \(k\) adalah degrees of freedom.

Moments (untuk \(k > 2\)): - \(E[T] = 0\) - \(\text{Var}(T) = \frac{k}{k-2}\)

Tails lebih berat dibanding normal. Semakin kecil \(k\), semakin heavy tail.

4.1 Konvergensi ke Normal

\[t(k) \xrightarrow{d} N(0,1) \text{ sebagai } k \to \infty\]

Praktisnya: untuk \(k > 30\), t-distribution sudah sangat dekat dengan normal. Untuk \(k > 120\), hampir identik.

4.2 Koneksi ke OLS

Ini adalah koneksi kritis dengan regresi:

\[\frac{\hat{\beta}_j - \beta_j}{\hat{s.e.}(\hat{\beta}_j)} \sim t(n-k)\]

di bawah asumsi Gauss-Markov ditambah normalitas errors (\(\varepsilon \sim N(0,\sigma^2 I)\)), dimana: - \(n\) = jumlah observasi - \(k\) = jumlah parameter (termasuk intercept) - \(n-k\) = residual degrees of freedom

Digunakan untuk: - Individual coefficient significance tests - Confidence intervals untuk means dengan unknown variance - Comparing two group means (Welch’s t-test)


5 4. F-Distribution

F-distribution adalah ratio dari dua chi-squared distributions. Ini adalah distribusi untuk F-statistics dalam regression.

ImportantDefinisi: F-Distribution

Jika \(U \sim \chi^2(m)\) dan \(V \sim \chi^2(n)\) independent, maka: \[F = \frac{U/m}{V/n} \sim F(m, n)\]

dimana \(m\) = numerator df dan \(n\) = denominator df.

Mean: \(E[F] = \frac{n}{n-2}\) untuk \(n > 2\) (sedikit di atas 1)

PDF: Right-skewed, defined for \(F > 0\).

5.1 F-test untuk Joint Hypotheses

Untuk testing \(H_0: R\beta = r\) (linear restrictions) di regression \(y = X\beta + \varepsilon\):

\[F = \frac{(R\hat{\beta}-r)^T\left[R(X^TX)^{-1}R^T\right]^{-1}(R\hat{\beta}-r)/m}{\hat{\sigma}^2} \sim F(m, n-k)\]

dimana: - \(m\) = number of restrictions - \(n-k\) = residual degrees of freedom - \(\hat{\sigma}^2 = SSE/(n-k)\) = estimated error variance

5.2 Relasi F dan t

CautionConnection: F dan t Adalah Satu Keluarga

Jika \(T \sim t(n)\), maka \(T^2 \sim F(1, n)\).

Ini berarti: uji dua sisi dengan t-test ekuivalen dengan F-test dengan 1 restriction.

Saat kamu test \(H_0: \beta_j = 0\) menggunakan t-statistic, F-statistic untuk restriction yang sama persis adalah kuadrat dari t-statistic. Dan p-value-nya identik!

Ini juga menjelaskan kenapa F-test “overall significance” di summary regresi selalu tests semua slope coefficients sekaligus.

Digunakan dalam: - Joint hypothesis tests (\(\beta_2 = \beta_3 = 0\), dll.) - ANOVA (comparing means across groups) - Overall significance of regression - Chow test (structural break) - Testing equality of variances


6 5. Distribusi Lain yang Penting

6.1 Bernoulli dan Binomial

Bernoulli: \(X \in \{0, 1\}\), \(P(X=1) = p\). Foundation dari logistic regression. \[E[X] = p, \quad \text{Var}(X) = p(1-p)\]

Binomial: \(X = \sum_{i=1}^n X_i\) dengan \(X_i \sim \text{Bernoulli}(p)\) iid. \[P(X=k) = \binom{n}{k}p^k(1-p)^{n-k}\] \[E[X] = np, \quad \text{Var}(X) = np(1-p)\]

6.2 Poisson Distribution

Untuk count data, waiting times, number of events in interval.

\[P(X=k) = \frac{e^{-\lambda}\lambda^k}{k!}, \quad k = 0, 1, 2, \ldots\]

Key property: \(E[X] = \text{Var}(X) = \lambda\) (mean equals variance — useful for testing overdispersion!).

Digunakan dalam: count regression models (Poisson regression), queueing theory, spatial point processes.

6.3 Exponential Distribution

Waiting time between Poisson events.

\[f(x) = \lambda e^{-\lambda x}, \quad x > 0\] \[E[X] = 1/\lambda, \quad \text{Var}(X) = 1/\lambda^2\]

Memoryless property: \(P(X > s+t | X > s) = P(X > t)\) — distribusi satu-satunya yang memiliki sifat ini di domain kontinu.

6.4 Beta Distribution

Untuk modeling probabilities dan proportions (nilai antara 0 dan 1).

\[f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}, \quad x \in (0,1)\]

Digunakan sebagai conjugate prior untuk binomial probability \(p\) dalam Bayesian statistics. - \(\alpha = \beta = 1\): Uniform prior - Large \(\alpha, \beta\): Strong prior belief near \(\alpha/(\alpha+\beta)\)

6.5 Gamma Distribution

Generalisasi exponential; sum of \(k\) exponential variables.

\[f(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}, \quad x > 0\] \[E[X] = \alpha/\beta, \quad \text{Var}(X) = \alpha/\beta^2\]

Catatan: \(\chi^2(k) = \text{Gamma}(k/2, 1/2)\). Chi-squared adalah special case dari Gamma.

6.6 Dirichlet Distribution

Generalisasi Beta untuk probability vectors \((\pi_1, \ldots, \pi_K)\) dengan \(\sum \pi_k = 1\).

Digunakan sebagai conjugate prior untuk multinomial probabilities dalam Bayesian stats; foundation dari Latent Dirichlet Allocation (LDA) dalam topic modeling.


7 6. Worked Example 1: Full t-Test

Problem: Data gaji (juta rupiah) untuk 15 karyawan baru:

28.5, 31.2, 25.8, 33.4, 29.1, 27.6, 30.8, 32.1, 26.4, 28.9, 31.5, 27.3, 29.8, 30.2, 28.7

Test \(H_0: \mu = 28\) vs \(H_1: \mu \neq 28\) pada \(\alpha = 0.05\).

Step 1: Hitung sample statistics. \[\bar{x} = \frac{1}{15}\sum x_i = \frac{441.3}{15} = 29.42\] \[s^2 = \frac{1}{14}\sum(x_i - \bar{x})^2 = 4.21, \quad s = 2.052\] \[SE = \frac{s}{\sqrt{n}} = \frac{2.052}{\sqrt{15}} = 0.530\]

Step 2: Hitung t-statistic. \[t = \frac{\bar{x} - \mu_0}{SE} = \frac{29.42 - 28}{0.530} = \frac{1.42}{0.530} = 2.679\]

Step 3: Bandingkan dengan critical value.

Dengan \(df = n-1 = 14\) dan \(\alpha/2 = 0.025\): \(t_{14, 0.025} = 2.145\).

\(|t| = 2.679 > 2.145\), jadi reject \(H_0\).

Step 4: Hitung p-value. \[p = 2 \times P(T_{14} > 2.679) = 2 \times 0.0085 = 0.017\]

Conclusion: Cukup bukti untuk menolak \(H_0\). Rata-rata gaji secara signifikan berbeda dari 28 juta (\(t(14) = 2.68\), \(p = 0.017\)).

# Data
gaji <- c(28.5, 31.2, 25.8, 33.4, 29.1, 27.6, 30.8, 32.1,
          26.4, 28.9, 31.5, 27.3, 29.8, 30.2, 28.7)

# Manual calculation
n <- length(gaji)
x_bar <- mean(gaji)
s <- sd(gaji)
SE <- s / sqrt(n)
t_stat <- (x_bar - 28) / SE
p_value <- 2 * pt(-abs(t_stat), df = n - 1)

cat(sprintf("n=%d, mean=%.3f, s=%.3f, SE=%.3f\n", n, x_bar, s, SE))
cat(sprintf("t-stat=%.3f, p-value=%.4f\n", t_stat, p_value))

# Or just use t.test()
t.test(gaji, mu = 28, alternative = "two.sided")

8 7. Worked Example 2: F-Test for Joint Hypotheses

Problem: Regresi \(y = \beta_1 + \beta_2 x_2 + \beta_3 x_3 + \varepsilon\), \(n = 50\).

Test \(H_0: \beta_2 = \beta_3 = 0\) (kedua variabel tidak signifikan secara bersama-sama).

Method 1: Restricted vs Unrestricted RSS

  • Unrestricted model (\(y = \beta_1 + \beta_2 x_2 + \beta_3 x_3\)): \(SSE_U = 245.3\), \(df = 47\)
  • Restricted model (\(y = \beta_1\)): \(SSE_R = 312.8\), \(df = 49\)

\[F = \frac{(SSE_R - SSE_U)/m}{SSE_U/(n-k)} = \frac{(312.8 - 245.3)/2}{245.3/47} = \frac{67.5/2}{5.219} = \frac{33.75}{5.219} = 6.47\]

Critical value: \(F_{0.05}(2, 47) \approx 3.20\).

\(F = 6.47 > 3.20\), reject \(H_0\). Setidaknya satu dari \(\beta_2, \beta_3\) tidak nol.

Method 2: Using R

set.seed(42)
n <- 50
x2 <- rnorm(n)
x3 <- rnorm(n)
y <- 2 + 0.5*x2 + 0.8*x3 + rnorm(n)

# Unrestricted model
model_u <- lm(y ~ x2 + x3)

# Restricted model
model_r <- lm(y ~ 1)

# F-test via anova()
anova(model_r, model_u)

# Or via linearHypothesis in car package
library(car)
linearHypothesis(model_u, c("x2 = 0", "x3 = 0"))

# Manual F-statistic
SSE_U <- sum(resid(model_u)^2)
SSE_R <- sum(resid(model_r)^2)
m <- 2   # restrictions
k <- 3   # parameters in unrestricted (including intercept)
F_stat <- ((SSE_R - SSE_U)/m) / (SSE_U/(n-k))
p_val <- pf(F_stat, df1=m, df2=n-k, lower.tail=FALSE)
cat(sprintf("F = %.3f, p = %.4f\n", F_stat, p_val))

9 8. R Code: Working with Distributions

# ============================================================
# NORMAL DISTRIBUTION
# ============================================================
pnorm(1.96)              # P(Z < 1.96) ≈ 0.975
pnorm(1.96) - pnorm(-1.96)  # P(-1.96 < Z < 1.96) ≈ 0.95
qnorm(0.975)             # 97.5th percentile ≈ 1.96
dnorm(0, mean=0, sd=1)   # PDF at 0 = 1/sqrt(2*pi) ≈ 0.399

# Standardization
x <- 115; mu <- 100; sigma <- 15
z <- (x - mu) / sigma    # z = 1.0
pnorm(z)                  # P(X < 115) = P(Z < 1)

# ============================================================
# t-DISTRIBUTION
# ============================================================
pt(-2, df=10)            # P(T < -2) with 10 df
qt(0.975, df=10)         # Critical value ≈ 2.228
pt(-2, df=10) * 2        # Two-sided p-value

# Comparing t vs normal critical values
cat("df=5:", qt(0.975, df=5), "\n")    # 2.571
cat("df=30:", qt(0.975, df=30), "\n")  # 2.042
cat("df=120:", qt(0.975, df=120), "\n")# 1.980
cat("Normal:", qnorm(0.975), "\n")     # 1.960

# ============================================================
# CHI-SQUARED DISTRIBUTION
# ============================================================
pchisq(9.49, df=4)       # P(X < 9.49) with 4 df ≈ 0.95
qchisq(0.95, df=4)       # 95th percentile ≈ 9.49
dchisq(4, df=4)          # PDF at x=4

# Simulate chi-squared from normals
n_sim <- 10000
set.seed(2024)
z1 <- rnorm(n_sim); z2 <- rnorm(n_sim); z3 <- rnorm(n_sim)
chi_sq_3 <- z1^2 + z2^2 + z3^2
hist(chi_sq_3, probability=TRUE, breaks=50,
     main="Simulated Chi-sq(3)", xlab="x")
curve(dchisq(x, df=3), add=TRUE, col="red", lwd=2)
legend("topright", "Theoretical", col="red", lwd=2)

# ============================================================
# F-DISTRIBUTION
# ============================================================
pf(3.89, df1=2, df2=20)  # P(F < 3.89) ≈ 0.938
qf(0.95, df1=2, df2=20)  # 95th percentile ≈ 3.49
pf(3.89, df1=2, df2=20, lower.tail=FALSE)  # Upper tail

# Verify: t^2 ~ F(1, n)
t_val <- 2.5
n_df <- 20
t_sq <- t_val^2
# P(T > 2.5 in two-sided test) should equal P(F > 6.25)
p_t_twosided <- 2 * pt(-abs(t_val), df=n_df)
p_f <- pf(t_sq, df1=1, df2=n_df, lower.tail=FALSE)
cat("p-value from t:", p_t_twosided, "\n")  # Same!
cat("p-value from F:", p_f, "\n")            # Same!

# ============================================================
# PLOTTING: Visualize how t approaches Normal
# ============================================================
curve(dnorm(x), from=-4, to=4, lwd=2, col="black",
      main="t-distribution vs Normal", ylab="Density")
curve(dt(x, df=1), add=TRUE, col="blue", lwd=2, lty=2)
curve(dt(x, df=5), add=TRUE, col="green", lwd=2, lty=3)
curve(dt(x, df=30), add=TRUE, col="orange", lwd=2, lty=4)
legend("topright",
       legend=c("N(0,1)", "t(1)", "t(5)", "t(30)"),
       col=c("black","blue","green","orange"),
       lwd=2, lty=1:4)

10 9. Koneksi ke Aplikasi Riil

CautionConnection: Distribusi dalam Econometrics dan ML

Dalam spatial econometrics: Moran’s I (test for spatial autocorrelation) under the null hypothesis is asymptotically \(N(0,1)\). Kamu menggunakannya seperti z-test biasa.

Dalam time series: Ljung-Box test untuk autocorrelation in residuals uses \(\chi^2(m)\) di mana \(m\) adalah number of lags tested. \(Q = n(n+2)\sum_{k=1}^m \frac{\hat{\rho}_k^2}{n-k} \xrightarrow{d} \chi^2(m)\).

Dalam ML — hypothesis testing for model comparison: Likelihood Ratio test \(-2\log\Lambda \xrightarrow{d} \chi^2(m)\) digunakan untuk membandingkan nested models (misalnya logistic regression dengan dan tanpa feature).

Bottom line: Knowing the null distribution is knowing your test. Sebelum interpret p-value apapun, tanyakan: “distribution apa yang digunakan untuk menghasilkan p-value ini?”


11 10. Practice Problems

Problem 1: Standard normal probabilities. - Hitung \(P(Z > 1.96)\) - Hitung \(P(-2 < Z < 2)\) - Hitung \(P(Z < -1.645)\)

Jawaban: - \(P(Z > 1.96) = 1 - \Phi(1.96) = 1 - 0.975 = 0.025\) - \(P(-2 < Z < 2) = \Phi(2) - \Phi(-2) = 0.9772 - 0.0228 = 0.9545\) - \(P(Z < -1.645) = 1 - \Phi(1.645) = 0.05\)

Problem 2: Chi-squared moments.

Jika \(X \sim \chi^2(5)\): - Hitung \(E[X]\) dan \(\text{Var}(X)\) - Hitung \(P(X > 11.07)\) (use R atau table) - Jika \(Y \sim \chi^2(3)\) independent dari \(X\), berapa distribusi \(X + Y\)?

Jawaban: - \(E[X] = k = 5\); \(\text{Var}(X) = 2k = 10\) - \(P(X > 11.07) = 1 - \text{pchisq}(11.07, df=5) \approx 0.05\) - \(X + Y \sim \chi^2(5+3) = \chi^2(8)\) (additive property)

Problem 3: Show \(T^2 \sim F(1,n)\).

Hint: \(T = Z/\sqrt{V/n}\) dimana \(Z \sim N(0,1)\), \(V \sim \chi^2(n)\) independent. Maka: \[T^2 = \frac{Z^2}{V/n} = \frac{Z^2/1}{V/n}\]

\(Z^2 \sim \chi^2(1)\) (karena kuadrat dari standard normal). Jadi \(T^2 = \frac{\chi^2(1)/1}{\chi^2(n)/n} \sim F(1,n)\). QED.

Problem 4: Suppose a regression has \(n=100\) observations and \(k=5\) parameters. You want to test \(H_0: \beta_3 = \beta_4 = \beta_5 = 0\). The unrestricted model has \(SSE_U = 180\) and the restricted model has \(SSE_R = 210\).

  • Compute the F-statistic
  • What is the distribution under \(H_0\)?
  • Is the result significant at \(\alpha = 0.05\)? (\(F_{0.05}(3, 95) \approx 2.70\))

Jawaban: \[F = \frac{(210-180)/3}{180/95} = \frac{10}{1.895} = 5.28 > 2.70 \Rightarrow \text{Reject } H_0\]

# Verify
pf(5.28, df1=3, df2=95, lower.tail=FALSE)  # p ≈ 0.002