Pendahuluan

Double Machine Learning (DML) adalah metode yang menggabungkan fleksibilitas machine learning dengan kerangka inferensi kausal ekonometrika. Dikembangkan oleh Chernozhukov et al. (2018), DML memungkinkan estimasi efek treatment yang valid secara statistik bahkan ketika hubungan confounding bersifat non-linear.

Tutorial ini menggunakan paket EconML dari Microsoft Research.

Paket yang Digunakan

import numpy as np
import polars as pl
from econml.dml import LinearDML, CausalForestDML
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier
import matplotlib.pyplot as plt

Intuisi DML

DML bekerja dalam dua tahap:

  1. Stage 1 (Nuisance estimation): Gunakan ML untuk memprediksi treatment \(T\) dan outcome \(Y\) dari confounders \(X\)
  2. Stage 2 (Causal estimation): Regresikan residual \(Y\) terhadap residual \(T\)

\[ \hat{\theta} = \frac{E[\tilde{Y} \cdot \tilde{T}]}{E[\tilde{T}^2]} \]

di mana \(\tilde{Y} = Y - \hat{m}(X)\) dan \(\tilde{T} = T - \hat{e}(X)\).

Keunggulan: Robustness terhadap misspecification di salah satu stage (doubly robust), dan valid inference berkat cross-fitting.

Data Simulasi

Kita simulasikan data dengan efek treatment heterogen:

np.random.seed(42)
n = 2000

# Confounders
X = np.random.normal(0, 1, (n, 5))

# Treatment (dipengaruhi confounders)
propensity = 1 / (1 + np.exp(-(X[:, 0] + 0.5 * X[:, 1])))
T = np.random.binomial(1, propensity)

# Heterogeneous treatment effect
true_effect = 2 + 3 * X[:, 0]  # efek bervariasi berdasarkan X0

# Outcome
Y = (
    true_effect * T
    + X[:, 0] ** 2
    + np.sin(X[:, 1])
    + 0.5 * X[:, 2]
    + np.random.normal(0, 1, n)
)

df = pl.DataFrame({
    "Y": Y,
    "T": T,
    **{f"X{i}": X[:, i] for i in range(5)}
})
df.head()

True ATE = \(E[2 + 3X_0] = 2\) (karena \(E[X_0] = 0\))

Estimasi dengan LinearDML

# Model DML dengan GBM sebagai first-stage learner
dml = LinearDML(
    model_y=GradientBoostingRegressor(n_estimators=100, max_depth=3),
    model_t=GradientBoostingClassifier(n_estimators=100, max_depth=3),
    cv=5,
    random_state=42
)

dml.fit(Y, T, X=X, W=None)

# Average Treatment Effect
ate = dml.ate(X)
ate_ci = dml.ate_interval(X, alpha=0.05)

Hasil

results = pl.DataFrame({
    "Metric": ["ATE", "95% CI Lower", "95% CI Upper", "True ATE"],
    "Value": [ate, ate_ci[0], ate_ci[1], 2.0]
})
results

Heterogeneous Treatment Effects

# Conditional ATE per observasi
cate = dml.effect(X)

# Bandingkan dengan true effect
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Scatter: estimated vs true
axes[0].scatter(true_effect, cate, alpha=0.3, s=10)
axes[0].plot([true_effect.min(), true_effect.max()],
             [true_effect.min(), true_effect.max()],
             'r--', lw=2, label='Perfect')
axes[0].set_xlabel('True CATE')
axes[0].set_ylabel('Estimated CATE')
axes[0].set_title('Estimated vs True Treatment Effect')
axes[0].legend()

# CATE by X0
axes[1].scatter(X[:, 0], cate, alpha=0.3, s=10, label='Estimated')
sort_idx = np.argsort(X[:, 0])
axes[1].plot(X[sort_idx, 0], true_effect[sort_idx],
             'r-', lw=2, label='True: 2 + 3·X₀')
axes[1].set_xlabel('X₀')
axes[1].set_ylabel('CATE')
axes[1].set_title('Treatment Effect Heterogeneity')
axes[1].legend()

plt.tight_layout()
plt.show()

Ringkasan

Aspek Detail
Metode Double Machine Learning (Chernozhukov et al., 2018)
Paket econml.dml.LinearDML
First-stage GBM (bisa diganti RF, Lasso, dll.)
Output ATE, CATE, confidence intervals
Keunggulan Doubly robust, valid inference, heterogeneous effects

Referensi

  • Chernozhukov, V., et al. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal.
  • EconML Documentation

Tutorial selanjutnya: Meta-Learners (T/S/X-Learner) untuk Causal Inference.