Double Machine Learning (DML) adalah metode yang menggabungkan fleksibilitas machine learning dengan kerangka inferensi kausal ekonometrika. Dikembangkan oleh Chernozhukov et al. (2018), DML memungkinkan estimasi efek treatment yang valid secara statistik bahkan ketika hubungan confounding bersifat non-linear.
Tutorial ini menggunakan paket EconML dari Microsoft Research.
Paket yang Digunakan
import numpy as np
import polars as pl
from econml.dml import LinearDML, CausalForestDML
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier
import matplotlib.pyplot as pltIntuisi DML
DML bekerja dalam dua tahap:
- Stage 1 (Nuisance estimation): Gunakan ML untuk memprediksi treatment \(T\) dan outcome \(Y\) dari confounders \(X\)
- Stage 2 (Causal estimation): Regresikan residual \(Y\) terhadap residual \(T\)
\[ \hat{\theta} = \frac{E[\tilde{Y} \cdot \tilde{T}]}{E[\tilde{T}^2]} \]
di mana \(\tilde{Y} = Y - \hat{m}(X)\) dan \(\tilde{T} = T - \hat{e}(X)\).
Keunggulan: Robustness terhadap misspecification di salah satu stage (doubly robust), dan valid inference berkat cross-fitting.
Data Simulasi
Kita simulasikan data dengan efek treatment heterogen:
np.random.seed(42)
n = 2000
# Confounders
X = np.random.normal(0, 1, (n, 5))
# Treatment (dipengaruhi confounders)
propensity = 1 / (1 + np.exp(-(X[:, 0] + 0.5 * X[:, 1])))
T = np.random.binomial(1, propensity)
# Heterogeneous treatment effect
true_effect = 2 + 3 * X[:, 0] # efek bervariasi berdasarkan X0
# Outcome
Y = (
true_effect * T
+ X[:, 0] ** 2
+ np.sin(X[:, 1])
+ 0.5 * X[:, 2]
+ np.random.normal(0, 1, n)
)
df = pl.DataFrame({
"Y": Y,
"T": T,
**{f"X{i}": X[:, i] for i in range(5)}
})
df.head()True ATE = \(E[2 + 3X_0] = 2\) (karena \(E[X_0] = 0\))
Estimasi dengan LinearDML
# Model DML dengan GBM sebagai first-stage learner
dml = LinearDML(
model_y=GradientBoostingRegressor(n_estimators=100, max_depth=3),
model_t=GradientBoostingClassifier(n_estimators=100, max_depth=3),
cv=5,
random_state=42
)
dml.fit(Y, T, X=X, W=None)
# Average Treatment Effect
ate = dml.ate(X)
ate_ci = dml.ate_interval(X, alpha=0.05)Hasil
results = pl.DataFrame({
"Metric": ["ATE", "95% CI Lower", "95% CI Upper", "True ATE"],
"Value": [ate, ate_ci[0], ate_ci[1], 2.0]
})
resultsHeterogeneous Treatment Effects
# Conditional ATE per observasi
cate = dml.effect(X)
# Bandingkan dengan true effect
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Scatter: estimated vs true
axes[0].scatter(true_effect, cate, alpha=0.3, s=10)
axes[0].plot([true_effect.min(), true_effect.max()],
[true_effect.min(), true_effect.max()],
'r--', lw=2, label='Perfect')
axes[0].set_xlabel('True CATE')
axes[0].set_ylabel('Estimated CATE')
axes[0].set_title('Estimated vs True Treatment Effect')
axes[0].legend()
# CATE by X0
axes[1].scatter(X[:, 0], cate, alpha=0.3, s=10, label='Estimated')
sort_idx = np.argsort(X[:, 0])
axes[1].plot(X[sort_idx, 0], true_effect[sort_idx],
'r-', lw=2, label='True: 2 + 3·X₀')
axes[1].set_xlabel('X₀')
axes[1].set_ylabel('CATE')
axes[1].set_title('Treatment Effect Heterogeneity')
axes[1].legend()
plt.tight_layout()
plt.show()Ringkasan
| Aspek | Detail |
|---|---|
| Metode | Double Machine Learning (Chernozhukov et al., 2018) |
| Paket | econml.dml.LinearDML |
| First-stage | GBM (bisa diganti RF, Lasso, dll.) |
| Output | ATE, CATE, confidence intervals |
| Keunggulan | Doubly robust, valid inference, heterogeneous effects |
Referensi
- Chernozhukov, V., et al. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal.
- EconML Documentation
Tutorial selanjutnya: Meta-Learners (T/S/X-Learner) untuk Causal Inference.