Neural Network Mathematics

Backpropagation Calculus dan Universal Approximation

ml-math
advanced
Matematika neural networks: feedforward pass, aktivasi functions, backpropagation sebagai chain rule, computational graphs, universal approximation theorem, vanishing/exploding gradients, dan solusinya.

1 Kenapa Ini Penting?

NoteWhy This Matters for Your Work

Backpropagation adalah chain rule yang diaplikasikan ke computational graphs. Memahami math-nya:

  • Memungkinkan kamu debug gradient issues (vanishing/exploding) ketika training tidak converge
  • Memberi intuisi untuk desain arsitektur: kenapa residual connections (ResNets) membantu
  • Menjelaskan kenapa depth lebih penting dari width untuk representasi kompleks
  • Menghubungkan neural networks ke framework statistik yang sudah kamu kenal (GLMs, basis expansions)

Jika selama ini kamu pakai neural networks sebagai black box, setelah ini kamu akan tahu exactly gradient mengalir ke mana.


2 Feedforward Networks

2.1 Single Neuron

Satu neuron: operasi sederhana dengan dua tahap:

\[z = \mathbf{w}^T\mathbf{x} + b \quad \text{(linear combination)}\] \[a = \sigma(z) \quad \text{(nonlinear activation)}\]

  • \(\mathbf{w} \in \mathbb{R}^{d_{in}}\): weight vector
  • \(b \in \mathbb{R}\): bias
  • \(\sigma\): activation function
  • \(z\): pre-activation
  • \(a\): post-activation (output)

2.2 Layer \(l\) dalam Network \(L\)-layer

Untuk layer \(l = 1, \ldots, L\):

\[Z^{[l]} = W^{[l]} A^{[l-1]} + b^{[l]}\] \[A^{[l]} = \sigma(Z^{[l]})\]

  • \(W^{[l]} \in \mathbb{R}^{d_l \times d_{l-1}}\): weight matrix
  • \(b^{[l]} \in \mathbb{R}^{d_l}\): bias vector
  • \(A^{[0]} = X\) (input)
  • \(A^{[L]} = \hat{Y}\) (output, bisa dengan activation berbeda)

Untuk batch of \(n\) samples: \(X \in \mathbb{R}^{d_0 \times n}\), sehingga \(Z^{[l]} \in \mathbb{R}^{d_l \times n}\).


3 Activation Functions

3.1 Sigmoid

\[\sigma(z) = \frac{1}{1 + e^{-z}}\]

Derivative: \[\sigma'(z) = \sigma(z)(1 - \sigma(z))\]

  • Output \(\in (0,1)\) → berguna di output layer untuk binary classification
  • Problem: saturates near 0 dan 1 → gradient sangat kecil → vanishing gradients
  • Tidak zero-centered

3.2 Tanh

\[\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} = 2\sigma(2z) - 1\]

Derivative: \[\tanh'(z) = 1 - \tanh^2(z) = \text{sech}^2(z)\]

  • Output \(\in (-1, 1)\)zero-centered (lebih baik dari sigmoid untuk hidden layers)
  • Masih saturates → masih bisa vanishing gradient

3.3 ReLU (Rectified Linear Unit)

\[\text{ReLU}(z) = \max(0, z)\]

Derivative: \[\text{ReLU}'(z) = \mathbf{1}[z > 0] = \begin{cases} 1 & z > 0 \\ 0 & z \leq 0 \end{cases}\]

  • Tidak saturate untuk \(z > 0\) → solusi utama vanishing gradient
  • Sparse activations (50% neurons “off” average) → computational efficiency
  • Dead neurons problem: jika unit selalu menghasilkan \(z < 0\), gradient selalu 0 → neuron tidak belajar

Variants: LeakyReLU \((\max(\alpha z, z)\), \(\alpha\) kecil\()\), ELU, GELU, Swish.

3.4 Softmax (Output Layer)

Untuk multi-class classification dengan \(K\) classes:

\[\text{softmax}(z)_k = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}}\]

  • Output \(\in (0,1)\), \(\sum_k \text{softmax}(z)_k = 1\)
  • Interpretasi: probability distribution over classes
  • Jacobian: \(\frac{\partial \text{softmax}(z)_k}{\partial z_j} = \text{softmax}(z)_k(\delta_{kj} - \text{softmax}(z)_j)\)

4 Forward Pass

Sequence of matrix operations:

\[X \xrightarrow{W^{[1]}, b^{[1]}} Z^{[1]} \xrightarrow{\sigma} A^{[1]} \xrightarrow{W^{[2]}, b^{[2]}} Z^{[2]} \xrightarrow{\sigma} \cdots \xrightarrow{W^{[L]}, b^{[L]}} Z^{[L]} \xrightarrow{\sigma^{[L]}} \hat{Y}\]

Selama forward pass: cache semua \(Z^{[l]}\) dan \(A^{[l]}\) — diperlukan untuk backpropagation.


5 Backpropagation — Chain Rule

5.1 Notasi

Loss: \(L = \ell(\hat{Y}, Y)\) (scalar)

Error signal layer \(l\): \[\delta^{[l]} = \frac{\partial L}{\partial Z^{[l]}} \in \mathbb{R}^{d_l \times n}\]

5.2 Output Layer Error

Untuk output layer \(L\) dengan softmax + cross-entropy:

\[\delta^{[L]} = \hat{Y} - Y\]

(Elegant! Gradient = prediction minus label)

Untuk general loss dan activation: \[\delta^{[L]} = \frac{\partial L}{\partial A^{[L]}} \odot \sigma'^{[L]}(Z^{[L]})\]

di mana \(\odot\) adalah elementwise (Hadamard) product.

5.3 Backward Recursion (Hidden Layers)

ImportantDefinisi: Backpropagation Equations

Untuk layer \(l = L-1, L-2, \ldots, 1\):

\[\delta^{[l]} = (W^{[l+1]T} \delta^{[l+1]}) \odot \sigma'(Z^{[l]})\]

Gradients untuk parameter update:

\[\frac{\partial L}{\partial W^{[l]}} = \frac{1}{n}\delta^{[l]} A^{[l-1]T}\]

\[\frac{\partial L}{\partial b^{[l]}} = \frac{1}{n}\sum_{\text{batch}}\delta^{[l]} = \frac{1}{n}\delta^{[l]}\iota\]

(rata-rata atas batch)

Derivasi dari chain rule: untuk parameter \(w_{jk}^{[l]}\) (weight dari neuron \(k\) di layer \(l-1\) ke neuron \(j\) di layer \(l\)):

\[\frac{\partial L}{\partial w_{jk}^{[l]}} = \sum_{\text{samples}} \frac{\partial L}{\partial z_j^{[l]}} \cdot \frac{\partial z_j^{[l]}}{\partial w_{jk}^{[l]}} = \sum_{\text{samples}} \delta_j^{[l]} \cdot a_k^{[l-1]}\]

Dalam notasi matrix (batch): \(\frac{\partial L}{\partial W^{[l]}} = \delta^{[l]} A^{[l-1]T}\)


6 Full Derivation: 2-Layer Network

Contoh konkret: 2-layer network dengan mean squared error.

Architecture: - Input: \(x \in \mathbb{R}^{d_0}\) - Hidden: \(z^{[1]} = W^{[1]}x + b^{[1]}\), \(a^{[1]} = \tanh(z^{[1]}) \in \mathbb{R}^{d_1}\) - Output: \(z^{[2]} = W^{[2]}a^{[1]} + b^{[2]}\), \(\hat{y} = z^{[2]} \in \mathbb{R}^{d_2}\) (linear output) - Loss: \(L = \frac{1}{2}\|\hat{y} - y\|^2\)

Forward pass: compute and cache \(z^{[1]}, a^{[1]}, z^{[2]}, \hat{y}\).

Backward pass:

Step 1 — Output layer error: \[\delta^{[2]} = \frac{\partial L}{\partial z^{[2]}} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z^{[2]}} = (\hat{y} - y) \cdot 1 = (\hat{y} - y)\]

Step 2 — Gradients for \(W^{[2]}, b^{[2]}\): \[\frac{\partial L}{\partial W^{[2]}} = \delta^{[2]} a^{[1]T}, \quad \frac{\partial L}{\partial b^{[2]}} = \delta^{[2]}\]

Step 3 — Hidden layer error (chain rule through \(W^{[2]}\) and \(\tanh\)): \[\delta^{[1]} = \frac{\partial L}{\partial z^{[1]}} = \underbrace{W^{[2]T}\delta^{[2]}}_{\text{chain through } W^{[2]}} \odot \underbrace{(1 - \tanh^2(z^{[1]}))}_{\text{chain through } \tanh}\]

Step 4 — Gradients for \(W^{[1]}, b^{[1]}\): \[\frac{\partial L}{\partial W^{[1]}} = \delta^{[1]} x^T, \quad \frac{\partial L}{\partial b^{[1]}} = \delta^{[1]}\]

Update: \(W^{[l]} \leftarrow W^{[l]} - \eta \frac{\partial L}{\partial W^{[l]}}\)


7 Computational Graph Perspective

7.1 Forward Pass: Build and Cache

Computational graph adalah directed acyclic graph (DAG) di mana: - Nodes: values (scalars, tensors) - Edges: operations

Contoh untuk \(L = (a \cdot b + c)^2\):

x --+
    multiply --> z1 --> add --> z2 --> square --> L
w --+                    ^
                         |
                         c

Forward pass: evaluasi semua nodes dari leaves ke root, simpan intermediate values.

7.2 Backward Pass: Reverse Accumulation

Automatic differentiation (reverse mode / backprop):

  1. Mulai dari output \(L\): \(\frac{\partial L}{\partial L} = 1\)
  2. Propagate backwards menggunakan chain rule di setiap node
  3. Setiap node menghitung local gradient dan multiply dengan incoming gradient

Kelebihan reverse mode: hitung gradient semua inputs dalam satu backward pass — efisien ketika output scalar (loss) dan banyak parameters.


8 Universal Approximation Theorem

ImportantDefinisi: Universal Approximation Theorem

Theorem (Cybenko, 1989; Hornik, 1991):

Untuk setiap continuous function \(f: [0,1]^d \to \mathbb{R}\) dan \(\varepsilon > 0\), ada single-hidden-layer neural network \(g\) dengan width \(N\) (sufficiently large):

\[\sup_{x \in [0,1]^d} |f(x) - g(x)| < \varepsilon\]

Secara informal: network dengan satu hidden layer yang cukup lebar bisa approximate any continuous function pada compact set.

Catatan penting: - Theorem adalah existence result, bukan constructive — tidak bilang cara find network - Width bisa sangat besar (exponential dalam \(d\) untuk worst case) - “Width” vs “Depth” trade-off: untuk many function classes, depth lebih efficient secara exponential

8.1 Width vs Depth

Intuisi depth efficiency: fungsi seperti \(f(x) = \sin(2^L\pi x)\) (highly oscillatory) membutuhkan: - Shallow network: \(O(2^L)\) neurons - Deep network: \(O(L)\) neurons (tiap layer doubles frequency via composition)

Depth exponentially reduces required width for hierarchical functions.


9 Vanishing dan Exploding Gradients

9.1 Asal Masalah

Dari backward recursion, gradient di layer 1 bergantung pada products of Jacobians:

\[\frac{\partial L}{\partial A^{[1]}} = \frac{\partial L}{\partial A^{[L]}} \cdot \prod_{l=1}^{L-1} \frac{\partial A^{[l+1]}}{\partial A^{[l]}}\]

Setiap faktor adalah \(W^{[l]T} \cdot \text{diag}(\sigma'(Z^{[l]}))\).

Jika eigenvalues dari product matrices memiliki magnitude: - \(< 1\): products \(\to 0\) exponentially → vanishing gradients → early layers tidak belajar - \(> 1\): products \(\to \infty\) exponentially → exploding gradients → training diverges

9.2 Problem dengan Sigmoid/Tanh

\(\sigma'(z) = \sigma(z)(1-\sigma(z)) \leq 0.25\) dan \(\tanh'(z) = 1 - \tanh^2(z) \leq 1\)

Untuk sigmoid: gradient dikali dengan \(\leq 0.25\) di setiap layer → exponential decay untuk deep networks.

9.3 Solusi

1. ReLU Activations: \(\text{ReLU}'(z) = 1\) untuk \(z > 0\) — tidak saturate, gradient 1 (tidak shrink) untuk active neurons.

2. Residual Connections (ResNets): \[A^{[l+1]} = \sigma(Z^{[l+1]}) + A^{[l]}\]

Gradient bisa “shortcut” langsung ke earlier layers: \[\frac{\partial L}{\partial A^{[l]}} = \frac{\partial L}{\partial A^{[l+1]}} \left(\frac{\partial A^{[l+1]}}{\partial A^{[l]}} + I\right) = \frac{\partial L}{\partial A^{[l+1]}} (\cdot + I)\]

Identity shortcut \(I\) ensures gradient \(\geq 1\) — prevents vanishing.

3. Batch Normalization: Normalize activations dalam setiap layer: \(\hat{z} = (z - \mu_B)/\sigma_B\), lalu scale+shift \(\gamma\hat{z} + \beta\).

Menjaga activations dalam range yang “well-conditioned” → gradient tidak terlalu besar/kecil.

4. Careful Initialization (Xavier/Glorot, He): - Xavier: \(W^{[l]} \sim N(0, 2/(d_{l-1} + d_l))\) — untuk sigmoid/tanh - He: \(W^{[l]} \sim N(0, 2/d_{l-1})\) — untuk ReLU

Menjaga variance activations dan gradients konstan sepanjang layers.


10 Softmax + Cross-Entropy: Elegant Gradient

Untuk output softmax + cross-entropy loss (multi-class classification):

\[\hat{y}_k = \frac{e^{z_k}}{\sum_j e^{z_j}}, \quad L = -\sum_k y_k \log \hat{y}_k\]

Gradient terhadap pre-softmax logits:

\[\frac{\partial L}{\partial z_k} = \hat{y}_k - y_k\]

Derivasi: menggunakan chain rule dan Jacobian softmax.

\[\frac{\partial L}{\partial z_k} = \sum_j \frac{\partial L}{\partial \hat{y}_j}\frac{\partial \hat{y}_j}{\partial z_k}\]

\(\frac{\partial L}{\partial \hat{y}_j} = -y_j/\hat{y}_j\)

\(\frac{\partial \hat{y}_j}{\partial z_k} = \hat{y}_j(\delta_{jk} - \hat{y}_k)\)

Substitusikan: \(\frac{\partial L}{\partial z_k} = -\sum_j \frac{y_j}{\hat{y}_j}\hat{y}_j(\delta_{jk} - \hat{y}_k) = -y_k + \hat{y}_k\sum_j y_j = \hat{y}_k - y_k\) (karena \(\sum_j y_j = 1\))

Result: \(\delta^{[L]} = \hat{Y} - Y\) — “predicted minus true” — intuitif dan elegant!


11 Worked Example: Full Forward and Backward Pass

library(tidyverse)

# ==========================================
# Manual Implementation: 2-Layer Network
# ==========================================

# Activation functions
sigmoid <- function(z) 1 / (1 + exp(-z))
sigmoid_grad <- function(z) sigmoid(z) * (1 - sigmoid(z))
relu <- function(z) pmax(0, z)
relu_grad <- function(z) (z > 0) * 1

# Initialize parameters (He initialization)
set.seed(42)
d0 <- 2   # input dim
d1 <- 4   # hidden dim
d2 <- 1   # output dim (binary)

W1 <- matrix(rnorm(d1 * d0, sd = sqrt(2/d0)), d1, d0)
b1 <- matrix(0, d1, 1)
W2 <- matrix(rnorm(d2 * d1, sd = sqrt(2/d1)), d2, d1)
b2 <- matrix(0, d2, 1)

# Generate some data (XOR problem)
n <- 100
X <- matrix(c(rep(c(0,0), 25), rep(c(1,0), 25),
              rep(c(0,1), 25), rep(c(1,1), 25)), ncol = 2, byrow = TRUE)
X <- X + matrix(rnorm(n*2, sd = 0.1), n, 2)
Y <- matrix(xor(X[,1] > 0.5, X[,2] > 0.5) * 1, 1, n)

X_t <- t(X)  # p x n format

# ==========================================
# Forward Pass
# ==========================================
forward_pass <- function(X, W1, b1, W2, b2) {
  Z1 <- W1 %*% X + matrix(rep(b1, ncol(X)), nrow = nrow(b1))
  A1 <- relu(Z1)
  Z2 <- W2 %*% A1 + matrix(rep(b2, ncol(X)), nrow = nrow(b2))
  A2 <- sigmoid(Z2)  # output probability
  list(Z1 = Z1, A1 = A1, Z2 = Z2, A2 = A2)
}

# ==========================================
# Backward Pass
# ==========================================
backward_pass <- function(X, Y, W1, b1, W2, b2, cache) {
  n <- ncol(X)
  Z1 <- cache$Z1; A1 <- cache$A1
  Z2 <- cache$Z2; A2 <- cache$A2

  # Output layer error (cross-entropy + sigmoid)
  delta2 <- A2 - Y                                    # d2 x n

  # Gradients for W2, b2
  dW2 <- (1/n) * delta2 %*% t(A1)                    # d2 x d1
  db2 <- matrix((1/n) * rowSums(delta2), d2, 1)       # d2 x 1

  # Hidden layer error
  delta1 <- (t(W2) %*% delta2) * relu_grad(Z1)        # d1 x n

  # Gradients for W1, b1
  dW1 <- (1/n) * delta1 %*% t(X)                     # d1 x d0
  db1 <- matrix((1/n) * rowSums(delta1), d1, 1)       # d1 x 1

  list(dW1 = dW1, db1 = db1, dW2 = dW2, db2 = db2)
}

# ==========================================
# Training Loop
# ==========================================
lr <- 0.1
n_epochs <- 1000
losses <- numeric(n_epochs)

for (epoch in 1:n_epochs) {
  # Forward pass
  cache <- forward_pass(X_t, W1, b1, W2, b2)

  # Binary cross-entropy loss
  A2 <- cache$A2
  loss <- -(1/n) * sum(Y * log(A2 + 1e-15) + (1-Y) * log(1 - A2 + 1e-15))
  losses[epoch] <- loss

  # Backward pass
  grads <- backward_pass(X_t, Y, W1, b1, W2, b2, cache)

  # Gradient descent update
  W1 <- W1 - lr * grads$dW1
  b1 <- b1 - lr * grads$db1
  W2 <- W2 - lr * grads$dW2
  b2 <- b2 - lr * grads$db2

  if (epoch %% 100 == 0) {
    cat(sprintf("Epoch %d: Loss = %.4f\n", epoch, loss))
  }
}

# Plot training curve
plot(losses, type = "l", col = "blue", lwd = 2,
     xlab = "Epoch", ylab = "Loss",
     main = "Training Loss (XOR Problem)")

# Final accuracy
cache_final <- forward_pass(X_t, W1, b1, W2, b2)
preds <- (cache_final$A2 > 0.5) * 1
accuracy <- mean(preds == Y)
cat(sprintf("\nFinal accuracy: %.2f%%\n", 100 * accuracy))

CautionConnection: RNNs, Transformers, dan Graph Neural Networks

Neural network math berekstensikan ke arsitektur modern:

  • RNNs (Recurrent Neural Networks): share weights across time steps. Backprop Through Time (BPTT) adalah backprop “unrolled” di waktu — vanishing gradient sangat parah → LSTM dan GRU (gating mechanisms) sebagai solusi.

  • Transformers (Attention): \(\text{Attention}(Q,K,V) = \text{softmax}(QK^T/\sqrt{d_k})V\) — weighted combination dari values \(V\) berdasarkan similarity antara queries \(Q\) dan keys \(K\). Ini adalah kernel regression dengan softmax kernel! Backprop melalui matrix multiplications dan softmax.

  • Graph Neural Networks (GNNs): generalisasi convolution ke graph structure. Aggregation \(h_i = \sigma(\sum_{j \in N(i)} W h_j)\) — setiap node aggregates neighbors’ features. Analogous ke spatial lag dalam spatial econometrics!

  • Batch Normalization: dalam distribusi internal covariate shift. Math: normalize \(\mu\) dan \(\sigma\) per batch, pelajari \(\gamma, \beta\) sebagai parameters. Dramatically speeds training.

  • Neural Tangent Kernel: infinitely-wide neural network training converges ke kernel regression dengan specific kernel (NTK). Bridges theory gap antara deep learning dan kernel methods.


12 Practice Problems

Problem 1: Derive backprop equations untuk single hidden layer dari first principles.

Network: \(z^{[1]} = W^{[1]}x + b^{[1]}\), \(a^{[1]} = \sigma(z^{[1]})\), \(\hat{y} = W^{[2]}a^{[1]} + b^{[2]}\), \(L = \frac{1}{2}(\hat{y}-y)^2\).

\(\frac{\partial L}{\partial \hat{y}} = \hat{y} - y\)

\(\frac{\partial L}{\partial W^{[2]}} = \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial W^{[2]}} = (\hat{y}-y)a^{[1]T}\)

\(\frac{\partial L}{\partial a^{[1]}} = W^{[2]T}(\hat{y}-y)\)

\(\frac{\partial L}{\partial z^{[1]}} = \frac{\partial L}{\partial a^{[1]}} \odot \sigma'(z^{[1]}) = W^{[2]T}(\hat{y}-y) \odot \sigma'(z^{[1]})\)

\(\frac{\partial L}{\partial W^{[1]}} = \frac{\partial L}{\partial z^{[1]}} x^T\)

Problem 2: Tunjukkan gradient vanishes untuk deep sigmoid network.

Untuk sigmoid: \(\sigma'(z) \leq 0.25\). Jika \(\|W^{[l]}\| \approx c\), maka:

\(\|\delta^{[l]}\| \leq \|W^{[l+1]T}\| \cdot \|\delta^{[l+1]}\| \cdot 0.25 \approx 0.25c \|\delta^{[l+1]}\|\)

Untuk \(L\) layers: \(\|\delta^{[1]}\| \leq (0.25c)^{L-1}\|\delta^{[L]}\| \to 0\) jika \(c < 4\).

Problem 3: Tunjukkan bahwa softmax + cross-entropy gradient adalah \(\hat{y} - y\).

Sudah dibuktikan di atas. Key steps: (1) chain rule, (2) Jacobian softmax, (3) simplify menggunakan \(\sum y_k = 1\).

Problem 4: Mengapa residual connections membantu vanishing gradient?

Tanpa residual: \(A^{[l+1]} = \sigma(W^{[l+1]}A^{[l]} + b^{[l+1]})\)

\(\frac{\partial A^{[l+1]}}{\partial A^{[l]}} = W^{[l+1]T}\text{diag}(\sigma'(\ldots))\) — bisa sangat kecil.

Dengan residual: \(A^{[l+1]} = \sigma(\ldots) + A^{[l]}\)

\(\frac{\partial A^{[l+1]}}{\partial A^{[l]}} = W^{[l+1]T}\text{diag}(\sigma'(\ldots)) + I\) — identity term ensures gradient \(\geq 1\).

Problem 5: Kenapa He initialization penting untuk ReLU networks?

Variance activations: \(\text{Var}(z^{[l]}) = d_{l-1} \cdot \text{Var}(W^{[l]}) \cdot \text{Var}(a^{[l-1]})\)

Untuk ReLU, hanya setengah inputs yang active: \(\text{Var}(a^{[l-1]}) = \text{Var}(z^{[l-1]})/2\)

Untuk variance stabil: \(d_{l-1} \cdot \text{Var}(W^{[l]}) \cdot 1/2 = 1\)\(\text{Var}(W^{[l]}) = 2/d_{l-1}\)

He initialization: \(W^{[l]} \sim N(0, 2/d_{l-1})\)


Navigasi: ← PCA & Dimensionality Reduction | Kembali ke Index