Deep Learning Models

Edited on 2024-01-17 In technical , deep learning notes Views:

Multilayer perceptron (MLP)/ artificial neural network (ANN) / deep neural network (DNN) / feedforward neural network (FNN)

A neural network is actually a weight matrix, assume we have $x$ as input, and the output is $o$ , and we only have 1 hidden layer.


Neural Network

The forward function is:

\begin{aligned} Z_l &= W^l X_{l-1} + b^l \\ X_{l-1} &= f_l(Z_{l-1}) \end{aligned}

where:

$Z_l$ is the input before activation functions
$X_l$ is the output of neurals (after activation)
$f_l$ is the activation function.

Steps to train the neural network:

Initialize the weights
Forward: Calculate $o$ with $w, b$
Back-propagation: Calculate the derivate of loss function
Iterate until loss below certain value

Assume:
$x_1 = 1, x_2 = 4, x_3 = 5, y_1 = 0.1, y_2 = 0.05$
Activation function: $f_l(x) = \sigma(x) = \frac{1}{1 + e^{-x}}$
Loss function: $L(\hat{y}, y) = \frac{1}{2} \sum_i^m{(\hat{y_i} - y_i)^2}$

Use the picture above.

Forward:

\begin{aligned} Z_{h1} &= w_1 * x_1 + w_3 * x_2 + w_5 * x_3 + b_1 = 4.3 \\ h_1 &= f_l(Z_{h1}) = 0.9866 \\ Z_{h2} &= w_2 * x_1 + w_4 * x_2 + w_6 * x_3 + b_2 = 5.3 \\ h_2 &= f_l(Z_{h2}) = 0.9950 \\ Z_{o1} &= w_7 * h_1 + w_9 * h_2 + b_2 = 5.3 \\ o_1 &= f_l(Z_{o1}) = 0.8896 \\ Z_{o2} &= w_8 * h_1 + w_10 * h_2 + b_2 = 1.3888 \\ o_2 &= f_l(Z_{o2}) = 0.8004 \\ \end{aligned}

Then $o_1, o_2$ is our predictions for $y_1, y_2$ at the first time. We use back-propagation to optimize:

Back-propagation (only use $w_1$ as an example):

\begin{aligned} \frac{\partial{L}}{\partial{w_1}} &= \frac{\partial{L}}{\partial{o_1}} * \frac{\partial{o_1}}{\partial{w_1}}+ \frac{\partial{L}}{\partial{o_2}} * \frac{\partial{o_2}}{\partial{w_1}} \\ &= \frac{\partial{L}}{\partial{o_1}} * \frac{\partial{o_1}}{\partial{z_{o_1}}} * \frac{\partial{z_{o_1}}}{\partial{w_1}} + \frac{\partial{L}}{\partial{o_2}} * \frac{\partial{o_2}}{\partial{z_{o_2}}} * \frac{\partial{z_{o_2}}}{\partial{w_1}} \\ &= \frac{\partial{L}}{\partial{o_1}} * \frac{\partial{o_1}}{\partial{z_{o_1}}} * \frac{\partial{z_{o_1}}}{\partial{h_1}} * \frac{\partial{h_1}}{\partial{z_{h_1}}} * \frac{\partial{z_{h_1}}}{\partial{w_1}} + \frac{\partial{L}}{\partial{o_2}} * \frac{\partial{o_2}}{\partial{z_{o_2}}} * \frac{\partial{z_{o_2}}}{\partial{h_2}} * \frac{\partial{h_2}}{\partial{z_{h_2}}} * \frac{\partial{z_{h_2}}}{\partial{w_2}} \end{aligned}

From the formula we could see, if the derivate of loss function is less than 1, then after several multiplies the final result would be very small.

Ref: https://www.anotsorandomwalk.com/backpropagation-example-with-numbers-step-by-step/

Gradient Exploding & Gradient Vanishing

Why do we have thus problems?
Back propagation computes the gradient by chain rule. From the back propagation formula, we could see if the number of hidden layers is large, then the result would involve many intermediate gradients. And if the intermediate gradients are below 1, then the products of them would decrease rapidly, thus the gradient would vanish. On the other hand, if the gradients are all above 1, then the product would increase rapidly, so the final gradient would explode.

How to deal with gradient exploding?
To deal with gradient exploding, we can use a gradient clip method. When the gradient is above the threshold, then:

g = \frac{||\text{Threshold}||}{||g||}*g

How to deal with gradient vanishing?

Model architecture: use LSTM, GRU.
Residual: Add residual into the network
Batch Normalization: TODO
Activation function: use ReLU instead of sigmoid

Convolutional Neural Network (CNN)

@TODO

Recurrent Neural Network (RNN)

Sometimes the output of the neural network is not only related to input, but also related to current state (output from last time period)

Vanilla RNN


vanilla rnn

Forward:

\begin{cases} h_t &= f_h(W_x X + W_h h_{t-1} + b_x) \\ y_t &= f_y(W_yh_t + b_y) \end{cases}

Note: $W_x, W_h, W_y$ doesn’t change accroding to $t$ , that’s called weight shareing.

Back-propagation:

\begin{aligned} \frac{\partial{L}}{\partial{W_h}} &= \frac{\partial{L}}{\partial{y_1}} * \frac{\partial{y_1}}{\partial{w_h}}+ \frac{\partial{L}}{\partial{y_2}} * \frac{\partial{y_2}}{\partial{w_h}} + \dots + \frac{\partial{L}}{\partial{y_t}} * \frac{\partial{y_t}}{\partial{w_h}} \end{aligned}

Sometimes to simplify calculation process, we would use a cutoff to shorten the window.

Different Architectures for Different Applications

ref: https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks

Type of RNN	Illustration	Example
one-to-one, $N_x = N_y = 1$		Traditional neural network
1-to-many, $N_x = 1, N_y > 1$		Music generation
many-to-1, $N_x > 1, N_y = 1$		Sentiment classification
many-to-many, $N_x = N_y$		Name entity recognition
many-to-many, $N_x \neq N_y$ encoer-decoder seq2seq		Machine translation

RNN & Gradient Vanishing
For rnn, the gradient vanishing problem is even more severe, because usually the length of sequence is large.

Long Short Term Memory/LSTM

In vanilla RNN, $h_t$ is changing all the time, and it’s only based on the previous state, like a short memory. So here in LSTM, we introduce a cell $\boldsymbol{c_t}$ to use longer memories (long short memory).

To avoid gradient vanishing/gradient exploding problem, LSTM introduce 3 gates: forget gate $\boldsymbol{f_t}$ , input/update gate $\boldsymbol{i_t}$ , output gate $\boldsymbol{o_t}$ .

Core idea of LSTM:

Output $\boldsymbol{h_t}$ is based on cell $\boldsymbol{c_t}$ , that is: $\boldsymbol{h_t} = \boldsymbol{o_t} \otimes tanh(\boldsymbol{c_t})$
Cell: $\boldsymbol{c_t} = \boldsymbol{i_t} \otimes tanh(W_c[x_t , h_{t - 1}] + b_c) + \boldsymbol{f_t} \otimes \boldsymbol{c_{t-1}}$

Where:

$\otimes$ is the outer product, multiply elementwise
$\tanh(x)$ is the activation function here, because its derivative is larger. Could be replaced with ReLU here.
The gates are also calculated by $x_t, h_t$ :

\begin{aligned} \boldsymbol{f_t} &= \sigma(W_f [x_t, h_{t - 1}] + b_f) \\ \boldsymbol{i_t} &= \sigma(W_i [x_t, h_{t - 1}] + b_i) \\ \boldsymbol{o_t} &= \sigma(W_o [x_t, h_{t - 1}] + b_o) \\ \end{aligned}

here we use $\sigma$ because $\sigma$ has a value of $[0, 1]$ , which is “gate”.
4. The number of trainable parameters is related to input dimension and hidden layers, which is: (input_dim + hidden_unit + 1) * hidden_unit * 4
Proof: Trainable parameters come from 4 parts: 3 gates and the neural network. Since the inputs of the networks are concatenation of $x_t$ and $h_t$ , the dimension of the concatenated vector is 1 x (input_dim + hidden_unit). And we have 1 dimension for $b$ , so for one part, the dimension is (input_dim + hidden_unit + 1) x hidden_unit, times 4 we get the final answer.

Gated Recurrent Unit/GRU

In LSTM, we have redundant gates $\boldsymbol{f_t}$ and $\boldsymbol{i_t}$ . So instead of using 3 gates, we only use 2 gates in GRU, reset gate $\boldsymbol{r_t}$ and update gate $\boldsymbol{u_t}$ .

Output $\boldsymbol{h_t}$ :

\boldsymbol{h_t} = \boldsymbol{u_t} \otimes h_{t - 1} + (1 - \boldsymbol{u_t}) \otimes tanh(W_c[x_t, \boldsymbol{r_t} \otimes h_{t - 1}] + b_c)

Similar to LSTM, $\boldsymbol{u_t}$ and $\boldsymbol{r_t}$ are also calculated by $h_t, x_t$ :

\begin{aligned} \boldsymbol{u_t} &= \sigma(W_u[x_t, h_{t - 1}] + b_u) \\ \boldsymbol{r_t} &= \sigma(W_r[x_t, h_{t - 1}] + b_r) \\ \end{aligned}