0%

Hierarchical Clustering

Split

Consider all the samples as one cluster, then split into sub-clusters depending on distances.

Combine

Consider each sample as one cluster, then combine the nearest into one cluster.

K-means

NP-hard problem

Steps:

  1. Select k random nodes as the center of clusters
  2. For the rest of n-k samples, calculate the distance between the sample with the centers
  3. Classify the sample to the nearest cluster
  4. Re-calculate the center of the clusters
  5. Repeat 3-4 until the centers don’t change anymore

How to decide the first centers
The randomly selected nodes as first centers would influence the results of the clusters.

Solutions:

  1. Select the initial centers for multiple times
  2. Use hierarchical clustering to find k clusters, and then use them as the initial centers.

How to choose k
If not decided by business rules, then use elbow rule.
Draw a linear picture between values of k and diameter of clusters:

Elbow Rule

The x is the values of k, and the y is the diameter of the clusters, eg, the sum of the square distance between points in a cluster and the cluster centroid.

From the picture above, the best value of k is 3.

Other Improments for K-Means

Sensitive to Initial Values
k-means++:

Sensitive to Noise and Outliners

Distance

Jaccard Distance

Cosine Distance

FM/Factorization Machines

Rendle, S. (2010). Factorization machines. 2010 IEEE International Conference on Data Mining (pp. 995–1000).

Summary

  1. Introduced the crossed feature with low computational complexity to be o(kn)o(kn), where k is the dimension of latent vector, and n is the dimension of original feature
  2. Suitable for sparse feature. When feature is sparse, could use smaller k to fit.

Details

Assume we have n features in total, if we need to fit the interaction between 2 features (2-degree), then the model will be:

f(x)=w0+i=1nwixi+i=1nj=i+1nwijxixj\begin{aligned} f(\bold{x}) &= w_0 + \sum_{i = 1}^n w_i x_i + \sum_{i = 1}^n \sum_{j = i + 1}^n w_{ij} x_i x_j \\ \end{aligned}

Since the number of xixjx_i x_j is n(n1)2\frac{n (n - 1)}{2}, the computational complexity of this model is o(n2)o(n^2).

Because wijw_{ij} is a symmetric matrix, and accroding to Cholesky decomposition, every symmetric matrix is a product of lower-triangular matrix and its transpose, so we have:

W=VTV\bold{W} = \bold{V}^T \bold{V}

Where V\bold{V} is a vector of K dimension, and could be seen as the latent vector.

Now for the model, we have:

f(x)=w0+i=1nwixi+i=1nj=i+1nwijxixj=w0+i=1nwixi+i=1nj=i+1nvivjxixj\begin{aligned} f(\bold{x}) &= w_0 + \sum_{i = 1}^n w_i x_i + \sum_{i = 1}^n \sum_{j = i + 1}^n w_{ij} x_i x_j \\ &= w_0 + \sum_{i = 1}^n w_i x_i + \sum_{i = 1}^n \sum_{j = i + 1}^n \vec{v_i} \vec{v_j} x_i x_j \\ \end{aligned}

And the computational complexity is o(kn2)o(kn^2).

For xixjx_ix_j, we could decompose it by:

(x1+x2+x3)2=x12+x22+x32+2x1x2+2x1x3+2x2x3    x1x2+x1x3+x2x3=(x1+x2+x3)2x12+x22+x322\begin{aligned} & (x_1 + x_2 + x_3)^2 = x_1^2 + x_2^2 + x_3^2 + 2x_1x_2 + 2x_1x_3 + 2x_2x_3 \\ \Leftrightarrow & \;\;x_1x_2 + x_1x_3 + x_2x_3 = \frac{(x_1 + x_2 + x_3)^2 - x_1^2 + x_2^2 + x_3^2}{2} \end{aligned}

So now the model is:

f(x)=w0+i=1nwixi+i=1nj=i+1nvivjxixj=w0+i=1nwixi+12[(i=1nvixi)2i=1nvi2xi2]\begin{aligned} f(x) &= w_0 + \sum_{i = 1}^n w_i x_i + \sum_{i = 1}^n \sum_{j = i + 1}^n \vec{v_i} \vec{v_j} x_i x_j \\ &= w_0 + \sum_{i = 1}^n w_i x_i +\frac{1}{2} [(\sum_{i = 1}^n v_ix_i)^2 - \sum_{i = 1}^nv_i^2x_i^2] \\ \end{aligned}

And the computational complexity is o(kn)o(kn)

Wide & Deep

https://dl.acm.org/doi/10.1145/2988450.2988454

Summary

Use wide part to fit crossed features, and use deep part to fit embeddings for categorical features.

Details

Wide & Deep

wide: lr + 2-degree crossed feature, ywide=w0+inwixi+i=1nj=1nwijxixjy_{wide} = w_0 + \sum_i^n{w_ix_i} + \sum_{i = 1}^n{\sum_{j = 1}^n{w_{ij}x_ix_j}}
deep: dnn, ydeep=DNN(x)y_{deep} = DNN(x)
final: y=σ(ywide+ydeep+b)y = \sigma(y_{wide} + y_{deep} + b)

It’s used for app recommendation, so a more business graph is as below:

Wide & Deep for app recommendation

Here the auther only used user installed app and impression app for crossed features.

what is ReLU(1024)?

Note here, it still requires to choose features for cross features by human, which is diffcult if the number of features is large.

DeepFM

https://arxiv.org/abs/1703.04247

Summary

Use FM to replace wide part in wide & deep, and use embedding for both wide and deep parts.

Details

DeepFM

Use latent vector from FM as the embedding, to be the input for both deep and wide parts.

DIN/Deep Interest Network for Click-Through Rate Prediction

Summary

Use attention mechanism where:
Query: target ad
Key: User behavior
Value: User behavior
And use an activation network to replace attention score function.

Details

DIN

TODO

DIEN/Deep Interest Evolution Network for Click-Through Rate Prediction

Summary

Use GRU to fit user historical behaviour.

Details

DIEN

TODO

Activation Function

Common properties:

Property Reason
continuous, differentiable*, non-linear differentiable so we could use back-propagation to learn parameters and optimize loss function
the function and its derivative should be as simple as possible speed up the calculation
the derivative should fall in a certain range, not too large, not too small otherwise it would slow down the training speed (TODO: why, ReLU doesn’t obey this)

*Note: Any differentiable function must be continuous at every point in its domain. The converse does not hold: a continuous function need not be differentiable. Eg: y=xy=|x| is continuous, and differentiable everywhere except x=0x=0.

sigmoid

Logistic

σ(x)=11+ex\sigma(x) = \frac{1}{ 1 + e^{-x}}

Property:

  1. σ(x)=σ(x)(1σ(x))\sigma^{'}(x) = \sigma(x) * (1 - \sigma(x))
  2. σ(x)=1σ(x)\sigma(x) = 1 - \sigma(-x)
Sigmoid and its derivative

For sigmoid function, when x[3,3]x \in [-3, 3], y=σ(x)y=\sigma(x) is nearly linear, and yy is either 0 or 1 for x outside this range. We call [3,3][-3, 3] as a non-saturation regions.

For its first derivative, σ(x)0\sigma'(x) \approx 0 when x[3,3]x \notin [-3, 3].

tanh

tanh(x)=exexex+extanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}

tanh and sigmoid

First derivative:
tanh(x)=1[tanh(x)]2tanh'(x) = 1 - [tanh(x)]^2

tanh and its derivative

@TODO

ReLU/Rectified linear unit

Vanilla ReLU

relu(x)=max(0,x)={x,x00x<0 relu(x)= \max(0, x) = \begin{cases} x, & x \geq 0 \\ 0 & x < 0 \\ \end{cases}

Advantages

  • Sparse activation: For example, in a randomly initialized network, only about 50% of hidden units are activated (have a non-zero output).
  • Better gradient propagation: Fewer vanishing gradient problems compared to sigmoidal activation functions that saturate in both directions.
  • Efficient computation: Only comparison, addition and multiplication.
  • Scale-invariant: max(0,ax)=amax(0,x),a0\max(0, ax)=a\max(0,x), \forall a \geq 0

Disadvantages

  • Non-differentiable at zero: however, it is differentiable anywhere else, and the value of the derivative at zero can be arbitrarily chosen to be 0 or 1.
  • Not zero-centered: neurons at later layers would have a position bias.
  • Unbounded.
  • Dying ReLU problem (vanishing gradient problem): When x<0x<0, the derivative of ReLU is 0, meaning the parameter of the neuron won’t udpate by back propagation. And all the neurons that are connected to this neuron won’t update.
    • Possible reasons
      • The initialization of parameters
      • Learning rate is too large. The parameters change too fast from positive to negative.
    • Solutions
      • Leaky ReLU: would reduce performance (TODO:why?)
      • Use auto-scale lr algorithms.

Leaky ReLU

Leaky  relu(x)={x,x0λxx<0 Leaky \; relu(x)= \begin{cases} x, & x \geq 0 \\ \lambda x & x < 0 \\ \end{cases}

Parametric ReLU

Prelui(x)={x,x0λixx<0 P relu_i(x)= \begin{cases} x, & x \geq 0 \\ \lambda_i x & x < 0 \\ \end{cases}

Where λi\lambda_i is a learnable parameter.

ELU

Ref: https://nndl.github.io

Softplus

Swish Function

GELU

Maxout

Comparision between different activation functions

TODO

Activation functions developed like this: sigmoid -> tanh -> ReLU -> ReLU variants -> maxout

Activation Function Advantage Disvantage
sigmoid The range of output is (0,1)(0,1) 1. First derivative is nearly 0 at saturation region
2. Not zero-centered, the inputs for later neurons are all positive
3. Slow because of exponential calculation
tanh Zero-centered, the range of output is (1,1)(-1,1) 1. First derivative is nearly 0 at saturation region
2. Slow because of exponential calculation
ReLU 1. Only 1 saturation region
2. Easy to calculate without exponential
dead ReLU problem
ReLU variants leaky ReLU, etc. Solved the dead ReLU problem
maxout

Attention Mechanism

Attention works as a weighted sum/avg, the attention score is the weights of every input.

Let’s say inputs are X, here’s how self KV-attention works:

KV Attention

The output is:

O(X)=i=1nαiviO(X) = \sum_{i=1}^n\alpha_iv_i

where αi\alpha_i is the attention score of xix_i, in KV attention, it’s calculated by:

αi=f(K,Q)\alpha_i = f(K, Q)

Some common ways to calculate attention:

  • Standard Scaled Dot-Product Attention

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q,K,V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})*V

  • Multi-Head Attention

MultiHead(Q,K,V)=Concat(head1,head2,...)W\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \text{head}_2, ...)W

where each head was calculates as:

headi=Attention(Q,K,V)\text{head}_i=\text{Attention}(Q,K,V)

Positional Embedding

Instead of using LSTM/GRU, use positional embedding to capture the positions, usually appears in transformers.

Suppose we have an input sequence with length L, then the positional embedding will be:

P(k,2i)=sin(kn2i/d)P(k,2i+1)=cos(kn2i/d)\begin{aligned} P(k, 2i) &= \sin(\frac{k}{n^{2i/d}}) \\ P(k, 2i + 1) &= \cos(\frac{k}{n^{2i/d}}) \\ \end{aligned}

Where:

  1. k is the index of position
  2. d is the dimension of positional embedding
  3. i is used for mapping column indices
  4. n is user-defined number, 10,000 in attention is all you need.

Below is an example:

Example for positional embedding

In transformer, the author added the positional embedding with word embedding, and then feed it into transformer encoder.

How positional embedding works in transformer

Implementation of positional embedding

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import numpy as np
import matplotlib.pyplot as plt

def getPositionEncoding(seq_len, d, n=10000):
P = np.zeros((seq_len, d))
for k in range(seq_len):
for i in np.arange(int(d/2)):
denominator = np.power(n, 2*i/d)
P[k, 2*i] = np.sin(k/denominator)
P[k, 2*i+1] = np.cos(k/denominator)
return P

# P.shape: seq_len * d
P = getPositionEncoding(seq_len=4, d=4, n=100)

Normalization

Note: I call all kinds of feature scalings as normalization.

Why feature scaling?

  1. When model is sensitive to distance. If no scaling, then the model will be heavily impacted by the outliners. Or the model will be impacted by features with large ranges.
  2. If dataset contains features that have different ranges, units of measurement, or orders of magnitude.

Do the normalization on training set, and use the statistics from training set to test set

Max-Min Scaling/unity-based normalization

Scaled features to range from [0, 1].

X=XXminXmaxXminX' = \frac{X - X_{min}}{X_{max} - X_{min}}

When X=XminX=X_{min}, then X=0X'=0. On the other hand, when X=XmaxX=X_{max}, X=1X'=1. So this range will be [0, 1]

It could also generalize to restrict the range of values in the dataset between any arbitrary points aa and bb.

X=a+XXminXmaxXmin(ba)X' = a + \frac{X - X_{min}}{X_{max} - X_{min}} * (b-a)

Log Scaling

Usually used when there are long tail problems.

X=log(X)X' = \log(X)

Z-score

X=Xmean(X)std(X)X' = \frac{X - \text{mean}(X)}{\text{std}(X)}

After z-score, the distribution of the data would be normal distribution/bell curve.

Property of normal distribution:

  1. μ=0\mu=0
  2. σ=1\sigma=1

For linear models, because they assume the data distribution is normal, so better do normalization.

Batch Normalization

Why use batch normalization?
Because in deep neural networks, the distribution of data will shift through the layers, which is called internal variance.

So we use BN between layers to mitigate the consequences.

How do we do batch normalizations?

For every mini-batch, calculate the normalization of each feature on every sample. The μ\mu and σ\sigma are the mean and the variance of the samples respectively.

To be less restrictive, batch normalization add a scaling factor γ\gamma and an offset β\beta, so the result is:

XBN=γx^+βx^=xμ(x)σ(x)\begin{aligned} X_{BN} &= \gamma \hat{x} + \beta \\ \hat{x} &= \frac{x - \mu(x)}{\sigma(x)} \end{aligned}

Limitations

  • BN calculates based on the batch, so the batch size needs to be big.
  • Not suitable for sequence models.

Layer Normalization

Similar to BN, but do the normalization based on feature scale. The μ\mu and σ\sigma are the mean and variance of the features on one sample respectively.

LN is calculating the normalization of the features, it’s scaling different features.

LN is more common in NLP, transformer uses layer normalization.

Regularization

Dropout

Dropout is a layer, it can be applied before input layer or hidden layers.

For input layer, the dropout rate is usually 0.2, which means for every input neuron, there’s 80% chance they will be used.
For hidden layer, the dropout rate is usually 0.5.

In pytorch, the implement is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import torch.nn as nn


class DemoModel(nn.Module):
def __init__(self):
super(DemoModel, self).__init__()
self.dropout_input = nn.Dropout(0.2)
self.layer1 = nn.Linear(128, 64)
self.act1 = nn.ReLU()
self.dropout1 = nn.Dropout(0.5)
self.layer2 = nn.Linear(64, 32)
self.act2 = nn.ReLU()
self.sigmoid = nn.Sigmoid()

def forward(self, x):
x = self.dropout_input(x)
x = self.act1(self.layer1(x))
x = self.act2(self.layer2(x))
x = self.sigmoid(x)
return x

When we do the inference, we don’t use dropout, just multiply the drop probability to every weight.

In pytorch, by model.eval(), it will do this automatically.

Early Stop

Stop running the model when triggered. Some early-stop triggers can be:

  • No change in the metric over a given number of epochs.
  • An absolute change in the metric.
  • A decrease in the performance observed over a given number of epochs.
  • Average change in the metrics over a given number of epochs.

Initialize

The initialization of the neural network shouldn’t be 0, or too small, or too large.

Initialize to be zeros
All the neurons will give the same effects to the output, when back propagating, the weights will be the same.

Initialize too small
Gradient vanishing.

initialize too large
Gradient explosion.

The common way to initialize is making sure: μ=0,σ=1\mu = 0, \sigma=1, which means it’s like a standard normal distribution.

Also, we want to keep the variance the same accross every layer.

Xavier Initialization

Initialize the weights to have μ=0\mu=0, and σ\sigma the same across all layers.

Normal xavier initialization

WN(0,2ninput+noutput)W \sim N(0, \sqrt{\frac{2}{n_{input} + n_{output}}})

where ninput,noutputn_{input}, n_{output} are the number of input in the input layer and the number of output in the output layer respectively.

Uniform xavier initialization

WU(6ninput+noutput)W \sim U(\sqrt{\frac{6}{n_{input} + n_{output}}})

Xavier works better with tanh and sigmoid activation functions.

For ReLU function, usually we use He initialization.

Questions

Why use σ(x)\sigma(x) as activation function?

Why ReLU instead of σ(x)\sigma(x)?

not saturate in both directions
allow faster and effective training of deep neural architectures on large and complex datasets.

Multilayer perceptron (MLP)/ artificial neural network (ANN) / deep neural network (DNN) / feedforward neural network (FNN)

A neural network is actually a weight matrix, assume we have xx as input, and the output is oo, and we only have 1 hidden layer.

Neural Network

The forward function is:

Zl=WlXl1+blXl1=fl(Zl1)\begin{aligned} Z_l &= W^l X_{l-1} + b^l \\ X_{l-1} &= f_l(Z_{l-1}) \end{aligned}

where:

  1. ZlZ_l is the input before activation functions
  2. XlX_l is the output of neurals (after activation)
  3. flf_l is the activation function.

Steps to train the neural network:

  1. Initialize the weights
  2. Forward: Calculate oo with w,bw, b
  3. Back-propagation: Calculate the derivate of loss function
  4. Iterate until loss below certain value

Assume:
x1=1,x2=4,x3=5,y1=0.1,y2=0.05x_1 = 1, x_2 = 4, x_3 = 5, y_1 = 0.1, y_2 = 0.05
Activation function: fl(x)=σ(x)=11+exf_l(x) = \sigma(x) = \frac{1}{1 + e^{-x}}
Loss function: L(y^,y)=12im(yi^yi)2L(\hat{y}, y) = \frac{1}{2} \sum_i^m{(\hat{y_i} - y_i)^2}

Use the picture above.

Forward:

Zh1=w1x1+w3x2+w5x3+b1=4.3h1=fl(Zh1)=0.9866Zh2=w2x1+w4x2+w6x3+b2=5.3h2=fl(Zh2)=0.9950Zo1=w7h1+w9h2+b2=5.3o1=fl(Zo1)=0.8896Zo2=w8h1+w10h2+b2=1.3888o2=fl(Zo2)=0.8004\begin{aligned} Z_{h1} &= w_1 * x_1 + w_3 * x_2 + w_5 * x_3 + b_1 = 4.3 \\ h_1 &= f_l(Z_{h1}) = 0.9866 \\ Z_{h2} &= w_2 * x_1 + w_4 * x_2 + w_6 * x_3 + b_2 = 5.3 \\ h_2 &= f_l(Z_{h2}) = 0.9950 \\ Z_{o1} &= w_7 * h_1 + w_9 * h_2 + b_2 = 5.3 \\ o_1 &= f_l(Z_{o1}) = 0.8896 \\ Z_{o2} &= w_8 * h_1 + w_10 * h_2 + b_2 = 1.3888 \\ o_2 &= f_l(Z_{o2}) = 0.8004 \\ \end{aligned}

Then o1,o2o_1, o_2 is our predictions for y1,y2y_1, y_2 at the first time. We use back-propagation to optimize:

Back-propagation (only use w1w_1 as an example):

Lw1=Lo1o1w1+Lo2o2w1=Lo1o1zo1zo1w1+Lo2o2zo2zo2w1=Lo1o1zo1zo1h1h1zh1zh1w1+Lo2o2zo2zo2h2h2zh2zh2w2\begin{aligned} \frac{\partial{L}}{\partial{w_1}} &= \frac{\partial{L}}{\partial{o_1}} * \frac{\partial{o_1}}{\partial{w_1}}+ \frac{\partial{L}}{\partial{o_2}} * \frac{\partial{o_2}}{\partial{w_1}} \\ &= \frac{\partial{L}}{\partial{o_1}} * \frac{\partial{o_1}}{\partial{z_{o_1}}} * \frac{\partial{z_{o_1}}}{\partial{w_1}} + \frac{\partial{L}}{\partial{o_2}} * \frac{\partial{o_2}}{\partial{z_{o_2}}} * \frac{\partial{z_{o_2}}}{\partial{w_1}} \\ &= \frac{\partial{L}}{\partial{o_1}} * \frac{\partial{o_1}}{\partial{z_{o_1}}} * \frac{\partial{z_{o_1}}}{\partial{h_1}} * \frac{\partial{h_1}}{\partial{z_{h_1}}} * \frac{\partial{z_{h_1}}}{\partial{w_1}} + \frac{\partial{L}}{\partial{o_2}} * \frac{\partial{o_2}}{\partial{z_{o_2}}} * \frac{\partial{z_{o_2}}}{\partial{h_2}} * \frac{\partial{h_2}}{\partial{z_{h_2}}} * \frac{\partial{z_{h_2}}}{\partial{w_2}} \end{aligned}

From the formula we could see, if the derivate of loss function is less than 1, then after several multiplies the final result would be very small.

Ref: https://www.anotsorandomwalk.com/backpropagation-example-with-numbers-step-by-step/

Gradient Exploding & Gradient Vanishing

Why do we have thus problems?
Back propagation computes the gradient by chain rule. From the back propagation formula, we could see if the number of hidden layers is large, then the result would involve many intermediate gradients. And if the intermediate gradients are below 1, then the products of them would decrease rapidly, thus the gradient would vanish. On the other hand, if the gradients are all above 1, then the product would increase rapidly, so the final gradient would explode.

How to deal with gradient exploding?
To deal with gradient exploding, we can use a gradient clip method. When the gradient is above the threshold, then:

g=Thresholdggg = \frac{||\text{Threshold}||}{||g||}*g

How to deal with gradient vanishing?

  1. Model architecture: use LSTM, GRU.
  2. Residual: Add residual into the network
  3. Batch Normalization: TODO
  4. Activation function: use ReLU instead of sigmoid

Convolutional Neural Network (CNN)

@TODO

Recurrent Neural Network (RNN)

Sometimes the output of the neural network is not only related to input, but also related to current state (output from last time period)

Vanilla RNN

vanilla rnn

Forward:

{ht=fh(WxX+Whht1+bx)yt=fy(Wyht+by) \begin{cases} h_t &= f_h(W_x X + W_h h_{t-1} + b_x) \\ y_t &= f_y(W_yh_t + b_y) \end{cases}

Note: Wx,Wh,WyW_x, W_h, W_y doesn’t change accroding to tt, that’s called weight shareing.

Back-propagation:

LWh=Ly1y1wh+Ly2y2wh++Lytytwh\begin{aligned} \frac{\partial{L}}{\partial{W_h}} &= \frac{\partial{L}}{\partial{y_1}} * \frac{\partial{y_1}}{\partial{w_h}}+ \frac{\partial{L}}{\partial{y_2}} * \frac{\partial{y_2}}{\partial{w_h}} + \dots + \frac{\partial{L}}{\partial{y_t}} * \frac{\partial{y_t}}{\partial{w_h}} \end{aligned}

Sometimes to simplify calculation process, we would use a cutoff to shorten the window.

Different Architectures for Different Applications

ref: https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks

Type of RNN Illustration Example
one-to-one, Nx=Ny=1N_x = N_y = 1 Traditional neural network
1-to-many, Nx=1,Ny>1N_x = 1, N_y > 1 Music generation
many-to-1, Nx>1,Ny=1N_x > 1, N_y = 1 Sentiment classification
many-to-many, Nx=NyN_x = N_y Name entity recognition
many-to-many, NxNyN_x \neq N_y
encoer-decoder
seq2seq
Machine translation

RNN & Gradient Vanishing
For rnn, the gradient vanishing problem is even more severe, because usually the length of sequence is large.

Long Short Term Memory/LSTM

In vanilla RNN, hth_t is changing all the time, and it’s only based on the previous state, like a short memory. So here in LSTM, we introduce a cell ct\boldsymbol{c_t} to use longer memories (long short memory).

To avoid gradient vanishing/gradient exploding problem, LSTM introduce 3 gates: forget gate ft\boldsymbol{f_t}, input/update gate it\boldsymbol{i_t}, output gate ot\boldsymbol{o_t}.

Core idea of LSTM:

  1. Output ht\boldsymbol{h_t} is based on cell ct\boldsymbol{c_t}, that is: ht=ottanh(ct)\boldsymbol{h_t} = \boldsymbol{o_t} \otimes tanh(\boldsymbol{c_t})
  2. Cell: ct=ittanh(Wc[xt,ht1]+bc)+ftct1\boldsymbol{c_t} = \boldsymbol{i_t} \otimes tanh(W_c[x_t , h_{t - 1}] + b_c) + \boldsymbol{f_t} \otimes \boldsymbol{c_{t-1}}

Where:

  1. \otimes is the outer product, multiply elementwise
  2. tanh(x)\tanh(x) is the activation function here, because its derivative is larger. Could be replaced with ReLU here.
  3. The gates are also calculated by xt,htx_t, h_t:

ft=σ(Wf[xt,ht1]+bf)it=σ(Wi[xt,ht1]+bi)ot=σ(Wo[xt,ht1]+bo)\begin{aligned} \boldsymbol{f_t} &= \sigma(W_f [x_t, h_{t - 1}] + b_f) \\ \boldsymbol{i_t} &= \sigma(W_i [x_t, h_{t - 1}] + b_i) \\ \boldsymbol{o_t} &= \sigma(W_o [x_t, h_{t - 1}] + b_o) \\ \end{aligned}

here we use σ\sigma because σ\sigma has a value of [0,1][0, 1], which is “gate”.
4. The number of trainable parameters is related to input dimension and hidden layers, which is: (input_dim + hidden_unit + 1) * hidden_unit * 4
Proof: Trainable parameters come from 4 parts: 3 gates and the neural network. Since the inputs of the networks are concatenation of xtx_t and hth_t, the dimension of the concatenated vector is 1 x (input_dim + hidden_unit). And we have 1 dimension for bb, so for one part, the dimension is (input_dim + hidden_unit + 1) x hidden_unit, times 4 we get the final answer.

Gated Recurrent Unit/GRU

In LSTM, we have redundant gates ft\boldsymbol{f_t} and it\boldsymbol{i_t}. So instead of using 3 gates, we only use 2 gates in GRU, reset gate rt\boldsymbol{r_t} and update gate ut\boldsymbol{u_t}.

Output ht\boldsymbol{h_t}:

ht=utht1+(1ut)tanh(Wc[xt,rtht1]+bc)\boldsymbol{h_t} = \boldsymbol{u_t} \otimes h_{t - 1} + (1 - \boldsymbol{u_t}) \otimes tanh(W_c[x_t, \boldsymbol{r_t} \otimes h_{t - 1}] + b_c)

Similar to LSTM, ut\boldsymbol{u_t} and rt\boldsymbol{r_t} are also calculated by ht,xth_t, x_t:

ut=σ(Wu[xt,ht1]+bu)rt=σ(Wr[xt,ht1]+br)\begin{aligned} \boldsymbol{u_t} &= \sigma(W_u[x_t, h_{t - 1}] + b_u) \\ \boldsymbol{r_t} &= \sigma(W_r[x_t, h_{t - 1}] + b_r) \\ \end{aligned}

Sampling

Common Down-sampling Methods

Sampling without replacement, in which a subset of the observations are selected randomly, and once an observation is selected it cannot be selected again.
Sampling with replacement, in which a subset of observations are selected randomly, and an observation may be selected more than once.

If we have m samples, and we do this sampling for m times, after choosing one item, we put it back. Then the sample size would be 63.2% of the original set.

Math details:
The probability of not being chosen is 11m1 - \frac{1}{m}, so the probability of not chosen after m times is (11m)m(1 - \frac{1}{m})^m, if m is large enough, then we have:

limm(11m)m=1e0.368\lim_{m \to \infty} (1 - \frac{1}{m})^m = \frac{1}{e} \approx 0.368

Selecting a stratified sample, in which a subset of observations are selected randomly from each group of the observations defined by the value of a stratifying variable, and once an observation is selected it cannot be selected again.

Upsampling/Oversampling

SMOTE (Synthetic Minority Oversampling Technique)

Synthesize samples for the minority class.

Assume we need to oversample by N, then the steps are:

  1. Select 1 minority sample and find its knn.
  2. Randomly choose N nodes from k neighbors.
  3. Interpolate the new synthetic sample.
  4. Repeat 1-3 for all samples in minority class.

Interpolate as follows:

SMOTE

Node C is synthesized based on the difference between node A and node B. The synthetic nodes are always in the same direction of its base nodes.

ANN (approximate nearest neighbor)

Common ann algorithms: trees, hashes, graphs

Tree-based: Kd-tree

Hash-based: LSH

Graph-based: HNSW

Ref: Similarity Search

Graph Search

Intrinsic search: start with A, and then iterate all its neighbors to find the nearest one to query. Then move to the nearest neighbor. Repeat this process until there is no neighbor nodes that are closer than the current node. The current node is returned as the result.

Questions:

  1. What if there’s some nodes that have no neighbors?
  2. If we want to find the nearest 2 nodes for query, and these 2 nodes are not connected, then we need to run the whole graph twice, which costs lots of resources.
  3. How to build the graph? Does node G really need those neighbors?

Answers:

  1. When building graph, make sure that every node has at least 1 neighbor.
  2. When building graph, make sure that if 2 nodes are close enough, they must have an edge.
  3. Reduce the number of neighbor as much as possible.

How to build a graph, see NSW method.

NSW

In order to build a graph, firstly shuffle data points and then insert the nodes one by one.

When inserting a new node, link it with nearest M nodes.

Long-range edges will likely occur at the beginning phase of the graph construction.

HNSW

HNSW Search

Build Graph: Use a parameter to decide whether this node needs to be inserted into certain layer. And the node is inserted into its insertion layer and every layer below it.

Search Graph: Start with top layer, find the nearest node greedily, and then go to lower layer. Repeat until the 0 layer.

K-Fold Cross Validation

@TODO

Normalization

The goal of normalization is: transform features to be on a same scale.

For batch Normalization and layer normalization, they are used in deep learning models, see details here.

KNN/K-Nearest Neighbors

Key: Find k nearest neighbors of the current node, and use them to decide its class. So:

  1. The definition of “near”?
  2. How to decide which k should we use?
  3. After we have k neighbors, what rule should we follow to decide the current node’s class?

Distance Metric

The Minkowski distance or LpL_p distance is a metric in a normed vector space which can be considered as a generalization of other common distance metrics.

Assume variable xx and yy have dimension n, then:

Lp(x,y)=[i=1nxiyip]1pL_p(x, y) = [\sum_{i = 1}^n |x_i - y_i| ^ p]^{\frac{1}{p}}

Minkowski distance (src)
  • When p=1p=1, Lp(x,y)=i=1nxiyiL_p(x, y) = \sum_{i = 1}^n |x_i - y_i|, which is Manhattan Distance (in 2-d, it’s the sum of the legs).
  • When p=2p=2, Lp(x,y)=i=1nxiyi2L_p(x, y) = \sqrt{\sum_{i = 1}^n |x_i - y_i| ^ 2}, which is Euclidean Distance (in 2-d, it’s the hypotenuse).
  • When p=p=\infty, Lp(x,y)=maxi=1,,nxiyiL_p(x, y) = \max\limits_{i = 1, \dots, n}|x_i - y_i|, which is Chebyshev Distance (the greatest of difference between two vectors along any coordinate dimension).

How to get Chebyshev distance from Minkowski distance?
TODO

Parameter K Selection

  • Small K

    • If neighbors are noises, the predictions tend to be wrong.
    • The model tends to be complicated, and easy to overfit.
  • Large K

    • Underfit

Usually we start with smaller k, and tune the parameter.

Classification Rule

Majority rule/The wisdom of crowds, and use 0-1 loss.

Implement: kd tree

TODO

Decision Tree

It could be seen as a set of if-else rules.

Since find the optimal decision tree is a np-complete problem, we use heuristic method for finding the sub-optimal dt.

Overall the steps are:

  1. Feature selection: Find the optimal feature, split the data by this feature.
  2. Build the decision tree: Repeat step1 until all the nodes are classified or we don’t have features.
  3. Trim: Trim the decision tree because by step2 we might overfit.

Feature Selection

Here are some common criteria for evaluating the feature importance.

Information Gain (aka MI)

Information gain is based on entropy. Since entropy is the measure of the amount of information that data carries, information gain is the difference between two amounts of information. One is before using the current feature to split the data, one is after using the current feature.

Info Gain(D,A)=H(D)H(DA)\text{Info Gain}(D,A) = H(D) - H(D | A)

Here H(D)H(D) is the entropy of the dataset.

Information Gain Ratio

When using information gain to choose the feature, we would naturally choose the feature with more values. So we introduce information gain ratio to solve this problem.

gR(D,A)=g(D,A)HA(D)g_R(D,A) = \frac{g(D,A)}{H_A(D)}

Where HA(D)H_A(D) is the entropy of feature AA in dataset DD. It’s similar to the way we calculate the entropy of the whole dataset, this time we only use data that has feature AA.

HA(D)=i=1NADiDlogDiDH_A(D) = -\sum_{i = 1}^{N_A} \frac{D_i}{D}log\frac{D_i}{D}

Where DD is the number of samples in the dataset.

Build the Tree

ID3 algorithm
Use information gain to choose the feature when spliting data

C4.5 algorithm
Use information gain ratio to choose the feature when spliting data

Tree Trim

TODO

How to apply tree models to classification/regression problems?

TODO

Ensemble Learning

Ensemble methods use multiple algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.

Bagging/Bootstrap aggregating

What is bootstrap dataset? A bootstrapped set is created by selecting from original training data set with replacement. Thus, a bootstrap set may contain a given example zero, one, or multiple times.
To get a bootstrap dataset, we do a sampling with replacement, so an observation may be selected more than once.

What is aggregation? Inference is done by voting of predictions of ensemble members.

  • For classification problems, use the voting results. If 2 labels tied, consider adding confidence.
  • For regression problems, use average, eg simple average, or weighted average with weights of base models as the average weights.

Random Forest/RF: an implementation of bagging and decision tree

Use decision tree as the base model, and add a randomness for feature selection, which is choose n features from current m candidates, and select optimal feature based on these n features.

  • n=mn=m: traditional decision tree
  • n=1n=1: randomly select 1 feature to split the data
  • n=log2mn=\log_2 m: recommended choice

Overall, random tree has 2 randomness.

  • Randomness in sample selection: because of bootstrap sampling, the dataset is sampled randomly.
  • Randomness in feature selection: the optimal feature is not chosen from all the features, instead it’s chosen based on randomly selected feature candidates.

Advantage

Disadvantage

Boosting

Core idea: learn a bad base model first, then keep boosting the performance until reaching a perfect standard.

Adaboost/Adaptive Boosting

Steps:

  1. Calculate error rate eme_m: the sum of classifier weights that give wrong predictions.
  2. Calculate new weight αm\alpha_m: αm=12ln1emem\alpha_m = \frac{1}{2} * ln \frac{1 - e_m}{e_m}. Note here αm\alpha_m is inversely proportional to eme_m.
  3. Improve the boosted classifier:

ωm+1={wmiZmeαmGm(xi)=yiwmiZmeαmGm(xi)yi\omega_{m+1} = \begin{cases} \frac{w_{mi}}{Z_m} e^{-\alpha_m} & G_m(x_i) = y_i \\ \frac{w_{mi}}{Z_m} e^{\alpha_m} & G_m(x_i) \neq y_i \\ \end{cases}

where ZmZ_m is a normalization factor, Zm=inwiZ_m=\sum_i^n w_i

Boosting Tree

Core: Use a tree to fit residuals.

The new classifier after each iteration is:

fm+1(x)=fm(x)+αmT(x,Θ)f_{m+1}(x) = f_m(x) + \alpha_mT(x, \Theta)

And the loss function is:

L(y,fm+1(x))=L(y,  fm(x)+αmT(x,Θ))L(y, f_{m+1}(x)) = L(y, \;f_m(x) + \alpha_mT(x, \Theta))

If we use MSE as the loss function, here we have:

L(y,  fm(x)+αmT(x,Θ))=12(yfm(x)αmT)2L(y, \;f_m(x) + \alpha_mT(x, \Theta)) = \frac{1}{2} (y - f_m(x) - \alpha_mT)^2

Where yfm(x)y - f_m(x) is a constant that denotes the difference between the true value and predicted value by last round classifier, which is the residual rr, so now we have:

L(y,  fm(x)+αmT(x,Θ))=12(yfm(x)αmT)2=12(rαmT)2\begin{aligned} L(y, \;f_m(x) + \alpha_mT(x, \Theta)) &= \frac{1}{2} (y - f_m(x) - \alpha_mT)^2 \\ &= \frac{1}{2} (r - \alpha_mT)^2 \end{aligned}

Which means every iteration, we train a tree to fit the residual.

But boosting tree is only valid when using MSE as the loss function, so we have GBDT instead.

GBDT/Gradient Boosting Decision Tree

Use negative gradient to approximate the residual, that is:

rmi=[L(yi,f(xi))f(xi)]f(x)=fm1(x)r_{mi} = - \left[ \frac{\partial L(y_i, f(x_i))}{\partial f(x_i)} \right]_{f(x) = f_{m-1}(x)}

For MSE, this is the standard residual, for other loss function, this is an approximate for residuals.

Traditional Implementation
Iterate all the feature candidates, split data based on current feature, then calculate fitted residuals based on all data, ri=[L(yi,f(xi))fm1(xi)]r_i = - \left[ \frac{\partial L(y_i, f(x_i))}{\partial f_{m - 1}(x_i)} \right]. Then use loss function to calculate loss, and choose the feature with minimum loss as the feature to split the data.

So the time complexity of traditional implementation is proportional to the number of feature times the number of samples.

Regularization

Shrinkage/Learning Rate

fm+1(x)=fm(x)+ναmT(x,Θ)f_{m+1}(x) = f_m(x) + \nu \cdot \alpha_mT(x, \Theta)

where 0<ν<10 < \nu < 1

Using small learning rates (such as ν<0.1\nu < 0.1) yields dramatic improvements in models’ generalization ability over gradient boosting without shrinking (ν=1\nu = 1). However, it comes at the price of increasing computational time both during training and querying: lower learning rate requires more iterations.

Stochastic Gradient Boosting Tree/SGBT
The base model should be trained on bootstraped datasets. Use ff to denote the size of training set.

  • f=1f=1: same as gradient boosting
  • 0.5f0.80.5 \leq f \leq 0.8: recommended interval
  • f=0.5f=0.5: most common choice

Number of Observations in Leaves
Limiting the minimum number of observations in trees’ terminal nodes.
Penalize Complexity of Tree
L2L_2 penality on the leaf values.

XGBoost/eXtreme Gradient Boosting

Xgboost is modified based on gbdt, where it added regularization and use a second order taylor series to get a more precise loss function.

Taylor series of f(x)f(x) on x=ax=a is below:

f(x)x=a=f(a)+f(a)1!(xa)+f(a)2!(xa)2+f(a)3!(xa)3+f(x)|_{x=a} = f(a) + \frac{f'(a)}{1!}(x - a) + \frac{f''(a)}{2!}(x - a)^2 + \frac{f'''(a)}{3!}(x - a)^3 + \dots

Assume the size of samples is nn, then the loss function is:

Loss(y,y^)=i=1nL[yi,fm(xi)]+Ω(Tm)=i=1nL[yi,fm1(xi)+Tm(xi)]+Ω(Tm)\begin{aligned} Loss(y, \hat y) &= \sum_{i = 1}^n {L[y_i, f_m(x_i)] + \Omega(T_m)} \\ &= \sum_{i = 1}^n {L[y_i, f_{m-1}(x_i) + T_m(x_i)] + \Omega(T_m)} \\ \end{aligned}

where TmT_m are all the trees.

To simplify the process, we only use 1 sample, then use a second taylor series at fm1(xi)f_{m-1}(x_i), we have:

L[yi,fm1(xi)+Tt(xi)]=L(yi,fm1(xi))+L(yi,fm1(xi))fm1Tm(xi)+12L2(yi,fm1(xi))fm12(xi)Tm2(xi)=L(yi,fm1(xi))+g(fm1(xi))Tm(xi)+12h(fm1(xi))Tm2(xi)=L(yi,fm1(xi))+giTm(xi)+12hiTm2(xi)\begin{aligned} L[y_i, f_{m-1}(x_i) + T_t(x_i)] &= L(y_i, f_{m-1}(x_i)) + \frac{\partial{L(y_i, f_{m-1}(x_i))}}{\partial{f_{m-1}}} * T_m(x_i) + \frac{1}{2} \frac{\partial{L^2(y_i, f_{m-1}(x_i))}}{\partial{f_{m-1}^2(x_i)}} * T_m^2(x_i) \\ &= L(y_i, f_{m-1}(x_i)) + g(f_{m-1}(x_i)) * T_m(x_i) + \frac{1}{2} h(f_{m-1}(x_i)) * T_m^2(x_i) \\ &= L(y_i, f_{m-1}(x_i)) + g_i * T_m(x_i) + \frac{1}{2} h_i * T_m^2(x_i) \end{aligned}

Use this result to Loss(y,y^)Loss(y, \hat y), now we have:

Loss(y,y^)=i=1nL[yi,fm1(xi)+Tm(xi)]+Ω(Tm)=i=1n{L[yi,fm1(xi)]+giTm(xi)+12hiTm2(xi)}+Ω(Tm)=i=1n{L[yi,fm1(xi)]+giTm(xi)+12hiTm2(xi)}+γT+12λj=1Twj2\begin{aligned} Loss(y, \hat y) &= \sum_{i = 1}^n {L[y_i, f_{m-1}(x_i) + T_m(x_i)] + \Omega(T_m)} \\ &= \sum_{i = 1}^n { \{ L[y_i, f_{m-1}(x_i)] + g_i * T_m(x_i) + \frac{1}{2} h_i * T_m^2(x_i) \}+ \Omega(T_m)} \\ &= \sum_{i = 1}^n { \{ L[y_i, f_{m-1}(x_i)] + g_i * T_m(x_i) + \frac{1}{2} h_i * T_m^2(x_i) \}+ \gamma |T| + \frac{1}{2} \lambda \sum_{j = 1}^T | w_j |^2} \end{aligned}

Note:

  1. TT is the number of leaves at iteration m
  2. wjw_j is the weight of leaf j

Since:

  • L[yi,fm1(xi)]L[y_i, f_{m-1}(x_i)] and γT\gamma |T| are both constants, we could just ignore them
  • Tm(x)T_m(x) is the sum of leaf weights at iteration m, so:

i=1ngiTm(x)=j=1Twj(iIjgi)i=1n12hiTm2(x)=12j=1Twj2(iIjhi)\begin{aligned} \sum_{i = 1}^n g_i * T_m(x) &= \sum_{j = 1}^T w_j * (\sum_{i \in I_j}g_i)\\ \sum_{i = 1}^n \frac{1}{2} h_i * T_m^2(x) &= \frac{1}{2} \sum_{j = 1}^T w_j^2 * (\sum_{i \in I_j}h_i) \end{aligned}

where IjI_j denotes all the samples that fall in leaf j

Now we have loss function as:

Loss(y,y^)=i=1n{L[yi,fm1(xi)]+giTm(xi)+12hiTm2(xi)}+γT+12λj=1Twj2=i=1n[giTm(xi)+12hiTm2(xi)]+12λj=1Twj2=j=1T[(iIjgi)wj+12(iIjhi+λ)wj2]\begin{aligned} Loss(y, \hat y) &= \sum_{i = 1}^n { \{ L[y_i, f_{m-1}(x_i)] + g_i * T_m(x_i) + \frac{1}{2} h_i * T_m^2(x_i) \}+ \gamma |T| + \frac{1}{2} \lambda \sum_{j = 1}^T | w_j |^2} \\ &= \sum_{i = 1}^n { [g_i * T_m(x_i) + \frac{1}{2} h_i * T_m^2(x_i)] + \frac{1}{2} \lambda \sum_{j = 1}^T {|w_j|^2}} \\ &= \sum_{j = 1}^T [(\sum_{i \in I_j}g_i)w_j + \frac{1}{2}(\sum_{i \in I_j}h_i + \lambda)w_j^2] \end{aligned}

Calculate the derivative of loss function of ww, then we have:

Lw=g+(h+λ)w\begin{aligned} \frac{\partial L}{\partial {w}} &= \sum{g} + (\sum{h} + \lambda) w \end{aligned}

Let derivative = 0, then we have optimal ww:

w=gh+λ\begin{aligned} w^* = -\frac{\sum{g}}{\sum{h} + \lambda} \end{aligned}

Then the minimum loss (with constants) is:

Loss(y,y^)=12j=1T[(g)2h+λ]+γT\begin{aligned} Loss(y, \hat y) &= - \frac{1}{2} \sum_{j = 1}^T {[\frac{(\sum{g})^2}{\sum{h} + \lambda}] + \gamma |T|} \end{aligned}

From the loss function formula we know the loss is related to the structure of the tree (TT in the equation). So in order to find the minimum loss, we need to iterate all different structured trees, which is too time-consuming.

Consider a heuristic method, start with a single node, and then find the next feature to split the data. The criteria for selecting feature is, after spliting by this feature, the loss would decrease.
Loss reduction:

Loss Reduction=Lno split(Lleft nodes after spliting+Lright nodes after spliting)=12[(gn)2hn+λ(gl)2hl+λ(gr)2hr+λ]+γ(111)=12[(gl)2hl+λ+(gr)2hr+λ(gn)2hn+λ]γ\begin{aligned} \text{Loss Reduction} &= L_{\text{no split}} - (L_{\text{left nodes after spliting}} + L_{\text{right nodes after spliting}}) \\ &= - \frac{1}{2} [\frac{(\sum {g_n})^2}{\sum {h_n} + \lambda} - \frac{(\sum {g_l})^2}{\sum {h_l} + \lambda} - \frac{(\sum {g_r})^2}{\sum {h_r} + \lambda}] + \gamma (1 - 1 - 1) \\ \\ &= \frac{1}{2} [\frac{(\sum {g_l)^2}}{\sum {h_l} + \lambda} + \frac{(\sum {g_r})^2}{\sum {h_r} + \lambda} - \frac{(\sum {g_n})^2}{\sum {h_n} + \lambda} ] - \gamma \end{aligned}

Tree generation based on pre-sorted features

If loss reduction > 0, then this feature should split the data, else we should prune. So everytime we find the maximum loss reduction to split the data.

To speed up the spliting, we calculate the feature reduction first:

Loss Reduction=Lno split(Lleft nodes after spliting+Lright nodes after spliting)\text{Loss Reduction} = L_{\text{no split}} - (L_{\text{left nodes after spliting}} + L_{\text{right nodes after spliting}})

and then sort them from largest to smallest.

Pros: fast speed, could run in parallel.
Cons:

  1. Need to iterate all the data to calculate loss reduction
  2. Need to store the sorted results

Note: Even xgb could run in parallel, it’s parallel in feature level, not in data level.

LightGBM

TODO

Entropy

Entropy

In information theory, entropy is the amount of uncertainty, assume the probability distribution of a single discreate random variable x is P(x)P(x), then the entropy of x is:

H(x)=xP(x)logP(x)H(x) = -\sum_x P(x)\log P(x)

Note: Entropy cares only the probability distribution of a variable, it has nothing to do with the value of the variable.

Joint Entropy

For 2 discreate random variables XX and YY, assume the joint probability distribution is P(x,y)P(x,y), then the joint entropy is:

H(X,Y)=x,yP(x,y)logP(x,y)H(X,Y) = -\sum_{x,y}P(x, y) \log P(x, y)

Properity

  1. Non-negativity
  2. Greater than individual entropies: H(x,y)max[H(x),H(y)]H(x,y) \geq \max[H(x), H(y)]
  3. Less than the sum of individual entropies: H(x,y)H(x)+H(y)H(x,y) \leq H(x) + H(y)

Conditional Entropy

The conditional entropy quantifies the amount of information needed to describe the outcome of a random variable YY given that the value of another random variable XX is known.

H(YX)=xP(x)H(YX=x)=xP(x)yP(yx)logP(yx)=x,yP(x,y)logP(yx)\begin{aligned} H(Y|X) &= \sum_xP(x)H(Y|X = x) \\ &= - \sum_xP(x) \sum_yP(y|x) \log P(y|x) \\ &= - \sum_{x,y}P(x,y) \log P(y|x) \end{aligned}

Properity

  1. H(YX)=0H(Y|X)=0 if and only if the value of Y is completely determined by the value of X
  2. H(YX)=H(Y)H(Y|X) = H(Y) if and only if Y and X are independant random variables.
  3. Chain rule: Joint entropy H(x,y)=H(x)+H(yx)        H(yx)=H(x,y)H(x)H(x,y) = H(x) + H(y|x) \;\; \Leftrightarrow \;\; H(y|x) = H(x, y) - H(x).
    Proof:

H(x,y)=x,yP(x,y)logP(x,y)=x,yP(x,y)log(P(yx)P(x))=x,yP(x,y)logP(yx)x,yP(x,y)logP(x)=H(yx)xyP(x,y)logP(x)=H(yx)xlogP(x)yP(x,y)=H(yx)xlogP(x)P(x)=H(yx)+H(x)\begin{aligned} H(x,y) &= -\sum_{x,y}P(x,y)logP(x,y) \\ &= -\sum_{x,y}P(x,y)log(P(y|x)P(x)) \\ &= -\sum_{x,y}P(x,y)logP(y|x) - \sum_{x,y}P(x,y)logP(x) \\ &= H(y|x) - \sum_{x} \sum_{y}P(x, y)logP(x) \\ &= H(y|x) - \sum_{x} logP(x)\sum_{y}P(x, y) \\ &= H(y|x) - \sum_{x} logP(x)P(x) \\ &= H(y|x) + H(x) \end{aligned}

Cross Entropy

Cross entropy could measure of difference between 2 probability distributions p and q over the same underlying set of events.

H(p,q)=xp(x)logq(x)H(p, q) = -\sum_x p(x) \log q(x)

Difference from joint entropy:

  • Joint entropy is 2 random variables over 1 probability distribution
  • Cross entropy is 1 random variable over 2 different probability distributions p, q

Cross entropy is widely used as a loss function in ML because of its relationship with KL divergence.

KL Divergence/Relative Entropy

KL divergence is the measure of how one probability distribution P is different from a second, reference probability distribution Q, or the loss you would get when you want to use p to approxmate q.

KL(pq)=xp(x)logp(x)q(x)=xp(x)logp(x)xp(x)logq(x)=H(p,q)H(p)\begin{aligned} KL(p||q) &= \sum_x p(x)\log\frac{p(x)}{q(x)} \\ &= \sum_xp(x)\log p(x) - \sum_xp(x)\log q(x) \\ &=H(p,q) - H(p) \end{aligned}

Note:

  1. KL(pq)KL(qp)KL(p||q) \neq KL(q||p)
  2. Since KL divergence is the measure of distance between 2 probability distributions, it’s range is [0,+inf)[0, +\inf). KL(pq)=0KL(p||q)=0 if and only if p=q.
  3. KL(pq)KL(p||q) is converx in the pair of probability measures p,q.

Mutual Information/MI/Information Gain

Mutual information is a measure of the mutual independence between two variables. Eg: XX is the result of a dice, and YY is whether XX is odd or even. So from YY we could know something about XX.

It’s defined as:

I(X;Y)=x,yp(x,y)logp(x,y)p(x)p(y)=x,yp(x,y)logp(xy)p(x)=x,yp(x,y)logp(yx)p(y)\begin{aligned} I(X;Y) &= \sum_{x,y}p(x,y)\log \frac{p(x,y)}{p(x)p(y)} \\ &= \sum_{x,y}p(x,y)\log \frac{p(x|y)}{p(x)} =\sum_{x,y}p(x,y)\log \frac{p(y|x)}{p(y)} \end{aligned}

Or it could be defined as:

I(X;Y)=EPXY[logPXYPXPY]=DKL(PXYPXPY)\begin{aligned} I(X;Y) &= \mathbb{E}_{P_{XY}}[\log \frac{P_{XY}}{P_XP_Y}] \\ &= D_{KL}(P_{XY} || P_XP_Y) \end{aligned}

It’s also known as the information gain, which is defined as:

Info Gain(Y,X)=H(Y)H(YX)\text{Info Gain}(Y,X) = H(Y) - H(Y|X)

Performance Evaluation

Confusion Matrix

ground truth / predictions positive negative
positive TP(true positive) FN(false negative)
negative FP(false positive) TN(true negative)

Precision

Precision is the fraction of relevant instances among the retrieved instances, which is:

Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}

Recall

Recall is the fraction of relevant instances that were retrieved.

Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}

Accuracy

Accuracy is how close a given set of observations are to their true value, it treats every class equally.

Acc=TP+TNTP+TN+FP+FN\text{Acc} = \frac{TP+TN}{TP + TN + FP + FN}

F1

F1 is the harmonic mean of precision and recall.

F1=2PrecisionRecallPrecision+RecallF1 = 2 * \frac{\text{Precision} * \text{Recall}}{\text{Precision} + \text{Recall}}

FβF_\beta

F1 is the special condition of FβF_\beta, the higher β\beta is, the more recall is important over precision.

Fβ=(1+β2)PrecisionRecallβ2Precision+RecallF_\beta = (1+\beta^2)\frac{\text{Precision} * \text{Recall}}{\beta^2 * \text{Precision} + \text{Recall}}

For F-score, it’s calculated based on predicted classes, instead of probabilities.

Use it for classification problems.

Support

TODO

ROC/AUC

A ROC space is defined by FPR(false positive rate) and TPR(true positive rate) as x and y axes, respectively, which depicts relative trade-offs between true positive (benefits) and false positive (costs).

FPR: defines how many incorrect positive results occur among all negative samples available during the test. Or how many negative samples that are higher than the gate are in the all negative samples.

FPR=FPFP+TN\text{FPR}=\frac{FP}{FP + TN}

TPR: defines how many correct positive results occur among all positive samples available during the test. Or how many positive samples that are higher than the gate are in the all positive samples.

TPR=TPTP+FN\text{TPR}=\frac{TP}{TP + FN}

How to draw ROC curve?
Use probability from the model to rank from large to small, then use the maximum prob as the gate, all samples above the gate are positive, others are negative. Change the gate and calculate the FPR(x) & TPR(y), then draw the line.

AUC is the ability of model to rank a positive sample over a negative sample

AUC

Special positions

  1. (0,0)(0,0) means classify all the samples to negative, so the FPR=0 and TPR=0
  2. (1,1)(1,1) means classify all the samples to positive.

x=FPR=FPFP+TN=# of Negative# of Negative+0=1y=TPR=TPTP+FN=# of Positive# of Positive+0=1\begin{aligned} x &= \text{FPR} = \frac{FP}{FP+TN} = \frac{\text{\# of Negative}}{\text{\# of Negative} + 0} = 1 \\ y &= \text{TPR} = \frac{TP}{TP+FN} = \frac{\text{\# of Positive}}{\text{\# of Positive} + 0} = 1 \\ \end{aligned}

Use the sum of many discreate trapezoidals’ areas to approximate AUC, assume we have NN samples, then AUC is:

AUC=12i=1N1(yi+yi+1)(xi+1xi)AUC = \frac{1}{2} \sum_{i = 1}^{N - 1}(y_i + y_{i + 1})(x_{i + 1} - x_i)

Use it for ranking problems

Advantage:

  1. Don’t need to decide a gate value for positive and negative samples.
  2. Not sensitive to positive and negative samples, only care about the ranking.

Disadvantage:

  1. Ignore model’s ability to fit the actual probability. For example, if the model give all positive samples prob as 0.51, and all negative samples as 0.49, auc is 1, but the model is underfitted because of the cross entrophy loss is huge.
  2. Not a good indicator for precision and recall. Sometimes we care only precision or recall.
  3. Don’t care the ranking between positive samples and negative samples.

gAUC/group AUC

@TODO

prAUC

NDCG

Loss Function

Suitable for Classifications

Cross Entropy Loss

Cross entropy is use the estimated data distribution p(x)p(x) to approximate the true data distribution q(x)q(x), that is: H(p,q)=xp(x)logq(x)H(p,q)=- \sum_x p(x)\log q(x)

So accroding to this, the cross entropy loss function is:

L(y,y^)=yTlogy^L(y, \hat{y}) = -y^T \log\hat{y}

Hinge Loss

TODO

Suitable for Regressions

Mean Square Error/MSE/L2

L(y,y^)=1ni=1n(yiyi^)2L(y, \hat{y}) = \frac{1}{n} * \sum_{i = 1}^n{ (y_i - \hat{y_i} ) ^2}

Mean Absolute Mean/MAE/L1

L(y,y^)=1ni=1nyiyi^L(y, \hat{y}) = \frac{1}{n} * \sum_{i = 1}^n{|y_i - \hat{y_i}|}

Huber Loss

Huber Loss

Huber loss (green) and squared error loss (blue)

Huber loss is a combination of L1 loss and L2 loss.

Lδ(y,y^)={12(yy^)2,yy^δδyy^+12δ2,yy^>δL_\delta(y, \hat{y}) = \begin{cases} \frac{1}{2} (y - \hat{y})^2, & |y - \hat{y}| \leq \delta \\ \delta|y - \hat{y}| + \frac{1}{2}\delta^2, & |y - \hat{y}| > \delta \end{cases}

  • Advantage: Optimize L1 & L2 loss’s disadvantages. (When the loss is near 0, use L2 loss. When it’s away from 0, use L1 loss.)
  • Disadvantage: need to tune δ\delta.

Suitable for Classification + Regression

0-1 Loss

l(y,y^)={1,if  y=y^0,if  yy^ l(y, \hat{y})= \begin{cases} 1, & \text{if} \; y = \hat{y} \\ 0, & \text{if} \; y \neq \hat{y} \\ \end{cases}

  • Advantage: very straightforward
  • Disadvantage: discreate; derivative equals to 0, thus hard to optimize

Optimization Algorithms

Gradient Descent & its twists

Steps:

  1. Determine the learning rate.
  2. Initialize the parameters.
  3. Choose samples for calculating the gradient of the loss functions.
  4. Update the parameters.

Note: GD could gurantee the global optimal for convex functions, and local optimal for non-convex functions.

Convex Function

Batch Gradient Descent

Calculate the gradient on all the data, then update the parameters.

Stochastic Gradient Descent/SGD

Calculate the gradient based on 1 single sample, then update the parameters.

Advantage:

  1. Fast, could be used for online learning
  2. Loss function might fluctuate
  3. Glocal optimal not guranteed (could be solved by decreasing learning rate with an appropriate rate)

Mini-batch Gradient Descent

Calculate the gradient on all the data from the batch, then update the parameters.

Usually we use [50,256][50, 256] as the range for batch size, since mini-batch is very common, in most cases, the sgd in python libraries is actually mini-batch.

w:=wαJ(θ)=wαninJ(θ)\begin{aligned} w &:= w - \alpha \nabla J(\theta) \\ &= w - \frac{\alpha}{n} \sum_i^n\nabla J(\theta) \end{aligned}

where: α\alpha is the learning rate, and the batch size is nn

Note:

  1. It needs to tune alphaalpha manually.
  2. Global optimal not guranteed.
  3. If the data is sparse, we might want to update the weights more for those frequent features, and less for those less frequent features.

Tips:

  1. Shuffle the training data before using this.
  2. Do batch-normalization for each mini-batch.
  3. Increase learning rate when batch size increased.

TODO

Logistic Regression

0-1 Binary Distribution

For logistic regression, the logarithm of the probability of an event is linear to two independant variables, that is:

lnp(y=1)p(y=0)=lnp(y=1)1p(y=1)=lnp1p=wx  p=ewx1+ewx=11+ewx=σ(wx)\begin{aligned} & \ln \frac{p(y=1)}{p(y=0)} = \ln \frac{p(y=1)}{1 - p(y=1)} = \ln \frac{p}{1 - p} = wx \\ \Rightarrow \; & p = \frac{e^{wx}}{1 + e^{wx}} = \frac{1}{1 + e^{-wx}} = \sigma(wx) \end{aligned}

Multiple Value Distribution

Similar to 0-1 distribution, if the value of Y is 0~K, then:

P(x=k)=ewkx1+ew1x+ew2x++ewKx=ewkx1+k=1KwkxP(x = k) = \frac{e^{w_kx}}{1 + e^{w_1x} + e^{w_2x} + \dots + e^{w_Kx}} = \frac{e^{w_kx}}{1 + \sum_{k=1}^Kw_kx}

It’s actually the softmax function

Loss Function

Use log loss here (why?)

Log loss:

L(θ)=(x,y)Dylog(y)(1y)log(1y)L(\theta) = \sum_{(x,y) \in D} -y \log(y') - (1 - y) \log(1 - y')

where:

  1. y is the label
  2. y' is the predicted label (between 0 to 1)

Maximum Entropy Model

This section is based on this paper: A Maximum Entropy Approach to Natural Language Processing

The principle of maximum entropy is that the best model has the maximum conditional entropy. (Entropy stands for the amount of uncertainty, the more entropy is, the more information the model contains)

See more about entropy here.

Here’s a constrain for maximum entropy model, the statistical distribution of the sample (namely, p~(x,y)\tilde{p}(x,y)) should be inherent to the distribution of the model (namely, p(yx)p(y|x)), that means the expection of the model should be inherent to the expection of the samples, that is:

x,yP~(x,y)f(x,y)=x,yP~(x)P(yx)f(x,y)\sum_{x,y} \tilde{P}(x,y)f(x,y) = \sum_{x,y}\tilde{P}(x)P(y|x)f(x,y)

where:

  1. P~(x,y),P~(x)\tilde{P}(x,y), \tilde{P}(x) are the probability for (x,y) and x in sample
  2. p(yx)p(y|x) is the model
  3. f(x,y)f(x,y) is a feature function where

f(x,y)={1,  (x,y) exists in samples0  otherwise f(x,y)= \begin{cases} 1, & \; (x,y) \text{ exists in samples} \\ 0 & \; otherwise \\ \end{cases}

Objective
Maximize the conditional entropy, that is to maximize H(YX)=(x,y)P~(x)P(yx)logP(yx)H(Y|X)=-\sum_{(x,y)}\tilde{P}(x)P(y|x)\log P(y|x)

Parameters Estimation
Optimize the objective under the constrains, we have:

max      H(YX)=P~(x)P(yx)logP(yx)s.t.      x,yP~(x)P(yx)f(x,y)=x,yP~(x,y)f(x,y)x,yP(yx)=1\begin{aligned} \max \;\;\; &H(Y|X) = -\sum \tilde{P}(x)P(y|x)logP(y|x) \\ s.t. \;\;\; & \sum_{x,y}\tilde{P}(x)P(y|x)f(x,y) = \sum_{x,y}\tilde{P}(x,y)f(x,y) \\ & \sum_{x,y} P(y|x) = 1 \end{aligned}

Including lagrange multipliers, we convert the constrained optimization to non-constrained optimization. The lagrangian function is:

L(P,ω)=H(P)+ω0[1x,yP(yx)]+i=1nωi[x,yP~(x)P(yx)f(x,y)x,yP~(x,y)f(x,y)]L(P, \omega ) = -H(P) + \omega_0[1-\sum_{x,y} P(y|x)] + \sum_{i=1}^n \omega_i[\sum_{x,y}\tilde{P}(x)P(y|x)f(x,y) - \sum_{x,y}\tilde{P}(x,y)f(x,y)]

And then just try to maximum and minimum values of the non-constrained optimization.

We now have:

Pω(yx)=1Zωexp(i=1nωifi(x,y))Zω=yexp(i=1nωifi(x,y))\begin{aligned} P_\omega(y|x) &= \frac{1}{Z_\omega} exp \big(\sum_{i=1}^n \omega_if_i(x,y) \big) \\ Z_\omega &= \sum_y exp \big( \sum_{i=1}^n \omega_if_i(x,y)\big) \end{aligned}

It’s very similar to LR, the only difference lies in the number of classes. In LR there’re only 2 classes while in MEM there’re multiple classes.

Maximum Likelihood Estimator

TODO

Support Vector Machine/SVM

The idea behind svm is to find a boundary between the postivie samples and negative samples, and it tries to find the boundary that has the largest distance to samples.

Linear data, hard-margin

When we could find the boundary, we have a hard-mirgin (data is linearly seperatable)

Find a hyperplane that the minimum distance between any points to the hypterplane is maximum, that is:

max    dmins.t.    dxdmin\begin{aligned} max \;& \; d_{min} \\ s.t. \; & \; d_x \geq d_{min} \end{aligned}

Note: Distance between a point (x,y) to a plane y = wx + b is: d=wx+bwd=\frac{|wx+b|}{|w|}

use xminx_{min} to denote the point that has the minimum distance, then we have:

max    wxmin+bws.t.    wxi+bwwxmin+bw,xiX\begin{aligned} max \;& \; \frac{w * x_{min} + b}{|w|} \\ s.t. \; & \; \frac{|wx_i + b|}{|w|} \geq \frac{w * x_{min} + b}{|w|}, \forall x_i \in X \end{aligned}

Since wxmin+bw * x_{min} + b is a constant, we use 11 to replace it. And we times w|w| at both sides of the formula at the constrain, now we have:

max    1ws.t.    wxi+b1,xiX\begin{aligned} max \;& \; \frac{1}{|w|} \\ s.t. \; & \; |wx_i + b| \geq 1, \forall x_i \in X \end{aligned}

Use minimize instead of maximize, we have:

max  1wmin  wmin  12w2max \; \frac{1}{|w|} \Leftrightarrow min \; |w| \Leftrightarrow min \; \frac{1}{2} w^2

To get rid of the abs in constrain, we multiply true label yiy_i at the constrain, now we have:

min    12w2s.t.    yi(wxi+b)1,xiX\begin{aligned} min \;& \; \frac{1}{2} w^2 \\ s.t. \; & \; y_i(wx_i + b) \geq 1, \forall x_i \in X \end{aligned}

This is a convex optimization problem, use lagrange multipliers. The lagarangian function would be:

L(w,b,α)=12w2iNαi[yi(wxi+b)1]L(w, b, \alpha) = \frac{1}{2} w^2 - \sum_i^N{\alpha_i [y_i(w x_i + b) - 1]}

Then we solve the maximium of w,bw, b, then the minimum of α\alpha.

Note: the max-margin hyperplane is completely determined by those xi\mathbf {x} _{i} that lie nearest to it. These xi\mathbf {x} _{i} are called support vectors.

Linear data, soft-margin

When we couldn’t find the boundary, we have soft-margin (data is linear)

When we have some outliners in the dataset, after eliminating these outliners, we could still find the boundary, then it’s a soft-mirgin svm.

To achieve this, we modify the constrains, add some slack, that is: yi(wxi+b)1ζy_i(wx_i + b ) \geq 1 - \zeta. At the same time, this slack variable shouldn’t be too large, so the objective function would be:

min    12w2+CiNζis.t.    yi(wxi+b)1ζi\begin{aligned} min \;& \; \frac{1}{2} w^2 + C \sum_i^N {\zeta_i} \\ s.t. \; & \; y_i(wx_i + b) \geq 1 - \zeta_i \end{aligned}

In other way, it would be interpreted that SVM uses hinge loss here.

[@TODO]

Non-linear data

When the data is not linear, we could project the data into another feature space, by using non-linear kernel function to replace every dot product in the low dimension space.

Use ϕ(x):XH\phi (x): X \to H as the projection function, to map x from low dimension space XX to high dimension space HH, the kernel function is defined as below:

K(x,z)=ϕ(x)ϕ(z)K(x, z) = \phi(x) \cdot \phi(z)

where xx and zz are two variables in the low dimension space.

Some common kernel functions:

  1. Polynomial: K(x,z)=(xz+1)pK(x,z) = (x \cdot z + 1) ^p
  2. Gaussian radial basis function: K(x,z)=exz22σ2K(x,z) = e^{- \frac{|x - z|^2}{2\sigma^2}}

Questions

Why use log loss for logistic regression?

When trying to maximize the whole probability of the series, the product of the probabilities of the events would be:

X=i=1npyi(1p)1yiX = \prod_{i = 1}^n p^{y_i}(1-p)^{1 - y_i}

where:

  1. p is the probability of y=1
  2. yi is the number of y=1 events

Use logrithm for the product to better optimize, then we have:

logX=i=1nyilogp+(1yi)log(1p)\log X = \sum_{i = 1}^n y_i \log p+(1-y_i)\log (1-p)

To maximize XX equals to minimize the log loss L(θ)L(\theta)