Linear Models

Logistic Regression

0-1 Binary Distribution

For logistic regression, the logarithm of the probability of an event is linear to two independant variables, that is:

lnp(y=1)p(y=0)=lnp(y=1)1p(y=1)=lnp1p=wx  p=ewx1+ewx=11+ewx=σ(wx)\begin{aligned} & \ln \frac{p(y=1)}{p(y=0)} = \ln \frac{p(y=1)}{1 - p(y=1)} = \ln \frac{p}{1 - p} = wx \\ \Rightarrow \; & p = \frac{e^{wx}}{1 + e^{wx}} = \frac{1}{1 + e^{-wx}} = \sigma(wx) \end{aligned}

Multiple Value Distribution

Similar to 0-1 distribution, if the value of Y is 0~K, then:

P(x=k)=ewkx1+ew1x+ew2x++ewKx=ewkx1+k=1KwkxP(x = k) = \frac{e^{w_kx}}{1 + e^{w_1x} + e^{w_2x} + \dots + e^{w_Kx}} = \frac{e^{w_kx}}{1 + \sum_{k=1}^Kw_kx}

It’s actually the softmax function

Loss Function

Use log loss here (why?)

Log loss:

L(θ)=(x,y)Dylog(y)(1y)log(1y)L(\theta) = \sum_{(x,y) \in D} -y \log(y') - (1 - y) \log(1 - y')

where:

  1. y is the label
  2. y' is the predicted label (between 0 to 1)

Maximum Entropy Model

This section is based on this paper: A Maximum Entropy Approach to Natural Language Processing

The principle of maximum entropy is that the best model has the maximum conditional entropy. (Entropy stands for the amount of uncertainty, the more entropy is, the more information the model contains)

See more about entropy here.

Here’s a constrain for maximum entropy model, the statistical distribution of the sample (namely, p~(x,y)\tilde{p}(x,y)) should be inherent to the distribution of the model (namely, p(yx)p(y|x)), that means the expection of the model should be inherent to the expection of the samples, that is:

x,yP~(x,y)f(x,y)=x,yP~(x)P(yx)f(x,y)\sum_{x,y} \tilde{P}(x,y)f(x,y) = \sum_{x,y}\tilde{P}(x)P(y|x)f(x,y)

where:

  1. P~(x,y),P~(x)\tilde{P}(x,y), \tilde{P}(x) are the probability for (x,y) and x in sample
  2. p(yx)p(y|x) is the model
  3. f(x,y)f(x,y) is a feature function where

f(x,y)={1,  (x,y) exists in samples0  otherwise f(x,y)= \begin{cases} 1, & \; (x,y) \text{ exists in samples} \\ 0 & \; otherwise \\ \end{cases}

Objective
Maximize the conditional entropy, that is to maximize H(YX)=(x,y)P~(x)P(yx)logP(yx)H(Y|X)=-\sum_{(x,y)}\tilde{P}(x)P(y|x)\log P(y|x)

Parameters Estimation
Optimize the objective under the constrains, we have:

max      H(YX)=P~(x)P(yx)logP(yx)s.t.      x,yP~(x)P(yx)f(x,y)=x,yP~(x,y)f(x,y)x,yP(yx)=1\begin{aligned} \max \;\;\; &H(Y|X) = -\sum \tilde{P}(x)P(y|x)logP(y|x) \\ s.t. \;\;\; & \sum_{x,y}\tilde{P}(x)P(y|x)f(x,y) = \sum_{x,y}\tilde{P}(x,y)f(x,y) \\ & \sum_{x,y} P(y|x) = 1 \end{aligned}

Including lagrange multipliers, we convert the constrained optimization to non-constrained optimization. The lagrangian function is:

L(P,ω)=H(P)+ω0[1x,yP(yx)]+i=1nωi[x,yP~(x)P(yx)f(x,y)x,yP~(x,y)f(x,y)]L(P, \omega ) = -H(P) + \omega_0[1-\sum_{x,y} P(y|x)] + \sum_{i=1}^n \omega_i[\sum_{x,y}\tilde{P}(x)P(y|x)f(x,y) - \sum_{x,y}\tilde{P}(x,y)f(x,y)]

And then just try to maximum and minimum values of the non-constrained optimization.

We now have:

Pω(yx)=1Zωexp(i=1nωifi(x,y))Zω=yexp(i=1nωifi(x,y))\begin{aligned} P_\omega(y|x) &= \frac{1}{Z_\omega} exp \big(\sum_{i=1}^n \omega_if_i(x,y) \big) \\ Z_\omega &= \sum_y exp \big( \sum_{i=1}^n \omega_if_i(x,y)\big) \end{aligned}

It’s very similar to LR, the only difference lies in the number of classes. In LR there’re only 2 classes while in MEM there’re multiple classes.

Maximum Likelihood Estimator

TODO

Support Vector Machine/SVM

The idea behind svm is to find a boundary between the postivie samples and negative samples, and it tries to find the boundary that has the largest distance to samples.

Linear data, hard-margin

When we could find the boundary, we have a hard-mirgin (data is linearly seperatable)

Find a hyperplane that the minimum distance between any points to the hypterplane is maximum, that is:

max    dmins.t.    dxdmin\begin{aligned} max \;& \; d_{min} \\ s.t. \; & \; d_x \geq d_{min} \end{aligned}

Note: Distance between a point (x,y) to a plane y = wx + b is: d=wx+bwd=\frac{|wx+b|}{|w|}

use xminx_{min} to denote the point that has the minimum distance, then we have:

max    wxmin+bws.t.    wxi+bwwxmin+bw,xiX\begin{aligned} max \;& \; \frac{w * x_{min} + b}{|w|} \\ s.t. \; & \; \frac{|wx_i + b|}{|w|} \geq \frac{w * x_{min} + b}{|w|}, \forall x_i \in X \end{aligned}

Since wxmin+bw * x_{min} + b is a constant, we use 11 to replace it. And we times w|w| at both sides of the formula at the constrain, now we have:

max    1ws.t.    wxi+b1,xiX\begin{aligned} max \;& \; \frac{1}{|w|} \\ s.t. \; & \; |wx_i + b| \geq 1, \forall x_i \in X \end{aligned}

Use minimize instead of maximize, we have:

max  1wmin  wmin  12w2max \; \frac{1}{|w|} \Leftrightarrow min \; |w| \Leftrightarrow min \; \frac{1}{2} w^2

To get rid of the abs in constrain, we multiply true label yiy_i at the constrain, now we have:

min    12w2s.t.    yi(wxi+b)1,xiX\begin{aligned} min \;& \; \frac{1}{2} w^2 \\ s.t. \; & \; y_i(wx_i + b) \geq 1, \forall x_i \in X \end{aligned}

This is a convex optimization problem, use lagrange multipliers. The lagarangian function would be:

L(w,b,α)=12w2iNαi[yi(wxi+b)1]L(w, b, \alpha) = \frac{1}{2} w^2 - \sum_i^N{\alpha_i [y_i(w x_i + b) - 1]}

Then we solve the maximium of w,bw, b, then the minimum of α\alpha.

Note: the max-margin hyperplane is completely determined by those xi\mathbf {x} _{i} that lie nearest to it. These xi\mathbf {x} _{i} are called support vectors.

Linear data, soft-margin

When we couldn’t find the boundary, we have soft-margin (data is linear)

When we have some outliners in the dataset, after eliminating these outliners, we could still find the boundary, then it’s a soft-mirgin svm.

To achieve this, we modify the constrains, add some slack, that is: yi(wxi+b)1ζy_i(wx_i + b ) \geq 1 - \zeta. At the same time, this slack variable shouldn’t be too large, so the objective function would be:

min    12w2+CiNζis.t.    yi(wxi+b)1ζi\begin{aligned} min \;& \; \frac{1}{2} w^2 + C \sum_i^N {\zeta_i} \\ s.t. \; & \; y_i(wx_i + b) \geq 1 - \zeta_i \end{aligned}

In other way, it would be interpreted that SVM uses hinge loss here.

[@TODO]

Non-linear data

When the data is not linear, we could project the data into another feature space, by using non-linear kernel function to replace every dot product in the low dimension space.

Use ϕ(x):XH\phi (x): X \to H as the projection function, to map x from low dimension space XX to high dimension space HH, the kernel function is defined as below:

K(x,z)=ϕ(x)ϕ(z)K(x, z) = \phi(x) \cdot \phi(z)

where xx and zz are two variables in the low dimension space.

Some common kernel functions:

  1. Polynomial: K(x,z)=(xz+1)pK(x,z) = (x \cdot z + 1) ^p
  2. Gaussian radial basis function: K(x,z)=exz22σ2K(x,z) = e^{- \frac{|x - z|^2}{2\sigma^2}}

Questions

Why use log loss for logistic regression?

When trying to maximize the whole probability of the series, the product of the probabilities of the events would be:

X=i=1npyi(1p)1yiX = \prod_{i = 1}^n p^{y_i}(1-p)^{1 - y_i}

where:

  1. p is the probability of y=1
  2. yi is the number of y=1 events

Use logrithm for the product to better optimize, then we have:

logX=i=1nyilogp+(1yi)log(1p)\log X = \sum_{i = 1}^n y_i \log p+(1-y_i)\log (1-p)

To maximize XX equals to minimize the log loss L(θ)L(\theta)