Logistic Regression
0-1 Binary Distribution
For logistic regression, the logarithm of the probability of an event is linear to two independant variables, that is:
⇒lnp(y=0)p(y=1)=ln1−p(y=1)p(y=1)=ln1−pp=wxp=1+ewxewx=1+e−wx1=σ(wx)
Multiple Value Distribution
Similar to 0-1 distribution, if the value of Y
is 0~K
, then:
P(x=k)=1+ew1x+ew2x+⋯+ewKxewkx=1+∑k=1Kwkxewkx
It’s actually the softmax
function
Loss Function
Use log loss here (why?)
Log loss:
L(θ)=(x,y)∈D∑−ylog(y′)−(1−y)log(1−y′)
where:
y
is the label
y'
is the predicted label (between 0 to 1)
Maximum Entropy Model
This section is based on this paper: A Maximum Entropy Approach to Natural Language Processing
The principle of maximum entropy is that the best model has the maximum conditional entropy. (Entropy stands for the amount of uncertainty, the more entropy is, the more information the model contains)
See more about entropy here.
Here’s a constrain for maximum entropy model, the statistical distribution of the sample (namely, p~(x,y)) should be inherent to the distribution of the model (namely, p(y∣x)), that means the expection of the model should be inherent to the expection of the samples, that is:
x,y∑P~(x,y)f(x,y)=x,y∑P~(x)P(y∣x)f(x,y)
where:
- P~(x,y),P~(x) are the probability for
(x,y)
and x
in sample
- p(y∣x) is the model
- f(x,y) is a feature function where
f(x,y)={1,0(x,y) exists in samplesotherwise
Objective
Maximize the conditional entropy, that is to maximize H(Y∣X)=−∑(x,y)P~(x)P(y∣x)logP(y∣x)
Parameters Estimation
Optimize the objective under the constrains, we have:
maxs.t.H(Y∣X)=−∑P~(x)P(y∣x)logP(y∣x)x,y∑P~(x)P(y∣x)f(x,y)=x,y∑P~(x,y)f(x,y)x,y∑P(y∣x)=1
Including lagrange multipliers, we convert the constrained optimization to non-constrained optimization. The lagrangian function is:
L(P,ω)=−H(P)+ω0[1−x,y∑P(y∣x)]+i=1∑nωi[x,y∑P~(x)P(y∣x)f(x,y)−x,y∑P~(x,y)f(x,y)]
And then just try to maximum and minimum values of the non-constrained optimization.
We now have:
Pω(y∣x)Zω=Zω1exp(i=1∑nωifi(x,y))=y∑exp(i=1∑nωifi(x,y))
It’s very similar to LR, the only difference lies in the number of classes. In LR there’re only 2 classes while in MEM there’re multiple classes.
Maximum Likelihood Estimator
TODO
Support Vector Machine/SVM
The idea behind svm is to find a boundary between the postivie samples and negative samples, and it tries to find the boundary that has the largest distance to samples.
Linear data, hard-margin
When we could find the boundary, we have a hard-mirgin (data is linearly seperatable)
Find a hyperplane that the minimum distance between any points to the hypterplane is maximum, that is:
maxs.t.dmindx≥dmin
Note: Distance between a point (x,y)
to a plane y = wx + b
is: d=∣w∣∣wx+b∣
use xmin to denote the point that has the minimum distance, then we have:
maxs.t.∣w∣w∗xmin+b∣w∣∣wxi+b∣≥∣w∣w∗xmin+b,∀xi∈X
Since w∗xmin+b is a constant, we use 1 to replace it. And we times ∣w∣ at both sides of the formula at the constrain, now we have:
maxs.t.∣w∣1∣wxi+b∣≥1,∀xi∈X
Use minimize instead of maximize, we have:
max∣w∣1⇔min∣w∣⇔min21w2
To get rid of the abs in constrain, we multiply true label yi at the constrain, now we have:
mins.t.21w2yi(wxi+b)≥1,∀xi∈X
This is a convex optimization problem, use lagrange multipliers. The lagarangian function would be:
L(w,b,α)=21w2−i∑Nαi[yi(wxi+b)−1]
Then we solve the maximium of w,b, then the minimum of α.
Note: the max-margin hyperplane is completely determined by those xi that lie nearest to it. These xi are called support vectors.
Linear data, soft-margin
When we couldn’t find the boundary, we have soft-margin (data is linear)
When we have some outliners in the dataset, after eliminating these outliners, we could still find the boundary, then it’s a soft-mirgin svm.
To achieve this, we modify the constrains, add some slack, that is: yi(wxi+b)≥1−ζ. At the same time, this slack variable shouldn’t be too large, so the objective function would be:
mins.t.21w2+Ci∑Nζiyi(wxi+b)≥1−ζi
In other way, it would be interpreted that SVM uses hinge loss here.
[@TODO]
Non-linear data
When the data is not linear, we could project the data into another feature space, by using non-linear kernel function to replace every dot product in the low dimension space.
Use ϕ(x):X→H as the projection function, to map x
from low dimension space X to high dimension space H, the kernel function is defined as below:
K(x,z)=ϕ(x)⋅ϕ(z)
where x and z are two variables in the low dimension space.
Some common kernel functions:
- Polynomial: K(x,z)=(x⋅z+1)p
- Gaussian radial basis function: K(x,z)=e−2σ2∣x−z∣2
Questions
Why use log loss for logistic regression?
When trying to maximize the whole probability of the series, the product of the probabilities of the events would be:
X=i=1∏npyi(1−p)1−yi
where:
p
is the probability of y=1
yi
is the number of y=1
events
Use logrithm for the product to better optimize, then we have:
logX=i=1∑nyilogp+(1−yi)log(1−p)
To maximize X equals to minimize the log loss L(θ)