Linear Models

Edited on 2023-10-25 In technical , machine learning notes Views:

Logistic Regression

0-1 Binary Distribution

For logistic regression, the logarithm of the probability of an event is linear to two independant variables, that is:

\begin{aligned} & \ln \frac{p(y=1)}{p(y=0)} = \ln \frac{p(y=1)}{1 - p(y=1)} = \ln \frac{p}{1 - p} = wx \\ \Rightarrow \; & p = \frac{e^{wx}}{1 + e^{wx}} = \frac{1}{1 + e^{-wx}} = \sigma(wx) \end{aligned}

Multiple Value Distribution

Similar to 0-1 distribution, if the value of Y is 0~K, then:

P(x = k) = \frac{e^{w_kx}}{1 + e^{w_1x} + e^{w_2x} + \dots + e^{w_Kx}} = \frac{e^{w_kx}}{1 + \sum_{k=1}^Kw_kx}

It’s actually the softmax function

Loss Function

Use log loss here (why?)

Log loss:

L(\theta) = \sum_{(x,y) \in D} -y \log(y') - (1 - y) \log(1 - y')

where:

y is the label
y' is the predicted label (between 0 to 1)

Maximum Entropy Model

This section is based on this paper: A Maximum Entropy Approach to Natural Language Processing

The principle of maximum entropy is that the best model has the maximum conditional entropy. (Entropy stands for the amount of uncertainty, the more entropy is, the more information the model contains)

See more about entropy here.

Here’s a constrain for maximum entropy model, the statistical distribution of the sample (namely, $\tilde{p}(x,y)$ ) should be inherent to the distribution of the model (namely, $p(y|x)$ ), that means the expection of the model should be inherent to the expection of the samples, that is:

\sum_{x,y} \tilde{P}(x,y)f(x,y) = \sum_{x,y}\tilde{P}(x)P(y|x)f(x,y)

where:

$\tilde{P}(x,y), \tilde{P}(x)$ are the probability for (x,y) and x in sample
$p(y|x)$ is the model
$f(x,y)$ is a feature function where

f(x,y)= \begin{cases} 1, & \; (x,y) \text{ exists in samples} \\ 0 & \; otherwise \\ \end{cases}

Objective
Maximize the conditional entropy, that is to maximize $H(Y|X)=-\sum_{(x,y)}\tilde{P}(x)P(y|x)\log P(y|x)$

Parameters Estimation
Optimize the objective under the constrains, we have:

\begin{aligned} \max \;\;\; &H(Y|X) = -\sum \tilde{P}(x)P(y|x)logP(y|x) \\ s.t. \;\;\; & \sum_{x,y}\tilde{P}(x)P(y|x)f(x,y) = \sum_{x,y}\tilde{P}(x,y)f(x,y) \\ & \sum_{x,y} P(y|x) = 1 \end{aligned}

Including lagrange multipliers, we convert the constrained optimization to non-constrained optimization. The lagrangian function is:

L(P, \omega ) = -H(P) + \omega_0[1-\sum_{x,y} P(y|x)] + \sum_{i=1}^n \omega_i[\sum_{x,y}\tilde{P}(x)P(y|x)f(x,y) - \sum_{x,y}\tilde{P}(x,y)f(x,y)]

And then just try to maximum and minimum values of the non-constrained optimization.

We now have:

\begin{aligned} P_\omega(y|x) &= \frac{1}{Z_\omega} exp \big(\sum_{i=1}^n \omega_if_i(x,y) \big) \\ Z_\omega &= \sum_y exp \big( \sum_{i=1}^n \omega_if_i(x,y)\big) \end{aligned}

It’s very similar to LR, the only difference lies in the number of classes. In LR there’re only 2 classes while in MEM there’re multiple classes.

Maximum Likelihood Estimator

TODO

Support Vector Machine/SVM

The idea behind svm is to find a boundary between the postivie samples and negative samples, and it tries to find the boundary that has the largest distance to samples.

Linear data, hard-margin

When we could find the boundary, we have a hard-mirgin (data is linearly seperatable)

Find a hyperplane that the minimum distance between any points to the hypterplane is maximum, that is:

\begin{aligned} max \;& \; d_{min} \\ s.t. \; & \; d_x \geq d_{min} \end{aligned}

Note: Distance between a point (x,y) to a plane y = wx + b is: $d=\frac{|wx+b|}{|w|}$

use $x_{min}$ to denote the point that has the minimum distance, then we have:

\begin{aligned} max \;& \; \frac{w * x_{min} + b}{|w|} \\ s.t. \; & \; \frac{|wx_i + b|}{|w|} \geq \frac{w * x_{min} + b}{|w|}, \forall x_i \in X \end{aligned}

Since $w * x_{min} + b$ is a constant, we use $1$ to replace it. And we times $|w|$ at both sides of the formula at the constrain, now we have:

\begin{aligned} max \;& \; \frac{1}{|w|} \\ s.t. \; & \; |wx_i + b| \geq 1, \forall x_i \in X \end{aligned}

Use minimize instead of maximize, we have:

max \; \frac{1}{|w|} \Leftrightarrow min \; |w| \Leftrightarrow min \; \frac{1}{2} w^2

To get rid of the abs in constrain, we multiply true label $y_i$ at the constrain, now we have:

\begin{aligned} min \;& \; \frac{1}{2} w^2 \\ s.t. \; & \; y_i(wx_i + b) \geq 1, \forall x_i \in X \end{aligned}

This is a convex optimization problem, use lagrange multipliers. The lagarangian function would be:

L(w, b, \alpha) = \frac{1}{2} w^2 - \sum_i^N{\alpha_i [y_i(w x_i + b) - 1]}

Then we solve the maximium of $w, b$ , then the minimum of $\alpha$ .

Note: the max-margin hyperplane is completely determined by those $\mathbf {x} _{i}$ that lie nearest to it. These $\mathbf {x} _{i}$ are called support vectors.

Linear data, soft-margin

When we couldn’t find the boundary, we have soft-margin (data is linear)

When we have some outliners in the dataset, after eliminating these outliners, we could still find the boundary, then it’s a soft-mirgin svm.

To achieve this, we modify the constrains, add some slack, that is: $y_i(wx_i + b ) \geq 1 - \zeta$ . At the same time, this slack variable shouldn’t be too large, so the objective function would be:

\begin{aligned} min \;& \; \frac{1}{2} w^2 + C \sum_i^N {\zeta_i} \\ s.t. \; & \; y_i(wx_i + b) \geq 1 - \zeta_i \end{aligned}

In other way, it would be interpreted that SVM uses hinge loss here.

[@TODO]

Non-linear data

When the data is not linear, we could project the data into another feature space, by using non-linear kernel function to replace every dot product in the low dimension space.

Use $\phi (x): X \to H$ as the projection function, to map x from low dimension space $X$ to high dimension space $H$ , the kernel function is defined as below:

K(x, z) = \phi(x) \cdot \phi(z)

where $x$ and $z$ are two variables in the low dimension space.

Some common kernel functions:

Polynomial: $K(x,z) = (x \cdot z + 1) ^p$
Gaussian radial basis function: $K(x,z) = e^{- \frac{|x - z|^2}{2\sigma^2}}$

Questions

Why use log loss for logistic regression?

When trying to maximize the whole probability of the series, the product of the probabilities of the events would be:

X = \prod_{i = 1}^n p^{y_i}(1-p)^{1 - y_i}

where:

p is the probability of y=1
yi is the number of y=1 events

Use logrithm for the product to better optimize, then we have:

\log X = \sum_{i = 1}^n y_i \log p+(1-y_i)\log (1-p)

To maximize $X$ equals to minimize the log loss $L(\theta)$