본문 바로가기

Mathematics

Logistic Regression

The goal of Logistic Regression

Suppose we have data with $m$ samples and $n$ features. The data can be represented as an $m\times n$ matrix $X$:

where:

  • $x_j^{(i)}$ is a scalar, representing the value of the $j$-th feature for the $i$-th sample (entry view).
  • $x^{(i)T}$ is a row vector, representing the values of all features for the $i$-th sample (row view).
  • $x_j$ is a column vector, representing the values of the $j$-th feature for all samples (column view).

The goal of logistic regression is to classify the $m$ samples into two classes—positive and negative.

The class label is denoted as:

$$ y^{(i)},\quad i=1,\cdots,m $$

where $y^{(i)}$ is binary, taking the value $1$ or $0$.

  • $y^{(i)}=1$ indicates that the $i$-th sample belongs to the positive class.
  • $y^{(i)}=0$ indicates that the $i$-th samtmple belongs to the negative class.

Class Probability

The logistic regression model does not directly predict $y^{(i)}$. Instead, it calculates the probability of belonging to the positive class, denoted as $p^{(i)}$, using the following formula:

$$ p^{(i)}=\frac{1}{1+e^{-z^{(i)}}},\quad z^{(i)}=w_1x_1^{(i)}+\cdots+w_nx_n^{(i)}+b $$

where:

  • $w_j$ and $b$ are parameters of the logistic regression model, optimized during the learning process.
    • $w_j$ represents the weight for the $j$-th feature;
    • $b$ is the model’s bias term.
  • $z^{(i)}$ is the logit for the $i$-th sample, also called the log-odds, and can be expressed as:
    $$ z^{(i)}=\log\frac{p^{(i)}}{1-p^{(i)}} $$

Logistic Function

The logistic function $\sigma(z)=\frac{1}{1+e^{-z}}$ also called the sigmoid function, produces an S-shaped curve.

It ensures that the output of model is always between 0 and 1, aligning with the probabilistic interpretation.

  • If $z^{(i)}\to+\infty$, then $p^{(i)}=\sigma(z^{(i)})\to 1$.
  • If $z^{(i)}\to-\infty$, then $p^{(i)}=\sigma(z^{(i)})\to 0$.
  • If $z^{(i)}=0$, then $p^{(i)}=\sigma(z^{(i)})= 0.5$.

By substituting the logit $z^{(i)}$ into the logistic function, we obtain the postive-class probability $p^{(i)}$, which determines the predicted class for the $i$-th sample.

  • If $p^{(i)}\ge0.5$, then the $i$-th sample is classified as belonging to the positive class ($y^{(i)}=1$).
  • If $p^{(i)}<0.5$, then the $i$-th sample is classified as belonging to the negative class ($y^{(i)}=0$).

Cross-entropy Loss

In the logistic regression model, the cross-entropy loss (also known as the log-loss) is used to optimize the model parameters.

For a single sample $i$, the loss is defined as:

$$ L^{(i)}:=-y^{(i)}\log p^{(i)}-(1-y^{(i)})\log(1-p^{(i)}) $$

For $m$ samples, the total loss is the average of the individual losses:

$$ L=\frac{1}{m}\sum_{i=1}^mL^{(i)} $$

Here:

  • $y^{(i)}$ is the true label ($1$ or $0$).
  • $p^{(i)}=\sigma(z^{(i)})$ is the predicted probability.

The parameters $w_j$ and $b$ are optimized when the loss $L$ is minimized.

Why Cross-entropy?

Information Theory Perspective

In information theory, the cross-entropy $H[P\Vert\hat P]$ measures the difference between the true probability distribution $P$ and the predicted distribution $\hat P$.

For discrete data, it is defined as:

$$ H[P\Vert\hat P]:=-\sum_{i=1}^m P(y^{(i)})\log \hat P(y^{(i)}) $$

Here:

  • $P(y^{(i)})=y^{(i)}$ (true label, either $1$ or $0$).
  • $\hat P(y^{(i)})=p^{(i)}$ (predictied probability).

Logistic regression predicts the probability $p^{(i)}\in(0,1)$ for each sample, while the true probability $y^{(i)}\in\lbrace0,1\rbrace$ is binary. Minimizing the cross-entropy ensures that the predicted probabilities align closely with the true labels.

Statistical Perspective

From a statistical perspective, logistic regression uses maximum likelihood estimation (MLE) to estimate the model parameters $w_j$ and $b$.

The observed labels $y^{(i)}$ are modeled using a Bernoulli distribution:

$$ f(y^{(i)})=(p^{(i)})^{y^{(i)}}(1-p^{(i)})^{1-y^{(i)}},\quad y^{(i)}\in\lbrace0,1\rbrace $$

The log-likelihood for a single observation is:

$$ \begin{split} \log f(y^{(i)}) &=\log[ (p^{(i)})^{y^{(i)}}(1-p^{(i)})^{1-y^{(i)}}] \\&=y^{(i)}\log p^{(i)}+(1-y^{(i)})\log(1-p^{(i)})\\&=-L^{(i)} \end{split} $$

Thus, minimizing the cross-entropy loss is equivalent to maximizing the likelihood of the observed label.