The goal of Logistic Regression

Suppose we have data with $m$ samples and $n$ features. The data can be represented as an $m\times n$ matrix $X$:

where:

$x_j^{(i)}$ is a scalar, representing the value of the $j$-th feature for the $i$-th sample (entry view).
$x^{(i)T}$ is a row vector, representing the values of all features for the $i$-th sample (row view).
$x_j$ is a column vector, representing the values of the $j$-th feature for all samples (column view).

The goal of logistic regression is to classify the $m$ samples into two classes—positive and negative.

The class label is denoted as:

$$ y^{(i)},\quad i=1,\cdots,m $$

where $y^{(i)}$ is binary, taking the value $1$ or $0$.

$y^{(i)}=1$ indicates that the $i$-th sample belongs to the positive class.
$y^{(i)}=0$ indicates that the $i$-th samtmple belongs to the negative class.

Class Probability

The logistic regression model does not directly predict $y^{(i)}$. Instead, it calculates the probability of belonging to the positive class, denoted as $p^{(i)}$, using the following formula:

$$ p^{(i)}=\frac{1}{1+e^{-z^{(i)}}},\quad z^{(i)}=w_1x_1^{(i)}+\cdots+w_nx_n^{(i)}+b $$

where:

$w_j$ and $b$ are parameters of the logistic regression model, optimized during the learning process.
- $w_j$ represents the weight for the $j$-th feature;
- $b$ is the model’s bias term.
$z^{(i)}$ is the logit for the $i$-th sample, also called the log-odds, and can be expressed as:
$$ z^{(i)}=\log\frac{p^{(i)}}{1-p^{(i)}} $$

Logistic Function

The logistic function $\sigma(z)=\frac{1}{1+e^{-z}}$ also called the sigmoid function, produces an S-shaped curve.

It ensures that the output of model is always between 0 and 1, aligning with the probabilistic interpretation.

If $z^{(i)}\to+\infty$, then $p^{(i)}=\sigma(z^{(i)})\to 1$.
If $z^{(i)}\to-\infty$, then $p^{(i)}=\sigma(z^{(i)})\to 0$.
If $z^{(i)}=0$, then $p^{(i)}=\sigma(z^{(i)})= 0.5$.

By substituting the logit $z^{(i)}$ into the logistic function, we obtain the postive-class probability $p^{(i)}$, which determines the predicted class for the $i$-th sample.

If $p^{(i)}\ge0.5$, then the $i$-th sample is classified as belonging to the positive class ($y^{(i)}=1$).
If $p^{(i)}<0.5$, then the $i$-th sample is classified as belonging to the negative class ($y^{(i)}=0$).

Cross-entropy Loss

In the logistic regression model, the cross-entropy loss (also known as the log-loss) is used to optimize the model parameters.

For a single sample $i$, the loss is defined as:

$$ L^{(i)}:=-y^{(i)}\log p^{(i)}-(1-y^{(i)})\log(1-p^{(i)}) $$

For $m$ samples, the total loss is the average of the individual losses:

$$ L=\frac{1}{m}\sum_{i=1}^mL^{(i)} $$

Here:

$y^{(i)}$ is the true label ($1$ or $0$).
$p^{(i)}=\sigma(z^{(i)})$ is the predicted probability.

The parameters $w_j$ and $b$ are optimized when the loss $L$ is minimized.

Why Cross-entropy?

Information Theory Perspective

In information theory, the cross-entropy $H[P\Vert\hat P]$ measures the difference between the true probability distribution $P$ and the predicted distribution $\hat P$.

For discrete data, it is defined as:

$$ H[P\Vert\hat P]:=-\sum_{i=1}^m P(y^{(i)})\log \hat P(y^{(i)}) $$

Here:

$P(y^{(i)})=y^{(i)}$ (true label, either $1$ or $0$).
$\hat P(y^{(i)})=p^{(i)}$ (predictied probability).

Logistic regression predicts the probability $p^{(i)}\in(0,1)$ for each sample, while the true probability $y^{(i)}\in\lbrace0,1\rbrace$ is binary. Minimizing the cross-entropy ensures that the predicted probabilities align closely with the true labels.

Statistical Perspective

From a statistical perspective, logistic regression uses maximum likelihood estimation (MLE) to estimate the model parameters $w_j$ and $b$.

The observed labels $y^{(i)}$ are modeled using a Bernoulli distribution:

$$ f(y^{(i)})=(p^{(i)})^{y^{(i)}}(1-p^{(i)})^{1-y^{(i)}},\quad y^{(i)}\in\lbrace0,1\rbrace $$

The log-likelihood for a single observation is:

$$ \begin{split} \log f(y^{(i)}) &=\log[ (p^{(i)})^{y^{(i)}}(1-p^{(i)})^{1-y^{(i)}}] \\&=y^{(i)}\log p^{(i)}+(1-y^{(i)})\log(1-p^{(i)})\\&=-L^{(i)} \end{split} $$

Thus, minimizing the cross-entropy loss is equivalent to maximizing the likelihood of the observed label.

저작자표시 비영리 변경금지

'Mathematics' 카테고리의 다른 글

음이항분포 (기하분포)의 2가지 관점 (0)	2025.02.24
이벤트 발생 횟수와 대기 시간의 확률 모델들 (0)	2025.02.22
혼동행렬, 신호탐지이론, ROC 곡선 (0)	2023.03.14
검정력, 적당한 표본의 크기 (0)	2023.03.13
제1종 오류와 제2종 오류 (0)	2023.03.12

mathnotes

Logistic Regression

The goal of Logistic Regression

Class Probability

Logistic Function

Cross-entropy Loss

Why Cross-entropy?

Information Theory Perspective

Statistical Perspective

'Mathematics' 카테고리의 다른 글

티스토리툴바

Logistic Regression

MathJax = { tex: {inlineMath: [['$', '$']]} };

The goal of Logistic Regression

Class Probability

Logistic Function

Cross-entropy Loss

Why Cross-entropy?

Information Theory Perspective

Statistical Perspective

'Mathematics' 카테고리의 다른 글

'Mathematics' Related Articles

티스토리툴바