The goal of Logistic Regression
Suppose we have data with $m$ samples and $n$ features. The data can be represented as an $m\times n$ matrix $X$:
where:
- $x_j^{(i)}$ is a scalar, representing the value of the $j$-th feature for the $i$-th sample (entry view).
- $x^{(i)T}$ is a row vector, representing the values of all features for the $i$-th sample (row view).
- $x_j$ is a column vector, representing the values of the $j$-th feature for all samples (column view).
The goal of logistic regression is to classify the $m$ samples into two classes—positive and negative.
The class label is denoted as:
$$ y^{(i)},\quad i=1,\cdots,m $$
where $y^{(i)}$ is binary, taking the value $1$ or $0$.
- $y^{(i)}=1$ indicates that the $i$-th sample belongs to the positive class.
- $y^{(i)}=0$ indicates that the $i$-th samtmple belongs to the negative class.
Class Probability
The logistic regression model does not directly predict $y^{(i)}$. Instead, it calculates the probability of belonging to the positive class, denoted as $p^{(i)}$, using the following formula:
$$ p^{(i)}=\frac{1}{1+e^{-z^{(i)}}},\quad z^{(i)}=w_1x_1^{(i)}+\cdots+w_nx_n^{(i)}+b $$
where:
- $w_j$ and $b$ are parameters of the logistic regression model, optimized during the learning process.
- $w_j$ represents the weight for the $j$-th feature;
- $b$ is the model’s bias term.
- $z^{(i)}$ is the logit for the $i$-th sample, also called the log-odds, and can be expressed as:
$$ z^{(i)}=\log\frac{p^{(i)}}{1-p^{(i)}} $$
Logistic Function
The logistic function $\sigma(z)=\frac{1}{1+e^{-z}}$ also called the sigmoid function, produces an S-shaped curve.
It ensures that the output of model is always between 0 and 1, aligning with the probabilistic interpretation.
- If $z^{(i)}\to+\infty$, then $p^{(i)}=\sigma(z^{(i)})\to 1$.
- If $z^{(i)}\to-\infty$, then $p^{(i)}=\sigma(z^{(i)})\to 0$.
- If $z^{(i)}=0$, then $p^{(i)}=\sigma(z^{(i)})= 0.5$.
By substituting the logit $z^{(i)}$ into the logistic function, we obtain the postive-class probability $p^{(i)}$, which determines the predicted class for the $i$-th sample.
- If $p^{(i)}\ge0.5$, then the $i$-th sample is classified as belonging to the positive class ($y^{(i)}=1$).
- If $p^{(i)}<0.5$, then the $i$-th sample is classified as belonging to the negative class ($y^{(i)}=0$).
Cross-entropy Loss
In the logistic regression model, the cross-entropy loss (also known as the log-loss) is used to optimize the model parameters.
For a single sample $i$, the loss is defined as:
$$ L^{(i)}:=-y^{(i)}\log p^{(i)}-(1-y^{(i)})\log(1-p^{(i)}) $$
For $m$ samples, the total loss is the average of the individual losses:
$$ L=\frac{1}{m}\sum_{i=1}^mL^{(i)} $$
Here:
- $y^{(i)}$ is the true label ($1$ or $0$).
- $p^{(i)}=\sigma(z^{(i)})$ is the predicted probability.
The parameters $w_j$ and $b$ are optimized when the loss $L$ is minimized.
Why Cross-entropy?
Information Theory Perspective
In information theory, the cross-entropy $H[P\Vert\hat P]$ measures the difference between the true probability distribution $P$ and the predicted distribution $\hat P$.
For discrete data, it is defined as:
$$ H[P\Vert\hat P]:=-\sum_{i=1}^m P(y^{(i)})\log \hat P(y^{(i)}) $$
Here:
- $P(y^{(i)})=y^{(i)}$ (true label, either $1$ or $0$).
- $\hat P(y^{(i)})=p^{(i)}$ (predictied probability).
Logistic regression predicts the probability $p^{(i)}\in(0,1)$ for each sample, while the true probability $y^{(i)}\in\lbrace0,1\rbrace$ is binary. Minimizing the cross-entropy ensures that the predicted probabilities align closely with the true labels.
Statistical Perspective
From a statistical perspective, logistic regression uses maximum likelihood estimation (MLE) to estimate the model parameters $w_j$ and $b$.
The observed labels $y^{(i)}$ are modeled using a Bernoulli distribution:
$$ f(y^{(i)})=(p^{(i)})^{y^{(i)}}(1-p^{(i)})^{1-y^{(i)}},\quad y^{(i)}\in\lbrace0,1\rbrace $$
The log-likelihood for a single observation is:
$$ \begin{split} \log f(y^{(i)}) &=\log[ (p^{(i)})^{y^{(i)}}(1-p^{(i)})^{1-y^{(i)}}] \\&=y^{(i)}\log p^{(i)}+(1-y^{(i)})\log(1-p^{(i)})\\&=-L^{(i)} \end{split} $$
Thus, minimizing the cross-entropy loss is equivalent to maximizing the likelihood of the observed label.
'Mathematics' 카테고리의 다른 글
음이항분포 (기하분포)의 2가지 관점 (0) | 2025.02.24 |
---|---|
이벤트 발생 횟수와 대기 시간의 확률 모델들 (0) | 2025.02.22 |
혼동행렬, 신호탐지이론, ROC 곡선 (0) | 2023.03.14 |
검정력, 적당한 표본의 크기 (0) | 2023.03.13 |
제1종 오류와 제2종 오류 (0) | 2023.03.12 |