본문 바로가기

Mathematics

Logistic Regression

728x90

The goal of Logistic Regression

Suppose we have data with m samples and n features. The data can be represented as an m×n matrix X:

where:

  • xj(i) is a scalar, representing the value of the j-th feature for the i-th sample (entry view).
  • x(i)T is a row vector, representing the values of all features for the i-th sample (row view).
  • xj is a column vector, representing the values of the j-th feature for all samples (column view).

The goal of logistic regression is to classify the m samples into two classes—positive and negative.

The class label is denoted as:

y(i),i=1,,m

where y(i) is binary, taking the value 1 or 0.

  • y(i)=1 indicates that the i-th sample belongs to the positive class.
  • y(i)=0 indicates that the i-th samtmple belongs to the negative class.

Class Probability

The logistic regression model does not directly predict y(i). Instead, it calculates the probability of belonging to the positive class, denoted as p(i), using the following formula:

p(i)=11+ez(i),z(i)=w1x1(i)++wnxn(i)+b

where:

  • wj and b are parameters of the logistic regression model, optimized during the learning process.
    • wj represents the weight for the j-th feature;
    • b is the model’s bias term.
  • z(i) is the logit for the i-th sample, also called the log-odds, and can be expressed as:
    z(i)=logp(i)1p(i)

Logistic Function

The logistic function σ(z)=11+ez also called the sigmoid function, produces an S-shaped curve.

It ensures that the output of model is always between 0 and 1, aligning with the probabilistic interpretation.

  • If z(i)+, then p(i)=σ(z(i))1.
  • If z(i), then p(i)=σ(z(i))0.
  • If z(i)=0, then p(i)=σ(z(i))=0.5.

By substituting the logit z(i) into the logistic function, we obtain the postive-class probability p(i), which determines the predicted class for the i-th sample.

  • If p(i)0.5, then the i-th sample is classified as belonging to the positive class (y(i)=1).
  • If p(i)<0.5, then the i-th sample is classified as belonging to the negative class (y(i)=0).

Cross-entropy Loss

In the logistic regression model, the cross-entropy loss (also known as the log-loss) is used to optimize the model parameters.

For a single sample i, the loss is defined as:

L(i):=y(i)logp(i)(1y(i))log(1p(i))

For m samples, the total loss is the average of the individual losses:

L=1mi=1mL(i)

Here:

  • y(i) is the true label (1 or 0).
  • p(i)=σ(z(i)) is the predicted probability.

The parameters wj and b are optimized when the loss L is minimized.

Why Cross-entropy?

Information Theory Perspective

In information theory, the cross-entropy H[PP^] measures the difference between the true probability distribution P and the predicted distribution P^.

For discrete data, it is defined as:

H[PP^]:=i=1mP(y(i))logP^(y(i))

Here:

  • P(y(i))=y(i) (true label, either 1 or 0).
  • P^(y(i))=p(i) (predictied probability).

Logistic regression predicts the probability p(i)(0,1) for each sample, while the true probability y(i){0,1} is binary. Minimizing the cross-entropy ensures that the predicted probabilities align closely with the true labels.

Statistical Perspective

From a statistical perspective, logistic regression uses maximum likelihood estimation (MLE) to estimate the model parameters wj and b.

The observed labels y(i) are modeled using a Bernoulli distribution:

f(y(i))=(p(i))y(i)(1p(i))1y(i),y(i){0,1}

The log-likelihood for a single observation is:

logf(y(i))=log[(p(i))y(i)(1p(i))1y(i)]=y(i)logp(i)+(1y(i))log(1p(i))=L(i)

Thus, minimizing the cross-entropy loss is equivalent to maximizing the likelihood of the observed label.

728x90