Logistic Regression
🎯 Binary Classification with Probability Predictions
Logistic Regression is the go-to algorithm for binary classification problems. Unlike Linear Regression which predicts continuous values, Logistic Regression predicts the probability of an instance belonging to a class.
What is Logistic Regression?
Logistic Regression models the probability of a binary outcome (0 or 1) using the logistic function. It transforms the linear combination of inputs into a probability value between 0 and 1.
Core Idea: Apply a sigmoid function to linear regression to squash output between 0 and 1
The Sigmoid Function
The sigmoid (logistic) function is the heart of logistic regression. It maps any real-valued input to a value between 0 and 1.
Sigmoid Equation
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$
Properties:
• Output range: (0, 1)
• $\sigma(0) = 0.5$ (decision boundary)
• S-shaped curve (smooth transition)
Odds and Log-Odds
Understanding odds is crucial for interpreting logistic regression coefficients.
Odds
Ratio of probability of event happening to not happening.
$$\text{Odds} = \frac{p}{1 - p}$$
Log-Odds (Logit)
Natural logarithm of odds. Linear in the input features.
$$\log\left(\frac{p}{1 - p}\right) = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_n x_n$$
How Logistic Regression Works
- Linear Combination: $z = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_n x_n$
- Apply Sigmoid: $P(y=1\mid x) = \sigma(z) = \frac{1}{1 + e^{-z}}$
- Calculate Cost: Cross-entropy loss measures prediction error
- Gradient Descent: Iteratively update coefficients to minimize cost
- Decision Threshold: Default 0.5 (tunable for different use cases)
Derivation
From Sigmoid to Log-Odds
Let's say the probability of success is $p$ and it is given by the sigmoid function:
$$p = \frac{1}{1 + e^{-z}}$$
Then the probability of failure is:
$$1 - p = 1 - \frac{1}{1 + e^{-z}} = \frac{1}{1 + e^{z}}$$
Now odds are:
$$\text{odds} = \frac{p}{1 - p} = \frac{\frac{1}{1 + e^{-z}}}{\frac{1}{1 + e^{z}}} = e^{z}$$
Taking log of the odds gives:
$$\log(\text{odds}) = \log(e^{z}) = z$$
So log-odds are linear in the input features, where:
$$z = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_n x_n$$
Probabilities and Likelihood
Assume the probability of $y = 1$ given $x$ and weights $w$ is:
$$p(y=1\mid x, w) = h(x)$$
$$p(y=0\mid x, w) = 1 - h(x)$$
Since $y$ follows a Bernoulli distribution, the likelihood for one example is:
$$p(y\mid x, w) = h(x)^y (1 - h(x))^{1-y}$$
For the full dataset:
$$L(w) = \prod_{i=1}^{m} h(x_i)^{y_i} (1 - h(x_i))^{1-y_i}$$
Taking log of the likelihood gives the log-likelihood:
$$\ell(w) = \sum_{i=1}^{m} \left[y_i \log(h(x_i)) + (1 - y_i) \log(1 - h(x_i))\right]$$
This is the cross-entropy loss function we want to minimize.
Gradient Descent
For a single data point $(x_i, y_i)$, the cost is:
$$E = y_i \log(h(x_i)) + (1 - y_i) \log(1 - h(x_i))$$
Now compute the gradient with respect to the weights:
$$\frac{\partial E}{\partial w} = y_i \frac{d}{dw} \log(h(x_i)) + (1 - y_i) \frac{d}{dw} \log(1 - h(x_i))$$
$$= y_i \frac{1}{h(x_i)} \frac{dh(x_i)}{dw} - (1 - y_i) \frac{1}{1 - h(x_i)} \frac{dh(x_i)}{dw}$$
The derivative of the sigmoid is:
$$h(x_i) = \frac{1}{1 + e^{-z}}, \qquad \frac{dh(x_i)}{dw} = h(x_i)(1 - h(x_i)) x_i$$
Substituting this gives:
$$\frac{\partial E}{\partial w} = y_i (1 - h(x_i)) x_i - (1 - y_i) h(x_i) x_i$$
$$= (h(x_i) - y_i) x_i$$
This is the gradient for a single example. The update rules become:
$$w_{new} = w_{old} - \alpha (h(x_i) - y_i) x_i$$
$$b_{new} = b_{old} - \alpha (h(x_i) - y_i)$$
Here $\alpha$ is the learning rate controlling the step size of gradient descent.
Real-World Applications
- Email spam detection (spam vs. not spam)
- Medical diagnosis (disease vs. healthy)
- Credit approval (approve vs. reject)
- Customer churn prediction
- Sentiment analysis (positive vs. negative)
NumPy Scratch Implementation
From Scratch with NumPy:
import numpy as np
import pandas as pd
def fit(x, y, learning_rate=0.01, epoch=10):
# Determine the number of samples (m) and features (n)
m, n = x.shape
# Initialize parameters
w = np.random.randn(n, 1) * 0.01
b = np.zeros(1)
# Ensure y is numpy array of shape (m, 1)
y = np.array(y).reshape(-1, 1)
result = pd.DataFrame()
# Helper functions for sigmoid and loss
def sigmoid(z):
return 1.0 / (1 + np.exp(-z))
def loss_fn(y_true, y_pred):
eps = 1e-15
y_pred = np.clip(y_pred, eps, 1 - eps)
return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
# Training Loop
for ep in range(epoch):
print("Epoch:", ep, end="")
# Forward pass
z = np.dot(x, w) + b
y_pred = sigmoid(z)
# Storing data to DataFrame
result["y_true"] = y.flatten()
result["y_pred"] = y_pred.flatten()
# Calculate gradients (dw, db)
dw = (1. / m) * np.dot(x.T, (y_pred - y))
db = (1. / m) * np.sum(y_pred - y)
# Update the parameters using the provided learning_rate (alpha)
w = w - learning_rate * dw
b = b - learning_rate * db
# Calculate new predictions for loss matching your exact logic
y_pred = sigmoid(np.dot(x, w) + b)
loss = loss_fn(y, y_pred)
print(" Loss:", loss)
print(", Final Loss=", loss)
print("W:{}, b={}".format(w.flatten(), b))
# Return the trained weights, bias, and the results tracking DataFrame
return w, b, result
Using scikit-learn
Quick Implementation:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score
# Create and train model
model = LogisticRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)
# Evaluate
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
Advantages & Disadvantages
✓ Advantages:
• Fast and efficient
• Probabilistic predictions
• Highly interpretable
• Works well with small to medium datasets
✗ Disadvantages:
• Assumes linear decision boundary
• Requires feature scaling
• Only binary classification
Evaluation Metrics
- Accuracy: Overall correctness (use with balanced data)
- Precision: True positives / All predicted positives
- Recall: True positives / All actual positives
- F1-Score: Harmonic mean of precision and recall
- ROC-AUC: Performance across all thresholds
Ready to explore advanced classification? Check out Supervised Learning overview for more algorithms!