Logistic Regression

🎯 Binary Classification with Probability Predictions

Logistic Regression is the go-to algorithm for binary classification problems. Unlike Linear Regression which predicts continuous values, Logistic Regression predicts the probability of an instance belonging to a class.

What is Logistic Regression?

Logistic Regression models the probability of a binary outcome (0 or 1) using the logistic function. It transforms the linear combination of inputs into a probability value between 0 and 1.

Core Idea: Apply a sigmoid function to linear regression to squash output between 0 and 1

The Sigmoid Function

The sigmoid (logistic) function is the heart of logistic regression. It maps any real-valued input to a value between 0 and 1.

Sigmoid Equation

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

Properties:
• Output range: (0, 1)
• $\sigma(0) = 0.5$ (decision boundary)
• S-shaped curve (smooth transition)

Odds and Log-Odds

Understanding odds is crucial for interpreting logistic regression coefficients.

Odds

Ratio of probability of event happening to not happening.

$$\text{Odds} = \frac{p}{1 - p}$$

Log-Odds (Logit)

Natural logarithm of odds. Linear in the input features.

$$\log\left(\frac{p}{1 - p}\right) = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_n x_n$$

How Logistic Regression Works

  1. Linear Combination: $z = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_n x_n$
  2. Apply Sigmoid: $P(y=1\mid x) = \sigma(z) = \frac{1}{1 + e^{-z}}$
  3. Calculate Cost: Cross-entropy loss measures prediction error
  4. Gradient Descent: Iteratively update coefficients to minimize cost
  5. Decision Threshold: Default 0.5 (tunable for different use cases)

Derivation

From Sigmoid to Log-Odds

Let's say the probability of success is $p$ and it is given by the sigmoid function:

$$p = \frac{1}{1 + e^{-z}}$$

Then the probability of failure is:

$$1 - p = 1 - \frac{1}{1 + e^{-z}} = \frac{1}{1 + e^{z}}$$

Now odds are:

$$\text{odds} = \frac{p}{1 - p} = \frac{\frac{1}{1 + e^{-z}}}{\frac{1}{1 + e^{z}}} = e^{z}$$

Taking log of the odds gives:

$$\log(\text{odds}) = \log(e^{z}) = z$$

So log-odds are linear in the input features, where:

$$z = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_n x_n$$

Probabilities and Likelihood

Assume the probability of $y = 1$ given $x$ and weights $w$ is:

$$p(y=1\mid x, w) = h(x)$$

$$p(y=0\mid x, w) = 1 - h(x)$$

Since $y$ follows a Bernoulli distribution, the likelihood for one example is:

$$p(y\mid x, w) = h(x)^y (1 - h(x))^{1-y}$$

For the full dataset:

$$L(w) = \prod_{i=1}^{m} h(x_i)^{y_i} (1 - h(x_i))^{1-y_i}$$

Taking log of the likelihood gives the log-likelihood:

$$\ell(w) = \sum_{i=1}^{m} \left[y_i \log(h(x_i)) + (1 - y_i) \log(1 - h(x_i))\right]$$

This is the cross-entropy loss function we want to minimize.

Gradient Descent

For a single data point $(x_i, y_i)$, the cost is:

$$E = y_i \log(h(x_i)) + (1 - y_i) \log(1 - h(x_i))$$

Now compute the gradient with respect to the weights:

$$\frac{\partial E}{\partial w} = y_i \frac{d}{dw} \log(h(x_i)) + (1 - y_i) \frac{d}{dw} \log(1 - h(x_i))$$

$$= y_i \frac{1}{h(x_i)} \frac{dh(x_i)}{dw} - (1 - y_i) \frac{1}{1 - h(x_i)} \frac{dh(x_i)}{dw}$$

The derivative of the sigmoid is:

$$h(x_i) = \frac{1}{1 + e^{-z}}, \qquad \frac{dh(x_i)}{dw} = h(x_i)(1 - h(x_i)) x_i$$

Substituting this gives:

$$\frac{\partial E}{\partial w} = y_i (1 - h(x_i)) x_i - (1 - y_i) h(x_i) x_i$$

$$= (h(x_i) - y_i) x_i$$

This is the gradient for a single example. The update rules become:

$$w_{new} = w_{old} - \alpha (h(x_i) - y_i) x_i$$

$$b_{new} = b_{old} - \alpha (h(x_i) - y_i)$$

Here $\alpha$ is the learning rate controlling the step size of gradient descent.

Real-World Applications

  • Email spam detection (spam vs. not spam)
  • Medical diagnosis (disease vs. healthy)
  • Credit approval (approve vs. reject)
  • Customer churn prediction
  • Sentiment analysis (positive vs. negative)

NumPy Scratch Implementation

From Scratch with NumPy:

import numpy as np
import pandas as pd

def fit(x, y, learning_rate=0.01, epoch=10):
    # Determine the number of samples (m) and features (n)
    m, n = x.shape
    
    # Initialize parameters 
    w = np.random.randn(n, 1) * 0.01
    b = np.zeros(1)
    
    # Ensure y is numpy array of shape (m, 1)
    y = np.array(y).reshape(-1, 1)
    result = pd.DataFrame()
    
    # Helper functions for sigmoid and loss
    def sigmoid(z):
        return 1.0 / (1 + np.exp(-z))
        
    def loss_fn(y_true, y_pred):
        eps = 1e-15
        y_pred = np.clip(y_pred, eps, 1 - eps)
        return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

    # Training Loop
    for ep in range(epoch):
        print("Epoch:", ep, end="")
        
        # Forward pass
        z = np.dot(x, w) + b
        y_pred = sigmoid(z)
        
        # Storing data to DataFrame
        result["y_true"] = y.flatten()
        result["y_pred"] = y_pred.flatten()
        
        # Calculate gradients (dw, db)
        dw = (1. / m) * np.dot(x.T, (y_pred - y))
        db = (1. / m) * np.sum(y_pred - y)
        
        # Update the parameters using the provided learning_rate (alpha)
        w = w - learning_rate * dw
        b = b - learning_rate * db
        
        # Calculate new predictions for loss matching your exact logic
        y_pred = sigmoid(np.dot(x, w) + b)
        loss = loss_fn(y, y_pred)
        print(" Loss:", loss)
        
    print(", Final Loss=", loss)
    print("W:{}, b={}".format(w.flatten(), b))
    
    # Return the trained weights, bias, and the results tracking DataFrame
    return w, b, result

Using scikit-learn

Quick Implementation:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Create and train model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

Advantages & Disadvantages

✓ Advantages:
• Fast and efficient
• Probabilistic predictions
• Highly interpretable
• Works well with small to medium datasets

✗ Disadvantages:
• Assumes linear decision boundary
• Requires feature scaling
• Only binary classification

Evaluation Metrics

  • Accuracy: Overall correctness (use with balanced data)
  • Precision: True positives / All predicted positives
  • Recall: True positives / All actual positives
  • F1-Score: Harmonic mean of precision and recall
  • ROC-AUC: Performance across all thresholds

Ready to explore advanced classification? Check out Supervised Learning overview for more algorithms!