Logistic Regression Guide | Saurav Acharya

Logistic Regression is the go-to algorithm for binary classification problems. Unlike Linear Regression which predicts continuous values, Logistic Regression predicts the probability of an instance belonging to a class.

What is Logistic Regression?

Logistic Regression models the probability of a binary outcome (0 or 1) using the logistic function. It transforms the linear combination of inputs into a probability value between 0 and 1.

Core Idea: Apply a sigmoid function to linear regression to squash output between 0 and 1

The Sigmoid Function

The sigmoid (logistic) function is the heart of logistic regression. It maps any real-valued input to a value between 0 and 1.

Sigmoid Equation

\sigma(z) = \frac{1}{1 + e^{-z}}

As the input $z$ increases, the sigmoid output rises smoothly from near 0 to near 1, crossing 0.5 at $z=0$.

Properties:
• Output range: (0, 1)
• $\sigma(0) = 0.5$ (decision boundary)
• S-shaped curve (smooth transition)

Odds and Log-Odds

Understanding odds is crucial for interpreting logistic regression coefficients.

Odds

Ratio of probability of event happening to not happening.

\text{Odds} = \frac{p}{1 - p}

Log-Odds (Logit)

Natural logarithm of odds. Linear in the input features.

\log\left(\frac{p}{1 - p}\right) = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_n x_n

How Logistic Regression Works

Linear Combination: $z = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_n x_n$
Apply Sigmoid: $P(y=1\mid x) = \sigma(z) = \frac{1}{1 + e^{-z}}$
Calculate Cost: Cross-entropy loss measures prediction error
Gradient Descent: Iteratively update coefficients to minimize cost
Decision Threshold: Default 0.5 (tunable for different use cases)

Derivation

From Sigmoid to Log-Odds

Let's say the probability of success is $p$ and it is given by the sigmoid function:

p = \frac{1}{1 + e^{-z}}

Then the probability of failure is:

1 - p = 1 - \frac{1}{1 + e^{-z}} = \frac{1}{1 + e^{z}}

Now odds are:

\text{odds} = \frac{p}{1 - p} = \frac{\frac{1}{1 + e^{-z}}}{\frac{1}{1 + e^{z}}} = e^{z}

Taking log of the odds gives:

\log(\text{odds}) = \log(e^{z}) = z

So log-odds are linear in the input features, where:

z = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_n x_n

Probabilities and Likelihood

Assume y as target class. Then,

Probability of correctly predicting class 1: $$p(y=1\mid x, w) = h(x)$$ Probability of correctly predicting class 0: $$p(y=0\mid x, w) = 1 - h(x)$$

Since $y$ follows a Bernoulli distribution, the likelihood for one example is:

p(y\mid x, w) = h(x)^y (1 - h(x))^{1-y}

For the full m dataset:

L(w) = \prod_{i=1}^{m} h(x_i)^{y_i} (1 - h(x_i))^{1-y_i}

Taking log of the likelihood gives the log-likelihood:

\ell(w) = \sum_{i=1}^{m} \left[y_i \log(h(x_i)) + (1 - y_i) \log(1 - h(x_i))\right]

Error function is the negative of log-likelihood:

J(w) = -\ell(w) = -\sum_{i=1}^{m} \left[y_i \log(h(x_i)) + (1 - y_i) \log(1 - h(x_i))\right]

This is the cross-entropy loss function we want to minimize.

Gradient Descent

For a single data point $(x_i, y_i)$, the cost is:

E = - y_i \log(h(x_i)) - (1 - y_i) \log(1 - h(x_i))

Now compute the gradient with respect to the weights:

\frac{\partial E}{\partial w} = - y_i \frac{d}{dw} \log(h(x_i)) - (1 - y_i) \frac{d}{dw} \log(1 - h(x_i))$$ $$= -y_i \frac{1}{h(x_i)} \frac{dh(x_i)}{dw} + (1 - y_i) \frac{1}{1 - h(x_i)} \frac{dh(x_i)}{dw}

The derivative of the sigmoid is:

h(x_i) = \frac{1}{1 + e^{-z}}, \qquad \frac{dh(x_i)}{dw} = h(x_i)(1 - h(x_i)) x_i

Substituting this gives:

\frac{\partial E}{\partial w} = - y_i (1 - h(x_i)) x_i + (1 - y_i) h(x_i) x_i$$ $$= (h(x_i) - y_i) x_i

This is the gradient for a single example. The update rules become:

w_{new} = w_{old} - \alpha (h(x_i) - y_i) x_i$$ $$b_{new} = b_{old} - \alpha (h(x_i) - y_i)

Here $\alpha$ is the learning rate controlling the step size of gradient descent.

Real-World Applications

Email spam detection (spam vs. not spam)
Medical diagnosis (disease vs. healthy)
Credit approval (approve vs. reject)
Customer churn prediction
Sentiment analysis (positive vs. negative)

NumPy Scratch Implementation

From Scratch with NumPy:

import numpy as np
import pandas as pd

def fit(x, y, learning_rate=0.01, epoch=10):
    # Determine the number of samples (m) and features (n)
    m, n = x.shape
    
    # Initialize parameters 
    w = np.random.randn(n, 1) * 0.01
    b = np.zeros(1)
    
    # Ensure y is numpy array of shape (m, 1)
    y = np.array(y).reshape(-1, 1)
    result = pd.DataFrame()
    
    # Helper functions for sigmoid and loss
    def sigmoid(z):
        return 1.0 / (1 + np.exp(-z))
        
    def loss_fn(y_true, y_pred):
        eps = 1e-15
        y_pred = np.clip(y_pred, eps, 1 - eps)
        return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

    # Training Loop
    for ep in range(epoch):
        print("Epoch:", ep, end="")
        
        # Forward pass
        z = np.dot(x, w) + b
        y_pred = sigmoid(z)
        
        # Storing data to DataFrame
        result["y_true"] = y.flatten()
        result["y_pred"] = y_pred.flatten()
        
        # Calculate gradients (dw, db)
        dw = (1. / m) * np.dot(x.T, (y_pred - y))
        db = (1. / m) * np.sum(y_pred - y)
        
        # Update the parameters using the provided learning_rate (alpha)
        w = w - learning_rate * dw
        b = b - learning_rate * db
        
        # Calculate new predictions for loss matching your exact logic
        y_pred = sigmoid(np.dot(x, w) + b)
        loss = loss_fn(y, y_pred)
        print(" Loss:", loss)
        
    print(", Final Loss=", loss)
    print("W:{}, b={}".format(w.flatten(), b))
    
    # Return the trained weights, bias, and the results tracking DataFrame
    return w, b, result

Using scikit-learn

Quick Implementation:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Create and train model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

Advantages & Disadvantages

✓ Advantages:
• Fast and efficient
• Probabilistic predictions
• Highly interpretable
• Works well with small to medium datasets

✗ Disadvantages:
• Only binary classification

Evaluation Metrics

Accuracy: Overall correctness (use with balanced data)
Precision: True positives / All predicted positives
Recall: True positives / All actual positives
F1-Score: Harmonic mean of precision and recall
ROC-AUC: Performance across all thresholds

Ready to explore advanced classification? Check out Supervised Learning overview for more algorithms!