Linear Regression Guide | Saurav Acharya

Linear Regression is the foundation of machine learning and statistics. It models the linear relationship between input features and a continuous output variable.

What is Linear Regression?

Linear Regression fits a straight line through data points to model the relationship between variables. Given input features (X) and target values (y), it finds the best-fit line that minimizes prediction errors.

$$y = mx + b$$ where $m$ = slope, $b$ = intercept, $x$ = input, $y$ = output

Linear Fit Visualization

The model (Red Line) tries to minimize the distance from all data points (Blue Dots).

The example above uses a single input, but real problems are rarely that simple. Predicting a house's price, for instance, depends on far more than one variable — square footage, number of bedrooms, age of the property, distance to the city center, and so on. Linear regression handles this by giving every feature its own weight and summing their combined effect. Instead of one slope $m$, we now have a vector of weights $\mathbf{w} = [w_1, w_2, \ldots, w_n]$, one for every feature.

For a single house with n features:

\mathbf{X} = [x_1, x_2, x_3, \ldots, x_n]$$ $$Y = w_1x_1 + w_2x_2 + \cdots + w_nx_n + b

A real dataset has many houses, not just one, so the same equation gets repeated once per row in the dataset. Writing them out individually like this makes the pattern obvious, but it quickly becomes unwieldy — which is exactly the problem matrix notation solves.

For M houses (dataset):

y_1 = w_1x_{11} + w_2x_{12} + \cdots + w_nx_{1n} + b$$ $$y_2 = w_1x_{21} + w_2x_{22} + \cdots + w_nx_{2n} + b$$ $$\vdots$$ $$y_m = w_1x_{m1} + w_2x_{m2} + \cdots + w_nx_{mn} + b

Matrix form:

\mathbf{y} = \mathbf{X}\mathbf{w} + b

This single line packs in every equation above. $\mathbf{X}$ is the matrix — every row is one house, every column is one feature. $\mathbf{w}$ is the weight vector the model learns, and $\mathbf{y}$ is the column of predicted prices for the entire dataset, computed in one matrix multiplication instead of a loop over rows. This compact form is also what makes the closed-form and gradient-based solutions below practical to implement with NumPy.

Assumptions of Linear Regression

Linear regression isn't a black box — it's a statistical model, and like any statistical model it only gives trustworthy results when certain conditions about the data roughly hold. Violating these assumptions doesn't always break the model outright, but it does quietly erode the accuracy of its predictions and the validity of its coefficients, which is why checking for them is a standard first step before trusting any regression output.

Linearity: The relationship between X and y is linear
Independence: Observations are independent of each other
Homoscedasticity: Constant variance of residuals
Normality: Residuals are normally distributed
No Multicollinearity: Predictors are not highly correlated

How Linear Regression Works

Method of Least Squares

This is the closed-form solution — instead of searching for the best weights step by step, we solve for them directly. The weights below are the exact ones that minimize the sum of squared errors, derived by setting the derivative of the cost function to zero and solving algebraically.

\mathbf{w} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}

import numpy as np

def fit(X, y):
    X = np.insert(X, 0, 1, axis=1)
    XT_X = np.dot(X.T, X)
    XT_X_inv = np.linalg.inv(XT_X)
    XT_y = np.dot(X.T, y)
    betas = np.dot(XT_X_inv, XT_y)
    return betas[0], betas[1:]

It's elegant, and for small datasets it's the fastest path to an exact answer. The catch is the matrix inversion $(\mathbf{X}^T\mathbf{X})^{-1}$, which costs roughly $O(n^3)$ as the number of features $n$ grows, and simply fails when $\mathbf{X}^T\mathbf{X}$ isn't invertible. That's the gap gradient descent fills: instead of solving for the answer in one shot, it approaches it gradually, which scales far better to large datasets and large feature counts.

Gradient Descent Method

Imagine standing on a hillside in thick fog, trying to reach the lowest point in the valley. You can't see the bottom, but you can feel which direction the ground slopes beneath your feet — so you take a small step downhill, feel the slope again, and repeat. That's gradient descent. The "hill" is the cost function, the "slope" is its gradient with respect to the weights, and each step nudges the weights a little closer to the values that minimize prediction error.

Cost Function Surface

As the weight $w$ moves toward its optimal value, the cost $E(w)$ drops along the bowl-shaped curve until it settles at the global minimum (green).

Cost Function (Mean Squared Error):

E = \frac{1}{2}\sum_{i=1}^{m}(\text{ŷ}_i - y_i)^2

To know which way is "downhill," we need the slope of this cost function with respect to each weight — that's the gradient. Here's how it's derived, one step at a time.

Deriving the Gradient:

Start with the cost function:

$$\frac{\partial E}{\partial w_i} = \frac{\partial}{\partial w_i}\left[\frac{1}{2}(\text{ŷ} - y_i)^2\right]$$

Apply the chain rule:

$$\frac{\partial E}{\partial w_i} = (\text{ŷ} - y_i) \cdot \frac{\partial \text{ŷ}}{\partial w_i}$$

Since $\text{ŷ} = w_1x_1 + \cdots + w_ix_i + \cdots + b$, only $w_i$ affects the derivative:

$$\frac{\partial E}{\partial w_i} = (\text{ŷ} - y_i) \cdot x_i$$

Similarly, for the bias term: $\frac{\partial E}{\partial b} = (\text{ŷ} - y_i)$

Weight and Bias Updates:

w_i^{new} = w_i^{old} - \alpha(\text{ŷ} - y_i)x_i$$ $$b^{new} = b^{old} - \alpha(\text{ŷ} - y_i)

The learning rate $\alpha$ controls how big each step is. Set it too small and training crawls toward the minimum, needing far more iterations than necessary. Set it too large and the weights overshoot the minimum entirely, sometimes bouncing back and forth or diverging instead of converging — choosing $\alpha$ well is one of the most practical decisions in training any gradient-based model.

There's still an open question, though: when we compute that gradient, how much data do we use to compute it? That choice splits gradient descent into three variants.

Batch Gradient Descent: computes the gradient using the entire training set before making a single update. The path it takes toward the minimum is smooth and stable, since every step is based on the full picture — but for large datasets, recomputing the gradient over millions of rows for every single update is slow and memory-hungry.

Mini-Batch Gradient Descent: splits the training data into small batches — commonly 32, 64, or 128 samples — and updates the weights once per batch. This is the workhorse of modern machine learning: it's far faster than batch gradient descent since each update only needs a slice of the data, and the small amount of noise from batch-to-batch variation actually helps the optimizer avoid getting stuck in shallow local minima, while still being far more stable than updating on a single point at a time.

Stochastic Gradient Descent (SGD): takes mini-batching to its extreme — the batch size is just one. The weights update after every single training example, making each individual step fast and well-suited to streaming or online learning where data arrives continuously. The trade-off is a noisier path to the minimum: the cost doesn't decrease smoothly but jitters as it descends, since every update is based on the (sometimes misleading) gradient of just one data point.

In short: batch trades speed for stability, stochastic trades stability for speed, and mini-batch sits in between — which is why it's the default choice in practice. The NumPy implementation below is stochastic gradient descent: notice that the weight update happens inside the inner loop, once per training example, rather than once per full pass over the dataset.

import numpy as np

def fit(X, y, epochs, learning_rate):
    n_samples, n_features = X.shape
    X_aug = np.insert(X, 0, 1, axis=1)
    weights = np.random.randn(n_features + 1) * 0.01

    for epoch in range(epochs):
        for i in range(n_samples):
            y_pred = np.dot(X_aug[i], weights)
            error = y_pred - y[i]
            gradient = error * X_aug[i]
            weights -= learning_rate * gradient
    return weights

Implementation with scikit-learn

Everything above — least squares, the gradient derivation, batch vs. mini-batch vs. stochastic updates — is what's happening under the hood. In practice, you rarely write that loop yourself. Libraries like scikit-learn handle the optimization internally and expose a clean interface: fit the model on training data, then predict on new data.

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Key Metrics

Once the model produces predictions, we need a way to score how good they actually are. These three metrics are the standard toolkit for evaluating a regression model — each highlights a slightly different aspect of prediction error.

R² Score: Proportion of variance explained (0-1) — how much better the model is than simply predicting the average every time.
RMSE: Root Mean Squared Error — penalizes large errors more heavily, in the same units as the target variable.
MAE: Mean Absolute Error — the average size of the error, treating all mistakes equally regardless of size.

Advantages & Disadvantages

Linear regression's simplicity is both its biggest strength and its biggest limitation. It's an excellent first model to reach for and a strong baseline to compare more complex models against, but it's not the right tool for every problem.

✓ Advantages: Fast, Efficient, Interpretable

✗ Disadvantages: Sensitive to outliers, Assumes linearity

Ready to explore other algorithms? Check out Logistic Regression!