Store Sales - Time Series Forecasting

🎯 Machine Learning Case Study with Real-World Data

Overview

This case study focuses on time-series forecasting to predict store sales using data from Corporación Favorita, a major grocery retailer based in Ecuador. The objective is to develop a model that accurately forecasts unit sales for thousands of items across various Favorita stores.

📋 Project Workflow

  1. Data Loading: Import data from CSV files into a DataFrame
  2. Data Splitting: Split the dataset into training and testing sets
  3. Model Training: Train a machine learning model and make predictions

The dataset includes dates, store and item details, promotions, and unit sales, providing an excellent opportunity to apply machine learning techniques.

For additional details and competition information, visit the Kaggle Competition Page →

Step 1: Setup Necessary Libraries

Let's import all the necessary libraries before exploring the data.

import numpy as np # linear algebra

import pandas as pd # data processing

import matplotlib.pyplot as plt # visualization

import plotly.express as px


from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_squared_error, mean_absolute_error

📂 Available Data Files:

  • oil.csv - Oil price data
  • train.csv - Training dataset (3M+ rows)
  • test.csv - Test dataset for predictions
  • holidays_events.csv - Holiday information
  • stores.csv - Store metadata
  • transactions.csv - Transaction data

Step 2: Data Loading & Exploration

Now let's create a function to load and preprocess the data efficiently.

def wrangle(filepath):

# Read CSV into DataFrame

df = pd.read_csv(filepath)

# Convert date column to datetime

df["date"] = pd.to_datetime(df["date"])

return df

The training dataset contains 3,000,888 rows and 5 columns including date, store, product family, promotions, and sales data.

Step 3: Feature Engineering

We enhance our dataset by integrating external factors like holidays and oil prices, which can significantly impact store sales.

🔧 Features Added:

  • Holiday Count: Number of holidays/events on each date
  • Oil Price: Daily crude oil prices (Ecuador's major export)
  • Encoded Categories: One-hot encoding for product families

Step 4: Model Training & Prediction

We use RandomForestRegressor, an ensemble learning method that combines multiple decision trees to improve accuracy and reduce overfitting.

# Split data: 80% training, 20% testing

cutoff = int(len(df) * 0.8)

X_train, X_test = X[:cutoff], X[cutoff:]

y_train, y_test = y[:cutoff], y[cutoff:]


# Train the model

model = RandomForestRegressor(random_state=42)

model.fit(X_train, y_train)


# Make predictions

y_predict = model.predict(X_test)

Step 5: Model Evaluation

We evaluate the model using Root Mean Squared Logarithmic Error (RMSLE), a metric that penalizes underestimates more heavily than overestimates—crucial for sales forecasting.

def rmsle(y_test, y_pred):

return np.sqrt(mean_squared_error(

np.log1p(y_test), np.log1p(y_pred)))


rmsle_value = rmsle(y_test, y_predict)

print(f"RMSLE: {rmsle_value:.4f}")

📊 Model Performance:

RMSLE Score: 0.8924

This baseline model provides a strong foundation. Further improvements can be achieved by exploring ensemble methods and hyperparameter tuning.

Step 6: Making Predictions on Test Data

Finally, we use our trained model to generate predictions for the test set, which will be submitted to the Kaggle competition.

# Predict on the test dataset

y_predict_test = model.predict(X_test)


# Create submission dataframe

submission = pd.DataFrame({

"id": test_ids,

"sales": y_predict_test

})

✨ Key Takeaways

  • Successfully loaded and preprocessed 3M+ rows of sales data
  • Engineered features including holidays and oil prices
  • Trained and evaluated a RandomForest model with competitive performance
  • Generated predictions on the test set for competition submission
  • Demonstrated practical ML workflow from data to predictions

🚀 Next Steps for Improvement

  • Explore XGBoost and LightGBM models for potentially better performance
  • Implement hyperparameter tuning using Grid Search or Bayesian Optimization
  • Create additional time-series features (lag features, rolling averages)
  • Analyze feature importance to understand key drivers of sales
  • Implement cross-validation for more robust evaluation
  • Try ensemble methods combining multiple models