🎯 Machine Learning Case Study with Real-World Data
This case study focuses on time-series forecasting to predict store sales using data from Corporación Favorita, a major grocery retailer based in Ecuador. The objective is to develop a model that accurately forecasts unit sales for thousands of items across various Favorita stores.
The dataset includes dates, store and item details, promotions, and unit sales, providing an excellent opportunity to apply machine learning techniques.
For additional details and competition information, visit the Kaggle Competition Page →
Let's import all the necessary libraries before exploring the data.
import numpy as np # linear algebra
import pandas as pd # data processing
import matplotlib.pyplot as plt # visualization
import plotly.express as px
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
📂 Available Data Files:
Now let's create a function to load and preprocess the data efficiently.
def wrangle(filepath):
# Read CSV into DataFrame
df = pd.read_csv(filepath)
# Convert date column to datetime
df["date"] = pd.to_datetime(df["date"])
return df
The training dataset contains 3,000,888 rows and 5 columns including date, store, product family, promotions, and sales data.
We enhance our dataset by integrating external factors like holidays and oil prices, which can significantly impact store sales.
We use RandomForestRegressor, an ensemble learning method that combines multiple decision trees to improve accuracy and reduce overfitting.
# Split data: 80% training, 20% testing
cutoff = int(len(df) * 0.8)
X_train, X_test = X[:cutoff], X[cutoff:]
y_train, y_test = y[:cutoff], y[cutoff:]
# Train the model
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)
# Make predictions
y_predict = model.predict(X_test)
We evaluate the model using Root Mean Squared Logarithmic Error (RMSLE), a metric that penalizes underestimates more heavily than overestimates—crucial for sales forecasting.
def rmsle(y_test, y_pred):
return np.sqrt(mean_squared_error(
np.log1p(y_test), np.log1p(y_pred)))
rmsle_value = rmsle(y_test, y_predict)
print(f"RMSLE: {rmsle_value:.4f}")
📊 Model Performance:
RMSLE Score: 0.8924
This baseline model provides a strong foundation. Further improvements can be achieved by exploring ensemble methods and hyperparameter tuning.
Finally, we use our trained model to generate predictions for the test set, which will be submitted to the Kaggle competition.
# Predict on the test dataset
y_predict_test = model.predict(X_test)
# Create submission dataframe
submission = pd.DataFrame({
"id": test_ids,
"sales": y_predict_test
})