This case study focuses on time-series forecasting to predict store sales using data from Corporación Favorita, a major grocery retailer based in Ecuador.
The objective is to develop a model that accurately forecasts unit sales for thousands of items across various Favorita stores. The dataset includes information such as dates, store and item details, promotions, and unit sales, providing an excellent opportunity to apply machine learning techniques.
The workflow involves the following steps:
Load Data: Import data from CSV files into a DataFrame.
Data Splitting: Split the dataset into training and testing sets for model development.
Model Training and Prediction: Train a machine learning model and use it to make predictions on the test set.
For additional details, visit Kaggle
Following steps were done in jupter notebook provided by Kaggle.
Let's write all the necessary libraries before exploring the data.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # Data visualization
import plotly.express as px # Data visualization
from sklearn.ensemble import RandomForestRegressor # RandomForestRegressor Model
from sklearn.metrics import mean_squared_error, mean_absolute_error # Precision testing
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
output:
/kaggle/input/store-sales-time-series-forecasting/oil.csv
/kaggle/input/store-sales-time-series-forecasting/sample_submission.csv
/kaggle/input/store-sales-time-series-forecasting/holidays_events.csv
/kaggle/input/store-sales-time-series-forecasting/stores.csv
/kaggle/input/store-sales-time-series-forecasting/train.csv
/kaggle/input/store-sales-time-series-forecasting/test.csv
/kaggle/input/store-sales-time-series-forecasting/transactions.csv
This function is used to read data from a file. Professionals often use this function to read different data files continuously.
We will explore all the files mentioned above in the next few cells
"""
This function load csv file to DataFrame
======================================
Parameter:
filepath-> str : (location of file)
Result:
df-> DataFrame
"""
def wrangle(filepath):
# Read into Data Frame
df = pd.read_csv(filepath)
# Changing date from object to datetime
df["date"] = pd.to_datetime(df["date"])
return df
# reading train data set to data frame using Wrangle function
filepath = ("/kaggle/input/store-sales-time-series-forecasting/train.csv")
df = wrangle(filepath).set_index("id")
# Let's see the Size of our data
df. shape
output:
(3000888, 5)
# Why don't we look into our sample data. Precisely, let's see 5 rows using "head"
df. head( )
output:
# reading holiday data to Data frame
df_holiday_path = "/kaggle/input/store-sales-time-series-forecasting/holidays_events.csv"
df_holiday = pd.read_csv(df_holiday_path)
df_holiday. head( )
output:
There are 6 columns in the "holidays_events.csv" file, and some holidays are marked as transferred, which is indicated by the "transferred" column. After reviewing the data, I concluded that selecting dates with "False" in the "transferred" column includes all the required holidays.
First, we will create a mask to filter holidays where the "transferred" column is marked as "False".
Next, we will create a new dataframe named "df_holiday" and add the "date" column to it. This will help us identify the dates on which holidays occur.
Third, we will convert the "date" column to the "datetime" type, which will treat the column as a datetime object. This will be useful for working with time-series data (trends)
Lastly, we will create a new column named "is_holiday" and set its value to 1 for all rows to indicate that the date is a holiday.
# Creating the dataframe with transferred colum as false.Since it signify there's a special event and there's no transferred in date.
mask_holi = df_holiday["transferred"] == False
df_holiday = df_holiday[mask_holi]
df_holiday = df_holiday[["date"]]
df_holiday["date"] = pd.to_datetime(df_holiday["date"])
df_holiday["is_holiday"] = 1
df_holiday. head( )
output:
There are multiple holidays or events on a single day. So, let's count the number of holidays or events for each day and then sort the dates based on the count of holidays or events.
# Aggregating number of holiday in a day
df_holiday = df_holiday["date"]. value_counts().sort_values().to_frame(name="holiday_count").reset_index()
It's time to merge the holiday data with our training data. We will fill in "0" for days with no holiday to simplify calculations in later stages.
Here, I have used the name "df_holi_int" to distinguish it from "df," as we added the new column "is_holiday" from df_holiday.
# Merging holiday date to train data
df_holi_int = df.merge(df_holiday[["date", "holiday_count"]], on = "date", how = "left" )
df_holi_int["holiday_count"] = df_holi_int["holiday_count"].fillna(0).astype(int)
Are you excited to look our data frame after adding "is_holiday" columns?
df_holi_int.head()
output:
# Reading oil data to Data frame
df_oil_path = "/kaggle/input/store-sales-time-series-forecasting/oil.csv"
df_oil = pd. read_csv(df_oil_path)
df_oil["date"] = pd.to_datetime(df_oil["date"])
df_oil = df_oil.rename(columns = {"dcoilwtico": "oil_price"})
df_oil.head()
output:
To analyze trend data, the first step is to plot it. This visualization helps identify whether the data is complete and reveals its overall tread.
# Plotting oil price
plt.plot(df_oil["date"], df_oil["oil_price"])
plt.xlabel("Day")
plt.ylabel("Oil price")
plt.title("Daily oil price");
output:
This analysis provides insights into the peak and lowest prices and the state of oil prices at the time of data collection. These observations are valuable for evaluating the dependency of oil prices on sales.
Let's add this column to our training data set.
# Adding oil_price column to training data
df = df_holi_int.merge(df_oil[["date", "oil_price"]], on = "date", how = "left" )
df.head()
output:
Finally, we have gathered all the necessary data for our modeling. Let's take a look at the information in the data frame.
df.info()
output:
All the data here are numeric except date and family columns. We wil remove date and change family column to numeric. But first, let's check if there's any missing values in columns.
# Look if there's any missing data present at any column of our data frame
df.isnull().sum()
output:
As expected, the plot above have already revealed that there are many missing values in the oil_price data. Before filling these missing values, let's remove the unnecessary column, "date".
We are removing the "date" column because we are using the RandomForestRegressor, which does not inherently utilize date values for modeling.
# Dropping Date column
df = df.drop(columns = "date")
df.head()
output:
Now, we will fill the missing values in the oil_price column using a combination of forward fill and backward fill methods.
# Replace all NaN of oil_price with forward and backward fill
df["oil_price"] = df["oil_price"].ffill()
df["oil_price"] = df["oil_price"].bfill()
output:
We are done cleaning training data.
Let's repeat the same process of adding "holiday" and "oil" columns to test data.
First, let's read the data into the Data Frame.
# Let's read test data into the data fame
df = pd.read_csv("/kaggle/input/store-sales-time-series-forecasting/test.csv")
test_data.head()
output:
# Change date into datetime format
test_data["date"] = pd.to_datetime(test_data["date"])
test_data.info()
output:
# Now let's merge holiday data
test_holi_int = test_data.merge(df_holiday[["date", "holiday_count"]], on="date", how="left")
test_holi_int["holiday_count"] = test_holi_int["holiday_count"].fillna(0).astype(int)
test_holi_int.head()
output:
# Adding oil_price column to training data
test_data = test_holi_int.merge(df_oil[["date", "oil_price"]], on="date", how="left")
test_data.head()
output:
# Replace all NaN of oil_price with forward and backward fill
test_data["oil_price"] = test_data["oil_price"].ffill()
test_data["oil_price"] = test_data["oil_price"].bfill()
# Dropping the date column and setting index to id
test_data = test_data.drop(columns = "date")
test_data = test_data.set_index("id")
test_data.head()
output:
Now it's time to split the data into training and test sets. Since this is a time series dataset, we will use a cutoff point to divide the data, ensuring the temporal order is preserved.
# splitting the data into train and test
target = "sales"
X = df.drop(columns = target)
y = df[target]
cutoff = int(len(df) * 0.8)
X_train, X_test, y_train, y_test = X.iloc[:cutoff], X.iloc[cutoff:], y.iloc[:cutoff], y.iloc[cutoff:]
There is a column "family" with categorical data, so let's convert it into a numeric format using one-hot encoding. This will create binary columns for each category, allowing the model to work with numeric data.
# initiating one-hot encoding to X_train
X_train_encoding = pd.get_dummies(X_train, columns = ["family"])
# initiating one-hot encoding to X_test data
X_test_encoding = pd.get_dummies(X_test, columns = ["family"])
# initiating one-hot encoding to test data
test_encoding = pd.get_dummies(test_data, columns = ["family"])
It's time to celebrate! We're at the moment to train our first model, RandomForestRegressor. This ensemble model combines the results of multiple decision trees to improve accuracy and reduce overfitting.
# Now it's time to train our RandomForestRegressor model
model = RandomForestRegressor(random_state = 42)
model.fit(X_train_encoding, y_train)
Now, that our model is trained. Let's predict our X_test
# Predicting the model
y_predict = model.predict(X_test_encoding)
Now that we have our predictions, let's analyze the model's performance.
Here we will use "Root Mean Squared Logarithmic Error (RMSLE)" to evaluate accuracy, which calculates the square root of the average of the squared differences between the logarithms of the predicted and actual values.
# How good is our model? test using RMSLE
def rmsle(y_test, y_pred):
return np.sqrt(mean_squared_error(np.log1p(y_test), np.log1p(y_pred)))
rmsle_value = rmsle(y_test.reset_index().drop(columns="index", y_predict))
print(f"RMSLE value is: {rmsle_value}")
output:
RMSLE value is: 0.8924370427620725
What do you think about the result? For the model to be a good fit, both values should be lower.
Since this is our first model, it serves as the baseline. Now, we can explore other models to further reduce these values and improve the fit, right?
# Put predicted values in dataframe
y_predict_df = pd.DataFrame(y_predict, index=y_test.index, columns = ["Sales"])
y_predict_df.head()
output:
Our initial model performs well as a baseline, but it is too early to evaluate it using the test data. Instead, we should explore other models to further improve the current model's metrics.
However, as part of the learning process, we will proceed to predict the test data using the current RandomForestRegressor model.
# Predicting the test dataset
y_predict_test = model.predict(test_encoding)
# Put predicted values of test set in the data frame
y_predict_test_df = pd.DataFrame(y_predict_test, index=test_encoding.index, columns=["Sales"])
y_predict_test_df.head()
output:
You now have the skills to clean data, apply the RandomForestRegressor, and make predictions on a test set. You can explore similar datasets on Kaggle and use the same model to make predictions. Additionally, Kaggle offers a great opportunity to deepen your knowledge of machine learning.