Dragon

Store Sales - Time Series Forecasting


This case study focuses on time-series forecasting to predict store sales using data from Corporación Favorita, a major grocery retailer based in Ecuador.
The objective is to develop a model that accurately forecasts unit sales for thousands of items across various Favorita stores. The dataset includes information such as dates, store and item details, promotions, and unit sales, providing an excellent opportunity to apply machine learning techniques.
The workflow involves the following steps:

  1. Load Data: Import data from CSV files into a DataFrame.

  2. Data Splitting: Split the dataset into training and testing sets for model development.

  3. Model Training and Prediction: Train a machine learning model and use it to make predictions on the test set.

For additional details, visit Kaggle

Following steps were done in jupter notebook provided by Kaggle.


First Step: Setup necessary libraries

Let's write all the necessary libraries before exploring the data.

import   numpy  as   np    # linear algebra

import   pandas   as   pd    # data processing, CSV file I/O (e.g. pd.read_csv)

import   matplotlib.pyplot   as   plt    # Data visualization

import   plotly.express   as   px    # Data visualization


from   sklearn.ensemble   import   RandomForestRegressor    # RandomForestRegressor Model

from   sklearn.metrics   import   mean_squared_error,  mean_absolute_error    # Precision testing


# Input data files are available in the read-only "../input/" directory

# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import   os

for   dirname, _, filenames  in   os.walk('/kaggle/input'):

for   filename  in  filenames:

print(os.path.join(dirname, filename))

output:

/kaggle/input/store-sales-time-series-forecasting/oil.csv
/kaggle/input/store-sales-time-series-forecasting/sample_submission.csv
/kaggle/input/store-sales-time-series-forecasting/holidays_events.csv
/kaggle/input/store-sales-time-series-forecasting/stores.csv
/kaggle/input/store-sales-time-series-forecasting/train.csv
/kaggle/input/store-sales-time-series-forecasting/test.csv
/kaggle/input/store-sales-time-series-forecasting/transactions.csv

Wrangle Function

This function is used to read data from a file. Professionals often use this function to read different data files continuously.

We will explore all the files mentioned above in the next few cells

"""
This function load csv file to DataFrame
======================================
Parameter:
filepath-> str : (location of file)
Result:
df-> DataFrame
"""

def  wrangle(filepath):

# Read into Data Frame

df = pd.read_csv(filepath)

# Changing date from object to datetime

df["date"] = pd.to_datetime(df["date"])

return  df

# reading train data set to data frame using Wrangle function

filepath   =   ("/kaggle/input/store-sales-time-series-forecasting/train.csv")

df   =  wrangle(filepath).set_index("id")

# Let's see the Size of our data

df. shape

output:

(3000888, 5)

# Why don't we look into our sample data. Precisely, let's see 5 rows using "head"

df. head( )

output:

head_5_sales_forecasting

Moving to Next Step: Let's look into Holiday data from "holidays_events.csv"

# reading holiday data to Data frame

df_holiday_path   =   "/kaggle/input/store-sales-time-series-forecasting/holidays_events.csv"

df_holiday   =   pd.read_csv(df_holiday_path)

df_holiday. head( )

output:

holiday-head-5

There are 6 columns in the "holidays_events.csv" file, and some holidays are marked as transferred, which is indicated by the "transferred" column. After reviewing the data, I concluded that selecting dates with "False" in the "transferred" column includes all the required holidays.

  1. First, we will create a mask to filter holidays where the "transferred" column is marked as "False".

  2. Next, we will create a new dataframe named "df_holiday" and add the "date" column to it. This will help us identify the dates on which holidays occur.

  3. Third, we will convert the "date" column to the "datetime" type, which will treat the column as a datetime object. This will be useful for working with time-series data (trends)

  4. Lastly, we will create a new column named "is_holiday" and set its value to 1 for all rows to indicate that the date is a holiday.

# Creating the dataframe with transferred colum as false.Since it signify there's a special event and there's no transferred in date.

mask_holi = df_holiday["transferred"] == False

df_holiday = df_holiday[mask_holi]

df_holiday = df_holiday[["date"]]

df_holiday["date"] = pd.to_datetime(df_holiday["date"])

df_holiday["is_holiday"] = 1

df_holiday. head( )

output:

df_holiday_head_5

There are multiple holidays or events on a single day. So, let's count the number of holidays or events for each day and then sort the dates based on the count of holidays or events.

# Aggregating number of holiday in a day

df_holiday = df_holiday["date"]. value_counts().sort_values().to_frame(name="holiday_count").reset_index()

It's time to merge the holiday data with our training data. We will fill in "0" for days with no holiday to simplify calculations in later stages.
Here, I have used the name "df_holi_int" to distinguish it from "df," as we added the new column "is_holiday" from df_holiday.

# Merging holiday date to train data

df_holi_int = df.merge(df_holiday[["date", "holiday_count"]], on = "date", how = "left" )

df_holi_int["holiday_count"] = df_holi_int["holiday_count"].fillna(0).astype(int)

Are you excited to look our data frame after adding "is_holiday" columns?

df_holi_int.head()

output:

train_holi_head_5

Adding new feature "oil_price" from "oil.csv"

# Reading oil data to Data frame

df_oil_path = "/kaggle/input/store-sales-time-series-forecasting/oil.csv"

df_oil = pd. read_csv(df_oil_path)

df_oil["date"] = pd.to_datetime(df_oil["date"])

df_oil = df_oil.rename(columns = {"dcoilwtico": "oil_price"})

df_oil.head()

output:

oil-head-5

To analyze trend data, the first step is to plot it. This visualization helps identify whether the data is complete and reveals its overall tread.

# Plotting oil price

plt.plot(df_oil["date"], df_oil["oil_price"])

plt.xlabel("Day")

plt.ylabel("Oil price")

plt.title("Daily oil price");

output:

daily oil price

This analysis provides insights into the peak and lowest prices and the state of oil prices at the time of data collection. These observations are valuable for evaluating the dependency of oil prices on sales.

Let's add this column to our training data set.

# Adding oil_price column to training data

df = df_holi_int.merge(df_oil[["date", "oil_price"]], on = "date", how = "left" )

df.head()

output:

complete_training_data

Finally, we have gathered all the necessary data for our modeling. Let's take a look at the information in the data frame.

df.info()

output:

info_sales_forecasting

All the data here are numeric except date and family columns. We wil remove date and change family column to numeric. But first, let's check if there's any missing values in columns.

# Look if there's any missing data present at any column of our data frame

df.isnull().sum()

output:

missing_values

As expected, the plot above have already revealed that there are many missing values in the oil_price data. Before filling these missing values, let's remove the unnecessary column, "date".

We are removing the "date" column because we are using the RandomForestRegressor, which does not inherently utilize date values for modeling.

# Dropping Date column

df = df.drop(columns = "date")

df.head()

output:

removing_date_df

Now, we will fill the missing values in the oil_price column using a combination of forward fill and backward fill methods.

# Replace all NaN of oil_price with forward and backward fill

df["oil_price"] = df["oil_price"].ffill()

df["oil_price"] = df["oil_price"].bfill()

output:

ffill_bfill_df

We are done cleaning training data.

Let's repeat the same process of adding "holiday" and "oil" columns to test data.

First, let's read the data into the Data Frame.

# Let's read test data into the data fame

df = pd.read_csv("/kaggle/input/store-sales-time-series-forecasting/test.csv")

test_data.head()

output:

test_data_head_5

# Change date into datetime format

test_data["date"] = pd.to_datetime(test_data["date"])

test_data.info()

output:

test_data_info

# Now let's merge holiday data

test_holi_int = test_data.merge(df_holiday[["date", "holiday_count"]], on="date", how="left")

test_holi_int["holiday_count"] = test_holi_int["holiday_count"].fillna(0).astype(int)

test_holi_int.head()

output:

test_holiday_data

# Adding oil_price column to training data

test_data = test_holi_int.merge(df_oil[["date", "oil_price"]], on="date", how="left")

test_data.head()

output:

test_oil_data

# Replace all NaN of oil_price with forward and backward fill

test_data["oil_price"] = test_data["oil_price"].ffill()

test_data["oil_price"] = test_data["oil_price"].bfill()

# Dropping the date column and setting index to id

test_data = test_data.drop(columns = "date")

test_data = test_data.set_index("id")

test_data.head()

output:

complete_test_data

Split the data

Now it's time to split the data into training and test sets. Since this is a time series dataset, we will use a cutoff point to divide the data, ensuring the temporal order is preserved.

# splitting the data into train and test

target = "sales"

X = df.drop(columns = target)

y = df[target]

cutoff = int(len(df) * 0.8)

X_train, X_test, y_train, y_test = X.iloc[:cutoff], X.iloc[cutoff:], y.iloc[:cutoff], y.iloc[cutoff:]

There is a column "family" with categorical data, so let's convert it into a numeric format using one-hot encoding. This will create binary columns for each category, allowing the model to work with numeric data.

# initiating one-hot encoding to X_train

X_train_encoding = pd.get_dummies(X_train, columns = ["family"])

# initiating one-hot encoding to X_test data

X_test_encoding = pd.get_dummies(X_test, columns = ["family"])

# initiating one-hot encoding to test data

test_encoding = pd.get_dummies(test_data, columns = ["family"])

Model training

It's time to celebrate! We're at the moment to train our first model, RandomForestRegressor. This ensemble model combines the results of multiple decision trees to improve accuracy and reduce overfitting.

# Now it's time to train our RandomForestRegressor model

model = RandomForestRegressor(random_state = 42)

model.fit(X_train_encoding, y_train)

Now, that our model is trained. Let's predict our X_test

# Predicting the model

y_predict = model.predict(X_test_encoding)

Model Analysis

Now that we have our predictions, let's analyze the model's performance.

Here we will use "Root Mean Squared Logarithmic Error (RMSLE)" to evaluate accuracy, which calculates the square root of the average of the squared differences between the logarithms of the predicted and actual values.

# How good is our model? test using RMSLE

def rmsle(y_test, y_pred):

return  np.sqrt(mean_squared_error(np.log1p(y_test), np.log1p(y_pred)))

rmsle_value = rmsle(y_test.reset_index().drop(columns="index", y_predict))

print(f"RMSLE value is: {rmsle_value}")

output:

RMSLE value is: 0.8924370427620725

What do you think about the result? For the model to be a good fit, both values should be lower.

Since this is our first model, it serves as the baseline. Now, we can explore other models to further reduce these values and improve the fit, right?

# Put predicted values in dataframe

y_predict_df = pd.DataFrame(y_predict, index=y_test.index, columns = ["Sales"])

y_predict_df.head()

output:

y_predict_df

Predicting test data

Our initial model performs well as a baseline, but it is too early to evaluate it using the test data. Instead, we should explore other models to further improve the current model's metrics.

However, as part of the learning process, we will proceed to predict the test data using the current RandomForestRegressor model.

# Predicting the test dataset

y_predict_test = model.predict(test_encoding)

# Put predicted values of test set in the data frame

y_predict_test_df = pd.DataFrame(y_predict_test, index=test_encoding.index, columns=["Sales"])

y_predict_test_df.head()

output:

Predict_values_of_test

You now have the skills to clean data, apply the RandomForestRegressor, and make predictions on a test set. You can explore similar datasets on Kaggle and use the same model to make predictions. Additionally, Kaggle offers a great opportunity to deepen your knowledge of machine learning.