Em recent article for Tecnologística, I wrote briefly about the status of AI utilization and Machine Learning in demand planning, illustrating the different uses with examples from highlighted companies. As feedback, I received some questions about how to start a journey of analytics in sales forecasting, which still seems very far from the reality of companies and, at times, something sophisticated and complex.
Therefore, Jessica Silva and I, who shares the sales forecasting course of ILOS, we decided to do some posts to try to demystify and provide some practical guidance on the use of tools and libraries available in languages open-source, such as Python and R. Sales forecasting is a task that typically involves statistical analysis, machine learning, or deep learning, depending on the complexity of the data set and the objective of this predictive process.
Let's start this series of posts with a very simple example of applying sales forecasting models in Python, which has a variety of libraries to handle this type of task. Some of the most useful Python libraries for sales forecasting include:
- pandas: data manipulation and analysis;
- NumPy: math operations;
- Matplotlib e Seaborn: data visualization;
- Sklearn: machine learning;
- Pmdarima: auto arima function;
- State models: statistical models, including linear regression and statistical tests;
- Prophet: time series forecasts (powered by Facebook);
- Hard e TensorFlow: deep learning models.
In this first text, our objective is to show the use of a more sophisticated model than most companies use and, for this, we chose a SARIMAX model, a variation of the ARIMA model that allows us to deal with time series with a growth trend, marked seasonality and the consideration of exogenous variables, such as discounts, holidays, competitive prices or various commercial investments.
To make this first foray simpler, we will use the auto_arima function from the pmdarima library, which is equivalent to a SARIMAX model from the statsmodels library. The difference is that the auto_arima function automatically tries to find the best model parameters (the terms AR, I, MA, seasonal and their order) using criteria such as the Akaike information criterion (AIC) or the Bayesian information criterion (BIC ). This can be very useful when you don't have a good idea of which parameters to use.
For this case, we will use the Spyder development environment, but there are several Python desktops available online, including installation tutorials and getting started.
Step 1: Import the required libraries
In the first step, the libraries that will be used are imported. They are the “packages” that contain the functions we need to use to do modeling and prediction.
Step 2: Upload your data
The database already treated contains three columns: one with date, one with sales and another with the selected exogenous variable (we will discuss the variable selection process in another post).
Step 3: Split your data into training and testing
Making this division allows us to mitigate the chance of overfitting the model. In this case, we chose to target the test group and the training group, placing the last year of observation as the test group for simplicity.
Step 4: Define and adjust the model
Here the SARIMAX model is effectively applied. It is worth noting that it was not necessary to define any of the model parameters, the function has a built-in optimization algorithm that selects the most appropriate model, or in other words, the one that minimizes the error.
Step 5: Make predictions
At this stage, the number of observations in the future that you want to predict is informed, which in this case was 12 months. ARIMA, and its derivatives, allows forecasts to be made for multiple periods. It is also important to enter the predicted data for the exogenous variable for this same time horizon.
Step 6: Create a dataframe with the predictions
A dataframe was created to facilitate the generation of graphs, which is the next step.
Step 7: Graph the original data and predictions
Here we calculate the MAPE (Mean Absolute Percentage Error) indicator. In both scenarios, there is good adherence between the model and the series. Carrying out this error analysis on both samples helps to see a potential overfitting scenario. If the training error was very low and the test error was very high, this would signal a case of a “data-biased model”, that is, the model is capable of almost perfectly predicting the series within the data sample. training, but it has little use outside that sample.
There are other prediction error metrics, such as AIC, MPE, RMSE, among others that should be analyzed when evaluating a model. Additionally, it is always good practice to check the model residuals to ensure that there are no remaining patterns not captured by the model.
This is a simplified example, showing that with just a few lines of code it is possible to use a very sophisticated model, including using an exogenous variable, which is not very common in companies. Don't know how to create the code? A tip: ChatGPT knows. You just need to ask and understand how to correctly parameterize the model to be able to criticize it. Attached we ran the code suggested by him with some small parameterization adjustments for test and training samples. The results can be seen there.
Of course, before going out using the model, it would be useful to do an exploratory data analysis, which may include checking stationarity, detecting outliers, transforming the data (if necessary, for example, if the series is not stationary) and visualizing data to understand any underlying trends or patterns.
In the next posts, we intend to give examples of how to treat and clean the sales baseline in an automated way, how to use ensemble techniques or combination of methods to improve accuracy, practical example of deep neural networks in time series with LSTM models with hyperparameter adjustment and more. Everything to try to help make it tangible and show that with a little effort, research and study, you can advance in Analytics for sales forecasting!
ATTACHMENT – GPT CHAT
Results: