A report on sales forecasting

A Kaggle project

Synopsis

Sales forecasting is a common problem that can be optimized by using Machine Learning techniques. In this report, sales records made by the Russian software company 1C Company have been analysed. Based on past sales, the amount of items that will be sold in a certain shop is predicted. The forecast accuracy is 96.06% with an error of +/- 5 units.


Data

1C Company is an independent software developer, distributor and publisher based in Moscow, Russia. It deals with development, licence, support, and sale of computer software and video games. The company operates through a wide network of more than 10000 business partners spread across 25 countries. In 2006 the trademark "1C" has been acknowledge wide popularity by the Russian Federal Service for Intellectual Property 1.

In here, a small record of products sold in Russian shops has been analysed. Data consist in almost 3 million observations: each one indicates how many items have been sold in a certain day, in a certain shop, at a certain price. An incremental index is assigned to each month: sales have been recorded from January 2013 to October 2015. Additional data consist in pair-wise dictionaries with the actual name and the id of items, shops, and categories, respectively. The original data set has been provided by Kaggle platform and can be found here.

Data summary

The aim of this report is to give a reliable estimate of the monthly sales of 1C stores. Different stores can have different needs, and certaintly, different items are sold more than others. Before engaging the prediction algorithm, let's take a closer look at the features that could influence the estimate the most. First, the data set is splitted into two parts:

  • control data set: from Jan 2013 to Dec 2013, the oldest sales, used to test the prediction algorithm
  • training data set: from Jan 2014 to Oct 2015, the most recent sales, used to build the prediction algorithm

There is quite a big range of products and stores: 55 shops sold 17054 items divided into 79 categories which span a price range of 0.5 - 50999.0 Russian Ruples. Note that in the training data are "missing" 4753 items, 5 shops, and 5 categories from the control data set. This is not a problem, indeed it allows to see how the algorithm behaves on unknown data. In order to summarize important features of the data, some summary figures are shown.

The upper panel of Figure 1 shows the total number of item sold in each shop (black lines represent the variability over items). The lower panel of Figure 1 shows that the most productive shop is the Moscow shopping center "Semenovsky" (31) with more than 175 thousand of items sold, followed by the Moscow mall "Atrium" (25) with more than 135 thousand sales, and the shopping mall "MEGA Tepliy Stan" (28) in the outskirt of Moscow with more then 100 thousand sales. The less productive shops are the shop "Zhukovsky" (11), and "Novosibirsk"" (36) shopping and entertainment center in Middle Russia.

Figure 1

The upper panel of Figure 2 shows the number of items in each category. Category 40 is the wider and includes more than 3500 movies in DVD; categories 55 and 37 collects more than 1500 of items each, representing BLU-RAYS movies and local music CDs, respectively. The lower panel of Figure 2 shows sale trends of the most popular categories. The total number of items, for each month and for each category, has been calculated and plotted. Note that all categories have a peak on December, likely due to festivities. The most sold items are DVDs (40), PC games (30), and local music CDs (55). More unpopular categories includes special editions games for PC (28) and games for XBOX 360 (23). Also, note that the most popular categories have a steeper descending trend in respect to the unpopular ones, which have more stable sales.

Figure 2

The left panel of Figure 3 shows in blue the price range for each category. For example, the price of DVD (category=40) ranges from 10 to less than 5000 Ruples, which correspond to 0.14-70 Euros; a PC game (30) is sold up to 5000 Ruples (70 Euros). The right panel of Figure 3 shows how many items, on average, have been sold for each price range. The most popular items cost between 100 and 499 Ruples with 40 thousands units sold, followed by items with a price up to 5000 Ruples. Medium priced items strongly contribute to the total sales, while cheap or very expensive items are less relevant.

Figure 3

In addition to price and category, another feature that can help forecast sales is the average past sales. For each month, for each shop, and for each item the number of units sold till that moment (starting from Jan 14) are averaged. The average past sales can also be used as a rough forecast to evaluate the result of the machine learning algorithm.

Forecast

The Random Forest method has been chosen to forecast the monthly sales for each item and shop of 1C company. The algorithm has been trained on the most recent sales, from Jan 2014 to Oct 2015. The predicted amount of items sold during the training period (Jan 14- Oct 15) exactly matches the true amount of sales 72.93% of the times. However, allowing a margin of error of just 1 unit, the accuracy increses to 88.06%, and within +/- 5 units the accuracy reaches 97.27%.

To perform a reliable test of the algorithm, the true amount of items sold from Jan 2013 to Dec 2013 has been discarded. The average past sales have been calculated by copying the ones of the training data set. For new items and shops, the average past sales is set to 0. The forecast for the "control" sales is exactly accurate at 69.84%. By allowing an error within +/- 1 and +/- 5 units, the accuracy increases to 87.56% and 96.06%, respectively.

Figure 4

Figure 4 shows the improvement of the machine learning prediction and the estimate using "historic" data for a subset of the control data set. The overall accuracy for the historic estimate is 36.85%, 74.95%, and 94.41% within +/-0, +/-1, and +/-5 items, respectively.