Similarity-based sales forecasting using improved ConvLSTM and prophet

Abstract

Sales forecasting is an important part of e-commerce and is critical to smart business decisions. The traditional forecasting methods mainly focus on building a forecasting model, training the model through historical data, and then using it to forecast future sales. Such methods are feasible and effective for the products with rich historical data while they are not performing as well for the newly listed products with little or no historical data. In this paper, with the idea of collaborative filtering, a similarity-based sales forecasting (S-SF) method is proposed. The implementation framework of S-SF includes three modules in order. The similarity module is responsible for generating top-k similar products of a given new product. We calculate the similarity based on two data types: time series data of sales and text data such as product attributes. In the learning module, we propose an attention-based ConvLSTM model which we called AttConvLSTM, and optimize its loss function with the convex function information entropy. Then AttConvLSTM is integrated with Facebook Prophet model to forecast top-k similar products sales based on their historical data. The prediction results of all top-k similar products will be fused in the forecasting module through operations of alignment and scaling to forecast the target products sales. The experimental results show that the proposed S-SF method can simultaneously adapt to the sales forecasting of mature products and new products, which shows excellent diversity, and the forecasting idea based on similar products improves the accuracy of sales forecasting.

Keywords

1. Introduction

Sales forecasting is critical to efficient sales management and business resource allocation. Reducing inventory, coordinating employee schedules, purchasing stocks, and evaluating sales teams all require accurate sales forecasting [1, 2]. Most of the methods used for sales forecasting are based on time series forecasting, which forecast the future based on existing historical sales data [3, 4]. Many early statistical-based models such as ARIMA were used for predictive tasks based on large amounts of historical data and achieved good performance [5]. The Facebook Prophet model supports non-linear fitting of time series models of different granularities, and special impact factors can be added to the model to make predictions more flexible [6]. With the popularity of deep learning, deep learning-based models can forecast the future more accurately and the accuracy of the models is also greatly dependent on big data [7, 8].

There is a case that some new products with short time to market have very little historical data. Models suitable for big data are not necessarily suitable for small data. Generally, there are two ways to deal with the problem of insufficient historical data. One is to consider more features such as product attributes to help forecast [9]. When we build forecasting models, in addition to time series data, there are many non-sequential data to be considered such as product attributes and retailer attributes. However, it is difficult to build a forecasting model that takes into account both sequential data and non-sequential data. Another way is to use more complex models so as to get good diversity [8]. Inspired by the collaborative filtering approach of recommender systems, we believe that similar products will show similar characteristics in sales trends. Therefore, we can use the sales trends of similar products to forecast the target product [10].

In this paper, we propose a sales forecasting method for both mature and new products, using a product similarity-based strategy and an integrated approach of two learning models, which can help enterprises to scientifically and reasonably forecast long-term and short-term product sales. It is worth mentioning that we consider multimodal data, time series and text data, blend these data, and make predictions based on sales of similar products. The specific innovations are as follows.

1)
The similarity-based sales forecasting (S-SF) method proposed in this paper is different from the mode of traditional forecasting, which is based on historical data of the product itself. S-SF utilizes the predicted values from multiple similar products to fit the sales of the target product. This method can help address the cold-start problems in new products forecasting tasks.
2)
We measure product similarity from several different perspectives, taking into account time-series data of sales and text data of products and retailers. These auxiliary data will make forecasting more reliable. The top-k similarity products will pass through the learning module, and the processing results will provide information for the final prediction.
3)
We extend the convolutional LSTM network (ConvLSTM) by adding an attention mechanism and optimize the loss function with convex function information entropy. The new proposed attention-based ConvLSTM model (AttConvLSTM) is then integrated with Prophet to provide better diversity. This also demonstrates the advantages of the integration of statistical learning model and deep learning model.
4)
We make fine-grained predictions by mapping the sales forecasting of different retailers of top-k products to the corresponding values of the target product through alignment and scaling operations. And finally, these values will be merged to generate an overall prediction for the target product.

The structure of the paper is as follows. We introduce the related work in Section 2. Section 3 presents the framework for similarity-based sales forecasting method and the similarity calculation strategy. In Section 4 we propose an improved ConvLSTM model and integrate it with Prophet model to improve the overall diversity. Section 5 shows the experimental results and evaluate our work. We summarize the work in Section 6.
2. Related work

From traditional statistical analysis methods to machine learning methods, as well as extensively studied deep learning methods, can all be used for sales forecasting. We explain the relevant research work from two aspects.

Statistical time series forecasting methods. Sales forecasts are often seen as time series forecasting issues. Among the models of time series forecasting, autoregressive integrated moving average (ARIMA) model is one of the prominent univariate time series models, which has statistical characteristics, and the well-known Box-Jenkins method is adopted in the model selection process [5]. ARIMA model is not only applicable to various exponential smoothing techniques, but also contains multiple time series models, autoregression (AR), moving average (MA) and autoregressive moving average (ARMA). An extension to ARIMA that supports the direct modeling of the seasonal component of the series is called seasonal autoregressive integrated moving average (SARIMA) [11]. Both ARIMA model and SARIMA model are suitable for prediction based on a large amount of historical data, which brings high computational costs and high requirements for stationary. Because of this, they are not suitable for use in high dimensional multi-modal time series forecasting. Facebook Prophet is also a model for forecasting time series data based on additive AR, which can be fitted to the periodic data of year, week, and day by using a nonlinear approach. It also includes a holiday factor, which can successfully respond to sudden drops or sudden rises in active time such as National Day and Spring Festival of China in actual production scenarios [6].

Machine learning and deep learning methods. Neural networks have also received an increasing amount of attention in time series analysis. Recurrent neural network (RNN) can achieve high precision when it comes to certain sequential machine learning tasks [12, 13]. Long short-term memory (LSTM) is a special RNN, mainly to solve the gradient disappearance and gradient explosion problems in long sequence training and can perform better in longer sequences than RNN [14, 15]. Gated recurrent unit (GRU) is a variant of LSTM. It combines the forget gate and the input gate into an update gate, while also mixing the cell state and the hidden state. While having similar prediction effects, GRU is faster to train because of its simpler structure [16]. ConvLSTM model replaces the fully connected layer in LSTM model with a convolutional layer, making full use of the strong feature extraction capabilities of convolution neural network (CNN). It can not only get time series information, but also extract spatial features [17]. The attention mechanism has received wide attention since it was proposed. It can achieve good results in many fields of application [18, 19]. In this paper, in the context of sales forecasting, we will improve the ConvLSTM model, add the attention mechanism to it, and explore the integration of statistical learning models with deep learning models.

3. Similarity-based sales forecasting method

In this section, we first formulate the time series forecasting problem and similarity-based forecasting problem. Then we present the framework of the proposed S-SF method and discuss the details of the framework.

3.1 Problem formulation

The formal description of time series forecasting can be expressed as follows. Given a series of fully observed time series signals { $x_{1}$ , $x_{2}$ , …, $x_{T}$ } where $x_{t}\in\mathbb{R}^{M}(1\leqslant t\leqslant T)$ , $T$ is the current timestamp, and $M$ is the variable dimension, we aim at forecasting a series of future signals such as $y_{T+h}$ where $h$ is the desirable horizon ahead of the current timestamp. In most of the cases, the horizon of the forecasting task is chosen according to the demands of the environmental settings, e.g. for sales forecasting, the horizon of interest ranges from month to week. We hence formulate the input matrix at timestamp $T$ as $X_{T}=\{x_{1},x_{2},\ldots,x_{T}\}\in\mathbb{R}^{M\times T}$ . The basic method is to build a mapping function $f(\cdot)$ .

$\displaystyle y_{T+h}=f(X_{T},\theta)=f(x_{1},x_{2},\ldots,x_{T},\theta)$ (1)

where the parameter vector will be learned in the training process.

Due to the different time-to-market of different products, it is hard to use the above method to forecast sales for some new products. Inspired by the idea of collaborative filtering, we propose a similarity-based sales forecasting method S-SF, i.e., the sales of the current product are fitted according to the sales of similar products. How to measure similarity has become the primary task.

Suppose $N$ is the number of products, a series of fully observed time series signals of product $i$ is $\{x_{1}^{i},x_{2}^{i},\ldots,x_{T}^{i}\}$ , and $y_{T+h}^{i}$ is the forecasting value of product $i$ . Then the input time series matrix of product $i$ is formulated as $X_{T}^{i}=\{x_{1}^{i},x_{2}^{i},\ldots,x_{T}^{i}\}\in\mathbb{R}^{M\times T}$ and all sales information is expressed as $\bar{X}_{T}=\{X_{T}^{1},X_{T}^{2},\ldots,X_{T}^{N}\}\in\mathbb{R}^{M\times T% \times N}$ . $A$ is the attribute matrix of all products, the attribute vector of product $i$ is $A^{i}=\{a_{1}^{i},a_{2}^{i},\ldots,a_{U}^{i}\}(1\leqslant i\leqslant N)$ , and $U$ is the variable dimension. Suppose $D$ is the retailer information of all products and $V$ is the number of retailers. Here $d_{v}^{i}=0$ means that product $i$ is not sold by retailer $v$ . In order to get a more accurate forecast, in this paper, the sales value of a product is the sum of the sales data of all the retailers related to the product. For example, $x_{t}^{i}$ can be expressed as $\{x_{t}^{i,1},x_{t}^{i,2},\ldots,x_{t}^{i,V}\}(1\leqslant t\leqslant T,x_{t}^{% i}=\sum_{v=1}^{V}x_{t}^{i,v},d_{v}^{i}=\sum_{t=1}^{T}x_{t}^{i,v})$ . We generate a top-k similar product matrix L according to some similarity computation strategies, list $\{L_{i,1},L_{i,2},\ldots,L_{i,k}\}$ refers to the similar product list of product $i$ .

$\displaystyle L_{i}=\textit{topK}(\bar{X}_{T},A,D,i)$ (2)

To forecast the sales of a product $j(1\leqslant j\leqslant N)$ , we need to first forecast the sales of each retailer $v$ of the similar product $j$ .

$\displaystyle\hat{y}_{T+h}^{j,v}=f_{1}(X_{T}^{j,v},\theta)=f_{1}(x_{1}^{j,v},x% _{2}^{j,v},\ldots,x_{T}^{j,v},\theta)$ (3)

Then, we forecast each retailers sale of product $i$ according to the retailers sales of similar products.

$\displaystyle\hat{y}_{T+h}^{i,v}=f_{2}(\hat{y}_{T+h}^{L_{i,1},v},\hat{y}_{T+h}% ^{L_{i,2},v},\ldots,\hat{y}_{T+h}^{L_{i,k},v})$ (4)

The final step is to calculate the forecasted value of product $i$ by merging the result.

$\displaystyle y_{T+h}^{i}=\sum_{v=1}^{V}\hat{y}_{T+h}^{i,v}$ (5)

Figure 1 shows the framework of S-SF method. It is made up of three modules: similarity module, learning module, and forecasting module. The similarity module mainly deals with two types of calculations, time series similarity calculation and text similarity calculation. The calculation results are combined to generate top-k similar products, which will be taken as the input to the second module. In the learning module, an improved ConvLSTM model is built, and Prophet model is integrated into it to perform prediction operations for k similar products. Finally, in the forecasting module, alignment and scaling processing are performed according to the prediction results of similar products.

We consider the similarity between products in three ways. The first is about the sales trend, which is a comparison of time series data. The second is about the sales of different retailers and the third is about the attributes of products. Among them, the latter two are non-sequential data comparisons. After comparing multiple types of similarity, we will select the most similar products and then use machine learning to forecast for these similar products. Finally, the forecasting value of the product will be determined by these similar products. Method S-SF addresses the problem of sales forecasting for two different product categories. The first category of products has a long sales time and rich historical data. Generally, mature products are those that have more than 1 year of sales data, and are still in sale in recent months. The second category of products has a short sales time and a small amount of sales data. In this paper, new products are those that are sold in the market for less than 1 year. For new products, they have less sales data, poor market stability, and are extremely difficult to forecast. Due to different forecasting backgrounds, an integrated machine learning model is considered in this paper.

Figure 1.

Framework of S-SF. It consists of three modules: similarity module, learning module, and forecasting module. The red box represents the target product to be forecasted.

3.2 Similarity computation of time series

Time series data are ubiquitous in our everyday life. In order to perform similarity comparison of time series data more effectively, the first step is denoising and normalization. After preprocessing, the similarity calculation of time series data will be more accurate. The following step 1) and step 2) will introduce the preprocessing strategy.

1)
Denoising. Denoising is one of the key steps in the preprocessing of time series data. In order to make the similarity comparison result more reliable, the time series data participating in the comparison is usually required that their mean and variance do not change significantly within a certain period of time. According to the definition of standard deviation, 99.7% of values distributed within 3 standard deviations from the mean. We treat the values as noising that exceed 3 times of standard deviation, i.e., $p(|x-\mu|>3\sigma)\leqslant 0.003$ where $\mu$ is the mean and $\sigma$ is the standard deviation. Less than 0.003 is a small probability event, so we will not consider these samples in this paper. Then, these sales values will be considered as outliers and will be replaced with $3\sigma+\mu$ . Figure 2 shows an example where there are two exceptions marked with blue lines.

Figure 2.
An example of exception value processing. The x-axis represents time and the y-axis represents sales. There are two blue markers, which are outliers and are replaced.

2)
Normalization. The sales trends for each product are different. Sales of some products will continue to grow, sales of some products will suddenly drop, and others will remain stable. For product $i$ , the observed time series signal $x_{t}^{i}(1\leqslant t\leqslant T)$ can be normalized as

$\displaystyle x_{t}^{i}=\frac{x_{t}^{i}-\min(i)}{\max(i)-\min(i)}$ (6)

where $\max(i)$ and $\min(i)$ represent the maximum sales value and the minimum sales value of product $i$ from 1 to $T$ , respectively.
3)
Similarity calculation. In natural science, Pearson correlation coefficient is widely used to measure the degree of correlation between two variables, including time series variables, with values between $-$ 1 and 1. Since it centralizes the vector, it has an advantage in vector similarity calculation compared to the traditional Euclidean distance. We use Pearson correlation coefficient to calculate the similarity between two products.

For products $i$ and $j$ , the similarity based on Pearson correlation coefficient is expressed as:

$\displaystyle s_{ij}^{1}=\frac{\sum_{t=1}^{T}(x_{t}^{i}-\bar{x}^{i})(x_{t}^{j}% -\bar{x}^{j})}{\sqrt{\sum_{t=1}^{T}(x_{t}^{i}-\bar{x}^{i})^{2}}\sqrt{\sum_{t=1% }^{T}(x_{t}^{j}-\bar{x}^{j})^{2}}},\bar{x}^{i}=\frac{1}{T}\sum_{t=1}^{T}x_{t}^% {i},\bar{x}^{j}=\frac{1}{T}\sum_{t=1}^{T}x_{t}^{j}$ (7)

The larger $s_{ij}^{1}$ value, the higher the similarity between product $i$ and product $j$ .
3.3 Similarity computation of non-sequential data

If two products are sold by the same retailer, their sales trends will be similar. Here $d_{iv}=0$ means that product $i$ is not sold by retailer $v$ .

For products $i$ and $j$ , the similarity of sales retailer based on Pearson correlation coefficient is expressed as:

$\displaystyle s_{ij}^{2}=\frac{\sum_{v=1}^{V}(d_{v}^{i}-\bar{d}^{i})(d_{v}^{j}% -\bar{d}^{j})}{\sqrt{\sum_{v=1}^{V}(d_{v}^{i}-\bar{d}^{i})^{2}}\sqrt{\sum_{v=1% }^{V}(d_{v}^{j}-\bar{d}^{j})^{2}}},\bar{d}^{i}=\frac{1}{V}\sum_{v=1}^{V}d_{v}^% {i},\bar{d}^{j}=\frac{1}{V}\sum_{v=1}^{V}d_{v}^{j}$ (8)

Attributes can be used to accurately describe the basics of a product and are one of the main indicators for measuring product similarity. It is also a non-sequence similarity calculation. Since product attributes such as brand and label are textual representations, there is no way to normalize them directly, so we convert the form of text into one hot coding to form a multidimensional vector. The Pearson similarity calculation method is still used here.

$\displaystyle s_{ij}^{3}=\frac{\sum_{u=1}^{U}(a_{u}^{i}-\bar{a}^{i})(a_{u}^{j}% -\bar{a}^{j})}{\sqrt{\sum_{u=1}^{U}(a_{u}^{i}-\bar{a}^{i})^{2}}\sqrt{\sum_{u=1% }^{U}(a_{u}^{j}-\bar{a}^{j})^{2}}},\bar{a}^{i}=\frac{1}{U}\sum_{u=1}^{U}a_{u}^% {i},\bar{a}^{j}=\frac{1}{U}\sum_{u=1}^{U}a_{u}^{j}$ (9)

3.4 Top-k similarity calculation

Through the above three kinds of similarity calculations, we get three similarity values, $s_{ij}^{1}$ , $s_{ij}^{2}$ , and $s_{ij}^{3}$ for products $i$ and $j$ . Then we combine these three values into one comprehensive value $s_{ij}$ .

$\displaystyle s_{ij}=w_{1}s_{ij}^{1}+w_{2}s_{ij}^{2}+w_{3}s_{ij}^{3}$ (10)

where $w_{1}$ , $w_{2}$ , and $w_{3}$ are three weights of three different similarity indicators. Their values can be adjusted according to their importance.

4. Machine learning based forecasting

4.1 An improved ConvLSTM model

ConvLSTM is a variant of LSTM. The main advantage is that more information can be extracted by turning the weight calculation in LSTM into a convolution operation. Here we make further adjustments to ConvLSTM, including adding more data to the input, adding attention mechanism, and modifying the loss function. We call the new model AttConvLSTM, attention mechanism based ConvLSTM.

1)
Add periodic historical data to the input

A large number of experiments have proved that ConvLSTM performs better than LSTM in time series prediction tasks. If only simple features are used as input to the model, it is also difficult for ConvLSTM to fit the varying time series data. A better way is to use more useful features. Here we extract additional features from the dataset to help the model fit better and achieve better predictions.

Akaike information criterion (AIC) is a measure of the goodness of fit of statistical models, proposed by Japanese statistician Akaike in 1974 [20]. It is built on the concept of entropy and provides a standard for weighing the complexity of the estimated model and the goodness of the fitted data. AIC is defined as

$\displaystyle AIC=2\bar{k}-ln(\bar{L})$ (11)

where $\bar{k}$ is the number of model parameters and $\bar{L}$ is the likelihood function.

For time series forecasting, a critical but uncertain question is how much data to use as a cycle. When choosing the best model from a set of alternative models, we usually choose the model with the smallest AIC. Therefore, we calculated AIC values for ConvLSTM with periods from 1 to 50 for the dataset in the experiment. Some of the results are shown in Table 1.

It can be seen from Table 1 that AIC of period 13 is the lowest, i.e., the model reaches the optimal state at period 13. The AIC values are also very low at period 26, 39, and 52. This shows that 13 is a very good cycle value. Actually, value 13 coincides with the length of a quarter, that is, sales data is cyclical every quarter. Based on AIC analysis, we use the data from the previous cycle as the second feature input of ConvLSTM, and these extra features help to better train the model.

Table 1
An instance of AIC statistics. The unit of period is a week, 7 days

Period 1 2 4 9 13 26 39 52

AIC 42.3 43.1 41.3 39.7 21.5 23.2 23.6 22.8

2)
Add attention mechanism

ConvLSTM has a better memory effect than RNN when dealing with longer time series, but there is still a lot of space for improvement in the effective screening of information. Fortunately, the attention mechanism can help preserve the intermediate results of ConvLSTM. We use self-attention model to learn and associate it with the output of ConvLSTM model for information screening purposes. This can greatly improve the learning effect of the model, so as to achieve the purpose of improving the accuracy of prediction. The ConvLSTM structure after adding attention mechanism is shown in Fig. 3.

Figure 3.
The architecture of AttConvLSTM model where the yellow box represents the convolution operation and the blue box is the standard attention module.

3)
Modify the loss function

The loss function of ConvLSTM typically uses the mean squared error (MSE). However, MSE is not a convex function and has higher requirements on the amount of data and the number of iterations, so it does not help the model to fit quickly if the amount of data is not sufficient. Here we propose to use the convex function information entropy to optimize the loss function. Information entropy is defined as:

$\displaystyle H=E[-\log p_{i}]=-\sum_{i=1}^{\omega}p_{i}\log p_{i}$ (12)

where $\omega$ is the length of the window. The loss function is expressed as:

$\displaystyle\textit{loss}=-\sum_{i=1}^{Z}\hat{p}_{i}\log\hat{p}_{i}-\left(-% \sum_{i=1}^{\omega}p_{i}\log p_{i}\right)$ (13)

where $\hat{p}=\{\hat{p}_{1},\hat{p}_{2},\ldots,\hat{p}_{\omega}\}$ is the forecasting vector, $p=\{p_{1},p_{2},\ldots,p_{\omega}\}$ is the true observed vector, and is the number of samples.

4.2 Integrated prophet model

Period	1	2	4	9	13	26	39	52
AIC	42.3	43.1	41.3	39.7	21.5	23.2	23.6	22.8

Research and our implementation show that a model cannot cope with multiple situations when faced with the time series forecasting problem. In the forecasting task, the following issues need to be considered. 1) The observation value is a piece of historical data given in hourly, daily, weekly, or monthly manner. 2) The periodic scale is not fixed. For example, a week includes seven days, one month contains 30 days, and one year contains 12 months. 3) There are some known important holidays, such as various legal holidays or traditional festivals. 4) There may be missing values or outliers in the data. 5) Trends are usually nonlinear changes that may reach natural limits.

Facebook Prophet is a model for forecasting time series data. It is based on a self-additive model that is used to fit non-linear trends of data such as year, week, season, and vacation. It is extremely robust to missing values, trend transitions, and a large number of outliers. For daily periodic data with at least one year of historical data, Prophet performs very well. Holiday information can also be added to Prophet to help the model eliminate outliers at the time of fitting, resulting in more accurate predictions. In addition to the holiday days, special dates can also be added, such as China’s Double 11 and 618 Shopping Festivals. The forecasting value at time t is expressed as:

$\displaystyle y(t)=g_{1}(t)+g_{2}(t)+g_{3}(t)+\epsilon_{t}$ (14)

where

$g_{1}(t)$ : Trend item, indicating a non-periodic change including saturated growth and piecewise linear. There are two different forms to be selected.

$g_{2}(t)$ : Period term, representing a periodic change (Fourier series), such as daily, per hour, per month, and per year.

$g_{3}(t)$ : Holiday item, indicating a change in holidays such as Spring festival or major events such as Double 11.

$\epsilon_{t}$ : Noise term, indicating random fluctuations that cannot be modeled and forecasted, subject to Gaussian distribution, such as government policy changes.

Prophet uses L-BGFS to optimize and find the largest posteriori estimate of each parameter.

4.3 Model ensemble and forecasting

The prediction results obtained from AttConvLSTM and Prophet models are denoted as $\dot{b}_{v}^{j}$ and $\widetilde{b}_{v}^{j}$ , respectively. We use a linear model for integration as follows:

$\displaystyle b_{v}^{j}=\mu_{1}\dot{b}_{v}^{j}+\mu_{2}\widetilde{b}_{v}^{j}$ (15)

It is possible to get multiple similar products through the first module, so the next step is to align and scale the predictions of several similar products to get the final prediction results. Figure 4 shows the process flow forecasting module.

For the alignment operation, there are three different situations to consider here. For each $d_{v}^{i}(1\leqslant v\leqslant V)\neq 0$ , we forecast its sales value $b_{v}^{i}$ .

If the sales retailer of the target product $i$ does not appear in the retailer list of similar products, we directly use the average of the forecasted values of similar products.

If the sales retailer of the target product $i$ appears only in one of the similar products, we use the forecasted value of this similar product.

If the sales retailer of the target product appears in multiple similar products, we use the average forecasted value of similar products. Next is the scaling operation. Although the forecasted value is obtained, the overall sales volume of each product is different, which will be adjusted as a weight. So, we need to add weights when calculating $b_{v}^{i}$ .

The final sales forecasting value is

$\displaystyle y_{T+h}^{i}=\sum_{v=1}^{V}b_{v}^{i}$ (16)

Figure 4.

Process flow of forecasting. The main difference from the traditional forecasting method is that the data granularity is finer, and each retailers sales of the product is forecasted independently.

5. Experiment and evaluation

5.1 Experiment environment

Datasets. The dataset used in this paper is actual sales data of a well-known company, including sales data from October 2014 to October 2019. There are 1.86 million sales records of 147 kinds of products in the dataset, and each record has several attributes, like name, brand, retailer, billing date, sales count. Some products are mature items and others are newly listed items. Since the public time series dataset lack product attributes, they are not suitable for the method proposed in this paper. Therefore, we do not use public dataset in the experiment. In order to evaluate the effect of S-SF method on the prediction of mature products and new products, we divide the dataset into two parts. DS-M contains mature product data with a sales history of more than 1 year, while DS-N only includes new product data with a sales history of less than 1 year. We divide the training set and test set into 8:2 in a random manner.

Methods. We conduct extensive experiments with 12 methods or models (including our new model and new method) on the actual production dataset for time series forecasting tasks. The methods are listed below.

(1)
ARIMA [5]: Autoregressive integrated moving average model. In the model of ARIMA (p, d, q), AR is “autoregressive”, p is the number of autoregressive terms, MA is the “sliding average”, q is the moving average number, and d is the number of differences to make the sequence more stable.
(2)
SARIMA [14]: Seasonal autoregressive integrated moving average model. It is suitable for data with seasonal periodic changes.
(3)
STL [21]: Seasonal-trend decomposition procedure based on loess. It is a common algorithm in time series decomposition, where loess is a robust regression algorithm.
(4)
RNN [2, 7]: It is a type of neural network for processing sequence data and has shown great power in many tasks.
(5)
LSTM [6]: Long short-term memory. It is a time recurrent neural network suitable for processing and forecasting important events with relatively long intervals and delays in time series.
(6)
ConvLSTM [8]: Convolutional LSTM network. It is a variant of LSTM and replaces the fully connected layer in LSTM model with a convolutional layer.
(7)
Prophet [6]: A model proposed by Facebook for forecasting time series data. It shows good performance for periodic time series data.
(8)
MEAN: Average value model. Especially when forecasting new products, we can only use simple models instead of complex models because there is not enough historical data for reference.
(9)
MA: Moving average. It is widely used because of its simple construction and easy parameter training.
(10)
AttConvLSTM: A new model proposed in our paper. It is based on ConvLSTM by adding an attention mechanism and using a new loss function.
(11)
AttConvLSTM+Prophet: An integrated model proposed in our paper. We adopt a simple linear integration approach.
(12)
S-SF: A forecasting method proposed in our paper. The integrated model AttConvLSTM+Prophet is used and the similarity module is also included in it.

Metrics. We use three conventional evaluation metrics in the experiment defined as

•
Root mean square error (RMSE):

$\displaystyle\textit{RMSE}=\sqrt{\frac{1}{m}\sum_{i=1}^{m}(\hat{y}_{i}-y_{i})^% {2}}$ (17)
•
Mean absolute error (MAE):

$\displaystyle\textit{MAE}=\frac{1}{m}\sum_{i=1}^{m}|(\hat{y}_{i}-y_{i})|$ (18)
•
Anomalous correlation coefficient (ACC):

$\displaystyle\textit{ACC}=1-\frac{1}{m}\sum_{i=1}^{m}\frac{|(\hat{y}_{i}-y_{i}% )|}{y_{i}}$ (19)

where $y_{i}$ is the actual value of product $i$ , $\hat{y}_{i}$ is the forecasting value of product $i$ , and m is the test number. The smaller the values of RMSE and MAE, the better the method being tested. Conversely, a larger value of ACC indicates a better method being tested.

Platform. All experiments are conducted on a Linux server. The configuration is as follows: 8*Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz processor, and 110GB memory, NVIDIA Tesla V100 16GB GPU. We implement all methods in python3.7. All deep learning models we used in this paper are implemented in Keras framework.
5.2 Performance of learning module

We have found through experiments that AttConvLSTM performs best when the sales sequence spans more than 1 year. When the sales sequence spans less than 1 year, Facebook prophet performs best. After comprehensive consideration, we use 1 year as the criterion for segmenting mature products and new products. For products with different sales times, different weight settings are used in model integration. The sales forecasting of Retailer $v$ for product $j$ (suppose $j$ is one of the top-k products) is defined as:

$\displaystyle b_{v}^{j}=\left\{\begin{array}[]{ll}0.6\dot{b}_{v}^{j}+0.4% \widetilde{b}_{v}^{j}&{\textit{time}\geqslant 1\textit{ year}}\\ 0.4\dot{b}_{v}^{j}+0.6\widetilde{b}_{v}^{j}&{\textit{time}<1\textit{ year}}\\ \end{array}\right.$ (20)

where 0.6 and 0.4 are different weights representing different importance, and time refers to the sales time of product $j$ .

Table 2 shows the experiment results, where 1) each line shows the results of three metrics and two datasets for a particular model, 2) each column shows the results of different models for a specific metric and a specific dataset, 3) each bold-faced value indicates the best result of a specific metric corresponding to all models, and 4) each underlined value indicates the best result of a specific metric corresponding to all models except two models AttConvLSTM and AttConvLSTM+Prophet, which are proposed by us.

Table 2

Results summary (in MAE, MSE,and ACC) of all methods on two types of datasets

Methods	MAE		RMSE		ACC
	DS-N	DS-M	DS-N	DS-M	DS-N	DS-M
ARIMA	100.1	166.8	316.3	527.2	0.4327	0.4098
SARIMA	94.0	164.9	297.2	489.7	0.4592	0.4431
STL	112.1	119.2	354.1	376.6	0.3950	0.5247
RNN	94.7	108.6	299.3	343.3	0.4533	0.5513
LSTM	91.0	97.2	287.6	307.2	0.4758	0.5995
ConvLSTM	87.6	94.5	276.7	298.5	0.4983	0.6177
Prophet	80.9	105.4	255.9	333.2	0.5178	0.5619
AttConvLSTM	83.1	91.2	269.8	282.3	0.5012	0.6209
AttConvLSTM+Prophet	77.7	88.3	245.4	279.1	0.5422	0.6381

From the results in Table 2, we can find that regardless of our proposed models AttConvLSTM and AttConvLSTM+Prophet, models ConvLSTM and Prophet perform best. ConvLSTM performs better for long-term time series prediction, and Prophet performs better for short-term time series prediction. The excellent performance of these two models also proves that our previous decision to linearly integrate these two models is reasonable. Obviously, from the results of three metrics, the proposed AttConvLSTM+Prophet model performs best for both datasets. Compared to the best results of the existing models, AttConvLSTM+Prophet can reduce 5.3% on MAE and RSME, and improve 4% on ACC. Furthermore, after adding the attention mechanism, the accuracy of ConvLSTM has been further improved.

5.3 Performance of S-SF method

The above experiments mainly verify the validity of the proposed model. The similarity module is not included because the similarity module does not directly affect the forecasting model. Next, the role of the similarity module in S-SF method will be mainly discussed.

Firstly, the choice of k value in top-k selection has a great influence on the results of the method, because we do not know how many similar products should be used to achieve the best prediction effect. We do experiments on different k values and find that k $=$ 3 works best. So, in the following experiment, we will use top-3 as the output of the similarity module.

Table 3
Comparison of S-SF method with other methods. Sim refers to similarity module

Methods	MAE	RMSE	ACC
MEAN	100.8	318.5	0.2182
MA	88.1	278.3	0.3459
Prophet	95.3	301.1	0.2693
Sim+ARIMA	59.9	189.2	0.4322
Sim+SARIMA	60.2	190.2	0.4297
Sim+STL	60.9	192.6	0.4341
Sim+RNN	53.7	169.8	0.4913
Sim+LSTM	50.8	160.5	0.5214
Sim+ConvLSTM	49.3	155.9	0.5440
S-SF	45.2	142.8	0.6158

Table 3 shows a comparison of S-SF method with several other methods, where 1) each line shows the results of three metrics for a particular method, 2) each column shows the results of different methods for a specific metric, 3) each bold-faced value indicates the best result of a specific metric corresponding to all methods, and 4) each underlined value indicates the best result of a specific metric corresponding to all methods except S-SF, which is proposed by us. We can see that S-SF performed best. Compared to Prophet, it can reduce MAE by 52.6%, reduce RMSE by 52.6%, and improve ACC by 128.7%. Moreover, after adding the similarity module, the performance of other models even exceeds Prophet. It can be seen that the forecasting method based on similarity proposed in this paper is effective.

6. Conclusion

Traditional sales forecasts typically predict a products future sales based on its historical sales trends and its related attributes. In this paper, we propose to use the predicts of similar products to fit the sales of the target product. The proposed method S-SF is especially suitable for newly listed product predictions with a small amount of historical data. For time series forecasting task, ConvLSTM model is suitable for mature products and Prophet model is suitable for new products, so we propose to build a new model AttConvLSTM based on ConvLSTM by adding more periodic data as input, adding attention mechanisms, and modifying the loss function, and then integrate it with Prophet. The experimental results show that S-SF method has higher accuracy and better diversity. Since the idea of sales forecasting based on similar products shows good results in this paper, we will further study the more fine-grained similarity calculation strategy.

Footnotes

Acknowledgments

This research is supported by the National Key R&D program of P. R. China (Grant No. 2017YFC 0907505).

References

Yeo

Kim

Koh

Hwang

S.W.

and Lipka

, Browsing2purchase: Online customer model for sales forecasting in an e-commerce site, in: Proceedings of the 25th International Conference Companion on World Wide Web, Springer, New York, USA, 2016, pp. 133–134.

Liu

Zhang

and Xu

, Research on sales information prediction system of e-commerce enterprises based on time series model, Information Systems and e-Business Management 1 (2019), 1–14.

Bohanec

Borštnar

M.K.

and Robnik-Šikonja

, Explaining machine learning models in sales predictions, Expert Systems with Applications 71 (2017), 416–428.

Tsoumakas

, A survey of machine learning techniques for food sales prediction, Artificial Intelligence Review 52 (2019), 441–447.

McKenzie

, General exponential smoothing and the equivalent arma process, Journal of Forecasting 3 (1984), 333–344.

Weytjens

Lohmann

and Kleinsteuber

, Cash flow prediction: MLP and LSTM compared to ARIMA and Prophet, Electronic Commerce Research, 2019, 1–21.

Loureiro

A.L.

Miguéis

V.L.

and da Silva

L.F.

, Exploring the use of deep neural networks for sales forecasting in fashion retail, Decision Support Systems 114 (2018), 81–93.

Weng

Liu

and Xiao

, Supply chain sales forecasting based on lightGBM and LSTM combination model, Industrial Management & Data Systems 120 (2019), 265–279.

Liu

Liao

Anwar

and Zhou

, Forecasting short-term traffic speed based on multiple attributes of adjacent roads, Knowledge-Based Systems 163 (2019), 472–484.

10.

Ding

Zhang

and Yang

, Time-aware cloud service recommendation using similarity-enhanced collaborative filtering and ARIMA model, Decision Support Systems 107 (2018), 103–115.

11.

Guin

, Travel time prediction using a seasonal autoregressive integrated moving average time series model, in: 2006 IEEE Intelligent Transportation Systems Conference 2006, pp. 493–498.

12.

Chen

Yin

Chen

Wang

Zhou

and Li

, Tada: Trend alignment with dual-attention multi-task recurrent neural networks for sales prediction, in: 2018 IEEE International Conference on Data Mining (ICDM), 2018, pp. 49–58.

13.

Bandara

Bergmeir

and Smyl

, Forecasting across time series databases using recurrent neural networks on groups of similar series: a clustering approach, Expert Systems with Applications 140 (2020), 112896.

14.

Lai

Chang

W.C.

Yang

and Liu

, Modeling long-and short-term temporal patterns with deep neural networks, in: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 2018, pp. 95–104.

15.

Pan

Yuan

Sun

Liang

and Li

, A novel LSTM-based daily airline demand forecasting method using vertical and horizontal time series, in: Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2018, pp. 168–173.

16.

Gers

Schmidhuber

and Cummin

, Learning to forget: continual prediction with LSTM, Neural Computation 12 (2000), 2451–2471.

17.

Shi.

Chen

Wang

Yeung

Wong

and Woo

, Convolutional LSTM network: A machine learning approach for precipitation nowcasting, in: Advances in Neural Information Processing Systems, 2015, pp. 802–810.

18.

Fan

Zhang

Pan

Zhang

Yuan

and Huang

, Multi-horizon time series forecasting with temporal attention learning, in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 2527–2535.

19.

Gao

K.D.L.M.B.X.

and Yang

, Attention convolutional neural network for advertiser-level click-through rate forecasting, in: Proceedings of the 2018 World Wide Web Conference, 2018, pp. 1855–1964.

20.

Akaike

, A new look at the statistical model identification, IEEE Transactions on Automatic Control 19 (1974), 716–723.

21.

Cleveland

McRae

and Terpenning

, STL: A seasonal-trend decomposition procedure based on loess, Journal of Official Statistics 6 (1990), 3–33.

Similarity-based sales forecasting using improved ConvLSTM and prophet

Abstract

Keywords

1. Introduction

3. Similarity-based sales forecasting method

3.1 Problem formulation

4.1 An improved ConvLSTM model

5.1 Experiment environment

Table 3 Comparison of S-SF method with other methods. Sim refers to similarity module

Footnotes

Acknowledgments

References

Table 3
Comparison of S-SF method with other methods. Sim refers to similarity module