Abstract
Asymmetric ν-twin Support vector regression (Asy-ν-TSVR) is an effective regression model in price prediction. However, there is a matrix inverse operation when solving its dual problem. It is well known that it may be not reversible, therefore a regularized asymmetric ν-TSVR (RAsy-ν-TSVR) is proposed in this paper to avoid above problem. Numerical experiments on eight Benchmark datasets are conducted to demonstrate the validity of our proposed RAsy-ν-TSVR. Moreover, a statistical test is to further show the effectiveness. Before we apply it to Chinese soybean price forecasting, we firstly employ the Lasso to analyze the influence factors of soybean price, and select 21 important factors from the original 25 factors. And then RAsy-ν-TSVR is used to forecast the Chinese soybean price. It yields the lowest prediction error compared with other four models in both the training and testing phases. Meanwhile it produces lower prediction error after the feature selection than before. So the combined Lasso and RAsy-ν-TSVR model is effective for the Chinese soybean price.
Introduction
Soybean is an important food and oil crop and is a key feed source around the world. Soybean price, which plays a crucial role in the livelihood of the population [1], exhibits frequent fluctuations due to the combined effects of the international market, politics and economic. This brings new challenges to Chinese food security, soybean production and the industry. Understanding price trends is crucial for the policymakers when implementing price control policies in agricultural product markets and subsidy policies for consumer consumption [2]. Therefore, in order to handle the impact of frequent fluctuations in soybean price, stabilize the Chinese soybean market and optimize planting structure, research on the underlying mechanisms of soybean price fluctuations and corresponding reasonable soybean price forecasts are crucial.
Current soybean price forecasting can be divided into forecasts for the futures and spot markets on domestic and foreign research. In particular, agricultural commodity futures have triggered the emergence and development of the futures market as an earlier variety of futures [3]. The rapid development of the soybean futures market has greatly affected the soybean production and potential price risks may result in economic losses. In order to mitigate these effects, extensive research has been performed on agricultural futures market forecasting [4]. However, the soybean futures and spot markets are inextricably linked, with the spot market forming the foundation of the futures market. The soybean spot market price is a direct response to domestic soybean supply and demand and also smooths out price volatility, and thus considered as more realistic and actionable than the futures price.
Soybean price is characterized by non-linearity, high noise levels and strong volatility, which creates difficulties in performing accurate forecasts [5]. The forecasting time can be classified into the long-term forecasts more than one year, and short-term forecasts less than one year [6]. Long and short-term forecasts of agricultural market prices are distinct in terms of their theoretical bases and the corresponding focuses. In particular, long-term forecasts focus on future trends, while short-term forecasts focus on the actual volatility.
The existing soybean price forecasting models can be divided into linear and nonlinear prediction cases. Linear models mainly include traditional statistical and econometric models [7], such as Quantile regression [8], Autoregressive integrated moving average model (ARIMA) [9], Generalized Autoregressive Conditional heteroscedasticity model (GARCH), etc. Compared with linear model, nonlinear model has better prediction accuracy and mining ability of nonlinear relationship of variables. Among them, soybean prediction models represented by machine learning, including support vector regression (SVR) [10], artificial neural network (ANN) [11], long short-term memory (LSTM) [12], etc., are playing an increasingly important role. However, the limitations of the current study are: i) all the researches based on the above models are short-term prediction of soybean price. It is characterized by forecasting future prices only through historical price data, while soybean prices are affected by many factors. Although the prediction effect is conspicuous in a certain period of time, its robustness and generalization ability are insufficient. ii) For long-term price forecasts, the time step is years. However, existing models are unsuitable for long-term prediction. For example, LSTM, ANN and other deep learning algorithms are difficult to solve the problems of insufficient data and complex model hyper parameters, and the prediction accuracy of traditional machine learning methods is unsatisfactory. iii)The existing data of long-term soybean price forecast is scarce and difficult to collect, which is one of the key reasons why it is difficult to break through.
SVR is a machine learning method based on the principle of structural risk minimization [13] with strong generalization ability, particularly for problems such as small samples, nonlinearity, over-fitting and local minima [14]. Grain price datasets are typically associated with small sample data, as well as uncertainties and nonlinearities among the influencing factors of soybean price. The application of SVR is widely implemented due to its superior forecasting performance across numerous research fields and its suitability for price forecasting [15–18]. Moreover, many successful attempts have been made to overcome the high training costs and complexity typically linked with SVR approaches [19], for example, ν-SVR [20], smooth ɛ-SVR [21], and TWSVR [22].
Numerous studies have investigated the variables and combinatorial approaches of SVR across various prediction fields. We developed a new SVR approach that incorporates a linear-cost function and insensitivity parameter for accurate fitting [23]. [24] designed a two-stage forecasting volatility method by combining SVR and the Generalized ARCH model (GARCH-SVR), demonstrating its improvements in the volatility forecasting ability. However, research on the application of SVR with the pinball loss function for the regression problems is limited. [25] developed a novel approach denoted as Asy-ν-SVR based on ν-SVR with the pinball loss function. [19] proposed an improved regularization approach based on Lagrangian Asy-ν-TSVR using the pinball loss function (LAsy-ν-TSVR), proving to be more effective in the removal of outliers and noise than TSVR. However, comprehensive research on soybean price forecasting is lacking, and the accuracy of current forecasting methods requires improvement. [26] proposed the application of hybrid models to predict agricultural product prices.
In summary, there are three main problems in soybean price forecasting model: i) the majority of existing studies on short-term soybean price forecasting methods are limited to the forecasting of the price data itself, while the factors influencing soybean price are ignored. ii) There is no doubt that the direct introduction of numerous influence factors into soybean price predictions may affect the prediction results. Meanwhile, the Lasso method is able to execute both parameter estimation and variable selection. ii) Although SVR is more suitable for small sample data with nonlinear relationships,it is not suitable for variable selection. Thus, the motivation of this paper is propose an optimal combined soybean price forecasting method based on the Lasso and RAsy-ν-TSVR methods.
The key contributions of this study are described in the following:
(1) A novel asymmetric support vector regression by introducing the regularization term was proposed. It avoids the matrix inverse operation and implements structural risk minimization principle.
(2) A combined prediction model based on the Lasso and RAsy-ν-TSVR is constructed for soybean price predictions. It outperformed other combined prediction models.
(3) To address the multiple factors influencing soybean price in China, 21 important influencing factors were chosen by Lasso to construct the prediction model.
(4) Our model had promising potentials for agricultural products price forecasting.
The rest of this paper is organized as follows: Section 2 analyzes the impact factors of soybean price fluctuation. Section 3 proposed RAsy-ν-TSVR to avoid the matrix inverse operation. The experiments on eight Benchmark datasets are conducted to verified its validity in Section 4. The constructed RAsy-ν-TSVR is successfully applied into the Soybean price forecasting in Section 5. Section 6 concludes the paper.
Impact factors of soybean price fluctuations
Soybean price is influenced by following three factors: production, consumption and comprehensive factors.
Production factor
Equilibrium price theory demonstrates that the price of a product is a function of the relationship between supply and demand. Supply is a component of the price determination of soybean. In addition, total production and planted area are key for the soybean production process. The agricultural production price index can objectively reflect the price level and structural changes in the national production of agricultural products. In terms of economic operating principles, the various costs (e.g., material, service, production, land and labor costs) of inputs in the soybean production process also directly affect the final pricing. Furthermore, in 1995, China shifted from a net exporter to net importer of soybean, and has remained so until present, with a yearly rise in import volume. Therefore, soybean imports can be employed as an indicator of international soybean supply.
Demand factor
In terms of market demand, the soybean industry is highly correlated with the level of consumer consumption and the demand of soybean. Chinese soybean demand can be divided into export and domestic consumption demand. Domestic consumer demand generally consists of edible and crushing demand, both of which account for more than 90% of total demand. Soybean crushing demand typically arises from soybean oil and soybean meal. Edible soybean and soybean products are a key source of vegetable protein. Urban residents mainly consume soybean products, while rural residents generally consume raw soybean. The development of the livestock industry and the changing food consumption structure of the population are driving the increase in feed demand. Thus, the growth in soybean oil and soybean meal consumption plays an important role in the increase in soybean demand. Soybean meal is a by-product of the soybean oil extracted from soybean and is a staple protein feed or livestock and poultry, with approximately 85% of soybean meal used for poultry and pig feeding. Soybean meal feed can effectively raise pigs to improve the slaughter rate and reduce production costs. The National Bureau of Statistics employs the amounts of pigs at the end of the year as a livestock feeding indicator. Furthermore, the total consumption of soybean has exhibited a steady increase with the growth of the total population and urbanization rate. Therefore, the level of soybean consumption can be considered in terms of the following factors: total population size, urbanization rate, population consumption level, per capita disposable income of urban residents and the consumer price index for soybean. In addition, when soybean price is high, consumers and producers can often choose other food substitutes, such as: corn, peanuts and rapeseed [27].

Impact factors analysis of soybean price fluctuation.
Comprehensive factors affecting soybean price include both macro-(e.g., level of national economic development, investment in rural fixed assets, total power of agricultural machinery etc.) and micro-(e.g., temperature averages, impact of affected area, etc.) factors. Note that, the level of national economic development is expressed by the gross domestic product. Furthermore, the agricultural disaster area indicates the loss of crop production due to disasters and can thus be used as an indicator to indirectly influence the price of soybean.
Impact factors of soybean price fluctuations
Based on the factors influencing soybean price and the data availability, we identified the following 25 factors as the key factors influencing soybean price and their interrelationships. They are shown in Fig. 1 .
Proposed regularized asymmetric ν-TSVR
Given a training set T = {(x1, y1) , (x2, y2) , ⋯ , (x l , y l )}, where x i ∈ R d and y i ∈ R. For brevity, we use matrix X = (x1, x2, ⋯ , x l ) T and Y = (y1, y2, ⋯ , y l ) T . e is a column vector of appropriate dimensions with entries equal to 1. The standard Euclidean norm is written as ∥ ∥ 2 and the l1-norm is ∥ ∥ 1.
Loss function
In this work, we employ the ɛ-insensitive and asymmetric loss functions.
Based on the Asy-ν-SVR [28], we propose a RAsy-ν-SVR. Asy-ν-SVR has been proven successful in the literature [28], outperforming other regression algorithms, including ν-SVR, TSVR [29, 30], and LS-SVR.
In order to avoid the irreversible matrix case in Asymmetric ν-TSVR, we introduce the regularization term
To derive the dual formulations of RAsy-ν-TSVR, we first introduce the following Lagrangian function for Equation (3),
Following the determination of the optimal values w1, w2 and b1, b2, the final regression function is expressed as
The whole flowchart of construction and implementation about RAsy-ν-TSVR is shown as Fig. 2

Implementation flowchart of RAsy-ν-TSVR.
The proposed RAsy-ν-TSVR exhibits two clear advantages: i) it implements the structural risk minimization principle by introducing a regularization term into the objective function, thus yielding high prediction accuracy; and ii) it avoids the matrix inverse operation when solving the dual QPPs. Although the matrix inverse operation (G T G) -1 exists in TSVR and Asy-ν-TSVR, matrix G T G may be not reversible. In RAsy-ν-TSVR, (G T G + DI) -1 is employed, which is always reversible in theory. Yet note that when applied to real problems, the optimal solution is not guaranteed.
In addition, RAsy-ν-TSVR solves a pair of small-sized QPPs instead of a large scale QPP as in ν-SVR, and employs the asymmetric ɛ-insensitive loss.
Experiment on eight Benchmark datasets
To verify the efficiency of our proposed algorithm RAsy-ν-TSVR, we compare it with ν-SVR, Asy-ν-SVR, TSVR and Asy-ν-TSVR on eight benchmark datasets from the UCI machine learning repository1 1 They are Bodyfat, Chwirut, Con. S, Diabetes, Machine-Cpu, Triazines, Housing, and Istanbul Stock Exchange. The Gaussian kernel is considered in five algorithms. All experiments are implemented by MATLAB R2020a on Windows 7 running on a PC with system configuration Inter Core i3-6100 CPU 3.7GHz with 8GB of RAM.
Evaluation criteria
The size of testing set is denoted by m, while y
i
denotes the real-value of sample point x
i
, and
We adopt a five-fold cross-validation to select the optimal parameters of the five algorithms. Our dataset is a time series, namely it is split orderly (not randomly) into five datasets, one of which is the testing set. This process is repeated five times [22]. We compare the training errors of the five algorithms in the nonlinear case using the Gaussian kernel function:
The performance of these five algorithms depends heavily on the choice of parameters. Here, we selected optimal values of the parameters via the grid search method. In particular, for all five algorithms, the Gaussian kernel parameter σ is selected from the set {2-4, 2-3, 2-2, ⋯ , 28}. In TSVR, Asy-ν-TSVR, and RAsy-ν-TSVR, we set C1=C2, D1=D2, ν1=ν2 and ɛ1=ɛ2 to degrade the computational complexity of the parameter selection. Parameters C and D are determined from the set {10-8, 2-3, 2-2, ⋯ , 28}. The optimal values of ν, ɛ and p are obtained from the sets {0.1, 0.2, ⋯ , 0.9}, {0.1, 0.2, ⋯ , 0.9} and {0.2, 0.4, 0.45, 0.5, 0.55, 0.6, 0.8}, respectively.
Result analysis
Performance comparisons of five algorithms with Gaussian kernel on eight benchmark datasets. Bold type shows the best result
Performance comparisons of five algorithms with Gaussian kernel on eight benchmark datasets. Bold type shows the best result
In terms of MAE criterion, from Tabel 1 we can find that RAsy-ν-SVR produces the lowest testing error among five algorithm in most cases when Gaussian kernel is employed, followed by Asy-ν-TSVR. Both of them employ the pin-ball loss, which implies that pin-ball loss is more suitable than ɛ-insensitive loss for these datasets. In addition, in terms of RMSE criterion, Asy-ν-TSVR yields the comparable testing error with Asy-ν-SVR and TSVR. It further shows that pin-ball loss is suitable for these datasets. Meanwhile we can find that small MAE and RMSE corresponds to small SSE/SST and large SSR/SST in most cases.
In terms of computational time, from Tables 1 we can really find that ν-SVR and Asy-ν-SVR cost larger running time than other three algorithms TSVR, Asy-ν-TSVR and RAsy-ν-TSVR in most cases. The main reason lies in that they solve a larger-sized QPP but others solve a pair of smaller-sized QPPs [33].
Average rank on prediction error of five algorithms with Gaussian kernel on eight benchmark datasets
We compare our proposed combined model Lasso-RAsy-ν-TSVR (L-RAsy-ν-TSVR) with other four combined algorithms, namely, Lasso-ν-SVR (L-ν-SVR), Lasso-TSVR (L-TSVR), Lasso-Asy-ν-SVR (L-Asy-ν-SVR), and Lasso-Asy-ν-TSVR (L-Asy-ν-TSVR). We also compare the combined models with the original single models to further demonstrate the effectiveness of the combined models.
Data acquisition and preprocessing
The dataset used for experiments was taken from annual soybean data in China across 1990-2018. Based on the priori acknowledge, we select 25 factors influencing soybean price, as shown in section 2.4. They are from China Agricultural Product Price Survey Yearbook, National Compilation of Agricultural Product Costs and Benefits, National Bureau of Statistics’ China Statistical Yearbook, China Agricultural Statistics, and literature. The dataset is divided into the training set (86%; 25-year data) and the testing set (14%; 4-year data).
In order to solve the missing value problem, we employ the mean imputation method. The data from the last five years is observed to gradually stabilize. Thus, we used the average values of the data across the last 5 years for the individual missing values in 2018. Furthermore, unlike other standardization approaches, our proposed method employs the min-max normalization technique to handle the data. In particular, min-max normalization, also known as discrete normalization, is a linear transformation of the original data that maps values within the interval [0, 1]. The conversion equation of the equation is described as:
We adopt a five-fold cross-validation to select the optimal parameters of the five algorithms. Our dataset is a time series, namely it is split orderly (not randomly) into five datasets, one of which is the testing set. This process is repeated five times [22]. We compare the training errors of the five algorithms in the nonlinear case using the Gaussian kernel function:
Following the determination of the key influencing factors of soybean price fluctuations via the Lasso regression [37], variables exhibiting minimal correlation with the dependent variable are removed in order to reduce the dimensionality of the data and noise interference. The Lasso approach is able to simultaneously perform regression and automatic variable selection. The selected influencing factors are analyzed using the Lasso method since it is an effective feature selection and is widely used, especially for the regression problems.
A total of 16 factors exhibits a positive correlation with soybean price, 5 have a negative correction and 4 are unrelated and are deleted. The subsequent analysis is based on the remaining 21 impact factors (Table 3).
The 25 influencing factors of soybean price, with 16 positive correction factors, 5 negative correction factors and 4 unrelated factors
The 25 influencing factors of soybean price, with 16 positive correction factors, 5 negative correction factors and 4 unrelated factors
The results indicate material and service costs, labor costs, land costs, and production costs to be the basic costs of soybean production on the supply side. The cost of a product during production is a monetary representation of a proportion of the commodity value and is positively related to the price [38]. Production costs exert a direct influence on soybean price [39] and exhibit the largest positive impact among the four related cost factors. The simultaneous increase in planted area and favorable production conditions with variety optimization can improve soybean quality, which will have a catalytic effect on soybean price. The result also reveals the relationship between Chinese soybean import and export volumes and soybean price. In the face of high domestic demand for soybeans, imported soybeans with high oil yield and low prices occupy the majority of the market, and since export trade is smaller than import trade [40]. Moreover, the impact of import volume demonstrates the serious effect exerted by the international market on domestic soybean price. Among competing products for soybeans, peanut and rapeseed prices are positively correlated with soybean price, while the opposite is true for corn prices. The government has encouraged the increase of soybean planting and supply while simultaneously reducing corn production in non-dominant production areas. In addition, there is an inverse competitive relationship between corn and soybean planting areas, thus making an inevitable link between the price fluctuations of the two crops [41]. On the demand side, the per capita disposable income of urban residents, bean consumer price index and the agricultural production price index all contribute to soybean price. The impact of demographic factors on soybean price is generally achieved by influencing the demand for soybean. Soybean demand increases with population, and thus soybean price inevitably exhibits an upward trend under the laws of the market. With the development of the economy and the improvement in dietary consumption levels, the demand for refined edible oils and animal protein-based food products have increased rapidly, and thus soybean demand is on the rise. Post-crushing soybean products typically include soybean oil and soybean meal, with an industrial chain that involves planting, processing, edible oil supply, feed farming etc. Such steps are more complicated than those of wheat, corn and rice.
Fig. 3 depicts the flowchart of the proposed combined forecasting model. The combined forecasting process can be divided into the following three steps:
i) A total of 21 key influencing factors of soybean price fluctuations are selected from 25 factors via the Lasso regression.
ii) The optimal RAsy-ν-TSVR model is then trained, where 5-fold cross-validation is employed to select the optimal parameters.
iii) The testing set is tested using the optimal parameters obtained from step ii).

Flowchart of proposed combined forecasting model for soybean price.
Average training error of the five combined algorithms with 21 important factors. Bold type represents the best result
Notes: i) Each data group denotes the mean value of five times testing results and plus or minus the standard deviation. ii) Bold values represent the best result.
The training phase employs 25 years of soybean prices (1990-2014) and the order five-fold cross validation to select the optimal parameters. Different from the general method, we divide the data into five parts according to time order not random. Table 4 reports the training results of the five combined models based on the 21 impact factors.
L-RAsy-ν-TSVR is observed to yield the lowest training error, followed by L-TSVR and L-Asy-ν-TSVR, with comparable training errors. L-ν-SVR and L-Asy-ν-SVR produce the highest training errors. Moreover, L-Asy-ν-SVR exhibits a slightly better performance than L-ν-SVR. The three asymmetric algorithms outperform the corresponding original ɛ-insensitive algorithms, indicating that the asymmetric loss function produces lower training error. This is consistent with the theory, i.e., ɛ-insensitive loss is the special case of asymmetric loss.
Table 5 reports the optimal parameters determined from the order five-cross-validation.
Optimal parameters of five combined algorithms with 21 important factors
Table 6 presents the forecasted soybean price from 2015 to 2018 using the optimal parameters in Table 5 for the 21 impact factors.
Forecasted soybean price of five combined algorithms with 21 impact features
Forecasted soybean price of five combined algorithms with 21 impact features
Our proposed combined approach L-RAsy-ν-TSVR produces the lowest testing error among the five combined models (0.0734). The L-ν-SVR and L-Asy-ν-SVR models exhibit the highest testing errors. This demonstrates the poor performance of SVR combined with both asymmetric and ɛ-insensitive losses.
The majority of the combined models are observed to produce lower training and lower testing errors, indicating the validity of combined models. This can be attributed to the reduction in noise interference.
Fig. 8 depicts the forecasted soybean prices for 1990-2018 based on the optimal parameters. The combined models exhibit similar variation trends. An increase in soybean price is associated with a rise in the prediction price. This reveals the feasibility and effectiveness of the combined models.

Prediction performance of five algorithms with 21 impact factors.
Thus far, our proposed combined model has been demonstrated to outperform the other four combined models in both the training and testing phases.
To further reveal the advantages of the combined models over the corresponding single models, we compare the experimental results with the original five models under 25 impact factors (Table 7).
Training average error of the five algorithms with 25 features. Bold type represents the best result
Training average error of the five algorithms with 25 features. Bold type represents the best result
Notes: i) Each data group denotes the mean value of five times testing results and plus or minus the standard deviation. ii) Bold values represent the best result.
RAsy-ν-TSVR exhibits the lowest RMSE and MAE, indicating its ability to effectively predict soybean price. ν-SVR is observed to produce the highest training error among the five algorithms.
Table 9 reports the optimal parameters determined via the order five-cross validation method in the training phase, while Table 8 lists the forecasted soybean prices from 2015 to 2018 using the optimal parameters.
The optimal parameters of five algorithms with 25 features
The experimental results reveal the following key conclusions:
•Comparing Tables 4 and 7 reveals that the training error (RMSE) of four combined models is lower than that of the single models. Only Asy-ν-SVR produces the comparable training error. This demonstrates the validity of the combined models, particularly our improved algorithm.
•The algorithms that adopt asymmetric loss produce lower RMSE values than the original algorithms (Table 9) for both the training and testing phases. More specifically, the asymmetric loss yields lower error than the ɛ-insensitive loss and hence the asymmetric loss function is suitable for soybean price forecasting.
•When 25 factors are adopted, RAsy-ν-TSVR yields the lowest MAE and RMSE, followed by Asy-ν-TSVR. ν-SVR produces the worst training error (Table 7). This also further demonstrates the asymmetric loss function is suitable for soybean price forecasting.
•Algorithms with asymmetric loss perform better than the corresponding original algorithms with ɛ-insensitive loss.
•Asy-ν-TSVR and TSVR produce low training errors and high testing errors. Since they both adopt the experiential risk minimization principle, which reduces the training error. However, the high testing errors suggest a tendency to over-fitting, and this is not our purpose. Our goal is to get lower test error not lower training error.
Four years’ prediction prices of the five algorithms with 25 impact features
An effective combined model of Lasso and RAsy-ν-TSVR is proposed to forecast Chinese soybean price. Firstly, we analyzed the influencing factors of soybean price fluctuations and selected 21 key factors from 25 factors using the Lasso. Secondly, RAsy-ν-TSVR is proposed to avoid the matrix inverse operation by introducing the regularization term into the objective function. Finally, the constructed RAsy-ν-TSVR is then to predict the soybean price. Compared with other four combined models, our proposed model obtained both lower training and testing errors. Moreover, it yielded lower errors by 21 key factors than 25 original factors. Certainly, more effective prediction models should be researched to further improve prediction accuracy of soybean price.
Footnotes
Acknowledgments
The authors gratefully acknowledge the helpful comments and suggestions of the reviewers, which have improved the presentation. This work was supported in part by the National Natural Science Foundation of China (No. 62176261) and the earmarked fund for Beijing Leafy Vegetables Innovation Team of Modern Agro-industry Technology Research System (BAIC07-2021).
Conflict of interest
The authors declare that there is no conflict of interests regarding the publication of this paper.
