Method based on the support vector machine and information diffusion for prediction intervals of granary airtightness

Abstract

Granaries should have good airtightness to reduce grain loss in storage. Prediction of granary airtightness at the design stage is beneficial in improving granary design. This paper proposes a method for the prediction interval (PI) of granary airtightness by using small sample data, which can guide designers with granary design. PI that the probability of the true target falling in it is markedly close or larger compared with the confidence level can be the decision basis of the granary design scheme. This study adopts support vector machine as the regression model trained by the airtightness data set of built granaries, and obtains the probability distribution of regression errors through information diffusion. The probability interval of errors is derived using a search algorithm, and PIs of granary airtightness can be acquired thereafter. Assessment indexes of PIs with confidence levels of 0.8 and 0.9 indicate that the proposed method can achieve confidence level and is superior to the comparative method using artificial neural network and bootstrap for PIs in cases of only a few samples. Thus, an innovative and feasible method is proposed for the computer-aided design of granary airtightness.

Keywords

Support vector machine information diffusion prediction interval granary airtightness

1 Introduction

1.1 Background and related studies

Although grain is an important resource, the loss of grain is heavy at the storage stage [22]. Accordingly, technical measures that reduce grain loss in storage should be adopted for global food security [20]. Reasons for grain loss include insect pests and mildew. Temperature, humidity, and other factors also affect grain quality in storage [22]. Available measures against pests include pesticide fumigation [20, 22] and carbon dioxide fumigation [36]. Some methods, such as temperature and humidity control and ozone fumigation, are effective in avoiding grain mildew during storage [10]. Moreover, reducing oxygen content and raising carbon dioxide level in the air also are effective at controlling pests [9]. Thus, strict requirements are imposed for air environment in grain storage, and grain storage facilities should have good airtightness [22, 36]. Pressure decay test (PDT), which is a technical index, has been used to evaluate the airtightness of grain storage facilities [6]. Chinese national standard in airtightness of warehouse also adopts PDT as airtightness index of granaries [12]. The testing processes are given as follows. First, air is compressed into a granary by a fan. Second, the fan is stopped when the pressure difference between the interior and exterior of the granary reaches the prescribed value (i.e., 500 Pa). Lastly, half-life of pressure difference is utilized to evaluate airtightness. The longer the half-life, the better the airtightness of the granary, and each granary grade has corresponding half-life requirements.

Evidently, a closed space is needed to ensure that granaries meet airtightness requirements. Granary airtightness considerably depends on architectural design schemes. The selection of granary type and structural measures to satisfy airtightness requirements in architectural design substantially affect granary airtightness. The prediction of granary airtightness at the design stage can guide designers, and some parts of the granary design will be modified if the prediction of airtightness cannot meet the requirements. Numerous factors influence granary airtightness, and a complex nonlinear relationship exists between influencing factors and airtightness index. Consequently, this study utilizes artificial intelligence methods to reveal the quantitative relationship between granary design scheme and airtightness and to predict granary airtightness at the design stage.

Neural networks, which are artificial intelligence methods, have been widely used in prediction because of their excellent learning ability [5 , 32]. Singaravel et al. [34] applied deep neural network in civil engineering design to evaluate the energy saving of different design schemes. They reported that deep neural network can predict building energy consumption rapidly. In building design optimization, Liang et al. [24] proposed an architectural design optimization method based on BP neural network. In particular, they selected spacing coefficient, air outlet area, and height from the bottom of the window sill to the ground as the main design parameters of building ventilation, let the comprehensive performance of building ventilation design be the main optimization objective to optimize the building design. García Kerdan and Morillón Gílvez [11] proposed the integration of exergy analysis to improve building energy efficiency, specifically by utilizing artificial neural network and exergy-based surrogate modelling. Xu and Yuan [38] proposed a new green design method for energy-saving buildings employing a neural network. Input parameters of the neural work included the energy consumption quota values obtained from statistical data, thermal parameters, and energy system parameters in energy-saving standards. Thereafter, they obtained a design scheme of building shape through neural network technology. Bhamare et al. [4] developed a machine learning and deep learning-based model for the thermal performance prediction of PCM integrated roof building, and reported the good prediction performance of artificial neural network.

Neural networks are often used for point prediction. That is, the predicted result is a certain value. However, the point prediction of neural networks is unreliable in some cases owing to the defects of training data and uncertainties of neural network models; moreover, point prediction does not provide information on prediction accuracy [18]. To handle the deficiency of point prediction, prediction interval (PI) becomes another form of neural network prediction [18 , 26]. PI is the prediction of a numerical interval comprising upper and lower bounds that bracket the true target of the predicted variable with a prescribed probability (i.e., confidence level) [18]. PIs can provide uncertain information for decision makers who can make decisions in accordance with the best and worst situations at a certain probability level. Thus, the risk of decision-making is considerably decreased.

Massive data are needed to train the neural network for prediction, and the number of training data affects the neural network’s calculation accuracy [2, 28]. Moreover, the prediction accuracy of neural network will decrease when the number of training data is small [30].

A large amount of data are necessary to predict granary airtightness using neural networks. To obtain data, investigators should consult the design drawings and various technical documents of built granaries. Such a consultation is costly. The costs of obtaining these data include time and money. In general, only a small amount of data can be obtained (i.e., small sample data). Uncertainties of point prediction increase when the number of sample data is small. Therefore, this study uses PIs for granary airtightness. Designers can decide whether or not to modify the airtightness designs of granaries in accordance with PIs at a nominal confidence level.

Support vector machine (SVM) based on statistical learning provides an efficient and novel model to improve generalization performance. Moreover, it can reach a global minimum. SVM can effectively solve problems with small sample data [35, 39]. Moreover, SVM can fulfill the tasks of classification, regression, and distribution estimation [7].

Information on prediction errors is also limited owing to the small sample data of granary airtightness. Information diffusion can maximize the information provided by small sample data to estimate the probability distribution of random variables. Theoretical studies and examples have proven that information diffusion is superior to classical statistical methods in processing small sample data [15, 17].

1.2 Originality and motivation

Owing to the scarcity of airtightness data of built granaries, traditional machine learning methods to predict granary airtightness at the design stage will have large errors, which cannot meet the requirements of optimizing the granary design. This study applies SVM and information diffusion to PI of granary airtightness to overcome the uncertainties of prediction results caused by small sample data, thereby making the proposed method unique.

The framework in this study is as follows. On the basis of the obtained granary airtightness data, the SVM regression model is adopted to establish mapping between the affecting factors and airtightness index. Thereafter, the probability distribution of regression errors is estimated by utilizing information diffusion. Upper and lower bounds of the probability interval of errors are acquired. Lastly, PIs of granary airtightness based on small sample data are obtained.

This paper is organized as follows. Sec. 1 deals with the background, studies related with application of neural network in civil engineering design, and small sample data processing methods. Sec. 2 elaborates the theoretical basis of this paper, assessment of PIs, and granary airtightness data set. Sec. 3 describes the calculation process and results of the proposed algorithm. Sec. 4 presents the comparative algorithm using artificial neural network (ANN) and bootstrap. Sec. 5 discusses the advantages and application conditions of the proposed algorithm. Lastly, Sec.6 concludes this paper.

2 Methodology

2.1 SVM model

Suppose that {(x₁, t₁) , …, (x_l, t_l)} are training vectors, where x_i ∈ R^N is a feature vector (i.e., the model has N causal attributes) and t_i ∈ R¹ is the target output (i.e., true target of the result attribute). Under given parameters C > 0 and ɛ > 0, the standard form of support vector regression [7] is as follows: $min_{ω, b, ξ, ξ^{*}} \frac{1}{2} ω^{T} ω + C \sum_{i = 1}^{l} ξ_{i} + C \sum_{i = 1}^{l} {ξ_{i}}^{*}$ (1) $\begin{matrix} subject to ω^{T} φ (x_{i}) + b - t_{i} ⩽ ɛ + ξ_{i} \\ t_{i} - ω^{T} φ (x_{i}) - b ⩽ ɛ + {ξ_{i}}^{*}, \\ ξ_{i}, {ξ_{i}}^{*} ⩾ 0, i = 1, \dots, l, \end{matrix}$ where φ (x_i) maps x_i into a higher-dimensional space and C > 0 is the regularization parameter. The following dual problem is solved: $\begin{matrix} min_{α, α^{*}} \frac{1}{2} (α - α^{*})^{T} Q (α - α^{*}) + ɛ \sum_{i = 1}^{l} (α + α^{*}) \\ + \sum_{i = 1}^{l} t_{i} (α - α^{*}) \end{matrix}$ (2) subject to e^T (α - α^*) =0,

0 ⩽ α_i, α_i^* ⩽ C, i = 1, ⋯ , l,

where Q_ij = K (x_i, x_j) ≡ φ (x_i) ^T φ (x_j), K (x_i, x_j) ≡ φ (x_i) ^T φ (x_j) is the kernel function. In this study, K (x_i, x_j) = exp(- γ ∥ x_i - x_j ∥ ²), γ > 0.

After solving problem (2), the approximate is function $\sum_{i = 1}^{l} (- α_{i} + α_{i}^{*}) K (x_{i}, x) + b$ .

2.2 Probability intervals based on information diffusion

2.2.1 Information diffusion

Definition 1 [16] Let E with universe U be a sample and V be a subset of U. A mapping from E × V to [0,1] $\begin{matrix} μ : E \times V \to [0, 1] \\ (e, v) \to μ (e, v), \forall (e, v) \in E \times V \end{matrix}$ is called an information diffusion of E on V if it is decreasing, i.e., ∀e ∈ E, ∀v′, v″ ∈ V, if ∥v′ - e∥ ⩽ ∥ v″ - e ∥, then μ (e, v′) ⩾ μ (e, v″), μ is called a diffusion function and V is called a monitoring space. When V = U, we say that μ (e, v) is sufficient.

Information provided by the sample can be diffused to monitoring points by using information diffusion function.

Let E = (e₁, e₂, ⋯ , e_n) be a sample, and V = (v₁, v₂, ⋯ , v_m) is the monitoring space. The normal information diffusion function is as follows [16]: $μ (e_{i}, v_{j}) = \frac{1}{h \sqrt{2 π}} exp [- \frac{{(e_{i} - v_{j})}^{2}}{2 h^{2}}]$ (3) where h is the diffusion coefficient as listed in Table 1.

Table 1

Values of the diffusion coefficient [14]

Size of sample	5	6	7	8
Calculation formula for h	0.8146 s	0.5690 s	0.4560 s	0.3860 s
Size of Sample	9	10	≥11
Calculation formula for h	0.3362 s	0.2986 s	2.6851 s/(n-1)

Note: $s = max_{1 ⩽ i ⩽ n} {e_{i}} - min_{1 ⩽ i ⩽ n} {e_{i}}$ , e_i ∈ R¹, n is the size of sample.

Let $f_{e_{i}} (v_{j}) = \frac{μ (e_{i}, v_{j})}{\sum_{j = 1}^{m} μ (e_{i}, v_{j})}$ (4) The probability estimate of monitoring point v_j is as follows [14]: $p (v_{j}) = \frac{\sum_{i = 1}^{n} f_{e_{i}} (v_{j})}{\sum_{j = 1}^{m} \sum_{i = 1}^{n} f_{e_{i}} (v_{j})}$ (5)

Thereafter, the probability distribution of monitoring space V = (v₁, v₂, ⋯ , v_m) can be acquired.

2.2.2 Upper and lower bounds of the probability interval

Let monitoring point v_L be the lower bound of the probability interval and monitoring point v_U be the upper bound. Ideally, the interval length should be as small as possible; that is, $v_{U} - v_{L} = min_{k > j} (v_{k} - v_{j})$ (6) Meanwhile, the following condition should be satisfied: $\sum_{j = L}^{U} p (v_{j}) ⩾ 1 - α$ (7) where 1 - α is the confidence level.

The number of monitoring points is finite. Thus, v_L and v_U can be found by the search algorithm. A feasible strategy is to take the monitoring point v_h of the maximum probability as the center and search v_L, v_U from v_h to both sides, as shown in Fig. 1.

The algorithm to find v_L and v_U is as follows:

Fig. 1

Upper and lower bounds of the probability interval.

Algorithm 1. Search algorithm for the probability
intervals.
Input: monitoring space V = (v₁, v₂, ⋯ , v_m),
probability distribution P = (p (v₁),
p (v₂), . . . , p (v_m)), the confidence level 1 - α.
Output: lower bound v_L and upper bound v_U.
Initialization:k₁ = 0, k₂ = 0
Find hs . t . $p (v_{h}) = max_{j} (p (v_{j}))$ .
Let d₁ = h - 1, d₂ = m - h.
while $\sum_{j = h - k_{1}}^{h + k_{2}} p (v_{j}) < 1 - α$
ifk₁ < d₁ and k₂ < d₂
Let k₁ = k₁ + 1, k₂ = k₂ + 1.
else ifk₁ ⩾ d₁ and k₂ < d₂
Let k₁ = d₁, k₂ = k₂ + 1.
else ifk₁ < d₁ and k₂ ⩾ d₂
Let k₁ = k₁ + 1, k₂ = d₂.
else ifk₁ ⩾ d₁, k₂ ⩾ d₂
Let k₁ = d₁, k₂ = d₂.
end if
end while
The lower bound of the probability interval
is v_L=v_h-k₁, the upper bound is v_U=v_h+k₂.
end Algorithm 1

2.3 PIs based on regression errors

Suppose that f (x) is a regression function established by SVM. Thus, n instances exist in the data set, which is {(x₁, t₁) , …, (x_n, t_n)}, where x_i ∈ R^N is the feature vector and t_i ∈ R¹ is the target output. The following equation is obtained as follows [33]: $t_{i} = f (x_{i}) + ɛ_{i}$ (8) where ɛ_i is the error of SVM and is assumed to be a random variable. Thereafter, $E_{i} = \frac{ɛ_{i}}{f (x_{i})}$ (9) where E_i is also a random variable.

If the probability distribution of E_i is known, then the following can be acquired: $P (e^{L} ⩽ E_{i} ⩽ e^{U}) = 1 - α$ (10)

If n instances exist, then (e₁, …, e_n) can be obtained using Equation (9), where is the observed value of E_i. Thereafter, the probability distribution of monitoring points can be obtained by utilizing information diffusion, where e^L and e^U can be derived by Algorithm 1.

Thereafter, the following is obtained: $P (f (x_{i}) \cdot e^{L} ⩽ ɛ_{i} ⩽ f (x_{i}) \cdot e^{U}) = 1 - α$ (11) Let I_p be the feature vector of the instance to be predicted, and suppose that the point prediction error of I_p is subject to the same probability distribution as E_i. The following equations are obtained: $L^{U} = f (I_{p}) + f (I_{p}) \cdot e^{U}$ (12) $L^{L} = f (I_{p}) + f (I_{p}) \cdot e^{L}$ (13) where L^U and L^L are the upper and lower bounds, respectively, of PI.

Several instances are input into the trained SVM to obtain the prediction values and errors. Thereafter, probability distribution of errors is obtained through information diffusion. Probability interval of errors at a certain confidence level can be acquired using Algorithm 1. Moreover, PI of the instance to be predicted is determined according to its point prediction and the upper and lower bounds of the probability interval of errors. The schematic of the preceding process is shown as Fig. 2.

Fig. 2

Schematic for calculating PI according to the probability distribution of error ratio.

2.4 Assessment of PIs

Reliability and width are involved in assessing PIs [18].

Reliability of PIs, which involves whether or not the probability of the true target falling within the PI can achieve the confidence level, is vital for decision makers. Reliability of PIs can be evaluated using the PI coverage probability (PICP). $PICP = \frac{1}{n_{test}} \sum_{i = 1}^{n_{test}} c_{i}$ (14) where $c_{i} = {\begin{matrix} 1, t_{i} \in [L_{i}^{L}, L_{i}^{U}] \\ 0, t_{i} \notin [L_{i}^{L}, L_{i}^{U}] \end{matrix}$ ; n_test is the number of instances in the test set.

A satisfactory PICP should be close to or greater than the confidence level for PI [18]. Interval width is closely related to PICP. A large interval width indicates a high PICP, and vice versa. However, if the interval width is considerably large, then minimal or no help will be brought to the decision makers. Therefore, interval width should be as narrow as possible on the premise of ensuring that PICP can meet the requirement. Consequently, the following measure to evaluate PI (i.e., mean PI width (MPIW)) is used: $MPIW = \frac{1}{n_{test}} \sum_{i = 1}^{n_{test}} (L_{i}^{U} - L_{i}^{L})$ (15) NMPIW is the normalized MPIW, which can eliminate the influence of dimension. $NMPIW = \frac{MPIW}{R}$ (16) where R is the target rang in observed samples.

Moreover, a comprehensive index is needed to evaluate PI quality. Coverage width-based criterion (CWC) can combine PICP and NMPIW into a single index [18]: $CWC = NMPIW (1 + γ (PICP) e^{- η (PICP - μ)})$ (17) where γ (PICP) is given by $γ (PICP) = {\begin{matrix} 0, PICP ⩾ μ \\ 1, PICP < μ \end{matrix}$ (18) where η and μ are two controlling hyperparameters of the location and amount of CWC jump. μ corresponds to the confidence level and can be set to 1 - α. In this study, η = 50. A small value of CWC is desirable.

2.5 Data set

2.5.1 Description of the granary airtightness data set

Sample data of built granaries are derived from several provinces in China. A total of 112 instances are found in the sample data, which are of 18 causal attributes and one result attribute (i.e., half-life of 500 Pa pressure). The form of this data set is a matrix, in which each column corresponds to an attribute and each row is an instance. The data set is called GAS. Causal attributes are selected according to experts in granary design and operation management. The result attribute is the half-life of 500 Pa pressure [12]. Table 2 describes the cause and result attributes.

Table 2
Attribute description of the granary airtightness data set

Attribute name Unit Type of attribute

* Half-life of 500 Pa pressure s numerical

Granary type — categorical

Height of the granary m numerical

Floor area m² numerical

Floor form of granary — categorical

Cubage of the granary m³ numerical

Airtightness measures for doors and windows — categorical

Moisture barrier form of wall — categorical

Moisture barrier height of wall m numerical

Structural layer thickness of wall mm numerical

Amount of mechanical vent in the granary number numerical

Amount of fumigation hole in the granary number numerical

Amount of fan outlet in the granary number numerical

Amount of natural vent number numerical

Amount of grain inlet number numerical

Type of roof structure — categorical

Type of wall structure — categorical

Ventilation form — categorical

Area of doors and windows m² numerical

Attribute name	Unit	Type of attribute
* Half-life of 500 Pa pressure	s	numerical
Granary type	—	categorical
Height of the granary	m	numerical
Floor area	m²	numerical
Floor form of granary	—	categorical
Cubage of the granary	m³	numerical
Airtightness measures for doors and windows	—	categorical
Moisture barrier form of wall	—	categorical
Moisture barrier height of wall	m	numerical
Structural layer thickness of wall	mm	numerical
Amount of mechanical vent in the granary	number	numerical
Amount of fumigation hole in the granary	number	numerical
Amount of fan outlet in the granary	number	numerical
Amount of natural vent	number	numerical
Amount of grain inlet	number	numerical
Type of roof structure	—	categorical
Type of wall structure	—	categorical
Ventilation form	—	categorical
Area of doors and windows	m²	numerical

Note: Attribute with * is the result attribute.

2.5.2 Data preprocessing

Data preprocessing can improve the efficiency, accuracy, and generalization ability of the machine learning model. The two types of attributes are numerical and categorical. Numerical attributes are comparable and ordered, whereas categorical attributes are neither comparable nor ordered. Data preprocessing methods adopted in this study include data normalization of numerical attributes and one-hot encoding of categorical attributes [27].

(1) In the data normalization procedure, data are scaled as follows: $x_{ij}^{'} = \frac{x_{ij} - min_{k} (x_{kj})}{max_{k} (x_{kj}) - min_{k} (x_{kj})}$ (19)

Data can be mapped to [0, 1] by utilizing Equation (19).

(2) One-hot encoding is applied to categorical attributes, which encodes categorical attributes into binary vectors. For example, three types are preprocessed by utilizing one-hot encoding. Thereafter, types 1, 2, and 3 are encoded into (1,0,0), (0,1,0), and (0,0,1), respectively.

3 Proposed method using SVM and information diffusion for PIs of granary airtightness

3.1 Algorithm using SVM and information diffusion for PIs

The data set D_l is divided into D_t and D_d, where D_t is used to train SVM and D_d is employed to acquire the probability distribution of SVM’s prediction error. PI of a new instance to be predicted can be worked out thereafter on the basis of the point prediction of the trained SVM and probability distribution of error.

The process of the algorithm using SVM and information diffusion for PIs is shown in Fig. 3.

Fig. 3

Process of the algorithm using SVM and information diffusion for PIs.

The algorithm is summarized as follows:

Algorithm 2. Algorithm using the SVM and information diffusion for PIs.
Input: learning data set D_l, input (i.e., the feature vector) of the instance to be predicted I_p, number of monitoring point m, number of instances adopted to acquire the probability distribution of monitoring point n, confidence level 1 - α.
Output: PI of the true target of I_p.
Step 1:D_l is split randomly into two sets, namely, D_t and D_d. D_t is used to train the SVM model and D_d is the data set for information diffusion. Size of D_t is l and the size of D_d is n.
Step 2: The SVM model is trained by D_t, and the trained SVM model is f (·). Parameters of SVM are optimized using the grid search technique [7].
Step 3: Every instance in D_d is introduced into the trained SVM model. For the ith instance d_i = (x_i, t_i) in D_d, the prediction error is acquired as follows:
ɛ_i = t_i - f (x_i)
where ɛ_i is the prediction error; t_i is the true target; f (x_i) is the output of SVM.
Meanwhile, e_i is obtained using Equation (9), and E = (e₁, e₂, ⋯ , e_n).
Step 4: The step of monitoring point is s_e, which is given by
$s_{e} = \frac{max_{i} (e_{i}) - min_{i} (e_{i})}{m - 1}$ (20)
V = (v₁, v₂, ⋯ , v_m) is the monitoring space. In this study, v_k - v_k-1 = s_e, m ⩾ k > 2, $v_{1} = min_{i} (e_{i})$ , and $v_{m} = max_{i} (e_{i})$ .
Step 5: By using Equations (3)–(5), the probability distribution of monitoring points can be acquired. The lower and upper bounds (i.e., $e_{i}^{L}$ and $e_{i}^{U}$ , respectively) of monitoring space can be derived using Algorithm 1.
Step 6: By introducing I_p into the trained SVM, the point prediction of I_p is gained.
Step 7:Equations (13) are adopted to obtain the lower and upper bounds (i.e., L^L and L^U, respectively) for the PI of I_p.
end Algorithm 2

3.2 Assessment of the Algorithm 2

3.2.1 Cross validation

Cross validation is a typical method for verifying the machine learning model [1]. The leave-one-out cross validation is adopted in this study. That is, an instance is taken from the data set at a time in order as validation data, the remainder of the data set is used for training the model. This step is repeated several times until all instances of the data set are validated. In the leave-one-out cross validation, the test set includes every instance in the data set. The process is shown as follows:

Algorithm 3. Leave-one-out cross validation for the PIs.
Input: preprocessed data set D.
Output: PI of the true target of every instance in D, assessment index values of PIs.
Definition:N_D is the number of instance in D.
fork = 1 to N_D
Let d_k be the kth instance in D, and d_k=(x_k,t_k).
Let I_p = x_k, and D_l = D ∖ d_k. The PI of I_P can
be acquired using Algorithm 2.
end for
The assessment index values of PIs are obtained using Equations (14), and (17).
end Algorithm 3

3.2.2 Results of Algorithm 3 using GAS

Let m = 22, n = 33, l = 78, and the confidence levels be 0.8 and 0.9. A total of 112 instances are available in GAS, and the data set is preprocessed as input of Algorithm 3. PIs of the 112 instances in GAS are acquired as shown in Fig. 4.

Fig. 4

PIs of granary airtightness by Algorithm 3. The left and right parts are PIs with confidence levels of 0.8 and 0.9, respectively.

Assessment index values of PIs are obtained using Equations (14), and (17), as listed in Table 3.

Table 3

Assessment index values of PIs of GAS by Algorithm 2

Confidence level	PICP	NMPIW	CWC
0.8	0.8304	0.0623	0.0623
0.9	0.9018	0.1019	0.1019

Evidently, PICPs can achieve satisfactory confidence levels, and NMPIWs are small.

4 Comparative method using ANN and bootstrap for PIs of granary airtightness

4.1 Algorithm using ANN and bootstrap for PIs

As a statistical sampling method, bootstrap uses a data-resampling technique to approximate an unknown distribution on the basis of an empirical distribution [3]. Previous studies have employed combinations of ANN and bootstrap to acquire PIs [18 , 31]. Compared with PIs acquired using SVM and information diffusion, the following algorithm for PIs of granary airtightness is based on ANN and bootstrap. ANN is trained using the error backpropagation algorithm [1].

Algorithm 4 Algorithm using ANN and bootstrap for PIs.
Input: learning sample data set D_l, number of bootstrap samples B, confidence level 1 - α, input vector of the instance to be predicted I_p
Output: PI of the true target of I_p
Step 1: Sample B times with replacement fromD_l, the training set $D_{t}^{i}$ , i = 1, 2, …, B is acquired, and the size of $D_{t}^{i}$ is the same as that of D_l.
Step 2: ANN is trained using data set $D_{t}^{i}$ . The trained neural network is f_i (·), i = 1, 2, …, B. ANN is trained using D_l, and f₀ (·) is the trained neural network.
Step 3: Input I_p into f_i (·), and point prediction is y_i = f_i (I_p), i = 1, 2, …, B. Input I_p intof₀ (·), and point prediction is y₀ = f₀ (I_p).
Then,
$E_{y} = \frac{1}{B} \sum_{i = 1}^{B} y_{i}$ ,
$ST D_{y} = \sqrt{\frac{1}{B - 1} \sum_{i = 1}^{B} {(y_{i} - E_{y})}^{2}}$
$BIA S_{y} = \frac{1}{B} \sum_{i = 1}^{B} (y_{i} - y_{0})$ .
Step 4: The lower and upper bounds of PIs can be obtained as follows [8]:
The lower bound of PI can be acquired as follows:
$L^{L} = y_{0} - BIA S_{y} - ST D_{y} \cdot z_{α / 2}$ (21)
The upper bound of PI is as follows:
$L^{U} = y_{0} - BIA S_{y} + ST D_{y} \cdot z_{α / 2}$ (22)
where $z_{α / 2} = Φ^{- 1} (\frac{α}{2})$ , Φ (·) is the standard normal distribution function.
end Algorithm 4

4.2 Assessment of Algorithm 4

The leave-one-out cross validation, which is the same as that in Subsection 3.2, is employed to assess Algorithm 4. The cross validation procedure replaces Algorithm 2 in Algorithm 3 with Algorithm 4. The preprocessed 112 instances of GAS constitute data set D, which is the input of Algorithm 3. Let B = 1000 [29] and confidence level 1 - α be 0.8 and 0.9. The numbers of neurons in the input, hidden, and output layers of ANN are 18, 10, and 1, respectively. PI of every instance in data set D is acquired using Algorithm 4.

The assessment index values of PIs are determined using Equations (14), and (17). The results are listed in Table 4.

Table 4
Assessment index values of the PIs of GAS by Algorithm 4

Confidence level PICP NMPIW CWC

0.8 0.6607 0.2423 1058

0.9 0.7143 0.3089 10783

Confidence level	PICP	NMPIW	CWC
0.8	0.6607	0.2423	1058
0.9	0.7143	0.3089	10783

PIs based on ANN and bootstrap cannot achieve the confidence level. Moreover, the accuracy of ANN is lower than that of SVM owing to the small size of sample. Consequently, NMPIWs of PIs based on ANN and bootstrap are considerably greater than those of PIs based on SVM and information diffusion, thereby substantially reducing the values of PIs in practice.

5 Discussion

The airtightness data of built granaries, which are necessary for training a machine learning model, are scarce because investigation and testing of built granaries are costly. Furthermore, causal attributes of the granary airtightness data consist of numerical and categorical attributes. Hence, data preprocessing is essential for improving the generalization capability of the algorithm.

In accordance with the assessment index values of PIs of GAS by Algorithms 2 and 4, the proposed method based on SVM and information diffusion is better than the method based on ANN and bootstrap. The proposed method shows its superiority in processing small sample data, which is due to SVM’s excellent performance in regression [35] and information diffusion’s advantage in estimating probability distribution by using small sample data [15].

Compared with the PI methods based on bootstrap [19, 25] or optimization algorithm [13, 37], the proposed method is simpler and more easily conducted, while good-quality PIs are acquired using small sample data. The calculation cost of the proposed method is also considerably low.

The proposed method is based on SVM and information diffusion. This method utilizes small sample data to obtain the probability distribution of SVM’s point prediction errors. PIs can be obtained by assuming that the point prediction error of the instance to be predicted is subject to the same probability distribution.

As far as a data set is concerned, only the appropriate machine learning method can obtain good results. The results show that the proposed method is suitable for a granary airtightness data set and practicable.

6 Conclusions

Granary airtightness is important for grain storage safety and closely related to the granary design. It is significant for improving granary airtightness that the data of built granaries are utilized to predict the airtightness of granaries in design. Moreover, designers can amend granary designs in accordance with the predicted results. If the predicted airtightness of granaries in design does not meet the requirements, then designers can modify the designs to enhance airtightness. The prediction can also help designers avoid waste caused by excessive airtightness redundancy.

The application of machine learning based on massive data to optimize structure design is a rapidly developing field. However, obtaining massive data on the airtightness of built granaries is difficult. Therefore, providing designers with beneficial design guidance utilizing small sample data is a topic that should be considered seriously. PIs can offer ranges of true targets with certain confidence levels. Moreover, PIs are more useful than the point prediction of considerable uncertainties [18]. Lower and upper bounds of PIs can be used as pessimistic and optimistic prediction results, respectively, of granary airtightness, which can be used to design granary airtightness complying with different requirements. To improve the quality of PIs, SVM predicting well by using small sample data is utilized in this study, and the probability distribution of regression errors is acquired through information diffusion. Thus, high-quality PIs can be obtained.

The results show that the proposed method can generate good PIs on the basis of a granary airtightness data set. Therefore, this method provides a new way to obtain PIs of granary airtightness, which is helpful to the granary design.

As for future work, several methods can be adopted in next research to improve the prediction of granary airtightness. First, additional granary airtightness data need to be collected, and these data should include more granary types and come from different regions. Sufficient and high-quality data can better train machine learning models and get more accurate predictions. Second, ensemble learning can be employed to reduce regression errors. Lastly, using attribute selection algorithm to determine the optimal causal attribute set is also an important way to improve the performance of the prediction model. Therefore, considerable work should continue to apply machine learning to the granary airtightness design.

Footnotes

Acknowledgments

This work was supported by the Special Scientific Research Fund of Grain Public Welfare Profession of China (No. 201513001-03).

References

Alpaydin

, Introduction to Machine Learning, MIT Press, Cambridge, Massachusetts (2014).

Alwosheel

, van Cranenburgh

and Chorus

C.G.

, Is your dataset big enough? Sample size requirements when using artificial neural networks for discrete choice analysis, Journal of Choice Modelling 28 (2018), 167–182.

Badin

, Daraio

and Simar

, A bootstrap approach for bandwidth selection in estimating conditional efficiency measures, European Journal of Operational Research 277 (2019), 784–797.

Bhamare

D.K.

, Saikia

, Rathod

M.K.

, Rakshit

and Banerjee

, A machine learning and deep learning based approach to predict the thermal performance of phase change material integrated building envelope, Building and Environment 199 (2021), 1–12.

Cao

W.P.

, Wang

X.Z.

, Ming

and Gao

J.Z.

, A review on neural networks with random weights, Neurocomputing 275 (2018), 278–287.

Carpaneto

, Bartosik

, Cardoso

and Manetti

, Pest control treatments with phosphine and controlled atmospheres in silo bags with different airtightness conditions, Journal of Stored Products Research 69 (2016), 143–151.

Chang

C.C.

and Lin

C.J.

, LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology 2 (2011), 1–27.

Davison

A.C.

and Hinkley

D.V.

, Bootstrap Methods and their Application, Cambridge University Press Cambridge, UK, 1997.

Dowell

F.E.

and Dowell

C.N.

, Reducing grain storage losses indeveloping countries, Quality Assurance and Safety of Crops &Foods 9 (2017), 93–100.

10.

Fleurat-Lessard

, Integrated management of the risks of stored grain spoilage by seedborne fungi and contamination by storage mould mycotoxins –An update, Journal of Stored Products Research 71 (2017), 22–40.

11.

García Kerdan

and Morillón Gílvez

, ANNEXE: An open-source building energy design optimisation framework using artificial neural networks and genetic algorithms, Journal of Cleaner Production 371 (2022), 1–17.

12.

General Administration of Quality Supervision Inspection and Quarantine of P.R.C, Grain and Oils Storage—Requirement of Airtightness of Warehouse, in, Standards Press of China, Beijing, 2010.

13.

Hosen

M.A.

, Khosravi

, Nahavandi

and Creighton

, Improving the Quality of Prediction Intervals Through Optimal Aggregation, IEEE Transactions on Industrial Electronics 62 (2015), 4420–4429.

14.

Huang

C.F.

, Assessment of Natural Disaster: Theory and Practice (in Chinese), Science Press, Beijing, 2005.

15.

Huang

C.F.

, Geospatial information diffusion based on self-learning discrete regression, Journal of Environmental Informatics 38 (2021), 93–105.

16.

Huang

C.F.

and Moraga

, Extracting fuzzy if-then rules by using the information matrix technique, Journal of Computer and System Sciences 70 (2005), 26–52.

17.

Huang

C.F.

, Zong

and Chen

Z.F.

, Four models to calculate a fuzzy probability distribution with a small sample, International Journal of Information Technology & Decision Making 6 (2007), 611–623.

18.

Khosravi

, Nahavandi

, Creighton

and Atiya

A.F.

, Comprehensive review of neural network-based prediction intervals and new advances, IEEE Transactions on Neural Networks 22 (2011), 1341–1356.

19.

Khosravi

, Nahavandi

, Srinivasan

and Khosravi

, Constructing optimal prediction intervals by using neural networks and bootstrap method, IEEE Transactions on Neural Networks and Learning Systems 26 (2015), 1810–1815.

20.

Kostyukovsky

, Trostanetsky

and E.

, Novel approaches for integrated grain storage management, Israel Journal of Plant Sciences 63 (2016), 7–16.

21.

Kujawa

and Niedbala

, Artificial neural networks in agriculture, Agriculture-Basel 11 (2021), 1–6.

22.

Kumar

and Kalita

, Reducing postharvest losses during storage of grain crops to strengthen food security in developing countries, Foods 6 (2017), 1–22.

23.

Lian

, Chen

C.L.P.

, Zeng

Z.G.

, Yao

and Tang

H.M.

, Prediction intervals for landslide displacement based on switched neural networks, IEEE Transactions on Reliability 65 (2016), 1483–1495.

24.

Liang

, Wang

P.-H.

and Hu

, Application of visual recognition based on BP neural network in architectural design optimization, Computational Intelligence and Neuroscience 2022 (2022), 1–9.

25.

Lins

I.D.

, Droguett

E.L.

, Moura

M.d.C.

, Zio

and Jacinto

C.M.

, Computing confidence and prediction intervals of industrialequipment degradation by bootstrapped support vector regression, Reliability Engineering & System Safety 137 (2015), 120–128.

26.

Mancini

, Calvo-Pardo

and Olmo

, Extremely randomized neural networks for constructing prediction intervals, Neural Networks 144 (2021), 113–128.

27.

Negnevitsky

, Artificial Intelligence: A Guide to Intelligent Systems, Pearson Education Limited, Edinburgh, 2011.

28.

Nuchitprasittichai

and Cremaschi

, An algorithm to determinesample sizes for optimization with artificial neural networks, AIChE Journal 59 (2013), 805–812.

29.

Ouyang

X.L.

, Zhuang

W.X.

and Du

, Output elasticities and inter-factor substitution: Empirical evidence from the transportation sector of Shanghai, Journal of Cleaner Production 202 (2018), 969–979.

30.

Pereira

G.H.D.

and Centeno

J.A.S.

, Assessment of training sample size for artificial neural networks in supervised image classification using spectral and laser scanner data, Boletim De Ciencias Geodesicas 23 (2017), 268–283.

31.

Ren

, Li

, Kong

, Shen

and Du

, A hybrid approach for interval prediction of concrete dam displacements under uncertain conditions, Engineering with Computers (2021), 1–19.

32.

Shen

, Nagai

and Gao

, Design of building construction safety prediction model based on optimized BP neural network algorithm, Soft Computing 24 (2020), 7839–7850.

33.

Shrestha

D.L.

and Solomatine

D.P.

, Machine learning approaches for estimation of prediction interval for the model output, Neural Networks 19 (2006), 225–235.

34.

Singaravel

, Suykens

and Geyer

, Deep-learning neural-network architectures and methods: Using component-based models in building-design energy prediction, Advanced Engineering Informatics 38 (2018), 81–90.

35.

Tange

R.I.

, Rasmussen

M. A.

, Taira

, and Bro

, Benchmarking support vector regression against partial least squares regression and artificial neural network: Effect of sample size on model performance, Journal of Near Infrared Spectroscopy 25 (2017), 381–390.

36.

Tutuncu

and Emekci

, Comparative efficacy of modified atmospheres enriched with carbon dioxide against Cadra (=Ephestia) cautella, Journal of the Science of Food and Agriculture 99 (2019), 5962–5968.

37.

Wang

J.D.

, Fang

K.J.

, Pang

W.J.

and Sun

J.W.

, Wind power intervalprediction based on improved PSO and BP neural network, Journalof Electrical Engineering & Technology 12 (2017), 989–995.

38.

and Yuan

, A novel method of BP neural network based green building design-The case of hotel buildings in hot summer and cold winter region of China, Sustainability 14 (2022), 1–22.

39.

Zhou

, Su

W.J.

, Ding

, Luo

H.B.

and Love

P.E.D.

, Predicting safety risks in deep foundation pits in subway infrastructure projects: Support vector machine approach, Journal of Computing in Civil Engineering 31 (2017), 1–14.

Method based on the support vector machine and information diffusion for prediction intervals of granary airtightness

Abstract

Keywords

1 Introduction

1.1 Background and related studies

1.2 Originality and motivation

2 Methodology

2.1 SVM model

2.2.1 Information diffusion

2.5.1 Description of the granary airtightness data set

3.1 Algorithm using SVM and information diffusion for PIs

3.2.1 Cross validation

3.2.2 Results of Algorithm 3 using GAS

4.1 Algorithm using ANN and bootstrap for PIs

Table 4 Assessment index values of the PIs of GAS by Algorithm 4 Confidence level PICP NMPIW CWC 0.8 0.6607 0.2423 1058 0.9 0.7143 0.3089 10783

6 Conclusions

Footnotes

Acknowledgments

References

Table 4
Assessment index values of the PIs of GAS by Algorithm 4

Confidence level PICP NMPIW CWC

0.8 0.6607 0.2423 1058

0.9 0.7143 0.3089 10783