Label distribution learning with climate probability for ensemble forecasting

Abstract

In meteorology, ensemble forecasting aims to post-process an ensemble of multiple members’ forecasts and make better weather predictions. While multiple individual forecasts are generated to represent the uncertain weather system, the performance of ensemble forecasting is unsatisfactory. In this paper we conduct data analysis based on the expertise of human forecasters and introduce a machine learning method for ensemble forecasting. The proposed method, Label Distribution Learning with Climate Probability (LDLCP), can improve the accuracy of both deterministic forecasting and probabilistic forecasting. The LDLCP method utilizes the relevant variables of previous forecasts to construct the feature matrix and applies label distribution learning (LDL) to adjust the probability distribution of ensemble forecast. Our proposal is novel in its specialized target function and appropriate conditional probability function for the ensemble forecasting task, which can optimize the forecasts to be consistent with local climate. Experimental testing is performed on both artificial data and the data set for ensemble forecasting of precipitation in East China from August to November, 2017. Experimental results show that, compared with a baseline method and two state-of-the-art machine learning methods, LDLCP shows significantly better performance on measures of RMSE and average continuous ranked probability score.

Keywords

Ensemble forecasting label distribution learning post-processing domain knowledge

1. Introduction

Ensemble forecasting for modeling nonlinear dynamic weather system has become increasingly critical to meteorology [31]. The ensemble members derived from Numerical Weather Prediction (NWP) are generated with various model physics and perturbed input data [3], representing likely scenarios of the future atmospheric development [18]. How to optimally combine the forecasts of ensemble members remains a challenge [1], for both research and operational applications.

In this paper, we focus on the case of real-valued meteorological variable, $\mathcal{Y}$ , referred to as the observation (label). The observation here could be temperature, precipitation, humidity, etc. The ensemble members always generate forecasts ahead, then the corresponding $\mathcal{Y}$ can be measured with a time delay. In meteorology, instead of evaluating a specific pair of forecast and observation, evaluating their underlying cumulative distribution functions (CDFs) are often of interest [6]. For a climate period (e.g., seasonal, annual, etc.), the climate probability of $\mathcal{Y}$ , denoted by $F$ , is defined as the associated underlying CDF of $\mathcal{Y}$ :

$\displaystyle F(y)=\mathbb{P}(\mathcal{Y}\leqslant y),$

where $\mathbb{P}$ denotes the probability. For ensemble forecasting, the goal is post-processing an ensemble of multiple members’ forecasts and constructing a CDF $G$ predicting $F$ . An illustration of the ensemble forecasting task is shown in Fig. 1.

Figure 1.

An illustration of the ensemble forecasting task. The multiple individual forecasts are used for generating ensemble forecast by post-processing model. The “model” can be either mathematical model or experienced human forecaster, e.g., given the initial forecasts, the human forecaster can decide which to trust and how to use them according to domain knowledge. The ensemble forecast is always generated ahead, then the corresponding observation is measured with a time delay. In this example, $n+1$ times ensemble forecasts were generated and the associated CDF is $G$ , while only $n$ times observations were measured and the underlying CDF is $F$ . Due to the time delay, underlying $F$ is unknown and required to be predicted by $G$ .

Ensemble forecasts in meteorology include deterministic forecasts and probabilistic forecasts [17]. For example, forecasting tomorrow’s maximum surface temperature is deterministic, while forecasting the probability of tomorrow’s maximum surface temperature larger than some level (e.g., 283 $K$ ) is probabilistic. Over the last two decades, a number of statistical post-processing methods have been applied to ensemble members for improving deterministic and/or probabilistic forecasts [15, 16, 19].

Among these methods, an intuitive and widely used strategy for combination is weighted average [28], which can reflect the members’ relative contributions to predictive skill. With assumptions regarding the probability distributions of forecasts and observations (e.g., Gaussian, Gamma, etc.), Raftery et al. [26] utilized Bayesian model averaging for combining predictive distributions of surface temperature and sea level pressure. Later, Sloughter et al. [29] generalized this method for precipitation. By contrast, Mallet [23] considered distribution assumption unnecessary and targeted the optimal combination in a nonparametric way, i.e., sequential aggregation [7]. Under the scope of online learning (also called online prediction, [21, 20]), various algorithms [22] and loss functions [30] have been proposed to perform better sequential aggregation for ensemble forecasting. The nature of weighted average is to produce a convex combination, namely linear opinion pool. However, Ranjan and Gneiting [27] theoretically demonstrated that linear opinion pools are suboptimal and suggested nonlinear generalization.

In this study, Label Distribution Learning with Climate Probability (LDLCP), as a novel machine learning method without distribution assumption, is proposed to realize nonlinear combination. LDLCP inherits the core idea of quantile regression (QR, [14]), which is a kind of CDF-based regression for sequential aggregation [5]. Further, LDLCP takes the expertise of human forecasters (domain knowledge) into consideration. Though the basic weather forecasts are made by various numerical models, in real applications the role of human forecasters is important in at least two aspects. First, given the initial forecasts from various NWPs around the world, human forecasters decide which to trust and how to use them according to their experience of how the forecasts are consistent with local climate. This inspires us to optimize the probabilistic relation between ensemble forecasts and observations using label distribution learning (LDL, [11]) paradigm. Second, conventional statistical models usually post-process the varibale to be forecasted directly, while human forecasters prefer to incorporate additional variables when they forecast. [9] has verified the potential benefit of jointly utilizing the relevant variables for machine learning models. Thus, the proposed LDLCP also considers the relevant variables.

Note that LDLCP is a novel LDL method, because the previous LDL methods can’t be used directly. More specifically, previous LDL methods aim to capture the underlying label distribution just for classification tasks, e.g., facial age estimation [10] and crowd counting [32]. They mostly use the Kullack-Leibler divergence as the similarity measure and regard the maximum entropy model [4] as the parametric model for the conditional probability mass function, which is not suitable for CDF-based ensemble forecasting task. Instead, we specialize the target function and the corresponding conditional probability function for this special problem.

The main contributions of this paper are summarized as follows:

•

We formulate the ensemble forecasting problem in LDL paradigm, relaxing the limitation of linear opinion pools and distribution assumptions for ensemble forecasting.

•

We adapt LDL to effectively handle real-valued regression problems through specialized algorithm.

•

We conduct evaluation of LDLCP on artificial and real-world ensemble forecasting data sets. The results confirm the advantages of our proposed method and demonstrate the potential of CDF-based extension for LDL.

The rest of this paper is organized as follows. Section 2 introduces our proposed method in detail and presents its implementation for ensemble forecasting. Experimental study is presented in Section 3. Finally, we conclude the paper and present our future work in Section 4.

2. The proposed method

Generally, the proposed LDLCP includes learning stage and forecasting stage. Figure 2 shows the LDLCP flowchart. The pseudocode of LDLCP is presented in Algorithm 2.2.2.

Figure 2.

The flowchart of LDLCP method. LDLCP firstly estimates the climate probability according to the observations in learning stage. Then, LDLCP constructs a new training set with the climate probability and optimizes a CDF-based label distribution model. Finally, in the forecasting stage, the predicted label distribution can be obtained for ensemble forecasting.

2.1 Learning stage (Training)

The training set of $m$ ensemble members with $d$ relevant variables forecasting $n$ times is constructed as $\{(\textbf{X}_{1},y_{1}),(\textbf{X}_{2},y_{2}),\ldots,(\textbf{X}_{n},y_{n})\}$ , where $\textbf{X}_{i}$ is the real-valued feature matrix of the $i$ th forecast and $y_{i}$ is the corresponding observation (label), for $i=1,\ldots,n$ . In detail,

$\displaystyle\textbf{X}_{i}=\left[\begin{array}[]{c}\textbf{x}^{1}_{i}\\ \textbf{x}^{2}_{i}\\ \vdots\\ \textbf{x}^{m}_{i}\end{array}\right]=\left[\begin{array}[]{cccc}g_{1}(\textbf{% x}^{1}_{i})&g_{2}(\textbf{x}^{1}_{i})&\ldots&g_{d}(\textbf{x}^{1}_{i})\\ g_{1}(\textbf{x}^{2}_{i})&g_{2}(\textbf{x}^{2}_{i})&\ldots&g_{d}(\textbf{x}^{2% }_{i})\\ \vdots&\ddots&\ddots&\vdots\\ g_{1}(\textbf{x}^{m}_{i})&g_{2}(\textbf{x}^{m}_{i})&\ldots&g_{d}(\textbf{x}^{m% }_{i})\\ \end{array}\right],$

where $\textbf{x}^{j}$ denotes the feature vector of the $j$ th ensemble member for $j=1,\ldots,m$ and $g_{k}(\textbf{x})$ is the $k$ th feature (relevant variable) of x for $k=1,\ldots,d$ .

In the training set, the corresponding climate probability can be empirically estimated with the Heaviside step function $H(\cdot)$ :

$\displaystyle F(y)\simeq F_{e}(y)=\frac{1}{n}\sum_{i=1}^{n}H(y-y_{i}),$ (1)

where $H(\cdot)$ is $1$ if the argument is positive and zero otherwise. Therefore, as shown in Fig. 2, the empirical climate probability, CDF $F_{e}$ , is a multi-step function. By Eq. (1), we can transform the training set into $\{(\textbf{X}_{1},F_{e}(y_{1})),(\textbf{X}_{2},F_{e}(y_{2})),\ldots,(\textbf{% X}_{n},F_{e}(y_{n}))\}$ . Given $(\textbf{X},F_{e}(y))$ , the goal of LDL is to learn a conditional probability function $\mathbb{P}(\mathcal{Y}\leqslant y|\textbf{X};\textbf{w})$ similar to $F_{e}$ , where w is the parameter vector. It should be noted that since $F_{e}$ is not a probability mass function, $F_{e}$ does not satisfy the constraint $\int_{-\infty}^{\infty}F_{e}=1$ . Directly regarding the Kullback-Leibler divergence as the similarity measure and optimizing the best parameter vector $\textbf{w}^{*}$ will lead to a trivial solution. Here, based on the Kullback-Leibler divergence, we use a symmetrized similarity measure and $\textbf{w}^{*}$ is determined as follows:

$\displaystyle\textbf{w}^{*}=\mathop{\arg\min}\limits_{\textbf{w}}\sum_{i=1}^{n% }\left(F_{e}(y_{i})\ln\frac{F_{e}(y_{i})}{\mathbb{P}(\mathcal{Y}\leqslant y_{i% }|\textbf{X}_{i};\textbf{w})}+\mathbb{P}(\mathcal{Y}\leqslant y_{i}|\textbf{X}% _{i};\textbf{w})\ln\frac{\mathbb{P}(\mathcal{Y}\leqslant y_{i}|\textbf{X}_{i};% \textbf{w})}{F_{e}(y_{i})}\right).$

Also, since $\mathbb{P}(\mathcal{Y}\leqslant y|\textbf{X};\textbf{w})$ is not a conditional probability mass function, the maximum entropy model which previous LDL methods often used, is not suitable. In order to turn the real-valued feature space into probability space, we assume $\mathbb{P}(\mathcal{Y}\leqslant y|\textbf{X};\textbf{w})$ to be the form of sigmoid function [13]:

$\displaystyle\mathbb{P}(\mathcal{Y}\leqslant y|\textbf{X};\textbf{w})=S(% \textbf{w}\odot\textbf{X})=\frac{1}{1+\exp(-\textbf{w}\odot\textbf{X})},$

where $\textbf{w}\odot\textbf{X}=\sum_{j=1}^{m}\sum_{k=1}^{d}w^{j}_{k}g_{k}(\textbf{x% }^{j})$ and $w^{j}_{k}$ is an element in w, i.e., w is $md-$ dimensional.

Thus, the final target function of w is:

$\displaystyle T(\textbf{w})=\sum_{i=1}^{n}\big{(}F_{e}(y_{i})-S(\textbf{w}% \odot\textbf{X}_{i})\big{)}\ln\frac{F_{e}(y_{i})}{S(\textbf{w}\odot\textbf{X}_% {i})}.$ (2)

Inspired by [11], the minimization of Eq. (2) follows the idea of an effective quasi-Newton optimization technique named Broyden-Fletcher-Goldfard-Shanno (BFGS, more details can be found in [25]). We consider iteratively minimizing the second-order Taylor series of $T(\textbf{w})$ at the current estimate of $\textbf{w}^{(l)}$ with the search direction $\textbf{p}^{(l)}$ and step length $\alpha^{(l)}$ :

$\displaystyle\textbf{p}^{(l)}=-\textbf{H}^{-1}(\textbf{w}^{(l)})\nabla T(% \textbf{w}^{(l)}),$ (3)

$\displaystyle\textbf{w}^{(l+1)}=\textbf{w}^{(l)}+\alpha^{(l)}\textbf{p}^{(l)},$

where $\nabla T(\textbf{w}^{l})$ is the gradient of $T(\textbf{w}^{l})$ and $\textbf{H}(\textbf{w}^{(l)})$ is the corresponding Hessian matrix.

Here, $\alpha^{(l)}$ needs to satisfy the strong Wolfe conditions [25]:

$\displaystyle T(\textbf{w}^{(l)}+\alpha^{(l)}\textbf{p}^{(l)})\leqslant T(% \textbf{w}^{(l)})+c_{1}\alpha^{(l)}\nabla T(\textbf{w}^{(l)})^{\top}\textbf{p}% ^{(l)},$ (4)

$\displaystyle|\nabla T(\textbf{w}^{(l)}+\alpha^{(l)}\textbf{p}^{(l)})^{\top}% \textbf{p}^{(l)}|\leqslant c_{2}|\nabla T(\textbf{w}^{(l)})^{\top}\textbf{p}^{% (l)}|,$ (5)

where $0<c_{1}<c_{2}<1$ .

With respect to Eq. (2), $\nabla T(\textbf{w})$ can be obtained through

$\displaystyle\frac{\partial T(\textbf{w})}{\partial w_{k}^{j}}=\sum_{i=1}^{n}g% _{k}(\textbf{x}_{i}^{j})S(\textbf{w}\odot\textbf{X}_{i})\left(S(\textbf{w}% \odot\textbf{X}_{i})-1\right)\left(1-\ln\frac{F_{e}(y_{i})}{S(\textbf{w}\odot% \textbf{X}_{i})}-\frac{F_{e}(y_{i})}{S(\textbf{w}\odot\textbf{X}_{i})}\right).$ (6)

To avoid computing $\textbf{H}^{-1}(\textbf{w}^{(l)})$ , BFGS approximates it by iteratively updating a matrix B:

$\displaystyle\textbf{B}^{(l+1)}=(\textbf{I}-\bm{\rho}^{(l)}\textbf{s}^{(l)}(% \textbf{u}^{(l)})^{\top})\textbf{B}^{(l)}(\textbf{I}-\bm{\rho}^{(l)}\textbf{u}% ^{(l)}(\textbf{s}^{(l)})^{\top})+\bm{\rho}^{(l)}\textbf{s}^{(l)}(\textbf{s}^{(% l)})^{\top},$ (7)

where $\textbf{s}^{(l)}=\textbf{w}^{(l+1)}-\textbf{w}^{(l)}$ , $\textbf{u}^{(l)}=\nabla T(\textbf{w}^{(l+1)})-\nabla T(\textbf{w}^{(l)})$ , $\bm{\rho}^{(l)}=\frac{1}{\textbf{s}^{(l)}\textbf{u}^{(l)}}$ .

The best parameter vector $\textbf{w}^{*}$ can be obtained when the above optimization procedure converges.

2.2 Forecasting stage (Testing)

In the forecasting stage, we first apply the piecewise cubic Hermite interpolating polynomial (PCHIP) [2] to $F_{e}$ and improve the smoothness and generalization of $F_{e}$ , preserving its monotonicity. In this way, a smooth version of $F_{e}$ , denoted by $F_{s}$ , can be obtained. Given the $n+1$ th feature matrix $\textbf{X}_{n+1}$ , then both deterministic and probabilistic ensemble forecasts can be generated with $\textbf{X}_{n+1}$ , $\textbf{w}^{*}$ and $F_{s}$ .

2.2.1 Deterministic ensemble forecast

As $F_{s}$ is a monotonically increasing function, the inverse function $F_{s}^{-1}$ exists:

$\displaystyle F_{s}^{-1}\left(\mathbb{P}(\mathcal{Y}\leqslant\hat{y}_{n+1}|% \textbf{X}_{n+1};\textbf{w}^{*})\right)\approx F_{s}^{-1}\left(F_{s}(\hat{y}_{% n+1})\right)=\hat{y}_{n+1},$ (8)

where $\hat{y}_{n+1}$ denotes the deterministic ensemble forecast of $\textbf{X}_{n+1}$ .

Replacing $F_{e}$ with $F_{s}$ and substituting Eq. (2.1) into Eq. (8) yields:

$\displaystyle\hat{y}_{n+1}=F_{s}^{-1}\left(S(\textbf{w}^{*}\odot\textbf{X}_{n+% 1})\right).$ (9)

2.2.2 Probabilistic ensemble forecast

Different from $G$ , the probabilistic ensemble forecast with respect to $\textbf{X}_{n+1}$ is denoted as $\mathcal{G}_{n+1}$ . Here, $G$ is the joint distribution of $\{y_{i}\}_{i=1}^{n+1}$ , while $\mathcal{G}_{n+1}$ is the predicted marginal distribution for $y_{n+1}$ . According to Eq. (9), $\hat{y}_{n+1}$ can be viewed as a nonlinear combination. Suppose that there is only one ensemble member $\textbf{x}^{j}$ , then the corresponding deterministic forecast $\hat{y}_{n+1}^{j}$ is:

$\displaystyle\hat{y}_{n+1}^{j}=F_{s}^{-1}\left(S\left(\sum_{k=1}^{d}w^{j*}_{k}% g_{k}(\textbf{x}_{n+1}^{j})\right)\right).$ (10)

In the forecasting stage, we generate $\mathcal{G}_{n+1}$ as a multi-step function. Hence, comparing with Eq. (1) yields the final form of $\mathcal{G}_{n+1}$ :

$\displaystyle\mathcal{G}_{n+1}(y)=\frac{\hat{y}_{n+1}}{\sum_{j=1}^{m}\hat{y}_{% n+1}^{j}}\sum_{j=1}^{m}H(y-\hat{y}_{n+1}^{j}).$ (11)

[t] LDLCPTraining set $S=\{(\textbf{X}_{i},y_{i})\}_{i=1}^{n}$ , convergence criterion $\varepsilon$ Deterministic ensemble forecast $\hat{y}_{n+1}$ , probabilistic ensemble forecast $\mathcal{G}_{n+1}(y)$ Transform $S$ to $\{(\textbf{X}_{i},F_{e}(y_{i}))\}_{i=1}^{n}$ by Eq. (1);Initialize the parameter vector $\textbf{w}^{(0)}$ ;Initialize the inverse Hessian approximation $\textbf{B}^{(0)}$ ;Compute $\nabla T(\textbf{w}^{(0)})$ by Eq. (6); $l=0$ ; $\|\nabla T(\textbf{w}^{(l)})\|<\varepsilon$ Compute search direction $\textbf{p}^{(l)}$ by Eq. (3);Use a line search procedure to compute the step length $\alpha^{(l)}$ satisfying Eq. (4) and Eq. (5); $\textbf{w}^{(l+1)}=\textbf{w}^{(l)}+\alpha^{(l)}\textbf{p}^{(l)}$ ;Compute $\nabla T(\textbf{w}^{(l+1)})$ by Eq. (6);Update $\textbf{B}^{(l+1)}$ by Eq. (7); $l=l+1$ ; Transform $F_{e}$ to $F_{s}$ by PCHIP;Compute $\hat{y}_{n+1}$ by Eq. (9);Compute $\mathcal{G}_{n+1}(y)$ by Eq. (10) and Eq. (11);Return $\hat{y}_{n+1}$ , $\mathcal{G}_{n+1}(y)$ .

3. Experimental study

In this section, we first give the descriptions of data sets, competitors, parameters and evaluation methods.1

¹
Source code and data are available at: https://github.com/xuebing1991/LDLCP.

Then, the experimental results and analysis are presented.

3.1 Data sets

There are two data sets used in the experiments including a widely used artificial data set in meteorological data analysis and a real-world data set with respect to precipitation ensemble forecasting. The first data set is generated to show in a direct and visual way how the proposed LDLCP method can capture the potential label distribution and realize nonlinear combination for ensemble members. The second data set includes real precipitation processes in four months, which helps to analyse how well the proposed LDLCP method can capture the tendency of observations in different scenarios.

3.1.1 MLT

The artificial data set described in [6] is used (with some modifications) in the experiment to mimic local temperatures (MLT). The observations and ensemble member values are generated from the following time series:

$\displaystyle a_{i}=(A\sin(\pi\omega_{1}i)+B\sin(\pi\omega_{2}i))^{2},i=1,% \ldots,T,$ (12)

$\displaystyle y_{i}\sim a_{i}(1+s_{1}\mathcal{N}(0,1))+s_{2}\mathcal{N}(0,1),$ (13)

$\displaystyle\textbf{x}_{i}^{j}\sim a_{i}(1+s_{1}\mathcal{N}(\mu_{j},1))+s_{2}% \mathcal{N}(\mu_{j},1),j=1,\ldots,m,$ (14)

where $T$ denotes the number of forecasting times and $\mathcal{N}$ represents the independent Gaussian noise by setting $\mu_{j}$ to be different values. The noise terms are used to reflect uncertainties and sampled independently at each time step. Table 1 summaries the parameters. A plot of the observations and the corresponding CDFs is shown in Fig. 3. In our experiment, $\mu=$ 0.5 is used for generating half of the ensemble members, while $\mu=$ 1 for the second half.

Table 1

Parameters used for generating the artificial data

$s_{1}$	$s_{2}$	$A$	$B$	$\omega_{1}$	$\omega_{2}$	$T$	$m$
0.3	0.3	1.68	0.0336	1/365.25	1/11	730	10

Figure 3.

The observations over 730 time steps (a) and the corresponding CDFs (b) of the MLT data set.

3.1.2 EFP

This original data set provided by China Meteorological Administration (CMA) corresponds to ensemble forecasts of 6 h accumulated precipitation (EFP) in East China ([116 ${}^{\circ}$ E, 123 ${}^{\circ}$ E] $\times$ [27 ${}^{\circ}$ N, 35 ${}^{\circ}$ N]) over 4 months (from August to November, 2017). The forecast data are from 51 ensemble members of European Centre for Medium-Range Weather Forecasts (ECMWF) and 21 ensemble members of National Centers for Environmental Prediction (NCEP). The observations are from 315 national automatic weather stations (AWSs) (see Fig. 4). ECMWF and NECP forecasts are both initialized twice at 0000 and 1200 UTC a day but with different temporal resolutions and lead time. To maintain consistency, only forecasts at four different times a day (0600, 1200, 1800 and 2400 UTC) from the latest initial time are considered (thus, $T=$ 488 for EFP). The gridded forecasts are interpolated to each AWS location by inverse distance weighted averaging among its four nearest grid points. A total of 22 forecast variables including height, humidity and wind speed at 500, 700, 850 and 925 mb, temperature, total precipitation (6 h accumulation), mean sea-level pressure and vertical velocity are regarded as relevant variables. The data with missing or abnormal values are eliminated from analysis.

Figure 4.

The study area of the EFP data set. The circles show the spatial distribution of the national AWS network within the study area.

3.2 Competitors and parameters

In this study, our proposed LDLCP method is compared with a baseline and two machine learning methods:

•
RAW: refers to raw ensemble. As a baseline, RAW always averages the forecasts of ensemble members. Accordingly, RAW simply generates probabilistic ensemble forecast as $\mathcal{G}_{n+1}^{\mathrm{RAW}}(y)=\frac{1}{m}\sum_{j=1}^{m}H(y-g(\mathbf{x}^% {j}_{n+1}))$ .
•
BMA[26, 29]: refers to Bayesian model averaging. This method is a well-known parameterized post-processing method used in operation. BMA assumes the forecast probability mass function $p(y|\Theta)$ to be a specific form with parameter set $\Theta$ (e.g., $p$ could be Gaussian distribution for temperature and Gamma distribution for precipitation). BMA conducts parameter optimization through EM algorithm [24] and generates $p_{n+1}(y|\Theta)=\sum_{j=1}^{m}w^{j}p(y|\Theta^{j})$ . Then, the output CDF $\mathcal{G}_{n+1}^{\mathrm{BMA}}(y)$ can be obtained by integrating $p_{n+1}(y|\Theta)$ .
•
EG[30]: refers to exponentiated gradient. EG is one of the state-of-the-art online learning methods for ensemble forecasting. EG assumes the forecast CDF to be the form of $\mathcal{G}_{n+1}^{\mathrm{EG}}(y)=\sum_{j=1}^{m}w^{j}_{n+1}H(y-g(\mathbf{x}^{% j}_{n+1}))$ . EG iteratively updates the weights for ensemble members according to:

$\displaystyle w_{n+1}^{j}=\frac{w_{n}^{j}\mathrm{exp}(-\eta\tilde{\ell}_{n}^{j% })}{\sum_{j=1}^{m}w_{n}^{j}\mathrm{exp}(-\eta\tilde{\ell}_{n}^{j})},$

where $\tilde{\ell}$ is the loss gradient [8] of class continuous ranked probability score (CRPS) [30] and $\eta$ is the learning rate (0.05 by default).

We use the Matlab programming language to implement RAW and EG, and use the R statistical package [29] to implement BMA. For a fair comparison, we implement all the methods without additional calibration strategies. For BMA and EG, the initial $m-$ dimensional weight vector is set to $[\frac{1}{m},\ldots,\frac{1}{m}]^{\top}$ . For LDLCP, BFGS with default settings is used for the optimization of $\textbf{w}^{*}$ , the convergence criterion $\varepsilon$ is set to $10^{-6}$ and the initial $m d$ -dimensional parameter vector is set to $[\frac{1}{md},\ldots,\frac{1}{md}]^{\top}$ . Moreover, each feature is normalized to zero mean and unit standard deviation for LDLCP.
3.3 Evaluation methods

We start the evaluation from $i_{0}=1+T_{0}$ (thus at least allowing for a learning period of $T_{0}$ ). In the experiment, $T_{0}$ equals half of the forecasting times for each data set, i.e., $T_{0}$ equals 365 and 244 for MLT and EFP, respectively. With respect to deterministic ensemble forecast, we regard root mean squared error (RMSE, Eq. (15)) as the evaluation measure, while we consider average continuous ranked probability score ( $\overline{\text{CRPS}}$ , Eq. (16)) for probabilistic ensemble forecast [12]. The equations are:

$\displaystyle\text{RMSE}=\sqrt{\frac{1}{\sum_{i=i_{0}}^{T}|\mathcal{S}_{i}|}% \sum_{i=i_{0}}^{T}\sum_{s\in\mathcal{S}_{i}}(\hat{y}_{i}-y_{i})_{s}^{2}},$ (15) $\displaystyle\overline{\text{CRPS}}=\frac{1}{\sum_{i=i_{0}}^{T}|\mathcal{S}_{i% }|}\sum_{i=i_{0}}^{T}\sum_{s\in\mathcal{S}_{i}}\int(\mathcal{G}_{i}(y)-H(y-y_{% i}))_{s}^{2}\text{d}y,$ (16)

where for the $i$ th time step, $|\mathcal{S}_{i}|$ denotes the number of available stations (the data with missing or abnormal values are eliminated from analysis) and $(\cdot)_{s}$ measures the loss for the $s$ th station. In particular, $|\mathcal{S}|=1$ for MLT and $|\mathcal{S}|_{\max}=315$ for EFP.

For each of the evaluation measures, the lower it is, the better the performance is. Note that EFP records real precipitation processes, while MLT is randomly generated according to Eqs (12), (13) and (14). Due to the randomness of MLT, MLT is generated 100 times and the experiments on MLT are conducted with 100 independent runs for a fair evaluation.

3.4 Results and discussions

Table 2 shows the performance results for MLT and EFP when applying RAW, BMA, EG and LDLCP. In Table 2, we report “mean $\pm$ standard deviation” for the experiments on MLT. Further, in Table 3, we conduct statistical analysis to evaluate whether LDLCP outperforms the competitors significantly. Here, pair-wise t-tests are applied to MLT using LDLCP as the control method, both for RMSE and $\overline{\text{CRPS}}$ . Based on the individual results for each data set, LDLCP performs the best in both RMSE and $\overline{\text{CRPS}}$ among the competitors. Compared with RAW, improvements of 25.4% in RMSE and 16.3% in $\overline{\text{CRPS}}$ are obtained by LDLCP in MLT, and improvements of 43.9% in RMSE and 37.4% in $\overline{\text{CRPS}}$ are obtained by LDLCP in EFP. Thus, LDLCP can improve the performance of both deterministic ensemble forecasting and probabilistic ensemble forecasting. Furthermore, Table 2 shows obviously that the three machine learning methods obtain better results than the baseline and verifies the advantage of applying appropriate post-processing methods for ensemble forecasting.

Table 2
Evaluation results on MLT and EFP

	MLT		EFP
	RMSE	$\overline{\text{CRPS}}$	RMSE	$\overline{\text{CRPS}}$
RAW	0.917 $\pm$ 0.012	0.526 $\pm$ 0.008	6.461	1.627
BMA	0.822 $\pm$ 0.020	0.442 $\pm$ 0.010	4.277	1.027
EG	0.817 $\pm$ 0.016	0.474 $\pm$ 0.011	3.722	1.096
LDLCP	0.684 $\pm$ 0.013	0.440 $\pm$ 0.010	3.624	1.019

The best result is in bold face.

Table 3

$p$ -values of pair-wise t-tests for MLT data set using LDLCP as the control method

	RMSE	$\overline{\text{CRPS}}$
RAW	$<$ 0.001	$<$ 0.001
BMA	$<$ 0.001	0.1191
EG	$<$ 0.001	$<$ 0.001

For RMSE and $\overline{\text{CRPS}}$ , we consider a difference to be significant at $p<$ 0.01.

Recall Eq. (16), a lower $\overline{\text{CRPS}}$ indicates that the predicted CDF should be calibrated and sharp (sharpness measures how sharp the predicted distribution is, refer to [12] for more details) simultaneously, i.e., a well-performed CDF in prediction will exhibit high confidence just for the real observation value. According to Table 3, LDLCP significantly outperforms RAW and EG in RMSE and $\overline{\text{CRPS}}$ . Though LDLCP significantly outperforms BMA in RMSE, it should be noted that with respect to $\overline{\text{CRPS}}$ , the results of BMA and LDLCP are comparable. Based on Eqs (15) and (16), BMA may outperform LDLCP in sharpness. As a result, though LDLCP outperforms BMA in calibration, LDLCP just achieves slightly better results in $\overline{\text{CRPS}}$ .

3.4.1 Further analysis for ensemble forecasting of temperature

For MLT, we intentionally generate the ensemble members similar to the observations. The ensemble members are composed of two classes of equal size. Empirically, the members with $\mu=$ 0.5 are more skillful than those with $\mu=$ 1.

As there are no additional relevant variables in MLT, MLT can be viewed as a reduced case. In this case, each competitor makes forecasts through assigning weights for the individual forecasts. Note that the weights should be non-negative and sum to one for RAW, BMA and EG, while LDLCP needn’t follow these restrictions. We track the variation of cumulated weights of the two classes and test whether the competitors can favor the class of more skillful members (see Fig. 5). With respect to BMA and EG, as expected, the class with $\mu=$ 0.5 sees its weight increase on average and approximate to 1. These two methods tend to rely on the class with $\mu=$ 0.5 and result in a limited improvement in performance compared to RAW.

Figure 5.

Cumulated weights of ensemble members with $\mu=$ 0.5 (bold lines) and $\mu=$ 1 (thin lines) computed by the competitors over 730 time steps. The dashed lines bound the cumulated weights which follow the restrictions of non-negative and less than one. As RAW always average the forecasts of ensemble members, the thin black line is covered by the thick black line in this figure.

However, the situation is quite different for LDLCP. Figure 5 shows that LDLCP does not tend to just rely on the class with $\mu=$ 0.5. Instead, LDLCP, as a nonlinear combination method, jointly utilizes the two classes and make ensemble forecasts. In this way, LDLCP is more robust to abnormal forecast values from some ensemble members. As a result, LDLCP can make the forecast distribution more similar to the observation distribution and result in significant improvement in performance compared to other competitors.

3.4.2 Further analysis for ensemble forecasting of precipitation

Figure 6.

Comparisons between observations and different deterministic forecasts with (b, d and f) and without (a, c and e) post-processing methods. The grey area represents the range of individual forecasts from 72 ensemble members and RAW represents the ensemble mean. Three examples of different scenarios (total precipitation: 13.1 (a and b), 34.6 (c and d) and 94.4 mm (e and f)) from 2400 UTC 5 November to 2400 UTC 8 November 2017 are shown.

For EFP, to further examine the quality of deterministic ensemble forecasting with respect to the whole precipitation processes in different scenarios, three examples are presented to show the comparisons between observations and different deterministic forecasts (see Fig. 6). Note the difference of scales in Fig. 6.

With respect to light precipitation process (Fig. 6a and b) and moderate precipitation process (Fig. 6c and d), LDLCP generally achieves the best performance (i.e., the forecasts applying LDLCP approximately match the observations), while the ensemble members tend to overestimate the precipitation processes (e.g., at 1200 UTC 7 November 2017).

Nonetheless, decreased performance of machine learning methods can be found with respect to heavy precipitation process (Fig. 6e and f). In this scenario, though LDLCP and EG can well capture the tendency of the precipitation process, they can not make satisfying forecasts for extreme precipitation. For example, at 0600 UTC 7 November 2017, obvious underestimations can be found for EG (17.6 versus 56.6 mm) and LDLCP (31.7 versus 56.6 mm). The reason may include the skewed distribution of different kinds of precipitation processes and the heavy precipitation processes are very few in the training set. In addition, extreme precipitation itself is difficult to forecast due to the weather uncertainty. In this perspective, LDLCP needs specialized adjustment for the precipitation with relatively low occurrence frequency (e.g., extreme precipitation).

4. Conclusion

In this paper, a machine learning method (LDLCP) for ensemble forecasting is proposed. LDLCP jointly utilizes the relevant variables and applies the paradigm of LDL to optimize the distribution of ensemble forecasts to be consistent with climate probability. LDLCP adapts specialized target function and conditional probability function for ensemble forecasting and doesn’t need any assumption about the probability distribution. Experiments were implemented on both artificial data (MLT) and the data set for ensemble forecasting of precipitation in East China (EFP). The results confirm that LDLCP can improve the performance of both deterministic forecasting and probabilistic forecasting, showing promising RMSE and $\overline{\text{CRPS}}$ in comparison with other competitors. Compared with raw ensemble, improvements of 25.4% in RMSE and 16.3% in $\overline{\text{CRPS}}$ are obtained by LDLCP in MLT, and improvements of 43.9% in RMSE and 37.4% in $\overline{\text{CRPS}}$ are obtained by LDLCP in EFP. However, the results of LDLCP may lack sharpness and the performance of LDLCP may be limited for extreme precipitation.

Though LDLCP is designed for ensemble forecasting task, as a general learning framework, LDLCP may also be used to solve other kinds of time-series prediction problems. Except for climate data, LDLCP could be useful if the characteristics of data include: 1) multiple predictors like ensemble members; 2) periodic underlying distribution like local climate.

Future research will include the following: 1) for probabilistic ensemble forecasting, a more effective approach to increase the sharpness of the forecasts obtained by LDLCP; 2) for deterministic ensemble forecasting, a suitable strategy to improve the performance of LDLCP when extreme precipitation occurs.

Footnotes

Acknowledgments

The authors would like to express their gratitude to the CMA Public Meteorological Service Center for providing the meteorological data. The authors are thankful for the financial support from the National Natural Science Foundation of China (U1636220 and 61602482).

References

Adhikari

, A neural network based linear ensemble framework for time series forecasting, Neurocomputing 157 (2015), 231–242.

Aràndiga

Donat

and Santágueda

, The PCHIP subdivision scheme, Applied Mathematics and Computation 272 (2016), 28–40.

Bauer

Thorpe

and Brunet

, The quiet revolution of numerical weather prediction, Nature 525 (2015), 47–55.

Berger

A.L.

Pietra

S.D.

and Pietra

V.J.D.

, A maximum entropy approach to natural language processing, Computational Linguistics 22 (1996), 39–71.

Biau

and Patra

, Sequential quantile prediction of times series, IEEE Transactions on Information Theory 57 (2011), 1664–1674.

Bröcker

, Evaluating raw ensembles with the continuous ranked probability score, Quarterly Journal of the Royal Meteorological Society 138 (2012), 1611–1617.

Cesa-Bianchi

and Lugosi

, Prediciton, Learning, and Games, Cambridge University Press, Cambridge, UK, 2006.

Devaine

Gaillard

Goude

and Stoltz

, Forecasting electricity consumption by aggregating specialized experts, Machine Learning 90 (2013), 231–260.

Gagne

D.J.

, II. Mcgovern

and Xue

, Machine learning enhancement of storm-scale ensemble probabilistic quantitative precipitation forecasts, Weather and Forecasting 29 (2014), 1024–1043.

10.

Geng

Yin

and Zhou

Z.-H.

, Facial age estimation by learning from label distributions, IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (2013), 2401–2412.

11.

Geng

, Label distribution learning, IEEE Transactions on Knowledge and Data Engineering 28 (2016), 1734–1748.

12.

Gneiting

and Katzfuss

, Probabilistic forecasting, Annual Review of Statistical and Its Application 1 (2014), 125–151.

13.

Han

and Moraga

, The influence of the sigmoid function parameters on the speed of backpropagation learning, in: International Workshop on Artificial Neural Networks: From Natural To Artificial Neural Computation (1995), 195–201.

14.

Koenker

and Bassett

, Regression quantiles, Econometrica 46 (1978), 33–50.

15.

Krishnamurti

T.N.

Kishtawal

C.M.

LaRow

T.E.

Bachiochi

D.R.

Zhang

Williford

C.E.

Gadqil

and Surendran

, Improved weather and seasonal climate forecasts from multimodel superensemble, Science 285 (1999), 1548–1550.

16.

Krishnamurti

T.N.

Kumar

Simon

Bhardwaj

Ghosh

and Ross

, A review of multimodel superensemble forecasting for weather, seasonal climate, and hurricanes, Review of Geophysics 54 (2016), 336–377.

17.

Leutbecher

and Palmer

T.N.

, Ensemble forecasting, Journal of Computational Physics 227 (2008), 3515–3539.

18.

Lewis

J.M.

, Roots of ensemble forecasting, Monthly Weather Review 133 (2005), 1865–1885.

19.

W.-T.

Duan

Q.-Y.

Miao

C.-Y.

A.-Z.

Gong

and Di

Z.-H.

, A review on statistical postprocessing methods for hydrometeorological ensemble forecasting, Wiley Interdisciplinary Reviews Water 4 (2017), e1246.

20.

Littlestone

, Learning when irrelevant attributes abound: A new linear-threshold algorithm, Machine Learning 2 (1988), 285–318.

21.

Littlestone

and Warmuth

M.K.

, The weighted majority algorithm, Information and Computation 108 (1994), 212–261.

22.

Mallet

Stoltz

and Mauricette

, Ozone ensemble forecast with machine learning algorithms, Journal of Geophysical Research 114 (2009), D05307.

23.

Mallet

, Ensemble forecast of analyses: coupling data assimilation and sequential aggregation, Journal of Geophysical Research 155 (2010), D24303.

24.

McLachlan

G.J.

and Krishnan

, The EM Algorithm and Extensions, Wiley (1997).

25.

Nocedal

and Wright

, Numerical Optimization (2nd ed.), Springer, New York, NY, USA (2006).

26.

Raftery

A.E.

Gneiting

Balabdaoui

and Polakowski

, Using bayesian model averaging to calibrate forecast ensembles, Monthly Weather Review 133 (2005), 1155–1174.

27.

Ranjan

and Gneiting

, Combining probability forecasts, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72 (2010), 71–91.

28.

Raykar

V.C.

S.-P.

Zhao

L.H.

Valadez

G.H.

Florin

Bogoni

and Moy

, Learning from crowds, Journal of Machine Learning Research 11 (2010), 1297–1332.

29.

Sloughter

J.M.

Raftery

A.E.

Gneiting

and Fraley

, Probabilistic quantitative precipitation forecasting using bayesian model averaging, Monthly Weather Review 135 (2007), 3209–3220.

30.

Thorey

Mallet

and Baudin

, Online learning with the continuous ranked probability score for ensemble forecasting, Quarterly Journal of the Royal Meteorological Society 143 (2017), 521–529.

31.

J.-P.

Tan

P.-N.

Zhou

J.-Y.

and Luo

L.-F.

, Online multi-task learning framework for ensemble forecasting, IEEE Transactions on Knowledge and Data Engineering 29 (2017), 1268–1280.

32.

Zhang

Wang

and Geng

, Crowd counting in public video surveillance by label distribution learning, Neurocomputing 166 (2015), 151–163.

Label distribution learning with climate probability for ensemble forecasting

Abstract

Keywords

1. Introduction

2.2.1 Deterministic ensemble forecast

1 Source code and data are available at: https://github.com/xuebing1991/LDLCP.

3.1.1 MLT

Table 2 Evaluation results on MLT and EFP

Footnotes

Acknowledgments

References

¹
Source code and data are available at: https://github.com/xuebing1991/LDLCP.

Table 2
Evaluation results on MLT and EFP