Improved approach to wave potential estimation using bivariate distributions

Abstract

Probabilistic approaches are frequently used to describe irregular activity data to assist the design and development of devices. Unfortunately, useful estimations are not always feasible due to the large noise in the data modeled, as it occurs when estimating the sea waves potential for electricity generation. In this work we propose a simple methodology based on the use of joint probability models that allow discriminating extreme values, collected from measurements as pairs of independent points, while allowing the preservation of the essential statistics of the measurements. The outcome of the proposed methodology is an equivalent data series where large-amplitude fluctuations are suppressed and, therefore, can be used for design purposes. For the evaluation of the proposed method, we used year-long databases of hourly-collected measurements of the wave’s height and period, performed at maritime buoys located in the Gulf of Mexico. These measurements are used to obtain a fluctuations-reduced representation of the energy potential of the waves that can be useful, for instance, for the design of electric generators.

Keywords

Bivariate distributions probability models irregular activity wave potential estimation

1 Introduction

Data analysis is of critical importance in practically all areas of knowledge to support decision-making. When the data are taken by acquisition systems and more than one variable is involved, it is common for the applied analysis to be statistical in nature. The analysis of experimental data collected at multiple locations over time leads to new and unique problems in statistical modeling and inference. Historically, time series methods have been applied to problems in the physical and environmental sciences, where information is integrated or fused in order to unify data from different sources with different contextual and conceptual representations [1]. This fact accounts for the basic engineering flavor permeating the language of time series analysis.

The field of data fusion, defined in a general way as the use of techniques that combine data from multiple sources [2], focuses on the acquisition, processing, and synergistic combination of several sources of information collected by one or more sensors to provide a better understanding of the phenomenon to be considered [3].

Many examples of areas and applications exist where the combination of information from different sources is necessary. We list only some of them for illustration as well as some of the statistical questions that might be asked about such data. In distributed detection problems [4] a multi-sensor system is used that is composed of a set of sensors placed with a certain spatial distribution. These sensors can either register the same variable at different locations or they can monitor different variables [5]. These types of measurements are common in biometric systems where expert systems are frequently used to monitor various sensors [6]. Another typical use is in the processing of radio signals from multiantenna configurations in wireless communication systems [7]; multispectral astronomical and remotely sensed images [8, 9]; in multi-frame blind image deconvolution with non-stationary blurring process [10, 11]; and, speech separation and enhancement [12, 13]. Energy estimations of renewable resources are, of course, not the exception, since one must deal with multiple, noisy meteorological parameters simultaneously [14, 15].

In this work, we propose a simple methodology based on the use of joint probability models that allow suppressing large-amplitude fluctuations of the data, caused by the extreme points, while preserving the essential statistics. More specifically, we focus on a two-dimensional situation where two independent sensors report different variables over time in the form of ordered pairs with irregular activity. The methodology proposed is general and can be extended to other scenarios of two-dimensional data fusion where large-amplitude fluctuations need to be reduced. Our proposal is illustrated for the practical case of estimating the sea wave potential. We make use of year-long databases of hourly-collected measurements of the wave’s height and period, performed at maritime buoys. As a result, we generate a new series with a significantly smaller dimension, which can be used, for example, to obtain a representation of the energy potential of the waves and, eventually, use such estimations in the design of electric generators.

The rest of the paper is organized as follows. In section 2 the detailed approach to the problem is presented, also showing a general description of the data used in this work. Section 3 describes the proposed methodology. In section 4, we show the results of the proposed methodology in an example where data obtained from marine buoys are used to obtain an estimate of the energy potential of sea waves. Finally, in Section 5 we present the conclusions.

2 Problem statement

To determine the energy potential of sea waves, measurements collected at maritime buoys are typically used. In our case, we had access to a database consisting of a historical record of hourly-performed measurements of several parameters including the wave’s significant height (H_m) and period (T). These parameters can be used to estimate the wave’s power by following the standard Salter-Falnes formulation. Specifically, the wave’s power density, J (W/m), can be calculated in terms of H_m and T as [16]: J = (ρg² H_m² T)/32π, where ρ is the density of sea water (1025 kg/m³), and g is the acceleration of gravity (9.81 m/s²). If a wave with power density J impinges on a rigid body with a single degree of freedom, the maximum power absorbed is P_max = (λ/2π) J, where λ is the depth-dependent wavelength (m). In our case, we worked with data collected from buoys at locations where the water is more than 300 m deep, therefore, the wavelength can be estimated as λ= (gT²)/2π.

To illustrate our approach, we use one-year long datasets. As we will discuss later, our methodology can be used with longer and shorter time series, as long as the data are enough to estimate the joint probability distribution of H_m and T. The database contains measurements recorded at the Campeche buoy, which is located at 22.203 N 94.000 W, where the water depth is more than 300 m. Figure 1(a) and 1(b) show the full time series H_m(t) and T(t), respectively, for an entire year of hourly recorded data (2014; 8760 data points) at the above-mentioned Campeche buoy. The results of the Salter-Falnes estimation of P_max are shown in Fig. 1(c) for the one-year long time series shown in panels (a)-(b). Basically, P_max was calculated for each (H_m, T) pair in the entire series recorded at the buoy over one year.

Fig. 1

Time series of (a) H_m(t) and (b) T(t) recorded hourly at the Campeche buoy in 2014 (8760 data points). (c) The corresponding wave’s power estimation using the Salter-Falnes model.

Figure 1(c) clearly illustrates the problem at hand: by using the raw series of (H_m, T) pairs from the database, the wave power estimations end up spanning over several orders of magnitude, thus making it difficult to use these results for design purposes. In general terms, like we discuss in the Introduction for other scenarios, this is because the fluctuations in the time series of each variable, H_m(t) and T(t), are amplified through the calculations due to the nonlinear dependence of the wave power on these parameters, P_max ∝ H_m²T³. This is further verified by calculating some representative values of the wave power data shown in Fig. 1(c). We calculated the mean to be 83.58 kW (±172.10 kW); the median (2nd quartile, Q₂, of the full series), 29.47 kW; and the rms power, 191.32 kW.

One can attempt to reduce the effects of this large variability by several means. For instance, in a first-order approach, one could first reduce each time series H_m(t) and T(t) to a single representative value e.g., mean, median, rms value, or largest-probability value, and then estimate P_max based on a single (H_m, T) pair. Unfortunately, this does not solve the problem. The estimation of P_max based on the mean values of H_m(t) and T(t) gives 39.63 kW; based on the median, 29.31 kW; based on the rms value, 53.41 kW; and, based on the value of largest probability in each marginal distribution, after fitting to a log-normal distribution, it gives 15.50 kW. As it can be noticed, using the raw data, and reducing the time series to a single value results in estimations with large variability which, in turn, would result in large over/under-estimations of the design considerations.

3 Methodology

An improvement over the previous approach is to model the joint probability distribution p(H_m, T) by recognizing that H_m(t) and T(t) are not independent random variables [17, 18]. This is true for the time series H_m(t) and T(t) explored in this work. In fact, the Pearson correlation coefficient between the year-long datasets shown in Fig. 1(a)-(b) is rather high, r = 0.8389, which tells that there exists a strong linear dependence between H_m and T, as verified in Fig. 2(a), where we plotted the entire collection of (H_m, T) pairs recorded over the course of one year. Fig. 2(a) shows basically the same data in Fig. 1(a)-(b) where the time dependence has been removed.

Fig. 2

(a) Distribution of the ordered pairs (H_m, T) recorded over the course of one year (raw data) and (b) the corresponding bivariate probability density function p(H_m, T). The red arrow indicates the contour taken to illustrate the proposed approach.

In Fig. 2(b) we show the corresponding probability distribution p(H_m, T) of the data in panel (a) both as a collection of equiprobability contours and as a continuous surface (inset). The values of p(H_m, T) were calculated using an adaptive kernel density estimation by following the work of Botev’s et al. [19]. This is basically a nonparametric probability density estimation that is useful for the case when the joint probability has an arbitrary shape.

In our approach, we essentially propose using p(H_m, T) of the raw data to generate a new series of random (H_m, T) pairs that are contained within an equiprobability contour of p(H_m, T) with a desired probability coverage. As we will show later, this approach is useful to generate a new time scale-independent series that preserves the probability of the measurements to the degree desired, and where the high variability due to the data fluctuations is suppressed. An overview of the proposed methodology is illustrated in Fig. 3. In the following, the steps involved are explained in detail.

Once p(H_m, T) is calculated, then an equiprobability contour is selected based on the desired coverage of p(H_m, T). For instance, the contour indicated in Fig. 2(b) covers around 80% of p(H_m, T). If a higher-valued contour is chosen, a smaller portion of p(H_m, T) will be covered but, in all cases, the region of largest probability is contained regardless of the contour selected. In the limit where the highest-valued contour is selected, only the (H_m, T) pair of largest probability will be accounted; conversely, in the limit when the lowest-valued contour is selected, the entire distribution will be considered.

In the next step, the contour selected is used as a closed boundary in the (H_m, T) plane for the generation of N random (H_m, T) pairs. Because we estimated p(H_m, T) based on the raw measurements, all the contours have irregular shapes that are different from one another (see Fig. 2(b)). To deal with this, we triangulate the region defined by the contour using a Matlab implementation of the standard Delaunay method [20], as shown in Fig. 4(a).

The algorithm for the generation of new (H_m, T) pairs based on the triangulation of the selected contour is explained as follows. For each triangle, we determined its area and associated a probability based on its relative fraction with respect to the total area. Using this area-based probability, we pick triangles randomly and generated a random number inside of it. In this way, the region of the (H_m, T) plane within the equiprobable contour selected is uniformly sampled. In practice, we did this in a fully vectorized manner to generate N samples in a single step. In Fig. 4(b) we show the result of the generation of N = 8760 samples, a new data series of the same length as the original data.

Fig. 3

Overview of the methodology proposed for the generation of noise-suppressed data series that preserve the statistics of the original data.

Fig. 4

(a) Standard Delaunay triangulation of an equiprobability contour of p(H_m, T) and (b) the corresponding series of (H_m, T) random pairs generated based on the proposed approach.

4 Results

The results of the methodology proposed are shown in Fig. 5, in the condensed form of a box plot representation. We plotted the statistics of the original measurements (Fig. 1) and those of the randomly generated shown in Fig. 4(b). The red markers indicate the outliers based on the standard criterion of the interquartile range (IQR = Q₃-Q₁), 1.5IQR<Outliers<3IQR, where Q₁ and Q₃ are the first and third quartile, respectively.

Fig. 5

Box plots of the raw measurements and the randomly generated data of (a) H_m, (b) T, and (c) P_max.

In Table 1, we show a comparison of the calculations obtained using the raw data and the proposed approach. Basically, we calculated P_max based on the Salter-Falnes formalism described above, when using a single (H_m, T) pair or the entire series, for both the raw data and the generated series. For instance, if one takes the mean of H_m(t) and the mean of T(t), the calculation of P_max gives 39.63 kW and 29.78 kW for the raw data and the generated series, respectively. If one performs the calculations over the entire series P_max(t) and analyze its probability distribution, then the mean value of P_max is 88.11 kW and 48.33 kW for the raw data and the generated series, respectively.

Table 1

Comparison between the power calculations using the raw data and the data generated using the proposed methodology

	Raw [kW]	Generated [kW]
Single (H_m, T) pair
Mean	39.63	29.78
Median (Q2)	29.31	29.68
RMS	53.41	37.14
Largest probability	15.50	11.43
P_max series
Mean	88.11	48.33
Median	29.42	28.74
RMS	212.66	70.19
Largest probability	88.11	48.33

As can be seen in Fig. 5 and Table 1, the outcome of our approach is evident: the large-amplitude fluctuations are suppressed while preserving the main features of the original statistics, as indicated by the three quartiles. A direct consequence is that the characteristics of the data series generated, where the noise due to extreme values has been reduced, can now be used for design purposes.

As mentioned before, the independent series H_m(t) and T(t) can be described by log-normal distributions. Because the extreme values are neglected, the tails of the distribution are effectively suppressed. Therefore, our approach can also be thought as an approximation to a quasi-Gaussian distribution. In other words, the generation of new (H_m, T) pairs within the contour selected preserves the bell shape of the original distribution with shorter tails, as illustrated in Fig. 5.

We should also note that the standard Delaunay triangulation deals with the contour as a convex polygon, as shown in Fig. 4(a). For our purposes, this is irrelevant, since only small portion of the points are generated outside the contour, as can be seen in Fig. 4(b). Nevertheless, in order to fully contain the N random samples, accurate two-dimensional triangulation methods are available for arbitrary, non-convex contours (see, for instance, [21]).

Finally, from the analysis presented, we noted that using the raw values for the wave power calculations tends to overestimate the values obtained. Conversely, if a single (H_m, T) pair is used, the calculations are underestimated. Our approach constitutes an intermediate solution. The new series of (H_m, T) pairs is confined to a region of p(H_m, T) where extreme outliers are excluded and, therefore, the fluctuations are suppressed as compared to the raw measurements, while keeping the main statistical features of the original data.

5 Conclusion

We present a simple methodology based on the use of joint probability distributions that allow suppressing large-amplitude fluctuations of the data, caused by the extreme points, while preserving the essential statistics. To illustrate our approach, we used measurements of sea wave’s significant height (H_m) and period (T). The core of the proposed methodology relies on modeling the joint probability distribution p(H_m, T), and then using an equiprobability contour to generate a new series of random (H_m, T) pairs that are contained within the contour selected. Importantly, the new series of random data is realistic since it was generated from p(H_m, T), which is based on the measurements. Moreover, the new data series is also independent of a time scale. In other words, if p(H_m, T) is assumed to be invariant, one can associated an arbitrary time scale to the randomly generated series. In a way, our methodology can be thought a statistical interpolation and it can make the new series generated useful, for instance, as the input of simulations that require finer time scale in comparison with the sampling time of the measurements. Finally, because large-amplitude fluctuations are suppressed, the randomly generated data can be useful. In the particular example taken in this work (sea wave power), the estimations based on the fluctuations-suppressed data can be useful for the design of sea wave generators.

References

Hall

and Llinas

, Multisensor data fusion: CRC press, 2001.

Blasch

, Kadar

, Salerno

, Kokar

M.M.

, Das

, Powell

G.M.

, et al., Issues and challenges of knowledge representation and reasoningmethods in situation assessment (Level 2 Fusion), in Signal Processing, Sensor Fusion, and Target Recognition XV, (2006), pp. 623510.

Roig

I.B.

, Algoritmos de detección distribuida en sistemas monosensor: Universidad Politecnica de Valencia (Spain), 2005.

Drakopoulos

and Lee

C.C.

, Optimumfusion of correlated local decisions, in Proceedings of the 27th IEEE Conference on Decision and Control (1988), pp. 2489–2494.

Atrey

P.K.

, Hossain

M.A.

, El Saddik

and Kankanhalli

M.S.

, Multimodal fusion for multimedia analysis: a survey, Multimedia Systems16 (2010), 345–379.

Yuksel

S.E.

, Wilson

J.N.

and Gader

P.D.

, Twenty years of mixture of experts, IEEE Transactions on Neural Networks and Learning Systems23 (2012), 1177–1193.

Ristaniemi

and Joutsensalo

, Advanced ICA-based receivers for block fading DS-CDMA channels, Signal Processing82 (2002), 417–431.

, Kopriva

and Szu

H.H.

, Independent-component analysis for hyperspectral remote sensing imagery classification, Optical Engineering45 (2006), 017008.

Nuzillard

and Bijaoui

, Blind source separation and analysis of multispectral astronomical images, Astronomy and Astrophysics Supplement Series147 (2000), 129–138.

10.

Chen

and Wang

, A new approach to near-infrared spectral data analysis using independent component analysis, Journal of Chemical Information and Computer Sciences41 (2001), 992–1001.

11.

Kopriva

, Du

, Szu

and Wasylkiwskyj

, Independent component analysis approach to image sharpening in the presence of atmospheric turbulence, Optics Communications233 (2004), 7–14.

12.

Araki

, Makino

, Aichner

, Nishikawa

and Saruwatari

, Subband-based blind separation for convolutive mixtures of speech, IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences88 (2005), 3593–3603.

13.

Lambert

R.H.

, Blind deconvolution of multipath mixtures, Unsupervised Adaptive Filtering1 (2000), 377–436.

14.

Kariniotakis

, Renewable Energy Forecasting: From Models to Applications: Woodhead Publishing, 2017.

15.

Twidell

and Weir

, Renewable energy resources: Routledge, 2015.

16.

Falnes

, A review of wave-energy extraction, Marine structures20 (2007), 185–201.

17.

Mendes

R.P.G.

, Calado

and Mariano

, Wave energy potential in Portugal– Assessment based on probabilistic description of ocean waves parameters, Renewable energy47 (2012), 1–8.

18.

Liu

, Chen

, Jiang

, Fu

, Wang

and Jiang

, Double entropy joint distribution function and its application in calculation of design wave height, Entropy21 (2019), 64.

19.

Botev

Z.I.

, Grotowski

J.F.

and Kroese

D.P.

, Kernel density estimation via diffusion, The annals of Statistics38 (2010), 2916–2957.

20.

Persson

P.-O.

and Strang

, A simple mesh generator in MATLAB, SIAM review46 (2004), 329–345.

21.

Engwirda

, Locally optimal Delaunay-refinement and optimisation-based mesh generation, 2014.