Performance of matrix completion approaches for aquaponics data

Abstract

Technological innovations in Internet of Things (IoT) have resulted in smart agricultural solutions such as a remotely monitored Aquaponics system and a wireless sensor network (WSN) of such systems (nodes). IoT enables continuous sensing of temperature and pH data at each node of the WSN, which is periodically transmitted to a remote fusion centre. In this regard, the data matrices acquired at the fusion centre often suffer from data vacancies and missing data problems, owing to typical wireless multipath fading environment, sensor malfunctions and node failures. This paper explores the applicability of different matrix completion approaches for missing data reconstruction. Specifically, the performance of baseline predictor, correlation based approaches such as baseline predictor with temporal model, k-nearest neighbors (kNN) and low rank based approaches such as Sparsity Regularized Singular Value Decomposition (SRSVD) and Augmented Lagrangian Sparsity Regularized Matrix Factorization (ALSRMF) have been explored. Reliable temperature and pH data for 19 independent acquisition hours with 60 samples per hour are acquired at the fusion centre via Ultra High Frequency (UHF) transmission at 470 MHz and suitable pre-processing. Simulating different data integrity scenarios, the reconstruction error plots from each of these matrix completion approaches is extracted. A hybrid of kNN and baseline predictor with temporal model rendered a Mean Absolute Percentage Error (MAPE) of 1.75% for temperature and 0.86% for pH, at 0.5 data integrity. Further, with ALSRMF, which exploits the low rank constraint, the error reduced to 1.25% for temperature and 0.7% for pH, thus substantiating a promising approach for Aquaponics system data reconstruction.

Keywords

Augmented Lagrangian Sparsity Regularized Matrix Factorization baseline approach k-nearest neighbors low rank matrix completion Sparsity Regularized Singular Value Decomposition

1. Introduction

The burgeoning developments in Internet of Things (IoT), Machine to Machine (M2M) communications and Wireless Sensor Networks (WSNs) have been welcomed in a variety of applications such as smart agricultural solutions [3,8], health care systems [1], intelligent transportation services and many more. A WSN comprises of several nodes which continuously monitor a desired geographical region for physical quantities such as carbon dioxide, temperature, radiation etc., that are wirelessly transmitted to a remote server or a fusion centre [11]. Huge amounts of such stored historical data, play a crucial role in predicting and providing useful inferences and aid in decision making.

A smart WSN based agricultural solution, which is decorous and serene choice for a smart city scenario, is a remotely and continuously monitored network of Aquaponics systems or gardens [18], geographically spread over a commercial complex or a residential community. IoT enables each Aquaponics system (node) to continuously sense the system health parameters such as temperature and pH. These data values are periodically transmitted to the fusion centre, which processes and stores the received data. For efficient spectrum utilization, the wireless communication link between the nodes and the fusion centre can be established as a cognitive link on any underutilized Ultra High Frequency (UHF) band such as the TV band [13].

Aquaponics are smart food production systems, where plants and aquatic life form an eco-friendly environment. More specifically, the system comprises a grow bed for plants and a fish tank. Through the circulating waters, the fish waste reaches the grow bed, where beneficial bacteria convert ammonia into nitrites and in turn to nitrates, which are absorbed by the plants [10]. The oxygenated water recirculates to the fish tank, forming a symbiotic environment. These systems reduce water wastage and are very much suited in arid areas with water scarcity. In addition to small scale gardening, these systems can also be deployed in large scale farming [21]. Solutions offering remote health monitoring of such systems save a lot of man hours, and are deemed to be opportune in a smart city lifestyle. Such smart systems encourage more and more families to install such gardens, further ensuring a greener urban environment.

We consider an Aquaponics WSN with one node and a fusion centre, where periodical wireless data transmission happens from the node to the fusion centre. Accompanied by sensor malfunctions, power interruptions and system failures, multipath fading of the wireless link can result in severe data losses [1]. Hence the time series data collected at the fusion centre can have irregular vacancies or missing values. With the emergence and vast growth of compressive sensing theoretical concepts and applications [14], many research works started addressing acquisition or collection of few data points rather than all to reduce the storage problems. It is interesting to note that missing values may also arise if the node adopts a random transmission policy for resource saving, where the node is active or asleep with a pre-selected transmit probability [26].

The focus of this work is to interpret/estimate the missing data values using suitable reconstruction approaches and study their performance. This necessitates the availability/acquisition of actual and error-free data at the fusion centre. Such reliable temperature and pH data are acquired at the fusion centre transmitted from the node via UHF wireless link at 470 MHz, using the system described in [17]. The data is acquired for 19 independent acquisition hours with 60 samples per hour, collected in a span of 8 days distributed over summer and winter seasons. Deliberately, each data value is received multiple times at the fusion centre and then pre-processed suitably to avoid data loss. These reliably acquired temperature and pH data are arranged as two different 19 × 60 matrices, which form the actual data. On these data matrices, random missing data values are simulated corresponding to different data integrity values. For instance, to simulate a 0.4 data integrity, 60% of the values are uniformly removed from the data matrices. Different retrieval approaches are employed to estimate the missing data values, some of which are discussed in the following section. Each retrieval approach is assessed using the reconstruction error versus data integrity plots.

The rest of the paper is organized as follows. Section 2 addresses and discusses the related works from the available literature. Section 3 presents the details of the data matrices collected from the Aquaponics system, and the various matrix completion approaches considered. Section 4 presents the results and corresponding discussions, and the paper subsequently ends in Section 5.

2. Related work

Missing values in the acquired data matrices make them incomplete, thus giving rise to the problem of matrix completion. This problem is well-explored in recommender systems, which are the systems or websites that suggest different things to the users by monitoring a variety of factors [6]. These systems basically predict or interpret the missing values in a matrix. For instance, reconstructing items versus user preference data, matrix completion approaches can help in predicting user preferences and thus aid in developing item recommendations for each user [16]. Similarly, movies can be suggested to users as in a Netflix recommender system. Different users rate the movies differently and sometimes the users do not rate at all. These missing ratings in a user – movie matrix can be interpreted from available ratings via baseline predictors [12], where prediction of missing values is casted as an overdetermined system of equations. On the contrary, another approach known as baseline approximation, formulates missing value estimation as a minimization problem with regularization, which is iteratively solved using alternate least squares method. This approach is applied to reconstruct internet traffic matrices [20].

The k-nearest neighbor (kNN) is a memory-based matrix completion approach which predicts the missing values by identifying k most similar rows or columns, extracted via similarity measures such as Pearson correlation coefficient [2]. kNN is also employed for matrix completion in an indoor localization problem, for estimating unknown received signal strength [23]. Further, it is used for predicting drug associated disease indications [27] and also for traffic data matrix reconstruction [4]. It is interesting to note that kNN is a local interpolation approach and is suitable only when the data integrity is high [9].

A neighborhood based interpolation approach is used to estimate ratings in a user-item rating data matrix, where multi-kernel function is combined with the similarity measure to provide more accurate weights [5]. A multi-Gaussian model, exploiting the maximum a posterior estimation via spectral clustering has been employed for traffic matrix reconstruction [28], where most similar rows are first clustered, based on the eigenvalues calculated from the affinity matrix. A principal component analysis (PCA) based matrix completion approach, further extended to a two phase recovery scheme is developed for the recovery of corrupted and lossy sensor data [25], obtained from 196 weather sensors in Zhu Zhou, China.

Unlike the above, there is another class of matrix completion approaches which assume that the required matrix has a low rank structure and hence proceed by solving a rank minimization or nuclear norm minimization, while also incorporating matrix factorization. In this context, some of the works available in the literature are discussed as follows. Considering the missing data scenario as a compressively sensed data scenario, an orthogonal rank-1 matching pursuit, which is a sparse recovery approach is devised for matrix completion [19]. This approach which has rank as the only tunable parameter is tested on large scale movie dataset. As most of the real world data sets may not hold the necessary conditions assumed by compressive sensing algorithms [24], approaches based on matrix factorization and SVD such as the Sparsity Regularized Singular Value Decomposition (SRSVD) approach are devised incorporating nuclear norm minimization. The SRSVD is applied for internet traffic matrix reconstruction [20], where regularization acts as a tradeoff between exact fit of the measurement data and the low rank nature of the data matrix. An alternating least squares (ALS) method is used to solve the minimization problem, where each of the regularized factored matrices is alternatively estimated by fixing the other in each iteration. Note that ALS is only one possible solution method. Hence, instead of ALS, the same minimization problem is solved using matrix inversion and genetic approaches [15]. Further, to render a faster convergence, stochastic gradient can also be used alongside the ALS solution method [7].

The rank minimization of SRSVD does not assume any similarity constraints or special properties of the data matrix during optimization. By accounting such spatial and temporal properties, another approach named the Sparsity Regularized Matrix Factorization (SRMF) is formulated as a constrained optimization problem [20]. In addition to regularization on the factored matrices as in SRSVD, this approach includes the effect of similarities in the optimization, via spatial and temporal constraint matrices. Similar to the SRSVD case, this constrained optimization problem can be solved using the ALS method [20]. Note that the constraint matrices can be chosen a prior by mining the dependencies among the data elements or any structures present in the data elements due to factors like time or geographical distances. Replacing the ALS solution of the SRMF approach by an augmented Lagrangian based solution method, the Augmented Lagrangian Sparsity Regularized Matrix Factorization (ALSRMF) approach has been developed for road traffic matrix reconstruction [22], which rendered better performance than the SRMF approach.

Owing to the fact that success of a matrix completion approach depends on the application at hand, in this work we propose to investigate the suitability of various matrix completion approaches for Aquaponics data matrix completion. Hybrid approaches, which are a combination of the neighborhood models and the low rank approaches can also be suited for matrix completion. Hence, this paper studies the reconstruction performance of different matrix completion approaches based on neighborhood models and also the low rank optimization approaches for Aquaponics data matrix completion.

3. Data matrices and matrix completion approaches

In the following sections, bold faced lower case letters indicate vectors, bold faced capital letters indicate matrices, and lower case normal letters are used for scalar quantities. $‖ V ‖_{F}^{2}$ is the squared Frobenius norm of a matrix V, $‖ v ‖_{F}^{2}$ is the squared Euclidean Norm of a vector v and $(V_{1} . V_{2})$ is the Hadamard product of the two matrices $V_{1}$ , $V_{2}$ .

3.1. Data matrices

Consider the Aquaponics WSN with one node and a fusion centre. The node senses the temperature and pH data every minute and transmits to the fusion centre, where they get stored. Data is collected for a total of 19 independent acquisition hours, each comprising 60 samples, resulting in 19 different temperature and pH time series datasets. These data are arranged as two matrices (temperature and pH) of size 19 × 60. A practical missing value scenario is depicted in Fig. 1, where A1 to A4 indicate the acquisition hours and S1 to S5 indicate the time series samples in each acquisition hour.

Fig. 1.

Typical data matrix from aquaponics system.

To be able to investigate the performance of different matrix completion approaches, first the actual data is acquired by sending and receiving each data value multiple times. At the fusion centre, the multiple received entries are preprocessed to obtain the single data value, as in [17]. Further, accounting for data corruption due to practical sensor noises [9], the data values in each acquisition hour are passed through a moving average filter, to suppress the noise via averaging. The resultant data is shown in Fig. 2 for temperature and pH respectively. From Fig. 2, note that the temperature time series of acquisition hours 16, 18 possess certain similarities, while those of acquisition hours 1, 4 and 7 are also having similarities. Similarly, the pH time series of acquisition hours 4, 7, 14, 16 and 18 appears to have significant amount of correlations. Correlations among the days and among the time samples are computed using Pearson correlation coefficients and corresponding Cumulative Distribution Function (CDF) plots are shown in Fig. 3. Plots related to temperature data show that around 20% of the days have more than 40% correlation and all the time samples exhibit more than 50% correlation. Plots related to pH data show that around 20% of the days have more than 25% correlation, while 70% of the time samples exhibit more than 50% correlation. Thus both the temperature and pH time series data possess significant correlations, providing scope for retrieving missing values via different neighborhood and low rank based matrix completion approaches, which incorporate similarity measures or constraints as discussed in Section 2.

In this regard, temperature/pH data matrix is represented as $\begin{matrix} (1) & D = [\begin{array}{cccc} d (a_{1}, s_{1}) & d (a_{1}, s_{2}) & \dots & d (a_{1}, s_{N}) \\ d (a_{2}, s_{1}) & d (a_{2}, s_{2}) & \dots & d (a_{2}, s_{N}) \\ d (a_{3}, s_{1}) & x_{i} (a_{3}, s_{2}) & \dots & d (a_{3}, s_{N}) \\ ⋮ & ⋮ & ⋮ & ⋮ \\ d (a_{M}, s_{1}) & d (a_{M}, s_{2}) & \dots & d (a_{M}, s_{N}) \end{array}] \end{matrix}$ where $a_{i}}_{i = 1}^{M}$ are the M acquisition hours, $s_{j}}_{j = 1}^{N}$ are the N samples in each acquisition hour. Let $C = {(a_{i}, s_{j}) ∣ d (a_{i}, s_{j}) \neq 0}$ indicate the set of all available values, A indicate the total number of available values, following which the data integrity is becomes $(A / MN)$ . Two neighborhood models namely baseline predictor and kNN, along with the low rank approaches namely SRSVD and the ALSRMF, are investigated for Aquaponics system data reconstruction, which are discussed in the following sub sections.

Fig. 2.

Temperature and pH time series of few acquisition hours.

Fig. 3.

CDF of Pearson correlation coefficients.

3.2. Baseline approach

In this approach, each available data value $d (a_{i}, s_{j})$ is expressed as a sum of three unknown biases, the overall matrix bias ( $b_{m}$ ), row bias ( $b_{a_{i}}$ ) and column bias ( $b_{s_{j}}$ ) as in $\begin{matrix} (2) & d (a_{i}, s_{j}) = b_{m} + b_{a_{i}} + b_{s_{j}} \end{matrix}$

Accordingly, observe that a total of ( $M + N + 1$ ) unknown biases and a total of A equations are available. The value $b_{m}$ is estimated as the average of A available data values, following which (2) is rewritten as $\begin{matrix} (3) & b_{a_{i}} + b_{s_{j}} = d (a_{i}, s_{j}) - b_{m} \end{matrix}$ for a total of A equations. These equations can be expresses as a system of linear equations, which tends to be over determined as $\begin{matrix} (4) & r = Gb \end{matrix}$

In (4), the known vector $\overline{r}$ comprises $d (a_{i}, s_{j}) - b_{m}$ values for all indices in C, G is the known matrix comprising of zeroes and ones and $\overline{b}$ is the vector of all the unknown row and column biases. Solving the least squares problem ${min}_{\overline{b}} ‖ r - Gb ‖^{2}$ , b is obtained as the Moore Penrose inverse matrix $b = {(G^{T} G)}^{- 1} G^{T} r$ . Correspondingly, all the unknown values are predicted using (2).

The baseline predictor with temporal model accounts for similarities among the columns or rows or both, in addition to the fixed row and column biases. Accordingly, each available data value $d (a_{i}, s_{j})$ is expressed as $\begin{matrix} (5) & d (a_{i}, s_{j}) = b_{m} + b_{a_{i}} + b_{s_{j}} + b_{s_{j}} (t) + b_{a_{i}} (t) \end{matrix}$ where $b_{s_{j}} (t)$ , $b_{a_{i}} (t)$ are the time varying unknown biases corresponding to column $s_{j}$ and row $a_{i}$ . For all the data values of the indices in C, known biases obtained from baseline approach are subtracted from (5) to obtain expressions similar to (3), called as innovations $\begin{matrix} (6) & q (a_{i}, s_{j}) = b_{s_{j}} (t) + b_{a_{i}} (t) \end{matrix}$

Similarities between any two columns, say, $s_{a}$ , $s_{b}$ can be extracted using the corresponding innovation vectors ${\tilde{q}}_{a}$ , ${\tilde{q}}_{b}$ comprising the innovations values available in both the columns. If a particular innovation value is available in only one column, it is not treated as an entry in the innovation vector. The similarity index is then calculated as $\begin{matrix} (7) & d_{s_{a}, s_{b}} = \frac{{\tilde{q}}_{a}^{T} {\tilde{q}}_{b}}{‖ {\tilde{q}}_{a} ‖ ‖ {\tilde{q}}_{b} ‖} \end{matrix}$

The predictions of innovations are then calculated as $\begin{matrix} (8) & \hat{q} (a_{i}, s_{b}) = \frac{d_{s_{a}, s_{b}} q (a_{i}, s_{a}) * + d_{s_{b}, s_{c}} * q (a_{i}, s_{c})}{| d_{s_{a}, s_{b}} | + | d_{s_{b}, s_{c}} |} \end{matrix}$

Adding back the fixed row, column and matrix biases to all such innovations estimates in (8), corresponding unknown values of D can be obtained.

3.3. kNN approach

This neighborhood approach first assumes the number of neighbors, k. Consider the unavailable data value $d (a_{i}, s_{j})$ . The k nearest neighbors of $s_{j}$ , are grouped as set N, by computing the similarity coefficient $sim (s_{j}, x)$ , of column $s_{j}$ with every other column x, similar to the Pearson’s correlation coefficient, but by using only available entries in both the columns. Correspondingly, the unknown value is estimated as $\begin{matrix} (9) & \hat{d} (a_{i}, s_{j}) = \frac{\sum_{x \in N} d (a_{i}, x) sim (s_{j}, x)}{\sum_{x \in N} sim (s_{j}, x)} \end{matrix}$

Note that as data integrity reduces, the number of computations increases in a kNN approach. As the data integrity reduces, similarity coefficient $sim (s_{j}, x)$ becomes less accurate as the number of available values decreases. However, if the baseline predictor is used as an initial estimate to compute the similarity coefficients, the accuracy of kNN approach may improve, which is also explored in this study.

3.4. SRSVD approach

Low rank approaches assume D to be a low rank matrix. Owing to the underlying physical processes, the elements in D can be treated as correlated, as seen from Fig. 2, and hence low rank assumption gets justified. Let $D {\in R}^{M \times N}$ be the incomplete data matrix where, unavailable data entries are represented as 0’s. Let M be the indicator matrix of same size as D, comprising 1’s wherever the data entry in D is available, and 0’s wherever they are not available. Let the unknown but complete matrix be X, and note that $M . X = D$ . Low rank minimization proceeds as $\begin{matrix} (10) & min_{X} rank (X) s.t M . X = D \end{matrix}$ Following SVD of $X = U Σ V^{T}$ , further expressing $X = L^{T} R$ as in [20], (10) can be expressed as $\begin{matrix} (11) & min_{L, R} rank (L R^{T}) s.t M . (L^{T} R) = D \end{matrix}$ Equivalently, following the steps [20], (11) can be expressed as $\begin{matrix} (12) & min_{L, R} ‖ L ‖_{F}^{2} + ‖ R ‖_{F}^{2} s.t M . (L^{T} R) = D \end{matrix}$ and the corresponding Lagrangian becomes $\begin{matrix} (13) & L (λ) = {‖ M . (L^{T} R) - D ‖}_{F}^{2} + λ (‖ L ‖_{F}^{2} + ‖ R ‖_{F}^{2}) \end{matrix}$ where λ is the regularization parameter. Applying the ALS solution method to minimize (13), the steps of the iterative approach, named SRSVD to obtain L, R are given in Table 1, with the derivations shown in Appendix. After convergence the complete matrix is constructed using $\hat{X} = L R^{T}$ .

Table 1
SRSVD using ALS solution method

Steps of SRSVD Approach

1. Input: D, k

2. Initialize: Random matrices $L \in R^{k \times M}$ , $R \in R^{k \times N}$ and compute $X^{(0)} = L^{T} R$

3. for iteration $n = 1, 2, \dots$

a. for each $i \in [1, M]$

i) Let $C_{i} = {j ∣ d (i, j) \neq 0}$ , $j \in [1, N]$ , and $d (i, j)$ is the $i^{th}$ row, $j^{th}$ column value of D.

ii) Form column vector $d_{i}$ with $i^{th}$ row elements of D corresponding to column indices $C_{i}$

iii) In R, retain columns with indices $C_{i}$ to form $R^{'}$

iv) Calculate $l_{i} = {(R^{'} R^{' T} + λ I_{k})}^{- 1} R^{'} d_{i}$ , where $I_{k}$ is the identity matrix of size k.

b. end

c. Update $L = [l_{1} l_{2} \dots l_{M}]$

d. for each $j \in [1, N]$

i) Let $C_{j} = {i ∣ d (i, j) \neq 0}$ , $i \in [1, M]$

ii) Form column vector $d_{j}$ with $j^{th}$ column elements of D corresponding to row indices $C_{j}$

iii) In L, retain columns with indices $C_{j}$ to form $L^{'}$

iv) Calculate $r_{j} = {(L^{'} L^{' T} + λ I_{k})}^{- 1} L^{'} d_{j}$

e. end

f. Update $R = [r_{1} r_{2} \dots r_{N}]$

g. Update $X^{(n)} = L^{T} R$

4. end when $‖ X^{(n)} - X^{(n - 1)} ‖ ⩽ ε (= 10^{- 10})$

5. Output: $\hat{X} = X^{(n)}$

Steps of SRSVD Approach
1.	Input: D, k
2.	Initialize: Random matrices $L \in R^{k \times M}$ , $R \in R^{k \times N}$ and compute $X^{(0)} = L^{T} R$
3.	for iteration $n = 1, 2, \dots$
	a.	for each $i \in [1, M]$
		i)	Let $C_{i} = {j ∣ d (i, j) \neq 0}$ , $j \in [1, N]$ , and $d (i, j)$ is the $i^{th}$ row, $j^{th}$ column value of D.
		ii)	Form column vector $d_{i}$ with $i^{th}$ row elements of D corresponding to column indices $C_{i}$
		iii)	In R, retain columns with indices $C_{i}$ to form $R^{'}$
		iv)	Calculate $l_{i} = {(R^{'} R^{' T} + λ I_{k})}^{- 1} R^{'} d_{i}$ , where $I_{k}$ is the identity matrix of size k.
	b.	end
	c.	Update $L = [l_{1} l_{2} \dots l_{M}]$
	d.	for each $j \in [1, N]$
		i)	Let $C_{j} = {i ∣ d (i, j) \neq 0}$ , $i \in [1, M]$
		ii)	Form column vector $d_{j}$ with $j^{th}$ column elements of D corresponding to row indices $C_{j}$
		iii)	In L, retain columns with indices $C_{j}$ to form $L^{'}$
		iv)	Calculate $r_{j} = {(L^{'} L^{' T} + λ I_{k})}^{- 1} L^{'} d_{j}$
	e.	end
	f.	Update $R = [r_{1} r_{2} \dots r_{N}]$
	g.	Update $X^{(n)} = L^{T} R$
4.	end when $‖ X^{(n)} - X^{(n - 1)} ‖ ⩽ ε (= 10^{- 10})$
5.	Output: $\hat{X} = X^{(n)}$

3.5. ALSRMF approach

The optimization in (12) does not include any constraints that reflect the row/column interdependencies among the data elements in D. Modifying (12) to include such latent similarities through proper choice of constraint matrices, a modified low rank minimization proceeds via $\begin{matrix} (14) & min_{L, R} {‖ M . (L^{T} R) - D ‖}_{F}^{2} + λ (‖ L ‖_{F}^{2} + ‖ R ‖_{F}^{2}) + α {‖ S (L^{T} R) ‖}_{F}^{2} + β {‖ (L^{T} R) T ‖}_{F}^{2} \end{matrix}$ where, $S \in R^{M \times M}$ is the row similarity matrix, $T \in R^{N \times (N - 1)}$ captures the column similarities and α, β are the balancing parameters. Note that both S, T matrices can be chosen based on the problem at hand. On similar lines to the SRSVD, ALS solution method can be applied to solve (14), which is termed the SRMF approach [20]. Alternatively, instead of solving (14) directly by the ALS solution method, it can be restated by introducing auxiliary variables and proceeding through the Alternating Direction Method of Multipliers (ADMM) and augmented Lagrangian solution method. This procedure to solve (14) is termed as ALSRMF approach [22].

Consider the auxiliaries $Q = L^{T} R$ , $B = ZT$ , $Z = R$ and $C = L^{T} B$ . Accordingly, (14) can be restated as $\begin{array}{c} min_{L, R, Q, Z, C, B} {‖ M . (Q) - D ‖}_{F}^{2} + λ (‖ L ‖_{F}^{2} + ‖ R ‖_{F}^{2}) + α ‖ SQ ‖_{F}^{2} + β ‖ C ‖_{F}^{2} \\ (15) & s.t. Q = L^{T} R, B = ZT, Z = R, C = L^{T} B \end{array}$

Subsequently, using the ADMM via augmented Lagrangian solution method, the unconstrained optimization becomes $\begin{matrix} (16) & \begin{aligned} min_{L, R, Q, Z, C, B} {‖ M . (Q) - D ‖}_{F}^{2} + λ (‖ L ‖_{F}^{2} + ‖ R ‖_{F}^{2}) + α ‖ SQ ‖_{F}^{2} + β ‖ C ‖_{F}^{2} + ⟨ G_{1}, L^{T} R - Q ⟩ + ⟨ G_{2}, L^{T} B - C ⟩ \\ + ⟨ G_{3}, ZT - B ⟩ + ⟨ G_{4}, R - Z ⟩ + \frac{μ}{2} ({‖ L^{T} R - Q ‖}_{F}^{2} + {‖ L^{T} B - C ‖}_{F}^{2} + ‖ R - Z ‖_{F}^{2} \\ + ‖ ZT - B ‖_{F}^{2}) \end{aligned} \end{matrix}$ where, $G_{1}$ , $G_{2}$ , $G_{3}$ , $G_{4}$ are Lagrangian multipliers and μ is a positive scalar. Solving (16), the iterative steps of the ALSRMF approach are illustrated in Table 2, as derived in [22], where $I_{M}$ is the identity matrix of size M.

Table 2
Steps of ALSRMF approach

Steps of ALSRMF Approach

1. Input: D, S, T, k, λ, α, β

2. Initialize: $μ = 0.01$ , $Q = G_{1} \in R^{M \times N}$ , $L \in R^{k \times M}$ , $G_{2} = C \in R^{M \times (N - 1)}$ , $R = G_{4} = Z \in R^{k \times N}$ , $B = G_{3} \in R^{k \times (N - 1)}$

3. for iteration $n = 1, 2, \dots$

Compute

a. $Q^{(n)} = M . {{((2 + μ) I_{M} + 2 α S^{T} S)}^{- 1} (2 M + μ H)} + (1 - M) . {{(μ I_{M} + 2 α S^{T} S)}^{- 1} μ H}$

b. $L^{(n)} = {(2 λ I_{k} + μ {RR}^{T} + μ {BB}^{T})}^{- 1} μ {R {(Q - \frac{G_{1}}{μ})}^{T} + B {(C - \frac{G_{2}}{μ})}^{T}}$

c. $R^{(n)} = {((2 λ + μ) I_{k} + μ {LL}^{T})}^{- 1} μ {L (Q - \frac{G_{1}}{μ}) + (Z - \frac{G_{4}}{μ})}$

d. $C^{(n)} = \frac{μ}{2 β + μ} (L^{T} B + \frac{G_{2}}{μ})$

e. $Z^{(n)} = {(B - \frac{G_{3}}{μ}) T^{T} + (R + \frac{G_{4}}{μ})} {(I_{N} + {TT}^{T})}^{- 1}$

f. $B^{(n)} = {(I_{k} + {LL}^{T})}^{- 1} {L (C - \frac{G_{2}}{μ}) + (ZT + \frac{G_{3}}{μ})}$

g. Update $Q = Q^{(n)}$ , $L = L^{(n)}$ , $R = R^{(n)}$ , $C = C^{(n)}$ , $Z = Z^{(n)}$ , $B = B^{(n)}$ , $X^{(n)} = L^{T} R$

h. Update $G_{1} = G_{1} + μ (L^{T} R - Q)$

i. Update $G_{2} = G_{2} + μ (L^{T} B - C)$

j. Update $G_{3} = G_{3} + μ (ZT - B)$

k. Update $G_{4} = G_{4} + μ (R - Z)$

l. Update $μ = min {μ ρ, 10^{10}}$ , where $ρ = 1.1$

4. end when $‖ X^{(n)} - X^{(n - 1)} ‖ ⩽ ε (= 10^{- 10})$

5. Output: $\hat{X} = X^{(n)} = L^{T} R$

Steps of ALSRMF Approach
1.	Input: D, S, T, k, λ, α, β
2.	Initialize: $μ = 0.01$ , $Q = G_{1} \in R^{M \times N}$ , $L \in R^{k \times M}$ , $G_{2} = C \in R^{M \times (N - 1)}$ , $R = G_{4} = Z \in R^{k \times N}$ , $B = G_{3} \in R^{k \times (N - 1)}$
3.	for iteration $n = 1, 2, \dots$
	Compute
	a.	$Q^{(n)} = M . {{((2 + μ) I_{M} + 2 α S^{T} S)}^{- 1} (2 M + μ H)} + (1 - M) . {{(μ I_{M} + 2 α S^{T} S)}^{- 1} μ H}$
	b.	$L^{(n)} = {(2 λ I_{k} + μ {RR}^{T} + μ {BB}^{T})}^{- 1} μ {R {(Q - \frac{G_{1}}{μ})}^{T} + B {(C - \frac{G_{2}}{μ})}^{T}}$
	c.	$R^{(n)} = {((2 λ + μ) I_{k} + μ {LL}^{T})}^{- 1} μ {L (Q - \frac{G_{1}}{μ}) + (Z - \frac{G_{4}}{μ})}$
	d.	$C^{(n)} = \frac{μ}{2 β + μ} (L^{T} B + \frac{G_{2}}{μ})$
	e.	$Z^{(n)} = {(B - \frac{G_{3}}{μ}) T^{T} + (R + \frac{G_{4}}{μ})} {(I_{N} + {TT}^{T})}^{- 1}$
	f.	$B^{(n)} = {(I_{k} + {LL}^{T})}^{- 1} {L (C - \frac{G_{2}}{μ}) + (ZT + \frac{G_{3}}{μ})}$
	g.	Update $Q = Q^{(n)}$ , $L = L^{(n)}$ , $R = R^{(n)}$ , $C = C^{(n)}$ , $Z = Z^{(n)}$ , $B = B^{(n)}$ , $X^{(n)} = L^{T} R$
	h.	Update $G_{1} = G_{1} + μ (L^{T} R - Q)$
	i.	Update $G_{2} = G_{2} + μ (L^{T} B - C)$
	j.	Update $G_{3} = G_{3} + μ (ZT - B)$
	k.	Update $G_{4} = G_{4} + μ (R - Z)$
	l.	Update $μ = min {μ ρ, 10^{10}}$ , where $ρ = 1.1$
4.	end when $‖ X^{(n)} - X^{(n - 1)} ‖ ⩽ ε (= 10^{- 10})$
5.	Output: $\hat{X} = X^{(n)} = L^{T} R$

4. Results and discussion

On the acquired temperature and pH data matrices of size 19 × 60, random missing data values are simulated corresponding to data integrity values from 0.2 to 0.9 indicating 20 % to 90% data intact respectively. The missing values are estimated using Baseline, Baseline with correlation, kNN, SRSVD and ALSRMF approaches. Further, a hybrid of kNN and baseline predictor with temporal model, where reconstructed matrix from Baseline with temporal approach forms the initial matrix for kNN is also used to estimate the missing values. For different data integrity values, reconstruction error of each approach is obtained using Root Mean Square Error (RMSE), Normalized Mean Absolute Error (NMAE) and Mean Absolute Percentage Error (MAPE) defined as $\begin{array}{c} (17) & RMSE = \sqrt{\frac{\sum_{i = 0}^{Tot} {(P_{i} - A_{i})}^{2}}{Tot}} \\ (18) & NMAE = \sum_{i = 0}^{Tot} | \frac{A_{i} - P_{i}}{A_{i}} | \\ (19) & MAPE = \frac{\sum_{i = 0}^{Tot} | P_{i} - A_{i} |}{\sum_{i = 0}^{Tot} | A_{i} |} * 100 % \end{array}$ where $Tot$ represents the total number of missing values. Also in (17) to (19), $A_{i}$ represents the actual data value, while $P_{i}$ indicates the estimated value. The simulations are carried out using MATLAB and the results are discussed as follows.

For the temperature data matrix completion, the plots of RMSE, NMAE and MAPE versus data integrity are shown in Fig. 4, Fig. 5 and Fig. 6 respectively. Observe that, as data integrity increases, the errors (17) to (19) decrease. Note that in these figures Baseline with corr indicates the Baseline approach with temporal correlation considerations and the Hybrid indicates the hybrid of kNN and baseline predictor with temporal model.

Fig. 4.

RMSE versus data integrity for temperature data.

Fig. 5.

NMAE versus data integrity for temperature data.

The kNN approach is used with $k = 5$ and the similarity indices between the columns given in (7) are calculated using the reconstructed matrix obtained from baseline approach. Further, in the hybrid approach, the similarity indices are calculated using the reconstructed matrix obtained from the baseline with correlation approach and the parameter $k = 5$ . For SRSVD approach, the tuned parameters are $λ = 0.299$ , $k = 11$ . Similarly for the ALSRMF approach, the tuned parameters are $λ = 0.1$ , $α = 0.25 * 10^{- 4}$ , $β = 10$ . The column constraint matrix T is chosen as a Toeplitz matrix, whose structure relates adjacent data samples in time [22]. The row constraint matrix S captures the similarities among the rows via the Pearson correlation coefficients. For this calculation, the reconstructed matrix obtained from the baseline approach is used.

From Fig. 4 to Fig. 6, it can be observed that the baseline with correlation approach performs better than the baseline and kNN approaches. Further the hybrid approach outperforms the baseline with correlation approach, while the SRSVD approach could not perform better than the baseline with correlation approach. Also, for a data integrity greater than 0.35, the ALSRMF approach outperforms the all the other approaches considered here and for a data integrity less than 0.35, the hybrid approach turns out to be a better choice. Note that when 50% of the data is lost, both the ALSRMF and the hybrid approach rendered less than 2% MAPE, with only 1.25% MAPE from the ALSRMF. Further, when 90% data is intact, both these approaches render highly close values of RMSE, NMAE and MAPE.

Fig. 6.

MAPE versus data integrity for temperature data.

Proceeding with similar studies on pH matrix, the results of RMSE, NMAE and MAPE for different values of data integrity are shown in Fig. 7, Fig. 8 and Fig. 9 respectively. In this case of pH matrix completion, the tuned parameters of ALSRMF approach are $λ = 0.1$ , $α = 0.45 * 10^{- 4}$ , $β = 1$ . In the kNN and the hybrid approaches, $k = 5$ and $λ = 0.299$ , $k = 11$ are used in the SRSVD. The choice of S, T remains the same as that for the temperature data matrix completion.

From the figures Fig. 7 to Fig. 9, it can be understood that the Baseline with correlation approach outperforms the Baseline, kNN and the SRSVD approaches. Further, the hybrid approach outperforms all these approaches. Note that for a data integrity more than 0.25, the ALSRMF approach outperforms all the other approaches considered here. For a data integrity less than 0.3, the hybrid baseline with kNN approach performs better than all the others considered. Note that at a 50% loss of data, both ALSRMF and the hybrid approach render less than 1% MAPE, with only 0.7% MAPE from the ALSRMF approach.

Fig. 7.

RMSE versus data integrity for pH data.

Fig. 8.

NMAE versus data integrity for pH data.

Fig. 9.

MAPE versus data integrity for pH data.

From the results of temperature and pH data shown in Fig. 4 to Fig. 9, it can be observed that the performance improvements obtained from ALSRMF for temperature data are much better than the improvements obtained for pH data. For instance, when data integrity changes from 0.3 to 0.4, the MAPE changes from 2 % to 1.25 % for temperature data, but from 1% to only 0.75% in the pH data case. This can be attributed to the fact that temperature data entries are relatively more correlated than the pH data entries, as discussed in Section 3.1. It can be interpreted that better the correlations among the data, better is the performance of the ALSRMF approach, thus making it an efficient matrix completion approach for Aquaponics data.

5. Conclusion

Data collected at the fusion centre of a WSN, transmitted by an Aquaponics sensor node is considered in this work. Specifically, two 19 × 60 data matrices of temperature and pH collected for 19 independent acquisition hours with 60 samples per hour are considered. Practically, these matrices tend to be incomplete, owing to missing data problems experienced by the fusion centre. Hence, data reconstruction using matrix completion is investigated via baseline predictor with and without temporal model, kNN, SRSVD, a hybrid of kNN and baseline with temporal model and ALSRMF approaches. The SRSVD and the ALSRMF are the iterative approaches based on low rank optimization, and so depend on the correlations among the data entries. Simulating different data integrity scenarios on both temperature and pH data matrices, reconstruction error is obtained in terms of NMAE, RMSE and MAPE for all these approaches. The results of reconstruction error versus data integrity plots demonstrate that the hybrid baseline with kNN approach performs better when data integrity is less than 0.3, and the ALSRMF approach performs better than the hybrid approach when data integrity is greater than 0.3. Thus the hybrid baseline with kNN and the ALSRMF approaches can be effectively employed for matrix completion. Development of the WSN with multiple nodes and a fusion centre, acquiring spatio-temporal data matrices, for subsequent investigations on matrix completion approaches via mining inter correlations among pH and temperature data, forms the scope of this work. Further, in addition to pH and temperature, data from multiple sensors that monitor the health of Aquaponics system can be explored for obtaining the inter correlations, that help in data reconstruction.

Conflict of interest

The authors have no conflict of interest to report.

Footnotes

Appendix

Solving for $l_{i}$ : Starting with differentiation of (13) with respect to $l_{i}$ , the solution proceeds by replacing D with $M . D$ $\begin{matrix} \frac{\partial}{\partial l_{i}} {{‖ M . (L^{T} R) - M . D ‖}_{F}^{2} + λ (‖ L ‖_{F}^{2} + ‖ R ‖_{F}^{2})} = 0 \end{matrix}$

Simplifying further, by retaining only the terms comprising $l_{i}$ we obtain $\begin{matrix} \frac{\partial}{\partial l_{i}} {{‖ (R^{' T} l_{i}) - d_{i} ‖}_{2}^{2} + λ (‖ l_{i} ‖_{2}^{2})} = 0 \end{matrix}$ where $d_{i}$ and $R^{'}$ are defined in Table 1. Simplifying further, $\begin{matrix} \frac{\partial}{\partial l_{i}} {{((R^{' T} l_{i}) - d_{i})}^{T} ((R^{' T} l_{i}) - d_{i}) + λ l_{i}^{T} l_{i}} = 0 . \end{matrix}$

Applying vector differentiation principles, it further reduces to $\begin{matrix} 2 R^{'} R^{' T} l_{i} - 2 R^{'} d_{i} + 2 λ I_{k} l_{i} = 0 . \end{matrix}$

Rearranging and solving for $l_{i}$ , we obtain $\begin{matrix} l_{i} = {(R^{'} R^{' T} + λ I_{k})}^{- 1} R^{'} d_{i} \end{matrix}$

Solving for $r_{j}$ , on similar lines, we get $\begin{matrix} r_{j} = {(L^{'} L^{' T} + λ I_{k})}^{- 1} L^{'} d_{j} \end{matrix}$ where $d_{j}$ and $L^{'}$ are defined in Table 1.

References

K.N.

Acharya,

M.G.

Yashwanth Gowda,

Vijay,

Deepthi,

Malathi and

Sure, Parametric and non-parametric regression approaches for non-invasive blood glucose monitoring, Biomedical Engineering: Applications, Basis and Communications 32(6) (2020), 2050043.

Bell,

Koren and

Volinksy, Chasing the $1,000,000: How we won the Netflix progress prize, Statistical Computing and Graphics 18(2) (2007).

J.M.

Cadenas,

M.C.

Garrido and

Martinez-España, Development of an application to make knowledge available to the farmer: Detection of the most suitable crops for a more sustainable agriculture, Journal of Ambient Intelligence and Smart Environments IOS Press 12(5) (2020), 419–432. doi:10.3233/AIS-200575.

Chen,

Wei,

Li,

Liang,

Cai and

Zhang, Ensemble correlation-based low-rank matrix completion with applications to traffic data imputation, Knowledge-Based Systems 132 (2017), 249–262. doi:10.1016/j.knosys.2017.06.010.

Chen,

Zhao and

Wang, Kernel meets recommender systems: A multi-kernel interpolation for matrix completion, Expert Systems with Applications 168 (2021), 114436. doi:10.1016/j.eswa.2020.114436.

Chiang, How does Netflix recommend movies? in: Networked Life: 20 Questions and Answers Chapter, Cambridge University Press, 2012, pp. 61–88. doi:10.1017/CBO9781139176200.006.

Gemulla,

Nijkamp,

P.J.

Haas and

Sismanis, Large-scale matrix factorization with distributed stochastic gradient descent, in: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2011, pp. 69–77. doi:10.1145/2020408.2020426.

Guerrero-Ulloa

et al., Internet of Things (IoT)-based indoor plant care system, Journal of Ambient Intelligence and Smart Environments IOS Press 15(1) (2023), 47–62. doi:10.3233/AIS-220483.

He,

Li,

Zhang and

Li, Missing and corrupted data recovery in wireless sensor networks based on weighted robust principal component analysis, Sensors 22(5) (2022), 1992.

10.

C.-C.

Huang,

H.-L.

Lu,

Y.-H.

Chang and

T.-H.

Hsu, Evaluation of the water quality and farming growth benefits of an intelligence aquaponics system, Sustainability 13(8) (2021), p4210. doi:10.3390/su13084210.

11.

Kaneko,

Cheung,

W.-T.

Su and

C.-W.

Lin, Graph-based joint signal/power restoration for energy harvesting wireless sensor networks, in: GLOBECOM IEEE Global Communications Conference, 2017, pp. 1–6. doi:10.1109/GLOCOM.2017.8254798.

12.

Koren, Collaborative filtering with temporal dynamics, in: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2009, pp. 447–456. doi:10.1145/1557019.1557072.

13.

Kumar,

Rakheja,

Sarswat,

Varshney,

Bhatia,

S.R.

Goli,

V.J.

Ribeiro and

Sharma, White space detection and spectrum characterization in urban and rural India, in: IEEE 14th International Symposium on a World of Wireless, Mobile and Multimedia Networks (WoWMoM), 2013, pp. 1–6.

14.

Kutyniok, Theory and applications of compressed sensing, GAMM-Mitteilungen 36(1) (2013), 79–101. doi:10.1002/gamm.201310005.

15.

Li,

Zhu,

Zhu and

Li, Compressive sensing approach to urban traffic sensing, in: IEEE 31st International Conference on Distributed Computing Systems, 2011, pp. 889–898.

16.

M.K.

Najafabadi and

M.N.

Mahrin, A systematic literature review on the state of research and practice of collaborative filtering technique and implicit feedback, Artificial intelligence review 45(2) (2016), 167–201. doi:10.1007/s10462-015-9443-9.

17.

O.N.

Nandesh,

Shetty,

Alva,

Paul and

Sure, A USRP based UHF wireless sensor node and fusion centre for aquaponics system monitoring, in: IEEE 3rd International Conference for Emerging Technology (INCET), 2022, pp. 1–7.

18.

P.P.

Ray, Internet of things for smart agriculture: Technologies, practices and future direction, Journal of Ambient Intelligence and Smart Environments IOS Press 9(4) (2017), 395–420. doi:10.3233/AIS-170440.

19.

Recht,

Fazel and

P.A.

Parrilo, Guaranteed minimum rank solutions to linear matrix equations via nuclear norm minimization, SIAM Review 52(3) (2010), 471–501. doi:10.1137/070697835.

20.

Roughan,

Zhang,

Willinger and

Qiu, Spatio-temporal compressive sensing and Internet traffic matrices (extended version), IEEE/ACM Transactions on Networking 20(3) (2011), 662–676. doi:10.1109/TNET.2011.2169424.

21.

Sethupathi,

Sridhar,

Suresh,

Sushmitha Dhatchayani and

Vaithiyanathan, Aquaponics agriculture for large scale irrigation system, International Research Journal of Engineering and Technology 6 (2019), 2978–2985.

22.

Sure,

C.P.

Srinivasan and

C.N.

Babu, Spatio-temporal constraint-based low rank matrix completion approaches for road traffic networks, IEEE Transactions on Intelligent Transportation Systems 23(8) (2022), 13452–13462. doi:10.1109/TITS.2021.3124613.

23.

Tan,

Zhang and

Li, An efficient fingerprint database construction approach based on matrix completion for indoor localization, IEEE Access 8 (2020), 130708–130718. doi:10.1109/ACCESS.2020.3009441.

24.

Wang,

M.J.

Lai,

Lu,

Fan,

Davulcu and

Ye, Orthogonal rank-one matrix pursuit for low rank matrix completion, SIAM Journal on Scientific Computing 37(1) (2015), 488–514. doi:10.1137/130934271.

25.

Xie

et al., Recover corrupted data in sensor networks: A matrix completion solution, IEEE Transactions on Mobile Computing 16(5) (2017), 1434–1448. doi:10.1109/TMC.2016.2595569.

26.

Yang,

V.Y.T.

Tan,

Keong Ho,

Ho Ting and

Liang Guan, Wireless compressive sensing for energy harvesting sensor nodes, IEEE Transactions on Signal Processing 61(18) (2013), 4491–4505. doi:10.1109/TSP.2013.2271480.

27.

Yang,

Luo,

Li,

F.X.

Wu and

Wang, Overlap matrix completion for predicting drug-associated indications, PLoS computational biology 15(12) (2019), e1007541. doi:10.1371/journal.pcbi.1007541.

28.

Zhou,

Zhang and

Xie, Accurate traffic matrix completion based on multi-Gaussian models, Computer Communications 102 (2017), 165–176. doi:10.1016/j.comcom.2016.11.011.