Abstract
Solid waste management has become a challenge for developing countries mainly because of surging economic activities, rapid urbanisation and rise in community living standards. Many researchers have identified its related problems and have recommended solutions while others have established models to forecast the amount of solid waste generated over a period. However, an efficient and effective management of solid waste requires adequate categorisation of solid waste generation areas to aid in the provision of area-specific or targeted solutions for each categorised area. In this study, we used primary data on some important socio-demographic variables (household size, house type, predominant religion of household, age and educational level of household head, residency type household waste disposal method, frequency of waste collection etc) and the amount of solid waste generated from 2102 households in Greater Accra Region, Ghana. We assessed the classification performances of a traditional statistical classifiers and some selected machine learning algorithms in classifying the surveyed areas in Greater Accra into low, medium, and high solid waste generation areas. The Support Vector Machine with the Cubic Kernel was found to be the best performing classifier with a Specificity of 86%, Sensitivity, Precision and Accuracy of 73% and Area under the curve (AUC) of 0.90. The Support Vector Machine with the Cubic Kernel is therefore recommended as a suitable algorithm for the categorisation of solid waste generation areas. Stakeholders responsible for solid waste management could leverage on the evidence from this study to categorise their waste generation areas and to proffer targeted community-based interventions.
Keywords
Introduction
Solid waste management includes holistically planning and executing the processes involved in the collection, disposal and treatment of solid waste (Nathanson, 2019).
It is often a challenge for the municipal authorities in developing countries mainly due to the increasing generation of waste, the burden posed on the municipal budget as a result of the high costs associated with its management and the lack of understanding of the varied factors that affect the different stages of waste management (Guerrero et al., 2013).
According to Lebersorger and Beigl (2011); Gallardo et al. (2018), in planning for solid waste management, what is fundamentally required is reliable data concerning waste generation, influencing factors on waste generation and a reliable forecast of waste quantities. It is evident from the literature cited above, that the accurate prediction of solid waste is essential for effective planning for solid waste management. It is therefore not surprising that a lot of authors have attempted to determine the best model for solid waste prediction. Notable among these authors include Dyson and Chang (2005), who presented an approach called “System Dynamics Modeling”. This model was to be used for the prediction of solid waste generation in a fast-growing urban area based on a set of limited samples. Their proposed model outperformed the traditional least squares regression model. Kannangara et al. (2018) developed models for accurate prediction of municipal solid waste (MSW) generation and diversion based on demographic and socio-economic variables, using decision trees and neural networks (two machine learning algorithms). In Ghana, Asante-Darko et al. (2016) proposed a Fourier series model to forecast solid waste generation in Kumasi, Ashanti Region of Ghana. Their approach incorporated some characteristics of the monthly waste data for forecasting solid waste. Their model revealed that the high rate of urbanisation and population growth would cause an increase in solid waste generated.
Chapman-Wardy et al. (2021) used two algorithms, namely, Levenberg-Marquardt and the Bayesian Regularisation to estimate the parameters of an Artificial Neural Network (ANN) model fitted to predict the average monthly waste generated and critically assess the factors that influence solid waste generation in some selected districts of the Greater Accra Region. Their study found Bayesian regularisation algorithm to be the most suitable estimator of the ANN parameters. However, an accurate prediction of solid waste generated is not entirely sufficient to provide complete solutions in waste management. Segregating the solid waste generation areas into ordered classes allows the municipal authorities to initiate targeted policies in the various areas based on the class in which they belong.
Karadimas and Loumos (2008); Shoba and Rasappan (2013) used the Geographic Information Systems (GIS) to identify the waste generation pattern for Athens, Greece and Coimbatore, India respectively. Shoba and Rasappan (2013) used GIS to develop thematic maps which aided in the classification of the Coimbatore area based on the population density and the solid waste per capita generated. The thematic maps helped officers to identify and monitor the high solid waste generation areas. It was recommended that promoting waste market recycling and creation of awareness would reduce the total volume of waste at the landfill sites in these areas. Karadimas and Loumos (2008) used GIS to randomly partition the municipality of Athens, then proceeded to determine the solid waste generated in each zone through the use of multiple linear regression. The variables used were based on the classes of commercial activities in the area. The study found that determining the waste generated per particular area resulted in a reduction in required waste bins by 30%.
In Ghana, Seshie et al. (2020) characterised solid waste generated in the Takoradi sub-metro. The area was divided based on income levels. The study found that the people in the high-income level area generated the highest level of solid waste. It was also found that 81.7% of waste generated was recyclable and only 18.4% was worthy to be disposed off. As stated earlier, classifying solid waste generation areas into various categories based on determinant factors allows the planners of waste management prepare targeted policies for each area based on applicable determinant factors.
This study seeks to classify the solid waste generation areas into high, medium and low solid waste generation areas, based on the critical factors which exists in the areas. The study will compare some machine learning classifiers (Support Vector Machines, K Nearest Neighbour and Naive Bayes) to a traditional/classical statistical classifiers (Logistic Regression) in quest to identify the most suitable algorithm for classifying some surveyed areas in Greater Accra Region into low, medium and high waste generation areas based on the amount of solid waste generated and some socio-demographic characteristics of the surveyed households.
The rest of this paper has been arranged as follows: Methods and materials section contains a detailed description of acquired study data, the adopted traditional and machine learning classifiers and their mathematical underpins. The results and discussion section contains interpretation of the study findings and finally, the conclusions and recommendation section provides a summary of the findings and a general conclusion of the study with some directions for future works.
Methods and materials
Description of data
Primary data was collected from some surveyed households in Greater Accra Region using well-structured questionnaires. The data contained information on the independent variables that influence solid waste generated in the districts such as age of head of household, house type, educational level of head of household, predominant religion of household, residency type, household size, employment category of head of household, household waste disposal method, frequency of waste collection and income level of head of household.
The study sample of the households was drawn from fifteen out of the twenty-six districts in the Greater Accra region. A sample size of 2102 households was used for the study. This is a representative sample that can be used to make inference about the population on households in the Greater Accra Metropolis with a 2.5% margin of error.
Classification of districts through Geostatistical analysis
Geostatistical classification was performed using the digital addresses obtained from surveyed household. The premise was that spatially distributed objects were spatially correlated; thus, the amount of solid waste generated that had close values had more influencing variables in common and were likely to be in one category, than the values of solid waste generated that were further apart. The Inverse Distance Weighted (IDW) Interpolation approach was used. The IDW interpolation uses a linearly weighted combination of a collection of sample points to determine cell values. The weight is determined by the inverse distance. This approach ensured that the influence of the value of solid waste generated being mapped would decrease with distance from the sampled site. This would also result in easy identification of areas not included in the study. The categories of the target variable used for training (supervised learning) the study classifiers are low, medium and high waste generation areas.
Figure 1 shows the different classes of solid waste generated in some selected districts of the Greater Accra Region. This represents the prior distribution of solid waste generated.
Map of Greater Accra Region showing the various classes of Waste Generated.
As indicated earlier, SVM is used to classify the waste generated in the Greater Accra Region into high, medium and low categories based on the identified critical factors. SVM is a statistical learning algorithm originally developed for binary classification but can be adopted for multi-classification as well (Schwenker, 2000; Lee & Lee, 2007). Figure 2 represents the grouping of two sets of data points into two categories, using SVM.
The Support Vector Machine.
The support vectors
where
Let the equations of hyperplanes be given as
and represent the two stashed lines through the support vectors in Fig. 2, where
Substituting Eq. (4) into Eq. (1) we obtain
Combining Eqs (2) and (3), the following equation is obtained
where
In solving the optimisation problem, SVM seeks to achieve two objectives.
To find the hyperplane with the largest margin. To ensure the hyperplane correctly classifies all data points.
Due to the presence of constraints, a Lagrange multiplier is introduced to convert the constrained problem into an unconstrained one.
Introducing Lagrangian Multiplier
subject to
Taking the derivative of
Substituting Eqs (8) and (9) into Eq. (7), the primal objective function, we obtain the dual optimisation formula
As
The dual optimisation equation is therefore represented below:
subject to
If we let
The value of
The equation above works for data points which are linearly separable.
Our quest in this study is to apply SVM (a binary classifier) to data which requires multi-classification. The One-against-one (OAO) multi-classification method will be used in this study.
This is a supervised learning algorithm which can be used for both classification and prediction purposes. In this study, this algorithm is used as a method to classify the selected households in the Greater Accra Region into high, medium and low solid waste generation areas.
In setting up the algorithm, the main considerations include the number of nearest neighbours
Classification with KNN
Suppose we have a number of classes
From the information provided above,
and
The posterior is represented below as:
Substituting Eq. (14) into Eq. (15), we obtain
Therefore, the posterior of an unknown data point
where
A Naive Bayes Classifier is a probabilistic machine learning model that is used to discriminate different objects based on certain features (Gandhi, 2018; Farid et al., 2014). Applying the assumption of class conditional independence,
Since
Given that the prediction made by the NB classifier is based on the class with the highest probability, the probability that a training record will be predicted as being in class
As per the objective of this study, the results from the machine learning classifiers outlined above are now compared to the traditional classifier (Logistic Regression).
Logistic regression is a type of Generalized Linear Model (GLM) used to model dependent categorical variables using numerical and categorical predictors. There are three types of logistic regression models, driven by the nature of the response variable.
The binary logistic regression models how a binary response variable
The second type of logistic regression is the ordinal logistic regression. With this logistic regression model, the response variable has more than two categories, but has characteristics following the natural order of sequence.
The third logistic regression model is the Nominal logistic regression model. This model also has a response variable with more than two categories, but these do not follow any natural sequence. The binary logistic regression model can be generalized into a situation where the dependent variable has more than two outcomes. The most widely used method of generalizing, the binary logistic regression to the ordinal logistic regression is by the use of the Proportional Odds Model (POM) (Fuks & Salazar, 2008). This model applies the concept of the Cumulative logit. The concept looks to describing the cumulative probability that a particular variable can be classified as belonging to an identified class or other lower classes (Williams, 2016; Fuks & Salazar, 2008).
Parameter estimation
The estimates of the parameters are obtained using the Maximum Likelihood approach. This approach allows the derivation of the maximum value of
where
In order to estimate the parameters, the Log-likelihood is derived below. The Maximum Likelihood estimator is obtained by finding the partial derivative of the log-likelihood function below.
Then the partial derivative of the log-likelihood is determined and equated to zero as follows:
and
The log-likelihood expression in Eq. (21) is optimized via maximum likelihood approach using numerical methods such as Newton Raphson’s iteration method to estimate the unknown parameters of the logistic regression model.
Specificity: This refers to the ability of a classification function to correctly classify a household as not belonging to a particular class.
Sensitivity: This refers to the ability of a classification function to classify household correctly as belonging to a particular class.
Accuracy: This refers to the ability of a classification function to correctly classify an observation (whether negative or positive).
Precision: This refers to the repeatability of the classification process.
Results and discussion
Evaluation of a traditional classification method (Logistic Regression) and the machine learning classifiers (Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and Naive Bayes (NB)) used in the study are presented below:
Evaluation of the logistic classification functions
The total sample of 2,102 responses from the households was partitioned into 70% (1,472 responses) for training and 30% (630 responses) for testing. The total population of responses (All) was then tested by the logistic classifier. Results obtained from training, testing, and testing of the whole population (All) are presented in Table 1.
Confusion matrix – logistic regression classifier
Confusion matrix – logistic regression classifier
From Table 1, overall, the logistic classifier classified correctly, 591 out of 887 households which were actually in high solid waste generation areas, 20 out of 378 households which were actually in medium solid waste generation areas, and 637 out of 837 households which were actually in low waste generation areas. In total (using all the data), the logistic classifier correctly classified 1,248 out of 2,102 surveyed households.
Table 2 contains the classification metrics (specificity, sensitivity and precision) for the logistic classifier on the surveyed data.
Evaluation of the logistic regression classifier
Specificity measures the ability of the logistic regression function to classify households which did not belong to a particular solid waste generation category, correctly. In testing the total population sampled (All), the logistic approach classified 72%, 98% and 62% of the households which did not belong to the high, medium and low solid waste generation areas respectively, correctly. It is clear that the logistic regression classifier seems to have a high specificity in classifying households from medium solid waste generation areas.
Sensitivity measures the ability of the logistic regression function to classify households which belong to a particular category correctly. Using data from all surveyed households, the logistic regression function classified 67%, 5% and 76% of households which belonged to high, medium and low solid waste generation areas respectively, correctly. The logistic regression classifier therefore has a very low sensitivity in classifying households from medium solid waste generation areas. Precision measures the ability of the logistic regression classifier to obtain similar results when applied to different samples. Using data from all surveyed households, the classifier classified repeatedly, 64%, 33% and 57% of the time to high, medium and low solid waste generation areas, respectively. It is also evident from Table 2 that in classifying households in the medium category, the logistic regression classifier was high in specificity, but low in sensitivity and precision.
From Table 3, generally, the logistic classifier was found to be above average in specificity (73%), but was just about average in sensitivity (59%), precision (56%) and accuracy (59%).
Measures of accuracy of the logistic regression classifier
The machine learning algorithms investigated included the Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and Naive Bayes (NB). The 5-Fold cross-validation system of testing and validating was applied. The One-vrs-One multi-class method of classification was also adopted.
Evaluation of SVM with the Cubic Kernel function
The Cubic Kernel function had a prediction speed of 14,000 observations per second, and a training time of 8.8043 seconds. The confusion matrix of the SVM with cubic kernel classifier is presented in Table 4.
Confusion matrix of SVM with Cubic Kernel function
Confusion matrix of SVM with Cubic Kernel function
It can be seen from Table 4 that out of the 2,102 households sampled, 701 of them were correctly classified by the SVM with the cubic kernel classifier as being located in high solid waste generation areas, 654 of them were correctly classified as being in the low waste generation areas and 178 households located in the medium waste generation areas were classified correctly. As shown in Table 5, overall, the weighted average measure of accuracy for the SVM with cubic kernel classifier was as follows: specificity (86%) and sensitivity, accuracy and precision (73% each).
Evaluation of SVM with Cubic Kernel classifier
The marked point on the Receiver Operating Characteristic (ROC) curve in Fig. 3 signifies that at a specified threshold, the probability of a household being wrongly classified is 0.14 and correctly classified is 0.79. The area under the curve (AUC) is 0.90.
The Kernel Naive Bayes had a prediction speed of 1,200 observations per second, and a training time of 7.3081 seconds. The results of the Kernel Naive Bayes classifier are shown in Table 6.
Confusion matrix of the Kernel Naive Bayes
Confusion matrix of the Kernel Naive Bayes
Table 6 shows that out of the 2,102 households sampled, 765 of them were correctly classified as being located in high waste generation areas, 535 of them were correctly classified as being in the low waste generation areas and only 2 households located in the medium waste generation area were classified correctly. In total, 1,302 households were classified correctly.
Evaluation of the Kernel Naive Bayes classifier
ROC for the cubic SVM.
From Table 7, the overall performance metrics of the Kernel Naive Bayes classifier were 73% for specificity, 62% for sensitivity, 63% for precision and 62% for accuracy. From Fig. 4, the ROC curve signifies that at some threshold, the probability of a household being wrongly classified is 0.44 and correctly classified is 0.86. The Area under the curve is 0.81.
The Weighted KNN uses 10 neighbours points. The Euclidean distance is applied with a suitable distance weighting. The Weighted KNN had a prediction speed of 20,000 observations per second, and a training time of 1.2928 seconds. The results of the weighted KNN in classifying the surveyed households are shown in Tables 8 and 9.
Confusion matrix for the Weighted KNN classifier
Confusion matrix for the Weighted KNN classifier
From Table 8, out of the 2,102 households sampled, 696 of them were correctly classified as being located in high waste generation areas, 661 of them were correctly classified as being in the low waste generation areas and 102 households located in the medium waste generation area were classified correctly. In total, 1,459 households were classified correctly by the Weighted KNN classifier.
Evaluation of the Weighted KNN classifier
ROC for Kernel Naive Bayes.
Table 9 shows the measures of accuracy for the classification indicators and the Weighted KNN classifier. The weighted KNN classifier gave a specificity of 82%, sensitivity of 69%, precision of 68% and accuracy of 69%. The indicated point on the ROC curve in Fig. 5 shows that at a specific threshold, the probability of a household being wrongly classified is 0.20 and correctly classified is 0.78. The area under the curve (AUC) was found to be 0.88.
Table 10 shows the performance metric for the adopted machine learning and traditional statistical classifiers.
Evaluation of the adopted classifiers
Evaluation of the adopted classifiers
ROC for Weighted KNN.
From Table 10, the best classifier for categorising the surveyed households to their respective waste generation is the SVM with Cubic Kernel (Cubic SVM) with Specificity of 86%, Sensitivity, Precision and Accuracy of 73% and AUC of 0.90. The Weighted KNN closely follows with Specificity of 82%, Sensitivity of 69%, Accuracy of 69%, and Area under the curve of 0.88. Overall, the machine learning classifiers outperformed the classical statistical classifiers.
The study sought to identify the best model to classify the surveyed households into respective waste generation areas based on the amount of solid waste generated and important socio-demographic characteristics of the households. In line with the stated objective, the study compared a traditional statistical classifier (Logistic Regression Classifier) and some machine learning classifiers (SVM with Cubic Kernel, Kernel Naive Bayes and Weighted KNN). The SVM with Cubic Kernel emerged the best performing classifier with Specificity of 86%, Sensitivity, Precision and Accuracy of 73% and Area under the curve (AUC) of 0.90. It was closely followed by the Weighted KNN with Specificity 82%, Sensitivity and Accuracy 69% and Area under the curve of 0.88.
The worst performing classifier was the logistic regression classifier, with Specificity 73% and Sensitivity and Accuracy 59%. Generally, the machine learning classifiers outperformed the classical statistical classifier. The Support Vector Machine with the Cubic Kernel is recommended as a suitable classifier for grouping the surveyed households into specified waste generation areas to proffer targeted or area-specific solutions in the management of solid generation in the Greater Accra Region. Future studies would focus on assessing the components of household solid waste to aid in the development of targeted solid waste management policies.
Footnotes
Conflict of interest
The authors declare that, there is no conflict of interest.
Funding statement
The study was supported by BANGA-Africa Project with funding from Carnegie Corporation.
