Abstract
Management of garbage classification is a general term for a series of activities to sort, store and transport garbage into public resources according to certain regulations or standards. Current garbage classification systems have several drawbacks, such as inability to identify multiple garbage categories, and high dependence on the surrounding environment. To address these issues, this paper has proposed the Real Time Multi-Modal Garbage classification System (abbreviated as RMGCS). It consists of two sub systems: an indoor garbage classification applet (abbreviated as IGCA) and an outdoor garbage classification system (abbreviated as OGCS). IGCA provides users with three methods of garbage classification, and OGCS provides users with outdoor real-time multi-target garbage classification and can dynamically update the recognition model. RMGCS achieves real-time, accurate, and multimodal classification. Finally, the experiments with RMGCS show that our approaches are effective and efficient.
Introduction
Background
As the world’s largest manufacturing country, China has the largest population. It produces an average of 300 kg of household garbage per person per year, which is about 400 million tons a year [15] [21]. Therefore, garbage classification is of great significance to our country. In 2019, Shanghai took the lead in the comprehensive collection of urban household garbage [4]. Subsequently, 8 cities in China promulgated the Regulations on the Administration of Domestic garbage Charges, and put garbage classification into the legal track [11].
In fact, garbage is also a recycled resource [26]. Most citizens are not familiar with the specific rules and garbage classification methods [10]. Therefore, a system that uses machines and advanced learning techniques to assist citizens in classification garbage is needed to be developed.
Challenges
In the past few years, China had no legislation or penalties for littering [21]. Therefore, we must rely on people’s awareness to achieve this goal, and there is not enough publicity from childhood onwards, which exacerbates the difficulty of classifying and disposing of garbage at later stages. In the age of mobile Internet, the application of mobile development technology can promote and popularize the knowledge of garbage classification effectively [17]. Relevant departments can use mobile phone APP and WeChat public accounts [5] to promote garbage classification knowledge to the public, provide real-time garbage classification search services. With mobile devices, users can quickly distinguish garbage types and improve the efficiency of garbage classification and disposal.
Contributions
To facilitate relevant organizations to release national environmental policies and garbage classification methods, as well as to improve the accuracy of garbage identification and the speed of garbage identification, two systems in this paper are developed: 1) Indoor Garbage classification Applet (abbreviate as IGCA) and 2) Outdoor Garbage classification System (abbreviate as OGCS). Both systems for supplement each other are used in indoor and outdoor scenes respectively.
IGCA is a comprehensive garbage separation system based on the WeChat applet that enables citizens to understand the national environmental policy and collect garbage at home. It has the following advantages.
OGCS is an outdoor garbage classification system based on YOLOv5 framework. It has the following advantages.
Organization
The rest of the paper is organized as follows. Section 2 introduces the work related to target identification and garbage classification. Section 3 introduces the design scheme and related technologies of the system. Section 4 presents some related experiments and comparison results. Section 5 describes how the system is demonstrated.
Related work
Methods of garbage classification
Recently, machine learning and deep learning methods, such as SVM and Random Forest, have been widely used in the field of garbage classification. Deep learning multi-layer neural network, is a machine learning algorithm with strong generalization ability [24]. In 2012, a Deep Convolutional Neural Network called ResNet was proposed to achive an accuracy of 93.03%and 95.51%on cifar-10 and Imagenet respectively [6]. Residual and Inception blocks were combined to achieve the accuracy of 99.66%, 98.04%and 95.32%on MNIST, SVHN, and cifar-10, respectively [29]. A fast RCNN-based garbage detection method was proposed using ResNet and RPN structures. But it is slow and has poor real-time performance. The classification method using lightweight models such as MobileNet and SqueezeNet as the backbone has a faster model in the first stage, but weak feature extraction and low accuracy. Although deep learning algorithm can get high classification accuracy, the high cost of training and testing time is a bottleneck [12]. Besides, there are some intelligent systems on garbage classification. A group-control robot system was proposed to provide a simple decentralized control scheme for swarm robots of indoor garbage collection [8]. But it depends on the cooperation between robots and takes a long time.
Another category of garbage classification is based on image classification algorithms. Automatic identification is done using thermographic images. The nonlinear characteristics of neural network with improved activation function has shown good results. AlexNet [9] was proposed to achieve record breaking image classification accuracy in the Large Scale Visual Recognition Challenge (ILSVRC). The problem of gradient dispersion becomes more and more prominent. In order to learn the distribution characteristics of the nonlinear data better, some improvements of the network are mainly focused on the network’s depth and activation function. For example, Leaky ReLU, SLU, PReLU have the characteristics of fast convergence speed, simple calculation and strong sparseness [14, 27].
Application pattern of garbage classification
In terms of application pattern of garbage classification, there are two main types of garbage classification systems: APP-based garbage classification auxiliary systems and automatic garbage classification systems based on mechanical equipment.
APP-based garbage classification auxiliary system for mobile devices mainly concentrates on APPs with more users. Some systems provide a machine learning method for garbage classification [28] (e.g., neural network garbage classification system called WasNet [26]), while others offer content matching method (e.g., a large-scale garbage recycling system [25]). Some systems use AR technology to provide photo identification queries and recyclable reservation services (e.g., the applet AR garbage classification. [1] in Alipay).
The automatic garbage classification system based on mechanical device is used in various applications. For example, a trash detection and classification system was proposed on a social-education trash bin robot. The robot is expected can be implemented in public facilities [19]. A smart bin was proposed to optimize the treatment of garbage by taking the capacitance and spectrum of garbage as the classification standard [18]. A garbage separating device automatically places received garbage in different containers using image processing and machine learning [22]. Another mechanical garbage classification device used an optical sensor for the detection of size, position, color, and shape [7]. It has a mechanical separator that uses compressed air that is controlled by a computer. A automated trash bin used average frequency response of garbage materials to distinguish plastic bottles and tin [20]. It has a a piezoelectric microphone for input signal acquisition and a comparator for noise elimination. However, this method is highly dependent on the surrounding environment, its classification still has limitations. The bottleneck of current mature garbage classification system is the accuracy and speed of garbage classification.
In real life, solid garbage and liquid garbage are often different mixtures. The existing automatic garbage classification system cannot effectively differentiate garbage [10]. Therefore, the application mode of the current garbage classification system still needs to be based on classification assistance system and classification knowledge popularization. This focus is on improving system recognition accuracy, reducing system operation steps, popularizing garbage classification knowledge, and helping users to sort garbage conveniently and consciously through the Internet [13].
Architecture
Overall architecture
This section introduces the architecture of IGCA and OGCS.

Architecture of Indoor Garbage Classification Applet (IGCA).
IGCA client calls camera and microphone through WeChat API to get data, where text data and voice data are sent to garbage type matching program at the back-end. Image data are sent to garbage identification model after feature extraction by tensorflow.js-based identification model at the front-end. In the garbage matching layer, we use natural language processing to identify named entities from the original data. In the garbage identification model, we input the extracted features using improved resnet50 model for prediction, and match the obtained prediction result to return one of five garbage types.
OGCS is a real-time prediction and tagging of the type and location of garbage in the video stream. The hardware architecture of the system is shown in Fig. 3. The camera is used to acquire video stream data in real time, the host computer is used to store the garbage classification model and perform offline prediction of the video stream, and the large screen is used to output the predicted video stream. The system software architecture is shown in Fig. 2, where the camera acquires images at a rate of 60 frames per second, and the images are transmitted to the host computer for reading. The results and coordinates are labeled in the video stream.

Architecture of outdoor garbage classification system (OGCS).

Demo system of outdoor garbage classification system (RMGCS).
In the IGCA system, we first solve some problems that users are unable to describe garbage names accurately. By combining image recognition with various natural language processing tasks, a multimodal garbage type recognition module is designed to support three types of input: image, speech, and text, and to process all three types of input data to obtain the names of garbage keyword sets. For an input keyword or a piece of text, as shown in Fig. 4, the module first performs named entity recognition, extracts the nouns in it, and generates a keyword set for the next step of fuzzy matching. For the input speech information, the module first translates the speech into text, and then performs named entity recognition to generate the keyword set. For the input image information, as shown in Fig. 5, the module performs target detection, identifies the type of image and extracts the names of the top 5 garbage based on probability ranking, and then performs fuzzy matching with the database.

IGCA UI: query with keyword and voice.

IGCA UI: image recognition.
To achieve accurate image garbage classification, we need to design an efficient and accurate model. We perform data augmentation using 360-degree angle random transformation and random contrast and luminance to reduce overlearning of the garbage classification dataset and improve the generalization ability of the model.
In practice, some garbage sizes are larger and some are smaller. This variation in size is difficult to be captured by standard convolution. Therefore, in PyConv-Resnet50, we make pyramidal convolution combined with residual network for multi-scale feature fusion by embedding pyramidal convolution into Bottleneck in order to extract multi-scale information as much as possible. The schematic of pyramid convolution is shown in Fig. 6, which intuitively shows the size of the convolution kernel decreases in order from top to bottom. The number of channels increases the order in the channel dimension. Finally, the obtained feature map is stitched together.

Pyramid convolution vs. standard convolution.
In order to make the pyramid convolution use different depths of convolution kernels in different layers, it is necessary to divide the input features into different groups and perform the convolution computation independently as shown in Fig. 7, called group convolution [23].

Group convolution.
In the training process of convolutional neural networks, optimizer selection is the key factor to make the model converge better. The first-order moment estimation and second-order moment estimation of the gradient are considered together to calculate the update step size.
Equation 1: Calculate the first-order exponential smoothing of the historical gradient, which is used to obtain the value of the gradient with momentum
Equation 2: Calculate the first-order exponential smoothing of the squared historical gradient, which is used to obtain the learning rate weight parameter for each weight parameter
Equation 3: Calculate the variable update value, which is proportional to the first-order exponential smoothing value of the historical gradient and inversely proportional to the first-order exponential smoothing value of the historical gradient squared.
Explanation of formula parameters: Beta1 and Beta2 are the exponential decay rates of the first- and second-order moments, respectively; g is the gradient at time step t; epsilon is a very small constant to stabilize the value.
The prediction set obtained after processing images by PyConv-Resnet50 is matched with the database, and the names and types of garbage with confidence in the top 5 are returned.
In practice, IGCA displays the top five names and types of garbage, i.e., the actual garbage is completed as long as it is in the top5 of the recognition results. In this case, the recognition accuracy of IGCA reaches 97.16%.
In order to improve the efficiency of the system operation and to ensure the operation in the case of poor network quality, pyramid convolution is implemented on the front-end based on tensorflow.js, and the features are extracted and sent to the back-end only. Further image prediction and garbage recognition are performed. The total data transfer reduction ratio using this approach is
This effectively reduces the size of the data sent from the front-end to the back-end.
In text information extraction, in order to effectively improve the recall rate of garbage entities in user input text, the entity naming recognition task in this system uses a BERT pre-training model, [3]. As shown in Fig. 9, the BERT model uses Transformer’s encoder as the main model structure, discards the recursive network structure of RNN, and introduces a bidirectional language modeling task. The transformer consists of an encoder mechanism (left) and a decoder mechanism (right). The modeling of the text is based entirely on the attention mechanism, which computes the interrelationships between each word and all words in the text and obtains a new representation of each word based on the word-to-word weights to obtain a global representation based on the interrelationship responses between different words and their importance through themselves and with other words. The text is continuously superimposed with the attention mechanism layer and the nonlinear network layer to obtain the final text representation. Introducing the BERT model into the entity naming recognition task not only takes into account the contextual relationships, but also makes full use of the global information to obtain more accurate recognition results for garbage kind matching.

Accuracy of different models.

Transformer Encoder of Indoor Garbage Classification Applet (IGCA).
Where the self-attention mechanism is calculated as (Q is the transformation of its own input):
The algorithmic part of OGCS introduces the YOLOv5 model proposed by Ultralytics LLC in May 2020. This version has 1/9 of the weight file size of YOLOv4 and the fastest inference speed. Its image inference speed is as fast as 0.007 s, i.e., it can process 140 frames per second and can handle real-time video streams. The network model structure is divided into Input, Backbone, Neck and Prediction; Input includes Mosaic data enhancement, image size processing and adaptive anchor frame calculation, Backbone includes Focus structure and CSP structure, Neck uses the structure of FPN+PAN, Prediction includes Bounding box loss function and non-maximum suppression (NMS).
YOLOv5 is divided into YOLOv5s, YOLOv5m, YOLOv5l and YOLOv5x according to the network depth size and the image requirement width size, and this system uses YOLOv5x as the model. The structure of the system is shown in Fig. 10, and we add a post-processing module after the prediction module and perform fuzzy matching of garbage types by means of the word separation technique and offline database.

Computing network of outdoor garbage classification system (OGCS).
The real-time recognition marker results are shown in Fig. 11.

Real-time rendering results of outdoor garbage classification system (OGCS).
Dataset
We save the collected datasets in CSV and SQL formats to facilitate model training and data analysis. A total of 3,993 pieces of garbage were collected, with 4 categories: recyclable, hazardous, wet and dry (including almost all common and uncommon garbage).
The statistics for different types of garbage are shown in Fig. 12. The data comes from the garbage classification inquiry platform of Shanghai Greening and Amenities Administration 1 . A total of 3,993 pieces of garbage were collected. Dried garbage accounted for the largest share, at 36%, and hazardous garbage at the lowest level, by 6%. Recycled and wet refuse accounted for 32 and 26%, respectively.

Garbage types of the dataset.
In addition to the dataset used in the production environment, we also use the Huawei Cloud garbage classification dataset for model training and evaluation. The dataset has 43 categories with a total of 19,459 images, and each image corresponds to the label: garbage category/garbage name.
IGCA uses a named entity recognition technique based on the BERT pre-training model. The pre-training process of BERT is very expensive and is trained by Google on 4 to 16 Cloud TPUs using a large corpus (Wikipedia + BookCorpus) for 4 days.
The parameter size of the BERT pre-training model is approximately:
Word vector parameters (including layernorm) + 12 * (Multi-Heads parameters + fully connected layer parameters + layernorm parameters) = (30522 + 512 + 2) * 768 + 768 * 2 + 12 * (768 * 768 / 12 * 3 * 12 + 768 * 768 + 768 * 3072 * 2 + 768 * 2 * 2) = 108808704.0 = 110M
OGCS uses a video target recognition technology based on YOLOv5, which is trained with the COCO dataset, for fast and accurate recognition. The YOLOv5 model is trained with the COCO dataset. The recognition speed and AP values of different versions of the YOLOv5 model are shown in Fig. 13. The difference of these four network structures is mainly reflected in the different parameter settings of depth and width. GPU Speed measures end-to-end time per image averaged over 5000 COCO val2017 images using a V100 GPU with batch size 32, and includes image preprocessing, PyTorch FP16 inference, postprocessing and NMS.

Comparison of different versions of the YOLO model.
On the Huawei Cloud garbage classification dataset, we compared the classification application effects of Resnet18, Resnet50, PyConv-Resnet50 PyConv-Resnet50 top5 respectively, and the results are shown in Fig. 8. We can see that data enhancement and multi-scale feature fusion effectively improve the recognition rate of garbage.In practical application scenarios, IGCA recognition accuracy reaches 97.16%. The introduction of the tensorflow.js approach to model separation architecture also resulted in an 88.80%reduction in the amount of data to be transferred.
In terms of application mode, as shown in Fig. 5, our proposed IGCA supports not only image input query, but also text input query and speech input query, which is different from the existing garbage recognition classification applications. When using text input query and speech input query, we first perform named entity recognition before garbage classification, and then extract all terms in the sentence for garbage recognition, which effectively improves the user efficiency compared with the existing text input query method. As shown in Fig. 4, inputting "I took the apple and the phone", most applications cannot identify what kind of garbage it is, while IGCA can identify apple, phone and other related items. The image input query can also identify multiple objects in the image and give the corresponding garbage classification.
Demonstration
Indoor garbage classification applet
We introduce the functions, usage, and principles of IGCA through a promotional video, and then guide users to follow us on WeChat to receive timely help and information about garbage classification. Finally, we invited users to use their smart phones to classification garbage in a real-life scenario using three identification methods.
Outdoor garbage classification system
We put up posters or distribute flyers to attract customers to drop off their garbage at the centralized garbage disposal point equipped with the "Real-time Garbage Separation Recognition System". The system identifies the garbage in real time and helps the user to classification the garbage.
Conclusions
By incorporating garbage classification into laws and regulations, there is an urgent need to develop a system to help citizens in garbage classification. In this thesis, we surveyed various methods of garbage classification and finally proposed RMGCS. RMGCS consists of two subsystems, IGCA and OGCS. Among them, IGCA has three garbage recognition modes, in text and speech recognition, and introduces named entity recognition technique and word separation technique to extract noun information from user input, and in image recognition, we propose PyConv-Resnet50 top5, which introduces pyramidal convolution and front and back-end model separation architecture, thus effectively improving the accuracy rate of indoor garbage recognition; The algorithm part of OGCS introduces the YOLOV5 model, which not only has faster convergence speed and better performance, but also can detect multiple garbage in real time and effectively solve the problem of classifying multiple garbage at the same time. Since IGCA has a small program, this allows us to easily and quickly update the program for higher accuracy in the future, in addition, the final device of OGC is connected to a cloud device, which makes it possible to iterate the model.
Footnotes
Acknowledgment
This work was supported by the National Natural Science Foundation of China (61772231), the Shandong Provincial Natural Science Foundation (ZR2017MF025), the Project of Independent Cultivated Innovation Team of Jinan City (2018GXRC002), and the Teaching Research Project of University of Jinan (JZ1807). The corresponding author is Kun Ma.
