Abstract
Air pollution has become an international calamity, a problem for human health and the environment. The ability to predict the air quality becomes a crucial task. The usual approaches for assessing air quality are exhausted when extracting complicated non-linear relationships and long-term dependence features embedded in the data. Long- and short-term memory, a recurrent neural network family, has emerged as a potent tool for addressing the mentioned issues, so computer-aided technology has become essential to aid with a high level of prediction and best-in-class accuracy. In this study, we investigated classic time-series analysis based on Improved Long short-term memory (ILSTM) to improve the performance of air quality index prediction. The predicted AQI value for the 25 days lies in a 97.63% Confidence interval zone and highly adoptable performance metrics such as R-Square, MSE, RMSE, and MAE values.
Keywords
Introduction
Air Quality Index (AQI) that is, which is, one of the most widely used air quality indicators, is an important tool for the air quality measurement [1]. The AQI is a consolidated measure that virtually determines the level of pollution in the air as of its many components including particles-of-matter (PM2.5 and PM10), the ground level ozone (O3), nitrogen dioxide (NO2), sulfur dioxide (SO2), and carbon monoxide (CO). The sources of this pollution include motor transport, manufacturing facilities, energy plants including those burning fossil fuels, as well as natural events such as wildfires and eolian events [2].
Studying and predicting AQI is vital for several reasons:
Public health: Prolonged exposure to increased air quality can lead to serious health problems especially for sensitive groups of people for example such as children, the elderly and those who have existing respiratory or cardiovascular problems. Skillfully AQI predictions would empower authorities to release timely warnings and take proactive actions to guarantee public safety [3,4]. Environmental monitoring: AQI data gives off information about the amount of different pollutants that are going through the atmosphere, therefore it helps with research and helps to assess the environmental condition [5,6,7]. Climate change: Is the air pollution and climate change, this is logical, because some pollutants play the role in the greenhouse effect that causes worsening of the global climate. The everyday measure of AQI give rise to the future decision making concerning human activities influencing the climate and will underpin the policy of cutting down the emissions [8,9,10]. Urban planning: AQI forecasting becomes an easy and effective tool for urban plainers to develop the most resilient transportations, energy efficiency measures, and land use policies. Such policies help build a prosperous and pleasable environment for urban dwellers. Even so, air pollution is a fundamental problem globally, which lead to the greatest dangers to the health, the nature and the lifestyle, among other issues [11,12,13,14].
Due to long short-term memory (LSTM’s) models’ nature of reciprocal neural networks (RNN’s), the use of LSTM has been proved as a powerful tool for dealing with the challenges of Air Quality Index (AQI) prediction [15,16]. For the specific modeling of sequential dates, LSTM networks serve the purpose very well, for instance, for time series data, which is important for reproducing the temporal nature of air pollution. Unlike the difference which exists between feedforward networks and LSTM neural networks, the latter employ memory cells that enable them to selectively store and forget information from previous time steps. This facility permits the LSTM models to capture long-term dependencies and patterns which develop the models’ power to perform very well in AQI forecasting, where historical data trends are of great importance [17,18,19]. The application of LSTM models in the field of AQI prediction is listed :The application of LSTM models in the field of AQI varies and would be discussed comprehensively ahead. Data preprocessing: Accurate training of LSTM model requires proper data processing before training, i.e. collecting meteorological variables, geographical information, and historical AQI data. This part consists of cleaning, normalization, and feature engineering to process the data into proper form for the model to work with it [20,21].
The importance of LSTM models in this aspect lies in the fact that these learning systems memorize and forget only contextual details of the precedent time buffer [22,23]. This characteristic helps them in a way of learning long-term dependencies of the data and this pattern is very important in AQI prediction that takes a wide range of factors into consideration for influence on level of air pollution, including weather conditions, emissions sources and atmospheric processes which last for long time.
The research on AQI is the backbone of preservation and wellbeing of humans, environmental conditions monitoring and decision-making at the policy level [24,25,26,8]. In LSTM models, the main highlights of performance against AQI prediction is allowing the tracking of temporal dependencies and the modelling of the intricate patterns happening in time series data. Despite great progress was made, and there were also challenges that still need to be tackled, which included data quality issue, geographic information integration, interpretability, and domain adaptation. As the pollution of air remains a troubling world-wide problem, ML supporters insist that LSTM models and other ML technologies will be key in maintenance and protection of environment [27,28,29,30,31].
The outcome of this study can enhance our cognition about what areas have bad air quality, and this information could be used to further develop the air quality management systems. By that we will make the environment better both for us and nature.
The paper briefs, as illustrated, are: literature review discussed in Section 2, methodology in Section 3, results and discussion in Section 4, and finally Section 5 gives conclusion of the proposed system.
The advent of advanced automation and artificial intelligence technologies has sparked significant evidence for the analysis of air quality index.
As proposed by Qiang Zhang et al. [32] investigated PM2.5 concentration prediction model based on multi-task deep learning by analyzing the data collected from the various stations. This approach enabled the model to consider downward trends up to high-frequency fluctuations of the time series data and, therefore, produced more accurate forecasting compared to those of the models which take raw data input.
The work by Yibin Chen et al. [33] involves an LSTM-based model for predicting PM2.5 densities in China. They present a model that unifies meteorological data, and multi-task learning, so that there is a single time series forecast of PM2.5 concentrations and meteorological observations. The complementing nature of the multi-task learning framework has also been demonstrated by the model leveraging on the intrinsic correlations between the air pollution and the meteorological factors, which has led to a higher predictive performance on both tasks. As well, they adopted a transfer learning technique in which they initialized the model with weights learned on other regions’ data in the first place, which helped to improve the accuracy of the prediction.
Duy Tran et al. [34] conducted a study on forecasting the AQI using the LSTM ensemble model. The multi-lstm ensemble architecture used different input features sets in different LSTM models for extraction of meteorological data, air pollutant concentrations, and other auxiliary data sources like traffic and industrial emissions. The individual LSTM models have been first trained in isolation functionally, and the forecasts of their separate models have been then combined using the weighted averaging technique. This committee method has helped the models both strong and the availability of data while mitigating their respective done of the fault at the same time.
The deep learning method proposed by Taoying Li et al. [19] proposed the AQI forecast is based on Convolutional neural network - LSTM model. Attention algorithm allowed neural network to direct the focus on the most meaningful input features at each step, thus making the model better capable to take into account long-term relations and disregard the noise and undesirable information. They based on the meteorological data, historical air pollutant concentration as well as relevant auxiliary data (such as land use data and traffic patterns). By means of attentional mechanism the model could selectively concentrate on the most distinctive features, such that had a significant impact on the model prediction precision.
While Xu et al. [18] designed a hybrid model composed of LSTM and Supporting vector regression, they worked with the AQI forecasting in Chengdu, China. A key aspect was the use of the LSTM component to model temporal dependencies in the input data and to recognize the non-linear relationships between the input features and the AQI targets with the help of the SVR component. Finally, this hybrid model seeks to cement two models with strong forecasting powers by combining them, with the consequent improvement in the accuracy of prediction when it is used separately.
This study looked at several dimensions of automated systems for predicting air quality index with a high confidence level. Many authors have analyzed the demand growth in analyzing the air quality index and its prediction. Where most authors focus on harnessing machine learning techniques, we attempted the temporal pattern learning features of the long short-term memory systems. Our investigation shows LSTMs as a robust forecast tool for air quality, making accurate outcomes helpful in taking preventive measures such as improving the environment and people’s health conditions.
Methodology
We have implemented a Improved Long Short-Term Memory (ILSTM) neural network model for predicting various air quality parameters, including AQI, PM10, PM2.5, O3, SO2, NO2, and CO, using historical data. The dataset is cloned from a GitHub repository and unzipped to the local directory. The dataset is loaded into a Pandas DataFrame, and the ‘datetime’ column is split into ‘date’ and ‘time’ columns. The ‘datetime’ column is converted to the datetime format and set as the index of the DataFrame. New columns ‘year’, ‘month’, ‘day’, and ‘hour’ are created by extracting the respective information from the ‘datetime’ index.
Air quality prediction can be regarded as an important operation that aids in dealing with the harmful effects of air pollution on human health and the environment. Weather patterns, air pollution data, people behaviour, are not fully understood because the traditional methods, such as statistical models and numerical simulations, often fall short of capturing the complex non-linear relationships present in air quality data. Additionally, Improved Long Short-Term Memory (ILSTM) networks, specialized type of recurrent neural networks (RNN) [35,36,37,38] are shown to provide impressive performance when dealing with time series data and complex dependencies along different timeframes which makes them very usable in air quality forecast tasks.
The data used in this study and hourly events were collected between December 2022 to December April 2024. A total of 8952 records were used for the performance analysis of the proposed system. The data were partitioned into 80% training data and 20% for testing of models.
Figure 1 provides the flow diagram of the algorithm. The proposed system utilizes major steps, which includes pre-processing of the raw data, replacement of NULL data using data imputation, calculation of evaluation metrics, feature extraction and selection, and classification of the AQI with predictions.

Steps used in the proposed algorithm.
The methodology involves data preprocessing, splitting the dataset into training and testing sets, scaling the data, creating input-output sequences, building and training the ILSTM model, generating predictions, and visualizing the results. The dataset is filtered to include only the data for the city of ‘Nagpur.’ Unnecessary columns are dropped from the DataFrame. In this study, the features extracted were AQI, PM10, PM2.5, O3, SO2, NO2, NH3, Benzene, Toluene, Xylene, and CO, but to reduce the burden of mathematical analysis and reduce the complexity of the classification and prediction only a few of them are retained. Generally, for basic analysis and prediction, up to a few days, PM2.5, PM10, and CO are sufficient [39,40,41,42].
Spatial distribution structure should be disposed in a way to capture the temporal dependencies occurs in the air quality data. The number of ILSTM layers and units has a great impact on the accuracy of the predictive models. The ILSTM model can be trained after the model architecture and the hyperparameters have been defined [43,44], and the previously processed data would be fed to the system. The data should be partitioned into training, validation, and test sets to make good assessment of model performance and reduce overfitting [45,46]. In the training, the ILSTM model is learns to map (both historical data and features to air quality value) the input sequences. Evaluation metrics can be used to assess the model’s performance, with
Performance metrics of LSTM model.
Performance metrics of ILSTM model.
For analysis, ILSTM models pre-trained on similar temporal sequences, such as weather and climate modeling, could be further fine-tuned or adapted into suitable models for air quality prediction. The domain adaptation approach is best suited to adapt and re-train the pre-trained models, tailoring them to the specific task of the air quality prediction and domain. These techniques might be applied to fine-tune the target dataset’s model, resort to domain adversarial training, or involve instance transfer or feature transfer methods.
Hyperparameter tuning has a critical role in the model’s performance and can be decisive. Techniques such as grid search, random search, or Bayesian optimization are capable of making a search through the enormous parameter spaces and steering the wrong turns. Thorough model training, tuning, and validation have to be performed so that the best model that prevents both overfitting and ensures the model’s generalization skills is chosen.The family of ensemble methods that include model averaging, stacking and boosting, for instance, could be introduced to refer to the advanced approach for the further improvement of the model precision. Besides making use of the data from the external sources, such as satelites, traffic flow, and socio-economic data may help in providing more information so as to increase the predictive power of the model. Transfer learning paradigms and domain adaptation approaches can be deployed alongside current tools to better exploit the pre-trained model capabilities or utilize knowledge from related domains, thereby accelerating the modelling cycle, and bumping up performance. A box chart shown in Fig. 2 representing the air quality index concerning one week (7 days) and Fig. 3 representing the air quality index concerning the 24 hours (in any particular day) is plotted to validate the proposed system.
The drift and the distribution in 24 hours for each station of the AQI values are shown in Fig. 2 reflects the trend of the average AQI values per day in all stations, from which we find that AQI values except weekends are higher than the rest time domain, it is because of the human activity have a positive impact on the increase of AQI values.

Box plot depicting AQI variations for one week.

Box plot depicting AQI variations for 24 hours.

Correlation between AQI’s air pollutants.
The model of the ILSTM is fabricated using the training data and the testing data, which helps to predict the actual and predicted air quality values. In our study, AQI was expected from January 2023 to September 2023 and achieved a 97.63% confidence interval. The 97.63% confidence level indicates AQI future value is present within the stated interval range, as shown in Fig. 5 for prediction of PM10 concentrate and Fig. 6 for prediction of PM2.5 concentrate. The results from the proposed method is highly promising and encouraging. The confidence interval is also compared with the other state of the art methods proposed by the previous researchers mostly based on the LSTM, CNN-LSTM, and SVR and it achieved maximum 95.3% confidence level only. In our proposed method based on ILSTM gives 97.63% confidence interval.

Prediction of PM10 values.
The test performance of the proposed classifier and prediction system has been evaluated for 8952 randomly selected samples. Figure 4 shows the correlation matrix for the actual (true) and predicted results of AQI’s air pollutants. In contrast to the LSTM model, the ILSTM model will evaluate more samples in less time without compromising accuracy. The rigorous analysis using LSTM and ILSTM models shows that the drift and distribution are more linear in ILSTM and thus give the best performance compared to LSTM.

Prediction of PM2.5 values.
The ILSTM model performance could also be enhanced with studies related to curriculum learning, a training strategy that progressively increases the complexity of the explained data. By beginning with primary cases and gradually adding more complex cases, the model will likely learn more resilient representations and better understand the generalization to minted cases. As for the present research, while the classical ILSTM model is robust in temporal dependencies, it could be extended to include spatial ones. Using spatiotemporal spatial and temporal ILSTM architectures such as convolutional LSTMs or graph-based LSTMs, different air quality data locations and the spatial correlations and interactions between them can be modeled. Such models will be able to capture the transmission routes of air pollutants by taking into account the place-related data variation and the spatial disparities, leading to even more accurate predictions. With the acceptance of LSTM models in air quality prediction becoming more and more complex and precise, research problems concerning interpretability, explainability, and quantifying uncertainty will be noted more often. Much more research is needed before building a system of interpreting representations and decision processes by which the ILSTM model works so air quality predictions based on these algorithms can better understand the reasons behind their decisions. Also, as an extension, research approaches for measuring and communicating the uncertainty associated with LSTM model outputs might be explored. Methods including Bayesian neural networks, the ensemble approach, or conformal prediction could be applied to estimate uncertainty, which is essential in risk assessment and decision-making regarding air quality management. Furthermore, the use of shared open science resources could promote the advancement toward more impactful solutions concerning the use of ILSTM-based air quality prediction. Collaborations among organizations and research groups through data, code, and model exchange allow us to transfer knowledge, ensure replicability, and adopt more stable and generalized solutions. Through challenges, benchmarking, and community measures, innovative strategies could encourage the quick pace of innovation and thus make the usefulness of models of advanced air quality respectable and felt by many.
