Abstract
Airlines operate in a competitive marketplace and must upgrade their services to meet customer safety and comfort. Post-pandemic, the government and airlines resumed flights with many restrictions, the impact which is unexplored. An increasing number of customers use social media to leave reviews and in this age of Machine Learning (ML), if a model is available to automatically polarize flyer sentiments, it can help airlines upscale. In this work, a custom dataset is scraped from Twitter by including online reviews of five Indian airlines. Multiclass sentiment analysis using three classifiers, support vector machine, K-nearest neighbor and random forest with word2vec and TF-IDF word embeddings is implemented. AirBERT, a fine-tuned deep learning attention model based on bidirectional encoder representation from transformers is proposed. From results, it is observed that on ML, Random Forest with TF-IDF performs the best but the graphical processing unit and domain corpora trained AirBERT outperforms all the other models with an accuracy of 91%. Indigo airlines and Jet Airways received the maximum percentage of positive and negative reviews respectively. In performance comparison with three existing models on the USA airlines tweets dataset, the proposed model outperforms others trained on general domain corpora and matches state-of-the-art TweetBERTv2 model accuracy. The model can be deployed by airlines and other service industries to implement a customer relationship management (CRM) system.
Keywords
Introduction
Social media networks have revolutionized the way in which people communicate. Online reviews are a great source of information about the consumer that can be mined. Airline service providers have to deal with a lot of client feedback regarding their products and services, therefore they need to analyze it. Traditional techniques for gathering consumer feedback from airline service firms include organizing as well as gathering surveys, which are tedious and unreliable. It will require a great deal of effort to distribute and collect surveys from consumers, as well as to record and file those questions, given the number of passengers that fly daily. Many consumers either don’t bother to fill out surveys or take them lightly, and this results in a lot of noise in the data. When it comes to analyzing airline customer feedback, social media and the web are far superior to surveys.
Passenger sentiment areas during post-pandemic flights
Passenger sentiment areas during post-pandemic flights
But when the volume of this data from different web sources grow, it gets more difficult to monitor how people’s opinions on the brand are changing over time. With hundreds of daily reviews on social media, news websites, and blogs, handling and tracking these references for large airlines may be rather difficult. Sentiment Analysis can help with it. It keeps track of and assesses airline online references to demonstrate how the online connected world is responding in real time to their services, offerings, etc. The polarity of a text can be determined and categorized using sentiment analysis. The applications of SA, which is a verified tool encompasses areas like e-commerce, healthcare monitoring, election campaigning, and more so for tourism and hospitality. Sentiment analysis has become more necessary as a result of the rise in unstructured data from social networking sites like Twitter and Facebook that must be evaluated and organized [1]. An airline can access both positive and negative comments made about their brand on social media, like Twitter.
Therefore, the primary goal of Twitter sentiment analysis is to ascertain whether a tweet is positive, negative, or neutral. The biggest issues with this, though, is that tweets are typically written in an informal language, they are brief messages with few sentiment indications, and they frequently utilize acronyms or abbreviations.
Worldwide this industry was hit severely due to the COVID-19 pandemic. Since air travel is a necessity more than convenience due to various reasons in a large country like India, the government and airlines had to restore flights, albeit in a limited way by implementing various strict measures. Table 1 captures some passenger sentiment areas which the airlines were supposed to cater to for a safe and hassle-free experience. With the help of these initiatives, the Indian aviation sector was able to regain over 95.55% of the daily domestic air passenger traffic from pre-COVID in just one year. However, there is no study on how each airline fared on these parameters and the experience of domestic flyers on account of these strict regulations that were implemented like random COVID checks, touchless check-in, wearing PPE kits inside the cabin, leaving middle seats vacant, cancellation of in-flight services etc. Understanding the degree to which travelers were made comfortable by various airlines is vital for future strategy development of airline companies in a competitive marketplace.
In order to ascertain the significance of both positive and negative online evaluations provided by Indian airline passengers, this study used sentiment analysis. Analyzing whether customers were satisfied or dissatisfied with the services during the time when flights started (June–July 2020) can help determine which airlines performed better. It aims to give trailing airlines a chance to receive critical insights on overall performance to their management for timely decision-making so that they can take corrective action for future. A customized dataset is curated from Twitter and various exploratory data analysis tasks are carried out. On the pre-processed data, sentiment analysis on three popular ML algorithms with different word embeddings viz. wword2vec and TF-IDF is carried out. The effectiveness of these algorithms is assessed and results compared with the implementation of a fine-tuned, state-of-the-art Bidirectional Encoder Representation for Transformer (BERT) model. The Twitter US Airline dataset is used to test the effectiveness of the fine-tuned model against other cutting-edge implementations.
The methodology adopted in tweets classification.
In this paper, the key contributions from the authors are:
AirBERT, a GPU-trained fast, fine-tuned domain specific language representation model trained on Twitter corpora for airline tweets mining is proposed. AirBERT is evaluated against three ML classification models trained with word2vec and TF-IDF embeddings on a custom Indian airlines review dataset. From results, AirBERT outperforms all other models. Comparing it with other existing models on the US Twitter airline dataset, the fine-tuned AirBERT matches the performance of state-of-the-art TweetBERTv2 model. Based on AirBERT’s negative tweet analysis and past customer profile, a customer relationship management model is proposed that can be leveraged by airlines and other service industry.
Following is a breakdown of the remaining sections of this paper: Section 2 discusses related work in this domain. Section 3 discusses the methodology adopted in detail with Section 4 outlining and discussing the results. This is followed by conclusion in Section 5 and references in the end.
Unsupervised and supervised techniques are the primary subcategories of Machine Learning (ML) [2]. These techniques are excellent for sentiment analysis because they can be automated and handle enormous amounts of data.
Generally, ML algorithms like Naive Bayes (NB), Random Forest (RF), Decision Tree (DT) and Support vector machine (SVM) have been used for sentiment prediction and optimization to solve twitter sentiment classification problem. While SVM and Multinomial NB have been shown to be better in terms of accuracy and optimization, hierarchical ML techniques only produce middling performance in classification problems. A very few researchers have combined above algorithms in an ensemble model to predict sentiment with good success [3]. In Table 2, a collection of benchmarks for machine learning approaches is provided in terms of classification accuracy. Similar to this, it has been demonstrated that using different feature extraction strategies increases classification accuracy. Among the many approaches available for text mining, TF-IDF, word2vec, and gloVe are among the most popular. It has been demonstrated that TF-IDF performs better when combined with its two modifications, linear discriminant analysis (LDA) and latent semantic analysis (LSA). Large datasets have shown increase in accuracy on usage of TF-IDF. The combination of TF-IDF and LSA is found to be suitable for smaller datasets.
Sentiment analysis using neural network architectures have appeared over the past few years. These models include convolutional neural network (CNN), deep neural network (DNN) and recurrent neural network (RNN) as shown in Table 3. The compositionality of words is difficult to capture using published sentiment prediction algorithms that uses deep CNN and RNN.
Benchmark summary of machine learning techniques
Benchmark summary of machine learning techniques
Benchmark summary of neural networks used for sentiment analysis
Benchmark summary of proposed models using BERT
Many neural network architectures find it difficult to extract character-level characteristics and embeddings of complex words, whereas CNN are more successful at extracting sentence- or word-level features such as morphological tags and stems.
Since its inception in 2018 by Google Research, several papers using the generic BERT model have appeared achieving state-of-the-art accuracy on NLP tasks as shown in Table 4.
The overall model process is shown in Fig. 1.
Dataset
A custom dataset for all Indian domestic airlines is scrapped from Twitter for a specific time interval post the pandemic when the flights were resumed. Dataset is scrapped using snscrape by providing required hashtags. Dataset contains tweets for five Indian domestic airlines i.e., Air India, SpiceJet, IndiGo, Jet Airways and Vistara Airways. Statistics of dataset is given in results section.
Data preprocessing
Tweet text may contain noise like special symbols, capital letters, hashtags, numbers, punctuations and URLs. It must be filtered before they are fed to machine learning algorithm for classification. Python RegEx regular expression package is used for Twitter specific preprocessing to filter this noise. Next step is to apply NLP operations like tokenization, stop word removal and lemmatization. The steps involved are briefly explained here. For preprocessing, all words in tweets are converted to lowercase since they don’t follow any standard rules. URL links are removed using regex as they don’t contribute to sentiment analysis, Non-letter characters like punctuation, numbers or hashtags are also removed. The preprocessed tweets are tokenized using TweetTokenizer in NLTK. The tokenizer returns a list of strings for each tweet. It helps to find word boundaries using spacing and punctuations in the tweets as shown in Table 5.
Tokenized random sample
Tokenized random sample
Since stopwords like ‘the’, ‘a’, ‘on’, ‘is’ etc. are not very useful in sentiment analysis, we remove them. In lemmatization step, it considers the context and converts the word to its meaningful base form. Finally, the pre-processed data needs to be converted into numerical vectors using different techniques like BoW, TF-IDF, word2vec, Glove etc. In this study, under Phase-I we use word2vec and TF-IDF word embedding techniques with the three ML algorithms as other methods don’t preserve semantics of the sentence. And in Phase-II, the same pre-processed data can be used as an input to train task-specific (airlines tweet sentiment analysis) BERT-based model.
TextBlob has the fundamental components of natural language processing that are utilized to determine the polarity and subjectivity of tweets [30]. The numeric value for polarity describes how much a text is positive or negative. And subjectivity describes how much a text is objective or subjective. We use this function to get a quick sense of the level/extent of polarity (float between [
Data visualization
Exploratory data analysis where the dataset is visualized helps understand its attributes. This includes checking distribution of reviews according to airlines, distribution of tweets on each airline grouped by sentiments, distribution of tweets on each airline grouped by tweet count, density analysis with character counts of tweets, density of character counts of tweets according to sentiment, density of word counts of tweets according to sentiments, density of stop-word counts in tweets according to sentiments, top 20 frequent words used, word cloud visualization and the
Word embedding methods
Two popular word embedding techniques are used in this work, word2vec and TF-IDF.
TF-IDF
The TF-IDF method emphasizes the inverse relationship between a term’s frequency and that of the document as a whole. A term’s frequency in a document is revealed by TF, while its relative rarity within the collection of documents is revealed by IDF. These numbers are multiplied together to arrive at the final TF-IDF value shown in Eq. (1).
here,
Instead of using DNN, the word2vec algorithm takes in a corpus and produces sets of vectors using shallow 2-layer neural networks.
Word2vec creates a vector for each term, although it could take more effort to combine those vectors into one vector or another format. In contrast, TF-IDF is a statistical measure that may be applied to terms in a document and then used to form a vector. Word2vec also takes into account the corpus’s context, but TF-IDF does not.
Classifiers used for tweet classification
The dataset is divided into training data (60%), validation data (20%) and testing data (20%). To validate the results obtained from TextBlob, under Phase-I of this study for model building we applied popular ML algorithms like SVM, KNN and RF on the training set. Once the models are well trained, the test set was evaluated. We contrast and compare these ML classifiers by fine-tuning the BERT pre-trained language representation model under Phase-II and compare the performance.
Support vector machine (SVM)
As depicted in Fig. 2, it makes use of the data’s pattern and performs as a non-probabilistic binary linear classifier [31].
Illustration of support vector machine.
To predict the class, SVM applies the function in Eq. (2) as:
In Eq. (2),
In above Eq. (4), the parameters
In Eqs (5) and (6),
Illustration of KNN algorithm.
It is one of the most straightforward ML classification techniques [32]. The classification is based on how the neighboring data point is categorized, as seen in Fig. 3.
Based on the similarity score of the previously stored data points, KNN categorizes the new data points. Here, the polarity either positive or negative is used to classify.
Illustration of random forest algorithm.
This algorithm for Supervised ML is used to solve Classification and Regression problems. It creates DT on several samples and uses their average for regression and majority vote, respectively, as illustrated in Fig. 4.
BERT is typically an encoder stack of the Transformer architecture, whereas the Transformer design typically comprises an encoder and decoder stack, hence the name encoder-decoder architecture. The architecture complexity of the two variants, BERT-base and BERT-large, varies. According to Fig. 5, the large encoder version has 24-layers, whereas the base model’s encoder has 12-layers. We use the BERT-base, which has 110 million trainable parameters in this study and implement it in Google Colab under Phase-II study. BERT uses Transformer, an attention methodology which learns contextual relations between words in a text (here tweet).
BERT models.
Transformer consists of two mechanisms an encoder which reads the input generating a language model and a decoder which classifies the sentiment. One of the key advantages of Transformer encoder is that it reads the entire sequence of words at once as compared to bidirectional models which either read text input left-to-right or right-to-left. This allows the encoder to learn the context of the word based on all its surroundings.
For sentiment analysis, BERT training uses a technique called Next sentence classification. The model receives a pair of sentences as input and it learns to predict if the second sentence is indeed the subsequent sentence in original corpus. During training 50% of inputs are in pair and the other 50% are random sentences. These will not be connected in any way to the first sentence and this distinguishing feature is made possible by adding a token [CLS] at the beginning of the sentence and token [SEP] at the end of the sentence. Also added are sentence (to identify if it is sentence A or B) and positional embeddings to indicate its position in the sequence. For predicting if second sentence is connected to first, the full input sequence is passed through the transformer model, the output of [CLS] token is transformed into a vector using a simple classification layer and calculating the probability if it is the next sequence using softmax.
The architecture of the proposed AirBERT classification model.
Figure 6 is a high-level architecture of AirBERT Transformer encoder model. BERT receives a string of words as input and transmits it up to the following encoder unit. The input is a string of airlines tweet sentence tokens, which are first embedded into vectors and then processed in the feed-forward neural network. Self-awareness is applied to each encoder layer. The feedforward network’s output is subsequently fed into the following encoder. For each activity, BERT uses a fine-tuning strategy that does not require any specific architecture. The output is a sequence of vectors of size H, in which each vector corresponds to an input token with the same index. Cutting-edge performance can be obtained by fine-tuning a pre-trained BERT model with just one additional classification layer on top of the Transformer output for the [CLS] token. This helps the model to distinguish between two sentences.
Unsatisfactory outcomes are obtained when language models that have been trained on corpus from general domains like BERT plain or other domains like BioBERT are used. A pre-trained language representation model called AirBERT is proposed to mine the airline reviews on Twitter. The BERT architecture is tuned using the training set data.
The data is initially organized in accordance with the required format. As explained below, BERT layers receive three input arrays: input_ids, attention_mask, and token_type ids.
input_ids – They are the list of integers that are uniquely tied to specific word.
attention_mask – This is a list of 1s and 0s which corresponds to the IDs in the input IDs array.
Token type ids – This is used to classify sequences or respond to queries. These require two distinct sequences to be stored in the same input IDs, therefore, to separate the sequences, special tokens like classifier (CLS) and separator (SEP) are used.
Fine-tuned weights and other parameters in AirBERT
The tokenizer classes encode_plus function tokenizes the raw input, adds the aforementioned special tokens, and pads the vector to the maximum length provided. Our raw data is converted using a helper function into the proper format so that it can be fed into the BERT model. We divide the data into Training data (60%) for training the model, validation data (20%) for model evaluation and Testing or holdout data (20%) to provide an unbiased estimate of model performance. The model is initialized for sentiment analysis and we set the number of epochs to 2, as higher epochs may give rise to overfitting problems as well as take more time for the model to train. The model training is started and it takes around 1.5 hours on a Nvidia K40 GPU.
The AirBERT fine-tuned weights or parameters are shown in Table 6. In brief, the model used is BERT-base-cased, tokenizer used is BERT-base-cased, optimizer is Adam, layers used-12 and hidden size is 768. We fine-tune the pre-trained model to improve the overall accuracy.
The performance parameters used are precision, recall, accuracy and F1-score for all the proposed models.
The experimental results and empirical evaluation are discussed in the next section.
Dataset statistics
(a) Initial sample of Indian-Airlines-Data, and (b) distribution of tweets during the chosen period (Jun–Jul’20).
Understanding of the data is carried out using exploratory data analysis, results of which are shared as below.
Indian Airline review dataset
Sentiment distribution in online reviews
Sentiment distribution in online reviews
The dataset named Indian-Airline-Data provides customer online reviews for five Indian Airlines across all domestic routes from June 2020 to July 2022. Initial sample of the dataset is shown in Fig. 7a. It contains 8396 records (rows) and 7 columns as date, user_id, retweet_count, like_tweet, tweet_count, tweet_location and tweet_content.
The total number of reviews received daily during the said period is shown in Fig. 7b. On an average
The dataset statistics are shown in Table 7 and sentiment distribution in online reviews and Table 8.
The percentage distribution of online reviews according to each airline in the total dataset is shown in Fig. 8a. Air India has the maximum reviews at 65%, followed by SpiceJet at 14%, Indigo and Jet Airways both at 8% and Vistara with 7% of the total reviews.
Figure 8b shows the ‘Top 20’ frequent words used. Apart from airline names, it is observed that ‘hardeeppuri’, the civil aviation minister, ‘refund’, ‘repartiation’, ‘vandebharatmission’ for getting back civilians struck abroad during COVID-19 travel restrictions, ‘jetrevival’ and ‘makejetflyagain’ figure predominantly.
Figure 8c displays the
Sentiment extraction
Figure 9a shows output parameters of the dataset using TextBlob sentiment function. The polarity and subjectivity numeric values are output by TextBlob. It also allowed to graph the histogram of tweet polarity, histogram of subjectivity and polarity vs. subjectivity visualization shown in Fig. 9b–d respectively.
(a) Daily distribution of number of reviews during the review month, (b) top 20 most frequent words used in reviews, and (c) 
TextBlob outputs. (a) Polarity and subjectivity in the reviews received, (b) histogram of positive polarity, (c) histogram of negative polarity, (d) polarity vs. subjectivity visualization, (e) class distribution of the three sentiments category.
TextBlob outputs. (f) total number of reviews for each airline by sentiment.
Density analysis using TextBlob. (a) Density of character counts of reviews, (b) density of character counts of reviews according to sentiments. (c) density of word count of reviews according to sentiments, (d) density of stop-word count of reviews according to sentiments.
Figure 9e shows the class distribution of the three sentiments using TextBlob function where the positive class percentage is 31%, negative is 21% and 48% is neutral or mixed class. In terms of number of tweets, the function returns good accuracy for positive class but returns poor accuracy for both negative and neutral tweets. This may be due to it considering subjectivity parameter while returning the results.
Experimental results across all models (best results in bold, values in %)
Cross comparison of classification results with existing state-of-the-art models (best results in bold, values in %)
Word cloud visualization for (a) positive reviews, (b) negative reviews.
To determine which airline has received the most positive feedback, we also analyze the sentiment of each review from all five airlines. Figure 9f shows the total number of reviews for each airline by sentiment using TextBlob. From figure, it is observed that overall Air India received the maximum number of reviews, followed by SpiceJet, Indigo, Jet Airways and Vistara.
Considering the absolute number of reviews received by each airline and calculating the percentage positive (% pos) and percentage negative (% neg) and percentage mixed/neutral (% Mix/Neu) sentiments, it is observed that Indigo airlines received the maximum percentage of positive reviews and Jet Airways received the maximum percentage of negative reviews during this period. From this, we can conclude that Indigo is the most customer friendly airline considering all the domestic routes.
In density analysis using TextBlob function, the density of character counts is shown in Fig. 10a. A peak is observed at 175-character counts in tweets. Comparing density of character counts of tweets according to sentiments is shown in Fig. 10b, where green curve indicates positive, red for negative and blue for neutral. It is seen that the maximum density of positive and negative reviews is in the range of 175–200-character count. The density of word counts in tweets according to sentiments as seen from Fig. 10c is 30.
The density of stop-word counts for tweets according to sentiments in Fig. 10d shows that density is highest initially with up to 4 stopwords used and the density thereafter decreases.
(a) Comparison of classifier accuracy with word2vec and TF-IDF vs. BERT, (b) AirBERT model training time statistics.
A customer relationship management system based on AirBERT.
When it comes to conveying the subject and attributes of a piece of writing, this strategy is often employed. By showing the content in relation to its distribution, the word cloud gives a more accurate picture of the article’s properties. Figure 11a displays the word cloud for positive reviews. Apart from ‘air india’, the top 5 positive words are ‘ticket’, ‘time’, ‘help’, ‘vandebharatmission’ and ‘refund’. Figure 11b shows the word cloud for negative reviews. Apart from ‘air india’, the top 5 negative words are ‘repatriation flight’, ‘jetrevival’, ‘long haul’, ‘airbus330 pilots’, and ‘capability’.
From the implementation of Phase-I and Phase-II models on customized dataset, ignoring the mixed/neutral reviews, Table 9 shows the experimental results returned by various classifiers and the AirBERT model for sentiment classification. With the AirBERT-model, we obtained an average F1-score of 91.5%, combining both the positive and negative class followed by RF TF-IDF combination with 84.5% and SVM TF-IDF with 83%. The reason for AirBERT’s superior performance can be attributed to the large training sets (
From Phase-I implementation, it is observed that RF with TF-IDF performed the best with a F1-score of 84.5% followed closely by SVM and TF-IDF with a F1-score of 83%.
K-Nearest neighbor was the least performing algorithm with an F1 score of 62% and 64% with word2vec and TF-IDF respectively. Thus, TF-IDF scores over word2vec with AirBERT outperforming all models as shown in Fig. 12a. Also, the time taken to train AirBERT is lower by a factor of 4x by using a CPU
Comparing the accuracy of TextBlob sentiment function and AirBERT, it is observed that TextBlob accuracy for positive tweets is high but very poor for negative class. AirBERT accuracy is a high 91% for both positive and negative class proving its efficacy.
It is worthwhile to note that any algorithm that relies on matrix multiplication and convolution can be accelerated using a GPU. And a consumer grade GPU with good amount of memory is today available at a fraction of a cost. Though the training time cost for AirBERT is significantly lower than a CPU by 4x but still heavy (
Comparison with existing models
We compare the performance of fine-tuned AirBERT model with four other state-of-the-art models using BERT as shown in Table 10 on the US Airline dataset. The USA airlines data containing 14640 tweets which is more polished than our dataset. It is observed that the proposed fine-tuned, pre-trained AirBERT model matches the performance of state-of-the-art TweetBERTv2 model with an accuracy of 92.56% vs. 92.99%. It outperforms all other models which are trained on general domain corpora like BERT.
Customer relationship management (CRM) system
There are many practical applications of the proposed model. Based on AirBERT’s negative tweet analysis and past customer profile, a high-level use-case for its deployment is a CRM system as shown in Fig. 13. This can be leveraged by Airlines and other service industries to win back customer loyalty in case of any deficiency in service. Potential issues are nipped in the bud. It will also help in upselling to happy customer and reduce churn. It can also help detect changes in the overall opinion towards an airline
However, there are some limitations of airlines tweet sentiment analysis in general and the model proposed. Neutral sentiments have not been considered, which sometimes have a negative tilt and can impact results considerably. Sometimes, sentiments get incorrectly targeted from data outside the review domain as the model would have been trained only in one particular domain. Finally, the model is trained only in English language tweets. For it to be robust for global airlines, the need of the hour are trained models that can transfer what it has learnt from one language to another.
Conclusion
Twitter is a popular social networking platform with rich data; however, it can be challenging to analyze its information. Applying language models trained on generic corpora like plain BERT or from other fields like Biomedical via BioBERT to tweets, which are often written in an informal way, often frequently delivers poor results. To mine airline reviews, AirBERT – a fine-tuned airlines domain-specific language representation model trained on Twitter corpora, is proposed. A custom dataset is curated for five airlines for the period immediately after the resumption of flights after the pandemic to check customer sentiments. From results, the GPU-trained AirBERT beats outperforms all other ML models by returning an accuracy of 91%. SVM with TF-IDF word embedding technique outperforms all other ML classifiers over word2vec technique with an accuracy of 87%. Indigo airlines emerged as the most satisfied airline from customer reviews during this period. Comparing the performance of AirBERT with three existing models on the common USA airlines tweets dataset, it outperforms other models trained on general domain corpora like BERT and matches state-of-the-art TweetBERTv2 model with an accuracy of 92.56%. The model can be deployed by airlines and other services industry to build a CRM system to win back customer loyalty. As part of future research, since reviews for only a month were considered, the model can be made more domain-specific by training it on larger datasets. Secondly, research trends are increasingly turning towards topic modeling and explainable AI (XAI), a technique that uses observable words and hidden meanings to derive latent themes and extract them for better understanding of the entire pattern which can be explored with methods like Local interpretable model-agnostic explanations (LIME) to go beyond the concept of positive, negative, or neutral sentiments, to reach and comprehend the significance of understanding conversations and what they reveal about customers.
