AirBERT: A fine-tuned language representation model for airlines tweet sentiment analysis

Abstract

Airlines operate in a competitive marketplace and must upgrade their services to meet customer safety and comfort. Post-pandemic, the government and airlines resumed flights with many restrictions, the impact which is unexplored. An increasing number of customers use social media to leave reviews and in this age of Machine Learning (ML), if a model is available to automatically polarize flyer sentiments, it can help airlines upscale. In this work, a custom dataset is scraped from Twitter by including online reviews of five Indian airlines. Multiclass sentiment analysis using three classifiers, support vector machine, K-nearest neighbor and random forest with word2vec and TF-IDF word embeddings is implemented. AirBERT, a fine-tuned deep learning attention model based on bidirectional encoder representation from transformers is proposed. From results, it is observed that on ML, Random Forest with TF-IDF performs the best but the graphical processing unit and domain corpora trained AirBERT outperforms all the other models with an accuracy of 91%. Indigo airlines and Jet Airways received the maximum percentage of positive and negative reviews respectively. In performance comparison with three existing models on the USA airlines tweets dataset, the proposed model outperforms others trained on general domain corpora and matches state-of-the-art TweetBERTv2 model accuracy. The model can be deployed by airlines and other service industries to implement a customer relationship management (CRM) system.

Keywords

Twitter sentiment analysis machine learning natural language processing deep learning classification

1. Introduction

Social media networks have revolutionized the way in which people communicate. Online reviews are a great source of information about the consumer that can be mined. Airline service providers have to deal with a lot of client feedback regarding their products and services, therefore they need to analyze it. Traditional techniques for gathering consumer feedback from airline service firms include organizing as well as gathering surveys, which are tedious and unreliable. It will require a great deal of effort to distribute and collect surveys from consumers, as well as to record and file those questions, given the number of passengers that fly daily. Many consumers either don’t bother to fill out surveys or take them lightly, and this results in a lot of noise in the data. When it comes to analyzing airline customer feedback, social media and the web are far superior to surveys.

Table 1
Passenger sentiment areas during post-pandemic flights

Flight services	Measurement criteria	Need/necessity
Flights schedule	On-time schedule	Management of time
Security checks	Identity, fever & vaccination status checks	Hassle-free entry
	Screening	Hassle-free entry to boarding gates
In-flight services	Childcare during COVID flight period	Offer a problem free and happy journey
	Crew friendliness	Understanding and meeting client requirements
	Comfortable seat, PPE kits, disposal etc.	For a comfortable journey
	Neat cabin	Offer healthy atmosphere
	Hygienic toilets	Offer hygienic atmosphere
	In-flight catering	Offer nutritious, tasty and timely food
	Providing pillow, blankets	Offer cozy atmosphere
	In-flight AC temp. control	Offer pleasant atmosphere
Luggage services	Luggage handling	Handle luggage with care.
	Luggage delivery	Avoid luggage loss and delay.
	Luggage security	Secure customer’s belongings
Backoffice operations	COVID-related airport and flight safety communication	For Proactive preparation of flyers

But when the volume of this data from different web sources grow, it gets more difficult to monitor how people’s opinions on the brand are changing over time. With hundreds of daily reviews on social media, news websites, and blogs, handling and tracking these references for large airlines may be rather difficult. Sentiment Analysis can help with it. It keeps track of and assesses airline online references to demonstrate how the online connected world is responding in real time to their services, offerings, etc. The polarity of a text can be determined and categorized using sentiment analysis. The applications of SA, which is a verified tool encompasses areas like e-commerce, healthcare monitoring, election campaigning, and more so for tourism and hospitality. Sentiment analysis has become more necessary as a result of the rise in unstructured data from social networking sites like Twitter and Facebook that must be evaluated and organized [1]. An airline can access both positive and negative comments made about their brand on social media, like Twitter.

Therefore, the primary goal of Twitter sentiment analysis is to ascertain whether a tweet is positive, negative, or neutral. The biggest issues with this, though, is that tweets are typically written in an informal language, they are brief messages with few sentiment indications, and they frequently utilize acronyms or abbreviations.

Worldwide this industry was hit severely due to the COVID-19 pandemic. Since air travel is a necessity more than convenience due to various reasons in a large country like India, the government and airlines had to restore flights, albeit in a limited way by implementing various strict measures. Table 1 captures some passenger sentiment areas which the airlines were supposed to cater to for a safe and hassle-free experience. With the help of these initiatives, the Indian aviation sector was able to regain over 95.55% of the daily domestic air passenger traffic from pre-COVID in just one year. However, there is no study on how each airline fared on these parameters and the experience of domestic flyers on account of these strict regulations that were implemented like random COVID checks, touchless check-in, wearing PPE kits inside the cabin, leaving middle seats vacant, cancellation of in-flight services etc. Understanding the degree to which travelers were made comfortable by various airlines is vital for future strategy development of airline companies in a competitive marketplace.

In order to ascertain the significance of both positive and negative online evaluations provided by Indian airline passengers, this study used sentiment analysis. Analyzing whether customers were satisfied or dissatisfied with the services during the time when flights started (June–July 2020) can help determine which airlines performed better. It aims to give trailing airlines a chance to receive critical insights on overall performance to their management for timely decision-making so that they can take corrective action for future. A customized dataset is curated from Twitter and various exploratory data analysis tasks are carried out. On the pre-processed data, sentiment analysis on three popular ML algorithms with different word embeddings viz. wword2vec and TF-IDF is carried out. The effectiveness of these algorithms is assessed and results compared with the implementation of a fine-tuned, state-of-the-art Bidirectional Encoder Representation for Transformer (BERT) model. The Twitter US Airline dataset is used to test the effectiveness of the fine-tuned model against other cutting-edge implementations.

Figure 1.

The methodology adopted in tweets classification.

In this paper, the key contributions from the authors are:

AirBERT, a GPU-trained fast, fine-tuned domain specific language representation model trained on Twitter corpora for airline tweets mining is proposed.

AirBERT is evaluated against three ML classification models trained with word2vec and TF-IDF embeddings on a custom Indian airlines review dataset. From results, AirBERT outperforms all other models. Comparing it with other existing models on the US Twitter airline dataset, the fine-tuned AirBERT matches the performance of state-of-the-art TweetBERTv2 model.

Based on AirBERT’s negative tweet analysis and past customer profile, a customer relationship management model is proposed that can be leveraged by airlines and other service industry.

Following is a breakdown of the remaining sections of this paper: Section 2 discusses related work in this domain. Section 3 discusses the methodology adopted in detail with Section 4 outlining and discussing the results. This is followed by conclusion in Section 5 and references in the end.

2. Literature review

Unsupervised and supervised techniques are the primary subcategories of Machine Learning (ML) [2]. These techniques are excellent for sentiment analysis because they can be automated and handle enormous amounts of data.

Generally, ML algorithms like Naive Bayes (NB), Random Forest (RF), Decision Tree (DT) and Support vector machine (SVM) have been used for sentiment prediction and optimization to solve twitter sentiment classification problem. While SVM and Multinomial NB have been shown to be better in terms of accuracy and optimization, hierarchical ML techniques only produce middling performance in classification problems. A very few researchers have combined above algorithms in an ensemble model to predict sentiment with good success [3]. In Table 2, a collection of benchmarks for machine learning approaches is provided in terms of classification accuracy. Similar to this, it has been demonstrated that using different feature extraction strategies increases classification accuracy. Among the many approaches available for text mining, TF-IDF, word2vec, and gloVe are among the most popular. It has been demonstrated that TF-IDF performs better when combined with its two modifications, linear discriminant analysis (LDA) and latent semantic analysis (LSA). Large datasets have shown increase in accuracy on usage of TF-IDF. The combination of TF-IDF and LSA is found to be suitable for smaller datasets.

Sentiment analysis using neural network architectures have appeared over the past few years. These models include convolutional neural network (CNN), deep neural network (DNN) and recurrent neural network (RNN) as shown in Table 3. The compositionality of words is difficult to capture using published sentiment prediction algorithms that uses deep CNN and RNN.

Table 2
Benchmark summary of machine learning techniques

Author(/s) (year of pub)	Dataset used	ML models used	Accuracy of classification (% age)
Khan and Urolagin (2018) [4]	Twitter data from airlines in four regions-India, Europe, America and Australia	RF, DT and Logistic Regression (LR)	RF-99.05, DT-98.97 and LR-91.10
Kaur and Malik, 2022 [5]	Twitter US airline data	SVM, NB, RF and Ensemble	SVM-90, Ensemble-98
Tusar and Islam, 2021 [6]	Twitter US airline sentiment	Bag-of-Words and TF-IDF, SVM, LR, Multinomial NB and RF	SVM and LR with Bag-of-Words technique – 77
Veera Kumari and Prajna, 2021 [7]	Twitter US airline dataset	RF, LR, K-Nearest Neighbors (KNN), NB, DT, Extreme Gradient Boost (XGB)	RF-74.93, KNN-69.66, LR-77.27, SVM-65.47, Gaussian NB-41.15, Extreme Gradient Boosting-71.72, Stochastic Gradient Descent-74.86, DT-67.92, Ensemble bagging classifiers outperform non-bagging classifiers in terms of accuracy.
Soni et al., 2020 [8]	Twitter Dataset for US Airline Sentiments	NB classifier with inputs of different sizes	Different size input gives different accuracies. NB-96 for 1 MB data
Kang et al., 2021 [9]	US airline twitter dataset	Multinomial NB based on various n-gram models	1-gram: 70.4, 77.7, 78.2, 77.9 2-gram: 53.4, 64.9, 66.9, 65.6 3-gram: 41.4, 61.3, 65.8, 59.1 (Acc, Prec, Rec, F1-score)
Rustam et al., 2019 [10]	US airline twitter dataset	Voting Classifier based on LR TF-IDF and word2vec used for feature extraction	VC – 78.9 and 79.1 with TF and TF-IDF
Dutta Das et al., 2017 [11]	US airlines Tweets data	NB	United Airlines received 1298 tweets, of which 395 were classified as negative, 187 as neutral, and 716 as positive.
Sreeja et al., 2020 [12]	Indian Airlines data from Twitter using APIs	EDA and data visualization NRC Emotion lexicon for the classification of the tweets into eight emotions	Tweets are categorized into multiple emotions to understand customer satisfaction
Kwon et al., 2021 [13]	More than 14,000 online reviews of 27 airlines collected from Skytrax (airlinequality.com)	Topic modelling using LDA, ‘tidytext’ library for sentiment analysis	16 words were derived to indicate positive and negative which will help for service employee training
Vadivukarassi et al., 2018 [14]	CrowdFlower Twitter US Airline Sentiment dataset containing 14640 tweets available on Kaggle	LR, KNN, SVC, DT, RF, AdaBoost and GaussianNB	LR – 64.51, KNN – 58.91 SVC – 64.51, DT – 75.88, RF – 81.35, AdaBoost Classifier – 78.55, GaussianNB Classifier – 57.24
Adeborna and Siau, 2014 [15]	US Airlines – AirTran, Frontier, SkyWest twitter data	NB Algorithm, Correlated Topics Models (CTM) with Variational Expectation Maximization (VEM) algorithm	Polarity classification accuracy of 86.4%
Wan and Gao, 2015 [16]	Twitter API to extract 107866 tweets – North American airline service brands	10-fold cross validation, Lexicon-based NB, Bayesian Network, SVM, DT, RF, Ensemble model	Lexicon Based – 67.9, NB – 90.0, Bayesian Network 91.4, SVM 84.6, C4.5 86.0, RF 89.8, Ensemble 91.72
Verma and Davis, 2021 [17]	3000 reviews of 16 airlines from trip advisor and airline rating platform	Ensemble learning	ROC-AUC results for boosting algorithms like XGBOOST and ensemble learning methods like Voting Classifier vary from 71 to 94.7 % and 73 to 94.8 %, respectively.

Table 3

Benchmark summary of neural networks used for sentiment analysis

Author(/s) (year of pub)	Dataset used	ML models used	Accuracy of classification (% age)
Kumar and Zymbler, 2019[18]	Tweepy package is used to download tweets from Twitter server.	SVM, ANN and CNN. GloVe dictionary approach to prepare the data set for analysis	SVM-76.5, ANN-79.4 and CNN-92.3
AlBadani et al., 2022 [19]	Twitter US Airlines dataset	Universal language model fine-tuning with SVM	SVM-78.5, BoW $+$ SVM-78.5 Deep learning (DL) Model with Dropouts in Keras-77.9, SIS-ULMFiT – 84.1, ULMFiT $+$ SVM – 99.78
Manchikanti and Madhurika, 2020 [20]	Twitter data	Simple RNN, LSTM and Stacked LSTM	Simple RNN-86, LSTM-89.4, Stacked LSTM-91
Bezek and Shams, 2020 [21]	Tweets UCI dataset	Keras, RF, GB, DT, SVM, GNB	Keras-92, RF-80, GB-78, DT-75, SVM-69, GNB-60
Ouyang et al., 2015 [22]	CNN	rottentomatoes.co (Contains movie review excerpts)	With a 45.5 percent accuracy, the new model fared better than the earlier models.
Hasib et al., 2021 [23]	CrowdFlower Twitter US Airline Sentiment dataset containing 14640 tweets available on Kaggle	TF-IDF with 4-layer DNN $+$ CNN (Meta Data $+$ TFIDF $+$ Trainable Embedding)	Accuracy of 91%
Wang et al., 2018 [24]	50000 English movie reviews from IMDB as first dataset, 12000 Chinese movie review comments from Douban as second dataset.	LSTM $+$ Word2Vec word embedding	LSTM with a F-measure of 0.859 on first dataset, 0.754 on second dataset

Table 4

Benchmark summary of proposed models using BERT

Author(/s) (year of pub)	Dataset used	DL models used	Accuracy of classification (% age)
Xie et al., 2021 [25]	Twitter comments on six USA airlines	BERT, Embeddings from Language Model (ELMo)	BERT F1-score ${}_{\text{micro}}$ ranges from 71.28 to 84.96. and for ELMo 66.33 to 82.3. BERT F1-score ${}_{\text{macro}}$ ranges from 67.98 to 78.17 and for ELMo 59.94 to 72.83 BERT F1-score ${}_{\text{weighted}}$ ranges from 70.52 to 84.78 and for ELMo 65.6 to 80.22
Heidari and Rafatirad, 2020 [26]	Online reviews from six online social platforms – Google flight, Kayak flight, Skyscanner, Tweeter flight reviews, Hotels.com, Trip advisor	BERT $+$ CNN	CNN without BERT sentiment scores F1-score 0.81, mcc 0.76 CNN without Best Flight information F1-score 0.84, mcc 0.79 BERT $+$ CNN F1-score 0.92, mcc 0.88
Kang et al., 2021 [9]	Twitter US airlines from Kaggle.	RF, Multinomial NB, Linear SVM, Ensemble Method, Bi-LSTM and BERT	BERT-86 (highest)
Abdul Qudar and Mago, 2020 [27]	All domains, Twitter US Airline dataset	BERT, TweetBERTv1, TweetBERTv2	BERT-85.2, TweetBERTv1-89, TwetBERTv2-92.99
Hasib, 2022 [28]	Bangladesh Airlines reviews used for a multiclass sentiment analysis and compared with US airlines Tweets data	3 ML algorithms – DT, RF, and XGBoost3 DL algorithms – (CNN, LSTM, BERT)	Best accuracy 83% by BERT.
Alqahtani, 2021 [29]	CrowdFlower Twitter US Airline Sentiment dataset containing 14640 tweets available on Kaggle	TF-IDF and N-gram for feature extraction. Glove and word2vec word embedding, CNN, XL-NET, BERT, ALBERT classification models	NB and LR with Spacy & Bigram tokenizer gave highest accuracy of 82.05 and 82.83. On CNN model, obtained an accuracy of 89.11 in BERT

Many neural network architectures find it difficult to extract character-level characteristics and embeddings of complex words, whereas CNN are more successful at extracting sentence- or word-level features such as morphological tags and stems.

Since its inception in 2018 by Google Research, several papers using the generic BERT model have appeared achieving state-of-the-art accuracy on NLP tasks as shown in Table 4.

3. Methodology

The overall model process is shown in Fig. 1.

3.1 Dataset

A custom dataset for all Indian domestic airlines is scrapped from Twitter for a specific time interval post the pandemic when the flights were resumed. Dataset is scrapped using snscrape by providing required hashtags. Dataset contains tweets for five Indian domestic airlines i.e., Air India, SpiceJet, IndiGo, Jet Airways and Vistara Airways. Statistics of dataset is given in results section.

3.2 Data preprocessing

Tweet text may contain noise like special symbols, capital letters, hashtags, numbers, punctuations and URLs. It must be filtered before they are fed to machine learning algorithm for classification. Python RegEx regular expression package is used for Twitter specific preprocessing to filter this noise. Next step is to apply NLP operations like tokenization, stop word removal and lemmatization. The steps involved are briefly explained here. For preprocessing, all words in tweets are converted to lowercase since they don’t follow any standard rules. URL links are removed using regex as they don’t contribute to sentiment analysis, Non-letter characters like punctuation, numbers or hashtags are also removed. The preprocessed tweets are tokenized using TweetTokenizer in NLTK. The tokenizer returns a list of strings for each tweet. It helps to find word boundaries using spacing and punctuations in the tweets as shown in Table 5.

Table 5
Tokenized random sample

Text	Tokenized
@jetairways Great flight 555, helpful crew!	[great, flight, helpful, crew]
@indigoairlines, need wheelchair, have a senior citizen	[need, wheelchair, senior, citizen]

Since stopwords like ‘the’, ‘a’, ‘on’, ‘is’ etc. are not very useful in sentiment analysis, we remove them. In lemmatization step, it considers the context and converts the word to its meaningful base form. Finally, the pre-processed data needs to be converted into numerical vectors using different techniques like BoW, TF-IDF, word2vec, Glove etc. In this study, under Phase-I we use word2vec and TF-IDF word embedding techniques with the three ML algorithms as other methods don’t preserve semantics of the sentence. And in Phase-II, the same pre-processed data can be used as an input to train task-specific (airlines tweet sentiment analysis) BERT-based model.

3.3 Sentiment polarity and subjectivity

TextBlob has the fundamental components of natural language processing that are utilized to determine the polarity and subjectivity of tweets [30]. The numeric value for polarity describes how much a text is positive or negative. And subjectivity describes how much a text is objective or subjective. We use this function to get a quick sense of the level/extent of polarity (float between [ $-$ 1, 1]) and/or subjectivity (float in the range [0, 1]) in the considered dataset automatically.

3.4 Data visualization

Exploratory data analysis where the dataset is visualized helps understand its attributes. This includes checking distribution of reviews according to airlines, distribution of tweets on each airline grouped by sentiments, distribution of tweets on each airline grouped by tweet count, density analysis with character counts of tweets, density of character counts of tweets according to sentiment, density of word counts of tweets according to sentiments, density of stop-word counts in tweets according to sentiments, top 20 frequent words used, word cloud visualization and the $n$ -gram visualization.

3.5 Word embedding methods

Two popular word embedding techniques are used in this work, word2vec and TF-IDF.

3.5.1 TF-IDF

The TF-IDF method emphasizes the inverse relationship between a term’s frequency and that of the document as a whole. A term’s frequency in a document is revealed by TF, while its relative rarity within the collection of documents is revealed by IDF. These numbers are multiplied together to arrive at the final TF-IDF value shown in Eq. (1).

$\displaystyle\textit{tf idf}(t,d,D)=\textit{tf}(t,d).\textit{idf}(t,D)$ (1)

here, $t$ is the term (word), $d$ is the documents and $D$ is the corpus.

3.5.2 Word2vec

Instead of using DNN, the word2vec algorithm takes in a corpus and produces sets of vectors using shallow 2-layer neural networks.

Word2vec creates a vector for each term, although it could take more effort to combine those vectors into one vector or another format. In contrast, TF-IDF is a statistical measure that may be applied to terms in a document and then used to form a vector. Word2vec also takes into account the corpus’s context, but TF-IDF does not.

3.6 Classifiers used for tweet classification

The dataset is divided into training data (60%), validation data (20%) and testing data (20%). To validate the results obtained from TextBlob, under Phase-I of this study for model building we applied popular ML algorithms like SVM, KNN and RF on the training set. Once the models are well trained, the test set was evaluated. We contrast and compare these ML classifiers by fine-tuning the BERT pre-trained language representation model under Phase-II and compare the performance.

3.6.1 Support vector machine (SVM)

As depicted in Fig. 2, it makes use of the data’s pattern and performs as a non-probabilistic binary linear classifier [31].

Figure 2.

Illustration of support vector machine.

To predict the class, SVM applies the function in Eq. (2) as:

$\displaystyle Y^{\prime}=w*\phi(x)+b$ (2)

In Eq. (2), $Y^{\prime}$ is retrieved by reducing the risk of regression.

$\displaystyle R_{\textit{reg}}(Y^{\prime})=C*\sum_{i=0}^{l}\gamma(Y^{\prime}_{% i}-Y_{i})+\frac{1}{2}*\|w\|^{2}$ (3) $\displaystyle\text{here,}\quad w=\sum_{j=1}^{l}(\alpha_{j}-\alpha_{j}^{*})\phi% (x_{j})$ (4)

In above Eq. (4), the parameters $\alpha$ and $\alpha^{*}$ relaxation parameter called Lagrange multiplier. The output obtained is,

$\displaystyle Y^{\prime}=\sum_{j=1}^{l}(\alpha_{j}-\alpha_{j}^{*})\phi(x_{j})*% \phi(x)+b$ (5) $\displaystyle Y^{\prime}=\sum_{j=1}^{l}(\alpha_{j}-\alpha_{j}^{*})*K(x_{j},x)+b$ (6)

In Eqs (5) and (6), $K(x_{j},x)$ states the kernel function. In this paper, SVM algorithm with Linear activation function is implemented.

3.6.2 K-nearest neighbors (KNN)

Figure 3.

Illustration of KNN algorithm.

It is one of the most straightforward ML classification techniques [32]. The classification is based on how the neighboring data point is categorized, as seen in Fig. 3.

Based on the similarity score of the previously stored data points, KNN categorizes the new data points. Here, the polarity either positive or negative is used to classify.

3.6.3 Random forest

Figure 4.

Illustration of random forest algorithm.

This algorithm for Supervised ML is used to solve Classification and Regression problems. It creates DT on several samples and uses their average for regression and majority vote, respectively, as illustrated in Fig. 4.

3.6.4 BERT

BERT is typically an encoder stack of the Transformer architecture, whereas the Transformer design typically comprises an encoder and decoder stack, hence the name encoder-decoder architecture. The architecture complexity of the two variants, BERT-base and BERT-large, varies. According to Fig. 5, the large encoder version has 24-layers, whereas the base model’s encoder has 12-layers. We use the BERT-base, which has 110 million trainable parameters in this study and implement it in Google Colab under Phase-II study. BERT uses Transformer, an attention methodology which learns contextual relations between words in a text (here tweet).

Figure 5.

BERT models.

Transformer consists of two mechanisms an encoder which reads the input generating a language model and a decoder which classifies the sentiment. One of the key advantages of Transformer encoder is that it reads the entire sequence of words at once as compared to bidirectional models which either read text input left-to-right or right-to-left. This allows the encoder to learn the context of the word based on all its surroundings.

For sentiment analysis, BERT training uses a technique called Next sentence classification. The model receives a pair of sentences as input and it learns to predict if the second sentence is indeed the subsequent sentence in original corpus. During training 50% of inputs are in pair and the other 50% are random sentences. These will not be connected in any way to the first sentence and this distinguishing feature is made possible by adding a token [CLS] at the beginning of the sentence and token [SEP] at the end of the sentence. Also added are sentence (to identify if it is sentence A or B) and positional embeddings to indicate its position in the sequence. For predicting if second sentence is connected to first, the full input sequence is passed through the transformer model, the output of [CLS] token is transformed into a vector using a simple classification layer and calculating the probability if it is the next sequence using softmax.

Figure 6.

The architecture of the proposed AirBERT classification model.

Figure 6 is a high-level architecture of AirBERT Transformer encoder model. BERT receives a string of words as input and transmits it up to the following encoder unit. The input is a string of airlines tweet sentence tokens, which are first embedded into vectors and then processed in the feed-forward neural network. Self-awareness is applied to each encoder layer. The feedforward network’s output is subsequently fed into the following encoder. For each activity, BERT uses a fine-tuning strategy that does not require any specific architecture. The output is a sequence of vectors of size H, in which each vector corresponds to an input token with the same index. Cutting-edge performance can be obtained by fine-tuning a pre-trained BERT model with just one additional classification layer on top of the Transformer output for the [CLS] token. This helps the model to distinguish between two sentences.

Unsatisfactory outcomes are obtained when language models that have been trained on corpus from general domains like BERT plain or other domains like BioBERT are used. A pre-trained language representation model called AirBERT is proposed to mine the airline reviews on Twitter. The BERT architecture is tuned using the training set data.

The data is initially organized in accordance with the required format. As explained below, BERT layers receive three input arrays: input_ids, attention_mask, and token_type ids.

input_ids – They are the list of integers that are uniquely tied to specific word.

attention_mask – This is a list of 1s and 0s which corresponds to the IDs in the input IDs array.

Token type ids – This is used to classify sequences or respond to queries. These require two distinct sequences to be stored in the same input IDs, therefore, to separate the sequences, special tokens like classifier (CLS) and separator (SEP) are used.

Table 6

Fine-tuned weights and other parameters in AirBERT

Layer (type)	Output shape	Param #	Connected to
input_ids (InputLayer)	((None, 70)]	0	[ ]
attention_mask (InputLayer)	[(None, 70)]	0	[ ]
tf_bert_model (TFBertModel)	TFBaseModelOutputWithPoolingAndCross Attentions(last_hidden_state $=$ (None, 70, 768), pooler_output $=$ (None, 768), past_key_values $=$ None, hidden_states $=$ None, attentions $=$ None, cross_attentions $=$ None)	108310272	[‘input_ids[0][0]’, ‘attention_mask[0][0]’]
global_max_pooling1d_7 (GlobalMaxPooling10)	(None, 768)	0	[‘tf_bert_model[6][0]’]
dense_21 (Dense)	(None, 768)	590592	[‘global_max_poolingid_ 7[0][0]’]
dropout_47 (Dropout)	(None, 768)	0	[‘dense_21[0][0]’]
dense_22 (Dense)	(None, 128)	98432	[‘dropout_47[0][0]’]
dropout_48 (Dropout)	(None, 128)	0	[‘dense_22[0][0]’]
dense_23 (Dense)	(None, 32)	4128	[‘dropout_48[0][0]’]
dense_24 (Dense)	(None, 2)	66	[‘dense_23[0][0]’]
Total params: 109, 003, 490 Trainable params: 109, 003, 490 Non-trainable params: 0

The tokenizer classes encode_plus function tokenizes the raw input, adds the aforementioned special tokens, and pads the vector to the maximum length provided. Our raw data is converted using a helper function into the proper format so that it can be fed into the BERT model. We divide the data into Training data (60%) for training the model, validation data (20%) for model evaluation and Testing or holdout data (20%) to provide an unbiased estimate of model performance. The model is initialized for sentiment analysis and we set the number of epochs to 2, as higher epochs may give rise to overfitting problems as well as take more time for the model to train. The model training is started and it takes around 1.5 hours on a Nvidia K40 GPU.

The AirBERT fine-tuned weights or parameters are shown in Table 6. In brief, the model used is BERT-base-cased, tokenizer used is BERT-base-cased, optimizer is Adam, layers used-12 and hidden size is 768. We fine-tune the pre-trained model to improve the overall accuracy.

The performance parameters used are precision, recall, accuracy and F1-score for all the proposed models.

$\displaystyle\text{Precision}=\frac{\textit{TP}}{(\textit{TP}+\textit{FP})}$ (7) $\displaystyle\text{Recall}=\frac{\textit{TP}}{(\textit{TP})+\textit{FN}}$ (8) $\displaystyle\text{Accuracy}=\frac{(\textit{TN}+\textit{TP})}{(\textit{TN}+% \textit{FN}+\textit{FP}+\textit{TP})}$ (9) $\displaystyle\text{F1-score}=2\times\frac{\textit{Recall}.\textit{Precision}}{% (\textit{Recall}+\textit{Position})}$ (10)

The experimental results and empirical evaluation are discussed in the next section.

Table 7

Dataset statistics

Dataset details	Class	Neg	Neu	Pos	Total
Custom dataset for Indian domestic airlines for the period Jun–Jul’20	3	4198	1695	2503	8396

Figure 7.

(a) Initial sample of Indian-Airlines-Data, and (b) distribution of tweets during the chosen period (Jun–Jul’20).

4. Results and discussion

Understanding of the data is carried out using exploratory data analysis, results of which are shared as below.

4.1 Indian Airline review dataset

Table 8
Sentiment distribution in online reviews

Airline	Pos	Neg	Mix/Neu	% Pos	% Neg	% Mix/Neu
Air India	1600	1100	2775	29	20	51
Indigo	240	150	250	38	23	39
Jet Airways	210	256	260	29	35	36
SpiceJet	410	210	600	34	17	49
Vistara	100	55	180	30	16	54
Total	2560	1771	4065	31	21	48

The dataset named Indian-Airline-Data provides customer online reviews for five Indian Airlines across all domestic routes from June 2020 to July 2022. Initial sample of the dataset is shown in Fig. 7a. It contains 8396 records (rows) and 7 columns as date, user_id, retweet_count, like_tweet, tweet_count, tweet_location and tweet_content.

The total number of reviews received daily during the said period is shown in Fig. 7b. On an average $\sim$ 275 reviews have been received daily. For experimental results, we do not focus on mixed/neutral reviews.

The dataset statistics are shown in Table 7 and sentiment distribution in online reviews and Table 8.

4.2 Data extraction

The percentage distribution of online reviews according to each airline in the total dataset is shown in Fig. 8a. Air India has the maximum reviews at 65%, followed by SpiceJet at 14%, Indigo and Jet Airways both at 8% and Vistara with 7% of the total reviews.

Figure 8b shows the ‘Top 20’ frequent words used. Apart from airline names, it is observed that ‘hardeeppuri’, the civil aviation minister, ‘refund’, ‘repartiation’, ‘vandebharatmission’ for getting back civilians struck abroad during COVID-19 travel restrictions, ‘jetrevival’ and ‘makejetflyagain’ figure predominantly.

Figure 8c displays the $n$ -gram visualization from the dataset. Unigram, bigram and trigrams have been captured to exploratory understanding purpose.

4.3 Sentiment extraction

Figure 9a shows output parameters of the dataset using TextBlob sentiment function. The polarity and subjectivity numeric values are output by TextBlob. It also allowed to graph the histogram of tweet polarity, histogram of subjectivity and polarity vs. subjectivity visualization shown in Fig. 9b–d respectively.

Figure 8.

(a) Daily distribution of number of reviews during the review month, (b) top 20 most frequent words used in reviews, and (c) $n$ -gram visualization of the dataset.

Figure 9.

TextBlob outputs. (a) Polarity and subjectivity in the reviews received, (b) histogram of positive polarity, (c) histogram of negative polarity, (d) polarity vs. subjectivity visualization, (e) class distribution of the three sentiments category.

Figure 9.

TextBlob outputs. (f) total number of reviews for each airline by sentiment.

Figure 10.

Density analysis using TextBlob. (a) Density of character counts of reviews, (b) density of character counts of reviews according to sentiments. (c) density of word count of reviews according to sentiments, (d) density of stop-word count of reviews according to sentiments.

Figure 9e shows the class distribution of the three sentiments using TextBlob function where the positive class percentage is 31%, negative is 21% and 48% is neutral or mixed class. In terms of number of tweets, the function returns good accuracy for positive class but returns poor accuracy for both negative and neutral tweets. This may be due to it considering subjectivity parameter while returning the results.

Table 9

Experimental results across all models (best results in bold, values in %)

Models	Dataset	Accuracy	Positive class			Negative class			Average
			Precision	Recall	F1-score	Precision	Recall	F1-score	F1-score
SVM with word2vec	Indian Airlines	75	72	74	73	75	73	74	73.5
SVM with TF-IDF	Twitter data	87	89	83	86	82	79	80	83
KNN with word2vec		62	61	60	60	64	62	63	62
KNN with TF-IDF		64	64	62	63	68	63	65	64
RF with word2vec		82	85	81	83	81.5	82	82	82
RF with TF-IDF		85	87	85	86	82	85	83	84.5
AirBERT model		91	91	89	90	94	93	93	91.5
AirBERT model	US AirlinesTwitter data	92.56	91.4	92	92	92	90	91	91

Table 10

Cross comparison of classification results with existing state-of-the-art models (best results in bold, values in %)

Author, year [reference]	Dataset	Models	Accuracy
Xie et al., 2020 [17]	US Airline Twitter dataset	BERT, Embeddings from Language Model (ELMo)	82.3
Heidari et al., 2020 [18]		BERT $+$ CNN	92
Kang et al., 2021 [11]		Bidirectional Long Term Short Memory (Bi-LSTM) and BERT	86
Abdul and Mago, 2020 [19]		BERT	85.2
		TweetBERTv1	89
		TweetBERTv2	92.99
Proposed model		AirBERT	92.56

Figure 11.

Word cloud visualization for (a) positive reviews, (b) negative reviews.

To determine which airline has received the most positive feedback, we also analyze the sentiment of each review from all five airlines. Figure 9f shows the total number of reviews for each airline by sentiment using TextBlob. From figure, it is observed that overall Air India received the maximum number of reviews, followed by SpiceJet, Indigo, Jet Airways and Vistara.

Considering the absolute number of reviews received by each airline and calculating the percentage positive (% pos) and percentage negative (% neg) and percentage mixed/neutral (% Mix/Neu) sentiments, it is observed that Indigo airlines received the maximum percentage of positive reviews and Jet Airways received the maximum percentage of negative reviews during this period. From this, we can conclude that Indigo is the most customer friendly airline considering all the domestic routes.

In density analysis using TextBlob function, the density of character counts is shown in Fig. 10a. A peak is observed at 175-character counts in tweets. Comparing density of character counts of tweets according to sentiments is shown in Fig. 10b, where green curve indicates positive, red for negative and blue for neutral. It is seen that the maximum density of positive and negative reviews is in the range of 175–200-character count. The density of word counts in tweets according to sentiments as seen from Fig. 10c is 30.

The density of stop-word counts for tweets according to sentiments in Fig. 10d shows that density is highest initially with up to 4 stopwords used and the density thereafter decreases.

4.4 Word cloud visualization

Figure 12.

(a) Comparison of classifier accuracy with word2vec and TF-IDF vs. BERT, (b) AirBERT model training time statistics.

Figure 13.

A customer relationship management system based on AirBERT.

When it comes to conveying the subject and attributes of a piece of writing, this strategy is often employed. By showing the content in relation to its distribution, the word cloud gives a more accurate picture of the article’s properties. Figure 11a displays the word cloud for positive reviews. Apart from ‘air india’, the top 5 positive words are ‘ticket’, ‘time’, ‘help’, ‘vandebharatmission’ and ‘refund’. Figure 11b shows the word cloud for negative reviews. Apart from ‘air india’, the top 5 negative words are ‘repatriation flight’, ‘jetrevival’, ‘long haul’, ‘airbus330 pilots’, and ‘capability’.

4.5 Sentiment classification

From the implementation of Phase-I and Phase-II models on customized dataset, ignoring the mixed/neutral reviews, Table 9 shows the experimental results returned by various classifiers and the AirBERT model for sentiment classification. With the AirBERT-model, we obtained an average F1-score of 91.5%, combining both the positive and negative class followed by RF TF-IDF combination with 84.5% and SVM TF-IDF with 83%. The reason for AirBERT’s superior performance can be attributed to the large training sets ( $\sim$ 800 million words) it is trained on and the fact that it is able to account for a word’s context much better. Previous methods of word-embedding like GloVe, word2vec and TF-IDF would return the same vector for a word no matter how it is used, while AirBERT returns different vectors for the same word depending on the words around it.

From Phase-I implementation, it is observed that RF with TF-IDF performed the best with a F1-score of 84.5% followed closely by SVM and TF-IDF with a F1-score of 83%.

K-Nearest neighbor was the least performing algorithm with an F1 score of 62% and 64% with word2vec and TF-IDF respectively. Thus, TF-IDF scores over word2vec with AirBERT outperforming all models as shown in Fig. 12a. Also, the time taken to train AirBERT is lower by a factor of 4x by using a CPU $+$ GPU as compared to training on CPU-only as shown in Fig. 12b.

Comparing the accuracy of TextBlob sentiment function and AirBERT, it is observed that TextBlob accuracy for positive tweets is high but very poor for negative class. AirBERT accuracy is a high 91% for both positive and negative class proving its efficacy.

It is worthwhile to note that any algorithm that relies on matrix multiplication and convolution can be accelerated using a GPU. And a consumer grade GPU with good amount of memory is today available at a fraction of a cost. Though the training time cost for AirBERT is significantly lower than a CPU by 4x but still heavy ( $\sim$ 1.5 hours), this is a one-time activity and once pre-trained, the model can be used a number of times on any new test data. The model weights can be saved and based on the size of new data set and similarity, one can either use the same model or fine-tune the pre-trained model and achieve good results. Though ML algorithms can also be accelerated by performing matrix multiplication on GPU, it makes practical sense only for computationally heavy methods like boosted trees (e.g., GPU accelerated XGBoost). Else, the overhead is significant compared to performance gains. Also, ML training algorithms like Random Forest and Decision Tree are too non-linear to work well with GPUs. Considering the above facts and the overall performance of AirBERT compared to ML algorithms, it is any day a better option to use this pre-trained optimized model for Airlines tweet data.

4.6 Comparison with existing models

We compare the performance of fine-tuned AirBERT model with four other state-of-the-art models using BERT as shown in Table 10 on the US Airline dataset. The USA airlines data containing 14640 tweets which is more polished than our dataset. It is observed that the proposed fine-tuned, pre-trained AirBERT model matches the performance of state-of-the-art TweetBERTv2 model with an accuracy of 92.56% vs. 92.99%. It outperforms all other models which are trained on general domain corpora like BERT.

4.7 Customer relationship management (CRM) system

There are many practical applications of the proposed model. Based on AirBERT’s negative tweet analysis and past customer profile, a high-level use-case for its deployment is a CRM system as shown in Fig. 13. This can be leveraged by Airlines and other service industries to win back customer loyalty in case of any deficiency in service. Potential issues are nipped in the bud. It will also help in upselling to happy customer and reduce churn. It can also help detect changes in the overall opinion towards an airline

However, there are some limitations of airlines tweet sentiment analysis in general and the model proposed. Neutral sentiments have not been considered, which sometimes have a negative tilt and can impact results considerably. Sometimes, sentiments get incorrectly targeted from data outside the review domain as the model would have been trained only in one particular domain. Finally, the model is trained only in English language tweets. For it to be robust for global airlines, the need of the hour are trained models that can transfer what it has learnt from one language to another.

5. Conclusion

Twitter is a popular social networking platform with rich data; however, it can be challenging to analyze its information. Applying language models trained on generic corpora like plain BERT or from other fields like Biomedical via BioBERT to tweets, which are often written in an informal way, often frequently delivers poor results. To mine airline reviews, AirBERT – a fine-tuned airlines domain-specific language representation model trained on Twitter corpora, is proposed. A custom dataset is curated for five airlines for the period immediately after the resumption of flights after the pandemic to check customer sentiments. From results, the GPU-trained AirBERT beats outperforms all other ML models by returning an accuracy of 91%. SVM with TF-IDF word embedding technique outperforms all other ML classifiers over word2vec technique with an accuracy of 87%. Indigo airlines emerged as the most satisfied airline from customer reviews during this period. Comparing the performance of AirBERT with three existing models on the common USA airlines tweets dataset, it outperforms other models trained on general domain corpora like BERT and matches state-of-the-art TweetBERTv2 model with an accuracy of 92.56%. The model can be deployed by airlines and other services industry to build a CRM system to win back customer loyalty. As part of future research, since reviews for only a month were considered, the model can be made more domain-specific by training it on larger datasets. Secondly, research trends are increasingly turning towards topic modeling and explainable AI (XAI), a technique that uses observable words and hidden meanings to derive latent themes and extract them for better understanding of the entire pattern which can be explored with methods like Local interpretable model-agnostic explanations (LIME) to go beyond the concept of positive, negative, or neutral sentiments, to reach and comprehend the significance of understanding conversations and what they reveal about customers.

References

Haenlein

Kaplan

. An Empirical Analysis of Attitudinal and Behavioral Reactions Toward the Abandonment of Unprofitable Customer Relationships. Journal of Relationship Marketing. 2010; 9(4): 200-228. doi: 10.1080/15332667.2010.522474.

Kaddoura

Popescu

Hemanth

. A systematic review on machine learning models for online learning and examination systems. PeerJ Computer Science. 2022; 8: e986. doi: 10.7717/peerj-cs.986.

Yenkikar

Babu

Hemanth

. Semantic relational machine learning model for sentiment analysis using cascade feature selection and heterogeneous classifier ensemble. PeerJ Computer Science. 2022; 8: e1100. doi: 10.7717/peerj-cs.1100.

Khan

Urolagin

. Airline Sentiment Visualization, Consumer Loyalty Measurement and Prediction using Twitter Data. International Journal of Advanced Computer Science and Applications. 2018; 9(6). doi: 10.14569/IJACSA.2018.090652.

Kaur

Malik

. A Sentiment Analysis of Airline System using Machine Learning Algorithms. International Journal of Advanced Research in Engineering. 2022; 12(1): 731-742. doi: 10.34218/IJARET.12.1.2021.066.

Tusar

Islam

. A Comparative Study of Sentiment Analysis Using NLP and Different Machine Learning Techniques on US Airline Twitter Data. In: International Conference on Electronics, Communications and Information Technology (ICECIT), 2021, pp. 1-4. doi: 10.48550/arXiv.2110.00859.

Veera Kumari

Prajna

. Collaborative Classification Approach for Airline Tweets Using Sentiment Analysis. Turkish Journal of Computer and Mathematics Education (TURCOMAT). 2021; 12(3): 3597-3603. doi: 10.17762/turcomat.v12i3.1639.

Soni

Mathur

Patsariya

. Performance Improvement of Naïve Bayes Classifier for Sentiment Estimation in Ambiguous Tweets of US Airlines. Advances in Intelligent Systems and Computing. 2020; 195-204. doi: 10.1007/978-981-15-1097-7_17.

Kang

Chye

Ong

Tan

. The Science of Emotion: Malaysian Airlines Sentiment Analysis using BERT Approach. In: International Conference on Digital Transformation and Applications (ICDXA), 2021, pp. 129-136. Available from: https://www.researchgate.net/lab/Huay-Wen-Kang-Lab.

10.

Rustam

Ashraf

Mehmood

Ullah

Choi

. Tweets Classification on the Base of Sentiments for US Airline Companies. Entropy. 2019; 21(11): 1078. doi: 10.3390/e21111078.

11.

Dutta Das

Sharma

Natani

Khare

Singh

. Sentimental Analysis for Airline Twitter data. IOP Conference Series: Materials Science and Engineering. 2017; 263: 042067. doi: 10.1088/1757-899X/263/4/042067.

12.

Sreeja

Sunny

Jatian

. Twitter Sentiment Analysis on Airline Tweets in India Using R Language. Journal of Physics: Conference Series. 2020; 1427(1): 012003. doi: 10.1088/1742-6596/1427/1/012003.

13.

Kwon

Ban

Jun

Kim

. Topic Modeling and Sentiment Analysis of Online Review for Airlines. Information. 2021; 12(2): 78. doi: 10.3390/info12020078.

14.

Vadivukarassi

Puviarasan

Aruna

. An Exploration of Airline Sentimental Tweets with Different Classification Model. International Journal for Research in Engineering Application & Management (IJREAM). 2018; 4(2): 72-77. doi: 10.18231/2454-9150.2018.0124.

15.

Adeborna

Siau

. An Approach to Sentiment Analysi – The Case of Airline Quality Rating. In: Pacific Asia Conference on Information Systems (PACIS 2014), Chengdu, China; 2014.

16.

Wan

Gao

. An Ensemble Sentiment Classification System of Twitter Data for Airline Services Analysis. In: 2015 IEEE International Conference on Data Mining Workshop (ICDMW), 2015. doi: 10.1109/ICDMW.2015.7.

17.

Verma

Davis

. Implicit Aspect-Based Opinion Mining and Analysis of Airline Industry Based on User-Generated Reviews. SN Computer Science. 2021; 2(4). doi: 10.1007/s42979-021-00669-7.

18.

Kumar

Zymbler

. A machine learning approach to analyze customer satisfaction from airline tweets. Journal of Big Data. 2019; 6(1). doi: 10.1186/s40537-019-0224-1.

19.

AlBadani

Shi

Dong

. A Novel Machine Learning Approach for Sentiment Analysis on Twitter Incorporating the Universal Language Model Fine-Tuning and SVM. Applied System Innovation. 2022; 5(1): 13. doi: 10.3390/asi5010013.

20.

Manchikanti

Madhurika

. AirLine Tweets Sentiment Analysis using RNN and LSTM Techniques. International Journal of Advanced Trends in Computer Science and Engineering. 2020; 9(5): 8197-8201. doi: 10.30534/ijatcse/2020/184952020.

21.

Bezek

Shams

. Analysis of Airline Tweets by Using Machine Learning Methods. International Journal of Engineering Research and Applications. 2022; 10(7). doi: 10.9790/9622-1007034245.

22.

Ouyang

Zhou

Liu

. Sentiment Analysis Using Convolutional Neural Network. In: IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing, 2015, pp. 2359-2364.

23.

Hasib

Habib

Towhid

Showrov

. A Novel Deep Learning based Sentiment Analysis of Twitter Data for US Airline Service. In: 2021 International Conference on Information and Communication Technology for Sustainable Development (ICICT4SD), 2021, pp. 450-455. doi: 10.1109/ICICT4SD50815.2021.9396879.

24.

Manchikanti

Madhurika

25.

Xie

Wen

Yang

. Ternary Sentiment Classification of Airline Passengers’ Twitter Text Based on BERT. Journal of Physics: Conference Series. 2021; 1813(1): 012-017. doi: 10.1088/1742-6596/1813/1/012017.

26.

Heidari

Rafatirad

. Using Transfer Learning Approach to Implement Convolutional Neural Network model to Recommend Airline Tickets by Using Online Reviews. In: 2020 15th International Workshop on Semantic and Social Media Adaptation and Personalization (SMA), 2020, pp. 1-6. doi: 10.1109/SMAP49528.2020.9248443.

27.

Abdul Qudar

Mago

. TweetBERT: A Pretrained Language Representation Model for Twitter Text Analysis. 2020. doi: 10.48550/arXiv.2010.11091.

28.

Hasib

. Sentiment analysis on Bangladesh airlines review data using machine learning. BRAC University Institutional Repository; MSc Thesis report. 2022. Available from: http://hdl.handle.net/10361/16666.

29.

Alqahtani

. Predict sentiment of airline tweets using ML models. EasyChair Preprint no 5228. 2021. Available from: https://easychair.org/publications/preprint/CNF4.

30.

Chandrasekaran

Hemanth

. Deep Learning and TextBlob Based Sentiment Analysis for Coronavirus (COVID-19) Using Twitter Data. International Journal on Artificial Intelligence Tools. 2022; 31(1). doi: 10.1142/S0218213022500117.

31.

Radhakrishnan

Lakshminarayanan

Chatterjee

Hemanth

. Forest data visualization and land mapping using support vector machines and decision trees. Earth Science Informatics. 2020; 13(4): 1119-1137. doi: 10.1007/s12145-020-00492-3.

32.

Revathi

Anitha

Hemanth

. Training feedforward neural network using genetic algorithm to diagnose left ventricular hypertrophy. TELKOMNIKA (Telecommunication Computing Electronics and Control). 2020; 18(3): 1285-1291. doi: 10.12928/telkomnika.v18i3.15225.

AirBERT: A fine-tuned language representation model for airlines tweet sentiment analysis

Abstract

Keywords

1. Introduction

Table 1 Passenger sentiment areas during post-pandemic flights

Table 2 Benchmark summary of machine learning techniques

3.1 Dataset

3.2 Data preprocessing

Table 5 Tokenized random sample

3.4 Data visualization

3.5 Word embedding methods

3.5.1 TF-IDF

3.6 Classifiers used for tweet classification

3.6.1 Support vector machine (SVM)

4.1 Indian Airline review dataset

Table 8 Sentiment distribution in online reviews

4.3 Sentiment extraction

4.6 Comparison with existing models

4.7 Customer relationship management (CRM) system

5. Conclusion

References

Table 1
Passenger sentiment areas during post-pandemic flights

Table 2
Benchmark summary of machine learning techniques

Table 5
Tokenized random sample

Table 8
Sentiment distribution in online reviews