Abstract
Recent works in Explainable AI mostly address the transparency issue of black-box models or create explanations for any kind of models (i.e., they are model-agnostic), while leaving explanations of interpretable models largely underexplored. In this paper, we fill this gap by focusing on explanations for a specific interpretable model, namely pattern-based logistic regression (PLR) for binary text classification. We do so because, albeit interpretable, PLR is challenging when it comes to explanations. In particular, we found that a standard way to extract explanations from this model does not consider relations among the features, making the explanations hardly plausible to humans. Hence, we propose
Introduction
Humans have been using explanations in AI for many purposes such as to support human decision making [35,39], to increase human trust in the AI [27,65], to verify and improve the AI [10,40], and to learn new knowledge from the AI [34,44]. Explanations may also be required for an AI-assisted system to comply with recent regulations including the General Data Protection Regulation (GDPR) [21]. These various needs for explanation have drawn a great amount of attention to the field of explainable AI (XAI) in recent years [1]. When an AI-assisted system is used for prediction (referred to as a prediction model or simply a model in the literature and in this paper), explanations for the system behavior are often categorized broadly into two types: local explanations and global explanations [1], where the former focus on explaining the predictions for specific inputs while the latter aim to explain the behavior of the model in general, irrespective of any inputs that it may take. If the model is inherently interpretable [58] (e.g., a decision tree), the model itself can be viewed as the global explanation whereas local explanations can be obtained during the prediction process (e.g., the corresponding path in the decision tree for the input, leading to the output/prediction). In this paper, we refer to explanations straightforwardly extracted from inherently interpretable models (e.g., the applicable path in a decision tree) as model-inherent explanations. However, if the model is opaque (e.g., it is a deep learning model), we may need to apply an additional step, by using a so-called post-hoc explanation method (e.g., LIME [54] and SHAP [43]), for extracting the explanations.
A number of properties of explanations have been identified as desirable in the literature, e.g., as in [62]. Amongst them, generally, we call an explanation faithful to the model if it accurately reflects the true reasoning process of the model, whereas an explanation is deemed plausible if it agrees with human judgement (e.g., as discussed in [26]). These two properties of explanations, i.e., faithfulness and plausibility, may be important in different situations. For instance, we want faithful explanations in order to verify the model correctness while we want plausible explanations to satisfy end users. Note that model-inherent explanations can be deemed faithful due to their straightforward and sensible explanation extraction process. However, this does not guarantee other desirable properties of the explanations. For instance, using a decision tree path with depth of 15 as an explanation is not comprehensible to humans and, therefore, not very plausible either. Post-hoc explanations could be more effective for impressing end-users in this case though they are not inherently extracted from (or perfectly faithful to) the underlying interpretable model (i.e., the decision tree).
In this paper, we develop a novel post-hoc local explanation method that aims to generate plausible explanations for a specific class of interpretable prediction models performing binary text classification with natural language data. Binary text classification aims to classify a given text into one of two possible categories. Examples of binary text classification (both studied in this paper) are sentiment analysis (where a piece of text is classified as having positive or negative sentiment) and spam detection (where a message is classified as spam or not). Our interpretable prediction models are built using logistic regression (LR) [29, chapter 5] with textual patterns [60] as features. LR is a traditional machine learning method, leading to interpretable models with linear combinations of features, that can be used, in particular, for text classification [29, chapter 5]. Because text documents are unstructured data, we need to perform feature extraction so as to obtain numerical representations of the documents before training the LR classifier. One standard way to do feature extraction is using frequent n-grams (i.e., frequent n consecutive words in the dataset) as features and applying TF-IDF vectorization to find associated values of the features [71, chapter 2]. Though using n-grams as features is simple and often effective, it makes the model less generalizable to words or n-grams that have never appeared during training. Also, the features are usually too fine-grained for humans to synthesize the overview of what the model has learned even though LR is inherently interpretable. In this paper, as elsewhere (e.g., [9,20,25]), we use patterns as features for prediction, in alternative to n-gram features. Specifically, we exploit the interpretability of the patterns, as in [18], but by using them as features for logistic regression models. We call our models of interest pattern-based logistic regression (PLR) models.1
Note that LR models with n-gram features (e.g., by using CountVectorizer in the scikit-learn library [51]) can be seen as instances of PLR too. Nonetheless, in this paper, we go further by using patterns as features so the resulting explanations provide more high-level knowledge into the learned tasks.
PLR models are inherently interpretable because LR is interpretable and because their features are interpretable and, as we will show, convenient for humans to learn or to extract knowledge from. However, their model-inherent explanations for PLR may not be plausible. This issue is especially critical when interactions among input features underpin the model whereas the model-inherent explanations treat features independently of each other. These feature interactions may result from agreement or disagreement between correlated pattern features. In order to address this problem, our proposed explanation method leverages computational argumentation (CA) to take care of the feature interactions and generate more plausible local explanations than the model-inherent explanations. We call our novel explanation method (applying CA to better explain PLR) Argumentative eXplanations for Pattern-based Logistic Regression (AXPLR, pronounced “ax-plore”).2
Note that we use AXPLR to indicate both our method for generating explanations and the generated explanations themselves; also, when used to refer to explanations, AXPLR has the same form in the singular and the plural.
Generally, local explanations often have an argumentative spirit by nature since they need to argue for or against possible predictions of the model [14]. When there are several arguments involved, these arguments may also have dialectical relationships between each other. Hence, there are several existing works which use computational argumentation to underpin XAI methods and produce argumentative explanations. For example, DEAr [12] considers related training examples as arguments, which argue to classify a test example, and uses a dispute tree [15] as dialectical explanation. DAX [16] extracts local argumentative explanations from a deep neural network by using arguments and their relations to represent the nodes and their connections in the neural network. (For more approaches, see recent survey papers of argumentative XAI [14,67].) In our work, AXPLR uses pattern-based input features of the PLR model as arguments and draws dialectical relations from specificity of the pattern features. All are modeled using modified versions of Quantitative Bipolar Argumentation Frameworks (QBAFs) [7] before being processed and translated into argumentative explanations for human consumption. Specifically, we use two variants of QBAFs. Both include a mapping associating arguments with the classification they advocate, in addition to arguments, attack and support relations and base scores as in standard QBAFs. The two variants differ in the way they use specificity of patterns to define the direction of attacks and supports.
We summarize the contribution of our work as follows.
We show that model-inherent local explanations for pattern-based logistic regression can lead to implausible explanations.
We propose AXPLR, a novel argumentative explanation method, to tackle the above problem by modeling relationships among pattern features using quantitative bipolar argumentation.
We prove that the argumentation framework underpinning AXPLR always predicts the same output as the original PLR model and satisfies several dialectical properties of human debates.
Using three binary text classification datasets, we conduct an empirical evaluation of the extracted argumentation frameworks. Moreover, for the same datasets, we conduct two human experiments to evaluate how plausible and helpful AXPLR is for human consumption compared to other explanation methods.
Note that, although patterns and LR are widely used in machine learning, and thus their combination in PLR cannot be deemed novel, we are the first, to the best of our knowledge, to show that PLR can be fruitfully used for NLP tasks to obtain explainable predictions. Furthermore, plausibility of the explanations could be improved using the CA-based explanation method proposed in this paper.
In the remainder of the paper, we explain the pattern-based logistic regression (PLR) model in Section 2. Then we discuss the weakness of the model-inherent explanations for PLR in Section 3. After that, Section 4 describes the two variants of QBAFs underpinning AXPLR, while Section 5 shows that these QBAFs satisfy many dialectical properties of human debates, leading to leaner derived AXPLR (of which the presentations are described in Section 6). Next, the experimental setup for AXPLR is explained in Section 7, followed by one empirical experiment in Section 8 and two human experiments (to assess the amenability of the argumentation underpinning AXPLR specifically) in Sections 9 and 10. Lastly, we discuss generalization and other possible uses of AXPLR in Section 11, position our work with respect to other related work in Section 12, and summarize the paper in Section 13. Code and datasets of this paper are available at
In this section, we provide necessary background on text classification (see Section 2.1) and PLR, including logistic regression (LR) which is the core machine learning method of PLR (see Section 2.2) and pattern features as well as the pattern extraction algorithm GrASP [60] used for constructing pattern features from training data (see Section 2.3). We conclude with an illustration of the overall process of PLR combining LR with GrASP for text classification (see Section 2.4). To illustrate ideas, we will use sentiment analysis as a running example of text classification throughout this section.
Binary text classification
We focus on the binary text classification task with two classes, using, as conventional,
Logistic regression
For each input text x, let us assume that x can be represented as a feature vector
The LR model is obtained after the training process is completed; it is fully characterized by
The next questions are “How do the d features look like for text?” and “How can we obtain them?”. We use pattern features whereby patterns indicate high-level characteristics of words in input texts, in addition to specifying exact words or lemmas. These high-level characteristics include both syntactic attributes (such as part-of-speech tags) and semantic attributes (such as synonyms and hypernyms). Thereby, we choose
Pattern features and GrASP: GReedy augmented sequential patterns
GrASP is a supervised algorithm which learns expressive patterns able to distinguish two classes of text [60]. An example of GrASP pattern for distinguishing between positive and negative texts in sentiment analysis is [[ “There is “... worse products .
GrASP is applied directly to the training data. In order to use it, we need to prepare two lists of texts containing positive and negative examples that we want to distinguish. Also, we need to specify some hyperparameters such as the desired number of patterns, the number of gaps allowed, the set of linguistic attributes which can appear in the patterns, and the maximum number of attributes per pattern. In the experiments in this paper, we employ the publicly released implementation of GrASP [38] which provides several built-in attributes that are suitable for classification tasks in general, e.g., the token text itself, its lemma, its hypernyms (according to wordnet [45]), part-of-speech tags, and sentiment tags. The resulting GrASP patterns are used as features in the LR model. Note, however, that we do not utilize the associated class GrASP assigns to each pattern to classify the input directly because GrASP does not tell us how to properly deal with the input that matches multiple (and potentially contradicting or relating) patterns. Instead, we use the patterns from GrASP only as features for training a classifier, letting the classifier decide how multiple patterns should play their roles and contribute to the final classification.
Pattern-based logistic regression using GrASP

Overview of the processes for training, using, and explaining PLR for binary text classification.
In this paper, we focus on PLR, i.e., LR with GrASP patterns as features. Figure 1(a) shows how to train such PLR model. First, we feed
Next, Fig. 1(b) shows how to use the trained PLR model for prediction (and how our proposed explanation method AXPLR connects to the prediction process). Given an unseen input text x, we get the prediction by extracting the feature vector
Since logistic regression is inherently interpretable and the GrASP patterns used are also interpretable, we can generate local explanations for inputs to a trained PLR model by reporting parts of the inputs that match the top-k patterns, in the spirit of much work in the XAI literature (e.g., [46, chapter 5] and [10]). Formally, given an input x, let

An illustrative example of using pattern-based logistic regression for sentiment analysis. (Here, FLX stands for flat logistic regression explanation, see Section 3.)
For the example in Fig. 2, the input text x matches four patterns, and the model predicts class positive (i.e.,

One possible QBAF. Arrows with plus and minus signs represent supports and attacks between arguments, respectively. The base score of each argument is displayed with a real number staying close to the argument.
Specifically, we aim to use a form of quantitative bipolar argumentation frameworks (QBAFs) [7] to simulate how the PLR model works on an input text. As background, a QBAF is a quadruple
In this section, we introduce our AXPLR method, whose overall generation process is shown in Fig. 4, alongside the illustrative example from Fig. 2. In Fig. 4, the part above the purple line is the standard prediction process already captured in Fig. 2, starting from extracting the feature vector from the input text and then computing the predicted probability using the model weights (

Overview of the AXPLR generation process. Above the purple line, it shows the standard prediction process of pattern-based logistic regression. Below the purple line, it shows the four main steps to generate AXPLR for the illustrative example from Fig. 2 (using a bottom-up QBAFc (BQBAFc), where edges labelled + indicate support and edges labelled − indicate attack). In step 1,
The second step computes the dialectical strength of each argument, considering its attacker(s) and supporter(s). Here, we propose a new strength function, returning a real number score in
In the remainder of this section, we provide details for the first three steps of the AXPLR generation process (in Sections 4.1, 4.2, and 4.3, respectively). Then, in Section 6, we give details of step 4. Before that, in Section 5, we prove formal properties of (original and post-processed) QBAFcs, providing formal guarantees about their suitability to give rise to explanations.
To begin with, we define how two patterns can be related.
A pattern
For instance, we can say from Fig. 2 that
The relation ≻ is not reflexive and not symmetric, but it is transitive.
See Appendix A.1. □
Next, we extract argumentation frameworks from a trained PLR model and a target input text x. These argumentation frameworks, like QBAFs [7], envisage that arguments can attack or support arguments, and that they are equipped with a base score. However, these frameworks differ from QBAFs in that the arguments therein support3
In this paper, we abuse terminology and use the term ‘support’ with two meanings: an argument may support a class (by means of the function c in Definition 2) or an argument may support another argument in a dialectical sense (relations
Given a trained binary logistic regression model based on feature patterns

The extracted top-down QBAFc for the example in Fig. 2. Here and everywhere in this paper we show QBAFcs as graphs, with nodes representing the arguments and labelled edges representing attack (−) or support (+). The color of the nodes represents the supported class (i.e., green for positive (1) and red for negative (0)). (The meaning of the equalities of the form

The extracted bottom-up QBAFc for the example in Fig. 2. The color represents the supported class (i.e., green for positive (1) and red for negative (0)). (The meaning of the equalities of the form
To explain, both the TQBAFc and the BQBAFc use the same
The extracted TQBAFc and BQBAFc for the example in Fig. 2 are shown in Figs 5 and 6, respectively. Following Definition 2, the differences between the TQBAFc and BQBAFc are the
We proposed both the top-down and the bottom-up arrangements of QBAFcs as they are suitable for different situations. Later, in Section 6, we will show that, in TQBAFcs, we explain to users with general patterns first and provide more specific patterns as details when requested. In BQBAFcs, by contrast, we explain to users with specific patterns first (as they contain more information) and mention general patterns as supporting or opposing reasons.4
Apart from TQBAFcs and BQBAFcs, there might be other possibilities to construct argumentation frameworks from the patterns which will lead to final explanations that are different from AXPLR. However, the investigation of such alternatives is outside the scope of this paper.
After this point, when we mention a QBAFc in this paper, we mean that it could be either a TQBAFc or a BQBAFc, unless otherwise stated. We assume that any generic QBAFc is of the form
Given a QBAFc
Indeed, we can see from
Additionally, thanks to Definition 2 and Lemma 2, the graph structures underlying any TQBAFc and BQBAFc are directed acyclic graphs (DAGs).
The graph structure of any QBAFc is a directed acyclic graph (DAG).
See Appendix A.2. □
After we obtain the QBAFcs, the next step is to calculate the dialectical strength of each argument therein. To make this strength faithful to the underlying PLR model, we propose the logistic regression semantics, given by the strength function σ, defined next.
The logistic regression semantics is defined as the strength function
This semantics can be applied to both TQBAFcs and BQBAFcs. According to Equation (3), the strength of an argument starts from its base score, and it is increased and decreased by the strengths of its supporters and its attackers, respectively. However, the strength of each supporter/attacker must be divided by its out-degree (i.e.,
Because QBAFcs are DAGs (see Theorem 1), we can use topological sorting to define the order to compute the arguments’ strengths. Considering the TQBAFc in Fig. 5, for example,
All the results are displayed in Fig. 5. Similarly, the strengths are computed for the BQBAFc and shown in Fig. 6. We can see that the strength of the default arguments δ of both TQBAFc and BQBAFc is equal to the absolute of the logit
For a given QBAFc, the prediction of the underlying PLR model can be inferred from the strength of the default argument:
The predicted probability for the class
Hence, if
See Appendix A.3. □
In other words, we can read the prediction from the default argument δ. The negative strength of δ implies that the argument can no longer support its originally supported class; therefore, the prediction must be the opposite class. Since
We have shown how to extract QBAFcs, equipped with a suitable notion of dialectical strength to match the workings of PLR so as to serve as a basis for explanation thereof. Nevertheless, when arguments in these QBAFcs have a negative dialectical strength, the human interpretation of any resulting explanations may be difficult. Using Fig. 5 as an example, we can see that argument
Given a QBAFc If If
According to Definition 4, to post-process a QBAFc from the previous step, we change the supported class of arguments with negative strengths (

The extracted top-down QBAFc in Fig. 5 after being post-processed.

The extracted bottom-up QBAFc in Fig. 6 after being post-processed.
Given a QBAFc
If
If
See Appendix A.4. □
Given a QBAFc and the corresponding
Furthermore, Theorem 2 also applies to
For a given
Thus, intuitively, the effect of the post-processing step is to flip all the negative strengths to be positive, so we adjust the QBAFc accordingly, while preserving the interpretations of the arguments. For instance, if the original argument a has
In this section, we analyze the logistic regression semantics σ, when applied to QBAFcs and
Dialectical properties for QBAFs
under semantics
(adapted from [7])
Dialectical properties for QBAFs
Summary of the group properties for gradual semantics [7] satisfied or unsatisfied by the logistic regression semantics σ when applied on QBAFcs and
Table 2 summarizes our results (the proofs are in Appendix A.5). To briefly explain here, GP1 is satisfied by both
In conclusion,
Presenting the whole
This task aims to classify whether a review is genuine or fake.

Example of shallow AXPLR for deceptive review detection. The partial input text and the model prediction are shown in the top-most box. The shallow AXPLR shows evidence for both the deceptive class and the truthful class. The patterns shown correspond to strongest supporters and attackers of δ. The meaning of each pattern/argument is also provided. The color and its intensity represent the supported class and the strengths of the arguments, respectively.
Shallow AXPLR leverage only the attackers and supporters of δ, but ignores the additional information available in the

Example of deep AXPLR for deceptive review detection. The partial input text and the model prediction are shown in the top-most box. The deep AXPLR shows evidence for both the deceptive class and the truthful class. A user can expand some patterns/arguments to see their sub-patterns (i.e., their attackers and/or supporters) such as, on the left, [[
Shallow and deep AXPLR are just two examples of explanations that can be drawn from
To evaluate AXPLR, we conducted both empirical and human evaluations. For the empirical evaluation, we calculated some statistics for the
Our human experiments were approved by the Science Engineering Technology Research Ethics Committee (SETREC) of Imperial College London on 18 August 2021. The SETREC reference is 21IC7119.
In the experiments, we targeted binary text classification using three English datasets as shown in Table 3. The table also shows the classes we consider as positive and negative when running GrASP and AXPLR and the number of examples for each data split (used for training, developing, and testing the models; please see Appendix B for more details about these splits). Specifically, the datasets are:
Datasets used in the experiments
For the LR classifiers of the first two datasets, the GrASP patterns were constructed with lemma, part-of-speech tags (POS), wordnet hypernyms, and sentiment attributes. We used alphabet size of 200, allowed two gaps in the patterns, and generated 100 patterns in total. For the last dataset (Deceptive Hotel Reviews), the settings were the same except that we used text attributes (capturing the whole word) instead of the lemma attributes and we generated 200 patterns in total. The performance of the LR classifiers of the three datasets are reported in Table 4. Accuracy is the percentage of correct predictions on the test set, while F1 is a harmonic mean of Precision and Recall of the model. Positive F1 and Negative F1 are F1s when we consider the positive and the negative classes as the main class. These two F1s are then averaged to be Macro F1. For more details about the evaluation metrics, please see Appendix B.
Performance of the pattern-based LR models on the test sets
We divide the empirical evaluation into two parts. The first part discusses the statistics for
Statistics for
s
Table 5 shows the statistics of the
Statistics (Average ± SD) of
for the SMS Spam Collection, Amazon Clothes, and Deceptive Hotel Reviews dataset.
,
are sets of arguments supporting positive and negative classes, respectively. TP, TN, FP, and FN stand for true positives, true negatives, false positives, and false negatives, respectively. The number of examples for each case as well as the total number of examples are indicated in the last row of the table
Statistics (Average ± SD) of
According to Table 5, the spam dataset had the minimum average number of arguments (∼10 arguments per example as shown by
Unlike the SMS Spam Collection dataset, the base scores of δ for the Amazon Clothes and the Deceptive Hotel Reviews datasets were 0.2597 and 0.6932 supporting the negative class, respectively (not shown in the table). In order to push the prediction to either positive or negative, we needed evidence. Hence, for these two datasets, the average number of arguments were similar for both classes (as shown by
In conclusion, the statistics of
Next, given a
The smallest number of supporting arguments k which are sufficient to make the strength of the default argument δ greater than 0 for 80% and 100% of δ from all the test examples. We consider both
and
and, specifically, when the final supported class of δ is and is not the original supported class of δ before the post-processing step
The smallest number of supporting arguments k which are sufficient to make the strength of the default argument δ greater than 0 for 80% and 100% of δ from all the test examples. We consider both
The smallest number of supporting arguments k which are sufficient to make the strength of intermediate arguments α greater than 0 for 80% and 100% of intermediate arguments from all the test examples. We consider both
Because different arguments could have different values of k that satisfy the sufficiency condition in Equation (4), we consider the values k which are sufficient for 80% and 100% of the arguments for each dataset. Tables 6 and 7 show the results for the default arguments δ and the intermediate arguments
Additionally, for δ of the SMS Spam Collection and the Amazon Clothes datasets,
Considering the sufficiency for intermediate arguments α in Table 7, only one supporting argument was usually sufficient to explain the supported class. Even without any supporters, only the base score was sufficient in most cases if the supported class is not flipped after post-processing (i.e.,
To sum up, the key finding from this sufficiency analysis is threefold. First, the number of supporting arguments required for the default argument varies for each task and predicted class. Second, using
In this section, we aimed to evaluate the plausibility of AXPLR, compared to FLX, to confirm our hypothesis that it is essential to consider relations between features (i.e., patterns) when we generate local explanations. So, we compared the feature scores given by the explanation methods to scores reflecting how humans consider the features. For instance, if a machine indicates pattern
Datasets
We used the SMS Spam Collection (spam filtering) and the Amazon Clothes (sentiment analysis) datasets since humans generally perform well on these two tasks, making the human scores reliable. For each dataset, we needed 500 test examples for evaluation. These examples must have at least one pattern matched (so that the explanation contains at least one pattern to be investigated), and they must have the predicted probability of the output class greater than 0.9 to ensure that the bad quality of the explanation was not due to low model accuracy or text ambiguity. Note that we did not conduct this experiment on the Deceptive Hotel Reviews dataset as lay humans are not adept at identifying deceptive reviews. The human accuracy was only around 55% in [34], so we cannot trust human judgement on machine explanations in this task. We would work on the deceptive review detection task in the next experiment instead.
Machine explanations
As discussed in Section 6, both FLX and AXPLR use
Human scores
We recruited human participants via Amazon Mechanical Turk (MTurk)7
If we have less than five unique matched phrases in the training set, we just show all of them.
For each dataset, since there are only 100 distinct patterns, we needed 100 questions for patterns and another 100 questions for groups of sample phrases of those patterns. However, the numbers of matched phrases are different between the two datasets. The SMS Spam Collection test samples had 2,964 distinct matched phrases in total, while the Amazon Clothes test samples had 3,140 distinct matched phrases. Considering both datasets and all the three types of questions, we had (100 + 100 + 2,964) + (100 + 100 + 3,138) = 6,502 distinct questions. Each distinct question was answered by five participants, and the scores were averaged before comparing with machine explanation scores. In other words, we can read the Pearson’s correlations as the degree of alignment between machine explanations and the average of the human annotations. Concerning the payment for answering questions, we paid the participants $0.30 per 10 pattern questions, $0.20 per 10 group-of-phrases questions, and $0.20 per 20 matched phrase questions.

Examples of questions (from the Amazon Clothes dataset) posted on Amazon Mechanical Turk to elicit human scores.
Pearson’s correlation between explanation scores and human scores for the SMS Spam Collection dataset
Pearson’s correlation between explanation scores and human scores for the Amazon Clothes dataset
Tables 8 and 9 report the Pearson’s correlations between the machine explanation scores and the human scores collected from Amazon Mechanical Turk for both datasets. The bottom part of each table shows inter-rater agreement metrics to reflect the agreement among annotators for each pair of dataset and question type. Because different questions may be answered by different sets of individuals, we chose Fleiss’ kappa [19] as the inter-rater agreement metric in this experiment. Considering five answer options (categories) as explained in Section 9.3, we observe that the agreement metrics for the SMS Spam Collection dataset were very close to zero (especially for the questions for patterns and samples), while the agreement rates for the Amazon Clothes dataset (sentiment analysis task) were noticeably higher. Even though the Fleiss’ kappa metrics of the latter task (with five answer categories) are around 0.210–0.357, we noticed that the different human annotations for the same pattern/phrase usually belong to the same polarity but with different degrees such as a phrase getting three “Definitely negative” and two “Negative” from five annotators. Hence, we calculated the Fleiss’ kappa again but now based on the answer polarity only. Specifically, we considered “Definitely negative” and “Negative” to be a single answer category and considered “Definitely positive” and “Positive” to be another answer category. Together with the “Not sure” option, we had three answer categories, and we then reported the resulting Fleiss’ kappa in the last row of Table 9. With a similar way of grouping answer options, the last row of Table 8 shows the Fleiss’ kappa when considering three answer categories for the SMS spam detection task (i.e., “Spam”, “Not spam”, and “Not sure”). We can see that even considering three answer categories does not help the SMS spam detection task, confirming that their human answers are unreliable. On the contrary, the Fleiss’ kappa scores are significantly stronger for the sentiment analysis task (Amazon Clothes dataset) if we consider only three answer categories, confirming that human answers in this task are reliable. This was likely because evidence from the sentiment analysis task (including patterns, samples, matched phrases) usually conveys clear meanings even without contexts, whereas evidence from the spam detection task often requires contexts for humans to make decisions. For example, upset, worthless, and disappointed were surely for negative reviews. In contrast, mobile, win, and call could appear both in spam and non-spam texts. This caused higher disagreements in human answers though the model used these words certainly as evidence for the spam class. As a result, the human scores for the spam task were less reliable than the scores for the sentiment analysis task. Consequently, the overall correlations in Table 8 were also less than the scores in Table 9.
Hence, we focused on discussing the results in Table 9 with more reliable human scores. For each row, the correlation between the explanations and the (average) human scores for patterns was lower than for samples and matched phrases. Therefore, we should show not only the patterns but also some matched samples of the patterns to generate better plausible explanations. In addition, for
Summary and discussion
In this experiment, we showed that the calculated strengths
Experiment 3: Tutorial and real-time assistance
Among the three datasets, the deceptive review detection task is the most difficult tasks for humans. When a trained model is more effective than humans in a particular task, it would be beneficial for humans to learn insights or tricks from the model, and explanations pave the way towards the learning as they reveal how the model works to humans. In this experiment, therefore, we follow the study [34] to evaluate how effective AXPLR can be used to teach and support humans to perform deceptive review detection.
Setup
We recruited participants via Amazon Mechanical Turk and redirected them to our a survey created using Qualtrics.9
Attention-check questions (4 questions) – The participant needed to answer all the questions in this part correctly to proceed.
Pre-test (10 questions) – For each question, the participant was asked whether a given hotel review was truthful or deceptive.
Tutorial (10 questions) – The format was the same as part 2, but then, we revealed to the participant the correct answer and the AI-generated prediction and explanation for them to learn from.
Post-test (20 questions) – For the first ten questions, the questions and the format were the same as part 2. We additionally showed what the participant had answered during the pre-test as a reference. The next ten questions were the same as the first ten except that we also provided AI explanations (without the predictions) for these questions, as real-time assistance [34]. The format of the explanations was the same as what s/he had seen during the tutorial phase. The corresponding previous answer (from the first ten questions) was also provided when the participant answered each of the last ten questions.
Additional questions (5 questions) – The participant was asked general questions before finishing the survey. These include, for example, how they detected deceptive and truthful reviews and any (free-text) feedback they might want to tell us.
A guaranteed reward ($2.00) was given after the participant completed the whole survey.
A bonus reward – The participant was given an additional bonus reward of $0.10 for each question answered correctly (both in the pre-test and in the post-test). Therefore, the maximum bonus reward each participant could get was
We compared four explanation methods in this experiment including SVM, FLX, shallow AXPLR, and deep AXPLR. We selected linear SVM since there is a study [34] showing that tutorials from simple models such as linear SVM worked better than tutorials from deep models such as BERT [17]. To train the SVM, we used TF-IDF vectorizer and employed exhaustive search to find the best hyperparameter

Example of SVM explanation during the tutorial phase.
FLX, shallow AXPLR, and deep AXPLR were extracted from the same pattern-based LR model, of which the performance was shown in Table 4. Note that the LR model underperformed the SVM model, with the accuracy of 0.853 and 0.891, respectively.10
The lower performance of PLR compared to SVM is probably because the SVM model had 7,663 token-based features in total, resulting from TF-IDF vectorization, whereas the PLR model had only 200 pattern-based features from GrASP. We believe that increasing the number of GrASP patterns for PLR would increase the model accuracy and make it more competitive to the SVM model with regards to model performance. However, we decided not to increase the number of GrASP patterns in the experiment as having too various and scattering pattern features in the explanation during the tutorial phase might be overwhelming for the participants.
In fact, we may also try using
For test questions, we randomly selected 50 questions from the test set of the Deceptive Hotel Reviews dataset. Then we partitioned them into five question sets (10 questions each). One participant was assigned one set of test questions and one explanation method (for tutorial and real-time assistance). The ten test questions were used for both the pre-test part and the post-test part of the survey (see Section 10.1). Each pair of explanation method and question set was assigned to five people. Overall, we had 4 explanation methods × 5 question sets × 5 annotations = 100 surveys in total. So, we recruited exactly 100 participants on MTurk without allowing a participant to do the survey twice.
To generate the tutorial part for each explanation method, we selected ten examples from the development set of the Deceptive Hotel Reviews dataset. The selection was done using submodular pick [54] to ensure that the ten selected examples covered important features of the task. Although submodular pick is a greedy algorithm, it provides a constant-factor approximation guarantee of
Results
The average scores of human participants are displayed in Table 10 together with the “Model” column, which reports the performance of the underlying AI models, i.e., SVM and PLR, on the same set of questions without humans involved. Nonetheless, the model performance is not the main focus in this paper but the quality of explanations is. Specifically, we were wondering how well explanations can teach and support humans to detect deceptive reviews. So, when reading Table 10, we should focus more on the post-test scores. In particular, we aim to check two possible success scenarios in this experiment. First, is it the case that the post-test scores of the participants (with or without real-time assistance) are higher than their pre-test scores? If yes, this shows the effectiveness of tutorial and real-time assistance. Second, is it the case that the post-test scores of the participants (with or without real-time assistance) are higher than the scores of the AI model (SVM or PLR) they learned from? If yes, this shows that humans can combine what they learned from the AI with their background knowledge to outperform the AI.
Checking the first success scenario using pre-test scores as a baseline, we observe that the tutorial phase only did not help the participants perform better as the post-test scores without real-time assistance were not significantly greater than the baseline. However, the real-time assistance after the tutorial indeed helped. By the approximate randomization test with 1,000 iterations and a significance level of 0.05 [22,47], the post-test scores with real-time assistance from the explanations were significantly higher than the pre-test scores and the post-test scores with no assistance of the same explanation methods. Nevertheless, using the same approximate randomization test, we see no significant difference across explanation groups, so we can conclude only that FLX, shallow AXPLR, and deep AXPLR are competitive with SVM for providing explanations to teach and support humans to detect deceptive reviews.
Scores of the human participants (Average ± SD) in the tutorial and real-time assistance experiment using the Deceptive Hotel Reviews dataset. The last column shows the average score of the model that provides real-time assistance. The maximum score is 10
Scores of the human participants (Average ± SD) in the tutorial and real-time assistance experiment using the Deceptive Hotel Reviews dataset. The last column shows the average score of the model that provides real-time assistance. The maximum score is 10
In order to check the second success scenario, we consider the average performance of the underlying AI models (see the “Model” column in Table 10). We found that SVM achieved 9 out of 10 in three question sets and 10 out of 10 in the other two, whereas the LR model (underlying FLX and AXPLR) got 7, 7, 8, 9, and 10 for the five question sets (regardless of the order). The total numbers of people that scored better than or equal to the AI during the pre-test, post-test with no assistance, and post-test with assistance are 6, 8, and 26 out of 100 people, respectively. This again shows the effectiveness of real-time assistance after the participants learned from the tutorial. However, if we consider only the cases where the human strictly outperformed the AI model, these numbers reduce to 3, 5, and 10, respectively.
The number of participants for each explanation method (out of 25) who can be considered a successful case with respect to the two scenarios – Post-test score > Pre-test score and Human score > AI score
To further investigate both success scenarios, Table 11 counts the number of participants for each explanation method (out of 25) who can be considered a successful case with respect to the two success scenarios. It can be seen from the table that the first success scenario is easier to achieve although only around 60% of the participants (14–16 out of 25) can outperform their pre-test scores using real-time assistance. In contrast, the second success scenario is very difficult to achieve, especially for SVM explanations of which the AI got almost a perfect or near-perfect score. Scaling the experiment up (i.e., increasing the number of difficult questions) would provide more room for humans to beat the AI. Nevertheless, our experiment shows that there is still a large room for improvement in this human-AI task.
Some answers from the participants on how they knew that a review was
Finally, we asked the participants in the final part of the survey how they detected deceptive and truthful reviews. We manually selected interesting answers from the participants who got 8 correct answers or more during the post-test with AI assistance. The answers are shown in Tables 12 and 13. As expected, participants learning from SVM explanations rarely mentioned patterns but individual words. Some used the majority of highlighting colors as a heuristic (which was surprisingly effective, probably due to the good performance of SVM). Since FLX was extracted from the LR model with GrASP patterns, we noticed some patterns and generalizations noted by participants who learned from FLX such as “when my was closely followed by 1 and hotel was followed by different words” and “the language used, and symbols and punctuation”. Similarly, we also saw patterns noted by participants who learned from both types of AXPLR such as “It uses pronouns closely together” and “The patterns of specific words close together stood out, like luxury hotel.”, as well as implicit patterns such as “There’s also much less usage of city and hotel names”. On the other hand, they could also cover word-level cues, as we can see from the comments like “I would also assume it was deceptive if the reviewer said “I” a lot.”. However, there was no participant in the deep AXPLR group mentioning the usefulness of sub-patterns (which could be expanded or collapsed). Also, the average scores of both types of AXPLR were not significantly different. It could imply that shallow AXPLR is already sufficient for tutorial and real-time assistance, without the need to go deep. Last but not least, we found two interesting comments from the deep AXPLR group. One contrasted deceptive and truthful reviews – “If the review said “location” as apposed to naming the city, I was more likely to assume it was true, or if it mentioned the elevators or doormen. If it said “we” instead of “I” I was usually more inclined say it was truthful.”. The other theorized the reason behind prominent patterns – “human phrasing that doesn’t have hallmarks of being algorithmically generated or designed with the obvious intent to be picked up by a search engine (repeatedly mentioning the word Chicago was one example of this used).”.
Some answers from the participants on how they knew that a review was
We may conclude from the results of Experiment 3 that AXPLR is competitive with SVM and FLX in terms of assisting humans in detecting deceptive reviews. Also, according to the qualitative analysis, AXPLR helps humans capture non-obvious patterns which are helpful to perform the task to some degree. Still, there is a gap between human performance and model performance as we can notice in the last two columns of Table 10. To narrow down this gap further, there are some interesting directions that could be explored. First, how could we make the tutorial part more effective? We hypothesize that submodular pick might not be the best method to select tutorial questions. In fact, [34] has tried the spaced repetition strategy where humans are presented with important features repeatedly (with some space in-between). However, it cannot be concluded from their experiment that spaced repetition is significantly better than submodular pick when it comes to selecting tutorial examples. It would be interesting to study whether there is a better method to select and arrange tutorial questions for supporting human learning.
Additionally, in our experiment, AXPLR transformed
General considerations on AXPLR
In this section, we discuss other possible applications of AXPLR and the generalization of AXPLR to other pattern extraction systems (beyond GrASP) as well as other machine learning models (beyond logistic regression). Lastly, we describe some challenges of generalizing the current version of AXPLR to multi-class classification.
Other possible applications of AXPLR
According to Experiment 2, AXPLR renders highly plausible explanations compared to FLX, the traditional explanation method of LR. One possible reason for AXPLR not shining in Experiment 3 is that plausibility may not always be necessary for the tutorial and real-time assistance task. Humans might perform the task by remembering and applying useful patterns without a clear understanding why such patterns are for the genuine class or the deceptive class. On the other hand, AXPLR would be more suitable for the task where plausibility is needed. For example, if we use the classifier as a decision support tool, we want the explanation to provide insights about the input text that align well with human reasoning. Even though the prediction is correct, if the explanation does not make sense, it is possible that the humans distrust the model and make a wrong final decision, leading to undesirable consequences.
Another context where AXPLR could be useful is explanation-based human debugging of the model [42]. The individual model weight
Generalization beyond GrASP
Generally, AXPLR aims to model dependency among features of logistic regression, and we used GrASP patterns as features in this paper. There are two pre-conditions for applying AXPLR to other models. First, the model operates by computing a linear combination of binary features and weights and applying a threshold on the result to perform binary classification. Second, we can identify specificity relations between features. As long as we use logistic regression for binary classification, the only missing step to generalize from GrASP to another pattern extraction system is properly defining specificity relations between patterns from the extraction system. As a very simple example, one can train a logistic regression model with frequent n-grams (n = 1, 2, 3) as features. So, among the n-gram features, some could be dependent on others such as {“I”, “I like”, “I like it”, “like it”, “like”, “it”}. We can easily identify specificity relations among these n-grams features, e.g., “I like it” ≻ “I like” ≻ “like”. Similarly, for a logistic regression model using regular expressions as features, the subset relationship between regular expressions can be used to define specificity relations between features in the spirit of Definition 1. Therefore, AXPLR is applicable to these models too.
Generalization beyond logistic regression
Apart from logistic regression, linear support vector machine (linear SVM) is another model that computes a linear combination of features and weights and applies a threshold on the result [73, chapter 21]. Specifically, for linear SVM, if
Furthermore, we argue that the dependency between features could emerge even in deep learning models. According to [40], filters of Convolutional Neural Networks (CNNs) [31], a class of deep learning architectures, behave like fuzzy pattern detectors. From a CNN model for abusive language detection, word clouds in Fig. 13 visualize n-grams that strongly activate some of the filters. We can see that these features are not independent though they are not written in explicit forms, unlike the interpretable GrASP patterns. Still, we may approximate patterns from the word clouds as follows: feature 7 = [[

In conclusion, despite the focus on PLR using GrASP, the core idea of our paper is to model relationships among dependent features using computational argumentation in order to create more plausible explanations for text classification. This is very relevant to the computational argumentation community and has potential to be extended to other models in the future.
Even though logistic regression can be extended to multi-class classification by using a weight matrix (instead of a weight vector
Related work
Local explanations for text classification. Text classification is a fundamental task in natural language processing, so there exist many explanation methods which are applicable to this task. Focusing on local explanation methods (aiming to explain specific predictions), we can see several forms of explanations in literature such as extracted rationales [28,37], attribution scores [6,61], rules [55,63], influential training examples [30,32], and counterfactual examples [57,72]. Since AXPLR forms an explanation using triplets of a pattern
Computational argumentation for explainable AI. As discussed in the introduction, computational argumentation has been used to support some XAI methods and construct argumentative explanations in the literature. According to [14], existing works in this area can be divided into two groups. The first group (i.e., intrinsic) draws explanations from models that are natively using argumentative techniques such as AA-CBR [13] and DeLP [56]. The second group (i.e., post-hoc) extracts argumentative explanations from non-argumentative models such as neural networks [16] and Bayesian networks [66]. Following [14], some post-hoc approaches create a complete mapping between the target model and the argumentation framework from which explanations are derived such as [2,59], while other post-hoc methods create an incomplete mapping between the model and the argumentation framework (so called approximate approaches) such as [16,66]. However, AXPLR is a post-hoc approach (due to the non-argumentative PLR model) that does not fit nicely into this complete-approximate dichotomy. On one hand, AXPLR constructs a complete mapping between the PLR model and the QBAFc since every activated feature in the model (as well as the bias term) has a corresponding argument in the QBAFc. On the other hand, the logistic regression semantics σ of AXPLR approximates the dialectical strength of every argument given that this strength does not actually exist in the PLR model. The approximation of σ is under an assumption that the strength of an argument is distributed equally and accordingly to every argument it attacks or supports, as represented by the fragments in Equation (3). So, we could say that AXPLR is a complete but intentionally approximate post-hoc approach so as to yield plausible explanations.
Argument mining. Our work stays at the intersection of explainable AI, natural language processing, and computational argumentation. Another research area that is similar to ours is argument mining, which involves natural language processing and computational argumentation. Argument mining is the process of automatically detecting and modeling the structure of inference and reasoning given in natural language texts [36]. However, our work is not considered an argument mining work because the arguments in our QBAFc are arguing about the predicted output of a text classifier (PLR), whereas arguments in general argument mining works are arguing about a specific claim or conclusion in text. Therefore, input texts in argument mining works must possess the argumentative spirit inside, while input texts for AXPLR do not need to be argumentative but the classifier instead turns parts of them to be arguments for making classifications.
Conclusion
To generate local explanations for pattern-based logistic regression models, we proposed AXPLR, an explanation method enabled by quantitative bipolar argumentation frameworks we defined (TQBAFc and BQBAFc), capturing interactions among the patterns. We proved that the extracted and post-processed frameworks underpinning AXPLR are faithful to the LR model and satisfy many desirable properties. After that, we proposed two presentations of AXPLR, shallow and deep, specifying whether we present only the top-level arguments or all the arguments in the explanations. We also conducted a number of experiments with AXPLR, amounting to empirical as well as human studies. The former discussed the statistics of the underlying argumentation frameworks for all input texts in the test sets and analyzed sufficiency of the explanations in terms of the number of supporting arguments needed. The latter assessed whether AXPLR is more plausible and helpful for human learning than traditional explanation methods for pattern-based LR models. The results show that taking into account relations between arguments as AXPLR does indeed helps the explanations align better with human judgement, particularly in the sentiment analysis task. Though AXPLR performed competitively with traditional explanation methods in tutoring and supporting humans to detect deceptive hotel reviews, there were many participants learning from AXPLR that could recall well-generalized patterns and important but implicit patterns deemed useful for the task. All in all, positive results in our work raise awareness of a novel way to use argumentation for explainable AI while some negative results shed light on challenges in this area for interested researchers. These pave the way for future experiments along this line in the computational argumentation community.
Footnotes
Acknowledgements
We would like to thank Alessandra Russo and Simone Stumpf for their helpful comments. Piyawat Lertvittayakumjorn wishes to thank the support from Anandamahidol Foundation, Thailand. Francesca Toni was funded in part by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 101020934) and in part by J.P. Morgan and by the Royal Academy of Engineering under the Research Chairs and Senior Research Fellowships scheme. Any views or opinions expressed herein are solely those of the authors listed, and may differ from the views and opinions expressed by J.P. Morgan or its affiliates. This material is not a product of the Research Department of J.P. Morgan Securities LLC. This material should not be construed as an individual recommendation for any particular client and is not intended as a recommendation of particular securities, financial instruments or strategies for a particular client. This material does not constitute a solicitation or offer in any jurisdiction.
Proofs
Machine learning terminology
This section explains meanings of technical terms concerning machine learning and classification that are used in this paper.
Dataset splits. Working on a classification task, we usually divide a dataset we have into three splits. The first split is a training set, which is used to train our classification model. The second split is a test set, which is used to evaluate the final trained model. The third split is a development set (also called a validation set), which is used to evaluate model(s) under development so as to choose the best model architecture or hyperparameters. To ensure that the model can generalize beyond what it sees during training and development, all the three data splits must be mutually exclusive.
Because logistic regression (LR) does not have any hyperparameters or multiple architectures to choose, we did not use the development set while training the LR models in our experiments (Sections 7–10). However, in Section 10, we used the development set of the deceptive hotel review dataset for two purposes – (i) to tune the regularization hyperparameter of the support vector machine (SVM) model and (ii) to generate explanations for tutoring human participants to detect deceptive reviews.
Evaluation metrics. For binary classification, let y and
By default, we consider class 1 to be the positive class and consider class 0 to be the negative class. However, we can also consider the four situations above with respect to a specific class c. Particularly,
To evaluate the performance of a classifier, we apply it to predict examples in a labeled test dataset
However, if the test dataset is class imbalanced, i.e., having examples of one class more than the other, accuracy may not be the best evaluation metric because a model can get a high accuracy only by always answering the majority class. Alternatively, we can report the model performance for each specific class
There are two ways to aggregate the class-specific metrics to be the metrics for the overall performance. First, micro-averaging combines the
Second, macro-averaging averages out precision and recall scores of all the classes so all the classes are weighted equally regardless of their size.
Normally, when we work with datasets that are class imbalanced, we want the model to work well for all the classes, not just for the majority class. Therefore, we often use macro F1 as the main evaluation metric in addition to the classification accuracy.
Additional results of Section 8.1 – statistics for QBAFcs and QBAFc ′ s
Along with the results in Section 8.1, we additionally present the statistics for QBAFcs (before the post-processing step) compared to
Number of relations.
Other remarks. First, the number of arguments
Additional results of Section 8.2 – sufficiency
We also plot the sufficiency results for the three datasets in Figs 16–18. The x-axis of each plot is the number of supporting arguments used (k), whereas the y-axis shows the percentage of arguments (default or intermediate) of which the strength can be greater than 0 by using only k supporting arguments. Left subplots of these three figures are for the default arguments, whereas right subplots are for intermediate arguments. Class flipped means the supported class changes after post-processing (i.e.,
User interface for human participants in Experiment 2 (Section 9 )
Figures 19 and 20 show some parts of the template used for rendering pattern questions in Experiment 2 for MTurk workers. This template is for the sentiment analysis task (i.e., the Amazon Clothes dataset). Additionally, Figs 21 and 22 are templates for rendering the group of sampled phrases and matched phrase questions of the same task, respectively. The user interface structures for the spam classification task were similar to the sentiment analysis task except that the five options of the spam task were Definitely Spam, Spam, Not sure, Non-spam, and Definitely Non-spam. For the full templates and additional details, please visit our GitHub repository –
User interface for human participants in Experiment 3 (Section 10 )
Figure 23 shows an example post-test question with the actual Qualtrics survey user interface when there is no real-time assistance from any explanation method. In contrast, Figs 24–27 show the same post-test question but with real-time assistance from SVM, FLX, shallow AXPLR, and deep AXPLR, respectively. For other parts of the survey, please visit our GitHub repository –
