Is this question going to be closed? Answering question closibility on Stack Exchange

Abstract

Community question answering sites (CQAs) are often flooded with questions that are never answered. To cope with the problem, experienced users of Stack Exchange are now allowed to mark newly posted questions as closed if they are of poor quality. Once closed, a question is no longer eligible to receive answers. However, identifying and closing subpar questions takes time. Therefore, the purpose of this article is to develop a supervised machine learning system that predicts question closibility, the possibility of a newly posted question to be eventually closed. Building on extant research on CQA question quality, the supervised machine learning system uses 17 features that were grouped into four categories, namely, asker features, community features, question content features and textual features. The performance of the developed system was tested on questions posted on Stack Exchange from 11 randomly chosen topics. The classification performance was generally promising and outperformed the baseline. Most of the measures of precision, recall, F1-score and area under the receiver operating characteristic curve (AUC) were above 0.90 irrespective of the topic of questions. By conceptualising question closibility, the article extends previous CQA research on question quality. Unlike previous studies, which were mostly limited to programming-related questions from Stack Overflow, this one empirically tests question closibility on questions from 11 randomly selected topics. The set of features used for classification offers a framework of question closibility that is not only more comprehensive but also more parsimonious compared with prior works.

Keywords

Closed question community question answering machine learning question quality Stack Exchange unanswered question

1. Introduction

Over the years, community question answering sites (CQAs) such as Baidu Zhidao, Stack Exchange and Yahoo! Answers have cemented themselves as key avenues to search for information. Whenever individuals with Internet access face an information need, they have an easy option to ask questions on CQAs that can be answered by other online users. If the asker chooses an incoming answer as satisfactory, the question is said to be resolved [1 –6].

Despite the undoubted benefit of CQAs, a downside is that these sites have long been flooded with questions that are never answered. For example, by 2010, 42.8% of questions posted on Baidu Zhidao were reported to remain unanswered [7]. By 2012, the volume of unanswered questions on Stack Overflow, a CQA site within Stack Exchange that is dedicated to computer programming, mounted to approximately 300,000 [8]. About one-fifth of unresolved questions on Yahoo! Answers are known to remain completely ignored [9].

To cope with the problem of unanswered questions, experienced users of Stack Exchange are now allowed to mark newly posted questions as closed. Specifically, a question can be closed if it is deemed to be duplicate, off-topic, opinion-seeking, unclear or vague. A question that is closed is not possible to be answered but can be updated for reopening (Stack Exchange, 2018). Clearly, this functionality helps nurture the quality of questions on the platform.

Even with this development, under-cooked questions continue to serve as a thorn in the flesh of Stack Exchange’s question-answering cycle. Identifying and closing them manually takes time, which experienced users would have rather spent on more meaningful questioning and answering activities. Furthermore, the queue of inappropriate questions on CQAs seems to be continually growing [2,7,9 –12]. A review of questions on various topics such as Golf and Mathematics available on Stack Exchange confirmed that closed questions are indeed ubiquitous (see Table 1).

Table 1.

Statistics of closed questions for the selected topics of Stack Exchange.

Topics	Total questions	Open questions	Closed questions	% of Closed questions
Programmer	38,299	28,480	9819	25.64
Ask Ubuntu	251,769	218,136	33,633	13.36
Golf	7195	6236	959	13.33
Sci-fi	38,027	34,518	3509	9.23
Server Fault	238,765	219,584	19,181	8.03
Dba	53,665	49,929	3736	6.96
Gaming	75,697	70,443	5254	6.94
Super User	343,034	320,168	22,866	6.67
Apple	80,467	76,999	3468	4.31
Mathematics	222,288	215,372	6916	3.11
Code Review	42,451	41,484	967	2.28

A potential remedy is to automatically identify questions that are likely to be closed before they are actually closed in reality. If the possibility for questions to be closed (henceforth referred to as closibility) can be conveyed to askers automatically soon after they write their questions without human intervention, the volume of likely-to-be-closed questions on Stack Exchange will be reduced. This in turn will minimise the time that experienced users, who are valuable information sources in the CQA community, would spend in closing subpar questions. Instead, they could focus on asking and answering, thereby resulting in more efficient use of the CQA platform for all and sundry.

Therefore, the purpose of this article is to develop a supervised machine learning system that predicts the closibility of questions on Stack Exchange. Existing research on question closibility has used a large number of features [13] but achieved a 0.71 F1-score [14]. Roy and Singh [15] used deep learning frameworks and achieved very low prediction accuracy. Deep learning-based models automatically extracted features from the input, and hence, interpreting the reasons for question closibility was not feasible. The model proposed in this research uses a fewer number of manually extracted features and outperforms the baseline models. Moreover, the proposed model also helps to identify the reason for question closibility through a feature analysis.

This article particularly builds on prior research that has been shedding light on reasons due to which several questions remain unanswered on CQAs [9,16]. In doing so, this article is significant in two ways. First, it represents one of the earliest works to conceptualise what is referred to as the closibility of questions. Using the body of CQA research on questions’ likelihood to remain unanswered as a stepping stone, this article deepens the scholarly understanding of factors that predict questions’ likelihood to be closed. Second, the developed system to predict closibility is meant to be generic so that it can be applied to questions on a variety of topics. Its performance was evaluated by testing it on data from a range of 11 topics available on Stack Exchange. This extends previous research that has been mostly restricted to programming-related questions drawn from Stack Overflow [9,14]. The classification performance was generally promising.

The rest of this article is organised as follows: The next section reviews the literature. The methodology is described thereafter. This is followed by the results and discussion. The final section closes out this article by highlighting its limitations and future scope.

2. Literature review

Since their inception, CQAs have been garnering and archiving huge volumes of user-generated content from their community of users. Their proliferation has continued to attract sustained scholarly interest over the years [1,12,17,18]. In particular, CQAs have been widely investigated to resolve answer-related issues such as examining the quality of answers [9,19 –23] and finding the best answer among a pool of responses [24,25]. Much attention has also been devoted to identifying expert users who are capable of producing high-quality answers [4,26 –34].

More recently, research has started to cast the spotlight on question-related issues such as clustering similar questions coupled with identification of hot topics [35 –37], identification of questioning motivation [38] and detecting duplicate questions [39 –45]. Particularly relevant to this article, scholars have also started to examine the quality of questions posted on CQAs [9,12,16,46,47].

For example, Shah et al. [48] used a dataset of 5000 questions posted on Yahoo! Answers to classify question quality as either good or bad. With the help of support vector machine (SVM) classification, they achieved an accuracy of 93.08%. The excellent performance notwithstanding, a limitation of the work was that some of the features required human intervention. Hence, it does not offer a logistically viable strategy to predict question quality on the fly. Ponzanelli et al. [41] proposed a model to minimise low-quality content on CQAs using textual and non-textual features extracted from a Stack Overflow dataset. They achieved a precision of 41.9% for high-quality questions and 64.91% for low-quality questions. Correa and Sureka [49] found that around 8% of the total questions were subpar on Stack Overflow. They used 47 different textual and non-textual features for classification. The model achieved a modest accuracy of 66%.

Srba and Bielikova [50] claimed that the quality of questions on Stack Overflow had been deteriorating over the years. They reported that the low-quality content of the site was 4.11% in 2011, which increased to 16.84% in 2016. They further showed that a particular group of users was continuously posting duplicate questions or definition-type questions on the site. Ahasanuzzaman et al. [39] proposed a system to find duplicate questions on Stack Overflow. They used cosine similarity, WordNet Similarity, Entity Overlap and Entity type overlap on a dataset of some 1.3 million Stack Overflow questions to classify whether an enquiry was duplicate or not. Their system achieved a recall value of 66.10%. Yang et al. [46] analysed unanswered questions on Yahoo! Answers, and proposed a supervised machine learning model to classify them. With a dataset containing 76,251 questions out of which 10,424 were unanswered, their model achieved an F1-score of 32.5%. Dror et al. [51] extended the work of Yang et al. [46] to predict the number of answers a question might receive. They achieved an F1-score of 40.3%.

Most closely related to the current article is the work of Correa and Sureka [14]. It proposed a model to predict closed questions on Stack Overflow. Several textual and non-textual features were extracted from the dataset to achieve a classification accuracy of 73%. This article extends [14] in at least two ways. First, instead of confining the dataset to only programming-related questions, it draws data from 11 randomly chosen topics. This serves to enhance research generalisability. Second, informed by recent works such as Chua and Banerjee [9], this article incorporates several new features such as up votes, down votes and interrogative words. Moreover, it drops variables such as a number of short words, upper case and lower case characters. For one, the conceptualisation of short words was unclear from Correa and Sureka [14]. Besides, there is no reason to assume that the proportions of upper and lower case characters will determine question closibility. Therefore, this article makes a modest attempt to present a more parsimonious set of features for predicting question closibility compared with previous works.

3. Methodology

Figure 1 depicts the complete framework of the proposed model to predict question closibility. The steps are explained below:

Dataset: The dataset used in this article was the data dump of Stack Exchange from January 2009 to March 2017.¹ The data dump is an anonymized dump of all posts, tags, votes, users, history, comments, badges and post history of the CQA in the form of eight different XML files. For the purpose of this article, four XML files related to questions and users were relevant: User.xml, Post.xml, Votes.xml and Badge.xml. These are represented within the dotted lines in Figure 1. The dataset was downloaded from the archive in May 2017.

Combined dataset and preprocessing: From the XML files, all the relevant attributes were captured to form the combined dataset. These include post id, post type id, favourite count, view count, comment count and answer count from Post.xml; user id, user reputation and the number of posts by the user from User.xml; user id and the class of the user from Badge.xml; and user id, up votes and down votes from Votes.xml. All of these attributes from the different XML files were combined together for further processing. The dataset was selected across 11 randomly selected topics. A statistics of the selected dataset is presented in Table 1, in which the topics are arranged in decreasing order of the proportion of closed questions.

Data labelling: The dataset was labelled into two classes (open and closed) based on the attribute of closed date. Questions having some date values in the closed date field indicated that they had been closed. Hence, they were labelled as closed questions. All remaining questions were labelled as open questions.

Feature extraction: A total of 17 features were extracted from the combined dataset. Informed by Correa and Sureka [14], these were grouped them into four categories, namely, asker features, community features, question content features and textual features.

Figure 1.

Proposed model to predict question closibility.

Asker features include account age [1] and badge score [2]. These were relevant because questions contributed by long-standing askers with high badge scores are likely to have less closibility than those posted by new and novice individuals [9,52].

Community features include post score [3], reputation [4], favourite count [5], comment count [6], view count [7], answer count [8], up votes [9] and down votes [10]. All of these reflect how well the CQA community accepts a user [16,53]. Obviously, the more whole-heartedly a user is accepted in the community, the less is the likelihood for the community to close a question submitted by the individual.

Question content features include the number of URLs [11] and the number of tags [12]. These have been shown to play a part in determining question closibility [14].

Textual features include question body length [13], question title length [14], the number of special characters [15], the number of punctuation marks [16] and the number of interrogative words [17]. These textual features are known to determine the clarity with which questions are articulated [9,14,16]. Hence, they may also shape question closibility.

All the 17 features are described below, and listed in Table 2.

1. Account age: The age of users’ account is the time from the date of joining the site to their latest post.

2. Badge score: The number of badges a user (u) has earned. Say, a user earns badges (b1, b2,…..bn) then, Badge Score (Bs) of user u is calculated as

Bs = \sum_{i = 1}^{n} \frac{1}{No . of users having b_{i}}

(1)

3. Post score: The score gained by the user (u) from the community users. The score is calculated as follows: Let the answers posted by the user u are $(an s_{1}, an s_{2}, . . . . . an s_{n})$ and question asked by the user (u) are $(que s_{1}, que s_{2}, . . . . . que s_{m})$ then the score S(u) of the user u is

S (u) = \sum_{i = 1}^{n} S (an s_{i}) + \sum_{j = 1}^{m} S (que s_{j})

(2)

4. Reputation: Users receive reward points for good-quality postings. The reward score reflects the standing of users in the CQA community.

5. Favourite count: Every question or answer of the user (u) can be marked as a favourite by a user (v). Suppose the number of questions of the user (u) marked as a favourite is n and the number of answers marked as a favourite is m, then the favourite count of the user (u) is

Favourite Count (u) = n + m

(3)

6. Comment count: The number of comments received from peer users. Suppose the number of comments received on a user question is ( $C_qn$ ) and Comments on their answers is $C_an$ , then the comment count of the user (u) is

Comment Count (u) = C_{qn} + C_{an}

(4)

7. View count: The number of community users who have seen the posts of a user u.

8. Answer count: The number of answers received by the questions asked by a user u.

9. Up votes: The number of posts with positive scores.

10. Down votes: The number of posts with negative scores.

11. Number of URLs: The number of URLs present in the question body.

12. Number of tags: Tags are generally keywords given by users to represent the question domain precisely. A high number of tags could suggest greater question precision.

13. Question body length: The number of words in the body of the question after removing stop words.

14. Question title length: The number of words in the question title.

15. The number of special characters: The number of a special character such as @, $>, <$ , etc., presents in the question body.

16. The number of punctuation marks: The number of punctuation marks presents in the question body.

17. The number of interrogative words: The number of interrogative words (start with ‘wh’, for example: ‘what’, ‘when’, ‘where’, etc.) presents in the question body.

Data split: The complete dataset as a whole could not be used for training and testing purposes. Hence, it was split into training and testing components.

Training samples: We randomly used 67% of the data points from the dataset to train the model.

Test samples: To test the performance of the trained model, we used the remaining 33% unseen dataset, that is, the data points which were not passed to the model during the training.

Model training: We trained several machine learning models like Gradient Boosting, Logistic Regression, Naive Bayes, Random Forest, XGBoost and others with the extracted features.

Trained model: When the trained models were ready for prediction with new data points, they were supplied with the test samples for predictions.

Predictions: The trained machine learning models predicted the category of the unseen data points. When the predicted category of an unseen data point matched with the actual category, it was deemed to be a correct prediction.

Table 2.

List of selected features.

Type	Features
Asker features	Account age
Asker features	Badge score
Community features	Post score
	Reputation
	Favourite count
	Comment count
	View count
	Answer count
	Up votes
	Down votes
Question content features	Number of URLs
Question content features	Number of tags
Textual features	Question body length
	Question title length
	Number of special characters
	Number of punctuation marks
	Number of interrogative words

3.1. Model setting and evaluation metrics

To evaluate the performance of the supervised machine learning system with the selected features, several classifiers were experimented [54 –56]. For the sake of brevity, this article reports the results for Gradient Boosting, Logistic Regression, Naive Bayes and Random Forest. XGBoost, for example, was omitted because it yielded very similar results to Gradient Boosting.

In the dataset, the proportions of closed and open questions were never comparable. Table 1 conveys that closed questions were consistently fewer than open questions for all the 11 topics. For example, of 38,299 questions from the topic of Programmer, only 9819 (25.64%) questions were closed.

Therefore, to study the effect of data imbalance, the classification was conducted in two phases: first with the imbalanced dataset, and next with its balanced version created by applying what is known as SMOTE–synthetic minority over-sampling technique [57]. In SMOTE, the minority class data are oversampled by creating synthetic examples instead of over-sampling with replacement. This is known to result in realistic newly created samples, and is superior to other techniques such as random over-sampling, which increases the sample of minority classes by creating multiple copies of the same data points. With such repeated instances in the dataset, classifiers tend to be over-fitted during the training process. Another over-sampling technique known as ADASYN (ADAptive SYNthetic method) builds on SMOTE. Nonetheless, initial experiments on two datasets Programmer and Ask Ubuntu yielded similar results with both the SMOTE and ADASYN techniques on Gradient Boosting and Random Forest classifiers; however, Naive Bayes and Logistic Regression classifiers performed better with SMOTE compared with ADASYN. Hence, we proceeded with SMOTE to balance the datasets. These were executed in Python on a machine having Intel Xeon(R) CPU 16 cores and 32 GB RAM.

The classification performance was evaluated in terms of precision, recall, F1-score and area under the receiver operating characteristic curve (AUC) [58]. These are explained as follows:

Precision: It is defined as the fraction of closed questions among the retrieved closed questions. It is computed as

Precision = \frac{T_{p}}{T_{p} + F_{p}}

(5)

where $T_{p}$ is the true positive (closed questions classified as closed) and $F_{p}$ is the false positive (open questions classified as closed).

Recall: It is the fraction of closed questions that have been retrieved over the total amount of closed questions in the system, recall is computed as

Recall = \frac{T_{p}}{T_{p} + F_{n}}

(6)

where $T_{p}$ is the true positive (closed question classified as closed) and $F_{n}$ is the false negative (closed question classified as open).

F1-score: It is the harmonic mean of Precision and Recall, F1-score is computed as

F 1 - score = 2 \times \frac{precision \times recall}{precision + recall}

(7)

AUC: Receiver operating characteristic curve (ROC) is the plot between the true positive rate and the false positive rate of the classifier for different thresholds

True positive rate = \frac{T_{p}}{T_{p} + F_{n}}

(8)

False positive rate = \frac{F_{p}}{F_{p} + T_{n}}

(9)

where $F_{P}$ is the number of open questions that are classified as closed. The greater the AUC value, the greater is the accuracy of classifiers.

4. Results

This section presents the experimental results obtained using the various machine learning algorithms on both the imbalanced and the balanced datasets. The datasets were divided into two parts. One part contained 67% of the data, and was used for model training. The other part contained the remaining 33% of the data, and was used to test the model performance.

4.1. Results with imbalanced datasets

The experiment commenced with questions from the topics of Programmer and Ask Ubuntu. This was because they contained the highest proportions of closed questions. The results are presented in Table 3.

Table 3.

Results with the imbalanced dataset for the topics of Programmer and Ask Ubuntu.

Classifier	Class	Programmer				Ask Ubuntu
		Precision	Recall	F1-score	AUC	Precision	Recall	F1-score	AUC
Gradient Boosting	Open	0.76	0.96	0.85	0.55	0.91	0.99	0.95	0.71
	Closed	0.59	0.14	0.23		0.92	0.42	0.57
	Average	0.68	0.55	0.54		0.92	0.71	0.71
Naive Bayes	Open	0.76	0.93	0.83	0.54	0.91	0.16	0.27	0.54
	Closed	0.44	0.16	0.23		0.15	0.91	0.26
	Average	0.60	0.55	0.53		0.53	0.54	0.27
Logistic Regression	Open	0.74	0.99	0.85	0.51	0.86	1.00	0.92	0.50
	Closed	0.46	0.03	0.05		0.14	0.01	0.01
	Average	0.60	0.51	0.45		0.50	0.51	0.46
Random Forest	Open	0.76	0.94	0.84	0.56	0.91	0.99	0.95	0.70
	Closed	0.49	0.17	0.26		0.90	0.41	0.57
	Average	0.63	0.56	0.55		0.91	0.70	0.76

AUC: area under the receiver operating characteristic curve.

Two interesting observations were made. First, Naive Bayes yielded antagonistic results for questions from the two topics. For the topic of Programmer, recall was higher for open questions (0.93) vis-a-vis closed ones (0.16). In other words, a better recall was achieved for the class having a greater number of data instances. For the topic of Ask Ubuntu, however, a better recall was achieved for closed questions (0.91) vis-a-vis open ones (0.16). Put differently, a better recall was obtained for the class having fewer data instances.

Second, none of the classifiers yielded very promising results consistently in terms of precision, recall, F1-score and AUC across questions from both the topics. The imbalance in the dataset was identified as a possible reason. Therefore, these data were balanced to recheck the classification performance. As indicated earlier, SMOTE was applied for this purpose [57].

4.2. Results with balanced datasets

The performance of the proposed model was now checked using the balanced datasets from the topics of Programmer and Ask Ubuntu. The detailed results are presented in Table 4. There were two notable observations. First, even with balanced datasets, Naive Bayes continued to yield antagonistic results for questions from the two topics. For the topic of Programmer, recall was higher for open questions (0.88) vis-a-vis closed ones (0.26). For the topic of Ask Ubuntu, however, a better recall was achieved for closed questions (0.94) vis-a-vis open ones (0.12). Given the poor performance, the data distribution was checked. The features were not normally distributed. As mentioned by Lewis [59], the Naive Bayes classifier performs well with datasets that are normally distributed or categorical in nature. This was the reason for the anomaly.

Table 4.

Results with the balanced dataset for the topics of Programmer and Ask Ubuntu.

Classifier	Class	Programmer				Ask Ubuntu
		Precision	Recall	F1-score	AUC	Precision	Recall	F1-score	AUC
Gradient Boosting	Open	0.76	0.93	0.84	0.82	0.90	0.99	0.95	0.94
	Closed	0.92	0.71	0.80		0.99	0.89	0.94
	Average	0.84	0.82	0.82		0.95	0.94	0.95
Naive Bayes	Open	0.54	0.88	0.67	0.57	0.65	0.12	0.20	0.53
	Closed	0.68	0.26	0.38		0.52	0.94	0.66
	Average	0.61	0.57	0.53		0.59	0.53	0.43
Logistic Regression	Open	0.56	0.51	0.54	0.56	0.66	0.40	0.50	0.60
	Closed	0.55	0.60	0.58		0.57	0.80	0.66
	Average	0.56	0.56	0.56		0.62	0.60	0.58
Random Forest	Open	0.74	0.9	0.81	0.74	0.89	0.99	0.94	0.93
	Closed	0.88	0.68	0.76		0.99	0.88	0.93
	Average	0.81	0.79	0.79		0.94	0.94	0.94

AUC: area under the receiver operating characteristic curve.

Second, Gradient Boosting outperformed the other classifiers in predicting question closibility. This was true across both the topics in terms of all the four selected performance indicators: precision, recall, F1-score and AUC [58]. The confusion matrix detecting the true distribution of the data instances over the different classes and the ROC curves for the topics of Programmer (AUC = 0.82) and Ask Ubuntu (AUC = 0.94) are shown in Figures 2 –5.

Figure 2.

ROC curve for Programmer topic.

Figure 3.

ROC curve for Ask Ubuntu topic.

Figure 4.

Confusion matrix for Programmer topic.

Figure 5.

Confusion matrix for Ask Ubuntu topic.

Given that Gradient Boosting emerged as the best performing classifier, it was therefore employed on the balanced datasets for all the topics to predict question closibility. The results are summarised in Table 8. The classification performance was generally promising.

As indicated earlier, the most pertinent prior work for this article is Correa and Sureka [14]. It proposed a model to predict closed questions from a dataset of programming-related questions drawn from Stack Overflow. Therefore, using questions from the topic of Programmer, this article uses Correa and Sureka [14] as a baseline for comparison. The comparative outcomes are shown in Table 6. The classification performance using Gradient Boosting was traced in terms of precision, recall, F1-score, and AUC. To afford a granular analysis, the classification was performed in four steps that involved including asker features, community features, question content features and textual features one by one. The proposed model mostly outperformed the baseline with the selected feature set as shown in Table 6. The proposed model also outperformed Roy and Singh’s [15] deep learning model as shown in Table 5.

Table 5.

Comparison with deep learning.

Model	Precision	Recall	F1-score	AUC
Roy and Singh [15]	0.47	0.49	0.48	0.81
Proposed	0.92	0.71	0.80	0.82

AUC: area under the receiver operating characteristic curve.

Table 6.

Comparison with baseline.

Askerfeatures	Communityfeatures	Questioncontentfeatures	Textualfeatures	Approach	Precision	Recall	F1-score	AUC
✓				Correa and Sureka [14]	0.63	0.63	0.63	0.62
✓				Proposed	0.59	0.55	0.57	0.60
✓	✓			Correa and Sureka [14]	0.70	0.65	0.67	0.68
✓	✓			Proposed	0.91	0.70	0.79	0.82
✓	✓	✓		Correa and Sureka [14]	0.70	0.65	0.67	0.68
✓	✓	✓		Proposed	0.91	0.71	0.80	0.82
✓	✓	✓	✓	Correa and Sureka [14]	0.69	0.65	0.67	0.68
✓	✓	✓	✓	Proposed	0.92	0.71	0.80	0.82

AUC: area under the receiver operating characteristic curve.

4.3. Results with deep learning

Of late, deep learning-based models such as convolutional neural network (CNN), long- and short-term memory (LSTM) and the transformer-based Bidirectional Encoder Representations from Transformers (BERT) models are widely being used in natural language processing. These models are often preferred to traditional machine learning as they are capable of finding hidden contextual patterns from the dataset using their complex architecture. Hence, we decided to explore deep learning.

The experiments were conducted on Programmer topic using the Google Collabotary platform, where all required libraries are pre-installed. First, we experimented with the CNN model with two variations (1) without using the dropout layer and (2) using the dropout layer. The outcomes are presented in Table 7. The CNN model with a single layer of convolution worked well for the open category question prediction but performed poorly for the closed category. Even adding a dropout layer did not substantially improve model performance. The F1-score was only 0.33 for closed questions. This shows that most closed questions remained untraceable by the CNN model.

Table 7.

Results of Gradient Boosting (traditional machine learning) in comparison with deep learning.

Topics	Class	Precision	Recall	F1-score	AUC
CNN	Open	0.79	0.89	0.84	0.59
CNN	Closed	0.47	0.28	0.35	0.59
CNN + Dropout	Open	0.78	0.83	0.80	0.56
CNN + Dropout	Closed	0.37	0.30	0.33	0.56
CNN	Open	0.77	0.97	0.86	0.55
CNN	Closed	0.61	0.13	0.22	0.55
CNN + Dropout	Open	0.80	0.89	0.84	0.55
CNN + Dropout	Closed	0.50	0.33	0.40	0.55
LSTM	Open	0.78	0.97	0.86	0.77
LSTM	Closed	0.69	0.17	0.27	0.77
BERT	Open	0.79	0.95	0.86	0.61
BERT	Closed	0.68	0.28	0.40	0.61
Gradient Boosting	Open	0.76	0.93	0.84	0.82
Gradient Boosting	Closed	0.92	0.71	0.80	0.82

AUC: area under the receiver operating characteristic curve; CNN: convolutional neural network; LSTM: long- and short-term memory.

Next, we increased the convolution layer from one to two and repeated the experiments. The outcomes of the two-layered CNN model improved slightly but were still lower than what was achieved using traditional machine learning. The LSTM and BERT models also exhibited similar performances. The AUC value using the LSTM model was 0.77 and using the BERT model was 0.61, indicating that the LSTM model performed better among all the deep learning and transformer-based models.

4.4. Feature importance

To dig deeper, the relative strength of the features was assessed for the topics of Programmer and Ask Ubuntu. Using Gradient Boosting, the feature importance graphs are shown in Figures 6 and 7, respectively.

Figure 6.

Feature importance graph for the topic of Programmer.

Figure 7.

Feature importance graph for the topic of Ask Ubuntu.

For the topic of Programmer, the top three contributing features were number of interrogative words, comment count and favourite count. For the topic of Ask Ubuntu, the top three features were answer count, number of tags and number of interrogative words. Across the two topics, the top three features were mostly dominated by non-textual features. The only exception was number of interrogative words.

The effect of the selected features can also be seen in Table 6, in which the model performance is shown as a function of the feature subsets. Each subset of features, namely, asker features, community features, question content features and textual features, was useful as the values of the performance measures continued to rise progressively.

5. Discussion

This section deals with the experimental outcomes and the major finding of the research, including theoretical contributions and implications for practice [60]. CQAs are known to be flooded with questions that are never answered [7,9,10]. To cope with the problem, experienced users of Stack Exchange are now allowed to mark newly posted questions as closed if it is duplicate, off-topic, opinion-seeking, unclear or vague. However, identifying and closing subpar questions manually takes time.

To this end, this article developed a supervised machine learning system that predicts question closibility – the possibility of a newly posted question to be eventually closed. It leveraged the body of literature that has been shedding light on reasons due to which several questions remain unanswered on CQAs [9,16]. The system was tested on questions posted on Stack Exchange from 11 randomly chosen topics. Gradient Boosting emerged as the best-performing classifier. As shown in Table 8, the classification performance was generally promising. Most of the measures of precision, recall, F1-score and AUC were above 0.9 with minimum values of 0.76, 0.71, 0.80 and 0.82, respectively.

Table 8.

Results with datasets after balancing with Gradient Boosting.

Topics	Class	Precision	Recall	F1-score	AUC
Programmer	Open	0.76	0.93	0.84	0.82
Programmer	Closed	0.92	0.71	0.80	0.82
Ask Ubuntu	Open	0.90	0.99	0.95	0.94
Ask Ubuntu	Closed	0.99	0.89	0.94	0.94
Golf	Open	0.91	0.95	0.93	0.93
Golf	Closed	0.95	0.90	0.92	0.93
Sci-fi	Open	0.92	0.97	0.94	0.95
Sci-fi	Closed	0.97	0.91	0.94	0.95
Server Fault	Open	0.92	0.99	0.96	0.96
Server Fault	Closed	0.99	0.92	0.95	0.96
Dba	Open	0.93	0.99	0.96	0.96
Dba	Closed	0.99	0.92	0.96	0.96
Gaming	Open	0.93	0.98	0.96	0.96
Gaming	Closed	0.98	0.93	0.96	0.96
Super User	Open	0.93	0.99	0.96	0.96
Super User	Closed	0.99	0.92	0.96	0.96
Apple	Open	0.94	1.00	0.96	0.96
Apple	Closed	1.00	0.93	0.96	0.96
Mathematics	Open	0.95	0.99	0.97	0.97
Mathematics	Closed	0.99	0.95	0.97	0.97
Code Review	Open	0.97	0.99	0.98	0.98
Code Review	Closed	0.99	0.97	0.98	0.98

AUC: area under the receiver operating characteristic curve.

The system was found to outperform the baseline, an earlier model to predict closed questions proposed by Correa and Sureka [14]. As shown in Table 6, the best F1-score obtained by the baseline Correa and Sureka [14] was 0.67, whereas our model achieved 0.80 in a similar setting. This article also supplements previous related works. Yang et al. [46], for example, only analysed unanswered questions, and achieved an F1-score of 32.5%. Again, Ahasanuzzaman et al. [39] achieved a recall of 66.10% in identifying duplicate questions. These factors such as being unanswered and duplication can all lead to a question being closed. This research combines all such factors of question closibility, and achieves an average F1-score of 0.92 with a minimum of 0.82 for the topic of Programmer, and a maximum 0.97 for the topic of Mathematics. Moreover, unlike works such as Shah et al. [48] that used features requiring human annotation, the system developed in this article employs features that could be readily obtained. Therefore, it offers a viable strategy to predict question closibility on CQAs in real time.

5.1. Theoretical contributions

The theoretical contributions of this article are four-fold. First, it conceptualises what is referred to as the closibility of questions. It also develops a supervised machine learning system that predicts the possibility of questions on Stack Exchange to be closed. This serves to deepen the scholarly understanding of factors that predict questions’ likelihood to be closed on CQAs. It serves as a call for scholars to explore the theme of question closibility more granularly in the future.

Second, this article builds on the literature on question quality [9,16] to identify new features for predicting question closibility. Features such as the use of interrogative words were not used in earlier works to distinguish between closed and open questions [14]. However, number of interrogative words was among the top three features for the topics of both Programmer and Ask Ubuntu (Figures 6 and 7). It should be incorporated in related future research. In sum, the 17 features identified in this article seem more comprehensive than prior works.

Third, the supervised machine learning system developed to predict question closibility was intended to cater to questions on a variety of topics. For this reason, features such as code snippet – commonly used in prior works that focus solely on programming-related questions – were dropped. This serves to enhance the parsimony while widening the applicability of the classifier. The classification performance was evaluated by testing the system on questions from a range of 11 topics available on Stack Exchange. As evident from Table 8, the majority of the measures of precision, recall, F1-score and AUC were above 0.90. This extends previous research that was mostly restricted to programming-related questions drawn from Stack Overflow [9,14].

Fourth, this article has implications for the usage of supervised machine learning on CQAs. For one, it demonstrates the utility of SMOTE to predict question closibility [57]. Previous CQA studies focusing on question quality did not mitigate the data imbalance problem before feeding datasets to classifiers. This article confirms that classification performance would be inadequate if the dataset is overly imbalanced. It also echoes previous research [59] that the Naive Bayes classifier does not work well if the features are not normally distributed. Gradient Boosting emerged as the best classifier regardless of questions’ topic.

5.2. Implications for practice

On the practical front, this article demonstrates the possibility of screening new questions while they are being submitted on CQAs [61]. Moderators and administrators of CQAs could use the system developed for the purpose of this article as a preliminary filter to weed out questions with high closibility. The current system may be placed just beneath the user interface to work silently and screen all incoming questions. Besides, this automated system could be employed on the archives of previously posted questions on CQAs to evaluate question quality at regular intervals.

Overall, the system is able to free users from the tedious task of manually closing subpar questions on CQAs. It will reduce not only the community participation time in marking questions as closed but also the moderation job time. These in turn will pave the way for more efficient use of the CQA platform as an information-seeking avenue.

6. Conclusion, limitations and future scope

On CQAs, the number of closed questions has been rising. This article presents a machine learning method to predict question closibility, which is defined as the possibility of a question to be closed, as a first step towards solving the problem. A supervised machine learning system was developed and tested on data from Stack Exchange on as many as 11 topics. The feature set was more parsimonious compared with those used in prior works. Even then, the system fared better than earlier studies. Gradient Boosting emerged as the best performing classifier regardless of the question topics. Therefore, the key takeaway message from this article is this: using supervised learning, it is possible to automatically identify questions that are likely to be closed before they are actually closed, with reasonably high accuracy. This holds immense promise for the future of CQAs.

Nonetheless, a limitation of the proposed system is that the features were selected manually. It was difficult to find the optimal number of features when the system was tested on 11 different topics. The current system could be enhanced by replacing the manual feature selection method with an automated approach. For this purpose, future research may consider neural network-based frameworks such as CNN, LSTM and BERT. Also, the same can be tested by drawing data from other CQAs such as Quora and Yahoo! Answers.

Another limitation of this article lies in the use of only English questions. As many CQAs allow question-answering in non-English languages, future research could develop a language-independent model. In addition, there can be several grammatical errors in questions, making it tough for the model to identify the exact context. A model that is able to auto-correct grammatical mistakes could be developed in the future to address these issues. Hopefully, such research efforts will further enhance the value of CQAs to Internet users in the long run.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship and/or publication of this article.

ORCID iD

Pradeep Kumar Roy

Notes

References

Ahmad

Feng

et al. A survey on mining stack overflow: question and answering (Q&A) community. Data Technol Appl 2018; 52(2): 190–247.

Kafle

De Silva

Dou

An overview of utilizing knowledge bases in neural networks for question answering. Inform Syst Front 2020; 22(5): 1095–1111.

Chen

Extracting core questions in community question answering based on particle swarm optimization. Data Technol Appl 2019; 53(4): 456–483.

Loginova

Varanasi

Neumann

Towards end-to-end multilingual question answering. Inform Syst Front 2021; 23: 227–241.

Roy

Ahmad

Singh

et al. Finding and ranking high-quality answers in community question answering sites. Glob J Flex Syst Manag 2018; 19(1): 53–68.

Gupta

Kar

Baabdullah

et al. Big data with cognitive computing: a review for the future. Int J Inform Manage 2018; 42: 78–89.

King

. Routing questions to appropriate answerers in community question answering services. In: Proceedings of the 19th ACM international conference on information and knowledge management, Toronto, ON, Canada, 26–30 October 2010, pp. 1585–1588. New York: ACM.

Saha

Perry

. Toward understanding the causes of unanswered questions in software information sites: a case study of stack overflow. In: Proceedings of the 2013 9th joint meeting on foundations of software engineering, Saint Petersburg, Russia, 18–26 August 2013, pp. 663–666. New York: ACM.

Chua

AYK

Banerjee

. Answers or no answers: studying question answerability in stack overflow. J Inf Sci 2015; 41(5): 720–731.

10.

Shen

Liu

Wang

et al. SocialQ&A: an online social network based question and answer system. IEEE T Big Data 2017; 3(1): 91–106.

11.

Huang

Hsieh

Wang

HC.

Automatic meeting summarization and topic detection system. Data Technol Appl 2018; 52(3): 351–365.

12.

Paredes

Simari

Martinez

et al. NetDER: an architecture for reasoning about malicious behavior. Inform Syst Front 2021; 23: 185–201.

13.

Chua

AYK

Banerjee

. So fast so good: an analysis of answer quality and answer speed in community question-answering sites. J Am Soc Inf Sci Tec 2013; 64(10): 2058–2068.

14.

Correa

Sureka

. Fit or unfit: analysis and prediction of ‘closed questions’ on stack overflow. In: Proceedings of the 1st ACM conference on Online social networks, Boston, MA, 7–8 October 2013, pp. 201–212. New York: ACM.

15.

Roy

Singh

JP.

Predicting closed questions on community question answering sites using convolutional neural network. Neural Comput Appl 2020; 32(14): 10555–10572.

16.

Asaduzzaman

Mashiyat

Roy

et al. Answering questions about unanswered questions of Stack Overflow. In: Proceedings of the 2013 10th IEEE working conference on mining software repositories (MSR), San Francisco, CA, 18–19 May 2013, pp. 97–100. New York: IEEE.

17.

Zhou

Understanding online knowledge community user continuance. Data Technol Appl 2018; 52(3): 445–458.

18.

Choi

Shah

Asking for more than an answer: what do askers expect in online Q&A services?

J Inf Sci 2017; 43(3): 424–435.

19.

Harper

Raban

Rafaeli

et al. Predictors of answer quality in online Q&A sites. In: Proceedings of the SIGCHI conference on human factors in computing systems, Florence, 5–10 April 2008, pp. 865–874. New York: ACM.

20.

Toba

Ming

Adriani

et al. Discovering high quality answers in community question answering archives using a hierarchy of classifiers. Inform Sciences 2014; 261: 101–115.

21.

Yao

Tong

Xie

et al. Detecting high-quality posts in community question answering sites. Inform Sciences 2015; 302: 70–82.

22.

Liu

Lin

Zheng

et al. Incorporating social information to perform diverse replier recommendation in question and answer communities. J Inf Sci 2016; 42(4): 449–464.

23.

Liu

Cao

et al. Understanding and summarizing answers in community-based question answering services. In: Proceedings of the 22nd international conference on computational linguistics (COLING’2008), Manchester, 18–22 August 2008, pp. 497–504. New York: ACM.

24.

Liu

Yang

. Predicting best answerers for new questions in community question answering. In: Proceedings of the 11th international conference on Web-age information management, Jiuzhaigou, China, 15–17 July 2010, pp. 127–138. Berlin; Heidelberg: Springer.

25.

Tian

Zhang

. Towards predicting the best answers in community-based question-answering services. In: Proceedings of the international AAAI conference on Web and social media (ICWSM), Cambridge, MA, 8–11 July 2013, pp. 725–728. Menlo Park, CA: AAAI Press.

26.

Yoon

Kim

HK.

Finding more trustworthy answers: various trustworthiness factors in question answering. J Inf Sci 2013; 39(4): 509–522.

27.

Zheng

et al. Algorithm for recommending answer providers in community-based question answering. J Inf Sci 2012; 38(1): 3–14.

28.

Huna

Srba

Bielikova

Exploiting content quality and question difficulty in CQA reputation systems. In: Proceedings of the 12th international conference and school on advances in network science, Wrocław, 11–13 January 2016, pp. 68–81. Berlin; Heidelberg: Springer.

29.

Neshati

Fallahnejad

Beigy

On dynamicity of expert finding in community question answering. Inform Process Manag 2017; 53(5): 1026–1042.

30.

Srba

Grznar

Bielikova

. Utilizing non-QA data to improve questions routing for users with low QA activity in CQA. In: Proceedings of the 2015 IEEE/ACM international conference on advances in social networks analysis and mining, Paris, 25–28 August 2015, pp. 129–136. New York: ACM.

31.

Yan

Zhou

Optimal answerer ranking for new questions in community question answering. Inform Process Manag 2015; 51(1): 163–178.

32.

Wang

Jiao

Abrahams

et al. ExpertRank: a topic-aware expert finding algorithm for online knowledge communities. Decis Support Syst 2013; 54(3): 1442–1451.

33.

Fichman

A comparative assessment of answer quality on four question answering sites. J Inf Sci 2011; 37(5): 476–486.

34.

Liu

Croft

Koll

. Finding experts in community-based question-answering services. In: Proceedings of the 14th ACM international conference on information and knowledge management, Bremen, 31 October–5 November 2005, pp. 315–316. New York: ACM.

35.

Qiu

Huang

. Convolutional neural tensor network architecture for community-based question answering. In: Proceedings of the 24th international conference on artificial intelligence (IJCAI), Buenos Aires, 25–31 July 2015, pp. 1305–1311. Menlo Park, CA: AAAI Press.

36.

Zhang

QuestionHolic: hot topic discovery and trend analysis in community question answering systems. Expert Syst Appl 2011; 38(6): 6848–6855.

37.

Verma

Sharma

Deb

et al. Artificial intelligence in marketing: systematic review and future research direction. Int J Inf Manag Data Insights 2021; 1: 100002.

38.

Espina

Figueroa

Why was this asked? Automatically recognizing multiple motivations behind community question-answering questions. Expert Syst Appl 2017; 80: 126–135.

39.

Ahasanuzzaman

Asaduzzaman

Roy

et al. Mining duplicate questions of stack overflow. In: Proceedings of the 2016 IEEE/ACM 13th working conference on mining software repositories (MSR), Austin, TX, 14–15 May 2016, pp. 402–412. New York: IEEE.

40.

Figueroa

Automatically generating effective search queries directly from community question-answering questions for finding related questions. Expert Syst Appl 2017; 77: 11–19.

41.

Ponzanelli

Mocci

Bacchelli

et al. Improving low quality stack overflow post detection. In: Proceedings of the 2014 IEEE international conference on software maintenance and evolution (ICSME), Victoria, BC, Canada, 29 September–3 October 2014, pp. 541–544. New York: IEEE.

42.

Xia

Correa

et al. It takes two to tango: deleted stack overflow question prediction with text and meta features. In: Proceedings of the 2016 IEEE 40th annual computer software and applications conference (COMPSAC), Atlanta, GA, 10–14 June 2016, vol. 1, pp. 73–82. New York: IEEE.

43.

Zhang

Xia

et al. Multi-factor duplicate question detection in stack overflow. J Comput Sci Technol 2015; 30(5): 981–997.

44.

Zhang

Sheng

Lau

et al. Detecting duplicate posts in programming QA communities via latent semantics and association rules. In: Proceedings of the 26th international conference on World Wide Web, Perth, WA, Australia, 3–7 April 2017, pp. 1221–1229. Geneva: International World Wide Web Conferences Steering Committee.

45.

Chintalapudi

Battineni

Di Canio

et al. Text mining with sentiment analysis on seafarers’ medical documents. Int J Inf Manag Data Insights 2021; 1(1): 100005.

46.

Yang

Bao

Lin

et al. Analyzing and predicting not-answered questions in community-based question answering services. In: Proceedings of the 25th AAAI conference on artificial intelligence, San Francisco, CA, 7–11 August 2011, vol. 11, pp. 1273–1278. Menlo Park, CA: AAAI Press.

47.

Grover

Kar

AK.

Big data analytics: a review on theoretical contributions and tools used in literature. Glob J Flex Syst Manag 2017; 18(3): 203–229.

48.

Shah

Kitzie

Choi

. Questioning the question – addressing the answerability of questions in community question-answering. In: Proceedings of the 2014 47th Hawaii international conference on system sciences (HICSS), Waikoloa, HI, 6–9 January 2014, pp. 1386–1395. New York: IEEE.

49.

Correa

Sureka

. Chaff from the wheat: characterization and modeling of deleted questions on stack overflow. In: Proceedings of the 23rd international conference on World Wide Web, Seoul, South Korea, 7–11 April 2014, pp. 631–642. New York: ACM.

50.

Srba

Bielikova

Why is stack overflow failing? Preserving sustainability in community question answering. IEEE Software 2016; 33(4): 80–89.

51.

Dror

Maarek

Szpektor

. Will my question be answered? Predicting ‘question answerability’ in community question-answering sites. In: Proceedings of the joint European conference on machine learning and principles and practice of knowledge discovery in databases (ECML/PKDD), Prague, 23–27 September 2013, vol. 3, pp. 499–514. Springer Berlin Heidelberg.

52.

Jeon

Croft

Lee

JH.

Finding similar questions in large question and answer archives. In: Proceedings of the 14th ACM international conference on information and knowledge management, Bremen, 31 October–5 November 2005, pp. 84–90. New York: ACM.

53.

Agichtein

Castillo

Donato

et al. Finding high-quality content in social media. In: Proceedings of the 2008 international conference on web search and data mining, Palo Alto, CA, 11–12 February 2008, pp. 183–194. New York: ACM.

54.

Breiman

Random forests. Mach Learn 2001; 45(1): 5–32.

55.

Friedman

JH.

Greedy function approximation: a gradient boosting machine. Ann Stat 2001; 29(5): 1189–1232.

56.

Rish

. An empirical study of the Naïve Bayes classifier. In: Proceedings of the workshop on empirical methods in artificial intelligence (IJCAI’2001), Seattle, WA, 4–6 August 2001, vol. 3, pp. 41–46. Armonk, NY: IBM.

57.

Chawla

Bowyer

Hall

et al. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 2002; 16: 321–357.

58.

Powers

DM.

Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. J Mach Learn Technol 2011; 2(1): 37–63.

59.

Lewis

. Naive (Bayes) at forty: the independence assumption in information retrieval. In: Proceedings of the 10th European conference on machine learning, Chemnitz, 21–23 April 1998, pp. 4–15. Berlin; Heidelberg: Springer.

60.

Kar

Dwivedi

YK.

Theory building with big data-driven research – moving away from the ‘what’ towards the ‘why’. Int J Inform Manage 2020; 54: 102205.

61.

Kar

AK.

What affects usage satisfaction in mobile payments? Modelling user generated content to develop the ‘digital service usage satisfaction model’. Inform Syst Front 2021; 23: 1341–1361.