Abstract
The early warning of mental disorders is of great importance for the psychological well-being of college students. The accuracy of conventional scaling methods on questionnaires is generally low in predicting mental disorders, as the questionnaires contain much noise, and the processing on the questionnaires is rudimentary. To address this problem, we propose a novel anomaly detection framework on questionnaires, which represents each questionnaire as a document, and applies keyword extraction and machine learning techniques to detect abnormal questionnaires. We also propose a new keyword statistic for the calculation of option significance and three interpretable machine learning models for the calculation of question significance. Experiments demonstrate the effectiveness of our proposed methods.
Introduction
The high stress of college students not only affects their academic performance [25] but also causes serious mental disorders, leading to self-harm and even suicide [7]. Studies show that the prevalence of mental disorders such as depression and self-harm among college students is on the rise [28]. Therefore, detecting potential mental disorders of college students at an early stage is of great importance for maximizing their psychological well-being.
The psychometric questionnaire is currently one of the main instruments for gathering information for the analysis of mental disorders. For example, the Likert scale [30], the most widely used approach to scaling responses in psychometric questionnaires, calculates the score of a questionnaire as the sum of scores of picked options, each of which is assigned with an integer value predefined by the questionnaire designers. This scaling method has caused considerable controversy, as the value assigned to each option is arbitrary, and it is assumed that each question is equally important and the distance between each option is equivalent, which has no objective basis of measure theory or numerical experiments [34].
Due to the shortages of conventional scaling methods, some researchers apply machine learning techniques to identify mental disorders. One advantage of machine learning technology is that it can learn appropriate weights for different features. For example, machine learning techniques such as support-vector network [9] and convolutional neural networks [42] are used for early warning of mental disorders based on clinical data [14], physiological data [35], video and audio data [36], behavioral data [37], social media data [6]. These data, however, are not easy to obtain in practical settings.
Compared to the above data source, questionnaires are a more economical and convenient data source. Applying machine learning techniques on questionnaires for early warning of mental disorder, however, faces the following challenges:
To address the above problems, we propose an anomaly detection framework based on text analysis techniques, which can be used for early warning of mental disorders in colleges, or more generally, to analyze questionnaires containing closed-ended questions. We refine the framework and propose a process for early warning of mental disorders. By representing the combinations of picked options as a document, we can impose keyword extraction techniques to calculate significance scores of options and filter unrelated questions. We have also proposed several machine learning models for calculating the significance scores of questions. Experiments show that our methods outperform baseline machine learning techniques, and provide much higher accuracy in early warning of mental disorders compared with the conventional scaling method currently used in colleges.
The main contributions of this paper are as follows:
1. An anomaly detection framework based on text analysis techniques is proposed. This framework is applicable to general questionnaire data composed of closed-ended questions. Based on this framework, we present an early warning process for mental disorders, which can eliminate most noise in the questionnaire.
2. A novel statistic for calculating options significance and filtering questions is proposed. Unlike statistics commonly used in keyword extraction applications, this statistic takes into account the highly imbalanced characteristics of the dataset in anomaly detection problems, thus significantly improving the accuracy for identifying abnormal samples.
3. Three interpretable anomaly detection models are proposed. Among them, Naive OF-IDF is highly interpretable, while Attention OF-IDF and Graph OF-IDF achieve high accuracy of the model while preserving acceptable interpretability.
The remainder of this paper is organized as follows. Section 2 introduces some works related to this study. In Section 3, we describe the dataset of the application problem. We present our proposed methods in section 4, including the framework of anomaly detection, a novel statistic, and three prediction models, and demonstrate the effectiveness of these methods in section 5. Finally, we conclude our work in Section 6.
Related work
Mental disorders and college students
College students have to face many challenges, such as adapting to a new environment, establishing new relationships, and coping with academic pressure [29]. In the process of coping with these challenges, college students may suffer from various psychological distress, mental health problems, and even self-harm and suicidality. For example, it is found that there is a strong relationship between suicidal ideation and depression for college students [7]. Natural disasters or pandemics also have a noticeable impact on the mental health of college students. A recent study shows that 24.9%of college students become anxious, after the outbreak of novel coronavirus disease 2019 (COVID-19) [39]. In fact, college students have more serious mental health problems comparing with their contemporaries [40]. And it is found that the prevalence of mental illness among college students is on the rise [19]. If mental disorders of college students can be identified at an early stage, proper intervention or assistance can be provided to them, thus reducing the risk of depression and suicide, and mitigating the deleterious effect on their physical and mental wellbeing. Many researchers found that the risk factors for college students’ mental disorders include personality characteristics [26], academic performance [28], personal background [18], social ecology [41], loneliness [13] and so on. At the same time, researchers found that these factors react to each other. For example, it was found that loneliness is an important factor leading to depression in college students [4]. Our study also demonstrates that considering the interplay between various factors is helpful to improve the detection accuracy of mental disorders.
Detecting mental disorders by using questionnaires
At present, psychometric questionnaires are employed as one of the primary tools to collect data for the early detection of mental disorders for college students [33]. Likert scale, as a presentative of scaling approach for questionnaires is commonly used for scoring a psychometric questionnaire [30], where each question is a statement related to some measurement topic. The respondent chooses the degree of agreement or disagreement with the statement in the options, which usually includes five levels, corresponding to a set of increasing or decreasing scores. For instance, the scale of Baker Depression Inventory (BDI) [5] is designed to measure depression, the R-UCLA Loneliness Scale [12] to measure loneliness, and the Eysenck Personality Scale (EPQ) to analyze personality characteristics, which are all psychometric questionnaires used for the detection of specific mental disorders.
Although the questionnaire data is easier to obtain and analyze than other data sources such as clinical data and video data, there are considerable disagreement with the conventional scaling method [34], where the impact of each question on the questionnaire result is considered independent and equivalent, and each option in a question is assigned with an integer score by the designer of the questionnaire. For example, for a five-level Likert scale, the scores of the options generally range from 1 to 5, which has no objective basis of measure theory or numerical experiments [16]. Besides, such scoring method assumes that the distance between each successive option is equivalent, but it may not be interpreted as such by the respondents, resulting in inaccurate results [22]. Our experiments also demonstrate that the conventional scoring method has led to a higher rate of misjudgment.
Detecting mental disorders by using machine learning techniques
Some studies leverage machine learning techniques to forecast certain mental disorders in specific populations. For example, a neural network model is employed to predict the possibility of psychological conditions for patients with concussion, based on their past condition and longitudinal clinical trajectory [14]. Deep learning networks are used for automatic analysis of depression by extracting human behavior primitives from videos [36]. Other studies construct machine learning models to predict mental disorders based on data sources including sensor data [37], EMG signals and ECG signal [35], and social media [6]. The above methods require specialized data sources such as past medical conditions, videos, or sensor data that can reflect the mental states of people. However, these data are generally difficult to obtain in practical applications. They may be applicable to patients in hospitals but not suitable for common college students.
Keyword extraction techniques
Keyword extraction is the process of identifying keywords, i.e., important words or phrases representing the main topic of a given document [27]. In this paper, we propose a novel approach to calculate the significance scores of options and to filter questions by leveraging the keyword extraction technique. Keyword extraction is usually based on various text features, such as syntactic features, statistical features, and features calculated from external resources. Among them, statistical features are widely used in large corpora [23]. Common statistical features include TF-IDF [17], domain relevance (DR) [32], co-occurrence features (such as TextRank [31]), and similarity features (such as Jaccard index [10]). TF-IDF is a simple but effective statistic, which has been widely used in information retrieval, position detection [3], topic detection [1], and so on. Because the dataset for anomaly detection is highly imbalanced between normal and abnormal samples [38], TF-IDF is not the appropriate statistic for the identification of abnormal samples. To address this problem, we propose a revised statistic for anomaly detection on the basis of TF-IDF.
Datasets
This study aims to construct an interpretable model to identify the abnormal questionnaires that indicate the respondents might suffer from mental disorders in college. The dataset comes from the Mental Health Education Center of a university. It consists of 28,583 questionnaires, which were completed by respondents on the first month they came to the university after being anonymized. Each respondent answered only one questionnaire.
The questionnaire mainly includes three categories of questions: solid-state questions, Beck Depression Inventory (hereinafter referred to as BDI) [5], Eysenck Personality Questionnaire (hereinafter referred to as EPQ) respondents. Details about the questionnaire questions are presented in Table 1.
Psychological scale composition
Psychological scale composition
There are a total of 146 questions in the questionnaire. For each question in the questionnaire, a respondent picks one and only one option as the answer. Hence, each answered questionnaire can be represented by a 146-dimension feature vector, where each feature represents the picked option to a question.
Statistics data show that 90%or more of students study for no more than four years in this university, so we focus on predicting whether a respondent will suffer from some mental disorders in the first four years of their college lives. There are 265 respondents who were diagnosed with mental disorders within four years after being enrolled. The questionnaires completed by these respondents are marked as positive samples. A total of 17,112 respondents have no records of mental disorders in the first four years after being enrolled. The questionnaires completed by these respondents are marked as negative samples. Altogether, there are 17,377 positive or negative samples in our study.
We first propose a novel framework for anomaly detection, which employs text analysis techniques to score questionnaires containing close-ended questions. Then we present a process, within this anomaly detection framework, to identify abnormal questionnaires with regards to mental disorders. Some details of this process will be presented at the end of this section.
An anomaly detection framework based on text analysis techniques
In this section, we present an anomaly detection framework based on text analysis, which accepts answers to a set of closed-ended questions and outputs anomaly detection results. Although the framework was originally designed for the identification of abnormal questionnaires on psychometric tests, it can also be applied to questionnaires for other purposes, as long as the questionnaire contains many close-ended questions.
Figure 1 shows our proposed anomaly detection framework. Before presenting details of the framework, we first introduce some basic notation. We assume that a questionnaire contains multiple close-ended questions, each question contains multiple options, and there are totally m questions and n options in the questionnaire. Let Q = {q1, q2, ⋯ , q m } represent the set of questions, where q i denotes the i-th question, and let O = {o1, o2, ⋯ , o n } represent the set of options, where o j represents the j-th option. The relationship between options and question is denoted by a function h, h : O → Q, which maps an option to its corresponding question. Note that h is a surjection but not an injection, which implies that an option must belong to some question and each question contains two or more options. We describe five steps of the proposed framework as follows.

Anomaly detection framework based on text analysis techniques.
We proposed a process of identifying abnormal questionnaires with regards to mental disorders by bringing some machine learning techniques into the framework present in the last section (see Figure 2). The proposed process conforms to the anomaly detection framework depicted in Figure 1, but introducing three new components:

Process of detecting abnormal questionnaires for mental disorders.
1. Machine learning algorithms. Some machine learning algorithms are employed to calculate the final score, including a keyword extraction algorithm and a classification algorithm. The former is introduced to construct a model for computing the option significance while the latter is for computing the question significance.
2. Supervised data set. The supervised data set is an component for building a supervised model, which is expected to achieve better performance than unsupervised models. The mental disorder records are introduced to supervise the construction of two machine learning models in this application.
3. Model interpretation. The model interpretation presents the rationale behind the model about the anomaly detection results.
Most steps shown in Figure 2 are the same as those in Figure 1. The other steps are described as follows:
Conventionally, option significance is designated by the designer of the questionnaire. We can also employ text analysis techniques, such as keyword extraction techniques, for the calculation of option significance after representing the questionnaire as a document. In this section, we propose a statistic for anomaly detection applications based on TF-IDF.
TF-IDF(Term Frequency–Inverse Document Frequency) is a statistical measure that is intended to evaluate the relevance of a word to a document in a corpus. The calculation of TF-IDF is based on the following observation: If a word occurs frequently in a particular document, but seldom in other documents, it may be regarded as a representative term of that document. Hence, we can define TF-IDF as follows:
In our approach, each picked option is modeled as a word, therefore we can leverage TF-IDF to calculate the importance of options. Unlike common document, however, each option occurs no more than once in a questionnaires. Hence we redefine the TF-IDF as follows:
However, Eq. (5-7) does not take into account an essential fact about anomaly detection problems, i.e., normal samples and abnormal samples are highly unbalanced, and the number of abnormal samples is usually much lower than that of normal samples. For example, suppose there are 2000 normal questionnaires, but only 20 of them contain option r; the number of abnormal questionnaires is 200, and again, 20 of them contain option r. Intuitively, r is an important option for the detection of the abnormal questionnaire, because the likelihood that r occurs in an abnormal questionnaire is 10 times than that it occurs in a normal questionnaire. The TF-IDF value of this option, however, is usually not very high due to the small value of TF. Therefore, many important options might be missed in similar cases.
To address this problem, we propose a new statistic called option frequency-inverse document frequency (OF-IDF), considering the characteristics of anomaly detection. It is calculated as follows:
After obtaining the significance of each option, we calculate abnormal scores for all questionnaires as follows:
We proposed three approaches to acquire matrix A. The second and third approaches leverage machine learning techniques to compute the elements of A. These methods are described in the following subsections.
A straightforward approach is to calculate the abnormal score of the questionnaire as the sum of OF-IDF of each picked option. We name the model as Naive OF-IDF. Its calculation can be formulated as follows:
Naive OF-IDF does not utilize label data in the training set. Nevertheless, it is possible to learn better question significance by taking advantage of the label information. Hence, we propose Attention OF-IDF, which computes the significance matrix based on the attention mechanism. The detailed process is shown in Figure 3(a).

Flowchart of Attention OF-IDF(a) and Graph OF-IDF(b).
Specifically, we use the self-attention mechanism [2] to calculate the question significance matrix A. The calculation formula is as follows:
The questionnaire score is calculated as follows:
We choose cross-entropy as the loss function to learn the attention coefficient matrix A by employing the backpropagation algorithm [20]. Note that this coefficient matrix is indeed the question significance matrix A that we try to obtain.
Our third approach models the relationships between questions with a graph. Each node of the graph represents a question, and each edge represents the interrelationship between two questions. By controlling the sparsity of the graph, a compromise between model accuracy and model interpretability is achieved. Inspired by a graph neural network model named GraphWaveNet [43], we construct an end-to-end supervised training model and learned an adaptive adjacency matrix from the training set. We call this model Graph OF-IDF, the detailed process of which is shown in Figure 3(b).
Graph OF-IDF calculates the adjacency matrix A as follows:
To compute the questionnaire anomaly score S, we feed the adaptive adjacency matrix A and the questionnaire matrix X into a graph convolutional network. This process can be formulated as follows:
Again, we learn matrix A with a backpropagation algorithm. Besides the cross-entropy loss, we also introduce the L1 regularization loss of matrix A, to ensure its sparsity.
Evaluation indicator
We use two evaluation indicators to verify the effectiveness of our proposed methods and compare them with some baseline approaches. Both indicators are suitable for scenarios where positive and negative samples are highly imbalanced.
Experimental setup
We randomly divide the original data set into the training set and the test set with the ratio of 7:3. Statistics about these data sets are shown in Table 2.
Experimental data statistics
Experimental data statistics
The three classification methods and the four anomaly detection methods in the baseline methods are all implemented using the machine learning library Scikit-learn [15]; the data standardization in the baseline method uses the StandardScaler method in Scikit-learn. For our proposed two models, we set the batch size as 100 and epochs _ num = 20 in stochastic gradient descent. The Adam optimizer is used to train the model with an initial learning rate of 0.01.
Other hyperparameters are set as follows: α = 1.6387 for Naive OF-IDF,d = 6 for Graph OF-IDF, and d k = 6 for Attention OF-IDF. The number of predicted positive samples P′ is set to 500, as suggested by the expert in the university which provides the dataset, considering the capacity of its human resource.
It can be seen from Table 3 that most of the machine learning model methods significantly exceed the conventional scoring method for scales. Logistics regression (LR) achieved the best performance among baseline methods. But our proposed three models achieve better performance than it in both AUC and TPR-500. Among them, Graph OF-IDF is 5.25%higher than LR on the TPR-500. Compared with the classification models, the performance of the anomaly detection model is generally worse. This is probably because the supervised label data provide classification models with more information which helps them outperform regular anomaly detection models.
Compare performance with the baseline methods
Compare performance with the baseline methods
We test the effectiveness of OF-IDF by comparing it with the TF-IDF defined in Eq. (7) for three models. As can be seen from Table 4, the introduction of OF-IDF effectively improves the performance of three models comparing with those based on TF-IDF. There is at least 6%improvement at TPR-500 for our three methods. And there are also improvement on AUC with varying degrees.
Comparison of OF-IDF and TF-IDF
We also verify the effectiveness of the question filtering step in our proposed framework. Table 5 lists the results of ablation experiments where we use QF+ to indicate the inclusion of the question filtering step in the process, and QF- to indicate the exclusion of it. It can be found that both the Baseline method and the method we proposed have significantly improved the effect after question filtering.
Comparison of with or without question filtering steps
Finally, we tested the stability of our models when P′ is set to different values. In this application, P′ is proportional to the capacity of human resources in a university to carry out follow-up operations for the questionnaires identified as abnormal. Therefore, the value of P′ may change across different universities. As depicted in Figure 4, our proposed methods consistently achieve the highest accuracy for different values of P′, which convinces the stability of that our methods.

Comparison of each method under different P′.
The above experimental results have demonstrated the superiority of our proposed models to other baseline methods. Particularly, our methods can provide universities with more reliable support in the detection of abnormal questionnaires compared with the conventional scoring method for scales.
We choose four representative questionnaires as the cases to illustrate the interpretability of OF-IDF. The option significance scores and questionnaire scores of these questionnaires are shown in Table 6.
Analysis of prediction results of case questionnaire
Analysis of prediction results of case questionnaire
After question filtering, a questionnaire contains only fourteen questions, and the significance scores of the selected options for these questions are listed in the first fourteen rows of the table, where each column corresponds to one of four questionnaires. Options with greater significance scores are in bold type. The third row from the bottom lists the abnormal score of the questionnaire calculated by Naive OF-IDF, which equals the sum of the scores of the first fourteen rows. The last two rows show the predicted results and true labels of the questionnaire.
As shown in the last two rows of Table 6, the first three questionnaires are correctly identified by the model. QN1 contains plenty of options with high significance scores, therefore are identified as an abnormal questionnaire. QN2 and QN3 are identified as normal questionnaires, as they contain few options with higher scores. For the same reason, the abnormal questionnaire QN4 is incorrectly identified as a normal one. Actually, the distribution of QN4 is very similar to that of QN3, which explains why the model is misled. Although the judgment of the model might be incorrect sometimes, it can always present reasonable evidence for its judgment. Therefore, we can say that the Naive OF-IDF has high interpretability.
Figure 5 plots the probability density curves [8] of the Naive OF-IDF scores for both normal questionnaires and abnormal questionnaires. The horizontal coordinate indicates the Naive OF-IDF score, and the vertical coordinate indicates the probability density, and the black dashed line indicates the corresponding abnormal score threshold when P′ = 500. It can be seen from Figure 5 that the higher the score is, the more likely the questionnaire is an abnormal one. However, the positive and negative samples overlap in a large range. In particular, the scores of some abnormal samples are the same as or even smaller than normal samples. It is probably that the respondents were in a normal state of mind when filling out the questionnaire, or some respondents concealed their real situations, causing the questionnaires to fail to include significant options for anomaly detection.

Naive OF-IDF probability density plot.
To explain the effect of the question weight matrix on questionnaire score, we visualized the weight matrix learned by Graph OF-IDF, as shown in Figure 6.

Heat map of question adjacency matrix learned by Graph OF-IDF.
The horizontal and vertical axes in Figure 6 denote the sequence numbers of fourteen questions. The gray level is proportional to the mutual impact between the two questions. Elements on the diagonal have the largest values, indicating that the maximum impact on a question is from itself. In addition, there are also noticeable influences between some pairs of questions. By filtering out question pairs that have little influence on each other, a sparse directed graph can be constructed, as shown in Figure 7.

Sparse graph constructed from the question adjacency matrix shown in Figure 7.
In Figure 7, the tail endpoint of directed edges represents the affected nodes. For example, question 1 has a significant impact on question 14. In this way, when calculating the score of question 14, we also need to consider the options of question 1. Graph OF-IDF has achieved a higher accuracy rate than Naive OF-IDF, which may imply that these questions are not completely independent, and a more accurate model can be constructed by considering their mutual impact. Graph OF-IDF achieves a higher accuracy at the expense of interpretability. Nevertheless, models with lower interpretability cannot guarantee higher accuracy. For example, the accuracy of Attention OF-IDF is not as good as Naive OF-IDF. Among the three proposed models, if interpretability is most concerned, the Naive OF-IDF is preferred; if the model performance is most valued, we can choose Graph OF-IDF, which can also provide good interpretability.
Early warning of mental disorders for college students helps to improve their psychological well-being in college. We propose a novel anomaly detection framework on questionnaires and refine it to a process of detecting abnormal psychometric questionnaires based on keyword extraction and machine learning techniques. We also propose a new keyword statistic and three models to calculate the option significance and question significance. Experiments demonstrate that our proposed methods outperform conventional scaling methods or common machine learning techniques. The proposed framework and process can also be applied to general questionnaires containing closed-end questions.
Acknowledgments
This work is supported by National Natural Science Foundation Project of CQ CSTC (No.cstc2020jcyj-msxmX).
