Abstract
Deep neural network (DNN) has been widely used in various artificial intelligence applications and is, unsurprisingly, penetrating the field of school psychology. In the school environment, universal screening is used by teachers to identify children’s emotional and behavioral risk (EBR) within a screener. EBR can be used to predict possible emotional and behavioral disorders, which impact children’s educational and social outcomes. Using the BASC-2 Behavioral and Emotional Screening System Teacher Rating Scale (BASC-2 BESS TRS; Reynolds & Kamphaus (2004). Behavior Assessment System for Children (2nd ed.). Circle Pines, MN: American Guidance Service) norm data, we classified children’s EBR status from normal to at-risk using DNN. Data oversampling was used to overcome the imbalanced sample feature (i.e., few cases with emotional and behavioral disorder). Traditional machine learning methods, such as Naïve Bayes and logistic regression, were included for comparison. The results indicated that the DNN with oversampling achieved the highest performance levels with accuracy (ACC) of .957, precision (PPV) of .545, true positive rate (TPR or sensitivity) of 1.000, and true negative rate (TNR or specificity) of .942 compared with the other methods. This novel method is helpful to provide accurate screening results for early identification of children’s EBR. The current study provides a useful guide for researchers to apply the DNN and oversampling to classification in assessment-related research.
Keywords
Introduction
Approximately 20% of children experience mental health issues at some point (General, US, 1999; World Health Organization, 2005) between early childhood and adolescence. Children’s emotional and behavioral problems are related to their cognitive and academic abilities, experiences at schools, academic outcomes, and later life success (Polderman, Boomsma, Bartels, Verhulst, & Huizink, 2010; Roeser, van der Wolf, & Strobel, 2001; Walker, Ramsey, & Gresham, 2004; Webster-Stratton, Jamila Reid, & Stoolmiller, 2008). Thus, such issues are of interest to related stakeholders.
Children spend 180 days per year and 6.64 hours per day at school on average (National Center for Educational Statistics, 2008); they spend most waking hours on school-related activities with their teachers. Therefore, schools provide an optimal place for teachers to investigate students’ emotional and behavioral issues. Instead of relying on traditional methods that wait until children’s emotional and behavioral disorders (EBD) are obvious, early identification of children’s emotional and behavioral risk (EBR) may provide more timely interventions for children in order to avoid difficulties that may become more severe over time (Walker et al., 2004). The multitiered system of supports (MTSS) is a comprehensive framework with applications in assisting children’s EBD issues. It includes different levels of evidence-based assessment in the educational system to “identify children at risk for illnesses and disorders, and early intervention to reduce risk, prevent the onset, or minimize the effects of a disorder” (DiStefano, Liu & Burgess, 2017, p. 494).
Screening instruments are usually used in the initial stage of MTSS to provide universal assessment in a quick, uniform, and efficient manner. The importance of a psychometrically sound screening instrument cannot be understated in that high-quality instruments correctly identify children with EBR. In addition, screeners are used as an initial assessment to detect the possible need for a more comprehensive assessment (Greer & Liu, 2016) later in MTSS. Thus, screening results can contribute to a collectively stable presentation of EBR; however, they cannot be used to assess the presence of EBD, as screeners are not constructed to include comprehensive information needed to make an accurate diagnosis, which is of special importance for children with EBR (Glover & Albers, 2007). On the other hand, although the two concepts are different, EBR should function as an accurate predictor of EBD to provide early intervention in the screening process.
In the field of mental health, many researchers focus on comparing the classification accuracy (ACC) of several different screening scales (e.g., Dowling et al., 2018; van Heyningen, Honikman, Tomlinson, Field, & Myer, 2018) with receiver operating characteristics (ROC). There are also studies comparing the ACC of different classification techniques within a single scale. Yovanoff and Squires (2006) compared ROC and Rasch methods of creating cutoff scores on the Ages and Stages Questionnaires: Social–Emotional (ASQ: SE). However, they did not have a prior diagnosis of EBD to be used as a criterion variable to judge the ACC of classification. More recently, DiStefano and Morgan (2011) investigated the ACC of creating cutoff scores for the BASC-2 Behavioral and Emotional Screening System (BASC–2 BESS) using the following three methods: T scores, ROC analysis, and the Rasch rating scale method with a prior diagnosis of EBD as a criterion. Both studies indicate that Rasch and ROC methods perform similarly.
There are several issues with these studies. Identifying children with EBR using a cutoff score does not take into consideration the counterbalanced effect among items within a scale. A child could have severe problems on an issue as described in one item, but such problems cannot be identified with an overall cutoff score as the summed score seems fine. A single cutoff score may not accurately classify these children. Such methods do not consider the unique characteristics of children as they only consider the screening items. In addition, classification ACC is not satisfactory. For instance, the final cutoff score has a sensitivity lower than .7 and an overall ACC rate lower than .8. Using the BASC-2 BESS data (DiStefano & Morgan, 2011), while the specificity was low (below .5) in ASQ-SE (Yovanoff & Squires, 2006). Admittedly, certain scales achieve higher levels of classification ACC. For instance, the Social, Academic, and Emotional Behavior Risk Screener obtains sensitivity of .95 and specificity of .99 (Social Academic & Emotional Behavior Risk Screener (SAEBRS), n.d). with BASC-2 BESS. However, utilizing another screening scale as the criterion may be problematic especially as the classification ACC of BASC-2 BESS itself is not satisfactory.
Traditional machine learning techniques, such as Naïve Bayes, logistic regression, and artificial neural network (ANN), have the capacity to overcome these issues by considering the feature of each item and respondents’ characteristics. Naïve Bayes is based on applying Bayes’ theorem with strong independence assumptions between the features (Zhang, 2004). The logistic regression model, also known as log-linear, is a statistical model which uses a logistic function to measure the relationship between the variables by estimating probabilities (Yu, Huang, & Lin, 2011). ANN, a computational machine learning model, imitates both the structure and functions of biological neural networks. It is considered a nonlinear statistical data modeling tool capable of finding complex patterns or relationships between inputs and outputs (Zurada, 1992).
These machine learning methods have been used in classification-based research in social science with researchers applying machine learning algorithms to identify students who may be at risk of low achievement and/or a retention problem (Gray & Perkins, 2019). They created a predictive machine learning model to identify possible failing students in a semester, and students’ pass/fail was identified at week four of the semester with an ACC as high as 97.2%. Another study, which used logistic regression to classify children into learning problem (LP) and no learning problem (NLP) groups, correctly classified 98% of the LP students and 98.1% of the NLP students with predictors (Del Prette, Prette, De Oliveira, Gresham, & Vance, 2012). ANN was used for sport result prediction (Bunker & Thabtah, 2017), and the results indicated that ANN is an appropriate methodology to predict the match result using a historical dataset to help sport managers model future matching strategies, while they did not report classification ACCs.
Traditional ANN models only include two layers (i.e., input and output layers). With the rapid development of machine learning in the early 2000s, deep neural network (DNN), an ANN with additional hidden layers between the input and output layers (Bengio, 2009), has gradually grown into a new technique for classification. DNN is more successful than traditional ANN as a classifier because it has more hierarchical layers and is more capable of modeling or abstracting features (Yosinski, Clune, Bengio & Lipson 2014). DNN models have been used in social science to predict people’s personality type from a given text (Majumder, Poria, Gelbukh, & Camria, 2017) and student performance from drawing patterns (Smith, Min, Mott, & Lester, 2015).
However, very few studies use DNN in the field of school psychology, especially with screener data to predict children’s emotional and behavioral disorder. One of the main reasons why DNN cannot be widely applied is that norm databases tend to be imbalanced with positive and negative cases (e.g., normal or at-risk). In the example of children’s screening for EBR, most children are normal (i.e., without EBD) in a population and only a small percentage of them have EBD (i.e., at-risk) (Forness, Freeman, Paparella, Kauffman, & Walker, 2012; Pastor, Reuben, & Duran, 2012) causing imbalanced data features in the classification process. Classification with imbalanced sample is indeed a crucial issue in machine learning (e.g., Ali, Shamsuddin, & Ralescu, 2015; Guo et al., 2017; Sun, Wong, & Kamel, 2009). There are multiple difficulties in providing accurate classification results. Most machine learning techniques are generally suitable for balanced data resources (Guo et al., 2017; Nguyen, Bouzerdoum, & Phung, 2009). Even the overall classification rate is decent with the bias toward the majority group (known as the negative cases without EBD) (Loyola-González, Martínez-Trinidad, Carrasco-Ochoa, & García-Borroto et al., 2016). The minority group is more likely to be misclassified (positive cases with EBD) (López, Fernández, García, Palade, & Herrera et al., 2013). In addition, the minority group may lack data or misused as noise for accurate classification (Nguyen et al., 2009). Meanwhile, researchers have figured out ways to overcome the drawbacks presented in such samples. Oversampling could be used to deal with the imbalanced data feature (e.g., Ali et al., 2015; He & Garcia, 2008; Nguyen et al., 2009) by increasing the sample size of the minority group (i.e., positive cases with EBD) with simulation to achieve a more balanced sample. Such techniques have been used successfully in autism diagnosis and predicting student failure at school (El-Sayed et al., 2015; Márquez-Vera, Cano, Romero, & Ventura, 2013). However, data imbalance issues in EBD classification have not been yet fully addressed.
Motivated by the above review, we applied DNN and oversampling techniques to classify children into normal and at-risk groups (i.e., with EBR or not) based on BASC-2 BESS and basic demographic characteristics. Previous research had shown that DNN performed better than other traditional machine learning methods in terms of predicting early hospital readmissions and drug discovery (Futoma, Morris, & Lucas, 2015; Korotcov, Tkachenko, Russo, & Ekins, 2017). However, we were not clear if DNN could predict children’s EBD accurately. Therefore, we compared classification ACC of DNN with that of Naïve Bayes and logistic regression, traditional machine learning methods, using parent-reported EBD status as the criterion. The results are of interest to stakeholders whose goal is to correctly identify children with EBD in order to offer appropriate interventions. The method is also of value to applied researchers who are interested in employing DNN in their substantive areas.
Methods
Data Sources
The data were from the norming sample of the Teacher Rating Scale of BASC-2. The norms represented the general population of US children with regard to sex, race, and special education classification (Reynolds & Kamphaus, 2004). The child form dataset contained 2459 cases with 120 diagnosed EBD as reported by their parents (4.8%). The children ranged in age from 4.4 years to 14.9 years, with an average age of 9.0 years and a standard deviation of 1.83. The screener (BASC-2 BESS TRS-C) included 27 items from the database with nine items from the internalizing problem dimension and six items each from externalizing problems, adaptive skills, and school problems to assess children’s EBR in different dimensions. A sample item from internalizing problems was “Worries about things that cannot be changed.” Items were rated using a 4-point Likert scale of 0 (never), 1 (sometimes), 2 (often), and 3 (almost always) based on the frequency of behaviors, providing a minimum possible sum score of 0 and maximum of 81. Higher scores indicate a higher level of EBR. Raw scores were transformed to a T score with a mean of 50 and standard deviation of 10 to classify children with EBR. Scores less than 59 were considered normal and scores off 61 or above indicated at-risk. As mentioned above, this traditional method had relatively low classification ACC. The test manual provides acceptable psychometric information of the screener (Kamphaus & Reynolds, 2007). It indicated high reliability estimates including split-half reliability (.96 to .97), test–rest reliability (.91), and interrater reliability (.71). It also had decent convergent validity (above .70) with other well-validated similar purposes including Achenbach System of Empirically Based Assessment Teacher’s Report Form (Achenbach & Rescorla (2001)) and Conners Rating Scale (Conners, 1997).
In addition to the scale items, the following demographic information was included in tested models: sex, age, race, and socioeconomic status (SES) based on parents’ education levels. Boys (n = 1238, 50.3%) and girls (n = 1221, 49.7%) were evenly distributed in the sample. The following race groups were included: white (n = 1459, 59.3%), Black (n = 388, 15.8%), Hispanic (n = 481, 19.6%), Asian (n = 61, 2.5%), American Indian (n = 43, 1.7%), and other minority groups (n = 27, 1.1%). Children’s parents’ educational levels were reported: grade 11 or less (n = 391, 15.9%), high school graduate or general educational development (n = 882, 35.9%), 1–3 years college or tech school (n = 689, 28.0%), 4 or more years of college or tech school (n = 489, 19.9%), and not reported (n = 8, .3%). Students were divided into two age-groups (age 9 or above: N = 1463, 59%; below age 9: n = 997 41%) in the modeling process.
Preprocessing
Missing data is a common problem and may have a significant impact on conclusions (Kossinets, 2006). There were 131 missing values in the 27 items of BASC-2 TRS and no missing values for demographics (missing percentage <.001) and no missing values from the demographic variables. We imputed the missing values with the mean of the observed data (Kang, 2013). Next, in one-hot encoding, categorical variables were converted into a form that could be provided for DNN algorithms in programming. For example, sex feature, male represents as vector [0,1] and female represents as vector [1,0]. All data information (i.e., 27 items and 4 demographic variables) of each case was represented as a binary vector with 123 nodes (i.e., 27 items × 4 categories + sex × 2 categories + ethnicity × 6 categories + SES × 5 categories + age × 2 categories) after one-hot encoding.
Data Oversampling
The BASC-2 BESS Teacher Rating Scale (TRS) dataset was highly imbalanced, and the ratio of negative to positive cases (i.e., non-EBD and EBD) group was over 20:1 (2339:120). Synthetic Minority Oversampling Technique (SMOTE) algorithm was employed to simulate synthetic samples from the EBD group (Chawla, Bowyer, Hall & Kegelmeyer, 2002) instead of creating copies directly. This algorithm was applied to the 120 children with EBD to generate more EBD cases to balance with the number of children in the non-EBD group (i.e., 2339). Depending upon the amount of extra cases required, the k nearest cases of sample were randomly chosen. The EBD group was oversampled by taking each EBD case and generating synthetic cases along the line segments connecting minority group k nearest cases which is defined mathematically in Generated synthetic instance. Note. The synthetic sample 
Models
Equations of logistic regression and Naïve Bayes are not introduced as they are commonly known machine learning methods. The DNN model is constructed with five layers including one input layer, one output layer, and three hidden layers, a major difference compared to traditional machine learning models (Figure 2). A nonlinear activation function, rectified linear unit, is responsible for transforming the summed weighted input into output and defined in Overview of deep neural network architecture.
This equation is visually shown in Figure 3. Finally, the cases were classified as EBD or non-EBD cases in the output layer. The sigmoid activation function, a common classifier used in the output layer of DNN, is a generalization of logistic regression classifier to binary classification defined in Rectified linear unit graph. Note. ReLU(x) is zero when x is less than zero, and ReLU(x) is equal to x when x is greater or equal to zero.
Implement
The data analysis was performed in Python, a free programing language available for major operating systems. Several packages were utilized to assist the modeling process, including Keras (Chollet et al., 2015), Scikit-learn (Pedregosa et al., 2011), Pandas (Mckinney, 2010), and Imblearn (Lemaître, Nogueira, & Aridas, 2017). Keras is an open-source neural network library containing the DNN model in the study. Scikit-learn features various classification algorithms including logistic regression and Naïve Bayes. Pandas is a software library written for data manipulation and analysis in Python. Imblearn was used to handle data imbalance. The Python scripts were divided into three parts to help researchers understand the implementation process.
First, we read data into the data frame with the Pandas and extracted all columns except the outcome variable (i.e., parent-reported EBD status). Then, we used the rain_test_split function from the Scikit-learn to split the original dataset into two parts randomly with a ratio of 1:3; Sample 1 included 1647 cases (78 EBD children) and Sample 2 included 812 cases (42 EBD children). Then, Sample 1 was combined with a randomly selected synthetic sample generated by the SMOTE algorithm from the Imblearn module to form a new training sample (i.e., Sample 3). In Sample 3, we randomly selected 1491 cases from the synthetic sample to form a balanced sample with positive and negative cases. This sample contained 1569 children with EBD (78 from Sample 1 and 1491 from the synthetic sample) and 1569 children without EBD (Python scripts available at https://github.com/wjd198605/children_risk_project_code/blob/master/part1.py).
The next step was to build and train three models. We imported the Naïve Bayes and logistic regression models from the Scikit-learn and imported the DNN model from the Keras. All three models (i.e., Naïve Bayes, logistic regression, and DNN) were trained using the one-hot encoded 27 items and 5 demographic variables into normal or at-risk group (EBR or not) with Sample 1 and Sample 3. The length of the binary vector (nodes) of the DNN model is the key parameter which varies for different dataset. As shown in Figure 2, the items and demographic variables were converted into 123 nodes as resources of the input layer. Then, there were three hidden (input) layers with 32, 16, and 8 nodes. The final output layer generated the classification probability results with 2 nodes. The modeling process decreased the number of nodes with a final goal of two category classification (EBD and non-EBD). The scripts can be applied to any social science data (Python scripts available at https://github.com/wjd198605/children_risk_project_code/blob/master/part2.py). Finally, we examined the model performance with Sample 2 (Python scripts available at https://github.com/wjd198605/children_risk_project_code/blob/master/part3.py).
Classification Index
Parent-reported EBD status was used to examine the performance of three models. We labeled children with EBD as positive (represented as 1) and children without EBD as negative (represented as 0). Based on the model training status (EBR or not) and true status (EBD or not), children were clustered into four groups. True negative cases were children without EBR and EBD (N = a). True positive cases were children with EBR and EBD (N = d). False negative cases were children without EBR but exhibiting EBD (N = b) later. False positive cases were children with EBR but without EBD (N = c).
Performance levels of three models were evaluated in terms of classification ACC, precision, sensitivity, specificity, and the area under receiver operating characteristic curve (AUROC). The ACC refers to the percentage of correct predictions (a + d/(a + b + c + d). Precision, also known as positive predictive value (PPV), refers to the fraction of relevant instances among the retrieved instances (d/(c + d)). Sensitivity (d/(b + d)) and specificity (a/(a + c)), also called true positive rate (TPR) and true negative rate (TNR), are the fraction of relevant instances that are used to measure the ratio of actual positives and negatives. For screening purposes, researchers expect to obtain a sensitivity rate and a specificity rate close to .80 (Glascoe, 2005). Sensitivity is more significant in the screening process, as researchers expect to correctly identify children with EBD, so these children are not missed for early intervention. We expected to see an overall ACC of .8 or above based on the suggested sensitivity and specificity. There is no suggested value for PPV. In practice, PPV may be low. For every two to three children who are screened with EBR, only one of them would be eventually diagnosed as EBD. Overall, higher values of all indices suggest higher ACC in classification. AUROC is reported to indicate the overall performance level of a model classification. AUROC is a common summary statistic for the efficiency of a predictor in a binary classification task. Values from .9 to 1 are considered excellent (Swets, 1996), equal to the probability that a predictor will rank a randomly chosen positive instance higher than a randomly chosen negative one (Pepe, Cai, & Longton, 2006). In general, AUROC measures how well an approach discriminates between EBD children and non-EBD children. With the classification indices described above, we evaluated the performance levels of three models (i.e., DNN, Naïve Bayes, and logistic regression) before and after data oversampling. Finally, confusion matrices, a tool for reporting summary frequency output (i.e., a, b, c, and d) in classifications, were used to show the classification results visually.
Results
Oversampling Evaluation
Classification Performance Levels of Three Models Based on Testing Sample 2.
Note. DNN = deep neural network.
Of all tested cases in Sample 2, 42 children were identified with EBD (4.988%). The overall ACC and TNR or specificity were high in logistic regression and DNN models without data oversampling, while Naïve Bayes model had relatively low ACCs. In addition, the TPR or sensitivity of logistic regression and DNN was only approximately .070 and .047, respectively, which indicated the model’s low ability in correctly identifying children with EBD from EBR students who were identified from the model without data oversampling. As shown in Figure 4, only 3 of 42 EBD children were correctly identified as EBR for logistic regression and only 2 of 42 EBD children were correctly identified as EBR for DNN. Interestingly, Naïve Bayes had a relatively high TPR or sensitivity rate, where 35 of 42 children were correctly shown as EBD (.832). PPV was low in all three models, ranging from .101 to .401. Therefore, data oversampling might be necessary to improve the performance levels of all three methods. Comparison of confusion matrices.
The three models trained with Sample 3 yielded significantly improved classification results. First, the DNN achieved a perfect TPR or sensitivity, indicating that this model had the ability to identify all 42 EBD children from 812 children in Sample 2 (Figure 4). As screening scales were highly emphasized on sensitivity (i.e., the ability to correctly identify EBD children), the DNN model was ideal for such a purpose. Overall, 8 (769 vs. 777) more cases were correctly identified as true positive and true negative cases with data oversampling. ACC and PPV slightly improved to .957 and .544, respectively. The model marked more children (35) as false positive cases, which constituted a .041 drop in TNR or specificity. Although TNR or specificity dropped a bit after oversampling, it was acceptable, as both values were higher than .90. Therefore, data oversampling was necessary to improve the performance levels of the DNN model. On the other hand, data oversampling improved the classification performance levels of Naïve Bayes in terms of all fit indices as well. Interestingly, logistic regression performed better without oversampling in most fit indices except the sensitivity. The PPV rates were still relatively low in all models.
DNN, Naïve Bayes, and Logistic Regression
We compared performance levels of three models with data oversampling in this section. As shown in Table 1, the DNN model surpassed the other two models in all fit indices. It achieved the highest performance levels with ACC of .957, PPV of .545, TPR or sensitivity of 1.000, and TNR or specificity of .942. Overall (Figure 4), 777 (735 + 42) of 812 children were correctly identified by the model (ACC of .957). All 42 EBD children were correctly found to have EBR (sensitivity of 1), and 735 of 777 non-EBD children (specificity of .942) were correctly found without EBR. On the other hand, logistic regression correctly selected 655 children (621 + 34) and Naïve Bayes correctly selected 591 children (555 + 36). Although all three models met the cutoff value for sensitivity, Naïve Bayes and logistic regression had TNR or specificity values a bit lower than .8. In addition, the PPV values in Naïve Bayes and logistic regression were extremely low even after data oversampling. For every 10 children who were referred as EBR, fewer than two cases actually had EBD. The PPV value was higher in the DNN model (i.e., .545). Finally, AUROC of DNN model reached to .979, which was a significantly higher than the other two models (Figure 5) and suggested an excellent fit. Receiver operating characteristic curve with data oversampling.
Discussion
The study investigated performance levels of three classification models: Naïve Bayes, logistic regression, and DNN to classify children’s EBR status (normal or at-risk) using the BASC-2 BESS TRS norm sample. All methods, which overcome the issues of creating a single cutoff score in the test manual, consider the feature of each item as well as participants’ characteristics. In addition, these machine learning techniques are capable of exploring many more personal properties that contribute to the classification task. Thus, it can provide more accurate classification results. Specifically, DNNs are modeled with multiple hidden layers and an output layer. The hidden layers twist and transplant the input several times to adjust the weight of features (i.e., the useful features have higher weights) to further increase classification accuracy.
Our study also designates the importance of several techniques that are crucial to the success of a DNN approach. The data oversampling technique is used on top of three methods to overcome the imbalanced data feature. This technique seems indispensable for many data resources of the general population with a small percentage of positive cases (i.e., imbalanced sample EBD cases). The results indicate that data oversampling must, of necessity, be used with DNN to formulate satisfactory classification results, especially for improving sensitivity. This applies to another traditional machine learning method (i.e., Naïve Bayes). One-hot encoding, a strategy to convert data to machine language, is critical in DNN analysis.
The results indicate that DNN with data oversampling is the optimal method to classify children with or without EBD. This technique processes categorical data gradually through its hidden layer structure. Each node in the hidden layer is given a weight that represents the strength of its relationship with the output. The weights are adjusted as the model develops (Brahma, Wu, & She, 2016), which is an advantage compared to traditional machine learning methods. In other words, all 27 screener items and 5 demographic items in BASC-2 BESS TR-S dataset have varying degrees of contribution to predict children with or without EBR based on its relationship with output. This method provides a new way for classification-based research and will eventually benefit a wide range of educational and social studies.
The comparison of classification performance levels shows the superior role of DNN compared to traditional machine learning approaches (i.e., Naïve Bayes and logistic regression) with all classification criteria. In particular, the TPR or sensitivity is 100%, which is ideal for the universal screening purpose. In other words, we can correctly identify all children with EBD based on their EBR status. The advantages are confirmed when we compare classification performance levels between DNN and traditional methods of setting up cutoff scores (DiStefano & Morgan, 2011). Specifically, TPR or sensitivity, TNR or specificity, and ACC are .71, .79, and .78 with the cutoff score method, while the TPR or sensitivity, TNR or specificity, and ACC are 1, .94, and .96 with DNN, respctively. Overall, accurate classification with data in the initial tier of MTSS is necessary for EBD children (i.e., children with emotional and behavioral problems) to receive more comprehensive behavioral assessment and on-time intervention services. DNN is an optimal method for this purpose especially due to its perfect sensitivity (i.e., ability to correctly identify EBD children). Admittedly, the PPV rates are low in all models, even in DNN with data oversampling. For every 10 children with EBR, only about 5.5 of them are with EBD (PPV = .545). The results indicate the necessity of follow-up comprehensive testing after screening measures to filter out more EBR children without EBD.
Implications to Clinical Practice
Our findings can benefit stakeholders in education (e.g., schools, teachers, and students) directly with further communication with the test developers. To improve the usability of study results, a user-friendly website supported by the DNN algorithm could be developed to identify children’s status of behavioral and emotional problems after teachers or other stakeholders input children’s demographic information and performance levels on the survey items. They will then see the predictive results generated by the DNN algorithm. Meanwhile, all the new student information can be saved to further improve the classification performance levels of DNN. The website can serve as an auxiliary tool to help teachers or parents identify children’s EBD status. Practitioners can access accurate screening results within a few minutes without needing to know the technical details. Most importantly, web-based surveys have become prevalent in areas such as evaluation and research (Greenlaw & Brown-Welty 2009). Compared with paper-based surveys, it will reduce both time and cost. While there are other screening tools available for such purposes, DNN can significantly improve classification performance levels, especially for the ability of identifying positive EBD cases (sensitivity). All of the aforementioned can be achieved by test developers by revising the test manual and the scoring software.
Limitations and Future Directions
First, data oversampling was used to overcome the issue of imbalanced data. Although simulated data have similar features compared with the original data, this may be a limitation as real positive cases are with an extremely low percentage. We used the five layers in the DNN model and received satisfactory classification results. The number of layers is important to achieve accurate classification results. In order to help researchers decide the optimal number of layers, a novel fitness function has been introduced that concurrently seeks for the most accurate and optimal DNN architecture for their study (Stathakis, 2009). Therefore, one future direction is to focus on how different architectures (i.e., number of layers) would affect the classification ACC. This may help improve the low precision rate in the current DNN model. In addition, other techniques of handling imbalanced data (e.g., Sun et al., 2009) may function well for accurate EBD classification and can be explored in future studies.
One of the challenges in classification-based EBD research is the lack of an accurate diagnosis of the EBD status. The parent-reported EBD, the most appropriate diagnostic variable in the norm database, was utilized in the current study. Although this is a direct measure of EBD status using another comprehensive scale to create the EBD status, parent report may not be 100% accurate. Future studies can be improved by obtaining the actual EBD status from the clinical setting directly. Admittedly, a screening scale, which is generally used as the initial stage of MTSS to briefly assess EBR levels, cannot function as a general outcome measure or monitor behavioral and emotional changes. Other types of EBD forms, such as full-length behavioral and emotional rating scales and progress monitoring scales, should be examined for classification ACC utilizing DNN modeling in the future studies. Finally, the current study provides a useful guide for social science researchers to apply DNN and oversampling to classify respondents into groups. The Python scripts are provided to facilitate the modeling process with different data resources.
Conclusion
We examined the use of DNN to classify children with EBD using BASC-2 BESS TRS norm data as compared to traditional machine learning methods. With proper data preprocessing and data oversampling, we trained multiple models and compared the classification ACCs across different models. The DNN method surpasses the traditional approaches in terms of ACC, PPV, TPR or sensitivity, TNR or specificity, and AUROC. The findings suggest that DNN can accurately identify children’s EBD with the BASC-2 BESS and demographic information. Our overall conclusion is that DNN can be a powerful tool in the initial stage of MTSS to provide useful information for stakeholders who are interested in the early identification of children’s emotional and behavioral problems and in providing treatments for children at risk.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
