Abstract
Objective
To develop and evaluate an AI-powered mobile application, Tebyan, to estimate children’s developmental levels from Draw-A-Person (DAP) test drawings and support the early identification of developmental concerns.
Methods
The system predicts drawing-based developmental age and compares it with the child’s chronological age, where an age gap may indicate the need for further evaluation. Nine deep learning models were created using MobileNet, ResNet, and EfficientNet architectures across binary, four-class, and eight-class configurations.
Results
Across all nine model configurations, performance decreased as class granularity increased. The two-class models achieved strong and balanced results (≈80% accuracy), four-class models showed moderate performance (≈55–65%), and eight-class models performed lowest (≈30–40%). Macro-averaged sensitivity, specificity, precision, and F1-scores were reported with 95% confidence intervals. Based on balance and stability, the four-class MobileNet model was selected for integration into the Tebyan application, supporting a more precise evaluation of developmental progression from the drawings.
Conclusion
Tebyan provides an AI-based approach for estimating developmental levels from children’s drawings by comparing the model’s predicted age group with the child’s actual age. While not a diagnostic tool, the system offers a supportive resource that may help caregivers and educators identify developmental patterns that warrant further attention.
Keywords
Introduction
Developmental delays affect approximately 10–15% of preschool-aged children worldwide, often manifesting as slower attainment of milestones compared to peers. 1 These delays, classified as mild, moderate, or severe based on the functional-to-chronological age ratio, can have long-term impacts on academic and social development. 2 Timely detection is crucial for enabling early interventions that can mitigate these outcomes and enhance children’s quality of life.
Traditional assessments rely on expert observation, but such evaluations may be inaccessible or subjective. Children’s drawings have long been used in psychological assessments to infer cognitive and emotional development. 3 The Draw-A-Person (DAP) test, in particular, offers a structured approach to evaluating mental maturity through human figure drawings.
In parallel, artificial intelligence (AI) has proven effective in automating a wide range of healthcare and cognitive assessments, enhancing diagnostic accuracy, optimizing patient care workflows, and enabling new modalities of treatment delivery. For instance, umbrella reviews have demonstrated that AI-enabled chatbots facilitate healthy lifestyle promotion, treatment adherence, and mental health screening across diverse populations. 4 Similarly, recent studies emphasize the transformative role of conversational AI systems such as ChatGPT in improving healthcare accessibility and communication, particularly in low and middle income countries, where they support remote consultation, patient education, and chronic disease monitoring. 5 Moreover, broader reviews highlight that AI systems, including machine learning, natural-language processing, and computer vision, are increasingly utilized in medical imaging, virtual patient care, and administrative optimization, thereby improving both patient and clinician experiences. 6 Extending these capabilities to developmental and cognitive domains, AI has also shown potential in automating assessments such as facial expression and handwriting analysis. Integrating AI into the interpretation of children’s drawings can further reduce human error, expedite developmental screening, and provide accessible tools for parents, caregivers, and teachers. However, despite these advancements, the application of AI to children’s drawings remains underexplored, as most existing models are limited to binary outputs or academic prototypes rather than practical, user-oriented tools. 2
Recent evidence also underscores the growing role of mobile-based applications in the management of various diseases. Mobile health (mHealth) technologies have demonstrated effectiveness in improving self-management, medication adherence, and behavioral monitoring across diverse patient populations. For instance, mobile apps designed for people living with HIV offer functionalities such as personalized reminders, educational content, and motivational feedback to support self-care. 7 Similarly, mobile-based self-care platforms for individuals with type 2 diabetes include features like glucose tracking, dietary guidance, and exercise monitoring to encourage sustainable behavioral change. 8 Moreover, usability evaluations of mobile self-management tools for chronic disease management, including HIV, have shown that intuitive design and real-time feedback significantly enhance user engagement and health outcomes. 9 Collectively, these findings confirm the effectiveness of mobile health solutions and support their adaptation to developmental screening and cognitive assessment in children.
This original research article introduces Tebyan, a mobile application that uses artificial intelligence to estimate children’s developmental levels by predicting age from Draw-A-Person (DAP) drawings and comparing it with chronological age. The originality of this work lies in combining automated developmental screening with mobile deployment to enable accessible early identification of developmental concerns. Additionally, the study evaluates multiple deep learning architectures across binary, four-class, and eight-class classification schemes to capture developmental progression at different levels of granularity, while integrating explainable AI techniques to enhance transparency and interpretability.
Although our model predicts chronological age groups rather than clinical disability categories, this approach aligns with the diagnostic principle of the DAP test, in which developmental level is inferred by comparing drawing-derived age estimates with the child’s actual age. Therefore, discrepancies between predicted and actual age may serve as early indicators of developmental concerns, supporting professional assessment.
The study’s objectives are to. (i) Move beyond binary classification by implementing multi-class age group prediction, (ii) Compare the performance of CNN models (MobileNet, ResNet, EfficientNet), (iii) Deploy the best-performing model in a user-friendly mobile app for widespread accessibility.
By bridging psychological testing and mobile AI deployment, this work aims to offer an early screening solution that is accessible and clinically relevant.
Literature review
Psychological background (DAP and mental age)
Children’s drawings have long been used in psychology as nonverbal indicators of cognitive development, emotional functioning, and perceptual maturity. As highlighted by Vygotsky, young children often rely more on memory-based schemas than on direct observation when drawing, reflecting underlying cognitive processes related to memory, attention, and conceptual representation. 10 This link between drawing behavior and cognition established the foundation for numerous drawing-based assessments.
Several standardized psychological drawing tests have been developed to evaluate mental and emotional development. These include the Draw-A-Person (DAP) test, the Kinetic Family Drawing (KFD) test, and the House-Tree-Person (HTP) test, each offering insight into emotional expression, personality traits, and developmental maturity. Additional tools, such as the clock drawing test, are widely used to assess cognitive impairments, especially in older populations. 2
Originally introduced by Goodenough in 1926, the Draw-A-Man test was designed to estimate a child’s mental age based on the level of detail, structure, and proportionality in their drawing. Machover’s 1949 expansion of the test, renaming it the Draw-A-Person (DAP) test, broadened its applicability by incorporating both male and female Figure. 11 The DAP test remains among the most widely used assessments for estimating psychological maturity in children and has been validated as a reliable indicator of cognitive development and psychometric intelligence. 2
Evaluation is based on the presence, detail, and proportionality of specific figure components, which are compared to normative developmental benchmarks. 12 The test’s expressive and nonthreatening nature encourages natural engagement from children, enhancing its diagnostic reliability. Multiple sources rank the DAP test among the top ten tools used in educational and clinical settings. 13 It is commonly used in early childhood education to monitor developmental progress and in psychological assessment to distinguish between neurotypical individuals and those with developmental or mental health conditions. 14
Computational approaches to drawing analysis
Artificial intelligence and machine learning have become central to the automated analysis of children’s drawings. Early work in this domain relied primarily on convolutional neural networks (CNNs), which learn hierarchical visual representations, such as edges, shapes, and spatial configurations, directly from raw pixel data. CNNs are particularly well suited for drawing analysis because they can capture structural elements (e.g., figure proportions, spatial layout, and object completeness) that correspond to developmental indicators.
Beyond CNNs, ensemble approaches combining multiple pretrained models (e.g., VGG, MobileNet, ResNet) have been used to improve robustness and generalization, especially when datasets are small or heterogeneous.
More recent approaches incorporate multimodal techniques, in which visual features are paired with textual descriptions or semantic embeddings produced by vision–language models. These approaches support higher-level psychological or emotional interpretation, extending beyond traditional age or development classification tasks.
Together, these computational methods form the technical foundation for the AI-based studies reviewed in the subsequent subsections.
Study design and search strategy
This review adopted a structured search strategy to identify prior work related to the analysis of children’s drawings and other drawing-based outputs for developmental or psychological assessment using artificial intelligence and machine learning. Searches were performed using keywords such as “children’s drawings,” “draw-a-person test,” “mental age estimation,” “drawing analysis,” “developmental assessment,” “deep learning,” and “fine motor skill evaluation.”
Data sources
Publications were sourced from diverse scholarly outlets, including journals, conferences, books, theses, and authoritative online platforms, with searches conducted across IEEE Xplore, SpringerLink, ScienceDirect, Scopus, PubMed, and other reputable resources.
Study selection
Studies were included if they 1 involved children’s drawings or drawing-related outputs such as figure sketches or fine motor movement traces, and 2 applied artificial intelligence, machine learning, or deep learning techniques relevant to developmental or psychological assessment. Studies unrelated to visual or drawing-based developmental analysis were excluded.
Data extraction
Information extracted from each study included publication year, dataset size, number of output classes, algorithms used, and accuracy.
Review of related work
This subsection reviews studies that have applied artificial intelligence and machine learning to analyze drawings for developmental, cognitive, or psychological assessment.
Beltzung et al. (2023) provided a comprehensive review of deep learning methods for analyzing drawing behavior, showing how convolutional and generative models can reveal perceptual and cognitive processes associated with human development. 15 Their work established deep learning as a powerful framework for studying mental and artistic growth, while emphasizing the challenge of interpretability.
Building upon this, Khlaif et al. (2025) designed an ensemble model integrating VGG16, VGG19, and MobileNet architectures to enhance developmental assessment. Trained on the Kids’ Hand Movement Dataset (KHMD), their model achieved 99% accuracy, demonstrating the robustness of ensemble techniques. 16
Recent research has also moved toward emotional and psychological interpretation. Shah et al. (2025) proposed a multimodal artificial intelligence framework based on a fine-tuned BLIP model integrated with a large language model. Their system generates descriptive analytical reports that capture artistic features and emotional themes in children’s drawings. 17
Earlier computational approaches to children’s drawing analysis predominantly used convolutional neural networks. Widiyanto et al. (2020) applied a CNN model to Draw-A-Person (DAP) images to estimate mental age and reported an accuracy of 72.08%. 18 Strikas et al. (2022) later introduced the MotoSkillsCNN model, which targeted fine motor development assessment; however, its accuracy declined as the number of developmental classes increased, reflecting the added classification complexity. 19
Research has also explored simplified age-prediction tasks. Polsley et al. (2022) applied neural networks to a binary age classification problem, distinguishing between younger children (ages 3–4) and older children (ages 5–8). 20 In a different domain, Simfukwe et al. (2023) used convolutional neural network regression for brain age estimation in adults, achieving an accuracy of 84%, although their study did not include children. 21
Pretrained deep learning models have also demonstrated strong performance. A 2021 study using ResNet50 on 1,051 children’s drawings achieved 89% accuracy when classifying whether children were above or below age 3. 2 Building on this, Tsimpiris and Varsamis (2023) evaluated EfficientNet-B0/B1, ResNet50, VGG16, and MobileNet on a larger dataset of 1,601 drawings across multi-class settings, observing the expected reduction in accuracy as class granularity increased. 22
Hybrid and feature-engineering approaches have likewise been investigated. Rakhmanov et al. (2020) compared BoVW + K-means, HOG + SVM, artificial neural networks, and CNNs on 1,000 student sketches, reporting accuracies between 32% and 62%. 23 Subsequent work introduced the Counting Key-Points (CKP) algorithm, which improved the classification accuracy to 65%. 24
Color-based methods have also been explored. Tolosana et al. (2022) demonstrated strong performance using SVM, Random Forest, and multilayer perceptron models on color-rich drawings, 25 though accuracy substantially decreased when these models were transferred to sketch-based datasets lacking chromatic information. 26
Summary of related studies focusing on mental and developmental age estimation from children’s drawings, highlighting dataset size, number of classes, applied algorithms, and reported accuracy.
Issues with previous work
A critical review of prior studies highlights several key limitations that present opportunities for further advancement. • • • •
Methodology
This study introduces an AI-driven framework designed to predict age-based developmental levels from children’s drawings using the Draw-A-Person (DAP) test, where differences between predicted and actual age may indicate potential developmental concerns. The proposed methodology comprises several stages, including data collection, preprocessing, classification strategy formulation, model training and evaluation, an explainability procedure, and integration into a mobile application. This structured approach enables the transformation of raw drawing inputs into automated age-group predictions that support early developmental screening within an accessible and user-friendly mobile environment.
Data collection
Efforts to collect new drawing samples encountered several challenges: many qualified evaluators required financial compensation, and others declined participation. Therefore, this study adopted a previously published and publicly available dataset that follows the Draw-A-Person (DAP) test framework, even though it exhibits class imbalance and limited diversity. The dataset, developed by Rakhmanov et al. (2020), 23 contains children’s drawings collected from private educational institutions in Nigeria. Participants were children aged 4-11 years enrolled in Nursery 1-2 and Primary 1-4. Ethical procedures were followed during data acquisition, with informed consent obtained from parents or guardians. Each child received a plain white A4 sheet and a pencil and was instructed to draw a human figure within 10–15 minutes without assistance. Drawing sessions were conducted during supervised classroom activities to minimize stress and external influences.
A total of 1,000 sketches were initially collected. After excluding incomplete or unclear samples, 951 images remained for analysis, with each child contributing one drawing. The data were categorized into eight chronological age groups (4–11 years) with the following distribution: 142 (age 4), 131 (age 5), 152 (age 6), 160 (age 7), 125 (age 8), 124 (age 9), 79 (age 10), and 38 (age 11) drawings.
For manual scoring, three trained evaluators, including a school counselor and two Ph.D. students, assessed each drawing using Goodenough’s 51-point rubric.
12
This rubric measures developmental maturity by evaluating the proportionality, presence, and arrangement of features such as the head, limbs, and facial components. The mean score across raters represented each child’s developmental maturity level as reported in the original dataset. In our study, however, these expert scores were not used to assign class labels; the classification categories were derived solely from each child’s documented chronological age. Chronological age was used as the ground truth because it provides an objective and consistently available reference for developmental comparison and aligns with the interpretive principle of the DAP test, in which drawing-derived developmental estimates are compared with a child’s actual age. Although expert-derived mental age scores based on the Goodenough rubric may enhance clinical precision, they require specialist evaluation and were not consistently available for model training. Using chronological age supports scalable model development and enables the creation of accessible screening tools.Sample drawings from the dataset are shown in Figure 1. Sample images from the dataset.
Data preprocessing
To ensure that all input images were suitable for training deep learning models, a systematic preprocessing pipeline was applied.
Resizing
All drawings were uniformly resized to 256 × 512 pixels to standardize input dimensions and ensure consistent processing across the model architectures. 23
Normalization
Pixel intensity values were normalized to a fixed scale (e.g., [0,1] or [-1,1]) to stabilize the input distribution and improve model convergence during training.
To reduce the risk of overfitting associated with training deep models on limited samples, several preventive strategies were applied:
Data augmentation
Extensive data augmentation techniques, including random rotation, flipping, and cropping, were applied to increase sample diversity and improve model generalization.
27
Figure 2 illustrates examples of the augmentation methods used in this study. Various image augmentation techniques applied.
Early stopping
Early stopping was implemented to halt training when validation performance plateaued, preventing unnecessary weight updates.
Cross-validation
Ten-fold cross-validation was employed to obtain stable and reliable performance estimates across multiple dataset splits.
Collectively, these strategies reduced model variance, improved generalization, and ensured that the reported performance reflected meaningful learning rather than overfitting.
Classification strategy
To overcome the limited precision of traditional binary age-grouping approaches and enable a more fine-grained developmental analysis, the original dataset was reorganized into multiple classification schemes, including four-class and eight-class age-group configurations. • Binary Classes: – Class 1: Ages 4-7 (585 samples) – Class 2: Ages 8-11 (366 samples) • Four Classes: – Class 1: 4-5 years (273 samples) – Class 2: 6-7 years (312 samples) – Class 3: 8-9 years (249 samples) – Class 4: 10-11 years (117 samples) • Eight Classes (Single-Year Groups): – Class 1: Age 4 (142 samples) – Class 2: Age 5 (131 samples) – Class 3: Age 6 (152 samples) – Class 4: Age 7 (160 samples) – Class 5: Age 8 (125 samples) – Class 6: Age 9 (124 samples) – Class 7: Age 10 (79 samples) – Class 8: Age 11 (38 samples)
This restructuring enabled a comparative evaluation across multiple granularity levels.
By merging classes, the dataset was simplified, allowing models to focus on broader age-related drawing patterns, which may improve predictive performance. This classification facilitates a clearer understanding of age-related drawing characteristics, supporting informed decisions about when further developmental evaluation may be warranted.
During the restructuring of the dataset into multiple class settings, we observed a noticeable imbalance among the eight age categories. Although data augmentation was applied to reduce this imbalance, the limited availability of new children’s drawings remained a significant constraint.
Model training
Three convolutional neural network (CNN) architectures (MobileNet, ResNet, and EfficientNet) were selected due to their proven effectiveness in image classification and complementary architectural strengths. MobileNet provides a lightweight design suitable for mobile deployment, ResNet enables stable training and robust feature extraction through residual learning, and EfficientNet employs compound scaling to balance network depth, width, and resolution for improved accuracy. To examine developmental progression at varying levels of granularity, each architecture was trained using binary, four-class, and eight-class age group schemes, resulting in nine model configurations.
Each model was implemented using Keras and TensorFlow, trained with Adam optimizer, batch size of 32, learning rate of 0.001, and early stopping to prevent overfitting. The dataset was first partitioned into 80% for model development and 20% as an independent hold-out test set. Ten-fold cross-validation was performed on the 80% development portion, and the final performance metrics were computed on the unseen 20% test set.
MobileNet MobileNet is a family of lightweight convolutional neural network (CNN) architectures specifically designed for efficient deployment on mobile and embedded devices. Its core innovation lies in the use of depthwise separable convolutions, which decompose standard convolutions into two operations: a depthwise convolution that applies a single filter per input channel, and a pointwise convolution that combines the outputs using 1 × 1 convolutions. 28 This significantly reduces computational complexity and model size.
The first version, MobileNetV1 (2017), introduced the baseline architecture for mobile-friendly models. MobileNetV2 (2018) enhanced the architecture by introducing inverted residual blocks with linear bottlenecks, allowing the network to expand low-dimensional feature maps efficiently and incorporate shortcut connections. 29 This design strategically removes nonlinearities in narrow layers to preserve information flow while minimizing memory usage.
MobileNetV3 (2019) further optimized performance through the integration of squeeze-and-excitation modules, which recalibrate channel-wise feature responses. It also introduced novel activation functions such as h-swish to improve accuracy with minimal additional cost. 30 These enhancements make MobileNetV3 highly suitable for tasks requiring a balance between speed, accuracy, and resource efficiency, making it an ideal fit for our mobile application context.
ResNet Residual Networks (ResNet) represent a major breakthrough in deep learning, particularly in training very deep CNNs. Introduced by He et al., ResNet addresses the vanishing gradient problem, a common issue in deep networks where gradients become too small for effective weight updates as layers increase. 2
ResNet’s key innovation is the introduction of residual blocks with skip connections, which allow the network to bypass one or more layers by directly connecting the input of a layer to its output. These shortcuts enable the model to learn residual functions-essentially, the difference between the input and output of a block-rather than attempting to learn the full transformation. 22 This architecture simplifies optimization and enables the training of networks with hundreds or even thousands of layers, resulting in improved generalization and performance across a wide range of visual recognition tasks.
EfficientNet EfficientNet is a scalable CNN architecture that achieves state-of-the-art performance while maintaining computational efficiency. Proposed by researchers at Google AI in 2019, EfficientNet introduced a compound scaling method that uniformly scales depth, width, and resolution using a set of fixed coefficients. 31 This is in contrast to traditional CNNs, which typically scale only one of these dimensions independently, often resulting in diminishing returns or increased computational cost.
The compound scaling approach ensures that as input image resolution increases, the network simultaneously deepens and widens in a balanced way, improving the receptive field and allowing for more complex feature extraction without incurring excessive computation. EfficientNet also builds on the MobileNetV3 foundation by incorporating squeeze-and-excitation blocks and novel activation functions, contributing to its high accuracy and low parameter count.
EfficientNet’s adaptability makes it suitable for a wide range of resource-constrained applications, although its performance can vary depending on dataset size and task complexity.
Model evaluation
Following the cross-validation training process, model performance was assessed on the 20% hold-out subset using accuracy, precision, recall, F1-score, and Specificity.
Evaluation metrics used in this study.
Explainability procedure
To identify the regions that most influenced the model’s predictions, we applied the Grad-CAM explainability method. Activation maps were extracted from the final convolutional block of each model to generate heatmaps highlighting the areas that contributed most to each decision. These heatmaps were then normalized and overlaid on the original drawings using a semi-transparent color map to visualize the model’s focus. The method was applied directly to the final trained model, and visualizations were produced for both correctly classified and misclassified samples from the held-out test set. This approach provides qualitative insight into the structural cues guiding model predictions and supports interpretability by verifying reliance on developmentally meaningful drawing features.
Development of the tebyan mobile application
The aim of this study is to create a mobile application that supports early screening for potential developmental differences in children by analyzing their human figure drawings. By leveraging deep learning, the application employs a CNN model for efficient image-based age-group prediction, allowing parents and caregivers to receive an estimate of the child’s age-based developmental level from their drawings. Figure 3 illustrates a high-level conceptual overview of the Tebyan application. High-level conceptual overview of the Tebyan application.
Application Development Platform We selected Android Studio as the integrated development environment (IDE) for Android app development due to its native platform support and comprehensive feature set. Its tight integration with the Android SDK ensures optimal performance across devices.
Android Studio’s rich design and layout tools enable rapid prototyping and refinement of the app’s user interface and user experience (UI/UX). In addition, it provides seamless integration with machine learning frameworks, facilitating the deployment of trained CNN models within the mobile environment. The availability of extensive documentation and community resources for integrating ML models into Android applications was a crucial factor in our selection. 32
Collaborative features such as built-in version control support further streamline team development. Figure 4 shows the prototype of the developed application. Overview of the application prototype.
Drawing classification procedure and decision rules
The core functionality of the Tebyan application is the classification of children’s drawings to estimate age-based developmental levels and support early screening. The procedure is designed to be intuitive and user-friendly and follows these main steps.
Age entry
The user begins by entering the child’s actual (chronological) age, which serves as a reference for interpreting the model’s output.
Image submission
The user can either upload an existing drawing or capture a new one using the device’s built-in camera.
Model inference
Upon image submission, the embedded deep learning model is activated. It analyzes the drawing and predicts an age group (e.g., 4–5, 6–7, 8–9, or 10–11 years), representing the developmental level implied by the structural, proportional, and representational features of the human figure.
Result generation
The application compares the predicted age group with the child’s actual age and assigns the outcome to one of three descriptive categories shown to the user: • • •
These categories are presented as heuristic screening indicators rather than validated diagnostic or clinical labels. They are derived from age-group discrepancies between the model’s predicted developmental estimate and the child’s chronological age, following interpretive principles of the Draw-A-Person (DAP) framework. Consequently, these classifications should be interpreted cautiously, as approximate developmental screening cues intended only to support decisions regarding whether further professional assessment may be warranted.
Reset functionality
Users may choose to clear previous inputs and classify a new drawing by selecting the “Refresh” button.
This classification logic is grounded in the developmental reasoning of the Draw-A-Person (DAP) test, a well-established psychological framework used to estimate cognitive maturity based on drawing features. Younger children (e.g., ages 4-5) typically produce simple outlines with limited detail, while older children (e.g., ages 10-11) tend to show greater proportional accuracy, more complete body representations, and richer detail. Accordingly, the app’s decision process mirrors how developmental psychologists interpret children’s drawings by relating the complexity and accuracy of visual representations to expected patterns for different age groups, and by highlighting cases where the predicted developmental level appears lower or higher than the child’s chronological age.
Results and discussion
Overall performance comparison
Macro-averaged performance of tested models.
As expected, classification performance decreased as the number of classes increased. Two-class models achieved strong and balanced results ( Classification accuracy of MobileNet, ResNet, and EfficientNet across the 2, 4, and 8 class configurations.
Consistent with the observed accuracy trends, a detailed examination of the classification metrics further illustrates the impact of increasing class granularity. Across the binary classification tasks, all three networks achieved balanced sensitivity and specificity values of approximately 0.8, demonstrating reliable discrimination between broad developmental ability levels. In the four-class configuration, macro-sensitivity values declined to a range of 0.53–0.65, although specificity remained consistently high (0.85–0.88), indicating elevated difficulty in distinguishing intermediate developmental categories. Notably, MobileNet achieved the best balance between sensitivity (0.65) and specificity (0.88), suggesting that its depthwise-separable convolutions effectively capture both global and local drawing features relevant to developmental progression. Under the most complex eight-class setting, performance declined further, driven by class imbalance and the fine-grained visual similarities between adjacent age groups. In particular, the limited number of samples in older age categories (ages 10 and 11) reduced class representation, which may bias predictions toward more frequent age groups and affect model generalization and stability. In this scenario, average sensitivity ranged from 0.29 to 0.38 across models, while specificity again remained high (0.88–0.92). Overall, the results demonstrate that although classification accuracy decreases with increasing class granularity, the models consistently maintain high specificity and interpretable error patterns. This indicates that all three architectures learn a meaningful developmental hierarchy and avoid over-predicting impairment, which is an essential property for reliable screening applications.
Per-class performance analysis
Per-class performance metrics for the two-class MobileNet model. Values represent mean estimates across 10 folds with 95% confidence intervals (±5.6%).
Per-class performance metrics for the four-class MobileNet model. Values represent mean estimates across 10 folds with 95% confidence intervals (±5.6%).
Per-class performance metrics for the eight-class MobileNet model. Values represent mean estimates across 10 folds with 95% confidence intervals (±5.6%).
Per-class performance metrics for the two-class ResNet model. Values represent mean estimates across 10 folds with 95% confidence intervals (±5.6%).
Per-class performance metrics for the four-class ResNet model. Values represent mean estimates across 10 folds with 95% confidence intervals (±5.6%).
Per-class performance metrics for the eight-class ResNet model. Values represent mean estimates across 10 folds with 95% confidence intervals (±5.6%).
Per-class performance metrics for the two-class EfficientNet model. Values represent mean estimates across 10 folds with 95% confidence intervals (±5.6%).
Per-class performance metrics for the four-class EfficientNet model. Values represent mean estimates across 10 folds with 95% confidence intervals (±5.6%).
Per-class performance metrics for the eight-class EfficientNet model. Values represent mean estimates across 10 folds with 95% confidence intervals (±5.6%).
Two-class classification
MobileNet (2 classes). Confusion matrices for all architectures and class configurations are provided in Appendix A. The binary MobileNet model achieved 81.3% accuracy (95% CI [75.70, 86.80]) with balanced macro-sensitivity = 0.79 and macro-precision = 0.81. At the class level, the “4–5–6–7” group achieved sensitivity = 0.89 and specificity = 0.69, while “8–9–10–11” reached sensitivity = 0.69 and specificity = 0.89, as illustrated in Table 4. These results show that MobileNet captures developmental differences asymmetrically across age groups, achieving higher sensitivity for younger children4–7 and higher specificity for older children,8–11 reflecting the distinct visual characteristics present in each group’s drawings.
ResNet (2 classes). The ResNet model attained 81.70% accuracy (95% CI [76.10, 87.30]) with both macro-sensitivity and specificity = 0.81. Errors were limited to the boundary between the two age-based categories. Class-wise, the “4–5–6–7” class yielded sensitivity = 0.86 and specificity = 0.76, while “8–9–10–11” produced sensitivity = 0.76 and specificity = 0.86, confirming balanced detection and generalization as illustrated in Table 7.
EfficientNet (2 classes). EfficientNet achieved 72.80% accuracy (95% CI [67.20, 78.40]) with macro-sensitivity = 0.71 and specificity = 0.80. Per-class results, presented in Table 10, show that “4–5–6–7” had sensitivity = 0.78 and specificity = 0.65, while “8–9–10–11” had sensitivity = 0.65 and specificity = 0.78, highlighting a balanced trade-off between recall and selectivity.
Four-class classification
MobileNet (4 classes). The four-class MobileNet model reached 64.60% accuracy (95% CI [59.0, 70.20]) with macro-sensitivity = 0.65 and specificity = 0.88. Most misclassifications occurred between neighboring groups, particularly “6–7” and “8–9” as illustrated in Table 5. At the class level, the “4–5” group achieved sensitivity = 0.82 and specificity = 0.86, “10–11” reached 0.71 and 0.93, while “6–7” and “8–9” recorded lower sensitivities (0.62 and 0.46) but maintained high specificities
ResNet (4 classes). The four-class ResNet model attained an accuracy of 56.80% (95% CI [51.20, 62.40]). The model’s macro-sensitivity and specificity were 0.56 and 0.85, respectively. Confusions were concentrated between contiguous categories such as “6–7” vs “8–9,” indicating the model’s awareness of the ordinal structure of the age-based categories. Class-wise analysis revealed that “4–5” achieved sensitivity = 0.73 and specificity = 0.87, whereas “10–11”, “8-9”, and “6–7” showed lower sensitivities (0.46, 0.45, and 0.41) but specificities above 0.84 as illustrated in Table 8. These patterns suggest a conservative classification approach that prioritizes precision.
EfficientNet (4 classes). EfficientNet attained 56.3% accuracy (95% CI [50.70, 61.90]) with macro-sensitivity = 0.53 and specificity = 0.87. Misclassifications mainly involved adjacent groups (“4-5” and “6-7,” “6-7” and “8-9”), reflecting natural overlap between adjacent age groups. Per-class results presented in Table 11 show “4-5” had sensitivity = 0.64 and specificity = 0.86, while “6-7” and “8-9” achieved sensitivities of 0.55 and 0.48 with specificities above 0.85. Overall, the model favored high specificity and limited false positives, suitable for screening contexts.
Eight-class classification
Eight-class classification posed the greatest challenge due to class imbalance and subtle visual similarities between adjacent categories, with the underrepresentation of older age groups further contributing to classification instability.
MobileNet (8 classes). The eight-class MobileNet model reached 38.40% accuracy (95% CI [32.80, 44.0]) with macro-sensitivity = 0.38 and specificity = 0.92. Errors were confined mostly to adjacent classes (“5” – “6” and “7” – “8”), while distant groups were rarely confused. As illustrated in Table 6, intermediate classes achieved sensitivities around 0.40 and specificities
ResNet (8 classes). ResNet produced 32.60% accuracy (95% CI [27.0, 38.20]) with macro-sensitivity = 0.33 and specificity = 0.89. As in other configurations, misclassifications primarily occurred between contiguous categories, indicating consistent learning of ordinal age structure. All age groups showed comparable performance, with sensitivities generally in the 0.30–0.35 range and specificities above 0.85 across all classes as shown in Table 9.
EfficientNet (8 classes). The EfficientNet model in the eight-class configuration achieved 28.7% accuracy(95% CI [23.10, 34.30]). The macro-sensitivity was 0.29 and specificity 0.88. Most errors appeared between adjacent age ranges, particularly “5” and “6.” All classes achieved moderate sensitivities (0.26–0.31) with specificities
Architectural confusion patterns
To further interpret model behavior, we analyzed the most frequently confused class pairs across architectures. In the four-class configuration, all models showed the highest confusion between the 6–7 and 8–9 age groups. This pattern reflects the transitional developmental stage in which drawing features evolve gradually rather than abruptly.
In the eight-class configuration, misclassifications were predominantly confined to adjacent age ranges, particularly 5–6, 6–7, and 7–8. Confusions between non-adjacent classes were rare, indicating that the models preserved the ordinal structure of developmental progression.
Architecture-specific trends were modest. MobileNet exhibited slightly fewer confusions between distant classes, suggesting effective capture of localized structural features. ResNet and EfficientNet demonstrated similar boundary confusions between neighboring age groups, reflecting reliance on broader feature representations. However, the overall consistency of confusion patterns across architectures indicates that most errors are driven by intrinsic visual similarity and developmental continuity in children’s drawings rather than architecture-specific limitations.
These findings support the interpretation that classification ambiguity primarily arises from gradual developmental transitions in drawing maturity rather than model deficiencies.
Comparative discussion
A consistent trend across all networks was that misclassifications almost exclusively occurred between adjacent age or ability groups, such as “6–7” versus “8–9.” There were virtually no cross-class confusions across distant categories, demonstrating that the models preserved the ordinal structure of developmental progression. This enhances model interpretability, as errors were semantically meaningful rather than random.
To quantitatively assess whether the performance differences among architectures were statistically significant, McNemar’s test was applied to paired predictions from the same test set (n = 192) for the binary configuration. The discordant counts were b = 8 and c = 10, yielding χ2 = 0.055 and p = 0.81, indicating no significant difference (p > 0.05) between MobileNet and ResNet. These two models were selected for statistical comparison because they achieved the highest accuracy among all tested architectures and exhibited closely matched performance, making them the most appropriate candidates for a direct paired evaluation. For the multi-class settings (four and eight classes), the natural extension, Bowker’s test of symmetry, was considered, but small per-class sample sizes limited statistical power. Instead, we report 95% confidence intervals and macro-averaged sensitivity and specificity, which provide comparable evidence of relative consistency.
Architectural Bias versus Domain-Level Ambiity Although the four-class MobileNet model was selected based on its balanced performance and stability, the remaining misclassifications appear to be influenced by both architectural characteristics and the inherent continuity of developmental drawing progression. EfficientNet, which relies on compound scaling of depth, width, and resolution, may require larger and more balanced datasets to fully exploit its representational capacity; consequently, intermediate age groups with limited samples may be more difficult to distinguish. ResNet’s residual learning framework emphasizes global feature propagation, which can contribute to boundary errors between adjacent developmental stages where global structural similarity is high. In contrast, MobileNet’s depthwise separable convolutions emphasize localized structural features, enabling effective capture of key developmental cues such as limb articulation, proportionality, and facial detail.
However, the consistent pattern of confusion between neighboring age groups across all architectures suggests that classification ambiguity is primarily driven by the gradual and continuous nature of developmental drawing maturity rather than limitations in model capacity. These findings indicate that the observed errors reflect domain-level developmental overlap rather than architecture-specific deficiencies.
Comparison with related work
Compared to previous studies using the same dataset, our models demonstrate a clear performance improvement across all class configurations. In particular, the proposed MobileNet model achieved an accuracy of 64.60% for the four-class task, surpassing the 52% reported by prior work using a conventional CNN architecture. 23 Likewise, our eight-class MobileNet model achieved 38.40%, exceeding the 32% reported in 23. Although a study using private datasets reported higher results, these discrepancies are likely attributed to differences in dataset composition, preprocessing strategies, and the presence of more homogeneous or higher-quality samples in private collections.
Comparison with related work.
For the publicly available dataset, both our study and the work presented in 23 utilized the same data source. Notably, our MobileNet-based model and ResNet-based model outperformed the CNN baseline in both the four-class and eight-class classification tasks, increasing accuracy from 52% to 64.60% and from 32% to 38.40%, respectively. These gains confirm that MobileNet’s efficient convolutional blocks and balanced regularization provide superior feature extraction capabilities even under class imbalance and fine-grained developmental distinctions.
When compared with studies that rely on private datasets, our results remain competitive despite operating under more challenging data conditions. In our experiments using a public dataset, MobileNet achieved 81.30% accuracy in the two-class setting, 64.60% in the four-class setting, and 38.40% in the eight-class setting, with ResNet and EfficientNet showing similar trends. By contrast, the study in 22 reported 69% accuracy for MobileNet and 68% for EfficientNet and 55% for ResNet in binary classification, but did not explore multi-class extensions. Likewise, 2 achieved 89% accuracy for binary classification using ResNet on a private dataset, and 18 reported 72% using a CNN-based model. These higher scores likely reflect the advantages of datasets collected under controlled conditions, which often exhibit reduced variability, more uniform drawing styles, and clearer visual features. Such curated datasets, common in studies using privately collected data, can naturally lead to higher model performance compared with publicly available datasets that may involve broader variability in participants, drawing settings, and visual quality.
The results demonstrate that the proposed framework generalizes effectively across varying levels of task complexity, marking a meaningful advancement in AI-assisted drawing analysis for developmental assessment. The improved performance observed in our four- and eight-class settings establishes a stronger baseline for future research and further reinforces the suitability of MobileNet as an efficient and well-regularized architecture for fine-grained developmental classification tasks.
Explainability and model interpretability
To better understand the model’s decision process, Grad-CAM was applied to representative samples from the MobileNet four-class model (Figure 6). This model was ultimately selected for integration into the mobile application, making it the most relevant configuration for interpretability analysis. Grad-CAM visualizations highlighting the model’s attention to key structural regions in four-class classification.
The heatmaps consistently highlighted structural regions central to developmental scoring, including the head or face, trunk, and major limb segments. These areas correspond to features emphasized in drawing-based developmental assessments, such as facial detailing, proportionality, and limb articulation. The model also attended to global organizational cues, including symmetry and vertical body alignment. These highlighted regions are consistent with criteria used in traditional Draw-A-Person and Goodenough scoring frameworks, which evaluate developmental maturity based on the presence, proportion, and organization of human figure components.
In drawings from older age group (8–11 years), activation occasionally extended to additional elements such as clothing details or articulated hands and feet, reflecting the increased structural complexity typical of older age groups. Misclassified samples generally involved adjacent developmental categories, and the associated heatmaps often revealed partially developed features resembling those of the neighboring class. This indicates that classification errors followed natural developmental continuity rather than random deviation.
Some visualizations showed mild activation in background regions, likely related to stroke density or page illumination. These were interpreted cautiously. Overall, the Grad-CAM results demonstrate that the model relied on meaningful structural cues, and that its occasional errors aligned with the gradual and continuous nature of children’s developmental progression. While these visual correspondences support interpretability, the system is intended as a screening aid and does not replace expert psychological assessment.
Mobile application performance
To ensure effective deployment, the most suitable architecture from our experiments was selected for integration into the Tebyan mobile application. Although the two-class models achieved the highest accuracy (approximately 81%), they offered only a broad distinction between developmental levels. For a more informative and practically meaningful interpretation, we integrated the four-class MobileNet model, which achieved an accuracy of 65.62% with balanced macro-sensitivity and macro-specificity. This configuration provides clearer and more clinically useful separation between age groups while avoiding the instability and data scarcity issues observed in the eight-class setup. By offering more nuanced developmental differentiation, the four-class design enables Tebyan to deliver interpretable and actionable feedback for parents, educators, and healthcare professionals, supporting early observation and monitoring of children’s developmental patterns.
Beyond the model selection process, Tebyan introduces an important applied methodological contribution. To the best of our knowledge, it is the first mobile application to leverage CNN-based analysis of Draw-A-Person (DAP) test drawings to support the early observation of developmental patterns in children. While this study does not propose a new CNN architecture, its contribution lies in adapting and integrating established deep learning models into a practical, real-world screening framework. Tebyan bridges psychological assessment principles with real-time AI processing on mobile devices, transforming research-oriented models into an accessible tool for parents, educators, and clinicians. This demonstrates how existing CNN architectures can be effectively repurposed to support developmental evaluation tasks in everyday settings.
The application was evaluated using a set of previously unseen drawings from individuals aged 4 to 11 years. During testing, Tebyan demonstrated satisfactory performance in accurately identifying and classifying different levels of developmental ability. Figure 7 present sample outputs that demonstrate how the system highlights typical, delayed, and advanced developmental drawing features. These results highlight Tebyan’s potential as a lightweight, accessible, and effective tool for early cognitive screening through AI-assisted drawing analysis Figure 8. Example result for a case with age-appropriate developmental indicators. Example result for a case with below-age developmental indicators.

Clinical and educational relevance
The four-class MobileNet model was selected as the primary screening model because it achieved the most clinically meaningful balance of sensitivity and specificity. Its ability to detect developmental differences while minimizing false alarms makes it particularly suitable for early assessment contexts, forming the foundation of Tebyan’s screening functionality. This performance pattern aligns with the intended role of the Tebyan system as a supportive tool designed to flag potential developmental concerns rather than to provide a definitive clinical diagnosis. Furthermore, the consistent misclassifications between adjacent categories indicate that the model captures natural developmental gradients, producing outputs that are both clinically plausible and educationally informative. Together, these findings support the feasibility of integrating AI-based drawing analysis into early childhood assessment workflows and highlight the potential of such systems to assist educators and clinicians in identifying children who may benefit from further evaluation Figure 9. Example result for a case with above-age developmental indicators.
Importantly, the application’s descriptive categories (e.g., Below Expected, Age-Appropriate, Above Expected) represent heuristic developmental screening approximations rather than formal psychometric, diagnostic, or clinically validated classifications.
It is important to emphasize that the Tebyan system does not generate diagnostic labels such as “mild,” “moderate,” or “severe” intellectual disability. Instead, the model estimates an age-based developmental level inferred from the visual structure of a child’s drawing. When the predicted developmental level differs substantially from the child’s chronological age, this discrepancy may serve as an early indicator of possible developmental delay, consistent with the interpretive principles of the Draw-A-Person (DAP) test. Because the available dataset does not include formal clinical diagnoses, the system is intended strictly as a preliminary screening instrument. Future work will explore the use of expert-annotated datasets to enable the mapping of drawing-based developmental estimates to clinically validated diagnostic categories.
Accordingly, Tebyan should not be used as a standalone diagnostic system or as a substitute for formal psychological, developmental, or medical assessment. Rather, it functions as a supportive AI-based screening aid intended to help caregivers, educators, and clinicians identify children who may benefit from comprehensive professional evaluation.
User interface and experience
In addition to its promising classification performance, the Tebyan application incorporates a high level of usability through deliberate design choices. The interface was created with clarity and simplicity in mind, offering intuitive navigation tailored for children and caregivers. A calming purple-and-white color scheme was adopted to promote a visually appealing and child-friendly environment.
The Android-based application delivers a smooth and structured user journey across its core functionalities.
Splash and welcome pages
Upon launching the app, users are greeted with a splash screen (Figure 10), followed by a welcome page (Figure 11). The welcome screen includes a simple call-to-action (”Click Here”) that leads users to the home page, streamlining the entry process. Application splash screen. Application welcome screen.

Drawing analysis
The (Figure 12) The core functionality of the application centers on drawing analysis. Users are prompted to input a child’s drawing along with their age. The embedded four-class MobileNet model then processes the image to estimate the child’s developmental level based on DAP drawing features. Application home screen with drawing analysis interface.
Mental games module
To support developmental growth, the app includes a dedicated section for mental games. These interactive exercises were selected to stimulate attention, memory, and basic problem-solving skills, providing supplementary developmental support for children.
In summary, Tebyan provides a fast, accessible, and engaging solution for early cognitive screening. Its clear instructions, visually guided flow, and integration of both screening and supportive developmental tools make it a practical resource for parents, teachers, and professionals involved in child assessment.
Conclusion
This study presents Tebyan, a mobile application designed to support the early observation of developmental patterns in children through AI-driven analysis of their drawings. Leveraging the Draw-A-Person (DAP) test and CNN-based classification models, we evaluated MobileNet, ResNet, and EfficientNet across binary, four-class, and eight-class categories. The four-class MobileNet model (64.60%) was selected for deployment in the app because it offered the best compromise between performance and meaningful age-group separation.
Tebyan provides an accessible tool that compares the model’s predicted age group with the child’s actual age to highlight potential developmental patterns. While not intended as a clinical diagnostic system, Tebyan functions strictly as an AI-supported developmental screening aid designed to assist caregivers, educators, and healthcare professionals in identifying developmental patterns that may warrant further expert evaluation. Its outputs should not replace formal psychological or medical assessment.
Limitations and future work
While this study demonstrates the potential of artificial intelligence in estimating developmental levels from children’s drawings, several limitations remain. The dataset used in this work was obtained from a previously published study that collected DAP-based drawings from a single private educational institution in Nigeria. Although we initially pursued the collection of new data, this effort could not be completed because expert participation was not secured. As a result, this study relied on publicly available data, which ensured methodological consistency but restricted demographic diversity and reduced generalizability.
A second limitation concerns the lack of detailed inter-rater information in the released dataset. Although the original study involved scoring by three trained experts using Goodenough’s 51-point rubric, the publicly available dataset reports only the aggregated score for each drawing. This prevented us from computing inter-rater reliability metrics such as the Intraclass Correlation Coefficient (ICC) or weighted Cohen’s κ, which would have provided insight into rater agreement and potential label variability. Future data collection efforts will ensure that raw expert scoring is preserved to enable these important psychometric analyses. Future work will also explore the integration of expert-scored and clinically validated developmental assessments to further align AI-based predictions with standardized psychological evaluation frameworks.
In addition, because the current dataset does not include formal clinical diagnostic labels or direct benchmarking against licensed developmental specialists, the present system cannot be considered clinically validated for diagnostic use. Future research should therefore prioritize clinically annotated datasets, inter-rater agreement preservation, and prospective validation studies involving psychologists, pediatric specialists, or developmental clinicians to establish stronger psychometric reliability and real-world clinical applicability.
The dataset also exhibits a noticeable class imbalance across age categories, particularly in the eight-class configuration. While data augmentation helped mitigate this issue, limited sample availability remains a constraint. Future studies will incorporate larger and more balanced datasets obtained through collaboration with multiple educational and clinical institutions.
In addition, future enhancements to the Tebyan application will include adaptive feedback mechanisms, improved visualization of developmental progress, and longitudinal monitoring features. These additions aim to support continuous developmental assessment and provide personalized recommendations for cognitive growth. Ultimately, expanding dataset diversity and integrating more advanced modeling and interpretability techniques will further strengthen Tebyan’s utility as a scalable, clinically aligned screening tool.
Footnotes
Ethical considerations
This study uses a publicly available dataset that was ethically approved in the original work by Rakhmanov et al. (2020). No new human data were collected by the authors. The dataset is fully anonymized, and all procedures adhered to standard ethical research guidelines.
Author contributions
Wedad M. Alawad: Study conceptualization, supervision, manuscript writing and revision, and model development oversight.
Areen Alquayid: Data preprocessing, model implementation, and result generation.
Sara Alghofaily: Literature review, application design, and manuscript drafting.
Shada Alharbi: Experimental setup, model evaluation, and preparation of figures and tables.
Deem Alorainy: Mobile application development and testing.
Funding
The researchers would like to thank the Deanship of Graduate Studies and Scientific Research at Qassim University (www.qu.edu.sa) for financial support (QU-APC-2026).
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
