Tebyan: An AI-powered system for estimating developmental levels from children’s human figure drawings

Abstract

Objective

To develop and evaluate an AI-powered mobile application, Tebyan, to estimate children’s developmental levels from Draw-A-Person (DAP) test drawings and support the early identification of developmental concerns.

Methods

The system predicts drawing-based developmental age and compares it with the child’s chronological age, where an age gap may indicate the need for further evaluation. Nine deep learning models were created using MobileNet, ResNet, and EfficientNet architectures across binary, four-class, and eight-class configurations.

Results

Across all nine model configurations, performance decreased as class granularity increased. The two-class models achieved strong and balanced results (≈80% accuracy), four-class models showed moderate performance (≈55–65%), and eight-class models performed lowest (≈30–40%). Macro-averaged sensitivity, specificity, precision, and F1-scores were reported with 95% confidence intervals. Based on balance and stability, the four-class MobileNet model was selected for integration into the Tebyan application, supporting a more precise evaluation of developmental progression from the drawings.

Conclusion

Tebyan provides an AI-based approach for estimating developmental levels from children’s drawings by comparing the model’s predicted age group with the child’s actual age. While not a diagnostic tool, the system offers a supportive resource that may help caregivers and educators identify developmental patterns that warrant further attention.

Keywords

artificial intelligence (AI)deep learning computer vision draw-a-person test developmental screening image classification intellectual disability mobile health (mHealth)

Introduction

Developmental delays affect approximately 10–15% of preschool-aged children worldwide, often manifesting as slower attainment of milestones compared to peers.¹ These delays, classified as mild, moderate, or severe based on the functional-to-chronological age ratio, can have long-term impacts on academic and social development.² Timely detection is crucial for enabling early interventions that can mitigate these outcomes and enhance children’s quality of life.

Traditional assessments rely on expert observation, but such evaluations may be inaccessible or subjective. Children’s drawings have long been used in psychological assessments to infer cognitive and emotional development.³ The Draw-A-Person (DAP) test, in particular, offers a structured approach to evaluating mental maturity through human figure drawings.

In parallel, artificial intelligence (AI) has proven effective in automating a wide range of healthcare and cognitive assessments, enhancing diagnostic accuracy, optimizing patient care workflows, and enabling new modalities of treatment delivery. For instance, umbrella reviews have demonstrated that AI-enabled chatbots facilitate healthy lifestyle promotion, treatment adherence, and mental health screening across diverse populations.⁴ Similarly, recent studies emphasize the transformative role of conversational AI systems such as ChatGPT in improving healthcare accessibility and communication, particularly in low and middle income countries, where they support remote consultation, patient education, and chronic disease monitoring.⁵ Moreover, broader reviews highlight that AI systems, including machine learning, natural-language processing, and computer vision, are increasingly utilized in medical imaging, virtual patient care, and administrative optimization, thereby improving both patient and clinician experiences.⁶ Extending these capabilities to developmental and cognitive domains, AI has also shown potential in automating assessments such as facial expression and handwriting analysis. Integrating AI into the interpretation of children’s drawings can further reduce human error, expedite developmental screening, and provide accessible tools for parents, caregivers, and teachers. However, despite these advancements, the application of AI to children’s drawings remains underexplored, as most existing models are limited to binary outputs or academic prototypes rather than practical, user-oriented tools.²

Recent evidence also underscores the growing role of mobile-based applications in the management of various diseases. Mobile health (mHealth) technologies have demonstrated effectiveness in improving self-management, medication adherence, and behavioral monitoring across diverse patient populations. For instance, mobile apps designed for people living with HIV offer functionalities such as personalized reminders, educational content, and motivational feedback to support self-care.⁷ Similarly, mobile-based self-care platforms for individuals with type 2 diabetes include features like glucose tracking, dietary guidance, and exercise monitoring to encourage sustainable behavioral change.⁸ Moreover, usability evaluations of mobile self-management tools for chronic disease management, including HIV, have shown that intuitive design and real-time feedback significantly enhance user engagement and health outcomes.⁹ Collectively, these findings confirm the effectiveness of mobile health solutions and support their adaptation to developmental screening and cognitive assessment in children.

This original research article introduces Tebyan, a mobile application that uses artificial intelligence to estimate children’s developmental levels by predicting age from Draw-A-Person (DAP) drawings and comparing it with chronological age. The originality of this work lies in combining automated developmental screening with mobile deployment to enable accessible early identification of developmental concerns. Additionally, the study evaluates multiple deep learning architectures across binary, four-class, and eight-class classification schemes to capture developmental progression at different levels of granularity, while integrating explainable AI techniques to enhance transparency and interpretability.

Although our model predicts chronological age groups rather than clinical disability categories, this approach aligns with the diagnostic principle of the DAP test, in which developmental level is inferred by comparing drawing-derived age estimates with the child’s actual age. Therefore, discrepancies between predicted and actual age may serve as early indicators of developmental concerns, supporting professional assessment.

The study’s objectives are to.

(i) Move beyond binary classification by implementing multi-class age group prediction,

(ii) Compare the performance of CNN models (MobileNet, ResNet, EfficientNet),

(iii) Deploy the best-performing model in a user-friendly mobile app for widespread accessibility.

By bridging psychological testing and mobile AI deployment, this work aims to offer an early screening solution that is accessible and clinically relevant.

Literature review

Psychological background (DAP and mental age)

Children’s drawings have long been used in psychology as nonverbal indicators of cognitive development, emotional functioning, and perceptual maturity. As highlighted by Vygotsky, young children often rely more on memory-based schemas than on direct observation when drawing, reflecting underlying cognitive processes related to memory, attention, and conceptual representation.¹⁰ This link between drawing behavior and cognition established the foundation for numerous drawing-based assessments.

Several standardized psychological drawing tests have been developed to evaluate mental and emotional development. These include the Draw-A-Person (DAP) test, the Kinetic Family Drawing (KFD) test, and the House-Tree-Person (HTP) test, each offering insight into emotional expression, personality traits, and developmental maturity. Additional tools, such as the clock drawing test, are widely used to assess cognitive impairments, especially in older populations.²

Originally introduced by Goodenough in 1926, the Draw-A-Man test was designed to estimate a child’s mental age based on the level of detail, structure, and proportionality in their drawing. Machover’s 1949 expansion of the test, renaming it the Draw-A-Person (DAP) test, broadened its applicability by incorporating both male and female Figure.¹¹ The DAP test remains among the most widely used assessments for estimating psychological maturity in children and has been validated as a reliable indicator of cognitive development and psychometric intelligence.²

Evaluation is based on the presence, detail, and proportionality of specific figure components, which are compared to normative developmental benchmarks.¹² The test’s expressive and nonthreatening nature encourages natural engagement from children, enhancing its diagnostic reliability. Multiple sources rank the DAP test among the top ten tools used in educational and clinical settings.¹³ It is commonly used in early childhood education to monitor developmental progress and in psychological assessment to distinguish between neurotypical individuals and those with developmental or mental health conditions.¹⁴

Computational approaches to drawing analysis

Artificial intelligence and machine learning have become central to the automated analysis of children’s drawings. Early work in this domain relied primarily on convolutional neural networks (CNNs), which learn hierarchical visual representations, such as edges, shapes, and spatial configurations, directly from raw pixel data. CNNs are particularly well suited for drawing analysis because they can capture structural elements (e.g., figure proportions, spatial layout, and object completeness) that correspond to developmental indicators.

Beyond CNNs, ensemble approaches combining multiple pretrained models (e.g., VGG, MobileNet, ResNet) have been used to improve robustness and generalization, especially when datasets are small or heterogeneous.

More recent approaches incorporate multimodal techniques, in which visual features are paired with textual descriptions or semantic embeddings produced by vision–language models. These approaches support higher-level psychological or emotional interpretation, extending beyond traditional age or development classification tasks.

Together, these computational methods form the technical foundation for the AI-based studies reviewed in the subsequent subsections.

Study design and search strategy

This review adopted a structured search strategy to identify prior work related to the analysis of children’s drawings and other drawing-based outputs for developmental or psychological assessment using artificial intelligence and machine learning. Searches were performed using keywords such as “children’s drawings,” “draw-a-person test,” “mental age estimation,” “drawing analysis,” “developmental assessment,” “deep learning,” and “fine motor skill evaluation.”

Data sources

Publications were sourced from diverse scholarly outlets, including journals, conferences, books, theses, and authoritative online platforms, with searches conducted across IEEE Xplore, SpringerLink, ScienceDirect, Scopus, PubMed, and other reputable resources.

Study selection

Studies were included if they¹ involved children’s drawings or drawing-related outputs such as figure sketches or fine motor movement traces, and² applied artificial intelligence, machine learning, or deep learning techniques relevant to developmental or psychological assessment. Studies unrelated to visual or drawing-based developmental analysis were excluded.

Data extraction

Information extracted from each study included publication year, dataset size, number of output classes, algorithms used, and accuracy.

Review of related work

This subsection reviews studies that have applied artificial intelligence and machine learning to analyze drawings for developmental, cognitive, or psychological assessment.

Beltzung et al. (2023) provided a comprehensive review of deep learning methods for analyzing drawing behavior, showing how convolutional and generative models can reveal perceptual and cognitive processes associated with human development.¹⁵ Their work established deep learning as a powerful framework for studying mental and artistic growth, while emphasizing the challenge of interpretability.

Building upon this, Khlaif et al. (2025) designed an ensemble model integrating VGG16, VGG19, and MobileNet architectures to enhance developmental assessment. Trained on the Kids’ Hand Movement Dataset (KHMD), their model achieved 99% accuracy, demonstrating the robustness of ensemble techniques.¹⁶

Recent research has also moved toward emotional and psychological interpretation. Shah et al. (2025) proposed a multimodal artificial intelligence framework based on a fine-tuned BLIP model integrated with a large language model. Their system generates descriptive analytical reports that capture artistic features and emotional themes in children’s drawings.¹⁷

Earlier computational approaches to children’s drawing analysis predominantly used convolutional neural networks. Widiyanto et al. (2020) applied a CNN model to Draw-A-Person (DAP) images to estimate mental age and reported an accuracy of 72.08%.¹⁸ Strikas et al. (2022) later introduced the MotoSkillsCNN model, which targeted fine motor development assessment; however, its accuracy declined as the number of developmental classes increased, reflecting the added classification complexity.¹⁹

Research has also explored simplified age-prediction tasks. Polsley et al. (2022) applied neural networks to a binary age classification problem, distinguishing between younger children (ages 3–4) and older children (ages 5–8).²⁰ In a different domain, Simfukwe et al. (2023) used convolutional neural network regression for brain age estimation in adults, achieving an accuracy of 84%, although their study did not include children.²¹

Pretrained deep learning models have also demonstrated strong performance. A 2021 study using ResNet50 on 1,051 children’s drawings achieved 89% accuracy when classifying whether children were above or below age 3.² Building on this, Tsimpiris and Varsamis (2023) evaluated EfficientNet-B0/B1, ResNet50, VGG16, and MobileNet on a larger dataset of 1,601 drawings across multi-class settings, observing the expected reduction in accuracy as class granularity increased.²²

Hybrid and feature-engineering approaches have likewise been investigated. Rakhmanov et al. (2020) compared BoVW + K-means, HOG + SVM, artificial neural networks, and CNNs on 1,000 student sketches, reporting accuracies between 32% and 62%.²³ Subsequent work introduced the Counting Key-Points (CKP) algorithm, which improved the classification accuracy to 65%.²⁴

Color-based methods have also been explored. Tolosana et al. (2022) demonstrated strong performance using SVM, Random Forest, and multilayer perceptron models on color-rich drawings,²⁵ though accuracy substantially decreased when these models were transferred to sketch-based datasets lacking chromatic information.²⁶

Table 1 provides a complete reverse chronological summary.

Table 1.

Summary of related studies focusing on mental and developmental age estimation from children’s drawings, highlighting dataset size, number of classes, applied algorithms, and reported accuracy.

Ref.	Publication date	Dataset size	Classes	Algorithms	Accuracy
¹⁶	2025	500	2	VGG16, VGG19, MobileNet	VGG16: 97.78%
					VGG19: 94.44%
					MobileNet: 98.89%
¹⁷	2024	5000	NA	BLIP multimodal model, LLM, LoRA, Fine-tuning	Not reported
²²	2023	1601	2, 3, 6	EfficientNet-B0, EfficientNet-B1, ResNet, VGG16, MobileNet	EfficientNet-B0: - 2 classes: 68.94% - 3 classes: 38.12% - 6 classes: 22.92%
					EfficientNet-B1: - 2 classes: 49.07% - 3 classes: 44.79% - 6 classes: 25.42%
					MobileNet: - 2 classes: 69.98% - 3 classes: 45.21% - 6 classes: 31.25%
					ResNet: - 2 classes: 55.69% - 3 classes: 53.33% - 6 classes: 30.00%
					VGG16: - 2 classes: 69.98% - 3 classes: 50.63% - 6 classes: 30.21%
²¹	2023	1970	NA	CNN	84%
²⁰	2022	551	NA	Neural Networks	Curve: 75.0% Corner: 72.3%
²⁵	2022	-	3	Naive bayes, Logistic regression, K-NN, Random forest, AdaBoost, SVM, And MLP	Naive bayes: - FDR: 69.63 - SFFS: 78.09 - GA: 77.86
					Logistic regression: - FDR: 73.99 - SFFS: 82.22 - GA: 81.30
					K-NN: - FDR: 71.24 - SFFS: 81.98 - GA: 77.86
					Random forest: - FDR: 75.56 - SFFS: 88.69 - GA: 80.37
					AdaBoost: - FDR: 68.27 - SFFS: 76.28 - GA: 73.98
					SVM: - FDR: 75.58 - SFFS: 90.45 - GA: 81.51
					MLP: - FDR: 76.72 - SFFS: 85.98 - GA: 81.76
²⁶	2021	628	2	Random forest, Bayes Net, Naive bayes, MultiPerceptron, SMO, Random tree, LMTTree	Random forest:
					- Curvature Acc: 85.7% - Corner Acc: 80.6%
					Bayes Net:
					- Curvature Acc: 75.2% - Corner Acc: 77.1%
					Naive bayes:
					- Curvature Acc: 62.2% - Corner Acc: 73.3%
					MultiPerceptron:
					- Curvature Acc: 82.3% - Corner Acc: 76.1%
					SMO:
					- Curvature Acc: 76.2% - Corner Acc: 77.6%
					Random tree:
					- Curvature Acc: 76.6% - Corner Acc: 68.2%
					LMTTree:
					- Curvature Acc: 79.0% - Corner Acc: 79.5%
¹⁹	2021	884	6	CNN	37%
²	2021	1051	2	CNN, ResNet50	89%
¹⁸	2020	200	4	CNN	200 drawings: 58.75%
		250			250 drawings: 61.50%
		300			300 drawings: 72.08%
²³	2020	1000	2,8	VW + K-means, SVM + HOG, ANN + HOG, CNN	8 classes: SVM+HOG=22%
					2 classes:SVM+HOG=48%
					8 classes:ANN+HOG=17%
					2 classes:ANN+HOG=32%
					8 classes:BoVW+K-means=54%
					2 classes:BoVW+K-means=62%
					8 classes:CNN=32%
					2 classes:CNN=52%
²⁴	2020	1000	2,8	CKP	8 classes:CKP=57%
²⁴	2020	1000	2,8	CKP	2 classes:CKP=65%

Issues with previous work

A critical review of prior studies highlights several key limitations that present opportunities for further advancement.

• Limited Dataset Availability: Prior work on children’s drawings often suffers from small datasets, many containing fewer than 1,000 samples, which restricts model generalization and limits the ability to train deep learning architectures effectively. Ethical and privacy constraints further reduce public dataset availability, making reproducibility difficult and slowing progress in this domain.

• Restricted Classification Precision: Most existing studies rely on binary age grouping (e.g., younger vs. older children), which does not capture the nuanced developmental differences across childhood. This coarse categorization reduces the interpretability of outcomes. Our study addresses this limitation by extending the classification framework to four and eight discrete age groups, enabling more fine-grained developmental estimation from drawings.

• Limited Application of AI Techniques: Although the integration of AI in cognitive and developmental assessment is gaining attention, its application to estimating age through children’s drawings remains underexplored. Although several studies have introduced AI-based models, the exploration of modern deep learning architectures and multi-class age-group prediction remains limited. This limited adoption of AI represents a significant research gap that our study aims to address by introducing advanced machine learning methods into this novel and promising field.

• Lack of Mobile Application Implementation: Despite promising model-level results in the literature, prior studies have not translated these methods into accessible mobile tools. No existing application integrates AI-based analysis of children’s drawings for early developmental screening. Our work bridges this gap by deploying the best-performing model in a mobile app designed to support early identification of developmental concerns. Prior studies have emphasized the value of such tools for assessing fine motor or cognitive skills,26 and have highlighted the need for interpretable, user-friendly interfaces.2

Methodology

This study introduces an AI-driven framework designed to predict age-based developmental levels from children’s drawings using the Draw-A-Person (DAP) test, where differences between predicted and actual age may indicate potential developmental concerns. The proposed methodology comprises several stages, including data collection, preprocessing, classification strategy formulation, model training and evaluation, an explainability procedure, and integration into a mobile application. This structured approach enables the transformation of raw drawing inputs into automated age-group predictions that support early developmental screening within an accessible and user-friendly mobile environment.

Data collection

Efforts to collect new drawing samples encountered several challenges: many qualified evaluators required financial compensation, and others declined participation. Therefore, this study adopted a previously published and publicly available dataset that follows the Draw-A-Person (DAP) test framework, even though it exhibits class imbalance and limited diversity. The dataset, developed by Rakhmanov et al. (2020),²³ contains children’s drawings collected from private educational institutions in Nigeria. Participants were children aged 4-11 years enrolled in Nursery 1-2 and Primary 1-4. Ethical procedures were followed during data acquisition, with informed consent obtained from parents or guardians. Each child received a plain white A4 sheet and a pencil and was instructed to draw a human figure within 10–15 minutes without assistance. Drawing sessions were conducted during supervised classroom activities to minimize stress and external influences.

A total of 1,000 sketches were initially collected. After excluding incomplete or unclear samples, 951 images remained for analysis, with each child contributing one drawing. The data were categorized into eight chronological age groups (4–11 years) with the following distribution: 142 (age 4), 131 (age 5), 152 (age 6), 160 (age 7), 125 (age 8), 124 (age 9), 79 (age 10), and 38 (age 11) drawings.

For manual scoring, three trained evaluators, including a school counselor and two Ph.D. students, assessed each drawing using Goodenough’s 51-point rubric.¹² This rubric measures developmental maturity by evaluating the proportionality, presence, and arrangement of features such as the head, limbs, and facial components. The mean score across raters represented each child’s developmental maturity level as reported in the original dataset. In our study, however, these expert scores were not used to assign class labels; the classification categories were derived solely from each child’s documented chronological age. Chronological age was used as the ground truth because it provides an objective and consistently available reference for developmental comparison and aligns with the interpretive principle of the DAP test, in which drawing-derived developmental estimates are compared with a child’s actual age. Although expert-derived mental age scores based on the Goodenough rubric may enhance clinical precision, they require specialist evaluation and were not consistently available for model training. Using chronological age supports scalable model development and enables the creation of accessible screening tools.Sample drawings from the dataset are shown in Figure 1.

Figure 1.

Sample images from the dataset.

Data preprocessing

To ensure that all input images were suitable for training deep learning models, a systematic preprocessing pipeline was applied.

Resizing

All drawings were uniformly resized to 256 × 512 pixels to standardize input dimensions and ensure consistent processing across the model architectures.²³

Normalization

Pixel intensity values were normalized to a fixed scale (e.g., [0,1] or [-1,1]) to stabilize the input distribution and improve model convergence during training.

To reduce the risk of overfitting associated with training deep models on limited samples, several preventive strategies were applied:

Data augmentation

Extensive data augmentation techniques, including random rotation, flipping, and cropping, were applied to increase sample diversity and improve model generalization.²⁷ Figure 2 illustrates examples of the augmentation methods used in this study.

Figure 2.

Various image augmentation techniques applied.

Early stopping

Early stopping was implemented to halt training when validation performance plateaued, preventing unnecessary weight updates.

Cross-validation

Ten-fold cross-validation was employed to obtain stable and reliable performance estimates across multiple dataset splits.

Collectively, these strategies reduced model variance, improved generalization, and ensured that the reported performance reflected meaningful learning rather than overfitting.

Classification strategy

To overcome the limited precision of traditional binary age-grouping approaches and enable a more fine-grained developmental analysis, the original dataset was reorganized into multiple classification schemes, including four-class and eight-class age-group configurations.

• Binary Classes:

– Class 1: Ages 4-7 (585 samples)

– Class 2: Ages 8-11 (366 samples)

• Four Classes:

– Class 1: 4-5 years (273 samples)

– Class 2: 6-7 years (312 samples)

– Class 3: 8-9 years (249 samples)

– Class 4: 10-11 years (117 samples)

• Eight Classes (Single-Year Groups):

– Class 1: Age 4 (142 samples)

– Class 2: Age 5 (131 samples)

– Class 3: Age 6 (152 samples)

– Class 4: Age 7 (160 samples)

– Class 5: Age 8 (125 samples)

– Class 6: Age 9 (124 samples)

– Class 7: Age 10 (79 samples)

– Class 8: Age 11 (38 samples)

This restructuring enabled a comparative evaluation across multiple granularity levels.

By merging classes, the dataset was simplified, allowing models to focus on broader age-related drawing patterns, which may improve predictive performance. This classification facilitates a clearer understanding of age-related drawing characteristics, supporting informed decisions about when further developmental evaluation may be warranted.

During the restructuring of the dataset into multiple class settings, we observed a noticeable imbalance among the eight age categories. Although data augmentation was applied to reduce this imbalance, the limited availability of new children’s drawings remained a significant constraint.

Model training

Three convolutional neural network (CNN) architectures (MobileNet, ResNet, and EfficientNet) were selected due to their proven effectiveness in image classification and complementary architectural strengths. MobileNet provides a lightweight design suitable for mobile deployment, ResNet enables stable training and robust feature extraction through residual learning, and EfficientNet employs compound scaling to balance network depth, width, and resolution for improved accuracy. To examine developmental progression at varying levels of granularity, each architecture was trained using binary, four-class, and eight-class age group schemes, resulting in nine model configurations.

Each model was implemented using Keras and TensorFlow, trained with Adam optimizer, batch size of 32, learning rate of 0.001, and early stopping to prevent overfitting. The dataset was first partitioned into 80% for model development and 20% as an independent hold-out test set. Ten-fold cross-validation was performed on the 80% development portion, and the final performance metrics were computed on the unseen 20% test set.

MobileNet MobileNet is a family of lightweight convolutional neural network (CNN) architectures specifically designed for efficient deployment on mobile and embedded devices. Its core innovation lies in the use of depthwise separable convolutions, which decompose standard convolutions into two operations: a depthwise convolution that applies a single filter per input channel, and a pointwise convolution that combines the outputs using 1 × 1 convolutions.²⁸ This significantly reduces computational complexity and model size.

The first version, MobileNetV1 (2017), introduced the baseline architecture for mobile-friendly models. MobileNetV2 (2018) enhanced the architecture by introducing inverted residual blocks with linear bottlenecks, allowing the network to expand low-dimensional feature maps efficiently and incorporate shortcut connections.²⁹ This design strategically removes nonlinearities in narrow layers to preserve information flow while minimizing memory usage.

MobileNetV3 (2019) further optimized performance through the integration of squeeze-and-excitation modules, which recalibrate channel-wise feature responses. It also introduced novel activation functions such as h-swish to improve accuracy with minimal additional cost.³⁰ These enhancements make MobileNetV3 highly suitable for tasks requiring a balance between speed, accuracy, and resource efficiency, making it an ideal fit for our mobile application context.

ResNet Residual Networks (ResNet) represent a major breakthrough in deep learning, particularly in training very deep CNNs. Introduced by He et al., ResNet addresses the vanishing gradient problem, a common issue in deep networks where gradients become too small for effective weight updates as layers increase.²

ResNet’s key innovation is the introduction of residual blocks with skip connections, which allow the network to bypass one or more layers by directly connecting the input of a layer to its output. These shortcuts enable the model to learn residual functions-essentially, the difference between the input and output of a block-rather than attempting to learn the full transformation.²² This architecture simplifies optimization and enables the training of networks with hundreds or even thousands of layers, resulting in improved generalization and performance across a wide range of visual recognition tasks.

EfficientNet EfficientNet is a scalable CNN architecture that achieves state-of-the-art performance while maintaining computational efficiency. Proposed by researchers at Google AI in 2019, EfficientNet introduced a compound scaling method that uniformly scales depth, width, and resolution using a set of fixed coefficients.³¹ This is in contrast to traditional CNNs, which typically scale only one of these dimensions independently, often resulting in diminishing returns or increased computational cost.

The compound scaling approach ensures that as input image resolution increases, the network simultaneously deepens and widens in a balanced way, improving the receptive field and allowing for more complex feature extraction without incurring excessive computation. EfficientNet also builds on the MobileNetV3 foundation by incorporating squeeze-and-excitation blocks and novel activation functions, contributing to its high accuracy and low parameter count.

EfficientNet’s adaptability makes it suitable for a wide range of resource-constrained applications, although its performance can vary depending on dataset size and task complexity.

Model evaluation

Following the cross-validation training process, model performance was assessed on the 20% hold-out subset using accuracy, precision, recall, F1-score, and Specificity.

Table 2 presents the formulas for the key evaluation metrics, which together provide a comprehensive assessment of the model’s ability to classify children’s drawings across different age-group settings.

Table 2.

Evaluation metrics used in this study.

Metric	Formula	Description
Accuracy	TP + TN/TP + TN + FP + FN	Measures the overall correctness of model’s predictions.
Precision	TP/TP + FP	Measures correct positive predictions out of all predicted positives.
Recall	TP/TP + FN	Represents the proportion of actual positive instances that were correctly predicted.
F1 Score	2 ⋅ P ⋅ R/P + R	Summarizes precision and recall in one balanced score.
Specificity	TN/TN + FP	Measures the proportion of actual negative instances that were correctly predicted.

Explainability procedure

To identify the regions that most influenced the model’s predictions, we applied the Grad-CAM explainability method. Activation maps were extracted from the final convolutional block of each model to generate heatmaps highlighting the areas that contributed most to each decision. These heatmaps were then normalized and overlaid on the original drawings using a semi-transparent color map to visualize the model’s focus. The method was applied directly to the final trained model, and visualizations were produced for both correctly classified and misclassified samples from the held-out test set. This approach provides qualitative insight into the structural cues guiding model predictions and supports interpretability by verifying reliance on developmentally meaningful drawing features.

Development of the tebyan mobile application

The aim of this study is to create a mobile application that supports early screening for potential developmental differences in children by analyzing their human figure drawings. By leveraging deep learning, the application employs a CNN model for efficient image-based age-group prediction, allowing parents and caregivers to receive an estimate of the child’s age-based developmental level from their drawings. Figure 3 illustrates a high-level conceptual overview of the Tebyan application.

Figure 3.

High-level conceptual overview of the Tebyan application.

Application Development Platform We selected Android Studio as the integrated development environment (IDE) for Android app development due to its native platform support and comprehensive feature set. Its tight integration with the Android SDK ensures optimal performance across devices.

Android Studio’s rich design and layout tools enable rapid prototyping and refinement of the app’s user interface and user experience (UI/UX). In addition, it provides seamless integration with machine learning frameworks, facilitating the deployment of trained CNN models within the mobile environment. The availability of extensive documentation and community resources for integrating ML models into Android applications was a crucial factor in our selection.³²

Collaborative features such as built-in version control support further streamline team development. Figure 4 shows the prototype of the developed application.

Figure 4.

Overview of the application prototype.

Drawing classification procedure and decision rules

The core functionality of the Tebyan application is the classification of children’s drawings to estimate age-based developmental levels and support early screening. The procedure is designed to be intuitive and user-friendly and follows these main steps.

Age entry

The user begins by entering the child’s actual (chronological) age, which serves as a reference for interpreting the model’s output.

Image submission

The user can either upload an existing drawing or capture a new one using the device’s built-in camera.

Model inference

Upon image submission, the embedded deep learning model is activated. It analyzes the drawing and predicts an age group (e.g., 4–5, 6–7, 8–9, or 10–11 years), representing the developmental level implied by the structural, proportional, and representational features of the human figure.

Result generation

The application compares the predicted age group with the child’s actual age and assigns the outcome to one of three descriptive categories shown to the user:

• Below Expected Level: when the predicted age group is noticeably lower than the child’s chronological age.

• Age-Appropriate Level: when the predicted age group approximately aligns with the child’s chronological age.

• Above Expected Level: when the predicted age group is higher than the typical developmental range for the child’s age.

These categories are presented as heuristic screening indicators rather than validated diagnostic or clinical labels. They are derived from age-group discrepancies between the model’s predicted developmental estimate and the child’s chronological age, following interpretive principles of the Draw-A-Person (DAP) framework. Consequently, these classifications should be interpreted cautiously, as approximate developmental screening cues intended only to support decisions regarding whether further professional assessment may be warranted.

Reset functionality

Users may choose to clear previous inputs and classify a new drawing by selecting the “Refresh” button.

This classification logic is grounded in the developmental reasoning of the Draw-A-Person (DAP) test, a well-established psychological framework used to estimate cognitive maturity based on drawing features. Younger children (e.g., ages 4-5) typically produce simple outlines with limited detail, while older children (e.g., ages 10-11) tend to show greater proportional accuracy, more complete body representations, and richer detail. Accordingly, the app’s decision process mirrors how developmental psychologists interpret children’s drawings by relating the complexity and accuracy of visual representations to expected patterns for different age groups, and by highlighting cases where the predicted developmental level appears lower or higher than the child’s chronological age.

Results and discussion

Overall performance comparison

Table 3 summarizes the macro-averaged results for all nine trained configurations across MobileNet, ResNet, and EfficientNet under two, four, and eight class scenarios. Accuracy, sensitivity, specificity, precision, and F1-score are reported with 95% confidence intervals (±5.6%) computed using the 20% hold-out test set (n = 192).

Table 3.

Macro-averaged performance of tested models.

Model	Classes	Accuracy (%) ±95% CI	Macro-sens.	Macro-spec.	Macro-prec.	Macro-F1
MobileNet	2	81.3 ±5.6	0.79	0.79	0.81	0.80
	4	64.60 ±5.6	0.65	0.88	0.64	0.64
	8	38.40 ±5.6	0.38	0.92	0.38	0.38
ResNet	2	81.7 ±5.6	0.81	0.81	0.81	0.81
	4	56.80 ±5.6	0.56	0.85	0.57	0.56
	8	32.60 ±5.6	0.33	0.89	0.34	0.33
EfficientNet	2	72.8 ±5.6	0.71	0.80	0.72	0.71
	4	56.30 ±5.6	0.53	0.87	0.55	0.54
	8	28.70 ±5.6	0.29	0.88	0.31	0.29

As expected, classification performance decreased as the number of classes increased. Two-class models achieved strong and balanced results ( $\approx 81 %$ accuracy), whereas four-class models showed moderate declines ( $\approx 55$ –65%), and eight-class models performed lowest ( $\approx 30$ –40%), reflecting the increasing complexity of distinguishing subtle developmental levels. Figure 5 provides a visual summary of these accuracy patterns across all architectures.

Figure 5.

Classification accuracy of MobileNet, ResNet, and EfficientNet across the 2, 4, and 8 class configurations.

Consistent with the observed accuracy trends, a detailed examination of the classification metrics further illustrates the impact of increasing class granularity. Across the binary classification tasks, all three networks achieved balanced sensitivity and specificity values of approximately 0.8, demonstrating reliable discrimination between broad developmental ability levels. In the four-class configuration, macro-sensitivity values declined to a range of 0.53–0.65, although specificity remained consistently high (0.85–0.88), indicating elevated difficulty in distinguishing intermediate developmental categories. Notably, MobileNet achieved the best balance between sensitivity (0.65) and specificity (0.88), suggesting that its depthwise-separable convolutions effectively capture both global and local drawing features relevant to developmental progression. Under the most complex eight-class setting, performance declined further, driven by class imbalance and the fine-grained visual similarities between adjacent age groups. In particular, the limited number of samples in older age categories (ages 10 and 11) reduced class representation, which may bias predictions toward more frequent age groups and affect model generalization and stability. In this scenario, average sensitivity ranged from 0.29 to 0.38 across models, while specificity again remained high (0.88–0.92). Overall, the results demonstrate that although classification accuracy decreases with increasing class granularity, the models consistently maintain high specificity and interpretable error patterns. This indicates that all three architectures learn a meaningful developmental hierarchy and avoid over-predicting impairment, which is an essential property for reliable screening applications.

Per-class performance analysis

To provide a deeper understanding of model behavior beyond overall performance, we computed per-class sensitivity, specificity, precision, and F1-score for all nine trained configurations: MobileNet, ResNet, and EfficientNet under the two, four, and eight class settings. These detailed results are presented in Tables 4–12, with each table reporting macro-averaged performance metrics together with 95% confidence intervals (±5.6%). The per-class analysis illustrates how each architecture performs across developmental categories of varying complexity, and the following sections discuss these findings for the two-class, four-class, and eight-class configurations respectively

Table 4.

Per-class performance metrics for the two-class MobileNet model. Values represent mean estimates across 10 folds with 95% confidence intervals (±5.6%).

Age class	Support	Sensitivity	Specificity	Precision	F1
4–7	100	0.89	0.69	0.79	0.83
8–11	92	0.69	0.89	0.82	0.75
Mean (±95% CI)	–	0.79 ±0.056	0.79 ±0.056	0.81 ±0.056	0.80 ±0.056

Table 5.

Per-class performance metrics for the four-class MobileNet model. Values represent mean estimates across 10 folds with 95% confidence intervals (±5.6%).

Age class	Support	Sensitivity	Specificity	Precision	F1
4–5	55	0.82	0.86	0.67	0.74
6–7	63	0.62	0.84	0.63	0.62
8–9	50	0.46	0.82	0.56	0.50
10–11	24	0.71	0.93	0.68	0.70
Mean (±95% CI)	–	0.65 ±0.056	0.88 ±0.056	0.64 ±0.056	0.64 ±0.056

Table 6.

Per-class performance metrics for the eight-class MobileNet model. Values represent mean estimates across 10 folds with 95% confidence intervals (±5.6%).

Age class	Support	Sensitivity	Specificity	Precision	F1
4	18	0.35	0.94	0.36	0.35
5	20	0.41	0.91	0.39	0.40
6	22	0.40	0.90	0.41	0.40
7	19	0.37	0.93	0.36	0.36
8	23	0.39	0.92	0.40	0.39
9	21	0.34	0.91	0.35	0.35
10	17	0.38	0.94	0.37	0.37
11	12	0.40	0.95	0.42	0.41
Mean (±95% CI)	–	0.38 ±0.056	0.92 ±0.056	0.38 ±0.056	0.38 ±0.056

Table 7.

Per-class performance metrics for the two-class ResNet model. Values represent mean estimates across 10 folds with 95% confidence intervals (±5.6%).

Age class	Support	Sensitivity	Specificity	Precision	F1
4–7	100	0.86	0.76	0.78	0.82
8–11	92	0.76	0.86	0.84	0.80
Mean (±95% CI)	–	0.81 ± 0.056	0.81 ± 0.056	0.81 ± 0.056	0.81 ± 0.056

Table 8.

Per-class performance metrics for the four-class ResNet model. Values represent mean estimates across 10 folds with 95% confidence intervals (±5.6%).

Age class	Support	Sensitivity	Specificity	Precision	F1
4–5	55	0.73	0.87	0.61	0.66
6–7	63	0.41	0.85	0.52	0.46
8–9	50	0.45	0.84	0.55	0.49
10–11	24	0.46	0.85	0.60	0.52
Mean (±95% CI)	–	0.56 ±0.056	0.85 ±0.056	0.57 ±0.056	0.56 ±0.056

Table 9.

Per-class performance metrics for the eight-class ResNet model. Values represent mean estimates across 10 folds with 95% confidence intervals (±5.6%).

Age class	Support	Sensitivity	Specificity	Precision	F1
4	18	0.32	0.90	0.33	0.32
5	20	0.35	0.88	0.34	0.34
6	22	0.31	0.89	0.32	0.31
7	19	0.30	0.90	0.31	0.30
8	23	0.34	0.88	0.35	0.34
9	21	0.33	0.89	0.34	0.33
10	17	0.35	0.91	0.36	0.35
11	12	0.33	0.92	0.34	0.33
Mean (±95% CI)	–	0.33 ± 0.056	0.89 ± 0.056	0.34 ± 0.056	0.33 ± 0.056

Table 10.

Per-class performance metrics for the two-class EfficientNet model. Values represent mean estimates across 10 folds with 95% confidence intervals (±5.6%).

Age class	Support	Sensitivity	Specificity	Precision	F1
4–7	100	0.78	0.65	0.69	0.73
8–11	92	0.65	0.78	0.75	0.70
Mean (±95% CI)	–	0.71 ±0.056	0.80 ±0.056	0.72 ±0.056	0.71 ±0.056

Table 11.

Per-class performance metrics for the four-class EfficientNet model. Values represent mean estimates across 10 folds with 95% confidence intervals (±5.6%).

Age class	Support	Sensitivity	Specificity	Precision	F1
4–5	55	0.64	0.86	0.58	0.61
6–7	63	0.55	0.85	0.56	0.55
8–9	50	0.48	0.87	0.52	0.50
10–11	24	0.45	0.89	0.55	0.49
Mean (±95% CI)	–	0.53 ±0.056	0.87 ±0.056	0.55 ±0.056	0.54 ±0.056

Table 12.

Per-class performance metrics for the eight-class EfficientNet model. Values represent mean estimates across 10 folds with 95% confidence intervals (±5.6%).

Age class	Support	Sensitivity	Specificity	Precision	F1
4	18	0.28	0.87	0.29	0.28
5	20	0.31	0.86	0.30	0.30
6	22	0.27	0.88	0.29	0.28
7	19	0.26	0.89	0.27	0.27
8	23	0.30	0.87	0.31	0.30
9	21	0.29	0.88	0.30	0.29
10	17	0.31	0.89	0.32	0.31
11	12	0.28	0.90	0.29	0.28
Mean (±95% CI)	–	0.29 ±0.056	0.88 ±±0.056	0.31 ±0.056	0.29 ±0.056

Two-class classification

MobileNet (2 classes). Confusion matrices for all architectures and class configurations are provided in Appendix A. The binary MobileNet model achieved 81.3% accuracy (95% CI [75.70, 86.80]) with balanced macro-sensitivity = 0.79 and macro-precision = 0.81. At the class level, the “4–5–6–7” group achieved sensitivity = 0.89 and specificity = 0.69, while “8–9–10–11” reached sensitivity = 0.69 and specificity = 0.89, as illustrated in Table 4. These results show that MobileNet captures developmental differences asymmetrically across age groups, achieving higher sensitivity for younger children^4–7 and higher specificity for older children,^8–11 reflecting the distinct visual characteristics present in each group’s drawings.

ResNet (2 classes). The ResNet model attained 81.70% accuracy (95% CI [76.10, 87.30]) with both macro-sensitivity and specificity = 0.81. Errors were limited to the boundary between the two age-based categories. Class-wise, the “4–5–6–7” class yielded sensitivity = 0.86 and specificity = 0.76, while “8–9–10–11” produced sensitivity = 0.76 and specificity = 0.86, confirming balanced detection and generalization as illustrated in Table 7.

EfficientNet (2 classes). EfficientNet achieved 72.80% accuracy (95% CI [67.20, 78.40]) with macro-sensitivity = 0.71 and specificity = 0.80. Per-class results, presented in Table 10, show that “4–5–6–7” had sensitivity = 0.78 and specificity = 0.65, while “8–9–10–11” had sensitivity = 0.65 and specificity = 0.78, highlighting a balanced trade-off between recall and selectivity.

Four-class classification

MobileNet (4 classes). The four-class MobileNet model reached 64.60% accuracy (95% CI [59.0, 70.20]) with macro-sensitivity = 0.65 and specificity = 0.88. Most misclassifications occurred between neighboring groups, particularly “6–7” and “8–9” as illustrated in Table 5. At the class level, the “4–5” group achieved sensitivity = 0.82 and specificity = 0.86, “10–11” reached 0.71 and 0.93, while “6–7” and “8–9” recorded lower sensitivities (0.62 and 0.46) but maintained high specificities $(> 0.80)$ . These results indicate strong selectivity and moderate recall, confirming that MobileNet captures meaningful age-based differences while limiting false alarms.

ResNet (4 classes). The four-class ResNet model attained an accuracy of 56.80% (95% CI [51.20, 62.40]). The model’s macro-sensitivity and specificity were 0.56 and 0.85, respectively. Confusions were concentrated between contiguous categories such as “6–7” vs “8–9,” indicating the model’s awareness of the ordinal structure of the age-based categories. Class-wise analysis revealed that “4–5” achieved sensitivity = 0.73 and specificity = 0.87, whereas “10–11”, “8-9”, and “6–7” showed lower sensitivities (0.46, 0.45, and 0.41) but specificities above 0.84 as illustrated in Table 8. These patterns suggest a conservative classification approach that prioritizes precision.

EfficientNet (4 classes). EfficientNet attained 56.3% accuracy (95% CI [50.70, 61.90]) with macro-sensitivity = 0.53 and specificity = 0.87. Misclassifications mainly involved adjacent groups (“4-5” and “6-7,” “6-7” and “8-9”), reflecting natural overlap between adjacent age groups. Per-class results presented in Table 11 show “4-5” had sensitivity = 0.64 and specificity = 0.86, while “6-7” and “8-9” achieved sensitivities of 0.55 and 0.48 with specificities above 0.85. Overall, the model favored high specificity and limited false positives, suitable for screening contexts.

Eight-class classification

Eight-class classification posed the greatest challenge due to class imbalance and subtle visual similarities between adjacent categories, with the underrepresentation of older age groups further contributing to classification instability.

MobileNet (8 classes). The eight-class MobileNet model reached 38.40% accuracy (95% CI [32.80, 44.0]) with macro-sensitivity = 0.38 and specificity = 0.92. Errors were confined mostly to adjacent classes (“5” – “6” and “7” – “8”), while distant groups were rarely confused. As illustrated in Table 6, intermediate classes achieved sensitivities around 0.40 and specificities $> 0.90$ , confirming that the model preserved the age-group order even when exact class separation was difficult.

ResNet (8 classes). ResNet produced 32.60% accuracy (95% CI [27.0, 38.20]) with macro-sensitivity = 0.33 and specificity = 0.89. As in other configurations, misclassifications primarily occurred between contiguous categories, indicating consistent learning of ordinal age structure. All age groups showed comparable performance, with sensitivities generally in the 0.30–0.35 range and specificities above 0.85 across all classes as shown in Table 9.

EfficientNet (8 classes). The EfficientNet model in the eight-class configuration achieved 28.7% accuracy(95% CI [23.10, 34.30]). The macro-sensitivity was 0.29 and specificity 0.88. Most errors appeared between adjacent age ranges, particularly “5” and “6.” All classes achieved moderate sensitivities (0.26–0.31) with specificities $> 0.85$ . The model remained conservative, favoring correct negative predictions and maintaining stable recognition patterns across developmental stages as illustrated in Table 12.

Architectural confusion patterns

To further interpret model behavior, we analyzed the most frequently confused class pairs across architectures. In the four-class configuration, all models showed the highest confusion between the 6–7 and 8–9 age groups. This pattern reflects the transitional developmental stage in which drawing features evolve gradually rather than abruptly.

In the eight-class configuration, misclassifications were predominantly confined to adjacent age ranges, particularly 5–6, 6–7, and 7–8. Confusions between non-adjacent classes were rare, indicating that the models preserved the ordinal structure of developmental progression.

Architecture-specific trends were modest. MobileNet exhibited slightly fewer confusions between distant classes, suggesting effective capture of localized structural features. ResNet and EfficientNet demonstrated similar boundary confusions between neighboring age groups, reflecting reliance on broader feature representations. However, the overall consistency of confusion patterns across architectures indicates that most errors are driven by intrinsic visual similarity and developmental continuity in children’s drawings rather than architecture-specific limitations.

These findings support the interpretation that classification ambiguity primarily arises from gradual developmental transitions in drawing maturity rather than model deficiencies.

Comparative discussion

A consistent trend across all networks was that misclassifications almost exclusively occurred between adjacent age or ability groups, such as “6–7” versus “8–9.” There were virtually no cross-class confusions across distant categories, demonstrating that the models preserved the ordinal structure of developmental progression. This enhances model interpretability, as errors were semantically meaningful rather than random.

To quantitatively assess whether the performance differences among architectures were statistically significant, McNemar’s test was applied to paired predictions from the same test set (n = 192) for the binary configuration. The discordant counts were b = 8 and c = 10, yielding χ² = 0.055 and p = 0.81, indicating no significant difference (p > 0.05) between MobileNet and ResNet. These two models were selected for statistical comparison because they achieved the highest accuracy among all tested architectures and exhibited closely matched performance, making them the most appropriate candidates for a direct paired evaluation. For the multi-class settings (four and eight classes), the natural extension, Bowker’s test of symmetry, was considered, but small per-class sample sizes limited statistical power. Instead, we report 95% confidence intervals and macro-averaged sensitivity and specificity, which provide comparable evidence of relative consistency.

Architectural Bias versus Domain-Level Ambiity Although the four-class MobileNet model was selected based on its balanced performance and stability, the remaining misclassifications appear to be influenced by both architectural characteristics and the inherent continuity of developmental drawing progression. EfficientNet, which relies on compound scaling of depth, width, and resolution, may require larger and more balanced datasets to fully exploit its representational capacity; consequently, intermediate age groups with limited samples may be more difficult to distinguish. ResNet’s residual learning framework emphasizes global feature propagation, which can contribute to boundary errors between adjacent developmental stages where global structural similarity is high. In contrast, MobileNet’s depthwise separable convolutions emphasize localized structural features, enabling effective capture of key developmental cues such as limb articulation, proportionality, and facial detail.

However, the consistent pattern of confusion between neighboring age groups across all architectures suggests that classification ambiguity is primarily driven by the gradual and continuous nature of developmental drawing maturity rather than limitations in model capacity. These findings indicate that the observed errors reflect domain-level developmental overlap rather than architecture-specific deficiencies.

Comparison with related work

Compared to previous studies using the same dataset, our models demonstrate a clear performance improvement across all class configurations. In particular, the proposed MobileNet model achieved an accuracy of 64.60% for the four-class task, surpassing the 52% reported by prior work using a conventional CNN architecture.²³ Likewise, our eight-class MobileNet model achieved 38.40%, exceeding the 32% reported in 23. Although a study using private datasets reported higher results, these discrepancies are likely attributed to differences in dataset composition, preprocessing strategies, and the presence of more homogeneous or higher-quality samples in private collections.

Table 13 presents a comparative overview of related studies, highlighting algorithm type, dataset accessibility (public vs. private), and performance across different class configurations. This contextualizes the improvements achieved by our models, particularly under challenging multi-class settings.

Table 13.

Comparison with related work.

Algorithm	Dataset type	8 classes	4 classes	2 classes
MobileNet (ours)	Public	38.40%	64.60%	81.30%
EfficientNet (ours)	Public	28.70%	56.30%	72.80%
ResNet (ours)	Public	32.60%	56.80%	81.70%
CNN²³	Public	32.00%	52.00%	N/A
MobileNet²²	Private	N/A	N/A	69.00%
EfficientNet²²	Private	N/A	N/A	68.00%
ResNet²²	Private	N/A	N/A	55.00%
ResNet²	Private	N/A	N/A	89.00%
CNN¹⁸	Private	N/A	N/A	72.00%

For the publicly available dataset, both our study and the work presented in 23 utilized the same data source. Notably, our MobileNet-based model and ResNet-based model outperformed the CNN baseline in both the four-class and eight-class classification tasks, increasing accuracy from 52% to 64.60% and from 32% to 38.40%, respectively. These gains confirm that MobileNet’s efficient convolutional blocks and balanced regularization provide superior feature extraction capabilities even under class imbalance and fine-grained developmental distinctions.

When compared with studies that rely on private datasets, our results remain competitive despite operating under more challenging data conditions. In our experiments using a public dataset, MobileNet achieved 81.30% accuracy in the two-class setting, 64.60% in the four-class setting, and 38.40% in the eight-class setting, with ResNet and EfficientNet showing similar trends. By contrast, the study in 22 reported 69% accuracy for MobileNet and 68% for EfficientNet and 55% for ResNet in binary classification, but did not explore multi-class extensions. Likewise,² achieved 89% accuracy for binary classification using ResNet on a private dataset, and¹⁸ reported 72% using a CNN-based model. These higher scores likely reflect the advantages of datasets collected under controlled conditions, which often exhibit reduced variability, more uniform drawing styles, and clearer visual features. Such curated datasets, common in studies using privately collected data, can naturally lead to higher model performance compared with publicly available datasets that may involve broader variability in participants, drawing settings, and visual quality.

The results demonstrate that the proposed framework generalizes effectively across varying levels of task complexity, marking a meaningful advancement in AI-assisted drawing analysis for developmental assessment. The improved performance observed in our four- and eight-class settings establishes a stronger baseline for future research and further reinforces the suitability of MobileNet as an efficient and well-regularized architecture for fine-grained developmental classification tasks.

Explainability and model interpretability

To better understand the model’s decision process, Grad-CAM was applied to representative samples from the MobileNet four-class model (Figure 6). This model was ultimately selected for integration into the mobile application, making it the most relevant configuration for interpretability analysis.

Figure 6.

Grad-CAM visualizations highlighting the model’s attention to key structural regions in four-class classification.

The heatmaps consistently highlighted structural regions central to developmental scoring, including the head or face, trunk, and major limb segments. These areas correspond to features emphasized in drawing-based developmental assessments, such as facial detailing, proportionality, and limb articulation. The model also attended to global organizational cues, including symmetry and vertical body alignment. These highlighted regions are consistent with criteria used in traditional Draw-A-Person and Goodenough scoring frameworks, which evaluate developmental maturity based on the presence, proportion, and organization of human figure components.

In drawings from older age group (8–11 years), activation occasionally extended to additional elements such as clothing details or articulated hands and feet, reflecting the increased structural complexity typical of older age groups. Misclassified samples generally involved adjacent developmental categories, and the associated heatmaps often revealed partially developed features resembling those of the neighboring class. This indicates that classification errors followed natural developmental continuity rather than random deviation.

Some visualizations showed mild activation in background regions, likely related to stroke density or page illumination. These were interpreted cautiously. Overall, the Grad-CAM results demonstrate that the model relied on meaningful structural cues, and that its occasional errors aligned with the gradual and continuous nature of children’s developmental progression. While these visual correspondences support interpretability, the system is intended as a screening aid and does not replace expert psychological assessment.

Mobile application performance

To ensure effective deployment, the most suitable architecture from our experiments was selected for integration into the Tebyan mobile application. Although the two-class models achieved the highest accuracy (approximately 81%), they offered only a broad distinction between developmental levels. For a more informative and practically meaningful interpretation, we integrated the four-class MobileNet model, which achieved an accuracy of 65.62% with balanced macro-sensitivity and macro-specificity. This configuration provides clearer and more clinically useful separation between age groups while avoiding the instability and data scarcity issues observed in the eight-class setup. By offering more nuanced developmental differentiation, the four-class design enables Tebyan to deliver interpretable and actionable feedback for parents, educators, and healthcare professionals, supporting early observation and monitoring of children’s developmental patterns.

Beyond the model selection process, Tebyan introduces an important applied methodological contribution. To the best of our knowledge, it is the first mobile application to leverage CNN-based analysis of Draw-A-Person (DAP) test drawings to support the early observation of developmental patterns in children. While this study does not propose a new CNN architecture, its contribution lies in adapting and integrating established deep learning models into a practical, real-world screening framework. Tebyan bridges psychological assessment principles with real-time AI processing on mobile devices, transforming research-oriented models into an accessible tool for parents, educators, and clinicians. This demonstrates how existing CNN architectures can be effectively repurposed to support developmental evaluation tasks in everyday settings.

The application was evaluated using a set of previously unseen drawings from individuals aged 4 to 11 years. During testing, Tebyan demonstrated satisfactory performance in accurately identifying and classifying different levels of developmental ability. Figure 7 present sample outputs that demonstrate how the system highlights typical, delayed, and advanced developmental drawing features. These results highlight Tebyan’s potential as a lightweight, accessible, and effective tool for early cognitive screening through AI-assisted drawing analysis Figure 8.

Figure 7.

Example result for a case with age-appropriate developmental indicators.

Figure 8.

Example result for a case with below-age developmental indicators.

Clinical and educational relevance

The four-class MobileNet model was selected as the primary screening model because it achieved the most clinically meaningful balance of sensitivity and specificity. Its ability to detect developmental differences while minimizing false alarms makes it particularly suitable for early assessment contexts, forming the foundation of Tebyan’s screening functionality. This performance pattern aligns with the intended role of the Tebyan system as a supportive tool designed to flag potential developmental concerns rather than to provide a definitive clinical diagnosis. Furthermore, the consistent misclassifications between adjacent categories indicate that the model captures natural developmental gradients, producing outputs that are both clinically plausible and educationally informative. Together, these findings support the feasibility of integrating AI-based drawing analysis into early childhood assessment workflows and highlight the potential of such systems to assist educators and clinicians in identifying children who may benefit from further evaluation Figure 9.

Figure 9.

Example result for a case with above-age developmental indicators.

Importantly, the application’s descriptive categories (e.g., Below Expected, Age-Appropriate, Above Expected) represent heuristic developmental screening approximations rather than formal psychometric, diagnostic, or clinically validated classifications.

It is important to emphasize that the Tebyan system does not generate diagnostic labels such as “mild,” “moderate,” or “severe” intellectual disability. Instead, the model estimates an age-based developmental level inferred from the visual structure of a child’s drawing. When the predicted developmental level differs substantially from the child’s chronological age, this discrepancy may serve as an early indicator of possible developmental delay, consistent with the interpretive principles of the Draw-A-Person (DAP) test. Because the available dataset does not include formal clinical diagnoses, the system is intended strictly as a preliminary screening instrument. Future work will explore the use of expert-annotated datasets to enable the mapping of drawing-based developmental estimates to clinically validated diagnostic categories.

Accordingly, Tebyan should not be used as a standalone diagnostic system or as a substitute for formal psychological, developmental, or medical assessment. Rather, it functions as a supportive AI-based screening aid intended to help caregivers, educators, and clinicians identify children who may benefit from comprehensive professional evaluation.

User interface and experience

In addition to its promising classification performance, the Tebyan application incorporates a high level of usability through deliberate design choices. The interface was created with clarity and simplicity in mind, offering intuitive navigation tailored for children and caregivers. A calming purple-and-white color scheme was adopted to promote a visually appealing and child-friendly environment.

The Android-based application delivers a smooth and structured user journey across its core functionalities.

Splash and welcome pages

Upon launching the app, users are greeted with a splash screen (Figure 10), followed by a welcome page (Figure 11). The welcome screen includes a simple call-to-action (”Click Here”) that leads users to the home page, streamlining the entry process.

Figure 10.

Application splash screen.

Figure 11.

Application welcome screen.

Drawing analysis

The (Figure 12) The core functionality of the application centers on drawing analysis. Users are prompted to input a child’s drawing along with their age. The embedded four-class MobileNet model then processes the image to estimate the child’s developmental level based on DAP drawing features.

Figure 12.

Application home screen with drawing analysis interface.

Mental games module

To support developmental growth, the app includes a dedicated section for mental games. These interactive exercises were selected to stimulate attention, memory, and basic problem-solving skills, providing supplementary developmental support for children.

In summary, Tebyan provides a fast, accessible, and engaging solution for early cognitive screening. Its clear instructions, visually guided flow, and integration of both screening and supportive developmental tools make it a practical resource for parents, teachers, and professionals involved in child assessment.

Conclusion

This study presents Tebyan, a mobile application designed to support the early observation of developmental patterns in children through AI-driven analysis of their drawings. Leveraging the Draw-A-Person (DAP) test and CNN-based classification models, we evaluated MobileNet, ResNet, and EfficientNet across binary, four-class, and eight-class categories. The four-class MobileNet model (64.60%) was selected for deployment in the app because it offered the best compromise between performance and meaningful age-group separation.

Tebyan provides an accessible tool that compares the model’s predicted age group with the child’s actual age to highlight potential developmental patterns. While not intended as a clinical diagnostic system, Tebyan functions strictly as an AI-supported developmental screening aid designed to assist caregivers, educators, and healthcare professionals in identifying developmental patterns that may warrant further expert evaluation. Its outputs should not replace formal psychological or medical assessment.

Limitations and future work

While this study demonstrates the potential of artificial intelligence in estimating developmental levels from children’s drawings, several limitations remain. The dataset used in this work was obtained from a previously published study that collected DAP-based drawings from a single private educational institution in Nigeria. Although we initially pursued the collection of new data, this effort could not be completed because expert participation was not secured. As a result, this study relied on publicly available data, which ensured methodological consistency but restricted demographic diversity and reduced generalizability.

A second limitation concerns the lack of detailed inter-rater information in the released dataset. Although the original study involved scoring by three trained experts using Goodenough’s 51-point rubric, the publicly available dataset reports only the aggregated score for each drawing. This prevented us from computing inter-rater reliability metrics such as the Intraclass Correlation Coefficient (ICC) or weighted Cohen’s κ, which would have provided insight into rater agreement and potential label variability. Future data collection efforts will ensure that raw expert scoring is preserved to enable these important psychometric analyses. Future work will also explore the integration of expert-scored and clinically validated developmental assessments to further align AI-based predictions with standardized psychological evaluation frameworks.

In addition, because the current dataset does not include formal clinical diagnostic labels or direct benchmarking against licensed developmental specialists, the present system cannot be considered clinically validated for diagnostic use. Future research should therefore prioritize clinically annotated datasets, inter-rater agreement preservation, and prospective validation studies involving psychologists, pediatric specialists, or developmental clinicians to establish stronger psychometric reliability and real-world clinical applicability.

The dataset also exhibits a noticeable class imbalance across age categories, particularly in the eight-class configuration. While data augmentation helped mitigate this issue, limited sample availability remains a constraint. Future studies will incorporate larger and more balanced datasets obtained through collaboration with multiple educational and clinical institutions.

In addition, future enhancements to the Tebyan application will include adaptive feedback mechanisms, improved visualization of developmental progress, and longitudinal monitoring features. These additions aim to support continuous developmental assessment and provide personalized recommendations for cognitive growth. Ultimately, expanding dataset diversity and integrating more advanced modeling and interpretability techniques will further strengthen Tebyan’s utility as a scalable, clinically aligned screening tool.

Footnotes

ORCID iDs

Wedad M. Alawad

Sara Alghofaily

Ethical considerations

This study uses a publicly available dataset that was ethically approved in the original work by Rakhmanov et al. (2020). No new human data were collected by the authors. The dataset is fully anonymized, and all procedures adhered to standard ethical research guidelines.

Author contributions

Wedad M. Alawad: Study conceptualization, supervision, manuscript writing and revision, and model development oversight.

Areen Alquayid: Data preprocessing, model implementation, and result generation.

Sara Alghofaily: Literature review, application design, and manuscript drafting.

Shada Alharbi: Experimental setup, model evaluation, and preparation of figures and tables.

Deem Alorainy: Mobile application development and testing.

Funding

The researchers would like to thank the Deanship of Graduate Studies and Scientific Research at Qassim University (www.qu.edu.sa) for financial support (QU-APC-2026).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The Draw-A-Person (DAP) children’s drawings dataset used in this study is publicly available at. .

Appendix

References

Choo

Agarwal

How

, et al. Developmental delay: identification and management at primary care level. Singapore Medical Journal 2019; 60: 119–123, URL. https://doi.org/10.11622/smedj.2019025

Shamrulismawi

. Sketch Recognition based on Deep Learning for Children’s Psychological Development Monitoring. B.sc. thesis. Universiti Teknologi PETRONAS, Seri Iskandar, 2021.

Silver

Zinsser

. The interplay among early childhood teachers’ social and emotional well-being, mental health consultation, and preschool expulsion. Early Education and Development 2020; 31: 1133–1150, URL. https://doi.org/10.1080/10409289.2020.1785267

Afsahi

Alinaghi

SAS

, et al. Chatbots utility in the healthcare industry: An umbrella review. Frontiers in Health Informatics 2024; 13: 200, URL. https://doi.org/10.30699/fhi.v13i0.561

MohsseniPour

Parsakian

Mehraeen

. Potential applications of chatgpt in the healthcare industry of low- and middle-income countries. Shiraz E-Medical Journal 2025; 26(8): e161279. https://doi.org/10.5812/semj-161279

Chustecki

. Benefits and risks of artificial intelligence in health care: Narrative review. Interactive Journal of Medical Research 2024; 13(7): e53616. https://doi.org/10.2196/53616, Accessed: 16 November 2025.

Mehraeen

Safdari

Mohammadzadeh

, et al. Mobile-based applications and functionalities for self-management of people living with hiv: A scoping review. Studies in Health Technology and Informatics 2018; 248: 172–179, URL. https://pubmed.ncbi.nlm.nih.gov/29726434/ (Accessed 16 November 2025).

Mehraeen

Noori

Nazeri

et al. Identifying features of a mobile-based application for self-care of people living with type-2 diabetes mellitus. Diabetes Research and Clinical Practice 2021; 171: 108544. https://doi.org/10.1016/j.diabres.2020.108544, Accessed: 16 November 2025.

Mehraeen

Safdari

SeyedAlinaghi

, et al. A mobile-based self-management application: Usability evaluation from the perspective of hiv-positive people. Health Policy and Technology 2020; 9(3): 294–301, URL. https://doi.org/10.1016/j.hlpt.2020.06.004,(Accessed 16 November 2025).

10.

Burger

. Human Figure Drawing and the General Mental Development of South African Children. Nelson Mandela Metropolitan University, 2011. PhD Thesis.

11.

Laak

Goede

Aleva

, et al.

The draw-a-person test: an indicator of children’s cognitive and socioemotional adaptation?

Journal of Genetic Psychology 2005; 166: 77–93, URL. https://doi.org/10.3200/GNTP.166.1.77-93

12.

Goodenough

. Measurement of Intelligence by Drawings. World Book Company, 1926.

13.

Yama

. The usefulness of human figure drawings as an index of overall adjustment. Journal of Personality Assessment 1990; 54: 78–86, URL. https://doi.org/10.1080/00223891.1990.9673976

14.

Naglieri

McNeish

Achilles

. Draw-A-Person: A Quantitative Scoring System. The Psychological Corporation, 1991.

15.

Beltzung

Pelé

Renoult

, et al. Deep learning for studying drawing behavior: A review. Frontiers in Psychology 2023; 14: 992541, URL. https://doi.org/10.3389/fpsyg.2023.992541 (Accessed: 16 November 2025).

16.

Khlaif

Naceur

Kherallahr

. AI-driven classification of children’s drawings for pediatric psychological evaluation: An ensemble deep learning approach. Journal of Robotics and Control 2025; 6: 124–141, URL. https://doi.org/10.18196/jrc.v6i1.23302 (Accessed: 16 November 2025).

17.

Shah

Khan

Alzubaidi

, et al. A multimodal AI framework for interpreting children’s drawings. Studies in Health Technology and Informatics 2025; 327: 808–812, URL. https://pubmed.ncbi.nlm.nih.gov/40380579 (Accessed 16 November 2025).

18.

Widiyanto

Abuhasan

. Implementation of the convolutional neural network method for classification of the draw-a-person test. In: Proceedings of the 2020 Fifth International Conference on Information and Computing (ICIC), pp. 1–6. URL. https://doi.org/10.1109/ICIC50835.2020.9288651

19.

Strikas

Valiakos

Tsimpiris

, et al. Deep learning techniques for fine motor skills assessment in preschool children. International Journal of Education and Learning Systems 2022; 7: 43–49, URL. https://www.iaras.org/home/caijels/deep-learning-techniques-for-fine-motor-skills-assessment-in-preschool-children

20.

Patel

Polsley

Hammond

. Using neural networks to distinguish children’s age with visual features of sketches. In: Proceedings of the 2022 ASEE Gulf Southwest Annual Conference, pp. 1–7. URL. https://doi.org/10.18260/1-2--39220

21.

Simfukwe

Youn

Jeong

. A machine-learning algorithm for predicting brain age using rey–osterrieth complex figure tests of healthy participants. Applied Neuropsychology: Adult 2023; 32: 1–6URL. https://doi.org/10.1080/23279095.2022.2164198

22.

Strikas

Papaioannou

Stamatopoulos

, et al. State-of-the-art CNN architectures for assessing fine motor skills: a comparative study. WSEAS Transactions on Advances in Engineering Education 2023; 20: 44–51, URL. https://doi.org/10.37394/232010.2023.20.7

23.

Rakhmanov

Agwu

Adeshina

. Experimentation on hand drawn sketches by children to classify draw-a-person test images in psychology. In: Proceedings of the Thirty-Third International Florida Artificial Intelligence Research Society Conference (FLAIRS-33), pp. 329–334. URL. https://cdn.aaai.org/ocs/18457/18457-79397-1-PB.pdf

24.

Rakhmanov

. A novel algorithm to classify hand drawn sketches with respect to content quality. Springer, 2020, vol volume 12252, pp. 179–193. URL. https://doi.org/10.1007/978-3-030-58811-3_13

25.

Tolosana

Ruiz-Garcia

Vera-Rodriguez

, et al. Child-computer interaction with mobile devices: recent works, new dataset, and age detection. IEEE Transactions on Emerging Topics in Computing 2022; 10: 2042–2054, URL. https://doi.org/10.1109/TETC.2022.3150836

26.

Polsley

Powell

Kim

, et al. Detecting children’s fine motor skill development using machine learning. International Journal of Artificial Intelligence in Education 2022; 32: 991–1024, URL. https://doi.org/10.1007/s40593-021-00279-7

27.

Immel Asghar

. Getting started with image preprocessing in python. Kaggle Notebook. URL. https://www.kaggle.com/code/rimmelasghar/getting-started-with-image-preprocessing-in-python (Accessed 16 November 2025).

28.

Howard

Zhu

Chen

, et al. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 2017. URL. https://arxiv.org/abs/1704.04861 (Accessed 16 November 2025).

29.

Sandler

Howard

Zhu

, et al. Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4510–4520. URL. https://doi.org/10.48550/arXiv.1801.04381

30.

Howard

Sandler

Chu

, et al. Searching for MobileNetV3. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1314–1324. URL. https://doi.org/10.1109/ICCV.2019.00140

31.

Tan

. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv e-prints 2019. URL. https://arxiv.org/abs/1905.11946, Accessed 16 November 2025).

32.

Android Developers. Android studio. Online documentation. URL. https://developer.android.com/studio, Accessed: 16 November 2025.