Abstract
The advances in social robotics have extended the possibilities of their use in different applications and have also increased the sectors of users to which those applications can benefit. An attractive population of users is children. Recently, there has been a trend towards research in the design of interactive systems for children, as well as in the study of modeling the interaction between children and this type of systems. In this work, we present a study carried out with the objective of analyzing the affective response of children when interacting with a robot using speech-based communication. We collected data through an experiment using a Wizard of Oz scenario where we induced different affective reactions in the participants. Two type of data were collected and analyzed: 1) a set of evaluators manually created annotations of emotions and attitudes to determine the distribution of emotions during the experiments and evaluate how difficult is the training of automatic classifiers to discriminate different affective states from the acoustic properties of the children’s voices; 2) we used the children’s responses from a self-evaluation questionnaire about their perceptions and preferences towards the robots, modeled with different personalities, to assess whether there are relevant differences according to their different age’s range. We obtained a large children’s speech database that would be a valuable resource for the study of paralinguistic and interaction aspects. Despite the imbalance of the database, we were able to obtain good results for the classification of emotions and attitudes. We also find some relevant differences in how young and older children note the differences in the behaviors of the robots according to the modeled personality. Differences based on children’s age were also found in the preferences towards the two different robots.
Introduction
One of the population groups that are benefited from the development of social robots is children [4]. The advantages and capabilities of social robots have been used to study child development [20]; child education [14]; rehabilitation [23]; the diagnosis and treatment of autism [16, 26]; and as mediators for children interviews [36], to name a few. A fundamental aspect of long-term interactions between the robot and a child is the creation of a social bond that facilitates the acceptance of the robot in daily life activities and makes children feel more comfortable with it [35].
For developers and designers of social robots, an important source of information to build and adapt the robot’s capabilities that facilitate the creation of such social bonds is to know how the children perceive, react and what are their preferences to the robot’s behavior. Two types of data can be used to assess the reactions and preferences of children towards a social robot: objective data (such as gestures, body postures, linguistic and paralinguistic features of children’s speech collected from video and audio recordings), and subjective data (children’s self-reported information collected from interviews or questionnaires).
When using speech-based communication between children and social robots, researchers require the collection of corpora with a large number of subjects and samples to develop algorithms with the ability to recognize the user’s state from paralinguistic information. The automatic classification of the user’s condition will facilitate the design of the robot to better adapt their actions and dialogues to the detected children’s state.
Currently, there are several large collections of children’s speech oriented to the speech recognition problem [6, 30] or the study of acoustic properties [19]. However, corpora with genuine interactions that comprise annotations of paralinguistic phenomena and self-reported assessments are still scarce. That is one of the significant limitations in the study of automatic recognition of paralinguistic information in the context of the child–robot interaction. To date, most of the research on paralinguistic information is primarily targeted to adults though children are one of the potential beneficiaries of computers with speech-based interfaces (e.g., for educational and entertainment purposes) [38].
Motivated by this limitations, we created a corpus that will allow the study of objective and subjective affective-related information in the context of the child–robot interaction. The purpose of this corpus is to capture different emotions and attitudes that can assist in inferring relevant aspects of the user’s state. With the aim of inducing emotions in children, we designed a setting to perform an interactive session between two Lego Mindstorms robots and a child. During the sessions, there were some simulated problems with the robots that were designed to produce different reactions in the participants. We recorded 174 children between 6 and 11 years old employing Wizard of Oz (WoZ) techniques in the production of the material (see Fig. 1).

From left to right: Child interacting with the robot, the technician operating the Text to Speech system, the technician controlling the robot’s movements, the facilitator, and a friend of the child.
Complementarily, we have also collected subjective data related with the perceptions and preferences of the children towards the behavior of the robot modeled with two styles of interaction: a robot with an agreeable personality and its opposite, a disagreeable robot. Both personalities were modeled through the dialogues and actions that the robots used as responses to child’s commands. The preferences of the children towards the two styles of interaction were collected using a multiple-choice questionnaire. The initial analysis and results obtained from the two types, paralinguistic and self-reported data, are presented.
The rest of the paper is organized as follows: Section 2 presents some related works. The description of the experiment used to collect the data is described in Section 3. Section 4 presents the analysis and results of the paralinguistic data while Section 5 describes the analysis and results of the self-reported data. Finally, Section 6 presents some conclusions.
Different studies have been developed to understand the reactions caused in the children due to the interaction with a social robot. One of the most studied is the identification and classification of different emotional responses elicited during child-robot interactive sessions. The interactive sessions can be recorded to extract and analyze the data generated during the interaction. One relevant source of user’s information to identify the elicited emotions during child-robot interaction is the analysis of children speech characteristics.
In the last years, several works have reported the use of pattern recognition techniques to analyze particular speech’s features associated with the identification of different emotions. In particular, the collection and processing of a corpus of dialogues expressed by 51 children when interacting with the Sony robot Aibo [31], allowed the execution of the INTERSPEECH 2009 emotion challenge. In this challenge, the objective was the development of algorithms that classify different emotions using a set of five (anger, emphatic, neutral, positive, remainder) or two classes (negative, idle). The winners of both sub-challenges employed a method based on acoustic features and Gaussian Mixture Models. The winner of the automatic classification on the 5-class task obtained an accuracy of 41.7% [15], and the winner of the 2-class task obtained an accuracy of 70.29% [7].
More recently, the same dialogue corpus of the INTERSPEECH 2009 emotion challenge has been used to propose new methods for emotion recognition. For example, the combination of Hidden Markov Models (HMM) and Deep Belief Networks (DBN) [17]; HMM and Artificial Neural Networks (ANN) [18]; or the use of transfer learning algorithms applying importance weights (IWs) within a support vector machine classifier to reduce the effects of the environment and speaker differences between the training and test data [11].
Equally important to the analysis of children emotional reactions, is also the assessment of how the children perceive the different actions taken by the robot and what would be their preferences associated with the various styles of interaction adopted by the robot. Numerous studies have reported the evaluation of different robot’s characteristics according to children perception. The work presented in [37] was focused on how children evaluate the different physical appearance of robots related to distinct personality and emotional traits. The experiment consisted of displaying five robot images (e.g. human-like, animal-like or machine-like) to children and completing a questionnaire for each image to collect their perceptions of different robot attributes. The reported findings include that children rated as the most aggressive and angry robot those with machine-like appearance. Those with simple animal-like appearance were rated as the happiest robots. Also, animal-machine and human-machine were rated by children as being the most friendly robots.
Tielman et al. [33] recently presented a study where they analyzed the preferences and opinions of children towards a robot that continuously adapts its expressive behavior to the detected emotional state (through a WoZ scenario) in the child. A group of 18 children played a quiz with a robot that adapts and displays its emotions through voice, body movement, body pose, and gesture. After that, the children interacted with another robot that showed small randomized body movements not related to emotion. Once every child finished the interaction with the two robots, they answered some questionnaires to provide their subjective opinions regarding their preferences to the robots. Obtained results show that children react more expressively and more positively to the robot which adaptively expresses itself than to the robot which does not. The authors conclude that enjoyment in children is higher with a robot which adaptively expresses itself through emotion and gesture than with a robot which does not do this.
The work presented in [34] investigated children’s attitudes towards robots with different anthropomorphic appearances and behaviors. A total of 578 children evaluated images and videos of various robots and answered a Likert-scale questionnaire to measure the social and physical attraction elicited in the children. The reported results describe that children prefer robots created with a moderate level of human likeness over those that have a highly human-like appearance but remain distinguishable from humans. Also, the children perceived the robots as more socially and physically attractive when the robots exhibited social cues in their behavior.
The analysis of both, objective and subjective data from the interactive sessions, would allow a better understanding of what are the relevant features that artificial systems, such as social robots, affect the target users. In this sense, the work presented in the following sections is similar to the reported in [25]. In that study, video analysis (objective data), and self-reported measures based on questionnaires (subjective data) were used to evaluate the levels of interest, engagement, and involvement in children during the interaction with a robot responsible for instructing dance sequences. Nevertheless, as objective data, we use paralinguistic information extracted from a corpus dialogue (in the Spanish language) obtained through interactive gaming sessions between 174 children and a Lego Robot. We complemented this analysis with subjective, self-reported information, collected through a questionnaire to assess how the children perceive the robot’s behavior and what would be their level of adherence between two different styles of interaction modeled in the robot.
Method
Scenario
We constructed a floor mat using illustration boards and a polystyrene base (see Fig. 1). We designed a mission in which each participant must guide the robot using spoken commands to go from the start point to the finish point of the track. Moreover, all along the way, the child must guide the robot to enter the stations and pick up some candies. The ultimate goal is to collect as many candies as possible during the mission, with no time limit. If the child makes the robot enter a station, then the child obtains the number of candies indicated by the color of the station. The child must also guide the robot to avoid the obstacles positioned along the track. If the robot knocks over an obstacle, the penalty is to return (from the accumulated candies) the number of candies associated with the color of the obstacle. Each child begins the game with four candies and has only one chance to guide the robot through the entire route. The child directs the robot through spoken instructions such as “go forward,” “stop,” and “turn left.” Before the test begins, the facilitator explains the rules to the child and tells him/her that the robot cannot leave the track and that it cannot return if it misses a station.
We placed a video camera in front of the children and the technicians. We used two laptops; one to control the Text to Speech (TTS) engine and the other to record the children’s speech. We controlled the movements via Bluetooth through a smartphone using the app provided by Lego. The “wizards” were seated behind the children during the session.
Before the beginning of each interactive session, the facilitator showed two identical Lego Mindstorms EV3 robots to the participants and told them that the robots understand human speech and that they behave and speak autonomously. To motivate the children to talk freely, the facilitator emphasized that the robots can understand and react to anything the child says. In reality, two technicians controlled the speech and behavior of the robots. One was responsible for controlling the robot’s movements using the Robot Commander mobile app. The other technician was in charge of generating the robot’s speech using a TTS engine running on a laptop. The laptop sends the synthetic voice via Bluetooth to a Logitech X100 speaker that hangs from the neck of each robot. The children wore a wireless Logitech H600 headset to capture their voices during the interaction. The robots used a repertory of 162 sentences divided into five groups: 1) greeting and introduction, 2) positive reinforcements, 3) negative reinforcements, 4) advice and instructions, and 5) requests for instructions. The two technicians were seated behind the participant and pretended to be merely spectators of the experiment. The third assistant is the facilitator, who explains the mission to the child, answers any questions, and helps the child in case of any difficulty.
We asked the children to interact with two robots, which are physically identical but have different personalities. One of the robots (called “Ever”) acts in a -disagreeable- non-collaborative manner, ignoring some of the commands given by the child and playing some prompts blaming the child for errors (e.g., “You should practice more in giving instructions; we have just begun, and you have already lost some candies”). This robot also shows selfish behavior, taking all the credit when it enters stations (e.g., “Yes! Finally, I’ve succeeded. Thanks to me, we finally have some candy”). The other -agreeable- robot (called “Paulina”) has a collaborative behavior and a happy mood. The utterances spoken by this robot encourage the child to get more candies. This robot is obedient and gives the child credit for his or her achievements (e.g., “Well done!”, “You are doing this very well”).
We designed this experiment based on the idea that entertaining and enjoyable games evoke a heightened level of emotional experience during play [12]. We expected that the child would engage in the activity and react emotionally to positive and negative events such as winning or losing candies, avoiding or knocking over obstacles, and being congratulated or reproached by the robot. Some authors suggest that the combination of positive and negative emotions during the challenge is the key to generating successful gaming experiences [13, 24]. We expected that children would show negative emotions during the interaction with the disagreeable robot and positive attitudes and emotions towards the agreeable robot. Moreover, we also expected that most of the children perceive the differences between the two robots, and their preferences and level of adherence were greater towards the agreeable robot.
During each session, the child interacts with both robots. Halfway through, the facilitator interchanges one robot with the other. During each recording session, there are two children in the room so as to have them feel supported by each other. Both take turns interacting with the robot. The robot that starts the mission is chosen randomly for the first child; sometimes the first one is the robot with the collaborative attitude and the second one is the robot with the non-collaborative attitude, and sometimes vice versa. For the second child, the robot that starts the mission is always the one that ended the mission with the first child.
Participants
The children involved in this experiment ranged in age from 6 to 11 years old ((8.62 mean, 1.73 standard deviation)), both sexes. All the participants attended primary school. According to their teachers, they showed cognitive, emotional, and social development corresponding to their age parameters. Given that the interaction with the robot was speech-based, an essential requirement was that all subjects had the ability to speak fluently. The native language of all the participants was Mexican Spanish.
The recruitment process was carried out in two ways. The first one was to contact parents and ask the children if they wanted to participate in a game. If they were willing to participate, we asked the parents’ permission for the child to take part in the experiment. We recorded these subjects in our lab facilities at Tepic, Nayarit. In the second recruitment task, a primary school in the town of Ruiz, Nayarit, allowed us to perform the experiment within its facilities and record the sessions with their students during school hours. Out of the total number of participants, 12% lived in an urban area (Tepic), and 88% lived in a rural area (Ruiz).
Data collection
The first group of children, those that participated in our lab, was formed by 21 children. The second group, participating at the primary school, involved 153 children making a total of 174 participants. Their ages ranged from 6 to 11 years old. The recordings of the interactions were made using a video camera (Sony HDR-CX405, 9.2 megapixels) that recorded a frontal view of the child’s actions during the session. The recording of the child’s voice was made using a wireless headset (Logitech H600). This headset uses a USB receiver antenna with a range of up to 10 m. We connected the antenna to a Dell computer with the Windows 8.1 operating system, which recorded the voice using the Audacity v2.1.1 software. The facilitator told the child that the robot can hear what he/she says into the headset microphone. Before the mission starts, the facilitator introduces the robots as a new development of a research center that needs to be tested and tuned. Then, she explains the rules to the children.
Most children completed the mission without presenting any difficulties that prevented the execution of the interactive session. Six-year-old participants were a little more nervous and shy than older children. It was necessary, in some cases, to motivate and encourage them to initiate the mission. In these cases, the explanation of the instructions and rules was much longer, and it was necessary to repeat some instructions more than once.
At the end of the session, the children were provided with a multiple-choice questionnaire to report whether they noted any differences in the behavior of the two robots, the reason they think sometimes the robot did not follow their instructions, and to which of the two robots they would invite to play in another future game session. The facilitator provide the participants with the questionnaire and give them the instructions to answer it.
After completing the interactive sessions with all the participants, we obtained the set of audio recordings. Our database consists of 2,093 min of audio recordings. On average, the participants took 12 min to accomplish the mission. As shown in Table 1, younger users took more time to guide the robot from the start point to the finish point. We included in the study 80 girls and 94 boys.
Minimum (Min.), maximum (Max.), average (Avg. Dur.), and standard deviation (Std. Dev.) of minutes taken to accomplish the mission according to age of the participant
Minimum (Min.), maximum (Max.), average (Avg. Dur.), and standard deviation (Std. Dev.) of minutes taken to accomplish the mission according to age of the participant
We used the labels obtained from the annotation process to train models for the recognition of two paralinguistic aspects: the classification of emotions and the classification of attitudes. We tested three acoustic feature sets to characterize the speech recordings and four machine learning algorithms in order to evaluate the difficulty of both classification tasks.
Acoustic feature extraction
We acoustically characterized the audio samples of this database using the software openSMILE [8]. This software allowed us to extract a large number of Low-Level Descriptors (LLDs). This software takes as input a configuration file where the user specifies the signal processing procedures to be applied as well as the LLD that will be extracted. We used three configuration files previously purposed to model paralinguistic aspects in speech. We decided to use these large feature sets to explore a broad range of different acoustic features and identify the most useful for the modeling of paralinguistic phenomena.
Set of acoustic features (IS-2009)
Set of acoustic features (IS-2009)
Set of acoustic features (IS-2010)
Set of acoustic features (IS-2011)
We used the software Weka [10] to apply a feature selection procedure. We decided to use the combination of Subset Evaluation and Best First as evaluator and search methods respectively. We applied these methods to the three features sets IS-09, IS-10 and IS-11. The Table 5 shows the number of selected features from each feature set.
Number of selected attributes
Number of selected attributes
Using the selected features, we carried out the training of models for automatic classification. We applied the machine learning algorithms SVM (polynomial kernel), Naive Bayes, Random Forest y Bagging (REPTree) as implemented in Weka [10]. We evaluated the trained models using 10-fold cross-validation.
Emotion classification
In the case of emotion classification, we adopted the six basic emotions suggested by Ekman, plus the label neutral, used when there is not a noticeable emotion and the label none used when there is not enough evidence to determine what emotion is being displayed. The Table 6 shows the number of samples per each emotion.
Number of samples per emotion category
Number of samples per emotion category
Table 7 shows the classification results using the F-measure as the quality metric. The experiments reported in this table were obtained using the eight classes of the corpus. We can see that it is a difficult task, mainly because we have a very unbalanced corpus. We have many samples from some classes (happiness, neutral) and very few for others (disgust, none, sadness).
Emotion classification 8 classes (F-measure)
The best result was obtained using the classifier Random Forest and the acoustic features from the IS-10 set. We can also observe that the classifier Bagging got good results.
With the objective of assessing the improvement in the classification performance achieved by reducing the number of classes, we eliminated the three categories with fewer samples. Table 8 shows the classification results. We can see that the F-measure increased in all the cases. The best result was obtained using the Bagging classifier and the acoustic features from the set IS-11.
Emotion classification 5 classes (F-measure)
The researchers in the machine learning area know that class imbalance usually provides misleading classification accuracy. To study the effect of the unbalanced classes on the emotion classification performance, we applied an algorithm to change the dataset to have more balanced data. We applied the Resample algorithm. This procedure produces a random subsample of a data set using either sampling with replacement or without replacement [10].
Table 9 shows the classification results after applying this procedure with a bias to uniform class, replacement of samples and maintaining the 30% of the samples from the complete dataset (8 classes).
Emotion classification 8 clases Resample 30% (F-measure), 655 samples per class
Table 10 shows the classification results after applying this procedure with a bias to uniform class, replacement of samples and maintaining the 50% of the samples from the reduced data set (5 classes).
Emotion classification 5 clases Resample 50% (F-measure), 1,685 samples per class
After resampling the datasets, we observed a significant improvement in the classification results of only two classifiers, Random Forest and Bagging. In the case of 8 classes, the best F-measure increased from 55.5 to 74.9. In the case of 5 classes, the best F-measure increased from 58.7 to 79.8.
In the case of attitudes classification, we have the four attitudes plus the class none that the labelers used when they did not identify any of the attitudes presented in the drop-down list. The number of samples per each attitude is shown in Table 11.
Number of samples per attitude category
Number of samples per attitude category
Table 12 shows the classification results for attitudes. We obtained better results for attitudes than for emotions.
Attitude classification 5 classes (F-measure)
Table 13 shows the results after applying the resampling procedure with bias to uniform class, replacement of samples and maintaining the 30% of the samples.
Attitude classification 5 classes (F-measure) Resample 30% (F-measure), 988 samples per class
In addition to the automatic analysis of the emotional reactions elicited during the interactive session, a second analysis from this experiment was to assess whether the different ages of the participants lead to significant differences in the self-reported perceptions and preferences towards the two robots with different personalities. As mentioned in Section 3.3, a multiple-choice questionnaire was designed to collect the subjective feedback from every participant after the interactive session with the robot. The collection and analysis of these subjective data are useful to assess whether the children clearly notes the different behaviors of the robot associated with the different modeled personalities.
Moreover, the questionnaire included a section to collect children’s feedback regarding their preferences to continue interacting with any of the the two robots (i.e. level of adherence). Two questions were formulated about the characteristics of the perceived robot’s behavior and one question about children’s preferences to interact again with the robots (see Table 14).
Asked questions after the interactive session with the robots
Asked questions after the interactive session with the robots
From the 174 recruited participants, 1 child decided not to conclude the session and 9 did not answer all the questions. The data of these 10 participants were excluded from the analysis getting a final number of 164 records to assess. Due that the sample of the participants with a specific age was small, the data were grouped into three ranges of ages as presented in Table 15. To analyse the independence of the age regarding the perceived robot’s behaviour and the level of adherence towards the two robots we conducted a chi-square analysis (p < 0.05). Given that the expected frequencies in the contingency table were less than 5 in more of the 20% of the cases, we used Fisher’s exact test for the two sections of the questionnaire.
Number of participants’ data grouped in three age’s ranges
Perception of robot’s behaviour
The obtained results regarding children’s perception of the robot’s behaviour were mixed. The collected responses to question 1 indicated a significative statistical relationship between the age and the perception of different behaviours in the robots with different personalities: the proportion of children between 6–7 years old that reported they did not note the differences in the behaviour of the two robots was greater than in the children aged 8–9 and 10–11, (χ2(2) = 8.5; p = 0.012). On the other hand, the responses to the question 2 indicated no significative statistical relationship between the age and the responses related to why some times the robot did not follow the provided commands, (χ2(2) = 0.68; p = 0.736). The plots of Fig. 2 show the graphical results from the two questions related with the perceived robot’s behaviour.

Graphical results of the perceived robot’s behaviour.
The relationship between the reported preferences about which of the two robots the participants wish to continue interacting in future sessions and the age was statistically significant: χ2(4) = 18.44; p = 0.001. Although most of the children in all range of ages reported their preferences towards Paulina, a relevant percentage of children aged 6–7 also selected Ever. The percentage of children that selected the two robots was also grater in children aged 6–7, and the portion of the participants aged 8–9 whom selected this option was greater than those aged 10–11 (see Fig. 3).

Graphical results of the reported preferences about continue interacting with the robots.
The classification of emotions with this database proved to be a difficult task mainly because of the strong unbalance of classes. We observed an improvement in the classification when we eliminated the classes with fewer samples; but especially when we applied a balancing procedure. In that case, the classification results improved considerably. This characteristic of the data base could be considered as a drawback; however, in real-world scenarios, we will find a similar distribution, where there is a lot of neutral samples and fewer samples of other emotions [1, 32].
The three feature sets tested in the experiments showed similar performance in the classification task. To discover which one was the best, we averaged the results of each feature set across all the experiments carried out with the four classification algorithms. For emotion classification, we obtained the best results with the set IS-10 (56.33 F-measure). For attitude classification, we obtained the best results with the set IS-11 (65.55 F-measure).
We also compared the classification results obtained with the different machine learning techniques. We averaged the results of each algorithm across all the classification tasks with the three different feature sets. We found that Random Forest obtained the best results. This algorithm was the most appropriate to model paralinguistic information in children’s speech. The Bagging algorithm also showed good results.
Regarding speech corpora, we have contributed with a large corpus of children’s speech during the interaction with a robot. In comparison with similar works about the creation of resources for the study of paralinguistic phenomena in children [3, 32], we have a greater number of subjects.
This new resource has exciting features for the researchers working in the area of human-computer interaction, language technology, and, in particular, computational paralinguistics. The emotions and attitudes captured in the recordings were generated spontaneously. Spontaneity is a valued characteristic of affective corpora because acted-out emotions are not suited to the implementation of applications in realistic conditions.
The annotations made by human evaluators allow the training of supervised models for emotions and attitudes classification. Furthermore, these same data can be used to model other elements of paralinguistic information such as children’s genre or sex. The database is available to the research community for further analysis of the collected data and to conduct further experiments.
Regarding the analysis of the children’s self-reported data, we can argue that take into consideration the age of the children is relevant when designing social robots aimed to interact with children. The clearest differences in the preferences towards the different styles of interaction were in children between 6 and 7 years old with respect of the children aged 8–11. Chi-square analysis was also conducted to assess the dependency between the gender of the participants and the provided responses, but not significative statistical relationship was obtained for all the questions (p > 0.05).
The youngest children were the group with the greatest percentage of participants that did not note the differences in the styles of interaction (personalities) modeled in the two robots. If an important objective of a robot is to clearly convey one type of personality while interacting, it seems that additional channels of communication should be used when interacting with children aged 6–7. If the robot has the capability to include the modeling of facial expressions, there is the possibility to use and exaggerate some expressions to get better recognition accuracy [2] of the intended style of interaction.
Regarding the level of adherence, most of the children preferred future interactions with the agreeable robot. Similarly to the answers reported to the identification of different styles of interaction, a significant proportion of youngest children selected the disagreeable robot to interact in future sessions. The percentage of participants that selected both robots was decreasing from the youngest to the oldest participants. An interesting finding is that none of the participants reported that they would not interact again with any of the robots. This reveals the interest and acceptability of the children at all different ages in the playing session and to maintain social interactions with a robot. Nevertheless, the found differences based on the range of ages should be taken into account to maximize the positive effects of this technology.
Footnotes
Acknowledgments
This research work has been carried out in the context of the “Cátedras CONACyT” program funded by the Mexican National Research Council (CONACyT). This work was partially funded by CONACYT under the Thematic Networks program (Language Technologies Thematic Network project 260178, 271622, 281795).
