Abstract
BACKGROUND:
With alternative and augmentative communication (AAC) people with complex communication needs (CCN) become more independent and express themselves to the fullest extent possible. In finding the best AAC solution, mobile technology and ICT (information and communications technology) provide new opportunities every day. Although a wide range of assistive technologies (AT) are available, matching person and technology (MPT) and setting the optimal parameters individually are essential. For an AAC solution to be optimal for letter-based communication it has to be easy-to-use, comfortable, and fast.
OBJECTIVES:
For people with severe speech and physical impairments (SSPI), one method to interact with a computer is using head-movement-driven mouse. There are different on-screen devices available for typing via head movements, and much work has been done to compare them in terms of the time required for typing. Dasher is one of the fastest software tools with a setting option for zooming speed. An optimistic initial model (OIM) based on Markov decision process (MDP) has already been shown to optimize this zooming speed for increasing the typing efficiency of persons without SSPI. Since this reinforcement learning component has so far been tested on neurotypical users only (e.g., research assistants), in the present case study we involved a user with SSPI. Our question was whether the algorithm can optimize its own parameters in these circumstances.
METHODS:
To document all relevant aspects of the human-computer interaction log files, screen and webcam videos were collected. These input data were later analyzed with mathematical methods based on the OIM reward systems feedbacks. In addition, manual interpretation using semi-supervised machine video annotation was carried out for analyzing screen events and user behaviors.
RESULTS:
The human annotations of the recorded video data indicated that the participant had at least two different typing strategies. In contrast with the data from a previous study, in our study the artificial intelligence (AI) component was unable to find optimal settings similar to those attained when only one typing strategy was used by subjects without SSPI.
CONCLUSIONS:
To maximize communication efficiency, a more complex assistive tool may be more appropriate. Closer cooperation between different areas of expertise is suggested in order to achieve solutions employing various methods.
Keywords
Introduction
People express their psychological states including intentions via verbal and non-verbal communication. Communication can take many forms: speech, gestures, facial expressions, and a remarkable diversity of multimodal signs. According to the mission statement of the American Speech-Language-Hearing Association (ASHA) from 1991, communication is the essence of human life, a guiding principle, and all people have the right to communicate to the fullest extent possible [1]. As Light and Gulens put it “…people cannot act as the primary causal agents in their lives without being able to communicate effectively with others…” [2] Part of the human population is unable to use speech as the primary communication channel. Nevertheless, the essential goal regarding their communicative competence is to be on a par with typical speakers. This creates the need for augmentative and alternative communication (AAC) methods and assistive technology [3]. The term assistive technology device is defined as “any item, piece of equipment, or product system, whether acquired commercially off the shelf, modified, or customized, that is used to increase, maintain, or improve functional capabilities of a child with a disability” [4]. Assistive technologies (AT) have a promising future in the training and treatment of people with complex communication needs (CCN) [5] as they offer alternative routes to successful information exchange. Discovering methods to increase user performance is currently an active area of research [6, 7, 8].
Due to the variability of causes leading to CCN, case study is a preferred method to approach questions of assistive technology [9]. Finding the most suitable setting through systematic trial and error or intelligent guesses, thereby matching person and technology can highly improve the communication skills, quality of life and nature of social interactions of the subjects affected [10, 11].
Speech rates in normal conversation are around 150–200 words per minute (wpm); skilled typists can achieve a rate of 30–40 wpm [12] and in chat rooms a speed of roughly 40 wpm is enough for enjoyable conversation [13]. In comparison, AAC can be more than 10 times slower than speech [14, 15]. Different methods can be used in AAC to transfer the most information in a given amount of time [16]. A significant amount of research was conducted in the last ten years in order to crank up this value using technology-assisted AAC [2, 7, 17, 18, 13]. In trying to achieve an appropriate match between person and technology, one faces new challenges in each individual case. An additional difficulty using assistive technologies is the fast rate of change of the relevant technologies: an estimated 30 percent of acquired AT devices are discarded within a year [10].
In this paper we present an AT-supported AAC solution. The purpose of the case study presented here was to determine whether a Reinforcement Learning (RL) component is able to optimize the settings of a typing tool in order to minimize the time needed to compose messages. We employed a letter-based predictive typing algorithm which was operated using a head-driven mouse. The participant of the study was a single subject with severe speech and physical impairments (SSPI). To examine the personal typing pattern of our participant with CCN, head movement and mouse position data were recorded and analyzed.
Experimental background and pilot study
The present case study is based on a work by Lőrincz and Takács [17], where an RL architecture was used to assist users during a typing task. The task was administered by a probabilistic word selection tool controlled by a webcam-based mouse [17, 19].
The first component of this complex system was a letter-based predictive on-screen keyboard software called Dasher; it was used for the typing task [17, 18, 13]. As Wills and MacKay characterize it [18], Dasher is a text input tool based on inverse arithmetic coding which is faster and more accurate than on-screen keyboards operated via gaze control [20]. Dasher can be operated without clicking. Search among letters is executed by vertical mouse movements; adding and deleting characters is achieved by resting, and altering typing speed is accomplished by horizontal mouse movements [19]. In using Dasher the most important adjustable parameter is relative zooming speed. This value can be read off from the screen as the distance of the cursor from Dasher’s crosshair. This value can be changed by the user while the typing process is paused. More information about setting up and operating Dasher is available in the online documentation [18, 20, 21].
In Lőrincz and Takács study [17] the original open source Dasher software was supplemented by an algorithm which was responsible for adjusting the zooming speed to optimize the efficiency of the typing process. This method was based on the idea of Markov decision process (MPD) [19], which was produced by an optimistic initial model (OIM) in Szita et al.’s earlier research [22, 23]. Szita et al.’s algorithm was developed for exploration and reinforcement learning in MDP, which integrates concepts from other advanced exploration methods. The key component of this algorithm is an OIM that will either explore new information that helps to make the model more accurate, or follows a near-optimal path [17]. The algorithm adjusted zooming speed in a time frame of five seconds. This adjustment was based on the recording and estimating user actions in every time step. User actions in turn were defined by the actual state of the screen, or one of the neighboring states. The RL algorithm was rewarded by the following rules: number of typed minus number of deleted letters during time steps [18]. Given how the OIM worked this setup presumed one individual strategy for each user to handle Dasher.
The second component of the complex system was a software tool that can replace mouse interactions and is called MouSense [17]. It combines head detections built on Haar-wavelets and a tracking solution, based on optic flow [19]. While operating MouSense the streamed webcam images are capable of tracking facial key points with big changes, and thus allowing cursor control by head movements.
Finally, data collection and analysis was enhanced in order to obtain more information on user behavior. Enhancement was achieved by human and machine analysis based on face images. For this purpose, webcam videos of the participants’ head were recorded.
There exist different tools for machine analysis of face videos, gathering information on head position, movement, and estimation of affective states [24, 25]. In addition, multimedia annotation tools assist labeling by human analysts; these softwares are generally designed for user-friendly crowdsource use [26, 27]; they are very helpful in annotating large-scale picture and video databases. Crowdsourced annotations usually serve to pre-select subsets of a single database for expert annotators; they are also used as teaching databases for machine learning in computer vision applications [28, 25, 29]. Annotation can be done with different goals in mind; one is identifying moving humans or vehicles in video frames, and tracking their motion [30]; another is labeling frame series based on one or more video sources [17]. An example of the latter is registering the viewer’s facial expressions while watching a cartoon animation.
This is exactly what we needed in our study; therefore we used an annotation tool called LabelMovie. LabelMovie can display two different video streams at the same time, and it also has a function for tagging frame segments [17]. It has already been used in experiments where two videos had to be annotated simultaneously [31, 24]. In this case two annotators worked independently, and as a reliability check the two labelings were compared. Accepted annotations had to reach a significant inter-rater agreement: Cohen’s kappa coefficient
Screenshot of the LabelMovie tool.
In the original study by Lőrincz and Takács [17], the above framework was used by two volunteers for 5 sessions. Each session took about 100 minutes of copying approximately 3000 characters of printed text [17].
We reproduced and extended this study using a single subject with severe speech and physical impairments (SSPI). Our participant was a 32-year-old man with tetraparesis spastica and CCN. His normal physical development slowed down or almost stopped in the fourth month after birth, due to an infection. He is only capable of minimal hand and head movements. He can use letter-based communication but declines to use prestored messages.
Before he started using MouSense he had been accustomed to using a computer which he controlled by a head-mounted stick. In this setup both pressing a key and moving the keyboard on the table to reach the needed key took considerable time. The head-mounted stick was used for communication as well. A paper-based board was held by the partner while the subject was pointing out letters one by one. Composing messages this way took extremely long time and was exhausting. Consequently, only one-to-two-word messages were composed most of the time.
Dasher screen discretization: the screen is split horizontally into 3 sections, then the middle horizontal section was partitioned into 3 vertical subsections.
In October 2011, he was introduced to MouSense with a QWERTY on-screen keyboard program called Free Virtual Keyboard [33]. He learned to use the new device within one week and it changed his communication radically. Replacing his low-tech devices by high-tech ones plus adequate support made his communication faster, more efficient and more independent [10]. He started to formulate longer sentences and his messages acquired more appropriate content.
This change in his communication revealed that he had spelling and grammar problems. This finding was consistent with studies that indicated difficulties in achieving literacy skills for people with SSPI [34].
Volunteers in the original study were already familiar with both Dasher and MouSense before the experiment. Our subject, however, was only aware of MouSense prior to the study, so we had to ensure that he had become a capable user of Dasher. Even though Dasher is freely accessible [18, 33], the Hungarian version is not currently being used as an assistive tool because clients of the leading institute for helping people with CCN in Hungary (the Hungarian Bliss Foundation), have not accepted Dasher as an assistive tool, despite recommendations by the Foundation. Instead, the overwhelming majority of the clients in the last ten years accepted the headmouse for controlling PCs; this is evidenced by more than 50 case studies related to the topic (unpublished documentation of the Hungarian Bliss Foundation).
The pilot study of our experiment consisted of six practice sessions of different duration. These sessions took place from June 2012 to August 2013. These occasions were designed with the purpose of assisting understanding the operation of the program. It also helped to develop a useful typing technique without the effect of the OIM algorithm. An assistant helped to set up the test conditions: starting the applications and placing the sample text under the display. To make reading as easy as possible, different formats and font sizes and text length were tried out during this period. The participant was asked to provide feedback on readability and clarity. Display characteristics of the sample text were adjusted according to the participant’s feedback. As a result of this procedure the suitable setting for easy readability turned out to be maximum 2 lines at a time, written in Arial and font size of 28–32 pt.
Non-convergence of the OIM algorithm.
The pilot study revealed that the participant was able to work out an efficient strategy to use Dasher; however, he still found Free Virtual Keyboard more effective. In the first of the six occasions, his average character typing time was 16.0 seconds, which improved to 4.2 seconds by the last practice session. The pilot study yielded a remarkable development; still, typing the 3000-character standard used in [17] would have taken at least 210 minutes at the improved rate, instead of the 100 minutes found in the Lőrincz and Takács study. Hence the experiments were split up into several shorter sessions, with the aim to reach the necessary typing speed.
In this section we present the characteristics of our experimental environment, and the method of data analysis used. According to the pilot study the experimental settings used in [17] had to be altered. An additional assistant was required to set the printed text, to start and stop data collection. To limit the length of experimental sessions in order to prevent high mental and physical burden the time phases of 20 minutes were kept and the periods per session were reduced to 2 (from 5 used in the study of Lőrincz and Takács). To compensate for this reduction, the number of sessions was increased to 16 (from 5 in the Lőrincz and Takács study). The sessions began in September 2013.
Annotation of the typing process is on the left, the used label set is in the middle, zoomed in examples of the effective and ineffective writing periods are on the right.
In the earlier study [17] the OIM made 2000 time steps during the typing of approximately 15,000 characters, and the convergence was evident after 1200 time steps. To minimize the typing necessary for convergence the state of the OIM was verified after every fourth experimental session in order to finish sessions as soon as convergence occurred.
Different types of data were collected during the research: log files from MouSense (including cursor positions), those from Dasher (containing typing and OIM states), and webcam videos.
For further investigation of the subject’s typing activity, Dasher screen videos were logged in the last four sessions in addition to the webcam images of the participant’s face. For analyzing the screen videos after the face and screen events were coupled and synchronized frame-by-frame, human annotation was used.
Discretization of cursor positions for labeling screen videos was different from that used in [17]. Instead of splitting the screen into a five-by-three matrix, which was related to the OIM algorithm’s reward system [17], the division of the screen we used was based on the cursor positions of the participant’s typing actions. This led to a classification of the screen into three vertical sections and the middle into three horizontal ones (illustrated by Fig. 2). For labeling the screen videos as single inputs the annotators followed the following rule: no frame was allowed to be marked with multiple labels, and not all frames needed to be labelled. The label set was: ‘write’, ‘delete’, ‘up search’, ‘down search’, ‘think’.
Comparison of different analyses: Head pose coordinates and annotations are visualized on the same timeline with two different labels.
The event values of OIM did not show strong patterns after 500 time steps nor in the end.
Paired couples of screen and webcam face videos were annotated as a second evaluation of the typing process. Here a different set of labels was used: ‘cursor on the right and head up’, ‘cursor on the left and head down’. After labeling, temporal patterns among the momentary labels were sought for, like when the cursor was moved in a circle.
In our study the OIM took nearly 4000 steps by the end of the data collection during which the participant typed only half as many characters as was planned. In addition, he spent twice as much time with the typing as the subjects of the original study.
As mentioned above, in the original experiment the OIM took about 2000 steps, and convergence was evident after 1200 steps [17]. In our study, however, the OIM state was checked after every fourth time step, and after 2000 steps there was no obvious sign of convergence. This did not change after another 8 sessions (see Fig. 3). In 16 sessions altogether the participant typed nearly 7000 characters, and it seemed that the AGI architecture was unable to optimize his performance.
To understand the reasons for the lack of convergence, human and machine-based interpretations of the typing process were compared, based on the webcam and screen videos. A number of face tracking (Haar filters, hidden Markov models, dynamic time warping, global alignment kernel, facial action units) and facial expression recognizer software [25, 35] failed to work on the participant’s face due his left-tilted head position and his mouth being open most of the time. In order to find personal behavioral patterns a two-stage manual annotation process was introduced: on the first stage, labeling was based on the screen videos; subsequently it was repeated using both face and screen videos.
After labeling the screen videos different patterns of typing behavior were sought for in the time series. Two types of typing behavior were identified and named effective and ineffective writing stages. The ‘effective phase’ label was assigned when the cursor’s basic position (presented in Fig. 4) was in the ‘write’ labelled section. Identification of the ineffective stage (presented in Fig. 4) was based on the cursor position frequently switching between all of the sections.
While exploring the causes of the ineffective states, annotators marked the stages where finding the next character took a remarkably long time. As an extreme, one case was found in the last session where typing a single character lasted for 16 seconds.
Typical causes of such failures were investigated by analyzing the ineffective phases further (the general pattern of such phases is shown in the lower part of Fig. 4). Three different patterns were identified in these phases. The first type was when a word started with the same letter that the preceding one ended with (for example: ‘big gap’). This typically resulted in omitting the space and the initial letter of the second word (that is, typing “bigap”). The second type arose when the letter following the next one to be typed had very high probability. For example, just after typing the t in ‘truck’ the next letter is r but the one after the next, namely u, is much bigger on Dasher due to the probability-based sizing of the letters. As a result, ‘u’ captures the subject’s attention, and makes it difficult to find the ‘r’. Finally, in some cases the cursor was moved around elliptically (from top right to bottom left, and backwards); we called this pattern ‘correction circle’). The explanation of this pattern was less obvious; it may have to do with the specific neurological condition of our subject.
In the annotation of the coupled face and screen videos, correlations between the typing process and the head movements were sought for. Since the OIM changed Dasher’s preset speed value, different vertical head positions were needed to reach the same convenient zooming speed at different times. Due to the subject’s left-tilted head holding the cursor, during typing the cursor was typically moving between the upper right (‘write’) and bottom left (‘delete’) parts of the screen (see Fig. 6). We found that typing actions exhibited stronger correlation with vertical than with horizontal head movements (see Fig. 5).
Discussion
The results presented above suggest that the participant used more than one strategy, like the effective ineffective writing strategies shown in Fig. 4. The comparison of the annotations revealed an alternation between effective and ineffective typing phases. We explain the non-convergence of the OIM by the presence of different behavioral patterns, since it is known from the earlier study [17], that the AGI architecture is able to find the optimal speed for typists using only one strategy.
Even though our participant became an advanced Dasher user during the study, his writing continued to include effective and different types of ineffective phases. Why the ineffective phases occurred is not addressed in this paper. Further investigation of this phenomenon is a psychological issue; it is worth noting that examining the ineffective stages indicated a connection between typing time and text complexity.
As we saw in the previous section, an extremely long time was needed to write one character in cases where the last letter of a word and the first letter of the next one was the same, or when the character next to the one to be typed had a high associated probability.
In order to minimize our subject’s message-compos-ing time, some solution has to be found to overcome the ineffective phases. A multi-strategy system might be capable of detecting the different typing phases and adapting an individual OIM for each strategy. The problem is that OIM can optimize speed within one strategy only; when only effective phases were present (as with a neurotypical subject) this procedure worked optimally. However, when effective and ineffective phases switch one another, we first need to identify them, and only then can we apply a suitable OIM for each of these phases, to optimize speed hence minimize writing time. In such a case Dasher’s zooming speed would be optimized separately for these phases. In the ineffective periods letter prediction should turn off and the alphabet without probability-based sizing should be displayed. Thus the AGI supported letter prediction should work only in the effective stages.
As a general suggestion for further research, this case study also hints toward the importance of direct cooperation between experts in different areas of research including psychologists, IT specialists, conductors and helping professionals to find the best match of person and technology in each individual case. Acceptance and acquisition of AT is very important for the client as well as for the close social environment. In the area of communicational competence, a number of low-comfort situations create a need to find new IT solutions. As AAC is much slower than speech or writing, making it faster typically comes at a price of making its content simpler or more stereotypical. Appropriate settings of the assisting tools are essential because they can bring the speed-content tradeoff closer to an optimum: subjects can communicate faster, yet retain a reasonable sophistication of their messages. This optimization, however, needs to take into account the characteristics of particular subjects. To find such settings, not only do we need pre-tests and pilot studies; practice and learning on the part of the user of the AT solution is also required. In this process parameters need to be adjusted so as to lead to the acceptance of technology, and make it favorable for long-term use. We think the present case study is one positive example of this approach. During the experiment our subject began to use Dasher without OIM in everyday communication. He was also the first of the clients of the Hungarian Bliss Foundation who found Hungarian Dasher an effective AT solution for head driven letter based communications.
Footnotes
Acknowledgments
We would like to thank our participant for his persistent work. We are also grateful to the Hungarian Bliss Foundation for the long term contribution and assistance. We give a special thanks to Lőrincz and Takács for their kind support during the entire study.
Conflict of interest
None to report.
