An App Evaluation System for All Stakeholders: A Pilot Study

Abstract

Mobile technologies, including apps, have become increasingly popular, and are being used to support daily activities among a variety of individuals. While the use of mobile technologies will not eliminate barriers often faced by individuals with disabilities, these systems have the potential to help minimize some of these barriers. As the popularity of apps is increasing, the purpose of this study was to evaluate the reliability, internal consistency, and social validity among novice raters on two app evaluation rating scales. A total of 17 adults, with and without identified disabilities, evaluated apps using two team-designed app rating scales. Overall, findings indicated that the ratings completed during the pilot phase by the research team were more reliable than those completed by novice raters during the testing phase; that the dimension of individualization was the most reliable among team raters and novice participants without disabilities; and that the highest level of inconsistency in the reliability was among novice participants with disabilities. Practical implications, limitations, and future research directions are discussed.

Keywords

apps app evaluation developmental disabilities mobile technology

Advances in technology over the past two decades have led to increased use of mobile technologies, such as tablets, in educational settings (Kumar & Goundar, 2019). These practices have expanded to special education teachers who use educational applications (apps) to engage learners in academic content, track classroom behaviors, and communicate with caregivers (Bouck et al., 2016). Mobile technologies have given individuals with disabilities increased access to academic opportunities by providing practitioners the ability to personalize the complexity of the content (Anderson & Putman, 2020). In turn, this has led to more accessible and inclusive practices (Cihak et al., 2015; Xie et al., 2018), by making the environment more equitable for a wider range of individuals with disabilities (Ciampa, 2017). Specifically, mobile technologies have been used to help decrease the number of prompts needed to complete a task (Laarhoven et al., 2018), increase independence in daily living activities (Cakmak & Cakmak, 2015; Laarhoven et al., 2018), promote social interactions (Vaala et al., 2015), provide differentiation, (Anderson & Putman, 2020), and enhance motivation and independence (Anderson & Putman, 2020). The support that may come from the use of mobile technologies can also contribute to self-directed learning (Ayres et al., 2013; Lee & Kim 2015), pacing (Thomas et al., 2019), and goal setting (Thomas et al., 2019).

Due to the wide range of possibilities mobile technology may support, an increased focus on the identification of effective and meaningful apps is warranted. The process of identifying apps requires knowledge on how the user may integrate the system into real-world opportunities (Courduff & Szapkiw, 2015) and the functions of the app (Weng & Taber-Doughty, 2015). The opportunity for users to select which app they prefer, will in turn, increase their motivation to use the selected app (Ciampa, 2017), and enhance the user’s engagement, independence, and self-determination skills (Frielink et al., 2018). As such, the process of evaluating apps is an essential step in the effective selection of an app. Unfortunately, it has been suggested that a major challenge with apps is that many of them are developed by individuals with limited child development or educational expertise (Vaala et al., 2015). The lack of expertise of app developers, combined with the absence of comprehensive information about the quality of an app is troublesome; and can lead for users to purchase apps based on star ratings, reviewer comments, the number of downloads (Bouck et al., 2016; Papadakis & Kalogiannakis, 2017), or the app stores’ algorithm, rather than through a systematic evaluation of an app (Papadakis & Kalogiannakis, 2017). Although the evaluation of an app may be a daunting task, without a comprehensive evaluation, it is possible to choose useless and ineffective apps (Lee & Kim, 2015; MacSuga-Gage et al., 2015), that are not compatible with the user (DeCarlo et al., 2019).

Attempts have been made to support practitioners, individuals with disabilities, and their families in the evaluation process by developing user-friendly app evaluation tools. Often these tools are in the form of a rating scale, rubric, or checklist. A key to successfully evaluating apps is to better understand the rationale for selecting a specific evaluation instrument. For example, a rating scale uses a scoring system that is hierarchical and based on assigning a value to specific categories, components, or questions. The purpose of using a rating scale is to represent a perceived quality for each dimension being evaluated. A rubric, on the other hand, uses a scoring guide that evaluates the perceived quality of specific levels of categories or components, and it includes a pre-determined description of each level; and lastly, a checklist follows a dichotomous list that includes multiple categories, components, or dimensions, that are used to determine whether a characteristic is present or absent (Brookhart, 2013).

Based on a systematic literature review conducted by Boesch et al., (in review), from 2008 to date, most of the existing app evaluation tools are non-empirical (see Authors for details), and only two of the tools were empirically evaluated, the Rubric for the Evaluation of Education Apps for preschool Children (REVEAC), by Papadakis et al. (2017), and the App Evaluation Rubric by Weng and Taber-Doughty (2015). Papadakis et al. (2017) developed a rubric that was grounded in a literature review conducted by the authors, to rate preschool educational apps. Following a suggested set of standards for scale-classified criteria, Papadakis et al. considered: (a) the total number of dimensions to be evaluated; (b) operational definitions for each dimension according to the performance level; and (c) total number of performance levels. The first draft of the rubric was evaluated by key stakeholders (preschool teachers), and their feedback was applied to the rubric. The final rubric included the evaluation of four dimensions, (a) content, (b) design, (c) functionality, and (e) technical quality of the app. Weng and Taber-Doughty (2015) designed a rubric for practitioners to evaluate educational apps for students with disabilities. Although they describe their tool as a rubric, based on the definitions from Brookhart (2013), Weng and Taber-Doughty’s tool should be described as a checklist and a rating scale. The checklist portion gathers general information and design features of the app, and the rating scale evaluated the quality of the following dimensions: (a) design features, (b) individualization, (c) support, and (d) overall impression and comments (open-ended).

Although app evaluation tools exist, most of these tools have not been empirically validated. Beyond this, the available tools have been designed to be used by practitioners, not by individuals with disabilities. As a result, the purpose of this study was to design, evaluate, and create two empirically grounded app evaluation tools (rating scales) that can be used across apps categories and by variety of users. To develop these app evaluation tools, the systematic review conducted by Boesch et al., (in review) was used to identify key elements and dimensions needed for the evaluation of apps. The research questions to be answered included: (1) is the app rating scale a reliable tool among future practitioners (globally)? (2) is the adapted app rating scale a reliable tool among individuals with disabilities (globally)? (3) were the dimensions of content, individualization, usability, and quality of an app reliable among future practitioners (local)? (4) were the dimensions of content, individualization, usability, and quality of an app reliable among individuals with disabilities (local)? and (5) what were the perspectives of participants, both future practitioners and of adults with disabilities, on the app rating scales?

Methods and Results

Research Design

Usability testing was implemented to evaluate a real-life activity, such as the evaluation of apps (Barnum, 2011). The aim was to determine the reliability of the two app rating scales across phases (pilot and testing) and participants (users with no identified and with identified disabilities). Both qualitative and quantitative data were collected from target stakeholders. The independent variable for this study was the background and characteristics of the participants, and the dependent variables were the reliability and validity of the app rating scales.

Procedures

The App Rating Scales

Two app rating scales, the app rating scale and the adapted app rating scale, were created based on the available literature (see Boesch et al., (in review)). The app rating scale was intended to be completed by practitioners or family members of complex disabilities. On the other hand, the adapted app rating scale was designed for users with complex disabilities, including those with autism, multiple, and intellectual disabilities. The differences between the two rating scales related to (a) the simplicity of the language on the adapted app rating scale to make it accessible to individuals with disabilities; and (b) the elimination of five questions on the adapted app rating scale to only include those relevant to the target user (e.g., “I would recommend this app to other professionals” was not included).

Both rating scales included questions relevant to the four dimensions suggested in the literature (see Boesch et al., (in review) for a complete review), including: (a) content, which related to information and images displayed on the apps; (b) individualization, which considered the user’s ability to alter the app according to an individual’s specific needs; (c) usability, which was information on the apps ease to navigate; and (d) quality, which reflected the rater’s overall impressions of the app rating scales. Each dimension was evaluated using a 5-point rating system, including: (a) not at all, the app did not align with any part of the statement; (b) slightly, the app aligned marginally with the statement; (c) somewhat, the app aligned with some part of the statement; (d) very much, the app aligned with most of the statement; and (e) extensively, the app aligned with all of the statement. A scale of 5-points was selected as it yields the highest mean reliability (Revilla et al., 2014) and as these scales outperform scales with less options in terms of psychometric properties (Adelson & McCoach, 2010). The 5-point scale was modified on the adapted app rating scale by providing a text statement along with a visual representation, including: (a) strongly disagree, two thumbs down; (b) disagree, one thumb down; (c) it’s okay, manual sign for okay; (d) agree, one thumb up; and (e) strongly agree, two thumbs up. The visual modifications were based on suggestions and examples of other evaluation tools provided by the collaborating state chapter of a national organization (see Table 1).

Table 1.

Number and Example of Questions on The App Rating Scales within Each Dimension.

Dimensions	# of Questions		Example Questions
Dimensions	ARS (n = 21)	AARS (n = 16)	App Rating Scale	Adapted App Rating Scale
Content	4	3	The content is all relevant to the target area.	This app has useful information.
			The content is appropriate for the developmental level of the target audience.	This app looks good.
			The visual information and language are appropriate for the target audience.	This app uses easy words.
			The app design is appropriate for the target audience.
Individualization	7	6	This app has the capability to be adjusted to meet the accessibility needs of an individual.	This app is easy to change to the way I need it.
			The size of pictures can be adjusted.	This app makes it easy to change the size of the pictures.
			The size of text can be adjusted.	This app makes it easy to change the size of the words.
			The speed of speech can be adjusted.	This app makes it easy to change how fast the voice is.
			There are multiple voices to choose from for speech output.	This app makes it easy to pick which voice I want.
			The app is available in multiple languages.	This app lets me make things the way I want them.
			The content can be customized.
Dimensions	# of questions		Example questions
Dimensions	Original (n = 21)	Adapted (n = 16)	App rating scale	Adapted app rating scale
Usability	6	5	Individuals can use the a2pp independently after set up.	This app makes it easy to use by myself.
			The app is free of all distracting features.	This app is easy to use.
			This app is easy to navigate.	This app has voices that are clear and easy to understand.
			The speech and voices generated by the app are clear and easy to understand.	This app is set up in a way that is easy to use.
			The layout of the app is simple and clear.	This app has ways to get help if I need it.
			The app provides support or assistance with the app.
Quality	4	2	I would recommend this app to other professionals.	I would use this app.
			I would recommend this app to parents who’s child needs assistance in this area.	I would want to tell others about this app.
			I would recommend this app to an individual who needs assistance in this area.
			If this app was relevant to an individual, they would use it on a regular basis.

Note. AARS = adapted app rating scale; ARS = app rating scale.

Application Selection

Two state agencies (the state chapter of a national organization for individuals with disabilities and their families, and state office of developmental and intellectual disabilities) contacted the first author as they were interested on creating a system for individuals with disabilities to evaluate apps. As result, the two state agencies selected the apps used during the testing phase (n = 7). A key feature for their selection was that all apps were accessibility to all individuals and had the potential to be use as means to increase the independence of individual with disabilities. The apps represented three domains, including communication (n = 2), daily routines (n = 3), and time management (n = 2). The intent of the collaborating state agencies was to consider apps that represented functional skills, as they are essential skills to increase the independence of individuals with disabilities (Ayres et al., 2013). Beyond the apps identified by the collaborating state agencies, the research team selected an addition 18 apps across the same domains to be used during the pilot phase. Altogether there were: (a) 11 apps for daily routines, which could be used to support the completion of everyday activities or independent living skills; (b) 7 apps for communication, which helped develop or aid an individual’s communication skills or interactions; and (c) 7 apps related to time management, which supported the individual in self-monitoring and scheduling. Apps selected represented a wide range of Apple Store® ratings from 1-5 (M = 3.15), including 13 apps with a rating ≤3.5 and 11 apps with a rating ≥ 3.6. All apps were downloaded onto iPads®, with screen sizes of 9.4 inches by 6.6 inches.

Data Analysis

Formative and summative data were collected and analyzed during the pilot and testing phase (Barnum, 2011). For the formative data, internal consistency and reliability were calculated for both app rating scales. Summative data were collected for each app rating scale globally, by determining the reliability of the app rating scales as a whole (one construct), and locally, by evaluating the reliability of each dimension in the app rating scale (content, individualization, usability, and quality). Reliability was used to describe if there were differences among and across raters, and if there were, how far apart the results were while using the same tools (Roberts & Priest, 2006). The goal was to determine the trustworthiness of both app evaluation rating scales.

Reliability of both app rating scales was calculated during the pilot and testing phases, across novice participants groups, and across dimensions using SPSS (version 27.0). Multiple measures were used to calculate reliability across participants, including: (a) Cronbach’s alpha was calculated to determine the internal consistency of the app rating scales. The criterion set was “good” (0.8 ≤ α ≤ 0.89) to “excellent” (α ≥ 0.9; Gliem & Gliem, 2003). By default, questions that correlated negatively with the overall scale were reverse coded and any questions with no variance, where ratings across participants were the same, were eliminated (n = 2; 1 for content and 1 for usability dimension); (b) Pearson’s correlation coefficient was used to determine the correlation between participant responses, with a set criterion of “strong to perfect” (.70 to 1.0; Hinkle et al., 2003); (c) Weighted Cohen’s kappa was also used to determine inter-rater reliability, with a set criterion of “substantial” (.61 to .80) to “almost perfect agreement” (.81 to 1.0; Landis & Koch, 1977); and (d) inter-rater adjacent agreements was used to evaluate the agreement between novice participants by calculating the percentages of agreements within plus or minus (±) 1 point of each other. The criterion for this measure was set as “acceptable” (75% to 89%) to “high” (≥90%; Shweta et al., 2015).

Pilot Phase

In the pilot phase, the app rating scales were evaluated by the team raters. Team raters were members of the research team and had previous exposure to the evaluation tools, as they were part of the development, design, and evaluation process of both app rating scales. During the design process of both app rating scales, 18 apps were evaluated by two team raters. Both team raters were White females with a bachelor’s as their highest earned degree, who were enrolled in a master’s degree special education low incidence program and had 1 to 2 years of teaching experience. To safeguard the reliability of the adapted app rating scale, two consultants from one of the collaborating state agencies were asked to examine the adapted app rating scale for clarity of each question and provide feedback. The two consultants were White males, ages 43 and 49, and had identified disabilities of traumatic brain injury and intellectual disability, respectively.

Global and local formative reliability data were collected to determine the degree to which the team raters did or did not agree on questions within the app rating scales. If disagreements occurred on a question by ±1 point, the research team evaluated the language used and determined the need for the question to be reworded or deleted. Once the team raters reached criteria for high reliability for each app rating scale, these were considered finalized. Using the final app rating scales, team raters evaluated seven additional apps, which were identified to be used in the testing phase by both novice groups.

Global findings for the app rating scale had “good” internal consistency, “strong” correlation, and “substantial” inter-rater reliability. The inter-rater adjacent agreement was “high” across team raters (92.86%). For local findings Cronbach’s alpha was “good to excellent” within the content and usability dimensions but, fell below criteria for individualization and quality. Results suggested the strongest correlation within the individualization dimension, and the weakest in quality dimension. Cohen’s kappa results were also lowest in the content and quality dimensions, indicating “none to slight” inter-rater reliability, while individualization and usability had the highest Cohen’s kappa. The percentage of questions answered ±1 point of each other was high across all four dimensions: content, 95.8%; individualization, 97.6%; usability, 88.8%; and quality, 87.5%. Similarly, global findings for the adapted app rating scale indicated “acceptable” internal consistency, “very high” correlation, “near perfect” inter-rater reliability, and a “high” (93.75%) inter-rater adjacent agreement across team raters. Although results did not suggest internal consistency across the dimensions for local findings, individualization had the strongest correlation and “near to perfect agreement” for Cohen’s kappa, and the content dimension was the lowest, with no correlation, and “none to slight” inter-rater reliability. “Strong to perfect” correlation was reached for both usability and quality dimensions. The percentage of questions answered ±1 point of each other (inter-rater adjacent agreement) was “high” across all four dimensions (see Table 2 complete summary).

Table 2.

Pilot Phase Reliability and Internal Consistency.

Dimensions	Pilot Phase
	App Rating Scale				App Adapted Rating Scale
	α	± 1, %	r	K	α	± 1, %	r	K
Global	.830	92.86	.823	.699	.730	93.75	.930	.843
Local
Content	.863	95.8	.575	.368	.421	83.3%	−.165	−.091
Individualization	.043	97.6	.919	.429	.667	95.8%	.947	.869
Usability	.880	88.8	.623	.521	.237	95.0%	.943	.856
Quality	.794	87.5	.382	.186	.205	100%	.898	.800

Note. Significant levels for Pearson’s and Cohen’s kappa at .01, and Cronbach’s alpha at .05; Criteria for Pearson’s set at .70, Cohen’s kappa set at .61, and Cronbach’s alpha at set .80.

Recruitment

Institutional Review Board (IRB) approval was obtained prior to recruiting and distributing information about the study. Convenience sampling was used to recruit participants (Etikan et al., 2016). Novice participants with no identified disabilities (novice-ND) were recruited by the research team through an email list from the Department of Special Education at the Southern University. The email list was of undergraduate and graduate students from across program areas within the department. The email contained information on the purpose of the study and time commitment. Email was sent once a week for a month. The inclusion criteria for novice-ND participants were: (a) pursing a master’s degree in low incidence special education; and (b) seeking a teaching licensure or endorsement in a low incidence program.

Novice participants with an identified disability (novice-WD) were recruited by an advertisement post on the website, Facebook™ page, and Twitter® account of a state chapter of a national organization that focuses on supporting individuals with disabilities. The post contained information provided by the research team on the purpose of the study and time commitment. An initial post was uploaded on the social media platforms and was re-posted weekly for a month. The inclusion criteria for novice-WD included: (a) a medical diagnosis of a developmental or acquired disability; (b) over the age of 18; (c) previous exposure to mobile technology, which was defined as the individual owning or having used one or more mobile technology systems, such as an iPhone®, Android™ phone, or tablets; and (d) previous experience with navigating apps, which was referred to as when an individual indicated knowing how to use apps on their mobile technology systems. After potential participants from both novice groups indicated they were interested, a member of the research team scheduled an in-person meeting with each individual to verbally explain the purpose and time commitment for the study.

Participants

Participants were divided into two groups, novice-ND, and novice-WD. Participants, in both groups, had no prior experience or exposure to the project or materials or had used an app rating scale, rubrics, or evaluation checklist to evaluate apps. The novice-ND participant group consisted of 9 female participants (8 White and 1 Asian), with ages ranging from 22 to 27 (M = 23.44). All participants were enrolled in a master’s program, and all but one (with 4 years), had no prior teaching experience. The novice-WD group included 8 participants, 7 of which were male and 1 female, with ages ranging from 20 to 57 (M = 37.75). Of these participants, 62.50% were White, 25% were Black, and 12.5% were Hispanic. The highest degree earned by participants in novice-WD included 25% with a high school certificate, 25% with a high school diploma, 25% with a college or university certificate, and 25% with a bachelor’s degree. The medical diagnosis of participants in novice-WD was of 37.50% who reported autism, 25% who indicated a diagnosis of multiple disabilities, 25% specified other health impairment, and 37.50% reported an intellectual disability (see Table 3).

Table 3.

Participant Profiles.

Novice-ND (n = 9)					Novice-WD (n = 8)
Age	Ethnicity	Gender	Level of Education	Teaching Experience	Age	Ethnicity	Gender	Level of Education	Primary Disability
22	W	F	B.S.	—	57	B	M	H.D.	ID
25	W	F	B.S.	—	53	W	F	B.S.	MD
23	W	F	B.S.	—	51	W	M	H.C.	OHI, TBI
22	W	F	B.S.	—	28	W	M	H.D.	Autism
23	A	F	B.S.	—	30	W	M	C.C.	ID
23	W	F	B.S.	—	20	H/L	M	C.C.	Autism
23	W	F	B.S.	—	39	W	M	B.S.	ID
23	W	F	B.S.	—	24	B	M	H.C	Autism
27	W	F	B.S.	4 years

Note. A = Asian; B = Black or African American; B.S. = bachelor’s degree; C.C. = college or university certificate; F = female; H.C. = high school certificate; H.D. = high school diploma; H/L = Hispanic or Latino; ID = intellectual disability; Level of education = the highest level degree earned by participants; M = male; MD = multiple disabilities; Novice-ND = novice without disability; Novice-WD = novice with disability; OHI = other health impairment; TBI = traumatic brain injury; W = white.

Testing Phase

During the testing phase, both app rating scales were evaluated by novice participants. Novice participants were adults who had no previous exposure to the app rating scales. The app rating scale was evaluated by novice participants with no identified disabilities (novice-ND) and the adapted app rating scale was evaluated by novice participants with an identified disability (novice-WD). Data were analyzed for each tool, globally and locally, with the purpose of determining the reliability among raters. Social validity among the novice rater groups were analyzed for both app rating scales.

To compare ratings, all novice participants (novice-ND and novice-WD) were provided a list of the same apps, an iPad®, and paper copies of their respective app rating scale. The apps list included the same apps that were evaluated by team raters once the rating scales were finalized. For participants in the novice-ND group, no additional training or instructions were provided beyond the instructions included within the app rating scale. The rationale for additional training was to evaluate if the instructions provided on the rating scale were sufficient to complete the app rating scale effectively and accurately. Novice-ND completed the evaluation of each app independently in office cubicles at the university they attended. The schedule was set by participants’ and research team availability, and no time constraints were given.

For novice-WD, a two-hour training was created by the research team on how to use the adapted app rating scale (task analysis) and visuals on how to navigate similar apps (pre-evaluation). The training was created using PowerPoint™ and occurred one week prior to the evaluation of the apps. The purpose of the training was two-fold, (a) to provide the collaborating consultants from one of the state agencies a systematic script to train, guide, and support participants in novice-WD (same guidelines to novice-ND); and (b) to provide participants a task analysis on how to evaluate apps. Novice-WD independently completed the app evaluation at the national organization office during pre-determined dates and times. Participants were scheduled to evaluate two apps per session with no time constraints and were given the opportunity to request guidance or supports as needed. If assistance was requested, a member of the research team recorded the participant number, and the type and frequency of their request.

Procedural Fidelity

To determine if the procedures for each group were applied as intended (Gresham et al., 2000), two forms were created to collect procedural fidelity during interactions with novice participants. For novice-ND, a 10-step checklist was created to evaluate the interactions between a research team member and novice-ND group. The checklist was completed by two members of the research team. The research team followed the script with the directions for novice-ND in 98.2% of the interactions.

Based on the agreements with the two collaborating state agencies, the research team created a training protocol, materials, and provided training on how to interact and present materials to participants in the novice-WD group to two consultants employed by one of the collaborating state agencies. A 24-step checklist was created by the research team to evaluate the consultants on their provision of the training, interactions, and directions provided to novice-WD participants. The checklist consisted of six overarching headings of (a) agenda, (b) rating scale, (c) parts of the rating scale, (d) rating, (e) conducted training, and (f) time for clarifying questions. For each heading an outline of the task to be completed (protocol) was included to determine if all steps were implemented. The checklist was completed by two members of the research team during the training and app evaluation section. Overall, the collaborating consultants followed the training protocol in 91.7% of the experiences. Anecdotal notes indicated that the research team deviated from the initial protocol by taking the lead during the app evaluation sessions to support the collaborating consultants and to provide novice-WD with the information needed to complete the rating scales to evaluate the apps.

Anecdotal notes from both novice groups were also collected on questions asked by participants, participant’s ability to navigate each app, barriers that were observed during the sessions (if any), and the assistance each participant was provided (if applicable). Two members of the research team were present during the app evaluation with the purpose to collect procedural fidelity data, assist the collaborating consultants (with the novice-WD group), and support participants (if needed). Anecdotal data indicated that no questions arose from novice-ND. On the contrary, novice-WD asked questions related to (a) how to use the iPad (e.g., getting locked out); (b) task procedures (e.g., if all questions on the adapted app rating scale needed to be completed); and (c) the app itself (e.g., if the sound settings could be changed).

App rating Scale

Global findings indicate that the app rating scale had “excellent” internal consistency, “moderate” correlation, and “moderate” inter-rater reliability. The percentage of questions within ±1 point of each other met criteria for novice-ND in the acceptable range (75.9%). For local findings, internal consistency was reached for content, individualization, and quality; however, in the usability dimension, results fell below criteria. Among participants in novice-ND, findings specific to the four dimensions indicate that individualization had the strongest correlation “moderate,” and quality had the weakest, “no correlation.” Cohen’s kappa results were lowest in the quality and content dimensions, indicating “none to slight” inter-rater reliability, while individualization maintained the highest scoring dimension with “moderate” inter-rater reliability. The inter-rater adjacent agreement was “acceptable” across content (81.1%), individualization (76.6%), and usability (78.1%), but fell below criteria for the quality dimension (65.7%; see Table 4).

Table 4.

Testing Phase Reliability and Internal Consistency.

Dimensions	Testing Phase
	App Rating Scale (Novice-ND)				App Adapted Rating Scale (Novice-WD)
	α	± 1, %	r	K	α	± 1, %	r	K
Global	.910	75.9	.602	.437	.910	69.1	.354	.201
Local
Content	.932	81.1	.265	.142	.816	79.5%	.032	.029
Individualization	.818	76.6	.637	.477	.890	59.3%	.225	.115
Usability	.781	78.1	.471	.298	.760	73.8%	.324	.189
Quality	.925	65.7	.093	.063	.904	65.5%	.093	.047

Adapted App Rating Scale

As demonstrated on Table 4, global findings indicated that internal consistency was “excellent,” yet the correlation was “weak,” and the inter-rater reliability was “slight.” For novice-WD, the percentage of questions being ±1 point of each other was “unacceptable” (69.1%), as it did not meet the set criteria. Like novice-ND, internal consistency was reached for content (“good”), individualization (“good”), and quality (“excellent”) dimensions; however, for usability, results fell below criteria. Yet, the highest correlated dimension among novice-WD was in the usability dimension, even though results were overall weak for correlation and inter-rater reliability across all dimensions. The inter-rater adjacent agreement was considered acceptable only for the content dimension (79.5%), and fell below criteria for individualization (59.3%), usability (73.8%), and quality (65.5%) dimensions.

Social Validity

Triangulation analysis was used to assess multiple perspectives of the data collected in relation to social validity (Barnum, 2011). The goal was to provide participants the opportunity to share their overall experience, highlights, and challenges when using the rating scales. The perspectives of participants were collected in the form of observations (for novice-WD), social validity survey (for novice-ND), and participant comments (all participants).

App rating Scale

A social validity survey was created by the research team for participants in the novice-ND group. The survey consisted of 15 questions related to participants’ (a) perspectives on the app rating scale (n = 9 close ended questions); (b) experience using the app rating scale, (n = 3 close ended questions); and (c) recommendations or suggestions on potential changes to enhance the app rating scale (n = 3 open ended questions). A 5-point scale Likert scale was used, including: (a) not at all, which refers to participants’ disagreeing with the statement; (b) slightly, defined as vaguely agreed with the statement; (c) somewhat, which was defined as partially agreeing with the statement; (d) a fair amount, defined as mostly agreed with the statement; and (e) very much, which was referred to as strongly agreed with the statement participants were asked to respond to each question. The survey was disseminated using REDCap™, a platform to build and manage online surveys (Harris et al., 2009). To determine the validity of the survey prior to dissemination, four external reviewers were asked to provide feedback on the questions, language, format, and type of potential responses. Reviewers included two master’s level graduate students in special education (not included as participants in this study), one without a teaching background and one with a special education teaching background, a special education teacher with 7+ years of teaching experience, and a university professor with survey expertise. The survey was revised for clarity of the questions, and the final version was piloted by the external reviewers.

The social validity survey was used to calculate descriptive statistics (means and standard deviations) across novice-ND’s demographic characteristics, including educational background and years of teaching experience. The total mean scores for participants’ perspectives and experiences were calculated by summing the items of each question within each section of the social validity survey and dividing by the total number of items. One-way ANOVAs were used to determine statistical differences between the novice-ND’s demographic information and their responses. Repeated measures were used to compare questions to each other to find if there was any significance between questions.

Post completion of the app evaluation, the survey was disseminated to participants in the novice-ND group. Results of the social validity survey were divided into two sections, perspectives, and experiences. The response rate for the social validity survey was of 88.9%. The mean score was M = 4.0 (SD = .971) on the participants’ perspectives section, indicating that participants reported that “a fair amount” of the rating scale was accessible and easy to use. The mean of the experience section was 3.9 (SD = .839), indicating that participants agreed “a fair amount” that they would recommend the scale to a fellow practitioner. Although not significant, the highest ranked items under the perspectives section included the rating scale’s ability to be completed easily (M = 4.3; SD = .951), to be completed in a feasible amount of time (M = 4.3; SD = .951), and that it is practical to complete (M = 4.3; SD = .951); F (8, 48) = 2.151, p = .49. Under the experience section, results indicated ranking similarities among participants, with a range of .14 between the highest and lowest ranked items. Participant educational background and number of years of teaching experience were used to determine correlations on the extent of satisfaction across sections, and no significant differences were found.

Thematic analyses were conducted on open-ended questions completed by participants in the novice-ND group. Each question was coded independently by two coders (research team) to determine coding reliability, with a Cohen’s Kappa of 0.968 (almost perfect agreement). A total of 4 participants responded to the question on the “possible future use of the app evaluation system,” indicating that they would use the app rating scale to meet the needs of their students (50% of responses), and to help examine an app prior to purchasing (50% of responses). A total of 5 participants responded to the question related if the “app rating scale was helpful,” and responses indicated that the app rating scale was helpful in meeting the needs of their current students (80%), examining apps before purchasing (20%), and evaluating features of the app using this tool (20%). Lastly, a total of 7 participants responded to “suggestion on changes to the app rating scale,” 71% indicated that no changes should be made, and the remaining participants (29%) mentioned adding an overall score for the app evaluated and making some changes to the language used to be more concise.

Adapted App Rating Scale

To obtain social validity from participants in the novice-WD group, anecdotal notes were collected on their perspectives on the adapted app rating scale. Post completion of the app evaluation, a member of the research team asked participants from the novice-WD group to share their overall experience when using the adapted app rating scale. An additional team member collected responses from participants. Overall, 75% of the participants in the novice-WD group indicated that they “agreed” or “strongly agreed” that the app rating scale was easy to use, and that they could use it independently. The remaining participants (25%) indicated that they “disagreed” or “strongly disagreed” to this question, suggesting that the adapted ap rating scale was not easy to use, or they could use it independently.

Two additional members of the research team collected anecdotal notes while participants in the novice-WD group were using the adapted app rating scale to evaluate the apps. The purpose was to gather notes on the challenges faced by novice-WD through the observations. These notes were collected by describing the challenge faced and during what evaluation phase. Overall, data suggested that approximately 60% of the participants in the novice-WD group had the skills needed to independently navigate, evaluate, and complete the adapted app rating scale. Yet, the remaining participants presented barriers during the app evaluation process including in the domains of motor skills (writing), literacy (reading abilities), and the ability to follow directions (receptive language), all which occurred while completing the navigation of the app or while completing the evaluation on the adapted app rating scale.

Discussion

The growing prevalence of technology across settings has led to greater equity and accessibility for all learners (Anderson & Putnam, 2020). Yet, exposure to these resources is not the only component to consider, as the technological competency may be a key element to the successful use and selection of an app (Foulger et al., 2017). For individuals with disabilities, training the individual to use the technology is essential in providing the opportunity to increase self-management skills (Ayres et al., 2013), increase motivation (Maich et al., 2019), and decrease system abandonment (McNaughton & Light, 2013; Weng & Taber-Doughty, 2015).

As practitioners continue to identify ways to integrate mobile technology into daily activities, challenges will be faced on how to meet the user’s needs effectively and efficiently (Baran et al., 2017). Current practices suggest that apps are being treated as consumables, requiring no long-term commitment (Douglas et al., 2012; McNaughton & Light, 2013); which in turn has led to the selection of apps that do not meet the specific needs of the target user (MacSuga-Gage et al., 2015). To increase the effective identification of apps, it is essential that the individual’s abilities are matched to the app (Anderson & Putman, 2020; Bouck et al., 2016). By implementing a systematic evaluation process, practitioners may be able to thoughtfully identify apps (MacSuga-Gage et al., 2015), and provide the user with the supports needed.

Currently several non-empirical app evaluation tools exist and can be found online. Yet, only a few authors, Papadakis et al. (2017) and Weng and Taber-Doughty (2015), have created app evaluation tools that have been empirically evaluated (see Boesch et al., (in review) for complete review). Both tools evaluate apps based on the dimensions of content, design, individualization capabilities, functionality, and usability, and described the outcomes of the evaluation of the app after using the tool. Although these two evidence-based tools offer a systematic means to evaluate apps, more research is needed to understand how to comprehensively evaluate apps, the dimensions needed to match the individual’s ability to an appropriate app, and to determine the possibility of identifying a valid and reliable tool to evaluate apps. As a result, two user- and time-friendly app rating scales were created and evaluated to determine the reliability, internal consistency, and social validity.

Overall, findings indicated that team raters were more reliable than novice participants; that the individualization dimension was most reliable across both team raters and novice-ND participants; and novice-ND participants found the app rating scale to be easy and feasible to complete. Yet, findings should be viewed with caution as reliability did not reach the set criteria for certain dimensions among team raters and novice participants. However, given that team raters consistently reached higher reliability, it seems as knowledge and skills may have been gained through their involvement in the development, design, and evaluation of the app rating scales. These findings may be the result of “experts” having the background knowledge to make more accurate judgements (Jones & Alcock, 2014); implying that practice using the app rating scales, and having the opportunity to discuss the process, may be a key element to increase the reliability among raters (Grainger & Adie, 2014).

Findings also suggest unique similarities and differences among novice-ND and novice-WD. Both groups reached the criteria for internal consistency across three of the four evaluation dimensions (content, individualization, and quality), and both groups reached internal consistency for the rating scales as a whole. The difference was that novice-ND were overall more reliable in their app evaluations than the novice-WD. Findings could be attributed to the notion that individuals with disabilities, particularly those with intellectual disabilities, often respond to Likert scales with a response bias (Hartley & MacLean, 2006; Stancliffe et al., 2015). In fact, it has been suggested that individuals with intellectual disabilities tend to select responses from either end of a given scale (Hartley & MacLean, 2006; Stancliffe et al., 2015). Findings support previous research as anecdotal notes suggest that novice-WD evaluated the apps with a personalized mindset, which in turn, may have led to the evaluation of an app based on preference, rather than the overall quality of the app. The personal approach to the evaluation of apps may have led to lower reliability among novice-WD, as it has been suggested that higher reliability often occurs when participants belong to the same subgroup (Barnum, 2011). Although novice-WD were all given the same training and apps, novice-WD all had different characteristics, which may help explain their different ratings. A potential consideration is to ensure that participants gain knowledge and skills through training, where they are provided guidance and feedback, while at the same time, become comfortable and familiar with the response format (Hartley & MacLean, 2006).

Although reliability was not reached across novice participants, the notion of applying a user-centered approach is essential in effectively identifying apps (McNaughton & Light, 2013; Ok et al., 2015). By doing so, practitioners can consider the strengths and areas of need of the user and select an app that appropriately will supports the individual (Ok et al., 2015). This process can only be applied if practitioners develop the technology competencies needed to comfortably utilize technological tools (Anderson & Putman, 2020). Ultimately, the goal should be for those evaluating apps to have the knowledge and skills needed to effectively identify apps that can help decrease the mismatch between a system (app) and the users.

Limitations and Future Directions

Although findings may be promising, there were a few limitations to this study. First, the apps used in this study were all free apps, for both the pilot and testing phase. The lack of purchased apps may have impacted and skewed the results. For example, Cohen’s kappa can be misleadingly low if majority of ratings are at highest or lowest level (Shweta et al., 2015); therefore, future research should consider the use of both free and purchased apps with low and high ratings, to decrease the lack of variance, and possibly increasing the inter-rater reliability across participants.

Second, apps evaluated among all raters were selected by the funding agency in collaboration with the state chapter of a national organization. As a result, not the same number of apps were evaluated across app categories, which in turn, eliminated the possibility of determining reliability differences across categories. Future research should consider evaluating the reliability within apps from the same category and ensuring that the same number of apps are selected across these. Potential results may shed information on if a general app rating scale can be used across apps or if there is a need to for a specific evaluation tool for certain app categories (e.g., communication apps for AAC users).

Third, due to the request from the two collaborating state agencies, the research team were not directly involved in the recruitment and selection of novice-WD. Although an inclusion criterion was set by the research team, this was not consistently applied by the two collaborating state agencies. Therefore, future research should clearly outline a set skill level criterion to ensure that all participants are able to complete the task independently. A consideration is to use the skill set suggested by Powell (2014) as a guideline for the effective use of mobile technology, including fine-motor skills, ability to follow multi-step directions, reading level, self-management abilities, and reinforcement needs.

Lastly, replications are also warranted to continue to evaluate the procedures, reliability, and dimensions of the app rating scales. Future research could also evaluate more in-depth the training, by considering implementing a training session for all participants and assess the length and type of training needed across participants to effectively use an app evaluation tool. Future research could also consider increasing the sample size and giving in-depth attention to the participants primary disability, age of the participants, and participants’ level of education as potential influencing factors on the use of the app rating scale.

Conclusions

While mobile technology and apps have become an important component of daily life, evaluation of these systems has not been a common practice. Due to the high demand of mobile technologies, future research is needed to determine the impact of an app evaluation tool on the user’s motivation, learnability, and independence. Furthermore, there is still a need to identify key dimensions for the evaluation of apps, and if there is a need for training to effectively use app evaluation tools. Only by furthering the research in this area will apps be identified by considering the individual’s abilities and needs; leading to increasing the user’s skill level and independence, and the same time, decrease system abandonment. Without a systematic, valid, and reliable app evaluation tool, the capabilities of an app may be minimized; and instead of the app being used to support and enhance the user’s abilities, opportunities, and experiences, the app being used may hinder the individual’s independence.

Footnotes

Acknowledgments

We would like to thank Robert Hodapp and Richard Urbano for their guidance with the data collection and analysis, Haley P. Neil for help during early faces of this project, Theresa Szydlik and Ava Lehavi for their help in components of this project, and Katie Shaw, Hanneh Shiheiber, Gillian Neff, and Stephanie Camacho for their support with final editing of this project. We are grateful for all your support.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This project was partially funded by Tennessee’s Department of Intellectual and Developmental Disabilities in collaboration with The Arc of Tennessee.

ORCID iD

M. Alexandra Da Fonte

Author Biographies

M. Alexandra Da Fonte is an Associate Professor of the Practice in the Department of Special Education at Vanderbilt University and a Member of the Vanderbilt Kennedy Center. Her research interest focus on AAC. Specifically, in teacher training, in bridging research-to-practice, and in the evaluation of feature matching to identify effect communication systems. She is the co-author of Implementing Effective Augmentative and Alternative Communication Practices for Students with Complex Communication Needs: A Handbook for School-Based Practitioners and Teaching Students with Severe Disabilities.

Nicole P. Wolfe is a special education teacher, currently teaching reading and social studies to students with intensive support needs who require a modified curriculum. She graduated from Vanderbilt University with her master’s in special education. Her interest and thesis focus on the teachers’ perspectives in the implementation of augmentative and alternative communication assessments in the classroom setting.

Emily R. DeLuca is a board-certified behavior analyst currently providing applied behavior services to children with autism spectrum disorder. She graduated from Vanderbilt University with a master’s degree in special education and applied behavior analysis. Her thesis focused on effective practices while conducting augmentative and alternation communication assessments.

Melissa J. Cavagnini is a special education teacher, currently teaching in a charter school. She graduated from Vanderbilt University with her master’s in special education. Her thesis focus on the teachers’ perspectives on the collaboration needs for the effective implementation of augmentative and alternative communication assessments.

Krista L. Nardi is a Teacher for the Visually Impaired at The Vision Institute of South Carolina, a non-profit organization. She provides services locally to school districts through outreach programs. Krista serves students with low vision or blindness to promote achievement, accessibility, and independence. She graduated from Vanderbilt University with her master’s in special education, with a concentration in visual disabilities.

References

Adelson

McCoach

D. B.

(2010). Measuring the mathematical attitudes of elementary students: The effects of a 4-Point or 5-Point Likert-type scale. Educational and Psychological Measurement, 70(5), 796–807. https://doi.org/10.1177/0013164410366694

Anderson

S. A.

Putman

R. S.

(2020). Special education teachers’ experience, confidence, beliefs, and knowledge about integrating technology. Journal of Special Education Technology, 35(1), 37–50. https://doi.org/10.1177/0162643419836409

Ayres

K. M.

Mechling

Sansosti

F. J.

(2013). The use of mobile technologies to assist with life skills/independence of students with moderate/severe Intellectual disability and/or autism spectrum disorders: Considerations for the future of school psychology. Psychology in the Schools, 50(3), 259–271. https://doi.org/10.1002/pits.21673

Baran

Uygun

Altan

(2017). Examining preservice teachers’ criteria for evaluating educational mobile apps. Journal of Educational Computing Research, 54(8), 1117–1141. https://doi.org/10.1177/0735633116649376

Barnum

C. M.

(2011). Usability testing essentials: Ready, settest! Burlington, MA: Morgan Kaufmann.

Boesch

M. B.

Da Fonte

M. A.

Holmes

E. E.

Nardi

K. L.

(in review). Searching for app evaluation tools: A systematic literature review.

Bouck

E. C.

Satsangi

Flanagan

(2016). Focus on inclusive education: Evaluating apps for students with disabilities, supporting academic access and success. Childhood Education, 92(4), 324–328. https://doi.org/10.1080/00094056.2016.1208014

Brookhart

S. M.

(2013). How to create and use rubrics for formative assessment and grading. Alexandria, VI: ASCD.

Cakmak

(2015). Teaching to intellectual disability individuals the shopping skill through iPad. European Journal of Educational Research, 4(4), 177–183. https://doi.org/10.12973/eu-jer.4.4.177

10.

Ciampa

(2017). Building bridges between technology and content literacy in special education: Lessons learned from special educators’ use of integrated technology and perceived benefits for students. Literacy and Research Instruction, 56(2), 85–113. https://doi.org/10.1080/19388071.2017.1280863

11.

Cihak

D. F.

Wright

Smith

C. C.

McMahon

Kraiss

(2015). Incorporating functional digital literacy skills as part of the curriculum for high school students with intellectual disability. Education and Training in Autism and Developmental Disabilities, 50(2), 155–171

12.

Courduff

Szapkiw

(2015). Using a community of practice to support technology integration in speech-language pathologist instruction. Journal of Special Education Technology, 30(2), 89–100. https://doi.org/10.117/0162643415617373

13.

DeCarlo

Bean

Lyle

Cargill

L. P.

(2019). The relationship between operational competency, buy-in, and augmentative and alternative communication use in school-age children with autism. American Journal of Speech Language Pathology, 28(2), 469–484. https://doi.org/10.1044/2018_AJSLP-17-0175

14.

Douglas

Wojcik

B. W.

Thompson

J. R.

(2012). Is there an app for that? Journal of Special Education Technology, 27(2), 59–70. https://doi.org/10.1177/016264341202700206

15.

Etikan

Musa

S. A.

Alkassim

S. R.

(2016). Comparison of convenience sampling and purposive sampling. American Journal of Theoretical and Applied Statistics, 5(1), 1–4. https://doi.org/10.11648/j.ajats.20160501.11

16.

Foulger

T. S.

Graziano

K. J.

Schmidt-Crawford

D. A.

Slykhuis

D. A.

(2017). Teacher educator technology competencies. Journal of Technology and Teacher Education, 25(4), 413–448

17.

Frielink

Schungel

Embregts

P. J.

(2018). Autonomy support, need satisfaction, and motivation for support among adults with intellectual disabilities: Testing a self- determination theory model. American Journal of Intellectual and Developmental Disabilities, 123(1), 33–39. https://doi.org/10.1352/1944-7558-123.1.33

18.

Gliem

J. A.

Gliem

R. R.

(2003). Calculating, interpreting, and reporting Cronbach’s alpha reliability coefficient for Likert-type scales. In Presented at the Midwest research to practice Conference in adult, continuing, and community education. Indiana: Indiana University. https://scholarworks.iupui.edu/bitstream/handle/1805/344/gliem+&+gliem.pdf?sequence=1

19.

Grainger

P. R.

Adie

(2014). How do preservice teacher education students move from novice to expert assessors? Australian Journal of Teacher Education, 39(7), 89–105. http://ro.ecu.edu.au/ajte/vol39/iss7/6

20.

Gresham

F. M.

MacMillan

D. L.

Beebe-Frankenberger

M. E.

Bocian

K. M.

(2000). Treatment integrity in learning disabilities intervention research: Do we really know how treatments are implemented? Learning Disabilities Research and Practice, 15(4), 198–205. https://doi.org/10.1207/SLDRP1504_4

21.

Harris

P. A.

Taylor

Thielke

Payne

Gonzalez

Conde

J. G.

(2009). Research electronic data capture (REDCap): A metadata-driven methodology and workflow process for providing translational research informatics support. Journal of Biomedical Informatics, 42(2), 377–381. https://doi.org/10.1016/j.jbi.2008.08.010

22.

Hartley

S. L.

MacLean

W. E.

(2006). A review of the reliability and validity of Likert-type scales for people with intellectual disability. Journal of Intellectual Disability Research, 50(11), 813–827. https://doi.org/10.1111/j.1365-2788.2006.00844.x

23.

Hinkle

D. E

Wiersma

Jurs

S. G.

(2003). Applied statistics for the behavioral Sciences. Boston, MA: Houghton Mifflin.

24.

Jones

Alcock

(2014). Peer assessment without assessment criteria. Studies in Higher Education, 39(10), 1774–1787. https://doi.org/10.1080/03075079.2013.821974

25.

Kumar

B. A.

Goundar

M. S.

(2019). Usability heuristics for mobile learning applications. Education and Information Technologies, 24(2), 1819–1833. https://doi.org/10.1007/s10639-019-09860-z

26.

Laarhoven

T. V.

Carreon

Bonneau

Lagerhausen

(2018). Comparing mobile technologies for teaching vocational skills to individuals with autism spectrum disorders and/or intellectual disabilities using universally designed prompting systems. Journal of Autism and Developmental Disabilities, 48(7), 2516–2529. https://doi.org/10.1007/s10803-018-3512-12

27.

Landis

J. R.

Koch

G. G.

(1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174. https://doi.org/10.2307/2529310

28.

Lee

J. S.

Kim

S. W.

(2015). Validation of a tool evaluating educational apps for smart education. Journal of Educational Computing Research, 52(3), 435–450. https://doi.org/10.1177/0735633115571923

29.

MacSuga-Gage

A. S.

Schmidt

McNiff

Gage

N. A.

Schmidt

(2015). Is there an app for that? A model to help school-based professionals identify, implement, and evaluate technology for problem behaviors. Beyond Behavior, 24(1), 24–30. https://doi.org/10.1177/107429561502400105

30.

Maich

Rutherford

Bishop

(2019). Phones, watches, and apps: Engaging everyday mobile assistive technology for adults with intellectual and/or developmental disabilities. Exceptionality Education International, 29(1), 116–135

31.

McNaughton

Light

(2013). The iPad and mobile technology revolution: Benefits and challenges for individuals who require augmentative and alternative communication. Augmentative and Alternative Communication, 29(2), 107–116. https://doi.org/10.3109/07434618.2013.784930

32.

M. W.

Kim

M. K.

Kang

E. Y.

Bryant

B. R.

(2016). How to find good apps. Intervention in School and Clinic, 51(4), 244–252. https://doi.org/10.1177/1053451215589179

33.

Papadakis

Kalogiannakis

(2017). Mobile educational applications for children: What educators and parents need to know. International Journal of Mobile Learning and Organisation, 11(2), 256–277. https://doi.org/10.1504/IJML0.2017.085338

34.

Papadakis

Kalogiannakis

Zaranis

(2017). Designing and creating an educational app rubric for preschool teachers. Education and Information Technologies, 22(6), 3147–3165. https://doi.org/10.1007/s10639-017-9579-0

35.

Powell

(2014). Choosing iPad apps with a purpose: Aligning skills and standards. TEACHING Exceptional Children, 47(1), 20–26. https://doi.org/10.1177/0040059914542765

36.

Revilla

Saris

W. E.

Krosnick

J. A.

(2014). Choosing the number of categories in agree disagree scales. Sociological Methods and Research, 43(1), 73–97. https://doi.org/10.1177/0049124113509605

37.

Roberts

Priest

(2006). Reliability and validity in research. Nursing Standard, 20(44), 41–45. https://doi.org/10.7748/ns2006.07.20.44.41.c6560

38.

Shweta

Chaturvedi

H. K.

Bajpai

(2015). Evaluation of inter-rater agreement and inter-rater reliability for observational data: An overview of concepts and methods. Journal of the Indian Academy of Applied Psychology, 41(3), 20–27

39.

Stancliffe

R. J.

Tichá

Larson

S. A.

Hewitt

A. S.

Nord

(2015). Responsiveness to self-report interview questions by adults with intellectual and developmental disability. Intellectual and Developmental Disabilities, 53(3), 163–181. https://doi.org/10.1352/1934-9556-53.3.163

40.

Thomas

C. N.

Peeples

K. N.

Kennedy

M. J.

Decker

(2019). Riding the special education technology wave: Policy, obstacles, recommendations, actionable ideas, and resources. Intervention in School and Clinic, 54(5), 295–303. https://doi.org/10.1177/1053451218819201

41.

Vaala

Levine

M. H.

(2015). Getting a read on the app stores: A market scan and analysis of children’s literacy apps. New York, NY: The Joan Ganz Cooney Center at Sesame Workshop. https://www.joanganzcooneycenter.org/content/uploads/2015/12/jgcc_gettingaread.pdf

42.

Weng

P.-L.

Taber-Doughty

(2015). Developing an app evaluation rubric for practitioners in special education. Journal of Special Education Technology, 30(1), 43–58. https://doi.org/10.1177/016264341503000104

43.

Xie

Basham

J. D.

Marino

J. T.

Rice

M. F.

(2018). Reviewing research on mobile learning in K-12 educational settings: Implications for students with disabilities. Journal of Special Education Technology, 33(1), 27–39. https://doi.org/10.1177/0162643417732292