Abstract
As smartphone’s computing power continues to grow and as mobile applications (apps) continue to dominate digital engagement, apps have become a new frontier for advancing field experiment methodology. Using apps may help researchers to scale up the reach, precisely control randomization and experiment materials, collect a variety of objective and self-reported data over time, and more conveniently replicate and adapt an experiment. We performed a systematic review on field experiments involving apps published between 2007 and 2017. Seven databases were scanned using a predefined search strategy. The database search retrieved 4,810 citations; 101 articles met the inclusion criteria. Our review suggests that scholars have only started to employ apps in field experiments in the last 4 years. Most studies only used apps as an experiment treatment instead of an experiment platform; therefore, researchers have yet to fully leverage the advantages. Almost all studies were from the health research domain and 77.2% used randomized controlled trial design. Only 7 studies utilized smartphone sensors for collecting data. Only one study reported cost and ethical concerns regarding using apps for the experiment. Given these findings, we reported a case study that targeted a minority racial group and leveraged the advantages of apps as an experiment platform and as a data collection tool to illustrate practical challenges and lessons learned regarding time, financial cost, and technical support. In conclusion, we suggest apps provide new ways to study causal mechanisms with experiment big data. Limitations of generalizability, retention, and design quality were discussed as well.
Introduction
Smartphones are powerful computing and sensing platforms integrated into people’s daily life. About 77% of the U.S. adult population own a smartphone (Pew Research Center, 2017), and 3 out of every 4 minutes spent on smartphones were spent on apps (Comscore, 2017). In an average month, an average person spent 37 hours and 28 minutes on apps (Nielsen, 2015). Unsurprisingly, the most popular apps are for online social networking (e.g., Facebook and Facebook Messenger), informational purposes (e.g., Google search and Gmail), and gaming (e.g., Trivia Crack and Pokémon Go). Many of the fastest growing apps are the ones that provide offline services or improve behaviors, such as rideshare services, exercise coaching, and date planning (Comscore, 2017). As smartphones’ computing power continues to grow and as people spend more time using apps to connect with others, apps to some extent have replaced traditional communication technology and have become an indispensable part of life. When billions of people enjoy using their apps, they are also constantly influenced by tailored advertisements, opinions, and behaviors of their families and friends.
Field experiments are social experiments implemented in naturally occurring environments. In contrast to lab experiments that often use abstract manipulations and are conducted in restricted times and highly controlled settings, and in contrast to natural experiments that are conducted upon unplanned naturally occurring events, field experiments introduce theoretically or pragmatically based interventions into people’s daily lives and observe their responses over extended periods of time (Gerber & Green, 2012). Despite the strengths of random assignment, unobtrusive measurement, and large-scale setting, field experiments have been underutilized in many research domains due to financial, logistical, and ethical concerns. Similar to websites being used for setting up experiments in the past decade, apps nowadays have become an unignorable new frontier for advancing experiment methodology, functioning as a convenient medium that can deliver research materials and requests to a large pool of participants in real time and on-the-go, seamlessly integrating experimentation into people’s daily activities.
Different research domains adopt apps for field experiments highlighting different methodological features. Social scientists demonstrate that apps can reach a larger and more diverse pool of participants. For instance, The Great Brain Experiment (Rutledge, Skandali, Dayan, & Dolan, 2014), a free app run by neuroscientists, used gamified experiments involving more than 60,000 people to investigate memory, impulsivity, risk-taking, and happiness on a scale that would not be possible using traditional lab experiment approaches. Public health researchers test apps as behavior changers and facilitators, leveraging apps as a powerful persuasive technology. For instance, one study designed three versions of an app based on motivational frames drawn from behavior change theories in promoting physical activity among sedentary adults (King et al., 2013). One version customized the app for assisting self-monitoring, one version focused on social influence, and one version utilized emotional reinforcement. Finally, computer and information scientists focus on designing apps for optimal user experiences and for building sensors for data collection. For instance, the “Friends and Family Study” at MIT (Aharony, Pan, Ip, Khayal, & Pentland, 2011) included continuous collection of over 25 phone-based signals such as location, accelerometer, Bluetooth-based device proximity, and communication activities.
The purpose of this paper is to discuss advantages and challenges in using apps for field experiments. We first discuss how apps can advance field experiments. Then we provide a systematic review of current practices in using apps for field experiments. By summarizing characteristics of previous studies, we aim to answer the following questions: What domains of research are using apps for field experiments? To what extent have previous studies leveraged the advantages of apps? And what challenges do they face in using apps for field experiments? In light of the current trend in app-based field experiments, we discuss a case study of a mobile app-based physical activity intervention among young African American women. The detailed descriptions of the app development and the project implementation illustrate sample advantages and challenges, and provide practical guidance for researchers considering using apps for field experiments.
Advantages of using mobile apps for field experiments
An app is a software application developed specifically for use on small, wireless computing devices, rather than desktop or laptop computers. Technically, there are three types of apps, web-based, native, and hybrid. A web-based app is one that is hosted on the web and accessed from a browser on the mobile device whereas a native app is one that is built for a specific platform (e.g., iPhone or Android) using their code libraries and accessing hardware features of the mobile device (e.g., camera and accelerometer). A hybrid app combines the best functions of these two; it can run across platforms and can access hardware features. In brief, apps can provide direct access to an existing website, can function as an independent software, and can collect data from device hardware (Joorabchi, Mesbah, & Kruchten, 2013).
Field experiments are experiments conducted in the natural world. In comparison with lab experiments that are constrained in highly controlled lab settings, field experiments extend the scale of the experiment and test the effects of manipulations in real-world settings. In comparison with natural experiments that are dependent upon uncontrollable naturally occurring events, field experiments reserve the ability to control the design and randomize theoretically or pragmatically relevant manipulations (Gerber & Green, 2012). When being designed and implemented rigorously, field experiments can establish good internal and external validity. However, in practice, field experiments can encounter many practical difficulties such as a lack of resources and collaborations with implementing sites that can compromise both internal and external validity (Banerjee & Duflo, 2017). For instance, while persuasive messages may show strong effects in changing people’s attitude in a lab setting, such effects may not be detected in actual field campaign evaluations due to practical issues such as a lack of message exposure (i.e., failure in reach) and contamination across experiment conditions (i.e., failure in control of randomization; Hornik, 2002).
Advances in field experiment methodology are based on technological innovations that can solve practical difficulties faced by field experiments. To provide an overview of the methodological advances brought by apps for field experiments, we discuss the following advantages in light of common practical difficulties: scale, control, measurement, and replication and adaption.
Scale
Scale is an essential feature of conducting field experiments to understand the dynamics of attitude and behavior change in social settings. Field experiments often face a practical difficulty in scaling up and need to invest a huge amount of human resources to reach a large pool of participants. For instance, to examine the effectiveness of different network-targeting strategies for behavior diffusion, researchers randomized 32 villages in rural Honduras and surveyed a population of 5,773 face-to-face for over 2 years (Kim et al., 2015). Because of the high frequency of smartphone and app usage (Comscore, 2017), apps can serve such field experiments by making each participant’s smartphone an intervention facilitator and a data collector to track the large-scale dynamics of social influence among thousands, even tens of thousands of people interacting on the platform. In addition to having difficulty in reaching a large pool of participants, field experiments can sometimes face difficulties in reaching vulnerable populations (e.g., homeless people and injecting drug users), who may not have a stable residence. There is evidence showing homeless people may rely on smartphones for vital support and to combat social exclusion (Asgary et al., 2015; Post et al., 2013). Therefore, apps may provide an important channel to connect to such populations in targeted field experiments.
Control
Field experiments can face practical difficulties in controlling randomization and quality delivery of experiment materials. Especially in field experiments that collaborate with community organizations, human errors may provoke failure in random assignment process and may fail to deliver experiment materials according to the assignment. For instance, in studies that tested the effects of HIV risk reduction interventions in comparison to health promotion interventions, facilitators delivering the intervention materials to small groups had to receive intensive trainings to make sure their teaching and facilitation adhered to the protocol and were identical across small groups (Jemmott et al., 2014; Zhang et al., 2016; Zhang, Brackbill, Yang, & Centola, 2015; Zhang, Jemmott, & Jemmott, 2015; Zhang et al., 2017). Apps can avoid random errors by precisely programming random assignments and controlling the delivery of experiment materials. This means that the full sequence of study enrollment, random assignment, and delivery of experiment materials can be automated and recorded. More importantly, the automation ensures strict double-blind experiments, where neither study participants nor researchers know the assignment, thus limiting experimenter biases and the Hawthorne effect (McCarney et al., 2007).
Measurement
Smartphones are powerful devices for collecting information from people, their activities, and their environment (Helbing, Bishop, Conte, Lukowicz, & McCarthy, 2012). The most commonly used sensors include the proximity sensor, accelerometer, gyroscope, barometer, ambient light sensor, thermometer, pedometer, and heart rate monitor (Chaudhri et al., 2012). First, smartphone usage data can be used to infer individuals’ sociodemographic backgrounds which then can be used to predict population behaviors on a large scale. For instance, information on smartphone version and data mode may be used to infer age and income level. Second, data collected from location and motion sensors can be used to infer individuals’ behaviors including communication behaviors (e.g., making phone calls and watching videos), social interactions (e.g., standing next to each other), and physical activities (e.g., taking steps and running). Third, information on levels of light, noise, temperature, and humidity collected from the sensors can be used to infer individuals’ living environment quality. Researchers can gather these objective data on an hourly interval or a second interval to calculate individual behaviors with a high level of precision. Beyond these built-in sensors, apps can also be used to gather self-reported data. With push notifications, researchers can send survey questions any time, and participants can conveniently report their thoughts and behaviors. With the increased usage of smartphones, apps may become a new gold standard for accurate measures of real-time behavior changes and outperform the currently fashionable big-data analytics approach (Helbing & Pournaras, 2015).
Replication and adaption
Unlike lab experiments that can be relatively easily replicated, large-scale field experiments have rarely been replicated. This is because most field experiments cannot be reproduced under identical structural circumstances with identical delivery of experiment materials and measurement. Building an experiment into an app provides a potential solution to these obstacles. With a controlled experiment design, populations of participants from multiple sites can be automatically randomized and can receive identical experiment materials and measurement. For instance, multisite field experiments using apps can avoid potential problems as a result of site idiosyncrasies. Furthermore, once an app is built, the additional cost for adding features is minimal. This allows researchers to adapt the app for testing extensions of the experiment and for exploring variations of theoretical models.
In sum, mobile apps can advance field experiments on several levels. Using an app as an experiment platform may help researchers to broaden the scale, precisely control randomization and experiment materials, collect a variety of objectively sensored and self-reported data over time, and more conveniently replicate and adapt an experiment. Given these advantages, it is useful to examine to what extent previous experiment studies have leveraged the advantages of apps.
Systematic review of field experiments using mobile apps
Methods
To get a comprehensive understanding of the current practice of using apps for field experiments, we conducted a systematic review of the literature. In February 2017, a systematic search from the last 10 years was performed on the following databases, PubMed, Embase, CINAHL, Medline, Scopus, Web of Science, and EBSCOhost – Communication Abstracts. The search was conducted on field experiments using mobile apps, but also used similar terminology such as “intervention” and “gaming” to accurately capture all of the relevant literature. The search query included the following keywords: “mobile application” or “mobile app” or “smartphone” or “mobile technology,” in conjunction with “field experiment” or “experiment” or “intervention” or “game” or “gaming,” in the title or abstract, with publication date within January 1, 2007 to February 15, 2017. Articles were included if the study conducted an experiment, used human subjects, involved mobile apps (either as the sole intervention or as a component of the intervention), and had an outcome as a result of the experiment intervention. Studies were excluded if they did not utilize an app, did not perform an experiment, used a web-based or tablet intervention, used the app strictly for recruitment or data collection, or if the study was not fully completed. Additionally, any duplicates or studies in languages other than English were excluded. The initial screening was conducted through a search of titles and abstracts by a team of four trained research members. Each member was assigned 1–2 database(s) to screen. From these titles and abstracts, the team screened out articles that did not fit the inclusion criteria. The initial screening was broad to confirm that important articles were included. If the inclusion of an article was questioned, the team either collectively made a decision or forwarded the article to the principal investigator. After the initial screening, the team performed a full-paper screening. Team members were assigned different sections of articles to screen. This was the strictest screening for inclusion and exclusion criteria. In addition, team members also checked references of the included articles and followed the same procedures to identify additional eligible articles. The final screening was then fully reviewed by the principal investigator. The principal investigator also consulted with two other experts from the field to check if additional studies should be added. The following study characteristics were analyzed from the final set of articles: publication year, domain of research, experiment design, sample size, demographics (age and racial composition), study duration, measurement type, retention rate, and the inclusion of an ethics discussion.
Results
Figure 1 shows the numbers of studies retrieved from each screening step. A total of 4,810 articles were retrieved from the databases. Among the 4,810, 170 duplicates were first excluded, then 4,311 articles were excluded based on reading of the title and abstract. Reports of process evaluation, study protocol, nonexperimental feasibility study, and description of app development were excluded. The full texts of the remaining 329 articles were retrieved and examined. Among the 329, 233 articles were excluded. These included studies that were incomplete, reported only baseline data, did not use mobile apps, did not conduct experiments, or conducted experiments in laboratory settings. In addition, we examined references of the remaining 96 articles and consulted two experts for additional studies. A final sample of 101 articles was included in the systematic review. Table 1 shows the characteristics of studies included in the review. The articles were published from 2011 to 2017, with 89.1% published after 2013. Almost all articles were from the health research domain. Five were from education, three were from advertising, two were from psychology, and only one was from communication research. The majority of studies (77.2%) used randomized controlled trial (RCT) design, followed by within-subject design, and quasi-experiment design. About 14.8% of the studies designed apps as an experiment platform and tested effects of different versions of the app. The sample sizes ranged from 4 to 44,000, with a median of 95.0 and a mean of 798.8 (SD = 4713.1). The target populations ranged from adolescents to older adults, with a mean age of 38.1 (SD = 12.8) across all studies. Only 41 (40.6%) studies reported racial information about their participants and only six primarily targeted non-White populations. One study targeted people living in lower socioeconomic status (SES) communities. The lengths of the studies ranged from 30 minutes to 36 months, with a median of 3.0 and a mean of 5.1 months (SD = 5.9). Only seven studies utilized smartphone sensors for collecting data. The retention rates ranged from 30% to 100%, with a mean of 85% (SD = 14.9%). Only one study mentioned cost and ethical concerns regarding using apps for the experiments.

Flow diagram of included studies.
Characteristics of the 101 studies included in the systematic review.
Although apps are becoming the dominant form of digital engagement, only in the last 4 years have more scholars started to employ apps in field experiments. Researchers focusing on health including those from the medical, public health, and nursing schools are the leaders in employing apps in field experiments. In light of the four advantages discussed before, it is clear that the majority of the reviewed studies have not leveraged them.
Regarding scale, most studies had similar sample sizes in the range of hundreds as other traditional field experiments had. This is because the majority of the reviewed studies had non-app-based experiment conditions such as face-to-face trainings that could only reach a small number of participants. Only two studies truly leveraged the global reach of the app store and run the experiments entirely through the app. One study recruited 18,420 participants and examined momentary subjective well-being as a result of different game designs (Rutledge et al., 2014) and the other delivered experiment manipulations as in-app banner advertisements and examined ethnic preferences in voting among 44,000 unique users (Nisser & Weidmann, 2016). Considering diversity of participants, only six studies targeted non-White populations and only one study targeted lower SES communities. We found no study that targeted vulnerable or hard-to-reach populations.
Regarding control, most studies tested the effectiveness of a specific app-based treatment in comparison to some forms of non-app-based control conditions (e.g., face-to-face, paper-based, web-based, or no-treatment control). Although many studies were RCTs, they only used apps as an intervention treatment and not as an experiment platform. The purpose of such studies was to test whether app-based interventions worked and whether they worked better than traditional intervention approaches. There were a few studies that compared effects of different apps. For instance, one study compared the effects of Nike + Running, a performance-monitoring app with Zombies Run!, an exercise gaming app (Gillman & Bryan, 2016). In total, 16 studies utilized apps as an experiment platform, delivering experiment materials through different versions of the app. For instance, one study designed two versions of an app leveraging different theoretical concepts: a group dynamics-based app for establishing group exercise norms and an individual support app for receiving standard social support (Irwin, Kurz, Chalin, & Thompson, 2016). The results of this study not only show the effects of different apps under precise experiment control, but can also inform theory development.
Regarding measurement, most studies did not utilize smartphone sensors for collecting objective behavior data and still relied on self-reported data (e.g., attitude, perceived norm, or behavior intention) collected from either offline or online surveys or researcher-assessed behavior data (e.g., weight, muscle strength, or skills) on a few assessment time points. Only seven studies utilized smartphone sensors. One study aiming to promote gratitude among college students innovatively used smartphone sensor data to calculate the optimal time to deliver gratitude-inspiring content, and used objective log data from the app to assess participant engagement. Specifically, the study utilized the Global Positioning System (GPS) to track location, Bluetooth to infer social proximity, and inertial sensor to estimate physical activity (Ghandeharioun, Azaria, Taylor, & Picard, 2016).
Lastly, regarding replication and adaption, we found that only two studies employed the same app. The same research team that developed the Pounds Off Digitally (POD) mobile app (Turner-McGrievy & Tate, 2011) later adapted the app to the Social POD app to include some social features such as informing participants of other participants’ behaviors (Hales et al., 2016).
None of the reviewed studies discussed practical difficulties or challenges encountered in the research. The cost effectiveness of app-based field experiments remains unclear at this point. Only one study reported the cost of the study and the ethical concern from the institutional review board (IRB). Nisser and Weidmann (2016) reported a budget of $5,000 and reached 44,000 unique users through in-app advertisements. IRB granted a waiver of ethical approval because the display of banner ads within apps and the collection of location data are standard practices in mobile advertising. This study suggests an innovative approach in conducting app-based experiments, especially for testing different message strategies at a population level.
Findings from this systematic review suggest that more and more researchers are starting to utilize apps in field experiments. However, this methodological approach is still young and has not fully leveraged the unique advantages brought by apps. The most prominent issue is that previous studies compared apps with other treatments, such as web-based or non-technology-based treatments. This may be because early research needs to establish the efficacy of apps in changing behaviors and outcomes in comparison with older technologies. Due to such experiment design, apps are only in one of the experiment conditions, thus cannot be used for controlling experiment materials or collecting data across conditions. Only a few innovative studies have started to use apps as the experiment platform and as the data collection tool. Only six out of the 101 studies leveraged smartphone sensors in collecting behavior data.
Given the heterogeneity of the reviewed studies, it is hard to compare different research designs and approaches to extract practical recommendations for using apps in field experiments. Most reported field experiments also did not include extensive details of their app development. In the following section, we provide a case study of our own research and discuss several practical challenges and lessons learned from the project. Different from the majority of the reviewed app-based field experiments, the case study targeted a minority racial group and leveraged the advantages of apps as an experiment platform and as a data collection tool using sensors. Detailed descriptions of the app development and the project implementation provide practical guidance for researchers considering using apps for field experiments.
Case study: An app-based physical activity intervention among young African American women
The PennFit study is an app-based physical activity intervention for young African American women (Zhang, 2017), a population at high risk for the deleterious consequences of physical inactivity (American Cancer Society, 2013; Centers for Disease Control and Prevention, 2011; Go et al., 2013; Murphy, Xu, & Kochanek, 2013). The study aimed to assist young African American women in establishing physical activity routines in their daily life. Because there is no previous work on using apps for physical activity intervention for this population, we first conducted formative research to assess the needs and user preferences. We did one-on-one face-to-face interviews with 30 young African American women to elicit attitudes and opinions for development of the app. We found that none of the women had previously used any fitness-tracking device and most had not used fitness apps. When asked about what makes it easy for them to engage in regular physical activity, the majority of the women mentioned that having daily reminders and having some form of social support would be helpful. Specifically, they mentioned they did not have close friends who exercised regularly and liked the idea of having new exercise buddies.
Based on the formative research findings, we learned that these women were willing to use apps, and especially to use apps to connect with other women. Thus, we designed an experiment to test whether using the app with social connections to other women (the social app) would be more effective than using the app by oneself (the solo app) in increasing daily physical activity. We hired two developers as independent contractors to build an Android app for this project. The reasons for only building an Android app are that Android phones are more popular among African Americans, and we did not have enough budget for building an iPhone app. One front-end developer built the interfaces of the app and one back-end developer built the communication and data infrastructure. The basic functions of the PennFit app were to send push notifications to provide daily exercise reminders, to allow women to track their daily steps, and to manually enter their daily exercises. Although we knew smartphone sensors could track steps, we decided to give Fitbit Zip to the participants to track steps because these women were interested in using fitness trackers and Fitbit had been shown to be a valid tracking device (Ferguson, Rowlands, Olds, & Maher, 2015; Vooijs et al., 2014). The app therefore was connected to Fitbit’s application program interface (API) to gather data from each tracking device every 2 minutes. Once the basic functions were established, developers then built two versions of the app: the social app and the solo app. The social app provided online connections that allowed women to see three other women’s steps and exercises in real time and to exchange information through an online chatting tool. The solo app only allowed women to see their own steps and exercises without any social component. The interfaces of the two versions are shown in Figure 2. We have obscured the profile pictures in this figure to protect participants’ privacy. The profile pictures were not obscured in the PennFit app. The PennFit app collected the following data: app login, GPS location information, and Fitbit data including the number of steps; light, moderate, and vigorous activities; and active calories. All objectively collected data and user input data were stored on a cloud computing service, Amazon Web Services (Amazon Web Services [AWS]).

Sample interfaces of the PennFit app for the solo condition and the social condition.
Before we ran the experiment, we conducted a 1-month pilot testing with five of the formative research participants to ensure usability. We then recruited 91 African American women 18 to 35 years of age through Facebook and conducted a 3-month RCT. Upon enrollment, participants completed a baseline online survey on their phones that collected sociodemographic and baseline exercise activity information. We gave all participants a Fitbit Zip to wear daily during the 3-month study period. Each participant installed the PennFit app on her phone and created her profile in the app, including a username, picture, age, favorite exercise, and body mass index (BMI). Upon logging into the app, participants were randomized to the social condition or to the solo condition. Participants assigned to the social condition were then randomly assigned to four-women networks. Each participant in the solo condition could only see her own profile and physical activity logs. Participants in the social condition could see both their own information and the profiles and activity logs of the three other women assigned to their network. By the end of the 3-month study, all participants completed a final online survey on their phones.
In general, the study was feasible because we could recruit eligible women willing to wear the Fitbit, install the app on their smartphones, and participate for 3 months. We reached a retention rate of 100% since no participant dropped from the study. The study also received good evaluations. On a scale from 0 to 10, participants’ mean ratings on liking and recommending PennFit to friends and colleagues were 8.5 and 8.3, respectively. More importantly, we were able to use the app as the platform to run the RCT and collect objective data from both the smartphone sensors and from the Fitbit. Even though we only enrolled 91 participants due to resource limitation, we had 90 days of the 91 individuals’ objectively logged app login data and physical activity data, which greatly boosted the power of the RCT to detect small experiment effect. By comparing the social condition with the solo condition, we found the online networks increased engagement with the Fitbit device and the app. Specifically, participants in the social condition logged into the app 1.6 more times per day than participants in the solo condition (p = .015). Social participants were also 1.5 times more likely to meet the daily exercise goal objectively assessed by Fitbit than the solo participants (p = .046).
We encountered three main challenges in developing apps for this field experiment. The first challenge was time. Unlike using an existing generic experiment platform, developing an app for an experiment involves multiple phases and iterations for the design, testing, and revision. Especially for targeting a racial minority group, the app had to be culturally and functionally congruent with their needs to ensure usability and adherence. The app design and implementation of this study took a total of 12 months. The formative research for understanding the group’s needs and preferences took 1 month, the actual app development including the front-end design and back-end architecting took 7 months, the pilot test with a small sample from the target population took 1 month, and the main RCT involving the full sample took 3 months. The developers were involved for a total of 11 months because they had to continuously monitor the database and solve any technical problems during the experiment.
The second challenge was financial cost. We spent about $4,000 on the app development and about $600 on the cloud computing service. With a limited budget, we could only develop an Android app, which further limited the potential pool of participants. We hired two developers as independent contractors through personal contacts and paid both a fixed amount of $2,000 upon project completion. Had we hired app developers from the market based on standard hourly rates, the cost of the project would have been much higher.
The third challenge was technical assistance. Although installing and running the app on smartphones were easy, one important lesson we learned from the study is that during a long study period participants might change their phones and data plans. When the system settings of the smartphones were changed, the participants had to reinstall the app to make sure it could synchronize with the Fitbit. Some participants encountered technical problems in reloading and reconfiguring their apps. Some participants also wanted to learn about how their apps worked technically when they encountered technical problems. In addition, some participants complained the app did not synchronize with the Fitbit in a timely manner when they did not have reliable Wi-Fi or data connection. A few also complained the GPS location tracking consumed a lot of battery. The research team had to ensure timely troubleshooting to keep these participants in the study. Specifically, one researcher had to provide technical assistance and call these participants to provide explanations and technical instructions.
Notwithstanding the challenges, developing the PennFit app was highly valuable for our research. The initial development of a research app may take a lot of time because of the formative research and pilot testing. However, once an app is developed, it can be reused and adapted in multiple studies with minimum additional cost. Indeed, we are currently adapting the PennFit app for running another experiment among college students to test theoretical hypotheses about social comparison. Finally, we did not encounter any problems with the IRB. In the IRB applications, we thoroughly discussed all potential scenarios and built protective mechanisms (e.g., data encryption) accordingly in the system. After all, participants have the ultimate control over app installation, deletion, and system settings to block push notifications or data crawling. In comparison with traditional field experiments where participants may not have the control over experiment materials delivered in their environments, using apps for running field experiments actually gives participants more control over their participation and data sharing.
Discussion
The purpose of this article has been to invite researchers to think about using mobile apps in the context of field experiments. Our review suggests that although apps have become the dominant digital channel, researchers have just started exploring the feasibility and effects of apps in field experiments. Equipped with powerful computing resources and multiple sensors of the smartphone, apps bring at least four advantages for conducting field experiments in terms of scale, control, measurement, and replication and adaption. The two central advantages of apps are control and measurement. In order to leverage these two advantages, research has to use apps as the experiment platform. Enabling a precise control over randomization and delivering of experiment materials to individual participants at their convenience enhances experiment’s external validity. The easiness of using apps plus programmed push notifications can also increase participant engagement with the experiment, eventually contributing to stronger experiment effect. Equally important, apps can leverage a variety of smartphone sensors to collect objective data over a long period of time. Researchers could use these digital footprints to better understand population behavior patterns under different experiment conditions. Beyond looking at physical traits such as physical activity, data from several sensors can also be used to infer communication and social interaction behaviors.
Our systematic review reveals that field experiments have just started to explore the feasibility and initial efficacy of apps. The majority of research did not leverage the four advantages of using apps. Instead, they treated apps as just another type of experiment treatment, not as the experiment platform. The few studies that run experiments through the app platform and collected data through sensors were able to broaden the scope of the research and generate significant new insights about human behaviors.
By reviewing these studies, we were not able to gain insights into the challenges of using apps and the cost effectiveness of using apps in field experiments. Studies simply did not report them. In the discussion of the case study, we point out three challenges in our research with regard to time, financial cost, and technical assistance. Developing a new research app to test theoretical hypotheses among a targeted population requires extensive formative research, pilot testing, and revisions. In collaborating with two independent app developers, with a limited budget of less than $5,000, we were only able to develop an Android app with two experimental versions. Our experiences with IRB applications suggest that running field experiments on apps actually gives participants more control over their participation and data sharing. As long as protective mechanisms are integrated into the app design and technical assistance is provided, the risks of using apps for field experiments are minimal.
Despite the potential advantages brought by apps for field experiments, researchers should also keep in mind some inherent limitations of this approach. First and foremost, although smartphone penetration rate has reached 77% among U.S. adults and continues rising (Pew Research Center, 2017), the patterns of smartphone and app use vary significantly across different groups. The digital divide still exists. Recent studies pointed out that although poor and marginalized populations now have access to smartphones, they often have trouble in maintaining the devices and the data plans (Gonzales, 2016; Gonzales, Ems, & Suri, 2016). In addition, populations of older age and with less income were shown to have less competence and skills in using mobile technologies (Lee, Park, & Hwang, 2015). Given these facts, field experiments using apps will underrepresent these populations, and the findings should be discussed in light of the generalizability limitation. To address this limitation, for studies that target marginalized or vulnerable populations, researchers should first consider conducting formative research to understand their needs and preferences in using technology, then provide all technological assistance including purchasing devices and data plans and offering training sessions to the participants (Ben-Zeev et al., 2014; Naslund, Aschbrenner, Barre, & Bartels, 2015).
Second, unlike experiments conducted in well-established and controlled environments, field experiments relying on apps face more uncertainties with regard to participant retention. Depending on the quality of the study implementation, retention rates in our reviewed articles ranged from 30% to 100%, with 86 (87.8%) articles reporting retention rates above 70%.
These numbers are similar to previously reported retention rates in another review (Payne, Lister, West, & Bernhardt, 2015). Using an app as the field experiment platform may lead to lower attrition rates because people can participate in the study anywhere and anytime. However, researchers should not blindly assume that just using an app can lead to good retention rates. If the app itself is not as engaging as popular commercial apps, participants may stop using it after the first try. Poor and inappropriate design of the app will discourage participants from downloading and using it. To address this issue, researchers should conduct formative research to elicit target participants’ preferences and needs and incorporate their inputs into the design. In addition, given limited development resources, researchers should not assume participants will be highly motivated to use the research app. Providing compensation to encourage consistent use of the app throughout the study can be a strategy to minimize attrition. In addition, similar to non-app-based field experiments, having a research staff to keep in contact with participants and send reminders through a variety of channels can greatly enhance retention. Finally, it may be useful to provide all participants some background information about the app system in anticipation of follow-up technical questions.
Third, the quality of the field experiments may be constrained by the quality of app developers and the collaborations between researchers and developers. Choosing app developers and ensuring good collaborations can be challenging. To our knowledge, having in-house app developers for research is uncommon in universities, especially in social science departments. Hiring independent app developers as contractors can be effective because independent developers usually have flexible working time schedules and can be paid based on hourly rates or fixed rates upon project completion. However, app development for research can involve many rounds of revisions and upgrades as research ideas and protocols evolve over time. Sometimes with unforeseeable changes, the design requirements may change quite a lot. Researchers should not assume that developers have a good understanding of the nature of scientific research and experiments and should discuss the research purposes and protocols with developers to minimize misunderstandings and unnecessary overcharges. When making a hiring decision, researchers should make sure to write all details of work requirements and payment conditions in the contract. In addition, researchers can ask for legal assistance from the university to examine the contract to avoid potential disputes.
Conclusion
App-based experiments are the future as apps continue to dominate digital engagement. Apps not only can display web-based contents, but also can integrate smartphone sensors. Unlike using laptops or tablets, people often carry their smartphones all day long. Using apps could be an effective and efficient way to reach, engage with, and track participants. Equipped with smartphone sensors, apps provide new ways to study causal mechanisms with experiment big data. With the advantages brought by apps, social scientists may answer new questions regarding the complex systems of communication and human interactions.
The systematic review indicates that although more and more scholars have started incorporating apps in field experiments, the majority of the studies have not fully leveraged apps’ potential advantages. Practical difficulties such as limited time and financial and human resources may prevent researchers from using apps as the field experiment platform. The details of the app development reported in the case study provide some practical guidance to researchers with no experience in using apps. While acknowledging the potential promises of apps for field experiments, researchers should also keep in mind the inherent constraints of technology-based research regarding generalizability, retention, and the design quality, and consider implementing multiple strategies to address the limitations.
Footnotes
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
