Abstract
We conducted three experiments with participants recruited on Amazon’s Mechanical Turk to examine the influence on app-installation decisions of summary risk information derived from the app permissions. This information can be framed negatively as amount of risk or positively as amount of safety, which was varied in all the experiments. In Experiments 1 and 2, the participants performed tasks in which they selected two Android apps from a list of six; in Experiment 3, the tasks were to reject two apps from the list. This summary information influenced the participants to choose less risky alternatives, particularly when it was framed in terms of safety and the app had high user ratings. Participants in the safety condition reported that they attended more to the summary score than did those in the risk condition. They also showed better comprehension of what the score was conveying, regardless of whether the task was to select or reject. The results imply that development of a valid risk/safety index for apps has the potential to improve users’ app-installation decisions, especially if that information is framed as amount of safety.
Keywords
Introduction
Assessing risk when making decisions is vital in many areas, including gambling at a casino (Frings, 2012), making healthcare decisions (Schwartz, 2011), and deciding about business strategies and tactics that may affect a business and its customers (Hung & Tangpong, 2010). Because the risks associated with specific actions are often not fully known by the decision maker and may be difficult to comprehend, how best to communicate those risks is a concern (e.g., Brust-Renck, Royer, & Reyna, 2013; McLaughlin & Mayhorn, 2014).
One domain with great risks that is relevant to most people is that of smart mobile devices (Toch, Wang, & Cranor, 2012). In early 2014, for the first time more internet traffic was due to smartphone and tablet devices than to personal computers (O’Toole, 2014). The number of smartphone users world-wide is estimated to be 1.76 billion by the end of 2014 (eMarketer, 2014). Although convenient and pervasive, mobile devices also introduce new dimensions of risk, such as leakage of personal files, physical location, and monetary loss (Shabtai et al., 2010), and raise new privacy and security concerns. However, users often do not have accurate understanding of the risks associated with installing apps on the device (Felt et al., 2012). This lack of understanding leads to the potential for an app to collect data from a user without the user’s explicit and intentional consent, which may result in malicious functionality such as intercepting bank authentication messages or sending texts to premium-rate phone numbers.
In our research, we have focused on security of the Android operating system because of its openness and popularity (Mansfield-Devine, 2012) and the fact that 99% of the mobile malware in 2013 targeted Android devices (Cisco Systems, Inc., 2014). In the current Android device, when a user chooses to install an app, a list of the permissions that the app requests is displayed. The defense against malware relies on users comprehending those permissions and making informed decisions about whether to install this app or select another app that provides similar functionality. However, several studies have provided evidence that users tend to ignore the permissions or fail to accurately comprehend their meanings (Chin, Felt, Sekar, & Wagner, 2012; Felt, Greenwood, & Wagner, 2011; Felt et al., 2012; Kelley et al., 2012).
Several researchers have proposed ways to communicate better the risks associated with installing apps to improve users’ app-selection decisions. Felt et al. (2012) suggested changing the wording of permissions, modifying and renaming the permission categories to be more informative, reducing the number of permissions, specifying more clearly the risks associated with a permission, and presenting users only those permissions that are of high risk. Lin et al. (2012) recommended using crowd-sourcing through Amazon Mechanical Turk (MTurk) to discover users’ expectations about the permissions an app would need and to signal to users when an app’s requested permissions deviate from those expectations. Kelley, Cranor, and Sadeh (2013) proposed that privacy-related information to which an app has access (contacts, personal information, location, etc.) be shown on the app’s main description page, visible when the app-selection decision is being made, rather than late in the process when the user has chosen and wants to install an app, as is currently the case. We have also proposed providing risk information early in the decision process, but in the form of summary risk scores that allow easy comparison between apps (Gates, Chen, Li, & Proctor, 2014; Gates, Li et al., 2014). The efficacy of such risk scores may depend on the way in which the information is presented, or framed.
Positive and Negative Framing in Decision Making
It is well-known that people’s preferences in risky contexts are influenced by the way in which a problem is framed (the framing effect; Tversky & Kahneman, 1981, 1986). For example, people are risk-averse when the outcomes are presented in terms of potential gains (i.e., in a positive frame) but risk-seeking when they are presented in terms of potential loss (i.e., in a negative frame). This effect has been found to be stable and replicable across time and different age groups (Mayhorn, Fisk, & Whittle, 2002) and in many scenarios (e.g., Barnes, McDermott, Hutchins, & Rothrock, 2011; Gambara & Piñon, 2005; Garcia-Retamero & Dhami, 2013).
In principle, because the alternative options in both frames are logically equivalent, people should choose the same option regardless of the problem framing. This framing effect is of special interest in human decision making because it is counterintuitive and inconsistent with the tenets of rational decision making (e.g., the principle of invariance; Tversky & Kahneman, 1986). Similar to the positive and negative framing of outcomes, the risk information associated with a particular app on mobile devices can be framed positively in terms of the amount of safety or negatively in terms of the amount of risk. Thus, framing the risk information in these two forms may lead to different judgments or preferences by users. We focused on the safety and risk frames in the current study because safety is a customary antonym of risk (see merriam-webster.com/), and both words are short enough to keep the app interface succinct.
Task Compatibility
The principle of compatibility is that input information is weighted based on its compatibility with the output (response): Inputs that are more compatible with outputs are weighted more and draw more attention than those that are less compatible (Huber, Huber, & Bär, 2014; Rubaltelli, Dickert, & Slovic, 2012; Slovic, Griffin, & Tversky, 1990; Tversky, Sattath, & Slovic, 1988). Compatibility between the nature of the task and the valence of the alternatives has been shown to influence decision making and judgments. Shafir (1993) provided evidence that positive dimensions are weighted more heavily when the task is one of selecting among alternatives, whereas negative dimensions are weighted more heavily when the task is to reject alternatives. For example, Nagpal and Krishnamurthy (2008) found that the combination of task and the valence of the alternatives had an influence on decision difficulty and decision time. When the task and the valence of the alternatives were compatible (e.g., choosing between two attractive alternatives, or rejecting one of two unattractive alternatives), the decisions were easier than when they were incompatible (e.g., choosing between two unattractive alternatives, or rejecting one of two attractive alternatives). Several other studies have shown that a selection task promotes the decision-maker to focus more on the positive features of the outcome, whereas a rejection task promotes the negative features (e.g., Chernev, 2009; Lai & Hui, 2006).
Installing an app on a mobile device is essentially a selection task. That is, the user selects which app to download among several apps that provide similar functionality. Positive information associated with an app may be weighted more than negative information in this selection task. With regard to an overall risk score, the compatibility principle suggests that when this score is presented as amount of “safety,” this information will be weighted more in the app-selection decision than when it is presented as amount of “risk.”
Current Study
We previously provided evidence that a summary risk score is beneficial in conveying risks for Android apps (Gates, Chen et al., 2014). The summary score supports easy comparison for risks associated with apps of similar functionality. In that study, we reported experiments that examined the effects of displaying a summary risk score in text or symbol format. The results showed that participants took this summary score into account, and it had a positive effect on their app-selection decisions. Furthermore, in a laboratory experiment in which participants were to decide as fast as possible whether to install an app or not, performance was better when the symbol designated amount of safety rather than risk. However, in an MTurk experiment in which participants selected one app from two presented side-by-side, there was little difference in their decisions as a function of whether the summary score conveyed risk as opposed to safety.
As noted, the latter online experiment of Gates, Chen et al. (2014) involved selection between only two alternative apps, which were presented side-by-side. This differs from the environment in the app stores for mobile devices, where the user is typically confronted with a list of multiple apps with similar functionality. For this reason, in the present study the number of alternative apps to be considered for each decision was increased to six, and they were presented in a list format. Our hypothesis was that in this more realistic scenario with a selection task, framing the summary risk/safety information in terms of safety would lead to less risky decisions than framing the information in terms of risk. Experiments 1 and 2 were designed to test this hypothesis; in Experiment 3, the decision was changed to one of rejecting the same number of apps from the list, to test whether compatibility with the task was a crucial factor.
Experiment 1
To emulate an app-selection context, we presented lists of alternative apps to users of MTurk, who would have experience installing apps on mobile devices. The display was designed to mimic what one would see when selecting an app from the Google Play store. Of most interest was the inclusion of a summary score, framed as amount of risk for some participants and safety for others, in the form of number of filled circles out of five. The expectancy was that users’ choices would be influenced by this score, more so when conveyed as safety than risk.
Method
Participants
In total, 295 participants were recruited through MTurk. The experiment took about 10 minutes to complete, and participants were paid $0.50 each. This study, and Experiments 2 and 3, received approval from Purdue University’s Institutional Review Board.
Materials and procedure
Participants were randomly assigned to a safety or risk condition. At the beginning of the experiment, an introductory page was displayed that included the purpose and a demonstration of the elements of each app that would be shown in the tasks (see Figure 1). Before performing the app-selection tasks, a pretask questionnaire was conducted for collecting demographic information, the participants’ history of use of mobile devices, and their app installation activities. Each app-selection task was on one page with a heading, “Which 2 of these Android Apps would you choose?” (see Figure 2). The six apps presented for each task were chosen from the Google Play Store when one of the experimenters did a search for a specific functionality. We skipped the Top 10 apps and chose the 11th to 16th apps to avoid a possible influence of participants’ familiarity with the apps.

The introductory page demonstrating the elements of each app in the risk condition.

An example of the task: Once an app is selected, its background turns to blue, the “SELECT” button changes to “UNSELECT,” and a list of reasons for app selection is shown below the selected app.
For each app, the displayed information included icon, app name, developer, user rating (filled stars out of five), user rating count, permission safety/risk (filled circles out of five; more circles indicated increasing safety in the safety condition and increasing risk in the risk condition), and two lines of a brief description that ended with “…” if it was not complete. The user ratings were generated randomly for each participant and each app. The user rating was taken from a distribution based off of approximately 300,000 apps collected from the actual app store. From five stars to one star, the likelihood was [.25, .35, .20, .10, .10], and the actual percentages were [.257, .352, .197, .099, .095]. The permission safety/risk scores were taken uniformly at random for each app. They were displayed as filled red circles for risk and filled green circles for safety, to take advantage of the colors’ strong associations with “stop” and “go,” respectively (Bergum & Bergum, 1981). Thus, for an app in a specific location in a task, the display was controlled to be identical for all participants except for the user rating and the safety/risk score.
Participants were urged to select two apps out of six by clicking “select” buttons that were positioned under each app. Selection of two apps rather than one was required because users typically do not make their final installation decision based only on the summary information available in the list display. The purpose was to examine what factors influence users’ decisions in narrowing their options to a smaller subset of the apps. Upon clicking the button for a particular app, the question, “Why did you pick this app?” was asked, with the following listed options: User Rating Score, User Rating Count, Permission Safety (or Risk), Icon Look and Feel, Description, Familiarity with app or developer, and Other. Participants could indicate as many of these options as they wanted, and they did not have to select any reason before clicking the submit button to continue to the next task. After all six tasks, a posttask questionnaire was conducted regarding the app selection tasks and the participant’s security expertise and concern.
Results and Discussion
Demographics
The demographic data are shown in Table 1. Almost two-thirds of the participants were male, with most between 18 and 40 years of age. More than 90% indicated that they used Android devices, with approximately 90% installing apps on at least a monthly basis. Less than 2% responded that they were security experts. Table 1 also includes the demographic information for participants in Experiments 2 and 3, which we will not discuss further since their characteristics were similar, except for having somewhat higher percentages of males.
Percentage of Participants in Each Demographic Category for All Three Experiments
App-selection analysis
The independent variables were the user rating, permission safety/risk score of each app, and the safety/risk condition, and the dependent variable was whether a specific app was selected by the participant. Again, “each app” refers to the app in a specific location in a task. We first conducted correlational analyses between app selection and user rating, and between app selection and safety/risk score, to find out the relation between these variables. These analyses were conducted for each app, which has a fixed icon, position, description, etc. A repeated-measures ANOVA was conducted for the percentage of app selection, by comparing each app’s selection percentage under each User Rating × Safety/Risk Score combination. This analysis was performed as if an app was a participant, with user rating, safety/risk score, and safety/risk condition as within-app factors. A binary logistic regression analysis was also performed to examine the effects of user rating, safety/risk score, safety/risk condition, and their interactions. The results were similar to those of the ANOVA but not as detailed, so we report only the ANOVA results in this and the following experiments.
In the safety condition, there was a positive correlation between user rating and app selection for every app (Pearson’s rs ≥ .188, N = 143, ps ≤ .025), and between safety score and app selection for most apps (34 out of 36 apps; rs ≥ .204, N = 143, ps ≤ .015). In the risk condition, there was also a positive correlation between user rating and app selection for every app (rs ≥ .220, N = 152, ps ≤ .007), but not between risk score and app selection for most apps (29 out of 36 apps; |r|s ≤ .139, N = 152, ps ≥ .091). These correlational results indicate that apps with higher user ratings were selected more often than those with lower ratings in both the safety and risk conditions. More important, apps with higher safety scores were selected more often than those with lower safety scores in the safety condition, but the risk score did not have a similar impact on app selection in the risk condition.
The ANOVA showed several findings. First, the percentages of app selection did not differ in the safety and risk conditions on average (Ms = 26.4 % vs. 25.5%), F(1, 35) = 1.03, p = .317, ηp2 = .03, but safety and risk conditions showed distinct patterns across safety rankings (i.e., safety/risk scores; see Figure 3, top left panel), F(4, 140) = 22.39, p < .001, ηp2 = .39. Note that the frequencies with which each app were associated with specific safety/risk score and user rating combinations were not controlled to be strictly equal, and thus, the percentage of app selection was computed as an average weighted by the frequency. As a result, the overall computed percentage can differ from the actual overall selection percentage (33.3%). Figures plotted from the raw data show the same patterns as those from the weighted data. The mean app selection percentage varied with the increased safety rankings, being 13.4%, 16.4%, 24.9%, 33.1%, 44.0% in the safety condition, and 21.8%, 25.2%, 22.6%, 27.3%, 30.6% in the risk condition. Post hoc pairwise comparisons with Bonferroni adjustment (used for all subsequent pairwise comparisons) showed that the differences between the safety/risk conditions were significant for most of the safety rankings, ps ≤ .007, except for the middle value, p = .182. Compared to the risk condition, participants in the safety condition selected apps with lower safety ranking less often and apps with higher safety ranking more often. In other words, better app selection decisions (safer apps selected more often and riskier apps selected less often) were made in the safety condition than in the risk condition.

App selection percentage (Experiments 1 and 2) and rejection percentage (Experiment 3) as a function of safety ranking (= safety score or 6 – risk score) in safety and risk conditions.
Second, overall, the percentage of app selection was higher with increased safety (i.e., increased safety score or decreased risk score) (Ms = 17.6%, 20.8%, 23.7%, 30.2%, and 37.3% for safety rankings 1 through 5), F(4, 140) = 68.32, p < .001, ηp2 = .66, and the percentage was also higher with increased user rating (Ms = 6.7%, 12.1%, 17.5%, 40.4%, and 53.0% for user ratings 1 through 5), F(4, 140) = 317.78, p < .001, ηp2 = .90. Moreover, safety ranking interacted with user rating, F(16, 560) = 8.61, p < .001, ηp2 = .20. Post hoc pairwise comparisons for this interaction showed that for apps with lower user ratings (1, 2, and 3) the safety ranking did not influence app selection much, but for apps with higher user ratings (4 and 5) increased safety rankings led to more app selection (see Table 2). These results show that apps with higher user ratings and higher safety rankings were selected more often than other options and that the percentage of app selection increased as the safety increased or the risk decreased, much more for the apps with high user rating than for those with low user rating (see Figure 4, top row).
Percentage Difference in App Selection/Rejection Between Different Safety Rankings as a Function of User Rating
The mean difference was significant at the .05 level. **The mean difference was significant at the .01 level. ***The mean difference was significant at the .001 level.

App selection percentage (Experiments 1 and 2) and rejection percentage (Experiment 3) as a function of user ratings and safety or risk score.
Finally, there was a three-way interaction among user rating, safety level, and safety/risk condition, F(16, 560) = 2.10, p = .007, ηp2 = .06. Further analyses showed that this interaction was mainly due to a significant two-way interaction between safety level and safety/risk condition when user rating equaled 2, 3, 4, and 5 (ps ≤ .007) but not when user rating equaled 1 (p = .124). An ANOVA excluding the user rating = 1 condition showed no three-way interaction, F < 1. There was also an interaction between safety/risk condition and user rating, F(4, 560) = 4.22, p = .003, ηp2 = .11. This interaction did not show up much in the mean data (7.5%, 11.7%, 20.5%, 40.5%, 51.6% for the safety condition, and 5.9%, 12.4%, 14.5%, 40.2%, 54.5% for the risk condition, with increased user rating). It reflected mainly a larger percentage for the safety condition than for the risk condition when paired with an intermediate user rating. Consistent with the mean data, post hoc pairwise comparisons showed that the safety and risk conditions differed when user rating = 3, p < .001, but not when it had other values, ps ≥ .159. There is no rationale for this pattern, and it was not significant in Experiments 2 or 3, so we suspect it is a Type 1 error.
We also considered the placement of the app in the list of six in each task. A common pattern for all six tasks was that the first app was selected more often than the second app and, on average, more often than any of the other apps, for which the selection percentage did not differ much (see Figure 5, top left panel). This result is consistent with the take-the-first heuristic in decision making (Johnson & Raab, 2003), and it could also be due to the first app in the list being (perceived as) more popular among the users.

App selection percentage (Experiments 1 and 2) and rejection percentage (Experiment 3) as a function of the placement of the apps.
Subjective reasons for app selection
For each app that was selected, participants were to mark the reason(s) why they selected it. A significant positive correlation was found between the safety score of the app and the reason “permission safety” for most of the apps (33 out of 36 apps; Pearson’s rs ≥ .350, Ns = 25~79, ps ≤ .042), and a significant negative correlation between the risk score and the reason “permission risk” for most of the apps (34 out of 36 apps; |r|s ≥ .298, Ns = 25~84, ps ≤ .026). Table 3 shows percentage of the reasons marked by the participants in safety and risk conditions. The difference between the percentages of selecting “permission safety” and “permission risk” was significant, Χ2(1, N = 3,540) = 68.55, p < .001, as well as the difference for user rating count, Χ2(1, N = 3,540) = 13.39, p < .001, and description, Χ2(1, N = 3,540) = 6.72, p = .010. No other differences were significant, ps ≥ .215.
Percentage of Reasons Chosen as a Function of Safety or Risk Condition When Selecting/Rejecting Apps During the Tasks and in the Posttask Questionnaire
Due to a technical issue, some of the participants in the risk condition saw “Permission Safety” for this question. This number was a sum of people who chose Permission Safety and those who chose Permission Risk (17.1+ 43.4).
After the participants finished all six tasks, they were asked what factors they considered when selecting apps during the tasks. Table 3 shows the percentage with which each factor was considered. The difference between the percentages of selecting “permission safety” and “permission risk” was significant, Χ2(1, N = 295) = 7.59, p = .006. None of the differences for other reasons in the two conditions was significant, ps ≥ .084.
Security/computer expertise
To compare whether expertise mediated the influence of safety/risk scores on app selections, the participants were divided into two groups based on their responses to the security expertise question (see Table 1): One group was experts, participants who indicated highly skilled or security expert, and the other group was nonexperts, those who indicated regular user or computer novice. An ANOVA with expertise, safety/risk score, and safety/risk condition as within-subject (app) factors was conducted on the app-selection percentage. Expertise entered into an interaction with safety/risk score, F(4, 140) = 5.35, p < .001, ηp2 = .13, but this was qualified by a three-way interaction of those two variables with safety/ risk condition, F(4, 140) = 3.50, p = .009, ηp2 = .09. The experts were more sensitive than the nonexperts to the safety level but only in the risk condition (see Table 4), indicating that the experts were less susceptible to the framing of the information as risk or safety. Although the experts’ selections were influenced more by the permission safety/risk information, this was not reflected in their marked reasons for app selection: Correlational analyses between the participants’ security expertise and the reason “permission safety/risk” showed no significant correlation for 33 of the 36 apps, ps > .05.
Percentage of App Selection (Experiments 1 and 2) and Rejection (Experiment 3) for Experts and Nonexperts in the Safety and Risk Conditions
Other questionnaire analyses
When asked whether they found the overall permission safety/risk information to be useful, most participants indicated that it was useful. On a 7-point scale, with 1 denoting not useful and 7 extremely useful, 64.3% of participants gave a rating of 5 and above in the safety condition, whereas 59.9% of participants did in the risk condition.
To determine whether the participants understood the permission safety/risk symbols, we showed them four full circles and asked what that symbol stood for. In the safety condition, 86.0% of the participants gave a correct answer, 6.3% gave an opposite answer, and 7.7% indicated that they did not know what it meant; in the risk condition, 63.2% of the participants gave a correct answer, 23.0% gave an opposite answer, and 13.8% marked that they did not know what it meant. Due to a technical issue, some participants saw the opposite color of what they saw during the task. This issue may have led to confusion in answering the question, as suggested by some participants’ comments (e.g., “For the last question #4, it is unclear because all the apps had green circles indicating a safe rating but in #4 they are red. So it is confusing.”).
Experiment 2
For the more realistic scenario of Experiment 1 than in our previous study (Gates, Chen et al., 2014), in which choice was between six and two apps, respectively, there was a substantial benefit of the summary score being presented as safety rather than risk. Given the current emphasis on the need for researchers to “integrate replications into their scholarly habits” (Brandt et al., 2014, p. 217), one goal of Experiment 2 was to confirm the reliability of this safety benefit. We intentionally introduced a confound between the safety versus risk variable and symbol color (green vs. red) in Experiment 1, so we designed Experiment 2 to eliminate the color difference, using a neutral color “blue” for both safety and risk. Also, we employed a more tightly controlled design in which sets of risk and safety conditions had equivalent user rating and risk/safety information, except for whether the latter was specified as risk or safety, to allow for direct comparison between the two conditions. Finally, due to some of the confusion about the risk/safety scores shown in the subjective responses, we made several minor methodological changes to the interface to improve clarity (see the Method section for details).
Method
Participants
A total of 494 participants were recruited through MTurk. On average, the experiment again took 10 minutes to complete, and participants’ were paid $0.75 each.
Materials and procedure
The materials and procedure were similar to those of Experiment 1, except as follows: (1) On the introductory page, the last sentence describing the permission safety/risk (i.e., “The higher the permission risk, the more permissions the app is requesting relative to other apps.”) was deleted, due to its potential to confuse the participants about the actual meaning of permission safety/risk. (2) The color of the symbols representing permission safety/risk was controlled to be the same (dark blue). (3) To ensure comparability of the risk and safety conditions, 10 sets of randomly generated numbers of user ratings and safety rankings were used in both the safety and risk versions of the tasks. (4) For one of the final questions—“What does a rating of [four full circles] stand for?”—the four circles were changed to five circles to match the display in the tasks. Because the circles in both conditions were blue, the issue of presenting a mismatching color in Experiment 1 was also remedied. (5) The issue of some participants in the risk condition seeing “permission safety” was also eliminated.
Results and Discussion
App-selection analysis
The correlation between app selection, user rating, and safety/risk score for each app was not able to reflect the real relation between them because for each set of tasks the user rating and safety/risk score were fixed. Thus, data from all 36 apps were combined for the correlational analyses. Overall, the results were similar to those in Experiment 1. In the safety condition, there was a positive correlation between user rating and app selection, r = .367, N = 8,886, p < .001, and a positive correlation between safety score and app selection, r = .157, N = 8,886, p < .001. In the risk condition, there was a positive correlation between user rating and app selection, r = .410, N = 8,892, p < .001, and a negative correlation between risk score and app selection, r = –.059, N = 8,892, p < .001. Note that this last negative correlation was not significant in Experiment 1, which had fewer participants, even though the value was numerically larger. There is likely a very weak correlation between risk score and app selection that was significant in this experiment due to the larger sample size and power.
Each app was again treated as one “participant” in the ANOVA. A univariate approach was used wherein the user rating and safety/risk score were treated as between-subjects (apps) factors, because each app only underwent some of the combinations of user rating and safety/risk score due to the use of the 10 sets of them. The ANOVA with safety/risk condition (safety vs. risk), user rating (1 through 5), and safety ranking (1 through 5; safety ranking = safety score in the safety condition; safety ranking = 6 – risk score in the risk condition) as between-subjects (apps) factors was conducted for the percentage of app selection.
The same result patterns were found as in Experiment 1. First, the percentage of app selection did not differ between the safety and risk conditions (Ms = 27.3% vs. 26.4%), F < 1.0, but the two conditions showed different trends across different safety rankings (see Figure 3, top right panel), F(4, 540) = 2.37, p = .052, ηp2 = .02. Post hoc pairwise comparisons showed that the safety and risk conditions only differed when the safety ranking = 5, p = .023, but not when it had other values, ps ≥ .155, although the data pattern was similar to that in Experiment 1. The result patterns conform to the proposition that users in the safety condition made safer (less risky) decisions than those in the risk condition.
Second, overall, the percentage of app selection was higher with increased safety ranking (Ms = 20.6%, 22.1%, 25.9%, 32.3%, and 33.4% for safety rankings 1 through 5), F(4, 540) = 13.83, p < .001, ηp2 = .09, and it was also higher with increased user rating (Ms = 7.1%, 9.9%, 17.6%, 38.6%, and 61.0% for user ratings 1 through 5), F(4, 540) = 217.47, p < .001, ηp2 = .62. Again, safety ranking interacted with user rating, F(16, 540) = 3.47, p < .001, ηp2 = .09, and this interaction did not differ across the safety and risk conditions, F < 1.0. Post hoc pairwise comparisons for the interaction between safety ranking and user rating (see Table 2) showed that safety rankings did not influence app selection for apps with lower user ratings (1, 2, and 3), but increased safety ranking led to greater selection percentage for apps with higher user ratings (4 and 5). These results are similar to those in Experiment 1 and indicate that the apps with higher user ratings and higher safety or lower risk scores were selected more than other options (see Figure 4, center row).
The interaction between safety/risk condition and user rating in Experiment 1 did not show up in Experiment 2, F < 1.0. Thus, that interaction does not appear to be reliable. When considering the placement of the six apps in each task, the same pattern showed as in Experiment 1: The first app was selected more often than the second one in each task, and on average, the first app was also selected more often than other apps (see Figure 5, top right panel).
Subjective reasons for app selection
For the reasons marked while selecting an app, a significant positive correlation was found between the safety score of the app and the reason “permission safety” for most of the apps (33 out of 36 apps; Pearson’s rs ≥ .227, Ns= 45~125, ps ≤ .030) in the safety condition, and a significant negative correlation between the risk score and the reason “permission risk” for most of the apps (33 out of 36 apps; |r|s ≥ .240, Ns = 40~137, ps ≤ .012) in the risk condition. Table 3 shows percentage of the reasons marked by the participants in safety and risk conditions. The difference between the percentages of selecting “permission safety” and “permission risk” was significant, Χ2(1, N = 5,922) = 4.19, p = .041, as well as the difference for user rating score, Χ2(1, N = 5,899) = 9.20, p = .002, user rating count, Χ2(1, N = 5,904) = 6.12, p = .013, and icon look and feel, Χ2(1, N = 5,912) = 23.33, p < .001.
Regarding the reasons selected in the posttask questionnaire (see Table 3), none of the differences between the percentages of selecting other reasons in the two conditions was significant, ps ≥ .319, except user rating score, Χ2(1, N = 494) = 3.79, p = .052, and description, Χ2(1, N = 494) = 6.91, p = .009.
Security/computer expertise
An analysis of selections with expertise as a variable similar to that of Experiment 1 was conducted. None of the terms that included expertise even approached being significant, Fs < 1. However, the tendencies of the mean values were consistent with the result of Experiment 1 that the experts tended to be more sensitive than the nonexperts to the safety level (see Table 4). The mean tendencies did not show any sign that the experts were less affected than the nonexperts by the safety/risk framing. As in Experiment 1, there was no correlation between security concern and expertise for 33 out of 36 apps, ps > .05. The main methodological difference from Experiment 1 was use of blue symbols to convey risk and safety, rather than red symbols and green symbols, respectively.
Other questionnaire analyses
When asked whether they found the overall permission safety/risk information to be useful, more than half the participants indicated that it was useful. On a 7-point scale, with 1 denoting not useful and 7 extremely useful, 56.4% participants gave a rating of 5 and above in the safety condition, and 52.9% participants did in the risk condition.
With the technical issues of Experiment 1 corrected, 89.5% of the participants in the safety condition gave a correct answer as to what that display stood for, 3.2% gave an opposite answer, and 7.3% indicated that they did not know what it meant. Participants in the risk condition continued to evidence more confusion, with 71.3% giving the correct answer, 19.0% the opposite answer, and 9.7% marking that they did not know what the symbols meant. Chi-squared analysis showed that participants in the safety condition understood the symbols better than those in the risk condition, Χ2(1, N = 494) = 25.98, p = .006. Thus, without the technical issues of Experiment 1, participants still showed better understanding of the symbols in the safety condition than in the risk condition.
Experiment 3
In Experiments 1 and 2, we found that a summary score promoted better app-selection decisions when framed as safety rather than risk. This advantage of safety framing could be due to several factors, including that the safety symbols obey the rule “the more the better,” the safety score is more compatible with the user ratings (for which more filled stars indicates better), and the safety score is more compatible than the risk score with the task of choosing apps. We examined this latter possibility in Experiment 3 by changing the task to one of rejecting two of the apps from the list. If the framing of the score as safety rather than risk was better in Experiment 2 due to the compatibility of safety with the task of choosing apps, this benefit should not be evident in Experiment 3 for which the task is one of rejecting apps.
Method
A total of 398 participants were recruited. Experiment 3 was conducted similarly to Experiment 2, except that any wording of “select” or “choose” was changed to “reject.”
Results and Discussion
App-rejection analysis
For the same reason as in Experiment 2, data from all 36 apps were combined for the correlational analyses. Overall, the correlational results were similar to those in Experiment 2. In the safety condition, there was a negative correlation between user rating and app rejection, r = –.443, N = 7,014, p < .001, and a negative correlation between safety score and app rejection, r = –.253, N = 7,014, p < .001. In the risk condition, there was a negative correlation between user rating and app rejection, r = –.461, N = 7,308, p < .001, and a positive correlation between risk score and app rejection, r = .114, N = 7,308, p < .001.
ANOVAs with user rating (1 through 5), safety ranking (1 through 5; safety ranking = safety score in the safety condition; safety ranking = 6 – risk score in the risk condition), and safety condition (safety vs. risk) as between-subjects (apps) factors were conducted for the percentage of app rejection. The result patterns were similar to those of Experiments 1 and 2. First, the percentage of app rejection did not differ in the safety and risk conditions on average (Ms = 42.4%), F < 1.0, but safety and risk conditions showed distinct patterns across different safety rankings (see Figure 3, bottom), F(4, 540) = 4.98, p = .001, ηp2 = .04. Post hoc pairwise comparisons showed that the differences between the safety/risk conditions were significant for the safety rankings 5, 4, and 1, ps = .042, .030, and .001, but not for safety rankings 3 and 2, ps = .861 and .404. The result pattern conforms to the proposition that users in the safety condition made better decisions than those in the risk condition.
Second, overall, the percentage of app rejection was lower with increased safety ranking (Ms = 61.9%, 43.1%, 38.3%, 37.6%, and 31.1% for safety rankings 1 through 5), F(4, 540) = 41.39, p < .001, ηp2 = .24, and it was also lower with increased user rating (Ms = 74.2%, 63.9%, 40.0%, 20.2%, and 13.7% for user ratings 1 through 5), F(4, 540) = 192.83, p < .001, ηp2 = .59. Again, the safety ranking interacted with user rating, F(16, 540) = 2.92, p < .001, ηp2 = .08, and this interaction did not differ for the safety and risk conditions, F < 1. Post hoc pairwise comparisons for the two-way interaction (see Table 2) showed that decreased safety ranking led to more app rejections, but this trend was more significant for apps with higher user ratings (3, 4, and 5) than for apps with lower user ratings (1 and 2). These results are similar to those in Experiments 1 and 2 and indicate that the apps with higher user ratings and higher safety or lower risk scores were rejected less than other options (see Figure 4, bottom row).
The interaction between safety/risk condition and user rating, which was significant in Experiment 1 but not Experiment 2, did not show up in Experiment 3, F < 1.0. Thus, that interaction, which was not very evident in the mean data, does not appear to be reliable.
When considering the placement of the six apps in each task, the app-rejection tasks did not show a consistent pattern across all six tasks. On average, the phenomenon that the first app was selected more often than other apps in the selection task was not evident in the rejection task. Rather, the first four apps had a similar rejection rate, which was higher than that of the last two apps (see Figure 5, bottom panel). This different pattern could be due to two possible reasons: (1) Different processes are involved in the selection and rejection tasks, and (2) the bias of the take-the-first heuristic to reject the first app in the list was canceled out in the present experiment by its being (perceived as) more popular among the users.
Subjective reasons for app rejection
For the reasons marked while rejecting an app, a significant negative correlation was found between the safety score of the app and the reason “permission safety” for most of the apps (30 out of 36 apps; Pearson’s |r|s ≥ .258, Ns = 23~97, ps ≤ .049) in the safety condition, and a significant positive correlation between the risk score and the reason “permission risk” for most of the apps (30 out of 36 apps; rs ≥ .227, Ns = 33~86, ps ≤ .037) in the risk condition. Table 3 shows percentages of the reasons marked in the safety and risk conditions. None of the differences between the percentages of selecting the reasons was significant, ps ≥ .074, except that for icon look and feel, Χ2(1, N = 4,772) = 5.95, p = .016.
Regarding the reasons selected in the posttask questionnaire (Table 1), none of the differences between the percentages of selecting other reasons in the two conditions was significant, ps ≥ .346.
Security/computer expertise
Similar to Experiment 2, an ANOVA of rejection percentage with expertise as a factor showed no significant terms that included expertise, Fs < 1. The mean values showed no sign of the experts being more sensitive than the nonexperts to the safety/risk information (see Table 4), but there was a tendency for their rejections to be affected less by the safety/risk framing. Again, there was no correlation between security concern and expertise for 32 of the 36 apps, ps > .05.
Other questionnaire analyses
More than half of the participants indicated they found the overall permission safety/risk information to be useful. On a 7-point scale, with 1 denoting not useful and 7 extremely useful, 68.2% participants gave a rating of 5 and above in the safety condition, and 71.9% participants did in the risk condition.
Regarding the question of whether the participants understood the safety/risk symbols, in the safety condition, 89.7% of the participants gave a correct answer, 3.1% gave an opposite answer, and 7.2% indicated that they were not sure what it meant; in the risk condition, 75.9% of the participants gave a correct answer, 19.7% gave an opposite answer, and 4.4% marked that they were not sure. Chi-squared analysis showed that participants in the safety condition understood the symbols better than those in the risk condition, Χ2(1, N = 398) = 13.37, p < .001.
General Discussion
Progress is being made toward development of summary risk scores for improving the security of mobile applications. Methods have been developed to generate risk scores based on machine learning techniques that can identify certain apps as risky and others as less risky (Gates, Li et al., 2014). To be effective, though, this summary risk information must be presented to users in a way that they can comprehend and that will cause their app-installation decisions to be less risky. The present study demonstrates, in a relatively lifelike scenario, that people’s app-installation decisions can be affected by summary risk/safety scores. In all three experiments, less risky apps tended to be chosen over more risky ones, more so when the score was framed as amount of safety rather than amount of risk. Although the risk/safety score influenced app installation, it did so less than the user ratings. When the user rating was high, the risk/safety score exerted the most effect. But when the user rating for an app was low, having a low risk or high safety score did not typically lead to selection of that app. Consistent with the decision data, the risk/safety score was reported as a reason for selecting or rejecting an app by participants less often than user ratings, but more often than other app elements such as icon and description.
There are two general types of reasons why framing the decision as one of safety was more effective than framing it as one of risk. The first type is one of the safety score being more compatible than the risk score with some aspect of the decision context (Shafir, 1995). The tasks of Experiments 1 and 2 required selections to be made, which is a positive decision in that two apps were chosen as being the most desirable of the six alternatives. The safety framing of the information, for which “more” means “better,” is more compatible with the task goal of determining the two best apps in a selection task. However, the benefit for the safety score in Experiment 3, which required rejection of two apps, suggests that task compatibility was not a significant factor in the present context. Another possibility is that the safety frame is more compatible with the population stereotype for scores, for which a higher number most often indicates better. Furthermore, the safety frame is more compatible with the user ratings than is the risk frame, for which “more” equals “worse.” These last two compatibility relations could contribute to the benefit for safety framing in Experiment 3 as well as in Experiments 1 and 2.
The second type of explanation is that, safety differs from risk in more than valence, although dictionaries and thesauruses identify them as antonyms. Safety seems to be a holistic concept in that people rarely talk about dimensions of safety. In contrast, risk seems to be more multidimensional in that it is customary to decompose overall risk into distinct risks. This difference is illustrated by article titles using “safety” singular but “risks” plural (e.g., see Livingstone, Haddon, Görzig, & Ólafsson, 2011). Thus, people may tend to think of overall safety but components of risk.
In agreement with the app choice data, participants showed more confusion about the risk score than the safety score. When asked whether a symbol with all circles filled indicated high safety or high risk, more participants answered incorrectly in the risk condition than in the safety condition. This confusion about the meaning of the risk symbol was not a result of the task requiring selection of apps, as in Experiments 1 and 2, because the confusion was also evident in Experiment 3 when the task was to reject apps. In addition, in Experiment 1, compared to the safety condition, participants in the risk condition stated after their individual choices and in the final questionnaire that they did not rely on that information as much. However, this result was not replicated in Experiments 2 and 3. Regardless of the reliability of these subjective judgments, the more objective symbol-identification data show the greater confusion for the risk symbols that would be expected from risk being a multidimensional concept.
To examine whether the greater confusion for the risk symbols accounted for their disadvantage in app choices, we conducted follow-up analyses of the app-choice data for each experiment comparing the original data based on all participants to data based only on those participants who identified the symbols correctly. Of interest was whether the Safety/Risk Condition × Safety Level interaction (indicative of the advantage for safety symbols) differed across the two data sets. For Experiment 1, that interaction was smaller for the participants who correctly identified the symbols than for all participants, F(4, 280) = 2.69, p = .031, ηp2 = .04, but it was still significant, F(4, 140) = 10.17, p < .001, ηp2 = .23. For Experiment 2, there was no significant difference in the Safety/Risk Condition × Safety Level interaction between the two data sets, F < 1, although that interaction only approached significance for the correct-identification data set, F(4, 140) = 2.30, p = .061, ηp2 = .06. For Experiment 3, the Safety/Risk Condition × Safety Level interaction tended to be smaller for the correct-identification data set than for all participants, F(4, 280) = 2.05, p = .088, ηp2 = .03, and was still significant for the former data set alone, F(4, 140) = 5.50, p < .001, ηp2 = .14. In summary, when participants who did not identify the symbols correctly were omitted, the safety score still led to more secure (less risky) app-installation decisions than did the risk score, although the difference between the two framings tended to decrease. Thus, the disadvantage of the risk score for app-installation decisions cannot be attributed solely to the greater confusion regarding the meaning of the risk symbols.
We previously reported an MTurk experiment in which participants had to select which one of two apps to install (Gates, Chen et al., 2014). The results showed only a slight tendency for a safety framing to be better than a risk framing. In the present study, for which participants were required to select two out of six apps, safety scores influenced performance considerably more than risk scores. Because the conditions of the prior experiment and the present ones differed in several ways, including the amount of information presented, we are not able to determine definitively the basis for the difference in results. It likely is due in part to the greater information-processing demands of comparing multiple alternatives than for making binary comparisons (Luce, 1986). Because consideration of multiple alternatives is part of most app decisions, and the format used to present the alternative apps in the present study is more similar to what would be seen when actually installing an app, the present results can be regarded as more ecologically valid to the mobile computing environment. Thus, an applied implication of our findings is that when risk/safety summary information about apps is provided, this information should be in the form of a safety score, though further research is needed to develop the methods of generating proper safety scores (e.g., Gates, Li et al., 2014; Peng et al., 2012).
Although the general patterns of results in Experiments 1 and 2 were similar, the influence of the safety/risk scores was larger in Experiment 1. A between-experiment ANOVA showed that this difference was statistically significant, yielding an Experiment × Safety/Risk Score interaction, F(4, 280) = 3.17, p = .014, ηp2 = .04. Additionally, those two variables interacted with safety/risk condition, F(4, 280) = 5.08, p = .001, ηp2 = .07, reflecting that the benefit of the safety frame was larger in Experiment 1 than in Experiment 2. The larger effect of the safety/risk scores in Experiment 1 suggests that red and green colors were more effective at signaling risk and safety, respectively, than was the dark blue color. Whether the difference in effectiveness is replicable and due to the stereotypic mapping of colors to risk and safety in Experiment 1, or to red and green being more salient than dark blue, remains to be determined.
The performance data for rejecting apps in Experiment 3 were similar to those for selecting apps in Experiments 1 and 2 in that the safety score had more influence on performance than the risk score. This finding suggests the possibility that participants converted the rejection task to one of selecting which apps not to reject (e.g., Meloy & Russo, 2004). However, Shafir (1993) found that such conversion was used to reduce the number of decisions that had to be made, whereas in Experiment 3, altering the task to one of selection would increase the number of decisions from two to four. Moreover, other aspects of the data differed from those of Experiments 1 and 2. Those experiments showed a primacy effect, that is, a bias to choose the app in the first position (the take-the-first heuristic), but Experiment 3 did not. Also, participants in Experiment 3 indicated that they were placing greater reliance on the permission risk/safety than did those in Experiments 1 and 2. Thus, the rejection decisions of Experiment 3 apparently involved somewhat different strategies than the selection decisions of Experiments 1 and 2.
That presenting a summary risk/safety score facilitates users’ secure app-installation decisions has theoretical and practical implications. Theoretically, it is consistent with fuzzy trace theory (Reyna, 2008; Reyna & Brainerd, 1995), according to which people have two types of mental representations, gist and verbatim. For reasoning and decision-making, most people rely on the gist representations to make decisions (Brainerd & Renya, 2002). The summary score serves as a basis for users to perform the gist processing of the overall risk associated with an app, and our results suggest that framing the summary score as one of safety may be especially effective. Also, in practice, this idea of presenting a summary risk/safety score fits with a general design principle that an effective interface should be direct and not overburden users with too much cognitive processing (e.g., Krug, 2000). We also found that the experts were influenced at least as much as the nonexperts by the summary scores in all three experiments. Overall, our results fit with Brust-Renck et al.’s (2013, p. 244) conclusion that in a variety of contexts “risk communication should convey the bottom-line (gist) message of risk rather than only the facts to help people make informed decisions.”
Footnotes
Acknowledgements
A preliminary version of this work including only Experiment 1, analyzed differently, was presented at the 2014 annual meeting of the Human Factors and Ergonomics Society and is included in the proceedings. This work was supported by Army Research Office Award 2008-0845-04 through North Carolina State University and by the National Science Foundation under Grant No. 1314688.
Jing Chen received her BS and MEd degrees in cognitive psychology from Zhejiang University in China, in 2007 and 2010, respectively. She is currently working toward her PhD degree in cognitive psychology and her MS degree in industrial engineering at Purdue University.
Christopher S. Gates received his BS degree in computer science as well as in mathematics and his MS degree in computer science, both from Rutgers University in 2002 and 2005, respectively. After this, he worked in industry for several years until 2009, when he returned to academia. He received his PhD degree in computer science from Purdue University in 2014.
Ninghui Li received a BEng degree in computer science from the University of Science and Technology of China in 1993 and MSc and PhD degrees in computer science from New York University, in 1998 and 2000, respectively. He is currently a professor in computer science at Purdue University. His research interests include security and privacy in information systems. He is a senior member of the IEEE and an ACM distinguished scientist.
Robert W. Proctor received an MA degree and PhD degree in experimental psychology from the University of Texas at Arlington, in 1972 and 1975, respectively. He is a distinguished professor in the Department of Psychological Sciences and a fellow of the Center for Education and Research in Information Assurance and Security at Purdue University. His research interests include basic and applied aspects of human performance in a variety of tasks and settings.
