Abstract
Eyewitness-identification tests often culminate in witnesses not picking the culprit or identifying innocent suspects. We tested a radical alternative to the traditional lineup procedure used in such tests. Rather than making a positive identification, witnesses made confidence judgments under a short deadline about whether each lineup member was the culprit. We compared this deadline procedure with the traditional sequential-lineup procedure in three experiments with retention intervals ranging from 5 min to 1 week. A classification algorithm that identified confidence criteria that optimally discriminated accurate from inaccurate decisions revealed that decision accuracy was 24% to 66% higher under the deadline procedure than under the traditional procedure. Confidence profiles across lineup stimuli were more informative than were identification decisions about the likelihood that an individual witness recognized the culprit or correctly recognized that the culprit was not present. Large differences between the maximum and the next-highest confidence value signaled very high accuracy. Future support for this procedure across varied conditions would highlight a viable alternative to the problematic lineup procedures that have traditionally been used by law enforcement.
Keywords
Eyewitness-identification evidence often proves to be very persuasive in the courtroom. Indeed, eyewitness testimony is sometimes the only evidence available, and convictions are often determined by such evidence (Semmler, Brewer, & Douglass, 2011; Wells et al., 1998). Yet, the fallibility of eyewitness-identification evidence—manifested in two striking response patterns—has been consistently highlighted in laboratory and field studies. First, witnesses sometimes identify an innocent suspect as the culprit, which often leads to erroneous convictions. The potency of mistaken identifications is dramatically highlighted by the fact that, in the United States, mistaken identifications were key factors in the wrongful conviction of more than 70% of the 289 DNA exoneration cases documented to date (Innocence Project, 2012).
Second, witnesses can state that the culprit is not present in a police lineup when this is not the case, which could possibly lead to the culprit going free. Field and laboratory studies indicate that rejections of lineups in which the culprit is present are commonplace (Pike, Brace, & Kynan, 2002; N. Steblay, Dysart, Fulero, & Lindsay, 2001). Although various procedural manipulations have been shown to reduce the likelihood of false identifications (Brewer & Palmer, 2010; Wells, Memon, & Penrod, 2006), the frequency with which these two problematic responses occur highlights the need for an alternative to the traditional eyewitness-identification test that has survived largely intact since the 19th century.
Wells et al. (2006) asked what would have happened if the traditional lineup procedure used by police “had never existed and the legal system [had] turned to psychology to determine how information could be extracted from eyewitnesses’ memories …. Operating from scratch, it seems likely that modern psychology would have developed radically different ideas” (pp. 68–69). Wells et al. suggested that measures of brain activity and eye movements, rapid displays of faces, comparisons of reaction times, “and other methods for studying memory might have been developed instead of the traditional lineup [procedure] …. It is possible to imagine a future science of eyewitness evidence that is radically different from the methods used today” (p. 69). In this article, we report three experiments demonstrating the effectiveness of a radical alternative to the traditional lineup.
A weakness of the traditional identification test or lineup is that it requires a witness to make a single decision about a lineup. The witness examines the lineup (comprising the suspect and a number of foils who match the suspect’s description, look similar to the suspect, or both) and either makes a choice from the lineup (i.e., picks the suspect or a foil who is known to be innocent) or rejects the lineup (i.e., says the culprit is not present or cannot decide). Perhaps if every witness were guaranteed to have a high-quality memory representation of the offender, the expectations placed on their memory might be reasonable. But factors such as a limited exposure to the culprit and a delay in the identification test (Brewer & Wells, 2011) often mitigate against the accuracy of their memory, so the witness has to weigh whether a positive or negative decision is more appropriate on the basis of the limited evidence available. The witness’s decision criterion is also vulnerable to influence from an array of social cues (Brewer & Palmer, 2010; Wells et al., 2006) that bias him or her toward a positive identification. Further, metacognitive cues (e.g., “I saw him for a long time”) may persuade witnesses that they should have a strong memory and thus should set a strict decision criterion demanding strong evidence for making an identification because the target stimulus should be familiar (Morrell, Gaitan, & Wixted, 2002).
Finally, the perceived significance of the identification decision may encourage the activation of heuristics that bias criterion setting. Consider, for example, a victim of an assault who is viewing a lineup and has plenty of time to reflect on the decision. Considerations such as “If I pick this person, he’ll probably get a long jail sentence, so I’d better be right” or “If I don’t pick this person, a dangerous man may well go free, so I’d better be certain before I say he’s not the one” would weigh heavily on his or her decision. Such thoughts are also likely to promote reflection on factors such as the quality of the viewing conditions and attentional constraints. In sum, various social, metacognitive, and heuristic influences can produce overly lax or unduly stringent response criteria (Clark, 2005), which contribute to false identifications or lineup rejections.
Using a variety of encoding and test stimuli, we tested a novel procedure for accessing eyewitness memory while minimizing the influence of strategic processes that often affect witnesses’ decision criteria in a traditional identification test. There are two key components to our procedure. The first involves having the witness rate the degree of match between the culprit and each lineup member (presented sequentially). The second involves requiring the witness to perform this matching process under severe constraints on processing time. Rather than requiring the witness to make a yes/no identification decision, the first component of our procedure requires confidence judgments about the likelihood that each face in a set of faces (presented sequentially)—including a target face (i.e., the culprit or an innocent suspect) and a number of foils—is that of the person who committed the crime. Thus, we use confidence judgments as an index of memory strength.
Although retrospective confidence judgments are malleable and, hence, can be poor indicators of eyewitness-identification accuracy (Brewer, 2006; Douglass & Steblay, 2006), this problem can be ameliorated by collecting confidence measures in a nonbiased way prior to any postidentification social interaction with other witnesses or lineup administrators and under conditions that limit opportunities for deliberation (Brewer & Weber, 2008; Brewer & Wells, 2006). This approach is consistent with theory and research in recognition memory, which indicates that confidence judgments can index memory strength, discriminating previously seen from unseen stimuli (Bernbach, 1971; Cleary & Greene, 2000; Wickelgren & Norman, 1966)—although these theories were designed to account for decision-making contexts much less complex than the present one. Similarly, as we have previously argued (Sauer, Brewer, and Weber, 2008), this approach aligns with metamemory research indicating that people have at least partial access to retrieval information that supports memory-based decisions even in the absence of a positive decision (cf. Koriat, 1993). However, a significant departure in the research reported here is that, rather than witnesses classifying each stimulus as old or new and assigning a confidence value to their judgment, we asked them to simply provide a confidence rating for each lineup stimulus.
Requiring witnesses to make a series of confidence assessments about individual stimuli—without any accompanying yes/no identification decision—should negate the influences of some of the biases that disturb criterion setting. Moreover, even if some of these social and metacognitive influences modulate confidence assessments, they should exert a uniform influence across the set of stimuli, thereby still allowing for confidence differences that would allow us to track variations in the match between each probe stimulus and the witness’s memory. In previous research (Sauer et al., 2008), we showed that a confidence criterion could be identified at the group level that discriminated whether an offender’s face had been recognized better than it had when the traditional identification procedure was used.
Here, to minimize the contribution of any strategic influences on such confidence judgments, we asked participants to make confidence ratings under a response-signal deadline (Brewer & Smith, 1990; Wickelgren, 1977). Specifically, witnesses made the confidence judgment about each face within 3 s of its appearance. Strong memory traces are accessed more rapidly than weak memory traces are (Murdock, 1985), and accurate eyewitness identifications occur significantly faster than inaccurate ones do (Brewer, Caon, Todd, & Weber, 2006; Weber, Brewer, Wells, Semmler, & Keast, 2004). Thus, confidence judgments obtained under constrained processing time should further minimize the contribution of factors that disturb criterion setting for identification decisions.
We speculated that the response deadline would have little effect on rapid memory-based responses to the target but would truncate the more time-consuming processes that contribute to false identifications. Thus, a strong memory for the target should still result in a high level of confidence, whereas confidence for unfamiliar faces should be low. Consistent with this argument, findings in the recognition-memory literature show that familiarity judgments (i.e., judgments that make a major contribution to confidence estimates; Wixted, 2007), including judgments for face stimuli, are usually made extremely rapidly (Bruce & Young, 1998; Johnson, Kounios, & Reeder, 1994; Yonelinas & Jacoby, 1994). Moreover, short deadlines minimize the likelihood that metacognitive cues or heuristics will distort the confidence values reported (especially in an item-specific manner). Thus, under short deadlines, the discrepancy between confidence values for previously seen and for unfamiliar faces should be large when the witness’s memory is strong (and hence likely accurate). Put another way, large confidence discrepancies between one lineup face and all the others would suggest a strong memory match and a high likelihood that the face attracting the standout, high-confidence value is the culprit. Small discrepancies would suggest either more than one strong memory match (i.e., more than one face attracts high but slightly different levels of confidence) or the absence of a strong memory match (i.e., when the highest level of confidence is relatively low).
In the three experiments reported here (each using multiple encoding and test conditions), we compared the deadline procedure with the traditional sequential-lineup procedure (i.e., control condition), in which witnesses viewed one face at a time without knowing how many faces would be shown overall (cf. Lindsay & Wells, 1985; N. K. Steblay, Dysart, & Wells, 2011). In the first two experiments, the retention interval between encoding and test was 5 min; in the third experiment, it was 1 week. To compare the effectiveness of the two procedures, we first used a classification algorithm that identified confidence criteria that optimally discriminated accurate from inaccurate decisions, providing the crucial comparison of overall classification accuracy produced by the two lineup conditions. Although this algorithm provides an index of the efficacy of the procedures at the group level, it does not inform the key forensic judgment about the likelihood that an individual witness is accurate. Consequently, the second approach used a profile analysis that indicated the likelihood that any individual’s pattern of confidence judgments reflected an accurate judgment about the target face.
Method
Participants and design
There were 494 participants (219 male, 275 female; age = 16–54 years, M = 20.1, SD = 6.7) in Experiment 1, 294 participants (117 male, 177 female; age = 16 to 53 years, M = 22.1, SD = 6.4) in Experiment 2, and 117 participants (48 male, 69 female; age = 16 to 60 years, M = 26.3, SD = 9.0) in Experiment 3. In all experiments, participants were undergraduates or members of the Flinders University community. Participants were randomly assigned to either the deadline or no-deadline (control) condition; lineup condition was varied between subjects.
Materials and procedure
In Experiment 1, participants were shown two different movies depicting simulated nonviolent crimes. In Experiments 2 and 3, participants were shown four different movies; each movie showed different stimulus events (involving different people). None of the movies shown in Experiments 2 and 3 involved a visible crime but participants were told after they viewed the movies that the people they saw were now suspects in crimes that had occurred in the vicinity. Participants in each experiment viewed the same movies. The movies were selected from an array used in our laboratory because our archival data showed that these particular movies produce varied patterns of identification-test choices and accuracy; this allowed us to test the procedures across multiple stimuli of varying difficulty, which mirrored the variety found in real-world scenarios.
In the deadline and control conditions, participants sat alone at a computer in a cubicle and received the following instruction: “You are going to be shown a short film. Pay close attention to it because you will be asked some questions afterwards.” Participants then clicked a “Next” button on-screen when they were ready to begin. A countdown appeared in the center of the screen. When it ended, there was a short delay, and then the film started. After participants watched the first movie, another screen appeared: “Before completing the questions about the film you have just seen, we would like you to watch another film. Once again, please pay close attention.” After watching two movies (Experiment 1) or four movies (Experiment 2), participants watched a distractor film (a TV program) for 5 min. Participants in Experiment 3 watched four movies and then returned to the laboratory approximately 1 week later (range = 3–13 days, M = 6.7, SD = 1.2). Presentation order of the movies was counterbalanced.
For the lineup phase, the screen displayed instructions for viewing the lineup and making responses. The 12 photos in each lineup each had an on-screen size of 8 cm × 6 cm and provided a frontal view of the suspect from the chest up. The foils (and the target’s replacement for each target-absent lineup) were selected using a match-description strategy: That is, all foils matched a modal free-recall description of the target provided by independent observers of each stimulus event (Wells, 1993). Thus, if the target was described as a male, 20 to 30 years old, with short darkish hair and a pale complexion, all lineup members matched that description.
In both conditions, participants viewed the 12 lineup photos in sequential order and were not informed about how many photos were in the lineup. The first four faces were the same for all participants in both conditions and, though matching the culprit’s description, were deliberately chosen because they were not very plausible selections. Their purpose was to familiarize the witness with the deadline procedure (and to ensure the same number of faces was seen in the deadline and control conditions), although witnesses were not informed that these were only warmup trials. The order of the next eight faces (the actual trials) was random. In both conditions, participants saw target-present and target-absent lineups; the presence or absence of the culprit was varied within subjects.
In the deadline condition, each face remained on the screen for 3 s. After 2 s, a buzzer indicated that the confidence judgment had to be completed within the next second. This involved clicking 1 of 11 on-screen buttons spanning “100% confident that this is the culprit” to “0% absolutely certain this is not the culprit.” If the confidence judgment was not made before the deadline (around 2.5% of all trials), the face disappeared, and the next face appeared. Note that pilot work with shorter deadlines produced many failures to respond within the allotted time.
Participants in the control condition viewed a standard sequential lineup of 12 faces, and participants had as much time as they needed to view each photo. A yes/no decision about whether the face was that of the culprit was required for each face. If the witness chose a particular face, the lineup ended. If the witness rejected a face, the next face appeared. This continued until a choice was made or the lineup ended. In the original sequential-lineup study (Lindsay & Wells, 1985), the lineup continued to the end even if the witness made a selection; in many other studies, the lineup ended when witnesses made a selection. Allowing witnesses only one pass through a sequential lineup has been a deliberate strategy designed to try to “force” witnesses to make absolute rather than relative judgments and hence overcome the tendency for witnesses to make a choice even when no strong match to memory is found. In the present study, when the lineup ended, a new screen appeared, and participants were asked to express how confident they were in their final identification decision using 1 of 11 on-screen buttons, as in the deadline condition. The same screen was presented to participants who rejected all lineup members.
Results
Two approaches were used to explore the efficacy of the deadline procedure: First, we compared overall (i.e., group-level) classification accuracy based on optimally discriminating confidence criteria, and, second, we conducted a profile analysis on confidence judgments across stimuli at the individual-witness level to highlight patterns associated with high versus low accuracy.
Accuracy in the deadline and control conditions
We used Sauer et al.’s (2008) hierarchical algorithm to infer a decision from the confidence ratings for the deadline condition (i.e., in which the witness did not make a yes/no decision) and thereby obtain a measure of accuracy. First, when a participant provided a single maximum confidence value—one value for a lineup member that was higher than all of the others—this was treated initially as a positive decision (i.e., a possible selection of that lineup member). The absence of a single maximum value was treated as a negative decision (or no selection) and, therefore, indicative of the suspect’s innocence. Cases with a single maximum value were then separated according to whether the maximum value referred to the suspect or to a foil. Maximum values for suspects and foils were then compared with one of two separate criteria derived from all participants’ data. Both the suspect and foil criteria were calculated as the maximum value (i.e., maximum confidence value) that optimally discriminated accurate from inaccurate decisions across participants. For example, if a maximum confidence value of 70% for the suspect produced the highest proportion of correct decisions across all participants, then 70% became the confidence criterion for the suspect. The confidence criterion algorithm maximized the number of accurate decisions without any attention to classification as hits or correct rejections.
Maximum values for suspects equaling or exceeding the confidence criterion for the suspect were classed as positive identifications of the suspect. For foils, maximum values equaling or exceeding the confidence criterion for foils were counted as lineup rejections; we followed this procedure because, in a forensic context, using the recommended single-suspect lineup means that lineup foils are known in advance to be innocent. Thus, identification of a foil as the best match provides evidence that the suspect is innocent (Clark & Wells, 2008; Wells & Olson, 2002). Maximum values for suspects and foils falling below their confidence criteria were classed as inconclusive responses. This approach resulted in one of three classifications: (a) the suspect was identified as the culprit, (b) the suspect was innocent, or (c) the results were inconclusive (such classifications are included in the denominator when calculating classification accuracy).
Table 1 shows the decision accuracy rates (and inferential statistics) for the control condition and those produced by the application of the hierarchical classification algorithm to the deadline condition for each lineup in Experiments 1, 2, and 3. Overall decision accuracy was significantly higher in the deadline condition than in the control condition in all three experiments (27%, 24%, 66%, respectively). The measures of Cohen’s w indicate that effect sizes were meaningful for 7 of the 10 stimulus sets and for the overall contrasts for each experiment. (The proportion of correct identifications and rejections are reported in Table S1 in the Supplemental Material available online.)
Proportion of Correct Decisions in the Three Experiments, With Inferential Comparisons
Profiles of individual confidence judgments indicative of accuracy
Knowing that, at the group level, the confidence procedure produces higher decision accuracy than the conventional lineup procedure does across a wide array of encoding and test stimuli is extremely valuable in evaluating the efficacy of the procedure. However, in an actual criminal investigation, the focus is on the likely accuracy of an individual witness’s decision. Accordingly, we conducted a profile analysis that highlighted when an individual set or pattern of confidence judgments was likely versus not likely to indicate an accurate discrimination of a previously seen face. We examined accuracy as a function of the discrepancy between the maximum and the next-highest confidence value. (Various other permutations, such as the discrepancy between the maximum and the average of all other confidence values, produced no better outcomes.) The resulting values indicate accuracy-classification rates when the maximum confidence value was 100% and the next-highest value was 0% (i.e., discrepancy = 100%); when the maximum was 100% and the next-highest value was 10%, 20%, 30%, etc. (discrepancy = 90%, 80%, and 70%, etc., respectively); when the maximum was 90% and the next-highest value was 0%, 10%, 20%, etc.; and so on. Thus, using this procedure, it was possible to compare classification accuracy, given particular profiles of confidence judgments, with decision accuracy for the control condition. Note that for target-present lineups, maximum values assigned to foils were excluded as, in appropriately conducted single-suspect lineups, foils are known in advance to be innocent. Also, unlike the hierarchical algorithm described earlier, which requires that an innocent suspect be designated in a target-absent lineup, the profile analysis classifies the presence of any single maximum value given for a target-absent lineup as an inaccurate decision, no matter how low the maximum is (i.e., it applies a conservative test).
As Table 2 shows, when the discrepancy between the maximum and the next-highest confidence value was large, the profile analysis produced classification-accuracy rates that were markedly higher than accuracy rates in the control condition (.51, .59, and .38 for Experiments 1, 2, and 3, respectively). Accuracy was extremely high when the discrepancies were near the top of the range. Only when the discrepancy fell to around 30% to 50% did accuracy approach the levels of the control procedure. Moreover, when the discrepancy was very small, accuracy was very low.
Proportion of Correct Decisions and Number of Decisions for Each Category of Discrepancy Between the Maximum and the Next-Highest Confidence Value in the Deadline Condition
Note: Confidence was rated on a scale from 0% to 100%.
Discussion
In two experiments including six different sets of encoding and test stimuli, we found that the deadline procedure produced significantly higher classification accuracy than did a control condition that used the traditional sequential yes/no decision procedure. In a third experiment, this finding was replicated when the retention interval was extended from 5 min (as in the first two experiments) to 1 week. Moreover, we conducted a profile analysis that identified patterns of confidence judgments across lineup stimuli that were more informative than a witness’s lineup decision about the likelihood that the witness had recognized the culprit or that the culprit was not in the lineup.
Large differences between the maximum and next-highest confidence rating (i.e., 70%–100%) denoted very high accuracy; in contrast, small differences (≤ 20%) signaled low accuracy. In other words, an individual’s confidence profile offers probative information that is not provided by the witness’s decision on the standard identification test. As noted earlier, these results are compatible with extrapolations from theory underpinning basic research on recognition memory and metamemory, which demonstrates the heuristic value of such theoretical perspectives for investigating decision making in everyday contexts of far greater complexity than those the theories were originally developed to accommodate. Ultimately, a formal model relating accuracy, confidence, and response time in this context would be desirable, with existing recognition-memory models (e.g., Ratcliff & Starns, 2009; Van Zandt, 2000) providing potential starting points for its development.
There remain, of course, some obvious significant challenges for future research using this procedure. One is to obtain converging evidence for its efficacy under a variety of forensically relevant conditions (e.g., varied encoding conditions, variations in target-foil discriminability). Another is to demonstrate the efficacy of the procedure under conditions in which familiarity-based confidence judgments might be considered problematic. For example, if—prior to viewing the lineup—a witness had been shown mug shots that included one of the lineup foils, it might be expected that familiarity for that foil would be higher, and confidence would be inflated. This could undermine the effectiveness of the confidence procedure when witnesses were exposed to a culprit-absent lineup containing the foil. It could also undermine the procedure by reducing the discrepancy between the maximum confidence value and the next-highest confidence value when witnesses were exposed to a lineup containing the target and the previously seen foil. That is, it should provide a tough test of the deadline procedure.
Nevertheless, we suggest that should a single memory-strength judgment—incorporating familiarity and contextual information—underpin confidence judgments made under deadline pressure (cf. Wixted, 2007), the efficacy of the procedure should be ensured. In contrast, if witnesses had opportunity for reflection (i.e., in the absence of a deadline), previous exposure to one of the lineup foils might produce a sense of familiarity that, on reflection, could be attributed to the face being that of the culprit. This last issue suggests another important question to be resolved, namely, is the deadline critical for curtailing metacognitive activity that distorts confidence assessments, or is it unimportant?
Clearly, adoption of a procedure similar to the deadline procedure used in the present study would be perceived as a radical change, one that would not be easily sold to the criminal-justice community (cf. Brewer & Wells, 2011). Thus, further research confirming its efficacy and resolving the optimal deadline and confidence-scale formats under a variety of conditions, such as those outlined in this article, will be crucial for demonstrating that finally there may exist a superior procedure to the age-old and problematic identification test.
Footnotes
Acknowledgements
We thank Nicole Reid and Rachel Hiller for their assistance with data collection.
Declaration of Conflicting Interests
The authors declared that they had no conflicts of interest with respect to their authorship or the publication of this article.
Funding
This research was supported by Australian Research Council Grant DP1093210 and a Flinders Research Grant.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
