Abstract
Matching two different images of an unfamiliar face is difficult, although we rely on this process every day when proving our identity. Although previous work with laboratory photosets has shown that performance is error-prone, few studies have focussed on how accurately people carry out this matching task using photographs taken from official forms of identification. In Experiment 1, participants matched high-resolution, colour face photos with current UK driving licence photos of the same group of people in a sorting task. Averaging 19 mistaken pairings out of 30, our results showed that this task was both difficult and error-prone. In Experiment 2, high-resolution photographs were paired with either driving licence or passport photographs in a typical pairwise matching paradigm. We found no difference in performance levels for the two types of ID image, with both producing unacceptable levels of accuracy (around 75%–79% correct). The current work benefits from increased ecological validity and provides a clear demonstration that these forms of official identification are ineffective and alternatives should be considered.
Introduction
People often complain that the photographs on their official forms of identification are a poor likeness. Although we now know that judging the likeness of an image is dependent on how familiar you are with the person (Ritchie, Kramer, & Burton, 2018; White, Burton, & Kemp, 2016), surprisingly little consideration has been given to how well people can actually be matched to their official ID photographs.
Evidence in recent years has firmly established that matching two different images of an unfamiliar face (or an image to a ‘live’ person) is both difficult and error-prone (e.g., Burton, White, & McNeill, 2010; Kemp, Towell, & Pike, 1997). However, studies typically utilise newly taken photographs or student ID images (Burton et al., 2010; Fysh & Bindemann, 2018). This body of research has shown that performance levels are lower than might be expected, with accuracy falling within the range of 70% to 90% (Bindemann, Avetisyan, & Blackwell, 2010; Bobak, Dowsett, & Bate, 2016; Burton et al., 2010; Kramer, Mulgrew, & Reynolds, 2018; Megreya & Bindemann, 2009; Megreya, Bindemann, & Havard, 2011; Megreya & Burton, 2006; Wirth & Carbon, 2017). For example, when simulating a border control context by asking participants to compare a large face image with one that had been resized and placed within a passport frame, performance was around 85% (Wirth & Carbon, 2017).
Interestingly, few studies have included comparisons involving official passport photographs (Meissner, Susa, & Ross, 2013; White, Dunn, Schmid, & Kemp, 2015), and to our knowledge, performance levels when matching with official driving licence photographs have yet to be determined. This is surprising, given that real-world identity verification often involves one or both of these forms of ID, and therefore, research utilising these images would benefit from improved ecological validity.
There are important reasons to predict that official ID images might lead to poor matching performance. First, these documents are valid for substantial periods of time (e.g., ten years for UK driving licences and adult passports, and often longer in other countries; in Germany, for instance, we had always even an “unlimited” policy). Research has shown that increasing the amount of time since a comparison photograph was taken results in a decrease in matching performance (e.g., Fysh & Bindemann, 2018; Megreya, Sandford, & Burton, 2013). However, most face matching tasks do not incorporate such time intervals into their stimuli (e.g., Burton et al., 2010; Fysh & Bindemann, 2018). Second, official ID images are often smaller or lower quality than images used in research. For example, UK driving licence photographs are 1.8 cm × 2.3 cm in size, greyscale, low resolution, and obscured by security watermarks. The clear prediction is that some or all of these features will decrease accuracy levels (e.g., Bindemann, Attard, Leach, & Johnston, 2013).
The current work addresses this gap in the literature by investigating face matching with both driving licence and passport photographs. Given that these official ID images are frequently used in real-world identification, it is important to establish levels of performance with these types of photographs. Based on a growing body of evidence that unfamiliar face matching is prone to numerous errors in the best of circumstances (e.g., Burton et al., 2010), we predict that these images will prove ineffective for use in this task.
Experiment 1
In our first experiment, we investigated face matching with official driving licence photographs. Given that these types of images have yet to be considered with respect to matching, and because we predicted that performance would be low due to the characteristics of the images themselves (small, low quality, and greyscale), we decided to employ a novel sorting task. By providing a set of high-resolution photographs and their accompanying driving licence images, we hypothesised that simply forming image pairs would provide an upper estimate of participants’ abilities. All matches were present in the set, and so additional information could be used to narrow down options through a process of elimination.
Methods
Participants
Fifty volunteers (Mage = 31.1 years, SDage = 13.1 years; 29 women; all self-reported as White) gave informed written consent before participating in the experiment and were verbally debriefed upon completion. Recruitment took the form of approaching students and staff on campus and asking if they would be willing to participate. Both experiments presented here were approved by the University of Lincoln’s School of Psychology ethics committee (PSY1718469) and were carried out in accordance with the provisions of the World Medical Association Declaration of Helsinki.
The data from three additional participants were excluded before analyses due to familiarity with at least one of the models featured in the stimuli.
Stimuli
We created a database of 50 models who varied in age, sex and ethnicity. For each model, we took a high-resolution, colour, passport-style photograph (front on, neutral expression) using a GoPro HERO5 Session camera. In addition, for the 48 models who supplied their current UK driving licence, we collected a scan of their photograph (which appears in greyscale on the original card).
From this database, we selected 30 White women who were not wearing glasses (Mage = 19.9 years, SDage = 0.6 years). No additional attempt was made to choose identities based upon their facial appearance (e.g., their similarity to each other). The time interval between taking our high-resolution photographs and the date of issue of the driving licences ranged from 46 days to 3.5 years for these women. The high-resolution photographs were cropped and resized to 3.7 cm × 5.5 cm, while the driving licence images remained unaltered at their original size of 1.8 cm × 2.3 cm (see Figure 1). The former were presented at a larger size in order to simulate higher resolution images that are typically submitted as part of official document applications (and are therefore compared by issuing officers) and to acknowledge that comparisons in ‘live’ face matching contexts benefit from a better quality representation.

Example stimuli used in Experiments 1 and 2. From left to right: a high-resolution photograph, passport photograph, and driving licence photograph of the same model, spanning more than 6.5 years. This model has given permission for her images to be reproduced here.
Procedure
Each participant was presented with 60 facial images (30 models × 2 types of photograph), which had been printed and laminated, and asked to match each driving licence picture with its corresponding high-resolution photograph. The task was self-paced, and participants were told that every image had a correct pairing, that is, there were no mismatches or other deceptions involved. All models were unfamiliar to the participants.
Results
Accuracy on the task is shown in Figure 2, along with an illustration of the expected distribution obtained by random guessing (estimated using a Monte Carlo simulation over 10,000 iterations). Observed performance (number of correct pairs: M = 10.82, SD = 5.17) was substantially better than guessing but remained error-prone, with participants making an average of 19 mistakes with a group of only 30 identities.

Accuracy when pairing the images. Both observed performance and simulated random guessing are shown for comparison.
It is worth noting that chance performance on this task (around 0%–7%) is much lower than for a standard pairwise matching task (50%), suggesting that participants’ accuracies can still be considered reasonable.
Previous research has shown that increasing the amount of time between photographic sittings resulted in a decrease in matching performance (e.g., Fysh & Bindemann, 2018; Megreya et al., 2013). Indeed, viewers may require personal familiarity with targets in order to avoid this decline (Carbon, 2008). Here, we found no correlation between the time period and model-level accuracy (averaged across participants), r(28) = −.06, p = .737. This may be due to the relatively small variability in the length of time since the documents were issued in our sample (maximum of 3.5 years old) in comparison with the 10 years for which these documents are legally valid. However, the current work was not designed to address this issue, and so we include this analysis for the interested reader.
Experiment 2
The results of the first experiment demonstrated that, in a card sorting task where participants were required to form image pairs, face matching with driving licence photographs was both difficult and error-prone. We next decided to investigate performance levels using the more typical forced choice, pairwise matching paradigm in order to allow for a better understanding of difficulty within the context of other results in the literature. In addition, this pairwise task is more comparable with real-world scenarios, where match/mismatch decisions for two images (or an image and a ‘live’ face) are made.
In this second experiment, we also directly compared performance with driving licence and passport photographs. Although both forms of identification are frequently used in everyday life, researchers have yet to consider whether accuracy with these two types of ID image may differ.
Methods
Participants
A new sample of 60 volunteers (Mage = 30.1 years, SDage = 11.1 years; 42 women; 82% self-reported as White, with other ethnicities including Black, Asian, Indian, Latino and ‘mixed’) gave informed consent before participating in the experiment and were debriefed upon completion. Both the consent form and debriefing were presented on-screen.
The experiment was carried out online (see below), and participants were recruited through invitation (i.e., friends and family also living in the UK). As such, there was no overlap between this sample and those who participated in Experiment 1.
Stimuli
For the database of 50 models described in Experiment 1, 29 of these supplied their current passports. We collected a scan of their ID photograph and also took several photographs of the document in order to obtain the best quality image for use. Due to the passport’s construction, the face image was obstructed by both strong reflections and security watermarks when digitally scanned.
From our database, we selected the 21 White women (Mage = 20.1 years, SDage = 1.0 years) who had supplied both their passports and driving licences. We excluded male models and those of other ethnicities because (a) there were insufficient numbers to be able to create suitable mismatch trials; and (b) we had no performance data for these models based on Experiment 1. The length of time between taking our high-resolution photographs and the date of issue of the driving licences ranged from 195 days to 6.6 years, and this range was 273 days to 5.1 years for the date of issue of the passports.
It is worth noting that for 12 of the models used, the same original photograph was submitted for both driving licence and passport applications. As such, the two resulting ID images were identical, although still differing in size, colour, quality, and so on.
The high-resolution photographs were cropped and resized to 3.7 cm × 5.5 cm. The driving licence (1.8 cm × 2.3 cm) and passport images (3.3 cm × 4 cm) remained unaltered at their original sizes (see Figure 1).
Procedure
We created two separate face matching tasks, pairing our high-resolution images with either the driving licence or passport photographs. Both tasks comprised 21 match trials (two different images of the same model) and 21 mismatch trials (images of two different models). The former involved presenting both the high-resolution image and the ID image of the model, while the latter were created by pairing the high-resolution image of one model with the ID image of a different model.
Given that 15 of the models used here also appeared in the first experiment, we were able to base our mismatch pairings on the data collected in Experiment 1 as follows. For these 15 models, each mismatch driving licence image was chosen as the one most frequently paired incorrectly with the high-resolution image of the identity in question (limited to this sample of 15 models). Simply, if participants most often paired Model A’s high-resolution image with Model B’s driving licence image in the card sorting task (considering only incorrect pairings), we used this pairing to form Model A’s mismatch trial here. For the remaining six models (where no sorting data were available), two of the authors discussed and agreed upon mismatch pairings based on visual similarity of the identities’ images. As with the Glasgow Face Matching Test (Burton et al., 2010), we made no attempt to prevent identities/images appearing more than once (e.g., a particular identity may resemble several others, resulting in their appearance in multiple mismatch trials). After forming the driving licence identity mismatch pairings, these were simply duplicated for the passport mismatch pairings. Therefore, all 42 trials were identical for the two tasks (i.e., the same identities appeared on-screen for any given trial), with the only difference being that each high-resolution image was presented with either the accompanying identity’s driving licence photograph or passport photograph.
The experiment was presented using the Qualtrics online platform (http://www.qualtrics.com). On each trial, two images appeared on-screen. The task was to judge whether these images were of the same person or two different people. The high-resolution image always appeared on the left of the screen, with the ID image appearing on the right. Participants responded by selecting either ‘same person’ or ‘different people’ using a mouse or touchscreen response (depending on the device upon which they took part). Trials were presented in a random order and were self-paced, viewing distance was not fixed, and no feedback was given at any point during the experiment. Participants were randomly assigned to complete either the driving licence (n = 30) or passport task (n = 30). All models were unfamiliar to the participants.
Although we had no control over the screen size that participants used (and hence the exact presentation sizes of the images seen), we took care to maintain the ratio of the image sizes, that is, the heights of the three image types were always proportionately 1 (high-resolution) to 0.42 (driving licence) to 0.73 (passport).
Results
For each participant, we calculated their overall percentage correct. In addition, following other research in this field (Kramer & Ritchie, 2016), we investigated signal detection measures. We calculated sensitivity indices (d′) and criterion values (c) for each participant using the following: Hit – both images are of the same identity and participants responded ‘same’; and False Alarm – the two images are of different people and participants responded ‘same’.
Our results are summarised in Table 1, along with a subset of previous findings where the relevant information was readily available for comparison. We found no difference between participants’ performances on the two tasks: overall percentage correct, t(58) = 1.52, p = .134, Cohen’s d = .39; sensitivity d′, t(58) = 1.54, p = .130, Cohen’s d = .40; criterion c, t(58) = 0.27, p = .786, Cohen’s d = .07.
Summary of the Results for the Two Tasks, Along With a Subset of Previous Findings for Comparison.
Note. GFMT = Glasgow Face Matching Test.
Values presented are M (SD).
Previous research has demonstrated worse performance when carrying out matching with other-race faces (Megreya, White, & Burton, 2011). Given that 11 of our participants self-reported their ethnicity as something other than White, we also analysed performance measures with the inclusion of participants’ ethnicities (White vs. other) as a factor. For overall percentage correct, we found no difference between the two tasks, F(1, 56) = 0.51, p = .479, ηp2 = .009, but a significant difference between ethnicities, F(1, 56) = 7.01, p = .010, ηp2 = .111. Surprisingly, White participants (M = 76.26%) performed worse than ‘other ethnicity’ participants (M = 84.39%). Importantly, we found no significant interaction, F(1, 56) = 2.89, p = .095, ηp2 = .049.
We also found this same pattern for sensitivity d′, with no difference between tasks, F(1, 56) = 0.70, p = .407, ηp2 = .012, but a significant difference between ethnicities, F(1, 56) = 7.76, p = .007, ηp2 = .122. Again, White participants (M = 1.54) performed worse than ‘other ethnicity’ participants (M = 2.16). As mentioned above, we found no significant interaction, F(1, 56) = 3.51, p = .066, ηp2 = .059.
Although this pattern of results contradicts previous work, we note that the number of non-White participants who took part in the driving licence (n = 2) and passport tasks (n = 9) was very small, and so no conclusions should be drawn from this evidence alone.
As with Experiment 1, we investigated whether the amount of time between photographic sittings was associated with model-level accuracy (averaged across participants for each task separately). For the driving licence task, we found no significant correlation between this time period and match trial accuracy, r(19) = −.21, p = .358. For the passport task, we also found no significant correlation between this time period and match trial accuracy, r(19) = −.31, p = .179. As mentioned earlier, the current experiment was not designed to investigate this question, and we only include these analyses for completeness.
General Discussion
Our results address two important issues. First, we investigated whether driving licence photographs were effective for use in facial identification. The findings of Experiment 1 strongly suggest that these types of images result in unacceptable levels of accuracy in the context of a card sorting task, where additional information was available through a process of elimination, but the requirements might nonetheless be cognitively demanding. Second, we directly compared face matching with driving licence and passport photographs. We found no significant difference in performance with these two types of ID, with both producing accuracy levels that argue against their suitability as forms of identification.
Performance levels with driving licence (76%) and passport photographs (79%) found here are well within the range of accuracies that are typically reported by studies utilising photographs from databases and standardised tests (around 70%–90%; see Table 1). Given that official ID images produce levels of performance comparable with more easily obtained laboratory photographs, our findings support the external validity of these previous studies.
Although the novel sorting task employed in Experiment 1 revealed significant difficulties with pairing high-resolution and driving licence photographs, we suggest that this task likely overestimated accuracy levels because (a) identities were not chosen based on facial similarity (beyond sex and ethnicity); (b) participants were not required to make decisions about more difficult, other-race faces (Megreya et al., 2011); and (c) all matches were present, and so additional information could be used to narrow down options through a process of elimination. In addition, real-world facial images are accompanied by other biographical information, which has been shown to further complicate decision-making (McCaffery & Burton, 2016). However, we acknowledge that participants may have used a form of relative matching (i.e., pairing faces that look most similar), which differs from the typical process required for identification.
Face matching performance with driving licence and passport photographs in Experiment 2 (around 75%–80%) appears to be worse than with pairs of images taken on the same day (around 80%–90%; Burton et al., 2010). This may be due to the lower quality images that appear on official forms of identification as well as the larger amount of time between photographic sittings (Fysh & Bindemann, 2018; Megreya et al., 2013). However, we found no evidence in the current work that the amount of time between photographic sittings was associated with matching accuracy. This may be due to the relatively small variability in the length of time since the documents were issued in our sample and these experiments were not designed to specifically test this idea, and so we encourage further investigation on this topic.
The current set of experiments included only female models as stimuli. This was simply due to the limitations imposed by the demographics of the initial model database. Rather than including a small number of men, which would result in easier mismatches due to the limited number of possible pairings, we chose to exclude these models. The same reasoning was applied to our non-White models. Although we have no a priori reason to predict different levels of performance for female versus male stimuli, we acknowledge this limitation in the present work.
A recent test of face matching, using images taken 9 months apart (on average), resulted in lower levels of accuracy than was found here (around 66%; Fysh & Bindemann, 2018). Although the present stimuli incorporated a longer gap between photographic sittings, we note that our participants were not required to match images that differed in facial expression (unlike in Fysh & Bindemann, 2018). Previous work has demonstrated that matching a neutral image with a smile is significantly more difficult than when both images portray neutral expressions (Bruce et al., 1999). Indeed, any change in conditions across images (e.g., the addition of glasses – Kramer & Ritchie, 2016; a change in viewing angle – Estudillo & Bindemann, 2014) typically results in a decrease in accuracy. Whether expression differences have a larger effect on performance than time differences is an important question for future research.
In conclusion, the present set of experiments demonstrates that face matching with driving licence photographs is error-prone. Although no worse than performance with passport photographs, we find that both types of ID result in unacceptable levels of accuracy (comparable with those found using typical face databases). Therefore, our recommendation is that alternative methods be used in combination with, or instead of, facial images when proof of identity is required. Multimodal biometric recognition, where information from multiple sources (e.g., the face, iris, and fingerprint – Shekhar, Patel, Nasrabadi, & Chellappa, 2014; for a review of biometric research, see Jain, Nandakumar, & Ross, 2016) is utilised to improve accuracy, represents one promising avenue for progress in this field.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work has received funding from an Experimental Psychology Society’s Small Grant awarded to R. S. S. K.
