Abstract
BACKGROUND:
Hesitation to employ females for physically demanding jobs is often due to sex related physical abilities. A physical employment standard (PES) identifies individuals who are physically capable for work.
OBJECTIVE:
A database containing 300 + sources of physical performance tests (PFTs) will inform potential sex bias for PES development.
METHODS:
Weighted means and probability density curves illustrate the percentage overlap between male and female performance on PFT data from the armed forces of 11 countries and the open literature. Where female training data were available, the change in percentage overlap illustrates the potential for reduction in sex-related differences.
RESULTS:
PFTs demonstrating the extremes of sex disparity were bench press (11 sources) and sit-ups (14 sources) with 9% and 93% overlap in performance, respectively. Training for bench press; pull ups; VO2max; and upright pull improved female performance by 12%, 22%, 35%, and 23% respectively. This translated into narrowing the gap between male and female mean performance by 1%, 4%, 5%, and 10% respectively.
CONCLUSIONS:
The ability of PFT to predict performance is essential; however, PFTs with more overlap will facilitate development of PES with reduced sex bias. PFTs with the greatest potential for improvement in females are identified here.
Introduction
In the past decade, many North Atlantic Treaty Organization (NATO) military forces have removed barriers to females serving in combat occupations [1]. There may be hesitation to include females in jobs or roles which require heavy physical demands [2], such as the combat arms, due to their higher level of attrition and musculoskeletal injuries [3] as well as a perceived inability of females to perform tasks with heavy physical demands [4]. Sex differences in physical abilities have been identified as larger than any other difference relevant to personnel selection [4, 5]. These differences are a function of modifiable (e.g., lean body mass) or non-modifiable (e.g., height, hormone profile) characteristics which are linked to, but not always a function of sex [6–8]. Figure 1 displays the sex differences in physical and physiological determinants of physical performance derived from Roberts et al. [2].

Sex differences in physical and physiological determinants of physical performance as reported in the literature and reviewed by Roberts et al (2).
A physical employment standard (PES) assessment can be used to identify individuals with the physical ability to perform these demanding jobs; however, a PES can be designed in a way that biases females, falsely identifying women as incapable of acceptable job performance [9]. For example, pull-ups and the flexed-arm hang are both physical fitness tests (PFTs) which measure upper-body muscular strength and endurance. Pull-ups can often result in a score of zero for women because they do not have the strength to complete one pull-up, while the flexed-arm hang test is less likely to result in a zero score, and will provide a continuous measure (seconds). The selection of PFTs can have significant effects on a woman’s ability to pass the PES, while not fairly reflecting her ability to perform the job tasks.
A PES must be designed to predict performance on the job [10], and one must make every attempt to minimize adverse impact (AI) to a minority group, as evidenced by a discrepancy in performance of two groups within the workforce [11]. When embarking on the development of a PES one selects the PFT to best assess the components of fitness required by the job. Manual material handling and its reliance on upper body strength and endurance can be measured by a multitude of PFT such as grip strength, bicep curl, and push-ups. By design, an AI assessment typically occurs after the PFTs have been selected. If a PES is designed with the goal to minimize differences in performance between sexes, the application, development time and adoption of the PES would be improved and legal challenge on the grounds of discrimination would be less likely [12]. As Hydren et al. [13] indicate, useful PFTs maintain correlation strength for within-sex samples not just combined-sex samples. Less useful PFTs derive predictive strength from sex differences in anthropometrics and physical performance [13]. The PFT database presented here will help researchers select tests with the least inherent bias that assess a potential applicant’s or incumbent’s physical ability necessary for job performance.
Courtright et al. [4] suggest physical training can be used as a means to reduce sex bias in a PES. A number of studies have shown that combined resistance and aerobic training programs can be used to improve the performance of women [14, 15] and reduce sex differences on physically demanding tasks such as heavy lifting, repetitive lifting and load carriage [16–18]. Inclusion of task-specific training along with resistance training produced even greater reductions in sex differences in physically demanding task performance [18]. In a randomized controlled training study, Gumeiniak et al. [19] demonstrated the effects of both test practice and physical training on reducing the sex bias and AI in a PES assessment. Using the Canadian Wildland Firefighter Fitness Test circuit as the criterion measure, only 11% of women were able to pass the test on the first attempt, as compared to 73% of men. Volunteers were divided into three cohorts: a physical training cohort, a Firefighter Fitness Test circuit training cohort and a control group. At the end of 5 weeks, 80% of the physical training cohort females passed, 72% of the circuit females passed, but only 26% of the control group passed. These researchers concluded physical training and adequate test familiarization significantly reduced the sex bias in the test [19]. Physical training is effective in improving women’s performance on PFTs and physically demanding job tasks and will improve the sex overlap if only the women are trained.
In 2015, with the formation of NATO research task group (RTG) “Combat Integration: implications for PES”, one of the goals was the creation of a database reporting male and female performance on individual PFTs, which measure the various components of fitness. As illustrated in Fig. 1, sex differences in the components of fitness are typically presented in a generalized manner (i.e. upper body strength) and don’t necessarily reflect the gender differences between individual PFTs (i.e. bench press strength vs handgrip strength). There can be large differences in the female to male performance between different PFTs which assess the same fitness component. Therefore, a database describing the sex differences for various PFTs within a component of fitness might help the PES developer reduce the inherent sex bias by selecting a test with less bias. In addition, this paper will report the effects of physical training of females on PFT performance, where data were available.
The purpose of this review was twofold: 1) to show the distribution of male and female performance on PFTs from civilian and military populations, including unpublished data, and 2) to demonstrate the effects of physical training on narrowing the sex gap in performance of PFTs, where data were available.
Methods
Military researchers from NATO allied nations contributed data from male and female military personnel performing PFTs. The database was constructed based primarily on performance of male and female military personnel including Army, Navy and Air Force active service members, recruits and incumbents. Data that were available in the literature from civilian emergency services personnel (i.e., emergency medical technicians, Federal Bureau of Investigation, Royal Canadian Mounted Police, firefighters and police) as well as some physically demanding occupations (steelworkers). Existing performance data were gathered from peer-reviewed and technical publications from the following countries: Australia, Canada, Denmark, France, Germany, Israel, Netherlands, New Zealand, Norway, the United Kingdom, and the United States.
A database was developed that consists of PFT maximal performance and these are listed in Table 1. For inclusion in the database, mean and standard deviation data from both sexes had to have been collected by the same research team. Where available, pre- and post-physical training data for males and females on these PFTs were also included.
Sex-specific physical fitness test data from 326 sources of data
Sex-specific physical fitness test data from 326 sources of data
*Denotes training data is reported. **Physical Ability Requirement Evaluation for the Royal Canadian Mounted Police, consisting of: 1. Obstacle Course 2. Push/Pull Station 3. Torso Bag Carry.
Probability density curves were created for the male and female groups based on aggregated data with the goal of creating a queryable database to graphically present the performance of males and females on the list of PFT (Table 1). For each PFT, weighted averages and standard deviations (SD) of performance data, based on sample size, were calculated for the male, female, and female training (when applicable) samples. These weighted averages and SD values were used to create probability density curves, to demonstrate the percentage of overlap between male and female performance. The percentage of overlap was calculated as the area of intersection between the male-female curves which has the specific aim to illustrate the potential to combine male and female data in any statistical analysis to reduce sex bias. For male and trained female samples with available training data, the percentage of overlap pre- and post-training was calculated to highlight the effects of training on reducing sex-related performance differences. For select timed PFTs, shorter durations represent better performance, which is why some female probability density curves are right of the males.
As an additional measure to capture sex differences in performance, Cohen’s d was calculated with the weighted means and SDpooled of the male, female, and trained female samples for each PFT. Cohen’s d demonstrated effect size, indicating the standardized difference between the male and female means. A larger d value represented a larger difference such that a value of 1 demonstrated the means from the two samples were separated by 1 SD.
Results
Table 1 includes data for 32 PFTs taken from a total of 326 sources of data (N = 158542 male data points, 43460 female data points). The list of PFTs, the unit of measure, the mean and SD for males and females, the female to male ratio, and Cohen’s d are also provided. Thirteen of the PFTs included pre-post training data for females (12 distinct references). Training data included a total of 2380 female data points.
Some example PFT sex differences are presented here and the characteristics of each are described.
Figure 2a and 2b display male and female pull-up data from non-training and training sources, respectively, from four countries. The data reveal a relatively small amount of overlap in performance (27%), and that the male-female performance overlap does not increase substantially with physical training of females (overlap increased 5 percentage units to 31.1%, N = 641 females). The Cohen’s d for pre-training is approximately 2.0. These training data are from research with Federal Bureau of Investigation trainees in the United States [20].

Male and female data on pull-ups.
The 38 cm isometric upright pull was employed in testing the Army and Navy populations in the U.K. and the U.S. These data (Fig. 3a and 3b) demonstrate that without training there is a large sex difference in performance with a 32% overlap and a Cohen’s d of 1.5. Following training, there was an improvement of 10 percentage units on the maximum weight (force) of the isometric upright pull, achieving 42% overlap. These data were obtained from 592 females before and after 8 weeks of U.S. Army Basic Combat Training [20, 21].

Male and female data on Upright Pull (described as an isometric upward pull at 38 cm from the ground).
Bench press performance was determined by 1 repetition maximum (1RM) from military, police and firefighters in American, Norwegian and Canadian populations. Figure 4a and 4b show the overlap in male and female performance is minimal (8.8%) with a Cohen’s d of 2.9. Training data shows improvement of up to 10% overlap, an increase of 1.2 percentage units but the sample size for females is very small (n = 19).

Male and female data on bench press
Maximal aerobic capacity or VO2max can be assessed directly or indirectly (i.e., predicted) when selecting PFTs for a PES assessment. These data include multiple military and civilian sources from six countries including indirect assessments (e.g., a shuttle run, Cooper’s run) and direct assessments of VO2max. As expected, Fig. 5a and 5b shows the considerable overlap in performance (60%) and a Cohen’s d of 0.9. Female performance showed improvement with training of 5 percentage units to 65% overlap, which reduced the difference between sexes from a pre-training difference of 9.8 mL·kg–1·min–1 to a post-training difference of 5.8 ml·kg–1·min–1.

Male and female data on (VO2max) expressed relative to body mass (ml·kg–1·min–1).
Male and female performance on sit-ups were determined by maximum repetitions within one minute (males = 2280, females = 4014) demonstrating 93.4% overlap in performance (see Fig. 6) and Cohen’s d of 0.1. The vertical lines on the graph indicate the required scores to pass the military PES in Norway, Netherlands, and Australia for either males and females or both.

Male and female data on sit-up performance.
Male and female 2-minute push-up performance is presented in Fig. 7a and 7b. Push-up performance is determined as repetitions until exhaustion, given a total time of 2 minutes. In addition, the vertical lines represent various PES which are separate for males and females in the military. These results demonstrated that the overlap between male and female performance is increased from 13% (Cohen’s d of 1.7) to 18% with training, with a sample of N = 1,457 trained females. These data include 38% of the sample derived from civilian populations.

Male and female data on push-ups in 2 minutes. 7c. push-ups in 1 minute (non-training).
Figure 7c includes the data on a 1 minute push-up test, used by the U.S. Navy, U.S. Army and U.S. Special Agents, for a total sample size of N = 817 (314 females). These populations present with much higher performance from females as the overlap in performance is 49% (Cohen’s d of 0.8) with males achieving on average 49 push-ups and females 29 push-ups.
Although many organizations employ task-based or simulation style PES, PFTs will likely always exist due to their simple procedure, efficiency for mass testing, effectiveness (criterion validity), lower cost, standardised protocols, and links to normative data [4]. For this reason, the compendium of PFT performance on males and females provided here will help PES researchers identify which PFT might be selected to assess a prospective employee’s physical ability with respect to the components of fitness required for the job. As Hydren et al. [13) indicated, a systematic review of PFTs for a military lifting task (soldier specific) which examines sex bias, did not previously exist in the literature. This database created for NATO RTG should expand and compliment the work of Hydren et al. 13] and assist in the reduction of AI. Obviously, the relationship to the task will dictate the PFTs selected for the PES; however, at times these tests may be selected as they are have historical or cultural context, or because other militaries or occupational organisations administer them [13].
The comparisons of upper-body muscular strength and endurance demonstrate that the flexed-arm hang has more overlap in male-female performance (65%) compared to pull-ups (27%) [22]. In addition, PES researchers might consider the trainability of a PES. For example, the male-female differences in push-up ability narrows to 12 repetitions with training (Fig. 7a and 7b). However, these training data should be interpreted with caution, as no consideration of the quality or length of training was made in the current analysis. Specifically with regard to the bench press data, the sample size used in this database applies to only one study with small improvements in female performance (Fig. 4a and 4b); however, other researchers have demonstrated increases of more than 23% in 1RM bench press [14, 24]. Of interest is that for this database the females achieved on average a 1RM bench press of 44 kg, which is relatively large compared to the other training studies.
Previous research, similar in nature [4], was limited in scope to peer reviewed data (113 studies, 140 samples, 59% non-military, 88 693 males and 18 279 females). Some PFTs are highlighted by Courtright et al. [4), such as a softball throw and pull-ups where sex differences were as high as Cohen’s d = 4.12 and d = 3.27 (summarized by Wilk and Sackett 25]). Whist the work of Courtright et al. [4] is a thorough systematic review, the limitation of only presenting published data excludes large populations (e.g., military) for which there are multiple sources of PFT data for females and unpublished training interventions. The data presented here is the extension and expression of these military sources one would not have access to without a group such as a NATO RTG.
Figure 5a and 5b would indicate that indirect and direct assessments of VO2max demonstrate low sex bias when presented relative to body mass, and this is reflected in Table 1. However, in the military aerobic activities such as marching often are performed wearing a load, and as Roberts et al. [2] indicated, absolute loads should be incorporated into PES design where possible.
In future studies, if the goal is to predict performance on loaded activities (personal protective equipment and a rucksack) or on tasks requiring the movement of an external load (lifting), relative VO2max is not often useful [26]. For these types of activities, the assessment of loaded VO2max or absolute VO2max is recommended [26].
Siddall et al. [27] examined PFT data and physical demographics including sex, and their relationship with a firefighter simulation test. Regression analysis identified an inverse relationship with relative VO2max, but sex was not selected as a predictor by the model. Therefore, male and female data could be combined in the regression. This is similar to research on the Canadian Armed Forces which identified that loaded relative VO2max was selected in a predictive step-wise regression model of urban operations performance, but not sex [28]. Beck et al. [29] examined relationships between fitness components, anthropometrics, demographics, and carrying tasks and determined that neither age nor sex were significantly predictive of carrying performance when controlling for modifiable factors such as leg lean mass. When these modifiable factors such as specific PFTs and lean mass demonstrate overlap in male and female data, the ability to combine male and female data in a regression improves, as sex is not identified as a significant factor [9, 29].
Figure 6 includes pass requirements for various PES on sit-ups, and these data demonstrate that not only do males and females perform similarly on sit-ups, but that the same pass score established by military organizations is attainable by both sexes. The similar performance of males and females on sit-ups is not surprising and has been demonstrated elsewhere [30, 31]; however, sit-ups have limited ability to predict performance on job related tasks [32]. Fig. 7a, 7b and 7c also include pass scores for military PES on push-ups and the post-training figure shows how variable the effect of training was on females; yet, some females would have achieved the male standards for push-ups after training.
Hauschild et al. [33] reviewed published research to assess which PFTs are predictive of occupational tasks, and sex differences of performance on work simulations. The authors generated pooled r values from 27 studies of a potential 273 (13 were in military populations) to determine which PFT(s) best predicted task performance for 12 categories of tasks (e.g., lift/lower, casualty drag, etc.). The analysis was limited to correlations reported in the peer reviewed literature and included few samples from military volunteers. The data were too limited to assess differences among male and females for all tasks, and only made the comparison for crawl and stretcher carry. Hydren et al. [13] conducted a meta-analysis that examined predictors of 1RM box lifting capacity for predictive accuracy and sex bias. They reported lean body mass and dynamic measures of strength (i.e., shoulder press, machine arm curl, latissimus dorsi pull-down, machine bench press, incremental dynamic lift, and leg press) to be the best predictors. They identified seven PFTs with moderate to good correlations (r2 = 0.50– 0.75) in combined gender samples that still had at least fair correlations (r2 = 0.25– 0.50) with 1RM box lifting capacity when examined for single sex samples. In addition, the reduction in correlation from combined to single-sex correlations were similar for men and women (r2 difference <0.10).
The current data shows that physical training is an effective means to reduce the sex differences in performance, and therefore decreasing AI [4, 34]. In addition to training to improve PFT performance, generalized physical fitness training has been shown by many to improve the performance of women on military-relevant physically demanding tasks [16, 36]. It is important to acknowledge that even though physical training can assist women in gaining access to the job, it must be continued throughout their employment to maintain an acceptable level of performance on physically demanding job tasks [2].
Future research should include a similar investigation for common soldiering tasks to identify which tasks have the greatest sex bias and which PFTs are most predictive of these common soldiering tasks with the least sex bias.
Conclusion
The data presented herein can be used in planning and conducting PES research, especially when used in collaboration with the work of Courtright et al. [4], Hauschild et al. [33] and Hydren et al. [13]. Although there are many important considerations when developing a PES (e.g., cost, space and equipment required, personnel training, etc.), the ability of the PES to predict job performance is of primary importance. Given multiple assessment tools with similar predictive capacity, consideration of the sex bias associated with those tools may be aided with the use of the database presented here. The reduction of sex bias allows a more fair and inclusive PES, and allows the combination of male and female data in the data analysis for tests such as multiple linear regression [27–29]. So often researchers are discouraged from combining male and female performance data as there are underlying assumptions that their performance is markedly different. The ability of these PFTs to improve with physical training can help guide programs and facilitate the success of females both on the PES assessment and on the job. More research is encouraged to examine the efficacy and adherence of females to these training programs to maintain long-term performance gains.
Conflict of interest
None to report.
Footnotes
Acknowledgments
The authors would like to acknowledge the contributors from the NATO HFM RTG 269.
