Abstract
BACKGROUND:
When one thinks of jobs with physical employment standards, the first thoughts typically center around firefighting, law enforcement, and military jobs. However, there are 100s of arduous jobs that exist in the public and private sectors that range from moderately demanding to strenuous. The Bureau of Labor Statistics reported that 28% of the workforce in the United States performs physically demanding jobs that involve construction, machinery installation and repair, public safety, and other professions.
OBJECTIVE:
This paper provides a historical perspective of physical employment standards for hiring workers into these arduous jobs, how we arrived at our current knowledge base, and the challenges faced today when determining and implementing physical employment standards.
METHOD:
This narrative review draws on evidence from 62 published sources.
RESULTS:
This paper focuses on the need for a multidisciplinary approach to identifying job requirements, the professions (e.g., medical, psychology, physiology) that underpin the methodologies, and the knowledge used by current researchers. Descriptions of test and cut score development, legal issues, and challenges for the future also are highlighted.
Introduction
When one thinks of jobs with physical employment standards, the first thoughts typically center around firefighting, law enforcement, and military jobs. However, there are 100s of arduous jobs that exist in the public and private sectors that range from moderately demanding to strenuous. Many jobs with physical demand have become more complex in that workers need computer and high-level technical skills to install, troubleshoot, calibrate, operate, and repair all types of equipment and monitoring devices. Although automation has made some jobs lessarduous, physical demand is present for many other jobs. For instance, jobs in the electric and telecommunications industries require climbing to heights over 30.5 meters (m) above the ground, digging holes in the ground, crawling in attics, and lifting heavy objects, along with installing equipment to ensure transmission of electrical signals to residential and commercial properties.
The Bureau of Labor Statistics reported that 28% of the workforce in the United States performs physically demanding jobs that involve construction, machinery installation and repair, public safety, and other professions. In many instances these are the highest paying jobs in a geographic location. Figure 1 shows the percent of jobs with medium and heavy physical demand across industries [1]. Of the jobs in the construction, installation/maintenance, and transportation industries, 32.3% to 45.5% have heavy physical demand. Over 50% of the jobs in the food preparation/serving, building and grounds maintenance, and production occupations have medium physical demand. Thus, physical work is still present in the United States.

Percentage of civilian jobs requiring different strength levels in selected United States occupations in 2016. Bureau of Labor Statistics, U.S. Department of Labor. (2017), The Economics Daily: Physical strength required for jobs in different occupations in 2016. [online], Available: http://www.bls.gov/opub/ted/2017/physical-strength-required-for-jobs-in-different-occupations-in-2016.htm.
The purpose of this paper is to provide a historical perspective related to physical employment standards, how we arrived at our current knowledge levels, and the challenges faced today when determining and implementing physical employment standards. Although many organizations use the term fitness standards, a more appropriate terminology is physical employment standards (PES) or physical performance standards. Use of the word fitness is related to general physical fitness and is not as accurate in terms of setting job-related employment standards.
Assessment of work and physical performance has a historical base in the fields of industrial/mechanical engineering, industrial-organizational psychology, medicine, applied/exercise physiology, and biomechanics/ergonomics. During the 1800s, many countries supported general physical fitness to engage in military war activities. The fitness centered on a gymnastics approach with well-known historical figures such as Frederick Jahn in Germany with the Turnvereins (German gymnastics), Franz Nachtegall in Denmark with the Institute of Military Gymnastics, and Pehr Henrik Ling in Sweden at the Royal Military School [2]. However, these assessments were general in nature and not specific to job tasks.
Some of the first workplace assessments were completed in the early 1900s by Frank and Lillian Gilbreth, who were both mechanical engineers. Frank Gilbreth initially worked as a bricklaying helper and observed differences in task performance across workers [3]. After attaining his engineering degree, he started his own consulting firm with his wife, which led to observational and time and motion studies targeted at improving work performance from an ergonomic perspective. The Gilbreths evaluated work in manufacturing and clerical settings and developed work aids such as vertical scaffolding that allowed bricklayers immediate access to the bricks. Frank Gilbreth developed techniques used by armies around the world to quickly disassemble and reassemble weapons. He and Lillian also addressed fatigue factors in the workplace due to inefficient movement patterns [4]. Their studies improved work productivity by defining best practices for performing work tasks, redesigning the workplace, and developing work aids.
In the early 1900s, railroads in the United States sought to increase worker efficiency. Frederick Taylor, a contemporary of the Gilbreths, developed the scientific management approach, which included time and motion studies [4]. His studies found a productivity relationship between time spent under load, such as lifting and carrying objects, and time spent at rest. He found that workers could lift and/or carry pieces of pig iron weighing 41.7 kg for 43% of the day before they had to revert to lighter pieces. However, if the pig iron weight was reduced, the worker could lift 20.9 kg for 58% of the day. Taylor strove for accurate workplace measurement that continues today. For example, grocery and product distribution centers worldwide are engineered to provide the greatest efficiency in picking and transporting products to a truck for delivery to a store. Workers in the distribution centers must achieve a specific percentage of the center’s engineered standard for productivity per shift [5].
Assessment instrumentation
Several pioneers were responsible for developing methods to assess physical performance. Dudley Sargent, a physician, developed the vertical jump test that is still used today in many contexts [6]. Further, he contended that there needed to be a means to equitably compare people’s performance and laid out test criteria. The tests encompassed measures of strength, speed, and “endurance” that included elbows to knees (straight leg sit-ups), modified pull-ups, push-ups, squats, and other tests. To provide an overall assessment of an individual’s fitness level, Sargent converted the test scores to joules and summed the joules across tests to classify the minimum, average, and maximum percent of work completed during the testing.
Physical measurement continued to evolve with development of instruments such as the universal dynamometer. In the late 1800 s, Kellogg [7] used the dynamometer to measure strength deficits in his patients after orthopedic surgery and for assessment of infantile paralysis. E. G. Martin [7] expanded dynamometer usage to testing muscles of the feet, hips, knees, shoulders, forearms wrists, fingers, and thumb, along with identifying the best order for testing. Up until this time, medical doctors were the primary researchers and inventors of instruments to measure physical performance.
Static strength testing was taken to another level by H. Harrison Clarke [8] who used cable tensiometers to measure strength in a more precise manner. The cable tensiometer was an adaptation of an instrument that measured the tension of aircraft control cables. Using the cable tensiometer, he developed procedures to measure strength in 38 muscle groups impacted by orthopedic disabilities in hospitals and Veterans Administration centers. Clarke expanded his work to compare the cable tensiometer to other measurement devices such as a strain gauge or spring scale. Thus, these developments contributed to the types of instruments we use today for measuring force. Specifically, most force platforms use strain gauge technology and dynamometers are now interfaced with software that records the data instantaneously.
The historical measurement tool for gathering expired gases to determine aerobic capacity and other parameters was the Douglas Bag [9]. Gordon Douglas, a British physiologist and physician, developed the bag to collect and measure gas respiratory exchange for medical purposes. Robert Bruce, Bruno Balke, and others expanded the use of the Douglas Bag to sport and work settings and standardized protocols to assess cardiac function and maximal oxygen consumption [10]. Applied or exercise physiologists have continued work in the areas strength and aerobic assessment and have made great strides in measurement precision.
In the 1950s and 1960s physiologists and psychologists identified dimensions of physical performance related to work and sport performance. Psychologists’ interest in physical performance waned until the early 1960s when Edwin Fleishman identified a taxonomy of physical factors that contributed to job task performance such as static and dynamic strength [11]. At the same time exercise physiologists such as A. Jackson [12], J. W. Borchart [13], and T. Baumgartner and M. Zuidema [14] conducted similar work identifying the physical abilities that were later targeted in the work setting. Similarly, industrial engineers such as Stover Snook and others assessed the strength and aerobic demands of the work place [15]. Per-Olof Åstrand, Irma Åstrand, and Karl Rodahl were some of the first applied physiologists to gather data related to job task performance in the fishing, steel, and other industries [16, 17].
In summary, PES research emerged from five different professions: industrial/mechanical engineering, industrial-organizational psychology, medicine, applied physiology, and biomechanics/ergonomics. Researchers in these professions provided the foundation for the current multidisciplinary approaches used in PES research today.
Job analysis – The foundation of physical employment standards
Defining job requirements was critical to early researchers and laid the foundation for gathering, organizing, analyzing, and documenting information about the workplace. The framework for job analysis was initially conceptualized by psychologists Lillian and Frank Gilbreth and Frederick Taylor who observed work and wanted to improve efficiency and productivity in the early 1900s [4]. This approach was expanded over the years by many industrial-organizational (I-O) psychologists who developed the job analysis methods used today to identify physical job requirements in terms of essential/critical job tasks, worker requirements, physical abilities, and ergonomic parameters.
John Flanagan [18] developed the critical incidence technique that involves a set of procedures for observing and gathering information about a specific human activity that occurs for a purpose and has consequences related to a worker’s action or inaction. Flanagan’s initial research was used to select aviators during World War II and the Korean War and was expanded to addressing pilot selection and classification in relation to aircraft requirements. We use the critical incidence technique today as part of the job analysis to collect information about job tasks and the consequences if not performed properly. For instance, workers will explain the physical demand of driving railroad spikes and repairing track. However, it is the interviewer’s responsibility to elicit information about the consequences of not performing these tasks successfully, such as train derailment.
Sidney Fine [19] created a structure for task statements by defining a task as an action or action sequence designed to contribute to a specified result within a time period. He described job analysis in terms of data, people, and things in a hierarchical manner ranging from simple to complex actions. The task or task sequence may be primarily physical such as carrying objects or mental such as analyzing data. First, the action the worker is performing should be defined such as lift a carton. Second, one should include the result of the worker action such as lift and load cartons onto a truck for delivery. In other words, use action verbs and define the “to do what” purpose. An example from the shipbuilding industry contains Fine’s task structure: Use open/closed end wrenches and socket sets to tighten and loosen bolts on machinery and equipment (e.g., pumps) [20].
Edwin Fleishman [11] took another approach and created a job analysis taxonomy that was ability oriented. His book, Structure and Measurement of Physical Fitness [11] was cited worldwide and formed the basis of a larger taxonomy that included physical abilities such as static strength, along with psychomotor (e.g., reaction time), cognitive, and fine motor abilities. Unlike other researchers who defined physical abilities during the same time period, Fleishman generated 7-point Likert scales for each ability that allows workers to identify a level of ability demand that corresponds to everyday tasks [21]. For instance, a moderate level of static strength (4 on a 7-point scale) equates to lifting 18.1 kg. These scale ratings form the basis for classifying all jobs across the physical abilities and other abilities (e.g., cognitive, psychomotor) in the U.S. Department of Labor O*NET system. In 2013 the European Centre for the Development of Vocational Training (Cedefop) used the O*NET and European, German, Italian, and Czech skills and social surveys to generate the European skills, competencies and occupations taxonomy (ESCO) [22].
Use of O*NET and Cedefop taxonomies provides an avenue for developing physical employment standards for multiple jobs within an organization by grouping jobs with similar demands. Organizations who institute PES for multiple jobs typically want the same assessments for these jobs. Use of individual tests for each job would be inefficient and costly. Thus, they opt for an abilities approach and use the same cut scores for jobs with moderate physical demand and different cut scores for jobs with higher demand. For example, a study in the shipbuilding industry for over 30 jobs found that workers lift and carry 9.1–22.7 kg, drag heavy welding and air lines, and climb ladders and scaffolding multiple times daily [23]. Creating a master task list across all jobs as illustrated in Fig. 2 allowed for use of job specific tasks, while equating tasks with comparable physical demand. For instance in Figure 2, if a job (e.g., Rust Machine Operator-16) has the “none” recorded for a task, it is not critical for that job. If a color is recorded, the task was critical to the job. This approach facilitated use of a physical ability taxonomy to identify the levels of the abilities for each job and classify 30+ jobs by physical demand (e.g., strength, aerobic capacity).

Example of tasks with the same movement pattern and physical demand across jobs but written to reflect specific criteria for a job. Gebhardt, DL, Baker, TA, Volpe, EK, St. Ville, KA. Job analysis of Huntington Ingalls shipbuilding jobs. Volume 1: Job analysis. Beltsville, MD: Human Performance Systems, Inc.; 2015. If Black & White needed, see below.
In summary, job analysis provides the foundation for physical employment standards by defining the purpose and outcomes of a job, along with the worker functions, performance techniques, and equipment used to perform job tasks. Information may be gathered from incumbents, supervisors, job standard operating procedures and policies, and training materials. Although the approach for gathering job analysis information may vary by profession (e.g., I/O psychology, ergonomics), multiple standardized methods (e.g., interviews, surveys) should be used to ensure legal defensibility of the physical employment standards. The published literature outlines a variety of job analysis techniques ranging from job observations, questionnaire design, data analysis, and critical task identification. The goal is to identify job requirements that are critical to successful job performance and define levels of physical performance, where feasible.
The second segment of job analysis involves the identification of physical work demands by exercise physiologists, biomechanists, and ergonomists. Much of the continuing work to determine the physical demand of tasks in a variety of jobs occurred in Germany, United Kingdom (UK), and the United States (U.S.) after World War II. This was partially due to the war effort and women working in a variety of male dominated professions (e.g., munitions plant).
In the 1950s Turner [24] determined the energy costs of selected light and heavy industrial jobs that involved working with plastic and hard rubber molds. The energy expenditures for heavy jobs ranged from 6.0 kilocalories per minute (kcal·min–1) for loading chemicals into a mixer to 4.6 and 3.6 kcal·min–1 for straightening lead contact bars and working with hard rubber molds, respectively. During the same time frame, British researchers found the energy expenditures of Scottish coal miners range from 3.8 to 7.1 kcal·min–1 for tasks involving use of picks and shovels [25].
Others such Per-Olof and Irma Åstrand contributed to this early research. Irma Åstrand et al. [16] determined the energy output for fishermen using oxygen uptake and heart rate. The demand of the tasks ranged from 2.5–5.0 kcal·min–1 for handling lines, baiting lines, and steering to 10.5–14.4 kcal·min–1 for pulling in nets. This research showed the average energy expenditure during work on board the ship was approximately 39% of the fishermen’s VO2max with some activities reaching 80% of maximum oxygen uptake.
Researchers who conducted early aerobic demand studies laid the groundwork for identifying physiological work demands. In addition, their investigations focused on assessing different quantities and rates of work in relation to an individual’s ability to safely perform a job without undue fatigue. This led to classifications of industrial work demand (Table 1) in the late 1970s [17]. Thus, physiological measures such as oxygen uptake can be combined with other job analysis information to accurately classify jobs by physical demand across a job family or a total organization.
Classification of workloads
Classification of workloads
Adapted from Åstrand PO, Rodahl K. Textbook of work physiology: Physiological basis of exercise. New York, NY: McGraw-Hill; 1977. p. 462.
On-the-job injuries triggered much of the initial work that identified physical workplace requirements that were costly to both the employer and employee. Manual materials handling jobs accounted for a high percentage of low back and other musculoskeletal injuries. As worker injuries increased, physical abilities such as strength and coordination, prevalent in most arduous jobs, became the focus of studies by industrial engineers, biomechanists, and ergonomists. In the 1970s, researchers quantified the forces required to move 4-wheel carts, lift objects, and perform other push/pull tasks using dynamometers, load cells, and force platforms to identify the strength requirements and limitations. They quantified manual materials handing factors and the impact on the musculoskeletal system. Snook and Ciriello [26] developed tables that indicated the maximum acceptable lifting weight for percentages of several male and female populations. Table 2 contains values from the Snook tables that show 75% of women in industrial jobs can lift objects weighing 10 kg at 2-minute intervals throughout the work day compared to 75% of industrial men who lift 19 kg objects at 2-minute intervals. However, when the object weight is 25 kg, only 50% of males can lift it every two minutes from floor level to knuckle height, while 50% of women can lift 12 kg objects for the same rate and height. Although these tables are very helpful in changing the workplace requirements to fit the worker, they cannot be imposed as employment standards with different requirements for males and females due to employment statutes and the workplace requirements.
Example of maximum acceptable weight lifted by various percentages of male and female industrial workers
aWidth = distance from body in cm. bDistance = vertical distance of lift in cm. cPercent = industrial population percentage (e.g., males) who can lift specific weight at a given frequency. dNumber = weight in kg. Adapted from: Snook SH, Ciriello VM. Maximum weights and workload acceptable to female workers. Journal of Occupational Medicine. 1974. 16(8):527-34. Snook SH, Irvine C, Bass, SF. Maximum weights and workloads acceptable to male industrial workers. American Industrial Hygiene Association Journal. 1970;31:579-86.
Chaffin and associates investigated the impact of manual materials handling on low back pain and determined the magnitude of the compressive force on the L5/S1 discs (e.g., 650 kg) that was hazardous when lifting objects [27]. They used this information in a biomechanical model to identify variations in load and the locations in relation to the center of mass that resulted in lower L5/S1 compressive forces [28]. Figure 3 illustrates the decrease in acceptable lift weight in relation to vertical and horizontal distances from the selected body markers (e.g., ankle) and height off the floor. Chaffin and associates’ work resulted in a battery of static strength tests for industrial workers that the United States National Institute for Occupational Safety and Health (NIOSH) published for use by industry in evaluating workers strength capacities [29]. This document also included lifting guidelines for males and females based on the height and frequency of a lift and horizontal distance of the object from center of mass for different lift distances such as floor to knuckle, knuckle to shoulder, and shoulder to overhead reach. Ayoub, Garg, and associates expanded the NIOSH studies by developing dynamic lifting models that addressed time, force, and torque, and strength norms for men and women [30, 31].

Changes in lifting capacity related to vertical and horizontal distance from selected body markers. Note. Adapted from Work practice guide for manual lifting, by National Institute for Occupational Safety and Health (NIOSH), 1981, p. 75, Copyright 1981 by the U.S. Department of Health and Human Services.
This body of research resulted in the NIOSH Revised Lifting Equation that used variables such as object weight, hand position, vertical distance from the ankle, angle of movement, lift frequency, duration of lifts, and object coupling to evaluate whether asymmetrical lifting tasks were within acceptable ranges [32]. The equation incorporated biomechanical, physiological, and psychophysical criteria to determine whether the lift or movement is within safe parameters. Although use of this equation assists organizations in limiting weights lifted by workers and redesign of the workplace, it may not be viable for use in physical employment standards because employers cannot always modify the workplace. They can stipulate that an object weighing over 22.7 kg requires two people, but this is not always feasible.
The European Union legislated a Council Directive in 1990 (90/269/EEC of 29 May 1990) to reduce the risk of back injuries (fourth individual Directive within the meaning of Article 16 (1) of Directive 89/391/EEC) and provide minimum health and safety requirements for manual materials handling [33]. The directive stipulated that employers shall use mechanical equipment when at all possible to avoid the need for manual materials handling by workers. The European approach was more directive to employers than the U.S. approach. Although these government bodies provided guidelines that would hopefully reduce injuries, the work setting does not always allow for changes in worker dynamics. For example, in the shipbuilding industry large 500 to 3,000 metric ton gantry and tower cranes lift large sections of a ship into place. However, riggers lift and move shackles, chain falls, come-a-longs, slings, and chains weighing 13.6 kg to 73.0 kg when rigging a ship section to a crane [20]. Although the technology has eliminated some of the arduous tasks, heavy lifting, pushing, and pulling tasks remain present in the workplace. This fact is seen in recent research in the Netherlands that addressed the sequence of bricklaying and how to implement ergonomic measures for effective task performance [34], which was similar to Gilbreth’s bricklaying research [3].
In summary, the physiological, biomechanical, and ergonomic parameters provide detailed information related to critical job tasks and overall work demands and have been used to increase productivity and reduce some of the physical demand in the workplace. As instrumentation to measure these parameters advanced, reanalysis of job tasks has expanded our knowledge of work demands. Although these studies add to our knowledge base, generating employment tests and physical employment standards that evaluate an individual’s aerobic and strength capabilities posed more challenges.
In the mid to late 1970s physical performance testing became more prevalent in employment selection and resulted in new employment laws, guidelines, and litigation. The military, fire service and law enforcement agencies in Australia, Canada, U.S., and several European countries were the predominant organizations using physical assessments to determine whether an individual was qualified for arduous jobs. Employers used two types of tests to assess applicants’ physical capabilities in relation to job demands. These were basic ability and job simulation assessments, which remain in use today. Basic ability tests evaluate a single physical ability or construct associated with performance of job tasks such as muscular strength, muscular endurance, aerobic capacity, anaerobic power, flexibility, equilibrium, and coordination [35]. This type of test has three advantages: (a) assesses individual abilities, (b) can be used for multiple jobs, and (c) is practical when there is limited space or transporting a test to multiple locations. Alternately, job simulations include essential components of the job and can include tools and objects used by workers but cannot include actions or actual tasks that would be learned during training or on the job (e.g., handcuffing). The advantages of job simulations include a resemblance to the job and the ability to develop the test directly from job analysis data.
When developing or selecting basic ability or job simulation assessments, there are several parameters one should consider. The first parameter addresses statistical properties and includes reliability, validity, and adverse impact. The reliability of basic ability tests such as arm lift, dynamic lift, 300-meter run, and beep test ranged from 0.40 to 0.95 [35, 36], while job simulations (e.g., pursuit run, carton lift), ranged from 0.50 to 0.91 [35, 37]. In studies that compared both basic ability and job simulation physical tests to a job performance measure (criterion measure) such as picking products for an order in a warehouse or pursuing and handcuffing a perpetrator, basic ability tests had predominantly higher validities (0.02–0.81) than job simulation (0.37–0.63) [35]. However, some of the low validities occurred for basic ability tests associated with measures of flexibility and equilibrium.
Adverse impact in physical assessment typically occurs in relation to male-female differences. Some test developers state that job simulations have less adverse impact than basic ability tests. However, recent research demonstrated that basic ability and job simulations have comparable levels of adverse impact. A meta-analysis study coded physical tests based on Gebhardt and Baker’s [35, 37] classification approach to investigate sex differences and adverse impact [38]. This study found basic ability tests involving muscular strength (e.g., grip strength, push-ups, shuttle run) had slightly less adverse impact on women (δ= 1.60) than job simulations (e.g., hose drag, casualty transportation) (δ= 1.94), where δ is the weighted effect size for the sample size corrected for measurement and sampling errors in the criterion [38]. Conversely, the level of adverse impact for cardiovascular assessments was the same between basic ability tests (e.g., treadmill, step test) (δ= 1.87) and job simulations (e.g., emergency response circuit, shoveling) (δ= 1.93). A large-scale male study (n = 50,000+) that investigated ethnic differences between basic ability tests and job simulations found that White males tended to perform better than African Americans on both basic ability and job simulation tests that involved continuous movement. However, African Americans and Whites performance was similar for basic ability tests and job simulations involving muscular strength [39]. These studies demonstrated that basic ability tests involving muscular strength and muscular endurance may have less adverse impact on women than job simulations, while basic ability tests and job simulations of cardiovascular endurance had the same level of adverse impact. Further, there are racial differences for tests involving continuous movement.
The fourth parameter centers on practical issues such as cost, logistics, test administration, and scoring paradigms. Considerations for basic ability tests focus on the cost of test equipment and who will administer the tests. For job simulations the issues involve obtaining a test location that is not cost prohibitive, ensuring that tasks simulations do not include trainable skills, constructing the test to allow for set-up in multiple locations, and generating scoring procedures that reflect minimum job performance and individual differences.
In addition, environmental parameters such as temperature, protective clothing, and work location can affect physical test composition and cut scores. Heat and cold stress occur in outdoor and indoor physical jobs and can be exacerbated by protective equipment worn by workers. High temperatures and humidity in the workplace results in longer times to complete tasks, heat stress and illness, and mortality [40]. Further, the use of protective clothing increases metabolic rate in relation to its thickness and number of layers which can escalate heat stress [41, 42]. Cold environments affect the respiratory system and can lead to pain in the extremities and musculoskeletal and tissue injuries, along with decreased mobility and manual dexterity [41]. Thus, to ensure that the assessment accurately reflects the job demands, clothing and equipment worn on the job may be incorporated into the physical test.
Workplace location can affect the demand of a job. For example, Fig. 4 shows that airport security personnel at larger airports (i.e., Cat X, Cat 1) handle a greater quantity of heavy baggage than at smaller airports (i.e., Cat 2, Cat 3) [43]. Thus, some locations may require different employment standards.

Effect of airport size in relation to size and quantity of baggage handled. Cat X and Cat 1 = large airports; Cat 2 and Cat 3 = smaller airports. Whetzel DL, Gebhardt DL, Baker TA, Erk RT, Fleisher MS, Volpe EK, St. Ville KA, Oliver JT, Geimer JL, Chang T. Job analysis of transportation security officer job series. Alexandria, VA: Human Resources Research Organization; 2012.
After the passage of the Civil Right Act of 1964 in the U.S., two landmark cases related to employment discrimination (Griggs v. Duke Power, 1970; Albemarle Paper Co. v. Moody, 1975) led to the United States’ Equal Employment Opportunity Commission (EEOC) publishing the Uniform Guidelines on Employee Selection, which established standards for applicant selection procedures, addressed adverse impact, and prohibited employment discrimination based on race, color, religion, sex, or nation origin [44]. This document had a profound effect on the drafting of employment requirements in other countries such as Canada, UK, and Australia from 1988 to 2010 [45, 46]. For example, the Canadian Psychological Association adopted the Uniform Guidelines in the development of their principles and policies for employment practices, which influenced the Canadian human rights codes and commissions in multiple provinces such as the Ontario’s Human Rights Commission [46, 47]. Likewise countries such as South Africa (South African Employment Act 1988) adopted the Uniform Guidelines validity section, while the UK in concert with European Discrimination law (1990) used the Uniform Guidelines premises and expanded them to include age, gender reassignment, disability, marriage, pregnancy, and sexual orientation [33] and later combined separate employment statutes to form the British Equality Act of 2010 [48]. However, Australia enacted separate statutes to address employment discrimination (Racial Discrimination Act 1975, Sex Discrimination Act 1984, Disability Discrimination Act 1992, Age Discrimination Act 2004) [46]. The employment statues and guidelines across these countries applied to all assessments (e.g., physical tests, interviews, job evaluations) and employment decisions including selection, promotion, retention, and training.
Besides addressing development of assessment procedures (e.g., job analysis, validation approaches), these guidelines and statutes focused on methods for assessing adverse impact and the obligation of the employer and/or test developer to reduce adverse impact. The most common method to assess whether adverse impact exists against a protected group (e.g., race, sex, ethnic group) is the 4/5 s or 80% rule, which indicates adverse impact is present if the minority (protected) group’s passing rate is less than 80% of the majority group’s passing rate [44]. In physical testing the minority group of concern is typically women. Other methods should be used to confirm the 4/5ths rule result, which can be affected by sample size. One method is the Standard Deviation or Z test that investigates whether differences in passing rates are due to chance at a probability valve of 0.05 [49]. A difference of 2 standard deviations indicates adverse impact when comparing the expected number of passes to the actual number. A second method is the Fisher’s Exact Test that calculates all 2 x 2 combinations to determine whether differences in passing rates are due to chance [49]. Adverse impact is present if the probability value is significant at the 0.05 level.
Impact of case law
Although most organizations and test developers follow the criteria for developing a valid assessment, litigation abounds in relation to physical employment standards with the U.S. having more physical employment standards litigation than other countries. The UK and Australia have much lower rates of assessment related litigation, with Canada having seminal legal cases that shaped their Human Rights laws [46].
In Berkman v. City of New York [50], Brenda Berkman failed a physical test in 1978 for a firefighter position. Using the 4/5ths rule, the court found the test discriminated against women and ordered a new test be developed. Berkman and other women passed the test and were hired by the city. This case set a precedent for all physical employment standards in the U.S. emphasizing that the test must reflect job standards. In Canada, the Meiorin case shaped segments of future employment decisions [51]. Tawney Meiorin was employed as a contract firefighter in British Columbia in 1989 and hired by the British Columbia government in 1992. In1994 she failed the physical employment test and was terminated from her job. The Labour Arbitration Board ruled in favor of Meiorin, but the Court of Appeal overturned this decision. Subsequently, the Canadian Supreme Court overturned the appeals court decision, reinstated Meiorin, and issued a three-part test to evaluate whether a discriminatory standard is an occupational requirement. The three-part test stated that “(a) the standard was adopted for a purpose that is rationally connected to job performance; (b) the particular standard was adopted in an honest and good faith belief that it was necessary to the fulfilment of that legitimate work-related purpose; and (c) the standard is reasonably necessary to the accomplishment of that legitimate work-related purpose” (impossible to accommodate without undue employer hardship) [51]. These two cases set legal precedence in relation to physical employment standards. More detailed examinations of past legal cases in physical testing are in reviews by Gebhardt and Baker [35] and Hogan and Quigley [52].
Recent litigation in the U.S. addressed one of the current issues in physical employment standards, which is the use of a single cut score or gender normed scores. In Bauer v. Holder [53] a trainee in the Federal Bureau of Investigation (FBI) academy challenged the use of gender-normed physical standards as a graduation requirement from the academy (30 push-ups for men, 14 push-ups for women). The district court found for the plaintiff and stated that gender-normed tests were discriminatory because female law enforcement personnel perform the same physical tasks as their male counterparts. During the appeals process, the FBI stressed that the assessments were fitness tests and the Court of Appeals (4th Circuit) upheld gender normed standards as a novel issue but did not address whether the level of physical performance was a bona fide occupational qualification [54]. In a third review the District Court upheld the gender-normed standard from the Court of Appeals [55]. Although this is only one of a limited number of gender-normed cases, it is important to note that other litigation upheld a single cut score(s) and that this ruling stated that the male and female physical employment standards must be equal. To date gender-normed standards were only relevant to law enforcement positions using basic ability tests. Private sector industries and fire service organizations use physical assessments with single cut scores.
Identification of cut scores
For the past 50 years, cut scores or performance standards have been used to identify individuals who can perform or be trained to perform the essential job tasks or a segment of a job. A variety of methods have been used to determine cut scores that are reasonable, useful, and consistent with acceptable job performance. The methods used depend upon the type of validity data available (e.g., content validity, criterion-related validity)1 and range from expert judgment to comparison of test and job performance data. In 1939 Taylor and Russell developed a method to estimate the percentage of new employees who would perform a job successfully and identify a cut score based on a validity coefficient, the number of applicants needed to fill vacant positions, and the percentage of current employees who perform the job successfully [56]. This method used data from a criterion related validity study and employer hiring needs. In the 1990s a judgmental cut score approach, bookmarking, was introduced in which subject matter experts identify a score on a test that indicated the likelihood an individual would be successful (e.g., probability of 0.67). However, relying on judgment and history of past performance did not account for changes in applicant populations or job demands. Empirical methods such as expectancy tables, contingency tables, ergonomic data, pass/fail tables, and Pareto analysis use validity (e.g., test and job performance measures), job analysis, and adverse impact data to identify cut scores [56].
Expectancy tables show the percentage of individuals meeting or exceeding a specific score point, their expected level of job performance, and differences in job performance across test scores. For example, for a test score of 72 the table would show that 90% of test takers met or exceeded this score and these test takers had a job performance of score of 24. For a score of 78, 80% of the test takers met or exceeded this score and had mean job performance of 29. This 5-point jump in job performance suggests that an applicant with a score of 78 or higher would have better job performance than one with a score of 72. Test scores with larger increases in job performance point to potential cut scores.
Contingency tables have been used to determine cut score accuracy by determining the percentage of correct (true passes, true failures) and incorrect (false passes, false failures) decisions [56]. Combined with an expectancy table, one can ascertain whether an increase in job performance with a specific test score coincides with an acceptable level of correct decisions as shown in the formula below for a sample of 165 individuals with 143 true passes, 8 true failures, 7 false passes, and 7 false failures.
Thus, if the expectancy table showed a marked increase in job performance at a specific score (e.g., 78 from above example) with a high level of correct decisions (e.g., 91.5%), then that score would be selected as a cut score.
With the increased level of scrutiny in physical testing as seen in the litigation starting in the 1970s, additional cut score assessments have evolved. One common approach is to generate pass/fail tables that evaluate the impact of potential cut scores on protected groups by showing the passing rate of a minority group (e.g., women) in relation to the majority group (e.g., men). This approach hinges on the 4/5ths rule for evaluating adverse impact. For example, if 91% of the majority group and 74% of the minority group achieve a selected cut score (e.g., 78 from the above example), there is no adverse impact (74% /91% = 0.81) since this value (0.81) is greater than 4/5 or 0.80. However, if the minority group pass rate is 70%, adverse impact is present (i.e., 0.77) since this value is below 0.80.
As far back as the early 1900s researchers such as Sargent advocated use of multiple physical tests to evaluate individuals [6]. Today physical test batteries typically consist of three or more assessments and can be scored individually (multiple hurdle approach) or as a composite (compensatory approach). When using a compensatory approach, the tests can be unit weighted with each test contributing equally to the overall score or weighted based on the statistical analysis (e.g., regression). If adverse impact is present, Pareto analysis provides a method to investigate the changes in job performance and diversity using optimal weighting factors [57]. The Pareto weighting approach optimizes two variables such as job performance and diversity at the same time to locate the optimal weighting factors for the test battery components that yield the greatest job performance and increase in diversity. This analysis requires test and job performance data from a minimum sample size of 100 and was shown to provide better predicted job performance with a decrease in adverse impact [57]. The Pareto-optimum weighing occurs at the point where one variable (e.g., job performance) cannot be improved without a worse outcome for the second variable (e.g., adverse impact). This statistical analysis has potential to reduce adverse impact in physical testing. Greater details about the Pareto analysis and other methods to set cut scores are found in articles by Song et al. [57] and Gebhardt [56].
In summary, no single best approach exists to set cut scores and human judgment is involved in all methods. The soundest approach for setting legally defensive cut scores involves integrating multiple methods and sources of information that lead to a preponderance of evidence that a cut score is useful (e.g., predicts job performance of new hires) and fair.
Benefits of physical employment standards
Physical tests have existed for a long time, but it was only in the late 1970s that a greater focus was placed on the validity of the tests in employment settings. Past research identified the demands of job tasks and organizations implemented physical assessments to select workers who could safely and effectively perform arduous job tasks with a minimal risk of injury. Due to the proprietary nature of personnel selection research many of the studies were not published. However, there are published studies related to use of pre-employment physical tests in personnel selection and their efficacy in terms of injury reduction, decrease worker compensation costs, improved productivity, and increased profit margins.
Arnold, Rauschenberger, Soubel, and Guion [58] developed and validated a strength test battery that exhibited high correlations between muscular endurance tests and a simulation of steel worker job tasks. They implemented the test to select steel workers and after a 6-month period found that the new hires work productivity doubled for workers hired using the physical test, which equated to increased productivity for an individual worker of $5,000 in 1982 dollars or $13,113 in 2018.
Baker and Gebhardt [59] validated a test battery that included muscular strength and muscular endurance tests for selection of railroad train service workers. After implementing the physical test to select train service workers, the railroad acquired another railroad that serviced the same geographical areas. Thus, injury data were available for one railroad that used a physical test for selection and one that did not. To determine the effectiveness of the test battery, they conducted prospective utility analyses that included days lost from work, restricted duty days, gross settlement costs, legal expenses and administrative costs [60]. Data for the utility analysis were obtained for new hires in original and acquired railroads for a 5-year period. Table 3 shows that 648 of the original railroad’s new hires (test group; n = 12,714) sustained injuries, while 3,898 of the acquired railroad (no test group) were injured during the same 5-year period. Controlling statistically for age, job tenure, and year injured (ANCOVA), these results showed injury costs and days lost from work were significantly lower (p < .01) for workers tested prior to entry into the job than for workers hired without a pre-employment physical test. The increased cost to replace a single worker in the acquired railroad (no test group) for the additional lost days (142.1–79.1 = 63 days) compared to the original railroad would be $17,438 in 2018 at an hourly wage of $34.60 per hour ($10,574 in 1995). Thus, substantial savings were achieved by screening applicants for the train service job.
Cost and injury reductions in railroad industry with use of physical employment testing
Cost and injury reductions in railroad industry with use of physical employment testing
aSample estimated from total workers due to lack of accurate hiring data. bp < 0.01.
Anderson and Briggs [60] showed that workers in manual materials handling jobs who passed a physical selection test had a 47% lower injury rate and a 21% higher retention rate. Legge [61] implemented a functional capacity assessment for security personnel prior to entry into annual defensive tactic training. This testing and remedial training resulted in a reduction of annual injury costs of $187,000 to almost zero over a 2-year period. Knapik et al. [62] analyzed injuries in a law enforcement academy over a 6-year period and found higher injury rates for recruits with lower scores on a physical test battery with most injuries associated with defensive tactics and fitness training. As is evident, use of physical employment standards has the following benefits: (a) decreased injury risk, (b) decrease cost to employer; (c) improved productivity; and (d) increased profit margin.
The history of physical assessment and employment standards demonstrated that arduous jobs remain in the workplace today. Approximately 28% of workforce performs jobs with moderate to heavy physical demand [1]. Thus, individuals with the capabilities to perform arduous job in an effective and safe manner are needed to ensure productivity and injury reduction in industrial, law enforcement, fire and rescue, and military settings. PES meet the need to hire workers that can perform job tasks effectively and safely. These standards and the accompanying physical assessments provide valid predictions related to performance of arduous job tasks [23, 38]. Further, use of physical assessments in the selection setting resulted in employer benefits ranging from reduction in lost work time, injuries, and turnover to increases in productivity [35, 58–61].
The challenges for the future remain like those of the past. PES must be job related and cut scores must be reasonable in relation to the demands of the job. Ensuring the validity of the physical tests and cut scores will help avoid litigation, as will further research into methods to optimize test utility while decreasing adverse impact (e.g., Pareto-optimization). As more women enter physical jobs, we can increase our knowledge base related to their job and test performance. Continued efforts to demonstrate the utility of physical assessments and return on investment in terms of increased productivity and decreased costs related to injuries, lost time from work, and turnover will entice employers to adopt PES in their organizations.
Conflict of interest
None to report.
Footnotes
Content validity shows the assessment is a representative sample of significant parts of the job as obtained in the job analysis. Construct validity involves identifying an ability or trait that underpins successful job performance. Selection procedure measure the candidate’s level of a characteristic/ability that is important to job success. Criterion-related validity demonstrates a statistical relationship between test scores and measures of job.
