Abstract

In this Statistical Sidebar, I will explore the concepts of validity and reliability. In the article, “Validating the O&M VISSIT: Determining appropriate service intensity,” by Pogrund, Darst, and Munro, the authors look at one particular tool that can be used for quantifying the level of service an individual student with a visual impairment should receive from a vision professional. When trying to show that an assessment tool or a data collection tool or a service delivery tool is useful, the characteristics of validity and reliability are generally used to make that argument. A valid tool is one that does what it is supposed to do or measures what it is supposed to measure. A reliable tool is one that does this repeatedly so that when administered under the same circumstances, the tool gives the same result. These two characteristics are analogous to accuracy and precision. When repeatedly throwing darts at a dart board to get a bullseye, you can be accurate, precise, both, or neither. “Accurate and precise” means that every dart hits the bullseye. “Accurate and not precise” means the darts are scattered, but, on average, they group around the bullseye. “Precise, but not accurate” means the darts all hit the same mark, but it is not the bullseye. “Neither accurate nor precise” means the darts are all over the board in no discernible pattern, including some hitting outside the board. In this darts-related analogy, accuracy is like validity and precision is like reliability.
Both validity and reliability can be evaluated or measured in different ways. Construct validity is when practical tests that are derived from a theory are used to measure some construct that is defined by that theory. Construct validity is important when you are trying to measure something that cannot be directly observed, like intelligence. Instead, you must construct a theory of how the target variable you want to measure would interact with other variables and what you could measure that would be an indication of the target variable. Construct validity can be evaluated using a multitrait-multimethod matrix, factor analysis, and structural equation modeling, among other statistical approaches. Content validity is not evaluated with a statistical approach, but rather some determination of whether the tool or test covers a representative sample of the target domain intended to be measured. It is often accomplished by using subject matter experts to evaluate test items against testing goals. Face validity is essentially a judgment of whether the test or tool appears to do what it is supposed to. Criterion validity compares the results of one tool or test with the results from a different measure that is already known to be valid. There are other nuances to defining validity, as well as a range of types of validity that have to do with how well research studies are designed, but those are beyond the discussion in this column.
In the article by Pogrund, Darst, and Munro, the authors used several forms of validity. A national recruiting effort led to 52 O&M specialists submitting a completed O&M VISSIT, as well as a completed survey about their experience with the tool. In this case, since the tool being evaluated was designed to determine service delivery levels, any practicing O&M specialist was qualified to evaluate how well the tool performed. This criterion allowed for a determination of social validity (how acceptable the tool was) and consequential validity (how closely to the intended use the tool operated) by relying on the evaluation of experts.
The authors also approached the assessment of the O&M VISSIT's validity by using more statistical approaches. For content validity, the authors calculated the item content validity index where each item in a tool or test is evaluated against the underlying construct. Construct validity was also assessed using exploratory factor analysis (EFA). This statistical approach showed that all of the items in the O&M VISSIT linked to seven of the eight skill areas covered in the tool and further linked to one underlying construct, which was student need for services.
When it comes to reliability, there are also several ways to evaluate it for a given test or tool, since there are different kinds of reliability. When discussing how reliable a tool or test is, it is generally evaluated using test-retest reliability, inter-method reliability, or internal consistency reliability. Test-retest reliability evaluates how closely scores align for two administrations of the test or tool at different times, but under the same conditions; inter-method reliability evaluates how consistently two versions of a given test or tool give similar results; and internal consistency reliability looks at how consistent the different items in a test are to each other. In the article under discussion from this issue, the authors used Cronbach's Alpha as a measure of how the single construct identified in the EFA was measured by each item. The calculated Cronbach's Alpha was 0.901, showing that across the items in the O&M VISSIT, the construct underlying the tool was being measured reliably.
It is instructive that the authors of this article conducted different analyses to look at both validity and reliability of the O&M VISSIT. As in the analogy of the dart board, just because something is reliable does not mean it is valid, since a test could consistently give the wrong answer. Similarly, just because a test validly measures what it purports to, does not mean that it will always give the same result. An evaluation of both validity and reliability makes for a stronger assessment of a tool or test and doing so in different ways also makes for a more robust evaluation.
