Abstract

In this issue, the article entitled, “Psychometric properties of WISC-IV verbal scales: A study of students in China who are blind” offers a number of interesting topics ripe for discussion. I would like to touch on several of them that are somewhat connected, the discussion of which might aid readers in understanding this and other articles that seek to quantify the human condition. First off, these authors use the Wechsler Intelligence Scale for Children (WISC-IV) to assess mental characteristics of children. This approach is nothing new, since it is the purpose the WISC-IV was created to serve. The WISC-IV is an example of a well tested, well used, standardized assessment of certain aspects of cognitive function or performance in children. The WISC has been in use since 1949, with successive editions being updated and re-normed to ensure that the assessment remains relevant and applicable as time goes on. The fifth edition was released in 2014.
The WISC is an example of a test that is meant to be able to be applied to any child that fits the criteria of the test. As such, it is important that such tests be “standardized” against a given target population. To accomplish this task, a large number of individuals (children, in this case) are given the test and a bunch of fancy statistics are conducted on the results to ensure that children are performing on the test in the way that is expected. For example, the range of scores is not too large or too small, and all of the test items are contributing as they should to the different subscores within the larger test. When a large enough sample of children has been tested, a sample, often in the thousands, provides a picture of how the results will pan out across the larger target population (e.g., all children aged 6–16 years). Thus, when any child of a given age is tested, that individual child’s score can be compared to what would be expected of children of the same age (the “standard”). These standardized scores are useful in that they allow an individual’s test score to be placed in context. However, this comparison is only valid as long as the child being tested fits into the group of children that were used when standardizing the test. The authors in this article are trying to see whether the results of WISC-IV testing with children who are blind in China fit with what would be predicted from the standardized scores of the WISC-IV.
In order to make this determination, the authors look at several typical measures used in psychometrics (the science of measuring human mental performance and processes). Readers might recall the terms validity and reliability from some past course or readings. Validity is the level at which a test measures what it says it is measuring. Reliability is a measure of how consistent the test is at doing so. One can imagine that a standardized test, meant to be used to compare results to all children of a certain age, must have very high levels of validity and reliability. Generally, a researcher would like to see levels of each of these measures above .80 (they can range from 0 to 1.0) and different editions of the WISC typically have reliability and validity scores well above .80. The question is, How are these things measured?
Reliability is often measured by comparing the results of one half of a test to the results from the other half. If both halves are giving a similar result, the idea is that this finding shows the entire test is consistent within itself, which is called “split-half reliability.” The simple raw scores from one half could be compared to the scores from the other half, but it is not generally how the comparison is made. Because one half might have more questions of a certain type or have more questions that weigh more heavily on one topic or another, some sort of statistical measures like the correlation between the scores, or Cronbach’s alpha, is usually computed to compare the two halves. Of course, a statistician can get deeper into controlling for the relative influence of each test item and use more sophisticated measures, but I will not get into that here. Suffice to say that reliability can be measured by comparing part of a test to another part or by comparing two different applications of the test, separated in time (test-retest reliability).
Validity is a bit more difficult to measure. How do you know whether a test is measuring what you want it to? On one hand, you could ask a bunch of experts to look it over and give you their expert opinion. But, if you are really concerned about the validity of a test and want to use some numbers to back up your assertions of what a test is measuring, you would need to either correlate the results of your test against other tests that purport to measure the same thing or do some complex statistics to determine how each test item or question is feeding into the test results. Analyzing the contributions of the individual test items is often done with some sort of factor analysis where the influence of each test item is plotted mathematically so that you can see whether the overall test is measuring the intended conceptual constructs or whether some test items are off in left field, measuring something else altogether.
The latter is what the authors of this article did with the WISC-IV, in order to see whether using it with children who are blind in China gave results that were in line with other uses of the test. Sometimes, when standardized tests are adapted for use with children who are blind, some test items have to be left out because they are too visually oriented. However, making changes of this nature can disrupt the established reliability and validity of the test, which is why the kind of work reported in this article is necessary. There is value in showing that a test can be used well with children who are blind, or that an adapted version gives results that are as reliable and valid as the typical version.
