Abstract
Self-report surveys that are online, lengthy, and contain sensitive material greatly increase the probability of invalid responding (IR) on the instrument. Most research to inform our identification of invalid responders have not been able to test their methodologies where all these conditions are present. This study systematically adopted 10 IR indicators based on direct, archival, and statistic strategies to identify IR providing answers on a lengthy survey collecting campus climate/violence information that college students (N = 6,995) accessed online. Exploratory factor analysis indicated two internal factors (i.e., careless and extreme responding) underlying these IR indicators. Latent class analysis identified 4.8% of the sample as being invalid responders. Compared with honest responders, invalid responders were significantly more likely to report forms of victimization and a greater negative impact from physical abuse or sexual assault. Of importance, mean scores on victimization scales were significantly higher for invalid responders, illustrating the potential for IR data to skew prevalence rates. IR indicators differentially identified honest and invalid responders. The findings of this study contribute to the systematic investigation of IR with college students completing online and lengthy surveys that address sensitive material.
Keywords
Self-administered questionnaires or self-report survey data are important modes for measuring participants’ experiences regarding sensitive topics, such as disability status; victimization; and lesbian, gay, bisexual, transgender, and queer or questioning (LGBTQ) identity (e.g., Saewyc et al., 2004; Schwarz, 1999; Tourangeau & Yan, 2007; Turner et al., 1998). However, because self-administered surveys are typically confidential and/or anonymous to encourage truthful responses, and researchers cannot monitor individual responses, especially for online surveys, it is likely that some responses may be invalid (Cornell et al., 2012; Fan et al., 2006; Huang et al., 2015; Robinson & Espelage, 2011; Savin-Williams & Joyner, 2014). To add to the knowledge regarding invalid responding (IR) to online surveys, the purpose of this project was to investigate IR in a lengthy data set which collected sensitive campus climate/violence information online.
Survey Characteristics That Increase the Probability of Invalid Responses
Survey literature has demonstrated that not all surveys are at equal risk for invalid responses, but rather surveys that are online, lengthy, and contain sensitive information are more likely to produce such responses (Meade & Craig, 2012). Online surveys, even so, continue to be utilized, because they offer advantages of easy dispersal, automated data collection and coding, and responder anonymity. Invalid responses may occur with online surveys because there is no social accountability, that is, the survey taker has no knowledge of or contact with the survey administrator, so they feel no social obligation of truthfulness or carefulness in that anonymous setting. In addition, because online survey takers cannot be monitored as to the conditions under a survey is completed, there is no accounting for possible environmental distractions (Douglas & McGarty, 2001; Lee, 2006; Meade & Craig, 2012) that may be present during the administration (e.g., a student listening to music while answering survey items). A lengthy survey may be vulnerable to IR because it has the potential to produce fatigue that may result in careless or random responses, particularly toward the middle or end of it (Baer et al., 1997; Berry et al., 1992). And surveys which ask participants to reveal potentially sensitive information, even when instructions inform respondents that their answers are anonymous and/or confidential, have a greater potential to collect inaccurate information (e.g., Turner et al., 1998). For example, the Centers for Disease Control and Prevention (1997) conducted a more nationally representative survey with college students, the National College Health Risk Behavior Survey in 1997, which included sensitive victimization questions, but there was no attempt to identify invalid responders in this data set.
Characteristics of Research Investigating IR Methodology
Investigations seeking the best methodology for detecting IR in surveys often could not or did not conduct their research to maximize the conditions discussed above that are more likely to produce IR (i.e., lengthy, online, and collecting sensitive information). Studies investigating invalid responses among college students or adults (usually recruited volunteers), typically provided incentives for participation and/or provided survey content regarding personality or attitude variables (thus removing the need for participants to avoid responding to sensitive content). For instance, Meade and Craig (2012) utilized 438 college students enrolled in an introductory psychology course who were required to participate in at least 3 hr of research studies. Although they found that different IR indicators could identify different types of IR, the students essentially “volunteered” for this particular study, because they chose it from among listed experiments and they had an incentive for participation, albeit in the form of course credit. Other studies that included college students have also not employed research designs that tested most of the conditions known to increase IR. Huang et al. (2012) recruited a sample of 725 undergraduates from a large Midwestern university who received credit for participation in completing Goldberg’s (1999) 300-item personality inventory. Although they found that psychometric antonyms (i.e., answering items opposite in content dissimilarly), and self-report of individual reliability were effective at detecting insufficient effort responses, the only characteristic of their survey likely to increase IR was that it was lengthy. More recently, Huang et al. (2015), seeking to test the efficacy of an infrequency detection scale with items that were counterfactual or impossible, used another undergraduate psychology pool from a large Midwestern university and incentivized their participation with extra course credit. The researchers concluded that using an infrequency approach may be a useful tool for identification of invalid responders for low-stakes surveys (Huang et al., 2015), but their sample still consisted of an incentivized sample which engaged in a project that was neither lengthy nor contained sensitive content. To what degree do college students view the task of completing a survey seriously and/or responsibly is not clear; thus, further study to determine whether a subset of college respondents respond invalidly needs to occur. Therefore, providing data on IR in a college population completing a lengthy online survey related to the sensitive topic of victimization is expected to advance our knowledge regarding indicators and patterns predictive of IR in college students.
Methods for Detecting Invalid Responses in Surveys
To enhance confidence in collected data, scholarship has emerged dedicated to detecting invalid responses and/or responders. One model for classifying indicators for identifying invalid responses, proposed by DeSimone et al. (2015), categorizes these strategies as direct methods, archival methods, and statistical methods.
Direct methods involve preplanning by the survey team to include items, ranging in intent and function that allow for evaluating response validity. These include the following: self-report of careful or accurate test taking, items that give instructions to the test-taker, and bogus items stating impossibilities that respondents should not endorse. Self-report of carefulness or accuracy has been used as a screening mechanism to remove surveys from a database when a respondent directly reports not paying attention while responding, not being careful completing the survey, and even not telling the truth (Curran, 2016; DeSimone et al., 2015; Maniaci & Rogge, 2014). Instructed items have also served as screening devices because respondents not following the specific instructions posed by these items are determined to be careless and/or inaccurate (DeSimone et al., 2015; Maniaci & Rogge, 2014). Bogus items ascertaining impossible or highly unlikely realities (Curran, 2016; DeSimone et al., 2015; Robinson-Cimpian, 2014) can identify respondents who intentionally or unintentionally (i.e., carelessly) endorse these impossibilities.
Archival methods utilize survey takers’ responses over the course of the survey to identify invalid responders based on response time and invariant responding. The length of time a respondent takes to complete a survey, easily recorded with many online survey platforms electronically recording the time a survey is open, can signal potential invalidity if test takers finish below a minimum amount of time for responding reasonably carefully on a survey. Huang et al. (2012) pointed out here is no standard or existing cutoff time to use because surveys vary widely in question length and complexity; thus, it is improbable that an appropriate universal cutoff time will ever be established. However, an example of how to systematically calculate the minimum amount of time necessary to consider a survey validity was Huang et al.’s (2012) cutoff based on an “educated guess” while erring on the conservative side, which was less than 52 seconds per page, or less than 2 seconds per item (for surveys with personality/attitude items).
Another common archival method identifies invariability in a test-taker’s responses that is signaled by the endorsement of the same response option to survey items in a row. Invariant responding may be especially noticeable when surveys mix positively and negatively valenced items (i.e., some reversed worded items in a scale), such that a person having particular attitudes would not respond similarly to the reverse-worded items (Curran, 2016; DeSimone et al., 2015). In addition, where repeated items measure frequency of behaviors or events, it is highly unlikely that all stated items would have occurred at the exact same frequency.
Statistical screening methods constitute the third category for identifying invalid responders. The first type, and probably the simplest statistical approach, is the identification of possible outliers using descriptive statistics (DeSimone et al., 2015). For example, a formula has been derived using the interquartile range (IQR) of participant responses to identify outliers (e.g., Rousseeuw & Croux, 1992; Upton & Cook, 1996; Zwillinger & Kokoska, 2000). The IQR between the first (Q1) and third quartile (Q3) is used in the calculation to identify respondents whose scores fall substantially below the first quartile or substantially above the third quartile, that is, observations below Q1 − 1.5 IQR or above Q3 + 1.5 IQR. Another way of detecting outliers is by computing Mahalanobis D, wherein a respondent is flagged when individual scores are compared with the sample mean scores across an entire survey to produce an estimate of the multivariate distance between an individual’s score and sample mean scores (DeSimone et al., 2015). Any score with a p value less than .05 indicates IR.
The second type of statistical screening methods utilizes psychometric synonyms and psychometric antonyms. These two methods assume respondents’ attitudes stay the same across a survey. In particular, participants are assumed to respond consistently with similar items and respond differently with dissimilar items. By examining the inter-item correlation, item pairs with high positive correlations are identified as psychometric synonyms (e.g., Meade and Craig [2012] used a criterion of .60 as the benchmark of “high” correlation), and item pairs with high negative correlations are identified as psychometric antonyms. However, using a statistical strategy involving psychometric synonyms and antonyms is not always predictable as to whether they will be useable for individual data sets. On one hand, researchers cannot predefine the item pairs before screening for IR, so this limitation adds to the statistical objectivity of the strategy. On the other hand, the existence of such item pairs is largely dependent on the data, so it is possible that psychometric synonyms or psychometric antonyms will not be found through inter-item correlation and thus would not be available for screening IR (DeSimone et al., 2015).
Current Research Questions
The detection of IR in this study was motivated by possession of a survey data set with the three conditions described above that increase the likelihood of IR, thus raising concerns about the quality of responses. First, the survey was conducted online. The survey was also lengthy due to the intent to capture victimization across a range of interpersonal behaviors as well as to assess perceptions of campus climate. In addition, the survey assessed potentially sensitive topics such as sexual assault, sexual harassment, and other forms of victimization, which respondents may be hesitant to report. Thus, the three factors stated above render this survey data more prone to IR. A major goal, therefore, was to determine whether invalid responders affect reporting of prevalence data for different forms of victimization.
The following questions were posed for this study:
By answering these questions, we could disentangle the relationships among the IR indicators (RQ1 and RQ2), determine whether latent classes exist and, if yes, the proportion of each class (RQ3), detect the influence of the invalid responders (RQ4 and RQ5), and finally decide which IR indicators were the most valuable for detecting invalid responders (RQ6).
Method
Participants
The data for these analyses came from a larger survey project conducted in 2017 which examined campus climate and interpersonal violence/harassment victimization. Of the original data set, 6,995 attendees at a large Southern university consented for their data to be used in research projects which they understood would protect their anonymity and confidentiality and would only report on aggregate data to professional audiences. Of the students eligible to participate, 31 students responded “choose not to answer” (CNTA) for every question and, thus, were removed due to providing no data. The final data set consisted of 6,964 participants.
Demographic characteristics of the data set included the following: 62% female, 37% male, and 1% who preferred not to report their gender; 25% first-year students, 49% combined sophomores, juniors and seniors, and 23% graduate/professional students; 78%White students, 7% African American students, 4% East Asian students, 3% Hispanic students, and 8% Other students; and 94% of the students were domestic and 6% were international students. Thirteen percent of students were first generation college students.
Survey Information
The original campus climate survey consisted of two main sections: (a) assessment of students’ attitudes, perceptions, and behaviors about safety, university response to victimization, and social risk factors, and (b) assessment of adverse experiences/victimization. The only sections from the “climate” part of the survey that were used for this project were students’ perceptions about their safety at the university and their attitudes toward affirmative elements of consent for sexual activity, which were the source for identifying items that were similarly answered (i.e., Psychometric Synonyms and Psychometric Antonyms [see below]). The victimization section of the survey consisted of seven forms of violence/harassment that students may experience while being in college. Bullying was assessed by measuring direct verbal bullying, bullying via social media, and physical bulling with follow up questions regarding the perceived reason the bullying occurred (Hamburger et al., 2011). Sexual harassment (six items; Campbell, 2014) and stalking (five items; Dye & Davis, 2003) used general items to cover the major concepts of behaviors within those forms of victimization, such that specific behaviors (e.g., leaving unwanted flowers on your doorstep) were subsumed under general concepts (e.g., unwanted contact) to cover a wide range of behaviors with streamlined items. Three categories of intimate partner victimization were reported by students in relationships during the prior year: reproductive coercion (two items; Miller & McCauley, 2013), serious forms of psychological abuse (five items; Follingstad, 2011), and physical violence (nine items; adapted from Banyard et al., 2013). The most-specific follow-up data were collected from students reporting sexual assault resulting from one of five situations: physical force, threats of harm, incapacitation due to involuntary substance use, incapacitation due to voluntary substance use, or escape from an attempted sexual assault (adapted from Banyard et al., 2013; Campbell, 2014; Krebs et al., 2007). Finally, the survey included demographic information (e.g., gender), and items hypothesized to be indicators of IR (e.g., bogus items, self-report of carelessness).
Measures of Invalid Responding
Direct screening variables
Two types of direct screening variables were constructed. First, to gauge truthfulness (Not Truthful) and/or carelessness (Careless) of respondents, two self-report items asked respondents regarding lying and low effort on the survey (e.g., Cornell et al., 2012; DeSimone et al., 2015; Maniaci & Rogge, 2014). Respondents selected True (0), False (1), or CNTA (missing) in response to the statement: “I have told the truth on this survey.” Respondents reported on their carefulness by selecting True (0), False (1), or CNTA (missing) in response to the statement: “I didn’t pay attention to how I answered this survey.”
Bogus items were the second direct strategy for detecting whether respondents paid attention or deliberately answered dishonestly (DeSimone et al., 2015). The first bogus indicator was derived from questions assessing the presence of extreme physical disabilities (i.e., blindness or impaired vision that cannot be corrected; deafness or severe hearing loss; paralysis in the form of paraplegia or quadriplegia) for which endorsement of more than one would be extremely unlikely (Robinson-Cimpian, 2014). Consultation with the institution’s Disability Resource Center determined that no student was registered as having two or more of the specified physical disabilities. Thus, this bogus indicator (2+ Disabilities) was coded 1 = two or more physical disabilities endorsed and 0 = either one or no physical disability was endorsed.
The second bogus indicator assessed an impossibility regarding student life—“I attend Student Government meetings 20 or more times a month”—that was impossible to achieve because student government meetings never occur that often. This indicator (StuGov) was coded yes = 1, no = 0, or CNTA = missing.
Archival screening variables
The first archival approach for this study used the amount of time for respondents to complete the survey because the survey platform automatically collected response times. Based on Huang et al.’s (2012) strategies when there is no established or standard cutoff, we first measured response time per item because the survey pages were variable rather than the same format. Because many of the survey items were somewhat involved, the survey format varied, and instructions were included throughout the survey, we adopted a criterion of 3 s/item, viewing Huang et al.’s (2012) criterion of 2 s/item to be too conservative for flagging possible invalid responses. Because students who had not experienced any victimization would answer a minimum number of 94 core survey questions, the 3-s criterion to flag respondents who were unlikely to have reasonably considered the items was calculated at 282 s. This indicator (Time) was coded 1 = response time ≤282 s and 0 = response time >282 s.
The second archival indicator of IR used invariant or longstring responding. Respondents’ answers on two victimization scales were flagged if a respondent chose the same frequency option (except for 0) for each of the items within that scale. One of the scales assessed sexual harassment for the prior year with six questions: (a) “Said sexual things to you that you did not want to hear”; (b) “Sent sexual messages or pictures that you did not want (including porn)”; (c) “Asked or pressured you for a date, hook up, or sexual favors even though you had already said no”; (d) “Made unwanted sexual gestures or imitated sexual motions when you did not want them to”; (e) “Touched you sexually (breasts, buttocks or genitals) when you did not want them to”; and (f) “Exposed themselves to you (breasts, buttocks, or genitals) when you did not want them to.” Response options were 0 = Never (0 times), 1 = Once (one time), 2 = Sometimes (two to five times); 3 = Often (six+ times), 4 = CNTA (missing). Respondents were coded as invariant = 1 on this scale if they selected all 1 s, 2 s, or 3 s. They were coded as not invariant = 0 if they selected any combination of responses (thereby indicating variability) or reported all items as never happening because many college students have not experienced these behaviors, thus accurately reporting no incidents. Data from respondents choosing CNTA for more than two items were coded as missing data.
The second scale used to assess invariant responding was the stalking scale that is comprised of five items and has the same response options and scoring as the sexual harassment scale. Items on the stalking scale included the following: (a) “Followed or spied on you in ways that made you afraid,” (b) “Repeatedly tried to communicate with you in ways that made you afraid,” (c) “Repeatedly showed up where you didn’t want them to in ways that made you afraid,” (d) “Invaded your privacy in ways that made you afraid,” and (e) “Did other threatening or damaging things in ways that made you afraid.” Data from respondents who chose CNTA for more than one item on the stalking scale were coded as missing data. The two invariant indicators derived in this way were labeled SexHarInv and StalkInv.
It should be noted that while there were other victimization scales which could yield invariant responding, they were not used because either not all students received the scale or the scale contained fewer items than the suggested responses in a row recommended for screening (e.g., Huang et al., 2012). Specifically, scales assessing physical violence, psychological abuse, or reproductive coercion in an intimate relationship were only administered to students reporting an intimate relationship in the past year. The three-item bullying scale was not considered because it did not have enough items to appropriately assess invariant responding.
Statistical screening variables
Two IR indicators were derived from determining extreme outliers on the sexual harassment scale (SexHar99%) and the stalking scale (Stalk99%). The most conservative option (i.e., 99th percentile) was chosen to indicate outlier status when calculating mean scores of respondents for these victimization scales. The more common method in the literature for determining outlier status, which calculates outliers based on their deviation below the first and above the third quartiles, was not workable for this data set due to the strong skew in the data from individuals who experienced no victimization of a particular type (score = 0). Because of this skew, the typical method for identifying outliers placed almost every person experiencing any victimization into the outlier category. Thus, we established a threshold requiring respondents’ mean scores to fall into the 99th percentile before they were assigned as an “outlier” for this IR indicator (1 = mean score at 99%; 0 = mean score < 99%). We did not employ Mahalanobis D as a screening method for the same reason that our victimization data do not follow a normal distribution (see Meade & Craig, 2012). Requiring scores to reach the threshold of the 99th percentile before indicating IR was expected to result in fewer misclassifications of individuals’ whose high reported rates may be valid and truthful.
To develop a statistical IR indicator based on response consistency (PsySyn), Meade and Craig’s (2012) and DeSimone et al.’s (2015) Psychometric Synonyms index was used as a model. The minimum correlation coefficient defining psychometric synonyms was predefined as .60 for similar concept item pairs identified from two of the survey’s climate scales—Perception of Safety (PS) scale and Attitudes toward Affirmative Elements of Consensual Sex (AECS) scale. Correlations of all item pairs between the two scales revealed three pairs of items correlating above .60. Correlations across these item pairs were conducted for each student to form a binary Psychometric Synonyms index. Any student whose within-person correlation was negative was coded as PsySyn = 1 (IR) for this variable, whereas PsySyn = 0 indicated valid responding.
Data Analysis Plan
Because all the IR indicators were binary, tetrachoric correlations were used to examine the relationships among the IR indicators. Exploratory factor analysis (EFA) with weighted least square of mean and variance (WLSMV) was used to examine the construct validity of the indicators and the relationship of the indicators with other variables in the study. Eigenvalues greater than 1 for the factor analysis were examined (Guttman, 1954; Kaiser, 1960) and scree plots were used to determine where the discontinuity of the eigenvalues occurred (Gorsuch, 1983; Tabachnick & Fidell, 2013). Factor loadings were then evaluated and reported (Pett et al., 2003). To answer whether classes of participants can be identified based on the IR indicators, latent class analysis (LCA) was used to determine if any latent class of IR exists and, if yes, the proportion of each class. Logistic regression analyses examined whether the responses from the invalid responder group(s) differed from responses of the honest group regarding victimization scales. Chi-square tests and independent t-tests were used to test the difference between invalid respondents and the rest of the sample for demographics and victimization. Poisson log-linear regressions were conducted to investigate the strongest IR predictor of each class when conducting this survey. Chi-square tests, independent t-tests, logistic regressions, and Poisson log-linear regressions were conducted in SPSS (IBM Corp, 2016). Tetrachoric correlations, EFA, and LCA were conducted using Mplus 7.4 (Muthén & Muthén, 1998–2017).
Results
Correlational Assessment
Nine of 10 IR indicators had a low to moderate positive correlation with each other, r ranging from .21 to .77 (see Table 1). However, psychometric synonyms (PsySyn) was uncorrelated with all other IR indicators except for response time (r = .228), indicating PsySyn measured a different type of IR, compared with other indicators.
Tetrachoric Correlations Among the Ten Invalid Response Indicators (N = 6,964).
Note. PsySyn = psychometric synonyms.*p < .05. **p < .01. ***p < .001.
Factorial Evidence
Given that PsySyn was not associated with most of the IR indicators, it was excluded from EFA analyses through which we intended to find common factors of those IR indicators. Results from the EFA suggested two factors to extract as evidenced by two eigenvalues greater than 1 and the scree plot indicating two factors. Table 2 shows the oblimin rotated standardized factor pattern loadings. A further examination of factor loadings showed the two factors can be interpreted as a carelessness factor (including Not Truthful, Careless, and Time) and an extreme responding factor (including SexHarInv, StalkInv, SexHar99%, and Stalk99%) with substantive loadings (>.30). The correlation between the factors was moderate, r = .51.
Standardized Factor Loading Results of the Exploratory Factor Analysis Among the 10 Invalid Response Indicators (N = 6,964).
Note. Factor 1 (Carelessness) correlated with Factor 2 (Extremeness) at .51, p < .001. Standardized factor loadings that are greater than .30 and do not cross-load on two factors are in bold.
Latent Group Results
LCA results
To answer whether participants can be identified based on different styles of responding, underlying latent classes using the 10 identified IR indicators were examined through LCA. We included the PsySyn variable in our LCA analysis given that internal correlations among the latent class membership indicators was not assumed for LCA models. Three latent classes were specified, mimicking strategies of Meade and Craig (2012). LCA analyses suggested three latent classes were significantly better than two latent classes, p(Vuong–Lo–Mendell–Rubin test) < .001 and Bootstrapped parametric Likelihood ratio test = 185.589, p < .001. Table 3 summarizes the probability of class membership by each indicator. Students in Class 1 (n = 6,632; 95.2%) have a close to zero probability of being an invalid responder based on the indicators (p ranges from .002 to .031). Students in Class 2 (n = 269; 3.9%) have a high probability of being classified into that group on the basis of indicators of Time (p = .91), Not Truthful (p = 1.0), and Student Government Attendance (p = .354). Students in Class 3 (n = 63; 0.9%) have high probabilities of being classified into that group based on their use of invariant responding to sexual harassment and/or stalking scales (i.e., SexHarInv and StalkInv), and/or their extremely high mean scores on the sexual harassment and/or stalking scales (i.e., SexHar99% and Stalk99%; p ranges from .388 to .678). Students in Class 3 who selected extreme responses are similar to Class 2 in that they have very short completion times (Time) and report they were not truthful (Not Truthful; p ranges from .384 to .378). The final count indicates that out of the 6,964 respondents, approximately 4.8% responded in an invalid manner. Thus, labels for the final determination of classes are as follows: Class 1 = Honest Class; Class 2 and Class 3 = Invalid Class.
Probability of Categorizing the Invalid Responder Based on the Latent Class Analysis (N = 6,964).
Note. Entropy for the latent class analysis with three classes was .954. Classes 2 and 3 were combined into the Invalid Group given that both classes included students who are not truthful and careless. PsySyn = psychometric synonyms.
p < .05. **p < .01. ***p < .001.
Logistic regression analyses
Logistic regression analysis showed that, in the whole sample, class membership is a significant predictor of reporting seven types of victimization, including sexual assault, bullying, sexual harassment, stalking, intimate partner physical abuse, intimate partner psychological abuse, and reproductive coercion (Table 4). The odds of reporting these forms of victimization were much higher in the Invalid Class than in the Honest Class, odds ratio ranging from 2.84 to 9.98. Also, class membership as an invalid responder was a significant predictor of reporting one’s gender as transgender or non-gender conforming (odds ratio (OR) = 8.29) and reporting greater negative impact from physical abuse (OR = 4.32) or from sexual assault (OR = 6.31). In general, students in the Invalid Class are more likely to report victimization, negative impacts from victimization, and a nontraditional gender identification.
Binary Logistic Regression Results for the Class Membership Predicting Victimization Variables (N = 6,964).
Note. All dependent variables were binary. All p values were less than .001. SE = standard error of the coefficients.
Differences between the Honest Class and the Invalid Class
Chi-square tests showed that the distributions of academic classification, ethnicity, first generation student status, and number of organizations to which students belong were similar between the Honest and Invalid Classes (see Table 5). Although the distribution of gender differed between the two classes, χ2(2) = 61.30, p < .001, the effect size was small, Cramer’s V = .10. Thus, differences between the Honest and Invalid Classes do not appear to be a function of demographic data.
Descriptive Statistics and the Chi-Square Test Results of the Demographic Variables (N = 6,964).
Note. Sensitivity analysis for the student organization variable excluded the cells that had less than five observations.
Table 6 summarizes mean scores of the Honest and Invalid Classes for the victimization scales (i.e., bullying, sexual harassment, stalking, physical abuse, impact of physical abuse, psychological abuse, and reproductive coercion). The Invalid Class had more reports of victimization compared with the Honest Class, t values ranging from −4.58 to −8.89. Further examination of the effect sizes showed that victimization differences between the Honest Class and the Invalid Class were large, with the majority of the Cohen’s ds above .50 (Cohen, 1988), suggesting the Invalid Class could have a significant impact on reported rates of victimization.
Means, Standard Deviations, and the Independent T-Test Results (Honest vs. Invalid) of the Victimization Scales.
All p values were less than .001.
Poisson log-linear regressions
Tables 7 and 8 summarize the level of each IR indicator predicting class membership. Time was shown to be the strongest indicator of honest responders, β = −2.65, OR = 0.07, indicating the odds of being an honest responder was 14.29 times higher among those who did not finish the survey quickly compared with those whose completion times were very short (Table 7). Not Truthful was shown to be the strongest indicator distinguishing whether a person was an invalid responder, β = 4.71, OR = 110.67 (Table 8). Thus, the odds of being an invalid responder was 110.67 times higher among respondents who reported they were not truthful, compared with those who responded that they answered honestly. The second strongest indicator of Invalid Class was Careless, β = 4.25, OR = 70.14, followed by StalkInv, Stalk99%, Time, SexHar99%, SexHarInv, Student Government, and 2+ Disabilities (β = 2.07–3.80, OR = 7.93–44.87). The only insignificant IR predictor was PsySyn, p = .745.
Single Poisson Log-Linear Regression Results of Each Invalid Respondent Indicator Predicting the Honest Class (N = 6,964).
Note. Degrees of freedom for the Wald Chi-square tests = 1. All p values for the Wald χ2 test of model effects except for PsySyn variable were less than .001. PsySyn = psychometric synonyms.
Test of model effect was not significant, p = .952.
Single Poisson Log-linear Regression Results of Each Invalid Respondent Indicator Predicting the Invalid Class (N = 6,964).
Note. Degrees of freedom for the Wald chi-square tests equals to 1. All p values for the Wald χ2 test of model effects except for the PsySyn variable were less than .001. PsySyn = psychometric synonyms.
Test of model effect was not significant, p = .745.
Discussion
The contribution of this study lies in the systematic identification of invalid responders completing a campus climate/victimization survey with characteristics previously demonstrated to increase the likelihood of IR. Specifically, the large data set was generated from a survey that was online (i.e., students were provided with their own survey platform link), lengthy (i.e., minimum of 94 items with the potential of 164 items), and assessed sensitive topics (i.e., seven forms of victimization). We are unaware of a project like ours investigating IR while using a survey with all of these characteristics; thus, we expect the findings to be an important contribution to this literature.
The percentage of invalid responders identified in this project was 4.8%. This finding seems in line with other studies investigating problematic response sets in college students, a national sample, or high school students completing surveys. For a national sample, Maniaci and Rogge (2014) cited rates of 3%–9% across their studies, whereas Johnson (2005) identified 3.5% of invalid responders based on an index of invariant responding in his sample that appeared to represent a broad base of respondents. Beach (1989), utilizing a very small sample of undergraduates, reported that 4% of respondents to a paper–pencil questionnaire self-reported inattention compared with 10% of respondents completing the same questionnaire online who also reported inattention to the survey content. Finally, Cornell et al. (2012) reported 11.8% of a sample of 7,801 high school students completing a national student health survey to be invalid responders. It is possible that individuals beyond high school age are less likely to respond with invalid answers, but further investigation comparing age groups under similar conditions is required before any definitive statement can be made.
Because we consulted the IR literature when developing our survey, this project included a range of strategies defined in the literature as direct, archival, and statistical (DeSimone et al., 2015). This inclusive approach to identifying invalid responders through multiple indicators allowed us to determine the relationship of these indicators to each other. In addition, the results indicated that different indicators differentially predicted membership into two classes of invalid responders distinct from an “honest” group, suggests that utilizing just one strategy for invalid responder identification may not capture the range of those survey-takers misreporting data, and that indicators representing different strategies for identification may be necessary. Even though extreme responders maybe easier to identify (i.e., choosing highest scores, responding invariantly), individuals intentionally providing random responses (i.e., careless) may require other indicators to identify them. Supporting this, the factor analysis of the IR indicators bolstered the identification of two classes of invalid responders by demonstrating that the indicators formed two factors with similar themes to the two classes formed by the LCA (i.e., careless and extreme).
Analyses were conducted to suggest the best predictors for identifying survey takers as invalid responders to indicate whether future investigators need to apply the range of IR detection strategies used in this project. Although many of the included indicators using these data performed well, the indicator of self-reported lack of truthfulness best predicted “invalid” group membership, followed by the self-reported indicator of carelessness, and both the invariant and 99th percentile indicators of stalking. Indicating that you have not told the truth on a survey may best predict IR because, rather than making a judgment whether one was or was not careless (i.e., “I didn’t pay attention to how I answered this survey”), willingness to admit intentional misleading suggests a stronger and clearer sense that one did not take the survey as it was meant to be completed. Scoring at the 99th percentile for the stalking scale may have been one of the strongest predictors for IR because stalking behaviors are quite clearly defined actions not likely requiring interpretation compared with some other forms of victimization (e.g., sexual harassment, psychological abuse). Thus, reporting that all five categories of stalking behavior (which include extreme actions) occurred at the highest frequency seems to clearly indicate aberrant responding. In addition, claiming that all five categories of another person’s stalking actions toward you occurred at the exact same frequency also suggests impossible happenings and therefore a choice to deliberately use a mode of responding (i.e., invariant) to speed one’s progress through a survey. In contrast, the recorded time for completion of a survey online appears to best classify honest responders, possibly because thoughtful answers, or even taking the time to fully read items, likely require more time than the minimum number of seconds calculated for finishing a survey with little or random effort.
What may be the most significant finding from this series of analyses was that the data on reported victimization were significantly different between the honest and invalid responders. Of concern, is that higher reported rates of victimization by invalid responders, whether done intentionally as mischievous responding or indirectly by not reading items or answering invariantly, could affect overall prevalence rates of victimization, in this case on college campuses. Response rates, victimization status, and mean scale scores on victimization scales were all significantly different between the honest and invalid responders, with very high odds ratios and effect sizes. When comparing data between the two classes of responders on their claims of victimization for every type of interpersonal violence/harassment, higher rates were recorded by the invalid group. In addition, there were several other reported data points for which invalid responders claimed much greater representation, namely, claiming transgender or non-gender conforming as their gender identity, claiming negative impacts from sexual assault, and claiming negative impacts from intimate partner physical violence. These findings suggest that even with large data sets, identifying invalid responders through various indicators of invalidity to determine whether their data are significantly affected by careless and extreme responders seems warranted. Because invalid responders were not identified by demographic information, removing them from the sample would not change the distribution of demographic information. However, if researchers are interested in reported information regarding interpersonal victimization, then the rates are likely to be significantly lower if invalid responders are removed from the sample, potentially providing more accurate prevalence information. We retained the full sample if the purpose was to display the distribution of demographic information or report descriptive statistics of the items. However, when the report established associations between certain variables and interpersonal victimization, data from the invalid responders were excluded. Our finding that invalid respondents altered data in meaningful ways supports the growing literature that identifying invalid responders is important to reduce problematic data generated in experimental studies or for correlational analyses due to the increased error variance (Johnson, 2005; Maniaci & Rogge, 2014; Meade & Craig, 2012).
Limitations
The data for this study were obtained from one university’s campus climate/victimization survey, and thus, the findings would benefit from replication with universities representing a variety of regions in the United States. In addition, the sample of respondents were primarily Caucasian, although somewhat lower than race/ethnicity statistics for the state in which the survey was conducted (88%; U.S. Census Bureau, 2016). However, the indicators which identified invalid responders in this study are in line with indicators found with other college samples and a national sample. Generalizability of the findings may also be limited due to sampling techniques. Further investigations need to determine whether particular demographic characteristics are associated with IR.
Conclusion
These findings contribute to the investigation of IR in college samples for online surveys, especially lengthy and sensitive ones. Our analyses found that invalid responders are able to be detected from indicators assessing both careless (random) approaches to survey-taking as well as extreme responding. The invalid responders produced very high odds ratios for reporting interpersonal victimization and significantly higher means than honest responders. Different validity indicators were better at predicting classification as an honest respondent compared with invalid respondents. This exploratory study emphasizes the need for continued investigation of IR when collecting sensitive material as well as for online surveys using frequency data in addition to Likert-type responses.
Footnotes
Acknowledgments
The authors wish to thank the Office of the President at the University of Kentucky for financial support of the project as well for institutional support.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
