Abstract
Many large-scale cross-national studies rely on a single-item measurement when comparing prevalence rates of traditional bullying, traditional victimization, cyberbullying, and cyber-victimization between countries. However, the reliability and validity of single-item measurement approaches are highly problematic and might be biased. Data from three countries were used as an example case to compare the single- and multiple-item approaches from a substantial and a statistical point of view. The sample comprised 671 Austrian (46.3% girls), 691 Cypriot (45.9% girls), and 604 Romanian (46.7% girls) 12 year old students. Data were collected via self-assessments with single and multiple-items. Because scalar measurement invariance could be established for the multiple-item measurement approaches, latent means between the three countries were compared. Substantial results of the single- and multiple-item approach did not differ for traditional bullying and traditional victimization, but differed for cyberbullying and cyber-victimization. As a consequence, we suggest using carefully validated multiple-item scales for cross-national comparisons.
Large-scale representative studies like the HBSC study (Currie et al., 2012) or the EU Kids Online study (Livingstone, Haddon, Görzig, & Ólafsson, 2011) demonstrate that there are cross national differences regarding the prevalence rates of bullying and cyberbullying among children and adolescents. It is tempting to post-hoc attribute such cross national differences to broad country-level cultural characteristics like, e.g., individualism and collectivism (Bergmüller, 2013; Hofstede, 2001). However, as long as the challenge of adequate measurement has not solved satisfactorily such ‘culture’ based interpretations are misleading. Large-scale representative cross-national studies often use a single-item approach to measure aggressive behavior, bullying, or cyberbullying (e.g., HBSC survey: Currie et al., 2012; EU Kids Online study: Livingstone et al., 2011; TIMMS study: Bergmüller, 2013), although a single-item measurement approach has several disadvantages compared with a multi-item approach.
Thus, the main goals of the present paper are to demonstrate that (1) single versus multiple-item measurement approaches show different levels of cross-national construct validity and (2) that they produce different substantial results.
Challenges of Cross-National Measurement
The most often used method to measure traditional bullying and cyberbullying is via self-assessments (e.g., Berne et al., 2013; Solberg & Olweus, 2003). Self-assessments reveal information on subjective experiences which are privately felt and which need not necessarily be verified by other informants (Graham, Bellmore, & Juvonen, 2003). Within self-assessments single-item vs. multiple-item measurement approaches can be distinguished. While the single-item approach directly addresses the involvement in bullying or cyberbullying during a certain period of time (e.g., during the past couple of months), the multiple-item approach addresses the involvement in several concrete behaviors (e.g., hitting, teasing, etc.) considered major forms of the bullying or cyberbullying construct. A research based definition often precedes both the single and the multiple-item approaches (e.g., Kärnä et al., 2011; Olweus, 1991; Roland, 1989; Smith & Sharp, 1994). Conceptually, the single-item approach assumes that the broad concept of bullying including hostile intent, repetition, power imbalance and its various forms can be understood easily by all respondents. Research however showed that this might not be the case for children of all age groups (Monks & Smith, 2006; Vaillancourt et al., 2008) and countries (Smith, Cowie, Olafsson, & Liefooghe, 2002; Strohmeier, Yanagida, & Toda, 2016). For instance, eight year old children consider fewer negative behavior options to be bullying compared with 14 year old adolescents who have a broader understanding of the concept of bullying (Smith et al., 2002). Furthermore, a concept which means exactly the same as the English term “bullying” does not exist in all languages (Strohmeier et al., 2016).
Consequently, results of the single-item approach are also dependent on the concrete terms used in the concrete languages (e.g., Smith et al., 2008). Smith et al. (2008) showed that terms used in different countries remarkably differed regarding their meanings; some terms rather captured verbal aggression, while others are rather connoted with physical aggressive acts or social exclusion. Moreover, when using a single-item approach it is not possible to investigate the equivalency of the constructs between countries, which is a crucial precondition for any statistically valid comparison between them. From a statistical point of view, the multiple-item approach offers the possibility to rigorously test the cross-national construct validity and reliability of the measure, because with this approach it is possible to consider the level of measurement (metric vs. ordered-categorical), weighting (equally weighted vs. weighted according to factor loadings), and measurement error (without vs. with accounting for measurement error) appropriately. To use a multiple-item approach and to test the cross-national construct validity is most important especially for cyberbullying, which is a relatively new and rare behavior compared to traditional bullying. Research already showed that prevalence rates of cyberbullying but not for traditional bullying are underestimated with a single-item approach compared to a multiple-item approach (Gradinger, Strohmeier, & Spiel, 2010). Also, to establish measurement invariance between countries can be difficult for cyberbullying but not so for traditional bullying, even when using multiple-item approaches (Strohmeier, Aoyama, Gradinger, & Toda, 2013).
To summarize, researchers have to be careful about the meanings of terms when conducting a cross-national study. This is important, because the same concept “bullying” and “cyberbullying” do not exist in all languages and therefore items containing these terms are likely to measure different concepts in different countries. Instead of using such ambiguous and difficult to translate terms in questionnaires, it might be preferable to use very specific behavioral descriptions. From a statistical point of view it is also advisable to not solely rely on a single-item approach, but to use multiple behavior based items to be able to rigorously test the construct validity between countries. This might especially be true for cyberbullying, which is a new and rarer phenomenon than traditional bullying. Finally, the multi-item approach might also help improving the conceptual understanding of bullying and cyberbullying, because definition and measurement are intertwined issues (Menesini &Nocentini, 2009).
The Present Study
The overall goal of the present study was to investigate whether single-item versus multiple-itemmeasurement approaches yield different cross-national results regarding prevalence rates of traditional bullying, traditional victimization, cyberbullying, and cyber-victimization. Data were collected in Austria, Cyprus, and Romania. We used these three countries as an example case to demonstrate that measurements should be equivalent when making cross-national comparisons. In several large scale cross-national studies (like the HBSC study) data from a large number of countries are compared without rigorously checking whether the measurements used in the countries are statistically equivalent. Thus, it was our main goal to use three countries as example case to compare the single- and multiple-item approaches from a substantial and a statistical point of view. This is important given that the same Social Competence Program (Strohmeier, Hoffmann, Schiller, Stefanek, & Spiel, 2012) has been implemented in the three countries and that it is important to know that different program effects found in the three countries are not due to biased measurements (Gradinger, Yanagida, Strohmeier, &Spiel, 2016; Solomontos-Kountouri, Gradinger, Yanagida, & Strohmeier, 2015; Trip et al., 2015; Yanagida, Strohmeier, & Spiel, submitted). Based on single-item approaches, Austria and Romania have rather high levels of traditional bullying. For instance, according to the HSBC study published in 2012 both Austria and Romania were under the top ten countries with the highest bullying rates, while in Cyprus HBSC data were not collected (Currie et al., 2012). In accordance with the EU Kids Online study, Romania is also high in cyber-victimization, while Austria and Cyprus have lower rates (Livingstone et al.,2011).
We acknowledge that “bullying” is a difficult term to translate as a concept which means exactly the same as the English term “bullying” does not exist in all languages. Therefore we suggest using very clear behavioral descriptions to avoid misleading translations. Thus, we think there is no convincing rationale to expect “cultural” differences between Austria, Romania and Cyprus. Instead, we want to examine whether the bullying/cyberbullying scales are equally valid in the three countries. As cyberbullying is a new and rarer concept than traditional bullying, the testing of measurement equivalence between countries is especially important and might be most problematic (Strohmeier et al., 2013). In addition, it is expected that the substantial results of cyberbullying and cyber-victimization might be more distorted through single-item measurements than with multiple-item measurements (Gradinger et al., 2010).
In a first step, measurement invariance for the multiple-item measurement approaches of traditional bullying, traditional victimization, cyberbullying, and cyber-victimization was examined for the three countries. In the literature, several levels of measurement invariance are discussed (e.g., Chen, 2008), which were tested here in a step-wise fashion.
In a second step, latent means of the multiple-item measurement approaches of traditional bullying, traditional victimization, cyberbullying, and cyber-victimization were compared between Austria, Cyprus, and Romania based on scalar measurement invariance.
In a third step, prevalence rates of traditional bullying, traditional victimization, cyberbullying, and cyber-victimization based on single-items were compared between Austria, Cyprus, and Romania.
In a fourth step, prevalence rates of traditional bullying, traditional victimization, cyberbullying, and cyber-victimization based on manifest single-items and latent means of multiple-items were compared between the three countries.
Method
Participants
For the present study, a sub-sample of 12-year-old students was selected from a larger cross-national data set. The final sample comprised 1,966 students enrolled in 35 schools and 138 classes in three countries. There were 671 Austrian (46.3% girls), 691 Greek Cypriot (45.9% girls), and 604 Romanian (46.7% girls) students who participated in the pretest of an evaluation study for the ViSC Social Competence Program (Strohmeier et al., 2012a) in Austria (Yanagida et al., submitted), Cyprus (Solomontos-Kountouri et al., 2015), and Romania (Trip et al., 2015).
Procedure
Data collection was completed during one regular school hour in the schools under the supervision of two trained research assistants. Participation in the data collection was based on active parental and child consent. Prior to data collection students were assured that their participation was voluntary and that their answers would be kept confidential.
Missing Data
In total, 1.0% of data were missing stemming from 63 incomplete records. Listwise deletion was used for the analyses based on the single-item approach. This method is justifiable given the small number of cases lost due to missing data, i.e.,11 cases (0.6%) for traditional bullying, 7 cases (0.4%) for traditional victimization, 16 cases (0.8%) for cyberbullying, and 4 cases (0.2%) for cyber-victimization (Graham, 2009). As for the multiple-item approach, pairwise deletion based on the default setting of Mplus (Muthén & Muthén, 1998–2012) for using robust weighted least squares estimator (WLSMV) was used.
Measures
Traditional bullying, traditional victimization, cyberbullying and cyber-victimization were measured with four scales. Each scale consists of several items which cover different forms of the constructs. Items of all four scales are shown in the Appendix. Answers to all questions were given on a five-point response scale ranging 0 (not at all), to 1 (once or twice), 2 (two or three times a month), 3 (once a week), and to 4 (nearly every day). The items covered a time span of two months. Due to low cell frequencies, the answer formats 2 (two or three times a month), 3 (once a week), and 4 (nearly every day) were collapsed into a single category which was labeled at least two or three times per month. The items were translated and back translated from German to Greek and Romanian by two bilingual speakers of each language.
Bullying Perpetration and Bullying Victimization. The self-reported scales consist of one global item, and three specific items covering different forms (physical, relational and verbal) of bullying and victimization. The term “bullying” was not used, instead very specific behavioral descriptions were used which are provided in the Appendix (Strohmeier, Gradinger, Schabmann, & Spiel, 2012). Cronbach’s α coefficients for the bullying perpetration scale were 0.77/0.70/0.77 (Austrian/Cypriot/Romanian) and 0.81/0.75/0.81 (Austrian/Cypriot/Romanian) for the bullying victimization scale.
Cyberbullying and Cyber-Victimization. Self-reported cyberbullying and cyber-victimization were each measured with a global item and seven specific items related to different electronic means based on Smith et al. (2008). Again, the term “bullying” was not used, but very specific behavioral descriptions were utilized. Cronbach’s α coefficients for the cyberbullying scale were 87/0.76/0.67 (Austrian/Cypriot/Romanian) and 0.85/0.86/0.69 (Austrian/Cypriot/Romanian) for the cyber-victimization scale.
Results
Step 1: Multiple-Item Approach - Factorial Invariance of Measurement Models between Countries
In cross-cultural research, measurement invariance is a critical issue and a precondition to make valid comparisons across groups. In the literature, four levels of measurement invariance are discussed: (1) configural or factor-form invariance, (2) metric or factor loading invariance, (3) scalar or intercept invariance, and (4) strict or residual invariance (Chen, 2008). At least scalar invariance is needed to make meaningful comparisons of latent means across groups. According to Little (2013), however, testing for strict invariance has dubious theoretical grounds because it is not reasonable to assume that the amount of random error present in each indicator across groups is the same. Thus, we test for configural, metric, and scalar invariance, but do not test for strict invariance.
As shown in Figure 1 (Panel A), the measurement models for traditional bullying and traditional victimization consist of four indicators for each factor, whereas the measurement models for cyberbullying and cyber-victimization (Panel B) consist of eight indicators for each factor.
Items used as indicators were not continuous, but ordered-categorical. In addition, the highly positive skewed nature of the item response distribution makes a statistical approach based on normal-theory inappropriate (Muthén & Kaplan, 1985). Thus, our subsequent analyses are based on a common factor model with ordered-categorical indicators (see Bovaird & Koziol, 2012). Model specification and identification were based on Millsap & Yun-Tein (2004) using theta parameterization and a robust weighted least squares estimator (WLSMV). In case of ordered-categorical indicators c – 1 thresholds per indicator are estimated in place of the intercept, where c denotes the number of categories (Edwards, Wirth, Houts, & Xi, 2012). Threshold parameters denote the location of the cut points on the latent trait, where respondents transition from a lower response category to the next higher response category (Bovaird & Koziol, 2012).
In order to test for measurement invariance, a series of confirmatory factor analysis (CFA) was conducted in Mplus version 7.3 (Muthén & Muthén, 1998–2012) to compare hierarchical series of models. Configural invariance was estimated by a multiple group model in which factor loadings, thresholds, and residuals were freely estimated and allowed to differ between Austria, Cyprus, and Romania. Metric invariance was estimated by a multiple group model in which the factor loadings were freely estimated but constrained to be equal between the three groups, while the thresholds and residuals were allowed to differ between Austria, Cyprus, and Romania. Scalar invariance was estimated by a multiple group model in which the factor loadings and thresholds were freely estimated but constrained to be equal between the three groups, while the residuals were allowed to differ between Austria, Cyprus, and Romania.
In Table 1, fit indices for the hierarchical series of models are presented. The path diagram for the unstandardized solution of the measurement model assuming scalar invariance is depicted in Fig. 1, Panel A for traditional bullying and traditional victimization and Panel B for cyberbullying and cyber-victimization.
In order to evaluate whether the assumption of invariance is tenable, difference in CFI and RMSEA were considered. It has been suggested that a difference in CFI more than 0.01 (Cheung & Rensvold, 2002) and a difference in RMSEA more than 0.01 (Chen, 2007) indicate a meaningful decrease in model fit making the invariance assumption not reasonable. Note that we neither consider the chi-square test statistic nor the chi-square difference test statistic for model evaluation and comparison since they are known for being too sensitive to large sample sizes (see Maede, Johnson, & Braddy, 2008).
As shown in Table 1, no meaningful decrease in model fit between the hierarchically nested models for the measurement model of traditional bullying and traditional victimization and the measurement model for cyberbullying and cyber-victimization were detected. Moreover, the final model assuming scalar measurement invariance showed good model fit for traditional bullying and traditional victimization (χ2(81) = 295.873, p < 0.001, CFI = 0.973 and RMSEA = 0.064) and cyberbullying and cyber-victimization (χ2(367) = 661.830, p < 0.001, CFI = 0.965 and RMSEA = 0.035).
It can be concluded that factor loadings and thresholds of the measured variables are invariant between Austrian, Cypriot, and Romania students and therefore the pre-condition for cross-national comparisons of (latent) means is met (Little, 2013).
Step 2: Multiple-Item Approach - Comparisons based on Latent Means
A latent variable approach based on the measurement model with ordered-categorical indicators under scalar measurement invariance was used to compare latent means of traditional bullying, traditional victimization, cyberbullying, and cyber-victimization between Austria, Cyprus, and Romania. Latent mean in the Austrian sample was constrained to 0 to compare Austria vs. Cyprus and Austria vs. Romania. As for the comparison Cyprus vs. Romania, latent mean in the Cypriot sample was constrained to 0. Analyses were conducted in Mplus version 7.3 (Muthén & Muthén, 1998–2012) using robust weighted least squares estimator (WLSMV).
Step 3: Single-Item Approach - Comparisons of Percentages
A non-parametric approach was chosen because items were not continuous, but ordered-categorical. In addition, no distributional assumption is required for the analyses. Therefore, a series of Kruskal-Wallis tests were applied to investigate cross-national differences in the distribution of prevalence rates. Subsequently, pairwise comparisons were conducted using two-sample Wilcoxon tests with continuity correction and Bonferroni-Holm (Holm, 1979) correction for multiple comparisons. Analyses were conducted in R version 3.2.0 (R Core Team, 2015).
The results of the Kruskal-Wallis tests revealed that prevalence rates (see Table 2) differ between the three countries for traditional bullying (χ2(2) = 62.461, p < 0.001), traditional victimization (χ2(2) = 19.209, p < 0.001), cyberbullying (χ2(2) = 6.611, p < 0.05) and cyber-victimization (χ2(2) = 11.610, p < 0.01).
As shown in Table 3, the substantial results of the single- and multiple-item approach did not differ for traditional bullying and traditional victimization, but for cyberbullying and cyber-victimization.
Independent of measurement approach, the prevalence of traditional bullying and victimization in Austria was higher compared with both Cyprus and Romania. Moreover, prevalence rates of traditional bullying were higher in Cyprus compared with Romania, while no differences regarding traditional victimization between Cyprus and Romania were found.
Cyberbullying in Austria was higher compared with both Cyprus and Romania which did not differ from each other according to the multiple-item measurement approach. However, when applying a single-item measurement approach the difference between Austria and Romania was not statistically significant anymore.
All results changed depending on the measurement approach for cyber-victimization (see Table 3).
In order to investigate the differences in results between single-item and multiple-item approaches, we conducted two-sample Wilcoxon tests comparing Cyprus and Romania for all items of the cyber-victimization scale. Results showed that cyber-victimization was higher in Cyprus in the items call (M Rank = 969.3, SD Rank = 305.8, W = 215420, p < 0.05) compared to Romania (M Rank = 932.6, SD Rank = 254.6) and chat (M Rank = 984.3, SD Rank = 237.5, W = 202590, p < 0.05) compared to Romania (M Rank = 961.6, SD Rank = 193.1).
Discussion
The present study focused on the cross-national comparability of measurements of traditional bullying, traditional victimization, cyberbullying, and cyber-victimization and compared single- and multiple-item measurement approaches from a substantial and a statistical point of view. The present study utilized a large data set collected in three countries: Austria, Cyprus, and Romania.
Although single-item measurement approaches are frequently used in large-scale cross-country studies like HBSC, EU Kids Online, or TIMMS, they are highly problematic. To begin with, substantial results are highly dependent on the concrete terms used in the concrete languages (e.g., Smith et al., 2002). Moreover, it is not possible to investigate the equivalency of the constructs between countries, which is a precondition for any statistically valid comparison between them (Chen, 2008; Little, 2013). From a statistical point of view, it is difficult to appropriately consider the level of measurement (metric vs. ordered-categorical), weighting (equally weighted vs. weighted according to factor loadings) and measurement error (without vs. with accounting for measurement error) in the analyses using a single-item approach.
In the present study, the equivalence of measurement between the three countries Austria, Cyprus and Romania could be established for all four scales traditional bullying, traditional victimization, cyberbullying and cyber-victimization. This is important as prior research showed problems of establishing measurement equivalence for cyberbullying but not for traditional bullying between eastern and western countries like Japan and Austria (Strohmeier et al., 2013). In all three countries the multiple-item scales were equivalent allowing for cross national comparisons of mean level differences. To use multiple-items and to ensure measurement equivalence is especially important for cyberbullying, as substantial results like prevalence rates are stronger depending on the number of items for cyberbullying than for traditional bullying (Gradinger et al., 2010). In the present study, a cross national valid measurement of traditional bullying, traditional victimization, cyberbullying and cyber-victimization was established, based on ordered-categorical items, with weighted factor loadings and accounting for measurement error.
In line with our expectations, the substantial results of the single- and multiple-item approach did not differ for traditional bullying and traditional victimization, but they did differ for cyberbullying and cyber-victimization. This pattern was already found in prior research examining single (global) and multiple-(specific)-items as well as different cut-off sores for prevalence rates of traditional bullying and cyberbullying (Gradinger et al., 2010; Nocentini, Menesini, & Calussi, 2009). Obviously the substantial results of the multiple-item measurements are more trustworthy, as they were established to be cross-nationally valid, accounted for measurement error and treated items based on their weighted importance (factor loadings). Our results demonstrate that for a new and rather rare phenomenon like cyberbullying, it is necessary to use multiple-specific items. Therefore, results of single-item measurements, even if they come from large scale comparison studies should be interpreted with caution.
These inconsistent results between measurement methods raise the question of whether it is wise to post-hoc attribute level differences of aggressive behavior, bullying or cyberbullying between countries based on single-item measurements on ‘cultural’ characteristics of countries. Such ‘culture’ based interpretations might be highly misleading, because countries differ on very many characteristics like educational policies, etc. For instance, there is quite a large variability regarding the levels of traditional bullying and traditional victimization among mainly ‘individualistic’ countries according to the HBSC study (Currie et al., 2012, p.194). For many years, Austria has consistently been identified as one of the countries with comparatively high prevalence rates of traditional bullying and physical fighting. Thus, to attribute the higher levels of traditional bullying, traditional victimization, and cyber-victimization in Austria compared with Cyprus and Romania to its more ‘individualistic’ culture is probably notwarranted (Hofstede, 2001).
Given the present evidence, it is difficult to fully explain these inconsistent results. One explanation is that multiple-items tend to result in higher reliability of measurement resulting in higher effect sizes (Kowalski, Giumetti, Schroeder, & Lattanner, 2014). Looking on the concrete items, Cyprus had higher levels of cyber-victimization than Romania based on the specific items measuring calls and chat contributions. These forms of cyber-victimization might not be thought of by the participants, while answering the single-global-item of cyber-victimization. This might be the reason, why the results are actually reverse comparing the single and multiple-item measurement for cyber-victimization. In sum, the present data indicate that, the decision between single- and multiple-item measurement approaches for investigating cross-national differences regarding traditional bullying, traditional victimization, cyberbullying, and cyber-victimization is important. In future studies, we recommend using carefully validated multiple-item scales for cross-national comparisons.
Limitations
To begin with, our study relied on self-assessments only. Although natural observations or peer nominations are also important (Pellegrini & Bartini, 2000), most studies on traditional bullying and cyberbullying use self-assessments, which are also recommended for reporting prevalence rates for bullying (Solberg & Olweus, 2003). A second limitation is that our data was not representative which limits the national generalizability of our findings. However, the data sets collected in the three countries were highly comparable, because all schools volunteered to take part in a one-year social competence program. Moreover, to avoid age bias, same-age sub-samples of 12 year olds were selected in each country. A third limitation is that our results are limited to the set of countries investigated in the present study. Hence, present findings should be replicated using a larger sample of countries including similar and dissimilar rates of bullying and victimization to enhance generalizability.
Funding
The implementation and evaluation of the ViSC program in Austria was funded by the Austrian Federal Ministry for Education, Arts and Cultural Affairs (PI: Christiane Spiel) between 2008 and 2011. The data analyses and writing of the present study was funded by the Platform for Intercultural Competences, University of Applied Sciences Upper Austria (PI: Dagmar Strohmeier) between 2012 and 2016. In Romania, the work on this project was possible thanks to a bilateral travel grant (Romania-Austria) during 2012– 2013. In Austria, the grant was funded by the OEAD (RO15/2012) within the WTZ Programm Österrreich- Rumänien.
Footnotes
Appendix
Acknowledgments
We are very grateful to the whole ViSC project team in Austria consisting of Eva-Maria Schiller, Elisabeth Stefanek, Petra Gradinger, Christoph Burger, Bianca Pollhammer, Katharina Derndarsky, Marie Therese Schultes and Christine Hoffmann for their invaluable work during the intervention study. We also want to thank the ViSC coaches and teachers who implemented the program in the schools and classes. We thank all schools and students who participated in this study. We also wish to thank Elisabeth Stefanek taking part in the WTZ Program and Cami Steiner for her invaluable help with translating the ViSC manual from German to Romanian language.
