Abstract
Abstract
Histological evaluation of the repair tissue is a main pillar in the advancing field of experimental articular cartilage repair. Despite their widespread use, the major histological scoring systems for cartilage repair have seldom been validated. We tested the hypotheses (1) that elementary scores have a better reproducibility compared with more complex systems and (2) that the data from these different histological scores correlate with the DNA and proteoglycan contents of the repair tissue. A total of 1,165 observations of cartilage repair based on histological sections (n=233) from an experimental investigation on the repair of standardized osteochondral defects in vivo were made by three investigators with different levels of experience in cartilage research to determine the inter- and intra-observer reproducibility of elementary (Pineda and Wakitani score) and complex (O'Driscoll, Sellers, Fortier score) histological grading systems. DNA and proteoglycan contents of the repair tissues from simultaneously created defects were determined and correlated with histological (a) overall score values, (b) matrix staining, and (c) cellular characteristics of the five scores. Finally, applying the proteoglycan content as validating test, sensitivity, and specificity of the grading systems were assessed. All histological scores provided high intra- (Pearson r=0.92–0.99) and inter-observer reliability (intra-class correlation=0.94–0.99), low numerical intra- and inter-observer differences, and high internal correlations (Spearman's ρ=0.63–0.91). No disparity in reliability and reproducibility was detected between elementary and complex scores or between investigators with different levels of experience (all p>0.05). Individual histological overall score values did not correlate with proteoglycan contents but with DNA contents of the repair tissue (O'Driscoll, Wakitani, Sellers score). In all systems, proteoglycan contents did not correlate with matrix staining (all p>0.05), but histological cellular characteristics correlated with total cell numbers (p<0.001). These data indicate that both elementary and comprehensive histological scores are suited to quantify cartilage repair. Histological and biochemical evaluations may serve as complementary tools to assess articular cartilage repair in vivo.
Introduction

Schematic illustration of an osteochondral defect and the various parameters assessed by five major histological scoring systems for articular cartilage repair in vivo. The scores according to Pineda and Wakitani are elementary systems, reflecting not more than five parameters; the O'Driscoll, Sellers, and Fortier scores are more complex, including up to nine different parameters. Parameters “defect filling”, “matrix staining”, and “cellular characteristics” are included in all scores; “surface integrity” and “integration with adjacent cartilage” are excluded only in the Pineda score. Histological attributes such as reconstruction of the subchondral bone or the osteochondral junction as well as degeneration of the adjacent cartilage are uncommon parameters applied by only one grading system, respectively. Color images available online at
Since the first development of a scoring system for experimental cartilage repair by O'Driscoll and colleagues, 21 a number of other methods have been proposed, such as those established by Pineda et al., 22 Wakitani and co-workers, 23 Sellers et al., 24 or Fortier and colleagues. 25 While the Pineda and Wakitani scores are elementary systems, reflecting not more than five parameters, the O'Driscoll, Sellers, and Fortier scores are more complex, including up to nine different parameters (Figure 1). In theory, simple systems may result in a better inter- and intra-observer reproducibility due to the smaller number of less complex parameters. Likewise, the more comprehensive systems may yield a higher power of discrimination between different degrees of cartilage repair, resulting in enhanced sensitivity and specificity. However, despite their widespread use, only the systems according to O'Driscoll and Pineda have been validated so far. 26 In contrast, the reproducibility of the more complex methods such as the Sellers and Fortier score, has not, to our knowledge, been assessed to date.
In addition to histological scoring, the cell number and proteoglycan content may serve as additional indicators for the quality of the repair tissue.18,27,28 However, validations with biochemical parameters were not included in the original descriptions of the grading systems. Moreover, it remains unclear whether the data obtained from histological scoring systems show a relationship with quantitative biochemical characteristics of the repair tissue.
In this study, we tested the hypothesis that elementary histological scores have a superior inter- and intra-observer reproducibility compared with the more comprehensive systems. Using samples from an experimental investigation representing the different grades of cartilage repair in vivo, we determined the inter- and intra-observer reproducibility of five of the most commonly used scoring systems for the repair of cartilage defects. We further hypothesized that the data from the different histological grading systems correlate with the DNA and proteoglycan contents of the repair tissue.
Materials and Methods
Study design
One thousand one-hundred and sixty-five observations of experimental articular cartilage repair were made based on histological sections (n=233) from an experimental investigation on the effects of gene-based treatments on the repair of standardized (3.2 mm in diameter) osteochondral defects (n=48) in vivo that served to represent the different grades of articular cartilage repair. 29 First, the five histological systems described thereafter were evaluated for intra- and inter-observer reliability, and internal correlations between the histological scores were determined. Second, the correlation between the histological evaluation and biochemical properties of the repair tissue (cell number and proteoglycan content) was evaluated. Particularly, correlations between (a) staining of the extracellular matrix using the metachromatic stain safranin-O and the proteoglycan content of the repair tissue and (b) histological cellular characteristics (density, morphology, and organization) and the cell number of the repair tissue was examined. Third, we attempted to determine sensitivity and specificity of the five histological grading systems, applying the proteoglycan content as a validating test. 18
Animal experiments
All animal procedures were approved by the Saarland Governmental Animal Care Committee and performed as previously described. 29 Briefly, a combined strategy was used to improve the repair of articular cartilage defects by co-transfection of the human insulin-like growth factor I (IGF-I) and fibroblast growth factor 2 (FGF-2) gene in a xenogenic transplantation model. 29 Alginate spheres containing NIH 3T3 cells were transfected with expression plasmid vectors containing a cDNA for the Escherichia coli lacZ gene (treatment 1), the human IGF-I gene (treatment 2), or both the human IGF-I and FGF-2 genes (treatment 3) and were transplanted into standardized cylindrical osteochondral defects (3.2 mm in diameter; n=2 defects per knee) in the trochlear groove of twelve female Chinchilla bastard rabbits (Charles River; mean weight 2.9±0.3 kg) in their late juvenile stages (mean age 14 weeks). 29 Three weeks after implantation, the repair tissue from the proximal defects was retrieved and subjected to biochemical evaluation. For histological evaluations, distal femurs were fixed in 4% phosphate-buffered formalin, trimmed, and decalcified. Paraffin-embedded frontal sections (5 μm; n=9–11 per defect) were taken within approximately 1.0 mm from the center of the defects at standardized 200 μm intervals 29 to avoid biased section sampling. Staining with safranin-O/fast green was performed according to routine histological protocols. 30
Evaluation of histological sections
Serial sections of the defects were analyzed under direct light microscopy using the scoring systems described by O'Driscoll and co-workers, 21 Pineda et al., 22 Wakitani and colleagues, 23 Sellers and co-workers, 24 and Fortier et al. 25 (Tables 1–5). All sections (n=233) were independently scored twice by two individuals (A, B) and once by one investigator (C) without knowledge of the treatment groups. Observers A and B were orthopaedic surgeons, and C was a student assistant. Investigators held different levels of experience in cartilage research. Between observations, there was an interval of at least eight weeks.
Total average point values were determined for each treatment (treatments 1, 2, and 3) (Table 6) and for each joint to allow for correlation with data from biochemical evaluations. Next, average values for the histological parameters (a) metachromatic matrix staining and (b) cellular appearance of the repair tissue (cellularity, cellular morphology, and cellular organization) were determined for each joint to allow for comparison with (a) proteoglycan content and (b) cell number of the repair tissues.
Histological sections were evaluated twice by two investigators (A, B) and once by one investigator (C) with different levels of experience in cartilage research. Mean average total point values from 12 animals (n=24 defects, n=233 histological sections) receiving either treatment 1, 2, or 3 are shown. Results are based on repeated-measures mixed-model analysis of variance (ANOVA) using all five investigations. Data are given as mean±standard deviation (SD) of average total point values.
p<0.05 versus treatment 1.
p<0.05, treatment 2 versus treatment 3.
In the scoring system according to O'Driscoll, high point values indicate improved cartilage repair; in the inverse scales according to Pineda, Wakitani, Sellers, and Fortier, low total point values indicate enhanced repair. The grading system according to Sellers was the only one to determine a significant improvement in histological cartilage repair between treatments 2 and 3. However, treatment 3 always resulted in improved histological grading of the repair tissue compared with treatment 1, although statistical significance of this improvement was observed only for the Pineda, Sellers, and Fortier score.
Biochemical evaluation of the repair tissue
DNA contents of the repair tissue (an indicator of cell number) 31 were determined as previously described. 29 Proteoglycan contents were measured by binding to dimethylmethylene blue dye.29,32–35 Measurements were performed using a GENios spectrophotometer/fluorometer (Tecan, Crailsheim, Germany).
Statistical analysis
Analysis of variance (ANOVA) was used to assess mean differences in biochemical parameters in vivo including DNA content, total cells, and proteoglycan content (Table 7) between the three treatments (1, 2, and 3) with the F-test and post-hoc least significant difference comparisons. For each investigation and overall, the five histological scoring systems were analyzed by the repeated-measures mixed-model ANOVA to assess whether mean differences in total scores were observed between the three treatments. Here, the mixed-model strategy was applied to account for treatments randomized to multiple knee joints from the same animal and to properly incorporate data from different investigators. 36 Histological scoring systems were compared by Spearman's rho (ρ) to evaluate correlation between the cartilage grading systems based on average points across all investigations. Inter-observer agreement for each scoring system was measured using the intra-class correlation coefficient (ICC) for multiple raters. 37 Intra-observer reliability (reproducibility) was evaluated by the Pearson product-moment correlation coefficient (r). Each histological scoring system was correlated with DNA content (i.e., cell number) as well as proteoglycan content based on average total score values and for sub-scores (metachromatic matrix staining and histological cellular characteristics) using Pearson r-values. Receiver operating characteristic (ROC) curve analysis was used to assess the ability of each histological scoring system in differentiating cartilage repair based on a cut-off value of >3.5 μg/mg dry weight (DW) for proteoglycan content with area under the curve (AUC) and 95% confidence intervals (CI) for judging whether the average total score for each scoring system can accurately predict the results above and below cut-off of the biochemical reference standard.18,38 Statistical analysis was performed using the SPSS software package (version 18.0, SPSS Inc./IBM, Chicago, IL). Two-tailed values of p<0.05 were considered statistically significant.
Data are given as mean±standard deviation. Moderate correlation was assessed between proteoglycan content and cell number (r=0.671; p<0.001). Statistically significant:
p<0.05 versus treatment 1.
p<0.05, treatment 2 versus treatment 3.
SF, synovial fluid; DW, dry weight.
Results
Comparative evaluation of histological sections from the three treatment groups using the O'Driscoll, Pineda, Wakitani, Sellers, and Fortier score
Histological sections were scored twice by two investigators (A and B) and once by one observer (C), always applying the O'Driscoll, Pineda, Wakitani, Sellers, and Fortier scores (Tables 1–5). When cartilage repair was compared between the three treatment groups, none of the five scores determined a significant difference in the average total point values for each system, investigator, or time point (investigation) (Table 6). Based on the repeated-measures mixed-model ANOVA, all scoring systems yielded similar results for the different treatment groups (all p>0.05). In contrast, improved histological grading was obtained for defects receiving treatment 3 compared with treatment 1 for all time points by all observers and for all scoring systems. Nevertheless, this improvement reached statistical significance only for the histological grading systems according to Pineda (p=0.034), Sellers (p=0.001), and Fortier (p=0.032) (Table 6).
Comparative analysis of the DNA and proteoglycan contents of the repair tissue from the three treatment groups
The entire repair tissue from the proximal defects of each joint was analyzed for DNA and proteoglycan contents (Table 7). When comparing cell number for the three different treatments, ANOVA indicated an overall difference between the groups (F=6.84; p=0.016) with post-hoc comparisons revealing a significantly higher DNA content (i.e., cell number) with treatment 3 versus treatment 1 (p=0.005) and versus treatment 2 (p=0.036) but no difference between treatment groups 1 and 2 (p=0.439) (Table 7).
Analysis of the proteoglycan contents of the repair tissues revealed 2.85±1.31 μg/mg DW for treatment 1, 3.41±1.36 μg/mg DW for treatment 2, and 4.88±1.78 μg/mg DW for treatment 3 (Table 7). ANOVA indicated an overall difference between the three groups (F=4.18; p=0.037) with post-hoc comparisons revealing significantly higher proteoglycan content with treatment 3 versus treatment 1 (p=0.012) but no differences between treatments 2 and 3 (p=0.108) or treatments 1 and 2 (p=0.397) (Table 7).
Internal correlation between the five histological scoring systems
The internal correlations between the average total point values of the five different scoring systems from five investigations were determined using Spearman's rank correlation coefficients (Spearman's ρ). Between all systems, correlation coefficients were generally high (Table 8). The lowest correlation coefficient was found between the systems according to Sellers and O'Driscoll (ρ=−0.628; p<0.01), the highest value was detected between the systems according to Wakitani and Fortier (ρ=0.905; p<0.001).
Correlation of histological average mean point values was calculated based on Spearman's rank correlation coefficients (Spearman's ρ). Statistically significant correlation was observed between all histological grading systems (all p<0.01). Since all scores except for the system according to O'Driscoll represent inverse grading systems, correlation coefficients between O'Driscoll and other systems take negative values.
Intra- and inter-observer reliability of the five histological scoring systems
Average differences in mean total point values between the two time points were low for all scoring systems and ranged between 0.03 and 0.86 for investigator A and between 0.00 and 0.36 for investigator B (Table 9). In good agreement, average differences in mean total point values between the three investigators (A, B, and C) were also low and ranged between 0.21±0.11 and 0.78±0.60 (Table 9). For mean intra- and inter-observer differences, no significant differences were detected between the five grading systems (all p>0.05).
Intra- and inter-observer correlations indicated by Pearson correlation coefficients (r). For all histological grading systems, correlation between different time points and different investigators was high. Mean differences in total average point values of the five histological scoring systems are indicated: the mean intra-observer difference was calculated from data yielded by two investigators (A, B) at two time points, mean inter-observer difference from data provided by three investigators (A, B, and C). Data are given as mean±standard deviation (SD).
Statistically significant (with p<0.01).
Intra- and inter-observer reliability of the five histological scoring systems was assessed by determination of Pearson correlation coefficients (Pearson r). Intra-observer reliability was calculated for two investigators at different time points and was high for both (A and B). The Pineda score possessed the highest (r=0.997), and the Fortier score the lowest (r=0.919) intra-observer correlation (Table 9). We then averaged both time points for observers A and B and assessed inter-observer reliability among the three investigators (A, B, and C) for the five systems. The inter-observer reliability was very high overall, with the Sellers score reaching the highest values (ICC=0.993; p<0.001) and the Wakitani score having the lowest inter-observer reliability (ICC=0.936; p<0.001) (Table 9). No significant differences were observed for intra- and inter-observer correlation between the five histological grading systems (all p>0.05). In particular, no differences were found between elementary (Pineda and Wakitani) and complex (O'Driscoll, Sellers and Fortier) histological grading systems.
Correlation between histological and biochemical evaluations of the cartilaginous repair tissue
The internal correlation between both biochemical parameters of the repair tissue (DNA and proteoglycan content) was significant, and the internal correlation and reliability between the histological grading systems was also high. Nevertheless, the Pearson correlation coefficients between each of the five scoring systems and the biochemical reference standard (proteoglycan content) were low and not statistically significant (Table 10). In contrast, the correlation between the histological evaluation and the respective cell numbers of the repair tissues was better, with the systems according to O'Driscoll, Wakitani, and Sellers reaching statistical significance (Table 10). No tendency toward a higher correlation of a biochemical parameter with either simple (Pineda, Wakitani) or complex (O'Driscoll, Sellers, and Fortier) histological scoring systems was observed.
Correlation of histological total average point values with biochemical properties of the repair tissue (proteoglycan content: DMMB assay; cell number: Hoechst 33258), correlation of metachromatic matrix staining (safranin-O), and proteoglycan content and correlation of histological cellular appearance of the repair tissue (cellularity, cellular morphology, and organization) with cell number were determined by Pearson correlation coefficients (r). Correlation between proteoglycan content of the repair tissue and the histological (a) average total point values and (b) point values for the parameter matrix staining was not significant. However, correlation between cell number and the histological cellular appearance of the repair tissue was significant (low to moderate) for all systems. Significance in correlation with the cell number was also observed for the total histological average point values of the systems according to O'Driscoll, Wakitani, and Sellers.
Statistically significant (p<0.05).
n.d., not determined.
We next investigated the correlation between the histological grading for the parameter metachromatic matrix staining (staining of proteoglycans by safranin-O) and the proteoglycan content of the repair tissue (Table 10). The respective correlations were not significant for any of the histological grading systems applied.
Correlation between the histological cellular appearance (density, morphology, and organization) and the effective cell number as determined by biochemical analysis was then analyzed. Interestingly, the cell number correlated significantly with the histological cellular appearance for all systems (Table 10), with the scores according to O'Driscoll and Sellers showing the best correlations in terms of magnitude. Besides, the average total point values of the scoring systems according to O'Driscoll, Wakitani, and Sellers demonstrated a significant correlation with the cell number of the repair tissues.
Sensitivity and specificity of histological scoring systems
Finally, the proteoglycan contents of the repair tissue were used as the validating test for grading of the repair tissue (cut off value: 3.5 μg proteoglycan/mg DW), against which the results of the histological analysis were compared. Accordingly and based on an ROC analysis for the five scoring systems, AUC values were calculated (Table 11). Only the complex score by O'Driscoll yielded an AUC sufficient to allow for a valid correlation with the biochemical proteoglycan content (p=0.03). The scores according to Wakitani and Sellers provided evidence of having some ability to differentiate cartilage repair quality and were on the borderline of significance. Due to low AUC values and correspondingly wide CI (Table 11), all other systems did not allow for discrimination between “good” and “poor” quality of the repair tissue (p>0.05) when referenced to the proteoglycan content. Based on the reported AUC values, sensitivity and specificity of the histological grading systems included in this study could not be conclusively calculated. Although the comprehensive O'Driscoll scoring system provided highest AUC values, comprehensive scores were not generally superior to simple grading systems with regard to AUC values.
Calculation of area under the curve (AUC) values was performed based on a receiver operating characteristic (ROC) analysis for five scoring systems. For validation of the histological grading systems, the proteoglycan content of the repair tissues was applied as reference standard. 18 Based on the data given in Table 7, a cut-off value of 3.5 μg proteoglycan per mg DW was set as threshold to grade the quality of the cartilage repair tissue. Only the O'Driscoll score yielded an AUC sufficient to assess the correlation with the biochemical proteoglycan content. Due to low AUC values and wide confidence intervals (CIs), all other histological scores did not allow for correlation with the biochemical quality of the repair tissue and, thus, rendering calculation of sensitivity and specificity of the scoring systems unfeasible.
Statistically significant (p<0.05).
Consequently, complex scoring systems did not yield a better accuracy in discrimination compared with elementary systems.
Discussion
The data of the present study show that all of the five histological scores provided high intra- and inter-observer reliability and low numerical intra- and inter-observer differences with high internal correlations among all scores. Moreover, reproducibility remained constant between investigators with different levels of experience. Importantly, the individual histological overall score values did not correlate with the proteoglycan contents. These data indicate that elementary and more comprehensive histological scoring systems are equally suited to accurately evaluate articular cartilage repair. The findings, therefore, suggest that histological and biochemical evaluations are complementary tools to assess articular cartilage repair in vivo.
Although several grading systems for in vitro engineered cartilaginous tissue39–41 as well as osteoarthritic cartilage have been described20,42–45 and validated,46–49 validation of scoring systems for experimental in vivo cartilage repair strategies has not been forthcoming. 26 Only the histological/histochemical grading system (HHGS) for osteoarthritic cartilage 42 and a histological grading system for in vitro tissue-engineered neocartilage 39 have been correlated to its proteoglycan contents to date. In 1971, Henry Mankin et al. assessed the relation between histological results (HHGS) and both DNA and polysaccharide content. 42 While there was an internal correlation between both biochemical parameters, the histological data correlated only with polysaccharide content. 42 In accordance, Grogan et al. recently reported that a visual evaluation system for in vitro tissue engineered cartilage correlated with computer-based histomorphometry and glycosaminoglycan content. 39 In the present study, histological results from all five scores did not correlate significantly with the biochemical cartilage matrix composition. Whether these differences result from the underlying experimental setup (articular cartilage repair in a lapine model of an osteochondral defect studied here versus human osteoarthritic cartilage or tissue engineered cartilage) remains to be determined.
In contrast, the DNA contents (i.e., cell numbers) of the repair tissues correlated well with the average total point values (O'Driscoll, Wakitani, and Sellers score) and all histologically assessed cellular characteristics. However, the quality of cartilage repair tissues in osteochondral defects may not be sufficiently reflected by its DNA content, 42 because the nature of the cells is not reflected in the DNA content: Besides articular chondrocytes, even progenitor cells, 50 synoviocytes, 51 and inflammatory cell infiltrates52,53 may be present in the repair tissue. Hence, the proteoglycan content was chosen as a validating test for the calculation of sensitivity and specificity of the scoring systems, 18 although no significant correlation emerged. It remains to be seen whether additional objective analyses such as biomechanical tests, 54 as well as macroscopic assessments, 55 non-destructive imaging,56,57 or computerized histomorphometry,40,58,59 may further enhance the value of validation.
Interestingly, our hypothesis that the elementary scoring systems22,23 have a better reproducibility than complex systems21,24,25 was proved to be incorrect: Despite the different levels of experience in cartilage research, all investigators obtained similar results for each grading system. Specifically, neither training level, the complexity of the system, nor the presence of cartilage-specific histomorphometric parameters (e.g., chondrocyte clustering or tidemark formation) influenced the reproducibility, in good agreement with previous findings.26,48 Thus, the capability to perform a precise histological analysis of articular cartilage repair does not depend on the degree of experience of the investigator, whichever scoring system is used.
The two histological scoring systems developed by the International Cartilage Repair Society60,61 were not included here, because they are designed for the evaluation of biopsies of human articular cartilage repair tissue. Since such core biopsies ideally are taken from the centre of the defect, 60 they do not allow to judge parameters such as integration or the overall assessment of the entire repair tissue. Moreover, parameters are either reported separately without yielding an overall point value 60 or scored using a visual analogue scale, 61 rendering a comparison to categorical numerical grading systems difficult.
Complex scoring systems provide more descriptive information about the nature of the repair tissue, especially about structural and cellular characteristics or the presence of degenerative changes.19,21 Interestingly, the differences between the three treatments investigated here were best reflected by the Sellers score, in good agreement with previous studies,29,35,62,63 while the other scoring system did not illustrate the extent of such differences. Therefore, the Sellers score may be of particular value to detect minute differences between treatment groups. Intriguingly, complex systems do not correlate better with biochemical data than basic systems, and the high internal correlation coefficients between all scores (Spearman's ρ 0.628–0.905) confirm the validity of basic systems compared with more detailed scores. These findings support the continuing value of such grading systems.
Conclusions
In summary, all of the five histological grading systems evaluated here allow for a high reliability and reproducibility to evaluate articular cartilage repair. Both elementary and comprehensive histological scoring systems are suited to quantify articular cartilage repair. Histological and biochemical evaluations are complementary tools to assess experimental articular cartilage repair in vivo.
Footnotes
Acknowledgment
The authors thank Yasmin Lindner for performing one evaluation of the histological sections. They also thank E. Kabiljagic for expert technical assistance during animal experiments.
Disclosure Statement
No competing financial interests exist for all authors.
