Rater Stability in a High-Stakes Performance Assessment: A Longitudinal Investigation
McNaughton, Tara M
MetadataShow full item record
The certification of medical practitioners frequently includes a performance assessment to ensure competence. Although such assessments offer richer evaluations of examinee performance compared to other exam types, the reliance on expert judgement in evaluating examinees presents some concerns. The subjective nature of the rating task may allow factors unrelated to examinee performance to influence ratings, and raters may have idiosyncratic perceptions of performance levels. To assess inter- and intra-rater differences, I used the Many-Facet Rasch Measurement model to quantify rater severity and rating scale category use. Applying a partial credit model on the rater facet, I used rater category thresholds to calculate a category breadth measure to identify central tendency and extremism. This method compares favorably with other indices used to identify these rater effects. The category breadth method identifies a slightly larger proportion of raters as exhibiting effects while providing more precise feedback to raters and rater trainers. Using hierarchical linear models, I assessed the stability of rater severity and consistency measures longitudinally. Most raters demonstrated stable severity; however, a sizable minority did not. Therefore, caution is warranted when using rater severity in common element equating designs. Conversely, nearly all raters demonstrated stable consistency measures, suggesting that rater consistency does not improve with experience. More intensive training for new raters or the use of practice ratings as a screening tool for rater selection may be necessary to improve rating quality.