Musculoskeletal Ultrasound Objective Structured Clinical Examination ...

Arthritis Care & Research Vol. 66, No. 1, January 2014, pp 2– 6 DOI 10.1002/acr.22105 © 2014, American College of Rheumatology

SPECIAL THEME ARTICLE: CLINICAL IMAGING AND THE RHEUMATIC DISEASES

Musculoskeletal Ultrasound Objective Structured Clinical Examination: An Assessment of the Test EUGENE Y. KISSIN,1 PETER C. GRAYSON,2 AMY C. CANNELLA,3 PAUL J. DEMARCO,4 AMY EVANGELISTO,5 JANAK GOYAL,6 RANY AL HAJ,7 JAY HIGGS,8 DANIEL G. MALONE,9 MIDORI J. NISHIO,10 DARREN TABECHIAN,11 AND GURJIT S. KAELEY12

Objective. To determine the reliability and validity of an objective structured clinical examination (OSCE) for musculoskeletal ultrasound (MSUS). Methods. A 9-station OSCE was administered to 35 rheumatology fellows trained in MSUS and to 3 expert faculty (controls). Participants were unaware of joint health (5 diseased/4 healthy). Faculty assessors (n ⴝ 9) graded image quality with predefined checklists and a 0 –5 global rating, blinded to who performed the study. Interrater reliability, correlation between a written multiple choice question examination (MCQ) and OSCE performance, and comparison of fellow OSCE results with those of the faculty were measured to determine OSCE reliability, concurrent validity, and construct validity. Results. Assessors’ interrater reliability was good (intraclass correlation coefficient [ICC] 0.7). Score reliability was good in the normal wrist and ankle stations (ICC 0.7) and moderate in the abnormal wrist and ankle stations (ICC 0.4). MCQ grades significantly correlated with OSCE grades (r ⴝ 0.52, P < 0.01). The fellows in the bottom quartile of the MCQ scored 3.07 on the OSCE, significantly worse than the top quartile fellows (3.32) and the faculty (3.29; P < 0.01). Scores also significantly discriminated bottom quartile fellows from faculty in the normal wrist and ankle stations (3.38 versus 3.78; P < 0.01), but not in the abnormal stations (3.37 versus 3.49; P ⴝ 0.08). Conclusion. MSUS OSCE is a reliable and valid method for evaluation of MSUS skill. Normal joint assessment stations are more reliable than abnormal joint assessment stations and better discriminate poorly performing fellows from faculty. Therefore, MSUS OSCE with normal joints can be used for the assessment of MSUS skill competency.

Utilization of musculoskeletal ultrasound (MSUS) has expanded greatly since its first use in 1958 (1). In addition to radiology, many specialties now employ MSUS for pointof-care imaging, including rheumatology, physiatry, podiatry, emergency medicine, general internal medicine, and

family practice, and this has led to a 316% increase in MSUS volume from 2000 to 2009 (2). The proliferation of MSUS has elicited questions about the qualifications of physicians performing MSUS examination. In response to these questions, certification examinations have been created by the American Registry for Diagnostic Medical

The view(s) expressed herein are those of the author(s) and do not reflect the official policy or position of Brooke Army Medical Center, the U.S. Army Medical Department, the U.S. Army Office of the Surgeon General, the Department of the Army, the U.S. Air Force, the Department of Defense, or the U.S. Government. Supported by the Clinician Scholar Educator Award from the Rheumatology Research Foundation. 1 Eugene Y. Kissin, MD: Boston University School of Medicine, Boston, Massachusetts; 2Peter C. Grayson, MD, MSc: Boston Medical Center, Boston, Massachusetts; 3Amy C. Cannella, MD: University of Nebraska Medical Center, Omaha; 4Paul J. DeMarco, MD, FACR: Arthritis and Rheumatism Associates PC, Wheaton, Maryland; 5Amy Evangelisto, MD: Hospital of the University of Pennsylvania, Philadelphia; 6Janak Goyal, MD: Raritan Bay Medical Center, Perth Amboy, New Jersey; 7Rany al Haj, MD: Shore Arthritis & Rheumatism Associates, Ocean, New Jersey; 8Jay Higgs, MD: Brooke Army Medical Center, Fort Sam Hous-

ton, Texas; 9Daniel G. Malone, MD, RMSK: Excel Orthopedics, Beaver Dam, Wisconsin; 10Midori J. Nishio, MD: John Muir Health, Walnut Creek, California; 11Darren Tabechian, MD: University of Rochester, Rochester, New York; 12Gurjit S. Kaeley, MRCP: University of Florida, Jacksonville. Dr. Kissin receives royalties from Gulfcoast Ultrasound for the wrist ultrasound video. Dr. DeMarco has received consulting fees, speaking fees, and/or honoraria (less than $10,000 each) from Amgen and Auxilium and (more than $10,000) from Abbott/AbbVie. Dr. Malone has received honoraria (less than $10,000) from SonoSite. Dr. Nishio has received speaking fees (less than $10,000) from Abbott. Address correspondence to Eugene Y. Kissin, MD, Arthritis Center, Boston University School of Medicine, 72 East Concord Street, Evans-506, Boston, MA 02118. E-mail: [email protected]. Submitted for publication March 13, 2013; accepted in revised form July 31, 2013.

INTRODUCTION

2

Reliability and Validity of MSUS OSCE

Significance & Innovations ●

This is the first study to assess the reliability and validity of an objective structured clinical examination (OSCE) assessment for musculoskeletal ultrasound.

●

This study showed higher reliability and discriminant validity for OSCE stations using normal joints compared to diseased joints.

●

This study showed that remote, blinded assessment of OSCE performance is reliable and valid, potentially decreasing the costs associated with organizing an OSCE examination for students of musculoskeletal ultrasound.

Sonography and by the American College of Rheumatology. The purpose of certification should be to insure a minimal level of competence, to stimulate professional growth, and to protect the public by encouraging quality care (3). An examination of MSUS competence must be able to evaluate 2 components: skill in US image acquisition and knowledge required for US image interpretation. Although knowledge of anatomy and pathology as well as the ability to interpret US images can be tested by a multiple choice question examination (MCQ), a practical examination of scanning ability is the most direct method of assessing the skill of obtaining US images. Unfortunately, practical examination is time consuming and expensive, and the reliability and validity of this approach in MSUS has not been established, resulting in debate about whether an objective structured clinical examination (OSCE) should be part of MSUS competency testing. Our research group has developed a training program for rheumatology fellows that includes online educational resources, remote online image review by rheumatology faculty with expertise in MSUS, and an educational workshop that includes 21 hours of didactic lectures and hands-on scanning of patients and cadaveric joints. Over the course of 8 months, fellows are encouraged to submit 50 comprehensive US studies for faculty review and feedback. Upon completion, fellows travel to a final examination, including an MCQ and an OSCE (4). The purpose of this study was to determine the reliability and validity of an OSCE for MSUS. This was done by assessing interrater reliability for practical examination grading, concurrent validity by comparing OSCE performance with performance on a written examination in MSUS, and construct validity by comparing trainee/fellow OSCE scores with faculty OSCE scores.

MATERIALS AND METHODS Setting and participants. Thirty-five rheumatology fellows who participated in an 8-month training program in MSUS underwent an examination consisting of a 76-question MCQ and a 9-station OSCE. The MCQ was

3 developed by the program faculty, many of whom are experienced test question writers. All faculty members submitted questions based on prespecified examination content areas. As a group, each question was reviewed and either retained or eliminated based on relevance, clarity, and difficulty. Nine rheumatology faculty members with expertise in MSUS (mean of 6 years of experience) served as proctors for the OSCE, as assessors/graders of the practical examination, and as “gold standard” control participants in the OSCE. Faculty were trained in standardized practical proctoring and grading during an hour-long seminar a few weeks prior to the OSCE, and the testing procedures were reviewed during a 30-minute meeting immediately before the OSCE. Each of 4 healthy volunteers and 5 volunteers with rheumatic joint disease, recruited from a rheumatology outpatient clinic, had 1 joint examined at an OSCE station. The participants were unaware of whether the joint to be examined was abnormal (one of each of the wrist, ankle, elbow, finger, and toe) or normal (one of each of the wrist, ankle, knee, and shoulder). In the abnormal stations, the following pathology was represented: gouty arthritis, synovitis, erosive arthritis (n ⫽ 2), and enthesitis. The protocol for this study was exempted by the Institutional Review Board at Boston University School of Medicine. Grading. The participants were expected to perform standardized, comprehensive scans for each joint area tested. The required images, including specific anatomic structures, were the same as those required in the preceding curriculum. At each station, a faculty proctor witnessed and graded the studies being performed using a predefined checklist. Participants were aware of the items on the checklist and the rating system. Each predefined checklist element on the scoring sheet was graded on a 5-point rating scale, where 1 ⫽ failing, 2 ⫽ borderline pass, 3 ⫽ average, 4 ⫽ above average, and 5 ⫽ “publication quality.” The checklist items included proper adjustment of machine settings, transducer orientation and alignment, as well as artifact-free visualization of the tendons and bone surfaces in each of the required views. Additionally, the graders were asked to provide a global rating score for each participant as an overall assessment of performance at each station on the same 5-point scale. Separately, each resulting US image was graded by 2 faculty assessors blinded to examinee identification using the same scoring sheet. It is important to note that the image grading was based strictly on image characteristics, not on whether the participant identified pathology when this was present. Each volunteer was also scanned by 3 faculty members, and the resulting images were graded along with the trainee images by the blinded faculty assessors. The borderline group method was used to set the overall OSCE passing score. This method consisted of identifying participants who scored a 2 (borderline pass) on the global rating scale for a station, calculating a mean composite score from the predefined checklist elements for that station for each participant who scored a 2 globally, and averaging the composite scores for all such participants. This composite mean score then served as a passing score

4

Kissin et al

for the station, and a mean of the passing scores for the 9 individual stations served as a passing score for the practical examination as a whole (5,6). The Angoff method was used to determine the MCQ pass score (7). Statistical analysis. Participant scores from the individual checklist elements were averaged for each station. A composite OSCE score was derived for each participant by averaging the participant’s mean scores across each station. Interrater reliability for assessors and proctors was estimated using the intraclass correlation coefficient (ICC). The mean of the composite OSCE scores was compared between proctors and assessors using a paired t-test. To assess for potential differences in how normal versus abnormal joint stations were graded, the ICC was used to estimate the reliability of assessor scores between the normal and abnormal wrist and ankle stations, respectively. To assess for potential redundancy of multiple stations, interstation correlation was calculated from the participant’s mean scores across each station using Pearson’s correlation coefficient. Concurrent validity was established by correlating MCQ scores and composite OSCE scores (proctor scores; averaged assessor scores) using Pearson’s correlation coefficient. The distribution of MCQ scores was compared between the participants who failed and passed the OSCE, as determined by the borderline group method, using Wilcoxon’s rank sum test. Construct validity was established by dividing MCQ scores into quartiles and comparing the trainee/fellow OSCE composite scores in the lowest MCQ quartile to trainee/fellow OSCE scores in the highest quartile and to faculty OSCE scores (gold standard) using Wilcoxon’s rank sum test. All calculations were done using SAS, version 9.3. A P value less than 0.05 was used to define statistical significance.

Figure 1. Correlation of the multiple choice question examination score with the average blinded assessor composite objective structured clinical examination (OSCE) score (r ⫽ 0.52, P ⬍ 0.01).

Validity. MCQ scores significantly correlated with composite OSCE scores from both of the assessors (r ⫽ 0.52, P ⬍ 0.01) (Figure 1) and from the proctors (r ⫽ 0.58, P ⬍ 0.01). The mean MCQ score for the 5 fellows who failed the OSCE was less than that for the 30 who passed (60% versus 71%; P ⫽ 0.04). The fellows in the bottom quartile of the MCQ scored 3.07 on the OSCE, which is significantly lower than the top quartile fellows (3.32) and the faculty (3.29; P ⬍ 0.01 for both groups). Composite OSCE scores also significantly discriminated bottom quartile fellows from faculty in the normal wrist and ankle stations (3.38 versus 3.78; P ⬍ 0.01), but not in the abnormal stations (3.37 versus 3.49; P ⫽ 0.08) (Figure 2). Top MCQ quartile fellows outperformed the bottom quartile MCQ fellows in the abnormal stations (3.78 versus 3.37; P ⫽ 0.01).

RESULTS Pass scores. Borderline methodology resulted in an OSCE pass score of 3.0, while the Angoff method resulted in an MCQ pass score of 62. Five fellows received a failing score on the OSCE from both examination assessors. Nine fellows received a failing score on the MCQ portion of the examination. Reliability. Interrater reliability for OSCE grading was good (ICC 0.7) between the assessors, but was poor (ICC 0.3) between the assessors and the proctors. The proctors consistently gave a higher OSCE station score than the average of the 2 blinded assessors (3.6 versus 3.2; P ⬍ 0.0001). Reliability of the assessor scores was good in the normal/healthy wrist and ankle stations (ICC 0.7) and moderate in the abnormal wrist and ankle stations (ICC 0.4). The mean interstation correlation (comparison of station scores within each participant) was low (r ⫽ 0.16, range ⫺0.15 to 0.57).

Figure 2. Composite objective structured clinical examination (OSCE) scores for faculty controls, fellows who scored in the top quartile on the multiple choice question examination (MCQ), and fellows who scored in the bottom quartile on the MCQ. * ⫽ bottom MCQ quartile fellows versus top quartile fellows and faculty (3.07 versus 3.32 and 3.29, respectively; P ⬍ 0.01); † ⫽ bottom MCQ quartile fellows versus faculty (3.38 versus 3.78; P ⬍ 0.01); ‡ ⫽ bottom MCQ quartile fellows versus faculty (3.37 versus 3.49; P ⫽ 0.08) and versus top quartile fellows (3.37 versus 3.78; P ⫽ 0.01).

Reliability and Validity of MSUS OSCE

5

Agreement. No fellows received an OSCE failing score from the proctors. Five fellows (14%) received an OSCE failing score from both blinded assessors. Thirteen fellows (37%) received a failing score from only 1 blinded assessor. All 5 fellows who failed the OSCE by grades from both blinded assessors also scored in the bottom 2 quartiles on the MCQ. Conversely, 13 of the fellows who were in the bottom 2 quartiles on the MCQ passed the OSCE. Three fellows failed both the OSCE and MCQ portions of the examination, whereas 8 fellows failed 1 of the 2 sections (Table 1).

DISCUSSION We found the 9-station MSUS OSCE to be a reliable and valid method for evaluation of MSUS skill. Both concurrent and construct validity were suggested by blinded OSCE image assessment successfully discriminating poorly performing fellows on the MCQ from well-performing fellows and faculty members. Given the high-stakes nature of a certification examination and the disagreement we found between assessors in giving a failing practical examination grade (36% of fellows failed by one assessor versus 14% failed by both assessors), it seems reasonable to require that at least 2 assessors grade the examination images (8). Remote, blinded grading of the OSCE images, as done in this study, increases result reliability by allowing more than one assessor to grade each station, while increasing feasibility by limiting cost. Furthermore, our finding that 11 fellows failed either the practical or the written examination while 3 fellows failed both examinations implies that a certification examination that does not include both components may lead to certification of some practitioners who can identify pathology on a test question image, but do not have sufficient skill to obtain the necessary images themselves. Other investigators have found similar discrepancies between pass rates of OSCE and written examinations (6). Low interstation correlation suggests that each station evaluated relatively independent skills, and that decreasing the number of stations may substantially impact the reliability of the OSCE (8). Similar findings regarding the minimal number of OSCE stations needed for a high-stakes examination have been reported previously (9). Although proctors were able to see the “live” perTable 1. Concurrent validity of the MCQ and OSCE* MCQ MCQ MCQ MCQ quartile 1 quartile 2 quartile 3 quartile 4 (49–62%) (63–70%) (71–75%) (76–89%) OSCE fail (⬍3.0) OSCE pass (ⱖ3.0)

3

2

0

0

6

7

9

8

* The mean multiple choice question examination (MCQ) score for the 5 fellows who failed the objective structured clinical examination (OSCE) was less than that for the 30 who passed (60% versus 71%; P ⫽ 0.04 by Wilcoxon’s rank sum), and these fellows were in the bottom 2 quartiles on the MCQ. Nine fellows failed the MCQ, and 3 of these fellows also failed the OSCE.

formance of the examination as well as the resulting images, this did not increase the validity of their grades in comparison to images graded “blindly.” In addition, witnessing the examination in progress led to grade “inflation.” The proctors saw the actual US performed, and this alone may have led to grade inflation because they witnessed that the target anatomic structures were imaged, but may not have been captured (“frozen”) optimally. Clerkship directors in internal medicine have reported delivering negative feedback as the top explanation for grade inflation (10). Therefore, giving a poor grade to a trainee in the same room may be more difficult than doing so to an anonymous trainee one is not facing. In addition to assigning a grade, proctors were tasked with labeling and storing images, making machine adjustments as requested by the examinees, and ensuring all paperwork was in order during the test. This multitasking and fatiguing environment under the time constraints of a “live” examination is much different from the quiet, self-paced study of images by the blinded assessors. The potential of replacing proctor grading with image grading remotely could also decrease the cost of a practical MSUS examination and increase the availability of qualified faculty for practical grading, thus making a practical examination more reliable (11) and more feasible. These factors should be balanced against the additional information available to a proctor but not to a blinded assessor, such as correct patient positioning, attention to patient comfort, and thoroughness of the scan (appropriately scanning all the way through a structure versus only scanning to make one perfect image). The findings that normal joint assessment stations were more reliable than abnormal joint assessment stations and better discriminated poorly performing fellows from faculty also impact feasibility, since recruiting patients with rheumatic disease for a day-long practical examination is substantially more challenging than recruiting healthy volunteers. The use of patients with rheumatic diseases would also increase yearly test variability. The reasons for the decreased grading reliability and decreased discriminant utility of OSCE stations with diseased joints are not clear. Efforts to limit variability in OSCE grading included the use of predefined checklists and training the faculty in standardized grading. Anatomic structures in a normal joint are straightforward and easier to quantitatively assess using predefined checklists. However, this distinction is less clear in a diseased joint where anatomic structures of interest may not be as easily demonstrated in a single image. Assessors may not be certain about the optimal appearance of abnormal joints for grading purposes, and therefore may have more difficulty scoring the resulting images in concordance. In addition, more competent sonographers may find and record the best representation of pathology in a single image as opposed to the target structures on the checklist, which may result in lower scores based on the grading paradigm. This variability could have been minimized by using consensus between assessors when scoring discrepancies were apparent.

6

Kissin et al

Our comparison of normal and abnormal stations is also limited by the number of stations overall, and the results may be impacted by the specific joints and pathology evaluated during the examination. While we tried to present pathology typical for a rheumatology practice, representing a complete array of rheumatologic pathology on a practical examination is not feasible. Therefore, it is more practical to test knowledge of pathology on an MCQ style of examination, and it is even more important to include a broad array of pathology on the MCQ if stations with pathology are not part of the practical examination. In any case, it is possible that our result might have been changed by using patient volunteers with different pathology. MSUS competency assessment is an important aspect of training, since the achievement of higher levels of competence can result in better-quality care and lower cost utilization (12–15). Since competence in MSUS depends not only on image interpretation but also on image acquisition, practical examination through OSCE could be used to supplement the MCQ to ensure that trainees who pass the examination have not only the knowledge necessary to interpret US images, but also the skill to obtain adequate images for interpretation. Finally, learners are motivated to acquire knowledge and skills to meet the challenge of whatever testing format is anticipated. Our trainees were likely motivated to improve their image acquisition skills in preparation for the OSCE. OSCE testing of a separate group who prepared only for an MCQ format would be required to test this hypothesis. This study examined the critical factors of reliability and validity required for the use of an OSCE as part of MSUS certification. Our findings suggest that a valid MSUS OSCE can be performed utilizing normal anatomy with remote grading by blinded experts. Additional studies are necessary to confirm our findings of these key test characteristics. In addition, further research on the cost of OSCE implementation in a certification examination is needed.

AUTHOR CONTRIBUTIONS All authors were involved in drafting the article or revising it critically for important intellectual content, and all authors approved the final version to be published. Dr. Kissin had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.

Study conception and design. Kissin, DeMarco, Higgs, Nishio, Kaeley. Acquisition of data. Kissin, Cannella, DeMarco, Evangelisto, Goyal, al Haj, Higgs, Malone, Nishio, Tabechian, Kaeley. Analysis and interpretation of data. Kissin, Grayson, al Haj, Kaeley.

REFERENCES 1. Kane D, Grassi W, Sturrock R, Balint PV. A brief history of musculoskeletal ultrasound: ‘from bats and ships to babies and hips.’ Rheumatology (Oxford) 2004;43:931–3. 2. Sharpe RE, Nazarian LN, Parker L, Rao VM, Levin DC. Dramatically increased musculoskeletal ultrasound utilization from 2000 to 2009, especially by podiatrists in private offices. J Am Coll Radiol 2012;9:141– 6. 3. Cerqueira MD, Arrighi JA, Geiser EA. Physician certification in cardiovascular imaging: rationale, process, and benefits. JACC Cardiovasc Imaging 2008;1:801– 8. 4. Kissin EY, Niu J, Balint P, Bong D, Evangelisto A, Goyal J, et al. Musculoskeletal ultrasound training and competency assessment program for rheumatology fellows. J Ultrasound Med 2013;32:1735– 43. 5. Smee SM, Blackmore DE. Setting standards for an objective structured clinical examination: the borderline group method gains ground on Angoff [letter]. Med Educ 2001;35:1009 –10. 6. Wilkinson TJ, Newble DI, Frampton CM. Standard setting in an objective structured clinical examination: use of global ratings of borderline performance to determine the passing score. Med Educ 2001;35:1043–9. 7. Angoff WH. The nature-nurture debate, aptitudes, and group differences. Am Psychol 1988;43:713–20. 8. Rushforth HE. Objective structured clinical examination (OSCE): review of literature and implications for nursing education. Nurse Educ Today 2007;27:481–90. 9. Hofer M, Kamper L, Sadlo M, Sievers K, Heussen N. Evaluation of an OSCE assessment tool for abdominal ultrasound courses. Ultraschall Med 2011;32:184 –90. 10. Cacamese SM, Elnicki M, Speer AJ. Grade inflation and the internal medicine subinternship: a national survey of clerkship directors. Teach Learn Med 2007;19:343– 6. 11. Friedlich M, MacRae H, Oandasan I, Tannenbaum D, Batty H, Reznick R, et al. Structured assessment of minor surgical skills (SAMSS) for family medicine residents. Acad Med 2001;76:1241– 6. 12. Stowell SA, Gardner AJ, Alpert JS, Naccarelli GV, Harkins TP, Louder AM, et al. Impact of certified CME in atrial fibrillation on administrative claims. Am J Manag Care 2012;18:253– 60. 13. Baigelman W, Weld L, Coldiron JS. Relationship between practice characteristics of primary care internists and unnecessary hospital days. Am J Med Qual 1994;9:122– 8. 14. Holmboe ES, Lipner R, Greiner A. Assessing quality of care: knowledge matters. JAMA 2008;299:338 – 40. 15. Sharp LK, Bashook PG, Lipsky MS, Horowitz SD, Miller SH. Specialty board certification and clinical outcomes: the missing link. Acad Med 2002;77:534 – 42.