Journal of Biomedical Informatics 42 (2009) 53–58
Contents lists available at ScienceDirect
Journal of Biomedical Informatics journal homepage: www.elsevier.com/locate/yjbin
Improving language models for radiology speech recognition John M. Paulett a, Curtis P. Langlotz b,* a b
School of Engineering and Applied Science, University of Pennsylvania, USA Department of Radiology, University of Pennsylvania, Radiology Administration, Penn Tower Lobby Level, 399 South 34th Street, Suite 100, Philadelphia, PA 19104, USA
a r t i c l e
i n f o
Article history: Received 12 December 2007 Available online 12 August 2008 Keywords: Radiology Speech recognition n-Gram Trigram model Word frequency Radiology reports
a b s t r a c t Speech recognition systems have become increasingly popular as a means to produce radiology reports, for reasons both of efficiency and of cost. However, the suboptimal recognition accuracy of these systems can affect the productivity of the radiologists creating the text reports. We analyzed a database of over two million de-identified radiology reports to determine the strongest determinants of word frequency. Our results showed that body site and imaging modality had a similar influence on the frequency of words and of three-word phrases as did the identity of the speaker. These findings suggest that the accuracy of speech recognition systems could be significantly enhanced by further tailoring their language models to body site and imaging modality, which are readily available at the time of report creation. Ó 2008 Elsevier Inc. All rights reserved.
1. Background As hospitals have sought to tighten their budgets, and referring providers demand rapid turn around time for radiology reports, radiology departments have shifted from using transcription services to the implementation of speech recognition systems. Speech recognition systems replace expensive transcription services and enable much quicker report delivery. However, lower accuracy rates of these systems compared to transcription services have affected the productivity of radiologists by causing them to spend a larger portion of their time correcting the inaccuracies of the computer generated report [1,2]. Over the past several years, speech recognition technology has greatly improved, with some vendors now advertising accuracy rates of up to 99 percent [3]. But the few errors that do occur in radiology reports can have profound effects when clinicians rely on the reports to make life-altering decisions. For example, a speech recognition engine can interpret a radiologist as saying, ‘‘There is now evidence of tuberculosis,” when the radiologist actually said, ‘‘There is no evidence of tuberculosis.” Overall speech recognition error rates can be reduced by several means. For example, most speech recognition engines can be ‘‘trained” or tailored to an individual’s voice and speech pattern. Such training can dramatically improve the accuracy rate of the engine. Additionally, many radiologists utilize macros, which serve as templates for reports that are commonly used. Typically, only a few blanks, called fields, in the macro need to be dictated. Macros thereby reduce the error rate of dictation by limiting the amount of
* Corresponding author. Fax: +1 215 349 5925. E-mail address:
[email protected] (C.P. Langlotz). 1532-0464/$ - see front matter Ó 2008 Elsevier Inc. All rights reserved. doi:10.1016/j.jbi.2008.08.001
text that is generated by the speech recognition engine. Furthermore, radiologists proofread their reports before signing them. However, even a single error can cause an incorrect diagnosis. Therefore, every additional effort should be made to systematically improve the accuracy of speech recognition. Most modern speech recognition engines rely upon probabilistic measures of word combinations, represented in a language model to translate audio input into words. Typically, statistical language models of speech recognition engines use trigrams, or threeword phrases, to determine which words the speaker used. While products such as Dragon NaturallySpeaking Medical (Nuance, Burlington, MA) include a medical dictionary and a radiology language model, no effort has been made to use the unique properties of radiological examinations to further refine the language model [4,5]. The goal of this study was to demonstrate that certain properties of imaging examinations, such as the body site, modality, and subspecialty, which are known prior to dictation, have an effect on word and trigram frequency that is comparable to the identity of the speaker. Because these former attributes of the report are known in advance, they could be used to dynamically refine the language model used by speech recognition engines, thereby improving accuracy.
2. Methods 2.1. Setting The Radiology department of the Hospital of the University of Pennsylvania has stored reports electronically in a Radiology Information System (GE Healthcare, Waukesha, WI) for about two dec-
54
J.M. Paulett, C.P. Langlotz / Journal of Biomedical Informatics 42 (2009) 53–58
ades, resulting in over two million digital radiological reports [6]. An automated speech recognition system with macro capabilities [7] (TalkStation, Agfa-Gevaert, Mortsel, Belgium), which incorporated an earlier version of the Dragon speech engine (Nuance, Burlington, MA), was implemented in late 1999.
Table 1 Table for determining the log-likelihood score [9]
w not w
X
Y
a c a+c
b d b+d
a+b c+d a+b+c+d=N
2.2. Research database The data was loaded into an Oracle 10 g Standard Edition database server (Oracle, Redwood Shores, CA), installed on a AMD Athlon desktop running Windows XP SP2. The data set for this study was the same as was used for the study conducted by Lakhani et al. [6]. The study sample consisted of 2,169,967 completed radiology reports from the Hospital of the University of Pennsylvania between January 1, 1998 and November 11, 2005. The authors removed patient identifiers from the database, including name, date of birth, and medical record number. Of the remaining reports, 29,736 contained only patient information with no report text. The authors deleted all 29,736 of these records. There remained 72 reports that contained the patient’s name within the report text. The authors manually edited these to remove the patient information. 2.3. Comparing sets of words and trigrams We opted to develop our own simple tokenizer because to our knowledge no existing tokenizers could be executed directly by the database, which was necessary to reduce processing time. Thus, each report was parsed into a set of tokenized words. Our tokenizer split words based upon a defined list of text delimiters, such as spaces, periods, commas, colons, semicolons, and letternumber interfaces. An exception list was created to list common delimiter characters that do not act as a delimiters but rather add meaning to the word, such as numbers with decimal points, times with colons, and 359 predefined text strings such as ‘‘H20”, ‘‘2ND”, ‘‘GM/CM”, ‘‘T9/10”, and ‘‘K-WIRE”. We compared the effects of modality, body site, and subspecialty to the effects of radiologist. We conducted four experiments comparing word frequency and one experiment comparing trigram frequency. For each comparison, we randomly divided our database into two subsets of reports of equal number. This enabled comparison of word frequency for both concordant and discordant reports. For the first word comparison experiment, we further subdivided each report subset by modality. We then generated a modality comparison matrix between modality pairs. Each comparison pair had an equal number of reports from each of the two database subsets. For example, we compared 10 modalities, which generated a 100 cell comparison matrix, in which we compared word frequencies between the two halves of the database. The second experiment subdivided each half of the database of reports by both modality and subspecialty and compared the word frequencies in the subgroups. The third experiment subdivided each half of the dictionary by modality and body site and compared word frequency between the halves. The fourth experiment compared word frequencies in sets of reports by radiologist in similar fashion. In addition to the experiments comparing word frequency, we also conducted experiments in which reports and their corresponding trigrams were divided and compared by modality as above. To reduce the huge computational burden, we limited the analysis of trigrams to only those occurring more than once in the database.
score assists in finding words or trigrams that are particularly characteristic of a report. This statistic gives an accurate measure of the ‘‘surprising” nature of an event and gives a sense of the ‘‘distinctive” nature of a corpus [8]. Kilgarriff [9] defined a framework for using the log-likelihood test to obtain the G2 statistic for a given word, w, given two texts, X and Y. In our setting, X and Y were the two halves of our dataset. We tabulated a (the number of occurrences of w in X), c (the number of words in X that were not w), b (the number of occurrences of w in Y), and d (the number of words in Y that were not w). Table 1 lays out these calculations in a table. To then find G2, Eq. (1) is used [9]. This calculation was performed only for words that existed in both halves of the data set. Computation of the G2 statistic, which is based on the log-likelihood score.
G2 ¼ 2ða logðaÞ þ b logðbÞ þ c logðcÞ þ d logðdÞ ða þ bÞ logða þ bÞ ða þ cÞ logða þ cÞ ðb þ dÞ logðb þ dÞ ðc þ dÞ logðc þ dÞ þða þ b þ c þ dÞ logða þ b þ c þ dÞÞ
ð1Þ
2
From the G values, the significant log-likelihood (SLL), Eq. (2), expresses the percent of distinct words or trigrams that had significant G2 values (p < 0.05). P-values were calculated from G2 using a reference standard [10]. The significant log-likelihood (SSL) metric measures the percent of significant (p > 0.05) n-grams.
SLL ¼
n gramsG2 >3:84 n gramsmod el [ n gramstest
ð2Þ
For each comparison of corpora, we used the G2 statistic to compare frequencies of a word or trigram between subsets of reports then used the SSL log-likelihood statistic to determine significance of difference in word frequencies between corpora. 3. Results 3.1. Univariate analysis After de-identifying the data and tokenizing the reports, there were 2,169,967 reports, consisting of 338,435,512 words (tokens). On average, there were 155.9 words per report with a standard deviation of 119.5. The distribution had a positive skew, with a mode of 61 (Fig. 1). The longest report contained 2620 words. Table 2 shows the distribution of reports among the 12-modality types defined in this report database. Table 3 shows the distribution of reports across body site. Because we randomly split the data, an equivalent number of reports reside in each subgroup. An analysis of the properties of report length distribution and distributions for modality, body site, and subspecialty for the model set and test set and found no significant differences, indicating that the two sets were properly randomized. 3.2. Comparison of word frequencies by modality
2.4. Metrics for comparing word and trigram frequencies The metric used for comparison of word and trigram frequencies was based on the log-likelihood score (G2). The log-likelihood
To test for distinctive words in each corpus, the log-likelihood test was run for each modality. Table 4, shows an example of the most significant word frequency differences between corpora.
J.M. Paulett, C.P. Langlotz / Journal of Biomedical Informatics 42 (2009) 53–58
55
reports. The intra-modality comparison has significantly lower frequency differences than the inter-modality comparison. Table 5 shows the ratio of distinctive words (i.e., words with significant G2 values) in discordant report sets (different modality) and concordant report sets (same modality). Every concordant comparison (that holds the modality constant), highlight in darker gray, has the smallest ratio of distinct words. The concordant entries (in darker gray) have an average ratio of 0.508% with a standard deviation of 0.156%. The other entries have an average ratio of 39.2% with a standard deviation of 5.82%. Therefore, words contained in a report from a different modality were much more likely to be distinctive than when comparing two words from reports of the same modality. We also examined the differences between the frequency of words that appear in one half of reports but do not appear in the other half. As expected, differences in word frequency in a concordant comparison were much less prevalent, and yielded words that were likely to be typographical or parsing errors. Differences in word frequency between modality in a discordant comparison were frequently used terms that were unique to the modalities in question. Fig. 1. Words per report. The graph is limited to two times the standard deviation ± mean.
Table 2 Report frequency by modality Modality code
Modality
% Total reports
CR CT MR US NM MG XA DF DEXA PT MISC MS
Computed radiography Computed tomography Magnetic resonance Ultrasound Nuclear medicine Mammography X-ray angiography Digital fluoroscopy Dual energy X-ray absorptiometry Positron emission tomography
47.9 15.9 10.3 8.66 5.12 4.85 3.31 2.68 0.936 0.301 0.0388 0.0230
Magnetic resonance spectroscopy
Computed radiography accounted for almost half of reports. CT, MRI, and ultrasound accounted for the bulk of the remainder.
Table 3 Reports frequency by body site Body site code
Body site
% Total reports
Chest Lwrext Head Abdomen Pelvis Breast Vertebra Cardio Uprext GI Muscskel GU Neck Misc Abd–pelvis Neuro
Chest Lower extremity Head Abdomen Pelvis Breast Vertebral column Cardiovascular Upper extremity Gastrointestinal Musculoskeletal Genitourinary Neck Miscellaneous Abdomen and pelvis Neurological
36.1 7.56 7.30 7.00 6.86 6.63 6.04 5.33 3.70 2.89 2.77 2.40 2.40 1.69 0.886 0.407
A large proportion of reports concerned the thorax. The remainder of reports were widely distributed throughout the body.
The list on the left shows word frequency between CR corpora, and the list on the right compares word frequency between CR and MR
3.3. Subgroup analysis To determine whether partitioning the data into smaller subsets based on body site or subspecialty also showed significant differences based on these variables, we further divided the MR subsets according to body site. The same statistical tests were used. Table 6 displays the results. When the subspecialty was the same, the average ratio of distinctive words (i.e., words with significant G2 values) was 0.647% with a standard deviation of 0.278%. When the subspecialties were not the same, the ratio was 38.8% (standard deviation = 7.26%). The same body site subanalysis was performed on the MR reports. Only the body sites with the 10 highest numbers of reports are shown. As seen in Table 7, the average SLL for sets with the same body site was 0.805% (std 0.215%), while for sets with different body sites the average SLL was 36.0% (std 6.91%) (Table 5). 3.4. Comparison of word frequency by radiologist The same analysis was conducted by using the dictating radiologist. For this comparison, only reports from the 10 radiologists with the highest number of reports were used. Table 8 shows the percentage of distinctive words (i.e., words with significant G2 values). When comparing dictionaries from the same radiologist, on average, 0.515% (std 0.147%) of the words are significant (SLL metric), whereas 32.5% (std 8.57%) of the words are significant when comparing dictionaries by different radiologists. 3.5. Trigram frequency analysis We extended our word frequency analysis to examine the frequency of three-word phrases (trigrams) by using the same statistical methods. Table 9 shows the 25 most common trigrams in our database. As shown in Table 10, the frequency differences for trigrams were more striking than the differences for words alone. When the modality was the same for both sets of trigrams there was an average ratio of 0.251% (std 0.0538%), while when the modality was different there was an average of 36.5% (std 5.60%). Table 11 summarizes the results of the four frequency comparisons. In each case, the word and trigram frequencies were much greater for discordant reports (having different values of the parameter in question) than for concordant reports.
56
J.M. Paulett, C.P. Langlotz / Journal of Biomedical Informatics 42 (2009) 53–58
Table 4 Top 15 most distinctive words between when comparing concordant reports (from the same modality, CR vs. CR) and discordant reports (between different modalities, CR vs. MR) Concordant: CR vs. CR
Difference in frequency (CR CR)
Discordant: CR vs. MR
Difference in frequency (CR MR)
Compartment Malleolus Paratracheal Asymmetric Nasal Facial Fractured Osteoblastic Kinked Measurement Enhancement Myoma Methacrylate Disarticulation Cochlear
494 +463 247 +231 218 +189 +184 +164 +115 49 +47 37 +35 27 27
Chest Axial Heart Lung Images Tube MRI Atelectasis Pneumothorax Views Sagittal Signal Echo Weighted T2
+513833 272549 +247858 +213917 207375 +203790 192576 +185614 +178164 +175501 144446 140412 123929 120174 109234
All words listed were classified as distinctive using the log-likelihood test with p < 0.01. Approximately 170 million words were in each subset.
Table 5 Ratio of distinctive words (SLL metric: p > 0.05) by modality
Entries highlighted in gray indicate that the modality is the same for both corpora. Concordant comparisons are highlighted in gray.
Table 6 Ratio of distinctive words (SLL metric: p > 0.05) by subspecialties of the MR modality
Table 7 Ratio of distinctive words (SLL metric: p > 0.05) by body site of the MR modality
4. Discussion Our analysis of words and trigrams (three-word phrases) in a large corpus of radiology reports shows that modality, body site,
and radiologist, can be used to further refine language models. All of these factors are known at the time of dictation, although only the identity of the dictating radiologist is currently used to tailor language models for speech recognition systems. Our data
57
J.M. Paulett, C.P. Langlotz / Journal of Biomedical Informatics 42 (2009) 53–58 Table 8 Ratio of distinctive words (SLL metric: p > 0.05) by reading radiologist
The column and row headers represent the rank of the radiologist by number of reports dictated.
Table 9 Top 25 trigrams use in radiology reports along with number of times each trigram appears in the entire database
Table 11 Summary of ratio of n-grams with significant (p < 0.05) G2 values (SLL metric) Parameters
Concordant (%)
Discordant (%)
Trigram
No. of occurrences
There is no There is a No evidence of Of the chest Of the right Of the left In the right Seen in the In the left Is no evidence There are no Examination of the Normal in size Views of the Within normal limits Comparison is made There is mild Is normal in Is made to Is seen in Of the abdomen Images of the The patient is View of the Available for comparison
13120608 10147824 7892478 6589080 6305400 5692920 4937232 4542216 3804672 3577140 3273182 3151104 2908422 2827404 2805176 2739506 2434608 2361870 2326368 2315192 2300490 2294480 2258064 2185600 1841708
Modality Modality, body site Radiologist Modality (trigrams)
0.508% ± 0.156 0.805% ± 0.215 0.515% ± 0.147 0.251% ± 2.23
39.2% ± 5.82 36.0% ± 6.91 32.5% ± 8.57 36.5% ± 5.60
from Table 4 show that words are used at dramatically different frequencies depending on imaging modality. These results, together with those shown in Table 5, show that when using a dictionary on the modality for which it was designed, the word frequencies are relatively constant, while trying to use a dictionary for a different modality results in a markedly different word frequency.
Table 10 Ratio of distinctive trigrams (SLL metric: p > 0.05) by modality
Comparison of reports with discordant modality and body site had similar rates of distinctive words than comparisons between reports dictated by different radiologists. Discordant reports have much higher rates of distinctive words that concordant reports.
Tables 6 and 7 show that differences in word choice and frequency exist even when the data is stratified into smaller subsets. The results show that significant differences between subgroups do exist that could be useful to language model designers. From Table 8, it is evident (as expected) that each radiologist has a unique word choice and unique word frequencies. However, our comparison metrics show that this effect is similar to the effect of modality and body site. Therefore, the use of modality and body site to tailor language models for speech recognition systems may be much more effective than training or tailoring based on the individual radiologist alone. We found that the results obtained for word frequency also applied to the frequency of three-word phrases. When looking at these phrases, there were significant differences between the phrase choice and frequency per modality. While the log-likelihood test typically is used in the case of words not phrases, we believe the test still is applicable for quantifying the distinctness of phrases between two sets of reports. Current speech recognition engines load a language model that is tailored only to the speech pattern of the dictating radiologist. To our knowledge, no available speech recognition engine provides the capability to train or tailor the behavior of the system based
58
J.M. Paulett, C.P. Langlotz / Journal of Biomedical Informatics 42 (2009) 53–58
on the modality and body site of the imaging exam being dictated. Our results indicate that it might be useful to study the effects of dynamically modifying language models of existing speech recognition software. It is possible that additional gains in accuracy could be obtained by employing models that are specific not only to the reporting radiologist, but also to the modality and body site of the imaging study being reported. 4.1. Limitations The primary limitation of the study is that we could not validate our conclusions by comparing different language models in an actual speech recognition engine. Such language models do not yet exist. We believe that such a study might show that incorporating specialized language models based on modality and body site improves the accuracy of dictation. However, no easy solution is currently available for dynamically changing models in a speech recognition engine. The text of an unknown number of reports in the database was generated in part using macros. Unfortunately, our radiology information system does not record which reports used macros, so we could not study their effects directly. While the use of macros skews the results of the word and phrase frequencies, the effect of this bias likely was similar across the database. Because radiologists have control over their own set of macros, these macros are more likely to reduce variability of word frequency within one radiologists corpus of reports, thereby magnifying the effect of tailoring to the radiologist. However, we found nevertheless that other factors had a similar effect on variability. We believe that macros could improve the accuracy of a speech recognition engine, since the speech engine will have the first two words of a trigram spelled out and thus the engine can refer to the dictionary to fill in the third word. The log-likelihood measure we used had an important limitation. If a word was used frequently in one set of reports, but was never used in the comparison set, that word would not appear in the results of the log-likelihood test. Nevertheless, the measure gave a good sense of frequency with which common words are used in the two corpora. A comparison of word and trigram frequencies that were not in one subset but appeared in another yielded results similar to the log-likelihood comparisons (unpublished data). We limited the analysis to trigrams that occurred more than once. From manual review of the omitted trigrams, we determined that almost all such unique trigrams were phrases with misspellings or numbers. Thus, adding these phrases would provide little benefit to a speech engine. We believe that any bias that may have been introduced by omitting unique trigrams would apply equally to all comparisons, and therefore would not affect our conclusions. Table 11 shows that there are significant word and phrase differences between different types of radiology reports, and that those differences are at least as pronounced between modalities and body sites as they are between speakers. Therefore, custom language models based upon the specific modality, body site, and radiologist could be used by the speech recognition software to improve recognition accuracy. For example, a speech recognition engine could view the exam being dictated (e.g., abdominal CT, knee
MR, chest X-ray) and load a language model tailored to the modality and body site for that exam. Most likely, the language models would be cached on a server that periodically would update each model, possibly using the method detailed by Siivola [5]. For the database analyzed in this study, 82 distinct language models would be needed to handle all combinations of modality and body site. However, we were unable to prove definitively that these additional predictors would increase the performance of a speech recognition engine, or to assess how these additional parameters should be weighted. This study focused on radiological exams due to the availability of a large database of reports. A similar approach might be useful in other medical fields that use dictation. For example, word and trigram frequency in operative reports may be strongly affected by the type of surgery. And progress notes from ambulatory clinics might be highly dependent on the chief complaint. 5. Conclusion Our results indicate that the word and phrase frequency during dictation of radiology reports is influenced by the modality and body site of the radiology study. When comparing subsets of reports, there were significantly lower levels of distinctiveness for concordant reports than for discordant reports. While the statistical significance of the differences could not be measured, the magnitudes of the differences were similar to those found between radiologists, indicating they likely have practical significance. This finding held true not only for the frequency of single words, but also for the frequency of three-word phrases. These results suggest that the accuracy of speech recognition products could be improved by tailoring the underlying speech model based on the modality, body site, and other parameters of the radiological study. References [1] Al-Aynati MM, Chorneyko KA. Comparison of voice-automated transcription and human transcription in generating pathology reports. Arch Pathol Lab Med 2003;127(6):721–5. [2] Liu D, Zuckerman M, Tulloss Jr WB. Six characteristics of effective structured reporting and the inevitable integration with speech recognition. J Digit Imaging 2006;19(1):98–104. [3] Nuance Dragon NaturallySpeaking 9 Product Page. Available from: . [4] Langer S. Radiology speech recognition: workflow, integration, and productivity issues. Curr Probl Diagn Radiol 2002;31(3):95–104. [5] Siivola V, Pellom B. Growing an n-gram language model. Proceedings of 9th European conference on speech communication and technology, 2005. [6] Lakhani P, Menschik ED, Goldszal AF, Murray JP, Weiner MG, Langlotz CP. Development and validation of queries using structured query language (SQL) to determine the utilization of comparison imaging in radiology reports stored on PACS. J Digit Imaging 2006;19(1):52–68. [7] Agfa TalkStation TeleDictation Product Page. Available from: ; 2007 [accessed 19.08.07]. [8] Kilgarriff Adam. Comparing corpora. Int J Corpus Linguist 2001;6:97–133. [9] Kilgarriff Adam. Which words are particularly characteristic of a text? a survey of statistical approaches. In: Proceedings of AISB workshop on language engineering for document analysis and recognition; 1996. p. 33–40. [10] Rayson P, Berridge D, Francis B. Extending the Cochran rule for the comparison of word frequencies between corpora. In: Proceedings of the seventh international conference on statistical analysis of textual data (JADT 2004). Presses universitaires de Louvain; 2004. p. 926–36.