Automating annotation of information-giving for ...

3 downloads 0 Views 352KB Size Report
Sep 12, 2013 - Elijah Mayfield,1 M Barton Laws,2 Ira B Wilson,2 Carolyn Penstein ... elijah@cmu.edu ...... 36 Heisler M, Bouknight RR, Hayward RA, et al.
Downloaded from jamia.bmj.com on September 13, 2013 - Published by group.bmj.com

Research and applications

Automating annotation of information-giving for analysis of clinical conversation Elijah Mayfield,1 M Barton Laws,2 Ira B Wilson,2 Carolyn Penstein Rosé1 1

Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA 2 Department of Health Services, Policy & Practice, Brown University, Providence, Rhode Island, USA Correspondence to Elijah Mayfield, Language Technologies Institute, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA; [email protected] Received 22 April 2013 Revised 25 August 2013 Accepted 27 August 2013

ABSTRACT Objective Coding of clinical communication for finegrained features such as speech acts has produced a substantial literature. However, annotation by humans is laborious and expensive, limiting application of these methods. We aimed to show that through machine learning, computers could code certain categories of speech acts with sufficient reliability to make useful distinctions among clinical encounters. Materials and methods The data were transcripts of 415 routine outpatient visits of HIV patients which had previously been coded for speech acts using the Generalized Medical Interaction Analysis System (GMIAS); 50 had also been coded for larger scale features using the Comprehensive Analysis of the Structure of Encounters System (CASES). We aggregated selected speech acts into information-giving and requesting, then trained the machine to automatically annotate using logistic regression classification. We evaluated reliability by per-speech act accuracy. We used multiple regression to predict patient reports of communication quality from post-visit surveys using the patient and provider information-giving to informationrequesting ratio (briefly, information-giving ratio) and patient gender. Results Automated coding produces moderate reliability with human coding (accuracy 71.2%, κ=0.57), with high correlation between machine and human prediction of the information-giving ratio (r=0.96). The regression significantly predicted four of five patientreported measures of communication quality (r=0.263– 0.344). Discussion The information-giving ratio is a useful and intuitive measure for predicting patient perception of provider–patient communication quality. These predictions can be made with automated annotation, which is a practical option for studying large collections of clinical encounters with objectivity, consistency, and low cost, providing greater opportunity for training and reflection for care providers.

BACKGROUND AND SIGNIFICANCE

To cite: Mayfield E, Laws MB, Wilson IB, et al. J Am Med Inform Assoc Published Online First: [please include Day Month Year] doi:10.1136/amiajnl2013-001898

The many extant systems for coding and analyzing physician–patient communication by defining verbal behaviors and counting their frequencies have produced a considerable literature.1 2 Such studies have linked verbal communication behaviors to health outcomes3–7 and established that conversation is relevant to shared decision-making,8 cultural competence,9 and patient-centeredness.10 However, quantitative analysis of clinical conversation has a very high barrier to entry. Coding a single outpatient visit with non-trivial annotations can require hours of work even for a trained expert. Fine-grained annotation of transcripts is

Mayfield E, et al. J Am Med Inform Assoc 2013;0:1–7.2013 doi:10.1136/amiajnl-2013-001898 Copyright by American Medical

prohibitively expensive to scale to datasets large enough to offer statistical power for many applications, and even relatively small-scale studies are expensive. Instead, results often use simple counting of annotation frequency or similar analyses.11 12 More complex conversation analysis, such as studying variation in conversational behaviors throughout a discussion, requires a large amount of data to be annotated; this fine-grained annotation of large collected datasets is not affordable or practical for the majority of interested researchers. With recent advances in machine learning for text, there is new opportunity for this cost barrier to be dismantled. In other medical contexts, the potential of natural language processing is already being realized. Within the clinical setting, in particular, natural language processing has been used for automated extraction of factual content (for instance, in clinical notes, electronic health records, and narratives13–16). Social behavior, meanwhile, has been studied with machine learning methods in informal settings such as online discussion forums, covering diverse topics in social support, medical information and personal narrative sharing, and medical education and awareness.17–19 Using machine learning to study social behavior in clinical conversations, by contrast, is an essentially unexplored area of research. This paper demonstrates an application of machine learning for automated annotation of one sociolinguistic aspect of conversation. We study the information-giving ratio, a speaker’s balance between giving and requesting information. We begin with a validation of the annotation scheme’s qualitative interpretability and quantitative ability to predict outcomes. We then demonstrate that machine learning is approaching the ability to replicate the results that would be found with human coding. Together, these results demonstrate both the utility of informationgiving annotation and the process of using machine learning for automated conversation annotation more generally in a clinical context.

Annotating information-giving Prior work has demonstrated that language from medical interactions confers social meaning extending beyond mere semantic content, making use of constructs from systemic functional linguistics.20 21 Our work draws from that subfield to define a measure of ‘information-giving.’ Information-giving annotations mark at the speech act level when a speaker gives information or requests information, similar to annotations of statements or questions.22 A slight variant of this annotation scheme has previously been used for analysis of direction-giving

Informatics Association.

1

Downloaded from jamia.bmj.com on September 13, 2013 - Published by group.bmj.com

Research and applications dialogs,23 24 collaborative learning,25 and online support groups for cancer patients.26 27 The work presented here is the first to transfer the scheme to clinical encounters. Throughout this paper we use the speech act as the unit of annotation for information-giving. Speech act theory28–30 is a sociolinguistic approach which identifies the social act embodied in an utterance, such as questioning, representing reality, expressing the speaker’s inner state, or giving instructions. A speech act is more fine-grained than a turn (an uninterrupted span of speech from one speaker); a turn may consist of multiple speech acts with different intentions. The annotation we use here includes only selected speech acts from coding done using the Generalized Medical Interaction Analysis System (GMIAS).22 GMIAS codes are hierarchical, with high-level categories (eg, Ask Questions, Give Information, and Conversation Management) that are broken down into more narrow and concrete behaviors (such as Open and Closed Questions, self-reports of behavior, or Introduce Topic). Our study is a secondary analysis of a dataset that had already been collected, segmented into speech acts, and annotated with GMIAS codes. Additionally, a subset of that data was annotated with the Comprehensive Analysis of the Structure of Encounters System (CASES) coding scheme.31 CASES annotates several metadiscursive aspects of a transcript, such as assigning ‘ownership’ of topics in a dialog and subdividing conversations into distinct segments in the medical interview. We use this latter annotation, known as the process that a speech act is fulfilling, in our assessment of face validity of our information-giving scheme. Social ritual (such as ‘thank you’ and ‘hello’), acknowledgments, promises, and various other categories of speech acts do not contribute new information from the speaker, nor do they request information, and are grouped as ‘other’ and ignored for our study, although those turns are contentful and have the potential to reveal insights in other analyses. The speech act classifications we do include are aggregated from more detailed categories in the GMIAS. An example of this annotation scheme is presented in figure 1. From these speech act-level annotations, we define a metric of overall information-giving from a speaker in a conversation— the ‘information-giving ratio.’ This value can be assigned as an aggregate of a set of speech acts, and is useful as a per-speaker, rather than per-speech act, measurement.

Machine learning: background We use machine learning as follows. After a set of initial training transcripts are annotated by hand, machine learning extracts features from those examples that represent the text content of speech acts, then discovers latent patterns across the training set. Those patterns are used to reproduce human coding behavior. No hand-derived rules or expert input is needed. Computers are incapable of semantic understanding. Machine learning depends on statistical correspondence between linguistic features extracted from training input and the target labels. We use several lexical and syntactic features, each of which is simple on its own. In the aggregate, using many features allows flexibility for machine learning to identify patterns in language use. A description of the types of features we use is presented in table 1. Using the line of dialog originally highlighted in figure 1, we present some example features in more detail in figure 2. Each feature represents a local unit of text from the transcript, rather than a measurement of the speech act as a whole. In our task, each speech act in a dialog transcript must be assigned exactly one label from a small set of possibilities. The 2

classifier is thus a mapping function between input features and output label.

OBJECTIVE The long-term goal of this research is to produce machine learning systems that can reliably annotate information-giving at a level that matches humans. Ensuring that this is useful as a goal and feasible in practice requires tests of validity and reliability at many stages of data analysis. In this work, we test this research workflow at four stages to test whether: 1. Findings from the information-giving annotation scheme correspond to intuitive understanding of provider–patient interactions. 2. Information-giving correlates meaningfully with indicators of outcomes that matter for patient healthcare quality. 3. An automated system can reproduce speech act annotations of information-giving with near human reliability. 4. The noisy automated annotation of information-giving can still detect meaningful trends in indicator variables. Our evaluations are divided into two categories: first, evaluating information-giving as a metric for provider–patient interactions (#1 and #2); second, applying automated annotation to that scheme (#3 and #4). Our higher-level objective is to define a generalized workflow for evaluating automated annotation systems. As machine learning becomes a viable option for some annotation tasks, a methodical way of testing its applicability in new domains will be desirable. Our work provides a roadmap for future researchers weighing the option of automating their data annotation.

MATERIALS AND METHODS Data collection This work is a secondary analysis of 415 transcripts of routine outpatient visits by people living with HIV, from four widely separated sites of care in the USA, collected and annotated for the Enhancing Communication and HIV Outcomes (ECHO) Study.34 Eligible providers were physicians, nurse practitioners (NPs), or physician assistants who provided primary care to HIV-infected patients. Eligible patients were HIV-infected; 19 years of age or older; English-speaking; identified in the medical record as non-Hispanic black, Hispanic, or non-Hispanic white; and had at least one prior visit with their provider. Forty-five providers participated. Each patient was represented by a single visit, while providers typically saw 8–10 patients in our sample. A total of 435 visits were originally recorded, but 18 of the recordings were unusable due to recorder malfunction, and two of the visits turned out to be principally with providers other than the HIV specialist, leaving 415 available for analysis. In addition to the 45 index providers, many encounters also featured a second provider, an NP, or fellow, particularly at one site which uses a model in which patients are normally first seen by an NP and then by a physician. We call these ‘complex’ visits. There were a total of 36 such complex visits, 30 of them at one site (27 featuring an NP and a physician, 3 featuring 2 physicians), with six visits at another site featuring a second physician, presumably a fellow plus the attending physician. Seventeen visits featured only an NP, with no physician participating. The original study was approved by Institutional Review Boards (IRBs) at all four participating sites. This secondary analysis was declared exempt by the IRB at the lead site which has custody of the data, as it uses only de-identified data. A professional transcription service or research assistant transcribed audio recordings of visits, and a research assistant Mayfield E, et al. J Am Med Inform Assoc 2013;0:1–7. doi:10.1136/amiajnl-2013-001898

Downloaded from jamia.bmj.com on September 13, 2013 - Published by group.bmj.com

Research and applications

Figure 1 Annotated transcript excerpt from a provider–patient dialog. reviewed the resulting transcripts for accuracy. Research assistants then coded the transcripts using the GMIAS. These coded transcripts were used to train the machine learning system, and as the ‘gold-standard’ for evaluation. A subset (50 encounters) was also annotated with the CASES scheme.31 All of the above steps were performed manually by human annotators. To acquire ‘gold-standard’ information-giving annotations from this (which we can use for training our machine learning annotation system), we define a mapping from GMIAS codes to information-giving: 1. All Representative or Expressive Questions (questions soliciting the hearer’s opinions, feelings, or other inner states) are mapped to ‘Requesting Information.’ 2. Representative statements (except Repeat), expressive statements (except Compliment, Agree, Apologize, and Validating), emotion statements (except Laughter, Surprise/ Awe, and Mild Satisfaction), and some directive statements (Convince, Give Permission, and Approve/Encourage) are mapped to ‘Giving Information.’ 3. All other annotations are marked as ‘Other.’

Measurable outcomes For each transcript, immediately following the medical encounter, research assistants for the original ECHO study administered a survey to patients that assessed demographic, social, and behavioral characteristics, as well as their experience of care and ratings of provider communication. Before performing our analyses, we selected five indicator variables from these surveys which we felt were likely to relate

to our measure of information-giving: communication quality,35 provider decision-making,36 participatory decision-making,8 37 interpersonal style,35 and interpersonal trust.38 Each of these are aggregate measures from multi-item scales. This subset of indicator variables was selected prior to our experiments with automated annotation, to avoid overfitting to in-sample data.

Information-giving: aggregating fine-grained annotations From turn-level annotation, we assign an Information-Giving Ratio to each speaker. We define this by summing across speech acts and dividing: Informationgiving Ratio ¼ #Giving=(#Giving þ #Requesting) Intuitively, an Information-Giving Ratio indicates the extent to which a speaker acts as the source of new information in the dialog. A score of 1.0 would indicate a speaker who never requested information from other speakers, while a score of 0.0 would indicate that they never provided new information. We assign an Information-Giving Ratio to each speaker separately, as one speaker’s high Information-Giving Ratio does not necessitate that the other speaker gave correspondingly less information.

Validation and evaluations We divide our study into two distinct evaluations. The first is a typical annotation study as applied to a collection of clinical

Table 1 Features extracted for our machine learning pipeline, with references demonstrating them as standard published techniques, when applicable Feature type

Description

Purpose

Unigrams (‘bag-of-words’)32

Determining vocabulary that corresponds to particular labels, without having to define keywords or metrics ahead of time

Bigrams32

One feature for each word observed in the entire set of training transcripts. For a single line of dialog, the value of each unigram feature is set to ‘true’ if that word appears, and false if the word does not appear Extends the unigram representation to adjacent pairs of words

Part-of-speech bigrams32

Identical to bigrams; however, rather than the surface form of a word, we abstract that word to the level of part-of-speech tags

Role-specialized N-grams33

All features in the three categories listed above are duplicated, but specialized by the predefined role of the speaker, using a domain adaptation approach described in prior work For both the previous and next speech acts from the other speaker in a transcript, we measure the similarity in vocabulary with the current speech act. To measure this, we use cosine similarity with TF-IDF term weighting In a first pass, this feature is blank. In subsequent passes, this feature is the predicted label of the previous and next speech act from the other speaker in the transcript

Adjacent speech – content similarity24

Adjacent speech – hypothesis label

Mayfield E, et al. J Am Med Inform Assoc 2013;0:1–7. doi:10.1136/amiajnl-2013-001898

Recognizing portions of phrases, rather than individual terms from the vocabulary Capturing grammatical structure on a basic, local level. This abstraction allows simple syntactic patterns to be recognized as indicative of a particular annotation Identifying differences in the use of a word across predefined social roles (for instance, using technical medical terms may indicate a different behavior from provider compared to patient) Recognizing shared topics and vocabulary between speakers can be a good indicator of the intention of a speaker; term weighting allows us to focus on rare, relevant words rather than function words like pronouns and conjunctions Context changes the likelihood of information-giving behaviors; for instance, if the previous speaker was Requesting Information, then this speech act is more likely to be Giving Information in response

3

Downloaded from jamia.bmj.com on September 13, 2013 - Published by group.bmj.com

Research and applications

Figure 2 Example features drawn from a single speech act. POS, part-of-speech. transcripts. We then test our ability to automate that annotation, first at a fine-grained level and then for predicting outcomes.

Evaluation 1: validation We evaluate the face validity of our analysis via comparison with the process annotations from the CASES coding scheme. The process annotation is based on the nature of the task being performed in that portion of the transcript, making it a good foundation on which to judge the interpretability of results. Processes span over multiple speech acts, from both speakers, and can reappear multiple times throughout a conversation. We focus on three processes in particular: 1. Presentation—patient-specific circumstances, including the patient’s observations and experiences as well as test results or clinical observations. 2. Information—medical facts and information, both abstract and general, non-specific to the patient, biomedical or otherwise. 3. Resolution—solutions, treatment options, referrals, etc, and solving the problem that triggered the medical interaction. For each transcript, we calculate the Information-Giving Ratios separately for each process in which each speaker participates. This results in three separate values for each speaker, one per process. If information-giving is useful for interpretation, we expect to see a differentiation in information-giving patterns among processes. Next, we model our five outcome indicator variables as separate multiple regressions. Each regression uses aggregate patient and provider Information-Giving Ratios and patient gender as input. We include second- and third-level interaction terms between these three variables. In preliminary analysis, patient gender demonstrated significant interaction with the Information-Giving Ratio, but age and race did not; we therefore discarded those factors in further analyses and regressions. To enable a model to assign weight to balanced ratios, rather than merely high or low ratios, we convert our Information-Giving Ratio into a three-valued nominal variable, based on whether a speaker’s ratio fell into the lowest quartile, the middle 50%, or the highest quartile among all speakers we observe in our training transcripts. Three transgender patients were excluded from these regressions.

Evaluation 2: automation All machine learning was performed in LightSIDE, an open source research tool for automated text analysis.32 Our work uses L2-regularized logistic regression as our classification 4

algorithm as implemented by LibLinear, an efficient and open source algorithm available within LightSIDE’s user interface.39 Where not otherwise noted, each classifier was trained on 40 conversations. Reported automated performance is the averaged result of five ‘runs’; in each run, a distinct 40-conversation subset of our total dataset was used for training, and automated annotation of the remaining 375 conversations was compared to human annotation to calculate accuracy. To evaluate the reliability of annotation, we report Cohen’s κ agreement statistic: Cohen's k ¼ ð%AccuracyChance AccuracyÞ= ð1Chance AccuracyÞ This statistic is widely used in annotation studies,40 especially where the distinction between two particular annotators is of interest (in our case, human versus automated annotation), rather than treating annotators as interchangeable. We also test automated reproduction of the Information-Giving Ratios per speaker. For this measure, only two data points are collected for each transcript (one per speaker). To test this reproduction, we calculate correlation (r) between Information-Giving Ratios calculated with manual versus automated annotations. Finally, we study the impact of varying the number of initial training transcripts and the inclusion of hypothesis label features (the final row in table 1), to give an estimate of the manual effort required to achieve usable machine learning reliability. Finally, we replicate the outcome indicator multiple regressions from manual annotation; however, in this case we calculate each Information-Giving Ratio with automatically annotated transcripts. With this test, we determine whether patterns can be identified with automated annotation. This measures the readiness of automated annotation for discovering overall conversational behaviors, rather than speech act-level annotations.

RESULTS In total, our data consist of 415 conversations consisting of 118 287 speech acts annotated as Giving Information (49.4%), 28 576 as Requesting Information (11.9%), and 92 448 as Other (38.6%). Descriptive statistics for our outcome indicator variables are presented in table 2. Mayfield E, et al. J Am Med Inform Assoc 2013;0:1–7. doi:10.1136/amiajnl-2013-001898

Downloaded from jamia.bmj.com on September 13, 2013 - Published by group.bmj.com

Research and applications Table 2 Indicators of communication quality used for evaluating the utility of our annotation

Table 3 Information-Giving Ratio and indicator outcome variable correlations

Variable

Range

Median

25th–75th %ile

Variable

Correlation (r)

Communication quality (overall)31 Provider decision-making32 Participatory decision-making6 33 Interpersonal style31 Interpersonal trust34

1.11–4.00 0.67–4.00 0–4 2–4 2–5

3.75 3.67 3.75 3.93 4.36

3.47–3.90 3.17–4.00 3.25–4.00 3.71–4.00 4.00–4.82

Communication quality Provider decision-making Participatory decision-making Interpersonal style Interpersonal trust

0.287* 0.264* 0.252 0.344* 0.265*

*p