Automatically Generated Consumer Health Metadata Using Semantic ...

1 downloads 0 Views 2MB Size Report
Jan 6, 2005 - Julie Fisher, Ms June Anderson (Monash University), Ms Sue Lockwood. (Breast Cancer Action Group) and Ms Rosetta Manaszewicz (Breast.
Automatically Generated Consumer Health Metadata Using Semantic Spaces Guocai Chen1, Jim Warren1, 2, Joanne Evans3 1

Department of Computer Science – Tamaki

2

Section for Epidemiology and Biostatistics, School of Population Health The University of Auckland, New Zealand 3

Caulfield School of IT, Faculty of Information Technology Monash University, Australia [email protected]

Abstract The continual growth of the World Wide Web presents the (also growing) population of health information seekers with the challenge of finding reliable information that is appropriate to their needs. Metadata about consumer health websites can provide a guide for end users and domain-specific search tools. In this paper we present and demonstrate a method for automatically inferring a non-trivial metadata attribute that has been encoded for breast cancer websites: whether the site is ‘medical’ or ‘supportive’ in tone. We induce decision trees to distinguish Medical vs. Supportive sites based on feature vectors of word co-occurrence patterns, founded in a semantic space model called Hyperspace Analog to Language (HAL). We achieve 82% (95% CI: 74% to 91%) classification accuracy. This should already be a useful capability for human metadata coders or to support on-the-fly queries, and it inspires us to further investigate metadata classifiers based on HAL features. Keywords: breast cancer, consumer health informatics, medical, hyperspace analogue to language, decision tree, entropy.

1

Introduction

Escalating healthcare costs are one of the key drivers of increasing interest in the provision of health information on the web from a health consumer perspective (Eysenbach 2000). Expectations are that informed patients can more actively participate in decisions surrounding treatment choices, better monitor their condition, and have more efficient and effective interactions with medical professionals (Bomba 2005). The potential benefits of greater patient self management and monitoring have led to the development of a number of health information portals and other initiatives aimed at improving the accessibility of relevant, quality and timely healthcare information on the web. The Australian HealthInsite (http://www.healthinsite.gov.au) and Geneva-based Health on the Net (http://www.hon.ch) are two key examples of such portals. The Breast Cancer Knowledge Online (BCKOnline, or BCKO) project is another such initiative. A collaboration of Monash University, BreastCare Victoria and the Breast

Cancer Action Group1, the project aimed to investigate how issues with the accessibility, relevancy, timeliness, format and quality of material on the web about breast cancer could be addressed. With a user-centred and metadata driven design approach, the project developed a prototype of a web portal in which user-aware resource descriptions allow for differentiated access to breast cancer information available online (Burstein et al. 2005). At the heart of the portal is a knowledge repository made up of metadata descriptions of relevant resources. In response to the user studies and needs analysis undertaken in the initial stages of the BCKO project, this metadata includes information about the type and style of information presented in the resource, the stage of breast cancer to which it relates, and the categories of users to which it applies (Enterprise Information Research Group 2004). The search interface allows portal users to indicate their information preferences along these lines. While usability studies have shown a high degree of satisfaction with the resultant portal, questions as to its scalability have been raised. Reliance on manual methods of metadata creation are problematic given the volume of information available online and its volatile, dynamic and complex nature (Hunter 2006). Further development of the portal therefore requires investigation into how the generation of metadata describing relevant resources from a user-centred perspective can be automated2. In this paper we present and demonstrate a method for automatically inferring a non-trivial metadata attribute that has been encoded on BCKOnline resources: whether the site is ‘medical’ or ‘supportive’ in tone. In section 2, we describe the algorithms and related methods we have applied to this classification problem. Section 3 reviews the classification results. Section 4 discusses the practical 1

An Intelligent User Sensitive Portal to Breast Cancer Knowledge Online, is a collaborative, multidisciplinary research study funded from 2002-2004 by an ARC Linkage Grant. Chief Investigators Prof. Sue McKemmish, Dr Kirsty. Williamson, A/Prof. Frada. Burstein, A/Prof Julie Fisher, Ms June Anderson (Monash University), Ms Sue Lockwood (Breast Cancer Action Group) and Ms Rosetta Manaszewicz (Breast Cancer Action Group).

2

This research is supported in part by: an ARC Discovery Project grant, Smart Information Portals: Meeting the knowledge and decision support needs of health care consumers for quality online information, funded from 2006-2008 with Chief Investigators A/Prof. Frada Burstein, Prof. Sue McKemmish and A/Prof Julie Fisher, and Partner Investigator Prof. Jim Warren; and a University of Auckland postgraduate scholarship.

Copyright © 2007, Australian Computer Society, Inc. This paper appeared at the /Australasian Workshop on Health Data and Knowledge Management (HDKM 2008)/, Wollongong, NSW, Australia. Conferences in Research and Practice in Information Technology, Vol. 80. Ping Yu, James R. Warren and John Yearwood, Eds. Reproduction for academic, not-for profit purposes permitted provided this text is included.

Figure 1: Personalised search interface of BCKOnline (http://www.bckonline.monash.edu.au/search/ personalisedSearch.jsp) utility of the research, as well as limitations and related and future work.

2 2.1

Methodology HAL matrix

Hyperspace Analog to Language (HAL) has been developed and illustrated as a ‘dimensional’ model of the nature of common ground in a corpus of discourse (Lund and Burgess, 1996; Burgess and Lund, 1997a) during the last decade. More specifically, HAL allows automatic construction of a dimensional model from a corpus of text (Burgess, Livesay and Lund, 1998). In the case of HAL, an N x N matrix is instantiated with an N-length vector for each unique word occurring in a corpus. A ‘window’ several words in width is moved across the corpus; wherever two words occur within the window the value at their intersection in the matrix is incremented. Thus, a corpus is converted to a high-dimensional semantic space, with minimal consideration to grammar. Other significant semantic space models include LSA (Laundauer, Foltz and Latham, 1998), which is widely used for document indexing, and the more recent COALS model (Rohde, Gonnerman and Plaut, 2004). We were particularly influenced to use HAL because it has some track record in representing emotion in text. Burgess and Lund (1997b) examined whether HAL could represent abstract concepts, such as love, hate, joy. They found that, in a comparison with human raters in predicting abstract variables for a set of words, “global co-occurrence information carried in the

Figure 2: HAL matrix for “the cat sits on a mat” word vectors can be used to predict a tangible proportion of the human likert scale ratings.” At an algorithmic level, HAL requires going through the corpus word by word, and for each word assigning a value to other words in its neighbourhood (aka, the ‘window,’ the size of the which is a significant parameter, typically 10 in practice). All words within the window are considered as co-occurring with each other with strengths inversely proportional to the distance between them. Thus an N by N matrix (the ‘HAL matrix’) will be instantiated, where N is the number of distinct words in the corpus. For instance, with a window size 3 the HAL matrix for an example text “the cat sits on a mat” is as shown in Figure 2. In keeping with prior studies (as per citations above) punctuation, sentence and paragraph boundaries, the order of co-occurrence of words (i.e., before or after), and some extremely common words (e.g. “a”, “the”, “is”, “are”, etc. – called ‘stop words’) are not considered useful to the

2.3

K-Fold Cross Validation

For development of a classifier we take 80 webpages each of type Medical and Supportive from those marked as just one of those types in BCKOnline portal database. Each group is divided into eight (k=8) slots; thus in each slot, for each type, there are ten web pages. Each slot in a group in turn acts as the test set, with the rest of each group used as the training set. Thus, we perform 8-fold cross validation assessment of our classifier. Figure 5 visually depicts the allocation of data in a cross-fold validation. HAL matrices (see section 2.1) have been built for each individual web page, as well as one big matrix for all the web pages in the training set (Ht).

2.4

Figure 3: Excerpt from webpage of type Medical inference of the underlying semantic space and are discarded.

2.2

The data

The BKCOnline Portal aims to provide a gateway to breast cancer information of high value to women afflicted by the disease, along with their families and friends. A key finding of the initial user needs analysis was of the need to identify quality resources that dealt not only with medical and scientific issues relating to breast cancer, but also with its psychosocial impacts. The metadata schema developed to describe the resources therefore incorporates an encoding scheme for categorising the type of resource as Medical, Supportive and/or Personal. (McKemmish et al. 2004) The BCKOnline Portal’s personalised search interface (Figure 1) allows users to indicate their preference. The results for the search term are then filtered accordingly. The BCKOnline portal database provides metadata to support personalised search for approximately 1000 consumer health websites. To provide an initial test for HAL-based prediction of the BCKO type, we examine problem of matching the classification of types ‘supportive’ or ‘medical’ to that given in the portal’s database. To train the classifier, a corpus was extracted from the web pages indexed by the portal. The text is conditioned by removing items outside the main text of the web page, including sidebars, ads, images and web links. BCKOnline indexes 135 sites that the coders have typed as Supportive and 701 of type Medical. The two types are not mutually exclusive; the 127 sites which are classified as both Supportive and Medical are omitted from further consideration in the present analysis. Figure 3 shows an excerpt of a web page classed as Medical in BCKOnline; figure 4 shows an excerpt of a Supportive page.

Decision Tree Creation

Since the training matrix is a very large (~7000 by 7000) and sparse, use of the entire matrix in the classifier development process is both computationally cumbersome and also conceptually out-of-keeping with the concept of semantic spaces (which assumes the number of relevant semantic dimensions to be less than the total vocabulary size). In HAL matrix the value of an element is a measure of the association of a word to other words. We picked the 100 vectors with the biggest sum of their vector values for each of Medical group (Ht[med]) and Supportive group (Ht[sup]), and the union of these two set of vectors is the final base of training set: Ht’ = Ht[med] ∪ Ht[sup] As a measure of the identification of the type of a webpage, cosine has been selected as a measure of association. In keeping with the dimensional theory of a HAL matrix, a high cosine on the vector for a particular word between two HAL matrices indicates that the two corpuses use the word in a similar context. Thus, the similarity of the HAL matrix for the ith test website on the jth word (j є the words in Ht’) to the Medical corpus is: Hi(j) · Ht[med](j) Sim(Hi(j), Ht[med](j)) = cos(Hi(j), Ht[med](j)) = |H | × |H | i

t[med](j)

The similarity to the Supportive corpus’ use of word j is defined in the same manner with respect to Ht[sup]. Taking word j as a candidate basis for classifying cases as Supportive or Medical, we simply estimate the type of the test case as being that with the highest similarity measure. Ties are taken consistently to arbitrarily go with Supportive as the estimated type (this becomes important when the decision word is missing in a specific test case’s corpus). To create a decision tree, we examine how well each candidate word j splits the training set into estimated membership of Supportive vs. Medical. The decision word for the root node of our decision tree is taken based on the maximum entropy reduction (i.e., information gain) from the parent to the child. This is repeated recursively, choosing a new best word for each node of the tree, until all training cases are correctly classified.

Copyright © 2007, Australian Computer Society, Inc. This paper appeared at the /Australasian Workshop on Health Data and Knowledge Management (HDKM 2008)/, Wollongong, NSW, Australia. Conferences in Research and Practice in Information Technology, Vol. 80. Ping Yu, James R. Warren and John Yearwood, Eds. Reproduction for academic, not-for profit purposes permitted provided this text is included.

Figure 5: Illustration of 4-fold cross validation; 8-fold cross validation is used in the present study.

Figure 4: Excerpt from webpage of type Supportive Table 1: Largest HAL vector sums for text corpora based on the Medical and the Supportive web sites Figure 6: An induced decision tree for 70 training websites of each Supportive and Medical; words (e.g., “Breast”) are the decision words for that node in the tree, with the entropy reduction in parentheses; number pairs at nodes indicate number of Medical and Supportive documents, respectively.

3

Results

Table 1 shows the dominant terms in each of the Medical and Supportive types of websites (i.e., those words with the largest vectors sums in the HAL matrices based on their text corpora). Figure 6 shows one of the induced decision trees, which turns out to use the word “Breast” in the first decision node (note case is ignored in processing and simply represents the case of the first instance of the word encountered in the corpus). Starting from the 70 Medical and 70 Supportive, the entropy reduction from the root to the first child node is: Entropy(70,70) 49+1 21+69 - 70+70Entropy(49,1) - 70+70Entropy(21,69) m n m n where Entropy(m,n) = m+nlog2m+n + m+nlog2m+n

Medical Word

Supportive Word

HAL val

HAL val

cancer

145047.0

cancer

79304.0

breast

127739.0

children

57217.0

women

73062.0

treatment

34776.0

treatment

56426.0

time

34641.0

patients

43493.0

people

31541.0

chemotherapy

34381.0

child

30999.0

risk

31878.0

death

29979.0

therapy

26450.0

feel

29225.0

cells

26324.0

life

28633.0

disease

25479.0

breast

28563.0

studies

23130.0

family

26311.0

effects

22758.0

make

20878.0

estrogen

21283.0

care

19959.0

brca

19175.0

things

19649.0

study

18443.0

dick

19434.0

tumor

18248.0

foley

19340.0

Figure 7 illustrates some examples of the decision tree of figure 6 in use to classify test cases (those in the 8th of the data held back from training and used to estimate classification accuracy). Figure 7a shows a webpage of type Medical (case ‘169 – Medical’) being tested first on

(a)

(b)

(c)

Figure 7: (a) Decision flow for a Medical webpage in the decision tree resulting in a correct classification (the page is: http://www.cancerbacup.org.uk/ info/goserelin.htm); (b) Correct classification of a Supportive webpage on the same decision tree (see http://my.webmd.com/content/ chat_transcripts/1/103833.htm); (c) Incorrect classification of a Medical webpage that is missing the keyword “Treatment” (see http://theoncologist.alphamedpress.org/cgi/ content/full/5/5/393?maxtoshow=&HITS=10&hits=10&RESULTFORMAT =&titleabstract=b) its HAL vector for the word “Breast.” In this case Sim(H169(Breast), Ht[med](Breast)) = 0.76 and Sim(H169(Breast), Ht[sup](Breast)) = 0.73, so the decision proceeds to the left-hand node. It is then assessed in terms of the word “Treatment” and again found more similar to the Medical corpus on this word. That node is a leaf (with all 42 training cases on the left of “Treatment” being Medical), so the test webpage is classed into the Medical group, which is correct. Similarly, Figure 8b shows a successful classification of a Supportive web page. Figure 7(c) shows an incorrect classification of a Medical site. It is noteworthy that the text corpus of the website lacks any instance of the word “Treatment” – resulting in a zero Sim score and requiring arbitrary resolution of the tie. Since our method is aimed at automation of metadata coding, it is relevant to be able to train a classifier using as few manually-coded cases as possible. Figure 8 illustrates our classification accuracy results, using the full 80 cases of each type at the far right (70 of each in training and 10 each for testing). The 95% confidence interval is based on the variance of the means for the 8 experiments of our 8-fold cross validation. Moving from right to left we progressively limit our available data. Notably the mean classification accuracy is relatively stable – the confidence interval widens as at 20 of each type we only hold back two or three cases per fold for testing. The observed accuracy with the full 80 cases per type is 82.5% with a 95% confidence interval from 73.8% to 91.2%.

4

Discussion

We have developed a method for automatic estimation of metadata attributes for consumer health websites. In particular, we demonstrate the ability to discern whether the text of a website is ‘medical’ or ‘supportive’ in nature (as per metadata coding guidelines of McKemmish et al., 2004). To provide suitable features for this type of classification we use a semantic space model called

Figure 8: 95% confidence intervals of classification accuracy by number of cases available of each type (Medical and Supportive) Hyperspace Analog to Language (HAL). HAL provides a very high-dimension sparse matrix representation of word co-occurrence in a text corpus. We induce a decision tree based on key words in the document and achieve around 80% classification accuracy. There is recognition that a range of automated metadata creation and extraction techniques are needed for resource description (Paynter 2005). The results indicate that this technique may be a useful addition to an automated metadata generation toolkit provided to the domain experts responsible for resource description to help overcome the bottlenecks of manual metadata creation processes. It also has the potential to aid in resource identification and selection processes. Being able to automatically categorise results from a generic search engine along “medical” and “supportive” lines may help to ensure the coverage of the knowledge repository of BCKOnline represents, and is responsive to user needs. Further work will explore the application of the technique to other elements in the BCKO metadata schema. Also, we are interested to continue to improve the classification

Copyright © 2007, Australian Computer Society, Inc. This paper appeared at the /Australasian Workshop on Health Data and Knowledge Management (HDKM 2008)/, Wollongong, NSW, Australia. Conferences in Research and Practice in Information Technology, Vol. 80. Ping Yu, James R. Warren and John Yearwood, Eds. Reproduction for academic, not-for profit purposes permitted provided this text is included.

Figure 9: Excerpt from the Medical website (case ‘183 – Medical’) that lacks the keyword “Treatment” and was mis-classified as per Figure 7(c). accuracy. For the latter task, it would seem that the weak link in the present method is the inability to recover should a globally-common keyword be missing in a specific website’s text. Considering the mis-classified Medical page from Figure 7(c), we can view part of its text in Figure 9. The tone and overall vocabulary of the page is quite fitting the BCKOnline ‘medical’ type, it just fails to have the keyword ‘Treatment.’ One solution to this may be to induce multiple independent decision trees (a decision forest). A more conventional solution would be to use another type of classifier, such as a k Nearest Neighbour (kNN) classifier. This would not provide the clear explainability of a decision tree, but it’s not clear that there is significant benefit in explaining the classifier decisions for our application.

5

Acknowledgements

We thank Vojoslav Kecman, Peter Bruza and Robert McArthur for their advice on methods for this study.

6

References

Bomba, D. (2005), 'Evaluating the Quality of Health Web Sites: Developing a Validation Method and Rating Instrument', Proceedings of the 38th Annual Hawaii

International Conference on System Sciences, 3-6 January, 2005, IEEE Computer Society, Los Alamitos, California, pp. 139-148. Burgess, C., Livesay, K., Lund, K. (1998), Explorations in context space: words, sentences, discourse. Discourse Processes, 25: 211-257. Burgess, C., and Lund, K. (1997a). Modelling parsing constraints with high-dimensional context space. Language and Cognitive Processes, 12(2/3), 177–210 Burgess, C., Lund, K. (1997b), Representing abstract words and emotional connotation in a high-dimensional memory space. Cognitive Science Proceedings, LEA. pp. 61-66. http://hal.ucr.edu/pdfs/Burgess_Lund (1997b).pdf Burstein, F., Fisher, J., McKemmish, S., Manaszewicz, R. and Malhotra, P. (2005), ‘User Centric Portal Design for Quality Health Information Provision’, Proceedings of the 38th Annual Hawaii International Conference on System Sciences, 3-6 January, 2005, IEEE Computer Society, Los Alamitos, California. Enterprise Information Research Group, Monash University (2004), Breast Cancer Knowledge Online Portal, with BreastCare Victoria and the Breast Cancer Action Group, Monash University, http://www.bckonline.monash.edu.au/.

Eysenbach, G. (2000), ‘Consumer Health Informatics’, British Medical Journal, vol. 320, no. 7251, 24 June, pp. 1713-1716. Hunter, J. (2006), ‘Next Generation Tools and Services: Supporting Dynamic Knowledge Spaces’, in Cushla Kapitzke and Betram C. Bruce (eds), Libr@ries: Changing Information Space and Practice, Lawrence Erlbaum Associates, Mahwah, New Jersey, pp. 91-111. Landauer, T., Foltz, P.W., Latham, D. (1998), Introduction to latent semantic analysis. Discourse Processes. 25: 259-284. Lund, K., and Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, and Computers, 28, 203–208 McKemmish, S., Manaszewicz, R. and Cheah, C. (2004), BCKOnline Metadata Schema Version 1.0, BCKOnline, December, 38 pp, http://www.sims.monash.edu.au/research/eirg/BCKO_ MetadataSchema_Version16.doc. Paynter, G.W. (2005), ‘Developing Practical Automatic Metadata Assignment and Evaluation Tools for Internet Resources’, in Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital libraries 2005, Denver, Colorado, USA, June 7-11, ACM Press, New York, USA, pp. 291-300. Rohde, D., Gonnerman, L., Plaut, D. (2004), An improved method for deriving word meaning from lexical co-occurrence, http://dlt4.mit.edu/~dr/COALS/ Coals.pdf

Copyright © 2007, Australian Computer Society, Inc. This paper appeared at the /Australasian Workshop on Health Data and Knowledge Management (HDKM 2008)/, Wollongong, NSW, Australia. Conferences in Research and Practice in Information Technology, Vol. 80. Ping Yu, James R. Warren and John Yearwood, Eds. Reproduction for academic, not-for profit purposes permitted provided this text is included.