IFMBE Proceedings 2512 - Text Mining for Automatic ... - Springer Link

Text Mining for Automatic Lexical Analysis of Layman Text of Biomedical Argument D. Defilippi, S. Pivetti, and M. Giacomini Department of Communication Computer and System Science, University of Genoa, Genoa, Italy Abstract—Despite various efforts to improve reliability of health care material on the world wide web in recent years, progress on this issue has been limited. Thus far, a variety of terms intended to describe this issue, including quality, trustworthiness, and credibility have been used. In spite of growing concern about quality and accuracy of online health information, there is mounting evidence that online information affects consumer behavior. This project aims to supply the internet customers with appropriate software that will be able to recognize the amount of information of the layman texts published online and determine the credibility level of the web sites. Keywords—Standard vocabularies, Text mining.

I. INTRODUCTION Consumers usually use general-purpose search engines without a defined search plan. Although users express concern regarding the quality and accuracy of online health information, few users recall where they obtained the information to answer their health-related questions. This project studies a solution for the problem relative at the likelihood that consumers will incidentally encounter wrong or not requested information during their internet research. A wide range of articles have been published recommending criteria or indicators of credibility in health care web sites [1]. Some publications suggest the author of web site content or the date in which the content was written should be revealed as examples of quality criteria [2]. In [3] is outlined how credibility format for web sites specifically contains four types of credibility: presumed, reputed, surface, and earned. These four credibility types are after tailored specifically for application to health care web sites. From [3] one can depict how web site’s credibility can grow up with the introduction of some characteristic like, for example, links to other health care web sites, a possible graphical aspect, or the employ of “personalized information” that consider what users likes. Other factors by which the web credibility can be increased are the speed of a site in answering user’s questions, email confirmation of the performed operation etc. One can notice how all these studies base their results about web sites credibility more on the graphical and superficial aspects of the sites than on the relevancy of the

information contained in them. Some works considered how much the web users can encounter not requested information during an on-line. [4] calculates likelihood that consumers will incidentally encounter information regarding complementary and alternative medicine (CAM) while searching for cancer information on the web, as well as factors that influence retrieval of CAM information. To overcome this key problem, some certifications of credibility for the web sites based on the information contained in these sites has been created. However these are based only on human force and on some guidelines that web sites have to follow. A consistent set of tools to reliably express credibility in health care web sites has yet to be developed [5]. A first step to solve these issues is to determine common terminology, criteria, means of implementation, and a theoretical framework with which conduct research in this field.

II. METHODS Ms SQL Server 2005, Ms Visual Studio 2008 have been chosen as development tools, mainly because of their security and performance features. The problem of the definition of the database has been approached using three levels: conceptual planning (creation of the Entity-Relation diagrams), logical planning (creation of the use-cases UML diagrams) and physical planning (implementation of database tables). The starting material was a sample of three hundred articles taken from some Italians health care web sites. A database has been created and it has been filled within the sample of these articles. A web program integrated by a Web Service has been created in order to depict the relevancy between the articles and the argument that they should be about. The program can be used by users through a web based user interface.

III. RESULTS The purpose of this work was to determine appropriate implementation and development of a theoretical framework by which credibility in health care web sites can be depicted in a total automatic way. This is achieved by the

O. Dössel and W.C. Schlegel (Eds.): WC 2009, IFMBE Proceedings 25/XII, pp. 281–284, 2009. www.springerlink.com

282

D. Defilippi, S. Pivetti, and M. Giacomini

creation of a program with a web based user interface, which can do an automatic lexical analysis of layman texts published on the health care web sites with which to determine the relevancy of these texts accompanied by the argument they should be about. The first part concerned application of data mining concepts, that is typical of structured sample of data, at a sample of not structured data like a text concerning the health care creating an algorithm able to apply text mining technique [6]. The second node consisted in collecting article’s samples. Three hundred of articles concerning health care has been collected, in HTML format, from three of the most important Italian health care web sites: Repubblica.it, Kataweb and Sportello Cancro. The most important problem that the work had to solve was to determine a method capable of working with the different types of HTML coding from the three different sources of information considered. The sample of articles has been loaded in a SQL database that has been specifically created (E-R diagram in Fig. 1). The developed program (called TextMinMed) can manage, in a total automatic way, the database and the sample of articles. The program examines each HTML code related to each article to find the part connected to the text and delete all the metadata not relevant for this study. After this first elaboration the texts has been tokenized to arrive at one list of words, adverbs, preposition etc.

Subsequently, this list has been filtered by comparing its terms with a database containing “not important” words, to obtain the final list with which the algorithm can estimate the quantity of information within each single article. The list obtained with this elaboration represents all the terms of each paper included into the structured medical thesaurus MeSH.

Fig. 2 Comparison results

Fig. 1 DB Structure

The way the program assigns a level of credibility to a web site is based on the determined quantity of information given to the users and their relevance with the argument of the user’s online researches. To do this, the program compares the number of the occurrences of the MeSH terms inside each article. An argument, coupled to the relative MeSH term, has been assigned to each article. Each MeSH term is associated to a Tree code that represent the category it belongs and its specificity. To determine the familiarity ranking between each article to the argument that it should treat the program makes a comparison of the tree codes of the argument and each term contained in each article calculating two parameters: Depth value and Distance value. The Depth value represents the position of the hierarchical level in common between the two tree codes that had been examined. IFMBE Proceedings Vol. 25

Text Mining for Automatic Lexical Analysis of Layman Text of Biomedical Argument

283

Fig. 3 Results of comparison of a set of paper for a given subject

The distance value represents how many hierarchical levels the program have to pass to arrive to the common one. This parameter ranges from 0 to 11, the highest depth value in the present implementation of MeSH. With this definition, the smaller is the distance value the highest is the familiarity between the term and the argument. Consequently the observation of the values for the totality of the terms gives the familiarity ranking between an article and his argument. In order to share data and to facilitate the program use, a simple web based user interface has been implemented. By using this interface results can be retrieved in a extremely easy way: a user can see the previous elaboration made on the articles that are already in the database. Moreover, an administration user also can insert new articles that he/she wants be elaborated. The source of the new data have to be clearly indicated by the administration user and also a classification of the inserted paper (by the use of appropriate MeSH descriptors and related qualifiers) has to be given by inserter. In figure 3, the results of a research over a group of

papers linked to AIDS is shown. In the first column the MeSH code of the terms used in the papers are shown, after the name of the MeSH descriptor in Italian (all considered papers are in Italian) and then the number of occurrences in the whole set of papers and also the number of papers that contain this descriptor. After this the tree node where the descriptor is contained and the depth and the distance, measured as described above.

IV. DISCUSSION Actually TextMinMed program is in a work in progress stage. With further updates it can be able to do web crawling and to apply its algorithms to several web data sources. But still these first results seem to indicate that in these papers for layman, we can hardly find concepts strictly correlated, at a scientific point of view, with the main declared subject of the paper. One technical reason for this can be that the MeSH coding system is not the best choice if the

IFMBE Proceedings Vol. 25

284

D. Defilippi, S. Pivetti, and M. Giacomini

target is intended to be limited to papers towards layman general public. A more effective use of this system can be foreseeable if also all concepts and terms already present into the MeSH system will be considered for research, but we still miss of the Italian official translation of these words, neither we can translate them for standard maintenance. This application has been created with the purpose to become a solid ground on which to develop a set of tools to maintain software agents with web crawling capacities. They can be used to test the web credibility and to maintain updated the certification that already exist and can grow up their range of study with less cost, more specifically and with more speed of work.

REFERENCES 1.

Laura O’Grady, Future directions for depicting credibility in health care web sites. International Journal of Medical Informatics (2006), Volume 75, Issue 1, Pages 58-65

2. 3. 4.

5. 6.

Yunli Wang, Zhenkai Liu, Automatic detecting indicators for quality of health information on the Web. International Journal of Medical Informatics 76 (2007) 575–582 Fogg, B., Persuasive Technology: Using Computers to Change What We Think and Do. 2002: Morgan Kaufmann. Muhammad Walji, Smitha Sagarama, Funda Meric-Bernstamb, Craig W. Johnson, Elmer V. Bernstam, Searching for cancer-related information online: unintended retrieval of complementary and alternative medicine information. International Journal of Medical Informatics (2005) 74, 685—693 Giuseppina Lombardo, Barbara Caci, Maurizio Cardaci, Dalla credibilità offline alla web-credibility: dimensioni psicologiche del costrutto. Psychofenia – vol. X, n. 16, 2007 R. Turra, G. Pedrazzi, F. Falciano, An introduction to text mining techniques and its application to literature data in biology. Abstracts' book , Nettab, November 2003

Corresponding Author: Institute: Street: City: Country: Email:

IFMBE Proceedings Vol. 25

Mauro Giacomini DIST – University of Genova Via All’Opera Pia 13 16145 Genova Italy [email protected]