Automatic information extraction from the Web - Semantic Scholar

Automatic information extraction from the Web David Sánchez, Antonio Moreno Department of Computer Science and Mathematics Universitat Rovira i Virgili (URV) Avda. Països Catalans, 26. 43007 Tarragona (Spain) {dsanchez, amoreno}@etse.urv.es

Abstract. The Web is a valuable repository of information. However, its size and its lack of structure difficult the search and extraction of knowledge. In this paper, we propose an automatic and autonomous methodology to retrieve and represent information from the Web in a standard way for a desired domain. It is based on the intensive use of a publicly available search engine and the analysis of a large quantity of web resources for detecting the information relevance. Polysemy detection and aliases/synonyms discovering of the evaluated domain are also considered. Results can be very useful for easing the access of the user to the Web resources or allowing computer processing of the data.

1 Introduction In the last years, the growth of the Information Society has been very significant, providing a way for fast data access and information exchange all around the world through the World Wide Web. However, the processing of Web resources is a difficult task due to their unstructured definition (natural language and visual representation), their untrusted origin and their changing nature. Although several search tools have been developed (e.g. search engines like Google), when the searched topic is very general or we don’t have an exhaustive knowledge of the domain (in order to set the most appropriate and restrictive search query), the evaluation of the huge amount of potential resources obtained is quite slow and tedious. In this sense, a methodology for representing and accessing in a structured way the resources depending on the selected domain would be very useful. It is also important that this kind of representations can be obtained automatically due to the dynamic and changing nature of the Web (hand made Directory Services like Yahoo are usually incomplete and obsolete and require human experts). The base of our proposal is the repetition of information that characterizes the Web, allowing us to detect important concepts for a domain through a statistical analysis of their appearances. Moreover, the exhaustive use of Web search engines represents an invaluable help for selecting “representative” resources and getting global statistics of concepts (Google is our particular “expert” in all domains). So, in this paper we present a methodology to extract information from the Web to build automatically a taxonomy of concepts and Web resources for a given domain (for the English language). Moreover, if particular examples (e.g. proper names) for a

discovered concept are found, they are considered as instances in the defined hierarchy. During the building process, the most representative web sites for each selected concept are retrieved and categorized according to the specific topic covered. Finally, a polysemy detection algorithm is performed in order to discover different senses of the same word and group the most related concepts. The result is a hierarchical and categorized organization of the available resources for the given domain. With the resulting hierarchy, an algorithm for discovering alternative domain keywords is also performed in order to find other “equivalent” forms of expressing the same concepts (e.g. synonyms, aliases, morphological derivates, different spellings), which can be useful in order to widen the search process. The key point is that all these results are obtained automatically and autonomously from the whole Web without any previous domain knowledge, and they represent the available resources at the moment. These aspects distinguish this method from other similar ones [18] that are applied on a representative selected corpus of documents [17], use external sources of semantic information [16] (like WordNet [1] or predefined ontologies), or require the supervision of a human expert. A prototype has been implemented to test our proposal. In addition to the advantage that a hierarchical representation of web resources provides in terms of searching for information, the obtained taxonomy is also a valuable element for building machine processable information representations like ontologies [6]. In fact, many ontology creation methodologies like METHONTOLOGY [20] consider a taxonomy of concepts for the domain as the starting point for the ontology creation. However, the manual creation of these hierarchies of concepts is a difficult task that requires an extended knowledge of the domain, obtaining, in most cases, incomplete, inaccurate or obsolete results; in our case, the taxonomy is created automatically and represents the Web’s state of the art for a domain. The rest of the paper is organised as follows: section 2 describes the methodology developed to build the taxonomy and classify web sites. Section 3 covers the semantic disambiguation of the results. Section 4 describes the method for discovering equivalent terms. Section 5 explains the way of representing the results and discusses on the main issues related to their evaluation. The final section contains some conclusions and proposes lines of future work.

2 Taxonomy building methodology In this section, the methodology used to discover, select and organise representative concepts, instances and websites for a given domain is described. The algorithm is based on analysing a large number of web sites in order to find important concepts for a domain by studying the neighbourhood of an initial keyword (we assume that words that are near to the specified keyword are closely related). Concretely, in the English language, the immediate anterior word for a keyword is frequently classifying it, whereas the immediate posterior one represents the domain where it is applied [19].

2.1 Concept discovery, taxonomy building and web categorization algorithm In more detail, the algorithm for discovering representative concepts and building a taxonomy of web resources (shown on Fig. 1) has the following phases (see Table 1 and Fig. 2, 3 and 4 with examples obtained for the Sensor and Cancer domains in order to follow the explanation example): "keyword" Search Constraints

}

Web Parsing + Syntactical Analysis Name Attributes

Name Attributes

...

...

Name Candidate Attributes Concepts

Input parameters

Web documents

} Class/Instance selection Class1 URL's

...

}

"class+keyword"

ClassN URL's

Selection Constraints Instance URL's

}

"class+keyword"

OWL Ontology

Fig. 1. Taxonomy building methodology

• It starts with a keyword that has to be representative enough for a specific domain (e.g. sensor, cancer) and a set of parameters that constrain the search and the concept selection (described below and, in more detail, in section 2.2). • Then, it uses a Web search engine (Google) as its particular “expert” in order to obtain the most representative web sites that contain that keyword (only the English language ones). The search constraints specified are the following: − Maximum number of pages returned by the search engine (e.g. 200 pages for the Cancer domain and 180 for the Sensor one): this parameter constrains the size of the search. In general, the bigger is the amount of documents evaluated, the better are the results obtained, because we base the quality of the results in the repetition of information.

•

•

•

•

− Filter of similar sites: for a general keyword (e.g. the first level of the search process: cancer), the enabling of this filter hides the webs that belong to the same web domain, obtaining a set of results that represent a wider spectrum. For a concrete keyword (e.g. the fourth level of the search: non melanoma skin cancer) with a smaller amount of results, the disabling of this filter parameter will return the whole set of pages (even sub pages of a web domain). For each web site returned, an exhaustive analysis is performed in order to obtain useful information from each one. Concretely: − Different types of non-HTML document formats are processed (pdf, ps, doc, ppt, etc.) by obtaining the HTML version from Google's cache. − For each "Not found" or "Unable to show" page, the parser tries to obtain the web site's old data from Google's cache. − Frame-based sites are also considered and Redirections are followed. − All the useful information contained in the website is considered: titles, headers, web’s keywords, text, captions… The web parser returns the useful text from each site (rejecting tags and visual information). Then, a syntactical analysis based on a previously trained annotation engine for the English language [23] is performed in order to detect sentences, tokens and grammatical categories. Next, it tries to find the initial keyword (e.g. cancer) into the retrieved text. For each matching, it analyses the immediate anterior word (e.g. prostate cancer). If it fulfills a set of prerequisites, it is selected as candidate concept. The posterior words (e.g. cancer treatment) are also retrieved and stored. Concretely, the system verifies the following: − Words must be "relevant words". Prepositions, determinants, and very common words ("stop words") are rejected. − Each word is analysed from its morphological root (e.g. family and familial are considered as the same word and their attribute values -described below- are merged/added). A stemming algorithm for the English language is used. − Only adjectives and nouns are considered. For each candidate retrieved, a selection algorithm is performed in order to decide if this candidate concept will be considered as a subclass of the initial one (a type of…) that could have new subclasses by itself, or as a particular instance of the immediate superclass (an example of…). This differentiation is made by detecting if the candidate concept is considered as a “proper name” (an instance) or not (a subclass). More in detail, the system uses the following heuristic: − The first letter must be a capital one (but the word must not be at the start of a sentence). Moreover, the syntactic analyser must mark it as a “proper noun”. − If the same word is found again during the parsing it must fulfil the same previous restrictions in all cases (or at least a high percentage of them) to finally confirm it as a correct instance (see examples in Fig 2 for the sensor domain). Among the candidate concepts considered as subclass candidates, a statistical analysis is performed for selecting the most representative ones. Apart from the text frequency analysis, the information obtained from the search engine (which is based in the analysis of the whole web) is also considered. Concretely, we consider the following attributes (see examples in Table 1 for the sensor domain):

− Total number of appearances: represents the concept's relevance for the domain and allows eliminating very specific or unrelated ones (e.g. situ sensor). − Number of different web sites that contain the concept at least one time: gives a measure of the concept's generality for a domain (e.g. wireless sensor is quite common, but heterogeneous sensor isn't). − Estimated number of results returned by the search engine with the selected concept alone: indicates the concept’s global generality and allows avoiding widely-used ones (e.g. level sensor). − Estimated number of results returned by the search engine joining the selected concept with the initial keyword: represents the association between those two terms (e.g. "oxygen sensor" gives many results but "mode sensor" doesn't). − Ratio between the two last measures: is an important measure because it indicates the intensity of the relation between the concept and the keyword and allows detecting highly related concepts (e.g. "temperature sensor" is much more relevant than "mode sensor"). • Only concepts (a little percentage of the candidate list) whose attributes fit with a set of specified constraints (a range of values for each parameter) are selected (marked in bold in Table 1). Moreover, a relevance measure (1) of this selection is computed based on the amount of times the concept attribute values exceed the selection constraints. This measure could be useful if an expert evaluates the taxonomy or if the hierarchy is bigger than expected (perhaps constraints where too loose), and a trim of the less relevant concepts is performed. 2* relevance =

# Dif _ Webs # Appearances Google _ Ratio + 3* + Min _ Appear Min _ Dif _ Web Min _ Ratio 6

(1)

• For each concept extracted from a word previous to the initial one, a new keyword is constructed joining this concept with the initial keyword (e.g. "breast cancer"), and the algorithm is executed again from the beginning. So, at each iteration, a new and more specific set of relevant documents for the new concrete subdomain will be retrieved. This process is repeated recursively until a selected depth level is achieved (e.g. 4 levels for the cancer example) or no more results are found (e.g. invasive cervical cancer has not got any subclass). • The obtained result is a hierarchy that is stored as an ontology with is-a relations (adding the corresponding instances found for each class). If a concept has different derivative forms, all of them are represented (e.g. family, familial) but identified with an equivalence relation (see two examples in Fig. 3). Moreover, each class and instance stores the concept's attributes described previously and the set of URLs from where it was selected during the analysis of the immediate superclass (e.g. the set of URLs returned by Google when setting the keyword cancer that contains the candidate concept skin cancer). • In the same way that the “previous word analysis” returns candidate concepts that can become classes, the posterior word for the initial keyword is also considered. In this case, the selected concepts are used to describe and categorize the set of URLs associated to each class. For example, if we find that for a specified URL

associated to the class breast cancer, this keyword is followed by the word foundation, the URL will be categorized with this word (that represents a domain of application). This information could be useful when the user browses the set of URLs of a class because it can give him an idea about the context where the class is applied (see some examples in Fig. 4 for the cancer domain). • Finally, a refinement process is performed in order to obtain a more compact taxonomy and avoid redundancy. In this process, classes and subclasses that share at least 90 % of their associated URLs are merged because we consider that they are closely related. For example, the hierarchy “cancer->bladder->stage->iii” will result in “cancer->bladder->stage_iii” (discovering automatically a multiword term, [16]), because the last 2 subclasses have the same web sets. Moreover, the list of web sites for each class is processed in order to avoid redundancies: if a URL is stored in one of its subclasses, it is deleted from the superclass set.

Table 1. Candidate concepts for the sensor ontology. Words in bold represent all the selected ones (merged -with the same root- in italic). The others are a reduced list of some of the rejected ones (attributes that don't fulfil the selection constraints are represented in italic). The half most relevant classes are also represented in bold in the last column. Candidate Concept field image individual light magnetic motion network networked oxygen pressure smart temperature wireless situ heterogeneous mode level cost

#Appear (min 9) 10 10 10 34 12 70 7 40 72 21 11 76 209 7 11 9 12 9

#Pages (min 4) 4 5 5 10 5 9 5 11 4 6 7 18 23 4 2 5 2 7

#Search Results (max 10000000) 6840000 6970000 7190000 7820000 3180000 6300000 6270000 816000 2430000 6210000 7210000 6280000 5390000 885000 538000 6980000 18420000 9130000

#Joined (min 50) 5140 91700 3230 27900 6880 39200 6300 1450 77000 53000 7350 117000 32800 988 550 27 18000 51

Ratio (min 0.0001) 7.51E-4 0.0131 4.49E-4 0.0035 0.0021 0.0062 0.0010 0.0017 0.0316 0.0085 0.0010 0.0186 0.0060 0.0011 0.0010 3.86E-6 9.77E-4 5.58E-6

Rele. 2.39 23.23 2.05 9.23 5.01 15.20 2.84 6.69 56.90 16.22 3.38 37.67 23.92 -

C

SENSOR

C

Field Sensor

C Image Sensor I LBCast I VisionPak I Xerox C

Individual Sensor

C Light Sensor I LEGO I CMA I PASPORT

C Magnetic Sensor C Motion Sensor I Decora I Kryptonite

C Network/Networked Sensor I EDragon I Endevco I FEDragon I GEDragon C Oxygen Sensor I Audi I BMW I Chrysler I Denso I Mercedes C Pressur e Sensor I Entran I EPN I Oceanographic I EPXT I SMT I Motorola

C Smart Sensor I PowerDent I BBN I FEL

C Temperature Sensor I DrDAQ I GlobalSpec.com I PASPORT I CTD I RTD I INEEL

C Wireless Sensor I Hitachi

Fig. 2. Examples of class instances for the first level subclasses of sensor taxonomy.

Fig. 3. Cancer taxonomy visualized on Protégé 2.1: numbers are class identifiers.

C CANCER C Bladder Cancer [treatment, diagnosis] http://telescan.nki.nl/bladder2.html [treatment, research] http://www.nlm.nih.gov/medlineplus/bladdercancer.html [chemotherapy] http://www.cancer-info.com/bladder.htm

C Breast Cancer [foundation, treatment] http://www.thebreastcancersite.com/ [foundation, research] http://www.komen.org/ [treatment, prevention, research] http://www.aacr.org/

C Cell Cancer [center, survivorship, prevention] http://www.mdanderson.org/ [melanoma, foundation, institute] http://www.maui.net/~southsky/introto.html

C Cervical Cancer [phone, quilt, coalition] http://www.nccc-online.org/ [prevention] http://www.ashastd.org/hpvccrc/ [treatment, control, mortality] http://www.cdc.gov/cancer/nbccedp/about.htm [detection, study, coalition] http://www.mdanderson.org/diseases/cervical/

C Childhood Cancer [awareness, news] http://www.cancernews.com/ [center, survivorship, prevention] http://www.mdanderson.org/ [research] http://www.ccrg.ox.ac.uk/

C Colon Cancer [conference] http://www.ccalliance.org/ [drug, news, support] http://www.cancerpage.com/ [spectrum, prevention, survival, risk] http://jncicancerspectrum.oupjournals.org/ [prevention] http://www.eifoundation.org/national/nccra/splash/

C Colorectal Cancer [center, news, endometrial] http://www.cancerlinksusa.com/colorectal/index.asp [pathogenesis, mortality, morbidity] http://www.ahrq.gov/clinic/colorsum.htm [cell, research, chemopreventive] http://www.nature.com/bjc/

C Esophageal Cancer [research, chemopreventive] http://www.nature.com/bjc/ [treatment] http://www.nci.nih.gov/cancerinfo/pdq/treatment/esophageal/patient/ [diagnosis, background] http://www.cancerlinksusa.com/esophagus/index.asp

C Eye Cancer [center, survivorship, prevention] http://www.mdanderson.org/ [research] http://www.nlm.nih.gov/medlineplus/eyecancer.html [survey] http://www.ehendrick.org/healthy/001533.htm [therapy] http://www.cancersupportivecare.com/eye.html

C Lung Cancer [treatment, prevention, support, educational, toll] http://www.lungcancer.org/ [drug, news, support] http://www.cancerpage.com/ [epidemiologic, study, journal] http://www.cheec.uiowa.edu/misc/radon.html [treatment, risk] http://www.mskcc.org/mskcc/html/12463.cfm

C National Cancer [association] http://www.judysgro.com/speeches/meagan.html

C Neck Cancer [center, survivorship, prevention] http://www.mdanderson.org/ [radiotherapy] http://www.cancerlinksusa.com/larynx/index.asp [foundation, tumor] http://www.headandneckcanada.com/pages/who_we_are.html

C Ovarian Cancer [phone, conference, quilt, institute, coalition] http://www.nccc-online.org/ [research] http://www.lynnecohenfoundation.org/ [surgery, treatment, diagnosis, overview, syndrome] http://www.oncolink.upenn.edu/types/article.cfm?c=6&s=19&ss=766&id=8589

C Pancreatic Cancer [faq, research, national, bibilography, genetic] http://www.path.jhu.edu/pancreas/ [research, questionnaire] http://www.nlm.nih.gov/medlineplus/pancreaticcancer.html [awareness, research, action, symposium] http://www.pancan.org/ [treatment, research, survival, action] http://deckerfund.tripod.com/links.html

C Skin Cancer [care, foundation, awareness] http://www.skincancer.org/ [drug, news, support] http://www.cancerpage.com/ [cell, susceptibility, research, chemopreventive] http://www.nature.com/bjc/

C Thyroid Cancer [surgery, consultation, clinic] http://www.thyroidcancer.com/ [treatment, topic] http://www.mskcc.org/mskcc/html/447.cfm [radiation, surgery, therapy] http://www.thyroid.ca/Guides/HG12.html

Fig. 4. Examples of categorized URLs for the first level subclasses of cancer taxonomy.

2.2 Parameter configuration As one can deduce from the previous explanation, the input parameters of the system are very important settings for the quality of the hierarchy. In this section, we describe several heuristics for configuring these values appropriately for the desired domain. All these general guides have been inferred empirically from an exhaustive test of the system for different domains. The search parameters affect the size and the time of the search; however, a priori, the more web sites to evaluate, the better will be the results. In general, a range from 50 pages (for the first level) to a 2% of the total set returned by Google will give quite good results. Evaluating beyond the 5% of the total results for a general domain (more than 10000 webs) is a very time consuming task and does not improve the quality of the results significantly. So, these parameters can be set directly from the information returned by Google about the number of sites that match with the search query. For each level of the recursion, it is convenient to set a decreasing number of pages (a reduction of a 25%) for the search in order to maximize the throughput of the system and the quality of the results. This is because in the first levels (e.g. cancer), the spectrum of the candidate concepts is wider than in the last ones (e.g. metastatic breast cancer) where the searched concept is much more restrictive and fewer valid results can be found.

Once the size of the search is fixed (e.g. 180 pages for the initial level of the Sensor domain and 200 for the Cancer one), the selection parameters have to be configured carefully in order to retrieve only the best candidates but without missing valid ones. In general, several heuristics can be applied in the configuration process: • Total number of appearances: this measure depends directly on the size of the search. A setting of 5% with respect to the number of sites evaluated (for each level of the recursion) seems to be a good measure. • Number of different web sites that contain the concept at least one time: again, this measure depends on the size of the search and the previous parameter. A good value is taking half of the total number of appearances in the corresponding level. • Estimated number of results returned by the search engine with the selected concept alone: with a setting of 10 million we can avoid very common words. • Estimated number of results returned by the search engine joining the selected concept with the initial keyword: a minimum of 50 hits could be sufficient for avoiding strange combinations or spelling mistakes of the webmaster. • Ratio between the two last measures: this is the most important measure. In fact, in some domains, the use of this measure (ignoring the previous ones) with a predefined value of 0.0001 is sufficient for obtaining good results because it summarizes in some way all the other measures (number of hits and strength of concepts’ relation). If more accurate results are required, at the cost of loosing some correct results, a setting of 0.001 can be also set. In our prototype, all of these parameters can be set manually at the beginning of the search for a personalised search but they can also be configured by the system automatically as a function of the generality of the searched domain (number of initial results) applying the described heuristics.

3 Polysemy detection and semantic clustering One of the main problems when analyzing natural language resources is polysemous words. In our case, for example, if the primary keyword has more than one sense (e.g. virus can be applied over “malicious computer programs” or “infectious biological agents”), the resulting taxonomy contains concepts from different domains (e.g. “email virus” and “immunodeficiency virus”). Although these concepts have been selected correctly, it could be interesting that the branches of the resulting taxonomic tree were grouped it they pertain to the same domain corresponding to a concrete sense of the immediate “father” concept. With this representation, the user can consult the hierarchy that belongs to the desired sense for the initial keyword. Performing this classification without any previous knowledge (for example, a list of meanings, synonyms or a thesaurus like WordNet [1]), is not a trivial process, as mentioned in [16, 17]. However, we can take profit from the context where each concept has been extracted, concretely, the documents (URL) that contain it. We can assume that each website is using the keyword in a concrete sense, so all candidate concepts that are selected from the analysis of a single document pertain to the domain associated to the same keyword’s meaning. Applying this idea over a large

amount of documents we can find, as shown in Fig. 5 for the virus example, quite consistent semantic relations between the candidate concepts. So, if a word has N meanings (or it is used in N different domains with different senses), the resulting taxonomy for this concept will be grouped in a similar number of sets, each one containing the elements that belong to a particular domain. We have designed a methodology to obtain these groups without having any previous semantic knowledge. In more detail, the algorithm performs the following actions: • It starts from the taxonomy generated before. Concretely, all concepts are organised in an is-a ontology and each concept stores the set of webs from which it has been obtained and the whole list of URLs returned by Google. • For a given concept of the taxonomy (for example the initial keyword: virus) and a concrete level of depth (for example the first one), a classification process is performed by joining the concepts which belong to each keyword sense. This process is performed by a SAHN (Sequential Agglomerative Hierarchical NonOverlapping) clustering algorithm that joins the more similar concepts, using as a similarity measure the number of coincidences between their URLs sets: − For each concept of the same depth level (e.g. anti, latest, simplex, linux, computer, influenza, online, immunodeficiency, leukaemia, nile, email) that depends directly on the selected keyword (virus), a set of URLs is constructed by joining the stored websites associated to the class and all the URLs of their descendent classes (without repetitions). − Each concept is compared to each other and a similarity measure is obtained by comparing their URLs sets (2). Concretely, the measure represents the maximum amount of coincidences between each set (normalised as a percentage of the total set). So, the higher it is, the more similar the concepts are (because they are frequently used together in the same context). ⎛ # Coin(URL( A),URL( B)) # Coin(URL( A),URL( B)) ⎞ ⎟⎟ simil( A, B) = Max⎜⎜ , #URL( A) #URL( B) ⎠ ⎝

(2)

− With these measures, a similarity matrix between all concepts is constructed. The more similar concepts (in the example, anti and computer) are selected and joined (they belong to the same keyword’s sense). − The joining process is performed by creating a new class with those concepts and removing them individually from the initial taxonomy. Their URL’s sets are joined but not merged (each concept keeps its URL set independently). − For this new class, the similarity measure to the remaining concepts is computed. In this case, the measure of coincidence will be the minimum number of coincidences between the URL set of the individual concept and the URL set of each concept of the group that forms the new class (3). With this method we are able to detect the most subtle senses of the initial keyword (or at least different domains where it is used). Other measures like taking into consideration the maximum number of coincidences or the average have been tested obtaining worse results (they tend to join all the classes, making the differentiation of senses difficult). Note that this is a local measure for a concrete iteration of

the algorithm (not a truly “distance” measure) and its value between several iterations cannot be compared. # Coin (Class ( A, B ), C ) = Min (# Coin ( A, C ), # Coin ( B, C ) )

(3)

− The similarity matrix is updated with these values and the new most similar concepts/classes are joined (building a dendogram as shown in Fig. 5). The process is repeated until no more elements remain unjoined or the similarity between each one is 0 (there are no coincidences between the URL sets). • The result is a partition (with 2 elements for the virus example) of classes (their number has been automatically discovered) that groups the concepts that belong to a specific meaning. The dendogram with the joined classes and similarity values can also be consulted by the user of the system. anti | latest | simplex | linux | computer | influenza | online | inmunodeficiency | leukemia | nile | email

Sense 1: anti | latest | linux | computer | online | email

Sense 2: simplex | influenza | inmunodeficiency | leukemia | nile

Fig. 5. Dendogram representing semantic associations between the classes found for the keyword “virus”. Two final clusters are obtained: Sense 1 groups the classes associated to the “computer program” meaning and Sense 2 for the “biological agent” meaning.

4 Discovery of equivalent domain keywords A very common problem of web indexing based on the presence or absence of a specified word is the use of different names to refer to the same entity. In fact, the goal of a web search engine is to retrieve relevant pages for a given topic determined by a keyword, but if a text doesn’t contain this specific word with the same spelling as specified it will be ignored. So, when using a search engine like Google, in some cases a considerable amount of relevant web resources are omitted due to the too strict word matching. In this sense, not only the different morphological forms of a given keyword are important, but also synonyms and aliases. There are several methodologies that try to find “alternative spellings” for a given keyword like Latent Semantic Analysis. LSA is a technique for identifying both semantically similar words and semantically similar documents [4]. It considers that the words that co-occur in documents with the given one with a significant overlap are

semantically related. However, these techniques tend to return domain related words (that is why they co-occur) but, sometimes, not truly “equivalent” ones [22]. We have developed a methodology for discovering equivalent terms using the taxonomy obtained by the construction algorithm for the given keyword and, again, the search engine Google. Our approach is based on considering the longest branches of subclasses (e.g. metastatic stage iv bladder cancer) of our hierarchy and use them as the constraint (search query) for obtaining new documents that contain equivalent words for the main keyword. The assumption of the algorithm is that the longest multiword terms of our taxonomy constrain enough the search process to obtain, in most cases, alternative spellings for the same semantic concept. Concretely, the procedure works as follows: • Select the N longest branches (at least 3 words) of the obtained taxonomy, without considering the initial keyword (e.g. metastatic stage iv bladder). • For each one, make a query in the search engine of the whole multiword sentence and retrieve the P more relevant pages. This can be made in two ways: − Setting only the multiword in the search engine as the query (e.g. “metastatic stage iv bladder”). That will return the webs containing this sentence without caring about the next word. Most of them will contain the original keyword but a little amount of them will use alternative words, ensuring that all considered pages will belong to the desired domain but slowing the search. − Specifying the constraint not to contain the original keyword (e.g. “metastatic stage iv bladder” –cancer). The set of pages (if there is any) will only contain alternative words. This will speedup the search dramatically but perhaps valid resources will be omitted (those that contain both the keyword and the alternative word(s)). • Search among the text of the obtained web resources for the multiword set and evaluate the following word: the position that originally was occupied by the initial keyword (e.g. cancer). If the word found is different than the initial keyword, consider it as a candidate for ‘alternative keyword’ (e.g. carcinoma). • Repeat the process for each website and each multiword term and count the number of appearances of each candidate (in a similar way that with the candidate concepts for the taxonomy). A stemming morphological analysis could also be performed for grouping different forms of the same word. • Once the process is finished, a list of aliases candidates is ordered by using a confidence measure that takes into consideration the number of appearances of candidates in different web sites and the number of different multiword terms from which the candidate has been obtained (4). In most cases, there will be a significant difference between the truly equivalent words and the invalid ones, so it will be easy to determine the cutting point between correct and incorrect results.

Confidence =# Dif _ Web _ Appear * (10^ (# Multi _ words _ terms − 1))

(4)

The described methodology has been tested with several domains obtaining very promising results, as shown in Tables 2 and 3 for the Cancer and Sensor domains respectively.

Table 2. Alternative keyword candidates for the Cancer domain. From the obtained taxonomy, 13 multiword terms have been considered evaluating 100 webs for each one and considering also the original keyword. Elements in bold represent reliable results. Grey candidates represent doubtful ones. The rest are examples of rejected words (in this case misspelled words).

Concept cancer carcinoma tumor cancertreatment cancersmall cancerafter

Derivatives cancer, cancers carcinoma, carcinomas tumor, tumors cancertreatment cancersmall cancerafter

#Hits 1041 417 13 3 2 2

#Webs 376 226 13 3 2 2

#Multi 13 10 2 1 1 1

Conf. 3.7E14 2.2E11 130 3 2 2

Table 3. Alternative keyword candidates for the Sensor domain. From the obtained taxonomy, 24 multiword terms have been considered evaluating 100 webs for each one and considering also the initial keyword. Elements in bold represent reliable results. Grey candidates represent doubtful ones. The rest are examples of rejected words.

Concept sensor transducer system probe transmitter measurement communication network

Derivatives sensor, sensors transducer, transducers system probe transmitter measurement communication network

#Hits 437 28 4 2 11 10 9 3

#Webs 294 19 4 2 11 7 7 3

#Multi 24 3 2 2 1 1 1 1

Conf. 2.9E25 1900 40 20 11 7 7 3

5 Ontology representation and evaluation The final hierarchy is stored in a standard representation language: OWL [14]. The Web Ontology Language is a semantic markup language for publishing and sharing ontologies on the World Wide Web. It is designed for use by applications that need to process the content of information and facilitates greater machine interpretability by providing additional vocabulary along with a formal semantics [21]. OWL is supported by many ontology visualizers and editors, like Protégé 2.1 [15], allowing the user to explore, understand, analyse or even modify the ontology easily. In order to evaluate the correctness of the results, a set of formal tests have been performed. Concretely, Protégé provides a set of ontological tests for detecting inconsistencies between classes, subclasses and properties of an OWL ontology from a logical point of view. However, the amount of different documents evaluated for obtaining the result and, in general, the variety and quantity of resources available in the web difficult extremely any kind of automated evaluation. So, the test of correctness from a semantic point of view can only be made by comparing the results with other existing semantic studies (for example, using other well known methodologies) or through an analysis performed by an expert of the domain.

6 Conclusion and future work Some researchers have been working on knowledge mining and ontology learning from different kinds of structured information sources (like data bases, knowledge bases or dictionaries [7]). However, taking into consideration the amount of resources available easily on the Internet, we believe that knowledge extraction from unstructured documents like webs is an important line of research. In this sense, many authors [2, 5, 8, 10, 11] are putting their effort on processing natural language texts for creating or extending structured representations like ontologies. In this field, taxonomies are the point of departure for many ontology creation methodologies [20]. The discovery of these hierarchies of concepts based on word association has a very important precedent in [19], which proposes a way to interpret the relation between consecutive words (multiword terms). Several authors [11, 16, 17] have applied other similar ideas for extracting hierarchies of concepts from a document corpus to create or extend ontologies. However, the classical approach for these methods is the analysis of a relevant corpus of documents for a domain. In some cases, a semantic repository (WordNet [1]) from which one can extract word's meanings and perform linguistic analysis or an existing representative ontology are used. On the contrary, our methodology does not start from any kind of predefined knowledge of the domain, and it only uses publicly available web search engines for building relevant taxonomies from scratch. Moreover, the fact of searching in the whole web adds some very important problems in relation to the relevant corpus analysis. On the one hand, the heterogeneity of the information resources difficults the extraction of important data; in this case, we base our methodology on the high amount of information repetition and a statistical analysis of the relevance of candidate concepts for the domain. Note also that most of these statistical measures are obtained directly from the web search engine, fact that speeds up greatly the analysis and gives more representative results (they are based in the whole web statistics, not in the analysed subset) than a classical statistical approach based only in the texts. On the other hand, polisemy becomes a serious problem when the retrieved resources obtained from the search engine are only based on the keyword’s presence (even some authors [16, 17] have detected this problem in the corpus analysis); in this case, we propose an automatic approach for polysemy disambiguation based on the clusterization of the selected classes according to the similarities between their information sources (web pages and web domains), without any kind of semantic knowledge. Finally, in order to extend the search widely (or even contrast the obtained results), an algorithm for discovering alternative keywords for the same domain is also proposed (very useful for domains with a little amount of web resources available). The final taxonomy obtained with our method really represents the state of the art on the WWW for a given keyword and the hierarchical structured and domaincategorized list of the most representative web sites for each class is a great help for finding and accessing the desired web resources. In this sense, it can be considered as an improvement over the representation of the results returned by a standard Web search engine and valuable information for the user. Moreover, the taxonomy can be used to guide a search for information or a classification process from a document corpus [3, 12, 13], or it can be the base for finding more complex relations between

concepts and creating ontologies [8]. This last point is especially important due to the necessity of ontologies for achieving interoperability and easing the access and interpretation of knowledge resources (e.g. Semantic Web [21], more information in [10]). As future lines of research some topics can be proposed: • Several executions from the same initial keyword in different times can give us different taxonomies. A study about the changes can tell us how a domain evolves (e.g. a new kind of sensor appears). • For each class, an extended analysis of the relevant web sites could be performed to find possible attributes and values that describe important characteristics (e.g. the price of a sensor), or closely related words (like a topic signature [9]). • More complex relations (like non-taxonomic ones) and new classes could be discovered from an exhaustive analysis of sentences that contain the selected concepts (finding the verb and detecting the predicate or the subject). This could be very useful when dealing with non-easily categorised domains (where no concatenated adjectives can be found).

Acknowledgements We would like to thank David Isern and Aida Valls, for their feedback. This work has been supported by the "Departament d'Universitats, Recerca i Societat de la Informació" of Catalonia.

References 1. WordNet: a lexical database for English Language. http://www.cogsci.princeton.edu/wn. 2. Ansa O., Hovy E., Aguirre E., Martínez D.: Enriching very large ontologies using the WWW. In: proceedings of the Workshop on Ontology Construction of the European Conference of AI (ECAI-00), 2000. 3. Alani H., Kim S., Millard D., Eal M., Hall W., Lewis H., and Shadbolt N.: Automatic Ontology-Based Knowledge Extraction from Web Documents. IEEE Intelligent Systems, IEEE Computer Society, 14-21, 2003. 4. Deerwester S., Dumais S., Landauer T., Furnas G. and Harshman, R.: Indexing by latent semantic analysis, Journal of the American Society of Information Science 41 (6) (1990) 391-407. 5. Alfonseca E. and Manandhar S.: An unsupervised method for general named entity recognition and automated concept discovery. In: Proceedings of the 1st International Conference on General WordNet, 2002. 6. Fensel D.: Ontologies: A Silver Bullet for Knowledge Management and Electronic Commerce. Volume 2, Springer Verlag, 2001. 7. Manzano-Macho D., Gómez-Pérez A.: A survey of ontology learning methods and techniques. OntoWeb: Ontology-based Information Exchange Management , 2000. 8. Maedche A., Volz R., Kietz J.U.: A Method for Semi-Automatic Ontology Acquisition from a Corporate Intranet. EKAW’00 Workshop on Ontologies and Texts, 2000. 9. Lin C.Y., and Hovy E.H.: The Automated Acquisition of Topic Signatures for Text Summarization. In: Proceedings of the COLING Conference, 2000.

10. Maedche A.: Ontology Learning for the Semantic web. Volume 665, Kluwer Academic Publishers, 2001. 11. Velardi P., Navigli R.: Ontology Learning and Its Application to Automated Terminology Translation. IEEE Intelligent Systems, 22-31, 2003. 12. Sheth A.: Ontology-driven information search, integration and analysis. Net Object Days and MATES, 2003. 13. Magnin L., Snoussi H., Nie J.: Toward an Ontology–based Web Extraction. The Fifteenth Canadian Conference on Artificial Intelligence, 2002. 14. OWL. Web Ontology Language. W3C. Web: http://www.w3c.org/TR/owl-features/. 15. Protégé 2.1. Web site: http://protege.stanford.edu/ 16. Voosen P.: Extending, trimming and fusing WordNet for technical documents. In: Proceedings of the NAACL Workshop on WordNet and Other Lexical Resources, Pittsburgh, 2001. 17. Sanderson M., Croft B.: Deriving concept hierarchies from text. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development and Information Retrieval. 1999, Berkeley, USA. 18. Hwang C.H.: Incompletely and Imprecisely Speaking: Using Dynamic Ontologies for Representing and Retrieving Information. In: Proceedings of the 6th International Workshop on Knowledge Representation meets Databases. 1999, Sweden. 19. Grefenstette G.: SQLET: Short Query Linguistic Expansion Techniques: Palliating OneWord Queries by Providing Intermediate Structure to Text. In: Information Extraction: A Multidiciplinary Approach to an Emerging Information Technology, volume 1299 of LNAI, chapter 6, 97-114. Springer. International Summer School, SCIE-97. 1997, Italy. 20. Fernández-López M., Gómez-Pérez A., Juristo N.: METHONTOLOGY: From Ontological Art Towards Ontological Engineering. Spring Symposium on Ontological Engineering of AAAI. Standford University, 1997, USA. 21. Semantic Web. W3C: http://www.w3.org/2001/sw/. 22. Bhat V., Oates T., Shanbhag V. and Nicholas C.: Finding aliases on the web using latent semantic analysis. Data Knowledge & Engineering 49 (2004) 129-143. 23. OpenNLP, http://opennlp.sourceforge.net/.

Automatic information extraction from the Web - Semantic Scholar

Automatic information extraction from the Web - Semantic Scholar

Suggest Documents

Automatic Web pages author extraction - Semantic Scholar

Automatic Information Extraction for Multiple ... - Semantic Scholar

Automatic Extraction of Synonymy Information - Semantic Scholar

automatic extraction of filled in information from ... - Semantic Scholar

Automatic Ontology Extraction from Unstructured ... - Semantic Scholar

Adapting Web Information Extraction Knowledge ... - Semantic Scholar

Information Extraction from Web Pages Using ... - Semantic Scholar

Automatic information extraction from semi-structured Web pages by ...

Information Extraction from Symbolically ... - Semantic Scholar

ROBUST INFORMATION EXTRACTION FROM ... - Semantic Scholar

TEXTLINE INFORMATION EXTRACTION FROM ... - Semantic Scholar

TEXTLINE INFORMATION EXTRACTION FROM ... - Semantic Scholar

Secure Information Extraction from Clinical ... - Semantic Scholar

Open Information Extraction from Biomedical ... - Semantic Scholar

Automatic information extraction from unstructured ...

Automatic Argumentation Extraction - European Semantic Web ...

Information Extraction on the Semantic Web

Web Services for Information Extraction from the Web

Information Extraction from Web Tables - Ijera.com

Information Extraction from Web Tables - CiteSeerX

Automatic Extraction of Top-k Lists from the Web - Semantic Scholar

Efficient Lyrics Extraction from the Web - Semantic Scholar

Extraction of Procedural Knowledge from the Web - Semantic Scholar

Extraction of Procedural Knowledge from the Web - Semantic Scholar