Browsing Semi-structured Texts on the Web using Formal Concept Analysis Richard Cole and Peter Eklund and Florence Amardeilh School of Information Technology and Electrical Engineering The University of Queensland St. Lucia, Queensland 4072, Australia
[email protected],
[email protected],
[email protected]
Abstract. Browsing unstructured Web-texts using Formal Concept Analysis (FCA) confronts two problems. Firstly, on-line Web-data is sometimes unstructured and any FCAsystem must include additional mechanisms to discover the structure of input sources. Secondly, many on-line collections are large and dynamic so a Web-robot must be used to automatically extract data when it is required. These issues are addressed in this paper which reports a case-study involving the construction of a Web-based FCA system used for browsing classified advertisements for real-estate properties1 . Real-estate advertisements were chosen because they represent a typical semi-structured information source accessible on the Web. Further data is only relevant for a short period of time. Moreover, the analysis of real-estate data is a classic example used in introductory courses on FCA. However, unlike the classic FCA real-estate example, whose input is a structured relational database, we mine Web-based texts for their implicit structure. The issues mining these texts and their subsequent presentation to the FCA-system are examined in this paper. Our method uses a hand crafted parser for extracting structured information from real-estate advertisements which are then browsed via a Web-based front-end employing rudimentary FCA-system features. The user is able to quickly determine the trade-offs between different attributes of real-estate properties and alter the constraints of the search to locate good candidate properties. Interaction with the system is characterized as a mixed initiative process in which the user guides the computer in the satisfaction of constraints. These constraints are not specified a-priori, but rather drawn from the data exploration process. Further, the paper shows how the Conceptual Email Manager, a prototype FCA text information retrieval tool, can be adapted to the problem.
1
Information Extraction and the Web — Overview
Since the creation of the DARPA Message Understanding Conferences (MUC) in 1987, Information Extraction (IE) has become an independent new field of research at the crossroad of Natural Language Processing (NLP), Text Mining and Knowledge and Data Discovery (KDD). For this reason the methods and techniques of IE are strongly influenced by developments in these related research topics. Moreover, IE can be useful for any collection of documents from which one would want to extract facts, and the World Wide Web is such a collection. 1 In Formal Concept Analysis the term property has a special meaning similar to attribute. In this paper property is only be used with the meaning of real-estate property, e.g. a house or apartment.
1.1
Definitions - Information Extraction
The objective of IE [13] is to locate and identify specific information from a natural language document. The key element of IE systems is the set of extraction rules, or extraction patterns, that identify the target information according to a scenario. Once an extraction pattern is identified, the IE system reduces extracted information to a more structured form such as a database table. Each record in the table must also have a link back to the original document [26]. As a result, tools for visual representation, fact comparison and automatic pattern analysis, play an important role in the resulting presentation of data derived from IE systems. Through the case study example presented in this paper, “rental accommodation” classifieds, we define the various terms used in the IE field. First, the “rental classified” scenario represents a way to format the target information, e.g. “the location, the renting price, the bedrooms number and the phone number”. Second, each scenario is defined by a list of patterns that describes possible ways to talk about one of its facets, such as the pattern “for rent”. Third, each pattern includes a set of extraction rules defining how to retrieve this pattern in the text. Fourth, each rule is composed of several fields, either constants or variables, representing a particular element of the information to extract. For example, concerning the pattern “for rent”, we might employ (or learn) a pattern such as the following: for rent - phone Bedrm.
1.2
Web documents and Text diversity
Some approaches to information extraction on the Web assume that all Web pages as semistructured, since they contain HTML-tags, however Hsu [15] provides a finer-grained categorization for Web documents as follows: structured Web pages provide itemized information and each element can be correctly extracted based on some uniform syntactic clues, such as delimiters or the orders of elements. Semi-structured Web pages may contain missing elements, multiple value elements, permutations and exceptions. Finally, unstructured Web pages require linguistic knowledge to correctly extract elements. It seems therefore that when it comes to extracting information from Web pages, the same sorts of problems and features facing information extraction on natural language documents also apply to the Web domain, namely that IE systems for structured text perform well, the information can be easily extracted using format descriptions. However, IE systems for unstructured text need several additional processing steps in conjunction with constructing extraction rules. These are typically based upon patterns that involve syntactic relations between words or classes of semantic words. They generally use NLP techniques and cannot be compared to the work of a human expert (although they also provide useful results). Likewise, IE systems for semi-structured text cannot limit themselves to rigid extraction rules, more suited to structured text, but must be able to switch context to apply NLP techniques for free text. Nevertheless, systems for semi-structured texts do use delimiters, such as HTML-tags, in order to construct extraction rules and patterns. Thus, a profitable approach to IE on semi-structured texts is a hybrid of the two. Moreover, on the Web, information is also highly dynamic. Web-site structure and the presence of hyperlinks are also important facets not present in traditional natural language
documents. It may, for instance, be necessary to follow hyperlinks to obtain all the pertinent information from online databases. Web documents are both stylistically different from natural language texts and may be globally distributed over multiple sites and platforms. Hence, the Web IE problem represents a special challenge for the field because of the nature of medium. 1.3
Architecture and components
The first step of the basic IE process is to extract each relevant element of the text through a local analysis, i.e. the system examines each field to decide if it is a new element to add to the pattern or if it relates to an existing element. Secondly, the system interlinks those elements and produces larger and/or new elements. Finally, only the pertinent elements regarding the patterns are translated into the output format, e.g. the scenario. Moreover, the information to extract can be in any part of the document, and this is the situation with many unstructured texts. In these cases, the elements will be extracted as above and a second process will then be necessary to link all the elements dealing with the same scenario. This IE process is slightly differently implemented if the system is based either on a knowledge engineering approach combined with natural processing language methods; or on a statistic and automatic training approach. In the first, experts examine sample texts and manually construct the extraction rules with an appropriate level of generality to produce the best performance. This “training” is effective but time consuming. In the second approach, once a training corpus has been annotated, a classifier is run so that the system learns how to analyze new texts. This is faster but requires sufficient volume of training data to achieve reasonable outcomes [2]. Most IE systems compromise by using rules that are manually created and classifier components that are automatically generated. To elaborate, IE systems use part or all of the following components: Firstly, Segmentation divides the document into segments, e.g. sentences, and the other components, such as images and tables for instance. Secondly, Lexical Analysis tags parts of speech, disambiguates words and identifies regular expressions such as names, numbers and dates. This gives some information about the words, their position in the text and/or sentence, their type and sometimes their meaning. Lexical analysis generally uses dictionaries (and/or ontologies). Thirdly, Syntactic Analysis identifies and tags nominal phrases, verbal phrases and other relevant structures as a partial analysis; or alternatively each of the individual elements, i.e. nouns, verbs, determinants, prepositions, conjunctions, etc., as a complete analysis. Fourthly, Information Extraction creates rules to identify pertinent elements, to retrieve suitable patterns, and store them according to a predefined format corresponding to the information extraction scenario. This last phase also examines co-reference relations, often inexplicit, such as the use of pronouns to qualify a person, a company or even an event. It is the only component specific to the domain [1]. Finally, Lawrence and Giles [20] claim that 80% of the Web is stored in the hidden Web, e.g. pages generated on the fly from some database, using XML/XSL to generate pages based on specific user requests to a database. This implies a special need for tools that can extract information from such pages. Thus, Information Extraction from Web sites is often performed using wrappers. Wrapper generation has evolved independent of the traditional IE field, deploying techniques that are less dependent on grammatical sentences than NLPbased techniques. A wrapper, in the Web environment, converts information implicitly stored
as an HTML document into information explicitly stored as a data-structure for further processing. Wrappers can be constructed manually by writing the code to extract information, or automatically by specifying the Web page structure through a grammar and translating it into code. In either case, wrapper creation is somewhat tedious and as Web pages change or new pages appear, new wrappers must be created. Consequently, Web Information Extraction often involves the study of semi-automatic and automatic techniques for wrapper creation. Wrapper induction [19] is a method for automatic wrapper generation using inductive machine-learning techniques. In wrapper induction, the task is to compute from a set of examples, a generalization that explains the observations as an inductive learning problem. 1.4
Related work
The IE field has developed over the last decade due to two factors: firstly, the exponential growth digital document collections and secondly, through the organization of the Message Understanding Conferences (MUC), held from 1987 to 1998 and sponsored by DARPA. The MUC conferences coordinated multiple research groups in order to stimulate research by evaluating various IE systems. Each participating site had six months to build an IE system for a pre-determined domain. The IE systems were evaluated on the same domain and corpus, allowing direct comparison. The results were scored by an official scoring program using the standard information retrieval measures. The MUC conferences demonstrated that fully automatic IE systems can be built with the state-of-the-art technology, and that, for some selected tasks, their performance is as good as the performance of human experts [27]. Despite these outcomes, building IE systems still requires a substantial investment in time and expertise and remains somewhat of a craft. Some of the systems developed during the MUC period were applied, or can be applied, to the Web Information Extraction problem. On the one hand, both FASTUS [3, 14] and H AS TEN [17] based their approaches on NLP techniques and developed the entire architectures mentioned above. They are operational systems but are still time and resource consuming in their scenario set-up. On the other hand, automatic training systems are based on either unsupervised algorithms, combined with a bottom-up approach when extracting the rules, such as C RYSTAL [21]; or a supervised algorithm along with a top-bottom approach, such as W HISK [22] and S RV [11]. Interestingly, with respect to this paper, W HISK used real estate classified ads as its document collection. Finally, another system named P ROTEUS [13] used dictionaries along with a set of regular expressions to mine documents in a top-bottom approach. Simultaneous with these developments, the wrapper generation communities also developed some IE systems using machine-learning algorithms to generate extraction patterns for online information sources. S HOP B OT W IEN, S OFT M EALY and S TALKIER belong to a group of systems that generate wrappers for fairly structured Web pages using delimiter-based extraction patterns [9, 19]. To conclude, at the time of writing search engines are not powerful enough for all the tasks associated with IE systems. They return a collection of documents, but they cannot extract relevant information from these documents. Thus, the Web Information Extraction field will continue to be an active area of research. As information systems will need to automate the process as far as possible to cope with the large amount of dynamic data found on the Web, IE systems will keep using machine-learning techniques rendering them beyond
Figure 1: The Homes On-line home-page. The site acts as the source of unstructured texts for our experiment.
the scope of generalist search indexes. Nevertheless, a combination of different approaches, achieving hybrid and domain specific search indexes, is believed to be a promising direction for IE [22, 18]. 1.5
The Interaction Paradigm and Learning Context
Mixed initiative [16] is a process from human-computer interaction involving humans and machines sharing tasks best suited to their individual abilities. The computer performs computationally intensive tasks and prompts human-clients to intervene when either the machine is unable to make a decision or resource limitations demand intervention. Mixed initiative requires that the client determine trade-offs between different attributes and alter search constraints to locate objects that satisfy an information requirement. This process is well suited to data analysis using an unsupervised symbolic machine learning technique called Formal Concept Analysis (FCA), an approach demonstrated in our previous work [5, 6, 7, 10] and inspired by the work of Carpineto and Romano [4]. This paper reinforces these ideas by re-using the real-estate browsing domain, a tutorial exercise in the introductory FCA literature. The browsing program for real-estate advertisements (R FCA) is more primitive than the Conceptual Email Manager C EM [5, 6, 7, 10], which uses concept lattices to browse Email and other text documents. Unlike C EM, R FCA is a Web-based program, creating a different set of engineering and technical issues in its implementation. However, R FCA is limited and when the analysis of the rental advertising requires nested-line diagrams (and other more sophisticated FCA-system features) we re-use C EM to show how that program can be re-used to produce nested-line diagrams for the real-estate data imported from the Web. Other related work demonstrates mixed initiative extensions by using concept lattice animation, notably the algorithms used in C ERNATO and joint work in the open-source G O DA collaboration2 . This article is structured as follows. Section 2 describes practical FCA systems and their 2
see http://toscanaj.sf.net and the framework project http://tockit.sf.net
coupling to relational database management systems (RDBMS). This highlights the necessity of structured input when using FCA and therefore the nature of the structure discovery problem. Section 3 describes the Web-robot used to mine structure from real-estate advertisements. This section details the methods required to extract structured data from unstructured Web-collections and measures their success in terms of precision and recall. Section 4 shows the Web-based interface for browsing structured real-estate advertisements. Section 5 demonstrates how real-estate data can be exported and the C EM program re-used to deploy nested line diagrams and zooming [23]. 2
Formal Concept Analysis and RDBMSs
FCA [12] has a long history as a technique for data analysis. Two software tools, T OSCANA [24] and A NACONDA embody a standard methodology for data-analysis based on FCA. Following this methodology, data is organized as a table in a RDBMS (see Figure 2) and is modeled mathematically as a many-valued context, (G, M, W, I) where G is a set of objects, M is a set of attributes, W is a set of attribute values and I is a relation between G, M and W such that if (g, m, w1 ) and (g, m, w2 ) then w1 = w2 . We define the set of values taken by an attribute, m ∈ M as Wm = {w ∈ W | ∃g ∈ G : (g, m, w) ∈ I}. An interpretation of this definition is that in the RDBMS table there is one row for each object, one column for each attribute, and each cell contains at most one attribute value. Organization over the data is achieved via conceptual scales that map attribute values to new attributes and are represented by a mathematical entity called a formal context. A formal context is a triple (G, M, I) where G is a set of objects, M is a set of attributes and I is a relation between objects and attributes. A conceptual scale is defined for a particular attribute of the many-valued context: if Sm = (Gm , Mm , Im ) is a conceptual scale of m ∈ M then we require Wm ⊆ Gm . The conceptual scale can be used to produce a summary of data in the many-valued context as a derived context. The context derived by Sm = (Gm , Mm , Im ) w.r.t. to plain scaling from data stored in the many-valued context (G, M, W, I) is the context (G, Mm , Jm ) where for g ∈ G and n ∈ Mm (g, n) ∈ J ⇔: ∃w ∈ W : (g, m, w) ∈ I
and (w, n) ∈ Im
Scales for two or more attributes can be combined together in a derived context. Consider a set of scales, Sm , where each m ∈ M gives rise to a different scale. The new attributes supplied by each scale can be combined together using a special type of union: [ N := {m} × Mm m∈M
Then the formal context derived from combining all these scales together is (G, N, J) with (g, (m, n)) ∈ J ⇔: ∃w ∈ W : (g, m, w) ∈ I
and (w, n) ∈ Im
A concept of a formal context (G, M, I) is a pair (A, B) where A ⊆ G, B ⊆ M , A = {g ∈ G | ∀m ∈ B : (g, m) ∈ I} and B = {m ∈ M | ∀g ∈ A : (g, m) ∈ I}. For a concept (A, B), A is called the extent and is the set of all objects that have all of the attributes in B. Similarly, B is called the intent and is the set of all attributes possessed in common by all the
Figure 2: Example showing the process of generating a derived concept lattice from a many-context and a conceptual scale for the attribute Views.
objects in A. As the number of attributes in B increases, the concept becomes more specific, i.e. a specialization ordering is defined over the concepts of a formal context by: (A1 , B1 ) ≤ (A2 , B2 ) :⇔ B2 ⊆ B1 More specific concepts have larger intents and are considered “less than” (