Automated Coding of Decision Support Variables Massimiliano Albanese, Marat Fayzullin, Jana Shakarian, and V.S. Subrahmanian
1 Introduction With the enormous amount of textual information now available online, there is an increasing demand – especially in the national security community – for tools capable of automatically extracting certain types of information from massive amounts of raw data. In the last several years, ad-hoc Information Extraction (IE) systems have been developed to help address this need [6]. However, there are applications where the types of questions that need to be answered are far more complex than those that traditional IE systems can handle, and require to integrate information from several sources. For instance, political scientists need to monitor political organizations and conflicts, while defense and security analysts need to monitor terrorist groups. Typically, political scientists and analysts define a long list of variables – referred to as “codebook” – that they want to monitor over time for a number of groups. Currently, in most such efforts, the task of finding the right value for each variable – denoted as “coding” – is performed manually by human coders, and is extremely time consuming. Thus, the need for automation is enormous. In the effort presented in this paper, we leverage our previous work in IE [1] and define a framework for coding terror related variables1 automatically and in realtime from massive amounts of data. The major contribution of our work consists
1
Although in this paper we focus on terrorism related variables, the proposed framework is general and can be adapted to many other scenarios.
M. Albanese () George Mason University, Fairfax, VA 22030, USA e-mail:
[email protected] M. Fayzullin • J. Shakarian • V.S. Subrahmanian University of Maryland, College Park, MD 20742, USA e-mail:
[email protected];
[email protected];
[email protected] V.S. Subrahmanian (ed.), Handbook of Computational Approaches to Counterterrorism, DOI 10.1007/978-1-4614-5311-6 4, © Springer ScienceCBusiness Media New York 2013
69
70
M. Albanese et al.
in defining a logic layer to reason about and integrate fine grained information extracted by an IE module. We implemented a prototype of the proposed framework, and preliminary experiments have shown that our approach is promising, both in terms of accuracy and in terms of significantly reducing human intervention in the process. Additionally, we have show that, separating the reasoning component of the process from the low-level processing enables to quickly deploy new variables. In addition to developing the computational framework, leveraging the interdisciplinary nature of our research group and the experience gained through the collaboration with groups of political scientists, we have developed our own codebook for Computational Monitoring of Terror Groups (CMOT). Codebooks designed by political scientists are primarily conceived for human coders and the definition of many variables leaves enough room to subjective interpretation of facts. In order to avoid this issue, we designed our codebook with the idea of making it an objective recording of evidence and facts, prone to be automated within a computational framework. CMOT is an actor-centered approach to collecting basic aspects of asymmetric conflict anywhere in the world. These aspects include organizational structure, activity profile and social profile of non-state armed groups (NSAGs), as well as a wide range of environmental variables regarding the social, economic, and political situation of the host country. CMOT attempts to cover detailed yet universal ground, and provide analysts with a comprehensive picture of the strategic situation a rational actor faces. To accommodate inaccuracies in the open source media which we are working with, the codebook has been structured into different levels of detail – allowing us to collect general to highly specific facts. The remainder of the paper is organized as follows. Section 2 discusses related work. Section 3 introduces the Automatic Coding Engine and provides details of its components, whereas Sect. 4 reports preliminary experimental results. Finally, concluding remarks are given in Sect. 5.
2 Related Work The aim of Information Extraction (IE) is the extraction and structuring of data from unstructured and semi-structured electronic documents, such as news articles from online newspapers [4]. Information Extraction involves a variety of issues and tasks ranging from text segmentation to named entity recognition and anaphora resolution, from ontology-based representations to data integration. There is a large body of work in the IE community addressing single issues among those mentioned, and a variety of approaches and techniques have been proposed. With respect to the issue of named entity recognition, some authors propose knowledge-based approaches [3] while others favor the use of statistical models such as Hidden Markov Models [8]. Amitay et al. [2] introduces Web-a-Where, a system for locating mentions of places and determining the place each name refers to. Several attempts have been made to build a comprehensive IE framework. Unfortunately, most IE tools are domain dependent, as they rely on domain specific knowledge or
Automated Coding of Decision Support Variables
71
features, such as page layouts, in order to extract information from text, or fail to answer the type of questions that are typical of the scenarios addressed in this paper. Soderland [12] presents an approach for extracting information from web pages based on a pre-processing stage that takes into account a set of domain-dependent and domain-independent layout features. Similarly [7] focuses on the extraction from web tables, but attempts to be domain independent by taking into account the two-dimensional visual model used by web browsers to display the information on the screen. Other efforts are aimed at developing IE capabilities for specific knowledge domains, such as molecular biology [9], and thus use domain specific knowledge to achieve their goals. More general approaches rely on automatic ontology-based annotation [6]. In conclusion, our approach differs significantly from previous approaches because it relies on domain independent information extraction to perform the extraction task, and adds a logic layer to reason about extracted facts and formulate answers to complex questions. This novel combination of information extraction and logic reasoning has proved to be a key factor in scaling up our approach to coding and enabling quick deployment of new variables.
3 Automatic Coding Engine In this section, we present the Automatic Coding Engine (ACE), our framework for automated coding of decision support variables. Figure 1 shows the overall architecture of the system, while details of the three major components are provided in the following sections. Documents are either fed into the system by a crawler – which continuously scans online news sources – or imported from any available corpus.2 The Document Preprocessor, using the library of Linguistic Resources – which includes dictionaries defined by domain experts and a classification of English verbs [10] – identifies documents, paragraphs, or individual sentences that are good candidates for extraction of relevant information. Candidate text fragments are parsed using the Link Grammar Parser, and Low-level Linguistic Sensors extract atomic facts, such as location of events, perpetrators of violent events, types of weapons used. Finally, the Rule Engine, using a library of logic rules, combines such atomic facts with a-priori knowledge encoded in the Knowledge Base, and derives groupspecific values – also referred to as “codes” – for each variable of interest, at the desired level of temporal granularity. Automatic codes generated by the system are stored in a database and made available to human reviewers through an interface that serves the double purpose of providing a tool to review automatic codes and a mechanism to manually code variables that have not been automated yet. In fact, the user interface was designed to ensure a gradual and smooth transition
2
A corpus of documents from LexisNexis was used in our experiments.
72
M. Albanese et al.
Applications Rule Library Rule Engine
Knowledge Base
APIs
Low-level Linguistic Sensors Database
Parser
Linguistic Resources (Dictionaries, Verb Classes, etc.)
Document Preprocessor
Training corpus
Crawler
Manual Coding Interface
User Interface
Existing Document Corpora
World Wide Web (News articles, etc.)
Fig. 1 System architecture
from a fully manual process to a semi-automatic one, where human intervention is limited to validation of automatically generated codes. In Sect. 4.2, we will show that our framework reduces the demand of coder’s time by one order of magnitude. Additionally, our framework offers APIs to make data available to other applications. In the following, we first formalize the coding problem and then describe each major component of the system. Let V denote the set of variables to be automated, which we will refer to as the codebook. Each variable V 2 V has an associated domain dom.V / of possible values that can be assigned to it. Variables can be boolean (i.e., dom.V / D f0; 1g/, enumerative, or numeric. We now give the fundamental definition of coding function. Definition 1. Given a variable V 2 V , a set of organizations O, and a set of time intervals I , a coding function for V is a function fV which associates each pair .I; O/ 2 I O with a value v 2 dom.V /. fV W I O ! dom.V /
(1)
In other words, fV assigns a value to V for each organization and time interval. In the following, we will denote the function corresponding to the manual coding process as fVh , and the one corresponding to the proposed automated process as fVa . Our objective is to develop a framework that allows to design coding functions fVa for a possibly large number of variables, with average precision and recall figures – as defined in Sect. 4.1 – around or above 70 %. Although we will often refer to
Automated Coding of Decision Support Variables
73
individual coding functions, our system is designed to allow concurrent automation and monitoring of multiple variables, a key aspect for achieving scalability of our approach to automation.
3.1 Preprocessing The first step of the processing pipeline involves selection of text fragments that are good candidates for extraction of atomic facts. The rationale for this stage is that the following steps, namely parsing and information extraction, are computationally intensive, while only a small fraction of the available textual data will actually contain evidence useful for coding. Given a variable V , a time interval I and an organization O, a text fragment T (sentence, paragraph, or entire document) might provide evidence for coding fVa .I; O/ only if the following necessary conditions are satisfied: (i) T mentions O or any person known to be affiliated with O 3 ; (ii) T reports events that occurred during I or facts valid in I ; (iii) textual clues for V are found in T . A textual clue for V is any expression (noun phrase or verb) which indicates that the text may possibly contain information useful for coding V . The three conditions above are necessary but not sufficient conditions to code fVa .I; O/. The information extraction task swill determine whether T enables coding of fVa .I; O/. In the example of Table 2, the phrase “AK-47 guns” in the text fragment corresponding to 2001 is a strong indicator that the text snippet may support coding of Equ F G Assault Rifle, a binary variable that codes whether a group reportedly utilizes assault rifles or its military arsenal allegedly includes assault rifles of some model and caliber in the period being coded. However, at this stage we a 0 cannot yet conclude that fEqu F G Assault Rifle .2001; Lord s Resistance Army/ D 1, as the semantic role of relevant entities has not been analyzed. Additionally, intuition also tells us that the closer O and a textual clue for V occur within the text, the higher the likelihood that Linguistic Sensors will extract from T atomic facts useful to code V for organization O. Thus, we pair each occurrence cV of a textual clue for V , with the closest occurrence o of O and compute the score sc.o; cV / D e ˛dp .o;cV / e ˇds .o;cV / , where dp .o; cV / and ds .o; cV / are the distances in terms of number of paragraphs and number of sentences respectively between o and cV . We then consider the top-k co-occurrences, find the text fragments that minimally contain them, and eventually expand the text fragments to include contiguous sentences in order to provide more context for extraction. We remind the reader that this stage does not perform any actual extraction, but rather selects candidate text fragments, thus enabling to scale automatic coding w.r.t. massive amount or raw text documents. An extremely sensitive aspect of automatic coding is identifying the correct time interval in which events occurred or facts are valid. To address this issue, we have developed an algorithm to analyze indirect temporal references like “last year” or “three weeks ago”, and infer actual dates. 3
Affiliations can be automatically extracted by Linguistic Sensors.
74
M. Albanese et al.
3.2 Linguistic Sensors In this stage, text fragments selected by the Document Preprocessor, are first parsed using the Link Grammar Parser [11]. Then Named Entity Recognition and Pronoun Resolution are performed using GATE/ANNIE [5]. The Link Grammar Parser is a syntactic parser of English, based on link grammar, an original theory of English syntax. Given a sentence, the system assigns to it a syntactic structure, which consists of a set of labeled links connecting pairs of words. The parser also produces a constituent representation of a sentence. As in T-REX [1], we then use a library of extraction rules to match constituent trees of previously unseen sentences against a library of templates, and extract pieces of information, in the form of RDF4 triples, from matching sentences. However, in our modular architecture, the IE component could be replaced by any suitable general purpose IE system. The novelty of our approach consists in combining information extraction with a logic layer to reason about extracted data. An extraction rule – referred to as Linguistic Sensor in this paper – is of type Head Body, where the body represents a set of conditions – constraints on subtrees of the constituent tree and on their relative position – and the head represents the set of RDF statements that can be inferred from the sentence if the conditions in the body of the rule are satisfied. A tree matching algorithm is used to match rules against previously unseen sentences, and RDF statements are extracted from matching sentences. Example 1. Figure 2 shows an example of a Linguistic Sensor aimed at identifying possession of weapons in scenarios where the equipment is surrendered to somebody. Solid edges indicate that there must be a direct link between two nodes, while dotted edges indicate that there must be a path (of any length) between two nodes. The rectangular boxes represent end nodes, i.e., the actual text that will be extracted in case of a match. The sensor in the figure requires that the subject of the sentence, which can be an arbitrarily nested noun phrase, contains a noun phrase identifiable as a named entity (either a person or an organization). It also requires that the main verb belongs to the class CHANGE POS VERBS of verbs denoting change of possession [10]. Finally, it requires that the verb is followed by a noun phrase – arbitrarily nested within another noun or prepositional phrase – denoting a weapon. If a sentence satisfies such constraints, then two RDF triples can be extracted: .tdata W EquipPossX; trexe W owner; Var1/ and .tdata W EquipPossX; trexe W equipment; Var3/, where X is a unique identifier assigned to each instance of the trexe W EquipmentPossession event. Any other information in the sentence is not relevant to this specific sensor and is ignored.
4
RDF (Resource Description Framework) is a web standard defined by the World Wide Web Consortium [13], originally created for encoding metadata, but now used for encoding information about and relationships between entities in the real world.
Automated Coding of Decision Support Variables
75
S
NP
VP
Var2
Var1
NP
Var3
VAR | IS_ENTITY VAR | CHANGE_POS_VERBS
VAR | IS_WEAPON
Fig. 2 An example of Linguistic Sensor
3.3 Logic Layer The last stage in the processing pipeline is the Rule Engine, which takes as input the output of the Linguistic Sensors and any a-priori knowledge encoded in the Knowledge Base and generates group-specific values for each variable of interest, at the desired level of temporal granularity. The Knowledge Base includes any information that may be available ahead of coding, such as countries in which organizations operate, or names of major known leaders. This information may be provided by the same domain experts who will use the system or automatically extracted from text documents. Each RDF statement extracted by Linguistic Sensors or encoded in the Knowledge Based is converted to an equivalent ground atom and fed into the Rule Engine, a logic engine based on Prolog. The Rule Engine uses rules from the Rule Library, a small fragment of which is shown in Fig. 3. The library of rules includes (i) rules to code individual variables; (ii) rules to code entire classes of variables (e.g. last two rules in Fig. 3); (iii) auxiliary rules to derive intermediate relations (e.g. fist two rules in Fig. 3). Once a library L of re-usable linguistic sensors has been developed for extracting a wide range of atomic facts, automation of additional variables only implies writing new logic rules. Thus, the marginal cost of automating additional variables in our framework is negligible w.r.t. the cost of writing ad-hoc algorithms
76
M. Albanese et al. org perpetrator(Event, Organization):trexe perpetrator(Event, Organization), rdf type(Organization, ‘trexb:Organization’). org perpetrator(Event, Organization):trexe perpetrator(Event, Person), rdf type(Person, ‘trexb:Person’), trexb affiliation(Person, Organization), rdf type(Organization, ‘trexb:Organization’). group equipment 1(Organization, Date, Event, VarName):is type(Event, ‘trexe:ViolentEvent’), trexe date(Event, Date), org perpetrator(Event, Organization), trexe weapon(Event, Weapon), triggers(Weapon, VarName). group equipment 1(Organization, Date, Event, VarName):is type(Event, ‘trexe:EquipmentPossession’), trexe date(Event, Date), trexe owner(Event, Organization), rdf type(Organization, ‘trexb:Organization’), trexe equipment(Event, Equipment), triggers(Equipment, VarName).
Fig. 3 A small excerpt of the Rule Library
to automate the same variables. Formally, the average cost to automate a variable in ACE is PjL j PjV j i D1 Cl .Li / C i D1 Cr .Vi / C D (2) jV j where Cl .Li / is the cost of developing sensor Li , and Cr .Vi / is the cost of writing logic rules for variable Vi . Instead, the average cost of developing ad-hoc algorithms is PjV j Ca .Vi / 0 C D i D1 (3) jV j where Ca .Vi / is the cost of developing an ad-hoc algorithm for variable Vi . In ACE, atomic facts extracted by linguistic sensors can be re-used in multiple logic rules (i.e., jL j jV j), and the cost of writing logic rules for variable Vi is clearly much lower than the cost of developing ad-hoc algorithms for Vi (i.e., Cr .Vi / Ca .Vi /). 0 Therefore, if jV j is very large, C C . In other words, our approach scales very well for very large sets of variables. Table 1 reports the number of variables coded each month, and shows how our effort scaled after an initial setup time. Additionally, dependencies exist between some of the variables in the CMOT codebook. We identified such dependencies and organized subsets of the codebook in a hierarchical fashion. Similarly, we organized textual clues for such variables
Automated Coding of Decision Support Variables
77
Table 1 Progression of automation effort #. of vars coded
Nov. ’09 10
Dec. ’09 30
Jan. ’10 55
Feb. ’10 60
Mar. ’10 55
into dictionaries and analogous hierarchies of concepts. We then mapped such hierarchies of concepts to the hierarchies of variables and encoded all these data structures in the Knowledge Base. Finally, we designed generalized logic rules corresponding to the root node of each hierarchy of variables: such rules (see last two rules in Fig. 3) have the name of the variable to be coded as one of the arguments, thus allowing the logic engine to reason about a whole family of variables. This powerful mechanism allows us to scale up our approach to automated coding even further, as we do not necessarily need to design ad-hoc rules for each individual variable, but rather for a family of variables. Equation 2 then becomes PjL j C D
i D1
PjPj Cl .Li / C i D1 Cr .Pi / jV j
(4)
where P is a partition of V and Cr .Pi / is the cost of writing logic rules to automate variables in Pi V .
4 Implementation and Experiments We implemented a prototype of ACE on top of T-REX [1]. Linguistic resources, logic rules, and additional resources have been developed to enable automation of about 170 variables. We then conducted a number of preliminary experiments to validate our approach. We measured recall and precision with respect to the ground truth provided by human coders (Sect. 4.1). We also tracked the amount of time required to human coders to review and validate the codes generated by the system, compared to the time required to manually code the same variables starting from raw documents (Sect. 4.2). Table 2 shows a sample of ACE output for variable Equ F G Assault Rifle, coded for the Lord’s Resistance Army in Uganda (only cases where the variable was coded as 1 are shown). The first column in the table indicates the time frame to which each code applies (yearly granularity was used in this case). The second column shows fragments of text (possibly from multiple documents) that provided evidence for coding. Finally, the third column is the ground truth generated by human coders.
78
M. Albanese et al.
Table 2 Output for Equ F G Assault Rifle, coded for Lord’s Resistance Army Year 1999
2000
2001
2003
Supporting evidence Text of report by Ugandan radio on 23rd April; all place names in northern Uganda Over 1,000 AK-47 rifles, nine anti-tank guns with 300 bombs and about 250 anti-tank and anti-personnel mines have been recovered from the Lord’s Resistance Army LRA by UPDF Uganda People’s Defence Forces in the border areas of northern Uganda over the last 1 year He said that the UPDF recently recovered an assortment of weapons and ammunition from the Lord’s Resistance Army LRA rebels, which included 31 AK-47 rifles and 18 guns Four rebels of the Lord’s Resistance Army (LRA) under commander Kwo-yelo surrendered to the UPDF Uganda People’s Defence Force in Gulu northern Uganda on Sunday evening 18 November with two AK-47 guns with 60 bullets Robert was 16 when guerrillas of the Lord’s Resistance Army [LRA] carrying axes, machetes and assault rifles slipped into his village and took him away during a chaotic night attack
GT 1
1
1
0
4.1 Precision and Recall We conducted experiments on 30 CMOT variables and measured precision and recall of our system. For each variable V 2 V , precision and recall can be defined as PV;v D
jRSV;v \ GTV;v j ; 8v 2 dom.V / RSV;v
(5)
RV;v D
jRSV;v \ GTV;v j ; 8v 2 dom.V / GTV;v
(6)
where RSV;v D f.I; O/ 2 I OjfVa .I; O/ D vg; 8v 2 dom.V /, and GTV;v D f.I; O/ 2 I OjfVh .I; O/ D vg; 8v 2 dom.V /, are the result set and the ground truth for V D v, i.e., the sets of .I; O/ pairs that fVa and fVh respectively associates with value v. The overall accuracy of fVa , i.e., the fraction of .I; O/ pairs that human coders and our system associate with the same value, can be computed as AD
jf.I; O/ 2 I OjfVa .I; O/ D fVh .I; O/gj jI Oj
(7)
We run our experiments on a corpus of 80,000 LexisNexis documents. We first measured the overall accuracy of our system, obtaining an average accuracy of 90 %. We then observed that, in the ground truth of most binary variables, the cases in which such variables are coded as 1 are quite sparse, therefore one could obtain artificially high accuracy by simply setting every variable to 0. We then focused on the most critical measures PV;1 and RV;1 . On the same set of documents, the system returned 800 .I; O/ pairs such that fVa .I; O/ D 1, pertaining to 12 distinct
Automated Coding of Decision Support Variables
79
terrorist organizations, between 1990 and 2009. Average precision and recall were found to be 82 and 77 % respectively. It is worth noting that human coders tend to use any previous knowledge they might have about the subject. This means that the actual recall could have been even higher than what we measured, if coders were to base their decisions solely on the information contained in the document corpus. Additionally, we observed that in 35 % of the missed detections, the system was able to correctly code the variable that is the immediate generalization of the one missed. As an example, in several cases the system failed to infer that a group used or possessed AK assault rifles in a certain time frame, but correctly inferred that it was using some sort of assault rifle. In conclusion, our system shows consistently high precision, and recall.
4.2 Time In our experiments, we also evaluated the impact of our framework on the time required to coders for completing the coding task. Since the CMOT coding effort was started, we observed that human coders take between 5 and 8 h to manually code 100 data points, at an average rate of 12–20 codes per hour, and the most time-consuming task is reading or skimming through hundreds of articles in search of relevant information. The large variability in coder’s time is due to a number of factors, including the level of expertise of each coder and the ease of finding reliable information for the assigned coding tasks. Instead, a human coder can review automatically generated codes at an average rate of 140 codes per hour, one order of magnitude faster than the fully manual process. As we mentioned earlier, for each code the user is provided with fragments of text that the algorithm used as evidence for coding, along with the list of documents those fragments belong to. If such text fragments do not provide enough context to judge the correctness of a code, reviewers can examine the original documents. We experimentally observed that the text provided as part of the output was sufficient to validate an automatic code in 89 % of the cases, thus reducing the amount of documents to be examined by human coders by a factor of 10. This latest experimental observation confirms that the most expensive task in the manual coding process is reading articles. Reducing the amount of text to be examined by a certain factor, reduces the total time to perform the process by roughly the same factor.
5 Conclusions and Future Work In this paper, we presented a framework for automatic monitoring of decision support variables and focused on the case of terror related variables. The proposed approach leverages our previous work in Information Extraction and complements it with a number of new features, including a logic layer, in order to address the
80
M. Albanese et al.
specific challenges posed by the coding task. Although this is still work in progress, we have shown preliminary results, which prove that our approach is promising, and can save analysts an incredible amount of time. Additionally, we showed that our approach to automation scales very well for very large sets of variables. For the future, we plan to complete automation of the CMOT codebook by December 2010, conduct massive experiments and tune the system, with the objective of obtaining recall precision figures around or above 70% for every variable. Future plans also include integrating the system with other systems in the lab to analyze trends and make predictions of future behavior of monitored groups.
References 1. Albanese M, Subrahmanian VS (2007) T-REX: a system for automated cultural information extraction. In: Proceedings of the first international conference on computational cultural dynamics (ICCCD ’07). AAAI, Menlo Park, pp 2–8 2. Amitay E, Har’El N, Sivan R, Soffer A (2004) Web-a-where: geotagging web content. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 273–280 3. Callan J, Mitamura T (2002) Knowledge-based extraction of named entities. In: Proceedings of the 4th international conference on information and knowledge management. ACM, New York 4. Cowie J, Lehnert W (1996) Information extraction. Commun ACM 39(1):80–91 5. Cunningham H, Maynard D, Bontcheva K, Tablan V GATE: a framework and graphical development environment for robust nlp tools and applications. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics (2002) 6. Ding Y, Embley DW (2006) Using data-extraction ontologies to foster automating semantic annotation. In: Proceedings of the 22nd international conference on data engineering workshops (ICDEW’06). IEEE Computer Society, Washington, DC, p 138 7. Gatterbauer W, Bohunsky P, Herzog M, Kroepl B, Pollak B (2007) Towards domainindependent information extraction from web tables. In: Proceedings of the 16th international world wide web conference. ACM, New York, pp 71–80 8. GuoDong Z, Jian S (2003) Integrating various features in hidden markov model using constraint relaxation algorithm for recognition of named entities without gazetteers. In: Proceedings of the international conference on natural language processing and knowledge engineering. IEEE Press, pp 465–470. 9. Jensen LJ, Saric J, Bork P (2006) Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Genet 7(2):119–129 10. Levin B (1993) English verb classes and alternations: a preliminary investigation. University of Chicago Press, Chicago 11. Sleator DD, Temperley D (1993) Parsing english with a link grammar. In: Proceedings of the third international workshop on parsing technologies (IWPT ’93). University of Tilburg, The Netherlands 12. Soderland S (1997) Learning to extract text-based information from the world wide web. In: Proceedings of the 3rd international conference on knowledge discovery and data mining. AAAI Press, pp 251–254 13. World Wide Web Consortium (W3C) (2004) Resource description framework (RDF). http:// www.w3.org/RDF/