Ant Colony Optimization (ACO) algorithm that is used to create rules by reading the ..... [6] Protégé: http://protege.stanford.edu/overview/protege-owl.html. [7] M. Dorigo, M. Birattari, Th. Stutzle, âAnt Colony Optimization: Artificial Ants as a ... http://iridia.ulb.ac.be/IridiaTrSeries/IridiaTr2006-023r001.pdf, 2006. [8] D. Maynard, Y.
Ant Colony Optimisation for Automatically Populating Ontologies with Individuals Mihnea Donciu1, Mădălina Ioniţă1, Mihai Dascălu1, Ştefan Trăuşan-Matu1, 2 1 Department of Computer Science, “Politehnica” University of Bucharest Bucharest, Romania 2 Research Institute for Artificial Intelligence of the Romanian Academy Bucharest, Romania {mihnea.donciu, madalina.ionita}@cti.pub.ro, {mihai.dascalu, stefan.trausan}@cs.pub.ro Abstract— With the rapid spread of the social web and of information retrieval systems, the need of structuring information and of making it more accessible for automatic evaluation has also increased, therefore justifying the use of semantic repositories as knowledge bases and enabling the transition to a semantic web. Besides defining an ontology in terms of concepts and relations, the actual process of populating an ontology with individuals has become a more and more time-consuming task due to the multitude of information sources. Therefore, the extraction of proper individuals may be very difficult when one is dealing with such amount of information freely available on the web and with so many classes within the hierarchy. On the other side, the classical approach of populating ontologies consists of parsing the input and matching it against certain regular expressions. Due to its intrinsic limitations, we propose a novel approach based on the Ant Colony Optimization (ACO) algorithm that is used to create rules by reading the predefined ontologies (mostly generated via expert knowledge) and later to apply them on the given text. The ants build individuals iteratively by searching keywords in the text that better fit as values for the attributes defined within each ontology class. The validation results prove that our method provides more specific individuals than classic pattern matching techniques as the algorithm scans every word from the text in order to compare it against a list of keywords. Keywords - Ant Colony Optimization; data mining; pattern matching; ontology
I.
INTRODUCTION
Due to different experts’ points of view, building ontologies has proven to be a laborious process, which in the end does not fully grasp all underlying and specific aspects of a particular domain. Besides this intrinsic limitation of ontologies due to the subjectivity of information, populating them with a considerable amount of individuals has proven to become a difficult task after the initial formalization of a domain. The purpose of our project is to build an application which automatically parses text from the Internet, extracts keywords that could consist in individuals for specific and predefined ontologies and decides which class each word should belong to using the Ant Colony Optimization algorithm. In our experiments we have focused on nutrition and sports medicine ontologies whose formalization is quite straightforward and multiple experts in the domain concurred on the general hierarchy of concepts and on the underlying connections.
Tightly connected to previously defined domains, our goal is to introduce a new perspective of sports, especially running [9]. We searched and indexed relevant public sources from the Internet in order to provide a hand to regular people for becoming better runners, all based on a scientific and documented background. In this context, we have implemented a web application which recommends a diet and a workout journal for a professional or an amateur runner by extracting data from the knowledge base, represented as ontologies. As Abd-Elrahman Elsayed stated [1], „The traditional task of the knowledge engineer is to translate the knowledge of the expert into the knowledge base of the expert system”. Additionally, in order to provide accurate and relevant recommendations, our knowledge base must integrate as many individuals as possible for providing personalized feedback. Our approach for populating individuals presumes visiting relevant URLs and indexing the corresponding web pages; later on, data mining techniques are applied on the indexed texts in order to extract the individuals. As described in [8], the process of populating ontologies needs ontology-based information extraction (OBIE) techniques. Given a text, the key terms (named entities or technical terms) must be identified and related to the ontology concepts. The extraction is performed by linguistic preprocessing (tokenization, POS tagging), followed by named entity recognition and automatic term recognition (rule-based grammar, machine learning techniques). Usually, the two methods are combined for maximizing the benefits. Similarity is crucial in the field of data mining [4, 5]. Three different types of knowledge acquisition levels can be performed in the context of a candidate term: syntactic, terminological and semantic. Syntactic knowledge is based on boundary words, which reside immediately before and after a candidate term. Each syntactic category has a weight based on a co-occurrence frequency analysis to determine how likely the candidate term is to be valid (e.g. we consider the heuristic that verbs next to the candidate term are usually better indicators than adjectives). Terminological knowledge addresses the terminological status of contextual words. A term can be a defined class name or a candidate for the individuals of a class. Contextual words are terms within the context which can indicate the presence of other searched terms. They are better indicators of other terms due to the premise that terms tend to co-occur together. Semantic knowledge incorporates semantic information about terms in a specific context. Context words
with a high degree of similarity to the candidate term are more likely to be relevant, whereas words in the surrounding context tend to be related. Our paper addresses two solutions specific to the datamining field in order to populate ontologies: pattern matching and Ant Colony Optimization, highlighting the benefits of using the second algorithm. Therefore, the second section presents related work, whereas the third and the fourth sections detail our approach and the Ant Colony Optimization algorithm. Section V is focused on the implementation of the 2 alternatives for populating ontologies with individuals. The paper ends with results and conclusions regarding our approach. RELATED WORK
II. A.
Data Mining using Ant Colony Optimization A lot of scientific research has been performed around data mining algorithms and techniques, including ontology building methods. Sizov [2] discusses about methods to extract knowledge from texts as instances of ontology classes by means of pattern matching. Additionally, Parpinelli, Lopes and Freitas [3] propose a data mining algorithm called Ant-Miner that extracts classification rules from data and compares the results with CN2, a classic data mining algorithm; rule pruning is also performed in order to detect false results. In the next chapters, we will show the analogies and differences with this solution. B.
Ontology building with data mining Continuing the idea of applying data-mining processes for extracting relevant information, our focus now shifts towards presenting an existing methodology for automatically building an ontology and for populating it with individuals. In the approach of Elsayed and all [1], raw data was extracted from a local database and an ontology was automatically built in two phases: the data mining phase, including data preparation, selection and extraction of knowledge, and the ontology building phase for transforming the extracted information into an ontology. The actual task of discovering knowledge is made through classification, more specifically decision trees, for introducing the discovered information as textual format into the ontology stored as XML-OWL. The decision nodes and the decision branches were mapped onto OWL classes. Each branch in the classification tree represents a classification rule and was converted to an individual of the class that represents the tree branch. The system was evaluated using two case studies: plant diseases and veterinary diseases. For plant diseases, there were 683 individuals classified into 19 classes. From them, 91% individuals were generated correctly; the results for bovine diseases were very similar [1]. III.
APPROACH
Although solutions consisting of applying data mining techniques for ontology population exist [2], our aim is to integrate the Ant Colony Optimization (ACO) in order to improve the results. A starting point for the ACO solution is the Ant-Miner algorithm for data mining which extracts classification rules from data [3]. Our ants start building individuals iteratively by
searching keywords in the text that better fit as values for each of the defined attributes within the ontology classes. Each keyword has an associated pheromone value, corresponding to the natural pheromone left by an ant while walking on a path. The pheromone concentration of a path is the best indicator of its quality. After each step, the ants’ results are compared and, for the best result, the pheromone concentration is increased for the corresponding keywords. The algorithm ends when no additional individuals are found to be added to the knowledge base. In this context, the ACO method is more efficient than pattern matching rules by scanning every word from the text in order to compare it against the keywords. In other words, by using patterns there is no guarantee that all the possible expressions and words are taken into account. IV.
ANT COLONY OPTIMISATION FOR DATA MINING
Our algorithm was built using the Ant Colony System model, an improvement of the standard Ant System algorithm, due to the fact that the pheromone update is performed locally by every ant after the edge traversal in the graph, in addition to the offline pheromone update at the end of the path construction process of one ant. [7] The algorithm design had to adapt to the problem specifications. Instead of finding the shortest tour by visiting all the cities defined by the distances between them, we are now dealing with big chunks of text extracted from the Internet. In our case, we have to find the words that best match as individuals for a given class within our ontologies. Additionally, even if the name of the class is actually present in the text, individuals cannot be automatically extracted from it without further processing. Therefore, the shift consists of differently evaluating the proposed candidates as individuals. This evaluation is quantified as the similarity between the proposed word and the class name. In the end, the candidates with the best similarities are transformed in individuals of the given class. Thus, our approach is different from the Ant-Miner algorithm [3], because we don’t incrementally create rules to identify classes to finally choose the best rule from all the ant constructed rules. Therefore, the algorithm states the following: ACO (name, text): Divide text into phrases; for each phrase if name exists in phrase N = number of words in phrase; M = number of ants; Calculate distances from other words Initialize ants; for N iterations for each ant Calculate probability for each word to be chosen; Choose one word randomly according to its probability using roulette wheel selection; if word found is a good candidate Add candidate to candidates list; Ant makes offline pheromone update; End if
At the end of the Nth iteration, a second check is performed on the found candidates in order to observe whether there are candidates who were added at the beginning due to the fact that the shortest possible similarity wasn’t already identified and they are not part of an enumeration of individuals. This check is similar to the rule pruning step of the Ant-Miner algorithm, which has the purpose to exclude irrelevant terms from the discovered rules of each ant.
End for End for for each word in found candidates list if word is a good candidate Add word to the individuals of the class; End if End for End if End for End ACO Each given text for a certain class name is first divided into sentences in order to make the search easier; afterwards, the class name is searched in each phrase by taking into consideration the semantic similarity between each word and the class name. The number of words in the phrase (N) will be the size of our problem, similarly to the number of cities in the Traveling Salesperson Problem. On the other hand, the edges are only connected from each word to the class name, which simplifies the algorithm. The other value of the problem is the number of ants (M) that should be less than N because the algorithm requires N iterations of searching individuals. Each ant will calculate the probability of each word to be chosen, according to the ACO probability formula, which depends on the pheromone importance η, the parameters α and β and the pheromone value τ: ,
∑
(1)
0, The probability for a word to be selected as an attribute during the ACO algorithm depends on the distance from the ontology’s class name and the pheromone which is present on the edge, proportioned to the edges of the other not yet visited neighboring words – N(sp). The parameters α and β control the relative importance of the pheromone value versus the inverse of the distance between the words [7]: 1
(2)
In our algorithm, k is one of the M ants and j is always the class name. The pheromone value is updated after each ant chooses the proposed word: 1
(3)
Each word proposed as candidate is evaluated against the class name and against other words already found as candidates. The first check is to find if the word is one of the keywords which cannot be individuals, as they are not nouns. If the word has a shorter distance, Lbest (the best match, so far, in terms of similarity in the ACO algorithm) is updated with the new distance and the word is added to the candidates list. The ant which found the candidate performs the offline pheromone update for it. Another criterion for being chosen as a good candidate is to be part of an enumeration of words near the class name in the phrase. In this case, the distance L is greater than Lbest, but the word is still a good candidate to become an individual.
The complexity of the algorithm for each phrase depends on the actual number of words per text (N). Choosing a word by an ant depending on its probability is made in O(N). In the worst case scenario, a number of words proportional to N are added to the candidates list and checked whether they are part of an enumeration or not, but in the average case only a few words are added or even none, which makes an O(1) average complexity for this check. Nevertheless this does not affect the overall complexity because the check will add all the terms within the enumeration once, skipping this step for the next enumeration terms. In the worst case scenario (the text is mostly an enumeration of candidate terms), this step would require O(N + N) iterations, which is equivalent to O(N). By using M ants, the algorithm’s complexity for each phrase is O(N2×M). The size of the problem N can also be considered as the average number of words in all the texts for the given class name. For our purpose, the other dimension of the problem is the actual number C of classes in our ontologies, leading to an overall complexity of O(N2×M×C). In addition to the previous analysis, a crucial element identified within our experiments turned out to be the definition of keywords for each class as accurately as possible in order to ensure the success of the ACO algorithm in terms of the precision of the final results. V.
IMPLEMENTATION
Our data mining application was implemented in Java and the data acquisition process consisted of crawling and indexing websites about nutrition and sports medicine. We will further present the implementation of each of the algorithms that we proposed for solving the task at hand. A.
Pattern matching We began our research about best suited algorithms for extracting information to populate our ontologies by implementing the classic data mining algorithm. This method has been used by other researchers who have done data mining for different purposes [1]. In the following paragraphs we will present in detail the parsing process of the data sources using pattern matching and rule establishing. In order to achieve optimal results regarding the final population of the ontology with individuals, we considered the following rules: •
Avoid false individuals.
A list of stopwords was built with known words that should not be taken as individuals in order to avoid passing unsuitable words into the ontologies.
•
Discard more than five words for one individual.
Individuals that resulted in more than five words were discarded and not added to the ontology, presuming we needed singular words or expressions, not phrases. •
Split each sentence by taking as separator the class name.
In the beginning of the algorithm, the text was split into sentences and each sentence was matched against some regular expressions which described the applied rules. One rule consisted of splitting the sentences, taking as reference the class name. E.g.: class name - Muscles “During this extension of the leg and flexion of the hip, the hamstring and gluteal muscles are required to stretch rapidly” •
Split the previously generate parts using the keywords “a”, “an”, “and”, “,”, “the”.
E.g.: First part: “During this extension of the leg and flexion of the hip, the hamstring and gluteal” Our implementation of the data mining process consists of a single method which was called for each class we had in our nutrition, sports medicine and runners ontologies. After extracting the individuals, we used a filtering method applied on the candidates in order to add the most representative ones to the ontology. The method receives as parameters the class to which the individual should belong and the individual that we had to process. If the result (the actual candidate) was a white space or if it had more than five tokens, we discarded it. The steps above regarding the pattern-matching alternative approach can be summarized in the following pseudocode: Pattern_matching_alternative(): Crawl and index relevant websites using the classes description; For each class for which we found texts For each text Split the text into sentences; For each sentence Split the sentence using the name of the class as separator; For each part Extract candidates using keywords “a”, “an”, “the”, “,”; Process candidates by filtering them and adding the best alternatives as individuals; End For End For End For End For End Pattern_matching_alternative The actual results of our algorithm are presented later, in the Results section.
B.
Ant Colony Optimisation Implementation The implemented ACO algorithm uses Apache Nutch [10] that crawls the web for relevant pages, indexes a local database and performs text searches according to our ontologies. The ontologies were read from local OWL files and transformed to ontology objects using the Apache Jena semantic web framework [11]. When iterating through all the classes, a NutchBean object is used to find in the local index database the relevant texts for each class. In our experiments, the number of ants M was set to 10 because, after multiple runs, we determined they are sufficient for finding all the individuals in each phrase in N iterations. The text was split into phrases and each phrase was split into words using a regular expression of separators. The distance vector retained the number of words that lay between our class name and each word of the text. If the class name appears more than once, the distance will be the shortest possible distance to each word. If the word is not found (singular or plural) this phrase is skipped. All the ants are initialized with some given values, experimentally adjusted: α and β are both set to 1 and the evaporation rate φ is set to 0.5. Their tasks are to build new individuals by proposing new candidates and to update locally and offline the pheromones, after choosing their candidates. Their proposals are checked by a function that compares the word with all the keywords which are extracted at the beginning from a text file. If one of the keywords is present, the word will be discarded. Otherwise, if L is smaller than Lbest, the word will be considered as a potential candidate. If not, there is a chance that this word is part of an enumeration of individuals before the class name; this is an additional check also implemented within our system. After selecting a good candidate, it is afterwards added to the list of best candidates (if not already present) and the ant who found it performs the offline pheromone update. At the end of N iterations, candidates which look to be false because they don’t have the shortest distance and are not part of an enumeration are discarded. These candidates are added to the individuals hashmap and will be transformed into individuals of their corresponding class. VI.
RESULTS
We built three different ontologies for our recommender system: •
nutrition.owl: contains classes and subclasses for aliments, proteins, carbohydrates etc.;
•
sports_medicine.owl: contains classes and subclasses for body parts (like leg, ankle, muscle), injuries, treatments, exercises;
•
running.owl: contains classes and subclasses for races, runners and running.
The ontologies were built after consulting a medical expert, more precisely a nutritionist, who helped us define the main concepts for our knowledge base. The doctor explained the most important groups of aliments and principles that runners need to follow in their diet. Also, we received some indications
about the parts of the body and the muscles that are mostly used during running workouts. Therefore, we were able to build the main classes and the properties that counted the most. Afterwards, the ontology concepts and objects were properly built using Protégé [6]. The hardware platform on which the tests were performed was a quad-core Intel i7-2670QM processor with hyperthreading at 3.1 GHz and 4GB of RAM memory. The first part of the test was the information retrieval process of data from the Internet. We created a script which uses Nutch [9] to crawl the web and create the local index database. Nutch offers the possibility to crawl and index using multiple threads simultaneously, which led to a smaller running time for this process. Then another script was used to populate each class with individuals found in the parsed texts. We first ran the pattern matching algorithm, then our ACO implementation. Table 1 presents the comparison of results, including the ones from Elsayed and all [1] which used Decision Trees. The individuals resulted after applying pattern matching and ACO on the same ontology input were counted automatically by our system, but the validity check had to be performed manually. Selected candidates which were an unfortunate combination of words respecting all the considered criteria were recognized as false positives and discarded. Semantic similarity was the most important criterion to qualify identified individuals as valid for the corresponding class name. Some examples of valid individuals are presented below. As one can see from the results, our ontologies contain 278 classes. The pattern matching algorithm had less individuals found than the ACO algorithm and populated less classes using, of course, the same input. The percent of correct individuals (the precision of the extraction method) was greatly improved because it used only single words to find similarities to the class names. Regarding the processing times, no major variations
were observed. Moreover, the obtained precision is comparable to the Decision Trees method, but it is clearly more straightforward. Additionally, more results can be extracted by adding more URLs to our searches using the crawler and by extending our ontologies with new concepts to be populated with individuals. The results were very satisfying, mostly by also taking into consideration that the resulted text from the crawling and indexing steps was only partially oriented on nutrition or sports medicine. TABLE I.
COMPARISON OF RESULTS BETWEEN THE 3 ALGORITHMS Pattern matching
Total number of classes Number of individuals found Number of populated classes Percent of correct individuals Processing time
ACO
Decision Tree
278
278
unspecified
224
274
683
28
40
19
76%
91%
91%
5.5 seconds
5.5 seconds
unspecified
For better highlighting our results, we present some examples of outputs generated by our Data Mining ACO algorithm. These are extracted words before they were added as individuals to their corresponding ontologies: • Class: Muscle: [calf] [broken] [training] [supraspinatus] [knee] [stomach] [elasticity] [passive] [action] [back] [abdominals] [abductor] [quadriceps] [contraction] [gastrocnemius] [soleus] [group] • Class: Carbohydrate: [storage] [fat] [glycogen] [source] [drink]
FIGURE 1. INDIVIDUALS ADDED TO THE MUSCLE CLASS BY THE ACO ALGORITHM
The output results presented above show that words which reside in the context of the defined class name become valid to be registered as individuals for the corresponding ontology. For example, words as back, passive or group have a more general meaning, but put together with muscle they can be made individuals of the Muscle class or can be treated as subclasses of Muscle. Other words like abductor or quadriceps are more specific to the muscle having a semantic similarity, no matter the context they can be found. Figure 1 depicts some examples of individuals added to the Muscle class by our ACO algorithm. The application used for creating and visualizing OWL files was Protégé [6]. VII.
CONCLUSIONS AND FUTURE WORK
Our system implemented the Ant Colony Optimization algorithm which runs in polynomial time, depending on the size of the texts and on the number of ontology classes. The used algorithm model was the Ant Colony System, with local and offline pheromone updates performed by the ants. Additionally, the classic pattern matching algorithm was also implemented in order to compare the results with ACO and reach the conclusion that the new approach was more accurate at finding new individuals, by also keeping the same total processing time due to the fact that the complexities of the 2 algorithms are similar. As future work, the 3 populated ontologies will be improved by crawling more data sources, using more Nutch nodes to split the URLs and, therefore, populate the ontology with hundreds of thousands of candidates, which will significantly improve the accuracy and the diversity of the recommended menu and treatments for runners. As a final remark, our approach proved that the Ant Colony Optimization can be adapted to extract knowledge from raw data. Even though the ACO algorithm is not optimal, the accuracy of the obtained results is comparable to that of other similar approaches and proved that the proposed method is feasible for automatically populating ontologies with individuals.
ACKNOWLEDGMENT The research presented in this paper was partially supported by project No.264207, ERRIC-Empowering Romanian Research on Intelligent Information Technologies / FP7REGPOT-2010-1. REFERENCES [1]
A-E Elsayed, S.R. El-Beltagy, M. Rafea, O. Hegazy, “Applying data mining for ontology building”, Faculty of Computers and Information, Computer Science Department, Cairo University Giza, Egypt, Available at http://www.cnblogs.com/cy163/archive/2010/07/22/1782970.html, 2010. [2] S. Sizov, “Information Systems and Semantic Web”, Workshop on Text Mining, Ontologies and Natural Language Processing in Biomedicine, Manchester, UK, March 20-21, Available at http://www-tsujii.is.s.utokyo.ac.jp/jw-tmnlpo/Sergei-Sizov.pdf, 2006. [3] R.S. Parpinelli, H.S. Lopes, A.A. Freitas, “Data Mining with an Ant Colony Optimization Algorithm”, IEEE Transactions on Evolutionary Computation, Available at http://neuro.bstu.by/our/Data-mining/fereitasant.pdf, 2002. [4] C.Anderson, “Data Mining: What is Data Mining?”, Available at http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/p alace/datamining.htm, Retrieved June 1st 2012. [5] Data Mining http://www.laits.utexas.edu/~norman/BUS.FOR/ course.mat/Alex/, Retrieved June 1st 2012. [6] Protégé: http://protege.stanford.edu/overview/protege-owl.html. [7] M. Dorigo, M. Birattari, Th. Stutzle, “Ant Colony Optimization: Artificial Ants as a Computational Intelligence Technique“, IRIDIA – Technical Report Series, Available at http://iridia.ulb.ac.be/IridiaTrSeries/IridiaTr2006-023r001.pdf, 2006. [8] D. Maynard, Y. Li, W. Peters, “NLP Techniques for Term Extraction and Ontology Population”, Available at http://gate.ac.uk/sale/olpbook/main.pdf. [9] M. Donciu, M. Ionita, M. Dascalu, S. Trausan-Matu, “The Runner Recommender System of Workout and Nutrition for Runners”, SYNASC 2011, pp. 230-238, 2011. [10] Apache Nutch framework: http://nutch.apache.org/. [11] Apache Jena framework: http://incubator.apache.org/jena/.