information from textual representation sources embedded in each individual and related web site and transforms them into structured and standard format, iGrid ...
A Framework of Collective Intelligence for Building Virtual Agriculture Knowledge Repository and Services Masahiko Nagai*, Naiyana Sahavechaphan**, Vasuthep Khunthong***, Asanee Kawtrakul**,*** * The University of Tokyo ** National Electronics and Computer Technology Center, Thailand *** Department of Computer Engineering, Kasetsart University, Thailand Abstract. Providing information and knowledge services with collecting and maintaining weakly structured text sources is time-consuming activities. This project targets for building specific services and knowledge infrastructures to support decision-making and problem solving in Agriculture domains. It is composed of three essential components :iExtraction -- a tool that semantically extracts relevant information from textual representation sources embedded in each individual and related web site and transforms them into structured and standard format, iGrid – a framework that supports virtual knowledge repository as well as discovers and integrates structured information distributed over different sources. Semantic Media Wiki is, then, applied to register and update of agricultural information as a semantic network dictionary. This constructed agricultural information is used for the reference information for interoperability for specialist and farmers, iVisualization -- a sophisticated tool that visually presents information in a specific semantic network model, called PMM: Problem-huMan-Method model. Moreover, in order to invite contributions from the user community in order to share the knowledge both tacit and explicit knowledge without the language barrier, it is necessary to provide more sophisticated tools and systems, such as reverse dictionary, ontology-based knowledge sharing and machine-aided translation for sustainable development of agricultural knowledge virtual repository and services .This collaboration project is currently implemented by using Rice domain as a case study. The generated PMM consists of Rice Disease Problems identification, Rice huMan experts who could solve that disease problem and the Method for solving the disease problem both in corrective and preventive ways. Keywords. Collective Intelligence, PMM Model, Information grid, Ontology,Semantic MediaWiki
1. Introduction Information has been recognized as an essential resource that supports an efficient decision-making and problem-solving. Accordingly, people usually follows three essential steps, as shown in Figure 1, when solving any problems. Firstly, they typically search for an information specific to the problem task at hand. They next search for other information influencing the problem-solving both in a horizontal and vertical manner. Finally, they analyze different pieces of information altogether until reaching the desired solutions.
Figure 1 The Information Retrieval Process for Problem-Solving Consider an example that a farmer wants to be ready for the problem related to the gray scar on the rice’s leaves. Here, he would preferably retrieve the information concerning about “what is the gray scar on the rice’s leaves?”, “what causes the disease?”, “how to protect/cure it?” and “who is a specialist in protecting/curing it?”. Essentially, with an applicability of the Internet as a huge repository of millions of information, he usually utilizes it as their first preference to seek for the above preferable information. However, such information is typically distributed and embedded in different Web pages and is available in textual, unstructured, representation as shown in Figure 2. He must thus not only visit different Web pages but also learn various Web interfaces to approach and extract the desired information. Clearly, this process is time-consuming and hence affects the procurement of a solution on time.
Figure 2 The Retrieval of Information distributed and embedded in different Web pages To address the previous problem, in this paper, we thus propose the collective intelligence framework, namely AKRS (Agriculture Knowledge Repository and Services). Specifically, AKRS integrates three key components altogether: (i) iExtraction – to transform the unstructured to structured information; (ii) iGrid – to build a virtual repository of any structured information wherein their underlying, realistic, repositories are distributed and diverse; and (iii) iVisualization -- to provide the visualization of information along with their semantic. We preliminarily evaluated our framework AKRS in supporting the PMM (Problem-huManMethod) based information under rice domain. The result showed the usefulness of our AKRS in providing rich information from different Web pages as one stop service, enabling users to efficiently retrieve any PMM-based information.
2. Collective Intelligence Framework Figure 3 illustrates our collective intelligence framework, namely AKRS. In particular, AKRS consists of three essential components: iExtraction, iGrid and iVisualization. In particular, iExtraction facilitates the development of different repositories in order to publish their structured information through iGrid, while iGrid ease the development iVisualization in which their underlying information is from a set of distributed and diverse repositories.
iVisualization
iGrid
iExtraction
Figure 3 The Collective Intelligence Framework, AKRS 2.1 iExtraction iExtraction is a component that transforms the textual (unstructured) information distributed over the Internet into a pre-defined structured format [Asanee et.al,2008] , and stores such structured information in a selected relational database. Its architecture is illustrated in Figure 4. Here, it consists of three essential layers:
Figure 4. iExtraction Architecture 1. 2.
3.
Pre-Processing Layer is responsible for removing “HTML tag” from Web page and performing word segmentation on Thai sentences. Extraction Layer is responsible for extracting desired information from Web page, in which its tags and Thai sentences are removed and segmented respectively, based on clue words, SRL Rules and Domain Frame Slot, and collecting extracted information in a database. Repository Layer represents the storage of clue words, SRL Rules and Domain Frame Slot as well as extracted information.
2.2 iGrid iGrid (or Information Grid) [naiyana et al, 2008] is a framework that facilitates the development of a virtual repository of any information wherein their underlying, realistic, repositories are distributed and diverse. Its implementation is basically based on two types of standard: metadata and API standards. Essentially, the ultimate goal of iGrid is to ease software developers build applications in which their underlying information is from a set of distributed and diverse repositories. Technically, with the application of iGrid, any information from different repositories can be simply accessible via a single point access along with the submission of queries defined based on metadata standards. Its architecture is illustrated in Figure 5. Here, it consists of six essential layers: 1. Application Layer represents any applications that request any information from iGrid 2. Standard Layer is responsible for collecting metadata standards that are defined for describing information and be applied for defining generic, standard, queries. 3. Integration Layer is responsible for integrating standard information from different repositories 4. Discovery Layer is responsible for finding the desired repositories as per a given query. 5. Transformation Layer is responsible for transforming not only generic to specific query as per an underlying repository, but also specific to generic, queried, information described in term of its corresponding standard format. 6. Repository Layer represents existing repository connecting to iGrid.
Figure 5. iGrid Architecture 2.3 iVisualization In order to manage extracted information, a system was developed based on Semantic MediaWiki. Semantic MediaWiki is a feature-rich wiki implementation. Semantic MediaWiki handles hyperlinks and has simple text syntax for creating new pages and crosslinks between terms (Leuf and Cunningham, 2001). In Semantic MediaWiki, a visual depiction of content is expressed by tags. It is not easy to add relations by
tags, so in this study, we developed a table-like editor as a wiki plug-in. Figure 6 shows Semantic MediaWiki and the table editor, by which the user can browse explanations of a term. Semantic MediaWiki displays not only definitions, but also relations of terms. The table editor is applied in order to modify relations of terms by using a table without tags. Problem information, Methodology information, Man information can be collected as trans-disciplinary management. In order to compare associations among the different key words, graph representation as shown in Figure 7 is useful. In this example, the landuse classification schema in Thailand and Indonesia are compared. The term “water body” can be found in both countries. Apparently, both landuse classes are the same, but the level of hierarchy is a bit different in each classification schema. In the case of Indonesian landuse, “water body” does not include water courses, but “water body” in Thailand includes all waterrelated geographical features. Consequently, graph representation proves a clear distinction between the two terms. Then, the new information such as the relations of “water body” in both countries can be created. These kinds of information are treated as created ontological information, and are added through the Semantic Media Wiki. The ontological information can grow autonomously by adding relations, becoming more and more useful. Constructed ontological information is used for the reverse dictionary. A reverse dictionary describes a concept of a term from definitions and associations of terms. Our reverse dictionary is developed based on GETA, which was developed by the National Institute of Informatics, Japan. It comprises tools for manipulating large-dimensional sparse matrices for text retrieval. GETA is an engine for the calculation of associations such as similarity measurement [Nishioka and Imaichi, 2000].
Figure 6. Semantic MediaWiki and Table Editor
Landuse in Thailand
Landuse in Indonesia
Figure 7. Information Network
3. Case Study using Rice Domain To evaluate the usefulness of our framework AKRS, we have deployed it to collect and provide PMMbased information under rice domain. Specifically, PMM -- Problems, problem-solving Methods and problem-solver Man -- represents the concepts/topics with specific properties as knowledge. Examples of PMM-based information include Disease (P: problem), Preventive or Corrective Solutions (M: Methods) for a given problem, and Experts (M: huMan or Solver) who have either solutions or policies/strategies about the problem.
3.1 iExtraction In this case study, we applied iExtraction for extracting information from various Web sites such as “http://nsw-rsc.ricethailand.go.th/LibraRiceSeeds/Disease/LibraryDisease-02.html, http://ubn-rrc.ricethailand.go.th and http://www.brrd.in.th/rkb/” that contain information about rice disease, methods for curing rice disease as well as experts who specialize in rice disease. According to this, the following examples showed results of iExtraction prior to storing them in repositories. In particular, three databases were developed as repositories of such information in a structured manner. The three databases consist of the characteristic of disease, method and experts. Disease database contains Thai name, Eng name, region, pathogen, symptom (Plant Morphology, Symptom Type, Shape and Size). Method database has disease name, cue and method. And expert database consists of expert name, position, office, expertise, address (number, street, subdistrict, district, city, country, zip code), education(degree, major, institute, year) Example 1: A sample of annotated corpus for training the system in part of Problems: P answers which will get the disease information of rice. โรค ไหม ( Rice Blast ) โรค ขาว และ การ ปองกัน กําจัด โรค ไหม ( Rice Blast ) พบ มาก ใน นา น้าํ ฝน ขาว พันธุ พื้นเมือง ไว ตอ ชวง แสง พบ สวนใหญ ใน ภาค เหนือ ภาค ตะวันออก เฉียง เหนือ ภาค ตะวันตก และ ภาค ใต สาเหตุ เชื้อรา Pyricularia grisea Sacc .อาการ อาการ ใบ จุด ช้ํา น้ํา และ รูป ตา อาการ ใบ ไหม คลาย น้ํา รอน ลวก แปลง กลา ที่ โรค ไหม ระบาด ระยะ กลา
Example 2: A sample of annotated corpus for training the system of Method: M answers which will get the method how to curve the disease. โรคไหม / Rice Blastการปองกันกําจัด / Prevent and Cure1.ใชพันธุ
ตานทานโรค เชน สุพรรณบุรี 1 สุพรรณบุรี 2 และ ชัยนาท 1 / 1. use resistant breed as Suphanburi1, Suphanburi2 and Chainat1ฉีดพนคาซูกะมัยซิน ตามอัตรา 20 กรัมตอน้ํา 20 ลิตร / 2. spray Kasugamycin follow ratio 20 g per water 20 liter
Example 3: A sample of annotated corpus for training the system in part of huMan: M answer.
การพัฒนาสายพันธุขาวตานทานโรคไหม/Development rice breed :improvement resistant to riceblast นายพูนศักดิ์ เมฆวัฒนากาญจน/Mr. Poonsak Mekwattanakanโทรศัพท :/telephone : (045) 344103 – 4 ตอ / ext 122
From three database, we define data standard for transformation between iExtraction and iGrid by using AKRS Metadata Standards. (see Figure 8) 3.2 iGrid In this case study, we developed four metadata standards (see Figure 8): plant, plantDisease, plantMethodology and Expert to describe information about plant, disease of plant, method/solution to prevent or cure a disease and person specializing in such method/solution, respectively. Each Web service was then developed for each individual repository (output of iExtraction) to provide information in a corresponding metadata standard. They are deployed at “http://vivaldi.cpe.ku.ac.th:8001/ wsrf/services/ igrid/isource/InformationSourceService?wsdl” and connected to iGrid as shown in Figure 9. Eventually, iGrid was able to reply PMM-based information under rice domain as per a given query that is generated based on the above four metadata standards.
Figure 8. Metadata Standards the describe PMM-based Information
Figure 9. iGrid with the availability of repositories generated from iExtraction 3.3 iVisualization As an example of AKRS, suppose a user wants to know about rice problem, such as “ rice disease caused by insect”. The reverse dictionary returns the answer as a list of terms with similarity scores, such as “Orange Leaf Disease”, “Yellow Dwarf Disease”, “Gall Dwarf Disease”, “Grassy Stunt Disease”, and so on. The reverse dictionary relates data by calculation of similarity by using a definition. The user without basic Agricultural knowledge can discover that “Orange Leaf Disease” is caused by insect. Also, the reverse dictionary helps information retrieval human and method based information. It shows who is expert of this rice disease, how to contact and how to solve the problem, what kinds of chemicals is good for this disease, by using semantic network tools. Also, the reverse dictionary is linked with existing translation web service, which is called LEXTRON developed by National Electronics and Computer Technology Center, Thailand. PMM based information is developed by Thai language, but that information is searched by English keyword too.
Figure 7. Reverse Dictionary
4. Conclusion In conclusion, we have proposed the collective intelligence framework for building virtual agriculture knowledge repository and services that facilitates the information extraction, retrieval and visualization. Its ultimate goal is to assist people not only retrieve information across different Web pages in one stop service, but also allow them to see the unexpected information that may be applicable for their decision-making and problem-solving in Agriculture domains. The underlying information thus includes the rice disease problems identification, the rice human experts who could solve that disease problem, and the method for solving the disease problem both in corrective and preventive ways.
References Asanee Kawtrakul, Therawat Tooumnauy, Vasuthep Khunthong, Aree Thunkijjanukij,Kanok Lohapiyaphan, Yuphayao Pothipaki, Werachai Narkwiboonwong and Anan Pusittigul, 2008.Ontology based Knowledge Map Construction for a Smart Knowledge Service, IAALD AFITA WCCA2008, Japan Asanee Kawtrakul, Watchara Sriswasdi, Suparat Wuttilerdcharoenwong, Vasuthep Khunthong, Frederic Andres, Saovakon Laovayanon, Decha Jenkollop, Werachai Narkwiboonwong and Anan Pusittigul, 2008.CyberBrain: Towards the Next Generation Social Intelligence, IAALD AFITA WCCA2008, Japan Kristin Klinger, Jamie Snavely, Jeff Ash and Carole Coulson. 2009. “Information retrieval In Biomedicine: Natrural Language Processing for Knowledge Integration.” In Asanee Kawtrakul, Chaveewan Pechsiri,
Sachit Rajbhamdari, Frederic Adres. (ed.). Chapter XVIII : Problem-Solving Map Extraction with Collective Intelligence Analysis and LangaugeEngineering. USA:IGI Global Publication, pp. 325-344. Leuf, B. and Cunningham, W., 2001, The Wiki Way: Quick Collaboration on the Web. Addison-Wesley, USA. Naiyana Sahavechaphan, Kamron Aroonrua, Sornthep Vannrat and Asanee Kawtrakul, 2008, Applying Information Grid for Agriculture Applications. AFITA, Japan Nishioka, S. and Imaichi, O., 2000, An Associative Search System based on a Generic Engine for Transposable Association (GETA), IPSJ SIG Notes, Vol.2000, No.53(20000601) pp. 93.