managing heterogeneous information resource stores for the forestry domain in a ... for style sheets (XSL), and the PHP language (Hypertext Pre- processor).
1
MSF- A Semantically Based Memory for Forest Information Access via Web E. J. Guerra-García1, Y. M. Fernández-Ordoñez1, R. C. Medina-Ramirez2 and J. Soria-Ruiz3
Abstract—Among the services provided by Internet, intelligent access to documentary resources seems necessary and is in the worldwide research agenda. In order to contribute to decisionmaking in many fields, information management systems should recognize and register the contents of resources for storage and treatment purposes. In areas where information resources increase exponentially, it is necessary to develop systems that consider the contents of resources, whether they are stored locally or remotely. This paper describes an approach for managing heterogeneous information resource stores for the forestry domain in a memory based on the semantic web concepts. We describe a pilot version of a semantic memory MSF-1 and which we describe an already available prototype SISFOR which is being used to validate our proposal.
I. INTRODUCTION
T
HE most widely accepted defining feature of the Semantic Web, a proposed extension of the current web, is its machine-usable content. In the current state the semantic web is still considered a vision and not a reality, since machines need to know what a resource contents is, i.e. meaning, and also know what to do with it in order to process queries [1]. An approach is that structured meaning be added to information which is already published in the web in natural language. Structured meaning is amenable to be used by a machine [2], in particular by a web browser which would then provide meaningful results for queries instead of very long lists of links to irrelevant information resources. An objective then is to profit of information already in documental resources in web pages and add semantic contents to these resources. There is a relation between the semantic web notion and that of a corporate memory (CM), where all knowledge and information relevant to the business of an organization is given an explicit representation [3]. Both notions relate to storing heterogeneous and distributed information and share the same needs of information search. However, CMs are restricted to a specific context, infrastructure and restrictions, namely those of the organization they serve to. In this sense, if we restrict our attention to information resources of a specific knowledge domain that are to be shared by users
interested in that domain, the CM outlook and proposals can be very useful [4]. This paper describes a semantic approach to manage information in the forestry domain in Mexico. As is the case for information available in the web for most domains, the information resources related to forestry cover a wide spectrum, whether forests are seen as part of the natural environment, from a commercial exploitation point of view or from other perspectives. In particular, forest management and use are particular to each country and are in line with national development politics. So the information resources are not only huge but they are very heterogeneous and disperse since they originate from a variety of sources and are compiled or generated by individuals and organizations which have diverse interests, even if all of them related to the forestry sector. In Mexico there has been a long and haphazard history of land property regimes where forest land has been involved. A vast amount of related documents exist concerning this aspect only [5]. Initial steps in our approach consider a diversity of forestry related resources to incorporate them in a pilot version of a semantic memory MSF-1 and then conceptually partition them into two categories: resources which provide information related to timber yielding species and those related to tree species destined to other uses. Metadata were then created to describe the contents of resources in each of these categories. The pilot version is subjected to analyses and observations by application domain experts. A domain expert justification for this partition is provided in the paper. In order to manage the contents of the pilot version MSF-1 a web prototype application SISFOR has been developed using existing technologies that have been proposed for the semantic web, such as XML, XML SCHEMA (XSD), eXtensible language for style sheets (XSL), and the PHP language (Hypertext Preprocessor). The paper is organized as follows. Section 2 describes the approach to managing the information resources for the forestry domain. Section 3 reports on the design of the architecture of the information management prototype SISFOR-MSF. Section 4 contains our conclusions of the experience and points to future work.
II. INFORMATION RESOURCES MANAGEMENT 1
Colegio de Postgraduados, Carretera México-Texcoco Km. 36.5, Montecillo, Texcoco 56230, Estado de México. - www.colpos.mx 2 Universidad Autónoma Metropolitana, Unidad Iztapalapa, San Rafael Atlixco 186; 09340, México D.F - www.uam.mx/ 3 INIFAP, Carr. Toluca-Zitácuaro, 52107 Zinacantepec, Estado de México. - www.inifap.gob.mx
The approach to managing the information resources in the SISFOR-MSF system comprises five stages. SISFOR and MSF-1” are acronyms in Spanish for “Sistema Semántico Forestal” and “Memoria Semántica Forestal” respectively.
2 Stage 1- Review and selection of domain information providers. There are many information providers for the forestry domain, so a comprehensive review was carried out mainly of official and academic sources. Several government agencies produce information themselves or may annotate, comment and contribute to the dissemination of information generated by third parties. Several agencies were selected as representative of web resources providers; among those selected are the following: the national forestry commission, CONAFOR (www.conafor.gob.mx); the ministry of agriculture, fisheries, livestock and rural development, SAGARPA(www.sagarpa.gob.mx) the statistics, informatics and geography institute, INEGI (www.inegi.gob.mx), the ministry of environment and natural resources, SEMARNAT (www.semarnat.gob.mx), the national research institute for forestry, agriculture and livestock, INIFAP (www.inifap.gob.mx), and the commission for the use and knowledge of biodiversity, CONABIO (www.conabio.gob.mx). Other sites included belong to state governments, research institutions and universities with activity in the forestry domain. Stage 2- Identification and classification of information resources. For this project a domain expert was on call throughout to advice on several important decisions. He suggested an initial partition of the identified sites and information resources as pertaining to timber yielding tree species and those pertaining to species destined to other uses. The justification is summarized as follows: Forests are commonly classified by their location: tropical or temperate forest. However many issues of interest to a large web user audience concern the exploitation of forest resources, The production of raw materials, deforestation for urbanization and agriculture are concerns that impact the economic and social welfare of a country, as well as the functioning of the natural environment. Rational wooded areas exploitation is a widespread concern. Thus the vast amount of information one can find in the web could be examined from the vantage point of the commercial use of tree species: for timber production and for other non-timber applications. This perspective is interesting for Mexico because of the total area covered by temperate timberland 21.6 million hectares have commercial potential, but only 8.6 million hectares are currently managed to this end. On the other hand, wooded areas for non-timber applications but with commercial relevance are found everywhere in the country. For example, 32% of the national woodland production is found in arid zones, from desert species such as yucca, manioc, ament and oregano among others. Information resources for the MSF-1 were then assigned to one of eight categories, and metadata were defined to describe the contents of resources in these categories. In the classification of resources common values for certain attributes, such as type of information resource, were used to assign a resource to a category. For example the type of source, which is an attribute for categories, was used. The categories are: a. Electronic document
types: published article, unpublished article, book, book chapter, bulletin, meeting memory (document in congress or conference proceedings), manual, thesis, technical report, collection and miscellaneous (for documents that are not of any of the above types). b. Person type: academician, student, researcher, journalist, other. c. Organization type: government, private, university, other d. Multimedia type: video, image, map e. Web site type: blog, wiki, discussion forum, gallery, portal f. Event type: workshop, conference, congress, short course, forum g. Mass media type: newspaper, interview, magazine, news h. Software type: system, program, algorithm The Bibtex tool was used as a guide for the electronic document category attribute definition, in order to format reference lists that are commonly used by the LaTeX system for document preparation. LaTeX uses a text based file format which is independent of style to define lists of bibliographic items such as articles, theses, etc. which were included here in the first category [6]. Stage 3- Characterization of information resources. This task is guided by the contents of the resource (metadata). The information resources are then grouped according to subthemes suggested by the domain expert. So each sub-theme has its associated information resources, and its own attributes (@) and elements are established via a conceptual schema (Figure 1). This allows for a better organization of the information in order to identify and retrieve it.
Figure 1. Conceptual schema Stage 4- XML schema definition (XSD). Once the characterization by themes (timber/non-timber forest resources) and their sub-themes is validated by the expert the XSD is produced (Figure 2); it is used to describe the structure and contents restrictions of the XML documents in a precise manner and in such a way that the documents are considered valid according to the established schema [7]. Stage 5- SISFOR validation and implementation. A friendly visualization tool is implemented so the user can access the information of the MSF-1. This is used in the first term by the
3 domain expert who provided advice in this project. It is then offered to other experts to polish the contents and obtain input for adding other functionalities to the system in the next version.
Users. SISFOR considers users in three main roles, each role with assigned tasks and responsibilities. Administrator. This role corresponds to the developer or expert in the implementation of the system. His responsibility is to check the quality of the metadata once the system is operational. He has all the privileges to manage the MSF-1 and maintain it, as well as the SISFOR itself. Domain expert. This role corresponds to the forestry expert user, who has access to update the MSF-1 with a new species and to insert new information resources for a subtheme. General user. Any person visiting SISFOR looking for information resources of interest. The implementation of the system is transparent to this kind of user. A. SISFOR Architecture SISFOR has a query interface to add items into the MSF-1, and to process queries in a web browser; it also has a server and the information repository, namely the MSF-1 (Figure 3).
Figure 2. XSD Schema Figure 3. SISFOR Architecture III. SISFOR WEB Implementation. A very important part of this development for the web is the construction of the MSF. XML documents corresponding to each of the eight categories of the resource information classification were created. XML, although a simple language, is strict and plays an important role in the exchange of a variety of data. It allows data to be read by a variety of applications, and serves as a means to organize and store information. It is an initial technology for a first approximation to the semantic web objectives. Being similar to HTML, its main functionality is to describe data and not just to show them as HTML does. XML technology offers a set of modules which provide useful services to the most frequent demands of users [8]. The XML documents are validated by the XSD schema, which is used mainly to describe structure and restrictions on the contents of the XML documents in a precise manner. The implementation has produced a working version of the MSF-1. A first version of SISFOR, namely the visualization module is finished and is being tested, as explained in a previous paragraph.
The technique used to retrieve information from XML documents is based on the application interface of the Document Object Model (DOM) proposed by W3C as a platform- and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure and style of documents [9]. The XSLT processor allows for the transformation of XML documents into XHTML documents. XSL style sheets are created which are representations of the XML document close to the final format; in addition they contain the appropriate text in the right order; then look and formatting details are added using the cascading style sheets language (CSS) for final details such as font types, sizes and colors. Then, when a query is presented by the user, The Apache server with PHP accepts the petition and the item are looked for in the MSF-1. The server is in charge of transforming the query results from XML into XHTML format via the XSL style sheets. The validation of data stored in the MSF-1 is done via the XSD.
4 B. SISFOR Interface SISFOR allows general user interaction through the two main themes: Timber Forest Resources (Recursos Forestales Maderables) and Non Timber Forest Resources (Recursos Forestales No Maderables), each of which have sub-themes and concerned species. As an example, if a user needs to know what information is stored in the MSF-1 relating to the Distribution in Mexico of Timber Forest Species, SISFOR will provide a list grouping what is stored in each information resource, and will show the total in each category as well. In this case the result will show the attributes and elements that were designated as mandatory in the XSD for the electronic documents category (Figure 4). Should the user desire more information about a specific document, upon clicking the go to site link (IR SITIO in Spanish) he will be directed to the concerned URL and will be shown the full document, and a download option will open.
IV. CONCLUSION The XML based scheme which was implemented to manage the MSF-1 is not rigid; the addition of new metadata is possible for any of the categories allowing for the following versions of the MSF-1. The current version can be accessed at http://www.cm.colpos.mx/sisfor/
Figure 5. Interface to insert a new species. One of the objectives of this continued development it to contribute to the application of the types of technologies already available which can be used to realize initial benefits promoted by the semantic web notion. Managing vast and heterogeneous information resources through their characterization in terms of contents and format is a requirement. Through a rather simple approach based on XML labels, structuring and exchanging information are not only possible but straightforward. The categorization of information resources and the intervention of domain experts in validating the proposal are on the road towards the formalization of knowledge representations and domain ontologies in our research agenda. The contribution of our approach is in the application domain which is of relevance in Mexico.
REFERENCES Figure 4. Visualization and list of electronic documents related to the sub-theme Distribution in Mexico. [1]
SISFOR also allows the visualization of the descriptive card of species which belong to the timber resources and non timber resources. This will show the available information grouped in each of the eight categories, such as images, videos or maps which are under the multimedia category; or from the web sites category containing links to navigate to those sites related to any of the established sub-themes. Geared to the expert user, the interface also allows the insertion of a new species into the MSF-1 (Figure 5).
[2]
[3] [4]
[5]
[6] [7]
M. Uschold. “Where are the Semantics on the Semantic Web”, AI Magazine, 24(3), Fall 2003, pp. 25-36. T. Berners-Lee, J. Hendler, and O. Lassila. “The semantic Web”, Scientific American, May 2001. Available at http://www.sciam.com/article.cfm?id=the-semantic-Web Dieng-Kuntz, R. “Corporate Semantic Webs”, ERCIM News, No. 51, 2002, pp. 19-21. A. M. Selvin, and S. J. Buckingham, “Rapid Knowledge Construction: A Case Study in Corporate Contingency Planning Using Collaborative Hypermedia”, Knowledge and Process Management, Vol. 9, Issue 2, 2002, pp. 119-128. N. A. Pérez-Flores, “Análisis de la dinámica funcional del territorio en Santa Catarina del Monte, Texcoco, Edo. de México”, Doctor of Science Thesis, Colegio de Postgraduados, December 2011. Bibtex. Available at: http://tezcatl.fciencias.unam.mx/texarchive/info/spanish/guia-bibtex/guia-bibtex.pdf P. Walmsley. “ Definitive XML Schema“, Prentice Hall, 2001.
5 [8]
XML. World Wide Web. http://www.w3c.es/divulgacion/guiasbreves/tecnologiasxml [9] Document Object Model (DOM). Available at: http://www.w3.org/DOM/ [10] Comisión Nacional Forestal, México Forestal, Electronic magazine http://www.mexicoforestal.gob.mx/ [11] H. M. Kim, A. Sengupta. "Extracting Knowledge from XML document repository: a semantic Web-based approach”, in Press, Information Technology and Management, Feb 2007.