Knowledge Specification for Versatile Hybrid ... - Semantic Scholar

22 downloads 0 Views 81KB Size Report
language HISML based on the open XML standard, introduced to fill the gap between simple soft ... Keywords: knowledge representation, hybrid intelligent systems, XML ..... Home Office Statistics of Scientific Procedures on Living Animals.
Knowledge Specification for Versatile Hybrid Intelligent Systems Daniel NEAGU Department of Computing, School of Informatics University of Bradford Richmond Str., BD7 1DP, United Kingdom e-mail: [email protected] Marian Viorel CRACIUN Department of Computer Science and Engineering University “Dunarea de Jos” of Galati Domneasca 47, Galati, Romania e-mail: [email protected] Qasim CHAUDHRY Central Science Laboratory Sand Hutton, York YO41 1 LZ, United Kingdom e-mail: [email protected] Abstract: The increasing amount and complexity of data used in predictive data mining call for new and flexible approaches based on hybrid intelligent methods to mine the data. This paper proposes a formal description for integrated data structures of Hybrid Intelligent Systems and the specification language HISML based on the open XML standard, introduced to fill the gap between simple soft computing models and complex models based on explicit and implicit knowledge represented as modular structures. First results of applying the proposed specification language in predictive toxicology using clusters and original modular hybrid intelligent models are also presented. Keywords: knowledge representation, hybrid intelligent systems, XML schema, predictive toxicology.

1. Introduction The research effort to model and process Predictive Data Mining (PDM) knowledge requires more flexible standards. Any modeling approaches in PDM require a stepwise integration between data collection and models development: data pre-processing, filtering; feature evaluation and selection; model creation and evaluation; knowledge extraction; adaptation to new data. Many of these simple components are regularly reviewed and updated. For easy access, homogeneous updates, data dissemination and integration of soft computing models, suitable data structures are needed. This paper explores knowledge representation issues on integration of soft computing structures to propose and validate robust models to process, mine and predict data. The models are based on an original integrated platform for Hybrid Intelligent Systems (HIS), Modular Neural Networks and Multi-Classifier Systems [7]. Therefore, experts can add experimental results and also their expertise, and general users can retrieve data and knowledge that will provide clues to improve existing predictive models. For this goal, experimental data and human expertise are integrated in ensembles of connectionist and fuzzy inference systems, thereby selecting multiple versatile predictive models to produce a single composite model that generally exhibits greater predictive power and generality. Predictive Data Mining (PDM) techniques search for strong patterns in data, for accurate further decisions [10]. Many Statistics and Machine Learning algorithms are used in PDM [10]: Linear Regression, Principal Component Analysis, Logistic Regression, Discriminant Analysis, k-Nearest Neighbors, Rule Induction, Decision Trees, Neural Networks, Fuzzy Inference Systems etc. PDM requires an integrated knowledge representation approach to assure a translational bridge between various software tools and file formats one can use nowadays, and to propose a valid model to manage knowledge using common ways for broad access (including distributed web databases).

219

RASC2004

The structure of the paper is as follows. Section 2 reviews existing XML specifications for standard data representation in AI. In section 3, a formal description of an original HIS, which integrates knowledge represented by two main concepts [7], implicit knowledge and explicit knowledge, is proposed. HISML, a XML scheme developed for Hybrid Intelligent Systems is also introduced. First results to apply HISML in Predictive Toxicology are reviewed in section 4. The paper ends with conclusions and directions for further research.

2. XML Schemas for Artificial Intelligence Data and knowledge representation for AI applications encounter the difficulty of various software tools and packages used for models development and processing. A framework for general standard specification of data contents and structure is XML. This opens possibilities to develop domain dictionaries, robust data processing and validation by metadata definition. There are some attempts so far to use XML for AI knowledge, rules and data representation. The input and output, and even the rules from an AI application can be used as XML files, reducing considerable time and effort in building conversion procedures [9]: Universal Rule Markup Language (URML1 ) development has the goal to promote standards for rule markup using XML. An original communication language and protocol for knowledge exchange for intelligent information agents, Knowledge Query and Manipulation Language (KQML2 ) offers an abstract level for distributed AI systems definition. KQML can be used as a language for an application program to interact with an intelligent system, as well as for two or more intelligent systems to share knowledge for cooperative problem solving [3]. Formal Language for Business Communication (FLBC 3 ), a competitor to KQML, is a XML-based formal language for automated electronic communication. The DARPA Agent Markup Language (DAML4 ), developed as an extension to XML and Resource Description Framework (RDF), provides a basic infrastructure that allows a machine to make similar simple inferences that human beings do. Case Based Markup Language (CBML) is an XML application for data represented as cases to facilitate knowledge and data markup readily reusable by intelligent agents [4]. Another effort in this direction is Artificial Intelligence Markup Language (AIML5 ), an XML-based language used in ALICE, a chat-bot. This markup language offers a simple yet specialized open-source representation alternative for conversational agents. The Predictive Model Markup Language (PMML6 ) is a language proposed to describe Statistical, Machine Learning and Data Mining models : the inputs to data mining models, the transformations used prior to prepare data for data mining, and the models parameters.

3. A Formal Description of Hybrid Intelligent Systems The last ten years have produced a tremendous amount of research on fuzzy logic and connectionist systems. The two approaches can be used in a complementary way, HIS combining their features (Figure 1). In such systems, the learner can insert fuzzy rules into neural networks. Training examples are used to refine initial knowledge or additional structures. Finally, it processes the output for given instances and, using specific methods [1], extracts symbolic information from trained networks. We define the implicit knowledge as connectionist representation of learning data. An explicit knowledge module has the role to adjust performances of implicit knowledge modules by using external information provided by experts, in form of Fuzzy Rule -based Systems. In our approach, connectionist integration of explicit and implicit knowledge appears a natural solution to develop homogeneous intelligent systems [7]. Explicit and implicit rules are represented using MLP (Multi1

URML: http://home.comcast.net/~stabet/urml.html KQML: http://www.cs.umbc.edu/kqml/ 3 FLBC: http://www.oasis -open.org/cover/flbc.html 4 DAML: http://www.daml.org/ 5 AIML: http://www.oasis -open.org/cover/aiml -ALICE.html 6 PMML: http://www.dmg.org/index.html 2

220

RASC2004

Layer Perceptron) Crisp Neural Networks (CNN) [8] and neuro-fuzzy or fuzzy (FNN) neural nets [2]. Thus, fuzzy logic provides the inference mechanism under cognitive uncertainty, since neural nets offer advantages of learning, adaptation, fault-tolerance, parallelism and generalization.

Figure 1. The architecture of an integrated implicit and explicit knowledge-based system

3.1.

Implicit and Explicit Knowledge-based Intelligent Systems

The HIS considered in this paper is a multi-input single -output neuro-fuzzy system (MISO). Its general goal is to model combinations of data and expert information to the corresponding output: Φ : D ⊆ Rn → R,

(1)

where n ∈ N is the number of the inputs for the application domain. This leads to the following steps in a fuzzy neural computational process [6]: (a) development of individual knowledge-based connectionist models, (b) modeling synaptic connections of individual models, to incorporate fuzziness into modules, (c) adjusting the ensemble voting algorithm (Fig ure 1). n +1

Let’s consider a MISO HIS with n inputs. Let U = ∏ D i be the universe of discourse over the i =1

application domain as the Cartesian product of sets Di , i=1..n+1, for the input variables X i ∈ Di , i=1..n, and the output Y ∈ Dn+1 . A HIS integrated model of the problem Φ, based on implicit (IKM) and explicit knowledge (EKM) modules, is a good approximation of Φ if: n   HIS =  M j / ∀ε > 0, ∃X ∈ ∏ Di ,∀Y = Φ ( X ) : Mj ( X ) − Y < ε  = 1 .. j m   i= 1

(2)

where the knowledge modules are functional models : n

Mj : ∏ D ij → Dn + 1, j

(3)

i =1

The modules M j are, in our approach, either implicit or explicit knowledge models: Mj ∈ {MIKM _ CNN, M IKM _ FNN, M EKM _ Mamdani, MEKM _ Sugeno} . For any of these M j models, we can propose, following (1)-(3), a formal parameter-based description of HIS:

Mj = Θ, Λ, Ω

(4)

where Θ is the set of topological parameters (i.e. number of layers, number of neurons on each layer, connection matrices, type and number of individual models and gating networks), Λ is the set of learning parameters (learning rate, momentum term, any early stopping attribute for implicit knowledge modules, but NIL for explicit knowledge modules) and Ω is the set of description parameters (type of fuzzy sets, parameters of membership functions associated to linguistic variables). Three distinctive cases to develop integrated HIS models are identified:

221

RASC2004

n

Case 1: D j = ∏ Dij for all j=1..m: a modular architecture [7] of experts on the whole input domain. i =1

m

Case 2:

n

Ι ∏D

ij

= 0 and Di ∩ Dj = 0, for all i = 1,..,n and j = 1,..,m. The HIS model is a collection

j =1 i =1

of m expert models on disjunctive input domains; the system is a top-down integrated decomposition model, dividing the initial problem in separate less-complex sub-problems [7]. m

Case 3:

n

Ι ∏D

ij

≠ 0 : models built on overlapping sub-domains; further algorithms to refine the

j =1 i =1

problem as cases 1 or 2 are required [7]. So far, strategies to combine IKM and EKM in a global HIS have been proposed [7]: Fire Each Module (FEM), Unsupervised-trained Gating Network (UGN), Supervised-trained Gating Network (SGN), majority voting etc. FEM is an adapted Fire Each Rule method [2] for modular networks, in two versions: statistical combination of crisp outputs (FEMS) or fuzzy inference of linguistic outputs (FEMF). UGN proposes competitive aggregation of EKMs and IKMs, while SGN uses a supervised trained layer to recognize component outputs.

3.2.

An XML Schema for Hybrid Intelligent Systems

The aim of this section is to propose the standard XML application for HIS representation, data exchange and analysis: the Hybrid Intelligent Systems Markup Language - HISML. From our knowledge, this is the very first attempt to propose a standard for integrated soft computing techniques (HIS). The HISML syntax captures the structure and parameters of modeling experiments. The information stored in a HISML document is further required to analyze and replicate the models.

Figure 2: The Hybrid Intelligent System defined in HISML – the class diagram The need to propose a standard format for data storage in HIS development using the HISML syntax is justified by the various methods involved and the aim of automation for the design steps. One of the

222

RASC2004

main features the HISML syntax considers is the recurrent organization of imbricate modules. The seed of the common format to manipulate and to store the data is captured by the concept of the ‘HISML’ element. Such an element has a required attribute 'version' and contains two other sections : a header with authoring information and the section ‘HIS’ to define recurrent intelligent component systems. A ‘HIS’ element has ‘name’ and might be (Figure 2): a ‘SimpleHIS’ (basic atoms IKM and EKM) or a ‘ComplexHIS’ (any other HIS, including simple ones). It also contains performance values. A ‘SimpleHIS’ element has with a ‘name’ attribute, and might be an IKM or EKM module. An IKM (defined by its ‘name’) can be a CNN or a FNN [7]. An EKM is similarly defined by its ‘name’ and can be rule initialized CNN or Fuzzy Inference System. A ‘ComplexHIS’ has at least two (simple or complex) HIS elements, and a Gatin g Module or a Combining Module (either statistical or fuzzy inference approach). A Gating Module can be a (supervised or unsupervised trained) CNN or a statistical combining algorithm (FEMS, majority voting, max, min, average etc.). A Combining Module is a (supervised or unsupervised trained) ANN of any ComplexHIS types. Their basic types are ANN and FIS. An ‘ANN’ comes with ‘train’ (has the net been trained?) and topological attributes (inputs number, hidden layers list, neurons data: activation functions, weights matrix to current layer, training algorithm). A ‘FIS’ includes the essential information to identify fuzzy inference systems.

4. A Case Study Predictive Toxicology (PT) is a multi-disciplinary science that requires a close collaboration between researchers from Toxicology, Chemistry, Biology, Statistics and Artificial Intelligence. PT aims to describe relationships between chemical structures and biological/ toxicological processes: Structure– Activity Relationships (SAR). The main objective of PT is to assess toxicity in a virtual laboratory and to propose as general models as possible for large groups of compounds and endpoints. Toxic chemicals are routinely tested on animals as part of the registration process. Current UK pesticide registration requires generation of LD50 (lethal doses for 50% of initial population) data [5] for at least one avian species to determine risks posed to wildlife. This section describes experiments on HISML schema for PT models of insecticides as alternative to the use of birds in LD50 tests. The dataset of LD50 values for 84 anticholinesterase pesticides (60 OPs and 24 carbamates) against mallard duck used in this study was obtained from the Canadian Wildlife Service. Over 1700 molecular descriptors have been initially made available for each optimized structure. The entire dataset has been divided in training and independent test sets. Training values were scaled to [-1,1] because of the transfer function for the hidden layer of feed-forward back-propagation ANN (the hyperbolic tangent transfer function). The test set is scaled to the same interval using min and max values of the training set descriptors. The experiment considered a HIS implementation (Case 2, section 5) of disjunctive input domains. Consequently, the data was split in 2 clusters before the development of the HIS (Figure 1) model Φ (Di ⊆ Rn , D = D1 ∪ .. ∪ DK, Di ∩ Dj = 0, for all i,j=1,..,K and fi = f /Di are the continuous projections of the function f on the sub-domains Di , for i = 1,..,K). From the original dataset, descriptors with missing values, highly cross-correlated descriptors (R2 in excess of 0.9) and non-variant descriptors (constant values for almost chemical compounds) were discarded. A genetic search based feature selection algorithm has been applied on training set to reduce the dimensionality of the input space, using a Weka 7 implementation: about 88 chemical descriptors were selected for the model data set. A competitive network as a choosing module ‘CM’ has been defined and trained to cluster training data in 2 disjunctive groups just in the input space (molecular descriptors space), based on the Euclidian distance as similitude measure. In this stage information about output values is not used. The trained expert was able to group together compounds according to their chemical classes with 85.5% accuracy on the training set and about 75% accuracy on the external test set. These results support the approach of using mathematical descriptions of chemical compounds to find the chemical class they belong to [6]. 7

http://www.cs.waikato.ac.nz/~ml/weka/

223

RASC2004

Each cluster is modeled by 3-layered feed-forward networks with 2 hidden units (‘CNN’), linear transfer function for input and output layers, hyperbolic tangent function for hidden layer. Each network has been trained 200 epochs using back-propagation based on gradient descent with momentum and adaptive learning rate. In Table 1 are shown values of correlation coefficient (R), mean-squared error (MSE) and mean absolute error (MAE) for training and test sets. The performance of unsupervised and supervised combination (‘ComplexHIS’) seems to be very good on the training set but poor on the test set. An explanation could rely on relatively small number of chemicals used in experiment and data homogeneity, given the unequal distribution of chemicals into classes. Table 1. Performances of disjunctive HIS models Dataset TRAIN TEST

R(%) 91.2 66.2

MSE 0.049 0.126

MAE 0.138 0.289

5. Conclusions Our approach to model predictive data mining knowledge , based on modular combination of local experts appears to propose better models. This makes our attempt challenging. One of the advantages of our approach is the flexibility of the developed models, able to be optimized for specific cases. In this paper, a formal description for integrated modules of Hybrid Intelligent Systems and the specification language HISML based on the open XML standard have been proposed. HISML can be considered a promising standard because of the expressiveness of its models for existing soft computing techniques, and its interconnection capabilities, flexibility, and simplicity to allow general use for HIS development. Future research steps are to test, update and (possibly) improve the overall representation of knowledge for HISML. Although it might be too early to speculate on its outcomes, we propose HISML to allow researchers a uniform access not only to soft computing techniques development, but also for further various integration and (even recurrent) HIS combinations. Acknowledgments : This work is partially funded by the EU FP5 project DEMETRA. The authors are grateful to P. Mineau, A. Baril, B.T. Collins, J. Duffe, G. Joerman, R. Luttik (Canadian Wildlife Service) for providing toxicity data, and to Nick Price and R. Watkins (CSL) for 3D molecular structures. The authors acknowledge also the support of the EPSRC project GR/T02508/01.

References [1]

J.M. Benitez, J.L. Castro, I. Requena (1997). Are ANNs Black Boxes? In IEEE Trans Neural Networks 8/5, pages 1157-1164. [2] J.J. Buckley, Y. Hayashi (1995). Neural nets for fuzzy systems. In Fuzzy Sets and Sys, vol. 71, pages 265-276. [3] T. Finin, Y. Labrou, J. Mayfield (1997). KQML as an agent communication language. In Software Agents, MIT Press, Cambridge. [4] C. Hayes, P. Cunningham (1998). Distributed CBR using XML. In Procs Intell Sys.&Electr. Comm Int’l Workshop, Bremen. [5] Government Statistical Service, The Stationary Office Limited, Norwich, Great Britain (1999). Home Office Statistics of Scientific Procedures on Living Animals. [6] D. Neagu, G. Gini (2003). Neuro-Fuzzy Knowledge Integration applied in Toxicity Prediction. In Innovations in Knowledge Engineering, Advanced Knowledge Int’l, pages 311-342. [7] D. Neagu, V. Palade (2002). Modular neuro-fuzzy networks used in explicit and implicit knowledge integration. In Procs 15th Int'l Conf. Florida AI Soc. - FLAIRS2002, pages 277-281 [8] D. Rumelhart, J. McClelland (1998). Parallel Distributed Processing, MIT Press. [9] S. Tabet, P. Bhogaraju, and D. Ash (2000). Using XML as a Language Interface for AI Applications. In Procs. Int’l Conf. PRICAI 2000, LNCS 2112, Springer, pages 103-110. [10] I.H. Witten, E. Frank, (1999). Data Mining: Practical Machine Learning Tools and Techniques with Java implementations, Morgan Kaufmann Publishers, San Francisco.

224

RASC2004

Suggest Documents