Case-Based Reasoning: A Knowledge Extraction Tool ... - Springer Link

1 downloads 0 Views 430KB Size Report
Abstract Case-based reasoning (CBR) is a relative newcomer to AI and is commonly described as an AI as well as KM technology. Case-Based Reasoning is ...
Case-Based Reasoning: A Knowledge Extraction Tool to Use Heba Ayeldeen, Olfat Shaker, Osman Hegazy and Aboul Ella Hassanien

Abstract Case-based reasoning (CBR) is a relative newcomer to AI and is commonly described as an AI as well as KM technology. Case-Based Reasoning is considered as a methodology not a technology to use. Finding the similarities between objects as well as knowledge extraction sometimes is a complicated issue to handle concerning decision makers and executive managers. Learning from previous failures and successes saves plenty of time in understanding the problems and visualizing the data. CBR as a process is one of the most used methods to solve the problem of knowledge capture and data understanding. In this paper we show mathematically the usage of CBR in clustering documents and finding correlations between medical data by using CBR with DB technology as an application. Results yield to an increase in comparison to human assessments and not using CBR methods.



Keywords Case-based reasoning Nearest neighbour Rough sets algorithm Problem-solving



 Knowledge discovery 

1 Introduction The field of AI is widely spread to aid in problem solving and deals with the uncertain information. By the late 1980s and 1990s, AI research had also developed highly successful methods for dealing with uncertain or incomplete information [1]. H. Ayeldeen (&)  A.E. Hassanien Scientific Research Group in Egypt (SRGE), Cairo, Egypt e-mail: [email protected] A.E. Hassanien e-mail: [email protected] H. Ayeldeen  O. Hegazy  A.E. Hassanien Faculty of Computers and Information, Cairo University, Cairo, Egypt e-mail: [email protected] O. Shaker Department of Molecular Biology, Faculty of Medicine, Cairo University, Cairo, Egypt e-mail: [email protected] © Springer India 2015 J.K. Mandal et al. (eds.), Information Systems Design and Intelligent Applications, Advances in Intelligent Systems and Computing 339, DOI 10.1007/978-81-322-2250-7_37

369

370

H. Ayeldeen et al.

Many subfields within AI technology appeared that deal with the construction and study of systems that can learn from data, rather than follow only explicitly programmed instructions. With the techniques within Cognitive science, machine learning, support vector machines and others, the problem of finding similarity between objects would be much easier to cope with the environmental and technological changes. Machine learning focuses on prediction, based on known attributes properties learned from the training data [2–5]. Compared to other AI approaches, CBR has several advantages, one of the main added-value is that the training time will be reduced for the solution selection. Casebased Reasoning (CBR) is a problem solving methodology that addresses a new problem by first retrieving a past, already solved similar case, and then reusing that case for solving the current problem. CBR is is one of the most used and successful machine learning methodologies that makes the most of a knowledge-rich representation of the different application domains [6–8]. This paper will discuss three different CBR applications, that CBR describes a methodology for problem solving. Section 2 identifies the main characteristics in the life-cycle of a CBR system. While Sects. 3 and 4 discusses CBR using nearest neighbour and CBR using database technology. Section 5 is showing a case study on the usage of DB technology to find the similarity between cases. Section 6 shows the interpreting results and discussion. The last section is the conclusion and future work.

2 Knowledge Discovery As part of knowledge management, knowledge discovery is the process of searching through large volumes of information and extracting subsets of information relevant to a given task. The knowledge discovery process consists of several steps as shown in flowchart 1 [9] (Fig. 1). 1. Understanding the target domain Knowledge relevant to a domain must be acquired. Knowledge discovery goals should be identified that benefit the user of a system. 2. Selecting a dataset A variety of datasets may contain useful domain knowledge, and the datasets thought to contain the best information should be selected. 3. Data cleaning Many datasets include noise. The data cleaning stage is concerned with removing this noise and accounting for missing information. 4. Choosing a data mining algorithm The knowledge discovery goals from step (1) should be matched with an appropriate data mining strategy. 5. Data mining In this step the datasets chosen in step (2) are searched for patterns of interest.

Case-Based Reasoning: A Knowledge Extraction Tool to Use

371

Fig. 1 Knowledge discovery process

6. Interpreting results This may involve visualization of extracted patterns or a return to any of the previous steps if problems are discovered. 7. Acting on discovered knowledge This may involve reporting knowledge discovered to a user, incorporating the knowledge with a local knowledge base, or checking for inconsistencies with the discovered knowledge and previously acquired knowledge. This process is interactive and may be repeated any number of times until useful knowledge has been discovered.

3 Case-Based Reasoning The problem solving life-cycle in CBR system consists mainly of the following four parts: Retrieve cases from previously ones; Reuse the old cases to find the alternative solution needed; Revise the solutions/cases you have got, maybe make some adaptations and Retain the solution after testing and validation [6] (Fig. 2). Case representation is one of the main tasks within CBR processes [6, 9, 10]. Case indexing refers to assigning indexes to previous cases to ease the process of

372

H. Ayeldeen et al.

Fig. 2 Problem solving life-cycle in CBR system

retrieval or even making modifications to the cases. It is preferable that these indexes should reflect the important features within the cases like the main attributes affecting the success or the failure of the case, and describe the circumstances in which a case is expected to be retrieved in the future [6, 9]. It is also important to include the measure of success or failure of the case in the case description with stating the different levels of success or failure. Now you have got the completed knowledge required to that case [6]. Case reuse and learning from previous experiences, eliminates the effort and time taken to solve the new coming problem. Different techniques of case selection can be used to retain the best/optimum solution required. Based on the problem you have and the circumstances around it, you start making adaptations on the previous cases near to what you want. So rather than spending so much time of getting the

Case-Based Reasoning: A Knowledge Extraction Tool to Use

373

solution, you make use of the old previous solutions you have in the case warehouse [6, 11, 12]. Before retaining the best or near optimum solution(s), testing and validating with the user or the decision maker is very important task to do. According to the test taken in advance, case is retained [9, 13]. Finally it can be concluded that Case-based reasoning (CBR) is the process of solving new problems based on the solutions of similar past problems [10, 11].

4 CBR Using Nearest Neighbour As mentioned before, there are various CBR techniques used to find the similarities and the correlation between patterns. Nearest neighbour techniques are the most widely used technology in CBR since it is provided by the majority of CBR tools like the Wayland system which is a CBR system that was implemented using a CBR tool called caspian [1, 6]. In nearest neighbour algorithm the similarity of the problem case to a case in the case-library for each case attribute is determined. This measure may be multiplied by a weighting factor. Then the sum of the similarity of all attributes is calculated to provide a measure of the similarity of that case in the library to the target case [1, 6]. This can be represented by Eq. 1: SimilarityðT; SÞ ¼

n X

f ðTi ; Si Þ  wi

ð1Þ

i¼1

where T is the target case S the source case n the number of attributes in each case i an individual attribute from 1 to n f a similarity function for attribute i in cases T and S and w the importance weighting of attribute i. This calculation is repeated for every case in the case-library to rank cases by similarity to the target.

5 CBR Using Database Technology CBR could be implemented using database technology as its simplest form. Databases are efficient means of storing and retrieving large volumes of data. Casebased systems make their decisions based on experiences of past situations. They

374

H. Ayeldeen et al.

try to acquire relevant knowledge of past cases and previously used solutions to propose solutions for new problems. If a direct match for an open problem is found in the database, the solution from the matching case is returned. Otherwise, the case is chosen that most closely matches the open problem. Then, the solution applied to the chosen past case is adapted for the open problem. Case-based systems are well suited for learning correlation patterns [1, 14]. The problem with using database technology for CBR is that databases retrieve using exact matches to the queries. This is commonly augmented by using wild cards, such as MAN matching on WOMAN and MANKIND. The use of wildcards, Boolean terms and other operators within queries may make a query more general, and thus more likely to retrieve a suitable case, but it is not a measure of similarity. However, it is possible to use SQL queries and measure similarity [1, 14].

6 Applying DB Technology to Find Similarity Between Objects: Case Study It is not easy to find the similarity between objects and get the correlation between them. Mainly we are concerned with medical data, Breast cancer patients in Egypt. First step is to make use of the DB technology to build the case library. The structure of the data is as follows: Patient number, Patient type (Control, Fibroadenoma and Cancer patients), Age, Family history, diabetes, hypertension and other proteins parameters like (LAPTM-4B, OPG, RANKL and YKL-40). These proteins can be used to identify whether the patient has breast cancer or they are considered as control healthy patients.

6.1 Subject and Methods Subjects All the breast cancer patients involved in the study were diagnosed at department of Biochemistry and Molecular Biology of Kasr Alainy Hospital of Cairo University. To detect the relationship between LAPTM4B polymorphism and breast cancer vulnerability, one hundred three breast cancer patients and eighty cancer-free healthy controls who were recruited from patients undergoing annual physical examination at Kasr Alainy Hospital of Cairo University were investigated. To analysis the association between LAPTM4B gene polymorphism and OPG gene protein, a long term clinical follow up were enrolled. For all participants in this study, written informed consent was obtained as delineated by the protocol which was approved by the Ethical Committee of Cairo University. The studied subjects were divided into three groups as follows: Group I: (n = 80) healthy females as a control group. Group II: (n = 40) patients with fibroadenoma. Group III: (n = 88) patients with breast carcinoma, they were classified according to

Case-Based Reasoning: A Knowledge Extraction Tool to Use

375

TNM grading system into 11 cases in stage II, 57 cases in stage III and 20 cases in stage IV. This group included 68 non metastatic breast cancer patients and 20 metastatic subjects. The Inclusion criteria includes adult females, age ranged from (20–70) years with no previous treatment with chemotherapy or radiotherapy. While the exclusion criteria includes age below 20 and above 70 years, previous treatment with chemotherapy or radiotherapy, other malignancy. Written consent forms were signed by all participants in this study including the controls. Also, this study was approved by the ethical committee of Kasr Alainy Hospital, Cairo University. All cases were subjected to estimation of LAPTM4B protein level in serum. The fibroadenoma and carcinoma biopsies were examined histopathologically. From each subject, blood sample was taken and separation of serum was made and used for estimation of LAPTM4B level by ELISA technique. Methods Detailed history was taken putting in consideration the course of illness, age of onset of the disease, mode of presentation and family history of the cancer. The protein was extracted from whole blood serum of both patients and control group with QIAamp DNA mini kit (Qiagen, Hilden, Germany), following the manufacturer’s instructions.

6.2 Tools and Statistical Analysis All statistical analysis in our study were carried out with WEKA 3.7.9 (WEKA, The university of Waikato). Genotypic frequencies were tested for correlation using the P-value test. The presentation software PHP 5.5 was used and by the usage of MySQL tool, We have collected a case warehouse for breast cancer patients.

7 Interpreting Results Spending a lot of time in search as well as finding the symmetry of different sets/ article/documents/relations is considered the domain problem for organizations. We take the advantages of KM in our case study, where all data is stored and indexed by the usage of DB technology within case-based tools. The benefits of this CBR tool is to help decision makers and executive managers analyze and take the right decision at the right time. Tables 1 and 2 show samples of the data used within the case library. After using DB technology as a storage and indexing tools, PHP tool was used to build the interface of the CBR system. The figures below show the interface of the CBR tool we built where user can easily move from step to another.

376 Table 1 Sample data and protein analysis for cancerfree healthy controls

Table 2 Sample data and protein analysis for cancer patients

H. Ayeldeen et al. Index

Age

Menst. H

LAPTM4B (pg/ml)

OPG

CC1 CC2 CC3 CC4 CC5

44 45 47 44 53

Pre Pre Pre Pre Post

202.3 613.2 646.7 613.2 710

12.5 10.6 22.1 17.4 9.7

Index

Age

Family. H

LAPTM4B (pg/ml)

OPG

CAN1 CAN2 CAN3 CAN4 CAN5

52 62 62 42 45

Yes Yes No Yes Yes

518.7 749.8 829.4 1,033.9 2103

106.5 111.048 70.068 47.976 80.25

Making use of the DB technology yield to increase in time saving for case retrieval. For instance we want to retrieve all healthy patients: SELECT * FROM ‘dataset’ WHERE ‘PtType’ = 0 ORDER BY ‘dataset’. ‘PtType’ DESC. As discussed in Sect. 4 by using patterns, wild cards as well as table joins we can make the query more complicated to achieve/retrieve the targeted case. SELECT pttype.PtType, Age, hypertension, Diabetes, FamHis, LAPTM4B, RANKL, OPG, YKL40 FROM dataset, pttype WHERE pttype.ptid = dataset.pttype AND dataset.PtType = 0 Finding correlations between parameters is a good indicator for accuracy. Equation 2 is used to measure the correlation between the proteins used in the data set.    1 X x  x0 y  y0 r¼ n1 Sx Sy

ð2Þ

After different queries selections and joining, results yield that there is a correlation between the two proteins LAPTM-4B and OPG where when new case is added if the LAPTM-4B value is ≤710 it is 98.1 % to be cancer patient and with the correlation with OPG if OPG value is ≥23.96 there is a probability that the patient is a healthy one.

Case-Based Reasoning: A Knowledge Extraction Tool to Use

377

8 Conclusion and Future Work No one can deny the importance of databases and database management systems to organizations. Databases technologies aid in the data definition as well as data administration where monitoring users and manipulating data integrity take place. Databases are used to support internal operations of organizations. The database technology has been used in various applications where DB can be used to hold administrative information and more specialized data. From the study provided in this paper, it can be concluded that by using the CBR techniques with respect to the DB technology results in saving much time in the search and seeking the solutions with a 98.1 % increase in comparison to human assessments and not using CBR applications. Many other CBR applications maybe used in the future for showing the advantages of using CBR methodologies. Rough sets and the nearest neighbour applications methods are in the future plans as well as similarity measures to find correlation and asymmetry between objects.

References 1. Watson, I.: Case-based reasoning is a methodology not a technology. J. Knowl. Based Syst. 12, 303–308 (1999) 2. Zawbaa, H.M., El-Bendary, N., Hassanien, A.E., Kim, T.-H.: Machine learning-based soccer video summarization system, Part II. CCIS, vol. 263, pp. 19–28. Springer, Heidelberg (2011) 3. Kolodner, J.L.: An introduction to case-based reasoning. Artif. Intell. Rev. 6, 3–34 (1992) 4. Ghany, K.K.A., Hassanien, A.E., Schaefer, G.: Similarity measures for fingerprint matching. In: International Conference on Image Processing, Computer Vision, and Pattern Recognition, pp. 21–24, USA (2014) 5. Mouhamed, M.R., Zawbaa, H.M., Al-Shammari, E.T., Hassanien, A.E., Snasel, V.: Blind watermark approach for map authentication using support vector machine. In: Advances in Security of Information and Communication Networks, pp. 84–97. Springer, Berlin (2013) 6. Pal, S.K., Shiu, S.C.K.: Foundations of Soft Case-Based Reasoning (2004) 7. Jonassen, D.H., Hernandez-Serrano, J.: Case-based reasoning and instructional design: Using stories to support problem solving. Education Tech. Research Dev. 2(50), 65–77 (2002) 8. Davies, J., Goel, A.K.: Visual case-based reasoning II: transfer and adaptation. In: Proceedings of the 1st Indian International Conference on Artificial Intelligence, pp. 377–382. Springer, Berlin (2003) 9. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: Knowledge discovery and data mining: towards a unifying framework. In: Proceedings of KDD’96, Second International Conference on Knowledge Discovery and Data Mining, pp. 82–88 (1996) 10. Park, M.K., Lee, I., Shon, K.M.: Using case based reasoning for problem solving in a complex production process. Expert Syst. Appl. 15, 69–75 (1998) 11. Grupe, F.H., Urwiler, R., Ramarapu, N.K., Owrang, M.: The application of case-based reasoning to the software development process. Inf. Softw. Technol. 40, 493–499 (1998)

378

H. Ayeldeen et al.

12. Rezvana, M.T., Zeinal Hamadania, A., Shalbafzadehb, A.: Case-based reasoning for classification in the mixed data sets employing the compound distance methods. Eng. Appl. Artif. Intell. 9(26), 2001–2009 (2013) 13. Singh, S.K.: Database Systems: Concepts, Designs and Applications. Pearson Education India, New Delhi (2006) 14. Lopez De Mantaras, R., Mcsherry, D., et al.: Retrieval, reuse, revision and retention in casebased reasoning. Knowl. Eng. Rev. 3(20), 215–240 (2005)

Suggest Documents