iternational journal of computer engineering

5 downloads 351 Views 7MB Size Report
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-. 6367(Prinf), ISSN 0976 - 6375(Online) Volume 4, Issue 2, March - April (2013), ... 5) Classification maps a data item into one of the predefined classes.
ITERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) N 0976 - 6367(Print) N 0976 - 6375(Online) lme 4,Issue 2, March - April (2013)' pp. 508'516 f,EME : www.iaeme.com/ij cet.asp lnal Impact Factor (2013): 6.1302 (Calculated by GISI) Bjifactor.com aaaoaaaaa aaaaaaaaaaaaaaaoaaa aaa"

"

'

o

"

"

"

t r

"

o

"

aaaaaaaooaaaaaaaaaaaoaao

:a

MINING LUNG CANCER DATA AND OTHER DISEASES DATA USING DATA MINING TECHNIQUES: A SURVEY Parag Deoskarr, Dr. Divakar Singh2, Dr. A4iu Singh3

.

BUIT, Barkatullah University, Bhopall MTech Scholar CSE Deptt. -gUIt, Barkatullah University' Bhopal2 HOD of CSE Deptt. Barkatullah University' Bhopal' BUIT, Deptt. Astt. Prof. of CSE

TBSTRACT about the dangerous diseases in the world then you always list Cancer as !ne. Lung cancer is one of the most dangerous cancer types in the world. These diseases can pread by uncontrolled cell growth in tissues of the lung. Early detection can save the life and hrvivability of the patients. In this paper we survey several aspects of data mining which is lsed for lung cancer prediction. Data mining is useful in lung cancer classification. We also

If you think

hn'ey the aspects of ant colony optimization (ACO) technique. Ant colony optimization belps in increasing or decreasing the disease prediction value. This study assorted data nining and ant colony optimization techniques for appropriate rule generation and llassification, which pilot to exact cancer classification. In addition to, it provides basic Famework for further improvement in medical diagnosis.

Keywords: ACO, data mining, rule pruning, Pheromone i

t. FITRODUCTION Lung cancer is a disease which is because of uncontrolled cell growth in tissues of the hng. If the cancer is not treated in the early stage, this growth can spread beyond the lung in I process called metastasis into nearby tissue and, eventually, into other parts gf the body' Most cancers which are in the primary stage are carcinomas that derive from epithelial cells. Common causes of lung cancer are tobacco and smoke. It is the main cause of cancer death worldwide, and it is difficult to detect in its early stages because symptoms can show their are several research suggest Foperties at advanced stages sometimes in the last stager. There rate. rhat the early detection of lung cancer will decrease the mortality 508

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Prinf), ISSN 0976 - 6375(Online) Volume 4, Issue 2, March - April (2013), O IAEME Decision classification is the most impoftant task for mining any data set. The problem which is classified is mainly collaborated with the assignment of an object to an object oriented parameter that is class and its parameter [1], [2]. There are several decision tasks which we observe in several fields of engineering, medical, and management related science can be considered as classification problems. Popular examples are pattem classification, speech recognition, character recognition, medical diagnosis and credit scoring.

But in our case classification alone is insufficient for classifying lung cancer dataset. we consider data mining for frequent pattem classification then it is better tool for classifying relevant data from the raw dataset. The performance of association rules is directly depend on frequent pattern minin-e. to balance it is the core problem of mining

If

association rules [3]. With the developing and more detailed of the research on frequent item sets mining, it is widely used in the field of data minin-e, for example, mining association rule, correlation analysis, classification. clustering -ll.suppofi vector machine[5] and positive association rule classification[6]. The main aim of data minin-s is to extract important information from huge amount of raw data. We emphasize to mine luns cancer data to discover knowledge that is not only

accurate, but also comprehensible for the lung cancer detection Ul, [8], t9l. Comprehensibility is important vu'henever dlscovered knowledge will be used for supporting a human decision. After all. if discovered knowledge is not comprehensible for a user, it will

not be possible to interpret and validate the knowledge. So we can say that trust in discovering rule knowledge is very important. In decision making, this can lead to incorrect decisions.

We provide here an overview of medical data mining technique. The rest of this paper as follows: Section 2 introduces medical data mining; Section 3 describes about ant colony optimization; Section 4 describes about related works; section 5 discuss about the Theoretical extraction. Section 6 describes Conclusion.

is arranged

2.

MEDICAL DATA MINING

If we study the definition of the term data mining, then we can say data mining refers to extracting or "mining" knowledge from large amounts of data or databases [10]. The process of finding useful patterns or meaning in raw data has been called KDD [11]. KDD provides a cleaning to the inconsistent data. Data Mining also provides pattem classification, visualization and rule separation. For understanding the utility of data mining then we better categoize data mining based on their function ability as below [12]: 1) Regression is a statistical methodology that is often used for numeric prediction. 2) Association returns affinities of a set of records. 3) Sequential pattern function searches for frequent subsequences in a sequence dataset, where a sequence records an ordering of events. 4) Summarization is to make compact description for a subset of data. 5) Classification maps a data item into one of the predefined classes. 6) Clustering identifies a finite set of categories to describe the data. 7) Dependency modeling describes significant dependencies between variables. 8) Change and deviation detection is to discover the most significant changes in"the data by using previously measured values. 509

Technology (IJCET)? II;SN-0976n:urnational Journal of Computer Engineering and March - April (2013)' @ IAEME 2, 4,issue -1r- prinl. ISS\ 09:'6 -6375(online) Volume research and personal \ledrcal diagnosis is very subjective because of the clinical studies have shown that the :-:_::iolt of the doctors mattei the diagnosis. A number ofexamined by different doctors is . . ,srs of one patient can differ significintly if the patient medical data mining is to idea : -' by, the same doctors at vaJous times tt:1. Themining oftechniques. It is possrble to , r_:_-i hidden tno*rJg" in medical field using data the casual mechanisms behind those not have fully undlrstood -: r-..rr\ pattems even if ive do '.''-nl'\.Eventhepattemswhichareirrelevantcanbediscovered[14].Clinicalreposltories

':...:1ni1]gtu,g"u*oontsofbiological,clinical,andadministrativedataareincreasingly -:- -Irng available as health care systems integrate patient information for research and applied on these databases discover - ,z.rrion objectives [15]. Data mining techni[ues and the management :, ,,.:ronships and patterns which are helplul in studying the progression following ring: structured data . _,:ease [16]. A typical clinic data mining r"r"ur"h iniluding _-..rrtiYe text, hypotheses, tabulate data statistics, analysis interpretation' new knowledge or -- ::- Questions, outcomes observations and stmctured data narrative text [17]' Prediction About diseases like skin cancer' ..,r diagnosis of a disease can be kinds of evaluation. a patient's early detection is vital because it can help in saving r.;--.St c&flcer or lung cancer '- ;1 8]. ,1,

\\T

COLONY OPTIMIZATION

TheAntColonyoptiniizatron(ACo)algorithmisameta-heuristicwhichisa posrtlr:e feedback svstem' and systematic greedy -:-uping of distributed environment. problens' solution for.corlbinatoriirl optrmization -- :roach to find un ofii-ur run by

is mainir ili,spired b' the erperiments The Ant coroinf optimization algonrl-u"r of real ants ln the reai enrirot.inlent Thel studl'and -:->s et al. [19] which using a grouping $ere able to select the real ants and sr.rgge,st that the real ani-' . r:-rve the behaviour of those altenlate paths bet\\'een : rrte st path between their nest and food resouie' in the e:'istence ol indirect

possible thror-rgh an .-.- two. The above searching for foocl resollrce is ants are trarelling for the food known as stigmergy amongst the ants'.when

-rtrmunication called pheromone' on the gror-rnd When they :.Jllrces, ants deposit ^.n.I-i.at-sr-rbstan"ce, iased choice' blased by the intensity of .,::ire at a destination point, ants rnake a frobability effect because of the very fact that : -,-romone they smell. This behaviou, hu, un autocatalytic conesponding path will be chosen the probability thaithe -:. rlnt choosing a path will increase retllrn back' the the next move. After iinishing the search ants =.rrn by other ants in of increasing pheromone quantity' So ::,:bability of .t,oorinf ,t ,u*. path is higher because " path, it provides the path for the ants' In rhe pheromone witj ue released on the chosen shortest path' 'i,ott *" can say that, all ants will select the in a double bridge experiment [20]' If we ants of Figure 1 shows the behaviour the shoflest that because of the same pheromone laying -,r-,rl\,se the case then we observed first ants which arrive at the food source are those :--,rh rvill be chosen. It will be stafts with start approaching the food destination these ants .:t,,t took the two shortest branches. Afler ,r-t" shori branch is the possibility for choosing r.:rr return trip, more pheromone is present on ant behaviour was first formulated ...: shofiest one than the one on the Long B,an.h. This on the AS algorithm' the Ant :rd ar-ranged as Anr System (AS) by lorigo et al. [21] B."r:9 In ACo algorithm' the i'..tony optimizatiln' taccj) algorithm- was proposed l?'l' :rrmizationproblemcanbeexpre-ssedasanformulxedgraphG=(c;L)'wherecistheset possible connections or transitions among . components of the problem, and L is the set of

-

510

rnternationat Jourl!- of 6367(print)' rssN ogze ,computer Engjneering and rechnorogy (TJCET), IssN - ei7iii"u""rTi,u-" +,ir*"i, r"".n Apr' (2013),@ 0976. rAEnrE the elements of c' The proposed solution is represented graph G' with respect in terms of feasible paths on to.i g""" the rs also cailed agents colrectivery-r"ji* '"t "r """strainis "* o*a'i""L. The popuration of anrs rhar ,rr" prout.rn I"0". using the graph assume that ihe unu ,,; "*rro^"ration probabry or rinding a solution, good ;:,t"fi;iffi:t:;we emerge as a result of collective lnt".uitiJn amongst ants. pheromone trails encod" ; i;"g.

";;;;;

.?::;;,':; j.:trililh##Iil?;.l,Ii,:",,,,..-1ix*htg:i,sntr;t

Figure 1: Double onli::^p:rimenr. (a) Anrs srarr exploring Eventually *o.t of tn" untr'ji"." ,rr" shorrest path rhe double bridge. (b) [201. The algorithm presented by Dorigo

Algorirhm ACO mera f.r.rrir,i"iii while (termin ation ..it..lon

et

al.

[22]was given below:

no-i sati sfied)

ant generation and activity$; pheromone evaporation0; daemon actions0;

,,optional"

end while end Algorithm

4. RELATED WORKS

rn 2011' Hnin wint Khaing et al. predicrion of hearl atrack .[23] presented an efficient approach fbr the .ist r"uer--, t1" i"ilil;e-database u. pr"r"nt"d by rhe the argorithm #"'r'i.r, dahbase is rirsrry crusrered iiJi?::J:?,i::,"":"0

i-1

,r'";;";;;

iln;i;;;;;lJ,if,

ll.F"f #i.1ti,liffi E-i.r."a aril the MAFIA (Maximal nr"qr"nlii",n *ttr"rirrr,nl*ir'.n #'#*nun, ,o rr"un Jir"u.i. ,ring ,n"n the learning

tfl**"i#ltrillti:r;;;;l

mrnrng on frequent patterns from the

"i*"",ir,o.

511

argorithm

lnternational Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6375(Online) Volume 4,Issue 2, March - April (2013), @ IAEME

t5367(Print),ISSN 0976

..

trained with the selected significant patterns for the effective prediction of heart attack :iseases. They have employed the ID3 algorithm as the training algorithm to show level of :-'ert attack with the decision tree. According to the author results showed that the designed rrediction system is capable of predicting the heart attack effectively. In20lI,Zenggui Ou et al. [24] discuss about how to use the sequential characteristic rn the course of Web data mining to cany out structural transfer of semi-structured data based .rn time effect of data, that is the systematic structuring of Web resources data, and solve the problem which is about the effectiveness in retrieval accordingly. In 2010, Zakaia Suliman Zubi et aL. l25l study that the lung canser is a disease of uncontrolled cell growth in tissues of the lung, Lung cancer is one of the most common and deadly diseases in the world. Authors suggest that the detection of lung cancer in its early stage is the key of its cure. So in general, a measure for early stage lung cancer diagnosis mainly includes those utilizing X-ray chest films, CT, MRI, etc. iv{edical imaees mining is a promising area of computational intelligence applied to automaticallr' analvsing patient's records aiming at the discovery of new knowledge potentiall]' useful for medical decision making. In 2011, Yao Liu et al. 126l proposed and implement a classifier usin_u discrete

parlicle swarrn optimization (DPSO) with an additional neq' rule pruning procedure for detecting lung cancer and breast cancer, which are the most colrurron cancer for men and \\'omen as per the author's observation. According to the author experiment which shorvs the new pruning method further improves the ciassification accuracy and their approach is effective in making cancer prediction. In 2011, Chandrasekhar U et al. l27l discuss and analyses recent improvements on

clustering algorithms like PP (Project Pursuit) based on the ACO algorithm for high dimensional data, recent applications of Data Clustering with ACO, application of Ant-based clustering algorithm for object finding by multiple robots in image processing field and the hybrid PSO/ACO algorithm for better optimized results. According to the author Cluster Analysis is a popular and widely used data analysis and data mining technique. The high quality and fast clustering algorithms play a vital role for users to navigate, effectively organize and structure the data. They observed that Ant Colony Optimization (ACO), a Swarm Intelligence technique, integrated with clustering algorithms, is being used by many applications for past few years. In2011, Shyi-Ching Liang et al. [28] suggest Classification rule is the most common representation of the rule in data mining. It is based on supervised learning process which generates rules from training data set. The main goal of the classification rule mining is the prediction of the predefined class based on the group. Based on ACO algorithm, Ant-Miner solved the classification rule problem. According to the author, Ant-Miner shows good performance in many dataset. In this research paper author proposed, an extension of AntMiner is proposed to incorporate the concept of parallel processing and grouping. In this paper intercommunication is provided via pheromone among ants is a critical part in ant colony optimization's searching mechanism. The algorithm design in such a way, with a slight modification in this parl which removes the parallel searching capability. Based on Ant-Miner, they propose an extension that modifies the algorithm design to incorporate parallel processing. The pheromone trail deposited by ants during the searching procedure affqcted each other. With the help of pheromone, ants can have better decision making while searching. They provide a possible direction for researches toward the classification rule problem. 512

International Journal of Computer Engineering and Technologl' GJCET), ISS\ 09766367(Print), ISSN 0976 - 6375(Online) Volume 4, Issue 2, March - April (2013)' O L{EIIE

In 2011, Mete QELIK et al. [29]

discuss about several classical and heunstic algorithms proposed to mine classification rules out of large datasets. In this research authors proposed, a new and novel heuristic classification data mining approach based on artificial bee colony algorithm (ABC) ABC-Miner. Authors proposed approach was compared with Pa(icle Swarm Optimization (PSO) rule classification algorithm and C4.5 algorithm using benchmark datasets. The experimental results show good efficiency of the proposed method. In 2011, G. Sophia Reena et al. [30] suggest that the Cancer research is an interesting research area in the field of medicine. Authors suggest that classification is momentously necessary for cancer diagnosis and treatment. The precise prophecy of dissimilar tumor types has immense value in providing better care and toxicity minimization on the patients. Author suggest that classification of patient taster obtainable as gene expression profiles has become an issue of prevalent study in biomedical research in modern years. Formerly, cancer classification depends upon the morphological and clinical. The modern arrival of the micro array technology has permitted the concurrent observation of thousands of genes, which provoked the progress in cancer classification using gene expression data. This study hub on the broadly used assorted data mining and machine learning techniques for appropriate gene selection, which pilot to exact cancer classification. In 2013, S.Vijiyarani et al. [31] reviewed and suggest thatdData mining is defined as very large amounts of data for useful information. Some of the most important through sifting and popular data mining techniques are association rules, classification, clustering, prediction and sequential patterns. Data mrnin-e techniques are used for variety of applications. In health

care industry, data mining plays an important role for predicting diseases. For detecting a disease number of tests should be required from the patient. But using data mining technique the number of test should be reduced. This reduced test plays an important role in time and performance. This technique has an advantages and disadvantages. They analyses how data mining techniques are used for predicting different types of diseases. As per our study there are several woks and algorithm is presented for efficient cancer detection. The algorithms are based on data mining, frtzzy logic, parlicle swarm optimization etc. Several authors categorically work on different types of cancer. After analysing those research papers we analyse that several research work are based on Lung cancer, heart Diseases and breast Cancer. Some of the authors presenting good results in the case of breast cancer and Herat diseases but fail to achieve higher accuracy in the case of Lung Cancer. In 2011 yao lio et al. [26] also proposed and implement a classifier using DPSO with new rule pruning procedure for detecting lung cancer and breast cancer from the UCI repository, which are the most common cancer for men and women. In the case of Lung Cancer they achieve the accuracy of 68.33 in the case of discrete particle swarm optimization and 64.44 in the case of particle swarm optimization. In the case of breast cancer they achieve the accuracy of 97 .23in the case of discrete particle swarrn optimization and 97 .06 in the case of particle swarm optimization. They also provide the comparison from different related techniques like PART, SMO, Naive Bayes, KNN and classification tree. As per our analysis the result is good in the case ofbreast cancer. But there is the hope in the case of lung cancer, because the prediction accuracy is not so high. Data mining and Ant colony optimization with the combined effort will produce better result by using pheromone trails, which is updated automatically on the basis of iteration and frequent pattern analysis.

513

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 - 6375(Online) Volume 4, Issue 2, Nlarch - April (2013), O IAEME 5.

TIIEORETICAL ANALYSIS The theoretical analysis of different diseases with different data mining techniques

and their accuracy of detection are shown in Table

1.

Table 1: Theoretical Anal Author

S1S

Technique Name

Disease Name

Accuracy

Hnin Wint Khaing eta al. t23l Hnin Wint Khaing eta

K-Means Based MAFIA

Heart Disease

14Vo

el. l23l

ID3

Shyi-Ching Liang et al.

Ant Colony Optimization and

-281

Prediction K-mean based

MAFIA with

for Heart disease Prediction

85%

Breast Cancer

70.33

Breast Cancer

Standard Deviation of 0.082

Classificati on Rule Problem

\Iete QELIK[29]

ABC-Miner

Yao Liu et al. [26]

Mining Cancer data with

Yao Liu et al. [26]

Discrete Particle Swarm Optimization and Rule Prunins Mining Cancer data with Discrete Particle Swarm Optimization and Rule Pruning

Lung Cancer

68.33

(DPSO (new)) Lung Cancer

64.44 (PSO (new))

6. CONCLUSION

The use of data mining techniques in Lung cancer classification increases the chance of making a correct and early detection, which could prove to be vital in combating the disease. In this paper, we provide a survey on lung cancer detection. We also analyses the utility of data mining by which we can find the efficient lung cancer detection technique. -A,fter analysis we find several classifications algorithm and their result by which we can find the future insights.

As the area of Lung cancer is very challenging and the researchers are continuing their research progress in efficient detection, there are lot of scope in the case of efficient detection. As per our observation there are some future suggestions rvhich are listed below: 1) We can apply neural network and Fuzz,v based technique to train cancer data set for finding better classification and accuracy. 2) We can apply optimization technique like Ant Colony Optimization to optimize the classification [33] for improving the detection. 3) Machine learning environment or Support Vector machine [32] is also an insight for better detection. 4) We can Llse some homogeneity based algorithm to find over fitting and overgeneralization Characteristics. It can be applied by clustering algorithm like KMeans. 514

rnternation"r J:y:llr_ of computer.Engineering and re_chnorogy (IJCET), rssN 6367(print), rss\ 0s76 r:z-riri'ri"") 0976_ v"i;;;;;ir#Jr,irr"rch _ Apr' (2013),@ rAEME REFERENCES i1l R, o Duda. p. E. Harr, and D. G. stork, ,.pattem classification,,. New york: witey ,2000. r2lD. I. Hand and S. D. Jacka, ;ni*jy"",i* york: wiley, 1981. [3] Agran ar R an-d snkant n,'"r]rt argo.rtt m"""j-crulrjii.urion.,,New io.-fi;,;; Associarion Rures,,, proc of In rern ari onat aonl"_..l? the "l J"lv i"*buruuu."rl r"",,"*5, ua A, tgg 4. Requent rremsets,,, pa*em [5] Hetal Bhavsar' Dr,. Alit C"i"it"l"',V1iaf on1 0f Support vector Machine crassificarion ;.; c o-pu o,- n. 1. ;" :tf GrACR ) t6l Nikh' Jain,vishar srrur-u,rrdurrr urruiya,,, Reduction of Negarive and positive Associarion Rule Mining srn.JJ"n,i'"i'nure using Modified "aovanc"J' ";;-M";"L1i Genetic c;;p*J Research (rrACR) vorume_2 ftlffiji-?;,*:;g.' "*i"r,T;li

illi.lil'il'"""";ffll"fll,I5*Yr*sr';:'i*Tl':?-,i'ost

f,Jii;ii.l;;;f

ii:it'$ "f,"illf

;ii:;

;;

;'

t7l w' Dorigo' g Di caro, l"a rj, ;;.;,;p y91lb"rdella, '.Ant argorithms for discrere 137_172,1999. t'l u' M' Favvad,,,l. i;"p;, ..From discovery: An overview," data mining to knowledge .llr,,:,:ol _gd f. smvth, in Advances in. Knowleage';;:.;., & Data Mining, optimization,', Arrif. Life. vot.

P' Smvrh,ana R-."urhurur","r,""or.

!,:Tl?.ot-Shapiro,

u. Fayyad,

bl-onor", naa, r,aii"p.lrr, pp. r_

[9] Junzo watada' K-eisuke Aoki, Masahiroxawano, Muhammad Suzuri Hitam, Dual Scaring

tl'Ji'T:,:liiliffi ?ru';lluxsri;d;';:11TJ,","n."rntel,igenirnf

lflitT;"Ttf;:i'ffi::lffir*"iiJ'' []llu' M'piatetskv-Shapiro, fftlfttflDiscovery:-An

ormatici

"ou'u Mining concepts and rechniques,,

San

c.a mytt p. & uthurusamy, R. Fayyad, ,,From Data Mining to overviet,'i tn ao"unl."'in ?no*r"og" Discovery and Data

hffii?:tJl lr;)*T' 'j{i,d*?'*1il,t "Introducti#to outu-iili;rng

"Dara mining techniques appried ro medicar

for Medicar rnformatics,,,clin Lab Med, pp.

51:fr?a?*"1d,

9_

Katta' "Medical Data Mining," Dara Mining and Knowledge ?li,%Y' Discovery, [15] Irene M' Mullins et al', "Data mining and clinical data repositories: 667'000 patient dara set," Insights fiom a c"-p*"^ #'ut;*r;N;;"il]vol. :0, pp. 135r _t3tl, zoo6. t16l J' c' Lobach' D. F. Goodwin, L. r. git., J. w.;;;;, i*. , EdwardHammond, Parther' "Medical Data Miningr t

Suggest Documents