Rule induction using enhanced RIPPER algorithm for clinical decision ...

2 downloads 0 Views 1MB Size Report
Rule Induction Using Enhanced RIPPER Algorithm for Clinical Decision Support System. BaNhtawar Seerat and Usman Qamar. College of Electrical and ...
6L[WK,QWHUQDWLRQDO&RQIHUHQFHRQ,QWHOOLJHQW&RQWURODQG,QIRUPDWLRQ3URFHVVLQJ :XKDQ&KLQD1RYHPEHU

5XOH,QGXFWLRQ8VLQJ(QKDQFHG5,33(5$OJRULWKP IRU&OLQLFDO'HFLVLRQ6XSSRUW6\VWHP %DNKWDZDU6HHUDW DQG8VPDQ4DPDU &ROOHJHRI(OHFWULFDO DQG0HFKDQLFDO(QJLQHHULQJ 1DWLRQDO8QLYHUVLW\RI6FLHQFHVDQG7HFKQRORJ\ ,VODPDEDG3DNLVWDQ (PDLOEDNKWDZDUVHHUDW#JPDLOFRP XVPDQT#FHPHQXVWHGXSN GDWDEDVH /DWHU LW LV EHLQJ XVHG IRU GLIIHUHQW GDWDEDVHV IURP GLIIHUHQW ILHOGV RI OLIH $VVRFLDWLRQ UXOH PLQLQJ FDQ DOVR EH XVHG IRU PLQLQJ DVVRFLDWLRQ UXOHV IURP SDWLHQWV¶ GDWD /DWHU WKHVHUXOHVFDQEHXVHGIRU SUHGLFWLRQDERXWKHDOWKRIDSHUVRQ %DVLFFRQFHSWVDUHLQWURGXFHGDVIROORZV

Abstract²'XHWR DYDLODELOLW\RIODUJHDPRXQWRIGDWDZLWKWKH HPHUJHQFH RI FRPSXWHUV DQG LQWHUQHW GDWD PLQLQJ LV JHWWLQJ SRSXODULQHYHU\ILHOGRIOLIHOLNHEXVLQHVVKHDOWKGLVDVWHUVHWFIRU SUHGLFWLYHDQDO\VLV$VPRUHDQGPRUHGDWDEHFRPHVDYDLODEOHLW EHFRPHVGLIILFXOWWRJHWXVHIXOLQIRUPDWLRQIURPWKDW,QWKDWFDVH WKDW WUHPHQGRXV GDWD LV TXLWH XVHOHVV )RU WKDW SXUSRVH GDWD PLQLQJ FRPHV DV D VDYLRU DQG KHOSV XV WR H[WUDFW XVHIXO LQIRUPDWLRQRXWRIWKHGDWD7KLVLQIRUPDWLRQFDQEHXVHGIXUWKHU IRU GHFLVLRQ PDNLQJ 7KLV SDSHU SUHVHQWV D PRGHO WKDW KHOSV LQ GLDJQRVLV RI GLVHDVHV E\ DQDO\]LQJ WKH SDWLHQWV¶ GDWD 7KH SDWLHQWV¶ DWWULEXWHV DUH DQDO\]HG DQG DVVRFLDWLRQ UXOHV DUH H[WUDFWHG IURP WKHVH DWWULEXWHV $VVRFLDWLRQ UXOH EDVHG &ODVVLILFDWLRQ LV XVHG IRU GLVHDVH GLDJQRVLV DQG WKXV KHOSIXO LQ FOLQLFDOGHFLVLRQPDNLQJ$SDWLHQWLVFODVVLILHGDVKHDOWK\RUVLFN EDVHGRQKLVDWWULEXWHVXVLQJFODVVLILFDWLRQ'LVHDVH0LQLQJ0RGHO LV SURSRVHG '00  EDVHG RQ DVVRFLDWLRQ UXOHV PLQLQJ $50  7KLV PRGHO LV JOREDOO\ RSWLPL]HG E\ XVLQJ :HLJKWHG $VVRFLDWLRQ 5XOHV 0LQLQJ :$50  DV 2SWLPL]HG 'LVHDVH 0LQLQJ 0RGHO 2'00  ZKLFK SURYLGHV LPSURYHG DFFXUDF\ RI GLVHDVH SUHGLFWLRQIRUHYHU\GLVHDVHGDWDVHW%RWK'00DQG2'00DUH WHVWHG RQ QLQH GDWDVHWV RI GLIIHUHQW GLVHDVHV 5HVXOWV RI GLVHDVH GLDJQRVLV DUH YHULILHG DJDLQVW UHDO GLDJQRVLV :$50 LPSURYHV WKH DFFXUDF\ RI GLDJQRVLV DQG WKXV RXWSHUIRUPV $50 7KXV LQ WKLV ZRUN &ODVVLILFDWLRQ XVLQJ 5LSSHU DOJRULWKP LV PXFK LPSURYHGXVLQJZHLJKWRSWLPL]DWLRQ

x

6HWRIWUDQVDFWLRQV/HW' ^WW«WQ`EHWKHVHWRI WUDQVDFWLRQV ZKHUH HYHU\ PHPEHU RI VHW UHSUHVHQWV D WUDQVDFWLRQDQGHDFKWUDQVDFWLRQFRQVLVWVRIDVXEVHWRI LWHPVIURP, $VVRFLDWLRQUXOH $ UXOH LVGHILQHGDVDQLPSOLFDWLRQRI WKHIRUP ; !< ZKHUH ;< ௚ , DQG ;ŀ< ĭ &RQILGHQFH7KHFRQILGHQFHRIDUXOH ; !< LVFLIF RI WUDQVDFWLRQV LQ ' FRQWDLQLQJ ; WKDW DOVR KDV @>@ 1HXUDOQHWZRUN LV H[SORUHGIRU 0HGLFDO'DWD PLQLQJDVGLVFXVVHGLQ>@>@/LQHDUJHQHWLFSURJUDPPLQJ LVDOVRDQDO\]HGIRU0HGLFDO'DWD0LQLQJ>@2OXNXQOHDQG (KLNLR\DSURSRVHGDPHWKRGWRH[WUDFWDVVRFLDWLRQUXOHVIURP 0HGLFDO,PDJH'DWD>@%HWKHOHWDOSURSRVHGDPHWKRGWR OHDUQ DVVRFLDWLRQ UXOHV IURP &DQFHU 'DWD >@ 6KLP DQG ;X SURSRVHG D FODVVLILFDWLRQ PHWKRG EDVHG RQ %D\HVLDQ @ :DQJ HW DO DSSOLHG IX]]\ FOXVWHU DQDO\VLV IRU PHGLFDO LPDJHV >@ &ODVVLILFDWLRQPHWKRGVDUHSURSRVHGIRU0HGLFDO'DWDLQ>@ DQG >@ &KHQJ HW DO XVHG FODVVLILFDWLRQ WHFKQLTXH IRU &DUGLRYDVFXODU'LVHDVHV,QWKLVZRUNIHDWXUHVHOHFWLRQLVXVHG WR PDNH FODVVLILFDWLRQ PRUH HIIHFWLYH >@ .DUHJRZGD DQG -D\DUDP SURSRVHG D FODVVLILFDWLRQ PRGHO WR FODVVLI\ GLDEHWLF SDWLHQWV XVLQJ &RUUHODWLRQ EDVHG )HDWXUH 6HOHFWLRQ &)6  DQG *HQHWLF $OJRULWKP *$  IRU SURYLGLQJ DFFXUDWH UHVXOWV >@ %D\HVLDQ1HWZRUNDOJRULWKPLVXVHGIRUPLQLQJ0HGLFDO'DWD >@ %D\HVLDQ 1HWZRUN DOJRULWKP LV XVHG IRU GLDJQRVLQJ &RURQDU\ +HDUW 'LVHDVH >@ $EUDKDP HW DO SURSRVHG LPSURYLQJ FODVVLILFDWLRQ RI 0HGLFDO 'DWD XVLQJ 1DwYH %D\HVLDQ FODVVLILHU >@ 690 LQ FRPELQDWLRQ ZLWK 'HFLVLRQ WUHHV LV XVHG WR GLDJQRVH WKH VXUYLYDO SUREDELOLW\ RI &+' SDWLHQWV >@ %DODNULVKQDQ DQG 1DUD\DQDVZDP\ SURSRVHG WR XVH690DVDIHDWXUHVHOHFWLRQWHFKQLTXHIRUFODVVLILFDWLRQRI 'LDEHWHV 'DWD >@ 7$1* DQG 76(1* SURSRVHG D FODVVLILFDWLRQPHWKRGIRU'LDEHWHVDQG&DQFHU'DWDE\PDNLQJ XVHRIZHLJKWHGIX]]\N11IX]]\N11DQGFULVSN11>@

EHWWHU>@6@ 0 5 6PLWK HW DO VXJJHVWHG WKDW RXWOLHUV DQG QRLVH VKRXOG EH HOLPLQDWHG IURP WKH GDWDVHW DV LW ZLOO \LHOG EHWWHU UHVXOWV LQ WHUPV RI FODVVLILFDWLRQDFFXUDF\>@$VGLVFXVVHGE\=DPDOORD\]HWDO %/LXHWDO DQG0LFKDHO/5D\PHUHWDO*HQHWLF$OJRULWKP *$ LVTXLWHDSRSXODUPHWKRGXQGHUUHVHDUFKDQGLVIRXQGWR EH TXLWH HIIHFWLYH IRU IHDWXUH VHOHFWLRQ DQG FODVVLILFDWLRQ DFFXUDF\LPSURYHPHQW>@ >@ >@ 6XSHUYLVHG OHDUQLQJ VSHFLDOO\ FODVVLILFDWLRQ LV EHLQJ H[SORUHG LQ UHFHQW \HDUV EXW LW KDV QRW EHHQ PXFK XVHG LQ PHGLFDOILHOG$OWKRXJKPDQ\ZD\VIRUGLVHDVHGLDJQRVLVKDYH EHHQH[SORUHGEXWVWLOOWKHUHLVQRSUHFLVHDQGDFFXUDWHPRGHO IRU SUHGLFWLRQ RI GLVHDVHV 'HVSLWH RI DOO WKH H[WHQVLYH ZRUN WKHUH LV QR JOREDO RU DEVROXWH PHWKRG WKDW FDQ EH XVHG IRU SUHGLFWLRQRIGLVHDVHV ,,,

',6($6(0,1,1*02'(/

,QWKLVZRUNDPRGHOIRUGLVHDVHGDWDPLQLQJDQGGHFLVLRQ PDNLQJLVSURSRVHGDVVKRZQLQ)LJ 0DMRUVWHSVLQYROYHG LQPLQLQJPRGHODUHLOOXVWUDWHGKHUH

B. Related Work for Accuracy Improvement of Classifier 7KHUH DUH VHYHUDO DOJRULWKPV IRU DVVRFLDWLRQ UXOH PLQLQJ 7KHUH DUH SURV DQG FRQV RI HYHU\ DOJRULWKP 6HYHUDO DSSURDFKHV KDYH EHHQ SURSRVHG WR LPSURYH WKH HIILFLHQF\ LQ WHUPVRIWLPHFRVWRUDFFXUDF\ 6XMDWD 'DVK HW DO VXJJHVWHG SDUWLDO OHDVW VTXDUH 3/6 UHJUHVVLRQ PHWKRG DV LW LV WKH VXLWDEOH IHDWXUH VHOHFWLRQ PHWKRGLQVWHDGRIK\EULGGLPHQVLRQDOLW\UHGXFWLRQ>@/LQHW DO FRPELQHG 362 SDUWLFOH VZDUP RSWLPL]DWLRQ  EDVHG DSSURDFK ZLWK FRPPRQO\ XVHG FODVVLILFDWLRQ WHFKQLTXH /'$ /LQHDU 'LVFULPLQDQW $QDO\VLV  DQG VKRZHG WKURXJK H[SHULPHQWDO UHVXOWV WKDW IRU PDQ\ SXEOLF GDWDVHWV WKH SURSRVHG FRPELQHG PRGHO 362/'$  KDYH KLJKHU FODVVLILFDWLRQ DFFXUDF\ UDWH >@ 5 %U\OO HW DO GHYHORSHG D QHZ ZUDSSHU PHWKRG $% $WWULEXWH %DJJLQJ  WR LPSURYH WKH FODVVLILFDWLRQ DFFXUDF\ LPSOHPHQWLQJ >@ ' : $EERWW FRPSDUHG ERRVWLQJ ZLWK DQ HQVHPEOH RI PRGHOV DFURVV WKH DOJRULWKP IDPLOLHV DQG DXWKRUV UHSRUW WKDW ERRVWLQJ SHUIRUPV

)LJ 'LVHDVHPLQLQJPRGHO



1) Data cleaning: 7RFOHDQGDWDIROORZLQJRSHUDWLRQVZHUH SHUIRUPHG a) Data acquisition: 'DWD LV FROOHFWHG IURP 8&, UHSRVLWRU\DORQJZLWKPHWDGDWD b) Data reformatting: ,Q WKLV SKDVH GDWD PXVW EH FRQYHUWHGWRUHTXLUHGIRUPDW)RUUDSLGPLQHU'DWDVKRXOGEH LQ&RPPD6HSDUDWHG9DOXHV &69 IRUPDW c) Filling in missing values: ,Q D GDWD VHW PDQ\ WXSOHV KDYHPLVVLQJYDOXHVRIYDULRXVDWWULEXWHV ,PSXWDWLRQLVXVHGWR ILOOPLVVLQJYDOXHVRIDQDWWULEXWH o )LJ 'DWDXQGHUVWDQGLQJ

A. Data Understanding )RUGDWDLW¶VWUXHWKDW*DUEDJHLQJDUEDJHRXW$VXIILFLHQW QXPEHU RI GDWD PXVW EH FROOHFWHG IURP VRPH UHSRVLWRU\ WR WUDLQ DQG WHVW WKH PRGHO 'DWD XQGHUVWDQGLQJ VWHSV DUH VKRZQ LQ)LJ  1) Source of Gata: 1LQH GLVHDVH GDWDVHWV IURP 8&, 0DFKLQH /HDUQLQJ 5HSRVLWRU\ DUH VHOHFWHG WR XVH LQ UHVHDUFK 7KHVHGDWDVRXUFHVZHUHVHOHFWHGEHFDXVHWKH\DUHIUHHWRXVH ZHOO GRFXPHQWHG ZHOO PDLQWDLQHG DQG DUH H[WHQVLYHO\ EHLQJ XVHGWRYHULI\VHYHUDOUHVHDUFKHV 2) Attributes and metadata: 7KH PHWDGDWD RI WKH GLVHDVH GDWDWKDWLVEHLQJXVHGIRUWKLVUHVHDUFKDORQJLVH[SODLQHGDV IROORZV a) Attribute type: $WWULEXWHVRIGDWDVHWVDUHRIIROORZLQJ W\SHV o

1RPLQDO &DWHJRULFDO ²YDOXHV XQRUGHUHGVHWHJJHQGHU

IURP

o

2UGLQDO ²YDOXHVIURPDQRUGHUHGVHWHJ%3

o

&RQWLQXRXV²UHDOQXPEHUVHJDJH

Imputation: 7KLV PHWKRG LQYROYHV ILQGLQJ D PHDQ E\ XVLQJ DWWULEXWH YDOXHV DQG XVLQJ WKDW PHDQ WR ILOO PLVVLQJ YDOXHV IRU WKDW DWWULEXWH DQGH[SODLQVZLWKH[DPSOHLQ)LJ

)LJ 'DWDSUHSDUDWLRQ

DQ

b) Attribute role: $WWULEXWHV RI D GDWDVHW SOD\ RQH RI WKHVHUROHV o

LQSXWLQSXWVIRUPRGHOLQJ

o

WDUJHWRXWSXW

o

LGDX[LOLDU\NHHSEXWQRWXVHIRUPRGHOLQJ

o

LJQRUHGRQ¶WXVHIRUPRGHOLQJ

)LJ ,PSXDWLRQ

d) Conversion of data type: 9DOXHV RI DOO DWWULEXWHV VKRXOGEHFRQYHUWHGWRELQDU\YDOXHVEHFDXVHWKHPRGHOFDQQRW GHDO ZLWK QRPLQDO YDOXHV 7KHVH YDOXHV FDQQRW EH LJQRUHG DV FRQWDLQ LPSRUWDQW LQIRUPDWLRQ 7KLV FRQYHUVLRQ LV GHSHQGHQW RQW\SHRI'DWDDQGH[SODLQHGDVIROORZV

o ZHLJKWLQVWDQFHZHLJKW 3) Concerns for medical data: 0HGLFDO GDWD LV LPPHQVH DQGWKHPDMRUFRQFHUQVIRUPHGLFDOGDWDDUHDVIROORZV x 0HGLFDOGDWDLVQRWLQELQDU\IRUPDW x 0LVVLQJYDOXHVRIDWWULEXWHV x 1RLV\YDOXHVRIDWWULEXWHV B. Data Preparation 'DWD SUHSURFHVVLQJ LV HVVHQWLDO IRU GLVHDVH GDWD GXH WR QRLVHLQGDWDDQGKLJKGLPHQVLRQDOLW\RIGDWD3UHSURFHVVLQJLV QHFHVVDU\WRJHWULGRIXQZDQWHGRXWOLHUVDQGWRIRFXVRQPDLQ DWWULEXWHV 7KXV WKLV LPSURYHV TXDOLW\ RI GDWD IRU IXUWKHU SURFHVVLQJ'DWDSUHSDUDWLRQVWHSVDUHVKRZQLQ)LJ



o

$WWULEXWHV ZKLFK KDYH IHZ EXW IUHTXHQW 1RPLQDORURUGLQDOYDOXHVDUHFRQYHUWHGELQDU\ YDOXHVDVVKRZQLQ)LJ

o

$WWULEXWHV ZKLFK KDYH PDQ\ LQIUHTXHQW FRQWLQXRXV RU QRPLQDO YDOXHV DUH DOVR FRQYHUWHG WR ELQDU\ YDOXHV WKURXJK GLVFUHWL]DWLRQDVVKRZQLQ)LJ

VHOHFWLRQ*$VLPXODWH'DUZLQLDQ7KHRU\RIWKH µ6XUYLYDO RI WKH )LWWHVW¶ $WWULEXWHV ZKLFK SURYLGH XVHIXO LQIRUPDWLRQ VKRXOG VXUYLYH DQG DUH VHOHFWHG E\ DVVLJQLQJ WKHP ZHLJKW µRQH¶ $WWULEXWHVZKLFKDUHQRXVHIXODUHQRWVHOHFWHG E\DVVLJQLQJWKHPZHLJKWµ]HUR¶*$IRUIHDWXUH VHOHFWLRQ LVH[SODLQHGLQDVVKRZQLQ)LJ 

)LJ 1RPLQDO2UGLQDOWRELQDU\YDOXHV

)LJ *HQHWLFDOJRULWKPIRUIHDWXUHVHOHFWLRQ

)LJ &RQWLQRXRXV0DQ\QRPLQDOWRELQDU\YDOXHV

C. Data Partioning 'DWD LV GLYLGHG LQWR GLIIHUHQW SDUWLWLRQV IRU WUDLQLQJ DQG WHVWLQJ DVLOOXVWUDWHGLQDVVKRZQLQ)LJ $VXLWDEOHSDUWLWLRQ RIGDWDFDQSURYLGHEHWWHUFODVVLILFDWLRQDFFXUDF\

e) Data reduction: 0LQLQJKLJKYROXPHRIGDWDWDNHVD ORW RI WLPH 7R UHGXFH WKLV SURFHVVLQJ WLPH YROXPH RI GDWD PXVWEHUHGXFHG f) Discretization: 'LVFUHWL]DWLRQ LV SDUW RI 'DWD UHGXFWLRQ DQG LW LV LPSRUWDQW IRU FRQWLQXRXV QXPHULFDO GDWD (TXDO ZLGWK %LQQLQJ LV XVHG IRU 'LVFUHWL]DWLRQ (TXDO ZLGWK ELQQLQJGLYLGHVWKHUDQJHRISRVVLEOHYDOXHVWR1VXEUDQJHVRI HTXDOVL]H%LQZLGWKRIHDFKVXEUDQJH ELQ FDQEHFDOFXODWHG E\   %LQZLGWK  PD[YDOXH± PLQYDOXH 1  g) Feature reduction: 7RUHGXFHGDWDYROXPHPLQLPXP QXPEHU RI IHDWXUHV PXVW EH VHOHFWHG 7KLV LGHDOO\ SURYLGHV VDPHUHSUHVHQWDWLRQDVRULJLQDOGDWDVHWZRXOGKDYHSUHVHQWHG o

o

Feature subset selection: VHOHFWLRQLQYROYHV DOO

1) Stratified sampling for data reduction: 6WUDWLILHG VDPSOLQJ LV XVHG IRU SDUWLWLRQLQJ 6WUDWLILFDWLRQ LV WKH SURFHVV RI UHDUUDQJLQJ WKH GDWD DV WR HQVXUH HDFK IROG LV D JRRG UHSUHVHQWDWLYHRIWKHZKROHGDWDVHWDVLOOXVWUDWHGLQ)LJ

)HDWXUH VXEVHW

‡

FRQVLGHULQJ VXEVHWV

SRVVLEOH

IHDWXUH

‡

6HOHFWLQJ WKH IXWXUH VXEVHW ZKLFK SURYLGHVEHWWHUDFFXUDF\

)LJ 'DWDSDUWLWLRQLQJ

Genetic algorithm for feature selection: *HQHWLF $OJRULWKP *$ LV XVHG IRU IHDWXUH



1) Algorithm for classifier building: 5,33(5$OJRULWKPLV D UXOH EDVHG OHDUQHU WKDW EXLOGV D VHW RI UXOHV WR LGHQWLI\ WKH FODVVHV E\ PLQLPL]LQJ WKH DPRXQW RI HUURU 5XOHV DUH OHDUQHG GLUHFWO\ IURP 'DWD 5XOHV DUH LQ ILUVW RUGHU ORJLF DQG HDV\ WR XQGHUVWDQG ,Q WZR FODVV SUREOHP RQH FODVV LV UHJDUGHG DV µ3RVLWLYH¶ +HDOWK\  DQG WKH RWKHU RQH DV µ1HJDWLYH¶ 8QKHDOWK\  5,33(5DOJRULWKPLVH[SODLQHGLQ)LJ E. Model Validation 5REXVWQHVV RI WKH PRGHO LV FKHFNHG E\ XVLQJ WHVWLQJ GDWD VHW FUHDWHG E\ GDWD SDUWLWLRQLQJ &URVV YDOLGDWLRQ LV XVHG IRU YDOLGDWLQJWKHPRGHO'DWDLVGLYLGHGLQWRQVXEVHWVRIHTXDO VL]HQVXEVHWVDUHXVHGIRUWUDLQLQJRIGDWDDQGRQHLVXVHGIRU WHVWLQJ SXUSRVH 7KH SURFHVV LV UHSHDWHG Q WLPHV DQG HYHU\ WLPHDGLIIHUHQWGDWDVHWLVXVHGIRUWHVWLQJWKHPRGHO 3HUIRUPDQFH RI WKH PRGHO LV FDOFXODWHG LQ WHUPV RI DFFXUDF\LQQLWHUDWLRQVDQGDQDYHUDJHRIWKHVHYDOXHVJLYHVXV DFFXUDF\RIWKHPRGHO ,9

3526$1'&2162)352326('02'(/

)LJ 6WUDWLILHG6DPSOLQJIRU'DWD3DUWLRQLQJ

A. Advantages of Proposed Model $GYDQWDJHVRIWKHSURSRVHGV\VWHPDUHDVIROORZV

D. Classifier Building $ FODVVLILHUPRGHO LV GHYHORSHG E\ XVLQJ FODVVLILFDWLRQ WHFKQLTXH RI $VVRFLDWLRQ 5XOH 0LQLQJ $50  7KH 5XOH ,QGXFWLRQ LV GRQH E\ XVLQJ 5HSHDWHG ,QFUHPHQWDO 3UXQLQJ WR 3URGXFH (UURU 5HGXFWLRQ  5,33(5  DOJRULWKP 7KLV PRGHO ZLOOEHXVHGWRFODVVLI\GDWDDVQRUPDORUSDWLHQW7KLVPRGHOLV WUDLQHG E\ XVLQJ WUDLQLQJ GDWD VHW REWDLQHG WKURXJK GDWD SDUWLWLRQLQJ

o

3HRSOH¶V DZDUHQHVV ZLOO HQKDQFH DQG WKDW ZLOO KHOS WKHP WR UHFRYHU WKHLU LVVXHV 7KH\ FDQ GLDJQRVH WKHLU PLQRU KHDOWK SUREOHPV ZLWKRXW JRLQJWRGRFWRU

o

0RUHRYHU SHRSOH ZLOO KDYH D FURVV FKHFN RI GLDJQRVLVLQFDVHWKH\KDYHDFFHVVLELOLW\WRGRFWRU

o

'RFWRUVFDQXVHWKLVDVDKHOSHUIRUGLDJQRVLV

B. Problems with RIPPER Algorithm 3UREOHPVZLWK ULSSHUDOJRULWKPDUHDVIROORZV o

7KH PDLQ UHDVRQ LV RYHUILWWLQJ 7KH DOJRULWKP OHDUQV WRR PXFK GHWDLO DERXW WKH DWWULEXWHV RI WKH FXVWRPHU GDWD 7KH FRQVHTXHQFH RI WKLV ZDV OHVV DFFXUDF\LQSUHGLFWLQJQHZFXVWRPHUGDWD

o

7KH  DFFXUDF\  LV IXUWKHU  LPSDFWHG  ZKHQ  WKH TXDOLW\RIWKHQHZGDWDGHFUHDVHVDVDUHVXOWRI LQFUHDVLQJ PLVVLQJ GDWD 7KH DOJRULWKP VKRZHG OHVV UHVLOLHQFH LQ WKH SUHVHQFH RI LQFUHDVLQJ PLVVLQJ GDWD RU QRLV\ GDWD  7KLV  UHVXOWHG  LQ SRRU FODVVLILFDWLRQ  SHUIRUPDQFH  FRPSDUHG  WR RWKHUVXSHUYLVHGDOJRULWKPVVXFKDVWKHQDwYH %D\HV  N1HDUHVW 1HLJKERU VXSSRUW YHFWRU PDFKLQHV DQG WKH ORJLFDO GLVFULPLQDQW DQDO\VLV DOJRULWKP>@ 9

237,0,=('',6($6(0,1,1*02'(/

0RGHOSURSRVHGDERYH LVXVHGIRUPLQLQJDVVRFLDWLRQUXOHV DQG WKHQ PDNLQJ GHFLVLRQV%XW WRRYHUFRPH WKH SUREOHPV RI ULSSHUDOJRULWKPDQLPSURYHPHQWLVVXJJHVWHGKHUH )LJ 5,33(5DOJRULWKPIRUFODVVLILFDWLRQ



,Q WKLV ZRUN *HQHWLF $OJRULWKP *$  LV XVHG IRU IHDWXUH VHOHFWLRQDQGZHLJKWRSWLPL]DWLRQRIDWWULEXWHVWRLPSURYHWKH DFFXUDF\7KHVWHSVLQYROYHGDUHVKRZQLQ)LJ

VSULQJV 0XWDWLRQ RFFXUV ZLWK D YHU\ ORZ SUREDELOLW\ ZKLFK LQYROYHVIOLSSLQJRIDIHZELWVRIWKHQHZLQGLYLGXDOV >@

)LJ D  8VLQJ *HQHWLF $OJRULWKP :UDSSHU IRU DWWULEXWH ZHLJKW RSWLPL]DWLRQ

,QDWWULEXWHZHLJKWLQJVRPHDWWULEXWHVDUHDVVLJQHGZHLJKW DSSURDFKLQJ ]HUR EXW QRW HTXDO WR ]HUR $ WKUHVKROG YDOXH LV GHILQHG DQG DWWULEXWHV ZLWK ZHLJKW OHVV WKDQ WKH VSHFLILHG WKUHVKROG YDOXH DUH DVVLJQHG ZHLJKW HTXDO WR ]HUR 7KURXJK WKLVFURVVRYHUDQGPXWDWLRQE\*HQHWLF$OJRULWKP o EHVWLQGLYLGXDOVDUHVHOHFWHG o DQGWKHLUZHLJKWVKDYHDOVRFDOFXODWHG 1RZ ZHLJKW YDOXHV DUH QRUPDOL]HG DQG PXOWLSOLHG E\ WKH YDOXHVRIWKDWDWWULEXWH *$ IRUDWWULEXWHZHLJKWRSWLPL]DWLRQLV H[SODLQHGLQ)LJ E 

)LJ 2SWLPL]HG'LVHDVH0LQLQJ0RGHO

A. Weight Optimization :HLJKWV DUH DVVLJQHG WR DWWULEXWHV RI WKH 'DWDVHW E\ *$ :HLJKWVDVVLJQHGWRDWWULEXWHVDUHUHDOYDOXHVEHWZHHQDQG :HLJKWRIDQDWWULEXWHGHVFULEHVUHOHYDQFHRIDQDWWULEXWHDQG XVHGIRU:HLJKWHG$VVRFLDWLRQ5XOH0LQLQJ 3XQFK HW DO SURSRVHG *HQHWLF $OJRULWKP *$  IRU WKLV SXUSRVH LQ  >@ *$ FDQ EH XVHG DV D ZUDSSHU IRU DWWULEXWHZHLJKWLQJ DVVKRZQLQ )LJ D DQGLQYROYHVWKHVH WZREDVLFVWHSV 

6SHFLILFDWLRQ RI YDOXH RI ZHLJKWV IRU HDFK RI WKH DWWULEXWHV



8VLQJ D )XQFWLRQ WR FDOFXODWH WKH DFFXUDF\ RI WKH FODVVLILFDWLRQDOJRULWKPXVLQJWKHVHVSHFLILHGZHLJKWV 7KLVIXQFWLRQLVFDOOHGILWQHVVDOJRULWKP

1) Genetic algorithm for weight optimization )LWQHVV IXQFWLRQ FDOFXODWHV SHUIRUPDQFH RI WKH DOJRULWKP EDVHG RQ QXPEHURISUHGLFWLRQVDQGQXPEHURIFRUUHFWSUHGLFWLRQV7KXV FKRRVHV WKH ZHLJKWV IRU DWWULEXWHV IRU ZKLFK SHUIRUPDQFH LV EHVW 6HOHFWLRQ RSHUDWRU VHOHFWV WKRVH LQGLYLGXDOV LQ D SRSXODWLRQ ZKLFK DUH UHJDUGHG EHWWHU E\ ILWQHVV IXQFWLRQ ,Q FURVVRYHU WZR LQGLYLGXDOV DUH FKRVHQ IURP WKH SRSXODWLRQ XVLQJ WKH VHOHFWLRQ RSHUDWRU DQG FURVVRYHU ZLOO JLYH WZR RII

)LJ E *HQHWLF$OJRULWKPIRUDWWULEXWHZHLJKWRSWLPL]DWLRQ



9,

,Q WRWDO QLQH GLIIHUHQW GDWDVHWV ZHUH XVHG LQ WUDLQLQJ DQG WHVWLQJRI'00$OOWKHVHGDWDVHWV DUHSXEOLFGDWDVHWDQGKDYH EHHQXVHGLQPDQ\RWKHUGDWDPLQLQJUHVHDUFKHV7KHVRIWZDUH XVHGIRULPSOHPHQWDWLRQRI'00ZDV5DSLG0LQHUZKLFKLV LQGXVWU\VWDQGDUGRSHQVRXUFHVRIWZDUHIRUGDWDPLQLQJ

(;3(50,(176$1'5(68/76

5DSLGPLQHU  LV XVHG IRU LPSOHPHQWDWLRQ RI WKH PRGHO 1LQHGLVHDVHGDWDVHWVDUHWDNHQIURP8QLYHUVLW\RI&DOLIRUQLD ,UYLQH 8&, 0DFKLQH /HDUQLQJ 5HSRVLWRU\ $IWHU SUHSURFHVVLQJERWK'00DQG2'00 PRGHOVDUHWHVWHGRYHU WKHVHGDWDVHWV

'DWDPLQLQJWHFKQLTXHVDUHXVHGIRUH[WUDFWLQJDVVRFLDWLRQ UXOHVIURPGLVHDVHGDWDEDVH $50LVQRWVXIILFLHQWWRILQGRXW LQWHUHVWLQJ UXOHV 6RPH LPSURYHG DOJRULWKPV VKRXOG EH SURSRVHGIRUWKDW:$50JLYHVLPSURYHGUHVXOWVDVFRPSDUHG WR $50 LQ WHUPV RI DFFXUDF\ DV 2SWLPL]HG 'LVHDVH 0LQLQJ 0RGHO 2'00 RXWSHUIRUPV'00

7KH QXPEHU RI LQVWDQFHV FDVHV  DQG WKH QXPEHU RI DWWULEXWHV IRU HDFK GDWDVHW DQG WKH DFFXUDF\ RI H[SHULPHQW UHVXOWV IRU ERWK PRGHOV DUH VKRZQ LQ 7DEOH , ([SHULPHQW UHVXOWVVKRZWKDWDFFXUDF\YDULHVGHSHQGLQJXSRQWKHGDWDVHW 6LPSOHULSSHU $50 DQGULSSHUZLWKZHLJKWRSWLPL]DWLRQ :$50  DUH FRPSDUHG IRU DFFXUDF\ RI UHVXOWV 7KH LPSURYHPHQWRIDFFXUDF\RIUHVXOWVLVVKRZQLQJUDSKLFDOIRUP LQ )LJ 9,, &21&/86,21 7RGD\UHVHDUFKHUVDUHIRFXVHGWRDGG\HDUVWRKXPDQOLIH 7KLVLQYROYHVLPSURYLQJWKHOLIHVW\OHRIDSHUVRQDQGKHOSLQJ KLPJHWULGRIDQ\KHDOWKUHODWHGSUREOHPVDVHDUO\DVSRVVLEOH 7KLV FDQ RQO\ EH DFKLHYHG WKURXJK HDUO\ GLDJQRVLV RI WKH SUREOHP6RLQWKLVUHVHDUFKZHSURSRVHGD'LVHDVH0LQLQJ 0RGHO '00 IRUSUHGLFWLRQRIGLVHDVHV 7KLV ZRUN FRQWULEXWHV WR WKH GHYHORSPHQW RI D GLVHDVH SUHGLFWLRQPRGHODQGLPSURYHPHQWRIWKHFODVVLILHU¶VDFFXUDF\ WKURXJK DWWULEXWH ZHLJKW RSWLPL]DWLRQ ZKLFK LV QRW PXFK H[SORUHGIRUWKHSXUSRVH 'DWDVHWV DUH FUXFLDO IRU DVVRFLDWLRQ UXOH PLQLQJ DV UHVXOWV YDU\RQEDVLVRIQXPEHURIDWWULEXWHVDQGQXPEHURILQVWDQFHV )RU PRUH DFFXUDWH DQG LPSURYHG UHVXOWV GDWDVHWV PXVW EH PXFK ODUJHU 7KDW¶VZK\ GLIIHUHQWGDWDVHWVKDYH EHHQXVHG WR YHULI\UHVXOWV

)LJ ,PSURYHGDFFXUDF\RIUHVXOWV 7$%/(,

(;3(50,1(17$1'5(68/7)25($&+'$7$6(7

'DWDVHW

&DVHV

$WWULEXWHV

$OJRULWKP

&ODVVLILFDWLRQ $FFXUDF\IRU'00

&ODVVLILFDWLRQ $FFXUDF\IRU 2'00

,PSURYHPHQW

+HDUW'LVHDVH



5HJXODU6SHFLDO

5XOH,QGXFWLRQ







&DQFHU



5HJXODU 6SHFLDO

5XOH,QGXFWLRQ







'LDEHWHV



5HJXODU 6SHFLDO

5XOH,QGXFWLRQ







+\SRWK\URLG



5HJXODU6SHFLDO

5XOH,QGXFWLRQ







,ULV



5HJXODU 6SHFLDO

5XOH,QGXFWLRQ







/\PSK



5HJXODU 6SHFLDO

5XOH,QGXFWLRQ







3ULPDU\WXPRU



5HJXODU 6SHFLDO

5XOH,QGXFWLRQ







6LFN



5HJXODU 6SHFLDO

5XOH,QGXFWLRQ







$XGLRORJ\



5HJXODU 6SHFLDO

5XOH,QGXFWLRQ









>@

9,,, )8785( :25. 'DWD 0LQLQJ LV QRW PXFK H[SORUHG \HW 7KHUH LV D ORW WR FRPHHVSHFLDOO\LQWKHILHOGRIPHGLFLQHLWFDQEHTXLWHKHOSIXO )ROORZLQJVXJJHVWLRQVDUHPDGHIRUIXWXUH

>@

>@

,Q WKLV UHVHDUFK ZRUN D SHUVRQ¶V KHDOWK UHODWHG DWWULEXWHV KDYH EHHQ DQDO\]HG DQG KH LV FDWHJRUL]HG DV D KHDOWK\ RU GLVHDVHG SHUVRQ 7KLV ZRUN FDQ EH HQKDQFHG VR WKDW WKH SHUVRQZKRLVFDWHJRUL]HGDVVLFNKLVSDUWLFXODUGLVHDVHFDQEH SUHGLFWHG0RUHRYHU WKHFXUUHQWZRUNLVMXVWEDVHGRQWUDLQLQJ DQG WKHQ WHVWLQJ WKH DOJRULWKP 7KLV FDQ EH H[WHQGHG WR SURGXFH DQ LQWHUDFWLYH V\VWHP ZKHUH XVHU HQWHUV WKH SHUVRQ¶V GDWD DQG V\VWHP PDNLQJ XVH RI WKH DOJRULWKP VXJJHVWV WKH SDUWLFXODUKHDOWKSUREOHPWRWKHXVHU6\VWHPLQWHUPVRILQSXW DQGRXWSXWLVVKRZQLQ)LJ

>@

>@

>@

,Q WKLV UHVHDUFK DFFXUDF\ RI GLDJQRVLV RI GLVHDVH LV LPSURYHG XVLQJ ZHLJKWLQJ RI DWWULEXWHV DORQJ ZLWK ULSSHU DOJRULWKP IRU FODVVLILFDWLRQ 2WKHU VXSHUYLVHG DQG XQVXSHUYLVHG WHFKQLTXHV FDQ EH XVHG IRU WKLV SXUSRVH ,Q DGGLWLRQWRDFFXUDF\WKHVXJJHVWHGPRGHOFDQEHLPSURYHGLQ WHUPV RI WLPH 0RUHRYHU UHVXOWV PD\ YDU\ IURP LPSOHPHQWDWLRQWRROVDQGODQJXDJHV7KLVFDQEHLQYHVWLJDWHG IXUWKHU WRR 7KH YHU\ VDPH PRGHO FDQ EH HPSOR\HG IRU GDWD IURPGLIIHUHQWILHOGVRIOLIH

>@

>@

>@

>@

)LJ ,QWHUDFWLYHV\VWHPIRUGHFLGLQJSDWLHQWV¶VKHDOWK

>@

5()(5(1&(6 >@ >@

>@

>@

>@

>@

>@

>@

>@

-+DQDQG0 .DPEHU'DWD 0LQLQJ&RQFHSWVDQG7HFKQLTXHV 8 $EGXOODK - $KPDG DQG $ $KPHG ³$QDO\VLV RI HIIHFWLYHQHVV RI DSULRUL DOJRULWKP LQ PHGLFDO ELOOLQJ GDWD PLQLQJ´ LQ (PHUJLQJ 7HFKQRORJLHV,&(7WK,QWHUQDWLRQDO&RQIHUHQFHRQ ,(((  SS &5 -L DQG =+ 'HQJ ³0LQLQJ IUHTXHQW RUGHUHG SDWWHUQV ZLWKRXW FDQGLGDWH JHQHUDWLRQ´ LQ )X]]\ 6\VWHPV DQG .QRZOHGJH 'LVFRYHU\ )6.')RXUWK,QWHUQDWLRQDO&RQIHUHQFHRQ,(((YRO SS +7 +H DQG 6/ =KDQJ ³$ QHZ PHWKRG IRU LQFUHPHQWDO XSGDWLQJ IUHTXHQW SDWWHUQV PLQLQJ´ LQ ,QQRYDWLYH &RPSXWLQJ ,QIRUPDWLRQ DQG &RQWURO  ,&,&,&  6HFRQG ,QWHUQDWLRQDO &RQIHUHQFH RQ ,((( SS &.6 /HXQJ &/ &DUPLFKDHO DQG % +DR ³(IILFLHQW PLQLQJ RI IUHTXHQWSDWWHUQVIURPXQFHUWDLQGDWD´LQ'DWD0LQLQJ:RUNVKRSV ,&'0 :RUNVKRSV  6HYHQWK ,((( ,QWHUQDWLRQDO &RQIHUHQFH RQ ,(((SS 6 %DVKLU DQG = +DOLP ³0LQLQJ IDXOW WROHUDQW IUHTXHQW SDWWHUQV XVLQJ SDWWHUQ JURZWKDSSURDFK´ LQ &RPSXWHU6\VWHPVDQG$SSOLFDWLRQV $,&&6$,((($&6,QWHUQDWLRQDO&RQIHUHQFHRQ,(((SS  6 -RVKLDQG5&-DLQ ³$ G\QDPLFDSSURDFKIRUIUHTXHQWSDWWHUQ PLQLQJ XVLQJ WUDQVSRVLWLRQ RI GDWDEDVH´ LQ &RPPXQLFDWLRQ 6RIWZDUH DQG 1HWZRUNV,&&61 6HFRQG,QWHUQDWLRQDO&RQIHUHQFHRQ,((( SS 77 1JX\HQ ³$Q LPSURYHG DOJRULWKP IRU IUHTXHQW SDWWHUQV PLQLQJ SUREOHP´LQ&RPSXWHU&RPPXQLFDWLRQ&RQWURODQG$XWRPDWLRQ &$  ,QWHUQDWLRQDO6\PSRVLXPRQ,((( SS

>@

>@

>@

>@

>@

>@

>@



;@ 6 %DODNULVKQDQ ³690 UDQNLQJ ZLWK EDFNZDUG VHDUFK IRU IHDWXUH VHOHFWLRQ LQ W\SH ,, GLDEHWHV GDWDEDVHV´ LQ 6\VWHPV 0DQ DQG &\EHUQHWLFV60&,(((,QWHUQDWLRQDO&RQIHUHQFHRQ,(((  SS >@ 6'DVK%3DWUD DQG%.7ULSDWK\ ³$ K\EULGGDWD PLQLQJWHFKQLTXH IRULPSURYLQJWKHFODVVLILFDWLRQDFFXUDF\RI PLFURDUUD\GDWD VHW´ ,-,QI (QJ (OHFWURQ %XV YROQRSS >@ 6: /LQ DQG 6& &KHQ ³362/'$ $ SDUWLFOH VZDUP RSWLPL]DWLRQ DSSURDFK IRU HQKDQFLQJ FODVVLILFDWLRQ DFFXUDF\ UDWH RI OLQHDU GLVFULPLQDQWDQDO\VLV´$SSO6RIW&RPSXW YROQR SS  >@ 5 %U\OO DQG 5 *XWLHUUH]2VXQD ³$WWULEXWH EDJJLQJ LPSURYLQJ DFFXUDF\ RI FODVVLILHU HQVHPEOHV E\ XVLQJ UDQGRP IHDWXUH VXEVHWV´ 3DWWHUQUHFRJQLWLRQ YROQRSS >@ ': $EERWW ³&RPELQLQJ PRGHOV WR LPSURYH FODVVLILHU DFFXUDF\ DQG UREXVWQHVV´ LQ 3URFHHGLQJV RI 6HFRQG ,QWHUQDWLRQDO &RQIHUHQFH RQ ,QIRUPDWLRQ)XVLRQ)XVLRQ  YROSS± >@ 6

Suggest Documents