fast parallel outlier detection strategy based on the Attribute ... fast and simple outlier detection method for categorical ..... languages like C++ and Python.
)DVW3DUDOOHO2XWOLHU'HWHFWLRQIRU&DWHJRULFDO'DWDVHWV XVLQJ0DS5HGXFH $QQD.RXIDNRX-LPP\6HFUHWDQ-RKQ 5HHGHU.HOYLQ&DUGRQD DQG0LFKDHO*HRUJLRSRXORV
$EVWUDFW² 2XWOLHU GHWHFWLRQ KDV UHFHLYHG FRQVLGHUDEOH DWWHQWLRQ LQ PDQ\ DSSOLFDWLRQV VXFK DV GHWHFWLQJ QHWZRUN DWWDFNV RU FUHGLW FDUG IUDXG 7KH PDVVLYH GDWDVHWV FXUUHQWO\ DYDLODEOH IRU PLQLQJ LQ VRPH RI WKHVH RXWOLHU GHWHFWLRQ DSSOLFDWLRQV UHTXLUH ODUJH SDUDOOHO V\VWHPV DQG FRQVHTXHQWO\ SDUDOOHOL]DEOH RXWOLHU GHWHFWLRQ PHWKRGV 0RVW H[LVWLQJ RXWOLHU GHWHFWLRQPHWKRGVDVVXPHWKDWDOORIWKHDWWULEXWHVRIDGDWDVHW DUH QXPHULFDO XVXDOO\ KDYH D TXDGUDWLF WLPH FRPSOH[LW\ ZLWK UHVSHFWWRWKHQXPEHURISRLQWVLQWKHGDWDVHWDQGTXLWHRIWHQ WKH\UHTXLUHPXOWLSOHGDWDVHWVFDQV,QWKLVSDSHUZHSURSRVHD IDVW SDUDOOHO RXWOLHU GHWHFWLRQ VWUDWHJ\ EDVHG RQ WKH $WWULEXWH 9DOXH )UHTXHQF\ $9) DSSURDFK D KLJKVSHHG VFDODEOH RXWOLHUGHWHFWLRQPHWKRGIRUFDWHJRULFDOGDWDWKDWLVLQKHUHQWO\ HDV\ WR SDUDOOHOL]H 2XU SURSRVHG VROXWLRQ05$9) LV EDVHG RQWKH0DS5HGXFHSDUDGLJPIRUSDUDOOHOSURJUDPPLQJZKLFK RIIHUV ORDG EDODQFLQJ DQG IDXOW WROHUDQFH 05$9) LV SDUWLFXODUO\ VLPSOH WR GHYHORS DQG LW LV VKRZQ WR EH KLJKO\ VFDODEOHZLWKUHVSHFWWRWKHQXPEHURIFOXVWHUQRGHV
'
, ,1752'8&7,21
(7(&7,1* RXWOLHUV LQ GDWD LV D UHVHDUFK ILHOG ZLWK PDQ\ DSSOLFDWLRQV VXFKDV FUHGLW FDUG IUDXG GHWHFWLRQ GLVFRYHULQJ FULPLQDO DFWLYLWLHV LQ HOHFWURQLF FRPPHUFH DQG QHWZRUN LQWUXVLRQ GHWHFWLRQ 2XWOLHU GHWHFWLRQ DSSURDFKHV FRQFHQWUDWH RQ GHWHFWLQJ SDWWHUQV WKDW RFFXU LQIUHTXHQWO\ LQ WKHGDWDVHWLQFRQWUDVWWRWUDGLWLRQDOGDWDPLQLQJWHFKQLTXHV WKDWDWWHPSWWRILQGSDWWHUQVWKDWRFFXUIUHTXHQWO\LQWKHGDWD $SSOLFDWLRQ H[DPSOHV ZKHUH WKH GLVFRYHU\ RI RXWOLHUV LV XVHIXO LQFOXGH LGHQWLI\LQJ LUUHJXODU FUHGLW FDUG WUDQVDFWLRQV LQGLFDWLQJ SRWHQWLDO FUHGLW FDUG IUDXG >@ RU SDWLHQWV ZKR H[KLELW DEQRUPDO V\PSWRPV GXH WR WKHLU VXIIHULQJ IURP D VSHFLILFGLVHDVHRUDLOPHQW>@ 0RVW RI WKH H[LVWLQJ UHVHDUFK HIIRUWV LQ RXWOLHU GHWHFWLRQ FRQFHQWUDWH RQ GDWDVHWV ZLWK DWWULEXWHV WKDW DUH HLWKHU QXPHULFDORURUGLQDOFDQEHGLUHFWO\PDSSHGLQWRQXPHULFDO YDOXHV ,QWKHFDVHZKHUHGDWDZLWKFDWHJRULFDODWWULEXWHVDUH SUHVHQW WKHVH WHFKQLTXHV PDS WKH FDWHJRULFDO WR QXPHULFDO YDOXHVDWDVNZKLFKLVQRWDOZD\VDVWUDLJKWIRUZDUGSURFHVV $QRWKHU LVVXH LV WKDW PDQ\ RI WKH DERYH DSSOLFDWLRQV IRU WKH PLQLQJ RI RXWOLHUV UHTXLUH WKH PLQLQJ RI YHU\ ODUJH GDWDVHWVHJWHUUDE\WHVFDOHGDWD 7KLVOHDGVWRWKHQHHGIRU ODUJHSDUDOOHOPDFKLQHVDQGDVVRFLDWHGSDUDOOHOL]DEOHRXWOLHU GHWHFWLRQ DOJRULWKPV ZKLFK PXVW VFDOH ZHOO ZLWK WKH VL]H DQG GLPHQVLRQDOLW\ RI WKH GDWDVHW 7KH VL]H RI WKH GDWDVHWV 0DQXVFULSWUHFHLYHG0DUFK7KLVZRUNZDVVXSSRUWHGLQSDUWE\ 16) JUDQWV DVZHOODVDQ16)JUDGXDWHUHVHDUFKIHOORZVKLS $QQD .RXIDNRX -LPP\ 6HFUHWDQ -RKQ 5HHGHU DQG 0LFKDHO *HRUJLRSRXORV DUH ZLWK WKH 6FKRRO RI ((&6 DW WKH 8QLYHUVLW\ RI &HQWUDO )ORULGD 2UODQGR )/ .HOYLQ &DUGRQD LV ZLWK WKH 'HSDUWPHQW RI &RPSXWHU(QJLQHHULQJDWWKH8QLYHUVLW\RI3XHUWR5LFR
WRGD\ DOVR GHPDQGV WKDW WKH PHWKRGV GHYHORSHG EH FRPSXWDWLRQDOO\ VLPSOH DQG HDVLO\ EDODQFHG RYHU D QXPEHU RIFOXVWHUQRGHV ,Q WKLV SDSHU ZH SURSRVH D SDUDOOHO LPSOHPHQWDWLRQ RI D IDVW DQG VLPSOH RXWOLHU GHWHFWLRQ PHWKRG IRU FDWHJRULFDO GDWDVHWVFDOOHG$WWULEXWH9DOXH)UHTXHQF\$9) $9)>@ ZDVVKRZQWRKDYHDVLJQLILFDQWSHUIRUPDQFHDGYDQWDJHRYHU D QXPEHU RI RWKHU FRPSHWLWLYH RXWOLHU GHWHFWLRQ VWUDWHJLHV WKDW KDYH DSSHDUHG LQ WKH UHFHQW OLWHUDWXUH DQG LW ZDV DOVR VKRZQWRVFDOHOLQHDUO\DVWKHGDWDVHWVL]HLQFUHDVHVERWKLQ WKHQXPEHURISRLQWVDQGQXPEHURIGLPHQVLRQV$OVR$9) GHSHQGVRQO\RQRQHXVHUSDUDPHWHUWKHQXPEHURIGHVLUHG RXWOLHUVNQHHGHGWREHLGHQWLILHG DQLPSRUWDQWDGYDQWDJH VLQFH LW UHTXLUHV PLQLPXP XVHU LQWHUYHQWLRQ )XUWKHUPRUH JLYHQ WKH IUHTXHQFLHV RI HDFK FDWHJRULFDO YDOXH LQWKH GDWD $9) SHUIRUPV RQO\ RQH GDWDVHW VFDQ WKXV OHQGLQJ LWVHOI WR WRGD\¶VODUJHDQGSRVVLEO\JHRJUDSKLFDOO\GLVWULEXWHGGDWD $9) LV EDVHG RQ DVVLJQLQJ D VFRUH WR HDFK SRLQW LQ WKH GDWDVHW XVLQJ WKH IUHTXHQF\ RI HDFK XQLTXH DWWULEXWH YDOXH WKXV LW LV HDVLO\ SDUDOOHOL]DEOH ,Q FRQWUDVW RWKHU WHFKQLTXHV IRURXWOLHUGHWHFWLRQLQFDWHJRULFDOGDWDVHH>@>@DQG>@ DUHPXFKPRUHFRPSOLFDWHGDQGFXPEHUVRPHWRSDUDOOHOL]H DV WKH\ UHTXLUH VHYHUDO VFDQV RI WKH GDWDVHW LQ RUGHU WR H[WUDFW IUHTXHQWO\ HQFRXQWHUHG DQG VRPHWLPHV OHQJWK\ FRPELQDWLRQV RI DWWULEXWH YDOXHV 0RUHRYHU WKH SDUDOOHO YHUVLRQ RI $9) WKDW LV SURSRVHG KHUH LV EDVHG RQ WKH 0DS5HGXFH SDUDGLJP RI SDUDOOHO SURJUDPPLQJ >@ 0DS5HGXFH SURYLGHV WKH QHFHVVDU\ VLPSOLFLW\ RI SDUDOOHO GHYHORSPHQW ZKLOH JXDUDQWHHLQJ WKH QHFHVVDU\ ORDG EDODQFLQJ DQG IDXOW WROHUDQFH IRU WKH LPSOHPHQWDWLRQ ,W LV ZRUWK QRWLQJ WKDW 0DS5HGXFH KDV DOUHDG\ EHHQ VXFFHVVIXOO\ XVHG LQ SDUDOOHOL]LQJ D QXPEHU RI PDFKLQH OHDUQLQJ DSSURDFKHV IRU GDWD PLQLQJ DSSOLFDWLRQV HJ VHH>@ 2XU FRQWULEXWLRQ LV WKDW ZH LQWURGXFH 0DS5HGXFH$9) 05$9) DSDUDOOHORXWOLHUGHWHFWLRQPHWKRGIRUFDWHJRULFDO GDWDVHWV JHDUHG WRZDUGV LGHQWLI\LQJ RXWOLHUV LQ ODUJH GDWD PLQLQJ SUREOHPV 05$9) LV EDVHG RQ $9) DQ RXWOLHU GHWHFWLRQPHWKRGWKDWKDVEHHQVKRZQWRSHUIRUPIDYRUDEO\ FRPSDUHG WR RWKHU FRPSHWLWLYH EXW PRUH FRPSOH[ RXWOLHU GHWHFWLRQ VWUDWHJLHV 'XH WR LWV VLPSOLFLW\ $9) LV DQ LGHDO PHWKRGWRSDUDOOHOL]HDQGXVLQJWKH0DS5HGXFHDSSURDFKWR SDUDOOHOL]HLWJXDUDQWHHVHDVHRIGHYHORSPHQWORDGEDODQFLQJ DQGIDXOWWROHUDQFHRIWKHLPSOHPHQWDWLRQ2XUUHVXOWVVKRZ WKDW05$9)H[KLELWVFORVHWRLGHDOVSHHGXSZLWKUHVSHFWWR QXPEHURISURFHVVLQJQRGHVLQWKHFOXVWHU
3297 c 978-1-4244-1821-3/08/$25.002008 IEEE
7KHRUJDQL]DWLRQRIWKLVSDSHULVDVIROORZV,Q6HFWLRQ,, ZH SURYLGH DQ RYHUYLHZ RI WKH SUHYLRXV UHVHDUFKUHODWHG WR RXWOLHU GHWHFWLRQ VWUDWHJLHV DV ZHOO DV D V\QRSVLV RI WKH HDUOLHUZRUNEDVHGRQWKH0DS5HGXFHSDUDGLJP,Q6HFWLRQ ,,, ZH SUHVHQW WKH 0DS5HGXFH SDUDGLJP ZKLOH LQ 6HFWLRQ ,9 ZH LQWURGXFH RXU SURSRVHG DOJRULWKP IRU SDUDOOHO RXWOLHU GHWHFWLRQLHWKH0DS5HGXFH$9))LQDOO\ZHSUHVHQWRXU H[SHULPHQWDO UHVXOWV LQ 6HFWLRQ 9 IROORZHG E\ RXU FRQFOXVLRQVLQ6HFWLRQ9, ,, 35(9,286:25. 7KH H[LVWLQJ RXWOLHU GHWHFWLRQ PHWKRGV FDQ EH JURXSHG LQWRWKHIROORZLQJFDWHJRULHV 6WDWLVWLFDOPRGHO EDVHG PHWKRGV DVVXPH WKDW D VSHFLILF PRGHOGHVFULEHVWKHGLVWULEXWLRQRIWKHGDWD>@/LPLWDWLRQV LQFOXGH REWDLQLQJ WKH ULJKW PRGHO IRU HDFK GDWDVHW DQG DSSOLFDWLRQ DQG ODFN RI VFDODELOLW\ ZLWK UHVSHFW WR GDWD GLPHQVLRQDOLW\>@ 'LVWDQFHEDVHGDSSURDFKHVHJ>@ HVVHQWLDOO\FRPSXWH GLVWDQFHV DPRQJ GDWD SRLQWV WKXV EHFRPLQJ TXLFNO\ LPSUDFWLFDO IRU ODUJH GDWDVHWV HJ D QHDUHVW QHLJKERU PHWKRGKDVTXDGUDWLFFRPSOH[LW\ZLWKUHVSHFWWRWKHQXPEHU RI GDWDVHW SRLQWV %D\ DQG 6FKZDEDFKHU >@ SURSRVH D GLVWDQFHEDVHGPHWKRGEDVHGRQUDQGRPL]DWLRQDQGSUXQLQJ DQGFODLPLWVFRPSOH[LW\LVFORVHWROLQHDULQSUDFWLFH &OXVWHULQJWHFKQLTXHVFDQDOVREHHPSOR\HGWRFOXVWHUWKH GDWDDQGWKHSRLQWVWKDWGRQRWEHORQJLQWKHIRUPHGFOXVWHUV DUH GHVLJQDWHG DV RXWOLHUV +RZHYHU FOXVWHULQJEDVHG PHWKRGV DUH IRFXVHG RQ RSWLPL]LQJ FOXVWHULQJ PHDVXUHV RI JRRGQHVVDQGQRWRQILQGLQJWKHRXWOLHUVLQWKHGDWD>@ 'HQVLW\EDVHG PHWKRGV HVWLPDWH WKH GHQVLW\ GLVWULEXWLRQ RIWKHGDWDDQGLGHQWLI\RXWOLHUVDVWKRVHO\LQJLQORZGHQVLW\ UHJLRQV HJ >@ >@ $OWKRXJK GHQVLW\EDVHG PHWKRGV GHWHFW RXWOLHUV QRW GLVFRYHUHG E\ WKH GLVWDQFHEDVHG PHWKRGV WKH\ EHFRPH SUREOHPDWLF IRU VSDUVH KLJK GLPHQVLRQDOGDWD>@ 2WKHU RXWOLHU GHWHFWLRQ HIIRUWV UHO\ RQ 6XSSRUW 9HFWRU PHWKRGV HJ >@ 5HSOLFDWRU 1HXUDO 1HWZRUNV 511V >@RUXVLQJDUHODWLYHGHJUHHRIGHQVLW\ ZLWKUHVSHFWRQO\ WRDIHZIL[HGUHIHUHQFHSRLQWV>@ 0RVWRIWKHDIRUHPHQWLRQHGWHFKQLTXHVDUHJHDUHGWRZDUGV QXPHULFDOGDWDDQGWKXVDUHPRUHDSSURSULDWHIRUQXPHULFDO GDWDVHWV RU RUGLQDO GDWD WKDW FDQ EH HDVLO\ PDSSHG WR QXPHULFDO YDOXHV >@ $QRWKHU OLPLWDWLRQ RI SUHYLRXV PHWKRGVLVWKHODFNRIVFDODELOLW\ ZLWKUHVSHFWWRQXPEHURI SRLQWVDQGRUGLPHQVLRQDOLW\RIWKHGDWDVHW 2XWOLHU'HWHFWLRQWHFKQLTXHVIRUFDWHJRULFDOGDWDVHWVKDYH UHFHQWO\ DSSHDUHG LQ WKH OLWHUDWXUH HJ >@ >@ >@ )RU LQVWDQFH 2WH\ HW DO LQ >@ SUHVHQWHG D GLVWULEXWHG RXWOLHU GHWHFWLRQ PHWKRG IRU PL[HG DWWULEXWH GDWDVHWV 7KHLU DSSURDFKLVOLQHDUZLWKUHVSHFWWRWKHQXPEHURIGDWDSRLQWV KRZHYHUWKHLUUXQQLQJWLPHLVH[SRQHQWLDOLQWKHQXPEHURI FDWHJRULFDO DWWULEXWHV DQG TXDGUDWLF LQ WKH QXPEHU RI QXPHULFDODWWULEXWHV )XUWKHUPRUH .RXIDNRX HW DO >@ H[SHULPHQWHG ZLWK D QXPEHU RI UHSUHVHQWDWLYH RXWOLHU GHWHFWLRQ DSSURDFKHV IRU FDWHJRULFDO GDWD DQG SURSRVHG $9) $WWULEXWH 9DOXH
3298
)UHTXHQF\ D VLPSOH IDVW DQG VFDODEOH PHWKRG IRU FDWHJRULFDOVHWV ,Q WKLV SDSHU ZH SURSRVH D SDUDOOHO YHUVLRQ RI WKH $9) DOJRULWKPSURSRVHGLQ>@7KLVSDUDOOHOPHWKRGLVGHYHORSHG XVLQJ0DS5HGXFH>@ZKLFKLVDVLPSOLILHGSDUDOOHOSURJUDP SDUDGLJP IRU ODUJH VFDOH GDWD LQWHQVLYH SDUDOOHO FRPSXWLQJ MREV 0DS5HGXFH KLGHV WKH SDUDOOHO PDFKLQH IURP WKH SURJUDPPHUE\VLPSOLI\LQJWKHSDUDOOHOSURJUDPPLQJPRGHO WRWZR IXQFWLRQVWKHPDSIXQFWLRQDQGWKHUHGXFHIXQFWLRQ *LYHQDOLVWRINH\VDQGDVVRFLDWHGYDOXHVWKHPDSIXQFWLRQ SURGXFHVDQLQWHUPHGLDWHVHWRINH\VDQGYDOXHV7KHUHGXFH IXQFWLRQWKHQFRPELQHVWKHVHLQWHUPHGLDWHYDOXHVLQWRDILQDO UHVXOW 0DS5HGXFH KDV DOUHDG\ IRXQG LWV ZD\ LQWR VHYHUDO PDFKLQHOHDUQLQJDQGGDWDPLQLQJDSSOLFDWLRQV&KXHWDO>@ SUHVHQW PDQ\ DOJRULWKPV LQ 0DS5HGXFH IRUP LQFOXGLQJ /RFDOO\ :HLJKWHG /LQHDU 5HJUHVVLRQ NPHDQV /RJLVWLF 5HJUHVVLRQ1DLYH %D\HV/LQHDU6XSSRUW9HFWRU0DFKLQHV ,QGHSHQGHQW &RPSRQHQW $QDO\VLV *DXVVLDQ 'LVFULPLQDQW $QDO\VLV([SHFWDWLRQ0D[LPL]DWLRQDQG%DFNSURSDJDWLRQ ,,, 0$35('8&( 0DS5HGXFH LV D SDUDOOHO SURJUDPPLQJ SDUDGLJP RULJLQDOO\ LQWURGXFHG E\ *RRJOH >@ ZKRVH FHQWUDO IRFXV LV WR VLPSOLI\ WKH SURFHVVLQJ RI ODUJH GDWDVHWV RQ LQH[SHQVLYH FOXVWHU FRPSXWHUV 7KHVH FOXVWHU FRPSXWHUV RIWHQ FRQWDLQ KXQGUHGV RUWKRXVDQGV RI QRGHV WKDW ERWK VWRUH DQG SURFHVV WKH GDWDVHWV LQ D GLVWULEXWHG IDVKLRQ 7\SLFDOO\ D VLQJOH PDVWHU VHUYHU LV XVHG WR VFKHGXOH WKH GDWD VWRUDJH DQG FRPSXWDWLRQRQWKHQRGHV7KHRULJLQDO0DS5HGXFHV\VWHP ZDV EXLOW RQ WKH *RRJOH )LOH 6\VWHP *)6 >@ ZKLFK LV RSWLPL]HG IRU VWRULQJ ODUJH LQIUHTXHQWO\ FKDQJHG GDWDVHWV DFURVV VWDQGDUG GLVNV RQ WKH FOXVWHU QRGHV 7KH 0DS5HGXFH*)6 FRPELQDWLRQ LV EXLOW WR WROHUDWH UHJXODU QRGHIDLOXUHVWKURXJKUHSOLFDWLRQRIWKHGDWDDQGVSHFXODWLYH H[HFXWLRQ7KLVV\VWHPDOVRDXWRPDWLFDOO\SURYLGHVIRUORDG EDODQFLQJ DQG VFKHGXOLQJ DVVRFLDWHG ZLWK WKH SDUDOOHO SURFHVVLQJRIWKHGDWD 8VHUV GHVLJQ D 0DS5HGXFH SURJUDP E\ UHO\LQJ DOPRVW HQWLUHO\ RQ WKH PDS DQG UHGXFH IXQFWLRQV $V D FRQVHTXHQFH WKH XVHU LV QRW IRUFHG WR GHYLVH D SDUDOOHOL]DWLRQ VWUDWHJ\ IRU WKH WDVN DW KDQG EXW LV RQO\ UHTXLUHG WR DGDSW LW WR D 0DS5HGXFH PRGHO 7KH PDS IXQFWLRQWDNHVDVLQSXWDVHWRINH\YDOXHSDLUVGHVLJQDWHGDV N DQG Y SURYLGHG GLUHFWO\ IURP WKH XVHUGHILQHG LQSXW ILOHV:LWKLQWKHPDSIXQFWLRQWKHXVHUVSHFLILHVZKDWWRGR ZLWK WKHVH NH\V DQG YDOXHV 7KH PDS IXQFWLRQ RXWSXWV DQRWKHU VHW RI NH\V DQG YDOXHV GHVLJQDWHG DV N DQG Y 7KHUHGXFHIXQFWLRQVRUWVWKHNH\YDOXHSDLUVE\ N $OORI WKH DVVRFLDWHG YDOXHV Y DUH UHGXFHG DQG HPLWWHG DV YDOXH
Y 7KHPDSDQGUHGXFHIXQFWLRQVDUHDVIROORZV PDSN Y → N Y >@ UHGXFH N Y >@ → N Y >@
2008 International Joint Conference on Neural Networks (IJCNN 2008)
$WWKH0DS5HGXFHUXQWLPHOHYHOWKHPDSRSHUDWLRQVDUH GLVWULEXWHG E\ WKH PDVWHUVHUYHU WR WKH FKXQNVHUYHUV 7KH VFKHGXOHU PDNHV DQ HIIRUW WR VFKHGXOH FRPSXWDWLRQ RQ WKH VDPH QRGH ZKHUH WKH GDWD LV VWRUHG 0HDQZKLOH RWKHU FKXQNVHUYHUVDVVLJQHGWRWKHUHGXFHSKDVHEHJLQWRWDNHWKH N Y YDOXH SDLUV DQG VRUW WKHP E\ N 7KHVH VRUWHG DUUD\V RI Y YDOXHV DUH SDVVHG WR WKH UHGXFH IXQFWLRQV RQ WKHVHVDPHDVVLJQHGQRGHV7KHVHRXWSXWVDUHILQDOO\VDYHG RQWKH*)6,WLVTXLWHFRPPRQIRUDQDSSOLFDWLRQWRVWULQJ WRJHWKHUPDQ\VLPSOHU0DS5HGXFHRSHUDWLRQV )DXOW WROHUDQFH DQG ORDG EDODQFLQJ DUH DXWRPDWLFDOO\ SURYLGHG E\ WKH VRIWZDUH WKDW VXSSRUWV 0DS5HGXFH DQG WKH *)6 %HFDXVH WKH *)6 VWRUHV D XVHUVSHFLILHG QXPEHU RI FRSLHVXVXDOO\WKUHH IRUHDFKFKXQNRIWKHGDWDRQGLIIHUHQW FKXQNVHUYHUVDQGEHFDXVHWKH*)6PRQLWRUVWKHFOXVWHUWR PDLQWDLQ WKHVH FRSLHV ORVLQJ D SDUWLFXODU FKXQN RI GDWD VKRXOG EH UHODWLYHO\ UDUH )RU IDXOW WROHUDQFH RI WKH 0DS5HGXFH RSHUDWLRQV WKHPDVWHU VHUYHU NHHSV WUDFN RI DOO UXQQLQJ RSHUDWLRQV DQG FDQ UHVWDUW IDLOHG WDVNV RQ RWKHU FKXQNVHUYHUVWKDWKDYHDFRS\RIWKHGDWD%\WKHQDWXUHRI RSHUDWLRQVWKDWDUHSXWLQWRWKH0DS5HGXFHIUDPHZRUNPDS RSHUDWLRQVWKDWDUHLQGHSHQGHQWRQHDFKHOHPHQW WKH\FDQEH UHFRPSXWHGE\DQ\FKXQNVHUYHUZLWKWKHSURSHUGDWD $ GLDJUDP RI D W\SLFDO 0DS5HGXFH*)6 DUFKLWHFWXUH LV GLVSOD\HGLQ)LJXUH
)LJXUH 7KH IORZ RI GDWD LQ D 0DS5HGXFH*)6 DUFKLWHFWXUH IRU ILOH VWRUDJHDQG0DS5HGXFHRSHUDWLRQV'DVKHGOLQHVLQGLFDWHFRQWUROPHVVDJHV DQGVROLGOLQHVLQGLFDWHGDWDWUDQVIHU
$Q RIWHQ FLWHG 0DS5HGXFH H[DPSOH LV NQRZQ DV :RUG&RXQW >@ 6XSSRVH ZH QHHG WR REWDLQ WKH QXPEHU RI RFFXUUHQFHV RI HDFK XQLTXH ZRUG LQ D ODUJH ILOH ,Q WKH 0DS5HGXFH SDUDGLJP WKLV FRPSXWDWLRQ FDQ EH GRQH HDVLO\ DQGHIILFLHQWO\DVIROORZVWKHPDSIXQFWLRQUHFHLYHVDVLQSXW DOLQHIURPWKHODUJHLQSXWILOH7KHQWKHPDSIXQFWLRQVSOLWV WKLVOLQHLQWRLWVFRPSRQHQWZRUGVDQGHPLWVWKHZRUGDVWKH NH\DQGµ¶DVWKHDVVRFLDWHGYDOXH 7KHUHGXFHIXQFWLRQWDNHVWKHVHZRUGNH\VDQGµ¶YDOXHV
DV LQSXW %HFDXVH HDFK µ¶ YDOXH LV DQ RFFXUUHQFH RI WKH VDPH ZRUG WKH\ DUH VLPSO\ VXPPHG WRJHWKHU WR ILQG WKH QXPEHU RI RFFXUUHQFHV RI WKH ZRUG :KHQ WKH UHGXFWLRQ RSHUDWLRQLVFRPSOHWHWKHUHZLOOEHDOLVWRIZRUGVZLWKWKHLU DVVRFLDWHGRFFXUUHQFHIUHTXHQF\6HH)LJXUHIRUDSLFWRULDO LOOXVWUDWLRQRIWKH:RUG&RXQWH[DPSOH
)LJXUH$SLFWRULDOLOOXVWUDWLRQRIWKH :RUG&RXQWH[DPSOH
,9 $9)$1'05$9)$/*25,7+06 $ $9)$WWULEXWH9DOXH)UHTXHQF\ 7KH $WWULEXWH 9DOXH )UHTXHQF\ $9) DOJRULWKP LV D VLPSOH DQG IDVW DSSURDFK WR GHWHFW RXWOLHUV LQ FDWHJRULFDO GDWD ZKLFK PLQLPL]HV WKH VFDQV RYHU WKH GDWD ZLWKRXW WKH QHHG WR FUHDWH RU VHDUFK WKURXJK GLIIHUHQW FRPELQDWLRQV RI DWWULEXWH YDOXHV RU LWHPVHWV )XUWKHU GHWDLOV DUH RPLWWHG GXH WR VSDFH OLPLWDWLRQV DQG WKH UHDGHU LV UHIHUUHG WR >@ IRU IXUWKHUUHDGLQJ ,WLVLQWXLWLYHWKDWWKDWRXWOLHUVDUHWKRVHSRLQWVZKLFKDUH LQIUHTXHQWLQWKHGDWDVHWDQGWKDWWKHµLGHDO¶RXWOLHUSRLQWLQ D FDWHJRULFDO GDWDVHW LV RQH ZKRVH HDFK DQG HYHU\ YDOXH LV H[WUHPHO\ LUUHJXODU RU LQIUHTXHQW 7KH LQIUHTXHQWQHVV RI DQDWWULEXWHYDOXHFDQEHPHDVXUHGE\FRPSXWLQJWKHQXPEHU RIWLPHVWKLVYDOXHLVDVVXPHGE\WKHFRUUHVSRQGLQJDWWULEXWH LQWKHGDWDVHW /HW¶V DVVXPH WKDW WKHUH DUH Q SRLQWV LQ WKH GDWDVHW [L L «QDQGHDFKGDWDSRLQWKDVPDWWULEXWHV:HFDQZULWH [L >[L« [LO « [LP@ ZKHUH [LO LV WKH YDOXH RI WKH OWK DWWULEXWH RI [L )ROORZLQJ WKH UHDVRQLQJ JLYHQ DERYH WKH $9)VFRUHEHORZLVDJRRGLQGLFDWRURIGHFLGLQJRIZKHWKHU [L LVDQRXWOLHU $9) 6FRUH [ L =
P
P
¦ I [ LO
O =
ZKHUHI[LO LVWKHQXPEHURIWLPHVWKHOWKDWWULEXWHYDOXHRI [L DSSHDUVLQWKHGDWDVHW$ORZHU$9)VFRUHPHDQVWKDWLWLV PRUH OLNHO\ WKDW WKH SRLQW LV DQ RXWOLHU 6LQFH LV HVVHQWLDOO\ D VXP RI P SRVLWLYH QXPEHUV WKH $9) VFRUH LV PLQLPDO ZKHQ HDFK RI WKH VXP¶V WHUPV LV LQGLYLGXDOO\ PLQLPL]HG 7KXV WKH µLGHDO¶ RXWOLHU DV GHILQHG DERYH ZLOO KDYHWKHPLQLPXP$9)VFRUH7KHPLQLPXPVFRUH ZLOOEH DFKLHYHG ZKHQ HYHU\ YDOXH LQ WKH GDWD SRLQW RFFXUV MXVW RQFH
2008 International Joint Conference on Neural Networks (IJCNN 2008)
3299
,QSXW 'DWDVHW±'QSRLQWVPDWWULEXWHV 7DUJHWQXPEHURIRXWOLHUV±N 2XWSXWNGHWHFWHGRXWOLHUV
/DEHODOOGDWDSRLQWVDVQRQRXWOLHUV &DOFXODWHIUHTXHQF\RIHDFKDWWULEXWHYDOXHI[LO IRUHDFKSRLQW[LL Q IRUHDFKDWWULEXWHOO P $9)6FRUH[L I[LO HQG $YHUDJHO$9)6FRUH[L HQG 5HWXUQWRSNRXWOLHUVZLWKPLQLPXP$9)6FRUH )LJXUH$9)3VHXGRFRGH
,QSXW'DWDVHW±'QSRLQWVPDWWULEXWHV 7DUJHWQXPEHURIRXWOLHUV±N 2XWSXWNGHWHFWHGRXWOLHUV
+DVK7DEOH+ PDSN LY 'L [LL Q EHJLQ IRUHDFKOLQ[LO P FROOHFW[LO HQG UHGXFHN [LOY EHJLQ +[LO ¦Y HQG PDSN LY 'L [L EHJLQ P
$9) ¦ + [ LO O =
FROOHFWN$9) HQG UHGXFHN $9)LY L )LJXUH3DUDOOHO$9)3VHXGRFRGH±05$9)
$VVKRZQLQWKHSVHXGRFRGHRI$9)VHH)LJXUH RQFH WKH$9)VFRUHLVFDOFXODWHGIRUDOOWKHSRLQWVWKH NRXWOLHUV UHWXUQHGDUHWKHNSRLQWVZLWKWKHVPDOOHVW$9)VFRUHV7KH FRPSOH[LW\ RI $9) LV 2Q P ZKHUH Q LV WKH QXPEHU RI GDWDSRLQWVDQGPLVWKHGLPHQVLRQDOLW\RIWKHGDWDVHW % 3DUDOOHO$9)05$9) 7KHRULJLQDO$9)DOJRULWKPFDOFXODWHVWKH$9)RYHUHDFK LQSXW UHFRUG LQGHSHQGHQWO\ PDNLQJ LW DPHQDEOH WR HDV\ SDUDOOHOL]DWLRQ,IWKH$9)FDQEHH[SUHVVHGLQWHUPVRIWKH 0DS5HGXFHPRGHOWKHQWKHSDUDOOHODOJRULWKPFDQKDYHWKH EHQHILWV RI DXWRPDWLF ORDG EDODQFLQJ DQG IDXOW WROHUDQFH ZLWKQRDGGLWLRQDOHIIRUWIURPWKHXVHU¶VSHUVSHFWLYH 8VLQJ 0DS5HGXFH WKH 0DS IXQFWLRQ DVVRFLDWHV HDFK GLVWLQFW DWWULEXWH YDOXH WR WKH 0DS¶V RXWSXW NH\ ,Q WKH 5HGXFH IXQFWLRQ WKH IUHTXHQF\ FRXQWV RI HDFK DWWULEXWH YDOXHDUHFRPSXWHG)LQDOO\WKH$9)VFRUHRIHDFKSRLQWLV FDOFXODWHGGXULQJDVHFRQG0DSIXQFWLRQ7KHVHFRQGUHGXFH
3300
LVVLPSO\DVRUWLQJRSHUDWLRQRIWKHFRPSXWHG$9)VFRUHV 7KH SVHXGRFRGH IRU 0DS5HGXFH$9) RU 05$9) LV VKRZQ LQ )LJXUH ,Q WKH ILUVW SDLU RI PDS DQG UHGXFH IXQFWLRQVILUVWSKDVH WKHIUHTXHQF\RIHDFKDWWULEXWHYDOXH LVH[WUDFWHGIURPWKHGDWDVHW,IWKHDWWULEXWHYDOXHVDFURVV HDFKGLPHQVLRQDUHXQLTXHRUDVLQRXUFRGH WKHGLPHQVLRQ LVFRQFDWHQDWHGWRWKHDWWULEXWHYDOXHWKLVSDLURIIXQFWLRQVLV VLPLODUWRWKH:RUG&RXQWSUREOHPGHVFULEHGLQ6HFWLRQ,,, ,Q WKH VHFRQG 0DS5HGXFH SKDVH WKH DWWULEXWH YDOXH IUHTXHQF\WDEOHUHVXOWLQJIURPWKHILUVWSKDVHLVORDGHGLQWR WKH KDVK WDEOH + E\ WKH PDS IXQFWLRQ 7KH PDS IXQFWLRQ WKHQFDOFXODWHVDQGHPLWVWKH$9)VFRUHIRUHDFKLQGLYLGXDO LQSXWUHFRUGE\LWHUDWLQJWKURXJKWKHGLPHQVLRQVDQGDGGLQJ WKH IUHTXHQF\ RI HYHU\ DWWULEXWH YDOXH ,Q RUGHU WR VRUW WKH GDWD SRLQWV E\ WKHLU RXWOLHU VFRUH WKH $9) VFRUH LV HPLWWHG DVWKHNH\DQGWKHLQSXWSRLQW,'LVHPLWWHGDVWKHYDOXH$W WKHHQGRIWKH0DS5HGXFHSURFHVVWKHUHVXOWLVDOLVWRI$9) VFRUHV VRUWHG LQ DVFHQGLQJ RUGHU ZLWK WKH OLVWHG SRLQW ,'V $V D UHVXOW WKH WRSN SRLQWV UHSUHVHQW WKH RXWOLHUV RI WKH GDWDVHWDVWKH\KDYHWKHNPLQLPXP$9)VFRUHV 9 (;3(5,0(176 $ ([SHULPHQWDO6HWXS 6LQFH WKH RULJLQDO 0DS5HGXFH*)6 LPSOHPHQWDWLRQ LV SURSULHWDU\ ZH XVHG DQ RSHQ VRXUFH 0DS5HGXFH VRIWZDUH FDOOHG +DGRRS >@ +DGRRS DOORZV HDV\ 0DS5HGXFH LPSOHPHQWDWLRQ LQ -DYD ZLWK VXSSRUW WR FRQQHFW LW WR RWKHU ODQJXDJHVOLNH&DQG3\WKRQ7KHVRIWZDUHZDVLQVWDOOHG RQ D QRGH FOXVWHU ZKHUH HDFK RI WKH QRGHV KDG GXDO 2SWHURQ SURFHVVRUV *% RI 5$0 DQG *% GLVNV :H XVHG +DGRRS YHUVLRQ DQG -DYD WR LPSOHPHQW WKH SDUDOOHO05$9)FRGH % 'DWDVHWV8VHG :H XVHG WKH IROORZLQJ GDWDVHWV IURP WKH 8&, UHSRVLWRU\ >@ - :LVFRQVLQ%UHDVW&DQFHU7KLVGDWDVHWKDVSRLQWVDQG DWWULEXWHV (DFK UHFRUG LV ODEHOHG DV HLWKHU EHQLJQ RU PDOLJQDQW)ROORZLQJWKHPHWKRGLQ>@ZHRQO\NHSWHYHU\ VL[WKPDOLJQDQWUHFRUGUHVXOWLQJLQRXWOLHUV DQG QRQRXWOLHUV - /\PSKRJUDSK\ 7KLV GDWDVHW FRQWDLQV LQVWDQFHV DQG DWWULEXWHV &ODVVHV DQG FRPSULVH RI WKH GDWD VR WKH\DUHFRQVLGHUHGDVWKHRXWOLHUV - 3RVWRSHUDWLYH 7KLV GDWDVHW LV XVHG WR GHWHUPLQH ZKHUH SDWLHQWV VKRXOG JR WR DIWHU D SRVWRSHUDWLYH XQLW ,QWHQVLYH &DUH8QLWKRPHRUKRVSLWDOIORRU ,WFRQWDLQVLQVWDQFHV DQGDWWULEXWHV&ODVVDQGDUHWKHRXWOLHUV - 3DJHEORFNV,WFRQWDLQVLQVWDQFHVZLWKDWWULEXWHV 7KHUH DUH FODVVHV ZKHUH RQH FODVV LV DERXW RI WKH GDWDVRWKHUHVWRIWKHGDWDFDQEHWKRXJKWRIDVRXWOLHUV:H GLVFUHWL]HG WKH FRQWLQXRXV DWWULEXWHV XVLQJ DQ HTXDO IUHTXHQF\ GLVFUHWL]DWLRQ DSSURDFK DQG UHPRYHG KDOI RI WKH RXWOLHUVVRWKDWZHKDYHDPRUHLPEDODQFHGGDWDVHW
2008 International Joint Conference on Neural Networks (IJCNN 2008)
7$%/(5(68/76217+(8&,'$7$6(76
(a) Breast Cancer (39 outliers)
N N
$9)
*UHHG\
)32)
2WH\¶V
(b) Lymphography (6 outliers) $9) *UHHG\ )32) 2WH\¶V
N
(c) Post-Operative (26 outliers) $9) *UHHG\ )32) 2WH\¶V
N
(d) Pageblocks (280 outliers) $9) *UHHG\ )32) 2WH\¶V
7$%/(5817,0(,16(&21'6)257+(6,08/$7(''$7$6(76:,7+ 9$5@ 3HQQ\ ., -ROOLIIH ,7 ³$ FRPSDULVRQ RI PXOWLYDULDWH RXWOLHU GHWHFWLRQPHWKRGVIRUFOLQLFDOODERUDWRU\VDIHW\GDWD´ 7KH6WDWLVWLFLDQ -RXUQDORIWKH5R\DO6WDWLVWLFDO6RFLHW\ SS± >@ .RXIDNRX $ 2UWL] ( *HRUJLRSRXORV 0 $QDJQRVWRSRXORV * 5H\QROGV.³$6FDODEOHDQG(IILFLHQW2XWOLHU'HWHFWLRQ6WUDWHJ\IRU &DWHJRULFDO 'DWD´ ,QW¶O &RQIHUHQFH RQ 7RROV ZLWK $UWLILFLDO ,QWHOOLJHQFH,&7$,2FWREHU >@ $JUDZDO 5 6ULNDQW 5 ³)DVW DOJRULWKPV IRU PLQLQJ DVVRFLDWLRQ UXOHV´ 3URFRIWKH,QW¶O&RQIHUHQFHRQ9HU\/DUJH'DWD%DVHV9/'% SS± >@ +H = ;X ; +XDQJ - 'HQJ - ³)32XWOLHU )UHTXHQW 3DWWHUQ %DVHG2XWOLHU'HWHFWLRQ´ &RPSXWHU6FLHQFHDQG,QIRUPDWLRQ6\VWHP SS >@ 2WH\ 0( *KRWLQJ $ 3DUWKDVDUDWK\ $ ³)DVW 'LVWULEXWHG 2XWOLHU 'HWHFWLRQ LQ 0L[HG$WWULEXWH 'DWD 6HWV´ 'DWD 0LQLQJ DQG .QRZOHGJH'LVFRYHU\ >@ 'HDQ-DQG*KHPDZDW 6³0DSUHGXFH 6LPSOLILHGGDWDSURFHVVLQJ RQODUJHFOXVWHUV´ 3URFHHGLQJVRI26',¶6\PSRVLXPRQ2SHUDWLQJ 6\VWHP'HVLJQDQG,PSOHPHQWDWLRQ >@ &KX&7 .LP 6/LQ@ :HL / 4LDQ : =KRX $ -LQ : ³+27 +\SHUJUDSKEDVHG 2XWOLHU 7HVW IRU &DWHJRULFDO 'DWD´ 3URF RI WK 3DFLILF$VLD &RQIHUHQFH RQ .QRZOHGJH 'LVFRYHU\ DQG 'DWD 0LQLQJ 3$.'' SS
2008 International Joint Conference on Neural Networks (IJCNN 2008)
>@ 7D[ ' 'XLQ 5 ³6XSSRUW 9HFWRU 'DWD 'HVFULSWLRQ´ 0DFKLQH /HDUQLQJSS± >@ +DUNLQV6+H+:LOOLDPV*%DVWHU5³2XWOLHU'HWHFWLRQ8VLQJ 5HSOLFDWRU 1HXUDO 1HWZRUNV´ 'DWD :DUHKRXVLQJ DQG .QRZOHGJH 'LVFRYHU\WK,QWHUQDWLRQDO&RQIHUHQFH'D:D.SS >@ 3HL@ +H = 'HQJ 6 ;X ; ³$ )DVW *UHHG\ DOJRULWKP IRU RXWOLHU PLQLQJ´3URFHHGLQJVRI3$.'' >@ *KHPDZDW 6 *RELRII + DQG /HXQJ 67 ³7KH JRRJOH ILOH V\VWHP´ ,Q 3URFHHGLQJV RI WK $&0 6\PSRVLXP RQ 2SHUDWLQJ 6\VWHPV3ULQFLSOHV 2FWREHU >@ %ODNH & 0HU] & 8&, PDFKLQH OHDUQLQJ UHSRVLWRU\ ZZZLFVXFLHGXaPOHDUQ0/5HSRVLWRU\KWPO >@ &ULVWRIRU ' DQG 6LPRYLFL ' ³)LQGLQJ 0HGLDQ 3DUWLWLRQV 8VLQJ ,QIRUPDWLRQ7KHRUHWLFDO$OJRULWKPV´ -RXUQDORI8QLYHUVDO&RPSXWHU 6FLHQFH SS VRIWZDUH DW KWWSZZZFVXPEHGXaGDQD*$&OXVWLQGH[KWPO >@ +DGRRS ³:HOFRPH WR KDGRRS´ KWWSOXFHQHDSDFKHRUJKDGRRS
2008 International Joint Conference on Neural Networks (IJCNN 2008)
3303