Knowledge-based Bayesian network for the classification of Mycobacterium tuberculosis complex sublineages Minoo Aminian
Amina Shabbeer
Kane Hadley
Departments of Mathematical Sciences and Computer Science Rensselaer Polytechnic Institute Troy, NY-12180
Department of Computer Science Rensselaer Polytechnic Institute Troy, NY-12180
Department of Computer Science Rensselaer Polytechnic Institute Troy, NY-12180
[email protected],
[email protected]
[email protected] Cagri Ozcaglar
Scott Vandenberg
Department of Computer Science Rensselaer Polytechnic Institute Troy, NY-12180
Department of Computer Science Siena College Loudonville, NY-12211
[email protected]
Departments of Mathematical Sciences and Computer Science Rensselaer Polytechnic Institute Troy, NY-12180
[email protected]
[email protected]
ABSTRACT :HGHYHORSDQRYHONQRZOHGJHEDVHG%D\HVLDQQHWZRUN.%%1 WKDW PRGHOV RXU NQRZOHGJH RI WKH 0\FREDFWHULXP WXEHUFXORVLV FRPSOH[ 07%& REWDLQHG IURP H[SHUWGHILQHG UXOHV DQG ODUJH '1$ILQJHUSULQWGDWDEDVHVWRFODVVLI\VWUDLQVRI07%&LQWRILIW\ RQH JHQHWLF VXEOLQHDJHV 7KH PRGHO XVHV WZR KLJKWKURXJKSXW ELRPDUNHUV VSDFHU ROLJRQXFOHRWLGH W\SHV VSROLJRW\SHV DQG P\FREDFWHULDO LQWHUVSHUVHG UHSHWLWLYH XQLWV 0,58 W\SHV WR UHSUHVHQW VWUDLQV RI 07%& VLQFH WKHVH DUH URXWLQHO\ JDWKHUHG IURP 07%& LVRODWHV RI WXEHUFXORVLV 7% SDWLHQWV .%%1 SURYLGHVDQHOHJDQWDQGVLPSOHZD\WRLQFRUSRUDWHH[LVWLQJZLGHO\ DFFHSWHG YLVXDO UXOHV IRU 07%& VXEOLQHDJHV LQWR D FODVVLILHU GHVLJQHG WR FDSWXUH NQRZQ SURSHUWLHV RI WKH 07%& ELRPDUNHUV 8QOLNH SULRU NQRZOHGJHEDVHG 690 DSSURDFKHV ZKLFK UHTXLUH UXOHV H[SUHVVHG DV SRO\KHGUDO VHWV .%%1 GLUHFWO\ LQFRUSRUDWHV WKH UXOHV ZLWKRXW DQ\ PRGLILFDWLRQ &RPSXWDWLRQDO UHVXOWV VKRZ WKDW .%%1 DFKLHYHV PXFK KLJKHU DFFXUDF\ WKDQ PHWKRGV EDVHG SXUHO\RQUXOHVDQGWKDQ%D\HVLDQQHWZRUNVWUDLQHGRQELRPDUNHU GDWDDORQH Categories and Subject Descriptors , >Artificial Intelligence@ /HDUQLQJ , >Pattern Recognition@0RGHOV VWDWLVWLFDO->Computer Application@ /LIHDQG0HGLFDO6FLHQFHV General Terms 'HVLJQ([SHULPHQWDWLRQ9HULILFDWLRQ
3HUPLVVLRQWRPDNHGLJLWDORUKDUGFRSLHVRIDOORUSDUWRIWKLVZRUNIRU SHUVRQDORUFODVVURRPXVHLVJUDQWHGZLWKRXWIHHSURYLGHGWKDWFRSLHVDUH QRW PDGH RU GLVWULEXWHG IRU SURILW RU FRPPHUFLDO DGYDQWDJH DQG WKDW FRSLHV EHDU WKLV QRWLFH DQG WKH IXOO FLWDWLRQ RQ WKH ILUVW SDJH 7R FRS\ RWKHUZLVH RU UHSXEOLVK WR SRVW RQ VHUYHUV RU WR UHGLVWULEXWH WR OLVWV UHTXLUHVSULRUVSHFLILFSHUPLVVLRQDQGRUDIHH $&0%&% $XJXVW&KLFDJR,/86$ &RS\ULJKW$&0
201
Kristin P. Bennett
Keywords 7XEHUFXORVLV %D\HVLDQ 1HWZRUNV VXEOLQHDJHV VSROLJRW\SH PXOWLSOHLQWHUVSHUVHGUHSHWLWLYHXQLWV
1. INTRODUCTION 2QHWKLUG RI WKH ZRUOG LV LQIHFWHG ZLWK WXEHUFXORVLV 7% 0ROHFXODU HSLGHPLRORJ\ QRZ SOD\V D FUXFLDO UROH LQ WKH WUDFNLQJ DQG FRQWURO RI 7% '1$ ILQJHUSULQWLQJ PHWKRGV KDYH PDGH LW SRVVLEOHWRGLVWLQJXLVKEHWZHHQFDVHVRIUHFHQWWUDQVPLVVLRQRI7% DQG UHDFWLYDWLRQV RI ODWHQW LQIHFWLRQV 7KLV KDV HQDEOHG WKH WUDFNLQJ RI WUDQVPLVVLRQ URXWHV DQG WKH WLPHO\ LGHQWLILFDWLRQ RI RXWEUHDNV 7KXV NQRZOHGJH DERXW WKH JHQRW\SH RI SUHYDLOLQJ VWUDLQV KDV UHYROXWLRQL]HG WUDGLWLRQDO DSSURDFKHV LQ WKH HSLGHPLRORJ\ RI 7% 0RUHRYHU WKH SUHGRPLQDQFH RI FHUWDLQ VWUDLQVRUJURXSVRIVWUDLQVLQFHUWDLQKRVWSRSXODWLRQVKDVFOHDUO\ EHHQ REVHUYHG > @ 6WXGLHV RI WKH JHQHWLF DQG ELRJHRJUDSKLF GLYHUVLW\ RI WKH 0\FREDFWHULXP WXEHUFXORVLV FRPSOH[ 07%& KDYH UHYHDOHG GLIIHUHQFHV LQ WKH YLUXOHQFH LPPXQRJHQLFLW\ DQG GUXJUHVLVWDQFH RI VWUDLQV >@ 7KLV KDV FRQVHTXHQFHV LQ WKH GHYHORSPHQW RI FRQWURO PHDVXUHV IRU 7% $QDO\VLV RI WKH SRSXODWLRQ VWUXFWXUH DOVR SURYLGHV LQVLJKWV LQWR WKH HYROXWLRQDU\ VFHQDULRRI07%& 3K\ORJHRJUDSKLF OLQHDJHV DQG VXEOLQHDJHV KDYH EHHQ GHILQHG EDVHG RQ JHQHWLF VLPLODULWLHV EHWZHHQ VWUDLQV DQG REVHUYHG DVVRFLDWLRQV EHWZHHQ JURXSV RI VLPLODU 07%& JHQRW\SHV ZLWK KRVW SRSXODWLRQV > @ $ YDULHW\ RI PROHFXODU WHFKQLTXHV LQFOXGLQJ WKH DQDO\VLV RI SK\ORJHQHWLFDOO\ LQIRUPDWLYH VLQJOH QXFOHRWLGH SRO\PRUSKLVPV 613V DQG ORQJ VHTXHQFH SRO\PRUSKLVPV /63V DUH XVHG WR JHQRW\SH 07%& VWUDLQV &ODVVLILFDWLRQ EDVHG RQ 613V DQG /63V LV FRQVLGHUHG WR EH WKH JROG VWDQGDUG > @ +RZHYHU VWXGLHV RI VXFK YDULDWLRQV LQ '1$VHTXHQFHVRI07%&VWUDLQVDUHQRWSHUIRUPHGIUHTXHQWO\IRU SXEOLF KHDOWK SXUSRVHV 6SDFHU ROLJRQXFOHRWLGH W\SLQJ VSROLJRW\SLQJ DQG P\FREDFWHULDO LQWHUVSHUVHG UHSHWLWLYH XQLWV YDULDEOH QXPEHU RI WDQGHP UHSHDWV 0,58 W\SLQJ DUH WZR SRO\PHUDVH FKDLQ UHDFWLRQ 3&5 EDVHG '1$ ILQJHUSULQWLQJ PHWKRGV URXWLQHO\ XVHG LQ WKH 8QLWHG 6WDWHV IRU JHQRW\SLQJ DOO LGHQWLILHGFXOWXUHSRVLWLYH7%FDVHV6SROLJRW\SLQJLVEDVHGRQWKH SRO\PRUSKLVPV IRXQG LQ WKH GLUHFW UHSHDW '5 UHJLRQ RI WKH
P\FREDFWHULDO FKURPRVRPH ZKLOH 0,58 W\SLQJ LV EDVHG RQ WKH QXPEHURI WDQGHP UHSHDWV SUHVHQW DW WR LGHQWLILHG ORFL GLVWULEXWHG DFURVV WKH 07%& JHQRPH >@ /DUJH GDWDEDVHV RI VSROLJRW\SHV KDYH EHHQ FROOHFWHG WKH PRVW VLJQLILFDQW EHLQJ 6SRO'% FRPSULVLQJ VWUDLQV REVHUYHG ZRUOGZLGH >@ 7KHVHVWUDLQVKDYHEHHQDVVLJQHGVXEOLQHDJHODEHOVXVLQJDPL[HG H[SHUWEDVHG DQG ELRLQIRUPDWLFDO DSSURDFK GHULYHG IURP YLVXDO UXOHV 7KHYLVXDOUXOHVDUHEDVHGRQWKHLGHQWLILFDWLRQRIFKDUDFWHULVWLF GHOHWLRQV RI RQH RU PRUH DGMDFHQW VSDFHUV &HUWDLQ LQIHUUHG PXWDWLRQV GHOHWLRQV RI EORFNV RI DGMDFHQW VSDFHUV LQ SURJHQLWRU VWUDLQVDUHFRQVLGHUHGWREHOLQHDJHGHILQLQJ7KHVHGHOHWLRQVDUH FRQVHUYHGLQDOOGHVFHQGHQWVWUDLQVVLQFHVWXGLHVKDYHVKRZQWKDW WKH PHFKDQLVP RI HYROXWLRQ REVHUYHG LQ WKH '5 UHJLRQ LQYROYHV ORVV RI VSDFHUV DQG VSDFHUV DUH UDUHO\ JDLQHG >@ $GGLWLRQDOO\ WKH H[LVWHQFH RI WKHVH VXEOLQHDJHV KDYH EHHQ LQGHSHQGHQWO\ YHULILHG E\ FOXVWHULQJ EDVHG RQ VSROLJRW\SH DQG 0,58 W\SHV RI VWUDLQV > @ 7KHUHIRUH ZKLOH LW KDV EHHQ HVWDEOLVKHG WKDW VWUDLQV RI 7% EHORQJ WR GLVWLQFW VXEOLQHDJHV WKH GHILQLWLRQV RI WKHVHVXEOLQHDJHVEDVHGRQVSROLJRW\SHVDUHQRWFOHDU7KHYLVXDO UXOHV IRU D VXEOLQHDJH DUH JHQHUDOL]DWLRQV RI VSROLJRW\SH SDWWHUQV WKDW EHORQJ WRWKDW VXEOLQHDJH+RZHYHUGLUHFWO\DSSO\LQJYLVXDO UXOHV WR VSROLJRW\SH SDWWHUQV FDQ OHDG WR PXOWLSOH DVVLJQPHQWV RI VXEOLQHDJH ODEHOV VLQFH VSROLJRW\SH SDWWHUQV PD\ PDWFK SDWWHUQV SUHVFULEHG E\ PRUH WKDQ RQH UXOH DQG VRPHWLPHV VSROLJRW\SH SDWWHUQV GR QRW H[DFWO\ PDWFK WKH SDWWHUQV VSHFLILHG E\ DQ\ UXOH 7KLV LV DQ LQKHUHQW OLPLWDWLRQ RI D UXOHEDVHG V\VWHP ± ZKHUHLQ UXOHV QHHG WR EH EURDG HQRXJK WR FDSWXUH JHQHUDO SDWWHUQV EXW QDUURZHQRXJKWRGHOLQHDWHFODVVHV$GGLWLRQDOO\VSROLJRW\SLQJLV EDVHG RQ SRO\PRUSKLVPV LQ D VLQJOH ORFXV WKH '5 UHJLRQ DQG WKHUHIRUH KDV WKH SRWHQWLDO IRU FRQYHUJHQW HYROXWLRQ 5HO\LQJ RQ VSHFLILF VXEVHTXHQFHV ZLWKLQ WKH VSROLJRW\SHV IRU WKH VWXG\ RI JHQHWLFGLYHUVLW\LVKHQFHHUURUSURQH 7KLV SDSHU SUHVHQWV D KLHUDUFKLFDO SUREDELOLVWLF JUDSKLFDO PRGHO WKH NQRZOHGJHEDVHG %D\HVLDQ QHWZRUN .%%1 WKDW HQFRGHV WKH NQRZOHGJH RI 07%& REWDLQHG IURP H[SHUWGHILQHG UXOHV DQG ODUJH GDWDEDVHV RI '1$ ILQJHUSULQW GDWD WR FODVVLI\ VWUDLQVRI07%&LQWR VXEOLQHDJHV([SHUWNQRZOHGJHLVPRGHOHG LQ WKH WRS OHYHO RI YDULDEOHV UHSUHVHQWLQJ WKH UXOHV 7KH PLGGOH OHYHO YDULDEOHV UHSUHVHQW WKH FODVV DQG WKH ORZHU OHYHO UHSUHVHQWV YDULRXV07%&ELRPDUNHUV±VSROLJRW\SHVDQG0,58W\SHV7KHVH YDULDEOHVPRGHOWKHNQRZQSURSHUWLHVRIVSROLJRW\SHVDQG0,58 UHSHDWVDVZHOODVWKHLUPHFKDQLVPVRIPXWDWLRQ7KHVWUXFWXUHRI WKH.%%1DOORZVWKHNQRZOHGJHEDVHFDSWXUHGLQWKHYLVXDOUXOHV WR EH HDVLO\ DQG VLPSO\ LQFRUSRUDWHG LQWR WKH OHDUQLQJ PHWKRG ZKLOH RYHUFRPLQJWKHOLPLWDWLRQVRIXVLQJRQO\VSHFLILFGHOHWLRQV DV VSHFLILHG E\ YLVXDO UXOHV WR GHFLGH WKH VXEOLQHDJH 0RUHRYHU WKH LQFRUSRUDWLRQ RI DGYLFH SURYLGHV DGGLWLRQDO EHQHILWV LQ SHUIRUPDQFH7KHUHDVRQLQJIRUDQ\GHFLVLRQPDGHE\WKH.%%1 LVHYLGHQWWRXVHUVWKHSUREDELOLW\JLYHVDTXDQWLWDWLYHHVWLPDWHRI WKHFRQILGHQFH 2WKHU DSSURDFKHV WR LQFRUSRUDWLQJ DGYLFH LQ WKH IRUP RI UXOHV KDV EHHQ VKRZQ WR LPSURYH GLVFULPLQDWLYH OHDUQLQJ PRGHOV RI 07%&PDMRUOLQHDJHVDQGRWKHUSUREOHPV>@+RZHYHUWKRVH PHWKRGVDUHOLPLWHGWRUXOHVH[SUHVVHGLQOHVVLQWXLWLYHSRO\KHGUDO IRUP 7KH SURSRVHG NQRZOHGJHEDVHG %D\HVLDQ QHWZRUN PHWKRG DOORZV WKH H[LVWLQJ YLVXDO UXOHV WR EH LQFRUSRUDWHG ZLWK QR PRGLILFDWLRQ UHVXOWLQJ LQ LPSURYHG FODVVLILFDWLRQ RI VXEOLQHDJHV RYHU WKH SUHGLFWLRQV PDGH ZLWK WKH YLVXDO UXOHV RU %D\HVLDQ 1HWZRUNVDORQH$OVRXQOLNHYLVXDOUXOHVWKHIOH[LELOLW\RIIHUHGE\ WKH.%%1HQDEOHVLWWRKDQGOHVXEOLQHDJHVZLWKQRNQRZQUXOHV
2. BACKGROUND 2.1 DNA fingerprinting 7ZR IUHTXHQWO\ XVHG '1$ ILQJHUSULQWLQJ PHWKRGV IRU WKH JHQRW\SLQJRI07%&VWUDLQVDUHVSROLJRW\SLQJDQG0,58W\SLQJ %HFDXVH RI WKHLU SRUWDEOH GDWD IRUPDW DQG UHSURGXFLELOLW\ WKHVH WZR ILQJHUSULQWLQJ PHWKRGV KDYH EHFRPH WKH VWDQGDUG IRU LQGLYLGXDOVWUDLQLGHQWLILFDWLRQIRUWKHSXUSRVHRI7%FRQWURO DQG WUDFNLQJ ,VRODWHV IURP DOPRVW HYHU\ 7% SDWLHQW LQ WKH 8QLWHG 6WDWHVDUHJHQRW\SHGE\WKHVHWZRPHWKRGV7KLVKDVHQDEOHGWKH FUHDWLRQ RI ODUJH UHIHUHQFH GDWDEDVHV :H GHVFULEH WKH WZR WHFKQLTXHV KHUH EULHIO\ DQG PHQWLRQ NH\ SURSHUWLHV WKDW ZHUH H[SORLWHG LQ WKH PRGHOLQJ RI YDULDEOHV IRU WKH GHVLJQ RI WKH %D\HVLDQQHWZRUN
6SROLJRW\SLQJ 6SROLJRW\SLQJ LV D 3&5EDVHG UHYHUVH K\EULGL]DWLRQ WHFKQLTXH WKDW H[SORLWV SRO\PRUSKLVPV LQ WKH '5 UHJLRQ WR GLVWLQJXLVK EHWZHHQ VWUDLQV >@ 7KH '5 UHJLRQ FRQWDLQV ES UHSHDWV LQWHUVSHUVHG ZLWK XS WR QRQUHSHWLWLYH ES OHQJWK VHTXHQFHV FDOOHG VSDFHUV 7KH VSROLJRW\SH RI D VWUDLQ LV UHSUHVHQWHG DV D ELW ORQJ ELQDU\ VWULQJ ZLWK D UHSUHVHQWLQJ DEVHQFHDQGUHSUHVHQWLQJSUHVHQFHRIDVSDFHUVHTXHQFH$NH\ IDFW DERXW WKH HYROXWLRQ RI VSROLJRW\SHV LV WKDW RQFH D VSDFHU LV ORVWLWLVH[WUHPHO\XQOLNHO\WREHUHJDLQHG,WLVK\SRWKHVL]HGWKDW VSROLJRW\SHVHYROYHE\GHOHWLRQRIDVLQJOHRUPXOWLSOHFRQWLJXRXV GLUHFW UHSHDWV '5V ZKHUHDV LQVHUWLRQ RI '5V LV YHU\ XQOLNHO\ >@
0,58 ± 9DULDEOH 1XPEHU RI 7DQGHP 5HSHDW 9175 7\SLQJ 0,589175 W\SLQJ LV D 9175 DQDO\VLV EDFWHULDO W\SLQJ VFKHPH WKDW SURYLGHV D KLJKWKURXJKSXW UHSURGXFLEOH PHWKRG IRU PROHFXODUW\SLQJRI07%&0,58LVDES'1$VHTXHQFH GLVSHUVHG ZLWKLQ WKH LQWHUJHQLF UHJLRQV RI WKH 07%& JHQRPH DV WDQGHP UHSHDWV 0,58 W\SLQJ LV EDVHG RQ WKH QXPEHU RI UHSHDWV REVHUYHG DW FHUWDLQ LGHQWLILHG SRO\PRUSKLF ORFL>@7KHVHORFL
DUH GLVSHUVHG WKURXJKRXW WKH 07%& JHQRPH DQG DUH LQGHSHQGHQW 7KH GHJUHH RI GLVFULPLQDWLRQ EHWZHHQ VWUDLQV GHSHQGV RQ WKH QXPEHU RI ORFL XVHG 7ZHOYH ORFL RI 0,58 DUH XVHGLQWKLVVWXG\0,58ORFXV0,58DQGWKHIROORZLQJ VHW RI ORFL KHQFHIRUWK UHIHUHQFHG DV 0,58 0,58 0,58 0,58 0,58 0,58 0,58 0,58 0,58 0,58 0,58 DQG 0,58 0,58 W\SLQJ KDV KLJKHU GLVFULPLQDWRU\SRZHUWKDQVSROLJRW\SHVWKHUHIRUHHVSHFLDOO\ZKHQ XVHG LQ FRQMXQFWLRQ ZLWK VSROLJRW\SHV 0,58 W\SLQJ SURYLGHV D SRZHUIXOPHWKRGIRULGHQWLILFDWLRQRIVWUDLQV>@
2.2 Bayesian Networks $ %D\HVLDQ QHWZRUN %1 LV FUHDWHG WR SUHGLFW WKH VXEOLQHDJHV $ %1 LV D JUDSKLFDO UHSUHVHQWDWLRQ RI D SUREDELOLW\ GLVWULEXWLRQ)RUPDOO\VSHDNLQJD%1LVDGLUHFWHGDF\FOLFJUDSK * 1 ( FRQVLVWLQJ RI D VHW RI QRGHV ; ^[L _ [L 1 ` WR UHSUHVHQWWKHYDULDEOHVDQGDVHWRIGLUHFWHGOLQNVWRFRQQHFWSDLUV RIQRGHV(DFKQRGHKDVDFRQGLWLRQDOSUREDELOLW\GLVWULEXWLRQWKDW TXDQWLILHV WKH SUREDELOLVWLF UHODWLRQ EHWZHHQ WKH QRGH DQG LWV SDUHQWVVXFKWKDWIRUDQHWZRUNRINQRGHV N
3 [ [ [N
3 [ L
202
L
_ SDUHQWV [L
7KHUHIRUHRQHFDQFRPSXWHWKHIXOOMRLQWSUREDELOLW\GLVWULEXWLRQ IURP WKH LQIRUPDWLRQ LQ WKH QHWZRUN ,Q RWKHU ZRUGV D ZHOO UHSUHVHQWHG%D\HVLDQQHWZRUNFDQFDSWXUHWKHFRPSOHWHQDWXUHRI WKHUHODWLRQVKLSEHWZHHQDVHWRIYDULDEOHV
3. PRIOR BAYESIAN NETWORKS 6327&/867ZDVWKHILUVWJHQHUDWLYHPRGHOXVHGIRUDQDO\VLV RI 07%& VXEOLQHDJHV >@ 6327&/867 XVHV PL[WXUH PRGHOV EDVHG RQ VSROLJRW\SHV WR LGHQWLI\ VWUDLQ IDPLOLHV RI 07%& 7KH 6327&/867 %D\HVLDQ 1HWZRUN PRGHOV WKH DV\PPHWULF HYROXWLRQ RI VSDFHUV XVLQJ D %D\HVLDQ 1HWZRUN ZLWK ³KLGGHQ SDUHQWV´ >@ 7KH KLGGHQ SDUHQWV RI D OLQHDJH JHQHUDWH WKH PHPEHUV RI WKH OLQHDJH 7KH\ FDSWXUH HYROXWLRQ RI VSROLJRW\SHV ZLWKRXW JHQHUDWLQJ WKH IXOO SK\ORJHQ\ $ VSDFHU LQ WKH KLGGHQ SDUHQWPD\EHORVWZLWKVPDOOSUREDELOLW\$VSDFHUWKDWLVDEVHQW LQ WKH SDUHQW LV DOPRVW QHYHU JDLQHG 7KH GHVLJQ PRGHOV WKH HYROXWLRQ PHFKDQLVP RI WKH '5 UHJLRQ DOORZLQJ WKH %D\HVLDQ QHWZRUN WR FDSWXUH WKH GHOHWLRQV WKDW DUH NQRZQ WR FKDUDFWHUL]H VSROLJRW\SH OLQHDJHV 7KH KLGGHQ SDUHQW WHFKQLTXH RI 6327&/867 LV XVHG IRU WKH VSROLJRW\SH SDUWV RI WKH .%%1 PRGHO 7KH&RQIRUPDO%D\HVLDQ1HWZRUN&%1 VKRZQLQ)LJXUH LV DQRWKHU JHQHUDWLYH PRGHO IRU DQDO\VLV RI ERWK VSROLJRW\SH DQG 0,58 W\SH GDWD IRU 07%& VWUDLQV > @ &%1 FDSWXUHV GRPDLQ NQRZOHGJH DERXW WKH SURSHUWLHV RI VSROLJRW\SHV DQG 0,58 DQG XVHV WKLV LQIRUPDWLRQ WR FODVVLI\ 07%& VWUDLQ JHQRW\SLQJGDWDLQWRPDMRUOLQHDJHV7KHYDOXHRIORFXV0,58 JHQHUDWHV WKH OLQHDJH ZKLFK LQ WXUQ GHWHUPLQHV WKH QXPEHU RI UHSHDWV LQ WKH UHPDLQLQJ 0,58 ORFL 7KXV SDWWHUQV LQ WKH RFFXUUHQFHVRIUHSHDWVDWHDFKORFXVIRUHDFKOLQHDJHDUHFDSWXUHG 7KHOLQHDJHDOVRJHQHUDWHVWKHKLGGHQSDUHQWVRIWKHOLQHDJHZKLFK LQWXUQJHQHUDWHWKHVSROLJRW\SHVSDFHUV&%1UHIOHFWVWKHNQRZQ PHFKDQLVPV RI HYROXWLRQ RI WKH VSROLJRW\SHV DQG 0,58 :LWK UDUH H[FHSWLRQV DQFHVWUDO VWUDLQV KDYH RU PRUH UHSHDWV DW 0,58 7KXV WKH WRSOHYHO YDULDEOH 0 LQGLFDWHV ZKHWKHU 0,58 LV OHVV WKDQ WZR LQGLFDWLQJ PRGHUQ OLQHDJHV ZLWK KLJK SUREDELOLW\ RUDWOHDVWWZRLQGLFDWLQJDQFHVWUDOOLQHDJHVZLWKKLJK SUREDELOLW\ :H WULHG XVLQJ &%1 WR FODVVLI\ 07%& JHQRW\SLQJ GDWD LQWR VXEOLQHDJHV%XWXVLQJWKHVLQJOHUXOH0,58ĺDQFHVWUDO DVLQWKHRULJLQDO&%1ZDVQRWHQRXJKWRJHQHUDWHDJRRGPRGHO .%%1JUHZRXWRIWKHHIIRUWWRLQFRUSRUDWHDOORIWKHYLVXDOUXOHV DYDLODEOH IURP 6SRO'% >@ WKH )RXUWK ,QWHUQDWLRQDO 6SROLJRW\SLQJ'DWDEDVH
Figure 1. The Conformal Bayesian network uses a single rule based on the number of repeats at the MIRU24 locus as the first-level of a hierarchical Bayesian network. It uses the 43 spacers as features. The shaded nodes refer to hidden variables that model the fact that spacers are lost but rarely gained. In addition, the number of repeats at MIRU loci may be used. The nodes pointed to with dotted lines are not used for prediction. CBN uses a rule based on the value of locus MIRU24 to predict the major lineage with high accuracy.
4. VISUAL RULES 7KH6SRO'%YLVXDOUXOHVIRU07%&VXEOLQHDJHVDUHEDVHGRQ VSROLJRW\SHSDWWHUQV)RUW\HLJKWYLVXDOUXOHVDUHJLYHQLQ>@$ VDPSOH RI WKHVH UXOHV LV SUHVHQWHG LQ )LJXUH (DFK OLQH FRUUHVSRQGVWRDUXOH7KHXQGHUOLQHGSRUWLRQVRIWKHVSROLJRW\SH PXVW PDWFK H[DFWO\ ZKLOH WKH SRUWLRQV QRW XQGHUOLQHGFDQWDNH DQ\ YDOXH 1RWH WKDW LQ >@ WKH UXOHV DUH H[SUHVVHG XVLQJ WKH RFWDOFRGLQJRIVSROLJRW\SHVKHUHZHH[SUHVVWKHPLQELQDU\IRU VLPSOLFLW\:KLOHWKHVHUXOHVHVWDEOLVKFKDUDFWHULVWLF SDWWHUQVIRU VXEOLQHDJHVRI07%&WKH\DUHQRWH[FOXVLYHDQGLQVRPHFDVHV RYHUODS,QSUDFWLFHDSUHFHGHQFHRURUGHULVLQWURGXFHGRYHUWKH
Figure 2. Visual rules for four sublineages LAM2, LAM5, LAM9, and T1 from SpolDB4 knowledge base of rules. The rule column represents characteristic patterns specified by the visual rules as underlined subsequences in the spoligotype patterns. All of these rules fire for the spoligotype 1101111111110111111100001111111100001111111, while three of the rules fire for 1111111111110011111100001111111100001111111.
203
UXOHV XVLQJ H[SHUW NQRZOHGJH VR WKDW XQDPELJXRXV VXEOLQHDJH SUHGLFWLRQV DUH JHQHUDWHG +RZHYHU WKLV SUHFHGHQFH KDV QRW EHHQSXEOLVKHGIRUVXEOLQHDJHVDQGLVXSWRWKHLQGLYLGXDOXVHURI WKH UXOHV 7KH SUHFHGHQFH UHSRUWHG KHUH ZDV FUDIWHG E\ WKH DXWKRUV E\ WULDO DQG HUURU WR DFKLHYH JRRG RYHUDOO UHVXOWV $QRWKHUFRPSOLFDWLRQLVWKDWVRPH6SRO'%VXEOLQHDJHVKDYHQR DVVRFLDWHGUXOHV 9LVXDO UXOHV ZLWK SUHFHGHQFH KDYH EHHQ HVWDEOLVKHG IRU VL[ major 07%& OLQHDJHV >@ $ SULRU RQOLQH NQRZOHGJHEDVHG VXSSRUW YHFWRU PDFKLQH 690 DSSURDFK FRPELQHG WKHVH YLVXDO UXOHV DQG SUHFHGHQFH LQWR D VHW RI UXOHV H[SUHVVHG LQ SRO\KHGUDO IRUP >@ 7KH PHWKRG SURGXFHG D KLJK DFFXUDF\ 690 XVLQJ PXFK OHVV GDWD +RZHYHU WKLV HOHJDQW ZRUNKDVVHYHUDOSUDFWLFDO OLPLWDWLRQV WKDW ZH VRXJKW WR RYHUFRPH LQ WKLV VWXG\ )LUVW H[SUHVVLQJ UXOHV DQG SUHFHGHQFH DV SRO\KHGUDO UXOHV FDQ EH FKDOOHQJLQJ IRU D ODUJH QXPEHU RI UXOHV 6HFRQG WKH PHWKRG ZRUNVEHVWZLWKOLQHDU690VDQGOLQHDU690VGRQRWFDSWXUHWKH XQGHUO\LQJFRPSOH[LW\RIWKHELRPDUNHUVDQGWKHLUPHFKDQLVPRI HYROXWLRQ 7KLV FDQ EH RYHUFRPH E\ XVLQJ QRQOLQHDU 690V GHJUHHSRO\QRPLDONHUQHOVZRUNYHU\ZHOO EXWWKHQLQFRUSRUDWLQJ WKH SRO\KHGUDO UXOHV EHFRPHV HYHQ PRUH FKDOOHQJLQJ7KLUGWKH FRPSOH[LW\ RI WUDLQLQJ LQFUHDVHV ZLWK WKH LQWURGXFWLRQ RI UXOHV 7KXVWKHSURSRVHGGHVLJQRIWKH.%%1KDVWKHIROORZLQJVDOLHQW IHDWXUHV x ,QFRUSRUDWHV UXOHV HDVLO\ ZLWKRXW PRGLILFDWLRQ DQG ZLWKRXWLPSRVLQJSUHFHGHQFH x 0RGHOV NQRZQ SURSHUWLHV RI ELRPDUNHUV DQG WKHLU PXWDWLRQPHFKDQLVPV x 3URYLGHV DQ HIILFLHQW WUDLQLQJ PHWKRG IRU FODVVHV ZLWK DQGZLWKRXWUXOHV x $FKLHYHVKLJKSUHGLFWLRQDFFXUDF\
Figure 1. The KBBN uses multiple rules based on the presence of characteristic deletions at the first-level of a hierarchical Bayesian network. As with the CBN, it uses the 43 spoligotype spacers and/or number of repeats at MIRU loci as features. The shaded nodes refer to hidden variables that model the fact that spacers are lost but rarely gained. The dotted lines indicate that the MIRU1 variables can be dropped to create a spoligotype-only model.
XQLIRUPSULRUV7KHUHDGHUFDQFRQVXOWWKHGHVFULSWLRQRI&%1LQ >@IRUIXOOGHWDLOVRIWKHVHSRUWLRQVRIWKHPRGHO )RUVSROLJRW\SHVZHIROORZHGWKH6327&/867PRGHO>@,W FDSWXUHVWKHIDFWWKDWVSDFHUVDUHORVWEXWDOPRVWQHYHUJDLQHGE\ LQWURGXFLQJDYDULDEOHIRUWKHXQREVHUYHGKLGGHQSDUHQW+M DQG IRU HDFK VSDFHU ERWK RI ZKLFKIROORZDELQRPLDOGLVWULEXWLRQ *LYHQ D GLPHQVLRQDO VSROLJRW\SH 6 DQG LWV VSDFHU SRVLWLRQ M LI VSDFHU LV SUHVHQW DQG LI VSDFHU LV DEVHQW 7KH SUREDELOLWLHVRIWKHVSDFHUJLYHQWKH SDUHQW DUHDVVXPHG WREHNQRZQ$VLQ>@ZHFRQVLGHUHGWKHSUREDELOLW\RIORVLQJD VSDFHUDVDQGSUREDELOLW\RIJDLQLQJDVSDFHUHTXDOWR 7KH .%%1 DVVXPHV WKDW WKH 0,58 ORFL DQG WKH VSROLJRW\SH KLGGHQ SDUHQWV DUH FRQGLWLRQDOO\ LQGHSHQGHQW JLYHQ WKH VXEOLQHDJH 7KH 0,58 ORFL DUH VFDWWHUHG WKURXJKRXW WKH FKURPRVRPHRI07%&LQORFDWLRQVDZD\IURPWKH'5ORFXVXVHG IRU VSROLJRW\SLQJ 7KXV WKH DVVXPSWLRQV RI LQGHSHQGHQFH EHWZHHQWKH0,58ORFLDQGEHWZHHQ0,58DQGVSROLJRW\SHDUH ZHOO VXSSRUWHG ELRORJLFDOO\ 7KH FRQGLWLRQDO LQGHSHQGHQFH DVVXPSWLRQ RI VSDFHUV LV D PRGHO VLPSOLILFDWLRQSUHYLRXVO\PDGH LQ WKH 6327&/867 %1 PRGHO >@ 7KLV FRQGLWLRQDO LQGHSHQGHQFHRIWKHELRPDUNHUVLQWKH%1PRGHOHQDEOHV.%%1 WR FRQIRUP WR WKH VHW RI DYDLODEOH ELRPDUNHUV ZLWKRXW DQ\ H[SHQVLYHPLVVLQJYDOXHFRPSXWDWLRQV$VSROLJRW\SHRQO\PRGHO FDQEHFUHDWHGE\VLPSO\GURSSLQJWKH0,58YDULDEOHV1RQHRI WKH JHQRW\SLQJ YDULDEOHV LQ WKH %1 DUH WUHDWHG DV XQREVHUYHG H[FHSW IRU WKH KLGGHQ SDUHQW VSDFHUV ZKLFK DUH DOZD\V XQREVHUYHG 8VLQJ%D\HVெUXOHRQHFDQSUHGLFWWKHVXEOLQHDJHIRUQHZGDWD E\GHWHUPLQLQJWKHVXEOLQHDJHZLWKPD[LPXPSUREDELOLW\ 3& _ 6: 5< v ¦ 36 M _ + M 3 + M _ & 3& _ 5<
5. KNOWLEDGE-BASED BAYESIAN NETWORK 7KH.QRZOHGJH%DVHG%D\HVLDQ1HWZRUN.%%1 UHSUHVHQWHG LQ )LJXUH LV D QRYHO KLHUDUFKLFDO %D\HVLDQ QHWZRUN SUREDELOLW\ PRGHO IRU VXEOLQHDJH FODVVLILFDWLRQ RI 07%& .%%1 FDSWXUHV GRPDLQNQRZOHGJHDERXWWKHSURSHUWLHVRIVSROLJRW\SHDQG0,58 DQG LQFRUSRUDWHV DGGLWLRQDO LQIRUPDWLRQ SURYLGHG E\ 6SRO'% UXOHV WR SUHGLFW WKH FODVV ZLWK KLJK DFFXUDF\ 7KH FRUUHVSRQGLQJ SUREDELOLW\GHQVLW\IXQFWLRQIRUWKHPRGHOVKRZQLQ)LJXUHLV 3& 0 6 : 5<
¦ 36 +
M:
M
_ + M 3 + M _ & 3 0 L _ & 3& _ 5< 3 5< L*
ZKHUHWKHUDQGRPYDULDEOH&UHSUHVHQWVWKHVXEOLQHDJHFODVVWKH UDQGRP YDULDEOH 6: ^6 M _ M :` ZLWK : ^` UHSUHVHQWV WKH VSROLJRW\SH VSDFHUV WKH UDQGRP YDULDEOH 0 * ^0 L _ L *` * 0,58 UHSUHVHQWV WKH 0,58 ORFL DV LQGH[HG E\ WKHLU ORFXV QXPEHU DQG ILQDOO\ 5< ^5N _ N @HDFK0,58ORFXVH[FHSW0,58LVPRGHOHGDVD PXOWLQRPLDO GLVWULEXWLRQ ZLWK SRVVLEOH YDOXHV « DQG 0,58 FDQ WDNH RQ VRPH DGGLWLRQDO YDOXHV 6LQFH WKH SURSRUWLRQV RI GLIIHUHQW FODVVHV DUH QRW HTXDO DQG VRPH FRS\ QXPEHUV GR QRW RFFXU ZH XVHG 'LULFKOHW VPRRWKLQJ ZLWK QRQ
+
204
M:
6. EXPERIMENTAL RESULTS :H FRPSDUHG WKH SHUIRUPDQFH RI VHYHUDO YDULDWLRQV RI .%%1 DJDLQVW ILYH DOWHUQDWLYH PHWKRGV :H FRQVWUXFWHG .%%1 XVLQJ VSROLJRW\SH0,58 GDWD DQG VSROLJRW\SH DORQH 6LQFH GHWHUPLQLQJ WKH SUHFHGHQFH RI UXOHV FDQ EH FKDOOHQJLQJ ZH LPSOHPHQWHG.%%1ZLWKRYHUODSSLQJUXOHVZLWKRXWSUHFHGHQFH DQGZLWKSUHFHGHQFHLPSRVHGUXOHV 7KH ILYH RWKHU PHWKRGV DUH SUHFHGHQFHLPSRVHG UXOHV UXOHV ZLWKRXW SUHFHGHQFH QRQOLQHDU 690 WUDLQHG RQ VSROLJRW\SH0,58QRQOLQHDU690WUDLQHGRQVSROLJRW\SHDORQH DQG 6327&/867 D VSROLJRW\SHRQO\ %1 ZLWK KLGGHQ SDUHQWV )RU 690 WKH :HND SDFNDJH IRU PXOWLFODVV 690 ZDV XVHG WR FRQVWUXFWWKHPRGHO>@7KHGDWDZDVSUHSURFHVVHGE\PDSSLQJ WKH VSROLJRW\SHV WR DQG QRUPDOL]LQJ WKH 0,58 7KH & SDUDPHWHU ZDV VHOHFWHG E\ FURVV±YDOLGDWLRQ 7HQIROG FURVV YDOLGDWLRQ RI WKH WUDLQLQJ VHW SDUDPHWHUV ZDV XVHG WR VHOHFW WKH GHJUHH RI D SRO\QRPLDO NHUQHO 7KLUG GHJUHH SRO\QRPLDO NHUQHOV ZHUHIRXQGWRZRUNEHVWRYHUDOWHUQDWLYHNHUQHOVLQFOXGLQJ5DGLDO %DVLV )XQFWLRQ 5%) DQG OLQHDU NHUQHOV ,Q SULRU VWXGLHV QRW UHSRUWHG KHUH ZH IRXQG WKDW GHJUHH SRO\QRPLDOV SHUIRUP EHVW IRU 690 VR ZH UHVWULFW SUHVHQWDWLRQ RI UHVXOWV WR WKDW NHUQHO 6327&/867 KDV SURYHG WR EH RQH RI WKH EHVW RI WKH QRQ NQRZOHGJH%1WKDWZHWULHGVRLWZDVFKRVHQDVWKHEDVHOLQH%1 PHWKRG 7KH SUHGLFWLYH DFFXUDF\ RI HDFK PRGHO ZDV PHDVXUHG E\ IROG VWUDWLILHG FURVV YDOLGDWLRQ 7KH IROG WUDLQLQJ DQG WHVWLQJ VHWVZHUHGHVLJQHGWREHGLVMRLQWZLWKUHVSHFWWRVSROLJRW\SHDQG 0,58 7KLV HQVXUHV WKHUH DUH DW OHDVW WHQ XQLTXH JHQRW\SHV 6SROLJRW\SH0,58 SDLUV LQ HDFK VXEOLQHDJH FRQVLGHUHG )RU PHWKRGVXVLQJRQO\VSROLJRW\SHWKHDFFXUDF\DQG)PHDVXUHVDUH QDWXUDOO\ VOLJKWO\ KLJKHU VLQFH WKH WUDLQ DQG WHVW VHWV PD\ EH RYHUODSSLQJ
6.1 Datasets 7ZR GDWDVHWV ZHUH FRPELQHG IRU XVH LQ WKLV VWXG\ 7KH ILUVW &'& ZDVWKHGDWDFROOHFWHGE\WKH &HQWHUVIRU'LVHDVH&RQWURO 8QLWHG 6WDWHV &'& DV SDUW RI URXWLQH 7% VXUYHLOODQFH LQ WKH 8QLWHG6WDWHVIURPFRQVLVWLQJRI 07%&LVRODWHV JHQRW\SHG E\ VSROLJRW\SLQJ DQG ORFL 0,58 W\SLQJ 7KH VHFRQG GDWDVHW ZDV WKH 6SRO'% GDWDVHW DYDLODEOH LQ WKH RQOLQH VXSSOHPHQWRI%UXGHO\HWDO>@FRQVLVWLQJRIGLVWLQFW VSROLJRW\SHV ODEHOHG ZLWK 6SRO'% VXEOLQHDJHV .%%1 ZDV WUDLQHG RQ D GDWDVHW RI UHFRUGV HDFK FRUUHVSRQGLQJ WR D VSROLJW\SH 0,58 VXEOLQHDJH WULSOHW REWDLQHG E\ MRLQLQJ WKH 6SRO'% DQG &'& GDWDVHWV E\ VSROLJRW\SH ,Q WKLV VWXG\ HYHU\ GLVWLQFW 0,58 DQGVSROLJRW\SHSDLULVFRQVLGHUHGWREHDXQLTXH JHQRW\SH :H GURSSHG WKH FODVVHV IRU ZKLFK WKHUH ZHUH IHZHU WKDQ UHFRUGV DQG DOVR WKH VXEOLQHDJHV WKDW GR QRW FRPPRQO\ LQIHFW KXPDQ EHLQJV HJ 3,1, DQG 3,1, 7R NHHS VPDOO FODVVHV ZH FRPELQHG 0$18 0$18DQG 0$18 LQWR RQH FODVV RI 0$18 DQG DOVR FRPELQHG $)5, DQG $)5, LQWRRQH FODVV RI $)5,BB 2YHUDOO ZH HQGHG XS ZLWK FODVVHV 'DWD ZDV SUHSURFHVVHG E\ DGGLQJ DQ DUUD\ RI ELQDU\ YDOXHV HDFK UHSUHVHQWLQJD6SRO'%UXOHDSSOLHGWRHDFKUHFRUG7KHYDOXHRI WKH UXOH ZDV VHW WR LI WKH UXOH ZDV ILUHG DQG ]HUR RWKHUZLVH ,I SUHFHGHQFH LV LPSRVHG RQO\ WKH UXOH ZLWK KLJKHVW SUHFHGHQFH ILUHVRWKHUZLVHPXOWLSOHUXOHVPD\ILUH
Table 1. Comparison of F-values of KBBN, Rules, SVM, and SPOTCLUST based on out-of-sample 10-fold cross validation test results. Two sets of models were created for SVM and KBBN – one trained on spoligotype and MIRU and one trained on only spoligotype (*). The training and test folds are defined based on distinct spoligotype and MIRU types and are identical for all methods. The slightly higher F-values of the (*) models can be explained by the fact that some spoligotypes were repeated in both train and test sets.
6.2 Prediction Results 7KHDYHUDJH)YDOXHHVWLPDWHGE\IROGFURVV YDOLGDWLRQ IRU WKH IRXU .%%1 YDULDWLRQV DQG ILYH DOWHUQDWLYH PHWKRGV LV SURYLGHG LQ 7DEOH .%%1 LV EHWWHU WKDQ RU QRW VLJQLILFDQWO\ GLIIHUHQW IURP WKH DOWHUQDWLYHV 1HLWKHU WKH SXUHO\ UXOHEDVHG PHWKRGV QRU %1 ZLWKRXW UXOHV DUH VDWLVIDFWRU\ E\ WKHPVHOYHV5XOHVZLWKRXWSUHFHGHQFHSHUIRUPSRRUO\EHFDXVHQR SUHGLFWLRQ LV PDGH ZKHQ PXOWLSOH UXOHV ILUH ,PSRVLQJ SUHFHGHQFH RQ WKH UXOHV PRYHV DFFXUDF\ RYHU EXW VWLOO VLJQLILFDQWO\ EHORZ .%%1 7KH VSROLJRW\SHRQO\ %1 PHWKRG 6327&/867 RQO\ DFKLHYHV DFFXUDF\ .%11 LV YHU\ FRPSHWLWLYH ZLWK WKH EHVW 690 PHWKRGV EXW RIIHUV DGGLWLRQDO DGYDQWDJHV :H DOVR VWXGLHG WKH RXWRIVDPSOH SUHGLFWLRQ SUREDELOLW\ RI HDFK PRGHO IRU HDFK VXEOLQHDJH DQG SURYLGHG WKH SUHGLFWLRQ SUREDELOLW\ GLVWULEXWLRQ PDS RI HDFK PRGHO SHU VXEOLQHDJH DV VKRZQ LQ )LJXUH 1RWH WKH WZR UXOHEDVHG PHWKRGV IDLO EHFDXVH QR UXOHV H[LVW IRU VRPH VXEOLQHDJHV VXFK DV %(,-,1* /,.(DQGUXOHVZLWKQRSUHFHGHQFHRYHUODS
7. DISCUSSION 7KH .%%1 EHQHILWV IURP ERWK H[SHUW DGYLFH DQG ODUJH FROOHFWLRQV RI '1$ ILQJHUSULQW GDWD ,W SHUIRUPV VLJQLILFDQWO\ EHWWHU WKDQ WKH RULJLQDO 6327&/867 %D\HVLDQ QHWZRUN WUDLQHG XVLQJ RQO\ VSROLJRW\SHV ,W DOVR RXWSHUIRUPV WKH Ä5XOHVRQO\ெ V\VWHPVHYHQDIWHUWKHLQFRUSRUDWLRQRISUHFHGHQFH 5XOHEDVHGV\VWHPVUHTXLUHWKHH[DFWPDWFKLQJRIVSROLJRW\SHV ZLWK VSHFLILHG SDWWHUQV 7KHVH VSHFLILHG SDWWHUQV FRUUHVSRQG WR LQIHUUHG PXWDWLRQ HYHQWV GHOHWLRQV RI RQH RU PRUH DGMDFHQW VSDFHUV WKDW FKDUDFWHUL]H VXEOLQHDJHV +RZHYHU RIWHQ VSROLJRW\SHVPDWFKSDWWHUQVSUHVFULEHGE\PRUHWKDQRQHUXOHDQG DUH WKXV DVVLJQHG PXOWLSOH VXEOLQHDJH ODEHOV 7KH GHOHWLRQ RI FRQWLJXRXV VSDFHUV LV XVXDOO\ FRQVLGHUHG WR EH D VLQJOH PXWDWLRQ HYHQW UDWKHU WKDQWKHLQGHSHQGHQWORVVRIVSDFHUVRYHUWLPH >@ +RZHYHU VLQFH VSROLJRW\SLQJ LV EDVHG RQ YDULDWLRQV ZLWKLQ D VLQJOH ORFXV WKH '5 UHJLRQ WKHUH LV D SRWHQWLDO IRU FRQYHUJHQW HYROXWLRQ RI VSROLJRW\SH SDWWHUQV >@ 7KLV LV RIWHQ WKH UHDVRQ
205
Figure 2. The heatmap represents the average F-value for the 51 lineages as determined by the 5 models: 1) SPOTCLUST (BN using spoligotypes alone) 2) Rule-based system without precedence 3) Rule-based system with precedence 4) Precedence-based KBBN trained on spoligotypes and MIRU 5) KBBN with no precedence trained on spoligotypes and MIRU. The KBBN models have the best performance as observed from the dominance of white squares indicating high precision and recall as captured by average F-measure. 2WKHUVXFFHVVIXOVWUDWHJLHVWRPLPLFWKHEHKDYLRURIUXOHEDVHG V\VWHPV ZKLOH PLWLJDWLQJ LWV GLVDGYDQWDJHV E\ DOORZLQJ IRU YDULDWLRQV LQ GHILQLWLYH SDWWHUQV LQFOXGH QHDUHVW QHLJKERU DSSURDFKHV DV LQ >@ +RZHYHU WKHVH UHTXLUH WKH GHILQLWLRQ RI DSSURSULDWH GLVWDQFH PHDVXUHV DQG LW LV GLIILFXOW WR FDSWXUH WKH GHSHQGHQFH LQ WKH GHOHWLRQ RI DGMDFHQW VSDFHUV LQ D GLVWDQFH PHDVXUH 1HDUHVW QHLJKERU PHWKRGV DOVR UHTXLUH D FRPSDULVRQ ZLWK HYHU\ LQVWDQFH LQ WKH GDWDEDVH LQ UHDO WLPH LQ FRQWUDVW WR %D\HVLDQQHWZRUNV
FLWHG DV D OLPLWDWLRQ RI VSROLJRW\SLQJ IRU SRSXODWLRQ JHQHWLF DQDO\VHV>@,WLVQHFHVVDU\WRXVHDOODYDLODEOHHYLGHQFHIURPWKH GDWDWKDWPD\LQGLFDWHDVXEOLQHDJHQRWMXVWWKHSUHVHQFHRIDIHZ FRQWLJXRXV GHOHWLRQV VXFK DV LQ WKH YLVXDO UXOHV 7KH .%%1 PRGHO LV ZHOOVXLWHG WR WKLV WDVN EHFDXVH LW LQFRUSRUDWHV WKH SUREDELOLW\GLVWULEXWLRQVRIDOOVSDFHUVLQDGGLWLRQWRWKHSUHVHQFH RIFKDUDFWHULVWLFGHOHWLRQVVSHFLILHGE\UXOHV7KH.%%1PRGHOLV H[WHQVLEOH WR RWKHU ELRPDUNHUV VXFK DV 613V DQG DGGLWLRQDO 0,58 ORFLDVZHOODVUXOHVIRUWKHVHELRPDUNHUV DVREVHUYHGLQ >@
7KHVH UHVXOWV LQGLFDWH WKDW KLJK FODVVLILFDWLRQ DFFXUDF\ FDQ EH DFKLHYHG XVLQJ OHVV GDWD E\ WKH LQFRUSRUDWLRQ RI GRPDLQ NQRZOHGJHLQWKHIRUPRIUXOHV,QWKLVUHJDUG.%%1LVFRQVLVWHQW ZLWK DSSURDFKHV IURP RWKHU PHWKRGV ,Q >@ YLVXDO UXOHV ZLWK SUHFHGHQFHFRQYHUWHGLQWRSRO\KHGUDOUXOHVE\H[SHUWVOHGWREHWWHU FODVVLILFDWLRQRI07%&PDMRUOLQHDJHVE\RQOLQH690.%%1 LPSURYHV RQ WKHVH DSSURDFKHV E\ DOORZLQJ UXOHV WR EH DSSOLHG ZLWK RU ZLWKRXW SUHFHGHQFH ZLWK QR PRGLILFDWLRQ 3UHYLRXVO\ LQ >@ LW ZDV VKRZQ WKDW LQFOXGLQJ D VLQJOH UXOH EDVHG RQ WKH QXPEHURIUHSHDWVDWWKH0,58ORFXVWKDWGLVWLQJXLVKEHWZHHQ ÄPRGHUQெDQGÄDQFHVWUDOெVWUDLQVOHDGVWRLPSURYHGDFFXUDF\
,Q FRQWUDVW UXOHEDVHG V\VWHPV LQFRUSRUDWH SUHFHGHQFH WR PLWLJDWH WKH HIIHFW RI PXOWLSOH ODEHOV E\ FKHFNLQJ IRU PRUH VWULQJHQW SDWWHUQV ILUVW 7KLV GRHV LPSURYH WKH DFFXUDF\ RYHU D EDVLF UXOHEDVHG V\VWHP ZLWKRXW SUHFHGHQFH EXW FRPSOHWHO\ H[FOXGHVIURPFRQVLGHUDWLRQDQ\RWKHUSRWHQWLDOVXEOLQHDJHODEHOV 6RPH VSROLJRW\SH SDWWHUQV GR QRW H[DFWO\ PDWFK WKH SDWWHUQV VSHFLILHGE\DQ\UXOH:KLOHSUREDELOLVWLFV\VWHPVFDQKDQGOHVXFK FDVHVHOHJDQWO\WKHVHVSROLJRW\SHVDUHW\SLFDOO\DVVLJQHGD³FDWFK DOO´ ODEHO E\ D UXOHEDVHG V\VWHP ZLWK SUHFHGHQFH 7KXV D GHWHUPLQLVWLF UXOH EDVHG V\VWHP LV SURQH WR VRPH PLVFODVVLILFDWLRQV
206
0\FREDFWHULXP WXEHUFXORVLV FRPSOH[ JHQHWLF GLYHUVLW\ PLQLQJ WKH IRXUWK LQWHUQDWLRQDO VSROLJRW\SLQJ GDWDEDVH 6SR,'% IRU FODVVLILFDWLRQ SRSXODWLRQ JHQHWLFV DQG HSLGHPLRORJ\ %PF 0LFURELRO >@)LOOLRO,'ULVFROO-5YDQ6RROLQJHQ'.UHLVZLUWK%1 .UHPHU.9DOHWXGLH*$QK''%DUORZ5%DQHUMHH' %LIDQL3-%UXGH\.&DWDOGL$&RRNVH\5&&RXVLQV' 9'DOH-:'HOODJRVWLQ2$'UREQLHZVNL)(QJHOPDQQ * )HUGLQDQG 6 %LQ]L ' * *RUGRQ 0 *XWLHUUH] 0 & +DDV : + +HHUVPD + .DOOHQLXV * .DVVD.HOHPEKR ( .RLYXOD 7 /\ + 0 0DNULVWDWKLV $ 0DPPLQD & 0DUWLQ * 0RVWURP 3 0RNURXVRY , 1DUERQQH 9 1DUYVND\D 2 1DVWDVL $ 1LREH(\DQJRK 6 1 3DSH - : 5DVRORIR 5D]DQDPSDUDQ\ 9 5LGHOO 0 5RVVHWWL 0 / 6WDXIIHU ) 6XII\V 3 1 7DNLII + 7H[LHU0DXJHLQ - 9LQFHQW 9 GH :DDUG - + 6ROD & DQG 5DVWRJL 1 *OREDO GLVWULEXWLRQ RI 0\FREDFWHULXPWXEHUFXORVLVVSROLJRW\SHV(PHUJ,QIHFW'LV 1RY >@)LOOLRO,'ULVFROO-5YDQ6RROLQJHQ'.UHLVZLUWK%1 .UHPHU.9DOHWXGLH*'DQJ'$%DUORZ5%DQHUMHH' %LIDQL3-%UXGH\.&DWDOGL$&RRNVH\5&&RXVLQV' 9'DOH-:'HOODJRVWLQ2$'UREQLHZVNL)(QJHOPDQQ *)HUGLQDQG6*DVFR\QH%LQ]L'*RUGRQ0*XWLHUUH]0 & +DDV : + +HHUVPD + .DVVD.HOHPEKR ( +R 0 / 0DNULVWDWKLV $ 0DPPLQD & 0DUWLQ * 0RVWURP 3 0RNURXVRY,1DUERQQH91DUYVND\D21DVWDVL$1LREH (\DQJRK 6 1 3DSH - : 5DVRORIR5D]DQDPSDUDQ\ 9 5LGHOO05RVVHWWL0/6WDXIIHU)6XII\V317DNLII+ 7H[LHU0DXJHLQ - 9LQFHQW 9 GH :DDUG - + 6ROD & DQG 5DVWRJL 1 6QDSVKRW RI PRYLQJ DQG H[SDQGLQJ FORQHV RI 0\FREDFWHULXPWXEHUFXORVLVDQGWKHLUJOREDOGLVWULEXWLRQDVVHVVHG E\VSROLJRW\SLQJLQDQLQWHUQDWLRQDOVWXG\-&OLQ0LFURELRO 0D\ >@ %DNHU / %URZQ 7 0DLGHQ 0 & DQG 'UREQLHZVNL ) 6LOHQW QXFOHRWLGH SRO\PRUSKLVPV DQG D SK\ORJHQ\ IRU 0\FREDFWHULXPWXEHUFXORVLV(PHUJ,QIHFW'LV6HS >@)LOOLRO,0RWLZDOD$6&DYDWRUH04L:+D]ERQ0 + %REDGLOOD GHO 9DOOH 0 )\IH - *DUFLD*DUFLD / 5DVWRJL 16ROD&=R]LR7*XHUUHUR0,/HRQ&,&UDEWUHH- $QJLXROL6(LVHQDFK.''XUPD]5-RORED0/5HQGRQ $ 6LIXHQWHV2VRUQLR - 3RQFH GH /HRQ $ &DYH 0 ' )OHLVFKPDQQ5:KLWWDP76DQG$OODQG'*OREDOSK\ORJHQ\ RI 0\FREDFWHULXP WXEHUFXORVLV EDVHG RQ VLQJOH QXFOHRWLGH SRO\PRUSKLVP 613 DQDO\VLV LQVLJKWV LQWR WXEHUFXORVLV HYROXWLRQ SK\ORJHQHWLF DFFXUDF\ RI RWKHU '1$ ILQJHUSULQWLQJ V\VWHPVDQGUHFRPPHQGDWLRQVIRUDPLQLPDOVWDQGDUG613VHW- %DFWHULRO-DQ >@ *XWDFNHU 0 0 6PRRW - & 0LJOLDFFLR &$5LFNOHIV 6 0 +XD 6 &RXVLQV ' 9 *UDYLVV ( $ 6KDVKNLQD ( .UHLVZLUWK % 1 DQG 0XVVHU - 0 *HQRPHZLGH DQDO\VLV RI V\QRQ\PRXV VLQJOH QXFOHRWLGH SRO\PRUSKLVPV LQ 0\FREDFWHULXP WXEHUFXORVLV FRPSOH[ RUJDQLVPV UHVROXWLRQ RI JHQHWLF UHODWLRQVKLSV DPRQJ FORVHO\ UHODWHG PLFURELDO VWUDLQV *HQHWLFV 'HF >@ 6XSSO\ 3 $OOL[ & /HVMHDQ 6 &DUGRVR2HOHPDQQ 0 5XVFK*HUGHV 6 :LOOHU\ ( 6DYLQH ( GH +DDV 3 YDQ 'HXWHNRP + DQG 5RULQJ 6 3URSRVDO IRU VWDQGDUGL]DWLRQ RI RSWLPL]HG P\FREDFWHULDO LQWHUVSHUVHG UHSHWLWLYH XQLWYDULDEOH QXPEHU WDQGHP UHSHDW W\SLQJ RI 0\FREDFWHULXP WXEHUFXORVLV - &OLQ0LFURELRO >@ :DUUHQ 5 0 6WUHLFKHU ( 0 6DPSVRQ 6 / YDQ GHU 6SX\*'5LFKDUGVRQ01JX\HQ'%HKU$$9LFWRU7
8. CONCLUSION AND FUTURE WORK .%%1 LV D KLJK DFFXUDF\ FODVVLILHU IRU 07%&VXEOLQHDJHV WKDW RXWSHUIRUPV PHWKRGV EDVHG RQ UXOHV RU %D\HVLDQ QHWZRUNV WUDLQHG RQ GDWD DORQH DQG PHHWV RU EHDWV WKH SHUIRUPDQFH RI QRQOLQHDU690PRGHOV$VDJHQHUDODSSURDFK.%%1KDVPDQ\ DWWUDFWLYH SURSHUWLHV ,W DOORZV DQ\ W\SH RI UXOHV WR EH LQFRUSRUDWHG LQWR D %D\HVLDQ 1HWZRUN ZLWK OLWWOH LQFUHDVH LQ WKH PRGHO DQG WUDLQLQJ FRPSOH[LW\ 3ULRU NQRZOHGJHEDVHG 690 UHTXLUHG PDQLSXODWLRQ RI WKH UXOHV PRGHOV GDWD DQGRU NHUQHO > @ ,Q FRQWUDVW WR 690 .%%1 FDQ SURGXFH H[SODQDWLRQV DQG SUREDELOLWLHV RI FODVVHV EDVHG RQ ZKLFK UXOHV ZHUHXVHGDQGKRZWKH\ZHUHDIIHFWHGE\WKHUHVWRIWKH.%%1 .%%1 FDQ EH UHDGLO\ H[WHQGHG WR RWKHU OHDUQLQJ WDVNV ,W FDQ SHUIRUP XQVXSHUYLVHG DQG VHPLVXSHUYLVHG OHDUQLQJ E\ WUHDWLQJ WKH FODVV DV XQREVHUYHG IRU VRPH WUDLQLQJ LQVWDQFHV 7KLV LV DQ LPSRUWDQW SURSHUW\ IRU 7% VXEOLQHDJHV VLQFH ODUJH XQODEHOHG GDWDVHWV H[LVW DQG QHZ OLQHDJHV DUH EHLQJ GLVFRYHUHG ,QFRUSRUDWLQJ SDWLHQW FKDUDFWHULVWLFV LQWR DQ XQVXSHUYLVHG .%%1 PRGHOFDQKHOSLWGLVFRYHULQWHUHVWLQJKRVWSDWKRJHQJURXSV ,QIXWXUHZRUNZHSODQWRHYDOXDWH.%%1DVDJHQHUDOVWUDWHJ\ IRULQFRUSRUDWLQJUXOHVLQWR%D\HVLDQ1HWZRUNVRQRWKHUGRPDLQV DQG FRPSDUH LW ZLWK RWKHU VWUDWHJLHV DQG NQRZOHGJHEDVHG OHDUQLQJ PHWKRGV $OVR ZH SODQ WR FUHDWH D PRUH GHILQLWLYH .%%1PRGHOIRU07%&VXEOLQHDJHVEDVHGRQWKHODWHVW6,79,7 VXEOLQHDJHV UXOHV DQG GDWDEDVHV LQ FRQMXQFWLRQ ZLWK ,QVWLWXW 3DVWHXU 7KH JRDO LV WR SURGXFH D SXEOLFO\ DYDLODEOH ZHEEDVHG WRRO IRU VXEOLQHDJH FODVVLILFDWLRQ WR VXSSRUW 7% FRQWURO DQG UHVHDUFKHIIRUWV
9. ACKNOWLEDGMENTS 7KLVZRUNZDVPDGHSRVVLEOHE\DQGZLWKWKHDVVLVWDQFHRI'U -HIIUH\5'ULVFROODQG'U/DXUHQ&RZDQRIWKH&'&:HZRXOG OLNHWRWKDQN'U1DOLQ5DVWRJLDQG,QVWLWXWH3DVWHXUIRUWKHLU DVVLVWDQFH7KLVZRUNLVVXSSRUWHGE\1,+5/0
10. REFERENCES >@ +LUVK $ ( 7VRODNL$*'H5LHPHU.)HOGPDQ0: DQG 6PDOO 3 0 6WDEOH DVVRFLDWLRQ EHWZHHQ VWUDLQV RI 0\FREDFWHULXPWXEHUFXORVLVDQGWKHLUKXPDQKRVWSRSXODWLRQV3 1DWO$FDG6FL86$$SU >@ *DJQHX[ 6 'H5LHPHU . 9DQ 7 .DWR0DHGD 0 GH -RQJ%&1DUD\DQDQ61LFRO01LHPDQQ6.UHPHU. *XWLHUUH] 0 & +LOW\ 0 +RSHZHOO 3 & DQG 6PDOO 3 0 9DULDEOH KRVWSDWKRJHQ FRPSDWLELOLW\ LQ 0\FREDFWHULXP WXEHUFXORVLV 3URF 1DWO $FDG 6FL 8 6 $ )HE >@ *DJQHX[ 6 DQG 6PDOO 3 *OREDO SK\ORJHRJUDSK\ RI 0\FREDFWHULXP WXEHUFXORVLV DQG LPSOLFDWLRQV IRU WXEHUFXORVLV SURGXFWGHYHORSPHQW/DQFHW,QIHFW'LV >@.DWR0DHGD0%LIDQL3-.UHLVZLUWK%1DQG6PDOO3 0 7KH QDWXUH DQG FRQVHTXHQFH RI JHQHWLF YDULDELOLW\ ZLWKLQ 0\FREDFWHULXP WXEHUFXORVLV 7KH -RXUQDO RI &OLQLFDO ,QYHVWLJDWLRQ >@ 0DOLN $ 1 - DQG *RGIUH\)DXVVHWW 3 (IIHFWV RI JHQHWLF YDULDELOLW\ RI 0\FREDFWHULXP WXEHUFXORVLV VWUDLQV RQ WKH SUHVHQWDWLRQ RI GLVHDVH 7KH /DQFHW ,QIHFWLRXV 'LVHDVHV >@%UXGH\.'ULVFROO-5LJRXWV/3URGLQJHU:*RUL$ $O+DMRM6$OOL[&$ULVWLPXQR/$URUD-DQG%DXPDQLV9
207
>@ /DXHU ) DQG %ORFK * ,QFRUSRUDWLQJ SULRU NQRZOHGJH LQ VXSSRUW YHFWRU PDFKLQHV IRU FODVVLILFDWLRQ $ UHYLHZ 1HXURFRPSXWLQJ
& DQG YDQ +HOGHQ 3 ' 0LFURHYROXWLRQ RI WKH GLUHFW UHSHDW UHJLRQ RI 0\FREDFWHULXP WXEHUFXORVLV ,PSOLFDWLRQV IRU LQWHUSUHWDWLRQRIVSROLJRW\SLQJGDWD-&OLQ0LFURELRO'HF >@ 2]FDJODU & 6KDEEHHU $ 9DQGHQEHUJ 6 @ 9LWRO , 'ULVFROO - .UHLVZLUWK % .XUHSLQD 1 DQG %HQQHWW . ,GHQWLI\LQJ 0\FREDFWHULXP WXEHUFXORVLV FRPSOH[ VWUDLQIDPLOLHVXVLQJVSROLJRW\SHV,QIHFW*HQHW(YRO >@ .XQDSXOL * %HQQHWW . 6KDEEHHU $ 0DFOLQ 5 DQG 6KDYOLN - 2QOLQH NQRZOHGJHEDVHG VXSSRUW YHFWRU PDFKLQHV 0DFKLQH /HDUQLQJ DQG .QRZOHGJH 'LVFRYHU\ LQ 'DWDEDVHV >@ .DPHUEHHN - 6FKRXOV / .RON $ YDQ$JWHUYHOG 0 YDQ6RROLQJHQ ' .XLMSHU 6 %XQVFKRWHQ $ 0ROKXL]HQ + 6KDZ 5 DQG *R\DO 0 6LPXOWDQHRXV GHWHFWLRQ DQG VWUDLQ GLIIHUHQWLDWLRQ RI 0\FREDFWHULXP WXEHUFXORVLV IRU GLDJQRVLV DQG HSLGHPLRORJ\-&OLQ0LFURELRO >@ 'ULVFROO -56SROLJRW\SLQJIRU PROHFXODUHSLGHPLRORJ\RI WKH0\FREDFWHULXPWXEHUFXORVLVFRPSOH[0HWKRGV0RO%LRO >@6XSSO\30D]DUV(/HVMHDQ69LQFHQW9*LFTXHO% DQG /RFKW & 9DULDEOH KXPDQ PLQLVDWHOOLWH OLNH UHJLRQV LQ WKH 0\FREDFWHULXP WXEHUFXORVLV JHQRPH 0RO 0LFURELRO >@ 6XSSO\ 3 /HVMHDQ 6 6DYLQH ( .UHPHU . YDQ 6RROLQJHQ ' DQG /RFKW & $XWRPDWHG KLJKWKURXJKSXW JHQRW\SLQJ IRU VWXG\ RI JOREDO HSLGHPLRORJ\ RI 0\FREDFWHULXP WXEHUFXORVLVEDVHGRQP\FREDFWHULDOLQWHUVSHUVHGUHSHWLWLYHXQLWV -&OLQ0LFURELRO2FW >@ $PLQLDQ 0 6KDEEHHU $ DQG %HQQHWW . $ FRQIRUPDO %D\HVLDQ QHWZRUN IRU FODVVLILFDWLRQ RI 0\FREDFWHULXP WXEHUFXORVLVFRPSOH[OLQHDJHV%0&%LRLQIRUPDWLFV6XSSO 6 >@$PLQLDQ06KDEEHHU$DQG%HQQHWW.'HWHUPLQDWLRQRI 0DMRU /LQHDJHV RI 0\FREDFWHULXP WXEHUFXORVLV XVLQJ 0\FREDFWHULDO ,QWHUVSHUVHG 5HSHWLWLYH 8QLWV ,((( ,QWHUQDWLRQDO &RQIHUHQFHRQ%LRLQIRUPDWLFV %LRPHGLFLQH >@6KDEEHHU$D&/DQG'ULVFROO-5DQG2]FDJODU&DQG 9DQGHQEHUJ6DQG@+ROPHV*'RQNLQ$DQG:LWWHQ,+ :HND$PDFKLQH OHDUQLQJZRUNEHQFK,(((&LW\ >@ +HUVKEHUJ 5 /LSDWRY 0 6PDOO 3 0 6KHIIHU + 1LHPDQQ 6 +RPROND 6 5RDFK - & .UHPHU . 3HWURY ' $ )HOGPDQ 0 : DQG *DJQHX[ 6 +LJK )XQFWLRQDO 'LYHUVLW\ LQ 0\FREDFWHULXP WXEHUFXORVLV 'ULYHQ E\ *HQHWLF 'ULIW DQG +XPDQ'HPRJUDSK\3ORV%LRO'HF >@$OOL[%HJXHF&+DUPVHQ':HQLJHU76XSSO\3DQG 1LHPDQQ6(YDOXDWLRQDQGVWUDWHJ\IRUXVHRI0,589175SOXV DPXOWLIXQFWLRQDOGDWDEDVHIRURQOLQHDQDO\VLVRIJHQRW\SLQJGDWD DQG SK\ORJHQHWLF LGHQWLILFDWLRQ RI 0\FREDFWHULXP WXEHUFXORVLV FRPSOH[LVRODWHV-&OLQ0LFURELRO$XJ >@7RZHOO**DQG6KDYOLN-:.QRZOHGJHEDVHGDUWLILFLDO QHXUDOQHWZRUNV$UWLI,QWHOO
208