Android App Behaviour Classification Using Topic ...

158 downloads 195971 Views 941KB Size Report
Android App Behaviour Classification Using Topic. Modeling Techniques and Outlier detection using. App Permissions. MayanN Garg, ANshit Monga.
)RXUWK,QWHUQDWLRQDO&RQIHUHQFHRQ3DUDOOHO'LVWULEXWHGDQG*ULG&RPSXWLQJ 3'*&

$QGURLG$SS%HKDYLRXU&ODVVLILFDWLRQ8VLQJ7RSLF 0RGHOLQJ7HFKQLTXHVDQG2XWOLHUGHWHFWLRQXVLQJ $SS3HUPLVVLRQV 0D\DQN*DUJ$NVKLW0RQJD

3UL\DQN%KDWW$QXMD$URUD

'HSDUWPHQWRI&6(,7 -D\SHH,QVWLWXWHRI,QIRUPDWLRQ7HFKQRORJ\VHF 1RLGD,QGLD PD\DQNJDUJMSU#JPDLOFRPDNVKLWPRQJD#JPDLOFRP

'HSDUWPHQWRI&6(,7 -D\SHH,QVWLWXWHRI,QIRUPDWLRQ7HFKQRORJ\VHF 1RLGD,QGLD EKDWWSUL\DQN#JPDLOFRPDQXMDDURUD#JPDLOFRP  

Abstract— Now-a-Days consumption of android apps has become a common phenomenon but user switch from one app to other app is also having high expectancy. There are various causes of Apps’ swopping by users. As per research study, one prime reason behind this is that android apps are not providing same functionalities as mentioned in their description on Google Play Store and second crucial reason is that Apps accessing users phone content without taking their permission. The objective of this research work is to classify the apps effectively and identify/detect outlier apps with the help of app behavior analysis. Outlier apps have been detected to validate whether an android app performs as it claims in its description on Google Play Store as well as other criteria is App accessing user’s personal content without user’s agreement. This work has been done in four phases which are as follows- Data extraction phaseapps content such as App Title and Description has been crawled and extracted from Google Play Store; Data Pre-processing- this preprocessing phase is required to reduce missing data and high dimension data using filtering and stemming techniques; App classification: formed clusters on the basis of generated feature vector list of various category apps with the help of Topic modeling approaches- probabilistic approach LDA and deterministic approach Non-negative matrix factorization approach NMF; Outlier Detection:- finally for outlier detection used manifest file/ user permission file off apps and mapped its content with App specific features list content to find out outlier Apps. Keywords—Android App Classification; LDA; NMF; Outlier Detection; Clustering, Maicious

,

,1752'8&7,21$1'027,9$7,21

2YHU WKH \HDUV ZRUOGZLGH DQGURLG PDUNHW VKDUH KDV EHHQ VLJQLILFDQWO\LQFUHDVHG7KHVHGD\VPRUHDQGPRUHSHRSOHDUH VKLIWLQJWRZDUGVDQGURLG$FFRUGLQJWRWKHUHVHDUFKFRQGXFWHG E\ ,QWHUQDWLRQDO 'DWD &RUSRUDWLRQ ,'&  DQGURLG KDG WKH KLJKHVW PDUNHW VKDUH RI  LQ  IROORZHG E\ LRV ZLWK D VKDUHRIDVVKRZQLQILJXUH$FFRUGLQJWRVWDWLVWDFRP HVWLPDWHGWRWDOQXPEHURI$QGURLGGHYLFHVIRUFRQVXPHUVDVDW 'HFHPEHUOLHVQRUWKRIELOOLRQ6RDVDQGURLGGHYLFHV KDYH LQFUHDVHG PDQ\ IDFWRUV VXFK DV GDWD VHFXULW\ >@ XVHU SULYDF\KDVDOVREHFRPHDPDMRUFRQFHUQDQGFKDQFHVRIIUDXG DQGDWWDFNVWRWKHDQGURLGDSSVKDYHDOVRLQFUHDVHG

‹,(((

7KLVLQFUHDVHLQ$QGURLG0DUNHWDQGVDPHLQDQGURLGDSS PDUNHW KDV UDLVHG RXU DWWHQWLRQ WRZDUGV XVHU¶V SHUVRQDO LQIRUPDWLRQ SULYDF\ ,W LV DOZD\V D WKUHDW WR WKH XVHUV WKDW DQ DQGURLGDSSZKLFKKDVEHHQZLOOEHXVHGLVDPDOLFLRXV$SS

 )LJ :RUOGZLGHVPDUWSKRQH260DUNHW6KDUH

:KHQHYHU8VHU LQVWDOODQ$SSTXHVWLRQDULVHVLQKLV PLQGLV ³:KHWKHUDQDSSEHKDYHVDVDGYHUWLVHGRUQRW"´$0DOHYROHQW EHKDYLRUIRUDVSHFLILF$SSFDQEHDFKDUDFWHULVWLFRIDQRWKHU $SS0DQ\DWLPHV$SSDFFHVVSKRQHFRQWHQWVXFKDVFRQWDFW QXPEHUV LPDJHV HWF ZLWKRXW WDNLQJ DFFHVVLQJ SHUPLVVLRQ IRU WKH VDPH IURP DQGURLG SKRQH RZQHU  7KLV ZRUN KDV EHHQ FDUULHG RXW WR LGHQWLI\ $QGURLG $SSV WKRVH GRHV QRW FRQWDLQ VDPH EHKDYLRU DV PHQWLRQHG LQ WKHLU GHVFULSWLRQ DW *RRJOH 3OD\ VWRUH DQG QRW WDNLQJ XVHU SHUPLVVLRQ WR DFFHVV XVHU FRQWHQW $V VKRZQ LQ ILJXUH  DQ $SS µ+XQJDPD 0XVLF ± 6RQJV  9LGHRV¶ IHDWXUH VHW LV DFFHVVLQJ 3KRWRV 0HGLD )LOHV FRQWHQW DQG LQ RWKHU VHWWLQJ DOVR PHQWLRQHG PRGLILHV V\VWHP VHWWLQJ YLHZQHWZRUNFRQQHFWLRQIXOOQHWZRUNDFFHVV&RUUHVSRQGLQJ WRWKLVZKHQHYHUZHXSORDGWKLV$SSRQRXUPRELOHSKRQHLW DGGVVRPHXSGDWHVWR:KDWV$SSVXFKDV • 'HYLFH  $SS KLVWRU\ LQIRUPDWLRQ FDQ EH UHWULHYHG E\UXQQLQJ$SSV



)RXUWK,QWHUQDWLRQDO&RQIHUHQFHRQ3DUDOOHO'LVWULEXWHGDQG*ULG&RPSXWLQJ 3'*&

• • • • 

•

,GHQWLW\ LQIRUPDWLRQ VXFK DV ILQG DFFRXQWV RQ GHYLFHVDGGUHPRYHDFFRXQWUHDG\RXURZQFRQWDFW FDUG &RQWDFWLQIRUPDWLRQVXFKDVUHDGFRQWDFWDQGPRGLI\ FRQWDFWV /RFDWLRQLQIRUPDWLRQQHDUE\ORFDWLRQDVSHUQHWZRUN DQGSUHFLVHORFDWLRQDVSHU*36DQG1HWZRUN 606 LQIRUPDWLRQ VXFK DV UHFHLYH WH[W PHVVDJHV DQG VHQG606PHVVDJHV 6RPHPRUH«

 )LJ 0DOLFLRXVEHKDYLRXURI$33µ+XQJDPD0XVLF6RQJV 9LGHRV¶

7KLV SUREOHP LV D QRYHO SUREOHP DFFRUGLQJ WR FXUUHQW VFHQDULR DQG WKDW LV WKH UHDVRQ RI DFTXLULQJ UHVHDUFKHUV¶ DWWHQWLRQ )HZ UHFHQW UHVHDUFKHV KDYH DOUHDG\ EHHQ SUDFWLFHG DQGH[SHULPHQWHGWRUHVROYHWKLVLVVXH>@ 2XU ZRUN LV PDLQO\ PRWLYDWHG E\ ZRUN GRQH E\ *RUOD HWDO>@LQ*RUODHWDO>@SXEOLVKHGDUHVHDUFKZRUNRQ FKHFNLQJ $SS EHKDYLRU DQG IRFXVHV RQ FKHFNLQJ ZKHWKHU DQ DSSOLFDWLRQGRHVZKDWLWFODLPV7KHUHIRUHLQKHUZRUNWRILQG PDOZDUH DSSOLFDWLRQV VKH GHVLJQHG FOXVWHUV RI $SSV¶ DFFRUGLQJWR$SSEHKDYLRUDQGIXUWKHURXWOLHUZHUHLGHQWLILHG LQHDFKFOXVWHUZLWKUHVSHFWWRVHQVLWLYH$3,¶VDSSOLFDWLRQXVH :H DOVR SHUIRUP VLPLODU VRUW RI RSHUDWLRQ ZKLFK LV EHLQJ SUHVHQWHGLQWKLVUHVHDUFKSDSHU6RTXHVWLRQDULVHVLVKRZRXU ZRUN LV GLIIHUHQW WKDQ WKHLU ZRUN ,Q RXU ZRUN ZH WULHG WR FODVVLI\ $SSV DSSURSULDWHO\ DQG WR DFFRPSOLVK LW WRSLF PRGHOLQJDOJRULWKPVKDVEHHQXVHGWRPDNHUHVXOWVHIIHFWXDO 7R FODVVLI\ DQGURLG $SSV DFFRUGLQJ WR GHVFULSWLRQ WZR WRSLF PRGHOLQJ DSSURDFK KDV EHHQ XVHG  1RQQHJDWLYH PDWUL[ IDFWRUL]DWLRQ 10)  DQG /DWHQW 'LULFKOHW $OORFDWLRQ /'$  IRU FODVVLILFDWLRQ RI FKRVHQ DQGURLG DSSV :H DUH DEOH WR DQDO\]H $SS EHKDYLRU H[SHGLWLRXVO\ DQG YDOLGDWHG FODVVLILFDWLRQ RI $SSV LV DOVR DSSURSULDWH E\ XVLQJERWK WRSLF PRGHOLQJWHFKQLTXHV 2XWOLHU$SSGHWHFWLRQWRGHWHFWPDOLFLRXVDSSLVSHUIRUPHG E\H[WUDFWLQJIHDWXUHVIURPDQGURLGSHUPLVVLRQPDQLIHVWILOH ,Q9LWDOLLHWDOLQKLVZRUNXVHGVHQVLWLYH$3,VWRGHWHFW DQGURLG $SSV¶ PDOZDUH EHKDYLRU DQG IRU WKLV SXUSRVH PLQHG $SSVIRUDEQRUPDOXVDJHRIVHQVLWLYHGDWD>@ )HZ PRUH UHOHYDQW ZRUNV SHUWLQHQW WR RXU SURSRVHG IUDPHZRUNLQWKHVHDUHDVDUHDVIROORZV

@ DOVR FRQWULEXWHG FRQVLGHUDEO\ WRZDUGV WKLV SUREOHP DQG SURSRVHG 'URLG$''0LQHU V\VWHP WRGHWHFWFODVVLI\DQGFKDUDFWHUL]HDQGURLGPDOZDUHEHKDYLRU 'URLG$''0LQHU V\VWHP H[WUDFW IHDWXUHV E\ PDSSLQJ WKHP ZLWKVHQVLWLYH$3,V :DOLG 0DDOHM HWDO>@ ZRUNHG RQ XVHU UHYLHZV LGHQWLILHG DQG H[WUDFWHG EXJV LGHDV IRU QHZ IHDWXUHV ,Q WKLV UHVHDUFK ZRUNUHVHDUFKHUFODVVLILHG$SSLQWRFDWHJRULHVEXJUHSRUW IHDWXUH UHTXHVW XVHU H[SHULHQFH DQG UDWLQJ E\ XVLQJ XVHUV¶ UHYLHZVUHSRUWV)HGULFD6DURRHWDO>@ZRUNHGXSRQIHDWXUH OLIHF\FOH DQG UHVXOW REWDLQHG E\ HPSLULFDO DQDO\VLV DUH DERXW $SSVIHDWXUHVVSUHDGPLJUDWHUHPDLQDQGGLHLQ$SSVWRUH %DODEDQWDUD\ HW DO >@ FRQFHQWUDWHV RQ FOXVWHULQJ RI GRFXPHQWV IRU JDWKHULQJ UHOHYDQW LQIRUPDWLRQ LQ D FOXVWHU XVLQJ .0HDQV DQG .0HGRLGV DQG FRPSDULVRQ LV GRQH WR ILQG RXW ZKLFK DOJRULWKP LV EHVW IRU FOXVWHULQJ )XUWKHU VHQWHQFH ZHLJKW LV XVHG WR IRFXV RQ NH\ SRLQWV RI ZKROH GRFXPHQW ZKLFK PDNHV LW HDV\ IRU SHRSOH WR UHDG RQO\ WKRVH GRFXPHQWV ZKLFK DUH UHOHYDQW WR WKHP 3DQGLWD HW DO >@ FRQFHQWUDWHRQILQGLQJWKHPDOLFLRXVDSSOLFDWLRQVLHZKHWKHU DQ DSSOLFDWLRQ GHVFULSWLRQ JLYHV DQ\ LGHD IRU ZK\ LW LV XVLQJ WKH SHUPLVVLRQ :+@IRFXVHVRQWKHXVHRI:(.$WRROIRUFOXVWHULQJXVLQJ. 0HDQV DQG XVHV (XFOLGHDQ GLVWDQFH WR ILQG GLVWDQFH EHWZHHQ WZRSRLQWVIRUDVVLJQLQJWKHPWRFOXVWHUDQGDOVRFRQFHQWUDWHV RQ FOXVWHU DQDO\VLV %OHL HW DO >@ GHVFULEHV /DWHQW 'LULFKOHW $OORFDWLRQ /'$  D SUREDELOLVWLF PRGHO IRU FROOHFWLRQ RI WH[WXDO GDWD DQG WR ILQG VKRUW GHVFULSWLRQV LH WRSLFV ZLWK SUREDELOLWLHV IURP D FROOHFWLRQ RI GRFXPHQWV WR HQDEOH SURFHVVLQJ KXJH FROOHFWLRQ RI GRFXPHQWV ZKLOH PDLQWDLQLQJ VWDWLVWLFDO UHODWLRQVKLSV DQG WR UHSRUW UHVXOW LQ GRFXPHQW PRGHOLQJDQGWH[WFODVVLILFDWLRQ 7KHUHIRUH WKH IRFXV RI RXU SDSHU LV WR FODVVLI\ $SSV DSSURSULDWHO\ DFFRUGLQJ WR WKHLU IHDWXUH VHW DQG WR GHWHFW RXWOLHU $SS RU PDOHYROHQW $SS 7KH ZRUN KDV EHHQ GRQH LQ SULPDULO\WZRGLUHFWLRQV •

$SSVFODVVLILFDWLRQLPSOLHGWRFODVVLI\DQGURLGDSSVRQWKH EDVLV RI WKHLU IHDWXUH YHFWRU VHW FRLIIHG WKURXJK $SS GHVFULSWLRQ  XVLQJ WRSLF PRGHOLQJ WHFKQLTXHV10) DQG /'$ • 2XWOLHU $SS GHWHFWLRQ  RXWOLHU $SSV DUH $SSV WKRVH DFFHVV XVHUV¶ SKRQH FRQWHQW ZLWKRXW WDNLQJ SHUPLVVLRQ DQG GRHV QRW SHUIRUP VDPH DV PHQWLRQHG LQ LWV EHKDYLRU ZLWKWKHKHOSRI$SSSHUPLVVLRQDQGPDQLIHVWILOH 7KHUHVWRIWKHSDSHULVVWUXFWXUHGDV6HFWLRQGHVFULEHV WKHSURMHFWHGDSSURDFKDQGSURFHVVIORZRI$SSFODVVLILFDWLRQ DQG PDOLFLRXV $SS GHWHFWLRQ 6HFWLRQ  SUHVHQWV RYHUYLHZ DERXWXVHG$SS'DWDVHW6HFWLRQH[SODLQVWKHSUHSURFHVVLQJ SURFHVV GRQH WR UHGXFH KLJK GLPHQVLRQDO DQG ZDVWH GDWD 6HFWLRQ  LV $SS FODVVLILFDWLRQ VHFWLRQ LQ ZKLFK ZH GLVFXVV XVHGWRSLFPRGHOLQJDSSURDFK10)$1'/'$DQGRXWFRPH RI DSSOLHG WRSLF PRGHOLQJ DSSURDFKHV DQG ILQDOO\ LQ WKLV VHFWLRQ FOXVWHU KDV EHHQ DVVLJQHG XVLQJ N QHDUHVW QHLJKERU DOJRULWKP RQ WKH EDVLV RI VLPLODULW\ GLIIHUHQFH LQ DVVLJQHG SUREDELOLW\XVLQJWRSLFPRGHOLQJDOJRULWKPV6HFWLRQGLVFXVV DERXW RXWOLHU $SS GHWHFWLRQ E\ PDSSLQJ IHDWXUHV YHFWRU OLVW



)RXUWK,QWHUQDWLRQDO&RQIHUHQFHRQ3DUDOOHO'LVWULEXWHGDQG*ULG&RPSXWLQJ 3'*&

IHDWXUHVZLWK$SSPDQLIHVWILOHPHQWLRQHGIHDWXUHVDQGILQDOO\ VHFWLRQFRQFOXGHVWKHUHVHDUFKSDSHU ,,

352-(&7('$3352$&+$1'352&(66)/2:2) 0$/(92/(17$33'(7(&7,21

7KURXJKRXWWKHGHVFULSWLRQIRUWKHSXUSRVHRIH[SODQDWLRQ ZRUNGHWDLOVDUHVHWIRUWKLQRUGHUWRSURYLGHXQGHUVWDQGLQJRI H[SHULPHQWV SHUIRUPHG 6RPH FDQRQLFDO VWHSV KDYH EHHQ VKRZQ LQ SURFHVV IORZ GLDJUDP RI WKH SURMHFWHG DSSURDFK LQ ILJXUH7KHSURMHFWHGDSSURDFKFRQWHQWVDUHVHSDUDWHGE\IRXU ERXQGDULHVDQGWKHVHERXQGDULHVDUHIRXUEURDGSKDVHVZKLFK KDVEHHQXVHGWRFODVVLI\$SSVLQDSSURSULDWHIHDWXUHEDVHG FOXVWHUDQGGHWHFWRXWOLHU$SSVRIHYHU\FOXVWHU )LUVW ERXQGDU\ LV GDWD H[WUDFWLRQ SKDVH ZKLFK EDVLFDOO\ FUDZO DQG H[WUDFW $SS FRQWHQW VXFK DV $SS WLWOH DQG $SS GHVFULSWLRQV IURP *RRJOH 3OD\ VWRUH WR SHUIRUP H[SHULPHQWV 6HFRQG ERXQGDU\ LV 'DWD SUHSURFHVVLQJ SKDVH ZKLFK LV UHTXLVLWHWRUHGXFHPLVVLQJGDWDDQGKLJKGLPHQVLRQGDWDXVLQJ ILOWHULQJ DQG SRUWHU VWHPPLQJ WHFKQLTXH 7KLV ERXQGDU\ LV EDVLFDOO\ WR FRQYHUW GDWD LQ WR DQDO\]LQJ IRUP E\ UHPRYLQJ ZDVWH SDUW IURP WKH H[WUDFWHG $SS FRQWHQWV 7KLUG LV PRVW LQIOXHQWLDO ERXQGDU\ ZKLFK LV IXQGDPHQWDOO\ D FODVVLILFDWLRQ ERXQGDU\ DQG FODVVLILHV $SSV RQ WKH EDVLV RI FOHDQHG $SS GHVFULSWLRQ ILOH FUHDWHG IURP SUHYLRXV SUHSURFHVVLQJ SKDVH 7RFODVVLI\$SSVDSSURSULDWHO\RQWKHEDVLVRIIHDWXUHVWRSLF PRGHOLQJ DSSURDFKHV KDYH EHHQ XVHG 7KH RXWFRPH RI WKLV ERXQGDU\LVIHDWXUHYHFWRUOLVWIRUDOO$SSVXQGHUFRQVLGHUDWLRQ DQG FOXVWHUV RI $SSV RQ WKH EDVLV RI SUREDELOLW\ DVVLJQHG WR $SS DFFRUGLQJ WR $SS¶V IHDWXUH YHFWRU OLVW :H DSSO\ 1RQ QHJDWLYHPDWUL[IDFWRUL]DWLRQWRSLFPRGHOLQJWHFKQLTXHILUVWWR JHQHUDWHIHDWXUHYHFWRUOLVWRQWKHEDVLVRI$SSGHVFULSWLRQDQG XVH SUREDELOLVWLF DSSURDFK /DWHQW 'LULFKOHW $OORFDWLRQ /'$  ZKLFK KHOSV LQ XSJUDGLQJ DVVRFLDWLQJ IHDWXUH YHFWRU OLVW DQG IXUWKHU DVVLJQ SUREDELOLW\ WR IHDWXUHV DYDLODEOH LQ IHDWXUH YHFWRUOLVW                      )LJ 3URFHVVIORZRI0DOHYROHQW$SSGHWHFWLRQ

,QWKLVVDPHSKDVHFOXVWHUVKDYHEHHQIRUPHGDFFRUGLQJWR IHDWXUHV EDVHG FODVVLILFDWLRQ XVLQJ . QHDUHVW QHLJKERU DOJRULWKP ZKLFK ZLOO SODFH DOO WKH FODVVLILHG $SSV LQ VXLWDEOH FOXVWHUVEDVHGRQIHDWXUHVVLPLODULW\ PHDVXUH)RXUWKDQGODVW ERXQGDU\LVEDVLFDOO\RXWOLHUGHWHFWLRQSURFHVVERXQGDU\7KLV RXWOLHUGHWHFWLRQERXQGDU\FKHFNVSHUPLVVLRQRIDOODSSVXVLQJ DSSV¶ PDQLIHVW ILOH $SS SHUPLVVLRQ ILOH IRU DOO $SSV XQGHU FRQVLGHUDWLRQ WKHQ PDWFK IHDWXUHV DYDLODEOHLQ IHDWXUH OLVWSHU $SSZLWK$SSPDQLIHVWILOHSHUPLVVLRQILOHPHQWLRQHGIHDWXUH ,I IHDWXUHV VKRZQ WR XVHUV LQ GHVFULSWLRQ DUH QRW VLPLODU WR DFFHVVHGRU PHQWLRQHG LQ PDQLIHVW ILOH LW ZLOO WUHDW VXFK DSSV DV RXWOLHU DSS RU PDOHYROHQW $SS 7KHUHIRUH GHWHFWLRQ RI RXWOLHU DSS RQ WKH EDVLV RI WKLV PDSSLQJ LI IHDWXUHV H[LVW LQ PDQLIHVWILOHDUHQRWSDUWRIIHDWXUHYHFWRUOLVWLWZLOOLGHQWLI\ WKRVHDSSVDVRXWOLHUDSSV 7RPDNHSURMHFWHGDSSURDFKFOHDULQHDFKDQGHYHU\VWHS ZHDUHH[SODLQLQJDOOVWHSVZLWKWKHKHOSRIDFDVHVWXG\QDPHG µ03 &XWWHU DQG 5LQJWRQH 0DNHU¶ DV VDPSOH FDVH VWXG\ DQG WU\WRFOHDUDOOPHGLDWHVWHSVZLWKWKHKHOSRIWKDW ,,, 29(59,(:2)86('$33'$7$6(7 7R FRQGXFW WKCQRW FRQVLGHUHG WR SHUIRUP H[SHULPHQWV (DFKGHVFULSWLRQLVEDVLFDOO\FUDZOHGLQGLIIHUHQWWH[WILOHZLWK WKH QDPH RI DSSOLFDWLRQ 7KH GHVFULSWLRQ LV D GHWDLOHG GHVFULSWLRQ DERXW IHDWXUHV DFWLYLWLHV LQVLGH DQ $SS :H REVHUYHG WKDW SULPDULO\ DSSOLFDWLRQV RI VDPH FDWHJRU\ KDYH DOPRVWVLPLODUGHVFULSWLRQRUZHPD\VD\WKDWEDVLFURRWZRUGV DUHVDPHRI$SSVRIVDPHFDWHJRU\)URPWKHFROOHFWHGGDWD ZH UDQGRPO\ VDPSOHG D VXEVHW WR VKRZ GDWD IRUPDW ZKLFK LV VKRZQLQ7DEOH,$VQDSVKRWRI*RRJOH3OD\6WRUHLVVKRZQLQ ILJXUHWRSUHVHQW$SSFRQWHQWDYDLODEOHFRUUHVSRQGLQJWRDQ $SS                       



)RXUWK,QWHUQDWLRQDO&RQIHUHQFHRQ3DUDOOHO'LVWULEXWHGDQG*ULG&RPSXWLQJ 3'*& 7$%/(,

29(59,(:2)86('$33'$7$

QRW EHHQ FRQVLGHUHG KHUH )LOWHULQJ LV D UHTXLVLWH VWHS DV LQFRQVLVWHQW GDWD PD\ OHDG WR IDOVH FRQFOXVLRQ 7KHUHIRUH ILOWHULQJKDVEHHQDSSOLHGWRUHPRYHWKHVWRSZRUGV%DVLFDOO\ WR UHPRYH VWRS ZRUGV JHQVLP S\WKRQ OLEUDU\ KDV EHHQ XVHG )LJXUHVKRZVEHIRUHGHVFULSWLRQVQDSVKRWLQ D DQGUHVXOWV DIWHU DSSO\LQJ ILOWHULQJ WR FKRVHQ FDVH VWXG\ DSS VKRZQ LQ ILJXUH E WRSUHVHQWUHPRYHVWRSZRUGVUHPRYDO

600 Apps of 10 Categories (60 Apps of each category) Platform- Google Play Store Apps Average description length (words) :404 Sample App title

Category

Description length (words)

+RWVWDU790RYLHV /LYH&ULFNHW

(QWHUWDLQPHQW



+HDOWK



/LIHVW\OH



0HGLFDO



0XVLF



1HZV



6KRSSLQJ



6RFLDO



6SRUWV



:HDWKHU



«

«

5HFLSH%RRN 7UXH%DODQFH /\EUDWH&RQVXOWD 'RFWRU :\QN0XVLF03  +LQGLVRQJV 'DLO\KXQW 1HZV+XQW  1HZV $PD]RQ,QGLD 6KRSSLQJ 9R[:HE &ULFEX]]&ULFNHW 6FRUHV 1HZV 7UDQVSDUHQW*ODVV &ORFN:LGJHW

«

$SS &DWHJRU\

D    E   )LJ )LOWHULQJ3URFHVV $ %HIRUH)LOWHULQJ E DIWHUDSSO\ILOWHULQJRQ$SS 'HVFULSWLRQ

$IWHU ILOWHULQJ ZH DSSO\ VWHPPLQJ RQ DOO ILOWHUHG DSS GHVFULSWLRQILOHV6WHPPLQJLVDQ1/3WHFKQLTXHZKLFKLVXVHG WR LGHQWLI\ FRPPRQ URRW ZRUG DV ³FRQVLJQ´ ³FRQVLJQHG´ ³FRQVLJQLQJ´ ³FRQVLJQPHQW´ DOO PDWFKHV RQH FRPPRQ URRW ZRUG³FRQVLJQ´DVVKRZQLQ)LJXUH,WLPSURYHVWKHUHVXOWV IRUODWHU1/3SURFHVVHVDVLWUHGXFHVWKHQXPEHURIZRUGVRI WKH GHVFULSWLRQV (YHQ LWHPV RWKHU WKDQ WH[W OLNH +70/ WDJV ZHUH UHPRYHG IURP WKH GHVFULSWLRQV 7KLV UHVXOWV LQ DSS GHVFULSWLRQ ZLWK UHGXFHG GLPHQVLRQDOLW\ DQG LPSURYHG WKH UHVXOWIRUIXUWKHUSURFHVVLQJ

$SS7LWOH

 $SS 'HVFULSWLRQ

)LJ 6QDSVKRWRIµ03FXWWHUDQG5LQJWRQH0DNHU¶WRVKRZSUHVHQW FRQWHQWRQ*RRJOHSOD\6WRUHFRUUHVSRQGLQJWRFKRVHQDSS

)LJ $EDVLFH[DPSOHRIKRZVWHPPLQJZRUNV

9 $33&/$66,),&$7,21



,9 '$7$35(352&(66,21 7RLGHQWLI\LQJVHWRIWRSLFV1DWXUDO/DQJXDJH3URFHVVLQJ 1/3  LV DSSOLHG WR WKH GHVFULSWLRQV IRU ILOWHULQJ DQG VWHPPLQJ )RU WKH SXUSRVH ZH FKRVH (QJOLVK ODQJXDJH FRQWHQWRQO\EHFDXVHRILWVSUHGRPLQDQFHRWKHUODQJXDJHKDV

:H FUHDWH IHDWXUH YHFWRU OLVW WR FODVVLI\ $SSV DSSURSULDWHO\ )HDWXUH YHFWRU OLVW LV EDVLFDOO\ OLVW RI IHDWXUHV WKRVH DUH LQHYLWDEOH IRU D SDUWLFXODU FDWHJRU\ RI $SSV 7KH LQSXW IRU WKLV SKDVH LV VWHPPHG $SS GHVFULSWLRQ ILOHV RI DOO $SSV DQG ZH DSSOLHG WZR WRSLF PRGHOLQJ DSSURDFKHV 'HWHUPLQLVWLF DOJRULWKP 1RQ1HJDWLYH 0DWUL[ )DFWRUL]DWLRQ DQGSUREDELOLVWLFPHWKRG/DWHQW'LULFKOHW$OORFDWLRQWRFUHDWH IHDWXUHYHFWRUOLVW %DVLFDOO\1RQQHJDWLYHGHVFULSWLRQIHDWXUHPDWUL[KDVEHHQ JHQHUDWHG XVLQJ 10) DQG WKHUHDIWHU /'$ KDV EHHQ XVHG WR PDSIHDWXUHDYDLODELOLW\LQGHVFULSWLRQZLWKIHDWXUHYHFWRUOLVW DQG/'$ZLOODVVLJQSUREDELOLW\WRRWRHDFK$SSDFFRUGLQJWR PDSSLQJRIIHDWXUHV



)RXUWK,QWHUQDWLRQDO&RQIHUHQFHRQ3DUDOOHO'LVWULEXWHGDQG*ULG&RPSXWLQJ 3'*&

A. Non-Negative Matrix factorization 10) LV D ZHOO NQRZQ WRSLF PRGHOLQJ DOJRULWKP IRU GDWD H[SORUDWLRQ 10) SURGXFHV D 1RQQHJDWLYH GHVFULSWLRQ WHUP PDWUL[ WR EULQJ RXW WKH KLGGHQ ODWHQW IHDWXUHV LQ WKH GDWD )XUWKHU DOO IHDWXUHV LQ WKH $SS GHVFULSWLRQ FDQ EH YLHZHG DV EHLQJ EXLOW XS IURP WKHVH ODWHQW IHDWXUHV 7KH REMHFWLYH RI XVLQJ 10) KHUH LV WR ILQG RXW $SS ODWHQW IHDWXUHV FRUUHVSRQGLQJWRWKHWRSLFWKRVHDUHPRVWSURPLQHQWDFURVVWKH DOO $SS GHVFULSWLRQ GDWD (DFK $SS FDQ EH DVVRFLDWHG ZLWK PXOWLSOHWRSLFV8VLQJ10)ZRUGVIURPWKHGHVFULSWLRQZHUH WRNHQL]HG DQG WKHQ DIWHU GHFRPSRVLWLRQ RI ZRUGV WKDW RFFXUV IUHTXHQWO\ ZHUH JURXSHG WRJHWKHU IRU LQVWDQFH ZRUGV VXFK DV ³ZHDWKHU´³ZLGJHW´DQG³IRUHFDVW´ZRXOGEHJURXSHGWRJHWKHU DQG ³FULFNHW´ ³PDWFK´ DQG ³OLYH´ ZRXOG EH DVVLJQHG RWKHU JURXS10)JHQHUDWHOLVWVRIZRUGVPRVWVWURQJO\DVVRFLDWHG ZLWKDVSHFLILFNLQGRIWRSLF(YHQWKRXJKWKH10)GLVFDUGV WKHILQHGHWDLOVLQWKHPDWUL[ZRUGIUHTXHQFLHVDQGUHFRQVWUXFW VDOLHQW GHWDLOV ,Q WKH LPSOHPHQWDWLRQ VWHPPHG WH[W LV IHG DV LQSXWDQGQXPEHURIWRSLFVDQGQXPEHURIZRUGVZLWKLQHDFK WRSLF FRXOG EH FKRVHQ IUHHO\ 7R 6KRZ DQG PDNH LW XQGHUVWDQGDEOH ZH FKRVH  QXPEHUV RI WRSLFV DQG  QXPEHUV RI ZRUGV LQ HDFK WRSLF 7DEOH VKRZV 10) IRUPHG WRSLFVIRUVDPH B. Latent Dirichlet Allocation /'$ LV DSSOLHG ZKLFK DVVLJQV SUREDELOLW\ DFFRUGLQJ WR IUHTXHQF\ RI 10) IRUPHG WRSLFV O\LQJ LQ $SS GHVFULSWLRQ 7KHUHIRUH/'$DVVLJQVSUREDELOLW\WRHDFK$33DFFRUGLQJWR 10) WRSLF IUHTXHQF\ LQ WKDW $SS :H FRQVLGHU RQH PRUH SDUDPHWHU IRU SUREDELOLW\ DVVLJQPHQW /'$ ZLOO PDS VRPH DGGLWLRQDO ZRUGV DOVR LQ WRSLF RWKHU WKDQ H[LVWLQJ LQ 10) IRUPHGWRSLFVDQGLPSDFWZLOOUHIOHFWRQ$SSSUREDELOLW\WRRLI $SSGHVFULSWLRQUHOHYDQWZRUG RWKHUWKDQ10)IRUPHGWRSLF  GRHV QRW OLH LQ 10) IRUPHG WRSLF /'$ ZLOO PDS LW ZLOO H[LVWLQJ10)IRUPHGWRSLFRQWKHEDVLVRIVLPLODULW\PHDVXUH RI WKDW ZRUG ZLWK H[LVWLQJ WRSLFV )LQDOO\ /'$ DVVLJQ HDFK $SSZLWKWRSLFRQWKHEDVLVRIDVVLJQHGSUREDELOLW\ 

/'$ JLYHV WKH SUREDELOLW\ DORQJ ZLWK PRVW DSSURSULDWH ZRUGFKRVHQIURPWKHDSSGHVFULSWLRQDQGDIWHUWKLVFOXVWHULQJ LVDSSOLHGWRDVVLJQDQDSSWRDSDUWLFXODUWRSLFRQWKHEDVLVRI SUREDELOLW\6RPHVDPSOHSUREDELOLW\DVVLJQPHQWRIHDFKWRSLF LQ$SSLVVKRZQLQ7DEOH,,,ZKLFKLV7RSLFZLVHSUREDELOLW\ GLVWULEXWLRQ WDEOH IRU DOO $SSV /'$ KDV EHHQ DSSOLHG XVLQJ S\WKRQ/LEUDU\>@ C. K- Nearest Neighbour Clustering 1H[W VWHS LV WR DVVLJQ DQ DSS WR D SDUWLFXODU WRSLF EXW DFFRUGLQJWRRXUSURMHFWHG$SSURDFKDQDSSPD\EHDVVLJQHG WRPRUHWKDQRQHWRSLF,WPHDQVDQ$SSFDQ¶WEHERXQGHGLQ RQH FOXVWHU 7KHUHIRUH DFFRUGLQJ WR RXU DSSURDFK DQ DSS DVVLJQV WR PXOWLSOH FDWHJRULHV 7R YDOLGDWH RXU DSSURDFK ZH PDSSHG RXU DVVLJQHG FDWHJRU\ ZLWK *RRJOH 3OD\ 6WRUH FDWHJRULHV*RRJOH3OD\VWRUHLVFDWHJRUL]LQJHDFK$SSLQRQH FDWHJRU\ZKLFKLVLUUHOHYDQWIRUFOXVWHULQJWKHDSSV(DFKDSS LVLGHQWLILHGE\FHUWDLQVHWRIZRUGVDORQJZLWKWKHSUREDELOLW\ DQG DIWHU DQDO\]LQJ WKH ZRUGV DORQJ ZLWK WKH SUREDELOLW\ WKH KLJKHUSUREDELOLW\ZRUGLVDVVLJQHGWRWKHWRSLFVZKLFKLWEHVW VXLWV DQG VXEVHTXHQWO\ QH[W KLJKHU SUREDELOLW\ ZRUG LV DVVLJQHG WR WKH WRSLFV $IWHU WKLV .0HDQV LV DSSOLHG XVLQJ :(.$ WRRO RQ WKH SUREDELOLW\ REWDLQHG E\ /'$ WR IRUP FOXVWHUV RI WKH DSSV DV VKRZQ LQ )LJXUH  +HQFH ILJXUH  UHSUHVHQWVSUREDELOLW\ZLVHFOXVWHURIQXPEHURIDSSVO\LQJLQ HDFKFOXVWHURIDSDUWLFXODUFDWHJRU\

 )LJ $SSZLVHFOXVWHUVRQWKHEDVLVRI/'$DVVLJQHGSUREEDELOLW\

7$%/(,,

10))250('723,&6



7RSLF

SKRQH

WUDFN

DSS

ORFDWLRQ

)ULHQG

*SV

8VH

PRELOH

WH[W

3URGXFW

7RSLF

GD\

ZLGJHW

FORFN

IRUHFDVW

+RXU

ZHDWKHU

&XUUHQW

VFUHHQ

WLPH

7RSLF

1HZV

ODWHVW

LQGLD

$SS

5HDG

KLQGL

:RUOG

EUHDN

6SRUWV

7RSLF

PXVLF

VRQJ

SOD\

SOD\HU

6RQJV

VRXQG

+LQGL

/LVWHQ

*HQUH

FRQGLWLRQ ,QWHUQDWLR QDO )UHH

7RSLF

HYHQW

ZDOOHW

0XVLF

JDPHV

,QGLDQ

PRYLH

)XQ

EROO\ZRRG

7RSLF

IUHH

0RELOH

GDWD

UHFKDUJH

3DFN

*

SD\PHQW

*

(DUQ

GRZQORDG

7RSLF

VKRS

SURGXFW

DSS

RQOLQH

RIIHU

IDVKLRQ

%HVW

SULFH

GHDO

OLNH

7RSLF

KHDUW

VWHS

JURXS

3D\PHQW

UDWH

SUHVVXUH

DSSOLFDWLRQ

IUHH

VFUHHQ

PHGLFLQH

7RSLF 7RSLF

OLYH DSS

FULFNHW 

PDWFK XVH

,3/ FRP

VFRUH GRFWRU

PDWFK KHDOWK

WHDP QHZ

ZRUOG WLPH

YLGHR KHOS

FKDOOHQJH LQGLD

7RSLF

ES

ZDOOHW

ERRN

SD\PHQW

ZHLJKW

RIIHU

'LVHDVH

ILWQHVV

KHDOWK

GRFWRU

FKDQQHO SD\PHQW





)RXUWK,QWHUQDWLRQDO&RQIHUHQFHRQ3DUDOOHO'LVWULEXWHGDQG*ULG&RPSXWLQJ 3'*& 7$%/(,,,

723,&:,6(352%$%,/,7@

>@

>@

>@

>@

>@ >@

$&.12:/('*0(17

>@

:H ZRXOG OLNH WR WKDQN 9LVKDO %LVKW 0V ,QGX &KDZOD DQG FROOHJHDXWKRULWLHVRI-,,71RLGDIRUKHOSLQJXVDQGSURYLGLQJ XVZLWKVXLWDEOHHQYLURQPHQWIRUVXFFHVVIXOFRPSOHWLRQRIWKH ZRUN

>@

5()(5(1&(6

>@ >@

WK,QWHUQDWLRQDO&RQIHUHQFHRQ6RIWZDUH(QJLQHHULQJSS $&0 $YGLLHQNR 9LWDOLL .RQVWDQWLQ .X]QHWVRY $OHVVDQGUD *RUOD $QGUHDV =HOOHU6WHYHQ$U]W6LHJIULHG5DVWKRIHUDQG(ULF%RGGHQ0LQLQJDSSV IRU DEQRUPDO XVDJH RI VHQVLWLYH GDWD ,Q ,((($&0 WK ,((( ,QWHUQDWLRQDO&RQIHUHQFHRQ6RIWZDUH(QJLQHHULQJYROSS ,((( /L