Proceedings of the International Conference on Pattern Recognition, Informatics and Medical Engineering, March 21-23, 2012
Deceptive Phishing Detection System (From Audio and Text messages in Instant Messengers using Data Mining Approach)
Mohammed Mahmood Ali Department of CSE Muffakamjah College of Engineering & Technology Hyderabad, Andhra Pradesh, INDIA e-mail:
[email protected] Abstract—Deceptive Phishing is the major problem in Instant Messengers, much of sensitive and personal information, disclosed through socio-engineered text messages for which solution is proposed[2] but, detection of phishing through voice chatting technique in Instant Messengers is not yet done which is the motivating factor to carry out the work and solution to address this problem of privacy in Instant Messengers (IM) is proposed using Association Rule Mining (ARM) technique a Data Mining approach integrated with Speech Recognition system. Words are recognized from speech with the help of FFT spectrum analysis and LPC coefficients methodologies. Online criminal’s now-a-days adapted voice chatting technique along with text messages collaboratively or either of them in IM’s and wraps out personal information leads to threat and hindrance for privacy. In order to focus on privacy preserving we developed and experimented Anti Phishing Detection system (APD) in IM’s to detect deceptive phishing for text and audio collaboratively. Keywords- Data Mining; Instant Messenger; Deceptive Phishing; Association Rule Mining(ARM); Anti Phishing Detection(APD); Speech Recognition system; Fast Fourier Transform(FFT); Linear Predicted Coding(LPC);
I.
INTRODUCTION (HEADING 1)
Phishing a fraudulent trick of stealing victim’s personal information by sending spoofed messages, through Instant Messengers via socially engineered messages. Over the past decades online identity fraud has transformed from being a small scale attack to huge spread syndicated crime as identified in e-mails, concrete work exists to detect deceptive phishing in Instant Messengers for text messages[2], but inefficient for voice chatting which is the fastest means for communication now-a-days e-criminals have adapted [3]. Data mining techniques emerged to address problems of understanding ever-growing volumes of information for structured and unstructured data, finding frequent patterns within huge data using Association rule mining technique [4]. In Instant Messenger[2] phisher tries to find out password and security related information through questions by pretending as a trustworthy chatmate through voice chat and sometimes text messages or by both collaboratively at different intervals of time. In IM’s, deceptive phishing has to be tackled dynamically and there are no robust techniques yet developed
978-1-4673-1039-0/12/$31.00 ©2012 IEEE
Dr. Lakshmi Rajamani Department of CSE University College of Engineering, Osmania University Hyderabad, Andhra Pradesh, INDIA e-mail:
[email protected] to do this, as the existing anti Phishing techniques are equipped to deal with static Phishing [5][6]. In static Anti-Phishing technique, a black list of suspected mail-ids is maintained in centralized black list servers [5] which disseminates vetted black list to end users for enforcement. These techniques are ineffective for Instant Messengers to detect phishing, there are two categories of deceptive phishing attacks popularly employed in IM’s are Password Phishing Scenarios and Security question Phishing Scenarios. In the second scenarios the phisher tries to trace out the personal information by acting as a trustworthy chatmate and thereby gain access to confidential data. There is no robust technique to deal with such attacks in IM’s [6] to our knowledge; this is the first attempt to apply Association Rule Mining technique on the tables/log files extracted from transaction database (TDB) using Information retrieval system discussed in this paper [19] when the Text messages or Audio messages are exchanged between chatmates in Instant Messenger shown in table 1. Our Contribution includes integrating an Instant Messaging System with a Phishing Detection System; using Data Mining technique of Associative rules [6] and Information Retrieval technique, which detects dynamic Phishing in Instant messages for both Voice and Text messages exchanged. In the remainder of the paper the term messages means both Voice and Text messages are included. Similarly the term Phishing means Deceptive Phishing to be understood. The proposed system named as Anti-phishing Detection system (APD) detects Phishing in Instant Messengers. In this paper we proposed, an APD that dynamically traces out any potential phishing attacks when messages exchanged between chatmates of an Instant Messaging System. The current Instant Messaging Systems doesn’t have any means to deal with Phishing. The remainder of this paper is organized as follows: This Section provides an overview of Instant Messaging system and deficiencies exist. Section II explains the problem statement and work done till date where as Section III explains the detailed architecture of the proposed APD-IM system and steps for integration of speech detection along with text messages
International Conference on Pattern Recognition, Informatics and Medical Engineering (PRIME-2012) collaboratively in IM’s. Detection of phishing messages in text is possible [2], but detecting phishing words from audio messages along with or without text messages is explained in this paper. Section III also explains general process followed in the proposed system is explained Section IV shows experimental results with patterns generated for threshold support and confidence during a traditional Phishing scenario for different transactions. Section V concludes the paper with an outlook to future research directions of IM’s must be enhanced to detect video Phishing collaboratively with Audio and Text messages integrated with 3G and 4G mobile technologies efficiently with high processing speeds.
II. PROBLEM STATEMENT IN INSTANT MESSENGERS AND RELATED WORK As many as 98,256 phishing attacks were analyzed by the APWG in the year 2011[3], phishers are constantly experimenting and adapting. Typical phishing scenarios through mails, phisher sets up fake website and tricks the people logging to the fake website page and collects confidential and personal information, specifically phishing in e-banking sector. The adoption and use of Instant Messengers in most of countries became the useful tool in day to day life[8] for quick response, studies of IM text messaging and file transfer frequency reveals the brief discussion in aspects of worms, analysis and countermeasures in IMs[9].
Table 1. Shows the chatting between the two chatmates words marked with blue color indicate audio speech where as black color is normal text messages exchanged, where xxxx & yyyy represents the place names. Chatmate-1 Chatmate-2 Hello do u hav any pets? s I hav 2 Whats ur fav food My fav food is pizza Who was ur fav teacher I have many What is ur fav past time I play number games What is ur lucky no My lucky no is 9 (a) first transaction for first day Chatmate-1 Where do u stay or asl please In which school did u study What was ur fav subject What is ur age 25 years 2 months 2 days 24 hours old What is your dob You are 5 months elder than me, Can I call you my big brother, if don’t mind (b)second transaction for second day
Chatmate-2 I stay at xxxxxxxxx I did my schooling at yyyy Xxxxxxxxxx 25, and what about urs Oh interesting 20-10-1979 May be not sure Hey its ok.
Chatmate-1 I was tired standing at my bank today I have at xxxx place where do u have? Where is the location of ur bank? Do u hav online account? I have to create one. Do u have any idea about the username ok, thanks for giving advice (c) third transaction for third day
Chatmate-2 Where do u have account? I hav at xxxx. near to xxxx place Yes do u? We can keep ids or names in capital letters Its all right
Just a minute, what passwords do you suggest for my account to be highly secure Hmm... Its not so secure, as everyone knows it. Its fine, what are special characters that many its too complex to remember oh is it….
Popular systems such as AOL Instant Messenger, MSN Messenger, ICQ, Yahoo Messenger, Google Talk, Skype and Internet Relay Chat (IRC) have changed the way we communicate with friends, acquaintances, and business colleagues. Once limited to desktops, popular Instant Messaging systems are finding their way onto handheld devices and cell phones, allowing users to chat from virtually anywhere. The number of corporate instant messaging users is expected to grow to over 500 million by 2012 with an additional 800 million home computer users having IM systems. Unfortunately, while IM systems have the ability to fundamentally change the way we communicate and do business [7], many of today’s implementations pose security challenges. Most IM systems presently in use were designed with scalability rather than security in mind with respect to deceptive Phishing attacks. Virtually some freeware IM programs lack encryption capabilities and most have features that bypass traditional corporate firewalls, making it difficult to control instant messaging usage. Some of these systems have insecure password management and are vulnerable to account spoofing and denial-of-service (DoS) attacks. Even worse, no firewall in the market today can scan instant messaging deceptive phishing. While instant messaging may seem like a new technology, it is actually decades old. The IRC system developed in 1988 by Jarkko Oikarinen3 still in use, this system allows users to form ad-hoc discussion groups to chat peer-to-peer with one another and exchange files seen today in many different Messengers that provide the same basic service, without detecting deceptive phishing messages.
Keep your Employeeid,
keep your name and use special characters at beginning or end @,~!@#$%, or Shift+number hey don’t worry its easy to remember remember the numbers eg DOB:20-10-1979, press shiftkey+number
The basic Instant Messaging architecture provides functionality of chat, news alerts, and conferences. Instant Messaging resources includes Web server, Lightweight Directory Access Protocol(LDAP) server[10]. In this scenario first LDAP server provides user entries for authentication and lookup, second chatmates download the Instant Messaging resources from web server or System Application Server thirdly chatmates are always connected to Instant Messaging server through an Instant Messaging multiplexor that supports text, audio and video chatting dynamically.
(d) fourth transaction for fourth day Is the procedure for creating online First u need to go to bank account same as normal account? and show all ur proofs. Can I use the same technique of Yes,u can use special Characters as I told earlier. creating pswds? Is it safe to use special characters as Obviously. Its difficult to passwords? trace. (e)Fifth transaction for fifth day (f) Sixth transaction for sixth day ……… Nth transaction for Nth day
Comparative study of AOL, Yahoo and MSN Instant Messengers with features and functions taxonomy discussed [11] along with protocols used for passing instant messages. The feature of IM to collect and analyze information in elearning environment [14] helped the users flexibility of easy
459
International Conference on Pattern Recognition, Informatics and Medical Engineering (PRIME-2012) learning methodology coupled with presence and availability of management services emerging as killer application in wireless and wire-line networks [12].The filtering and spam detection in IM poured new life to IMs [9]. Integration of IMs in mobile collaborative learning helped the mobile users [13] but ability to detect and filter deceptive phishing is incomplete for Audio and Text messages in IM’s.
analysis with the help of FFT [23]and LPC coefficient parameters [25] and simulated in MATLAB [1], parameters are used to differentiate one voice (word) from other voices (words). The proposed method is implemented using Java language and integrated with IM. In implementation, there are Six (6) major functional parts: 1.
A Phishing Detection Tool [14], security and identification indicators for browsers against Spoofing and Phishing Attacks[15],[16] is known but detecting and identifying phishing websites in real-time is difficult tasks as it depends on many factors like (URL & Domain Identity) and (Security & Encryption)[17] identifying vulnerabilities which allow these phishing sites to be created and suggest methods to identify common attacks that helped webmasters and their hosting companies to defend their servers[18], Legal Risks For Phishing Researchers [6]. Now-a-days people are using social Phishing in IM via Text and Audio messages. Phishing messages in IM’s can be detected if alone text messages are sent [2]. But if Text messages and Audio messages or either of them is collaboratively used for sending messages in IM’s then it is difficult to detect Phishing attacks.
2. 3. 4. 5. 6.
Voice and Text detection Modified Architecture for IM. Integration of Vice and Text messages in TDB. Voice recognition using spectrum analysis (FFT and LPC coefficient methodologies). Differentiate words based on parameters using MATLAB and using Spectraplus. Rules extractions using Association Rule Mining technique. General algorithmic approach for Voice and Text detection in IM’s.
A. Voice and Text detection Modified Architecture for IM A Modified Architecture of Voice and Text recognition system for IM’s is shown in Fig. 1. The Audio and Text messages are passed together collaboratively or either of them in IM by chatmates. To detect phishing in such cases as mentioned is a challenging task. Detection of Deceptive phishing messages in IM’s for text messages is possible [2], but detecting phishing words from audio messages along with or without text messages is explained in this paper. The Text
In this paper we proposed APD-IM system for detecting Phishing messages either if it is Text message or Audio message or both of them used collaboratively. Most of the work proposed in this paper is related to finding word parameters from Speech and detection of Phishing from the voice, after filtering out unnecessary voices based on word parameters from speech using FFT word parameters and LPC coefficient parameters [23],[25]. The detection of phishing from Text messages already proposed in previous work [2]. This section describes significant vulnerabilities that are present in common Instant Messaging systems and the types of attacks that can exploit the users leading to phishing attacks. III. PROPOSED SYSTEM ARCHITECTURE OF APD-IM In this paper we present an Association rule mining technique (Apriori algorithm) [21] to detect Deceptive Phishing, suspicious messages (Audio and Text or either of them) sent using Instant Messenger between two or more chatmates.
Figure 1. Shows APD-IM Architecture of phishing detection system for Voice and Text messages in IM.
and Voice messages need to be filtered by removing unnecessary words, for this the Text messages and Voice messages stored separately in the database. Later integration of text messages and audio messages is done by merging dynamically explained in the next section III.B.
The messages are stored in Transaction database(TDB), before storing the messages in TDB the unnecessary words are filtered out by searching the Ignore words Database(IGWDB) using Information retrieval system technique(stemming, Ngram technique, ignore words)[19], the frequent reoccurring words are extracted from the TDB dynamically using Association rule mining technique[20] and stored in Transaction pattern database(TPDB), Table 4 illustrates few words extracted, with unique ids allocated to them. Then the rules are framed dynamically for the words exists in the TPDB which satisfies the user-defined minimum threshold support and confidence (threshold value) [21]. If the condition is true phish words are pushed to Phishing Database (PDB) then alert message is triggered from PDB to chatmates It is developed specifically to detect phishing of unusual and deceptive communication in IM’s for Text and Voice messages. The parameters for Voice detection is found using Spectrum
The voice recognition from a long audio track is broken down into smaller clips as shown in Fig. 2, each of these steps are self explanatory. The audio track may consists of breakingup of voices during the chat sessions which is noise that has to be identified and removed using Hidden Marklov Model (HMM) [24], training of Voices is not discussed elaborately in this paper. We considered an ideal situation of sample Voices. Working of Voice processor tasks in IM is shown in Fig. 2. The tasks performed by Voice processor is appropriate format conversion of audio clips *.amr to *.wav format removing noise [22] from long audio track and classify into independent
460
International Conference on Pattern Recognition, Informatics and Medical Engineering (PRIME-2012) words are sent for storing in TDB with unique IDs as discussed [2],[IRS].
Figure 2. Shows Long Audio Track is broken into short clips (voices) via Clip Classifier(Voice Processor) and converted into .WAV format and sent to database for storing, where filtering of unnecessary clips is done and unique ids are allocated for each clip, acts as input for TDB in IM.
short clips[1]. Send each independent clips to (VDB) database for storage where unique ids are allocated, which act as an input to TDB after filtering out unnecessary words using (IGWDB) database with the help of Information Retrieval System technique [19].
3.
If Voice and Text messages are detected collaboratively, then it involves merging of two databases VDB and WDB as one transaction and stored in Voice Word Database (VWDB) then compared with IGWDB for filtering out unnecessary words as explained in points 1 and 2 respectively. Finally selected words are allocated unique IDs and sent for storing in TDB.
4.
Voice may also consists of 2002, or (Two zero zero two), or other words which is yet a challenging task we have considered an ideal situation of Voice in this paper [1] which is out of scope.
C. Voice recognition using Spectrum analysis (FFT and LPC coefficient methodologies). Speech should be initially transformed and compressed, in order to simplify subsequent processing. Many signal analysis techniques are available which can extract useful features. Six major Spectral analysis algorithms are available as shown in Fig. 3. Among them most popular methods are (Fast Fourier Transform (FFT) and Linear Prediction Coefficient (LPC), Speech signals are converted into the spectrum signal using FFT [23] but, FFT requires only complex values. Similarly by using LPC spectrum program, we get different spectrum from the original spectrum and then analysis on their spectrum is done to find other parameters, structure of standard Speech recognition systematic approach is illustrated in Fig. 4. There are various applications of word recognition, like mobile communication, on-line and off-line communications, etc. We have used to detect words from Voice in Instant Messengers (IMs) to detect Phishing words.
VDB database store word parameters of each clip, the word parameters discussed in Section III.D are extracted dynamically with the help of FFT and LPC coefficient, spectrum analysis using MATLAB [1] by Voice processor as shown in Fig. 2. These word parameters for every clip stored in VDB database, are checked with IGWDB database which consists of ignore word parameters for Voice to filter unnecessary word parameters then sends to VWDB database as shown in Fig.1; ultimately unique id’s allocated based on set of significant word parameters identified and sent to TDB database for later processing, where ARM technique applied to find frequent occurrences of words in TDB database and sent to Transaction Pattern database (TPDB), where again ARM technique reapplied to find phish words that must satisfy minimum threshold Support and confidence (user-defined), finally the phish words identified from TPDB database are sent to PDB database, which send the message to chatmates in IM’s as an alert message based upon detection of phishing words from PDB database. B. Integration of voice and text messages in TDB Steps involved in the Process of Integration of Voice and Text messages in Transaction database (TDB) refer Fig. 1. 1.
If alone Voice message is detected it has to be handled with Speech recognition system, dynamically where the parameters of the voice are found like (peak, frequency, amplitude, TDH, etc) explained in Section III.D, the frequent occurrences of these parameters captured using ARM technique [20] and stored in the voice database (VDB), immediately this VDB is compared with IGWDB, the IGWDB consists of unnecessary words like prepositions, articles, etc. Finally the filtered words are chosen [19] and unique IDs are allocated then sent for storing in the TDB.
2.
If alone Text message is detected it has to be stored in WDB and unnecessary words are filtered out by comparing with IGWDB dynamically using Information Retrieval techniques, finally selected
Figure 3. Shows Different types spectral analysis algorithms.
Figure 4. Shows General Process of Word detection from Speech signal.
461
International Conference on Pattern Recognition, Informatics and Medical Engineering (PRIME-2012) Word recognition from Speech signal using spectrum analysis, which involves Features extraction, Preprocessing, Pattern matching and Decision making, for word parameters from spectrum of speech signal, are chosen using statistical methods which gives the range values for each word as output These parameters help us to differentiate the words from each other. Every word has some bounded or range of values that characterize the word based on parameters [1].
MATLAB. The graph obtained for significant and insignificant parameters for word is plotted, the insignificant parameters are neglected and significant parameters are chosen for finding the word, significant parameters only sent to TDB for storage from VDB that differentiate the words from each other. For example let us take significant parameters selected by FFT Spectrum analysis for 5 different Table 2. Word parameters selected by FFT spectrum analysis for 5 different samples of word 'MURDER’.
The various word parameters that are calculated by analysis of spectrum for speech signal are Mean, Median, Standard deviation(STD), Root mean square(RMS), Maximum peak, Minimum peak, Width of maximum peak, Signal to noise ratio(SNR), Peak Frequency, Peak amplitude, Total power, Total harmonic distortion(THD), TDH+Noise, Inter modulation distortion (IDM). These parameters can be obtained by using MATLAB and SpectraPlus. These parameters have some values in which they are bounded based on these bounded values we can differentiate one word with another. D. Differentiate words based on parameters using MATLAB and using Spectraplus. To recognize speech word dynamically, we have recorded the word and converted into .wave format, then stored in MATLAB dictionary, Digital signal processing, technique is also used to convert clip samples in a series of data that we can interpret “.wav” extension, we retrieved these samples using “wavread” in MATLAB. To represent signal in frequency domain we used Discrete Fourier Transform (DFT), defined as shown below where f denotes hertz, N denotes window, frequency in duration of samples using FFT
samples for single word 'MURDER’ shown in Table 2. Among these parameters some significant parameters are selected where as insignificant parameters are neglected and may not be efficient for differentiating the word in TDB. Some of the word parameters are same for two different words in such cases, Linear Predicted Coding coefficient (LPC) is efficient in such cases, again the word parameters, recalculated from the spectrum of speech signal that helps us to differentiate the word from each other using LDA technique [24], for example KILL and BILL Voice words got the same word parameters .where µ 1 & µ 2 are mean of parameters, ı1 & ı1 are Standard deviation for words KILL and BILL is for
command in MATLAB. This is done because the length of our signal must be power or two. The real and imaginary components of FFT of signal stored in vector x, where x, reads the file name the Algorithm shown in Fig. 5.
differentiating the words that contain same parameters [1]. Similarly significant parameters selected by LPC spectrum analysis for 5 different samples for word 'MURDER', is shown in table 3. Finally with the help of word parameter correct word recognition is done. We have used MATLAB for reading .wav files then finding spectrum of speech signal, sometimes, SpectraPlus is also used for analysis of .wav files, based on the requirement.
Figure 5. Algorithm that accepts .WAV extention and produc Spectrum of Signal from which word parameters are derived.
The Spectrum of signal after the Algorithm applied is shown in Fig. 6, the Time vs. Frequency plotted graph in
Table. 3. Significant word parameters selected by LPC spectrum analysis for 5 different samples of word 'MURDER'.
Figure 6. Shows Spectrum of signal from which word parameters are derived.
462
International Conference on Pattern Recognition, Informatics and Medical Engineering (PRIME-2012) above from Section III.B to Section III.E. The overall working steps of APD-IM system explained in Fig.7.
E. Rules extractions using Association rule mining technique Significant word parameters are chosen that differentiate voice words from each other are stored in VDB, compared with IGWDB database for filtering out unnecessary words using IRS technique, and sent to TDB database where frequent occurrences of voice words are identified using ARM technique and sent to PDB database as phishing words where again ARM technique is reapplied to TPDB database then checks user-defined support and confidence for the voice words and finally reports to chatmate in IM by checking PDB database on detection of phishing words.
Chatmate start messaging
Directory server
1
6
Instant Messenger Server
2 WDB
The TDB database consists of Audio and Text words stored, as explained earlier, from TDB database unnecessary words are also filtered out using IRS techniques discussed in Section II for text messages and Section III for audio messages, based on existing number of transaction obtained in Table 1. It consists of 5 transactions between two chatmates out of which 16 keywords are picked up with unique ids from ITEM1 to ITEM16 represented as I1...I16, as shown below in table 4.
VDB
Transaction database where transactions stored in (TDB), IRS YES/No O technique filters out unnecessary 5 words, by checking (IGWDB)
3 Apriori Alorithm applied on TDB, patterns detected and stored in (TPDB)
4
Table 4. Shows List of few Words Chosen based on frequent occurrences captured using ARM technique from TDB discussed in Section II & III.
Again Apriori applied to TPDB checks for phishing word and stores in (PDB)
Figure 7. Shows the General flow of APD-IM system works for detecting phishing words.
Let us assume that the Items in transactions which satisfies support=2 or 20% out of 5 different transaction are [{I1,I2,I12}, {I1,I2,I14}, {I1,I2,I15},{I1,I2,I16}] are considered to be frequent occurrences obtained from TDB and the confidence=100% which satisfies are [{I1^I12=>I2, I2^I12=>I1, I12=>I1^I2, I1^I14=>I2, I2^I14=>I1, I14=>I1^I2, I1^I15=>I2, I2^I15=>I1, I15=>I1^I2, I1^I16=>I2 , I2^I16=>I1 , I16=>I1^I2}]. These ARM rules are framed, based on these rules the items are sent to PDB database as phishing words, Again ARM technique applied on PDB to find phishing words to detect phishing words. Twice applying ARM technique accuracy to identify phishing words improved efficiently. The Support given is very less because in IM privacy information is exposed within no time or less number of transaction. During the process of sending messages, some of the words appeared to be phishing words even though they may not, but this is to be tolerated by chatmates during chatting in IMs. F. General Algorithmic steps for Text and Speech recogniton system in IM The chatting of messages (Text and Audio) includes both, in IM detected by Anti Phishing Detection system (APD). If phishing words found, APD-IM send an alert message to chatmate users, at one or both the ends; Depends on where the APD-IM system is installed, its architecture is shown in Fig. 2. Text words are detected and stored in WDB database where as Audio words stored in VDB database as already explained
463
1.
The chatmate enabled with IM support establishes connection with the Instant messaging Server, checks for authentication of the chatmate through the Directory Server. If chatmate is authenticated then he can start sending messages.
2.
The Instant Messenger Server forwards messages which include both Audio (VDB) and Text (WDB) or either of them to transaction database (TDB), TDB stores messages exchanged between two or more chatmates, by checking Ignore word database (IGWDB) using IRS technique after filtering out unnecessary words.
3.
Apriori algorithm applied on TDB, patterns detected are stored in Transaction pattern database (TPDB).
4.
Again Apriori applied to TPDB checks for phishing words, if detected sends to Phishing database (PDB).
5.
If phishing words detected, forwards a YES to the Instant Messenger server else NO.
6.
If YES is the result, the Instant Messenger sends an alert message to the victim chatmate about the possible Phishing attack else if NO, is the result the Instant Messenger server proceeds further.
International Conference on Pattern Recognition, Informatics and Medical Engineering (PRIME-2012) The working of the APD-IM algorithm is shown in Fig. 8. Input: Instant Messages in Transaction Database (TDB) (day to day)
Output: Alert Phishing message to IM chatmate if detected 1 Do //Apply IRS for filtering (IGWDB) and pick words and push to //(TDB) which include both Text and Audio(WDB and VDB) // merged and stored in VWDB as discussed in section III. 2 { Do //Scan TDB for Relevant patterns //Apply Apriori technique find patterns from TDB // and push to Transaction pattern database (TPDB) {Call Apriori algorithm and Scan TDB/ /generates patterns from TDB //and stores in TPDB 3 Push patterns to TPDB 4 } Until TDB!=NULL //Apply Apriori find min_support and confidence for TPDB //user defined 5 Re-Call Apriori algorithm and Scan TPDB 6 { Derive association rules dynamically for freq_words 7 Calculate confidence//user defined 8 Check the rules satisfying threshold //user-defined 9 If (Confidence satisfies Threshold value) // Pick relevant words // Push Phishing words in PDB permanently 10 { Scan TPDB and Push words to PDB 11 }While TPDB!=NULL //satisfy min threshold support & Conf. 12 if PDB==TDB // Check phish words in TDB if detected 13 Report to Instant Messenger chatmate as Phishing word 14 else 15 return to IM // do nothing 16 } 17 } Until TDB! =NULL
Figure 9. Shows Databases tables (TDB, IGWDB, TPDB, PDB, and Chat backupDB, VDB, WDB, VWDB) which is used by DataProcess program.
in Fig. 10. The detected phish words are updated to PDB database.
Figure 8. Shows Algorithm of APD-IM for storing transactions and reporting to IM chatmate regarding Phishing detection in IM.
IV. IMPLEMENTATION AND EXPERIMENTAL RESULTS APD-IM implemented using Apache TomCat 6.0 for Web Server for creating separate sessions for each chatmate with Browser support (IExplorer 6.5 or higher), SQL Server 2005 for Database and Java 6.0 for Apriori Algorithm for finding frequent patterns, using Information Retrieval system technique from database, odbc/jdbc drivers for connectivity. The software Simulation tools are also used like MATLAB and SPECTRAPLUS for Spectrum analysis from speech signal for calculating word parameters using FFT and LPC coefficients dynamically.
Figure 10. DataProcess program identifies frequent occurrences of patterns using ARM technique (Apriori is used) with min support and min confidence.
DataProcess program checks for number of lines between the chatmates must be < 25 (User-defined limit). The APD-IM system is tested on number of transactions (lines) between the chatmates with user defined minimum support, minimum confidence verses the number of phishing words detected from transaction patterns database (TPDB) shown in fig 11(a) and fig 11(b) using columnar graph. It is observed that as the number of transactions(145 lines) between the chatmates increases the transaction patterns and phishing words follows a constant straight line as seen in fig 11 (b) using X-Y axis it may not detect phishing words as predicted, so frequent deletions of transactions is required. transaction patterns
The sequence of steps clearly mentioned in Fig. 2. when the messages are sent between the chat messages the number of databases dynamically used named as chatData/TDB(stores messages between the current chatmates, chatData_bkp(stores historical chat messages), Ignorewords/IGWDB(stores ignore words, preposition,etc. which is to be neglected used by IRS), phishwords/PDB(stores phishing words detected dynamically), Transpatters/TPDB (stores frequent patterns detected), voicewords/VDB, Textwords/WDB shown and voicetextwords/VWDB, some of them are shown Fig. 9. DataProcess program perform the operation of detecting Phishing words using TPDB database, which consists of patterns generated between the chatmates from TDB database, DataProcess program consists of Information Retrieval system technique and Apriori Algorithm, DataProcess program must always be running in active state which identifies frequent patterns from the messages and detect phishing words shown
14 12 10 8 6 4 2 0
5
6
2 2
1
8
7 3
3
9
10
4 4
5
11 1111
5 5 5
7
1212 12
6 6 6 6
Transaction Patterns Phishing words
9
11
phishing words
Figure 11. (a) Shows Columnar Graph of Transaction patterns vs Phishing Words detected from Transactions for.min-skewed-support (2,3,4,5,6) & minconf 60%.
464
Transaction patters and
Phishing words detected
International Conference on Pattern Recognition, Informatics and Medical Engineering (PRIME-2012) [5] 15
[6]
Transaction Patterns Phishing words
10 5
[7]
support
0 0
50
100
150
[8]
Total Num ber of Trans actions betw een use rs
[9]
Figure 11. (b) Shows Transaction Patterns vs Phishing Words detected vs min-skewed support and min-conf 60%
V. CHALLENGES AND FUTURE WORK
[10]
The APD-IM designed to detect deceptive phishing for messages in text and audio format. We have shown the experimental results, for text messages and acoustic voice messages (converted into words). The APD-IM system quite complex to design for video Instant messaging system, because integration of one more sub-component Image Processing in Multiplexer required that captures the images from run-time video will be discussed later. The other issues yet to be done are: •
Short-forms to be abbreviated and stored in the table, with unique identifiers.
•
When voice consists of Numbers, their conversions to character words like Numerical ‘0’ and character ‘Zero’ is still challenging task, similarly Dates, Fractional numbers(5/2), in speech require conversion.
[11] [12]
[13]
[14]
[15]
[16]
•
Number is said as double two (22) similarly Roman numbers. kg can be kilogram or something else.
[17]
•
Instant Messengers must be enhanced to detect video phishing collaboratively with audio and text messages.
[18]
The future looks green as the APD-IM can be enhanced to meet the requirements of wireless Instant Messengers, mobile Instant Messengers for 3G and 4G Technologies. The APD-IM can be successfully integrated in Instant Messengers, if distributors of IM wish to share the data and avoid Deceptive Phishing attacks; we have tested by creating our own Instant Messenger test bed.
[19]
[20]
[21]
REFERENCES [1] [2]
[3] [4]
[22]
Gurpreet singh, “word recognition from speech signal using spectrum analysis and LPC,” thesis submitted at thapar university in 2011. M. Mahmood Ali and L. Rajamani, “Phishing Detection in Instant Messengers using Data Mining Approach,” proceedings of ObCom 2011, will be published by Springer-Verlag Berlin Heidelberg 2012, part I, CCIS 269, pp. 490–502, 2012. “Apwg phishing activity trends till December, 23rd 2011.” [Online] http://www.antiphishing.org/ phishReportsArchive.html. Ahmed Jawad, Asim Karim and Imadullah Khan “Online algorithms for complete itemset counts using set-to-string Mappings,” published by IEEE in 2006.
[23]
[24]
[25]
465
Michael Atighetchi, and Partha Pal, “Attribute-based Prevention of Phishing Attacks,” Copyright 2009, BBN Technologies. Christopher Soghoian, “Legal Risks For Phishing Researchers,” in IEEE xplore in 2009. HwaMin Lee, Doosoon Park, and Min Hong, “An instant messenger system for learner analysis in e-learning environment,” in 2008 SIGITE '08: Proceedings of the 9th ACM SIGITE conference on Information technology education. Kyung Yong Rhe, and Wang-Bae Kim, “The Adoption and Use of the Internet in South Korea,” Journal of Computer-Mediated Communication in 2004. Zhijun Liu, Weili Lin, and Na Li Lee, “Detecting and filtering instant messaging spam - a global and personalized approach ,” at Secure Network Protocols, (NPSec). 1st IEEE ICNP Workshop on 6 Nov. 2005. Salim, Et al., “Data Retrieval and Security Using Lightweight Directory Access Protocol,” at Knowledge Discovery and Data Mining, 2009. WKDD 2009. Second International Workshop in. 2009. R.B. Jennings, Et.al., “A study of Internet instant messaging and chat protocols,” IEEE Network, vol. 20, issue 4, pp. 16-21, July-Aug. 2006. Debbabi, and M. Rahman, “The war of presence and instant messaging: right protocols and APIs,” Consumer Communications and Networking Conference, 2004. CCNC 2004. First IEEE on Jan. 2004. Fu Kai Fang, “Design and implementation of an instant messaging architecture for mobile collaborative learning,” at Computing, Communication, Control, and Management, 2009. CCCM 2009. ISECS International Colloquium on Aug. 2009. Weider D, Yu Shruti Nargundkar, Nagapriya Tiruthani, “A Phishing Detection Tool,” at 33rd Annual IEEE International Computer Software and Applications Conference Washington, USA on july 2009. Amirherzberg, and Ahmad jbara, “Security and Identification Indicators for Browsers against Spoofing and Phishing Attacks,” at ACM Transactions on Internet Technology, Vol. 8, No. 4, Article 16, on September 2008. juan chen, and Chuanxiong Guo, “Online Detection and Prevention of Phishing,” at Communications and Networking in China, First International Conference in 2006. Modelling Intelligent Phishing Detection System for e-Banking using Fuzzy Data Mining by Maher Aburrous, etl at International Conference on CyberWorlds in 2009. Wardman, B. Shukla, and G. Warner,Identifying vulnerable websites by analysis of common strings in phishing URLs,” at eCrime Researchers Summit, eCRIME '09 on oct 2009. Gerald j. Kowalski, and mark t maybury, “Information storage and retrieval system theory and implementation,” second edition 2006 published by springer. R. J. Bayardo, “Efficiently mining long patterns from database,” In Proceedings of the 1998 ACM SIGMOD International conference on Management of data, 1998, pp. 85-93. R. Srikant and R. Agarawal, “Mining quantitative association rules in large relational tables,” In Proceedings of the ACM - Special Interest Group on Management of Data (ACM SIGMOD), 1996, pp.1-12. Larence R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” in feb 1989 published by IEEE. Jose Leonardo Plaza Aguilar, and David Báez López, “A Voice Recognition System for Speech Impaired People,” published by IEEE at CONIELECOMP, 2004. Hamid Sharkhzadeh, and Li Deng, “Waveform based speech recognition using Hidden Filter Model parameter selection and sensitivity to power normalization,” IEEE Transactions on Audio and Speech Processing, vol. 2, January 1994. Ibrahim N. Abu-Isbeih, Khaled Dagrouq, and Wael Ali-Sawalmeh, “Speaker identification wavelet transform based method,” IEEE 5th International Multi-Conference on Systems, Signals and Devices, 2008.