Feb 11, 2009 - Index Terms-Automatic Deception Detection, Social not be an issue that many ... software and keep training employees on the security issue. .... had high school education or less, 81% listening time/interviewer's talking time).
An Investigation of Heuristics of Human Judgment in Detecting Deception and Potential Implications in Countering Social Engineering Judee K. Burgoon University of Arizona Email:tqiniu. arizona.edu Abstract-Social engineering (as used by the military or law-enforcement) is the emerging technique for obtaining classified information by interacting and deceiving people who can access that information.
of intrusion that relies heavily on human interactions. Compared with breaking the technical shield such as the firewall or intrusion detection system, many sophisticated hackers prefer to deceive or compromise the human agents so to former orm promise the clasiie as as to obtain classified information [3]. Illustrated by the former successful computer hacking by Mitnick, social engineering is a simple but very effective way of breaking normal security procedures. For example, phishing is an email fraud sent by perpetrators who appear to be trustworthy. Iftargets believe the deceptive email, they may follow the request and leak a client's
Rather than using traditional
techniques of attacking the technical shields such as firewalls, many sophisticated computer hackers find that social engineering is more effective and difficult to detect by humans. Why can people not effectively detect social engineering, or more specifically, the art of deception? What can be done to augment human abilities for the task? The current findings
to
warrant several possibilities that influence human ability to detect deception. Factors include such things as truth-bias, stereotypical thinking and processing ability. Knowing that human detection ability is limited, we propose a method to automatically detect
pers number
According to estimates by the U.S. Federal Trade Commission (FTC), social engineering-related issues cost individuals and businesses approximately $52.6 billion in 2004, and much of that cost is borne by businesses. Since many
deception that potentially assists humans. Results show that a system, using discriminant analysis to classify deception performed significantly better than humans in detecting
deception. The findings can also be applied to general situations to ensure information authentication-scenarios other than social
Index Terms-Automatic Engineering
Deception
Detection,
incidents of individuals or organizations being victimized by
Social
I. INTRODUCTION Information security is about the protection of information and its critical components, including the software and hardware that process, store, and transmit that information [1]. In the era of www-world, information security has become one of the most critical issues in business, homeland security, and social life [2]. Furthermore, information security is much more than just a technical issue. According to the famous computer hacker, Kevin Mitnick [3], it is just a "security illusion" for managers to think that they will be safe if they always install and update their system with the latest technical protection software and keep training employees on the security issue. An intrusion is just "a matter of sooner or later". Why? Because "the biggest online security gap for most computers lies the and somewhere between the keyboard chair"--Curmudgeon, http://cybercoyote.org/security/deception.shtml). In other words, humans are the weakest link in information security. Social engineering is the art and science of obtaining information by deception and other techniques based on the fundamentals of human behavior and psychology. In information security, social engineering is a non-technical form
1-4244-1330-3/07/$25.OO 02007 IEEE.
methods of social engineering are not well publicized, it may not be an issue that many people think very seriously about. However, whether we are aware of it or not, the problem is widespread, and according to the FTC, it affects approximately 10 million Americans each year. This number is only expected to rise unless more protections are instituted from these malicious activities. Given the huge negative impact of social engineering, there are strong incentives to search for solutions to social engineering. In the following sections, we first address social engineering problems as an art of deception and introduce the fact that humans are poor in detecting deception. Next, we list several possible factors that influence human judgment and explain why humans, even after training, are inevitably influenced by some judgment biases and thus cannot effectively detect deception. We then propose an automatic deception detection system (ADDS) to assist humans in deception detection. The research questions are as follows: Is the human ability to detect deception limited by human nature? And if so, can an automatic system be designed to facilitate humans? In the method section, we describe a deception-related experiment from which empirical data are obtained to study human heuristics that influence judgment and we compare the performance between humans and the ADDS. Last, we summarize the findings and discuss possible implications in areas other than countering social engineering. In order to test the research question, we selected a set of
1 52
Authorized licensed use limited to: IEEE Xplore. Downloaded on February 11, 2009 at 00:42 from IEEE Xplore. Restrictions apply.
verbal and vocal cues that can potentially detect deception; then studied the way that humans manipulate the cues in judgment; In order to compare the performance of machine and human detection, these cues can also be automated by programs.
studies on global deception showed that the stereotypical judgment is consistent across culture and country [13]. Presumably, people judge a suspect's deceptiveness by comparing the suspect's behavior to their stereotype of a liar. However, those believed cues may show no correspondence to II. WHY PEOPLE PERFORM POORLY AT DETECTING DECEPTION the cues that actually distinguish truths from lies [14]. For example, gaze avoidance was shown to have a very weak Even though the cues to detect deception are available, association with highly motivated lies [15]. accuracy of human detection is not satisfactory across most We predict that people make stereotypical judgments based studies of detecting deceit. The percentages of lie detection on a parroting of the prevailing cultural wisdom, rather than range from 45% to 60%, when 50% accuracy is expected by specifically and systematically thinking. Unlike previous chance alone [4]. The review done by Vrij has showed that people are particularly poor at detecting lies (44% accuracy rate) studies on stereotype factors, we focus on the cues that can [5]. We predict that human judgment is influenced by some bias potentially be automated. We measure the characteristics ofthe cues by studying the syntactic structure, semantic meanings and and stereotypical thinking. These limitations are decided by thus the mental status represented in the words, and simple human nature or cultural norms and are thus inevitable. If our vocalic cues, such as taking time measurement [16, 17]. By hypotheses are true, a possible solution is to design an studying the machine programmable cues and implementing appropriate system to mitigate the human shortcomings, because a system is not affected by human nature and can make them in our automatic deception detection system, we can compare the performance between machine and human fair judgments. judgment. We consider two major factors that influence human According to Bond's worldwide results on beliefs about judgments: deceptive behavior, a majority of the people investigated (62.2%) believe that liars tell longer stories than usual [12, 13]. A. Truth-biased This factor is subject to a variety of cognitive heuristics, or Pauses in the middle of speaking also reflect general beliefs of mental shortcuts, that influence humans' accuracy in detecting deception [12]. Anderson also showed that people reportedly deceit. People rely on mental shortcuts to judge incoming used complexity as a indicator of truthfulness and thus information [6]. Bias is usually caused by an incomplete lack-of-it as a deceptive sign [14]. Furthermore, receivers heuristic process which is a non-analytic way to process reported more suspicion when senders used less affect terms, information [7]. When not motivated to process information more uncertainty and less specificity [10]. Therefore, the systematically, message receivers will operate in a heuristic hypotheses on stereotypical judgment are: mode, relying on a subset of cues to form a convenient Hypothesis 2: Humans judge deceivers to have 2.1) less judgment rather than carefully evaluating all of the available information. Truth bias is one of the most commonly cited talking time, 2.2) shorter messages, 2.3) more speech heuristics in the deception literature [8]. The bias partly reflects interruption, 2.4) less complexity; 2.5) less specificity; 2.6) less a general expectation that people tell the truth because honest affect terms, and 2.7) more uncertainty. behavior is actually more frequently observed [9]. A highly III. METHOD sensitive person may be less popular and have less satisfying In order to study how humans rely on cues to detect relationships than less sensitive people. Thus it is a norm of polite interaction to lead people to place more attention on deception, we conducted a deception related experiment. With content than intent [10]. Truth bias also reflects a cognitive the experiment data as a test bed, deceptive cues are shortcut and a simple decision rule applied when the automatically programmed or coded. Each message was also information is difficult to validate objectively, such as the rated by human judgers for truthfulness. We then studied the experience or values of unknown others [11]. Therefore, the relationship between human judgment and the cues. In order to compare human and machine judgment, an first hypothesis of this study predicted that: automatic deception detection system uses the cues as input to Hypothesis 1: Humans make more truth assessments than train a discriminant analysis function to classify deception. The output of classification is referred as machine judgment.. deceptive assessments. B. Stereotypical cues to deception A. Experiment The second factor is the stereotype cues to deception. The experiment was designed in a deception related setting. According to comprehensive surveys, people believe that liars The scenario of the experiment is an interview in which would display specified behaviors [12, 13]. This belief about videotaped and alternated between truthful deception, also known as stereotypical Judgment implies that interviewees were people often rely on certain cues to makejudgments. According avet teir There are four blocks, each containing three questions. Truth to Zukerma,itis wel belevedthat iars ould(T) or deception (D) occurred in alternating blocks of three shrug, shift their posture, pause move feet and legs, gaze, their in~~ mideo th paig n pa ucl. Futeroe questions and followed one of two orders: TDTD or DTDT.
~~ ~
153 Authorized licensed use limited to: IEEE Xplore. Downloaded on February 11, 2009 at 00:42 from IEEE Xplore. Restrictions apply.
Each block included all truthful or all deceptive responses so that three D questions followed three T questions or vice versa. The current study entailed transcribing all the videotaped interviews and subjecting them to automated analysis of the linguistic and meta-content features. Participants (N=122) were (a) community members recruited from the county courthouse in a large southwestern metropolitan area and (b) nontraditional undergraduate students (older than age 25) enrolled in a communication course for business majors at the university. Demographically, 60 of the participants were male and 62 were female; 37% were age 19 to 30, 30% were age 30 to 40, and 33% were older than 40; 90% were Caucasian and 10% were either African American, Hispanic or other; 5% had high school education or less, 81% had (some) college education, and 14% had graduate level education. Participants were paid or received extra credit for their participation. Participants were paired randomly to create 61 dyads (32 cross-gender dyads; 29 same-gender dyads, 15 of which were female-female). Upon arrival at the apartment-like research site, participants signed consent forms, were randomly assigned the role of interviewee (hereafter referred to as senders) or interviewer (hereafter referred to as receivers), and then separated. Senders reviewed a list of 12 questions (on education, occupation, personal relationship and political attitudes) that they would be answering during the interview and then received the deception induction: They were told that past research has shown that complete honesty is often not in one's best interests, and that the ability to manage information is an important skill, one that senders would be asked to test in the upcoming interview by contradicting or misrepresenting their true response on some questions. In order to measure human judgments, messages were rated by humans. Specifically, after each question period, Interviewers were asked to rate, on a 0 to 10 scale, how truthful they thought the interviewee was in answering the question in the interview. Using continuous measurement strategies and methods that obtain judgments for each response, studies have shown that receivers are more sensitive to changes in the credibility of senders and messages during deception than studies using dichotomous choices [10]. The ratings for the three questions associated with the theft were averaged together for a mean truth estimate. The human judgments were later used to compare with machine judgments. C. Measuring the Verbal and Vocal Cues The vocal cues investigated in this paper are talking time and speech disturbance. Talking time measures the time interviewees spend on talking. Besides self talking time, two other cues also measure the talking amount. One is turn switch time: the silence between speaking turns; and the other is listening time: the talking time of interviewers, or the listening time of the interviewees. The longer turn switch and listening time, the less the interviewees spend in actual talking. Speech disturbance is defined as the pauses during conversation.
According to the forms of pauses, it can be further specified as vocalized, nonvocalized and other forms of vocal nonfluency. Vocalized pauses are the speech hesitations such as saying 'ah' or 'mm' between words. Nonvocalized pauses are the silent periods between words during the answering process. Other examples of pauses consist of other forms of non-fluency such as stuttering and repetitions. For each of the six vocal cues in talk time and speech disturbance, three measurements were taken: frequency, latency, and mean. Currently the vocal cues are coded by trained coders. However, the vocal cues are potentially automatable [17]. As a summary, the classes of vocal cues are as follows: 1) Talking Time (self-talking time, turn-switch time, listening time/interviewer's talking time). 2) Speech disturbance (vocalized pauses, nonvocalized pauses, other vocal nonfluency) As demonstrated in previous research, valuable verbal cues to detecting deception could be extracted by machine [16,18]. Many ofthe cues are syntax related cues or semantically belong to certain kinds of words which reflect users' psychological aspects [19]. Technically, they could be automatically calculated with a shallow parser or with a supplementary of a look-up dictionary [16]. In the current investigation, we analyzed the same clusters of indicators but used the General Architecture for Text Extraction (GATE) for parsing [20] and used the Whissell dictionary of over 7,000 words with scaled values for affect-related indicators [21]. As a summary, the classes of verbal cues and respective indicators are as follows: 1) Quantity (number of words, number of verbs, number of sentences). 2) Complexity (syntactic complexity or average sentence length, ASL; lexical complexity measured as average word length, AWL; pausality, measured as amount of punctuation) 3) Diversity (lexical diversity, content word diversity, redundancy) 4) Specificity (Temporal immediacy, temporal nonimmediacy, spatial-far details, spatial-close details, ratio of sensory terms to total terms, modifiers, 1st person pronouns /self reference, 2nd person pronouns, and 3rd person
pronouns) 5) Affect (activation, pleasantness, imagery, all scaled in the dictionary; affect, average of activation, pleasantness, and imagery) 6) Uncertainty (modal verbs) 7) Verbal nonimmediacy (passive voice)
1 56 Authorized licensed use limited to: IEEE Xplore. Downloaded on February 11, 2009 at 00:42 from IEEE Xplore. Restrictions apply.
Gues Parser
Classifier
Fig. 1. A Simplified Flow Chart of the Automatic Deception Detection System
Whissell measures the affect terms (activation, pleasantness IV. RESULTS and imagery) in three values, with 1 being the lowest and 3 to the According experiment design, there are four blocks of being the highest. The high or low value represents the affect interviews, each containing 3 questions. Questions within each level evaluated in the corpus defined by Whissell [21]. block belong to a different condition (either truthful or However, the high and low value varies by corpus/contexts. For deceptive). Within each block, the values of the 3 questions are example, "happy" could be very pleasant in formal averaged to reduce random noise. Because the focus is on the communications between two companies but less so between interviewer's judgment and cues they use and not the dynamic two good friends. In order to more accurately evaluate affect effect in deception detection, we performed analysis on each level specifically for our context (the experiment), we also block instead of a repeated-measure data analysis. In the future, measured the value exceeding or below 1 and 2 standard thetedata will be used to study the order (i.e., speak truth or deviations. For example, a "low" or "very low" activation term deception first) or within-subject effects. means lower than 1 or 2 standard deviations of mean of Hypothesis 1 predicts that the interviewers would make more activation terms in the experiment data. truthful than deceptive judgments. The similar hypothesis has also been tested by Burgoon et al [8]. Unlike that research D. The Automatic Deception Detection System which tested the within-subject effects, this one tests within Automatic deception detection has not been studied until blocks. Our results showed that the pattern is consistent across recently [16]. Figure 1 is a simplified flow chart of one such different blocks of time. Four t-tests were performed on each system. The system takes messages (could be in text, audio or block. (Table 1) shows that interviewers significantly video modality) and then parses them for deceptive indicators, video modality) for deceptive indicators. jde Results parses oto h st ahbok h judged most ofthe messages to beetuh(5.I truth (>5). In each block, the of the are formed with certain patterns messages Next, mean ratings were 7.5, 7.6, 8.2, and 7.7 respectively, all with statistical or machine learning methods. Patterns are compared p-value (in the t-test) < 0.0001. The result suggests that even with norm pattern, i.e., the pattern in the truthful cases. If the when the interviewers have been notified of the existence of new pattern is significantly different than the norm, the system they could not effectively tell the difference between r deception, judges* t e a d o Th Judges the message as deceptive, otherwise truthful. The trt dcpin d eption. mechanism of the system is to identify a set of cues that are automatable potentially by program.Hypothesis 2 tests whether human detection is influenced by .. . stereotypical thinking. A set of regression analyses were I is performed on the interviewer's judgments with all available video cues cus must beparseddifferentlythant thn tt. For example, video verbal and nonverbal cues. Table 2 shows the significant cues cues [17]. The classifier should also consider the fusion issue of cues in the case of two cues forming oppositejudgments, in this case, a fusion engine is necessary to handle and combine cues. However, since the focus of this paper is to compare human TABLE 1 T-TEST FOR THE TRUTH BIAS with machine judgments, we will leave the discussion of the system architecture to other papers. Standard As shown in the third component of the flow chart, once we P-VALUE P-AU ' . ~~Block Mean Rating Deviation T-value have the automated cues, we can apply statistical and machine 1 7.5 2.18 8.9 classification methods to discriminate deception from truth.