134
Int. J. Business Intelligence and Data Mining, Vol. 5, No. 2, 2010
Practical algorithms for subgroup detection in covert networks Nasrullah Memon*, Uffe Kock Wiil and Pir Abdul Rasool Qureshi The Maersk Mc-Kinney Moller Institute, University of Denmark, Campusvej 55, 5230 Odense M, Denmark E-mail:
[email protected] E-mail:
[email protected] E-mail:
[email protected] *Corresponding author Abstract: In this paper, we present algorithms for subgroup detection and demonstrated them with a real-time case study of USS Cole bombing terrorist network. The algorithms are demonstrated in an application by a prototype system. The system finds associations between terrorist and terrorist organisations and is capable of determining links between terrorism plots occurred in the past, their affiliation with terrorist camps, travel record, funds transfer, etc. The findings are represented by a network in the form of an Attributed Relational Graph (ARG). Paths from a node to any other node in the network indicate the relationships between individuals and organisations. The system also provides assistance to law enforcement agencies, indicating when the capture of a specific terrorist will more likely destabilise the terrorist network. In this paper, we discuss the important application area related to subgroups in a terrorist cell using filtering of graph algorithms. The novelty of the algorithms can be easily found from the results they produce. Keywords: covert networks; iMiner; IDM; investigative data mining; SNA; social network analysis; subgroup detection algorithms. Reference to this paper should be made as follows: Memon, N., Wiil, U.K. and Qureshi, P.A.R. (2010) ‘Practical algorithms for subgroup detection in covert networks’, Int. J. Business Intelligence and Data Mining, Vol. 5, No. 2, pp.134–155. Biographical notes: Nasrullah Memon is an Associate Professor at the Maersk Mc-Kinney Moller Institute, University of Southern Denmark. He received a Master’s Degree in Applied Mathematics from the University of Sindh, Pakistan, and a Master’s Degree in Software Development from the University of Huddersfield, UK. He holds a PhD in Investigative Data Mining from Aalborg University, Denmark. He is also affiliated with Mehran University of Engineering and Technology, Jamshoro, Sindh, Pakistan, and Hellenic American University, Athens, Greece. He is Editor-in-Chief of International Journal on Advances in Social Network Analysis and Mining, Springer-Verlag. He has published more than 50 research papers in international conferences and journals. His current research is on knowledge management, mathematical methods in counterterrorism, information extraction and investigative data mining.
Copyright © 2010 Inderscience Enterprises Ltd.
Practical algorithms for subgroup detection in covert networks
135
Uffe Kock Wiil is a Professor of Software Engineering and Technology at the Maersk Mc-Kinney Moller Institute, University of Southern Denmark. He holds an MSc in Computer Engineering (1990) and a PhD in Computer Science (1993) both from Aalborg University, Denmark. His research interest includes software technology, knowledge management, hypertext, computer-supported cooperative work and distributed systems. These research interests are currently being applied in three overall areas: tools and techniques for counterterrorism, healthcare and planning. He has published more than 100 research papers. His research papers have been cited more than a 1000 times. Pir Abdul Rasool Qureshi is working as a Research Assistant at the Maersk Mc-Kinney Moller Institute, University of Southern Denmark. He received a Bachelor’s Degree in Software Engineering from Mehran University of Engineering and Technology, Jamshoro, Sindh, Pakistan. He has currently published more than a hand full of research papers and has vast experience in development of web harvesting and investigative data mining software.
1
Introduction
When intelligence analysts are required to understand a complex, uncertain situation, one of the techniques they often use is to simply draw a diagram of the situation. The diagrams are ARGs (Coffman et al., 2004). In these graphs, nodes represent people, organisations, objects, or events. Edges represent relationship like interaction, ownership, or trust. Attributes store the details of each node and edge, like person’s name, or interactions time of occurrence. The graphs function as external memory aids, which are fundamental tools for arriving at an unbiased conclusion in the face of uncertain information (Heuer, 2001). For example, if we have information that Terrorist T1 is a friend of T2, who is father of T3, who frequents at place P1, as does the person T4, T5 and T6 who work at Organisation O1, where T6 has colleague T7 who funded T8 who is the friend of T9 who is brother of T10, then this information may be presented as in Figure 1. Visualising the associations/relations in this way is of importance in investigating of criminal networks. It has been found to aid considerably a good understanding of what is known so far, which is necessary to guide and direct further lines of enquiry in the most timely and productive way. The study of terrorist networks falls into the larger category of criminal intelligence analysis (Xu and Chen, 2005), which is often applied to investigations of organised crime (e.g., terrorism, money laundering, fraud, etc.). Unlike other types of crime often committed by single or few offenders, organised crime is carried out by multiple, collaborating offenders, who may form groups and play different roles. In terrorist network, for example, different groups may be responsible for choosing venue to attack; some groups may be responsible for finding sources of funding (e.g., handling drug supply, distribution, sales, smuggling and money laundering); some groups may be responsible for recruiting people; some groups may be responsible for making bombs, etc. In each group, there may be a leader who issues commands and provides steering mechanisms to the group, as well as gatekeepers/brokers who ensure that information flow effectively to and from other groups. Criminal intelligence analysis, therefore,
136
N. Memon et al.
requires the ability to integrate information from multiple crime incidents or even multiple sources and discover regular patterns about the structure, organisation, operation, and information flow in covert networks. Criminal networks are covert networks that operate in secrecy and stealth. Figure 1
Motivating example (see online version for colours)
To dismantle covert networks, both reliable data and practical algorithms play a vital role. However, intelligence and law enforcement agencies are often faced with the dilemma of having too much data collected from multiple sources: bank accounts and transactions, phone records, vehicle sales and registration records, and surveillance reports to name a few McAndrew (1999) and Sparrow (1991), which in effect contributes with limited value, because they lack practical techniques and software tools to utilise the data effectively and efficiently. Today’s criminal intelligence analysis is primarily a manual process (Xu and Chen, 2005) that consumes much human time and efforts, and thus has limited applicability to crime investigation. Our objective is to provide an IDM perspective for covert network analysis. We discuss challenges in data processing, then review the existing covert network analysis and visualisation tools, and propose subgroup detection algorithms. Like data-mining applications in many other domains, mining law enforcement data involves many obstacles (Xu and Chen, 2005). First, incomplete, incorrect, or inconsistent data can create problems. Moreover, these characteristics of criminal networks cause difficulties not common in traditional data-mining applications (Sparrow, 1991). Incompleteness: Covert networks always operate in secrecy and nodes of the networks hide in the crowd. Criminals may minimise interactions to avoid attracting police attention and their interactions are hidden behind various illicit activities. Thus, data about covert networks is inevitably incomplete, i.e., some existent links or nodes will be unobserved or unrecorded.
Practical algorithms for subgroup detection in covert networks
137
Incorrectness: Incorrect data regarding criminals’ identities, physical characteristics and addresses may result either from unintentional data entry errors or from intentional deception by criminals. Many criminals lie about their identity information when caught and investigated. Inconsistency: Information about a criminal who has multiple police contacts may be entered in law enforcement databases multiple times. These records are not inevitably consistent. Multiple data records could make a single criminal appear to be different individuals. When seemingly different individuals are included in a network under study, misleading information may be the result. Problems specific to covert network analysis lie in data transformation, fuzzy boundaries and network dynamics: Data transformation: Network analysis requires that data be presented in a specific format in which network members are represented by nodes, and their associations or interactions are represented by links. However, information about criminal associations is usually not precise in raw data and transforming them to the required format can be fairly labour-intensive and time-consuming. Fuzzy boundaries: The boundaries of covert networks are quite ambiguous. Even organised crime families are often interrelated and many significant crime figures are significant precisely because they are connected to a number of different criminal organisations. It can be quite difficult for an analyst to decide whom to include and whom to exclude from a network under study. Dynamic: Covert networks are, for all practical purposes, not static, but are subject to change over time. Each contact report, telephone call, or financial transaction has a time and date. The relationship between any two individuals is not merely present or absent (binary), it is simply weaker or stronger; rather, it has a distribution over time, waxing and waning from one period to another. Many of most useful network questions depend heavily on this temporal dimension, begging information about which associations are becoming stronger, or weaker, or, extinct. New methods of data collection may be required to capture the dynamics of criminal networks. Some techniques have been developed to address these problems. For example, to improve data correctness and consistency, many heuristics are employed in the FinCEN system as the US Department of Treasury to disambiguate and consolidate financial transactions into uniquely identified individuals in the system (Goldberg and Wong, 1998). Other approaches like IDM (Memon, 2007a) and the concept space method (Chen, 2003) can transform covert event data into a networked format (Xu and Chen, 2005). IDM (Memon et al., 2007) is a newly emerging research area, which is at the intersection of link analysis, harvesting the web or web mining (Memon et al., 2007), graph mining (Cook and Holder, 2000) and Social Network Analysis (SNA) (Scott, 2004). Graph mining and SNA, in particular, attracted attention from a wide audience in the investigation of law enforcement and intelligence agencies (Getoor et al., 2003). As a result of this attention, the law enforcement and intelligence agencies realised that knowledge about covert networks and detecting covert networks are important to crime and terrorism investigation (Senator, 2005). Group detection refers to the discovery of organisational structure that relates selected individuals with each other. In a broader
138
N. Memon et al.
context, it refers to the discovery of the underlying structure relating instances of any type of entity among themselves (Marcus et al., 2007). Discovery of an organisational structure from terrorists’ data leads the investigators to terrorist cells. Therefore, detection of terrorist cells is important to terrorism investigation. Detecting a terrorist cell or even a part of cell (subgroup) is also important and valuable. A subgroup can be extended with other members with the help of domain experts. An experienced intelligence officer usually knows the members of well-known terrorist cells. She or he can decide which subgroups should be united to constitute the whole group. Another outcome of terrorist group detection is considered to be pre-emptive strike or terrorism prevention. For example, a terrorist cell prepares a truck with explosive material and human bombs are in the process of getting prepared at terrorist camps. If intelligence agents get the information of anyone involved in the case, then the plan can be prevented before it happens (with the algorithms discussed in this paper). Specific software such as Analyst Notebook (http://www.i2.co.uk) and Sentient (http://www.www.sentient.nl/) provides visual spatio-temporal representations of terrorist subgroups in graphs, but they lack automated group detection functionality. We have developed a prototype called iMiner based on novel algorithms, which among other things has the capability to automatically detect subgroups. In this paper, we explain how the iMiner prototype supports subgroup detection in covert networks. The algorithms make the following contributions: •
detect all terrorists who are directly or indirectly connected to a specified terrorist
•
detect paths that connect two specified terrorists
•
detect connections between groups of terrorists
•
uncover connections between the root (a node) and a destination (another node in terrorist cell).
The remainder of the paper is organised as follows. Section 2 provides an overview about covert network analysis and visualisation tools. Section 3 provides a detailed introduction to IDM. Section 4 discusses the algorithms for subgroup detection; whereas, Section 5 provides an overview about our IDM test bed. Section 6 presents the results of the algorithms for subgroup detection (presented in Section 4) using the USS Cole bombing terrorist network as a case study. Section 7 concludes the paper with a discussion of some limitations of the algorithms.
2
Covert network analysis and visualisation tools
Klerks (2001) categorised existing covert network analysis approaches and tools into three generations. First generation – manual approach: Anacapa chart (Harper and Harris, 1975) is the representation of first generation. With this approach, an analyst must first construct an association matrix by identifying criminal associations from raw data. A link chart for visualisation purposes can then be drawn based on association matrix. For example, to map the terrorist network containing the 19 hijackers in 9–11 attacks, Krebs (2001) gathered data about the relationships among the hijackers from publicly released
Practical algorithms for subgroup detection in covert networks
139
information reported in several major newspapers. He then manually constructed an association matrix to integrate these relations (Krebs, 2001) and drew a network representation to analyse the structural properties of network. Although such a manual approach for criminal network analysis is helpful in crime investigation, it becomes an extremely ineffective and inefficient method when data sets are very large. Second generation – graphic-based approach: These tools can automatically produce graphical representations of criminal networks. Most existing network analysis tools belong to this generation. Among them Analyst’s Notebook (Klerks, 2001), Netmap (Goldberg and Wong, 1998) and XANALYS Link Explorer, previously called Watson (McAndrew, 1999), are the most popular. For example, Analyst’s Notebook can automatically generate a link chart based on relational data from a spreadsheet or text file. Although second-generation tools are capable of using various methods to visualise criminal networks, their sophistication level remains modest because they produce only graphical representation of criminal networks without much analytical functionality. They still rely on analysts to study the graphs with the awareness to find structural properties of the network. Third generation – SNA: This approach is expected to provide more advanced functionality to assist crime investigations (Xu and Chen, 2005). Sophisticated structural analysis tools are needed to go from merely drawing networks to mining large volumes of data to discover useful knowledge about the structure and organisation of covert networks. In our research, we used IDM techniques to analyse covert networks; after connecting the dots, we provided strategies for destabilising terrorist networks (Memon, 2007a).
3
Investigative Data Mining
Defeating terrorist networks requires a more nimble intelligence apparatus that operates more actively and makes use of advanced information technology. Data mining for counterterrorism is a powerful approach for intelligence and law enforcement officials fighting terrorism (DeRosa, 2004). IDM is a combination of data mining and subject-based automated data analysis techniques. Data mining actually has relatively narrow meaning: the approach that uses algorithms to discover predictive patterns in data sets. Data mining should not be seen as a complete solution by itself. Rather, data mining is a major step in a complete investigative process. It may be considered as a pre-processing step to shed light on certain circumstances, which are to be later thoroughly analysed to reach at some useful discoveries. Actually, problem investigation process may be generally realised as three steps, namely pre-processing, processing and post-processing. Data-mining techniques are mostly essential for each of the three steps to prepare the data, to produce the results, and to transform the results, respectively. Subject-based automated data analysis applies models to data to predict behaviour, assess risk, determine associations, or do other type of analysis (DeRosa, 2004). The models used for automated data analysis can be used on patterns discovered by data-mining techniques.
140
N. Memon et al.
Although IDM techniques are powerful, it is a mistake to view them as a complete solution to security problems. Rather, IDM techniques constitute one step in the right direction; they are not sufficient but necessary for producing effective solutions to security problems. The strength of IDM is to help analysts and investigators in drawing conclusions and decision making. IDM can automate some functions that analysts would otherwise have to perform manually. As the domain of knowledge gets larger, manual analysis turns almost impossible and automated tools for IDM become essential. Consequently, more sophisticated and scalable IDM techniques are needed to deal with the large amounts and diverse sources of data. The process can help to prioritise attention and focus an enquiry, and can even do some early analysis and sorting of masses of data. In the counterterrorism domain, much of the data is classified. If we are to truly get the benefits of the techniques, we need to test with real data. Only few researchers have the clearances to work on classified data. The challenge is to find unclassified data that is a representative sample of classified data. It is not straightforward to do this, as one has to make sure that all classified information, even through implications, is removed. Another alternative is to find as good data as possible in an unclassified setting for researchers to work on. However, the researchers have to work not only with counterterrorism experts but also with data-mining specialists. That is, the research carried out in an unclassified setting has to be transferred to a classified setting later to test the applicability of data-mining algorithms. Only then do we get the true benefits of IDM. IDM is known as a data-hungry project for academia (Memon and Larsen, 2006b, 2006c). It can only be used if researchers have good data. When monitoring central banking data in the detection of an interesting pattern – for example, a person who is earning less and continuously spending more for the last 12 months, then the investigative officer may send a message to an investigative agency to keep the person under observation. Furthermore, if a person (who is under observation by an investigative agency) has visited Newark airport more than five times in a week (result received from video surveillance cameras), then this activity is termed a suspicious activity and investigative agencies may act upon it. It was not possible for this research study to get such type of sensitive data. Therefore, in this research, we harvested unclassified data from authenticated web resources from terrorist attacks that occurred or were planned (Memon and Larsen, 2006a, 2006b). IDM offers the ability to map a covert cell and to measure the specific structural and interactive criteria of such a cell. This framework aims to connect dots between individuals and to map and measure complex, covert human groups and organisations (Memon and Larsen, 2006b, 2006c). The method focuses on uncovering the patterning of people’s interaction, and correctly interpreting these networks assists in predicting behaviour and decision making within the network (Memon and Larsen, 2006b, 2006c). This technique is also known as subject-based link analysis. It uses aggregated public records or other large collection of data to find the links between a subject – a suspect, an address, or a piece of relevant information – and other people, places, or things. This can provide additional clues for analysts and investigators to follow (DeRosa, 2004). IDM also gives the analyst the ability to measure the level of covertness and efficiency of the cell as a whole, the level of activity, the ability to access others, and the level of control over a network each individual possesses. The measurement of these criteria allows specific counterterrorism applications to be drawn, and assists in the assessment of the most effective methods of disrupting and neutralising a terrorist cell.
Practical algorithms for subgroup detection in covert networks
141
In short, IDM provides a useful way of structuring knowledge and framing further research. Ideally, it can also enhance an analyst’s predictive capability (Memon and Larsen, 2006b, 2006c). Traditional data mining commonly refers to using techniques rooted in statistics, rule-based logic, or artificial intelligence to comb through large amounts of data to discover previously unknown but statistically significant patterns. However, the application of IDM in the counterterrorism domain is more problematic, because unlike traditional data-mining applications, we must find extremely wide variety of activities and hidden relationships among individuals (Seifert, 2006). In other words, traditional data-mining techniques proved to be effective in discovering patterns hidden in the data. Such patterns are not directly accessible by traditional query processors. Hence, data-mining techniques are characterised by their capability to draw conclusions based on thorough analysis of the investigated data. On the other hand, IDM targets incorporating common-sense, semantics and domain expert knowledge in the analysis and hence it is more than outlier mining, though both discover and highlight exceptional cases and narrow down the search process. Listed in Table 1 is a consolidated and concise comparison of traditional data mining and IDM. Table 1
Traditional data mining vs. Investigative Data Mining
Traditional data mining
Investigative Data Mining
Detect comprehensive models of databases to develop statistically valid patterns
Find connected instances of rare patterns
No starting points
Known starting points or matches with patterns estimated by analysts
Apply models over entire data
Reduce search space; results are starting points for human analysts
Independent instances
Relational instances
No correlation between instances
Significance autocorrelation
Minimal consolidation needed
Consolidation is key
Dense attributes
Sparse attributes
Sampling is used
Sampling destroys connections
Homogeneous data
Heterogeneous data
As mentioned earlier, in this research, we have chosen a very small portion of data mining used for counterterrorism. We have used SNA and graph theory techniques for connecting the dots. Our goal is to propose mathematical models to law enforcement and intelligence agencies for destabilising terrorist networks after linking the dots between them. Traditionally, most of the literature in SNA has focused on networks of individuals. Although SNA is not conventionally considered as a data-mining technique, it is especially suitable for mining a large volume of association data to discover hidden structural patterns in terrorist networks. SNA primarily focuses on applying analytic techniques to the relationships between individuals and groups, and investigating how those relationships can be used to infer additional information about the individuals and groups (Degenne and Forse, 1999). There are a number of mathematical and algorithmic
142
N. Memon et al.
approaches that can be used in SNA to infer such information, including connectedness and centrality (Wasserman and Faust, 1994). SNA is used in a variety of domains. For example, business consultants use SNA to identify the effective relationships between workers that enable work to get done; these relationships often differ from connections seen in an organisational chart (Ehrlich and Carboni, 2005). Law enforcement personnel have used social networks to analyse terrorist networks (Krebs, 2002; Stewart, 2001) and criminal networks (Sparrow, 1991). The capture of Saddam Hussein was facilitated by SNA: military officials constructed a network containing Hussein’s tribal and family links allowing them to focus on individuals who had close ties to Hussein (Hougham, 2005). After the 9/11 attacks, SNA has increasingly been used to study terrorist networks, (Shultz and Vogt, 2004). Even though these covert networks share some features with conventional networks, they are harder to identify, because they mask their transactions. The most complicating factor is that the terrorist networks are loosely organised networks having no central command structure. They work in small groups, who communicate, coordinate and conduct their campaign in network-like manner. There is a pressing need to automatically collect data of terrorist networks, analyse such networks to find hidden relations and groups, prune data sets to locate regions of interest, detect key players, characterise the structure, trace points of vulnerability and find efficiency (how fast communication is being made) of the network. Hence, it is desirable to have tools to detect hidden hierarchical structure of terrorist networks and to evaluate the efficiency of the networks, in given data sets. How often do members of a group speak with one another? What conduits do they use? How fast can they jump into action? Who is a peripheral or a key player in the network? How are members in the network knitted? Who is depending on whom in a network? If one member is captured, how much efficiency of the network is affected? How to find the command structure from horizontal covert networks? Such information could be useful to intelligence agencies, as they focus even harder on loosely organised terrorist groups. To address the above-mentioned questions, complex models have been created that offer insights into theoretical terrorist networks and look at how to model the shape of a terrorist network when little information is known through predictive modelling techniques based on inherent network structures. A common problem for the modellers is the issue of data (Ressler, 2006). Any academic work is only as good as the data, no matter the type of advanced methods used. Many of the models were created data-free or without complete data, and do not fully consider human and data limitations (Memon, 2007a). On the other hand, data collection is difficult for terrorist network analysis because it is hard to create a complete network. It is especially difficult to gain information on terrorist networks. Terrorist organisations do not provide information on their members, and the government rarely allows researchers to use their intelligence data. A number of academic researchers focus primarily on data collection on terrorist organisations, analysing the information through description and straightforward modelling (Everett and Borgatti, 1999; Germn and Mockus, 2003). Krebs (2002) was one of the first to collect data using public sources with his 2002 paper in Connections. In this work, Krebs creates a pictorial representation of the Al Qaeda network (Corbin, 2002) responsible for 9/11 that shows the many ties between the hijackers of the four airplanes. After the Madrid bombing in 2004, Spanish sociologist Rodriguez completed an analysis similar to Krebs’ by using public sources to map the March 11th terrorist network. In his research,
Practical algorithms for subgroup detection in covert networks
143
he found diffuse network based on weak ties amongst the terrorists (Borgatti, 2002; Ressler, 2006). Another bright spot is the 2004 publication of Understanding Terror Networks by Sageman. Using public sources, Sageman collects biographies of 172 terrorist operatives affiliated with the global Salafi jihad (the violent revivalist Islamic movement led by Al Qaeda). He uses SNA specifically on Al Qaeda operatives since 1998. This analysis yields four large terrorist clusters. The first cluster resides on the Pakistan–Afghan border and consists of the central staff of Al Qaeda and the global Salafist jihad movement. The second cluster is a group of operatives located in core Arab states such as Saudi Arabia, Egypt, Yemen and Kuwait. The third cluster is known as the Maghreb Arabs who, although they come from North African nations, currently reside in France and England. The final cluster is centred in Indonesia and Malaysia and is affiliated with Jemaah Islamiyah (Sageman, 2004). Despite strengths, collecting terrorists’ information from open sources has a few key drawbacks. With open sources, if the author does not have information on terrorists, he or she assumes they do not exist. This can be quite problematic, as the data analysis may be misleading. For instance, if one cannot find an Al Qaeda operative in publicly available sources in Denmark, the researcher could assume there is no Al Qaeda network. However, it is very likely this is not the case, since terrorists generally try to keep a low profile before committing an attack. The data collectors can also be criticised because their work is more descriptive and lacks complex modelling tools. Fostering relationships with modellers could augment the work being conducted by data collectors, as statistical analysis might be able to take into account some of the limitations of the data and provide an additional analytical framework. Therefore, data collection is not an easy task; it requires some skills and the data collector is preferred to have more comprehensive domain knowledge. In our research, we developed mathematical models for further analysis of terrorist networks (Memon, 2007a). These models are implemented in a software prototype, iMiner, which is integrated with a database for terrorist events that have occurred in the past. The data are collected by harvesting authenticated websites (Memon et al., 2009a). The prototype iMiner endows an analyst with the ability to measure the level of covertness, the efficiency of the cell as a whole, the level of activity, the ability to access others, and the level of control over a network each individual possesses. The measurement of these criteria allows specific counterterrorism applications to be drawn and assists in the assessment of the most effective methods of disrupting and neutralising a terrorist cell. In short, iMiner provides a useful way of structuring knowledge and framing further research. Ideally, it can also enhance an analyst’s predictive capability. The research embodies descriptive and predictive modelling, also considering links (the relationship between the objects), and more information is made available to the mining process. This enables several new tasks (Memon, 2007a): 1
group detection
2
subgroup detection
3
object classification
144
N. Memon et al.
4
object dependence
5
understanding the structure of a terrorist network
6
detecting command structure in terrorist networks.
This paper presents practical algorithms for the detection of subgroups by using filtering graph algorithms.
4
Subgroup detection algorithms
A covert network can often be partitioned into cells (subgroups) consisting of individuals who closely interact with each other. Given a network, traditional data-mining techniques such as cluster analysis may be employed to detect underlying groupings that are not otherwise apparent in the data. Hierarchical clustering methods have been proposed to partition a network into subgroups (Memon et al., 2007). Cliques whose members are fully or almost fully connected can also be detected based on clustering results. Subgroup detection aims at clustering nodes in a graph into subgroups that share common characteristics. But to some extent, subgraph discovery does the same job for finding interesting or common patterns in a graph. One of the most common interests of SNA is the substructures that may be present in the network. Subgroups are subsets of actors among whom there are relatively strong, direct, intense, frequent, or positive ties. From the ideas of subgroups within a network, we can understand social structure and embeddedness of individuals. For example, some persons may act as ‘brokers’ between groups. Some actors may be part of tightly connected and closed elite, whereas others are completely isolated from the group. Such differences in the ways that actors are embedded in the structure of groups within a network can have profound consequences for the ways the actors see the network, and the behaviours that they are more likely to practise. Memon et al. (2007) used a bottom-up approach for the detection of subgroups. This approach begins with basic groups and seeks to see how far this kind of close relationship can be extended. The notion is to build outward from single ties to construct the network. It emphasises how the macro might emerge out of the micro. Bottom-up approaches include the use of cliques, n-cliques, n-clans and k-plexes. In this paper, we employed IDM subgroup detection techniques for terrorist intelligence analysis. The goal has been to provide law enforcement and intelligence agencies with IDM techniques that not only produce graphical representations of terrorist networks but also provide structural analysis functionality to facilitate terrorist investigations. Our intension is also to introduce strategies to assist law enforcement and investigative agencies to disrupt terrorist networks. We have designed and developed a prototype software system (iMiner) for analysing, visualising and destabilising terrorist networks. iMiner is an experimental system, which also provides algorithms for retrieval of information and its presentation in graph form. Two algorithms that enable detection of subgroups (extended version of Memon et al. (2009b)) are presented:
Practical algorithms for subgroup detection in covert networks
145
A
Algorithm for retrieving all nodes (terrorists) that are directly or indirectly connected to specific nodes (terrorists)
B
Algorithm for retrieving all nodes (terrorists), which are directly connected to group of nodes (terrorist)
5
The iMiner testbed
The purpose of the construction of the iMiner database is to test the developed models (Memon, 2007a). The database is integrated with the IDM prototype and the system
146
N. Memon et al.
architecture of the prototype is depicted in Figure 2. The first stage of network analysis development is to automatically identify the strongest association paths, or geodesics, between two or more network members. In practice, such a task often entails intelligence officials manually exploring links and trying to find association paths that might be useful for generating investigative leads (Memon, 2007a). Figure 2
The iMiner database system architecture
The iMiner allows analysts to uncover hidden hierarchy of covert networks, which may help law enforcement and intelligence agencies in understanding the structure of covert networks. In addition to this, intelligence agencies can easily set the priorities for eradicating some important nodes in the network by visualising how much the efficiency of a network is minimised by the capture of a node. In the aftermath of the 9/11 attacks, it was noted that coherent information sources were not available to the researchers (Sageman, 2004). Information was either available in a fragmentary form, not allowing for comparison studies across incidents, groups or tactics, or made available in written papers – which are not readily suitable for quantitative analysis of terrorist networks. Data collected by law enforcement agencies, while potentially better organised, are largely not available to the research community owing to restrictions in distribution of sensitive information. To counter the information scarcity, we have developed a database of terrorist attacks that have occurred and the information about terrorist organisations involved in those events. This information is collected from open-source media (but authenticated websites) – primarily http://www.trackingthethreat.com The iMiner database (see Figure 3) consists of various types of entities. Here is an incomplete list of the different entity types: •
terrorist organisations such as Al Qaeda
•
terrorists such as Osama Bin Ladin and Ramzi Yousef.
•
terrorist facilities such as Darunta Training Camp and Khalden Training Camp.
•
terrorist events/attacks such as 9/11 and WTC terrorist attack 2003.
The data set also contains various types of relations connecting instances of different entity types. Here is a partial list of the various relation types:
Practical algorithms for subgroup detection in covert networks
147
•
memberOf: Instances of terrorist can be affiliated with various instances of terrorist organisation.
•
facilityOwner: Instances of terrorist facility are usually run by instances of terrorist organisations.
•
facilityMember: Instances of terrorist are linked to various instances of terrorist facilities if the terrorist instance attended/spent some time at the facility.
•
claimResponsibility: Instances of terrorist organisation are linked to the instances of terror attacks they claim responsibility for.
•
participatedIn: Instances of terrorist that may have participated in instances of terror attacks.
We have developed a prototype system to get information from online repositories and save it in our database for analysis. The analysis of relational data is a rapidly growing area within the larger research community interested in machine learning, knowledge discovery and data mining. Figure 3
A screenshot of iMiner showing the whole data set (see online version for colours)
We stored data in the database in the form of triples (Memon, 2007b; Memon et al., 2007) where subject and object may be entities of interest and relationship is link between two entities. These triples are also used in the Resource Description Framework (RDF). The binary-relational view regards the universe as consisting of entities with binary-relationships between them. An entity is anything that is of interest and can be identified. A binary-relationship is an association between two entities. The first entity in
148
N. Memon et al.
a relationship is called the subject and the second entity is called the object. A relationship is described by identifying the subject, the object, and the type of relationship, for example: Bin Laden is leader of Al Qaeda can be written as Bin Laden. Al Qaeda. leaderOf. N-ary relationships such as Samad got bomb-making training from Zarqawi may be reduced to a set of binary-relationships by the explicit naming of the implied entity (Memon, 2007b; Memon et al., 2007): training #1. Samad. trainee training #1. Zarqawi. trainer training #1. bomb making. got training of There are several advantages and disadvantages of using binary relational storage (Frost, 1982) such as: •
Simple interface between a binary relational storage structure and other modules of a database management system consists of three procedures: i insert (triple) ii delete (partial specification of triple) iii retrieve partial specification of triple).
•
Retrieval requests such as “list of all terrorists that met at Malaysia before 9/11” are met by issuing simple retrieval call (terrorist. met at Malaysia before 9/11. ?), which delivers only that data that has been requested.
Multi-attribute retrieval may be inefficient. The use of binary relational structure for multi-attribute retrieval might not be the most efficient method (irrespective of the way in which the structure is implemented). To illustrate this, consider the example: “retrieve (the names of) all young (25 years old) married terrorists involved in 9/11 attacks who attended flight schools and got terrorist training in Afghanistan”. Using binary-relational structure, one could issue the retrievals (? .marital status. Married) (? .age. 25) (? .involved in. 9/11 terrorist attacks) (? .attended. flight schools) (? .got terrorist training. Afghanistan) and merge the resultant sets to obtain the required data. This is probably much faster than having to do a sequential search through the whole database, which would be necessary if the database was held as a file of records (name, age, marital status, etc.) ordered on name. Relationships are saved in sequence of their involvement in a certain terrorism plot. For example, all relationships, which are involved in the 9/11 terrorist attacks, are saved together thus forming domains. Each domain has a table of relations (in which relations are stored in the above-given format) and each domain is registered in the domain’s master table. These domains are analogous to case files in police stations (Memon, 2007b; Memon et al., 2007). The investigating officers might be interested in visualising or analysing relations in connection to any particular case or terrorism plot, thus if the required relations are saved together they are fetched quickly. Moreover, the minimum numbers of relations
Practical algorithms for subgroup detection in covert networks
149
(i.e., only those which are saved in that domain) are traversed in search operation for any entity or its links. This strategy also reduces the number of joins per query, yielding performance gains.
6
Case study1
On 12 October 2000, USS Cole, under the command of Commander Kirk Lippold, set in to Aden harbour for a routine fuel stop. Cole completed mooring at 09:30. Refuelling started at 10:30. Around 11:18 local time (08:18 UTC), a small craft approached the port side of the destroyer, and an explosion occurred, putting a 40-by-60-foot gash in the ship’s port side according to the memorial plate to those who lost their lives. According to former CIA intelligence officer Robert Finke, the blast appeared to be caused by explosives moulded into a shaped charge against the hull of the boat. The blast hit the ship’s galley, where crew was lining up for lunch. The crew fought flooding in the engineering spaces and had the damage under control by the evening. Divers inspected the hull and determined the keel was not damaged. Seventeen sailors were killed and 39 others were injured in the blast. The injured sailors were taken to the United States Army’s Landstuhl Regional Medical Center near Ramstein, Germany and later, back to the USA. The attack was the deadliest against a US Naval vessel since the Iraqi attack on the USS Stark (FFG-31) on 17 May 1987. The asymmetric warfare attack was organised and directed by Osama bin Laden’s al-Qaeda terrorist organisation. We have collected data through harvesting data from the web for the above-mentioned case study and using the IDM techniques developed the sociogram of people, organisations, etc., involved in the tragic plot, which is shown in Figure 3. Using the Algorithm A (in Section 4), we detected indirect and direct connections between two nodes (M.H.a. Ahdal and O.B. Laden) from the USS Cole Bombing Network (as shown in Figure 4), which are depicted in Figure 5. Figure 4
USS Cole Bombing Terrorist Network (see online version for colours)
150 Figure 5
N. Memon et al. The result of Algorithm A by selecting two nodes (see online version for colours)
We also tried the Algorithm B (see Section 4), selecting the same nodes; M.H.a. Ahdal and O.B. Laden; the subgraph is detected as shown in Figure 6. Figure 6
The result of Algorithm B by selecting the nodes (see online version for colours)
It is interesting to note that we found M.H.a. Ahdal and O.B. Laden are connected with each other by a mediator Harithi (as shown in Figure 7). Figure 7
The shortest distance between M.H.a. Ahdal and O.B. Laden (see online version for colours)
Practical algorithms for subgroup detection in covert networks
151
It is noted that the metadata of every node is stored in the database, also all the connections of one node with other nodes stored in the database. The metadata of M.H.a. Ahdal and his connections as detected by our software prototype are shown in Figure 8. Figure 8
Metadata of one terrorist as detected by iMiner (see online version for colours)
The results achieved from the algorithms are very much promising and we expect that the algorithms would be quite useful for the intelligence agencies tracing the terrorists.
7
Conclusion and discussion
After the 9/11 attacks, SNA has increasingly been used to study terrorist networks. These covert networks share some features with conventional networks, but they are harder to identify, because they mask their transactions. The most complicating factor is that the terrorist networks are loosely organised networks having no central command structure; they work in small groups, who communicate, coordinate and conduct their campaign in a network-like manner. There is a pressing need to automatically collect data on terrorist networks, analyse such networks to find hidden relations and groups, prune data sets to locate regions of interest, detect key players, characterise the structure, trace points of vulnerability and calculate the efficiency of the network (how fast communication is being made). Terrorist networks consist of many cells (subgroups) that attend to particular parts of their environment. Subgroups are subsets of actors among whom there are relatively strong, direct, intense, frequent, or positive ties. From the ideas of subgroups within a network, we can understand social structure and the embeddedness of individuals. For example, some persons may act as ‘brokers’ between groups. Some actors may be part of a tightly connected and closed elite, whereas others are completely isolated from the group. Such differences in the ways that actors are embedded in the structure of groups within a network can have profound consequences for the ways the actors see the
152
N. Memon et al.
network, and the behaviours that they are more likely to practise. Subgroup detection aims at clustering nodes in a graph into subgroups that share common characteristics. But to some extent, subgraph discovery does the same job for finding interesting or common patterns in a graph. In this paper, we proposed algorithms for subgroup detection and demonstrated them with a real-time case study of USS Cole bombing terrorist network. The algorithms make the following contributions: •
detect all terrorists that are directly or indirectly connected to specified terrorists
•
detect paths that connect two specified terrorists
•
detect connections between groups of terrorists
•
uncover connections between the root (a node) and a destination (another node in terrorist cell).
The demonstration pinpoints the subset of data that are directly or indirectly connected to a terrorist or group of terrorists. In this paper, we presented the results of our findings based on limited exercises in exploring the utility of the algorithms for detecting subgroups in terrorist networks. The proposed algorithms could be used for law enforcement agencies. We are also confident that real-time or near real-time information from a multiplicity of databases could have the potential to generate early warning signals of utility in detecting and deterring terrorist attacks. It is necessary, of course, to have experts in the loop. This research has provided substantive and in-depth analyses of terrorist networks. Although real-world data is hard to come by, the data sets collected by harvesting the web provides an excellent starting point for analysing terrorist networks. This investigation is, of course, not without shortcomings. The data set we have demonstrated is harvested from web, without any validation of data records from counterterrorism agencies. So, despite a thorough investigation, elements of this paper cannot help but be speculative, especially to the extent to which the data set is concerned. Also, there remains the possibility that other factors, unknown to the authors, might have been at play. Although this paper provides some new perspective, it does not offer a comprehensive or exclusive account. Future research directions that we plan to work on include the refinement of the algorithms developed, and we are also working with counterterrorism experts on some scenarios (to develop new algorithms), which can help in the detection of possible subgroups in terrorist networks.
References Borgatti, S.P. (2002) ‘Stopping terrorist networks: Can social network analysis really contribute?’, Proceedings of Sunbelt International Conference on Social Networks, New Orleans, LA, pp.13–17. Chen, H. (2003) ‘Digital government: technologies and practices’, Decision Support Systems, Vol. 34, No. 3, pp.223–227. Coffman, T., Greenblatt, S. and Marcus, S. (2004) ‘Graph-based technologies for intelligence analysis’, Communications of ACM, Vol. 7, No. 3, pp.45–47.
Practical algorithms for subgroup detection in covert networks
153
Cook, D.J. and Holder, L.B. (2000) ‘Graph-based data mining’, IEEE Intelligent Systems, Vol. 15, No. 2, pp.32–41. Corbin, J. (2002) Al-Qaeda: In Search of the Terror Network that Threatens the World, Thunder’s Mouth Press, London, UK. Degenne, A. and Forse, M. (1999) Introducing Social Networks, Sage Publications, London. DeRosa, M. (2004) Data Mining and Data Analysis for Counterterrorism, CSIS Report, March, http://www.csis.org/media/csis/pubs/040301_data_mining_report.pdf Ehrlich, K. and Carboni, I. (2005) Inside Social Network Analysis, IBM Technical Report. http://domino.watson.ibm.com/cambridge/research.nsf/0/3f23b2d424be0da6852570a5007099 75/$FILE/TR_2005-10.pdf Everett, M.G. and Borgatti, S.P. (1999) ‘The centrality of groups and classes’, Journal of Mathematical Sociology, Vol. 23, No. 3, pp.181–201. Frost, R.A. (1982) ‘Binary relational storage structures’, The Computer Journal, Vol. 25, No. 3, pp.358–367. Germn, D. and Mockus, A. (2003) ‘Automating the measurement of open source projects’, Proceedings of the 3rd Workshop on Open Source Software Engineering, 25th International Conference on Software Engineering, Portland, available at Available at http://mockus.us/ papers/oose03.pdf Getoor, L. (2003) ‘Link mining: a new data mining challenge’, SIGKDD Explorations, Vol. 5, No. 1, pp.84–89. Goldberg, H.G. and Wong, R.W.H. (1998) ‘Restructuring transactional data for link analysis in the FinCEN AI System’, Proceedings of 1998 AAAI Fall Symposium on Artificial Intelligence and Link Analysis, AAAI Press, Menlo Park, CA. Harper, W.R. and Harris, D.H. (1975) ‘The application of link analysis to police intelligence’, Human Factors, Vol. 17, No. 2, pp.157–164. Heuer, R.J. (2001) Psychology of Intelligence Analysis, Centre for the Studies of Intelligence, Central Intelligence Agency, available at available at http://www.au.af.mil/au/awc/awcgate/ psych-intel/index.html Hougham, V. (2005) Sociological Skills Used in the Capture of Saddam Hussein, http://www.asanet.org/footnotes/julyaugust05/fn3.html Klerks, P. (2001) ‘The network paradigm applied to criminal organizations: theoretical nitpicking or a relevant doctrine for investigators? Recent developments in the Netherlands’, Connections, Vol. 24, No. 3, pp.53–65. Krebs, V. (2001) Unlocking Terrorist Networks, available at First Monday, http://www. firstmonday.dk/issues/issue7_4/krebs Krebs, V. (2002) ‘Mapping networks of terrorist cells’, Connections, Vol. 24, pp.45–52. Marcus, S.M., Moy, M. and Coffman, T. (2007) ‘Social network analysis’, in Cook, D.J. and Holder, L.B. (Eds.): Mining Graph Data, John Wiley & Sons, Chichester. McAndrew, D. (1999) ‘The structural analysis of criminal networks’, in Canter, D. and Alison, L. (Eds.): The Social Psychology of Crime: Groups, Teams, and Networks, Offender Profiling Series, III, Dartmouth, Aldershot. Memon, N. (2007a) Investigative Data Mining: Mathematical Models of Analyzing, Visualizing and Destabilizing Terrorist Networks, PhD Dissertation, Aalborg University, Denmark. Memon, N. (2007b) A First Look on iMiner’s Knowledge base and Detecting Hidden Hierarchy of Riyadh Bombing Terrorist Networks, Lecture Notes in Computer Science and Engineering, Proceedings of IMECS 2007, Hong Kong. Memon, N. and Larsen, H.L. (2006a) ‘Structural analysis and destabilizing terrorist networks using investigative data mining’, Proceedings of Advanced Data Mining and Applications, Lecture Notes in Artificial Intelligence (LNAI 4093), Xi’An, China, pp.1037–1048.
154
N. Memon et al.
Memon, N. and Lasren, H.L. (2006b) ‘Practical approaches for analysis, visualization and destabilizing terrorist networks’, Proceedings of the First International Conference on Availability, Reliability and Security (AReS 2006), Vienna, Austria, pp.906–913. Memon, N. and Larsen, H.L. (2006c) ‘Practical approaches for analysis, visualization and destabilizing terrorist networks’, Proceedings of ARES 2006, Vienna, Austria, pp.906–913. Memon, N., Hicks, D.L. and Larsen, H.L. (2007) ‘Harvesting terrorists information from web’, 11th International Conference Information Visualization, IV ’07, Electrical Engineering/Electronics, Computer, Communications and Information Technology Association, pp.664–671. Memon, N., Qureshi, A.R., Wiil, U.K. and Hicks, D. (2009a) ‘Novel algorithms for subgroup detection in terrorist networks’, Proc. International Conference on Availability, Reliability and Security (ARES 2009), Fukuaka, Japan, pp.572–577. Memon, N., Will, U.K., Alhajj, R., Atzenbeck, C. and Harkiolakis, N. (2009b) ‘Harvesting covert networks: the case study of the iMiner database’, Int. J. Networks Virtual Organizations, (to appear). Ressler, S. (2006) ‘Social network analysis as an approach to combat terrorism: past, present, and future research’, The Journal of Naval Postgraduate School Center for Homeland Defense and Security, Vol. 2, No. 2, available at http://www.hsaj.org/pages/volume2/issue2/ pdfs/2.2.8.pdf Sageman, M. (2004) Understanding Terrorist Networks, University of Pennsylvania Press. Philadelphia, USA. Scott, J. (2004) Social Network Analysis: A Handbook, Sage Publications, London, UK. Seifert, J.W. (2006) Data Mining and Homeland Security: An Overview, Congressional Research Service, updated 27 January (Order Code RL31798), See also K.A. Senator, T.E. (2005) ‘Link mining applications: progress and challenges’, SIGKDD Explorations, Vol. 7, No. 2, pp.76–83. Shultz, R.H. and Vogt, A. (2004) ‘The real intelligence failure on 9/11 and the case for a doctrine of striking first’, in Howard, R.D. and Sawyer, R.L. (Eds.): Terrorism and Counterterrorism: Understanding the New Security Environment, McGraw-Hill, Dushkin, pp.405–428. Sparrow, M. (1991) ‘The application of network analysis to criminal intelligence: an assessment of the prospects’, Social Networks, Vol. 13, pp.251–274. Stewart, T. (2001) Six Degrees of Mohamed Atta, http://money.cnn.com/magazines/business2 Wasserman, S. and Faust, K. (1994) Social Network Analysis: Methods and Applications, Series, Cambridge University Press, Cambridge. Xu, J. and Chen, H. (2005) ‘Criminal network analysis and visualization’, Communications of the ACM, Vol. 48, No. 6, pp.100–107.
Bibliography Brandes, U. and Wagner, D. (2004) ‘visone – Analysis and visualization of social networks’, in Jünger, M. and Mtuzel, P. (Eds.): Graph Drawing Software, Springer, Berlin, pp.321–340. Memon, N. and Larsen, H.L. (2006a) ‘Practical algorithms of destabilizing terrorist networks’, Proceedings of IEEE Intelligence Security Conference, Lecture Notes in Computer Science (LNCS 3976), San Diego, USA, pp.398–411. Memon, N. and Larsen, H.L. (2006d) ‘Structural analysis and destabilizing terrorist networks’, Proceedings of The 2006 International Conference on Data Mining (DMIN 2006), 25–29 June, Las Vegas, Nevada, USA, pp.296–302.
Practical algorithms for subgroup detection in covert networks
155
Memon, N., Kristoffersen, K.C., Hicks, D.L. and Larsen, H.L. (2007b) ‘Detecting critical regions in covert networks: a case study of 9/11 terrorists network’, Proceedings of the 2nd International Conference on Availability, Reliability and Security (ARES ‘07), Vianna, Austria, pp.861–870. Taipale (2003) ‘Data mining and domestic security: connecting the dots to make sense of data’, Columbia Science and Technology Law Review, pp.22–23.
Note 1
http://en.wikipedia.org/wiki/USS_Cole_bombing