handbook of digital forensics of multimedia data and ...

113 downloads 4397 Views 3MB Size Report
basis of common credit card numbers and email addresses retrieved. (Garfinkel .... a methodology tailored for terroristic attacks, or post-blast investigations which allows .... representative key frames and application of image classification.
3 A Machine Learning-Based Approach to Digital Triage Fabio Marturana1, Simone Tacconi2 and Giuseppe F. Italiano1 1 Department of Civil Engineering and Computer Science Engineering, University of Tor Vergata, Rome, Italy 2 Computer Forensic Unit, Postal and Communication Police, Rome, Italy

3.1 Introduction Nowadays the pervasiveness of high-tech multimedia devices and fast mobile networking is unprecedented, and this trend is expected to grow in the future. Terabytes of image files, audio and video footages are downloaded or exchanged every day over the Internet, contributing to the global diffusion of cyber threats and cybercrimes. As a consequence, the impact of multimedia forensics, the sub-field of digital forensics whose task is to extract evidence from multimedia data, is dramatically gaining momentum due to the central role played by multimedia evidence in modern investigations. Unfortunately, the road to build a general understanding of a digital investigation is often paved with challenges such as time restrictions, huge volumes of data requiring in depth analysis, legal constraints to excessive search and seizure and suspect’s privacy protection. The situation is even worsened by the short innovation cycle of digital devices, which makes it difficult to keep up with innovation and technology (Poisel and Tjoa 2011). Further, dealing with most digital forensics frameworks in use today (Palmer 2001; Kent et al. 2006), the seizure of an entire hard disk drive or a mobile phone’s memory, and consequent creation of a forensically sound image is the best practice followed by law enforcement responders to preserve evidence integrity. Later on, sized devices and forensic images are sent to the Digital Forensic Laboratory (DFL) for in-depth examination with time-consuming and computationally intensive procedures which allow to reconstruct event timeline and extract

evidence of suspect’s guilt. As a consequence of the average workload of a typical DFL, spanning from 9 to 12 months, such a framework may cause unacceptable delays in prosecution of criminals (Gomez 2012). Further, due to the inherent nature of the incriminatory information which normally refers to a minimal part of a digital device’s memory, most forensic frameworks currently in use may even violate suspect’s privacy (Hong et al. 2013). Nevertheless, dealing with crimes such as terrorist attacks, murders, kidnapping, human trafficking, just to name a few, the need for a timely identification, analysis and interpretation of the evidentiary material found on the crime scene is crucial (Rogers et al. 2006; Pearson and Watson 2010). In such cases, on-site selective pre-examination of available data sources with digital triage tools, for instance, could provide investigators with actionable intelligence to guide the search and seizure process. Further, adopting a digital triage framework to classify digital sources as relevant or not to the case at hand could be a viable solution to lighten DFL’s workload (Gomez 2012). Finally, digital triage could address the issue of selectively identifying and collecting digital evidence on the crime scene in those cases where excessive search and seizure is not allowed by the legal system, as it happens in Republic of Korea, for instance, where some courts have even ruled out inadmissible evidence gathered out of the scope of the search warrant (Hong et al. 2013). Taking into account the preceding considerations, the need arises to look for new forensic frameworks based on digital triage by which identifying relevant information and isolating probative content (i.e. evidence) in a timely manner among the huge amount of data pertaining to a single crime investigation. Therefore, digital triage is emerging as a practical solution to cope with time-critical investigations, huge examination workloads and legal restrictions to excessive search and seizure. In this regards, methods for rapid data extraction have been developed so far for specific purposes, such as the identification of relevant hard drives in fraud cases on the basis of common credit card numbers and email addresses retrieved (Garfinkel 2009). However, despite some data retrieval and matching functions, the classification of a digital evidence as relevant or not to a criminal investigation still remains a mostly manual process in charge of experienced individuals. More recently, drawing inspiration from earliest digital triage models (Rogers et al. 2006; Pearson and Watson 2010), research has focused on the need for automating such process by means of crime templates (Cantrell et al. 2012), similarity digest (Roussev and

Quates 2012), automated tools for bulk data extraction and analysis (Garfinkel 2013), machine learning-based supervised (Grillo et al. 2009; Garfinkel et al. 2010; Gomez 2012), and unsupervised approaches (Drezewski et al. 2012). Dealing with machine learning-based digital triage, this chapter presents a framework for selective pre-examination and statistical classification of digital data sources that may be deployed both on the crime scene and at DFLs. It is aimed at (i) providing investigators with quick actionable intelligence on the crime scene when time is a critical factor, (ii) lightening the workload at DFLs and (iii) protecting suspect’s privacy where excessive search and seizure of data is not allowed by the legal system. Based on the opinion of surveyed forensic experts, and the exploitation of machine learning classification algorithms, such framework presents two main advantages with respect to most examination techniques in use today as it (i) requires limited manual intervention and (ii) produces measurable and reproducible error rates. The latter important aspect makes it compliant, for instance, with the U.S. Supreme Court’s Daubert ruling (Heilbronner 2011), which states that a scientific method, in order to be admissible in a trial, must have a ‘known or potential’ error rate and must be subject to empirical testing.

3.1.1 Chapter Outline The remainder of the chapter is organized as follows: related work on digital triage is summarized in Section 3.2, whereas a machine learningbased digital triage framework for use in data source supervised classification is illustrated in Section 3.3. Subsequently, a case study on the crime of exchange of child pornography is presented in Section 3.4 as an example of the described framework. Concluding remarks are drawn in Section 3.5 and challenges and future directions for the forensic community finally explored in Section 3.6.

3.2 Related Work on Digital Triage This section provides an overview about history and state-of-the-art research on digital triage.

3.2.1 Triage in the Medical Field Derived from the French verb trier, meaning to separate, sift or select, the term triage may have originated during the Napoleonic Wars from the

work of Dominique Jean Larrey. 1 The term was used further during World War I by French doctors treating the battlefield wounded at the aid stations behind the front. Those responsible for the removal of the wounded from a battlefield or their care afterwards would divide the victims into three categories: 1. Those who are likely to live, regardless of what care they receive; 2. Those who are likely to die, regardless of what care they receive; 3. Those for whom immediate care might make a positive difference in outcome. In the medical field, triage is referred to as the process of determining the priority of patients’ treatments based on the severity of their condition. This allocates medical resources to patient treatment efficiently when those are insufficient for all to be treated immediately. Triage may result in determining the order and priority of emergency treatment, the order and priority of emergency transport, or the transport destination for the patient. Triage may also be used for patients arriving at the emergency department, or telephoning medical advice systems, among others. As medical technology has advanced, so have modern approaches to triage which are increasingly based on scientific models. The victim’s categorization process is frequently the result of algorithm-based triage scores based on specific physiological assessment findings. As triage concepts become more sophisticated, triage guidance has evolved into both software and hardware decision support systems for use by caregivers in both hospitals and the battlefield.

3.2.2 Early Digital Triage Models The term triage is being adopted in the context of digital devices and technology as well to describe the process of prioritizing digital data sources according to their importance in relation to an investigation (Chipman et al. 1980). Digital triage may be referred to as the prioritization process that attempts to both guide the search and seizure process on the crime scene and reduce the volume of data that need to be exhaustively examined at DFLs (Pollitt 2013). Digital triage has experienced an outstanding development in recent years, and it is still evolving at a rapid pace. Research on the subject matter, initially at least, attempted to provide a solution to the delay

introduced by most forensic frameworks currently in use which resulted to be inappropriate to time-sensitive criminal cases such as kidnapping or danger to a human being. In this regards, Rogers et al. (2006) proposed a pioneering field triage model, called computer forensics field triage process model (CFFTPM), aimed at gathering actionable intelligence on the crime scene in investigations where time was a critical factor. The authors, motivated by the need to provide investigative leads in timecritical situations, defined a workflow for identification, analysis and interpretation of digital evidence on the crime scene, without the need for acquiring a forensic copy of the device or taking it back to the lab for indepth examination. Entailing a real risk of exhibit tampering, the CFFTPM assumes that the crime scene has been properly secured and controlled and includes integrity protection tasks (e.g. hardware write blocker and forensic software usage) to assure that any potential evidence found could undertake traditional examination and analysis back at the lab in a controlled environment. After proper prior planning by which quantifying the various possibilities of the crime scene, and qualifying the expertise of the various investigators on the investigation team, the CFFTPM enables the triage phase, a process in which things are ranked in terms of importance or priority. Essentially, those items, pieces of evidence or potential containers of evidence that are the most important or the most volatile are dealt with first. Once a system or storage media has been identified and prioritized during the triage phase, the actual examination and analysis are conducted. Links between the evidence found and a specific, identifiable suspect are shown to ascertain which individual or individuals are responsible for, or even had knowledge of, incriminating data found on the storage media. Thorough knowledge of user profiles and artefacts relating to usage, are essential to accomplishing this goal. User profiles, home directories, file permissions, MAC times (i.e. Modification, Access, Creation), Internet activity, emails and web browsing are analyzed to place artefacts in context with verifiable real world events. Doing that as quick as possible may result in a suspect presented with clear evidence indicating that he or she, and no other person is responsible for evidence recovered who may feel compelled to admit their guilt. Drawing its inspiration from the CFFTPM and considering the challenges posed by the battlefield crime scene, Pearson and Watson (2010) defined the so-called digital triage forensics (DTF) process, a methodology tailored for terroristic attacks, or post-blast investigations which allows

the collection of actionable intelligence and potential evidence in the battlefield. The DTF model is in line with the CFFTPM with the exception that triage processing is done at the forward operating base (FOB), before in-depth analysis, rather than on the crime scene. The reason is obvious: the forensic team, who is tasked of collecting evidence in the battlefield, must always operate in a timely manner and under safe conditions. It is noteworthy to mention that both Rogers et al. and Pearson and Watson focused their attention on the crime scene rather than the DFLs since, at the time, the issue of heavy workload sat DFLs was not as prevalent in research. As the issue of DFLs being unable to cope with heavy workloads became more widespread, the need for digital triage methodologies to reduce backlogs arose (Gomez 2012). In conclusion, digital triage methodologies could be used to address the following problems: Timely identification of case-relevant items on the crime scene in time-critical situations; Examination of large volumes of data and lighten workloads at DFLs; Suspect’s privacy protection by reducing the scope of search from the entire device to only some area of interest more likely to turn into incriminatory data and evidence.

3.2.3 Machine Learning-Based Digital Triage Dealing with machine learning-based digital triage it is noteworthy to mention the pioneering Five Minutes Forensics (5MF) technique described by Grillo et al. (2009). In this regards, the cited authors defined a framework for fast computer user profiling based on five predefined user categories (i.e. occasional, chat-internet, office worker, experienced and hacker), and a set of features related to each of them. Dealing with the feature set or feature vector, the authors assumed that, for instance, an experienced computer user owns operating system configuration skills, while a hacker could own particular software packages, cracking tools and so on. Based on previous observations the statistical analysis of such features extracted from a seized hard disk allows to label each computer user as belonging to one of the five predefined category, and thus define examination priorities at DFLs. In a child pornography exchange investigation where many computers are found on the crime scene, for example, a computer user classified as

occasional will be treated with less priority than a chat-Internet user due to the limited amount of evidentiary material that is likely to be found by investigators. Dealing with machine learning-based digital triage, the 5MF technique (Grillo et al. 2009) has influenced several authors. In this regards, Marturana et al. (2011a, 2011b) have built their model for mobile handsets classification on the basis of the 5MF technique. In particular, Marturana et al. (2011a) have analyzed a corpus of data extracted from handsets and smartphones in court cases of exchange of child pornography, corporate espionage, murder and human trafficking, and used multiclass categorization for classifying objects on the basis of owner’s usage profile (i.e. base, medium and expert). Further, Marturana et al. (2011b) have processed the same data corpus with binary categorization to identify test objects allegedly used for exchanging child pornography material. Finally, Marturana and Tacconi (2013) have discussed findings of a case study on court cases of copyright infringement with a dataset extracted from desktop and laptop computers as an application of the aforementioned methodology. Grillo et al. (2009) have influenced Cantrell et al.’s (2012) work as well in their definition of a partially-automated and crime-specific digital triage process model under development at Mississippi State University. In such work, the authors dealt with crime templates to identify relevant items in a given crime context. In child pornography, for instance, a template included the search of file names within a dictionary of commonly used words by pornographers or by mathematical hash values compared with a repository of well-known pornographic material. In the latter case, even if the name was forged, its hash value was likely to remain the same. Such templates are made available to the investigators during on site search and seizure and to the examiners at DFLs. Dealing with digital triage, it is not the first time that the concept of crime-related template emerges since some authors (Marturana et al. 2011a, 2011b; Gomez 2012) has already identified the need to train machine learning supervised classifiers on the basis of crime-related features which is very similar to the idea of searching a digital source with a template. The two aforementioned approaches are different in that, with templates, the preexamination phase should focus a priori on the search for information pertaining to the crime under investigation (e.g. specific files, file types, strings, logs and executables) whereas, with a machine learning-based approach, it is necessary to define a reasonable (i.e. in terms of classification accuracy) number of likely crime-related features to be

extracted during the pre-examination stage and let the algorithm do the rest of the work. In the latter case it is noteworthy to mention that classification accuracy is influenced by factors such as the total number of features, their correlation or linear dependence, the training set dimensionality etc. With the aim of solving the multiuser carved data ascription problem, Garfinkel et al. (2010) have followed a similar approach to the one presented in this chapter. The authors used a machine learning-based technique to ascribe files extracted from multiuser personal computers. The approach taken to extract file system, file placement and embedded file metadata relies upon specific metadata extraction tools (i.e. fiwalk ’ s metadata extraction system) according to the following workflow: Ascribable (i.e. with known owner) exemplar file are recovered from the disk image. For each exemplar, metadata are extracted. The data is pre-processed. The exemplars are used to train the classifier. Classifier accuracy is calculated. The carved files (i.e. with unknown owner) are processed and metadata extracted. The carved data is processed and classified.

3.2.4 Other Multimedia Source Classification Techniques Dealing with multimedia forensics in child pornography investigations, a technique similar to digital triage allows for the extraction of crimerelevant information (e.g. nudity) from specific files (e.g. photos and video footages) with custom algorithms (e.g. skin detection) and consequent classification of the digital data source. In this particular context, two areas of interest concerning, namely, image and video media analysis, arise. Regarding image content classification, on the one hand, works regarding the detection of pornographic material refer, for instance, to the detection of skin regions in an image and matching with human bodies by applying geometric grouping rules (Fleck et al. 1996) or the estimation of the ‘skin probability ’ of a pixel based on its colour (Jones and Rehg 2002). Dealing with video footages, on the other hand, Janshon et al. (2009) proposed a new approach on automatic detection of

pornographic content in video databases which adds motion analysis and periodicity detections to key frames analysis (i.e. extraction of representative key frames and application of image classification techniques on them).

3.3 A Machine Learning-Based Digital Triage Framework This section describes a machine learning-based digital triage framework for use in crime scene search and seizure (on-scene digital triage) and DFL examination (off-site digital triage). Such framework is laid out to integrate with other forensic frameworks currently in use (e.g. based on acquisition, retrieval and analysis), preserve digital evidence integrity guarantee analysis repeatability, 2 and selectively identify relevant (e.g. requiring in-depth analysis) digital sources.

3.3.1 Machine Learning Terminology Being the described framework based on machine learning, a description of related terms that will be frequently used along the chapter is mandatory. Machine learning is related to the construction and study of systems that can learn from data. It adopts the same principles that humans use to learn, that is to say repetition and experience. Among the various types of machine learning schemes reported in literature we need to mention supervised and unsupervised learning. Supervised learning is the process of learning from a training set of correctly identified exemplars or training instances. This is achieved by observing present behaviour and comparing it with past observations to discover structural patterns which can be examined, reasoned and used to inform decisions (Witten et al. 2011). Supervised machine learning is the classification technique adopted in this work. Typically, the process requires many exemplars to be able to find such structural patterns. This is known as the learning data or training set. Defining the training set is crucial to the machine learning process as it provides the known properties upon which future predictions of unknown instances (i.e. test instances) can be made. In binary supervised machine learning, for instance, during the training phase each instance is manually given a binary classification known as class (e.g. yes or no, relevant or nonrelevant). In this regards, the training is defined as controlled, or,

supervised. The corresponding unsupervised procedure, out of the scope of the present work, is known as clustering or cluster analysis, and involves grouping data into categories based on some measure of inherent similarity (e.g. the distance between instances, considered as vectors in a multi-dimensional vector space). 3.3.1.1 State Vector Supervised learning requires data to be structured into a state vector. Such vector encompasses a set of variables related to the classification problem. Once the state vector has been defined, a set of values corresponding to such variables is populated each time a new instance of data is created. In the context of this study, the state vector encompasses a number of features whereas the instances are populated with the corresponding values. 3.3.1.2 Features Features (or attributes) are key elements of the framework which will be used interchangeably throughout the chapter. A feature or attribute, in this context, is defined as a quantitative measure of a potentially crimerelated item. In other words, a feature x is something potentially related to the investigation that one can extract from a digital source and measure. An example could be the number of .jpg files recovered from a digital source (e.g. number of .jpg files = 24). It is noteworthy to mention that, being the described framework based on a statistical model, there is no causal relationship between a single feature or set of features and the investigated crime. In other words, whenever a feature x takes a value y doesn’t necessarily mean that the digital source is 100% related to the investigated crime. We can only predict the likelihood that each digital source is related to the crime at hand and provide the correspondent error rate based on the occurrence of a specific value for that feature or group of features. 3.3.1.3 Instances and Exemplars The set of values corresponding to the feature vector extracted from each digital data source is called instance, which may be represented as a vector of data. Instances represent the content mapping of each digital data sources into model data according to the state vector. For example, if a concept can be modelled using two integer variables, x and y, the corresponding state vector will be (x, y) and (3, 7) an instance. Instances derived from training set devices, whose relation with the investigation is

known, are called exemplars. The training set is a collection of exemplars given as input to the classification algorithm. This provides the basis upon which a machine learning algorithm may be trained to determine a device s relevance or non-relevance to an investigation. Features and exemplars are important to the overall model as they are given as input to the processing algorithms.

3.3.2 The framework in Detail Once the machine learning terminology has been defined, we can provide the framework details and describe the process that we used to validate the methodology. Exploiting some concepts derived by other digital triage models (Garfinkel et al. 2010; Marturana and Tacconi 2013), such framework can be generalized as encompassing four phases, namely data extraction, attributes (or features) extraction, processing and classification, corresponding to the tasks of data collection, processing and presentation, as illustrated in Figure 3.1.

Figure 3.1 Digital triage workflow. An overview of the phases included in the workflow is as follows: Collection 3 A digital data source (test object) is collected/identified on the crime scene for forensic analysis. A forensic image of the test object is optionally created. 4 Data is extracted, alternatively, from the test object or its forensic copy either on-site or at DFLs using appropriate tools and possibly adopting forensically sound procedures. Processing

A supervised machine learning algorithm is trained on the basis of a knowledge base of training exemplars pertaining to the crime and already classified (training set). The data extracted from the test object turns into features according to the state vector and an observation (test instance), which is the machine-learning equivalent of the test object, is populated accordingly. The test instance is provided as input to the algorithm and processed. Presentation As a result of the training phase, the test instance is given a binary classification and the corresponding test object is categorized as relevant or negligible to the investigation (i.e. data source ’ s relevance to the case). The data source ’ s relevance to the case is presented to the investigator on the crime scene, who can proceed with the search and seizure, or to the examiner, who may decide to perform further analysis at DFLs.

3.3.3 Collection-Data Extraction Data extraction is tasked to collect relevant information from the digital data sources in a timely manner. Data extraction can be carried out using both open source and commercial tools. Once extracted from the media, each piece of information (e.g. file information or metadata) is ready to be presented as a feature. Concerning guidelines for extraction of crimerelevant data from computer systems, Kent et al. (2006) suggested to: look at the last change date of critical files, examine configuration and start-up files, search for hacking tools (e.g. password crackers, key loggers etc.), examine the password file for unauthorized accounts, search for keywords appropriate to the incident, search for hidden areas, slack space and cache, look for changes to files, critical file deletions and unknown new files, collect a list of all e-mail addresses as well as visited and bookmarked URLs,

whereas, regarding mobile phones and smartphones, Jansen and Ayers (2007) identified the following areas of interest: Subscriber and equipment identifiers, Date/time, language, and other settings, Phonebook information, Calendar information, SMS and MMS history, Dialed, incoming, and missed call logs, E-mail history, Photos, audio recordings and video footages, Instant messaging App and Web browsing activities, E-documents, GPS information.

3.3.4 Processing-Feature Extraction, Dataset Creation and Processing Algorithms As a result of the data extraction phase, all the features related to a searched digital device are collected within an instance. The collection of all such instances forms the dataset used for classification tasks that may be represented as the two-dimensional matrix called input matrix as illustrated in Table 3.1.

Table 3.1 Input matrix Phone Phone Phone Phone Phone #1 #2 #3 #M-1 #M Feature #1–Number of Phonebook contacts

5

25

12

42

23

Feature #2–Number received calls

148

12

57

25

128

Feature #3–Number dialed 57 calls

85

45

12

78

Feature #4–Number missed calls

12

47

120

75

49

Feature #5–Number received calls

12

78

23

0

0

Feature #6–Number sent SMS

42

15

46

8

7

Feature # N-1–Mean duration received calls

120

40

60

15

12

Feature # N–Mean duration dialled calls

125

12

42

12

5

Class (predicted)

relevant relevant nonnonnonrelevant relevant relevant

Looking at the input matrix by rows we find the value taken by each instance in the dataset corresponding to a given feature whereas instances (i.e. one for each digital device) are represented by the columns. Features are discrete variables with values from countable sets that may be both finite, in case a test or a multiple selection is made (true/false, yes/no/maybe, etc.) or infinite (i.e. integer), in case we count the number of variable occurrences. A test on a particular attribute (i.e. are there installed P2P clients on the digital device? Yes/No) is an example of a multiple selection whereas the number of image files and video footages extracted from a device is an example where the number of feature occurrences is counted. In our model, nominal features assume values from finite countable sets, whereas numeric features assume values from infinite countable sets. Following the dataset creation, the processing phase relies on machine learning algorithms tasked of classifying each instance in the dataset. The

following is a list of processing tasks, illustrated in the flow chart in Figure 3.2, pertaining to the described framework: Training set creation. The collection of a consistent set of exemplars or past observations whose class is known, is the preliminary step of supervised learning. Exemplars are collected on the basis of their relevance to the investigated crime. The training input matrix is populated by merging all the available exemplars and then given as input to selected classifiers. Training set classification. Training set exemplars are processed with supervised learning classification algorithms. Two classification methods are available for the purpose: multiclass categorization, where each instance is given the class related to the most likely crime in the target set (e.g. exchange of child pornography, copyright infringement, hacking, murder and terrorism, with regards to our data corpus) and binary categorization, where a classifier per each target crime is trained. In multi-class categorization, the instance is associated with the most likely crime among the aforementioned targets and the class variable is given the correspondent value. In binary categorization the class is given a binary value (i.e. relevant or non- relevant etc.), depending on the likelihood that a connection between the digital media and the target crime exists. In the case study on exchange of child pornography with mobile phones described later on, binary categorization will be tested. It is noteworthy to mention that the training step is crucial for the whole classification process as training exemplars must be chosen accurately to avoid both overfitting, when training data is limited, and the imbalanced dataset problem, when the number of training samples pertaining to one (or more) of the classes is small with regards to the others (Weiss 2004). Learning accuracy evaluation. Once the training phase is over, learning accuracy is evaluated with a validation technique on the basis of a set of performance indicators, such as Confusion Matrix, Precision, Recall and F-measure. In this specific context learning accuracy evaluation is crucial to build an effective framework for class prediction that can be used in a court room. Test set creation. Upon creation, classification and validation of the training set, the model is ready to be used in test mode. Features are extracted from those digital data sources requiring crime-related classification; Related test set instances and test input matrix are populated accordingly.

Test set classification and performance evaluation. The test input matrix is given as input to selected supervised learning algorithms and test set instances are classified accordingly. Testing accuracy is evaluated according to the aforementioned technique and performance indicators

Figure 3.2 Processing tasks workflow. In Section 3.4 an example of training set creation, classification and validation related to child pornography exchange is provided; conversely it was not possible to provide an example of test set creation and classification due to the limited number of available instances. The reason is simple, being already used for training purpose, it was not possible to use the same instances for testing.

3.3.5 Presentation Once algorithmic processing has identified structural patterns in training data, a dependent variable, called class, is predicted for each instance in the dataset. Each digital source in the dataset is therefore classified as relevant or not to the crime at hand and its relevance to the investigation

estimated. In other words, the classification output is represented by a binary nominal variable assuming the following two values: relevant or non-relevant. The classification output is then presented to law enforcement responders and forensic examiners for further decisions.

3.3.6 Model validation Evaluation of any scientific method is essential to validate its effectiveness and performance. For the proposed methodology to be able to accurately classify test objects, evaluation and validation must be carried out. This will also assist in building legal confidence in the methodology through placing a value on its accuracy. Ideally, the best way to evaluate a classifier would be to have huge amounts of training and test data for evaluation purpose. However, this is not a realistic scenario where obtaining data can be a major issue, as it happens for instance with satellite remote sensing where a mindnumbing and time-consuming manual process is mandatory to collect any usable data. Digital forensics faces an even more difficult problem with acquiring data (e.g. legal constraints and privacy preservation). Nevertheless, evaluating and validating performance with limited data is still necessary and the most popular method used in such situations is cross-validation (Witten et al. 2011). Cross-validation is a statistical method for evaluating and comparing algorithms by dividing available data into two sets, namely training and test sets, respectively used to train and validate the model. In typical cross-validation, training and test sets must cross-over in successive rounds such that each data point has a chance of being validated against. Estimating learning performance using one algorithm, and comparing performance of different algorithms to find out the best one for the available data are scientific goals of cross-validation (Refaeilzadeh et al. 2009). A commonly used method for evaluating and comparing learning algorithms is N-fold cross-validation where N represents a fixed number of folds, or data partitions. At each step of the validation process, it is important that classes in the original dataset are proportionally represented in both training and test sets sin order to avoid data overfitting. It is difficult, for instance, for an algorithm to predict future observations when the training set consists of true instances in a binary classification problem. Therefore, the data should be rearranged (e.g. randomized) so each fold contains a good representation of the binary classes. This procedure is called stratification and is important prior to N-

fold cross-validation. In the case of three fold, for example, two-thirds of the data is reserved for training and one-third is held out for testing. The procedure is repeated three times until every instance has been used for testing (Witten et al. 2011), as illustrated in Figure 3.3.

Figure 3.3 Procedure of three-fold cross-validation. The standard way to evaluate a learning algorithm given a single, fixed and limited sample of data is to use 10-fold cross-validation. Ten fold validation is the best practice as extensive tests have been done, and it was evaluated that ten is about the correct number of folds to get the best estimate (Witten et al. 2011). Grillo et al. (2009), Marturana et al. (2011a, 2011b), Gomez (2012) and Marturana and Tacconi (2013), all used 10-fold cross-validation. The results of 10-fold cross-validation can be presented in several formats. The aim of the validation technique is to estimate the classification accuracy of the method on the available dataset, therefore analysis of correct and incorrect classifications is required. In classification problems, results may be represented in the form of a twodimensional confusion matrix (Witten et al. 2011). Consisting of positive and negative classifications represented by true positive (TP), true negative (TN), false positive (FP) and false negative (FN). TP and TN represent correctly labelled items as belonging to the positive and negative class respectively. For example, in a binary classification of yes or no, TP = real “yes”, whereas TN = real “no”. Conversely, FP and FN are

incorrect classifications. A FP is an item classified as yes, but it should have been listed as no whereas a FN is an item classified as no which should have been listed as yes. The precision rate is calculated as follows: (3.1)

The recall rate is calculated as: (3.2)

The F-measure, representing a weighted harmonic mean of precision and recall, is calculated as follows: (3.3) The resultant two-dimensional confusion matrix is displayed as such:

Note that, in binary classification problems, the diagonal values represent the correct and incorrect classifications. For clarity, this is outlined in the following text:

This represents positive results, e.g. correct yes and no results.

This represents negative results, e.g. incorrect yes and no results. Often a row and column for each class is added to the Confusion Matrix. In the below example, ‘a’ and ‘b’ represent the original classes (e.g. ‘yes’ and ‘no’ respectively), whereas ap and bp are the predicted ones.

ap bp a TP FN b FP TN Below is reported another example of how a Confusion Matrix may be represented. Confusion Matrix ap bp 2) which increase the influence of large differences between attributes at the expense of small ones, generally, the ED represents a good compromise. The following is the formula for the ED: (3.5)

Different attributes are often measured on different scales, so if the previous formula were used directly, the effect of some attributes might be negligible with regards to others that had larger scales of measurement. Consequently, it is a good practice to normalize all attribute values to lie between 0 and 1 by calculating the following formula: (3.6)

where max(xij ) and min(xij ) are, respectively, the maximum and minimum value of xij over all instances in the training set. Even though it becomes less discriminating as the number of attributes increases, we have selected the Euclidean distance to calculate distance between instances in the dataset. The reason is that calculating Euclidean distance in WEKA is computationally simple and fit to our low-dimensional dataset. 3.4.4.2 Neural network The idea behind a Neural Network (NN) is that very complex behaviour can arise from relatively simple units, the artificial neurons, acting in concert. Used in domains such as money laundering and credit card fraud

detection, a network of artificial neurons is designed to mimic the human ability to learn from experience (Linoff and Berry 2011). Based on NN history that merges biology and machine learning concepts, we can introduce the terms single-layer perceptron (SLP) which comes from machine learning terminology, and artificial neuron, derived by biology glossary, to describe the basic component of a NN which represents at the same time the simplest existing NN model. As illustrated in Figure 3.6, a SLP or simply perceptron, takes a number of inputs and produces the corresponding output on the basis of the activation function (AF). This function is split into two parts: combination function (CF) and transfer function (TF).

Figure 3.6 Single layer perceptron. The CF is generally represented by a weighted sum, which is the default in data mining problems, where each input is multiplied by its weight and these products are added together, whereas the TF serves to model perceptron’s behaviour. The linear function, for instance, is suitable for representing linear regression models, the step function which has the value 1 when the weighted sum of the inputs is above some threshold, and 0 otherwise mimics the human brain’s ‘all-or-nothing’ response, the

sigmoid functions (i.e. logistic and hyperbolic tangent) produce output values from limited intervals (i.e. [0,1] or [–1,1]) and are appropriate to estimating variables with non-linear behaviour. Since the SLP is able to solve simple classification problems with the limitation that classes must be linearly separable (e.g. a typical example of SLP limitations is represented by the XOR problem), a more complex structure called multi-layer perceptron (MLP) has been defined in literature as the typical NN model. As illustrated in Figure 3.7, a MLP may be represented by a network of interconnected SLPs, with an input layer, where data enters the network, one or more hidden layers, comprised of SLPs, each of which receives multiple inputs from the input layer, and a output layer, which includes a variable number of SLPs, based on the number of predictable classes. In binary prediction, for instance, the output layer comprises a single SLP.

Figure 3.7 Multilayer perceptron. In this work we will test WEKA’s implementation of the NN algorithm, called multilayer perceptron, to classify the dataset.

3.4.4.3 Learning from Imbalanced Dataset This section deals with the so-called imbalanced dataset problem, which occurs during the training stage when significant fewer exemplars of a class are available with regards to others. The lack of data related to a class may result in inaccurate construction of decision boundaries between classes and consequent minority class (i.e. pedo) misclassifications. The problem can occur as well within a majority class formed of more clusters of data as one of them may contain significantly fewer exemplars than others, which may lead to misclassifications. Standard machine learning algorithms, indeed, tend to treat exemplars from a minority class or cluster of data within a class as noise with consequent performance degradation (Nguyen et al. 2009). Imbalanced data set is an intrinsic property in investigations related to credit card frauds, money laundering and financial crimes, for instance, where very few illegal transactions are carried out compared to the huge number of normal operations. The problem, however, is not limited to the aforementioned cases as imbalanced data sets may be present in areas that don’t have an inherent imbalance problem as well. Legal, economic, privacy limitations in collecting data, for instance, and the large effort required to obtain a representative data set may result in imbalanced data sets, as it occurred to the framework described in this chapter where collecting real data related to the class pedo from seized devices was the main difficulty that we encountered. Data sampling (e.g. under-sampling, over-sampling and advanced sampling) with exemplars pre-processing to minimize discrepancy among the classes is a common approach to solve the class imbalance problem. Under-sampling, for instance, removes exemplars from majority class whereas over-sampling adds exemplars to minority class before processing the training set. In this regards, we have tested SMOTE, an advanced sampling algorithm (Chawla et al. 2002) which introduce new (e.g. synthetic), non-replicated pedo exemplars in the training set. 3.4.4.4 Data Overfitting Dealing with machine learning, overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of training data points and begins to memorize training data rather than learning to generalize from them. A model which has been overfit will

generally have poor performance on new data, as it can exaggerate minor fluctuations in the training set. For example, if the number of parameters is the same as or greater than the number of available training instances, a learning process can perfectly predict the training data simply by entirely memorizing them, but it will typically fail drastically when making predictions about unseen instances, since it has not learned to generalize at all. One way to avoid overfitting is to use a lot of data thus forcing the algorithm to generalize and come up with a good model suiting all the points. Using a small training set, we have an intrinsic limitation in generalizing the model. Therefore we run 10-fold cross-validation with an unsupervised preprocessing filter implemented by WEKA called randomize, which randomly shuffles the order of training set exemplars passed through it. 3.4.4.5 Automatic Feature Selection Techniques Besides selected algorithms, number of training set exemplars, and related order in validation folds, even the number of features in the state vector, affects classification accuracy as well. If this number is too large, indeed, classification accuracy may be negatively affected as the inclusion of irrelevant, redundant and noisy attributes in the model can result in poor predictive performance and increased computation. In this regards, among the available techniques (Hall and Holmes 2003), we have tested two algorithms for selecting the most effective features from our state vector, namely belonging to the family of wrapper e filter methods. The wrapper method creates all possible subset of features in the state vector and then uses an evaluator to choose the best performing subset according to the selected classifier. It uses a search technique, such as best first, random search, exhaustive search, etc., to find a subset of features and then selects the one with which the classification algorithm performs the best according to the performance indicators outlined in Section 3.3.6. Conversely, the filter method uses a combination of attribute evaluation and ranker algorithms to assign a rank to all the features in the state vector regardless of the selected classifier. Once attribute ranking is over, lower ranking features are removed one at the time from the state vector and related classification accuracy evaluated.

In comparison, due to the utilization of a classification algorithm to evaluate the accuracy resulting from the selected subset of features, rather than an attribute ranker usually makes the wrapper method produce better result than the filter method.

3.4.5 Experiment Setup, Results and Discussion The case study described hereafter is based on the data corpus outlined in Section 3.4.3, encompassing M = 23 training exemplars extracted from just as many cell phones and smartphones from the case repository of the Italian Postal and Communications Police and concerning different crimes such as exchange of child pornography, violation of trade secret, human trafficking and extortion. As outlined in Section 3.4.3, such training set encompasses seven child porn exemplars (i.e. pedo) and 16 exemplars concerning other crimes (i.e. non-pedo). The state vector, defined in Section 3.4.2, encompasses N = 114 attributes. As a consequence, we have a rectangular N × M training input matrix. The data was extracted using CellDEK 1. 11, Paraben’s Device Seizure 4.3.4092.33445, XRY 4.5 and 5.10, UFED 1.1.38, FTK 3.1 and stored in a database. The database was processed using functions and procedures to count feature occurrences and, finally, normalization was carried out to ensure data integrity so that classification could take place. In this case study we will use results from Marturana et al. (2011a, 2011b); Marturana and Tacconi (2013) as a reference baseline for benchmark purposes. Class predictions obtained with KNN and NN, respectively described in Sections 3.4.4.1 and 3.4.4.2, will be compared with such baseline. Pre-processing data techniques, such as synthetic over-sampling (i.e. SMOTE), and training exemplars randomization, described in Sections 3.4.4.3 and 3.4.4.4, will be explored to address the imbalanced dataset and overfitting issues, and related results discussed. The feature selection techniques described in Section 3.4.4.5 will be tested as well and results discussed. As outlined in Section 3.3.6, classifiers’ learning accuracy will be calculated on the basis of confusion matrix, precision, recall and Fmeasure. 3.4.5.1 Baseline results This section summarizes the reference baseline from Marturana et al.

(2011a, 2011b); Marturana and Tacconi (2013) which includes the following WEKA’s classifiers: Bayesian networks (BNs), decision tree (DT) and locally weighted learning (LWL), namely BayesNet, J48 and LWL in WEKA. Training accuracy has been validated by the cited authors using 10-fold cross-validation and Table 3.5 illustrates the related output which forms the reference baseline against which this case study will be compared: Table 3.5 Reference baseline results. Performance parameter Machine learning schemes BN

DT

LWL

Weighted_avg_Precision

0.553

0.68

0.644

Weighted_avg_Recall

0.579

0.684

0.632

0.644

0.636

Weighted_avg_F_Measure 0.56

To summarize, the DT algorithm resulted to have the greater accuracy and, on average, all the classifiers were able to classify correctly more than half of the 23 phones. It is noteworthy to mention that the baseline tests are not based on pre-processing and are used as-is for benchmark purposes. 3.4.5.2 KNN and NN Without Pre-Processing (Complete State Vector) This section illustrates findings of the training experiment with KNN and NN (see Sections 3.4.4.1 and 3.4.4.2), using the training input matrix outlined in Section 3.4.5. Basic algorithms have been tested, without further pre-processing. Training accuracy has been validated using 10-fold cross-validation and classification results are summarized in Table 3.6: Table 3.6 KNN and NN without pre-processing (i.e. 23 non-randomized exemplars and 114 features). Performance parameter

Machine learning schemas KNN (K = 1)

KNN (K = 3)

KNN (K = 5)

NN

Weighted_avg_Precision

0.581

0.452

0.455

0.525

Weighted_avg_Recall

0.478

0.348

0.435

0.565

Weighted_avg_F_Measure 0.496

0.365

0.445

0.542

KNN (k = 1) resulted to have the highest average precision rate whereas NN had a higher precision rate concerning the majority class (i.e. nonpedo), resulting in better recall and F-measure; As a consequence, without further processing, NN is more likely to identify non-relevant devices (i.e. true negative) than KNN. On average, all the classifiers were able to classify correctly more than half of the 23 phones, which is not good enough for the purpose of the proposed method. 3.4.5.3 KNN and NN with Training Exemplars Randomization (Complete State Vector) This section illustrates findings of the training experiment with KNN and NN (see Sections 3.4.4.1 and 3.4.4.2), obtained with the training input matrix outlined in Section 3.4.5, after randomizing the 23 training exemplars to address the overfitting problem. Training accuracy has been validated using 10-fold cross-validation and classification results are summarized in Table 3.7. Table 3.7 KNN and NN with Randomize only (i.e. 23 non-randomized exemplars and 114 features). Performance parameter

Machine learning schemas KNN (K = 1)

KNN (K = 3)

KNN (K = 5)

NN

Weighted_avg_Precision

0.632

0.43

0.515

0.665

Weighted_avg_Recall

0.565

0.348

0.478

0.696

Weighted_avg_F_Measure 0.582

0.373

0.493

0.667

The proposed methodology showed better results in terms of classification accuracy and less data overfitting as a result of randomizing exemplars within the training set. On average, all the tested algorithm performed better than without randomization and NN outperformed KNN with every value of k. Classification accuracy is comparable with the baseline results described in Section 3.4.5.1 and it is still not good enough for the purpose of the proposed method. 3.4.5.4 KNN and NN with Training Exemplars Randomization and Over-Sampling (Complete State Vector) This section illustrates findings of the training experiment with KNN and

NN (see Sections 3.4.4.1 and 3.4.4.2), obtained with the training input matrix outlined in Section 3.4.5, and: over-sampling the minority class (i.e. pedo) from 7 to 14 exemplars, to get a more balanced training set, randomizing the training exemplars to avoid data overfitting during validation, with the aim of further improving accuracy and performance. Training accuracy has been validated using 10-fold cross-validation and classification results are summarized in Table 3.8: Table 3.8 KNN and NN with SMOTE and Randomize (i.e. 30 oversampled and randomized exemplars and 114 features). Performance parameter

Machine learning schemas KNN (K = 1)

KNN (K = 3)

KNN (K = 5)

NN

Weighted_avg_Precision

0.806

0.775

0.767

0.767

Weighted_avg_Recall

0.667

0.567

0.533

0.767

Weighted_avg_F_Measure 0.635

0.487

0.43

0.766

The combined effect of balancing and randomizing training exemplars turns into a general improvement in classifiers accuracy. In particular KNN (k = 1) was able to correctly classify all the minority class instances (i.e. pedo) resulting in the highest average precision rate whereas NN had a more balanced behaviour, resulting in better recall and F-measure; As a consequence, without further processing, KNN (k = 1) is more likely to identify relevant devices (i.e. true positive) than other algorithms whereas NN is more likely to identify non-relevant devices (i.e. true negative). 3.4.5.5 KNN and NN with Pre-Processing and Automatic Feature Selection (Reduced Training Input Matrix) This section illustrates findings of the training experiment with KNN and NN (see Sections 3.4.4.1 and 3.4.4.2), obtained with the training input matrix outlined in Section 3.4.5, and: over-sampling the minority class (i.e. pedo) from 7 to 14 exemplars, to get a more balanced dataset,

randomizing the training exemplars to avoid data overfitting during validation, reducing the number of attributes with automatic feature selection techniques (i.e. wrapper and filter methods), to improve classification accuracy. Training accuracy has been validated using 10-fold cross-validation. Classification results are summarized in Tables 3.9 and 3.10: Table 3.9 KNN and NN with SMOTE and Randomize (number of attributes reduced with wrapper method). Performance parameter

Machine learning schemas KNN (K = KNN (K = KNN (K = NN 1) 3) 5)

Weighted_avg_Precision

0.834

0.933

0.767

0.941

Weighted_avg_Recall

0.833

0.933

0.767

0.933

Weighted_avg_F_Measure 0.833

0.933

0.766

0.933

Table 3.10 KNN and NN with SMOTE and Randomize (number of attributes reduced with filter method). Performance parameter

Machine learning schemas KNN (K = 1)

KNN (K = 3)

KNN (K = 5)

NN

Weighted_avg_Precision

0.867

0.781

0.874

0.781

Weighted_avg_Recall

0.867

0.767

0.867

0.767

Weighted_avg_F_Measure 0.867

0.766

0.867

0.766

As the wrapper method tends to choose the best performing subset of attributes according to the selected classifier, a different subset of attributes is selected according to every algorithm. In particular, after testing the wrapper method with KNN (k = 1) a subset of 2 attributes is selected, with KNN (k = 3) a subset of 6 attributes is chosen, whereas with KNN (k = 5) a total of 5 attributes are selected. In this experiment, both KNN (k = 3) and NN have correctly classified 93.3% of the training exemplars, a considerable achievement with regards to the initial 47.8% of KNN (k = 1) without pre-processing the training set. The slightly

different average precision depends on the number of minority class exemplars incorrectly identified by each of them (i.e. 1 exemplar with KNN and 0 exemplars with NN) (Table 3.10). The same subset of attributes is selected for using with each classification algorithm as a consequence of training set pre-processing with filter method, which is a suboptimal choice. In this case a total of 9 attributes out of 114 have been selected resulting in a general improvement in classification accuracy with regards to the baseline. Further, the wrapper method outperforms filter method as far as KNN (k = 3) and NN are concerned whereas the filter method resulted in greater classification accuracy regarding KNN (k = 1,5). For this reason, we warmly suggest to take both attribute selecting techniques into account.

3.5 Conclusion This chapter has dealt with the application of dealt with the digital forensics triage to solve the problem of classifying digital data sources as relevant or not to an investigation. In particular, we have described a digital triage framework based on Machine Learning algorithms which allows to infer the relevance of a target object (e.g. a digital device) to the case at hand on the basis of a set of features. Such features, concerning suspect’s activity likely related to the investigated crime, are included in a state vector representing a set of features that may be extracted from target objects either on the crime scene or at DFLs. The aforementioned framework encompasses the following three phases: data extraction, feature extraction and processing, presentation of results. A collection of a consistent set of exemplar data sources (i.e. the training set) is the preliminary step of the learning process. Exemplars are collected on the basis of their relevance to the investigated crime. Such training set is given as input to selected Machine Learning algorithms for classification purpose and the resulting model, validated using 10-fold cross-validation, is thus ready for future predictions of unknown data instances. Our goal is to provide investigators and forensic examiners with a tool that can be used both during the search and seizure on the crime scene and examinations at DFLs. The described solution has been tested in a real-life scenario concerning child pornography exchange with mobile phones in which two machine learning algorithms, namely k-nearest neighbour (KNN) and neural network (NN) have been tested and related findings compared. In particular, KNN is considered a legally superior data mining classifier

whose predictions are easy to argue in a trial, whereas NN can be readily and effectively applied to digital triage classification tasks. Further, three inherent issues of the available data corpus, namely imbalanced dataset, overfitting, and linear dependence of the available features have been addressed with just as many data pre-processing techniques, namely exemplars oversampling, randomization and automatic feature selection. As a result of the aforementioned case study we have shown that, with the right combination of pre-processing techniques and classification algorithms, it was possible to get a classification accuracy of 93.3%, with regards to the initial 47.8%, obtained with KNN (k = 1) and no preprocessing.

3.6 Challenges and Future Directions for the Digital Forensics Community Digital triage is emerging as a forensic discipline and new models and methodologies are being developed at an outstanding pace. Nevertheless, research on digital triage is still in its infancy as further experiments are underway, and several issues must be taken into great consideration such as the lack of experimental data resulting in potential limitations that must be evaluated on a case-by-case basis. Further, with regards to specific types of crime, it is hard to infer the relevance of a digital source to the case at hand without first inspecting each file’s content. For example the only evidence of a fraud on a computer may be the presence of a certain document (e.g. a spreadsheet) containing a list of credit card numbers. Such evidence is highly case-specific and not very different from normal documents that the person creates during normal working hours as part of his daily activities. For this reason, as file content analysis is out of the scope of the statistical approach undertaken in this research, whose aim is to keep digital data source pre-examination as simple and less time-consuming as possible, the inability to infer crimerelated features from files’ content is another potential limitation. Nevertheless, the digital triage framework presented in this chapter resulted in very good predictions related to exchange of child pornography cases. However, to enable the described digital triage framework to be generalized to other crime types it is necessary to (i) acquire domain knowledge about the investigated crime and (ii) build up a valid dataset by which correctly train selected Machine Learning

algorithms. It is noteworthy to mention that the proposed methodology is thought to be integrated with other forensic frameworks currently in use rather than replacing them, speeding- up the identification of relations between target objects and the investigated crimes on a commonality basis that, if done manually, would be a mind-numbing and timeconsuming activity. In this regards, being the described framework easily extendable to other crimes, we believe that it will serve as an inspiration to forensic practitioners and researchers. In this regard, interested readers who want to try their own implementations should, first of all, identify a set of crime-related features (i.e. the model’s independent variables represented by the feature vector) and then collect a consistent set of training exemplars pertaining to the investigated crime, which is the core activity of the whole process on which classification accuracy strongly depends. The higher the number of exemplar digital data sources (e.g. hard drives, smartphones, handsets, tablets and PDAs) pertaining to the investigated crime are used to train the classification algorithms, the better new data sources (i.e. test instances) will be classified. Further, since a classifier may outperform others with regards to a specific training set, it is noteworthy to mention the importance of making a benchmark study on different classifiers to find the one(s) that perform(s) better than others on the available dataset. In the child pornography exchange case study, for instance, the limited amount of available training data made it impossible to determine whether a classifier was better or worse than others on average, across all possible training sets that can be drawn from the domain. Moreover more effective solutions to the problem of classifying a digital data source would be to enable attribute weighting through automated and manual methods. In this regards, several techniques to improve classification accuracy such as attribute manipulation and weighting can be explored. Having the ability to place more importance on specific attributes by means of weights could possibly improve classification performance. Automatic methods for weighting attributes such as fully automating the classification process by using measures to calculate attribute weights without human intervention may be tested. Alternatively, weighting attributes manually may allow forensic investigators to provide invaluable input which can come from their experience or the context of the investigation.

Acknowledgements This chapter is based on a research conducted in collaboration with the Computer Forensics Unit of the Italian Postal and Communications Police.

References Cantrell G., Dampier D., Dandass Y. S., Niu N., Bogen C. Research Toward a Partially-Automated, and Crime Specific Digital Triage Process Model. In Computer & Information Science, vol. 5, no. 2, pages 29–38, 2012. Chawla N. V., Bowyer K. W., Hall L. O., Kegelmeyer W. P. SMOTE: Synthetic Minority Oversampling Technique. In Journal of Artificial Intelligence Research, vol. 16, pages 321–357, 2002. Chipman M., Hackley B.E., Spencer T.S. Triage of Mass Casualties: Concepts for Coping With Mixed Battlefield Injuries. In Military Medicine, vol. 145, no. 2, pages 99–100, 1980. Drezewski R., Sepielak J., Filipkowski W. System Supporting Money Laundering Detection. In Digital Investigation, vol. 9, pages 8–21, Elsevier, 2012. Durkin K. Misuse of the Internet by Pedophiles: Implications for Law Enforcement and Probation Practice. In Federal Probation, vol. 61, pages 14–18, Sage Publications, 1997. Fleck M., Forsyth D., Bregler C. Finding Naked People. In Proceedings of the 4th European Conference on Computer Vision, Cambridge, UK. Lecture Notes in Computer Science, vol. 2, pages 593–602, SpringerVerlag, 1996. Garfinkel S. L. Automating Disk Forensic Processing with SleuthKit, XML and Python. In Proceedings of 4th International IEEE Workshop on Systematic Approaches to Digital Forensic Engineering, Berkeley, CA, 2009. Garfinkel S. L., Parker-Wood A., Huynh D., Megletz J. An Automated Solution to the Multiuser Carved Data Ascription Problem. In IEEE Transactions on Information Forensic and Security, vol. 5, no. 4, pages 868–882, 2010. Garfinkel S. L. Digital Media Triage with Bulk Analysis and

Bulk_Extractor. In Computers & Security, vol. 32, pages 56–72, Elsevier, 2013. Gomez L. Triage in-Lab: Case Backlog Reduction with Forensic Digital Profiling. In Proceedings of Simposio Argentino de Informática y Derecho, La Plata, Argentina, pages 217–225, 2012. Grillo A., Lentini A., Me G., Ottoni M. Fast User Classifying to Establish Forensic Analysis Priorities. In 5th IEEE International Conference on IT Security Incident Management and IT Forensics, Stuttgart, Germany, 2009. Hall M. A., Holmes G. Benchmarking Attribute Selection Techniques for Discrete Class Data Mining. In IEEE Transactions on Knowledge and Data Engineering , vol. 15, no. 3, 2003. Heilbronner R. L. Daubert vs. Merrel Dow Pharmaceuticals, Inc. (1993). In Encyclopedia of Clinical Neuropsychology, pages 769–770, Springer, New York, 2011. Hong I., Yu H., Lee S., Lee K. A New Triage Model Conforming to the Needs of Selective Search and Seizure of Electronic Evidence. In Digital Investigation, vol. 10, no. 2, pages 175–192, Elsevier, 2013. Jansen W., Ayers R. Guidelines on Cell Phone Forensics. In Recommendations of the National Institute for Standard and Technology (NIST), NIST Special Publication 800–101, Gaithersburg, MD, 2007. Janshon C., Ulges A., Breuel T. M. Detecting Pornographic Video Content by Combining Image Features with Motion Information. In Proceedings of 17th ACM International Conference on Multimedia, Beijing, China, pages 601–604, 2009. Jones M. J., Rehg J. M. Statistical Color Models with Application to Skin Detection. In International Journal of Computer Vision, vol. 46, no. 1, pages 81–96, 2002. Kent K., Chevalier S., Grance T., Dang H. Guide to Integrating Forensic Techniques into Incident Response. In Recommendations of the National Institute for Standard and Technology (NIST), NIST Special Publication 800–86, Gaithersburg, MD, 2006. Linoff G. S., Berry M. J. Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management, 3rd Edition, Wiley, Indianapolis, IN, 2011.

Marturana F., Bertè R., Me G., Tacconi S. Mobile Forensics ‘triaging’: new directions for methodology. In Proceedings of VIII Conference of the Italian Chapter of the Association for Information Systems (ITAIS), Springer, Rome, Italy, 2011a. Marturana F., Bertè R., Me G., Tacconi S. A Quantitative Approach to Triaging in Mobile Forensics. In Proceedings of International Joint Conference of IEEE TrustCom-11/IEEE ICESS-11/FCST-11 (TRUSTCOM 2011), Changsha, China, pages 582–588, 2011b. Marturana F., Tacconi S. A Machine Learning-Based Triage Methodology for Automated Categorization of Digital Media. In Digital Investigation, vol. 10, no. 2, pages 193–204, Elsevier, 2013. Nguyen G. H., Bouzerdoum A., Phung S. L. Learning Pattern Classification Tasks with Imbalanced Data Sets. In P.-Y. Yin (Ed.), Pattern Recognition, pages 193–208, InTech, Rijeka, 2009. Osanka F. M., Johann S. L. Sourcebook on Pornography. Lexington Books, Lexington, MA/Toronto, 1989. Palmer G. A road map for digital forensic research. In Report from the first Digital Forensic Research Workshop (DFRWS), 2001. http://www.dfrws.org/2001/dfrws-rm-final.pdflast (Accessed 18 February 2015). Pearson S., Watson R. Digital Triage Forensics-Processing the Digital Crime Scene. Syngress, Boston, 2010. Poisel R., Tjoa S. Forensics Investigations of Multimedia Data: A Review of the State-of-the-Art. In Proceedings of 6th IEEE International Conference on IT Security Incident Management and IT Forensics, Stuttgart, Germany, 2011. Pollitt M. M. Triage: A practical solution or admission of failure. In Digital Investigation, vol. 10, no. 2, pages 87–88, Elsevier, 2013. Refaeilzadeh P., Tang L., Liu H. Cross-Validation. In L. Liu and M. Tamer Özsu (Eds), Encyclopedia of Database Systems, pages 532–538, Springer, New York, 2009. Rogers M. K., Goldman J., Mislan R., Wedge T., Debrota S. Computer Forensics Field Triage Process Model. In Journal of Digital Forensics, Security and Law, vol. 1, no. 2, pages 19–38, Association of Digital Forensics, Security and Law, Maidens, VA, 2006.

Roussev V., Quates C. Content Triage with Similarity Digests: The M57 Case Study. In Digital Investigation, vol. 9, pages S60–S68, Elsevier, 2012. Weiss G. M. Mining with Rarity: A Unifying Framework. In SIGKDD Explorations and Newsletters, vol. 6, pages 7–19, 2004. Witten I. H., Frank E., Hall M. A. Data Mining Practical Machine Learning Tools and Techniques. 3rd Edition, Elsevier, Burlington, MA, 2011. Wu X., Kumar V., Quinlan J. R., Ghosh J., Yang Q., Motoda H., McLachlan G. J., Ng A., Liu B., Yu P. S., Zhou Z.-H., Steinbach M., Hand D. J., Steinberg D. Top 10 Algorithms in Data Mining. In Knowledge and Information Systems, vol. 14, pages 1–37, 2008.

Notes 1 French surgeon in Napoleon’s army and an important innovator in battlefield medicine.

2 ACPO. Good Practice Guide for Computer-Based Electronic Evidence. Available at http://www.7safe.com/electronic_evidence/ACPO_guidelines_computer_evid 3 Prior to extracting data from multimedia devices, it is mandatory to define a state vector with relevant crime-related attributes. 4 In time-critical situations, when the preservation of the discovered device is less important than finding quick intelligence to guide the search and seizure process, investigators can opt to extract data directly from the target system without imaging it first. 5 WEKA from the Machine Learning Group at the University of Waikato, New Zealand, is downloadable at http://www.cs.waikato.ac.nz/ml/weka.