Identification and Localization of Data Types within ... - Semantic Scholar

8 downloads 0 Views 708KB Size Report
Robert F. Erbacher and John Mulholland are with the Department of ..... [2] Eoghan Casey, Handbook of Computer Crime Investigation, Academic. Press, 2002.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < are primarily concerned with the embedding and appending of data in this work. In essence what we are attempting to identify is a form of covert channel or steganography [11]. By applying analysis, i.e. steganalysis [4], we can identify not only the presence of the embedded data but also identify relative locations of many of the data types. This prior work on file type identification does have ramifications on the identification of data types as we are essentially working towards the same ends just from different levels, i.e. different points of view. For instance, our use of window sizes could be considered to be a derivative of the work on text identification through n-grams [3][9]. The application of the statistical analysis for data type identification is itself a direct extension of the work on file type identification by Karresand et al. [5][6] and McDaniel et al. [10]. III. SCENARIO DESCRIPTIONS While pursuing this research we had several scenarios of potentially hidden information in mind for which we were gearing our solutions. The fundamental premise being that a sophisticated criminal would not leave sensitive data unhidden. Even simple techniques at hiding data can make location of the data unfeasible for all but the most important cases. Examples of such scenarios include: • Appending data to an executable. This is particularly effective on Linux-based systems. Pretty much any file type can be so appended. Such modified executables will continue to operate correctly. Similarly, many file types will ignore data appended to them. ƒ A spreadsheet containing drug delivery schedules could easily be appended to a file. ƒ Child pornography can be appended to files. By appending images to each system file a large number of images can be hidden. • Embedding data into a file. This is quite common for innocuous means but essentially revolves around incorporating data into a file. ƒ Formatted text documents, such as Microsoft Word, can easily have images embedded into them. As many systems will have hundreds or thousands of such documents it becomes extremely time consuming to parse through these files to identify innocuous from criminally relevant files. ƒ Formatted text documents can similarly have spreadsheets or tables embedded. Such spreadsheets could identify criminally relevant transactions such as drug deliveries or sale of child pornography. IV. APPROACH During the course of this research, we explored a wide range of statistical equations in order to identify which statistics were most effective at differentiating the data type components of a file. Clearly, all of these statistics presented would not be relevant when designing and implementing a differentiation engine. However, the completeness provided

2

here will prevent the need for other researchers to retread the same ground when it is clear whether some equations provide value or not. A. Explored Statistical Algorithms Thirteen statistics were chosen to determine the different characteristics of the file types. These statistics are: • Average • Moving Average • Delta Moving Average • Standard Deviation • Delta Standard Deviation • Delta2 Standard Deviation • Deviation from the Standard Deviation (std2) • Kurtosis • Distribution of Averages • Distribution of Delta Averages • Distribution of Standard Deviations • Distribution of Delta Standard Deviations • Distribution of Deviations from the Standard Deviation Of these thirteen statistics for the file and data types analyzed for this portion of the research we found that the average, kurtosis, distribution of averages, standard deviation, and distribution of standard deviations were sufficient to effectively differentiate the different types of data and the additional statistics added nothing beneficial to the analysis. Incorporation of additional data types may necessitate incorporation of additional statistics for full differentiation. Other statistics outside of those tested may make analysis more effective. This understanding should aid other researchers in focusing their efforts. A complete description of the analyzed statistics is presented in Appendix I. B. Sliding Windows As many of the above statistics are dependant on the sliding window size, we explored a range of differing window sizes to identify any and all impacts the window sizes may have on the statistical analysis. More specifically, we examined window sizes covering the powers of 2 from 64 bytes to 16K bytes. We found that below 256 bytes the graphs became too cluttered and the features of the graph were obfuscated. Above 1K the characteristic features of the graphs were too greatly smoothed. This is partly a consequence of the file sizes we were using but given the range of different files we experimented with it was determined that windows sizes above 1K did not provide any added value. Values between 256 and 1K do not provide substantial differentiation. Thus, window sizes of 256 bytes and 1K are presented here. These values provide the most distinctive graphs. C. Final Statistical Analysis Clearly the full set of experimental data can not be provided here as it amounted to hundreds of graphs. For the data sets included here, GnuPlot was used to generate the graphs. The graphs have a vertical axis ranging from 0 to 255 when byte values are being represented and 0 to 0.5 when probabilities are being represented. The horizontal axis ranges from 0 to 1

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < Similarly, this consistency in the file results in a significant lack of peakedness. Consequently, the Kurtosis maps uniformly to zero, except at the beginning of the file where the chaotic file header information is located. Given the unique characteristic appearance of each of the three graphs any of the three statistical techniques could be used to differentiate jpg imagery data.

4

Zip data also has a very low kurtosis throughout, as shown in Figure 6. This lack of variability in the Kurtosis can be used as a distinctive characteristic.

B. Archival (Zip) Data In many ways zip files are similar to jpg files, showing the characteristic deviations at the beginning of the file where the file header information is located, though not as extreme, and containing a narrow band for the average data values. However, the band isn’t nearly as tight as that of the jpg files, ranging instead from 109 to 142. This deviation in band ranges is one factor usable in differentiating between these two data types. This similarity results from the fact that both data formats use forms of data compression.

Figure 5: Distribution of average values of zip data. This is similar in many respects to jpg data but the range of values is larger and there are occasional deviations, dependent on the files incorporated into the zip files.

Figure 4: Average byte values of zip data. Initial variability can be seen at the beginning of the data stream with occasional selective peaks. The end of zip data always contains a significant dip as data content deviates.

This can be shown in the Average statistic, Figure 4, and the Average Distribution, Figure 5. The zip data statistics do not differ greatly when the files contain different file types with the possible exception of the images_hicontrast.zip which contained png files. This file contained a small peak around 65% of the way through the file. All of the data streams contain a significant dip at the end of the file. This is likely the table of contents for the file and the deviation in information contents explains the significant change in average values. Since all files are mapped to the same scale in terms of file lengths, using percentage through the file, very large zip files will appear to have a shorter dip. The number of files in the zip file, and consequently, the size of the table of contents also impact the length of this portion of the graph.

Figure 6: Kurtosis values of zip data. The consistently low values are distinguishable. The values clearly are not fixed at zero but are extremely small. C. Adobe Acrobat (PDF) Data Pdf data is distinctive in the chaotic nature of the average values throughout the file, figures 7a and 7b. Ultimately, the range of values taken on by pdf data is quite large. It is the lack of any type of consistency in values that is unique with respect to pdf data. As with zip data there is a dip at the end of the file that is consistent across data streams.

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < data takes so much of the file to specify compared to the rest of the file that the textual data becomes insignificant.

7

indicative of special or unique features of each of the data streams. The unique characteristics of each data stream can be aided by analysis of the average distribution from figure 15. The distribution of ch1.ppt is clearly showing evidence of containing imagery data; which can be validated through examination of the contents of the file. The imagery data seems to all appear at the beginning of the file as we suspect the Powerpoint format specifies all imagery data at the beginning of the data stream and then simply refers to the correct imagery data within the page description portion of the data stream.

Figure 13: Averages values for doc data. These word processing data streams are interesting in the ability to identify the presence of different types of data within the file.

Figure 15: Distribution of averages of ppt data. This distribution can be used to confirm the types of data present within a data stream. Clearly the presence or lack of data types can be identified through the distribution. For instance the presence of imagery data is indicated by the presence of values within the 130 range.

Figure 14: Averages values for ppt data. It appears that all imagery and diagram data is specified at the beginning of the data stream and then simply referred to within the page description of the latter portion of the data stream. It is this latter portion of the data stream that exhibits a high range of values as the pages are described, including text and formatting. G. Microsoft PowerPoint (Ppt) Data Of the files analyzed to date, Microsoft Powerpoint data was the most difficult to analyze. These files appear particularly chaotic in nature. In order to analyze such data we can begin relying on the results garnered so far. Powerpoint data intrinsically takes on the data types of its underlying formats. For instance, examining the graph from figure 14, all of the data streams clearly maintain a similar pattern towards the end of the data stream. This is the page description and text itself. The deviation at the beginnings of the data streams is

From the average distribution graph we can see that the graph for “Review of Modeling Cell Division in Cell Automata.ppt” appears anomalous in that there is a large peak at value ~48. The graph for this data stream is obscured in figure 14. The anomaly is a result of the fact that this document contains no formal page layout, figures, diagrams, or extensive formatting. Essentially the background of the slides is just a plain white with the body of the slides just plain text. This results in a very high distribution of the documents data revolving around textual data and its associated binary values. Similar feature distinctions can be made for diagrams, etc. In terms of differentiation, Powerpoint data can be differentiated by the large range of average values exhibited towards the end of the file, after definition of any imagery, diagram data, etc. VII. RECOGNITION OF EMBEDDED DATA TYPES The analysis in the previous section focused on typical sample data and the ability of the statistical techniques to differentiate

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < the data types incorporated within these data streams. This section examines scenarios more typical of forensic analysis in which an individual is purposefully attempting to hide information. Clearly, we don’t examine every possibility but these are scenarios that will allow an individual to rapidly hide information that would take large amounts of time for an analyst to find and the described technique provides the ability for this information to be found much more efficiently and effectively.

8

contained within each of the files. The first file contains the typical characteristics of imagery data while the third file has a deviating spike above the 150 level indicative of the table.

1) Obfuscation within doc files The first set of experiments was based on identification of relevant data embedded into Microsoft Word files. Consider for instance the number of such Microsoft Word files that may be present on a typical hard drive, especially for an individual with a professional job. Embedding criminally relevant data into such a file would make it extremely time consuming to locate; i.e. each document would need to be opened and browsed through. Such embedded data could potentially include child pornography, spreadsheets of the criminal activities such as drug deliveries or bets made and by whom. Browsing of such documents would need to be done slowly in order to ensure all data/images are loaded. An example of Microsoft doc files is shown in figure 16. The goal here is to identify a spreadsheet of drug deliveries; taking on the form of a table once embedded. This is identified by the large block of high variability of the third file, CyberSecurity-2SprdSht.doc. The table loses the stair step pattern when converted to an embedded table but remains uniquely identifiable. From this unique characteristic, identification and location can be performed rapidly for retrieval. As far as the other two files go, it can be seen clearly that the first file contains large amounts of imagery data that had been removed from the other two files.

Figure 17: Distribution of averages of doc files. The peaks aid identification of data types contained within a file. The table is indicated by the peak above 150. 2) Obfuscation within xls files This second example looks at raw spreadsheet files, figure 18 shows the average window values. In this scenario, the spreadsheet could be the target itself or it could be used to hide alternative data sources, such as child pornography. As with the other Windows formats, imagery data is specified at the beginning of the file and referenced later, within the body of the file. This is exemplified here in which the image is actually anchored at the end of the file but the data appears at the beginning of the file.

Figure 16: Averages of doc files highlighting the presence of a spreadsheet. More specifically the spreadsheet takes on the form of a large table within the Microsoft Word document.

Figure 18: Averages of xls files. Imagery data is easily identified. Larger gaps between steps identify the charts. Here deviation from the expected pattern identifies data of interest.

For additional differentiation, the average distribution, figure 17, can be used to rapidly identify what types of data is

The stair step pattern of normal data is easily identified. More challenging is identification of the charts within the

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < VIII. DATA EXTRACTION Once alternative data types (sources) have been identified as being embedded within a specific file, extraction of that data is a straight forward matter in most cases of opening the file with the correct application and accessing the appropriate location within the file. Sufficient information can be identified from the analysis process specified above to locate the alternative data relatively quickly, thus speeding the analysis process. The more difficult scenario is where data is appended to a file. In this case, the file can’t simply be opened. Instead, the information provided by the data distribution, namely the standard deviation distribution will identify the type of data appended. Additionally, the average values will identify approximately where the change in data types occurs. With these two pieces of information the appropriate location within the file can be algorithmically searched for known magic numbers associated with the given data types or for identifiable boundaries for data types without magic numbers, i.e. raw textual data. Given this position the appended data can be extracted, cut, from the obfuscating file and displayed with an appropriate application. IX. PERFORMANCE As mentioned in the Introduction, the goal of this technique is to improve the efficiency and effectiveness of forensic analysts. Clearly, the technique is effective at aiding the rapid identification of data types within a file, irregardless of the file type. Adding this technique into the analysis process does not substantially add time to the analysis process. For instance, on a 3Ghz system with 1GB of memory analyzing 35 files (111 MB) with the 256 byte window size and all 14 statistics required approximately 4 minutes from start to finish (raw data to final graph creation). Clearly, analyzing terabytes of data will require some time, however, not all of the statistics would be needed in a production system and there is plenty of room for optimization as this was not an issue in this project. Additionally, faster systems, especially in terms of the disk interface which was a low end ATA in this case would greatly improve performance. In an analysis process, the statistics could easily be run overnight providing the results for rapid analysis by an analyst the next day. X. CONCLUSION We have shown that multiple statistics can easily identify individual data components intrinsic to a file; the differentiation between file types and data types is a very important concept in the forensic analysis of computer data. The distributions/probabilities can be used to rapidly identify the possible existence of data types within a file and the individual statistics mapped to file position can then be used to validate the data type and identify approximate positioning of the identified data types within the file for rapid analysis. We have identified which statistics allow for the identification

10

of each data type and what characteristics allow for said identification and differentiation from other data types. Clearly, it should be quite feasible to develop algorithmic techniques to easily identify not only the overall file type but the data types integral to the file. This will greatly reduce the manpower needed to identify and examine large hard drives with this particular technique for hiding data. XI. FUTURE WORK There would be benefit from testing and experimenting with a wider range of file and data types with additional statistical algorithms. The next major component in the research will be the development of actual algorithms applying what has been identified as differentiating characteristics and determination of the effectiveness of the algorithmic approach. This essentially would be a critical step before deployment by law enforcement and others for actual use. Additionally, we must examine other mechanisms for data hiding and how such techniques can be identified. While many of these even more advanced techniques will not be used for some time we need to investigate techniques for their identification and analysis. For instance, it has been validated that information can be hidden in the Bios of PCI expansion cards [8]. REFERENCES [1] [2] [3] [4]

[5]

Brian Carrier, File System Forensic Analysis, Addison-Wesley, Upper Saddle River NJ, 2005. Eoghan Casey, Handbook of Computer Crime Investigation, Academic Press, 2002. M. Damashek, “Gauging similarity with n-grams: Language independent categorization of text,” Science, vol. 267, Feb. 1995, pp. 843–848. Neil F. Johnson and Sushil Jajodia, “Steganalysis: The Investigation of Hidden Information,” IEEE Information Technology Conference, Syracuse, New York, 1998, pp. 113-116. M. Karresand and N. Shahmehri, “Oscar – file type identification of binary data in disk clusters and ram pages,” in Proceedings of IFIP International Information Security Conference: Security and Privacy in Dynamic Environments (SEC2006), LNCS, 2006, pp. 413-424.

[6]

M. Karresand, N. Shahmehri, “File Type Identification of Data Fragments by Their Binary Structure,” In Proceedings of the IEEE Information Assurance Workshop, West Point, NY, June, 2006, pp. 140147. [7] Warren G. Kruse II and Jay G. Heiser, Computer Forensics: Incident Response Essentials, Addison-Wesley, 2002. [8] Robert Lemos, PCI cards the next haven for rootkits?, 2006, http://www.securityfocus.com/brief/360. [9] W.-J. Li, K. Wang, S. Stolfo, and B. Herzog, “Fileprints: Identifying file types by n-gram analysis,” in Proceedings from the sixth IEEE Systems, Man and Cybernetics Information Assurance Workshop, June 2005, pp. 64–71. [10] M. McDaniel and M. Heydari, “Content based file type detection algorithms,” in Proceedings of the IEEE 36th Annual Hawaii International Conference on System Sciences (HICSS’03), Washington, DC, 2003, pp. 332.1. [11] G.J. Simmons, “The Prisoner's Problem and the Subliminal Channel,” In Proceedings of CRYPTO '83, 1984, pp. 51-67. [12] http://berghel.net/publications/data_hiding/data_hiding.php

APPENDIX I This appendix describes the thirteen explored statistics in more detail, as follows:

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < = Pr( (B + 1) > X ~

D

~

XB

j

≥ B)

(9)

Distribution of Delta Averages: The distribution of delta averages, (10), is the probability that a delta average chosen from all of the delta averages for the file is the value of B. D

∆X B

= Pr( (B + 1) > ∆ X

j

≥ B)

(10)

Distribution of Standard Deviations: The distribution of standard deviations, (11), is the probability that a standard deviation chosen from all of the standard deviations for the file is the value of B. = Pr( (B + 1) > S j ≥ B )

D

(11)

SB

Distribution of Delta Standard Deviations: The distribution of delta standard deviations, (12), is the probability that a delta standard deviation chosen from all of the delta standard deviations for the file is the value of B. D

∆S B

= Pr( (B + 1) > ∆ S j ≥ B )

(12)

Distribution of Deviations from the Standard Deviation: The distribution of deviations from the standard deviation, (13), is the probability that a deviation from the standard deviation chosen from all of the deviations from the standard deviation for the file is the value of B. = Pr( (B + 1) > SS j ≥ B )

D SS B

(13)

12

Suggest Documents