Automated Event and Social Network Extraction from ...

7 downloads 3151 Views 344KB Size Report
computer; internet cache can locate Facebook, LinkedIn, Google Plus caches; email has ...... Initial data ingestion is currently limited to a python script that walks.
UNCLASSIFIED

Automated Event and Social Network Extraction from Digital Evidence Sources with Ontological Mapping Author and Co-author Contact Information Dr Benjamin Turnbull Australian Centre for Cyber Security, University of New South Wales, ACT, Australia [email protected] Mr Suneel Randhawa Automated Analytics and Decision Support, Cyber and Electronic Warfare Division Defence Science and Technology Organisation West Avenue, Edinburgh, South Australia, 5111. Australia. [email protected]

Abstract The sharp rise in consumer computing, electronic and mobile devices and data volumes has resulted in increased workloads for digital forensic investigators and analysts. The number of crimes involving electronic devices is increasing, as is the amount of data for each job. This is becoming unscaleable and alternate methods to reduce the time trained analysts spend on each job are necessary. This work leverages standardised knowledge representations techniques and automated rule-based systems to encapsulate expert knowledge for forensic data. The implementation of this research can provide high-level analysis based on low-level digital artefacts in a way that allows an understanding of what decisions support the facts. Analysts can quickly make determinations as to which artefacts warrant further investigation and create high level case data without manually creating it from the low-level artefacts. Extraction and understanding of users and social networks and translating the state of file systems to sequences of events are the first uses for this work. A major goal of this work is to automatically derive 'events' from the base forensic artefacts. Events may be system events, representing logins, start-ups, shutdowns, or user events, such as web browsing, sending email. The same information fusion and homogenisation techniques are used to reconstruct social networks. There can be numerous social network data sources on a single computer; internet cache can locate Facebook, LinkedIn, Google Plus caches; email has address books and copies of emails sent and received; instant messenger has friend lists and call histories. Fusing these into a single graph allows a more complete, less fractured view for an investigator. Both event creation and social network creation are expected to assist investigator-led triage and other fast forensic analysis situations. 1

UNCLASSIFIED

Keywords Artificial intelligence, Big Data, Digital forensics, Digital evidence, Event representation, Forensic tool development, Knowledge engineering, Knowledge representation, Ontology, Software engineering, Symbolic artificial intelligence, Triage

Introduction If you ask any digital forensic analyst or manager over the last decade, one of the greatest organisational issues faced in electronic evidence analysis is the growth of workload. The last decade of Moore's and Kryder's Laws (Schaller 1997, Walter 2005) and the proliferation of devices in an approaching 'post-pc' world has seen computing encroach upon every facet of our lives. The growth in what defines a technology as a suitable source for forensic analysis and the amount of data each component may hold has grown substantially and electronic crime and forensic analysts are straining under the weight of this demand and are trying to find new ways of coping with the increasing influx of computer data for analysis. Given the growth of workload is not changing in the near future, there is a need to augment the methods being employed by forensic analysis groups to increase the number of devices that can be analysed without sacrificing the quality of results. We need to work smarter, not harder; or where possible, employ automated services to perform some of this work. There are two primary aims of this research; to provide a large-scale, consistent knowledge representation and to build symbolic Artificial Intelligence for developing deep understanding in digital forensic cases based on that data. Specifically, by encoding the data into ontology and reasoning over the resulting dataset, higher abstractions of data can be derived. Although other AI and machine learning paradigms are of use in this field, our current focus is on systems that provide the ability to audit the decision making process of inferred knowledge. Extraction of users and social networks and translating the state of file systems to sequences of events are the first two uses for this work. The first outcome of this work is an ontological representation and data store consistently representing entities and relationships pertaining to: -

The hierarchy of files, directories and file systems,

-

User accounts and system information, 2

UNCLASSIFIED

-

System events,

-

People,

-

User events.

The second outcome of this work is a rule-based system that automatically extracts data from multiple sources. One of the major goals of this work is to automatically infer 'events' from the base forensic artefacts. Events may be system events (taken from multiple sources and fused appropriately), representing logins, start-ups, shutdowns and updates, or user events, such as web browsing, sending email or other activity. The same information fusion and homogenisation techniques can also be used to reconstruct social networks. There are numerous social network data sources located on a single computer; internet cache can house Facebook, LinkedIn, Google Plus caches; email has address books and copies of emails sent and received; instant messenger has friend lists and call histories. Fusing these into a single graph allows a more complete, less fractured view for an investigator, providing better insight. This work is then implemented as a proof-of-concept forensic tool, ParFor (‘Parallax Forensics’). ParFor makes use of Resource Descriptive Framework (RDF) (Klyne, Carroll et al. 2004) ontologies to provide a unified representation of multiple different data sources, and to provide higher level reasoning capabilities. ParFor was designed primarily as a vehicle for this research, but was scoped in a relatively generic manner. As such, it is expected that this platform is generic enough to serve as a basis for other machine learning and reasoning paradigms, as well as allow expansion into other forms of digital evidence. This work is conducted as part of the Parallax BattleMind project (Murray, Grove et al. 2013).

Background and research need Given the increased workload of digital forensic analysts everywhere, investigator led triage has become a common method for moving work from overloaded forensic analysts to investigators. It is thought that an investigator with some computer training, rather than a specialist forensic analyst, has more understanding of the wider investigation and is often best-suited to perform an initial analysis on a device. This is especially true for cases where the main purpose of the analysis is to extract specific information, or where an investigator is primarily interested in a 'summary' of the device to see if further analysis is necessary by analysts. Expert forensic analysts can therefore be diverted to more technical analysis, making more efficient use of their time and expertise. There are disadvantages to the use of triage as a substitute for analysis (Pollitt 2013), but that is beyond the scope of this work. The reality is that triage is a commonly used technique and this is unlikely to change in the near future. Assistance to less technical investigators can take several forms, such as additional training. This approach has had several success stories (Schmidt, Dampier et al. 2009, Dampier, Blaylock et al. 2012). One alternative is to alter the software to make it friendlier for less technical investigators or to construct specific tools that can encode expert knowledge. Currently, most forensic systems and tools, even triage systems, are designed for technical forensic analysts. While triage-specific tools 3

UNCLASSIFIED

exist, often in practice investigators use full forensic analysis software for triage but only making use of a reduced feature set. Reducing the need to locate and interpret low-level computing artefacts is another, less explored method for achieving the same goals. The core concept of this approach is that expert systems can locate and interpret the artefacts and provide higher level abstractions/concepts vs. conclusions. As long as the reasoning behind the reasoning process can be explained, either for court or for verification purposes, such an approach can be used to interpret a piece of digital evidence whilst abstracting away the underlying mechanics of file interpretation. One logical abstraction of computing devices is to define it as a sequence of events that have occurred on the system. For example, an investigator wants to know when a user was accessing the internet and what sites they visited; they care less about the browser and operating system specific encoding of how this information is stored. It is all about the interpretation, not the files themselves. It is less about what is there and more about what it means. In this sense, this work expands upon the work of Carrier (Carrier 2003), which classifies individual tools at multiple abstraction levels, providing the abstraction layers physical media, media management, file system analysis, network analysis and memory analysis. This work increases the abstraction layers beyond those of Carrier and into the user space. The concepts, however, remain the same. Each level is unique and some levels of abstraction may build upon the interpretation of the previous ones. As long as the link between abstraction layers is understood, the higher level abstraction can collate information together for greater understanding. This work expands on Carrier and Spafford (2004) in other ways; specifically their work on events in digital forensics. One of the stated aims of this work is to create a higher level event representation. Although, similar to the work of Carrier and Spafford, the event representations in this work are at a higher abstraction level than their work. Other notable work in creating event representations for digital forensics comes from Marrington et al. (2007), who introduced an event model coupled with a data representation. Specifically, this work draws upon the concepts of classifying data into a structured form by using a richer and more complex ontological structure. Marrington et al rely on four entity types (content, application, principal, system), and their associated sub-components. By design, our work seeks to expand on the technical concepts and translate these into mechanisms of use to an investigator. One of the primary aims of this work is to model people, how they interact, and their connections with multiple devices used over the course of an investigation, which is a more holistic perspective. This work also acknowledges previous work by Hargreaves and Patterson (2012) in developing event-driven systems for Digital Forensics. This work was developed independently and specifically for interoperability with our network-based system (Murray, Grove et al. 2013). The complexity, scale, linking of events through RDF, and the broader and more holistic nature of the information gathered are ways in which this work differs from other work in the area. Although this submission has the similar aim of assisting digital forensic analysis through event reconstruction, and use an event model comparable to that used by Batten and Pan (2011), the 4

UNCLASSIFIED

approach taken and subsequent research differs. This work is far more data intensive, but can begin to answer analyst questions in a more holistic way. One of the primary precepts of digital forensics, or indeed any forensic science, is that one must understand the actions and assumptions made. If one cannot say how an interpretation was made, its validity is rightly questioned. Unfortunately, some of the most promising machine learning and AI techniques do not allow for this. Techniques such as cluster analysis or Bayesian networks may flag outliers and anomalies based against others, but cannot provide a concrete understanding of why these are anomalous, except that they are different to others. It is the authors experience, and it has been shown elsewhere (Khan, Chatwin et al. 2007, Teh and Stewart 2012) that this is detrimental to investigator acceptance and is against some fundamentals of digital forensics. One of the primary hurdles is that investigators won't accept the findings of an assertion if no information can be given to why it was made. Computers cannot have 'gut feelings', so the use of machine learning paradigms that cannot accurately explain why assertions are made is asking investigators to take an algorithm on faith, something akin to suggesting 'it just is'. At the very least, any inferences made would need to be reconstructed from basic principles to ensure validity, either for court or other outcomes. Similarly, the use of training datasets provides issues. With multiple Operating Systems, updates, configuration differences, different software installed to different file system locations (across multiple file systems); it is difficult to get a cohesive and rich training set. There are no public labelled training sets that are relatively new and applicable to this domain. Synthetic data cannot be used for training as it can only be used to train for features you are already aware of. The authors feel that current machine learning and AI techniques requiring training are therefore not applicable at the current time. This is not to minimise the current work in the area (Tse, Chow et al. 2012); only to recognise the current limitations. Although basic, Symbolic AI techniques do not require training and can provide a link between ancedent and consequent. Modus ponens/tolens ('If this, then that') rules allow for an understanding of the processes that lead to an assertion, linking the original artefacts to each event. Specifically, all rule sets can be clearly defined and knowledge justifies a decision or inference. Symbolic AI also does not require training or specific datasets. This is not to suggest that this technique is used in isolation. Once abstracted to a higher level, there are several possibilities for the use of more advanced techniques. For the purposes of this work, the authors focused on Symbolic AI techniques with the view of adding other reasoning paradigms at a later date. This is consistent with the approach taken by Hargreaves and Patterson (2012) and that used by FACE: automated digital evidence discovery and correlation (Case, Cristina et al. 2008) and ECF – Event Correlation for Forensics (Chen, Clark et al. 2003).

Benefits of ontological representation At their core, ontological representation allows for a very natural encoding of hierarchies, graphs, and entities with attributes. Specifically, ontological representation fits closely with how people internally represent data; objects, types and properties. Ontologies are easily represented as graphs with objects as nodes and properties as links. 5

UNCLASSIFIED

A common representation allows an ontology to 'know' things; ingest information from multiple domains of knowledge and present them and their relationships in a consistent manner. Google Knowl, one of the current generation of ontological systems, has the apt tagline "Things, not strings" (Google Inc. 2012). Ontological representation allows a more extensible method of storage than the use of custom databases (although triples can be stored in this way), as new properties append to the existing structures. Additionally, ontological representation provides two major benefits; a common language and a basis for reasoning. The common language is one of the primary benefits of ontological representation. The use of a shared ontology across multiple systems creates a consistent language, as the same terms are used to define the same meanings. Different, incompatible knowledge representations will potentially use the same term in a different context, use different terms to mean the same concept, or represent terms with different granularities. Adherence to a common language allows multiple systems to communicate effectively. Of course, this relies entirely on the ontological structure. The second benefit of ontological representation is its use in symbolic artificial intelligence, and machine learning. Knowledge becomes a graph structure which can be easily navigated, compared and analysed by machine. While the semantic meaning of nodes and links is designed for human consumption, the links themselves are easily navigated by machine.

Summary of Existing Forensic Ontologies There have been several instances of researchers using ontological data stores, or developing ontologies specifically for the field of digital forensics. Rather than construct a new digital forensic ontology, a survey was made to assess the suitability of existing ones. It was found that many of the existing forensic ontologies are high level, referring to the investigation in general, rather than the electronic artefacts. Also missing from many existing ontology papers is an underlying reason for putting this data into such a representation in the first place. Ontology is a clean representation suitable for machine navigation, but often papers discuss the representation without answering 'why'. It also follows that the purpose of ontology determines its structure. Many papers discuss the structure without discussing the underlying requirements that shaped the formation. This made it difficult to assess the suitability of the ontology for the needs of this research. This is not meant as a criticism of these representations, merely stating their unsuitability for this work. The ontology required for this work, either found in literature or developed, aims to represent computers, user accounts, disks, file systems, files, directories, metadata, people and events. The representation for all objects must be in such a way as to allow for later expansion to other devices and subtypes (i.e. specialisation of file types), and to allow for annotation of additional knowledge as it is required. Given the breadth of these representations, there is a requirement to either link multiple ontologies (in an orthogonal, consistent way) or develop new ones. This includes information about computers, events, file system information and the users of computers. There is also a need to represent the origin of facts, specifically if the source is the original data or an inference.

6

UNCLASSIFIED

One of the first uses of ontological data stores for digital forensics was from Shatz and Mohay (2004). Although Schatz used ontological systems (specifically Apache Jena (McBride 2002)) for representation, the majority of his work used dynamic ontologies. A dynamic ontology is one that is informing the structure of the data based on the data itself. Unlike some other graph knowledge stores (Lenat and Guha 1989, Cycorp Inc 2002), RDF does not require the addition of new terms to an ontology before their use. Dynamic ontologies are created automatically as dictated by the data and can be entered by individual systems in the most fitting format for that data source. However, by not defining the structure of data, there are limits imposed as to the utility. In the same way a database schema have as many or as few constraints as the designer intends, ontology can have varying degrees of structure. Having no constraints on structure makes adding data easier, but makes querying and understanding data more difficult. This is the same with ontology; if you rely on the data to self-organise, it must be added in the same format every time to be orthogonal and consistent (Gomez-Perez, Corcho-Garcia et al. 2004). Tagging and organising the ontology structures creates the constraints of a common language, which allows consistency and a common language. This is one of the primary purposes of ontology. For example, if two systems entering documents into an ontological data store treat the concept of a title differently, they become effectively different concepts. For the purposes of this work, the primary ontology must be consistent, known in advance and orthogonal. Therefore, unlike the work of Schatz, a dynamic ontology is not suitable for this work. Possibly the first use of a formalised ontology to model the field of digital forensics was the work of Brinson, Robinson et al (2006) of Purdue. Although robust, this ontology is not suitable for our purposes as it is relatively high level and unable to capture the detail required. Specifically, this work focuses on the investigative side of digital forensics, including the judiciary, the roles of forensic investigator and pieces of evidence at a broad scale. This work does not seek to represent potential evidence sources individually or in sufficient depth for our needs. In the literature, there are several references to DIALOG, an ontology (Kahvedžić and Kechadi 2009, Ćosić, Ćosić et al. 2011). On the surface, DIALOG appears to be a useful ontology; relating case and artefact information in a known structure. However, at the time of writing, there was no way to verify this as the ontology itself as it was not published. Although not strictly an ontology, Digital Forensics XML (Garfinkel 2009, Garfinkel 2012) represents a similar approach to achieve the same ends. By structuring a common machine-readable format, automation becomes achievable. The standard definition of objects and their properties is similar to one of the purposes of this work, but whilst the purpose is similar it is not the same. There are also many benefits that RDF provides; the integration of all knowledge across a single source, the reasoning capabilities, and the shared infrastructure and the potential scale that such systems provide being prime examples. This is not to denigrate the use of Digital Forensics XML, it merely highlights the different purpose for which it has been created. There are also several public domain ontologies in related fields. Cyber Observable eXpression (CybOX) (Barnum 2010, Mitre Corporation 2015) is an ambitious project to develop a unified ontology representation applicable to computer and network security developers. Although the 7

UNCLASSIFIED

domain of Digital Forensics is closely related to computer and network security, there are some obvious differences in potential incoming data sources and the intentions of analysts and practitioners. The benefits of ontological representation discussed within this work are similar to the motivations of the CybOX project, although the concepts modelled, the technology base, forms of representation and the intended use are different. There are also other cyber security ontologies within the public domain (Symonenko, Liddy et al. 2004, Denker, Kagal et al. 2005, Parkin, van Moorsel et al. 2009, Takahashi, Kadobayashi et al. 2010). As with CybOX, we are indirectly influenced by their existence, but are ultimately have decided that the possibilities of reuse are small and that in the interests of having the best possible representation we should assume no prior representation. As such, we acknowledge that this approach is being used in related fields. However, our approach is different and ultimately unique.

Summary of Existing Event Ontologies We found two event ontologies of interest, but decided to use neither of these directly. Specifically, many general event ontologies were influenced by their chief use-case. It was thought that our usecase did not align in the same way. These are discussed as follows. LODE (Shaw, Raphaël Troncy et al. 2009) provides an interesting perspective into event ontologies, but despite its claims of generality, seems designed for a specific purpose. The authors felt there were inconsistencies in the representation. Specifically, the atPlace predicate (to denote the place where an event occurred) is at odds with inSpaceproperty, denoting much the same thing. Similarly there is no means of deterring the verb of an event (what actually happened) or the specific roles of agents involved. From this it was determined that LODE would require significant reengineering for the use-case outlined in this work. The event ontology (Raimond and Abdallah 2007) is a detailed ontological representation of events, and the result appears to be widely used. The event ontology started as a companion to the music ontology (Raimond, Abdallah et al. 2007). Whilst this work is well-developed, there is much specificity unnecessary to the use-case previously outlined. As such, whilst we draw on it for inspiration, we do not implement it directly.

Abstracting Data to Form Events One of the major focuses for this work is to test the possibility of abstracting event representations into a form more conducive to forensic triage. Events in this context are at a higher abstraction level than have been discussed by other authors mentioned previously in this work. Specifically we are capturing two types of event; User events (Bob sent an email) and System events (Computer1 was turned on). There is some overlap between the two types, but whereas some events can be attributed to a user account, some will occur at a system level. Although the details are beyond the scope of this paper, the authors feel that the use of an ontological structure represents a formalisation of language, and as such should wherever possible conform to linguistic best-practices. It was decided wherever possible to use an event representation compatible with Neo-Davidsonian event representation (Davidson 1980, Rosen 1999, 8

UNCLASSIFIED

Hornstein 2002). Specifically, Neo-Davidsonian events are spatial-temporal occurrences requiring location, time, verb, and subject. Neo-Davidsonian sentences are explicit about Subjects (called “Actors”), Objects (“Themes”) and Tenses. All verbs are present tense and time is explicitly recorded, not implicit. The Parsons (1995) representation expands on this explicit representation adding objects used to perform an action (“Instrument”). Neo-Davidsonian representation is the standard, although still evolving, representation of events. There are some arguments against its use (Bayer 1997), but these have been minimized by later works. That said, although we subscribe to the required explicitness found in Neo-Davidsonian representation, for the purposes of our work, we have relaxed the explicit location requirement that Davidson felt necessary. The authors feel that the location of all events will be within the computer, and the concept of location is less concrete or integral to the event in the intangibility of a running computer process. We will revisit this if necessary. The broad definition of the term events used for this work is “actions that result in changes to the world state or the creation of new objects”. Wherever possible, we wish to link events to user actions, to state that actions result or inform a user. We are also simplifying event representations into a single event type, rather than the four categories Parsons uses; Accomplishments, Achievements, States and Processes. There are no ongoing events in a post-mortem digital forensic examination – by definition the system is no longer running. Although Parson distinguishes instantaneous events, we feel no desire to categorise these separately; within our representation they are the same events with the same start and end times, effectively giving them zero duration. The authors are aware that the use of Davidsonian events in an environment that is not directly observable is bordering closely to Kimian states (Kim 1969, Maienborn 2007). Davidsonian (and NeoDavidsonian) events are observable, whereas Kimian states are events that refer to the unobservable. Where possible, one would wish to avoid Kimian states, as it is hard to reason about belief states such as what people believe at a specific time or their thought processes. Digital forensic analysis is by nature comprised of inferring events that were not directly observed; it is a post-mortem, not a live observation. However the inferences made are based on data and their existence can be reasoned about. As such, Kimian states are avoided. It must be noted that the framework does not explicitly exclude the presence of Kimian states, but it is something that should be avoided or minimized by individuals using and expanding the base framework. Neo-Davidsonian event representation is amenable to an RDF representation; we are provided with a rich framework that is easily extended. Where the ‘instrument’ predicate may seem limiting, RDF allows us to create sub-properties as necessary. We can then search by specific property types unique to that type of event, or on all instruments attached to that event as necessary. As per NeoDavidsonian representation, we build event objects in RDF, and then use the following predicates; actor (1 or more subject of the event), verb (1 action involved in the event), theme (0 or more objects of the event), instrument (0 or more other objects involved in the event), startTime and endTime. From this framework, we can build any number of event types and expand the ontology as required.

9

UNCLASSIFIED

Use of layered multiple ontologies over the same dataset There is no single way to represent data, and much of the representation depends on the purpose for which the ontology is being developed and the granularity of the data available. Whilst there is no value in modelling at a level beyond what is required, the alternate risk is to set an abstraction level too high that does not capture the information required without rebuilding structures. For this work, we use multiple ontologies that overlap on key elements. Specifically, we are choosing to model information at multiple abstraction levels, allowing us to use multiple abstraction levels to model the same data. This is, in effect, modelling the same data with multiple ontologies. The same (or different) objects can be linked to the conclusions drawn. There can be multiple, linked graphs. In this way, the extension of Carrier’s discussion of forensic tool abstraction layers is quite appropriate. This tool is then a meta-tool; across all levels of abstraction that it is feasible (and worthwhile) to model. The following information areas are represented; common understanding, ingestion facts, direct interpretation of facts, and inferences (specifically including events). This is shown in Figure 1.

Figure 1: Developed layered ontologies and concepts. Common understanding contains ontological concepts that are of use at any granularity and across multiple domains. Ingestion facts encompass digital forensic information such as disks, partitions, file systems, files, and directories. These are facts as determined from the forensic software with no analysis or addition. This will model, for example, the hierarchical nature of disks, directories and files, and have properties for each. The direct interpretation of facts representation will append the raw information with additional interpretation. Objects such as emails, extraction of compressed files, password hashes, registry information, vulnerability analysis, password hash extraction, user enumeration are not found in the raw interpretation of a file system, but are necessary in any forensic analysis. This is the 10

UNCLASSIFIED

interpretation of information to derive new information. Part of this stage will be the abstracting of file formats and objects to a common representation (emails, users, email addresses, etc.). Inferences pertain to system information that is not directly attributable from the forensic software, but instead is the result of an expert system that has used the lower ontologies to assert a high level event or activity.

Separating Fact from Hypothesis Forensic software is designed to display facts, verifiable from the source data, to the user. As a rule, it is up to the investigator to take these facts and create hypothesis, case studies and greater understanding. The tool itself is merely interpreting the data and providing fact. However, commercial forensic tools all add their own interpretation of raw data in some way; grouping files by type (according to header or extension), by visibility, or other metrics considered of interest to an investigator. While these interpretations are based solely on attributes of the original data, they represent logical reasoning encoded into the system. FTK flags files that have the extension of one data type but identify as another in the header; this is an interpretation. FTK (or a programmer at AccessData) has made the determination those files that do this are worthy of further investigation, and that this is 'bad'. The addition of scripting languages into forensic tools, such as Encase Enscript, is a more freeform representation of the same concept; that someone can search for data and make logical assumptions based on that. However, the underlying raw data is still available to the user. Using multiple ontologies gives us the flexibility to perform this representation in multiple ways. At the base level, we can annotate the raw facts of the data; a disk containing a file system that has a directory and files in that directory. The relationships are defined only how they are provided in the raw data. Object attributes are provided based on facts in the system (for example, that a file with extension ".png" has the property "extensionFileType" related to the object PNGFile). Good understanding of existing file types and subclassing would make queries such as 'find all image files over 3kb', or 'find all files that have an image extension but are not identified as images' quite simple. The next layer of ontology relies on the first, but is created from reasoners trawling the low ontology to create and populate. Rather than focus on low level data such as disks and files, this ontology has a much higher abstraction, focusing on users and their attributes (name, accounts, email addresses, known connections to others), computing systems in general and events of computing use. This level of ontology is based entirely on reasoners looking at low level data in the initial context and then using logic to make inferences in the data. These might be as simple as enumerating computer users and saying that for each user, a person exists, and that the Microsoft Outlook profile for that user is that person's email address. These also might be much more complex, correlating multiple files across multiple devices with temporal aspects. This would allow an investigator to look at, confirm, deny or query hypotheses as needed, while retaining access to the raw facts of a case. The hypotheses add value, but the raw data is still accessible. As the origin of different inferences and hypotheses are tagged, a link exists between the low and high ontologies. 11

UNCLASSIFIED

Encoding Expert Opinion to Derive Events. Forensic analysis is as much an art as a science, but some processes used by forensic investigators for interpreting data can be codified. At least in the beginnings of a new investigation, the stages taken by an expert are fairly standardised; find users, locate and analyse files, and profile user activity. Given the results of these, more in-depth processes are performed. In the interests of saving time for analysts, or allowing less skilled analysts to perform basic triage, these processes can be partially automated in many cases. The methods for determining user accounts is well documented across all major operating systems, as is the location and parsing of log data. Analysis of files can locate all email storage, Internet browser caches and media files. As such, these processes can be done automatically and presented to the analyst to better direct their search or augment future triages. The outputs of automating triage will not be a better triage, but a more efficient one. It is also possible to encode means for automatically searching for cases where there is inconsistency in a timeline. For example, if a computer is not on but activity is occurring (files are being altered that are not from an external source) then timing inconsistencies are occurring. Automating this process is less about finding novel information but more about automatically searching for the anomalous. The more processes we can automate, the less we must do manually, or the more time we can focus on processes that are unable to be automated. The most obvious limitation of event creation and automated triage in general is that there is reliance upon the correctness of the underlying data. There are also timing issues (web browsing is based on cookies and internet cache data and therefore loaded pages, not the hour spent reading the page after it loaded). Reasoners are systems that read low ontology (or high ontology, or both), and write to the high ontology. Reasoners at their basic level are a codification of expert knowledge. Specifically reasoners interpret the presence, or lack of, files or metadata and draw conclusions based on their result. For example, a reasoner may examine the low ontology relating to known internet cache files and conclude that between the hours of 8pm and 10pm on the 4th April, Joe Bloggs (with attributes username jbloggs, email address [email protected]) was surfing the Internet, emailing Jane Smith (email address [email protected]). Closer examination of the websites visited in the internet use event can then find that Joe was on eBay (adding another user account to his identity) buying gloves, industrial strength garbage bags, a shovel and duct tape. The system might also know that Joe was in the wilderness that day (inferred from EXIF data from images sourced from his phone). An investigator or analyst will find all this information, but automation will extract it faster and present it sooner. This is not to suggest that reasoners will put forensic analysts and investigators out of a job. Humans are a vital part of the analysis process; there are intricacies and variations in every case, and current approaches cannot account for them. However, automated reasoning across structured data can help them. 12

UNCLASSIFIED

Digital Forensic Event Ontologies The proposed ontology must encompass a broad range of topics, given that it is designed to represent information about people, computers, files, websites, and events. As such, whilst this is a single ontology, it makes use of multiple namespaces to separate terms with the same name. For example, namespaces ensure that Apple (company) is a separate object from Apple (fruit). The created ontology is too verbose to provide in full. It is available at https://github.com/benjaminturnbull/ParFor/blob/master/ontology/ParForOntology.ttl and is released under the GPLv3 licence. The ontology namespaces are; person, communications, computers, events, users, file systems, detailed files and a base ontology. The base namespace (prefix base) is commonly referenced across the other ontologies, and is a prerequisite for incoming data. Base can be expanded to incorporate common attributes and metadata as necessary. File systems (prefix fs) incorporate basic information about files, directories, permissions and basic properties (names, etc). The detailed files namespace (prefix fsdetail) expands on the basic properties to categorise files, provide additional detail including visibility and format-specific metadata (such as EXIF metadata). The user namespace (prefix user) encompasses information about users and groups. There are obvious links between this and both the file and person namespaces. The person namespace (prefix person) relates to the people using a computing system, and represents a higher level ontology. Information related to this ontology would need to originate from another source (the people who use a given computer, and the mappings between them and specific user accounts). The purpose of this ontology is to attribute findings to individuals, if possible. The communications namespace (prefix comms) is to document and categorise the types of communications a device has; both the programs (and the types of communications) and the messages themselves. The computer namespace (prefix comp) relates to computers (and related devices). This encompasses information about devices (hostname), operating system and installed software, as well as event-related information such as on and off times. The events namespace (prefix event) represents events associated with each device. Events are a nebulous concept, but can be defined as actions that have a time associated (started, complete) in which something altered. We use a Post-Davidsonian approach to events (as previously discussed). This is the ontology that captures and maintains this rigor. Apart from providing a logical separation of concepts, the use of multiple namespaces allows different systems or implementations to only use a subset of ontology if appropriate. This also has the benefit that additions or changes to concepts will only affect a subset of the overall ontology, smoothing out upgrade paths as ontology versions change. It is expected that ontologies at the higher abstraction will change over time in reaction to the information that can be inferred, any implementation and as the purpose evolves.

13

UNCLASSIFIED

ParFor – Implementation Of Research The following section discusses the implementation of the system, named ParFor. Some details are omitted for brevity, but are available online at https://github.com/benjaminturnbull/ParFor/. The aim of this system is to provide a proof-of-concept and to illustrate the feasibility of the system. Particularly, the following aims are tested; -

Use of ontology to represent electronic evidence for forensic analysis;

-

Multiple ontologies operating in synchronisation;

-

Use of ontological event derivation and representation.

Parfor is written in the Python Programming language. Whilst this potentially introduces performance issues, Python provides a large number of libraries potentially of use for digital forensic analysts (such as the extraction of metadata, steganographic analysis of images and powerful text manipulation), connects well to ontological knowledge stores, and is relatively portable. If performance becomes a major issue, bottlenecking Python modules can be rewritten in C/C++. For the purposes of this exploratory work, development time was considered of greater importance. This again highlights that this implementation is proof-of-concept at this stage.

Ontologies For the implementation of this work, we used RDF as the ontology basis. There are other formats in use, but RDF is a W3C standard (Manola, Miller et al. 2004, Prud’Hommeaux and Seaborne 2008). There are multiple RDF data storage implementations. There are also multiple RDF reasoners, providing useful inferencing capabilities for free. The most useful of these is transitivity. Additionally RDF has an active community, and is used in multiple fields (Wood 2010). For example, if directory A contains directory B which contains directory C, and the property contains is transitive, a query for all items that A contains will return both B and C without the latter needing to be explicitly stated. Given the abbreviated ontological concepts, a simple query returning all instances of fs:File will return instance:File-190204859302-computer1. fs:File rdf:type rdfs:Class . fsdetail:FileType rdf:type rdfs:Class ; rdfs:subClassOf fs:File . fsdetail:MediaFileType rdf:type rdfs:Class ; rdfs:subClassOf fsdetail:FileType . fsdetail:MovieFileType rdf:type rdfs:Class ; rdfs:subClassOf fsdetail:MediaFileType . instance:File-190204859302-computer1 rdf:type fsdetail:MovieFileType ; fs:name “The Mighty Boosh Episode 1” .

14

UNCLASSIFIED

There are disadvantages to using RDF, specifically when separating fact from metadata. RDF provides no simple way of asserting facts about facts. For this system, we wish to assert the following data about each fact; whether the fact is inferred or directly from the data, if it is inferred, which reasoner inferred it, and when this fact was entered into the system. RDF does allow for fact reification (Manola, Miller et al. 2004). However, reification has disadvantages to its use; specifically verbosity and complexity. RDF reification adds verbosity as each fact becomes four facts (one to make a statement, and one for each the subject, predicate and object) before adding any additional facts. This was not seen as scalable. Reification adds complexity as rather than searching for {?s ?p ?o}, one would need to search for {?fact rdf:type rdf:Statement . ?fact rdf:subject ?s . ?fact rdf:predicate ?p . ?fact rdf:object ?o}. This makes querying more difficult and slower. For these reasons, the authors overloaded the fourth 'triple' provided by RDF; the context. Contexts (also known as the graph or named graph) are used to separate facts into independent 'buckets' within a single repository. By default, queries are done across all graphs but it is possible to specify the graph if required. For this work, we are not using RDF reification. Instead, we use the context to hold an individual key for each assertion. This individual key is an index for a second repository, with predicates for the required metadata. This second repository is stored in RDF and hence receives the reasoning benefits of this format. The following represents an example in which two assertions are tagged with a graph ID that is then held in a separate repository containing the metadata. Due to the limitations of the TTL representation, the context is shown as a comment per-statement. Data repository: instance:File-190204859302-computer1 rdf:type fsdetail:MovieFileType ; fs:name “The Mighty Boosh Episode 1” .

# graph assertion0001 # graph assertion0002

Metadata Repository: instance:Assertion0001 rdf:type metadata:Assertion ; metadata:source instance:Reasoner001 ; metadata:inputDate “2013-05-01T11:11:11”^^xsd:datetime. instance:Assertion0002 rdf:type metadata:Assertion ; metadata:source instance:Reasoner001 ; metadata:inputDate “2013-05-01T11:11:11”^^xsd:datetime. Instance:Reasoner001 rdf:type metadata:Reasoner .

Although there are alternatives to using a second repository, such as the process outlined in Watkins and Nicole (2006), but separating the metadata improved the readability of the information in each repository. Another considered alternative was to use the assertion ID provided in Allegrograph. However, this ID is not part of the RDF standard and relying on its use moves us away from the known standard and creates a reliance on proprietary technologies. 15

UNCLASSIFIED

Base system The implemented base system is comprised of a knowledge base, data ingestors, reasoners, and a visualiser. These are discussed independently.

Knowledge Base The application used a blackboard approach, where the central node for all communications is through the reasoning knowledge base. Using the RDF/RDFS standard allows great flexibility in choosing a suitable knowledge base store. ParFor uses SuRF i and Allegrograph ii, but can really use anything (such as Sesame or Jena), as we have kept to the standards. SuRF is relatively immature, but provides a unified method to interact with multiple graphing databases.

Data ingestion and Basic Inferencing The aim of this system is not to store the content of the data itself; but rather to store interesting 'facts' that can be extracted or inferred about the data. The raw data is still available outside of the system, but the storage of raw bytes is not a good use-case for ontological data-stores. There are two types of data ingestion that were considered for electronic devices; initial ingestion and additional ingestion. Initial data ingestion is currently limited to a python script that walks loopback mounted disks, adding files, directories, and specific user account information. Additional data ingestion occurs in two forms; adding additional ‘facts’, and rule inferencing. Additional metadata can be retrieved from data that has already been ingested (i.e. EXIF data or image recognition). As the aim is to add knowledge to individual nodes (files, directories) or infer the existence of new nodes (user accounts), the full contents of a file may need to be analysed to collect the information. Rule inferencing is conducted through a component we wrote called SPARQLer. Written in Python, SPARQLer is a small program that asserts forward-chaining rules into a datastore. Specifically, it reads in SPARQL construct queries, and asserts the results to the knowledge base (in our case, AllegroGraph). SPARQL (Haase, Broekstra et al. 2004, Prud’Hommeaux and Seaborne 2008) is a standard for searching RDF triple stores and there are numerous advantages to its use. Directly running a SPARQL insert query would have been preferable, which would immediately insert the result of the query back into the datastore (similarly to SQL), but this feature is not uniform across SPARQL knowledge stores. Instead construct queries produce triples, which can then be immediately added back into the knowledge store. Whilst SPIN rules (Knublauch 2009) were an alternative, the lack of datastore support, the immaturity of the standard and their complexity compared with the more mature SPARQL were reasons to discount their inclusion. An example of a simple construct query is as follows: # Name: Image File Type Creator (extension matches) # Description: Makes image files of type ImageFileType if the extension is known. @prefix rdf: . @prefix base: . @prefix fs: .

16

UNCLASSIFIED

@prefix fsdetail: . CONSTRUCT {?file rdf:type fsdetail:ImageFileType .} WHERE { ?file rdf:type fs:File . ?file fs:fileExtension ?extension . fsdetail:ImageFileType fsdetail:associatedWithExtension ?extension }

Alternately, SPARQLer has a plugin system that allows the system to perform a query and pass the results into a plugin for additional processing, accepting back a list of assertions to be made to the system. Plugins are written in the Python programming language, but the details of their operation are beyond the scope of this paper. However, relying on SPARQL limited the effectiveness of the system and allowing plugins to perform additional processing, such as interrogating the source file, calling out to the Internet, or using additional data sources, provides a much greater level of flexibility. SPARQLer has use-cases outside of digital forensic ontological data stores, mainly on systems with relatively static data. The system iterates the rules until no new facts are being asserted; such iteration ensures that the order of rule execution is ultimately not important and that all logical deductions can be made even if the output of one rule is a prerequisite for another rule that has previously fired. For the purposes of this work, SPARQLer is primarily used to convert assertions from one ontology into another; generally from the base 'fact' ontology into higher 'reasoning' ontologies.

Other Reasoners Although most reasoners could be built as SPARQLer plugins, we also built a small number of independent reasoners, called EventMappers. We currently have EventMappers for social network mapping (SocialNetworkMapper), the computer starting up and turning off (ComputerOnMapper) and locating evidence of clock tampering (TimingWeirdnessMapper). The SocialNetworkMapper locates Windows Messenger contact lists and emails to form social networks. The links between individuals are then weighted to the frequency of communication. We are looking at enriching the source data to include Facebook and other social media data. The ComputerOnMapper creates events for the computer turning on and off. From this, the system then ‘knows’ when the computer is on and off. In and of itself, this is not particularly interesting, but is a prerequisite for TimingWeirdnessMapper. TimingWeirdnessMapper searches MAC times (excluding copied or moved files) for candidates indicating files were created, accessed or modified (implementing research from Chow et al. (2007)) when the computer was turned off. The following examples highlight two events; “The user account BenTurnbull powered off Computer1 at 2012-09-05 09:15:13 local time” and “There is a timing inconsistency in Computer1 involving the files File-001, File-002 and File-003”. This example assumed that there is no ComputerPoweredOn event for Computer1 between these two events. instance:Event-00193847821 rdf:type event:Event ; event:actor instance:UserAcctBenTurnbull ; event:verb: event:ComputerPoweredOff ; event:theme instance:Computer1 ;

17

UNCLASSIFIED

event:timeStart “2012-09-05T09:15:13.405+09:30” ; event:timeEnd “2012-09-05T09:15:23.909+09:30” . instance:TimingInconsistency-011223 rdf:type event:Event ; event:actor instance:Computer1 ; event:verb: event:TimingAnomaly ; event:theme instance:File-001; event:theme instance:File-002; event:theme instance:File-003; event:timeStart “2012-09-05T09:19:14.112+09:30” ; event:timeEnd “2012-09-05T09:28:55.254+09:30” .

Visualiser Creating novel information visualisations to represent data is certainly within the scope of the project, but is beyond the scope of this paper. The areas of information security visualisation (Marty 2008) and digital forensics visualisation (Osborne, Turnbull et al. 2012) are both active in research. One issue noted for digital forensic information visualisation is the need for appropriate data formats and connections, so as to appropriately render information of use of the forensic investigator. As one of the primary purposes of this work is to summarise information into a form that is easily parsed, this framework is a suitable backend for visualisers to work off. The visualiser system currently implemented is hardcoded to the ontology used. From an implementation perspective, it is just another reasoner, albeit one that queries the knowledge base without asserting new knowledge itself. The Python module performs SPARQL queries as necessary, converts the outputs to JSON, which is read by the web-based visualiser. This component is written in D3.js (Bostock, Ogievetsky et al. 2011). The included visualisation system is merely a barebones implementation. There is much work to be done integrating a complete information visualisation system based on the inferences generated from this system. It is envisaged that there are many research opportunities in this field.

Conclusion Lessons Learned and Next Phases The current maturity of RDF and associated ontological platforms was constantly tested throughout development. There are some great new technologies coming through but the technologies don’t seem to directly map to our use-case. For example, there is no simple, elegant way to add automatically forward-chaining rules and we found we had to build this capability ourselves. Other ontological systems exist, such as Cycorp's Cyc that have some of these niceties. However, the sheer volume of data, the use of a different ontology (i.e. not the integrated Cyc ontology), Cyc's lack of community support, the non-standard nature of the ontology and performance issues are some reasons that Cyc was not used for this project (Murray, Grove et al. 2013). The authors do not regret the choice of RDF, but note that the environment is not a complete solution in and of itself. This system has only been subjected to minimal testing and has been presented as a concept demonstrator to a small number of forensic analysts. This has provided anecdotal evidence, but additional research, specifically interviewing forensic analysts and individuals who may use this 18

UNCLASSIFIED

software, is needed to confirm these notes. We found that the ability for analysts to understand how inferences were created was a prerequisite for this software’s use, and that analysts must have the ability to refute hypotheses and reasoned facts. Understanding which reasoning rules were used to create inferences and the origin files that were the source of these inferences was strongly required by analysts. This was required to understand and be able to explain to others how and why an inference was made. The other requirement analysts had was to have the ability to refute a hypothesis or reasoning chain. We will need to add mechanisms to easily refute hypotheses and have the effects remain persistent. The effects would also need to immediately cascade. This is not a simple change and would require some in-depth research into ontological representation in general.

Future Work The next phases of this work are two-fold; the technical and the research. On the technical side, there are several areas that can be improved to add ease-of-use and breadth of use. From a research perspective, there are opportunities to use ParFor as a framework on which to add other research. From a technical perspective, SPARQLer is operational, but there are opportunities to improve its performance and capability. This will be a focus of future technical work. There are also technical opportunities to add to the capability of different Triple stores. Capabilities such as backwardchaining reasoning and dynamically updating rulesets exist in other systems but not for RDF. The dynamic realisation properties found in OWL (McGuinness and Van Harmelen 2004) may provide some insight here, but will require significant reengineering. From a research perspective, now that data may be entered in a consistent, orthogonal manner, there are opportunities to try different forms of reasoning. Similarly, multiple forms of visualisation can be built relatively quickly now and that will be an area of future research. Finally, the development of multiple data ingestion mechanisms and the subsequent fusion techniques required will be a future research challenge in this area. From a practical perspective, we are also looking to align our event representation with that of Hargreaves and Patterson (2012), either internally or though an exporting mechanism. Although the fundamental components align, we require compatibility into our existing systems and also to maintain the rigor provided by the Post-Davidsonian representation. Interoperability is important, although we require the use of RDF as a means of providing simple extensibility and also for interoperability with existing projects. There is also opportunity to further leverage Digital Forensic XML tools (Garfinkel 2012) to work with ontology-based systems; translation and context being issues that would need to be resolved. However, there is a common base of machine-readable standardised languages that may provide a baseline for communications and translation.

19

UNCLASSIFIED

Conclusion This paper outlines the development of multiple orthogonal ontologies for Digital Forensics, capturing relationships from the low-level artefact through to high-level connections between individuals. The aim of this ontology was to provide an open, extensible series of ontologies that can operate at multiple levels of abstraction. These can serve as a basis for others to extend upon as necessary. This work also provides the basis for rule-based data fusion from multiple data sources into a consistent knowledge representation. This work has expanded upon this to provide a forward-chaining, rule-based reasoning system for RDF-based data stores. This is a generic system, although we use it exclusively for building reasoning chains for our Digital Forensic ontology. The most advanced research component developed in this work was the event model used to abstract computer-specific paradigms (files, directories, etc.) into a higher abstraction suitable for investigators (events such as ‘Bob went on the Internet from 11:20 to 11:40). There is much more work that can be done in this respect to abstracting technical notions away from an end-user in such a way that the original information is retained if necessary. This work represents a first step in this regards. This research has a theoretical underpinning, but has been implemented in a way to ensure its potential usefulness to the community. There are multiple ways to expand this work, both academically and through the developed implementation.

References Barnum, S. (2010). "The Balance of Secure Development and Secure Operations in the Software Security Equation." Crosstalk - The Journal of Defense Software Engineering September/October 2010. Batten, L. M. and L. Pan (2011). Using relationship-building in event profiling for digital forensic investigations. Forensics in Telecommunications, Information, and Multimedia, Springer: 40-52. Bayer, S. L. (1997). "Confessions of a Lapsed Neo-Davidsonian: events and arguments in compositional semantics." Routledge. Bostock, M., V. Ogievetsky and J. Heer (2011). "D3: Data-Driven Documents." IEEE Trans. Visualization & Comp. Graphics (Proc. InfoVis). Brinson, A., A. Robinson and M. Rogers (2006). "A cyber forensics ontology: Creating a new approach to studying cyber forensics." Digital Investigation 3: 37-43. Carrier, B. (2003). "Defining digital forensic examination and analysis tools using abstraction layers." International Journal of Digital Evidence 1(4): 1-12. Carrier, B. and E. H. Spafford (2004). "An event-based digital forensic investigation framework." Digital forensic research workshop. Case, A., A. Cristina, L. Marziale, G. G. Richard and V. Roussev (2008). "FACE: Automated digital evidence discovery and correlation." digital investigation 5: S65-S75. Chen, K., A. Clark, O. De Vel and G. Mohay (2003). "ECF-event correlation for forensics." Chow, K. P., F. Y. Law, M. Y. Kwan and P. K. Lai (2007). The rules of time on NTFS file system. Second International Workshop on Systematic Approaches to Digital Forensic Engineering (SADFE), IEEE: 7185. 20

UNCLASSIFIED

Ćosić, J., Z. Ćosić and M. Baća (2011). "An Ontological Approach to Study and Manage Digital Chain of Custody of Digital Evidence." Journal of Information and Organizational Sciences 35(1): 1-13. Cycorp Inc. (2002). "Foundations of Knowledge Representation in Cyc." Retrieved 8 Feb, 2012, from http://www.cyc.com/doc/tut/DnLoad/CollectionsIndividuals.pdf. Dampier, D., K. Blaylock and R. McGrew (2012). Digital Forensics Workforce Training for Wounded Warriors. Proceedings of the 2012 ASEE-SE Conference. Mississippi, USA: April 1-3, 2012. Davidson, D. (1980). "Readings in philosophy of psychology 1." Mental events. Denker, G., L. Kagal and T. Finin (2005). "Security in the Semantic Web using OWL." Information Security Technical Report 10(1): 51-58. Garfinkel, S. (2012). "Digital forensics XML and the DFXML toolset." Digital Investigation 8(3): 161174. Garfinkel, S. L. (2009). Automating disk forensic processing with SleuthKit, XML and Python. Systematic Approaches to Digital Forensic Engineering, 2009. SADFE'09. Fourth International IEEE Workshop on, IEEE. Gomez-Perez, A., O. Corcho-Garcia and M. Fernandez-Lopez (2004). Ontological engineering, Springer. Google Inc. (2012). "Introducing the Knowledge Graph: things, not strings." Retrieved June 12, 2012, from http://googleblog.blogspot.com/2012/05/introducing-knowledge-graph-things-not.html. Haase, P., J. Broekstra, A. Eberhart and R. Volz (2004). A comparison of RDF query languages. The Semantic Web–ISWC 2004: 502-517. Hargreaves, C. and J. Patterson (2012). "An automated timeline reconstruction approach for digital forensic investigations." Digital Investigation 9: S69-S79. Hornstein, N. (2002). A grammatical argument for a neo-Davidsonian semantics. Logical form and language: 345-364. Kahvedžić, D. and T. Kechadi (2009). "DIALOG: A framework for modeling, analysis and reuse of digital forensic knowledge." Digital Investigation 6: 23-33. Khan, M., C. R. Chatwin and R. C. Young (2007). "A framework for post-event timeline reconstruction using neural networks." digital investigation 4(3): 146-157. Kim, J. (1969). Events and their descriptions: some considerations. Essays in honor of Carl G. Hempel. Netherlands, Springer: 198-215. Klyne, G., J. J. Carroll and B. McBride (2004). Resource description framework (RDF): Concepts and abstract syntax. W3C recommendation 10. Knublauch, H. (2009). "SPIN – SPARQL Inferencing Notation." Retrieved Feb 8, 2013, from http://spinrdf.org/. Lenat, D. and R. V. Guha (1989). Building large knowledge-based systems; representation and inference in the Cyc project. Reading, Massachusetts, United States of America, Addison-Wesley Publishing. Maienborn, C. (2007). On Davidsonian and Kimian states. Existence: Semantics and syntax. Netherlands, Springer 107-130. Manola, F., E. Miller and B. McBride (2004). RDF primer. W3C recommendation 10 W3C: 1-107. Marrington, A. D., G. M. Mohay, A. J. Clark and H. L. Morarji (2007). "Event-based computer profiling for the forensic reconstruction of computer activity." Marty, R. (2008). Applied Security Visualization, Addison Wesley Professional. McBride, B. (2002). "Jena: A semantic web toolkit." IEEE Internet Computing 6(6): 55-59. McGuinness, D. L. and F. Van Harmelen (2004). OWL Web ontology language overview. W3C Recommendation, W3C. Mitre Corporation. (2015). "CybOX Cyber Observable eXpression, a structured language for cyber observables." from http://cybox.mitre.org/. 21

UNCLASSIFIED

Murray, A., D. Grove, D. Gerhardy, B. Turnbull, T. Tobin and C. Moir (2013). An Overview of the Parallax BattleMind v1.5 for Computer Network Defence. Australasian Information Security Conference. Adelaide, South Australia. Osborne, G., B. Turnbull and J. Slay (2012). Development of InfoVis Software for Digital Forensics. Computer Software and Applications Conference Workshops (COMPSACW), 2012 IEEE 36th Annual, IEEE. Parkin, S. E., A. van Moorsel and R. Coles (2009). An information security ontology incorporating human-behavioural implications. Proceedings of the 2nd International Conference on Security of Information and Networks, ACM. Parsons, T. (1995). "Thematic relations and arguments." Linguistic Inquiry: 635-662. Pollitt, M. M. (2013). "Triage: A practical solution or admission of failure." Digital Investigation. Prud’Hommeaux, E. and A. Seaborne (2008). SPARQL query language for RDF. W3C recommendation 15. Raimond, Y. and S. Abdallah (2007). The event ontology. Technical report. Raimond, Y., S. Abdallah, M. Sandler and F. Giasson (2007). The music ontology. International Conference on Music Information Retrieval: 417-422. Rosen, S. T. (1999). "The syntactic representation of linguistic events." Glot International 4(2): 3-11. Schaller, R. R. (1997). "Moore's law: past, present and future." IEEE Spectrum 34(6): 52-59. Schatz, B., G. Mohay and A. Clark (2004). Generalising event forensics across multiple domains. School of Computer Networks Information and Forensics Conference. Perth, Western Australia, Edith Cowan University. Schmidt, M., D. Dampier and D. Guster (2009). A Multi-University Resource Allocation Approach to Provide Computer Forensics Education to Law Enforcement Agents. Proceedings of the 8th Annual Security Conference. Las Vegas, NV, USA. Shaw, R., Raphaël Troncy and L. Hardman (2009). Lode: Linking open descriptions of events. In The Semantic Web. Berlin Heidelberg, Springer: 153-167. Symonenko, S., E. D. Liddy, O. Yilmazel, R. Del Zoppo, E. Brown and M. Downey (2004). Semantic analysis for monitoring insider threats. Intelligence and security informatics, Springer: 492-500. Takahashi, T., Y. Kadobayashi and H. Fujiwara (2010). Ontological approach toward cybersecurity in cloud computing. Proceedings of the 3rd international conference on Security of information and networks, ACM. Teh, A. and A. Stewart (2012). Human-Readable Real-Time Classifications of Malicious Executables. Proceedings of the 10th Australian Information Security Management Conference. Perth, Western Australia. Tse, H., K.-P. Chow and M. Kwan (2012). Reasoning about Evidence using Bayesian Networks. Advances in Digital Forensics VIII, Springer: 99-113. Walter, C. (2005). "Kryder's law." Scientific American 293(2): 32-33. Watkins, E. and D. Nicole (2006). Named graphs as a mechanism for reasoning about provenance. Frontiers of WWW Research and Development, APWeb 2006: 943-948. Wood, D. (2010). Linking enterprise data 1st Edition, Springer. Relevant Websites i

https://code.google.com/p/surfrdf/

ii

http://www.franz.com/agraph/allegrograph/

22