Predicting and Fixing Vulnerabilities before They Occur: A Big Data ...

2016 2nd International Workshop on BIG Data Software Engineering

Predicting and Fixing Vulnerabilities Before They Occur: A Big Data Approach Hong-Mei Chen, Rick Kazman

Ira Monarch

Ping Wang

University of Hawaii Honolulu, HI USA

Independent Consultant Pittsburgh, PA, USA

University of Maryland College Park, MD, USA

{hmchen,kazman}@hawaii.edu

[email protected]

[email protected]

System, based on a wide variety of unstructured big data sources, to achieve two goals:

ABSTRACT

The number and variety of cyber-attacks is rapidly increasing, and the rate of new software vulnerabilities is also rising dramatically. The cybersecurity community typically reacts to attacks after they occur. Being reactive is costly and can be fatal, where attacks threaten lives, important data, or mission success. Taking a proactive approach, we are: (I) identifying potential attacks before they come to fruition, and based on this identification; (II) developing preventive counter-measures. We describe a Proactive Cybersecurity System (PCS), a layered, modular service platform that applies big data collection and processing tools a wide variety of unstructured data sources to identify potential attacks and develop countermeasures. The PCS provides security analysts a holistic, proactive, and systematic approach to cybersecurity. Here we describe our research vision and progress towards that vision.

• •

Goal I: identify potential attacks before they take place and cause harm, and based on this identification Goal II: develop preventive counter-measures.

200 100 0 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

Figure 1. Use-After-Free CVEs

Keywords

2. A Proactive Cybersecurity System

Software security, architecture analysis, concept clustering

To achieve Goal I, we are building a Targeted Vulnerability Prediction (TVP) subsystem to detect, from hackers' ad hoc communities and publicly available security sources, the emerging concepts that are the early warning signs of likely vulnerability targets. Specifically, we are mining publicly available vulnerability, exploit, and attack databases like CVEs (cve.mitre.org), CVE Details (cvedetails.com), and the Open Web Application Security Project's (OWASP) Web Application Security Consortium Web Hacking Incidents Database (WHID) Project (https://www.owasp.org/index.php/OWASP_WASC_ Web_Hacking_Incidents_Database_Project) to determine prominent concepts. In addition, we are mining hacker discussion forums, blogs, and Internet Relay Chat (IRC) channels (e.g. freenode.net, AnonOps IRC, Metasploit IRC, Google Project Zero, blackhat.com, GMANE.org, seclists.org) to identify emerging concepts. These are largely unstructured sources of big data, which pose challenges for manipulation and interpretation.

1. INTRODUCTION

In May 2005, the first edition of Secure Coding in C and C++ cautioned about "referencing freed memory" [16]. In 2007, researchers from WatchFire reported a "Dangling Pointer" vulnerability in Microsoft IIS [1] and a Blackhat conference talk reported one of the first exploits of what became known as UseAfter-Free (UAF) [7]. Blogs and tutorials began to appear around 2010. Figure 1 shows the reported number of common vulnerabilities and exposures (CVEs), by year, for UAF. Clearly the offensive hacker community learned (about UAFs) and just as clearly it takes time, from the discovery of a vulnerability until it becomes a true threat to the “white hat” community. This time lag—during which hackers are gaining expertise and planning exploits—represents an opportunity for proactive counter-measures. But such counter-measures can only be applied if the potential threat is determined early enough.

We are using concept clustering on this unstructured big data to determine changes in prominent concepts, some of which will represent new forms of exploit vectors. In parallel, we are also building and evolving an ontology to track these changes at several levels of sub-categorization. Linguistic, statistical, computational, and hybrid techniques will be employed to manage the extraction and evolution of the ontology.

The number and variety of cyber-attacks is increasing, and the rate of new software vulnerabilities is rising dramatically: “The compound annual growth rate (CAGR) of detected security incidents has increased 66% year-over-year since 2009” [14]. But the software security community is typically reacting to attacks after they occur. Being reactive is costly and can be fatal, where attacks threaten lives, important data, or mission success. To address this risk, we are developing a Proactive Cybersecurity

As not all concepts are equally valuable, a key challenge of this research is to prioritize the large number of concepts identified, so that the attention of an analyst can be appropriately guided. Our focus is to prioritize vulnerabilities that will be likely targets of attacks. Ecology theory helps discover patterns resulting from hackers choosing to join certain communities, but not others [10]. Utilizing ecology analysis methods, we will be able to predict the rate of entry to the hacker community associated with each emerging concept, hence the trajectory or momentum of each concept. Using a technique of text mining—sentiment analysis—

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICSE ’16, May 14-22, 2016, Austin, TX, USA © 2016ACM.ISBN978-1-4503-4152-3/16/05...$15.00 DOI: http://dx.doi.org/10.1145/2884781.2896829

72 72

we will also be able to associate rates of entry with different sentiments, thus enhancing our understanding of concept trajectory and momentum.

of the data. We have selected public security databases and hacker communities as our main types of information source. We now describe examples of each type, show the kinds of information they provide, and the challenges associated with each.

To achieve Goal II, we are developing (a) an Architectural Vulnerability Detection (AVD) subsystem and (b) a Risk Analysis and Recommender (RAR) subsystem. AVD will add a further capability of predicting the impact of the attack vectors identified in the TVP subsystem on a system architecture. RAR will analyze the risks associated with identified vulnerabilities, estimate the costs of mitigation actions, and recommend refactoring and assurance strategies. Initial research results [15] have shown the promise of an architecture-centric approach to cybersecurity.

Both types of sources discuss vulnerabilities, PoC (Proof of Concept) exploits, and attacks. And there are time delays both between the identification of vulnerabilities and the production of PoC exploits and also between hostile attacks targeting these vulnerabilities and the corresponding PoC exploits [5][16]. There are also important differences between the two types of data sources. By collecting data from multiple sources we can assemble information about different aspects of the same exploits. Such differences can be combined for better understanding of the conditions responsible for the delays between vulnerabilities and exploits that are the bases for the PCS.

The TVP, AVD, and RAR subsystems constitute the Proactive Cybersecurity System (PCS) as shown Figure 2, a modular service platform that combines data sources, data collection and processing tools, metrics and models. The PCS provides security analysts a holistic, proactive, and systematic approach to cybersecurity. PCS is holistic because it draws on huge amounts of data across all the known vulnerability and exploit databases, and mines publicly known hacker communities. This allows us to characterize the limits of any one data source, thus avoiding bias. This data helps us to prioritize and develop counter-measures that take into consideration extrinsic and intrinsic properties regarding the rate that vulnerabilities are exploited, thus being in a position to gauge the “honeymoon effect” [5]—the time between software being released and the discovery of a vulnerability.

Another important difference is that most of the public security databases do not provide information about who contributed an entry to the database. Hacker sources do typically identify who is making a contribution. In many cases, however, the names provided are fanciful and an individual may not use the same designation in different chats, lists and database contributions. We now describe our two types of (big data) sources and the challenges for making use of each.

3.1 Hacker Communities

Hackers form communities. Some of the hackers’ blogs, software repositories, IRC channels, etc. can be found on the internet. They are learning communities and they are innovation communities, no different from entrepreneurs, venture capitalists, researchers, and even terrorist organizations. This is why they are successful and why we fear them. But, mounting a successful attack requires tremendous resources and patience. Hacker communities, as with all innovation communities, need to share information to be effective; they build on each other’s work and discourse, sometimes directly but more often indirectly [17].

To realize Goal I, we are (1) identifying data sources; (2) collecting and managing data; (3) identifying emerging concepts; (4) tracking concept evolution; and (5) prioritizing vulnerabilities. To achieve Goal II, we are (6) developing counter measures. We now detail the key challenges for these research activities.

3. Data Sources

There are many potential information sources available. We have identified two main types of data sources containing information that can help us identify emerging concepts that point to vulnerabilities that are likely to be targeted. Assessing the data sources is a critical activity as the subsequent analysis and proactive measures rely on the comprehensiveness and reliability

By analyzing the topics in hackers' discussions, we will be able to get early indication as to which vulnerabilities are likely to be the focus of upcoming attacks. Early insight can lead to early quality assurance and mitigation strategies. In this way the software

73 73

security community can be proactive in detecting and eliminating vulnerabilities, rather than simply reacting to them as they occur. For example, the Heartbleed bug was discovered simultaneously by (defensive) security researchers at Google and at Codenomicon, avoiding potentially huge losses if hackers had found this bug first (in April, 2014 more than 2/3 of the world’s web servers were vulnerable to Heartbleed).

prepared for analysis. A big data repository is thus planned for storing the raw data to allow “schema on read” for different types of analysis.

4. Identifying and Tracking Concepts

Accurately identifying emerging concepts is another major challenge of this research. To address the inherent complexity of the data collected, we are employing concept clustering, sentiment analysis and text mining techniques to identify: 1) emerging concepts against the background of longer-lasting ones; and 2) emerging hacker communities associated with the emerging concepts. Because of the huge amount of data involved, manual curation will not be possible in general, and so PCS needs to aid and guide a human analyst who will make the final interpretation and decision to develop countermeasures.

3.2 Public Security Databases

Publicly accessible databases are maintained by various organizations. MITRE’s CVE database collects vulnerability, exploit and attack information. An related project—CVE Details—identifies vulnerability and corresponding exploit types for which advanced searching can be done. This data can be tabulated showing frequencies of vulnerability or exploit types on a yearly basis. This data can also show the frequency of exploits across all types also on a yearly basis. We have already explored some of this data (for example, Figure 1), and patterns have emerged. In most cases, there are spikes in the number of recognized vulnerabilities. In some cases, the changes from one year to another can be as great as 1000 occurrences.

For 1), we will apply concept clustering, text mining, and ontological analysis to identify and track concepts. Gathering cluster analysis and text mining results for inclusion into an evolving ontology is typically done manually at the moment but we plan to automate as much of this as possible. For 2), we will: a) elaborate the evolution of hacker communities by analyzing their networks; and b) determine which emerging concepts are not only likely vulnerabilities but which are likely targets of attacks and hence worthy of a human’s attention.

Similar variations and spikes in frequency are seen in exploit data, both PoC and hostile. This information can be mined from CVEs, CVE Details, the exploit databases, and the WHID. As with vulnerabilities, the frequency of occurrence of PoC exploits and attacks changes over time. Determining the root causes for such patterns, particularly the spikes, is one of our research goals.

4.1 Concept Clustering

For concept clustering we are employing Leximancer (leximancer.com), as we have had good experience with this tool in prior research (e.g. [11]). Leximancer produces concept maps that show relationships among the most significant concepts used in a text collection. It enables rapid analysis of thousands of text entries in records like those collected in Gmane or CVE List, but also allows modulating the results through researcher intervention and interpretation. Words and phrases are clustered automatically into affinity groups, each represented by single terms called concepts, and the concepts in turn are also clustered automatically into higher-level abstractions called themes. An affinity group for a concept includes the terms whose usage in the text collection is more similar to one another than to other terms not in the group.

Some attacks come prior to PoC exploits and close in time to the discovery of a vulnerability, perhaps even before its discovery. In such cases initial attacks would not be preventable. Even if we don’t find the requisite events that occur before all attacks, the events we do find will enable us to make predictions about a spike in attacks that we can mitigate. In addition to keeping track of instances of vulnerabilities, exploits and attacks, we also have to keep track of what category in the ontology they are instances of. We may find, for example, that while instances are increasing in a high-level category, they are only increasing in certain subcategories, and not others. It is these specific increasing subcategories that provide a basis for risk determination and for mitigation strategies.

Concepts are named by the term in the affinity group that has the highest total similarity score with the other terms in the group. Also, concepts belonging to the same theme are more similar in usage than concepts not belonging to the theme. As an example, the concept map from a completed automated analysis of the entire CVE List circa August 2015 is shown in Figure 3.

There are several challenges here. Exploit databases typically have much more extensive coverage of exploits than the CVE Details website. Also, the WHID’s collection of attack instances is much smaller than the true number, since organizations are often reluctant to acknowledge they have been attacked. Because of such discrepancies among the data sources, our analyses will not treat any one source as definitive. We will instead triangulate over several data sources, and generate a confidence score for the predictions, depending on the extent to which trends discovered in multiple data sources are compatible.

3.3 Collecting and Managing Data

Collecting and managing this big, unstructured data presents significant challenges. Quantifying instances of vulnerabilities and exploits is currently done through numerous manual searches, laboriously selecting and counting entries. One of our goals is to automate this process as much as possible. We will utilize web spider technology to collect data from hacker forums. Also, we have experience in network evolution visualization and successfully developed web-scraping and crowdsourcing tools, which will be core modules for data collection and management. Large volumes and different varieties of data will have to be collected from the two main sources, ingested, stored and

Figure 3: CVE Concept Map This map displays concepts laid out in a two-dimensional space where proximity represents similarity of usage. Clusters of similar

74 74

concepts are enclosed in circles. These clusters are prominent themes. For example, DLL Hijacking belongs to a theme called “local” at the lower right of the map. Note that many of the concepts in its vicinity are part of the statements made over and over again in CVE records concerning DLL Hijacking, namely attacks using a Trojan Horse for local users to gain privileges.

without proper decision support and with limited data. These existing efforts will be significantly enhanced by the PCS with its holistic, proactive, and systematic approach to cybersecurity.

7. REFERENCES

[1] J. Afek and A. Sharabani, "Dangling Pointer: Smashing the Pointer for Fun and Profit", http://www.orkspace.net/secdocs/Conferences/BlackHat/US A/2007/Dangling%20Pointer-paper.pdf, 2007.

4.2 Tracking Concept Evolution

Tracking the evolution of concepts and sub-communities and the sentiments associated with them discovered in earlier parts of the PCS process presents an additional challenge. We iteratively perform three interrelated processes to mine concepts and track their changes. The concepts to be mined and tracked cover conditions leading to the identification of vulnerabilities and exploits (both non-hostile and hostile) along with a characterization of the vulnerabilities and exploits themselves and their classification. The characterization differentiates, and the classification relates, the individuals, groups, communities and organizations, the systems and applications, and the processes, methods and techniques involved. The processes are:

[2] L. Bass, P. Clements, and R. Kazman, Software Architecture in Practice. Addison-Wesley, 2012. [3] T. Bhat, J. Collard, E. Subrahmanian, R. Sriram, J. Elliot, U. Kattner, C. Campbell, and I. Monarch, "Generating Domain Ontologies Using Root- and Rule-Based Terms," NIST Information Technology Laboratory, 2015. [4] F. Buschmann, R. Meunier, H. Rohnert, P. Sommerlad, and M. Stal, Pattern-Oriented Software Architecture Volume 1: A System of Patterns. Wiley, 1996. [5] S. Clark, S. Frei, M. Blaze, and J. Smith, "Familiarity Breeds Contempt," in Proceedings of the Annual Computer Security Applications Conference (ACSAC), 2010.

1) mining security data sources using phrasal parsing, automated terminology construction, statistical analysis and clustering to determine the most salient concepts [3] in the corpora being analyzed and tracking their changes through time;

[6] P. Clements, R. Kazman, and M. Klein, Evaluating Software Architectures: Methods and Case Studies, Addison-Wesley, 2001.

2) mapping the relationships among these concepts and tracking their changes through time to generate a series of maps of changing networks of the most relevant and important concepts;

[7] A. Ekelhart, S. Fenz, M. Klemen, and E. Weippl, "Security Ontologies: Improving Quantitative Risk Analysis," Proceedings of the 40th Annual Hawaii International Conference on System Sciences, IEEE Press, 2007.

3) building a security ontology ([12][7])—the Emergent Vulnerabilities and Exploits Ontology (EVEO)—based on the results of 1) and 2) that will help guide the construction and tracking of emerging concepts.

[8] Q. Feng, R. Kazman, Y. Cai, R. Mo, and L. Xiao, “An Architecture-centric Approach to Security Analysis”, Proceedings of the 13th Working IEEE/IFIP Conference on Software Architecture (WICSA), April 2016.

5. Developing Countermeasures The AVD and RAR subsystems will enable proactive measures for an organization. Currently, there exist few proactive methods. It has been shown [13] that a majority of security bugs—nearly two thirds—are “foundational”: they have existed for many years in a system’s legacy code. Furthermore, it has been shown that there is a “honeymoon” period after the release of a system, before the identification of its first vulnerability [5]. Taken together, this suggests that: a) one cannot simply try to find all of the security bugs in a system, but rather must take a strategic, riskdriven approach to security assurance, and b) there is a short window of opportunity after a product has been released, before the hacker community discovers its vulnerabilities. To capitalize on this window of opportunity, we must be proactive and efficient in our assurance efforts. We are building on our existing work in identifying security bugs [8] to achieve this goal.

[9] J. Ferguson, "Understanding the Heap by Breaking It", https://www.blackhat.com/presentations/bh-usa07/Ferguson/Presentation/bh-usa-07-ferguson.pdf, 2007. [10] M. Hannan, and J. Freeman, Organizational Ecology. Harvard University Press, 1989. [11] R. Kazman, D. Goldenson, I. Monarch, W. Nichols, and G. Valetto, “Evaluating the Effects of Architectural Documentation: A Case Study of a Large Scale Open Source Project”, IEEE Transactions on Software Engineering, 2016. [12] L. Obrst, P. Chase, and R. Markeloff, "Developing an Ontology of the Cyber Security Domain," Proceedings of the STIDS: MITRE, 49-56, 2012. [13] A. Ozment and S. Schechter, "Milk or Wine: Does Software Security Improve with Age?," Proceedings of the Usenix Security Symposium, 2006.

6. Conclusions and Future Work

In this paper we have described our vision for a Proactive Cybersecurity System, its intellectual foundations, and the early work that we have taken to realizing this vision. The PCS rests on a big data infrastructure for extracting information from public data sources, transforming (cleaning) and loading the data, clustering and visualizing it, and curating it for future use.

[14] PWC, “Managing cyber risks in an interconnected world”, http://www.dol.gov/ebsa/pdf/ erisaadvisorycouncil2015security3.pdf, 2015. [15] J. Ryoo, R. Kazman, and P. Anand, "Architectural Analysis of Security Vulnerabilities", IEEE Security and Privacy, September/October 2015.

This research, if successful, will guide quality assurance and risk mitigation activities, helping the security assurance community to be proactive rather than reactive. Security assurance personnel must have been doing some of this already. A big part of their job is trying to predict the future, assess emerging risks, and take preventive actions. But they currently do this in an ad hoc fashion,

[16] R. Seacord, Secure Coding in C and C++, Addison-Wesley, 2005. [17] E. Swanson, and N. Ramiller, "The Organizing Vision in Information Systems Innovation," Organization Science (8:5), 458-47, 1997. 75 75

Predicting and Fixing Vulnerabilities before They Occur: A Big Data ...

Predicting and Fixing Vulnerabilities before They Occur: A Big Data ...

Suggest Documents

Patch management: Fixing vulnerabilities before they are ... - GFI.com

Can Organisational Failures be Prevented Before They Occur?(A ...

Big Data: Concept, Potentialities and Vulnerabilities

Big Data: Concept, Potentialities and Vulnerabilities

Forecasting Cryptocurrency Price Shifts Weeks Before They Occur ...

A Comparison of Pregnancies That Occur Before and After ... - CiteSeerX

Predicting Tourist Demand Using Big Data - Springer

Mining Big Data to Predicting Future

Predicting Crowd Behavior with Big Public Data

Predicting Where a Radiation Will Occur: Acoustic and ... - PLOS

Predicting Tourist Demand Using Big Data - Springer

Watch Earthquakes as they Occur - IRIS

Watch Earthquakes as they Occur - IRIS

Before They Are Hanged - Pyr

Unleashing Use-Before-Initialization Vulnerabilities ... - Internet Society

Prevent colds and flu before they start

Predicting Infectious Disease Using Deep Learning and Big Data - MDPI

Predicting Vulnerabilities in Computer-Supported Inferential Analysis ...

Predicting Vulnerabilities in Computer-Supported ... - CiteSeerX

Predicting Vulnerabilities of North American ... - Tufts University

Social contacts and the locations in which they occur ... - BioMedSearch

Before they were famous: music-based tourism and a ...

Predicting Bug-Fixing Time: A Replication Study ...

Fixing the leak: Unemployment incidence before and after the ... - iab